arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.05162 2026-06-04 cs.CV 版本更新

Controllable Dynamic 3D Shape Generation via 3D Trajectories and Text

基于3D轨迹和文本的可控动态3D形状生成

Jaeyeong Kim, Ines Kim, Jahyeok Koo, Seungryong Kim

发表机构 * KAIST AI Project(韩国科学技术院人工智能项目)

AI总结 提出T2Mo前馈框架,通过3D轨迹和文本条件生成可控动态3D形状,采用形状接地轨迹嵌入处理任意配置轨迹,实现空间精确跟随与全局语义一致。

Comments Project page: https://cvlab-kaist.github.io/T2Mo/

详情
AI中文摘要

我们提出T2Mo,一个前馈框架,用于基于3D轨迹和文本的可控动态3D形状生成。由于语言固有的模糊性,仅使用文本生成精确意图的运动仍然具有挑战性。为了解决这个问题,我们采用3D轨迹作为可控空间引导,指定选定点应移动的精确路径。通过结合两者,T2Mo生成的对象运动在空间上遵循给定轨迹,同时全局反映文本语义。为了鲁棒地处理任意配置的轨迹输入(从密集到稀疏且不均匀分布),我们进一步提出了一种形状接地轨迹嵌入,将输入轨迹集映射到覆盖整个对象的形状感知令牌集。我们与基于文本的基线以及级联视频基线(结合轨迹引导视频生成和视频到动态网格生成)进行了广泛比较。定量和定性评估以及用户研究表明,我们的方法生成的运动更忠实地遵循给定提示,具有更高的表现力,同时保持运动质量。

英文摘要

We introduce T2Mo, a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. Due to the inherent ambiguity of language, generating precisely intended motions using text alone remains challenging. To address this, we adopt 3D trajectories as controllable spatial guidance, specifying the exact paths along which selected points should move. By combining both, T2Mo generates object motions that spatially adhere to the given trajectories while globally reflecting the text semantics. To robustly handle trajectory inputs with arbitrary configurations, ranging from dense to sparse and unevenly distributed, we further propose a shape-grounded trajectory embedding that maps an input trajectory set into a shape-aware token set covering the entire object. We conduct extensive comparisons against text-based baselines and cascaded video-based baselines that combine trajectory-guided video generation with video-to-dynamic mesh generation. Quantitative and qualitative evaluations, along with user studies, demonstrate that our approach produces motions that more faithfully follow the given prompts with higher expressiveness while preserving motion quality.

2606.05149 2026-06-04 cs.CV cs.LG eess.IV 版本更新

An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers

基于视觉Transformer的开源两阶段细粒度车辆分类流水线

Gandhimathi Padmanaban, Fred Feng

发表机构 * Department of Electrical and Computer Engineering, University of California, Los Angeles, CA, USA(1 电气工程与计算机科学系,美国加州大学洛杉矶分校)

AI总结 提出一个结合RT-DETR检测器和微调ViT-Base/16的两阶段流水线,用于六类车身分类,并引入置信度弃权机制,在分布内和分布外数据集上分别达到0.94和0.89的准确率。

Comments 24 pages, 10 figures, venue TBD

详情
AI中文摘要

车辆车身类型是超车碰撞中骑行者伤害严重程度的重要决定因素,然而,在公开文献中,尚不存在从自然道路视频中将车辆分类为与伤害风险相关类别的自动化工具。标准目标检测基准仅提供粗粒度车辆标签(轿车、卡车、公交车、摩托车),而现有的细粒度识别系统在受控图像上训练,且缺乏跨记录站点的部署鲁棒性评估。本文提出一个开源的两阶段计算机视觉流水线,结合预训练的RT-DETR检测器进行粗粒度车辆定位,以及微调的视觉Transformer(ViT-Base/16)进行六类车身分类:乘用车、SUV、皮卡、小型货车、大型货车和商用卡车。当softmax输出低于0.60时,基于置信度的弃权机制保留第二阶段预测,产生未知标签而非静默误分类。在来自密歇根州安阿伯市自行车道走廊的3,805个标注超车事件(分布内)上评估,该流水线达到0.94的准确率,每类F1分数从0.91(小型货车)到0.97(SUV)。在来自开放骑行数据集的311个事件(分布外)上独立评估,无需重新训练,准确率为0.89。四个代表性类别中的三个在域偏移下保持F1不低于0.90。观察到的最大退化出现在小型货车(F1=0.72),原因是弃权率从2.4%上升到25.0%,而非主动误分类,这与传播真实模型不确定性的机制一致。完整的流水线,包括推理脚本、训练代码、评估工具和模型权重,作为开源软件发布,以支持跨路边视频档案和骑行安全研究的可重复性和复用。

英文摘要

Vehicle body type is a significant determinant of cyclist injury severity in overtaking crashes, yet automated tools for classifying vehicles into injury-risk-relevant categories from naturalistic roadway video do not exist in the open literature. Standard object detection benchmarks provide only coarse vehicle labels (car, truck, bus, motorcycle), while existing fine-grained recognition systems are trained on controlled imagery and lack evaluation for deployment robustness across recording sites. This paper presents an open-source two-stage computer vision pipeline combining a pre-trained RT-DETR detector for coarse vehicle localization with a fine-tuned Vision Transformer (ViT-Base/16) for six-category body-type classification: passenger car, SUV, pickup truck, minivan, large van, and commercial truck. A confidence-based abstention mechanism withholds Stage 2 predictions when softmax output falls below 0.60, producing unknown labels rather than silent misclassifications. Evaluated on 3,805 annotated overtaking events from a bicycle-lane corridor in Ann Arbor, Michigan (in-distribution), the pipeline achieved 0.94 accuracy with per-class F1 scores from 0.91 (minivan) to 0.97 (SUV). On an independent out-of-distribution evaluation of 311 events from an open cycling dataset without retraining, accuracy was 0.89. Three of four well-represented categories maintained F1 at or above 0.90 under domain shift. The largest degradation was observed for minivan (F1 = 0.72), driven by abstention rate rising from 2.4% to 25.0% rather than active misclassification, consistent with the mechanism propagating genuine model uncertainty. The full pipeline, including inference scripts, training code, evaluation utilities, and model weights, is released as open-source software to support reproducibility and reuse across roadside video archives and cycling safety research.

2606.05142 2026-06-04 cs.CV cs.AI 版本更新

GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

GeM-NR:面向非刚性场景变化的几何感知多视角编辑

Josef Bengtson, Yaroslava Lochman, Fredrik Kahl

发表机构 * Chalmers University of Technology(查尔姆斯理工大学)

AI总结 提出GeM-NR,一种无需训练的快速灵活方法,通过深度图对齐、视角投影和条件细化实现多视角一致的通用非刚性图像编辑,支持几何和外观的显著变化。

Comments Project page: https://gem-nr.github.io/

详情
AI中文摘要

近年来,基于生成模型的多视角图像编辑的发展使我们离通用3D内容生成和定制更近一步。现有大多数工作通过利用未编辑场景的几何结构,专注于刚性或仅外观的编辑。这自然将这些方法限制在保留底层场景结构的编辑上。其他方法则针对特定图像编辑任务(如物体移除和添加)进行训练。尽管取得了进展,但通用的非刚性编辑(即大幅改变场景几何的编辑)对现有方法仍然具有挑战性。我们提出GeM-NR,一种快速灵活且无需训练的方法,用于通用的多视角一致图像编辑,包括大幅改变场景几何和外观的编辑。给定一个使用选定骨干编辑器(如FLUX、Qwen、BrushNet)编辑的锚点图像和一个未编辑的查询图像,GeM-NR以与锚点编辑一致的方式编辑查询图像。该方法包含多个阶段:(i) 深度图估计,我们提出一种策略以最大化编辑和未编辑场景的3D点云之间的对齐;(ii) 投影到查询视角;(iii) 基于未编辑查询的条件细化所得图像。基于条件化的公式从两个视角很好地扩展到物体的多个视角。我们展示了该方法处理几何和外观显著变化的编辑的能力,这是现有方法难以做到的。我们进行了广泛评估,表明我们的方法在各种编辑任务中提高了一致性,包括生成编辑场景的3D表示。定量和定性结果均表明,我们的方法在编辑质量以及多视角几何和光度一致性方面达到了最先进的性能。

英文摘要

Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these methods to edits that preserve the underlying scene structure. Other approaches are trained for specific image editing tasks, such as object removal and addition. Despite this progress, general nonrigid edits, i.e., edits that substantially change the scene geometry, remain challenging for existing methods. We propose GeM-NR, a fast and flexible training-free approach for general multi-view consistent image editing, including edits that drastically change the geometry and appearance of the scene. Given an anchor image edited with a chosen backbone editor (such as FLUX, Qwen, BrushNet) and a query unedited image, GeM-NR edits the query image consistently with the anchor edit. The method incorporates multiple stages: (i) depth map estimation, where we propose a strategy to maximize the alignment between the 3D point clouds of the edited and unedited scenes, (ii) projection onto a query viewpoint, and (iii) refinement of the obtained image conditioned on the unedited query. The conditioning-based formulation scales well from two to many views of an object. We demonstrate the ability of our method to handle edits with significant changes in geometry and appearance, something that existing methods struggle with. We perform an extensive evaluation showing that our method improves consistency for a wide variety of edit tasks, including generating 3D representations of the edited scene. Both quantitative and qualitative results indicate the state-of-the-art performance of our method in terms of edit quality as well as geometric and photometric consistency across multiple views.

2606.05124 2026-06-04 cs.GR cs.CV cs.LG 版本更新

Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting

几何高斯:在高斯泼溅中解耦外观与几何

Hongyu Zhou, Zorah Lähner

发表机构 * University of Bonn(波恩大学) Lamarr Institut(拉马尔研究所)

AI总结 针对3D高斯泼溅在几何表示与外观渲染间的冲突,提出通过为每个溅射添加几何不透明度参数并配合透明度优化流程,实现几何与外观的解耦,提升复杂场景(尤其是透明物体)的渲染与几何性能。

详情
AI中文摘要

在3D高斯泼溅(3DGS)成功用于新视角合成后,许多工作探索了如何将其用于几何表面表示。然而,直接从3DGS中提取准确的几何信息仍然具有挑战性,且往往会降低外观渲染质量。在这项工作中,我们通过使用完整的地面真值纹理和几何信息进行训练,证明了默认形式的3DGS本质上不适合同时表示纹理和几何。我们还提出了一种简单的解决方案,即为每个溅射应用一个额外的几何不透明度参数,并配合可选的透明度策划优化流程。我们的实验,无论是使用地面真值还是视觉基础模型的几何输入,都表明这一改变在多种数据集上提高了渲染和几何性能,尤其是对于包含透明物体的复杂场景,我们的方法带来了显著提升。

英文摘要

After the success of 3D Gaussian Splatting (3DGS) for novel view synthesis, many works have explored how to also use it for geometric surface representation. However, extracting accurate geometric information directly from 3DGS remains challenging and can often reduce the appearance rendering quality. In this work, we show that 3DGS in its default form is inheritedly unsuited to represent texture and geometry at the same time, by training with complete ground-truth texture and geometry information. We also propose a simple solution by applying a single additional geometry opacity parameter to each splat, together with an optional transparency-curated optimization pipeline. Our experiments, both with ground-truth and vision foundation model geometric input, show that this change leads to improved rendering and geometry performance on a wide variety of dataset, and especially complex scenes with transparent objects benefit significantly from our method.

2606.05115 2026-06-04 cs.CV cs.AI cs.CL 版本更新

Continual Visual and Verbal Learning Through a Child's Egocentric Input

通过儿童自我中心输入进行持续的视觉与语言学习

Xiaoyang Jiang, Yanlai Yang, Kenneth A. Norman, Brenden Lake, Mengye Ren

发表机构 * Agentic Learning AI Lab, New York University(代理学习人工智能实验室,纽约大学) Department of Psychology, Princeton University(心理学系,普林斯顿大学)

AI总结 提出BabyCL持续多模态学习框架,在单一时间顺序处理SAYCam数据集,通过流式视觉表示学习和图像-文本对比目标,在SAYCam Labeled-S 4AFC基准上优于流式学习基线,缩小了与离线训练上限的差距。

Comments 15 pages, 4 figures

详情
AI中文摘要

儿童从连续的、时间结构化的自我中心经验流中学习单词的含义。最近的研究表明,神经网络也可以从儿童的自我中心视频记录中学习单词-指代物映射,但它们会循环处理打乱的数据数百个周期,这与儿童实际接触环境的方式形成对比。我们引入了BabyCL,一个持续多模态学习框架,它以单一时间顺序处理SAYCam数据集,结合了流式视觉表示学习和图像-文本对比目标。BabyCL将流的多阶段时间分割与双回放缓冲区相结合,该缓冲区独立管理视觉和多模态历史,并在共享骨干网络上联合训练三个对比损失。在匹配的优化预算下,BabyCL在SAYCam Labeled-S 4AFC基准上优于流式学习基线,显著缩小了与离线训练上限的差距。消融实验表明,这些增益对在线时间分割窗口的长度和回放缓冲区的驱逐规则具有鲁棒性。总之,这些结果表明,在更接近儿童实际体验的训练条件下,有意义的单词-指代物映射可以出现。

英文摘要

Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.

2606.05107 2026-06-04 cs.CV cs.AI 版本更新

Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have

谁需要标签?利用已有的元数据适应视觉基础模型

Elouan Gardès, Seung Eun Yi, Kartik Ahuja, Théo Moutakanni, Huy V. Vo, Piotr Bojanowski, Wolfgang M. Pernice, Loïc Landrieu, Camille Couprie

发表机构 * Meta FAIR, Paris(Meta FAIR,巴黎) LIGM, CNRS, Gustave Eiffel, ENPC, IP Paris(LIGM,CNRS,居斯塔夫·艾菲尔,ENPC,IP巴黎) Columbia University, New York(哥伦比亚大学,纽约)

AI总结 提出一种无标签方法FINO,利用元数据通过自监督学习将通用视觉基础模型适应到专业科学领域,无需任务标签且仅用轻量探针进行监督,在多个领域超越标准无监督和全监督适应方法。

详情
AI中文摘要

我们提出一种无标签方法,将强大但通用的视觉基础模型适应到专业科学领域。标准的监督微调通常不适合这些场景:标签稀缺,且任务特定训练可能破坏模型的通用性和鲁棒性。我们转而利用元数据以自监督方式将表示适应到新领域。我们的方法FINO结合了标准的自监督目标与灵活的元数据指导,能够处理高度细粒度的离散元数据和连续元数据。它鼓励表示保留信息因子,同时抑制虚假因子。在亚细胞荧光显微镜、地球观测、野生动物监测和医学成像中,FINO始终优于标准的无监督域适应和全监督适应。它甚至超过了高度专业化的领域特定最先进方法,同时在骨干网络适应中不使用任何任务标签,仅使用轻量探针进行监督。

英文摘要

We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific training can collapse the model's generality and hurt robustness. We instead leverage metadata to adapt representations to new domains in a self-supervised manner. Our method, FINO, combines a standard self-supervised objective with flexible metadata guidance that handles both highly granular discrete metadata and continuous metadata. It encourages the representation to preserve informative factors while suppressing spurious ones. Across subcellular fluorescence microscopy, Earth observation, wildlife monitoring, and medical imaging, FINO consistently outperforms standard unsupervised domain adaptation and fully supervised adaptation. It also exceeds highly-specialized domain-specific state of the art, while using no task labels for backbone adaptation and only lightweight probes for supervision.

2606.05103 2026-06-04 cs.LG astro-ph.IM cs.CV stat.ML 版本更新

Identifying Gems from Roman RAPIDly

从Roman RAPIDly中识别宝石

Karan Gandhi, Ashish A. Mahabal, Jacob E. Jencson, Russ R. Laher, Ben Rusholme, Lin Yan, Ryan M. Lau, Schuyler D. Van Dyk, Mansi M. Kasliwal

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology, Gandhinagar, India(印度理工学院计算机科学与工程系) Division of Physics, Mathematics, and Astronomy, California Institute of Technology, Pasadena, CA 91125, USA(加州理工学院物理、数学与天文学系) Center for Data Driven Discovery, California Institute of Technology, Pasadena, CA 91125, USA(数据驱动发现中心) IPAC, California Institute of Technology, 1200 E. California Blvd, Pasadena, CA 91125, USA(IPAC, 加州理工学院) Caltech Optical Observatories, California Institute of Technology, Pasadena, CA 91125, USA(加州理工学院光学观测站)

AI总结 针对Roman太空望远镜无真实数据的问题,提出机器学习模型RuBR和通用方法,用于在RAPID流水线中区分真实瞬变/变源与虚假检测,实验表明该方法在Roman时代具有鲁棒性。

Comments 15 pages, 10 figures, Submitted to the Publications of the Astronomical Society of the Pacific

详情
AI中文摘要

南希·格雷斯·罗马太空望远镜(Roman)计划最早于2026年9月发射,将以前所未有的空间分辨率和节奏进行宽场红外成像巡天,从而发现数百万天文瞬变源。因此,有必要建立自动化的警报流水线,以便望远镜在发射后不久就能开始发现可靠的瞬变源和变源。然而,目前不存在真实的Roman数据,这使得开发此类流水线变得困难。在这项工作中,我们提出了一个机器学习模型$RuBR$和一种通用方法,用于在RAPID流水线中区分真实的瞬变和变源检测与虚假检测。具体而言,我们使用该方法提出了三个模型:$RuBR_{comb}$在本地注入和OpenUniverse2024瞬变源的组合数据上训练和测试,$RuBR_{loc}$在本地注入瞬变源上训练并在OpenUniverse2024瞬变源上测试,以及$RuBR_{DA}$将本地注入瞬变源与部分OpenUniverse2024瞬变源以域适应模式结合进行训练。这为在Roman任务早期阶段缺乏真实标签的情况下,将$RuBR_{comb}$模型适应真实观测的策略铺平了道路。尽管图像差分流水线仍在改进中,但我们的实验结果证明了所提出方法的有效性及其在Roman时代进行稳健真实-虚假分类的前景。

英文摘要

The Nancy Grace Roman Space Telescope (Roman), set for launch as early as September 2026, will conduct wide-field infrared imaging surveys with unprecedented spatial resolution and cadence, enabling the discovery of millions of astronomical transients. Hence, it is necessary to have automated pipelines for generating alerts in place so that the telescope can begin discovering reliable transients and variable objects soon after it is launched. However, no real Roman data currently exist, making the development of such pipelines difficult. In this work, we present a machine learning model $RuBR$ and a general methodology for distinguishing genuine transient and variable detections from spurious (bogus) detections within the RAPID pipeline. In particular, we present three models using this methodology: $RuBR_{comb}$ trained and tested on combined locally injected and OpenUniverse2024 transients, $RuBR_{loc}$ trained on locally injected transients and tested on OpenUniverse2024 transients, and $RuBR_{DA}$ that combines locally injected transients with a fraction of OpenUniverse2024 transients in domain-adaptation mode for training. This paves the way for strategies to adapt the $RuBR_{comb}$ model to real observations in the absence of any ground-truth labels during the early phases of the Roman mission. While the image differencing pipeline continues to be improved, our experimental results demonstrate the effectiveness of the proposed approach and its promise for robust real-bogus classification in the Roman era.

2606.05071 2026-06-04 cs.CV 版本更新

InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space

InstantRetouch:基于双边空间的高效高保真指令引导图像润色

Jiarui Wu, Yujin Wang, Ruikang Li, Fan Zhang, Mingde Yao, Tianfan Xue

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) CUHK MMLab(香港中文大学多模态实验室) CPII under InnoHK(创新香港下的CPII)

AI总结 提出一种基于双边空间操作的图像润色方法,通过预测低分辨率双边网格并利用学习引导图切片,结合扩散模型蒸馏和提示对齐损失,实现高效、高保真且遵循指令的图像润色。

Comments Computer Vision and Pattern Recognition (CVPR), 2026

详情
AI中文摘要

语言引导的照片润色旨在调整颜色和色调,同时保留几何和纹理。最近,基于扩散的润色显示出优越的视觉质量,但由于其生成性质,常常面临保真度问题,并且由于其迭代采样过程,效率低下。在这项工作中,我们提出了一种高效且保真的润色方法,使用双边空间操作,该方法既紧凑又内容解耦。具体来说,我们的模型不是直接编辑像素或图像潜在表示,而是预测一个低分辨率的仿射变换双边网格,该网格通过学习的引导图进行切片,然后应用于全分辨率图像。这种方法实现了高保真度和更高的效率。为了保留预训练生成模型的强先验,我们使用变分分数蒸馏将多步扩散模型蒸馏到我们的双边网格框架中,并辅以提示对齐损失来指导指令跟随行为。此外,我们引入了一个新的基准,并在多个维度上评估我们的方法:保真度、指令遵循和效率。与最新的润色方法(如Gemini-2.5-Flash(Nano-Banana))相比,我们的方法可以避免内容漂移,显著改善延迟,并生成视觉上令人愉悦的编辑,同时保持高水平的保真度。项目页面:https://openimaginglab.github.io/InstantRetouch/。

英文摘要

Language-guided photo retouching aims to adjust color and tone while preserving geometry and texture. Recently, diffusion-based retouching shows a superior visual quality, but often struggles with both fidelity issues due to its generative nature and efficiency because of its iterative sampling process. In this work, we propose an efficient and fidelity-preserving retouching method using bilateral space manipulation, which is both compact and content-decoupled. Specifically, instead of directly editing pixels or image latents, our model predicts a low-resolution bilateral grid of affine transforms, which are sliced using a learned guidance map and then applied to the full-resolution image. This approach yields both high fidelity and improved efficiency. To retain strong priors of a pretrained generative model, we distill a multi-step diffusion model into our bilateral grid framework using Variational Score Distillation, complemented by a prompt alignment loss to guide instruction-following behavior. Additionally, we introduce a new benchmark and evaluate our method across multiple dimensions: fidelity, instruction following, and efficiency. Compared to the latest retouch methods, like Gemini-2.5-Flash (Nano-Banana), our method can avoid content drift, significantly improve latency, and generate visually pleasing edits, while maintaining a high level of fidelity. Project page: https://openimaginglab.github.io/InstantRetouch/.

2606.05068 2026-06-04 cs.CV 版本更新

MaCo-GAN: Manifold-Contrastive Adversarial Learning for Single Image Super-Resolution

MaCo-GAN: 用于单图像超分辨率的流形对比对抗学习

Daeyoung Han, Seongmin Hwang, Moongu Jeon

发表机构 * Department of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Republic of Korea(电气工程与计算机科学系,全州科学技术院,全州,韩国) Department of AI Convergence, Gwangju Institute of Science and Technology, Gwangju, Republic of Korea(人工智能融合系,全州科学技术院,全州,韩国)

AI总结 提出MaCo-GAN,通过流形对比对抗学习替代传统对抗损失,利用动态假样本合成器生成保持低分辨率对应的假图像,实现感知-失真权衡的持续改进。

详情
AI中文摘要

传统的用于单图像超分辨率(SISR)的生成对抗网络(GAN)常常出现幻觉伪影,这主要是因为标准判别器评估整体图像自然度而非严格的条件真实性。为了解决这个问题,我们提出了MaCo-GAN,一种新颖的流形对比GAN框架,用监督对比目标替代了传统的对抗损失。我们方法的核心是一个动态假样本合成器,它将真实数据(GT)转换为一系列具有挑战性、感知上合理且严格保持低分辨率(LR)对应的假图像。利用这些合成样本,我们建立了一个鲁棒的对比极小极大博弈:生成器被训练为将其预测吸引到流形上的假图像(低失真)并远离流形外的假图像(高失真),而判别器则优化完全相反的目标。通过简单地将基线SR模型的对抗损失替换为我们提出的目标,我们在各种基准测试中展示了感知-失真权衡的持续改进。广泛的消融研究验证了我们框架的有效性,并深入洞察了这种条件对比博弈的动态。

英文摘要

Conventional Generative Adversarial Networks (GANs) for Single Image Super-Resolution (SISR) often struggle with hallucinated artifacts, largely because standard discriminators evaluate overall image naturalness rather than strict conditional realism. To address this, we propose MaCo-GAN, a novel manifold-contrastive GAN framework that replaces the conventional adversarial loss with a supervised contrastive objective. A core component of our method is a dynamic fake sample synthesizer that transforms ground truth (GT) data into a spectrum of challenging, perceptually plausible fake images that strictly maintain low-resolution (LR) correspondence. Utilizing these synthesized samples, we establish a robust contrastive minimax game: the generator is trained to attract its predictions toward on-manifold fakes (low distortion) and repel them from off-manifold fakes (high distortion), while the discriminator optimizes the exact opposite. By simply replacing the adversarial loss of a baseline SR model with our proposed objective, we demonstrate consistent improvements in the perception-distortion trade-off across various benchmarks. Extensive ablation studies validate the effectiveness of our framework and provide deep insights into the dynamics of this conditional contrastive game.

2606.05058 2026-06-04 cs.CV cs.AI 版本更新

UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD

UniCAD:面向多模态多任务CAD的统一基准与通用模型

Jingyuan Chen, Sheng Jin, Haopeng Sun, Wentao Liu, Chen Qian

发表机构 * SenseTime Research and Tetras.AI(秒速科技研究院和Tetras.AI)

AI总结 针对CAD领域缺乏统一多模态基准的问题,提出UniCAD基准和UniCAD-MLLM通用多模态大语言模型,在点云到CAD重建、文本/图像到CAD生成和CAD问答等任务上实现端到端统一处理,并在多个基准上取得最优性能。

详情
AI中文摘要

计算机辅助设计(CAD)通过创建精确、可编辑的3D模型,支撑着现代工程和制造。然而,CAD研究通常孤立地研究各项任务,而多模态、多任务学习因缺乏统一基准而受阻。为解决这一问题,我们引入了UniCAD,一个全面的多模态CAD学习基准,涵盖点云到CAD重建、文本/图像到CAD生成以及CAD问答等多种输入模态。伴随该基准,我们提出了UniCAD-MLLM,一个通用的多模态大语言模型,能够接收文本、图像、草图和点云,并在单一框架内以端到端方式执行这些异构任务。在UniCAD和Fusion360基准上的大量实验表明,UniCAD-MLLM在所有任务上均达到最先进性能,优于现有的任务特定和多任务基线。我们将发布数据集、代码和预训练模型,以加速未来研究。

英文摘要

Computer-Aided Design (CAD) underpins modern engineering and manufacturing by enabling the creation of precise, editable 3D models. However, CAD research typically studies tasks in isolation, and multi-modal, multi-task learning for CAD is hindered by the absence of a unified benchmark. To address this gap, we introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning that covers point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering across diverse input modalities. Alongside the benchmark, we present UniCAD-MLLM, a universal multi-modal large language model that ingests text, images, sketches, and point clouds and performs these heterogeneous tasks in an end-to-end fashion within a single framework. Extensive experiments on the UniCAD and Fusion360 benchmarks demonstrate that UniCAD-MLLM achieves state-of-the-art performance across all tasks, outperforming existing task-specific and multi-task baselines. We will release the dataset, code, and pretrained models to accelerate future research.

2606.05035 2026-06-04 cs.CV 版本更新

Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping

Anchor3R: 基于瞬态锚点的流式3D重建用于长时程视觉映射

Peilin Tao, Chong Cheng, Yuansen Du, Caiwei Song, Zhengqing Chen, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, Hainan Cui, Shuhan Shen

发表机构 * CASIA(中国科学院自动化研究所) UCAS(中国科学院自动化研究所) Horizon Robotics(Horizon机器人技术有限公司) HKUST(GZ)(香港科技大学(广州))

AI总结 提出Anchor3R框架,通过将前馈重建视为当前帧坐标系下的局部测量预测而非全局回归,结合窗口相对位姿预测、闭环插入和运动平均,实现长序列上的在线3D重建与位姿估计。

详情
AI中文摘要

长时程在线视觉映射是机器人感知的核心能力,需要在有限内存和计算下从视觉流中持续估计相机运动和场景几何。最近的前馈3D重建模型提供了强大的几何先验,但其流式变体通常在与第一帧或持久场景记忆绑定的固定坐标系中预测位姿。这种固定基准设计会导致训练-测试不匹配、对早期锚点的注意力偏差以及在远长于训练序列的序列上累积漂移。我们提出Anchor3R,一种流式3D重建框架,将前馈重建视为以当前为中心的局部测量预测,而非持久的全局基准回归。在每个时间步,Anchor3R预测窗口相对位姿和当前帧坐标系下的局部点图,将流式重建转化为相对位姿测量生成。这些测量支持在线位姿更新,而闭环插入和运动平均对齐轨迹并将局部点图转换为一致的全局重建。在室内、室外、驾驶和RGB-D基准上的实验表明,Anchor3R在长时程位姿精度和密集重建质量上优于现有流式基线,同时支持有限内存的在线推理。

英文摘要

Long-horizon online visual mapping is a core capability for robot perception, requiring continuous camera-motion and scene-geometry estimation from visual streams under bounded memory and computation. Recent feed-forward 3D reconstruction models provide strong geometric priors, but their streaming variants often predict poses in a fixed coordinate system tied to the first frame or a persistent scene memory. This fixed-gauge design leads to train--test mismatch, attention bias toward early anchors, and accumulated drift on sequences much longer than those seen during training. We propose \emph{Anchor3R}, a streaming 3D reconstruction framework that treats feed-forward reconstruction as current-centric local measurement prediction rather than persistent global-gauge regression. At each time step, Anchor3R predicts window-relative poses and a local pointmap in the current-frame coordinate system, turning streaming reconstruction into relative-pose measurement generation. These measurements support online pose updates, while loop-closure reinsertion and motion averaging align the trajectory and transform local pointmaps into a coherent global reconstruction. Experiments on indoor, outdoor, driving, and RGB-D benchmarks show that Anchor3R improves long-horizon pose accuracy and dense reconstruction quality over existing streaming baselines, while supporting bounded-memory online inference.

2606.05031 2026-06-04 cs.CV 版本更新

MetaPoint: Unlocking Precise Spatial Control in Agentic Visual Generation

MetaPoint:在智能体视觉生成中实现精确空间控制

Dewei Zhou, Xinyu Huang, Xun Wang, Ji Xie, Yabo Zhang, Liang Li, Kunchang Li, Zongxin Yang, Yi Yang

发表机构 * Zhejiang University(浙江大学) ByteDance Seed(字节跳动种子) Harvard University(哈佛大学)

AI总结 提出MetaPoint方法,通过将连续2D坐标表示为单个特殊token,利用模型固有的位置编码实现像素级空间控制,无需修改架构。

详情
AI中文摘要

生成式视觉模型从根本上难以实现精确的空间控制。这源于一个核心脱节:模型可以处理空间的文本描述,但无法直接将数值坐标映射到2D图像画布上。我们引入了MetaPoint,一种通过将连续2D坐标表示为单个特殊token来弥合这一差距的方法。关键在于,MetaPoint不需要新的架构组件;它直接利用模型固有的位置编码方案来解释这些坐标,将我们的token视为画布上的一个虚拟点。这种轻量级方法能够用一个token实现对象位置的像素级控制,或用两个token实现边界框控制,而无需架构更改或定制注意力掩码。MetaPoint token被设计为可组合的,作为空间基元。这使得规划智能体能够将高级用户请求分解为结构化的基元序列,供生成器使用。通过提供一种简单、精确且可扩展的空间控制构建块,MetaPoint解锁了更强大的组合式生成智能体,并支持直观的交互式编辑系统。

英文摘要

Generative visual models fundamentally struggle with precise spatial control. This arises from a core disconnect: models can process textual descriptions of space but cannot directly map numerical coordinates onto the 2D image canvas. We introduce MetaPoint, a method that bridges this gap by representing a continuous 2D coordinate as a single, special token. Crucially, MetaPoint requires no new architectural components; it directly leverages the model's inherent positional encoding schemes to interpret these coordinates, treating our token as a virtual point on the canvas. This lightweight approach enables pixel-level control of an object's position with one token or its bounding box with two, all without requiring architectural changes or bespoke attention masking. The MetaPoint tokens are designed to be compositional, serving as spatial primitives. This allows a planner agent to decompose a high-level user request into a structured sequence of primitives for the generator. By providing a simple, precise, and scalable building block for spatial control, MetaPoint unlocks more powerful compositional generative agents and enables intuitive, interactive editing systems.

2606.05018 2026-06-04 cs.CV 版本更新

Handwriting Extraction and Analysis of Signature Lists in Swiss Popular Initiatives

瑞士民众倡议中签名列表的手写提取与分析

Marco Peer, Thomas Gorges, Mathias Seuret, Vincent Christlein, Andreas Fischer

发表机构 * AIBEX Group, University of Fribourg(AIBEX集团,弗里堡大学) Pattern Recognition Lab, FAU Erlangen-Nürnberg(模式识别实验室,埃尔兰根-纽伦堡大学)

AI总结 针对瑞士民众倡议中签名列表验证的繁重人工流程,提出结合模板行分割、OCR和基于AI的手写分析(特别是作者检索)的自动化管道,实验表明OCR对短文本识别率低(CER 29.6%),而作者检索mAP达50.6%,可有效支持重复提交检测。

Comments Accepted for presentation at ICCST 2026

详情
AI中文摘要

民众倡议和公投是瑞士民主的核心,然而手写签名列表的验证仍然是一个劳动密集型的手工过程。本文研究了自动化文档分析方法的潜力,包括OCR和基于AI的手写分析,以支持这一任务。我们提出了一种结合基于模板的行分割与文本识别和作者检索技术的流水线,并在包含418位作者的443条手写条目的数据集上进行了评估。结果表明,OCR在处理词汇表外的手写文本时表现不佳,名字的词错误率(CER)为29.6%。相比之下,作者检索表现更为稳健,平均精度(mAP)达到50.6%。此外,我们的实验表明,现成的OCR系统对于手写签名数据的转录不够可靠,尤其是对于姓名或地址等短且词汇表外的条目。然而,作者检索方法可以有效地识别签名列表中视觉相似的条目,使其成为基于手写相似性支持检测潜在重复提交的合适工具。

英文摘要

Popular initiatives and referendums are central to Swiss democracy, yet the validation of handwritten signature lists remains a labor-intensive manual process. This paper investigates the potential of automated document analysis methods, including OCR and AI-based handwriting analysis, to support this task. We propose a pipeline combining template-based line segmentation with text recognition and writer retrieval techniques, evaluated on a dataset of 443 handwritten entries from 418 writers. Results show that OCR struggles with out-of-vocabulary handwriting, with a CER of 29.6% for first names. In contrast, writer retrieval performs more robustly, reaching an mAP of 50.6%. Furthermore, our experiments indicate that off-the-shelf OCR systems are not sufficiently reliable for transcription of handwritten signature data, particularly for short, out-of-vocabulary entries such as names or addresses. However, writer retrieval methods can effectively identify visually similar entries across signature lists, making them a suitable tool for supporting the detection of potential duplicate submissions based on handwriting similarity.

2606.05011 2026-06-04 cs.CV cs.RO 版本更新

CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation

CIPER: 跨视图图像检索与姿态估计的统一框架

Yurim Jeon, Dongseong Seo, Seung-Woo Seo

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出CIPER框架,通过共享Transformer编码器和任务特定令牌联合进行城市级跨视图检索与精确3自由度姿态估计,实现互惠特征学习。

Comments 16 pages, 5 figures

详情
AI中文摘要

跨视图地理定位通过将地面图像与航拍图像数据库匹配来估计其地理位置。现有方法要么通过大规模检索,要么通过精确姿态估计来处理,但无法兼顾:基于检索的方法能够进行广域搜索,但牺牲了定位精度;而姿态估计方法仅在狭窄的搜索空间内实现高精度。简单级联这些流程会导致误差传播和特征表示不一致。我们将跨视图地理定位形式化为一个统一问题,要求同时进行城市级检索和精确的3自由度姿态估计。我们提出CIPER(跨视图图像检索与姿态估计变换器),这是一种单一架构,通过互惠特征学习联合执行两项任务。CIPER使用共享的Transformer编码器和任务特定令牌,将全局检索特征与空间定位线索分离。为了弥合地面和航拍视图之间的大领域差距,我们引入了一个双向Transformer姿态解码器,该解码器使用地面特征作为空间查询进行双向交叉注意力。一种集合预测策略进一步在统一的多任务目标下实现稳定的3自由度回归。在VIGOR、KITTI和Ford Multi-AV上的实验表明,特别是在有限的视野和任意方向条件下,性能具有竞争力。代码可在https://github.com/yurimjeon1892/CIPER获取。

英文摘要

Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.

2606.05008 2026-06-04 cs.CV cs.AI cs.CL 版本更新

M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

M$^3$Eval: 通过认知基础视频任务的多模态记忆评估

Jie Huang, Ruixun Liu, Sirui Sun, Xinyi Yang, Yin Li, Yixin Zhu, Yiwu Zhong

发表机构 * School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院) State Key Laboratory of General Artificial Intelligence, Peking University(北京大学通用人工智能国家重点实验室) Yuanpei College, Peking University(北京大学元培学院) Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) School of Psychological and Cognitive Sciences, Peking University(北京大学心理学与认知科学学院) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出首个多模态模型记忆评估框架M$^3$Eval,通过认知心理学设计的视频任务系统评估模型在记忆保持、忠实性和鲁棒性上的表现,发现模型在并行视频流处理、干扰模式、时空记忆和符号记忆方面的显著缺陷。

Comments We present an evaluation designed for multi-modal memory in multi-modal models

详情
AI中文摘要

随着多模态模型向长视频理解发展,记忆成为关键能力。尽管在视频数据集和基准测试方面做出了大量努力,现有工作主要关注感知和推理,而没有系统评估记忆:模型保留了什么、信息如何忠实保存、以及记忆在干扰下的鲁棒性。为填补这一空白,我们引入了M$^3$Eval,这是第一个用于探测多模态模型中不同记忆维度的综合评估框架和基准。基于认知心理学,我们的设计通过精心构建的任务来隔离记忆的关键方面。利用M$^3$Eval,我们在代表性多模态模型上进行了大量实验,揭示了一致的弱点和独特行为。我们发现,模型在处理并行视频流时难以保持解耦表示,表现出与人类记忆显著不同的干扰模式,在空间域比时间域更可靠地定位记忆源,并且符号记忆有限。总的来说,我们的基准为未来研究提供了宝贵资源,而我们的发现强调了记忆作为基本但未充分探索的能力,并为设计更有效的多模态模型记忆机制提供了见解。我们的代码和数据集可在https://pku-value-lab.github.io/m3eval-homepage获取。

英文摘要

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

2606.04992 2026-06-04 cs.CV cs.HC 版本更新

Multi-Camera AR Guidance System for Surgical Instrument Handling and Assembly: Investigating Workload and Efficiency

用于手术器械操作与组装的 multi-camera AR 引导系统:工作量和效率研究

Shiyu Li, Julian Kreimeier, Hannah Schieber, Dirk Müller, Bernhard Kainz, Rüdiger von Eisenhart-Rothe, Daniel Roth

发表机构 * NVIDIA Deep Learning Data Synthesizer(NVIDIA深度学习数据合成器) NVIDIA Scene Imaging Interface(NVIDIA场景成像接口)

AI总结 提出一种无标记的多摄像头增强现实引导系统,结合6D位姿估计和头戴显示,显著降低手术器械操作的工作量并提高效率。

Comments 11 pages

详情
AI中文摘要

手术中器械的操作和组装对洗手护士提出了很高的认知要求,尤其是在器械不熟悉的情况下。我们提出了一种支持性的手术器械引导系统,该系统结合了多摄像头6D位姿估计和头戴显示器上的增强现实原位可视化,无需额外标记。位姿估计和连续的相机校准通过已知物体实现。6D位姿估计网络仅使用合成数据进行训练,旨在获得更好的泛化能力和实际应用性。AR引导显示工具提示定位线索和逐步组装动画。通过基于注视的选择和脚踏板,用户可以在术中操作中切换组装步骤。在技术评估中,我们的方法优于最先进的6D位姿估计。在膝关节置换术的手术模拟中,对29名洗手护士进行了用户研究,将系统与纸质手册进行了比较。AR引导显著降低了感知工作量。客观上,AR引导将任务完成时间减少了21.3%(4.76分钟)。特别是,对器械组不太熟悉的洗手护士在使用该系统时受益。两种条件下的错误频率相当。定性反馈强调了过程清晰度提高、信息过载减少和感知独立性。总之,我们的无标记多摄像头AR引导方法可以在主观和客观上改善术中器械操作性能,特别是对于未经培训的洗手护士。

英文摘要

The handling and assembly of instruments during surgery imposes high cognitive demands on scrub nurses, particularly when instruments are unfamiliar. We present a supporting guidance system for surgical instrumentation that combines multi-camera 6D pose estimation with augmented reality in-situ visualization on a head-mounted display without the requirement for additional markers. Pose estimation and consecutive camera calibration are achieved through known objects. The 6D pose estimation network is trained purely on synthetic data, aiming for better generalizability and real-world applicability. The AR guidance displays tooltip localization cues and step-wise assembly animations. Via gaze-based selection and a foot pedal, users can switch between assembly steps in intraoperative use. In a technical evaluation, our approach outperforms state-of-art 6D pose estimation. A user study with 29 scrub nurses was conducted in a surgical simulation of knee arthroplasty, comparing the system against a paper manual. AR guidance significantly reduced the perceived workload compared. Objectively, AR guidance reduced task completion time by 21.3\% (4.76 minutes). Specifically, scrub nurses less experienced with the instrument set benefited when using the system. Error frequencies were comparable between conditions. Qualitative feedback highlighted improved process clarity, reduced information overload, and perceived independence. To summarize, our marker-free multi-camera AR guidance approach for surgical instruments can, subjectively and objectively, improve intraoperative instrumentation performance, particularly for untrained scrub nurses.

2606.04986 2026-06-04 cs.CV 版本更新

Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning

Food-R1: 一种基于强化学习的统一多任务食品视觉语言模型

Yu Zhu, Yongkang Li, Wenjie Zhu, Haoyi Jiang, Wenyu Liu, Wei Yang, Bin Li, Xinggang Wang

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 针对现有食品视觉语言模型依赖监督微调导致推理和泛化能力受限以及营养标注稀缺的问题,提出包含链式思维标注的大规模基准CalorieBench-80K和基于强化微调(GRPO)的统一多任务食品视觉语言模型Food-R1,在食品相关任务上持续超越强基线。

详情
AI中文摘要

最近的研究探索了将视觉语言模型(VLM)用于食品分析。然而,现有方法主要依赖于监督微调(SFT),这通常限制了推理和泛化能力。此外,高质量的大规模营养标注仍然稀缺。为了解决这些问题,我们引入了CalorieBench-80K,一个包含精心整理的卡路里标签和饮食建议注释的大规模基准。据我们所知,这是第一个包含链式思维(CoT)注释用于卡路里推理的食品图像基准。我们还提出了Food-R1,一个在多任务学习范式中训练的统一食品VLM,以赋予模型广泛的能力。Food-R1经过基于CoT的冷启动指令微调,然后使用组相对策略优化(GRPO)进行强化微调(RFT),以提高推理和性能。在CalorieBench-80K和代表性基准上的实验表明,Food-R1在食品相关任务上持续优于强基线。代码、模型权重和基准注释可在项目仓库中获得。

英文摘要

Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations remain scarce. To address these issues, we introduce CalorieBench-80K, a large-scale benchmark with curated calorie labels and dietary advice annotations. To the best of our knowledge, it is the first food image benchmark to incorporate Chain-of-Thought (CoT) annotations for calorie reasoning. We also propose Food-R1, a unified food VLM trained in a multi-task learning paradigm to equip the model with broad capabilities. Food-R1 undergoes CoT-based cold-start instruction tuning, followed by reinforcement fine-tuning (RFT) using Group Relative Policy Optimization (GRPO) to improve reasoning and performance. Experiments on CalorieBench-80K and representative benchmarks show that Food-R1 consistently outperforms strong baselines across food-related tasks. The code, model weights, and benchmark annotations are available at the project repository.

2606.04970 2026-06-04 cs.CV cs.AI 版本更新

Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

计划、观察、恢复:主动式程序辅助的基准与架构

Kaustav Kundu, Ritvik Shrivastava, Maxim Arap, Nanshu Wang, Xianhui Zhu, Quintin Fettes, Gautam Tiwari, Parth Suresh, Théo Moutakanni, Alejandro Castillejo Munoz, Allen Bolourchi, Pascale Fung, Pinar Donmez, Babak Damavandi, Anuj Kumar, Seungwhan Moon

发表机构 * Meta Reality Labs(Meta现实实验室) Meta Superintelligence Labs(Meta超智能实验室)

AI总结 提出EgoProactive数据集和Pro²Bench基准,并设计解耦规划器-交互架构,用于主动式程序辅助中的实时引导和异常恢复。

Comments 53 pages, 14 figures

详情
AI中文摘要

我们设想一个主动的多模态辅助系统,该系统在程序性任务中为用户提供实时的逐步指导,自主决定何时中断以及如何指导。然而,由于缺乏反映现实条件的大规模跨领域基准,特别是用户偏离预期步骤序列的常见情况,进展受到限制。我们通过四项贡献来解决这一差距: extbf{(1)}~我们发布了 extbf{EgoProactive},一个大规模的可穿戴自我中心数据集,用于主动程序辅助,带有明确的计划外(OOP)标注和恢复步骤; extbf{(2)}~我们将五个已建立的基准(Ego4D、EPIC-KITCHENS、EgoExo4D、HoloAssist、HowTo100M)扩充为统一的主动指导模式下的 extbf{Pro extsuperscript{2}Bench}; extbf{(3)}~我们提出了一种专门针对程序状态、视觉线索和恢复注入的 extbf{解耦规划器-交互架构}; extbf{(4)}~我们引入了一种跨模型家族迁移的训练后方案,通过在Llama~4和Qwen-3.6-VL上的跨骨干复制进行验证。在大量实验中,我们训练的Llama-4系统在所有六个数据集上,相对于强大的专有基线(Claude Opus~4.6、Gemini~3.1~Pro、GPT~5.2)和开放权重基线(Qwen3~VL~235B),显著提高了客观干预质量。Oracle计划实验进一步表明,当计划质量得到控制时,训练的双工模型产生高质量的指导,并在计划外恢复方面取得巨大收益。

英文摘要

We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.

2606.04925 2026-06-04 cs.CV 版本更新

Scene-Centric Unsupervised Video Panoptic Segmentation

以场景为中心的无监督视频全景分割

Christoph Reich, Oliver Hahn, Nikita Araslanov, Laura Leal-Taixé, Christian Rupprecht, Daniel Cremers, Stefan Roth

发表机构 * TU Munich(慕尼黑技术大学) TU Darmstadt(达姆施塔特技术大学) NVIDIA(英伟达) University of Oxford(牛津大学) MCML ELIZA

AI总结 提出无监督视频全景分割任务及首个方法VideoCUPS,利用深度、运动和视觉线索生成伪标签,并通过Video DropLoss训练,在无监督条件下实现准确分割。

Comments CVPR 2026. Oliver Hahn and Christoph Reich - both authors contributed equally. Code: https://github.com/visinf/cups/tree/main/videocups Project page: https://visinf.github.io/videocups/

详情
AI中文摘要

视频全景分割(VPS)旨在联合检测、分割和跟踪所有对象,同时将视频划分为语义一致的区域。我们引入了无监督VPS的任务设置,省略任何人工监督。现有的无监督场景理解工作主要关注图像分割任务;视频领域仍未充分探索。我们提出了VideoCUPS,这是第一个无监督VPS方法。VideoCUPS通过利用无监督的深度、运动和视觉线索,从以场景为中心的视频中生成时间一致的全景视频伪标签。使用新颖的Video DropLoss在这些伪标签上训练,可以得到一个准确的无监督VPS模型。为了对进展进行基准测试,我们引入了一个全面的评估协议和四个竞争基线,将最先进的无监督全景图像和实例视频分割模型扩展到VPS。VideoCUPS优于所有基线,并展示了强大的标签高效学习能力。通过VideoCUPS、我们的评估协议和基线,我们为未来无监督VPS的研究提供了坚实的基础。

英文摘要

Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose VideoCUPS, the first unsupervised VPS approach. VideoCUPS generates temporally consistent panoptic video pseudo-labels from scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate, unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. VideoCUPS outperforms all baselines and demonstrates strong label-efficient learning. With VideoCUPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.

2606.04922 2026-06-04 cs.CV cs.AI cs.LG 版本更新

Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

几何感知蒸馏用于提示调优生物医学视觉-语言模型

Tran Dinh Tien, Zhiqiang Shen

发表机构 * Department of Machine Learning(机器学习系) Mohamed bin Zayed University of Artificial Intelligence(Mohamed bin Zayed人工智能大学)

AI总结 提出Omni-Geometry知识蒸馏(OGKD)框架,通过注入类别关系结构到教师模型,生成保留真实标签同时尊重类间几何的方向性目标,并设计全局几何感知蒸馏(GAD)和标签引导几何蒸馏(LGD)损失,在11个医学数据集上平均提升准确率1.7%-2.8%。

Comments Preprint. Code is available at https://github.com/tientrandinh/OGKD

详情
AI中文摘要

当前基于提示和适配器的视觉-语言模型(VLM)调优方法在医学影像中具有吸引力,因为临床数据敏感性倾向于冻结骨干网络且标注有限。然而,这些方法通常仅优化真实类别,将所有其他类别视为同等错误,忽略了临床上有意义的类别关系,并在有限监督设置下产生不稳定的决策边界。我们提出了Omni-Geometry知识蒸馏(OGKD),一种新框架,将类别关系结构注入教师模型,以生成保留真实标签同时尊重类间几何的方向性目标。利用这些目标,我们开发了两种蒸馏损失:全局几何感知蒸馏(GAD)作用于全局图像标记,标签引导几何蒸馏(LGD)将相同的几何应用于注意力补丁标记以改善细粒度对齐。在11个广泛使用的医学数据集上进行的基础到新类和少样本评估的综合实验和分析中,我们的OGKD实现了显著更好的性能,在所有先前最先进的VLM适应方法上平均绝对增益为1.7%-2.8%。它还能稳健地泛化到未见类别,并产生比其他方法更可靠的预测。我们的代码可在https://github.com/tientrandinh/OGKD获取。

英文摘要

Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.

2606.04911 2026-06-04 cs.CV cs.CL 版本更新

BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine

BreastGPT: 面向乳腺癌临床全流程的多模态大语言模型

Yang Liu, Jiajin Zhang, Danyang Tu, Yaojun Hu, Jiao Qu, Jiuyu Zhang, Yu Shi, Wei Fang, Shi Gu, Ling Zhang, Yingda Xia

发表机构 * DAMO Academy, Alibaba Group(阿里巴巴集团 DAMO 院) Zhejiang University(浙江大学) Hupan Lab(华潘实验室) West China Hospital(西京医院) China Medical University(中国医科大学)

AI总结 提出BreastGPT多模态大语言模型,通过构建工作流对齐的指令语料库BreastStage和双分支视觉编码器,实现乳腺癌筛查、诊断和治疗规划全流程的多模态推理,在BreastStage-Bench上取得75.66%封闭式准确率和89.92%开放式得分。

详情
AI中文摘要

乳腺癌仍然是女性癌症相关死亡的主要原因。其临床管理需要跨临床工作流(包括筛查、诊断和治疗规划)的多模态推理,其中每个阶段涉及不同的成像模态、任务目标和推理模式。然而,受限于数据稀缺和模型通用性,现有的医学多模态大语言模型通常仅在孤立的模态或狭窄的任务族上进行评估,限制了它们支持工作流级临床推理的能力。在这项工作中,我们首先引入了BreastStage,一个工作流对齐的乳腺影像指令语料库,包含来自5种成像模态的17个子数据集和136个任务模板的186万条指令遵循对。其保留子集BreastStage-Bench为评估乳腺癌护理连续体中的多模态推理提供了全面的基准。基于该语料库,我们提出了BreastGPT,一个统一的多模态大语言模型,配备双分支视觉编码器和概念保持的令牌压缩,以弥合标准放射学与千兆像素病理学之间的尺度差距。在BreastStage-Bench上,BreastGPT实现了75.66%的封闭式准确率和89.92%的开放式得分,在临床阶段和任务格式上均优于通用和医学专用多模态大语言模型。这些结果表明,工作流对齐的数据和跨尺度视觉建模对于临床基础的医学多模态大语言模型至关重要。所有数据、代码和模型检查点已在https://yangyy-liu.github.io/BreastGPT.io发布。

英文摘要

Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textit{screening}, \textit{diagnosis} and \textit{treatment planning}, where each stage involves distinct imaging modalities, task objectives, and reasoning patterns. However, constrained by data scarcity and model versatility, existing medical MLLMs are typically evaluated on isolated modalities or narrow task families, limiting their ability to support workflow-level clinical reasoning. In this work, we first introduce \textbf{BreastStage}, a workflow-aligned breast imaging instruction corpus comprising 1.86M instruction-following pairs curated from 17 sub-datasets across 5 imaging modalities and 136 task templates. Its held-out split, \textbf{BreastStage-Bench}, provides a comprehensive benchmark for evaluating multimodal reasoning across the breast cancer care continuum. Building on this corpus, we propose \textbf{BreastGPT}, a unified MLLM equipped with a dual-branch visual encoder and concept-preserving token compression to bridge the scale gap between standard radiology and gigapixel pathology. On BreastStage-Bench, BreastGPT achieves 75.66\% closed-ended accuracy and 89.92\% open-ended score, outperforming both general-purpose and medical-specific MLLMs across clinical stages and task formats. These results suggest that workflow-aligned data and cross-scale visual modeling are critical for clinically grounded medical MLLMs. All data, code, and model checkpoints are released at https://yangyy-liu.github.io/BreastGPT.io.

2606.04898 2026-06-04 cs.CV 版本更新

CDPM-Align: Multi-Scale Guidance-Aligned Diffusion Pretraining for Robust Few-Shot Anatomical Landmark Detection

CDPM-Align:用于鲁棒少样本解剖标志检测的多尺度引导对齐扩散预训练

Roberto Di Via, Irina Voiculescu, Francesca Odone, Vito Paolo Pastore

发表机构 * MaLGa DIBRIS, University of Genoa(DIBRIS,热那亚大学) University of Genoa(热那亚大学) Department of Computer Science, University of Oxford(奥大利大学计算机科学系)

AI总结 提出多尺度引导对齐的条件扩散预训练方法CDPM-Align,通过生成式预训练学习鲁棒表示,在少样本和低标注场景下提升解剖标志检测的准确性和不确定性。

Comments Accepted MICCAI 2026

详情
AI中文摘要

解剖标志检测是医学图像分析中的一项基础任务,支持广泛的诊断和介入工作流程。尽管最近的方法已经实现了亚毫米级的定位,但仅凭准确性不足以用于临床部署,还需要预测的可靠性和鲁棒性。尽管具有临床相关性,但表示学习在此背景下的影响仍未得到充分探索。在这项工作中,我们引入了CDPM-align,一种用于解剖标志检测的多尺度引导对齐条件扩散预训练方法。我们的实验设置侧重于少量图像和少量标注场景。具体来说,我们采用三个流行的异构小规模基准数据集,通过条件生成预训练进行表示学习。此外,我们考虑了标志检测下游任务的低标注场景,分别使用10张和25张标注图像,反映了临床工作与标注资源约束之间的现实权衡。我们的结果证实,生成式预训练使模型能够学习鲁棒的表示。这提高了下游任务的准确性和不确定性,朝着安全高效的临床部署迈进。

英文摘要

Anatomical landmark detection is a fundamental task in medical image analysis supporting a wide range of diagnostic and interventional workflows. Although recent methods have achieved sub-millimetric localisation, accuracy alone is not sufficient for clinical deployment, requiring reliability and robustness in prediction. Despite its clinical relevance, the impact of representation learning in this context is still underexplored. In this work, we introduce CDPM-align, a multi-scale guidance-aligned conditional diffusion pre-training for anatomical landmark detection. Our experimental setup focuses on a few images and a few annotation regimes. Specifically, we employ three popular heterogeneous small-scale benchmark datasets for representation learning via conditional generative pre-training. Furthermore, we consider low-annotation scenarios for the downstream task of landmark detection, with 10 and 25 annotated images, reflecting realistic trade-offs between clinical effort and resource constraints for annotations. Our results confirm that generative pre-training enables the model to learn a robust representation. This improves both accuracy and uncertainty on the downstream tasks, advancing towards safe and efficient clinical deployment.

2606.04891 2026-06-04 cs.CV cs.CG 版本更新

Hierarchical Space Partition for Surface Reconstruction

表面重建的层次空间划分

Minjie Tang, Xiangfei Li

发表机构 * Independent Researcher(独立研究员) Huazhong University of Science and Technology(华中科技大学)

AI总结 针对点云重建中因LiDAR扫描局限导致细节缺失的问题,提出基于平面分类与优先级生长的层次空间划分方法,并通过最小割优化生成水密多边形网格。

Comments Published in 2026 International Conference on 3D Vision (3DV)

详情
Journal ref
in 2026 International Conference on 3D Vision (3DV), Vancouver, BC, Canada, 2026, pp. 207-216
AI中文摘要

从点云生成紧凑的多边形模型是3D视觉和计算机图形学中的一个关键问题。然而,由于LiDAR扫描的固有限制(例如距离约束和遮挡),关键场景信息常常缺失,导致重建精度下降。为了解决这个问题,我们提出了一种平面组装策略,该策略在保持模型紧凑性的同时有效恢复缺失的细节。我们将从场景中提取的所有平面分为三类:高可见、几乎不可见和不可见。通过场景结构分析恢复的不可见平面指示了缺失的细节。这三种类型的平面对应于三种生长优先级。每个平面根据优先级水平生长,空间被逐步划分,即层次划分。随后,我们通过基于最小割的优化从划分中生成水密多边形网格。最后,在公共数据集上的比较显示了我们的方法相对于主流方法的有效性和优越性。项目页面可在https://hsr-3dv.github.io/获取。

英文摘要

Generating compact polygonal models from point clouds is a key problem in 3D vision and computer graphics. However, due to inherent limitations of LiDAR scanning (e.g. range constraints and occlusions), critical scene information is often missing, leading to degraded reconstruction accuracy. To address this, we propose a plane assembling strategy that effectively recovers missing details while maintaining model compactness. We classify all the planes extracted from the scene into three categories: highly visible, barely visible, and invisible. The invisible planes, which are recovered by scene structure analysis, indicate the missing details. The three types of planes correspond to the three growth priorities. Each plane grows according to the priority level, and the space is partitioned progressively, namely, the hierarchical partition. Subsequently, we generate a watertight polygonal mesh from the partition via a min-cut-based optimization. Finally, comparisons on public datasets show the effectiveness and superiority of our method against mainstream approaches. The project page is available at https://hsr-3dv.github.io/.

2606.04888 2026-06-04 cs.CV 版本更新

HD-DinoMoE: A Class-Aware Hierarchical Dual Mixture-of-Experts Network for Scleral Anomaly Segmentation in Complex Acquisition Scenarios

HD-DinoMoE: 一种用于复杂采集场景下巩膜异常分割的类别感知层次化双混合专家网络

Yinxiang Yu, Maoxiang Chu, Qi Niu, Guanghu Liu, Wei Xu, Haotian Wang, Zhi Chen, Yutian Zhu, Yuelong Fan, Guanghao Liao

发表机构 * School of Electronic and Information Engineering, University of Science and Technology Liaoning(辽宁科技大学电子与信息工程学院)

AI总结 针对多源分布差异、异常形态多样和巩膜镜面反射问题,提出类别感知层次化双混合专家网络HD-DinoMoE,结合双流DINOv3特征融合与类别特定多专家解码,实现血管、黄斑和黑斑、血斑的像素级分割,在ML-SASD数据集上达到72.11%的平均Dice和58.44%的平均IoU。

Comments Submitted to Medical Image Analysis; 47 pages, 31 figures, 14 tables

详情
AI中文摘要

中医目诊通过观察巩膜表面异常提供经验性线索,但其临床应用仍具有主观性且难以量化。为支持智能化和可量化的目诊,本研究提出了中医启发的人工智能眼部辅助诊断系统(TAO),并聚焦于像素级巩膜表面异常分割。针对受多源分布差异、异常形态多样和巩膜镜面反射(SSR)影响的临床和用户采集图像,我们提出了HD-DinoMoE,一种类别感知层次化双混合专家网络。HD-DinoMoE结合类别感知双流DINOv3特征融合与类别特定多专家解码,以分割血管、黄斑和黑斑以及血斑。一种三阶段骨干冻结路由策略稳定了双骨干适应;渐进置信惩罚(PCP)损失减少了SSR区域的高置信度假阳性和分割泄漏;类别感知自适应样本加权(CA-ASW)平衡了样本和类别级别的训练贡献。我们进一步构建了多标签巩膜异常分割数据集(ML-SASD),这是一个包含临床、野生和混合设置以及三种异常类别像素级标注的新基准。在ML-SASD-Mix上,HD-DinoMoE实现了72.11%的平均Dice和58.44%的平均交并比,同时保持了良好的边界定位和镜面区域假阳性控制。它在公共SBVPI数据集的血管子集上也显示出有竞争力的泛化能力。这些结果表明,HD-DinoMoE为复杂采集场景下的TAO提供了一种可行的分割解决方案。代码和数据访问信息可在https://github.com/FX-CMX/HD-DinoMoE获取。

英文摘要

Traditional Chinese Medicine (TCM) ocular inspection provides empirical cues for assessing scleral surface anomalies, but its clinical use remains subjective and difficult to quantify. To support intelligent and quantifiable ocular inspection, this study presents the TCM-inspired Artificial Intelligence Ocular Auxiliary Diagnosis System (TAO) and focuses on pixel-level scleral surface anomaly segmentation. For clinical and user-acquired images affected by multi-source distributional discrepancies, diverse anomaly morphologies, and scleral specular reflection (SSR), we propose HD-DinoMoE, a class-aware hierarchical dual mixture-of-experts network. HD-DinoMoE combines class-aware dual-stream DINOv3 feature fusion with class-specific multi-expert decoding to segment Vessels, Yellow and Black Spots, and Blood Spots. A three-stage backbone-frozen routing strategy stabilizes dual-backbone adaptation; Progressive Confidence Penalty (PCP) Loss reduces high-confidence false positives and segmentation leakage in SSR regions; and Class-Aware Adaptive Sample Weighting (CA-ASW) balances sample- and class-level training contributions. We further construct the Multi-label Scleral Anomaly Segmentation Dataset (ML-SASD), a new benchmark with Clinical, Wild, and Mix settings and pixel-wise annotations for three anomaly categories. On ML-SASD-Mix, HD-DinoMoE achieves a mean Dice of 72.11% and a mean Intersection-over-Union of 58.44%, while maintaining favorable boundary localization and specular-region false-positive control. It also shows competitive generalization on the Vessels subset of the public SBVPI dataset. These results indicate that HD-DinoMoE provides a feasible segmentation solution for TAO under complex acquisition scenarios. The code and data access information are available at https://github.com/FX-CMX/HD-DinoMoE.

2606.04881 2026-06-04 cs.CV cs.AI 版本更新

DiverAge: Reliable Pluralistic Face Aging with Cross-Age Identity Relation Guidance

DiverAge: 基于跨年龄身份关系引导的可靠多元人脸老化

Yueying Zou, Peipei Li, Qianrui Teng, Dianyan Xu, Zekun Li

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(人工智能学院,北京邮电大学) School of Computer Science, University of California, Santa Barbara(计算机科学学院,加州大学圣芭芭拉分校)

AI总结 提出基于扩散自编码的分层多元人脸老化框架DiverAge,通过随机扩散解码和年龄条件语义调制保持外观多样性,并引入跨年龄身份关系调节器(CARR)在推理时引导去噪,以提升序列级有序可靠性。

Comments 11 pages,10 figures, 5 tables

详情
AI中文摘要

人脸老化在长期生物特征分析、跨年龄身份验证和法医身份分析中扮演重要角色。由于同一主体因遗传、环境和生活方式等因素在目标年龄可能呈现多种合理外观,人脸老化本质上是一个一对多的生成问题。然而,仅有多元性不足以实现可靠的人脸老化:模型应在每个年龄组内提供外观级别的候选多样性,同时跨有序年龄组保持序列级别的有序可靠性。现有的确定性老化方法可以合成视觉上合理的年龄增长人脸,但通常缺乏随机多样性。相比之下,多元老化方法引入局部外观变化,但往往未能明确调控完整老化序列的身份演化。本文提出基于扩散自编码的分层多元人脸老化框架DiverAge。DiverAge通过随机扩散解码和年龄条件语义调制保持外观级多样性。为提升序列级可靠性,我们引入跨年龄身份关系调节器(CARR),一种推理时引导策略,联合去噪多个目标年龄组。CARR由从真实同身份跨年龄对估计的跨年龄身份相似性(CIS)先验引导,通过单边采样时引导抑制过度的跨年龄身份漂移,无需修改训练目标或引入额外可训练参数。实验表明,DiverAge在保持身份保留、年龄准确性、图像质量和外观级多样性的同时,提升了序列级有序可靠性。

英文摘要

Face aging plays an important role in long-term biometric analysis, cross-age identity verification, and forensic identity analysis. Since the same subject may exhibit multiple plausible appearances at a target age due to genetic, environmental, and lifestyle factors, face aging is inherently a one-to-many generation problem. However, pluralism alone is insufficient for reliable face aging: a model should provide appearance-level candidate diversity within each age group while maintaining sequence-level ordinal reliability across ordered age groups. Existing deterministic aging methods can synthesize visually plausible age-progressed faces, but usually lack stochastic diversity. In contrast, pluralistic aging methods introduce local appearance variations, but often fail to explicitly regulate the identity evolution of the full aging sequence. In this paper, we propose \textbf{DiverAge}, a hierarchical pluralistic face aging framework based on diffusion autoencoding. DiverAge preserves appearance-level diversity through stochastic diffusion decoding and age-conditioned semantic modulation. To improve sequence-level reliability, we introduce a Cross-age Identity Relation Regulator (CARR), an inference-time guidance strategy that jointly denoises multiple target age groups. CARR is guided by a Cross-age Identity Similarity (CIS) prior estimated from real same-identity cross-age pairs, and suppresses excessive cross-age identity drift through one-sided sampling-time guidance without modifying the training objective or introducing extra trainable parameters. Experiments demonstrate that DiverAge improves sequence-level ordinal reliability while maintaining identity preservation, age accuracy, image quality, and appearance-level diversity.

2606.04880 2026-06-04 cs.CV 版本更新

MAOAM: Unified Object and Material Selection with Vision-Language Models

MAOAM: 基于视觉语言模型的统一对象与材质选择

Jaden Park, Valentin Deschaintre, Jason Kuen, Kangning Liu, Iliyan Georgiev, Krishna Kumar Singh, Yong Jae Lee, Michael Fischer

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Adobe Research(Adobe研究)

AI总结 提出MAOAM框架,利用视觉语言模型和分割头,通过文本或点击交互实现对象和材质的精确选择,并设计数据生成流水线解决材质选择数据缺乏问题。

Comments Accepted to SIGGRAPH 2026 Conference. Project page: \href{https://jadenpark0.github.io/project_pages/maoam/}{here}

详情
AI中文摘要

选择是交互式图像编辑中的核心操作。为了实用,用户应能通过文本或点击交互来指定和区分所需的选择区域,系统应支持不仅选择对象,还包括其他标准,如材质。基于材质的选择对于重新纹理化表面或编辑特定材质的实例等任务很有价值。然而,现有的基于视觉语言模型(VLM)的选择方法以对象为中心,通常支持单一交互模态,限制了其适用性。因此,在这项工作中,我们提出了Mask Any Object And Material(MAOAM),一个统一的选择框架,能够在文本和点击交互中实现精确的对象和材质级选择。MAOAM利用带有分割头的VLM从用户提示中生成像素级掩码:VLM解释用户的选择意图(对象或材质级)并编码视觉实体、属性和空间关系,而分割头将输出标记解码为掩码。一个关键挑战是缺乏带有文本标注的材质选择数据集。我们提出了一种可扩展的数据生成流水线:收集带有材质掩码的真实和合成图像,并利用VLM生成具有丰富视觉语义的材质描述。我们通过多任务目标训练MAOAM,涵盖点击和文本选择,以及从材质描述派生的辅助VQA任务,以促进更深入的材质理解。尽管使用单模态提示训练,我们的模型在推理时结合文本和点击时表现出选择能力的涌现提升,实现了灵活的图像编辑工作流程。实验表明,在多样化的对象、材质和交互场景中,选择准确且连贯,突显了实际鲁棒性。

英文摘要

Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.

2606.04871 2026-06-04 cs.CV 版本更新

Recent Advances and Trends in Learning-based 3D Representations

基于学习的3D表示的最新进展与趋势

Adrien Schockaert, Hamid Laga, Hazem Wannous, Vincent Magnier, Guillaume Dufaye, Jean-françois Witz

发表机构 * CERI SN, IMT Nord Europe(CERI SN,IMT Nord Europe) Univ. Lille, CNRS, Centrale Lille, UMR 9013 - LaMcube - Laboratoire de Mécanique, Multiphysique, Multiéchelle(里尔大学,CNRS,Centrale Lille,UMR 9013 - LaMcube - 机械、多物理场、多尺度实验室) Downs, 59670 Sainte-Marie-Cappel(Downs,59670 Sainte-Marie-Cappel) School of Information Technology, Murdoch University(墨尔本大学信息科技学院)

AI总结 本文综述了从离散显式格式到连续隐式场(基于神经渲染或基元溅射)的3D表示家族,分析了其优缺点及关键应用,并强调了向隐式表示的范式转变。

详情
AI中文摘要

选择合适的3D表示是一个基本的设计决策,它决定了现代计算机视觉和图形管线在3D重建、新视角合成与渲染、形状与运动分析、识别和生成等任务中的效率、质量和能力。虽然传统表示(如网格、点云和体素网格)仍然是3D传感器(如LiDAR和3D扫描仪)的标准输出,并广泛应用于下游应用(如编辑和仿真),但最近的神经和基元表示(如3D高斯溅射)提供了紧凑且可微的替代方案,在游戏、AR/VR、自动驾驶、机器人导航和医学成像等应用中开辟了广泛的机会。本文的目标是综述主要的3D表示家族,从离散显式格式到基于神经渲染或基元溅射的连续隐式场。对于每种表示类型,我们介绍其一般公式和变体,讨论其优点和局限性,并突出关键应用。最后,我们概述了开放挑战和未来研究的潜在方向。与近期广泛涵盖3D物体和场景重建的综述不同,本文专注于分析3D表示本身的演变。我们特别强调了向隐式表示的范式转变,提供了关于这些新兴格式如何从根本上改变3D/4D工作流程的新视角。

英文摘要

The selection of an appropriate 3D representation is a fundamental design decision that dictates the efficiency, quality, and capabilities of modern computer vision and graphics pipelines for tasks such as 3D reconstruction, novel-view synthesis and rendering, shape and motion analysis, recognition, and generation. While traditional representations (\eg meshes, point clouds, and volumetric grids) remain standard outputs of 3D sensors (\eg LiDAR and 3D scanners) and are widely used in downstream applications (\eg editing and simulation), recent neural and primitive-based representations (\eg 3D Gaussian Splatting) offer compact and differentiable alternatives opening a wide range of opportunities in applications such as games, AR/VR, autonomous driving, robot navigation, and medical imaging, to name a few. The goal of this paper is to survey the main families of 3D representations from discrete explicit formats to continuous implicit fields based either on neural rendering or primitive splatting. For each type of representation, we present the general formulation and its variants, discuss its benefits and limitations, and highlight key applications. We conclude the paper by outlining the open challenges and potential directions for future research. Distinct from recent surveys that broadly cover 3D object and scene reconstruction, this paper provides a focused analysis on the evolution of 3D representations themselves. We specifically emphasize the paradigm shift toward implicit representations, offering a novel perspective on how these emerging formats fundamentally alter 3D/4D workflows.

2606.04863 2026-06-04 cs.CV 版本更新

IRIS-GAN: Staged Specialist Detection of Deepfake Faces

IRIS-GAN: 深度伪造人脸的分阶段专家检测

Jaume M. Trenchs, Veronica Sanz

发表机构 * Departamento de Física Teórica, Universitat de València, Burjassot, Spain(瓦伦西亚大学理论物理系,瓦伦西亚大学,西班牙Burjassot) Instituto de Física Corpuscular (IFIC), CSIC–Universitat de València, Valencia, Spain(物理微观粒子研究所(IFIC),西班牙-瓦伦西亚大学,瓦伦西亚,西班牙)

AI总结 提出IRIS-GAN,一种通过分阶段暴露于不同GAN族来训练的专业伪造人脸检测器,在跨生成器迁移下实现高检测率,并通过Grad-CAM分析揭示生成器依赖的空间响应模式。

Comments 20 pages, 10 figures

详情
AI中文摘要

我们引入IRIS-GAN,一种针对跨生成器迁移下合成人脸图像的专业取证检测器。我们并非解决通用合成图像检测问题,而是专注于由生成对抗网络(GAN)生成的人脸,这些网络在深度伪造内容中处于领先地位,并通过分阶段暴露于日益苛刻的GAN族同时保留早期生成器来训练检测器。最终模型在考虑的GAN族中实现了超过99%的伪造检测率,并以98.9%的准确率分类了一个外部真实人脸数据集。Grad-CAM分析进一步揭示了可测量的生成器依赖的空间响应模式,这些模式对于仅使用热图的二级分类器仍然具有信息量。对扩散生成人脸的族外测试证实了IRIS-GAN是一个专家检测器,具有一定能力检测非GAN深度伪造。这些结果确立了分阶段训练作为鲁棒GAN人脸取证的有效策略。

英文摘要

We introduce IRIS-GAN, a specialist forensic detector for synthetic face images under cross-generator shift. Rather than addressing universal synthetic-image detection, we focus on faces generated by generative adversarial networks (GANs), which are state-of-the-art in deepfake content, and train the detector through staged exposure to increasingly demanding GAN families while retaining earlier generators. The final model reaches fake-detection rates above 99% across the GAN families considered and classifies an external real-face dataset with 98.9% accuracy. Grad-CAM analysis further reveals measurable generator-dependent spatial response patterns, which remain informative for a secondary heatmap-only classifier. Out-of-family tests on diffusion-generated faces confirm that IRIS-GAN is a specialist detector, with some capability to reach non-GAN deepfakes. These results establish staged training as an effective strategy for robust GAN-face forensics.

2606.04847 2026-06-04 cs.CV cs.CL cs.LG 版本更新

MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

MusaCoder: 在摩尔线程GPU上通过全栈训练实现原生GPU内核生成

Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang

发表机构 * Moore Threads AI

AI总结 提出MusaCoder全栈训练框架,结合渐进式数据合成、多样性保持拒绝微调和基于执行反馈的强化学习,在CUDA和MUSA后端上生成高效原生GPU内核,9B模型匹配前沿闭源模型,27B模型达到新最优。

详情
AI中文摘要

原生GPU内核生成将高级张量程序转换为可执行、高效的低级代码。现有大型语言模型(LLMs)在此任务上表现不佳,而基于执行的强化学习面临稀疏奖励、奖励黑客和训练不稳定性问题。我们提出MusaCoder,一个用于在CUDA和MUSA后端上生成原生GPU内核的全栈训练框架。MusaCoder结合了渐进式内核导向数据合成、保持多样性的拒绝微调以及通过MooreEval(一个分布式验证器和奖励环境)进行的执行反馈强化学习(RL)。为了稳定RL,MusaCoder引入了PrimeEcho用于首轮锚定的多轮奖励、Buffered Dynamic Retry用于从全失败的困难样本中恢复信号,以及MirrorPop用于离策略序列过滤。在KernelBench和MUSA移植变体上的实验表明,MusaCoder在正确性和经验加速方面均优于强开源和专有基线,其中9B模型匹配或超越前沿闭源模型,27B模型建立了新的最优结果。这些结果不仅证明了全栈执行反馈训练对原生内核生成的有效性,也展示了摩尔线程GPU支持完整LLM后训练栈的能力,为新兴加速器上的大模型训练和优化提供了实用基础。

英文摘要

Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.

2606.04844 2026-06-04 cs.SD cs.CV 版本更新

Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification

漂移增强评分:文本驱动的零样本音频-语言分类噪声鲁棒性

Tu Vo, Sheir Zaheer, Chan Y. Park

发表机构 * Anonymous Authors(匿名作者)

AI总结 提出漂移增强评分(DAS),通过文本生成的噪声条件提示预测音频嵌入漂移方向,为每个类别添加奖励分数,在不增加梯度或测试时批处理的情况下,显著提升零样本音频分类在噪声下的准确率和mAP。

详情
AI中文摘要

对比音频-语言模型(如CLAP)能够实现零样本音频分类:通过将音频嵌入与文本提示嵌入匹配来标记声音,无需标注音频。但在声学噪声下,这种匹配会失效,标准基准测试中,0 dB SNR时准确率和mAP下降12-30个百分点。我们提出漂移增强评分(DAS),这是一种添加到余弦评分中的每类小奖励。当噪声音频嵌入向该类噪声条件文本提示预测的方向漂移时,奖励该类。该奖励仅从文本推导,计算一次并缓存,推理时每类只需一个内积,无需梯度或测试时批处理。在LAION CLAP骨干网络上,我们将DAS与Acevedo等人同期方法的四种变体在UrbanSound8K和完整FSD50K评估集上进行比较,将每个片段与城市声学场景噪声混合,覆盖一系列SNR。DAS在所有测试条件下均提升了指标:UrbanSound8K上准确率提高+2.60至+5.75个百分点,FSD50K上mAP提高+1.50至+1.74个百分点。

英文摘要

Contrastive audio-language models such as CLAP enable zero-shot audio classification: a sound is labelled by matching its embedding to text prompt embeddings, with no labelled audio. This matching breaks down under acoustic noise, where accuracy and mAP fall by 12-30 percentage points at 0 dB SNR on standard benchmarks. We propose Drift Augmented Scoring (DAS), a small per-class bonus added to the cosine score. The bonus rewards a class when the noisy audio embedding drifts in the direction that the class's noise-conditioned text prompts predict. It is derived from text alone, computed once and cached, and adds a single inner product per class at inference, with no gradients and no test-time batch. On a LAION CLAP backbone, we compare DAS against the four variants of Acevedo et al.'s concurrent method on UrbanSound8K and the full FSD50K eval set, mixing each clip with urban acoustic scene noise across a range of SNRs. DAS improves the metric on every test condition: by +2.60 to +5.75 accuracy points on UrbanSound8K and +1.50 to +1.74 mAP points on FSD50K.

2606.04836 2026-06-04 cs.CV 版本更新

3D Temporal Analysis for Autism Spectrum Disorder Screening During Attention Tasks

注意力任务期间自闭症谱系障碍筛查的3D时间分析

Inam Qadir, Elizabeth B Varghese, Dena Al-Thani, Marwa Qaraqe

发表机构 * College of Science and Engineering, Hamad Bin Khalifa University, Qatar Foundation, Doha, Qatar(科学与工程学院,哈马德·本·哈利法大学,卡塔尔基金会,多哈,卡塔尔)

AI总结 提出基于DECA的3D时间分析框架,提取头部姿态和面部表情特征,利用LSTM/GRU分类器在VR-CPT任务中实现ASD筛查,多模态融合达到84.6%准确率。

详情
AI中文摘要

对学龄儿童进行准确的自闭症谱系障碍(ASD)筛查对于识别早期可能遗漏的病例以及及时干预以支持社交、认知和学业发展至关重要。当前的ASD筛查依赖于主观评估和2D分析方法,无法捕捉ASD行为特征的空间位移模式。本研究提出了一种新颖的3D时间分析框架,该框架基于DECA(详细表情捕捉与动画)这一3D建模框架,用于提取全面的头部姿态参数(包括平移分量$T_x, T_y, T_z$)以及独立于姿态变化的面部表情。基于LSTM和GRU的时间分类器在从39名7-12岁参与者(19名ASD,20名TD)在虚拟现实-持续性能测试任务中收集的视频数据提取的3D特征上进行训练。GRU模型表现出优越性能,其中3D头部姿态特征达到83.9%的准确率,3D面部特征达到81.4%的准确率,分别比2D基线方法高出10.7%和7.5%。此外,通过PCA降维的3D头部姿态和面部特征的多模态融合达到了84.6%的最高准确率,优于单模态方法。这项工作为针对学龄人群ASD识别中当前诊断局限性的客观、自动化筛查工具奠定了基础。

英文摘要

Accurate Autism Spectrum Disorder (ASD) screening for school-age children is crucial to identify cases that may have been missed earlier and to enable timely interventions supporting social, cognitive, and academic development. Current ASD screening relies on subjective assessments and 2D analysis methods that fail to capture spatial displacement patterns characteristic of ASD behaviors. In this study, a novel 3D temporal analysis framework is presented, built on top of DECA (Detailed Expression Capture and Animation), a 3D modeling framework, to extract comprehensive head pose parameters (including translational components $T_x, T_y, T_z$) and facial expressions independent of pose variations. LSTM and GRU-based temporal classifiers were trained on the extracted 3D features from video data collected from 39 participants (19 ASD, 20 TD) aged 7-12 years during Virtual Reality-Continuous Performance Test tasks. The GRU-based models demonstrated superior performance, with 3D head pose features achieving 83.9\% accuracy and 3D facial features reaching 81.4\% accuracy, outperforming 2D baseline approaches by 10.7\% and 7.5\%, respectively. Furthermore, multimodal fusion of 3D head pose and facial features with PCA-based dimensionality reduction achieved the highest accuracy of 84.6\%, outperforming unimodal approaches. This work establishes a foundation for objective, automated screening tools addressing current diagnostic limitations in ASD identification for school-age populations.

2606.04820 2026-06-04 cs.CV cs.AI cs.LG 版本更新

OA-CutMix: Correcting the Label Bias of CutMix

OA-CutMix:纠正CutMix的标签偏差

Tobias Christian Nauen, Stanislav Frolov, Federico Raue, Brian B. Moser, Andreas Dengel

发表机构 * RPTU University Kaiserslautern-Landau(凯撒斯劳滕-兰道大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心)

AI总结 针对CutMix中标签分配基于区域面积导致语义偏差的问题,提出OA-CutMix,利用分割掩码根据可见目标面积分配标签,在不改变图像混合过程的情况下提升分类准确率。

详情
AI中文摘要

CutMix已成为事实上的标准混合增强方法,但其标签分配基于一个有缺陷的假设:粘贴补丁的面积忠实地反映了其对混合图像的语义贡献。然而,在实践中,补丁经常落在背景区域,将标签信用分配给其目标不可见的类别。CutMix标签与语义目标面积的平均差异为21.5%。在17%的样本中,一张图像贡献了零个可见目标像素,却获得了非零的标签权重。我们提出目标感知CutMix(OA-CutMix),通过用从预计算分割掩码中导出的权重替换基于面积的CutMix权重来纠正这种偏差,根据每个图像贡献给混合图像的可见目标面积比例分配标签。图像混合过程完全保持不变。我们在4种架构和6个数据集上评估了OA-CutMix与10多种静态和动态混合方法的性能。OA-CutMix在所有任务中始终达到最高准确率,甚至优于动态混合方法,但训练时间成本仅为其一小部分。对于小目标,改进最大,因为CutMix的标签偏差最大。因此,纠正标签足以匹配或超过修改图像混合算法的方法的性能。

英文摘要

CutMix has become the de facto standard mixing augmentation, yet its label assignment rests on a flawed assumption: The area of the pasted patch faithfully reflects its semantic contribution to the mixed image. In practice, however, patches frequently land on background regions, assigning label credit to classes whose objects are not visible. The mean discrepancy of the CutMix label and the semantic object area is $21.5\%$. In $17\%$ of samples an image contributes zero visible object pixels yet receives nonzero label weight. We propose Object-Aware CutMix (OA-CutMix), which corrects this bias by replacing the area-based CutMix weight with one derived from precomputed segmentation masks, assigning labels in proportion to the visible object area each image contributes to the mix. The image mixing procedure is left entirely unchanged. We evaluate OA-CutMix against 10+ static and dynamic mixing methods across 4 architectures and 6 datasets. OA-CutMix consistently achieves the highest accuracy over all tasks, outperforming even dynamic mixing methods, but at a fraction of the training-time cost. Improvements are largest for small objects, where the label bias from CutMix is greatest. Thus, correcting the label is sufficient to match or exceed the performance of methods modifying the image mixing algorithm.

2606.04806 2026-06-04 cs.CV cs.AI 版本更新

NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

NoRA: 评估视觉第一人称规范性动作推理中的基于事实的合理性

Sichao Li, Sai Ma, Daniel Kilov, Secil Yanik Guyot, Zhuang Li, Seth Lazar

发表机构 * The University of Sydney(悉尼大学) Australian National University(澳大利亚国立大学) RMIT University(皇家墨尔本理工大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出NoRA基准,通过事实-理由-动作支持图评估多模态模型生成合理动作并基于可见事实进行推理的能力,发现当前VLM在构建完整动作空间和绑定正确支持方面存在不足。

详情
AI中文摘要

LLM和智能系统越来越多地部署在社交环境中,使得规范能力对安全和适当行为至关重要。然而,现有方法要么仅在文本中评估规范性判断,要么将其简化为从固定候选动作集中选择。我们认为两者都不够。在实践中,智能体永远不会获得一个选项菜单;它们必须从头识别一个合理的动作,基于可见事实并由可检查的理由支持。我们引入了NoRA,一个视觉第一人称视频基准,要求模型生成候选的下一个动作,并通过显式的事实-理由-动作支持图来证明每个动作。该基准包含1,420个带注释的视频片段,包括HumanGold-190和LLMSilver-1230分割。每个实例通过动作对齐、事实基础和支持绑定进行评估,汇总为单一的基于事实的合理性分数。我们在直接、深思熟虑和结构化提示模式下对12个多模态系统进行了基准测试,发现当前的VLM经常能恢复合理的动作和相关的场景事实,但始终难以构建完整的合理动作空间并将所选动作绑定到正确的局部支持上。NoRA使这一差距可测量,将评估问题从模型是否能选择一个动作转变为是否能基于正确的可见理由证明一个适当的动作。

英文摘要

LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue both are insufficient. In practice, agents are never handed a menu of options; they must identify a reasonable action from scratch, grounded in visible facts and supported by inspectable reasons. We introduce NoRA, a visual first-person video benchmark that requires models to generate candidate next actions and justify each through an explicit fact-reason-action support graph. The benchmark comprises 1,420 annotated video clips, including HumanGold-190 and LLMSilver-1230 splits. Each instance is evaluated through action alignment, factual grounding, and support binding, aggregated into a single grounded reasonableness score. We benchmark 12 multimodal systems under direct, deliberate, and structured prompting regimes, finding that current VLMs frequently recover plausible actions and relevant scene facts, but consistently struggle to construct the full reasonable action space and bind selected actions to the correct local support. NoRA makes this gap measurable, shifting the evaluation question from whether a model can pick an action to whether it can justify an appropriate action for the right visible reasons.

2606.04801 2026-06-04 cs.CV 版本更新

Fast Cubical Persistent Homology on 2D and 3D Images via Union-Find, Pruning, and Lookup Tables

基于并查集、剪枝和查找表的2D和3D图像快速立方体持久同调

Titouan Le Breton, Karol Szustakowski, Marie Piraud

发表机构 * Helmholtz AI(海德堡人工智能研究所) Helmholtz Munich(海德堡慕尼黑研究所) École des Ponts ParisTech(巴黎科技大学) ENS Paris-Saclay(巴黎-萨克勒大学) Institute of AI for Health, Helmholtz Munich(健康人工智能研究所,海德堡慕尼黑研究所)

AI总结 提出Flash Cubical方法,通过并查集、边剪枝和查找表技术,高效计算2D和3D图像在V-过滤下的立方体持久性,在时间和内存上达到最优。

详情
AI中文摘要

我们提出Flash Cubical,一种在$\mathbb{F}_2$上对2D和3D图像的V-过滤进行立方体持久性高效计算的方法。该实现基于三个核心思想。首先,立方体复形满足某些性质,允许通过并查集和对偶性计算最高维度的持久性。其次,对某些边进行剪枝可以实现快速高效的并查集。第三,使用查找表,利用立方体复形的规律性预计算局部信息,避免运行时计算局部信息。据我们所知,这是在V-过滤下最有效的立方体持久性实现,无论在时间还是内存成本上。尽管本文关注V-过滤立方体复形的持久性,但基本思想自然推广到立方体复形的T-过滤,并为其他复形提供了有希望的方向。

英文摘要

We present Flash Cubical, a highly efficient computation of cubical persistence on a V-filtration for 2D and 3D images over $\mathbb{F}_2$. The implementation is built around three core ideas. First, cubical complexes satisfy properties that allow for the computation of persistence of the highest dimension via union-find and duality. Second, pruning of certain edges allows for a fast and efficient implementation of union-find. Third, the use of a lookup table, which exploits the regularity of cubical complexes to pre-compute local information. This avoids the need to compute local information at run time. To the best of our knowledge, this is the most efficient implementation of cubical persistence with a V-filtration, both in terms of time and memory costs. Although the paper focuses on persistence for V-filtration cubical complexes, the underlying ideas generalise naturally to T-filtrations on cubical complexes and suggest promising directions for other complexes.

2606.04797 2026-06-04 cs.CV cs.LG 版本更新

Crafting Your Evolving Dreams: Concept-Incremental Versatile Customization

打造你不断演变的梦想:概念增量式多功能定制

Jiahua Dong, Wenqi Liang, Hongliu Li, Yang Cong, Duzhen Zhang, Hanbin Zhao, Henghui Ding, Yulun Zhang, Salman Khan, Fahad Shahbaz Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(Mohamed bin Zayed大学人工智能学院) University of Trento(特伦托大学) Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University(香港理工大学土木与环境工程系) South China University of Technology(华南理工大学) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) Institute of Big Data, Fudan University(复旦大学大数据研究院) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出持续可定制扩散模型(CCDM),通过属性解耦LoRA模块和相关性引导聚合策略解决灾难性遗忘,并结合可控区域上下文合成策略处理概念忽视,实现概念增量式多功能定制。

Comments Accepted to Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

详情
AI中文摘要

定制扩散模型(CDMs)因其生成个性化概念的卓越能力而引起了广泛关注。然而,大多数CDMs不切实际地假设用户的个性化概念集合是静态的,无法随时间增长。此外,在增量学习一系列新概念时,它们对先前学习的概念表现出显著的灾难性遗忘和概念忽视。为了解决上述挑战,我们开发了一种新颖的持续可定制扩散模型(CCDM),使用户能够进行概念增量式多功能定制。具体来说,我们设计了一个属性解耦LoRA(AD-LoRA)模块和一个相关性引导的AD-LoRA聚合策略,以缓解灾难性遗忘。它们可以保留每个任务的概念特定属性,并利用有益的任务间相关性来增强新定制任务的持续学习。此外,为了解决概念忽视的挑战,我们提出了一种可控区域上下文合成策略,该策略根据用户提供的条件进行多概念合成。该策略通过保证用户定义区域之间的语义独立性及其平滑边界过渡,增强了多概念合成的整体一致性。实验表明,我们的CCDM在基线方法上表现出显著改进。

英文摘要

Custom diffusion models (CDMs) have garnered significant interest owing to their remarkable capacity for generating personalized concepts. However, the majority of CDMs unrealistically presume that the user's collection of personalized concepts is static and incapable of incremental growth over time. Furthermore, they exhibit significant catastrophic forgetting and concept neglect of previously learned concepts when incrementally learning a sequence of new ones. To resolve the above challenges, we develop a novel Continually Customizable Diffusion Model (CCDM), enabling users to perform concept-incremental versatile customization. Specifically, we design an attribute-decoupled LoRA (AD-LoRA) module and a relevance-guided AD-LoRA aggregation strategy to mitigate catastrophic forgetting. They can preserve concept-specific attributes of each task and leverage beneficial inter-task correlations to enhance the continual learning of new customization tasks. Additionally, to address the challenge of concept neglect, we propose a controllable regional context synthesis strategy that performs multi-concept composition in alignment with user-provided conditions. This strategy enhances the overall consistency in multi-concept synthesis by guaranteeing semantic independence between user-defined regions and their smooth boundary transitions. Experiments show our CCDM exhibits significant improvements over baseline methods.

2606.04792 2026-06-04 cs.CV 版本更新

A Pathology Foundation Model for Gastric Cancer with Real-World Validation

用于胃癌的病理基础模型及真实世界验证

Ling Liang, Jiabo Ma, Zhengyu Zhang, Fengtao Zhou, Yingxue Xu, Yihui Wang, Cheng Jin, Zhengrui Guo, On Ki Tang, Zhijian Cen, Zhen Wang, Qi Xie, Chengyu Lu, Chenglong Zhao, Feifei Wang, Yu Cai, Hongyi Wang, Jing Zhang, Yaping Ye, Shijun Sun, Shenglei Li, Yu Wang, Zhenhui Li, Ronald Cheong Kin Chan, Xiuming Zhang, Zhe Wang, Hao Chen, Li Liang

发表机构 * Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China(计算机科学与工程系,香港科技大学,香港特别行政区,中国) Department of Pathology, Nanfang Hospital, Southern Medical University, Guangzhou, China(病理学系,南方医科大学南芳医院,广州,中国) Department of Pathology, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China(病理学系,南方医科大学基础医学学院,广州,中国) Guangdong Province Key Laboratory of Molecular Tumor Pathology, Guangzhou, China(广东省分子肿瘤病理学重点实验室,广州,中国) Department of Anatomical and Cellular Pathology, The Chinese University of Hong Kong, Hong Kong SAR, China(解剖学与细胞病理学系,香港中文大学,香港特别行政区,中国) Pathology Artificial Intelligence Development and Assessment Laboratory, State Key Laboratory of Translational Oncology, The Chinese University of Hong Kong, Hong Kong SAR, China(病理人工智能发展与评估实验室,转化肿瘤学国家重点实验室,香港中文大学,香港特别行政区,中国)

AI总结 提出胃癌专用基础模型GRACE,基于多中心HE染色全切片图像,在28项临床任务中优于通用PFM,并通过前瞻性验证和读者研究证实其辅助诊断效能。

详情
AI中文摘要

胃癌仍然是癌症死亡的主要原因,但其组织学和分子异质性使诊断和风险分层复杂化。通用病理基础模型(PFM)在胃癌诊疗的关键细粒度终点上往往表现停滞,且很少有模型经过严格的前瞻性验证或临床读者研究。我们提出了GRACE,一个用于真实世界评估和临床决策支持的胃癌专用基础模型。GRACE基于来自37,493名患者的多中心胃癌病理数据集(共48,364张HE染色全切片图像)开发。在28项临床相关任务评估中,GRACE持续优于代表性泛癌PFM,达到宏观AUC 0.9188,在癌前病变诊断(宏观AUC 0.9322)、肿瘤组织病理评估(宏观AUC 0.9119)、分子分型(宏观AUC 0.8682)和预后预测方面表现强劲。除基准测试外,GRACE的转化价值通过严格的证据链得到证实。在安全门控标准(排除要求100%阴性预测值,纳入要求100%阳性预测值)下,GRACE简化了高达69.6%的恶性诊断病例的审查,并分流了46.8%的MMR-IHC随访请求。这种转化可行性通过病理学家-AI协作的随机交叉读者研究得到进一步加强。在GRACE辅助下,诊断准确率从82.0%提高到89.9%,正确诊断的校正优势比提高近两倍(OR 1.987),同时敏感性和特异性也得到提升。AI辅助还使诊断时间减少14.9%,诊断信心提高9.0%,并显著改善评估者间一致性。当校准至不劣于高级病理医生时,AI辅助工作流可分流60.7%的萎缩病例和82.7%的肠化生病例。

英文摘要

Gastric cancer remains a major cause of cancer mortality, yet its histological and molecular heterogeneity complicates diagnosis and risk stratification. General-purpose pathology foundation models (PFMs) often plateau on fine-grained endpoints central to gastric cancer care, and few have undergone rigorous prospective validation or clinical reader studies. We present GRACE, a Gastric-specific foundation model for Real-world Assessment and Clinical dEcision support. GRACE was developed from multicenter gastric pathology datasets totaling 48,364 primarily HE-stained whole-slide images from 37,493 patients. When evaluated on 28 clinically relevant tasks, GRACE consistently outperformed representative pancancer PFMs, achieving a macro-AUC of 0.9188, with strong performance for precancerous lesion diagnosis (macro-AUC 0.9322), tumor histopathological assessment (macro-AUC 0.9119), molecular profiling (macro-AUC 0.8682), and prognostic prediction. Beyond benchmarking, GRACE's translational value was substantiated through a rigorous evidence chain. Under safety-gated criteria requiring 100% NPV for rule-out and 100% PPV for rule-in, GRACE streamlined review for up to 69.6% of malignancy-diagnosis cases and triaged 46.8% of MMR-IHC follow-up requests. This translational feasibility was further strengthened by a randomized crossover reader study of pathologist-AI collaboration. With GRACE assistance, diagnostic accuracy improved from 82.0% to 89.9%, yielding nearly twofold higher adjusted odds of a correct diagnosis (OR 1.987) alongside concurrent gains in sensitivity and specificity. AI assistance also reduced diagnostic time by 14.9%, elevated diagnostic confidence by 9.0%, and markedly improved inter-rater agreement. When calibrated to maintain non-inferior performance to senior pathologists, the AI-assisted workflow could triage 60.7% of atrophy and 82.7% of intestinal metaplasia cases.

2606.04788 2026-06-04 cs.CV cs.RO 版本更新

Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives

Z-FLoc: 基于几何基元的零样本楼层平面定位

Ayumi Umemura, Toshinori Kuwahara, Marc Pollefeys, Daniel Barath

发表机构 * ETH Zurich(苏黎世联邦理工学院) Tohoku University(东北大学)

AI总结 提出一种零样本楼层平面定位方法,通过从单目3D重建的鸟瞰图中提取直线和圆等几何基元,并与楼层平面进行鲁棒匹配,无需重新训练即可泛化到新环境。

详情
AI中文摘要

视觉定位——在预先存在的地图中估计相机姿态——是计算机视觉中的一个基本问题。楼层平面是一种有吸引力的地图表示:它们对于大多数建筑来说易于获取、紧凑,并且固有地不受视觉外观变化的影响。然而,弥合相机观测与楼层平面几何之间的严重领域差距仍然具有挑战性。现有方法通过数据驱动学习来解决这一差距,但它们需要大规模训练数据和特定环境的重新训练,限制了实际部署。我们提出了一种零样本楼层平面定位方法,无需任何重新训练即可泛化到新环境。我们的关键见解是,主导几何基元——直线和圆——在人造环境中无处不在,并提供外观不变的结构约束。我们从单目3D重建的鸟瞰图投影中提取这些基元,并通过鲁棒估计框架内的专用最小求解器将它们与楼层平面进行匹配。在模拟和真实数据集上的实验表明,我们的方法在未见过的环境上优于最先进的基于学习的方法,同时在所有实验中使用单一固定的超参数集。源代码将公开提供。

英文摘要

Visual localization -- estimating a camera pose within a pre-existing map -- is a fundamental problem in computer vision. Floorplans are an attractive map representation: they are readily available for most buildings, compact, and inherently invariant to visual appearance changes. However, bridging the severe domain gap between camera observations and floorplan geometry remains challenging. Existing methods address this gap through data-driven learning, yet they require large-scale training data and environment-specific retraining, limiting their practical deployment. We propose a zero-shot floorplan localization method that generalizes to novel environments without any retraining. Our key insight is that dominant geometric primitives -- lines and circles -- are ubiquitous in human-made environments and provide appearance-invariant structural constraints. We extract these primitives from a bird's-eye-view (BEV) projection of monocular 3D reconstructions and match them to the floorplan via dedicated minimal solvers within a robust estimation framework. Experiments on both simulated and real-world datasets show that our approach outperforms state-of-the-art learning-based methods on unseen environments, while using a single fixed set of hyperparameters across all experiments. The source code will be made publicly available.

2606.04775 2026-06-04 cs.LG cs.AI cs.CV cs.SY eess.SY math.OC 版本更新

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

通过降阶线性最优控制引导视频生成模型的激活

Jihoon Hong, Alice Chan, Qiyue Dai, Julian Skifstad, Glen Chou

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出LA-LQR框架,将文本到视频推理建模为动态系统,通过降阶最优控制实现最小干预的激活引导,减少不安全内容生成同时保持视觉质量。

详情
AI中文摘要

在大规模网络数据上训练的文本到视频(T2V)模型可能生成不良内容,这促使我们进行干预以减少有害输出而不牺牲视觉质量。激活引导提供了一种有吸引力的机制替代微调和提示过滤,但现有的T2V引导方法仍然有限,通常采用粗糙的、非预测性的干预,可能导致过度引导和内容退化。为了弥补这一差距,我们提出了潜在激活线性二次型调节器(LA-LQR),一种用于最小侵入性T2V引导的降阶最优控制框架。LA-LQR将T2V推理表述为一个动态系统,并计算闭环反馈干预,将激活引导向期望的特征设定点,同时惩罚不必要的扰动。为了使最优控制对高维视频激活可行,我们将激活投影到由对比提示对导出的低维、任务相关子空间,估计该潜在空间中的局部线性动力学,并求解潜在LQR问题以获得时间步和层特定的引导信号。我们提供了将潜在设定点跟踪与原始激活空间特征控制联系起来的理论界限,并实证验证了降阶潜在动力学的保真度。在概念引导和视频安全基准测试中,LA-LQR相对于基线减少了不安全生成,同时保持了提示保真度和视觉质量。

英文摘要

Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering, but existing T2V steering methods remain limited, typically applying coarse, non-anticipative interventions that can lead to oversteering and content degradation. To close this gap, we propose Latent Activation Linear-Quadratic Regulator (LA-LQR), a reduced-order optimal control framework for minimally invasive T2V steering. LA-LQR formulates T2V inference as a dynamical system and computes closed-loop feedback interventions that steer activations toward desired feature setpoints while penalizing unnecessary perturbations. To make optimal control feasible for high-dimensional video activations, we project activations onto a low-dimensional, task-relevant subspace derived from contrastive prompt pairs, estimate local linear dynamics in this latent space, and solve a latent LQR problem to obtain timestep- and layer-specific steering signals. We provide theoretical bounds relating latent setpoint tracking to raw activation-space feature control, and empirically validate the fidelity of the reduced latent dynamics. On concept steering and video safety benchmarks, LA-LQR reduces unsafe generations relative to baselines, while preserving prompt fidelity and visual quality.

2606.04773 2026-06-04 cs.CV cs.CL 版本更新

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

NextMotionQA: 使用视觉语言模型基准测试和评判人体运动理解

Yong Cao, Chuqiao Li, Xianghui Xie, Gerard Pons-Moll, Andreas Geiger

发表机构 * University of Tübingen(图宾根大学) Tübingen AI Center(图宾根人工智能中心) Max Planck Institute for Informatics(马克斯·普朗克信息学院) Saarland Informatics Campus(萨尔兰州信息学院)

AI总结 提出NextMotionQA基准,通过三项互补任务和多粒度难度分层,系统评估视觉语言模型在人体运动理解中的能力,并揭示其在细粒度评判中的局限性。

Comments 23 pages, 8 figures, 9 tables

详情
AI中文摘要

人体运动理解的可靠评估对于推进具身人工智能、机器人和动画至关重要。然而,现有基准存在语义粒度粗糙、难度无区分、标注质量有限以及答案模糊等问题,无法诊断当前模型的失败之处。为弥补这一差距,我们引入NextMotionQA,这是一个全面的基准,利用视觉语言模型(VLM)进行半自动化、专家验证的数据集构建。NextMotionQA包含三项互补任务:多项选择题问答、视频字幕生成和细粒度错误纠正。每项任务沿三个核心语义轴系统组织,并分为三个任务复杂度级别。我们对十二个代表性VLM的广泛评估揭示了在传统单任务评估中不可见的关键能力差距和弱点。在互补方向上,近期工作开始使用VLM作为文本到运动评估的评判者;我们探究它们在更困难任务下是否表现出同样的退化。我们发现,VLM在粗粒度标准上与专家评分高度一致(Cohen's κ=0.70),但在细粒度、部件级评判上表现不佳(κ=0.10),验证了该范式在其强项领域的有效性,同时明确了其局限性。

英文摘要

Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we introduce NextMotionQA, a comprehensive benchmark that leverages vision-language models (VLMs) for semi-automated, expert-verified dataset. NextMotionQA features three complementary tasks: multiple-choice question answering, video captioning, and fine-grained error correction. Each task is systematically structured across three core semantic axes and stratified into three task complexity levels. Our extensive evaluation of twelve representative VLMs uncovers critical capability gaps and weakness that remain invisible under conventional, single-task evaluations. In a complementary direction, recent work has begun using VLMs as judges for text-to-motion evaluation; we ask whether they show the same degradation under harder tasks. We find that VLMs align strongly with expert ratings on coarse criteria (Cohen's κ=0.70) but break down on fine-grained, part-level judgment (κ=0.10), validating the paradigm in its strong regime while clarifying its limits.

2606.04772 2026-06-04 cs.CV cs.AI 版本更新

Coarse-to-fine Hierarchical Architecture with Sequential Mamba for Brain Reconstruction

用于脑重建的基于顺序Mamba的粗到细层次架构

Hoang-Son Vo, Van-Hung Bui, Minh-Huy Mai-Duc, Tien-Dung Mai, Soo-Hyung Kim

发表机构 * Chonnam National University, Gwangju, Republic of Korea(全罗国立大学,韩国光州市) Vietnam National University - Ho Chi Minh City, University of Science, Vietnam(越南国家大学-胡志明市,越南科学大学) Institute for Cybersecurity and Digital Technologies, Russia(俄罗斯网络安全与数字技术研究所)

AI总结 提出CHASMBrain,一种基于双流Mamba和粗到细策略的两阶段图像到fMRI编码框架,在NSD数据集上优于基线,并揭示了视觉皮层的因果组织特性。

详情
AI中文摘要

理解深度视觉表征与人类视觉系统之间的关系是计算神经科学中的一个基本挑战。尽管现代视觉模型在图像识别中取得了强劲性能,但它们与人类视觉皮层层次组织的对应关系仍是一个开放问题。在本研究中,我们提出了CHASMBrain,一种新颖的分层两阶段图像到fMRI编码框架。我们的架构利用双流Mamba设计,明确分离并处理全局语义标记和局部空间补丁,这一设计受视觉皮层功能组织的启发。采用粗到细策略:第一阶段预测去噪的ROI级激活,第二阶段使用Mamba-VAE将这些粗响应细化为全体素级预测。在自然场景数据集(NSD)上的实验表明,我们的方法达到了0.429的皮尔逊相关系数和0.261的均方误差,优于所有评估的基线,包括岭回归和DINOv2线性探针。除了预测性能,因果分支消融实验揭示了一种非对称特化:补丁流特定锁定于早期视觉皮层(视网膜拓扑区域),而CLS流为高阶区域提供更广泛的语义上下文——这种对应关系是因果性的,而不仅仅是相关性的。跨被试迁移实验进一步表明,学习到的骨干网络在个体间泛化良好,只需极少的个体适应,表明模型捕捉到了共享的、与主体无关的视觉表征。

英文摘要

Understanding the relationship between deep visual representations and the human visual system is a fundamental challenge in computational neuroscience. While modern vision models achieve strong performance in image recognition, their correspondence with the hierarchical organization of the human visual cortex remains an open question. In this study, we propose CHASMBrain, a novel hierarchical two-stage framework for image-to-fMRI encoding. Our architecture leverages a dual-stream Mamba design to explicitly separate and process global semantic tokens and local spatial patches, motivated by the functional organization of the visual cortex. A coarse-to-fine strategy is employed: Stage 1 predicts denoised ROI-level activations, while Stage 2 refines these coarse responses into full voxel-level predictions using a Mamba-VAE. Experiments on the Natural Scenes Dataset (NSD) demonstrate that our method achieves a Pearson correlation of 0.429 and an MSE of 0.261, outperforming all evaluated baselines including ridge regression and DINOv2 linear probes. Beyond predictive performance, causal branch-ablation experiments reveal an asymmetric specialization: the patch stream is specifically locked to early visual cortex (retinotopic regions), while the CLS stream contributes broader semantic context to higher-order areas -- a correspondence that holds causally, not merely correlationally. Cross-subject transfer experiments further show that the learned backbone generalizes across individuals with minimal per-subject adaptation, suggesting the model captures a shared, subject-agnostic visual representation.

2606.04767 2026-06-04 cs.LG cs.CV 版本更新

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

通过Fisher信息度量模型鲁棒性:谱界、理论保证与实用算法

Chong Zhang, Xiang Li, Jia Wang, Qiufeng Wang, Xiaobo Jin

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于Fisher信息矩阵谱范数的攻击无关鲁棒性度量,理论推导常见架构的闭式谱界,并开发高效估计算法,实验验证其与对抗脆弱性的强相关性。

Comments 35 pages, 1 figure

详情
AI中文摘要

深度神经网络的鲁棒性对于安全关键部署至关重要,但现有评估方法通常依赖于攻击且缺乏可解释性。我们提出了一种基于Fisher信息矩阵(FIM)谱范数的原则性、攻击无关的鲁棒性度量,该度量量化了模型输出分布对输入扰动的worst-case敏感性。理论上,我们证明了FIM等于输入Jacobian的方差,并推导了常见架构(包括VGG、ResNet、DenseNet和Transformer)的闭式谱界,提供了首个理论鲁棒性排名。为了实现可扩展的评估,我们开发了高效算法,包括幂迭代和基于Hutchinson的估计,支持白盒和黑盒设置。在多个数据集(包括CIFAR、ImageNet和医学图像)和多种架构上的大量实验表明,我们的度量与对抗脆弱性之间存在强相关性。我们的框架作为一种可解释的诊断工具,补充了基于攻击的评估,提供了对架构敏感性的洞察,并指导更鲁棒模型的设计。代码可在https://github.com/franz-chang/SRP/获取。

英文摘要

The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation methods are often attack-dependent and lack interpretability. We propose a principled, attack-agnostic robustness metric based on the spectral norm of the Fisher Information Matrix (FIM), which quantifies the worst-case sensitivity of the model's output distribution to input perturbations. Theoretically, we establish that the FIM equals the variance of the input Jacobian and derive closed-form spectral bounds for common architectures, including VGG, ResNet, DenseNet, and Transformer, providing the first theoretical robustness ranking. To enable scalable evaluation, we develop efficient algorithms, including power iteration and Hutchinson-based estimation, that support both white-box and black-box settings. Extensive experiments across multiple datasets, including CIFAR, ImageNet, and medical images, and across multiple architectures show a strong correlation between our metric and adversarial vulnerability. Our framework serves as an interpretable diagnostic tool that complements attack-based evaluations, offering insights into architectural sensitivity and guiding the design of more robust models. Code is available at: https://github.com/franz-chang/SRP/.

2606.04764 2026-06-04 cs.CV 版本更新

Do Foundation Models See Biology? Evaluating Attention Coherence with Spatial Transcriptomics in Glioblastoma

基础模型是否理解生物学?利用空间转录组学评估胶质母细胞瘤中的注意力一致性

Dilakshan Srikanthan, Amoon Jamzad, Paul Wilson, Nooshin Maghsoodi, Robert Policelli, Gabor Fichtinger, John F. Rudan, Parvin Mousavi

发表机构 * Translational Medicine, School of Medicine, Queen’s University, Kingston, ON, Canada(转化医学、医学院、皇后大学、金斯顿,ON,加拿大) School of Computing, Queen’s University, Kingston, ON, Canada(计算学院、皇后大学、金斯顿,ON,加拿大) Department of Surgery, Queen’s University, Kingston, ON, Canada(外科部门、皇后大学、金斯顿,ON,加拿大)

AI总结 提出基于空间转录组学的框架,客观评估病理基础模型注意力图与生物学的一致性,发现注意力捕捉多基因转录程序而非单个分子事件。

详情
AI中文摘要

病理基础模型的注意力图是否捕捉真实的生物学仍未知,但这一问题对临床信任和监管批准至关重要。我们提出一个基于空间转录组学的框架,用于无假设的注意力正交评估,并将其应用于五个病理基础模型(CONCH v1.5、UNI v2、Virchow2、GigaPath、H-Optimus-1)和一个ResNet50基线。使用基于注意力的多实例学习,我们训练单任务和多任务模型预测胶质母细胞瘤中的五种分子改变(CPTAC队列),在独立TCGA队列上验证,并使用来自18个样本的共配准Visium空间转录组数据评估注意力图与87个转录特征之间的生物学一致性。内部结果显示,没有单一编码器在所有任务中占优,外部验证则颠倒了内部性能排名。注意力图显示从通路(Cohen's d=0.329)到单个基因(d=0.055)的五倍富集梯度,表明注意力捕捉的是涌现的多基因转录程序而非单个分子事件。空间平滑的注意力图并不意味生物学一致性,不同编码器关注不同的生物学区室。我们的框架提供了对基础模型从组织病理学中学到内容的客观定量评估,推动该领域超越定性显著性图审查。

英文摘要

Whether attention maps from pathology foundation models capture genuine biology remains unknown, yet this question is critical for clinical trust and regulatory approval. We propose a spatial transcriptomics-based framework for orthogonal, hypothesis-free evaluation of attention and apply it to five pathology foundation models (CONCH v1.5, UNI v2, Virchow2, GigaPath, H-Optimus-1) and a ResNet50 baseline. Using attention-based multiple instance learning, we train single-task and multi-task models to predict five molecular alterations in glioblastoma on the CPTAC cohort, validate on an independent TCGA cohort, and evaluate biological coherence of attention maps against 87 transcriptional signatures using co-registered Visium spatial transcriptomics data from 18 samples. Internally, no single encoder dominates across all tasks, and external validation inverts internal performance rankings. Attention maps show a five-fold enrichment gradient from pathways (Cohen's d=0.329) to individual genes (d=0.055), indicating that attention captures emergent multi-gene transcriptional programs rather than individual molecular events. Spatially smooth attention maps do not imply biological coherence, and different encoders attend to distinct biological compartments. Our framework provides objective, quantitative assessment of what foundation models learn from histopathology, moving the field beyond qualitative saliency map review.

2606.04737 2026-06-04 cs.CV 版本更新

Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment

基于物理信息的视频生成:通过混合专家潜在对齐

Cong Wang, Hanxin Zhu, Jiayi Luo, Yonglin Tian, Xiaoqian Cheng, Peiyan Tu, Xin Jin, Long Chen, Zhibo Chen

发表机构 * CASIA(中国科学院自动化研究所) UCAS(中国科学技术大学) ZGCA(浙江大学) USTC(中国科学技术大学) BUAA(北京航空航天大学) ZJU(浙江大学) EIT(欧洲工业技术学院)

AI总结 提出PILA框架,通过混合专家潜在对齐将物理结构化潜在引导注入预训练视频模型的冻结流匹配动力学,以提升生成视频的物理合理性。

详情
AI中文摘要

大规模视频生成模型在语义一致性和视觉质量方面取得了显著进展,生成的视频越来越连贯且视觉上令人信服。然而,由像素级拟合引发的动态过程自然无法适应支配真实世界运动和交互的规律性,导致在物理合理性方面持续存在不足。为解决这一局限,我们提出了PILA(物理信息潜在对齐),一个将物理结构化的潜在引导注入预训练视频模型冻结流匹配动力学的框架。具体而言,PILA首先采用锚定场估计,将冻结生成器的潜在变量映射到一个由场代理槽组织的可操作物理属性库中,利用可观测运动作为运动学锚点来构建较难直接观测的代理。为处理真实世界动态的异质性,PILA采用基于物理类别的混合专家设计。标签先验掩码专家路由选择特定类别的算子专家,其精炼结果通过从物理关系中抽象出的操作残差进行正则化。最后,精炼后的代理被融合回物理属性库,并解码为流匹配向量场的修正,从而在保持预训练骨干网络视觉先验的同时注入物理感知引导。通过在Wan 2.1-1.3B上进行分阶段适配器训练,并将学到的适配器直接迁移到Wan 2.2-14B,PILA在VBench-2.0、VideoPhy-2和PhyGenBench上,在视觉质量和基准测量的物理合理性方面均达到了最先进的结果。

英文摘要

Large-scale video generation models have made remarkable progress in semantic consistency and visual quality, producing videos that are increasingly coherent and visually convincing. Nevertheless, the dynamics induced by pixel-level fitting do not naturally accommodate the regularities that govern real-world motion and interaction, resulting in persistent shortcomings in physical plausibility. To address this limitation, we propose \textbf{PILA} (Physics-Informed Latent Alignment), a framework that injects physics-structured latent guidance into the frozen flow-matching dynamics of pretrained video models. Specifically, PILA first employs anchored field estimation to map frozen-generator latents into an operational physical attribute bank organized by field-proxy slots, using observable motion as a kinematic anchor for constructing less directly observed proxies. To handle the heterogeneity of real-world dynamics, PILA adopts a mixture-of-experts design over physical categories. Label-prior masked expert routing selects category-specific operator experts, whose refinements are regularized by operational residuals abstracted from physical relations. Finally, the refined proxies are fused into the physical attribute bank and decoded into a correction to the flow-matching vector field, injecting physics-aware guidance while preserving the visual prior of the pretrained backbone. With staged adapter training on Wan 2.1-1.3B and direct transfer of the learned adapter to Wan 2.2-14B, PILA achieves state-of-the-art results on VBench-2.0, VideoPhy-2, and PhyGenBench in both visual quality and benchmark-measured physical plausibility.

2606.04722 2026-06-04 cs.CV 版本更新

StrokeTimer: Robust Representation Learning for Ischemic Stroke Onset-Time Estimation from Non-contrast CT

StrokeTimer: 基于非增强CT的缺血性卒中发病时间估计的鲁棒表示学习

Weiru Wang, Susanne G. H. Olthuis, Elizaveta Lavrova, Robert J. van Oostenbrugge, Charles B. L. M. Majoie, Wim H. van Zwam, Ruisheng Su

发表机构 * Department of Biomedical Engineering, Eindhoven University of Technology(埃因霍温理工大学生物医学工程系) Graduate School of Life Sciences, Utrecht University(乌得勒支大学生命科学研究生院) Department of Neurology, Maastricht University Medical Centre+(马斯特里赫特大学医学中心神经科) Precision Medicine Department, GROW Research Institute for Oncology and Reproduction, Maastricht University(马斯特里赫特大学精准医学部,GROW肿瘤与生殖医学研究所) Department of Radiology and Nuclear Medicine, Amsterdam University Medical Centre(阿姆斯特丹大学医学中心放射学与核医学系) Department of Radiology and Nuclear Medicine, Maastricht University Medical Centre+(马斯特里赫特大学医学中心放射学与核医学系)

AI总结 提出StrokeTimer框架,通过自监督解耦学习和能量引导对比学习,从非增强CT中估计缺血性卒中发病时间,在大型多中心数据集上实现宏AUC 0.69和宏F1 0.57,较基线提升近50%。

Comments Early accepted at MICCAI 2026

详情
AI中文摘要

缺血性卒中是一种主要的全球性疾病。治疗决策高度时间敏感,因为再灌注治疗的资格取决于卒中发病与干预之间的时间间隔。然而,在临床实践中,真实的发病时间往往不确定,因此需要基于影像的组织年龄评估作为替代标志物。常规非增强CT(NCCT)上的早期缺血性改变通常很细微,而真实世界的临床数据集表现出显著的发病时间类别不平衡和中心-扫描仪相关的异质性。在这项工作中,我们提出了StrokeTimer,一个用于急性缺血性卒中发病时间估计的全自动框架。StrokeTimer整合了自监督解耦学习和能量引导对比学习,以捕捉细微的缺血模式,同时解决采集变异下的长尾数据分布。发病时间被分为三个临床相关窗口:<4.5小时、4.5-6小时和>6小时。在两个国家队列(MR CLEAN Registry和MR CLEAN LATE)的大型多中心NCCT数据集上的实验结果表明,StrokeTimer实现了宏AUC 0.69和宏F1分数0.57,比最强基线提高了近50%(p < 0.005)。在这个现实且具有挑战性的设置中,代表性基线方法表现出接近随机的宏性能。模型解释进一步突出了与已建立的放射学生物标志物一致的细微灰白质模糊和低密度区域。这些发现证明了StrokeTimer在支持急性缺血性卒中治疗决策方面的潜力。代码可在https://github.com/BrainVas/StrokeTimer获取。

英文摘要

Ischemic stroke is a major global disease. Treatment decisions are highly time-sensitive, as eligibility for reperfusion therapies relies on the interval between stroke onset and intervention. However, the true onset time is often uncertain in clinical practice, necessitating imaging-based assessment of tissue age as a surrogate marker. Early ischemic changes on routinely acquired non-contrast CT (NCCT) are often subtle, and real-world clinical datasets exhibit pronounced onset-time class imbalance and center-scanner-related heterogeneity. In this work, we propose StrokeTimer, a fully automated framework for onset-time estimation in acute ischemic stroke. StrokeTimer integrates self-supervised disentanglement learning with energy-guided contrastive learning to capture subtle ischemic patterns while addressing long-tailed data distributions under acquisition variability. Onset time is categorized into three clinically relevant windows: <4.5 h, 4.5-6 h, and >6 h. Experimental results on a large multi-center NCCT dataset from two national cohorts, MR CLEAN Registry and MR CLEAN LATE, show that StrokeTimer achieves a macro AUC of 0.69 and a macro F1-score of 0.57, improving the strongest baseline by nearly 50% (p < 0.005). In this realistic, challenging setting, representative baseline approaches exhibit near-chance macro performance. Model explanations further highlight subtle gray-white matter blurring and hypodense regions consistent with established radiological biomarkers. These findings demonstrate the potential of StrokeTimer to support treatment decision-making in acute ischemic stroke. Code is available at https://github.com/BrainVas/StrokeTimer.

2606.04710 2026-06-04 cs.CV 版本更新

Data Efficient Complex Feature Fusion Network For Hyperspectral Image Classification

数据高效复杂特征融合网络用于高光谱图像分类

Maitreya Shelare, Atharva Satam, Poonam Sonar, Sneha Burnase

发表机构 * Department of Electronics and Telecommunication, Rajiv Gandhi Institute of Technology, University of Mumbai(电子与电信系,拉吉夫甘地技术学院,孟买大学)

AI总结 提出一种数据高效的注意力双支路复杂特征融合网络(DE-CFFN),通过因子分析降维和3D卷积层滤波器数量减半来减少模型复杂度,同时保持与CFFN相当的分类性能。

Comments 10 pages, 3 figures

详情
Journal ref
In Proceedings of International Conference on Wireless Communication (ICWiCOM 2025), Lecture Notes in Electrical Engineering, vol. 1499, Springer, 2025
AI中文摘要

本工作提出了一种数据高效的基于注意力的双支路复杂特征融合网络(CFFN)变体,用于高光谱图像分类。所提出的模型称为DE-CFFN,保留了原始的双流结构:实值神经网络(RVNN)处理标准高光谱图像块,而复值神经网络(CVNN)处理其傅里叶变换后的对应物。本工作的主要贡献在于特征提取过程和架构增强。使用因子分析进行降维,相比主成分分析提供了更好的潜在特征表示。此外,RVNN和CVNN流均通过将3D卷积层中的滤波器数量逐次减半来减少复杂度。两个分支的输出被拼接并通过一个挤压激励(SE)块以增强联合特征表示。在Pavia University和Salinas数据集上的评估表明,DE-CFFN实现了与CFFN相当的分类性能,同时显著减小了模型大小、内存消耗和推理延迟,使其适用于实时高光谱成像应用。

英文摘要

This work presents a data-efficient variant of the Attention-Based Dual-Branch Complex Feature Fusion Network (CFFN) for hyperspectral image classification. The proposed model, termed DE-CFFN, retains the original two-stream structure: the Real-Valued Neural Network (RVNN) processes standard hyperspectral patches, while the Complex-Valued Neural Network (CVNN) handles their Fourier-transformed counterparts. The main contribution of this work lies in the feature extraction process and architectural enhancement. Factor Analysis is used for dimensionality reduction, offering improved latent feature representation over Principal Component Analysis. Additionally, both the RVNN and CVNN streams are structurally modified by successively halving the number of filters in the 3D convolutional layers to reduce complexity. The outputs of both branches are concatenated and passed through a Squeeze and Excitation (SE) block to enhance joint feature representation. Evaluated on the Pavia University and Salinas datasets, DE-CFFN achieves classification performance comparable to CFFN, while significantly reducing model size, memory consumption, and inference latency, making it suitable for real-time hyperspectral imaging applications.

2606.04706 2026-06-04 cs.CV 版本更新

ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection

ReConFuse: 重建误差引导的语义融合用于AI生成视频检测

Xiaojing Chen, Xinyu Lu, Changtao Miao, Yunfeng Diao

发表机构 * Anhui University(安徽大学) Ant Group(蚂蚁集团) Hefei University of Technology(合肥工业大学)

AI总结 提出ReConFuse框架,利用预训练WF-VAE的重建误差作为鉴别线索,结合多帧语义特征和Mamba时序建模,实现AI生成视频的鲁棒检测。

详情
AI中文摘要

AI生成的视频变得越来越逼真,引发了关于错误信息、内容真实性和媒体信任的严重担忧。因此,可靠的AI生成视频检测对于多媒体取证至关重要,但由于需要捕捉空间伪影、时间动态并泛化到不断演变的生成模型,这仍然具有挑战性。在本文中,我们探索重建误差作为AI生成视频检测的判别性取证线索。通过使用预训练的WF-VAE重建输入视频,我们观察到真实视频和生成视频表现出可区分的逐帧重建误差模式,表明重建误差可以揭示它们的分布差异。然而,将基于重建的图像检测扩展到视频并非易事,因为视频重建误差在帧间具有时间组织性,并且需要语义上下文才能有效解释。为了应对这些挑战,我们提出了ReConFuse,一个用于视频级AI生成视频检测的重建引导语义融合框架。ReConFuse从WF-VAE重建的视频中提取重建误差线索,将其与多帧语义特征对齐,并使用基于Mamba的模块对时间演化进行建模以进行视频级分类。在多个生成器和评估设置上的实验证明了ReConFuse的有效性和强大的泛化能力。

英文摘要

AI-generated videos are becoming increasingly realistic, raising serious concerns about misinformation, content authenticity, and media trust. Reliable AI-generated video detection is therefore essential for multimedia forensics, yet remains challenging due to the need to capture spatial artifacts, temporal dynamics, and generalize to evolving generative models. In this paper, we explore reconstruction error as a discriminative forensic cue for AI-generated video detection. By reconstructing input videos with a pretrained WF-VAE, we observe that real and generated videos exhibit distinguishable frame-wise reconstruction error patterns, suggesting that reconstruction errors can reveal their distributional discrepancies. However, extending reconstruction-based image detection to videos is non-trivial, since video reconstruction errors are temporally organized across frames and require semantic context for effective interpretation. To address these challenges, we propose ReConFuse, a reconstruction-guided semantic fusion framework for video-level AI-generated video detection. ReConFuse extracts reconstruction error cues from WF-VAE reconstructed videos, aligns them with multi-frame semantic features, and uses a Mamba-based module to model temporal evolution for video-level classification. Experiments across multiple generators and evaluation settings demonstrate the effectiveness and strong generalization ability of ReConFuse.

2606.04705 2026-06-04 cs.CV cs.AI 版本更新

Enhancing MedSAM with a Lightweight Box Predictor for Medical Image Segmentation

通过轻量级框预测器增强 MedSAM 用于医学图像分割

Amirhossein Movahedisefat, Amirreza Fateh, Mohammad Reza Mohammadi

发表机构 * School of Computer Engineering, Iran University of Science and Technology (IUST)(伊朗科学技术大学计算机工程学院)

AI总结 提出一种集成轻量级框预测器的 MedSAM 增强框架,通过单次点击估计边界框以提升点提示的空间引导能力,在仅增加 1.6M 参数下显著提高多模态医学图像分割的准确性和鲁棒性。

详情
AI中文摘要

医学图像中的语义分割是一项关键但具有挑战性的任务,原因是数据稀缺和跨模态的高变异性。虽然像 Segment Anything Model (SAM) 这样的基础模型显示出潜力,但它们在没有特定适应的情况下往往难以处理医学图像。此外,点提示尽管是最自然的用户交互形式,但为可靠分割提供的空间上下文不足,特别是当目标结构不规则或对比度差时。在本文中,我们提出了一种增强的分割框架,将轻量级框预测器模块集成到 MedSAM 架构中。框预测器通过使用局部图像嵌入特征从单次用户点击估计近似边界框,提供空间引导以减少点提示的模糊性,同时仅引入 1.6M 额外参数和可忽略的推理开销。我们引入了一个两阶段训练流程,其中框预测器在集成到 MedSAM 之前独立训练。为了验证我们方法的泛化能力,我们在四个不同的数据集(FLARE22、BRISC、BUSI、LungSegDB)上进行了广泛评估,这些数据集涵盖不同的成像模态,包括 CT、MRI 和超声。我们的方法在不同解剖结构和成像领域中提高了分割准确性和鲁棒性,在 BUSI、FLARE22、BRISC 和 LungSegDB 上分别达到了 0.89、0.93、0.88 和 0.98 的 Dice 分数。代码可在 https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor 获取。

英文摘要

Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two-stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor

2606.04701 2026-06-04 cs.CV cs.CL 版本更新

Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms

在短视频平台上对原生动态屏幕GUI代理的基准测试

Jiashu Yao, Heyan Huang, Daiqing Wu, Wangke Chen, Huaxi Ai, Haoyu Wen, Zeming Liu, Yuhang Guo

发表机构 * Beijing Institute of Technology(北京理工大学) Tsinghua University(清华大学) Beihang University(北航)

AI总结 针对短视频平台等动态屏幕环境,提出LivingScreen基准测试,通过三级任务套件和联合评估准确性与信息效率的指标,发现现有GUI代理存在观察过度或不足的问题。

Comments preprint

详情
AI中文摘要

当前的GUI代理假设屏幕是静态的,即两次动作之间世界是冻结的。然而,诸如短视频应用之类的真实界面违反了这一假设,因为其内容持续播放,一个称职的用户必须决定观看什么以及观看多长时间。我们将此任务形式化为原生动态屏幕GUI代理,并引入LivingScreen——首个在短视频平台上实例化该任务的基准测试,它包含一个基于浏览器的忠实环境、三级任务套件以及联合评估准确性和信息效率的指标。评估广泛的前沿模型后,我们发现没有一个模型能达到人类的成本-准确率性能,并且它们的主要失败模式是过度观察和观察不足,这表明观察控制是未来GUI代理缺失的能力轴。所有数据和代码将在https://github.com/BITHLP/LivingScreen上提供。

英文摘要

GUI agents today assume a static screen, where the world is frozen between two actions. However, real interfaces such as short-video applications violate this assumption, as their content keeps playing, and a competent user must decide what to watch and for how long. We formalize this task as Living-Screen-Native GUI agents and introduce LivingScreen, the first benchmark instantiating it on short-video platforms, with a faithful browser-based environment, a three-tier task suite, and metrics that jointly score accuracy and information efficiency. Evaluating extensive frontier models, we find that none reaches the human cost-accuracy performance, and that their dominant failure mode is over- and under-observation, pointing to observation control as a missing capability axis for future GUI agents. All data and code will be available at https://github.com/BITHLP/LivingScreen.

2606.04700 2026-06-04 cs.CV 版本更新

A New Angle on Bones: Robust Pose Estimation in X-Ray and Ultrasound

骨骼的新视角:X射线和超声中的鲁棒姿态估计

Ron Keuth, Christoph Großbröhmer, Franziska Halm, Miriam Johann, Anne-Nele Schröder, Ludger Tüshaus, Mattias P. Heinrich, Lasse Hansen

发表机构 * Medical Informatics, University of Lübeck(吕贝克大学医学信息学系) Institut of Radiology and Nuclear Medicine, University Hospital Schleswig-Holstein(石勒苏益格-荷尔斯泰因大学医院放射学与核医学研究所) Paediatric Surgery, University Hospital Schleswig-Holstein(石勒苏益格-荷尔斯泰因大学医院小儿外科) EchoScout GmbH

AI总结 提出基于学习的关键点候选和鲁棒线模型(RANSAC、霍夫变换)的自动骨骼姿态估计方法,在儿科骨折和髋关节发育不良评估中达到临床可接受的误差并优于地标方法。

Comments Code and annotations for fracture angle assessment in radiographs: https://github.com/multimodallearning/RobustBonePoseEstimation

详情
AI中文摘要

测量骨骼结构之间的角度是医学图像分析中的常规任务,为诊断和治疗规划提供关键的定量参数。自动化方法可以减少时间和成本,同时提高可重复性。在这项工作中,我们通过基于学习的关键点候选提议,随后使用线模型提取轴参数,来解决自动骨骼姿态估计问题。由于传统线模型如最小二乘法对异常值敏感,我们结合了假阳性减少策略和鲁棒拟合技术,如RANSAC和霍夫变换,以提高鲁棒性。我们在三个临床相关的儿科角度估计任务上评估了我们的方法:X射线和超声中的骨折碎片评估,以及使用Graf方法的超声中髋关节发育不良评估。我们的方法分别实现了$4.1^\circ$、$5.4^\circ$和$5.51^\circ$的平均误差,不仅保持在预期的临床观察者变异范围内,而且显著优于基于地标的方法。我们的代码和用于X射线骨折角度评估的注释已在GitHub上公开。

英文摘要

Measuring the angle between bone structures is a routine task in medical image analysis and provides a key quantitative parameter for diagnosis and treatment planning. Automated methods can reduce time and cost while improving reproducibility. In this work, we address automatic bone pose estimation using a learning-based point candidate proposal followed by a line model to extract axis parameters. Since conventional line models such as least squares are sensitive to outliers, we incorporate false-positive reduction strategies and robust fitting techniques, such as RANSAC and Hough transforms, to improve robustness. We evaluate our method on three clinically relevant paediatric angle estimation tasks: fracture fragment assessment in radiographs and ultrasound and developmental dysplasia of the hip evaluation in ultrasound using the Graf method. Our approach achieves mean errors of $4.1^\circ$, $5.4^\circ$, and $5.51^\circ$, respectively, not only remaining within the expected clinical observer variability, but also significantly outperforming landmark-based methods. Our code and annotations for fracture angle assessment in radiographs are publicly available on GitHub.

2606.04699 2026-06-04 cs.LG cs.AI cs.CV 版本更新

Graph-Guided Universum Learning in Generalized Eigenvalue Proximal SVMs for Alzheimer's Disease Classification

基于图引导的广义特征值近端支持向量机中的Universum学习用于阿尔茨海默病分类

Yogesh Kumar, Vrushank Ahire, Mudasir Ganaie

发表机构 * Dept. of Computer Science and Engineering, IIT Ropar, Punjab 140001, India(计算机科学与工程系,IIT罗帕尔,旁遮普140001,印度)

AI总结 针对阿尔茨海默病分类,提出两种图引导的Universum学习模型UG-GEPSVM和IUG-GEPSVM,利用轻度认知障碍样本构建图拉普拉斯正则化,替代传统独立惩罚项,在ADNI MRI数据集上取得更优性能。

详情
AI中文摘要

早期准确检测阿尔茨海默病(AD)对于及时干预和疾病管理至关重要。广义特征值近端支持向量机(GEPSVM)及其基于Universum的变体在AD分类中显示出有希望的结果。然而,现有方法将Universum样本视为独立点,未考虑它们之间的几何关系。本文提出了两种图引导的Universum学习模型,即UG-GEPSVM和IUG-GEPSVM,用于使用结构MRI数据进行AD与认知正常(CN)分类。在所提出的框架中,轻度认知障碍(MCI)受试者被用作Universum数据,以提供AD和CN类别之间的中间信息。使用高斯相似性、最小生成树连通性和多跳传播在Universum样本上构建图。从该图中导出拉普拉斯矩阵,捕获MCI样本的几何结构。这种基于拉普拉斯的正则化被纳入学习过程,以替代传统的独立Universum惩罚项。UG-GEPSVM将此正则化集成到广义特征值公式中,而IUG-GEPSVM使用标准特征值公式扩展了数值稳定的改进GEPSVM框架。在ADNI MRI数据集变体上使用ICA和PCA特征在五个不同噪声水平下的实验表明,两种提出的模型始终优于现有的GEPSVM和基于Universum的方法。UG-GEPSVM实现了88.07%的最高平均AUC,并在增加的噪声水平下保持稳定的性能。统计检验进一步证实了观察到的改进的显著性。

英文摘要

Early and accurate detection of Alzheimer's disease (AD) is important for timely intervention and disease management. Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM) and its Universum-based variants have shown promising results for AD classification. However, existing methods treat Universum samples as independent points and do not consider the geometric relationships among them. This paper proposes two graph-guided Universum learning models, namely UG-GEPSVM and IUG-GEPSVM, for AD versus cognitively normal (CN) classification using structural MRI data. In the proposed framework, mild cognitive impairment (MCI) subjects are used as Universum data to provide intermediate information between AD and CN classes. A graph is constructed over the Universum samples using Gaussian similarity, Minimum Spanning Tree connectivity, and multi-hop propagation. From this graph, a Laplacian matrix is derived that captures the geometric structure of the MCI samples. This Laplacian-based regularization is incorporated into the learning process in place of the conventional independent Universum penalty term. UG-GEPSVM integrates this regularization into the generalized eigenvalue formulation, while IUG-GEPSVM extends the numerically stable improved GEPSVM framework using a standard eigenvalue formulation. Experiments on ADNI MRI dataset variants using ICA- and PCA-based features at five different noise levels show that both proposed models consistently outperform existing GEPSVM and Universum-based methods. UG-GEPSVM achieves the highest average AUC of 88.07% and maintains stable performance under increasing noise levels. Statistical tests further confirm the significance of the observed improvements.

2606.04688 2026-06-04 cs.CV 版本更新

MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation

MeshWeaver: 稀疏体素引导的表面编织用于自回归网格生成

Jiale Xu, Wang Zhao, Ying Shan

发表机构 * ARC Lab, Tencent PCG(腾讯PCG实验室)

AI总结 提出MeshWeaver框架,通过多级稀疏体素编码器注入几何上下文,以自回归方式直接预测顶点实现表面编织,在压缩比、高多边形网格生成和几何保真度上达到最优。

Comments CVPR 2026

详情
AI中文摘要

自回归网格生成通过将网格标记化为序列并以语言建模方式训练模型而受到关注。然而,现有方法存在两个基本限制:(i) 标记化效率低,导致长标记序列并阻碍扩展到高多边形网格;(ii) 缺乏几何感知引导,因为生成仅基于全局形状嵌入而非局部表面线索。我们提出MeshWeaver,一个自回归框架,将网格生成视为表面编织过程,直接预测下一个顶点而非独立坐标。其核心是多级稀疏体素编码器,通过三种互补方式将几何上下文注入生成过程:提供体素特征作为顶点表示,通过交叉注意力引导标记预测,以及作为结构支架约束生成围绕输入表面。我们的层次化设计使得在单次解码步骤中实现从粗到细的顶点预测,同时紧密耦合生成模型与3D几何。大量实验表明,MeshWeaver实现了18%的最先进压缩比,能够生成多达16K面的网格,并且在几何保真度上显著优于先前方法。

英文摘要

Autoregressive mesh generation has gained attention by tokenizing meshes into sequences and training models in a language-modeling fashion. However, existing approaches suffer from two fundamental limitations: (i) low tokenization efficiency, which yields long token sequences and prevents scaling to high-poly meshes, and (ii) absence of geometry-aware guidance, as generation is conditioned only on global shape embeddings rather than local surface cues. We introduce MeshWeaver, an autoregressive framework that treats mesh generation as a surface weaving process by directly predicting the next vertex instead of independent coordinates. At its core is a multi-level sparse-voxel encoder that injects geometric context into the generative process in three complementary ways: providing voxel features as vertex representations, guiding token prediction via cross-attention to voxel features, and serving as a structural scaffold that constrains generation around the input surface. Our hierarchical design enables coarse-to-fine vertex prediction in a single decoding step, while tightly coupling the generative model with 3D geometry. Extensive experiments demonstrate that MeshWeaver achieves a state-of-the-art compression ratio of 18%, can generate meshes with up to 16K faces, and significantly improves geometric fidelity over prior approaches.

2606.04684 2026-06-04 cs.CV cs.AI 版本更新

Real-Time Automatic License Plate Recognition Using YOLOv8, SORT Tracking, and Temporal Data Interpolation

基于YOLOv8、SORT跟踪与时间数据插值的实时自动车牌识别

Mirza Muhammad Mobeen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一个五阶段端到端算法流程,结合YOLOv8目标检测、SORT多目标跟踪和时间数据插值,解决动态交通监控中因光照变化、遮挡等导致的识别率低和跟踪路径断裂问题。

Comments 7 Pages, For Accessing code:https://github.com/ mobeen-pmo/Automatic-License-Plate-Recognition

详情
AI中文摘要

视频处理的实时困难严重限制了自动车牌识别(ALPR)在动态交通监控环境中的应用。对非受控变量(如光照剧烈变化、摄像机扫描角度、车辆高速行驶和物理遮挡)的高保真识别是一个问题,常导致跟踪路径断裂和光学字符识别(OCR)率低下。为缓解这些弱点,本研究提出一个五阶段端到端算法流程,涵盖基于深度学习的目标检测、运动学多目标跟踪和几何时间数据插值之间的平滑过渡。所提出的架构利用强大的YOLOv8 nano模型在第一阶段定位车辆,然后使用简单在线实时跟踪(SORT)算法建立帧间时空联系。另一种更具体的YOLOv8目标检测器检测车牌区域,将切片数组传递给EasyOCR链,并受位置语法验证约束。更重要的是,启动离线时间边界框插值机制以重新连接断裂的路径。

英文摘要

The real-time hardships of video processing seriously limit the usage of Automatic License Plate Recognition (ALPR) with application in dynamic traffic monitoring settings. High-fidelity recognition of unconstrained variables, e.g. drastic variations in illumination, acute camera scans, high vehicle speeds, and harsh physical concealment, is a problem that often leads to disjointed tracking paths and poor Optical Character Recognition (OCR) rates. In order to mitigate these weaknesses, the study proposes a 5 stage, end-to-end algorithmic pipeline, encompassing a smooth transition between deep learning based object detection, multi-object tracking which is kinematic in nature, and geometry temporal data interpolation. The suggested architecture takes advantage of a very powerful YOLOv8 nano model to localize the vehicle at the first stage and then Simple Online and Realtime Tracking (SORT) algorithm is used to build spatial-temporal links between frames. Another, more specific typology of YOLOv8 object detectors the license plate area, channeling the sliced array to an EasyOCR chain under the limitations of positional syntax verification. More importantly, an offline interpolation mechanism of temporal bounding box is initiated to recast fragmented paths.

2606.04656 2026-06-04 cs.CV cs.AI 版本更新

Instance-Level Post Hoc Uncertainty Quantification in Object Detection

目标检测中的实例级事后不确定性量化

Chongzhe Zhang, Zifan Zeng, Qunli Zhang, Feng Liu, Zheng Hu

发表机构 * Tsinghua University(清华大学)

AI总结 提出蒙特卡洛广义线性模型(MC-GLM),用于目标检测中实例级、近似事后不确定性量化,无需重新训练,在nuScenes数据集上验证了有效性。

Comments 7 pages, 2 figures

详情
AI中文摘要

目标检测是自动驾驶的安全关键组成部分。为了安全保证,量化边界框预测中的不确定性至关重要。无需重新训练的事后不确定性量化符合实际部署需求;因此,我们采用拉普拉斯近似。由于需要实例级不确定性,需要多次反向传播的线性化推理方法时间效率不高,而基于采样的方法并非完全事后。我们提出了蒙特卡洛广义线性模型(MC-GLM),它提供实例级且近似事后不确定性量化。蒙特卡洛步骤中所需的样本数量是恒定的,与输出实例数量无关,因此可以并行化。在nuScenes数据集上使用CenterPoint检测器的实验验证了我们方法的有效性,所得不确定性表现出良好质量。

英文摘要

Object detection is a safety-critical component of autonomous driving. It is essential to quantify the uncertainty in bounding-box predictions for safety assurance. Post hoc uncertainty quantification without retraining aligns with real-world deployment requirements; therefore, we employ the Laplace approximation. Because instance-level uncertainty is needed, linearized inference methods that require multiple backpropagations are not time-efficient, and sampling-based methods are not fully post hoc. We propose Monte-Carlo generalized linearized model (MC-GLM), which provides instance-level and approximately post hoc uncertainty quantification. The number of samples required in the Monte Carlo step is constant and independent of the number of output instances, so it can be parallelized. Experiments on the nuScenes dataset with the CenterPoint detector validate the effectiveness of our method, and the resulting uncertainties exhibit good quality.

2606.04613 2026-06-04 cs.CV cs.LG 版本更新

Beyond Symmetric Alignment: Spectral Diagnostics of Modality Imbalance in Vision-Language Models in the Medical Domain

超越对称对齐:医学领域视觉-语言模型中模态不平衡的光谱诊断

Alessandro Gambetti, Qiwei Han, Cláudia Soares, Hong Shen

发表机构 * NOVA School of Science and Technology(诺瓦科学与技术学校) Nova School of Business and Economics(诺瓦商业与经济学校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出非对称光谱对齐分数(SAS),通过特征值加权的特征模态相关性量化模态信息不平衡,并在医学图像-文本数据集上评估15个VLM,发现医学图像比临床报告保留更丰富的结构信息,且SAS与检索性能的相关性最强。

Comments 10 pages, 3 figures, 9 tables

详情
AI中文摘要

视觉-语言模型(VLM)在应用于医学图像-文本数据时表现不佳,但可用于诊断这种失败的工具仍然有限。现有的表示对齐度量是对称的,将两种模态合并为一个分数,隐藏了哪种模态驱动了跨模态退化。我们引入了光谱对齐分数(SAS),这是一种非对称度量,将两种模态投影到锚定模态的主特征基上,并计算特征值加权的每个特征模态的相关性,从而得到方向性分数,其差值量化了模态信息不平衡。我们将SAS嵌入到一个基准框架中,评估了15个VLM在自然和医学图像-文本数据集上的表现,同时使用了6种对齐度量和双向检索。我们的实验表明,医学图像比其配对的临床报告保留了更丰富的结构信息,这种方向性不对称是所有竞争度量无法察觉的,并且SAS在医学领域实现了与检索性能的最强零标签相关性,使其成为临床部署的实用诊断工具。代码可在以下网址获取:https://github.com/iamalegambetti/medical-vlms-assessment。

英文摘要

Vision-Language Models (VLMs) struggle when applied to medical image-text data, yet the tools available to diagnose this failure remain limited. Existing representation alignment metrics are symmetric, collapsing both modalities into a single score and hiding which modality drives cross-modal degradation. We introduce the Spectral Alignment Score (SAS), an asymmetric metric that projects both modalities onto the principal eigenbasis of an anchor modality and computes eigenvalue-weighted per-eigenmode correlations, resulting in directional scores whose difference quantifies modality information imbalance. We embed SAS within a benchmarking framework evaluating 15 VLMs across natural and medical image-text datasets alongside 6 alignment metrics and bidirectional retrieval. Our experiments show that medical images retain richer structural information than their paired clinical reports, a directional asymmetry invisible to all competing metrics, and that SAS achieves the strongest zero-label correlation with retrieval performance in the medical domain, positioning it as a practical diagnostic tool for clinical deployment. Code is available at this URL: https://github.com/iamalegambetti/medical-vlms-assessment.

2606.04604 2026-06-04 cs.CV 版本更新

COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations

COMBINER: 基于属性邻居关系的组合图像检索

Zixu Li, Yupeng Hu, Zhiwei Chen, Haokun Wen, Xuemeng Song, Liqiang Nie

发表机构 * School of Software, Shandong University(山东大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院) Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系)

AI总结 针对组合图像检索中视觉相似但属性不同的样本问题,提出基于属性原型的跨模态统一表示方法COMBINER,通过自适应语义解耦、统一原型组合和双重关系建模提升检索准确性。

Comments Accepted by IEEE TIP 2026

详情
AI中文摘要

组合图像检索(CIR)是一项具有挑战性的检索任务,旨在通过多模态输入定位特定图像。尽管CIR技术近期取得了进展,但先前的方法常常忽略视觉上相似但属性不同的情况,这可能削弱多模态特征融合和相似性建模。为缓解这一限制,我们基于属性原型设计了跨模态特征的统一表示。然而,由于三个核心问题,该任务远非直接:(1)属性级语义的纠缠,(2)模态间的不一致性,以及(3)监督信号缺失。为解决上述障碍,我们引入了基于属性邻居关系的组合图像检索网络(COMBINER)。具体而言,我们首先设计了一个自适应语义解耦模块,能够基于多模态原始特征解耦属性特征。其次,我们提出了一个统一原型组合模块,可以构建跨模态统一原型(CUP)并促进多模态特征组合。最后,我们引入了一个双重关系建模模块,能够基于属性相似性挖掘成对和邻居关系。与传统的邻居关系建模CIR方法相比,COMBINER是首个解决视觉相似但属性无关样本现象的研究。它通过采用基于属性原型的相似性度量,实现了对样本间语义关系的更准确理解。在三个基准数据集上进行的全面实验证实了我们提出的COMBINER的有效性。我们的方法实现将在https://github.com/Lee-zixu/COMBINER上提供。

英文摘要

Composed Image Retrieval (CIR) represents a challenging retrieval task that targets locating specific images through multimodal inputs. Despite recent progress in CIR techniques, prior approaches often overlook cases where images appear visually alike yet differ in attributes, potentially undermining both multimodal feature fusion and similarity modeling. To mitigate this limitation, we design a unified representation of cross-modal features based on attribute prototypes. Nevertheless, the task is far from straightforward, owing to three core issues: (1) entanglement in attribute-level semantics, (2) inconsistency across modalities, and (3) supervised signal missing. To tackle the above obstacles, we introduce a COMposed image retrieval network guided By attrIbute-based NEighbor Relations (COMBINER). Specifically, we first design an Adaptive Semantic Disentanglement module, which is capable of disentangling attribute features based on multimodal primitive features. Secondly, we propose a Unified Prototype-based Composition module, which can construct cross-modal unified prototypes (CUP) and facilitate multimodal feature composition. Finally, we introduce a Dual Relations Modeling module, which can mine pairwise and neighbor relations based on attribute similarity. Compared to traditional neighbor relations modeling CIR methods, COMBINER represents the first study addressing the phenomenon of visually similar but attribute-unrelated samples. It achieves a more accurate understanding of the semantic relations among samples by employing an attribute prototype-based similarity metric. Comprehensive experiments conducted on three benchmark datasets confirm the effectiveness of our proposed COMBINER. The implementation of our method will be accessed at https://github.com/Lee-zixu/COMBINER

2606.04593 2026-06-04 cs.CV 版本更新

4D Reconstruction from Sparse Dynamic Cameras

来自稀疏动态相机的4D重建

Kazuki Ozeki, Shun Kenney, Yuto Shibata, Eisuke Takeuchi, Takuya Narihira, Kazumi Fukuda, Ryosuke Sawata, Yuki Mitsufuji, Yoshimitsu Aoki

发表机构 * Keio University(庆应大学) Sony AI(索尼人工智能) Sony Group Corporation(索尼集团)

AI总结 针对稀疏动态相机设置下的4D重建,提出一种通过集成跨相机特征匹配与帧内点跟踪来确保时空一致性的3D轨迹初始化方法,并引入噪声鲁棒的深度排序正则化损失和时空多样批次采样策略,在自建数据集LetCamsGo上验证了动态区域重建质量的提升。

Comments Accepted by 4DV Workshop at CVPR 2026

详情
AI中文摘要

尽管从单目动态相机进行动态3D(即4D)重建最近取得了进展,但其仍然受到深度模糊的根本限制。本文关注一种替代实用方案,即稀疏动态相机设置,其中少量独立移动的相机捕捉相同的对象。在保持低成本的同时,这种设置引入了多视图约束,并且对于现实世界的视频制作(如体育、音乐会和电视节目)仍然实用。尽管有潜力,但我们的实验表明,现有单目或密集固定相机方法的简单扩展是不够的,因为它们无法解决跨视图和时间的复杂时空不一致性。为填补这一空白,我们提出了一种简单而有效的3D轨迹初始化方法,通过集成跨相机特征匹配与帧内点跟踪来确保时空一致性。此外,我们引入了噪声鲁棒的深度排序正则化损失和时空多样批次采样策略,以增强优化稳定性和跨视图泛化。进一步地,为解决此任务缺乏标准化基准的问题,我们引入了LetCamsGo,这是一个新的真实世界视频数据集,包含4个不同环境中的5个序列,由三个独立移动的相机和一个固定相机记录。在LetCamsGo上的全面基准测试表明,与基线相比,我们提出的框架提高了动态区域的4D重建质量,为野外低成本4D重建范式铺平了道路。

英文摘要

Although dynamic 3D (i.e., 4D) reconstruction from a monocular dynamic camera has recently advanced, it remains fundamentally limited by depth ambiguity. In this paper, we focus on an alternative practical way, i.e., sparse dynamic camera setup, where a handful of independently moving cameras capture the same subjects. While keeping capture costs low, this setup introduces multi-view constraints and remains practical for real-world video production such as sports, concerts, and TV shows. Despite its potential, our experiments show that naive extensions of existing monocular or dense-fixed camera-based methods are insufficient since they fail to resolve the complex spatiotemporal inconsistencies across views and time. To fill this gap, we propose a simple yet effective 3D track initialization method designed to ensure spatiotemporal consistency by integrating inter-camera feature matching with intra-camera point tracking. Additionally, we incorporate a noise-robust depth-ordering regularization loss and a spatiotemporally diverse batch sampling strategy to enhance optimization stability and cross-view generalization. Furthermore, to address the lack of standardized benchmarks for this task, we introduce LetCamsGo, a new real-world video dataset with 5 sequences across 4 diverse environments, recorded by three independently moving cameras and one fixed camera. Comprehensive benchmarking on LetCamsGo demonstrated that our proposed framework improves 4D reconstruction quality in dynamic regions compared with baselines, paving the way for a low-cost 4D reconstruction paradigm in the wild.

2606.04591 2026-06-04 cs.CL cs.CV 版本更新

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

多模态长对话中的细粒度片段检索

Hanbo Bi, Zhiqiang Yuan, Chongyang Li, Qiwei Yan, Zexi Jia, Jiapei Zhang, Xiaoyue Duan, Yingchao Feng, Jinchao Zhang, Jie Zhou

发表机构 * Pattern Recognition Center, WeChat AI, Tencent Inc(模式识别中心、微信AI、腾讯公司) Aerospace Information Research Institute, Chinese Academy of Sciences(航天信息研究所、中国科学院)

AI总结 提出细粒度片段检索任务,通过强化学习训练的生成式检索模型F2RVLM和两阶段系统FFRS,实现多模态长对话中多语句、多图像片段的精准定位。

详情
AI中文摘要

随着多模态交流平台的广泛采用,文本和图像交织的长对话变得越来越普遍。用户通常需要检索与特定主题相关的连贯对话片段,而不是孤立的语句。我们提出了细粒度片段检索(FFR),用于在多模态长对话中定位语义相关的多语句、多图像片段。我们探索了两种设置:(1)单对话内的FFR,从给定对话中检索片段;(2)对话语料库内的FFR,从大规模语料库中为开放域场景检索片段。对于(1),我们引入了F2RVLM,一种基于生成的检索模型,使用强化学习训练,通过多目标奖励和难度感知课程采样来增强片段连贯性。对于(2),我们开发了FFRS,一个两阶段系统,结合了离线片段级索引和在线检索。具体来说,每个对话被分解为最小语义片段,由片段嵌入模型(FEM)编码到向量数据库中;在推理时,FEM快速召回Top-K候选,F2RVLM进行细粒度推理以识别最相关的子内容。为支持FFR,我们构建了MLDR,迄今为止最长的多模态对话检索数据集,以及一个基于微信的真实世界测试集。在两个基准上的实验表明,F2RVLM和FFRS在单对话和语料库级别的FFR上始终取得优越性能。

英文摘要

With the widespread adoption of multi-modal communication platforms, long-form dialogues interleaving text and images have become increasingly common. Users often need to retrieve coherent dialogue fragments related to specific topics, rather than isolated utterances. We propose Fine-grained Fragment Retrieval (FFR), which locates semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. We explore two settings: (1) FFR within Single-Dialogue, retrieving fragments from a given dialogue; and (2) FFR within Dialogue Corpus, retrieving from a large-scale corpus for open-domain scenarios. For (1), we introduce F2RVLM, a generation-based retrieval model trained with reinforcement learning, using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence. For (2), we develop FFRS, a two-stage system combining offline fragment-level indexing with online retrieval. Specifically, each dialogue is decomposed into minimal semantic fragments encoded by a Fragment Embedding Model (FEM) into a vector database; at inference, FEM rapidly recalls Top-K candidates, and F2RVLM performs fine-grained reasoning to identify the most relevant sub-content. To support FFR, we construct MLDR, the longest multi-modal dialogue retrieval dataset to date, and a WeChat-based real-world test set. Experiments on both benchmarks demonstrate that F2RVLM and FFRS consistently achieve superior performance across single-dialogue and corpus-level FFR.

2606.04545 2026-06-04 cs.CV 版本更新

Impostor: An Agent-Curated Benchmark for Realistic AIGC Manipulation Localization

Impostor:一个用于真实AIGC篡改定位的智能体策划基准

Zhenliang Li, Yutao Hu, Qixiong Wang, Wenpeng Du, Hongxiang Jiang, Jiasong Wu, Xiaolong Jiang, Jungong Han

发表机构 * Southeast University(东南大学) Xiaohongshu Inc.(小红书公司) Tsinghua University(清华大学)

AI总结 为解决现有图像篡改检测与定位基准在视觉真实感、篡改多样性和生成器覆盖方面的局限,提出了Impostor数据集和CraftAgent框架,并设计了PhaseAware-Net方法,在多个基准上取得优异性能。

Comments 10 pages, 3 figures, 5 tables

详情
AI中文摘要

近期生成式图像编辑的进展提高了局部图像篡改的真实感和可控性,给图像篡改检测与定位(IMDL)带来了新挑战。然而,现有IMDL基准在视觉真实感、篡改多样性和生成器覆盖方面仍有局限,难以反映图像篡改的最新趋势。为解决这些局限,我们引入了Impostor,一个包含10万张篡改图像的高质量AI编辑图像篡改定位数据集。Impostor由CraftAgent构建,这是一个闭环智能体框架,集成了场景感知、编辑规划、篡改执行、质量验证和迭代反思,以自动生成多样且视觉真实的篡改图像。此外,Impostor包含由七个近期AIGC模型生成的图像,涵盖三种篡改类型,并包含多个篡改区域,为基于AIGC的IMDL提供了更全面的基准。进一步,我们提出了PhaseAware-Net(PANet),一个语义-取证框架,引入局部相位建模和语义-取证一致性学习,以更好地定位语义合理但取证异常的篡改区域。大量实验表明,Impostor对现有大型视觉语言模型(LVLMs)和专用IMDL方法构成了显著挑战,而PANet在Impostor和多个公开基准上取得了优越性能。

英文摘要

Recent advances in generative image editing have improved the realism and controllability of localized image manipulation, raising new challenges for image manipulation detection and localization (IMDL). However, existing IMDL benchmarks still have limitations in visual realism, manipulation diversity, and generator coverage, making it difficult to reflect recent trends in image manipulation. To address these limitations, we introduce Impostor, a high-quality AI-edited image manipulation localization dataset containing 100K manipulated images. Impostor is constructed by CraftAgent, a closed-loop agent framework that integrates scene perception, editing planning, manipulation execution, quality validation, and iterative reflection to automatically generate diverse and visually realistic manipulated images. Moreover, Impostor contains images generated by seven recent AIGC models across three manipulation types and includes multiple manipulated regions, providing a more comprehensive benchmark for AIGC-based IMDL. Furthermore, we propose PhaseAware-Net (PANet), a semantic-forensic framework that introduces local phase modeling and semantic-forensic consistency learning to better localize semantically plausible yet forensically disrupted manipulated regions. Extensive experiments show that Impostor poses significant challenges to existing large vision-language models (LVLMs) and specialized IMDL methods, while PANet achieves superior performance on Impostor and multiple public benchmarks.

2606.04528 2026-06-04 cs.CV cs.AI 版本更新

Optical-Guided Neural Collapse for SAR Few-Shot Class Incremental Learning

光学引导的SAR少样本类增量学习中的神经坍缩

Fan Zhang, Sijin Zheng, Fei Ma, Qiang Yin, Yongsheng Zhou, Fei Gao, Xian Sun

发表机构 * Beihang University, Beijing 100191, China(北航,北京100191,中国) Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China(航天信息研究所,中国科学院,北京100190,中国)

AI总结 针对SAR图像少样本类增量学习中的数据稀缺和方位角敏感问题,提出利用光学ATR数据集的正交子空间作为几何先验,通过投影损失和分类器损失联合诱导神经坍缩,实现特征紧凑性和类间可分离性。

Comments 16 pages, 6 figures

详情
AI中文摘要

合成孔径雷达图像中的少样本类增量学习由于严重的数据稀缺和SAR特有的变异性而面临独特挑战。特别是,SAR中强烈的方位角敏感性导致大的类内变异和类间混淆,而FSCIL的顺序更新进一步导致先前学习类别的灾难性遗忘。受神经坍缩启发,我们提出了一种光学引导的SAR FSCIL框架,该框架从数据丰富的光学ATR数据集中推导出正交特征子空间,并将其作为几何先验来指导SAR特征学习。通过主角约束将SAR特征投影到这些正交子空间上,有效地将判别结构从光学域转移到SAR域。具体地,我们的投影损失和用冻结的单纯形ETF几何优化的分类器损失通过将特征集中在类均值周围同时保持大的类间角度,联合诱导神经坍缩。我们在一个包含光学ATR数据集和具有24个目标类别的SAR ATR数据集的基准上评估该方法,该基准组织为一个基础训练会话和七个增量会话。与最近的FSCIL方法(包括NCFSCIL等)相比,我们的方法实现了最高的最终准确率以及最终性能与性能下降之间的有利权衡。此外,神经坍缩指标显示类内紧凑性和类间可分离性得到改善,表明学习到的特征更接近理想的单纯形ETF几何。

英文摘要

Few-shot class-incremental learning (FSCIL) in synthetic aperture radar imagery presents unique challenges due to severe data scarcity and SAR-specific variability. In particular, strong azimuth sensitivity in SAR induces large intra-class variation and inter-class confusion, and FSCIL sequential updates further lead to catastrophic forgetting of previously learned classes. Inspired by neural collapse, we propose an optical-guided SAR FSCIL framework, which derives orthogonal feature subspaces from a data-rich optical ATR dataset and uses them as geometric priors to guide SAR feature learning. SAR features are projected onto these orthogonal subspaces via principal angle constraints, effectively transferring discriminative structure from the optical to the SAR domain. Specifically, our projection loss and the classifier loss optimized with a frozen simplex-ETF geometry jointly induce neural collapse by concentrating features around class means while maintaining large inter-class angles. We evaluate the approach on a benchmark comprising an optical ATR dataset and a SAR ATR dataset with 24 target classes, organized into a base training session and seven incremental sessions. Compared with recent FSCIL methods including NCFSCIL and so on, our method achieves the highest final accuracy and a favorable trade-off between final performance and performance degradation. Moreover, neural collapse metrics show improved intra-class compactness and inter-class separability, indicating that the learned features more closely approximate the ideal simplex-ETF geometry.

2606.04527 2026-06-04 cs.MM cs.CV cs.GR 版本更新

Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

Echo-Infinity: 学习演化记忆用于实时无限视频生成

Yuxuan Bian, Zeyue Xue, Songchun Zhang, Shiyi Zhang, Weiyang Jin, Yaowei Li, Junhao Zhuang, Haoran Li, Jie Huang, Haoyang Huang, Nan Duan, Qiang Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Joy Future Academy, JD(joy future academy) The Hong Kong University of Science and Technology(香港科学与技术大学) Tsinghua University(清华大学) The University of Hong Kong(香港大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出Echo-Infinity框架,通过可学习的演化记忆以恒定成本动态过滤、抽象和压缩任意长度历史,结合统一相对RoPE方案,首次实现24小时实时无限视频生成。

Comments Website: https://echo-team-joy-future-academy-jd.github.io/Echo-Infinity/

详情
AI中文摘要

我们提出了Echo Infinity,一个面向实时无限视频生成的自回归(AR)框架,它采用可学习的演化记忆,以恒定成本动态过滤、抽象和压缩任意长度的历史。现有方法主要使用预定义的KV缓存调度、固定比例启发式压缩或推理时的RoPE适配来管理记忆。这些设计由于有限的缓存窗口和忽略自回归生成噪声,不可避免地丢失历史信息并放大复合误差。受人类记忆巩固的启发,Echo-Infinity用可学习的记忆查询替代手工设计的记忆管理,这些查询通过注意力和门控机制在过去的帧从局部窗口中被驱逐时更新。查询与视频扩散变换器(DiTs)进行端到端优化,形成一种演化记忆,支持任意压缩比,计算量恒定且与视频长度无关。它们还充当可泛化的生成先验,即使仅使用优化后的初始状态也能提高质量。我们进一步引入了统一相对RoPE方案,它将锚定帧固定从id 0开始,并让最新帧的id在训练和推理过程中最多增长到DiTs预训练的最大时间RoPE id,从而将模型从有限的RoPE约束中解放出来,并缩小训练-测试的RoPE外推差距。在长视频和短视频生成中,Echo-Infinity达到了最先进的性能,并且据我们所知,首次展示了有前景的24小时(>130万帧)实时滚动生成,为无限视频生成提供了一条实用路径。

英文摘要

We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.

2606.04493 2026-06-04 cs.CV cs.AI 版本更新

SFMambaNet: Spectral-Frequency Enhanced Selective State Space Model for Correspondence Pruning

SFMambaNet: 用于对应点筛选的频谱-频率增强选择性状态空间模型

Zhihua Wang, Yanping Li, Yizhang Liu

AI总结 提出SFMambaNet,通过局部频谱-几何注意力块和频谱集成全局Mamba块,首次将频域感知融入对应点筛选任务,增强内点与离点的区分能力。

详情
AI中文摘要

对应点筛选旨在从初始对应点集中识别内点。现有大多数基于图神经网络的方法依赖于从粗欧几里得坐标映射的几何特征,难以捕捉内点呈现的细微几何一致性。而基于Mamba的方法虽具有全局感受野和长序列建模能力,但往往在隐藏状态空间中积累大量不一致特征,难以区分内点与离点。本文首次将频域感知融入该任务,提出SFMambaNet,一种新颖的频谱-频率增强Mamba双视图对应点筛选网络。我们的方法由两个组件协同构成:首先,设计局部频谱-几何注意力(LSGA)块。LSGA将频谱位置编码融入局部图交互,并引入多尺度Mamba处理,以增强对细微几何一致性的捕捉并提升局部特征判别性。在此基础上,设计频谱集成全局Mamba(SIGM)块。SIGM在状态空间中嵌入频率门控机制,利用LSGA提供的频率信息显式抑制隐藏状态内高频噪声的累积,并减轻不一致特征的传播。这增强了内点-离点可分性,并以近乎线性的复杂度实现了鲁棒的全局上下文建模能力。大量实验表明,SFMambaNet在多个具有挑战性的任务上优于当前最先进方法。代码可在https://github.com/Kirito14IT/SFMambaNet获取。

英文摘要

Correspondence pruning aims to identify inliers from an initial set of correspondences. Most existing Graph Neural Network (GNN)-based methods rely on geometric features mapped from coarse Euclidean coordinates, which struggle to capture the subtle geometric consistencies presented by inliers. While Mamba-based methods possess global receptive fields and long sequence modeling capabilities, they tend to accumulate substantial inconsistent features within the hidden state space, making it difficult to distinguish inliers from outliers. In this paper, we integrate frequency domain perception into this task for the first time and propose SFMambaNet, a novel Spectral-Frequency enhanced Mamba-based two-view correspondence pruning network. Our method is collaboratively composed of two components: First, we design a Local Spectral-Geometric Attention (LSGA) block. LSGA incorporates spectral positional encoding into local graph interactions and introduces multi-scale Mamba processing to enhance the capture of subtle geometric consistencies and improve local feature discriminability. Building upon this, we design a Spectral-Integrated Global Mamba (SIGM) block. SIGM embeds a frequency gating mechanism within the state space, utilizing the frequency information provided by LSGA to explicitly suppress high-frequency noise accumulation within hidden states and mitigate the propagation of inconsistent features. This enhances inlier-outlier separability and achieves robust global context modeling capabilities with nearly linear complexity. Extensive experiments demonstrate that SFMambaNet outperforms current state-of-the-art methods on several challenging tasks. The code is available at https://github.com/Kirito14IT/SFMambaNet.

2606.04480 2026-06-04 cs.CV cs.HC 版本更新

IMPose: Interactive Multi-person Pose Estimation with Dynamic Correction Propagation

IMPose: 基于动态校正传播的交互式多人姿态估计

Haoyang Ge, Jian Ma, Ziwen Wang, Qihe Wang, Jianqi Fan, Hongzhi Yu, Xingyu Chen, Kun Li

发表机构 * Tianjin University(天津大学) Zhongguancun Academy(中关村学院) Tiandi(天迪)

AI总结 提出IMPose交互式工具,通过双级跟踪机制(关键点级和实例级)将稀疏的多人姿态校正传播到整个视频,显著减少手动标注工作量。

详情
AI中文摘要

高质量动态人体姿态标注为人工智能提供精确的运动学信息,使其能够掌握人类行为,但仍然劳动密集且耗时。当前的标注工具要么缺乏时间校正传播,要么在多人场景中失败,需要过多的人工干预。在本文中,我们介绍了IMPose,一种用于多人动态姿态标注的交互式工具。它具有双级跟踪机制,可将标注者的一帧多人姿态校正传播到整个视频。关键点级通过顺序建模确保校正的时间传播,而实例级采用关键点感知嵌入和相对位置编码来维持多人跨帧一致性。为了进一步提高鲁棒性,IMPose在轨迹库中维护历史姿态和实例线索,增强了长程时间关联,并在遮挡和运动模糊等挑战性情况下稳定标注。通过将稀疏的人工校正转换为密集且连贯的姿态轨迹,我们的框架显著减少了跨帧的重复手动细化。大量实验表明,IMPose在不同交互预算下始终实现强精度-效率权衡,在低点击标注设置中表现出特别优势。IMPose实现了高精度和高效率的标注,在3DPW上每1050帧视频仅需27次点击,在PoseTrack21上每个轨迹段每84帧仅需3次点击。我们进一步扩展了PoseTrack21,以10名标注员10小时的最小成本添加了188K个姿态实例(355万个关键点)。标注工具、代码和扩展数据集将开源。

英文摘要

High-quality dynamic human pose annotation equips AI with precise motion kinematics to enable human behavior mastery, yet remains labor-intensive and time-consuming. Current annotation tools either lack temporal correction propagation or fail in multi-person scenarios, necessitating excessive manual intervention. In this paper, we introduce IMPose, an interactive tool for multi-person dynamic pose annotation. It features a dual-level tracking mechanism that propagates one-frame multi-person pose corrections from annotators across entire videos. The keypoint-level ensures corrections temporal propagation via sequential modeling, while the instance-level employs keypoint-aware embedding with relative positional encoding to maintain multi-person cross-frame consistency. To further improve robustness, IMPose maintains historical pose and instance cues in a trajectory bank, which enhances long-range temporal association and stabilizes annotation in challenging cases such as occlusion and motion blur. By converting sparse human corrections into dense and coherent pose trajectories, our framework significantly reduces repeated manual refinement across frames. Extensive experiments show that IMPose consistently achieves a strong accuracy efficiency trade off under different interaction budgets, demonstrating particular advantages in low click annotation settings. IMPose achieves high precision annotation with high efficiency, requiring only 27 clicks per 1,050 frame video on 3DPW and 3 clicks per tracklet per 84-frame on PoseTrack21. We further expand PoseTrack21 with 188K pose instances (3.55M keypoints) at a minimal cost of 10 annotators in 10 hours. The annotation tool, codes, and extended dataset will be open-sourced.

2606.04479 2026-06-04 cs.CV cs.AI cs.CL 版本更新

Evaluating Reasoning Fidelity in Visual Text Generation

评估视觉文本生成中的推理保真度

Jiajun Hong, Jiawei Zhou

发表机构 * Stony Brook University(石桥大学)

AI总结 通过长文本渲染、事实知识探测、上下文理解和多步推理等任务,评估当前文本到图像模型在视觉文本生成中是否忠实保持推理能力,发现其常产生语义错误和逻辑不一致,与纯文本模型存在显著差距。

Comments Peer reviewed and accepted at CVPR 2026 at the GRAIL-V (Grounded Retrieval and Agentic Intelligence for Vision-Language) workshop (non-archival track)

详情
AI中文摘要

最近的文本到图像(T2I)模型能够在图像中渲染高度清晰且结构良好的文本,从而支持文档生成和幻灯片生成等应用。然而,当复杂解决方案必须直接通过渲染文本表达时,这些系统是否忠实地保留了推理能力,还是仅仅模仿表面模式,目前尚不清楚。我们通过评估视觉文本生成中的推理保真度来研究这一问题,其中模型必须将完整的推理过程表达为图像。我们的评估包括长文本渲染、事实知识探测、上下文理解和多步推理。在这些设置中,我们发现当前的T2I模型经常产生语义错误、逻辑不一致和错误的中间步骤,即使渲染的文本在视觉上清晰。这些失败与纯文本模型在相同任务上的强推理表现形成对比。我们的发现揭示了视觉文本生成与程序性推理之间的显著差距,促使更可靠的视觉文本推理。

英文摘要

Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.

2606.04469 2026-06-04 cs.CV cs.AI 版本更新

Adaptive Calibration for Fair and Performant Facial Recognition

自适应校准:实现公平且高性能的面部识别

Ryan Brown, Chris Russell

发表机构 * University of Oxford(牛津大学)

AI总结 提出自适应校准(AC)方法,通过将归一化嵌入的余弦相似度映射为校准概率,并融入局部上下文校正区域差异,从而在无需人口统计元数据的情况下提升面部识别的整体性能和公平性。

详情
AI中文摘要

我们引入自适应校准(AC),一种新颖的面部识别校准策略,将归一化嵌入之间的余弦相似度映射为良好校准的概率。通过将局部上下文纳入校准,自适应校正确保了余弦相似度中的一个基本不匹配问题,即相同的距离在不同嵌入区域可能对应不同的匹配概率。我们的方法在无需人口统计元数据的情况下,既提高了整体性能,又实现了更公平的校准。在各种预训练模型和标准基准上,我们的方法在准确性和公平性指标上始终优于现有方法。AC为公平的面部识别提供了实用的解决方案,无需人口统计组注释,同时提高了整体性能。与现有方法不同,我们的方法提供了连续的、区域特定的校准,避免了“降级”现象,即公平性以牺牲某些群体的性能为代价。

英文摘要

We introduce Adaptive Calibration (AC), a novel calibration strategy for facial recognition that maps cosine similarity between normalized embeddings to well-calibrated probabilities. By incorporating local context into calibration, Adaptive Calibration corrects for a fundamental mismatch in cosine similarity, whereby the same distance can correspond to different match probabilities in different embedding regions. Our approach improves both overall performance and results in a fairer calibration without requiring demographic metadata. Our approach consistently dominates existing methods both on accuracy and fairness metrics across a variety of pretrained models and standard benchmarks. AC provides a practical solution for equitable facial recognition, without requiring demographic group annotations, and while improving overall performance. Unlike existing approaches, our method provides continuous, region-specific calibration that avoids "leveling down" where fairness comes at the cost of degraded performance for some groups.

2606.04461 2026-06-04 cs.CV 版本更新

ChannelTok: Efficient Flexible-Length Vision Tokenization

ChannelTok: 高效灵活长度视觉分词

Sukriti Paul, Arpit Bansal, Tom Goldstein

发表机构 * University of Maryland, College Park(马里兰大学College Park分校)

AI总结 提出一种基于通道的轻量级灵活长度分词器,通过随机尾部丢弃训练实现语义重要性排序,在保持高质量的同时大幅提升解码速度和模型效率。

详情
AI中文摘要

领先的灵活视觉分词器以极端成本实现SOTA质量,依赖参数繁重的骨干网络和缓慢的多步生成解码器。我们摆脱这种复杂的空间分词范式,引入一种简单、轻量且快速的通道级灵活长度分词器。我们的方法将每个潜在通道视为一个视觉标记,采用参数高效的CNN-Transformer混合骨干网络。此外,在训练过程中采用随机尾部丢弃范式,自然地迫使通道按语义重要性排序。这使得在推理时只需保留前$k$个通道即可实现灵活压缩,并自然支持可变长度自回归图像生成。我们通过在ImageNet上的大量实验验证了该方法,展示了在不同标记预算下的一致质量。结果建立了新的质量-效率前沿:我们的模型实现了最先进的感知质量(rFID 2.92),同时解码速度比次优方案快$8.6\times$,参数量小$2.1\times$(1.59亿参数)。我们的工作将通道级分词确立为高效视觉表示的一种强大且实用的范式。项目页面:https://channeltok.github.io

英文摘要

Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN-Transformer hybrid backbone. Furthermore, employing a stochastic tail-dropping paradigm during training naturally forces channels to organize by semantic importance. This allows for flexible compression at inference by simply retaining the first $k$ channels, and naturally enables variable-length autoregressive image generation. We validate our approach through extensive experiments on ImageNet, demonstrating consistent quality across diverse token budgets. The results establish a new quality-efficiency frontier: our model achieves state-of-the-art perceptual quality (rFID 2.92) while being $8.6\times$ faster in decoding and $2.1\times$ smaller (159M params) than the next-best alternative. Our work establishes channel-wise tokenization as a powerful and practical paradigm for efficient visual representation. Project page: https://channeltok.github.io

2606.04457 2026-06-04 cs.CV 版本更新

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

先构思再绘制:面向图像生成的视觉提示工程

Liyu Jia, Fengda Zhang, Jiachun Pan, Kesen Zhao, Saining Zhang, Wang Lin, Weijia Wu, Yue Liao, Aojun Zhou, Hanwang Zhang

发表机构 * Nanyang Technological University(南洋理工大学) National University of Singapore(国立新加坡大学) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出视觉提示工程(VPE),通过在单一模型内先生成视觉语义令牌作为中间计划,再生成完整图像,从而避免信息瓶颈,提升图像生成质量与编辑保真度。

详情
AI中文摘要

在图像生成之前,将视觉语义表示作为中间步骤引入,可以降低文本与图像之间的建模难度,从而提高生成质量。近期工作如X-Omni和BLIP3o-Next探索了这一方向,但它们通常采用两阶段外部流水线:一个独立的自回归模型首先生成语义令牌,然后将其作为条件输入给独立的扩散解码器。由于解码器无法同时访问原始输入和语义计划,这种设计引入了信息瓶颈,限制了编辑等下游任务中的细节保留。而Transfusion、BAGEL和Show-o2等内部架构通过单一模型内的跨模态交互避免了这一瓶颈,但它们在没有中间语义引导的情况下,仍然面临困难的文本到像素建模差距。我们提出了视觉提示工程(VPE),它可以无缝集成到此类内部框架中。具体来说,模型首先自回归地生成视觉语义令牌(例如SigLIP 2)作为“视觉提示”,以捕捉语义布局,然后基于该计划生成完整图像令牌。我们在类别条件生成、文本到图像生成和图像编辑上验证了VPE,涵盖了多种令牌类型和模型架构。结果表明,VPE可以加速收敛、提高质量上限,并且通过内部集成,在相同参数规模下,相比外部替代方案实现了显著更好的编辑保真度(PSNR:26.76 vs. 19.92),同时保持了有竞争力的编辑响应速度。

英文摘要

Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance. We propose Visual Prompt Engineering (VPE), which can be seamlessly integrated into such internal frameworks. Specifically, the model first autoregressively generates visual semantic tokens (e.g., SigLIP 2) as "visual prompts" that capture the semantic layout, then generates the full image tokens conditioned on this plan. We validate VPE across class-conditional generation, text-to-image generation, and image editing, covering various token types and model architectures. Results show that VPE can accelerate convergence, raise quality ceilings, and through internal integration, achieve substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale, while maintaining competitive editing responsiveness.

2606.04453 2026-06-04 cs.CV cs.LG 版本更新

Radiomic Feature Selection Using Gradient Loss of Deep Neural Network for Lung Cancer Stage Detection

基于深度神经网络梯度损失的放射组学特征选择用于肺癌分期检测

Hina Shakir, Mohammad Mohatram, Javeed Hussain, Syed Rizwan Ali, Muhammad Irfan Memon

发表机构 * Department of Software Engineering, Bahria University(巴尔ia大学软件工程系) Global College of Engineering and Technology(全球工程与技术学院) Software Engineering & Business Incubation Center, Bahria University(软件工程与企业孵化中心,巴尔ia大学)

AI总结 提出GL-RFE框架,利用深度神经网络梯度敏感性分析递归消除低贡献特征,从106个放射组学特征中选出前15个用于肺癌早晚期分类,准确率达90.22%。

详情
Journal ref
J. Vis. Exp. (230), e70181, (2026)
AI中文摘要

放射组学能够从医学图像中提取定量成像生物标志物,已成为计算机辅助癌症诊断的重要工具。然而,放射组学数据集通常具有高维小样本的特点,使得特征选择成为构建可靠预测模型的关键步骤。本研究提出了一种梯度损失递归特征消除(GL-RFE)框架,该框架集成深度神经网络的梯度敏感性分析,以识别对肺癌分期检测最具影响力的放射组学特征。使用3D Slicer平台的PyRadiomics扩展从胸部计算机断层扫描(CT)中提取了总共106个放射组学特征。所提出的方法通过计算网络损失相对于输入特征的梯度来评估特征重要性,并递归消除贡献最小的特征。最终选出的前15个放射组学特征用于训练深度神经网络分类器,以区分早期和晚期肺癌。该框架在测试数据集上取得了强劲的分类性能,准确率为90.22%,精确率为90.10%,召回率为90.24%,F1分数为90.16%。可视化分析(包括相关性热图和分布图)进一步证实了特征冗余减少和类别可分性提高。与传统特征选择技术相比,GL-RFE有效捕捉了非线性特征交互并增强了模型泛化能力。所提出的协议为基于放射组学的癌症分期检测提供了一种可重复且可解释的方法,特别适用于高维小样本生物医学数据集,并在基因组学和多模态临床分析等其他领域具有潜在应用价值。

英文摘要

Radiomics enables extraction of quantitative imaging biomarkers from medical images and has become an important tool for computer-aided cancer diagnosis. However, radiomics datasets are typically high-dimensional with limited samples, making feature selection a critical step for building reliable predictive models. This study proposes a Gradient-Loss Recursive Feature Elimination (GL-RFE) framework that integrates gradient sensitivity analysis from a deep neural network to identify the most influential radiomic features for lung cancer stage detection. A total of 106 radiomic features were extracted from chest Computed Tomography (CT) scans using the PyRadiomics extension of the 3D Slicer platform. The proposed method evaluates feature importance by computing gradients of the network loss with respect to input features and recursively eliminates features with minimal contribution. The resulting top-15 radiomic features are used to train a deep neural network classifier for distinguishing early-stage and advanced-stage lung cancer. The proposed framework achieves strong classification performance, with accuracy of 90.22%, precision of 90.10%, recall of 90.24%, and F1-score of 90.16% on the test dataset. Visualization analyses, including correlation heat maps and distribution plots, further confirm reduced feature redundancy and improved class separability. Compared to conventional feature selection techniques, GL-RFE effectively captures nonlinear feature interactions and enhances model generalization. The presented protocol provides a reproducible and interpretable methodology for radiomics-based cancer stage detection and is particularly suitable for high-dimensional, small-sample biomedical datasets, with potential applications in other domains such as genomics and multimodal clinical analysis.

2606.04437 2026-06-04 cs.CV 版本更新

INTACT: Ego-Guided Typed Sparse Evidence Retrieval for Heterogeneous Collaborative Perception

INTACT: 面向异构协同感知的自我引导类型化稀疏证据检索

Chen Li, Shengrong Yuan, Jialong Zuo, Xinzhong Zhu, Nong Sang, Changxin Gao

发表机构 * National Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology(多谱信息智能处理国家级重点实验室,人工智能与自动化学院,华中科技大学) Zhejiang Normal University(浙江师范大学)

AI总结 提出INTACT框架,通过自我车辆发出类型化证据查询、协作方仅返回局部证据的稀疏检索机制,实现异构协同感知中零训练的新节点接入,在OPV2V-H和DAIR-V2X上取得高效性能。

详情
AI中文摘要

协同感知通过跨智能体共享信息扩展自动驾驶车辆的感知范围,但异构传感器和感知模型使得中间特征融合难以大规模部署。现有的异构协同方法通常遵循先翻译后融合的范式:协作方特征必须在对齐、适应或投影到自我兼容空间后才能融合。这种特征兼容性契约提升了固定系统的性能,但将部署与协作方特定的适配耦合,使得新加入的异构智能体集成成本高昂。为解决这一问题,我们提出INTACT,一种面向异构协同感知的自我引导类型化稀疏证据检索框架。INTACT不翻译整个协作方特征图,而是让自我车辆发出类型化证据查询,表达可疑目标和证据不足的区域。协作方仅在查询位置返回局部证据,自我车辆通过稀疏的每查询路由选择有用响应,并通过门控残差回写注入。这将兼容性要求从全局特征图可解释性转变为在自我车辆查询下的局部、类型化响应可比性,实现了零训练的异构插入协议:自我接口训练一次,新协作方通过检查点合并加入。在模拟和真实世界的异构协同感知基准上的大量实验验证了INTACT的有效性和可部署性。在OPV2V-H上,INTACT仅用0.52M额外参数和18.0 $\log_2$通信量达到80.1 AP70,相当于密集特征传输的约16倍压缩。在DAIR-V2X上,INTACT在具有挑战性的真实条件下达到43.8 AP50。

英文摘要

Collaborative perception extends the perceptual range of autonomous vehicles by sharing information across agents, but heterogeneous sensors and perception models make intermediate feature fusion difficult to deploy at scale. Existing heterogeneous collaboration methods typically follow a translation-first paradigm: collaborator features must be aligned, adapted, or projected into an ego-compatible space before fusion. Such feature-compatibility contracts improve fixed-system performance, but they couple deployment to collaborator-specific adaptation and make newly joined heterogeneous agents costly to integrate. To address this gap, we propose INTACT, an ego-guided typed sparse evidence retrieval framework for heterogeneous collaborative perception. Instead of translating an entire collaborator feature map, INTACT lets the ego vehicle issue typed evidence queries that express suspected objects and evidence-deficient regions. Collaborators respond only with local evidence at queried locations, and the ego selects useful responses through sparse per-query routing and injects them through gated residual write-back. This changes the compatibility requirement from global feature-map interpretability to local, typed response comparability under ego-issued queries, enabling a zero-training heterogeneous insertion protocol in which the ego interface is trained once and new collaborators join through checkpoint merging. Extensive experiments on simulated and real-world heterogeneous collaborative perception benchmarks validate the effectiveness and deployability of INTACT. On OPV2V-H, INTACT achieves 80.1 AP70 with only 0.52M additional parameters and 18.0 $\log_2$ communication volume, corresponding to about 16$\times$ compression over dense feature transmission. On DAIR-V2X, INTACT achieves 43.8 AP50 under challenging real-world conditions.

2606.04436 2026-06-04 cs.CV cs.RO 版本更新

3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training

3DThinkVLA:通过3D思维引导的协同训练赋予视觉-语言-动作模型潜在3D先验

Jiaxin Shi, Xidong Zhang, Fucai Zhu, Zhe Li, Siyu Zhu, Weihao Yuan

发表机构 * Shanghai Jiao Tong University(上海交通大学) Harbin Institute of Technology(哈尔滨工业大学) Nanyang Technological University(南洋理工大学) Fudan University(复旦大学) Nanjing University(南京大学) Daimon Robotics(达梦机器人) Great Bay University(大亚大学)

AI总结 提出3D思维引导的协同训练框架,通过解耦3D几何感知与空间推理并在不同特征层次注入,使VLA模型在动作预测中隐式进行3D空间推理,无需3D传感器或外部模型,在多个基准上达到最优性能。

详情
AI中文摘要

我们提出了一种3D思维引导的协同训练框架,使视觉-语言-动作(VLA)模型能够在动作预测过程中隐式地进行3D空间推理。我们的核心见解是,3D几何感知和3D空间推理是两种不同的能力,可以在不同的特征层次上解耦并注入。在训练过程中,三个紧密耦合的组件主要在潜在空间中协同工作:(1)为了获得几何先验,一个潜在3D几何感知模块将中间视觉特征与3D基础模型对齐,在不修改VLM骨干架构的情况下获取低级几何线索。(2)作为补充,一个在线3D推理蒸馏模块通过共享推理锚点令牌缓解提示引发的推理差距。在3D VLM协同训练期间,该锚点作为第一个输出令牌发出,以稳健地编码空间先验。在VLA训练期间,它作为插入在任务指令和动作指令之间的输入令牌,将高级空间思维从显式教师推理提示转移到学生动作提示,无需链式思维文本生成。(3)然后,这些解耦的几何和推理特征通过空间增强的动作集成统一起来,该集成将它们作为分层空间条件共同注入到动作查询令牌中,以防止动作捷径。在部署时,我们的方法仅保留其轻量级适配器以执行隐式3D推理,丢弃用于监督的3D基础模型和教师分支。因此,它纯粹在2D图像上运行,无需3D传感器、外部模型或显式文本生成,同时防止预训练VLM的灾难性遗忘,在LIBERO、LIBERO-PLUS、SimplerEnv和真实世界操作任务上实现了最先进的性能。

英文摘要

We propose a 3D-thinking-guided co-training framework that enables vision-language-action (VLA) models to perform 3D spatial reasoning implicitly during action prediction. Our core insight is that 3D geometry perception and 3D spatial reasoning are distinct capabilities that can be disentangled and injected at different feature hierarchies. During training, three tightly coupled components work in concert primarily within the latent space: (1) To gain geometric priors, a latent 3D geometry perception module aligns intermediate visual features with a 3D foundation model, acquiring low-level geometric cues without architectural modifications to the VLM backbone. (2) Complementing this, an online 3D reasoning distillation module mitigates the prompt-induced reasoning gap via a shared reasoning anchor token. During 3D VLM co-training, this anchor is emitted as the first output token to robustly encode spatial priors. During VLA training, it serves as an input token inserted between the task and action instructions, transferring high-level spatial thinking from explicit teacher reasoning prompts to student action prompts without chain-of-thought text generation. (3) These disentangled geometric and reasoning features are then united by a spatially augmented action integration, which jointly injects them into the action-query tokens as hierarchical spatial conditions to prevent action shortcuts. At deployment, our method retains only its lightweight adapters to perform implicit 3D reasoning, discarding the 3D foundation model and the teacher branch used for supervision. Consequently, it operates purely on 2D images without 3D sensors, external models, or explicit text generation while preventing catastrophic forgetting of the pretrained VLM, achieving state-of-the-art performance on LIBERO, LIBERO-PLUS, SimplerEnv, and real-world manipulation tasks.

2606.04434 2026-06-04 cs.CV cs.LG 版本更新

Hyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal In-Context Learning

Hyper-ICL:基于双曲锚点蒸馏的注意力校准用于多模态上下文学习

Niloufar Alipour Talemi, Hossein Kashiani, Fatemeh Afghah

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Hyper-ICL,一种轻量级训练框架,通过低秩logit适配器和双曲锚点蒸馏损失校准注意力分布,无需推理时提供上下文示例即可重建演示效果,提升多模态上下文学习的准确性和稳定性。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

多模态上下文学习已成为多模态大语言模型的一种实用推理范式,其中少量交错的图像-文本上下文示例条件化模型以解决新任务。尽管灵活,但多模态ICL由于对演示格式、顺序和内容的敏感性,导致高推理延迟和不稳定性。为解决这些限制,我们提出Hyper-ICL,一种轻量级、基于训练的无演示多模态ICL框架,它直接在推理时无需ICD即可重建演示效果。Hyper-ICL学习一个参数高效的低秩logit级适配器,校准注意力分布以更好地匹配演示诱导的注意力重分布。为捕捉演示影响如何随查询变化,我们引入查询自适应调制机制,根据当前查询在层和头之间自适应控制token级的干预强度。最后,我们提出逐层双曲锚点蒸馏损失,通过Lorentz测地距离将中间学生特征对齐到演示条件化的教师。该损失鼓励学生重建ICD诱导的演示-查询关系。在六个不同多模态基准(包括VQAv2、OK-VQA和COCO Caption)上的大量实验表明,Hyper-ICL在准确性和稳定性上持续优于普通ICL和现有最先进方法。

英文摘要

Multimodal In-Context Learning (ICL) has emerged as a practical inference paradigm for Multimodal Large Language Models, where a small set of interleaved image-text In-Context Demonstrations (ICDs) conditions the model to solve new tasks. Despite its flexibility, multimodal ICL incurs high inference latency and suffers from instability due to sensitivity to demonstration formatting, ordering, and content. To address these limitations, we propose Hyper-ICL, a lightweight, training-based framework for demonstration-free multimodal ICL that reconstructs demonstration effects directly without requiring ICDs at inference time. Hyper-ICL learns a parameter-efficient low-rank logit-level adapter that calibrates attention distributions to better match demonstration-induced attention redistribution. To capture how demonstration influence varies across queries, we introduce a query-adaptive modulation mechanism that adaptively controls intervention strength at token level across layers and heads based on the current query. Finally, we propose a layer-wise hyperbolic anchor distillation loss that aligns intermediate student features to a demonstration-conditioned teacher via Lorentz geodesic distance. This loss encourages the student to reconstruct the demonstration-query relationships induced by ICDs. Extensive experiments across six different multimodal benchmarks (including VQAv2, OK-VQA, and COCO Caption) demonstrate that Hyper-ICL consistently improves accuracy and stability over vanilla ICL and existing state-of-the-art methods.

2606.04433 2026-06-04 cs.CV cs.CL cs.LG 版本更新

Stateful Visual Encoders for Vision-Language Models

用于视觉-语言模型的有状态视觉编码器

Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez, Trevor Darrell

发表机构 * University of California, Berkeley(加州大学伯克利分校) UC Berkeley(加州大学伯克利分校)

AI总结 提出有状态视觉编码器,通过将每个视觉表示条件于先前的视觉特征,增强视觉-语言模型在多图像、多轮交互中的视觉变化感知能力,在跨图像空间聚合、多目标视觉差异和轨迹行为克隆等任务上取得一致改进。

Comments Project page: https://statefulvisualencoders.github.io/

详情
AI中文摘要

视觉-语言模型(VLM)越来越多地用于多图像、多轮代理场景,其中决策依赖于视觉变化。然而,在现有的开源权重VLM中,视觉比较仅在语言模型内部进行,而视觉编码器本身是无状态的:每个图像独立编码,无法访问先前的视觉上下文。因此,微小但任务关键的变化可能在语言模型有机会比较之前被减弱,尤其是当这些变化不影响场景的高层语义时。我们引入了一种有状态视觉编码器,它将每个视觉表示条件于先前的视觉特征。在监督微调下,配备有状态编码器的VLM在涉及跨图像空间聚合、多目标视觉差异和视觉轨迹行为克隆的控制任务上取得了一致的改进。这些改进在输入分辨率、语言模型大小和VLM骨干网络上保持一致。最后,我们在实际任务上验证了我们的模型,包括纵向放射学、细粒度图像比较和遥感,其中有状态编码器一致地改进了通用VLM基线,并在选定领域可以匹配或超越专用模型。项目页面:https://statefulvisualencoders.github.io/

英文摘要

Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/

2606.04432 2026-06-04 cs.CV 版本更新

DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation

DSA: 用于快速自回归视频生成的动态步数分配

Thanh-Tung Le, Yunhan Zhao, Menglei Chai, Zhengyang Shen, Zhe Cao, Danhang Tang, Xiaohui Xie, Deying Kong

发表机构 * University of California, Irvine(加州大学尔湾分校) Google(谷歌) Google DeepMind(谷歌深Mind)

AI总结 提出一种置信度引导的自适应计算框架DSA,通过轻量级置信度头动态调整每帧去噪步数,在保持视频质量的同时实现实时自回归视频生成。

Comments CVPR2026, Findings Track

详情
AI中文摘要

视频扩散变压器已实现最先进的视觉质量,但其高推理成本仍是实时应用的主要瓶颈。最近的蒸馏框架产生了具有降低延迟的自回归视频扩散模型,但这些模型仍然每帧使用固定数量的去噪步数,在可预测帧上浪费计算,而在具有挑战性的帧上精炼不足。我们提出了DSA,一种用于自回归视频扩散的置信度引导自适应计算框架。DSA引入了一个轻量级置信度头,在分布匹配蒸馏目标下与生成器联合训练,以估计每帧去噪可靠性。在推理时,该置信度信号动态调整扩散步数:简单帧提前终止以提高速度,而复杂帧获得额外精炼。我们的方法不需要额外的视频数据、启发式规则,且几乎不需要架构修改。实验表明,DSA实现了实时自回归视频生成,在H100 GPU上达到22.63 FPS,延迟低于1秒,同时与最近的自回归和双向视频扩散模型相比,保持了有竞争力或更优的VBench质量。我们的结果表明,置信度引导的自适应采样为交互式视频生成提供了一条有效且实用的路径。

英文摘要

Video diffusion transformers have achieved state-of-the-art visual quality, but their high inference cost remains a major bottleneck for real-time applications. Recent distillation frameworks produce autoregressive video diffusion models with reduced latency, yet these models still use a fixed number of denoising steps per frame, wasting computation on predictable frames and under-refining challenging ones. We present DSA, a confidence-guided adaptive computation framework for AR video diffusion. DSA introduces a lightweight confidence head, trained jointly with the generator under a distribution-matching distillation objective, to estimate per-frame denoising reliability. At inference, this confidence signal dynamically adjusts the number of diffusion steps: simple frames terminate early for speed, while complex frames receive additional refinement. Our method requires no extra video data, no heuristics, and little architectural modification. Experiments show that DSA achieves real-time autoregressive video generation, reaching 22.63 FPS with sub-second latency on H100 GPUs, while maintaining competitive or superior VBench quality compared to recent autoregressive and bidirectional video diffusion models. Our results demonstrate that confidence-guided adaptive sampling provides an effective and practical path toward interactive video generation.

2606.04427 2026-06-04 cs.CV 版本更新

Implicit Fuzzification via Bounded Noise Injection for Robust Medical Image Segmentation

通过有界噪声注入的隐式模糊化用于鲁棒医学图像分割

Bisheng Tang, Zhangfeng Ma, Chuchu Zhai, Feng Dong, Yaoqun Wu, Ammar Oad, Yifei Peng

发表机构 * Xinshao County People’s Hospital, Shaoyang(新邵县人民医院,邵阳)

AI总结 提出NoiseUNet,通过在跳跃连接中注入有界扰动来正则化跨尺度特征融合,隐式实现模糊化,提升医学图像分割的准确性和边界保真度。

Comments Under reviewing

详情
AI中文摘要

图像分割仍然受到由采样引起的信息损失和像素级标注固有不确定性导致的边界模糊性的根本限制。尽管U-Net等编码器-解码器架构取得了强劲性能,但它们常常产生过度自信的预测,无法捕捉过渡区域的模糊性。为解决此问题,我们提出 extbf{NoiseUNet},一个简单而有效的框架,它在跳跃连接中注入有界扰动以正则化跨尺度特征融合。该机制增强了对局部特征变化的鲁棒性,并促进了边界感知表示。理论上,该扰动诱导出隐式模糊化效果,产生软性的、数据驱动的隶属度,无需显式模糊建模。我们进一步引入 extbf{ThyR},一个具有固有模糊边界的真实世界甲状腺超声数据集。实验表明,NoiseUNet在分割精度和边界保真度上均有一致提升。

英文摘要

Image segmentation remains fundamentally limited by boundary ambiguity arising from sampling-induced information loss and inherent uncertainty in pixel-wise labeling. Although encoder-decoder architectures such as U-Net achieve strong performance, they often produce overconfident predictions that fail to capture transition-region ambiguity. To address this issue, we propose \textbf{NoiseUNet}, a simple yet effective framework that injects bounded perturbations into skip connections to regularize cross-scale feature fusion. This mechanism enforces robustness to local feature variations and promotes boundary-aware representations. Theoretically, the perturbation induces an implicit fuzzification effect, yielding soft, data-driven memberships without requiring explicit fuzzy modeling. We further introduce \textbf{ThyR}, a real-world thyroid ultrasound dataset with inherently ambiguous boundaries. Experiments demonstrate that NoiseUNet consistently improves both segmentation accuracy and boundary fidelity.

2606.04419 2026-06-04 eess.IV cs.AI cs.CV physics.med-ph 版本更新

L-TGVN: Leveraging Longitudinal Priors for Personalized Rapid MRI

L-TGVN:利用纵向先验进行个性化快速MRI

Arda Atalık, Sumit Chopra, Daniel K. Sodickson

发表机构 * NYU Center for Data Science(纽约大学数据科学中心) Center for Advanced Imaging Innovation and Research (CAI²R)(先进成像创新与研究中心) Courant Institute of Mathematical Sciences(数学科学学院) Function Health

AI总结 提出L-TGVN,一种利用纵向先验作为侧信息从高度欠采样测量中重建当前扫描的变分网络,无需显式配准并适应协议差异,在定量指标和结构保持上优于基线方法。

Comments Accepted to MICCAI 2026

详情
AI中文摘要

MRI提供优异的软组织对比度且无电离辐射,但长采集时间增加患者不适,同时提高检查成本并限制扫描仪吞吐量。减少扫描时间的常见方法是采集更少的测量值,这会产生一个病态线性逆问题;因此,恢复诊断质量的图像需要结合测量数据之外的先验知识。在随访检查中,患者最近的先前扫描可以提供高度信息化的受试者特定背景,但实际应用因时间变化(包括病理进展)、扫描间错位以及跨采集协议漂移而复杂化。在这项工作中,我们引入了L-TGVN,一种纵向信任引导变分网络,利用先前扫描作为侧信息,从高度欠采样测量中重建当前扫描。关键是,L-TGVN约束先前扫描的影响与获取的测量一致。与许多现有的纵向重建方法不同,它不需要先前扫描和当前扫描之间的显式预配准。它进一步适应不同就诊间的采集协议差异(例如,序列参数的变化)。我们在匹配容量的基线上评估L-TGVN,包括先验引导方法和不使用纵向先验的方法,并观察到标准定量指标的一致改进,以及在挑战性加速下更好地保留精细结构。源代码可在github.com/sodicksonlab/L-TGVN获取。

英文摘要

MRI provides excellent soft-tissue contrast without ionizing radiation, but long acquisition times increase patient discomfort while also raising exam costs and limiting scanner throughput. A common approach to reduce scan time is to acquire fewer measurements, which yields an ill-posed linear inverse problem; recovering diagnostic-quality images therefore requires incorporating prior knowledge beyond the measured data. In follow-up exams, the most recent prior scan of a patient can provide a highly informative subject-specific context, but practical use is complicated by temporal changes (including pathology progression), misalignment between scans, and protocol drift across acquisitions. In this work, we introduce L-TGVN, a Longitudinal Trust-Guided Variational Network that leverages prior scans as side information to reconstruct the current scan from heavily undersampled measurements. Crucially, L-TGVN constrains the influence of prior scans to be consistent with the acquired measurements. Unlike many existing longitudinal reconstruction methods, it does not require explicit pre-registration between prior and current scans. It further accommodates differences in acquisition protocols across visits (e.g., changes in sequence parameters). We evaluate L-TGVN against matched-capacity baselines, including prior-guided methods and methods that do not use longitudinal priors, and observe consistent improvements in standard quantitative metrics together with better preservation of fine structures at challenging accelerations. Source code is available at github.com/sodicksonlab/L-TGVN.

2606.04414 2026-06-04 cs.CV cs.MM 版本更新

Motion-Guided Causal Disentanglement for Robust Multi-View Cine Cardiac MRI Diagnosis

运动引导的因果解耦用于鲁棒多视角电影心脏MRI诊断

Chuankai Xu, Cristiane De Carvalho Singulane, Mohammad Abuannadi, Stephen Chandler, Jeremy Slivnick, Karolina Zareba, Jane Cao, Vidya Nadig, Fabio Fernandes, Seth Uretsky, Diego Perez de Arenaza, Amit Patel, Jianxin Xie

发表机构 * University of Virginia(弗吉尼亚大学) University of Chicago(芝加哥大学) Ohio State University(俄亥俄州立大学) St. Francis Hospital & Heart Center(圣弗朗西斯医院及心脏中心) Hartford HealthCare(哈特福德医疗集团) Instituto do Coração (InCor)(心脏研究所(InCor)) Atlantic Health System(大西洋健康系统) Sociedad Italiana de Beneficencia (Hospital Italiano)(意大利慈善协会(意大利医院))

AI总结 提出运动引导的视角-疾病解耦框架MoViD,通过双分支监督对比学习和梯度反转对抗约束分离视角特定与疾病判别特征,结合无标注时间运动特征定位心脏区域并缓解类别不平衡,在静脉血栓数据集和两个公开基准上超越标准Transformer基线。

详情
AI中文摘要

多视角心脏磁共振成像提供互补的解剖信息,广泛用于无创疾病评估。最近的基于Transformer的模型在CMR分析中展示了强大的表示学习能力;然而,它们通常学习统一的潜在嵌入,将视角特定的解剖变异与疾病相关特征纠缠在一起。这种纠缠使分类器偏向结构属性而非视角不变的病理模式。在低数据场景下,特别是对于代表性不足的心脏疾病,这个问题更加严重,因为有限的样本增加了对捷径学习和视角相关决策边界的敏感性。为了解决这个问题,我们提出了一个基于ViT-MAE骨干的运动引导视角-疾病解耦框架MoViD。该模型通过双分支监督对比目标和梯度反转对抗约束,明确地将潜在表示分解为视角特定和疾病判别组件,最小化疾病信息泄漏到视角嵌入中。此外,引入了一种从帧间差异图导出的无标注时间运动特征,用于定位跳动的心脏区域并抑制背景伪影。对比损失中融入了焦点重加权机制以缓解类别不平衡。我们在一个私有临床静脉血栓数据集和两个公开基准(M&Ms, M&Ms2)上评估了该框架。在疾病分类和心脏分割任务中,我们的方法始终优于标准Transformer基线,并与大规模预训练基础模型相比表现出竞争性能,验证了结构解耦在医学图像分析中的有效性。

英文摘要

Multi-view cardiac magnetic resonance (CMR) imaging provides complementary anatomical information and is widely used for noninvasive disease assessment. Recent transformer-based models have demonstrated strong representation learning capabilities for CMR analysis; however, they typically learn unified latent embeddings that entangle view-specific anatomical variations with disease-related features. Such entanglement biases classifiers toward structural attributes rather than view-invariant pathological patterns. This issue is exacerbated in low-data regimes, particularly for underrepresented cardiac conditions, where limited samples increase the susceptibility to shortcut learning and view-dependent decision boundaries. To address this, we propose a Motion-Guided View--Disease Disentanglement framework MoViD built upon a ViT-MAE backbone. The model explicitly factorizes latent representations into view-specific and disease-discriminative components using dual-branch supervised contrastive objectives and a gradient-reversal adversarial constraint that minimizes disease leakage into the view embedding. Additionally, an annotation-free temporal motion feature, derived from inter-frame difference maps, is introduced to localize the beating heart region and suppress background artifacts. A focal reweighting mechanism is incorporated into the contrastive loss to mitigate class imbalance. We evaluate the framework on a private clinical venous thrombosis dataset and two public benchmarks (M&Ms, M&Ms2). Across disease classification and cardiac segmentation tasks, our approach consistently outperforms standard transformer baselines and demonstrates competitive performance against large-scale pretrained foundation models, validating the efficacy of structural disentanglement in medical image analysis.

2606.04410 2026-06-04 cs.CV 版本更新

Ultra-Fast Neural Video Compression

超快神经视频压缩

Jiahao Li, Wenxuan Xie, Zhaoyang Jia, Bin Li, Zongyu Guo, Xiaoyi Zhang, Yan Lu

发表机构 * Microsoft Research Asia(微软亚洲研究院) University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于块的编码框架DCVC-UF,通过联合时空建模和并行重建实现超快编解码,显著提升率失真-复杂度权衡。

Comments CVPR 2026

详情
AI中文摘要

尽管神经视频编解码器(NVC)已展现出优越的压缩比,但其过高的计算复杂度仍是实际部署的关键障碍。本文引入一种基于块的编码框架,旨在显著改善率失真-复杂度权衡。我们的方法不是逐帧处理,而是将多个帧组成的块编码为单个紧凑的潜在表示,并同时解码它们。这是通过用于联合时空建模的跨帧交互模块和用于并行重建的帧特定解码器实现的。这种范式不仅显著提高了编码吞吐量,还有助于更有效地建模长期时间相关性。为了进一步提高速度,我们提出了一种简化的熵编码机制,将比特流交互整合为单一步骤,大幅减少解码开销。基于这些创新,我们提出了DCVC-UF(超快),一种新的NVC,在性能上树立了新的SOTA。我们的实验表明,DCVC-UF可以实现超快的编码和解码速度,显著优于之前的领先编解码器。DCVC-UF是NVC发展历程中的一个显著里程碑。代码位于https://github.com/microsoft/DCVC。

英文摘要

While neural video codecs (NVCs) have demonstrated superior compression ratio, their prohibitive computational complexity remains a critical barrier to real-world deployment. This paper introduces a chunk-based coding framework designed to significantly improve the rate-distortion-complexity trade-off. Instead of processing frames sequentially, our approach encodes a chunk of multiple frames into a single compact latent representation and decodes them simultaneously. This is enabled by cross-frame interaction modules for joint spatial-temporal modeling and frame-specific decoders for parallel reconstruction. This paradigm not only dramatically enhances coding throughput but also facilitates more effective modeling of long-term temporal correlations. To further boost speed, we propose a streamlined entropy coding mechanism that consolidates bit-stream interactions into a single step, substantially reducing decoding overhead. Building on these innovations, we present DCVC-UF (Ultra-Fast), a new NVC that sets a new SOTA in performance. Our experiments show that DCVC-UF can achieve ultra-fast encoding and decoding speeds, significantly outperforming previous leading codecs. DCVC-UF serves as a notable landmark in the journey of NVC evolution. The code is at https://github.com/microsoft/DCVC.

2606.04385 2026-06-04 cs.CV 版本更新

Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models

保持几何结构的异质基础模型无监督对齐

Shuwen Yu, Zhanxuan Hu, Yi Zhao, Yonghang Tai, Huafeng Li

发表机构 * Yunnan Normal University, Kunming, China(云南师范大学,昆明,中国) Kunming University of Science and Technology, Kunming, China(昆明理工大学,昆明,中国)

AI总结 提出GPUA框架,通过正交映射将视觉基础模型特征对齐到视觉语言模型语义空间,无需标签或参数更新,提升跨模型兼容性并在零样本识别与分割任务中取得显著增益。

Comments Accepted at ICML 2026

详情
AI中文摘要

基础模型推动了计算机视觉的快速发展,然而两种主导范式——视觉语言基础模型(VLM)和纯视觉基础模型(VFM)——仍然仅部分兼容。VLM提供语言基础的语义对齐,但通常视觉上粗糙;而VFM学习判别性的感知几何结构,但缺乏语义基础。我们提出GPUA(保持几何结构的无监督对齐),一个整合VFM和VLM互补优势的框架。受跨语言对齐启发,GPUA将VFM特征视为一种视觉语言,并学习一个正交映射,将VFM空间转换到VLM语义空间,保持几何结构并缩小模态差距,无需标签或模型参数更新。GPUA是任务无关的,仅需对预训练模型进行特征级访问。在多种基准上的实验表明,跨模型兼容性得到改善,下游零样本识别和分割任务中取得了显著增益,且开销可忽略。代码可在https://github.com/Yuteam14/GPUA获取。

英文摘要

Foundation models have driven rapid progress in computer vision, yet the two dominant paradigms, vision-language foundation models (VLMs) and vision-only foundation models (VFMs), remain only partially compatible. VLMs offer language-grounded semantic alignment but are often visually coarse, while VFMs learn discriminative perceptual geometry but lack semantic grounding. We propose GPUA (Geometry-Preserving Unsupervised Alignment), a framework that integrates the complementary strengths of VFMs and VLMs. Inspired by cross-lingual alignment, GPUA treats VFM features as a visual language and learns an orthogonal mapping that translates the VFM space into the VLM semantic space, preserving geometry and narrowing the modality gap without labels or model parameter updates. GPUA is task-agnostic and requires only feature-level access to pretrained models. Experiments across diverse benchmarks demonstrate improved cross-model compatibility and strong gains in downstream zero-shot recognition and segmentation with negligible overhead. Code is available at https://github.com/Yuteam14/GPUA

2606.04369 2026-06-04 cs.CV 版本更新

VT-3DAD: Cross-Category 3D Anomaly Detection via Visual-Text Normal Space Alignment

VT-3DAD:通过视觉-文本正常空间对齐的跨类别3D异常检测

Zi Wang, Katsuya Hotta, Yawen Zou, Koichiro Kamide, Yijin Wei, Chao Zhang, Jun Yu

发表机构 * Niigata University(Niigata大学) University of Toyama(Toyama大学) Iwate University(Iwate大学)

AI总结 提出VT-3DAD无训练框架,通过冻结CLIP编码器提取多视图视觉特征和文本正常锚点,融合视觉与语义偏差实现跨类别3D异常检测,在ShapeNetPart上达到最优性能。

详情
AI中文摘要

少样本跨类别3D异常检测旨在仅使用少量正常参考样本判断未知点云是否属于目标正常类别。现有的基于训练的方法通常需要类别级优化,而最近基于多视图CLIP视觉特征的无训练方法主要依赖视觉相似性,可能被几何相似的类别混淆。本文提出VT-3DAD,一种通过视觉-文本正常空间对齐进行跨类别3D异常检测的无训练框架。给定少量正常参考样本和测试点云,VT-3DAD首先生成逼真的多视图深度图,并使用冻结的CLIP视觉编码器提取视图级特征。视觉分支在多视图特征空间中度量参考-测试偏差。同时,深度感知和3D感知提示由冻结的CLIP文本编码器编码,构建文本正常锚点,为目标类别提供语义正常性约束。最终异常分数通过融合来自正常参考的视觉偏差和来自文本正常空间的语义偏差获得。在ShapeNetPart数据集上的实验表明,VT-3DAD达到了最先进性能。特别地,与仅视觉基线相比,VT-3DAD将单样本平均AUC-ROC从92.49%提升至94.80%,同时将平均标准差从5.64降至3.41。

英文摘要

Few-shot cross-category 3D anomaly detection aims to determine whether an unknown point cloud belongs to a target normal category using only a few normal references. Existing training-based methods usually require category-wise optimization, while recent training-free methods based on multi-view CLIP visual features mainly rely on visual similarity and may be confused by geometrically similar categories. In this paper, we propose VT-3DAD, a training-free framework for cross-category 3D anomaly detection via Visual-Text Normal Space Alignment. Given few-shot normal references and a test point cloud, VT-3DAD first generates realistic multi-view depth maps and extracts view-wise features using a frozen CLIP visual encoder. The visual branch measures reference-test deviation in the multi-view feature space. In parallel, depth-aware and 3D-aware prompts are encoded by the frozen CLIP text encoder to construct textual normal anchors, which provide semantic normality constraints for the target category. The final anomaly score is obtained by fusing visual deviation from normal references and semantic deviation from the textual normal space. Experiments on the ShapeNetPart dataset demonstrate that VT-3DAD achieves state-of-the-art performance. In particular, VT-3DAD improves the one-shot average AUC-ROC from 92.49% to 94.80% compared with the visual-only baseline, while also reducing the average standard deviation from 5.64 to 3.41.

2606.04365 2026-06-04 cs.CV cs.AI 版本更新

Multi-Granularity 3D Kidney Lesion Characterization from CT Volumes

多粒度3D肾脏病变特征提取来自CT体积

Renjie Liang, Zhengkang Fan, Jinqian Pan, Chenkun Sun, Jiang Bian, Russell Terry, Jie Xu

发表机构 * Department of Health Outcomes and Biomedical Informatics, University of Florida(健康结果与生物医学信息学系,佛罗里达大学) Department of Urology, University of Florida(泌尿外科,佛罗里达大学) Department of Biostatistics and Health Data Science, Indiana University School of Medicine(生物统计学与健康数据科学系,印第安纳大学医学院) Center of Biomedical Informatics(生物医学信息学中心)

AI总结 提出LesionDETR,一种基于DETR的架构,通过大小距离匈牙利匹配和分层损失,实现从CT体积中按病变预测四个临床属性,在双侧异常检测上达到AUC 0.799。

详情
AI中文摘要

放射学报告通过类型、大小、增强和衰减描述肾脏病变,但现有的3D方法仅在患者或器官级别进行预测。我们将肾脏CT特征提取重新定义为每个病变的集合预测任务:一个模型为每个肾脏输出可变数量的病变,每个病变具有四个临床属性。我们从一家学术医疗中心的788名患者中整理了2,619个CT体积,具有多粒度的侧别和每个病变的标签,并使用KiTS23(489例)进行零样本外部验证。我们提出了 extbf{LesionDETR},一种DETR风格的架构,具有大小距离匈牙利匹配和分层损失,将每个槽的输出聚合到侧别目标。在四种输入表示和六种编码器初始化中,两个设计选择占主导地位:分割掩码作为输入通道,以及同域腹部预训练(SuPreM);通用大型语料库预训练并不比随机初始化更好。LesionDETR在UF-Health上达到双侧侧别异常AUC $0.799 \pm 0.009$,在KiTS23上达到$0.817 \pm 0.072$。计数条件变体在囊性病变上达到每个病变mAP $0.190 \pm 0.083$;罕见的实性病变AP仍处于噪声水平,表明下一个瓶颈是针对性数据收集,而非架构。该框架为下游结构化报告生成提供了经过验证的每个病变预测。

英文摘要

Radiology reports describe kidney lesions by type, size, enhancement, and attenuation, yet existing 3D methods predict only at the patient or organ level. We reformulate kidney CT characterization as a per-lesion set-prediction task: one model emits a variable number of lesions per kidney, each with four clinical attributes. We curated 2,619 CT volumes from 788 patients at one academic medical center, with multi-granularity side- and per-lesion labels, and used KiTS23 (489 cases) for zero-shot external validation. We propose \textbf{LesionDETR}, a DETR-style architecture with size-distance Hungarian matching and a hierarchical loss that aggregates per-slot outputs to side-level objectives. Across four input representations and six encoder initializations, two design choices dominate: a segmentation mask as an input channel, and same-domain abdominal pretraining (SuPreM); generic large-corpus pretraining is no better than random initialization. LesionDETR reaches bilateral side-level abnormality AUC $0.799 \pm 0.009$ on UF-Health and $0.817 \pm 0.072$ on KiTS23. A count-conditioned variant reaches per-lesion mAP $0.190 \pm 0.083$ on cystic lesions; rare solid-lesion AP stays at the noise floor, pointing to targeted data collection, not architecture, as the next bottleneck. The framework yields verified per-lesion predictions for downstream structured report generation.

2606.04345 2026-06-04 cs.CV cs.AI cs.LG 版本更新

HYolo: An Intelligent IoT-Based Object Detection System Using Hypergraph Learning

HYolo:一种基于超图学习的智能物联网目标检测系统

Isha Abid, Fawad Khan, Muhammad Khuram Shahzad

发表机构 * National University of Sciences and Technology(国家安全科学与技术大学)

AI总结 提出HYolo框架,将超图学习融入YOLO架构以建模高阶特征关系,在COCO数据集上mAP@50提升约12%。

Comments 8 pages, multiple figures;

详情
AI中文摘要

本文提出HYolo,一种基于物联网的智能目标检测框架,将超图学习集成到YOLO架构中。传统的基于YOLO的目标检测模型主要捕获成对特征交互,可能无法建模对象与上下文特征之间的复杂高阶关系。为解决这一局限,HYolo引入超图学习以捕获更丰富的上下文依赖关系并改进对象表示。在COCO数据集上的实验评估表明,与基线YOLO模型相比,性能显著提升。所提方法在mAP@50上实现了约12%的提升,同时增强了整体检测准确性和鲁棒性。通过建模高阶特征关系,HYolo在物联网环境中提供了改进的上下文理解和更可靠的目标检测性能。结果表明,将超图学习集成到目标检测流程中,为智能且上下文感知的物联网视觉系统提供了一个有前景的方向。

英文摘要

This paper presents HYolo, an intelligent IoT-based object detection framework that integrates hypergraph learning into the YOLO architecture. Traditional YOLO-based object detection models primarily capture pairwise feature interactions and may fail to model complex high-order relationships among objects and contextual features. To address this limitation, HYolo incorporates hypergraph learning to capture richer contextual dependencies and improve object representation. Experimental evaluation on the COCO dataset demonstrates significant performance improvements over baseline YOLO models. The proposed approach achieves approximately 12% improvement in mAP@50 while enhancing overall detection accuracy and robustness. By modeling high-order feature relationships, HYolo provides improved contextual understanding and more reliable object detection performance in IoT-based environments. The results indicate that integrating hypergraph learning into object detection pipelines offers a promising direction for intelligent and context-aware IoT vision systems.

2606.04343 2026-06-04 cs.CV 版本更新

Robust Multi-view Clustering against Imperfect Information

面向不完美信息的鲁棒多视图聚类

Zhichao Huang, Haochen Zhou, Hao Wang, Mouxing Yang, Xi Peng

发表机构 * College of Computer Science, Sichuan University, China(四川大学计算机学院) School of Artificial Intelligence, Sichuan University, China(四川大学人工智能学院) Tianfu Jincheng Laboratory, Chengdu, China(天府锦城实验室)

AI总结 针对多视图数据中视图缺失和对应关系噪声的不完美信息问题,提出后验引导的潜在对应推理框架(PLCI),通过将跨视图对应视为潜在变量并融合实例级可靠性和原型级语义传输来统一处理两种挑战。

Comments 19 pages, 11 figures

详情
AI中文摘要

现实世界的多视图数据总是遭受不完美信息问题,其中特定实例的视图特定观测缺失(即不完整视图,IV)且跨视图对应关系不匹配(即噪声对应,NC)。作为补救,已经提出了许多面向IV和NC的多视图聚类(MvC)方法,然而这些方法要么需要可靠的对应关系,要么需要足够完整的实例,因此无法解决不完美信息问题。相比之下,我们观察到IV和NC挑战都源于同一个问题,即不完美的跨视图对应信息,其中锚点实例在另一视图中的对应可能不可用或不可靠。基于这一观察,我们提出了一种新颖的鲁棒MvC框架,称为后验引导的潜在对应推理(PLCI),它能够以统一的方式处理IV和NC。具体来说,PLCI将每个锚点实例所需的跨视图对应表述为潜在变量,并整合实例级可靠性和原型级语义传输来推断潜在对应的后验分布。在六个广泛使用的多视图数据集上,与10种最先进的MvC方法相比,大量实验证明了PLCI在处理不完美信息问题上的有效性。代码将在接收后发布。

英文摘要

Real-world multi-view data always suffer from imperfect information problem, where the view-specific observations are absent (i.e., Incomplete Views, IV) and cross-view correspondences are mismatched (i.e., Noisy Correspondences, NC) for certain instances. As a remedy, numerous IV- and NC-oriented multi-view clustering (MvC) methods have been proposed, which however require either reliable correspondences or sufficiently complete instances, thus stopping short of addressing the imperfect information problem. In contrast, we observe that both IV and NC challenges originate from the same issue of imperfect cross-view counterpart information, where the counterpart of an anchor instance in another view might be either unavailable or unreliable. Based on the observation, we propose a novel robust MvC framework, termed Posterior-guided Latent Counterpart Inference (PLCI), which could handle both IV and NC in a unified manner. Specifically, PLCI formulates the desired cross-view counterpart of each anchor instance as a latent variable, and integrates both instance-level reliability and prototype-level semantic transport to infer the posterior distribution of the latent counterpart. Extensive experiments on six widely-used multi-view datasets against 10 state-of-the-art MvC methods demonstrate the effectiveness of PLCI for tackling the imperfect information problem. The code will be released upon acceptance.

2605.03927 2026-06-04 cs.CV 版本更新

StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

StateVLM: 一种用于机器人可操作推理的状态感知视觉语言模型

Xiaowen Sun, Matthias Kerzel, Mengdi Li, Xufeng Zhao, Paul Striker, Stefan Wermter

发表机构 * Department of Informatics, University of Hamburg(汉堡大学信息学院) King Abdullah University of Science and Technology(卡塔尔科学与技术大学)

AI总结 提出StateVLM模型,通过辅助回归损失训练策略增强视觉语言模型在目标检测和状态定位中的数值推理能力,并构建OSAR基准验证其有效性。

详情
AI中文摘要

视觉语言模型(VLM)在各种机器人任务中表现出色,因为它们能够感知视觉信息并理解自然语言指令。然而,当应用于机器人时,VLM仍然受限于大型语言模型(LLM)固有的一个基本限制:它们在数值推理方面存在困难,特别是在目标检测和目标状态定位中。为了探索VLM中作为回归任务的数值推理,我们提出了一种新颖的训练策略,使VLM适应目标检测和目标状态定位。该方法在微调期间利用框解码器输出计算辅助回归损失(ARL),同时在推理时保持标准序列预测。我们利用这种训练策略开发了StateVLM(状态感知视觉语言模型),这是一种新颖的模型,旨在感知和学习细粒度的目标表示,包括目标和其状态的精确定位,以及可抓取区域。由于缺乏目标状态可操作推理的基准,我们引入了一个开源基准——目标状态可操作推理(OSAR),其中包含1172个场景,7746个单独目标及其对应的边界框。在适配基准(RefCOCO、RefCOCO+和RefCOCOg)上的对比实验表明,与没有ARL的模型相比,ARL使模型性能平均提高1.6%。在OSAR基准上的实验进一步支持了这一发现,表明带有ARL的StateVLM比没有ARL的模型平均性能高5.2%。特别是,ARL对于OSAR中复杂的可操作推理任务也很重要,它增强了模型输出的一致性。

英文摘要

Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject to a fundamental limitation inherent in large language models (LLMs): they struggle with numerical reasoning, particularly in object detection and object-state localization. To explore numerical reasoning as a regression task in VLMs, we propose a novel training strategy to adapt VLMs for object detection and object-state localization. This approach leverages box decoder outputs to compute an Auxiliary Regression Loss (ARL) during fine-tuning, while preserving standard sequence prediction at inference. We leverage this training strategy to develop StateVLM (State-aware Vision-Language Model), a novel model designed to perceive and learn fine-grained object representations, including precise localization of objects and their states, as well as graspable regions. Due to the lack of a benchmark for object-state affordance reasoning, we introduce an open-source benchmark, Object State Affordance Reasoning (OSAR), which contains 1172 scenes with 7746 individual objects and corresponding bounding boxes. Comparative experiments on adapted benchmarks (RefCOCO, RefCOCO+, and RefCOCOg) demonstrate that ARL improves model performance by an average of 1.6% compared to models without ARL. Experiments on the OSAR benchmark further support this finding, showing that StateVLM with ARL achieves an average of 5.2% higher performance than models without ARL. In particular, ARL is also important for the complex task of affordance reasoning in OSAR, where it enhances the consistency of model outputs.

2606.04323 2026-06-04 cs.CV 版本更新

Answer Self-Consistency with Margin-Triggered Question Re-Arbitration for the CVPR 2026 VidLLMs Challenge

面向CVPR 2026 VidLLMs挑战赛的基于边际触发问题重新仲裁的答案自一致性方法

Tomoya Miyazawa, Hiroyasu Okuno

发表机构 * Data Analytics Labo Co.(数据分析师实验室公司)

AI总结 提出一种无需训练的测试时推理框架ASC-MQRA,通过答案自一致性聚合多轮视频问答结果,并利用边际触发机制对低置信度样本进行条件性重新仲裁,在CVPR 2026 VidLLMs挑战赛Track 2上取得领先性能。

详情
AI中文摘要

在本报告中,我们提出了针对CVPR 2026 VidLLMs挑战赛Track 2的解决方案。该赛道评估视频中的视觉关系推理能力,模型需要推断并非总是明确可见的关系。我们提出了答案自一致性结合边际触发问题重新仲裁(ASC-MQRA),一种基于多模态推理模型的无需训练的测试时推理框架。核心ASC组件执行多次随机视频问答运行,并通过答案级别的自一致性聚合其答案选择。这显著优于单次推理,并构成了我们的最终测试提交。我们进一步研究了MQRA,一种针对低边际样本的条件性重新仲裁模块,其中第一阶段的投票分布指示了不确定性。我们的投票边际分析表明,低边际样本通常在前几名候选答案中包含真实答案,这促使MQRA缩小候选集并仅针对保留的候选答案重新观看视频。在验证集上,MQRA相比ASC进一步提升,表明低边际投票分布可以提供有用的不确定性信号。然而,在测试集上,MQRA相对于ASC略微降低了性能,表明重新仲裁对触发子集的大小和类别分布敏感。因此,我们的最终测试提交使用了不带重新仲裁的ASC,在验证集上达到72.73的平均准确率和78.34的类别宏平均准确率,在测试集上达到81.16的平均准确率和80.91的类别宏平均准确率。本报告详细介绍了我们的提示策略、实现设置、消融研究和诊断分析。代码可在https://github.com/data-analytics-labo/ASC-MQRA获取。

英文摘要

In this report, we present our solution for Track 2 of the CVPR 2026 VidLLMs Challenge. This track evaluates visual relational reasoning in videos, where models must infer relations that are not always explicitly visible. We propose Answer Self-Consistency with Margin-Triggered Question Re-Arbitration (ASC-MQRA), a training-free test-time reasoning framework built on a multimodal reasoning model. The core ASC component performs multiple stochastic video question-answering runs and aggregates their answer choices through answer-level self-consistency. This substantially improves over single-pass inference and forms our final test submission. We further study MQRA, a conditional re-arbitration module for low-margin examples where the first-stage vote distribution indicates uncertainty. Our vote-margin analysis shows that low-margin examples often retain the ground-truth answer among the top candidates, motivating MQRA to narrow the candidate set and re-watch the video only over the retained candidates. On validation, MQRA further improves over ASC, indicating that low-margin vote distributions can provide a useful uncertainty signal. On test, however, MQRA slightly degrades performance relative to ASC, suggesting that re-arbitration is sensitive to the size and category distribution of the triggered subset. Our final test submission therefore uses ASC without re-arbitration, achieving 72.73 average accuracy and 78.34 category-wise macro average accuracy on validation, and 81.16 average accuracy and 80.91 category-wise macro average accuracy on test. This report details our prompting strategy, implementation setup, ablation studies, and diagnostic analyses. The code is available at https://github.com/data-analytics-labo/ASC-MQRA

2606.04319 2026-06-04 cs.GR cs.CV 版本更新

PureLight: Learning Complex Luminaires with Light Tracing

PureLight: 使用光线追踪学习复杂光源

Pedro Figueiredo, Zixuan Li, Beibei Wang, Miloš Hašan, Nima Khademi Kalantari

发表机构 * Texas A&M University(德克萨斯大学) Nankai University(南开大学) Nanjing University of Science and Technology(南京理工大学) NVIDIA(NVIDIA公司)

AI总结 提出一种基于神经网络的公式,通过光线追踪和归一化流网络学习复杂光源的辐射分布,并蒸馏为轻量级MLP以实现高效渲染。

Comments 9 pages, 10 figures

详情
AI中文摘要

我们提出了一种神经公式来估计复杂光源的外观。我们专注于具有复杂光传输(例如,被多个镜面层包围的小型发射器)的具有挑战性的光源,这些光源对于(双向)路径追踪来说很难处理。为此,我们使用光线追踪从发射器构建路径到出射表面,并将外观估计公式化为一个分布学习问题。具体来说,我们使用一个大型归一化流网络对出射表面上的出射辐射概率密度函数(pdf)进行建模,并将出射辐射恢复为估计的pdf与通量的乘积。为了实现高效推理,我们将学习到的外观蒸馏到一个轻量级MLP中,该MLP直接估计出射表面上的辐射。我们还训练了一个采样网络用于从光源进行有效的直接照明计算,以及一个混合网络将光源合成到场景中。我们的公式使得在任意场景中使用低样本数渲染具有挑战性的光源成为可能。

英文摘要

We propose a neural formulation for estimating the appearance of complex luminaires. We focus on challenging luminaires with complex light transport (e.g., small emitters enclosed by multiple specular layers) that are difficult for (bidirectional) path tracing. To this end, we use light tracing to construct paths from emitters to the exit surfaces and formulate appearance estimation as a distribution learning problem. Specifically, we model the probability density function (pdf) of outgoing radiance on the exit surfaces using a large normalizing flow network, and recover the outgoing radiance as the product of the estimated pdf and flux. To enable efficient inference, we distill the learned appearance into a lightweight MLP that directly estimates radiance on the exit surfaces. We additionally train a sampling network for effective direct illumination computation from the luminaire, and a blending network to composite the luminaire into the scene. Our formulation makes it feasible to render challenging luminaires using low sample counts in arbitrary scenes.

2606.04301 2026-06-04 cs.CV 版本更新

XSSR: Cross-Domain Self-Supervised Representative Selection for Efficient Annotation in Medical Image Segmentation

XSSR: 跨域自监督代表性选择用于医学图像分割中的高效标注

Byunghyun Ko, Aleksei Anisimov, Kobe Ke, Suhas Bharthepude, Jeongkyu Lee

发表机构 * Northeastern University, San Jose, CA 95113, USA(东北大学,旧金山,CA 95113,美国) Northeastern University, New York, NY 10021, USA(东北大学,纽约,NY 10021,美国)

AI总结 提出XSSR框架,通过自监督学习在目标域中自动选择代表性样本进行标注,在仅使用5%标注预算时达到接近全数据性能。

Comments Accepted to the Third International Conference on AI in Healthcare (AIiH 2026). This is the preprint version of the paper

详情
AI中文摘要

获取标注医学图像数据是资源密集型的,而在跨域场景中,源域和目标域在成像设备、人群或临床站点上存在差异,这一挑战进一步加剧。本研究引入了XSSR(跨域自监督代表性选择),一个旨在最小化目标域标注工作同时保持稳健分割性能的框架。XSSR包括三个阶段:首先,在无标签源数据上训练掩码自编码器(MAE),以建立共享嵌入空间,无需目标标签;其次,贪婪选择算法基于复合密度、新颖性和多样性标准对无标签目标样本进行评分;第三,仅在所选子集上训练U-Net分割模型。新颖性-多样性权衡参数alpha通过最小化嵌入空间覆盖自动校准,消除了手动调整。我们在三个公开基准上评估XSSR:胸部X光、RIGA+视网膜眼底成像和多站点前列腺MRI,每个基准在固定的5%标注预算下。XSSR在胸部X光上仅使用22个标注样本就达到了全数据性能的99.3%,在前列腺MRI上比随机选择高出最多2.5个Dice点,并在所有数据集上始终比CoreSet基线高出0.4到1.2个Dice点。消融研究表明多样性是最有影响力的评分组成部分,按站点分析表明性能与源域的扫描仪相似性相关。

英文摘要

Acquiring labeled medical image data is resource-intensive and a challenge further exacerbated in cross-domain scenarios where source and target datasets differ in imaging equipment, population, or clinical site. This study introduces XSSR (Cross-Domain Self-Supervised Representative Selection), a framework designed to minimize annotation effort in the target domain while maintaining robust segmentation performance. XSSR comprises three stages: first, a Masked Autoencoder (MAE) is trained on unlabeled source data to establish a shared embedding space without requiring target labels; second, a greedy selection algorithm scores unlabeled target samples based on a composite density, novelty, and diversity criterion; and third, a U-Net segmentation model is trained exclusively on the selected subset. The novelty-diversity trade-off parameter, alpha, is automatically calibrated by minimizing embedding-space coverage, eliminating manual tuning. We evaluate XSSR on three public benchmarks: Chest X-ray, RIGA+ retinal fundus imaging, and multi-site Prostate MRI, each under a fixed 5% annotation budget. XSSR achieves 99.3% of full-data performance on Chest X-ray using only 22 labeled samples, surpasses random selection by up to 2.5 Dice points on Prostate MRI, and consistently outperforms the CoreSet baseline by 0.4 to 1.2 Dice points across all datasets. Ablation studies indicate that diversity is the most influential scoring component, and per-site analysis shows that performance correlates with scanner similarity to the source domain.

2606.04299 2026-06-04 cs.CV cs.LG 版本更新

Efficient and Training-Free Single-Image Diffusion Models

高效且无需训练的单图像扩散模型

Haojun Qiu, Kiriakos N. Kutulakos, David B. Lindell

发表机构 * Department of Computer Science, University of Toronto(多伦多大学计算机科学系) Vector Institute(向量研究所)

AI总结 提出一种基于多尺度补丁数据集的无训练单图像扩散模型,通过闭式最优去噪器实现高效生成,达到与训练模型相当的质量和多样性。

Comments CVPR 2026; Project Page: https://haojunqiu.github.io/efficient-SID/

详情
AI中文摘要

我们考虑生成图像的问题,其内部结构——由多尺度补丁分布定义——与单个参考图像匹配。最近的方法通过训练单图像扩散模型来解决这个问题。但即使在这种设置下,训练计算成本高昂且需要数小时的优化。相反,我们使用不同尺度下的图像补丁数据集对图像进行建模。由于该数据集是有限的,且其补丁的维度较小,可以使用最优的闭式去噪器可计算地获得噪声补丁的得分函数,从而消除了神经网络训练的需要。我们将这种基于补丁的去噪器集成到一个高效、无需训练的图像扩散模型中,并描述了我们的方法如何与经典的基于补丁的图像恢复技术相联系。与训练过的单图像扩散模型相比,我们的方法实现了最先进的生成质量和多样性,并展示了应用,包括无条件图像生成、文本引导风格化、图像对称化和重定向。此外,我们展示了我们的方法与潜在空间扩散兼容,并展示了多种额外的加速技术,以实现一秒内的百万像素单图像生成和几分钟内的十亿像素生成。

英文摘要

We consider the problem of generating images whose internal structure -- defined by the distribution of patches across multiple scales -- matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.

2606.04291 2026-06-04 cs.CV 版本更新

A Cookbook of 3D Vision: Data, Learning Paradigms, and Application

3D视觉食谱:数据、学习范式与应用

Hongyang Du, Zongxia Li, Dawei Liu, Runhao Li, Haoyuan Song, Qingyu Zhang, Yubo Wang, Jingcheng Ni, Shihang Gui, Congchao Dong, Tao Hu

发表机构 * Brown University(布朗大学) University of Maryland, College Park(马里兰大学学院公园分校) University of Pennsylvania(宾夕法尼亚大学) University of Southern California(南加州大学) New York University(纽约大学) The University of Sydney(悉尼大学) Stability AI

AI总结 本文提出一种以数据为中心的3D视觉分类法,通过分析点云、网格、体素和3D高斯等几何表示及其获取流程,以及数据集设计、基准构建和监督机制,统一了表示、学习范式与下游任务(重建、生成、视频建模)之间的关系。

Comments Accepted to the CVPR 2026 OpenSUN3D Workshop. Official version available at CVF Open Access. https://openaccess.thecvf.com/content/CVPR2026W/OpenSUN3D/html/Du_A_Cookbook_of_3D_Vision_Data_Learning_Paradigms_and_Application_CVPRW_2026_paper.html

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026
AI中文摘要

3D视觉在日益多样化的数据表示、学习范式和建模策略的推动下迅速发展。然而,该领域在表示和基准测试方面仍然分散,难以形成关于效率、保真度和可扩展性的统一视角。本文提供了一种以数据为中心的3D视觉分类法,将几何表示、数据集、学习框架和应用连接在一个单一的概念图中。我们首先分析3D数据的主要结构表示——点云、网格、体素和3D高斯——及其获取流程。然后,我们研究数据集设计、基准构建和监督机制如何塑造最近的进展,涵盖2D监督的3D学习、隐式神经表示和4D世界建模。通过这种整合视角,我们阐明了表示、学习范式与下游任务(重建、生成和视频建模)之间的关系,提供了关于平衡效率与保真度以及多模态几何基础的新兴趋势的统一观点。

英文摘要

3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and modeling strategies. Yet the field remains fragmented across representations and benchmarks, making it difficult to develop unified perspectives on efficiency, fidelity, and scalability. This work provides a data-centric taxonomy of 3D vision that connects geometric representations, datasets, learning frameworks, and applications within a single conceptual map. We begin by analysing the principal structural representations of 3D data--point clouds, meshes, voxels, and 3D Gaussians--along with their acquisition pipelines. We then examine how dataset design, benchmark construction, and supervision regimes shape recent advances, spanning 2D-supervised 3D learning, implicit neural representations, and 4D world modeling. Through this integrative lens, we clarify the relationships among representations, learning paradigms, and downstream tasks in reconstruction, generation, and video modeling, offering a consolidated view of emerging trends toward balancing efficiency and fidelity and toward multimodal geometric grounding.

2606.04282 2026-06-04 cs.CV 版本更新

FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

FindIt:面向通用多模态大语言模型的格式感知视觉检测基准

Eshika Khandelwal, Jingjing Pan, Mingfang Zhang, Quan Kong, Lorenzo Garattoni, Hilde Kuehne

发表机构 * Tuebingen AI Center, University of Tuebingen(图宾根人工智能中心,图宾根大学) Woven by Toyota, Inc.(丰田公司) Toyota Motor Europe(丰田欧洲公司)

AI总结 提出首个全面评估通用多模态大语言模型在可提示定位能力上的基准,涵盖四种核心任务,并标准化输入输出格式,揭示模型对格式约束的敏感性。

详情
AI中文摘要

多模态大语言模型(MLLMs)主要在自由形式的视觉语言任务(如视觉问答、图像描述和摘要)上进行评估。然而,它们的实际应用正在迅速扩展到更结构化的计算机视觉场景,用户提示模型执行以定位为中心的任务(如目标检测),通常是在更大的智能体或决策系统中。尽管发生了这种转变,但目前还没有标准化的基准来系统地大规模评估这些能力。在这项工作中,我们引入了第一个专门设计用于评估通用MLLMs可提示定位能力的全面基准。我们的基准涵盖四个核心任务类别:目标检测、指代表达检测、实例级检测和基于视频的检测。为了实现一致和公平的评估,我们开发了一个统一框架,标准化输入,强制可解析的边界框输出,并定义了跨任务的透明评估协议。使用该套件,我们评估了多种开源和专有MLLMs,深入分析了它们的性能和局限性。除了准确性,我们还检查了模型遵守输出格式规范的能力,表明当前系统对格式约束高度敏感,并且即使面对微小变化也常常无法泛化。我们的结果突出了最先进的MLLMs在定位设置中的优势和缺点,并指出了改进多模态模型设计和评估的重要方向。

英文摘要

Multimodal large language models (MLLMs) are predominantly evaluated on free-form vision-language tasks such as visual question answering, captioning, and summarization. However, their practical use is rapidly expanding to more structured computer vision settings, where users prompt models to perform localization-centric tasks such as object detection, often within larger agentic or decision-making systems. Despite this shift, there is currently no standardized benchmark that systematically evaluates these capabilities at scale. In this work, we introduce the first comprehensive benchmark specifically designed to assess the promptable localization abilities of generalist MLLMs. Our benchmark spans four core task categories: object detection, referring expression detection, instance-level detection, and video-based detection. To enable consistent and fair evaluation, we develop a unified framework that standardizes inputs, enforces parsable bounding box outputs, and defines transparent evaluation protocols across tasks. Using this suite, we evaluate a diverse set of open-source and proprietary MLLMs, providing an in-depth analysis of their performance and limitations. Beyond accuracy, we examine models' ability to adhere to output format specifications, showing that current systems are highly sensitive to formatting constraints and often fail to generalize even to minor variations. Our results highlight both the strengths and shortcomings of state-of-the-art MLLMs in localization settings, and point toward important directions for improving multimodal model design and evaluation.

2606.04271 2026-06-04 cs.CV cs.AI 版本更新

StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets

StandardE2E:端到端自动驾驶数据集的统一框架

Stepan Konev

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出StandardE2E框架,通过统一数据模式、多数据集联合加载和简化新数据集添加流程,解决端到端自动驾驶数据集格式不兼容问题。

详情
AI中文摘要

自动驾驶已从模块化的感知-预测-规划堆栈转向端到端(E2E)模型,这些模型直接将传感器输入映射到车辆控制,通常通过辅助任务(如3D检测、运动预测和高清地图感知)进行正则化。进展由快速增长的传感器丰富驾驶数据集生态系统驱动,但每个数据集都有自己的文件格式、API、坐标约定和模态覆盖范围,导致跨数据集实验甚至基本的每个数据集预处理都需要为每个项目重新实现。我们提出StandardE2E,一个为E2E驾驶数据集提供统一接口的框架。StandardE2E (i) 在共享数据模式下标准化每个数据集的预处理;(ii) 在单个PyTorch DataLoader中组合多个数据集,用于跨数据集预训练、辅助任务监督和场景级过滤;(iii) 将添加新数据集简化为从原始帧到规范模式的单个数据集映射,而整个下游流程保持不变。该框架开箱即支持六个数据集:Waymo End-to-End、Waymo Perception、Argoverse 2 Sensor、Argoverse 2 LiDAR、NAVSIM (OpenScene-v1.1) 和 WayveScenes101,并作为开源标准e2e Python包发布,可在 https://github.com/stepankonev/StandardE2E 获取。

英文摘要

Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at https://github.com/stepankonev/StandardE2E.

2606.04269 2026-06-04 cs.RO cs.AI cs.CV 版本更新

Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation

Instant-Fold: 可变形物体操作的情境模仿学习

Yilong Wang, Cheng Qian, Edward Johns

发表机构 * The Robot Learning Lab(机器人学习实验室) Imperial College London(伦敦帝国学院)

AI总结 提出Instant-Fold框架,通过单次人类演示的情境模仿学习,无需梯度更新即可推断并执行多种可变形物体操作模式,在仿真训练后零样本迁移到真实世界。

详情
AI中文摘要

可变形物体操作(DOM)具有挑战性,因为其状态是高维、部分可观测的,并且通过长时间跨度、拓扑变化的交互演变,涉及多种有效的操作模式。我们引入了Instant-Fold,一个用于DOM的情境模仿学习框架。给定单次人类演示,我们的策略直接从演示中推断并执行多种操作模式,包括空间执行和顺序的变化,无需梯度更新。我们的方法首先通过时间对比预训练学习变形感知的视觉表示,然后基于演示的条件流匹配变换器策略预测执行预期操作模式的动作。完全在仿真中训练的Instant-Fold能够泛化到多种折叠模式,并零样本迁移到真实世界环境,无需额外的数据收集或微调。视频可在https://instant-fold.github.io获取。

英文摘要

Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at https://instant-fold.github.io.

2606.04264 2026-06-04 cs.CV 版本更新

UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation

UniCanvas: 一种基于扩散的图文联合生成统一模型

Zeyuan Yang, Hao-Wei Chen, Xueyang Yu, Yuncong Yang, Haoyu Zhen, Ziqiao Ma, Maohao Shen, Chuang Gan

发表机构 * UMass Amherst(马萨诸塞大学阿默斯特分校) University of Michigan(密歇根大学) MIT(麻省理工学院)

AI总结 提出UniCanvas,通过扩散模型在像素画布上以文本嵌入图像的方式实现图文联合生成,解决现有模型在视觉与文本生成上的不足。

详情
AI中文摘要

近年来,在单一架构内同时处理多模态理解与生成的统一视觉语言模型取得了显著进展。虽然自回归VLM能够跨模态推理,但无法生成高质量图像。相比之下,扩散模型能生成逼真的视觉效果,却难以生成连贯的文本,这使得开发一个能无缝处理视觉和文本生成的统一模型变得具有挑战性。最近的进展表明,语言可以有效地嵌入到视觉表示中,使模型能够直接从图像中推理文本语义。为此,我们提出了UniCanvas,这是首次尝试通过文本图像生成来统一扩散模型以生成交错多模态内容。扩散模型自然地捕捉共享像素画布上的变换,这可以视为视觉变化的世界模型。该模型不是生成离散的文本标记,而是学习将语言表示为图像内部的视觉模式,利用其固有的多模态嵌入空间。这种设计使得模型在图像合成过程中能够在单个像素画布上自然地“绘制”文本,实现无缝的多模态生成。实验表明,UniCanvas在性能上优于先前的统一模型,将基于扩散模型的文本图像生成定位为一种有前景的统一多模态生成范式。

英文摘要

Recent years have seen remarkable progress in unified vision-language models handling both multimodal understanding and generation within a single architecture. While autoregressive VLMs can reason across modalities, they fail to generate high-quality images. In contrast, diffusion models produce photorealistic visuals yet struggle to generate coherent text, making it challenging to develop a single unified model that can seamlessly handle both visual and text generation. Recent advances suggest that language can be effectively embedded within visual representations, allowing models to reason about textual semantics directly from images. To this end, we propose UniCanvas, a first attempt that unifies diffusion models to generate interleaved multimodal contents through text-in-image generation. Diffusion models naturally capture transformations on a shared pixel canvas, which can be viewed as world models of visual change. Instead of producing discrete text tokens, the model learns to represent language as visual patterns inside images, leveraging its inherent multimodal embedding space. This design allows the model to "draw" text naturally within a single pixel canvas during image synthesis, achieving seamless multimodal generation. Experiments demonstrate that UniCanvas improves performance over previous unified models, positioning text-in-image generation with diffusion models as a promising unified multimodal generation paradigm.

2606.04261 2026-06-04 cs.AI cs.CL cs.CV cs.ET cs.LG 版本更新

Can Generalist Agents Automate Data Curation?

通用智能体能否自动化数据筛选?

Feiyang Kang, Hanze Li, Adam Nguyen, Mahavir Dabas, Jiaqi W. Ma, Frederic Sala, Dawn Song, Ruoxi Jia

发表机构 * Virginia Tech(弗吉尼亚理工大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出Curation-Bench基准,通过通用编码智能体自动化数据筛选循环,实验表明现成智能体可达到强基线,但存在执行-研究差距,而结构化方法引导的智能体能在十分之一数据预算下自主组合出优于强基线的数据选择策略。

Comments Preprint

详情
AI中文摘要

训练数据的筛选是现代AI开发中最重要但劳动密集的部分之一:实践者根据嘈杂的基准反馈迭代地提出、实施、评估和修订数据策略。我们探究通用编码智能体能否自动化这一数据筛选循环。我们引入了*Curation-Bench*,一个以智能体为中心的基准,它固定模型、训练配方和评估套件,同时赋予智能体命令行权限以检查数据、实施策略、提交到固定的训练/评估流水线并进行修订。在视觉-语言指令微调实例中,现成智能体在十次迭代内达到了已发表的强数据选择基线。然而,轨迹分析揭示了持续的*执行-研究差距*:即使提供了策略指南和论文参考,智能体主要调整局部策略变体,而非探索新的策略家族。要求每次迭代引用、实例化和改编先前方法的框架将智能体转向方法引导的探索。这种框架化的智能体自主组合——无需人工设计输入——一种数据选择策略,在十分之一的数据预算下优于已发表的强基线。总体而言,当前智能体可以运行筛选循环,但可靠的数据研究需要框架化的方法适应,而非仅靠开放式提示。代码和基准已开源。

英文摘要

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

2606.04251 2026-06-04 cs.CV 版本更新

SBP-Net: Learning Thin Structure Reconstruction with Sliding-Box Projections

SBP-Net: 基于滑动盒投影的薄结构重建学习

Ofir Gilad, Andrei Sharf

发表机构 * Faculty of Computer and Information Science, Ben Gurion University of the Negev(计算机与信息科学学院,内盖夫本· Gurion大学)

AI总结 针对薄3D结构稀疏、尺度变化和复杂几何带来的重建挑战,提出一种基于局部深度投影的SBP-Net方法,通过滑动盒生成局部正交深度投影并用神经网络重建缺失薄结构,再融合回3D模型,在肺动脉和工业管道重建中优于现有方法。

Comments Accepted to IEEE ICIP 2026, 6 pages, 4 figures

详情
AI中文摘要

重建薄3D结构因其稀疏性、尺度变化和复杂几何而具有挑战性。这类结构出现在广泛领域,包括血管系统的医学成像和工业管道系统。虽然最近的神经方法在密集表面上表现良好,但常常无法恢复精细的薄几何形状。我们提出了一种基于局部深度投影的重建方法,该方法为薄结构提供了高效且信息丰富的2D表示。具体来说,我们使用滑动盒遍历3D模型以生成局部正交深度投影,然后由神经网络处理以在2D中重建缺失的薄结构。随后,局部重建结果被融合回3D模型,以产生连贯且详细的形状。在CT体积的肺动脉重建以及合成和真实扫描的工业管道恢复上的实验表明,与现有方法相比,该方法更好地保留了精细结构细节。

英文摘要

Reconstructing thin 3D structures is challenging due to their sparsity, scale variation, and complex geometry. Such structures arise in a wide range of domains, including medical imaging of vascular systems and industrial pipe systems. While recent neural methods perform well on dense surfaces, they often fail to recover fine thin geometries. We propose a reconstruction approach based on local depth projections, which provide an efficient and informative 2D representation of thin structures. Specifically, we traverse the 3D model with a sliding box to generate local orthographic depth projections, which are processed by a neural network to reconstruct missing thin structures in 2D. The local reconstructions are subsequently fused back into the 3D model to produce a coherent and detailed shape. Experiments on pulmonary artery reconstruction from CT volumes and industrial pipeline recovery from synthetic and real scans demonstrate improved preservation of fine structural details over existing methods.

2606.04249 2026-06-04 cs.CV eess.IV 版本更新

Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement

基于潜空间运动跟踪的单次测量前瞻性动态3D MRI重建

Lixuan Chen, Zhongnan Liu, Jesse Hamilton, James M. Balter, Jeong Joon Park, Liyue Shen

发表机构 * University of Michigan(密歇根大学)

AI总结 提出PDMR框架,通过离线学习运动场的低维潜流形并采用三平面表示实现高效编码,从单次测量中实现高保真、时间一致的前瞻性动态3D MRI重建。

详情
AI中文摘要

前瞻性重建在许多临床应用中至关重要,例如MRI引导的放射治疗,这需要从当前获取的测量中实现精确的图像重建和快速运动估计。然而,由于超稀疏采样和严格的延迟要求,前瞻性重建仍然具有挑战性。在这项工作中,我们提出了PDMR,一种具有潜空间运动跟踪的前瞻性动态3D MRI重建框架。我们的核心思想是离线学习一个高效且可泛化的运动场潜流形,从而实现快速在线自适应以进行前瞻性重建。具体来说,我们将变形矢量场(DVF)参数化在低维流形上,有效减少了快速在线自适应的搜索空间,并采用三平面表示实现几何感知和内存高效的3D运动编码。在XCAT数字体模和内部腹部MRI数据集上的实验表明,PDMR在多个前瞻性场景(立即和2分钟后)中实现了高保真和时间一致的重建,优于最先进的回顾性和在线方法。我们的结果为临床实践中实现超快速、运动感知的前瞻性MRI重建提供了一条有前景的途径。

英文摘要

Prospective reconstruction is crucial in many clinical applications such as MRI-guided radiotherapy, which demands accurate image reconstruction and fast motion estimation from currently acquired measurements. However, prospective reconstruction remains challenging due to ultra-sparse sampling and stringent latency requirements. In this work, we propose PDMR, a Prospective Dynamic 3D MRI Reconstruction framework with latent-space motion tracking. Our core idea is to learn an efficient and generalizable latent manifold of motion fields offline, enabling rapid online adaptation for prospective reconstruction. Specifically, we parameterize the deformation vector fields (DVFs) on a low-dimensional manifold, effectively reducing the search space for fast online adaptation, and employ a tri-plane representation to achieve geometry-aware and memory-efficient encoding of 3D motion. Experiments on both XCAT digital phantoms and in-house abdominal MRI datasets demonstrate that PDMR achieves high-fidelity and temporally consistent reconstruction across multiple prospective scenarios (Immediate and After-2min), outperforming state-of-the-art retrospective and online methods. Our results suggest a promising pathway toward ultra-fast, motion-aware prospective MRI reconstruction in clinical practice.

2606.04244 2026-06-04 cs.AI cs.CL cs.CV cs.LG 版本更新

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

VAMPS: 视觉辅助数学问题求解基准

Amirhossein Dabiriaghdam, Shayan Vassef, Mohammadreza Bakhtiari, Yasamin Medghalchi, Ilker Hacihaliloglu, Mesrob Ohannessian, Lele Wang, Giuseppe Carenini

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出VAMPS基准,通过1,168道双语多选题评估多模态大模型在借助绘图工具进行数学推理时的表现,发现直接解析求解优于工具辅助视觉求解。

详情
AI中文摘要

多模态大语言模型在复杂推理方面能力日益增强,但当它们必须通过工具外部化问题然后基于工具输出进行推理时,尤其是在依赖视觉辅助的情况下,其性能往往会下降。这一差距尤为重要,因为真实的工程和科学工作流程通常依赖可视化工具进行分析、验证和决策。为了研究这一差异,我们引入了VAMPS(视觉辅助数学问题求解),一个用于图辅助数学的基准。VAMPS包含1,168个多模态、双语选择题问答对,这些题目来自伊朗大学入学考试的代数和微积分问题,并通过人工审核的LLM生成的合成变体进行了扩展,所有题目都经过精心挑选,使得绘图能够通过揭示交点、极值、渐近线等提供自然的求解策略。VAMPS旨在用于基准测试和诊断,它超越了以往主要评估在固定视觉输入上进行推理的多模态基准,通过测试模型是否能够从构建有用的图形中受益并将其答案基于结果可视化。总体而言,我们发现,在一组多样化的模型中,直接解析求解出人意料地优于工具辅助的视觉求解,即使在绘图是自然策略的问题上也是如此。

英文摘要

Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

2606.04240 2026-06-04 cs.CV cs.AI cs.CL 版本更新

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

EReL@MIR 2025 多模态文档检索挑战赛(赛道1)概述

Jingbiao Mei

发表机构 * University of Cambridge(剑桥大学) Cambridge United Kingdom(剑桥英国)

AI总结 本文介绍了EReL@MIR 2025多模态文档检索挑战赛(赛道1)的设计、数据集、评估协议、最终排名及前三名获胜系统的分析,所有系统均基于Qwen2-VL系列解码器多模态大语言模型嵌入器。

Comments MDR Challenge Report at WWW2025

详情
AI中文摘要

对于视觉丰富的文档(即文本与图形、表格和图表交织的页面)的检索,对于多模态检索增强生成至关重要,然而大多数检索器仍然丢弃视觉通道。\emph{多模态文档检索挑战赛}是首届EReL@MIR研讨会(与2025年万维网会议同期举办)中MIR挑战赛的赛道1,要求参与者构建一个\emph{单一}检索系统,处理两种互补的场景:基于文本查询在长文档内进行封闭集文档页面检索(MMDocIR),以及基于图像或图像加文本查询进行开放域维基百科风格段落检索(M2KR)。系统根据两个任务上平均Recall@$\{1,3,5\}$的宏平均值进行排名。该挑战赛吸引了来自22个团队的455名参赛者和586份提交。本报告描述了挑战赛的设计、数据集和评估协议;报告了最终排名;并分析了三个获胜团队的系统。所有三个系统都基于Qwen2-VL系列的解码器多模态大语言模型嵌入器,而非CLIP风格的编码器,主要区别在于它们是通过微调集成、无训练的多路融合与强视觉语言重排序器,还是零样本后期交互达到顶尖水平。无训练系统与微调获胜者的得分差距在0.1分以内。

英文摘要

Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.

2606.04205 2026-06-04 cs.MM cs.AI cs.CL cs.CV cs.LG cs.SD 版本更新

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

DetectZoo:一个用于跨文本、音频和图像模态的AI生成内容检测的统一工具包

Sajad Ebrahimi, Nima Jamali, Bardia Shirsalimian, Kelly McConvey, Wentao Zhang, Jalehsadat Mahdavimoghaddam, Maksym Taranukhin, Maura Grossman, Vered Shwartz, Yuntian Deng, Ebrahim Bagheri

发表机构 * University of Toronto(多伦多大学) University of Waterloo(滑铁卢大学) Toronto Metropolitan University(多伦多 Metropolitan 大学) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所)

AI总结 提出DetectZoo,一个首个统一的多模态AI生成内容检测工具包,通过标准化数据预处理、评估流程和集成61个检测器与22个基准数据集,实现公平可重复的基准测试。

详情
AI中文摘要

生成模型的日益普及和能力提升模糊了人类与机器生成内容之间的界限,推动了跨文本、图像和音频检测领域的大量研究。大多数现有的检测器要么是商业软件,要么是开源但带有不兼容的代码库、定制化的预处理、评估协议和评估指标,这使得它们的采用、公平比较和复现变得相当困难。为了解决这一关键差距,我们引入了DetectZoo,这是首个可扩展的工具包,旨在为跨文本、音频和图像模态的AI生成内容检测提供统一接口。DetectZoo标准化了从数据摄取和预处理到模型评估的完整实证流程,为研究人员提供了一个统一的框架来系统地基准测试最先进的检测器。通过将多样的公共数据集和基线检测算法集成到单一的统一API下,我们的工具包促进了严格且可重复的评估。DetectZoo提供了61个检测器的参考实现、22个基准数据集的原生加载器,以及一个标准化的评估流程,通过通用接口报告多个指标。每个检测器都是自包含的,但可通过同一接口访问,自动缓存预训练权重,并复现原始发表的结果。DetectZoo降低了多模态AI取证的入门门槛,使研究人员能够识别跨领域的性能差距,并加速开发鲁棒、可泛化的检测技术。开源仓库和全面文档可在https://github.com/sadjadeb/DetectZoo 获取,且可通过pip install detectzoo安装该包。

英文摘要

The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.

2606.04198 2026-06-04 cs.CV 版本更新

Spatial Artifact Coherence Determines Codec Robustness in Patch-Based rPPG

空间伪影相干性决定基于补丁的rPPG中的编解码鲁棒性

Achraf Ben Ahmed

发表机构 * PlesmoSense SARL(PlesmoSense公司)

AI总结 提出空间伪影相干性(SAC)度量,解释编解码压缩下基于补丁的rPPG方法优于全局投影方法的原因,并设计PatchPCA算法族,实验表明SAC解释了93.8%的PCA优势方差。

详情
AI中文摘要

远程光电容积描记法(rPPG)在未压缩基准上实现了低心率误差,但在远程医疗、新生儿ICU和驾驶员疲劳应用中通过压缩视频通道部署。先前没有工作确定在编解码压缩下空间分解优于全局投影方法的物理量。我们提出空间伪影相干性(SAC),定义为4x4块间绿色通道协方差矩阵(带通0.75-2.5 Hz)的非对角能量与对角能量之比,以及PatchPCA算法族(四种编解码感知的rPPG算法)。我们在三个公共数据集上评估了280名受试者、11种编解码退化变体(MPEG-4、H.265、H.264、JPEG、色度子采样)和13种算法,通过Wilcoxon检验(BH-FDR,q < 0.05,904次检验)。SAC解释了PCA优势中93.8%的变体间方差(r = +0.969),编解码族之间零重叠:非MPEG-4变体聚集在SAC 0.10-0.18,PCA胜率84-90%;而MPEG-4变体聚集在SAC 0.48-0.59,胜率61%,平均改进降低5.8倍。在受试者内部,78%确认了预期模式(p < 10^-22,dz = 0.73)。变体内部受试者水平SAC相关性为r = +0.099,确认SAC分类编解码族而非预测个体结果。MPEG-4的影响是结构性的(宏块DCT几何,而非噪声幅度),由源编解码状态而非分辨率决定。P-Hybrid被确定为最部署鲁棒的算法。建立了PatchPCA优势的两个必要操作条件:SAC < 0.30和低到中等运动,直接排除了原始到MPEG-4转码流水线。SAC为临床远程监测系统中编解码感知的rPPG算法选择提供了物理基础度量。

英文摘要

Remote photoplethysmography (rPPG) achieves low heart-rate error on uncompressed benchmarks yet is deployed over compressed video channels in telehealth, neonatal ICU, and driver fatigue applications. No prior work identifies the physical quantity determining when spatial decomposition outperforms global-projection methods under codec compression. We propose Spatial Artifact Coherence (SAC), defined as the ratio of off-diagonal to diagonal energy in the 4x4 inter-patch Green-channel covariance matrix (bandpass 0.75-2.5 Hz), and the PatchPCA algorithm family (four codec-aware rPPG algorithms). We evaluate 280 subjects across three public datasets, 11 codec degradation variants (MPEG-4, H.265, H.264, JPEG, chroma subsampling), and 13 algorithms via Wilcoxon tests (BH-FDR, q < 0.05, 904 tests). SAC explains 93.8% of between-variant variance in PCA advantage (r = +0.969), with zero overlap between codec families: non-MPEG-4 variants cluster at SAC 0.10-0.18 with 84-90% PCA win rates, while MPEG-4 variants cluster at SAC 0.48-0.59 with 61% win rate and a 5.8x reduction in mean improvement. Within subjects, 78% confirm the expected pattern (p < 10^-22, dz = 0.73). Within-variant subject-level SAC correlation is r = +0.099, confirming SAC classifies codec families rather than predicting individual outcomes. MPEG-4's effect is structural (macroblock DCT geometry, not noise amplitude), governed by source codec state, not resolution. P-Hybrid is identified as the most deployment-robust algorithm. Two necessary operating conditions for PatchPCA advantage are established: SAC < 0.30 and low-to-moderate motion, directly ruling out raw-to-MPEG-4 transcoding pipelines. SAC provides a physically grounded metric for codec-aware rPPG algorithm selection in clinical remote monitoring systems.

2606.04166 2026-06-04 cs.CV 版本更新

End-to-End Text Line Detection and Ordering

端到端文本行检测与排序

Benjamin Kiessling

发表机构 * ALMAnaCH, Inria, France(ALMAnaCH、法国国家信息与自动化研究所)

AI总结 提出Orli模型,将文本行检测与阅读顺序排序统一为图像到序列问题,通过自回归生成基线实现端到端处理,在多种历史文档上达到先进性能。

详情
AI中文摘要

实际的历史文档文本识别流程通常将布局分析分解为行检测和单独的阅读顺序步骤,后者通常由手工编码的几何启发式方法处理,但难以应对旁注、多列、表格和特定来源的编辑惯例。本文介绍了Orli(行的有序回归),一个端到端模型,将两个子任务视为单一的图像到序列问题:从页面图像中,Orli以自回归方式直接按阅读顺序生成文本行基线。基线采用弦框架参数化表示,该参数化锚定行的位置、方向和范围,同时通过垂直偏移编码局部几何;迭代细化头和局部视觉细化器生成最终曲线。在涵盖十种书写系统的196,691页异构语料库上训练,Orli在没有数据集特定训练的情况下,略微超过了之前报道的cBAD行检测的最先进水平,在多个阅读顺序基准测试中零样本达到近乎完美的覆盖率和排序,并通过有限的微调适应更专业的域外布局。该方法的源代码和模型权重在开放许可下可从https://github.com/mittagessen/orli获取。

英文摘要

Practical text-recognition pipelines for historical documents typically decompose layout analysis into line detection followed by a separate reading-order step, with the latter most often handled by a hand-coded geometric heuristic that struggles with marginalia, multiple columns, tables, and source-specific editorial conventions. This article introduces Orli (Ordered Regression of Lines), an end-to-end model that casts both sub-tasks as a single image-to-sequence problem: from a page image, Orli autoregressively generates text-line baselines directly in reading order. Baselines are represented in a chord-frame parameterization that anchors a line's position, orientation, and extent while encoding local geometry through perpendicular offsets; an iterative refinement head and a local visual refiner produce the final curve. Trained on a heterogeneous corpus of 196,691 pages spanning ten writing systems, Orli marginally exceeds the previously reported state of the art for cBAD line detection without dataset-specific training, reaches near perfect coverage and ordering on multiple reading-order benchmarks zero-shot, and adapts to more specialized out-of-domain layouts with limited fine-tuning. The method's source code and model weights are available under an open license at https://github.com/mittagessen/orli.

2606.04133 2026-06-04 cs.CV 版本更新

Pinpoint: Grounded Worldwide Image Geolocation via Cross-Source Retrieval and Reranking

Pinpoint: 基于跨源检索与重排序的全球图像地理定位

Nika Chuzhoy, Brian Hu, Amit A. Arora, Jae Ro, Sarthak S. Sahu

发表机构 * Virtualitics

AI总结 提出一种检索-重排序架构Pinpoint,通过对比学习融合Flickr照片和街景图像,结合注意力重排序器利用跨源证据实现全球图像地理定位,在多个基准上达到最优。

详情
AI中文摘要

图像地理定位旨在根据视觉内容估计照片拍摄地点。在全球范围内,由于视觉证据往往模糊、多样且分布不均,这仍然具有挑战性。先前的工作通常将普通互联网照片和街景图像的地理定位视为独立任务,尽管它们具有互补优势:互联网照片更匹配用户拍摄查询的外观分布,而街景图像提供更密集、地理覆盖更广的参考。我们提出Pinpoint,一种检索-重排序架构,以由粗到细的流程结合两种数据源。对比图像-GPS嵌入器在用户上传的Flickr照片和街景图像上训练,学习共享的图像-GPS嵌入空间,用于检索候选位置。然后,基于注意力的重排序器通过结合候选级别的视觉和GPS特征以及来自附近位置的跨源证据,对检索到的候选进行重新评分,以确定预测。与最近的先前工作不同,Pinpoint不依赖多模态大语言模型,使得推理更快且更具可重复性。Pinpoint在互联网照片(IM2GPS3k和YFCC4k)和街景图像(OSV-5M)的标准基准上,在所有指标上均达到最先进的结果。

英文摘要

Image geolocation aims to estimate where a photograph was taken from its visual content. At worldwide scale, this remains challenging because visual evidence is often ambiguous, diverse, and unevenly distributed. Prior work has typically treated geolocation of ordinary internet photos and street-view imagery as separate tasks, despite their complementary strengths: internet photos better match the appearance distribution of user-captured queries, while street-view imagery provides denser, geographically grounded coverage. We present Pinpoint, a retrieve-and-rerank architecture that combines both sources in a coarse-to-fine pipeline. A contrastive image-GPS embedder is trained on both user-uploaded Flickr photos and street-view imagery, learning a shared image-GPS embedding space that is used to retrieve candidate locations. An attention-based reranker then rescores retrieved candidates by combining candidate-level visual and GPS features with cross-source evidence from nearby locations to ground the prediction. Unlike recent prior work, Pinpoint does not rely on multimodal large-language models, making inference faster and more reproducible. Pinpoint achieves state-of-the-art results across all metrics on standard benchmarks for internet photos (IM2GPS3k and YFCC4k) and street-view imagery (OSV-5M).

2606.04108 2026-06-04 cs.GR cs.AI cs.CV cs.LG 版本更新

SymTRELLIS: Symmetry-Enforced Voxel Latents for 3D Generation

SymTRELLIS: 对称性增强的体素潜变量用于3D生成

Guangda Ji, Qimin Chen, Qinchan Li, Mingrui Zhao, Kai Wang, Hao Zhang

发表机构 * Simon Fraser University(西蒙 Fraser大学)

AI总结 提出SymTRELLIS方法,通过在流模型生成过程中对预测速度进行对称化平均,强制任意有限点群对称性,无需重新训练VAE或流模型,显著降低对称性误差。

详情
AI中文摘要

单视图3D生成模型已取得令人印象深刻的视觉质量,但它们并非为满足结构或功能需求而设计,在实践中常常存在不足。对称性就是这样一个需求:违反对称性,即使是微小的违反,也可能使模型在物理上不可用。我们提出SymTRELLIS,一种在TRELLIS.2的基于流的3D生成过程中强制任意有限点群对称性(旋转、反射和多面体对称)的方法,无需重新训练底层的VAE或流模型。我们的关键思想是将空间变换在潜空间中的作用近似为体素潜变量上的学习线性算子,通过一个轻量级的空间变换潜映射器实现,该映射器在通用的非对称3D数据上训练。在生成时,我们通过在每一步ODE中对所有对称等价变换的预测流速度进行平均来强制对称性,这一过程称为速度对称化。对称性规格可以从初始TRELLIS.2生成中自动估计,或由用户提供,从而实现超越输入图像暗示的刻意折叠操作。在一个包含266个严格对称物体的基准测试上(涵盖2到20倍旋转和多面体对称群),与TRELLIS.2、Hunyuan3D-2.1和TripoSG相比,SymTRELLIS显著降低了所有对称性误差指标,同时保持了与基础模型相当的重建精度。

英文摘要

Single-view 3D generative models have achieved impressive visual quality, yet they are not designed to satisfy structural or functional requirements, and in practice, often fall short. Symmetry is one such requirement: violations, even subtle ones, on symmetry can render a model physically unusable. We present SymTRELLIS, a method that enforces arbitrary finite point group symmetries (rotational, reflectional, and polyhedral) during the flow-based 3D generation of TRELLIS.2, without retraining the underlying VAE or flow model. Our key idea is to approximate the latent-space action of spatial transformations as a learned linear operator on voxel latents, implemented as a lightweight spatial-transform latent mapper trained on generic, non-symmetric 3D data. At generation time, we enforce symmetry by averaging predicted flow velocities across all symmetry-equivalent transformations at each ODE step, a process we call velocity symmetrization. The symmetry specification can be estimated automatically from an initial TRELLIS.2 generation or supplied by the user, enabling deliberate fold manipulation beyond what the input image suggests. On a curated benchmark of 266 strictly symmetric objects spanning 2- to 20-fold rotations and polyhedral symmetry groups, SymTRELLIS substantially reduces all symmetry error metrics compared to TRELLIS.2, Hunyuan3D-2.1, and TripoSG, while maintaining reconstruction accuracy comparable to the base model.

2606.04107 2026-06-04 cs.CV 版本更新

Reflection Separation from a Single Image via Joint Latent Diffusion

基于联合潜在扩散的单图像反射分离

Zheng-Hui Huang, Zhixiang Wang, Yu-Lun Liu, Yung-Yu Chuang

发表机构 * Shanda AI Research Tokyo(Shanda AI Research东京) National Taiwan University(台湾大学) National Yang Ming Chiao Tung University(阳明交通大学)

AI总结 提出一种基于扩散模型的方法,通过联合生成透射和反射层、跨层自注意力机制、分离采样策略和潜在优化,解决强光或弱反射等极端条件下的单图像反射分离问题。

Comments CVPR 2026. Project page: https://brian90709.github.io/diff-reflection-separation/

详情
AI中文摘要

单图像反射分离在强光或弱反射等极端条件下极具挑战性。现有方法由于信息不足,在强光或弱反射场景中往往难以恢复两个图层。本文提出了一种针对此任务显式微调的扩散模型,利用生成扩散先验实现鲁棒分离。我们的方法通过一个统一的扩散模型同时生成透射层和反射层,并引入一种新颖的跨层自注意力机制以更好地解耦特征。我们进一步引入一种分离采样策略,在扩散过程中迭代减少层间干扰,以及一个带有学习到的合成函数的潜在优化步骤,以在复杂真实场景中获得改进的结果。大量实验表明,我们的方法在多个真实世界基准上超越了最先进的方法。项目页面:https://brian90709.github.io/diff-reflection-separation/

英文摘要

Single-image reflection separation is highly challenging under extreme conditions like glare or weak reflections. Existing methods often struggle to recover both layers in glare or weak-reflection scenarios because of insufficient information. This paper presents a diffusion model explicitly fine-tuned for this task, leveraging generative diffusion priors for robust separation. Our method simultaneously generates transmission and reflection layers through a unified diffusion model, incorporating a novel cross-layer self-attention mechanism for better feature disentanglement. We further introduce a disjoint sampling strategy to iteratively reduce interference between the layers during diffusion and a latent optimization step with a learned composition function for improved results in complex real-world scenarios. Extensive experiments demonstrate that our approach surpasses state-of-the-art methods on multiple real-world benchmarks. Project page: https://brian90709.github.io/diff-reflection-separation/

2606.04098 2026-06-04 cs.CV 版本更新

When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection

当眼见不再为实——面向搜索辅助的视频虚假信息检测基准

Tao Yu, Yujia Yang, Shenghua Chai, Zhang Jinshuai, Haopeng Jin, Hao Wang, Minghui Zhang, Zhongtian Luo, Yuchen Long, Xinlong Chen, Jiabing Yang, Zhaolu Kang, Yuxuan Zhou, Zhengyu Man, Xinming Wang, Hongzhu Yi, Zheqi He, Xi Yang, Yan Huang, Liang Wang

发表机构 * CASIA(中国科学院自动化研究所) UCAS(中国科学院大学) BAAI(百度人工智能研究院) Tsinghua University(清华大学) Peking University(北京大学)

AI总结 提出EVID-Bench基准,通过跨视频对比和开放网络搜索检测视频虚假信息,涵盖9种操纵类型,评估前沿多模态模型发现准确率低且面临多种挑战。

Comments 52 pages

详情
AI中文摘要

视频虚假信息越来越多地在语义和证据层面运作:真实镜头可能被选择性编辑、时间重排、跨源拼接或通过AI生成内容增强以构建虚假叙事。这种依赖证据的操纵无法仅从输入视频中可靠验证,因为缺失、重排、替换或重新语境化的证据位于视频本身之外。我们引入了 extbf{EVID-Bench},一个面向搜索辅助的视频虚假信息检测基准,系统必须搜索开放网络以查找相关视频,并通过跨视频比较识别哪些信息是虚假的。EVID-Bench包含222个视频,涵盖3类9种操纵类型:AI生成、单源编辑和多源编辑。所有样本均经过验证,前沿模型仅通过视觉检查无法检测。我们使用检索增强验证基线评估了九种前沿多模态模型。最佳系统仅达到61.43%的点级准确率和43.24%的视频级准确率,而AI生成的操纵仍然特别具有挑战性。错误分析揭示了反复出现的挑战:模型固着于无关锚点,错误地将合成内容归因于编辑拼接,并在完全解释操纵之前过早终止搜索。

英文摘要

Video misinformation increasingly operates at the semantic and evidential level: authentic footage may be selectively edited, temporally reordered, spliced across sources, or augmented with AI-generated content to construct false narratives. Such evidence-dependent manipulations cannot be reliably verified from the input video alone, because the missing, reordered, replaced, or recontextualized evidence lies outside the video itself. We introduce \textbf{EVID-Bench}, a benchmark for search-grounded video misinformation detection, where a system must search the open web for related videos and identify what information is false through cross-video comparison. EVID-Bench comprises 222 videos spanning 9 manipulation types across 3 categories: AI generation, single-source editing, and multi-source editing. All samples are verified to be undetectable by frontier models through visual inspection alone. We evaluate nine frontier multimodal models using a retrieval-augmented verification baseline. The best system achieves only 61.43\% point-level accuracy and 43.24\% video-level accuracy, while AI-generated manipulations remain especially challenging. Error analysis reveals recurring challenges: models fixate on irrelevant anchors, misattribute synthetic content to editorial splicing, and terminate search prematurely before fully explaining the manipulation.

2606.04092 2026-06-04 cs.CV cs.LG 版本更新

Optimal Transport Flow Matching by Design

通过设计实现最优传输流匹配

Shimon Malnick, Matan Rusanovsky, Ohad Fried, Shai Avidan

发表机构 * Tel Aviv University(特拉维夫大学) Reichman University(里奇曼大学)

AI总结 本文通过将先验分布视为设计选择而非固定输入,利用数据与其低频投影之间的恒等耦合作为最优传输耦合,简化流匹配模型中的轨迹曲率,实现快速高质量生成。

Comments Project page: https://www.malnick.net/designing_ot_flows

详情
AI中文摘要

流匹配模型学习将样本从简单先验分布传输到复杂数据分布。当先验-数据对通过最优传输(OT)耦合时,学习到的轨迹是直线且无交叉的,从而实现快速甚至单步生成。然而,在高维空间中计算OT耦合是困难的,现有方法试图解决OT问题,但代价是持续的偏差或显著的开销。我们不求解OT耦合,而是重新表述问题。一旦将先验视为设计选择而非固定输入,先验与数据之间的OT耦合就不再唯一。许多先验允许与数据之间存在OT最优的恒等耦合,因此我们可以自由选择一个易于采样的先验。我们将自然图像的低频投影确定为这样的选择。数据与其低频表示之间的恒等耦合在经验上是OT最优的,先验的结构足够丰富,可以在推理时由轻量级模型采样,而剩余的流匹配任务简化为合成高频细节。用高斯噪声插值先验进一步提高了生成质量,同时保留了OT耦合。该方法无需对流模型本身进行修改,并且自然地与潜在空间模型、无分类器引导和单步生成框架集成。在所有基准测试中,与现有流匹配方法相比,我们的方法将轨迹曲率降低了2倍以上,从而在少步数情况下实现了更好的生成质量。

英文摘要

Flow matching models learn to transport samples from a simple prior distribution to a complex data distribution. When prior-data pairs are coupled via optimal transport (OT), the learned trajectories are straight and non-crossing, enabling fast, even single-step, generation. However, computing the OT coupling in high dimensions is intractable, and existing methods attempt to solve the OT problem, at the cost of persistent bias or significant overhead. Rather than solving for the OT coupling, we reformulate the problem. Once the prior is treated as a design choice rather than a fixed input, the OT coupling between prior and data is no longer unique. Many priors admit an OT-optimal identity coupling to the data, leaving us free to choose one that is also tractable to sample. We identify low-frequency projection of natural images as such a choice. The identity coupling between data and its low-frequency representation is empirically OT-optimal, the prior is structured enough to be sampled by a lightweight model at inference, and the remaining flow-matching task reduces to synthesizing high-frequency detail. Interpolating the prior with Gaussian noise further improves generation quality while preserving the OT coupling. The approach requires no modifications to the flow model itself, and integrates naturally with latent-space models, classifier-free guidance, and one-step generation frameworks. Across all benchmarks, our method reduces trajectory curvature by more than $2\times$ compared to existing flow matching methods, yielding better generation quality in the few-step regime.

2606.04061 2026-06-04 cs.CV 版本更新

Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal Reasoning

模态内邻居从不说谎:基于图模态内推理纠正模态间噪声对应

Yang Liu, Wentao Feng, Shu-Dong Huang, Yalan Ye, Jiancheng Lv

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出IN2R框架,利用模态内数据的几何稳定性,通过图精炼器对动态跨模态记忆中的邻居进行关系推理,合成连续软原型以纠正模态间噪声对应,显著提升跨模态检索性能。

详情
Journal ref
International Conference of Machine Learning 2026
AI中文摘要

大规模网络采集数据集推动了跨模态检索的进展,但不可避免地遭受噪声对应问题,严重损害模型泛化能力。现有方法主要通过过滤噪声或寻找替代标签来解决,但它们主要局限于“离散选择”范式。我们认为,依赖单一离散代理会导致单点脆弱性和离散化误差。为克服这些限制,我们提出了一种新颖框架——模态内邻居感知噪声纠正(IN2R),它将范式从搜索替代标签转变为合成可靠的监督目标。利用模态内数据固有的几何稳定性,IN2R采用图精炼器对从动态跨模态记忆中检索到的邻居进行关系推理。我们的方法不是传播离散标签,而是合成一个连续的软原型,反映局部语义邻域的共识,有效纠正模态间错位。在Flickr30K、MS-COCO和CC152K上的大量实验表明,IN2R显著优于最先进的方法。我们的代码和预训练模型可在https://github.com/liuyyy111/IN2R公开获取。

英文摘要

Large-scale web-harvested datasets have fueled the progress of cross-modal retrieval but inevitably suffer from noisy correspondence, which severely degrades model generalization. Existing methods primarily address this by filtering out noise or seeking a substitute label, yet they predominantly remain bound by a "Discrete Selection" paradigm. We argue that relying on a single discrete proxy induces Single-Point Fragility and Discretization Error. To overcome these limitations, we propose a novel framework, Intra-modal Neighbor-aware Noise Rectification (IN2R), which shifts the paradigm from searching for a substitute to synthesizing a reliable supervision target. Leveraging the intrinsic geometric stability of intra-modal data, IN2R employs a Graph Refiner to perform relational reasoning over neighbors retrieved from a dynamic Cross-Model Memory. Instead of propagating discrete labels, our method synthesizes a continuous, soft prototype that reflects the consensus of the local semantic neighborhood, effectively rectifying inter-modal misalignment. Extensive experiments on Flickr30K, MS-COCO, and CC152K demonstrate that IN2R significantly outperforms state-of-the-art methods. Our code and pre-trained models are publicly available at https://github.com/liuyyy111/IN2R.

2606.04060 2026-06-04 cs.CV 版本更新

Weakly Supervised Incremental Segmentation via Semantic Anchors and Spatial Arbitration

基于语义锚点和空间仲裁的弱监督增量分割

Zhonggai Wang, Kai Fang, Guangyu Gao

发表机构 * National Natural Science Foundation of China(中华人民共和国国家自然科学基金委员会) Tsinghua University(清华大学)

AI总结 针对弱监督增量语义分割中噪声监督导致的特征漂移和语义覆盖问题,提出SASA方法,通过语义锚点稳定表示学习和空间标签仲裁过滤不可靠信号,有效缓解特征漂移。

Comments Accepted by ICME2026

详情
AI中文摘要

弱监督增量语义分割(WILSS)面临持续引入噪声监督的问题,这会逐步破坏类别级表示,导致严重的特征漂移和语义污染,从而使新学习的类别覆盖旧类别。为了解决这些问题,我们提出了一种抗漂移的WILSS方法,名为SASA,旨在通过语义锚点和空间仲裁稳定语义学习。具体地,在表示层面,我们引入可学习令牌的语义锚点作为刚性类别级参考,以保持长期语义一致性。作为补充,弹性残差适应实现了受控的、实例特定的细化,确保稳定而灵活的学习轨迹。在监督层面,我们开发了一种空间标签仲裁机制,该机制执行几何感知决策,直接过滤不可靠信号,并强制执行严格的“一个对象,一个类别”约束。通过协同稳定表示和提高监督可靠性,SASA有效缓解了弱监督下的特征漂移。在标准基准上的大量实验表明,我们的方法始终优于现有最先进方法,特别是在具有挑战性的多步增量设置中。代码可在https://github.com/ZhonggaiWang/SASA获取。

英文摘要

Weakly Incremental Learning for Semantic Segmentation (WILSS) suffers from the continuous introduction of noisy supervision, which progressively corrupts class-level representations, leading to severe feature drift and semantic corruption, thereby causing newly learned classes to overwrite old ones. To address these issues, we propose a drift-resilient WILSS approach, named SASA, designed to stabilize semantic learning via Semantic Anchors and Spatial Arbitration. Specifically, at the representation level, we introduce semantic anchors of learnable tokens as rigid class-level references to preserve long-term semantic identity. Complementary to this, an elastic residual adaptation facilitates controlled, instance-specific refinement, ensuring a stable yet flexible learning trajectory. At the supervision level, we develop a Spatial Label Arbitration mechanism that performs geometry-aware decisions to directly filter unreliable signals and enforce a strict "one object, one class" constraint. By synergistically stabilizing representations and improving supervision reliability, SASA effectively mitigates feature drift under weak supervision. Extensive experiments on standard benchmarks demonstrate that our approach consistently outperforms existing state-of-the-art methods, particularly in challenging multi-step incremental settings. The code is available at https://github.com/ZhonggaiWang/SASA.

2606.04046 2026-06-04 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

深入场景:通过焦点计划生成打破视觉-语言决策中的感知瓶颈

Boyuan Xiao, Bohong Chen, Yumeng Li, Ji Feng, Yao-Xiang Ding, Kun Zhou

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出SceneDiver方法,通过从粗到细的焦点计划生成,逐步构建场景图并分解任务,减少视觉幻觉,提升视觉-语言模型和视觉-语言-动作模型在具身决策任务中的表现。

Comments Accepted at ICML 2026

详情
AI中文摘要

在具身视觉-语言决策任务(如机器人操作和导航)中,视觉-语言模型和视觉-语言-动作模型(VLMs & VLAs)是具有不同优势的强大工具:VLMs更擅长长期规划,而VLAs更擅长反应控制。然而,它们的性能受到相同感知瓶颈的限制:由于模型无法区分任务相关对象与干扰物,导致视觉幻觉。原则上,准确识别并聚焦关键对象同时过滤无关对象是突破这一限制的关键。一个直接的解决方案是一步聚焦:直接关注重要对象。然而,这种方法被证明无效,因为有效的聚焦本质上需要深度场景理解。为此,我们提出SceneDiver,一种利用VLMs长期规划能力的从粗到细的焦点计划生成方法,首先构建整体场景图以建立初步理解,然后通过识别、理解和分析的迭代循环逐步将任务分解为更简单的子问题。为了实现反应控制,我们还设计了一个轻量级适配器,将深思熟虑的聚焦能力蒸馏到VLAs中。在标准具身AI基准上的评估证实,我们的方法显著减少了VLMs和VLAs的视觉幻觉,同时在需要快速执行的任务中保持了计算效率。我们的代码和数据发布在:https://future-item.github.io/SceneDiver。

英文摘要

In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future-item.github.io/SceneDiver.

2606.03972 2026-06-04 cs.CV 版本更新

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

AAD-1:用于一步自回归视频生成的非对称对抗蒸馏

Haobo Li, Yanhong Zeng, Yunhong Lu, Jiapeng Zhu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Yujun Shen, Zhipeng Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出非对称对抗蒸馏框架AAD-1,通过打破生成器与判别器的对称性以及分阶段训练策略,解决一步自回归视频生成中的运动崩溃和训练不稳定问题,实现最先进性能。

Comments ICML 2026. Project page: \url{https://aad-1.github.io/}

详情
AI中文摘要

我们提出了AAD-1,一种用于一步自回归图像到视频生成的非对称对抗蒸馏框架。最先进的方法采用对抗蒸馏,但存在运动崩溃和训练不稳定的问题,导致生成静态视频。AAD-1通过架构和训练策略上的两个关键设计解决了这些挑战。我们的核心架构见解是打破生成器和判别器之间的对称性。生成器保持因果性以保留自回归采样能力,而判别器则双向关注整个时空上下文,并为整个视频序列生成单一的全局真实性评分。这种非对称设计使判别器能够有效检测导致自回归生成中运动崩溃的全局时间故障和长程漂移。为了稳定训练,我们引入了一种分阶段策略,首先使用分布匹配来引导一个稳定的一步生成器,提供一个预热阶段,使学生分布更接近教师分布,然后再开始对抗蒸馏。在VBench上的大量实验表明,AAD-1在一步自回归视频生成中达到了最先进的性能。

英文摘要

We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video generation. State-of-the-art methods adopt adversarial distillation but suffer from motion collapse and training instability, resulting in static videos. AAD-1 addresses these challenges through two key designs in architecture and training strategy. Our key architectural insight is to break the symmetry between generator and discriminator. While the generator remains causal to preserve autoregressive sampling capability, the discriminator attends bidirectionally over the full spatiotemporal context and produces a single holistic realism score for the entire video sequence. This asymmetric design enables the discriminator to effectively detect global temporal failures and long-range drift that cause motion collapse in autoregressive generation. To stabilize training, we introduce a phased strategy that first uses distribution matching to bootstrap a stable one-step generator, providing a warm-up phase that brings the student distribution closer to the teacher before adversarial distillation begins. Extensive experiments on VBench demonstrate that AAD-1 achieves state-of-the-art performance in one-step autoregressive video generation.

2606.03943 2026-06-04 cs.RO cs.CV cs.LG 版本更新

PointAction: 3D Points as Universal Action Representations for Robot Control

PointAction: 3D点作为机器人控制的通用动作表示

Mutian Tong, Han Jiang, Qiao Feng, Lingjie Liu, Jiatao Gu

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出PointAction框架,通过微调视频生成模型联合预测未来RGB帧和动态3D点图,将点动力学作为与具体本体无关的动作接口,再由扩散动作解码器映射为可执行动作,以减少RGB动作歧义并跨任务/本体迁移。

Comments Project page: https://oriontmt.github.io/pointaction/

详情
AI中文摘要

视频-动作模型(VAM)利用预训练视频扩散模型捕获的广泛视觉动态,为通用机器人操作提供了有前景的路径。然而,仅RGB视频展开无法直接操作:它们未明确指定度量3D运动、接触几何和细粒度空间约束,导致动作基础不明确。同时,跨不同任务和本体的动作监督扩展仍然成本高昂。我们提出PointAction,一个通过显式基于点的4D建模将视频预测桥接到机器人动作的框架。PointAction微调基础视频生成模型,联合预测未来RGB帧和动态3D点图,产生任务相关场景几何的时间一致3D运动。这些点动力学作为结构化的、与本体无关的动作接口,由基于扩散的动作解码器映射为可执行的机器人动作。通过使用度量3D点动力学作为视频预测和控制之间的接口,PointAction减少了仅RGB动作基础的不确定性,并支持在有限动作监督下跨任务和本体的迁移。实验表明,PointAction在机器人场景上实现了最先进的4D生成质量,在模拟中优于现有基线,并泛化到预训练中未见过的两个真实机器人手臂。

英文摘要

Video-Action Models (VAMs) leverage the broad visual dynamics captured by pre-trained video diffusion models, offering a promising path toward generalizable robot manipulation. However, RGB-only video rollouts are not directly actionable: they leave metric 3D motion, contact geometry, and fine-grained spatial constraints under-specified, making action grounding ambiguous. Meanwhile, scaling action supervision across diverse tasks and embodiments remains costly. We present PointAction, a framework that bridges video predictions to robot actions through explicit point-based 4D modeling. PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps, producing temporally consistent 3D motion of task-relevant scene geometry. These point dynamics serve as a structured, embodiment-agnostic action interface, which a diffusion-based action decoder maps to executable robot actions. By using metric 3D point dynamics as the interface between video prediction and control, PointAction reduces the ambiguity of RGB-only action grounding and supports transfer across tasks and embodiments with limited action supervision. Experiments show that PointAction achieves state-of-the-art 4D generation quality on robot scenes, outperforms existing baselines in simulation, and generalizes to two real robot arms unseen during pretraining.

2606.03746 2026-06-04 cs.CV cs.AI cs.GR cs.LG 版本更新

Qwen-Image-Flash: Beyond Objective Design

Qwen-Image-Flash:超越目标设计

Tianhe Wu, Kun Yan, Zikai Zhou, Lihan Jiang, Jiahao Li, Jie Zhang, Kaiyuan Gao, Ningyuan Tang, Shengming Yin, Xiaoyue Chen, Xiao Xu, Yilei Chen, Yuxiang Chen, Yan Shu, Yixian Xu, Yanran Zhang, Zihao Liu, Zhendong Wang, Zekai Zhang, Deqing Li, Liang Peng, Yi Wang, Jingren Zhou, Chenfei Wu

发表机构 * alibaba-inc.com(阿里巴巴公司)

AI总结 本文通过系统研究数据组成、教师指导和任务混合三个因素,提出Qwen-Image-Flash,表明有效的少步蒸馏不仅需要精心设计的目标,还需要对更广泛的训练流程进行原则性组织。

详情
AI中文摘要

少步蒸馏已成为加速先进视觉生成模型的有效策略,但先前的工作主要集中在蒸馏目标上。在这项工作中,我们从互补的角度重新审视少步蒸馏,重点关注关键影响学生表现的训练方案。以Qwen-Image-2.0为代表案例,我们系统地研究了统一文本到图像生成和指令引导图像编辑蒸馏中的三个因素:数据组成、教师指导和任务混合。我们的实证分析揭示了若干非直观行为,这些行为推动了Qwen-Image-Flash的开发。总体而言,我们的结果表明,有效的少步蒸馏不仅需要精心设计的目标,还需要对更广泛的训练流程进行原则性组织。

英文摘要

Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focused on distillation objectives. In this work, we revisit few-step distillation from a complementary perspective, focusing on the training recipe that critically shapes student performance. Using Qwen-Image-2.0 as a representative case, we systematically investigate three factors in unified text-to-image generation and instruction-guided image editing distillation: data composition, teacher guidance, and task mixture. Our empirical analysis reveals several non-obvious behaviors, which motivate the development of Qwen-Image-Flash. Overall, our results suggest that effective few-step distillation requires not only carefully designed objectives, but also principled organization of the broader training pipeline.

2606.03598 2026-06-04 cs.RO cs.AI cs.CV 版本更新

PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

PHASER: 面向视觉-语言-动作模型的相位感知与语义经验回放

Ziyang Chen, Shaoguang Wang, Weiyu Guo, Qianyi Cai, He Zhang, Pengteng Li, Yiren Zhao, Yandong Guo

发表机构 * Thrust of AI, HKUST(Guangzhou)(人工智能 thrust,香港科技大学(广州)) AI 2 Robotics, Shenzhen, China(人工智能与机器人,深圳,中国)

AI总结 提出PHASER框架,通过相位感知容量分配和多模态干扰路由策略,结合自动相位提取管线Auto-PC,解决VLA模型在持续学习中的灾难性遗忘问题,在LIBERO基准上平均成功率提升高达31%。

Comments 20 pages, 8 figures, 12 tables

详情
AI中文摘要

视觉-语言-动作(VLA)模型在语言条件机器人操作中取得了显著成功。然而,在开放环境中部署这些模型需要持续获取新技能,这一过程不可避免地会严重遗忘先前学习的行为。虽然经验回放(ER)是一种标准的缓解策略,但简单的均匀采样从根本上与操作轨迹的时间特征不一致。它系统性地欠采样短暂但因果关键的子技能,导致相位饥饿,并完全忽略了历史任务中不同程度的遗忘。为克服这些限制,我们提出PHASER,一种架构无关的持续学习框架。PHASER采用以相位为中心的容量分配,确保所有子技能获得平等的记忆支持,并结合多模态干扰路由策略,动态优先处理遗忘风险高的历史相位。此外,为实现完全自主的终身适应,我们集成了Auto-PC,一种轻量级管线,结合无监督动作信号变化点检测和基于VLM的语义验证,无需大量人工监督即可提取时间边界。在LIBERO持续学习套件上对三个VLA骨干网络的评估表明,PHASER取得了显著的实证改进,与匹配预算的ER相比,平均成功率(ASR)提升高达31%,并在LIBERO-Goal CL设置中达到87.8%的最终ASR。

英文摘要

Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these models in open-ended environments requires continuously acquiring novel skills, a process that inevitably triggers severe catastrophic forgetting of previously learned behaviors. While experience replay (ER) serves as a standard mitigating strategy, naive uniform sampling fundamentally misaligns with the temporal characteristics of manipulation trajectories. It systematically under-samples brief but causally critical sub-skills, leading to phase starvation, and completely overlooks the varying degrees of forgetting across historical tasks. To overcome these limitations, we introduce PHASER, an architecture-agnostic continual learning framework. PHASER employs a phase-centric capacity allocation to guarantee equal memory support for all sub-skills, coupled with a multi-modal interference routing strategy that dynamically prioritizes historical phases at high risk of forgetting. Furthermore, to enable fully autonomous lifelong adaptation, we integrate Auto-PC, a lightweight pipeline combining unsupervised action-signal change-point detection with VLM-based semantic verification to extract temporal boundaries without intensive manual supervision. Evaluated across three VLA backbones on LIBERO continual learning suites, PHASER yields substantial empirical improvements, increasing Average Success Rate (ASR) by up to 31% over matched-budget ER and achieving an 87.8% final ASR on the LIBERO-Goal CL setting.

2606.03564 2026-06-04 cs.CV cs.AI 版本更新

CR-Seg: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation

CR-Seg:注意力引导与CoT增强的由粗到精推理分割

Yifan Cao, Xiaocui Yang, Faxian Wan, Shi Feng, Daling Wang, Yifei Zhang

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院)

AI总结 提出CR-Seg两阶段框架,通过注意力图提取和全局到局部思维链,实现由粗到精的推理分割,解决跨模态对齐和推理-答案不一致问题。

详情
AI中文摘要

推理分割旨在通过联合视觉-文本推理来分割复杂语言描述的目标对象。现有方法通常依赖学习到的语义标记来桥接多模态大语言模型(MLLMs)和分割模型,但面临困难的跨模态对齐问题;或者依赖显式空间提示(如边界框),但可能丢失整体响应语义。为解决这些限制,我们提出注意力引导与CoT增强的由粗到精推理分割(CR-Seg),一个两阶段框架。具体地,我们设计了提取注意力图和点(EAP)模块,用于提取粗目标定位的注意力图并选择信息点,两者都输入SAM进行掩码细化。为缓解推理-答案不一致,我们进一步引入全局到局部思维链(GLCoT),引导模型从全局场景上下文逐步推理到局部目标细节。在推理分割基准上的大量实验证明了CR-Seg的有效性。

英文摘要

Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing methods typically rely on either learned semantic tokens to bridge Multimodal Large Language Models (MLLMs) and segmentation models, suffering from difficult cross-modal alignment, or explicit spatial prompts such as bounding boxes, which may lose holistic response semantics. To address these limitations, we propose Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation, termed CR-Seg, a two-stage framework for coarse-to-refined reasoning segmentation. Specifically, we design an Extract Attention Maps and Points (EAP) module to extract attention maps for coarse target localization and select informative points, both of which are fed into SAM for mask refinement. To alleviate reasoning--answer inconsistency, we further introduce Global-to-Local Chain-of-Thought (GLCoT), which guides the model to reason progressively from global scene context to local target details. Extensive experiments on reasoning segmentation benchmarks demonstrate the effectiveness of CR-Seg.

2606.03402 2026-06-04 cs.CV 版本更新

Mamba-Enhanced Implicit Motion Learning for Audio-Driven Portrait Animation

Mamba增强的隐式运动学习用于音频驱动肖像动画

Xuan Wei, Jiahui Chen, Kaiheng Li, Mingyu Shao, Qingqi Hong

发表机构 * Fujian Provincial Natural Science Foundation of China(福建省自然科学基金委员会) Giant Interactive Group Inc.(巨匠互动集团有限公司) National Natural Science Foundation of China(国家自然科学基金委员会)

AI总结 提出一种两阶段隐式运动框架,结合区域感知注意力机制和Mamba增强扩散模型,从单张静态图像和音频生成逼真且时间一致的人体运动视频,在多个基准上达到最先进性能。

Comments accepted by 2026 IEEE International Conference on Multimedia and Expo (ICME)

详情
AI中文摘要

音频驱动的人体运动视频生成旨在从单张静态图像合成逼真且时间一致的人体动画,应用于说话头生成、共语手势生成和动态演示。超越传统基于关键点的方法(这些方法往往难以捕捉细微的运动动态),我们提出了一种新颖的隐式运动框架,用于从单张静态图像和音频生成逼真且时间一致的人体运动视频。我们的方法采用两阶段流水线,将运动预测与渲染解耦。第一阶段将外观先验和层次深度线索整合到区域感知注意力机制中,以建模潜在运动特征。第二阶段采用Mamba增强的扩散模型直接从音频和源图像预测这些特征,实现细粒度运动模式的无监督学习。这种解耦架构增强了灵活性和效率。在一个新的380小时高质量数据集上训练,我们的方法在准确性、自然性和时间一致性方面优于多个公共基准和我们收集的数据上的先前工作,达到了新的最先进水平。

英文摘要

Audio-driven human motion video generation aims to synthesize realistic and temporally coherent human animations from a single static image, with applications in talking-head synthesis, co-speech gesture generation, and dynamic presentations. Moving beyond conventional keypoint-based methods that often struggle to capture subtle motion dynamics, We propose a novel implicit-motion framework for generating realistic and temporally coherent human motion videos from a single static image and audio. Our approach uses a two-stage pipeline that decouples motion prediction from rendering. The first stage integrates appearance priors and hierarchical depth cues into a region-aware attention mechanism to model latent motion features. The second stage employs a Mamba-enhanced diffusion model to directly predict these features from audio and the source image, enabling unsupervised learning of fine-grained motion patterns. This decoupled architecture enhances flexibility and efficiency. Trained on a new 380-hour high-quality dataset, our method outperforms prior work across multiple public benchmarks and our collected data in accuracy, naturalness, and temporal coherence, setting a new state-of-the-art.

2606.03376 2026-06-04 cs.CV cs.AI cs.CL cs.LG 版本更新

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

P²-DPO:通过校准直接偏好优化在感知处理中锚定幻觉

Ruipeng Zhang, Zhihao Li, Haozhang Yuan, C. L. Philip Chen, Tong Zhang

发表机构 * Guangdong Provincial Key Laboratory of Computational AI Models and Cognitive Intelligence, School of Computer Science & Engineering, South China University of Technology(广东省计算人工智能模型与认知智能重点实验室,计算机科学与工程学院,华南理工大学) Pazhou Lab, Guangzhou, China(琶洲实验室,广州,中国) Engineering Research Center of the Ministry of Education on Health Intelligent Perception and Paralleled Digital-Human, Guangzhou, China(教育部健康智能感知与并行数字人工程研究中心,广州,中国)

AI总结 针对大型视觉语言模型中的幻觉问题,提出P²-DPO训练范式,通过模型自生成偏好对和校准损失,直接优化感知瓶颈和视觉鲁棒性,无需昂贵人工反馈。

详情
AI中文摘要

幻觉最近在大型视觉语言模型(LVLMs)中引起了广泛的研究关注。直接偏好优化(DPO)旨在直接从人类提供的纠正偏好中学习,从而解决幻觉问题。尽管取得了成功,但这种范式尚未专门针对关注区域中的感知瓶颈或解决图像退化下的视觉鲁棒性不足问题。此外,现有的偏好对通常是视觉无关的,其固有的离策略性质限制了它们在指导模型学习方面的有效性。为了解决这些挑战,我们提出了感知处理直接偏好优化(P²-DPO),一种新颖的训练范式,其中模型生成并学习自己的偏好对,从而直接解决已识别的视觉瓶颈,同时固有地避免视觉无关和离策略数据的问题。它引入了:(1)一种针对焦点增强感知和视觉鲁棒性的在策略偏好对构建方法,以及(2)一种精心设计的校准损失,以精确地将视觉信号与文本的因果生成对齐。实验结果表明,在相当数量的训练数据和成本下,P²-DPO在基准测试中优于依赖昂贵人工反馈的强基线。此外,对注意力区域保真度(ARF)和图像退化场景的评估验证了P²-DPO在解决关注区域感知瓶颈和提高对退化输入的视觉鲁棒性方面的有效性。

英文摘要

Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target the perceptual bottleneck in attended regions or address insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are often vision-agnostic and their inherently off-policy nature limits their effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P$^2$-DPO), a novel training paradigm in which the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pairs construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P$^2$-DPO outperforms strong baselines that rely on costly human feedback on benchmarks. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P$^2$-DPO in addressing perceptual bottleneck in attended regions and improving Visual Robustness against degraded inputs.

2606.03201 2026-06-04 cs.CV cs.AI 版本更新

Reinforcement Learning from Cross-domain Videos with Video Prediction Model

基于视频预测模型的跨领域视频强化学习

Zhao Yang, Xinrui Zu, Jacob E. Kooi, Thomas Delliaux, He Liu, Shujian Yu, Kevin Sebastian Luck, Vincent François-Lavet

发表机构 * VU Amsterdam(阿姆斯特丹大学) ISAE-SUPAERO

AI总结 提出XIPER奖励模型,通过跨领域视频预测将智能体观测映射到专家域,利用预测似然作为奖励信号,解决视觉差异域中无奖励信号和领域差距问题。

详情
AI中文摘要

由于缺乏奖励信号以及存在领域差距,从视觉上截然不同的领域的专家视频中进行强化学习具有挑战性。我们引入了XIPER(跨领域视频预测奖励),这是一种奖励模型,用于从视觉不同领域收集的专家视频中进行学习,其中智能体的外观因颜色、形态或仿真到现实差距等因素而不同。更具体地说,XIPER训练了一个跨领域视频预测模型,将智能体观测映射到专家领域,并使用预测似然作为奖励信号。在DMC Color Suite(8个任务)和DMC Body Suite(3个任务)上的实验表明,尽管存在智能体颜色和形态等领域的差距,XIPER始终优于基线方法。我们进一步在仿真到现实迁移数据集上分析了XIPER,证明它仅凭模拟专家视频就能为真实机器人观测产生有意义的奖励信号。代码、预训练模型、数据集和视频演示可在我们的项目网页上找到:this https URL

英文摘要

Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the presence of domain gaps. We introduce XIPER (Cross-domain Video Prediction Reward), a reward model for learning from expert videos collected in a visually different domain, where the agent's appearance differs due to factors such as color, morphology, or the sim-to-real gap. More specifically, XIPER trains a cross-domain video prediction model that maps agent observations into the expert domain and uses the prediction likelihood as a reward signal. Experiments on the DMC Color Suite (8 tasks) and DMC Body Suite (3 tasks) show that XIPER consistently outperforms baselines despite domain gaps such as differences in agent color and morphology. We further analyze XIPER on a sim-to-real transfer dataset, demonstrating that it produces meaningful reward signals for real-robot observations given only simulated expert videos. Code, pretrained models, datasets and video demonstrations can be found on our project webpage: https://sites.google.com/view/xiper

2606.03175 2026-06-04 cs.CV cs.RO 版本更新

Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation

在值得时询问:面向实例目标导航的成本感知开放式交互

Xunyi Zhao, Sihao Lin, Gengze Zhou, Zerui Li, Shijie Li, Wei Tao, Jiajun Liu, Qi Wu

发表机构 * Adelaide University(阿德莱德大学) Responsible AI Research Centre, Australian Institute for Machine Learning(负责任人工智能研究中心,澳大利亚机器学习研究所) Institute for Infocomm Research (I2R), A*STAR(信息与通信研究院(I2R),A*STAR) iMotion CSIRO Data61 Project Website(CSIRO Data61项目网站)

AI总结 针对实例目标导航中语言歧义问题,提出一种成本敏感的不确定性减少方法,通过信息增益分析确定有效问题类型,并构建基准测试和加权成功率指标,实现零样本MLLM导航器仅在预期收益大于成本时查询。

详情
AI中文摘要

实例目标导航(IGN)要求具身智能体根据不明确的自然语言描述,在干扰物中找到特定对象实例。这种歧义通常无法仅通过感知和语言解决,因此与oracle的交互成为消歧的自然机制。先前的交互方法允许oracle查询,但将轻量级澄清和路径级指导同等对待,使得智能体通过重复的高信息量问题提高成功率,而非高效解决潜在歧义。我们将交互式IGN重新定义为成本敏感的不确定性减少问题,其中智能体应提出其答案相对于惩罚能最大程度减少导航不确定性的问题。为此,我们对现有导航语料库进行信息增益分析,以识别哪些线索能减少导航不确定性,从而得到一组紧凑的问题类型和数据驱动的成本。然而,现有的交互式导航基准并未建模不同问题类型的成本,也未评估智能体使用交互的效率,因此不适合研究成本敏感的交互。基于此分类,我们构建了一个用于诊断交互行为和效率的基准,以及一个加权成功率指标,该指标根据推导出的成本对每次查询进行惩罚。我们进一步提出了一种零样本MLLM导航器,仅在预期不确定性减少证明交互成本合理时,才在每个决策步骤有选择地进行查询。

英文摘要

Instance Goal Navigation (IGN) requires an embodied agent to find a specific object instance among distractors from an under-specified natural-language description. Such ambiguity often cannot be resolved from perception and language alone, making interaction with an oracle a natural mechanism for disambiguation. Prior interactive methods allow oracle queries but treat lightweight clarification and route-level guidance alike, letting agents boost success rate through repeated high-information questions rather than by resolving the underlying ambiguity efficiently. We recast interactive IGN as a cost-sensitive uncertainty-reduction problem, where the agent should ask the question whose answer provides the largest reduction in navigation uncertainty relative to its penalty. To this end, we apply an information-gain analysis on existing navigation corpora to identify which cues reduce navigation uncertainty, yielding a compact set of question types and data-derived weights. However, existing interactive navigation benchmarks do not model the cost of different question types or evaluate how efficiently agents use interaction, making them unsuitable for studying cost-sensitive interaction. Based on this taxonomy, we construct a benchmark for diagnosing interaction behavior and efficiency, together with a Weighted Success Rate metric that penalizes each query by its derived cost. We further propose a zero-shot MLLM navigator that selectively queries at each decision step only when the expected uncertainty reduction justifies the interaction cost.

2606.02894 2026-06-04 cs.CV 版本更新

Tiny Collaborative Inference for Occlusion-Robust Object Detection

用于遮挡鲁棒目标检测的微型协同推理

Chieh-Tung Cheng, Mustafa Aslanov, Eiman Kanjo

发表机构 * Imperial College London(帝国理工学院伦敦分校) Nottingham Trent University(诺丁汉特伦特大学)

AI总结 针对超低端边缘设备,结合MCUNet骨干网络、YOLOv2检测头和TensorFlow Lite量化,评估决策级融合(WBF)相比特征级融合在遮挡场景下提升mAP达+0.2736,并验证了多视角融合与Wi-Fi对等部署的可行性。

详情
AI中文摘要

小型边缘设备,如物联网监控节点和搜索救援(SAR)平台,越来越期望本地运行计算机视觉。然而,在超低端硬件上,目标检测受到可用内存和计算、多个设备协作时的通信成本以及遮挡导致的精度损失的限制。本文通过结合MCUNet骨干网络、YOLOv2检测头和TensorFlow Lite量化,评估了在小于1 MB SRAM的设备上的遮挡鲁棒目标检测。我们评估了两种协作推理策略:特征级融合(拼接中间特征图)和通过加权框融合(WBF)的决策级融合。在测试的遮挡设置下,WBF优于特征级融合,在非对称遮挡场景中最高可提升+0.2736 mAP。将融合扩展到三个视角进一步提高了精度(最高+0.3827 mAP),同时增加了通信开销(每次交换约1.3 KB)。硬件实验从主机辅助的USB中继基线开始,然后转移到两个Coral Dev Board Micro单元上的Wi-Fi对等部署,其中WBF在设备上运行,通信能量相对于推理仍然很小。在一个代表性的301.9秒自主会话中,包含108帧,融合输出在61帧上观察到,而仅Board 2为47帧,帧级覆盖增益为+29.8%。我们还包含了一个小型探索性的去中心化联邦学习(DFL)可行性说明,但由于在非独立同分布本地数据下性能仍然有限,我们不将其作为主要结果。结果支持决策级融合作为提高小规模边缘目标检测中遮挡鲁棒性的可行选项,包括在超低端硬件上无需主机的多板操作。

英文摘要

Edge AI nodes for search and rescue are increasingly expected to run computer vision locally, yet ultra-low-end hardware imposes hard constraints on memory, compute, and inter-device communication. This work addresses occlusion-robust object detection on devices with less than 1 MB SRAM by combining an MCUNet backbone, a YOLOv2 detection head, and Lite quantisation. Two collaborative inference strategies are evaluated: feature-level fusion, concatenating intermediate feature maps, and decision-level fusion via Weighted Boxes Fusion (WBF). WBF outperforms feature-level fusion under all tested occlusion conditions, yielding gains of up to +0.2736 mAP in asymmetric scenarios. Extending fusion to three views improves accuracy further (up to +0.3827 mAP) at modest communication overhead (~1.3 KB per exchange). Hardware experiments progress from a host-assisted USB-relay baseline to a Wi-Fi peer-to-peer deployment on two Coral Dev Board Micro units, where WBF executes on-device with negligible communication energy relative to inference. In a 301.9 s autonomous session of 108 frames, fused output is produced on 61 frames versus 47 for a single board - a coverage gain of +29.8%. A decentralised federated learning feasibility note is included but not treated as a primary result, as performance remains limited under non-iid data. The results support decision-level fusion as a viable option for improving occlusion robustness in small-scale edge object detection, including host-free multi-board operation on ultra-low-end hardware.

2606.02576 2026-06-04 cs.CV cs.LG 版本更新

ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

ProtoAda: 原型引导的自适应适配器扩展与几何整合用于多模态持续指令微调

Yu-Cheng Shi, Zhen-Hao Xie, Jun-Tao Tang, Da-Wei Zhou

发表机构 * School of Artificial Intelligence, Nanjing University, China(南京大学人工智能学院) State Key Laboratory of Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室)

AI总结 提出ProtoAda框架,通过格式感知任务原型和几何感知参数整合,解决多模态持续指令微调中任务路由错误和梯度干扰问题。

详情
AI中文摘要

多模态大语言模型通过指令微调取得了强大性能,但实际部署需要它们持续获取新的视觉语言能力,这使得多模态持续指令微调至关重要。为了减少任务间干扰并促进协作,近期方法常采用稀疏架构,如基于图像-文本相似度路由的LoRA专家混合。然而,具有不同响应结构的任务可能共享高度相似的视觉语言语义,从而被错误地路由到同一专家;仅凭图像-文本相似度不足以进行可靠的任务分配。例如,一个需要坐标预测的定位任务专家,在学习语义相似的VQA任务后,可能偏向于生成短文本答案。这种格式盲目的任务分配将异构响应类型整合到共享参数中,引发梯度干扰和无效的专家协作。为解决此问题,我们提出ProtoAda,一种原型引导的自适应微调框架。ProtoAda引入格式感知任务原型,使任务分配和路由与任务语义及输出结构对齐,并以几何感知方式整合格式兼容的更新,有效重用并逐步优化现有参数。在多个基准上的大量实验表明,ProtoAda取得了优越性能,尤其是在答案结构易被顺序微调破坏的任务上。

英文摘要

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. To reduce inter-task interference and promote collaboration, recent methods often employ sparse architectures like Mixture of LoRA Experts with image-text similarity routing. However, tasks with distinct response structures could share highly similar visual-linguistic semantics and thus be wrongly routed to the same expert; image-text similarity alone is insufficient for reliable task assignment. For example, an expert in a grounding task requiring coordinate prediction may be biased toward producing short textual answers after learning semantically similar VQA tasks. This format-blind task assignment integrates heterogeneous response types into shared parameters, inducing gradient interference and ineffective expert collaboration. To address this problem, we propose ProtoAda, a prototype-guided adaptive tuning framework. ProtoAda introduces format-aware task prototypes to align task assignment and routing with both task semantics and output structure, and further consolidates format-compatible updates in a geometry-aware manner to effectively reuse and progressively refine existing parameters. Extensive experiments on multiple benchmarks demonstrate that ProtoAda achieves superior performance, especially on tasks whose answer structures are easily corrupted by sequential tuning.

2606.02521 2026-06-04 cs.LG cs.CV 版本更新

Drifting Preference Optimization for One-Step Generative Models

一步生成模型的漂移偏好优化

Zhou Jiang, Yandong Wen, Zhen Liu

发表机构 * Westlake University(西湖大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出 DrPO 方法,通过在线采样和排序构建非参数偶极偏好场及参考漂移,实现一步式生成模型的无梯度偏好微调。

Comments 24 pages, 9 figures

详情
AI中文摘要

一步式文本到图像生成器因只需一次前向传播即可生成图像而具有吸引力,但其偏好微调仍然困难:标准对齐方法通常依赖于策略似然、去噪轨迹、可微奖励梯度或测试时优化。我们提出漂移偏好优化(DrPO),一种针对确定性一步生成器的在线偏好微调方法。对于每个提示,DrPO 从当前生成器中采样候选图像,用目标奖励对其进行排序,并利用高分和低分样本合成特征空间更新方向。该更新是一个非参数偶极偏好场加上从冻结的基础生成器估计的参考漂移,并通过解耦的特征空间回归目标进行优化。目标奖励仅用于排序,因此 DrPO 可以使用大型、黑盒或不可微的奖励进行训练,而推理仍只需一次生成器调用。我们在 SD-Turbo 和 SDXL-Turbo 上使用多个目标奖励和基准(包括 HPSv3 和 GenEval)评估了 DrPO。DrPO 在匹配有效批量设置下,通过移除奖励模型反向传播,比无奖励梯度的一步偏好基线提高了对齐度,并将 HPSv3 训练计算量减少了 3.51 倍。初步离线实验表明,基于样本的梯度合成也可用于在线奖励排序之外。

英文摘要

One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denoising trajectories, differentiable reward gradients, or test-time optimization. We propose Drifting Preference Optimization (DrPO), an online preference-finetuning method for deterministic one-step generators. For each prompt, DrPO samples candidates from the current generator, ranks them with a target reward, and uses high- and low-scoring samples to synthesize a feature-space update direction. The update is a non-parametric dipole preference field plus a reference drift estimated from the frozen base generator, and is optimized through a detached feature-space regression target. The target reward is used only for ranking, so DrPO can train with large, black-box, or non-differentiable rewards while inference remains a single generator call. We evaluate DrPO on SD-Turbo and SDXL-Turbo with multiple target rewards and benchmarks, including HPSv3 and GenEval. DrPO improves alignment over reward-gradient-free one-step preference baselines and reduces HPSv3 training computation by $3.51\times$ under the matched effective-batch setting by removing reward-model backpropagation. Initial offline experiments suggest that sample-based gradient synthesis can also be used beyond online reward ranking.

2606.01649 2026-06-04 cs.CV 版本更新

PhyScene3D: Physically Consistent Interactive 3D Tabletop Scene Generation

PhyScene3D:物理一致的交互式3D桌面场景生成

Weixing Chen, Zhuoqian Feng, Yang Liu, Yexin Zhang, Yifan Wen, Yinghong Liao, Weichao Qiu, Guanbin Li, Liang Lin

发表机构 * Sun Yat-sen University, China(中山大学) Peng Cheng Laboratory(鹏城实验室) Guangdong Key Laboratory of Big Data Analysis and Processing(广东省大数据分析与处理重点实验室) Huawei(华为)

AI总结 提出PhyScene3D框架,通过认知拓扑推理链和物理感知去噪对齐,解决3D桌面场景生成中的物理一致性问题,显著降低碰撞率。

Comments 23 pages, 5 figures, accepted by ICML 2026

详情
AI中文摘要

生成物理一致的3D桌面场景是交互式和通用机器人学习中的一个基本但尚未充分探索的问题。挑战源于密集的对象层次结构和不规则的可供性。这里,交互场景指的是一个物理有效、无碰撞的环境,可直接加载到物理模拟器中。现有方法,从解耦的符号求解器到端到端回归模型,通常遭受误差传播或过拟合到包含广泛物理违规的噪声监督。为了解决这些限制,我们引入了PhyScene3D,一个将生成重新表述为类人构造过程的框架。提出的认知拓扑推理链(CTRC)将场景合成分解为顺序的、锚点条件的过程。它采用基于3D AABB的放置方案,施加了强大的结构归纳偏置。为了解决不完美的监督和物理不可行性,我们引入了物理感知去噪对齐(PADA)。它将可微分的符号距离场(SDF)与测试时优化(TTO)相结合,将生成的场景投影到物理可行的流形上,同时保留语义意图。实验表明,PhyScene3D在语义准确性和物理有效性方面均优于最先进的方法,相对于人工标注的训练数据,场景级碰撞率降低了40%。

英文摘要

Generating physically consistent 3D tabletop scenes is a fundamental yet underexplored problem for interactive and generalist robotic learning. The challenge stems from dense object hierarchies and irregular affordances. Here, an interactive scene denotes a physically valid, collision-free environment directly loadable into physics simulators. Existing methods, ranging from decoupled symbolic solvers to end-to-end regression models, often suffer from error propagation or overfitting to noisy supervision containing widespread physical violations. To address these limitations, we introduce PhyScene3D, a framework that reformulates generation as a Human-Mimetic Constructive Process. The proposed Cognitive Topological Reasoning Chain (CTRC) factorizes scene synthesis into a sequential, anchor-conditioned process. It employs a 3D AABB-based placement scheme that imposes a strong structural inductive bias. To address imperfect supervision and physical infeasibility, we introduce Physics-Aware Denoising Alignment (PADA). It integrates a differentiable Signed Distance Field (SDF) with Test-Time Optimization (TTO) to project generated scenes onto a physics-feasible manifold while preserving semantic intent. Experiments demonstrate that PhyScene3D outperforms state-of-the-art approaches in both semantic accuracy and physical validity, achieving a 40% reduction in scene-wise collision rate relative to the human-annotated training data.

2606.01573 2026-06-04 cs.CV 版本更新

$\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer

VG²GT: 体素-高斯泼溅视觉几何基础变换器

Yibin Zhao, Yihan Pan, Jun Nan, Wenli Yang, Liwei Chen, Jianjun Yi

发表机构 * East China University of Science and Technology(东华大学) Shanghai Open University(上海开放大学) Shanghai Xiaoyuan Innovation Center(上海小元创新中心)

AI总结 提出VG²GT,利用冻结的视觉基础模型、多尺度可微体素模块和体素特征直接回归高斯原语参数,通过随机实体体渲染监督深度图,实现几何精确的高斯场景重建,在多个数据集上达到最优性能。

详情
AI中文摘要

高斯泼溅在3D重建和新视角合成方面显示出强大的潜力。然而,大多数现有方法需要精确的相机参数和逐场景优化,而使用像素对齐高斯原语的前馈方法常常遭受伪影和非均匀原语的困扰。在本文中,我们提出了VG²GT,一种体素-高斯泼溅视觉几何基础变换器。VG²GT利用冻结的预训练视觉基础模型(VFM),结合多尺度可微体素模块以增强几何理解,并直接从体素特征分裂和回归高斯原语参数。在训练过程中,通过随机实体体渲染监督深度图,使得在保持视觉基础模型完全冻结的同时,实现几何准确的高斯场景重建。这种设计使VG²GT能够无缝插入任何基于补丁特征的VFM,同时大幅降低所需的训练成本。VG²GT在广泛使用的DTU、Replica、TAT和ScanNet数据集上优于当前最先进的方法。

英文摘要

Gaussian splatting has shown strong potential for 3D reconstruction and novel view synthesis. However, most existing methods require accurate camera parameters and per-scene optimization, while feed-forward methods with pixel-aligned Gaussian primitives often suffer from artifacts and non-uniform primitives. In this paper, we propose $\text{VG}^2$GT, a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer. $\text{VG}^2$GT leverages a frozen pretrained visual foundation model (VFM), incorporates a multi-scale differentiable voxel module to enhance geometric understanding, and directly splits and regresses Gaussian primitive parameters from voxel features. During training, depth maps are supervised through stochastic solid volume rendering, enabling geometrically accurate Gaussian scene reconstruction while keeping the visual foundation model fully frozen. This design enables $\text{VG}^2$GT to be seamlessly plugged into any patch-feature-based VFM, while substantially reducing the required training cost. $\text{VG}^2$GT outperforms current state-of-the-art methods on widely used DTU, Replica, TAT, and ScanNet datasets.

2606.01537 2026-06-04 cs.CV cs.LG 版本更新

PaCX-MAE: Physiology-Augmented Chest X-Ray Masked Autoencoder

PaCX-MAE: 生理增强的胸部X光掩码自编码器

Yancheng Liu, Kenichi Maeda, Manan Pancholy

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Tokyo(东京大学) University of Michigan(密歇根大学)

AI总结 提出PaCX-MAE跨模态蒸馏框架,通过双对比预测目标将生理先验注入胸部X光编码器,在保持单模态推理的同时提升生理相关任务性能。

Comments Accepted at the ICML 2026 3rd Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences (FM4LS)

详情
AI中文摘要

临床诊断通常需要结合影像与生理测量,但部署的模型通常处理单模态数据。我们提出PaCX-MAE,一种跨模态蒸馏框架,将生理先验注入胸部X光(CXR)编码器,同时在推理时严格保持单模态。PaCX-MAE通过双对比预测目标增强域内掩码自编码,使CXR表示与配对的ECG和实验室嵌入对齐。在九个基准上的广泛评估表明,该方法在领域特定MAE上取得一致改进,特别是在依赖生理的任务上(例如,MedMod上AUROC提升2.7;VinDr上F1提升6.5)。该方法在1%标注数据下表现出高度标签效率,并保持解剖保真度,在分割任务上与MAE持平。零样本和注意力分析证实,PaCX-MAE成功学习关注生理指标,如心脏轮廓,这在标准视觉预训练中缺失。

英文摘要

Clinical diagnosis often requires combining imaging with physiological measurements, yet deployed models typically operate on unimodal data. We present PaCX-MAE, a cross-modal distillation framework that injects physiological priors into chest X-ray (CXR) encoders while remaining strictly unimodal at inference. PaCX-MAE augments in-domain masked autoencoding with a dual contrastive-predictive objective, aligning CXR representations with paired ECG and laboratory embeddings. Extensive evaluation across nine benchmarks demonstrates consistent improvements over domain-specific MAE, particularly on physiology-dependent tasks (e.g., +2.7 AUROC on MedMod; +6.5 F1 on VinDr). The method proves highly label-efficient in the 1% regime and preserves anatomical fidelity, achieving parity with MAE on segmentation tasks. Zero-shot and attention analyses confirm that PaCX-MAE successfully learns to attend to physiological indicators, such as the cardiac silhouette, absent in standard visual pretraining.

2606.01023 2026-06-04 cs.CV cs.AI 版本更新

Data Collection for Training Quality-Control AI in Carpet Manufacturing

地毯制造中用于训练质量控制AI的数据收集

Akbar Erkinov

发表机构 * Independent Researcher(独立研究者)

AI总结 针对地毯生产中视觉检测慢、主观且不一致的问题,提出一种在线机器视觉系统设计,通过同步线扫描相机和组合照明实时检测缺陷,并系统收集标注数据以持续训练质量控制模型,最终通过DMAIC方法量化质量改进。

Comments 10 pages, 3 figures

详情
AI中文摘要

视觉检测仍然是机织和簇绒地毯生产中主要的质量控制实践,但在现代织机的线速度和宽度下,它缓慢、主观且不一致。我们提出了一种在线机器视觉系统的设计方案,其主要目的有两个:实时检测地毯幅面,以及同样重要的是,系统地收集和标注缺陷图案的图像,以便在设备使用寿命内训练日益强大的质量控制模型。该方案基于一个具体的工业环境:在一个机织地毯生产设施中进行的六西格玛(DMAIC)项目,该项目预计在增加织机后会出现生产瓶颈,且基线缺陷率较高,质量故障带来的财务风险显著。我们描述了一个基于同步线扫描相机并组合明场和掠射照明的成像子系统,推导了在多米宽幅面上分辨细微结构缺陷所需的分辨率和吞吐量要求,并定义了地毯特定的缺陷分类。然后,我们提出了一种分阶段建模策略,从基于无缺陷材料的无监督异常检测开始,遵循MVTec异常检测基准中地毯类别的范例,并通过人在环的标注飞轮成熟为有监督的检测和分割模型。最后,我们将检测性能与DMAIC目标联系起来,展示逃逸缺陷的减少如何转化为过程质量和过程西格玛水平的提升。贡献在于提供了一个端到端、可部署的蓝图,将数据收集视为首要工程目标而非事后考虑。

英文摘要

Visual inspection remains the dominant quality-control practice in woven and tufted carpet production, yet it is slow, subjective, and inconsistent at the line speeds and widths of modern looms. We present a design proposal for an in-line machine-vision system whose primary purpose is twofold: to inspect the carpet web in real time and, equally importantly, to systematically collect and label images of defect patterns so that increasingly capable quality-control models can be trained over the life of the installation. The proposal is grounded in a concrete industrial setting: a Six Sigma (DMAIC) project at a woven-carpet production facility that anticipated a production bottleneck following the installation of additional weaving machines, with a substantial baseline defect rate and significant financial exposure associated with quality failures. We describe an imaging subsystem based on synchronized line-scan cameras with combined bright-field and grazing illumination, derive the resolution and throughput requirements needed to resolve fine structural defects across a multi-metre web, and define a carpet-specific defect taxonomy. We then lay out a staged modelling strategy that begins with unsupervised anomaly detection trained on defect-free material, following the paradigm exemplified by the carpet category of the MVTec Anomaly Detection benchmark, and matures through a human-in-the-loop annotation flywheel into supervised detection and segmentation models. Finally, we connect detection performance to the DMAIC objectives, showing how reductions in escaped defects translate into improved process quality and process sigma levels. The contribution is an end-to-end, deployable blueprint that treats data collection as a first-class engineering objective rather than an afterthought.

2606.00747 2026-06-04 cs.CV cs.AI 版本更新

SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy

SkyShield:占用作为低空无人机自主飞行的安全接口

Jie Gao, Jie Ma, Kaihui Lin, Kai Ye, Miaohui Zhang, Pingyang Dai, Liujuan Cao

发表机构 * Xiamen University(厦门大学) Jiangxi Academy of Sciences(江西省科学院)

AI总结 针对低空无人机自主飞行中的三维空间理解问题,提出首个前视单目语义占用基准SkyShield、动态感知度量KAR-mIoU和几何优先基线SkyOcc,将占用作为安全接口。

详情
AI中文摘要

对于低空无人机自主飞行,三维空间理解不仅仅是感知目标,更是人类指令与物理飞行之间的安全接口。在20米以下的人尺度城市空域中,薄几何结构、遮挡、植被和城市杂乱决定了飞行器能否安全进入前方空间。然而,现有的无人机数据集主要提供2D标注或3D框,而面向驾驶的占用基准假设稳定的地面级传感器装置。两者都缺少低空飞行的定义性场景:一个前视单目相机从移动的飞行器上观察占据和自由空间,具有逐帧变化的6自由度姿态和相机外参。为填补这一空白,我们提出了SkyShield,据我们所知,这是首个面向20米以下城市无人机飞行的前视单目语义占用基准。基于CARLA构建,SkyShield包含36K个前视无人机样本,涵盖多种城市场景和天气条件,每张图像配以逐帧6自由度无人机姿态、逐帧动态相机几何、无人机状态和前视截锥体语义占用标签。我们进一步提出了KAR-mIoU,一种以无人机为中心且动态感知的度量,通过运动可达性和碰撞时间重新加权体素级评估,揭示传统mIoU隐藏的安全关键风险。为应对这一具有挑战性的新场景,我们提供了SkyOcc,一种几何优先的单目基线,将逐帧无人机姿态集成到投影中,融合时序占用特征,并应用安全先验优化以保留稀疏的碰撞关键结构。SkyShield、KAR-mIoU和SkyOcc共同将占用确立为低空空中自主飞行的安全接口。代码和数据集将公开发布。

英文摘要

For low-altitude Unmanned Aerial Vehicle (UAV) autonomy, 3D spatial understanding is not merely a perception objective, but the safety interface between human instructions and physical flight. In human-scale urban airspace below 20 meters, thin geometry, occlusions, vegetation, and urban clutter define whether an aerial agent can safely enter the space ahead. However, existing UAV datasets mainly provide 2D annotations or 3D boxes, while driving-oriented occupancy benchmarks assume stable ground-level sensor rigs. Both miss the defining regime of low-altitude flight: a front-facing monocular camera observing occupied and free space from a moving aerial body with frame-wise changing 6-DoF pose and camera extrinsics. To bridge this gap, we introduce SkyShield, to the best of our knowledge the first front-view monocular semantic occupancy benchmark for urban UAV flight below 20 meters. Built on CARLA, SkyShield contains 36K front-view UAV samples across diverse urban scenes and weather conditions, pairing each image with frame-wise 6-DoF UAV pose, frame-wise dynamic camera geometry, UAV states, and front-frustum semantic occupancy labels. We further propose KAR-mIoU, a UAV-centric and dynamics-aware metric that re-weights voxel-level evaluation by kinematic reachability and time-to-collision, revealing safety-critical risks hidden by conventional mIoU. To tackle this challenging new setting, we provide SkyOcc, a geometry-first monocular baseline that integrates frame-wise UAV attitude into projection, fuses temporal occupancy features, and applies safety-prior optimization to preserve sparse collision-critical structures. Together, SkyShield, KAR-mIoU, and SkyOcc establish occupancy as a safety interface for low-altitude aerial autonomy. Code and dataset will be released publicly.

2606.00260 2026-06-04 cs.CV cs.LG 版本更新

LastAct: Trajectory-Guided Latest-Activity Localization for Real-Time Smart-Home Activity Recognition

LastAct: 轨迹引导的最新活动定位用于实时智能家居活动识别

Zishuai Liu, Ruili Fang, Jin Lu, Fei Dou

发表机构 * School of Computing, University of Georgia(佐治亚大学计算学院)

AI总结 提出LastAct框架,通过轨迹图像序列和边界定位器解决滑动窗口中的边界污染问题,实现实时智能家居活动识别。

详情
AI中文摘要

基于环境传感器的人类活动识别(HAR)支持健康监测和辅助生活等智能家居应用。然而,在实际部署中,传感器事件以连续流的形式到达,活动边界未知。因此,滑动窗口推理会产生许多跨越转换并包含混合活动的窗口,造成边界污染,违反了大多数基准和模型使用的预分割实例假设。此外,许多管道通过将传感器ID视为独立标记来未充分利用空间上下文。我们提出了LastAct,一个面向轨迹的流式智能家居HAR框架,旨在处理混合窗口下的最新活动,同时显式建模空间结构。LastAct将传感器事件投影到家庭平面图上,形成保持空间连续性的布局对齐轨迹图像序列。一个轻量级门控识别受污染的窗口,边界定位器估计最后一个转换,从而实现边界引导的掩码,强调边界后的证据并抑制过时的上下文。为了提高效率,我们重用预计算的布局对齐模板缓存以避免重复渲染。实验表明,在四个公开的智能家居数据集上,采用接近真实的混合活动协议,LastAct在纯窗口上达到竞争性或更优的性能,并在交叉/混合窗口上获得显著的Macro-F1增益,展示了在接近真实的滑动窗口机制下更强的鲁棒性。

英文摘要

Human Activity Recognition (HAR) from ambient sensors enables smart-home applications such as health monitoring and assisted living. In realistic deployments, however, sensor events arrive as a continuous stream and activity boundaries are unknown. Sliding-window inference therefore produces many windows that straddle transitions and contain mixed activities, creating boundary contamination that violates the pre-segmented instance assumption used by most benchmarks and models. Moreover, many pipelines under-use spatial context by treating sensor IDs as independent tokens. We present LastAct, a trajectory-centric framework for streaming smart-home HAR that targets the most recent activity under mixed windows while explicitly modeling spatial structure. LastAct projects sensor events onto the home floorplan to form a layout-aligned trajectory image sequence that preserves spatial continuity. A lightweight gate identifies contaminated windows, and a boundary localizer estimates the last transition to enable boundary-guided masking that emphasizes post-boundary evidence and suppresses stale context. For efficiency, we reuse a precomputed layout-aligned template cache to avoid repeated rendering. Empirically, across four public smart-home datasets under near-realistic mixed-activity protocols, LastAct achieves competitive or superior performance on pure windows and yields substantial Macro-F1 gains on cross/mixed windows, demonstrating improved robustness under near-realistic sliding-window regimes.

2605.31039 2026-06-04 cs.CV 版本更新

GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

GGT-100K:面向泛化真实世界图像恢复的生成式真实标签

Xiangtao Kong, Jixin Zhao, Lingchen Sun, Rongyuan Wu, Lei Zhang

发表机构 * VISUAL COMPUTING LAB POLYU(PolyU视觉计算实验室) The Hong Kong Polytechnic University(香港理工大学) OPPO Research Institute(OPPO研究院)

AI总结 提出利用生成式多模态基础模型从真实低质量图像合成高质量目标作为真实标签,构建包含10万对数据的GGT-100K数据集,显著提升多种图像恢复模型的真实世界泛化能力。

详情
AI中文摘要

真实世界图像恢复(IR)受限于高质量配对训练数据的稀缺。合成数据集丰富但常无法模拟真实退化,而真实配对数据集昂贵且难以获取。因此,在这些数据集上训练的IR模型在真实场景中泛化能力有限。本文提出利用生成式多模态基础模型(MFMs)从真实低质量(LQ)图像生成高质量(HQ)目标,即生成式真实标签(GGT)。我们首先对包括Nano-Banana-2和GPT-Image-2在内的九种最先进MFMs,在多种场景和退化类型的图像上进行了系统评估。结果表明,采用基于VLM自适应提示的Nano-Banana-2在合成感知真实且内容忠实的高质量目标方面能力最强,可作为LQ输入的GGT。随后,我们使用Nano-Banana-2构建GGT合成流水线,包括多阶段质量控制以确保数据可靠性,并构建了GGT-100K,一个包含103,707个训练对的LQ-HQ配对数据集,覆盖多样场景和复杂真实退化。还建立了500个图像对的测试集。大量实验表明,GGT-100K持续提升多种IR模型的真实世界泛化能力,尤其对微调生成模型进行IR任务有显著益处。我们的结果表明,MFMs可作为面向恢复的数据生成的实用工具,GGT-100K是扩展真实世界IR模型泛化边界的有用资源。

英文摘要

Real-world image restoration (IR) is bottlenecked by the scarcity of high-quality paired training data. Synthetic datasets are abundant but often fail to model real-world degradations, while real-world paired datasets are expensive and difficult to capture. As a result, IR models trained on these datasets show limited generalization in real-world scenarios. In this work, we propose Generative Ground Truth (GGT) by using generative multimodal foundation models (MFMs) to produce high-quality (HQ) targets from real-world low-quality (LQ) images. We first conduct a systematic evaluation of nine state-of-the-art MFMs, including Nano-Banana-2 and GPT-Image-2, on images of various scenes and degradation types. The results demonstrate that Nano-Banana-2 with VLM-based adaptive prompting shows the highest capability to synthesize perceptually realistic and content-faithful HQ targets, which can serve as the GGT for the LQ input. We then employ Nano-Banana-2 to build a GGT synthesis pipeline, which involves multi-stage quality control to ensure data reliability, and construct GGT-100K, an LQ-HQ paired dataset comprising 103,707 training pairs and covering diverse scenes and complex real-world degradations. A test set of 500 image pairs is also established. Extensive experiments show that GGT-100K consistently improves the real-world generalization of a wide range of IR models, with particularly strong benefits for finetuning generative models for IR tasks. Our results suggest that MFMs can serve as practical tools for restoration-oriented data generation, and GGT-100K is a useful resource to expand the generalization boundaries of real-world IR models.

2605.30705 2026-06-04 cs.CV cs.LG 版本更新

Equivariant Latent Alignment via Flow Matching under Group Symmetries

群对称下通过流匹配的等变潜在对齐

Sunghyun Kim, Jaehoon Hahm, Jeongwoo Shin, Joonseok Lee

发表机构 * University of Illinois Urbana-Champaign, Illinois, USA(伊利诺伊大学厄巴纳-香槟分校) Seoul National University, Seoul, Korea(首尔国立大学)

AI总结 针对现有方法在潜在空间中存在的等变错位问题,提出基于流的残差潜在流框架,通过纠正错位潜在表示来增强旋转群SO(n)下的等变一致性,提升新视角合成质量。

详情
AI中文摘要

几何感知生成模型和新视角合成方法在视觉保真度和一致性方面展现出强大潜力。同时,等变表示学习已成为构建潜在空间的有力框架,其中分析已知的群变换可以直接作用,捕捉数据中的几何结构,并增强新视角合成的可解释性和泛化性。然而,我们发现现有方法常遭受潜在错位问题,即潜在空间中预期的群作用与实际所需的变换之间存在差异。因此,学习到的潜在表示往往无法一致地保持底层群对称性所施加的等变关系。为解决此问题,我们提出残差潜在流,一种基于流的框架,用于纠正错位的潜在表示,从而提高对底层等变关系的遵从性。我们的综合实验表明,在旋转群SO(n)下,我们的方法显著减少了潜在错位,并提高了新视角合成的质量。

英文摘要

Geometry-aware generative models and novel view synthesis approaches have shown strong potential in visual fidelity and consistency. In parallel, equivariant representation learning has emerged as a powerful framework for constructing latent spaces where analytically known group transformations could act directly, capturing geometric structure in data and enhancing both interpretability and generalization in novel view synthesis. However, we identify that existing approaches often suffer from latent misalignment, a discrepancy between the intended group action and the actually required transformations in the latent space. Consequently, the learned latents often fail to consistently preserve the equivariant relations imposed by the underlying group symmetry. To address this, we propose Residual Latent Flow, a flow-based framework that corrects the misaligned latents, thereby improving compliance with the underlying equivariance relation. Our comprehensive experiments show that our method significantly reduces latent misalignment and improves novel view synthesis quality, under rotation groups SO(n).

2605.25402 2026-06-04 cs.CV cs.AI 版本更新

Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation

解剖锚定的自监督:蒸馏视觉基础模型用于不变超声表示

Chunzheng Zhu, Yijun Wang, Jianxin Lin, Feng Wang, Hongwei Wang, Lei Zhao, Shengli Li, Kenli Li

发表机构 * Hunan University(湖南大学) Shenzhen Maternity and Child Healthcare Hospital(深圳妇幼保健医院)

AI总结 提出解剖锚定的超声自监督框架ANAUS,通过可学习潜在提示引擎和领域自适应实现无标注解剖分割,并设计双策略自监督学习(语义感知解剖分离对齐和上下文核心区域预测)来增强表示学习,在六个公开数据集上超越现有方法。

Comments MICCAI 2026 Accepted Paper; Anatomy-Anchored Ultrasound Self-Supervision

详情
AI中文摘要

自监督预训练范式在医学图像中学习可迁移表示方面日益重要,但现有超声图像方法在图像或帧级别操作,忽略了临床对齐表示学习的解剖上下文。在这项工作中,我们提出了一种解剖锚定的超声自监督框架ANAUS,将表示学习从通用视觉区域转移到临床有意义的解剖结构。利用可学习的潜在提示引擎以及对现有公开图像-掩码对的一次性领域自适应,我们使LP-SAM模块能够大规模实现无标注解剖描绘。基于此解剖基础,我们提出了一种双策略自监督学习范式,包括视图间语义感知的解剖分离对齐和上下文核心区域预测,以增强表示学习。具体而言,前者在相同解剖区域内强制特征不变性,同时促进不同结构间的可区分性;后者迫使模型重建被破坏的区域,从而捕获细粒度的结构细节。在六个公开数据集上的广泛评估表明,我们的方法持续优于当前最先进的方法,同时保持了临床部署所需的计算效率。代码可在https://github.com/zhcz328/ANAUS获取。

英文摘要

Self-supervised pre-training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet existing methods for ultrasound (US) images operate at the image or frame level, overlooking the anatomical context for clinical-aligned representation learning. In this work, we propose an anatomy-anchored ultrasound self-supervision framework ANAUS that shifts representation learning from generic visual regions to clinically meaningful anatomical structures. Utilizing a learnable latent prompt engine alongside a one-time domain adaptation on existing public image-mask pairs, we empower the LP-SAM module to achieve annotation-free anatomy delineation at scale. Building upon this anatomical grounding, we propose a dual-policy self-supervised learning paradigm consisting of inter-view semantics-aware anatomy-separating alignment and contextual core-region prediction to enhance representation learning. Specifically, the former enforces feature invariance within identical anatomical regions while promoting discriminability across distinct structures; the latter compels the model to reconstruct corrupted regions, thereby capturing fine-grained structural details. Extensive evaluations on six public datasets demonstrate that ANAUS consistently outstrips current state-of-the-art methods while maintaining the computational efficiency essential for clinical deployment. Code is available at https://github.com/zhcz328/ANAUS.

2605.24602 2026-06-04 cs.CV cs.AI 版本更新

Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

纠正注意力分散引起的视觉模糊以减少幻觉:算法与理论

Quanjiang Li, Zhiming Liu, Wei Luo, Tingjin Luo, Chenping Hou

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本文揭示多模态大语言模型中的物体幻觉与类人注意力分散现象相关,并提出一种无需额外训练的注意力聚焦方法(AFIP)通过跨头注意力增强和动态历史注意力强化来纠正视觉模糊,从而减少幻觉。

详情
Journal ref
ICML2026
AI中文摘要

多模态大语言模型(MLLMs)经常遭受物体幻觉的困扰,但导致这一失败的视觉感知机制仍知之甚少。在这项工作中,我们揭示幻觉与一种类人注意力分散现象密切相关,其中人类在注意力分散下会经历视觉清晰度下降并产生不准确的描述,而在模型中,同样的机制表现为解码过程中多头注意力的空间不一致性以及对图像令牌注意力的时间衰减。我们进一步提供了理论见解,表明注意力分散会增加模型复杂度并降低分类泛化能力。受这些发现的启发,我们提出了一种用于改进图像感知的注意力聚焦方法(AFIP),该方法通过跨头注意力丰富来纠正注意力分散,并通过动态历史注意力增强来强化视觉基础。在多个基准和模型上的大量实验验证了AFIP的有效性,且无需额外训练。

英文摘要

Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this failure remains poorly understood. In this work, we reveal that hallucinations are strongly associated with a human-like attention distraction phenomenon, where humans under divided focus experience degraded visual clarity and produce inaccurate descriptions, while in models the same mechanism manifests as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. We further provide theoretical insights that attention dispersion increases model complexity and degrades classification generalization. Motivated by these findings, we propose an Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction via cross-head attention enrichment and reinforces visual grounding through dynamic historical attention enhancement. Extensive experiments on multiple benchmarks and models validate the effectiveness of AFIP without additional training.

2605.18102 2026-06-04 cs.CV 版本更新

DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos

DanceHMR: 从单目视频中恢复手部感知的全身人体网格

Wenhao Shen, Ming Zhou, Hengyuan Zhang, Siyuan Bian, Youjiang Xu, Yuan Zhang

发表机构 * ByteDance Intelligent Creation(字节跳动智能创作)

AI总结 提出一种基于残差体手融合的时序一致全身HMR框架,通过身体上下文与手部观测的融合以及特写增强,实现稳定身体运动与精细手部恢复。

Comments Project page: https://shenwenhao01.github.io/dancehmr/

详情
AI中文摘要

单目视频人体网格恢复对于数字人、虚拟角色动画和具身模拟至关重要,需要时间稳定性和表现力丰富的全身运动。现有视频HMR方法能生成连贯的身体运动,但常忽略精细的手部关节;而基于图像的全身体方法逐帧独立恢复SMPL-X网格,常导致手部运动抖动且不准确。我们提出一种针对具有挑战性的野外单目视频的时序一致全身体HMR框架。我们的模型通过残差体手融合统一身体上下文和特定部分的手部观测,在单个时序架构中实现稳定的身体运动和精细的手部恢复。我们进一步引入特写感知增强,以提高上半身构图下的鲁棒性。在全身体和仅身体基准上的实验表明,手部重建得到改善,身体精度具有竞争力。我们的方法在具有挑战性的真实世界视频中也产生了时间稳定且2D一致的SMPL-X运动。

英文摘要

Monocular video human mesh recovery is essential for digital humans, avatar animation, and embodied simulation, where both temporal stability and expressive whole-body motion are required. Existing video HMR methods produce coherent body motion but often overlook detailed hand articulation, while image-based whole-body methods recover SMPL-X meshes independently per frame, often leading to jittery and inaccurate hand motion. We present a temporally coherent whole-body HMR framework for challenging in-the-wild monocular videos. Our model unifies body context and part-specific hand observations through residual body-hand fusion, enabling stable body motion and detailed hand recovery within a single temporal architecture. We further introduce close-up-aware augmentation to improve robustness under upper-body framing. Experiments on whole-body and body-only benchmarks demonstrate improved hand reconstruction and competitive body accuracy. Our method also produces temporally stable and 2D-consistent SMPL-X motion in challenging real-world videos.

2605.21268 2026-06-04 cs.CV 版本更新

Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification

视觉Transformer与卷积神经网络在土地利用场景分类中的应用

Arun D. Kulkarni

发表机构 * Computer Science Department, University of Texas at Tyler(德克萨斯理工大学计算机科学系)

AI总结 本文比较了视觉Transformer和CNN在遥感土地利用场景分类中的性能,发现CNN在有限训练样本和局部纹理特征强的场景中表现稳健,而ViT在数据充足时能更好地捕捉全局空间关系,但计算成本更高。

Comments 11 pages

详情
AI中文摘要

来自遥感影像的土地利用场景分类在环境监测、城市规划和可持续资源管理中起着关键作用。近年来,深度学习方法显著推动了该领域的发展,其中卷积神经网络因其强大的局部空间特征捕获能力而占据主导地位。然而,视觉Transformer的出现引入了一种新范式,通过自注意力机制建模长距离依赖关系,可能实现更好的全局上下文理解。本文对视觉Transformer和基于CNN的架构在遥感土地利用场景分类中进行了比较评估。使用基准遥感数据集(包括UC Merced土地利用和EuroSAT土地利用数据集)评估了代表性CNN模型(如AlexNet)和视觉Transformer。研究考察了分类准确率、精确率、召回率、F1分数和计算复杂度,以提供全面的性能比较。实验结果表明,在训练样本有限且局部纹理特征强的数据集上,CNN表现稳健;而在训练数据充足时,视觉Transformer在捕获复杂场景中的全局空间关系方面表现出更优性能。然而,ViT通常需要更多的计算资源和更大的训练数据集才能达到最优性能。本研究的结果为两种架构的优势和局限性提供了见解,并为遥感土地利用场景分类应用中选择合适模型提供了指导。

英文摘要

Land Use Scene Classification (LUSC) from remote sensing imagery plays a critical role in environmental monitoring, urban planning, and sustainable resource management. In recent years, deep learning methods have significantly advanced the state of the art, with Convolutional Neural Networks (CNNs) dominating the field because of their strong ability to capture local spatial features. However, the emergence of Vision Transformers (ViTs) has introduced a new paradigm that models long-range dependencies through self-attention mechanisms, potentially enabling improved global context understanding. This paper presents a comparative assessment of Vision Transformers and CNN-based architecture for remote sensing land use scene classification. Representative CNN models, such as AlexNet, is evaluated alongside the Vision Transformer (ViT) using benchmark remote sensing datasets, including the UC Merced Land Use and EuroSAT Land Use datasets. The study examines classification accuracy, precision, recall, F1-score, and computational complexity to provide a comprehensive performance comparison. Experimental results demonstrate that CNNs perform robustly on datasets with limited training samples and strong local texture characteristics, whereas Vision Transformers exhibit superior performance in capturing global spatial relationships in complex scenes when sufficient training data are available. However, ViTs typically require greater computational resources and larger training datasets to achieve optimal performance. The findings of this study provide insights into the strengths and limitations of both architectures and offer guidance for selecting appropriate models for remote sensing land use scene classification applications.

2605.19398 2026-06-04 cs.CV cs.AI 版本更新

Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

重新平衡参考帧主导性以改善图像到视频模型中的运动

Wooseok Jeon, Seungho Park, Seunghyun Shin, Sangeyl Lee, Hyeonho Jeong, Hae-Gon Jeon

发表机构 * Yonsei University(延世大学) GIST(韩国科学技术院) Adobe Research(Adobe研究)

AI总结 针对图像到视频模型生成视频过于静态的问题,提出无需训练且模型无关的DyMoS方法,通过重新平衡去噪初期生成帧对参考帧的注意力来增强运动,同时保持视觉质量和保真度。

Comments Preprint. Project page: https://sh0xed98b8.github.io/DyMoS/

详情
AI中文摘要

与文本到视频模型相比,图像到视频模型通常生成的视频过于静态。先前的方法通过削弱或修改图像条件信号来缓解这一问题,但往往需要额外训练或牺牲对参考图像的保真度。在这项工作中,我们识别出参考帧主导性是运动抑制的关键机制。我们观察到,I2V模型中的非参考帧将过多的自注意力分配给参考帧的关键词元,导致参考信息随时间过度传播,从而抑制了帧间动态。基于这一发现,我们提出了DyMoS(动态运动滑块),一种无需训练且模型无关的方法,在初始去噪步骤中重新平衡从生成帧到参考帧的注意力路径。DyMoS保持输入图像和模型权重不变,并引入单个标量参数以连续控制运动强度。在多个最先进的I2V骨干网络上的实验表明,DyMoS在保持视觉质量和参考图像保真度的同时,一致地改善了运动动态。

英文摘要

Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate this issue by weakening or modifying the image-conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify reference-frame dominance as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS (Dynamic Motion Slider), a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state-of-the-art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.

2605.15741 2026-06-04 cs.CV 版本更新

HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

HyperDiT: 用于高保真像素空间扩散的超连接Transformer

Yu He, Lichen Ma, Zipeng Guo, Xinyuan Shan, Jingling Fu, Dong Chen, Junshi Huang, Yan Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对像素空间扩散模型中全局语义与细粒度细节难以兼顾的粒度困境,提出HyperDiT框架,通过超连接跨尺度交互和尺度感知旋转位置编码,结合预训练视觉基础模型的密集语义,在像素空间实现高保真生成,在ImageNet 256×256上取得1.56的SoTA FID。

详情
AI中文摘要

像素空间扩散模型绕过了变分自编码器(VAE)的重建瓶颈,但面临一个基本的“粒度困境”:捕捉全局语义需要大的块尺度,而生成高保真细节则要求细粒度的输入。为了解决这个问题,我们提出了HyperDiT,一个统一的框架,建立超连接跨尺度交互以桥接语义和像素流形。与通过AdaLN注入语义不同,HyperDiT利用交叉注意力机制,使细粒度标记能够全局查询多级语义锚点。为了解决多尺度交互过程中的空间不匹配问题,我们引入了尺度感知旋转位置编码(SA-RoPE),以确保不同块大小的标记之间精确的几何对齐。此外,我们加入了寄存器,从预训练的视觉基础模型(VFM)中学习密集语义,有效减少生成幻觉和伪影。大量实验表明,HyperDiT在像素空间内直接在ImageNet 256×256上实现了最先进的FID为1.56。通过将细粒度流与语义指导相结合,HyperDiT为高保真像素生成提供了一种优越的范式。

英文摘要

Pixel-space diffusion models bypass the reconstruction bottleneck of Variational Autoencoders (VAEs) but face a fundamental "granularity dilemma": capturing global semantics favors large patch scales, while generating high-fidelity details demands fine-grained inputs. To address this issue, we propose HyperDiT, a unified framework establishing Hyper-Connected Cross-Scale Interactions to bridge the semantic and pixel manifold. Diverging from injecting semantics by AdaLN, HyperDiT utilizes Cross-Attention mechanisms, enabling fine-grained tokens to query multi-level semantic anchors globally. To resolve the spatial mismatch during multi-scale interactions, we introduce Scale-Aware Rotary Position Embedding (SA-RoPE) to ensure precise geometric alignment among tokens of varying patch sizes. Furthermore, we incorporate Registers to learn the dense semantics from a pretrained Visual Foundation Model (VFM), effectively reducing generation hallucination and artifacts. Extensive experiments demonstrate that HyperDiT achieves state-of-the-art (SoTA) FID of $\mathbf{1.56}$ on ImageNet $256\times256$ directly within the pixel space. By combining the fine-grained stream with semantic guidance, HyperDiT offers a superior paradigm for high-fidelity pixel generation.

2605.14091 2026-06-04 cs.CV 版本更新

Venus-DeFakerOne: Unified Fake Image Detection & Localization

Venus-DeFakerOne: 统一假图检测与定位

GuangJian Team

发表机构 * Ant Group(蚂蚁集团)

AI总结 针对假图生成机制统一化而检测定位研究碎片化的问题,提出基于InternVL2和SAM2的数据驱动统一基础模型DeFakerOne,实现跨场景的图像级检测与像素级定位,在39个检测和9个定位基准上达到最优性能。

详情
AI中文摘要

近年来,生成式AI的快速发展从根本上重塑了图像伪造的范式,打破了文档编辑、自然图像篡改、DeepFake生成和全图像AIGC合成之间的传统界限。尽管伪造生成正趋于统一,但现有的假图检测与定位(FIDL)研究仍然碎片化。这造成了日益统一的伪造生成机制与领域特定检测范式之间的不匹配。弥合这一不匹配给FIDL带来了两个关键挑战:理解跨域伪影的迁移与干扰,以及构建一个高容量的统一基础模型以实现联合检测与定位。为应对这些挑战,我们提出了DeFakerOne,一个以数据为中心的统一FIDL基础模型,集成了InternVL2和SAM2。DeFakerOne能够在多种场景下同时进行图像级检测和像素级伪造定位。大量实验表明,DeFakerOne达到了最先进的性能,在39个伪造检测基准和9个定位基准上均优于基线。此外,该模型对真实世界扰动和最先进的生成器(如GPT-Image-2)表现出卓越的鲁棒性。最后,我们系统分析了数据缩放规律、跨域伪影迁移-干扰模式、细粒度监督的必要性以及原始分辨率伪影保留,突显了可扩展、鲁棒且统一的FIDL的设计原则。

英文摘要

In recent years, the rapid evolution of generative AI has fundamentally reshaped the paradigm of image forgery, breaking the traditional boundaries between document editing, natural image manipulation, DeepFake generation, and full-image AIGC synthesis. Despite this shift toward unified forgery generation, existing research in Fake Image Detection and Localization (FIDL) remains fragmented. This creates a mismatch between increasingly unified forgery generation mechanisms and the domain-specific detection paradigm. Bridging this mismatch poses two key challenges for FIDL: understanding cross-domain artifacts transfer and interference, and building a high-capacity unified foundation model for joint detection and localization. To address these challenges, we propose DeFakerOne, a data-centric, unified FIDL foundation model integrating InternVL2 and SAM2. DeFakerOne enables simultaneous image-level detection and pixel-level forgery localization across diverse scenarios. Extensive experiments demonstrate that DeFakerOne achieves state-of-the-art performance, outperforming baselines on 39 forgery detection benchmarks and 9 localization benchmarks. Furthermore, the model exhibits superior robustness against real-world perturbations and state-of-the-art generators such as GPT-Image-2. Finally, we provide a systematic analysis of data scaling laws, cross-domain artifacts transfer-interference patterns, the necessity of fine-grained supervision, and the original resolution artifacts preservation, highlighting the design principles for scalable, robust, and unified FIDL.

2605.14054 2026-06-04 cs.AI cs.CV 版本更新

Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning

Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning

Haozhe Wang, Qixin Xu, Changpeng Wang, Taofeng Xue, Chong Peng, Wenhu Chen, Fangzhen Lin

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种基于强化学习的模态感知信用分配框架(MoCA),通过感知验证和结构化口头验证解决视觉语言模型中感知与推理的权衡问题,实现多任务性能提升。

Comments Accepted by ICML 2026 as Oral

详情
AI中文摘要

实现稳健的感知-推理协同是高级视觉语言模型(VLM)的核心目标。最近的进展通过架构设计或智能体工作流追求这一目标。然而,这些方法通常受限于静态文本推理,或因外部智能体复杂性的巨大计算和工程负担而变得复杂。更糟糕的是,这种大量投入并未带来成比例的性能提升,常常在感知和推理上观察到“跷跷板效应”。这促使我们从根本上重新思考真正的瓶颈。在本文中,我们认为这种权衡的根本原因是模态信用分配中的模糊性:当VLM失败时,是由于感知缺陷(“坏视力”)还是逻辑缺陷(“坏思维”)?为解决这一问题,我们引入了一个强化学习框架,通过可靠地奖励感知保真度来改善感知-推理协同。我们明确地将生成过程分解为交错的感知和推理步骤。这种解耦使得能够对感知进行有针对性的监督。关键的是,我们引入了感知验证(PV),利用“盲推理”代理独立于推理结果奖励感知保真度。此外,为了在自由形式的VL任务中扩展训练,我们提出了结构化口头验证(Structured Verbal Verification),用结构化的算法执行替代高方差的LLM评判。这些技术被整合到模态感知信用分配(MoCA)机制中,该机制将奖励路由到特定的错误源——无论是坏视力还是坏思维——使单个VLM能够在广泛的任务谱系上同时获得性能提升。

英文摘要

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

2605.13672 2026-06-04 cs.CV cs.AI cs.LG 版本更新

SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

SpurAudio: 用于研究少样本音频分类中捷径学习的基准

Giries Abu Ayoub, Morad Tukan, Loay Mualem

发表机构 * Department of Computer Science, University of Haifa(海法大学计算机科学系) Independent Researcher(独立研究者) University of Stuttgart, Germany(斯图加特大学,德国) IMPRS-IS, Germany(智能系统国际Max Planck研究学校,德国)

AI总结 提出SpurAudio基准,通过控制音频中前景与背景的关联,评估少样本分类模型对虚假相关性的敏感性,发现现有方法在背景变化时性能显著下降。

详情
AI中文摘要

少样本分类(FSC)广泛用于从有限标注数据中学习,但大多数评估隐含假设目标概念与上下文线索无关。然而,在现实场景中,样本通常出现在丰富的上下文中,允许模型利用前景内容与背景信号之间的虚假相关性。虽然这种效应已在少样本图像分类中得到研究,但其在少样本音频分类中的作用仍 largely 未被探索,且现有音频基准对上下文结构的控制有限。我们引入了 SpurAudio,一个利用音频中前景事件和背景环境的自然可分离性,以支持对支持集和查询集之间的上下文偏移进行可控、多级评估的基准。使用该基准,我们表明许多最先进的少样本方法在背景相关性被破坏时遭受严重的性能下降,尽管在标准评估协议下达到相似的准确率。关键的是,即使在大型预训练音频基础模型中,这种脆弱性仍然存在,排除了骨干网络容量不足的解释。此外,在传统基准下看似相当的方法可能对虚假相关性表现出显著不同的敏感性,揭示了与特征表示在推理时如何与分类器头交互相关的系统性算法优势和脆弱性。这些发现为音频中少样本方法的行为提供了新的见解,并强调了在评估FSC模型时需要明确探测上下文依赖性的基准。

英文摘要

Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.

2304.10891 2026-06-04 cs.LG cs.AI cs.CV cs.RO cs.SY eess.SY 版本更新

Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey

基于Transformer的自动驾驶模型与面向部署的压缩:综述

Juan Zhong, Yuhang Shi, Zukang Xu, Xi Chen

发表机构 * Renmin University of China(中国人民大学) Artificial Intelligence Innovation and Incubation Institute, Fudan University(复旦大学人工智能创新与孵化院) Shanghai Academy of AI for Science(上海人工智能科学研究院) Department of houmo.ai(houmo.ai部门)

AI总结 本文综述了基于Transformer的自动驾驶模型,并从部署角度分析了压缩与加速策略(如量化、剪枝、知识蒸馏等)如何影响模型设计、部署性、鲁棒性和安全性。

详情
AI中文摘要

基于Transformer的模型正成为自动驾驶的核心范式,因为它们能够捕捉感知、预测和规划中的长程空间依赖、多智能体交互和多模态上下文。然而,它们在真实车辆中的部署仍然困难,因为高容量注意力架构带来了显著的延迟、内存和能量开销。本综述回顾了具有代表性的基于Transformer的自动驾驶模型,并按任务角色、感知配置和架构设计进行组织。更重要的是,我们从面向部署的角度审视这些模型,分析效率约束如何在实际中重塑模型设计选择。我们进一步回顾了与基于Transformer的驾驶系统相关的压缩和加速策略,包括量化、剪枝、知识蒸馏、低秩近似和高效注意力,并讨论了它们的优势、局限性和任务依赖性。我们不将压缩视为孤立的后期处理步骤,而是强调其作为直接影响部署性、鲁棒性和安全性的系统级设计考虑。最后,我们指出了面向标准化、安全感知和硬件感知的高效自动驾驶系统评估的开放挑战和未来研究方向。

英文摘要

Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, multi-agent interactions, and multimodal context across perception, prediction, and planning. At the same time, their deployment in real vehicles remains difficult because high-capacity attention-based architectures impose substantial latency, memory, and energy overhead. This survey reviews representative Transformer-based autonomous driving models and organizes them by task role, sensing configuration, and architectural design. More importantly, it examines these models from a deployment-oriented perspective and analyzes how efficiency constraints reshape model design choices in practice. We further review compression and acceleration strategies relevant to Transformer-based driving systems, including quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention, and discuss their benefits, limitations, and task-dependent applicability. Rather than treating compression as an isolated post-processing step, we highlight it as a system-level design consideration that directly affects deployability, robustness, and safety. Finally, we identify open challenges and future research directions toward standardized, safety-aware, and hardware-conscious evaluation of efficient autonomous driving systems.

2602.22779 2026-06-04 cs.CV 版本更新

TrajTok: Learning Trajectory Tokens enables better Video Understanding

TrajTok: 学习轨迹令牌以实现更好的视频理解

Chenhao Zheng, Jieyu Zhang, Jianing Zhang, Weikai Huang, Ashutosh Kumar, Quan Kong, Oncel Tuzel, Chun-Liang Li, Ranjay Krishna

发表机构 * University of Washington(华盛顿大学) Allen Institute for Artificial Intelligence(人工智能研究院) Apple(苹果公司) Woven by Toyota, Inc(丰田纺织公司)

AI总结 提出TrajTok,一种端到端视频令牌化模块,通过隐式时空聚类生成对象轨迹令牌,提升视频理解效率与性能。

Comments CVPR 2026

详情
AI中文摘要

视频模型中的令牌化通常通过分块(patchification)进行,产生过多且冗余的令牌,严重限制了视频的效率和可扩展性。虽然最近的基于轨迹的令牌化器通过将视频时长与令牌数量解耦提供了有前景的解决方案,但它们依赖于复杂的外部分割和跟踪流水线,速度慢且任务无关。我们提出TrajTok,一个端到端的视频令牌化器模块,完全集成并与视频模型共同训练以服务于下游目标,根据语义复杂度动态调整令牌粒度,独立于视频时长。TrajTok包含一个统一的分割器,在空间和时间上对像素进行隐式聚类,直接在一次前向传播中生成对象轨迹。通过优先考虑下游适应性而非像素完美的分割保真度,TrajTok轻量且高效,同时经验上提升了视频理解性能。利用TrajTok,我们实现了一个从头训练的视频CLIP模型(TrajViT2)。它在分类和检索基准上均实现了大规模的最佳精度,同时保持了与最佳令牌合并方法相当的高效率。TrajTok也证明了其作为令牌化器之外的多功能组件。我们表明,它可以无缝集成作为预训练视觉特征的探测头(TrajAdapter)或视觉-语言模型中的对齐连接器(TrajVLM),尤其在长视频推理中表现出色。

英文摘要

Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.

2605.06637 2026-06-04 cs.CV 版本更新

DPM++: Dynamic Masked Metric Learning for Occluded Person Re-identification

DPM++:用于遮挡行人重识别的动态掩码度量学习

Lei Tan, Yingshi Luan, Pincong Zou, Pingyang Dai, Liujuan Cao

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(中国教育部多媒体可信感知与高效计算重点实验室,厦门大学)

AI总结 提出DPM++动态掩码度量学习框架,通过自适应掩码选择可靠身份子空间,结合CLIP两阶段监督和显著性引导的补丁转移策略,在遮挡和整体场景下均达到最优性能。

详情
AI中文摘要

尽管行人重识别取得了显著进展,但障碍物造成的遮挡在实际应用中仍是一个未解决的问题。困难在于不完整的遮挡样本与整体身份表示之间的不匹配。严重遮挡会移除判别性身体线索并引入背景杂波和遮挡物的干扰,使得全局度量学习不可靠。现有方法主要依赖额外的预训练模型来估计可见部分以进行对齐,或通过数据增强构建遮挡样本,但仍缺乏一个统一的框架来学习在真实遮挡模式下鲁棒的可见性一致匹配。本文提出了DPM++,一种用于遮挡行人重识别的动态掩码度量学习框架。DPM++学习一种输入自适应的掩码度量,动态地为每个遮挡实例选择可靠的身份子空间,使匹配能够强调可见性一致的证据,同时抑制不可靠的组件。基于分类器-原型空间,DPM++引入了基于CLIP的两阶段监督方案,其中ID级语义先验从文本分支学习并转移到分类器-原型空间中进行动态掩码匹配。为了增强掩码度量,我们引入了一种显著性引导的补丁转移策略,在训练过程中合成可控且逼真的遮挡样本。利用真实场景先验,该策略使模型暴露于真实的部分观察中,并提供比随机擦除更丰富的监督。此外,遮挡感知的样本配对和掩码引导优化提高了框架的稳定性和有效性。在遮挡和整体行人重识别基准上的实验表明,DPM++在整体和遮挡场景中均持续优于先前的最先进方法。

英文摘要

Although person re-identification has made impressive progress, occlusion caused by obstacles remains an unsettled issue in real applications. The difficulty lies in the mismatch between incomplete occluded samples and holistic identity representations. Severe occlusion removes discriminative body cues and introduces interference from background clutter and occluders, making global metric learning unreliable. Existing methods mainly rely on extra pre-trained models to estimate visible parts for alignment or construct occluded samples via data augmentation, but still lack a unified framework that learns robust visibility-consistent matching under realistic occlusion patterns. In this paper, we propose DPM++, a Dynamic Masked Metric Learning framework for occluded person re-identification. DPM++ learns an input-adaptive masked metric that dynamically selects reliable identity subspaces for each occluded instance, enabling matching to emphasize visibility-consistent evidence while suppressing unreliable components. Built upon the classifier-prototype space, DPM++ introduces a CLIP-based two-stage supervision scheme, where ID-level semantic priors are learned from the text branch and transferred into the classifier-prototype space for dynamic masked matching. To strengthen the masked metric, we introduce a saliency-guided patch transfer strategy to synthesize controllable and photo-realistic occluded samples during training. Exploiting real scene priors, this strategy exposes the model to realistic partial observations and provides richer supervision than random erasing. In addition, occlusion-aware sample pairing and mask-guided optimization improve the stability and effectiveness of the framework. Experiments on occluded and holistic person re-identification benchmarks show that DPM++ consistently outperforms previous state-of-the-art methods in both holistic and occlusion scenarios.

2605.00242 2026-06-04 cs.CV cs.AI 版本更新

MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video

MAEPose: 基于毫米波视频的人体姿态估计的自监督时空学习

Xijia Wei, Yuan Fang, Kevin Chetty, Youngjun Cho, Nadia Bianchi-Berthouze

发表机构 * University College London(伦敦大学学院)

AI总结 提出MAEPose,一种直接处理毫米波频谱视频的掩码自编码方法,通过自监督时空学习实现鲁棒的人体姿态估计,在三个数据集上优于现有方法。

详情
AI中文摘要

毫米波雷达为基于RGB的人体姿态估计提供了一种更具隐私保护性的替代方案。然而,现有方法通常依赖预提取的中间表示,如稀疏点云或频谱图图像,这些方法丢弃了雷达视频流中自然存在的丰富时空信息用于模型学习,同时此类信号处理增加了系统复杂性。此外,现有解决方案主要采用端到端监督方式,未利用未标记的原始视频流来学习通用表示。在本研究中,我们提出MAEPose,一种基于掩码自编码的人体姿态估计方法,直接处理毫米波频谱视频。MAEPose从未标记的雷达视频中学习时空运动感知的通用表示,并利用其热图解码器进行多帧姿态估计预测。我们基于留一法交叉验证和严格的统计检验,在三个数据集上对其进行评估。MAEPose在MPJPE指标上始终优于最先进的基线方法,最高提升22.1%(p<0.05),并且在零样本旁观者干扰下保持鲁棒精度,误差仅增加6.5%。消融研究证实,预训练和热图解码器均有显著贡献,而模态分析表明,使用距离-多普勒视频作为输入比距离-方位角或其融合能实现更好的姿态估计性能,且计算成本更低。

英文摘要

Millimetre-wave (mmWave) radar offers a more privacy-preserving alternative to RGB-based human pose estimation. However, existing methods typically rely on pre-extracted intermediate representations such as sparse point clouds or spectrogram images, where the rich spatiotemporal information naturally present in radar video streams is discarded for model learning, while such signal processing adds system complexity. In addition, existing solutions are mainly conducted in an end-to-end supervised manner without leveraging unlabelled raw video streams to learn generalized representations. In this study, we present MAEPose, a masked autoencoding-based human pose estimation approach that operates directly on mmWave spectrogram videos. MAEPose learns spatiotemporal motion-aware generalized representations from unlabelled radar video, and leverages its heatmap decoder for multi-frame pose estimation predictions. We evaluate it across three datasets based on leave-one-person-out cross-validation with rigorous statistical testing. MAEPose consistently outperforms state-of-the-art baselines by up to 22.1% in MPJPE p<0.05, and maintains robust accuracy under zero-shot bystander interference with only a 6.5% error increase. Ablation studies confirm that both the pre-training and the heatmap decoder contribute substantially, while modality analysis indicates that leveraging Range-Doppler video as input achieves better pose estimation performance than Range-Azimuth or their fusion, with lower computational cost.

2604.28173 2026-06-04 cs.CV 版本更新

Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements

动作基元:人体运动的自监督层次化表示

Genki Kinoshita, Shu Nakamura, Ryo Kawahara, Shohei Nobuhara, Yasutomo Kawanishi, Ko Nishino

发表机构 * Kyoto University(京都大学) Kyoto Institute of Technology(京都理工学院) RIKEN(理化学研究所)

AI总结 提出一种层次化表示方法,通过自监督学习从人体姿态数据中提取动作原子和动作基元,用于动作识别、运动预测和运动插值等任务。

Comments Accepted as Highlight at CVPR2026. Project page: https://vision.ist.i.kyoto-u.ac.jp/research/action-motifs/

详情
AI中文摘要

有效的人类行为建模需要一种能够利用其组合性的人体运动表示。我们提出了一种层次化表示,包括捕获原子关节运动的动作原子和由它们的时间组合形成的动作基元,这些基元编码了在不同整体人类动作中发现的相似身体运动。我们推导出A4Mer,一种嵌套的潜在Transformer,以完全自监督的方式从人体姿态数据中学习这种层次化表示。A4Mer将3D姿态序列分割成可变长度的片段,并将每个片段表示为单个潜在令牌(动作原子)。通过自底向上的表示学习,由这些动作原子组成的时间模式自然出现(动作基元),这些模式捕获了可重复的、语义化的身体运动片段的有意义时间跨度。A4Mer通过在其各自的潜在空间中进行掩码令牌预测的统一预训练任务来实现这一点。我们还引入了动作基元数据集(AMD),这是一个大规模的多视角人类行为视频数据集,具有完整的SMPL注释。我们引入了一种新颖的相机使用方式,将其安装在脚上,以在频繁且严重的身体遮挡情况下实现逐帧注释。实验结果证明了A4Mer在提取有意义的动作基元方面的有效性,这些基元显著有益于人类行为建模任务,包括动作识别、运动预测和运动插值。

英文摘要

Effective human behavior modeling requires a representation of the human body movement that capitalizes on its compositionality. We propose a hierarchical representation consisting of Action Atoms that capture the atomic joint movements and Action Motifs that are formed by their temporal compositions and encode similar body movements found across different overall human actions. We derive A4Mer, a nested latent Transformer to learn this hierarchical representation from human pose data in a fully self-supervised manner. A4Mer splits a 3D pose sequence into variable-length segments and represents each segment as a single latent token (Action Atoms). Through bottom-up representation learning, temporal patterns composed of these Action Atoms, which capture meaningful temporal spans of reusable, semantic segments of body movements, naturally emerge (Action Motifs). A4Mer achieves this with a unified pretext task of masked token prediction in their respective latent spaces. We also introduce Action Motif Dataset (AMD), a large-scale dataset of multi-view human behavior videos with full SMPL annotations. We introduce a novel use of cameras by mounting them on the feet to achieve their frame-wise annotations despite frequent and heavy body occlusions. Experimental results demonstrate the effectiveness of A4Mer for extracting meaningful Action Motifs, which significantly benefit human behavior modeling tasks including action recognition, motion prediction, and motion interpolation.

2604.09686 2026-06-04 cs.AI cs.CV 版本更新

Belief-Aware VLM Model for Human-like Reasoning

信念感知的VLM模型用于类人推理

Anshul Nayak, Shahil Shaik, Yue Wang

发表机构 * Mechanical Engineering Department, Clemson University(克莱姆森大学机械工程系)

AI总结 提出一种信念感知的视觉语言模型框架,通过检索式记忆和强化学习近似信念,提升长时程意图推理能力,在HD-EPIC等数据集上优于零样本基线。

Comments Accepted for publication at the IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026). 6 pages, 3 figures, 1 table

详情
AI中文摘要

传统的意图推理神经网络模型严重依赖可观测状态,难以泛化到多样化的任务和动态环境。视觉语言模型(VLM)和视觉语言动作(VLA)模型的最新进展通过大规模多模态预训练引入了常识推理,实现了跨任务的零样本性能。然而,这些模型仍然缺乏显式的信念表示和更新机制,限制了其像人类一样推理或捕捉长时程中不断演变的人类意图的能力。为了解决这个问题,我们提出了一个信念感知的VLM框架,集成了基于检索的记忆和强化学习。我们不学习显式的信念模型,而是使用基于向量的记忆来近似信念,该记忆检索相关的多模态上下文,并将其纳入VLM进行推理。我们进一步通过在VLM潜在空间上使用强化学习策略来优化决策。我们在公开可用的VQA数据集(如HD-EPIC)上评估了我们的方法,并展示了相对于零样本基线的持续改进,突出了信念感知推理的重要性。

英文摘要

Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Language Action (VLA) models introduce common-sense reasoning through large-scale multimodal pretraining, enabling zero-shot performance across tasks. However, these models still lack explicit mechanisms to represent and update belief, limiting their ability to reason like humans or capture the evolving human intent over long-horizon. To address this, we propose a belief-aware VLM framework that integrates retrieval-based memory and reinforcement learning. Instead of learning an explicit belief model, we approximate belief using a vector-based memory that retrieves relevant multimodal context, which is incorporated into the VLM for reasoning. We further refine decision-making using a reinforcement learning policy over the VLM latent space. We evaluate our approach on publicly available VQA datasets such as HD-EPIC and demonstrate consistent improvements over zero-shot baselines, highlighting the importance of belief-aware reasoning.

2602.00104 2026-06-04 cs.CV cs.AI 版本更新

R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation

R3G: 一种面向以视觉为中心的答案生成的推理-检索-重排序框架

Zhuohong Chen, Zhengxian Wu, Zirui Liao, Shenao Jiang, Hangrui Xu, Yang Chen, Chaokui Su, Xiaoyu Liu, Haoqian Wang

发表机构 * The Shenzhen International Graduate School, Tsinghua University, China(清华大学深圳国际研究生院) State Key Laboratory of Nuclear Power Safety Technology and Equipment, China(核能安全技术与装备国家重点实验室) School of Computer Science and Information Engineering, Hefei University of Technology, China(合肥工业大学计算机科学与信息工程学院)

AI总结 提出R3G框架,通过先制定推理计划指定所需视觉线索,再采用粗检索加细粒度重排序的两阶段策略选择证据图像,在MRAG-Bench上提升六种多模态大语言模型在九个子场景中的准确率,实现整体最优性能。

详情
AI中文摘要

以视觉为中心的VQA检索需要检索图像以补充缺失的视觉线索,并将其整合到推理过程中。然而,选择正确的图像并将其有效整合到模型的推理中仍然具有挑战性。为了解决这一挑战,我们提出了R3G,一个模块化的推理-检索-重排序框架。它首先生成一个简要的推理计划,指定所需的视觉线索,然后采用两阶段策略,先进行粗检索,再进行细粒度重排序,以选择证据图像。在MRAG-Bench上,R3G在六个多模态大语言模型骨干和九个子场景中提高了准确率,实现了整体最优性能。消融实验表明,充分性感知的重排序和推理步骤是互补的,有助于模型既选择正确的图像又充分利用它们。我们在https://github.com/czh24/R3G发布代码和数据。

英文摘要

Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model's reasoning remains challenging.To address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking framework.It first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.On MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at https://github.com/czh24/R3G.

2603.28762 2026-06-04 cs.CV cs.AI cs.GR cs.LG 版本更新

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

上下文空间中的即时排斥以实现扩散变换器的丰富多样性

Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or

发表机构 * Tel Aviv University(特拉维夫大学) Snap Research Israel(Snap以色列研究)

AI总结 针对文本到图像扩散模型多样性不足的问题,提出在扩散变换器的上下文空间中通过多模态注意力通道施加即时排斥,在不牺牲视觉保真度和语义一致性的前提下显著提升生成多样性,且计算开销小,适用于现代Turbo和蒸馏模型。

Comments SIGGRAPH 2026. Project page: https://contextual-repulsion.github.io/

详情
AI中文摘要

现代文本到图像(T2I)扩散模型在语义对齐方面取得了显著进展,但通常缺乏多样性,倾向于为任何给定提示收敛到狭窄的视觉解决方案集。这种典型性偏差对需要广泛生成结果的创意应用构成了挑战。我们识别出当前多样性方法中的一个基本权衡:修改模型输入需要昂贵的优化来整合生成路径的反馈。相反,对空间上已承诺的中间潜变量进行操作往往会破坏正在形成的视觉结构,导致伪影。在这项工作中,我们提出在上下文空间中应用排斥作为一种新颖的框架,以实现扩散变换器的丰富多样性。通过干预多模态注意力通道,我们在变换器的前向传播过程中施加即时排斥,在文本条件被新兴图像结构丰富后的块之间注入干预。这允许在结构信息形成后但构图固定之前重定向引导轨迹。我们的结果表明,上下文空间中的排斥在不牺牲视觉保真度或语义一致性的情况下产生了显著更丰富的多样性。此外,我们的方法非常高效,计算开销小,即使在现代“Turbo”和蒸馏模型中也有效,而传统的基于轨迹的干预在这些模型中通常会失败。

英文摘要

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.

2512.14177 2026-06-04 cs.CV 版本更新

Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes

利用语义高斯过程改进LVLM中的语义不确定性量化

Joseph Hoche, Andrei Bursuc, David Brellmann, Gilles Louppe, Pavel Izmailov, Angela Yao, Gianni Franchi

发表机构 * AMIAD, Pôle Recherche, Palaiseau(AMIAD研究部,Palaiseau) valeo.ai Safran Tech University of Liège(利耶大学) New York University(纽约大学) National University of Singapore(新加坡国立大学) ENSTA Paris(巴黎ENSTA)

AI总结 提出语义高斯过程不确定性(SGPU)框架,通过分析答案嵌入的几何结构来量化语义不确定性,避免脆弱的聚类方法,在多个模型和数据集上实现了最先进的校准和判别性能。

详情
AI中文摘要

大型视觉语言模型(LVLM)经常产生看似合理但不可靠的输出,因此鲁棒的不确定性估计至关重要。最近的语义不确定性估计工作依赖于外部模型对多个采样响应进行聚类并测量其语义一致性。然而,这些聚类方法通常脆弱,对微小的措辞变化高度敏感,并且可能错误地分组或分离语义相似的答案,导致不可靠的不确定性估计。我们提出了语义高斯过程不确定性(SGPU),这是一个贝叶斯框架,通过分析答案嵌入的几何结构来量化语义不确定性,避免了脆弱的聚类。SGPU将生成的答案映射到密集的语义空间,计算其嵌入的Gram矩阵,并通过特征谱总结其语义配置。然后将这种谱表示输入到高斯过程分类器中,该分类器学习将语义一致性模式映射到预测不确定性,并且可以在黑盒和白盒设置中应用。在跨越VQA、图像分类和文本QA的八个数据集上的六个LLM和LVLM中,SGPU始终实现了最先进的校准(ECE)和判别(AUROC、AUARC)性能。我们进一步表明,SGPU可以跨模型和模态迁移,表明其谱表示捕捉了语义不确定性的通用模式。

英文摘要

Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.

2603.22121 2026-06-04 cs.CV cs.AI 版本更新

GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval

GenSpan: 用于多动词视频语料库时刻检索的生成校准运动跨度先验

Yunzhuo Sun, Xinyue Liu, Yanyang Li, Nanding Wu, Linlin Zong, Xianchao Zhang, Wenxin Liang

发表机构 * Dalian University of Technology(大连理工大学)

AI总结 提出GenSpan框架,利用LLM生成辅助视频作为时间先验,结合令牌选择器和双向状态空间模型,提升多动词查询下的视频语料库时刻检索与定位性能。

Comments Major revision with title change, updated method, and additional experiments

详情
AI中文摘要

视频语料库时刻检索(VCMR)旨在检索与自然语言查询对应的正确视频及其时间片段,对于时间动作顺序至关重要的多动词查询尤其具有挑战性。现有方法通常仅依赖文本或静态图像,难以捕捉隐式运动动态,导致检索错误和时间错位。我们提出GenSpan,一个生成校准的VCMR框架,从LLM选择的字幕线索和分解的子事件中构建短辅助视频,将这些作为时间先验而非直接检索目标。令牌选择器过滤与生成运动对齐的候选视频特征,双向状态空间模型高效预测视频-时刻元组。在TVR和ActivityNet-Captions上的实验表明,GenSpan提高了语料库级检索和时刻定位,特别是对于复杂的多动作查询,同时与最先进的多模态基线相比降低了计算成本。

英文摘要

Video Corpus Moment Retrieval (VCMR) aims to retrieve both the correct video and its temporal segment corresponding to a natural-language query, a task that is especially challenging for multi-verb queries where temporal action ordering is critical. Existing approaches often rely solely on text or static images and struggle to capture implicit motion dynamics, leading to retrieval errors and temporal misalignment. We propose GenSpan, a generation-calibrated VCMR framework that constructs short auxiliary videos from LLM-selected subtitle cues and decomposed sub-events, using these as temporal priors rather than direct retrieval targets. A token selector filters candidate-video features aligned with generated motion, and a bidirectional state-space model efficiently predicts video-moment tuples. Experiments on TVR and ActivityNet-Captions demonstrate that GenSpan improves corpus-level retrieval and moment localization, particularly for complex multi-action queries, while reducing computational cost compared to state-of-the-art multimodal baselines.

2603.13432 2026-06-04 cs.CV cs.AI 版本更新

Spatial Transcriptomics as Images for Large-Scale Pretraining

空间转录组学作为图像进行大规模预训练

Yishun Zhu, Jiaxin Qi, Jian Wang, Yuhua Zheng, Jianqiang Huang

发表机构 * Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心) Hangzhou Institute for Advanced Study, University of the Chinese Academy of Sciences(中国科学院大学杭州高等研究院)

AI总结 提出将空间转录组学数据视为可裁剪的多通道图像,通过空间分块和基因子集选择来增加训练样本并保留空间上下文,实现大规模预训练,显著提升下游任务性能。

详情
AI中文摘要

空间转录组学(ST)在组织切片上具有精确坐标的离散点处分析数千个基因表达值,保留了临床和病理研究所需的空间背景。随着测序通量的提高和平台的进步,不断增长的数据量促使大规模ST预训练成为可能。然而,预训练的基本单元(即单个训练样本的构成)仍然不明确。现有选择分为两类:(1)将每个点视为独立样本,这丢弃了空间依赖性,将ST简化为单细胞转录组学;(2)将整个切片视为单个样本,这导致输入过大且训练样本急剧减少,削弱了有效预训练。为解决这一问题,我们提出将空间转录组学视为可裁剪的图像。具体而言,我们通过从原始切片中裁剪补丁,定义了一个具有固定空间大小的多通道图像表示,从而在保留空间上下文的同时大幅增加训练样本数量。在通道维度上,我们定义了基因子集选择规则以控制输入维度并提高预训练稳定性。大量实验表明,所提出的基于图像的数据集构建方法用于ST预训练能够持续提升下游性能,优于传统预训练方案。消融研究验证了空间分块和通道设计都是必要的,从而建立了一种统一、实用的ST数据组织范式,支持大规模预训练。

英文摘要

Spatial Transcriptomics (ST) profiles thousands of gene expression values at discrete spots with precise coordinates on tissue sections, preserving spatial context essential for clinical and pathological studies. With rising sequencing throughput and advancing platforms, the expanding data volumes motivate large-scale ST pretraining. However, the fundamental unit for pretraining, i.e., what constitutes a single training sample, remains ill-posed. Existing choices fall into two camps: (1) treating each spot as an independent sample, which discards spatial dependencies and collapses ST into single-cell transcriptomics; and (2) treating an entire slide as a single sample, which produces prohibitively large inputs and drastically fewer training examples, undermining effective pretraining. To address this gap, we propose treating spatial transcriptomics as croppable images. Specifically, we define a multi-channel image representation with fixed spatial size by cropping patches from raw slides, thereby preserving spatial context while substantially increasing the number of training samples. Along the channel dimension, we define gene subset selection rules to control input dimensionality and improve pretraining stability. Extensive experiments show that the proposed image-like dataset construction for ST pretraining consistently improves downstream performance, outperforming conventional pretraining schemes. Ablation studies verify that both spatial patching and channel design are necessary, establishing a unified, practical paradigm for organizing ST data and enabling large-scale pretraining.

2603.20304 2026-06-04 cs.CV 版本更新

Transferable Multi-Bit Watermarking Across Frozen Diffusion Models via Latent Consistency Bridges

跨冻结扩散模型的可迁移多位水印:基于潜在一致性桥

Hong-Hanh Nguyen-Le, Van-Tuan Tran, Thuc D. Nguyen, Nhien-An Le-Khac

发表机构 * National Institute of Advanced Intelligence Science and Technology, Japan(日本国家先进人工智能科学与技术研究院)

AI总结 提出DiffMark,一种即插即用的多位水印框架,通过潜在一致性模型实现高效训练,在单个前向传播中提取64位水印,速度提升45倍,并支持跨架构迁移。

Comments Accepted in Second Workshop on Technical AI Governance Research (TAIGR) @ ICML 2026

详情
AI中文摘要

随着生成式人工智能的发展,全球治理框架越来越要求可验证的内容溯源。然而,现有水印技术面临关键的政策-技术脱节:基于采样的方法需要计算上不可行的逆过程,而微调方法则受限于特定模型检查点,阻碍了标准化、跨模型的监管。为弥合这一差距,我们引入了DiffMark,一种即插即用的多位水印框架。DiffMark将一种持久的、学习到的扰动嵌入到冻结扩散模型的每个去噪步骤中,在最终潜在空间中累积可恢复的信号。为了通过冻结网络实现高效训练,我们利用潜在一致性模型(LCM)作为可微的训练桥梁。DiffMark在单次16.4毫秒的前向传播中实现64位提取,比逆过程基线快45倍。通过实现每图像密钥灵活性和无需重新训练的跨架构可迁移性,DiffMark提供了实用、可扩展的技术工具,以实现用户问责并执行新兴的AI治理要求。

英文摘要

As generative AI advances, global governance frameworks increasingly mandate verifiable content provenance. However, existing watermarking techniques face a critical policy-to-technology disconnect: sampling-based methods require computationally prohibitive inversion, while fine-tuning approaches are tethered to specific model checkpoints, hindering standardized, cross-model oversight. To bridge this gap, we introduce DiffMark, a plug-and-play multi-bit watermarking framework. DiffMark embeds a persistent, learned perturbation into every denoising step of a frozen diffusion model, accumulating a recoverable signal in the final latent space. To enable efficient training through the frozen network, we utilize Latent Consistency Models (LCMs) as a differentiable training bridge. DiffMark achieves 64-bit extraction in a single 16.4 ms forward pass, which is a $45\times$ speed-up over inversion baselines. By enabling per-image key flexibility and cross-architecture transferability without retraining, DiffMark provides the practical, scalable technical tooling necessary to operationalize user accountability and enforce emerging AI governance mandates.

2510.20042 2026-06-04 cs.CV 版本更新

Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models

揭示盲点:生成图像模型中的文化偏见评估

Huichan Seo, Sieun Choi, Minki Hong, Yi Zhou, Junseo Kim, Lukman Ismaila, Naome Etori, Mehul Agarwal, Zhixuan Liu, Jihie Kim, Jean Oh

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Dongguk University(东国大学) Delft University of Technology(代尔夫特理工大学) Johns Hopkins University, School of Medicine(约翰霍普金斯大学医学院) University of Minnesota–Twin Cities(明尼苏达大学双城分校) Lavoro AI

AI总结 提出一个统一的评估框架,结合自动指标、文化感知VQA和专家人工判断,对文本到图像和图像到图像生成模型进行跨国家、跨时代和跨类别的文化偏见评估,发现模型存在默认全球北方现代描绘、迭代编辑侵蚀文化保真度以及仅应用表面线索等问题。

Comments 28 pages, 8 figures. Accepted at IASEAI 2026. Huichan Seo, Sieun Choi, and Minki Hong contributed equally

详情
AI中文摘要

生成图像模型产生引人注目的视觉内容,但常常歪曲文化。先前的工作主要研究了文本到图像(T2I)系统中的文化偏见,而图像到图像(I2I)编辑器尚未得到充分探索。我们通过一个统一的评估来弥补这一差距,该评估涵盖六个国家、一个8类别/36子类别的模式以及时代感知提示,在标准化协议下审计T2I生成和I2I编辑,产生可比较的诊断结果。使用固定设置的开放模型,我们进行了跨国家、跨时代和跨类别的评估。我们的框架结合了标准自动指标、文化感知的检索增强VQA以及来自母语审阅者的专家人工判断。为了实现可重复性,我们发布了完整的图像语料库、提示和配置。我们的研究揭示了三个发现:(1)在无视国家的提示下,模型默认采用全球北方、现代倾向的描绘,抹平了国家间的差异;(2)迭代的I2I编辑侵蚀了文化保真度,即使传统指标保持平稳或改善;(3)I2I模型应用表面线索(色调变化、通用道具)而非时代一致、上下文感知的变化,通常对全球南方目标保留源身份。这些结果突显了当前系统中文化敏感的编辑仍然不可靠。通过发布标准化数据、提示和人工评估协议,我们提供了一个可重复的、以文化为中心的基准,用于诊断和跟踪生成图像模型中的文化偏见。项目页面:https://seochan99.github.io/ECB

英文摘要

Generative image models produce striking visuals yet often misrepresent culture. Prior work has examined cultural bias mainly in text-to-image (T2I) systems, leaving image-to-image (I2I) editors underexplored. We bridge this gap with a unified evaluation across six countries, an 8-category/36-subcategory schema, and era-aware prompts, auditing both T2I generation and I2I editing under a standardized protocol that yields comparable diagnostics. Using open models with fixed settings, we derive cross-country, cross-era, and cross-category evaluations. Our framework combines standard automatic metrics, a culture-aware retrieval-augmented VQA, and expert human judgments collected from native reviewers. To enable reproducibility, we release the complete image corpus, prompts, and configurations. Our study reveals three findings: (1) under country-agnostic prompts, models default to Global-North, modern-leaning depictions that flatten cross-country distinctions; (2) iterative I2I editing erodes cultural fidelity even when conventional metrics remain flat or improve; and (3) I2I models apply superficial cues (palette shifts, generic props) rather than era-consistent, context-aware changes, often retaining source identity for Global-South targets. These results highlight that culture-sensitive edits remain unreliable in current systems. By releasing standardized data, prompts, and human evaluation protocols, we provide a reproducible, culture-centered benchmark for diagnosing and tracking cultural bias in generative image models. Project page: https://seochan99.github.io/ECB

2603.12433 2026-06-04 cs.CV cs.AI cs.LG 版本更新

Revisiting Model Stitching In the Foundation Model Era

重新审视基础模型时代的模型拼接

Zheda Mai, Ke Zhang, Fu-En Wang, Zixiao Ken Wang, Albert Y. C. Chen, Lu Xia, Min Sun, Wei-Lun Chao, Cheng-Hao Kuo

发表机构 * The Ohio State University(俄亥俄州立大学) Boston University(波士顿大学) Amazon(亚马逊)

AI总结 本文通过系统协议研究视觉基础模型(如CLIP、DINOv2、SigLIP 2)的可拼接性,提出基于目标模型倒数第二层特征匹配损失的拼接方法,并构建VFM拼接树(VST)实现多模态大模型中多个VFM的准确率-延迟权衡。

Comments Accepted by CVPR 2026

详情
AI中文摘要

模型拼接通过一个轻量拼接层将一个模型(源)的早期层连接到另一个模型(目标)的后期层,作为表征兼容性的探针。先前工作发现,尽管初始化或目标不同,但基于同一数据集训练的模型仍然是可拼接的(准确率下降可忽略)。我们重新审视在目标、数据和模态组合(例如CLIP、DINOv2、SigLIP 2)上各异的视觉基础模型(VFM)的拼接,并提出问题:异构VFM是否可拼接?我们引入了一个系统协议,涵盖拼接点、拼接层家族、训练损失和下游任务。三个发现浮现:(1)拼接层训练至关重要:传统方法在拼接点匹配中间特征或端到端优化任务损失时难以保持准确率,尤其是在浅层拼接点。(2)通过在目标模型的倒数第二层使用简单的特征匹配损失,异构VFM在视觉任务上变得可靠可拼接。(3)对于深层拼接点,拼接模型可以超越任一组成模型,仅增加少量推理开销(用于拼接层)。基于这些发现,我们进一步提出VFM拼接树(VST),它在多个VFM之间共享早期层同时保留其后期层,为通常利用多个VFM的多模态大语言模型提供了可控的准确率-延迟权衡。综合来看,我们的研究将拼接从诊断探针提升为整合互补VFM优势并定位其表征对齐或分歧点的实用方法。

英文摘要

Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.

2512.14099 2026-06-04 cs.CV 版本更新

ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Discrete Diffusion Models

ViewMask-1-to-3: 通过多模态离散扩散模型实现多视图一致图像生成

Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出ViewMask-1-to-3,将多视图生成建模为离散序列预测问题,利用MAGVIT-v2视觉令牌和掩码令牌预测的离散扩散,通过迭代去掩码实现渐进式多视图生成,无需专门架构或3D几何先验,在GSO和3D-FUTURE基准上优于基线方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

受离散扩散在语言-视觉建模中成功的启发,我们探索了其在多视图生成中的潜力,而这一任务目前主要由连续方法主导。我们引入了ViewMask-1-to-3,将多视图生成建模为一个离散序列建模问题,其中每个视点被表示为来自MAGVIT-v2的视觉令牌。通过掩码令牌预测的离散扩散,我们的方法能够通过迭代令牌去掩码实现渐进式多视图生成,将语言和视觉统一在共享的令牌空间中。重要的是,简单的随机掩码结合自注意力自然地促进了跨视图一致性,无需专门的架构或3D几何先验。我们的方法在GSO和3D-FUTURE基准上优于基线,在标准图像指标上平均排名第一,并且在3D-FUTURE上相比连续扩散模型实现了10.6%更高的IoU。此外,所提出的框架可以自然地扩展到支持文本到图像生成和多模态理解,突显了其向更统一的多模态理解和生成范式发展的潜力。

英文摘要

Motivated by discrete diffusion's success in language-vision modeling, we explore its potential for multi-view generation, a task dominated by continuous approaches. We introduce ViewMask-1-to-3, formulating multi-view generation as a discrete sequence modeling problem where each viewpoint is represented as visual tokens from MAGVIT-v2. Through discrete diffusion via masked token prediction, our approach enables progressive multi-view generation via iterative token unmasking, unifying language and vision in a shared token space. Importantly, simple random masking combined with self-attention naturally encourages cross-view consistency without specialized architectures or 3D geometric priors. Our method outperforms the baseline on the GSO and 3D-FUTURE benchmarks, ranking first on average across standard image metrics, and achieving a 10.6% higher IoU than continuous diffusion models on 3D-FUTURE. Furthermore, the proposed framework can be naturally extended to support text-to-image generation and multimodal understanding, highlighting its potential toward a more unified paradigm for multimodal understanding and generation.

2603.09493 2026-06-04 cs.CV cs.AI 版本更新

EvoPrompt: Guided Prompt Evolution for Vision-Language Models Adaptation

EvoPrompt: 引导提示演化以适应视觉-语言模型

Enming Zhang, Jiayang Li, Yanlong Wang, Yanru Wu, Zhenyu Liu, Yang Li

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Sun Yat-sen University(中山大学) Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出EvoPrompt框架,通过引导提示演化路径并解耦低秩更新为方向和幅度分量,实现视觉-语言模型在少样本学习中的遗忘-free微调,同时保持零样本能力。

详情
AI中文摘要

大规模视觉-语言模型(VLM)在有限标注数据下适应下游任务仍然是一个重大挑战。虽然参数高效的提示学习方法提供了一条有希望的路径,但它们常常遭受预训练知识的灾难性遗忘。为了解决这一限制,我们的工作基于一个洞察:控制提示的演化路径对于遗忘-free适应至关重要。为此,我们提出了EvoPrompt,一个旨在明确引导提示轨迹以进行知识保留微调的新型框架。具体来说,我们的方法采用模态共享提示投影器(MPP)从统一嵌入空间生成层次化提示。关键的是,一种演化训练策略将低秩更新解耦为方向和幅度分量,保留早期学习的语义方向而仅调整其幅度,从而使提示能够在不丢弃基础知识的情况下演化。这一过程通过特征几何正则化(FGR)进一步稳定,该正则化强制特征去相关以防止表示崩溃。大量实验表明,EvoPrompt在少样本学习中实现了最先进的性能,同时稳健地保留了预训练VLM的原始零样本能力。

英文摘要

The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they often suffer from catastrophic forgetting of pre-trained knowledge. Toward addressing this limitation, our work is grounded in the insight that governing the evolutionary path of prompts is essential for forgetting-free adaptation. To this end, we propose EvoPrompt, a novel framework designed to explicitly steer the prompt trajectory for knowledge-preserving fine-tuning. Specifically, our approach employs a Modality-Shared Prompt Projector (MPP) to generate hierarchical prompts from a unified embedding space. Critically, an evolutionary training strategy decouples low-rank updates into directional and magnitude components, preserving early-learned semantic directions while only adapting their magnitude, thus enabling prompts to evolve without discarding foundational knowledge. This process is further stabilized by Feature Geometric Regularization (FGR), which enforces feature decorrelation to prevent representation collapse. Extensive experiments demonstrate that EvoPrompt achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.

2603.09242 2026-06-04 cs.CV 版本更新

When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection

当检测器遗忘取证:阻断语义捷径以实现可泛化的AI生成图像检测

Chao Shuai, Shaojing Fan, Chenlin Zou, Bin Gong, Weichen Lian, Xiuli Bi, Zhenguang Liu, Zhongjie Ba, Kui Ren

发表机构 * State Key Laboratory of Blockchain and Data Security, Zhejiang University(区块链与数据安全国家重点实验室,浙江大学) National University of Singapore(新加坡国立大学) Chongqing University of Posts and Telecommunications(重庆邮电大学)

AI总结 本文提出几何语义解耦(GSD)框架,通过抑制语义主导方向来促进不变取证表征,从而解决预训练视觉基础模型在AI生成图像检测中因语义回退导致的泛化不足问题。

详情
AI中文摘要

生成模型日益逼真,模糊了真实与合成内容之间的界限,给可靠的AI生成图像检测带来了重大挑战。尽管大规模预训练视觉基础模型提升了检测能力,但它们对来自未见生成管道的图像的泛化仍然不足。在本文中,我们首次识别出一个关键失败机制,称为语义回退,即取证微调未能完全重塑表征空间。因此,所得表征仍沿高层语义结构而非操作特定的取证线索组织。基于这一见解,我们提出了一个几何语义解耦(GSD)框架,该框架显式抑制语义主导方向,从而促进不变的取证表征。具体而言,GSD利用冻结的CLIP编码器通过奇异值分解(SVD)估计主导语义子空间。然后,通过几何约束公式抑制语义成分,并在样本和层间自适应调节抑制强度。我们进一步引入了一种小批量SVD近似策略,该策略分摊子空间估计,在保持有效性的同时实现了超过15倍的计算开销减少。最后,考虑到涵盖大规模和在线评估的实际场景,我们开发了三种推理协议:批量推理、逐样本推理和基于参考的推理,并证明它们能产生一致的语义解耦,从而形成稳定的面向伪造的特征流形。

英文摘要

The growing realism of generative models has blurred the boundary between real and synthetic content, posing significant challenges to reliable AI-generated image detection. Although large-scale pre-trained Vision Foundation Models have advanced detection capability, their generalization to images from unseen generation pipelines remains inadequate. In this paper, we identify, for the first time, a key failure mechanism, termed \emph{semantic fallback}, wherein forensic fine-tuning fails to fully reshape the representation space. Consequently, the resulting representations remain organized along high-level semantic structures rather than manipulation-specific forensic cues. Building on this insight, we propose a \textbf{Geometric Semantic Decoupling (GSD)} framework, which explicitly suppresses semantically dominant directions, thereby promoting invariant forensic representations. Specifically, GSD leverages a frozen CLIP encoder to estimate the dominant semantic subspace via Singular Value Decomposition (SVD). It then suppresses the semantic components through a geometry-constrained formulation with the suppression strength adaptively modulated across samples and layers. We further introduce a mini-batch SVD approximation strategy that amortizes subspace estimation, achieving over a $15 \times$ reduction in computational overhead while preserving effectiveness. Finally, considering practical scenarios spanning both large-scale and online evaluation, we develop three inference protocols, batch, per-sample, and reference-based inference, and demonstrate that they induce consistent semantic decoupling, yielding a stable forgery-oriented feature manifold.

2603.03482 2026-06-04 cs.CV cs.AI cs.LG 版本更新

Beyond Pixel Histories: World Models with Persistent 3D State

超越像素历史:具有持久3D状态的世界模型

Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, Jiang Bian

发表机构 * University of Edinburgh(爱丁堡大学) Microsoft Research(微软研究院)

AI总结 提出PERSIST范式,通过模拟潜在3D场景(环境、相机、渲染器)的演化,实现具有持久空间记忆和一致几何的世界模型,显著提升3D一致性、空间记忆和长期稳定性。

Comments Accepted to the International Conference on Machine Learning (ICML) 2026. To appear in the Proceedings of Machine Learning Research (PMLR). 9 pages

详情
AI中文摘要

交互式世界模型通过响应用户的动作持续生成视频,实现开放式的生成能力。然而,现有模型通常缺乏环境的3D表示,意味着3D一致性必须从数据中隐式学习,且空间记忆受限于有限的时域上下文窗口。这导致不真实的用户体验,并对训练智能体等下游任务构成重大障碍。为解决这一问题,我们提出PERSIST,一种新的世界模型范式,它模拟潜在3D场景(环境、相机和渲染器)的演化。这使得我们能够合成具有持久空间记忆和一致几何的新帧。定量指标和定性用户研究均表明,与现有方法相比,在空间记忆、3D一致性和长期稳定性方面有显著提升,从而实现连贯、演化的3D世界。我们进一步展示了新颖的能力,包括从单张图像合成多样化的3D环境,以及通过直接在3D空间中支持环境编辑和指定,实现对生成体验的细粒度、几何感知控制。项目页面:https://francelico.github.io/persist.github.io

英文摘要

Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to downstream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesise new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine-grained, geometry-aware control over generated experiences by supporting environment editing and specification directly in 3D space. Project page: https://francelico.github.io/persist.github.io

2603.02697 2026-06-04 cs.CV cs.AI 版本更新

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

ShareVerse:面向共享世界建模的多智能体一致视频生成

Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan

发表机构 * Shanghai Jiao Tong University China(上海交通大学中国) Fudan University China(复旦大学中国) StepFun China(StepFun中国)

AI总结 提出ShareVerse框架,通过构建多智能体交互数据集、空间拼接策略和跨智能体注意力机制,实现多智能体共享世界的一致视频生成。

详情
AI中文摘要

本文提出ShareVerse,一个视频生成框架,支持多智能体共享世界建模,解决了现有工作缺乏统一共享世界构建和多智能体交互支持的问题。ShareVerse利用大型视频模型的生成能力,并整合了三个关键创新:1)在CARLA仿真平台上构建了大规模多智能体交互世界建模数据集,包含多样场景、天气条件和交互轨迹,以及配对的每智能体四视角视频(前/后/左/右视图)和相机数据。2)我们提出了一种针对独立智能体四视角视频的空间拼接策略,以建模更广泛的环境并确保内部多视角几何一致性。3)我们将跨智能体注意力模块集成到预训练视频模型中,实现跨智能体时空信息的交互传递,保证重叠区域的共享世界一致性和非重叠区域的合理生成。支持49帧大规模视频生成的ShareVerse能够准确感知动态智能体的位置,实现一致的共享世界建模。

英文摘要

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

2602.23214 2026-06-04 cs.CV cs.LG eess.IV 版本更新

Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction

即插即用扩散遇见ADMM:双变量耦合用于鲁棒医学图像重建

Chenhe Du, Xuanyu Tian, Qing Wu, Muyu Liu, Jingyi Yu, Hongjiang Wei, Yuyao Zhang

发表机构 * ShanghaiTech University(上海科技大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出双耦合即插即用扩散(DC-PnPDP)框架,通过引入经典对偶变量提供积分反馈并采用频谱均匀化(SH)处理结构伪影,解决了现有PnP求解器的稳态偏差和幻觉问题,在CT和MRI重建中实现了最先进的保真度和加速收敛。

Comments Accepted by ICML 2026

详情
AI中文摘要

即插即用扩散先验(PnPDP)框架通过将预训练生成模型视为模块化先验,已成为解决成像逆问题的强大范式。然而,我们发现当前PnP求解器(例如基于HQS或近端梯度)存在一个关键缺陷:它们作为无记忆算子,仅基于瞬时梯度更新估计。这种缺乏历史跟踪的做法不可避免地导致非消失稳态偏差,使得重建在严重损坏下无法严格满足物理测量。为了解决这个问题,我们提出了双耦合PnP扩散(DC-PnPDP),它恢复了经典对偶变量以提供积分反馈,逐步强制数据一致性和先验之间的一致性。然而,这种严格的几何耦合引入了第二个挑战:累积的对偶残差表现出频谱有色、结构化的伪影,违反了扩散先验的加性白高斯噪声(AWGN)假设,导致严重的幻觉。为了弥合这一差距,我们引入了频谱均匀化(SH),一种频域适应机制,将这些结构化残差调制为统计上合规的伪AWGN输入。这有效地将求解器的严格优化轨迹与去噪器的有效统计流形对齐。在CT和MRI重建上的大量实验表明,我们的方法解决了偏差-幻觉权衡,实现了最先进的保真度并显著加速收敛。代码可在https://github.com/duchenhe/DC-PnPDP获取。

英文摘要

Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generative models as modular priors. However, we identify a critical flaw in prevailing PnP solvers (e.g., based on HQS or Proximal Gradient): they function as memoryless operators, updating estimates solely based on instantaneous gradients. This lack of historical tracking inevitably leads to non-vanishing steady-state bias, where the reconstruction fails to strictly satisfy physical measurements under heavy corruption. To resolve this, we propose Dual-Coupled PnP Diffusion (DC-PnPDP), which restores the classical dual variable to provide integral feedback, progressively enforce agreement between the data-consistency and prior. However, this rigorous geometric coupling introduces a secondary challenge: the accumulated dual residuals exhibit spectrally colored, structured artifacts that violate the Additive White Gaussian Noise (AWGN) assumption of diffusion priors, causing severe hallucinations. To bridge this gap, we introduce Spectral Homogenization (SH), a frequency-domain adaptation mechanism that modulates these structured residuals into statistically compliant pseudo-AWGN inputs. This effectively aligns the solver's rigorous optimization trajectory with the denoiser's valid statistical manifold. Extensive experiments on CT and MRI reconstruction demonstrate that our approach resolves the bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence. The code is available at https://github.com/duchenhe/DC-PnPDP

2506.06006 2026-06-04 cs.CV cs.AI cs.CL 版本更新

Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

视觉语言模型能预测未来状态吗?从逆动力学引导世界模型

Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti

发表机构 * Institute for Language, Cognition and Computation, University of Edinburgh(语言、认知与计算研究所,爱丁堡大学) Language Technology Lab, University of Cambridge(语言技术实验室,剑桥大学) NVIDIA(NVIDIA公司) University of Groningen(格罗宁根大学)

AI总结 本文发现视觉语言模型(VLM)难以直接进行前向动力学预测(FDP),但逆动力学预测(IDP)更容易学习,并利用IDP通过弱监督学习和推理时验证两种策略引导FDP,在Aurora-Bench上取得与最先进图像编辑模型竞争的性能。

详情
AI中文摘要

统一的视觉语言模型(VLM)能否执行前向动力学预测(FDP),即根据先前的观察和(语言形式的)动作预测未来状态(图像形式)?我们发现VLM难以根据指令生成帧之间物理上合理的过渡。然而,我们识别出多模态基础中的一个关键不对称性:微调VLM学习逆动力学预测(IDP)——有效地描述帧之间的动作——比学习FDP容易得多。反过来,IDP可以通过两种主要策略引导FDP:1)来自合成数据的弱监督学习,以及2)推理时验证。首先,IDP可以为未标记的视频帧观察对标注动作,以扩大FDP的训练数据规模。其次,IDP可以为FDP的多个样本分配奖励以对其进行评分,从而在推理时有效指导搜索。我们通过Aurora-Bench上的以动作为中心的图像编辑任务,使用两个VLM家族评估了这两种策略产生的FDP。尽管仍然是通用模型,我们的最佳模型实现了与最先进的图像编辑模型竞争的性能,根据GPT4o作为评判,在Aurora-Bench的所有子集上,性能提高了7%到13%,并获得了最佳平均人类评估。

英文摘要

Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP)-effectively captioning the action between frames-is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them, effectively guiding search at inference time. We evaluate the FDP resulting from both strategies through the task of action-centric image editing on Aurora-Bench with two families of VLMs. Despite remaining general-purpose, our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin between 7% and 13% according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

2602.06883 2026-06-04 cs.LG cs.CV stat.ML 版本更新

Vision Transformer Finetuning Benefits from Non-Smooth Components

视觉变换器微调受益于非平滑组件

Ambroise Odonnat, Laetitia Chapel, Romain Tavenard, Ievgen Redko

发表机构 * Noah's Ark Lab(诺亚 ark 实验室) Univ. Rennes 2, Inria(里昂二大学,法国国家信息与自动化研究所)

AI总结 本文通过分析视觉变换器组件的可塑性(即输出对输入变化的敏感度),发现高可塑性(低平滑性)的注意力模块和前馈层在微调中表现更好,挑战了平滑性有利的传统观点。

Comments Accepted at ICML 2026

详情
AI中文摘要

变换器架构的平滑性在泛化、训练稳定性和对抗鲁棒性方面已被广泛研究。然而,其在迁移学习中的作用仍知之甚少。本文分析了视觉变换器组件使其输出适应输入变化的能力,即它们的\emph{可塑性}。定义为平均变化率,它捕捉了对输入扰动的敏感性;特别地,高可塑性意味着低平滑性。我们的理论分析和大量实验——在大规模视觉变换器上进行超过1000次微调运行——表明,这一视角为选择在适应过程中优先考虑的组件提供了原则性指导。对从业者的关键启示是,注意力模块和前馈层的高可塑性始终导致更好的微调性能。我们的发现偏离了平滑性是可取的普遍假设,为变换器的功能特性提供了新的视角。代码可在 https://github.com/ambroiseodt/vit-plasticity 获取。

英文摘要

The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their \emph{plasticity}. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies a low smoothness. Our theoretical analysis and extensive experiments -- over $1,000$ finetuning runs on large-scale vision transformers -- showcase that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on transformers' functional properties. The code is available at https://github.com/ambroiseodt/vit-plasticity.

2512.03553 2026-06-04 cs.CV cs.AI 版本更新

Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

直播中的动态内容审核:结合监督分类与MLLM增强的相似度匹配

Wei Chee Yew, Hailun Xu, Sanjay Saha, Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang, Kanchan Sarkar, Zhenheng Yang, Danhui Guan

发表机构 * TikTok Singapore Singapore(TikTok新加坡) TikTok San Jose United States(TikTok旧金山美国) TikTok Shanghai China(TikTok上海中国)

AI总结 提出一种混合审核框架,结合监督分类和基于参考的相似度匹配,利用多模态大语言模型提升准确性,在保持轻量推理的同时实现大规模直播内容审核。

Comments To be published at KDD 2026 (ADS track)

详情
AI中文摘要

内容审核对于大规模用户生成视频平台仍然是一个关键且具有挑战性的任务,尤其是在直播环境中,审核必须及时、多模态,并且能够应对不断演变的不良内容形式。我们提出了一个在生产规模部署的混合审核框架,该框架将已知违规的监督分类与针对新颖或微妙情况的基于参考的相似度匹配相结合。这种混合设计能够稳健地检测出明确违规以及传统分类器无法检测到的新颖边缘情况。多模态输入(文本、音频、视觉)通过两个流水线处理,多模态大语言模型(MLLM)将知识提炼到每个流水线中,以提高准确性,同时保持推理轻量。在生产中,分类流水线在80%精确率下达到67%召回率,相似度流水线在80%精确率下达到76%召回率。大规模A/B测试显示,用户对不良直播的观看次数减少了6-8%。这些结果表明了一种可扩展且适应性强的多模态内容治理方法,能够处理明确违规和新兴对抗行为。

英文摘要

Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.

2601.19683 2026-06-04 cs.CV 版本更新

SharpNet: Enhancing MLPs to Represent Functions with Controlled Non-differentiability

SharpNet: 增强MLP以表示具有受控非可微性的函数

Hanting Niu, Junkai Deng, Fei Hou, Wencheng Wang, Ying He

发表机构 * Key Laboratory of System Software (CAS), Institute of Software, Chinese Academy of Sciences Beijing China University of Chinese Academy of Sciences Beijing China(中国科学院软件研究所系统软件重点实验室,中国科学院北京大学,中国)

AI总结 提出SharpNet架构,通过引入基于泊松方程跳跃Neumann边界条件的辅助特征函数,使MLP能够精确控制非可微性位置,从而在保持全局平滑的同时准确恢复尖锐特征。

详情
AI中文摘要

多层感知机(MLP)是学习和函数逼近的标准工具,但它们固有地产生全局平滑的输出。因此,在没有专门后处理的情况下,它们难以表示连续但故意不可微的函数(即具有规定的$C^0$尖锐特征的函数)。我们提出SharpNet,一种改进的MLP架构,通过使用定义为具有跳跃Neumann边界条件的泊松方程解的辅助特征函数来增强网络,从而编码用户指定的尖锐特征。该特征函数通过高效的局部积分进行评估,并且相对于特征位置完全可微,使我们能够联合优化特征位置和MLP参数以恢复目标函数或几何。这种构造提供了对非可微性发生位置的精确控制,在特征位置强制执行所需的$C^0$行为,同时在其他地方保持平滑。我们在2D问题和3D CAD重建上验证了SharpNet,并与几个最先进的基线进行了比较。在两种设置中,SharpNet都能准确恢复尖锐边缘和角落,同时保持远离它们时的平滑,而现有方法往往模糊梯度不连续性。定性和定量结果证明了我们方法的有效性。我们的项目页面、代码和模型可在https://sharpnettech.github.io公开获取。

英文摘要

Multi-layer perceptrons (MLPs) are a standard tool for learning and function approximation, but they inherently produce globally smooth outputs. Consequently, they struggle to represent functions that are continuous yet intentionally non-differentiable (i.e., functions with prescribed $C^0$ sharp features) without ad hoc post-processing. We present SharpNet, a modified MLP architecture that encodes user-specified sharp features by augmenting the network with an auxiliary feature function defined as the solution to Poisson's equation with jump Neumann boundary conditions. This feature function is evaluated via an efficient local integral and is fully differentiable with respect to the feature locations, allowing us to jointly optimize both the feature locations and the MLP parameters to recover the target function or geometry. This construction provides precise control over where non-differentiability occurs, enforcing the desired $C^0$ behavior at feature locations while preserving smoothness elsewhere. We validate SharpNet on 2D problems and 3D CAD reconstruction, and compare it with several state-of-the-art baselines. In both settings, SharpNet accurately recovers sharp edges and corners while remaining smooth away from them, whereas existing methods tend to blur gradient discontinuities. Qualitative and quantitative results demonstrate the effectiveness of our approach. Our project page, code and models are publicly available at https://sharpnettech.github.io.

2512.22105 2026-06-04 cs.CV 版本更新

Learning Association via Track-Detection Matching for Multi-Object Tracking

通过轨迹-检测匹配学习关联用于多目标跟踪

Momir Adžemović

发表机构 * Algoritmi i računarstvo, University of Belgrade(塞尔维亚大学算法与计算机科学系)

AI总结 提出TDLP方法,通过链接预测学习轨迹与检测之间的关联,在保持模块化和计算效率的同时超越现有方法。

Comments 14 pages (+4 for references), 8 tables, 4 figures

详情
AI中文摘要

多目标跟踪旨在通过跨视频帧关联检测来维持目标身份。现有文献中存在两种主要范式:基于检测的跟踪方法,计算效率高但依赖手工设计的关联启发式;以及端到端方法,从数据中学习关联但计算复杂度较高。我们提出轨迹-检测链接预测(TDLP),一种基于检测的跟踪方法,通过轨迹和检测之间的链接预测(即预测每帧中每条轨迹的正确延续)来执行逐帧关联。TDLP在架构上主要针对几何特征(如边界框)设计,同时可选地融入额外线索,包括姿态和外观。与基于启发式的方法不同,TDLP直接从数据中学习关联,无需手工规则,同时与端到端跟踪器相比保持模块化和计算效率。在多个基准上的大量实验表明,TDLP在基于检测的跟踪和端到端方法中均持续超越最先进性能。最后,我们提供了链接预测与基于度量学习的关联的详细分析,并表明链接预测更有效,特别是在处理异构特征(如检测边界框)时。我们的代码可在\href{https://github.com/Robotmurlock/TDLP}{https://github.com/Robotmurlock/TDLP}获取。

英文摘要

Multi-object tracking aims to maintain object identities over time by associating detections across video frames. Two dominant paradigms exist in literature: tracking-by-detection methods, which are computationally efficient but rely on handcrafted association heuristics, and end-to-end approaches, which learn association from data at the cost of higher computational complexity. We propose Track-Detection Link Prediction (TDLP), a tracking-by-detection method that performs per-frame association via link prediction between tracks and detections, i.e., by predicting the correct continuation of each track at every frame. TDLP is architecturally designed primarily for geometric features such as bounding boxes, while optionally incorporating additional cues, including pose and appearance. Unlike heuristic-based methods, TDLP learns association directly from data without handcrafted rules, while remaining modular and computationally efficient compared to end-to-end trackers. Extensive experiments on multiple benchmarks demonstrate that TDLP consistently surpasses state-of-the-art performance across both tracking-by-detection and end-to-end methods. Finally, we provide a detailed analysis comparing link prediction with metric learning-based association and show that link prediction is more effective, particularly when handling heterogeneous features such as detection bounding boxes. Our code is available at \href{https://github.com/Robotmurlock/TDLP}{https://github.com/Robotmurlock/TDLP}.

2512.21094 2026-06-04 cs.CV 版本更新

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

T2AV-Compass:迈向文本到音频-视频生成的统一评估

Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jiahao Wang, Jialu Chen, Miao Deng, Yubin Guo, Chenxi Liao, Yize Zhang, Zhaoxiang Zhang, Jiaheng Liu

发表机构 * NJU-LINK Team, Nanjing University(南京大学NJU-LINK团队) Kling Team, Kuaishou Technology(快手技术 Kling 团队) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出T2AV-Compass基准,通过分类学驱动的500个复杂提示和双层次评估框架(客观信号指标+主观MLLM评判),系统评估文本到音频-视频生成模型,发现现有模型在跨模态对齐和指令遵循方面显著不足。

Comments 41 pages, 13 figures, 12 tables. Accepted at ICML 2026

详情
AI中文摘要

文本到音频-视频(T2AV)生成旨在从自然语言合成时间连贯的视频和语义同步的音频,但其评估仍然碎片化,通常依赖单模态指标或范围狭窄的基准,无法捕捉跨模态对齐、指令遵循和复杂提示下的感知真实性。为解决这一局限,我们提出了T2AV-Compass,一个用于全面评估T2AV系统的统一基准,包含通过分类学驱动流程构建的500个多样且复杂的提示,以确保语义丰富性和物理合理性。此外,T2AV-Compass引入了一个双层次评估框架,将用于视频质量、音频质量和跨模态对齐的客观信号级指标与用于指令遵循和真实性评估的主观MLLM-as-a-Judge协议相结合。对11个代表性T2AV系统的广泛评估表明,即使是最强的模型也远未达到人类水平的真实性和跨模态一致性,在音频真实性、细粒度同步、指令遵循等方面持续失败。这些结果表明未来模型有显著的改进空间,并凸显了T2AV-Compass作为推进文本到音频-视频生成的挑战性和诊断性测试平台的价值。

英文摘要

Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.

2512.16919 2026-06-04 cs.CV cs.AI cs.RO 版本更新

DVGT: Driving Visual Geometry Transformer

DVGT: 驾驶视觉几何变换器

Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Shengyin Jiang, Long Chen, Zhi-Xin Yang, Jiwen Lu

发表机构 * Tsinghua University(清华大学) University of Macau(澳门大学) Xiaomi EV(小米电动车) Peking University(北京大学)

AI总结 提出DVGT,一种从无位姿多视角图像序列重建全局稠密3D点图的视觉几何变换器,通过交替注意力机制学习几何关系,无需相机参数和后处理对齐,在多个驾驶数据集上显著优于现有模型。

Comments Code is available at https://github.com/wzzheng/DVGT

详情
AI中文摘要

从视觉输入中感知和重建3D场景几何对于自动驾驶至关重要。然而,目前仍缺乏一种能够适应不同场景和相机配置的、面向驾驶的稠密几何感知模型。为弥补这一空白,我们提出了驾驶视觉几何变换器(DVGT),它从一系列无位姿的多视角视觉输入中重建全局稠密3D点图。我们首先使用DINO骨干网络为每张图像提取视觉特征,并采用交替的视角内局部注意力、跨视角空间注意力和跨帧时间注意力来推断图像间的几何关系。然后,我们使用多个头解码第一帧自车坐标系下的全局点图以及每帧的自车位姿。与依赖精确相机参数的传统方法不同,DVGT无需显式的3D几何先验,能够灵活处理任意相机配置。DVGT直接从图像序列预测度量尺度的几何,消除了与外部传感器后对齐的需求。在包含nuScenes、OpenScene、Waymo、KITTI和DDAD的大型驾驶数据集混合训练下,DVGT在各种场景中显著优于现有模型。代码可在https://github.com/wzzheng/DVGT获取。

英文摘要

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

2512.05277 2026-06-04 cs.CV cs.AI 版本更新

From Segments to Scenes: Temporal Understanding for Agentic Autonomous Driving via Vision-Language Models

从片段到场景:自动驾驶中基于视觉语言模型的时间理解

Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学) ETH Zurich(苏黎世联邦理工学院) University of Washington(华盛顿大学) University of Southern California(南加州大学)

AI总结 提出自动驾驶时间理解基准TAD,通过场景思维链和轨迹认知图两种无训练方法提升视觉语言模型的时间推理能力。

详情
AI中文摘要

视觉语言模型(VLM)越来越多地被部署为野外自主代理的感知和推理骨干,其中自动驾驶(AD)是最安全关键的实例之一。可靠的时间理解对于此类代理预测事件、归因原因和在动态环境中安全行动至关重要,但即使对于最先进的(SoTA)VLM来说,这仍然是一个重大挑战。先前的视频基准强调了其他内容(体育、烹饪等),但现有基准没有专门关注短时和长时AD视频的时间理解。为填补这一空白,我们提出了自动驾驶时间理解(TAD)基准,包含近6000个问答(QA)对,涵盖7个任务,并评估了9个闭源和开源通用以及AD专用模型。当前SoTA模型在TAD上的表现远低于人类准确率。为了改进基于VLM的驾驶代理的时间推理,我们提出了两种新颖的无训练解决方案:Scene-CoT,它使用思维链(CoT)推理;以及TCogMap,它结合了由轨迹分析模块生成的自我中心时间认知图,该模块作为VLM周围的代理工具运行。与现有VLM集成后,我们的方法在TAD上的平均准确率提高了高达17.72%,在STSBench上提高了高达10.35%。通过引入TAD、对SoTA模型进行基准测试并提出有效的增强方法,本工作旨在促进野外代理AD系统时间理解的进一步进展。基准和评估代码分别可在${\href{https://huggingface.co/datasets/vbdai/TAD}{ ext{Hugging Face}}}$和${\href{https://github.com/vbdi/tad_bench}{ ext{GitHub}}}$上获取。

英文摘要

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliable temporal understanding is essential for such agents to anticipate events, attribute causes, and act safely in dynamic environments, yet this remains a significant challenge even for state-of-the-art (SoTA) VLMs. Prior video benchmarks have emphasized other content (sports, cooking, etc.), yet no existing benchmark focuses exclusively on temporal understanding for both short- and long-form AD footage. To fill this gap, we present the Temporal Understanding in Autonomous Driving (TAD) benchmark, comprising nearly 6000 question-answer (QA) pairs across 7 tasks, and evaluate 9 closed- and open-source generalist as well as AD-specialist models. Current SoTA models perform substantially below human accuracy on TAD. To improve the temporal reasoning of VLM-based driving agents, we propose two novel training-free solutions: Scene-CoT, which uses Chain-of-Thought (CoT) reasoning, and TCogMap, which incorporates an ego-centric temporal cognitive map produced by a trajectory-analysis module that operates as an agentic tool around the VLM. Integrated with existing VLMs, our methods improve average accuracy on TAD by up to $17.72\%$ and by up to $10.35\%$ on STSBench. By introducing TAD, benchmarking SoTA models, and proposing effective enhancements, this work aims to catalyze further progress on temporal understanding for agentic AD systems operating in the wild. The benchmark and evaluation code are available at ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ and ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$, respectively.

2512.08331 2026-06-04 cs.CV 版本更新

DMAConv: Dual Mask-Adaptive Convolution for Remote Sensing Pansharpening

DMAConv: 用于遥感全色锐化的双掩膜自适应卷积

Xianghong Xiao, Zeyu Xia, Zhou Fei, Jinliang Xiao, Haorui Chen, Liangjian Deng

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Tongji University(同济大学)

AI总结 提出双掩膜自适应卷积(DMAConv),通过软硬掩膜动态分配计算资源,以轻量级双分支结构高效处理遥感图像的区域异质性,实现SOTA性能且计算成本最低。

详情
AI中文摘要

全色锐化旨在融合高分辨率全色图像与低分辨率多光谱图像。现有的深度学习方法,包括最近的自适应卷积,难以应对遥感图像的区域异质性,且往往计算成本过高。为解决这些挑战,我们提出双掩膜自适应卷积(DMAConv),这是一种根据特征特征动态分配计算资源的新型算子。DMAConv首先使用轻量级模块生成软掩膜和硬掩膜。硬掩膜将特征分为一个紧凑分支(用于全局处理冗余信息)和一个聚焦分支(以更多计算投入建模复杂异质区域)。随后,软掩膜对两个分支的输入特征进行初步调制。这种双分支掩膜自适应设计显著增强了特征表示,同时最小化了计算开销。大量实验表明,我们的方法在广泛的定量基准上达到了SOTA,且参数数量显著更低,计算成本在自适应卷积模型中最低。

英文摘要

Pansharpening aims to fuse a high-resolution panchromatic image with a low-resolution multispectral image. Existing deep learning methods, including recent adaptive convolutions, struggle with regional heterogeneity in remote sensing images and often incur prohibitive computational costs. To address these challenges, we propose Dual Mask-Adaptive Convolution (DMAConv), a novel operator that dynamically allocates computational resources based on feature characteristics. DMAConv first employs a lightweight module to generate soft and hard masks. The hard mask separates features into a compact branch for processing redundant information globally and a focused branch that models complex, heterogeneous regions with greater computational investment. The soft mask then preliminarily modulates the input features for both branches. This dual-branch, mask-adaptive design significantly enhances feature representation while minimizing computational overhead. Extensive experiments demonstrate that our method achieves SOTA on a broad array of quantitative benchmarks, with substantially lower parameter counts and the minimal computational cost among adaptive convolution models.

2511.16624 2026-06-04 cs.CV cs.AI 版本更新

SAM 3D: 3Dfy Anything in Images

SAM 3D: 将图像中的任何内容3D化

SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, Jitendra Malik

发表机构 * Meta Superintelligence Labs(Meta超智能实验室)

AI总结 提出SAM 3D生成模型,从单张图像重建3D物体的几何、纹理和布局,通过人机协同标注和分阶段训练突破数据瓶颈,在真实场景中取得显著优势。

Comments Website: https://ai.meta.com/sam3d/

详情
AI中文摘要

我们提出SAM 3D,一种用于视觉引导的3D物体重建的生成模型,能够从单张图像预测几何、纹理和布局。SAM 3D在自然图像中表现出色,这些图像中遮挡和场景杂乱很常见,且来自上下文的视觉识别线索起着更重要的作用。我们通过一个人工和模型在环的流水线来标注物体形状、纹理和姿态,以前所未有的规模提供视觉引导的3D重建数据。我们在一个现代的、多阶段的训练框架中从这些数据中学习,该框架结合了合成预训练和真实世界对齐,打破了3D“数据壁垒”。与近期工作相比,我们获得了显著提升,在真实世界物体和场景的人类偏好测试中至少达到5:1的胜率。我们将发布我们的代码和模型权重、一个在线演示以及一个新的用于野外3D物体重建的具有挑战性的基准测试。

英文摘要

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

2511.06331 2026-06-04 cs.CV 版本更新

Label-Efficient 3D Forest Mapping: Self-Supervised and Transfer Learning for Instance Segmentation, Semantic Segmentation, and Species Classification

标签高效的3D森林映射:自监督与迁移学习用于实例分割、语义分割和物种分类

Aldino Rizaldy, Fabian Ewald Fassnacht, Ahmed Jamal Afifi, Hua Jiang, Richard Gloaguen, Pedram Ghamisi

发表机构 * Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Helmholtz Institute Freiberg for Resource Technology (HIF)(德累斯顿-罗斯托克亥姆霍兹中心(HZDR)、弗里贝格资源技术亥姆霍兹研究所(HIF)) Remote Sensing and Geoinformatics, Freie Universität Berlin(柏林自由大学遥感与地理信息学系) Institute of Geomatics, BOKU University(博科尼大学测绘学院) Faculty of Electrical and Computer Engineering, University of Iceland(爱沙尼亚大学电气与计算机工程学院)

AI总结 本文利用自监督和迁移学习策略,在少量标注数据下提升3D点云中树木实例分割、语义分割和物种分类的性能,并集成统一框架以简化流程。

详情
AI中文摘要

个体树木级别的详细结构和物种信息对于支持精准林业、生物多样性保护以及为生物量和碳映射提供参考数据日益重要。来自机载和地面激光扫描的点云是目前快速大规模获取此类信息的最合适数据源。深度学习的最新进展改进了对个体树木的分割和分类以及语义树组件的识别。然而,深度学习模型通常需要大量标注训练数据,这限制了进一步的改进。为3D点云生成密集、高质量的标注,尤其是在复杂森林中,劳动密集且难以规模化。我们探索使用自监督和迁移学习来减少对大型标注数据集的依赖。我们的目标是提高三个任务的性能:实例分割、语义分割和树木分类,使用现实且可操作的训练集。与从头训练相比,我们观察到所有任务均有所改进,并通过各自的指标进行评估。对于实例分割,自监督学习结合领域适应使AP50提高了16.98%。对于语义分割,仅自监督学习使mIoU提高了1.79%。对于树木分类,层次迁移学习使平均Jaccard提高了6.07%。为简化使用并鼓励采用,我们将这些任务集成到一个统一框架中,简化了从原始点云到树木描绘、结构分析和物种分类的流程。预训练模型减少了约21%的能耗和碳排放。这一开源贡献旨在加速从激光扫描点云中操作性地提取个体树木信息,以支持林业、生物多样性和碳映射。

英文摘要

Detailed structural and species information on individual tree level is increasingly important to support precision forestry, biodiversity conservation, and provide reference data for biomass and carbon mapping. Point clouds from airborne and ground-based laser scanning are currently the most suitable data source to rapidly derive such information at scale. Recent advancements in deep learning improved segmenting and classifying individual trees and identifying semantic tree components. However, deep learning models typically require large amounts of annotated training data which limits further improvement. Producing dense, high-quality annotations for 3D point clouds, especially in complex forests, is labor-intensive and challenging to scale. We explore strategies to reduce dependence on large annotated datasets using self-supervised and transfer learning. Our objective is to improve performance across three tasks: instance segmentation, semantic segmentation, and tree classification using realistic and operational training sets. We observe improvements across all tasks, compared to training from scratch, evaluated with their respective metrics. For instance segmentation, self-supervised learning combined with domain adaptation improves AP50 by 16.98%. For semantic segmentation, self-supervised learning alone improves mIoU by 1.79%. For tree classification, hierarchical transfer learning improves mean Jaccard by 6.07%. To simplify use and encourage uptake, we integrated the tasks into a unified framework, streamlining the process from raw point clouds to tree delineation, structural analysis, and species classification. Pretrained models reduce energy consumption and carbon emissions by ~21%. This open-source contribution aims to accelerate operational extraction of individual tree information from laser scanning point clouds to support forestry, biodiversity, and carbon mapping.

2511.00801 2026-06-04 cs.CV cs.MM 版本更新

Med-Banana: Learning Quality-Controlled Medical Image Editing from Success-and-Failure Trajectories

Med-Banana:从成功与失败轨迹中学习质量可控的医学图像编辑

Zhihui Chen, Qingyuan Lei, Kai He, Yanrui Du, Mengling Feng

发表机构 * National University of Singapore(新加坡国立大学) The Chinese University of Hong Kong(香港中文大学) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 提出Med-Banana框架,通过收集成功与失败编辑轨迹数据集Med-Banana-80K,联合训练编辑器、验证器和优化器,实现质量可控的医学图像编辑。

详情
AI中文摘要

文本引导的医学图像编辑必须满足所需的病理特征,同时保留解剖结构、模态特定外观和临床合理性。然而,现有数据集主要用最终接受的编辑结果来监督编辑器,并丢弃生成过程中产生的失败尝试。我们认为这些失败为质量控制提供了必要的监督:它们指定了应该拒绝什么、为什么编辑在医学或视觉上无效,以及应该如何修改指令。我们提出了Med-Banana,一个用于质量可控的医学图像编辑的轨迹监督框架。我们引入了Med-Banana-80K,一个大规模的成功与失败编辑轨迹资源,包含候选图像、验证结果、拒绝原因和提示优化。在此基础上,Med-Banana联合训练编辑器、验证器和优化器,实现了从接受和拒绝尝试中进行编辑-验证-优化推理。在MLLM评估者、盲审专家评估、源保留和真实-合成可分离性探测上的实验表明,与开放的医学图像编辑器相比,该方法具有一致的改进。代码和数据已公开。

英文摘要

Text-guided medical image editing must satisfy the requested pathology while preserving anatomy, modality-specific appearance, and clinical plausibility. However, existing datasets largely supervise editors with final accepted edits and discard the failed attempts produced during generation. We argue that these failures provide essential supervision for quality control: they specify what should be rejected, why an edit is medically or visually invalid, and how the instruction should be revised. We present Med-Banana, a trajectory-supervised framework for quality-controlled medical image editing. We introduce Med-Banana-80K, a large-scale resource of success-and-failure editing trajectories with candidate images, verification outcomes, rejection reasons, and prompt refinements. Building on it, Med-Banana jointly trains an editor, verifier, and refiner, enabling edit--verify--refine inference from accepted and rejected attempts. Experiments across MLLM judges, blind expert assessment, source-preservation and real--synthetic separability probes demonstrate consistent improvements over open medical image editors. Code and data are publicly available.

2505.24528 2026-06-04 cs.CV cs.LG 版本更新

Geospatial Foundation Models to Enable Progress on Sustainable Development Goals

地理空间基础模型推动可持续发展目标的进展

Pedram Ghamisi, Weikang Yu, Xiaokang Zhang, Aldino Rizaldy, Jian Wang, Chufeng Zhou, Richard Gloaguen, Gustau Camps-Valls

发表机构 * Helmholtz-Zentrum Dresden-Rossendorf(德累斯顿-罗斯托克研究所) University of Iceland(冰岛大学) Wuhan University(武汉大学) Wuhan University of Science and Technology(武汉科技大学) Universitat de València(瓦伦西亚大学)

AI总结 本文提出SustainFM基准框架,基于17个可持续发展目标评估地理空间基础模型,发现其在多样任务中优于传统方法,并强调需从模型中心转向影响驱动部署,关注能效、泛化性和伦理。

详情
AI中文摘要

基础模型(FMs)是大规模预训练的人工智能系统,已革新自然语言处理和计算机视觉,并正在推进地理空间分析和地球观测(EO)。它们承诺在任务间改进泛化、可扩展性以及用最少标注数据高效适应。然而,尽管地理空间FMs迅速激增,其现实世界效用和与全球可持续发展目标的一致性仍未充分探索。我们提出SustainFM,一个基于17个可持续发展目标的全面基准框架,涵盖从资产财富预测到环境危害检测的极其多样化的任务。本研究提供了对地理空间FMs的严格、跨学科评估,并对其在实现可持续发展目标中的作用提供了关键见解。我们的发现表明:(1)虽然并非普遍优越,但FMs在多样任务和数据集上通常优于传统方法。(2)评估FMs应超越准确性,将可迁移性、泛化性和能效作为其负责任使用的关键标准。(3)FMs支持可扩展的、基于SDG的解决方案,为应对复杂可持续发展挑战提供广泛实用性。关键的是,我们倡导从以模型为中心的发展转向以影响驱动的部署,并强调能效、对领域变化的鲁棒性以及伦理考量等指标。

英文摘要

Foundation Models (FMs) are large-scale, pre-trained artificial intelligence (AI) systems that have revolutionized natural language processing and computer vision, and are now advancing geospatial analysis and Earth Observation (EO). They promise improved generalization across tasks, scalability, and efficient adaptation with minimal labeled data. However, despite the rapid proliferation of geospatial FMs, their real-world utility and alignment with global sustainability goals remain underexplored. We introduce SustainFM, a comprehensive benchmarking framework grounded in the 17 Sustainable Development Goals with extremely diverse tasks ranging from asset wealth prediction to environmental hazard detection. This study provides a rigorous, interdisciplinary assessment of geospatial FMs and offers critical insights into their role in attaining sustainability goals. Our findings show: (1) While not universally superior, FMs often outperform traditional approaches across diverse tasks and datasets. (2) Evaluating FMs should go beyond accuracy to include transferability, generalization, and energy efficiency as key criteria for their responsible use. (3) FMs enable scalable, SDG-grounded solutions, offering broad utility for tackling complex sustainability challenges. Critically, we advocate for a paradigm shift from model-centric development to impact-driven deployment, and emphasize metrics such as energy efficiency, robustness to domain shifts, and ethical considerations.

2508.08237 2026-06-04 cs.MM cs.AI cs.CV cs.SD eess.AS 版本更新

VGGSounder: Audio-Visual Evaluations for Foundation Models

VGGSounder:基础模型的音视频评估

Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke

发表机构 * Technical University of Munich, MCML(慕尼黑技术大学,MCML) University of Tübingen(图宾根大学) Tübingen AI Center(图宾根人工智能中心) MPI for Intelligent Systems, ELLIS Institute(智能系统Max Planck研究所,ELLIS研究所)

AI总结 针对VGGSound数据集在音视频基础模型评估中的标签不完整、类别重叠和模态错位等问题,提出重新标注的多标签测试集VGGSounder,并引入模态混淆指标分析模型性能退化。

Comments Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2025

详情
AI中文摘要

音视频基础模型的出现凸显了可靠评估其多模态理解能力的重要性。VGGSound数据集常被用作评估音视频分类的基准。然而,我们的分析发现了VGGSound的几个局限性,包括标签不完整、部分类别重叠以及模态错位。这些问题导致对听觉和视觉能力的评估出现偏差。为了解决这些局限性,我们引入了VGGSounder,这是一个全面重新标注的多标签测试集,它扩展了VGGSound,并专门设计用于评估音视频基础模型。VGGSounder具有详细的模态标注,能够精确分析特定模态的性能。此外,通过我们新的模态混淆指标,我们分析了添加另一种输入模态时的性能退化,揭示了模型的局限性。

英文摘要

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

2510.13796 2026-06-04 cs.CL cs.CV 版本更新

The Mechanistic Emergence of Symbol Grounding in Language Models

语言模型中符号接地机制的涌现

Shuyu Wu, Ziqiao Ma, Xiaoxi Luo, Yidong Huang, Josue Torres-Fonseca, Freda Shi, Joyce Chai

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 通过机械因果分析,发现符号接地在语言模型的中层计算中通过注意力头聚合环境信息实现,并在多模态对话和多种架构中复现。

详情
AI中文摘要

符号接地(Harnad, 1990)描述了词语等符号如何通过连接真实世界的感知运动经验来获得意义。最近的研究初步表明,在大规模训练且未使用显式接地目标的(视觉-)语言模型中,接地可能涌现。然而,这种涌现的具体位置及其驱动机制仍 largely 未被探索。为解决这一问题,我们引入了一个受控评估框架,通过机械和因果分析系统地追踪符号接地如何在内部计算中产生。我们的发现表明,接地集中在中层计算中,并通过聚合机制实现,其中注意力头聚合环境接地以支持语言形式的预测。这种现象在多模态对话和跨架构(Transformer 和状态空间模型)中复现,但在单向 LSTM 中未出现。我们的结果提供了行为和机械证据,表明符号接地可以在语言模型中涌现,并对预测和潜在控制生成的可靠性具有实际意义。

英文摘要

Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.

2510.09953 2026-06-04 cs.CV 版本更新

J-RAS: Mutual Adaptation for Medical Image Segmentation via Contrastive Retrieval-Augmented Joint Optimization

J-RAS:基于对比检索增强联合优化的医学图像分割互适应方法

Salma J. Ahmed, Emad A. Mohammed, Azam Asilian Bidgoli

发表机构 * Laurier University(劳里尔大学)

AI总结 提出J-RAS框架,通过交替对比学习和监督学习联合优化分割与检索模型,实现检索与分割的互适应,提升医学图像分割的边界描绘、鲁棒性和跨数据集泛化能力。

详情
AI中文摘要

临床医生手动进行医学图像分割虽然准确,但耗时且在不同专家间存在差异,而基于AI的模型自动化了这一过程,但在数据有限和域偏移时往往表现不佳。受病理学学员通过指导性比较专家标注的切片和组织病理学图谱参考图像来获得疾病识别技能的启发,我们提出了联合检索增强分割(J-RAS)。该框架使分割网络能够在指导下学习。J-RAS通过交替对比学习和监督学习联合优化分割模型和检索模型,使检索网络能够发现上下文相关的图像-掩码对,从而细化分割模型的解剖推理。与被动提供相似样本的传统检索增强不同,J-RAS建立了一个互适应和优化循环,其中检索模型学习强调分割相关的线索,而分割模型利用检索到的示例来改进边界描绘、对罕见病例的鲁棒性以及跨数据集泛化。在涵盖不同成像模态的四个公共基准(包括ACDC和M&Ms(MRI)、乳腺癌超声以及肺部和感染CT)上,使用多种骨干网络(U-Net、TransUNet、SAM和SegFormer)进行的评估证明了J-RAS的泛化性和有效性。例如,在ACDC上,SegFormer的平均Dice从0.8708±0.042和HD从1.8130±2.49提升至0.9115±0.031和1.1489±0.30。这些结果突显了检索引导的对比优化如何在医学图像分割中桥接人类式指导与机器学习的精确性。

英文摘要

Manual medical image segmentation by clinicians, though accurate, is time-consuming and variable across experts, whereas AI-based models automate this process but often underperform with limited data and domain shifts. Inspired by how pathology trainees acquire disease recognition skills through guided comparison with expert-annotated slides and histopathology atlas reference images, we propose Joint Retrieval-Augmented Segmentation (J-RAS). This framework enables segmentation networks to learn with guidance. J-RAS jointly optimizes a segmentation model and a retrieval model through alternating contrastive and supervised learning, allowing the retrieval network to discover contextually relevant image-mask pairs that refine the segmentation model's anatomical reasoning. Unlike conventional retrieval-based augmentation that passively provides similar samples, J-RAS establishes a mutual adaptation and optimization loop where the retrieval model learns to emphasize segmentation-relevant cues, while the segmentation model leverages retrieved examples to improve boundary delineation, robustness to rare cases, and cross-dataset generalization. Evaluations on four public benchmarks spanning different imaging modalities, including ACDC and M&Ms (MRI), Breast Cancer Ultrasound, and lung and infection CT, across multiple backbones (U-Net, TransUNet, SAM, and SegFormer) demonstrate the generalizability and effectiveness of J-RAS. For instance, on ACDC, SegFormer improves from a mean Dice of 0.8708$\pm$0.042 and HD of 1.8130$\pm$2.49 to 0.9115$\pm$0.031 and 1.1489$\pm$0.30. These results highlight how retrieval-guided contrastive optimization bridges human-like guidance and machine-learned precision in medical image segmentation.

2510.03511 2026-06-04 cs.CV cs.AI cs.LG eess.IV 版本更新

Platonic Transformers: A Solid Choice For Equivariance

柏拉图式Transformer:等变性的坚实选择

Mohammad Mohaiminul Islam, Rishabh Anand, David R. Wessels, Friso de Kruiff, Thijs P. Kuipers, Rex Ying, Clara I. Sánchez, Sharvaree Vadgama, Georg Bökman, Erik J. Bekkers

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Platonic Transformer,通过基于柏拉图立体对称群参考帧的注意力机制实现等变性,在不增加计算成本的前提下提升性能。

详情
AI中文摘要

尽管Transformer广泛应用,但缺乏科学和计算机视觉中常见几何对称性的归纳偏置。现有的等变方法往往通过复杂、计算密集的设计牺牲了Transformer的高效性和灵活性。我们引入Platonic Transformer来解决这一权衡。通过将注意力定义为相对于柏拉图立体对称群参考帧,我们的方法引入了一种有原则的权重共享方案。这使得模型能够同时对连续平移和柏拉图对称性保持等变,同时保留标准Transformer的精确架构和计算成本。此外,我们证明这种注意力在形式上等价于动态群卷积,这表明模型学习自适应几何滤波器,并实现高度可扩展的线性时间卷积变体。在计算机视觉(CIFAR-10)、3D点云(ScanObjectNN)和分子性质预测(QM9、OMol25)等多个基准测试中,Platonic Transformer通过利用这些几何约束以零额外成本取得了有竞争力的性能。

英文摘要

While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.

2510.01532 2026-06-04 cs.CV 版本更新

MATCH: Multi-faceted Adaptive Topo-Consistency for Semi-Supervised Histopathology Segmentation

MATCH: 面向半监督组织病理学分割的多面自适应拓扑一致性

Meilong Xu, Xiaoling Hu, Shahira Abousamra, Chen Li, Chao Chen

发表机构 * Stony Brook University(斯通布罗克大学) Massachusetts General Hospital and Harvard Medical School(麻省总医院和哈佛医学院) Department of Biomedical Data Science, Stanford University(斯坦福大学生物医学数据科学系)

AI总结 提出一种半监督分割框架MATCH,通过随机丢弃和时间训练快照生成多种扰动预测,并强制拓扑一致性来识别和保留相关拓扑特征,引入结合空间重叠与全局结构对齐的匹配策略以减少预测差异,有效降低拓扑错误,提升分割鲁棒性和准确性。

Comments 20 pages, 6 figures. Accepted by NeurIPS 2025

详情
AI中文摘要

在半监督分割中,从无标签数据中捕获有意义的语义结构至关重要。这在组织病理学图像分析中尤其具有挑战性,因为物体分布密集。为了解决这个问题,我们提出了一个半监督分割框架,旨在稳健地识别和保留相关的拓扑特征。我们的方法利用通过随机丢弃和时间训练快照获得的多种扰动预测,强制这些不同输出之间的拓扑一致性。这种一致性机制有助于将生物学有意义的结构与瞬态和噪声伪影区分开来。这个过程的一个关键挑战是在没有真实标签的情况下准确匹配预测中对应的拓扑特征。为了克服这一点,我们引入了一种新颖的匹配策略,将空间重叠与全局结构对齐相结合,最小化预测之间的差异。大量实验表明,我们的方法有效减少了拓扑错误,从而产生更稳健和准确的分割,这对于可靠的下游分析至关重要。代码可在 https://github.com/Melon-Xu/MATCH 获取。

英文摘要

In semi-supervised segmentation, capturing meaningful semantic structures from unlabeled data is essential. This is particularly challenging in histopathology image analysis, where objects are densely distributed. To address this issue, we propose a semi-supervised segmentation framework designed to robustly identify and preserve relevant topological features. Our method leverages multiple perturbed predictions obtained through stochastic dropouts and temporal training snapshots, enforcing topological consistency across these varied outputs. This consistency mechanism helps distinguish biologically meaningful structures from transient and noisy artifacts. A key challenge in this process is to accurately match the corresponding topological features across the predictions in the absence of ground truth. To overcome this, we introduce a novel matching strategy that integrates spatial overlap with global structural alignment, minimizing discrepancies among predictions. Extensive experiments demonstrate that our approach effectively reduces topological errors, resulting in more robust and accurate segmentations essential for reliable downstream analysis. Code is available at https://github.com/Melon-Xu/MATCH.

2307.00862 2026-06-04 cs.CV cs.CL 版本更新

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

UniFine: 一种统一且细粒度的零样本视觉-语言理解方法

Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang

发表机构 * Columbia University(哥伦比亚大学) Microsoft Research(微软研究院) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出UniFine框架,通过利用句子关键词和图像对象等细粒度信息进行图像-文本匹配,在零样本设置下统一处理VQA、SNLI-VE和VCR等视觉-语言任务,并在多个数据集上取得显著改进。

Comments 14 pages, 4 figures, ACL 2023 Findings

详情
AI中文摘要

视觉-语言任务,如VQA、SNLI-VE和VCR,具有挑战性,因为它们需要模型的推理能力来理解视觉世界和自然语言的语义。针对视觉-语言任务的监督方法已被充分研究。然而,在零样本设置下解决这些任务的研究较少。由于对比语言-图像预训练(CLIP)在图像-文本匹配上展现了显著的零样本性能,先前的工作通过将视觉-语言任务转换为图像-文本匹配问题来利用其强大的零样本能力,并且它们主要考虑全局级别的匹配(例如,整个图像或句子)。然而,我们发现视觉和文本的细粒度信息,例如句子中的关键词和图像中的对象,对于语义理解可能相当有信息量。受此启发,我们提出了一个统一框架,利用细粒度信息进行零样本视觉-语言学习,涵盖多个任务,如VQA、SNLI-VE和VCR。我们的实验表明,我们的框架在VQA上优于先前的零样本方法,并在SNLI-VE和VCR上取得了显著改进。此外,我们的消融研究证实了我们提出的方法的有效性和泛化性。

英文摘要

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method.

2503.21469 2026-06-04 eess.IV cs.CV 版本更新

Embedding Compression Distortion in Video Coding for Machines

面向机器的视频编码中的嵌入压缩失真

Yuxiao Sun, Yao Zhao, Meiqin Liu, Chao Yao, Weisi Lin

发表机构 * Beijing Jiaotong University, China(北京交通大学) University of Science and Technology Beijing, China(北京科技大学) Nanyang Technological University, Singapore(南洋理工大学)

AI总结 提出压缩失真表示嵌入(CDRE)框架,通过提取机器感知相关的失真表示并嵌入下游模型,提升压缩视频的任务性能。

详情
AI中文摘要

目前,视频传输不仅服务于人类视觉系统(HVS)以供观看,还服务于机器感知以供分析。然而,现有的编解码器主要针对像素域和HVS感知指标进行优化,而非机器视觉任务的需求。为解决此问题,我们提出了一种压缩失真表示嵌入(CDRE)框架,该框架提取与机器感知相关的失真表示,并将其嵌入下游模型,从而解决压缩过程中丢失的信息并提升任务性能。具体而言,为了更好地分析与机器感知相关的失真,我们设计了一个压缩敏感提取器,用于在特征域中识别压缩退化。为了实现高效传输,引入了一个轻量级失真编解码器,将失真信息压缩为紧凑表示。随后,该表示被逐步嵌入下游模型,使其更好地了解压缩退化并提升性能。在各种编解码器和下游任务上的实验表明,我们的框架能够以最小的比特率、执行时间和参数数量开销,有效提升现有编解码器的率-任务性能。我们的代码和补充材料发布在 https://github.com/Ws-Syx/CDRE/。

英文摘要

Currently, video transmission serves not only the Human Visual System (HVS) for viewing but also machine perception for analysis. However, existing codecs are primarily optimized for pixel-domain and HVS-perception metrics rather than the needs of machine vision tasks. To address this issue, we propose a Compression Distortion Representation Embedding (CDRE) framework, which extracts machine-perception-related distortion representation and embeds it into downstream models, addressing the information lost during compression and improving task performance. Specifically, to better analyze the machine-perception-related distortion, we design a compression-sensitive extractor that identifies compression degradation in the feature domain. For efficient transmission, a lightweight distortion codec is introduced to compress the distortion information into a compact representation. Subsequently, the representation is progressively embedded into the downstream model, enabling it to be better informed about compression degradation and enhancing performance. Experiments across various codecs and downstream tasks demonstrate that our framework can effectively boost the rate-task performance of existing codecs with minimal overhead in terms of bitrate, execution time, and number of parameters. Our codes and supplementary materials are released in https://github.com/Ws-Syx/CDRE/.

2503.10629 2026-06-04 cs.CV 版本更新

Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathology

层次化自监督对抗训练用于组织病理学中的鲁棒视觉模型

Hashmat Shadab Malik, Shahina Kunhimon, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(Mohamed Bin Zayed人工智能大学) Khalifa University(卡勒比大学) Linköping University(林霍尔姆大学) Australian National University(澳大利亚国立大学)

AI总结 提出层次化自监督对抗训练(HSAT),利用组织病理图像的患者-切片-补丁层次结构进行多级对比学习,生成对抗样本并整合到对抗训练中,在OpenSRH数据集上白盒设置平均提升54.31%,黑盒设置性能下降降至3-4%。

Comments Accepted at 28th International Conference On Medical Image Computing And Computer Assisted Intervention (MICCAI 2025)

详情
AI中文摘要

对抗攻击对医疗等关键领域的视觉模型构成重大挑战,这些领域可靠性至关重要。尽管对抗训练在自然图像中已得到充分研究,但其在生物医学和显微镜数据中的应用仍然有限。现有的自监督对抗训练方法忽视了组织病理图像的层次结构,其中患者-切片-补丁关系提供了有价值的判别信号。为了解决这一问题,我们提出了层次化自监督对抗训练(HSAT),它利用这些属性通过多级对比学习生成对抗样本,并将其整合到对抗训练中以增强鲁棒性。我们在多类组织病理数据集OpenSRH上评估了HSAT,结果表明HSAT在生物医学和自然图像领域均优于现有方法。HSAT增强了鲁棒性,在白盒设置中平均提升54.31%,在黑盒设置中将性能下降降至3-4%,而基线为25-30%。这些结果为该领域的对抗训练树立了新的基准,为更鲁棒的模型铺平了道路。我们的训练和评估代码可在https://github.com/HashmatShadab/HSAT获取。

英文摘要

Adversarial attacks pose significant challenges for vision models in critical fields like healthcare, where reliability is essential. Although adversarial training has been well studied in natural images, its application to biomedical and microscopy data remains limited. Existing self-supervised adversarial training methods overlook the hierarchical structure of histopathology images, where patient-slide-patch relationships provide valuable discriminative signals. To address this, we propose Hierarchical Self-Supervised Adversarial Training (HSAT), which exploits these properties to craft adversarial examples using multi-level contrastive learning and integrate it into adversarial training for enhanced robustness. We evaluate HSAT on multiclass histopathology dataset OpenSRH and the results show that HSAT outperforms existing methods from both biomedical and natural image domains. HSAT enhances robustness, achieving an average gain of 54.31% in the white-box setting and reducing performance drops to 3-4% in the black-box setting, compared to 25-30% for the baseline. These results set a new benchmark for adversarial training in this domain, paving the way for more robust models. Our Code for training and evaluation is available at https://github.com/HashmatShadab/HSAT.

2502.01576 2026-06-04 cs.CV 版本更新

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Robust-LLaVA:大规模鲁棒图像编码器对多模态大语言模型的有效性

Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Khan, Salman Khan

发表机构 * Mohamed bin Zayed University of AI(Mohamed bin Zayed人工智能大学) Khalifa University(卡利法大学) Michigan State University(密歇根州立大学) Australian National University(澳大利亚国立大学)

AI总结 本文提出利用大规模对抗预训练的图像分类模型替代CLIP编码器,以增强多模态大语言模型对视觉对抗扰动的鲁棒性,在无需额外对抗训练的情况下,在视觉问答、图像描述和越狱攻击任务中取得显著鲁棒性提升。

Comments Accepted at Trustworthy FMs Workshop Trust Before Use: Building Foundation Models that You Can Trust (ICCVW) 2025

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务中表现出色,但仍然容易受到视觉对抗扰动的影响,这些扰动可能引发幻觉、操纵响应或绕过安全机制。现有方法通过在ImageNet规模的数据上对CLIP视觉编码器进行受限的对抗微调来缓解这些风险,从而保持其泛化能力。然而,这种有限的对抗训练限制了鲁棒性和更广泛的泛化。在这项工作中,我们探索了一种替代方法,即利用在大规模数据上经过对抗预训练的现有视觉分类模型。我们的分析揭示了两个主要贡献:(1)对抗预训练的广泛规模和多样性使得这些模型能够对各种对抗威胁表现出优越的鲁棒性,范围从不可察觉的扰动到高级越狱尝试,而无需额外的对抗训练;(2)将这些鲁棒模型与MLLM进行端到端集成,有助于语言组件更好地适应鲁棒视觉特征,在复杂推理任务上优于现有的即插即用方法。通过在视觉问答、图像描述和越狱攻击上的系统评估,我们证明使用这些鲁棒模型训练的MLLM在保持良好干净性能的同时,实现了优越的对抗鲁棒性。我们的框架在描述和VQA任务中分别实现了2倍和1.5倍的平均鲁棒性增益,并在越狱攻击中提供了超过10%的改进。代码和预训练模型将在https://github.com/HashmatShadab/Robust-LLaVA 提供。

英文摘要

Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms. Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data, ensuring their generalization ability is preserved. However, this limited adversarial training restricts robustness and broader generalization. In this work, we explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these models to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts, without requiring additional adversarial training, and (2) end-to-end MLLM integration with these robust models facilitates enhanced adaptation of language components to robust visual features, outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust models achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2x and 1.5x average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against jailbreak attacks. Code and pretrained models will be available at https://github.com/HashmatShadab/Robust-LLaVA.

2412.20803 2026-06-04 cs.CV 版本更新

Scalable Event Cloud Network for Event-based Classification

可扩展事件云网络用于基于事件的分类

Hongwei Ren, Fei Ma, Xiaopeng Lin, Yuetong Fang, Hongxiang Huang, Yue Zhou, Yulong Huang, Haotian Fu, Ziyi Yang, Youxin Jiang, Xiangqian Wu, Bojun Cheng

发表机构 * Research Centre for Multimodal Artificial Intelligence(多模态人工智能研究中心) Applications, Faculty of Computing, Harbin Institute of Technology(应用学院,哈尔滨工业大学) MICS Thrust, Hong Kong University of Science(科学与技术大学(香港)MICS研究方向) Guangdong Laboratory of Artificial Intelligence(广东人工智能与数字经济实验室)

AI总结 提出SECNet,通过结构级极性集成和频域特征提取,解决事件云表示在空间和时间分辨率上的可扩展性问题,在十个数据集上验证了有效性。

Comments ICML2026 Oral

详情
AI中文摘要

事件相机是受生物启发的传感器,引起了工业界和学术界的广泛关注。主流方法倾向于帧和体素表示,这些方法在达到满意性能的同时,引入了耗时的转换、庞大的模型,并牺牲了细粒度的时间信息。相比之下,点云表示在解决上述弱点方面显示出潜力,但在抽象更高空间分辨率和更长时序事件的特征方面可扩展性有限。在本文中,我们提出了一种名为SECNet的可扩展网络,以利用事件云表示。SECNet通过创新的基于事件的分组和采样模块,在结构层面而非仅在输入层面集成极性。为了适应事件数量的激增,SECNet通过傅里叶变换在频域中进行特征提取。这种方法不仅显著减少了乘累加操作的爆炸,而且有效地抽象了时空特征。我们在 extbf{十个}基于事件的数据集上进行了大量实验,验证了SECNet的可扩展性、有效性和效率。我们的代码将在以下网址提供:https://github.com/rhwxmx/SECNet_ICML。

英文摘要

Event cameras are biologically inspired sensors garnering significant attention from both industry and academia. Mainstream methods favor frame and voxel representations, which reach a satisfactory performance while introducing time-consuming transformations, bulky models, and sacrificing fine-grained temporal information. Alternatively, Point Cloud representation demonstrates promise in addressing the mentioned weaknesses, but it has limited scalability in abstracting features of higher spatial resolution and longer temporal sequence events. In this paper, we propose a Scalable Network named SECNet to leverage Event Cloud representation. SECNet integrates polarity at the structural level by innovating the Event-based Group and Sampling module rather than only at the input level. To accommodate the surge in the number of events, SECNet embraces feature extraction in the frequency domain via the Fourier transform.This approach not only substantially extinguishes the explosion of Multiply Accumulate Operations but also effectively abstracts spatio-temporal features. We conducted extensive experiments on \textbf{ten} event-based datasets, and substantiate the scalability, effectiveness, and efficiency of SECNet. Our code will be available at: https://github.com/rhwxmx/SECNet_ICML.

2411.19758 2026-06-04 cs.CV cs.AI cs.LG 版本更新

LaVIDE: Language-Prompted Satellite Change Detection via Map-Image Alignment

LaVIDE: 通过地图-图像对齐的语言提示卫星变化检测

Shuguo Jiang, Fang Xu, Chuandong Liu, Hong Tan, Shengyang Li, Lei Yu, Wen Yang, Sen Jia, Gui-Song Xia

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) School of Artificial Intelligence, Wuhan University(武汉大学人工智能学院) Technology and Engineering Center for Space Utilization and the Key Laboratory of Space Utilization, Chinese Academy of Sciences(中国科学院空间利用技术与重点实验室) School of Aeronautics and Astronautics, University of Chinese Academy of Sciences(中国科学院大学航空宇航学院) School of Electronic Information, Wuhan University(武汉大学电子信息学院) College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院)

AI总结 提出LaVIDE框架,利用受限提示学习和对象感知嵌入增强,通过语言弥合高层地图类别与低层图像细节之间的语义鸿沟,实现跨模态对齐,在多类与单类变化检测任务上分别提升IoU 18.4%和5.2%。

详情
AI中文摘要

基于地图参考和最新图像的遥感变化检测,在缺乏早期图像进行比较时,有助于及时观测地球表面。然而,高层地图类别与低层图像细节之间的语义鸿沟阻碍了提取同质特征以进行稳健的时间关联。与比较像素级视觉相似性或传播分割误差的传统方法不同,我们提出了一种新颖框架——LaVIDE(用于检测变化的语言-视觉判别器),该框架以语言为中介,弥合了高层地图类别与低层图像细节之间的语义鸿沟。具体来说,我们引入了受限提示学习来生成上下文感知的文本提示,使地图语义与图像内容对齐,并采用对象感知嵌入增强策略将对象级属性(如形状、边界)整合到地图表示中。这些组件能够在统一的语言-视觉特征空间中实现稳健的跨模态对齐。在四个基准数据集(DynamicEarthNet、HRSCD、BANDON和SECOND)上的大量实验表明,LaVIDE以显著优势超越了最先进的方法,在多类和单类变化检测任务上分别实现了18.4%和5.2%的IoU提升。我们的框架不仅提高了地图-图像变化检测的准确性,还为以最少人工干预快速更新地图提供了实用解决方案,有望在城市规划、灾害评估和生态保护等领域产生广泛影响。代码和数据集可在 https://github.com/ShuGuoJ/LAVIDE.git 获取。

英文摘要

Remote sensing change detection based on a map reference and an up-to-date image boosts timely observation of the Earth's surface when earlier images are lacking for comparison. However, the semantic gap between high-level map categories and low-level image details hinders the extraction of homogeneous features for robust temporal association in change detection. Unlike conventional approaches that either compare pixel-level visual similarity or propagate segmentation errors, \textcolor{black}{we propose a novel framework, \underline{La}nguage-\underline{VI}sion \underline{D}iscriminator for d\underline{E}tecting changes, LaVIDE}, which bridges the semantic gap between high-level map categories and low-level image details using language as an intermediary. Specifically, we introduce {\it restricted prompt learning} to generate context-aware textual prompts that align map semantics with image content, and an {\it object-aware embedding enhancement} strategy to integrate object-level attributes (e.g., shape, boundary) into map representations. These components enable robust cross-modal alignment within a unified language-vision feature space. Extensive experiments on four benchmarks, DynamicEarthNet, HRSCD, BANDON, and SECOND, demonstrate that LaVIDE outperforms state-of-the-art methods by significant margins, achieving $18.4\%$ and $5.2\%$ improvements in IoU on multi-class and single-class change detection tasks, respectively. Our framework not only advances the accuracy of map-image change detection but also provides a practical solution for rapid map updating with minimal human intervention, promising broad impacts in urban planning, disaster assessment, and ecological conservation. Code and datasets are available at: https://github.com/ShuGuoJ/LAVIDE.git.

2406.09407 2026-06-04 cs.CV 版本更新

Towards Evaluating the Robustness of Visual State Space Models

评估视觉状态空间模型的鲁棒性

Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Shahbaz Khan, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI(Mohamed Bin Zayed人工智能大学) Center of Secure Cyber-Physical Security Systems(安全的网络物理安全系统中心) Linköping University(林波伊大学) Australian National University(澳大利亚国立大学)

AI总结 本文全面评估了视觉状态空间模型(VSSMs)在遮挡、图像结构、常见损坏和对抗攻击等多种扰动下的鲁棒性,并与Transformer和CNN等架构进行比较,揭示了其优势和局限性。

Comments Accepted at The 5th Workshop of Adversarial Machine Learning on Computer Vision (CVPRW 2025)

详情
AI中文摘要

视觉状态空间模型(VSSMs)是一种结合了循环神经网络和潜变量模型优势的新型架构,通过有效捕捉长程依赖和建模复杂视觉动态,在视觉感知任务中表现出色。然而,它们在自然和对抗扰动下的鲁棒性仍然是一个关键问题。在这项工作中,我们全面评估了VSSMs在各种扰动场景下的鲁棒性,包括遮挡、图像结构、常见损坏和对抗攻击,并将其性能与Transformer和卷积神经网络等成熟架构进行比较。此外,我们研究了VSSMs在复杂视觉场景中针对物体-背景组合变化的鲁棒性,使用了专门设计用于测试模型性能的复杂基准。我们还使用模拟真实场景的损坏数据集评估了它们在目标检测和分割任务上的鲁棒性。为了更深入地理解VSSMs的对抗鲁棒性,我们进行了基于频率的对抗攻击分析,评估了它们对低频和高频扰动的性能。我们的发现突出了VSSMs在处理复杂视觉损坏方面的优势和局限性,为未来研究提供了宝贵的见解。我们的代码和模型将在 https://github.com/HashmatShadab/MambaRobustness 提供。

英文摘要

Vision State Space Models (VSSMs), a novel architecture that combines the strengths of recurrent neural networks and latent variable models, have demonstrated remarkable performance in visual perception tasks by efficiently capturing long-range dependencies and modeling complex visual dynamics. However, their robustness under natural and adversarial perturbations remains a critical concern. In this work, we present a comprehensive evaluation of VSSMs' robustness under various perturbation scenarios, including occlusions, image structure, common corruptions, and adversarial attacks, and compare their performance to well-established architectures such as transformers and Convolutional Neural Networks. Furthermore, we investigate the resilience of VSSMs to object-background compositional changes on sophisticated benchmarks designed to test model performance in complex visual scenes. We also assess their robustness on object detection and segmentation tasks using corrupted datasets that mimic real-world scenarios. To gain a deeper understanding of VSSMs' adversarial robustness, we conduct a frequency-based analysis of adversarial attacks, evaluating their performance against low-frequency and high-frequency perturbations. Our findings highlight the strengths and limitations of VSSMs in handling complex visual corruptions, offering valuable insights for future research. Our code and models will be available at https://github.com/HashmatShadab/MambaRobustness.

2407.13922 2026-06-04 cs.CV cs.AI cs.LG 版本更新

CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems

CounterFace: 用于人脸识别系统细粒度反事实评估的合成人脸数据集

Guruprasad Viswanathan Ramesh, Ashish Hooda, Shimaa Ahmed, Harrison J Rosenberg, Ramya Korlakai Vinayak, Kassem Fawaz

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Visa Research(Visa研究)

AI总结 提出CounterFace数据集,通过全自动流水线生成包含20种面部属性和8种人口统计因素的11,821个反事实人脸对,用于细粒度评估人脸识别系统在特定属性-人口统计组合下的性能退化。

Comments Code available at https://github.com/Guruprasad68/counterface_facct2026. Dataset available for non-commercial research upon request

详情
AI中文摘要

人脸识别系统广泛应用于关键应用,因此其在不同人群和条件下的可靠性和鲁棒性至关重要。人脸识别系统的标准评估通常依赖LFW等数据集来估计平均识别准确率。一些基准测试也捕捉了粗粒度的身份内变化,如老化、姿态和光照。然而,人脸存在更细粒度的变化,包括发型和化妆等外观变化,这些在现有基准测试中代表性不足。反事实评估提供了一种在细粒度变化下评估人脸识别鲁棒性的方法。然而,现有使用图像生成器合成的反事实人脸数据集由于在流程中使用人工验证,属性覆盖范围有限。我们提出CounterFace,一个新的反事实评估数据集,包含20种面部属性和8种人口统计因素,超过先前合成人脸数据集14种属性和2种人口统计因素。该数据集使用基于现成图像生成器和自定义验证器的全自动流水线生成,无需人工验证。CounterFace包含11,821个反事实人脸对,事后用户研究证实了生成反事实的忠实性。我们评估了两个商业和四个开源人脸识别系统(AWS Rekognition、Face++、AdaFace、MagFace、ArcFace、FaceNet)在160种属性-人口统计组合上的性能。与标准评估基准不同,我们的数据集有助于隔离单个系统的精确故障模式。结果表明,所有六个系统的性能退化因属性和人口统计而异,遮挡属性(如口罩和胡须)普遍降低性能。

英文摘要

Face recognition (FR) systems are widely deployed in critical applications, making their reliability and robustness across diverse populations and conditions essential. Standard evaluation of FR systems typically relies on datasets such as LFW to estimate average recognition accuracy. Some benchmarks also capture coarse-grained intra-identity variations such as aging, pose, and lighting. However, human faces undergo more fine-grained changes, including appearance changes such as hairstyles and makeup, that are underrepresented in existing benchmarks. Counterfactual evaluation provides a method to assess FR robustness under such fine-grained variations. Existing counterfactual face datasets synthesized with image generators, however, are limited in attribute coverage due to the use of humans for verification in the pipeline. We propose CounterFace, a new counterfactual evaluation dataset comprising 20 facial attributes and 8 demographic factors, exceeding prior synthetic face datasets by 14 attributes and 2 demographics. The dataset is generated using a fully automated pipeline based on off-the-shelf image generators with custom verifiers, removing human need for verification. CounterFace contains 11,821 counterfactual face pairs, and a post-hoc user study confirms the faithfulness of the generated counterfactuals. We evaluate two commercial and four open-source FR systems (AWS Rekognition, Face++, AdaFace, MagFace, ArcFace, FaceNet) across 160 attribute-demographic combinations. Our dataset helps in the isolation of precise failure modes for individual systems unlike standard evaluation benchmarks. Results indicate that the performance degradation varies across attributes and demographics for all six systems and occluding attributes (e.g., facemask and facial hair) universally degrade performance.

2404.11309 2026-06-04 cs.CV 版本更新

Achieving Rotation-Invariant Convolution via Non-Learnable Orientation Alignment Operators

通过不可学习的朝向对齐算子实现旋转不变卷积

Hanlin Mo, Peihong Lei, You Hao, Guoying Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于不可学习算子的旋转不变卷积(RIConvs),其参数量和计算过程与标准卷积相同,在多个视觉任务中提升准确率,尤其在数据有限时效果显著。

详情
AI中文摘要

在深度神经网络中实现旋转不变性而无需数据增强是一个研究热点。内在不变性使特征能够捕捉目标的固有属性,从而提升深度学习在视觉任务中的性能。基于多种类型的不可学习算子,本文提出了一套对任意旋转自然不变的卷积操作。与大多数先前方法不同,这些旋转不变卷积(RIConvs)具有与标准卷积相同的可学习参数数量和相似的计算过程,因此可以互换。使用MNIST-Rot数据集,我们验证了它们在不同旋转角度下的不变性,并与先前的旋转不变CNN进行了比较,其中两种基于梯度的RIConvs取得了最先进的结果。然后,我们将RIConvs与经典CNN骨干网络集成,并在纹理识别、飞机类型识别和遥感图像分类任务上进行了评估。结果表明,RIConvs显著提高了准确率,特别是在训练数据有限的情况下,并且即使在使用数据增强时也能提升性能。

英文摘要

Achieving rotational invariance in deep neural networks without data augmentation is a research hotspot. Intrinsic invariance enables features to capture targets' inherent properties, enhancing deep learning performance in visual tasks. Based on various types of non-learnable operators, this paper proposes a comprehensive set of convolution operations that are natually invariant to arbitrary rotations. Unlike most prior methods, these rotation-invariant convolutions (RIConvs) have the same number of learnable parameters and a similar computational process as standard convolutions, making them interchangeable. Using the MNIST-Rot dataset, we validate their invariance across rotation angles and compare them with previous rotation-invariant CNNs, where two gradient-based RIConvs achieve state-of-the-art results. Then, we integrate RIConvs with classic CNN backbones and evaluate them on texture recognition, aircraft type recognition, and remote sensing image classification tasks. Results show that RIConvs significantly improve accuracy, particularly with limited training data, and enhance performance even with data augmentation.

1803.03724 2026-06-04 math.DG cs.CG cs.CV cs.NA math.AP math.NA 版本更新

Contour Parametrization via Anisotropic Mean Curvature Flows

通过各向异性均 curvature 流进行轮廓参数化

P. Suárez-Serrato, E. I. Velázquez Richards

发表机构 * Department of Mathematics, University of California, Santa Barbara, on leave from , Instituto de Matem\'aticas, Instituto de Matem\'aticas, Universidad Nacional Aut\'onoma de M\'exico, Mexico City

AI总结 本文提出了一种新的各向异性均 curvature 流实现,用于轮廓识别。通过将平面闭合光滑曲线的均 curvature 流与外部场相结合,该方法利用点电荷势场约束曲线运动,从而实现轮廓的参数化。

Comments 30 pages, 20 images, source code for our numerical implementation is available in this URL https://github.com/V3du4rd0/AMCF

详情
Journal ref
Applied Mathematics and Computation, Volume 441, 2023, 127699
AI中文摘要

我们提出了一种新的各向异性均 curvature 流实现,用于轮廓识别。我们的方法将平面闭合光滑曲线的均 curvature 流与来自点电荷势场的外部场相结合。这种耦合在曲线与背景图像匹配时约束其运动。我们还包含了用于数值近似的稳定性准则。

英文摘要

We present a new implementation of anisotropic mean curvature flow for contour recognition. Our procedure couples the mean curvature flow of planar closed smooth curves, with an external field from a potential of point-wise charges. This coupling constrains the motion when the curve matches a picture placed as background. We include a stability criteria for our numerical approximation.

2402.02555 2026-06-04 cs.CV cs.CL 版本更新

High-Quality Entity Segmentation and Grounding

高质量实体分割与定位

Lu Qi, Yi-Wen Chen, Tao Zhang, Xiangtai Li, Xu Yang, Bo Du, Ming-Hsuan Yang

发表机构 * Wuhan University(武汉大学) Insta360 Research(Insta360研究院) Department of EECS, University of California, Merced(加州大学默塞德分校电子工程与计算机科学系) Nanyang Technological University(南洋理工大学) Institute of Automation of the Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出ESG流水线,通过新数据集EntitySeg和两阶段解耦设计(CropFormer高质量分割+GELLA精确名词提取与语义匹配),实现高质量实体分割与定位,在五项任务上有效。

详情
AI中文摘要

在这项工作中,我们提出了ESG,一个由新数据集EntitySeg支持的高质量实体分割与定位流水线。首先,所提出的数据集命名为EntitySeg,包含跨越各种图像域和实体的图像,以及用于训练和测试的大量高分辨率图像和高质量掩码标注。然后,ESG主要由两个模块组成:用于高质量实体分割的CropFormer,以及用于从句子中精确提取名词并在语言和视觉区域之间进行语义匹配的GELLA。与现有联合训练分割和大语言模型的定位方法不同,ESG采用两阶段解耦设计,保留了高质量掩码和定位鲁棒性,避免了联合训练通常带来的权衡。CropFormer确保高质量实体分割结果,然后可以编码到GELLA模型中进行有效定位。大量实验结果表明,我们提出的流水线在五项任务上有效,包括实体分割、全景分割、开放词汇分割、指代分割和全景定位叙述。此外,ESG流水线的GELLA模块高度灵活,能够处理来自任何分割框架的掩码输入,这得益于其轻量级的颜色图/视觉编码器、语言/掩码解码器和关联模块。实体分割数据集和定位代码将在https://github.com/qqlu/Entity发布。

英文摘要

In this work, we propose ESG, a pipeline for high-quality entity segmentation and grounding supported by a new dataset EntitySeg. At first, the proposed dataset naming EntitySeg contains images spanning various image domains and entities, along with plentiful high-resolution images and high-quality mask annotations for training and testing. Then, the ESG mainly consists of two modules: CropFormer for high-quality entity segmentation whereas GELLA for accurate noun extraction from sentences and semantic matching between language and visual regions. Unlike existing grounding methods that jointly train a segmentation and a large language model, ESG adopts a two-stage decoupled design, preserving high-quality masks and grounding robustness without the trade-offs often introduced by joint training. CropFormer ensures high-quality entity segmentation results, which can then be encoded into the GELLA model for effective grounding. Extensive experimental results demonstrate the effectiveness of our proposed pipeline across five tasks, including entity segmentation, panoptic segmentation, open-vocabulary segmentation, referring segmentation, and panoptic localized narratives. Furthermore, GELLA module of ESG pipeline is highly flexible and capable of processing mask inputs from any segmentation framework, thanks to its lightweight colormap/vision encoder, language/mask decoder, and association module. The entity segmentation dataset and grounding code will be released at https://github.com/qqlu/Entity.

1410.6333 2026-06-04 cs.CV cs.NA math.NA 版本更新

A Regularization Approach to Blind Deblurring and Denoising of QR Barcodes

一种正则化方法用于QR条形码的盲去模糊和去噪

Yves van Gennip, Prashant Athavale, Jérôme Gilles, Rustum Choksi

发表机构 * Fields Institute, University of Toronto(多伦多大学菲尔兹研究所)

AI总结 本文提出了一种基于正则化的纯方法,用于在存在噪声的情况下对QR条形码进行盲去模糊和去噪,利用了已知的所需模式和开源条形码阅读器的事实。

Comments 14 pages, 19 figures (with a total of 57 subfigures), 1 table; v3: previously missing reference [35] added

详情
AI中文摘要

QR bar codes are prototypical images for which part of the image is a priori known (required patterns). Open source bar code readers, such as ZBar, are readily available. We exploit both these facts to provide and assess purely regularization-based methods for blind deblurring of QR bar codes in the presence of noise.

英文摘要

QR bar codes are prototypical images for which part of the image is a priori known (required patterns). Open source bar code readers, such as ZBar, are readily available. We exploit both these facts to provide and assess purely regularization-based methods for blind deblurring of QR bar codes in the presence of noise.

1803.00638 2026-06-04 math.NA cs.CV cs.NA 版本更新

Fast and accurate computation of orthogonal moments for texture analysis

快速且准确的正交矩计算用于纹理分析

C. Di Ruberto, L. Putzu, G. Rodriguez

发表机构 * Department of Mathematics and Computer Science, University of Cagliari(卡利里大学数学与计算机科学系) Department of Electrical and Electronic Engineering, University of Cagliari(卡利里大学电子与电气工程系)

AI总结 本文提出了一种快速稳定的算法,用于图像正交矩的计算,通过递推关系方法优化了Matlab实现,以提高计算效率和重建精度,并在纹理分析中展示了优于传统描述符的性能。

Comments 29 pages, 9 figures

详情
Journal ref
Pattern Recongnit. 83 (2018) 498-510
AI中文摘要

在本文中,我们描述了一种快速且稳定的算法,用于计算图像的正交矩。正交矩具有高区分能力,但某些公式形式计算复杂度较高,限制了实时应用。本文详细描述了基于递推关系的方法,并提出了一种优化的Matlab实现,旨在解决上述限制,并向社区提供高效易用的软件。在实验中,我们评估了递推公式的有效性及其在重建任务中的性能,与文献中常用的闭式表示进行比较。结果表明,计算复杂度显著降低,同时重建精度更高。为了评估和比较计算矩在纹理分析中的准确性,我们在六个著名的纹理图像数据库上进行了分类实验。再次,递推公式在分类任务中优于闭式表示。更重要的是,如果使用所提出的稳定过程从图像的GLCM计算正交矩,则在某些情况下,正交矩优于一些最流行的纹理分类状态-of-the-art描述符。

英文摘要

In this work we describe a fast and stable algorithm for the computation of the orthogonal moments of an image. Indeed, orthogonal moments are characterized by a high discriminative power, but some of their possible formulations are characterized by a large computational complexity, which limits their real-time application. This paper describes in detail an approach based on recurrence relations, and proposes an optimized Matlab implementation of the corresponding computational procedure, aiming to solve the above limitations and put at the community's disposal an efficient and easy to use software. In our experiments we evaluate the effectiveness of the recurrence formulation, as well as its performance for the reconstruction task, in comparison to the closed form representation, often used in the literature. The results show a sensible reduction in the computational complexity, together with a greater accuracy in reconstruction. In order to assess and compare the accuracy of the computed moments in texture analysis, we perform classification experiments on six well-known databases of texture images. Again, the recurrence formulation performs better in classification than the closed form representation. More importantly, if computed from the GLCM of the image using the proposed stable procedure, the orthogonal moments outperform in some situations some of the most diffused state-of-the-art descriptors for texture classification.

1601.03094 2026-06-04 cs.CV cs.SY eess.SY math.OC 版本更新

A metric for sets of trajectories that is practical and mathematically consistent

一种用于轨迹集的度量标准,具有实用性和数学一致性

José Bento, Jia Jie Zhu

AI总结 本文提出了一种新的轨迹集度量标准,解决了现有数学一致度量难以计算以及实用近似度量不一致的问题,该度量标准能够快速计算、最优处理轨迹身份混淆,并且在数学上是有效的。

Comments Submitted to IEEE Transactions on Signal Processing

详情
AI中文摘要

在计算机视觉、机器学习、机器人学和通用人工智能领域,对轨迹集空间的度量至关重要。然而,现有的轨迹集接近性概念要么在数学上不一致,要么在实际应用中有限。在本文中,我们指出现有数学一致度量的局限性,这些度量基于OSPA(Schuhmacher等人,2008);以及实践中使用的启发式接近性概念,其主要思想与广泛用于计算机视觉的CLEAR MOT度量(Keni和Rainer,2008)相似。通过两步方法,我们提出了一个新的直观度量标准,以解决这些局限性。首先,我们解释了一种导致难以计算的度量解决方案。然后,我们修改此公式,以获得一个易于计算但保留先前度量有用属性的度量。我们的接近性概念是第一个展示以下三个特征的度量:1)可以快速计算,2)以最优方式整合轨迹身份的混淆,3)在数学意义上是一个度量。

英文摘要

Metrics on the space of sets of trajectories are important for scientists in the field of computer vision, machine learning, robotics, and general artificial intelligence. However, existing notions of closeness between sets of trajectories are either mathematically inconsistent or of limited practical use. In this paper, we outline the limitations in the current mathematically-consistent metrics, which are based on OSPA (Schuhmacher et al. 2008); and the inconsistencies in the heuristic notions of closeness used in practice, whose main ideas are common to the CLEAR MOT measures (Keni and Rainer 2008) widely used in computer vision. In two steps, we then propose a new intuitive metric between sets of trajectories and address these limitations. First, we explain a solution that leads to a metric that is hard to compute. Then we modify this formulation to obtain a metric that is easy to compute while keeping the useful properties of the previous metric. Our notion of closeness is the first demonstrating the following three features: the metric 1) can be quickly computed, 2) incorporates confusion of trajectories' identity in an optimal way, and 3) is a metric in the mathematical sense.

1812.03434 2026-06-04 cs.CG cs.CV cs.NA math.NA physics.med-ph q-bio.QM 版本更新

Area-preserving mapping of 3D ultrasound carotid artery images using density-equalizing reference map

利用密度相等参考图进行3D超声颈动脉图像的面积保持映射

Gary P. T. Choi, Bernard Chiu, Chris H. Rycroft

发表机构 * Mathematics Group, Lawrence Berkeley National Laboratory(伯克利国家实验室数学组)

AI总结 本文提出了一种新的密度相等参考图(DERM)方法,用于将3D颈动脉表面映射到标准化的2D颈动脉模板,重点是通过最小化局部面积变形来保持局部几何结构,从而提高对血管壁加斑块厚度(VWT)的定量监测和比较能力。

详情
Journal ref
IEEE Transactions on Biomedical Engineering, 67(9), 1507-1517 (2020)
AI中文摘要

颈动脉动脉粥样斑块是一种发生在颈动脉分叉处的局部疾病。为了定量监测血管壁加斑块厚度(VWT)的局部变化,并比较不同患者或同一患者在不同超声扫描会话中的VWT分布,需要一种映射技术来调整不同颈动脉模型的几何变化。在本工作中,我们提出了一种新的方法,称为密度相等参考图(DERM),用于将3D颈动脉表面映射到标准化的2D颈动脉模板,重点是通过最小化局部面积变形来保持颈动脉表面的局部几何结构。初始映射是由之前描述的弧长缩放(ALS)映射方法生成的,该方法将3D颈动脉表面投影到2D非凸L形域。随后通过变形ALS映射,利用所提出的结合密度相等映射和参考映射技术的算法,构建出平滑且面积保持的扁平化映射。这种结合使首次实现了将3D表面映射到标准化的非凸平面域的1:1映射,且以面积保持的方式。使用20个颈动脉表面模型的评估显示,与ALS映射方法相比,所提出的方法将扁平化映射的面积变形减少了超过80%。

英文摘要

Carotid atherosclerosis is a focal disease at the bifurcations of the carotid artery. To quantitatively monitor the local changes in the vessel-wall-plus-plaque thickness (VWT) and compare the VWT distributions for different patients or for the same patients at different ultrasound scanning sessions, a mapping technique is required to adjust for the geometric variability of different carotid artery models. In this work, we propose a novel method called density-equalizing reference map (DERM) for mapping 3D carotid surfaces to a standardized 2D carotid template, with an emphasis on preserving the local geometry of the carotid surface by minimizing the local area distortion. The initial map was generated by a previously described arc-length scaling (ALS) mapping method, which projects a 3D carotid surface onto a 2D non-convex L-shaped domain. A smooth and area-preserving flattened map was subsequently constructed by deforming the ALS map using the proposed algorithm that combines the density-equalizing map and the reference map techniques. This combination allows, for the first time, one-to-one mapping from a 3D surface to a standardized non-convex planar domain in an area-preserving manner. Evaluations using 20 carotid surface models show that the proposed method reduced the area distortion of the flattening maps by over 80% as compared to the ALS mapping method.

1605.01177 2026-06-04 cs.CV cs.SY eess.SY 版本更新

A metric on the space of finite sets of trajectories for evaluation of multi-target tracking algorithms

有限轨迹集合空间上的度量用于多目标跟踪算法评估

Ángel F. García-Fernández, Abu Sajana Rahmathullah, Lennart Svensson

发表机构 * Zenuity AB(Zenuity AB公司)

AI总结 本文提出了一种用于以数学严谨方式评估多目标跟踪算法的有限轨迹集合空间上的度量。该度量用于比较不同算法对轨迹的估计与真实轨迹,并包含与定位误差、漏检和误检以及轨迹切换相关的直观成本。度量计算基于解决多维分配问题,还提出了该度量的下界,该下界也是轨迹集合的度量,并可通过线性规划在多项式时间内计算。此外,还扩展了该度量到随机有限轨迹集合。

Comments Matlab code for the metric is available at https://github.com/Agarciafernandez/MTT

详情
Journal ref
in IEEE Transactions on Signal Processing, vol. 68, pp. 3917-3928, 2020
AI中文摘要

在本文中,我们提出了一种度量,用于以数学严谨的方式评估多目标跟踪算法。该度量的主要用途是将不同算法对轨迹的估计与真实轨迹进行比较。所提出的度量包括与每个时间步长正确检测目标、漏检和误检以及轨迹切换相关的直观成本。度量计算基于解决多维分配问题。我们还提出了该度量的下界,该下界也是轨迹集合的度量,并可通过线性规划在多项式时间内计算。此外,我们还扩展了该度量到随机有限轨迹集合。

英文摘要

In this paper, we propose a metric on the space of finite sets of trajectories for assessing multi-target tracking algorithms in a mathematically sound way. The main use of the metric is to compare estimates of trajectories from different algorithms with the ground truth of trajectories. The proposed metric includes intuitive costs associated to localization error for properly detected targets, missed and false targets and track switches at each time step. The metric computation is based on solving a multi-dimensional assignment problem. We also propose a lower bound for the metric, which is also a metric for sets of trajectories and is computable in polynomial time using linear programming. We also extend the proposed metrics on sets of trajectories to random finite sets of trajectories.

1902.09135 2026-06-04 math.NA cs.CV cs.NA eess.IV 版本更新

A Dual Symmetric Gauss-Seidel Alternating Direction Method of Multipliers for Hyperspectral Sparse Unmixing

一种双对称Gauss-Seidel交替方向乘子法用于超光谱稀疏解混

Longfei Ren, Chengjing Wang, Peipei Tang, Zheng Ma

发表机构 * School of Information Science and technology, and the Provincial Key Lab of Information Coding and Trans- mission, Southwest Jiaotong University(信息科学与技术学院,信息编码与传输省重点实验室,西南交通大学) School of Mathematics, Southwest Jiaotong University(数学学院,西南交通大学) School of Computer and Computing Science, Zhejiang University City College(计算机与计算科学学院,浙江大学城市学院)

AI总结 本文提出了一种高效的双对称Gauss-Seidel交替方向乘子法(sGS-ADMM)用于带有总变分正则化的超光谱稀疏解混,解决了传统ADMM在计算效率和收敛性方面的不足,并通过实验验证了该方法在解混效率和图像质量上的优越性。

Comments 30 pages, 6 figures

详情
AI中文摘要

由于稀疏解混已成为超光谱解混的有前景的方法,最近一些空间上下文信息已被用来提高解混性能。总变分(TV)已被广泛用于促进空间均匀性和相邻像素之间的平滑性。然而,带有TV正则项的超光谱稀疏解混的计算任务很重。此外,对于带有TV正则项的超光谱稀疏解混的原始交替方向乘子法(ADMM)的收敛性尚未详细解释。在本文中,我们设计了一种高效的、收敛的双对称Gauss-Seidel ADMM(sGS-ADMM)用于带有TV正则项的超光谱稀疏解混。我们还对这种算法进行了全局收敛性和局部线性收敛率的分析。如数值实验所示,我们的算法在解混效率上明显优于最先进的算法。更重要的是,我们能够获得质量更高的图像。

英文摘要

Since sparse unmixing has emerged as a promising approach to hyperspectral unmixing, some spatial-contextual information in the hyperspectral images has been exploited to improve the performance of the unmixing recently. The total variation (TV) has been widely used to promote the spatial homogeneity as well as the smoothness between adjacent pixels. However, the computation task for hyperspectral sparse unmixing with a TV regularization term is heavy. Besides, the convergence of the primal alternating direction method of multipliers (ADMM) for the hyperspectral sparse unmixing with a TV regularization term has not been explained in details. In this paper, we design an efficient and convergent dual symmetric Gauss-Seidel ADMM (sGS-ADMM) for hyperspectral sparse unmixing with a TV regularization term. We also present the global convergence and local linear convergence rate analysis for this algorithm. As demonstrated in numerical experiments, our algorithm can obviously improve the efficiency of the unmixing compared with the state-of-the-art algorithm. More importantly, we can obtain images with higher quality.

1407.0221 2026-06-04 cs.CV cs.NA math.NA 版本更新

Imaging with Kantorovich-Rubinstein discrepancy

基于Kantorovich-Rubinstein偏差的成像

Jan Lellmann, Dirk A. Lorenz, Carola Schönlieb, Tuomo Valkonen

发表机构 * Department for Applied Mathematics and Theoretical Physics, University of Cambridge(应用数学与理论物理系,剑桥大学) Institute for Analysis and Algebra, TU Braunschweig(分析与代数研究所, Braunschweig 技术大学) Center for Mathematical Modeling (Modemat), EPN Quito(数学建模中心(Modemat),厄瓜多尔奎托)

AI总结 本文提出将最优传输中的Kantorovich-Rubinstein范数应用于成像问题,提出了一种变分正则化模型,结合Kantorovich-Rubinstein偏差项和总变分正则化,用于图像去噪和卡通-纹理分解,并与其他方法建立联系,证明优化问题可转化为凸-凹鞍点问题并用标准工具求解。

详情
AI中文摘要

我们提出将最优传输中的Kantorovich-Rubinstein范数应用于成像问题。特别是,我们讨论了一种变分正则化模型,该模型包含Kantorovich-Rubinstein偏差项和总变分正则化,用于图像去噪和卡通-纹理分解。我们指出这种方法与其他最近提出的方法如总广义变分和捕捉振荡模式的范数之间的联系。我们还表明,相应的优化问题可以转化为带有简单约束的凸-凹鞍点问题,因此可以用标准工具求解。数值示例展示了有趣的特点和有利的去噪和卡通-纹理分解性能。

英文摘要

We propose the use of the Kantorovich-Rubinstein norm from optimal transport in imaging problems. In particular, we discuss a variational regularisation model endowed with a Kantorovich-Rubinstein discrepancy term and total variation regularization in the context of image denoising and cartoon-texture decomposition. We point out connections of this approach to several other recently proposed methods such as total generalized variation and norms capturing oscillating patterns. We also show that the respective optimization problem can be turned into a convex-concave saddle point problem with simple constraints and hence, can be solved by standard tools. Numerical examples exhibit interesting features and favourable performance for denoising and cartoon-texture decomposition.

1801.03800 2026-06-04 cs.CV cs.NA math.NA 版本更新

Cortical-inspired image reconstruction via sub-Riemannian geometry and hypoelliptic diffusion

基于子黎曼几何和双曲椭圆扩散的皮层启发式图像重建

Ugo Boscain, Roman Chertovskih, Jean-Paul Gauthier, Dario Prandi, Alexey Remizov

发表机构 * CNRS, LJLL, Universit\'e Pierre et Marie Curie, Paris, France SYSTEC, FEUP, University of Porto, Portugal

AI总结 本文基于初级视觉皮层的数学模型,提出了一种利用双曲椭圆扩散进行图像修复的算法,其中一种算法不利用图像损坏位置信息,另一种则利用该信息,后者在图像修复领域达到了最先进的水平,验证了视觉皮层确实编码了第一种算法。

详情
Journal ref
ESAIM: ProcS. 64 (2018), pp. 37-53
AI中文摘要

在本文中,我们回顾了几种基于初级视觉皮层数学模型自然关联的双曲椭圆扩散的图像修复算法。特别是,我们提出了一种不利用图像损坏位置信息的算法,以及利用该信息的其他算法。虽然第一种算法只能重建我们视觉系统仍能识别的图像,但我们展示了第二种算法完全超越了这一限制,提供了图像修复领域的最先进修复结果。这可以被解释为对视觉皮层确实编码了第一种算法的事实的验证。

英文摘要

In this paper we review several algorithms for image inpainting based on the hypoelliptic diffusion naturally associated with a mathematical model of the primary visual cortex. In particular, we present one algorithm that does not exploit the information of where the image is corrupted, and others that do it. While the first algorithm is able to reconstruct only images that our visual system is still capable of recognize, we show that those of the second type completely transcend such limitation providing reconstructions at the state-of-the-art in image inpainting. This can be interpreted as a validation of the fact that our visual cortex actually encodes the first type of algorithm.

1712.05870 2026-06-04 math.NA cs.CV cs.NA 版本更新

Multi-dimensional imaging data recovery via minimizing the partial sum of tubal nuclear norm

通过最小化管核范数的偏和进行多维成像数据恢复

Tai-Xiang Jiang, Ting-Zhu Huang, Xi-Le Zhao, Liang-Jian Deng

发表机构 * School of Mathematical Sciences/Research Center for Image and Vision Computing(数学科学学院/图像与视觉计算研究中心) University of Electronic Science and Technology of China(电子科技大学) FinTech Innovation Center(金融科技创新中心) Financial Intelligence and Financial Engineering Research Key Laboratory of Sichuan province(四川省金融 intelligence 和金融工程研究重点实验室) School of Economic Information Engineering(经济信息工程学院) Southwestern University of Finance and Economics(西南财经大学)

AI总结 本文在张量奇异值分解(t-SVD)框架下研究张量恢复问题,提出张量管核秩的替代物偏和管核范数(PSTNN),并构建两个基于PSTNN的最小化模型用于张量补全和主成分分析,通过交替方向乘子法(ADMM)算法解决,并在合成数据和实际数据上验证了PSTNN的优越性。

详情
AI中文摘要

在本文中,我们研究了张量恢复问题,基于张量奇异值分解(t-SVD)框架。我们提出了张量的偏和管核范数(PSTNN)。PSTNN是张量管核秩的替代物。我们为两种典型张量恢复问题,即张量补全和张量主成分分析,构建了两个基于PSTNN的最小化模型。我们基于交替方向乘子法(ADMM)提出了两种算法来解决所提出的PSTNN基于张量恢复模型。在合成数据和实际数据上的实验结果揭示了所提PSTNN的优越性。

英文摘要

In this paper, we investigate tensor recovery problems within the tensor singular value decomposition (t-SVD) framework. We propose the partial sum of the tubal nuclear norm (PSTNN) of a tensor. The PSTNN is a surrogate of the tensor tubal multi-rank. We build two PSTNN-based minimization models for two typical tensor recovery problems, i.e., the tensor completion and the tensor principal component analysis. We give two algorithms based on the alternating direction method of multipliers (ADMM) to solve proposed PSTNN-based tensor recovery models. Experimental results on the synthetic data and real-world data reveal the superior of the proposed PSTNN.

1904.10379 2026-06-04 cs.GR cs.CV cs.NA math.NA 版本更新

Multi-modal 3D Shape Reconstruction Under Calibration Uncertainty using Parametric Level Set Methods

在校准不确定性下利用参数水平集方法进行多模态3D形状重建

Moshe Eliasof, Andrei Sharf, Eran Treister

发表机构 * Computer Science Department, Ben-Gurion University of the Negev(本古里安大学计算机科学系)

AI总结 本文提出了一种参数化水平集方法,用于在存在校准不确定性时从多模态数据中重建3D形状,该方法能够有效处理不同数据模态,如稀疏点集、体素切片、2D照片等,并在复杂对象的紧凑表示和校准噪声鲁棒性方面取得了显著贡献。

详情
AI中文摘要

我们考虑了在存在不确定校准参数的情况下,从多模态数据中重建3D形状的问题。通常,3D数据模态可以是多种多样的形式,例如稀疏点集、体素切片、2D照片等。为了联合处理这些数据模态,我们利用了一种参数化水平集方法,该方法使用椭球径向基函数。这种方法不仅允许我们以解析且紧凑的方式表示物体,还赋予我们克服源自不准确获取参数的校准相关噪声的能力。这种本质上隐式的正则化导致了高度鲁棒且可扩展的重建,超越了其他传统方法。在我们的结果中,我们首先展示了该方法对复杂物体进行紧凑表示的能力。然后我们展示了我们的重建方法在少量测量和获取参数中的噪声方面都具有鲁棒性。最后,我们展示了从不同模态,如通过液体位移获得的体素切片(类似于CT扫描和X射线)以及从形状轮廓获得的视觉测量中,我们的重建能力。

英文摘要

We consider the problem of 3D shape reconstruction from multi-modal data, given uncertain calibration parameters. Typically, 3D data modalities can be in diverse forms such as sparse point sets, volumetric slices, 2D photos and so on. To jointly process these data modalities, we exploit a parametric level set method that utilizes ellipsoidal radial basis functions. This method not only allows us to analytically and compactly represent the object, it also confers on us the ability to overcome calibration related noise that originates from inaccurate acquisition parameters. This essentially implicit regularization leads to a highly robust and scalable reconstruction, surpassing other traditional methods. In our results we first demonstrate the ability of the method to compactly represent complex objects. We then show that our reconstruction method is robust both to a small number of measurements and to noise in the acquisition parameters. Finally, we demonstrate our reconstruction abilities from diverse modalities such as volume slices obtained from liquid displacement (similar to CTscans and XRays), and visual measurements obtained from shape silhouettes.

1605.06311 2026-06-04 stat.CO cs.CV cs.SY eess.SY 版本更新

Poisson multi-Bernoulli conjugate prior for multiple extended object filtering

泊松多伯努利共轭先验用于多扩展目标滤波

Karl Granstrom, Maryam Fatemi, Lennart Svensson

发表机构 * Department of Signals and Systems, Chalmers University of Technology(信號與系統系,查爾姆斯理工大学) Zenuity

AI总结 本文提出了一种用于多扩展目标滤波的泊松多伯努利混合(PMBM)共轭先验。通过泊松点过程描述尚未检测到的目标存在,而多伯努利混合描述已检测到的目标分布。预测和更新方程针对标准转移密度和测量似然性进行推导。预测和更新均保持密度的PMBM形式,因此PMBM密度是一种共轭先验。然而,未知的数据关联导致PMBM密度中出现难以处理的大量项,因此需要近似方法。本文给出了伽马高斯逆 Wishart 实现,并提供了处理数据关联问题的方法。模拟研究显示,扩展目标PMBM滤波器在与扩展目标d-GLMB和LMB滤波器的比较中表现良好。使用激光雷达数据的实验展示了同时跟踪已检测和未检测目标的优势。

详情
AI中文摘要

本文提出了一种用于多扩展目标滤波的泊松多伯努利混合(PMBM)共轭先验。通过泊松点过程描述尚未检测到的目标存在,而多伯努利混合描述已检测到的目标分布。预测和更新方程针对标准转移密度和测量似然性进行推导。预测和更新均保持密度的PMBM形式,因此PMBM密度是一种共轭先验。然而,未知的数据关联导致PMBM密度中出现难以处理的大量项,因此需要近似方法。本文给出了伽马高斯逆 Wishart 实现,并提供了处理数据关联问题的方法。模拟研究显示,扩展目标PMBM滤波器在与扩展目标d-GLMB和LMB滤波器的比较中表现良好。使用激光雷达数据的实验展示了同时跟踪已检测和未检测目标的优势。

英文摘要

This paper presents a Poisson multi-Bernoulli mixture (PMBM) conjugate prior for multiple extended object filtering. A Poisson point process is used to describe the existence of yet undetected targets, while a multi-Bernoulli mixture describes the distribution of the targets that have been detected. The prediction and update equations are presented for the standard transition density and measurement likelihood. Both the prediction and the update preserve the PMBM form of the density, and in this sense the PMBM density is a conjugate prior. However, the unknown data associations lead to an intractably large number of terms in the PMBM density, and approximations are necessary for tractability. A gamma Gaussian inverse Wishart implementation is presented, along with methods to handle the data association problem. A simulation study shows that the extended target PMBM filter performs well in comparison to the extended target d-GLMB and LMB filters. An experiment with Lidar data illustrates the benefit of tracking both detected and undetected targets.

1903.10604 2026-06-04 cs.CV cs.SY eess.SY 版本更新

An Approach for Adaptive Automatic Threat Recognition Within 3D Computed Tomography Images for Baggage Security Screening

一种基于3D计算层析成像的自适应自动威胁识别方法用于行李安全检查

Qian Wang, Khalid N. Ismail, Toby P. Breckon

发表机构 * Department of Computer Science, Durham University, United Kingdom(英国杜伦大学计算机科学系) Department of Engineering, Durham University, United Kingdom(英国杜伦大学工程系)

AI总结 本文提出了一种基于3D X射线计算层析成像的自适应自动威胁识别方法,旨在解决快速演变的威胁特征识别问题,通过多尺度3D CT图像分割算法、多类支持向量机分类器和适应性策略实现高检测概率和低误报率。

Comments Technical Report, Durham University

详情
AI中文摘要

使用X射线扫描器对行李进行安全检查已成为航空安全的常规操作,自动威胁检测方法基于3D X射线计算层析成像(CT)图像,称为自动威胁识别(ATR)。当前策略使用预定义的威胁材料签名,而非适应于新出现的威胁签名。为了解决这个问题,先前的工作提出了自适应自动威胁识别(AATR)的概念。本文提出了一种基于X射线CT行李扫描图像的解决方案。该方法旨在解决筛查要求中快速演变的威胁特征问题。理想情况下,部署在安全扫描仪中的检测算法应能快速适应不同情况,具有不同的威胁特征要求(例如,威胁材料、物体的物理属性)。我们通过一种新颖的自适应机器学习方法来解决这个问题,该解决方案包括一个多尺度3D CT图像分割算法、一个多类支持向量机(SVM)分类器用于物体材料识别以及一种使方法适应的策略。实验在开放和封闭的3D CT行李图像数据集上进行,这些数据集专门用于AATR研究。我们提出的方法在识别和适应性方面表现良好。总体而言,我们的方法可以实现约90%的检测概率和低于20%的误报率。我们的AATR展示了适应不同种类材料的能力,甚至包括训练数据中未出现的未知材料,适应不同所需的检测概率以及适应不同规模的威胁物体。

英文摘要

The screening of baggage using X-ray scanners is now routine in aviation security with automatic threat detection approaches, based on 3D X-ray computed tomography (CT) images, known as Automatic Threat Recognition (ATR) within the aviation security industry. These current strategies use pre-defined threat material signatures in contrast to adaptability towards new and emerging threat signatures. To address this issue, the concept of adaptive automatic threat recognition (AATR) was proposed in previous work. In this paper, we present a solution to AATR based on such X-ray CT baggage scan imagery. This aims to address the issues of rapidly evolving threat signatures within the screening requirements. Ideally, the detection algorithms deployed within the security scanners should be readily adaptable to different situations with varying requirements of threat characteristics (e.g., threat material, physical properties of objects). We tackle this issue using a novel adaptive machine learning methodology with our solution consisting of a multi-scale 3D CT image segmentation algorithm, a multi-class support vector machine (SVM) classifier for object material recognition and a strategy to enable the adaptability of our approach. Experiments are conducted on both open and sequestered 3D CT baggage image datasets specifically collected for the AATR study. Our proposed approach performs well on both recognition and adaptation. Overall our approach can achieve the probability of detection around 90% with a probability of false alarm below 20%. Our AATR shows the capabilities of adapting to varying types of materials, even the unknown materials which are not available in the training data, adapting to varying required probability of detection and adapting to varying scales of the threat object.

1904.03537 2026-06-04 math.OC cs.CV cs.LG cs.NA math.NA 版本更新

Convex-Concave Backtracking for Inertial Bregman Proximal Gradient Algorithms in Non-Convex Optimization

凸凹回溯法用于非凸优化中的惯性Bregman近似梯度算法

Mahesh Chandra Mukkamala, Peter Ochs, Thomas Pock, Shoham Sabach

发表机构 * Faculty of Mathematics and Computer Science, Saarland University(萨尔兰大学数学与计算机科学学院) Institute of Computer Graphics and Vision, Graz University of Technology(格拉茨技术大学计算机图形与视觉研究所) Faculty of Industrial Engineering, The Technion(技术学院工业工程学院)

AI总结 本文提出了一种凸凹回溯方法,用于非凸优化中的惯性Bregman近似梯度算法,通过寻找目标函数的凸上界和凹下界,实现步长和外推参数的自适应选择,并证明算法全局收敛到临界点。

Comments 29 pages

详情
AI中文摘要

回溯线搜索是一种古老而强大的策略,用于在近似梯度算法中寻找更好的步长。其主要原理是局部寻找目标函数的简单凸上界,从而控制使用的步长。在惯性近似梯度算法中,情况变得更加复杂,通常导致对外推参数的非常严格的限制。在本文中,我们展示通过局部寻找目标函数的简单凹下界,可以控制外推参数。这导致了一种双凸凹回溯过程,允许自适应地选择步长和外推参数。我们将此过程应用于惯性Bregman近似梯度方法的类别,并证明由这些算法生成的任何序列都全局收敛到函数的临界点。在图像处理和机器学习中的多个具有挑战性的非凸问题上的数值实验显示,结合惯性步和双回溯策略能够实现性能的提升。

英文摘要

Backtracking line-search is an old yet powerful strategy for finding a better step sizes to be used in proximal gradient algorithms. The main principle is to locally find a simple convex upper bound of the objective function, which in turn controls the step size that is used. In case of inertial proximal gradient algorithms, the situation becomes much more difficult and usually leads to very restrictive rules on the extrapolation parameter. In this paper, we show that the extrapolation parameter can be controlled by locally finding also a simple concave lower bound of the objective function. This gives rise to a double convex-concave backtracking procedure which allows for an adaptive choice of both the step size and extrapolation parameters. We apply this procedure to the class of inertial Bregman proximal gradient methods, and prove that any sequence generated by these algorithms converges globally to a critical point of the function at hand. Numerical experiments on a number of challenging non-convex problems in image processing and machine learning were conducted and show the power of combining inertial step and double backtracking strategy in achieving improved performances.

1812.03446 2026-06-04 math.NA cs.CV cs.NA 版本更新

A New Variational Model for Joint Image Reconstruction and Motion Estimation in Spatiotemporal Imaging

一种新的变分模型用于时空成像中的联合图像重建和运动估计

Chong Chen, Barbara Gris, Ozan Öktem

发表机构 * LSEC, ICMSEC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(LSEC,ICMSEC,数学系统科学研究所,中国科学院) Department of Mathematics, KTH–Royal Institute of Technology(数学系,皇家理工学院)

AI总结 本文提出了一种新的变分模型,用于时空成像中的联合图像重建和运动估计,该模型基于形状理论框架,结合了改进的静态图像重建和顺序间接图像配准,通过理论分析和数值实验展示了其在稀疏和高噪声数据下的有效性。

Comments 35 pages, 5 figures, 3 tables, revised

详情
Journal ref
SIAM Journal on Imaging Sciences 2019
AI中文摘要

我们提出了一种新的变分模型,用于时空成像中的联合图像重建和运动估计,该模型基于我们提出的一般框架,结合了形状理论。该模型由两个组成部分组成,一个用于执行改进的静态图像重建,另一个依次执行间接图像配准。对于后者,我们将大变形各向同性度量映射框架推广到顺序间接配准设置中。所提出的模型在理论上与替代方法(基于光学流的模型和各向同性运动模型)进行了比较,证明了所提模型在最优解方面具有良好的性质。此外,还给出了所提模型时间离散化场景的理论推导和高效算法,表明时间离散化版本的最优解与时间连续版本一致,并且大部分计算组件都是易于实现的线性化变形。还分析了该算法的复杂度。本文最后通过2D空间+时间断层成像中非常稀疏和/或高噪声数据的数值示例进行了总结。

英文摘要

We propose a new variational model for joint image reconstruction and motion estimation in spatiotemporal imaging, which is investigated along a general framework that we present with shape theory. This model consists of two components, one for conducting modified static image reconstruction, and the other performs sequentially indirect image registration. For the latter, we generalize the large deformation diffeomorphic metric mapping framework into the sequentially indirect registration setting. The proposed model is compared theoretically against alternative approaches (optical flow based model and diffeomorphic motion models), and we demonstrate that the proposed model has desirable properties in terms of the optimal solution. The theoretical derivations and efficient algorithms are also presented for a time-discretized scenario of the proposed model, which show that the optimal solution of the time-discretized version is consistent with that of the time-continuous one, and most of the computational components is the easy-implemented linearized deformation. The complexity of the algorithm is analyzed as well. This work is concluded by some numerical examples in 2D space + time tomography with very sparse and/or highly noisy data.

1706.04048 2026-06-04 math.NA cs.CV cs.NA math.DS math.FA math.OC 版本更新

Indirect Image Registration with Large Diffeomorphic Deformations

具有大变形的间接图像配准

Chong Chen, Ozan Öktem

发表机构 * Department of Mathematics, KTH–Royal Institute of Technology(数学系,皇家理工学院) LSEC, ICMSEC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(数学和系统科学研究所,中国科学院)

AI总结 本文提出了一种基于大变形可微映射框架的间接图像配准方法,解决了在间接噪声观测下模板与目标的配准问题,并证明了该方法在数据误差趋近于零时的稳定性与收敛性。

Comments 43 pages, 4 figures, 1 table; revised

详情
Journal ref
SIAM Journal on Imaging Sciences 2018
AI中文摘要

本文将大变形可微度量映射框架适应于图像配准,用于间接设置,其中模板通过间接噪声观测的目标进行配准。配准使用将模板通过群作用转换的可微映射。这些可微映射由定义于具有一定正则性的速度场的流方程生成。理论分析包括证明间接图像配准在数据误差趋近于零时具有稳定解并收敛,因此成为一种良好的正则化方法。本文最后给出了在2D断层扫描中使用间接图像配准的示例,处理非常稀疏和/或高度噪声的数据。

英文摘要

The paper adapts the large deformation diffeomorphic metric mapping framework for image registration to the indirect setting where a template is registered against a target that is given through indirect noisy observations. The registration uses diffeomorphisms that transform the template through a (group) action. These diffeomorphisms are generated by solving a flow equation that is defined by a velocity field with certain regularity. The theoretical analysis includes a proof that indirect image registration has solutions (existence) that are stable and that converge as the data error tends so zero, so it becomes a well-defined regularization method. The paper concludes with examples of indirect image registration in 2D tomography with very sparse and/or highly noisy data.

1904.11898 2026-06-04 cs.RO cs.CV cs.LG cs.SY eess.SY 版本更新

Perceptual Attention-based Predictive Control

基于感知注意力的预测控制

Keuntaek Lee, Gabriel Nakajima An, Viacheslav Zakharov, Evangelos A. Theodorou

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种新的信息处理架构,用于安全的深度学习视觉导航系统,通过模型预测控制(MPC)、卷积神经网络(CNNs)和不确定性量化方法,实现基于感知注意力的预测控制算法,提高了系统对不安全状况的快速检测能力。

详情
AI中文摘要

在本文中,我们提出了一种新的信息处理架构,用于安全的基于深度学习的视觉导航自主系统。所提出的信息处理架构用于支持一种基于感知注意力的预测控制算法,该算法利用模型预测控制(MPC)、卷积神经网络(CNNs)和不确定性量化方法。我们的方法新颖之处在于利用MPC学习如何在视觉输入的相关区域上放置注意力,从而最终使系统能够更快速地检测到不安全状况。我们通过使用MPC学习如何选择输入图像中的感兴趣区域,这些区域用于输出控制动作以及在注意力感知的视觉输入中的epistemic和aleatoric不确定性估计。我们使用这些不确定性估计来量化在当前导航条件下网络控制器的安全性。所提出的架构和算法在1:5比例的陆地车辆上进行了测试。实验结果表明,所提出的算法在早期检测不安全状况方面优于先前的方法,例如当导航环境中出现新障碍物时。所提出的架构是向在安全关键领域使用基于深度学习的感知控制策略迈出的第一步。

英文摘要

In this paper, we present a novel information processing architecture for safe deep learning-based visual navigation of autonomous systems. The proposed information processing architecture is used to support a perceptual attention-based predictive control algorithm that leverages model predictive control (MPC), convolutional neural networks (CNNs), and uncertainty quantification methods. The novelty of our approach lies in using MPC to learn how to place attention on relevant areas of the visual input, which ultimately allows the system to more rapidly detect unsafe conditions. We accomplish this by using MPC to learn to select regions of interest in the input image, which are used to output control actions as well as estimates of epistemic and aleatoric uncertainty in the attention-aware visual input. We use these uncertainty estimates to quantify the safety of our network controller under the current navigation condition. The proposed architecture and algorithm is tested on a 1:5 scale terrestrial vehicle. Experimental results show that the proposed algorithm outperforms previous approaches on early detection of unsafe conditions, such as when novel obstacles are present in the navigation environment. The proposed architecture is the first step towards using deep learning-based perceptual control policies in safety-critical domains.

1905.04835 2026-06-04 cs.LG cs.CV cs.MA cs.RO cs.SY eess.SY stat.ML 版本更新

Multi-Agent Image Classification via Reinforcement Learning

通过强化学习进行多智能体图像分类

Hossein K. Mousavi, Mohammadreza Nazari, Martin Takáč, Nader Motee

AI总结 本文研究了利用多个能够收集未知环境部分姿态依赖观测的移动智能体进行图像分类的问题,提出了一种网络架构,用于指导智能体形成局部信念、采取局部行动并从原始部分观测中提取相关特征,通过与邻居智能体交换信息更新自身信念,并利用强化学习技术实现分类问题的去中心化实现。

Comments Preprint of the paper to be published in IROS'19 proceedings

详情
AI中文摘要

我们研究了使用多个能够收集未知环境部分姿态依赖观测的移动智能体进行分类问题。目标是在有限的时间范围内对图像进行分类。我们提出了一种网络架构,用于指导智能体如何形成局部信念、采取局部行动并从原始部分观测中提取相关特征。智能体被允许与邻居智能体交换信息以更新自身信念。证明了如何利用强化学习技术通过运行去中心化共识协议来实现分类问题的去中心化实现。我们在MNIST手写数字数据集上的实验结果展示了我们所提框架的有效性。

英文摘要

We investigate a classification problem using multiple mobile agents capable of collecting (partial) pose-dependent observations of an unknown environment. The objective is to classify an image over a finite time horizon. We propose a network architecture on how agents should form a local belief, take local actions, and extract relevant features from their raw partial observations. Agents are allowed to exchange information with their neighboring agents to update their own beliefs. It is shown how reinforcement learning techniques can be utilized to achieve decentralized implementation of the classification problem by running a decentralized consensus protocol. Our experimental results on the MNIST handwritten digit dataset demonstrates the effectiveness of our proposed framework.

1903.11683 2026-06-04 stat.ML cs.CV cs.LG cs.RO cs.SY eess.SY stat.AP 版本更新

Outlier-Robust Spatial Perception: Hardness, General-Purpose Algorithms, and Guarantees

抗异常的空域感知:难度、通用算法和保证

Vasileios Tzoumas, Pasquale Antonante, Luca Carlone

AI总结 本文研究了空域感知中异常数据的影响,提出了一种通用算法来有效去除异常,并提供了对算法性能的理论保证。

详情
AI中文摘要

空域感知是许多机器人应用的核心,涵盖了定位与建图、点云对齐和从相机图像中估计相对姿态等广泛的研究问题。异常数据的存在会威胁到空域感知的鲁棒性,而一般情况下,异常值是主要问题。尽管已有处理异常值的技术,但它们可能以不可预测的方式失败(例如RANSAC、鲁棒估计器),或具有指数级的运行时间(例如分支界限法)。在本文中,我们通过三个贡献推动了异常拒绝的前沿。首先,我们证明了即使是最简单的线性异常拒绝实例也是近似不可行的:在最坏情况下,无法设计出一个准多项式时间算法来高效计算近似解。我们的第二个贡献是提供第一个实例级的次优界限,以评估给定异常拒绝结果的近似质量。我们的第三个贡献是提出了一种简单的通用算法,称为自适应修剪,用于去除异常值。我们的算法利用了最近提出的一类全局求解器,能够解决无异常的问题,并通过迭代去除误差较大的测量值。我们在三个空域感知问题上展示了所提出的算法:三维配准、双视几何和SLAM。结果表明,我们的算法在各种应用中优于几种最先进的方法,同时是一种通用的方法。

英文摘要

Spatial perception is the backbone of many robotics applications, and spans a broad range of research problems, including localization and mapping, point cloud alignment, and relative pose estimation from camera images. Robust spatial perception is jeopardized by the presence of incorrect data association, and in general, outliers. Although techniques to handle outliers do exist, they can fail in unpredictable manners (e.g., RANSAC, robust estimators), or can have exponential runtime (e.g., branch-and-bound). In this paper, we advance the state of the art in outlier rejection by making three contributions. First, we show that even a simple linear instance of outlier rejection is inapproximable: in the worst-case one cannot design a quasi-polynomial time algorithm that computes an approximate solution efficiently. Our second contribution is to provide the first per-instance sub-optimality bounds to assess the approximation quality of a given outlier rejection outcome. Our third contribution is to propose a simple general-purpose algorithm, named adaptive trimming, to remove outliers. Our algorithm leverages recently-proposed global solvers that are able to solve outlier-free problems, and iteratively removes measurements with large errors. We demonstrate the proposed algorithm on three spatial perception problems: 3D registration, two-view geometry, and SLAM. The results show that our algorithm outperforms several state-of-the-art methods across applications while being a general-purpose method.

1903.02531 2026-06-04 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

Combining Optimal Control and Learning for Visual Navigation in Novel Environments

将最优控制与学习相结合用于新环境中的视觉导航

Somil Bansal, Varun Tolani, Saurabh Gupta, Jitendra Malik, Claire Tomlin

发表机构 * University of California, Berkeley(加州大学伯克利分校) Facebook AI Research(脸书人工智能研究)

AI总结 本文提出了一种结合模型控制与学习感知的方法,用于在新环境中实现可靠的视觉导航,通过生成无碰撞路径的 waypoints,使机器人能够高效地到达目标位置,同时在低帧率和仿真到现实的迁移中表现良好。

Comments Project website: https://vtolani95.github.io/WayPtNav/

详情
AI中文摘要

基于模型的控制是机器人导航的流行范式,因为它可以利用已知的动力学模型来高效地规划鲁棒的机器人轨迹。然而,在环境事先未知且只能通过机器人上的传感器部分观测的情况下,使用基于模型的方法具有挑战性。在本工作中,我们通过将基于模型的控制与基于学习的感知相结合来解决这一不足。基于学习的感知模块生成一系列 waypoints,通过无碰撞路径引导机器人到达目标。这些 waypoints 被用于基于模型的规划器生成平滑且动态可行的轨迹,该轨迹通过反馈控制在物理系统上执行。我们在模拟的真实世界复杂环境中以及在实际地面车辆上的实验表明,与纯几何映射或端到端学习方法相比,所提出的方法在新环境中能够更可靠、更高效地到达目标位置。我们的方法不依赖于详细的显式 3D 环境地图,能够与低帧率工作,并且在仿真到现实的迁移中表现良好。描述我们方法和实验的视频可在项目网站上获得。

英文摘要

Model-based control is a popular paradigm for robot navigation because it can leverage a known dynamics model to efficiently plan robust robot trajectories. However, it is challenging to use model-based methods in settings where the environment is a priori unknown and can only be observed partially through on-board sensors on the robot. In this work, we address this short-coming by coupling model-based control with learning-based perception. The learning-based perception module produces a series of waypoints that guide the robot to the goal via a collision-free path. These waypoints are used by a model-based planner to generate a smooth and dynamically feasible trajectory that is executed on the physical system using feedback control. Our experiments in simulated real-world cluttered environments and on an actual ground vehicle demonstrate that the proposed approach can reach goal locations more reliably and efficiently in novel environments as compared to purely geometric mapping-based or end-to-end learning-based alternatives. Our approach does not rely on detailed explicit 3D maps of the environment, works well with low frame rates, and generalizes well from simulation to the real world. Videos describing our approach and experiments are available on the project website.

1811.09358 2026-06-04 cs.LG cs.CV cs.NA math.NA math.OC stat.ML 版本更新

A Sufficient Condition for Convergences of Adam and RMSProp

Adam和RMSProp收敛性的充分条件

Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu

发表机构 * Tencent AI Lab(腾讯AI实验室) Stony Brook University(石英布鲁克大学)

AI总结 本文提出了一种易于检查的充分条件,该条件仅依赖于基础学习率参数和历史二阶矩量的组合,以保证通用的Adam/RMSProp算法在大规模非凸随机优化中的全局收敛性,并展示了几种Adam变体在非凸设置下的收敛性可由此条件直接推导。

Comments Accepted by CVPR2019 as an Oral presentation

详情
AI中文摘要

Adam和RMSProp是训练深度神经网络中最具影响力的自适应随机算法,尽管在凸设置中通过几个简单的反例已被指出存在发散现象。许多尝试,如降低自适应学习率、采用大批次大小、引入时间去相关技术、寻找类比的替代方案等,已被尝试以促进Adam/RMSProp型算法收敛。与现有方法不同,我们引入了一种替代的易于检查的充分条件,该条件仅依赖于基础学习率参数和历史二阶矩量的组合,以保证通用的Adam/RMSProp算法在大规模非凸随机优化中的全局收敛性。此外,我们展示了几种Adam变体,如AdamNC、AdaEMA等,在非凸设置下的收敛性可通过所提出的充分条件直接推导。此外,我们表明Adam本质上是一种具有指数移动平均动量的特定加权AdaGrad,这为理解Adam和RMSProp提供了新的视角。这一观察结合该充分条件,为它们的发散性提供了更深入的解释。最后,我们通过将Adam和RMSProp应用于特定反例和训练深度神经网络来验证该充分条件。数值结果与我们的理论分析一致。

英文摘要

Adam and RMSProp are two of the most influential adaptive stochastic algorithms for training deep neural networks, which have been pointed out to be divergent even in the convex setting via a few simple counterexamples. Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, etc., have been tried to promote Adam/RMSProp-type algorithms to converge. In contrast with existing approaches, we introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam/RMSProp for solving large-scale non-convex stochastic optimization. Moreover, we show that the convergences of several variants of Adam, such as AdamNC, AdaEMA, etc., can be directly implied via the proposed sufficient condition in the non-convex setting. In addition, we illustrate that Adam is essentially a specifically weighted AdaGrad with exponential moving average momentum, which provides a novel perspective for understanding Adam and RMSProp. This observation coupled with this sufficient condition gives much deeper interpretations on their divergences. At last, we validate the sufficient condition by applying Adam and RMSProp to tackle a certain counterexample and train deep neural networks. Numerical results are exactly in accord with our theoretical analysis.

1905.11299 2026-06-04 eess.SY cs.CV cs.LG cs.SY 版本更新

ImgSensingNet: UAV Vision Guided Aerial-Ground Air Quality Sensing System

ImgSensingNet: 基于无人机视觉引导的空地空气质量感知系统

Yuzhe Yang, Zhiwen Hu, Kaigui Bian, Lingyang Song

发表机构 * Computer Science and Artificial Intelligence Laboratory(计算机科学与人工智能实验室) School of Electrical Engineering and Computer Science(电子工程与计算机科学学院)

AI总结 本文提出ImgSensingNet,一种基于无人机视觉引导的空地联合感知系统,通过融合无人机拍摄的雾霾图像与地面三维无线传感器网络(WSN)收集的AQI数据,实现精细化空气质量监测与预测,显著降低系统能耗。

Comments Preliminary version published in INFOCOM 2019. Code available at https://github.com/YyzHarry/ImgSensingNet

详情
AI中文摘要

鉴于日益严重的空气污染问题,城市区域空气质量指数(AQI)的监测已引起广泛关注。本文提出ImgSensingNet,一种基于视觉引导的空地联合感知系统,用于利用无人机拍摄的雾霾图像与地面三维无线传感器网络(WSN)收集的AQI数据进行精细化空气质量监测与预测。具体而言,ImgSensingNet首先利用计算机视觉技术从拍摄的雾霾图像中识别不同区域的AQI尺度,其中设计了与雾霾相关的特征和深度卷积神经网络(CNN)以直接学习雾霾图像与相应AQI尺度之间的映射关系。基于学习到的AQI尺度,ImgSensingNet决定是否唤醒地面无线传感器进行小尺度AQI监测和推断,从而显著降低系统的能耗。采用基于熵的模型以在未测量位置实现准确的实时AQI推断和未来空气质量分布预测。我们在两所大学校园自2018年2月起实施并评估ImgSensingNet,已收集17,630张照片和260万条AQI数据样本。实验结果证实,与现有最先进的AQI监测方法相比,ImgSensingNet在提高推断精度的同时显著降低了能耗。

英文摘要

Given the increasingly serious air pollution problem, the monitoring of air quality index (AQI) in urban areas has drawn considerable attention. This paper presents ImgSensingNet, a vision guided aerial-ground sensing system, for fine-grained air quality monitoring and forecasting using the fusion of haze images taken by the unmanned-aerial-vehicle (UAV) and the AQI data collected by an on-ground three-dimensional (3D) wireless sensor network (WSN). Specifically, ImgSensingNet first leverages the computer vision technique to tell the AQI scale in different regions from the taken haze images, where haze-relevant features and a deep convolutional neural network (CNN) are designed for direct learning between haze images and corresponding AQI scale. Based on the learnt AQI scale, ImgSensingNet determines whether to wake up on-ground wireless sensors for small-scale AQI monitoring and inference, which can greatly reduce the energy consumption of the system. An entropy-based model is employed for accurate real-time AQI inference at unmeasured locations and future air quality distribution forecasting. We implement and evaluate ImgSensingNet on two university campuses since Feb. 2018, and has collected 17,630 photos and 2.6 millions of AQI data samples. Experimental results confirm that ImgSensingNet can achieve higher inference accuracy while greatly reduce the energy consumption, compared to state-of-the-art AQI monitoring approaches.

1905.08538 2026-06-04 math.NA cs.CV cs.NA 版本更新

A Two-stage Classification Method for High-dimensional Data and Point Clouds

高维数据和点云的两阶段分类方法

Xiaohao Cai, Raymond Chan, Xiaoyu Xie, Tieyong Zeng

发表机构 * Mullard Space Science Laboratory (MSSL), University College London, Surrey RH5 6NT, UK(穆拉德空间科学实验室(MSSL),伦敦大学学院, Surrey RH5 6NT,英国) Department of Mathematics, City University of Hong Kong, Kowloon Tong, Hong Kong(城市大学数学系, Hong Kong, 香港) Department of Mathematics, The Chinese University of Hong Kong, Shatin, Hong Kong(香港中文大学数学系, Shatin, Hong Kong)

AI总结 本文提出了一种两阶段多阶段半监督分类方法,用于高维数据和无结构点云的分类。首先使用模糊分类方法如标准支持向量机生成初始解,然后应用SaT(平滑和阈值)两阶段方法改进分类。第一阶段通过无约束凸变分模型净化和平滑初始解,第二阶段将第一阶段得到的平滑分区投影到二进制分区。这两个阶段可以重复进行,以提高分类质量。我们证明了平滑阶段的凸模型有唯一解,并可以通过专门设计的对偶算法求解。我们在多个基准数据集上测试了我们的方法,并与最先进的方法进行了比较。实验结果表明,我们的方法在高维数据和点云的分类准确率和计算速度上均优于现有方法。

Comments 21 pages, 4 figures

详情
AI中文摘要

高维数据分类是机器学习和成像科学中的基本任务。在本文中,我们提出了一种两阶段多阶段半监督分类方法,用于对高维数据和无结构点云进行分类。首先,使用模糊分类方法如标准支持向量机生成初始解。然后应用名为SaT(平滑和阈值)的两阶段方法来改进分类。第一阶段实现一个无约束凸变分模型以净化和平滑初始解,随后第二阶段将第一阶段得到的平滑分区投影到二进制分区。这两个阶段可以重复进行,以最新结果作为新初始解,持续提高分类质量。我们证明了平滑阶段的凸模型具有唯一解,并可以通过专门设计的对偶算法求解。我们测试了我们的方法,并在多个基准数据集上与最先进的方法进行了比较。实验结果清楚地表明,我们的方法在高维数据和点云的分类准确率和计算速度上均优于现有方法。

英文摘要

High-dimensional data classification is a fundamental task in machine learning and imaging science. In this paper, we propose a two-stage multiphase semi-supervised classification method for classifying high-dimensional data and unstructured point clouds. To begin with, a fuzzy classification method such as the standard support vector machine is used to generate a warm initialization. We then apply a two-stage approach named SaT (smoothing and thresholding) to improve the classification. In the first stage, an unconstraint convex variational model is implemented to purify and smooth the initialization, followed by the second stage which is to project the smoothed partition obtained at stage one to a binary partition. These two stages can be repeated, with the latest result as a new initialization, to keep improving the classification quality. We show that the convex model of the smoothing stage has a unique solution and can be solved by a specifically designed primal-dual algorithm whose convergence is guaranteed. We test our method and compare it with the state-of-the-art methods on several benchmark data sets. The experimental results demonstrate clearly that our method is superior in both the classification accuracy and computation speed for high-dimensional data and point clouds.

1905.05946 2026-06-04 cs.RO cs.CV cs.SY eess.SY 版本更新

Depth map estimation methodology for detecting free-obstacle navigation areas

用于自由障碍区域检测的深度图估计方法

Sergio Trejo, Karla Martinez, Gerardo Flores

AI总结 本文提出了一种基于视觉的方法,利用立体相机和一维LiDAR估计四旋翼导航中的自由障碍区域。通过加权最小二乘滤波器过滤深度图,并通过卡尔曼滤波算法融合信息,确定四旋翼可通过的足够大自由空间区域。整个过程在Jetson TX2嵌入式计算机上用ROS实现。

详情
Journal ref
ICUAS'19 The 2019 International Conference on Unmanned Aircraft Systems
AI中文摘要

本文提出了一种基于视觉的方法,利用立体相机和一维LiDAR估计四旋翼导航中的自由障碍区域。所提出的方法利用立体相机提供的深度图和一维LiDAR的测距信息。在对深度图进行加权最小二乘滤波器(WLS)过滤后,通过卡尔曼滤波算法融合信息。通过使用卡尔曼滤波器的输出信息,在视差图中标记一个区域,以确定是否存在足够大的自由空间供四旋翼通过。整个过程在Jetson TX2嵌入式计算机上用机器人操作系统(ROS)实现。实验展示了该方法的有效性。

英文摘要

This paper presents a vision-based methodology which makes use of a stereo camera rig and a one dimension LiDAR to estimate free obstacle areas for quadrotor navigation. The presented approach fuses information provided by a depth map from a stereo camera rig, and the sensing distance of the 1D-LiDAR. Once the depth map is filtered with a Weighted Least Squares filter (WLS), the information is fused through a Kalman filter algorithm. To determine if there is a free space large enough for the quadrotor to pass through, our approach marks an area inside the disparity map by using the Kalman Filter output information. The whole process is implemented in an embedded computer Jetson TX2 and coded in the Robotic Operating System (ROS). Experiments demonstrate the effectiveness of our approach.

1905.02176 2026-06-04 math.NA cs.CG cs.CV cs.NA math.DG 版本更新

Computation of Circular Area and Spherical Volume Invariants via Boundary Integrals

通过边界积分计算圆面积和球体积不变量

Riley O'Neill, Pedro Angulo-Umana, Jeff Calder, Bo Hessburg, Peter J. Olver, Chehrzad Shakiban, Katrina Yezzi-Woodley

发表机构 * Department of Mathematics, University of St. Thomas(圣托马斯大学数学系) School of Mathematics, University of Minnesota(明尼苏达大学数学学院) Department of Anthropology, University of Minnesota(明尼苏达大学人类学系)

AI总结 本文提出通过边界积分计算平面曲线的圆面积不变量和曲面的球体积不变量,利用散度定理将积分转化为边界积分,并扩展到高维超曲面,为三角化曲面提供了一种无需离散环境空间的计算方法,应用于考古学中骨折碎片的特征检测。

详情
AI中文摘要

我们展示如何通过线积分和表面积分分别计算平面曲线的圆面积不变量和曲面的球体积不变量。我们利用散度定理将面积和体积积分分别表示为特定核的线积分和表面积分;我们的结果也扩展到高维超曲面。所得到的表面积分可以在三角化网格上解析计算。这为三角化曲面计算球体积不变量提供了一种简单的计算算法,无需离散环境空间。我们讨论了该方法在考古学中对感兴趣骨折碎片特征检测的潜在应用。

英文摘要

We show how to compute the circular area invariant of planar curves, and the spherical volume invariant of surfaces, in terms of line and surface integrals, respectively. We use the Divergence Theorem to express the area and volume integrals as line and surface integrals, respectively, against particular kernels; our results also extend to higher dimensional hypersurfaces. The resulting surface integrals are computable analytically on a triangulated mesh. This gives a simple computational algorithm for computing the spherical volume invariant for triangulated surfaces that does not involve discretizing the ambient space. We discuss potential applications to feature detection on broken bone fragments of interest in anthropology.

1809.00846 2026-06-04 cs.LG cs.CV cs.SY eess.SY stat.ML 版本更新

Towards Understanding Regularization in Batch Normalization

向批量归一化中的正则化理解迈进

Ping Luo, Xinjiang Wang, Wenqi Shao, Zhanglin Peng

发表机构 * The Chinese University of Hong Kong(香港中文大学) SenseTime Research(时光科技研究院) The University of Hong Kong(香港大学)

AI总结 本文通过理论分析探讨了批量归一化在神经网络训练中的收敛性和泛化能力,揭示了批量归一化作为隐式正则化的作用,并通过实验验证了其在卷积神经网络中的正则化特性。

Comments International Conference on Learning Representations (ICLR)

详情
AI中文摘要

批量归一化(BN)在神经网络训练中提高了收敛性和泛化能力。本工作从理论上理解这些现象。我们通过使用由核层、BN层和非线性激活函数组成的基本网络块来分析BN。这个基本网络帮助我们从三个方面理解BN的影响。首先,将BN视为隐式正则化,可以将其分解为总体归一化(PN)和伽马衰减作为显式正则化。其次,BN和正则化的学习动态表明,使用大最大有效学习率训练可以收敛。第三,通过统计力学探讨BN的泛化能力。实验表明,卷积神经网络中的BN共享上述分析中的正则化特性。

英文摘要

Batch Normalization (BN) improves both convergence and generalization in training neural networks. This work understands these phenomena theoretically. We analyze BN by using a basic block of neural networks, consisting of a kernel layer, a BN layer, and a nonlinear activation function. This basic network helps us understand the impacts of BN in three aspects. First, by viewing BN as an implicit regularizer, BN can be decomposed into population normalization (PN) and gamma decay as an explicit regularization. Second, learning dynamics of BN and the regularization show that training converged with large maximum and effective learning rate. Third, generalization of BN is explored by using statistical mechanics. Experiments demonstrate that BN in convolutional neural networks share the same traits of regularization as the above analyses.

1806.02998 2026-06-04 cs.CV cs.NA math.GN math.NA 版本更新

Logarithmic mathematical morphology: a new framework adaptive to illumination changes

对数数学形态学:一种适应光照变化的新框架

Guillaume Noyel

发表机构 * University of Strathclyde Institute of Global Public Health(斯特拉思克莱德大学全球公共卫生研究所) International Prevention Research Institute(国际预防研究所) iPRI Lyon, France(法国里昂)

AI总结 本文提出了一种基于对数图像处理模型的新数学形态学框架,该框架能够适应曝光时间或光照强度变化引起的光照变化,通过定义对数膨胀和腐蚀算子,提高了低对比信息处理的效率。

详情
Journal ref
23rd Iberoamerican Congress on Pattern Recognition (CIARP 2018), Nov 2018, Madrid, Spain. Springer International Publishing, Lecture Notes in Computer Science, 11401, pp.453-461, 2019, Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. https://atvs.ii.uam.es/ciarp2018/
AI中文摘要

通过基于成像物理的对数图像处理(LIP)模型,定义了一组适应由曝光时间或光照强度变化引起的光照变化的数学形态学(MM)算子。该模型与人类视觉一致。基本算子,即对数膨胀和对数腐蚀,通过结构函数的LIP加法定义。这些两个互补算子的结合给出了形态学滤波器,即对数开运算和闭运算,用于模式识别。建立了“经典”膨胀和腐蚀与其对数版本之间的数学关系,从而方便了其实现。在模拟和真实图像上的结果表明,对数MM在低对比信息上比“经典”MM更有效。

英文摘要

A new set of mathematical morphology (MM) operators adaptive to illumination changes caused by variation of exposure time or light intensity is defined thanks to the Logarithmic Image Processing (LIP) model. This model based on the physics of acquisition is consistent with human vision. The fundamental operators, the logarithmic-dilation and the logarithmic-erosion, are defined with the LIP-addition of a structuring function. The combination of these two adjunct operators gives morphological filters, namely the logarithmic-opening and closing, useful for pattern recognition. The mathematical relation existing between ``classical'' dilation and erosion and their logarithmic-versions is established facilitating their implementation. Results on simulated and real images show that logarithmic-MM is more efficient on low-contrasted information than ``classical'' MM.

1904.05814 2026-06-04 cs.CV cs.GR cs.LG cs.NA cs.RO math.NA 版本更新

Probabilistic Permutation Synchronization using the Riemannian Structure of the Birkhoff Polytope

利用Birkhoff多面体的Riemannian结构的概率排列同步

Tolga Birdal, Umut Şimşekli

AI总结 本文提出了一种新的几何和概率方法,用于在多个对象或图像集合之间同步对应关系。核心方法包括基于Birkhoff-Riemannian L-BFGS优化放松后的循环一致性损失,以及基于Birkhoff-Riemannian Langevin Monte Carlo生成Birkhoff多面体样本并估计解的置信度。

Comments To appear as oral presentation at CVPR 2019. 20 pages including the supplementary material

详情
AI中文摘要

我们提出了一种全新的几何和概率方法,用于在多个对象或图像集合之间同步对应关系。具体而言,我们提出了两个算法:(1) Birkhoff-Riemannian L-BFGS用于以系统化的方式优化放松后的循环一致性损失的松弛版本;(2) Birkhoff-Riemannian Langevin Monte Carlo用于在Birkhoff多面体上生成样本并估计找到的解的置信度。为此,我们首先介绍了最近发展出的Birkhoff多面体的Riemannian几何。接着,我们引入了一种新的概率同步模型,形式为马尔可夫随机场(MRF)。最后,基于一阶retraction算子,我们将问题 formulation 为模拟随机微分方程,并设计了新的积分器。我们在合成和真实数据集上展示,我们能够以更快的收敛速度和可靠的置信度/不确定性估计获得高质量的多图匹配结果。

英文摘要

We present an entirely new geometric and probabilistic approach to synchronization of correspondences across multiple sets of objects or images. In particular, we present two algorithms: (1) Birkhoff-Riemannian L-BFGS for optimizing the relaxed version of the combinatorially intractable cycle consistency loss in a principled manner, (2) Birkhoff-Riemannian Langevin Monte Carlo for generating samples on the Birkhoff Polytope and estimating the confidence of the found solutions. To this end, we first introduce the very recently developed Riemannian geometry of the Birkhoff Polytope. Next, we introduce a new probabilistic synchronization model in the form of a Markov Random Field (MRF). Finally, based on the first order retraction operators, we formulate our problem as simulating a stochastic differential equation and devise new integrators. We show on both synthetic and real datasets that we achieve high quality multi-graph matching results with faster convergence and reliable confidence/uncertainty estimates.

1604.00970 2026-06-04 cs.CV cs.SY eess.SP eess.SY 版本更新

Extended Object Tracking: Introduction, Overview and Applications

扩展目标跟踪:介绍、概述与应用

Karl Granstrom, Marcus Baum, Stephan Reuter

发表机构 * Department of Signals and Systems, Chalmers University of Technology(信号与系统系,查尔姆斯理工大学)

AI总结 本文综述了扩展目标跟踪的当前研究,定义了该问题并讨论其与其他目标跟踪方法的区别,介绍了两种基本的扩展目标跟踪方法,并总结了在摄像头、X波段雷达、激光雷达、RGB-D传感器等应用中的实际应用。

Comments 30 pages, 19 figures

详情
Journal ref
Journal of Advances in Information Fusion, Volume 12, Number 2, Pages 139-174, December 2016, ISSN 1557-6418
AI中文摘要

本文提供了一篇详尽的扩展目标跟踪当前研究的概述。我们给出了扩展目标跟踪问题的明确定义,并讨论了其与其他类型目标跟踪的界限。接下来,广泛讨论了扩展目标建模的不同方面。随后,我们介绍了两种基本且常用的方法——随机矩阵方法和基于卡尔曼滤波的星形形状方法。接下来一部分讨论了多个扩展目标的跟踪,并阐述了如何利用随机有限集(RFS)和非RFS多目标跟踪器来处理大量的可行关联假设。文章最后总结了当前的应用情况,突出了四个涉及摄像头、X波段雷达、激光雷达、红绿蓝深度(RGB-D)传感器的应用示例。

英文摘要

This article provides an elaborate overview of current research in extended object tracking. We provide a clear definition of the extended object tracking problem and discuss its delimitation to other types of object tracking. Next, different aspects of extended object modelling are extensively discussed. Subsequently, we give a tutorial introduction to two basic and well used extended object tracking approaches - the random matrix approach and the Kalman filter-based approach for star-convex shapes. The next part treats the tracking of multiple extended objects and elaborates how the large number of feasible association hypotheses can be tackled using both Random Finite Set (RFS) and Non-RFS multi-object trackers. The article concludes with a summary of current applications, where four example applications involving camera, X-band radar, light detection and ranging (lidar), red-green-blue-depth (RGB-D) sensors are highlighted.

1803.07187 2026-06-04 cs.CV cs.NA eess.IV math.NA 版本更新

Unveiling the invisible - mathematical methods for restoring and interpreting illuminated manuscripts

揭示无形之物 - 用于修复和解释手稿的数学方法

Luca Calatroni, Marie d'Autume, Rob Hocking, Stella Panayotova, Simone Parisotto, Paola Ricciardi, Carola-Bibiane Schönlieb

AI总结 本文探讨了用于修复和可视化手稿的数学方法,强调了数字图像处理在艺术领域中的应用和重要性。

详情
AI中文摘要

过去五十年来,数学方法在数字图像分析和处理方面的快速发展,主要集中在摄影、生物医学成像和各种工程领域。然而,艺术领域在此过程中大多被忽视,除了最近十年中少数例外。然而,随着艺术领域数字化的迅速兴起,艺术领域对数字图像处理方法的接受度正在增加,因此关注这一点的重要性也随之增加。本文讨论了一系列用于数字图像修复和数字可视化的方法,特别是手稿,因为它们传统上保持物理未受干扰,为数字操作提供了有趣的机会。同时,它们也展示了数学和数字修复作为通用和客观工具包在艺术领域中的可能性。

英文摘要

The last fifty years have seen an impressive development of mathematical methods for the analysis and processing of digital images, mostly in the context of photography, biomedical imaging and various forms of engineering. The arts have been mostly overlooked in this process, apart from a few exceptional works in the last ten years. With the rapid emergence of digitisation in the arts, however, the arts domain is becoming increasingly receptive to digital image processing methods and the importance of paying attention to this therefore increases. In this paper we discuss a range of mathematical methods for digital image restoration and digital visualisation for illuminated manuscripts. The latter provide an interesting opportunity for digital manipulation because they traditionally remain physically untouched. At the same time they also serve as an example for the possibilities mathematics and digital restoration offer as a generic and objective toolkit for the arts.

1903.05079 2026-06-04 math.NA cs.CV cs.NA 版本更新

A total variation based regularizer promoting piecewise-Lipschitz reconstructions

一种基于总变分的正则化器,促进分段Lipschitz重建

Martin Burger, Yury Korolev, Carola-Bibiane Schönlieb, Christiane Stollenwerk

发表机构 * Department of Applied Mathematics and Theoretical Physics, University of Cambridge(剑桥大学应用数学与理论物理系)

AI总结 本文提出了一种新的总变分家族正则化器,促进具有给定Lipschitz常数(可空间变化)的重建。通过证明该功能的正则化性质,并研究其与总变分和infimal convolution类型正则化器TVLp的联系,特别是建立了拓扑等价性。数值实验表明,所提出的正则化器在性能上与总广义变分相似,但具有非常直观的自由参数解释,即只是梯度范数的局部估计。它还提供了一种自然的空间自适应正则化方法。

Comments 12 pages, 4 figures, accepted for publication in SSVM conference proceedings 2019

详情
AI中文摘要

我们介绍了一种新的总变分家族正则化器,该正则化器促进具有给定Lipschitz常数(也可以空间变化)的重建。我们证明了该功能的正则化性质,并研究了其与总变分和infimal convolution类型正则化器TVLp的联系,特别是建立了拓扑等价性。我们的数值实验表明,所提出的正则化器可以达到与总广义变分相似的性能,同时具有非常直观的自由参数解释,其自由参数仅仅是梯度范数的局部估计。它还提供了一种自然的空间自适应正则化方法。

英文摘要

We introduce a new regularizer in the total variation family that promotes reconstructions with a given Lipschitz constant (which can also vary spatially). We prove regularizing properties of this functional and investigate its connections to total variation and infimal convolution type regularizers TVLp and, in particular, establish topological equivalence. Our numerical experiments show that the proposed regularizer can achieve similar performance as total generalized variation while having the advantage of a very intuitive interpretation of its free parameter, which is just a local estimate of the norm of the gradient. It also provides a natural approach to spatially adaptive regularization.

1902.10414 2026-06-04 math.NA cs.CV cs.NA 版本更新

Computing Nonlinear Eigenfunctions via Gradient Flow Extinction

通过梯度流灭绝计算非线性本函数

Leon Bungert, Martin Burger, Daniel Tenbrinck

AI总结 本文研究了通过梯度流灭绝轮廓计算非线性本函数的问题,提出了一种递归减去本函数的方案,并展示了该方法在某些情况下能将数据分解为本函数,如一维总变分。还讨论了使用灭绝轮廓和梯度流进行谱图聚类的数值实验结果。

Comments 12 pages, 5 figure, accepted for publication in SSVM conference proceedings 2019

详情
AI中文摘要

在本文中,我们研究了通过梯度流灭绝轮廓计算非线性本函数的问题。我们分析了一种递归减去此类本函数的方案,并证明在某些情况下(例如一维总变分),该过程能将数据分解为本函数。我们讨论了使用灭绝轮廓和梯度流进行谱图聚类的数值实验结果,如在机器学习应用中所使用的那样。

英文摘要

In this work we investigate the computation of nonlinear eigenfunctions via the extinction profiles of gradient flows. We analyze a scheme that recursively subtracts such eigenfunctions from given data and show that this procedure yields a decomposition of the data into eigenfunctions in some cases as the 1-dimensional total variation, for instance. We discuss results of numerical experiments in which we use extinction profiles and the gradient flow for the task of spectral graph clustering as used, e.g., in machine learning applications.

1902.05343 2026-06-04 cs.CV cs.RO cs.SY eess.SY 版本更新

Study of dynamical system based obstacle avoidance via manipulating orthogonal coordinates

基于操纵正交坐标的动态系统障碍避障研究

Weiya Ren

发表机构 * Artificial Intelligence Research Center of National Innovation Institute of Defense Technology(国家创新技术研究院人工智能研究中心) Tianjin Artificial Intelligence Innovation Center(天津人工智能创新中心)

AI总结 本文研究了基于动态系统的障碍避障问题,通过引入正交坐标开发了调制矩阵,使调制矩阵更加合理。新轨迹的方向可通过正交坐标的线性组合表示。提出了一种通过引入旋转矩阵来解决局部最小问题,并在三维或更高维空间中提供更合理运动的正交坐标操纵方法。该方法还为围绕凸形体巡逻提供了解决方案。实验结果表明所提出方法的有效性。

详情
AI中文摘要

在本文中,我们考虑了基于动态系统的障碍避障问题。通过引入正交坐标,开发了调制矩阵,使调制矩阵更加合理。新轨迹的方向可通过正交坐标的线性组合表示。通过引入旋转矩阵,提出了一种正交坐标操纵方法,以解决局部最小问题,并在三维或更高维度空间中提供更合理的运动。所提出的方法还为围绕凸形体巡逻提供了解决方案。在几个设计的动态系统上的实验结果展示了所提出方法的有效性。

英文摘要

In this paper, we consider the general problem of obstacle avoidance based on dynamical system. The modulation matrix is developed by introducing orthogonal coordinates, which makes the modulation matrix more reasonable. The new trajectory's direction can be represented by the linear combination of orthogonal coordinates. A orthogonal coordinates manipulating approach is proposed by introducing rotating matrix to solve the local minimal problem and provide more reasonable motions in 3-D or higher dimension space. The proposed method also provide a solution for patrolling around a convex shape. Experimental results on several designed dynamical systems demonstrate the effectiveness of the proposed approach.

1807.04638 2026-06-04 math.NA cs.CV cs.NA 版本更新

PDE-constrained LDDMM via geodesic shooting and inexact Gauss-Newton-Krylov optimization using the incremental adjoint Jacobi equations

基于偏微分方程约束的LDDMM通过测地线射击和近似高斯-牛顿-克罗内克优化使用增量伴随雅可比方程

Monica Hernandez

发表机构 * Computer Sciences Department(计算机科学系) Aragon Institute on Engineering Research(阿拉贡工程研究院) University of Zaragoza(萨拉戈塔大学)

AI总结 本文提出了一种基于偏微分方程约束的LDDMM方法,利用测地线射击和近似高斯-牛顿-克罗内克优化,通过增量伴随雅可比方程在初始速度场空间中进行参数化,从而避免了对初始速度场的复杂依赖,提供了高效的测地线路径。

详情
AI中文摘要

在偏微分方程约束的大变形 diffeomorphic度量映射框架下提出的一类非刚性注册方法是一个特别有趣且具有物理意义的 diffeomorphic 注册方法集合。在该框架中,不精确的牛顿-克罗内克优化已显示出卓越的数值精度和极快的收敛速度。然而,非稳态速度场的伽辽金表示法并未提供适当的测地线路径。在本文中,我们提出了一种在初始速度场空间中参数化的偏微分方程约束LDDMM方法。梯度和Hessian-向量积的推导是在最终速度场上进行,并通过伴随和增量伴随雅可比方程反向传输。这样,我们避免了在推导和计算伴随方程及其增量版本时对初始速度场的复杂依赖。所提出的方法在偏微分方程约束LDDMM框架内提供了测地线,并展示了与基准PDE约束LDDMM和EPDiff-LDDMM方法相媲美的性能。

英文摘要

The class of non-rigid registration methods proposed in the framework of PDE-constrained Large Deformation Diffeomorphic Metric Mapping is a particularly interesting family of physically meaningful diffeomorphic registration methods. Inexact Newton-Krylov optimization has shown an excellent numerical accuracy and an extraordinarily fast convergence rate in this framework. However, the Galerkin representation of the non-stationary velocity fields does not provide proper geodesic paths. In this work, we propose a method for PDE-constrained LDDMM parameterized in the space of initial velocity fields under the EPDiff equation. The derivation of the gradient and the Hessian-vector products are performed on the final velocity field and transported backward using the adjoint and the incremental adjoint Jacobi equations. This way, we avoid the complex dependence on the initial velocity field in the derivations and the computation of the adjoint equation and its incremental counterpart. The proposed method provides geodesics in the framework of PDE-constrained LDDMM, and it shows performance competitive to benchmark PDE-constrained LDDMM and EPDiff-LDDMM methods.

1805.11572 2026-06-04 cs.CV cs.LG cs.NA math.NA stat.ML 版本更新

Adversarial Regularizers in Inverse Problems

对抗正则化在反问题中的应用

Sebastian Lunz, Ozan Öktem, Carola-Bibiane Schönlieb

发表机构 * DAMTP Department of Mathematics(DAMTP数学系) University of Cambridge(剑桥大学) KTH - Royal Institute of Technology(皇家理工学院)

AI总结 本文提出了一种利用神经网络作为正则化函数的新框架,用于解决反问题,该方法通过学习真实图像分布与未正则化重建分布之间的差异来提升反问题求解的性能。

Comments published at NeurIPS 2018

详情
AI中文摘要

医学成像和计算机视觉中的反问题传统上使用纯模型方法来解决。其中,变分正则化模型是其中最流行的方法之一。我们提出了一种新的框架,用于将数据驱动的方法应用于反问题,使用神经网络作为正则化函数。网络学习区分真实图像分布与未正则化重建分布的分布。一旦训练完成,网络通过求解相应的变分问题应用于反问题。与其他数据驱动的反问题方法不同,该算法即使在只有无监督训练数据可用的情况下也能应用。实验展示了该框架在BSDS数据集上的去噪潜力以及在LIDC数据集上的计算机断层扫描重建潜力。

英文摘要

Inverse Problems in medical imaging and computer vision are traditionally solved using purely model-based methods. Among those variational regularization models are one of the most popular approaches. We propose a new framework for applying data-driven approaches to inverse problems, using a neural network as a regularization functional. The network learns to discriminate between the distribution of ground truth images and the distribution of unregularized reconstructions. Once trained, the network is applied to the inverse problem by solving the corresponding variational problem. Unlike other data-based approaches for inverse problems, the algorithm can be applied even if only unsupervised training data is available. Experiments demonstrate the potential of the framework for denoising on the BSDS dataset and for computed tomography reconstruction on the LIDC dataset.

1707.09715 2026-06-04 eess.SY cs.CV cs.RO cs.SY 版本更新

Automatic Crack Detection in Built Infrastructure Using Unmanned Aerial Vehicles

使用无人机自动检测建筑基础设施裂缝

Manh Duong Phung, Van Truong Hoang, Tran Hiep Dinh, Quang Ha

发表机构 * School of Electrical Mechanical and Mechatronic Systems, University of Technology Sydney, Australia(电气机械与机电系统学院,悉尼技术大学,澳大利亚)

AI总结 本文提出了一种利用无人机采集数据并结合直方图分析进行建筑基础设施裂缝检测的方法,通过自动化流程提高检测效率并降低安全隐患。

Comments In proceeding of The 34th International Symposium on Automation and Robotics in Construction (ISARC), pp. 823-829, Taipei, Taiwan, 2017

详情
AI中文摘要

本文针对建筑基础设施健康监测中至关重要的裂缝检测问题,提出了一种包含两个阶段的方法:使用无人机(UAV)进行数据采集和利用直方图分析进行裂缝检测。首先,利用激光扫描仪创建结构的3D模型,然后提取几何属性以生成用于导航无人机拍摄结构图像的路径点。接着,将从重叠视野中获取的图像拼接在一起,通过直方图分析和峰值检测进行聚类,最后利用局部自适应阈值识别潜在裂缝。整个过程自动化进行,从而显著提高了检查时间并最小化了安全风险。已开发出原型系统进行评估,并包含实验结果。

英文摘要

This paper addresses the problem of crack detection which is essential for health monitoring of built infrastructure. Our approach includes two stages, data collection using unmanned aerial vehicles (UAVs) and crack detection using histogram analysis. For the data collection, a 3D model of the structure is first created by using laser scanners. Based on the model, geometric properties are extracted to generate way points necessary for navigating the UAV to take images of the structure. Then, our next step is to stick together those obtained images from the overlapped field of view. The resulting image is then clustered by histogram analysis and peak detection. Potential cracks are finally identified by using locally adaptive thresholds. The whole process is automatically carried out so that the inspection time is significantly improved while safety hazards can be minimised. A prototypical system has been developed for evaluation and experimental results are included.

1805.12521 2026-06-04 math.NA cs.CV cs.NA 版本更新

Whole Brain Susceptibility Mapping Using Harmonic Incompatibility Removal

利用谐波不兼容性去除的全脑susceptibility映射

Chenglong Bao, Jae Kyu Choi, Bin Dong

发表机构 * Yau Mathematical Sciences Center, Tsinghua University(清华大学尤拉数学科学中心) School of Mathematical Sciences, Tongji University(同济大学数学科学学院)

AI总结 本文提出了一种基于正则化的susceptibility重建模型,通过引入基于稀疏性的正则化项来处理谐波不兼容性,以提高全脑susceptibility映射的性能。

Comments Accepted for publication in SIAM Journal on Imaging Sciences

详情
AI中文摘要

定量susceptibility映射(QSM)旨在通过利用磁共振信号中的相位数据求解场到源的逆问题,从而可视化三维的susceptibility分布。然而,由于积分核的傅里叶变换在频域中存在零点,逆问题是病态的。尽管已经提出了许多基于正则化的模型来克服这个问题,但场数据中的不兼容性并未得到足够的关注,导致恢复质量下降。在本文中,我们表明QSM的数据采集过程本质上会在测量的局部场中生成谐波不兼容性。基于这一发现,我们提出了一种新的基于正则化的susceptibility重建模型,并在谐波不兼容性上引入了基于稀疏性的正则化项。数值实验表明,所提出的方法在性能上优于现有的方法。

英文摘要

Quantitative susceptibility mapping (QSM) aims to visualize the three dimensional susceptibility distribution by solving the field-to-source inverse problem using the phase data in magnetic resonance signal. However, the inverse problem is ill-posed since the Fourier transform of integral kernel has zeroes in the frequency domain. Although numerous regularization based models have been proposed to overcome this problem, the incompatibility in the field data has not received enough attention, which leads to deterioration of the recovery. In this paper, we show that the data acquisition process of QSM inherently generates a harmonic incompatibility in the measured local field. Based on such discovery, we propose a novel regularization based susceptibility reconstruction model with an additional sparsity based regularization term on the harmonic incompatibility. Numerical experiments show that the proposed method achieves better performance than the existing approaches.

1711.04178 2026-06-04 cs.LG cs.CV cs.NA math.NA stat.ML 版本更新

CUR Decompositions, Similarity Matrices, and Subspace Clustering

CUR分解、相似矩阵与子空间聚类

Akram Aldroubi, Keaton Hamm, Ahmet Bugra Koku, Ali Sekmen

AI总结 本文提出了一种利用CUR分解解决子空间聚类问题的通用框架,通过构造相似矩阵实现无噪声情况下的精确聚类,并展示了如何通过CUR分解生成多种相似矩阵以处理噪声数据,同时推导出两种已知的子空间聚类方法。

Comments Approximately 30 pages. Current version contains improved algorithm and numerical experiments from the previous version

详情
AI中文摘要

本文提出了一种利用CUR分解解决子空间聚类问题的通用框架。CUR分解提供了一种自然方法来构造数据来自未知子空间联盟$\mathscr{U}=\underset{i=1}{\overset{M}\bigcup}S_i$的相似矩阵。由此构造的相似矩阵在无噪声情况下能够实现精确聚类。此外,这种分解还能从给定数据集生成多种不同的相似矩阵,从而具有足够的灵活性以对含噪声数据进行准确聚类。我们还展示了两种已知的子空间聚类方法可以从CUR分解中推导出来。本文还提出了一种基于相似矩阵理论构造的算法,并在合成和真实数据上进行了实验以测试该方法。此外,本文还利用了基于CUR的相似矩阵的改进版本,提供了一种启发式算法用于子空间聚类;该算法在Hopkins155运动分割数据集上的聚类性能目前最佳。

英文摘要

A general framework for solving the subspace clustering problem using the CUR decomposition is presented. The CUR decomposition provides a natural way to construct similarity matrices for data that come from a union of unknown subspaces $\mathscr{U}=\underset{i=1}{\overset{M}\bigcup}S_i$. The similarity matrices thus constructed give the exact clustering in the noise-free case. Additionally, this decomposition gives rise to many distinct similarity matrices from a given set of data, which allow enough flexibility to perform accurate clustering of noisy data. We also show that two known methods for subspace clustering can be derived from the CUR decomposition. An algorithm based on the theoretical construction of similarity matrices is presented, and experiments on synthetic and real data are presented to test the method. Additionally, an adaptation of our CUR based similarity matrices is utilized to provide a heuristic algorithm for subspace clustering; this algorithm yields the best overall performance to date for clustering the Hopkins155 motion segmentation dataset.

1812.04303 2026-06-04 cs.CV cs.GR cs.NA math.NA 版本更新

Analytic heuristics for a fast DSC-MRI

动态磁共振成像的分析启发法

Marco Virgulin, Marco Castellaro, Enrico Grisan, Fabio Marcuzzi

发表机构 * Department of Mathematics, Padua University(数学系,帕多瓦大学) Department of Information Engineering, Padua University(信息工程系,帕多瓦大学)

AI总结 本文提出了一种确定性方法用于动态磁敏感对比成像数据的重建,并将其与文献中已有的压缩感知解决方案进行比较。通过问题的数学分析,尽管计算复杂度非多项式导致计算不可行,但提出了简单的启发法,效果良好,并在真实图像和加噪人工假体上给出了结果。

详情
AI中文摘要

在本文中,我们提出了一种确定性方法用于动态磁敏感对比磁共振成像数据的重建,并将其与文献中已有的压缩感知解决方案进行比较。我们的研究基于对问题的数学分析,该问题由于非多项式复杂度而计算上不可行,但提出了简单的启发法,其表现相当不错。我们给出了在真实图像和加噪人工假体上的结果。

英文摘要

In this paper we propose a deterministic approach for the reconstruction of Dynamic Susceptibility Contrast magnetic resonance imaging data and compare it with the compressed sensing solution existing in the literature for the same problem. Our study is based on the mathematical analysis of the problem, which is computationally intractable because of its non polynomial complexity, but suggests simple heuristics that perform quite well. We give results on real images and on artificial phantoms with added noise.

1805.07857 2026-06-04 cs.LG cs.CV cs.NA math.NA stat.ML 版本更新

Parallel Transport Convolution: A New Tool for Convolutional Neural Networks on Manifolds

平行运输卷积:用于流形上卷积神经网络的新工具

Stefan C. Schonsheck, Bin Dong, Rongjie Lai

发表机构 * Rensselaer Polytechnic Institute(伦斯拉尔理工学院)

AI总结 本文提出平行运输卷积(PTC),一种在流形及其离散对应物上扩展卷积操作的新方法,能够保持卷积的紧凑支持、方向性和跨流形的可转移性,从而在曲面域上构建小波样操作和深度卷积神经网络。

Comments 10 pages

详情
AI中文摘要

卷积在科学和工程中的各种应用中扮演了重要的角色,是卷积神经网络中最关键的操作。近年来,研究者对在曲面域(如流形和图)上推广卷积的兴趣增长,但现有方法无法保持欧几里得卷积的所有理想特性,即紧凑支持滤波器、方向性和跨不同流形的可转移性。本文开发了一种新的卷积操作扩展,称为平行运输卷积(PTC),应用于黎曼流形及其离散对应物。PTC基于平行运输,能够沿流形传输信息并内在保持方向性。PTC允许构建具有紧凑支持的滤波器,并且对流形变形具有鲁棒性。这使得我们能够执行小波样操作,并在曲面域上定义深度卷积神经网络。

英文摘要

Convolution has been playing a prominent role in various applications in science and engineering for many years. It is the most important operation in convolutional neural networks. There has been a recent growth of interests of research in generalizing convolutions on curved domains such as manifolds and graphs. However, existing approaches cannot preserve all the desirable properties of Euclidean convolutions, namely compactly supported filters, directionality, transferability across different manifolds. In this paper we develop a new generalization of the convolution operation, referred to as parallel transport convolution (PTC), on Riemannian manifolds and their discrete counterparts. PTC is designed based on the parallel transportation which is able to translate information along a manifold and to intrinsically preserve directionality. PTC allows for the construction of compactly supported filters and is also robust to manifold deformations. This enables us to preform wavelet-like operations and to define deep convolutional neural networks on curved domains.

1804.01983 2026-06-04 math.NA cs.CV cs.LG cs.NA 版本更新

High-dimension Tensor Completion via Gradient-based Optimization Under Tensor-train Format

通过张量列车格式的梯度优化实现高维张量补全

Longhao Yuan, Qibin Zhao, Lihua Gui, Jianting Cao

发表机构 * Graduate School of Engineering, Saitama Institute of Technology, Japan(日本埼玉科技大学工学研究科) Tensor Learning Unit, RIKEN Center for Advanced Intelligence Project (AIP), Japan(日本RIKEN先进人工智能项目(AIP)张量学习单元) School of Automation, Guangdong University of Technology, China(广东技术大学自动化学院) School of Computer Science and Technology, Hangzhou Dianzi University, China(杭州电子科技大学计算机科学与技术学院)

AI总结 本文提出了一种基于张量列车格式的梯度优化方法,用于补全高维张量中的缺失数据,通过寻找低秩张量列车分解来捕捉数据的潜在特征,并利用梯度下降算法高效解决张量补全问题,同时引入视觉数据张量化方法提升算法性能。

详情
AI中文摘要

张量列车(TT)分解因其在高阶张量中的强大表示能力和稳定性而受到关注。本文提出了一种新的方法,用于恢复由高阶张量表示的不完整数据中的缺失条目。我们尝试找到不完整数据的低秩TT分解,以捕捉整个数据集的潜在特征,然后重建缺失条目。通过应用梯度下降算法,利用优化模型高效地解决了张量补全问题。我们提出了两种基于TT的算法:张量列车加权优化(TT-WOPT)和张量列车随机梯度下降(TT-SGD),用于优化TT分解因子。此外,提出了一种名为视觉数据张量化(VDT)的方法,将视觉数据转换为高阶张量,从而提升了我们算法的性能。在合成数据和视觉数据的实验中,我们的算法在高阶、高缺失率和大规模张量补全情况下表现出高效和优越的性能,相比最先进的补全算法。

英文摘要

Tensor train (TT) decomposition has drawn people's attention due to its powerful representation ability and performance stability in high-order tensors. In this paper, we propose a novel approach to recover the missing entries of incomplete data represented by higher-order tensors. We attempt to find the low-rank TT decomposition of the incomplete data which captures the latent features of the whole data and then reconstruct the missing entries. By applying gradient descent algorithms, tensor completion problem is efficiently solved by optimization models. We propose two TT-based algorithms: Tensor Train Weighted Optimization (TT-WOPT) and Tensor Train Stochastic Gradient Descent (TT-SGD) to optimize TT decomposition factors. In addition, a method named Visual Data Tensorization (VDT) is proposed to transform visual data into higher-order tensors, resulting in the performance improvement of our algorithms. The experiments in synthetic data and visual data show high efficiency and performance of our algorithms compared to the state-of-the-art completion algorithms, especially in high-order, high missing rate, and large-scale tensor completion situations.

1811.12084 2026-06-04 cs.CV cs.LG cs.NA math.AP math.NA 版本更新

Networks for Nonlinear Diffusion Problems in Imaging

图像中非线性扩散问题的网络

Simon Arridge, Andreas Hauptmann

发表机构 * Department of Computer Science(计算机科学系;伦敦大学学院) University College London

AI总结 本文提出了一种基于非线性扩散过程的网络架构DiffNet,用于解决图像中的非线性扩散问题,该网络在可解释性和泛化能力方面优于传统卷积神经网络,并在非线性扩散逆问题上取得了与U-Net相当的性能。

详情
AI中文摘要

许多成像和视觉任务近期通过深度学习方法,特别是卷积神经网络的应用,经历了重大变革。这些方法在某些应用中取得了显著成果,即使这些应用并不明显表明卷积适合捕捉底层物理。在本文中,我们开发了一种基于非线性扩散过程的网络架构,称为DiffNet。通过设计,我们获得了一种适合图像中扩散相关问题的非线性网络架构。此外,所执行的更新是显式的,从而比传统卷积神经网络架构获得了更好的可解释性和泛化能力。在STL-10图像数据集上测试DiffNet在非线性扩散逆问题中的性能,使用Perona-Malik滤波器。我们获得的结果与已建立的U-Net架构具有竞争力,参数数量和必要的训练数据较少。

英文摘要

A multitude of imaging and vision tasks have seen recently a major transformation by deep learning methods and in particular by the application of convolutional neural networks. These methods achieve impressive results, even for applications where it is not apparent that convolutions are suited to capture the underlying physics. In this work we develop a network architecture based on nonlinear diffusion processes, named DiffNet. By design, we obtain a nonlinear network architecture that is well suited for diffusion related problems in imaging. Furthermore, the performed updates are explicit, by which we obtain better interpretability and generalisability compared to classical convolutional neural network architectures. The performance of DiffNet tested on the inverse problem of nonlinear diffusion with the Perona-Malik filter on the STL-10 image dataset. We obtain competitive results to the established U-Net architecture, with a fraction of parameters and necessary training data.

1510.02923 2026-06-04 math.AP cs.CV cs.NA math.NA 版本更新

On 1-Laplacian Elliptic Equations Modeling Magnetic Resonance Image Rician Denoising

关于1-拉普拉斯椭圆方程建模磁共振图像Rician去噪

Adrian Martin, Emanuele Schiavi, Sergio Segura de Leon

AI总结 本文研究了利用总变分(TV)在贝叶斯或广义Tikhonov框架中建模磁共振图像Rician去噪的问题,推导出非线性椭圆方程,涉及1-拉普拉斯算子,并通过修正的一阶贝塞尔函数定义反应项,提出存在性理论和解的性质,采用收敛的近点算法直接求解非光滑非凸优化问题,通过合成和真实MRI数据验证了方法的有效性,并应用于扩散张量图像。

详情
AI中文摘要

在贝叶斯或广义Tikhonov框架中利用总变分(TV)建模磁共振图像(MRI)的Rician去噪问题,自然导致非线性椭圆方程的考虑。这些方程涉及所谓的1-拉普拉斯算子,需要特别注意问题的正确建模。通过引入描述数据的Rician统计学,通过一个带有反应项的奇异方程来定义,该反应项由修正的一阶贝塞尔函数定义。本文提供了存在性理论和其他解的定性性质。值得注意的是,相关函数的每个正全局极小值都是此类解之一。此外,本文直接使用收敛的近点算法解决这一非光滑非凸最小化问题。基于合成和真实MRI的数据结果表明,所提出的方法在Rician去噪中优于之前的基于TV的模型,这些模型通过正则化或凸化问题来处理。最后,还展示了在受Rician噪声强烈影响的MRI模态——扩散张量图像上的应用,并进行了讨论。

英文摘要

Modeling magnitude Magnetic Resonance Images (MRI) rician denoising in a Bayesian or generalized Tikhonov framework using Total Variation (TV) leads naturally to the consideration of nonlinear elliptic equations. These involve the so called $1$-Laplacian operator and special care is needed to properly formulate the problem. The rician statistics of the data are introduced through a singular equation with a reaction term defined in terms of modified first order Bessel functions. An existence theory is provided here together with other qualitative properties of the solutions. Remarkably, each positive global minimum of the associated functional is one of such solutions. Moreover, we directly solve this non--smooth non--convex minimization problem using a convergent Proximal Point Algorithm. Numerical results based on synthetic and real MRI demonstrate a better performance of the proposed method when compared to previous TV based models for rician denoising which regularize or convexify the problem. Finally, an application on real Diffusion Tensor Images, a strongly affected by rician noise MRI modality, is presented and discussed.

1804.06128 2026-06-04 math.NA cs.CV cs.NA 版本更新

Fast and Accurate Tensor Completion with Total Variation Regularized Tensor Trains

快速且准确的张量补全与总变分正则化张量列车

Ching-Yun Ko, Kim Batselier, Wenjian Yu, Ngai Wong

发表机构 * Department of Electrical and Electronic Engineering, The University of Hong Kong(香港大学电子工程系) Delft Center for Systems and Control, Delft University of Technology(代尔夫特理工大学系统与控制中心)

AI总结 本文提出了一种基于张量列车的新型张量补全方法,通过总变分和Tikhonov正则化提升了补全速度和可扩展性,尤其在已知数据极少时表现优异。

Comments 13 pages. Source code and supplemental materials are available via: https://github.com/IRENEKO/TTC Updates 11/13: included more comparisons and experimental results

详情
AI中文摘要

我们提出了一种基于张量列车的新张量补全方法。待补全的张量被建模为低秩张量列车,其中利用已知的张量条目及其坐标来更新张量列车。为图像和视频补全专门提出了一种新的张量列车初始化程序,已被证明能确保补全算法的快速收敛。张量列车框架还显示出能够轻松容纳总变分和Tikhonov正则化,因为它们具有低秩张量列车表示。图像和视频修复实验验证了所提方案在速度和可扩展性方面的优越性,在相似精度下比现有张量补全方法快了高达155倍。此外,我们展示了所提方案在已知数据极少时(例如,1%)相比现有算法具有显著优势。

英文摘要

We propose a new tensor completion method based on tensor trains. The to-be-completed tensor is modeled as a low-rank tensor train, where we use the known tensor entries and their coordinates to update the tensor train. A novel tensor train initialization procedure is proposed specifically for image and video completion, which is demonstrated to ensure fast convergence of the completion algorithm. The tensor train framework is also shown to easily accommodate Total Variation and Tikhonov regularization due to their low-rank tensor train representations. Image and video inpainting experiments verify the superiority of the proposed scheme in terms of both speed and scalability, where a speedup of up to 155X is observed compared to state-of-the-art tensor completion methods at a similar accuracy. Moreover, we demonstrate the proposed scheme is especially advantageous over existing algorithms when only tiny portions (say, 1%) of the to-be-completed images/videos are known.

1811.03621 2026-06-04 cs.HC cs.CV cs.LG cs.SY eess.SY stat.ML 版本更新

Satyam: Democratizing Groundtruth for Machine Vision

Satyam: 机器视觉领域地面真实数据的民主化

Hang Qiu, Krishna Chintalapudi, Ramesh Govindan

发表机构 * University of Southern California(南加州大学) Microsoft Research(微软研究院)

AI总结 本文提出Satyam系统,通过简化流程使非专业人员能够高效收集机器视觉的地面真实数据,从而提升自动驾驶、交通监控和视频监控系统的性能。

详情
AI中文摘要

机器学习的民主化已经导致了用于自动驾驶、交通监控和视频监控的基于机器学习的机器视觉系统。然而,没有大大简化收集地面真实数据的过程,真正的民主化就无法实现。这种地面真实数据的收集对于确保在不同条件下具有良好的性能是必要的。在本文中,我们提出了Satyam系统的设计和评估,这是一个首次出现的系统,使非专业人士能够以最小的努力启动机器视觉的地面真实数据收集任务。Satyam利用一个众包平台,亚马逊机械 Turk,并自动化了地面真实数据收集的几个具有挑战性的方面:创建和启动定制的网页用户界面任务以获取所需的真实数据,控制结果质量以应对垃圾邮件发送者和未经训练的工人,根据任务复杂性调整价格,过滤表现差的垃圾邮件发送者和工人,以及处理工人的报酬。我们通过几种流行的基准视觉数据集验证了Satyam,并展示了通过Satyam获得的真实数据与由训练专家获得的数据相当,并且在用于训练时提供匹配的机器学习性能。

英文摘要

The democratization of machine learning (ML) has led to ML-based machine vision systems for autonomous driving, traffic monitoring, and video surveillance. However, true democratization cannot be achieved without greatly simplifying the process of collecting groundtruth for training and testing these systems. This groundtruth collection is necessary to ensure good performance under varying conditions. In this paper, we present the design and evaluation of Satyam, a first-of-its-kind system that enables a layperson to launch groundtruth collection tasks for machine vision with minimal effort. Satyam leverages a crowdtasking platform, Amazon Mechanical Turk, and automates several challenging aspects of groundtruth collection: creating and launching of custom web-UI tasks for obtaining the desired groundtruth, controlling result quality in the face of spammers and untrained workers, adapting prices to match task complexity, filtering spammers and workers with poor performance, and processing worker payments. We validate Satyam using several popular benchmark vision datasets, and demonstrate that groundtruth obtained by Satyam is comparable to that obtained from trained experts and provides matching ML performance when used for training.

1802.00285 2026-06-04 cs.CV cs.RO cs.SY eess.SY 版本更新

Virtual-to-Real: Learning to Control in Visual Semantic Segmentation

虚拟到现实:学习在视觉语义分割中的控制

Zhang-Wei Hong, Chen Yu-Ming, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, Hsuan-Kung Yang, Brian Hsi-Lin Ho, Chih-Chieh Tu, Yueh-Chuan Chang, Tsu-Ching Hsiao, Hsin-Wei Hsiao, Sih-Pin Lai, Chun-Yi Lee

发表机构 * Elsa Lab(Elsa实验室) Department of Computer Science(计算机科学系) National Tsing Hua University(国立清华大学)

AI总结 本文提出了一种模块化架构,通过将感知模块和控制策略模块结合,利用语义图像分割作为元表示,解决虚拟到现实的迁移问题,并在障碍避让和目标跟随任务中展示了优越的性能。

Comments 7 pages, accepted by IJCAI-18

详情
AI中文摘要

从物理世界收集训练数据通常是耗时且甚至对脆弱机器人来说是危险的,因此最近的机器人学习进展倡导使用模拟器作为训练平台。不幸的是,合成与真实视觉数据之间的现实差距阻止了在虚拟世界中训练的模型直接迁移到现实世界。本文提出了一种模块化架构来解决虚拟到现实的问题。所提出的架构将学习模型分为感知模块和控制策略模块,并使用语义图像分割作为这些模块之间关联的元表示。感知模块将感知的RGB图像转换为语义图像分割。控制策略模块实现为一个深度强化学习代理,根据转换后的图像分割执行动作。我们的架构在避障任务和目标跟随任务中进行了评估。实验结果表明,我们的架构在虚拟和现实环境中均显著优于所有基线方法,并且比它们具有更快的学习曲线。我们还对各种变体配置进行了详细分析,并验证了我们模块化架构的可迁移性。

英文摘要

Collecting training data from the physical world is usually time-consuming and even dangerous for fragile robots, and thus, recent advances in robot learning advocate the use of simulators as the training platform. Unfortunately, the reality gap between synthetic and real visual data prohibits direct migration of the models trained in virtual worlds to the real world. This paper proposes a modular architecture for tackling the virtual-to-real problem. The proposed architecture separates the learning model into a perception module and a control policy module, and uses semantic image segmentation as the meta representation for relating these two modules. The perception module translates the perceived RGB image to semantic image segmentation. The control policy module is implemented as a deep reinforcement learning agent, which performs actions based on the translated image segmentation. Our architecture is evaluated in an obstacle avoidance task and a target following task. Experimental results show that our architecture significantly outperforms all of the baseline methods in both virtual and real environments, and demonstrates a faster learning curve than them. We also present a detailed analysis for a variety of variant configurations, and validate the transferability of our modular architecture.

1703.09971 2026-06-04 cs.CV cs.NA math.DS math.NA 版本更新

A Geometric Framework for Stochastic Shape Analysis

随机形状分析的几何框架

Alexis Arnaudon, Darryl D. Holm, Stefan Sommer

发表机构 * Department of Mathematics, Imperial College(帝国理工学院数学系) Department of Computer Science (DIKU), University of Copenhagen(哥本哈根大学计算机科学系(DIKU))

AI总结 本文提出了一种随机流形的几何框架,用于分析形状、图像和地标的数据演化,通过Fokker-Planck方程和数值模拟研究了随机演化的特性,并提出了两种参数推断方法。

详情
AI中文摘要

我们介绍了一种随机的流形模型,其作用于多种数据类型上可降维为形状、图像和地标的随机演化。随机性引入在运输数据的向量场中,该随机性在大变形流形度度量映射(LDDMM)框架中用于形状分析和图像配准。随机性因此建模了跟随给定变形速度时流的误差或不确定性。该方法在有限维地标流形的例子中进行了说明,其随机演化通过Fokker-Planck方程和数值模拟研究。我们推导了两种从离散时间点观测到的地标配置推断随机模型参数的方法。第一种方法将Fokker-Planck方程的矩匹配到数据样本的矩,第二种方法则使用蒙特卡罗桥采样方案的期望最大化算法来优化数据似然。我们推导并数值测试了这两种方法推断底层噪声空间相关长度的能力。

英文摘要

We introduce a stochastic model of diffeomorphisms, whose action on a variety of data types descends to stochastic evolution of shapes, images and landmarks. The stochasticity is introduced in the vector field which transports the data in the Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework for shape analysis and image registration. The stochasticity thereby models errors or uncertainties of the flow in following the prescribed deformation velocity. The approach is illustrated in the example of finite dimensional landmark manifolds, whose stochastic evolution is studied both via the Fokker-Planck equation and by numerical simulations. We derive two approaches for inferring parameters of the stochastic model from landmark configurations observed at discrete time points. The first of the two approaches matches moments of the Fokker-Planck equation to sample moments of the data, while the second approach employs an Expectation-Maximisation based algorithm using a Monte Carlo bridge sampling scheme to optimise the data likelihood. We derive and numerically test the ability of the two approaches to infer the spatial correlation length of the underlying noise.

1709.00483 2026-06-04 math.NA cs.CV cs.NA math.OC stat.ML 版本更新

Iteratively Linearized Reweighted Alternating Direction Method of Multipliers for a Class of Nonconvex Problems

迭代线性化加权交替方向乘子法用于一类非凸问题

Tao Sun, Hao Jiang, Lizhi Cheng, Wei Zhu

发表机构 * Department of Mathematics, National University of Defense Technology(国防科技大学数学系) College of Computer, National University of Defense Technology(国防科技大学计算机学院) The State Key Laboratory for High Performance Computation, National University of Defense Technology(国防科技大学高性能计算国家重点实验室) Hunan Key Laboratory for Computation and Simulation in Science and Engineering, School of Mathematics and Computational Science, Xiangtan University(湖南计算与模拟科学工程重点实验室,湘潭大学数学与计算科学学院)

AI总结 本文提出了一种迭代线性化加权交替方向乘子法,用于解决信号处理和机器学习中常见的非凸和非光滑问题,该方法通过将子问题转化为凸问题以提高求解效率,并证明了算法的全局收敛性。

详情
AI中文摘要

在本文中,我们考虑解决在信号处理和机器学习研究中频繁出现的一类非凸和非光滑问题。传统的交替方向乘子法在数学和计算上解决非凸和非光滑子问题时遇到了困难。为此,我们提出了一种加权交替方向乘子法。在该算法中,所有子问题都是凸的,易于求解。我们还提供了几种保证以确保收敛性,并利用Kurdyka-Łojasiewicz性质证明了该算法全局收敛到辅助函数的临界点。展示了几个数值结果以证明所提算法的有效性。

英文摘要

In this paper, we consider solving a class of nonconvex and nonsmooth problems frequently appearing in signal processing and machine learning research. The traditional alternating direction method of multipliers encounters troubles in both mathematics and computations in solving the nonconvex and nonsmooth subproblem. In view of this, we propose a reweighted alternating direction method of multipliers. In this algorithm, all subproblems are convex and easy to solve. We also provide several guarantees for the convergence and prove that the algorithm globally converges to a critical point of an auxiliary function with the help of the Kurdyka-Łojasiewicz property. Several numerical results are presented to demonstrate the efficiency of the proposed algorithm.

1710.06647 2026-06-04 cs.CV cs.NA math.NA 版本更新

Image Restoration by Iterative Denoising and Backward Projections

通过迭代去噪和反向投影进行图像恢复

Tom Tirer, Raja Giryes

发表机构 * School of Electrical Engineering, Tel Aviv University(特拉维夫大学电气工程学院)

AI总结 本文提出了一种利用现成去噪器解决逆问题的替代方法,通过将典型成本函数转换为新的优化问题,并引入高效的最小化方案和自动调参机制,以减少参数调优并提升图像修复和去模糊的效果。

Comments To appear in IEEE Transactions on Image Processing

详情
AI中文摘要

逆问题出现在许多应用中,如图像去模糊和修复。解决这些问题是通过为每个问题设计特定算法。Plug-and-Play(P&P)框架最近被引入,利用现有去噪算法的出色能力来解决一般逆问题。尽管这种新策略已找到许多应用,但通常需要大量的参数调优才能获得高质量的结果。在本文中,我们提出了一种替代方法,利用现成的去噪器解决逆问题,其参数调优要求更少。首先,我们将典型成本函数(由保真度和先验项组成)转换为一个密切相关的新优化问题。然后,我们提出了一种高效的最小化方案,具有Plug-and-Play属性,即先验项仅通过去噪操作处理。最后,我们提出了一种自动调参机制来设置方法的参数。我们对方法进行了理论分析,并通过图像修复和去模糊任务与特定技术和P&P方法的实验,证明了其竞争力。

英文摘要

Inverse problems appear in many applications, such as image deblurring and inpainting. The common approach to address them is to design a specific algorithm for each problem. The Plug-and-Play (P&P) framework, which has been recently introduced, allows solving general inverse problems by leveraging the impressive capabilities of existing denoising algorithms. While this fresh strategy has found many applications, a burdensome parameter tuning is often required in order to obtain high-quality results. In this work, we propose an alternative method for solving inverse problems using off-the-shelf denoisers, which requires less parameter tuning. First, we transform a typical cost function, composed of fidelity and prior terms, into a closely related, novel optimization problem. Then, we propose an efficient minimization scheme with a plug-and-play property, i.e., the prior term is handled solely by a denoising operation. Finally, we present an automatic tuning mechanism to set the method's parameters. We provide a theoretical analysis of the method, and empirically demonstrate its competitiveness with task-specific techniques and the P&P approach for image inpainting and deblurring.

1810.03275 2026-06-04 math.NA cs.CV cs.NA 版本更新

TV-regularized CT Reconstruction and Metal Artifact Reduction Using Inequality Constraints with Preconditioning

基于不等式约束与预条件的TV正则化CT重建及金属伪影消除

Clemens Schiffer

发表机构 * Karl-Franzens-Universität Graz(格拉茨卡尔-弗里德里希大学)

AI总结 本文提出了一种结合不等式约束与预条件的TV正则化方法,用于CT重建中减少金属伪影,通过Chambolle-Pock算法和预条件的Douglas-Rachford分裂法及ADMM算法实现快速收敛,验证了模型在真实和合成数据中的有效性。

Comments Master's Thesis, as submitted at the University of Graz

详情
AI中文摘要

总变分(TV)正则化被应用于X射线计算机断层扫描(CT)以减少金属伪影。本新型模型通过在受金属影响的sinogram数据上引入不等式约束,增强了Tikhonov正则化(具有L²数据保真项)和总变分正则化,以建模金属引起的误差。所提出的优化问题通过Chambolle-Pock算法进行离散化和求解。通过预条件的Douglas-Rachford分裂法以及高级方向乘子法(ADMM)实现了更快的收敛。该方法被应用于真实和合成数据,证明了模型在减少金属伪影方面的可行性。CT数据的技术细节及其处理在附录中给出。

英文摘要

Total variation(TV) regularization is applied to X-Ray computed tomography(CT) in an effort to reduce metal artifacts. Tikhonov regularization with $L^2$ data fidelity term and total variation regularization is augmented in this novel model by inequality constraints on sinogram data affected by metal to model errors caused by metal. The formulated problem is discretized and solved using the Chambolle-Pock algorithm. Faster convergence is achieved using preconditioning in a Douglas-Rachford spitting method as well as Advanced Direction Method of Multipliers(ADMM). The methods are applied to real and synthetic data demonstrating feasibility of the model to reduce metal artifacts. Technical details of CT data used and its processing are given in the appendix.

1809.07399 2026-06-04 math.NA cs.CV cs.NA 版本更新

Nonisometric Surface Registration via Conformal Laplace-Beltrami Basis Pursuit

非等距表面配准 via 保形拉普拉斯-贝尔特拉米基底追踪

Stefan C. Schonsheck, Michael M. Bronstein, Rongjie Lai

发表机构 * Department of Mathematics, Rensselaer Polytechnic Institute(拉特格斯理工学院数学系) Institute of Computational Science(瑞士意大利计算科学研究所) Universit della Svizzera Italiana

AI总结 本文提出了一种变分模型,通过保形变形对非等距零属表面的拉普拉斯-贝尔特拉米特征值系统进行对齐,利用新的基追踪方案同时计算目标形状的保形变形及其变形的LB特征值系统,通过混合交替最小化算法和增广拉格朗日方法,仅需少量地标点即可获得准确对应关系。

Comments 21 pages, 7 figures

详情
AI中文摘要

表面配准是几何处理中最基本的问题之一。许多方法已用于解决当表面近似等距时的问题。然而,计算内在相似性较低的表面之间的对应关系更具挑战性。本文提出了一种变分模型,通过保形变形对两个非等距零属表面的拉普拉斯-贝尔特拉米(LB)特征值系统进行对齐。该方法使我们能够计算非等距形状之间的几何有意义的点对点映射。我们的模型基于一种新颖的基追踪方案,其中我们同时计算目标形状的保形变形及其变形的LB特征值系统。我们使用混合了交替最小化算法和增广拉格朗日方法的近端交替最小化算法来求解该模型,仅需少量地标点即可获得准确的对应关系。我们还提出了一种重新初始化方案,以克服变分问题非凸性带来的某些困难。大量数值实验展示了所提出方法在处理具有大变形的非等距表面方面的有效性和鲁棒性,无论是在底层流形上的噪声还是给定地标点内的误差方面。

英文摘要

Surface registration is one of the most fundamental problems in geometry processing. Many approaches have been developed to tackle this problem in cases where the surfaces are nearly isometric. However, it is much more challenging to compute correspondence between surfaces which are intrinsically less similar. In this paper, we propose a variational model to align the Laplace-Beltrami (LB) eigensytems of two non-isometric genus zero shapes via conformal deformations. This method enables us compute to geometric meaningful point-to-point maps between non-isometric shapes. Our model is based on a novel basis pursuit scheme whereby we simultaneously compute a conformal deformation of a 'target shape' and its deformed LB eigensytem. We solve the model using an proximal alternating minimization algorithm hybridized with the augmented Lagrangian method which produces accurate correspondences given only a few landmark points. We also propose a reinitialization scheme to overcome some of the difficulties caused by the non-convexity of the variational problem. Intensive numerical experiments illustrate the effectiveness and robustness of the proposed method to handle non-isometric surfaces with large deformation with respect to both noise on the underlying manifolds and errors within the given landmarks.

1608.01825 2026-06-04 physics.med-ph cs.CV cs.NA math.NA 版本更新

Compartmental analysis of dynamic nuclear medicine data: regularization procedure and application to physiology

动态核医学数据的室模型分析:正则化程序及其在生理学中的应用

Delbary Fabrice, Garbarino Sara

发表机构 * Centre for Medical Image Computing, Department of Computer Science, University College London(伦敦大学学院计算机科学系医学影像中心)

AI总结 本文提出一种基于正则化多变量Gauss-Newton方法的室模型正则化程序,用于估计示踪剂系数,并应用于脑、肝和肾功能的实验研究。

详情
Journal ref
Inverse Problems in Science and Engineering 2018
AI中文摘要

基于示踪剂质量平衡的室模型在临床和预临床核医学中被广泛用于获取生物组织中示踪剂代谢的定量信息。本文是系列两篇论文中的第二篇,探讨了在反问题框架下通过室模型估计示踪剂系数的问题。虽然前一篇工作专注于讨论2、3和n维室模型系统的可识别性问题,本文则讨论如何通过通用的正则化多变量Gauss-Newton方案数值确定示踪剂系数。本文考虑了涉及不同小鼠模型的FDG-PET数据的实验测量,应用于脑、肝和肾功能的研究。

英文摘要

Compartmental models based on tracer mass balance are extensively used in clinical and pre-clinical nuclear medicine in order to obtain quantitative information on tracer metabolism in the biological tissue. This paper is the second of a series of two that deal with the problem of tracer coefficient estimation via compartmental modelling in an inverse problem framework. While the previous work was devoted to the discussion of identifiability issues for 2, 3 and n-dimension compartmental systems, here we discuss the problem of numerically determining the tracer coefficients by means of a general regularized Multivariate Gauss Newton scheme. In this paper, applications concerning cerebral, hepatic and renal functions are considered, involving experimental measurements on FDG-PET data on different set of murine models.

1601.05585 2026-06-04 eess.SY cs.CV cs.SY 版本更新

Generalized optimal sub-pattern assignment metric

有限目标集上的广义最优子模式分配度量

Abu Sajana Rahmathullah, Ángel F. García-Fernández, Lennart Svensson

发表机构 * Zenuity AB(泽尼特公司) Aalto University(阿尔托大学) Chalmers University of Technology(查尔姆斯理工大学)

AI总结 本文提出了一种广义最优子模式分配度量(GOSPA),该度量在目标集空间中未归一化,并通过优化分配而非排列来惩罚基数误差,从而更准确地评估多目标跟踪算法性能。

Comments The paper received the Jean Pierre Le Cadre best paper award at the 20th International Conference on Information Fusion, July 2017. A Matlab implementation of the proposed GOSPA metric is available in https://github.com/abusajana/GOSPA Also visit https://youtu.be/M79GTTytvCM for a 15-min presentation about the paper

详情
Journal ref
Proceedings of the 20th International Conference on Information Fusion (Fusion), 2017
AI中文摘要

本文提出了有限目标集空间中的广义最优子模式分配(GOSPA)度量。与已确立的最优子模式分配(OSPA)度量相比,GOSPA未归一化,其对基数误差的惩罚方式不同,允许通过优化分配而非排列来表达。这一特性使得GOSPA能够以传统多目标跟踪(MTT)性能指标所示的方式,对检测目标的定位误差以及漏检和误检误差进行合理惩罚。此外,本文将GOSPA度量扩展到随机有限集空间,这对于通过模拟严格评估MTT算法至关重要。

英文摘要

This paper presents the generalized optimal sub-pattern assignment (GOSPA) metric on the space of finite sets of targets. Compared to the well-established optimal sub-pattern assignment (OSPA) metric, GOSPA is unnormalized as a function of the cardinality and it penalizes cardinality errors differently, which enables us to express it as an optimisation over assignments instead of permutations. An important consequence of this is that GOSPA allows us to penalize localization errors for detected targets and the errors due to missed and false targets, as indicated by traditional multiple target tracking (MTT) performance measures, in a sound manner. In addition, we extend the GOSPA metric to the space of random finite sets, which is important to evaluate MTT algorithms via simulations in a rigorous way.

1809.03314 2026-06-04 cs.CV cs.SY eess.SY 版本更新

A Robotic Auto-Focus System based on Deep Reinforcement Learning

基于深度强化学习的机器人自动对焦系统

Xiaofan Yu, Runze Yu, Jingsong Yang, Xiaohui Duan

发表机构 * Center of Wireless Communication and Signal Processing(无线通信与信号处理中心)

AI总结 本文提出一种端到端的自动对焦方法,通过深度强化学习在视觉输入中学习对焦策略,实现自动清晰成像。方法通过离散化动作空间和应用DQN,解决自动对焦问题并推广至基于视觉的控制问题。

Comments To Appear at ICARCV 2018

详情
AI中文摘要

考虑到DQN在处理高维视觉输入和学习离散域控制策略方面的优势,DQN可能成为传统自动对焦方法的替代方案。本文基于深度强化学习提出了一种端到端方法,从视觉输入中学习自动对焦策略,并自动聚焦到清晰点。我们证明了我们的方法——通过粗到细的步骤离散化动作空间并应用DQN,不仅解决了自动对焦问题,还为基于视觉的控制问题提供了一种通用方法。分别在虚拟和真实环境中进行训练阶段以获得有效的模型。虚拟实验表明,我们的方法在不同聚焦范围内能够实现100%的准确性。进一步在真实机器人上训练可消除模拟器与真实场景之间的偏差,从而在实际应用中实现可靠性能。

英文摘要

Considering its advantages in dealing with high-dimensional visual input and learning control policies in discrete domain, Deep Q Network (DQN) could be an alternative method of traditional auto-focus means in the future. In this paper, based on Deep Reinforcement Learning, we propose an end-to-end approach that can learn auto-focus policies from visual input and finish at a clear spot automatically. We demonstrate that our method - discretizing the action space with coarse to fine steps and applying DQN is not only a solution to auto-focus but also a general approach towards vision-based control problems. Separate phases of training in virtual and real environments are applied to obtain an effective model. Virtual experiments, which are carried out after the virtual training phase, indicates that our method could achieve 100% accuracy on a certain view with different focus range. Further training on real robots could eliminate the deviation between the simulator and real scenario, leading to reliable performances in real applications.

1802.07072 2026-06-04 math.OC cs.CV cs.NA math.NA 版本更新

Composite Optimization by Nonconvex Majorization-Minimization

通过非凸majorization-minimization进行复合优化

Jonas Geiping, Michael Moeller

发表机构 * University of Siegen(明斯特大学)

AI总结 本文提出非凸majorization-minimization方法用于非凸复合函数优化,证明其能实现全局收敛,并通过实验展示其在深度超分辨率中的优越性。

Comments 38 pages, 12 figures, accepted for publication in SIIMS

详情
AI中文摘要

非凸复合函数的最小化可以建模多种成像任务。解决此类问题的流行算法是majorization-minimization技术,通过迭代用易于最小化的majorizing函数近似复合非凸函数。大多数技术,例如梯度下降,利用凸majorizers以保证majorizer易于最小化。在我们的工作中,我们考虑了一类自然的非凸majorizers,并证明这些majorizers仍然足以实现全局收敛的优化方案。数值结果表明,当非凸majorizers被求解到全局最优时,通过应用此方案,可以经常获得比以前的majorization-minimization方法更好的局部最优解。最后,我们展示了我们的算法在从原始时间飞行数据中进行深度超分辨率中的行为。

英文摘要

The minimization of a nonconvex composite function can model a variety of imaging tasks. A popular class of algorithms for solving such problems are majorization-minimization techniques which iteratively approximate the composite nonconvex function by a majorizing function that is easy to minimize. Most techniques, e.g. gradient descent, utilize convex majorizers in order to guarantee that the majorizer is easy to minimize. In our work we consider a natural class of nonconvex majorizers for these functions, and show that these majorizers are still sufficient for a globally convergent optimization scheme. Numerical results illustrate that by applying this scheme, one can often obtain superior local optima compared to previous majorization-minimization methods, when the nonconvex majorizers are solved to global optimality. Finally, we illustrate the behavior of our algorithm for depth super-resolution from raw time-of-flight data.

1806.00728 2026-06-04 stat.ML cs.CV cs.LG cs.SY eess.SP eess.SY 版本更新

Data-Free/Data-Sparse Softmax Parameter Estimation with Structured Class Geometries

无数据/稀疏数据softmax参数估计与结构类几何

Nisar Ahmed

发表机构 * H.J. Smead Aerospace Engineering Sciences, University of Colorado, Boulder, Colorado 80309(H.J. Smead航空航天工程科学系,科罗拉多大学,伯尔德,科罗拉多州80309)

AI总结 本文提出在少量或无标注数据情况下,利用类标签对数几率边界结构几何先验信息进行softmax参数估计,通过线性方程组求解,无需昂贵的数据采样和优化。

Comments Final version accepted to IEEE Signal Processing Letters (double column), submitted July 21, 2018

详情
AI中文摘要

本文考虑在少量或无标注训练数据可用时,但已知类标签对数几率边界相对几何结构信息的softmax参数估计问题。证明了'无数据'softmax模型合成对应于求解参数方程组,其中期望主导类对数几率边界通过分解输入特征空间的凸多面体编码。当方程可解时,线性方程给出仅使用类边界多面体规范的softmax参数解集。这允许softmax参数学习无需昂贵的暴力数据采样和数值优化。线性方程还可适应数据稀疏情况下的约束最大似然估计。由于某些多面体规范可能无法得到解,因此也展示了存在某些概率分类问题,其对数几率边界无法用m类softmax模型学习。

英文摘要

This note considers softmax parameter estimation when little/no labeled training data is available, but a priori information about the relative geometry of class label log-odds boundaries is available. It is shown that `data-free' softmax model synthesis corresponds to solving a linear system of parameter equations, wherein desired dominant class log-odds boundaries are encoded via convex polytopes that decompose the input feature space. When solvable, the linear equations yield closed-form softmax parameter solution families using class boundary polytope specifications only. This allows softmax parameter learning to be implemented without expensive brute force data sampling and numerical optimization. The linear equations can also be adapted to constrained maximum likelihood estimation in data-sparse settings. Since solutions may also fail to exist for the linear parameter equations derived from certain polytope specifications, it is thus also shown that there exist probabilistic classification problems over m convexly separable classes for which the log-odds boundaries cannot be learned using an m-class softmax model.

1708.01244 2026-06-04 math.NA cs.CV cs.NA 版本更新

Image reconstruction with imperfect forward models and applications in deblurring

基于不完美正向模型的图像重建及其在去模糊中的应用

Yury Korolev, Jan Lellmann

AI总结 本文提出基于偏序空间(Banach格)的图像重建方法,通过顺序区间描述数据和正向模型误差,分析可行集的凸性及其在去模糊中的应用。

详情
AI中文摘要

我们提出并分析了一种基于偏序空间(Banach格)的图像重建方法,用于处理不完美的正向模型。该方法通过顺序区间描述数据和正向模型的误差,其特征是可行集由线性不等式约束定义。本文的主要贡献是对该可行集的研究。通过多种设置考察了该可行集的凸性,并考虑了引入关于正向算子额外信息的修改。数值示例展示了该方法在存在模糊核误差的去模糊中的性能。

英文摘要

We present and analyse an approach to image reconstruction problems with imperfect forward models based on partially ordered spaces - Banach lattices. In this approach, errors in the data and in the forward models are described using order intervals. The method can be characterised as the lattice analogue of the residual method, where the feasible set is defined by linear inequality constraints. The study of this feasible set is the main contribution of this paper. Convexity of this feasible set is examined in several settings and modifications for introducing additional information about the forward operator are considered. Numerical examples demonstrate the performance of the method in deblurring with errors in the blurring kernel.

1509.02223 2026-06-04 cs.CV cs.NA math.NA 版本更新

Diffusion tensor imaging with deterministic error bounds

具有确定性误差边界的扩散张量成像

Artur Gorokh, Yury Korolev, Tuomo Valkonen

发表机构 * Faculty of Physics, Lomonosov Moscow State University(莫斯科罗蒙诺索夫国立大学物理系) School of Engineering and Materials Science, Queen Mary University of London(伦敦女王玛丽大学工程与材料科学学院)

AI总结 本文在Banach格中利用偏序理论建模逆问题的误差,应用于扩散张量成像中复杂噪声建模问题,通过确定性误差边界方法简化非线性Stejskal-Tanner方程的处理。

详情
AI中文摘要

逆问题的数据和前向算子的误差可以利用Banach格中的偏序进行建模。我们在此新框架中呈现了一些正则化理论的现有结果,其中误差通过适当的偏序表示为界限。我们将该理论应用于扩散张量成像,其中正确的噪声建模具有挑战性:它涉及Rician分布和非线性Stejskal-Tanner方程。在统计框架中线性化后者会进一步复杂化噪声模型。我们通过误差边界方法避免了这一点,该方法在单调变换下保持简单的误差结构。

英文摘要

Errors in the data and the forward operator of an inverse problem can be handily modelled using partial order in Banach lattices. We present some existing results of the theory of regularisation in this novel framework, where errors are represented as bounds by means of the appropriate partial order. We apply the theory to Diffusion Tensor Imaging, where correct noise modelling is challenging: it involves the Rician distribution and the nonlinear Stejskal-Tanner equation. Linearisation of the latter in the statistical framework would complicate the noise model even further. We avoid this using the error bounds approach, which preserves simple error structure under monotone transformations.

1608.02702 2026-06-04 cs.CV cs.NA math.NA 版本更新

Steerable Principal Components for Space-Frequency Localized Images

可旋转主成分用于空间-频率局部化图像

Boris Landa, Yoel Shkolnisky

发表机构 * Department of Applied Mathematics, School of Mathematical Sciences(应用数学系,数学科学学院)

AI总结 本文提出一种快速准确的方法,通过二维Prolate Spheroidal Wave Functions对图像进行展开,获取可旋转主成分,用于图像及其旋转的最优扩展。

详情
AI中文摘要

本文描述了一种快速且准确的方法,用于从大量图像数据集中获得可旋转主成分,假设图像在空间和频率上具有良好的局部化特性。所获得的可旋转主成分用于图像数据集及其旋转的最优扩展。该方法首先使用一系列二维Prolate Spheroidal Wave Functions对图像进行展开,其中展开系数通过特殊设计的数值积分方案进行评估。然后,利用这些展开系数构建一个旋转不变的协方差矩阵,其具有块对角结构,其块的特征分解提供了所需的可旋转主成分。所提出的方法被证明比现有方法更快,同时提供适当的误差界以保证其准确性。

英文摘要

This paper describes a fast and accurate method for obtaining steerable principal components from a large dataset of images, assuming the images are well localized in space and frequency. The obtained steerable principal components are optimal for expanding the images in the dataset and all of their rotations. The method relies upon first expanding the images using a series of two-dimensional Prolate Spheroidal Wave Functions (PSWFs), where the expansion coefficients are evaluated using a specially designed numerical integration scheme. Then, the expansion coefficients are used to construct a rotationally-invariant covariance matrix which admits a block-diagonal structure, and the eigen-decomposition of its blocks provides us with the desired steerable principal components. The proposed method is shown to be faster then existing methods, while providing appropriate error bounds which guarantee its accuracy.

1807.11534 2026-06-04 math.NA cs.CV cs.NA 版本更新

A Restricted-Domain Dual Formulation for Two-Phase Image Segmentation

两相图像分割的受限域双变量公式

Jack Spencer

发表机构 * Department of Mathematics, University of Liverpool, UK(利物浦大学数学系)

AI总结 本文探讨了两相图像分割中数据拟合的性质,提出在受限域内求解双变量公式以提升计算效率,并通过实验验证了该方法的有效性。

详情
Journal ref
Irish Machine Vision and Image Processing Conference Proceedings, pp. 139-146, 2017
AI中文摘要

在两相图像分割中,凸松弛方法允许计算各种数据拟合项的全局最小解。许多高效方法可以快速求解。然而,我们考虑该公式中数据拟合的本质是否允许对解做出合理假设,以进一步提高计算性能。特别是,我们采用了一个广为人知的双变量公式,并在受限域内求解相应的方程。我们展示了实验结果,探讨了该限制对解的影响,并量化了计算性能的改进。这种方法可以简单地扩展到类似方法,并可能为此类问题提供一种高效替代方案。

英文摘要

In two-phase image segmentation, convex relaxation has allowed global minimisers to be computed for a variety of data fitting terms. Many efficient approaches exist to compute a solution quickly. However, we consider whether the nature of the data fitting in this formulation allows for reasonable assumptions to be made about the solution that can improve the computational performance further. In particular, we employ a well known dual formulation of this problem and solve the corresponding equations in a restricted domain. We present experimental results that explore the dependence of the solution on this restriction and quantify imrovements in the computational performance. This approach can be extended to analogous methods simply and could provide an efficient alternative for problems of this type.

1807.10757 2026-06-04 cs.CV cs.NA math.NA 版本更新

A multi-contrast MRI approach to thalamus segmentation

一种多对比MRI方法用于丘脑分割

Veronica Corona, Jan Lellmann, Peter Nestor, Carola-Bibiane Schoenlieb, Julio Acosta-Cabronero

发表机构 * Department of Applied Mathematics and Theoretical Physics, University of Cambridge(应用数学与理论物理系,剑桥大学) Queensland Brain Institute, University of Queensland(昆士兰脑研究所,昆士兰大学) Mater Hospital, South Brisbane, Queensland, Australia(马特医院,南布里斯班,昆士兰,澳大利亚) Wellcome Centre for Human Neuroimaging, UCL Institute of Neurology, University College London, United Kingdom(wellcome人类神经影像中心,伦敦大学学院神经学研究所,伦敦大学学院,英国) German Center for Neurodegenerative Diseases (DZNE), Magdeburg, Germany(德国神经退行性疾病研究中心(DZNE),马格德堡,德国)

AI总结 本文提出一种多模态MRI分割方法,通过多对比数据提高丘脑子核分割精度,结合迭代配准、手动分割模板、监督学习和凸优化,提升分割性能与鲁棒性。

详情
AI中文摘要

丘脑变化与许多神经疾病相关,包括阿尔茨海默病、帕金森病和多发性硬化症。常规干预常包括手术或深部脑刺激,因此准确分割灰质丘脑子区域具有临床重要性。MRI适用于结构分割,因其能提供单次扫描的不同解剖视图。尽管有多种对比度可用,开发能处理多谱的图像分割技术变得越来越重要。本文提出了一种新的多模态数据分割方法,用于自动分割主要丘脑子核组,使用T1-、T2*-加权和定量susceptibility mapping (QSM)信息。该方法包括四个步骤:高度迭代的图像配准、在平均训练数据模板上的手动分割、监督学习用于模式识别,以及最终的凸优化步骤,通过进一步的空间约束来优化解决方案。这导致了与手动分割更一致的解决方案,优于标准Morel图谱方法。此外,我们展示了多对比方法提升了分割性能。然后我们研究了是否能利用训练模板轮廓的先验知识进一步提高凸分割的精度和鲁棒性,从而在单个受试者中获得高度精确的多对比分割。该方法可扩展到大多数3D成像数据类型和任何在单次扫描或多受试者模板中可辨识的感兴趣区域。

英文摘要

Thalamic alterations are relevant to many neurological disorders including Alzheimer's disease, Parkinson's disease and multiple sclerosis. Routine interventions to improve symptom severity in movement disorders, for example, often consist of surgery or deep brain stimulation to diencephalic nuclei. Therefore, accurate delineation of grey matter thalamic subregions is of the upmost clinical importance. MRI is highly appropriate for structural segmentation as it provides different views of the anatomy from a single scanning session. Though with several contrasts potentially available, it is also of increasing importance to develop new image segmentation techniques that can operate multi-spectrally. We hereby propose a new segmentation method for use with multi-modality data, which we evaluated for automated segmentation of major thalamic subnuclear groups using T1-, T2*-weighted and quantitative susceptibility mapping (QSM) information. The proposed method consists of four steps: highly iterative image co-registration, manual segmentation on the average training-data template, supervised learning for pattern recognition, and a final convex optimisation step imposing further spatial constraints to refine the solution. This led to solutions in greater agreement with manual segmentation than the standard Morel atlas based approach. Furthermore, we show that the multi-contrast approach boosts segmentation performances. We then investigated whether prior knowledge using the training-template contours could further improve convex segmentation accuracy and robustness, which led to highly precise multi-contrast segmentations in single subjects. This approach can be extended to most 3D imaging data types and any region of interest discernible in single scans or multi-subject templates.

1806.10472 2026-06-04 cs.CV cs.NA math.NA 版本更新

Homogeneity of a region in the logarithmic image processing framework: application to region growing algorithms

对数图像处理框架中区域的同质性:应用于区域生长算法

Michel Jourlin, Guillaume Noyel

发表机构 * Lab. H. Curien, UMR CNRS 5516(H. Curien实验室,CNRS 5516研究单位) University of Strathclyde Institute of Global Public Health(斯特拉斯堡大学全球公共卫生研究所) International Prevention Research Institute, iPRI(国际预防研究研究所)

AI总结 本文探讨了对数图像处理(LIP)算子在评估区域同质性中的作用,提出两种新的异质性标准,改进了Revol技术以增强对比度变化的鲁棒性,减少区域生长过程中的链式效应。

详情
Journal ref
International Workshop on the Physics and Mechanics of Random Structures: from Morphology to Material Properties, Jun 2018, Island of Ol{é}ron, France
AI中文摘要

本文探讨了对数图像处理(LIP)算子在评估区域同质性中的作用,提出两种新的异质性标准,一种基于LIP加法,另一种基于LIP标量乘法。这些工具能够管理区域生长算法,采用Revol技术:从初始种子开始,通过应用特定的膨胀操作来扩展生长区域,直到其异质性水平不超过一定值。我们引入的新方法显著改进了Revol现有的技术,使其对图像的对比度变化具有鲁棒性。这种性质强烈减少了区域生长过程中出现的链式效应。

英文摘要

The current paper deals with the role played by Logarithmic Image Processing (LIP) operators for evaluating the homogeneity of a region. Two new criteria of heterogeneity are introduced, one based on the LIP addition and the other based on the LIP scalar multiplication. Such tools are able to manage Region Growing algorithms following the Revol's technique: starting from an initial seed, they consist of applying specific dilations to the growing region while its inhomogeneity level does not exceed a certain level. The new approaches we introduce are significantly improving Revol's existing technique by making it robust to contrast variations in images. Such a property strongly reduces the chaining effect arising in region growing processes.

1804.03415 2026-06-04 math.NA cs.CV cs.NA 版本更新

A Fast Hierarchically Preconditioned Eigensolver Based On Multiresolution Matrix Decomposition

基于多分辨率矩阵分解的快速分层预处理本征求解器

Thomas Y. Hou, De Huang, Ka Chun Lam, Ziyun Zhang

AI总结 本文提出一种基于多分辨率运算压缩框架的迭代方法,用于高效计算稀疏对称正矩阵的大量左本征对,通过整合隐式重启Lanczos方法与多分辨率框架,提出扩展-细化迭代方案以提高计算效率。

Comments 46 pages, 11 figures, 10 tables

详情
AI中文摘要

本文提出了一种新的迭代方法,用于在多分辨率运算压缩框架下分层计算稀疏对称正矩阵的相对大量左本征对。我们通过将多分辨率框架整合到隐式重启Lanczos方法中,利用每个分解组件的良好条件性。我们通过提出扩展-细化迭代方案实现这种结合,其内在思想是将目标谱分解成若干段,使得每段对应的本征问题良好条件。理论分析和数值示例也展示了该算法的效率和有效性。

英文摘要

In this paper we propose a new iterative method to hierarchically compute a relatively large number of leftmost eigenpairs of a sparse symmetric positive matrix under the multiresolution operator compression framework. We exploit the well-conditioned property of every decomposition components by integrating the multiresolution framework into the Implicitly restarted Lanczos method. We achieve this combination by proposing an extension-refinement iterative scheme, in which the intrinsic idea is to decompose the target spectrum into several segments such that the corresponding eigenproblem in each segment is well-conditioned. Theoretical analysis and numerical illustration are also reported to illustrate the efficiency and effectiveness of this algorithm.

1710.04265 2026-06-04 math.NA cs.CV cs.NA 版本更新

Solutions of Quadratic First-Order ODEs applied to Computer Vision Problems

二次一阶微分方程的解及其在计算机视觉问题中的应用

David Casillas-Perez, Daniel Pizarro, Manuel Mazo, Adrien Bartoli

发表机构 * Department of Electronic, University of Alcalá(阿尔卡拉大学电子系) ISIT - CNRS/Université d’Auvergne(奥弗涅大学ISIT-CNRS)

AI总结 本文研究了特定二次一阶微分方程的存在性和唯一性,探讨了其在平面-透视曲线重建中的应用,并提出了最大深度函数和最大深度解问题。

Comments The version 2: New change of variable. Maximal Curve Maximal Solution Convergence Cones The version 3: modifies the author's list and the abstract in metadata

详情
AI中文摘要

本文研究了特定二次一阶微分方程的存在性和唯一性,探讨了其在平面-透视曲线重建中的应用,并提出了最大深度函数和最大深度解问题。

英文摘要

This article is a study about the existence and the uniqueness of solutions of a specific quadratic first-order ODE that frequently appears in multiple reconstruction problems. It is called the \emph{planar-perspective equation} due to the duality with the geometric problem of reconstruction of planar-perspective curves from their modulus. Solutions of the \emph{planar-perspective equation} are related with planar curves parametrized with perspective parametrization due to this geometric interpretation. The article proves the existence of only two local solutions to the \emph{initial value problem} with \emph{regular initial conditions} and a maximum of two analytic solutions with \emph{critical initial conditions}. The article also gives theorems to extend the local definition domain where the existence of both solutions are guaranteed. It introduces the \emph{maximal depth function} as a function that upper-bound all possible solutions of the \emph{planar-perspective equation} and contains all its possible \emph{critical points}. Finally, the article describes the \emph{maximal-depth solution problem} that consists of finding the solution of the referred equation that has maximum the depth and proves its uniqueness. It is an important problem as it does not need initial conditions to obtain the unique solution and its the frequent solution that practical algorithms of the state-of-the-art give.

1709.05746 2026-06-04 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

Adversarial Discriminative Sim-to-real Transfer of Visuo-motor Policies

对抗性判别仿真到现实的视觉-运动策略转移

Fangyi Zhang, Jürgen Leitner, Zongyuan Ge, Michael Milford, Peter Corke

发表机构 * Australian Centre for Robotic Vision (ACRV)(澳大利亚机器人视觉中心) Queensland University of Technology (QUT)(昆士兰技术大学) Monash University(墨尔本大学)

AI总结 本文提出对抗性判别仿真到现实转移方法,减少现实数据标注成本,在桌面上物体抓取任务中,通过视觉观测控制7自由度机械臂在障碍物中抓取蓝色立方体,仅需93个标注和186个未标注图像即可实现97.8%的成功率和1.8厘米的控制精度。

Comments Under review for the International Journal of Robotics Research

详情
AI中文摘要

各种方法已被提出以学习用于现实世界机器人应用的视觉-运动策略。一种解决方案是首先在仿真中学习然后转移到现实世界。在转移过程中,大多数现有方法需要带有标签的真实图像。然而,在许多机器人应用中,标注过程往往昂贵甚至不实际。在本文中,我们提出了一种对抗性判别仿真到现实转移方法,以减少标注真实数据的成本。通过模块化网络在桌面物体抓取任务中验证了该方法的有效性,其中7自由度的机械臂以速度模式控制在障碍物中抓取蓝色立方体。对抗性转移方法将标注真实数据的需求减少了50%。策略可以仅使用93个标注和186个未标注的真实图像转移到现实环境。转移的视觉-运动策略对训练中未见过的物体和移动目标具有鲁棒性,实现了97.8%的成功率和1.8厘米的控制精度。

英文摘要

Various approaches have been proposed to learn visuo-motor policies for real-world robotic applications. One solution is first learning in simulation then transferring to the real world. In the transfer, most existing approaches need real-world images with labels. However, the labelling process is often expensive or even impractical in many robotic applications. In this paper, we propose an adversarial discriminative sim-to-real transfer approach to reduce the cost of labelling real data. The effectiveness of the approach is demonstrated with modular networks in a table-top object reaching task where a 7 DoF arm is controlled in velocity mode to reach a blue cuboid in clutter through visual observations. The adversarial transfer approach reduced the labelled real data requirement by 50%. Policies can be transferred to real environments with only 93 labelled and 186 unlabelled real images. The transferred visuo-motor policies are robust to novel (not seen in training) objects in clutter and even a moving target, achieving a 97.8% success rate and 1.8 cm control accuracy.

1710.01493 2026-06-04 cs.LG cs.CV cs.NA math.NA math.OC 版本更新

Image Labeling Based on Graphical Models Using Wasserstein Messages and Geometric Assignment

基于图形模型的图像标注:利用Wasserstein消息与几何分配

Ruben Hühnerbein, Fabrizio Savarino, Freddie Åström, Christoph Schnörr

发表机构 * Image and Pattern Analysis Group, Heidelberg University, Germany(海德堡大学图像与模式分析组) Heidelberg Collaboratory for Image Processing, Heidelberg University, Germany(海德堡图像处理协同实验室)

AI总结 本文提出基于离散图模型的最大后验推断新方法,利用局部Wasserstein距离近似目标函数并实现并行收敛。

详情
AI中文摘要

我们介绍了一种基于离散图模型的最大后验推断新方法。通过利用局部Wasserstein距离来耦合图底层边的分配措施,给定的离散目标函数被平滑近似并限制在分配流形上。相应的乘法更新方案结合了两个过程:(i)所得到的黎曼梯度流的几何积分,以及(ii)将解四舍五入为有效的标签。在整个过程中,已知的LP松弛方法中的局部边缘约束得以满足,而平滑的几何设置导致快速收敛的迭代,可以并行执行每条边。

英文摘要

We introduce a novel approach to Maximum A Posteriori inference based on discrete graphical models. By utilizing local Wasserstein distances for coupling assignment measures across edges of the underlying graph, a given discrete objective function is smoothly approximated and restricted to the assignment manifold. A corresponding multiplicative update scheme combines in a single process (i) geometric integration of the resulting Riemannian gradient flow and (ii) rounding to integral solutions that represent valid labelings. Throughout this process, local marginalization constraints known from the established LP relaxation are satisfied, whereas the smooth geometric setting results in rapidly converging iterations that can be carried out in parallel for every edge.

1804.02307 2026-06-04 math.OC cs.CV cs.NA math.NA 版本更新

Accelerated Optimization in the PDE Framework: Formulations for the Manifold of Diffeomorphisms

在PDE框架中的加速优化:微分流形上的形式化方法

Ganesh Sundaramoorthi, Anthony Yezzi

发表机构 * KAUST (King Abdullah University of Science and Technology)(卡斯特大学(国王阿卜杜勒-阿齐兹大学)) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种适用于微分流形上优化问题的新方法,通过将Nesterov加速优化推广到无限维流形,推导出连续演化方程并将其与流体力学原理联系起来,同时与最优运输问题建立联系。

详情
AI中文摘要

我们考虑在无限维微分流形上优化成本泛函的问题。我们提出了一种新的优化方法,适用于任何在微分流形上设置的优化问题,通过将Nesterov加速优化推广到微分流形。虽然我们的框架适用于无限维流形,但我们特别处理微分流形的情况,受计算机视觉中光流问题的启发。这通过基于最近的变分方法来一般类加速优化方法的近期工作实现,该方法适用于有限维空间。我们将其推广到无限维流形。我们推导出令人惊讶的简单的连续演化方程,即偏微分方程,用于加速梯度下降,并将其与流体力学中的简单力学原理联系起来。我们的方法与最优运输问题有自然联系,因为可以将我们的方法视为无限数量粒子的演化,这些粒子具有质量(用质量密度表示),在能量景观中移动。质量随优化变量变化,并赋予粒子动力学。这与有限维情况不同,后者只有一粒粒子移动,因此动力学不依赖于质量。我们推导了理论,计算了加速优化的PDE,并展示了这些新加速优化方案的行为。

英文摘要

We consider the problem of optimization of cost functionals on the infinite-dimensional manifold of diffeomorphisms. We present a new class of optimization methods, valid for any optimization problem setup on the space of diffeomorphisms by generalizing Nesterov accelerated optimization to the manifold of diffeomorphisms. While our framework is general for infinite dimensional manifolds, we specifically treat the case of diffeomorphisms, motivated by optical flow problems in computer vision. This is accomplished by building on a recent variational approach to a general class of accelerated optimization methods by Wibisono, Wilson and Jordan, which applies in finite dimensions. We generalize that approach to infinite dimensional manifolds. We derive the surprisingly simple continuum evolution equations, which are partial differential equations, for accelerated gradient descent, and relate it to simple mechanical principles from fluid mechanics. Our approach has natural connections to the optimal mass transport problem. This is because one can think of our approach as an evolution of an infinite number of particles endowed with mass (represented with a mass density) that moves in an energy landscape. The mass evolves with the optimization variable, and endows the particles with dynamics. This is different than the finite dimensional case where only a single particle moves and hence the dynamics does not depend on the mass. We derive the theory, compute the PDEs for accelerated optimization, and illustrate the behavior of these new accelerated optimization schemes.

1805.09408 2026-06-04 cs.CV cs.NA math.NA 版本更新

Non-convex non-local flows for saliency detection

非凸非局部流用于显著性检测

Iván Ramírez, Gonzalo Galiano, Emanuele Schiavi

发表机构 * Dpt. of Mathematics, Universidad Rey Juan Carlos(数学系,雷乌恩卡洛斯大学) Dpt. of Mathematics, Universidad de Oviedo(数学系,奥维多大学)

AI总结 本文提出并求解了新的变分模型用于数字图像自动显著性检测,结合非局部框架和新的二次显著性检测项,用于胶质瘤在MRI-Flair图像中的分割。

详情
AI中文摘要

我们提出并数值求解了一个新的变分模型,用于数字图像的自动显著性检测。使用非局部框架,我们考虑了一组保持边缘的函数,结合一个新的二次显著性检测项。该术语定义了一个受p-拉普拉斯算子驱动的约束双侧障碍问题,包括所谓的超拉普拉斯情况(0 < p < 1)。然后考虑并应用了相关的非凸非局部反应流,用于MRI-Flair图像中的胶质瘤分割。通过快速卷积核基于的近似解进行计算。数值实验显示,与超拉普拉斯算子相关的非凸性在标准度量方面提供了单调改进的结果。

英文摘要

We propose and numerically solve a new variational model for automatic saliency detection in digital images. Using a non-local framework we consider a family of edge preserving functions combined with a new quadratic saliency detection term. Such term defines a constrained bilateral obstacle problem for image classification driven by p-Laplacian operators, including the so-called hyper-Laplacian case (0 < p < 1). The related non-convex non-local reactive flows are then considered and applied for glioblastoma segmentation in magnetic resonance fluid-attenuated inversion recovery (MRI-Flair) images. A fast convolutional kernel based approximated solution is computed. The numerical experiments show how the non-convexity related to the hyperLaplacian operators provides monotonically better results in terms of the standard metrics.

1805.08095 2026-06-04 cs.LG cs.CV cs.NA math.NA stat.ML 版本更新

Small steps and giant leaps: Minimal Newton solvers for Deep Learning

小步与巨跃:用于深度学习的最小牛顿求解器

João F. Henriques, Sebastien Ehrhardt, Samuel Albanie, Andrea Vedaldi

发表机构 * Visual Geometry Group, University of Oxford(视觉几何组,牛津大学)

AI总结 本文提出一种快速的二阶方法,可作为现有深度学习求解器的替代方案。该方法仅需每个迭代两次额外的前向模式自动微分操作,计算成本与两次标准前向传递相当,易于实现。方法解决了现有二阶求解器的长期问题,避免了计算Hessian矩阵的近似逆矩阵的高成本和噪声敏感性。

详情
AI中文摘要

我们提出了一种快速的二阶方法,可作为现有深度学习求解器的替代方案。与随机梯度下降(SGD)相比,该方法每个迭代仅需两次额外的前向模式自动微分操作,计算成本与两次标准前向传递相当,且易于实现。我们的方法解决了现有二阶求解器的长期问题,即每次迭代精确或通过共轭梯度法计算近似Hessian矩阵的逆矩阵,这一过程成本高且对噪声敏感。相反,我们提出保持一个梯度的估计值,该估计值通过逆Hessian矩阵投影得到,并在每次迭代中更新一次。该估计值的大小相同,类似于SGD中常用的动量变量。不维护Hessian的估计值。我们首先在具有已知闭式解的小问题上验证了我们的方法,称为CurveBall,包括噪声Rosenbrock函数和退化的两层线性网络,其中现有深度学习求解器似乎难以处理。然后我们在CIFAR和ImageNet上训练了多个大型模型,包括ResNet和VGG-f网络,展示了无需超参数调优的更快收敛速度。代码已提供。

英文摘要

We propose a fast second-order method that can be used as a drop-in replacement for current deep learning solvers. Compared to stochastic gradient descent (SGD), it only requires two additional forward-mode automatic differentiation operations per iteration, which has a computational cost comparable to two standard forward passes and is easy to implement. Our method addresses long-standing issues with current second-order solvers, which invert an approximate Hessian matrix every iteration exactly or by conjugate-gradient methods, a procedure that is both costly and sensitive to noise. Instead, we propose to keep a single estimate of the gradient projected by the inverse Hessian matrix, and update it once per iteration. This estimate has the same size and is similar to the momentum variable that is commonly used in SGD. No estimate of the Hessian is maintained. We first validate our method, called CurveBall, on small problems with known closed-form solutions (noisy Rosenbrock function and degenerate 2-layer linear networks), where current deep learning solvers seem to struggle. We then train several large models on CIFAR and ImageNet, including ResNet and VGG-f networks, where we demonstrate faster convergence with no hyperparameter tuning. Code is available.

1804.10432 2026-06-04 math.NA cs.CV cs.NA math.DG 版本更新

Variational Regularization of Inverse Problems for Manifold-Valued Data

变分正则化用于流形值数据的反问题

Martin Storath, Andreas Weinmann

发表机构 * Image Analysis and Learning Group, Interdisciplinary Center for Scientific Computing, Universität Heidelberg, Germany(图像分析与学习组,跨学科科学计算中心,海德堡大学,德国) Department of Mathematics and Natural Sciences, Hochschule Darmstadt, and Institute of Computational Biology, Helmholtz Zentrum München, Germany(数学与自然科学系,达姆施塔特应用科学大学,以及计算生物学研究所,海德堡研究中心,德国)

AI总结 本文研究流形值数据的变分正则化反问题,提出TV和TGV正则化方法,并通过合成和真实数据验证其有效性。

详情
AI中文摘要

本文考虑了在反问题设置中流形值数据的变分正则化。特别是,我们考虑了带有间接测量算子的TV和TGV正则化。我们提供了关于正则化问题良定性的结果,并给出了在流形设置中实现这些模型的算法。此外,我们通过合成和真实数据的实验结果,展示了所提出方案的应用潜力。

英文摘要

In this paper, we consider the variational regularization of manifold-valued data in the inverse problems setting. In particular, we consider TV and TGV regularization for manifold-valued data with indirect measurement operators. We provide results on the well-posedness and present algorithms for a numerical realization of these models in the manifold setup. Further, we provide experimental results for synthetic and real data to show the potential of the proposed schemes for applications.

1801.02686 2026-06-04 cs.CV cs.SY eess.SY 版本更新

Towards Multi-Object Detection and Tracking in Urban Scenario under Uncertainties

面向城市场景下的多目标检测与跟踪在不确定性中的研究

Achim Kampker, Mohsen Sefati, Arya Abdul Rachman, Kai Kreisköther, Pascual Campoy

AI总结 本文提出一种实时框架,结合3D激光雷达的遮挡感知检测与高效启发式过滤,以应对城市环境中传感器限制和目标运动复杂性带来的不确定性,实现高效的多目标跟踪。

Comments Some significant editorial/editing issues are found upon review. Paper will undergo language re-proofing before resubmitted

详情
Journal ref
4th.VEHITS.Proc. 109 (2018) 156-167
AI中文摘要

面向自动驾驶车辆的城市应用需要可靠的感知技术来应对高不确定性。最近引入的紧凑型3D激光雷达传感器提供了周围空间信息,可用于增强车辆感知。我们提出了一种实时集成框架,用于使用3D激光雷达的多目标检测和跟踪,旨在城市使用。我们的方法结合了遮挡感知的检测方法,计算高效的启发式规则过滤和自适应概率跟踪,以处理3D激光雷达的传感限制和目标运动复杂性带来的不确定性。使用真实世界预录制的3D激光雷达数据进行评估并与最新作品进行比较的结果表明,我们的框架能够在城市环境中实现有希望的跟踪性能。

英文摘要

Urban-oriented autonomous vehicles require a reliable perception technology to tackle the high amount of uncertainties. The recently introduced compact 3D LIDAR sensor offers a surround spatial information that can be exploited to enhance the vehicle perception. We present a real-time integrated framework of multi-target object detection and tracking using 3D LIDAR geared toward urban use. Our approach combines sensor occlusion-aware detection method with computationally efficient heuristics rule-based filtering and adaptive probabilistic tracking to handle uncertainties arising from sensing limitation of 3D LIDAR and complexity of the target object movement. The evaluation results using real-world pre-recorded 3D LIDAR data and comparison with state-of-the-art works shows that our framework is capable of achieving promising tracking performance in the urban situation.

1804.06114 2026-06-04 cs.LG cs.CV cs.NA math.NA stat.ML 版本更新

A Support Tensor Train Machine

支持张量列车机

Cong Chen, Kim Batselier, Ching-Yun Ko, Ngai Wong

发表机构 * The Department of Electrical and Electronic Engineering, The University of Hong Kong(香港大学电子与电气工程系)

AI总结 本文提出支持张量列车机,通过将传统支持张量机中的秩一张量替换为张量列车,提升模型表达能力,实验验证其优于SVM和STM。

Comments 7 pages

详情
AI中文摘要

近年来,将传统向量机技术扩展到张量形式引起了广泛关注。例如,支持张量机(STM)利用秩一张量捕捉数据结构,从而缓解传统支持向量机(SVM)中的过拟合和维度灾难问题。然而,秩一张量的表达能力对于许多现实数据来说是有限的。为克服这一限制,我们引入支持张量列车机(STTM),通过将STM中的秩一张量替换为张量列车。实验验证并确认STTM优于SVM和STM。

英文摘要

There has been growing interest in extending traditional vector-based machine learning techniques to their tensor forms. An example is the support tensor machine (STM) that utilizes a rank-one tensor to capture the data structure, thereby alleviating the overfitting and curse of dimensionality problems in the conventional support vector machine (SVM). However, the expressive power of a rank-one tensor is restrictive for many real-world data. To overcome this limitation, we introduce a support tensor train machine (STTM) by replacing the rank-one tensor in an STM with a tensor train. Experiments validate and confirm the superiority of an STTM over the SVM and STM.

1612.00181 2026-06-04 cs.CV cs.NA math.NA 版本更新

Monge's Optimal Transport Distance for Image Classification

蒙特问题最优运输距离用于图像分类

Michael Snow, Jan Van lent

发表机构 * Department of Engineering Design and Mathematics, Centre for Machine Vision, University of the West of England(工程设计与数学系,机器视觉中心,西英格兰大学)

AI总结 本文提出利用Wasserstein距离进行图像比较,通过求解Monge问题的高效数值方法,并用1-NN算法展示其在图像分类中的优势。

Comments 15 pages, 14 figure

详情
AI中文摘要

本文聚焦于一种用于图像比较的相似性度量,即Wasserstein距离。Wasserstein距离源于Monge最优运输问题的偏微分方程(PDE) formulation。我们提出了一个高效的数值求解方法来解决Monge问题。为了展示该度量在图像比较中的判别能力,我们使用$1$-近邻($1$-NN)机器学习算法来展示该度量相对于其他更传统距离度量以及Tangent Space距离在MNIST数据集上的优势。到目前为止,Wasserstein度量的PDE formulation尚未用于处理图像比较,也尚未在$1$-nearest neighbour架构中使用Wasserstein距离。

英文摘要

This paper focuses on a similarity measure, known as the Wasserstein distance, with which to compare images. The Wasserstein distance results from a partial differential equation (PDE) formulation of Monge's optimal transport problem. We present an efficient numerical solution method for solving Monge's problem. To demonstrate the measure's discriminatory power when comparing images, we use a $1$-Nearest Neighbour ($1$-NN) machine learning algorithm to illustrate the measure's potential benefits over other more traditional distance metrics and also the Tangent Space distance, designed to perform excellently on the well-known MNIST dataset. To our knowledge, the PDE formulation of the Wasserstein metric has not been presented for dealing with image comparison, nor has the Wasserstein distance been used within the $1$-nearest neighbour architecture.

1712.08585 2026-06-04 math.OC cs.CV cs.NA math.NA 版本更新

Denoising of image gradients and total generalized variation denoising

图像梯度去噪与总泛化变分去噪

Birgit Komander, Dirk A. Lorenz, Lena Vestweber

发表机构 * Institute of Analysis and Algebra(分析与代数研究所) TU Braunschweig(布拉unsch维格技术大学)

AI总结 本文重新审视全变分去噪,提出增强模型假设已获得图像梯度估计,改进了图像重建质量,并推导出与总泛化变分去噪方法相似的模型,提出约束去噪模型和参数自由的变分去噪模型,采用Chambolle-Pock和Douglas-Rachford方法进行数值实验,验证了预处理对收敛速度的提升。

详情
AI中文摘要

我们重新审视全变分去噪,并研究一个增强模型,其中假设已获得图像梯度的估计。我们证明这会提高图像重建质量,并推导出所得到的模型类似于总泛化变分去噪方法,从而为该模型提供了新的动机。进一步,我们提出使用约束去噪模型,并开发一个基本无参数的变分去噪模型,即所有模型参数都直接从噪声图像中估计。此外,我们使用Chambolle-Pock的对偶方法以及Douglas-Rachford方法用于新模型。对于后者,必须解决大规模的偏微分方程离散化。我们提出以不精确的方式使用预条件共轭梯度法进行处理,并为此推导出预条件器。数值实验表明,所得到的方法具有良好的去噪性能,并且预处理显著提高了收敛速度。最后我们分析了不同TGV去噪问题形式的对偶间隙,并推导出一个简单的停止准则。

英文摘要

We revisit total variation denoising and study an augmented model where we assume that an estimate of the image gradient is available. We show that this increases the image reconstruction quality and derive that the resulting model resembles the total generalized variation denoising method, thus providing a new motivation for this model. Further, we propose to use a constraint denoising model and develop a variational denoising model that is basically parameter free, i.e. all model parameters are estimated directly from the noisy image. Moreover, we use Chambolle-Pock's primal dual method as well as the Douglas-Rachford method for the new models. For the latter one has to solve large discretizations of partial differential equations. We propose to do this in an inexact manner using the preconditioned conjugate gradients method and derive preconditioners for this. Numerical experiments show that the resulting method has good denoising properties and also that preconditioning does increase convergence speed significantly. Finally we analyze the duality gap of different formulations of the TGV denoising problem and derive a simple stopping criterion.

1710.10781 2026-06-04 math.NA cs.CV cs.LG cs.NA stat.ML 版本更新

Stochastic variance reduced multiplicative update for nonnegative matrix factorization

随机方差缩减乘法更新用于非负矩阵分解

Hiroyuki Kasai

发表机构 * Graduate School of Informatics and Engineering, The University of Electro-Communications(信息与工程研究生院,东京电波通信大学)

AI总结 本文提出一种随机方差缩减乘法更新算法,改进非负矩阵分解的收敛速度,通过数值实验验证其在不同数据集上的优越性。

Comments IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2018)

详情
AI中文摘要

非负矩阵分解(NMF)是一种降维和因子分析方法,其因子矩阵具有低秩非负约束。考虑到NMF中的随机学习,本文特别针对最流行的乘法更新(MU)规则,该规则收敛速度较慢。本文提出一种随机梯度的方差缩减技术,数值比较表明,所提出的算法在不同合成和实际数据集上均优于现有算法。

英文摘要

Nonnegative matrix factorization (NMF), a dimensionality reduction and factor analysis method, is a special case in which factor matrices have low-rank nonnegative constraints. Considering the stochastic learning in NMF, we specifically address the multiplicative update (MU) rule, which is the most popular, but which has slow convergence property. This present paper introduces on the stochastic MU rule a variance-reduced technique of stochastic gradient. Numerical comparisons suggest that our proposed algorithms robustly outperform state-of-the-art algorithms across different synthetic and real-world datasets.

1705.10887 2026-06-04 stat.ML cs.CV cs.LG cs.NA math.NA 版本更新

Efficient, sparse representation of manifold distance matrices for classical scaling

高效表示经典标度中的流形距离矩阵

Javier S. Turek, Alexander Huth

发表机构 * Intel Labs(英特尔实验室) The University of Texas at Austin(得克萨斯大学奥斯汀分校)

AI总结 本文提出一种基于双调和插值的稀疏方法,用于高效表示流形距离矩阵,相比现有方法速度快2倍,内存占用低20倍,能处理大规模点集。

Comments Conference CVPR 2018

详情
AI中文摘要

Geodesic距离矩阵可以揭示对非刚性变形不敏感的形状特性,因此常用于分析和表示3-D形状。然而,这些矩阵随点数的平方增长,因此对于大规模点集常用低秩近似来存储和分析。本文提出了一种新颖的稀疏方法,利用双调和插值高效表示流形距离矩阵。该方法利用数据流形的知识,学习一个稀疏插值算子,通过部分点近似距离。我们证明,与现有方法相比,该方法在处理大规模点集的MDS问题时速度快2倍,内存占用低20倍,质量相似。这使得分析之前不可行的大规模点集成为可能。

英文摘要

Geodesic distance matrices can reveal shape properties that are largely invariant to non-rigid deformations, and thus are often used to analyze and represent 3-D shapes. However, these matrices grow quadratically with the number of points. Thus for large point sets it is common to use a low-rank approximation to the distance matrix, which fits in memory and can be efficiently analyzed using methods such as multidimensional scaling (MDS). In this paper we present a novel sparse method for efficiently representing geodesic distance matrices using biharmonic interpolation. This method exploits knowledge of the data manifold to learn a sparse interpolation operator that approximates distances using a subset of points. We show that our method is 2x faster and uses 20x less memory than current leading methods for solving MDS on large point sets, with similar quality. This enables analyses of large point sets that were previously infeasible.

1803.08137 2026-06-04 cs.CV cs.AI cs.NA math.NA stat.ML 版本更新

Robust Blind Deconvolution via Mirror Descent

通过镜像下降实现鲁棒盲去卷积

Sathya N. Ravi, Ronak Mehta, Vikas Singh

AI总结 本文研究盲去卷积的鲁棒性和收敛性,提出一种具有理论保证的算法,在实践中表现优异。

详情
AI中文摘要

我们重新审视盲去卷积问题,重点在于理解其鲁棒性和收敛性属性。可证明的鲁棒性对噪声和其他扰动的容忍能力最近在视觉领域受到关注,从获得对抗攻击的免疫性到评估和描述关键任务应用中算法的失败模式。此外,许多基于深度架构的盲去卷积方法内部使用或优化基本公式,因此更清楚地理解该子模块的行为、何时可以求解以及它可以容忍多少噪声注入是首要要求。我们推导了盲去卷积理论基础的新见解。出现的算法具有良好的收敛保证,并在我们论文中正式定义的意义上被证明是鲁棒的。有趣的是,这些技术结果在实践中表现非常出色,其中在标准数据集上,我们的算法结果与或优于现有最先进方法。关键词:盲去卷积,鲁棒连续优化

英文摘要

We revisit the Blind Deconvolution problem with a focus on understanding its robustness and convergence properties. Provable robustness to noise and other perturbations is receiving recent interest in vision, from obtaining immunity to adversarial attacks to assessing and describing failure modes of algorithms in mission critical applications. Further, many blind deconvolution methods based on deep architectures internally make use of or optimize the basic formulation, so a clearer understanding of how this sub-module behaves, when it can be solved, and what noise injection it can tolerate is a first order requirement. We derive new insights into the theoretical underpinnings of blind deconvolution. The algorithm that emerges has nice convergence guarantees and is provably robust in a sense we formalize in the paper. Interestingly, these technical results play out very well in practice, where on standard datasets our algorithm yields results competitive with or superior to the state of the art. Keywords: blind deconvolution, robust continuous optimization

1803.07226 2026-06-04 cs.CV cs.DS cs.NA math.NA 版本更新

Learning the Hierarchical Parts of Objects by Deep Non-Smooth Nonnegative Matrix Factorization

通过深度非光滑非负矩阵分解学习物体的层次部分

Jinshi Yu, Guoxu Zhou, Andrzej Cichocki, Shengli Xie

发表机构 * RIKEN(日本理化学研究所) SKOLTECH(莫斯科SKOLTECH)

AI总结 本文提出深度非光滑非负矩阵分解方法,通过更深层架构学习复杂数据的层次特征,结合非负约束生成部分特征并提取更高层次抽象特征,实验表明其在聚类分析中表现优异。

详情
AI中文摘要

非光滑非负矩阵分解(nsNMF)能够产生更局部化、重叠更少的特征表示,同时保持对数据的良好拟合。然而,nsNMF及其他现有NMF方法由于其浅层结构无法学习复杂数据的层次特征。为填补这一空白,本文提出了一种深度nsNMF方法,其架构比标准nsNMF更深入。深度nsNMF不仅由于非负约束生成部分特征,还通过组合低层特征生成更高层次的抽象特征。深入描述了深度架构如何帮助在dnsNMF中高效发现抽象特征。此外,本文还表明深度nsNMF与深度自编码器有密切关系,表明所提模型继承了深度学习和NMF的主要优势。大量实验表明,所提方法在聚类分析中表现出色。

英文摘要

Nonsmooth Nonnegative Matrix Factorization (nsNMF) is capable of producing more localized, less overlapped feature representations than other variants of NMF while keeping satisfactory fit to data. However, nsNMF as well as other existing NMF methods is incompetent to learn hierarchical features of complex data due to its shallow structure. To fill this gap, we propose a deep nsNMF method coined by the fact that it possesses a deeper architecture compared with standard nsNMF. The deep nsNMF not only gives parts-based features due to the nonnegativity constraints, but also creates higher-level, more abstract features by combing lower-level ones. The in-depth description of how deep architecture can help to efficiently discover abstract features in dnsNMF is presented. And we also show that the deep nsNMF has close relationship with the deep autoencoder, suggesting that the proposed model inherits the major advantages from both deep learning and NMF. Extensive experiments demonstrate the standout performance of the proposed method in clustering analysis.

1803.05026 2026-06-04 cs.LG cs.CV cs.IT cs.NA math.IT math.NA 版本更新

Principal Component Analysis with Tensor Train Subspace

张量列车子空间下的主成分分析

Wenqi Wang, Vaneet Aggarwal, Shuchin Aeron

发表机构 * Purdue University(普渡大学)

AI总结 本文提出TT-PCA算法,通过保持低秩张量结构来估计结构化的张量列车子空间,相比PCA和Tucker-PCA更具鲁棒性,实验验证其有效性。

详情
AI中文摘要

张量列车是一种分层张量网络结构,通过参数化大规模多维数据集来缓解维度灾难。本文提出TT-PCA算法,用于从给定数据中估计这种结构化的张量列车子空间。通过保持低秩张量结构,TT-PCA比PCA或Tucker-PCA更具鲁棒性,这在测试扩展YaleFace数据集B时得到了数值验证。

英文摘要

Tensor train is a hierarchical tensor network structure that helps alleviate the curse of dimensionality by parameterizing large-scale multidimensional data via a set of network of low-rank tensors. Associated with such a construction is a notion of Tensor Train subspace and in this paper we propose a TT-PCA algorithm for estimating this structured subspace from the given data. By maintaining low rank tensor structure, TT-PCA is more robust to noise comparing with PCA or Tucker-PCA. This is borne out numerically by testing the proposed approach on the Extended YaleFace Dataset B.

1701.01945 2026-06-04 math.OC cs.CV cs.NA math.NA 版本更新

A Framework for Wasserstein-1-Type Metrics

Wasserstein-1型度量的框架

Bernhard Schmitzer, Benedikt Wirth

AI总结 本文提出一个统一框架,将Wasserstein-1度量推广为不同质量非负测度之间的差异度量,继承其凸性与计算效率,并通过数值实验验证其应用价值。

Comments to appear in Journal of Convex Analysis

详情
AI中文摘要

我们提出一个统一框架,用于将Wasserstein-1度量推广为不同质量非负测度之间的差异度量。该推广继承了Wasserstein-1度量的凸性和计算效率,并包含文献中几种先前方法作为特殊情况。通过各种具体实例的数值实验,我们进一步展示了其在应用中的有用性。

英文摘要

We propose a unifying framework for generalising the Wasserstein-1 metric to a discrepancy measure between nonnegative measures of different mass. This generalization inherits the convexity and computational efficiency from the Wasserstein-1 metric, and it includes several previous approaches from the literature as special cases. For various specific instances of the generalized Wasserstein-1 metric we furthermore demonstrate their usefulness in applications by numerical experiments.

1803.03104 2026-06-04 eess.SY cs.CV cs.SY math.DS stat.ML 版本更新

Applicability and interpretation of the deterministic weighted cepstral distance

确定性加权谱距的应用与解释

Oliver Lauwers, Bart De Moor

发表机构 * KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing(库勒文大学电子工程系(ESAT)、动态系统信号处理与数据分析中心)

AI总结 本文结合系统理论和机器学习,研究了加权谱距在确定性线性时不变单输入单输出模型中的应用,提出了一种基于输入输出信号信息评估系统稳定性和相位类型的纯数据驱动方法。

Comments 18 pages, 5 figures, submitted for review to Automatica

详情
AI中文摘要

量化数据对象之间的相似性是现代数据科学的重要部分。决定使用哪种相似性度量非常依赖于具体应用。本文结合系统理论和机器学习的见解,研究了之前为ARMA模型信号定义的加权谱距。我们将其扩展到可逆的确定性线性时不变单输入单输出模型,并评估其适用性。我们证明了该距离总能以底层模型的极点和零点进行解释,并在稳定、最小相位或不稳定、最大相位模型的情况下,可以以子空间角度进行几何解释。然后,我们提出了一种仅使用输入/输出信号信息的方法来评估生成模型的稳定性和相位类型。通过这种方式,我们证明了扩展的加权谱距与加权谱模型范数之间的联系。通过这种方式,我们提供了一种纯数据驱动的方法来评估输入/输出信号对的不同底层动态,而无需任何系统识别步骤。这在时间序列聚类等机器学习任务中可能很有用。本文还发布了一个iPython教程,包含各种方法和算法的实现,以及一些证明等价性的数值示例。

英文摘要

Quantifying similarity between data objects is an important part of modern data science. Deciding what similarity measure to use is very application dependent. In this paper, we combine insights from systems theory and machine learning, and investigate the weighted cepstral distance, which was previously defined for signals coming from ARMA models. We provide an extension of this distance to invertible deterministic linear time invariant single input single output models, and assess its applicability. We show that it can always be interpreted in terms of the poles and zeros of the underlying model, and that, in the case of stable, minimum-phase, or unstable, maximum-phase models, a geometrical interpretation in terms of subspace angles can be given. We then devise a method to assess stability and phase-type of the generating models, using only input/output signal information. In this way, we prove a connection between the extended weighted cepstral distance and a weighted cepstral model norm. In this way, we provide a purely data-driven way to assess different underlying dynamics of input/output signal pairs, without the need for any system identification step. This can be useful in machine learning tasks such as time series clustering. An iPython tutorial is published complementary to this paper, containing implementations of the various methods and algorithms presented here, as well as some numerical illustrations of the equivalences proven here.

1605.00031 2026-06-04 cs.LG cs.CV cs.NA math.NA stat.ML 版本更新

Deep Convolutional Neural Networks on Cartoon Functions

深度卷积神经网络在卡通函数上的应用

Philipp Grohs, Thomas Wiatowski, Helmut Bölcskei

发表机构 * 1 Dept. Math., ETH Zurich, Switzerland

AI总结 本文研究深度卷积神经网络在卡通函数上的变形稳定性,提出考虑结构特性的新结果,适用于具有尖锐和弯曲不连续性的信号。

Comments This is a slightly updated version of the paper published in the ISIT proceedings. Specifically, we corrected errors in the arguments on the volume of tubes. Note that this correction does not affect the main statements of the paper

详情
Journal ref
Proc. of IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, pp. 1163-1167, July 2016
AI中文摘要

Wiatowski和Bölcskei, 2015证明了深度卷积神经网络基于的特征提取器的变形稳定性和垂直平移不变性由网络结构本身保证,而非特定卷积核和非线性。虽然平移不变性结果适用于平方可积函数,变形稳定性界仅适用于带限函数。然而,许多实际相关信号(如自然图像)表现出尖锐和弯曲的不连续性,因此不是带限的。本文的主要贡献是针对Donoho, 2001引入的卡通函数类建立变形稳定性界。

英文摘要

Wiatowski and Bölcskei, 2015, proved that deformation stability and vertical translation invariance of deep convolutional neural network-based feature extractors are guaranteed by the network structure per se rather than the specific convolution kernels and non-linearities. While the translation invariance result applies to square-integrable functions, the deformation stability bound holds for band-limited functions only. Many signals of practical relevance (such as natural images) exhibit, however, sharp and curved discontinuities and are, hence, not band-limited. The main contribution of this paper is a deformation stability result that takes these structural properties into account. Specifically, we establish deformation stability bounds for the class of cartoon functions introduced by Donoho, 2001.

1705.07364 2026-06-04 cs.LG cs.CV cs.NA math.NA 版本更新

Stabilizing Adversarial Nets With Prediction Methods

用预测方法稳定对抗网络

Abhay Yadav, Sohil Shah, Zheng Xu, David Jacobs, Tom Goldstein

发表机构 * University of Maryland, College Park(马里兰大学 College Park 分校)

AI总结 本文提出一种改进的随机梯度下降方法,通过稳定对抗网络的训练过程,使其更可靠地收敛到鞍点,提高训练稳定性与效率。

Comments Accepted at ICLR 2018

详情
AI中文摘要

对抗神经网络在数据科学中解决了很多重要问题,但训练却极具挑战性。这些困难源于对抗网络的最优权重对应于损失函数的鞍点而非极小值。通常用于此类问题的交替随机梯度方法难以可靠收敛到鞍点,且当收敛时对学习率极为敏感。本文提出一种简单的随机梯度下降修改方法,以稳定对抗网络。理论和实践中均表明,所提方法可靠收敛到鞍点,并在更宽的训练参数范围内保持稳定。这使对抗网络更少出现'崩溃'现象,并允许使用更大的学习率进行更快的训练。

英文摘要

Adversarial neural networks solve many important problems in data science, but are notoriously difficult to train. These difficulties come from the fact that optimal weights for adversarial nets correspond to saddle points, and not minimizers, of the loss function. The alternating stochastic gradient methods typically used for such problems do not reliably converge to saddle points, and when convergence does happen it is often highly sensitive to learning rates. We propose a simple modification of stochastic gradient descent that stabilizes adversarial networks. We show, both in theory and practice, that the proposed method reliably converges to saddle points, and is stable with a wider range of training parameters than a non-prediction method. This makes adversarial networks less likely to "collapse," and enables faster training with larger learning rates.

1801.09238 2026-06-04 eess.SY cs.CV cs.SY stat.AP 版本更新

Performance Analysis of Robust Stable PID Controllers Using Dominant Pole Placement for SOPTD Process Models

基于主导极点放置的鲁棒稳定PID控制器性能分析用于SOPTD过程模型

Saptarshi Das, Kaushik Halder, Amitava Gupta

发表机构 * Department of Mathematics, College of Engineering, Mathematics and Physical Sciences, University of Exeter(数学系,工程、数学与物理科学学院,埃克塞特大学) Department of Power Engineering, Jadavpur University(动力工程系,贾瓦德普大学)

AI总结 本文提出新的主导极点放置PID控制器设计方法,用于处理具有时间延迟的二阶过程。通过三阶Pade近似约束闭环主导和非主导极点位置,分析不同非主导极点类型对稳定性区域的影响。

Comments 50 pages, 42 figures, Knowledge-Based Systems, 2018

详情
AI中文摘要

本文推导了新的基于主导极点放置的PID控制器设计公式,用于处理具有时间延迟的二阶过程(SOPTD)。之前已尝试在无延迟系统中进行极点放置。时间延迟项在Pade近似中表现为具有可变数量交错极点和零点的高阶系统,这使得精确极点放置控制变得困难。本文报告了使用三阶Pade近似来约束闭环主导和非主导极点在复数s平面上的解析表达式。然而,通过增加Pade阶数验证了不同时间延迟近似对闭环性能的不变性,代表了更接近现实的高阶延迟动态。非主导极点的性质(如全部为复数、实数或组合)会影响特征方程并影响可实现的稳定性区域。不同类型的非主导极点及其对应的稳定性区域对九个测试台过程的影响被获得,这些过程表现出不同的开环阻尼比和滞后到延迟比。接下来,通过蒙特卡洛模拟研究不同表达式在设计参数空间中产生更宽稳定性区域的效果。随后,通过成千上万的蒙特卡洛模拟研究了各种时域和频域控制性能参数及其在不确定过程参数下的偏差,围绕每个测试台过程的鲁棒稳定解。

英文摘要

This paper derives new formulations for designing dominant pole placement based proportional-integral-derivative (PID) controllers to handle second order processes with time delays (SOPTD). Previously, similar attempts have been made for pole placement in delay-free systems. The presence of the time delay term manifests itself as a higher order system with variable number of interlaced poles and zeros upon Pade approximation, which makes it difficult to achieve precise pole placement control. We here report the analytical expressions to constrain the closed loop dominant and non-dominant poles at the desired locations in the complex s-plane, using a third order Pade approximation for the delay term. However, invariance of the closed loop performance with different time delay approximation has also been verified using increasing order of Pade, representing a closed to reality higher order delay dynamics. The choice of the nature of non-dominant poles e.g. all being complex, real or a combination of them modifies the characteristic equation and influences the achievable stability regions. The effect of different types of non-dominant poles and the corresponding stability regions are obtained for nine test-bench processes indicating different levels of open-loop damping and lag to delay ratio. Next, we investigate which expression yields a wider stability region in the design parameter space by using Monte Carlo simulations while uniformly sampling a chosen design parameter space. Various time and frequency domain control performance parameters are investigated next, as well as their deviations with uncertain process parameters, using thousands of Monte Carlo simulations, around the robust stable solution for each of the nine test-bench processes.

1701.08092 2026-06-04 cs.CV cs.NA math.NA 版本更新

Double-sided probing by map of Asplund's distances using Logarithmic Image Processing in the framework of Mathematical Morphology

通过使用对数图像处理的Asplund距离映射实现双面探测

Guillaume Noyel, Michel Jourlin

发表机构 * International Prevention Research Institute(国际预防研究所) Lab. H. Curien(H. Curien实验室) UMR CNRS 5516(CNRS 5516联合研究单位)

AI总结 本文在数学形态学框架下,利用对数图像处理的标量乘法建立数学形态学与探针与灰度函数之间Asplund距离映射的联系,并通过实例展示该方法在模式匹配中的应用。

Comments The final publication is available at link.springer.com

详情
Journal ref
13th International Symposium on Mathematical Morphology, ISMM 2017, May 2017, Fontainebleau, France. Springer International Publishing, pp.408-420, 2017, Mathematical Morphology and Its Applications to Signal and Image Processing: 13th International Symposium, ISMM 2017, Fontainebleau, France, May 15--17, 2017, Proceedings. http://cmm.ensmp.fr/ismm2017/
AI中文摘要

我们通过使用对数图像处理的标量乘法,建立了数学形态学与探针和灰度函数之间Asplund距离映射之间的联系。我们证明该映射是函数通过结构函数(即探针)进行膨胀和腐蚀的比值的对数。膨胀和腐蚀是将图像的格映射到正函数的格中的映射。使用平坦的结构元素,可以通过图像的膨胀和腐蚀来简化Asplund距离映射的表达,这些映射仍保留在图像的格中。我们通过一个使用非平坦结构函数的模式匹配示例来展示我们的方法。

英文摘要

We establish the link between Mathematical Morphology and the map of Asplund's distances between a probe and a grey scale function, using the Logarithmic Image Processing scalar multiplication. We demonstrate that the map is the logarithm of the ratio between a dilation and an erosion of the function by a structuring function: the probe. The dilations and erosions are mappings from the lattice of the images into the lattice of the positive functions. Using a flat structuring element, the expression of the map of Asplund's distances can be simplified with a dilation and an erosion of the image; these mappings stays in the lattice of the images. We illustrate our approach by an example of pattern matching with a non-flat structuring function.

1610.06781 2026-06-04 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

Modular Deep Q Networks for Sim-to-real Transfer of Visuo-motor Policies

模块化深度Q网络用于视觉-运动策略的仿真到现实迁移

Fangyi Zhang, Jürgen Leitner, Michael Milford, Peter Corke

发表机构 * Australian Centre for Robotic Vision (ACRV)(澳大利亚机器人视觉中心) Queensland University of Technology (QUT)(昆士兰理工大学)

AI总结 本文提出模块化深度强化学习方法,通过在感知与控制之间引入瓶颈,实现仿真到现实的迁移,提升机器人视觉-运动协调能力。

Comments Australasian Conference on Robotics and Automation (ACRA) 2017, Student Paper Award Finalist

详情
Journal ref
The proceedings of the Australasian Conference on Robotics and Automation (ACRA) 2017
AI中文摘要

尽管深度学习在计算机视觉中因大量视觉数据而取得显著成功,但为机器人学习收集足够大的现实世界数据集成本较高。为提高这些技术在真实机器人上的实用性,我们提出了一种模块化深度强化学习方法,能够将仿真训练的模型迁移到现实世界机器人任务中。我们引入了感知与控制之间的瓶颈,使网络能够独立训练,然后在端到端方式下合并和微调,以进一步提高视觉-运动协调性。在经典的平面视觉引导机器人抓取任务中,微调后的准确度达到1.6像素,显著优于直接迁移(17.5像素),显示出在更复杂和广泛的应用中的潜力。我们的方法提供了一种更高效学习和迁移视觉-运动策略的技术,无需完全依赖大规模现实世界机器人数据集。

英文摘要

While deep learning has had significant successes in computer vision thanks to the abundance of visual data, collecting sufficiently large real-world datasets for robot learning can be costly. To increase the practicality of these techniques on real robots, we propose a modular deep reinforcement learning method capable of transferring models trained in simulation to a real-world robotic task. We introduce a bottleneck between perception and control, enabling the networks to be trained independently, but then merged and fine-tuned in an end-to-end manner to further improve hand-eye coordination. On a canonical, planar visually-guided robot reaching task a fine-tuned accuracy of 1.6 pixels is achieved, a significant improvement over naive transfer (17.5 pixels), showing the potential for more complicated and broader applications. Our method provides a technique for more efficient learning and transfer of visuo-motor policies for real robotic systems without relying entirely on large real-world robot datasets.

1503.05528 2026-06-04 cs.CV cs.MM cs.NA eess.IV math.NA 版本更新

Video Inpainting of Complex Scenes

复杂场景的视频修复

Alasdair Newson, Andrés Almansa, Matthieu Fradet, Yann Gousseau, Patrick Pérez

发表机构 * Technicolor

AI总结 本文提出一种自动视频修复算法,通过优化全局基于补丁的功能实现复杂场景修复,提升修复效率与质量,无需手动输入,适用于高清晰度视频。

详情
Journal ref
SIAM Journal on Imaging Sciences, Society for Industrial and Applied Mathematics, 2014, 7 (4), pp.1993-2019
AI中文摘要

我们提出了一种自动视频修复算法,该算法依赖于对全局基于补丁的功能进行优化。我们的算法能够处理视频修复中出现的各种挑战性情况,如动态纹理的正确重建、多个移动物体和移动背景。此外,我们实现了比现有最佳方法快一个数量级的执行时间。我们还能在高清晰度视频上获得良好的修复质量。最后,我们提供了具体的算法细节,使实现我们的算法尽可能简单。所得到的算法不需要分割或手动输入,除了定义修复掩码外,能够处理比以前工作更广泛的场景。

英文摘要

We propose an automatic video inpainting algorithm which relies on the optimisation of a global, patch-based functional. Our algorithm is able to deal with a variety of challenging situations which naturally arise in video inpainting, such as the correct reconstruction of dynamic textures, multiple moving objects and moving background. Furthermore, we achieve this in an order of magnitude less execution time with respect to the state-of-the-art. We are also able to achieve good quality results on high definition videos. Finally, we provide specific algorithmic details to make implementation of our algorithm as easy as possible. The resulting algorithm requires no segmentation or manual input other than the definition of the inpainting mask, and can deal with a wider variety of situations than is handled by previous work. 1. Introduction. Advanced image and video editing techniques are increasingly common in the image processing and computer vision world, and are also starting to be used in media entertainment. One common and difficult task closely linked to the world of video editing is image and video " inpainting ". Generally speaking, this is the task of replacing the content of an image or video with some other content which is visually pleasing. This subject has been extensively studied in the case of images, to such an extent that commercial image inpainting products destined for the general public are available, such as Photoshop's " Content Aware fill " [1]. However, while some impressive results have been obtained in the case of videos, the subject has been studied far less extensively than image inpainting. This relative lack of research can largely be attributed to high time complexity due to the added temporal dimension. Indeed, it has only very recently become possible to produce good quality inpainting results on high definition videos, and this only in a semi-automatic manner. Nevertheless, high-quality video inpainting has many important and useful applications such as film restoration, professional post-production in cinema and video editing for personal use. For this reason, we believe that an automatic, generic video inpainting algorithm would be extremely useful for both academic and professional communities.

1711.11075 2026-06-04 math.NA cs.CV cs.NA 版本更新

A fast nonconvex Compressed Sensing algorithm for highly low-sampled MR images reconstruction

一种快速非凸压缩感知算法用于高采样率MRI图像重建

Damiana Lazzaro, Elena Loli Piccolomini, Fabiana Zama

发表机构 * Department of Mathematics, University of Bologna(博洛尼亚大学数学系)

AI总结 本文提出一种快速高效的MRI图像重建算法,通过非凸正则化目标函数和最小二乘数据拟合约束,解决严重欠采样数据的重建问题,证明了算法的收敛性。

详情
AI中文摘要

本文提出了一种快速且高效的MRI图像重建方法,将压缩感知理论建模为具有非凸正则化目标函数的约束最小化问题。我们提出了一种名为快速非凸重加权(FNCR)的算法,基于迭代方案,通过凸线性化近似非凸问题,并自动更新惩罚参数。凸问题通过前向-后向过程求解,其中后向步骤通过分裂Bregman策略实现。此外,我们提出了一种新的高效迭代求解器用于出现的线性系统。我们证明了所提出的FNCR方法的收敛性。在合成假人和真实图像上的结果表明,该算法表现优异且计算高效,即使与文献中表现最佳的方法相比也是如此。

英文摘要

In this paper we present a fast and efficient method for the reconstruction of Magnetic Resonance Images (MRI) from severely under-sampled data. From the Compressed Sensing theory we have mathematically modeled the problem as a constrained minimization problem with a family of non-convex regularizing objective functions depending on a parameter and a least squares data fit constraint. We propose a fast and efficient algorithm, named Fast NonConvex Reweighting (FNCR) algorithm, based on an iterative scheme where the non-convex problem is approximated by its convex linearization and the penalization parameter is automatically updated. The convex problem is solved by a Forward-Backward procedure, where the Backward step is performed by a Split Bregman strategy. Moreover, we propose a new efficient iterative solver for the arising linear systems. We prove the convergence of the proposed FNCR method. The results on synthetic phantoms and real images show that the algorithm is very well performing and computationally efficient, even when compared to the best performing methods proposed in the literature.

1711.09867 2026-06-04 math.NA cs.CV cs.NA 版本更新

Accelerated Optimization in the PDE Framework: Formulations for the Active Contour Case

在PDE框架中实现加速优化:活动轮廓情况的公式化

Anthony Yezzi, Ganesh Sundaramoorthi

发表机构 * School of Electrical and Computer Engineering, Georgia Institute of Technology(电子工程学院,佐治亚理工学院) Electrical Engineering, King Abdullah University of Science and Technology(电气工程,国王阿卜杜勒-阿齐兹大学)

AI总结 本文探讨了在PDE框架中利用加速优化方法提升参数估计性能,通过变分框架和Bregman散度推导连续极限ODE,并扩展至无限维流形,引入共进化质量模型连接最优质量传输的流体力学公式化。

详情
AI中文摘要

在Nesterov开创性工作的基础上,加速优化方法已被用于显著提升一阶梯度基参数估计在第二阶优化策略不可用或不实际的场景中的性能。不仅加速梯度下降比传统梯度下降收敛更快,而且通过初始超调和随后振荡回荡,更稳健地搜索参数空间,从而选择仅局部最小值,其吸引基足够大以包含初始超调。这种行为使加速和随机梯度搜索方法在机器学习社区中特别受欢迎。在最近的PNAS 2016论文中,Wibisono、Wilson和Jordan展示了如何将广泛的一类加速方案用变分框架形式化,围绕Bregman散度,从而得到连续极限ODE。我们展示了其公式如何进一步扩展到无限维流形(从几何空间曲线和曲面开始)通过将Bregman散度替换为切空间上的内积,并显式引入一个与目标对象同时演变的分布式质量模型。这种共进化质量模型,仅为使优化具备有益的动力学而引入,也将由此得到的一类基于PDE的加速优化方案与最优质量传输的流体力学公式化联系起来。

英文摘要

Following the seminal work of Nesterov, accelerated optimization methods have been used to powerfully boost the performance of first-order, gradient-based parameter estimation in scenarios where second-order optimization strategies are either inapplicable or impractical. Not only does accelerated gradient descent converge considerably faster than traditional gradient descent, but it also performs a more robust local search of the parameter space by initially overshooting and then oscillating back as it settles into a final configuration, thereby selecting only local minimizers with a basis of attraction large enough to contain the initial overshoot. This behavior has made accelerated and stochastic gradient search methods particularly popular within the machine learning community. In their recent PNAS 2016 paper, Wibisono, Wilson, and Jordan demonstrate how a broad class of accelerated schemes can be cast in a variational framework formulated around the Bregman divergence, leading to continuum limit ODE's. We show how their formulation may be further extended to infinite dimension manifolds (starting here with the geometric space of curves and surfaces) by substituting the Bregman divergence with inner products on the tangent space and explicitly introducing a distributed mass model which evolves in conjunction with the object of interest during the optimization process. The co-evolving mass model, which is introduced purely for the sake of endowing the optimization with helpful dynamics, also links the resulting class of accelerated PDE based optimization schemes to fluid dynamical formulations of optimal mass transport.

1310.7443 2026-06-04 cs.CV cs.NA math.NA 版本更新

On Convergent Finite Difference Schemes for Variational - PDE Based Image Processing

关于变分-PDE基图像处理的收敛有限差分方案

V. B. S. Prasath, Juan C. Moreno

AI总结 本文提出一种自适应各向异性Huber函数图像修复方案,结合L2-L1正则化函数,通过Split Bregman方法实现图像去噪与边缘保持,实验表明该算法具有最佳收敛性。

Comments 23 pages, 12 figures, 2 tables

详情
Journal ref
Computational and Applied Mathematics, 2017
AI中文摘要

我们研究了一种基于自适应各向异性Huber函数的图像修复方案。通过结合L2-L1正则化函数,自适应Huber函数基于的能量最小化模型在噪声数字图像中提供去噪和边缘保持。我们研究了一种基于连续分段线性函数的收敛有限差分方案,并使用变量分割方案,即Split Bregman,以获得离散最小化器。给出的实验结果包括图像去噪,并与加性操作分割、双固定点和投影梯度方案的比较表明,我们的算法获得了最佳的收敛速率。

英文摘要

We study an adaptive anisotropic Huber functional based image restoration scheme. By using a combination of L2-L1 regularization functions, an adaptive Huber functional based energy minimization model provides denoising with edge preservation in noisy digital images. We study a convergent finite difference scheme based on continuous piecewise linear functions and use a variable splitting scheme, namely the Split Bregman, to obtain the discrete minimizer. Experimental results are given in image denoising and comparison with additive operator splitting, dual fixed point, and projected gradient schemes illustrate that the best convergence rates are obtained for our algorithm.

1703.09499 2026-06-04 cs.CV cs.NA math.NA 版本更新

Locality preserving projection on SPD matrix Lie group: algorithm and analysis

局部保持投影在SPD矩阵李群上的应用:算法与分析

Yangyang Li, Ruqian Lu

AI总结 本文提出在SPD矩阵李群上进行降维的算法,通过局部保持投影思想构建Laplacian矩阵,有效处理高维SPD矩阵,提升人脸识别和动作识别性能。

Comments 15 pages, 3 tables

详情
AI中文摘要

对用于图像识别的对称正定(SPD)矩阵作为特征描述符通常是高维的。传统流形学习仅适用于降维高维向量数据。对于高维SPD矩阵,直接使用流形学习算法降维矩阵数据是不可能的。SPD矩阵必须首先转换为长向量,然后降维此向量。然而,这种方法破坏了SPD矩阵空间的空间结构。为克服这一限制,我们提出了一种新的在SPD矩阵空间上的降维算法,将高维SPD矩阵转换为低维SPD矩阵。我们的工作基于所有相同大小的SPD矩阵集具有李群结构的事实,并旨在将流形学习转换到SPD矩阵李群。我们使用局部保持投影(LPP)算法的基本思想,构建对应的Laplacian矩阵在SPD矩阵李群上。因此,我们称我们的方法为Lie-LPP以强调其李群特性。我们展示了详细的算法分析,并通过实验表明Lie-LPP在人类动作识别和人类面孔识别上实现了有效的结果。

英文摘要

Symmetric positive definite (SPD) matrices used as feature descriptors in image recognition are usually high dimensional. Traditional manifold learning is only applicable for reducing the dimension of high-dimensional vector-form data. For high-dimensional SPD matrices, directly using manifold learning algorithms to reduce the dimension of matrix-form data is impossible. The SPD matrix must first be transformed into a long vector, and then the dimension of this vector must be reduced. However, this approach breaks the spatial structure of the SPD matrix space. To overcome this limitation, we propose a new dimension reduction algorithm on SPD matrix space to transform high-dimensional SPD matrices into low-dimensional SPD matrices. Our work is based on the fact that the set of all SPD matrices with the same size has a Lie group structure, and we aim to transform the manifold learning to the SPD matrix Lie group. We use the basic idea of the manifold learning algorithm called locality preserving projection (LPP) to construct the corresponding Laplacian matrix on the SPD matrix Lie group. Thus, we call our approach Lie-LPP to emphasize its Lie group character. We present a detailed algorithm analysis and show through experiments that Lie-LPP achieves effective results on human action recognition and human face recognition.

1510.02975 2026-06-04 math.OC cs.CV cs.DC cs.NA cs.SY eess.SY math.NA 版本更新

Optimal Piecewise Linear Function Approximation for GPU-based Applications

基于GPU的应用最优分段线性函数逼近

Daniel Berjón, Guillermo Gallego, Carlos Cuevas, Francisco Morán, Narciso García

AI总结 本文提出一种高效方法,通过最优设计的分段线性近似,提高复杂连续函数的实时计算效率,尤其在GPU上表现优异。

Comments 12 pages, 12 figures, post-print, IEEE Transactions on Cybernetics, Oct. 2015

详情
Journal ref
IEEE Transactions on Cybernetics, vol. 46, no. 11, pp. 2584-2595, Nov. 2016
AI中文摘要

近年来,许多计算机视觉和人机交互应用需要评估复杂的连续数学函数作为关键步骤。然而,严格评估此类函数通常计算成本高,无法满足实时应用需求。为此,函数常被近似为更简单的分段多项式表示。本文提出一种新的高效技术,通过近优化设计的两种分段线性近似,在大量评估子区间预算下评估复杂连续函数。我们开发了详尽的误差分析,提供渐近紧界,准确量化两种表示的近似性能。该方法改进了之前的误差估计,允许用户在近似误差和评估子区间数量之间进行权衡。为保证实时运行,该方法适用于但不仅限于现代图形处理单元(GPU)的高效实现,其中通过利用其纹理单元中的固定函数插值程序,优于以往的替代方法。本文提出的方法适用于任何需要评估连续函数的应用,我们详细测试了其质量和效率在多个函数上的表现,特别是高斯函数,因其在许多计算机视觉和控制领域中被广泛使用且计算成本高。

英文摘要

Many computer vision and human-computer interaction applications developed in recent years need evaluating complex and continuous mathematical functions as an essential step toward proper operation. However, rigorous evaluation of this kind of functions often implies a very high computational cost, unacceptable in real-time applications. To alleviate this problem, functions are commonly approximated by simpler piecewise-polynomial representations. Following this idea, we propose a novel, efficient, and practical technique to evaluate complex and continuous functions using a nearly optimal design of two types of piecewise linear approximations in the case of a large budget of evaluation subintervals. To this end, we develop a thorough error analysis that yields asymptotically tight bounds to accurately quantify the approximation performance of both representations. It provides an improvement upon previous error estimates and allows the user to control the trade-off between the approximation error and the number of evaluation subintervals. To guarantee real-time operation, the method is suitable for, but not limited to, an efficient implementation in modern Graphics Processing Units (GPUs), where it outperforms previous alternative approaches by exploiting the fixed-function interpolation routines present in their texture units. The proposed technique is a perfect match for any application requiring the evaluation of continuous functions, we have measured in detail its quality and efficiency on several functions, and, in particular, the Gaussian function because it is extensively used in many areas of computer vision and cybernetics, and it is expensive to evaluate.

1711.02857 2026-06-04 cs.LG cs.AI cs.CV cs.NA math.NA stat.ML 版本更新

Learning Sparse Visual Representations with Leaky Capped Norm Regularizers

通过泄漏受限范数正则化器学习稀疏视觉表示

Jianqiao Wangni, Dahua Lin

AI总结 本文提出泄漏受限范数正则化器,用于学习过完备视觉表示,证明了其在3D形状恢复中的收敛性,优于ℓ1和非凸正则化方法。

详情
AI中文摘要

诱导稀疏性的正则化是学习过完备视觉表示的重要组成部分。尽管ℓ1正则化广受欢迎,本文研究了非凸正则化在该问题中的应用。我们的贡献包括三个部分:首先,我们提出了泄漏受限范数正则化器(LCNR),允许模型权重低于一定阈值的部分被更强地正则化,从而实现强稀疏性,仅引入可控的估计偏差。我们提出了一种主要化-最小化算法来优化联合目标函数。其次,我们的研究显示,在单目3D形状恢复和神经网络中,LCNR优于ℓ1和其他非凸正则化方法,实现了最先进的性能和更快的收敛速度。第三,我们证明了在3D恢复问题上的理论全局收敛速度。到目前为止,这是首次对3D恢复问题的收敛性分析。

英文摘要

Sparsity inducing regularization is an important part for learning over-complete visual representations. Despite the popularity of $\ell_1$ regularization, in this paper, we investigate the usage of non-convex regularizations in this problem. Our contribution consists of three parts. First, we propose the leaky capped norm regularization (LCNR), which allows model weights below a certain threshold to be regularized more strongly as opposed to those above, therefore imposes strong sparsity and only introduces controllable estimation bias. We propose a majorization-minimization algorithm to optimize the joint objective function. Second, our study over monocular 3D shape recovery and neural networks with LCNR outperforms $\ell_1$ and other non-convex regularizations, achieving state-of-the-art performance and faster convergence. Third, we prove a theoretical global convergence speed on the 3D recovery problem. To the best of our knowledge, this is the first convergence analysis of the 3D recovery problem.

1710.00620 2026-06-04 cs.CV cs.NA math.NA 版本更新

Out-of-focus Blur: Image De-blurring

失焦模糊:图像去模糊

Yuzhen Lu

AI总结 本文研究通过模拟研究解决因失焦模糊导致的图像去模糊问题,采用正则化方法和共轭梯度法提升去模糊效果,提出最优参数选择策略。

Comments 11 pages

详情
AI中文摘要

图像去模糊在许多真实场景或物体成像中至关重要。本项目通过模拟研究,针对由失焦模糊扭曲的图像进行去模糊处理。首先探索伪逆滤波器,但因噪声放大而失败。随后采用Tikhonov正则化方法,相比伪逆滤波器有显著改进。在Tikhonov正则化中,正则化参数的选择对获得高质量图像至关重要,正则化解具有半收敛性质。当使用预设的不一致原理确定最优值时,相对恢复误差为8.49%。此外,采用共轭梯度法进行图像去模糊,计算速度快且结果更优,相对恢复误差为8.22%。迭代次数在CG中充当正则化参数,迭代解也具有半收敛性质。

英文摘要

Image de-blurring is important in many cases of imaging a real scene or object by a camera. This project focuses on de-blurring an image distorted by an out-of-focus blur through a simulation study. A pseudo-inverse filter is first explored but it fails because of severe noise amplification. Then Tikhonov regularization methods are employed, which produce greatly improved results compared to the pseudo-inverse filter. In Tikhonov regularization, the choice of the regularization parameter plays a critical rule in obtaining a high-quality image, and the regularized solutions possess a semi-convergence property. The best result, with the relative restoration error of 8.49%, is achieved when the prescribed discrepancy principle is used to decide an optimal value. Furthermore, an iterative method, Conjugated Gradient, is employed for image de-blurring, which is fast in computation and leads to an even better result with the relative restoration error of 8.22%. The number of iteration in CG acts as a regularization parameter, and the iterates have a semi-convergence property as well.

1710.06232 2026-06-04 cs.CV cs.SY eess.SY 版本更新

Analysis of feature detector and descriptor combinations with a localization experiment for various performance metrics

基于多种性能指标的特征检测器与描述符组合分析:定位实验

Ertugrul Bayraktar, Pinar Boyraz

AI总结 本文通过移动机器人室内实验,比较不同特征检测器与描述符组合在图像匹配中的性能,分析不同组合在精度、时间、角度差等五项指标下的表现。

Comments 11 pages, 3 figures, 1 table

详情
Journal ref
Turkish Journal of Electrical Engineering & Computer Sciences, (2017) 25: 2444 - 2454
AI中文摘要

本研究旨在提供特征检测器/描述符方法的详细性能比较,特别是当其各种组合用于图像匹配时的表现。通过移动机器人在室内环境中的定位实验作为案例研究,使用3090张查询图像和127张数据集图像。研究包括五种特征检测器方法(FAST、ORB、SURF、SIFT、BRISK)和五种特征描述符方法(BRIEF、BRISK、SIFT、SURF、ORB)。这些方法在23种不同组合中使用,通过本研究定义的性能标准获得有意义且一致的比较结果。所有方法作为独立的特征检测器或描述符分别使用。性能分析展示了各种检测器和描述符组合的判别能力。分析使用五个参数:(i)准确性,(ii)时间,(iii)关键点之间的角度差,(iv)正确匹配的数量,(v)正确匹配关键点之间的距离。在60°范围内,覆盖系统五个旋转姿态点,FAST-SURF组合具有最低的距离和角度差值以及最高的匹配关键点数量。SIFT-SURF是准确度最高的组合,正确分类率为98.41%。最快的算法是ORB-BRIEF,匹配560张在运动中捕获的图像和127张数据集图像的总运行时间为21,303.30秒。

英文摘要

The purpose of this study is to provide a detailed performance comparison of feature detector/descriptor methods, particularly when their various combinations are used for image-matching. The localization experiments of a mobile robot in an indoor environment are presented as a case study. In these experiments, 3090 query images and 127 dataset images were used. This study includes five methods for feature detectors (features from accelerated segment test (FAST), oriented FAST and rotated binary robust independent elementary features (BRIEF) (ORB), speeded-up robust features (SURF), scale invariant feature transform (SIFT), and binary robust invariant scalable keypoints (BRISK)) and five other methods for feature descriptors (BRIEF, BRISK, SIFT, SURF, and ORB). These methods were used in 23 different combinations and it was possible to obtain meaningful and consistent comparison results using the performance criteria defined in this study. All of these methods were used independently and separately from each other as either feature detector or descriptor. The performance analysis shows the discriminative power of various combinations of detector and descriptor methods. The analysis is completed using five parameters: (i) accuracy, (ii) time, (iii) angle difference between keypoints, (iv) number of correct matches, and (v) distance between correctly matched keypoints. In a range of 60°, covering five rotational pose points for our system, the FAST-SURF combination had the lowest distance and angle difference values and the highest number of matched keypoints. SIFT-SURF was the most accurate combination with a 98.41% correct classification rate. The fastest algorithm was ORB-BRIEF, with a total running time of 21,303.30 s to match 560 images captured during motion with 127 dataset images.

1608.01431 2026-06-04 cs.CV cs.NA math.NA 版本更新

An efficient iterative thresholding method for image segmentation

一种高效的迭代阈值方法用于图像分割

Dong Wang, Haohan Li, Xiaoyu Wei, Xiaoping Wang

AI总结 本文提出了一种高效的迭代阈值方法用于多相图像分割,通过非局部多相能量近似轮廓长度,实现最优复杂度O(N log N)的高效分割。

Comments 14 pages, 21 figures

详情
AI中文摘要

我们提出了一种高效的迭代阈值方法用于多相图像分割。该算法基于最小化分段常数Mumford-Shah功能,在其中轮廓长度(或周长)被近似为非局部多相能量。通过迭代方法求解最小化问题。每次迭代包括计算简单的卷积后跟随阈值步骤。该算法易于实现且具有最优复杂度O(N log N)每迭代。我们还展示了迭代算法具有总能量衰减性质。我们展示了一些数值结果以显示我们方法的效率。

英文摘要

We proposed an efficient iterative thresholding method for multi-phase image segmentation. The algorithm is based on minimizing piecewise constant Mumford-Shah functional in which the contour length (or perimeter) is approximated by a non-local multi-phase energy. The minimization problem is solved by an iterative method. Each iteration consists of computing simple convolutions followed by a thresholding step. The algorithm is easy to implement and has the optimal complexity $O(N \log N)$ per iteration. We also show that the iterative algorithm has the total energy decaying property. We present some numerical results to show the efficiency of our method.

1511.06631 2026-06-04 math.NA cs.CV cs.NA math.OC 版本更新

Multi-Contrast MRI Reconstruction with Structure-Guided Total Variation

多对比MRI重建与结构引导的总变分

Matthias J. Ehrhardt, Marta M. Betcke

AI总结 本文提出基于结构引导的总变分方法,用于多对比MRI重建,通过结合结构先验知识提升重建质量,在标准指标上优于传统总变分方法。

Comments 18 pages, 16 figures

详情
AI中文摘要

磁共振成像(MRI)是一种多功能成像技术,允许根据采集参数获得不同对比度。许多临床研究同时获取多种对比度(如T1和T2加权图像),这使整体扫描过程非常耗时。由于所有图像显示相同的解剖结构,可以通过考虑相似性来省略不必要的测量。本文讨论了两种总变分的修改版本,分别基于位置和方向,利用结构先验知识减少总变分在无结构知识时的退化情况。我们使用交替方向乘子法求解由此产生的凸最小化问题,将正向算子与先验分离。对于两种先验,对应的近端算子可作为快速梯度投影法在对偶问题上的扩展实现。我们在六个基于仿生和真实MRI图像的数据集上测试了这些先验。所有测试案例中,利用其他对比度的结构信息比单独使用总变分在标准指标如峰值信噪比和结构相似性指数上表现更好。此外,我们发现利用二维方向信息可生成具有清晰边缘的图像,优于仅使用边缘位置先验信息的重建结果。

英文摘要

Magnetic resonance imaging (MRI) is a versatile imaging technique that allows different contrasts depending on the acquisition parameters. Many clinical imaging studies acquire MRI data for more than one of these contrasts---such as for instance T1 and T2 weighted images---which makes the overall scanning procedure very time consuming. As all of these images show the same underlying anatomy one can try to omit unnecessary measurements by taking the similarity into account during reconstruction. We will discuss two modifications of total variation---based on i) location and ii) direction---that take structural a priori knowledge into account and reduce to total variation in the degenerate case when no structural knowledge is available. We solve the resulting convex minimization problem with the alternating direction method of multipliers that separates the forward operator from the prior. For both priors the corresponding proximal operator can be implemented as an extension of the fast gradient projection method on the dual problem for total variation. We tested the priors on six data sets that are based on phantoms and real MRI images. In all test cases exploiting the structural information from the other contrast yields better results than separate reconstruction with total variation in terms of standard metrics like peak signal-to-noise ratio and structural similarity index. Furthermore, we found that exploiting the two dimensional directional information results in images with well defined edges, superior to those reconstructed solely using a priori information about the edge location.

1710.00489 2026-06-04 cs.RO cs.AI cs.CV cs.NE cs.SY eess.SY 版本更新

SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Planning and Control

SE3-姿态网络:用于视觉-运动规划和控制的结构深度动力学模型

Arunkumar Byravan, Felix Leeb, Franziska Meier, Dieter Fox

AI总结 本文提出了一种基于结构深度动力学模型的深度视觉-运动控制方法,通过编码器-解码器结构学习低维姿态嵌入,实现场景分割和姿态预测,并在现实世界中实现了闭环控制。

Comments 8 pages, Initial submission to IEEE International Conference on Robotics and Automation (ICRA) 2018

详情
AI中文摘要

本文提出了一种基于结构深度动力学模型的深度视觉-运动控制方法。我们的深度动力学模型是一种SE3-Nets的变体,通过编码器-解码器结构学习低维姿态嵌入用于视觉-运动控制。与以往工作不同,我们的动力学模型是结构化的:给定一个输入场景,我们的网络明确学习分割显著部分并预测其姿态嵌入以及其运动作为姿态空间中的变化。我们通过一对相隔动作的点云训练我们的模型,并展示在仅提供帧间点对数据关联的监督下,我们的网络能够学习有意义的场景分割以及一致的姿态。我们进一步展示我们的模型可以直接在学习的低维姿态空间中用于闭环控制,其中动作通过最小化姿态空间中的误差使用基于梯度的方法计算,类似于传统模型驱动控制。我们展示了在模拟和现实世界中控制Baxter机器人从原始深度数据的结果,并与两种基线深度网络进行了比较。我们的方法在实时运行,实现了良好的场景动态预测,并在多个控制运行中优于基线方法。视频结果可在:https://rse-lab.cs.washington.edu/se3-structured-deep-ctrl/

英文摘要

In this work, we present an approach to deep visuomotor control using structured deep dynamics models. Our deep dynamics model, a variant of SE3-Nets, learns a low-dimensional pose embedding for visuomotor control via an encoder-decoder structure. Unlike prior work, our dynamics model is structured: given an input scene, our network explicitly learns to segment salient parts and predict their pose-embedding along with their motion modeled as a change in the pose space due to the applied actions. We train our model using a pair of point clouds separated by an action and show that given supervision only in the form of point-wise data associations between the frames our network is able to learn a meaningful segmentation of the scene along with consistent poses. We further show that our model can be used for closed-loop control directly in the learned low-dimensional pose space, where the actions are computed by minimizing error in the pose space using gradient-based methods, similar to traditional model-based control. We present results on controlling a Baxter robot from raw depth data in simulation and in the real world and compare against two baseline deep networks. Our method runs in real-time, achieves good prediction of scene dynamics and outperforms the baseline methods on multiple control runs. Video results can be found at: https://rse-lab.cs.washington.edu/se3-structured-deep-ctrl/

1709.02641 2026-06-04 math.NA cs.CV cs.NA 版本更新

Completion of High Order Tensor Data with Missing Entries via Tensor-train Decomposition

通过张量-列车分解完成高阶张量数据中的缺失条目

Longhao Yuan, Qibin Zhao, Jianting Cao

AI总结 本文提出TT-WOPT算法,利用张量-列车分解解决高阶张量数据缺失问题,实验表明在高缺失率下性能优于其他方法。

Comments 8 pages, ICONIP 2017

详情
AI中文摘要

在本文中,我们旨在解决高阶张量数据中缺失条目的完成问题。现有的张量分解和完成方法在张量阶数N>>3时受到维度诅咒的限制。为克服这一问题,我们提出了一种高效的算法称为TT-WOPT(张量-列车加权优化),用于寻找张量数据的潜在核心张量并恢复缺失条目。张量-列车分解,具有强大的表示能力和线性可扩展性,被应用于我们的算法中。在合成数据和自然图像完成的实验结果表明,我们的方法显著优于其他相关方法。特别是当数据缺失率非常高时,例如85%到99%,我们的算法比其他最先进的算法表现更好。

英文摘要

In this paper, we aim at the completion problem of high order tensor data with missing entries. The existing tensor factorization and completion methods suffer from the curse of dimensionality when the order of tensor N>>3. To overcome this problem, we propose an efficient algorithm called TT-WOPT (Tensor-train Weighted OPTimization) to find the latent core tensors of tensor data and recover the missing entries. Tensor-train decomposition, which has the powerful representation ability with linear scalability to tensor order, is employed in our algorithm. The experimental results on synthetic data and natural image completion demonstrate that our method significantly outperforms the other related methods. Especially when the missing rate of data is very high, e.g., 85% to 99%, our algorithm can achieve much better performance than other state-of-the-art algorithms.

1709.01237 2026-06-04 cs.CV cs.LG cs.NA math.NA 版本更新

Newton-type Methods for Inference in Higher-Order Markov Random Fields

牛顿型方法在高阶马尔可夫随机场推断中的应用

Hariprasad Kannan, Nikos Komodakis, Nikos Paragios

AI总结 本文研究了在高阶马尔可夫随机场推断中使用牛顿型方法求解拉格朗日对偶问题的益处,提出了一种收敛性可证且高效的框架,包含Hessian矩阵构建的计算复杂度与精度的平衡策略、阻尼策略、截断策略与通用预条件器的结合,以及稀疏团势能的高效求和-乘积计算。

Comments 10 pages, 3 figures, 3 tables, CVPR 2017

详情
Journal ref
Poster at IEEE International Conference on Computer Vision and Pattern Recognition 2017
AI中文摘要

线性规划松弛是离散马尔可夫随机场MAP推断中的核心方法。正确求解拉格朗日对偶问题的能力是此类方法的关键组成部分。本文研究了使用牛顿型方法求解平滑版本问题的拉格朗日对偶问题的益处。我们探讨了其在实现更优收敛行为和更好地处理公式中的病态性质方面的能力,与一阶方法相比。我们证明了确实可以高效地应用信任区域牛顿方法,以解决广泛MAP推断问题。本文提出了一种可证收敛且高效的框架,包括(i)在Hessian矩阵构建方面计算复杂度和精度之间的良好平衡,(ii)一种有助于高效优化的阻尼策略,(iii)一种与通用共轭梯度预条件器结合的截断策略,(iv)稀疏团势能的高效求和-乘积计算。高阶马尔可夫随机场的结果展示了这种方法的潜力。

英文摘要

Linear programming relaxations are central to {\sc map} inference in discrete Markov Random Fields. The ability to properly solve the Lagrangian dual is a critical component of such methods. In this paper, we study the benefit of using Newton-type methods to solve the Lagrangian dual of a smooth version of the problem. We investigate their ability to achieve superior convergence behavior and to better handle the ill-conditioned nature of the formulation, as compared to first order methods. We show that it is indeed possible to efficiently apply a trust region Newton method for a broad range of {\sc map} inference problems. In this paper we propose a provably convergent and efficient framework that includes (i) excellent compromise between computational complexity and precision concerning the Hessian matrix construction, (ii) a damping strategy that aids efficient optimization, (iii) a truncation strategy coupled with a generic pre-conditioner for Conjugate Gradients, (iv) efficient sum-product computation for sparse clique potentials. Results for higher-order Markov Random Fields demonstrate the potential of this approach.

1611.02862 2026-06-04 cs.CV cs.NA math.NA 版本更新

The Little Engine that Could: Regularization by Denoising (RED)

那辆小引擎也能:通过去噪进行正则化(RED)

Yaniv Romano, Michael Elad, Peyman Milanfar

AI总结 本文提出了一种更强大灵活的框架,通过去噪引擎定义逆问题的正则化,以提升图像去模糊和超分辨率的性能。

详情
AI中文摘要

图像去噪是图像处理中广泛研究的问题。确实,最近高级且高效的去噪算法的出现使一些人相信现有的方法在去噪性能上已接近极限。我们能否利用这一显著成就来处理图像处理中的其他任务?最近的工作对此问题给出了肯定的回答,形式为Plug-and-Play Prior($P^3$)方法,表明通过依次应用图像去噪步骤可以处理任何逆问题。这严重依赖于ADMM优化技术,以获得这种连续去噪解释。这是否是图像处理任务中唯一能利用图像去噪引擎的方法?在本文中,我们提供了一种更强大、更灵活的框架来实现相同的目标。与$P^3$方法不同,我们提出了正则化通过去噪(RED):利用去噪引擎定义逆问题的正则化。我们提出了一种显式的图像自适应拉普拉斯基正则化函数,使整体目标函数更清晰且更明确。通过完全灵活地选择迭代优化过程来最小化上述函数,RED能够结合任何图像去噪算法,非常有效地处理一般逆问题,并保证收敛到全局最优解。我们测试了这种方法,并在图像去模糊和超分辨率问题中展示了最先进的结果。

英文摘要

Removal of noise from an image is an extensively studied problem in image processing. Indeed, the recent advent of sophisticated and highly effective denoising algorithms lead some to believe that existing methods are touching the ceiling in terms of noise removal performance. Can we leverage this impressive achievement to treat other tasks in image processing? Recent work has answered this question positively, in the form of the Plug-and-Play Prior ($P^3$) method, showing that any inverse problem can be handled by sequentially applying image denoising steps. This relies heavily on the ADMM optimization technique in order to obtain this chained denoising interpretation. Is this the only way in which tasks in image processing can exploit the image denoising engine? In this paper we provide an alternative, more powerful and more flexible framework for achieving the same goal. As opposed to the $P^3$ method, we offer Regularization by Denoising (RED): using the denoising engine in defining the regularization of the inverse problem. We propose an explicit image-adaptive Laplacian-based regularization functional, making the overall objective functional clearer and better defined. With a complete flexibility to choose the iterative optimization procedure for minimizing the above functional, RED is capable of incorporating any image denoising algorithm, treat general inverse problems very effectively, and is guaranteed to converge to the globally optimal result. We test this approach and demonstrate state-of-the-art results in the image deblurring and super-resolution problems.

1708.07850 2026-06-04 cs.LG cs.CV cs.NA math.NA 版本更新

Structured Low-Rank Matrix Factorization: Global Optimality, Algorithms, and Applications

结构低秩矩阵分解:全局最优性、算法与应用

Benjamin D. Haeffele, Rene Vidal

AI总结 本文提出一种适用于大规模数据集的矩阵分解技术,通过特定正则化形式捕捉额外结构,证明在因子规模足够时局部极小值即为全局极小值,并展示在神经钙成像视频分割和高光谱压缩恢复中的优势。

详情
AI中文摘要

近年来,低秩矩阵分解问题凸形式在机器学习中受到广泛关注。然而,此类形式往往需要求解与数据矩阵同样大小的矩阵,难以应用于大规模数据集。此外,在许多应用中,数据可能表现出超越单纯低秩的结构,例如图像和视频呈现复杂的时空结构,而标准低秩方法大多忽略这些结构。本文研究了一种适用于大规模数据集的矩阵分解技术,通过特定形式的正则化捕捉额外结构,该正则化包括总变分和核范数等已知正则化器作为特例。尽管所得优化问题非凸,我们证明在因子规模足够时,若满足某些条件,则因子的任何局部极小值即为全局极小值。此外,本文还提供了几种实用算法来解决矩阵分解问题,并推导了近似解到全局最优解距离的界。神经钙成像视频分割和高光谱压缩恢复的示例展示了该方法在高维数据集中的优势。

英文摘要

Recently, convex formulations of low-rank matrix factorization problems have received considerable attention in machine learning. However, such formulations often require solving for a matrix of the size of the data matrix, making it challenging to apply them to large scale datasets. Moreover, in many applications the data can display structures beyond simply being low-rank, e.g., images and videos present complex spatio-temporal structures that are largely ignored by standard low-rank methods. In this paper we study a matrix factorization technique that is suitable for large datasets and captures additional structure in the factors by using a particular form of regularization that includes well-known regularizers such as total variation and the nuclear norm as particular cases. Although the resulting optimization problem is non-convex, we show that if the size of the factors is large enough, under certain conditions, any local minimizer for the factors yields a global minimizer. A few practical algorithms are also provided to solve the matrix factorization problem, and bounds on the distance from a given approximate solution of the optimization problem to the global optimum are derived. Examples in neural calcium imaging video segmentation and hyperspectral compressed recovery show the advantages of our approach on high-dimensional datasets.

1610.03819 2026-06-04 math.NA cs.CV cs.NA math.ST stat.TH 版本更新

Recursive Diffeomorphism-Based Regression for Shape Functions

递归微分流形回归用于形状函数

Jieren Xu, Haizhao Yang, Ingrid Daubechies

AI总结 本文提出一种递归微分流形回归方法,用于一维广义模式分解问题,旨在从其叠加中提取广义模式。首先应用一维同步压缩变换估计瞬时信息,然后提出基于微分流形和非参数回归的新方法估计波形函数。

详情
AI中文摘要

本文提出了一种递归微分流形回归方法,用于解决一维广义模式分解问题,目标是从其叠加中提取广义模式$α_k(t)s_k(2πN_kϕ_k(t))$。首先,应用一维同步压缩变换估计瞬时信息,例如$α_k(t)$和$N_kϕ_k(t)$。其次,提出一种基于微分流形和非参数回归的新方法来估计波形函数$s_k(t)$。这两种方法导致在弱分离条件下广义模式分解问题的框架。提供了合成和真实数据的数值示例,以展示这些方法的广泛应用。

英文摘要

This paper proposes a recursive diffeomorphism based regression method for one-dimensional generalized mode decomposition problem that aims at extracting generalized modes $α_k(t)s_k(2πN_kϕ_k(t))$ from their superposition $\sum_{k=1}^K α_k(t)s_k(2πN_kϕ_k(t))$. First, a one-dimensional synchrosqueezed transform is applied to estimate instantaneous information, e.g., $α_k(t)$ and $N_kϕ_k(t)$. Second, a novel approach based on diffeomorphisms and nonparametric regression is proposed to estimate wave shape functions $s_k(t)$. These two methods lead to a framework for the generalized mode decomposition problem under a weak well-separation condition. Numerical examples of synthetic and real data are provided to demonstrate the fruitful applications of these methods.

1705.05065 2026-06-04 cs.RO cs.AI cs.CV cs.SY eess.SY 版本更新

AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles

AirSim:面向自动驾驶车辆的高保真视觉与物理模拟

Shital Shah, Debadeepta Dey, Chris Lovett, Ashish Kapoor

AI总结 本文提出基于Unreal引擎的AirSim模拟器,用于高效开发和测试自动驾驶算法,支持高频率物理模拟和多种协议,通过四旋翼实验验证其有效性。

Comments Accepted for Field and Service Robotics conference 2017 (FSR 2017)

详情
AI中文摘要

为自动驾驶车辆开发和测试算法在现实世界中成本高且耗时。为利用最新机器智能和深度学习进展,需收集大量标注训练数据。本文提出基于Unreal引擎的新模拟器,提供真实的物理和视觉模拟。模拟器包含可实现实时硬件在环(HITL)模拟的物理引擎,支持MavLink等流行协议。模拟器从零开始设计,可扩展以适应新车辆类型、硬件平台和软件协议。模块化设计使各组件可独立用于其他项目。通过实现四旋翼自动驾驶车辆并实验性比较软件组件与真实飞行,验证了模拟器的有效性。

英文摘要

Developing and testing algorithms for autonomous vehicles in real world is an expensive and time consuming process. Also, in order to utilize recent advances in machine intelligence and deep learning we need to collect a large amount of annotated training data in a variety of conditions and environments. We present a new simulator built on Unreal Engine that offers physically and visually realistic simulations for both of these goals. Our simulator includes a physics engine that can operate at a high frequency for real-time hardware-in-the-loop (HITL) simulations with support for popular protocols (e.g. MavLink). The simulator is designed from the ground up to be extensible to accommodate new types of vehicles, hardware platforms and software protocols. In addition, the modular design enables various components to be easily usable independently in other projects. We demonstrate the simulator by first implementing a quadrotor as an autonomous vehicle and then experimentally comparing the software components with real-world flights.

1611.05963 2026-06-04 cs.CV cs.NA math.NA 版本更新

Reweighted Low-Rank Tensor Decomposition based on t-SVD and its Applications in Video Denoising

基于t-SVD的加权低秩张量分解及其在视频去噪中的应用

M. Baburaj, Sudhish N. George

AI总结 本文提出基于t-SVD的加权低秩张量分解方法,通过改进张量多秩和稀疏成分恢复,提升视频去噪性能。

Comments Algorithm 1 is inefficient since line 2 is processed n 3 times need to be changed There are inconsistent notations throughout the manuscript Unitary Tensor are not defined

详情
AI中文摘要

基于t-SVD的张量鲁棒主成分分析(TRPCA)通过同时最小化张量核范数和l1范数,将低秩多线性信号分解为低多秩和稀疏成分。但当信号多秩较大或噪声较多时,TRPCA性能下降。为解决此问题,本文提出一种新的高效迭代加权张量分解方案,显著提升TRPCA的张量多秩。此外,通过加权l1范数恢复张量稀疏成分,提高分解精度。通过应用于视频去噪问题,实验结果表明所提算法优于其他方法。

英文摘要

The t-SVD based Tensor Robust Principal Component Analysis (TRPCA) decomposes low rank multi-linear signal corrupted by gross errors into low multi-rank and sparse component by simultaneously minimizing tensor nuclear norm and l 1 norm. But if the multi-rank of the signal is considerably large and/or large amount of noise is present, the performance of TRPCA deteriorates. To overcome this problem, this paper proposes a new efficient iterative reweighted tensor decomposition scheme based on t-SVD which significantly improves tensor multi-rank in TRPCA. Further, the sparse component of the tensor is also recovered by reweighted l 1 norm which enhances the accuracy of decomposition. The effectiveness of the proposed method is established by applying it to the video denoising problem and the experimental results reveal that the proposed algorithm outperforms its counterparts.

1707.01530 2026-06-04 cs.CV cs.NA math.NA 版本更新

On the Fusion of Compton Scatter and Attenuation Data for Limited-view X-ray Tomographic Applications

在有限视角X射线断层成像应用中融合康普顿散射与衰减数据

Hamideh Rezaee, Brian Tracey, Eric L. Miller

AI总结 本文提出一种融合康普顿散射数据与传统衰减数据的方法,用于恢复材料密度和光电吸收,通过变分方法和正则化技术提升成像精度。

详情
AI中文摘要

本文演示了在有限视角X射线断层成像应用中,融合能量分辨的康普顿散射光子观测与传统衰减数据,用于联合恢复材料密度和光电吸收的实用性。我们首先开发了康普顿散射过程的物理和相关数值模型。利用该模型,我们提出了一种变分方法来恢复这两种材料属性。除了典型的数据保真项外,优化功能还包含对质量和光电系数的正则化。我们还考虑了质量密度情况下的新型边缘保持方法。为了帮助恢复光电信息,我们借鉴了最近的方法,并采用非局部正则化方案,利用质量密度更稳定成像的事实。模拟结果展示了同时使用散射光子数据和能量分辨信息在映射两种材料属性方面的明显优势。具体而言,比较仅使用传统衰减数据获得的图像与仅使用康普顿散射光子或两种数据结合形成的图像,显示同时利用两种数据进行重建能提供更准确的结果。

英文摘要

In this paper we demonstrate the utility of fusing energy-resolved observations of Compton scattered photons with traditional attenuation data for the joint recovery of mass density and photoelectric absorption in the context of limited view tomographic imaging applications. We begin with the development of a physical and associated numerical model for the Compton scatter process. Using this model, we propose a variational approach recovering these two material properties. In addition to the typical data-fidelity terms, the optimization functional includes regularization for both the mass density and photoelectric coefficients. We consider a novel edge-preserving method in the case of mass density. To aid in the recovery of the photoelectric information, we draw on our recent method in \cite{r15} and employ a non-local regularization scheme that builds on the fact that mass density is more stably imaged. Simulation results demonstrate clear advantages associated with the use of both scattered photon data and energy resolved information in mapping the two material properties of interest. Specifically, comparing images obtained using only conventional attenuation data with those where we employ only Compton scatter photons and images formed from the combination of the two, shows that taking advantage of both types of data for reconstruction provides far more accurate results.

1707.00281 2026-06-04 cs.CV cs.NA math.NA math.OC 版本更新

A Batch-Incremental Video Background Estimation Model using Weighted Low-Rank Approximation of Matrices

一种基于矩阵加权低秩近似的批量增量视频背景估计模型

Aritra Dutta, Xin Li, Peter Richtárik

AI总结 本文提出一种批量增量视频背景估计模型,通过加权低秩近似改进传统方法,在实测和合成视频上优于GRASTA、ReProCS等算法。

详情
AI中文摘要

主成分追寻(PCP)是背景估计问题的最新方法。由于计算成本高,PCP算法如鲁棒主成分分析(RPCA)及其变种难以处理高清视频。为避免这些算法的维度诅咒,已有方法采用增量方式解决背景估计问题。本文提出一种基于矩阵加权低秩近似的批量增量背景估计模型。通过实测和合成视频实验,证明所提方法在性能上优于GRASTA、ReProCS、incPCP和GFL等最新背景估计算法。

英文摘要

Principal component pursuit (PCP) is a state-of-the-art approach for background estimation problems. Due to their higher computational cost, PCP algorithms, such as robust principal component analysis (RPCA) and its variants, are not feasible in processing high definition videos. To avoid the curse of dimensionality in those algorithms, several methods have been proposed to solve the background estimation problem in an incremental manner. We propose a batch-incremental background estimation model using a special weighted low-rank approximation of matrices. Through experiments with real and synthetic video sequences, we demonstrate that our method is superior to the state-of-the-art background estimation algorithms such as GRASTA, ReProCS, incPCP, and GFL.

1706.08575 2026-06-04 math.NA cs.CV cs.NA 版本更新

Using Frame Theoretic Convolutional Gridding for Robust Synthetic Aperture Sonar Imaging

利用框架理论卷积栅格进行鲁棒合成孔径声呐成像

John McKay, Anne Gelb, Vishal Monga, Raghu Raj

AI总结 本文提出使用框架理论卷积栅格算法改进合成孔径声呐成像,以提高鲁棒性和精度,减少因多普勒效应和声速估计误差导致的不准确性。

Comments Accepted to OCEANS 2017 - Anchorage (Conference)

详情
AI中文摘要

近年来,合成孔径声呐(SAS)技术及处理方法的进步显著提升了水下成像性能,优于传统方法在准确性和效率上。然而,当前SAS重建方法存在固有局限。特别是流行的高效傅里叶域SAS方法需要二维插值,通常病态且不准确,不可避免地降低对斑点和不准确声速估计的鲁棒性。为克服这些问题,我们提出使用框架理论卷积栅格(FTCG)算法处理非均匀傅里叶数据。FTCG在非均匀快速傅里叶变换(NUFFT)算法基础上,将NUFFT视为给定傅里叶框架数据的近似问题。FTCG已被证明在计算成本略高的情况下能提供改进的准确性。通过模拟数据,我们概述了如何使用FTCG来增强当前SAS处理。

英文摘要

Recent progress in synthetic aperture sonar (SAS) technology and processing has led to significant advances in underwater imaging, outperforming previously common approaches in both accuracy and efficiency. There are, however, inherent limitations to current SAS reconstruction methodology. In particular, popular and efficient Fourier domain SAS methods require a 2D interpolation which is often ill conditioned and inaccurate, inevitably reducing robustness with regard to speckle and inaccurate sound-speed estimation. To overcome these issues, we propose using the frame theoretic convolution gridding (FTCG) algorithm to handle the non-uniform Fourier data. FTCG extends upon non-uniform fast Fourier transform (NUFFT) algorithms by casting the NUFFT as an approximation problem given Fourier frame data. The FTCG has been show to yield improved accuracy at little more computational cost. Using simulated data, we outline how the FTCG can be used to enhance current SAS processing.

1403.7588 2026-06-04 math.OC cs.CV cs.NA math.NA stat.ML 版本更新

Scalable Robust Matrix Recovery: Frank-Wolfe Meets Proximal Methods

可扩展的鲁棒矩阵恢复:Frank-Wolfe与近端方法的结合

Cun Mu, Yuqian Zhang, John Wright, Donald Goldfarb

AI总结 本文提出了一种可扩展且高效的鲁棒矩阵恢复方法,结合Frank-Wolfe和近端方法,以线性复杂度解决压缩主成分追寻问题,通过秩一SVD更新低秩部分并处理稀疏项,验证了方法在视觉数据中的可扩展性。

详情
Journal ref
SIAM Journal on Scientific Computing, 2016, Vol. 38, No. 5 : pp. A3291-A3317
AI中文摘要

矩阵从压缩和严重损坏的观测中恢复是稳健统计中的基本问题,广泛应用于计算机视觉和机器学习。理论上,在某些条件下,该问题可以通过自然的凸松弛,即压缩主成分追踪(CPCP)在多项式时间内解决。然而,所有现有的可证明算法对于CPCP都面临每迭代超线性的成本,这严重限制了它们在大规模问题中的应用。在本文中,我们提出了一种可证明、可扩展和高效的解决CPCP的方法,具有(本质上)线性每迭代成本。我们的方法结合了Frank-Wolfe和近端方法的经典思想。在每次迭代中,我们主要利用Frank-Wolfe来使用秩一SVD更新低秩部分,并利用近端步骤处理稀疏项。还讨论了收敛结果和实现细节。我们通过在视觉数据上的有希望的数值实验展示了所提出方法的可扩展性。

英文摘要

Recovering matrices from compressive and grossly corrupted observations is a fundamental problem in robust statistics, with rich applications in computer vision and machine learning. In theory, under certain conditions, this problem can be solved in polynomial time via a natural convex relaxation, known as Compressive Principal Component Pursuit (CPCP). However, all existing provable algorithms for CPCP suffer from superlinear per-iteration cost, which severely limits their applicability to large scale problems. In this paper, we propose provable, scalable and efficient methods to solve CPCP with (essentially) linear per-iteration cost. Our method combines classical ideas from Frank-Wolfe and proximal methods. In each iteration, we mainly exploit Frank-Wolfe to update the low-rank component with rank-one SVD and exploit the proximal step for the sparse term. Convergence results and implementation details are also discussed. We demonstrate the scalability of the proposed approach with promising numerical experiments on visual data.

1705.05804 2026-06-04 cs.CV cs.NA math.NA stat.ML 版本更新

The Incremental Multiresolution Matrix Factorization Algorithm

增量多分辨率矩阵分解算法

Vamsi K. Ithapu, Risi Kondor, Sterling C. Johnson, Vikas Singh

AI总结 本文提出增量多分辨率矩阵分解算法,用于揭示对称矩阵的层次块结构,通过逐特征分析提升大规模矩阵处理能力,并在医学影像回归任务中验证其有效性。

Comments Computer Vision and Pattern Recognition (CVPR) 2017, 10 pages

详情
AI中文摘要

多分辨率分析和矩阵分解是计算机视觉的基础工具。本文研究了这两个不同领域的交汇,并获得揭示对称矩阵层次块结构的技术,这对许多视觉问题的成功至关重要。我们的新算法,增量多分辨率矩阵分解,逐特征揭示此类结构,因此能有效扩展至大规模矩阵。我们描述了这种多尺度分析比直接全局分解能识别的更多。我们通过医学影像数据评估所得到的分解在回归任务中的有效性。我们还利用该分解在由流行深度网络学习的表示上进行操作,提供证据表明这些网络即使未显式训练以执行此类推断,也能推断语义关系。我们展示了该算法可作为探索工具来改进网络架构,并在视觉的众多其他设置中使用。

英文摘要

Multiresolution analysis and matrix factorization are foundational tools in computer vision. In this work, we study the interface between these two distinct topics and obtain techniques to uncover hierarchical block structure in symmetric matrices -- an important aspect in the success of many vision problems. Our new algorithm, the incremental multiresolution matrix factorization, uncovers such structure one feature at a time, and hence scales well to large matrices. We describe how this multiscale analysis goes much farther than what a direct global factorization of the data can identify. We evaluate the efficacy of the resulting factorizations for relative leveraging within regression tasks using medical imaging data. We also use the factorization on representations learned by popular deep networks, providing evidence of their ability to infer semantic relationships even when they are not explicitly trained to do so. We show that this algorithm can be used as an exploratory tool to improve the network architecture, and within numerous other settings in vision.

1705.05116 2026-06-04 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

Tuning Modular Networks with Weighted Losses for Hand-Eye Coordination

通过加权损失调节模块网络以提升手眼协调

Fangyi Zhang, Jürgen Leitner, Michael Milford, Peter I. Corke

AI总结 本文提出端到端微调方法,通过加权损失提升模块化深度视觉-运动策略在平面抓取任务中的手眼协调性能。

Comments 2 pages, to appear in the Deep Learning for Robotic Vision (DLRV) Workshop in CVPR 2017

详情
AI中文摘要

本文介绍了一种端到端微调方法,用于改进模块化深度视觉-运动策略(模块网络)中的手眼协调能力,其中每个模块独立训练。得益于加权损失,该微调方法显著提升了策略在机器人平面抓取任务中的性能。

英文摘要

This paper introduces an end-to-end fine-tuning method to improve hand-eye coordination in modular deep visuo-motor policies (modular networks) where each module is trained independently. Benefiting from weighted losses, the fine-tuning method significantly improves the performance of the policies for a robotic planar reaching task.

1610.06688 2026-06-04 cs.CV cs.NA math.NA 版本更新

Multispectral image denoising with optimized vector non-local mean filter

多光谱图像去噪的优化向量非局部均值滤波

Ahmed Ben Said, Rachid Hadjidj, Kamel Eddine Melkemi, Sebti Foufou

AI总结 本文提出将非局部均值滤波扩展至向量域,用于多光谱图像去噪,通过优化参数和计算复杂度提升去噪性能。

Comments 30 pages, 17 figures, journal paper

详情
AI中文摘要

如今,许多应用依赖高质量图像以确保任务执行性能。然而,噪声是大多数应用中不可避免的问题。因此,开发技术以减轻噪声影响,同时保持图像相关信息的完整性至关重要。本文提出将非局部均值滤波(NLM)扩展至向量情况,并应用于多光谱图像去噪。目标是利用多光谱成像系统带来的额外信息。NLM滤波器利用图像中的信息冗余来去除噪声。恢复的像素是图像中所有像素的加权平均。在我们的贡献中,我们提出了一种优化框架,其中动态调整NLM滤波器参数,并通过考虑最相似像素来降低计算复杂度。滤波器参数使用Stein的无偏风险估计器(SURE)而非随意方法进行优化。实验在受加性白高斯噪声污染的多光谱图像上进行,并提供了PSNR和与其他方法的相似性比较,以展示本方法在去噪性能和计算复杂度方面的效率。

英文摘要

Nowadays, many applications rely on images of high quality to ensure good performance in conducting their tasks. However, noise goes against this objective as it is an unavoidable issue in most applications. Therefore, it is essential to develop techniques to attenuate the impact of noise, while maintaining the integrity of relevant information in images. We propose in this work to extend the application of the Non-Local Means filter (NLM) to the vector case and apply it for denoising multispectral images. The objective is to benefit from the additional information brought by multispectral imaging systems. The NLM filter exploits the redundancy of information in an image to remove noise. A restored pixel is a weighted average of all pixels in the image. In our contribution, we propose an optimization framework where we dynamically fine tune the NLM filter parameters and attenuate its computational complexity by considering only pixels which are most similar to each other in computing a restored pixel. Filter parameters are optimized using Stein's Unbiased Risk Estimator (SURE) rather than using ad hoc means. Experiments have been conducted on multispectral images corrupted with additive white Gaussian noise and PSNR and similarity comparison with other approaches are provided to illustrate the efficiency of our approach in terms of both denoising performance and computation complexity.

1703.09744 2026-06-04 cs.CV cs.SY eess.SY 版本更新

Feature Analysis and Selection for Training an End-to-End Autonomous Vehicle Controller Using the Deep Learning Approach

基于深度学习方法的自动驾驶控制器训练中的特征分析与选择

Shun Yang, Wenshuo Wang, Chang Liu, Kevin Deng, J. Karl Hedrick

AI总结 本文通过分析CNN训练中不同特征对控制器性能的影响,提出特征选择方法以降低计算成本。实验表明,道路相关特征不可或缺,路边相关特征能提升控制器泛化能力,而天空相关特征贡献有限。

Comments 6 pages, 11 figures, 3 tables, accepted by 2017 IEEE Intelligent Vehicles Symposium

详情
AI中文摘要

基于深度学习的方法因其强大的非线性函数近似能力,已被广泛用于训练自动驾驶车辆控制器。然而,训练过程通常需要大量标记数据且耗时较长。本文分析了卷积神经网络(CNN)训练中各特征对控制器性能的影响,为特征选择提供指导。通过使用开放赛车模拟器(TORCS)收集大量数据,并将图像特征分为天空相关、路边相关和道路相关三类。设计了两个实验框架来研究各单个特征对训练CNN控制器的重要性。第一个框架使用包含所有三个特征的训练数据训练控制器,然后用移除一个特征的数据测试以评估特征影响。第二个框架则使用排除一个特征的训练数据,而测试数据包含所有三个特征。通过不同驾驶场景测试和分析两个实验框架下的训练控制器。实验结果表明:(1)道路相关特征对训练控制器至关重要;(2)路边相关特征有助于提升控制器在复杂路边信息场景下的泛化能力;(3)天空相关特征对训练端到端自动驾驶车辆控制器贡献有限。

英文摘要

Deep learning-based approaches have been widely used for training controllers for autonomous vehicles due to their powerful ability to approximate nonlinear functions or policies. However, the training process usually requires large labeled data sets and takes a lot of time. In this paper, we analyze the influences of features on the performance of controllers trained using the convolutional neural networks (CNNs), which gives a guideline of feature selection to reduce computation cost. We collect a large set of data using The Open Racing Car Simulator (TORCS) and classify the image features into three categories (sky-related, roadside-related, and road-related features).We then design two experimental frameworks to investigate the importance of each single feature for training a CNN controller.The first framework uses the training data with all three features included to train a controller, which is then tested with data that has one feature removed to evaluate the feature's effects. The second framework is trained with the data that has one feature excluded, while all three features are included in the test data. Different driving scenarios are selected to test and analyze the trained controllers using the two experimental frameworks. The experiment results show that (1) the road-related features are indispensable for training the controller, (2) the roadside-related features are useful to improve the generalizability of the controller to scenarios with complicated roadside information, and (3) the sky-related features have limited contribution to train an end-to-end autonomous vehicle controller.

1703.08001 2026-06-04 cs.CV cs.NA math.NA 版本更新

Nonlinear Spectral Image Fusion

非线性频谱图像融合

Martin Benning, Michael Möller, Raz Z. Nossek, Martin Burger, Daniel Cremers, Guy Gilboa, Carola-Bibiane Schönlieb

AI总结 本文展示基于总变分正则化的非线性频谱分解框架在图像融合及更广泛的图像处理任务中的有效性,通过选择特定图像的频率转移特征如面部皱纹,实现图像编辑。

Comments 13 pages, 9 figures, submitted to SSVM conference proceedings 2017

详情
AI中文摘要

本文演示了基于总变分正则化的非线性频谱分解框架在图像融合及更广泛的图像处理任务中的有效性。局部化良好且边缘保留的频谱总变分分解允许选择特定图像的频率以转移特定特征,如面部皱纹,从一个图像到另一个图像。我们通过多个数值实验展示了所提出方法的有效性,包括与泊松图像编辑、线性渗透、小波融合和拉普拉斯金字塔融合等竞争技术的比较。我们得出结论,所提出的频谱总变分图像分解框架是半自动和全自动图像编辑和融合的重要工具。

英文摘要

In this paper we demonstrate that the framework of nonlinear spectral decompositions based on total variation (TV) regularization is very well suited for image fusion as well as more general image manipulation tasks. The well-localized and edge-preserving spectral TV decomposition allows to select frequencies of a certain image to transfer particular features, such as wrinkles in a face, from one image to another. We illustrate the effectiveness of the proposed approach in several numerical experiments, including a comparison to the competing techniques of Poisson image editing, linear osmosis, wavelet fusion and Laplacian pyramid fusion. We conclude that the proposed spectral TV image decomposition framework is a valuable tool for semi- and fully-automatic image editing and fusion.

1703.05560 2026-06-04 math.NA cs.CV cs.NA math.SP 版本更新

Combining Contrast Invariant L1 Data Fidelities with Nonlinear Spectral Image Decomposition

结合对比不变的L1数据保真度与非线性频谱图像分解

Leonie Zeune, Stephan A. van Gils, Leon W. M. M. Terstappen, Christoph Brune

AI总结 本文研究了变分方法的多尺度方法及其梯度流,结合L1保真度与非线性频谱分解以提升图像分割和形状分解的性能。

Comments 13 pages, 7 figures, conference SSVM 2017

详情
AI中文摘要

本文聚焦于变分方法的多尺度方法及其对应的梯度流。近年来,针对如总变分等凸正则化函数,已开发出通过非线性频谱分解解决非线性本征值问题的新理论和算法。这些方法为高级图像滤波开辟了新方向。然而,为了在图像分割和形状分解中有效应用,需要清晰解释频谱响应与大小和强度尺度的关系,但当前方法缺乏这一解释。在此背景下,L1数据保真度因其有趣的多尺度特性,如对比不变性,特别有用。因此,本文的创新点是将基于L1的多尺度方法与非线性频谱分解相结合。我们从频谱图像表示和分解的角度比较L1与L2尺度空间方法。我们证明了L1-TV的对比不变多尺度行为在频谱响应中促进稀疏性,从而提供更具信息量的分解。我们提供了一种数值方法,并分析了合成和生物医学图像,其中分解导致了改进的分割。

英文摘要

This paper focuses on multi-scale approaches for variational methods and corresponding gradient flows. Recently, for convex regularization functionals such as total variation, new theory and algorithms for nonlinear eigenvalue problems via nonlinear spectral decompositions have been developed. Those methods open new directions for advanced image filtering. However, for an effective use in image segmentation and shape decomposition, a clear interpretation of the spectral response regarding size and intensity scales is needed but lacking in current approaches. In this context, $L^1$ data fidelities are particularly helpful due to their interesting multi-scale properties such as contrast invariance. Hence, the novelty of this work is the combination of $L^1$-based multi-scale methods with nonlinear spectral decompositions. We compare $L^1$ with $L^2$ scale-space methods in view of spectral image representation and decomposition. We show that the contrast invariant multi-scale behavior of $L^1-TV$ promotes sparsity in the spectral response providing more informative decompositions. We provide a numerical method and analyze synthetic and biomedical images at which decomposition leads to improved segmentation.

1501.06209 2026-06-04 math.NA cs.CV cs.NA math.OC physics.med-ph 版本更新

Parallel Magnetic Resonance Imaging

并行磁共振成像

Martin Uecker

AI总结 本文探讨了并行磁共振成像在图像重建中的应用,通过逆问题视角分析,介绍了正则化、离散化和迭代重建等基本概念,并讨论了自校准算法、近似理论及压缩感知的结合。

Comments 22 pages, 9 Figures, 76 References. Copyright: Martin Uecker. Draft for a book chapter. To appear in: A Majumdar and RK Ward (eds.), MRI: Physics, Image Reconstruction, and Analysis, CRC Press 2015

详情
Journal ref
In: MRI: Physics, Image Reconstruction, and Analysis, CRC Press 2015, pp. 73-92, ISBN 9781482298871
AI中文摘要

磁共振成像(MRI)的主要缺点是其长扫描时间和由此引起的运动敏感性。利用多个接收线圈的互补信息,平行成像能够从欠采样的k空间数据中恢复图像并加速测量。由于平行磁共振成像可以加速任何成像序列,因此具有重要的应用价值。平行成像带来了图像重建的根本性转变:图像重建从简单的直接傅里叶变换转变为求解一个病态逆问题的解决方案。本文从逆问题的角度概述了图像重建,介绍了正则化、离散化和迭代重建等基本概念,并讨论了包括自校准算法、与近似理论的联系以及与压缩感知的结合等高级主题。

英文摘要

The main disadvantage of Magnetic Resonance Imaging (MRI) are its long scan times and, in consequence, its sensitivity to motion. Exploiting the complementary information from multiple receive coils, parallel imaging is able to recover images from under-sampled k-space data and to accelerate the measurement. Because parallel magnetic resonance imaging can be used to accelerate basically any imaging sequence it has many important applications. Parallel imaging brought a fundamental shift in image reconstruction: Image reconstruction changed from a simple direct Fourier transform to the solution of an ill-conditioned inverse problem. This work gives an overview of image reconstruction from the perspective of inverse problems. After introducing basic concepts such as regularization, discretization, and iterative reconstruction, advanced topics are discussed including algorithms for auto-calibration, the connection to approximation theory, and the combination with compressed sensing.

1703.00663 2026-06-04 math.NA cs.CV cs.LG cs.NA math.OC stat.ML 版本更新

Introduction to Nonnegative Matrix Factorization

非负矩阵因子分解简介

Nicolas Gillis

AI总结 本文介绍非负矩阵因子分解的应用、解的几何性质与唯一性、复杂度及算法,并探讨其与多面体扩展形式的联系。

Comments 18 pages, 4 figures

详情
Journal ref
SIAG/OPT Views and News 25 (1), pp. 7-16 (2017)
AI中文摘要

本文介绍了非负矩阵因子分解(NMF)的概念,并提供了简要概述。讨论了NMF在高光谱成像中的应用、解的几何性质与唯一性、复杂度、算法及其与多面体扩展形式的联系。为将NMF置于更广泛的问题框架中,首先简要介绍了受限低秩矩阵近似问题的更一般问题类别。

英文摘要

In this paper, we introduce and provide a short overview of nonnegative matrix factorization (NMF). Several aspects of NMF are discussed, namely, the application in hyperspectral imaging, geometry and uniqueness of NMF solutions, complexity, algorithms, and its link with extended formulations of polyhedra. In order to put NMF into perspective, the more general problem class of constrained low-rank matrix approximation problems is first briefly introduced.

1608.00514 2026-06-04 math.NA cs.CV cs.NA 版本更新

Dimensionality reduction based on Distance Preservation to Local Mean (DPLM) for SPD matrices and its application in BCI

基于距离保持到局部均值的距离降维(DPLM)的SPD矩阵及其在BCI中的应用

Alireza Davoudi, Saeed Shiry Ghidary, Khadijeh Sadatnejad

AI总结 本文提出了一种非线性降维算法,用于对对称正定(SPD)矩阵流形进行处理,通过保持数据的局部结构和局部均值距离(DPLM)来提供高类间区分的低维表示。实验表明,DPLM在BCI竞赛IV的多类数据集IIa上优于其他方法,因其对异常值具有鲁棒性。

详情
AI中文摘要

本文提出了一种非线性降维算法,用于对称正定(SPD)矩阵流形。该算法考虑了SPD矩阵的几何特性,并通过保持距离到局部均值(DPLM)来提供高类间区分的低维表示。DPLM在训练样本数量上是线性的,并且可以利用可用的标签信息以提高分类任务的性能。我们在BCI竞赛IV的多类数据集IIa上进行了多项实验。结果表明,我们的方法在与其他文献中方法相比时,由于其对异常值的鲁棒性而表现更优。实验还确认,DPLM与FGMDM作为分类器的结合在该数据集上实现了最先进的性能。

英文摘要

In this paper, we propose a nonlinear dimensionality reduction algorithm for the manifold of Symmetric Positive Definite (SPD) matrices that considers the geometry of SPD matrices and provides a low dimensional representation of the manifold with high class discrimination. The proposed algorithm, tries to preserve the local structure of the data by preserving distance to local mean (DPLM) and also provides an implicit projection matrix. DPLM is linear in terms of the number of training samples and may use the label information when they are available in order to performance improvement in classification tasks. We performed several experiments on the multi-class dataset IIa from BCI competition IV. The results show that our approach as dimensionality reduction technique - leads to superior results in comparison with other competitor in the related literature because of its robustness against outliers. The experiments confirm that the combination of DPLM with FGMDM as the classifier leads to the state of the art performance on this dataset.

1612.07850 2026-06-04 cs.RO cs.CV cs.SY eess.SY 版本更新

Automatic Interpretation of Unordered Point Cloud Data for UAV Navigation in Construction

无人机在建筑施工中无序点云数据的自动解释

M. D. Phung, C. H. Quach, D. T. Chu, N. Q. Nguyen, T. H. Dinh, Q. P. Ha

AI总结 本文提出了一种数据处理系统,用于自动为无人机生成航路点,以检查建筑和桥梁等结构表面。系统通过两个正交安装的2D激光扫描仪和惯性测量单元的数据,利用数据注册、表面检测和航路生成算法,实现结构点云重建和航路规划。

Comments In The 14th International Conference on Control, Automation, Robotics and Vision, ICARCV 2016

详情
AI中文摘要

本工作旨在开发一种数据处理系统,能够自动为无人驾驶航空器(UAV)生成航路点,以检查建筑物和桥梁等结构的表面。输入包括由两个正交安装在UAV上的2D激光扫描仪和惯性测量单元(IMU)记录的数据。为实现目标,开发了处理所收集数据的算法,分为三类:(i)数据注册和滤波以生成结构的3D模型并控制点云密度以提高数据完整性;(ii)表面和障碍物检测以协助UAV的监控任务;(iii)航路点生成以设置飞行路径。不同数据集的实验表明,所开发的系统能够重建结构的3D点云,提取其表面和物体,并为UAV生成航路点以完成检查任务。

英文摘要

The objective of this work is to develop a data processing system that can automatically generate waypoints for navigation of an unmanned aerial vehicle (UAV) to inspect surfaces of structures like buildings and bridges. The input includes data recorded by two 2D laser scanners, orthogonally mounted on the UAV, and an inertial measurement unit (IMU). To achieve the goal, algorithms are developed to process the data collected. They are separated into three major groups: (i) the data registration and filtering to generate a 3D model of the structure and control the density of point clouds for data completeness enhancement; (ii) the surface and obstacle detection to assist the UAV in monitoring tasks; and (iii) the waypoint generation to set the flight path. Experiments on different data sets show that the developed system is able to reconstruct a 3D point cloud of the structure, extract its surfaces and objects, and generate waypoints for the UAV to accomplish inspection tasks.

1702.02680 2026-06-04 cs.CV cs.NA math.NA 版本更新

Manifold Based Low-rank Regularization for Image Restoration and Semi-supervised Learning

基于流形的低秩正则化用于图像恢复和半监督学习

Rongjie Lai, Jia Li

AI总结 本文提出基于流形的低秩正则化方法,用于图像恢复和半监督学习,通过线性近似流形维度,提升处理非线性数据的灵活性和效果。

Comments 23 pages, 13 figures

详情
AI中文摘要

低秩结构在图像科学和数据科学的近期进展中扮演重要角色。作为低秩结构在非线性数据中的自然扩展,流形低维结构的概念被应用于许多数据处理问题。受此概念启发,本文考虑基于流形的低秩正则化作为流形维度的线性近似。这种正则化比全局低秩正则化更灵活,能够更好地处理非线性数据。作为应用,本文将所提正则化方法应用于图像科学和数据科学中的经典反问题,包括图像修复、图像超分辨率、X射线计算机断层扫描(CT)图像重建和半监督学习。我们在多个图像恢复问题和使用MINST数据集的半监督学习问题上进行了大量数值实验。我们的数值测试展示了所提方法的有效性,并通过与许多现有方法的比较,证明了新正则化方法的出色性能。

英文摘要

Low-rank structures play important role in recent advances of many problems in image science and data science. As a natural extension of low-rank structures for data with nonlinear structures, the concept of the low-dimensional manifold structure has been considered in many data processing problems. Inspired by this concept, we consider a manifold based low-rank regularization as a linear approximation of manifold dimension. This regularization is less restricted than the global low-rank regularization, and thus enjoy more flexibility to handle data with nonlinear structures. As applications, we demonstrate the proposed regularization to classical inverse problems in image sciences and data sciences including image inpainting, image super-resolution, X-ray computer tomography (CT) image reconstruction and semi-supervised learning. We conduct intensive numerical experiments in several image restoration problems and a semi-supervised learning problem of classifying handwritten digits using the MINST data. Our numerical tests demonstrate the effectiveness of the proposed methods and illustrate that the new regularization methods produce outstanding results by comparing with many existing methods.

1609.06041 2026-06-04 physics.med-ph cs.CV cs.NA math.NA 版本更新

A very fast iterative algorithm for TV-regularized image reconstruction with applications to low-dose and few-view CT

一种非常快速的迭代算法用于TV正则化图像重建及其在低剂量和少视角CT中的应用

Hiroyuki Kudo, Fukashi Yamazaki, Takuya Nemoto, Keita Takaki

AI总结 本文提出了一种快速迭代算法用于低剂量和少视角CT图像重建,通过TV正则化最小化数据保真项,利用预条件技术提升收敛速度。

Comments 16 pages, 8 figures, SPIE Optics + Photonics 2016 Conference (Developments in X-Ray Tomography X) Paper No. 9967-37

详情
AI中文摘要

本文研究了通过最小化数据保真项并使用总变分(TV)惩罚进行低剂量和少视角CT的迭代重建。我们提出了一种非常快速的迭代算法来解决这个问题。算法推导如下:首先,通过拉格朗日对偶性将原始最小化问题转换为鞍点(对偶)问题,并应用一阶对偶迭代方法。其次,使用滤波反投影(FBP)重建算法的 ramp 滤波器对迭代公式进行预条件处理,以确保问题解不变。所得到的算法结构类似于所谓的迭代 FBP 算法,并且能够快速收敛到成本函数的精确最小值。

英文摘要

This paper concerns iterative reconstruction for low-dose and few-view CT by minimizing a data-fidelity term regularized with the Total Variation (TV) penalty. We propose a very fast iterative algorithm to solve this problem. The algorithm derivation is outlined as follows. First, the original minimization problem is reformulated into the saddle point (primal-dual) problem by using the Lagrangian duality, to which we apply the first-order primal-dual iterative methods. Second, we precondition the iteration formula using the ramp flter of Filtered Backprojection (FBP) reconstruction algorithm in such a way that the problem solution is not altered. The resulting algorithm resembles the structure of so-called iterative FBP algorithm, and it converges to the exact minimizer of cost function very fast.

1609.06020 2026-06-04 physics.med-ph cs.CV cs.NA math.NA 版本更新

Proposal of fault-tolerant tomographic image reconstruction

容错断层成像图像重建方法的提出

Hiroyuki Kudo, Keita Takaki, Fukashi Yamazaki, Takuya Nemoto

AI总结 本文提出一种容错断层成像重建算法,通过使用L1范数替代L2范数,结合凸优化中的近端分裂框架,提升异常数据下的重建鲁棒性。

Comments 12 pages, 5 figures, SPIE Optics + Photonics 2016 Conference (Developments in X-Ray Tomography X) Paper No. 9967-55

详情
AI中文摘要

本文针对断层成像中部分投影数据bin被异常数据污染的情况,提出一种新的容错重建算法。传统迭代重建使用L2范数误差函数||Ax-b||_2^2,易受异常数据影响。本文改用L1范数误差函数||Ax-b||_1^1,并开发一种基于近端分裂框架的行动作迭代算法。同时提出改进的L1-TV重建方法,在成本函数中加入弱化总变分(TV)惩罚项。仿真结果表明,L2范数重建在异常bin影响下图像严重受损,而L1范数和L1-TV重建对异常bin具有鲁棒性。

英文摘要

This paper deals with tomographic image reconstruction under the situation where some of projection data bins are contaminated with abnormal data. Such situations occur in various instances of tomography. We propose a new reconstruction algorithm called the Fault-Tolerant reconstruction outlined as follows. The least-squares (L2-norm) error function ||Ax-b||_2^2 used in ordinary iterative reconstructions is sensitive to the existence of abnormal data. The proposed algorithm utilizes the L1-norm error function ||Ax-b||_1^1 instead of the L2-norm, and we develop a row-action-type iterative algorithm using the proximal splitting framework in convex optimization fields. We also propose an improved version of the L1-norm reconstruction called the L1-TV reconstruction, in which a weak Total Variation (TV) penalty is added to the cost function. Simulation results demonstrate that reconstructed images with the L2-norm were severely damaged by the effect of abnormal bins, whereas images with the L1-norm and L1-TV reconstructions were robust to the existence of abnormal bins.

1701.07158 2026-06-04 math.NA cs.CV cs.NA math.FA 版本更新

An Edge Driven Wavelet Frame Model for Image Restoration

基于边缘驱动的小波框架模型图像修复

Jae Kyu Choi, Bin Dong, Xiaoqun Zhang

AI总结 本文提出一种边缘驱动的小波框架模型,通过将图像近似为分段光滑函数,对光滑和奇异区域施加不同强度的正则化,实现鲁棒的图像修复。

详情
AI中文摘要

小波框架系统已知在从噪声和退化图像中捕捉奇异性的方面效果显著。本文介绍了一种新的边缘驱动小波框架模型用于图像修复,通过将图像近似为分段光滑函数。通过隐式表示图像奇异集,所提模型对光滑和奇异图像区域及边缘施加不同的正则化强度。所提出的边缘驱动模型对图像近似和奇异估计均具有鲁棒性。隐式公式还使能够对所提模型进行渐近分析,并建立离散模型与一般连续变分模型之间的严谨联系。最后,图像修复和去模糊的数值结果表明,所提模型在与几种流行图像修复模型相比时表现优异。

英文摘要

Wavelet frame systems are known to be effective in capturing singularities from noisy and degraded images. In this paper, we introduce a new edge driven wavelet frame model for image restoration by approximating images as piecewise smooth functions. With an implicit representation of image singularities sets, the proposed model inflicts different strength of regularization on smooth and singular image regions and edges. The proposed edge driven model is robust to both image approximation and singularity estimation. The implicit formulation also enables an asymptotic analysis of the proposed models and a rigorous connection between the discrete model and a general continuous variational model. Finally, numerical results on image inpainting and deblurring show that the proposed model is compared favorably against several popular image restoration models.

1512.00389 2026-06-04 cs.CV cs.NA math.NA 版本更新

Accelerated graph-based nonlinear denoising filters

加速的图基非线性去噪滤波器

Andrew Knyazev, Alexander Malyshev

AI总结 本文提出通过共轭梯度法和Nesterov加速技术加速图基非线性去噪滤波器,实验显示在图像去噪中效率提升2-12倍。

Comments 10 pages, 6 figures, to appear in Procedia Computer Science, vol.80, 2016, International Conference on Computational Science, San Diego, CA, USA, June 6-8, 2016

详情
Journal ref
Procedia Computer Science Volume 80, 2016, Pages 607-616, International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA
AI中文摘要

去噪滤波器如双边、引导和总变分滤波器在一般图上应用于图像时,若噪声不够小可能需要重复应用。本文提出两种加速技术:共轭梯度法和Nesterov加速。数值实验表明加速的非线性滤波器在图像去噪中效率显著,加速技术将达到给定PSNR所需的迭代次数减少2-12倍。

英文摘要

Denoising filters, such as bilateral, guided, and total variation filters, applied to images on general graphs may require repeated application if noise is not small enough. We formulate two acceleration techniques of the resulted iterations: conjugate gradient method and Nesterov's acceleration. We numerically show efficiency of the accelerated nonlinear filters for image denoising and demonstrate 2-12 times speed-up, i.e., the acceleration techniques reduce the number of iterations required to reach a given peak signal-to-noise ratio (PSNR) by the above indicated factor of 2-12.

1607.06032 2026-06-04 cs.CV cs.NA math.DS math.NA math.OC 版本更新

A Topological Lowpass Filter for Quasiperiodic Signals

一种用于拟周期信号的拓扑低通滤波器

Michael Robinson

AI总结 本文提出一种两阶段拓扑算法,用于从噪声测量中恢复拟周期函数的估计。第一阶段为拓扑相位估计器,能检测函数的拟周期结构,不增加额外限制,从而避免在使用大量样本时产生失真。

详情
AI中文摘要

本文提出了一种两阶段拓扑算法,用于从一组噪声测量中恢复拟周期函数的估计。算法的第一阶段是一个拓扑相位估计器,能够检测函数的拟周期结构,而无需对函数施加额外限制。通过尊重这一相位估计,算法在使用大量样本进行函数估计时避免产生失真。

英文摘要

This article presents a two-stage topological algorithm for recovering an estimate of a quasiperiodic function from a set of noisy measurements. The first stage of the algorithm is a topological phase estimator, which detects the quasiperiodic structure of the function without placing additional restrictions on the function. By respecting this phase estimate, the algorithm avoids creating distortion even when it uses a large number of samples for the estimate of the function.

1612.06176 2026-06-04 cs.CV cs.NA math.NA stat.ML 版本更新

An extended Perona-Malik model based on probabilistic models

基于概率模型扩展的Perona-Malik模型

Lars M. Mescheder, Dirk A. Lorenz

AI总结 本文基于高斯尺度混合模型扩展了Perona-Malik模型,通过EM算法推导出滞后扩散算法,并改进其以更好地捕捉恢复中的不确定性,同时提出计算可行的放松方法,实验显示改进算法在恢复纹理区域和模糊边缘方面表现更优。

详情
AI中文摘要

Perona-Malik模型在从噪声输入中恢复图像方面非常成功。本文将该模型重新诠释为高斯尺度混合物的语言,并推导出一些扩展。具体来说,我们展示了将EM算法应用于高斯尺度混合物导致滞后扩散算法用于计算Perona-Malik扩散方程的稳态点。此外,我们展示了这些高斯尺度混合物的均场近似如何导致一种改进的滞后扩散算法,更准确地捕捉恢复中的不确定性。由于这种改进在实践中难以计算,我们提出对均场目标进行放松以使算法计算可行。我们的数值实验表明,这种改进的滞后扩散算法在恢复纹理区域和模糊边缘方面通常比未改进的算法表现更好。作为高斯尺度混合框架的第二个应用,我们展示了如何通过高效采样过程获得概率模型,使计算条件均值和其他期望在算法上可行。同样,所得到的算法与滞后扩散算法有很强的相似性。最后,我们展示了在相同框架下,通过离散边缘先验可以得到概率版本的Mumford-Shah分割模型。

英文摘要

The Perona-Malik model has been very successful at restoring images from noisy input. In this paper, we reinterpret the Perona-Malik model in the language of Gaussian scale mixtures and derive some extensions of the model. Specifically, we show that the expectation-maximization (EM) algorithm applied to Gaussian scale mixtures leads to the lagged-diffusivity algorithm for computing stationary points of the Perona-Malik diffusion equations. Moreover, we show how mean field approximations to these Gaussian scale mixtures lead to a modification of the lagged-diffusivity algorithm that better captures the uncertainties in the restoration. Since this modification can be hard to compute in practice we propose relaxations to the mean field objective to make the algorithm computationally feasible. Our numerical experiments show that this modified lagged-diffusivity algorithm often performs better at restoring textured areas and fuzzy edges than the unmodified algorithm. As a second application of the Gaussian scale mixture framework, we show how an efficient sampling procedure can be obtained for the probabilistic model, making the computation of the conditional mean and other expectations algorithmically feasible. Again, the resulting algorithm has a strong resemblance to the lagged-diffusivity algorithm. Finally, we show that a probabilistic version of the Mumford-Shah segementation model can be obtained in the same framework with a discrete edge-prior.

1612.05323 2026-06-04 cs.CV cs.NA math.NA 版本更新

A Stochastic Large Deformation Model for Computational Anatomy

计算解剖学中的一种随机大变形模型

Alexis Arnaudon, Darryl D. Holm, Akshay Pai, Stefan Sommer

AI总结 本文提出一种随机模型,用于在大变形流形度量映射框架中引入随机变化,通过几何性质定制的设置,解决带噪声地标点模板估计问题,并提出两种高效估计噪声场参数的方法。

详情
AI中文摘要

在使用计算解剖学研究人体器官形状时,发现变异来源于受试者间解剖差异、疾病特异性效应和测量噪声。本文介绍了一种随机模型,用于将随机变化纳入大变形流形度量映射(LDDMM)框架中。通过在特定设置中考虑随机性,该设置适合LDDMM的几何性质,我们为带噪声的地标点模板估计问题建立了公式,并给出了两种高效估计噪声场参数的方法。一种方法直接用有限组微分方程近似每个地标点的方差时间演化,另一种基于期望最大化算法。在第二种方法中,通过应用随机扰动的大变形梯度流算法的桥采样技术,在不注册地标点的情况下评估数据似然性。该方法和估计算法在合成示例和人类胼胝体形状数据上进行了实验验证。

英文摘要

In the study of shapes of human organs using computational anatomy, variations are found to arise from inter-subject anatomical differences, disease-specific effects, and measurement noise. This paper introduces a stochastic model for incorporating random variations into the Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework. By accounting for randomness in a particular setup which is crafted to fit the geometrical properties of LDDMM, we formulate the template estimation problem for landmarks with noise and give two methods for efficiently estimating the parameters of the noise fields from a prescribed data set. One method directly approximates the time evolution of the variance of each landmark by a finite set of differential equations, and the other is based on an Expectation-Maximisation algorithm. In the second method, the evaluation of the data likelihood is achieved without registering the landmarks, by applying bridge sampling using a stochastically perturbed version of the large deformation gradient flow algorithm. The method and the estimation algorithms are experimentally validated on synthetic examples and shape data of human corpora callosa.

1609.05258 2026-06-04 cs.RO cs.AI cs.CV cs.SY eess.SY 版本更新

The ACRV Picking Benchmark (APB): A Robotic Shelf Picking Benchmark to Foster Reproducible Research

ACRV 摘取基准 (APB):一个促进可重复研究的机器人货架摘取基准

Jürgen Leitner, Adam W. Tow, Jake E. Dean, Niko Suenderhauf, Joseph W. Durham, Matthew Cooper, Markus Eich, Christopher Lehnert, Ruben Mangels, Christopher McCool, Peter Kujala, Lachlan Nicholson, Trung Pham, James Sergeant, Liao Wu, Fangyi Zhang, Ben Upcroft, Peter Corke

AI总结 本文提出ACRV摘取基准(APB),通过42个常见物品、广泛可用的货架和精确的物品排列指南,提供可重复的机器人摘取基准,支持完整机器人系统的比较。

Comments 8 pages, submitted to RA:Letters

详情
AI中文摘要

机器人挑战如亚马逊摘取挑战(APC)或DARPA挑战是推动科学进步的重要方式。它们使研究在明确的基准上进行比较,所有参与者享有相同的测试条件。然而,此类挑战事件仅偶尔举行,参赛人数有限,且测试条件难以在主事件后复制。我们提出一个新的物理基准挑战:ACRV摘取基准(APB)。该基准设计为可重复,包含42个常见物品、广泛可用的货架和精确的物品排列指南。明确的评估协议使完整机器人系统(包括感知和操作)的比较成为可能,而不仅仅是子系统。本文还描述并报告了基于Baxter机器人开放基线系统的实验结果。

英文摘要

Robotic challenges like the Amazon Picking Challenge (APC) or the DARPA Challenges are an established and important way to drive scientific progress. They make research comparable on a well-defined benchmark with equal test conditions for all participants. However, such challenge events occur only occasionally, are limited to a small number of contestants, and the test conditions are very difficult to replicate after the main event. We present a new physical benchmark challenge for robotic picking: the ACRV Picking Benchmark (APB). Designed to be reproducible, it consists of a set of 42 common objects, a widely available shelf, and exact guidelines for object arrangement using stencils. A well-defined evaluation protocol enables the comparison of \emph{complete} robotic systems -- including perception and manipulation -- instead of sub-systems only. Our paper also describes and reports results achieved by an open baseline system based on a Baxter robot.

1607.08481 2026-06-04 cs.CV cs.NA math.NA 版本更新

A Nonlocal Denoising Algorithm for Manifold-Valued Images Using Second Order Statistics

基于二阶统计的非局部去噪算法用于流形值图像

Friederike Laus, Mila Nikolova, Johannes Persch, Gabriele Steidl

AI总结 本文首次将非局部块方法推广到流形值图像,通过最小均方误差估计提出新的估计器,用于恢复流形值图像。

详情
AI中文摘要

非局部块方法,特别是Lebrun等人(2013)的贝叶斯方法,被认为是去噪(彩色)图像的有效方法,这些图像受到白高斯噪声影响。本文首次尝试将该技术推广到流形值图像。此类图像,例如具有相位或方向信息或值在对称正定矩阵流形上的图像,在现实应用中很常见。将正态分布推广到流形不是标准的,已有不同尝试。本文聚焦于一个直接的内在模型,并讨论特定流形的其他方法。我们将Lebrun等人的贝叶斯方法重新解释为最小均方误差估计,这促使我们定义相应的估计器。有了这个估计器,我们提出了一种非局部块方法用于恢复流形值图像。各种概念验证示例展示了所提算法的潜力。

英文摘要

Nonlocal patch-based methods, in particular the Bayes' approach of Lebrun, Buades and Morel (2013), are considered as state-of-the-art methods for denoising (color) images corrupted by white Gaussian noise of moderate variance. This paper is the first attempt to generalize this technique to manifold-valued images. Such images, for example images with phase or directional entries or with values in the manifold of symmetric positive definite matrices, are frequently encountered in real-world applications. Generalizing the normal law to manifolds is not canonical and different attempts have been considered. Here we focus on a straightforward intrinsic model and discuss the relation to other approaches for specific manifolds. We reinterpret the Bayesian approach of Lebrun et al. (2013) in terms of minimum mean squared error estimation, which motivates our definition of a corresponding estimator on the manifold. With this estimator at hand we present a nonlocal patch-based method for the restoration of manifold-valued images. Various proof of concept examples demonstrate the potential of the proposed algorithm.

1612.00056 2026-06-04 math.NA cs.CV cs.NA math.GR 版本更新

Generalized Fourier-Bessel operator and almost-periodic interpolation and approximation

广义傅里叶-贝塞尔算子与近周期插值与逼近

Jean-Paul Gauthier, Dario Prandi

AI总结 本文研究了在频率平面中闭合于离散旋转的有限频率集上的函数评估、插值与逼近问题,提出了一种抽象分解定理以高效解决相关数值问题,结合SE(2,N)群的特殊结构。

Comments 15 pages, 2 figures

详情
AI中文摘要

我们考虑由有限频率集F上的三角函数定义的函数f,该集在频率平面中关于角度2kπ/M(M为整数)的旋转闭合。首先研究在空间平面上类似有限集E上的函数评估问题,其次研究通过E网格上的f来插值或逼近双变量函数g。为此,我们建立了评估函数的抽象分解定理,这是解决这些问题高效数值解的关键。该结果基于SE(2,N)群的特殊结构,该群是平面运动群SE(2)的子群,对应于离散旋转,是一个极大近周期群。尽管本文动机源于生物仿生图像重建和模式识别中的相关问题,但该主题也与经典问题如极坐标下的FFT、非均匀FFT以及一般三角多项式评估等有关。

英文摘要

We consider functions $f$ of two real variables, given as trigonometric functions over a finite set $F$ of frequencies. This set is assumed to be closed under rotations in the frequency plane of angle $\frac{2kπ}{M}$ for some integer $M$. Firstly, we address the problem of evaluating these functions over a similar finite set $E$ in the space plane and, secondly, we address the problems of interpolating or approximating a function $g$ of two variables by such an $f$ over the grid $E.$ In particular, for this aim, we establish an abstract factorization theorem for the evaluation function, which is a key point for an efficient numerical solution to these problems. This result is based on the very special structure of the group $SE(2,N)$, subgroup of the group $SE(2)$ of motions of the plane corresponding to discrete rotations, which is a maximally almost periodic group. Although the motivation of this paper comes from our previous works on biomimetic image reconstruction and pattern recognition, where these questions appear naturally, this topic is related with several classical problems: the FFT in polar coordinates, the Non Uniform FFT, the evaluation of general trigonometric polynomials, and so on.

1611.05947 2026-06-04 math.AG cs.CV cs.NA math.NA 版本更新

Minimal Problems for the Calibrated Trifocal Variety

校准三焦点流形的最优化问题

Joe Kileel

AI总结 本文通过数值代数几何和Bertini软件确定校准三焦点流形的最优化问题的代数次数。

Comments 23 pages, 1 table

详情
AI中文摘要

我们确定了计算机视觉中校准三焦点流形的最优化问题的代数次数。我们依赖于数值代数几何和同伦持续软件Bertini。

英文摘要

We determine the algebraic degree of minimal problems for the calibrated trifocal variety in computer vision. We rely on numerical algebraic geometry and the homotopy continuation software Bertini.

1601.08201 2026-06-04 math.NA cs.CV cs.NA 版本更新

Spectrally Grouped Total Variation Reconstruction for Scatter Imaging Using ADMM

基于ADMM的谱分组总变分重建用于散射成像

Ikenna Odinaka, Yan Kaganovsky, Joel A. Greenberg, Mehadi Hassan, David G. Politte, Joseph A. O'Sullivan, Lawrence Carin, David J. Brady

AI总结 本文提出基于ADMM的谱分组总变分重建算法,通过改进的正则化方法提升散射成像的谱与空间质量,利用凸分解实现并行优化。

Comments Presented at IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC) 2015. 4 pages, 2 figures

详情
AI中文摘要

我们考虑X射线相干散射成像,目标是从多重散射测量中重建每个空间位置的动量转移分布(谱分布)。每种材料具有独特的动量转移分布(MTP),可用于区分不同材料。我们提出了一种基于泊松噪声模型的迭代图像重建算法,能够处理光子限制的测量以及数据的多种二次统计。为了提高图像质量,先前方法使用边缘保持正则化器以在空间域中促进分段常数图像,而每个谱bin分别处理。相反,我们提出谱分组正则化,促进空间方向上的分段常数图像,同时确保相邻空间bin的MTP相似,如果它们包含相同材料。我们证明这种分组正则化在谱和空间图像质量上都有提升。我们追求一种优化转移方法,利用凸分解将问题提升,使得所有超体素可以并行更新并以闭式形式处理。分组惩罚引入了挑战,因为它不直接适用于这些分解。我们使用交替方向乘子法(ADMM)将原问题替换为一个等效的子问题序列,这些子问题适用于凸分解,导致高度并行的算法。我们在真实数据上展示了性能。

英文摘要

We consider X-ray coherent scatter imaging, where the goal is to reconstruct momentum transfer profiles (spectral distributions) at each spatial location from multiplexed measurements of scatter. Each material is characterized by a unique momentum transfer profile (MTP) which can be used to discriminate between different materials. We propose an iterative image reconstruction algorithm based on a Poisson noise model that can account for photon-limited measurements as well as various second order statistics of the data. To improve image quality, previous approaches use edge-preserving regularizers to promote piecewise constancy of the image in the spatial domain while treating each spectral bin separately. Instead, we propose spectrally grouped regularization that promotes piecewise constant images along the spatial directions but also ensures that the MTPs of neighboring spatial bins are similar, if they contain the same material. We demonstrate that this group regularization results in improvement of both spectral and spatial image quality. We pursue an optimization transfer approach where convex decompositions are used to lift the problem such that all hyper-voxels can be updated in parallel and in closed-form. The group penalty introduces a challenge since it is not directly amendable to these decompositions. We use the alternating directions method of multipliers (ADMM) to replace the original problem with an equivalent sequence of sub-problems that are amendable to convex decompositions, leading to a highly parallel algorithm. We demonstrate the performance on real data.

1511.05261 2026-06-04 cs.CV cs.LG cs.NA math.NA stat.ML 版本更新

Robust PCA via Nonconvex Rank Approximation

通过非凸秩近似实现鲁棒PCA

Zhao Kang, Chong Peng, Qiang Cheng

AI总结 本文提出非凸秩近似方法,以改进鲁棒PCA中核范数的局限性,通过高效算法提升准确性和效率。

Comments IEEE International Conference on Data Mining

详情
AI中文摘要

在数据挖掘和机器学习中,许多应用需要恢复低秩矩阵。鲁棒主成分分析(RPCA)是处理此类问题的通用框架。RPCA中核范数作为秩函数的凸替代物被广泛研究。在某些假设下,它可以以高概率恢复底层低秩矩阵。然而,这些假设可能在实际应用中不成立。由于核范数通过将所有奇异值相加来近似秩,即本质上是奇异值的ℓ1范数,因此产生的近似误差并不 trivial,导致最终的矩阵估计器可能有显著偏差。为寻求更接近的近似并缓解核范数的上述限制,我们提出了一种非凸秩近似。这种对矩阵秩的近似比核范数更紧密。为了解决相关的非凸最小化问题,我们开发了高效的增广拉格朗日乘子优化算法。实验结果表明,我们的方法在准确性和效率上均优于当前最先进的算法。

英文摘要

Numerous applications in data mining and machine learning require recovering a matrix of minimal rank. Robust principal component analysis (RPCA) is a general framework for handling this kind of problems. Nuclear norm based convex surrogate of the rank function in RPCA is widely investigated. Under certain assumptions, it can recover the underlying true low rank matrix with high probability. However, those assumptions may not hold in real-world applications. Since the nuclear norm approximates the rank by adding all singular values together, which is essentially a $\ell_1$-norm of the singular values, the resulting approximation error is not trivial and thus the resulting matrix estimator can be significantly biased. To seek a closer approximation and to alleviate the above-mentioned limitations of the nuclear norm, we propose a nonconvex rank approximation. This approximation to the matrix rank is tighter than the nuclear norm. To solve the associated nonconvex minimization problem, we develop an efficient augmented Lagrange multiplier based optimization algorithm. Experimental results demonstrate that our method outperforms current state-of-the-art algorithms in both accuracy and efficiency.

1610.06049 2026-06-04 math.NA cs.CV cs.NA 版本更新

Fast and Accurate Surface Normal Integration on Non-Rectangular Domains

非矩形域上快速且准确的表面法向量积分

Martin Bähr, Michael Breuß, Yvain Quéau, Ali Sharifi Boroujerdi, Jean-Denis Durou

AI总结 本文提出一种结合经典方法与现代技术的高效算法,用于非矩形域上的表面法向量积分,通过迭代Krylov子空间求解器和预条件处理提升精度和效率,实验证明其在计算机视觉中的有效性。

详情
AI中文摘要

在三维空间中计算表面形状时,表面法向量的积分是一个经典问题。然而,至今仍难以设计出一种方法,既能处理非平凡计算域,又具备高精度、鲁棒性和计算效率。本文结合经典方法与现代计算技术,构建了一个求解器。基于Poisson积分模型,我们提出使用迭代Krylov子空间求解器作为核心步骤。尽管此类方法可能非常高效,但只有在结合合适的数值预条件处理和问题特定的初始化时,才能发挥其全部潜力。我们进行了详尽的数值研究,以确定适用于我们目的的预条件器。为了解决合适的初始化问题,我们提出通过最近开发的快速推进积分器计算初始状态。详细的数值实验展示了这种新组合的优势。此外,我们还展示了所开发的数值框架能够灵活应对现代计算机视觉应用,在真实世界光流立体数据集上进行了验证。

英文摘要

The integration of surface normals for the purpose of computing the shape of a surface in 3D space is a classic problem in computer vision. However, even nowadays it is still a challenging task to devise a method that combines the flexibility to work on non-trivial computational domains with high accuracy, robustness and computational efficiency. By uniting a classic approach for surface normal integration with modern computational techniques we construct a solver that fulfils these requirements. Building upon the Poisson integration model we propose to use an iterative Krylov subspace solver as a core step in tackling the task. While such a method can be very efficient, it may only show its full potential when combined with a suitable numerical preconditioning and a problem-specific initialisation. We perform a thorough numerical study in order to identify an appropriate preconditioner for our purpose. To address the issue of a suitable initialisation we propose to compute this initial state via a recently developed fast marching integrator. Detailed numerical experiments illuminate the benefits of this novel combination. In addition, we show on real-world photometric stereo datasets that the developed numerical framework is flexible enough to tackle modern computer vision applications.

1610.04973 2026-06-04 cs.NE cs.CV cs.SY eess.SY 版本更新

Multiple Instance Fuzzy Inference Neural Networks

多实例模糊推理神经网络

Amine Ben Khalifa, Hichem Frigui

AI总结 本文提出多实例模糊推理系统,用于处理多实例数据,通过扩展Sugeno推理方法,开发MI-ANFIS系统,用于土地雷探测中的多算法融合。

Comments Submitted to IEEE Transactions On Cybernetics for review

详情
AI中文摘要

模糊逻辑是一种强大的工具,用于建模知识不确定性、测量不精确性和模糊性。然而,当数据具有多种表达形式时,模糊逻辑无法很好地处理另一种模糊性,这在多实例学习问题(MIL)中尤为明显。在MIL中,一个对象由多个实例组成,称为一个袋。如果袋中的所有实例都是负的,则标记为负;如果至少有一个实例是正的,则标记为正。正袋编码了模糊性,因为实例本身未被标记。本文介绍了模糊推理系统和神经网络,用于处理实例袋作为输入,并能够从模糊标记的数据中学习。首先,我们介绍了多实例Sugeno风格模糊推理(MI-Sugeno),它扩展了标准Sugeno推理以处理多实例推理。其次,我们使用MI-Sugeno定义并开发多实例自适应神经模糊推理系统(MI-ANFIS)。我们扩展了标准ANFIS的架构以允许袋推理,并通过反向传播推导出学习算法以确定网络的前提和结论参数。所提出的推理系统通过合成和适用于MIL问题的基准数据集进行测试和验证。我们还应用所提出的MI-ANFIS来融合多个判别算法的输出,以用于使用穿透雷达的土地雷探测。

英文摘要

Fuzzy logic is a powerful tool to model knowledge uncertainty, measurements imprecision, and vagueness. However, there is another type of vagueness that arises when data have multiple forms of expression that fuzzy logic does not address quite well. This is the case for multiple instance learning problems (MIL). In MIL, an object is represented by a collection of instances, called a bag. A bag is labeled negative if all of its instances are negative, and positive if at least one of its instances is positive. Positive bags encode ambiguity since the instances themselves are not labeled. In this paper, we introduce fuzzy inference systems and neural networks designed to handle bags of instances as input and capable of learning from ambiguously labeled data. First, we introduce the Multiple Instance Sugeno style fuzzy inference (MI-Sugeno) that extends the standard Sugeno style inference to handle reasoning with multiple instances. Second, we use MI-Sugeno to define and develop Multiple Instance Adaptive Neuro Fuzzy Inference System (MI-ANFIS). We expand the architecture of the standard ANFIS to allow reasoning with bags and derive a learning algorithm using backpropagation to identify the premise and consequent parameters of the network. The proposed inference system is tested and validated using synthetic and benchmark datasets suitable for MIL problems. We also apply the proposed MI-ANFIS to fuse the output of multiple discrimination algorithms for the purpose of landmine detection using Ground Penetrating Radar.

1509.01404 2026-06-04 math.NA cs.CV cs.LG cs.NA math.OC stat.ML 版本更新

Coordinate Descent Methods for Symmetric Nonnegative Matrix Factorization

对称非负矩阵分解的坐标下降方法

Arnaud Vandaele, Nicolas Gillis, Qi Lei, Kai Zhong, Inderjit Dhillon

AI总结 本文提出高效的坐标下降方法用于对称非负矩阵分解,适用于大规模稀疏矩阵,通过实验证明其在合成和实际数据集上的有效性。

Comments 25 pages, 5 figures, 7 tables. Main changes: comparison with another symNMF algorithm (namely, BetaSNMF), and correction of an error in the convergence proof

详情
Journal ref
IEEE Transactions on Signal Processing 64 (21), pp. 5571-5584, 2016
AI中文摘要

给定一个对称非负矩阵A,对称非负矩阵分解(symNMF)是寻找一个非负矩阵H,通常列数远少于A,使得A≈HH^T。本文提出简单且高效的坐标下降方案来解决该问题,能够处理大规模稀疏输入矩阵。通过合成和实际数据集的实验,展示了所提方法在合成和实际数据集上的有效性,并证明其在与最新状态的最先进方法相比表现优异。

英文摘要

Given a symmetric nonnegative matrix $A$, symmetric nonnegative matrix factorization (symNMF) is the problem of finding a nonnegative matrix $H$, usually with much fewer columns than $A$, such that $A \approx HH^T$. SymNMF can be used for data analysis and in particular for various clustering tasks. In this paper, we propose simple and very efficient coordinate descent schemes to solve this problem, and that can handle large and sparse input matrices. The effectiveness of our methods is illustrated on synthetic and real-world data sets, and we show that they perform favorably compared to recent state-of-the-art methods.

1604.06665 2026-06-04 math.NA cs.CV cs.NA math.SP 版本更新

Multiscale Segmentation via Bregman Distances and Nonlinear Spectral Analysis

基于Bregman距离和非线性谱分析的多尺度分割

Leonie Zeune, Guus van Dalum, Leon W. M. M. Terstappen, S. A. van Gils, Christoph Brune

AI总结 本文提出一种高效的多尺度分割方法,结合非线性分割与尺度空间和谱分解,通过Bregman距离改进变分分割模型,实现自适应正则化参数选择,用于生物医学图像分割。

详情
AI中文摘要

在生物医学成像中,可靠地分割对象(例如从小细胞到大器官)对自动化医疗诊断至关重要。新的多尺度分割方法可以显著提高在自然强度、大小和形状变化情况下的性能。本文旨在基于形状轮廓分割感兴趣对象,并自动发现不同尺度的多个对象。总体策略是结合非线性分割与尺度空间和最近文献中引入的谱分解。为此,我们推广了一个基于总变分的变分分割模型,使用Bregman距离构造逆尺度空间。这使得新模型可以通过基于总变分的谱分解的尺度分析方法来完成。结果我们获得了一种非常高效、(几乎)无参数的多尺度分割方法,附带自适应正则化参数选择。该方法的附加好处通过系统合成测试和新的生物医学工具箱中识别和分类循环肿瘤细胞的应用得到验证。由于非线性扩散的本质,本文的数学概念为非局部分类问题提供了有前景的扩展。

英文摘要

In biomedical imaging reliable segmentation of objects (e.g. from small cells up to large organs) is of fundamental importance for automated medical diagnosis. New approaches for multi-scale segmentation can considerably improve performance in case of natural variations in intensity, size and shape. This paper aims at segmenting objects of interest based on shape contours and automatically finding multiple objects with different scales. The overall strategy of this work is to combine nonlinear segmentation with scales spaces and spectral decompositions recently introduced in literature. For this we generalize a variational segmentation model based on total variation using Bregman distances to construct an inverse scale space. This offers the new model to be accomplished by a scale analysis approach based on a spectral decomposition of the total variation. As a result we obtain a very efficient, (nearly) parameter-free multiscale segmentation method that comes with an adaptive regularization parameter choice. The added benefit of our method is demonstrated by systematic synthetic tests and its usage in a new biomedical toolbox for identifying and classifying circulating tumor cells. Due to the nature of nonlinear diffusion underlying, the mathematical concepts in this work offer promising extensions to nonlocal classification problems.

1609.08438 2026-06-04 cs.CV cs.NA math.NA 版本更新

Flows Generating Nonlinear Eigenfunctions

生成非线性本征函数的流

Raz Z. Nossek, Guy Gilboa

AI总结 本文提出一种生成非线性本征函数的流,通过非线性算子与特征值的关系,探讨了正则化函数的理论,并引入正向与反向流以研究非线性本征函数的空间。

详情
AI中文摘要

非线性变分方法已成为许多图像处理任务中非常强大的工具。最近出现了一种新研究方向,探讨由凸函数引发的非线性本征函数。这为凸正则化提供了新的见解和更好的理论理解,并引入了新的处理方法。然而,非线性本征值问题的理论仍处于初级阶段。本文提出一种新的流,可以生成形式为T(u)=λu的非线性本征函数,其中T(u)是非线性算子,λ∈R为特征值。我们发展了T(u)是正则化一阶齐次函数(如总变分TV或总广义变分TGV)的理论。我们引入了正向流和反向流;它们的稳态解是非线性本征函数。正向流单调地平滑解(相对于正则化器)并同时增加L²范数。反向流具有相反的特性。对于这两种流,稳态解依赖于初始条件,因此不同的初始条件产生不同的本征函数。这使我们能够深入研究非线性本征函数的空间,允许生成数值上多样的例子,可能尚未知。此外,我们还提出一个指标来衡量函数与本征函数的亲和力,并将其与线性情况下的伪本征函数联系起来。

英文摘要

Nonlinear variational methods have become very powerful tools for many image processing tasks. Recently a new line of research has emerged, dealing with nonlinear eigenfunctions induced by convex functionals. This has provided new insights and better theoretical understanding of convex regularization and introduced new processing methods. However, the theory of nonlinear eigenvalue problems is still at its infancy. We present a new flow that can generate nonlinear eigenfunctions of the form $T(u)=λu$, where $T(u)$ is a nonlinear operator and $λ\in \mathbb{R} $ is the eigenvalue. We develop the theory where $T(u)$ is a subgradient element of a regularizing one-homogeneous functional, such as total-variation (TV) or total-generalized-variation (TGV). We introduce two flows: a forward flow and an inverse flow; for which the steady state solution is a nonlinear eigenfunction. The forward flow monotonically smooths the solution (with respect to the regularizer) and simultaneously increases the $L^2$ norm. The inverse flow has the opposite characteristics. For both flows, the steady state depends on the initial condition, thus different initial conditions yield different eigenfunctions. This enables a deeper investigation into the space of nonlinear eigenfunctions, allowing to produce numerically diverse examples, which may be unknown yet. In addition we suggest an indicator to measure the affinity of a function to an eigenfunction and relate it to pseudo-eigenfunctions in the linear case.

1510.07573 2026-06-04 cs.RO cs.CV cs.MA cs.SY eess.SY 版本更新

Generalized Regressive Motion: a Visual Cue to Collision

广义回归运动:碰撞的视觉线索

Krzysztof Chalupka, Michael Dickinson, Pietro Perona

AI总结 研究提出广义回归运动作为碰撞检测的视觉线索,通过几何分析证明其在同类碰撞中的可靠性,并通过基于代理的建模显示其比 looming 更有效。

详情
AI中文摘要

大脑和感觉系统进化以指导运动。关键任务是控制对静止障碍物的接近并检测移动生物。Looming 被提出为主要的单目视觉线索,用于检测其他动物的接近并避免与静止障碍物碰撞。在昆虫和脊椎动物大脑中发现了优雅的神经机制用于 looming 检测。然而,looming 未在两个移动动物碰撞的背景下进行分析。我们提出了一种替代策略,即广义回归运动(GRM),这与最近观察到的果蝇行为一致。几何分析证明 GRM 是同类碰撞的可靠线索,而基于代理的建模表明 GRM 比 looming 更有效用于检测接近、防止碰撞和维持移动性。

英文摘要

Brains and sensory systems evolved to guide motion. Central to this task is controlling the approach to stationary obstacles and detecting moving organisms. Looming has been proposed as the main monocular visual cue for detecting the approach of other animals and avoiding collisions with stationary obstacles. Elegant neural mechanisms for looming detection have been found in the brain of insects and vertebrates. However, looming has not been analyzed in the context of collisions between two moving animals. We propose an alternative strategy, Generalized Regressive Motion (GRM), which is consistent with recently observed behavior in fruit flies. Geometric analysis proves that GRM is a reliable cue to collision among conspecifics, whereas agent-based modeling suggests that GRM is a better cue than looming as a means to detect approach, prevent collisions and maintain mobility.

1609.05483 2026-06-04 eess.SY cs.CV cs.RO cs.SY 版本更新

Set-Point Regulation of Linear Continuous-Time Systems using Neuromorphic Vision Sensors

利用神经形态视觉传感器进行线性连续时间系统的设定点调节

Prince Singh, Sze Zheng Yong, Emilio Frazzoli

AI总结 本文提出基于神经形态视觉传感器的H∞控制器,用于调节线性时不变系统的设定点,并在不稳定系统上验证了方法的有效性。

Comments Submitted to IEEE Transactions on Automatic Control

详情
AI中文摘要

近年来发展出的神经形态视觉传感器因其高时间分辨率和低延迟成为敏捷和自主机器人应用的有前景候选者。每个像素在检测到光照场变化时独立地发出异步的

英文摘要

Recently developed neuromorphic vision sensors have become promising candidates for agile and autonomous robotic applications primarily due to, in particular, their high temporal resolution and low latency. Each pixel of this sensor independently fires an asynchronous stream of "retinal events" once a change in the light field is detected. Existing computer vision algorithms can only process periodic frames and so a new class of algorithms needs to be developed that can efficiently process these events for control tasks. In this paper, we investigate the problem of regulating a continuous-time linear time invariant (LTI) system to a desired point using measurements from a neuromorphic sensor. We present an $H_\infty$ controller that regulates the LTI system to a desired set-point and provide the set of neuromorphic sensor based cameras for the given system that fulfill the regulation task. The effectiveness of our approach is illustrated on an unstable system.

1609.05434 2026-06-04 math.NA cs.CV cs.NA 版本更新

Consistent Discretization and Minimization of the L1 Norm on Manifolds

一致离散化和流形上L1范数的最小化

Alex Bronstein, Yoni Choukroun, Ron Kimmel, Matan Sela

AI总结 本文探讨了流形上L1范数的离散化问题,指出采样敏感性,并提出基于迭代加权l2范数的替代方法,应用于压缩模式问题,通过简单特征分解提升稳定性与准确性。

详情
AI中文摘要

L1范数因其稀疏性优势在过去二十年中广泛应用于信号和图像处理。最近,其在非欧几里得领域扩展被用于形状分析。例如,与狄利克雷能量最小化结合,可生成紧凑支持的准谐正交基,称为压缩流形模式。连续流形上的L1范数常被采样函数的向量l1范数替代,但本文指出该方法不一致离散化且对采样敏感。提出两种替代离散化方法,产生迭代加权l2范数。在压缩模式问题中,该策略简化为一系列简单特征分解问题,无需非凸优化,产生更稳定准确的结果。

英文摘要

The L1 norm has been tremendously popular in signal and image processing in the past two decades due to its sparsity-promoting properties. More recently, its generalization to non-Euclidean domains has been found useful in shape analysis applications. For example, in conjunction with the minimization of the Dirichlet energy, it was shown to produce a compactly supported quasi-harmonic orthonormal basis, dubbed as compressed manifold modes. The continuous L1 norm on the manifold is often replaced by the vector l1 norm applied to sampled functions. We show that such an approach is incorrect in the sense that it does not consistently discretize the continuous norm and warn against its sensitivity to the specific sampling. We propose two alternative discretizations resulting in an iteratively-reweighed l2 norm. We demonstrate the proposed strategy on the compressed modes problem, which reduces to a sequence of simple eigendecomposition problems not requiring non-convex optimization on Stiefel manifolds and producing more stable and accurate results.

1609.04167 2026-06-04 math.NA cs.CV cs.IT cs.LG cs.NA math.IT math.OC 版本更新

Proceedings of the third "international Traveling Workshop on Interactions between Sparse models and Technology" (iTWIST'16)

第三届“国际稀疏模型与技术相互作用研讨会”(iTWIST'16)会议论文集

V. Abrol, O. Absil, P. -A. Absil, S. Anthoine, P. Antoine, T. Arildsen, N. Bertin, F. Bleichrodt, J. Bobin, A. Bol, A. Bonnefoy, F. Caltagirone, V. Cambareri, C. Chenot, V. Crnojević, M. Daňková, K. Degraux, J. Eisert, J. M. Fadili, M. Gabrié, N. Gac, D. Giacobello, A. Gonzalez, C. A. Gomez Gonzalez, A. González, P. -Y. Gousenbourger, M. Græsbøll Christensen, R. Gribonval, S. Guérit, S. Huang, P. Irofti, L. Jacques, U. S. Kamilov, S. Kiticć, M. Kliesch, F. Krzakala, J. A. Lee, W. Liao, T. Lindstrøm Jensen, A. Manoel, H. Mansour, A. Mohammad-Djafari, A. Moshtaghpour, F. Ngolè, B. Pairet, M. Panić, G. Peyré, A. Pižurica, P. Rajmic, M. Roblin, I. Roth, A. K. Sao, P. Sharma, J. -L. Starck, E. W. Tramel, T. van Waterschoot, D. Vukobratovic, L. Wang, B. Wirth, G. Wunder, H. Zhang

AI总结 本文探讨了稀疏模型与技术的相互作用,涵盖数据传感、非凸逆问题、概率推断、机器学习等领域,通过演讲和讨论促进国际科研合作。

Comments 69 pages, 22 extended abstracts, iTWIST'16 website: http://www.itwist16.es.aau.dk

详情
AI中文摘要

第三届“国际稀疏模型与技术相互作用研讨会”(iTWIST'16)于2016年8月24日至26日在丹麦第四大城市阿勒堡举行。该研讨会旨在通过具体的口头/海报展示和自由讨论促进国际科研团队的合作。本届研讨会汇集了约50位国际参与者,包含8场特邀讲座、12场口头报告和12个海报,主题涵盖稀疏范式的理论、应用和推广,包括由稀疏驱动的数据传感与处理(如光学、计算机视觉、基因组学、生物医学、数字通信、信道估计、天文学);稀疏模型在非凸/非线性逆问题中的应用(如相位恢复、盲去卷积、自校准);近似概率推断用于稀疏问题;稀疏机器学习与推断;“盲”逆问题与字典学习;稀疏建模的优化;信息论、几何与随机性;稀疏?未来是什么(离散值信号;低维空间的并集、共稀疏性、混合/组范数、基于模型的、低复杂度模型等);矩阵/流形传感与处理(图、低秩近似等);数值方法/优化中的复杂性与精度权衡;电子/光学压缩传感器(硬件)。

英文摘要

The third edition of the "international - Traveling Workshop on Interactions between Sparse models and Technology" (iTWIST) took place in Aalborg, the 4th largest city in Denmark situated beautifully in the northern part of the country, from the 24th to 26th of August 2016. The workshop venue was at the Aalborg University campus. One implicit objective of this biennial workshop is to foster collaboration between international scientific teams by disseminating ideas through both specific oral/poster presentations and free discussions. For this third edition, iTWIST'16 gathered about 50 international participants and features 8 invited talks, 12 oral presentations, and 12 posters on the following themes, all related to the theory, application and generalization of the "sparsity paradigm": Sparsity-driven data sensing and processing (e.g., optics, computer vision, genomics, biomedical, digital communication, channel estimation, astronomy); Application of sparse models in non-convex/non-linear inverse problems (e.g., phase retrieval, blind deconvolution, self calibration); Approximate probabilistic inference for sparse problems; Sparse machine learning and inference; "Blind" inverse problems and dictionary learning; Optimization for sparse modelling; Information theory, geometry and randomness; Sparsity? What's next? (Discrete-valued signals; Union of low-dimensional spaces, Cosparsity, mixed/group norm, model-based, low-complexity models, ...); Matrix/manifold sensing/processing (graph, low-rank approximation, ...); Complexity/accuracy tradeoffs in numerical methods/optimization; Electronic/optical compressive sensors (hardware).

1608.06440 2026-06-04 cs.RO cs.CV cs.SY eess.SY 版本更新

A Delay-Tolerant Potential-Field-Based Network Implementation of an Integrated Navigation System

基于延迟容忍的势场网络的集成导航系统实现

Rachana Ashok Gupta, Ahmad A. Masoud, Mo-Yuen Chow

AI总结 本文提出一种基于网络控制器的集成导航系统,通过互联网实现摄像头、无人地面车和远程服务器的实时网络化,旨在简化无人车导航同时保持系统鲁棒性。

详情
Journal ref
The IEEE Transactions On Industrial Electronics, Vol. 57, No.2, February 2010, PP. 769-783
AI中文摘要

网络控制器(NCs)是能够将动态、空间扩展且功能专门化的模块转化为可执行目标导向组的设备,称为网络控制系统。本文探讨了设计和构建使用互联网作为通信介质的NC的实践方面。重点在于寻找兼容的控制器组件,这些组件可通过主机结构集成,使摄像头、无人地面车(UGV)、远程计算机服务器及必要的操作软件界面能够实时联网。目标是简化UGV的导航过程,同时保持系统性能的鲁棒性。本文描述了所提控制器的结构、其组件及其接口方式。提供了详尽的实验结果,包括性能评估和与之前实现的NC的比较。

英文摘要

Network controllers (NCs) are devices that are capable of converting dynamic, spatially extended, and functionally specialized modules into a taskable goal-oriented group called networked control system. This paper examines the practical aspects of designing and building an NC that uses the Internet as a communication medium. It focuses on finding compatible controller components that can be integrated via a host structure in a manner that makes it possible to network, in real-time, a webcam, an unmanned ground vehicle (UGV), and a remote computer server along with the necessary operator software interface. The aim is to deskill the UGV navigation process and yet maintain a robust performance. The structure of the suggested controller, its components, and the manner in which they are interfaced are described. Thorough experimental results along with performance assessment and comparisons to a previously implemented NC are provided.

1608.02165 2026-06-04 cs.CV cs.AI cs.NA math.NA math.OC 版本更新

ShapeFit and ShapeKick for Robust, Scalable Structure from Motion

形状拟合与形状踢:用于鲁棒、可扩展的结构从运动

Thomas Goldstein, Paul Hand, Choongbum Lee, Vladislav Voroninski, Stefano Soatto

AI总结 本文提出一种利用高效凸优化程序进行成对方向定位恢复的新方法,能有效处理对抗性异常值,且在真实场景和模拟数据上验证了其性能和灵活性。

详情
AI中文摘要

我们介绍了一种新的方法,用于从成对方向中恢复位置,该方法利用了一个高效的凸优化程序,具有精确恢复保证,即使在存在对抗性异常值的情况下也能有效工作。当成对方向代表视图之间的缩放相对位置(例如通过视差几何估计)时,我们的方法可用于位置恢复,即确定相对姿态,仅需一个未知的标度因子。对于此任务,我们的方法性能与最先进的方法相当,但速度提高了数量级。我们提出的方法具有灵活性,可以适应其他位置恢复方法,并可用于加速其他方法。这些特性通过在13个大型不规则图像集合以及具有真实场景和模拟数据的地面真实数据上广泛测试来验证。

英文摘要

We introduce a new method for location recovery from pair-wise directions that leverages an efficient convex program that comes with exact recovery guarantees, even in the presence of adversarial outliers. When pairwise directions represent scaled relative positions between pairs of views (estimated for instance with epipolar geometry) our method can be used for location recovery, that is the determination of relative pose up to a single unknown scale. For this task, our method yields performance comparable to the state-of-the-art with an order of magnitude speed-up. Our proposed numerical framework is flexible in that it accommodates other approaches to location recovery and can be used to speed up other methods. These properties are demonstrated by extensively testing against state-of-the-art methods for location recovery on 13 large, irregular collections of images of real scenes in addition to simulated data with ground truth.

1608.01372 2026-06-04 cs.CV cs.NA math.NA 版本更新

Permutation NMF

排列非负矩阵分解

Giovanni Barbarino

AI总结 本文提出在经典NMF中引入平移不变性,使其能检测不同图像中位移后的共同特征。

详情
AI中文摘要

非负矩阵分解(NMF)是一种常用于机器学习中提取数据特征(如文本文档和图像)的技术,得益于其自然的聚类特性。特别是在图像处理中,它能够分解多个图像并识别相同位置上的共同部分。本文的目标是提出一种在经典NMF中引入平移不变性的方法,即所提出的算法能够检测不同原始图像中位移后的共同特征。

英文摘要

Nonnegative Matrix Factorization(NMF) is a common used technique in machine learning to extract features out of data such as text documents and images thanks to its natural clustering properties. In particular, it is popular in image processing since it can decompose several pictures and recognize common parts if they're located in the same position over the photos. This paper's aim is to present a way to add the translation invariance to the classical NMF, that is, the algorithms presented are able to detect common features, even when they're shifted, in different original images.

1502.00592 2026-06-04 stat.ME cs.CV cs.MM cs.NA math.NA stat.AP 版本更新

A Class of DCT Approximations Based on the Feig-Winograd Algorithm

基于Feig-Winograd算法的一类DCT近似方法

C. J. Tablada, F. M. Bayer, R. J. Cintra

AI总结 本文提出基于Feig-Winograd算法8点DCT因子化的参数化矩阵类,通过多目标优化获得具有低计算复杂度、正交性、低逆复杂度及接近精确DCT性能的新型DCT近似方法。

Comments 26 pages, 4 figures, 5 tables, fixed arithmetic complexity in Table IV

详情
Journal ref
Signal Processing, vol. 113, pp. 38-51, August 2015
AI中文摘要

本文提出了一种基于Feig-Winograd因子化8点DCT的参数化矩阵类。此类参数化诱导出一个矩阵子空间,统一了多种现有的DCT近似方法。通过求解一个综合的多目标优化问题,我们识别出几种新的DCT近似方法。获得的解旨在具备以下特性:(i) 低乘法无计算复杂度,(ii) 正交或近正交性,(iii) 低复杂度可逆性,(iv) 接近精确DCT的接近性和性能。所提出的方法在接近DCT、编码性能及图像压缩适用性方面进行了评估。考虑到帕累托效率,某些新提出的近似方法可能在文献中已有的各种现有方法上表现更优。

英文摘要

A new class of matrices based on a parametrization of the Feig-Winograd factorization of 8-point DCT is proposed. Such parametrization induces a matrix subspace, which unifies a number of existing methods for DCT approximation. By solving a comprehensive multicriteria optimization problem, we identified several new DCT approximations. Obtained solutions were sought to possess the following properties: (i) low multiplierless computational complexity, (ii) orthogonality or near orthogonality, (iii) low complexity invertibility, and (iv) close proximity and performance to the exact DCT. Proposed approximations were submitted to assessment in terms of proximity to the DCT, coding performance, and suitability for image compression. Considering Pareto efficiency, particular new proposed approximations could outperform various existing methods archived in literature.

1607.03255 2026-06-04 math.NA cs.CV cs.NA math.OC 版本更新

A Variational Model for Joint Motion Estimation and Image Reconstruction

一种联合运动估计和图像重建的变分模型

Martin Burger, Hendrik Dirks, Carola-Bibiane Schönlieb

AI总结 本文提出一种变分模型,用于联合估计运动和重建图像序列,基于连续时间欧拉运动模型,通过分析亮度恒定方程实现鲁棒运动估计,并证明了在合适函数空间中解的存在性。

详情
AI中文摘要

本文旨在推导并分析一种用于图像序列联合运动估计和重建的变分模型,该模型基于连续时间欧拉运动模型。该模型可通过连续性方程或亮度恒定方程建立。本文重点分析后者以实现二维图像序列的鲁棒运动估计。我们严格证明了在合适函数空间设置下解的存在性。此外,我们讨论了基于对偶算法的模型数值解法,并探讨了多个示例。最后,展示了本文模型与现有技术,如顺序图像重建和运动估计相比的优势。

英文摘要

The aim of this paper is to derive and analyze a variational model for the joint estimation of motion and reconstruction of image sequences, which is based on a time-continuous Eulerian motion model. The model can be set up in terms of the continuity equation or the brightness constancy equation. The analysis in this paper focuses on the latter for robust motion estimation on sequences of two-dimensional images. We rigorously prove the existence of a minimizer in a suitable function space setting. Moreover, we discuss the numerical solution of the model based on primal-dual algorithms and investigate several examples. Finally, the benefits of our model compared to existing techniques, such as sequential image reconstruction and motion estimation, are shown.

1602.07038 2026-06-04 cs.GR cs.CV cs.NA math.NA 版本更新

Computer Aided Restoration of Handwritten Character Strokes

计算机辅助修复手写字符笔画

Barak Sober, David Levin

AI总结 本文提出一种变分方法用于修复损坏的古代希伯来文字符,通过立方样条进行梯度下降并保持插值,实现了对1000个古代字符的修复。

Comments 11 pages, 17 figures

详情
AI中文摘要

本文提出了一种新的变分方法,用于计算机辅助修复高度噪声文档中的不完整字符。我们将字符笔画建模为具有变化半径的笔的运动。随后,使用立方样条表示进行梯度下降步骤,同时在某些初始(手动采样)点保持插值。所提出算法用于修复约1000个古代希伯来文字符(约公元前8-7世纪),部分结果在此展示,表明该算法在退化文档上应用时能产生合理的结果。

英文摘要

This work suggests a new variational approach to the task of computer aided restoration of incomplete characters, residing in a highly noisy document. We model character strokes as the movement of a pen with a varying radius. Following this model, a cubic spline representation is being utilized to perform gradient descent steps, while maintaining interpolation at some initial (manually sampled) points. The proposed algorithm was utilized in the process of restoring approximately 1000 ancient Hebrew characters (dating to ca. 8th-7th century BCE), some of which are presented herein and show that the algorithm yields plausible results when applied on deteriorated documents.

1606.07414 2026-06-04 cs.CV cs.MM cs.NA math.NA stat.ME 版本更新

Multiplierless 16-point DCT Approximation for Low-complexity Image and Video Coding

无乘法器16点DCT近似用于低复杂度图像和视频编码

T. L. T. Silveira, R. S. Oliveira, F. M. Bayer, R. J. Cintra, A. Madanayake

AI总结 本文提出一种无需乘法和位移操作的16点近似DCT变换,通过矩阵分解快速算法仅需44次加法,实现了最低的算术成本,并在图像和视频编码中表现出最佳的成本效益比。

Comments 12 pages, 5 figures, 3 tables

详情
AI中文摘要

本文介绍了一种正交的16点近似离散余弦变换(DCT)。所提出的变换不需要乘法或位移操作。引入了一种基于矩阵分解的快速算法,仅需44次加法,这是文献中最低的算术成本。为了评估所提出的变换,计算了计算复杂度、与精确DCT的相似性以及编码性能指标。经典和最先进的16点低复杂度变换在比较分析中被使用。在图像压缩中,所提出的近似通过PSNR和SSIM测量评估,获得了最佳的成本效益比。对于视频编码,所提出的近似被嵌入到HEVC参考软件中,直接与原始HEVC标准进行比较。通过FPGA硬件实现和测试,所提出的变换在与文献中最佳竞争变换相比时,面积-时间和面积-时间平方VLSI指标分别提高了35%和37%。

英文摘要

An orthogonal 16-point approximate discrete cosine transform (DCT) is introduced. The proposed transform requires neither multiplications nor bit-shifting operations. A fast algorithm based on matrix factorization is introduced, requiring only 44 additions---the lowest arithmetic cost in literature. To assess the introduced transform, computational complexity, similarity with the exact DCT, and coding performance measures are computed. Classical and state-of-the-art 16-point low-complexity transforms were used in a comparative analysis. In the context of image compression, the proposed approximation was evaluated via PSNR and SSIM measurements, attaining the best cost-benefit ratio among the competitors. For video encoding, the proposed approximation was embedded into a HEVC reference software for direct comparison with the original HEVC standard. Physically realized and tested using FPGA hardware, the proposed transform showed 35% and 37% improvements of area-time and area-time-squared VLSI metrics when compared to the best competing transform in the literature.

1508.01308 2026-06-04 cs.CV cs.NA math.HO math.NA math.OC 版本更新

Collaborative Total Variation: A General Framework for Vectorial TV Models

协同总变分:向量总变分模型的通用框架

Joan Duran, Michael Moeller, Catalina Sbert, Daniel Cremers

AI总结 本文提出协同总变分(CTV)模型,通过不同维度的范数测量颜色图像张量的平滑性,探讨其理论性质和应用效果,实验比较了多种CTV方法在去噪、去模糊和修复等逆问题中的性能。

详情
Journal ref
SIAM Journal on Imaging Sciences, vol. 9(1), pp. 116-151, 2016
AI中文摘要

尽管已有二十年,总变分(TV)仍然是图像处理中最受欢迎的正则化方法之一,并引发了大量研究,特别是从标量到向量值函数的转变。本文将彩色图像的梯度视为一个三维矩阵或张量,其维度对应空间扩展、与其他像素的差异和光谱通道。通过不同维度的不同范数测量该张量的平滑性,根据这些范数的类型可获得不同的正则化特性,从而得到新的颜色图像模型。我们称之为协同总变分(CTV)。在理论方面,我们刻画了所提出正则化器的对偶范数、次微分和近端映射。进一步地,借助广义奇异向量的概念,证明了$\ell^{\infty}$通道耦合做出最强烈的先验假设,并具有最大程度减少颜色伪影的潜力。我们的实际贡献包括一个广泛的实验部分,其中我们比较了大量协同TV方法在去噪、去模糊和修复等逆问题中的性能。

英文摘要

Even after over two decades, the total variation (TV) remains one of the most popular regularizations for image processing problems and has sparked a tremendous amount of research, particularly to move from scalar to vector-valued functions. In this paper, we consider the gradient of a color image as a three dimensional matrix or tensor with dimensions corresponding to the spatial extend, the differences to other pixels, and the spectral channels. The smoothness of this tensor is then measured by taking different norms along the different dimensions. Depending on the type of these norms one obtains very different properties of the regularization, leading to novel models for color images. We call this class of regularizations collaborative total variation (CTV). On the theoretical side, we characterize the dual norm, the subdifferential and the proximal mapping of the proposed regularizers. We further prove, with the help of the generalized concept of singular vectors, that an $\ell^{\infty}$ channel coupling makes the most prior assumptions and has the greatest potential to reduce color artifacts. Our practical contributions consist of an extensive experimental section where we compare the performance of a large number of collaborative TV methods for inverse problems like denoising, deblurring and inpainting.

1606.05535 2026-06-04 math.NA cs.CV cs.DS cs.NA 版本更新

Tensor Ring Decomposition

张量环分解

Qibin Zhao, Guoxu Zhou, Shengli Xie, Liqing Zhang, Andrzej Cichocki

AI总结 本文提出张量环分解,通过循环多线性乘积表示高维张量,具有循环维度不变性,改进了传统张量分解方法,通过不同算法优化隐含核心并验证其有效性。

详情
AI中文摘要

张量网络近年来已成为解决大规模优化问题的强大工具。其中最流行的张量网络是张量列车(TT)分解,但其高度依赖张量维度的排列,导致难以找到最优TT表示。本文引入一种基本的张量分解模型,通过循环多线性乘积表示高维张量,可图形化解释为第三阶张量的环形连接,称为张量环(TR)分解。TR模型的关键优势是通过迹操作和等价处理隐含核心获得循环维度不变性。TR模型可视为TT分解的线性组合,从而获得强大的泛化表示能力。对于隐含核心的优化,我们提出了基于顺序SVD、ALS方案和分块ALS技术的四种不同算法。此外,研究了TR模型的数学性质,表明通过TR表示可以高效地执行基本多线性代数运算,经典张量分解可方便地转换为TR表示。最后,通过合成信号和真实数据集的实验评估了不同算法的性能。

英文摘要

Tensor networks have in recent years emerged as the powerful tools for solving the large-scale optimization problems. One of the most popular tensor network is tensor train (TT) decomposition that acts as the building blocks for the complicated tensor networks. However, the TT decomposition highly depends on permutations of tensor dimensions, due to its strictly sequential multilinear products over latent cores, which leads to difficulties in finding the optimal TT representation. In this paper, we introduce a fundamental tensor decomposition model to represent a large dimensional tensor by a circular multilinear products over a sequence of low dimensional cores, which can be graphically interpreted as a cyclic interconnection of 3rd-order tensors, and thus termed as tensor ring (TR) decomposition. The key advantage of TR model is the circular dimensional permutation invariance which is gained by employing the trace operation and treating the latent cores equivalently. TR model can be viewed as a linear combination of TT decompositions, thus obtaining the powerful and generalized representation abilities. For optimization of latent cores, we present four different algorithms based on the sequential SVDs, ALS scheme, and block-wise ALS techniques. Furthermore, the mathematical properties of TR model are investigated, which shows that the basic multilinear algebra can be performed efficiently by using TR representaions and the classical tensor decompositions can be conveniently transformed into the TR representation. Finally, the experiments on both synthetic signals and real-world datasets were conducted to evaluate the performance of different algorithms.

1403.3320 2026-06-04 math.NA cs.CV cs.NA 版本更新

Numerical Approaches for Linear Left-invariant Diffusions on SE(2), their Comparison to Exact Solutions, and their Applications in Retinal Imaging

在SE(2)上的线性左不变扩散的数值方法,其与精确解的比较及其在视网膜成像中的应用

Jiong Zhang, Remco Duits, Gonzalo Sanguinetti, Bart M. ter Haar Romeny

AI总结 本文比较了SE(2)上左不变扩散的多种数值方法,并提供了精确解的分析,展示了傅里叶方法在精度上的优势,并在视网膜成像中应用了结合左不变PDE演化的可逆方向分数。

Comments A final and corrected version of the manuscript is Published in Numerical Mathematics: Theory, Methods and Applications (NM-TMA), vol. (9), p.1-50, 2016

详情
Journal ref
Numerical Mathematics: Theory, Methods and Applications (NM-TMA), vol. (9), p.1-50, 2016
AI中文摘要

在旋转平移群SE(2)上的左不变PDE演化(及其残差方程)在皮层建模和图像分析领域已被广泛研究。它们包括Citti与Sarti、Petitot提出的超椭圆扩散(用于轮廓增强)以及Mumford提出的方向过程(用于轮廓补全)。本文对多种数值方法进行了详尽研究和比较,这在文献中是缺失的。现有的数值方法可以分为三类:有限差分法、基于傅里叶的方法(等同于SE(2)-傅里叶方法)以及随机方法(蒙特卡洛模拟)。此外,之前Duits和van Almsick在2005年的工作中明确推导出了三种PDE演化的精确解(在空间傅里叶域中)。本文概述了这三种精确解,并解释了它们如何与三种数值方法相关联。我们计算了所有数值方法相对于精确解的相对误差,并发现基于傅里叶的方法表现最佳,具有最小的相对误差。我们还改进了Mathematica算法以评估Mathieu函数,这对实现精确解至关重要。此外,我们还对核中的奇点进行了渐近分析,并提出了对底层随机过程的概率扩展,以克服时间积分核在原点处的奇异行为。最后,我们展示了将左不变PDE演化与可逆方向分数结合在视网膜成像中的应用。

英文摘要

Left-invariant PDE-evolutions on the roto-translation group $SE(2)$ (and their resolvent equations) have been widely studied in the fields of cortical modeling and image analysis. They include hypo-elliptic diffusion (for contour enhancement) proposed by Citti & Sarti, and Petitot, and they include the direction process (for contour completion) proposed by Mumford. This paper presents a thorough study and comparison of the many numerical approaches, which, remarkably, is missing in the literature. Existing numerical approaches can be classified into 3 categories: Finite difference methods, Fourier based methods (equivalent to $SE(2)$-Fourier methods), and stochastic methods (Monte Carlo simulations). There are also 3 types of exact solutions to the PDE-evolutions that were derived explicitly (in the spatial Fourier domain) in previous works by Duits and van Almsick in 2005. Here we provide an overview of these 3 types of exact solutions and explain how they relate to each of the 3 numerical approaches. We compute relative errors of all numerical approaches to the exact solutions, and the Fourier based methods show us the best performance with smallest relative errors. We also provide an improvement of Mathematica algorithms for evaluating Mathieu-functions, crucial in implementations of the exact solutions. Furthermore, we include an asymptotical analysis of the singularities within the kernels and we propose a probabilistic extension of underlying stochastic processes that overcomes the singular behavior in the origin of time-integrated kernels. Finally, we show retinal imaging applications of combining left-invariant PDE-evolutions with invertible orientation scores.

1605.02196 2026-06-04 eess.SY cs.CV cs.LG cs.RO cs.SY 版本更新

All Weather Perception: Joint Data Association, Tracking, and Classification for Autonomous Ground Vehicles

全天候感知:面向自主地面车辆的数据关联、跟踪与分类的联合解决方案

Peter Radecki, Mark Campbell, Kevin Matzen

AI总结 本文提出一种新型概率感知算法,用于自主地面车辆在全天候条件下的数据关联、目标跟踪和分类。该算法扩展了原有的 Rao-Blackwellized 粒子滤波器,结合多模型跟踪进行分类,并通过升级 Cornell 的 AGV 实验证明了先进视觉算法在恶劣天气下的鲁棒性。

Comments 35 pages, 21 figures, 14 tables

详情
AI中文摘要

本文提出了一种新颖的概率感知算法,作为实时联合解决方案,用于自主地面车辆在全天候条件下的数据关联、目标跟踪和目标分类。该算法扩展了最初使用粒子滤波进行数据关联和卡尔曼滤波进行多目标跟踪的 Rao-Blackwellized 粒子滤波器(Miller 等,2011a),现已包含多模型跟踪用于分类。此外,还实现了一种最先进的视觉检测算法,该算法包含方向信息,适用于自主地面车辆(AGV)应用。Cornell 的 AGV 从 DARPA 城市挑战中被升级并用于实验,以检验先进视觉算法能否补充或替代激光雷达和雷达传感器。在恶劣天气和光照条件下,传感器和算法性能得到测试。实验评估显示,在联合概率感知算法中,摄像头、激光雷达和雷达传感器能够实现稳健的全天候数据关联、跟踪和分类。

英文摘要

A novel probabilistic perception algorithm is presented as a real-time joint solution to data association, object tracking, and object classification for an autonomous ground vehicle in all-weather conditions. The presented algorithm extends a Rao-Blackwellized Particle Filter originally built with a particle filter for data association and a Kalman filter for multi-object tracking (Miller et al. 2011a) to now also include multiple model tracking for classification. Additionally a state-of-the-art vision detection algorithm that includes heading information for autonomous ground vehicle (AGV) applications was implemented. Cornell's AGV from the DARPA Urban Challenge was upgraded and used to experimentally examine if and how state-of-the-art vision algorithms can complement or replace lidar and radar sensors. Sensor and algorithm performance in adverse weather and lighting conditions is tested. Experimental evaluation demonstrates robust all-weather data association, tracking, and classification where camera, lidar, and radar sensors complement each other inside the joint probabilistic perception algorithm.

1512.01979 2026-06-04 math.NA cs.CV cs.NA 版本更新

Hyperspectral Chemical Plume Detection Algorithms Based On Multidimensional Iterative Filtering Decomposition

基于多维迭代滤波分解的高光谱化学烟雾检测算法

Antonio Cicone, Jingfang Liu, Haomin Zhou

AI总结 本文提出基于多维迭代滤波分解的后处理工具,用于提升化学烟雾边界识别性能,并通过预处理方法实现高光谱数据的去相关与均值中心化,改进了余弦相似度分类器性能。

详情
AI中文摘要

空气中的化学物质可能对人类和环境造成极大危害。高光谱图像可用于识别化学烟雾,但该任务极具挑战性。假设已知某些具有已知频谱的化学烟雾已被高光谱传感器拍摄,可以使用匹配滤波或自适应余弦估计等标准技术,结合适当选择的阈值,来确定化学烟雾的位置。然而,由于噪声和传感器故障,即使在看似简单的状况下,准确识别化学像素也并不容易。本文提出了一种后处理工具,以完全自适应和数据驱动的方式,提高任何分类方法在识别烟雾边界方面的性能。这是通过多维迭代滤波(MIF)算法(arXiv:1411.6051, arXiv:1507.07173)实现的,这是一种类似于开创性经验模态分解(EMD)方法的非平稳信号分解方法。此外,基于MIF技术,我们还提出了一种预处理方法,使高光谱数据集去相关并均值中心化。余弦相似度度量,通常在实践中表现不佳,当配备此类预处理方法时,似乎成为一种成功且表现优异的分类器。我们展示了所提出方法在实际问题中的应用示例。

英文摘要

Chemicals released in the air can be extremely dangerous for human beings and the environment. Hyperspectral images can be used to identify chemical plumes, however the task can be extremely challenging. Assuming we know a priori that some chemical plume, with a known frequency spectrum, has been photographed using a hyperspectral sensor, we can use standard techniques like the so called matched filter or adaptive cosine estimator, plus a properly chosen threshold value, to identify the position of the chemical plume. However, due to noise and sensors fault, the accurate identification of chemical pixels is not easy even in this apparently simple situation. In this paper we present a post-processing tool that, in a completely adaptive and data driven fashion, allows to improve the performance of any classification methods in identifying the boundaries of a plume. This is done using the Multidimensional Iterative Filtering (MIF) algorithm (arXiv:1411.6051, arXiv:1507.07173), which is a non-stationary signal decomposition method like the pioneering Empirical Mode Decomposition (EMD) method. Moreover, based on the MIF technique, we propose also a pre-processing method that allows to decorrelate and mean-center a hyperspectral dataset. The Cosine Similarity measure, which often fails in practice, appears to become a successful and outperforming classifier when equipped with such pre-processing method. We show some examples of the proposed methods when applied to real life problems.

1504.07259 2026-06-04 cs.CV cs.NA math.AP math.NA 版本更新

Image Segmentation and Restoration Using Parametric Contours With Free Endpoints

基于自由端点参数轮廓的图像分割与修复

Heike Benninghoff, Harald Garcke

AI总结 本文提出一种新型自由端点主动轮廓方法,通过离散化穆恩-沙赫功能实现图像分割与修复,结合曲线法向流动和端点切向流动演化规律,采用参数化轮廓与边缘保持去噪实现快速分割与修复。

详情
AI中文摘要

本文介绍了一种新型的具有自由端点的主动轮廓方法。提出了一种基于离散穆恩-沙赫功能的图像分割与修复方案,其中轮廓可以是闭合或开放曲线。除了曲线的法向流动外,还推导了端点切向流动的演化规律。通过参数化方法描述演化的轮廓并结合边缘保持去噪,获得图像分割与修复的快速方法。给出了分析和数值方案,并通过人工测试图像和真实医学图像进行了数值实验。

英文摘要

In this paper, we introduce a novel approach for active contours with free endpoints. A scheme is presented for image segmentation and restoration based on a discrete version of the Mumford-Shah functional where the contours can be both closed and open curves. Additional to a flow of the curves in normal direction, evolution laws for the tangential flow of the endpoints are derived. Using a parametric approach to describe the evolving contours together with an edge-preserving denoising, we obtain a fast method for image segmentation and restoration. The analytical and numerical schemes are presented followed by numerical experiments with artificial test images and with a real medical image.

1604.02292 2026-06-04 math.NA cs.CV cs.NA 版本更新

A method for locally approximating regularized iterative tomographic reconstruction methods

一种局部近似正则化迭代断层成像重建方法

D. M. Pelt, K. J. Batenburg

AI总结 本文提出一种局部近似正则化迭代断层成像方法,通过仅在感兴趣区域进行计算,降低计算需求,实现与全局方法相近的重建质量。

Comments 32 pages, 13 figures

详情
AI中文摘要

在许多断层成像应用中,获取的投影数据往往数量有限或包含大量噪声。在这些情况下,标准重建方法容易产生伪影,影响进一步分析。先进的正则化迭代方法,如总变分最小化,通常能通过利用对被扫描物体的先验知识提高重建质量。然而,这些方法在实践中往往计算时间过长或内存需求大。此外,由于它们基于最小化全局目标函数,正则化迭代方法需要重建整个被扫描物体,即使仅对重建图像的某个(小)区域感兴趣。本文提出了一种在被扫描物体的(小)感兴趣区域内部近似正则化迭代重建方法。该方法仅在感兴趣区域内部进行计算,确保低计算需求。不同幻影图像和正则化类型的结果显示,所提出局部方法的重建结果与近似全局正则化迭代方法的重建结果几乎相同,即使对于相对较小的感兴趣区域也是如此。此外,我们还表明,通过并行重建多个小区域并将它们合并为一个重建,可以高效地重建更大的区域。

英文摘要

In many applications of tomography, the acquired projections are either limited in number or contain a significant amount of noise. In these cases, standard reconstruction methods tend to produce artifacts that can make further analysis difficult. Advanced regularized iterative methods, such as total variation minimization, are often able to achieve a higher reconstruction quality by exploiting prior knowledge about the scanned object. In practice, however, these methods often have prohibitively long computation times or large memory requirements. Furthermore, since they are based on minimizing a global objective function, regularized iterative methods need to reconstruct the entire scanned object, even when one is only interested in a (small) region of the reconstructed image. In this paper, we present a method to approximate regularized iterative reconstruction methods inside a (small) region of the scanned object. The method only performs computations inside the region of interest, ensuring low computational requirements. Reconstruction results for different phantom images and types of regularization are given, showing that reconstructions of the proposed local method are almost identical to those of the global regularized iterative methods that are approximated, even for relatively small regions of interest. Furthermore, we show that larger regions can be reconstructed efficiently by reconstructing several small regions in parallel and combining them into a single reconstruction afterwards.

1402.4893 2026-06-04 cs.CV cs.NA math.NA 版本更新

Anisotropic Mesh Adaptation for Image Representation

各向异性网格自适应用于图像表示

Xianping Li

AI总结 本文提出基于各向异性网格自适应的GPRAMA方法,通过改进的网格拼接技术实现更高质量的图像表示,同时降低计算成本。

Comments 25 pages, 15 figures

详情
AI中文摘要

三角网格在图像表示中已获得广泛关注,并广泛应用于图像处理。本文介绍了一种用于图像表示的各向异性网格自适应(AMA)方法框架,并提出了一种基于AMA和贪心点移除(GPR)方案的GPRAMA方法。与许多其他方法不同,AMA方法直接从三角网格开始,然后根据用户定义的度量张量自适应调整网格以表示图像。AMA方法具有清晰的数学框架,为图像表示和图像重建提供了灵活性。开发了一种网格拼接技术以实现GPRAMA方法,从而得到流行GPRFS-ED方法的改进版本。GPRAMA方法在质量和计算成本方面均优于GPRFS-ED方法。

英文摘要

Triangular meshes have gained much interest in image representation and have been widely used in image processing. This paper introduces a framework of anisotropic mesh adaptation (AMA) methods to image representation and proposes a GPRAMA method that is based on AMA and greedy-point removal (GPR) scheme. Different than many other methods that triangulate sample points to form the mesh, the AMA methods start directly with a triangular mesh and then adapt the mesh based on a user-defined metric tensor to represent the image. The AMA methods have clear mathematical framework and provides flexibility for both image representation and image reconstruction. A mesh patching technique is developed for the implementation of the GPRAMA method, which leads to an improved version of the popular GPRFS-ED method. The GPRAMA method can achieve better quality than the GPRFS-ED method but with lower computational cost.

1603.08497 2026-06-04 cs.CV cs.NA math.NA 版本更新

On distances, paths and connections for hyperspectral image segmentation

关于超光谱图像分割的距离、路径和连接

Guillaume Noyel, Jesus Angulo, Dominique Jeulin

AI总结 本文提出η和η连接以增强λ-平坦区的区域信息,通过自顶向下的方法实现更精细的分割。

详情
Journal ref
Proceedings of the 8th International Symposium on Mathematical Morphology: Volume 1, pp.399-410, 2007, 978-85-17-00032-5
AI中文摘要

本文介绍η和η连接,以在λ-平坦区添加区域信息,这些连接仅考虑局部信息。采用自顶向下的方法,首先构建λ-平坦区以获得子分割,然后通过计算η-有界区域和μ-测地球内λ-平坦区获得更精细的分割。所提出的算法基于队列,使用累积距离进行有序种子选择。η-有界区域通过中心点控制类内幅度变化,μ-测地球控制类的大小。这些结果应用于超光谱图像。

英文摘要

The present paper introduces the $η$ and η connections in order to add regional information on $λ$-flat zones, which only take into account a local information. A top-down approach is considered. First $λ$-flat zones are built in a way leading to a sub-segmentation. Then a finer segmentation is obtained by computing $η$-bounded regions and $μ$-geodesic balls inside the $λ$-flat zones. The proposed algorithms for the construction of new partitions are based on queues with an ordered selection of seeds using the cumulative distance. $η$-bounded regions offers a control on the variations of amplitude in the class from a point, called center, and $μ$-geodesic balls controls the "size" of the class. These results are applied to hyperspectral images.

1410.7632 2026-06-04 cs.CV cs.RO cs.SY eess.SY 版本更新

On the Covariance of ICP-based Scan-matching Techniques

关于基于ICP的扫描匹配技术的协方差

Silvère Bonnabel, Martin Barczyk, François Goulette

AI总结 本文研究了ICP算法计算旋转变换协方差的问题,指出点到点版本的ICP应用会导致错误协方差,通过数学证明验证点到平面版本的正确性。

Comments Accepted at 2016 American Control Conference

详情
AI中文摘要

本文考虑了通过迭代最近点(ICP)算法计算旋转变换协方差的问题。该问题对于配备深度感应相机(如Kinect)或激光雷达(如Velodyne)的移动机器人和车辆的定位具有相关性。先前文献中提出的闭式公式通常基于ICP解是通过最小化线性最小二乘问题得到的假设。本文表明,这种做法需要谨慎,因为算法的重新匹配步骤未被显式考虑,应用于点到点版本的ICP会导致完全错误的协方差。随后,我们提供了一个形式化的数学证明,说明该方法在点到平面版本的ICP中是有效的,这验证了从业者直觉和实验结果。

英文摘要

This paper considers the problem of estimating the covariance of roto-translations computed by the Iterative Closest Point (ICP) algorithm. The problem is relevant for localization of mobile robots and vehicles equipped with depth-sensing cameras (e.g., Kinect) or Lidar (e.g., Velodyne). The closed-form formulas for covariance proposed in previous literature generally build upon the fact that the solution to ICP is obtained by minimizing a linear least-squares problem. In this paper, we show this approach needs caution because the rematching step of the algorithm is not explicitly accounted for, and applying it to the point-to-point version of ICP leads to completely erroneous covariances. We then provide a formal mathematical proof why the approach is valid in the point-to-plane version of ICP, which validates the intuition and experimental results of practitioners.

1506.09016 2026-06-04 cs.LG cs.CV cs.NA math.NA math.OC stat.ML 版本更新

Online Learning to Sample

在线学习采样

Guillaume Bouchard, Théo Trouillon, Julien Perez, Adrien Gaidon

AI总结 本文提出AW-SGD算法,通过在线学习优化采样策略,提升在线优化效率,应用于图像分类、矩阵分解和强化学习。

Comments Update: removed convergence theorem and proof as there is an error. Submitted to UAI 2016

详情
AI中文摘要

随机梯度下降(SGD)是机器学习中用于在线优化最广泛使用的技术之一。在本工作中,我们通过适应性地学习如何在每个时间步选择最有用的训练示例来加速SGD。首先,我们证明SGD可以用于学习重要采样估计器的最佳可能采样分布。其次,我们证明SGD算法的采样分布可以通过逐步最小化梯度的方差来在线估计。所得到的算法——自适应加权SGD(AW-SGD)——维护一组用于优化的参数,以及一组用于采样学习示例的参数。我们证明AWSGD在三个不同的应用中实现了更快的收敛:(i)使用深度特征的图像分类,其中图像的采样取决于其标签,(ii)矩阵分解,其中行和列不是均匀采样的,以及(iii)强化学习,其中优化和探索策略同时被估计,其中我们的方法对应于一个off-policy梯度算法。

英文摘要

Stochastic Gradient Descent (SGD) is one of the most widely used techniques for online optimization in machine learning. In this work, we accelerate SGD by adaptively learning how to sample the most useful training examples at each time step. First, we show that SGD can be used to learn the best possible sampling distribution of an importance sampling estimator. Second, we show that the sampling distribution of an SGD algorithm can be estimated online by incrementally minimizing the variance of the gradient. The resulting algorithm - called Adaptive Weighted SGD (AW-SGD) - maintains a set of parameters to optimize, as well as a set of parameters to sample learning examples. We show that AWSGD yields faster convergence in three different applications: (i) image classification with deep features, where the sampling of images depends on their labels, (ii) matrix factorization, where rows and columns are not sampled uniformly, and (iii) reinforcement learning, where the optimized and exploration policies are estimated at the same time, where our approach corresponds to an off-policy gradient algorithm.

1410.1699 2026-06-04 math.NA cs.CV cs.NA math.OC physics.med-ph 版本更新

Mumford-Shah and Potts Regularization for Manifold-Valued Data with Applications to DTI and Q-Ball Imaging

Mumford-Shah和Potts正则化用于流形值数据及其在DTI和Q-ball成像中的应用

Andreas Weinmann, Laurent Demaret, Martin Storath

AI总结 本文提出用于流形值信号和图像的Mumford-Shah和Potts正则化算法,通过动态规划与优化技术解决单变量问题,并针对Cartan-Hadamard流形证明算法能全局最小化。方法无需先验边缘集限制或数据空间离散化,应用于DTI和Q-ball成像实现脑白质分割。

详情
AI中文摘要

Mumford-Shah和Potts函数是强大的变分模型,广泛用于信号和图像处理,典型应用包括边缘保持去噪和分割。由于非光滑和非凸特性,即使对标量数据计算也具有挑战性。对于流形值数据,问题更加复杂,因为向量空间的典型特征不可用。本文提出用于流形值信号和图像的Mumford-Shah和Potts正则化算法。对于单变量问题,我们推导出基于动态规划结合(凸)优化技术的求解器。对于Cartan-Hadamard流形(包括扩散张量成像的数据空间),我们证明我们的算法能为任何起始点计算全局最小值。对于多变量Mumford-Shah和Potts问题(图像正则化),我们提出将其拆分为合适的子问题,利用相应单变量问题开发的技术精确求解。我们的方法不需要任何先验边缘集限制,也不需要离散化数据空间。我们将其应用于扩散张量成像(DTI)以及Q-ball成像。使用DTI模型,我们获得了脑胼胝体的分割。

英文摘要

Mumford-Shah and Potts functionals are powerful variational models for regularization which are widely used in signal and image processing; typical applications are edge-preserving denoising and segmentation. Being both non-smooth and non-convex, they are computationally challenging even for scalar data. For manifold-valued data, the problem becomes even more involved since typical features of vector spaces are not available. In this paper, we propose algorithms for Mumford-Shah and for Potts regularization of manifold-valued signals and images. For the univariate problems, we derive solvers based on dynamic programming combined with (convex) optimization techniques for manifold-valued data. For the class of Cartan-Hadamard manifolds (which includes the data space in diffusion tensor imaging), we show that our algorithms compute global minimizers for any starting point. For the multivariate Mumford-Shah and Potts problems (for image regularization) we propose a splitting into suitable subproblems which we can solve exactly using the techniques developed for the corresponding univariate problems. Our method does not require any a priori restrictions on the edge set and we do not have to discretize the data space. We apply our method to diffusion tensor imaging (DTI) as well as Q-ball imaging. Using the DTI model, we obtain a segmentation of the corpus callosum.

1510.00771 2026-06-04 cs.CV cs.RO cs.SY eess.SY 版本更新

Design and Analysis of a Single-Camera Omnistereo Sensor for Quadrotor Micro Aerial Vehicles (MAVs)

单相机 omnistereo 传感器的设计与分析用于四旋翼微型飞行器(MAVs)

Carlos Jaramillo

AI总结 本文提出一种适用于低负载四旋翼微型飞行器的单相机 omnistereo 传感器设计,通过共轴超曲面镜实现立体视觉,分析其几何特性与3D感知性能。

Comments 49 pages, 22 figures, journal article draft

详情
Journal ref
Sensors 16 (2016) 217
AI中文摘要

我们描述了一种应用于微型飞行器(MAVs)的 omnistereo 系统的设计和3D感知性能。所提出的 omnistereo 模型采用一个单目相机,与一对双曲面镜(折叠catadioptric配置)共轴对齐。我们证明这种配置在安装在具有低负载的四旋翼MAV上进行立体视觉是可行的。理论上的单视角(SVP)约束帮助我们推导出传感器投影几何的解析解,并生成SVP兼容的全景图像,以从立体对应关系中计算3D信息(真正同步地)。我们对各种系统特性进行了广泛分析,如大小、catadioptric空间分辨率、视场。此外,我们提出了一种概率模型,用于估计从三角化中深度的不确定性,用于斜向后投影射线。我们期望通过我们的解决方案的可重复性来激励,因为它可以被适应(最优地)到其他基于catadioptric的omnistereo视觉应用。

英文摘要

We describe the design and 3D sensing performance of an omnidirectional stereo-vision system (omnistereo) as applied to Micro Aerial Vehicles (MAVs). The proposed omnistereo model employs a monocular camera that is co-axially aligned with a pair of hyperboloidal mirrors (folded catadioptric configuration). We show that this arrangement is practical for performing stereo-vision when mounted on top of propeller-based MAVs characterized by low payloads. The theoretical single viewpoint (SVP) constraint helps us derive analytical solutions for the sensor's projective geometry and generate SVP-compliant panoramic images to compute 3D information from stereo correspondences (in a truly synchronous fashion). We perform an extensive analysis on various system characteristics such as its size, catadioptric spatial resolution, field-of-view. In addition, we pose a probabilistic model for uncertainty estimation of the depth from triangulation for skew back-projection rays. We expect to motivate the reproducibility of our solution since it can be adapted (optimally) to other catadioptric-based omnistereo vision applications.

1510.06895 2026-06-04 cs.LG cs.CV cs.NA math.NA 版本更新

Nonconvex Nonsmooth Low-Rank Minimization via Iteratively Reweighted Nuclear Norm

非凸非光滑低秩最小化通过迭代重加权核范数

Canyi Lu, Jinhui Tang, Shuicheng Yan, Zhouchen Lin

AI总结 本文提出通过迭代重加权核范数算法解决非凸非光滑低秩最小化问题,利用非凸替代函数近似秩函数,提升低秩矩阵恢复性能。

详情
AI中文摘要

核范数因其在压缩感知中用于低秩矩阵恢复而被广泛使用,但求解基于核范数的松弛凸问题通常导致原始秩最小化问题的次优解。本文提出在矩阵奇异值上使用非凸替代函数近似秩函数,从而得到非凸非光滑最小化问题。然后通过迭代重加权核范数(IRNN)算法求解,该算法通过求解加权奇异值阈值(WSVT)问题,利用非凸替代函数的特殊性质获得闭式解。同时,IRNN被扩展以处理两个或多个变量块的非凸问题。理论上,证明IRNN单调减少目标函数值,任何极限点都是 stationary 点。在合成数据和真实图像上的大量实验表明,IRNN相比最先进的凸算法在低秩矩阵恢复方面表现更优。

英文摘要

The nuclear norm is widely used as a convex surrogate of the rank function in compressive sensing for low rank matrix recovery with its applications in image recovery and signal processing. However, solving the nuclear norm based relaxed convex problem usually leads to a suboptimal solution of the original rank minimization problem. In this paper, we propose to perform a family of nonconvex surrogates of $L_0$-norm on the singular values of a matrix to approximate the rank function. This leads to a nonconvex nonsmooth minimization problem. Then we propose to solve the problem by Iteratively Reweighted Nuclear Norm (IRNN) algorithm. IRNN iteratively solves a Weighted Singular Value Thresholding (WSVT) problem, which has a closed form solution due to the special properties of the nonconvex surrogate functions. We also extend IRNN to solve the nonconvex problem with two or more blocks of variables. In theory, we prove that IRNN decreases the objective function value monotonically, and any limit point is a stationary point. Extensive experiments on both synthesized data and real images demonstrate that IRNN enhances the low-rank matrix recovery compared with state-of-the-art convex algorithms.

1512.01927 2026-06-04 math.NA cs.CV cs.LG cs.NA 版本更新

Fast Optimization Algorithm on Riemannian Manifolds and Its Application in Low-Rank Representation

流形上的快速优化算法及其在低秩表示中的应用

Haoran Chen, Yanfeng Sun, Junbin Gao, Yongli Hu

AI总结 本文提出了一种具有快速收敛速度的新型一阶优化算法FOA,并在低秩表示模型中应用了基于FOA的快速子空间追踪方法,实验表明其在收敛速度和准确性方面优于其他方法。

详情
AI中文摘要

本文研究了在流形上优化一类复合函数的问题,并提出了一种新的第一阶优化算法(FOA),具有快速收敛速度。通过理论分析证明了该算法具有二次收敛性。在矩阵补全任务的实验中,FOA在流形上的其他一阶优化方法中表现更优。基于FOA提出了一种快速子空间追踪方法,用于解决基于增广拉格朗日方法的低秩矩阵流形上的低秩表示模型。实验结果表明,FOA和SP-RPRG(ALM)在合成和真实数据集上均实现了更快的收敛速度和更高的准确性。

英文摘要

The paper addresses the problem of optimizing a class of composite functions on Riemannian manifolds and a new first order optimization algorithm (FOA) with a fast convergence rate is proposed. Through the theoretical analysis for FOA, it has been proved that the algorithm has quadratic convergence. The experiments in the matrix completion task show that FOA has better performance than other first order optimization methods on Riemannian manifolds. A fast subspace pursuit method based on FOA is proposed to solve the low-rank representation model based on augmented Lagrange method on the low rank matrix variety. Experimental results on synthetic and real data sets are presented to demonstrate that both FOA and SP-RPRG(ALM) can achieve superior performance in terms of faster convergence and higher accuracy.

1512.00298 2026-06-04 math.NA cs.CV cs.NA math.OC 版本更新

On Optical Flow Models for Variational Motion Estimation

关于变分运动估计中的光学流模型

Martin Burger, Hendrik Dirks, Lena Frerking

AI总结 本文探讨了基于总变分的正则化方法在运动估计中的应用,重点分析了光学流模型的不同变体及优化方法,并提出通过反问题视角评估运动估计质量的框架。

Comments 27 pages, 3 figures, 2 tables

详情
AI中文摘要

本文旨在讨论和评估用于运动估计的基于总变分的正则化方法,特别关注光学流模型。除了标准的L²和L¹数据保真度外,我们还概述了通过与高阶模型结合而获得的不同总变分正则化变体,并基于对偶方法提出统一的计算优化方法。此外,我们通过Bregman迭代扩展模型,并从反问题的角度分析变分光学流模型。本文特别关注运动估计的定量评估,这是困难且常被低估的任务。我们讨论了几种运动估计质量度量方法,并将其应用于比较之前讨论的正则化方法。

英文摘要

The aim of this paper is to discuss and evaluate total variation based regularization methods for motion estimation, with particular focus on optical flow models. In addition to standard $L^2$ and $L^1$ data fidelities we give an overview of different variants of total variation regularization obtained from combination with higher order models and a unified computational optimization approach based on primal-dual methods. Moreover, we extend the models by Bregman iterations and provide an inverse problems perspective to the analysis of variational optical flow models. A particular focus of the paper is the quantitative evaluation of motion estimation, which is a difficult and often underestimated task. We discuss several approaches for quality measures of motion estimation and apply them to compare the previously discussed regularization approaches.

1511.04685 2026-06-04 math.NA cs.CV cs.NA math.SP 版本更新

Semi-Inner-Products for Convex Functionals and Their Use in Image Decomposition

半内积在凸泛函中的应用及其在图像分解中的应用

Guy Gilboa

AI总结 本文扩展了Lumer意义下的半内积到凸泛函,构建了Banach空间中凸泛函的希尔伯特空间结构,并利用该结构分析总变分和更高阶泛函。

详情
AI中文摘要

半内积在Lumer意义下被扩展到凸泛函,从而在Banach空间中为凸泛函构建出类似于希尔伯特空间的结构。特别地,给出了针对一个齐次泛函的半内积的一般表达式。因此,可以利用新的算子对总变分和更高阶泛函如总广义变分(TGV)进行分析。拥有半内积后,可以以简单的方式定义函数之间的角度。证明在齐次情况下,Bregman距离可以以该新定义的角度来表示。此外,推导了由泛函诱导的非线性本征函数的半内积的性质。我们利用这一构造,陈述了两个信号完美分解的充分条件,并提出了数值指标,以指示这些条件何时大致满足。

英文摘要

Semi-inner-products in the sense of Lumer are extended to convex functionals. This yields a Hilbert-space like structure to convex functionals in Banach spaces. In particular, a general expression for semi-inner-products with respect to one homogeneous functionals is given. Thus one can use the new operator for the analysis of total variation and higher order functionals like total-generalized-variation (TGV). Having a semi-inner-product, an angle between functions can be defined in a straightforward manner. It is shown that in the one homogeneous case the Bregman distance can be expressed in terms of this newly defined angle. In addition, properties of the semi-inner-product of nonlinear eigenfunctions induced by the functional are derived. We use this construction to state a sufficient condition for a perfect decomposition of two signals and suggest numerical measures which indicate when those conditions are approximately met.

1411.0814 2026-06-04 math.NA cs.CV cs.NA 版本更新

A random algorithm for low-rank decomposition of large-scale matrices with missing entries

一种用于大规模矩阵低秩分解的随机算法

Yiguang Liu

AI总结 本文提出随机子矩阵方法(RSM)用于大规模矩阵低秩分解,该方法在速度和内存使用上优于现有算法,并在精度上达到或接近最优。

详情
AI中文摘要

本文提出了一种随机子矩阵方法(RSM),用于计算具有已知条目百分比ρ的大规模矩阵的低秩分解。RSM非常快速,其所需的浮点运算(flops)与现有最先进算法相比具有竞争力。同时,RSM非常节省内存。在已知条目均匀分布于给定矩阵的情况下,通过已知条目形成的子矩阵被随机选择。根据已证明的定理,与较小奇异值相关的子空间受噪声扰动较小,因此计算每个子矩阵对应的null向量或右奇异向量。这些向量是给定大规模矩阵真实地面中的相应子矩阵的null向量。如果随机选择了足够的子矩阵,就能估计出低秩分解。在随机合成矩阵(如131072X1024)和真实数据集上的实验结果表明,RSM在速度和内存使用上显著优于现有方法,同时在精度上也达到或接近最优。

英文摘要

A Random SubMatrix method (RSM) is proposed to calculate the low-rank decomposition of large-scale matrices with known entry percentage ρ. RSM is very fast as the floating-point operations (flops) required are compared favorably with the state-of-the-art algorithms. Meanwhile RSM is very memory-saving. With known entries homogeneously distributed in the given matrix, sub-matrices formed by known entries are randomly selected. According to the just proved theorem that subspace related to smaller singular values is less perturbed by noise, the null vectors or the right singular vectors associated with the minor singular values are calculated for each submatrix. The vectors are the null vectors of the corresponding submatrix in the ground truth of the given large-scale matrix. If enough sub-matrices are randomly chosen, the low-rank decomposition is estimated. The experimental results on random synthetical matrices with sizes such as 131072X1024 and on real data sets indicate that RSM is much faster and memory-saving, and, meanwhile, has considerable high precision achieving or approximating to the best.

1509.04237 2026-06-04 cs.CV cs.NA math.NA 版本更新

A Total Fractional-Order Variation Model for Image Restoration with Non-homogeneous Boundary Conditions and its Numerical Solution

一种总分数阶变化模型用于图像恢复及其数值解法,具有非均匀边界条件

Jianping Zhang, Ke Chen

AI总结 本文提出一种分数阶总α阶变化模型用于图像恢复,克服传统总变分模型的不足,通过分析理论性质和开发四种算法,在恢复质量和效率上优于现有高阶模型。

Comments 26 pages

详情
AI中文摘要

为克服基于总变分的图像恢复模型的不足,近年来提出了多种高阶(通常为二阶)正则化模型。本文分析并测试了一种基于分数阶导数的总α阶变化模型,其性能优于当前流行的高阶正则化模型。尽管已有使用总α阶变化进行图像恢复的研究,但尚未进行理论分析,且所有测试公式均使用零Dirichlet边界条件,这在现实中不适用(而非零边界条件违反分数阶导数的定义)。本文首先回顾了一些分数阶导数的结果,然后严格分析所提出总α阶变分模型的理论性质。接着开发了四种算法来求解变分问题,一种基于变分Split-Bregman思想,三种基于直接求解离散优化问题。数值实验表明,在恢复质量和解效率方面,所提模型在光滑图像上能与已建立的高阶模型(均 curvature 和总泛化变化)产生高度竞争的结果。

英文摘要

To overcome the weakness of a total variation based model for image restoration, various high order (typically second order) regularization models have been proposed and studied recently. In this paper we analyze and test a fractional-order derivative based total $α$-order variation model, which can outperform the currently popular high order regularization models. There exist several previous works using total $α$-order variations for image restoration; however first no analysis is done yet and second all tested formulations, differing from each other, utilize the zero Dirichlet boundary conditions which are not realistic (while non-zero boundary conditions violate definitions of fractional-order derivatives). This paper first reviews some results of fractional-order derivatives and then analyzes the theoretical properties of the proposed total $α$-order variational model rigorously. It then develops four algorithms for solving the variational problem, one based on the variational Split-Bregman idea and three based on direct solution of the discretise-optimization problem. Numerical experiments show that, in terms of restoration quality and solution efficiency, the proposed model can produce highly competitive results, for smooth images, to two established high order models: the mean curvature and the total generalized variation.

1404.5009 2026-06-04 cs.CV cs.LG cs.NA math.NA 版本更新

Efficient Semidefinite Branch-and-Cut for MAP-MRF Inference

高效半定规划分支定界法用于MAP-MRF推断

Peng Wang, Chunhua Shen, Anton van den Hengel, Philip Torr

AI总结 本文提出了一种高效的分支定界方法用于求解通用MAP-MRF推断问题,通过结合可扩展的半定规划和切割平面法,实现了高效的约束求解,并在密集连接或 unary 成本相对较低时取得最佳结果。

Comments 21 pages

详情
AI中文摘要

我们提出了一种分支定界(B&C)方法用于求解通用MAP-MRF推断问题。该方法的核心是一个非常高效的边界求解过程,结合了可扩展的半定规划(SDP)和切割平面法以寻找违反的约束。为了进一步加快计算,采用了模型简化、预热启动和移除不活跃约束等策略。我们分析了所提方法在不同设置下的性能,并证明我们的方法在性能上要么优于要么与最先进的方法相当。特别是当连接是密集的或当unary成本的相对大小较低时,我们实现了最佳报告结果。实验表明,所提出的算法在各种时间预算下,在具有挑战性的非子模MAP-MRF推断问题中优于最先进的方法。

英文摘要

We propose a Branch-and-Cut (B&C) method for solving general MAP-MRF inference problems. The core of our method is a very efficient bounding procedure, which combines scalable semidefinite programming (SDP) and a cutting-plane method for seeking violated constraints. In order to further speed up the computation, several strategies have been exploited, including model reduction, warm start and removal of inactive constraints. We analyze the performance of the proposed method under different settings, and demonstrate that our method either outperforms or performs on par with state-of-the-art approaches. Especially when the connectivities are dense or when the relative magnitudes of the unary costs are low, we achieve the best reported results. Experiments show that the proposed algorithm achieves better approximation than the state-of-the-art methods within a variety of time budgets on challenging non-submodular MAP-MRF inference problems.

1509.00728 2026-06-04 math.OC cs.CV cs.MA cs.NA math.NA stat.ML 版本更新

On Transitive Consistency for Linear Invertible Transformations between Euclidean Coordinate Systems

关于欧几里得坐标系统之间线性可逆变换的传递一致性

Johan Thunberg, Florian Bernard, Jorge Goncalves

AI总结 本文研究了如何同步非传递一致的线性可逆变换,提出两种同步方法及迭代Gauss-Newton方法,适用于不同图拓扑,并通过仿真验证了方法的有效性。

Comments 25 pages

详情
AI中文摘要

传递一致性是欧几里得坐标框架之间线性可逆变换集合的内在属性。在实践中,当变换由数据估计时,这一属性往往缺失。本文解决如何同步非传递一致的变换的问题。一旦变换被同步,它们将满足传递一致性条件——从框架A到框架C的变换等于先从A到B再从B到C的复合变换。坐标框架对应图中的节点,变换对应图中的边。提出了两种直接或集中同步方法,分别适用于近强连通图和连通图。作为第二种方法的扩展,提出了迭代Gauss-Newton方法,并将其适应于仿射和欧几里得变换的情况。还提出了适用于正交矩阵的两种分布式同步方法,这些方法可以看作是两种直接或集中方法的分布式版本;它们类似于用于分布式平均的标准共识协议。当变换为正交矩阵时,可以计算最优性间隙的上界。仿真显示,即使在噪声幅度较大的情况下,间隙也几乎准确。本文还从理论层面提供了传递一致变换的线性代数关系。所提出方法的一个优点是其简单性——使用基本线性代数方法,例如奇异值分解(SVD)。对于广泛参数设置范围内的方法,进行了数值验证。

英文摘要

Transitive consistency is an intrinsic property for collections of linear invertible transformations between Euclidean coordinate frames. In practice, when the transformations are estimated from data, this property is lacking. This work addresses the problem of synchronizing transformations that are not transitively consistent. Once the transformations have been synchronized, they satisfy the transitive consistency condition - a transformation from frame $A$ to frame $C$ is equal to the composite transformation of first transforming A to B and then transforming B to C. The coordinate frames correspond to nodes in a graph and the transformations correspond to edges in the same graph. Two direct or centralized synchronization methods are presented for different graph topologies; the first one for quasi-strongly connected graphs, and the second one for connected graphs. As an extension of the second method, an iterative Gauss-Newton method is presented, which is later adapted to the case of affine and Euclidean transformations. Two distributed synchronization methods are also presented for orthogonal matrices, which can be seen as distributed versions of the two direct or centralized methods; they are similar in nature to standard consensus protocols used for distributed averaging. When the transformations are orthogonal matrices, a bound on the optimality gap can be computed. Simulations show that the gap is almost right, even for noise large in magnitude. This work also contributes on a theoretical level by providing linear algebraic relationships for transitively consistent transformations. One of the benefits of the proposed methods is their simplicity - basic linear algebraic methods are used, e.g., the Singular Value Decomposition (SVD). For a wide range of parameter settings, the methods are numerically validated.

1508.05514 2026-06-04 stat.ML cs.CV cs.LG cs.RO cs.SY eess.SY 版本更新

Gaussian Mixture Reduction Using Reverse Kullback-Leibler Divergence

基于反向Kullback-Leibler散度的高斯混合减少

Tohid Ardeshiri, Umut Orguner, Emre Özkan

AI总结 本文提出一种贪心混合减少算法,基于Kullback-Leibler散度进行混合成分的剪枝与合并,通过分析近似方法提高计算效率,并在模拟和实际数据中验证其性能优于现有方法。

详情
AI中文摘要

我们提出了一种贪心的混合减少算法,能够基于Kullback-Leibler散度(KLD)剪枝和合并混合成分。该算法不同于已知的Runnalls基于KLD的方法,因为它不限于合并操作。剪枝能力(除合并外)使算法在减少过程中能够保留原始混合的峰值。通过分析近似方法来避免KLD的计算不可行性,从而得到一个计算高效的算法。所提出的算法在两个数值示例中与Runnalls和Williams的方法进行比较,使用模拟和实际数据。结果表明,所提出的方法在性能和计算复杂度方面使其成为现有混合减少方法的高效替代方案。

英文摘要

We propose a greedy mixture reduction algorithm which is capable of pruning mixture components as well as merging them based on the Kullback-Leibler divergence (KLD). The algorithm is distinct from the well-known Runnalls' KLD based method since it is not restricted to merging operations. The capability of pruning (in addition to merging) gives the algorithm the ability of preserving the peaks of the original mixture during the reduction. Analytical approximations are derived to circumvent the computational intractability of the KLD which results in a computationally efficient method. The proposed algorithm is compared with Runnalls' and Williams' methods in two numerical examples, using both simulated and real world data. The results indicate that the performance and computational complexity of the proposed approach make it an efficient alternative to existing mixture reduction methods.

1412.2291 2026-06-04 stat.CO cs.CG cs.CV cs.NA math.NA 版本更新

Adjusted least squares fitting of algebraic hypersurfaces

修正的最小二乘法拟合代数超曲面

Konstantin Usevich, Ivan Markovsky

AI总结 本文提出修正的最小二乘法用于拟合欧几里得空间中的点集,通过构造偏倚修正的矩矩阵解决普通最小二乘法的偏倚问题,并改进了计算算法。

Comments 30 pages, 10 figures

详情
AI中文摘要

我们考虑用代数超曲面拟合欧几里得空间中点集的问题。假设真实超曲面由多项式方程描述,其上的点受均值为零的独立高斯噪声干扰,我们估计真实多项式方程的系数。修正的最小二乘估计器考虑了普通最小二乘估计器中的偏倚。该估计器基于构造一个准Hankel矩阵,这是一个偏倚修正的矩矩阵。对于未知噪声方差的情况,估计器定义为多项式特征值问题的解。本文提出了关于修正最小二乘估计器不变性性质的新结果,并改进了计算估计器的算法,适用于任意多项式方程中的单项式集。

英文摘要

We consider the problem of fitting a set of points in Euclidean space by an algebraic hypersurface. We assume that points on a true hypersurface, described by a polynomial equation, are corrupted by zero mean independent Gaussian noise, and we estimate the coefficients of the true polynomial equation. The adjusted least squares estimator accounts for the bias present in the ordinary least squares estimator. The adjusted least squares estimator is based on constructing a quasi-Hankel matrix, which is a bias-corrected matrix of moments. For the case of unknown noise variance, the estimator is defined as a solution of a polynomial eigenvalue problem. In this paper, we present new results on invariance properties of the adjusted least squares estimator and an improved algorithm for computing the estimator for an arbitrary set of monomials in the polynomial equation.

1508.04467 2026-06-04 cs.CV cs.IT cs.LG cs.NA math.IT math.NA stat.ML 版本更新

Robust Subspace Clustering via Smoothed Rank Approximation

通过平滑秩近似实现鲁棒子空间聚类

Zhao Kang, Chong Peng, Qiang Cheng

AI总结 本文提出基于对数-行列式秩近似的方法,用于子空间聚类,以提高精度并有效处理误差和噪声。

Comments Journal, code is available

详情
Journal ref
IEEE Signal Processing Letters, 22(2015)2088-2092
AI中文摘要

本文提出基于对数-行列式秩近似的方法,用于子空间聚类,以提高精度并有效处理误差和噪声。矩阵秩最小化受线性约束在许多应用领域中出现,从信号处理到机器学习。核范数是该问题的凸松弛,可以在某些受限且理论有趣的条件下精确恢复秩。然而,对于许多现实应用,核范数近似到秩函数只能产生远离最优解的结果。为了寻求比核范数更准确的解决方案,本文提出基于对数-行列式的秩近似方法。我们考虑将此秩近似应用于子空间聚类应用。我们的框架可以建模不同类型的误差和噪声。开发了有效的优化策略,并具有理论保证,以收敛到 stationary 点。所提出的方法在人脸识别和运动分割任务上相比最先进的子空间聚类算法表现出有希望的结果。

英文摘要

Matrix rank minimizing subject to affine constraints arises in many application areas, ranging from signal processing to machine learning. Nuclear norm is a convex relaxation for this problem which can recover the rank exactly under some restricted and theoretically interesting conditions. However, for many real-world applications, nuclear norm approximation to the rank function can only produce a result far from the optimum. To seek a solution of higher accuracy than the nuclear norm, in this paper, we propose a rank approximation based on Logarithm-Determinant. We consider using this rank approximation for subspace clustering application. Our framework can model different kinds of errors and noise. Effective optimization strategy is developed with theoretical guarantee to converge to a stationary point. The proposed method gives promising results on face clustering and motion segmentation tasks compared to the state-of-the-art subspace clustering algorithms.

1503.01993 2026-06-04 cs.CV cs.NA math.NA 版本更新

Tomographic Image Reconstruction using Training images

利用训练图像进行断层图像重建

Sara Soltani, Martin S. Andersen, Per Christian Hansen

AI总结 本文提出一种利用训练图像的断层图像重建算法,通过非负字典和正则化非负矩阵分解实现稀疏表示,减少计算复杂度,并在低剂量设置下表现优异。

Comments 25 pages, 12 figures

详情
AI中文摘要

我们描述并检验了一种断层图像重建算法,其中解决方案的先验知识以训练图像的形式提供。我们首先基于训练图像中的原型元素构建非负字典,该问题被公式化为正则化非负矩阵分解。将字典作为先验信息纳入凸重建问题中,然后找到字典中的稀疏表示解。字典应用于图像的非重叠块,相较于其他算法减少了计算复杂度。计算实验澄清了模型参数和正则化参数的选择及相互作用,并表明在少数投影低剂量设置下,我们的算法与总变分正则化竞争,并倾向于包含更多纹理和正确边缘。

英文摘要

We describe and examine an algorithm for tomographic image reconstruction where prior knowledge about the solution is available in the form of training images. We first construct a nonnegative dictionary based on prototype elements from the training images; this problem is formulated as a regularized non-negative matrix factorization. Incorporating the dictionary as a prior in a convex reconstruction problem, we then find an approximate solution with a sparse representation in the dictionary. The dictionary is applied to non-overlapping patches of the image, which reduces the computational complexity compared to other algorithms. Computational experiments clarify the choice and interplay of the model parameters and the regularization parameters, and we show that in few-projection low-dose settings our algorithm is competitive with total variation regularization and tends to include more texture and more correct edges.

1507.08847 2026-06-04 cs.LG cs.CV cs.NA math.NA 版本更新

A novel multivariate performance optimization method based on sparse coding and hyper-predictor learning

一种基于稀疏编码和超预测器学习的新型多变量性能优化方法

Jiachen Yanga, Zhiyong Dinga, Fei Guoa, Huogen Wanga, Nick Hughesb

AI总结 本文提出一种新型方法,通过稀疏编码和超预测器学习优化多变量性能度量,通过联合优化问题最小化重建误差、稀疏性及复杂损失函数上界。

详情
AI中文摘要

本文研究了多变量性能度量的优化问题,提出了一种新算法。与传统机器学习方法不同,本文研究如何学习有效超预测器以处理数据点元组,从而最小化对应于多变量性能度量的复杂损失函数。我们提出将数据点元组通过字典转换为稀疏码元组,然后应用线性函数比较稀疏码与给定候选类别标签。为了学习字典、稀疏码和线性函数参数,我们提出一个联合优化问题。在此问题中,同时最小化稀疏码的重建误差和稀疏性,以及复杂损失函数的上界。此外,损失函数的上界通过稀疏码和线性函数参数近似。为优化此问题,我们开发了一种基于下降梯度方法的迭代算法,交替学习稀疏码和超预测器参数。在一些基准数据集上的实验结果表明,所提方法优于其他最先进的算法。

英文摘要

In this paper, we investigate the problem of optimization multivariate performance measures, and propose a novel algorithm for it. Different from traditional machine learning methods which optimize simple loss functions to learn prediction function, the problem studied in this paper is how to learn effective hyper-predictor for a tuple of data points, so that a complex loss function corresponding to a multivariate performance measure can be minimized. We propose to present the tuple of data points to a tuple of sparse codes via a dictionary, and then apply a linear function to compare a sparse code against a give candidate class label. To learn the dictionary, sparse codes, and parameter of the linear function, we propose a joint optimization problem. In this problem, the both the reconstruction error and sparsity of sparse code, and the upper bound of the complex loss function are minimized. Moreover, the upper bound of the loss function is approximated by the sparse codes and the linear function parameter. To optimize this problem, we develop an iterative algorithm based on descent gradient methods to learn the sparse codes and hyper-predictor parameter alternately. Experiment results on some benchmark data sets show the advantage of the proposed methods over other state-of-the-art algorithms.

1506.08110 2026-06-04 cs.CV cs.NA math.NA 版本更新

Nonnegative Matrix Factorization applied to reordered pixels of single images based on patches to achieve structured nonnegative dictionaries

非负矩阵分解应用于基于块的单图像重新排序像素以实现结构非负字典

Richard M. Charles, Kye M. Taylor, James H. Curry

AI总结 本文提出利用非负矩阵分解对单图像块重新排序像素生成结构化非负字典,通过SVD和NMF对比发现NMF能保留图像原始符号结构和局部细节。

Comments 34 pages, 15 figures, 2 tables

详情
AI中文摘要

近年来计算能力的提升使得处理和分析各种领域的大型数据集成为可能。通常分析需要创建低秩近似以实现高效存储。本文提出并分析了一种新颖方法,通过将非负矩阵分解应用于单自然图像的重新排序像素来创建非负、结构化的字典。我们基于块重新排序像素并提出了一般方法。我们研究了当使用奇异值分解(SVD)和非负矩阵分解(NMF)作为低秩近似时的方法。峰值信噪比(PSNR)和均值结构相似性指数(MSSIM)用于评估算法。我们报告说,虽然SVD提供最佳重建,但其向量字典丢失了原始图像的符号结构和局部细节。相比之下,使用NMF生成的字典保留了原始图像矩阵的符号结构,并提供了一个非负、基于部分的字典。

英文摘要

Recent improvements in computing allow for the processing and analysis of very large datasets in a variety of fields. Often the analysis requires the creation of low-rank approximations to the datasets leading to efficient storage. This article presents and analyzes a novel approach for creating nonnegative, structured dictionaries using NMF applied to reordered pixels of single, natural images. We reorder the pixels based on patches and present our approach in general. We investigate our approach when using the Singular Value Decomposition (SVD) and Nonnegative Matrix Factorizations (NMF) as low-rank approximations. Peak Signal-to-Noise Ratio (PSNR) and Mean Structural Similarity Index (MSSIM) are used to evaluate the algorithm. We report that while the SVD provides the best reconstructions, its dictionary of vectors lose both the sign structure of the original image and details of localized image content. In contrast, the dictionaries produced using NMF preserves the sign structure of the original image matrix and offer a nonnegative, parts-based dictionary.

1412.2700 2026-06-04 math.NA cs.CV cs.NA 版本更新

Subspace based low rank and joint sparse matrix recovery

基于子空间的低秩和联合稀疏矩阵恢复

Sampurna Biswas, Sunrita Poddar, Soura Dasgupta, Raghuraman Mudumbai, Mathews Jacob

AI总结 本文研究从欠采样测量中恢复低秩和联合稀疏矩阵的问题,提出在不同测量矩阵下进行测量以减少总测量数,适用于动态MRI图像恢复。

Comments 5 pages, 5 figures, Asilomar 2014 conference submission

详情
AI中文摘要

我们考虑从欠采样的列测量中恢复低秩和联合稀疏矩阵的问题。该问题在高时空分辨率动态MRI数据恢复中具有重要相关性,其中矩阵的每一列对应图像时间序列中的一个帧,由于帧高度相关,矩阵具有高度低秩性。此外,在适当变换/框架域(如小波、梯度)中,矩阵的非零位置在不同帧中大致相同,其支持集的超集可以安全地假设为联合稀疏。与经典多测量向量(MMV)设置不同,我们考虑每个快照使用不同的测量矩阵进行测量。我们证明这种方法可以减少总测量数,特别是当矩阵的秩远小于其稀疏性时。在动态成像的实验中,该方法在实现自由呼吸心脏MRI中非常有用。

英文摘要

We consider the recovery of a low rank and jointly sparse matrix from under sampled measurements of its columns. This problem is highly relevant in the recovery of dynamic MRI data with high spatio-temporal resolution, where each column of the matrix corresponds to a frame in the image time series; the matrix is highly low-rank since the frames are highly correlated. Similarly the non-zero locations of the matrix in appropriate transform/frame domains (e.g. wavelet, gradient) are roughly the same in different frame. The superset of the support can be safely assumed to be jointly sparse. Unlike the classical multiple measurement vector (MMV) setup that measures all the snapshots using the same matrix, we consider each snapshot to be measured using a different measurement matrix. We show that this approach reduces the total number of measurements, especially when the rank of the matrix is much smaller than than its sparsity. Our experiments in the context of dynamic imaging shows that this approach is very useful in realizing free breathing cardiac MRI.

1306.1392 2026-06-04 math.NA cs.CV cs.NA 版本更新

PyHST2: an hybrid distributed code for high speed tomographic reconstruction with iterative reconstruction and a priori knowledge capabilities

PyHST2:一种混合分布式代码,用于高速断层扫描重建,支持迭代重建和先验知识能力

Alessandro Mirone, Emmanuelle Gouillart, Emmanuel Brun, Paul Tafforeau, Jerome Kieffer

AI总结 PyHST2是一种用于高速断层扫描重建的混合分布式代码,支持迭代重建和先验知识,适用于第三代同步辐射设施的高数据流需求。

详情
AI中文摘要

我们介绍了PyHST2代码,该代码在ESRF用于相位对比和吸收断层扫描。该代码采用了分布式和流水线架构,以支持第三代同步辐射设施的高数据流(每实验10太字节)。代码实现了默认的滤波反投影重建,以及结合先验知识的迭代重建技术。这些技术用于提高重建质量或减少所需数据量以达到目标质量。实现的先验知识技术基于总变分惩罚和一种新的凸函数,该函数基于重叠块。我们详细介绍了不同方法及其实现,代码以免费许可证发布。我们还提供了在没有地面真实数据的情况下估计先验技术最佳参数值的方法。

英文摘要

We present the PyHST2 code which is in service at ESRF for phase-contrast and absorption tomography. This code has been engineered to sustain the high data flow typical of the third generation synchrotron facilities (10 terabytes per experiment) by adopting a distributed and pipelined architecture. The code implements, beside a default filtered backprojection reconstruction, iterative reconstruction techniques with a-priori knowledge. These latter are used to improve the reconstruction quality or in order to reduce the required data volume and reach a given quality goal. The implemented a-priori knowledge techniques are based on the total variation penalisation and a new recently found convex functional which is based on overlapping patches. We give details of the different methods and their implementations while the code is distributed under free license. We provide methods for estimating, in the absence of ground-truth data, the optimal parameters values for a-priori techniques.

1305.1256 2026-06-04 math.NA cs.CV cs.NA 版本更新

A Convex Functional for Image Denoising based on Patches with Constrained Overlaps and its vectorial application to Low Dose Differential Phase Tomography

基于具有受限重叠的块的凸函数的图像去噪方法及其在低剂量差分相位断层扫描中的向量应用

Alessandro Mirone, Emmanuel Brun, Paola Coan

AI总结 本文提出了一种新的凸函数用于图像去噪,通过引入块间相似性约束项,结合FISTA加速迭代算法,在低剂量差分相位断层扫描中实现了高效且鲁棒的重建。

详情
AI中文摘要

我们通过字典学习技术解决图像去噪问题,构建了一种新的凸函数形式。该函数包含通常的稀疏诱导项和保真项,以及一个新的项,诱导重叠区域中块之间的相似性。该函数依赖于两个自由正则化参数:一个乘以块基函数系数的稀疏诱导L1范数系数,另一个乘以重叠区域中块差异的L2范数系数。通过应用迭代近端梯度下降法与FISTA加速求解。在断层扫描重建中,通过在每次迭代步骤中应用解的投影及其误差反投影来计算梯度。我们在合成数据上研究了解的质量,作为正则化参数和噪声函数的函数,其中解是先验已知的。我们对实验数据应用该方法,针对差分相位断层扫描,采用一种原始方法,即使用向量块,每个块有两个成分:每个梯度成分一个。所得到的算法在ESRF断层扫描重建代码PyHST中实现,结果显示出稳健、高效,并且适合在医学断层扫描中显著减少所需剂量和投影数量。

英文摘要

We solve the image denoising problem with a dictionary learning technique by writing a convex functional of a new form. This functional contains beside the usual sparsity inducing term and fidelity term, a new term which induces similarity between overlapping patches in the overlap regions. The functional depends on two free regularization parameters: a coefficient multiplying the sparsity-inducing $L_{1}$ norm of the patch basis functions coefficients, and a coefficient multiplying the $L_{2}$ norm of the differences between patches in the overlapping regions. The solution is found by applying the iterative proximal gradient descent method with FISTA acceleration. In the case of tomography reconstruction we calculate the gradient by applying projection of the solution and its error backprojection at each iterative step. We study the quality of the solution, as a function of the regularization parameters and noise, on synthetic datas for which the solution is a-priori known. We apply the method on experimental data in the case of Differential Phase Tomography. For this case we use an original approach which consists in using vectorial patches, each patch having two components: one per each gradient component. The resulting algorithm, implemented in the ESRF tomography reconstruction code PyHST, results to be robust, efficient, and well adapted to strongly reduce the required dose and the number of projections in medical tomography.

1506.00060 2026-06-04 cs.CV cs.NA math.NA 版本更新

A Three-stage Approach for Segmenting Degraded Color Images: Smoothing, Lifting and Thresholding (SLaT)

一种用于退化彩色图像分割的三阶段方法:平滑、提升和阈值化(SLaT)

Xiaohao Cai, Raymond Chan, Mila Nikolova, Tieyong Zeng

AI总结 本文提出了一种三阶段SLaT方法,用于处理噪声、信息损失和模糊的多相分割。通过凸变分模型平滑图像,利用维度提升保留颜色信息,并通过多通道阈值化实现分割,提升了分割质量和效率。

Comments 19 pages

详情
AI中文摘要

本文提出了一种SLaT(平滑、提升和阈值化)方法,用于多相分割受噪声、信息损失和模糊影响的彩色图像。第一阶段应用凸变分模型的变体对每个通道进行平滑处理,证明该模型在不同退化下具有唯一解。第二阶段通过维度提升,将恢复图像与其在二次颜色空间中的变换组合成向量值图像,以保留足够的信息进行分割。第三阶段通过多通道阈值化找到分割结果。相数仅在最后阶段需要,用户可自由选择或更改而无需重新解决前阶段。实验表明,SLaT方法在分割质量和CPU时间上优于其他先进方法。

英文摘要

In this paper, we propose a SLaT (Smoothing, Lifting and Thresholding) method with three stages for multiphase segmentation of color images corrupted by different degradations: noise, information loss, and blur. At the first stage, a convex variant of the Mumford-Shah model is applied to each channel to obtain a smooth image. We show that the model has unique solution under the different degradations. In order to properly handle the color information, the second stage is dimension lifting where we consider a new vector-valued image composed of the restored image and its transform in the secondary color space with additional information. This ensures that even if the first color space has highly correlated channels, we can still have enough information to give good segmentation results. In the last stage, we apply multichannel thresholding to the combined vector-valued image to find the segmentation. The number of phases is only required in the last stage, so users can choose or change it all without the need of solving the previous stages again. Experiments demonstrate that our SLaT method gives excellent results in terms of segmentation quality and CPU time in comparison with other state-of-the-art segmentation methods.

1504.00905 2026-06-04 math.OC cs.CV cs.LG cs.SY eess.SY 版本更新

Robust Anomaly Detection Using Semidefinite Programming

使用半定规划进行鲁棒异常检测

Jose A. Lopez, Octavia Camps, Mario Sznaier

AI总结 本文提出基于多项式优化和矩方法的新型异常检测方法,仅需正常状态特征统计矩信息,相较于Parzen窗口和1类SVM等方法表现更优,且能简洁描述正常状态,简化高维数据集的异常检测问题。

Comments 13 pages, 11 figures

详情
AI中文摘要

本文提出了一种基于多项式优化和矩方法的新方法,用于异常检测问题。所提出的方法仅需了解感兴趣特征的正常状态分布的统计矩信息,并在性能上优于现有方法(如Parzen窗口和1类SVM)。此外,它还提供了对正常状态的简洁描述,因此在处理高维数据集时,导致异常检测问题显著简化。

英文摘要

This paper presents a new approach, based on polynomial optimization and the method of moments, to the problem of anomaly detection. The proposed technique only requires information about the statistical moments of the normal-state distribution of the features of interest and compares favorably with existing approaches (such as Parzen windows and 1-class SVM). In addition, it provides a succinct description of the normal state. Thus, it leads to a substantial simplification of the the anomaly detection problem when working with higher dimensional datasets.

1505.07690 2026-06-04 math.NA cs.CV cs.NA 版本更新

Invertible Orientation Scores of 3D Images

三维图像的可逆取向分数

Michiel Janssen, Remco Duits, Marcel Breeuwer

AI总结 本文提出三维取向分数,用于增强和检测噪声图像中的细长结构,采用可逆相干态变换和球面谐波变换实现高效计算,展示初步应用结果。

Comments ssvm 2015 published version in LNCS contains a mistake (a switch notation spherical angles) that is corrected in this arxiv version

详情
AI中文摘要

在噪声图像数据中增强和检测细长结构对于许多生物医学应用至关重要。为处理2D图像中的复杂交叉结构,引入了2D取向分数,已在多种应用中表现出色。本文将其扩展到3D取向分数。首先,通过可逆相干态类型的变换从给定数据集构建取向分数。为此,我们引入了2D蛋糕小波的3D版本,这些是能同时检测取向结构和取向边缘的复小波。为了高效实现小波创建的不同步骤,我们使用球面谐波变换。最后,我们展示了一些3D取向分数的实际应用初步结果。

英文摘要

The enhancement and detection of elongated structures in noisy image data is relevant for many biomedical applications. To handle complex crossing structures in 2D images, 2D orientation scores were introduced, which already showed their use in a variety of applications. Here we extend this work to 3D orientation scores. First, we construct the orientation score from a given dataset, which is achieved by an invertible coherent state type of transform. For this transformation we introduce 3D versions of the 2D cake-wavelets, which are complex wavelets that can simultaneously detect oriented structures and oriented edges. For efficient implementation of the different steps in the wavelet creation we use a spherical harmonic transform. Finally, we show some first results of practical applications of 3D orientation scores.

1312.6208 2026-06-04 math.NA cs.CV cs.NA 版本更新

Total variation with overlapping group sparsity for image deblurring under impulse noise

总变分与重叠组稀疏性用于在脉冲噪声下的图像去模糊

Gang Liu, Ting-Zhu Huang, Jun Liu, Xiao-Guang Lv

AI总结 本文提出一种结合总变分与重叠组稀疏性的模型,用于在脉冲噪声下恢复模糊图像,通过引入盒约束和ADMM框架下的高效算法,有效缓解阶梯效应并提升PSNR和相对误差。

Comments 22 pages, 57 figures, submitted

详情
Journal ref
PLOS ONE 2015 10(4): e0122562
AI中文摘要

总变分(TV)正则化方法是图像去模糊中保留边缘的有效方法。然而,基于TV的解决方案通常存在阶梯效应。本文为了缓解阶梯效应,提出了一种新的模型,用于恢复受脉冲噪声影响的模糊图像。该模型包含一个ℓ1保真项和一个总变分与重叠组稀疏性(OGS)正则化项。此外,我们对所提模型施加盒约束以获得更准确的解决方案。提出了一种高效的算法,在交替方向乘子法(ADMM)框架下求解该模型。我们使用一个内循环,嵌套在主要化最小化(MM)迭代中,用于所提方法的子问题。与其他方法相比,数值结果表明,所提方法在避免阶梯效应和PSNR以及相对误差(ReE)方面显著提高了恢复质量。

英文摘要

The total variation (TV) regularization method is an effective method for image deblurring in preserving edges. However, the TV based solutions usually have some staircase effects. In this paper, in order to alleviate the staircase effect, we propose a new model for restoring blurred images with impulse noise. The model consists of an $\ell_1$-fidelity term and a TV with overlapping group sparsity (OGS) regularization term. Moreover, we impose a box constraint to the proposed model for getting more accurate solutions. An efficient and effective algorithm is proposed to solve the model under the framework of the alternating direction method of multipliers (ADMM). We use an inner loop which is nested inside the majorization minimization (MM) iteration for the subproblem of the proposed method. Compared with other methods, numerical results illustrate that the proposed method, can significantly improve the restoration quality, both in avoiding staircase effects and in terms of peak signal-to-noise ratio (PSNR) and relative error (ReE).

1505.01599 2026-06-04 cs.CV cs.NA math.NA 版本更新

Filter characteristics in image decomposition with singular spectrum analysis

图像分解中的奇异谱分析滤波特性

Kenji Kume, Naoko Nose-Togawa

AI总结 本文研究了奇异谱分析在多维数据分解中的滤波特性,指出自适应生成的滤波器具有对称性,用于图像去噪。

详情
AI中文摘要

奇异谱分析是一种非参数时间序列谱分解方法,可通过滤波解释扩展至多维数据分解。本文指出,当应用于多维数据时,自适应生成的滤波器表现出对称性,源于滞后协方差矩阵的双对称性。滞后协方差矩阵的特征向量为对称或反对称,对于二维图像数据,这导致了具有偶阶或奇阶导数的微分型滤波器。主导滤波器为平滑滤波器,反映图像低频分量的主导地位。其他滤波器对应于带通或高通滤波器,用于边缘增强或噪声去除。本文简要讨论了分解对图像去噪的意义。

英文摘要

Singular spectrum analysis is developed as a nonparametric spectral decomposition of a time series. It can be easily extended to the decomposition of multidimensional lattice-like data through the filtering interpretation. In this viewpoint, the singular spectrum analysis can be understood as the adaptive and optimal generation of the filters and their two-step point-symmetric operation to the original data. In this paper, we point out that, when applied to the multidimensional data, the adaptively generated filters exhibit symmetry properties resulting from the bisymmetric nature of the lag-covariance matrices. The eigenvectors of the lag-covariance matrix are either symmetric or antisymmetric, and for the 2D image data, these lead to the differential-type filters with even- or odd-order derivatives. The dominant filter is a smoothing filter, reflecting the dominance of low-frequency components of the photo images. The others are the edge-enhancement or the noise filters corresponding to the band-pass or the high-pass filters. The implication of the decomposition to the image denoising is briefly discussed.

1505.00193 2026-06-04 cs.CV cs.NA math.AP math.NA 版本更新

Segmentation and Restoration of Images on Surfaces by Parametric Active Contours with Topology Changes

通过参数主动轮廓实现表面图像分割与恢复

Heike Benninghoff, Harald Garcke

AI总结 本文提出一种用于二维表面图像分割与恢复的新方法,通过参数化主动轮廓模型扩展到表面图像,并利用数值方案高效计算分割和去噪结果,同时检测并处理轮廓的拓扑变化。

详情
AI中文摘要

本文提出了一种新的方法,用于二维表面图像的分割与恢复。主动轮廓模型被扩展到表面图像上,表面上的演化曲线通过参数方法进行数学描述。对于图像恢复,扩散方程在后处理步骤中在各个区域内解出,采用Neumann边界条件。数值方案允许高效计算表面图像的分割和去噪版本。此外,通过快速子程序检测并处理演化曲线的拓扑变化。最后,展示了在不同人工和真实图像上应用所开发方法的多个实验。

英文摘要

In this article, a new method for segmentation and restoration of images on two-dimensional surfaces is given. Active contour models for image segmentation are extended to images on surfaces. The evolving curves on the surfaces are mathematically described using a parametric approach. For image restoration, a diffusion equation with Neumann boundary conditions is solved in a postprocessing step in the individual regions. Numerical schemes are presented which allow to efficiently compute segmentations and denoised versions of images on surfaces. Also topology changes of the evolving curves are detected and performed using a fast sub-routine. Finally, several experiments are presented where the developed methods are applied on different artificial and real images defined on different surfaces.

1504.07643 2026-06-04 math.NA cs.CV cs.NA 版本更新

A novel variational model for image registration using Gaussian curvature

基于高斯曲率的图像配准新变分模型

Mazlinda Ibrahim, Ke Chen, Carlos Brito-Loeza

AI总结 本文提出基于高斯曲率的图像配准变分模型,通过增广拉格朗日方法求解,优于线性曲率、均曲率及 diffeomorphic demon 模型,在鲁棒性和准确性上表现更佳。

Comments 23 pages, 5 figures. Key words: Image registration, Non-parametric image registration, Regularisation, Gaussian curvature, surface mapping

详情
AI中文摘要

图像配准是许多图像处理应用中的重要任务,旨在通过比较、组合或叠加对齐两幅或更多图像。通过构建最优变换使模板图像与给定参考图像相似。尽管已有许多模型,但设计能建模大且平滑变形场的模型仍具挑战性。本文提出一种新的变分模型,利用高斯曲率作为正则化项。该模型受几何处理中表面修复工作启发。采用增广拉格朗日方法提供有效数值求解器。数值实验表明,新模型在鲁棒性和准确性方面优于基于线性曲率、均曲率及 diffeomorphic demon 模型的三种竞争模型。

英文摘要

Image registration is one important task in many image processing applications. It aims to align two or more images so that useful information can be extracted through comparison, combination or superposition. This is achieved by constructing an optimal trans- formation which ensures that the template image becomes similar to a given reference image. Although many models exist, designing a model capable of modelling large and smooth deformation field continues to pose a challenge. This paper proposes a novel variational model for image registration using the Gaussian curvature as a regulariser. The model is motivated by the surface restoration work in geometric processing [Elsey and Esedoglu, Multiscale Model. Simul., (2009), pp. 1549-1573]. An effective numerical solver is provided for the model using an augmented Lagrangian method. Numerical experiments can show that the new model outperforms three competing models based on, respectively, a linear curvature [Fischer and Modersitzki, J. Math. Imaging Vis., (2003), pp. 81- 85], the mean curvature [Chumchob, Chen and Brito, Multiscale Model. Simul., (2011), pp. 89-128] and the diffeomorphic demon model [Vercauteren at al., NeuroImage, (2009), pp. 61-72] in terms of robustness and accuracy.

1412.4044 2026-06-04 stat.ML cs.CV cs.NA math.NA math.OC 版本更新

Adaptive Stochastic Gradient Descent on the Grassmannian for Robust Low-Rank Subspace Recovery and Clustering

在Grassmannian上进行自适应随机梯度下降用于鲁棒低秩子空间恢复与聚类

Jun He, Yue Zhang

AI总结 本文提出GASG21算法,通过在Grassmann流形上进行自适应随机梯度下降,实现从大矩阵中鲁棒地恢复低秩子空间,并通过K子空间扩展实现对受损数据的聚类。

Comments 13 pages, 12 figures and 6 tables

详情
AI中文摘要

在本文中,我们提出了GASG21(Grassmannian Adaptive Stochastic Gradient for $L_{2,1}$ norm minimization),一种自适应随机梯度算法,用于从大规模矩阵中鲁棒地恢复低秩子空间。在存在列异常值的情况下,我们将批量模式矩阵$L_{2,1}$范数最小化问题(带有秩约束)重新公式化为受Grassmann流形约束的随机优化方法。对于每个观测数据向量,低秩子空间$\mathcal{S}$通过沿着Grassmannian的测地线进行梯度步长更新。为了加速随机梯度方法的收敛速度,我们选择通过利用连续梯度来自适应调整常数步长。此外,我们证明了在适当初始化的情况下,K子空间扩展K-GASG21可以将大量受损数据向量鲁棒地聚类到子空间的并集。在合成和真实数据上的数值实验展示了所提出算法在重柱异常值腐蚀下的效率和准确性。

英文摘要

In this paper, we present GASG21 (Grassmannian Adaptive Stochastic Gradient for $L_{2,1}$ norm minimization), an adaptive stochastic gradient algorithm to robustly recover the low-rank subspace from a large matrix. In the presence of column outliers, we reformulate the batch mode matrix $L_{2,1}$ norm minimization with rank constraint problem as a stochastic optimization approach constrained on Grassmann manifold. For each observed data vector, the low-rank subspace $\mathcal{S}$ is updated by taking a gradient step along the geodesic of Grassmannian. In order to accelerate the convergence rate of the stochastic gradient method, we choose to adaptively tune the constant step-size by leveraging the consecutive gradients. Furthermore, we demonstrate that with proper initialization, the K-subspaces extension, K-GASG21, can robustly cluster a large number of corrupted data vectors into a union of subspaces. Numerical experiments on synthetic and real data demonstrate the efficiency and accuracy of the proposed algorithms even with heavy column outliers corruption.

1503.03004 2026-06-04 cs.CV cs.NA math.NA 版本更新

Fast and Robust Fixed-Rank Matrix Recovery

快速且鲁棒的固定秩矩阵恢复

German Ros, Julio Guerrero

AI总结 本文提出了一种高效且稳定的固定秩矩阵分解方法,通过几何和代数技术结合,避免了截断奇异值分解的瓶颈,提升了大规模问题的处理效率。

详情
AI中文摘要

我们解决了高效稀疏固定秩(S-FR)矩阵分解问题,即将受污染的矩阵M分解为未受污染的秩为r的矩阵L和稀疏异常值矩阵S。固定秩约束通常由系统研究的物理限制决定。本文提出了一种准确且高效的S-FR分解方法,适用于大规模问题。我们的方法结合了几何和代数技术,避免了截断奇异值分解(TSVD)的瓶颈。相反,采用极坐标分解来利用固定秩问题的流形结构,作为Stiefel和SPD流形的乘积,从而获得更好的收敛性和稳定性。然后,闭合形式的投影器有助于加速方法的每次迭代。我们引入了一种新的快速投影器用于SPD流形,并证明其有效性。进一步的加速是通过Nystrom方案实现的。在鲁棒光度立体和光谱聚类的合成和真实数据实验中,我们的方法优于现有技术。

英文摘要

We address the problem of efficient sparse fixed-rank (S-FR) matrix decomposition, i.e., splitting a corrupted matrix $M$ into an uncorrupted matrix $L$ of rank $r$ and a sparse matrix of outliers $S$. Fixed-rank constraints are usually imposed by the physical restrictions of the system under study. Here we propose a method to perform accurate and very efficient S-FR decomposition that is more suitable for large-scale problems than existing approaches. Our method is a grateful combination of geometrical and algebraical techniques, which avoids the bottleneck caused by the Truncated SVD (TSVD). Instead, a polar factorization is used to exploit the manifold structure of fixed-rank problems as the product of two Stiefel and an SPD manifold, leading to a better convergence and stability. Then, closed-form projectors help to speed up each iteration of the method. We introduce a novel and fast projector for the $\text{SPD}$ manifold and a proof of its validity. Further acceleration is achieved using a Nystrom scheme. Extensive experiments with synthetic and real data in the context of robust photometric stereo and spectral clustering show that our proposals outperform the state of the art.

1503.06561 2026-06-04 math.NA cs.CV cs.NA 版本更新

A Comparative Analysis of Tensor Decomposition Models Using Hyper Spectral Image

基于超光谱图像的张量分解模型比较分析

Ankit Gupta, Ashish Oberoi

AI总结 本文比较了LMLRA、BTD和CPD三种张量分解模型在超光谱图像处理中的性能,发现BTD在分解结果上表现最佳。

Comments 7 pages, 3 figures,1 table

详情
Journal ref
International Journal of Computer Science Trends and Technology (IJCST) V3(2): Page(5-11) Mar-Apr 2015. ISSN: 2347-8578
AI中文摘要

超光谱成像是一种遥感技术,广泛应用于材料识别、空间物体识别和行星勘探等领域。由于图像的多维特性,多向数组成为分析超光谱数据的一种可能方法,即张量。本文实现了三种分解模型LMLRA、BTD和CPD对样本数据的处理,结果证明块项分解(BTD)是分解超光谱图像为因子矩阵的最佳张量模型。

英文摘要

Hyper spectral imaging is a remote sensing technology, providing variety of applications such as material identification, space object identification, planetary exploitation etc. It deals with capturing continuum of images of the earth surface from different angles. Due to the multidimensional nature of the image, multi-way arrays are one of the possible solutions for analyzing hyper spectral data. This multi-way array is called tensor. Our approach deals with implementing three decomposition models LMLRA, BTD and CPD to the sample data for choosing the best decomposition of the data set. The results have proved that Block Term Decomposition (BTD) is the best tensor model for decomposing the hyper spectral image in to resultant factor matrices.

1305.3006 2026-06-04 cs.CV cs.NA math.NA 版本更新

Fast Linearized Alternating Direction Minimization Algorithm with Adaptive Parameter Selection for Multiplicative Noise Removal

快速线性化交替方向最小化算法与自适应参数选择用于乘性噪声去除

Dai-Qiang Chen, Li-Zhi Cheng

AI总结 本文提出基于线性化技术的两种快速算法,通过特殊偏差函数自适应选择正则化参数,同时恢复图像,在PSNR和计算时间上优于现有方法。

Comments 23pages

详情
Journal ref
Journal of Computational and Applied Mathematics 257 (2014) 29-45
AI中文摘要

由于总变分(TV)具有边缘保持能力和低计算成本,具有TV正则化的变分模型在乘性噪声去除领域被广泛研究。成功应用的关键在于:正则化参数的最优选择,平衡数据保真项与TV正则化项;以及高效算法计算解。本文提出两种基于线性化技术的快速算法,能够同时估计正则化参数并恢复图像。在所提算法的迭代步骤中,正则化参数通过为乘性噪声定义的特殊偏差函数进行调整。在一定条件下证明了所提算法的收敛性,数值实验显示所提算法在PSNR值和计算时间上整体优于一些最先进的方法。

英文摘要

Owing to the edge preserving ability and low computational cost of the total variation (TV), variational models with the TV regularization have been widely investigated in the field of multiplicative noise removal. The key points of the successful application of these models lie in: the optimal selection of the regularization parameter which balances the data-fidelity term with the TV regularizer; the efficient algorithm to compute the solution. In this paper, we propose two fast algorithms based on the linearized technique, which are able to estimate the regularization parameter and recover the image simultaneously. In the iteration step of the proposed algorithms, the regularization parameter is adjusted by a special discrepancy function defined for multiplicative noise. The convergence properties of the proposed algorithms are proved under certain conditions, and numerical experiments demonstrate that the proposed algorithms overall outperform some state-of-the-art methods in the PSNR values and computational time.

1502.06220 2026-06-04 cs.CV cs.NA math.NA 版本更新

Boosting of Image Denoising Algorithms

图像去噪算法的提升

Yaniv Romano, Michael Elad

AI总结 本文提出一种通用递归算法提升图像去噪方法,通过SOS流程增强信号、操作去噪方法并减去前一步结果,研究其收敛性并展示在K-SVD等算法中的改进效果。

Comments 33 pages, 9 figures, 3 tables, submitted to SIAM Journal on Imaging Sciences

详情
AI中文摘要

本文提出了一种通用递归算法,用于改进图像去噪方法。给定初始去噪图像,建议重复以下

英文摘要

In this paper we propose a generic recursive algorithm for improving image denoising methods. Given the initial denoised image, we suggest repeating the following "SOS" procedure: (i) (S)trengthen the signal by adding the previous denoised image to the degraded input image, (ii) (O)perate the denoising method on the strengthened image, and (iii) (S)ubtract the previous denoised image from the restored signal-strengthened outcome. The convergence of this process is studied for the K-SVD image denoising and related algorithms. Still in the context of K-SVD image denoising, we introduce an interesting interpretation of the SOS algorithm as a technique for closing the gap between the local patch-modeling and the global restoration task, thereby leading to improved performance. In a quest for the theoretical origin of the SOS algorithm, we provide a graph-based interpretation of our method, where the SOS recursive update effectively minimizes a penalty function that aims to denoise the image, while being regularized by the graph Laplacian. We demonstrate the SOS boosting algorithm for several leading denoising methods (K-SVD, NLM, BM3D, and EPLL), showing tendency to further improve denoising performance.

1502.07743 2026-06-04 eess.SY cs.CV cs.SY math.OC 版本更新

Tracking an Object with Unknown Accelerations using a Shadowing Filter

利用阴影滤波器跟踪具有未知加速度的对象

Kevin Judd

AI总结 本文提出基于阴影滤波器的跟踪方法,用于处理未知随机加速度的物体跟踪问题,该方法高效且稳健,优于传统卡尔曼滤波。

Comments 20 pages, 5 figures

详情
AI中文摘要

一个常见的问题是跟踪物理对象,如机动船只、飞机、陆地车辆、航天器或携带无线设备的生物体。传感器数据通常有限,且范围或方位的观测不准确。此问题比跟踪弹道轨迹更困难,因为操作会影响未知且任意变化的加速度。尽管随机滤波或状态估计(卡尔曼滤波和粒子滤波)被广泛使用,但在这种跟踪上下文中,变分方法更为合适,因为物体通常不显示显著的随机运动。这促使我们提出基于阴影滤波器的优雅方法。所得到的滤波器高效(减少为线性方程的求解)且稳健(不受缺失数据和奇异相关性导致贝叶斯滤波灾难性失败的影响)。跟踪如此稳健,以至于在某些常见情况下,它实际上通过忽略对卡尔曼滤波至关重要的误差相关性而表现更好。

英文摘要

A commonly encountered problem is the tracking of a physical object, like a maneuvering ship, aircraft, land vehicle, spacecraft or animate creature carrying a wireless device. The sensor data is often limited and inaccurate observations of range or bearing. This problem is more difficult than tracking a ballistic trajectory, because an operative affects unknown and arbitrarily changing accelerations. Although stochastic methods of filtering or state estimation (Kalman filters and particle filters) are widely used, out of vogue variational methods are more appropriate in this tracking context, because the objects do not typically display any significant random motions at the length and time scales of interest. This leads us to propose a rather elegant approach based on a \emph{shadowing filter}. The resulting filter is efficient (reduces to the solution of linear equations) and robust (uneffected by missing data and singular correlations that would cause catastrophic failure of Bayesian filters.) The tracking is so robust, that in some common situations it actually performs better by ignoring error correlations that are so vital to Kalman filters.

1502.00555 2026-06-04 stat.ME cs.CV cs.MM cs.NA math.NA stat.CO 版本更新

A Discrete Tchebichef Transform Approximation for Image and Video Coding

一种用于图像和视频编码的离散切比绍夫变换近似

P. A. M. Oliveira, R. J. Cintra, F. M. Bayer, S. Kulasekera, A. Madanayake

AI总结 本文提出了一种低复杂度的离散切比绍夫变换近似方法,通过减少乘法和加法运算提升编码效率,并在FPGA上实现时降低功耗和面积。

Comments 13 pages, 5 figures, 2 tables

详情
Journal ref
IEEE Signal Processing Letters, vol. 22, issue 8, pp. 1137-1141, 2015
AI中文摘要

本文介绍了一种低复杂度的离散切比绍夫变换(DTT)近似方法。所提出的正反向变换是乘法自由的,仅需较少的加法和位移运算。数值压缩模拟展示了该变换在图像和视频编码中的效率。此外,基于Xilinx Virtex-6 FPGA的硬件实现表明,与文献相比,动态功耗降低了44.9%,面积减少了64.7%。

英文摘要

In this paper, we introduce a low-complexity approximation for the discrete Tchebichef transform (DTT). The proposed forward and inverse transforms are multiplication-free and require a reduced number of additions and bit-shifting operations. Numerical compression simulations demonstrate the efficiency of the proposed transform for image and video coding. Furthermore, Xilinx Virtex-6 FPGA based hardware realization shows 44.9% reduction in dynamic power consumption and 64.7% lower area when compared to the literature.

1310.5715 2026-06-04 math.NA cs.CV cs.LG cs.NA math.OC stat.ML 版本更新

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

随机梯度下降、加权采样与随机化Kaczmarz算法

Deanna Needell, Nathan Srebro, Rachel Ward

AI总结 本文改进了随机梯度下降在光滑强凸目标下的线性收敛保证,从二次依赖于条件数转换为线性依赖,同时探讨了加权采样对收敛性的影响,并将随机化Kaczmarz算法与SGD联系起来,证明其在加权最小二乘问题中的指数收敛性。

Comments 22 pages, 6 figures

详情
AI中文摘要

我们获得了随机梯度下降在光滑且强凸目标下的改进有限样本保证,将线性收敛的依赖从二次的条件数$(L/μ)^2$(其中$L$是光滑性的上界,$μ$是强凸性的上界)转为线性依赖于$L/μ$。此外,我们展示了如何通过重新加权采样分布(即重要性采样)进一步提升收敛性,并获得平均光滑性的线性依赖,优于先前结果。我们还讨论了SGD中的重要性采样在其他场景中的应用。我们的结果基于将SGD与随机化Kaczmarz算法联系起来的发现,使我们能够将两种方法的文献思想相互转移。特别是,我们将随机化Kaczmarz算法重新表述为SGD的一个实例,并应用我们的结果证明其在加权最小二乘问题中的指数收敛性,而非原始最小二乘问题。然后,我们提出了一种修改的Kaczmarz算法,具有部分偏置采样,该算法能够收敛到原始最小二乘解,并以相同的指数收敛速率。

英文摘要

We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning $(L/μ)^2$ (where $L$ is a bound on the smoothness and $μ$ on the strong convexity) to a linear dependence on $L/μ$. Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results. We also discuss importance sampling for SGD more broadly and show how it can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods. In particular, we recast the randomized Kaczmarz algorithm as an instance of SGD, and apply our results to prove its exponential convergence, but to the solution of a weighted least squares problem rather than the original least squares problem. We then present a modified Kaczmarz algorithm with partially biased sampling which does converge to the original least squares solution with the same exponential convergence rate.

1501.02995 2026-06-04 cs.MM cs.CV cs.NA math.NA stat.ME 版本更新

Improved 8-point Approximate DCT for Image and Video Compression Requiring Only 14 Additions

改进的8点近似DCT用于图像和视频压缩,仅需14次加法

U. S. Potluri, A. Madanayake, R. J. Cintra, F. M. Bayer, S. Kulasekera, A. Edirisuriya

AI总结 本文提出一种仅需14次加法的8点DCT近似方法,具有低计算复杂度,相比现有方法在算法复杂度和信噪比上表现更优,适用于HEVC等可重构视频标准。

Comments 30 pages, 7 figures, 5 tables

详情
Journal ref
Circuits and Systems I: Regular Papers, IEEE Transactions on, Volume 61, Issue 6, June 2014, 1727--1740
AI中文摘要

视频处理系统如HEVC要求低能耗以满足多媒体市场的需求,推动了快速算法在高效近似2-D DCT变换方面的广泛应用。由于DCT具有显著的能量压缩特性,被广泛应用于多种压缩标准。已提出无乘法器的近似DCT变换,提供极低电路复杂度下的优异压缩性能。此类近似可通过仅使用加法和减法在数字VLSI硬件中实现,显著降低芯片面积和功耗。本文提出一种新的8点DCT近似方法,仅需14次加法运算和无乘法。该变换具有低计算复杂度,并在算法复杂度和峰值信噪比方面与现有最先进的DCT近似方法进行比较。所提出的DCT近似方法是HEVC等可重构视频标准的候选方案。所提出变换及其他几种DCT近似方法被映射到脉动阵列数字架构,并通过FPGA技术和45 nm CMOS工艺物理实现为数字原型电路。

英文摘要

Video processing systems such as HEVC requiring low energy consumption needed for the multimedia market has lead to extensive development in fast algorithms for the efficient approximation of 2-D DCT transforms. The DCT is employed in a multitude of compression standards due to its remarkable energy compaction properties. Multiplier-free approximate DCT transforms have been proposed that offer superior compression performance at very low circuit complexity. Such approximations can be realized in digital VLSI hardware using additions and subtractions only, leading to significant reductions in chip area and power consumption compared to conventional DCTs and integer transforms. In this paper, we introduce a novel 8-point DCT approximation that requires only 14 addition operations and no multiplications. The proposed transform possesses low computational complexity and is compared to state-of-the-art DCT approximations in terms of both algorithm complexity and peak signal-to-noise ratio. The proposed DCT approximation is a candidate for reconfigurable video standards such as HEVC. The proposed transform and several other DCT approximations are mapped to systolic-array digital architectures and physically realized as digital prototype circuits using FPGA technology and mapped to 45 nm CMOS technology.

1501.00680 2026-06-04 math.NA cs.CV cs.NA 版本更新

A New Method for Signal and Image Analysis: The Square Wave Method

信号和图像分析中的一种新方法:正弦波方法

Osvaldo Skliar, Ricardo E. Monge, Sherry Gapper

AI总结 本文介绍了正弦波方法在信号和图像分析中的应用,通过两个案例展示了其在频域中的表现。

详情
AI中文摘要

本文简要回顾了正弦波方法(SWM)在信号和图像分析中的应用,并说明了如何用正弦波变换(SWT)在频域中表达结果。为了说明该领域的新方法,分析了两个案例:a) 一组肌电信号样本;b) 经典的Lenna图像。

英文摘要

A brief review is provided of the use of the Square Wave Method (SWM) in the field of signal and image analysis and it is specified how results thus obtained are expressed using the Square Wave Transform (SWT), in the frequency domain. To illustrate the new approach introduced in this field, the results of two cases are analyzed: a) a sequence of samples (that is, measured values) of an electromyographic recording; and b) the classic image of Lenna.

1403.5403 2026-06-04 cs.CV cs.NA math.NA math.OC 版本更新

A Non-Local Structure Tensor Based Approach for Multicomponent Image Recovery Problems

一种基于非局部结构张量的多组件图像恢复方法

Giovanni Chierchia, Nelly Pustelnik, Beatrice Pesquet-Popescu, Jean-Christophe Pesquet

AI总结 本文提出基于非局部总变分的多组件图像恢复方法,利用梯度得到的结构张量,通过ℓ_{1,p}矩阵范数惩罚非局部变化,改进收敛速度。

详情
AI中文摘要

非局部总变分(NLTV)已发展为图像恢复变分方法中的有用工具。本文通过利用多组件图像梯度得到的结构张量,将NLTV正则化扩展到多组件图像。所提出的方法允许通过各种ℓ_{1,p}矩阵范数(p≥1)对不同组件的非局部变化进行惩罚。为方便超参数选择,我们采用约束凸优化方法,在数据保真项最小化的同时满足ST-NLTV正则化约束。所得到的凸优化问题通过新颖的epigraphical投影方法解决。这种公式能够高效实现,得益于最近的对偶近端算法的灵活性。进行了多谱和超光谱图像的实验。结果表明引入非局部结构张量正则化是有利的,并显示所提出的方法在收敛速度方面相比当前最先进的方法有显著改进。

英文摘要

Non-Local Total Variation (NLTV) has emerged as a useful tool in variational methods for image recovery problems. In this paper, we extend the NLTV-based regularization to multicomponent images by taking advantage of the Structure Tensor (ST) resulting from the gradient of a multicomponent image. The proposed approach allows us to penalize the non-local variations, jointly for the different components, through various $\ell_{1,p}$ matrix norms with $p \ge 1$. To facilitate the choice of the hyper-parameters, we adopt a constrained convex optimization approach in which we minimize the data fidelity term subject to a constraint involving the ST-NLTV regularization. The resulting convex optimization problem is solved with a novel epigraphical projection method. This formulation can be efficiently implemented thanks to the flexibility offered by recent primal-dual proximal algorithms. Experiments are carried out for multispectral and hyperspectral images. The results demonstrate the interest of introducing a non-local structure tensor regularization and show that the proposed approach leads to significant improvements in terms of convergence speed over current state-of-the-art methods.

1406.5429 2026-06-04 math.NA cs.CV cs.LG cs.NA math.OC 版本更新

Playing with Duality: An Overview of Recent Primal-Dual Approaches for Solving Large-Scale Optimization Problems

双模互动:解决大规模优化问题的最新对偶方法综述

Nikos Komodakis, Jean-Christophe Pesquet

AI总结 本文综述了近期用于解决大规模优化问题的对偶方法,探讨了对偶问题在信号处理、计算机视觉和机器学习中的应用,强调了对偶算法在求解凸优化和离散问题中的优势。

详情
AI中文摘要

优化方法在信号/图像处理、计算机视觉和机器学习问题中处于核心地位。长期以来,人们认识到研究优化问题的对偶形式可能显著简化问题的求解。然而,将原问题和对偶问题联合考虑的高效策略是近期的新思想,近年来在凸分析、离散优化、并行处理和非光滑优化领域产生了许多重要贡献,尤其强调稀疏性问题。本文旨在阐述对偶方法的原理,并概述不同背景下提出的数值方法。我们展示了对偶算法在求解大规模凸优化问题和离散问题中的优势,并通过各种应用示例说明其实用性。

英文摘要

Optimization methods are at the core of many problems in signal/image processing, computer vision, and machine learning. For a long time, it has been recognized that looking at the dual of an optimization problem may drastically simplify its solution. Deriving efficient strategies which jointly brings into play the primal and the dual problems is however a more recent idea which has generated many important new contributions in the last years. These novel developments are grounded on recent advances in convex analysis, discrete optimization, parallel processing, and non-smooth optimization with emphasis on sparsity issues. In this paper, we aim at presenting the principles of primal-dual approaches, while giving an overview of numerical methods which have been proposed in different contexts. We show the benefits which can be drawn from primal-dual algorithms both for solving large-scale convex optimization problems and discrete ones, and we provide various application examples to illustrate their usefulness.

1411.2584 2026-06-04 cs.CV cs.NA math.NA 版本更新

Applications of sampling Kantorovich operators to thermographic images for seismic engineering

采样Kantorovich算子在地震工程中的热图像应用

Danilo Costarelli, Federico Cluni, Anna Maria Minotti, Gianluca Vinti

AI总结 本文利用多变量采样Kantorovich算子S_w理论,结合MATLAB和矩阵计算,开发算法重建热图像,用于建筑在地震作用下的行为模拟,并通过实际案例分析不同模型的性能差异。

Comments 16 pages, 5 figures, 2 tables

详情
AI中文摘要

在本文中,我们介绍了多变量采样Kantorovich算子$S_w$在地震工程中的某些应用。这些算子在连续函数空间和Orlicz空间中的数学理论展示了如何近似/重建多变量信号,如图像。特别是,为了获得热图像的应用,我们开发了一种数学算法,使用MATLAB和矩阵微积分。Orlicz空间的设置很重要,因为它允许通过$S_w$重建非连续信号。我们的采样Kantorovich算法能够重建建筑物的热图像,从而获得用于模拟结构在地震作用下行为的模型。我们分析了一个实际案例研究,从结构分析的角度出发,并比较了不同模型下建筑在地震作用下的行为差异。

英文摘要

In this paper, we present some applications of the multivariate sampling Kantorovich operators $S_w$ to seismic engineering. The mathematical theory of these operators, both in the space of continuous functions and in Orlicz spaces, show how it is possible to approximate/reconstruct multivariate signals, such as images. In particular, to obtain applications for thermographic images a mathematical algorithm is developed using MATLAB and matrix calculus. The setting of Orlicz spaces is important since allow us to reconstruct not necessarily continuous signals by means of $S_w$. The reconstruction of thermographic images of buildings by our sampling Kantorovich algorithm allow us to obtain models for the simulation of the behavior of structures under seismic action. We analyze a real world case study in term of structural analysis and we compare the behavior of the building under seismic action using various models.

1410.3426 2026-06-04 math.NA cs.CV cs.NA 版本更新

Computing Topology Preservation of RBF Transformations for Landmark-Based Image Registration

计算基于地标图像配准的RBF变换拓扑保持性

R. Cavoretto, A. De Rossi, H. Qiao, B. Quatember, W. Recheis, M. Mayr

AI总结 本文研究了RBF在基于地标的图像配准中保持拓扑性质的能力,通过单点和四点模型分析Matérn函数等变换的拓扑保持特性,并与高斯、温德兰德武函数的数值结果进行比较。

详情
AI中文摘要

在图像配准中,适当的变换应保持拓扑性质。特别是对于基于地标的图像配准,如果一个地标的位移大于邻近地标的位移,就会发生拓扑违反。本文旨在分析一些用于建模图像配准变形的径向基函数(RBF)的拓扑保持性。Matérn函数在统计文献中较为常见(见例如\cite{Matern86,Stein99})。在本文中,我们使用它们来解决基于地标的图像配准问题。我们分别展示了单点和四点模型中RBF的拓扑保持性质。三种类型的Matérn变换的数值结果与高斯、温德兰德武函数的结果进行了比较。

英文摘要

In image registration, a proper transformation should be topology preserving. Especially for landmark-based image registration, if the displacement of one landmark is larger enough than those of neighbourhood landmarks, topology violation will be occurred. This paper aim to analyse the topology preservation of some Radial Basis Functions (RBFs) which are used to model deformations in image registration. Matérn functions are quite common in the statistic literature (see, e.g. \cite{Matern86,Stein99}). In this paper, we use them to solve the landmark-based image registration problem. We present the topology preservation properties of RBFs in one landmark and four landmarks model respectively. Numerical results of three kinds of Matérn transformations are compared with results of Gaussian, Wendland's, and Wu's functions.

1410.0719 2026-06-04 math.NA cs.CV cs.IT cs.LG cs.NA math.IT math.OC math.ST stat.TH 版本更新

Proceedings of the second "international Traveling Workshop on Interactions between Sparse models and Technology" (iTWIST'14)

第二届‘国际稀疏模型与技术相互作用’研讨会论文集(iTWIST'14)

L. Jacques, C. De Vleeschouwer, Y. Boursier, P. Sudhakar, C. De Mol, A. Pizurica, S. Anthoine, P. Vandergheynst, P. Frossard, C. Bilen, S. Kitic, N. Bertin, R. Gribonval, N. Boumal, B. Mishra, P. -A. Absil, R. Sepulchre, S. Bundervoet, C. Schretter, A. Dooms, P. Schelkens, O. Chabiron, F. Malgouyres, J. -Y. Tourneret, N. Dobigeon, P. Chainais, C. Richard, B. Cornelis, I. Daubechies, D. Dunson, M. Dankova, P. Rajmic, K. Degraux, V. Cambareri, B. Geelen, G. Lafruit, G. Setti, J. -F. Determe, J. Louveaux, F. Horlin, A. Drémeau, P. Heas, C. Herzet, V. Duval, G. Peyré, A. Fawzi, M. Davies, N. Gillis, S. A. Vavasis, C. Soussen, L. Le Magoarou, J. Liang, J. Fadili, A. Liutkus, D. Martina, S. Gigan, L. Daudet, M. Maggioni, S. Minsker, N. Strawn, C. Mory, F. Ngole, J. -L. Starck, I. Loris, S. Vaiter, M. Golbabaee, D. Vukobratovic

AI总结 iTWIST'14聚焦稀疏范式理论与应用,通过演讲、海报和讨论促进国际协作,涵盖稀疏数据传感、子空间联合、非线性逆问题等主题。

Comments 69 pages, 24 extended abstracts, iTWIST'14 website: http://sites.google.com/site/itwist14

详情
AI中文摘要

iTWIST研讨会旨在通过口头报告、海报和自由讨论促进国际科学团队合作。第二届iTWIST'14于2014年8月27日至29日在比利时纳穆尔举行,吸引了70名国际参与者,包含9场特邀讲座、10场口头报告和14个海报,主题涵盖稀疏范式的理论、应用与推广,包括稀疏数据传感、低维子空间联合、非线性逆问题等。

英文摘要

The implicit objective of the biennial "international - Traveling Workshop on Interactions between Sparse models and Technology" (iTWIST) is to foster collaboration between international scientific teams by disseminating ideas through both specific oral/poster presentations and free discussions. For its second edition, the iTWIST workshop took place in the medieval and picturesque town of Namur in Belgium, from Wednesday August 27th till Friday August 29th, 2014. The workshop was conveniently located in "The Arsenal" building within walking distance of both hotels and town center. iTWIST'14 has gathered about 70 international participants and has featured 9 invited talks, 10 oral presentations, and 14 posters on the following themes, all related to the theory, application and generalization of the "sparsity paradigm": Sparsity-driven data sensing and processing; Union of low dimensional subspaces; Beyond linear and convex inverse problem; Matrix/manifold/graph sensing/processing; Blind inverse problems and dictionary learning; Sparsity and computational neuroscience; Information theory, geometry and randomness; Complexity/accuracy tradeoffs in numerical methods; Sparsity? What's next?; Sparse machine learning and inference.

1410.0868 2026-06-04 math.NA cs.CV cs.NA 版本更新

Group Orbit Optimization: A Unified Approach to Data Normalization

群轨道优化:数据规范化的一个统一方法

Shuchang Zhou, Zhihua Zhang, Xiaobing Feng

AI总结 本文提出并研究了群轨道优化(GOO)问题,证明其可诱导矩阵分解技术,如SVD、LU分解、QR分解等,从而构建统一的矩阵分解框架,并推广至张量分解。

详情
AI中文摘要

在本文中,我们提出并研究了一个矩阵群轨道上的优化问题,称为群轨道优化(GOO)。我们证明GOO可以诱导矩阵分解技术,如奇异值分解(SVD)、LU分解、QR分解、施尔德分解和Cholesky分解等。这为矩阵分解提供了一个统一的框架,使这些矩阵分解方法得以连接。此外,我们推广GOO用于张量分解。作为GOO的具体应用,我们设计了一种新的特殊线性群上的数据分解方法,用于规范化点云数据。实验结果表明,我们的规范化方法能够从剪切、旋转和挤压等畸变中有效恢复。

英文摘要

In this paper we propose and study an optimization problem over a matrix group orbit that we call \emph{Group Orbit Optimization} (GOO). We prove that GOO can be used to induce matrix decomposition techniques such as singular value decomposition (SVD), LU decomposition, QR decomposition, Schur decomposition and Cholesky decomposition, etc. This gives rise to a unified framework for matrix decomposition and allows us to bridge these matrix decomposition methods. Moreover, we generalize GOO for tensor decomposition. As a concrete application of GOO, we devise a new data decomposition method over a special linear group to normalize point cloud data. Experiment results show that our normalization method is able to obtain recovery well from distortions like shearing, rotation and squeezing.

1409.3714 2026-06-04 math.NA cs.CV cs.NA 版本更新

Time-domain multiscale shape identification in electro-sensing

时域多尺度形状识别在电感应中的应用

Habib Ammari, Han Wang

AI总结 本文提出一种新颖的时域多尺度方法,用于电感应中利用脉冲信号进行形状识别,通过多尺度过滤极化张量计算不变形状描述符,具有强抗噪性,适用于回声定位和感应数据的脉冲成像。

详情
AI中文摘要

本文提出了一种新颖的时域多尺度方法,用于电感应中利用脉冲信号进行形状识别。该方法基于从多尺度过滤极化张量计算出的变换不变形状描述符。所提出的算法在远场测量中即使在非常有限的视角下也具有显著的抗噪能力。这为利用回声定位和感应数据进行脉冲成像打开了新的大门。

英文摘要

This paper presents premier and innovative time-domain multi-scale method for shape identification in electro-sensing using pulse-type signals. The method is based on transform-invariant shape descriptors computed from filtered polarization tensors at multi-scales. The proposed algorithm enjoys a remarkable noise robustness even with far-field measurements at very limited angle of view. It opens a door for pulsed imaging using echolocation and induction data.

1409.2579 2026-06-04 math.NA cs.CV cs.LG cs.NA 版本更新

A theoretical contribution to the fast implementation of null linear discriminant analysis method using random matrix multiplication with scatter matrices

对利用散射矩阵进行随机矩阵乘法实现null线性判别分析方法的理论贡献

Ting-ting Feng, Gang Wu

AI总结 本文提出一种理论方法,通过合理选择随机矩阵保证null LDA的列满秩,避免信息丢失,提升计算效率。

Comments 7 pages

详情
AI中文摘要

null线性判别分析方法是一种有竞争力的降维方法,但其实现计算成本高。最近提出了一种利用随机矩阵乘法与散射矩阵的快速实现方法,但若随机矩阵任选,则导向矩阵可能秩不足,导致有用判别信息丢失。本文研究如何合理选择随机矩阵以满足null LDA的两个理论准则,给出了保证导向矩阵列满秩的必要且充分条件,并描述了该条件的几何特性。

英文摘要

The null linear discriminant analysis method is a competitive approach for dimensionality reduction. The implementation of this method, however, is computationally expensive. Recently, a fast implementation of null linear discriminant analysis method using random matrix multiplication with scatter matrices was proposed. However, if the random matrix is chosen arbitrarily, the orientation matrix may be rank deficient, and some useful discriminant information will be lost. In this paper, we investigate how to choose the random matrix properly, such that the two criteria of the null LDA method are satisfied theoretically. We give a necessary and sufficient condition to guarantee full column rank of the orientation matrix. Moreover, the geometric characterization of the condition is also described.

1407.0921 2026-06-04 math.OC cs.CV cs.NA math.NA 版本更新

Solving QVIs for Image Restoration with Adaptive Constraint Sets

利用自适应约束集求解图像恢复中的QVIs

Frank Lenzen, Jan Lellmann, Florian Becker, Christoph Schnörr

AI总结 本文研究了自适应图像恢复中的准变分不等式,证明了更大类问题的解唯一性,并提出收敛的数值算法,实验结果支持理论发现。

详情
AI中文摘要

我们考虑了一类用于自适应图像恢复的准变分不等式(QVIs),其中自适应性通过解依赖的约束集描述。在先前工作中,我们研究了理论和数值问题。尽管我们能够证明较广类问题的解存在性,但遇到了解的唯一性和现有求解QVIs算法收敛性的问题。特别是,随着图像尺寸增加,所涉及微分算子的增长条件数带来了严重问题。在本文中,我们证明了更大类问题的解唯一性,特别是与图像尺寸无关。此外,我们提供了一个具有证明收敛性的数值算法。实验结果支持我们的理论发现。

英文摘要

We consider a class of quasi-variational inequalities (QVIs) for adaptive image restoration, where the adaptivity is described via solution-dependent constraint sets. In previous work we studied both theoretical and numerical issues. While we were able to show the existence of solutions for a relatively broad class of problems, we encountered problems concerning uniqueness of the solution as well as convergence of existing algorithms for solving QVIs. In particular, it seemed that with increasing image size the growing condition number of the involved differential operator poses severe problems. In the present paper we prove uniqueness for a larger class of problems and in particular independent of the image size. Moreover, we provide a numerical algorithm with proved convergence. Experimental results support our theoretical findings.

1405.2220 2026-06-04 q-fin.TR cs.CE cs.CV cs.SY eess.SY q-fin.ST 版本更新

Gaussian-Chain Filters for Heavy-Tailed Noise with Application to Detecting Big Buyers and Big Sellers in Stock Market

高斯链滤波器用于重尾噪声及其在股票市场中检测大买家和大卖家的应用

Li-Xin Wang

AI总结 本文提出高斯链分布用于处理重尾噪声,通过构造基于最大似然原理的高斯链滤波器,改进了市场情绪跟踪策略,实证显示其在股票市场中表现优于传统策略。

详情
AI中文摘要

我们提出了一种新的重尾分布——高斯链(GC)分布,其灵感来源于社会组织中的分层结构。我们确定了高斯链分布的均值、方差和峰度以展示其重尾性质,并计算尾部分布表以提供具体数值说明尾部的重性。为了过滤重尾噪声,我们构建了基于最大似然原理的2阶和3阶GC滤波器。仿真结果表明,当噪声呈重尾分布时,GC滤波器比基准最小二乘算法表现更好。利用GC滤波器,我们提出了一种名为Ride-the-Mood的交易策略,通过检测市场中大买家和大卖家的行为,跟踪市场情绪。对五只蓝筹香港股票近两年的数据应用Ride-the-Mood策略,显示其收益高于基准买入持有策略和恒生指数基金。

英文摘要

We propose a new heavy-tailed distribution --- Gaussian-Chain (GC) distribution, which is inspirited by the hierarchical structures prevailing in social organizations. We determine the mean, variance and kurtosis of the Gaussian-Chain distribution to show its heavy-tailed property, and compute the tail distribution table to give specific numbers showing how heavy is the heavy-tails. To filter out the heavy-tailed noise, we construct two filters --- 2nd and 3rd-order GC filters --- based on the maximum likelihood principle. Simulation results show that the GC filters perform much better than the benchmark least-squares algorithm when the noise is heavy-tail distributed. Using the GC filters, we propose a trading strategy, named Ride-the-Mood, to follow the mood of the market by detecting the actions of the big buyers and the big sellers in the market based on the noisy, heavy-tailed price data. Application of the Ride-the-Mood strategy to five blue-chip Hong Kong stocks over the recent two-year period from April 2, 2012 to March 31, 2014 shows that their returns are higher than the returns of the benchmark Buy-and-Hold strategy and the Hang Seng Index Fund.

1405.2128 2026-06-04 cs.CV cs.NA math.NA 版本更新

Variational Image Segmentation Model Coupled with Image Restoration Achievements

变分图像分割模型与图像修复成果相结合

Xiaohao Cai

AI总结 本文提出结合图像修复与分割的多相分割模型,有效处理高噪声、模糊和缺失像素图像,改进传统分割模型并高效求解。

Comments 23 pages

详情
AI中文摘要

图像分割和图像修复是图像处理中的重要课题,本文提出一种新的多相分割模型,结合图像修复和分割模型。利用图像修复特性,所提模型能有效处理高噪声、模糊、缺失像素和向量值图像。特别是传统分割模型如分段常数Mumford-Shah模型可通过引入新的数据保真项扩展,用于分割受噪声、模糊或缺失像素影响的灰度和向量值图像。该模型使用交替最小化算法高效求解,并在温和条件下证明了算法收敛性。在多种合成和实际图像上的实验表明,该方法在模糊图像和缺失像素图像的分割上优于现有方法。

英文摘要

Image segmentation and image restoration are two important topics in image processing with great achievements. In this paper, we propose a new multiphase segmentation model by combining image restoration and image segmentation models. Utilizing image restoration aspects, the proposed segmentation model can effectively and robustly tackle high noisy images, blurry images, images with missing pixels, and vector-valued images. In particular, one of the most important segmentation models, the piecewise constant Mumford-Shah model, can be extended easily in this way to segment gray and vector-valued images corrupted for example by noise, blur or missing pixels after coupling a new data fidelity term which comes from image restoration topics. It can be solved efficiently using the alternating minimization algorithm, and we prove the convergence of this algorithm with three variables under mild condition. Experiments on many synthetic and real-world images demonstrate that our method gives better segmentation results in comparison to others state-of-the-art segmentation models especially for blurry images and images with missing pixels values.

1312.6813 2026-06-04 math.NA cs.CV cs.NA 版本更新

New explicit thresholding/shrinkage formulas for one class of regularization problems with overlapping group sparsity and their applications

一类具有重叠组稀疏性的正则化问题的新显式阈值化/收缩公式及其应用

Gang Liu, Ting-Zhu Huang, Xiao-Guang Lv, Jun Liu

AI总结 本文提出了一类具有重叠组稀疏性的正则化问题的新显式收缩公式,应用于TV去模糊和去噪,通过交替方向乘子法验证了其有效性。

Comments 22 pages, 30 figures

详情
AI中文摘要

本文研究了具有重叠组稀疏性的正则化问题,提出了一类新的显式收缩公式,适用于TV去模糊和去噪。通过交替方向乘子法验证了其有效性,解决了重叠组稀疏性带来的困难。

英文摘要

The least-square regression problems or inverse problems have been widely studied in many fields such as compressive sensing, signal processing, and image processing. To solve this kind of ill-posed problems, a regularization term (i.e., regularizer) should be introduced, under the assumption that the solutions have some specific properties, such as sparsity and group sparsity. Widely used regularizers include the $\ell_1$ norm, total variation (TV) semi-norm, and so on. Recently, a new regularization term with overlapping group sparsity has been considered. Majorization minimization iteration method or variable duplication methods are often applied to solve them. However, there have been no direct methods for solve the relevant problems because of the difficulty of overlapping. In this paper, we proposed new explicit shrinkage formulas for one class of these relevant problems, whose regularization terms have translation invariant overlapping groups. Moreover, we apply our results in TV deblurring and denoising with overlapping group sparsity. We use alternating direction method of multipliers to iterate solve it. Numerical results also verify the validity and effectiveness of our new explicit shrinkage formulas.

1404.6691 2026-06-04 math.NA cs.CV cs.NA physics.med-ph 版本更新

Sinogram constrained TV-minimization for metal artifact reduction in CT

基于sinogram约束的TV最小化方法用于CT中的金属伪影消除

Clemens Schiffer, Kristian Bredies

AI总结 本文提出了一种基于凸优化问题和总变分正则化的CT金属伪影消除方法,利用Chambolle-Pock算法求解,并通过合成数据验证了该方法的有效性。

Comments Part of the OAGM 2014 proceedings (arXiv:1404.3538)

详情
AI中文摘要

本文提出了一种新的方法,用于减少X射线计算机断层扫描(CT)图像中的金属伪影。该方法基于带有sinogram不等式约束的凸优化问题和重建图像的总变分正则化。Chambolle-Pock算法用于数值求解优化问题的离散版本。作为概念验证,我们展示了并讨论了合成数据的数值结果。

英文摘要

A new method for reducing metal artifacts in X-ray computed tomography (CT) images is presented. It bases on the solution of a convex optimization problem with inequality constraints on the sinogram, and total variation regularization for the reconstructed image. The Chambolle-Pock algorithm is used to numerically solve the discretized version of the optimization problem. As proof of concept we present and discuss numerical results for synthetic data.

1404.6871 2026-06-04 math.NA cs.CV cs.NA 版本更新

Proximal Iteratively Reweighted Algorithm with Multiple Splitting for Nonconvex Sparsity Optimization

近端迭代重加权算法与多重分裂用于非凸稀疏优化

Canyi Lu, Yunchao Wei, Zhouchen Lin, Shuicheng Yan

AI总结 本文提出PIRE算法解决非凸稀疏及结构稀疏问题,相比传统方法更高效,且在每轮迭代计算成本接近凸求解器。进一步提出PIRE-PS和PIRE-AU处理多变量问题,理论证明其收敛性,实验显示性能优异。

详情
Journal ref
Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014
AI中文摘要

本文提出近端迭代重加权(PIRE)算法用于解决一般性问题,涉及大量非凸稀疏及结构稀疏相关问题。与以往非凸稀疏问题迭代求解器相比,PIRE更为通用和高效。PIRE每轮迭代的计算成本通常接近当前最先进的凸求解器。我们进一步提出具有并行分裂的PIRE算法(PIRE-PS)和具有交替更新的PIRE算法(PIRE-AU)以处理多变量问题。在理论上,我们证明所提方法收敛且任何极限解都是 stationary 点。在合成和真实数据集上的广泛实验表明,我们的方法在学习性能上具有竞争力,但效率显著更高,相较于以往的非凸求解器。

英文摘要

This paper proposes the Proximal Iteratively REweighted (PIRE) algorithm for solving a general problem, which involves a large body of nonconvex sparse and structured sparse related problems. Comparing with previous iterative solvers for nonconvex sparse problem, PIRE is much more general and efficient. The computational cost of PIRE in each iteration is usually as low as the state-of-the-art convex solvers. We further propose the PIRE algorithm with Parallel Splitting (PIRE-PS) and PIRE algorithm with Alternative Updating (PIRE-AU) to handle the multi-variable problems. In theory, we prove that our proposed methods converge and any limit solution is a stationary point. Extensive experiments on both synthesis and real data sets demonstrate that our methods achieve comparative learning performance, but are much more efficient, by comparing with previous nonconvex solvers.

1403.7543 2026-06-04 math.OC cs.CV cs.IT cs.NA math.IT math.NA 版本更新

A sparse Kaczmarz solver and a linearized Bregman method for online compressed sensing

一种稀疏Kaczmarz求解器和线性化Bregman方法用于在线压缩感知

Dirk A. Lorenz, Stephan Wenger, Frank Schöpfer, Marcus Magnor

AI总结 本文提出了一种计算稀疏或最小TV解的线性系统算法框架,包含Kaczmarz方法和线性化Bregman方法作为特例,并引入了稀疏Kaczmarz求解器等新方法,适用于测量缓慢且昂贵的在线压缩感知、TV断层成像和射电干涉测量等问题。

详情
AI中文摘要

本文提出了一种计算稀疏或最小TV解的线性系统算法框架。该框架包括Kaczmarz方法和线性化Bregman方法作为特例,以及若干新方法,如稀疏Kaczmarz求解器。该算法框架具有多种应用,尤其适用于线性测量缓慢且昂贵获取的问题。本文展示了在线压缩感知、TV断层成像和射电干涉测量等应用示例。

英文摘要

An algorithmic framework to compute sparse or minimal-TV solutions of linear systems is proposed. The framework includes both the Kaczmarz method and the linearized Bregman method as special cases and also several new methods such as a sparse Kaczmarz solver. The algorithmic framework has a variety of applications and is especially useful for problems in which the linear measurements are slow and expensive to obtain. We present examples for online compressed sensing, TV tomographic reconstruction and radio interferometry.

1403.3022 2026-06-04 cs.CV cs.NA math.NA 版本更新

Efficient Legendre moment computation for grey level images

对灰度图像的Legendre矩高效计算

Guanyu Yang, Huazhong Shu, Christine Toumoulin, Guo-Niu Han, Limin M. Luo

AI总结 本文提出一种高效计算Legendre矩的方法,适用于二值和灰度图像,通过递推公式和无乘法算法降低计算复杂度。

详情
Journal ref
Pattern Recognition 39, 1 (2006) 74-80
AI中文摘要

Legendre正交矩已在图像分析领域广泛应用。由于直接方法计算非常耗时,近期研究致力于降低计算复杂度。然而,现有算法主要针对二值图像。本文提出一种新的快速方法计算Legendre矩,不仅适用于二值图像,也适用于灰度图像。我们首先通过Legendre多项式的递推性质建立一维Legendre矩的递推公式。结果表明,一维Legendre矩Lp可表示为Lp-1(1)和Lp-2(0)的线性组合。基于此关系,通过L1(a)和L0(a)数组(a为小于p的整数)得到Lp(0)。为进一步降低计算复杂度,采用无需乘法的算法计算这些量。该方法随后扩展至二维Legendre矩Lpq的计算。我们证明所提方法比直接方法更高效。

英文摘要

Legendre orthogonal moments have been widely used in the field of image analysis. Because their computation by a direct method is very time expensive, recent efforts have been devoted to the reduction of computational complexity. Nevertheless, the existing algorithms are mainly focused on binary images. We propose here a new fast method for computing the Legendre moments, which is not only suitable for binary images but also for grey levels. We first set up the recurrence formula of one-dimensional (1D) Legendre moments by using the recursive property of Legendre polynomials. As a result, the 1D Legendre moments of order p, Lp = Lp(0), can be expressed as a linear combination of Lp-1(1) and Lp-2(0). Based on this relationship, the 1D Legendre moments Lp(0) is thus obtained from the array of L1(a) and L0(a) where a is an integer number less than p. To further decrease the computation complexity, an algorithm, in which no multiplication is required, is used to compute these quantities. The method is then extended to the calculation of the two-dimensional Legendre moments Lpq. We show that the proposed method is more efficient than the direct method.

1403.3021 2026-06-04 cs.CV cs.NA math.NA 版本更新

Image reconstruction from limited range projections using orthogonal moments

利用正交矩进行有限范围投影的图像重建

Huazhong Shu, Jian Zhou, Guo-Niu Han, Limin M. Luo, Jean-Louis Coatrieux

AI总结 本文提出一组正交多项式用于图像重建,探讨了投影矩与图像矩的关系,并通过仿真验证了方法的有效性。

详情
Journal ref
Pattern Recognition 40, 2 (2007) 670-680
AI中文摘要

本文提出了一组正交多项式用于从投影数据中重建图像。详细讨论了投影矩与图像矩之间的关系,并展示了某些有趣的性质。提供了仿真结果以验证该方法,并将其性能与先前工作进行比较。

英文摘要

A set of orthonormal polynomials is proposed for image reconstruction from projection data. The relationship between the projection moments and image moments is discussed in detail, and some interesting properties are demonstrated. Simulation results are provided to validate the method and to compare its performance with previous works.

1403.0240 2026-06-04 cs.CV cs.CE cs.NA math.NA q-bio.QM 版本更新

Particle methods enable fast and simple approximation of Sobolev gradients in image segmentation

粒子方法使Sobolev梯度在图像分割中的快速和简单近似

Ivo F. Sbalzarini, Sophie Schneider, Janick Cardinale

AI总结 本文提出利用粒子方法高效计算Sobolev梯度,以解决图像分割中的正则化问题,通过局部粒子交互替代全局Poisson方程,提升计算效率。

Comments 21 pages, 10 figures

详情
AI中文摘要

生物图像分析因图像强度分布不均和噪声水平高而具有挑战性。贝叶斯推断通过先验知识正则化问题。测量图像中形状之间“距离”的基本选择至关重要。已知简单的几何L2距离退化并导致病态情况。使用Sobolev梯度可避免此问题,使分割问题更非病态。然而,Sobolev梯度的高计算成本和实现开销阻碍了实际应用。本文展示如何利用粒子方法应用于图像分割,实现Sobolev梯度的简单且计算高效的实现。我们证明Sobolev梯度的评估相当于图像轮廓上的粒子-粒子相互作用。我们扩展了现有的基于粒子的分割算法以使用Sobolev梯度。使用合成和真实图像,我们对2D和3D图像进行基准测试,使用分段光滑和分段常数区域模型。当前的粒子近似Sobolev梯度比先前的参考实现快2.8到10倍,但保留了Sobolev梯度已知的有利性质。此加速通过使用局部粒子-粒子相互作用而不是在每次迭代中求解全局Poisson方程实现。每次迭代的计算时间比Sobolev梯度的L2梯度更高。然而,由于Sobolev梯度预条件了优化问题,因此可能需要更少的总迭代次数以使算法收敛,这在某些情况下可以抵消更高的每次迭代成本。

英文摘要

Bio-image analysis is challenging due to inhomogeneous intensity distributions and high levels of noise in the images. Bayesian inference provides a principled way for regularizing the problem using prior knowledge. A fundamental choice is how one measures "distances" between shapes in an image. It has been shown that the straightforward geometric L2 distance is degenerate and leads to pathological situations. This is avoided when using Sobolev gradients, rendering the segmentation problem less ill-posed. The high computational cost and implementation overhead of Sobolev gradients, however, have hampered practical applications. We show how particle methods as applied to image segmentation allow for a simple and computationally efficient implementation of Sobolev gradients. We show that the evaluation of Sobolev gradients amounts to particle-particle interactions along the contour in an image. We extend an existing particle-based segmentation algorithm to using Sobolev gradients. Using synthetic and real-world images, we benchmark the results for both 2D and 3D images using piecewise smooth and piecewise constant region models. The present particle approximation of Sobolev gradients is 2.8 to 10 times faster than the previous reference implementation, but retains the known favorable properties of Sobolev gradients. This speedup is achieved by using local particle-particle interactions instead of solving a global Poisson equation at each iteration. The computational time per iteration is higher for Sobolev gradients than for L2 gradients. Since Sobolev gradients precondition the optimization problem, however, a smaller number of overall iterations may be necessary for the algorithm to converge, which can in some cases amortize the higher per-iteration cost.

1402.0289 2026-06-04 cs.CV cs.RO cs.SY eess.SY 版本更新

A Robust Framework for Moving-Object Detection and Vehicular Traffic Density Estimation

一种用于移动物体检测和车辆交通密度估计的稳健框架

Pranam Janney, Glenn Geers

AI总结 本文提出了一种基于纹理度量的移动物体检测方法,具有计算成本低、参数调整少和抗噪声能力强的特点,实验表明其性能优于现有方法,并提出车辆交通密度估计的框架及对比分析。

详情
AI中文摘要

智能机器需要从视频中获取基本信息,如移动物体检测,以推断更高层次的语义信息。本文提出了一种利用纹理度量检测视频中移动物体的方法。该方法计算成本低,参数调整少,且对噪声、光照变化、动态背景和低帧率具有鲁棒性。实验结果表明,所提方法的性能优于现有方法。我们还利用前景物体检测技术提出车辆交通密度估计的框架,并比较了基于前景物体检测的框架与基于经典密度状态建模的框架在车辆交通密度估计中的差异。

英文摘要

Intelligent machines require basic information such as moving-object detection from videos in order to deduce higher-level semantic information. In this paper, we propose a methodology that uses a texture measure to detect moving objects in video. The methodology is computationally inexpensive, requires minimal parameter fine-tuning and also is resilient to noise, illumination changes, dynamic background and low frame rate. Experimental results show that performance of the proposed approach is higher than those of state-of-the-art approaches. We also present a framework for vehicular traffic density estimation using the foreground object detection technique and present a comparison between the foreground object detection-based framework and the classical density state modelling-based framework for vehicular traffic density estimation.

1401.1558 2026-06-04 math.DG cs.CV cs.NA math.NA 版本更新

The Continuity of Images by Transmission Imaging Revisited

通过传输成像重新审视图像的连续性

Zhitao Fan, Feng Guan, Chunlin Wu, Ming Yan

AI总结 本文研究传输成像中图像连续性问题,采用更通用的束几何和更弱的假设证明大多数图像为连续函数,并比较了去除泊松噪声的两种图像处理方法。

Comments 23 pages, 8 figures

详情
AI中文摘要

传输成像作为一种重要的成像技术,广泛应用于天文学、医学诊断和生物学科学。[49]中显示传输成像与日常生活中使用的反射成像有显著差异。理解图像结构(先验信息)对设计、测试和选择图像处理方法至关重要,良好的图像处理方法有助于进一步利用图像数据,例如提高传输成像应用中的物体重建精度。在反射成像中,图像通常建模为不连续函数,甚至分段常数函数。在传输成像中,[49]中显示几乎所有图像都是连续函数。然而,[49]的作者只考虑了平行束几何情况,并在证明中使用了一些过于强烈的假设,排除了一些常见情况,如圆柱形物体。在本文中,我们考虑了更通用的束几何,并通过完全不同的技术简化了假设。特别是,我们证明在平行束和发散束几何(两种最典型的束几何)中,传输成像中的几乎所有图像都是连续函数,所用假设比[49]中要弱得多,可以容纳几乎所有实际案例。此外,基于我们的分析,我们比较了两种用于去除泊松噪声(传输成像中最显著的噪声)的图像处理方法。数值实验将提供以证明我们的分析。

英文摘要

Transmission imaging, as an important imaging technique widely used in astronomy, medical diagnosis, and biology science, has been shown in [49] quite different from reflection imaging used in our everyday life. Understanding the structures of images (the prior information) is important for designing, testing, and choosing image processing methods, and good image processing methods are helpful for further uses of the image data, e.g., increasing the accuracy of the object reconstruction methods in transmission imaging applications. In reflection imaging, the images are usually modeled as discontinuous functions and even piecewise constant functions. In transmission imaging, it was shown very recently in [49] that almost all images are continuous functions. However, the author in [49] considered only the case of parallel beam geometry and used some too strong assumptions in the proof, which exclude some common cases such as cylindrical objects. In this paper, we consider more general beam geometries and simplify the assumptions by using totally different techniques. In particular, we will prove that almost all images in transmission imaging with both parallel and divergent beam geometries (two most typical beam geometries) are continuous functions, under much weaker assumptions than those in [49], which admit almost all practical cases. Besides, taking into accounts our analysis, we compare two image processing methods for Poisson noise (which is the most significant noise in transmission imaging) removal. Numerical experiments will be provided to demonstrate our analysis.

1310.3447 2026-06-04 cs.CV cs.NA math.NA 版本更新

Image Restoration using Total Variation with Overlapping Group Sparsity

利用总变分与重叠组稀疏性的图像恢复

Jun Liu, Ting-Zhu Huang, Ivan W. Selesnick, Xiao-Guang Lv, Po-Yu Chen

AI总结 本文提出一种结合总变分与重叠组稀疏性的图像恢复方法,通过避免阶梯效应并保留边缘,提升恢复效果,并提出高效算法进行比较验证。

Comments 11 pages, 37 figures

详情
AI中文摘要

图像恢复是成像科学中最基础的问题之一。总变分(TV)正则化因其能保持边缘而在图像恢复问题中被广泛应用。然而,它也以产生阶梯状伪影而闻名。通常,高阶总变分(HTV)正则化是一个好的选择,但其过度平滑的特性限制了应用。本文研究了一个最小化问题,其中目标函数包括常规的l2数据保真项和一种重叠组稀疏性总变分正则化项,该正则化项能够避免阶梯效应并允许在恢复图像中保留边缘。我们还提出了一种快速算法来求解相应的最小化问题,并将我们的方法与最先进的基于TV和HTV的方法进行了比较。数值实验表明,所提出的方法在PSNR、相对误差和计算时间方面均表现出高效和有效。

英文摘要

Image restoration is one of the most fundamental issues in imaging science. Total variation (TV) regularization is widely used in image restoration problems for its capability to preserve edges. In the literature, however, it is also well known for producing staircase-like artifacts. Usually, the high-order total variation (HTV) regularizer is an good option except its over-smoothing property. In this work, we study a minimization problem where the objective includes an usual $l_2$ data-fidelity term and an overlapping group sparsity total variation regularizer which can avoid staircase effect and allow edges preserving in the restored image. We also proposed a fast algorithm for solving the corresponding minimization problem and compare our method with the state-of-the-art TV based methods and HTV based method. The numerical experiments illustrate the efficiency and effectiveness of the proposed method in terms of PSNR, relative error and computing time.

1310.2842 2026-06-04 math.NA cs.CV cs.NA 版本更新

Wavelet methods for shape perception in electro-sensing

用于电感应形状感知的小波方法

Habib Ammari, Stéphane Mallat, Irène Waldspurger, Han Wang

AI总结 本文提出了一种基于小波的电感应形状识别新方法,通过微电阻测量高效识别目标形状,并通过数值模拟验证了算法的稳定性与分辨率。

详情
AI中文摘要

本文旨在提出一种利用小波方法解决电感应问题的新方法。它提供了一种高效的算法,用于从微电阻测量中识别目标形状。所提出算法的稳定性和分辨率能力通过数值模拟进行了量化评估。

英文摘要

This paper aims at presenting a new approach to the electro-sensing problem using wavelets. It provides an efficient algorithm for recognizing the shape of a target from micro-electrical impedance measurements. Stability and resolution capabilities of the proposed algorithm are quantified in numerical simulations.

1309.5401 2026-06-04 cs.RO cs.CV cs.SY eess.SY 版本更新

Nonmyopic View Planning for Active Object Detection

非我的视图规划用于主动物体检测

Nikolay Atanasov, Bharath Sankaran, Jerome Le Ny, George J. Pappas, Kostas Daniilidis

AI总结 本文提出通过控制移动深度相机视角进行主动物体检测,通过规划视图序列平衡移动能耗与识别正确假设的概率,实验表明优于贪心方法。

Comments 12 pages (two-column); 7 figures; 2 tables; Manuscript submitted to the IEEE Transactions on Robotics (TRO)

详情
AI中文摘要

计算机视觉中的核心问题之一是语义重要物体的检测以及其姿态的估计。大多数物体检测工作基于单张图像处理,其性能受限于遮挡和外观和几何的模糊性。本文提出了一种主动检测方法,通过控制移动深度相机的视角进行物体检测。当初始静态检测阶段识别出感兴趣的物体时,会针对其类别和方向提出多个假设。传感器随后规划一系列视角,平衡移动所消耗的能量与识别正确假设的概率。我们提出了一个包含传感器移动性的主动假设检验问题,并使用基于点的近似POMDP算法进行求解。通过仿真和实际世界实验验证了我们的方法的有效性,结果表明我们的方法优于广泛使用的贪心视角选择方法,并在静态物体检测上提供了显著改进。

英文摘要

One of the central problems in computer vision is the detection of semantically important objects and the estimation of their pose. Most of the work in object detection has been based on single image processing and its performance is limited by occlusions and ambiguity in appearance and geometry. This paper proposes an active approach to object detection by controlling the point of view of a mobile depth camera. When an initial static detection phase identifies an object of interest, several hypotheses are made about its class and orientation. The sensor then plans a sequence of views, which balances the amount of energy used to move with the chance of identifying the correct hypothesis. We formulate an active hypothesis testing problem, which includes sensor mobility, and solve it using a point-based approximate POMDP algorithm. The validity of our approach is verified through simulation and real-world experiments with the PR2 robot. The results suggest that our approach outperforms the widely-used greedy view point selection and provides a significant improvement over static object detection.

1308.2292 2026-06-04 cs.CV cs.NA math.AP math.NA 版本更新

Fast image segmentation and restoration using parametric curve evolution with junctions and topology changes

利用参数曲线演化进行快速图像分割与修复

Heike Benninghoff, Harald Garcke

AI总结 本文提出一种基于区域轮廓模型的曲线演化方案,支持节点和拓扑变化,结合后处理去噪,实现快速高效的图像分割与修复。

Comments 26 pages, 16 figures

详情
AI中文摘要

本文介绍了基于区域轮廓模型的曲线演化方案,允许节点、向量值图像和拓扑变化,并结合后处理去噪,实现快速高效的图像分割与修复。通过利用切向自由度避免网格点分布不均,多个人工测试问题和真实图像的数值模拟展示了该方法的性能。

英文摘要

Curve evolution schemes for image segmentation based on a region based contour model allowing for junctions, vector-valued images and topology changes are introduced. Together with an a posteriori denoising in the segmented homogeneous regions this leads to a fast and efficient method for image segmentation and restoration. An uneven spread of mesh points is avoided by using the tangential degrees of freedom. Several numerical simulations on artificial test problems and on real images illustrate the performance of the method.