arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 8098
2606.03920 2026-06-03 cs.CV

Benchmarking Visual State Tracking in Multimodal Video Understanding

多模态视频理解中的视觉状态追踪基准测试

Sihyun Yu, Nanye Ma, Pinzhi Huang, Hyunseok Lee, Shusheng Yang, June Suk Choi, Ellis Brown, Oscar Michel, Boyang Zheng, Jinwoo Shin, Saining Xie

发表机构 * New York University(纽约大学) KAIST(韩国科学技术院)

AI总结 提出VSTAT基准,通过需要连续感知和整合整个视频流的问题评估多模态大语言模型的视觉状态追踪能力,发现当前模型远低于人类表现,失败主要源于视觉感知而非文本推理。

Comments Website: https://vision-x-nyu.github.io/vstat-site/

详情
AI中文摘要

理解视频需要超越识别孤立时刻,因为人类会持续追踪实体、状态和事件。这种视觉状态追踪能力是视频理解的基础,但在当前多模态大语言模型(MLLMs)的评估中仍未得到充分探索。我们引入了视觉状态追踪基准(VSTAT),这是一个基于视频的基准,旨在诊断MLLMs的视觉状态追踪能力。VSTAT包含来自合成和真实世界视频的834个片段,配以1500个问题,这些问题无法从任何单帧或短片段中回答,需要持续感知和整合整个视频流中的事件。尽管在现有视频基准上表现强劲,我们发现最先进的MLLMs远低于人类水平,仅略高于基于答案先验的基线。为了分析这一差距,我们将MLLMs的思维轨迹与底层视频流进行比较,以理解MLLMs在VSTAT上失败的原因和时机。我们发现MLLMs在文本中正确推理和追踪,但在视觉上感知它们需要追踪的事件时失败。最后,我们的初步评估表明,最近的基于代理的方法,包括基于MLLM的视频代理和编码代理,并不能轻易解决这些失败,在VSTAT上仍然表现不佳。

英文摘要

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.

2606.03918 2026-06-03 cs.AI

Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

Hedge-Bench:在金融推理相关的困难、现实任务上对智能体进行基准测试

Eric Cho, Shawn Huang, Alice Lu, Andy Lyu

发表机构 * Trata Brigham Young University

AI总结 提出Hedge-Bench基准,包含102个基于对冲基金分析师实际工作推理轨迹的任务,用于评估AI智能体在开放金融推理问题上的表现,前沿模型得分低于16%。

Comments Dataset and evaluation harness available at github.com/Trata-Inc/trata-hedge-bench

详情
AI中文摘要

AI智能体越来越多地能够处理金融分析的机械任务:检索文档、计算公式、更新电子表格。更难、更有价值的挑战在于推理那些定义专家分析师工作的开放式问题。现有基准没有捕捉到这类问题,而那些试图评估开放式推理的基准依赖于模型判断的输出,这引入了噪声和循环。我们提出Hedge-Bench 1.0:一个包含102个实际工作任务的基准,这些任务基于专业对冲基金分析师在相关信息来源上工作的明确推理轨迹。这种方法能够针对经过验证的专家步骤进行确定性评分。前沿模型和智能体在基准上的得分低于16%。我们在该网址发布数据集和评估工具。

英文摘要

AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsheets. The harder, more valuable challenge is reasoning through the open-ended questions that define expert Analyst work. Existing benchmarks do not capture this class of problem, and those that attempt to evaluate open-ended reasoning rely on model-judged outputs that introduce noise and circularity. We present Hedge-Bench 1.0: a benchmark of 102 actual, on-the-job tasks grounded in the explicit reasoning traces of professional hedge fund analysts working with relevant information sources. This approach enables deterministic grading against verified expert steps. Frontier models and agents score below 16\% on the benchmark. We publish the dataset and evaluation harness at github.com/Trata-Inc/trata-hedge-bench.

2606.03915 2026-06-03 cs.CV

PatchScene: Patch-based Voxel Diffusion for Large-Scale Scene Completion

PatchScene:基于体素块扩散的大规模场景补全

Qingdong Xu, Jiajun Zhu, Shilin Zhu, Xinjing He, Chao Lu, Huanran Wang, Jiyao Zhang

发表机构 * MEGVII Technology(MEGVII技术有限公司) Qianli Technology(千利技术) Peking University(北京大学) Northeastern University, China(中国东北大学) Northwest Polytechnical University, Xi’an(西北工业大学西安校区)

AI总结 提出PatchScene,一种基于体素块扩散的框架,通过局部3D区域细粒度生成、置信度引导的时空融合和环形流扩散策略,实现大规模LiDAR场景补全,在SemanticKITTI上达到最优性能并展现强泛化能力。

Comments 10 pages, 5 figures, 5 tables

详情
AI中文摘要

我们提出了PatchScene,一种新颖的基于扩散的大规模LiDAR场景补全框架。与依赖全局潜在表示或密集体素网格的现有方法不同,PatchScene采用基于体素块的扩散范式,在局部3D区域内显式生成细粒度几何结构。为了确保在空间和时间尺度上的连贯重建,我们引入了一种置信度引导的时空融合机制,在统一的生成过程中整合重叠块和相邻帧。此外,我们设计了一种环形流扩散策略,利用LiDAR扫描的径向密度模式,将近距离区域的高保真信息逐步传播到远距离区域,从而实现空间无界的场景补全。在SemanticKITTI基准上的大量实验表明,PatchScene在所有标准指标上均达到了最先进的性能,在几何精度和时间一致性上超越了先前的方法。值得注意的是,在20米LiDAR范围上训练的模型无需重新训练即可有效推广到50米场景,突显了其在真实世界自动驾驶应用中的强大可扩展性和泛化能力。

英文摘要

We propose PatchScene, a novel diffusion-based framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-the-art performance across all standard metrics, surpassing previous approaches in both geometric accuracy and temporal consistency. Remarkably, the model trained on 20 m LiDAR ranges generalizes effectively to 50 m scenes without retraining, highlighting its strong scalability and generalization capability for real-world autonomous driving applications.

2606.03911 2026-06-03 cs.CV

Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching

Bootstrap Your Generator: 基于流匹配的无配对视觉编辑

Yoad Tewel, Yuval Atzmon, Gal Chechik, Lior Wolf

发表机构 * Weizmann Institute of Science(魏茨曼科学研究院)

AI总结 提出Bootstrap Your Generator (ByG)框架,利用基础模型知识通过流匹配实现无配对训练的图像视频编辑,无需外部信号,在数据稀缺场景下达到最优性能。

Comments Accepted at ICML 2026. Project page is at https://research.nvidia.com/labs/par/byg/

详情
AI中文摘要

现代生成模型对视觉内容有深刻理解,但训练它们进行图像编辑通常需要大量配对示例数据集。这限制了可扩展性,尤其是对于视频编辑,收集配对数据成本过高。我们提出了Bootstrap Your Generator (ByG),一个用于流匹配编辑模型无配对训练的通用框架。它利用基础模型的知识,无需任何外部信号。我们的方法将从冻结模型中提取的指令遵循线索与用于结构保持的循环一致性相结合。为了使这可行,我们提出将来自干净预测的下游损失的梯度路由到噪声训练状态。我们在具有挑战性的数据稀缺图像和视频编辑场景中展示了最先进的结果。大量评估和用户研究表明,我们的方法有效泛化到未见过的领域,并优于在数百万样本上训练的监督基线。分析表明,我们的梯度路由弥合了训练-推理差距,从基础模型中提取语义线索提供了强大的训练信号,消除了对外部奖励模型的需求。

英文摘要

Modern generative models possess a deep understanding of visual content, yet training them for image editing typically requires massive datasets of paired examples. This limits scalability, especially for video editing where collecting paired data is prohibitively expensive. We propose Bootstrap Your Generator (ByG), a general framework for unpaired training of flow matching editing models. It leverages the base model's knowledge without any external signal. Our approach pairs instruction-following cues extracted from the frozen model with cycle-consistency for structure preservation. To make this tractable, we propose to route gradients from downstream losses over clean predictions to noisy training states. We demonstrate state-of-the-art results on challenging data-scarce image and video editing scenarios. Extensive evaluations and user studies show that our method effectively generalizes to unseen domains and outperforms supervised baselines trained on millions of samples. Analysis reveals that our gradient routing bridges the train-inference gap, and extracting semantic cues from a base model provides a robust training signal that obviates the need for external reward models.

2606.03909 2026-06-03 cs.CV

SparseStreet: Sparse Gaussian Splatting for Real-Time Street Scene Simulation

SparseStreet: 用于实时街景模拟的稀疏高斯泼溅

Qingpo Wuwu, Xiaobao Wei, Peng Chen, Nan Huang, Zhongyu Zhao, Hao Wang, Ming Lu, Ningning Ma, Shanghang Zhang

发表机构 * Peking University(北京大学) Chinese Academy of Sciences(中国科学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Autonomous Driving Development, NIO(蔚来自动驾驶开发)

AI总结 针对街景重建中高斯原语冗余问题,提出节点可学习剪枝与背景压缩框架,实现高达80%压缩比且质量损失极小。

详情
AI中文摘要

尽管3D高斯泼溅在街景重建中显示出有希望的结果,现有方法需要大量高斯原语来捕捉细节,导致存储成本过高和渲染速度缓慢。我们观察到动态对象(如车辆和行人)需要高保真表示以保持时间一致性,而静态背景区域通常包含大量冗余。受此启发,我们提出SparseStreet,一种专为街景设计的通用压缩框架。首先,我们引入基于节点的可学习剪枝策略,系统性地移除低贡献高斯原语,同时保留视觉关键区域。其次,在场景表示稳定后,我们应用背景压缩,进一步减少静态区域中的冗余。我们的方法有效保留了动态对象的几何和外观,同时显著减少了高斯原语的总数。在Waymo和nuScenes上的大量实验表明,SparseStreet实现了高达80%的压缩比,且质量退化极小,实现了资源高效、高保真的动态场景重建。项目网站:此 https URL。

英文摘要

While 3D Gaussian Splatting has shown promising results in street scene reconstruction, existing methods require massive numbers of Gaussian primitives to capture fine details, leading to prohibitive storage costs and slow rendering speeds. We observe that dynamic objects (e.g., vehicles and pedestrians) demand high-fidelity representations to maintain temporal consistency, while static background regions often contain substantial redundancy. Motivated by this, we propose SparseStreet, a general compression framework specifically designed for street scenes. First, we introduce a node-based learnable pruning strategy that systematically removes low-contributing Gaussian primitives while preserving visually critical regions. Second, after the scene representation stabilizes, we apply background compression, further reducing redundancy in static regions. Our method effectively preserves the geometry and appearance of dynamic objects while significantly reducing the total number of Gaussian primitives. Extensive experiments on the Waymo and nuScenes demonstrate that SparseStreet achieves up to 80% compression ratio with minimal quality degradation, enabling resource-efficient, high-fidelity dynamic scene reconstruction. Project website: https://sparsestreet.github.io/.

2606.03906 2026-06-03 cs.AI

scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation

scTranslation:单细胞多组学模态翻译的综合基准

Jiabei Cheng, Jingbo Zhou, Jun Xia, Changkai Li, Zhen Lei, Chang Yu, Stan Z. Li

发表机构 * Westlake University(西湖大学) Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Xidian University(西安电子科技大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 针对单细胞多组学模态翻译任务,提出了包含多样化数据集、先进模型和全面评估指标的综合基准scTranslation,并系统研究了特征选择、特征质量和少样本设置等影响因素。

详情
AI中文摘要

在单细胞中同时测量多种组学模态使研究人员能够更全面地理解细胞状态和调控机制。然而,由于高实验成本、显著噪声和不完全的模态覆盖,近年来出现了多种用于模态翻译的计算方法。尽管翻译模型有所发展,但在数据集、评估指标和影响因素方面仍缺乏系统的基准评估。为此,我们提出了scTranslation,一个用于单细胞多组学模态翻译任务的综合基准。它包括多样化的翻译数据集,整合了最先进的模型,并提供了全面的评估指标。此外,我们评估了模型在不同场景下的性能,如特征选择、特征质量和少样本设置。这些因素显著影响模型性能,但此前很少被系统研究。利用该基准,我们对当前方法进行了大规模研究,报告了许多有洞察力的发现,为未来发展开辟了新的可能性。该基准已开源以促进未来研究。代码匿名发布于该https URL。

英文摘要

Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of datasets, evaluation metrics, and influencing factors. To address this, we present scTranslation, a comprehensive benchmark for single-cell multi-omics modality translation tasks. It includes diverse translation datasets, integrates state-of-the-art models, and provides a comprehensive evaluation metrics. In addition, we assess model performance under different scenarios, such as feature selection, feature quality, and few-shot settings. These factors significantly affect model performance but have rarely been systematically studied before. Leveraging this benchmark, we conduct a large-scale study of current methods, report many insightful findings that open up new possibilities for future development. The benchmark is open-sourced to facilitate future research. The code is anonymously released at https://github.com/Bunnybeibei/scTranslation.

2606.03905 2026-06-03 cs.RO

Semantic-weighted ICP for LiDAR Odometry: Class-Aware Residual Reweighting for Robust Scan Registration

语义加权ICP用于LiDAR里程计:基于类感知残差加权的鲁棒扫描配准

Vasco Carvalho, Tiago Barros, Urbano J. Nunes

发表机构 * Institute of Robotics and Autonomous Systems, University of Lisbon(里斯本大学机器人与自主系统研究所)

AI总结 提出语义加权ICP方法,通过根据语义类别的几何稳定性对残差进行加权,在动态和复杂环境中提升LiDAR里程计的位姿估计鲁棒性。

详情
AI中文摘要

LiDAR里程计是自主机器人系统的基本组成部分,依赖于连续点云之间的几何配准来估计自运动。然而,传统的几何方法在动态或非结构化环境中常常退化,原因是移动物体、稀疏几何特征、植被和语义模糊结构导致不可靠的对应关系。现有工作表明,其中一些限制可以通过在配准过程中引入环境的语义信息来解决。在这项工作中,我们在此基础上进一步表明,并非环境中的所有元素对配准同等重要。因此,我们提出了一种用于LiDAR里程计的语义类加权ICP。所提出的方法不是严格过滤掉属于特定语义类别的点,而是根据其预期的几何稳定性对属于语义类别的点的残差进行加权。这种策略使得信息丰富但可能不稳定的结构能够对配准过程做出贡献,同时减轻动态物体的影响。实验评估在SemanticKITTI和RELLIS-3D数据集上进行,这些数据集包括城市、高速公路、乡村和越野环境。实证结果表明,所提出的语义加权ICP改进了位姿估计,特别是在传统刚性特征稀缺的具有挑战性的越野场景中。此外,分析表明,这种加权策略的有效性高度依赖于环境,受场景的结构和语义组成影响。

英文摘要

LiDAR odometry is a fundamental component of autonomous robotic systems, relying on geometric registration between consecutive point clouds to estimate ego-motion. However, traditional geometric approaches often degrade in dynamic or unstructured environments due to unreliable correspondences caused by moving objects, sparse geometric features, vegetation, and semantically ambiguous structures. Existing works have shown that, some of these limitations can be addressed by introducing semantic information from the environment in the registration process. In this work, we build on this, and show that not all elements in the environment are equally relevant for registration. Hence, we propose a semantic class-weighted ICP for LiDAR odometry. Instead of strictly filtering out points belonging to specific semantic classes, the proposed approach weights the residuals of points belonging to semantic categories based on their expected geometric stability. This strategy enables informative but potentially unstable structures, to contribute to the registration process while mitigating the influence of dynamic objects. The experimental evaluation was conducted on the SemanticKITTI and RELLIS-3D datasets, which include urban, highway, rural, and off-road environments. The empirical results show that the proposed Semantic-weighted ICP improves pose estimation, especially in challenging off-road scenarios where conventional rigid features are scarce. Furthermore, the analysis reveals that the effectiveness of this weighting strategy is highly environment-dependent, influenced by the structural and semantic composition of the scene.

2606.03904 2026-06-03 cs.LG cs.CV

MAdam: Metric-Aware Multi-Objective Adam

MAdam: 度量感知的多目标Adam

Fengbei Liu, Rachit Saluja, Sunwoo Kwak, Ruibo Wang, Ruining Deng, Heejong Kim, Johannes C. Paetzold, Mert R. Sabuncu

发表机构 * Cornell Tech(康奈尔科技) Weill Cornell Medicine(韦尔医学院) Delft University of Technology(代尔夫特理工大学)

AI总结 提出MAdam,通过偏好条件曲率预处理多目标优化中的协调方向,解决Adam与求解器之间的权重失配和几何失配问题,在多任务学习、帕累托前沿恢复等任务中一致提升性能。

详情
AI中文摘要

多目标优化是许多机器学习问题的基础,然而跨损失平衡、梯度平衡和基于帕累托的求解器家族几乎都将它们协调后的方向交给Adam处理。我们表明这种耦合在求解器的意图和优化器的执行之间引入了两个系统性差距。第一个是权重失配:Adam的二阶矩分母将时变偏好向量与梯度统计量纠缠在一起,将偏好边缘化为历史平均值,并将不同的帕累托权衡压缩为近乎均匀的混合。第二个是几何失配:Adam的自适应度量扭曲了多目标优化求解器假设的欧几里得几何,将对齐的目标转化为明显的冲突。为了共同解决这两个问题,我们引入了MAdam(度量感知的多目标Adam),这是一个即插即用的包装器,不改变求解器和优化器。MAdam通过标量化目标的偏好条件曲率对协调方向进行预处理;在此白化输入上,Adam的二阶矩退化为单位矩阵,因此实际更新由偏好条件度量主导。在多任务学习、帕累托前沿恢复、物理信息神经网络和医学成像中,MAdam在每个求解器家族上都一致优于Adam。

英文摘要

Multi-objective optimization (MOO) underlies many machine learning problems, yet MOO solvers across the loss-balancing, gradient-balancing, and Pareto-based families almost universally hand their reconciled directions to Adam~\cite{kingma2015adam}. We show this coupling introduces two systematic gaps between the solver's intent and the optimizer's execution. The first is a \emph{weighting mismatch}: Adam's second-moment denominator entangles the time-varying preference vector with gradient statistics, marginalizing the preference into a history average and collapsing distinct Pareto trade-offs toward a near-uniform mixture. The second is a \emph{geometric mismatch}: Adam's adaptive metric distorts the Euclidean geometry MOO solvers assume, turning aligned objectives into apparent conflicts. To resolve both jointly, we introduce \textbf{MAdam} (Metric-Aware Multi-Objective Adam), a drop-in wrapper that leaves both solver and optimizer unchanged. MAdam preconditions the reconciled direction by the preference-conditioned curvature of the scalarized objective; on this whitened input, Adam's second moment collapses to identity, so the realized update is governed by the preference-conditioned metric. Across multi-task learning, Pareto-front recovery, physics-informed neural networks, and medical imaging, MAdam consistently improves over Adam for every solver family.

2606.03903 2026-06-03 cs.CV

An Attention-Based Denoising Model for Diffusion Weighted Imaging

一种基于注意力的扩散加权成像去噪模型

Prithviraj Verma, Pawan Kumar, Chandan Deshani, Prasun Chandra Tripathi

发表机构 * Institute of Infrastructure Technology Research and Management (IITRAM)(基础设施技术研究与管理研究所) University of Sheffield(谢菲尔德大学)

AI总结 提出一种结合Swin Transformer窗口注意力和多维门控精化的噪声感知注意力驱动去噪框架,用于解决DWI中信号依赖的Rician噪声问题,在1%至15%噪声水平下实现平均PSNR 33.69 dB和SSIM 0.8539。

详情
AI中文摘要

扩散加权成像(DWI)用于全身癌症筛查,但通常需要较长的采集时间。当扫描时间减少时,图像质量往往会下降,导致扫描中的噪声增加。DWI中的幅度重建引入了信号依赖的Rician噪声,这使得传统的基于卷积的方法去噪更具挑战性。为了解决这一限制,我们提出了一种噪声感知的注意力驱动去噪框架,该框架将分层Swin Transformer窗口注意力与基于transformer的多维门控精化相结合,用于DWI恢复。该模型结合了显式噪声水平调节和残差重建,以实现对广泛损坏水平下异方差噪声的自适应抑制。在损坏的DWI扫描上的实验评估显示了强大的恢复性能。我们的模型在1%至15%的噪声水平下实现了平均PSNR 33.69 dB和SSIM 0.8539,同时在严重噪声条件下保持稳定行为。这些结果表明,注意力引导的上下文建模与通道自适应精化相结合,为DWI去噪提供了稳健且可推广的解决方案。

英文摘要

Diffusion-weighted imaging (DWI) is used for whole-body cancer screening, but it typically requires a long acquisition time. When the scan time is reduced, the image quality often suffers, leading to increased noise in the scans. Magnitude reconstruction in DWI introduces signal-dependent Rician noise, which makes denoising more challenging for conventional convolution-based methods. To address this limitation, we propose a noise-aware attention-driven denoising framework that integrates hierarchical Swin Transformer window attention with transformer-based multi-dimensional gated refinement for DWI restoration. The model incorporates explicit noise-level conditioning and residual reconstruction to enable adaptive suppression of heteroscedastic noise across a wide range of corruption levels. Experimental evaluation on corrupted DWI scans demonstrates strong restoration performance. Our model achieves a mean PSNR of 33.69~dB and SSIM of 0.8539 across noise levels from 1\% to 15\%, while maintaining stable behavior under severe noise conditions. These results indicate that attention-guided contextual modeling combined with channel-adaptive refinement provides a robust and generalizable solution for DWI denoising.

2606.03893 2026-06-03 cs.CV

Electromagnetic Navigation for Femoral Osteotomy Using High-Accuracy X-ray-to-CT Registration

基于高精度X光到CT配准的股骨截骨电磁导航

Roman Flepp, Arend Nieuwland, Bastian Sigrist, Philipp Fürnstahl, Lilian Calvet, Thomas Dreher

发表机构 * Department of Pediatric Orthopedics and Traumatology, University Children’s Hospital Zürich(苏黎世大学儿童医院小儿骨科与创伤外科部门) Research in Orthopedic Computer Science, University Hospital Balgrist, University of Zurich(骨科计算机科学研究所,巴尔格里斯大学医院,苏黎世大学) Department of Orthopedic Surgery, University Hospital Balgrist, University of Zurich(骨科外科部门,巴尔格里斯大学医院,苏黎世大学)

AI总结 提出一种基于电磁跟踪的股骨截骨导航系统,通过一次术中C臂标定和两幅X光图像配准实现实时无荧光导航,在合成股骨实验中总角度误差显著优于徒手操作,并与患者特异性器械精度等效。

Comments Will be published in the International Journal of Computer Assisted Radiology and Surgery

详情
AI中文摘要

矫正性股骨截骨术中准确执行术前计划仍具挑战。当前技术受限于精度不一、侵入性和辐射暴露,徒手方法和患者特异性器械(PSI)分别通常需要>30和>6次荧光透视图像。我们提出一种集成的、基于电磁跟踪(EMT)的股骨截骨导航系统,可最小化剥离和术中荧光透视。该系统将基于CT的术前计划与一次性术中C臂标定以及从初始化时获取的两幅X光图像进行精确的X光到CT配准相结合。这使得锯片和骨碎片相对于术前计划的实时、无荧光EMT导航成为可能,并兼容单平面和双平面截骨。在使用18个合成股骨的可行性研究中,EMT引导在总角度误差上显著优于徒手执行($(3.05 \pm 0.75)^\circ$ vs. $(6.32 \pm 2.36)^\circ$,$p=0.031$),假设两者具有相同的最小手术暴露。EMT引导试验均未超过>5°的临床阈值,而徒手6次试验中有4次异常值。该系统在总角度误差($p \le 0.02$)和总平移误差($p=0.048$)上与PSI达到统计等效($\pm 2^\circ$,$\pm 2, ext{mm}$),用户问卷评分无显著差异。通过仅使用两幅X光图像转移术前计划,同时匹配PSI精度且无需额外手术暴露,所提出的系统为后续尸体和临床验证提供了动力。

英文摘要

Accurate execution of preoperative plans in corrective femoral osteotomies remains challenging. Current techniques are limited by variable accuracy, invasiveness, and radiation exposure, with free-hand methods and patient-specific instrumentation (PSI) often requiring >30 and >6 fluoroscopic images, respectively. We present an integrated, electromagnetic tracking (EMT)-based navigation system for femoral osteotomies that minimizes dissection and intraoperative fluoroscopy. The system couples CT-based preoperative planning with one-time intraoperative C-arm calibration and accurate X-ray-to-CT registration from two fluoroscopic images acquired at initialization. This enables real-time, fluoroscopy-free EMT navigation of the saw blade and bone fragments relative to the preoperative plan, and is compatible with uniplanar and biplanar osteotomies. In a feasibility study using 18 synthetic femora, EMT guidance significantly outperformed free-hand execution in total angular error ($(3.05 \pm 0.75)^\circ$ vs.\ $(6.32 \pm 2.36)^\circ$, $p=0.031$), assuming the same minimal surgical exposure for both. No EMT-guided trials exceeded the >5° clinical threshold, whereas free-hand produced 4 outliers of 6 trials. The system achieved statistical equivalence ($\pm 2^\circ$, $\pm 2,\text{mm}$) to PSI for total angular ($p \le 0.02$) and total translational ($p=0.048$) errors, with no significant differences in user questionnaire scores. By transferring preoperative plans using only two fluoroscopic images while matching PSI accuracy without additional surgical exposure, the proposed system motivates subsequent cadaveric and clinical validation.

2606.03890 2026-06-03 cs.CV

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

OVO-S-Bench:多模态大语言模型中流式空间智能的分层基准

Yifei Li, Pengyiang Liu, Yuhang Zang, Zhongyue Shi, Qi Fu, Hongye Hao, Jiwen Lu

发表机构 * Tsinghua University(清华大学) Shanghai AI Laboratory(上海人工智能实验室) Beihang University(北京航空航天大学)

AI总结 提出OVO-S-Bench,一个完全人工标注的流式空间智能基准,包含1680个问题,涵盖四个抽象层次,评估38个MLLM,发现Gemini-3.1-Pro落后人类专家27分,流式空间微调MLLM表现不如其骨干模型。

Comments 48 pages, 12 figures, 15 tables. Project page: https://internlm.github.io/OVO-S-Bench/

详情
AI中文摘要

机器人、增强现实和自动驾驶中的多模态智能体必须从连续的自我中心流中推理地点和布局,通常使用当前视野之外的证据。现有基准要么在完整视频上进行离线评估,要么针对事件而非空间结构。我们引入了OVO-S-Bench,一个完全人工标注的流式空间智能基准,包含来自348个源视频的1680个问题。标注涉及12名训练有素的标注员,每人还担任盲审交叉评审,耗时约804人小时的多轮质量保证。每个问题带有一个查询时间戳和一个证据区间,评估时模型仅看到查询之前的前缀。问题涵盖四个抽象层次:瞬时自我中心感知、时空上下文跟踪、空间模拟与推理、以及异中心映射。在38个专有和开源MLLM中,Gemini-3.1-Pro落后人类专家27分(59.2 vs. 86.6),异中心映射是主要瓶颈。值得注意的是,流式和空间微调的MLLM表现不如其骨干模型。我们进一步发现,当链式思维推理未基于流时,会放大空间错误。通过暴露这些局限性,OVO-S-Bench为下一代流式空间MLLM建立了一个高要求的测试平台。

英文摘要

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators, each also serving as a blind cross-reviewer, across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points, 59.2 vs. 86.6, with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.

2606.03888 2026-06-03 cs.CV cs.LG

CoralBay: A Self-Supervised CT Foundation Model

CoralBay: 一种自监督CT基础模型

Ioannis Gatopoulos, Nicolas Känzig, Sebastian Otálora, Fei Tang

发表机构 * kaiko.ai(Kaiko AI)

AI总结 提出CoralBay框架,通过分层3D Swin骨干网络和自蒸馏学习多尺度特征,实现CT体积数据的自监督预训练,有效提升下游放射学任务性能。

详情
AI中文摘要

自监督学习已在2D自然图像上实现了大规模预训练,产生了跨任务有效迁移的通用视觉表示。然而,许多医学成像模态(如CT扫描)本质上是三维的,在结构和语义上与自然图像根本不同。体积模态捕捉空间连续性、器官解剖和基于强度的组织特性(如亨氏单位),这些无法通过2D预训练充分建模。为弥补这一差距,我们引入了CoralBay,一种自蒸馏框架,通过使用分层3D Swin骨干网络并将自蒸馏应用于拼接的多尺度特征,扩展了DINO,实现了数据高效的自监督学习,编码了全局语义和细粒度局部结构的丰富空间表示。因此,CoralBay有效迁移到广泛的下游放射学任务,在多样化的解剖目标上展现出强大且一致的性能。此外,我们通过引入一个公开、可复现的3D放射学排行榜,为开源\eva框架做出贡献,该排行榜统一了多个数据集,并建立了评估体积表示学习方法的标准化基准。

英文摘要

Self-supervised learning has enabled large-scale pre-training on 2D natural images, producing general-purpose visual representations that transfer effectively across tasks. However, many medical imaging modalities, such as CT scans, are inherently three-dimensional and differ fundamentally from natural images in both structure and semantics. Volumetric modalities capture spatial continuity, organ anatomy, and intensity-based tissue properties (e.g., Hounsfield Units), which are not adequately modeled by 2D pre-training. To bridge this gap, we introduce CoralBay, a self-distillation framework that extends DINO by using a hierarchical 3D Swin backbone and applying self-distillation to concatenated multi-scale features, enabling data-efficient self-supervised learning of rich spatial representations that encode both global semantics and fine-grained local structure. As a result, CoralBay transfers effectively to a wide range of downstream radiological tasks, demonstrating strong and consistent performance across diverse anatomical targets. In addition, we contribute to the open-source \eva framework by introducing a public, reproducible 3D radiology leaderboard that unifies multiple datasets and establishes a standardized benchmark for evaluating volumetric representation learning methods.

2606.03885 2026-06-03 cs.LG

Attribution via Distributional Paths for Information Revelation

通过分布路径进行信息揭示的归因

Kieran A. Murphy, Shameen Shrestha

发表机构 * New Jersey Institute of Technology(新泽西理工学院)

AI总结 提出Reveal-IG方法,将路径归因从输入空间提升到结构化探针分布空间,通过逐步揭示信息并归因模型期望输出的变化,保留完整性并避免输入空间路径伪影。

Comments Code: https://github.com/murphyka/Reveal-IG

详情
AI中文摘要

特征归因方法通过为输入特征分配重要性分数来解释预测。基于路径的方法(如积分梯度)特别有吸引力,因为它们满足 extit{完整性}:归因总和等于模型输出在参考状态和输入之间的变化。然而,大多数路径方法在输入空间中定义轨迹,沿着所选路径通过逐点扰动输入来解释模型。输入空间路径积分模型在每个经过点的原始响应,无法控制特征被查询的分辨率;轨迹中靠近基线的早期部分与输入本身对解释的贡献相同。在这里,我们将路径归因从输入空间提升到围绕感兴趣示例的结构化探针分布空间,并将我们的方法称为Reveal-IG。Reveal-IG不是遍历原始输入值,而是逐步揭示关于输入的信息,并归因模型期望输出沿此分布路径的变化。结果是一个路径归因框架,它保留了对期望模型响应的完整性,并自然地适应多尺度图像探针和表格数据中的特征级不确定性。综合诊断表明,Reveal-IG避免了影响输入空间方法的路径伪影,并且在ImageNet分类和表格回归中,它产生稳定的、有符号的归因——在使用归因符号的指标上领先,同时在其余指标上保持竞争力。

英文摘要

Feature attribution methods explain predictions by assigning importance scores to input features. Path-based methods such as Integrated Gradients are especially appealing because they satisfy \textit{completeness}: attributions sum to the change in model output between a reference state and the input. Yet most path methods define this trajectory in input space, explaining a model through pointwise perturbed inputs along a chosen path. An input-space path integrates the model's raw response at each point it passes through, with no control over the resolution at which a feature is queried; the early, baseline-adjacent part of the trajectory contributes to the explanation on equal footing with the input itself. Here, we lift path attribution from input space to a space of structured probe distributions around the example of interest, and call our method Reveal-IG. Rather than traversing raw input values, Reveal-IG progressively reveals information about the input and attributes changes in the model's expected output along this distributional path. The result is a path-attribution framework that retains completeness with respect to the expected model response, and naturally accommodates multiscale image probes and feature-wise uncertainty in tabular data. Synthetic diagnostics show that Reveal-IG avoids path artifacts that affect input-space methods, and across ImageNet classification and tabular regression it produces stable, signed attributions -- leading on metrics that use attribution sign while remaining competitive on the rest.

2606.03879 2026-06-03 cs.CV cs.AI

Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs

超越编码器累加:衡量多编码器视觉语言模型中编码器的作用

Wei Ding, Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Yu Wang

发表机构 * Tsinghua University(清华大学) Tencent(腾讯) University of Macau(澳门大学) University of Science and Technology Beijing(北京科技大学)

AI总结 通过重新训练所有31个非空子集,提出容量-必要性分解和预投影器秩分析,揭示多编码器视觉语言模型中编码器角色并非简单累加,并给出最优配对原则。

详情
AI中文摘要

随着基础模型向融合更多异构视觉流扩展,理解不同编码器在联合训练下的交互成为原则性设计的前提。然而,大型视觉语言模型目前缺乏相应的工具,且参数高效的编码器配置在训练前难以识别。为了重新审视联合训练下的编码器角色,我们在16基准的Cambrian-1套件上,在统一流程下重新训练并评估了五个常见视觉编码器的所有31个非空子集(总计约2万GPU小时),并报告了三个发现。首先,从头重新训练每个子集揭示了与在固定检查点上掩码编码器所得不同的编码器排名,包括哪个编码器整体排名第一。其次,我们将每个编码器的贡献分解为两个维度:容量(编码器自身达到的分数)和必要性(从完整池中移除时的下降)。这两个维度不可互换。配对两个最高容量的编码器是次优的,而将一个高容量锚点与一个自适应补充配对则匹配完整的五编码器模型。在此配对之外添加更多编码器仅带来边际收益。第三,在固定参数数量下,每个编码器的预投影器有效秩解释了残差分数变化。最强的配对结合了一个秩在联合训练中存活的锚点和一个秩在联合训练下扩展的补充,这表明更高秩、更少坍缩的投影器输入对应着编码器-投影器接口处更有利的优化机制。总之,容量-必要性分解和预投影器秩分析,连同通过重新训练进行的全面评估,揭示了多编码器视觉语言模型设计中的方法论差距,并提供了弥补这一差距的具体原语。

英文摘要

As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five common vision encoders under a unified pipeline (~20k GPU-hours total), and report three findings. First, retraining each subset from scratch reveals encoder rankings that differ from those obtained by masking encoders on a fixed checkpoint, including which encoder ranks first overall. Second, we decompose each encoder's contribution into two axes, Capacity, the score an encoder reaches on its own, and Necessity, the drop when it is removed from the full pool. The two axes are not interchangeable. Pairing the two highest-Capacity encoders is suboptimal, while pairing a high-Capacity anchor with an adaptive complement matches the full five-encoder model. Adding further encoders beyond this pair yields only marginal gains. Third, at fixed parameter count, per-encoder pre-projector effective rank explains the residual score variation. The strongest pairs combine an anchor whose rank survives joint training with a complement whose rank expands under it, suggesting that higher-rank, less-collapsed projector inputs correspond to a more favorable optimization regime at the encoder-projector interface. Together, the Capacity-Necessity decomposition and the pre-projector rank analysis, along with comprehensive evaluation through retraining, expose a methodological gap in multi-encoder LVLM design, and offer concrete primitives for closing it.

2606.03877 2026-06-03 cs.CV

MLP Splatting: Object-Centric Neural Fields

MLP Splatting: 以对象为中心的神经场

Shinjeong Kim, Yuzhou Cheng, Xin Kong, Paul H. J. Kelly, Andrew J. Davison

发表机构 * Department of Computing, Imperial College London(帝国理工学院伦敦分校计算机系)

AI总结 提出MLP-Splatting方法,通过少量紧凑MLP原语实现场景分解和新视角合成,支持对象级编辑且内存和渲染效率优于现有方法。

详情
AI中文摘要

3D表示对于场景渲染、理解和交互至关重要。最近的方法,如3D高斯泼溅和神经辐射场,实现了令人印象深刻的光照真实感新视角合成,但缺乏将场景元素轻松分解为少数原语的能力,需要额外的分割或分组才能进行对象级操作。我们提出了MLP-Splatting,一种通过少量富有表现力的光场原语实现场景分解,同时提供光照真实感新视角合成的方法。MLP-Splatting将每个原语建模为一个独立的紧凑MLP,具有局部空间支持,预测辐射度和不透明度。与低级高斯原语或单个全局辐射场相比,我们的神经原语提供了更大的表达能力,同时保持空间局部性。通过高效的光线-原语交互稀疏体积合成进行渲染。我们的原语仅使用RGB监督进行训练,这产生了代表局部场景区域(通常对应于对象或对象部分)的原语,通过选择少量原语即可实现无需分割掩码的交互式对象级编辑。我们的方法辅以可选的语义特征蒸馏,支持开放词汇场景交互和开放集实例分割。与最先进的方法相比,我们在实验中表明,与语义3DGS方法相比,我们实现了显著更低的内存使用(1/15倍)和更快的渲染(3倍)。项目页面:此https URL

英文摘要

3D representations are fundamental to scene rendering, understanding, and interaction. Recent approaches, such as 3D Gaussian Splatting and Neural Radiance Fields, achieve impressive photorealistic novel-view synthesis, but lack the ability to easily decompose scene elements into a few primitives, requiring additional segmentation or grouping for object-level manipulation. We present MLP-Splatting, a method that enables scene decomposition via a few expressive light-field primitives while providing photorealistic novel-view synthesis. MLP-Splatting models each primitive as an independent compact MLP with localized spatial support that predicts radiance and opacity. In contrast to low-level Gaussian primitives or a single global radiance field, our neural primitives provide greater expressive capacity while remaining spatially localized. Rendering is performed through efficient sparse volumetric compositing over ray-primitive interactions. Our primitives are supervised using RGB supervision alone, which yields primitives that represent local scene regions often corresponding to objects or object parts, enabling interactive object-level editing without segmentation masks by selecting a handful of primitives. Our method, augmented with optional semantic feature distillation, enables open-vocabulary scene interaction and open-set instant segmentation. Compared to state-of-the-art methods, we achieve substantially lower memory usage (1/15$\times$) and faster rendering (3$\times$), as we show in our experiments compared to semantic 3DGS methods. Project Page: https://shinjeongkim.com/mlp-splatting

2606.03875 2026-06-03 cs.CV

Seg2Track++: Probabilistic Track Validation and Data Association for Multi-Object Tracking and Segmentation

Seg2Track++: 用于多目标跟踪与分割的概率轨迹验证与数据关联

Diogo Mendonça, Tiago Barros, Cristiano Premebida, Urbano J. Nunes

发表机构 * University of Coimbra, Institute of Systems and Robotics, Department of Electrical and Computer Engineering(科英布拉大学,系统与机器人研究所,电气与计算机工程系)

AI总结 提出Seg2Track++框架,结合SAM2实例分割与概率轨迹验证,实现零样本多目标跟踪与分割,提升身份保持并抑制假阳性传播。

详情
AI中文摘要

自主系统需要鲁棒的多目标跟踪与分割(MOTS)以在动态环境中可靠运行,确保一致的目标身份和精确的掩码级描绘。SAM2等基础模型在分割方面表现出强大的零样本泛化能力,但其直接应用于MOTS受到不可靠的轨迹关联和假阳性传播的限制。本文介绍Seg2Track++,一个将实例分割与SAM2及新颖的轨迹管理模块相结合的框架,以执行具有增强时间一致性的零样本MOTS。轨迹通过掩码质心距离(MCD)和置信度感知成本调制(CCM)进行关联,而概率轨迹验证(PTV)采用伯努利滤波器验证轨迹存在并抑制鬼影轨迹。在KITTI MOTS上的实验结果表明,无需微调即可改善身份保持、减少假阳性传播并实现鲁棒的轨迹管理。

英文摘要

Autonomous systems require robust Multi-Object Tracking and Segmentation (MOTS) to operate reliably in dynamic environments, ensuring consistent object identities and precise mask-level delineation. Foundation models such as SAM2 have shown strong zero-shot generalization for segmentation, but their direct application to MOTS is limited by unreliable track association and false-positive propagation. This work introduces Seg2Track++, a framework that integrates instance segmentation with SAM2 and a novel track management module to perform zero-shot MOTS with enhanced temporal consistency. Tracks are associated using Mask Centroid Distance (MCD) and Confidence-Aware Cost Modulation (CCM), while Probabilistic Track Validation (PTV) employs a Bernoulli filter to validate track existence and suppress ghost tracks. Experimental results on KITTI MOTS demonstrate improved identity preservation, reduced false-positive propagation, and robust track management without fine-tuning.

2606.03874 2026-06-03 cs.CV cs.RO

DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction

DyaPlex: 用于二元交互的全双工语音-运动模型

Koki Nagano, Hongyu Liu, Seonwook Park, Tianye Li, Amrita Mazumdar, Christian Jacobsen, Shengze Wang, Michael Stengel, Rajarshi Roy, Ka Chun Cheung, Simon See, Shalini De Mello

发表机构 * NVIDIA HKUST(香港科技大学)

AI总结 提出DyaPlex,一种流式全双工语音-运动模型,通过双塔Transformer架构和统一二元令牌交织机制,实现同步多模态交互,在单体和二元交互基准上达到最优性能。

Comments Project page: https://research.nvidia.com/labs/amri/projects/DyaPlex

详情
AI中文摘要

我们提出了DyaPlex,一种用于二元交互的流式全双工语音-运动模型。为了捕捉人类交流的连续性和互惠性,这种全双工能力使智能体能够以流式方式同时感知和生成语音及物理运动。其核心在于,我们的方法利用了基础全双工语音模型的强先验,并集成了新颖的运动通路,从而实现完全同步的多模态交互。具体来说,我们设计了一种双塔Transformer架构,在保持冻结基础语音模型的零样本对话推理能力的同时,构建了深度耦合的流式运动通路。通过引入统一的二元令牌交织机制,并借助时间对齐的语音-运动RoPE引导交叉注意力,我们的模型有效地将自回归运动与丰富的潜在语音特征对齐。在4000小时的Seamless Interaction数据集上训练,我们的模型有效捕捉了跨说话者依赖关系,并在单体和二元人类交互基准上建立了新的最优性能。

英文摘要

We present DyaPlex, a streaming, full-duplex speech-and-motion model designed for dyadic interaction. To capture the continuous and reciprocal nature of human communication, this full-duplex capability empowers the agent to simultaneously perceive and generate both speech and physical motion in a streaming fashion. At its core, our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction. Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features. Trained on the 4,000-hour Seamless Interaction dataset, our model effectively captures cross-speaker dependencies and establishes new state-of-the-art performance across both monadic and dyadic human interaction benchmarks.

2606.03871 2026-06-03 cs.CV cs.CL cs.LG

Visual Instruction Tuning Aligns Modalities through Abstraction

视觉指令调优通过抽象对齐模态

Luis Palacios, Lorenzo Basile, Diego Doimo, Alberto Cazzaniga

发表机构 * Area Science Park, Trieste, Italy(特里埃斯特Area Science Park)

AI总结 通过探针分析和因果干预,发现视觉指令调优将视觉特征直接嵌入LLM的中间语义层,绕过早期单模态处理层,并通过扩展和强化现有抽象阶段对齐视觉与文本表示。

详情
AI中文摘要

视觉指令调优有效地使预训练的大语言模型(LLM)能够同时处理图像信息和文本。然而,视觉特征如何嵌入到LLM骨干网络的逐层抽象层次中仍不清楚。通过一系列不同的视觉-语言架构,我们表明指令调优主要充当桥梁,将视觉特征直接嵌入到LLM的中间语义层,绕过了用于单模态处理的早期层。通过探针分析和因果干预,我们表明这些中间层是视觉-语言处理的语义核心,并在广泛的 multimodal 基准测试中发挥关键作用。此外,通过比较语义等价的视觉和文本表示的几何结构,我们发现微调扩展并强化了现有的抽象阶段,使视觉特征与已有的文本特征对齐。最后,我们通过将微调限制在中间层来确认这种局部对齐的功能作用:该策略在视觉中心基准测试中保持了全微调的性能,同时减少了训练时间。我们的结果表明,多模态集成是一种局部现象,由LLM内部抽象引擎的重新利用驱动。

英文摘要

Visual instruction tuning effectively adapts a pre-trained Large Language Model (LLM) to process image information alongside text. Yet, it remains unclear how visual features are embedded into the layer-wise hierarchy of abstractions of the LLM backbone. Across a diverse set of vision-language architectures, we show that instruction tuning primarily serves as a bridge, embedding visual features directly into the intermediate semantic layers of the LLM, bypassing the early layers devoted to unimodal processing. With probing analyses and causal interventions, we show that these intermediate layers are the semantic core of vision-language processing and play a critical role in the performance on a broad set of multimodal benchmarks. In addition, by comparing the geometry of semantically equivalent visual and textual representations, we find that fine-tuning extends and strengthens the existing abstraction phase, aligning visual features with pre-existing textual ones. Finally, we confirm the functional role of this localized alignment by restricting fine-tuning to intermediate layers alone: this strategy preserves the performance of full fine-tuning on vision-centric benchmarks while reducing training time. Our results suggest that multimodal integration is a localized phenomenon driven by the repurposing of the internal abstraction engine of the LLM.

2606.03868 2026-06-03 cs.CV

Unified Video-Action Joint Denoising for Dexterous Action and Data Generation

统一视频-动作联合去噪用于灵巧动作与数据生成

Dingrui Wang, YuAn Wang, Jinkun Liu, Yue Zhang, Mattia Piccinini, Yu Sun, Johannes Betz

发表机构 * Technical University of Munich(慕尼黑技术大学) ByteDance(字节跳动) Tsinghua University(清华大学)

AI总结 提出Donk模型,通过联合建模交互视频与手部轨迹的分布,实现灵巧手的动作生成与数据增强。

Comments 9 pages, 5 figures

详情
AI中文摘要

最近的世界动作模型通过将广泛的视觉动态先验与可执行的机器人动作对齐来利用视频基础模型。我们从分布的角度重新审视这种对齐。现有的公式通常将对齐的先验缩小为基于观测的未来动作策略分布。相比之下,我们通过在多条件机制下对交互视频和可执行手部轨迹的联合空间进行建模,保持更广泛的分布。我们提出了Donk,一个用于灵巧手的统一视频-动作去噪模型。通过语言、初始图像和初始手部状态,Donk采样未来视频和双手MANO轨迹作为动作策略。在没有图像条件的情况下,相同的去噪架构从文本条件分布中采样配对的视频-动作展开,将对齐的视频先验转化为数据引擎。在动作、视频和仅文本生成评估中,Donk在相同的统一训练方案下提高了灵巧轨迹的准确性,保持了强大的视频保真度,并产生了平滑的文本条件动作展开。

英文摘要

Recent world action models leverage video foundation models by aligning broad visual-dynamics priors with executable robot actions. We revisit this alignment from a distributional perspective. Existing formulations typically narrow the aligned prior into an observation-conditioned policy distribution over future actions. In contrast, we keep the distribution broader by modeling the joint space of interaction videos and executable hand trajectories under multiple conditioning regimes. We propose Donk, a unified video-action denoising model for dexterous hands. With language, an initial image, and the initial hand state, Donk samples future videos and bimanual MANO trajectories as an action policy. Without the image condition, the same denoising architecture samples paired video-action rollouts from a text-conditioned distribution, turning the aligned video prior into a data engine. Across action, video, and text-only generation evaluations, Donk improves dexterous trajectory accuracy, preserves strong video fidelity, and produces smooth text-conditioned action rollouts under the same unified training recipe.

2606.03867 2026-06-03 cs.CL cs.AI

A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs

一种基于LLM和知识图谱的无训练混合智能体框架用于多文档摘要

Cuong Vuong Tuan, Trang Mai Xuan, Tien-Cuong Nguyen, Vu-Duc Ngo, Thien Van Luong

发表机构 * Faculty of Artificial Intelligence and Data Science, Phenikaa University(人工智能与数据科学学院,泛尼克大学) VNPT AI, VNPT Group(VNPT AI,VNPT集团) MobiFone Research and Development Center, MobiFone Corporation(MobiFone研发与开发中心,MobiFone公司) Business AI Lab, Faculty of Data Science and Artificial Intelligence, National Economics University, College of Technology(商业人工智能实验室,数据科学与人工智能学院,国家经济大学,技术学院)

AI总结 提出一种无需训练、结合大语言模型和知识图谱的混合智能体框架,通过分解摘要任务为专用智能体(抽取、知识感知抽象、迭代精炼)并利用多视角一致性机制,在英文和越南语数据集上取得领先性能。

Comments Accepted by Neural Computing and Applications

详情
AI中文摘要

多文档摘要(MDS)在从文本数据集合中提取关键信息方面发挥着关键作用。现有方法通常难以捕捉复杂的文档间关系,严重依赖大量标注数据进行监督训练,或在跨领域和跨语言时泛化能力有限。为解决这些限制,我们提出一种无训练的混合智能体框架用于MDS,该框架利用大语言模型(LLM)和知识图谱的互补优势。我们的方法将摘要分解为专门的智能体任务:抽取式选择、知识感知抽象和迭代精炼,每个任务无需特定微调。我们通过由LLM引导的多视角一致性机制统一其输出。在四个英文和越南语数据集上的实验表明,该方法达到了最先进或具有竞争力的性能,验证了我们模块化设计的有效性和适应性。

英文摘要

Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing approaches often struggle to capture complex inter-document relationships, rely heavily on large amounts of labeled data for supervised training, or exhibit limited generalization across domains and languages. To address these limitations, we present a training-free mixture-of-agents framework for MDS that leverages the complementary strengths of large language models (LLMs) and knowledge graphs. Our approach decomposes summarization into specialized agent tasks: extractive selection, knowledge-aware abstraction, and iterative refinement, each operating without task-specific fine-tuning. We unify their outputs using a multi-perspective consistency mechanism guided by LLMs. Experiments across four datasets in English and Vietnamese demonstrate state-of-the-art or competitive performance, validating the effectiveness and adaptability of our modular design.

2606.03858 2026-06-03 cs.AI

PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

PyraMathBench: 评估与提升大型语言模型的数学能力

Zetian Ouyang, Linlin Wang, Gerard de Melo, Liang He

发表机构 * East China Normal University(东华师范大学) Hasso Plattner Institute, University of Potsdam(波茨坦大学哈索普兰特纳研究所)

AI总结 提出PyraMathBench分层基准测试,通过整合数值处理与数学推理评估LLM,并引入SOLVE模块和IRPO优化方法提升数值-数学协同能力。

详情
AI中文摘要

尽管数值推理作为大型语言模型(LLM)在各类应用中数学能力的基石具有关键作用,但很少有基准测试通过整合数值处理与数学推理来评估LLM,这阻碍了数学任务中失败的可解释性。我们引入了PyraMathBench,一个全面的分层基准测试,包含来自7,404道数学文字题的32,505个问题,涵盖4个关键认知方面、14个子类别和2种模态。实验表明,LLM的性能因数值计算不足和对抽象数值问题的处理薄弱而严重受损。为解决这一问题,我们提出了智能优化与学习型多功能模块(SOLVE)和交互式相对策略优化(IRPO),通过高效的工具调用(模糊匹配和低质量调用拒绝)增强LLM的数值-数学协同能力。对比实验显示,Qwen-2.5在SOLVE和IRPO训练下获得了5.0分的提升。

英文摘要

Despite the pivotal role of numerical reasoning as the cornerstone of mathematical capabilities in large language models (LLMs) across applications, few benchmarks evaluate LLMs by integrating numerical processing and mathematical reasoning, hindering the interpretability of failures in math tasks. We introduce PyraMathBench, a comprehensive hierarchical benchmark with 32,505 questions derived from 7,404 math word problems, spanning 4 key cognitive aspects, 14 subcategories, and 2 modalities. Experiments reveal that LLMs' performance is severely compromised by inadequate numerical computation and weak handling of abstract numerical questions. To address this, we propose the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO), which enhance LLMs' numerical-mathematical synergy via efficient tool calls (fuzzy matching and low-quality call rejection). Comparative experiments show Qwen-2.5 achieves a 5.0 score improvement with SOLVE and IRPO training.

2606.03851 2026-06-03 cs.LG

Two-Action Apple Tasting with Switching Costs

带有切换成本的双动作苹果品尝问题

Tommaso Cesari, Roberto Colomboni

发表机构 * School of Electrical Engineering and Computer Science University of Ottawa(电气工程与计算机科学学院 马来西亚渥太华大学) School of Mathematics University of Bristol(数学学院 布里斯托尔大学)

AI总结 研究对抗性对手下带有切换成本的双动作苹果品尝问题,通过揭示动作和盲动作的权衡,证明了最优遗憾为Θ(√T)。

详情
AI中文摘要

我们研究带有切换成本的双动作苹果品尝问题,对手是 oblivious 的。在等价的归一化形式中,每一轮学习者在揭示动作和盲动作之间选择:揭示动作获得奖励 $0$ 并揭示盲动作的隐藏值 $x_t\in[-1,1]$;盲动作获得奖励 $x_t$ 但不揭示任何信息。每当学习者切换动作时支付一个单位,遗憾相对于事后最佳固定动作来衡量。带有切换成本的通用反馈图算法对该问题给出 $\widetilde O(T^{2/3})$ 的遗憾保证。双动作苹果品尝图是切换成本分类中缺失的 $\Omega(T^{2/3})$ 障碍的自然候选:这样的下界将传递到一大类仍未分类的反馈图。我们证明这个障碍不存在:该问题的 oblivious 极小极大期望遗憾满足 \[ \frac{1}{2\sqrt3}\cdot\sqrt T \le R_T^\star \le 2\sqrt{3}\cdot \sqrt{T}. \]

英文摘要

We study the two-action apple-tasting problem with switching costs against an oblivious adversary. In an equivalent normalized formulation, at each round the learner chooses between a revealing action and a blind action: the revealing action gives reward $0$ and reveals the hidden value $x_t\in[-1,1]$ of the blind action; the blind action gives reward $x_t$ but reveals nothing. The learner pays one unit whenever they switches actions, and regret is measured against the best fixed action in hindsight. General feedback-graph algorithms with switching costs give $\widetilde O(T^{2/3})$ regret guarantees for this problem. The two-action apple-tasting graph was the natural candidate for the missing $Ω(T^{2/3})$ obstruction in the switching-cost classification: such a lower bound would have transferred to a large family of still-unclassified feedback graphs. We prove that this obstruction is not there: the oblivious minimax expected regret for this problem satisfies \[ \frac{1}{2\sqrt3}\cdot\sqrt T \le R_T^\star \le 2\sqrt{3}\cdot \sqrt{T}. \]

2606.03847 2026-06-03 cs.RO

Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies

去噪提示何时重新规划:基于流的机器人策略的去噪方差自适应分块

Xiangdong Feng, Yuxuan Cheng, Chen Shi, Boyao Han, Yuxuan Yan, Yitong Hong, Zhuotao Tian, Li Jiang

发表机构 * Beijing Institute of Technology(北京理工大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shenzhen Loop Area Institute(深圳Loop区研究院) Hunan University(湖南大学) Xi’an Jiaotong University(西安交通大学) Renmin University of China(中国人民大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳))

AI总结 针对基于流的机器人策略中固定执行步长的问题,提出DVAC方法,利用去噪过程中干净动作估计的方差自适应决定执行步长,在保持或提升任务成功率的同时降低重新规划频率。

详情
AI中文摘要

动作分块已成为基于流的机器人策略的常见推理策略,通过建模演示中的多步时间依赖关系来改善动作连贯性。然而,执行步长通常仍被设为经验固定值,忽略了可预测的自由空间运动和精度关键交互阶段往往需要不同的重新规划频率。在本文中,我们首先证明基于流的策略的去噪过程包含任务阶段的内在信号:干净动作估计在可预测运动阶段保持稳定,但在接触密集或精度敏感操作附近波动更大。受此观察启发,我们提出DVAC(去噪方差自适应分块),一种测试时方法,自适应地决定从每个预测分块中执行多少动作。DVAC测量最终去噪步骤中干净动作估计的方差,执行稳定的低方差前缀,并在提交高方差未来动作之前重新规划。为了跨任务和 rollout 迁移,DVAC进一步使用局部方差尺度的滚动估计来校准阈值。在LIBERO、RoboTwin、CALVIN和真实世界操作上的实验表明,DVAC在提高任务成功率的同时降低了重新规划频率。使用基于$\pi_{0.5}$的策略,DVAC将LIBERO成功率从94.75%提高到98.00%,重新规划减少43.0%,同时在RoboTwin和CALVIN上也取得了总体收益,并提高了真实世界执行效率。

英文摘要

Action chunking has become a common inference strategy for flow-based robot policies, improving action coherence by modeling multi-step temporal dependencies in demonstrations. However, the execution horizon is still typically set as an empirical fixed value, overlooking that predictable free-space motions and precision-critical interaction phases often require different replanning frequencies. In this work, we first show that the denoising process of flow-based policies contains an intrinsic signal of task phases: clean-action estimates remain stable during predictable motion phases, but fluctuate more strongly around contact-rich or precision-sensitive operations. Motivated by this observation, we propose DVAC (Denoising-Variance Adaptive Chunking), a test-time method that adaptively determines how many actions to execute from each predicted chunk. DVAC measures the variance of clean-action estimates over the final denoising steps, executes the stable low-variance prefix, and replans before high-variance future actions are committed. To transfer across tasks and rollouts, DVAC further calibrates the threshold with a rolling estimate of the local variance scale. Experiments on LIBERO, RoboTwin, CALVIN, and real-world manipulation show that DVAC improves task success while reducing replanning frequency. With a $π_{0.5}$-based policy, DVAC improves LIBERO success from 94.75% to 98.00% and reduces replanning by 43.0%, while also yielding aggregate gains on RoboTwin and CALVIN and improving real-world execution efficiency.

2606.03846 2026-06-03 cs.CL cs.AI cs.LG

Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models

聚类自评估:一种简单而有效的大型语言模型不确定性量化方法

Qi Cao, Takeshi Kojima, Andrew Gambardella, Helinyi Peng, Yutaka Matsuo, Yusuke Iwasawa

发表机构 * The University of Tokyo(东京大学)

AI总结 提出一种基于语义聚类和多项选择概率的简单自评估方法,用于大型语言模型的不确定性量化,在多个模型和数据集上优于基线方法。

Comments Findings of ACL 2026

详情
AI中文摘要

大型语言模型(LLM)在各种任务中表现出色,但常常生成看似合理实则事实错误的回答。这一问题因缺乏明确的不确定性估计而加剧,使用户难以判断模型输出的可靠性。现有的不确定性量化方法通常依赖间接信号,如生成样本的熵。这些信号难以解释,且未充分利用模型评估自身不确定性的能力。我们提出一种简单而有效的自评估方法用于LLM的不确定性量化。我们的方法将生成样本分组为语义不同的聚类,将其转化为结构化多项选择题的答案选项,并使用LLM分配给每个选项的概率作为置信度估计。在多个模型和数据集上的实验表明,我们的方法始终优于基线方法。值得注意的是,仅需两个额外样本即可达到竞争性能,证明了其有效性和效率。

英文摘要

Large language models (LLMs) demonstrate remarkable performance across diverse tasks, but they often generate responses that appear plausible while being factually incorrect. This problem is compounded by the lack of explicit uncertainty estimates, which makes it difficult for users to judge the reliability of model outputs. Existing uncertainty quantification methods typically rely on indirect signals, such as entropy across sampled generations. These signals can be difficult to interpret and do not fully leverage the model's ability to assess its own uncertainty. We propose a simple yet effective self-assessment method for uncertainty quantification in LLMs. Our approach groups sampled generations into semantically distinct clusters, converts them into answer options in a structured multiple-choice question, and uses the probability assigned by the LLM to each option as a confidence estimate. Experiments across multiple models and datasets show that our method consistently outperforms baseline approaches. Notably, it achieves competitive performance with as few as two additional samples, demonstrating both its effectiveness and efficiency.

2606.03843 2026-06-03 cs.LG cs.AI

Re-Evaluating Continual Learning with Few-Shot Adaptation

重新评估带少样本适应的持续学习

Amogh Inamdar, Matthew So, Vici Milenia, Richard Zemel

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出用少样本评估替代零样本评估来更全面衡量持续学习系统的稳定性和可塑性,并通过新指标“每样本可塑性”发现元学习未来任务序列能诱导学习到学习行为。

Comments 21 pages, 16 figures

详情
AI中文摘要

持续学习方法旨在最大化在任务序列上训练的机器学习模型的稳定性和可塑性。稳定性的标准度量(即遗忘)是模型在先前学习任务上的零样本性能,而可塑性则是在最近学习任务上的性能。然而,零样本评估并未完全衡量模型或方法保留已学信息或快速适应新信息的能力,因为它需要在多个任务上完美回忆。在本文中,我们提出少样本评估作为对持续学习系统稳定性和可塑性的更全面评估。我们对持续图像分类的任务序列进行了细粒度评估,发现这一范式为流行持续学习策略的性能提供了新颖的见解。通过使用新指标——每样本可塑性——进行少样本评估,我们展示了通过元学习未来任务的短序列向持续学习方法添加“前瞻性”会在任务序列上诱导学习到学习的行为。

英文摘要

Continual learning methods aim to maximize the stability and plasticity of machine learning models that are trained on a sequence of tasks. The standard measure of stability (i.e., forgetting) is the 0-shot performance of a model on previously learned tasks, and plasticity, the performance on the most recently learned task. However, 0-shot evaluation does not fully measure a model or method's ability to retain learned information or adapt quickly to new information, as it requires perfect recall across multiple tasks. In this paper, we propose few-shot evaluation as a more comprehensive assessment of the stability and plasticity of a continual learning system. We conduct a fine-grained assessment on task sequences for continual image classification and find that this paradigm produces novel insights into the performance of popular continual learning strategies. Through few-shot evaluation with a novel metric -- per-shot plasticity -- we show that adding `foresight' to continual learning methods via the meta-learning of a short sequence of future tasks induces learning-to-learn behavior over the task sequence.

2606.03841 2026-06-03 cs.AI

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

EvoDS: 具有技能学习和上下文管理的自进化自主数据科学智能体

Zherui Yang, Fan Liu, Yansong Ning, Hao Liu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出EvoDS,通过自主技能获取和自适应上下文压缩策略,结合强化学习训练,使数据科学智能体能够自进化并显著提升多阶段迭代任务的性能。

Comments Accepted by KDD2026

详情
AI中文摘要

近年来,大语言模型(LLM)智能体的进展为自动化数据科学带来了有希望的突破。然而,现有方法仍然受到静态动作集和缺乏原则性长程上下文管理的根本限制,阻碍了它们在多阶段、迭代数据科学流程中积累跨任务可重用经验并可靠运行的能力。为了解决这些挑战,我们引入了EvoDS,一个自进化的自主数据科学智能体,通过智能体强化学习学会扩展其技能并自适应地管理长期上下文。具体来说,EvoDS引入了两个关键策略:(1)自主技能获取(ASA)机制,使智能体能够合成、验证和重用可执行技能;(2)自适应上下文压缩(ACC)策略,将上下文管理视为一个学习控制问题而非被动截断。这些策略在一个两阶段多智能体训练方案中协调,使EvoDS能够随时间自主改进。理论上,我们证明了EvoDS的分层设计减少了工具选择错误,其优化目标与信息瓶颈原理一致,确保了高效的上下文使用。实验上,EvoDS在四个不同基准测试中平均优于最先进的开源数据科学智能体28.9%,同时消除了超出令牌限制的失败。我们的代码和数据可在该网址获取。

英文摘要

Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at https://github.com/usail-hkust/EvoDS.

2606.03839 2026-06-03 cs.LG

Text-attributed Graph Condensation via Text Selection and Attribute Matching

通过文本选择和属性匹配的文本属性图压缩

Haowei Han, Yuxiang Wang, Guojia Wan, Hao Wang, Shanshan Feng, Hao Huang, Jiawei Jiang, Xiao Yan

发表机构 * School of Computer Science Wuhan University(武汉大学计算机学院) Institute for Math & AI Wuhan University(武汉大学数学与人工智能研究院)

AI总结 提出TAGSAM方法,通过子图文本选择和属性相似性匹配压缩文本属性图,在保持训练精度的同时显著降低空间和时间消耗。

详情
AI中文摘要

文本属性图(TAG)是一种重要的图结构数据,其中每个节点都有文本描述。TAG模型通常联合训练图神经网络(GNN)和语言模型,导致高空间和时间消耗,尤其是在大型数据集上。为了缓解这一问题,我们提出了TAGSAM,一种在保持训练精度的同时压缩TAG的压缩方法。TAGSAM有两个关键设计,即子图文本选择和属性相似性匹配,分别压缩TAG的文本描述和图拓扑。对于文本,子图文本选择通过最大化互信息从多个相关文本描述中选择并合并代表性文本块。对于图拓扑,基于匹配训练轨迹(MTT)的流行压缩方法存在高方差,阻碍了精度。我们的属性相似性匹配通过对齐稳定的相似性矩阵来缓解这一问题。我们评估了TAGSAM与六个最先进的基线方法,结果显示其优越性能。在相同压缩大小下,TAGSAM在精度上平均比最佳基线提高4.9%。此外,即使将TAG压缩到仅1%的大小,它仍能保持有竞争力的训练精度。我们的代码可在以下网址获取:this https URL

英文摘要

Text-Attributed Graph (TAG) is an important type of graph structured data, where each node has a text description. TAG models usually train a Graph Neural Network (GNN) and language model jointly, which leads to high space and time consumption, especially on large datasets. To mitigate this, we propose TAGSAM, a condensation method that compresses TAGs while preserving training accuracy. TAGSAM comes with two key designs, i.e., subgraph text Selection and Attribute similarity Matching, which compress the text description and graph topology of TAG, respectively. For the texts, subgraph text selection selects and merges representative text chunks from multiple related text descriptions by maximizing mutual information. For the graph topology, popular condensation methods based on Matching Training Trajectories (MTT) suffer from high variance, which hinders accuracy. Our attribute similarity matching mitigates this issue by aligning stable similarity matrices. We evaluate TAGSAM against six state-of-the-art baselines, where it showcases superior performance. For the same compressed size, TAGSAM improves upon the best-performing baseline by an average of 4.9% in accuracy. Furthermore, it maintains competitive training accuracy even when the TAG is condensed to just 1% size. Our code is available at https://github.com/SundayVHan/TAGSAM

2606.03837 2026-06-03 cs.CV

Where Do We (Not) Need Temporal Context in Low-Resource Video Task Adaptation?

在低资源视频任务适应中,我们(不)需要时间上下文的哪些部分?

Luc P. J. Sträter, Hazel Doughty

发表机构 * Leiden University(莱顿大学)

AI总结 本文系统研究了视频理解中模型适应策略的时间上下文分配问题,通过评估不同设置下的参数高效微调和探测方法,揭示了时间上下文在骨干网络、PEFT和探测之间的最优分布。

详情
AI中文摘要

参数高效微调(PEFT)和探测使得仅使用少量可训练参数就能适应基础模型,这对于标注和计算成本高昂的视频理解具有吸引力。然而,视频PEFT主要集中于适应图像预训练模型,而标准PEFT方法也可应用于视频表示。这些设置很少被比较,并且都将时间推理限制在模型的单个组件中,从而留下了时间上下文应如何在骨干网络、PEFT和探测之间分布的问题。在这项工作中,我们提供了视频理解中模型适应策略的系统研究。我们在外观聚焦、运动聚焦和空间密集设置中评估了方法,特别关注数据有限且参数效率最有利的场景。我们的结果为跨设置的PEFT和探测提供了新的见解,并证明了时间上下文分配对于有效视频适应的重要性。

英文摘要

Parameter-efficient fine-tuning (PEFT) and probing enable adaptation of foundation models using only a small number of trainable parameters, making it attractive for video understanding where annotation and computation are expensive. However, video PEFT has focused on adapting image-pretrained models, while standard PEFT methods can also be applied to video representations. These settings are rarely compared and both confine temporal reasoning to a single component of the model, leaving open how temporal context should be distributed across backbone, PEFT and probe. In this work we provide a systematic study of model adaptation strategies for video understanding. We evaluate methods across appearance-focused, motion-focused and spatially dense settings, with a particular focus on scenarios with limited data where parameter-efficiency is most beneficial. Our results provide new insights into PEFT and probing across settings and demonstrate the importance of temporal context allocation for effective video adaptation

2606.03834 2026-06-03 cs.RO

Let the Dynamics Flow: Stable Flow Matching Dynamical Systems

让动力学流动:稳定的流匹配动力系统

Rodrigo Pérez-Dattari, Francisco Leiva, Andrea Testa, Leonel Rozo, Javier Ruiz del Solar, Noémie Jaquier

发表机构 * Department of Robotics, Perception, and Learning, KTH Royal Institute of Technology(机器人、感知与学习系,皇家理工学院) Advanced Mining Technology Center (AMTC) and Department of Electrical Engineering, Universidad de Chile(先进采矿技术中心(AMTC)和电气工程系,智利大学) Bosch Center for Artificial Intelligence, Renningen, Germany(博世人工智能中心,德国Renningen) Italian Institute of Artificial Intelligence (AI4I), Turin, Italy(意大利人工智能研究所(AI4I),意大利都灵)

AI总结 提出稳定流匹配动力系统(SFMDS)框架,通过流匹配参数化动力系统并施加李雅普诺夫稳定性约束,实现稳定、可扩展、多模态的机器人运动生成。

详情
AI中文摘要

流匹配最近已成为模仿学习的一种强大方法,能够实现可扩展、表达力强且多模态的运动策略。然而,将这些生成模型纳入形式化的稳定性保证(确保机器人行为安全和可泛化的前提)仍然是一个重大挑战。虽然将机器人运动建模为动力系统允许这种基于稳定性的归纳偏置,但现有框架难以捕捉复杂机器人任务中固有的丰富动作分布。本文介绍了稳定流匹配动力系统(SFMDS),这是一个弥合高容量生成模型与形式化李雅普诺夫稳定性保证之间差距的新框架。SFMDS通过流匹配参数化动力系统,同时将模型约束到稳定解族。我们提出了两种变体:基于惩罚项的软约束,以及直接嵌入模型架构的硬结构约束。我们还将两种公式扩展到李群。在基准数据集、仿真和类人机器人上的实验表明,SFMDS在低维和高维状态空间中学习稳定、可扩展和多模态的动力系统,从而实现安全且富有表现力的机器人运动生成。

英文摘要

Flow matching has recently emerged as a powerful approach for imitation learning, enabling scalable, expressive, and multimodal motion policies. However, incorporating formal stability guarantees into these generative models, a prerequisite to ensure safe and generalizable robot behaviors, remains a significant challenge. While modeling robot motions as dynamical systems allows for such stability-based inductive biases, existing frameworks struggle to capture the rich action distributions inherent in complex robotic tasks. This paper introduces Stable Flow Matching Dynamical Systems (SFMDS), a novel framework that bridges the gap between high-capacity generative modeling and formal Lyapunov stability guarantees. SFMDS parametrizes dynamical systems via flow matching while simultaneously constraining the model to a family of stable solutions. We propose two variants: a soft constraint based on a penalty term, and a hard structural constraint embedded directly in the model architecture. We further extend both formulations to Lie groups. Experiments on benchmark datasets, in simulation, and on a humanoid robot show that SFMDS learns stable, scalable, and multimodal dynamical systems in low- and high-dimensional state spaces, enabling safe and expressive robot motion generation.

2606.03831 2026-06-03 cs.LG stat.ML

Online Learning with Gradient-Variation Interval Regret

基于梯度变化的区间遗憾在线学习

Yan-Feng Xie, Shuche Wang, Peng Zhao, Zhi-Hua Zhou

发表机构 * State Key Laboratory for Novel Software Technology and the School of Artificial Intelligence, Nanjing University(新型软件技术国家重点实验室和人工智能学院,南京大学) Institute of Operations Research and Analytics, National University of Singapore(运筹与分析研究所,新加坡国立大学)

AI总结 本文提出首个基于梯度变化量实现区间遗憾界的在线学习算法,采用两层在线集成结构,自适应多种问题相关量并达到极小化最优率,同时引入Lipschitz和平滑性无关的变体。

详情
AI中文摘要

本文研究使用区间遗憾度量的非平稳在线学习,该度量要求在线算法在每个时间区间内表现良好。我们提出了第一个在线学习算法,其区间遗憾界随梯度变化缩放,梯度变化是衡量在线函数梯度累积变化的基本度量,与多种问题相关量有关,并与随机优化等问题紧密相连。我们的方法采用简单高效的两层在线集成结构,实现了强大的理论保证。具体来说,它享有同时自适应多种问题相关量的遗憾界,同时在最坏情况下保持极小化最优率。此外,认识到超参数调优的挑战,我们引入了一种Lipschitz和平滑性无关的变体,自动适应这些可能未知的常数。这主要得益于一种新颖的Lipschitz自适应元算法,该算法可能具有独立的意义。除了区间遗憾,我们的方法还产生了更广泛的影响:它为区间动态遗憾(一种更强的度量,与任何区间上的变化比较器竞争)提供了通用的界,并首次为随机扩展对抗优化提供了分段刻画。理论发现通过实验得到验证。

英文摘要

This paper investigates non-stationary online learning using the metric of interval regret, which requires an online algorithm to perform well over every time interval. We propose the first online learning algorithm that achieves an interval regret bound scaling with gradient variation, a fundamental measure of the cumulative change in online function gradients, which relates to various problem-dependent quantities and is closely connected to stochastic optimization and other problems. Our method employs a simple and efficient two-layer online ensemble structure that achieves strong theoretical guarantees. Specifically, it enjoys a regret bound that simultaneously adapts to various problem-dependent quantities while also preserving the minimax-optimal rate in the worst case. Moreover, recognizing the challenge of hyperparameter tuning, we introduce a Lipschitz- and smoothness-agnostic variant that automatically adapts to these potentially unknown constants. This is primarily enabled by a novel Lipschitz-adaptive meta algorithm, which may be of independent interest. Beyond interval regret, our method also yields broader implications: it provides versatile bounds for interval dynamic regret, a stronger measure that competes with changing comparators over any interval, and yields the first piecewise characterization for stochastic extended adversarial optimization. Theoretical findings are validated by experiments.