arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2408.06747 2026-05-11 cs.CV

ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

ReCLIP++: 学习校正CLIP的偏差以实现无监督语义分割

Jingyun Wang, Guoliang Kang

发表机构 * Beihang University（北航大学）

AI总结本文提出ReCLIP++，通过显式建模和校正CLIP中的偏差以提升无监督语义分割性能，设计了参考提示和位置嵌入投影来分别编码类别偏好和空间偏好偏差，并通过矩阵乘法生成偏差logit图，再通过元素级减法校正logits，最后利用Gumbel-Softmax操作生成分割掩码。

Comments Extended version of our CVPR 24 paper, accepted by IJCV 2025

详情

DOI: 10.1007/s11263-025-02566-5

AI中文摘要

近期工作利用CLIP执行具有挑战性的无监督语义分割任务，其中只有未标注的图像可用。然而，我们发现当将CLIP应用于像素级理解任务时，会出现意外的偏差（包括类别偏好偏差和空间偏好偏差）。先前工作未显式建模该偏差，这在很大程度上限制了分割性能。本文提出显式建模和校正CLIP中存在偏差以促进无监督语义分割任务。具体来说，我们设计了一个可学习的

英文摘要

Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias (including class-preference bias and space-preference bias) occurs. Previous works don't explicitly model the bias, which largely constrains the segmentation performance. In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. Specifically, we design a learnable "Reference" prompt to encode class-preference bias and a projection of the positional embedding in the vision transformer to encode space-preference bias respectively. To avoid interference, two kinds of biases are firstly independently encoded into different features, i.e., the Reference feature and the positional feature. Via a matrix multiplication between the Reference feature and the positional feature, a bias logit map is generated to explicitly represent two kinds of biases. Then we rectify the logits of CLIP via a simple element-wise subtraction. To make the rectified results smoother and more contextual, we design a mask decoder which takes the feature of CLIP and the rectified logits as input and outputs a rectified segmentation mask with the help of Gumbel-Softmax operation. A contrastive loss based on the masked visual features and the text features of different classes is imposed, which makes the bias modeling and rectification process meaningful and effective. Extensive experiments on various benchmarks including PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff demonstrate that our method performs favorably against previous state-of-the-arts. The implementation is available at: https://github.com/dogehhh/ReCLIP.

URL PDF HTML ☆

赞 0 踩 0

2407.15134 2026-05-11 cs.LG cs.AI

Proximal Policy Distillation

近端策略蒸馏

Giacomo Spigler

发表机构 * AI for Robotics Lab (AIR-Lab)（人工智能机器人实验室）； Department of Cognitive Science and Artificial Intelligence（认知科学与人工智能系）； Tilburg University（蒂尔堡大学）

AI总结本文提出PPD方法，结合学生驱动蒸馏和近端策略优化，提升样本效率并利用学生策略在蒸馏过程中收集的额外奖励。在多种强化学习环境中验证了其有效性。

Journal ref Transactions on Machine Learning Research, ISSN 2835-8856 (2025)

详情

AI中文摘要

本文提出PPD方法，一种结合学生驱动蒸馏和近端策略优化的新型策略蒸馏方法，旨在提高样本效率并利用学生策略在蒸馏过程中收集的额外奖励。为评估方法的有效性，我们在多种强化学习环境中比较了PPD与两种常见替代方法——学生蒸馏和教师蒸馏。这些环境包括离散动作和连续控制（ATARI、Mujoco和Procgen）。对于每个环境和方法，我们对一组目标学生神经网络进行蒸馏，这些网络大小较小、相同（自我蒸馏）或大于教师网络。我们的发现表明，PPD在样本效率和生成的学生策略质量上优于传统策略蒸馏方法。此外，PPD在从不完美演示中蒸馏策略时表现出更强的鲁棒性。本文的代码作为基于stable-baselines3的新Python库的一部分发布，以促进策略蒸馏：`sb3-distill'。

英文摘要

We introduce Proximal Policy Distillation (PPD), a novel policy distillation method that integrates student-driven distillation and Proximal Policy Optimization (PPO) to increase sample efficiency and to leverage the additional rewards that the student policy collects during distillation. To assess the efficacy of our method, we compare PPD with two common alternatives, student-distill and teacher-distill, over a wide range of reinforcement learning environments that include discrete actions and continuous control (ATARI, Mujoco, and Procgen). For each environment and method, we perform distillation to a set of target student neural networks that are smaller, identical (self-distillation), or larger than the teacher network. Our findings indicate that PPD improves sample efficiency and produces better student policies compared to typical policy distillation approaches. Moreover, PPD demonstrates greater robustness than alternative methods when distilling policies from imperfect demonstrations. The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: `sb3-distill'.

URL PDF HTML ☆

赞 0 踩 0

2406.13724 2026-05-11 cs.AI

Heterogeneous Graph Neural Networks with Post-hoc Explanations for Multi-modal and Explainable Land Use Inference

异构图神经网络与事后解释方法用于多模态和可解释的土地利用推断

Xuehao Zhai, Junqi Jiang, Adam Dejl, Antonio Rago, Fangce Guo, Francesca Toni, Aruna Sivakumar

发表机构 * Imperial College London, Department of Civil and Environmental Engineering（帝国理工学院伦敦校区土木与环境工程系）； Imperial College London, Department of Computing（帝国理工学院伦敦校区计算机系）

AI总结本文提出一种结合异构图神经网络与可解释AI的方法，用于多模态土地利用推断，提升了准确性和可解释性，尤其在办公和生活用地方面表现突出。

Journal ref Information Fusion, Volume 120, 103057. 2025

详情

DOI: 10.1016/j.inffus.2025.103057

AI中文摘要

城市土地利用推断是一项关键任务，有助于城市规划和政策制定。最近，传感器和定位技术的广泛应用促进了多模态移动数据的收集，为理解日常活动模式提供了有价值的信息。许多研究采用先进的数据驱动技术探索这些多模态移动数据在土地利用推断中的潜力。然而，现有研究往往独立处理样本，忽略了相邻对象之间的空间相关性和不同服务之间的异质性。此外，复杂深度学习方法的固有低可解释性在城市规划中构成重大障碍，因为透明性和可推广性对长期政策决策至关重要。为克服这些挑战，我们引入了一种可解释框架，用于推断土地利用，该框架结合了异构图神经网络（HGNs）与可解释AI技术，提高了准确性和可解释性。实证实验表明，所提出的HGNs在六个土地使用指标上均显著优于基线图神经网络，特别是在'办公'和'生活'方面。作为解释，我们考虑了特征归因和反事实解释。特征归因分析显示，框架预测的'residence'和'work'类别的对称性与伦敦居民的'work'和'recreation'活动相符。反事实解释分析揭示了节点特征和类型的变化主要负责预测土地利用分布与理想混合状态之间的差异。这些分析表明，所提出的HGNs能够适当地支持城市利益相关者进行城市规划和政策制定。

英文摘要

Urban land use inference is a critically important task that aids in city planning and policy-making. Recently, the increased use of sensor and location technologies has facilitated the collection of multi-modal mobility data, offering valuable insights into daily activity patterns. Many studies have adopted advanced data-driven techniques to explore the potential of these multi-modal mobility data in land use inference. However, existing studies often process samples independently, ignoring the spatial correlations among neighbouring objects and heterogeneity among different services. Furthermore, the inherently low interpretability of complex deep learning methods poses a significant barrier in urban planning, where transparency and extrapolability are crucial for making long-term policy decisions. To overcome these challenges, we introduce an explainable framework for inferring land use that synergises heterogeneous graph neural networks (HGNs) with Explainable AI techniques, enhancing both accuracy and explainability. The empirical experiments demonstrate that the proposed HGNs significantly outperform baseline graph neural networks for all six land-use indicators, especially in terms of 'office' and 'sustenance'. As explanations, we consider feature attribution and counterfactual explanations. The analysis of feature attribution explanations shows that the symmetrical nature of the `residence' and 'work' categories predicted by the framework aligns well with the commuter's 'work' and 'recreation' activities in London. The analysis of the counterfactual explanations reveals that variations in node features and types are primarily responsible for the differences observed between the predicted land use distribution and the ideal mixed state. These analyses demonstrate that the proposed HGNs can suitably support urban stakeholders in their urban planning and policy-making.

URL PDF HTML ☆

赞 0 踩 0

2403.18149 2026-05-11 cs.RO cs.SY eess.SY math.OC

Code Generation and Conic Constraints for Model-Predictive Control on Microcontrollers with Conic-TinyMPC

基于锥约束的模型预测控制与代码生成：适用于微控制器的Conic-TinyMPC

Ishaan Mahajan, Khai Nguyen, Sam Schoedel, Elakhya Nedumaran, Moises Mata, Brian Plancher, Zachary Manchester

发表机构 * School of Engineering and Applied Science, Columbia University（哥伦比亚大学工程与应用科学学院）； Carnegie Mellon University（卡内基梅隆大学）； Massachusetts Institute of Technology（麻省理工学院）； Barnard College, Columbia University and Dartmouth College（哥伦比亚大学巴纳德学院和达特茅斯学院）

AI总结本文提出一种适用于微控制器的模型预测控制方法，通过支持二次锥约束和自动生成C++代码，显著提升了嵌入式系统的求解速度和问题规模。

Comments Accepted to ICRA 2026. 4 Figures. 2 Tables. First three authors contributed equally

详情

AI中文摘要

模型预测控制（MPC）是受约束机器人系统中的一种前沿控制方法，但在资源受限硬件上的部署仍然具有挑战性。这一挑战进一步加剧了表达性锥约束的使用，因为它们提供了更强的建模能力，但需要比线性替代方案显著更多的计算资源。为了解决这一挑战，我们扩展了最近开发的快速、结构利用、缓存求解器的工作，基于交替方向乘子法（ADMM）以支持二次锥，以及从Python、MATLAB和Julia生成C++代码。微控制器基准测试显示，我们的求解器在QP和SOCP问题上比最先进的嵌入式求解器快了十倍到一百四十二点七倍，使我们能够以量级更大的问题在内存中运行。我们通过模拟和硬件实验验证了求解器的部署性能，包括在27克Crazyflie四旋翼上的轨迹跟踪与锥约束。我们的开源代码可在https://tinympc.org上获得。

英文摘要

Model-predictive control (MPC) is a state-of-the-art control method for constrained robotic systems, yet deployment on resource-limited hardware remains difficult. This challenge is magnified by expressive conic constraints, which offer greater modeling power but require significantly more computation than linear alternatives. To address this challenge, we extend recent work developing fast, structure-exploiting, cached solvers for embedded applications based on the Alternating Direction Method of Multipliers (ADMM) to provide support for second-order cones, as well as C++ code generation from Python, MATLAB, and Julia. Microcontroller benchmarks show that our solver provides up to a two-order-of-magnitude speedup, ranging from 10.6x to 142.7x, over state-of-the-art embedded solvers on QP and SOCP problems, and enables us to fit order-of-magnitude larger problems in memory. We validate our solver's deployed performance through simulation and hardware experiments, including trajectory tracking with conic constraints on a 27g Crazyflie quadrotor. Our open-source code is available at https://tinympc.org.

URL PDF HTML ☆

赞 0 踩 0

2310.07379 2026-05-11 cs.CV cs.AI cs.LG

Causal Unsupervised Semantic Segmentation

因果无监督语义分割

Junho Kim, Byung-Kwan Lee, Yong Man Ro

发表机构 * School of Electrical Engineering（电气工程学院）； Korea Advanced Institute of Science and Technology（韩国科学技术院）

AI总结本文提出CAUSE框架，通过因果推理方法解决无监督语义分割中概念聚类层次确定问题，实现像素级分组，并在多个数据集上取得最佳性能。

Comments code available: https://github.com/ByungKwanLee/Causal-Unsupervised-Segmentation

Journal ref Pattern Recognition, Volume 171, Part B, 112173 (2026)

详情

DOI: 10.1016/j.patcog.2025.112173

AI中文摘要

无监督语义分割旨在不使用人工标注实现高质量语义分组。随着自监督预训练的出现，各种框架利用预训练特征训练预测头进行无监督密集预测。然而，这种无监督设置中确定分割概念所需的合适聚类级别是一个重大挑战。为此，我们提出一种新的框架，即因果无监督语义分割（CAUSE），该框架利用因果推理的见解。具体而言，我们将干预导向方法（即前端调整）用于定义适合的两步任务进行无监督预测。第一步涉及构建概念聚类书作为中介，以离散形式表示不同粒度层次的概念原型。然后，中介建立与后续概念层面自监督学习的显式链接，用于像素级分组。通过在各种数据集上的广泛实验和分析，我们证实了CAUSE的有效性，并在无监督语义分割中实现了最先进的性能。

英文摘要

Unsupervised semantic segmentation aims to achieve high-quality semantic grouping without human-labeled annotations. With the advent of self-supervised pre-training, various frameworks utilize the pre-trained features to train prediction heads for unsupervised dense prediction. However, a significant challenge in this unsupervised setup is determining the appropriate level of clustering required for segmenting concepts. To address it, we propose a novel framework, CAusal Unsupervised Semantic sEgmentation (CAUSE), which leverages insights from causal inference. Specifically, we bridge intervention-oriented approach (i.e., frontdoor adjustment) to define suitable two-step tasks for unsupervised prediction. The first step involves constructing a concept clusterbook as a mediator, which represents possible concept prototypes at different levels of granularity in a discretized form. Then, the mediator establishes an explicit link to the subsequent concept-wise self-supervised learning for pixel-level grouping. Through extensive experiments and analyses on various datasets, we corroborate the effectiveness of CAUSE and achieve state-of-the-art performance in unsupervised semantic segmentation.

URL PDF HTML ☆

赞 0 踩 0

2305.01429 2026-05-11 cs.LG stat.ML

Unsupervised Feature Based Algorithms for Time Series Extrinsic Regression

无监督特征基于的时序外回归算法

David Guijo-Rubio, Matthew Middlehurst, Guilherme Arcencio, Diego Furtado Silva, Anthony Bagnall

发表机构 * Department of Computer Science and Numerical Analysis, University of Cordoba（科塔达大学计算机科学与数值分析系）； School of Computing Sciences, University of East Anglia（东安格利亚大学计算科学学院）

AI总结本文扩展了时序外回归算法的数据集，引入了FreshPRINCE和DrCIF两种新算法，展示了它们在性能上显著优于其他回归器，包括标准旋转森林回归器。

Comments 19 pages, 21 figures, 6 tables. Appendix included

Journal ref Data Mining and Knowledge Discovery, Volume 38, pages 2141-2185, (2024)

详情

DOI: 10.1007/s10618-024-01027-w

AI中文摘要

时序外回归（TSER）涉及使用一组训练时序数据形成一个连续响应变量的预测模型，该响应变量不直接与回归器序列相关。2022年发布的TSER存档包含19个问题，本文将其扩展至63个问题，并重现了先前的基线算法比较。随后扩展比较范围，包括更广泛的标准回归器和先前研究中使用的最新TSER模型。结果显示，之前评估的回归器均无法超越标准分类器的回归适应版本，即旋转森林。本文引入了两种新的TSER算法，FreshPRINCE是一种通过转换为广泛总结特征后接旋转森林回归器的管道估计器，DrCIF是一种通过随机区间上的总结统计创建特征的树集成。研究显示，FreshPRINCE、DrCIF和InceptionTime在性能上显著优于其他18个回归器。更重要的是，这两种新方法是唯一能显著超越标准旋转森林回归器的模型。

英文摘要

Time Series Extrinsic Regression (TSER) involves using a set of training time series to form a predictive model of a continuous response variable that is not directly related to the regressor series. The TSER archive for comparing algorithms was released in 2022 with 19 problems. We increase the size of this archive to 63 problems and reproduce the previous comparison of baseline algorithms. We then extend the comparison to include a wider range of standard regressors and the latest versions of TSER models used in the previous study. We show that none of the previously evaluated regressors can outperform a regression adaptation of a standard classifier, rotation forest. We introduce two new TSER algorithms developed from related work in time series classification. FreshPRINCE is a pipeline estimator consisting of a transform into a wide range of summary features followed by a rotation forest regressor. DrCIF is a tree ensemble that creates features from summary statistics over random intervals. Our study demonstrates that both algorithms, along with InceptionTime, exhibit significantly better performance compared to the other 18 regressors tested. More importantly, these two proposals (DrCIF and FreshPRINCE) models are the only ones that significantly outperform the standard rotation forest regressor.

URL PDF HTML ☆

赞 0 踩 0

2304.13029 2026-05-11 cs.LG

Bake off redux: a review and experimental evaluation of recent time series classification algorithms

Bake off redux：对近期时间序列分类算法的回顾与实验评估

Matthew Middlehurst, Patrick Schäfer, Anthony Bagnall

发表机构 * Humboldt-Universität zu Berlin（柏林洪堡大学）

AI总结本文回顾了2017年的时间序列分类算法 bake off，评估了最新算法在扩展的UCR数据集上的性能，发现Hydra+MultiROCKET和HIVE-COTEv2表现最佳。

详情

DOI: 10.1007/s10618-024-01022-1

AI中文摘要

2017年，一篇研究论文在加州大学河滨分校（UCR）档案中的85个数据集上比较了18种时间序列分类（TSC）算法。这项研究，通常称为'bake off'，发现只有9种算法显著优于动态时间规整（DTW）和旋转森林基准。该研究根据算法从时间序列数据中提取的特征类型对算法进行分类，形成了五种主要算法类型的分类学。这种算法分类以及提供代码和可重复结果的提供，促进了时间序列分类（TSC）领域 popularity 的增加。六年来，UCR档案已扩展到112个数据集，提出了大量新算法。我们重新审视了bake off，查看每个提出的类别自原始发表以来的进步，并使用扩展的UCR档案评估新算法的性能。我们扩展了分类学，以包含三个新类别，以反映最近的发展。除了最初提出的距离、区间、形状let、字典和混合基于的算法，我们还比较了新的卷积和特征基于的算法以及深度学习方法。我们引入了30个分类数据集，这些数据集最近被捐赠给档案或重新格式化为TSC格式，并使用这些数据集进一步评估每个类别中表现最好的算法。总体而言，我们发现两种最近提出的算法，Hydra+MultiROCKET和HIVE-COTEv2，在当前和新的TSC问题上显著优于其他方法。

英文摘要

In 2017, a research paper compared 18 Time Series Classification (TSC) algorithms on 85 datasets from the University of California, Riverside (UCR) archive. This study, commonly referred to as a `bake off', identified that only nine algorithms performed significantly better than the Dynamic Time Warping (DTW) and Rotation Forest benchmarks that were used. The study categorised each algorithm by the type of feature they extract from time series data, forming a taxonomy of five main algorithm types. This categorisation of algorithms alongside the provision of code and accessible results for reproducibility has helped fuel an increase in popularity of the TSC field. Over six years have passed since this bake off, the UCR archive has expanded to 112 datasets and there have been a large number of new algorithms proposed. We revisit the bake off, seeing how each of the proposed categories have advanced since the original publication, and evaluate the performance of newer algorithms against the previous best-of-category using an expanded UCR archive. We extend the taxonomy to include three new categories to reflect recent developments. Alongside the originally proposed distance, interval, shapelet, dictionary and hybrid based algorithms, we compare newer convolution and feature based algorithms as well as deep learning approaches. We introduce 30 classification datasets either recently donated to the archive or reformatted to the TSC format, and use these to further evaluate the best performing algorithm from each category. Overall, we find that two recently proposed algorithms, Hydra+MultiROCKET and HIVE-COTEv2, perform significantly better than other approaches on both the current and new TSC problems.

URL PDF HTML ☆

赞 0 踩 0

2104.07551 2026-05-11 cs.LG

HIVE-COTE 2.0: a new meta ensemble for time series classification

HIVE-COTE 2.0：时间序列分类的新元集成

Matthew Middlehurst, James Large, Michael Flynn, Jason Lines, Aaron Bostrom, Anthony Bagnall

发表机构 * School of Computing Sciences, University of East Anglia, Norwich, UK（东安格利亚大学计算科学学院，诺里奇，英国）

AI总结 HIVE-COTE 2.0通过引入两种新分类器和Arsenal集成提升时间序列分类的准确性和实用性，验证其在112个单变量UCR数据集和26个多变量UEA数据集上的优越性能。

详情

DOI: 10.1007/s10994-021-06057-9

AI中文摘要

Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) 是一个异构的元集成方法，用于时间序列分类。HIVE-COTE 由多个领域中的分类器组成，包括与相位无关的形状let、基于词袋的字典和与相位相关的区间。自2016年首次提出以来，该算法在UCR时间序列分类档案上的准确性一直保持领先。随着时间的推移，它被逐步改进，最终成为HIVE-COTE 1.0。在此期间，一些算法提出了与HIVE-COTE相当的准确性。我们对HIVE-COTE算法进行了全面的改进，显著提高了其准确性和实用性，将其升级为HIVE-COTE 2.0。我们引入了两种新的分类器，即时间字典集成（TDE）和多样表示的规范区间森林（DrCIF），以取代现有的集成成员。此外，我们引入了Arsenal，即一个由ROCKET分类器组成的集成，作为HIVE-COTE 2.0的新组成部分。我们证明HIVE-COTE 2.0在112个单变量UCR档案数据集和26个多变量UEA档案数据集上显著优于当前最先进的方法。

英文摘要

The Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) is a heterogeneous meta ensemble for time series classification. HIVE-COTE forms its ensemble from classifiers of multiple domains, including phase-independent shapelets, bag-of-words based dictionaries and phase-dependent intervals. Since it was first proposed in 2016, the algorithm has remained state of the art for accuracy on the UCR time series classification archive. Over time it has been incrementally updated, culminating in its current state, HIVE-COTE 1.0. During this time a number of algorithms have been proposed which match the accuracy of HIVE-COTE. We propose comprehensive changes to the HIVE-COTE algorithm which significantly improve its accuracy and usability, presenting this upgrade as HIVE-COTE 2.0. We introduce two novel classifiers, the Temporal Dictionary Ensemble (TDE) and Diverse Representation Canonical Interval Forest (DrCIF), which replace existing ensemble members. Additionally, we introduce the Arsenal, an ensemble of ROCKET classifiers as a new HIVE-COTE 2.0 constituent. We demonstrate that HIVE-COTE 2.0 is significantly more accurate than the current state of the art on 112 univariate UCR archive datasets and 26 multivariate UEA archive datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.07514 2026-05-11 cs.RO cs.CV

Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

未来是否兼容？世界动作模型中的动态一致性诊断

Bo-Kai Ruan, Teng-Fang Hsiao, Ling Lo, Hong-Han Shuai

发表机构 * National Yang Ming Chiao Tung University

AI总结本文研究了世界动作模型中动作-状态一致性作为可靠性指标的重要性，提出了一种无需额外训练的共识策略提升规划效果。

Comments Technical Report

详情

AI中文摘要

世界动作模型（WAMs）通过预测未来观察和动作来实现决策，但其生成的未来可靠性尚未充分检验：生成的未来仅仅是视觉合理，还是与所声称的动作序列动态兼容？本文识别了动作-状态一致性，即预测动作与诱导状态转换之间的对齐，作为WAMs缺失的可靠性轴。通过系统研究多种联合预测和逆动力学模型，发现动作-状态一致性在许多任务中系统地分离了成功和失败的回放，并遵循与学习价值估计相似的成功-失败趋势。这些结果表明，一致性捕获了超越视觉真实性的决策相关结构。我们进一步识别了背景崩溃作为重要边界条件，其中低动态失败轨迹可能因静态未来更容易预测而变得欺骗性一致。基于这些发现，我们引入了一种价值无关的共识策略用于测试时间选择，通过预测未来的一致性来对候选回放进行排序。该策略在RoboCasa和RoboTwin 2.0上提高了成功率，而无需额外训练或奖励建模。总体而言，我们的发现确立了动作-状态一致性作为评估WAM可靠性诊断工具以及价值无关规划的实用信号。

英文摘要

World Action Models (WAMs) enable decision-making through imagined rollouts by predicting future observations and actions. However, the reliability of these imagined futures remains under-examined: is a generated future merely visually plausible, or is it dynamically compatible with the action sequence it claims to model? In this work, we identify action-state consistency, the alignment between predicted actions and induced state transitions, as a missing reliability axis for WAMs. Through a systematic study across representative joint-prediction and inverse-dynamics models, we find that action-state consistency systematically separates successful and failed rollouts across many tasks and follows similar success-failure trends as learned value estimates. These results suggest that consistency captures decision-relevant structure beyond visual realism. We further identify background collapse as an important boundary condition, where low-dynamics failed trajectories can become deceptively consistent because static futures are easier to predict. Building on these findings, we introduce a value-free consensus strategy for test-time selection, which ranks candidate rollouts by agreement among predicted futures. This strategy improves success rates on RoboCasa and RoboTwin 2.0 without additional training or reward modeling. Taken together, our findings establish action-state consistency as both a diagnostic tool for evaluating WAM reliability and a practical signal for value-free planning.

URL PDF HTML ☆

赞 0 踩 0

2605.07513 2026-05-11 cs.LG

Tessellations of Semi-Discrete Flow Matching

半离散流匹配的镶嵌

Emile Pierret, Johannes Hertrich, Samuel Hurault, Julie Delon

发表机构 * DMA, ENS-PSL（DMA，ENS-PSL）； MAP5 ； LIGM, Univ. Gustave Eiffel, CNRS（LIGM，法国埃菲尔大学，CNRS）

AI总结本文研究半离散设置下的流匹配，探讨终端分配区域的几何特性，揭示精确半离散流匹配目标的内在几何结构。

详情

AI中文摘要

我们研究一种半离散设置下的流匹配，其中高斯源被传输到由有限多个点支持的离散目标。这种半离散情形是流匹配用于生成建模的理论基础，其中目标分布由有限数据集表示。在此半离散情形下，精确的流匹配速度场可闭合形式表达，使能独立于优化和近似效应分析终端流映射诱导的几何结构。我们研究终端分配区域，即终端流下目标原子的前像。我们证明这些区域是开的、单连通的，并在附加假设下与单位球同胚。同时，一个平面四点示例表明这些单元可能与半离散最优传输中的拉格朗日单元有显著差异：它们可能非凸、有曲线边界，并表现出不同的有界性和相邻模式。这些结果阐明了精确半离散流匹配目标在神经近似之前内在诱导的几何结构。

英文摘要

We study Flow Matching in a semi-discrete setting where a Gaussian source is transported toward a discrete target supported on finitely many points. This semi-discrete regime is the theoretical setting behind the use of Flow Matching for generative modeling, where the target distribution is represented by a finite dataset. In this semi-discrete regime, the exact Flow Matching velocity field is available in closed form, which makes it possible to analyze the geometry induced by the terminal flow map independently of optimization and approximation effects. We investigate the terminal assignment regions, namely the preimages of the target atoms under the terminal flow. We show that these regions are open, simply connected and, under an additional assumption, homeomorphic to the unit ball. At the same time, a planar four-point example shows that these cells can differ sharply from Laguerre cells arising in semi-discrete optimal transport: they may be non-convex, have curved boundaries, and exhibit different boundedness and adjacency patterns. These results clarify the geometry intrinsically induced by the exact semi-discrete Flow Matching objective before neural approximation enters the picture.

URL PDF HTML ☆

赞 0 踩 0

2605.07512 2026-05-11 cs.CV

Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models

层次双子空间解耦用于视觉-语言模型中的持续学习

Mengxin Qin, Xiang Zhang, Kun Wei, Xu Yang, Cheng Deng

发表机构 * School of Electronic Engineering, Xidian University（西电电子工程学院）

AI总结本文提出HDSD框架，通过分解参数空间和结构化参数分解，减少子空间干扰和参数漂移，提升视觉-语言模型持续学习性能。

2605.07510 2026-05-11 cs.CV cs.CL cs.IR

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

InterLV-Search：交错多模态代理搜索基准测试

Bohan Hou, Jiuning Gu, Jiayan Guo, Ronghao Dang, Sicong Leng, Xin Li, Xuemeng Song, Jianfei Yang

发表机构 * Nanyang Technological University（南洋理工大学）； Shandong University（山东大学）； Damo Academy, Alibaba Group（阿里达摩院）

AI总结本文提出InterLV-Search基准测试，用于评估交错语言-视觉代理搜索，包含多层级任务，涵盖主动视觉证据搜索、受控离线交错多模态搜索和开放网页交错多模态搜索，揭示当前系统在视觉证据获取、搜索控制及多模态证据整合方面的挑战。

详情

AI中文摘要

现有多模态代理搜索基准测试评估多模态搜索和视觉浏览，但视觉证据要么局限于输入，要么被视为答案终点而非交错搜索轨迹的一部分。我们引入InterLV-Search，一个用于交错语言-视觉代理搜索的基准测试，其中文本和视觉证据被反复用于条件后续搜索。它包含2,061个示例，涵盖三个层级：主动视觉证据搜索、受控离线交错多模态搜索和开放网页交错多模态搜索。除了现有基准测试外，它还包含多模态多分支样本，涉及证据搜索过程中多个实体之间的比较。我们用自动化管道构建了层级1和2，用机器引导的人工监督开放网页管道构建了层级3。我们进一步提供InterLV-Agent用于标准化工具使用、轨迹记录和评估。在专有和开源多模态代理上的实验表明，当前系统距离解决交错多模态搜索仍有很大差距，最佳模型整体准确率低于50%，凸显了视觉证据获取、搜索控制和多模态证据整合的挑战。我们发布了基准测试数据和评估代码在https://github.com/hbhalpha/InterLV-Search-Bench

英文摘要

Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \textbf{InterLV-Search}, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench

URL PDF HTML ☆

赞 0 踩 0

2605.07507 2026-05-11 cs.CL cs.IR

TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature

TCMIIES: 一种基于浏览器的LLM智能信息提取系统用于学术文献

Hanqing Zhao

发表机构 * Hebei University, College of Traditional Chinese Medicine（河北大学中医学院）

AI总结本文提出TCMIIES，一种无需安装的浏览器平台，利用商业LLM API实现学术文献结构化信息提取，通过图形界面定义自定义提取方案，支持多领域数据库，提取准确率超94%。

详情

AI中文摘要

学术出版物的指数增长催生了需要自动提取结构化知识的工具。尽管大型语言模型（LLMs）在自然语言理解和信息提取方面表现出色，但现有解决方案通常需要专用基础设施、编程知识或微调的领域特定模型，这对专业领域的研究人员构成障碍。本文提出TCMIIES，一种基于浏览器、零安装的平台，利用商业LLM API从学术文献中进行结构化信息提取。系统采用新颖的模式引导提示框架，具有自动系统提示生成功能，使研究人员可以通过直观的图形界面定义自定义提取模式，而无需任何编程。TCMIIES采用纯前端架构，确保数据隐私，通过浏览器本地处理所有信息，支持五个主要LLM提供商，实现并发批量处理，具有自动重试机制，并提供智能字段映射，适用于中文学术数据库，包括CNKI和Wanfang。通过在传统中医研究中的多个提取场景的全面评估，证明了系统的有效性，实现结构化输出合规率超过94%，信息提取准确率与领域专家注释相当。该系统代表了连接先进LLM能力和领域特定学术信息提取需求的实用解决方案，特别是对于需要灵活、隐私保护和成本效益提取工具的专业领域研究人员。

英文摘要

The exponential growth of academic publications has created an urgent need for automated tools capable of extracting structured knowledge from unstructured scientific texts. While large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and information extraction, existing solutions often require specialized infrastructure, programming expertise, or fine-tuned domain-specific models that create barriers for researchers in specialized fields. This paper presents TCMIIES, a browser-based, zero-installation platform that leverages commercial LLM APIs to perform structured information extraction from academic literature. The system employs a novel schema-guided prompting framework with automatic system prompt generation, enabling researchers to define custom extraction schemas through an intuitive graphical interface without any programming. TCMIIES features a pure front-end architecture that ensures data privacy by processing all information locally in the browser, supports five major LLM providers, implements concurrent batch processing with automatic retry mechanisms, and provides intelligent field mapping for Chinese academic databases including CNKI and Wanfang. We demonstrate the system's effectiveness through comprehensive evaluation across multiple extraction scenarios in Traditional Chinese Medicine research, achieving structured output compliance rates exceeding 94\% and information extraction accuracy comparable to domain-expert annotation. The system represents a practical, accessible solution that bridges the gap between advanced LLM capabilities and domain-specific academic information extraction needs, particularly for researchers in specialized fields who require flexible, privacy-preserving, and cost-effective extraction tools.

URL PDF HTML ☆

赞 0 踩 0

2605.07505 2026-05-11 cs.AI cs.LG

LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

LiteGUI: 通过强化学习蒸馏紧凑的GUI代理

Yubin Wu, Zicheng Cai, Liping Ning, Hua Wang, Zhi Chen, Yaohua Tang, Hao Chen

发表机构 * Moore Threads AI

AI总结本文提出一种无需监督微调的训练方法，通过引导在线蒸馏和多解决方案双层GRPO框架，提升小规模模型在GUI任务中的性能和探索能力。

详情

AI中文摘要

开发轻量级、设备端的视觉-语言GUI代理对于高效的跨平台自动化交互至关重要。然而，当前设备端代理受限于模型容量有限，进一步性能提升仍需迫切解决。传统监督微调（SFT）在小规模模型上常导致过拟合、灾难性遗忘和策略刚性，无法充分应对这些挑战。本文提出一种新颖的SFT-free训练范式，显著提升小规模模型性能。我们首先通过引导在线蒸馏系统性整合通用知识蒸馏到GUI代理领域。通过结合Oracle参考轨迹和动态检索机制，我们的方法减少幻觉并缓解多解决方案GUI任务中的认知不一致。在此基础上，我们进一步引入多解决方案双层GRPO框架，联合对宏观层面子任务规划与微观层面执行匹配进行对齐，从而改进长时间跨度GUI代理场景中的探索。此外，我们构建了自动化数据生成流水线，合成具有丰富多解决方案标注的GUI任务轨迹。大量实验表明，我们的方法在轻量模型中实现了最先进的性能，同时在所有基准测试中与大幅大规模模型竞争。消融研究进一步显示，结构化在线蒸馏和多解决方案双层探索能够完全解锁2B/3B规模代理的能力，超越传统模仿学习的性能极限。

英文摘要

Developing lightweight, on-device vision-language GUI agents is essential for efficient cross-platform automated interaction. However, current on-device agents are constrained by limited model capacity, and further performance improvements remain urgently needed. Traditional Supervised Fine-Tuning (SFT) for small-scale models often leads to overfitting, catastrophic forgetting and policy rigidity, and thus fails to fully address these challenges. In this work, we propose a novel SFT-free training paradigm that significantly enhances the performance of small-scale models. We first present the initial systematic integration of generalized knowledge distillation into the GUI agent domain via Guided On-policy Distillation. By incorporating oracle reference trajectories together with a dynamic retrieval mechanism, our method reduces hallucinations and mitigates the cognitive misalignment inherent in multi-solution GUI tasks. Building on this foundation, we further introduce a Multi-solution Dual-level GRPO framework that jointly aligns macro-level subtask planning with micro-level execution matching, thereby improving exploration in long-horizon GUI agent scenarios. In addition, we construct an automated data generation pipeline to synthesize GUI task trajectories with rich multi-solution annotations. Extensive experiments show that our method achieves state-of-the-art performance among lightweight models while remaining competitive with substantially larger-scale models across all benchmarks. Ablation studies further demonstrate that structured on-policy distillation and multi-solution dual-level exploration can fully unlock the capabilities of 2B/3B scale agents, surpassing the performance limits of conventional imitation learning.

URL PDF HTML ☆

赞 0 踩 0

2605.07503 2026-05-11 cs.CV

Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers

Diffusion-APO：基于轨迹的直接偏好对齐用于视频扩散变换器

Jingyuan Zhu, Biaolong Chen, Le Zhang, Aixi Zhang, Hao Jiang, Pipei Huang

发表机构 * Alibaba Group（阿里巴巴集团）

AI总结本文提出Diffusion-APO，通过同步训练噪声与推理时的去噪路径，解决视频扩散模型与人类意图对齐的问题，采用统一的RLHF框架实现灵活的多阶段偏好对齐，提升视觉质量和指令遵循性能。

详情

AI中文摘要

高效地对齐大规模视频扩散模型与人类意图需要一种可扩展且基于轨迹的路径，以弥合训练噪声分布与实际推理轨迹之间的固有差异。尽管现有方法如直接偏好优化（DPO）和组相对策略优化（GRPO）试图解决这一问题，但它们往往受到依赖易产生偏差的复杂奖励模型或次优时间步采样策略的限制。在本文中，我们提出Diffusion-APO（对齐偏好优化），一种基于轨迹的算法，通过同步训练噪声与推理时间去噪路径来最大化梯度信号效率。为了将这一算法创新转化为实用解决方案，我们引入了一个统一且模块化的RLHF框架，集成了在线排名、半在线锚定、离线精炼和意识蒸馏的漂移校正。该框架能够在不同数据和计算约束下实现灵活的多阶段偏好对齐，而无需依赖标量奖励基的策略梯度。通过广泛的实验，我们证明Diffusion-APO在视觉质量和指令遵循方面始终优于标准基线，同时在模型加速过程中有效保持生成保真度，提供了一条稳健的端到端路径，用于可扩展的视频扩散对齐。

英文摘要

Efficiently aligning large-scale video diffusion models with human intent requires a scalable and trajectory-aware pathway that bridges the inherent discrepancy between training noise distributions and practical inference trajectories. While existing paradigms such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) attempt to address this, they are often hindered by either reliance on bias-prone, complex reward models or suboptimal timestep sampling. In this paper, we propose Diffusion-APO (Aligned Preference Optimization), a trajectory-aware algorithm that resolves this misalignment by synchronizing training noise with inference-time denoising paths to maximize gradient signal efficacy. To translate this algorithmic innovation into a practical solution, we introduce a unified and modular RLHF framework that integrates online ranking, half-online anchoring, offline refinement, and distillation-aware drift correction. This framework enables flexible, multi-stage preference alignment across diverse data and computational constraints without relying on scalar-reward-based policy gradients. Through extensive experiments, we demonstrate that Diffusion-APO consistently outperforms standard baselines in visual quality and instruction following, while effectively preserving generative fidelity during model acceleration, providing a robust, end-to-end pathway for scalable video diffusion alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.07499 2026-05-11 cs.CV

Cloud-top infrared observations reveal the four-dimensional precipitation structure

积云红外观测揭示四维降水结构

Tianchi Xu, Ziqiang Ma, Andrea Marinoni, Yuanpeng He, Xiaoqing Li, Chuanfeng Zhao, Kang He, Jintao Xu, Bohan Zhou, Wenbo Zhao, Haoshuang Chen, Tun Wang, Dongdong Wang, Yang Hong

发表机构 * Institute of Remote Sensing and Geographical Information Systems, School of Earth and Space Sciences, Peking University（遥感与地理信息系统研究所，地球与空间科学学院，北京大学）； Department of Computer Science and Technology, University of Cambridge（计算机科学与技术系，剑桥大学）； Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education（高可信软件技术重点实验室（北京大学），教育部；计算机科学学院，北京大学）； School of Computer Science, Peking University（风云气象卫星创新中心，国家气象中心，中国气象局）； Innovation Center for Fengyun Meteorological Satellite, National Meteorological Centre, China Meteorological Administration（大气与海洋科学系，物理学院，北京大学）； Department of Atmospheric and Oceanic Sciences, School of Physics, Peking University（山地灾害环境研究所，中国科学院）； Institute of Mountain Hazards and Environment, Chinese Academy of Sciences（土木工程与环境科学学院，俄克拉荷马大学）； School of Civil Engineering and Environmental Science, University of Oklahoma

AI总结本文提出4DPrecipNet框架，利用积云红外数据重建四维降水结构，揭示亚云降水的可观测性，通过物理约束深度学习方法实现降水垂直和时间演变的重建。

详情

AI中文摘要

准确的四维（4D）降水信息对于理解地球的能量和水循环至关重要，但目前在大尺度上仍缺乏观测数据。传统理论认为静止轨道红外观测主要感知积云顶部属性，对亚云降水灵敏度有限。本文显示积云顶部红外测量仍能编码足够的信息以恢复四维降水结构，揭示亚云过程的先前未被利用的可观察性。我们引入了物理约束的深度学习框架4DPrecipNet，其中以湿度优先的约束要求潜在表示恢复可降水量，使模型在热力学一致性上得到锚定。通过整合多通道红外辐射与这些约束和雷达推导的降水剖面，我们从静止轨道重建降水系统的垂直和时间演变。该框架捕捉到深对流结构及其演变，具有在大规模样本和独立雷达比较中稳健的表现。这些结果表明亚云降水在积云顶部红外观测中被物理编码，为持续全球降水结构监测开辟了新途径。

英文摘要

Accurate four-dimensional (4D) precipitation information is essential for understanding the Earth's energy and water cycles, yet remains observationally unresolved at global scales. Conventional theory holds that geostationary infrared observations primarily sense cloud-top properties, with limited sensitivity to sub-cloud precipitation. Here we show that cloud-top infrared measurements nevertheless encode sufficient information to recover the four-dimensional structure of precipitation, revealing a previously unexploited observability of sub-cloud processes. We introduce a physically constrained deep learning framework, 4DPrecipNet, in which a moisture-first constraint requires the latent representation to recover precipitable water vapour, anchoring the model in thermodynamic consistency. By integrating multi-channel infrared radiances with these constraints and radar-derived precipitation profiles, we reconstruct the vertical and temporal evolution of precipitation systems from geostationary orbit. The framework captures deep convective structures and their evolution, with robust performance across large samples and independent radar comparisons. These results demonstrate that sub-cloud precipitation is physically encoded in cloud-top infrared observations, establishing a new pathway for continuous global monitoring of precipitation structure.

URL PDF HTML ☆

赞 0 踩 0

2605.07495 2026-05-11 cs.CV

Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing

轻量级无配对智能手机ISP传输与语义伪配对

Yujin Cho, Flavien Armangeon, Yanhao Li

发表机构 * Université Paris Saclay, ENS Paris Saclay, CNRS, Centre Borelli, France（巴黎萨克雷大学，巴黎萨克雷高等师范学院，国家科学研究中心，Borelli研究中心，法国）

AI总结本文提出一种轻量级方法，通过语义伪配对缓解无配对数据问题，利用FGW最优传输构建伪配对，训练紧凑CNN提升颜色渲染性能，实现PSNR、SSIM和ΔE的显著提升。

Comments 13 pages, 9 figures, CVPR Workshops 2026

详情

AI中文摘要

无配对智能手机ISP是一个具有挑战性的问题，由于RAW和目标RGB图像之间缺乏场景和颜色对齐。许多现有方法要么需要配对数据，要么依赖对抗训练，这在无配对设置中可能不稳定。本文提出了一种简单有效的方法，用于NTIRE 2026学习智能手机ISP挑战中的无配对数据。我们的方法首先从训练补丁中重建更大的图像以恢复全局上下文。然后，我们使用DINOv2提取语义嵌入，并利用融合的Gromov-Wasserstein（FGW）最优传输在图像和补丁层面构建RAW和RGB图像之间的伪配对。这种语义匹配使我们能够部分缓解数据的无配对性，并构建这些伪输入-目标配对。基于这些伪配对，我们训练了一个仅含7000个参数的轻量级CNN进行颜色渲染。该网络设计紧凑，专注于颜色转换而非结构变化，这有助于减少伪影并提高训练稳定性。我们的挑战提交在最终隐藏测试集上实现了22.569 PSNR、0.675 SSIM和8.067 ΔE，显著优于基线，并在所有挑战提交中取得了第三好的SSIM和ΔE。我们的代码可在github.com/nuniniyujin/Unpaired-ISP获取。

英文摘要

Unpaired smartphone ISP is a challenging problem due to the lack of scene and color alignment between RAW and target RGB images. Many existing methods either require paired data or rely heavily on adversarial training, which can become unstable in the unpaired setting. In this work, we present a simple and effective approach developed for the NTIRE 2026 Learned Smartphone ISP Challenge with Unpaired Data. Our method first reconstructs larger images from training patches to recover global context. Then, we extract semantic embeddings with DINOv2, and use fused Gromov-Wasserstein (FGW) optimal transport to build pseudo pairs between RAW and RGB images at both image and patch levels. This semantic matching allows us to partially alleviate the unpairedness of the data and build these pseudo input-target pairs. Based on these pseudo pairs, we train a lightweight CNN with only 7K parameters for color rendering. The network is designed to be compact and focus on color transformation rather than structural change, which helps reduce artifacts and improve training stability. Our challenge submission achieves 22.569 PSNR, 0.675 SSIM, and 8.067 $ΔE$ on the final hidden test set, significantly improving over the baseline and achieving the 3rd best SSIM and $ΔE$ among all challenge entries. Our code is available at github.com/nuniniyujin/Unpaired-ISP .

URL PDF HTML ☆

赞 0 踩 0

2605.07494 2026-05-11 cs.CV

DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models

DIMoE-Adapters：动态专家进化用于视觉语言模型的持续学习

Mengxin Qin, Xiang Zhang, Xi Wang, Kun Wei, Xu Yang, Cheng Deng

发表机构 * School of Electronic Engineering, Xidian University（西安电子科技大学电子工程学院）

AI总结本文提出DIMoE-Adapters框架，通过动态专家进化方法平衡持续学习中的稳定性与可塑性，解决多领域任务增量学习中的领域迁移问题。

详情

AI中文摘要

持续学习使视觉语言模型能够在不重新训练的情况下积累知识并适应变化的任务。然而，在多领域任务增量学习中，大规模领域迁移加剧了稳定性与可塑性的矛盾。现有方法依赖固定架构和静态参数分配，限制了对新领域的适应并加剧灾难性遗忘。为了解决这些挑战，我们提出DIMoE-Adapters，一种动态增量混合专家适配框架，引入动态专家进化范式来平衡稳定性与可塑性。该范式通过两个协作组件：自校准专家进化（SCEE）和原型引导专家选择（PGES）实现。SCEE通过专家优化动态构建和进化稀疏专家池，提高可塑性并减少冗余容量。PGES根据SCEE构建的池控制专家利用，提高在之前遇到和未见任务上的稳定性。大量实验表明，DIMoE-Adapters在各种设置中均优于现有最先进方法。

英文摘要

Continual learning enables vision-language models to accumulate knowledge and adapt to evolving tasks without retraining from scratch. However, in multi-domain task-incremental learning, large domain shifts intensify the stability-plasticity dilemma. Most existing methods rely on fixed architectures with statically allocated parameters, which limits adaptation to new domains and aggravates catastrophic forgetting. To address these challenges, we propose DIMoE-Adapters, a Dynamic Incremental Mixture-of-Experts Adapters framework that introduces a dynamic expert evolution paradigm to balance stability and plasticity. This paradigm is implemented through two collaborative components: Self-Calibrated Expert Evolution (SCEE) and Prototype-Guided Expert Selection (PGES). SCEE constructs and evolves a sparse expert pool through expert optimization dynamics, improving plasticity while reducing redundant capacity. PGES controls expert utilization based on the pool shaped by SCEE, improving stability across both previously encountered and unseen tasks. Extensive experiments show that DIMoE-Adapters outperforms previous state-of-the-art methods across various settings.

URL PDF HTML ☆

赞 0 踩 0

2605.07492 2026-05-11 cs.CV

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

文档解析距离解决还有多远？PureDocBench：一个跨干净、退化和现实场景的可追溯基准测试

Zhiheng Li, Zongyang Ma, Jiaxian Chen, Jianing Zhang, Zhaolong Su, Yutong Zhang, Zhiyin Yu, Ruiqi Liu, Xiaolei Lv, Bo Li, Jun Gao, Ziqi Zhang, Chunfeng Yuan, Bing Li, Weiming Hu

发表机构 * CASIA（中国科学院自动化研究所）； UCAS（中国科学技术大学）； NWPU（西北工业大学）； JLU（吉林大学）； USTC（中国科学技术大学）； PKU（北京大学）； HelloGroup ； ShanghaiTech（上海科技大学）； Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information（北京多模态信息超级智能安全重点实验室）

AI总结本文提出PureDocBench，一个可追溯的基准测试，覆盖10个领域、66个子类和1475页文档，包含清洁、数字退化和现实退化三种版本。评估40种模型发现，文档解析尚未解决，最佳模型得分仅74分，公式识别仍是瓶颈，通用VLM在退化下表现不如专用模型。

Comments 42 pages, 20 figures, 16 tables

详情

AI中文摘要

过去一年间，已有超过20种开源文档解析模型，但该领域仍几乎只在OmniDocBench上进行基准测试，这是一个1355页手动标注的数据集，其最高分数已超过90%。我们运行的三阶段审核流程对OmniDocBench的21353个评估块进行了审查，确认了2580个错误（12.08%）；结合一年以上的公开可用性，标注质量和污染风险都对其排名提出了质疑。为了解决这些问题，我们提出了PureDocBench，一个程序生成、可追溯的基准测试，从HTML/CSS渲染文档图像，并从同一来源生成可验证的标注，涵盖10个领域、66个子类和1475页，每页在三种版本中：清洁、数字退化和现实退化（总计4425张图像）。评估40种模型，涵盖管道专家、端到端专家和通用VLM，我们发现：（i）文档解析远未解决：最佳模型得分仅约74分，最强和最弱模型之间有44.6分的差距；（ii）参数不超过4B的专用解析器与通用VLM（大5-100倍）相当或更优，但公式识别仍是共同瓶颈，没有任何模型在三个赛道平均公式指标上超过67%；（iii）通用VLM在数字/现实退化下仅损失0.99/8.52总分，而管道专家损失4.90/14.21，产生排名反转，使仅清洁评估对部署来说误导。所有数据、代码和制品均已公开发布。

英文摘要

The past year has seen over 20 open-source document parsing models, yet thefield still benchmarks almost exclusively on OmniDocBench, a 1,355-pagemanually annotated dataset whose top scores have saturated above 90%. Athree-stage audit pipeline we run on OmniDocBench screens its 21,353evaluator-scored blocks and confirms 2,580 errors (12.08%); combined with overa year of public availability, both annotation quality and contamination riskcall its rankings into question. To address these issues, we presentPureDocBench, a programmatically generated, source-traceable benchmark thatrenders document images from HTML/CSS and produces verifiable annotations fromthe same source, covering 10 domains, 66 subcategories, and 1,475 pages, eachin three versions: clean, digitally degraded, and real-degraded (4,425 imagestotal). Evaluating 40 models spanning pipeline specialists, end-to-endspecialists, and general-purpose VLMs, we find: (i) document parsing is farfrom solved: the best model scores only ~74 out of 100, with a 44.6-point gapbetween the strongest and weakest models; (ii) specialist parsers with <=4Bparameters rival or surpass general VLMs that are 5-100x larger, yet formularecognition remains a shared bottleneck where no model exceeds 67% whenaveraging the formula metric across all three tracks; (iii) general VLMs loseonly 0.99/8.52 Overall points under digital/real degradation versus 4.90/14.21for pipeline specialists, producing ranking reversals that make clean-onlyevaluation misleading for deployment. All data, code, and artifacts arepublicly released.

URL PDF HTML ☆

赞 0 踩 0

2605.07491 2026-05-11 cs.CV

Implicit Multi-Camera System Calibration Using Gaussian Processes

基于高斯过程的隐式多摄像头系统校准

Ivan De Boi, Bart Ribbens, Veronika Golanova, Ursula Kapov, Simon Verspeek

发表机构 * InViLab

AI总结本文提出基于高斯过程的隐式多摄像头校准框架，通过直接学习多摄像头2D图像坐标到3D世界坐标的非线性映射，避免了传统显式校准方法的复杂性和数据需求，同时利用不确定性量化提升校准可靠性。

详情

AI中文摘要

本文提出一种基于高斯过程（GP）回归的新型隐式多摄像头系统校准框架。传统显式校准方法受限于刚性数学模型，难以处理非传统光学带来的复杂非线性失真，而现有基于神经网络的隐式方法通常数据需求高且缺乏内在不确定性量化（UQ）。我们的GP模型直接学习所有摄像头的2D图像坐标到3D世界坐标的复杂非线性映射，完全绕过了估计显式内参和外参的耗时过程。此外，内在的UQ对于将简单的3D点预测转化为可验证的3D测量至关重要，且具有统计上可信的置信区间。为进一步提高数据效率和实际部署效果，我们集成了主动学习（AL），利用GP的预测不确定性智能指导新校准数据的获取。该方法实现了稳健、高效且可靠的校准解决方案，尤其在收集大量校准数据受限的实际场景中表现突出。实验表明，3D预测的不确定性在靠近摄像头的区域更高。在$uv$坐标空间中，该区域的数据点更稀疏，尽管它们不在3D空间中。本文工作对需要校准复杂多摄像头系统的人员具有相关性。

英文摘要

This paper proposes a novel framework for implicit multi-camera system calibration utilizing Gaussian Process (GP) regression. Conventional explicit calibration methods are constrained by rigid mathematical models and struggle with complex, non-linear distortions from unconventional optics, while existing neural network-based implicit approaches are typically data-hungry and lack inherent uncertainty quantification (UQ). Our GP-based model directly learns the complex, non-linear mapping from 2D image coordinates across all cameras to a 3D world coordinate, completely bypassing time-consuming estimation of explicit intrinsic and extrinsic parameters. Moreover, the inherent UQ is critical for transforming a simple 3D point prediction into a verifiable 3D measurement, complete with statistically-sound confidence bounds. To further enhance data efficiency and practical deployment, we integrate Active Learning (AL), which intelligently leverages the GP's predictive uncertainty to strategically guide the acquisition of new calibration data. This approach results in a robust, data-efficient, and reliable calibration solution, proving particularly effective in practical scenarios where collecting extensive calibration data is a dominant constraint. Our experiments show that the uncertainty for the 3D predictions is higher closer to the cameras. The data points in $uv$-coordinate space are more sparse in that region, even though they are not in 3D space. This work is relevant for anyone who is tasked with the calibration of complex multi-camera systems.

URL PDF HTML ☆

赞 0 踩 0

2605.07489 2026-05-11 cs.SD cs.MM eess.SP

A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation

为和弦生成设计的分解检索-编辑-重排序框架

Qiqi He, Dichucheng Li, Xiaoheng Sun, Anqi Huang

发表机构 * Individual Researcher（个人研究者）

AI总结本文提出分解检索-编辑-重排序框架，通过分阶段处理提升和弦生成的多样性与音乐理论可行性平衡能力。

Comments Accepted by the 2026 ACM International Conference on Multimedia Retrieval (ICMR 2026)

详情

DOI: 10.1145/3805622.3810672

AI中文摘要

和弦生成是一种具有内在约束的创造性任务，需要在风格多样性与音乐理论可行性之间取得平衡。现有方法通常将候选生成与约束执行整合在单一模型中，使多样性-可行性权衡难以控制和解释。本文从系统层面出发，引入检索-编辑-重排序（RER）框架，将任务分解为三个明确阶段：i）检索，定义风格合理的候选空间；ii）编辑，通过最小修改强制音乐理论可行性；iii）重排序，解决可行候选间的软偏好。这种分离提供了一个可控的流程，每个组件解决生成过程的不同方面，从而增强输出和弦的可解释性和可调性。通过客观指标和主观评估，我们的分解系统在平衡和弦多样性和音乐理论可行性方面优于所有端到端和弦生成基线。消融研究进一步证实了每个阶段在创造性探索和约束满足中的互补作用。

英文摘要

Chord generation is an inherently constrained creative task that requires balancing stylistic diversity with music-theoretic feasibility. Existing approaches typically entangle candidate generation and constraint enforcement within a single model, making the diversity-feasibility trade-off difficult to control and interpret. In this work, we approach chord generation from a system-level perspective, introducing a Retrieval-Edit-Rerank (RER) framework that decomposes the task into three explicit stages: i) retrieval, which defines a stylistically plausible candidate space; ii) editing, which enforces music-theoretic feasibility through minimal modifications; and iii) reranking, which resolves soft preferences among feasible candidates. This separation provides a controllable pipeline, where each component addresses a distinct aspect of the generation process, thereby enhancing both the interpretability and adjustability of the output chords. Through objective metrics and subjective evaluation, our decomposed system outperforms all end-to-end chord generation baselines in balancing chord diversity and music-theoretic feasibility. Ablation studies further confirm the complementary roles of each stage in creative exploration and constraint satisfaction.

URL PDF HTML ☆

赞 0 踩 0

2605.07485 2026-05-11 cs.LG cs.AI

Excluding the Target Domain Improves Extrapolation: Deconfounded Hierarchical Physics Constraints

排除目标领域提升外推：去偏分层物理约束

Tsuyoshi Okita

发表机构 * Kyushu Institute of Technology（九州理工大学）

AI总结本文提出去偏分层门机制，通过识别温度偏倚污染，提升物理约束外推性能，实验显示排除目标域数据可提升39%的外推表现。

Comments 16 pages, 2 figures

详情

AI中文摘要

外推到分布外条件是物理约束深度生成模型的核心挑战。现有方法将物理约束作为单一静态正则化项统一应用于生成过程，未能处理物理定律的分层结构和混杂变量问题。我们提出去偏分层门（DHG），作为诊断和控制机制：它识别温度偏倚如何污染每个约束层级，使分层门反映内在物理不一致而非虚假温度效应。DHG结合反事实估计通过do操作符与后门调整去除混杂因素，然后逐步应用从粗到细的物理约束。我们在预训练中发现一个反直觉结果：排除目标域数据预训练在外推表现上优于包含它（RMSE 0.224 vs. 0.324）。这是因为FNO学习了领域无关的物理模式，当目标域被排除时转移更有效。在锂离子电池温度外推基准测试中（训练于24摄氏度，评估于4.0-43.0摄氏度），我们的方法达到RMSE=0.215，比无约束基线（Pure CFM: 0.397）提升了46%。

英文摘要

Extrapolation to out-of-distribution conditions is a fundamental challenge for physics-constrained deep generative models. Existing methods apply physical constraints as a single static regularization term uniformly across the generation process, and address neither the hierarchical structure of physical laws and the confounding variable problem. We propose the Deconfounded Hierarchical Gate (DHG), which serves as a diagnostic and control mechanism: it identifies when and how strongly temperature confounding contaminates each constraint level, so that hierarchical gates reflect intrinsic physical inconsistency rather than spurious temperature effects. DHG combines counterfactual estimation via the do-operator with backdoor adjustment to remove confounding, then applies Coarse-to-Fine physical constraints progressively. We report a counter-intuitive finding in pretraining: excluding the target-domain data from pretraining outperforms including it by 39% in extrapolation performance (RMSE 0.224 vs. 0.324). This occurs because FNO learns domain-agnostic physical patterns that transfer more effectively when the target domain is withheld. On a lithium-ion battery temperature extrapolation benchmark (trained at 24 degrees Celsius, evaluated at 4.0--43.0 degrees Celsius), our method achieves RMSE = 0.215, a 46% improvement over the unconstrained baseline (Pure CFM: 0.397).

URL PDF HTML ☆

赞 0 踩 0

2605.07478 2026-05-11 cs.CV

AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models

AudioFace: 基于语言的语音驱动面部动画与多模态语言模型

Kai Zheng, Zejian Kang, Rui Mao, Hongyuan Zou, Yuanchen Fei, Xuanyang Xu, Xiangru Huang

发表机构 * Westlake University（西湖大学）； Zhejiang University（浙江大学）； Tiangong University（天工大学）； Hunan University（湖南大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结本文提出AudioFace框架，通过语言和发音信息指导语音驱动的面部动作生成，提升语音与面部运动的对应精度。

详情

AI中文摘要

语音驱动的面部动画需要准确对应声学信号与面部运动，尤其是与发音相关的嘴部运动。然而，直接将语音音频映射到面部系数往往忽略了语音生成中的语言和发音结构。本文提出AudioFace，一种语言辅助的语音驱动blendshape生成框架，将与嘴部相关的面部系数预测视为由语言和发音信息引导的结构化生成问题。与仅依赖声学特征不同，我们的方法利用多模态大语言模型的先验知识，并引入转录和音素级别的提示，将语音信号与可解释的面部动作联系起来。大量实验表明，AudioFace在多个评估指标上均取得优越性能，验证了语言辅助和多模态先验引导的语音驱动面部动画的有效性。

英文摘要

Speech-driven facial animation requires accurate correspondence between acoustic signals and facial motion, especially for articulation-related mouth movements. However, directly mapping speech audio to facial coefficients often overlooks the linguistic and phonetic structure underlying speech production. In this paper, we propose AudioFace, a language-assisted framework for speech-driven blendshape generation that treats mouth-related facial coefficient prediction as a structured generation problem guided by linguistic and articulatory information. Instead of relying solely on acoustic features, our method leverages the prior knowledge of multimodal large language models and introduces transcript- and phoneme-level cues to bridge speech signals with interpretable facial actions. Extensive experiments show that AudioFace achieves superior performance across multiple evaluation metrics, validating the effectiveness of language-assisted and multimodal-prior-guided speech-driven facial animation.

URL PDF HTML ☆

赞 0 踩 0

2605.07477 2026-05-11 cs.CV

ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning

ReasonEdit：通过强化学习实现可解释图像编辑评估

Honghua Chen, Zitong Xu, Huiyu Duan, Xinyun Zhang, Xiongkuo Min, Guangtao Zhai

发表机构 * University of Electronic Science and Technology of China（电子科学与技术大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出ReasonEdit，通过引入ReasonEdit-22K数据集和RE-Reward模型，训练出可解释的图像编辑评估模型，提升评估的可解释性和透明度。

详情

AI中文摘要

近期文本引导图像编辑（TIE）模型取得了显著进展，但许多编辑结果仍存在伪影、意外修改和次优美学问题。尽管已有多个基准和评估方法，但大多数现有方法依赖标量评分且缺乏可解释性。这一限制主要源于缺乏高质量的TIE解释数据集和有效的奖励模型来训练可解释的评估器。为解决这些挑战，我们引入了ReasonEdit-22K，这是首个结合22K编辑图像和113K链式思考（CoT）样本的数据集，以及130万人类判断评估这些解释在逻辑性、准确性和有用性方面的表现。基于此数据集，我们提出了RE-Reward，一种基于多模态大语言模型（MLLM）的奖励模型，旨在为图像编辑的可解释推理评估提供人类对齐的反馈。此外，我们开发了ReasonEdit，该模型通过来自RE-Reward和群体相对策略优化（GRPO）算法的奖励信号进行训练，以学习可解释的评估模型。大量实验表明，ReasonEdit在与人类偏好的一致性方面表现优异，并在公共基准上表现出强大的泛化能力。此外，它能够生成高质量的可解释评估文本，使图像编辑的评估更加透明和可信。代码可在https://github.com/IntMeGroup/ReasonEdit获取。

英文摘要

Recent text-guided image editing (TIE) models have achieved remarkable progress, however, many edited results still suffer from artifacts, unintended modifications, and suboptimal aesthetics. Although several benchmarks and evaluation methods have been proposed, most existing approaches rely on scalar scores and lack interpretability. This limitation largely stems from the absence of high-quality interpretation datasets for TIE and effective reward models to train interpretable evaluators. To address these challenges, we introduce ReasonEdit-22K, the first dataset that combines 22K edited images with 113K Chain-of-Thought (CoT) samples, along with 1.3M human judgments assessing these interpretations in terms of logicality, accuracy, and usefulness. Building upon this dataset, we propose RE-Reward, a multimodal large language model (MLLM)-based reward model designed to provide human-aligned feedback for evaluating interpretable reasoning in image editing. Furthermore, we develop ReasonEdit, which is trained using reward signals derived from RE-Reward and the Group Relative Policy Optimization (GRPO) algorithm to learn an interpretable evaluation model. Extensive experiments demonstrate that ReasonEdit achieves superior alignment with human preferences and exhibits strong generalization across public benchmarks. In addition, it is capable of generating high-quality interpretable evaluation text, enabling more transparent and trustworthy assessment for image editing. The code is available at https://github.com/IntMeGroup/ReasonEdit.

URL PDF HTML ☆

赞 0 踩 0

2605.07476 2026-05-11 cs.LG

NPMixer: Hierarchical Neighboring Patch Mixing for Time Series Forecasting

NPMixer：用于时间序列预测的层次化邻域块混合

Jung Min Choi, Vijaya Krishna Yalavarthi, Lars Schmidt-Thieme

发表机构 * ISMLL, University of Hildesheim, VWFS Data Analytics Research Center (VWFS-DARC), Hildesheim, Germany（ISMLL，希尔德斯海姆大学，VWFS数据分析研究中心（VWFS-DARC），德国希尔德斯海姆）

AI总结本文提出NPMixer，通过层次化邻域块混合捕捉局部时间动态和全局依赖，实验表明其在多个数据集上优于现有模型。

详情

AI中文摘要

多变量时间序列预测仍面临挑战，由于局部时间动态和多变量全局依赖的复杂性。本文提出NPMixer，一种具有可学习平稳小波变换的层次架构，该变换可自适应学习滤波器系数以数据依赖的方式将信号分解为趋势和细节成分。我们的框架引入了邻域混合块，通过一系列层次化MLP层在非重叠块上捕捉局部时间动态。具体而言，混合块利用MLP学习块内的和跨块的时间模式，扩展感受野以捕捉多尺度依赖。通道混合编码器应用于高频成分以学习通道相关性，同时保持底层全局趋势的稳定性。在七个基准数据集上的广泛实验表明，NPMixer在28个评估实验设置中（71.4%）在MSE上表现更好，优于现有最先进模型。

英文摘要

Multivariate time series forecasting remains a challenge due to the complexity of local temporal dynamics and global dependencies across multiple variables. In this paper, we propose \textbf{N}eighboring \textbf{P}atching \textbf{Mixer} (\textbf{NPMixer}), a hierarchical architecture featuring a Learnable Stationary Wavelet Transform that adaptively learns filter coefficients to decompose signals into trend and detail components in a data-dependent manner. Our framework introduces a Neighboring Mixer Block that captures local temporal dynamics through a series of hierarchical MLP layers operating on non-overlapping patches. Specifically, the mixer block utilizes MLPs to learn temporal patterns within and across these patches, expanding the receptive field to capture multi-scale dependencies. A Channel-Mixing Encoder is applied to high-frequency components to learn channel correlations while preserving the stability of the underlying global trend. Extensive experiments on seven benchmark datasets demonstrate that NPMixer consistently outperforms state-of-the-art models, achieving better performance in 20 out of 28 ($71.4\%$) evaluated experimental setups for MSE.

URL PDF HTML ☆

赞 0 踩 0

2605.07474 2026-05-11 cs.CV cs.AI

ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

ForgeVLA：无需语言标注的联邦视觉-语言-动作学习

Yuhao Zhou, Yunpeng Zhu, Yang Zhou, Jindi Lyu, Jian Lan, Zhangyuan Wang, Dan Si, Thomas Seidl, Qing Ye, Jiancheng Lyu

发表机构 * Sichuan University（四川大学）； Zhejiang University（浙江大学）； Ludwig-Maximilians-Universität München（慕尼黑路德维希-马克西米利安大学）； Lenovo Group Limited（联想集团有限公司）； Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education（教育部机器学习与工业智能工程研究中心）

AI总结本文提出ForgeVLA框架，通过分布式视觉-动作对训练联邦VLA模型，无需集中数据或人工标注。通过客户端指令分类器构建完整三元组，并引入对比规划损失和自适应聚合策略以解决特征坍塌问题，实验表明性能优于基线模型。

Comments 26 pages

详情

AI中文摘要

本文提出ForgeVLA框架，通过分布式视觉-动作对训练联邦VLA模型，无需集中数据或人工标注。通过客户端指令分类器构建完整三元组，并引入对比规划损失和自适应聚合策略以解决特征坍塌问题，实验表明性能优于基线模型。

英文摘要

Vision-Language-Action (VLA) models hold great promise for general-purpose robotic intelligence, yet scaling up such models is severely bottlenecked by the high cost of acquiring annotated training data. Fortunately, vision-equipped robots deployed across various domains already produce abundant vision-action pairs that can be leveraged to scale up VLA training more efficiently. However, these raw data cannot be centrally aggregated due to various constraints and also exhibit severe heterogeneity. To address these challenges, in this paper, we propose ForgeVLA, a federated VLA training framework that learns VLA models from distributed vision-action pairs without centralizing raw data or requiring manual annotations. Specifically, each client in ForgeVLA is equipped with an embodied instruction classifier that maps vision-action pairs to a predefined instruction set, recovering the missing language modality and forming complete vision-language-action triplets. Beyond triplet construction, we also identify vision-language feature collapse as a critical challenge that has been largely overlooked in prior federated VLA research. To mitigate this issue, ForgeVLA combines a client-side contrastive planning loss with a server-side adaptive aggregation strategy to learn task-discriminative representations efficiently. Extensive experiments across multiple benchmarks show that ForgeVLA significantly outperforms other baselines, and ablation studies further validate the contribution of each component.

URL PDF HTML ☆

赞 0 踩 0

2605.07471 2026-05-11 cs.LG hep-ex

Transfer Learning Across Fast- and Full-Simulation Domains in High-Energy Physics

在高能物理中跨快速模拟与全模拟领域的迁移学习

Matthias Schott, Lucie Flek

发表机构 * Institute of Physics, University of Bonn（波恩大学物理研究所）； Bonn-Aachen International Center for Information Technology (b-it)（波恩-亚琛国际信息科技中心（b-it））

AI总结本文研究了在LHC环境中快速模拟与全模拟数据集之间的迁移学习，通过三个任务验证了预训练模型在减少数据需求和提升性能方面的优势。

Comments 16 pages, 8 figures

详情

AI中文摘要

高能物理中的机器学习模型通常在模拟数据上训练，其中全模拟样本计算成本高，而快速模拟能提供更大的统计量但现实感较低。本文系统研究了在真实LHC环境下快速模拟与全模拟数据集之间的迁移学习。我们考虑了三个代表性任务：信号-背景分类、夸克-胶子喷注标记和缺失横向能量重建，使用密集神经网络、图神经网络和基于Transformer的架构。模型在ATLAS类快速模拟上预训练，并适应于CMS类快速模拟和全模拟的ATLAS开放数据。在所有任务中，预训练模型均优于独立训练的基线模型，并显著减少目标领域训练数据的需求，通常将所需统计量减少约一半。这些结果表明快速模拟可用于学习稳健、可重用的表示，并推动发布训练模型作为可重用的科学资产，超越大型基础模型。

英文摘要

Machine-learning models in high-energy physics are often trained on simulated data, where fully simulated samples are computationally expensive while fast simulation provides large statistics at reduced realism. In this work, we systematically study transfer learning between fast-simulated and fully simulated datasets in a realistic LHC environment. We consider three representative tasks, signal-background classification, quark-gluon jet tagging, and missing transverse energy reconstruction, using dense neural networks, graph neural networks, and transformer-based architectures. Models are pretrained on ATLAS-like fast simulation and adapted to CMS-like fast simulation and to fully simulated ATLAS Open Data. Across all tasks, pretrained models consistently outperform independently trained baselines and require significantly less target-domain training data, typically reducing the needed statistics by about a factor of two. These results demonstrate that fast simulation can be used to learn robust, reusable representations and motivate publishing trained models as reusable scientific assets beyond large foundation models.

URL PDF HTML ☆

赞 0 踩 0

2605.07470 2026-05-11 cs.LG hep-ex

Uncovering Hidden Systematics in Neural Network Models for High Energy Physics

揭示高能物理中神经网络模型中的隐藏系统误差

Lucie Flek, Philipp Alexander Jungs, Akbar Karimi, Timo Saala, Alexander Schmid, Matthias Schott, Philipp Soldin, Christopher Wiebusch, Ulrich Willemsen

发表机构 * Bonn-Aachen Institute of Technology, University of Bonn, Germany（波恩-亚琛技术学院，波恩大学，德国）； Physics Institute III A, RWTH Aachen University, Germany（亚琛工业大学物理研究所III A，德国）； Institute of Physics, University of Bonn, Germany（波恩大学物理研究所，德国）； Physics Institute III B, RWTH Aachen University, Germany（亚琛工业大学物理研究所III B，德国）

AI总结研究揭示神经网络在高能物理分析中对输入微小变化的敏感性，提出量化框架以评估和控制系统误差。

Comments 18 pages, 9 figures

详情

AI中文摘要

神经网络（NNs）是多维分类器，能学习输入可观测量之间的复杂非线性关系。尽管其灵活性在高能物理分析中表现突出，但使其对输入微小变化敏感，导致NN模型系统误差的传播和估计仍存挑战。研究发现，控制区域或输入特征名义变化导出的不确定性可能低估真实模型不确定性，从而留下偏差。受对抗攻击研究启发，探讨微小扰动如何导致NN输出显著变化，同时保持一维和相关输入分布几乎不变。通过代表性高能物理任务，如事件分类和对象识别，并测试多种网络架构，证明网络在允许不确定性范围内可被系统性“欺骗”。基于此，提出量化框架以探测和测量神经网络对现实实验变化的隐藏敏感性，提供评估和控制系统误差的实用路径。

英文摘要

Neural networks (NNs) are inherently multidimensional classifiers that learn complex, non-linear relationships among input observables. While their flexibility enables unprecedented performance in high-energy physics (HEP) analyses, it also makes them sensitive to small variations in their inputs. Consequently, the propagation and estimation of systematic uncertainties in NN-based models remain an open challenge. There are indications that uncertainties derived in control regions or from nominal variations of input features can underestimate the true model uncertainty, potentially leaving biases unaccounted for. Inspired by insights from adversarial-attack studies in machine learning, we explore how subtle perturbations, fully consistent with the experimental uncertainties on the input observables, can lead to substantial changes in NN outputs, while keeping the one-dimensional and correlated input distributions nearly unchanged. Using a set of representative HEP tasks, including event classification and object identification, and testing across a variety of network architectures, we demonstrate that networks can be systematically "fooled" at significant rates within the allowed uncertainty envelopes. Building on this observation, we introduce a quantitative framework to probe and measure the hidden sensitivity of neural networks to realistic experimental variations, providing a practical path to evaluate and control their systematic uncertainty in physics analyses.

URL PDF HTML ☆

赞 0 踩 0

2605.07467 2026-05-11 cs.LG cs.AI cs.ET

Physical Simulators as Do-Operators: Causal Discovery under Latent Confounders for AI-for-Science

物理模拟器作为do-运算符：在潜在混杂因素下的因果发现用于科学人工智能

Tsuyoshi Okita

发表机构 * Kyushu Institute of Technology（九州工业大学）

AI总结本文提出CFM-SD方法，利用物理模拟器作为do-运算符，解决AI-for-Science中潜在混杂因素和真实干预数据的因果发现问题，实验表明其在合成数据和真实科学数据中均表现出色。

Comments 17 pages, 1 figure

详情

AI中文摘要

现有的干预性因果发现方法--IGSP、DCDI、ENCO--假设因果充分性（无潜在混杂因素）并依赖于合成模拟器中的虚拟干预。在分子设计和材料科学等AI-for-Science场景中，潜在混杂因素普遍存在，而真实干预（如基于物理的模拟）需要每数据点数小时至数天。本文提出CFM-SD（因果流匹配与模拟数据），利用第一原理物理模拟器作为Pearl的干预性算术中的do-运算符，同时处理潜在混杂因素和真实干预数据。理论上，d变量因果结构在物理可实现性约束下可识别，需要O(d)个单变量干预。在合成数据的内在评估中（γ=0.2--0.8），CFM-SD的平均F1值为0.800，远高于所有基线方法的F1值0.127--0.562。在真实科学数据的外在评估中，CFM-SD在分子毒性预测和电池电解质优化中实现了57-58%的偏倚减少，展示了超越合成基准的实用价值。

英文摘要

Existing interventional causal discovery methods -- IGSP, DCDI, ENCO -- assume causal sufficiency (no latent confounders) and rely on virtual interventions in synthetic simulators. In AI-for-Science settings such as molecular design and materials science, latent confounders are ubiquitous and real interventions (e.g., physics-based simulations) require hours to days per data point. We propose CFM-SD (Causal Flow Matching with Simulation Data), which uses first-principles physical simulators as do-operators in Pearl's interventional calculus to simultaneously handle latent confounders and real interventional data. Theoretically, $d$-variable causal structure is identifiable with $O(d)$ single-variable interventions -- the minimum under physical realizability constraints. In Intrinsic Evaluation on synthetic data ($γ=0.2$--$0.8$), CFM-SD achieves average F1$=0.800$ vs. F1$=0.127$--$0.562$ for all baselines. In Extrinsic Evaluation on real scientific data, CFM-SD achieves 57--58\% bias reduction in molecular toxicity prediction and battery electrolyte optimization, demonstrating practical value beyond synthetic benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.07466 2026-05-11 cs.CV

A Unified Framework for the Detection and Classification of Fatty Pancreas in Ultrasound Images

一种用于超声图像中脂肪性胰腺检测与分类的统一框架

Ioan-Tudor-Alexandru Anghel, Ciprian-Mihai Ceausescu, Elena Dana Nedelcu, Elena Raluca Stirban, Camelia Croitoru, Despina Ungureanu, Ana Maria Palan, Gabriela Pop

发表机构 * Faculty of Mathematics and Computer Science, University of Bucharest（布加勒斯大学数学与计算机科学学院）； Ponderas Academic Hospital, Radiology（波内拉斯学术医院放射科）； Ponderas Academic Hospital, Internal Medicine（波内拉斯学术医院内科）； Ponderas Academic Hospital, Gastroenterology（波内拉斯学术医院胃肠病科）； Emergency Clinical Hospital Bucharest, Radiology Department（布加勒斯急救临床医院放射科）

AI总结本文提出一种端到端框架，通过分割引导的纹理分析自动分类正常与脂肪性胰腺，采用TransUNet架构结合ResNet和Transformer，实现胰腺和脾静脉分割，并通过纹理比较进行分类，验证了在无标注数据下的有效性。

详情

AI中文摘要

非酒精性脂肪性胰腺病（NAFPD）是一种与代谢综合征、胰岛素抵抗和胰腺癌风险增加相关的未被充分诊断的疾病。通常依赖于临床医生对超声图像的主观视觉评估进行诊断。我们提出了一种端到端框架，用于从腹部超声图像中自动分类正常与脂肪性胰腺。我们的方法采用基于TransUNet的分割架构，结合ResNet编码器和Transformer瓶颈，以勾勒胰腺和脾静脉，随后通过解剖学引导的补丁提取和患者层面的分类，通过成对纹理比较实现。特征工程通过比较静脉周围脂肪与胰腺实质的回声强度，模拟临床推理，提供可解释的信号用于分类。分割模型通过领域特定的迁移学习从肝脏分割任务初始化。我们验证了整个流程在包含214个腹部超声图像和107个专家标注案例的临床数据集上，使用5折交叉验证。SVM与RBF核在交叉验证中的平均准确率为89.7%±1.8%，F1值为0.898±0.019，而无监督的K-Means基线达到87.8%的准确率，证明所提出的特征能够捕捉相关的临床信号，即使在没有标注训练数据的情况下。据我们所知，这是首个利用分割引导的纹理分析进行超声图像中脂肪性胰腺分类的端到端自动化框架。

英文摘要

Non-alcoholic fatty pancreas disease (NAFPD) is an underdiagnosed condition associated with metabolic syndrome, insulin resistance, and increased risk of pancreatic cancer. Diagnosis typically relies on subjective visual assessment of ultrasound images by clinicians. We propose an end-to-end framework for automatically classifying normal versus fatty pancreas from abdominal ultrasound images. Our method employs a TransUNet-based segmentation architecture with a ResNet encoder and transformer bottleneck to delineate the pancreas and the splenic vein, followed by anatomically-guided patch extraction and patient-level classification through pairwise texture comparison. The feature engineering mimics clinical reasoning by comparing the echogenicity of peri-venous fat to the pancreatic parenchyma, providing an interpretable signal for classification. The segmentation models are initialized via domain-specific transfer learning from a liver segmentation task. We validate the full pipeline on a clinical dataset of 214 abdominal ultrasound images with 107 expert-labeled cases using 5-fold cross-validation. SVM with RBF kernel achieves a mean cross-validated accuracy of 89.7\%\,$\pm$\,1.8\% and F1 of 0.898\,$\pm$\,0.019, while the unsupervised K-Means baseline reaches 87.8\% accuracy, demonstrating that the proposed features capture the relevant clinical signal even without labeled training data. To our knowledge, this is the first end-to-end automated framework for fatty pancreas classification from ultrasound using segmentation-guided texture analysis.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

Proximal Policy Distillation

Heterogeneous Graph Neural Networks with Post-hoc Explanations for Multi-modal and Explainable Land Use Inference

Code Generation and Conic Constraints for Model-Predictive Control on Microcontrollers with Conic-TinyMPC

Causal Unsupervised Semantic Segmentation

Unsupervised Feature Based Algorithms for Time Series Extrinsic Regression

Bake off redux: a review and experimental evaluation of recent time series classification algorithms

HIVE-COTE 2.0: a new meta ensemble for time series classification

Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

Tessellations of Semi-Discrete Flow Matching

Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature

LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers

Cloud-top infrared observations reveal the four-dimensional precipitation structure

Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing

DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

Implicit Multi-Camera System Calibration Using Gaussian Processes

A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation

Excluding the Target Domain Improves Extrapolation: Deconfounded Hierarchical Physics Constraints

AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models

ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning

NPMixer: Hierarchical Neighboring Patch Mixing for Time Series Forecasting

ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

Transfer Learning Across Fast- and Full-Simulation Domains in High-Energy Physics

Uncovering Hidden Systematics in Neural Network Models for High Energy Physics

Physical Simulators as Do-Operators: Causal Discovery under Latent Confounders for AI-for-Science

A Unified Framework for the Detection and Classification of Fatty Pancreas in Ultrasound Images