arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪 全部专题
2605.14824 2026-05-15 cs.LG math.AT

ToMAToMP: Robust and Multi-Parameter Topological Clustering

Ludo Andrianirina, Mathieu Carrière

发表机构 * DataShape, Centre Inria d’Université Côte d’Azur(DataShape,法国国家信息与自动化研究所(Inria)- 里昂大学(Université Côte d’Azur))

AI总结 本文提出了一种名为 ToMAToMP 的新型拓扑聚类方法,旨在解决传统 ToMATo 算法在处理多参数函数、对异常值敏感以及依赖图结构参数等局限性。该方法基于多参数持续同调中的 MMA 分解,实现了对多个函数的同时处理,并在理论上保证了鲁棒性。实验表明,ToMAToMP 在多个数据集上相比非拓扑和拓扑方法均表现出更优的聚类效果。

详情
英文摘要

Topological clustering, and its main algorithm ToMATo, is a clustering method from Topological Data Analysis (TDA) which has been applied successfully in several applications during the last few years. This is due to its high versatility, as clusters are detected from the persistent components in the sublevel sets of any user-defined function (gene expression, pixel values, etc), and efficiency, as topological clustering enjoys robustness guarantees. However, ToMATo is also limited in several ways. First, a graph on the data points needs to be provided as a hyper-parameter of the method (whose fine-tuning is left to the user). Second, ToMATo is known to be very sensitive to outlier values in the function range. Finally, and most importantly, ToMATo can only handle one function at a time, whereas it is critical to use several functions in various applications. In this article, we introduce ToMAToMP: the first topological clustering method able to handle several functions at the same time with theoretical guarantees. More specifically, we leverage a recent tool from multi-parameter persistent homology, called MMA decomposition, to design our clustering algorithm, and prove that it enjoys robustness properties. As corollaries, we show that it can be used to make ToMATo independent of graph tuning, and robust to outliers. Finally, we provide a set of numerical experiments showcasing the efficiency and quality of the clusterings produced by ToMAToMP, by showing strong improvement over non-topological and topological baselines for various datasets.

2605.14821 2026-05-15 cs.CV

HDRFace: Rethinking Face Restoration with High-Dimensional Representation

Zirui Wang, Xianhui Lin, Yi Dong, Bo Wei, Gangjian Zhang, Siteng Ma, Zebiao Zheng, Xing Liu, Hong Gu, Minjing Dong

发表机构 * City University of Hong Kong(香港城市大学) vivo BlueImage Lab, vivo Mobile Communication Co., Ltd(vivo蓝影实验室,vivo移动通信有限公司)

AI总结 在复杂退化条件下的人脸修复仍是一个信息严重丢失的病态逆问题。本文提出HDRFace,一种基于高维表示的人脸修复框架,通过引入语义丰富的先验知识,在不改变生成主干网络的前提下提升修复质量。该方法首先利用现成修复器获得结构可靠的中间结果,再通过预训练的高维特征编码器提取输入和中间结果的细粒度面部表示,并将其作为额外条件注入生成过程。此外,提出结构-细节感知的自适应融合机制(SDFM),在结构建模时强调全局约束,在细节生成时加强表示引导,从而在结构一致性和细节保真之间取得平衡。

详情
英文摘要

Face restoration under complex degradations still remains an ill-posed inverse problem due to severe information loss. Although diffusion models benefit from strong generative priors, most methods still condition only on low-quality inputs, making it difficult to recover identity-critical details under heavy degradations. In this work, we propose HDRFace, a High-Dimensional Representation conditioned Face restoration framework that injects semantically rich priors into the conditional flow without modifying the generative backbone. Our pipeline first obtains a structurally reliable intermediate restoration with an off-the-shelf restorer, then uses a pretrained high-dimensional feature encoder to extract fine-grained facial representations from both the low-quality input and the intermediate result, and injects them as additional conditions for generation. We further introduce SDFM, a Structure-Detail aware adaptive Fusion Mechanism that emphasizes global constraints during structure modeling and strengthens representation guidance during detail synthesis, balancing structural consistency and detail fidelity. To validate the generalization ability of our method, we implement the proposed framework on two generative models, SD V2.1-base and Qwen-Image, and consistently observe stable and coherent performance gains across different architectures.

2605.14819 2026-05-15 cs.CV

The Velocity Deficit: Initial Energy Injection for Flow Matching

Linze Li, Zong-Wei Hong, Shen Zhang, Bo Lin, Jinglun Li, Yao Tang, Jiajun Liang

发表机构 * Jiiov Technology, Beijing, China(吉 iov 技术,北京,中国)

AI总结 该论文提出了一种名为“速度亏损”(Velocity Deficit)的现象,指出在高维流匹配中,均方误差(MSE)目标函数会导致速度幅值被系统性低估,从而使生成样本无法到达数据流形,产生积分滞后问题。为了解决这一问题,作者提出了初始能量注入方法,包括基于训练的幅度感知流匹配(MAFM)和无需训练的尺度调度校正器(SSC),揭示了速度收缩在轨迹起点和终点的不对称影响。实验表明,SSC在ImageNet-1k等任务上显著提升了生成质量并加快了生成速度,同时方法也适用于文本到图像生成和高分辨率图像生成。

Comments Accepted by ICML2026

详情
英文摘要

While Flow Matching theoretically guarantees constant-velocity trajectories, we identify a critical breakdown in high-dimensional practice: the Velocity Deficit. We show that the MSE objective systematically underestimates velocity magnitude, causing generated samples to fail to reach the data manifold-a phenomenon we term Integration Lag. To rectify this, we propose Initial Energy Injection, instantiated via two complementary methods: the training-based Magnitude-Aware Flow Matching (MAFM) and the training-free Scale Schedule Corrector (SSC). Both are grounded in our discovery of a crucial asymmetry: velocity contraction causes harmful kinetic stagnation at the trajectory's start, yet acts as a beneficial denoising mechanism at its end. Empirically, SSC yields significant efficiency gains with zero retraining and just one line of code. On ImageNet-1k (256x256), it improves FID by 44.6% (from 13.68 to 7.58) and achieves a 5x speedup, enabling a 50-step generator (FID 7.58) to beat a 250-step baseline (FID 8.65). Furthermore, our methods generalize to Text-to-Image tasks and high-resolution generation, improving FID on MS-COCO by ~22%.

2605.14816 2026-05-15 cs.CL

Conversion of Lexicon-Grammar tables to LMF. Application to French

Eric Laporte, Elsa Tolone, Mathieu Constant

发表机构 * Laboratoire d’automatique documentaire et linguistique(文档自动化与语言学实验室)

AI总结 本文介绍了将法语动词的Lexicon-Grammar表转换为词法标记框架(LMF)格式的首次实验。Lexicon-Grammar是法语的重要词法和句法信息来源,将其转换为符合LMF标准的格式,有助于提升自然语言处理词典的标准化与互操作性。文章分析了转换过程中遇到的主要困难,并介绍了转换后得到的资源。

Journal ref LMF. Lexical Markup Framework, 2013, ISTE - Wiley, pp.157-187

详情
英文摘要

We describe the first experiment of conversion of Lexicon-Grammar tables for French verbs into the Lexical Markup Framework (LMF) format. The Lexicon-Grammar of the French language is currently one of the major sources of lexical and syntactic information for French. Its conversion into an interoperable representation format according to the LMF standard makes it usable in different contexts, thus contributing to the standardization and interoperability of natural language processing dictionaries. We briefly introduce the Lexicon-Grammar and the derived dictionaries; we analyse the main difficulties faced during the conversion; and we describe the resulting resource.

2605.14815 2026-05-15 cs.CV

Probing into Camera Control of Video Models

Chen Hou, Christian Rupprecht

发表机构 * Visual Geometry Group University of Oxford(视觉几何组牛津大学)

AI总结 本文研究了视频生成模型中的相机控制问题,旨在使模型能够生成具有几何意义的内容。不同于以往依赖额外模块和配对数据的方法,作者提出将相机控制视为一种几何引导,通过在去噪过程中对潜在特征进行可微分重采样来实现。该方法无需额外训练,适用于大多数视频扩散模型,并可用于探测基础模型的相机控制能力,揭示了现有模型在多视角生成任务中的共性偏差与性能差异。

详情
英文摘要

Video is a rich and scalable source of 3D/4D visual observations, and camera control is a key capability for video generation models to produce geometrically meaningful content. Existing approaches typically learn a mapping from camera motion to video using additional camera modules and paired data. However, such datasets are often limited in scale, diversity, and scene dynamics, which can bias the model toward a narrow output distribution and compromise the strong prior learned by the base model. These limitations motivate a different perspective on camera control. In this paper, we show that camera control need not be modeled as an implicit mapping problem, but can instead be treated as a form of geometric guidance that induces displacements across frames. Specifically, we reformulate camera control into a set of displacement fields and apply them via differentiable resampling of latent features during denoising. Our simple approach achieves effective camera control with minimal degradation across diverse quality metrics compared to fine-tuned baselines. Since our method is applicable to most video diffusion models without training, it can also serve as a probe to study the camera control capabilities of base models. Using this probe, we identify universal biases shared by representative video models, as well as disparities in their responses to camera control. Finally, we benchmark their performance in multi-view generation, offering insights into their potential for 3D/4D tasks.

2605.14810 2026-05-15 cs.RO

CaMeRL: Collision-Aware and Memory-Enhanced Reinforcement Learning for UAV Navigation in Multi-Scale Obstacle Environments

Hong Hong, Feiyu Liao, Yongheng Liang, Boning Zhang, Haitao Wang, Hejun Wu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) Guangdong Key Laboratory of Big Data Analysis and Processing(广东省大数据分析与处理重点实验室)

AI总结 在无人机避障导航中,障碍物尺度的变化往往被忽视,现有方法通常仅依赖单帧深度观测的几何特征,难以有效应对多尺度障碍物环境。为此,本文提出CaMeRL,一种结合碰撞感知与记忆增强的强化学习框架,通过编码细粒度障碍物结构提升对小障碍物的敏感性,并利用时序记忆模块缓解大障碍物遮挡带来的部分可观测问题。实验表明,CaMeRL在超小和超大障碍物场景中均优于现有方法,且在复杂户外环境中表现出可靠的导航能力。

Comments 8 pages, 7 figures. Submitted to IEEE Robotics and Automation Letters

详情
英文摘要

In obstacle avoidance navigation of unmanned aerial vehicles (UAVs), variations in obstacle scale have received strangely less attention than obstacle number or density. Existing methods typically extract purely geometric features from single-frame depth observations. Such representations tend to neglect small obstacles and lose spatial context under occlusions caused by large obstacles, leading to noticeable degradation in environments with multi-scale obstacles. To address this issue, we propose CaMeRL, a Collision-aware and Memory-enhanced Reinforcement Learning framework for UAV navigation. The collision-aware latent representation encodes risk-sensitive depth cues to preserve fine-grained obstacle structures, thereby improving sensitivity to small obstacles. The temporal memory module integrates observations across frames, mitigating partial observability caused by large-obstacle occlusions. We evaluate CaMeRL with multi-scale obstacles, including ultra-small and extra-large obstacle settings. Results show that CaMeRL outperforms state-of-the-art baselines across all scales, with success rate gains of 0.48 and 0.28 in the ultra-small and extra-large settings, respectively. More importantly, CaMeRL achieves reliable navigation in cluttered outdoor environments.

2605.14808 2026-05-15 cs.CV

SuperADD: Training-free Class-agnostic Anomaly Segmentation -- CVPR 2026 VAND 4.0 Workshop Challenge Industrial Track

Lukas Roming, Felix Lehnerer, Jonas V. Funk, Andreas Michel, Georg Maier, Thomas Längle, Jürgen Beyerer

发表机构 * Fraunhofer IOSB(弗劳恩霍夫智能系统研究所)

AI总结 本文提出了一种无需训练且类别无关的工业异常分割方法SuperADD,用于应对生产环境中因采集条件变化导致的数据分布偏移问题。该方法基于SuperAD改进,引入了DINOv3主干网络、重叠块处理、强度增强、优化的记忆库采样以及迭代形态闭合等技术,提升了模型在不同光照条件下的鲁棒性和泛化能力。实验表明,SuperADD在MVTec AD 2数据集上取得了优于现有方法的分割性能,适用于工业场景中对产品变体和外观变化的高效处理需求。

Comments Technical report for the CVPR 2026 VAND 4.0 workshop challenge industrial track

详情
英文摘要

Visual anomaly detection (AD) for industrial inspection is a highly relevant task in modern production environments. The problem becomes particularly challenging when training and deployment data differ due to changes in acquisition conditions during production. In the VAND 4.0 Industrial Track, models must remain robust under distribution shifts such as varying illumination and their performance is assessed on the MVTec AD 2 dataset. To address this setting, we propose a training-free and class-agnostic anomaly detection pipeline based on the work of SuperAD. Our approach improves generalization through several modifications designed to enhance robustness under distribution shifts. These adaptations include using a DINOv3 backbone, overlapping patch-wise processing, intensity-based augmentations, improved memory-bank subsampling for better coverage of the data distribution, and iterative morphological closing for cleaner and more spatially consistent anomaly maps. Unlike methods that rely on class-specific architectures or per-class hyperparameter tuning, our method uses a single architecture and one shared hyperparameter configuration across all object classes. This makes the approach well suited for industrial deployment, where product variants and appearance changes must be handled with minimal adaptation effort. We achieve segmentation F1 scores of $62.61\%$, $57.42\%$, and $54.35\%$ on test public, private, and private mixed of MVTec AD 2 respectively, thereby outperforming SuperAD and other state-of-the-art methods. Code is available at https://github.com/LukasRoom/SuperADD.

2605.14805 2026-05-15 cs.RO

Learning Cross-Coupled and Regime Dependent Dynamics for Aerial Manipulation

Rishabh Dev Yadav, Samaksh Ujjawal, Sihao Sun, Spandan Roy, Wei Pan

发表机构 * Department of Computer Science, The University of Manchaster(曼彻斯特大学计算机科学系) International Institute of Information Technology Hyderabad(海得拉巴国际信息科技研究所) Department of Cognitive Robotics, Delft University of Technology(代尔夫特理工大学认知机器人系) Newcastle University(新castle大学)

AI总结 本文研究了空中机械臂在复杂任务(如负载运输)中对精确动力学模型的需求,针对其强耦合、气动延迟和负载变化引起的非稳态动力学特性,提出了一个结构化的编码器-解码器框架。该方法通过非线性编码器捕捉状态-输入历史中的交叉耦合和时序依赖,结合轻量级线性解码器实现对非稳态残差动力学的在线自适应,从而提升模型预测精度和轨迹跟踪性能。实验表明,该方法在实际平台中表现出更优的适应能力和控制性能。

详情
英文摘要

Accurate dynamics models are critical for aerial manipulators operating under complex tasks such as payload transport. However, modeling these systems remains fundamentally challenging due to strong quadrotor-manipulator coupling, delayed aerodynamic interactions, and regime-dependent dynamics variations arising from payload changes and manipulator reconfiguration. These effects produce residual dynamics that are simultaneously cross-coupled, history-dependent, and nonstationary, causing both analytical models and purely offline learned models to degrade during deployment. To address these challenges, we propose a structured encoder-decoder framework for adaptive residual dynamics learning in aerial manipulators. The proposed nonlinear latent encoder captures cross-variable coupling and temporal dependencies from state-input histories, while a lightweight linear latent decoder enables online adaptation under regime-dependent nonstationary dynamics. The linear-in-parameter decoder structure permits closed-form Bayesian adaptation together with consistency-driven covariance inflation, enabling rapid and stable adaptation to both transient and slowly varying dynamics changes while remaining compatible with real-time model predictive control (MPC). Experimental results on a real aerial manipulation platform demonstrate improved residual prediction accuracy, faster adaptation under changing operating conditions, and enhanced MPC-based trajectory tracking performance. These results highlight the importance of jointly modeling coupled temporal dynamics and deployment-time nonstationarity for reliable aerial manipulation.

2605.14802 2026-05-15 cs.AI

A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency

Zhao Yang, Wang Huan, Li Yingshuo, Tu Haomiao, Lin Hujite

发表机构 * Changchun Kelaile Technology Co., Ltd(长春凯莱尔科技有限公司)

AI总结 该研究针对大语言模型在长期交互中面临的事实遗忘、时间线混乱、角色漂移和稳定性下降等问题,提出了一种异构时间记忆治理框架ARPM。该框架将静态知识记忆与动态对话经验记忆分离,并结合向量检索、BM25、RRF融合、双时间重排序等多种技术,实现对连续性和角色一致性的可追溯治理。实验表明,ARPM在高噪声环境下仍能保持语义连续性与角色一致性,并揭示了长期角色一致性可以被分解为可治理的组件并进行白盒评估。

Comments 23 pages, 5 figures, 2 tables. Preprint version. Code for ARPM v4.0 is available at: https://github.com/Spirtxiaoqi7/ARPM

详情
英文摘要

Large language models often suffer from fact loss, timeline confusion, persona drift, and reduced stability during long-range interaction, especially under high-noise knowledge bases, context clearing, and cross-model transfer. To address these issues, we introduce ARPM, an external temporal memory governance framework for long-term dialogue. ARPM separates static knowledge memory from dynamic dialogue experience memory and combines vector retrieval, BM25, RRF fusion, dual-temporal reranking, chronological evidence reading, and a controlled analysis protocol for evidence verification and answer binding. Unlike approaches that encode persona consistency into model weights or rely only on long context, ARPM treats continuity as a traceable, auditable, and transferable governance problem. Using engineering logs, we conduct three experiments. First, in a 50-round question-answering setting, we compare signal-to-noise ratios of 1:5 and 1:200+, and distinguish CSV auto-judgment from manual review. Under 1:5, CSV recall accuracy is 54.0%, while manual review raises it to 100.0%. Under 1:200+, the values are 44.0% and 80.0%. These results show that automatic rules can underestimate recall after supporting evidence enters the prompt. Second, ablation results show that dialogue history retrieval is necessary for recent continuity: disabling it reduces strict accuracy from 100% to 66.7%, and disabling BM25 reduces it to 80.0%, indicating that pure semantic retrieval is insufficient for correction and tracing. Third, under a 5.1-million-character noise substrate, periodic context clearing, and multi-model handoff, ARPM maintains semantic continuity, boundary continuity, and persona consistency, while exposing limits caused by weak protocol compliance. These findings show that long-term persona consistency can be decomposed into governable components and evaluated in a white-box manner.

2605.14801 2026-05-15 cs.RO

Exploring Bottlenecks in VLM-LLM Navigation: How 3D Scene Understanding Capability Impacts Zero-Shot VLN

Ziyi Xia, Chaoran Xiong, Litao Wei, Xinhao Hu, Ling Pei

发表机构 * Shanghai Key Laboratory of Navigation and Location Based Services, Shanghai Jiao Tong University(上海导航与基于位置的服务关键实验室,上海交通大学)

AI总结 本文研究了视觉-语言导航(VLN)系统中的关键瓶颈,即当前3D感知模型对像素级精度的追求与导航任务对实时性和计算效率的需求之间的冲突。作者基于典型的VLM-LLM框架,提出了两个核心子系统的统计成功率上限,揭示了感知精度提升到一定程度后对导航性能的边际效益递减现象。研究指出,VLN中的3D场景理解应更关注导航相关的语义词汇和边界框精度,而非单纯的像素级准确度。

Comments Accepted by ICRA Workshop MM-Spatial AI, Oral

详情
英文摘要

Zero-shot vision-and-language navigation (VLN) has gained significant attention due to its minimal data collection costs and inherent generalization. This paradigm is typically driven by the integration of pre-trained Vision-Language Models (VLMs) and Large Language Models (LLMs), where VLMs construct 3D scene graphs while LLMs handle high-level reasoning and decision-making. However, a critical bottleneck exists in this system: current 3D perception models prioritize pixel-level accuracy, directly conflicting with the strict computational limits and real-time efficiency demanded by embodied navigation. To address this gap, this paper quantifies the actual impact of 3D scene understanding capability on VLN performance. Based on typical VLM-LLM frameworks, we propose statistical success rate (SR) upper bounds for two core subsystems: 1) the slow LLM planner, which relies on topological mapping semantics, and 2) the fast reactive navigator, which utilizes spatial coordinates and bounding boxes to execute LLM decisions. Evaluations using state-of-the-art 3D scene understanding models validate our proposed bounds and reveal a perception saturation phenomenon, indicating that improvements in perception accuracy beyond a certain threshold yield diminishing returns in navigation success. Our findings suggest that 3D scene understanding for VLN should pivot away from strict pixel-level precision, prioritizing instead navigation-relevant core vocabularies and accurate bounding box proportions.

2605.14795 2026-05-15 cs.CV

COAL: Counterfactual and Observation-Enhanced Alignment Learning for Discriminative Referring Multi-Object Tracking

Shukun Jia, Shiyu Hu, Yipei Wang, Ximeng Cheng, Yichao Cao, Xiaobo Lu

发表机构 * School of Automation, Southeast University, Nanjing, China(东南大学自动化学院,南京,中国) Key Laboratory of Measurement and Control of Complex Systems of Engineering, Ministry of Education, Nanjing, China(工程复杂系统测量与控制国家重点实验室,教育部,南京,中国) School of Physical & Mathematical Sciences, Nanyang Technological University, Singapore(南洋理工大学物理与数学科学学院,新加坡) Big Data Institute, Central South University, Changsha, China(中南大学大数据研究院,长沙,中国)

AI总结 该论文研究了在稀疏语义监督下如何提升指称多目标跟踪(RMOT)的判别能力,提出了COAL框架,通过引入显式语义注入和反事实学习策略,增强对复杂语义结构的识别能力。COAL结合视觉语言模型和大语言模型,构建了一个层次化多流融合架构,有效缓解了稀疏监督导致的过拟合和语义崩溃问题。实验表明,该方法在多个基准数据集上取得了显著提升,尤其在具有挑战性的Refer-KITTI-V2数据集上超越了现有最优方法。

详情
英文摘要

Referring Multi-Object Tracking (RMOT) faces a fundamental structural contradiction between the high-discriminability demand and the sparse semantic supervision. This mismatch is particularly acute in highly homogeneous scenarios that require fine-grained discrimination over complex compositional semantics. However, under sparse supervision, models overfit to salient yet insufficient cues, thereby encouraging shortcut learning and semantic collapse. To resolve this, we propose COAL (Counterfactual and Observation-enhanced Alignment Learning), a framework that advances RMOT beyond isolated structural optimization through knowledge regularization. First, we introduce Explicit Semantic Injection (ESI) via a VLM to densify the observation space and enhance instance discriminability. Second, leveraging LLM reasoning, we propose Counterfactual Learning (CFL) to augment supervision, enforcing strict attribute verification for robust compositional recognition. These strategies are unified within a Hierarchical Multi-Stream Integration (HMSI) architecture, which distills external knowledge into domain-specific discriminative representations. Experiments on Refer-KITTI and Refer-KITTI-V2 benchmarks validate COAL's efficacy. Notably, it surpasses the state-of-the-art by 7.28% HOTA on the highly challenging Refer-KITTI-V2. These results demonstrate the effectiveness of knowledge regularization for resolving the sparsity-discriminability paradox in RMOT.

2605.14790 2026-05-15 cs.CL cs.AI

Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

Songyang Gao, Yinghui Xia, Siyi Liu, Hui Xiong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Tsinghua University(清华大学)

AI总结 本文提出了一种名为“Graphs of Research(GoR)”的监督微调方法,用于提升基于大语言模型(LLM)的科研想法生成能力。该方法通过构建每篇种子论文的两跳引用邻域,利用引用位置、频率、前驱链接和发表时间等信息生成论文演化的有向无环图(DAG),并以此作为监督信号对模型进行训练。实验表明,GoR 在与基于 GPT-4o 的基线模型的对比中取得了最优性能,验证了引用演化图作为监督信号在科研想法生成任务中的有效性。

详情
英文摘要

Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.

2605.14785 2026-05-15 cs.LG cs.CV

Understanding Imbalanced Forgetting in Rehearsal-Based Class-Incremental Learning

Alberto Tamajo, Srinandan Dasmahapatra, Rahman Attar

发表机构 * School of Electronics and Computer Science, University of Southampton(南安普顿大学电子与计算机科学学院)

AI总结 在类增量学习(CIL)中,神经网络容易出现灾难性遗忘问题,而基于重放的策略虽能缓解这一问题,但研究发现不同类别被遗忘的程度并不均衡。本文系统分析了这种不均衡遗忘现象,提出三个最后一层系数以捕捉增量学习过程中影响各类别遗忘的不同梯度级干扰源,并验证这些系数能够有效预测各类别的遗忘程度。研究还发现,自诱导干扰系数是预测遗忘程度最强的指标,为缓解不均衡遗忘提供了新的思路和方向。

Comments 37 pages; 24 tables; 7 figures; submitted to a journal

详情
英文摘要

Neural networks suffer from catastrophic forgetting in class-incremental learning (CIL) settings. Rehearsal$\unicode{x2013}$replaying a subset of past samples$\unicode{x2013}$is a well-established mitigation strategy. However, recent results suggest that, despite balanced rehearsal allocation, some classes are forgotten substantially more than others. Despite its relevance, this imbalanced forgetting phenomenon remains underexplored. This work shows that imbalanced forgetting arises systematically and severely in rehearsal-based CIL and investigates it extensively. Specifically, we construct, from a principled analysis, three last-layer coefficients that capture different gradient-level sources of interference affecting each past class during an incremental step. We then demonstrate that, together, they reliably predict how past classes will rank in terms of forgetting at the end of that step. While predictive performance alone does not establish causality, these results support the interpretation of the coefficients as a plausible mechanistic account linking last-layer gradient-level interactions during training to class-level forgetting outcomes. Notably, one coefficient$\unicode{x2013}$capturing self-induced interference$\unicode{x2013}$emerges as the strongest predictor, with controlled experiments providing evidence consistent with this coefficient being influenced by the new-class interference coefficient. Overall, our findings provide valuable insights and suggest promising directions for mitigating imbalanced forgetting by reducing class-wise disparities in the identified sources of interference.

2605.14781 2026-05-15 cs.CV

MonoPRIO: Adaptive Prior Conditioning for Unified Monocular 3D Object Detection

Leon Davies, Qinggang Meng, Mohamad Saada, Baihua Li, Simon Sølvsten

发表机构 * organization= Department of Computer Science, Loughborough University , addressline= Epinal Way , city= Loughborough , postcode= LE11 3TU , state= Leicestershire , country= United Kingdom organization= European Center for Risk \& Resilience Studies, University of Southern Denmark , addressline= Degnevej 14 , city= Esbjerg , postcode= 6705 , country= Denmark

AI总结 单目3D目标检测在面对遮挡、截断和投影引起的尺度-深度歧义时面临挑战,尤其是在统一多类场景中,类别差异和部分可见性使得尺度估计更加不稳定。为此,本文提出MonoPRIO,通过自适应先验条件化方法,在尺度路径上优化统一的单目3D检测性能。该方法构建了类别感知的尺度原型,采用软混合先验路由解码器查询,并引入不确定性感知的对数空间条件化和Cluster-Aligned Prior正则化,显著提升了检测精度和鲁棒性。实验表明,MonoPRIO在KITTI测试集上取得了目前最强的统一多类检测结果,并在仅检测汽车的场景中也表现出优越的性能,同时计算量远低于现有方法。

Comments 12 pages, 4 figures, 8 tables. Submitted to Pattern Recognition. Code and reproducibility material available at https://github.com/bigggs/MonoPRIO

详情
英文摘要

Monocular 3D object detection remains challenging because metric size and depth are underdetermined by single-view evidence, particularly under occlusion, truncation, and projection-induced scale-depth ambiguity. Although recent methods improve depth and geometric reasoning, metric size remains unstable in unified multi-class settings, where class variability and partial visibility broaden plausible size modes. We propose MonoPRIO, a unified monocular 3D detector that targets this bottleneck through adaptive prior conditioning in the size pathway. MonoPRIO constructs class-aware size prototypes offline, routes each decoder query to a soft mixture prior, applies uncertainty-aware log-space conditioning, and uses Cluster-Aligned Prior (CAP) regularisation on matched positives during training. On the official KITTI test server, MonoPRIO achieves the strongest fully reported unified multi-class result among methods reporting complete Car, Pedestrian, and Cyclist metrics. In the car-only setting, it also achieves the strongest 3D bounding-box AP across Easy/Moderate/Hard categories among compared methods without extra data, while using substantially less compute than MonoCLUE. Ablations and diagnostics show complementary gains from routed injection and CAP, with the largest benefits in ambiguity-prone, partially occluded, and low-data regimes. These findings indicate that adaptive priors are most effective when image evidence underdetermines metric size, while atypical geometry or extreme visibility loss can still cause mismatch between routed priors and true instance geometry. Code, trained models, result logs, and reproducibility material are available at https://github.com/bigggs/MonoPRIO.

2605.14779 2026-05-15 cs.LG

Peng's Q($λ$) for Conservative Value Estimation in Offline Reinforcement Learning

Byeongchan Kim, Min-hwan Oh

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出了一种无需模型的离线多步强化学习算法——保守Peng's Q(λ)(CPQL),通过适配Peng's Q(λ)算子用于保守价值估计,替代传统的Bellman算子,从而在离线强化学习中更有效地估计值函数。该方法利用离线轨迹,首次在理论上和实验上证明了多步保守价值估计的有效性,并通过固定点特性自然引入行为策略的隐式正则化,有效缓解了价值估计过于悲观的问题,实现了优于或至少等于行为策略的性能,同时提供了接近最优的性能保障。实验表明,CPQL在D4RL基准测试中显著优于现有的离线单步方法,并在离线到在线学习框架中也展现出良好的应用前景。

Comments Accepted in ICLR 2026

详情
英文摘要

We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng's Q($λ$) (CPQL). Our algorithm adapts the Peng's Q($λ$) (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, this is the first work in offline RL to theoretically and empirically demonstrate the effectiveness of conservative value estimation with a \textit{multi-step} operator by fully leveraging offline trajectories. The fixed point of the PQL operator in offline RL lies closer to the value function of the behavior policy, thereby naturally inducing implicit behavior regularization. CPQL simultaneously mitigates over-pessimistic value estimation, achieves performance greater than (or equal to) that of the behavior policy, and provides near-optimal performance guarantees -- a milestone that previous conservative approaches could not achieve. Extensive numerical experiments on the D4RL benchmark demonstrate that CPQL consistently and significantly outperforms existing offline single-step baselines. In addition to the contributions of CPQL in offline RL, our proposed method also contributes to the offline-to-online learning framework. Using the Q-function pre-trained by CPQL in offline settings enables the online PQL agent to avoid the performance drop typically observed at the start of fine-tuning and to attain robust performance improvements. Our code is available at https://github.com/oh-lab/CPQL.

2605.14774 2026-05-15 cs.AI

Identifying Culprits Through Deep Deterministic Policy Gradient Deep Learning Investigation

Lata B T, Savitha N J

发表机构 * Dept. of CSE, UVCE, Bengaluru, India(计算机科学与工程系,UVCE,班加罗尔,印度)

AI总结 本文研究如何利用深度强化学习方法提高犯罪调查中犯罪嫌疑人的识别准确率。作者提出采用深度确定性策略梯度(DDPG)算法,通过训练犯罪现场资料、证人证词和嫌疑人档案等数据集,有效提升识别效率并减少误判。实验结果表明,该方法在识别准确率上达到95%,优于现有多种方法,为人工智能在司法领域的应用提供了新思路。

Journal ref Mathematical Statistician and Engineering Applications, https://www.philstat.org/index.php/MSEA/article/view/2953, ISSN: 2094-0343

详情
英文摘要

In the world of AI and advanced technologies investigation aspects identification of a crime or criminal plays a major problem. In this research we focus on a Conventional ways of implicating criminal investigations usually rely on limited data analysis. Finding an optimal and efficient method that will effectively identify criminals from complex datasets and minimise false positives and false negatives is the considered as a challenge. The main novelty approach of this work is based on the deep learning algorithm Deep Deterministic Policy Gradient (DDPG) is presented in this paper. We train the DDPG model with a dataset of crime scene material, witness statements and suspect profiles. The algorithm uses features to maximise the likelihood of identifying the offender while minimising the noise impact and irrelevant data. We show the efficacy of the proposed method, where DDPG identified criminals with an amazing accuracy of 95% than other several existing methods.

2605.14773 2026-05-15 cs.LG cs.AI

Beyond What to Select: A Plug-and-play Oscillatory Data-Volume Scheduling for Efficient Model Training

Suorong Yang, Hanqi Zhu, Hai Gan, Fangjian Su, Guang Li, Furao Shen, Soujanya Poria

发表机构 * National University of Singapore(国立新加坡大学) Nanjing University(南京大学) Hokkaido University(北海道大学) Nanyang Technological University(南洋理工大学)

AI总结 本文研究了数据选择在模型训练中的高效应用,指出现有方法虽关注选择哪些样本,但通常固定数据量比例,导致动态选择与静态数据量之间的不匹配。作者从优化角度出发,提出了一种名为PODS的插件式振荡数据量调度框架,通过动态调整数据选择比例,在增强正则化效果的同时保持优化稳定性。实验表明,PODS在多种数据集和任务中均有效提升了训练效率与模型性能的平衡。

详情
英文摘要

Data selection accelerates training by identifying representative training data while preserving model performance. However, existing methods mainly focus on designing sample-importance criteria, i.e., deciding what to select, while typically fixing the selected data volume as the target ratio throughout training. Thus, they are often dynamic in sample identity but static in data volume. In this work, we revisit data selection from an optimization perspective and show that selected-data training induces an implicit regularization effect modulated by the instantaneous selection ratio. This reveals a key trade-off: lower ratios amplify selection-induced regularization, whereas higher ratios preserve data coverage and optimization fidelity. Motivated by this insight, we propose PODS, a Plug-and-play Oscillatory Data-volume Scheduling framework. Rather than introducing another sample-scoring metric, PODS serves as a lightweight module that dynamically schedules how much data to select over training. Under the target selection ratio, PODS alternates between low-ratio regularization phases and high-ratio recovery phases to exploit selection-induced regularization without sacrificing optimization stability. With its lightweight, ratio-level, and task-agnostic design, PODS is compatible with existing static and dynamic selection methods and broadly applicable across training paradigms. Experiments across various datasets, architectures, and tasks show that PODS consistently improves the efficiency-generalization trade-off, e.g., reducing ImageNet-1k training cost by 50% with improved accuracy and accelerating LLM instruction tuning by over 2x without performance degradation.

2605.14772 2026-05-15 cs.CV cs.GR cs.LG

BioHuman: Learning Biomechanical Human Representations from Video

Yujun Huo, He Zhang, Chentao Song, Honglin Song, Zongyu Zuo, Tao Yu

发表机构 * Beihang University(北航) Tsinghua University(清华大学)

AI总结 该研究旨在从视频中学习人体的生物力学表示,以超越传统运动学分析,实现对人体内部肌肉活动等生物力学状态的理解。为此,作者提出了一种基于仿真的框架,从现有的动作捕捉数据中估计肌肉激活状态,构建了包含同步视频、运动和激活信息的大型数据集BioHuman10M,并基于此数据集设计了一个端到端模型BioHuman,能够从单目视频中联合预测人体运动和肌肉激活状态。实验表明,该方法在运动重建和肌肉活动预测方面表现出色,并具有良好的泛化能力,为基于视频的生物力学理解提供了新的基准。

详情
英文摘要

Understanding human motion beyond surface kinematics is crucial for motion analysis, rehabilitation, and injury risk assessment. However, progress in this domain is limited by the lack of large-scale datasets with biomechanical annotations, and by existing approaches that cannot directly infer internal biomechanical states from visual observations. In this paper, we introduce a simulation-based framework for estimating muscle activations from existing motion capture datasets, resulting in BioHuman10M, a large-scale dataset with synchronized video, motion, and activations. Building on BioHuman10M, we propose BioHuman, an end-to-end model that takes monocular video as input and jointly predicts human motion and muscle activations, effectively bridging visual observations and internal biomechanical states. Extensive experiments demonstrate that BioHuman enables accurate reconstruction of both kinematic motion and muscle activity, and generalizes across diverse subjects and motions. We believe our approach establishes a new benchmark for video-based biomechanical understanding and opens up new possibilities for physically grounded human modeling.

2605.14771 2026-05-15 cs.AI

MediaClaw: Multimodal Intelligent-Agent Platform Technical Report

Shaoan Zhao, Huanlin Gao, Qiang Hui, Ting Lu, Xueqiang Guo, Yantao Li, Xinpei Su, Fuyuan Shi, Chao Tan, Fang Zhao, Kai Wang, Shiguo Lian

发表机构 * China Unicom AI (Yuanjing) Team(中国unicom AI (元京)团队)

AI总结 MediaClaw 是一个基于 OpenClaw 生态构建的多模态智能体平台,旨在解决AIGC应用中的实际部署难题,如能力碎片化、接口异构、生产流程割裂和优质工作流复用受限等问题。该平台采用统一抽象、插件化扩展和工作流编排的三层架构,将全品类AIGC能力抽象为统一调用模型,并通过任务导向的技能模块实现复杂生产流程的可复用化。本文重点介绍了MediaClaw的架构设计理念、核心能力模型设计逻辑以及关键工程权衡,为构建多模态能力平台提供可复用的实践参考。

详情
英文摘要

MediaClaw is a multimodal agent platform built on the OpenClaw ecosystem. Its core design follows a three-layer architecture of unified abstraction, pluginized extension, and workflow orchestration. The system is intended to address practical deployment pain points in AIGC adoption, including fragmented capabilities, heterogeneous interfaces, disconnected production processes, and limited reuse of high-quality production workflows. \system{} abstracts full-category AIGC capabilities into a unified invocation model, uses plugins to support hot-pluggable capability expansion, and uses task-oriented Skills to turn complex production processes into reusable workflow assets. This report focuses on the architectural design philosophy of MediaClaw, the design logic of its core capability model, and the key engineering trade-offs in implementation. It aims to provide reusable practical reference for building multimodal capability platforms.

2605.14766 2026-05-15 cs.CL cs.AI eess.AS

Streaming Speech-to-Text Translation with a SpeechLLM

Titouan Parcollet, Shucong Zhang, Xianrui Zheng, Rogier C. van Dalen

发表机构 * Samsung, AI Center – Cambridge, United Kingdom(三星,人工智能中心——剑桥,英国)

AI总结 本文提出了一种基于大语言模型(LLM)的实时流式语音到文本翻译系统,旨在解决现有SpeechLLM系统在实际应用中响应速度慢的问题。该方法使模型不仅能生成翻译文本,还能判断是否已接收到足够的音频信息以进行输出,从而实现更高效的流式处理。实验表明,该系统在保持翻译质量接近非流式基线的同时,将延迟降低至1-2秒,显著提升了实时性。

Comments 9 pages of main text; 24 pages in total

详情
英文摘要

Normally, a system that translates speech into text consists of separate modules for speech recognition and text-to-text translation. Combining those tasks into a SpeechLLM promises to exploit paralinguistic information in the speech and to reduce cascaded errors. But existing SpeechLLM systems are slow since they do not work in a real streaming fashion: they wait for a complete utterance of audio before outputting a translation, or output tokens at fixed intervals, which is not suitable for real applications. This work proposes an LLM-based architecture for real streaming speech-to-text translation. The LLM learns not just to emit output tokens, but also to decide whether it has seen enough audio to do so. The system is trained using automatic alignments of the input speech and the output text. In experiments on different language pairs, the system achieves a translation quality close to the non-streaming baseline, but with a latency of only 1-2 seconds.

2605.14765 2026-05-15 cs.SD cs.CL

Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music

Mohammad Hossein Sameti, Diba Hadi Esfangereh, Sepehr Harfi Moridani, Leili Javidpour, Mahdieh Soleymani Baghshah

发表机构 * Sharif University of Technology(谢尔国立技术大学) Independent Researcher(独立研究员)

AI总结 该研究针对波斯音乐生成模型缺乏的问题,构建了一个包含900多小时高质量音频的波斯音乐大规模数据集,涵盖流行、传统和现代等多种风格。基于该数据集对先进的生成模型MusicGen进行微调,使其更符合波斯音乐的调式、节奏和文化特点,并通过主客观指标评估其性能。该工作为波斯音乐生成研究提供了新资源,展示了音乐生成模型在适应非主流文化语境中的潜力。

Comments 9 pages, 2 figures, 3 tables

详情
英文摘要

Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising over 900 hours high-quality audio samples across diverse sub-genres, including pop, traditional, and contemporary styles. This dataset captures the rich melodic and cultural diversity of Persian music and serves as the foundation for fine-tuning MusicGen, a state-of-the-art generative music model. We adapt MusicGen to this domain and evaluate its performance by utilizing subjective and objective metrics. To assess the semantic alignment between generated music and intended style tags, we report the proportion of relevant tags accurately reflected in the generated outputs. Our results demonstrate that the fine-tuned model produces compositions that more align with Persian stylistic conventions. This work introduces a new resource for generative music research and illustrates the adaptability of music generation models to underrepresented cultural and linguistic contexts.

2605.14764 2026-05-15 cs.LG cs.AI

Compositional Sparsity as an Inductive Bias for Neural Architecture Design

Hongyu Lin, Antonio Briola, Yuanrong Wang, Tomaso Aste

发表机构 * Department of Computer Science, University College London, London, United Kingdom(伦敦大学学院计算机系)

AI总结 本文研究了深度神经网络如何通过结构先验克服维度灾难的问题,提出了一种基于组合稀疏性的归纳偏差。作者结合信息过滤网络(IFN)和同调神经网络(HNN),构建了一种可解释的神经网络设计框架,通过分层组合实现抽象表示。实验表明,HNN在参数数量远少于传统深度网络的情况下,不仅在合成任务中能准确恢复稀疏结构,还在多个真实数据集上表现出更优的性能和稳定性。

详情
英文摘要

Identifying the structural priors that enable Deep Neural Networks (DNNs) to overcome the curse of dimensionality is a fundamental challenge in machine learning theory. Existing literature suggests that effective high-dimensional learning is driven by compositional sparsity, where target functions decompose into constituents supported on low-dimensional variable subsets. To investigate this hypothesis, we combine Information Filtering Networks (IFNs), which extract sparse dependency structures via constrained information maximisation, with Homological Neural Networks (HNNs), which map the inferred topology into fixed-wiring sparse neural graphs. We formalise the design principles underlying this construction and present an interpretable pipeline in which abstraction emerges through hierarchical composition. HNNs are orders of magnitude sparser than standard DNNs and require only minimal hyperparameter tuning. On synthetic tasks with known sparse hierarchies, HNNs recover the underlying compositional structure and remain stable in regimes where dense alternatives degrade as dimensionality increases. Across a broad suite of real-world datasets, HNNs consistently match or outperform dense baselines while using far fewer parameters, exhibiting lower variance and showing reduced sensitivity to hyperparameters.

2605.14761 2026-05-15 cs.AI cs.HC

AI Outperforms Humans in Personalized Image Aesthetics Assessment via LLM-Based Interviews and Semantic Feature Extraction

Yoshia Abe, Tatsuya Daikoku, Yasuo Kuniyoshi

发表机构 * Graduate School of Information Science and Technology(信息科学与技术研究生院) The University of Tokyo(东京大学) Bunkyo-ku, Tokyo(东京都文京区)

AI总结 该研究旨在解决AI准确预测个体对图像审美评价这一基础性挑战。研究提出了一种结合深度学习和大型语言模型(LLM)的集成系统,通过基于LLM的半结构化访谈主动获取用户的审美偏好,并结合图像的低级和高级语义特征进行预测。实验表明,该系统在预测性能上优于传统模型、人类预测者以及用户自身在一段时间后的重新评估,尤其在高评分图像上表现突出,表明AI在捕捉个体审美偏好方面可能比人类更具优势。

Comments 25 pages, 13 figures

详情
英文摘要

Accurately predicting individual aesthetic evaluation for images is a fundamental challenge for AI. Various deep learning (DL)-based models have been proposed for this task, training on image evaluation data to extract objective low-level features. However, aesthetic preferences are inherently subjective and individual-dependent. Accurate prediction thus requires the extraction of high-level semantic features of images and the active collection of preference information from the target individual. To address this issue, we focus on the utility of Large Language Models (LLMs) pretrained on vast amounts of textual data, and develop an integrated DL-LLM system. The system actively elicits aesthetic preferences through LLM-based semi-structured interviews and predicts aesthetic evaluation by leveraging both low-level and high-level features. In our experiments, we compare the proposed system against conventional systems, human predictors, and the target individual's own re-evaluations after a certain time interval. Our results show that the proposed system outperforms all of them, with particularly strong performance on highly-rated images. Moreover, the prediction error of the proposed system is smaller than within-person variability, while human predictors show the largest error, likely due to the influence of their own aesthetic values. These results suggest that AI may be better positioned than others or one's future self to capture individual aesthetic preferences at a given point. This opens a new question of whether AI could serve as a deeper interpreter of human aesthetic sensibility than humans themselves.

2605.14758 2026-05-15 cs.AI

Probabilistic Verification of Recurrent Neural Networks for Single and Multi-Agent Reinforcement Learning

Luca Marzari, Enrico Marchesini

发表机构 * TU Wien(维也纳技术大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 该论文研究了基于循环神经网络(RNN)的策略在部分可观测强化学习中的概率验证问题。针对现有工具在验证RNN策略时依赖严格假设或粗略近似导致结果过于保守的问题,提出了一种名为RNN-ProVe的概率验证框架,通过策略驱动采样估计策略下隐藏状态空间中不良行为的发生概率,并给出统计误差界以提供高置信度的验证结果。实验表明,该方法在单智能体和多智能体任务中能够提供更定量且更具可行性意识的概率保证。

Comments Accepted at the 35th International Joint Conference on Artificial Intelligence (IJCAI) 2026

详情
英文摘要

History-dependent policies induced by recurrent neural networks (RNNs) rely on latent hidden state dynamics, making verification in partially observable reinforcement learning (RL) challenging. Existing RNN verification tools typically rely on restrictive modeling assumptions or coarse over-approximations of the hidden state space, which can lead to overly conservative or inconclusive results. We propose $\textbf{RNN}$ $\textbf{Pro}$babilistic $\textbf{Ve}$rification ($\texttt{RNN-ProVe}$), a probabilistic framework that $\textit{estimates the likelihood}$ of undesired behaviors in RNN-based policies. $\texttt{RNN-ProVe}$ uses policy-driven sampling to approximate the set of hidden states that are feasible under a trained policy, and derives statistical error bounds to produce bounded-error, high-confidence estimates of behavioral violations. Experiments on partially observable single-agent and cooperative multi-agent tasks show that $\texttt{RNN-ProVe}$ yields more quantitative, feasibility-aware probabilistic guarantees than existing tools, while scaling to recurrent and multi-agent settings.

2605.14754 2026-05-15 cs.AI

XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition

Gong Zhiren, Tiantong Wu, Jiaming Zhang, Fuyao Zhang, Che Wang, Yurong Hao, Yikun Hou, Foo Ping, Yilei Zhao, Fei Huang, Chau Yuen, Wei Yang Bryan Lim

发表机构 * Alibaba Group, China(阿里巴巴集团,中国)

AI总结 XDomainBench 是一个用于诊断大语言模型在高维科学知识组合中推理崩溃问题的诊断基准。该研究通过系统化设计不同学科组合和任务难度,揭示了随着知识组合复杂度增加,模型推理能力显著下降的现象。研究发现,推理崩溃主要由学科组合带来的难度提升以及交互过程中错误累积和领域混淆所导致,为科学知识合成中的模型评估提供了新的视角和实验框架。

详情
英文摘要

Large Language Models (LLMs) are increasingly deployed for knowledge synthesis, yet their capacity for compositional generalization in scientific knowledge remains under-characterized. Existing benchmarks primarily focus on single-turn restricted scenarios, failing to capture the capability boundaries exposed by real-world interactive scientific workflows. To address this, we introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning. We formalize the composition order and mixture structure to enable systematic stress-testing from single-discipline to inter-disciplinary, comprising 8,598 interactive sessions across 20 domains and 4 task categories, with 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics, simulating real AI4S scenarios. Large-scale evaluation of LLMs reveals a systematic reasoning collapse as composition order increases, stemming from two root causes: (i) direct difficulty increases induced by domain composition, and (ii) indirect interaction-amplified failures where trajectory patterns trigger error accumulation, reasoning breaks, and domain confusion, ultimately leading to session collapse.

2605.14752 2026-05-15 cs.LG cs.AI

Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions

Qirui Liu, Hao Chen, Weijie Shi, Jiajie Xu, Jia Zhu

发表机构 * South China University of Technology(华南理工大学) Tencent Financial Technology(腾讯金融科技) The Hong Kong University of Science and Technology(香港科学与技术大学) Soochow University(苏州大学) Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University(浙江省智能教育技术与应用重点实验室,浙江师范大学)

AI总结 该研究旨在准确识别学生的错误概念,以支持个性化教育,针对数据稀缺、标注噪声大及模型部署受限等挑战,提出了一种基于认知不确定性的两阶段知识蒸馏框架。该方法通过挖掘现有数据中的高价值样本,结合教师模型的不确定性与置信度差异,识别关键样本并设计难度自适应机制,使学生模型能够有效继承类别间关系并区分模糊错误类型。实验表明,该方法在少量数据训练下显著提升了分类性能,优于当前最优模型。

Comments ACL 2026 Findings. 10 pages, 5 figures, 19 tables

详情
英文摘要

Accurately identifying student misconceptions is crucial for personalized education but faces three challenges: (1) data scarcity with long-tail distribution, where authentic student reasoning is difficult to synthesize; (2) fuzzy boundaries between error categories with high annotation noise; (3) deployment parado-large models overlook unconventional approaches due to pretraining bias and cannot be deployed on edge, while small models overfit to noise. Unlike traditional methods that increase diversity through large-scale data synthesis, we propose a two-stage knowledge distillation framework that mines high-value samples from existing data. The first stage performs standard distillation to transfer task capabilities. The second stage introduces a dual-layer marginal selection mechanism based on cognitive uncertainty, identifying four types of critical samples based on teacher model uncertainty and confidence differences. For different data subsets, we design difficulty-adaptive mechanism to balance hard/soft label contributions, enabling student models to inherit inter-class relationships from teacher soft labels while distinguishing ambiguous error types. Experiments show that with augmented training on only 10.30% of filtered samples, we achieve MAP@3 of 0.9585 (+17.8%) on the MAP-Charting dataset, and using only a 4B parameter model, we attain 84.38% accuracy on cross-topic tests of middle school algebra misconception benchmarks, significantly outperforming sota LLM (67.73%) and standard fine-tuned 72B models (81.25%). Our code is available at https://github.com/RoschildRui/acl2026_map.

2605.14749 2026-05-15 cs.CL cs.AI cs.LG

Non-linear Interventions on Large Language Models

Sangwoo Kim

发表机构 * Department of Linguistics, Seoul National University, Republic of Korea(韩国首尔国立大学语言系)

AI总结 本文研究了如何对大语言模型中的非线性表示特征进行干预,突破了现有线性干预方法的局限。作者提出了一种适用于非线性特征的通用干预框架,并设计了相应的学习方法,能够对缺乏直接输出信号的隐式特征进行干预。实验表明,该方法在拒绝绕过引导任务中表现优于传统线性方法,干预效果更精确。

详情
英文摘要

Intervention is one of the most representative and widely used methods for understanding the internal representations of large language models (LLMs). However, existing intervention methods are confined to linear interventions grounded in the Linear Representation Hypothesis, leaving features encoded along non-linear manifolds beyond their reach. In this work, we introduce a general formulation of intervention that extends naturally to non-linearly represented features, together with a learning procedure that further enables intervention on implicit features lacking a direct output signature. We validate our framework on refusal bypass steering, where it steers the model more precisely than linear baselines by intervening on a non-linear feature governing refusal.

2605.14747 2026-05-15 cs.CL cs.AI cs.CV cs.LG

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Weimin Xiong, Shuhao Gu, Bowen Ye, Zihao Yue, Lei Li, Feifan Song, Sujian Li, Hao Tian

发表机构 * National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University(国家多媒体信息处理重点实验室,计算机科学学院,北京大学) The University of Hong Kong(香港大学) Renmin University of China(中国人民大学)

AI总结 本文提出了一种名为Video2GUI的全自动框架,用于从未标注的互联网视频中提取结构化的GUI交互轨迹,以解决当前GUI智能体预训练数据规模小、领域单一的问题。该方法通过粗到细的过滤策略筛选高质量的GUI教程视频,并将其转化为可用于训练的交互轨迹,构建了包含1200万条轨迹、覆盖1500多个应用和网站的大型数据集WildGUI。基于该数据集预训练的模型在多个GUI定位和操作基准测试中取得了5-20%的性能提升,达到了或超越了现有最佳水平。

Comments Accepted at ICML 2026

详情
英文摘要

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.

2605.14746 2026-05-15 cs.LG

Selective Safety Steering via Value-Filtered Decoding

Bat-Sheva Einbinder, Hen Davidov, Yee Whye Teh, Yarin Gal, Yaniv Romano

发表机构 * Department of Electrical and Computer Engineering, Technion IIT(技术学院电气与计算机工程系) Department of Statistics, University of Oxford(牛津大学统计系) OATML, Department of Computer Science, University of Oxford(牛津大学计算机科学系) Department of Computer Science, Technion IIT(技术学院计算机科学系)

AI总结 本文研究了如何在解码阶段通过安全奖励对大语言模型进行选择性引导,以减少不必要的安全干预,同时提升输出的安全性。提出了一种基于价值过滤的解码方法,通过设定阈值显式控制误干预的概率,从而在安全性和模型原有性能之间取得更好的平衡。实验表明,该方法在多个数据集上优于现有方法,实现了更优的安全性与输出质量的权衡。

详情
英文摘要

While large language models (LLMs) are trained to align with human values, their generations may still violate safety constraints. A growing line of work addresses this problem by modifying the model's sampling policy at decoding time using a safety reward. However, existing decoding-time steering methods often intervene unnecessarily, modifying generations that would have been safe under the base model. Such unnecessary interventions are undesirable, as they can distort key properties of the base model such as helpfulness, fluency, style, and coherence. We propose a new test-time steering method designed to reduce such unnecessary interventions while improving the safety of unsafe responses. Our approach filters tokens using a value-based safety criterion and provides an explicit bound on the probability of false interventions. A single threshold hyperparameter controls this bound, allowing practitioners to trade off higher rates of unnecessary intervention for better output safety. Across multiple datasets and experiments, we show that our value-filtered decoding method outperforms existing baselines, achieving better trade-offs between safety, helpfulness, and similarity to the base model.

2605.14744 2026-05-15 cs.CL cs.AI cs.CY

Mechanical Enforcement for LLM Governance:Evidence of Governance-Task Decoupling in Financial Decision Systems

José Manuel de la Chica Rodríguez, Carlos Martí-González

发表机构 * Santander AI Lab(Santander AI实验室)

AI总结 本研究探讨了在受监管的金融决策系统中,大型语言模型(LLM)如何通过自然语言政策进行治理的问题,指出当前的评估方法仅关注任务准确性,而忽略了治理对决策推理过程的约束。为此,研究提出了五个衡量治理合规性的指标,并引入四种独立于模型解释循环的机械强制方法,显著提升了决策信息的完整性和任务准确性。实验表明,机械强制不仅大幅降低了无信息决策的比例,还验证了治理与任务性能之间的解耦现象,即在系统压力下,治理质量可以独立于任务表现得到保持。

详情
英文摘要

Large language models in regulated financial workflows are governed by natural-language policies that the same model interprets, creating a principal--agent failure: outputs can appear compliant without being compliant. Existing evaluation measures task accuracy but not whether governance constrains behaviour at the decision rationale level -- where regulated decisions must be auditable. We introduce five governance metrics that quantify policy compliance at the rationale level and apply them in a synthetic banking domain to compare text-only governance against mechanical enforcement: four primitives operating outside the model's interpretive loop. Under text-only governance, 27% of deferrals carry no decision-relevant information. Mechanical enforcement reduces this rate by 73%, more than doubles deferral information content, and raises task accuracy from MCC~$0.43$ to $0.88$. The improvement is driven by architectural separation: LLM-generated rationales under mechanical enforcement show comparable CDL to text-only governance -- the gain comes from removing clear-cut decisions from the model's control. A causal ablation confirms that each primitive is individually necessary. Our central finding is a governance-task decoupling: under structural stress, text-only governance degrades on both dimensions simultaneously, whereas mechanical enforcement preserves governance quality even as task performance drops. This implies that governance and task evaluation are distinct axes: accuracy is not a sufficient proxy for governance in regulated AI systems.