arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
专题追踪
2411.04077 2026-05-12 cs.CV

H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models

Nhi Pham, Michael Schott

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克信息研究所) Saarland University(萨尔兰州大学) Zuse School(祖斯学校)

AI总结 本文提出了一种基于分层抽样评估的H-POPE基准,用于系统评估大视觉语言模型在物体存在性和属性层面的幻觉问题。该方法通过从粗到细的层次结构进行评估,揭示了模型在细粒度属性上更容易产生幻觉的现象。研究进一步探讨了模型在生成文本时是否依赖于视觉输入,为理解视觉语言模型的生成机制提供了新的视角。

Comments Poster at https://sites.google.com/berkeley.edu/bb-stat/home

详情
英文摘要

By leveraging both texts and images, large vision language models (LVLMs) have shown significant progress in various multi-modal tasks. Nevertheless, these models often suffer from hallucinations, e.g., they exhibit inconsistencies between the visual input and the textual output. To address this, we propose H-POPE, a coarse-to-fine-grained benchmark that systematically assesses hallucination in object existence and attributes. Our evaluation shows that models are prone to hallucinations on object existence, and even more so on fine-grained attributes. We further investigate whether these models rely on visual input to formulate the output texts.

2410.10247 2026-05-12 cs.CV cs.AI

LPT: Less-overfitting Prompt Tuning for Vision-Language Model

Chenhao Ding, Xinyuan Gao, Songlin Dong, Jizhou Han, Qiang Wang, Zhengdong Zhou, Yuhang He, Yihong Gong

发表机构 * IEEE(国际电气电子工程师协会)

AI总结 该研究针对视觉语言模型在迁移过程中易出现的过拟合问题,提出了一种名为LPT的轻量级提示调优框架。其核心方法包括利用CLIP过滤细粒度前景信息以引导基础视觉概念的提示生成,并引入特征级结构保持约束和输出级层次逻辑约束,以增强模型的泛化能力。实验表明,LPT在多个基准任务中显著提升了模型的泛化性能,有效缓解了过拟合问题。

详情
英文摘要

Vision-language models (VLMs) have demonstrated exceptional generalization capabilities for downstream tasks. Due to its efficiency, prompt learning has gradually become a more effective and efficient method for transferring VLMs to downstream tasks, surpassing traditional finetuning methods. However, during the transfer process, these models are prone to severe overfitting, leading to a significant decline in generalization ability. To address this issue, we propose a framework named LPT, specifically designed for vision-language models. Specifically, we use CLIP to filter out fine-grained foreground information that may lead to overfitting, thereby guiding the prompts with basic visual concepts. Additionally, to further mitigate overfitting, we have developed a Structural Preservation (SP) constraint at the feature level, which aligns the model's overall feature space structure with the frozen CLIP, endowing the feature space with overall plasticity and enabling effective reshaping of the feature space during optimization. Moreover, we employ Hierarchical Logit (HL) constraint at the output layer to constrain the overall class information in the output, complementing the role of SP at the output end. Extensive experiments across various benchmarks (from base-to-novel, cross-dataset transfer, and domain generalization) demonstrate that our approach significantly improves generalization capability and effectively alleviates overfitting compared to state-of-the-art methods.

2408.17366 2026-05-12 cs.LG cs.AI

Leveraging Graph Neural Networks to Forecast Electricity Consumption

Eloi Campagne, Yvenn Amara-Ouali, Yannig Goude, Argyris Kalogeratos

发表机构 * Centre Borelli, Université Paris-Saclay, CNRS, Ecole Normale Supérieure Paris-Saclay(巴黎-萨克勒大学中心Borelli,巴黎-萨克勒大学,法国国家科学研究中心,巴黎-萨克勒高等师范学院) Laboratoire de Mathématiques d’Orsay (LMO), Université Paris-Saclay, CNRS, Faculté des Sciences d’Orsay(奥赛数学实验室(LMO),巴黎-萨克勒大学,法国国家科学研究中心,奥赛科学学院) EDF R&D, Palaiseau – France(EDF研发部,帕莱舍,法国)

AI总结 本文研究了如何利用图神经网络进行电力需求预测,以应对可再生能源接入和去中心化电网带来的复杂性和不确定性。研究提出了一种基于图结构的方法,能够有效捕捉电网中节点间的空间分布与关系特性,并引入了包括图卷积网络和图SAGE在内的多种模型进行预测。该方法不仅拓展了传统广义可加模型的框架,还提供了一套用于构建和评估图模型性能与可解释性的完整框架,并在合成数据和法国本土区域的真实数据上进行了实验验证。

Comments 17 pages, ECML PKDD 2024 Workshop paper

详情
英文摘要

Accurate electricity demand forecasting is essential for several reasons, especially as the integration of renewable energy sources and the transition to a decentralized network paradigm introduce greater complexity and uncertainty. The proposed methodology leverages graph-based representations to effectively capture the spatial distribution and relational intricacies inherent in this decentralized network structure. This research work offers a novel approach that extends beyond the conventional Generalized Additive Model framework by considering models like Graph Convolutional Networks or Graph SAGE. These graph-based models enable the incorporation of various levels of interconnectedness and information sharing among nodes, where each node corresponds to the combined load (i.e. consumption) of a subset of consumers (e.g. the regions of a country). More specifically, we introduce a range of methods for inferring graphs tailored to consumption forecasting, along with a framework for evaluating the developed models in terms of both performance and explainability. We conduct experiments on electricity forecasting, in both a synthetic and a real framework considering the French mainland regions, and the performance and merits of our approach are discussed.

2407.07639 2026-05-12 cs.LG cs.AI

Explaining Graph Neural Networks for Node Similarity on Graphs

Daniel Daza, Cuong Xuan Chu, Trung-Kien Tran, Daria Stepanova, Michael Cochez, Paul Groth

发表机构 * Vrije Universiteit Amsterdam(瓦赫宁根大学阿姆斯特丹) Bosch Center for Artificial Intelligence(博世人工智能中心) Abo Akademi University(阿博阿卡迪米大学) Elsevier discovery lab(埃斯勒弗发现实验室) University of Amsterdam(阿姆斯特丹大学)

AI总结 本文研究了如何为基于图神经网络(GNN)的节点相似性计算提供可解释性,以提升图数据中相似性搜索的可理解性。作者比较了两种主流解释方法——基于互信息(MI)和基于梯度的(GB)解释,发现梯度基解释具有三个重要优势:可操作性、一致性以及可显著压缩为稀疏解释而不影响相似性评分效果。该研究为图神经网络的可解释性提供了有价值的实证分析和指导。

Comments Accepted in Transactions of Machine Learning Research (2026)

详情
英文摘要

Similarity search is a fundamental task for exploiting information in various applications dealing with graph data, such as citation networks or knowledge graphs. While this task has been intensively approached from heuristics to graph embeddings and graph neural networks (GNNs), providing explanations for similarity has received less attention. In this work we are concerned with explainable similarity search over graphs, by investigating how GNN-based methods for computing node similarities can be augmented with explanations. Specifically, we evaluate the performance of two prominent approaches towards explanations in GNNs, based on the concepts of mutual information (MI), and gradient-based explanations (GB). We discuss their suitability and empirically validate the properties of their explanations over different popular graph benchmarks. We find that unlike MI explanations, gradient-based explanations have three desirable properties. First, they are actionable: selecting inputs depending on them results in predictable changes in similarity scores. Second, they are consistent: the effect of selecting certain inputs overlaps very little with the effect of discarding them. Third, they can be pruned significantly to obtain sparse explanations that retain the effect on similarity scores.

2406.19741 2026-05-12 cs.RO cs.AI

ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning

Christopher E. Mower, Yuhui Wan, Hongzhan Yu, Antoine Grosnit, Jonas Gonzalez-Billandon, Matthieu Zimmer, Jinlong Wang, Xinyu Zhang, Yao Zhao, Anbang Zhai, Puze Liu, Daniel Palenicek, Davide Tateo, Cesar Cadena, Marco Hutter, Jan Peters, Guangjian Tian, Yuzheng Zhuang, Kun Shao, Xingyue Quan, Jianye Hao, Jun Wang, Haitham Bou-Ammar

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室) University of Leeds(利兹大学) Technical University of Darmstadt(达姆施塔特技术大学) East China Normal University(华东师范大学) Huawei Technologies(华为技术有限公司) ETH Zurich(苏黎世联邦理工学院) University College London(伦敦大学学院)

AI总结 本文提出了一种名为ROS-LLM的框架,旨在让非专家用户通过自然语言指令直观地编程机器人,该框架结合了机器人操作系统(ROS)与大型语言模型(LLM)。该系统支持从LLM输出中自动提取行为并执行ROS动作,提供多种行为模式,并通过模仿学习扩展机器人动作库,同时利用人类和环境反馈进行LLM反思。实验表明,该框架在多种复杂场景中表现出良好的鲁棒性、可扩展性和灵活性,并已开源以供使用和复现。

Comments This document contains 26 pages and 13 figures

Journal ref Nature Machine Intelligence 8, 313-325 (2026)

详情
英文摘要

We present a framework for intuitive robot programming by non-experts, leveraging natural language prompts and contextual information from the Robot Operating System (ROS). Our system integrates large language models (LLMs), enabling non-experts to articulate task requirements to the system through a chat interface. Key features of the framework include: integration of ROS with an AI agent connected to a plethora of open-source and commercial LLMs, automatic extraction of a behavior from the LLM output and execution of ROS actions/services, support for three behavior modes (sequence, behavior tree, state machine), imitation learning for adding new robot actions to the library of possible actions, and LLM reflection via human and environment feedback. Extensive experiments validate the framework, showcasing robustness, scalability, and versatility in diverse scenarios, including long-horizon tasks, tabletop rearrangements, and remote supervisory control. To facilitate the adoption of our framework and support the reproduction of our results, we have made our code open-source. You can access it at: https://github.com/huawei-noah/HEBO/tree/master/ROSLLM.

2406.12910 2026-05-12 cs.LG cs.AI cs.NE physics.chem-ph q-bio.BM

Human-level molecular optimization driven by mol-gene evolution

Jiebin Fang, Churu Mao, Yuchen Zhu, Xiaoming Chen, Chang-Yu Hsieh, Zhongjun Ma

发表机构 * Hainan Institute of Zhejiang University(浙江大学海南研究院) Institute of Marine Biology and Pharmacology, Ocean College, Zhejiang University(浙江大学海洋学院海洋生物与药理研究所) College of Pharmaceutical Sciences and Cancer Center, Zhejiang University(浙江大学药学院与癌症中心)

AI总结 该研究提出了一种名为DGMM的深度遗传分子修饰算法,旨在解决药物分子优化中结构新颖性与药理性质平衡的问题。通过引入离散变分自编码器(D-VAE),将分子编码为量化代码“mol-gene”,从而将深度学习与遗传算法结合,实现类似药物化学家的分子结构优化。该方法能够发现药理性质相似但结构不同的化合物,并揭示药物发现中结构优化的权衡关系,展示了其在多个应用中的有效性。

详情
英文摘要

De novo molecule generation allows the search for more drug-like hits across a vast chemical space. However, lead optimization is still required, and the process of optimizing molecular structures faces the challenge of balancing structural novelty with pharmacological properties. This study introduces the Deep Genetic Molecular Modification Algorithm (DGMM), which brings structure modification to the level of medicinal chemists. A discrete variational autoencoder (D-VAE) is used in DGMM to encode molecules as quantization code, mol-gene, which incorporates deep learning into genetic algorithms for flexible structural optimization. The mol-gene allows for the discovery of pharmacologically similar but structurally distinct compounds, and reveals the trade-offs of structural optimization in drug discovery. We demonstrate the effectiveness of the DGMM in several applications.

2405.17642 2026-05-12 cs.LG cs.AI stat.ME

Unifying Perspectives: Plausible Counterfactual Explanations on Global, Group-wise, and Local Levels

Oleksii Furman, Patryk Wielopolski, Łukasz Lenkiewicz, Jerzy Stefanowski, Maciej Zięba

发表机构 * wrocław University of Science and Technology(沃里克大学科学与技术学院) Poznań University of Technology(波兹南技术大学) Tooploox Sp. z o.o.(Tooploox公司)

AI总结 随着人工智能系统日益复杂,可解释性需求日益迫切。本文提出一种基于梯度优化的统一方法,能够同时生成局部、全局和群体级反事实解释,弥补了现有方法在不同粒度层面缺乏整合的不足。通过将实例分组与反事实生成结合为单一高效流程,并引入可信性准则,提升了群体级反事实的合理性与实用性,实验验证了该方法在有效性、贴近性与可信性之间的良好平衡。

详情
英文摘要

The growing complexity of AI systems has intensified the need for transparency through Explainable AI (XAI). Counterfactual explanations (CFs) offer actionable "what-if" scenarios on three levels: Local CFs providing instance-specific insights, Global CFs addressing broader trends, and Group-wise CFs (GWCFs) striking a balance and revealing patterns within cohesive groups. Despite the availability of methods for each granularity level, the field lacks a unified method that integrates these complementary approaches. We address this limitation by proposing a gradient-based optimization method for differentiable models that generates Local, Global, and Group-wise Counterfactual Explanations in a unified manner. We especially enhance GWCF generation by combining instance grouping and counterfactual generation into a single efficient process, replacing traditional two-step methods. Moreover, to ensure trustworthiness, we innovatively introduce the integration of plausibility criteria into the GWCF domain, making explanations both valid and realistic. Our results demonstrate the method's effectiveness in balancing validity, proximity, and plausibility while optimizing group granularity, with practical utility validated through practical use cases.

2405.12969 2026-05-12 cs.LG

EchoAlign: Bridging Generative and Discriminative Learning under Noisy Labels

Yuxiang Zheng, Zhongyi Han, Yilong Yin

发表机构 * School of Software, Shandong University, Jinan 250100, China(山东大学软件学院,济南250100,中国) Sydney AI Centre, The University of Sydney, Sydney, NSW 2050, Australia(悉尼大学悉尼人工智能中心,悉尼,新南威尔士州2050,澳大利亚)

AI总结 本文提出了一种名为 EchoAlign 的新框架,用于在存在噪声标签的情况下桥接生成式学习与判别式学习。该方法不直接修正标签,而是通过生成模型调整实例特征以对齐噪声标签,并结合特征相似性筛选出可靠的样本,从而提升模型鲁棒性。实验表明,EchoAlign 在多个基准数据集上优于现有方法,尤其在高噪声环境下表现出更强的性能和稳定性。

Comments 27 pages, 7 figures. The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: 10.1007/s11704-026-51604-z

详情
英文摘要

Noisy labels severely hinder the accuracy and generalization of machine learning models, especially when ambiguous instance features make reliable annotation difficult. Existing approaches, including transition-matrix-based label correction, struggle to capture complex relationships between instances and noisy labels, limiting their effectiveness in such settings. We present EchoAlign, a framework that bridges generative and discriminative learning under noisy labels. Instead of correcting labels, EchoAlign treats noisy labels as supervision targets and modifies the corresponding instances to align with them. The framework has two components: EchoMod uses controllable generative models to adjust instance features while preserving key instance-level structural cues, such as shape and edges, and avoiding excessive distortion; EchoSelect mitigates distribution shifts by retaining a reliable subset of original instances, guided by feature similarity between original and modified samples. This generative-discriminative interplay enables robust learning in highly noisy settings. Experiments on three benchmark datasets show that EchoAlign outperforms state-of-the-art methods in most evaluated settings. Under 30% instance-dependent noise, EchoSelect retains nearly twice as many correctly labeled samples as competing approaches while maintaining 99% selection accuracy, demonstrating the robustness and effectiveness of EchoAlign.

2404.18923 2026-05-12 cs.CL

Holmes: A Benchmark to Assess the Linguistic Competence of Language Models

Andreas Waldis, Yotam Perlitz, Leshem Choshen, Yufang Hou, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab)(通用知识处理实验室) Technical University of Darmstadt(达姆施塔特技术大学) Information Systems Research Lab(信息系统研究实验室) Lucerne University of Applied Sciences and Arts(卢塞恩应用科学与艺术大学) IBM Research AI(IBM AI研究部) MIT CSAIL(MIT CSAIL实验室) MIT-IBM Watson AI Lab(MIT-IBM沃森AI实验室) IBM Research Europe - Ireland(IBM欧洲爱尔兰研究部)

AI总结 本文提出Holmes,一个用于评估语言模型语言能力的新基准,旨在衡量模型对语言现象的无意识理解能力。通过分类器探测方法,研究分析了模型在句法、形态、语义等语言现象上的内部表征,并发现模型的语言能力与规模密切相关,同时模型架构和指令微调也显著影响性能。为此,作者还提出了计算效率更高的简化版FlashHolmes,以在保持高精度的同时降低计算负担。

详情
英文摘要

We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence - their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs' internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing Holmes, we review over 270 probing studies and include more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version that reduces the computation load while maintaining high-ranking precision.

2404.14442 2026-05-12 cs.LG cs.AI

Toward a Unified Lyapunov-Certified ODE Convergence Analysis of Smooth Q-Learning with p-Norms

Donghwan Lee, Hyunjun Na

发表机构 * Department of Electrical Engineering(电气工程系)

AI总结 本文研究了标准Q学习及其平滑变体的收敛性分析问题,提出了一种基于常微分方程(ODE)的统一收敛性框架。该方法通过引入平滑的p范数Lyapunov函数,克服了传统无穷范数方法中的非光滑问题,提供了简洁且严谨的稳定性证明。该框架适用于包括对数求和指数softmax、玻尔兹曼softmax和mellowmax操作符在内的多种平滑Q学习算法,并且在Bellman算子不构成收缩映射的情况下依然有效,具有广泛的适用性。

详情
英文摘要

Convergence of Q-learning has been the subject of extensive study for decades. Among the available techniques, the ordinary differential equation (ODE) method is particularly appealing as a general-purpose, off-the-shelf tool for sanity-checking the convergence of a wide range of reinforcement learning algorithms. In this paper, we develop a unified ODE-based convergence framework that applies to standard Q-learning and several soft/smoothed variants, including those built on the log-sum-exponential softmax, Boltzmann softmax, and mellowmax operators. Our analysis uses a smooth p-norm Lyapunov function, leading to concise yet rigorous stability arguments and circumventing the non-smoothness issues inherent to classical infty-norm-based approaches. To the best of our knowledge, the proposed framework is among the first to provide a unified ODE-based treatment that is broadly applicable to smooth Q-learning algorithms while also encompassing standard Q-learning. Moreover, it remains valid even in settings where the associated Bellman operator is not a contraction, as may happen in Boltzmann soft Q-learning.

2312.08413 2026-05-12 cs.LG cs.CR cs.CY

Privacy Constrained Fairness Estimation for Decision Trees

Florian van der Steen, Fré Vink, Heysem Kaya

发表机构 * Department of Information and Computing Sciences, Utrecht(乌得勒支信息与计算科学系) Responsible AI Team, Dutch Central Government Audit Service(荷兰中央政府审计服务责任AI团队)

AI总结 随着数据价值的提升,保护敏感信息和确保人工智能模型的公平性变得尤为重要。本文研究了在差分隐私约束下对决策树模型进行公平性评估的问题,提出了一种新的方法PAFER,能够在保证隐私的前提下准确估计统计公平性。实验表明,该方法在保持模型可解释性的同时,能够有效降低公平性估计的误差,并在人类更易理解的决策树上表现更优。

Comments 52 pages, under review in Applied Intelligence journal

详情
英文摘要

The protection of sensitive data becomes more vital, as data increases in value and potency. Furthermore, the pressure increases from regulators and society on model developers to make their Artificial Intelligence (AI) models non-discriminatory. To boot, there is a need for interpretable, transparent AI models for high-stakes tasks. In general, measuring the fairness of any AI model requires the sensitive attributes of the individuals in the dataset, thus raising privacy concerns. In this work, the trade-offs between fairness, privacy and interpretability are further explored. We specifically examine the Statistical Parity (SP) of Decision Trees (DTs) with Differential Privacy (DP), that are each popular methods in their respective subfield. We propose a novel method, dubbed Privacy-Aware Fairness Estimation of Rules (PAFER), that can estimate SP in a DP-aware manner for DTs. DP, making use of a third-party legal entity that securely holds this sensitive data, guarantees privacy by adding noise to the sensitive data. We experimentally compare several DP mechanisms. We show that using the Laplacian mechanism, the method is able to estimate SP with low error while guaranteeing the privacy of the individuals in the dataset with high certainty. We further show experimentally and theoretically that the method performs better for DTs that humans generally find easier to interpret.

2311.03600 2026-05-12 cs.RO

Scalable and Efficient Continual Learning from Demonstration via a Hypernetwork-generated Stable Dynamics Model

Sayantan Auddy, Jakob Hollenstein, Matteo Saveriano, Antonio Rodríguez-Sánchez, Justus Piater

发表机构 * Faculty of Electrical Engineering and Computer Science, Technical University of Berlin(电气工程与计算机科学系,柏林技术大学) Department of Computer Science, University of Innsbruck(计算机科学系,因斯布鲁克大学) Digital Science Center (DiSC), University of Innsbruck(数字科学中心(DiSC),因斯布鲁克大学) Department of Industrial Engineering, University of Trento(工业工程系,特伦托大学) Singular Research Center on Intelligent Systems (CiTIUS), University of Santiago de Compostela(智能系统研究中心(CiTIUS),圣地亚哥-德孔波斯特拉大学)

AI总结 该研究提出了一种可扩展且高效的从演示中持续学习方法,通过超网络生成稳定动力学模型,以提升机器人在真实环境中学习和保持多技能的能力。核心方法包括生成轨迹学习的动力学模型和轨迹稳定化的李雅普诺夫函数,构建了一个带有时钟增强的稳定神经ODE求解器(sNODE),并在超网络中引入随机正则化以减少训练时间复杂度。实验表明,该方法在多个复杂数据集和现实任务中表现出优越的轨迹精度、持续学习性能和稳定性。

Comments To appear in IEEE Transactions on Cognitive and Developmental Systems

详情
英文摘要

Robots capable of learning from demonstration (LfD) must exhibit stability while executing learned motion skills. To be effective in the real world, they should also remember multiple skills over time -- a capability lacking in current stable-LfD methods. We propose an approach to stable, continual LfD, and highlight the role of stability in improving continual learning. Our proposed hypernetwork generates the parameters of two neural networks: a trajectory learning dynamics model, and a trajectory-stabilizing Lyapunov function. These generated networks form a clock-augmented stable neural ODE solver (sNODE), a stable dynamics model that offers a superior stability-accuracy trade-off compared to the state-of-the-art. We further propose stochastic hypernetwork regularization with a single, uniformly-sampled task embedding, reducing the cumulative training time for $N$ tasks from O($N^2$) to O($N$) without degrading performance on real-world tasks. We introduce high-dimensional variants of the popular LASA dataset to assess scalability and extend a dataset of robotic LfD tasks to assess real-world performance. We empirically evaluate our approach on multiple LfD datasets of varying complexity, including sequences of 7--26 tasks, trajectories of 2--32 dimensions, and real-world tasks involving position and orientation. Our thorough evaluation on multiple LfD datasets demonstrates that our approach sequentially learns and retains multiple motion skills without retraining on past demonstrations, and outperforms other relevant baselines in terms of trajectory errors, continual learning scores, and stability metrics. Notably, we show that stability greatly enhances continual learning performance, particularly in size-efficient chunked hypernetworks. Our code is available at https://github.com/sayantanauddy/clfd-snode.

2306.03606 2026-05-12 cs.AI

BioBLP: A Modular Framework for Learning on Multimodal Biomedical Knowledge Graphs

Daniel Daza, Dimitrios Alivanistos, Payal Mitra, Thom Pijnenburg, Michael Cochez, Paul Groth

发表机构 * Vrije Universiteit Amsterdam(荷兰阿姆斯特丹自由大学) University of Amsterdam(阿姆斯特丹大学) Elsevier B.V.(埃森哲公司) Discovery Lab, Elsevier(埃森哲发现实验室)

AI总结 本文提出了一种名为BioBLP的模块化框架,用于在包含多模态实体属性的生物医学知识图谱中学习实体嵌入,能够处理不同模态的属性数据并支持缺失属性的实体。该方法还引入了一种高效的预训练策略,显著提升了模型性能并减少了训练时间。实验表明,在药物-蛋白质相互作用预测任务中,BioBLP优于不考虑属性数据的基线方法,尤其在低度节点上表现突出。

Journal ref J Biomed Semant 14, 20 (2023)

详情
英文摘要

Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate such data, but assume that entities can be represented with the same data modality. This is not always the case for biomedical KGs, where entities exhibit heterogeneous modalities that are central to their representation in the subject domain. We propose a modular framework for learning embeddings in KGs with entity attributes, that allows encoding attribute data of different modalities while also supporting entities with missing attributes. We additionally propose an efficient pretraining strategy for reducing the required training runtime. We train models using a biomedical KG containing approximately 2 million triples, and evaluate the performance of the resulting entity embeddings on the tasks of link prediction, and drug-protein interaction prediction, comparing against methods that do not take attribute data into account. In the standard link prediction evaluation, the proposed method results in competitive, yet lower performance than baselines that do not use attribute data. When evaluated in the task of drug-protein interaction prediction, the method compares favorably with the baselines. We find settings involving low degree entities, which make up for a substantial amount of the set of entities in the KG, where our method outperforms the baselines. Our proposed pretraining strategy yields significantly higher performance while reducing the required training runtime. Our implementation is available at https://github.com/elsevier-AI-Lab/BioBLP .

2010.03496 2026-05-12 cs.CL cs.AI cs.LG

Inductive Entity Representations from Text via Link Prediction

Daniel Daza, Michael Cochez, Paul Groth

发表机构 * Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) University of Amsterdam(阿姆斯特丹大学) Discovery Lab, Elsevier(Elsevier发现实验室)

AI总结 该研究探讨了如何通过链接预测任务从知识图谱中的文本描述中学习归纳性实体表示,并评估这些表示在不同任务中的泛化能力。研究提出了一种基于预训练语言模型的架构,能够有效处理训练时未见过的实体,在链接预测、实体分类和信息检索等任务中均取得显著提升。实验表明,所学实体表示无需微调即可跨任务迁移,展现出比现有方法更强的泛化能力。

Comments The Web Conference 2021

详情
英文摘要

Knowledge Graphs (KG) are of vital importance for multiple applications on the web, including information retrieval, recommender systems, and metadata annotation. Regardless of whether they are built manually by domain experts or with automatic pipelines, KGs are often incomplete. Recent work has begun to explore the use of textual descriptions available in knowledge graphs to learn vector representations of entities in order to preform link prediction. However, the extent to which these representations learned for link prediction generalize to other tasks is unclear. This is important given the cost of learning such representations. Ideally, we would prefer representations that do not need to be trained again when transferring to a different task, while retaining reasonable performance. In this work, we propose a holistic evaluation protocol for entity representations learned via a link prediction objective. We consider the inductive link prediction and entity classification tasks, which involve entities not seen during training. We also consider an information retrieval task for entity-oriented search. We evaluate an architecture based on a pretrained language model, that exhibits strong generalization to entities not observed during training, and outperforms related state-of-the-art methods (22% MRR improvement in link prediction on average). We further provide evidence that the learned representations transfer well to other tasks without fine-tuning. In the entity classification task we obtain an average improvement of 16% in accuracy compared with baselines that also employ pre-trained models. In the information retrieval task, we obtain significant improvements of up to 8.8% in NDCG@10 for natural language queries. We thus show that the learned representations are not limited KG-specific tasks, and have greater generalization properties than evaluated in previous work.

2605.10111 2026-05-12 cs.LG cs.AI cs.CV

CFSPMNet: Cross-subject Fourier-guided Spatial-Patch Mamba Network for EEG Motor Imagery Decoding in Stroke Patients

Xiangkai Wang, Yun Zhao, Dongyi He, Qingling Xia, Gen Li, Xinlai Xing, Yuchi Pan, Bin Jiang

发表机构 * School of Artificial Intelligence, Chongqing University of Technology(重庆理工大学人工智能学院) Chongqing Key Laboratory of Embodied Intelligence Perception and Autonomous Learning for Humanoid Robots(重庆市人形机器人感知与自主学习重点实验室) Key Laboratory of Advanced Equipment Intelligence of the Chongqing Education Commission(重庆市教育委员会先进设备智能重点实验室) School of Smart Health, Chongqing Polytechnic University of Electronic Technology(重庆理工大学电子工程学院智能健康学院) Department of Language Science and Technology, The Hong Kong Polytechnic University(香港理工大学语言科学与技术系) School of Pharmacy and Bioengineering, Chongqing University of Technology(重庆理工大学药学院与生物工程学院)

AI总结 该研究针对中风患者脑机接口(BCI)解码中的跨被试应用难题,提出了一种名为CFSPMNet的新型神经网络框架。该方法结合傅里叶域状态重组与共享-私有原型匹配机制,通过建模潜在的神经状态组织,有效提升了跨被试MI-EEG解码的准确性和鲁棒性。实验表明,CFSPMNet在两个中风MI-EEG数据集上均优于现有主流方法,展现出显著的性能提升。

详情
英文摘要

Motor imagery electroencephalography (MI-EEG) decoding offers a non-invasive route for post-stroke rehabilitation, but cross-patient use remains difficult because pathological neural reorganization changes task-related EEG dynamics, aperiodic activity, local excitability, cross-regional coordination, and trial-level brain-state context. This makes source-learned MI representations unreliable for unseen patients. To address this problem, we propose CFSPMNet, a cross-patient adaptation framework that models post-stroke MI-EEG as latent neural-state organization. CFSPMNet combines a Fourier-Reorganized State Mamba Network (FRSM) with Shared-Private Prototype Matching (SPPM). FRSM represents each trial as a latent physiological token sequence, reorganizes token states in the Fourier domain, and uses Fourier-derived trial context to guide Mamba state-space propagation. SPPM improves target pseudo-label updating by combining semantic confidence with shared-private physiological consistency, filtering confident but physiologically inconsistent target predictions. Leave-one-subject-out experiments on two stroke MI-EEG datasets show that CFSPMNet outperforms representative CNN-, Transformer-, Mamba-, and adaptation-based baselines, achieving average accuracies of 68.23% on XW-Stroke and 73.33% on 2019-Stroke, with gains of 5.63 and 8.25 percentage points over the strongest competitors. Ablation, sensitivity, feature-alignment, pseudo-label selection, and neurophysiological visualization analyses further support the roles of Fourier-domain token-state reorganization and calibrated pseudo-label updating. These results suggest that latent neural-state modeling can improve rehabilitation-oriented cross-patient BCI decoding. Code is available at https://github.com/wxk1224/CFSPMNet.

2605.10108 2026-05-12 cs.CL cs.LG

GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction

Ihor Stepanov, Oleksandr Lukashov, Mykhailo Shtopko, Vivek Kalyanarangan

发表机构 * Knowledgator Engineering(Knowledgator工程公司) Baldor Technologies Pvt. Ltd. (IDfy)(Baldor技术私人有限公司(IDfy))

AI总结 本文提出了一种统一的框架GLiNER-Relex,用于联合执行命名实体识别(NER)和关系抽取(RE)任务。该方法基于共享的双向Transformer编码器,将实体类型和关系类型标签联合建模,实现了在推理时对任意实体和关系类型的零样本提取。实验表明,GLiNER-Relex在多个标准关系抽取数据集上表现优异,兼具计算效率和模型灵活性,并已作为开源工具包发布。

Comments 19 pages, 1 figure, 2 tables

详情
英文摘要

Joint named entity recognition (NER) and relation extraction (RE) is a fundamental task in natural language processing for constructing knowledge graphs from unstructured text. While recent approaches treat NER and RE as separate tasks requiring distinct models, we introduce GLiNER-Relex, a unified architecture that extends the GLiNER framework to perform both entity recognition and relation extraction in a single model. Our approach leverages a shared bidirectional transformer encoder to jointly represent text, entity type labels, and relation type labels, enabling zero-shot extraction of arbitrary entity and relation types specified at inference time. GLiNER-Relex constructs entity pair representations from recognized spans and scores them against relation type embeddings using a dedicated relation scoring module. We evaluate our model on four standard relation extraction benchmarks: CoNLL04, DocRED, FewRel, and CrossRE, and demonstrate competitive performance against both specialized relation extraction models and large language models, while maintaining the computational efficiency characteristic of the GLiNER family. The model is released as an open-source Python package with a simple inference API that allows users to specify arbitrary entity and relation type labels at inference time and obtain both entities and relation triplets in a single call. All models and code are publicly available.

2605.10107 2026-05-12 cs.AI cs.AR

Arcane: An Assertion Reduction Framework through Semantic Clustering and MCTS-Guided Rule Exploring

Hongqin Lyu, Yonghao Wang, Zhiteng Chao, Tiancheng Wang, Huawei Li

发表机构 * State Key Lab of Processors, Institute of Computing Technology, CAS, Beijing, China(处理器国家重点实验室,计算技术研究所,中国科学院,北京,中国) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学,北京,中国)

AI总结 本文提出了一种名为Arcane的断言约简框架,旨在解决基于断言的硬件验证中冗余断言导致的仿真效率低下问题。该方法结合语义聚类对大规模断言进行准确分类,并利用蒙特卡洛树搜索(MCTS)探索最优的规则应用顺序,以高效减少断言数量。实验表明,Arcane在保持形式化覆盖率和突变检测能力的前提下,最多可减少76.2%的断言数量,并使仿真速度提升2.6至6.1倍。

Comments 6 pages, 6 figures

详情
英文摘要

Assertion-based Verification (ABV) is essential for ensuring that hardware designs conform to their intended specifications. However, existing automated assertion-generation approaches, such as LLM-based frameworks, often generate large numbers of redundant assertions, which significantly degrade simulation efficiency. To mitigate the simulation overhead caused by redundant assertions, this paper proposes Arcane, an efficient assertion reduction framework. It integrates a two-tier assertion clustering approach for accurate semantic classification of large assertion sets, and employs Monte Carlo Tree Search (MCTS) to explore optimal rule-application sequences for efficient assertion reduction. The experimental results on Assertionbench [20] show that Arcane achieves a reduction of up to 76.2% in the assertion count while fully preserving formal coverage and mutation-detection ability. Further simulation studies demonstrate a speedup of 2.6x to 6.1x speedup in simulation time. The proposed framework is released at https://anonymous.4open.science/r/Arcane1-0A6F/.

2605.10106 2026-05-12 cs.CV cs.AI

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

Tingshu Mou, Jiabo He, Renying Wang, Ce Liu, Hao Yang, Tiehua Zhang, Jingjing Chen, Xingjun Ma

发表机构 * Fudan University(复旦大学) Bosch Center for Artificial Intelligence (BCAI)(博世人工智能中心(BCAI)) Tongji University(同济大学)

AI总结 本文提出了一种名为ViSRA的基于视频的三维空间推理代理,旨在提升多模态大语言模型(MLLMs)的空间推理能力。ViSRA无需额外训练,通过利用专家模型提供的显式空间信息,以模块化和可扩展的方式引导模型进行空间推理,实现了灵活的即插即用框架。该方法在多个现有基准和未见过的三维空间任务中均表现出色,相比基线方法分别提升了15.6%和28.9%的绝对性能,具有可迁移的三维理解能力和较低的计算成本。

详情
英文摘要

Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free perspective and introduce ViSRA, a human-aligned Video-based Spatial Reasoning Agent, as a framework to probe the spatial reasoning mechanism of MLLMs. ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm. ViSRA offers two key advantages: (1) human-aligned and transferable 3D understanding rather than task-specific overfitting; and (2) no post-training computational cost along with heavy manual curation of spatial reasoning datasets. Experimental results demonstrate consistent improvement across a set of MLLMs on both existing benchmarks and unseen 3D spatial reasoning tasks, with ViSRA outperforming baselines by up to a 15.6% and 28.9% absolute margin respectively.

2605.10091 2026-05-12 cs.LG

TopoU-Net: a U-Net architecture for topological domains

Gaurav Gaurav, Ibrahem ALJabea, Yaroslav Zakomornyy, Eric Frank, Mohamed Elhamdadi, Theodore Papamarkou, Mustafa Hajij

发表机构 * University of South Florida(佛罗里达州立大学) Louisiana State University(路易斯安那州立大学) Vinci4D(Vinci4D公司) PolyShape NTUA(NTUA PolyShape) USFCA

AI总结 TopoU-Net 是一种面向拓扑结构数据的 U-Net 架构,旨在处理包含点、边、区域、超边等复杂结构的数据。该方法将 U-Net 视为一种层次化的编码-解码框架,利用组合复形中的单元、关联和秩来构建表示空间与跳跃连接。通过引入秩路径的概念,TopoU-Net 在不同拓扑层级之间进行特征传递,并在多个任务中表现出优越的性能,尤其在异质图和高阶结构数据上效果显著。

详情
英文摘要

Many modern datasets mix points, edges, regions, groups, objects, events, hyperedges, and relations. Yet neural architectures often force such data into grids, graphs, or sequences, obscuring higher-order structure and making encoder-decoder designs domain-specific. We view U-Net not as a grid-specific architecture, but as a hierarchical encoder-decoder principle: representation spaces, transport maps between levels, and skip connections between matched levels. Combinatorial complexes naturally supply these ingredients through cells, incidences, and ranks. We introduce TopoU-Net, a rank-path U-Net for topological domains. Given a path from an input rank to a bottleneck rank and back, the encoder lifts cochains upward along incidence maps, the decoder transports them downward, and skip connections merge features at matched ranks. Rank replaces spatial scale: choosing paths through nodes, edges, faces, hyperedges, or global cells becomes the central architectural decision. A key quantity is the bottleneck support ratio, the number of cells at the bottleneck relative to the number of cells at the input rank. This ratio is fixed by the complex and chosen path rather than by arbitrary pooling, and it clarifies when skip connections are optional, useful, or structurally important. Across node classification, graph classification, hypergraph node classification, mesh classification, and image reconstruction, TopoU-Net provides a reusable encoder-decoder template for higher-order structured data. Among the evaluated baselines, it achieves the strongest mean accuracy on six of eight node-classification datasets and four of five hypergraph datasets, with the largest gains on heterophilic graphs. Ablations show that removing skip connections is most damaging under severe bottleneck compression.

2605.10087 2026-05-12 cs.CV

Initiation of Interaction Detection Framework using a Nonverbal Cue for Human-Robot Interaction

Guhnoo Yun, Juhan Yoo, Kijung Kim, Dong Hwan Kim

发表机构 * Korea Institute of Science and Technology(韩国科学技术院) Department of Computer Science, Semyung University(Semyoung大学计算机科学系)

AI总结 本文提出了一种基于音频和视觉传感器融合的非语言线索的人机交互(HRI)启动检测框架,用于家庭环境中的机器人交互。该框架通过声音源定位与人体跟踪信息结合,实现用户注视机器人时的交互启动检测,即使用户未直接说话,也能在注视时间超过预设阈值时识别交互意图。研究设计了状态转移模型,并在移动机器人上进行了实验验证,所有模块均集成于ROS系统中,实现了框架的完整实现与应用。

详情
英文摘要

This paper describes an initiation of interaction(IoI) detection framework without keywords for human-robot interaction(HRI) based on audio and vision sensor fusion in a domestic environment. In the proposed framework, the robot has its own audio and vision sensors, and can employ external vision sensor for stable human detection and tracking. When the user starts to speak while looking at the robot, the robot can localize his or her position by its sound source localization together with human tracking information. Then the robot can detect the IoI if it perceives the face of the speaker faces the robot. In case that the user does not speak directly, the robot can also detect the IoI if he or she looks at the robot for more than predefined periods of time. A state transition model for the proposed IoI detection framework is designed and verified by experiments with a mobile robot. In order to implement and associate our model in a robot architecture, all the components are implemented and integrated in the Robot Operating System(ROS) environment.

2605.10086 2026-05-12 cs.RO

A cell-decomposition based path planner for 3D navigation in constrained workspaces

João P. L. Morais, Luciano C. A. Pimenta, Marcelo A. Santos, Guilherme V. Raffo

发表机构 * Department of Management, Information and Production Engineering, University of Bergamo, Dalmine, BG, Italy(伯加莫大学管理、信息与生产工程系) Department of Electronic Engineering, Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brazil(巴西米纳斯格拉斯联邦大学电子工程系)

AI总结 本文提出了一种基于单元分解的路径规划算法,用于在受限三维工作空间中进行导航,确保每个单元与其至少一个相邻单元之间具有完全可见性。该方法构建了一个简化的路径可行性验证框架,并可方便地嵌入到优化问题中。通过结合Yen的k最短路径算法与二阶锥规划(SOCP),提出了一种名为KSP-SOCP的新方法,在保证路径质量的同时降低了计算负担,实验表明该方法在时间和内存效率上优于传统方法,适用于大规模场景。

Comments Accepted for publication at the 23rd IFAC World Congress (Busan, Korea)

详情
英文摘要

This paper proposes a cell decomposition algorithm for binary occupancy grids that ensures mutual complete visibility from each cell to at least one adjacent cell. This decomposition establishes a simplified framework for verifying path feasibility that can be easily embedded in optimization problems. To illustrate its utility, we formulate both second-order cone programs (SOCP) and their mixed-integer variant (MISOCP) within the proposed framework. Furthermore, we propose the KSP-SOCP method, which combines Yen's k-shortest path algorithm with the SOCP, achieving improved solutions compared to a standard SOCP approach while avoiding the computational burden of MISOCP. The cell decomposition algorithm, KSP-SOCP, and MISOCP approaches were evaluated in 9 city-like workspaces. The decomposition efficiently partitioned each map, enabling both optimization methods to compute feasible paths. The proposed KSP-SOCP achieved time performance comparable to the MISOCP while requiring less memory, making it highly suitable for large-scale problems.

2605.10079 2026-05-12 cs.CV

SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

Liangyang Ouyang, Ruicong Liu, Caixin Kang, Yifei Huang, Yoichi Sato

发表机构 * The University of Tokyo(东京大学) Shanda AI Research Tokyo(Shanda AI东京研究所)

AI总结 该论文提出了一种名为SocialDirector的训练-free交互控制器,用于提升多人物视频生成中社会互动的控制能力。该方法通过调节交叉注意力图,实现了对人物动作执行者、动作时机及目标对象的精确控制,有效解决了现有模型中人物与动作不匹配、社交动态混乱等问题。研究还构建了自动化评估流程,实验表明SocialDirector显著提升了生成视频的交互真实性,接近真实视频的表现水平。

详情
英文摘要

Video generation has advanced rapidly, producing photorealistic videos from text or image prompts. Meanwhile, film production and social robotics increasingly demand multi-person videos with rich social interactions, including conversations, gestures, and coordinated actions. However, existing models offer no explicit control over interactions, such as who performs which action, when it occurs, and toward whom it is directed. This often results in wrong person performing unintended actions (actor-action mismatch), disordered social dynamics, and wrong action targets. To address these challenges, we present SocialDirector, a training-free interaction controller that enhances the generation model by modulating cross-attention maps. SocialDirector contains two modules: Social Actor Masking and Directional Reweighting. Social Actor Masking constrains each person's visual tokens to attend only to their own textual descriptions via a spatiotemporal mask, avoiding actor-action mismatch and disordered social dynamics. Directional Reweighting amplifies attention to directional words (e.g., "leftward", "right"), leading each action towards its intended target. To evaluate generated social interactions, we annotate existing datasets with interaction descriptions and build a fully automated evaluation pipeline powered by open-source VLMs. Experiments on different video generation models show that SocialDirector significantly improves interaction fidelity and approaches the upper bound set by real videos.

2605.10071 2026-05-12 cs.CV

MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization

Yaning Zhang, Tianyi Wang, Zan Gao, Yibo Zhao, Chunjie Ma, Meng Wang

发表机构 * Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences)(计算机科学与技术学院,齐鲁工业大学(山东省科学院)) School of Computing, National University of Singapore(国立新加坡大学计算机学院) Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences)(山东省人工智能研究院,齐鲁工业大学(山东省科学院)) Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology(教育部计算机视觉与系统重点实验室,天津工业大学)

AI总结 随着高真实感人脸生成技术的快速发展,通用性的人脸伪造检测与定位方法变得尤为重要。本文提出了一种多领域细粒度视觉-语言重建模型(MFVLR),通过语言引导的细粒度人脸伪造表示学习,全面捕捉多领域中的视觉伪造痕迹,从而实现对扩散模型生成人脸伪造内容的通用检测与定位。该模型引入细粒度语言变换器、多领域视觉编码器和视觉解码器,并设计了创新的视觉注入模块,显著提升了模型在跨生成器、跨伪造类型和跨数据集场景下的性能。

详情
英文摘要

The swift advancement in photo-realistic face generation technology has sparked considerable concerns across society and academia, emphasizing the requirement of generalizable face forgery detection and localization methods. Prior works tend to capture face forgery patterns across multiple domains using image modality, other modalities like fine-grained texts are not comprehensively investigated, which restricts the generalization capability of models. Besides, they usually analyze facial images created by GAN, but struggle to identify and localize those synthesized by diffusion. To solve the problems, in this paper, we devise a novel multi-domain fine-grained vision-language reconstruction (MFVLR) model, which explores comprehensive and diverse visual forgery traces via language-guided face forgery representation learning, to achieve generalizable diffusion-synthesized face forgery detection and localization (DFFDL). Specifically, we devise a fine-grained language transformer that studies general fine-grained language embeddings using language reconstruction. We propose a multi-domain vision encoder to capture general and complementary visual forgery patterns across the image and residual domains. A vision decoder is designed to reconstruct image appearance and achieve forgery localization. Besides, we propose an innovative plug-and-play vision injection module to enhance the interaction between the vision and language embeddings. Extensive experiments and visualizations demonstrate that our network outperforms the state of the art on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.

2605.10065 2026-05-12 cs.CL cs.AI

NCO: A Versatile Plug-in for Handling Negative Constraints in Decoding

Hyundong Jin, Yo-Sub Han

发表机构 * Yonsei University(延世大学)

AI总结 在生成文本时,防止大型语言模型生成不适当内容(如脏话和个人身份信息)变得越来越重要。为了解决在解码过程中高效处理多个硬约束和正则表达式约束的问题,本文提出了一种名为NCO的解码策略,该方法通过在线模式匹配实现对约束的高效处理,避免了状态爆炸问题,并兼容多种采样和搜索方法。实验表明,NCO在实际任务中有效提升了内容过滤的效果。

详情
英文摘要

Controlling Large Language Models (LLMs) to prevent the generation of undesirable content, such as profanity and personally identifiable information (PII), has become increasingly critical. While earlier approaches relied on post-processing or resampling, recent research has shifted towards constrained decoding methods that control outputs during generation to mitigate high computational costs and quality degradation. However, preventing multiple forbidden hard constraints or regex constraints from appearing anywhere in the output is computationally challenging. A straightforward solution is to convert these constraints into a single automaton that tracks all forbidden patterns during decoding, but this often becomes impractically large. Standard regex engines also do not readily support the operations needed to build such a constraint, such as complement and intersection. In order to address these limitations, we propose NCO, a decoding strategy that performs online pattern matching over finite hard constraints and regex constraints, reducing computational overhead without inducing state explosion. NCO is fully compatible with standard inference strategies, including various sampling methods and beam search, while also supporting soft masking for probabilistic suppression. We empirically demonstrate its effectiveness across practical tasks, including PII and profanity suppression. Our implementation is available at https://github.com/hyundong98/NCO-Decoding.git .

2605.10064 2026-05-12 cs.AI

MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs

Ruiyi Yang, Zechen Li, Hao Xue, Imran Razzak, Flora D. Salim

发表机构 * University of New South Wales(新南威尔士大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Mohamed Bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 MAGE 是一种基于多智能体协同进化的框架,通过构建包含四个子图的协同进化知识图谱,将智能体在学习过程中的经验与反馈外部化存储,从而支持冻结主干模型在推理时的稳定表现。该方法利用任务条件引导检索机制,结合任务级和技能级的强化学习策略,实现了知识的高效积累与应用。实验表明,MAGE 在多个复杂任务上显著优于基于提示的冻结主干模型,展示了其在自我进化学习中的有效性与广泛适用性。

Comments 25 pages, 3 figures

详情
英文摘要

Self-evolving language-model agents must decide what to learn next and how to preserve what they have learned across iterations. Existing systems typically carry this cross-iteration knowledge as natural-language feedback, flat episodic memory, or implicit reinforcement signals, none of which cleanly supports a frozen weak backbone at inference time. This paper introduces MAGE (Multi-Agent Graph-guided Evolution), a framework that externalizes self-knowledge into a four-subgraph co-evolutionary knowledge graph. Its experience subgraph stores both teacher-written failure corrections and the learner's own past correct reasoning traces, which are retrieved as task-conditioned guidance for a frozen execution model. During evolution, the graph, a task-level search bandit, and a skill-level routing bandit are updated from the same reward stream, while the learner's backbone remains unchanged. We further provide structural analysis showing how append-only memory growth, bounded curriculum coverage, and task-filtered retrieval together support stable improvement of the retrieval substrate for frozen-learner evolution. Across nine benchmarks spanning mathematical reasoning, multi-hop and open-domain question answering, spatio-temporal analysis, financial numerical reasoning, medical multiple-choice, an open-world survival game, and web navigation, MAGE achieves strong performance against prompt-based frozen-backbone baselines. Ablations show that self-harvested success traces and teacher-written corrections are complementary, with success memories contributing most on reasoning-template-heavy tasks and corrective memories supporting harder composition and interaction settings.

2605.10063 2026-05-12 cs.RO

EFGCL: Learning Dynamic Motion through Spotting-Inspired External Force Guided Curriculum Learning

Keita Yoneda, Kento Kawaharazuka, Kei Okada

发表机构 * Department of Mechano-Informatics, Graduate School of Information Science and Technology, The University of Tokyo(机械信息学系,信息科学和技术研究生院,东京大学) AI Center, Graduate School of Information Science and Technology, The University of Tokyo(人工智能中心,信息科学和技术研究生院,东京大学)

AI总结 本文提出了一种基于物理引导的强化学习方法——外部力引导课程学习(EFGCL),旨在解决足式机器人学习复杂全身动态运动时效率低、失败风险高的问题。受体操中“ spotting ”动作的启发,该方法通过在训练过程中引入辅助外力,使机器人能够物理上体验成功动作的执行过程,无需依赖特定任务的奖励设计或参考轨迹。实验表明,EFGCL显著提升了四足机器人学习跳跃等复杂动作的效率,并成功在真实机器人上复现了仿真中的运动,验证了该方法的有效性和通用性。

Comments Accepted at RA-L 2026, website - https://keitayoneda.github.io/kleiyn-efgcl/, YouTube - https://youtu.be/sFK00hm14No/

Journal ref IEEE Robotics and Automation Letters (RA-L) 2026

详情
英文摘要

Learning dynamic whole-body motions for legged robots through reinforcement learning (RL) remains challenging due to the high risk of failure, which makes efficient exploration difficult and often leads to unstable learning. In this paper, we propose External Force Guided Curriculum Learning (EFGCL), a guided RL approach based on the principle of physical guidance, in which external assistive forces are introduced during training. Inspired by spotting in artistic gymnastics, EFGCL enables agents to physically experience successful motion executions without relying on task-specific reward shaping or reference trajectories. Experiments on a quadrupedal robot performing Jump, Backflip, and Lateral-Flip tasks demonstrate that EFGCL accelerates learning of the Jump task by approximately a factor of two and enables the acquisition of complex whole body motions that conventional RL methods fail to learn. We further show that the learned policies can be deployed on real robot, reproducing motions consistent with those observed in simulation. These results indicate that physically guided exploration, which allows agents to experience success early in training, is an effective and general strategy for improving learning efficiency in dynamic whole-body motion tasks.

2605.10054 2026-05-12 cs.CV

Explanation-Aware Learning for Enhanced Interpretability in Biomedical Imaging

Zubair Faruqui, Rahul Dubey

发表机构 * Department of Computer Science, Missouri State University(密苏里州立大学计算机科学系)

AI总结 该研究针对医学影像诊断中深度神经网络过度依赖非临床相关特征的问题,提出了一种在训练过程中直接引入解释性监督的方法,以引导模型关注具有临床意义的区域。研究系统分析了不同解释损失设计和监督强度对模型预测性能和解释可信度的影响,并引入了两个新的量化指标用于评估解释质量。实验表明,该方法在保持模型准确性的同时,能够显著提升解释的临床相关性,适用于多种标注的生物医学影像任务。

Comments Under review at IEEE Journal of Biomedical and Health Informatics (JBHI)

详情
英文摘要

Deep neural networks for medical image diagnosis often achieve high predictive accuracy while relying on spurious or clinically irrelevant visual cues, limiting their trustworthiness in practice. Post-hoc explanation methods are widely used to visualize model decisions in the form of saliency maps; however, these explanations do not influence how models learn during training, allowing non-causal or confounding features to persist. This motivates the incorporation of explanation supervision directly into the training objective to guide model attention toward clinically meaningful regions and promote clinically grounded decision-making. This paper presents a systematic approach to integrate explanation loss into model training and analyzes how different explanation loss designs and supervision strengths influence both predictive performance and spatial faithfulness of explanations. To quantitatively assess interpretability, two complementary explanation performance metrics-annotation coverage and saliency precision-are introduced, enabling rigorous evaluation beyond qualitative visualization. Our experimental results reveal a clear trade-off between explanation quality and explanation loss coefficients. Furthermore, quantitative statistical analysis yields consistently improved explanation alignment while maintaining comparable accuracy. Experiments were conducted on annotated chest X-ray datasets; however, the proposed framework is applicable to a broad range of annotated biomedical imaging modalities. Overall, these findings demonstrate that explanation supervision is not a monolithic design choice and provide practical guidance for incorporating explanation loss into training objectives under noisy clinical annotations.

2605.10051 2026-05-12 cs.RO cs.AI

Guided Streaming Stochastic Interpolant Policy

Puming Jiang, Meiyi Wang, Kelvin Lin, Ce Hao, Harold Soh

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 本文研究了如何在推理时通过引导机制,使生成式机器人策略能够动态适应目标,而无需重新训练。传统方法受限于基于块的架构,存在延迟高、反应性差的问题。作者通过分析价值函数的时间演化,推导出针对随机插值策略的最优引导项,并提出了流式随机插值策略(SSIP),实现了快速且反应灵敏的实时控制。此外,还提出了两种互补机制,分别支持零样本适应和高效推理,实验表明该方法在动态复杂环境中表现出更优的反应能力和物理合理性。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. The first two authors contributed equally

详情
英文摘要

Inference-time guidance is essential for steering generative robot policies toward dynamic objectives without retraining, yet existing methods are largely confined to chunk-based architectures that exhibit high latency and lack the reactivity needed for test-time preference alignment or obstacle avoidance. In this work, we formally derive the optimal guidance term for Stochastic Interpolants (SI) by analyzing the value function's time evolution via the Backward Kolmogorov Equation, establishing a modified drift that theoretically guarantees sampling from a target distribution. We apply this framework to real-time control through the Streaming Stochastic Interpolant Policy (SSIP), which generalizes the deterministic Streaming Flow Policy (SFP). Unifying this guidance law with the streaming architecture enables fast and reactive control. To support diverse deployment needs, we propose two complementary mechanisms: training-free Stochastic Trajectory Ensemble Guidance (STEG) that computes gradients on-the-fly for zero-shot adaptation, and training-based Conditional Critic Guidance (CCG) for amortized inference. Empirical evaluations demonstrate that our guided streaming approach significantly outperforms conventional chunk-based policies in reactivity and provides superior, physically valid guidance for dynamic, unstructured environments.

2605.10050 2026-05-12 cs.CV

EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs

Jiameng Li, Minye Wu, Jiezhang Cao, Aleksei Tiulpin, Matthew B. Blaschko

发表机构 * KU Leuven(鲁文大学) Shanghai Jiaotong University(上海交通大学) Weill Cornell Medicine(韦尔医学院)

AI总结 视频大语言模型(VideoLLMs)在处理长视频时面临挑战,因为密集采样会导致大量视觉token,而稀疏采样则可能遗漏关键时间信息,引发模型幻觉。本文提出了一种轻量且无需训练的token剪枝方法EchoPrune,通过将冗余token解释为时间回声,利用跨模态相关性和时间重建误差对token进行评分,从而在固定token预算下提升时间分辨率。实验表明,EchoPrune使VideoLLMs在相同token预算下处理的帧数提升至原来的20倍,并在多个基准上提升了性能和推理速度。

Comments 9 pages

详情
英文摘要

Long-form video understanding remains challenging for Video Large Language Models (VideoLLMs), as the dense frame sampling introduces massive visual tokens while sparse sampling risks missing critical temporal evidence and leading to LLM hallucination. Existing training-free token reduction methods either treat videos equally as static images or rely on segment-level merging heuristics, which weaken fine-grained spatiotemporal modeling and introduce additional overhead. In this paper, we propose EchoPrune, a lightweight and training-free token pruning method that improves temporal resolution under a fixed LLM-side visual token budget. Our core idea is to interpret redundant video tokens as temporal echoes: if a token is well reconstructed from the previous frame, it is merely a temporally redundant echo; otherwise, it may capture new events, motion, or query-relevant visual evidence. Based on this insight, EchoPrune scores visual tokens by (i) query-guided crossmodal relevance and (ii) temporal reconstruction error, measured by correspondence matching and echo matching across consecutive frames. The selected tokens preserve task-relevant cues and temporal novelty while suppressing predictable redundancy, allowing VideoLLMs to observe more frames without increasing the decoding budget. Extensive experiments on LLaVA-OV, Qwen2.5VL, and Qwen3VL across six video understanding benchmarks show that EchoPrune enables VideoLLMs to process up to 20x frames under the same token budget, yielding improved performance (+8.6%) and inference speedup (5.6x for prefilling) on Qwen2.5VL-7B.

2605.10047 2026-05-12 cs.LG cs.AI

Rethinking Loss Reweighting for Imbalance Learning as an Inverse Problem: A Neural Collapse Point of View

Jinping Wang, Zixin Tong, Zhiwu Xie, Zhiqiang Gao

发表机构 * CSMT, Wenzhou-Kean University(温州肯恩大学计算机科学与技术学院) International Frontier Interdisciplinary Research Institute, Wenzhou-Kean University(温州肯恩大学国际前沿交叉研究 institutes)

AI总结 本文从逆问题的角度重新思考不平衡学习中的损失重加权问题,提出了一种基于神经崩溃(Neural Collapse)理论的动态权重调整策略。该方法以类间平均损失相等为目标,通过逆向推导动态确定类别权重,从而更有效地缓解类别不平衡带来的影响。实验表明,该方法在多个数据集上优于现有主流长尾分类方法,且能更好地贴近理想几何结构。

Comments Accepted by ICML2026

详情
英文摘要

Loss reweighting is a widely used strategy for long-tailed classification, but existing reweighting strategies often rely on heuristics and rarely define a well-specified target. Inspired by Neural Collapse (NC), the ideal simplex Equiangular Tight Frame (ETF) terminal geometry suggests equal per-class average loss as a reasonable target for reweighting. Based on the ideal equal loss objective, we consider loss reweighting as an inverse problem and propose an inverse-view reweighting strategy that infers class weights dynamically to match this ideal objective. Empirically, NC metrics suggest our method can effectively reduce the loss imbalance coefficient and closer alignment with NC geometry while consistently outperforming strong long-tailed baselines on different datasets.