arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.16117 2026-05-18 cs.CL

SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation

SGR：一种用于LLM的分步推理框架，通过外部子图生成

Xin Zhang, Yang Cao, Baoxing Wu, Kai Song, Siying Li

AI总结 SGR通过外部子图生成提升LLM推理能力，利用结构化知识支持多步推理，实验表明在多个基准数据集上均优于基线方法，提高了推理准确性和事实可靠性。

详情

AI中文摘要

SGR通过外部子图生成提升LLM推理能力，利用结构化知识支持多步推理，实验表明在多个基准数据集上均优于基线方法，提高了推理准确性和事实可靠性。

英文摘要

Large Language Models (LLMs) have demonstrated strong capabilities across diverse NLP applications, such as translation, text generation, and question answering. Nevertheless, they remain limited in complex settings that demand deep reasoning and logical inference. Since these models are trained on large-scale text corpora, their generation process may still introduce irrelevant, noisy, or factually inconsistent content. To mitigate this problem, we introduce SGR, a stepwise framework that enhances LLM reasoning through external subgraph generation. SGR builds query-specific subgraphs from external knowledge bases and uses their semantic structure to support multi-step inference. By grounding intermediate reasoning steps in structured external knowledge, the framework helps the model concentrate on relevant entities, relations, and supporting evidence. In particular, SGR first constructs a subgraph tailored to the input question. It then guides the model to reason progressively over the generated structure and combines multiple reasoning trajectories to obtain the final prediction. Experimental results across several benchmark datasets show that SGR achieves consistent improvements over competitive baselines, highlighting its value for improving both reasoning accuracy and factual reliability.

URL PDF HTML ☆

赞 0 踩 0

2605.16116 2026-05-18 cs.AI

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

ShopGym: 一个集成框架，用于电子商务网络代理的现实模拟和可扩展基准测试

Chinmay Savadikar, Mingyu Zhao, Yuanzheng Zhu, Han Li, Shuang Xie, Alberto Castelo, Tianfu Wu, Lingyun Wang

AI总结本文提出ShopGym框架，通过模拟层ShopArena和基准层ShopGuru，实现电子商务网络代理的现实模拟与可扩展基准测试，验证了合成商店在结构属性和代理性能上的有效性。

Comments 32 pages, 10 figures

详情

AI中文摘要

开发和评估电子商务网络代理需要能够保持有意义任务结构并支持可控、可重复和可扩展科学比较的环境。现有方法面临权衡：实时商店提供现实但非平稳、难以检查和不可重复，而手动构建的沙盒基准测试提供控制但仅覆盖狭窄的布局、目录、政策和交互模式范围。我们主张核心瓶颈是方法论的：该领域缺乏一种可扩展的方式，能够构建同时现实、多样、可控、可检查和可重复的评估设置。我们引入ShopGym，一个集成框架，用于电子商务网络代理的现实模拟和可扩展基准测试。ShopGym是一个构建电子商务模拟环境和基础基准任务的框架。其模拟层ShopArena通过匿名化商店规范和分阶段验证生成过程，将实时种子商店转换为自包含的沙盒商店。在这些模拟商店之上，ShopGuru合成跨七个技能类别的基准任务，每个任务基于商店的目录、导航结构、政策和交互可能性。共同，ShopArena和ShopGuru产生自包含、可重置、可检查和稳定的评估成果，保留结构属性和与购物任务相关的代理评估信号。我们通过基于图的结构分析和基于代理的行为评估验证了该框架，使用224个生成的任务在六个沙盒商店中：三个由合成数据构建，三个由真实数据构建。我们的结果表明，合成商店保留了实时商店的关键结构属性，代理在合成商店上的表现与在实时商店上的表现正相关。

英文摘要

Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.

URL PDF HTML ☆

赞 0 踩 0

2605.16115 2026-05-18 cs.RO

Beyond Collision Avoidance: Multi-Robot Yielding and Spatial Affordance in Emergency Evacuations

超越避障：紧急疏散中的多机器人让路与空间可得性

Ning Zhou, Edmund R. Hunt, Nikolai W. F. Bode

AI总结研究通过虚拟疏散实验探讨多机器人让路策略对人类空间期望的影响，发现主动让路优于冻结和效率优先策略，并揭示环境可得性对认知预期的塑造作用。

详情

AI中文摘要

随着移动服务机器人与行人共存，确保紧急疏散期间的被动安全行为至关重要。现有多机器人让路策略往往仅关注碰撞避障和宏观流优化，忽视环境可得性和人类空间期望。为弥合宏观理论与微观感知间的差距，我们进行了基于游戏的虚拟疏散实验（N=56）。我们研究了四种多机器人让路策略（Hide, LineEscape, Freeze, ShortestPath）在有无避难所的狭窄走廊中的个体心理反应。结果建立了一个稳健的偏好等级（Hide > LineEscape > Freeze > ShortestPath），表明主动空间让路显著优于冻结和效率优先方法。关键发现是环境可得性深刻塑造了认知预期。积极利用可用避难所增强了主动让路的心理舒适度（Hide）。相反，未能利用明显避难所（如执行LineEscape）可能触发预期违反。这体现在感知认知延迟显著增加，尽管客观轨迹未受阻碍。此外，先前的机器人交互经验有助于用户解读复杂的社会意图。最终，本研究证明紧急情况中的人机交互安全必须从纯轨迹优化发展到语义感知导航。未来工作将扩展该框架以研究机器人群与行人群体之间的复杂交互。

英文摘要

As mobile service robots increasingly coexist with pedestrians, ensuring passively safe behaviour during confined emergency evacuations is critical. Existing multi-robot yielding strategies often focus solely on collision avoidance and macroscopic flow optimisation, overlooking environmental affordances and human spatial expectations. To bridge the gap between macroscopic theory and micro-level perception, we conducted a game-based virtual evacuation experiment (N=56). We investigated individual psychological responses to four multi-robot yielding strategies (Hide, LineEscape, Freeze, ShortestPath) across confined corridors with and without refuge niches. Our results establish a robust preference hierarchy (Hide > LineEscape > Freeze > ShortestPath), demonstrating that proactive space-yielding significantly outperforms freezing and efficiency-first approaches. Crucially, we found that environmental affordances heavily shape cognitive expectations. Actively utilising available niches amplifies the psychological comfort of proactive yielding (Hide). Conversely, failing to use an obvious niche (e.g., executing LineEscape) may trigger Expectation Violation. This is reflected in a drastically increased perceived cognitive delay, despite objectively unimpeded trajectories. Furthermore, prior robot interaction experience helps users decode complex social intents. Ultimately, this research demonstrates that safe human-robot interaction during emergencies must evolve from pure trajectory optimisation to semantically aware navigation. Future work will extend this framework to investigate complex interactions between robot swarms and pedestrian crowds.

URL PDF HTML ☆

赞 0 踩 0

2605.16113 2026-05-18 cs.CL cs.AI

DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation

DebiasRAG: 通过检索增强生成实现大型语言模型中公平生成的无调优路径

Rui Chu, Bingyin Zhao, Thanh Quoc Hung Le, Duy Cao Hoang, Huawei Lin, Ping Li, Weijie Zhao, Khoa D Doan, Yingjie Lao

AI总结本文提出DebiasRAG，一种基于检索增强生成的无调优动态查询特定去偏框架，通过生成查询特定去偏候选、构建上下文候选池和梯度更新去偏引导上下文重排序三阶段，提升生成公平性并保留LLM固有属性。

详情

AI中文摘要

大型语言模型（LLMs）因生成能力卓越而取得空前成功。然而，由于依赖训练语料中的知识，它们可能生成幻觉、刻板印象和社会偏见内容。特别是，LLMs容易产生涉及种族、性别和年龄的偏见响应，统称为社会偏见。先前研究使用微调和提示工程来减轻LLMs中的偏见，但这些方法需要额外的训练资源或领域知识来设计框架。此外，它们可能降低LLMs的原始能力，并常忽视公平推断中动态去偏上下文的需要。本文提出DebiasRAG，一种基于检索增强生成（RAG）的新型无调优和动态查询特定去偏框架。DebiasRAG在保持LLM固有属性如表示能力的同时提升公平性。DebiasRAG包含三个阶段：（1）查询特定去偏候选生成；（2）上下文候选池构建；（3）梯度更新去偏引导上下文重排序。首先，DebiasRAG通过常规检索生成与查询相关的自我诊断偏见上下文，这些偏见上下文由DebiasRAG提供者离线准备。给定查询特定的偏见上下文，DebiasRAG反向生成去偏上下文，作为额外的公平性约束提供给LLM输出。其次，常规RAG检索过程从常规RAG文档数据库生成查询相关的上下文，如分块维基百科数据集。

英文摘要

Large language models (LLMs) have achieved unprecedented success due to their exceptional generative capabilities. However, because they depend on knowledge encapsulated from training corpora, they may produce hallucinations, stereotypes, and socially biased content. In particular, LLMs are prone to prejudiced responses involving race, gender, and age, which are collectively referred to as social biases. Prior studies have used fine-tuning and prompt engineering to mitigate such biases in LLMs, but these methods require additional training resources or domain knowledge to design the framework. Moreover, they may degrade the original capabilities of LLMs and often overlook the need for dynamic debiasing contexts for fairer inference. In this paper, we propose DebiasRAG, a novel tuning-free and dynamic query-specific debiasing framework based on retrieval-augmented generation (RAG). DebiasRAG improves fairness while preserving the intrinsic properties of LLMs, such as representation ability. DebiasRAG consists of three stages: (1) query-specific debiasing candidate generation; (2) context candidate pool construction; and (3) gradient-updated debiasing-guided context piece reranking. First, DebiasRAG leverages self-diagnosed bias contexts relevant to the query through regular retrieval, where the bias contexts are prepared offline by the DebiasRAG provider. Given the query-specific bias contexts, DebiasRAG reversely produces debiasing contexts, which are provided as additional fairness constraints for LLM outputs. Second, a regular RAG retrieval process produces query-related contexts from the regular RAG document database, such as a chunked Wikipedia dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.16112 2026-05-18 cs.LG cs.AI

Attention Dispersion in Dynamic Graph Transformers: Diagnosis and a Transferable Fix

动态图变换器中的注意力分散：诊断与可迁移的修复

Jinhao Zhang, Kangfei Zhao, Qiuhao Zeng, Long-Kai Huang

AI总结本文识别动态图变换器在时间分布偏移下的注意力分散问题，并提出可迁移的差分注意力机制以提升性能，尤其在高偏移数据集上表现显著。

详情

AI中文摘要

基于变换器的架构已成为连续时间动态图（CTDG）学习的主导范式，但其性能在时间偏移数据集上受限。本文发现注意力分散是动态图变换器在时间分布偏移下的共同失效模式。通过对比结构和时间上不同的历史邻居与随机邻居，发现预测依赖于一类关键节点，这些节点比任意邻居更具预测信号。然而，现有变换器无法聚焦这些节点，因为时间偏移削弱了注意力对比并导致注意力分布过于分散。该诊断表明一种简单且可迁移的修复方法：用差分注意力替代标准注意力，以抑制共同模式注意力并放大差异性token级信号。当添加到三个代表性的CTDG变换器基线中时，差分注意力一致提升了性能，收益集中在高偏移数据集上。注意力层面的测量进一步验证了机制，显示关键节点上的注意力熵降低和注意力质量提高。基于这些发现，我们引入DiffDyG，结合差分注意力与标准输入编码。在9个基准和三种负采样协议上，DiffDyG实现了SOTA性能，尤其在最偏移的数据集上表现显著。

英文摘要

Transformer-based architectures have become the dominant paradigm for Continuous-Time Dynamic Graph (CTDG) learning, yet their performance remains limited on temporally shifted datasets. In this work, we identify attention dispersion as a shared failure mode of dynamic graph Transformers under temporal distribution shift. Through controlled ablation contrasting structurally and temporally distinguished historical neighbors against random ones, we show that prediction depends on a class of critical nodes that carry consistently more predictive signal than arbitrary neighbors. However, existing Transformers fail to focus on these nodes even when they are present in the input, as temporal shift weakens attention contrast and produces overly dispersed attention distributions. This diagnosis suggests a simple and transferable fix: replace standard attention with differential attention, which suppresses common-mode attention and amplifies distinctive token-level signals. When added to three representative CTDG Transformer baselines, differential attention consistently improves performance, with gains concentrated on high-shift datasets. Attention-level measurements further confirm the mechanism, showing reduced attention entropy and increased attention mass on critical nodes. Building on these findings, we introduce DiffDyG, a reference implementation combining differential attention with standard input encodings. Across 9 benchmarks and three negative sampling protocols, DiffDyG achieves SOTA performance, with especially large gains on the most shifted datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.16107 2026-05-18 cs.CL

Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection

多级上下文令牌关系建模用于机器生成文本检测

Chenwang Wu, Yiuming Cheung, Bo Han, Shuhai Zhang, Defu Lian

AI总结本文提出多级上下文令牌关系建模框架，通过局部校准和全局规则推理模块提升机器生成文本检测性能，实验显示在多种实际场景中效果显著。

详情

AI中文摘要

机器生成文本（MGT）存在虚假信息和钓鱼风险，需可靠检测。度量方法通过统计可区分特征更实用。本文将代表性度量方法置于统一框架中，分析其优劣，发现令牌级检测得分易受生成过程随机性影响。理论推导多跳转移并探索局部和全局关系，提出多级上下文令牌关系建模框架。局部关系通过轻量马尔可夫校准模块优化令牌证据，全局关系引入规则支持推理模块。最终在联合多级推理框架中结合局部校准得分和全局规则信号。实验显示在多种实际场景中效果显著，计算开销低。

英文摘要

Machine-generated texts (MGTs) pose risks such as disinformation and phishing, underscoring the need for reliable detection. Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting. Given their diverse designs, we first place representative metric-based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token-level detection score is easily biased by the inherent randomness of the MGTs generation process. Then, we theoretically derive the multi-hop transitions of the token-level detection score and explore their local and global relations. Based on these findings, we propose a multi-level contextual token relation modeling framework for MGT detection. Specifically, for local relations, we model them through a lightweight Markov-informed calibration module that refines token-level evidence before aggregation. For global relations, we introduce a rule-support reasoning module that uses explicit logical rules derived from contextual score statistics. Finally, we combine the local calibrated score and the global rule-support reasoning signal in a joint multi-level inference framework. Extensive experiments show broad and substantial improvements across various real-world scenarios, including cross-LLM and cross-domain settings, with low computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.16103 2026-05-18 cs.AI

Sign-Separated Finite-Time Error Analysis of Q-Learning

符号分离的有限时间误差分析Q学习

Donghwan Lee

AI总结本文提出符号分离的有限时间误差分析方法，用于常步长Q学习。通过切换系统表示，将误差分解为负和正部分，负部分由固定最优策略关联的线性时不变系统主导，正部分由线性切换系统控制。分析揭示了Q学习误差动态中的最大诱导不对称性，并提供确定性和随机性常步长递推的有限时间界。

2605.16099 2026-05-18 cs.LG cs.AI

Federated Imputation under Heterogeneous Feature Spaces

联邦学习下的异构特征空间中的缺失值填补

Imane Hocine, Chaimaa Medjadji, Sylvain Kubler, Gregoire Danoy, Yves Le Traon

AI总结本文提出FedHF-Impute框架，通过共享全局特征图实现跨客户端知识传递，提升联邦填补效果，在模拟数据集上优于基线方法。

详情

AI中文摘要

联邦学习（FL）使去中心化客户端能够协同训练，但大多数方法假设特征模式一致，这在表格设置中不成立，因为客户端只能观察部分重叠的特征子集。在这些异构特征空间中，参数平均方法（如FedAvg）在弱重叠或不相交的特征组之间转移很少的信息，限制了联邦填补的有效性。为克服这一问题，我们提出了FedHF-Impute，一个联邦填补框架，将结构特征不可用性与传统缺失性分开，并利用共享的全局特征图通过信息传递在统计相关特征之间传播信息。这使即使特征从未在本地共同观察时也能实现间接跨客户端知识传递，同时保持标准的联邦通信。在模拟部分模式重叠的SECOM和AirQuality数据集上，FedHF-Impute在填补准确性（RMSE）上比FL基线方法提高了26.9%和8.4%，在PhysioNET上表现相当，仅比最佳基线差0.3%。

英文摘要

Federated Learning (FL) enables collaborative training across decentralized clients, but most methods assume aligned feature schemas, an assumption that rarely holds in tabular settings where clients observe only partially overlapping feature subsets. In these heterogeneous feature spaces, parameter-averaging methods (e.g., FedAvg) transfer little information across weakly overlapping or disjoint feature groups, limiting their effectiveness for federated imputation. To overcome this, we propose \textbf{FedHF-Impute}, a federated imputation framework that separates structural feature unavailability from conventional missingness and uses a shared global feature graph to propagate information across statistically related features through message passing. This enables indirect cross-client knowledge transfer, even when features are never jointly observed locally, while preserving standard federated communication. Under simulated partial schema overlap on the SECOM and AirQuality datasets, FedHF-Impute improves imputation accuracy (RMSE) over FL baselines by 26.9\%, and 8.4\% respectively, while achieving comparable performance on PhysioNET, with only a 0.3\% difference relative to the best baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.16089 2026-05-18 cs.LG cs.AI

Centralized vs Decentralized Federated Learning: A trade-off performance analysis

集中式与去中心化联邦学习：性能权衡分析

Chaimaa Medjadji, Guilain Leduc, Sylvain Kubler, Yves Le Traon

AI总结本文通过Fedstellar模拟器、MNIST数据集和MLP分类器，对比分析集中式、去中心化和半去中心化联邦学习架构的性能权衡，揭示不同应用场景下的优劣势。

详情

DOI: 10.1109/FiCloud62933.2024.00019

AI中文摘要

联邦学习（FL）作为一种在分布式边缘设备上进行协作模型训练同时保护数据隐私的有前景范式，尤其在物联网设备数量激增的情况下显得尤为重要。然而，将如此大量的数据集中存储面临通信限制、隐私和法规等问题。FL可以是集中式（CFL）、去中心化（DFL）或半去中心化（SDFL）。选择合适的FL架构取决于应用需求。然而，非常少的研究通过实验比较了这三种架构，不仅为了理解各自的优势和局限性，还为了探讨不同性能指标之间的权衡。本文克服了这一分析的不足，利用Fedstellar模拟器、MNIST数据集和MLP分类器进行实验分析。

英文摘要

Federated Learning (FL) has emerged as a promising paradigm for collaborative model training across distributed edge devices while preserving data privacy especially with the huge increase amount of data due to the adoption of technologies which contributes to the growing number of IoT devices. Storing this amount of data centrally is challenging due to issues like limited communication, privacy, and regulations. FL can be Centralized (CFL), Decentralized (DFL), and Semi-decentralized (SDFL). Choosing the right FL architecture depends on the application's needs. However, very few research studies have experimentally compared these three types of architectures to not only understand the respective strengths and limitations, but also trade-offs between different performance indicators. This paper overcome this lack of analysis, conducting experimental analyses using the Fedstellar simulator, MNIST dataset, and MLP classifier.

URL PDF HTML ☆

赞 0 踩 0

2605.16088 2026-05-18 cs.LG cs.AI

Multi-level Self-supervised Pretraining on Compositional Hierarchical Graph for Molecular Property Prediction

基于组合层次图的多级自监督预训练用于分子性质预测

Xiayu Liu, Zhengyi Lu, Hou-biao Li

AI总结本文提出MolCHG框架，通过多级自监督预训练提升分子性质预测性能，采用组合层次图组织分子结构，引入bond graph增强bond信息，实现原子与bond语义的平等聚合。

Comments 11pages, 4 figures

详情

AI中文摘要

自监督预训练在分子图上已展现出分子性质预测的潜力，但现有方法多在单一结构粒度上操作，将bond信息视为辅助边属性而非独立语义层。本文提出MolCHG，一种基于新型组合层次图的多级自监督预训练框架，将分子结构划分为三个语义层级的四种节点类型。通过引入与原子图并行的bond图，该架构将bond层面信息提升为独立演化的节点表示，使片段节点能平等聚合原子层面和bond层面语义。设计了三个层级特定的预训练目标：原子-债券交叉视图对比任务对齐每个片段的原子视图和bond视图表示；片段级功能团预测任务注入领域相关的化学知识；图级结构预测任务编码全局分子拓扑。在九个MoleculeNet基准测试中，MolCHG在七个数据集上取得最佳性能，在其余数据集上与最强基线竞争。消融研究进一步确认多级监督信号互补，每个组件均对整体性能有贡献。

英文摘要

Self-supervised pretraining on molecular graphs has emerged as a promising approach for molecular property prediction, yet most existing methods operate at a single structural granularity and treat bond information as auxiliary edge attributes rather than as an independent semantic layer. In this work, we propose MolCHG, a multi-level self-supervised pretraining framework built upon a novel Compositional Hierarchical Graph that organizes molecular structure into four types of nodes across three semantic levels. By introducing a bond graph that operates in parallel with the atom graph, our architecture elevates bond-level information to independently evolving node representations, enabling fragment nodes to aggregate atom-level and bond-level semantics on an equal footing. We design three level-specific pretraining objectives: an atom-bond cross-view contrastive task that aligns the atom-view and bond-view representations within each fragment, a fragment-level functional group prediction task to inject domain-relevant chemical knowledge, and graph-level structure prediction tasks to encode global molecular topology. Experiments on nine MoleculeNet benchmarks demonstrate that MolCHG achieves the best performance on seven datasets across both classification and regression tasks, remaining competitive with the strongest baselines on the rest. Ablation studies further confirm that the multi-level supervision signals are complementary and that each component contributes to the overall performance.

URL PDF HTML ☆

赞 0 踩 0

2605.16081 2026-05-18 cs.LG cs.CV

MIND: Decoupling Model-Induced Label Noise via Latent Manifold Disentanglement

MIND: 通过潜在流形解耦模型诱导标签噪声

Dayong Ren

AI总结 MIND通过潜在流形解耦技术解决模型诱导标签噪声问题，通过动态投影样本到潜在结构聚类，提升噪声识别能力，验证了其在复杂基准测试中的优越性。

Comments Accepted, to appear in ICML2026

详情

AI中文摘要

学习从自动注释驱动的预训练专家和基础模型主导的数据饥饿应用的范式，但引入了关键挑战：模型诱导的标签噪声。与经典鲁棒学习中的随机噪声不同，这种噪声源于标注者的归纳偏置，表现为与局部特征流形紧密耦合的系统性误差。现有方法依赖于全局转移矩阵无法捕捉这些结构模式，而学习实例特定的矩阵在数学上不可行。我们提出模型诱导噪声解耦（MIND），一个理论奠基的框架解决这一困境。我们证明高维噪声流形可通过潜在流形解耦分解为可处理的子空间依赖组件。具体而言，我们的潜在解耦估计器（LDE）动态地将样本投影到具有一致错误模式的潜在结构聚类中，无需地面真实锚点即可实现噪声可识别。为严格评估鲁棒性，我们采用分层协议：从受控噪声的CIFAR-100到大规模真实世界3D数据集（S3DIS、ScanNet）的结构压力测试，其中错误模式显式耦合于几何流形。经验上，MIND在这些复杂基准测试中显著优于现有方法，并有效纠正了视觉-语言模型（如OpenSeg）的零样本幻觉，突显其作为基础模型鲁棒蒸馏框架的潜力。

英文摘要

The paradigm of learning from automatic annotations driven by pre-trained experts and Foundation Models dominates data-hungry applications. However, it introduces a critical challenge: model-induced label noise. Unlike stochastic noise in classical robust learning, this noise stems from annotator inductive biases, manifesting as systematic errors tightly coupled with local feature manifolds. Existing methods relying on global transition matrices underfit these structural patterns, while learning instance-specific matrices remains mathematically intractable. We propose Model-Induced Noise Decoupling (MIND), a theoretically grounded framework addressing this dilemma. We demonstrate that the high-dimensional noise manifold can be decoupled into tractable, subspace-dependent components via Latent Manifold Disentanglement. Specifically, our Latent Decoupling Estimator (LDE) dynamically projects samples into latent structural clusters with consistent error modes, facilitating noise identifiability without ground-truth anchor points. To rigorously evaluate robustness, we adopt a hierarchical protocol: moving from controlled noise on CIFAR-100 to a structural stress test on large-scale real-world 3D datasets (S3DIS, ScanNet), where error patterns explicitly couple with geometric manifolds. Empirically, MIND significantly outperforms state-of-the-art methods on these complex benchmarks and effectively corrects zero-shot hallucinations from Vision-Language Models (e.g., OpenSeg), highlighting its potential as a robust distillation framework for Foundation Models.

URL PDF HTML ☆

赞 0 踩 0

2605.16080 2026-05-18 cs.CV

ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation

ReAlign：通过推理对齐表示实现通用图像伪造检测

Qing Huang, Zhipei Xu, Xuanyu Zhang, Xiangyu Yu, Jian Zhang

AI总结本文提出ReAlign框架，通过对比学习将LLM生成的高质量推理文本转化为轻量级AIGI检测器，提升检测准确性和泛化能力。

Comments Accepted by CVPR 2026

详情

AI中文摘要

AI生成图像（AIGIs）的兴起对数字真实性提出了新的挑战，需要高效且通用的图像伪造检测系统。现有方法无论是非LLM还是LLM基于的方法，都有各自的优势和局限性。非LLM方法提供高效的低级artifact检测，但缺乏语义理解。相反，LLM方法提供强大的语义推理和可解释性，但计算成本高且对细微视觉伪影不敏感。此外，解释性推理文本对伪造检测性能的真实贡献仍不明确。本文研究了LLM生成的推理文本的内在价值和潜力，将其视为通用性和语义错误敏感性的来源。基于这些发现，我们提出了ReAlign，一种新的框架，通过对比学习将由GRPO优化的LLM生成的高质量推理文本提炼成轻量级AIGI检测器。ReAlign有效继承了推理文本表示的泛化能力和语义敏感性，同时保持高效和轻量级以部署。此外，ReAlign采用定制的联合优化策略，整合对比损失用于图像-文本对齐和分类损失用于准确的伪造鉴别。在AIGCDetectBenchmark、AIGI-Holmes和我们新构建的UltraSynth-10k上的实验结果表明，ReAlign在准确性和泛化能力上均优于现有最先进检测器，特别是在面对来自现代生成模型的复杂、高保真伪造时表现突出。

英文摘要

The rise of AI-generated images (AIGIs) poses growing challenges for digital authenticity, prompting the need for efficient, generalizable image forgery detection systems. Existing methods, whether non-LLM-based or LLM-based, exhibit distinct advantages and limitations. While non-LLM-based models offer efficient low-level artifact detection, they often lack semantic understanding. Conversely, LLM-based methods provide strong semantic reasoning and explainability but are computationally intensive and less sensitive to subtle visual artifacts. Moreover, the true contribution of explanatory reasoning texts to forgery detection performance remains unclear. In this work, we investigate the intrinsic value and potential of LLM-generated reasoning texts, considering it a source of generalization and semantic-error sensitivity. Based on these findings, we propose ReAlign, a novel framework that distills high-quality reasoning texts generated by a GRPO-optimized LLM into a lightweight AIGI detector via contrastive learning. ReAlign effectively inherits the generalization ability and semantic sensitivity capability of reasoning textual representations, while remaining efficient and lightweight for deployment. Moreover, ReAlign adopts a tailored joint optimization strategy that integrates contrastive loss for image-text alignment and classification loss for accurate forgery discrimination. Experimental results on AIGCDetectBenchmark, AIGI-Holmes, and our newly constructed UltraSynth-10k demonstrate that ReAlign consistently outperforms existing state-of-the-art detectors in both accuracy and generalization, particularly when facing complex, high-fidelity forgeries from modern generative models.

URL PDF HTML ☆

赞 0 踩 0

2605.16079 2026-05-18 cs.CV cs.AI cs.HC

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

VideoSeeker：通过原生代理工具调用激励实例级视频理解

Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao

AI总结 VideoSeeker通过整合代理推理与实例级视频理解任务，提升视频理解精度，实验表明其在实例级任务中比基线模型提升13.7%，超越GPT-4o和Gemini-2.5-Pro。

Comments Project Page: https://gaotiexinqu.github.io/VideoSeeker/

详情

AI中文摘要

大型视觉-语言模型（LVLMs）在视频理解上取得了显著进展，但在需要精确实例级时空定位的任务中面临重大挑战。现有方法主要依赖文本提示进行人机交互，但这些提示难以提供精确的空间和时间参考，导致用户体验不佳。此外，当前方法通常将视觉感知与语言推理解耦，以语言为中心而非视觉内容，限制了模型主动感知细粒度视觉证据的能力。为解决这些问题，我们提出VideoSeeker，一种通过视觉提示实现实例级视频理解的新范式。VideoSeeker无缝整合代理推理与实例级视频理解任务，使模型能够主动感知并按需检索相关视频片段。我们构建了一个四阶段全自动数据合成管道，高效生成大规模高质量的实例级视频数据。我们通过冷启动监督和强化学习训练将工具调用和主动感知能力内化到模型中，构建了一个强大的视频理解模型。实验表明，我们的模型在实例级视频理解任务中平均比基线模型提升13.7%，超越强大的闭源模型如GPT-4o和Gemini-2.5-Pro，同时在通用视频理解基准上也表现出有效的迁移能力。相关数据集和代码将公开发布。

英文摘要

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.

URL PDF HTML ☆

赞 0 踩 0

2605.16077 2026-05-18 cs.CL

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

大型语言模型能否用于临床评估中的语音模仿？基于LLM的数据增强用于认知评分预测

Si-Belkacem Yamine Ketir, Lenard Paulo Tamayo, Shohei Hisada, Shaowen Peng, Shoko Wakamiya, Eiji Aramaki

AI总结本文提出基于LLM的数据增强框架，通过生成不同风格的口语独白来提升语音认知评分预测。采用相似性引导的增强策略，有效减少少数低分群体的预测误差，同时保持多数群体性能。

Comments 11 pages, 6 figures

详情

AI中文摘要

准确评估自发语音中的认知衰退仍面临挑战，因数据集规模有限和类别不平衡。本文提出基于大型语言模型（LLM）的数据增强框架，以提高语音认知评分预测。实验在日语语料上进行，每个参与者提供自发口语叙述和书面回答。书面回答作为语义锚点，利用GPT-5生成不同风格的口语独白。使用Sentence-BERT语音嵌入训练的偏最小二乘回归模型预测Hasegawa痴呆量表分数。研究两种增强策略：随机类别平衡选择和相似性引导的类别平衡选择。后者优先选择语义接近的合成样本，导致更一致的改进，并显著减少少数低分群体的预测误差，同时保持多数群体性能。总体而言，我们的发现证明了语义引导的LLM驱动增强作为解决类别不平衡和提高临床语音分析数据效率的潜在方法。

英文摘要

Accurate assessment of cognitive decline from spontaneous speech remains challenging due to limited dataset size and class imbalance. In this work, we propose a large language model (LLM)-driven data augmentation framework to improve the prediction of cognitive scores from speech. Experiments are conducted on a Japanese corpus in which each participant provides both a spontaneous oral narrative and a written response to the same clinical prompt. The written responses serve as semantic anchors to generate multiple oral-like monologues in different styles using GPT-5. We then predict Hasegawa Dementia Scale scores, a widely used cognitive screening tool in Japan, using a Partial Least Squares regression model trained on Sentence-BERT speech embeddings. We investigate two augmentation strategies: random class-balanced selection, which yields moderate but unstable improvements, and similarity-guided class-balanced selection. The latter prioritizes semantically close synthetic samples, leading to more consistent improvements and substantially reducing prediction error for minority low-score participants while maintaining performance for the majority group. Overall, our findings demonstrate the potential of semantically guided LLM-driven augmentation as a principled approach for addressing class imbalance and improving data efficiency in clinical speech analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.16076 2026-05-18 cs.CV cs.AI

理由者还是翻译者？面向污染的评估与税法中的神经符号鲁棒性

Parisa Kordjamshidi, Samer Aslan, Madhavan Seshadri, Leslie Barrett, Enrico Santus

AI总结本文研究了税法推理中LLM性能受数据污染影响的问题，提出神经符号框架提升法律AI的可靠性与鲁棒性。

详情

AI中文摘要

近期大型语言模型（LLM）的进步显著增强了自动化法律推理能力。然而，其性能反映的是真正的法律推理能力还是数据污染的产物仍不明确。本文对税法推理方法进行了全面实证研究，并实施了污染检测协议以严格评估LLM的可靠性。我们发现性能可能因污染而被夸大。基于此分析，我们进行了系统评估，比较了单一LLM与混合系统，后者将法律文本翻译为形式化表示并委托符号求解器进行推理。我们构建了一个新的测试套件，通过案例和规则变化来测试对未见文档的泛化能力。我们的发现表明，法律推理本质上是组合性的，神经符号框架为法律AI提供了更可靠和稳健的基础，以及对未观测情境的更好泛化能力。

英文摘要

Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.

URL PDF HTML ☆

赞 0 踩 0

2605.16048 2026-05-18 cs.LG cs.AI

加速梯度下降法用于更快收敛并最小化开销

Manuel Graca, L. Miguel Silveira, Arlindo Oliveira, Frank Liu

AI总结本文提出CT-AGD算法，通过捕捉局部曲率加速一阶方法，减少训练周期，与Adam等自适应方法具有相似的存储和计算开销。

Comments 17 pages

2605.16011 2026-05-18 cs.CL cs.AI

Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study

视觉语言模型在数学教育中能否具备适应性？一种基于学习者模型的评分研究

Jie Gao, Yongan Yu, Junzhu Su, Yiran Lin, Adam K. Dube, Jackie Chi Kit Cheung

AI总结本文探讨视觉语言模型在数学教育中的适应性，提出基于学习者模型的评分框架，评估模型在认知、动机和复杂度方面的适应性，并发现现有模型在有限学习者信息下难以产生一致的指导响应。

详情

AI中文摘要

适应性学习指的是跟踪学习者学习进度并根据个体学习者表现调整教学过程的教育技术。它日益被认可为开发有效学习支持工具的关键。视觉语言模型（VLMs）已在数学教育中得到应用，学生将其作为个性化教学的辅助工具。然而，不清楚VLMs是否具备根据不同学习者档案提供数学指导的能力。当前VLMs缺乏系统评估框架来评估数学辅导任务中对不同学习者档案的适应性。为解决这一差距，我们借鉴适应性学习框架中的学习者模型（Shute和Towle，2018），提出基于学习者模型的评分表。我们的评分表将适应性评估形式化为三个方面：认知方面、动机方面和复杂度。我们还评估了VLM响应的两个额外维度：正确性（答案和解决方案的正确性）和质量（响应本身的质量）。我们的实验结果表明，不同模型在适应性方面存在可测量的差异，并揭示了当前VLMs在有限学习者信息下难以一致产生基于学习者模型的教学响应。

英文摘要

Adaptive learning refers to educational technologies that track learners' learning progress and adapt the instructional process based on individual learners' learning performance. It is increasingly recognized as critical for developing an effective learning support tool. Vision language models (VLMs) have seen adoption in mathematics education, and students have been using them as learning aids for personalized instruction. However, it is unknown whether VLMs have the ability to adapt to different learner profiles when providing mathematical instructions. Current VLMs lack a systematic evaluation framework for this adaptivity to different learner profiles in mathematics tutoring tasks. To address this gap, we draw on the learner model from the adaptive learning framework (Shute and Towle, 2018) and propose a learner model-based rubric. Our rubric formalizes adaptivity assessment into three aspects: cognitive aspects, motivational aspects, and complexity. We also evaluate two additional dimensions of VLM responses: correctness (of answers and solutions) and quality (of the response itself). Our experimental results show measurable differences in adaptivity across models and also reveal that current VLMs struggle to consistently produce learner model-based instructional responses, especially when receiving limited learner information.

URL PDF HTML ☆

赞 0 踩 0