arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.31444 2026-06-01 cs.AI cs.LO

Answer-Set-Programming-based Abstractions for Reinforcement Learning

基于回答集编程的强化学习抽象方法

Rafael Bankosegger, Thomas Eiter, Johannes Oetsch

AI总结本文提出使用回答集编程（ASP）实现CARCASS框架中的抽象，以解决强化学习中状态空间巨大带来的挑战，并通过积木世界和Minigrid两个领域的案例验证了该方法的有效性。

Comments Accepted for publication at the 42nd International Conference on Logic Programming (ICLP 2026). To appear in Theory and Practice of Logic Programming (TPLP)

详情

AI中文摘要

强化学习（RL）使自主智能体能够从经验中学习策略，但现实问题通常涉及巨大的状态空间，使得学习和泛化具有挑战性。因此，抽象和近似是必不可少的。关系强化学习（RRL）提供了一种推理对象及其关系的方法，而Martijn van Otterlo的CARCASS框架展示了如何使用逻辑表示在一阶域中建模马尔可夫决策过程（MDP）。CARCASS最初用Prolog实现，利用领域知识创建强大的抽象。我们探索了回答集编程（ASP），这是一种丰富的、与Prolog相反完全声明式的建模语言，以实现CARCASS抽象。我们在两个领域（即积木世界和Minigrid）的案例研究中评估了基于ASP的实现。我们的结果表明，使用ASP的CARCASS为构建RL抽象提供了一种有前景的方法，尤其是在领域知识可用的情况下。

英文摘要

Reinforcement Learning (RL) enables autonomous agents to learn policies from experience, but realistic problems often involve enormous state spaces, making learning and generalisation challenging. Abstraction and approximation are therefore essential. Relational Reinforcement Learning (RRL) offers a way to reason about objects and their relations, and the CARCASS framework by Martijn van Otterlo demonstrates how logical representations can model Markov Decision Processes (MDPs) in first-order domains. Originally implemented in Prolog, CARCASS leverages domain knowledge to create powerful abstractions. We explore Answer-Set Programming (ASP), which is a rich and, contrary to Prolog, fully declarative modelling language, to realise CARCASS abstractions. We evaluate our ASP-based implementation in case studies of two domains, viz. Blocks World and Minigrid. Our results indicate that CARCASS with ASP provides a promising approach to constructing abstractions for RL, especially when domain knowledge is available.

URL PDF HTML ☆

赞 0 踩 0

2605.31438 2026-06-01 cs.LG

Flow map learning in nonlinear vector autoregressive models: influence of the feature-library structure on the training error

非线性向量自回归模型中的流映射学习：特征库结构对训练误差的影响

Markus Gross

AI总结研究非线性向量自回归模型（NVAR/NG-RC）中特征库结构如何影响训练误差，揭示了训练误差随时间分辨率遵循的标度律，并指出特征库能否精确表示流映射的Lie级数系数决定了误差行为。

Comments 35 pages, 12 figures

详情

AI中文摘要

时间序列预测通常需要学习非线性和时滞依赖关系。一类典型的预测模型是非线性向量自回归过程（NVAR），也称为下一代储层计算机（NG-RC）。这些模型在其显式特征库张成的空间上近似Koopman算子。我们考虑学习马尔可夫非线性动力系统的可辨识性问题，并表明训练误差作为时间分辨率的函数遵循特征性的（预）渐近标度律。这些定律取决于特征库能否精确或仅近似表示流映射（传播子）的早期Lie级数系数。对于由多项式向量场控制的动力系统，我们展示了具有单项式和傅里叶特征库的NVAR/NG-RC模型的机制。我们确定了训练误差对时间分辨率、涉及的非线性阶数和延迟项数量的依赖性。虽然延迟项减少了最优单步训练误差，但只有当库提供足够的非线性时，它们才能改善长期预测。因此，当模型类与真实数据生成过程不匹配时，小的训练误差与弱的泛化能力共存。在各种混沌动力系统上的数值实验证实了理论预测。

英文摘要

Time series forecasting often requires learning nonlinear and time-delayed dependencies. A paradigmatic class of forecasting models are nonlinear vector autoregressive processes (NVAR), also known as next-generation reservoir computers (NG-RCs). These models approximate the Koopman operator on the space spanned by their explicit feature library. We consider the identifiability problem for learning Markovian nonlinear dynamical systems and show that the training error as a function of time resolution follows characteristic (pre-)asymptotic scaling laws. These laws depend on whether the feature library can represent the early Lie-series coefficients of the flow map (propagator) exactly or merely approximately. For dynamical systems governed by polynomial vector fields, we demonstrate the mechanism for NVAR/NG-RC models with monomial and Fourier feature libraries. We determine the dependence of the training error on the temporal resolution, the involved nonlinear degree, and the number of delay terms. While delay terms reduce the optimal one-step training error, they improve long-horizon forecasts only when the library provides sufficient nonlinearity. Thus, small training error coexists with weak generalization as the model class is mismatched to the true data-generating process. Numerical experiments on various chaotic dynamical systems confirm the theoretical predictions.

URL PDF HTML ☆

赞 0 踩 0

2605.31436 2026-06-01 cs.RO

Actuator-Aware Inverse Kinematics with Joint-Limit Admissibility for Torque-Controlled Redundant Robots

面向力矩控制冗余机器人的关节极限可容许性感知逆运动学

Mohammad Dastranj, Mahdi Hejrati, Jouni Mattila

AI总结提出一种基于凸二次规划的逆运动学方法，通过控制障碍函数约束关节极限，并利用控制器兼容性目标解决冗余，实现无需修改下层控制器的任务行为改善。

详情

AI中文摘要

本文针对关节极限约束下的力矩控制冗余机器人，提出了执行器感知的逆运动学。在所考虑的架构中，逆运动学输出不仅仅是纯运动学的关节速度指令；它是提供给下游力矩级控制器的所需关节速度。因此，小的命令任务残差不一定能改善实际运动。所提出的方法构建了一个凸二次规划问题，其决策变量是关节级所需速度。控制障碍函数风格的边界施加了参考级关节极限可容许性，而任务方程通过惩罚松弛变量处理。冗余通过考虑先前命令一致性和执行器扭矩容量加权的控制器兼容性目标来解决。该方法独立于特定的力矩级控制器，可作为末端轨迹与冗余机器人控制器之间的中间逆运动学层。在虚拟分解控制的七自由度上肢外骨骼上的实验将所提方法与标准逆运动学基线以及约束任务保持二次规划基线进行了比较。结果表明，在不修改下游控制器的情况下，在测试轨迹中实现了更低的极限推动指令、有界可容许所需速度以及改善的实际任务行为。

英文摘要

This paper proposes actuator-aware inverse kinematics for torque-controlled redundant robots under joint-limit constraints. In the considered architecture, the inverse-kinematic output is not merely a purely kinematic joint-velocity command; it is the required joint velocity supplied to a downstream torque-level controller. Therefore, a small commanded task residual may not necessarily improve realized motion. The proposed method formulates a convex quadratic programming problem whose decision variable is the joint-level required velocity. Control barrier function style bounds impose reference-level joint-limit admissibility, while the task equation is handled through a penalized slack variable. Redundancy is resolved using a controller-compatibility objective that accounts for previous-command consistency and actuator torque-capacity weighting. The method is independent of the particular torque-level controller and can serve as an intermediate IK layer between an endpoint trajectory and a redundant robot controller. Experiments on a virtual-decomposition-controlled seven-degree-of-freedom upper-limb exoskeleton compare the method with standard inverse-kinematic baselines and a constrained task-preserving quadratic programming baseline. The results indicate lower limit-pushing commands, bounded admissible required velocities, and improved realized task behavior in the tested trajectory, without modifying the downstream controller.

URL PDF HTML ☆

赞 0 踩 0

2605.31433 2026-06-01 cs.CL

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

SCOPE：面向开放式任务的协同演化策略自我对弈

Wai-Chung Kwan, Aryo Pradipta Gema, Joshua Ong Jun Leang, Pasquale Minervini

AI总结提出SCOPE框架，通过协同演化挑战者和求解者策略，实现无数据自我对弈，在开放式任务上提升性能并超越依赖人工提示的方法。

详情

AI中文摘要

自我对弈可以在没有外部监督的情况下训练语言模型。然而，现有方法需要可规则检查的答案，使得开放式任务依赖于精心设计的提示或前沿模型作为评判者。我们引入了SCOPE，一个无数据的自我对弈框架，用于开放式任务，它协同演化两个策略：一个生成文档基础任务的挑战者，以及一个通过多轮检索回答这些任务的求解者。初始模型的冻结副本作为自我评判者，它从源文档编写特定任务的评分标准，并据此对求解者的回答进行评分。在三个7-8B指令调优模型（Qwen2.5、Qwen3、OLMo-3）上，SCOPE在八个基准测试中将开放式任务性能提升了高达+10.4分，并匹配或超过了在约9K个精心设计的提示上训练的GRPO_data。尽管仅在开放式任务上训练，SCOPE还在七个保留的短格式问答基准测试中提升了高达+13.8分，在所有三个模型上均超越了GRPO_data。消融实验表明，协同演化挑战者对于保持任务接近求解者的前沿是必要的；性能提升源于检索和合成的改进，其相对贡献因任务而异；并且评分标准生成质量是自我评判的瓶颈。

英文摘要

Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.

URL PDF HTML ☆

赞 0 踩 0

2605.31432 2026-06-01 cs.CL cs.AI cs.SD

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

DOA：面向语音大语言模型的长形式同声传译的无训练解码器仅注意力策略

Sara Papi, Luisa Bentivogli

AI总结提出DOA策略，利用解码器自注意力导出代理对齐，无需训练即可实现语音大语言模型在长形式同声传译中的流式决策。

详情

AI中文摘要

同声语音到文本翻译（SimulST）在语音尚未完成时生成翻译，需要流式策略来决定何时读取和何时写入。最先进的方法依赖于基于注意力的编码器-解码器模型，其中交叉注意力提供显式的对齐信号。相比之下，语音大语言模型（SpeechLLMs）是仅解码器架构，仅依赖自注意力。这引发了一个核心问题：解码器自注意力是否包含足够稳定的对齐信号来指导流式策略。此外，现有方法通常依赖于基于训练的适应或启发式等待-$k$策略，并且尚未在长形式场景中得到验证。为了填补这些空白，我们提出了仅解码器注意力（DOA），这是一种无训练策略，通过从自注意力中导出代理对齐，使现成的SpeechLLMs能够进行长形式同声传译。在Phi4-Multimodal和Qwen3-Omni上的实验表明，DOA提供了有效的对齐信号来支持流式决策，实现了低延迟的长形式SimulST，其质量接近无需重新训练的离线解码。

英文摘要

Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that decides when to read and when to write. State-of-the-art approaches rely on attention-based encoder-decoder models where cross-attention provides explicit alignment signals. In contrast, Speech Large Language Models (SpeechLLMs) are decoder-only architectures relying solely on self-attention. This raises a central question: whether decoder self-attention contains sufficiently stable alignment signals to guide the streaming policy. Moreover, existing approaches typically rely on training-based adaptations or heuristic wait-$k$ policies and have not been validated in long-form settings. To fill these gaps, we propose Decoder-Only Attention (DOA), a training-free policy that enables long-form simultaneous translation with off-the-shelf SpeechLLMs by deriving a proxy alignment from self-attention. Experiments on Phi4-Multimodal and Qwen3-Omni show that DOA provides an effective alignment signal for supporting streaming decisions, enabling low-latency long-form SimulST with quality close to offline decoding without retraining.

URL PDF HTML ☆

赞 0 踩 0

2605.31429 2026-06-01 cs.CV

YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models

YARD: Y型架构寄存器解码用于大型视觉语言模型中的高效幻觉缓解

Ting Chen, Geng Li, Guohao Chen, Yu Hu, Guan Huang, Mai Chen, Langsheng Lei, Jun Du

AI总结提出YARD框架，通过Y型架构共享浅层计算并在中间层分支，用寄存器令牌替换视觉令牌构建退化分支，实现无需训练的对比解码，有效缓解幻觉并降低推理延迟。

Comments 21 pages, 11 figures

详情

AI中文摘要

对比解码（CD）旨在通过对比标准模型和视觉退化模型的输出分布来缓解大型视觉语言模型（LVLM）中的幻觉。然而，现有的免训练CD方法存在次优的退化分支：完全丢弃视觉令牌过于极端并导致语言幻觉，而破坏输入图像对视觉证据的控制粗糙且由于需要两次完整前向传播而导致高推理延迟。为了解决这些问题，我们提出了YARD，一种免训练的Y型架构寄存器解码框架。受可靠文本到视觉定位主要出现在中间解码器层的观察启发，YARD通过共享浅层计算并恰好在此关键阶段分支，在内部构建退化分支。对于退化分支，YARD用寄存器令牌替换补丁级视觉令牌，这些令牌保留了全局图像语义但缺乏细粒度局部证据。这种图像感知但局部欠基础的设计提供了忠实的对比信号，没有极端模态不匹配，同时Y型架构严格避免了昂贵的第二次前向传播。在生成性和判别性幻觉基准上的大量实验表明，YARD在多个LVLM上一致实现了最先进的幻觉缓解，同时显著降低了推理延迟。

英文摘要

Contrastive decoding (CD) seeks to mitigate hallucinations in Large Vision-Language Models (LVLMs) by contrasting the output distributions of a standard model and a visually degraded model. However, existing training-free CD methods suffer from sub-optimal degraded branches: completely dropping visual tokens is too extreme and induces language hallucinations, while corrupting input images offers coarse control over visual evidence and suffers from high inference latency due to requiring two full forward passes. To address these dilemmas, we propose YARD, a training-free Y-Architecture Register Decoding framework. Motivated by the observation that reliable text-to-vision grounding predominantly emerges in the middle decoder layers, YARD constructs the degraded branch internally by sharing shallow-layer computations and branching exactly at this critical stage. For the degraded branch, YARD replaces patch-level visual tokens with register tokens, which preserve global image semantics but lack fine-grained local evidence. This image-aware yet locally under-grounded design provides a faithful contrastive signal without extreme modality mismatch, while the Y-architecture strictly avoids a costly second forward pass. Extensive experiments on generative and discriminative hallucination benchmarks demonstrate that YARD consistently achieves state-of-the-art hallucination mitigation across multiple LVLMs, alongside a significant reduction in inference latency.

URL PDF HTML ☆

赞 0 踩 0

2605.31427 2026-06-01 cs.LG cs.DC

DG-CoLearn: An Efficient Collaborative Learning Framework for Dynamic Graphs

DG-CoLearn：一种高效的动态图协同学习框架

Ashley Hoi-Ting Au, Zikun Zhang, Ligang He, Qiang Ni

AI总结针对动态图学习中重复全快照重训练计算开销大且不适用于分区数据协同场景的问题，提出基于增量图快照处理的客户端无感知协同学习框架DG-CoLearn，通过服务器中介的嵌入交换机制实现准确的多跳消息传递，在训练速度、通信开销和预测性能上均取得显著提升。

详情

AI中文摘要

动态图学习（DGL）对于建模演化的图数据至关重要，但现有方法由于重复的全快照重训练而遭受显著的计算开销，并且不适合具有分区数据的协同设置。在现实的图系统中，跨分区边是不可避免的，但客户端之间直接共享图结构可能违反隐私约束。我们提出DG-CoLearn，一种基于增量图快照处理的客户端无感知协同动态图学习框架，该框架将计算集中在受时间更新影响的图区域，同时通过时间建模保留历史信息。这种增量设计一致地应用于整个图处理流程，包括一种服务器中介的嵌入交换机制，以实现准确的多跳消息传递，而无需暴露原始的跨客户端结构信息。大量实验表明，DG-CoLearn在训练时间上实现了高达33.8倍的加速，通信开销降低了27.4倍，同时在节点分类（F1提升高达13.36%）和链接预测（MAP提升高达8.27%）任务上持续提高了预测性能。这些结果突显了DG-CoLearn在协同动态图学习中桥接效率、可扩展性和客户端间结构隐私方面的有效性。

英文摘要

Dynamic graph learning (DGL) is essential for modelling evolving graph data, but existing methods suffer from significant computational overhead due to repeated full-snapshot retraining and are not well-suited for collaborative settings with partitioned data. In realistic graph systems, cross-partition edges are unavoidable, but direct sharing of graph structure between clients may violate privacy constraints. We propose DG-CoLearn, a client-oblivious collaborative dynamic graph learning framework built on incremental graph snapshot processing, which focuses computation on graph regions affected by temporal updates while preserving historical information through temporal modelling. This incremental design is consistently applied across the entire graph processing pipeline, including a server-mediated embedding exchange mechanism to enable accurate multi-hop message passing without exposing raw cross-client structural information. Extensive experiments demonstrate that DG-CoLearn achieves up to 33.8$\times$ speedup in training time and 27.4$\times$ reduction in communication overhead, while consistently improving predictive performance on both node classification (up to 13.36% F1 improvement) and link prediction (up to 8.27% MAP improvement) tasks. These results highlight the effectiveness of DG-CoLearn in bridging efficiency, scalability, and client-to-client structural privacy in collaborative dynamic graph learning.

URL PDF HTML ☆

赞 0 踩 0

2605.31423 2026-06-01 cs.LG

Fixed Universal Transformers

固定通用Transformer

Jingwen Liu, Alexandr Andoni, Daniel Hsu

AI总结提出固定通用Transformer，通过输入嵌入模拟任意给定类别的Transformer，证明其通用性在足够大嵌入维度下可通过稀疏结构实现，且随机初始化几乎必然通用，实验验证了理论。

2605.31421 2026-06-01 cs.CL cs.AI cs.DS

Neuro-symbolic Syntactic Parsing: Shaping a Neural Network with the CYK Algorithm

神经符号句法分析：用CYK算法塑造神经网络

Fabio Massimo Zanzotto, Federico Ranaldi, Giorgio Satta

AI总结本文提出CYKNN，一种将CYK算法直接编码为可训练矩阵-向量乘法的循环神经网络架构，在简单语法任务上超越大语言模型，开辟神经符号方法新途径。

Comments 9 content pages

2605.31410 2026-06-01 cs.AI

FAM-Bench: A Multimodal Benchmark for Condition-Aware Food-as-Medicine Reasoning

FAM-Bench：面向条件感知的“食物即药物”推理的多模态基准

Mingyang Mao, Bhargav Rishi Medisetti, Utkarsh Grover, Tanvir Ibrahim, Wenyan Li, Tingting Zhang, Xiaomin Lin

AI总结提出FAM-Bench多模态基准，包含2500个营养专家验证的实例，通过菜肴适宜性评估和比较分析两个任务，测试模型在特定健康状况下对食物选择的推理能力。

详情

AI中文摘要

“食物即药物”要求模型不仅推理一道菜是什么或它含有哪些营养，还必须决定一个具体的食物选择是否适合特定的健康状况。现有的食物AI基准主要评估菜肴识别、食谱理解、营养估算或一般营养问答，而这一健康感知决策层在很大程度上未经测试。我们引入了FAM-Bench，一个多模态的“食物即药物”基准，包含2500个营养专家验证的实例，涵盖13种与饮食相关的健康状况。该基准包含两个互补任务：菜肴级适宜性评估，其中模型根据菜肴图像和配料列表判断其是否适合某种状况；以及比较性菜肴分析，其中模型根据状况特定适宜性对四个候选菜肴进行排序。这两个任务都需要整合配料证据、视觉制备线索和临床营养约束，为语言和视觉语言模型中的基于事实的健康感知推理提供了一个标准化的测试平台。

英文摘要

Food-as-Medicine requires models to reason beyond what a dish is or what nutrition it contains: they must decide whether a concrete food choice is appropriate for a specific health condition. Existing food AI benchmarks primarily evaluate dish recognition, recipe understanding, nutrient estimation, or general nutrition question answering, leaving this health-aware decision layer largely untested. We introduce FAM-Bench, a multi-modal Food-as-Medicine benchmark with 2500 nutrition-expert-verified instances across 13 diet-related health conditions. The benchmark contains two complementary tasks: dish-level suitability assessment, where models judge whether a dish is suitable for a condition from its image and ingredient list, and comparative dish analysis, where models rank four candidate dishes by condition-specific suitability. Both tasks require integrating ingredient evidence, visual preparation cues, and clinical nutrition constraints, providing a standardized testbed for grounded health-aware reasoning in language and vision-language models.

URL PDF HTML ☆

赞 0 踩 0

2605.31408 2026-06-01 cs.CL cs.AI cs.LG

Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study

大型语言模型代理中的技能可用性与呈现粒度：一项受控的SkillsBench研究

Xiaonan Xu, Wenjing Wu

AI总结通过受控实验研究技能知识的呈现粒度对下游任务成功率的影响，发现技能可用性显著提升成功率，而呈现粒度变化影响较小且不确定。

详情

AI中文摘要

技能文档在推理时为大型语言模型代理提供程序性知识。本文研究受控技能知识的呈现粒度是否会改变下游任务成功率。实验使用固定的SkillsBench版本，包含30个任务、领域平衡的子集（由官方oracle运行验证）、两种启用推理的模型配置、六种技能条件，以及每个任务-条件-模型单元五次试验。技能可用性是最清晰的经验信号。相对于无技能，技能条件使GPT-5.5的任务平均通过率提高26.7至36.0个百分点，使DeepSeek V4-Flash提高18.0至26.0个百分点。最终数据包含1800行，每个模型900行。任务是推理单元。在每个任务-条件-模型单元内聚合五次试验，然后在30个任务上估计配对对比。主要的呈现对比较小且不确定。低抽象指导与高抽象指导相比，GPT-5.5差异为+0.7个百分点，DeepSeek V4-Flash差异为-6.7个百分点，两者的95%自助法置信区间均跨越零。在中抽象指导中添加一个工作示例与无示例变体相比，差异分别为+0.7和+1.3个百分点。平均奖励稳健性检验保持了相同的实质性结论。在这个受控子集中，技能可用性与更高的成功率相关，而测试的呈现粒度变化产生的影响较小、不确定且依赖于模型。

英文摘要

Skill documents provide procedural knowledge to large-language-model agents at inference time. This article studies whether the presentation granularity of controlled skill knowledge changes downstream task success. The experiment uses a pinned SkillsBench version, a 30-task domain-balanced subset validated by official oracle runs, two reasoning-enabled model configurations, six skill conditions, and five trials per task-condition-model cell. Skill availability is the clearest empirical signal. Relative to no skill, skill conditions increase task-mean pass rate by 26.7 to 36.0 percentage points for GPT-5.5 and by 18.0 to 26.0 percentage points for DeepSeek V4-Flash. The final data contain 1,800 rows, with 900 rows for each model. The task is the inference unit. Five trials are aggregated within each task-condition-model cell before paired contrasts are estimated over 30 tasks. The primary presentation contrasts are smaller and uncertain. Low-abstraction guidance differs from high-abstraction guidance by +0.7 percentage points for GPT-5.5 and -6.7 percentage points for DeepSeek V4-Flash, with both 95% bootstrap confidence intervals crossing zero. Adding one worked example to medium-abstraction guidance differs from the no-example variant by +0.7 and +1.3 percentage points. Mean-reward robustness checks preserve the same substantive conclusion. In this controlled subset, skill availability is associated with higher success than no skill, while the tested presentation-granularity changes yield small, uncertain, and model-dependent effects.

URL PDF HTML ☆

赞 0 踩 0

2605.31404 2026-06-01 cs.CL cs.AI

The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning

剑、盾与阿喀琉斯之踵：大型语言模型在导航规划中空间推理的语言归纳偏置特征化

Xudong Zhang, Jian Yang, Shengkai Wang, Jiangpeng Tian, Shaowen Chen, Xian Wei, Ke Li, Xiong You

AI总结提出双干预框架，通过表示干预和上下文干预分离语言结构与上下文线索，揭示LLM在导航规划中语言归纳偏置的规律：拓扑信息是稳健规划的支柱，语言格式是双刃剑，语义信息是致命弱点。

详情

AI中文摘要

基于大型语言模型（LLM）的导航系统通常构建显式空间表示（如拓扑图、语义栅格图）并将其转换为文本描述作为LLM的输入。然而，此类基于文本的空间表示的语言结构及其包含的上下文特征（如拓扑、几何）的选择通常被视为中性的工程决策，而非塑造LLM行为的关键因素。为填补这一空白，我们提出一个双干预框架，将语言结构与不同的上下文线索分离，以评估LLM在导航规划中的语言归纳偏置。在该框架中，表示干预改变语言格式和语言压缩程度，阐明语言表示何时支持或抑制导航规划。上下文干预结合上下文特征组合与冲突探测，明确澄清LLM在处理不同上下文线索时的偏好和弱点。跨多种空间推理任务和多个模型规模的实验揭示了一致模式：拓扑信息是坚固的盾牌和稳健规划的支柱；语言格式是双刃剑，其效果取决于模型大小、任务需求和压缩程度；语义信息是致命的阿喀琉斯之踵——错误的语义线索会系统性地破坏规划过程。总体而言，我们的研究表明，基于LLM的导航中有效的文本空间表示应保持拓扑完整性，根据模型能力校准表示压缩，并确保语义正确性，而非简单采用单一表示。我们的代码公开于https://github.com/jonesdong150/LLM-Navigation-Inductive-Bias。

英文摘要

Large Language Model (LLM)-based navigation systems commonly construct explicit spatial representations (e.g., topological graphs, semantic raster maps) and translate them into textual descriptions as LLMs' inputs. However, the linguistic structures of such text-based spatial representations and the choices of contextual features (e.g., topology, geometry) they contain are often treated as neutral engineering decisions rather than key factors that shape LLMs' behavior. To fill the gap, we propose a dual-interventional framework that disentangles linguistic structures from different contextual cues to evaluate the linguistic inductive bias of LLMs for navigation planning. In the framework, representation intervention varies the linguistic format and the degree of linguistic compression, clarifying when linguistic representations support or inhibit navigation planning. Context intervention, combined with contextual feature combination and conflict probing, explicitly clarifies the preferences and weaknesses of LLMs when processing different contextual cues. Experiments across diverse spatial reasoning tasks and multiple model scales reveal a consistent pattern: topological information is a sturdy shield and the backbone of robust planning; linguistic format is a double-edged sword whose effect depends on model size, task demands, and the compression level; and semantic information is a fatal Achilles' heel -- incorrect semantic cues can systematically derail the planning process. Overall, our study shows that effective text-based spatial representations in LLM-based navigation should preserve topological integrity, calibrate representational compression to model capacity, and ensure semantic correctness, rather than simply adopting a single representation. Our code is publicly available at https://github.com/jonesdong150/LLM-Navigation-Inductive-Bias.

URL PDF HTML ☆

赞 0 踩 0

2605.31400 2026-06-01 cs.CV

FSM-Net: An Efficient Frequency-Spatial Network for Real-World Deblurring

FSM-Net：一种用于真实世界去模糊的高效频率-空间网络

Vinh-Thuan Ly

AI总结提出FSM-Net，通过频率注意力模块和交叉门控视觉E-Branchformer实现高效双域去模糊，在NTIRE 2026挑战赛中获得第二名。

Comments Accepted to NTIRE Workshop at CVPR 2026. Project page: https://efficient-deblurring-fsmnet.vercel.app

详情

AI中文摘要

真实世界图像去模糊要求高保真恢复和计算效率，现有方法往往难以平衡。本文提出FSM-Net（频率-空间多分支网络），一种高效解决方案，在NTIRE 2026高效真实世界去模糊挑战赛中获得第二名。FSM-Net开创了双域方法：新颖的频率注意力模块通过FFT显式恢复高频结构细节，而瓶颈处的交叉门控视觉E-Branchformer以线性复杂度捕获全局依赖。为确保鲁棒收敛，我们采用由复合损失函数（多尺度Charbonnier、结构边缘和频率）引导的渐进课程训练策略。在RSBlur基准上评估，FSM-Net仅用4.94M参数和159.35 GMACs（1920x1200分辨率）即达到33.144 dB PSNR的出色性能。通过有效推动效率与质量的帕累托前沿，FSM-Net为资源受限的图像恢复建立了强基线。

英文摘要

Real-world image deblurring demands both high-fidelity restoration and computational efficiency, a balance existing methods often struggle to achieve. In this paper, we propose FSM-Net (Frequency-Spatial Multi-branch Network), a highly efficient solution that secured 2nd place in the NTIRE 2026 Challenge on Efficient Real-World Deblurring. FSM-Net pioneers a dual-domain approach: a novel Frequency Attention module explicitly recovers high-frequency structural details via FFT, while a Cross-Gated Vision E-Branchformer at the bottleneck captures global dependencies with linear complexity. To ensure robust convergence, we employ a progressive curriculum training strategy guided by a composite loss function (Multi-Scale Charbonnier, Structural Edge, and Frequency). Evaluated on the RSBlur benchmark, FSM-Net achieves an outstanding 33.144 dB PSNR with only 4.94M parameters and 159.35 GMACs (at 1920x1200 resolution). By effectively pushing the Pareto frontier of efficiency and quality, FSM-Net establishes a strong baseline for resource-constrained image restoration.

URL PDF HTML ☆

赞 0 踩 0

2605.31393 2026-06-01 cs.CL cs.AI

Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models

面向手语翻译的大语言模型目标端释义增强

Pedro Dal Bianco, Jean Paul Nunes Reinhold, Oscar Stanchi, Facundo Quiroga, Franco Ronchetti, Ulisses Brisolara Corrêa

AI总结针对手语翻译中平行语料稀缺和目标词汇长尾分布的问题，提出利用GPT-4o生成参考句子的受控释义变体进行目标端增强，并在三种手语数据集上验证了方法的有效性。

Comments Accepted at GenSign (https://genai4sl.github.io/) at CVPR 2026. Non proceedings track

详情

AI中文摘要

手语翻译（SLT）仍然受到有限的配对手语视频/文本语料库和长尾目标词汇的限制。我们研究了目标端增强方法，其中GPT-4o生成参考句子的受控释义变体，而手语输入保持不变。采用基于Signformer姿态的Transformer，在两阶段调度下进行训练：先在增强语料库上预训练，然后在原始参考句子上微调。我们在三个具有互补挑战的数据集上进行了评估：PHOENIX14T（德国手语），具有适度的词汇多样性；GSL（希腊手语），具有高度受控、重复的录制；以及LSA-T（阿根廷手语），具有严重的长尾稀疏性。在PHOENIX14T上，增强将BLEU-4从9.56提高到10.33。接近饱和的GSL基线和极其稀疏的LSA-T设置揭示了该方法的局限性。据我们所知，这是第一项将LLM生成的目标端释义和LLM作为评估者应用于手语翻译的研究。语义评估揭示了词汇重叠指标低估的忠实度提升。

英文摘要

Sign language translation (SLT) remains constrained by limited paired sign-video/text corpora and heavy-tailed target vocabularies. We study target-side augmentation in which GPT-4o generates controlled paraphrase variants of reference sentences while the sign input remains unchanged. A Signformer-style pose-based Transformer is trained under a two-stage schedule: pre-training on the augmented corpus followed by fine-tuning on the original references. We evaluate on three datasets spanning complementary challenges: PHOENIX14T (German Sign Language), with moderate lexical diversity; GSL (Greek Sign Language), with highly ontrolled, repetitive recordings; and LSA-T (Argentinian Sign Language), with severe long-tail sparsity. On PHOENIX14T, augmentation improves BLEU-4 from 9.56 to 10.33. The near-saturated GSL baseline and extremely sparse LSA-T setting reveal the limits of the approach. To our knowledge, this is the first study to apply LLM-generated target-side araphrases and LLM-as-a-Judge evaluation to SLT. The semantic evaluation reveals gains in fidelity that lexical overlap metrics understate.

URL PDF HTML ☆

赞 0 踩 0

2605.31388 2026-06-01 cs.LG

Constrained Multi-Objective Reinforcement Learning with Max-Min Criterion

带最大最小准则的约束多目标强化学习

Giseung Park, Hyunyoung Nam, Woohyeon Byeon, Amir Leshem, Youngchul Sung

AI总结提出一种融合最大最小准则与显式约束满足的多目标强化学习框架，通过理论分析和表格实验验证收敛性，并在建筑热控制、多目标运动控制和温室气体排放感知交通管理中展示其平衡公平性与约束满足的有效性。

Comments Accepted to ICML 2026

2605.31387 2026-06-01 cs.CL cs.RO

Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely

多轮多智能体对话用于协作重建仅略微提升VLM在空间推理上的性能

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

AI总结研究通过多轮多智能体对话框架评估视觉语言模型在协作空间推理任务中的表现，发现视觉空间理解仍是主要瓶颈，文本表示和分解图像表示可部分提升性能。

Comments Preprint

详情

AI中文摘要

在多样化环境中运行的机器人依赖视觉输入来解释物体和空间布局。在人类协作任务中，它们被期望通过语言传达这种理解。视觉语言模型（VLM）支持涉及视觉解释、问答和指令跟随的机器人任务，但它们在需要空间推理的协作对话任务中的能力仍未充分探索。我们通过一个结合视觉解释、基础、语言引导交互和动作生成的协作结构构建任务来研究这一差距。我们开发了一个框架，其中VLM通过对话从视觉和文本输入重建目标结构。我们在交互设置、输入模态和图像表示上评估了开放权重和封闭VLM。结果表明，对于评估的VLM，视觉表示的空间推理仍然困难。目标的详细文本表示在模态条件下产生更高的重建成功率，而分解的图像表示提高了性能。这些发现揭示了协作VLM智能体在视觉空间基础和基础指令生成方面的局限性。

英文摘要

Robots operating in diverse environments rely on visual input to interpret objects and spatial layouts. In human-collaborative tasks, they are expected to communicate this understanding through language. Vision-language models (VLMs) support robotic tasks involving visual interpretation, question answering, and instruction following, but their capabilities in collaborative dialogue tasks requiring spatial reasoning remain underexplored. We study this gap through a collaborative structure-building task that combines visual interpretation, grounding, language-guided interaction, and action generation. We develop a framework in which VLMs use dialogue to reconstruct a target structure from visual and textual inputs. We evaluate open-weight and closed VLMs across interaction settings, input modalities, and image representations. Results show that spatial reasoning over visual representations remains difficult for the evaluated VLMs. Detailed text representations of the target yield higher reconstruction success across modality conditions, while decomposed image representations improve performance. These findings reveal limits in visual spatial grounding and grounded instruction generation for collaborative VLM agents.

URL PDF HTML ☆

赞 0 踩 0

2605.31378 2026-06-01 cs.CL

Unlocking Fine-Grained Translation Quality Estimation in LRMs through Synergistically Evolving Implicit and Explicit Reasoning

解锁大型推理模型中细粒度翻译质量评估：通过协同演化隐式和显式推理

Renfei Dang, Xinye Wang, Zhejian Lai, Weilu Xu, Shimin Tao, Daimeng Wei, Min Zhang, Shujian Huang

AI总结针对大型推理模型在细粒度翻译质量评估上的困难，提出RIEQE两阶段训练框架，通过非思考监督微调（隐式推理）和思考强化学习（显式推理）协同演化，在WMT测试集上超越基线。

详情

AI中文摘要

大型推理模型（LRMs）即使拥有长推理链，在细粒度翻译质量评估（QE）上仍然困难。我们认为LRMs已经具备强大的多语言能力，而核心挑战源于学习细粒度QE任务的内在难度。本文提出RIEQE（隐式和显式推理用于QE），一个简单的两阶段训练框架，实现隐式（层级别）和显式（词级别）推理能力的协同演化。为了使隐式推理可行，我们首先将复杂的QE任务分解为简单的子任务。基于此，我们的两阶段方法应用：（1）非思考监督微调（NonThinking-SFT），无推理链的监督微调，直接提升模型的隐式推理倾向和能力；（2）思考强化学习（Thinking-RLVR），标准可验证奖励强化学习，随后增强显式推理。结果表明，在我们的框架下，隐式和显式推理协同演化。在WMT测试集上，基于Qwen3-4B-Thinking-2507的RIEQE在显式推理性能上超越所有基线，同时其隐式推理能力也与当前最好的基于编码器的模型相当。我们进一步提供了隐式和显式推理协同合作的证据，展示了它们如何相互促进。

英文摘要

Large Reasoning Models (LRMs) still struggle with fine-grained translation quality estimation (QE), even with long reasoning chains. We argue that LRMs already possess strong multilingual capabilities, while the core challenge stems from the intrinsic difficulty of learning the fine-grained QE task. In this paper, we propose RIEQE (Reasoning both Implicitly and Explicitly for QE), a simple two-stage training framework that enables the co-evolution of implicit (layer-wise) and explicit (token-wise) reasoning capabilities. To make implicit reasoning feasible, we first decompose the complex QE task into straightforward subtasks. Based on this, our two-stage approach applies: (1) NonThinking-SFT, Supervised Fine-Tuning (SFT) without reasoning chains to directly boost the model's implicit reasoning tendency and capability; and (2) Thinking-RLVR, standard Reinforcement Learning with Verifiable Reward (RLVR) to subsequently strengthen explicit reasoning. Results demonstrate that implicit and explicit reasoning synergistically co-evolve under our framework. On the WMT test sets, RIEQE based on Qwen3-4B-Thinking-2507 surpasses all baselines in explicit reasoning performance, while its implicit reasoning capability is also comparable to the best current encoder-based models. We further provide evidence for the synergistic collaboration between implicit and explicit reasoning, showing how they mutually benefit each other.

URL PDF HTML ☆

赞 0 踩 0

2605.31376 2026-06-01 cs.RO cs.CV cs.GR

LiftNav: Path Planning via Semantic Lifting in TSDF-Guided Gaussian Splatting

LiftNav: TSDF引导的高斯泼溅中的语义提升路径规划

Hannah Schieber, Dominik Frischmann, Victor Schaack, Angela P. Schoellig, Daniel Roth

AI总结提出LiftNav混合导航框架，结合TSDF+GS双地图、YOLO检测、TSDF三维提升和B样条轨迹优化，实现无需密集三维嵌入的灵活语义导航，并通过铰链损失碰撞惩罚提升轨迹平滑性和安全性，在Replica数据集仿真中实现100%可行性和更短轨迹。

2605.31373 2026-06-01 cs.LG cs.AI

Scaling Higher-Order Graph Learning with Maximal Clique Complexes

基于最大团复形的规模化高阶图学习

Antoine Vialle, Aref Einizade, Fragkiskos D. Malliaros, Jhony H. Giraldo

AI总结提出简化与分解的细胞Weisfeiler-Leman测试及最大团复形，结合CliqueWalk随机游走，实现可扩展的高阶图神经网络。

2605.31371 2026-06-01 cs.LG

Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handling

Softsign: 优化器中的平滑符号函数以更好地处理参数异质性

Dmitrii Feoktistov, Timofey Belinsky, Andrey Veprikov, Amir Zainullin, Aleksandr Beznosikov

AI总结提出SoftSignum和SoftMuon优化器，通过温度控制的软符号变换替代硬符号映射，结合自适应分位数温度调度，解决基于符号的优化器在参数异质性和终端收敛上的问题，并在随机非凸设置下证明收敛性，实验表明在多种深度学习任务（包括大语言模型预训练）中优于硬符号优化器和AdamW。

Comments 9 pages, 3 tables, 4 Figures

详情

AI中文摘要

基于符号和LMO启发的优化器最近在深度学习中因其强大的性能和低内存占用而受到广泛关注。然而，它们的固定幅度更新会损害终端收敛：它们将更新机制与梯度幅度解耦，未能考虑参数异质性，常常导致振荡而非收敛。我们提出SoftSignum，一种基于符号优化的平滑松弛方法，用温度控制的软符号变换替代硬符号映射，实现了从符号类更新到幅度敏感的SGD类步骤的参数级过渡。我们辅以自适应分位数温度调度，并将相同原理扩展到矩阵值优化器，得到SoftMuon。我们还开发了一个基于强凸正则化子和Fenchel共轭的广义几何松弛框架，证明了在随机非凸设置下的收敛性。在包括大语言模型预训练在内的多种深度学习任务上的实验表明，SoftSignum和SoftMuon持续优于其硬符号对应物和标准AdamW。

英文摘要

Sign-based and LMO-inspired optimizers have recently attracted substantial attention in deep learning due to their strong performance and low memory footprint. However, their fixed-magnitude updates can hurt terminal convergence: they decouple update mechanisms from gradient magnitudes and fail to account for parameter heterogeneity, often leading to oscillation rather than convergence. We propose SoftSignum, a smooth relaxation of sign-based optimization that replaces the hard sign map with a temperature-controlled soft-sign transformation, enabling a parameter-wise transition from sign-like updates to magnitude-sensitive SGD-like steps. We complement it with an adaptive quantile-based temperature schedule and extend the same principle to matrix-valued optimizers, obtaining SoftMuon. We also develop a generalized geometry-relaxation framework based on strongly convex regularizers and Fenchel conjugates, proving convergence in stochastic non-convex setting. Experiments on diverse deep learning tasks, including LLM pretraining, show that SoftSignum and SoftMuon consistently improve over their hard sign-based counterparts and standard AdamW.

URL PDF HTML ☆

赞 0 踩 0

2605.31370 2026-06-01 cs.AI

HypoAgent: An Agentic Framework for Interactive Abductive Hypothesis Generation over Knowledge Graphs

HypoAgent: 一种用于知识图谱上交互式溯因假设生成的智能体框架

Yisen Gao, Yixi Cai, Tianshi Zheng, Jiaxin Bai, Yangqiu Song

AI总结提出HypoAgent框架，通过三个智能体（意图识别、假设生成、根因分析）实现知识图谱上的交互式溯因假设生成，在常识和生物医学领域知识图谱上达到最优语义相似度。

Comments Under Review

详情

AI中文摘要

知识图谱上的溯因推理旨在生成解释观察到的实体或事实的逻辑假设。现有的可控假设生成方法允许用户通过显式条件引导这一过程，但在交互式场景中仍存在局限：它们难以在多轮对话中锚定不断变化的自然语言意图，并且在生成的假设失败时缺乏细粒度诊断。为解决这些问题，我们提出了HypoAgent，一种用于知识图谱上交互式溯因假设生成的智能体框架。HypoAgent集成了三个智能体：意图识别智能体，将用户话语和对话历史转化为可执行的知识图谱条件；假设生成智能体，根据提取的用户意图执行可控假设生成；以及根因分析智能体，诊断不可靠的假设片段并利用知识图谱邻域探测来识别支持的改进。在常识和生物医学领域特定知识图谱上的实验表明，HypoAgent在单轮、多轮和无条件设置下均达到了最先进的语义相似度。我们的代码可在https://github.com/HKUST-KnowComp/HypoAgent获取。

英文摘要

Abductive reasoning over knowledge graphs aims to generate logical hypotheses that explain observed entities or facts. Existing controllable hypothesis generation methods allow users to guide this process with explicit conditions, but they remain limited in interactive settings: they struggle to ground evolving natural-language intents across multi-turn dialogues and provide little fine-grained diagnosis when generated hypotheses fail. To address these limitations, we propose HypoAgent, an Agentic framework for interactive abductive Hypothesis Generation over knowledge graphs. HypoAgent integrates three agents: an Intent Recognition Agent that grounds user utterances and dialogue history into executable KG conditions, a Hypothesis Generation Agent that performs controllable hypothesis generation according to the extracted user intention, and a Root Cause Analysis Agent that diagnoses unreliable hypothesis fragments and leverages KG neighborhood probing to identify supported refinements. Experiments on commonsense and biomedical domain-specific knowledge graphs demonstrate that HypoAgent achieves state-of-the-art semantic similarity under single-turn, multi-turn, and unconditional settings. Our code is available at https://github.com/HKUST-KnowComp/HypoAgent.

URL PDF HTML ☆

赞 0 踩 0

2605.31369 2026-06-01 cs.LG cs.CV

A Unifying View of Variational Generative Wasserstein Flows

变分生成式Wasserstein流的统一视角

Paul Caucheteux, Clément Bonet, Anna Korba

AI总结本文提出生成式Wasserstein流（GWF）的统一理论框架，将多种现有生成模型视为f-散度目标的参数化JKO方案实例，并扩展至积分概率度量与最大均值差异，推导新算法并阐明与GAN的联系。

Comments Accepted as a spotlight at ICML2026

详情

AI中文摘要

许多现代生成模型可视为最小化概率分布之间的散度，但它们依赖于不同的算法和几何原理。Wasserstein梯度流为优化分布提供了连续时间形式，可通过Jordan-Kinderlehrer-Otto（JKO）方案的隐式离散化来近似。在这项工作中，我们提出了一个基于Wasserstein梯度流的生成建模统一理论框架，称为生成式Wasserstein流（GWF）。我们表明，一大类现有方法可以推导为f-散度目标的参数化JKO方案实例，并建立了几个最近提出的算法之间的等价性。我们将此框架扩展到f-散度之外，涵盖积分概率度量和平方最大均值差异，推导了新的基于JKO的生成算法，并阐明了它们与GAN的联系。我们通过实验研究了JKO正则化对广泛目标的影响。最后，我们分析了参数化Wasserstein流，其中动力学限制在由参数化映射诱导的分布上。

英文摘要

Many modern generative models can be viewed as minimizing divergences between probability distributions, yet they rely on different algorithmic and geometric principles. Wasserstein gradient flows provide a continuous-time formulation for optimizing over distributions, and can be approximated through their implicit discretization via the Jordan-Kinderlehrer-Otto (JKO) scheme. In this work, we present a unified theoretical framework for generative modeling based on Wasserstein gradient flows, which we refer to as Generative Wasserstein Flows (GWF). We show that a broad class of existing methods can be derived as instances of parametric JKO schemes for $f$-divergence objectives, and we establish equivalences between several recently proposed algorithms. We extend this framework beyond f-divergence to Integral Probability Metrics and squared Maximum Mean Discrepancy, deriving new JKO-based generative algorithms, and clarifying their connections with GANs. We study empirically the impact of the JKO regularization for a wide set of objectives. Finally, we analyze parametric Wasserstein flows, where the dynamics are restricted to distributions induced by parametrized maps.

URL PDF HTML ☆

赞 0 踩 0

2605.31367 2026-06-01 cs.LG cs.CL

Trading Complexity for Expressivity Through Structured Generalized Linear Token Mixing

通过结构化广义线性令牌混合在表达性与复杂性之间进行权衡

Erwan Fagnou, Paul Caillon, Blaise Delattre, Alexandre Allauzen

AI总结本文提出一个统一框架，将令牌混合层分解为直接输入-输出影响和递归传播，通过设计结构化递归模式在运行时复杂度和表达性之间进行可证明的权衡，并在合成任务和语言建模上验证。

Comments 20 pages, 3 figures, ICML 2026 main

详情

AI中文摘要

令牌混合层在语言模型学习和生成长期依赖关系中起着关键作用。其效率依赖于解码速度与内存需求以及缓存大小之间的必要权衡。考虑因果生成，本文通过一个统一框架探索新的权衡，该框架分离了两个关键特征：(i) 在一个生成步骤中输入对输出的直接影响；(ii) 通过过去输出进行信息的递归传播。该框架涵盖了注意力机制和状态空间模型等主要架构，但也通过允许每个状态依赖于多个过去状态（而不仅仅是直接前驱）来推广递归方程。通过引入结构，我们设计了新的递归模式，这些模式可证明达到所需的复杂度，同时提供关于其表达性的理论见解——以原则性的方式用运行时换取表达性。在合成任务以及语言建模上进行了实证验证。这些结果共同提供了一个统一的工具包，用于理解和设计跨模型家族的高效且富有表达性的令牌混合器。

英文摘要

Token mixing layers play a key role in how language models can learn and generate long-range dependencies. Their efficiency relies on the necessary trade-off between decoding speed and the memory requirements, along with the cache size. Considering causal generation, this paper explores new trade-offs thanks to a unified framework which separates two crucial features: (i) the direct influence of inputs on outputs in one generation step; (ii) the recurrent propagation of information through past outputs. This framework encompasses major architectures such as attention and state-space models, but also generalizes the recurrence equations by allowing each state to depend on multiple past states rather than only the immediate predecessor. By introducing structure, we design new recurrence patterns that provably achieve the desired complexity, while providing theoretical insights on their expressivity -- trading runtime for expressivity in a principled way. Empirical validation is performed on synthetic tasks, along with language modeling. Together, these results provide a unified toolkit for the understanding and design of efficient and expressive token mixers across model families.

URL PDF HTML ☆

赞 0 踩 0

2605.31365 2026-06-01 cs.AI

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

学习适应：通过认知感知探索实现自我改进的网络智能体

Weile Chen, Bingchen Miao, Qifan Yu, Wendong Bu, Guoming Wang, Wenqiao Zhang, Shengyu Zhang, Juncheng Li, Siliang Tang

AI总结提出SCALE框架，利用选择器、预测器和评判器三个对抗角色，通过环境探索自主发现智能体局限性并扩展认知边界，结合SCALE-Hop图探索策略和SCALE-20k数据集，显著提升多模态大语言模型在多种网络环境中的性能和泛化能力。

Comments 24 pages

详情

AI中文摘要

多模态大语言模型的最新进展在网络智能体领域取得了令人瞩目的进展。然而，现有的网络智能体通常依赖于手工设计的执行流程或昂贵的专家轨迹，限制了它们在复杂动态环境中的适应性。为了解决这些挑战，我们提出了SCALE（自我认知感知学习与探索），它利用三个对抗角色——选择器、预测器和评判器——通过环境探索自主发现智能体的局限性并扩展其认知边界。此外，我们提出了SCALE-Hop，一种图探索策略，有助于全局规划并帮助智能体避免局部探索陷阱。为了进一步支持学习，我们构建了SCALE-20k，一个从19个真实世界网站收集的大规模数据集，包含多样化的任务类型和由SCALE探索轨迹生成的结构化演示。实验结果表明，我们的方法显著提高了多种多模态大语言模型在各种网络环境中的性能和泛化能力。我们的框架为构建真正自主和自适应的网络智能体提供了一种可扩展且可泛化的解决方案。

英文摘要

Recent advances in Multimodal Large Language Models (MLLMs) have led to promising progress in web agents. However, existing web agents often rely on handcrafted execution pipelines or expensive expert trajectories, limiting their adaptability to complex, dynamic environments. To address these challenges, we propose SCALE (Self-Cognitive-Aware Learning and Exploration), which leverages three adversarial roles, Selector, Predictor, and Judger to autonomously discover the agent's limitations and expand its cognitive boundaries through environmental exploration. Moreover, we propose SCALE-Hop, a graph exploration strategy that facilitates global planning and helps agents avoid local exploration traps. To further support learning, we construct SCALE-20k, a large-scale dataset collected from 19 real-world websites, containing diverse task types and structured demonstrations generated from SCALE's exploration traces. Experimental results show that our approach significantly improves the performance and generalization of multiple MLLMs in various web environments. Our framework offers a scalable and generalizable solution for building truly autonomous and adaptive web agents.

URL PDF HTML ☆

赞 0 踩 0

2605.31363 2026-06-01 cs.CL

The Latin Substrate: How Language Models Represent and Mediate Script Choice

拉丁基底：语言模型如何表示和中介文字选择

Daniil Gurgurov, Alan Saji, Katharina Trinley, Josef van Genabith, Simon Ostermann

AI总结通过logit透镜、表征和机制分析，发现语言模型在转写时存在一致的潜在拉丁化，且通过少量后期注意力头因果中介文字选择，非拉丁输出由紧凑可识别门控产生，拉丁输出则来自网络扩散贡献，表明模型围绕共享潜在表示组织文字变异并偏向拉丁基底。

Comments preprint

详情

AI中文摘要

许多语言使用多种文字书写，要求大型语言模型（LLM）以不同的正字法形式生成等价的语言内容。虽然先前的研究表明LLM通过共享的潜在表示路由信息，但它们如何内部中介文字变异仍知之甚少。我们通过首先使用logit透镜检查逐层输出分布来研究这个问题，这揭示了转写过程中一致的潜在拉丁化，然后通过文字生成的表征和机制分析。在表征层面，我们展示了同一语言的文字在层间变得越来越可分离，并且一个简单的线性引导方向可以翻转模型的输出文字，同时大致保持语义内容。该向量非对称地泛化到构建时未见过的书写系统，可靠地将非拉丁输出翻转为拉丁，但将拉丁输出映射到各种非拉丁文字。在机制层面，我们定位了一小组后期注意力头，它们因果中介文字选择。这些头跨不相关语言和书写系统转移，表明文字路由由语言无关组件实现。在这两项分析中，我们观察到一致的方向性不对称：非拉丁输出由紧凑、可识别的门控产生，而拉丁文字输出来自网络中的扩散贡献。总的来说，我们的发现暗示LLM围绕共享潜在表示组织文字变异，同时表现出对拉丁文字的优先基底。

英文摘要

Many languages are written in multiple scripts, requiring large language models (LLMs) to generate equivalent linguistic content in distinct orthographic forms. While prior work suggests that LLMs route information through shared latent representations, how they internally mediate script variation remains poorly understood. We study this question by first examining per-layer output distributions with the logit lens, which reveals consistent latent romanization during transliteration, and then through representational and mechanistic analyses of script generation. At the representational level, we show that scripts of the same language become increasingly separable across layers and that a simple linear steering direction can flip a model's output script while largely maintaining semantic content. The vector generalizes asymmetrically to writing systems unseen during construction, flipping non-Latin output to Latin reliably, but mapping Latin output into varied non-Latin scripts. At the mechanistic level, we localize a small set of late-layer attention heads that causally mediate script choice. These heads transfer across unrelated languages and writing systems, suggesting that script routing is implemented by language-agnostic components. Across both analyses, we observe a consistent directional asymmetry: non-Latin output is produced by a compact, identifiable gate, while Latin-script output emerges from diffuse contributions across the network. Collectively, our findings hint that LLMs organize script variation around shared latent representations while exhibiting a privileged substrate toward Latin script.

URL PDF HTML ☆

赞 0 踩 0

2605.31360 2026-06-01 cs.LG cs.AI

dashi: A Python library for Dataset Shift Characterization to Support Trustworthy AI Development and Deployment

dashi: 一个用于数据集偏移表征以支持可信AI开发和部署的Python库

David Fernández-Narro, Pablo Ferri, Ángel Sánchez-García, Juan M. García-Gómez, Carlos Sáez

AI总结本文介绍dashi，一个开源Python库，通过无监督（基于信息几何和非参数统计流形）和有监督方法，对数据集偏移进行探索、量化和表征，以支持AI生命周期中的可信度评估。

详情

AI中文摘要

人工智能（AI）生命周期需要对底层数据动态有透彻理解，以实现稳健、安全且经济高效的AI开发和使用。数据集偏移定义为训练和测试数据分布之间的变化。无论是随时间（时间性）还是跨不同站点（多源）发生，它们都可能严重降低模型性能并损害数据质量。这在健康AI中尤为重要，因为不受控制的偏移在训练和操作阶段都可能严重影响患者的安全和基本权利。虽然协变量偏移、先验偏移和概念偏移的理论基础已很完善，但缺乏可访问且全面的软件工具来执行其分析。我们介绍了dashi，一个开源Python库，旨在对数据集偏移进行探索、量化和表征。dashi提供双重方法：一种无监督方法，利用信息几何和非参数统计流形进行数据变异性表征和分析（例如，信息几何时间图和多源变异性指标，如全局概率偏差和源概率异常度）；以及一种有监督方法，量化和表征模型性能退化。无监督和有监督方法均适用于用户定义的时间批次和域/源批次。我们在三个模拟和真实世界的健康AI案例研究（妊娠期糖尿病、COVID-19和紧急医疗调度）中展示了dashi的实用性。通过提供交互式视觉分析和变异性指标，dashi支持AI生命周期阶段的可信度，通过评估数据一致性和AI性能实现稳健且安全的机器学习管道。

英文摘要

The Artificial Intelligence (AI) life cycle requires a thorough understanding of the underlying data dynamics for robust, safe and cost-effective AI development and use. Dataset shifts are defined as changes between train and test data distributions. Whether occurring over time (temporal) or across different sites (multi-source), they can severely degrade model performance and compromise data quality. This is particularly important in health AI, where the safety and fundamental rights of patients can be severely affected by uncontrolled shifts both at training and operational stages. While the theoretical foundations of covariate, prior, and concept shifts are well established, there is a lack of accessible and comprehensive software tools to perform their analysis. We introduce dashi, an open-source Python library designed for the exploration, quantification, and characterization of dataset shifts. dashi provides a dual approach: an unsupervised approach that leverages information geometry and non-parametric statistical manifolds to data variability characterization and analysis (e.g., Information Geometric Temporal plots and Multi-Source Variability metrics like Global Probabilistic Deviation and Source Probabilistic Outlyingness), and a supervised approach that quantifies and characterizes model performance degradation. Both unsupervised and supervised approaches work across user-defined temporal and domain/source batches. We demonstrate the utility of dashi on three simulated and real-world health AI case studies on gestational diabetes mellitus, COVID-19 and emergency medical dispatch. By providing interactive visual analytics and variability metrics, dashi supports trustworthiness of AI life cycle stages enabling robust and safe machine learning pipelines through the assessment of data coherence and AI performance.

URL PDF HTML ☆

赞 0 踩 0

2605.31354 2026-06-01 cs.AI cs.LG

Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

资源受限视觉代理中共享状态协作的故障模式诊断

Yunpeng Zhou

AI总结本文通过噪声累积视角研究弱学习者（4B-8B模型）在共享工作记忆下的协作推理故障模式，提出CoSee审计框架追踪文档视觉问答中的信息流，发现朴素共享工作空间会放大幻觉而非解决，并识别出噪声强化和策略崩溃两种主要故障模式。

详情

AI中文摘要

模块化视觉推理系统越来越依赖共享工作记忆进行多步协作，但低容量场景下中间状态演化的故障动态仍未被充分探索。我们通过噪声累积的视角研究弱学习者（4B-8B模型）的协作推理故障模式。我们引入了CoSee，一个审计框架，形式化了读-写-验证循环以追踪文档视觉问答中的信息流。在多页、图表和基于网页的基准测试中，我们发现了一个反直觉的退化：朴素的共享工作空间往往放大而非解决幻觉。我们识别出两种主要的故障模式：噪声强化（未基于事实的笔记被重新用作证据）和策略崩溃（添加的上下文使模型转向欠指定的短形式答案）。使用成本-准确率帕累托前沿，我们表明增加计算量在没有显式验证的情况下可能与性能负相关。我们的发现表明，对于资源受限的代理，瓶颈不在于推理深度而在于通信保真度，为可靠的模块化设计提供了轨迹级诊断和机制基线。

英文摘要

Modular visual reasoning systems increasingly rely on shared working memory for multi-step collaboration, yet the failure dynamics of intermediate state evolution in low-capacity regimes remain underexplored. We study failure modes of collaborative reasoning with weak learners (4B--8B models) through the lens of noise accumulation. We introduce CoSee, an auditing framework that formalizes the read-write-verify loop to trace information flow in document visual question answering. Across multi-page, chart, and web-based benchmarks, we find a counter-intuitive degradation: naive shared workspaces often amplify hallucinations rather than resolve them. We identify two dominant failure modes: Noise Reinforcement, where ungrounded notes are reused as evidence, and Policy Collapse, where added context shifts the model toward under-specified, short-form answers. Using cost-accuracy Pareto frontiers, we show that increased compute can correlate negatively with performance without explicit verification. Our findings suggest that for resource-constrained agents, the bottleneck lies not in reasoning depth but in communication fidelity, providing trace-level diagnostics and a mechanistic baseline for reliable modular design.

URL PDF HTML ☆

赞 0 踩 0

2605.31352 2026-06-01 cs.RO

Haptic Sorter: A Unified Planning Framework for Online Shape Estimation and Real-Time Pose Inference

Haptic Sorter: 一种用于在线形状估计和实时位姿推断的统一规划框架

Zhuoyi Lu, Lin Yang, Sri Harsha Turlapati, Domenico Campolo

AI总结提出一种基于模型的统一几何框架，结合贝叶斯优化引导的触觉探索、超椭圆形状近似、操作势能自适应公式以及在线常微分方程实时位姿推断，用于机器人操作中的形状估计和位姿跟踪。

详情

AI中文摘要

机器人操作通常假设在运动规划之前，物体的形状和位姿是已知的。然而，在实践中精确的几何信息并不总是可用的，并且位姿推断受到传感器不确定性和视角遮挡的影响。在这项工作中，我们提出了一个统一的基于模型的几何框架，集成了机器人触觉感知、建模和操作规划。我们的创新点包括： i) 引入贝叶斯优化（BO）来指导触觉探索以推断物体形状，其中使用超椭圆来近似几何边界； ii) 自适应地制定操作势能，编码物体几何以用于准静态机器人-物体交互； iii) 提出一个在线常微分方程（ODE），基于模型预测和触觉反馈进行实时位姿推断。我们在一个二维机器人分拣任务上部署了我们的系统，并改变物体几何形状，以在仿真和真实世界的多臂设置中验证我们框架的鲁棒性和泛化能力。

英文摘要

Robotics manipulation usually assumes that the shape and pose of the object are known to the robot prior to motion planning. However, precise geometric information is not always available in practice, and pose inference suffers from sensor uncertainties and view occlusion. In this work, we propose a unified model-based geometric framework integrating robotic haptic perception, modeling, and manipulation planning. Our novelties involve: \textit{i)} Introducing Bayesian Optimization (BO) to guide the haptic exploration for object shape inference, where superellipses are used to approximate geometric boundary; \textit{ii)} Adaptive formulation of manipulation potential encoding object geometry for quasi-static robot-object interaction; \textit{iii)} Proposing an online Ordinary Differential Equation (ODE) for real-time pose inference based on model prediction and tactile feedback. We deploy our system on a 2D robotic sorting task, and vary object geometries to validate the robustness and generalizability of our framework in both simulation and a real-world multi-arm setup.

URL PDF HTML ☆

赞 0 踩 0

2605.31351 2026-06-01 cs.CL cs.CV

A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

面向VLM-as-a-Judge评估的视障辅助基准

Yi Zhao, Siqi Wang, Zhe Hu, Yushi Li, Jing Li

AI总结针对视障辅助任务中VLM-as-a-Judge评估的可靠性问题，提出VIABLE基准（含30万+样本、有效性-公正性-稳定性框架及12种失败模式分类），发现现有模型不可靠，并开发VIA-Judge-Agent方法提升诊断准确性和用户偏好。

详情

AI中文摘要

基于AI的视障辅助（VIA）仍然具有挑战性，主要原因是人工评估成本高昂。VLM-as-a-Judge范式可能提供一种有前景的替代方案，尽管该范式主要在通用领域得到研究。因此，我们质疑此类评判者是否可以在VIA任务中值得信赖。为探究这一问题，我们引入了VIABLE（面向VLM-as-a-Judge评估的视障辅助基准），这是首个用于VIA中VLM-as-a-Judge评估的基准。VIABLE包含超过30万个判断样本，涵盖三种场景，并引入了一个包含12种失败模式分类的有效性-公正性-稳定性框架。基于VIABLE，我们对七个不同模型规模的评判者进行了系统研究，结果表明现有模型在所有评估轴上基本不可靠。最强的评判者GPT-5.4仅达到52.6%的单故障诊断准确率，却表现出最高的自我偏好率（94.2%）；而开源评判者存在严重偏差且对抗性脆弱。为解决这些问题，我们提出了VIA-Judge-Agent，一种与模型无关的推理时增强方法，通过视觉证据提取和基于分类的工作流来增强评判者。该方法在诊断准确性和下游VIA响应（更受BLV用户青睐）方面实现了积极改进。数据和代码可在 https://github.com/YiyiyiZhao/VIABLE 获取。

英文摘要

AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general domains. We therefore ask whether such judges can be trusted for VIA tasks. To investigate this question, we introduce VIABLE (Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation), the first benchmark for VLM-as-a-Judge evaluation in VIA. VIABLE contains over 300K judgment samples across three scenarios and introduces an Effectiveness--Impartiality--Stability framework with a 12-mode failure taxonomy. Based on VIABLE, our systematic study of seven judges across different model scales shows that existing models are largely unreliable across all evaluation axes. The strongest judge, GPT-5.4, achieves only 52.6% single-failure diagnostic accuracy, yet exhibits the highest self-preference rate at 94.2%; while open-source judges are strongly biased and adversarially fragile. To address these issues, we propose VIA-Judge-Agent, a model-agnostic inference-time harness that augments judges with visual evidence extraction and a taxonomy-guided workflow. It enables positive improvements in diagnostic accuracy and downstream VIA responses more preferred by BLV users. Data and code are available at: https://github.com/YiyiyiZhao/VIABLE

URL PDF HTML ☆

赞 0 踩 0

2605.31349 2026-06-01 cs.CL cs.AI cs.CV cs.MM

FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection

FBHM：用于仇恨模因检测的功能性基准测试与视觉语言模型引导

Paramananda Bhaskar, Naquee Rizwan, Daksh Jogchand, Saurabh Kumar Pandey, Animesh Mukherjee

AI总结针对现有基准无法因果评估视觉语言模型漏洞的问题，提出基于25种修辞功能和10个目标社区构建的FBHM基准，并采用可学习引导向量（LSV）在极低数据量下提升模型性能约30个Macro-F1点。

详情

AI中文摘要

仇恨模因检测对于视觉语言模型仍是一个严峻挑战，因为现有基准在结构上是观察性的——混淆了修辞仇恨机制与目标社区特征，并阻碍了对模型漏洞的因果评估。为解决这一问题，我们引入了FBHM，一个系统策划的基于功能的仇恨模因基准，沿两个正交轴构建：25种不同的修辞功能和10个目标社区（总共5,000个模因）。对最先进的视觉语言模型进行基准测试揭示了一个严重的泛化差距：在标准数据集上高度准确的模型在FBHM上灾难性地下降到接近随机性能，证明它们利用了数据集特定的启发式方法而非稳健的多模态推理。为了高效缩小这一差距，我们提出了LSV（可学习引导向量），一种超低数据量策略，在仅500个引导样本（50个独特基础模因）上应用因果干预目标，将FBHM性能提升约30个Macro-F1点，同时优于上下文学习和PEFT，且不降低源域性能。

英文摘要

Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal evaluation of model vulnerabilities. To address this, we introduce FBHM, a systematically curated benchmark of Functionality Based Hateful Memes constructed along two orthogonal axes: 25 distinct rhetorical functionalities and 10 target communities (5,000 memes total). Benchmarking state-of-the-art VLMs reveals a severe generalization gap: models highly accurate on standard datasets catastrophically drop to near-random performance on FBHM, proving they exploit dataset-specific heuristics rather than robust multimodal reasoning. To efficiently close this gap, we propose LSV (learnable steering vectors), an ultra-low data regime strategy that applies a causal intervention objective on as few as 500 steering samples (50 unique base memes), boosting FBHM performance by ~30 Macro-F1 points while outperforming in-context learning and PEFT without degrading source-domain performance.

URL PDF HTML ☆

赞 0 踩 0