2606.14516 2026-06-15 cs.AI cs.CL cs.CY 新提交

密集坐标列表微调在视觉语言模型中诱导可控干扰面

Chenyu Zhou, Qiliang Jiang, Boguang Pan

发表机构 * School of Engineering, Institute of Science Tokyo（东京科学大学工学院）； College of Control Science and Engineering, Zhejiang University（浙江大学控制科学与工程学院）； Graduate School of Information, Production and Systems, Waseda University（早稻田大学信息生产系统研究生院）

AI总结研究密集坐标列表微调对视觉语言模型结构化输出（如重复、终止）的影响，发现其产生结构绑定且跨家族的干扰面，可通过目标信号分离和结构轴探针进行测量与控制。

详情

AI中文摘要

微调视觉语言模型以输出密集坐标列表可改善视觉定位，但也会改变模型序列化、重复和终止结构化输出的方式。我们将此行为视为一个生成与控制面进行研究。在Gemma 4 12B中，高容量q/k/v/o LoRA将类别感知F1@0.3从0.007提升至0.448，同时诱导重复尾部压力（重复率0.080，最大重复23）。q/v秩扫描在秩4-64范围内保持最大重复为21-22，显示出容量持久性。目标信号是可分离的：对象级重复停止移除了精确重复记录（重复率0.000，最大重复1），同时保持F1（0.494至0.490）和更严格的F1@0.5（0.381至0.385）。结构轴探针将效应定位到边界框坐标对象列表；密集非边界框和空间/计数JSON保持无重复，包括在高容量适配器下。Qwen3-VL-8B复现了干净的控制端点（F1@0.3 0.318，重复率0.000），COCO 2017复现了获取和重复压力。因此，密集坐标列表适应创建了一个结构绑定、跨家族的干扰面，该干扰面可被测量和控制。

英文摘要

Fine-tuning vision-language models to emit dense coordinate lists improves visual grounding but also changes how models serialize, repeat, and terminate structured outputs. We study this behavior as a generation and control surface. In Gemma 4 12B, high-capacity q/k/v/o LoRA raises class-aware F1@0.3 from 0.007 to 0.448 while inducing repeated-tail pressure (duplicate rate 0.080, max repeat 23). A q/v rank sweep keeps max repeat at 21-22 across ranks 4-64, showing capacity persistence. The target signal is separable: object-level repeat-stop removes exact repeated records (duplicate rate 0.000, max repeat 1) while preserving F1 (0.494 to 0.490) and stricter F1@0.5 (0.381 to 0.385). Structure-axis probes localize the effect to bbox-coordinate object lists; dense non-bbox and spatial/count JSON remain repeat-clean, including under high-capacity adapters. Qwen3-VL-8B reproduces a clean controlled endpoint (F1@0.3 0.318, duplicate rate 0.000), and COCO 2017 reproduces acquisition plus duplicate pressure. Dense coordinate-list adaptation therefore creates a structure-bound, cross-family interference surface that can be measured and controlled.

URL PDF HTML ☆

赞 0 踩 0

2606.14502 2026-06-15 cs.AI 新提交

From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI

从聊天机器人到数字同事：向持久自主人工智能的范式转变

Yongheng Zhang, Ziang Liu, Jiaxuan Zhu, Shuai Wang, Xiangqi Chen, Haojing Huang, Jiayi Kuang, Siyu Chen, Ao Shen, Hao Wu, Qiufeng Wang, Qian-Wen Zhang, Junnan Dong, Wenhao Jiang, Ying Shen, Hai-Tao Zheng, Yinghui Li, Di Yin, Xing Sun, Philip S. Yu

发表机构 * arXiv

AI总结本文提出LLM从聊天机器人向数字同事的范式转变，通过认知核心（思考型LLM）和工具增强任务执行（OpenClaw工作站系统）两个维度，实现持久工作、状态持久化、可重用技能和自改进能力。

Comments The paper is available on the project website: https://from-chatbot-to-digital-colleague.github.io/

详情

AI中文摘要

大型语言模型（LLM）正在经历从对话生成器向集成AI系统的根本性转变，这些系统具备推理、行动、记忆和自我改进能力。我们将这一转变概念化为从聊天机器人到数字同事的转变：从对话式回答到持久工作。我们沿着两个紧密耦合的维度组织这一转变。首先，在认知核心层面，LLM正从聊天机器人时代由下一词预测驱动的“快速思考”系统，向思考型LLM发展，后者利用推理时计算、思维链推理、反思、过程监督和强化学习来支持更深思熟虑和可靠的认知。其次，在工具增强的任务执行层面，LLM正从临时调用外部资源的工具调用智能体，向配备持久工作空间、技能、验证循环和治理的OpenClaw式工作站系统（OpenClaw）发展。“工作空间+技能”范式通过状态持久化、可重用程序、任务闭合和经验复用，使偶发性的工具使用变得像同事一样。我们研究了数据构建从指令-响应对向状态-动作-观测轨迹的转变，以及评估从静态基准向沙盒化、可审计、自演进的AI生态系统的转变。

英文摘要

Large Language Models (LLMs) are undergoing a fundamental transformation from conversational generators into integrated AI systems capable of reasoning, action, memory, and self-improvement. We conceptualize this transition as a shift from Chatbot to Digital Colleague: from conversational answers to persistent work. We organize this transition along two tightly coupled dimensions. First, at the cognitive core level, LLMs are advancing from Chatbot-era "fast thinking" systems driven by next-token prediction toward Thinking LLMs that leverage inference-time computation, Chain-of-Thought reasoning, reflection, process supervision, and reinforcement learning to support more deliberate and reliable cognition. Second, at the tool-augmented task execution level, LLMs are progressing from tool-calling Agents that invoke external resources in an ad hoc manner toward OpenClaw-style workstation systems (OpenClaw) equipped with persistent Workspaces, skills, verification loops, and governance. The "Workspace + Skill" paradigm makes episodic tool use colleague-like via state persistence, reusable procedures, task closure, and experience reuse. We examine data construction shifts from instruction-response pairs to State-Action-Observation trajectories and evaluation from static benchmarks to sandboxed, auditable, self-evolving AI ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2606.14492 2026-06-15 cs.LG 新提交

Recipe-Controlled Decoder Audit for Structural Knowledge-Graph Completion

配方控制的解码器审计用于结构知识图谱补全

Xihang Shan, Ye Luo

发表机构 * School of Mathematical Sciences, Xiamen University（厦门大学数学科学学院）； School of Informatics, Xiamen University（厦门大学信息学院）

AI总结提出配方控制的解码器审计方法，通过交换解码器评估其对知识图谱补全性能的影响，发现解码器效果受配方和来源影响，并建议在编码器层面声明前进行解码器×深度扫描。

Comments 11 pages, 5 figures. Code and artifacts: https://github.com/AndyShan11/kgc-decoder-audit

详情

AI中文摘要

我们提出了一种用于结构直推式知识图谱补全（KGC）的配方控制解码器审计（RCDA）。该审计提出了一个简单的报告问题：在将性能提升归因于编码器或训练配方之前，当在相同配方下交换解码器时，会发生什么变化？使用ComplEx和DistMult作为主要控制对，并辅以针对性的RotatE/TransE抽查，我们评估了七个基准。在五个标准知识图谱上，在我们的配方下，ComplEx与DistMult的差异虽小但一致（MRR增加+0.005至+0.012），而CompGCN风格的编码器效果因数据集而异。在小知识图谱上，解码器效果成为主要诊断指标：Kinship显示ComplEx稳定优势为+0.143 MRR（6个种子），而UMLS在干净的6种子服务器重跑中偏好ComplEx（+0.022 MRR），但在早期来源变体中结果相反。因此，我们将小知识图谱的解码器选择视为对配方和来源敏感，而非固定的数据集胜者。我们进一步表明，在WN18RR上解码器选择与编码器深度存在交互，且在我们的配方下，YAGO3-10上L=0的ComplEx在d=128时达到0.6971 ± 0.0048 MRR。结果是一个紧凑的审计协议：报告匹配的解码器行，记录小知识图谱来源，并在做出编码器层面声明之前进行解码器×深度扫描。

英文摘要

We present a recipe-controlled decoder audit (RCDA) for structural transductive knowledge-graph completion (KGC). The audit asks a simple reporting question: before attributing gains to an encoder or training recipe, what changes when the decoder is swapped under the same recipe? Using ComplEx and DistMult as the primary controlled pair, with targeted RotatE/TransE spot-checks, we evaluate seven benchmarks. On five standard KGs, ComplEx-vs-DistMult differences are modest but consistent under our recipe (+0.005 to +0.012 MRR), whereas CompGCN-style encoder effects vary more by dataset. On small KGs, decoder effects become the main diagnostic: Kinship shows a stable ComplEx advantage of +0.143 MRR (6 seeds), while UMLS favours ComplEx by +0.022 MRR in a clean 6-seed server rerun but reverses in an earlier provenance variant. We therefore treat small-KG decoder choice as recipe- and provenance-sensitive rather than as a fixed dataset winner. We further show that decoder choice interacts with encoder depth on WN18RR, and that under our recipe L=0 ComplEx on YAGO3-10 reaches 0.6971 +/- 0.0048 MRR at d=128. The result is a compact audit protocol: report matched decoder rows, log small-KG provenance, and sweep decoder x depth before making encoder-level claims.

URL PDF HTML ☆

赞 0 踩 0

2606.14476 2026-06-15 cs.AI cs.LG 新提交

When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More

当工具决定时：LLM代理盲目服从图神经网络工具，更强的骨干网络服从更多

Zhongyuan Wang, Pratyusha Vemuri

发表机构 * raptorX.ai

AI总结研究LLM代理在使用GNN工具时是否真正判断而非盲目服从，发现代理在97.6-99.2%的情况下完全采纳GNN输出，且更强的骨干网络服从更多，选择性调用设计受限。

Comments 9 pages, 2 figures. Under review at TMLR

详情

AI中文摘要

越来越多的研究为大型语言模型（LLM）代理配备图神经网络（GNN）作为可调用工具，假设代理能够判断何时以及多大程度上依赖该工具。我们直接测试了这一假设。我们将冻结的GNN作为显式工具暴露给ReAct风格的LLM代理，并在文本属性图（ogbn-arxiv，在WikiCS上重复）上的节点分类任务中，测量代理是使用工具还是仅仅服从它。我们发现代理并未进行判断：其预测与原始GNN的预测一致率达到97.6-99.2%（5个随机种子），沦为GNN鹦鹉，全盘采用工具的输出并绕过自身推理。通过扫描骨干网络能力（Qwen2.5 0.5B-7B），这种服从并非弱模型伪影：在能够调用工具的模型中，一致性随能力提升而上升（从1.5B的0.60到7B的0.98）。关键的是，服从的代价并未随能力增长而缩小，反而在替代方案出现时扩大：每个节点上可用动作的oracle比鹦鹉在3B时高出0.09-0.18，在7B时高出0.12-0.22，在高同质性下几乎翻倍，因为鹦鹉被冻结的GNN所束缚，而代理的替代方案在改进；在7B时，简单的邻居标签工具在高同质性下超越了GNN（0.81 vs 0.71），但代理仍然服从。一个简单的选择性调用门恢复了约一半的高同质性差距（0.71到0.83），但未带来全局净收益，而保留估计表明，在标准测试时特征上可达到的最佳门最多只能获得oracle余量的三分之一：可靠的选择性调用似乎受限于可用信息，而不仅仅是路由器设计。我们的结果是一个警示性测量：对代理+工具系统的评估不能假设代理在工具之上添加了判断，选择性调用必须被设计进去，而不是期望从规模中涌现。

英文摘要

A growing line of work equips large language model (LLM) agents with graph neural networks (GNNs) as callable tools, assuming the agent exercises judgment over when and how much to rely on such a tool. We test this directly. We expose a frozen GNN to a ReAct-style LLM agent as an explicit tool and measure, on node classification over a text-attributed graph (ogbn-arxiv, replicated on WikiCS), whether the agent uses the tool or merely obeys it. We find the agent does not exercise judgment: its predictions agree with the raw GNN's 97.6-99.2% of the time (5 seeds), collapsing into a GNN parrot that adopts the tool's output wholesale and bypasses its own reasoning. Sweeping backbone capability (Qwen2.5 0.5B-7B), the deference is not a weak-model artifact: among models able to invoke the tool, agreement rises with capability (0.60 to 0.98 from 1.5B to 7B). Crucially, the cost of deference does not shrink as capability grows and grows where alternatives emerge: a per-node oracle over the available actions beats the parrot by 0.09-0.18 at 3B and 0.12-0.22 at 7B, roughly doubling at high homophily, because the parrot is pinned to the frozen GNN while the agent's alternatives improve; at 7B a simple neighbour-label tool overtakes the GNN at high homophily (0.81 vs 0.71) yet the agent still defers. A simple selective-invocation gate recovers about half of that high-homophily gap (0.71 to 0.83) but yields no net global gain, and held-out estimates bound the best achievable gate over standard test-time features to at most a third of the oracle headroom: reliable selective invocation looks limited by available information, not merely router design. Our results are a cautionary measurement: evaluations of agent+tool systems cannot assume the agent adds judgment on top of the tool, and selective invocation must be designed in rather than expected to emerge from scale.

URL PDF HTML ☆

赞 0 踩 0

2606.14475 2026-06-15 cs.CV 新提交

Value-order Decomposition for Generalist Anomaly Detection

值序分解用于通用异常检测

Miaoyun Zhao, Jing Chen, Miaoni Zhao, Qiang Zhang

发表机构 * Dalian University of Technology（大连理工大学）； Xi’an Chang’an Vanke City Primary School（西安长安万科城小学）； Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology), Ministry of Education（社会计算与认知智能教育部重点实验室（大连理工大学））

AI总结提出值序分解（VOD）方法，通过解耦和抑制类别、缺陷类型和域特定信息，实现跨域异常检测的强泛化。

详情

AI中文摘要

工业异常检测受限于数据量少，使得跨域泛化尤其具有挑战性。通用异常检测（GAD）旨在在源域上训练一个统一模型，能够有效检测未见目标域中的异常。在初始语义特征空间中，异常与物体类别或缺陷类型之间的强纠缠阻碍了跨域的有效泛化。最近的工作通过将特征投影到残差空间来解决这个问题；然而，这些方法主要增加了正常特征的跨域重叠，而异常特征仍然与物体类别、缺陷类型和数据域相关，导致对齐和泛化效果差。为了解决这一限制，我们提出了值序分解（VOD），一种简单而有效的技术，它弥合了物体类别、缺陷类型（包括真实和合成缺陷）和数据域之间的\textbf{三种泛化差距}。VOD解耦并抑制了物体类别、缺陷类型和域特定信息，促进了正常和异常样本内部的对齐，同时保持了它们的可分离性，从而实现了跨三个差距的鲁棒泛化。利用同一物体内真实和合成缺陷之间的强对齐，我们仅使用正常和合成异常参考进行异常检测，并有效泛化到未见过的真实缺陷类型。在多样化的工业和医学基准上的实验表明，我们的方法使用简单的剪切粘贴异常模拟策略，实现了跨三个差距的强泛化。

英文摘要

Industrial anomaly detection suffers from limited data, making cross-domain generalization particularly challenging. Generalist Anomaly Detection (GAD) aims to train a unified model on a source domain that can effectively detect anomalies in unseen target domains. In the initial semantic feature space, strong entanglement between anomalies and object categories or defect types hinders effective generalization across domains. Recent works address this issue by projecting features into a residual space; however, such methods primarily increase cross-domain overlap for normal features, while anomalous features remain specific to object categories, defect types and data domains, leading to poor alignment and generalization. To address this limitation, we propose Value-order Decomposition (VOD), a simple yet effective technique that bridges \textbf{three types of generalization gaps} across object categories, defect types (including real and synthetic defects), and data domains. VOD disentangles and suppresses object-category-, defect-type-, and domain-specific information, promoting alignment within normal and abnormal samples while preserving their separability, thereby enabling robust generalization across the three gaps. Leveraging the strong alignment between real and synthetic defects within the same object, we perform anomaly detection using only normal and synthetic-abnormal reference, and effectively generalize to unseen real defect types. Experiments on diverse industrial and medical benchmarks demonstrate that our method, using a simple cut-and-paste anomaly simulation strategy, achieves strong generalization across the three gaps.

URL PDF HTML ☆

赞 0 踩 0

2606.14470 2026-06-15 cs.AI cs.CL cs.LG 新提交

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

GitOfThoughts: 版本控制的推理与可回放、差异比较和合并的智能体记忆

Pavan C Shekar, Abhishek H S, Aswanth Krishnan

发表机构 * QpiAI

AI总结提出GitOfThoughts框架，将智能体推理树存储为git仓库，实现推理的可回放、审计和合并；实验表明，对于新问题，任何记忆格式均不能可靠提升准确率，仅当检索案例与当前问题高度相似（>0.8）时才有显著提升，且收益来自答案检索而非方法迁移。

Comments 10 pages, 1 figure, 9 tables

详情

AI中文摘要

大语言模型推理是短暂的：思维链随上下文窗口消失，剪枝的搜索分支不留记录，记忆缓冲区无法进行差异比较、合并或审计。其他所有复杂的软件过程（代码、基础设施、数据、实验）都受版本控制；推理却没有。我们提出GitOfThoughts，将智能体的推理树存储为git仓库：每个评分的思维是一个提交，分数是注释，结果是标签，检索是智能体自身历史上的“git log”。这使得推理可回放、可审计，并且可以在智能体之间以近乎零的工程成本进行合并。然后我们提出一个更难的问题：记忆在任何基质上是否真的能提高准确性？在五种基质（无、markdown、向量、图、git）、两个基准、两个模型规模以及预注册的复制实验中，对于新问题的答案是否定的。没有一种记忆格式可靠地有帮助，一个有希望的早期结果在其自身的预注册复制下崩溃了。记忆只有在超过我们所谓的可复制阈值时才有效：当检索到的案例与当前问题几乎重复（相似度>~0.8）时，准确率急剧上升；低于此阈值，则无效果。收益是答案检索，而非方法迁移：一个4.5倍大的模型使近重复收益翻倍，但仍然无法从工作示例中提取可迁移的方法。我们发现唯一的通用杠杆是测试时采样。因此，git作为基质的理由是审计性、溯源性和可合并性，且准确率相当。我们记录了一个撤回的结果和一个被反驳的假设，以体现我们坚持的评估标准。

英文摘要

Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code, infrastructure, data, experiments) is version-controlled; reasoning is not. We introduce GitOfThoughts, which stores an agent's reasoning tree as a git repository: every scored thought is a commit, scores are notes, outcomes are tags, and retrieval is "git log" over the agent's own history. This makes reasoning replayable, auditable, and mergeable across agents at near-zero engineering cost. We then ask the harder question: does memory, in any substrate, actually improve accuracy? Across five substrates (none, markdown, vector, graph, git), two benchmarks, two model scales, and pre-registered replications, the answer for novel problems is no. No memory format reliably helps, and a promising early result collapsed under its own pre-registered replication. Memory pays only above what we call the copyability threshold: when the retrieved case is a near-duplicate of the current problem (similarity >~ 0.8), accuracy jumps sharply; below it, nothing. The gain is answer retrieval, not method transfer: a 4.5x larger model doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. The only general lever we find is test-time sampling. The case for git-as-substrate is therefore auditability, provenance, and mergeability at accuracy parity. We document a retracted result and a refuted hypothesis to model the evaluation standard we hold ourselves to.

URL PDF HTML ☆

赞 0 踩 0

2606.14466 2026-06-15 cs.SD cs.AI cs.LG 新提交

The Perceived Fragility of Explanations in Audio Models: Manipulation of Attribution with Unchanged Predictions

音频模型中解释的感知脆弱性：在预测不变的情况下操纵归因

Piotr Kitłowski, Dominik Wiącek, Mateusz Modrzejewski

发表机构 * University of Warsaw（华沙大学）

AI总结提出一种心理声学框架，通过优化不可听扰动来解耦模型归因与分类，证明在音频深度伪造检测中可系统扭曲解释热图而保持预测标签不变。

Comments Accepted to the ICML 2026 Workshop on Machine Learning for Audio: 5 pages, 4 figures

2606.14463 2026-06-15 cs.LG 新提交

用于蒙特卡洛树搜索规划的因果对象中心模型

Rodion Vakhitov, Leonid Ugadiarov, Alexey Skrynnik, Aleksandr Panov

发表机构 * MIRAI ； CogAILab

AI总结提出COMET算法，结合无监督对象中心编码器和Transformer世界模型，通过动作-槽融合机制和对象因果注意力实现高效规划，在多个基准上优于基线方法。

详情

AI中文摘要

我们提出了COMET（用于高效树搜索的因果对象中心模型），一种基于模型的强化学习算法，在槽结构化的潜在空间中执行蒙特卡洛树搜索。COMET将冻结的无监督对象中心编码器与基于Transformer的世界模型配对，其中通过一种新颖的动作-槽融合机制将动作绑定到对象上，该机制用于槽转移预测。策略和价值头使用对象因果注意力，通过学习到的每槽相关性分数调节令牌交互，使决策集中在任务相关实体上。COMET为MuZero风格的潜在规划增加了显式的对象级归纳偏差。在来自Object-Centric Visual RL基准、ManiSkill、Robosuite和VizDoom的八个视觉和动态多样化的任务中，COMET在训练早期相比对象中心和单一基线实现了更高的平均归一化分数。

英文摘要

We introduce COMET (Causal Object-centric Model for Efficient Tree search), a model-based reinforcement learning algorithm that performs Monte Carlo Tree Search in a slot-structured latent space. COMET pairs a frozen unsupervised object-centric encoder with a transformer-based world model, in which actions are bound to objects through a novel action-slot fusion mechanism that is used in slot transition prediction. Policy and value heads use object-causal attention, modulating token interactions by learned per-slot relevance scores so that decision-making concentrates on task-relevant entities. COMET adds an explicit object-level inductive bias to MuZero-style latent planning. Across eight visually and dynamically diverse tasks from the Object-Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom, COMET achieves a higher mean normalized score during the early stages of training compared to object-centric and monolithic baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.14416 2026-06-15 cs.LG stat.ML 新提交

Federated Learning for Feature Generalization with Convex Constraints

基于凸约束的联邦学习特征泛化

Dongwon Kim, Donghee Kim, Sung Kuk Shyn, Kwangsu Kim

发表机构 * Dongwon Kim（金东Won）； Donghee Kim（金东浩）； Sung Kuk Shyn（申 Sung Kuk）； Kwangsu Kim（金光Su）

AI总结针对联邦学习中客户端数据异构导致的泛化问题，提出FedCONST方法，利用线性凸约束自适应调整更新幅度，平衡参数学习，并通过梯度信噪比分析验证其有效性，实现跨异构环境的强泛化。

Comments Accepted at the 42nd International Conference on Machine Learning (ICML 2025)

详情

AI中文摘要

联邦学习（FL）常因客户端数据异构而难以泛化。局部模型容易过拟合其局部数据分布，甚至可迁移特征在聚合过程中也可能被扭曲。为应对这些挑战，我们提出FedCONST，一种基于全局模型参数强度自适应调整更新幅度的方法。这可以防止过度强调已学好的参数，同时加强未充分发展的参数。具体而言，FedCONST采用线性凸约束来确保训练稳定性，并在聚合过程中保留局部学到的泛化能力。梯度信噪比（GSNR）分析进一步验证了FedCONST在增强特征可迁移性和鲁棒性方面的有效性。因此，FedCONST有效对齐了局部和全局目标，减轻了过拟合，促进了跨不同FL环境的更强泛化，达到了最先进的性能。

英文摘要

Federated learning (FL) often struggles with generalization due to heterogeneous client data. Local models are prone to overfitting their local data distributions, and even transferable features can be distorted during aggregation. To address these challenges, we propose FedCONST, an approach that adaptively modulates update magnitudes based on the parameter strength of the global model. This prevents over-emphasizing well-learned parameters while reinforcing underdeveloped ones. Specifically, FedCONST employs linear convex constraints to ensure training stability and preserve locally learned generalization capabilities during aggregation. A Gradient Signal to Noise Ratio (GSNR) analysis further validates the effectiveness of FedCONST in enhancing feature transferability and robustness. As a result, FedCONST effectively aligns local and global objectives, mitigating overfitting and promoting stronger generalization across diverse FL environments, achieving state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2606.14415 2026-06-15 cs.AI 新提交

CSPO: Constraint-Sensitive Policy Optimization for Safe Reinforcement Learning

CSPO: 面向安全强化学习的约束敏感策略优化

Ayoub Belouadah, Sylvain Kubler, Yves Le Traon

发表机构 * University of Luxembourg（卢森堡大学）

AI总结提出约束敏感策略优化（CSPO），通过引入局部约束敏感性修正原目标，加速安全恢复并减少振荡，在导航与运动基准上取得更高约束回报。

Comments Accepted as a Spotlight paper at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

安全强化学习（Safe RL）旨在最大化期望回报的同时满足安全约束，通常建模为约束马尔可夫决策过程（CMDPs）。虽然原始-对偶方法可扩展到深度强化学习，但它们常常遭受延迟约束校正，导致振荡行为和长时间的安全违规。在本文中，我们提出约束敏感策略优化（CSPO），一种一阶原始-对偶方法，将局部约束敏感性纳入策略更新。CSPO通过从安全边界的最短有符号距离导出的约束敏感校正来增强原始目标，从而实现更智能的恢复步骤回到安全状态，补偿延迟的拉格朗日乘子更新，减少边界附近的振荡，并保留原始约束问题的KKT解。在导航和运动基准上的实验表明，与最先进的原始-对偶和基于惩罚的方法相比，CSPO实现了更快的安全恢复和高奖励保持，从而获得更高的约束回报。

英文摘要

Safe reinforcement learning (Safe RL) aims to maximize expected return while satisfying safety constraints, typically modeled as Constrained Markov Decision Processes (CMDPs). While primal-dual methods scale well to deep RL, they often suffer from delayed constraint correction, leading to oscillatory behavior and prolonged safety violations. In this paper, we propose Constraint-Sensitive Policy Optimization (CSPO), a first-order primal-dual method that incorporates local constraint sensitivity into policy updates. CSPO augments the primal objective with a constraint-sensitive correction derived from the shortest signed distance to the safety boundary, enabling smarter recovery steps back to safety, compensating for delayed Lagrange multiplier updates, reducing oscillations near the boundary, and preserving the KKT solutions of the original constrained problem. Experiments on navigation and locomotion benchmarks demonstrate that CSPO achieves faster safety recovery and high reward preservation, resulting in higher constrained returns compared to state-of-the-art primal-dual and penalty-based methods

URL PDF HTML ☆

赞 0 踩 0

2606.14409 2026-06-15 cs.RO cs.AI 新提交

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

Hy-Embodied-0.5-VLA：从视觉-语言-动作模型到真实世界机器人学习栈

He Zhang, Lingzhu Xiang, Haitao Lin, Zeyu Huang, Minghui Wang, Dingyan Zhong, Yubo Dong, Yihao Wu, Yongming Rao, Dongsheng Zhang, Wanjia He, Ling Chen, Kai Huang, Jiahao Chen, Sichang Su, Xumin Yu, Ziyi Wang, Chengwei Zhu, Xiao Teng, Yuchun Guo, Yufeng Zhang, Yuandong Liu, Rui Wang, Zisheng Lu, Han Hu, Zhengyou Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结提出端到端机器人学习栈HyVLA-0.5，涵盖数据收集、模型设计、预训练与微调、RL后训练及真实部署，各组件协同工作。

2606.14397 2026-06-15 cs.LG 新提交

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

Running the Gauntlet: 重新评估智能体在陌生环境中的能力

Mykola Vysotskyi, Runqi Lin, Grzegorz Biziel, Michal Zakrzewski, Sebastian Montagna, Damian Rynczak, Shreyansh Padarha, Kumail Alhamoud, Zihao Fu, William Lugoloobi, Kai Rawal, Hanna Yershova, Xander Davies, Taras Rumezhak, Guohao Li, Fazl Barez, Baoyuan Wu, Arkadiusz Drohomirecki, Yarin Gal, Chris Russell, Christopher Summerfield, Adam Mahdi, Volodymyr Karpiv, Philip Torr, Adel Bibi

发表机构 * University of Oxford（牛津大学）； SoftServe ； Massachusetts Institute of Technology（麻省理工学院）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； UK AI Security Institute（英国人工智能安全研究所）； Ukrainian Catholic University（乌克兰天主教大学）

AI总结提出GauntletBench基准，通过20个视觉密集型任务评估智能体在时间感知、图形理解和3D推理等未被充分探索的能力，发现最先进智能体成功率仅19.1%，远低于人类80%以上。

详情

AI中文摘要

随着智能体系统不断发展并广泛部署于现实场景，对其能力进行忠实评估的需求日益增长。然而，当前的基准通常基于流行应用，任务相对简单，且关注狭窄的能力集，忽略了更广泛的维度，导致现代智能体性能饱和，无法探测其局限性。为此，我们引入了GauntletBench，一个基于网络的基准，用于评估智能体在挑战性场景中的泛化能力，重点关注三个未被充分探索的能力（时间感知、图形理解和3D推理），涵盖五个较少被覆盖的专业应用（视频编辑器、工作流构建器、3D建模器、飞行分析器和电路设计器），每个应用包含20个视觉密集型任务（共100个）。我们的基准提供了一个模块化流水线，包括一个与开源和闭源智能体框架兼容的环境、一个受控的基于网络的应用、一个结构良好的任务套件，以及一个具有多样化指标的自动评估引擎。与广泛预期相反，我们的实证结果表明，前沿智能体系统远未达到人类水平的表现。即使是最先进的智能体，在我们的GauntletBench上也仅达到19.1%的成功率，凸显了这些被忽视的能力和泛化方面的局限性。相比之下，非专家人类标注者在我们具有挑战性但可行的任务上实现了超过80%的成功率，揭示了当前智能体能力与复杂现实场景所需能力之间的巨大差距。

英文摘要

As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with relatively simple tasks and focus on a narrow set of capabilities while overlooking broader dimensions, resulting in saturated performance on modern agents and failing to probe their limitations. To this end, we introduce GauntletBench, a web-based benchmark for evaluating agent generalisation in challenging scenarios, focusing on three underexplored capabilities (temporal perception, graphical understanding, and 3D reasoning), across five less-covered professional applications (Video Editor, Workflow Builder, 3D Modeller, Flight Analyser, and Circuit Designer), each with 20 vision-intensive tasks (100 in total). Our benchmark provides a modular pipeline that comprises an environment compatible with both open- and closed-source agent frameworks, a controlled web-based application, a well-structured task suite, and an automated evaluation engine with diverse metrics. Contrary to widespread expectations, our empirical results reveal that frontier agentic systems remain far from achieving human-level performance. Even the state-of-the-art agent achieves only a 19.1% success rate on our GauntletBench, highlighting the limitations in these overlooked capabilities and generalisation. By comparison, non-expert human annotators achieve over 80% success on our challenging yet feasible tasks, revealing the substantial gap between current agent capabilities and those required for complex real-world scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.14391 2026-06-15 cs.CL cs.AI cs.SD 新提交

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

学习听到犹豫：面向非流畅语音的连续学习ASR

Henri-Leon Kordt, Theresa Pekarek Rosin, Jae Hee Lee, Stefan Wermter

发表机构 * Knowledge Technology, Department of Informatics, University of Hamburg（汉堡大学信息学系知识技术研究所）

AI总结针对ASR系统忽略非流畅导致信息丢失的问题，提出基于连续学习与显式非流畅标记的方法，在预训练模型中引入标记并持续训练，分析标记学习与ASR性能的权衡及跨方法共享的交叉注意力头机制。

Comments Accepted at Interspeech 2026

2606.14389 2026-06-15 cs.CV 新提交

弹性查询强化学习：VLA模型的自我感知策略执行

Ge Wang, Xinyu Tan, Xiang Li, Man Luo, Chengsi Yao, Shenhao Yan, Jiahao Yang, Fan Feng, Honghao Cai, Xiangyuan Wang, Zhixin Mai, Yiming Zhao, Yatong Han, Zhen Li

发表机构 * Ising AI ； CUHK-Shenzhen（香港中文大学（深圳））； PKU（北京大学）

AI总结提出弹性查询强化学习（EQRL），通过轻量级潜在调度适配器动态调整VLA模型的推理步骤和动作块长度，利用评论家集成分歧估计状态难度，在降低推理成本的同时保持或提升任务成功率。

详情

AI中文摘要

视觉-语言-动作（VLA）模型是机器人操作中强大的动作生成器，但通常以固定的推理和重新规划调度执行。这种刚性忽略了机器人控制的不均匀难度：接触密集或不确定状态可能需要更多计算和更新鲜的反馈，而较容易的状态通常可以用更少的推理步骤和更长的开环执行来处理。我们提出弹性查询强化学习（EQRL），一个使每个VLA策略查询具有弹性的框架。一个轻量级的潜在调度适配器联合选择潜在输入、去噪预算和动作块长度，无需微调底层VLA模型。为了使调度具有难度感知，EQRL在联合潜在调度动作上训练一个评论家，并从评论家集成分歧中推导出状态难度信号。该信号引导计算资源向困难状态倾斜，而学习到的残差允许任务驱动的修正。我们将可变块执行形式化为查询级宏动作强化学习，具有块依赖的折扣和摊销的函数评估次数（NFE）预算。在仿真和真实机器人操作中，EQRL在保持或提高任务成功率的同时，降低了摊销推理成本。

英文摘要

Vision-language-action (VLA) models are powerful action generators for robot manipulation, but they are typically executed with fixed inference and replanning schedules. This rigidity ignores the uneven difficulty of robot control: contact-rich or uncertain states may need more computation and fresher feedback, while easier states can often be handled with fewer inference steps and longer open-loop execution. We propose Elastic Queries Reinforcement Learning (EQRL), a framework that makes each VLA policy query elastic. A lightweight latent-schedule adaptor jointly selects the latent input, denoising budget, and action chunk length, without fine-tuning the underlying VLA model. To make scheduling difficulty-aware, EQRL trains a critic over the joint latent-schedule action and derives a state difficulty signal from critic ensemble disagreement. This signal guides compute toward difficult states, while a learned residual allows task-driven correction. We formulate variable chunk execution as query-level macro-action RL with chunk-dependent discounting and an amortized number-of-function-evaluations (NFE) budget. Across simulation and real-robot manipulation, EQRL reduces amortized inference cost while preserving or improving task success.

URL PDF HTML ☆

赞 0 踩 0