URL PDF HTML ☆

赞 0 踩 0

2501.12942 2026-06-11 cs.AI 版本更新

Offline Diffusion Policy for Multi-User Delay-Constrained Scheduling

面向多用户延迟约束调度的离线扩散策略

Zhuoran Li, Ruishuo Chen, Hai Zhong, Longbo Huang

发表机构 * Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University（交叉信息学院（IIIS），清华大学）

AI总结提出基于离线强化学习的SOCD算法，利用扩散策略和批评网络指导，从离线数据中学习高效调度策略，避免在线交互，在部分可观测和大规模环境中表现优异。

详情

AI中文摘要

有效的多用户延迟约束调度在诸多实际应用中至关重要，包括具身AI、即时通讯、直播和数据中心管理，这些场景需要在具有不同延迟敏感性的用户之间进行高效资源分配。在这些场景中，调度器必须实时做出决策，以满足延迟和资源约束，同时无需事先了解系统动态，这些动态通常是时变的且难以估计。当前基于学习的方法通常需要在训练阶段与实际系统进行在线交互。因此，这些方法往往难以实施或不切实际，因为它们会显著降低系统性能并产生高昂的服务成本。为应对这些挑战，我们提出了一种新颖的基于离线强化学习的算法，名为SOCD（通过离线学习与批评引导和扩散模型进行调度），该算法仅从预先收集的离线数据中学习高效调度策略。SOCD创新性地采用了扩散策略，并辅以无采样的批评网络进行策略引导。通过将拉格朗日乘子优化融入离线强化学习，SOCD仅从可用数据集中高效训练出高质量且满足约束的策略，无需与系统进行在线交互。实验结果表明，SOCD对多种系统动态具有鲁棒性，包括部分可观测和大规模环境，并且与现有方法相比性能更优。

英文摘要

Effective multi-user delay-constrained scheduling is crucial in various real-world applications, including embodied AI, instant messaging, live streaming, and data center management, where efficient resource allocation is required among users with diverse delay sensitivities. In these scenarios, schedulers must make real-time decisions to satisfy both delay and resource constraints without prior knowledge of system dynamics, which are often time-varying and challenging to estimate. {Current learning-based methods typically require online interactions with actual systems during the training stage. Therefore, these approaches are often difficult or impractical, as they can significantly degrade system performance and incur substantial service costs.} To address these challenges, we propose a novel offline reinforcement learning-based algorithm, named \underline{S}cheduling By \underline{O}ffline Learning with \underline{C}ritic Guidance and \underline{D}iffusion Model (SOCD), to learn efficient scheduling policies purely from pre-collected \emph{offline data}. SOCD innovatively employs a diffusion policy, complemented by a sampling-free critic network for policy guidance. By integrating the Lagrangian multiplier optimization into the offline reinforcement learning, SOCD efficiently trains high-quality constraint-aware policies exclusively from available datasets, eliminating the need for online interactions with the system. Experimental results demonstrate that SOCD is resilient to various system dynamics, including partially observable and large-scale environments, and delivers superior performance compared to existing methods.

URL PDF HTML ☆

赞 0 踩 0

2509.23248 2026-06-11 cs.AI cs.NI 版本更新

FitText: 通过模因检索演化智能体工具生态

Kyle Zheng, Han Zhang, Renliang Sun, Chenchen Ye, Wei Wang

发表机构 * UCLA（加州大学洛杉矶分校）

AI总结针对用户任务描述与工具文档间的语义鸿沟，提出FitText框架，将检索嵌入推理循环，通过自然语言伪工具描述迭代优化和模因进化选择，显著提升工具检索性能。

详情

AI中文摘要

用户描述任务的方式与工具文档之间存在语义鸿沟。随着API生态扩展到数万个端点，仅凭初始查询的静态检索无法弥合这一鸿沟：智能体对其所需工具的理解在执行过程中不断演变，但其工具集却保持不变。我们指出，这种检索接口（而非规划）是端到端智能体性能的约束瓶颈，并引入FitText——一个无需训练的框架，通过将检索直接嵌入智能体的推理循环中，使其动态化。FitText将检索视为测试时假设的演化：智能体生成自然语言的伪工具描述（关于所需工具的可修正信念），利用检索反馈迭代优化，并通过随机生成探索多样化的替代方案。模因检索在候选描述上施加进化选择压力，并由避免冗余搜索的工具记忆引导。在ToolRet（三个领域）上，FitText的重构策略在所有基模型上相比静态查询检索将NDCG@5提升了2.7至10.6个点；在StableToolBench（16,464个API）上使用GPT-5.4-mini时，模因检索达到了84.3%的合并通过率，相比静态查询检索绝对提升了26.7个点。

英文摘要

A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, but its tool set does not. We identify this retrieval interface, not planning, as the binding constraint on end-to-end agent performance, and introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent's reasoning loop. FitText treats retrieval as test-time evolution of hypotheses: the agent generates natural-language pseudo-tool descriptions (revisable beliefs about the tool it needs), refines them iteratively using retrieval feedback, and explores diverse alternatives through stochastic generation. Memetic Retrieval adds evolutionary selection pressure over candidate descriptions, guided by a tool memory that avoids redundant search. On ToolRet (three domains), FitText's reformulation strategies improve NDCG@5 by 2.7 to 10.6 points over static query retrieval across all base models; on StableToolBench (16,464 APIs) with GPT-5.4-mini, Memetic reaches an 84.3% pooled pass rate, a 26.7-point absolute gain over static query retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.05922 2026-06-11 cs.AI cs.CL cs.LG 版本更新

Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference

回顾性工具优化：通过轨迹回滚上的自我偏好改进LLM智能体

Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia

发表机构 * City University of Hong Kong（香港城市大学）； Microsoft Research Asia（微软亚洲研究院）

AI总结提出一种自监督方法RHO，利用历史轨迹回滚和自偏好选择优化智能体工具集，无需真实标签，在SWE-Bench Pro上通过单轮优化将通过率从59%提升至78%。

Comments Code: https://github.com/wbopan/retro-harness ; Project website: https://paper-rho.wenbo.io

详情

AI中文摘要

AI智能体依赖于技能、工具和工作流程的整合（称为工具集）来解决复杂问题。持续改进这一工具集对于适应新任务至关重要。然而，现有的优化方法通常需要真实验证集，但在实际部署场景中获取此类标注数据非常困难。为解决这一问题，我们提出回顾性工具优化（RHO），一种仅利用过去轨迹的自监督方法。具体而言，RHO从历史轨迹中选择一个多样化的困难任务核心集，并并行重新求解。智能体通过自我验证和自我一致性分析这些回滚，然后生成候选工具集更新，并通过自身的成对自我偏好选择最有效的更新。我们在三个不同领域（涵盖软件工程、技术工作和知识工作）上评估RHO。值得注意的是，单轮优化无需任何外部评分即可将SWE-Bench Pro上的通过率从59%提升至78%。此外，我们的分析表明RHO有效针对先前的失败模式。因此，优化后的工具集改变了智能体的行为模式，并在长周期会话中保持更高的准确性。

英文摘要

AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.

URL PDF HTML ☆

赞 0 踩 0

2606.07909 2026-06-11 cs.AI cs.CL 版本更新

MemToolAgent: Leveraging Memory for Tool Using Agents Based on Environment and User Feedback

MemToolAgent概述：一个简单的餐厅预订场景，其中代理检索相似记忆，接收关于无效时间格式的反馈，并生成反思以更新其记忆

Suleyman Armagan Er, Danilo Ribeiro, Yogesh Virkar, Surafel Lakew, Adi Kalyanpur, James Gung, Thomas Delteil, Arshit Gupta

发表机构 * AWS AI ； University of Washington（华盛顿大学）

AI总结提出MemToolAgent框架，通过记忆管理提升大语言模型代理的工具使用能力，包含记忆提取和动态检索模块，在三个基准上分别提升29%、80%和17%。

Comments 8 pages, 5 figures

详情

AI中文摘要

现代大语言模型（LLM）代理可以使用外部工具帮助用户解决复杂任务。然而，对于需要从长期历史事件或先前的代理-环境交互中学习的问题，LLM代理需要使用记忆机制来存储和检索经验。尽管对话代理存在复杂的记忆系统，但很少有研究实证检验如何通过过去的用户-代理对话来提升代理的工具使用能力。我们提出MemToolAgent，一个通过记忆管理改善工具使用的框架。我们的方法包含一个记忆提取模块，将过去的经验处理成结构化的记忆条目，以及一个检索模块，动态选择存储记忆条目的子集。这使得无需LLM微调即可实现更个性化和准确的响应，与用户偏好和反馈保持一致。总之，本工作有三个主要贡献：（1）统一的记忆条目格式，无需LLM微调即可改善通用和个性化工具使用；（2）基于反思的记忆提取，利用环境和用户反馈将错误执行提炼为批评并存储；（3）一个检索模块，根据记忆相似度分布选择使用多少过去经验。MemToolAgent在WorkBench、NESTFUL和PEToolBench基准上相比强基线分别实现了29%、80%和17%的相对改进。

英文摘要

Modern large language model (LLM) agents can use external tools to help users solve complex tasks. However, for problems that require learning from long-term historical events or from previous agent-environment interactions, LLM agents are required to use memory mechanisms to store and retrieve experiences. While sophisticated memory systems exist for dialogue agents, few studies have empirically examined how to improve agents' tool-using capabilities through past user-agent conversations. We propose MemToolAgent, a framework that improves tool use through memory management. Our approach contains a memory extraction module that processes past experiences into structured memory entries, and a retrieval module that dynamically selects a subset of the stored memory entries. This enables more personalized and accurate responses aligned with user preferences and feedback without requiring LLM fine-tuning. In summary, this work has three main contributions: (1) a unified memory entry format that improves both general-purpose and personalized tool use without LLM fine-tuning, (2) a reflection-based memory extraction that uses environment and user feedback to distill wrong executions into critiques to store, and (3) a retrieval module that chooses how many past experiences to use based on the memory similarity distribution. MemToolAgent achieves 29%, 80%, and 17% relative improvements compared to strong baselines on the WorkBench, NESTFUL, and PEToolBench benchmarks, respectively.

URL PDF HTML ☆

赞 0 踩 0

2509.10303 2026-06-11 cs.LG cs.AI 版本更新

Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions

超越次优性：离线强化学习通过随机解决方案学习有效调度

Jesse van Remmerden, Zaharah Bukhsh, Yingqian Zhang

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）

AI总结提出离线RL算法CDQAC，从次优静态数据集学习调度策略，在JSP/FJSP上超越在线RL和强启发式方法，仅需1-5%数据，发现状态-动作覆盖比轨迹质量更重要。

详情

AI中文摘要

在线强化学习（RL）方法通过与模拟环境直接交互学习调度策略，在作业车间调度（JSP）和柔性作业车间调度（FJSP）问题上表现出色。然而，这些方法通常需要大量的训练交互，限制了其样本效率和实际适用性。受此挑战的启发，我们引入了保守离散分位数演员-评论家（CDQAC），这是一种离线RL算法，可以直接从静态、次优数据集中学习有效的调度策略。CDQAC将基于分位数的评论家与延迟策略更新相结合，以估计机器-操作对的回报分布。在JSP和FJSP基准上的大量实验表明，CDQAC始终优于生成数据的启发式方法，超越了最先进的离线和在线RL基线，并且具有很高的样本效率，仅需原始数据集的1%到5%即可学习高质量策略。我们的分析表明，在调度中，离线RL的性能主要受状态-动作覆盖范围而非单个轨迹质量的影响。调度将密集奖励（与完工时间目标对齐）与跨启发式方法的等长轨迹相结合，从而能够从广泛的行为中有效学习。与此观察一致，由简单随机启发式方法生成的具有更广覆盖范围的数据集，使其性能优于在由更强启发式方法（如遗传算法）生成的数据集上训练的策略。

英文摘要

Online reinforcement learning (RL) approaches have demonstrated strong performance on Job Shop Scheduling (JSP) and Flexible JSP (FJSP) problems by learning scheduling policies through direct interaction with simulated environments. However, these methods often require extensive training interactions, limiting their sample efficiency and practical applicability. Motivated by this challenge, we introduce Conservative Discrete Quantile Actor-Critic (CDQAC), an offline RL algorithm that learns effective scheduling policies directly from static, suboptimal datasets. CDQAC couples a quantile-based critic with delayed policy updates to estimate the return distribution of machine-operation pairs. Extensive experiments on JSP and FJSP benchmarks demonstrate that CDQAC consistently outperforms the data-generating heuristics, surpasses state-of-the-art offline and online RL baselines, and is highly sample efficient, requiring only 1 to 5% of the original dataset to learn high-quality policies. Our analysis suggests that, in scheduling, offline RL performance is governed mainly by state-action coverage rather than the quality of individual trajectories. Scheduling couples a dense reward aligned with the makespan objective with equal-length trajectories across heuristics, enabling effective learning from a broad range of behaviors. Consistent with this observation, datasets generated by a simple random heuristic with broader coverage let it outperform policies trained on datasets produced by stronger heuristics such as Genetic Algorithms.

URL PDF HTML ☆

赞 0 踩 0

2605.10907 2026-06-11 cs.CR cs.AI 版本更新

Engineering Robustness into Personal Agents with the AI Workflow Store

通过AI工作流存储增强个人代理的鲁棒性

Roxana Geambasu, Mariana Raykova, Pierre Tholoniat, Trishita Tiwari, Lillian Tsai, Wen Zhang

发表机构 * Columbia University and Google（哥伦比亚大学和谷歌）； Google（谷歌）

AI总结本文探讨将严谨的软件工程流程整合到代理循环中，以生成可靠、安全且确定性约束的代理工作流，提升高风险场景下的性能。

2605.14084 2026-06-11 cs.SE cs.AI cs.CL 版本更新

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

CRANE：通过空域编辑实现代码代理的约束推理注入

Mingzhi Zhu, Michele Merler, Raju Pavuluri, Stacy Patterson

发表机构 * Rensselaer Polytechnic Institute（拉特格斯理工学院）； IBM Research（IBM研究院）

AI总结 CRANE通过空域编辑技术，结合推理和工具使用能力，提升代码代理性能，在多个基准测试中取得显著成果。

详情

AI中文摘要

代码代理必须同时对长周期的仓库状态进行推理并遵守严格的工具使用协议。在配对的Instruct/Thinking检查点中，这些能力是互补但不一致的。Instruct模型简洁且工具纪律性强，而Thinking模型提供更强的规划和恢复行为，但往往过度 deliberates 并降低代理性能。我们提出CRANE（通过空域编辑实现代码代理的约束推理注入），一种无需训练的参数编辑方法，将Thinking-Instruct的delta视为Instruct骨干的候选推理编辑方向池。CRANE结合幅度阈值去噪delta，保守的泰勒门来保留对推理转移和工具使用保留共同有益的编辑，以及渐进的Sigmoid投影来抑制格式关键的更新方向。通过合并配对的Instruct和Thinking检查点，CRANE在单独模型上取得显著优势的同时保持Instruct级别的效率：在Roo-Eval上，它实现了Qwen3-30B-A3B的pass1为66.2%（+19.5%）和Qwen3-Next-80B-A3B的81.5%（+8.7%）；在SWE-bench-Verified上，它在两个规模（122/500和180/500）上解决了多达14个额外的实例；在Terminal-Bench v2上，它提高了pass1/pass5高达2.3%/7.8%，分别达到7.6%/17.9%和14.8%/30.3%，在所有三个基准测试中一致超越了其他合并策略。

英文摘要

Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.03077 2026-06-11 cs.LG cs.AI cs.DC 版本更新

Libra: Efficient Resource Management for Agentic RL Post-Training

Libra：面向智能体强化学习后训练的高效资源管理

Kaiwen Chen, Xin Tan, Jingzong Li, Hong Xu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； The Hang Seng University of Hong Kong (2018)（香港恒生大学）

AI总结针对智能体强化学习中长尾、非平稳工作负载带来的资源管理挑战，提出Libra系统，通过周期性全局资源规划器和因果驱动多级反馈队列调度器，实现GPU分配优化和请求调度，最高提升3倍吞吐量和2.5倍收敛速度。

Comments 19 pages, 12 figures

详情

AI中文摘要

强化学习（RL）已成为大型语言模型（LLM）的标准后训练范式，从偏好对齐扩展到复杂推理和多轮智能体行为。在智能体RL中，rollout阶段生成轨迹并调用工具，产生长尾和非平稳的工作负载，挑战了传统的资源管理假设。出现了三个基本挑战。首先，由于长尾分布，一小部分轨迹主导了rollout完成时间。其次，rollout和训练在计算模式、内存需求和对序列长度的敏感性上表现出强烈的不对称性。第三，随着RL策略的演变，轨迹长度分布随时间漂移，使得任何静态资源分配逐渐变得次优。我们提出Libra，引入了两个核心机制。第一个是周期性全局资源规划器，它联合优化rollout和训练集群间的GPU分配。它利用弹性混合池实现阶段间轻量级、非阻塞的工作节点重新分配。第二个是因果驱动的多级反馈队列（C-MLFQ）调度器，它基于从工具返回结果导出的因果信号（而非依赖脆弱的长度的预测）将请求路由到异构的rollout桶。在48个A800 GPU上的评估表明，与基线相比，Libra实现了高达3.0倍的吞吐量提升和高达2.5倍的奖励收敛加速。

英文摘要

Reinforcement learning (RL) has emerged as a standard post-training paradigm for shaping large language models (LLMs) into capable agents. In agentic RL, the rollout stage generates trajectories while invoking tools, producing long-tailed and non-stationary workloads that expose two fundamental challenges in resource management. First, due to the long-tail distribution, a small fraction of trajectories dominates rollout makespan. Second, rollout and training are subject to cross-stage imbalance, as they exhibit strong asymmetry in compute patterns, memory demands, and sensitivity to sequence length. Compounding this asymmetry, the sequence length distribution drifts continuously as the policy evolves, rendering any static resource split progressively suboptimal. We present Libra, a resource management system to address both challenges via two core mechanisms. The first is a global resource planner that jointly optimizes GPU allocation across rollout and training clusters. It leverages an elastic hybrid pool to enable lightweight, non-blocking worker reallocation between stages. The second is a causality-driven multi-level feedback queue (C-MLFQ) scheduler, which routes requests to heterogeneous rollout buckets based on causal signals derived from tool-return outcomes, rather than relying on fragile length predictions. Evaluated on 48 A800 GPUs, Libra achieves up to 3.0x higher throughput and converges up to 2.5x faster in reward compared to the baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.11724 2026-06-11 cs.AI 新提交

Mind the Perspective: Let's Reason Recursively for Theory of Mind

注意视角：递归推理实现心智理论

Chao Lei, Guang Hu, Meng Yang, Yanbei Jiang, Nir Lipovetzky

发表机构 * School of Computing and Information Systems, The University of Melbourne, Australia（墨尔本大学计算与信息系统学院）； SensiLab, Monash University, Australia（蒙纳士大学SensiLab）

AI总结提出RecToM框架，通过递归视角构建建模嵌套信念，将高阶信念问题转化为实际世界问题，在多个ToM基准上达到最先进性能。

详情

AI中文摘要

心智理论（ToM）推理需要从部分且不对称的观察中推断智能体的信念，这对大语言模型（LLM）来说仍然是一个开放的挑战。现有的基于提示的方法通过可观察事件过滤或时间信念链来改进ToM推理，但没有显式建模嵌套信念。我们引入了RecToM，一个用于ToM推理的推理时框架，通过递归视角构建来建模嵌套信念。RecToM沿着问题指定的角色链，从先前的角色视角构建每个角色视角，将高阶信念问题简化为最终构建视角内的实际世界问题。我们进一步提供了KD45分析，表明RecToM的视角构建诱导了超越简单事件过滤的良好信念模态。在包括Hi-ToM、Big-ToM和FanToM在内的ToM基准上，跨多个LLM骨干网络的实验表明，RecToM持续优于最近的高级方法，达到了最先进的性能。值得注意的是，RecToM在GPT-5.4和Qwen3.5上达到了Hi-ToM的100%准确率，这是一个需要高阶ToM推理的基准。

英文摘要

Theory of Mind (ToM) reasoning requires inferring agents' beliefs from partial and asymmetric observations, which remains an open challenge for LLMs. Existing prompting-based approaches improve ToM reasoning through observable-event filtering or temporal belief chains, without explicitly modeling nested beliefs. We introduce RecToM, an inference-time framework for ToM reasoning that models nested beliefs via recursive perspective construction. RecToM constructs each character perspective from the preceding character perspective along the character chain specified by the question, reducing higher-order belief questions to actual-world questions within the final constructed perspective. We further provide a KD45 analysis showing that RecToM's perspective construction induces a well-formed belief modality beyond simple event filtering. Experiments on ToM benchmarks, including Hi-ToM, Big-ToM, and FanToM, across multiple LLM backbones show that RecToM consistently outperforms recent advanced approaches, achieving state-of-the-art performance. Notably, RecToM reaches 100\% accuracy on Hi-ToM with GPT-5.4 and Qwen3.5, a benchmark requiring higher-order ToM reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.12065 2026-06-11 cs.AI cs.MA 新提交

Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework

BIM中几何密集型合规检查自动化：基于图的语义推理框架

Zixuan Xiao, Pei Troh Koh, Jun Ma, Jack C. P. Cheng

发表机构 * Department of Urban Planning and Design, The University of Hong Kong（香港大学城市规划与设计系）； Department of Civil and Environmental Engineering, The Hong Kong University of Science and Technology（香港科学与技术大学土木与环境工程系）

AI总结针对BIM中几何密集型法规自动检查的语义鸿沟问题，提出SGR-BIM图驱动推理框架，通过跨模态知识图谱实现可解释推理，在679个消防规范查询上达到84.3%准确率，较基线提升8.6%。

详情

DOI: 10.1016/j.autcon.2026.107038
Journal ref: Automation in Construction 189 (2026) 107038

AI中文摘要

MODF-SIR：面向社交智能推理的多智能体全模态蒸馏框架

Shang Ma, Jisheng Dang, Wencan Zhang, Yifan Zhang, Bimei Wang, Hong Peng, Bin Hu, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University（兰州大学信息科学与工程学院）； School of Medical Technology, Beijing Institute of Technology（北京理工大学医学技术学院）； Cloud and AI BU, Huawei（华为云与AI业务部）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）

AI总结提出基于轻量级多模态大语言模型的多智能体协作框架，通过知识蒸馏增强训练与推理，结合测试时适应、长尾事件提取和链式思维提示，在多个基准上取得最优结果。

详情

AI中文摘要

我们提出一个基于轻量级多模态大语言模型（MLLM）的多智能体协作框架，专门设计用于社交智能推理。我们方法的一个关键特征是，训练和推理阶段都通过知识蒸馏进行增强。在该架构中，与社交智能相关的多模态数据被精确定位。此外，相关的长尾事件被识别、提取并呈现为格式化的显式文本。这种格式化策略防止关键的长尾信息在分词过程中被头部事件和环境噪声掩盖。具体来说，我们在整个推理流程中集成了测试时适应（TTA），包括长尾事件的提取和表示、链式思维（CoT）提示和自我反思。该TTA机制也经过蒸馏增强，利用低秩适应（LoRA）仅针对实例级推理微调基础模型。在多个基准上对各种开源和专有AI模型进行的广泛评估证明了所提出框架的有效性。使用IntentTrain约30%的训练数据，我们取得了最先进的结果。代码见https://this URL，演示见https://this URL，LoRA见https://this URL，训练路由器的数据集见https://this URL。

英文摘要

We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi-modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long-tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long-tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test-Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long-tail events, Chain-of-Thought (CoT) prompting, and self-reflection. This TTA mechanism is also distillation-enhanced, utilizing Low-Rank Adaptation (LoRA) to fine-tune the foundation model exclusively for instance-level reasoning. Extensive evaluations against various open-source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework. With around 30% of training data from IntentTrain, we achieve state-of-the-art results. Codes are available at https://github.com/eeee-sys/MODF-SIR, demo is available at https://huggingface.co/spaces/Harry-1234/MODF-SIR, LoRA is available at https://huggingface.co/Harry-1234/MODF-SIR and the dataset for training router is available at https://huggingface.co/datasets/Harry-1234/IntentRouterTrain.

URL PDF HTML ☆

赞 0 踩 0

2606.12260 2026-06-11 econ.TH cs.AI cs.GT cs.LG stat.ML 交叉投稿

Market Design for AI: Beyond the Copyright Binary

人工智能的市场设计：超越版权二元论

Yan Dai, Maryam Farboodi, Negin Golrezaei, Sepehr Shahshahani

发表机构 * MIT Operations Research Center（麻省理工学院运筹学中心）； MIT Sloan School of Management（麻省理工学院斯隆管理学院）； Washington University School of Law（华盛顿大学法学院）

AI总结本文通过静态和动态博弈模型，分析AI训练数据市场中“自由使用”与“强知识产权”两种模式的失败，提出通过数据中介内部化外部性并补贴创新贡献的市场设计。

详情

AI中文摘要

我们如何设计一个用于训练AI模型的人类生成内容市场，既能促进技术进步，又能保留个人创作高质量内容的激励？现有方法采取两极立场：基于合理使用的“自由使用”模式和“强知识产权”模式。我们证明两者均失败：自由使用不补偿创作者，而通过建模为静态Stackelberg博弈，强知识产权也削弱了创作激励。我们发现这对更具创新性的创作者尤其如此，我们将此现象称为“原创性惩罚”。将这一见解扩展到动态模型，我们发现另一种市场失灵会损害AI模型性能，即使对于初始良好的模型也是如此：此类模型导致人类更依赖AI辅助创作，导致同质化内容反馈到训练中，从而降低模型性能——即“精确性诅咒”。我们进一步提出一种市场设计，通过数据中介内部化跨创作者外部性并补贴创新贡献，从而恢复效率。

英文摘要

How can we design a market of human-generated content for use in training AI models that both enables technological progress and preserves individual incentives for high-quality content creation? Existing approaches take polar positions: a "free-for-all" model based on fair use and a "strong intellectual property rights" model. We show that both fail: Free-for-all does not compensate creators, and -- by modeling as a static Stackelberg game -- strong intellectual property rights also underpower creative incentives. We find this especially true for more innovative creators, a phenomenon we term the "originality penalty." Extending this insight to a dynamic model, we find another market failure undermining AI model performance, even for an initially good model: Such a model induces greater reliance by humans on AI-assisted creation, resulting in homogenized content feeding back into training, which degrades the model performance -- a "curse of precision." We further propose a market design with a data intermediary internalizing cross-creator externalities and subsidizing innovative contributions, thereby restoring efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.12281 2026-06-11 cs.MA cs.AI cs.LG 交叉投稿

CCKS: Consensus-based Communication and Knowledge Sharing

CCKS：基于共识的通信与知识共享

Jinyuan Zu, Xiaowei Lv, Yongcai Wang, Deying Li, Yunjun Han, Wenping Chen, Fengyi Zhang, Naiqi Wu

发表机构 * Public Computing Cloud, Renmin University of China（中国人民大学公共计算云）； School of Information, Renmin University of China（中国人民大学信息学院）； State Key Laboratory of Multimodal Artificial Intelligence Systems, Beijing Engineering Research Center of Intelligent Systems and Technology, Institute of Automation, Chinese Academy of Sciences（多模态人工智能系统国家重点实验室，智能系统与技术北京工程研究中心，中国科学院自动化研究所）； The Information Science Academy, China Electronics Technology Group Corporation（中国电子科技集团有限公司信息科学研究院）； Department of Mechatronics Engineering, Guangdong University of Technology（广东工业大学机电工程学院）

AI总结针对多智能体强化学习中动作建议过度依赖教师指导的问题，提出基于共识的通信与知识共享框架，通过对比学习构建共识模型，平衡探索与学习，提升合作效率与性能。

详情

AI中文摘要

在分布式训练和分布式执行（DTDE）的协作多智能体强化学习（MARL）中，基于动作建议的知识共享促进了智能体间的可解释和可扩展合作。然而，当前的动作建议方法往往过于遵循教师的指导，而未评估师生兼容性，导致过度建议、稳定性欠佳和性能下降。为克服这些挑战，本文提出了一种基于共识的通信与知识共享（CCKS）框架，该框架允许智能体基于共识衍生的约束采纳建议，并更智能地遵循教师指令。该机制使智能体能够平衡探索与向经验丰富的教师学习，从而提升整体性能。关键在于共识模型的构建，为此我们提出在智能体训练阶段利用对比学习基于局部观测构建共识模型。在动作选择中，智能体根据共识和共享知识对动作进行评分和选择。CCKS设计为即插即用解决方案，可无缝集成到现有DTDE算法中。在Google Research Football环境和复杂的星际争霸II多智能体挑战中进行的实验表明，与当前的DTDE基线相比，集成CCKS显著提高了合作效率、学习速度和整体性能。代码可从此https URL获取。

英文摘要

In Decentralized Training and Decentralized Execution (DTDE) for cooperative Multi-Agent Reinforcement Learning (MARL), action-advising-based knowledge sharing promotes interpretable and scalable cooperation among agents. However, current action advising approaches often adhere too much to the teacher's guidance without evaluating teacher-student compatibility, which causes excessive advising, suboptimal stability, and degraded performance. To overcome these challenges, this paper presents a Consensus-based Communication and Knowledge Sharing (CCKS) framework, which allows agents to adopt recommendations based on consensus-derived constraints and to follow the teacher's instructions more smartly. This mechanism enables agents to balance exploration and learning from experienced teachers, improving overall performance. The key is the consensus model construction, for which we propose to employ contrastive learning to construct consensus models based on local observations in the agents' training phase. In action selection, agents score and choose actions based on consensus and shared knowledge. Designed as a plug-and-play solution, CCKS integrates seamlessly with existing DTDE algorithms. Experiments conducted in the Google Research Football environment and the complex StarCraft II Multi-Agent Challenge demonstrate that the integration with CCKS significantly improves cooperation efficiency, learning speed, and overall performance compared with current DTDE baselines. The code is available at https://github.com/yuanxpy/CCKS.

URL PDF HTML ☆

赞 0 踩 0

2606.12352 2026-06-11 cs.RO cs.AI 交叉投稿

CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

CHORUS: 基于单一VLA策略的去中心化多体协作

Ria Doshi, Tian Gao, Annie Chen, Chelsea Finn, Jeannette Bohg

发表机构 * Stanford University（斯坦福大学）

AI总结提出CHORUS框架，利用预训练视觉-语言-动作模型的视觉运动先验，实现无需推理时通信的去中心化多机器人协作，在真实实验中显著优于基线。

Comments Project Website: https://chorus-model.github.io

详情

AI中文摘要

多机器人协作使机器人能够高效完成从通过门搬运沙发到建筑工地组装结构等各种任务。然而，在移动多机器人环境中实现这种协调仍然具有挑战性：基于团队联合观测的集中式方法随团队规模扩展性差，而为每个机器人训练一个策略的去中心化方法通常需要显式对齐程序或推理时信息共享来克服部分可观测性。我们的关键见解是，预训练的视觉-语言-动作（VLA）模型的视觉运动先验应能够仅从每个机器人的局部观测实现反应式去中心化协作，无需这些推理时假设。我们提出CHORUS，一个适配单一VLA骨干以控制多样化多机器人团队的框架。推理时，每个机器人运行CHORUS的独立副本，仅基于其自身观测和机器人标识提示。在包括移动卷尺测量、图书馆书籍交接和洗衣篮抬举的真实实验中，CHORUS相比去中心化从头训练模型提升64个百分点，对队友行为的反应性提升40个百分点，并优于集中式基线。这些结果表明，共享VLA骨干能够实现去中心化多机器人协作，无需每个机器人的独立策略或推理时机器人间通信。

英文摘要

Multi-robot collaboration allows robots to efficiently take on a wide range of tasks, from moving a couch through a doorway to assembling structures on a construction site. However, achieving such coordination in mobile multi-robot settings remains challenging: centralized methods conditioned on the combined observations of a team scale poorly with team size, and decentralized methods that train one policy per robot often require explicit alignment procedures or information sharing at inference time to overcome partial observability. Our key insight is that the visuomotor priors of pretrained vision-language-action (VLA) models should enable reactive, decentralized collaboration from each robot's local observations alone, without these inference-time assumptions. We propose CHORUS, a framework that adapts a single VLA backbone to control diverse, multi-robot teams. At inference time, each robot runs an independent copy of CHORUS, conditioned only on its own observations and a robot-identifying prompt. In real-world experiments including mobile tape measurement, library book handovers, and laundry basket lifting, CHORUS achieves a 64% point improvement over decentralized, from-scratch models, improves reactivity to teammate behavior by 40% points, and outperforms centralized baselines. Together, these results show that a shared VLA backbone is capable of achieving decentralized multi-robot collaboration, without per-robot policies or inter-robot communication at inference.

URL PDF HTML ☆

赞 0 踩 0

2307.01472 2026-06-11 cs.AI cs.LG cs.MA 版本更新

Improving Generalization and Data Efficiency with Diffusion in Offline Multi-agent RL

通过扩散模型提升离线多智能体强化学习的泛化能力与数据效率

Zhuoran Li, Ling Pan, Jiatai Huang, Longbo Huang

发表机构 * Institute for Interdisciplinary Information Sciences（交叉信息学院）； Tsinghua University（清华大学）； Department of Electronic and Computer Engineering（电子与计算机工程系）； Hong Kong University of Science and Technology（香港科学与技术大学）

AI总结提出扩散离线多智能体模型（DOM2），利用扩散模型增强策略表达力和多样性，结合轨迹数据重加权，在离线MARL中显著提升性能、泛化能力和数据效率。

详情

AI中文摘要

我们提出了一种新颖的扩散离线多智能体模型（DOM2），用于离线多智能体强化学习（MARL）。与主要依赖策略设计中保守性的现有算法不同，DOM2基于扩散模型增强了策略的表达力和多样性。具体来说，我们将扩散模型融入策略网络，并在训练中提出了一种基于轨迹的数据重加权方案。这些关键要素显著提高了算法对环境变化的鲁棒性，并在性能、泛化和数据效率方面取得了显著提升。我们的大量实验结果表明，DOM2在所有多智能体粒子和多智能体MuJoCo环境中均优于现有最先进方法，并且由于其高表达力和多样性，在迁移环境中（在评估的30个设置中有28个）泛化能力显著更强。此外，DOM2具有超高的数据效率，与现有算法相比，实现相同性能所需数据不超过5%（数据效率提升20倍）。

英文摘要

We present a novel Diffusion Offline Multi-agent Model (DOM2) for offline Multi-Agent Reinforcement Learning (MARL). Different from existing algorithms that rely mainly on conservatism in policy design, DOM2 enhances policy expressiveness and diversity based on diffusion model. Specifically, we incorporate a diffusion model into the policy network and propose a trajectory-based data-reweighting scheme in training. These key ingredients significantly improve algorithm robustness against environment changes and achieve significant improvements in performance, generalization and data-efficiency. Our extensive experimental results demonstrate that DOM2 outperforms existing state-of-the-art methods in all multi-agent particle and multi-agent MuJoCo environments, and generalizes significantly better to shifted environments {(in $28$ out of $30$ settings evaluated)} thanks to its high expressiveness and diversity. Moreover, DOM2 is ultra data efficient and requires no more than $5\%$ data for achieving the same performance compared to existing algorithms (a $20\times$ improvement in data efficiency).

URL PDF HTML ☆

赞 0 踩 0

2601.04884 2026-06-11 cs.AI 版本更新

Precomputing Multi-Agent Path Replanning Using Temporal Flexibility

利用时间灵活性预计算多智能体路径重规划

Issa Hanou, Eric Kemmeren, Devin Wild Thomas, Mathijs de Weerdt

发表机构 * Department of Computer Science, University of Waterloo（1 温哥华大学计算机科学系）

AI总结针对多智能体执行中单个智能体延迟导致冲突的问题，提出FlexSIPP算法，通过预计算延迟智能体的所有可行计划并利用其他智能体的时间灵活性，避免级联延迟，在荷兰铁路网络和MovingAI基准测试中实现高效重规划。

Comments Accepted at SoCS'26

详情

AI中文摘要

当智能体被延迟时，执行多智能体计划可能具有挑战性，因为这通常会导致与其他智能体的冲突。因此，我们需要快速找到一个新的安全计划。仅对延迟的智能体进行重规划通常无法产生有效的计划，有时甚至无法产生可行的计划。另一方面，对其他智能体进行重规划可能导致级联变化和延迟，并且计算成本高昂。我们展示了如何通过跟踪和利用其他智能体的时间灵活性（即智能体在不改变与初始延迟智能体之外的其他智能体的顺序，或进一步延迟其他智能体的前提下，可以承受的最大延迟）来高效地对单个延迟智能体进行重规划，同时避免级联延迟。我们的算法FlexSIPP预计算延迟智能体的所有可能计划，并在给定场景中返回对其他智能体的更改。我们在实际案例研究（荷兰密集使用的铁路网络中的列车重规划）和MovingAI MAPF基准测试集中展示了我们的方法。实验表明，FlexSIPP提供了与实际情况调整相关的有效解决方案，并且在合理的时间范围内。

英文摘要

Executing a multi-agent plan can be challenging when an agent is delayed, because this typically creates conflicts with other agents. So, we need to quickly find a new safe plan. Replanning only the delayed agent often does not yield an efficient plan, and sometimes cannot even yield a feasible one. On the other hand, replanning other agents may lead to a cascade of changes and delays, and it is computationally expensive. We show how to efficiently replan a single delayed agent by tracking and using the temporal flexibility of other agents while avoiding cascading delays. This flexibility is the maximum delay that the agent can take without changing the order with agents other than the initially delayed agent, or further delaying other agents. Our algorithm, FlexSIPP, precomputes all possible plans for the delayed agent and returns the changes to the other agents within the given scenario. We demonstrate our method in a real-world case study of replanning trains in the densely-used Dutch railway network and in the MovingAI MAPF benchmark set. Our experiments show that FlexSIPP provides effective solutions relevant to real-world adjustments, and within a reasonable timeframe.

URL PDF HTML ☆

赞 0 踩 0

2602.18291 2026-06-11 cs.AI 版本更新

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

扩散以协调：高效在线多智能体扩散策略

Zhuoran Li, Hai Zhong, Xun Wang, Qingxin Xia, Lihua Zhang, Longbo Huang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出首个在线离线策略多智能体强化学习框架OMAD，利用扩散策略和松弛策略目标最大化缩放联合熵，实现高效探索与协调，在MPE和MAMuJoCo上样本效率提升2.5至5倍。

详情

AI中文摘要

在线多智能体强化学习（MARL）是实现高效智能体协调的重要框架。关键在于增强策略表达能力以实现更优性能。基于扩散的生成模型在图像生成和离线设置中展现出卓越的表达能力和多模态表示，因此非常适合满足这一需求。然而，它们在在线MARL中的潜力尚未被充分探索。主要障碍是扩散模型的难以处理的似然性阻碍了基于熵的探索和协调。为应对这一挑战，我们首次提出使用扩散策略的在线离线策略MARL框架（OMAD）来协调协调。我们的关键创新是采用松弛策略目标，最大化缩放联合熵，从而在无需可处理似然的情况下促进有效探索。此外，在集中训练与分散执行（CTDE）范式中，我们使用联合分布价值函数来优化分散扩散策略。它利用可处理的熵增强目标来指导扩散策略的同时更新，从而确保稳定协调。在MPE和MAMuJoCo上的广泛评估表明，我们的方法在10个不同任务上达到了新的最先进水平，样本效率显著提升了2.5至5倍。

英文摘要

Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across $10$ diverse tasks, demonstrating a remarkable $2.5\times$ to $5\times$ improvement in sample efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.12655 2026-06-11 cs.AI cs.MA 版本更新

Robust Instruction Compliance in Cooperative Multi-Agent Reinforcement Learning

鲁棒的指令遵从：合作多智能体强化学习

Wo Wei Lin, Ethan Rathbun, Enrico Marchesini, Xiang Zhi Tan

发表机构 * Department of Computer Sciences, Northeastern University（东北大学计算机科学系）； Department of Computer Sciences, Massachusetts Institute of Technology（麻省理工学院计算机科学系）

AI总结针对外部指令中断行为并冲突长期目标的问题，提出宏动作值修正方法（MAVIC），通过修正指令边界的Bellman备份实现一致值估计，在复杂合作环境中保持高指令遵从和基础任务性能。

详情

AI中文摘要

现实场景中的多智能体强化学习（MARL）可能需要适应外部自然语言指令，这些指令会中断正在进行的行为并与长期目标冲突。然而，基于指令的条件奖励引入了一种基本失败模式，因为Bellman更新耦合了跨指令上下文的值估计，导致当指令中断宏动作时值不一致。我们提出了用于指令遵从的宏动作值修正（MAVIC），该方法通过修正传入指令目标并恢复当前目标下的延续值，来纠正指令边界处的Bellman备份。与奖励塑形不同，MAVIC修改了自举目标本身，从而在统一策略下实现随机指令切换时的一致值估计。我们提供了理论分析和演员-评论家实现，并表明MAVIC在日益复杂的合作多智能体环境中实现了高指令遵从，同时保持了基础任务性能。

英文摘要

Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.

URL PDF HTML ☆

赞 0 踩 0

2509.14860 2026-06-11 cs.CV cs.AI cs.CL cs.MA 版本更新

MARIC: Multi-Agent Reasoning for Image Classification

MARIC：用于图像分类的多智能体推理

Wonduk Seo, Minhyeong Yu, Hyunjin An, Seunghyun Lee

发表机构 * Enhans, Seoul, South Korea（韩国首尔Enhans）； Peking University, Beijing, China（中国北京北京大学）

AI总结提出多智能体框架MARIC，通过分解图像分类为协作推理过程，利用大纲智能体、方面智能体和推理智能体进行多视角分析与综合，在四个基准数据集上显著优于基线方法。

Comments 11 pages, preprint

详情

AI中文摘要

图像分类传统上依赖于参数密集型模型训练，需要大规模标注数据集和大量微调才能达到有竞争力的性能。虽然最近的视觉语言模型（VLM）缓解了其中一些限制，但它们仍然受限于对单次表示的依赖，往往无法捕捉视觉内容的互补方面。在本文中，我们介绍了基于多智能体的图像分类推理（MARIC），这是一个多智能体框架，将图像分类重新表述为协作推理过程。MARIC首先利用大纲智能体分析图像的全局主题并生成有针对性的提示。基于这些提示，三个方面智能体沿着不同的视觉维度提取细粒度描述。最后，推理智能体通过集成反思步骤综合这些互补输出，产生用于分类的统一表示。通过明确地将任务分解为多个视角并鼓励反思性综合，MARIC减轻了参数繁重训练和单一VLM推理的缺点。在4个不同的图像分类基准数据集上的实验表明，MARIC显著优于基线，突出了多智能体视觉推理在鲁棒且可解释的图像分类中的有效性。

英文摘要

Image classification has traditionally relied on parameter-intensive model training, requiring large-scale annotated datasets and extensive fine tuning to achieve competitive performance. While recent vision language models (VLMs) alleviate some of these constraints, they remain limited by their reliance on single pass representations, often failing to capture complementary aspects of visual content. In this paper, we introduce Multi Agent based Reasoning for Image Classification (MARIC), a multi agent framework that reformulates image classification as a collaborative reasoning process. MARIC first utilizes an Outliner Agent to analyze the global theme of the image and generate targeted prompts. Based on these prompts, three Aspect Agents extract fine grained descriptions along distinct visual dimensions. Finally, a Reasoning Agent synthesizes these complementary outputs through integrated reflection step, producing a unified representation for classification. By explicitly decomposing the task into multiple perspectives and encouraging reflective synthesis, MARIC mitigates the shortcomings of both parameter-heavy training and monolithic VLM reasoning. Experiments on 4 diverse image classification benchmark datasets demonstrate that MARIC significantly outperforms baselines, highlighting the effectiveness of multi-agent visual reasoning for robust and interpretable image classification.

URL PDF HTML ☆

赞 0 踩 0

2604.20348 2026-06-11 cs.RO cs.AI cs.MA 版本更新

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

通过多智能体上下文学习的双臂机器人操作

Alessio Palma, Indro Spinelli, Vignesh Prasad, Luca Scofano, Yufeng Jin, Georgia Chalvatzaki, Fabio Galasso

发表机构 * Sapienza University of Rome（罗马萨皮恩扎大学）； TU Darmstadt（达姆施塔特技术大学）； Hessian.AI（黑森AI）

AI总结提出BiCICLe框架，将双臂操作建模为多智能体主从问题，通过解耦动作空间实现标准LLM的少样本学习，在TWIN基准上平均成功率70.5%，超越无训练基线。

详情

AI中文摘要

语言模型（LLMs）已成为具身控制的强大推理引擎。特别是，上下文学习（ICL）使得现成的纯文本LLM能够预测机器人动作，无需任何任务特定训练，同时保持其泛化能力。将ICL应用于双臂操作仍然具有挑战性，因为高维联合动作空间和紧密的臂间协调约束迅速压垮标准上下文窗口。为了解决这个问题，我们引入了BiCICLe（双臂协调上下文学习），这是第一个使标准LLM无需微调即可执行少样本双臂操作的框架。BiCICLe将双臂控制建模为多智能体主从问题，将动作空间解耦为顺序的、条件化的单臂预测。在TWIN基准的13个任务上评估，BiCICLe实现了70.5%的平均成功率，比最佳无训练基线高出6.1个百分点，并超过了大多数监督方法。我们还展示了在3个任务上无需特定硬件重新训练的优越现实世界性能。

英文摘要

Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In-Context Learning (ICL) enables off-the-shelf, text-only LLMs to predict robot actions without any task-specific training while preserving their generalization capabilities. Applying ICL to bimanual manipulation remains challenging as the high-dimensional joint action space and tight inter-arm coordination constraints rapidly overwhelm standard context windows. To address this, we introduce BiCICLe (Bimanual Coordinated In-Context Learning), the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning. BiCICLe frames bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential, conditioned single-arm predictions. Evaluated on 13 tasks from the TWIN benchmark, BiCICLe achieves 70.5% average success rate, outperforming the best training-free baseline by 6.1 percentage points and surpassing most supervised methods. We also demonstrate superior real-world performance on 3 tasks without hardware-specific retraining.

URL PDF HTML ☆

赞 0 踩 0

2606.08102 2026-06-11 cs.RO cs.AI cs.MA 版本更新

Continual Quadruped Robots Coordination via Semantic Skill Discovery

通过语义技能发现实现持续四足机器人协调

Daoqing Wang, Yuchen Xiao, Weixuan Huang, Zhilong Zhang, Shenghua Wan, Meng Li, Lei Yuan, Yang Yu

发表机构 * National Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China（新型软件技术国家重点实验室，南京大学，南京，中国）； School of Artificial Intelligence, Nanjing University, Nanjing, China（人工智能学院，南京大学，南京，中国）； Polixir Technologies, Nanjing, China（南京极智科技有限公司）

AI总结提出Conquer框架，通过语义技能库实现多四足机器人在持续学习任务中的协调，避免灾难性遗忘，最终平均成功率95.6%。

Comments 22 pages, 8 figures, 11 tables. Project page: https://conquer-project.pages.dev/

详情

AI中文摘要

多四足协调因其增强的负载能力、更广的接触覆盖范围以及对挑战性任务的适应性提升而受到越来越多的关注。现有的多四足操作方法通常专注于预定义或封闭的任务族，往往依赖多智能体强化学习（MARL）来训练特定任务的协调策略。然而，这类方法在开放式持续学习场景中难以应对，其中任务顺序到达，机器人期望在复用先前学到的技能的同时获取新协调技能，且不出现灾难性遗忘。为应对这一挑战，我们提出Conquer，一个语义技能库框架，将持续多四足协调形式化为检索-适应-更新过程。首先，为适应不同任务中的团队规模变化，我们设计了一个团队结构的Self-Allies-Goal（SAG）主干，通过显式建模每个机器人自身状态、队友上下文和任务目标，支持可变基数的机器人团队。对于每个新任务，Conquer从执行前信息构建任务级语义描述符，并从技能库中检索相关技能进行适应。成功执行后，Conquer通过提取轨迹级语义描述符并根据语义距离组织它们来更新技能库，从而实现持续技能积累和跨任务知识迁移。仿真实验表明，Conquer达到了95.6%的最终平均成功率，展示了强大的前向迁移能力和可忽略的灾难性遗忘。在宇树Go2团队上的实际部署进一步验证了Conquer用于实际多四足协调的可行性。仿真和真实机器人演示视频见：https://conquer-project.pages.dev/。

英文摘要

Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi-quadruped manipulation typically focus on predefined or closed task families, often relying on multi-agent reinforcement learning (MARL) to train task-specific coordination policies. However, such methods struggle in open-ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill-library framework that formulates continual multi-quadruped coordination as a retrieve-adapt-update process. First, to accommodate varying team sizes across tasks, we design a team-structured Self-Allies-Goal (SAG) backbone that supports variable-cardinality robot teams by explicitly modeling each robot's own state, teammate context, and task goal. For each incoming task, Conquer constructs a task-level semantic descriptor from pre-execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory-level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross-task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real-world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi-quadruped coordination. Simulation and real-robot demonstration videos are available at: https://conquer-project.pages.dev/.

URL PDF HTML ☆

赞 0 踩 0

2606.11662 2026-06-11 cs.AI 新提交

TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search

TreeSeeker：深度搜索中的树结构试错与回溯

Zhuofan Shi, Mingzhe Ma, Lu Wang, Fangkai Yang, Pu Zhao, Yiming Guan, Youling Huang, Wei Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan

发表机构 * Microsoft（微软公司）； East China Normal University（东华大学）

AI总结提出TreeSeeker框架，通过树结构分支-回溯搜索和UCB信号选择，在深度搜索中实现受控试错，显著提升复杂问答性能。

详情

AI中文摘要

将未来行为预测作为学习任务

Mosh Levy, Yoav Goldberg, Asa Cooper Stickland

发表机构 * Bar-Ilan University（巴伊兰大学）； Allen Institute for AI（艾伦人工智能研究所）； UK AI Security Institute（英国人工智能安全研究所）

AI总结提出将AI行为预测作为可学习任务，训练行为预测器从推理轨迹中预测未来行为，无需解释步骤，在两项任务上优于GPT-5.4和Claude Opus-4.6。

详情

AI中文摘要

对AI系统的信任通常基于对其工作原理的解释，人们利用这些解释来预测系统在新输入上的行为。对于大型推理模型（LRM），这条常规路径尤其难以遵循：针对单个token生成的解释方法无法自然推广到长轨迹，而轨迹本身在作为自然语言阅读时往往不忠实。我们提出一种绕过解释步骤的替代方案：将行为预测视为可学习任务，训练行为预测器（Behavior Forecasters）在单个推理轨迹上运行，以做出通常从解释中寻求的相同预测。预测器的训练数据通过查询LRM获得，无需人工标注，其推理在单次前向传播中完成。我们在两个任务上实例化该方法：LRM在重新运行时重复其答案的可能性，以及移除输入部分如何改变其答案。我们在三个不同的推理数据集上对这两个任务进行了评估，发现训练后的行为预测器比作为朴素读者阅读相同轨迹的GPT-5.4和Claude Opus-4.6更准确，而推理成本仅为其一小部分。我们发现，端到端微调骨干网络并从目标LRM初始化对于强性能都是必要的。这些结果表明，推理轨迹携带了关于LRM未来行为的信息，超出了朴素阅读所能传达的范围。

英文摘要

Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For large reasoning models (LRMs), this conventional route is particularly difficult to follow: explanation methods for single token generations do not naturally generalize to long trajectories, and the trajectories themselves are often not faithful when read as natural language. We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster's training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than GPT-5.4 and Claude Opus-4.6 reading the same trajectories as naive readers, at a small fraction of their inference cost. We find that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance. These results show that the reasoning trajectory carries information about the LRM's future behavior that goes beyond what naive reading conveys.

URL PDF HTML ☆

赞 0 踩 0

2606.11559 2026-06-11 cs.AI 新提交

HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

HERO: 基于环境观察的后见增强反思的智能体自蒸馏

Haoran Liu, Yuwei Zhang, Xiyao Li, Bohan Lyu, Jingbo Shang

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； Independent Researcher（独立研究员）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出HERO框架，利用环境观察作为局部对齐反馈进行自蒸馏，解决多轮设置中特权反馈与当前决策上下文不对齐导致的性能下降问题，在TauBench和WebShop上提升任务成功率并减少冗余轮次。

详情

AI中文摘要

强化学习通常通过轨迹的终端结果来提升多轮智能体能力，这使得难以确定每个中间轮的信用分配。最近的在线自蒸馏方法通过自教师将特权反馈转化为密集的令牌级监督，提供了一种有前景的替代方案。我们的研究动机是观察到当朴素地将此范式扩展到多轮设置时出现意外的性能下降，我们将其归因于特权反馈（如成功轨迹或终端结果）与学生当前决策上下文之间缺乏对齐。我们引入了HERO，一种后见增强的自蒸馏框架，它使用下一个环境观察作为局部对齐反馈。每次轨迹展开后，HERO反思完成的交互，将每个观察转化为紧凑的轮级诊断，捕获关于原始动作的可操作反馈，如其必要性、有效性或失败原因。在TauBench和WebShop上，HERO比仅环境反馈的自蒸馏和GRPO提高了任务成功率并减少了不必要的轮次。在训练轮次预算有限（成功轨迹稀少且GRPO提供弱奖励对比信号）的情况下，它尤其有效。

英文摘要

Reinforcement learning typically improves multi-turn agent capabilities through the terminal outcome of the trajectories, which makes it difficult to determine credit assignments for each intermediate turns. Recent on-policy self-distillation methods offer a promising alternative by converting privileged feedback into dense token-level supervision through a self-teacher. Our study is motivated by the unexpected performance degradation observed when naively extending this paradigm to multi-turn settings, which we attribute to a lack of alignment between privileged feedback, such as successful trajectories or terminal outcomes, and the student's current decision context. We introduce HERO, a hindsight-enhanced self-distillation framework that uses next environment observations as locally aligned feedback. After each rollout, HERO reflects on the completed interaction to convert each observation into a compact turn-level diagnosis, that captures actionable feedback about the original action such as its necessity, validity or failure cause. On TauBench and WebShop, HERO improves task success and reduces unnecessary turns over environment-feedback-only self-distillation and GRPO. It is especially effective under limited training turn budgets, where successful rollouts are rare and GRPO provides weak reward-contrast signals.

URL PDF HTML ☆

赞 0 踩 0

2606.11634 2026-06-11 cs.AI 新提交

SPEAR: 一种后量化误差自适应恢复系统，实现高效低比特LLM服务

Hongyuan Liu, Yawei Li, Zhiqiang Que, Qinli Yang, Junming Shao, Guosheng Hu

发表机构 * University of Electronic Science and Technology of China（电子科学与技术大学）； University of Bristol（布里斯托大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结针对低比特量化导致LLM质量下降的问题，提出SPEAR系统，通过输入感知的门控误差补偿器（EC）选择性修正高误差层，结合自适应内核融合调度和SLO感知调度器，在<1%内存开销下恢复W4与FP16之间56-75%的困惑度差距。

详情

AI中文摘要

高效的大语言模型（LLM）服务日益受到部署成本的制约。量化是降低服务成本的关键技术，但即使是最先进的4比特量化器，其与FP16之间仍存在显著的质量差距，尤其是在低比特服务最有利的小型模型中。我们发现这一差距的根本原因：量化误差高度依赖于输入，且在不同token之间差异显著，而现有的后量化补偿方法是静态的，对所有输入应用相同的修正。结果，简单token被过度修正，而困难token则修正不足。我们提出SPEAR，一种后量化误差自适应恢复系统，用于改进低比特LLM服务。SPEAR引入了由逐token门控调制的轻量级误差补偿器（EC），并将其仅放置在通过CKA引导的熵感知诊断识别出的最误差敏感层。这将少量参数预算集中在最有效的位置。EC的高效部署带来了若干系统挑战，包括额外计算、由输入相关门控引起的张量并行同步，以及跨配置的延迟不稳定。SPEAR通过自适应内核融合调度解决了这些问题，结合了后同步集成规约内核与P2P双写，将EC后计算融合到低比特GEMM中，并采用SLO约束的EC感知调度器以实现可预测的服务性能。在具有挑战性的逐通道量化设置中，SPEAR恢复了W4与FP16之间56-75%的困惑度差距，同时增加了不到1%的模型内存开销，并保持了与广泛使用的4比特服务部署相当的延迟。

英文摘要

Efficient large language model (LLM) serving is increasingly constrained by deployment cost. Quantization is a key technique for reducing serving cost, yet even state-of-the-art 4-bit quantizers exhibit a noticeable quality gap from FP16, particularly for smaller models where low-bit serving is most beneficial. We identify a fundamental cause of this gap: quantization error is highly input-dependent and varies substantially across tokens, while existing post-quantization compensation methods are static and apply identical corrections to all inputs. As a result, easy tokens are over-corrected while hard tokens remain under-corrected. We present SPEAR, a system for post-quantization error-adaptive recovery that improves low-bit LLM serving. SPEAR introduces lightweight Error Compensators (ECs) modulated by per-token gates and places them only at the most error-sensitive layers identified through a CKA-guided entropy-aware diagnostic. This focuses a small parameter budget where it is most effective. Efficient deployment of ECs presents several systems challenges, including additional computation, tensor-parallel synchronization caused by input-dependent gating, and latency instability across configurations. SPEAR addresses these issues through adaptive kernel-fusion dispatch, combining an epilogue-integrated peer-reduction kernel with P2P dual-write to fuse the post-EC computation into low-bit GEMMs, and an SLO-constrained EC-aware scheduler for predictable serving performance. Across challenging per-channel quantization settings, SPEAR recovers 56-75% of the perplexity gap between W4 and FP16 while adding less than 1% model memory overhead and maintaining latency comparable to a widely used 4-bit serving deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.11262 2026-06-11 cs.LG cs.AI 交叉投稿

PermDoRA -- Understanding Adapter Interference in Language Models: Limits of Parameter-Space Geometry

PermDoRA -- 理解语言模型中的适配器干扰：参数空间几何的局限性

Gowtham Sivaramakrishnan, Sarvesha Kumar Kombaiah Seetha, Kishan Gupta Balaji, Santhosh Baradwaj Vaduvur Ranganathan

发表机构 * Independent Researcher（独立研究员）

AI总结研究适配器组合中的干扰是否源于线性参数更新重叠，通过DoRA-RBAC框架和几何感知合并策略实验，发现参数空间几何不是干扰主因，而是共享非线性表示中的交互。

Comments 18 Pages, COLM 2026

详情

AI中文摘要

大型语言模型（LLMs）中的访问控制需要模块化机制，以在不重新训练或跨领域干扰的情况下实现特定领域行为。一个常见的假设是，适配器组合过程中的干扰源于线性参数更新的重叠，这表明强制正交性或方向独立性应能提高多领域性能。我们使用DoRA-RBAC（一种基于权重分解低秩适配的分层适配器组合框架）来测试这一假设。我们比较了传统的欧几里得合并与一种几何感知的黎曼启发式合并策略，该策略通过在LLaMA-3.1-8B和Mistral-7B上的多个QA基准（GPQA、PubMedQA、SimpleQA、WMDP）上进行归一化方向平均来近似弗雷歇均值。我们的结果表明，虽然单领域性能与LoRA相当，但几何感知合并相比标准平均在多领域组合中并未提供一致的优势。进一步分析揭示，适配器更新的角度对齐和正交性是组合性能的弱预测因子。这些发现表明，适配器干扰并非主要由参数空间几何决定，而是与共享非线性表示中的交互一致。

英文摘要

Access control in large language models (LLMs) requires modular mechanisms to enable domain-specific behavior without retraining or cross-domain interference. A common hypothesis is that interference during adapter composition arises from overlap in linear parameter updates, suggesting that enforcing orthogonality or directional independence should improve multi-domain performance. We test this hypothesis using DoRA-RBAC, a hierarchical adapter composition framework based on weight-decomposed low-rank adaptation. We compare conventional Euclidean merging with a geometry-aware Riemannian-inspired merging strategy that approximates the Frechet mean via normalized directional averaging across multiple QA benchmarks (GPQA, PubMedQA, SimpleQA, WMDP) on LLaMA-3.1-8B and Mistral-7B. Our results show that while single-domain performance matches LoRA, geometry-aware merging provides no consistent advantage over standard averaging in multi-domain settings.Diagnostic analysis further reveals that angular alignment and orthogonality of adapter updates are weak predictors of composition performance. These findings suggest that adapter interference is not governed primarily by parameter-space geometry, but is instead consistent with interactions in shared nonlinear representations.

URL PDF HTML ☆

赞 0 踩 0

2606.11272 2026-06-11 cs.LG cs.AI 交叉投稿

Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

联邦持续学习：分布式和非平稳数据上的终身与隐私保护学习综述

Masoume Gholizade, Fabrizio Ruffini, Pietro Ducange, Francesco Marcelloni

发表机构 * University of Pisa（比萨大学）； University of Modena and Reggio Emilia（摩德纳和雷焦艾米利亚大学）

AI总结本文系统综述联邦持续学习（FCL），定义问题、分析经典联邦学习在非平稳数据下的局限，提出多维分类法，并讨论应用、评估指标及开放挑战。

Comments 77 pages, 8 figures

详情

DOI: 10.1016/j.neucom.2026.133929
Journal ref: Neurocomputing, Volume 694, 2026, 133929

AI中文摘要

联邦学习（FL）能够在分布式客户端之间实现协作和隐私保护的模型训练，但大多数现有的FL系统隐含地假设数据是平稳的。在现实场景中——如医疗、工业物联网（IIOT）、网络安全和智慧城市——数据流本质上是非平稳的，导致经典FL方法遭受性能下降、不稳定和灾难性遗忘。持续学习（CL）解决了在演化数据分布下的学习问题，但主要在集中式环境中研究，忽视了联邦系统的关键约束，包括隐私、有限通信和客户端异质性。联邦持续学习（FCL）出现在FL和CL的交汇处，旨在支持分布式和非平稳数据上的终身、自适应和隐私感知学习。本综述提供了FCL的全面和系统概述。我们首先给出FCL问题的正式定义并阐明其独特特征。然后分析经典FL在非平稳条件下的局限性，强调CL原理如何支持长期适应。为了组织快速增长的文献，我们提出了FCL方法的多维分类法。此外，我们回顾了代表性的应用领域和数据模态，总结了常用的评估指标，并讨论了评估长期性能和遗忘的实验视角。最后，我们强调了关键开放挑战，包括处理时间漂移下的极端异质性、设计可扩展且隐私保护的记忆机制，以及建立标准化基准。本综述旨在为推进FCL走向鲁棒和可部署的现实世界系统提供参考和路线图。

英文摘要

Federated Learning (FL) enables collaborative and privacy-preserving model training across distributed clients, but most existing FL systems implicitly assume data stationarity. In real-world settings-such as healthcare, industrial IoT (IIOT), cybersecurity, and smart cities-data streams are inherently non-stationary, leading classical FL methods to suffer from performance degradation, instability, and catastrophic forgetting. Continual Learning (CL) addresses learning under evolving data distributions but has been largely studied in centralized settings, overlooking key constraints of federated systems, including privacy, limited communication, and client heterogeneity. Federated Continual Learning (FCL) emerges at the intersection of FL and CL, aiming to support lifelong, adaptive, and privacy-aware learning over distributed and non-stationary data. This survey provides a comprehensive and systematic overview of FCL. We first present a formal definition of the FCL problem and clarify its distinctive characteristics. We then analyze the limitations of classical FL under non-stationary conditions, highlighting how CL principles support long-term adaptation. To organize the rapidly growing literature, we propose a multi-dimensional taxonomy of FCL approaches. Furthermore, we review representative application domains and data modalities, summarize commonly used evaluation metrics, and discuss experimental perspectives for assessing long-term performance and forgetting. Finally, we highlight key open challenges, including handling extreme heterogeneity under temporal drift, designing scalable and privacy-preserving memory mechanisms, and establishing standardized benchmarks. This survey aims to serve as a reference and a roadmap for advancing FCL toward robust and deployable real-world systems.

URL PDF HTML ☆

赞 0 踩 0

2606.11275 2026-06-11 cs.LG cs.AI 交叉投稿

RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways

RoVE: 旋转值嵌入注意力实现相对位置相关的值路径

Alejandro García-Castellanos, Maurice Weiler, Erik J Bekkers

发表机构 * AMLab University of Amsterdam（阿姆斯特丹大学AMLab）； MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）

AI总结提出RoVE方法，通过同时旋转键和值使值对位置敏感，将RoPE注意力转化为注意力卷积，在少样本学习、分布外困惑度和长上下文检索上优于RoPE。

2606.11417 2026-06-11 cs.LG cs.AI stat.ML 交叉投稿

Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

密封审计上的有符号压缩进展是古德哈特抵抗的

Ayush Mittal, Dhruv Gupta

发表机构 * GitHub

AI总结提出有符号压缩进展作为内在动机，证明其累积奖励等于审计改进，且对有限审计面板具有假阳性预算，抵抗古德哈特定律。

Comments 16 pages, 7 figures. Lean 4 (Mathlib) mechanized core and ARC-TGI experiment code: https://github.com/Zetetic-Dhruv/audit-compression-progress

详情

AI中文摘要

压缩进展是一个长期提出的内在动机方案：当智能体的世界模型在预测或压缩经验方面变得更好时给予奖励。民间声称这种奖励是“可信的”，因为它只在学习时支付。我们使这一点精确化并证明它。如果内在奖励是固定密封审计损失的有符号减少，即 r_t = E(theta_{t-1}) - E(theta_t)，那么累积奖励恰好望远镜式地归结为端点审计改进，因此没有策略可以在真实审计性能停滞或下降时无限推高奖励。对于有限审计面板，同样的结果成立，并带有尖锐的假阳性预算：累积经验奖励最多为真实审计改进加上 2 Delta_n(F, delta)，即模型类的均匀审计偏差。这是无水平依赖的：一旦密封面板均匀控制该类，随时间变化的适应性无需付出代价。该定理还识别了失败模式：如果进展被截断、在智能体自身流上评分、暴露于可重用面板上的高容量模型，或应用于使 Delta_n 无效的神经类，则保证消失。我们给出了结构核心（望远镜式、有限审计界、有限吉布斯和熵下限）的 Lean 4 机械化，以及在 ARC-TGI 网格变换生成器上带有自适应保留攻击的实验套件。实验证实了理论：有限审计偏差按 n^{-0.527} 缩放；有符号进展抵抗截断农场、流泄漏和噪声电视好奇心；朴素的可重用审计可被黑盒标量反馈利用，而标准发布防御将攻击保持在 2 Delta_n 阈值以下。密封审计上的有符号压缩进展是真正改进的会计信号。

英文摘要

Compression progress is a long-standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predicting or compressing experience. The folk claim is that this reward is "credible" because it is paid only for learning. We make this precise and prove it. If intrinsic reward is the signed decrease of a fixed sealed-audit loss, r_t = E(theta_{t-1}) - E(theta_t), then cumulative reward telescopes exactly to endpoint audit improvement, so no policy can push reward up indefinitely while true audit performance stagnates or degrades. For finite audit panels the same result holds with a sharp false-positive budget: cumulative empirical reward is at most true audit improvement plus 2 Delta_n(F, delta), the uniform audit deviation of the model class. This is horizon-free: adaptivity over time costs nothing once the sealed panel uniformly controls the class. The theorem also identifies the failure modes: the guarantee disappears if progress is clipped, scored on the agent's own stream, exposed to a high-capacity model on a reusable panel, or applied to a neural class that makes Delta_n vacuous. We give a Lean 4 mechanization of the structural core (telescoping, the finite-audit bound, finite Gibbs, and the entropy floor) and an experiment suite on ARC-TGI grid-transformation generators with adaptive holdout attacks. Experiments confirm the theory: finite-audit deviation scales as n^{-0.527}; signed progress resists clip-farming, stream leakage, and noisy-TV curiosity; naive reusable audits are exploitable by black-box scalar feedback, while standard release defenses keep the attack below the 2 Delta_n threshold. Signed compression progress on a sealed audit is an accounting signal of genuine improvement.

URL PDF HTML ☆

赞 0 踩 0

2606.11437 2026-06-11 cs.DS cs.AI cs.LG stat.ML 交叉投稿

当上下文回归：面向在线策略蒸馏中的鲁棒内化

Xun Wang, Ruishuo Chen, Zhuoran Li, Yu Chen, Longbo Huang

发表机构 * IIIS, Tsinghua University（清华大学交叉信息研究院）

AI总结针对在线策略蒸馏中上下文内化后重新引入上下文导致性能下降的问题，提出一种轻量级一致性正则化方法，通过锚定无上下文输出并惩罚偏离，有效缓解退化并提升鲁棒性。

详情

AI中文摘要

近期研究表明，在线策略蒸馏可以将特权上下文（如系统提示或任务提示）内化到学生模型中，使得推理时不再需要上下文。尽管该方法成功提升了学生的无上下文性能，我们却发现一个有趣且此前未被研究的现象：在许多设置中，向蒸馏后的学生模型重新引入原始特权上下文实际上会降低其性能，甚至对于它已经在无上下文情况下正确解决的实例也是如此。我们将此称为上下文诱导退化，并认为鲁棒内化不仅要求匹配教师的条件上下文行为，还要求在上下文重新引入时保持稳定，这一性质我们称为上下文可移除性。受此观察启发，我们提出一种轻量级一致性正则化方法，首先通过停止梯度锚定学生的无上下文输出，然后通过前向KL散度惩罚条件上下文输出偏离该锚点。这一简单添加每训练步仅需一次额外前向传播，却能有效缓解上下文诱导退化，并在许多情况下甚至提升无上下文性能。在涵盖不同领域和模型家族的12种配置中，我们的方法在大多数设置下提升了条件上下文准确率，在11/12的设置中减少了上下文诱导损害，并有效消除了响应长度膨胀。一项机制性案例研究进一步证实，上下文可移除性在表示层面得以实现，无论上下文是否存在，隐藏状态几乎保持相同。

英文摘要

Recent work has shown that on-policy distillation can internalize privileged context, such as system prompts or task hints, into a student model so that the context is no longer needed at inference time. Although this approach successfully improves the student's no-context performance, we identify an interesting and previously unstudied phenomenon: in many settings, reintroducing the original privileged context to the distilled student actually degrades its performance, even on instances it already solves correctly without context. We term this context-induced degradation and argue that robust internalization demands not only matching the teacher's context-conditioned behavior, but also remaining stable when the context is reintroduced, a property we call context removability. Motivated by this observation, we propose a lightweight consistency regularizer that first anchors the student's no-context output via stop-gradient, then penalizes the context-conditioned output for deviating from it via forward KL divergence. This simple addition requires only one extra forward pass per training step, yet it effectively mitigates context-induced degradation and, in many cases, even improves no-context performance. Across 12 configurations spanning diverse domains and model families, our method improves context-conditioned accuracy in the majority of settings, reduces context-induced harm in 11 out of 12 settings, and effectively eliminates response-length inflation. A mechanistic case study further confirms that context removability is achieved at the representation level, with hidden states remaining nearly identical regardless of whether the context is present.

URL PDF HTML ☆

赞 0 踩 0

2606.11640 2026-06-11 cs.LG cs.AI 交叉投稿

TAROT: Task-Adaptive Refinement of LLM-prior Graphs for Few-shot Tabular Learning

TAROT: 面向小样本表格学习的任务自适应LLM先验图精炼

Ruxue Shi, Yili Wang, Mengnan Du, Hangting Ye, Yi Chang, Xin Wang

发表机构 * Jilin University（吉林大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出TAROT框架，通过构建并精炼任务自适应语义图，利用LLM先验和GNN编码特征语义关系，提升小样本表格学习性能。

详情

AI中文摘要

小样本表格学习为实际应用中标注成本高、新任务样本收集困难的情况提供了一种经济有效的方法。现有的传统方法和基于LLM的方法在小样本场景中已展现出有效性。然而，传统方法需要在未标注或生成的数据上进行额外训练，这带来了显著的计算开销。此外，直接将原始表格数据输入LLM的基于LLM的方法引发了隐私和合规性问题。更重要的是，这两种范式都很大程度上忽略了特征之间的语义关系，而语义关系为构建语义图提供了结构和语义先验。语义图对于在小样本场景中建模有意义的特征交互至关重要。本文提出TAROT，一个基于GNN的框架，通过从先验中构建并精炼任务自适应语义图来编码结构和语义先验，从而提升小样本表格学习的预测性能。TAROT首先通过统一语义表格节点编码器（USTNE）将异构表格数据编码为统一的节点语义表示。然后，它提示LLM根据任务描述和特征名称推断特征之间的语义关系，以构建语义图。为了减轻LLM幻觉引入的结构噪声，TAROT引入了任务自适应语义图精炼，剪除虚假或与任务无关的边，并添加缺失的与任务相关的边，使图结构与下游目标对齐。最后，GNN在精炼后的图上进行消息传递，以捕获与任务相关的语义依赖关系进行预测。在各种小样本表格学习基准上的大量实验证明了TAROT的优越性能，使其成为该领域的最先进方法。

英文摘要

Few-shot tabular learning provides a cost-effective approach for real-world applications where annotation is costly and collecting sufficient samples for new tasks is difficult. Existing Traditional and LLM-based methods have demonstrated effectiveness in few-shot scenarios. However, traditional methods need additional training on unlabeled or generated data, which incur significant computational overhead. In addition, LLM-based methods that directly feed raw tabular data into LLMs raise privacy and compliance concerns. More importantly, both paradigms largely overlook the semantic relationships between features, which provide structural and semantic prior for constructing a semantic graph. Semantic graph is essential for modeling meaningful feature interactions in few-shot scenarios. In this paper, we propose TAROT, a GNN-based framework that encodes the structural and semantic prior by constructing and refining a task-adaptive semantic graph from this prior, thereby improving predictive performance in few-shot tabular learning. TAROT first encodes heterogeneous tabular data into unified node semantic representations via a Unified Semantic Tabular Node Encoder (USTNE). Then, it prompts LLMs to infer the semantic relationship between features based on the task description and feature names to construct a semantic graph. To mitigate structural noise introduced by the hallucination of LLMs, TAROT introduces Task-adaptive Semantic Graph Refinement that prunes spurious or task-unrelated edges and adds missing task-related ones, aligning the graph structure with the downstream objective. Finally, a GNN performs message passing over the refined graph to capture task-related semantic dependencies for prediction. Extensive experiments on various few-shot tabular learning benchmarks demonstrate the superior performance of TAROT, establishing it as a state-of-the-art approach in this domain.

URL PDF HTML ☆

赞 0 踩 0

2606.11695 2026-06-11 cs.LG cs.AI 交叉投稿

稀疏化Kolmogorov-Arnold网络用于可解释量子态层析

Xinge Wu, Huaxin Wang, Jiajun Liu, Ruiqing He, Jiandong Shang, Hengliang Guo, Qiang Chen

发表机构 * National Supercomputing Center in Zhengzhou（郑州国家超级计算中心）； Zhengzhou University（郑州大学）； School of Computer and Artificial Intelligence（计算机与人工智能学院）； School of Communication and Artificial Intelligence（通信与人工智能学院）； School of Integrated Circuits（集成电路学院）； Nanjing Institute of Technology（南京理工大学）

AI总结研究利用稀疏化Kolmogorov-Arnold网络作为可检查的重构规则，通过三量子比特GHZ基准测试，识别出与GHZ相关的Pauli测量集，并揭示与解析GHZ Pauli分组一致的输入-隐藏-输出通路结构，实现神经网络重构模型的结构可解释性。

详情

AI中文摘要

量子态层析的机器学习方法可以实现高保真度重构，但训练模型所使用的物理结构往往隐含。这里我们探究稀疏化Kolmogorov-Arnold网络（KAN）是否不仅可以作为回归器，还可以作为可检查的重构规则，其内部组织可以与已知的Pauli结构进行对照。我们研究了一个受控的三量子比特GHZ族基准测试，其中所有63个非恒等Pauli期望值被用于重构三个GHZ子空间变量：种群不平衡$z$、实部非对角分量$c$和虚部非对角分量$c$。在有限采样和退极化噪声下，外部消融从63个测量中识别出扩展的12通道GHZ相关Pauli集，在测试的采样次数和退极化噪声强度下实现了精确的前12恢复。这些支持模式在多种子随机初始化和噪声水平分析中保持稳定，并在随机标签控制下崩溃。主要的剪枝输入-隐藏-输出通路以与解析GHZ Pauli分组一致的方式组织Z型种群可观测量和X/Y非对角可观测量，稀疏公式恢复恢复了规范的带符号Pauli关系。因此，KAN的贡献在于神经重构模型中的通路级结构可解释性，而非优越的稀疏回归。结合阴性对照，这些探针提供了一条一致性链，用于审计学习到的重构规则与已知物理结构的一致性。

英文摘要

Machine-learning approaches to quantum state tomography can achieve high reconstruction fidelity, but the physical structure used by the trained model often remains implicit. Here we ask whether a sparsified Kolmogorov-Arnold Network (KAN) can be used not only as a regressor, but also as an inspectable reconstruction rule whose internal organization can be checked against known Pauli structure. We study a controlled three-qubit GHZ-family benchmark in which all 63 non-identity Pauli expectation values are used to reconstruct three GHZ-subspace variables: the population imbalance $z$, the real off-diagonal component $c$, and the imaginary off-diagonal component $s$. Under finite-shot sampling and depolarizing noise, external ablation identifies the extended 12-channel GHZ-relevant Pauli set from the 63 measurements, with exact top-12 recovery across the tested shot counts and depolarizing-noise strengths. These support patterns remain stable across multi-seed random-initialization and noise-level analyses, and collapse under random-label controls. The dominant pruned input-hidden-output pathways organize Z-type population observables and X/Y off-diagonal observables in a pattern consistent with the analytic GHZ Pauli grouping, and sparse formula recovery recovers the canonical signed Pauli relations. The contribution of the KAN is therefore pathway-level structural interpretability within a neural reconstruction model, rather than superior sparse regression. Together with negative controls, these probes provide a consistency chain for auditing learned reconstruction rules against known physical structure.

URL PDF HTML ☆

赞 0 踩 0

2606.11831 2026-06-11 cs.LG cs.AI 交叉投稿

From Uniform to Learned Graph Priors: Diffusion for Structure Discovery

从均匀到学习图先验：用于结构发现的扩散

Qi Shao, Hao Guo, Jiawen Chen, Duxin Chen, Wenwu Yu

发表机构 * School of Mathematics, Southeast University（东南大学数学学院）

AI总结提出Diff-prior，一种扩散参数化的自适应先验，通过可学习的去噪式校准对边后验进行结构化校准，提升神经关系推理方法的结构发现可靠性。

Comments 15 pages, 3 figures, Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817940

AI中文摘要

神经关系推理（NRI）方法通过离散潜在边的变分推理从轨迹中发现交互图。然而，这些方法通常依赖于过度简化的因子化图先验。这种先验通常接近均匀分布，将边视为独立实体。这种系统性错位与现实世界系统不匹配，导致边后验分散且不明确，限制了结构发现的可靠性。为了解决这个问题，我们提出了\textit{Diff-prior}，一种扩散参数化的自适应先验，用于校准潜在图分布而非生成图。我们的核心见解是将先验整合重新构建为一种可学习的去噪式校准，将分散、不确定的边后验组织成更可靠的整体结构，该结构可通过扩散模型训练。Diff-prior学习一个自适应结构先验，在推理过程中对边后验进行结构化校准，引导其朝向更接近底层结构的分布。Diff-prior在结构采样之前操作，并直接对编码器边分布进行去噪校准，为结构化变量提供了一种通用的训练范式。在标准基准上的实验验证了我们的框架，结果表明Diff-prior提高了结构推理的性能，并在多个NRI系列架构中生成更明确的边后验。代码可在以下网址获取：https://this URL。

英文摘要

Neural relational inference (NRI) methods discover interaction graphs from trajectories through variational reasoning on discrete potential edges. However, these methods typically rely on oversimplified, factorized graph priors. Such priors, typically nearing uniform distributions, treat edges as independent entities. This systemic misalignment does not match the real-world systems and yields diffuse and indecisive edge posteriors limiting the reliability of structural discovery. To address this, we propose \textit{Diff-prior}, a diffusion-parameterized adaptive prior used to calibrate latent graph distribution rather than generate graphs. Our core insight is to reframe prior integration as a learnable denoising-style calibration that organizes scattered, uncertain edge posteriors into a more reliable overall structure which can be trained by the diffusion model. Diff-prior learns an adaptive structure prior that performs structured calibration on the edge posteriors during inference, guiding it towards a distribution closer to the underlying structure. The diff-prior operates before structural sampling and acts as a denoising calibrator directly on the encoder edge distribution, which provides a generic training paradigm over structured variables. Experiments on standard benchmarks validated our framework, and the results indicate that Diff-prior improves the performance of structure inference and generates more decisive edge posteriors across multiple NRI-family architectures. The code is available on https://github.com/Hardy158118/Diffprior.

URL PDF HTML ☆

赞 0 踩 0

2606.11854 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

使用ART微调多模态大语言模型：基于艺术的强化训练

Michal Chudoba, Sergey Alyaev, Petra Galuscakova, Tomasz Wiktorski

发表机构 * University of Stavanger（斯塔万格大学）； NORCE Research（NORCE研究机构）

AI总结提出ART方法，通过优化原始视觉输入将信息注入冻结的多模态大语言模型，实现软提示微调，无需修改计算图，在数学和工具使用基准上达到与LoRA相当的精度。

详情

AI中文摘要

大语言模型有两种主要的参数高效微调技术。低秩适应在LLM层之间引入额外权重，而软提示则向LLM输入引入额外的微调特定原始token。然而，两者都需要修改预编译、预优化LLM的计算图。因此，两者在vLLM等高吞吐引擎中均未得到完全支持。我们提出使用ART（基于艺术的强化训练）进行微调。该方法通过仅优化冻结的多模态大语言模型的原始视觉输入来注入信息，从而在预编译计算图上实现软token方法。它依赖于将梯度反向传播到普通像素阵列，因此支持任何微调目标。此外，优化的视觉输入可以风格化为与任务相关的计算艺术品。该方法在流行的开源Qwen架构的不同规模以及多个文本基准上的有效性得到确认。具体而言，ART在数学和结构化工具使用基准上达到了与LoRA竞争的精度。

英文摘要

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.11961 2026-06-11 cs.LG cs.AI 交叉投稿

Categorical Prior Lock-in: Why In-Context Learning Fails for Structured Data

类别先验锁定：为何上下文学习在结构化数据上失败

Antonio Pelusi, Stefano Braghin, Alberto Trombetta

发表机构 * University of Insubria（因苏布里亚大学）； IBM Research Ireland（IBM 爱尔兰研究院）

AI总结研究大语言模型在结构化数据生成中上下文学习的局限性，发现其无法更新预训练中的类别先验分布，导致罕见类完全无法生成；参数高效微调可解决但带来记忆化风险。

Comments 9 pages, 5 figures. Empirical study of in-context learning and LoRA fine-tuning for synthetic tabular data generation, introducing the phenomenon of categorical prior lock-in. Under review

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作结构化数据的条件生成器，依赖上下文学习（ICL）来适应新分布而无需更新参数。我们以高基数表格数据作为受控测试案例，研究分布不匹配下ICL在结构化生成中的局限性，并识别出一种结构性失败模式，我们称之为“类别先验锁定”：ICL无法更新模型从预训练中继承的令牌分布先验。在两个70亿参数开源模型中，ICL随着示例增加提高了数值保真度，但在类别分布上表现出明显的天花板效应，完全无法复现罕见类。参数高效微调（LoRA）克服了这些限制，但引入了可测量的记忆化风险，并在某些情况下破坏了结构化输出生成的稳定性，凸显了适应性与隐私之间的基本权衡。

英文摘要

Large language models (LLMs) are increasingly used as conditional generators for structured data, relying on in-context learning (ICL) to adapt to new distributions without parameter updates. We investigate the limits of ICL for structured generation under distribution mismatch, using high-cardinality tabular data as a controlled test case, and identify a structural failure mode we term \textit{categorical prior lock-in}: the inability of ICL to update the model's prior over token distributions inherited from pre-training. Across two 7B-parameter open-weight models, ICL improves numerical fidelity with additional examples but exhibits a sharp ceiling on categorical distributions, failing to reproduce rare classes entirely. Parameter-efficient fine-tuning (LoRA) overcomes these limitations but introduces measurable memorization risk and, in some cases, destabilizes structured output generation, highlighting a fundamental trade-off between adaptability and privacy.

URL PDF HTML ☆

赞 0 踩 0

2606.12138 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

多速率专家混合模型加速液态神经网络训练

Shilong Zong, Almuatazbellah Boker, Hoda Eldardiry

发表机构 * Virginia Tech（弗吉尼亚理工大学）

AI总结提出多速率专家混合框架，结合液态神经网络的多尺度动态与注意力机制，提升多变量时间序列建模的准确性和效率。

详情

AI中文摘要

多变量时间序列数据通常表现出复杂的时间依赖、不规则采样和跨多个时间尺度的异质动态，使得精确序列建模特别具有挑战性。传统的循环神经网络（RNN），如长短期记忆网络（LSTM），在离散时间下运行，可能难以有效捕捉连续和不规则的时间行为。液态神经网络（LNN）通过连续时间动态解决了其中一些限制，但标准LNN架构通常依赖单一动力系统，限制了其建模异质时间模式的能力。为了解决这些挑战，我们提出了一个基于液态神经网络的多速率专家混合（MR-MoE）框架。在所提出的架构中，多个基于LNN的专家以不同的时间尺度运行，使模型能够明确分离快速变化的动态和缓慢演变的时间趋势。门控网络进一步实现了基于输入条件的自适应专家专业化。此外，我们结合了特征级和时间注意力机制，以提高鲁棒性、可解释性和长程依赖建模能力。特征级注意力抑制噪声或无关变量，而时间注意力则选择性地关注信息丰富的历史状态。我们在一个复杂的多变量时间序列预测任务上评估了所提出的框架，并与强基线模型（包括LSTM、单体LNN和标准MoE模型）进行了比较。实验结果表明，所提出的MR-MoE框架在保持良好计算效率的同时，持续实现了改进的AUROC和AUPRC性能。这些结果突显了结合连续时间动态、多尺度专家分解和自适应注意力机制对时间序列建模的有效性。

英文摘要

Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple time scales, making accurate sequence modeling particularly challenging. Traditional recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) networks, operate in discrete time and may struggle to effectively capture continuous and irregular temporal behaviors. Liquid Neural Networks (LNNs) address some of these limitations through continuous-time dynamics, but standard LNN architectures typically rely on a single dynamical system, limiting their ability to model heterogeneous temporal patterns. To address these challenges, we propose a Multi-Rate Mixture-of-Experts (MR-MoE) framework built on top of Liquid Neural Networks. In the proposed architecture, multiple LNN-based experts operate at distinct time scales, enabling the model to explicitly separate fast-changing dynamics from slow-evolving temporal trends. A gating network further enables adaptive expert specialization based on input conditions. In addition, we incorporate both feature-level and temporal attention mechanisms to improve robustness, interpretability, and long-range dependency modeling. Feature-level attention suppresses noisy or irrelevant variables, while temporal attention selectively focuses on informative historical states. We evaluate the proposed framework on a complex multivariate time-series prediction task and compare it against strong baselines, including LSTM, monolithic LNN, and standard MoE models. Experimental results demonstrate that the proposed MR-MoE framework consistently achieves improved AUROC and AUPRC performance while maintaining favorable computational efficiency. These results highlight the effectiveness of combining continuous-time dynamics, multi-scale expert decomposition, and adaptive attention mechanisms for time-series modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.12287 2026-06-11 cs.NE cs.AI 交叉投稿

SpikeDecoder: Realizing the GPT Architecture with Spiking Neural Networks

SpikeDecoder: 用脉冲神经网络实现GPT架构

Claas Beger, Florian Walter, Alois Knoll

发表机构 * Chair of Robotics, Artificial Intelligence and Real-time Systems（机器人、人工智能与实时系统教授席）

AI总结提出SpikeDecoder，一种基于脉冲神经网络（SNN）的Transformer解码器，用于自然语言处理，通过替换ANN模块和优化嵌入方法，在保持性能的同时降低理论能耗87%-93%。

详情

AI中文摘要

Transformer架构被广泛认为是自然语言处理最强大的工具，但由于大量复杂操作，其本质上存在高能耗问题。为解决这一问题，我们考虑脉冲神经网络（SNN），它通过天然的事件驱动方式处理信息，是传统人工神经网络（ANN）的节能替代方案。然而，这本质上使得SNN难以训练。通常，许多基于SNN的模型通过转换预训练的ANN来规避这一问题。最近，有研究尝试设计可直接训练的基于SNN的Transformer模型结构改编。尽管结果显示出巨大潜力，但应用领域是计算机视觉，且所提模型仅包含编码器模块。在本文中，我们提出SpikeDecoder，一种完全基于SNN的Transformer解码器模块实现，用于自然语言处理。通过一系列实验，我们分析了用脉冲替代方案交换ANN模型不同模块的影响，以识别权衡和性能损失的主要来源。我们进一步研究了残差连接的作用以及SNN兼容归一化技术的选择。除了模型架构的工作，我们还制定并比较了将文本数据投影为脉冲的不同嵌入方法。最后，我们证明，与ANN基线相比，所提出的基于SNN的解码器模块将理论能耗降低了87%至93%。

英文摘要

The Transformer architecture is widely regarded as the most powerful tool for natural language processing, but due to a high number of complex operations, it inherently faces the issue of high energy consumption. To address this issue, we consider Spiking Neural Networks (SNNs), which are an energy-efficient alternative to conventional Artificial Neural Networks (ANNs) due to their naturally event-driven approach to processing information. However, this inherently makes them difficult to train. Often, many SNN-based models circumvent this issue by converting pre-trained ANNs. More recently, attempts have been made to design directly trainable SNN-based adaptations of the Transformer model structure. Although the results showed great promise, the application field was computer vision. Moreover, the proposed model incorporates only encoder blocks. In this paper, we propose SpikeDecoder, a fully SNN-based implementation of the Transformer decoder block, for applications in natural language processing. In a series of experiments, we analyze the impact of exchanging different blocks of the ANN model with spike-based alternatives to identify trade-offs and significant sources of performance loss. We further investigate the role of residual connections and the selection of SNN-compatible normalization techniques. Besides the work on the model architecture, we formulate and compare different embedding methods to project text data into spikes. Finally, we demonstrate that our proposed SNN-based decoder block reduces the theoretical energy consumption by 87% to 93% compared to the ANN baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.12318 2026-06-11 cs.LG cs.AI 交叉投稿

Harness In-Context Operator Learning with Chain of Operators

利用算子链实现上下文算子学习

Minghui Yang, Ling Guo, Liu Yang

发表机构 * Department of Mathematics, Shanghai Normal University（上海师范大学数学系）； Department of Mathematics, National University of Singapore（新加坡国立大学数学系）

AI总结提出Chain of Operators (CHOP)框架，通过构造显式初等变换与冻结ICON的算子链，无需微调即可提升上下文算子网络在分布外算子任务上的泛化能力，在标量守恒律和平均场控制问题中降低推理误差。

详情

AI中文摘要

神经算子近似函数空间之间的映射，但通常对其他算子泛化能力差，需要微调或重新训练。上下文算子网络（ICON）通过向模型提供数值上下文来解决此问题，使模型从提示中学习特定算子并适应不同算子而无需微调。然而，ICON在分布外（OOD）算子任务上仍可能泛化失败。受大型语言模型（LLM）的提示工程成功启发，我们引入了算子链（CHOP），一种在不更新参数的情况下将冻结的ICON应用于OOD算子任务的框架。具体来说，CHOP构建了一个由显式初等变换和冻结ICON组成的算子链。在标量守恒律和平均场控制问题上的实验表明，与直接ICON评估相比，CHOP降低了相对推理误差，同时链中的每个算子保持可解释且具有封闭形式。在一个PDE族上构建的链进一步泛化到另一个不同的族，表明跨提示系统存在共享机制。

英文摘要

Neural operators approximate mappings between function spaces, but often generalize poorly to other operators and usually require fine-tuning or retraining. In-Context Operator Networks (ICON) addresses this issue by prompting the model with numerical context so that the model learns specific operators from prompts and adapt to different operators without fine-tuning. However, ICON may still fail to generalize to out-of-distribution (OOD) operator tasks. Inpired by the success of harness engineering of Large Language models (LLMs), we introduce Chain of Operators (CHOP), a framework that harness a frozen ICON to OOD operator tasks without updating its parameters. Specifically, CHOP constructs a chain of operators consisting of explicit elementary transformations and the frozen ICON. Experiments on a scalar conservation law and a mean-field control problem show that CHOP reduces relative inference error over direct ICON evaluation, while each operator in the chain remains interpretable and in closed form. A chain constructed on one PDE family further generalizes to a different family, indicating shared mechanisms across harness systems.

URL PDF HTML ☆

赞 0 踩 0

2606.12362 2026-06-11 cs.LG cs.AI 交叉投稿

Latent World Recovery for Multimodal Learning with Missing Modalities

缺失模态下的多模态学习中的潜在世界恢复

Hui Wang, Tianyu Ren, Joseph Butler, Christopher Baker, Karen Rafferty, Simon McDade

发表机构 * Queen's University Belfast（贝尔法斯特女王大学）

AI总结提出潜在世界恢复（LWR）框架，通过邻居潜在对齐和可用性感知融合，在缺失模态下实现鲁棒的多模态预测，避免显式重构误差。

详情

AI中文摘要

我们研究了缺失模态下的多模态学习，特别受到生物科学应用的启发，在这些应用中，当需要做出决策时，异构模态通常仅部分可用。我们提出了潜在世界恢复（LWR），这是一个基于两个关键思想的框架：(i) 来自不同模态的特定模态嵌入在共享潜在空间中对齐，以及 (ii) 通过仅融合在训练和推理时实际可用的模态嵌入来构建统一表示。LWR 不填补缺失模态或要求固定的模态集，而是将每个模态视为对底层潜在状态的部分感知，并直接从观察到的模态执行可用性感知表示学习。这种基于邻居的潜在对齐和可用性感知模态融合的结合，使得在部分观测下能够进行鲁棒的多模态预测，同时避免了显式重构缺失模态带来的误差传播。我们在真实世界的不完整多组学基准上评估了所提出的框架，并证明它为下游任务（如癌症表型分类和生存预测）提供了一种有效的方法。

英文摘要

We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need to be made. We propose Latent World Recovery (LWR), a framework built on two key ideas: (i) modality-specific embeddings from different modalities are aligned in a shared latent space, and (ii) a unified representation is constructed by fusing only the embeddings of the modalities that are actually available at both training and inference time. Rather than imputing missing modalities or requiring a fixed modality set, LWR treats each modality as a partial perception of an underlying latent state and performs availability-aware representation learning directly from the observed modalities. This combination of neighbor-based latent alignment and availability-aware modality fusion enables robust multimodal prediction under partial observation, while avoiding error propagation from explicit reconstruction of missing modalities. We evaluate the proposed framework on real-world incomplete multi-omics benchmarks and demonstrate that it provides an effective approach to downstream tasks such as cancer phenotype classification and survival prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.12386 2026-06-11 cs.LG cs.AI 交叉投稿

潜意识学习是引导向量蒸馏

Camila Blank, Agam Bhatia, Senthooran Rajamanoharan, Arthur Conmy, Neel Nanda

发表机构 * Stanford University（斯坦福大学）

AI总结本文发现潜意识学习通过单个引导向量实现，并证明这是引导向量蒸馏的特例，解释了非语义数据如何传递语义特征。

详情

AI中文摘要

潜意识学习指的是学生语言模型在教师输出上微调时获得教师的特征（例如，系统提示对猫头鹰的偏好），尽管输出与这些特征在语义上无关。目前尚不清楚没有语义意义的数据如何传递特定的语义特征。在这项工作中，我们表明潜意识学习是由单个引导向量介导的，即添加到模型激活中的向量。在两个开源模型上，我们发现教师的系统提示可以很好地近似为一个引导向量，而学生的行为是通过微调学习对齐向量驱动的。不能被引导向量很好近似的系统提示不会潜意识地学习。这是引导向量蒸馏的一个特例，其中在受引导教师输出上训练的学生学会模仿该引导。我们在一系列语义和随机向量上演示了引导向量蒸馏。向模型激活添加语义向量可以对其行为产生模型无关和模型特定（即非语义）的影响，因此非语义的生成数据可以传递具有语义效果的向量，从而实现潜意识学习。这也解释了为什么潜意识学习不能在模型之间转移。我们发现自适应优化器对于语言模型中的潜意识学习是必要的：引导数据上的激活梯度沿引导方向携带一个小但一致的分量，而非自适应优化器通过允许异常梯度主导来阻碍这一点。

英文摘要

Subliminal learning refers to a student language model acquiring a teacher's traits (e.g. a system-prompted preference for owls) when fine-tuned on the teacher's outputs, despite the outputs being semantically unrelated to those traits. It remains poorly understood how data without semantic meaning can transfer specific semantic traits. In this work, we show that subliminal learning is mediated by a single steering vector, i.e. a vector added to the model's activations. Across two open-source models, we find that the teacher's system prompt is well approximated by a steering vector, and that the student's behavior is driven by learning an aligned vector over fine-tuning. System prompts that are not well approximated by steering vectors are not subliminally learned. This is a special case of steering vector distillation, in which a student trained on the outputs of a steered teacher learns to imitate that steering. We demonstrate steering vector distillation on a range of semantic and random vectors. Adding a semantic vector to a model's activations can have both model-independent and model-specific (i.e. non-semantic) effects on its behavior, so generated data that is non-semantic can transmit a vector with semantic effects, enabling subliminal learning. This also explains why subliminal learning does not transfer between models. We find that adaptive optimizers are necessary for subliminal learning in language models: activation gradients on steered data carry a small but consistent component along the steering direction, and non-adaptive optimizers impede this by allowing outlier gradients to dominate.

URL PDF HTML ☆

赞 0 踩 0

2505.13196 2026-06-11 cs.LG cs.AI quant-ph 版本更新

A Physics-Inspired Optimizer: Velocity Regularized Adam

一种受物理启发的优化器：速度正则化Adam

Pranav Vaidhyanathan, Lucas Schorling, Natalia Ares, Maike Osborne

发表机构 * University of Oxford（牛津大学）

AI总结本文提出VRAdam优化器，通过引入速度正则化技术，结合Adam的参数缩放，提升训练稳定性与收敛速度，理论分析显示其在非凸目标下的收敛速率为O(√(lnN)/√N)。

Comments L. Schorling and P. Vaidhyanathan contributed equally to this work. 20 pages, 10 figures

详情

AI中文摘要

我们介绍了一种受物理启发的优化器——速度正则化Adam（VRAdam），用于训练深度神经网络。该优化器借鉴了四次项用于动能的思想，其在系统动力学中具有稳定作用。先前的算法，包括普遍使用的Adam，训练过程中处于所谓的稳定性边缘，导致快速振荡和损失收敛缓慢。然而，VRAdam基于速度在学习率上添加更高阶惩罚，使得算法在权重更新变得较大时自动减慢。实践中，我们观察到在高速度区域，有效动态学习率会缩小并抑制振荡。通过将这种基于速度的正则化用于全局阻尼，结合Adam的参数缩放，我们创建了一个强大的混合优化器。对于该优化器，我们从物理和控制的角度对动量在稳定性边缘的操作进行了严格的理论分析。此外，我们推导了在轻微假设下的非凸随机目标下的收敛界，收敛速率为O(ln(N)/√N)。我们证明VRAdam在标准优化器如AdamW上表现更优。我们通过多种任务如图像分类、语言建模和生成建模，使用不同架构和训练方法（包括卷积神经网络、Transformer和GFlowNets）进行基准测试。

英文摘要

We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algorithms, including the ubiquitous Adam, operate at the so-called adaptive edge of stability regime during training, leading to rapid oscillations and slowed convergence of loss. However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, and damping oscillations. By combining this velocity-based regularizer for global damping with per-parameter scaling of Adam, we create a powerful hybrid optimizer. For this optimizer, we provide rigorous theoretical analysis of operation at the edge of stability from a physical and control perspective for the momentum. Furthermore, we derive convergence bounds with the rate $\mathcal{O}(\ln(N)/\sqrt{N})$ for a stochastic non convex objective under mild assumptions. We demonstrate that VRAdam exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.

URL PDF HTML ☆

赞 0 踩 0

2505.15201 2026-06-11 cs.LG cs.AI cs.CL stat.ML 版本更新

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Pass@K 策略优化：解决更困难的强化学习问题

Christian Walder, Deep Karkhanis

发表机构 * Google DeepMind（谷歌深Mind）

AI总结提出 Pass-at-k 策略优化 (PKPO)，通过变换奖励直接优化 pass@k 性能，利用低方差无偏估计器，在训练中退火 k 可同时提升 pass@1 和 pass@k，解决更难问题。

详情

AI中文摘要

强化学习算法对每个问题采样多个 n>1 的解决方案尝试并独立奖励它们。这优化了 pass@1 性能，优先考虑孤立样本的强度，而牺牲了样本集的多样性和集体效用。这未充分利用采样能力，限制了探索和在更难示例上的最终改进。作为修复，我们提出 Pass-at-k 策略优化 (PKPO)，一种对最终奖励的变换，导致直接优化 pass@k 性能，从而优化联合考虑时最大化奖励的样本集。我们的贡献是推导出 pass@k 及其梯度在二元和连续奖励设置中的新型低方差无偏估计器。我们展示了使用我们的估计器进行优化简化为标准强化学习，其中奖励经过稳定高效的变换函数联合变换。虽然先前的工作仅限于 k=n，但我们是第一个能够对任意 k ≤ n 实现 pass@k 鲁棒优化的。此外，我们的方法不是以 pass@1 性能换取 pass@k 增益，而是允许在训练中退火 k，同时优化两个指标，通常能在显著 pass@k 增益的同时获得强大的 pass@1 数值。我们在玩具实验上验证了我们的奖励变换，揭示了我们的公式的方差减少特性。我们还使用开源 LLM GEMMA-2 包含了真实世界的例子。我们发现我们的变换有效地优化了目标 k。此外，更高的 k 值能够解决更多和更难的问题，而退火 k 则同时提升了 pass@1 和 pass@k。关键的是，在传统 pass@1 优化停滞的具有挑战性的任务集上，我们的 pass@k 方法解锁了学习，这可能是由于通过优先考虑联合效用而非单个样本的效用实现了更好的探索。

英文摘要

Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

URL PDF HTML ☆

赞 0 踩 0

2506.20040 2026-06-11 cs.LG cs.AI cs.CL 版本更新

Cross-Layer Discrete Concept Discovery for Interpreting Language Models

跨层离散概念发现用于解释语言模型

Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou

发表机构 * University of Washington（华盛顿大学）

AI总结提出跨层向量量化变分自编码器（CLVQ-VAE），通过离散向量量化瓶颈将残差流中的重复特征压缩为紧凑可解释的概念向量，在三个数据集上优于聚类、单层VQ-VAE和稀疏自编码器基线。

详情

AI中文摘要

由于残差流的存在，解释语言模型仍然具有挑战性，残差流在相邻层之间线性混合和复制特征，导致单层分析忽略这种跨层结构。跨层稀疏自编码器（SAE）解决了层混合问题，但在连续空间中操作，概念分散在许多神经元上，没有清晰的边界。我们引入了跨层向量量化变分自编码器（CLVQ-VAE），这是一种新颖的框架，通过离散向量量化瓶颈将较低层的表示映射到较高层，将重复的残差流特征压缩为紧凑、可解释的概念向量。我们的方法结合了基于top-k温度的采样和指数移动平均（EMA）码本更新，在保持码本多样性的同时，对离散潜在空间进行受控探索。在基于编码器和解码器的模型上，针对ERASER-Movie、Jigsaw和AGNews数据集，CLVQ-VAE在三个评估轴上优于聚类、单层向量量化变分自编码器（VQ-VAE）和稀疏自编码器（SAE）基线：移除识别出的概念使模型准确率下降高达93%，LLM评判员在66.7%的比较中将我们的概念排在首位，人类标注者从我们的可视化中恢复模型预测的准确率为78%，而聚类为54%。

英文摘要

Interpreting language models remains challenging due to the existence of residual stream, which linearly mixes and duplicates features across adjacent layers, causing single-layer analyses to miss this cross-layer structure. Cross-layer sparse autoencoders (SAEs) address layer mixing but operate in continuous space, where concepts split across many neurons without clear boundaries. We introduce Cross-Layer Vector Quantized-Variational Autoencoder (CLVQ-VAE), a novel framework which maps representations from a lower layer to a higher layer through a discrete vector-quantization bottleneck, collapsing duplicated residual-stream features into compact, interpretable concept vectors. Our approach combines top-k temperature-based sampling with exponential moving average (EMA) codebook updates, providing controlled exploration of the discrete latent space while maintaining codebook diversity. Across both encoder- and decoder-based models on ERASER-Movie, Jigsaw, and AGNews, CLVQ-VAE outperforms clustering, single-layer vector quantized-variational autoencoder (VQ-VAE), and sparse autoencoder (SAE) baselines across three evaluation axes: removing identified concepts drops model accuracy by up to 93%, LLM judges rank our concepts first in 66.7% of comparisons, and human annotators recover model predictions from our visualizations with 78% accuracy versus 54% for clustering.

URL PDF HTML ☆

赞 0 踩 0

2507.21164 2026-06-11 cs.LG cs.AI eess.IV stat.ML 版本更新

OCSVM-Guided Representation Learning for Unsupervised Anomaly Detection

OCSVM引导的无监督异常检测表示学习

Nicolas Pinon, Robin Trombetta, Carole Lartizien

发表机构 * Univ. Lyon（里昂大学）； CNRS UMR 5220（国家科学研究中心UMR 5220）； Inserm U1294（法国国家医学研究院U1294）； INSA Lyon（里昂国立应用科学学院）； UCBL（里昂大学）； CREATIS（里昂大学生物医学图像研究中心）

AI总结提出一种将表示学习与可解析求解的一类SVM耦合的方法，通过定制损失函数直接对齐潜在特征与决策边界，在MNIST-C和脑MRI病变检测任务上展现了鲁棒性和性能。

详情

AI中文摘要

无监督异常检测（UAD）旨在无需标签数据检测异常，这在许多机器学习应用中是必要的，因为异常样本稀少或不可用。大多数最先进的方法分为两类：基于重构的方法（通常重构异常过于完美）和与密度估计器解耦的表示学习（可能遭受次优特征空间）。虽然一些近期方法尝试耦合特征学习和异常检测，但它们通常依赖替代目标、限制核选择或引入近似，从而限制了表达能力和鲁棒性。为解决这一挑战，我们提出了一种新颖方法，通过自定义损失公式将表示学习与可解析求解的一类SVM（OCSVM）耦合，该损失直接使潜在特征与OCSVM决策边界对齐。该模型在两个任务上评估：基于MNIST-C的新基准，以及具有挑战性的脑MRI细微病变检测任务。与大多数关注图像级别大而高信号病变的方法不同，我们的方法成功针对小而非高信号的病变，同时我们评估体素级别的指标，处理了更具临床相关性的场景。两个实验评估了对领域偏移的鲁棒性形式，包括MNIST-C中的损坏类型以及MRI中的纹理或人群年龄变化。结果展示了我们提出模型的性能和鲁棒性，突显了其在通用UAD和现实医学成像应用中的潜力。源代码可在此https URL获取。

英文摘要

Unsupervised anomaly detection (UAD) aims to detect anomalies without labeled data, a necessity in many machine learning applications where anomalous samples are rare or not available. Most state-of-the-art methods fall into two categories: reconstruction-based approaches, which often reconstruct anomalies too well, and decoupled representation learning with density estimators, which can suffer from suboptimal feature spaces. While some recent methods attempt to couple feature learning and anomaly detection, they often rely on surrogate objectives, restrict kernel choices, or introduce approximations that limit their expressiveness and robustness. To address this challenge, we propose a novel method that couples representation learning with an analytically solvable One-Class SVM (OCSVM), through a custom loss formulation that directly aligns latent features with the OCSVM decision boundary. The model is evaluated on two tasks: a \deleted{new} benchmark based on MNIST-C, and a challenging brain MRI \deleted{subtle} lesion detection task. Unlike most methods that focus on large, hyperintense lesions at the image level, our approach succeeds to target small, non-hyperintense lesions, while we evaluate voxel-wise metrics, addressing a more clinically relevant scenario. Both experiments evaluate a form of robustness to domain shifts, including corruption types in MNIST-C and texture or population age variations in MRI. Results demonstrate performance and robustness of our proposed model, highlighting its potential for general UAD and real-world medical imaging applications. The source code is available at https://github.com/Nicolas-Pinon/uad_ocsvm_guided_repr_learning.

URL PDF HTML ☆

赞 0 踩 0

2508.21380 2026-06-11 cs.LG cs.AI 版本更新

The Algorithm Is Not the Behavior: Learned Priors Override Look-Ahead in a Chess-Playing Neural Network

算法并非行为：学得的先验知识在弈棋神经网络中覆盖前瞻

Elias Sandmann, Sebastian Lapuschkin, Wojciech Samek

发表机构 * Fraunhofer HHI（弗劳恩霍夫人工智能研究所）

AI总结研究发现，国际象棋神经网络Leela Chess Zero在中间层能正确计算解法，但最终输出被安全优先的先验知识覆盖，导致错误答案。

详情

AI中文摘要

CoVar: 置信度-方差引导的半监督学习伪标签选择

Jinshi Liu, Lei He, Pan Liu

发表机构 * College of Artificial Intelligence, Shenzhen University（深圳大学人工智能学院）； School of Information and Electrical Engineering, Hunan University of Science and Technology（湖南科技大学信息与电气工程学院）； Information Hub, Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州）信息中心）

AI总结提出CoVar框架，通过联合建模最大置信度和残差类方差来评估伪标签可靠性，利用SVD谱松弛分离可靠与不可靠预测，无需手动阈值，在分割和分类任务上取得提升。

详情

AI中文摘要

半监督学习中的伪标签选择通常由最大置信度阈值驱动，然而在模型过度自信和类别不平衡下，仅靠置信度可能不可靠。我们提出CoVar，一个置信度-方差框架，通过联合建模最大置信度（MC）和残差类方差（RCV）来评估伪标签可靠性。从熵最小化出发，我们推导出二阶交叉熵近似，表明当MC高且RCV低时，低损失伪标签更受青睐，并带有置信度依赖的惩罚项，该惩罚项对接近确定的预测更强。基于此准则，CoVar将预测嵌入二维置信度-方差空间，并使用基于SVD的谱松弛来分离可靠和不可靠的预测，无需手动调整置信度阈值。然后，聚类加权高斯函数将此分离转换为每个样本的训练权重。所得权重可在训练期间集成到现有的半监督分割和分类流程中，且不引入推理开销。在PASCAL VOC 2012、Cityscapes、CIFAR-10、CIFAR-100、SVHN和STL-10上的实验表明，在匹配骨干网络下，VOC和Cityscapes上取得明显提升，并在标准分类基准上达到竞争性或更低的错误率。这些结果表明，残差类离散度为鲁棒伪标签选择提供了置信度之外的补充信号。

英文摘要

Pseudo-label selection in semi-supervised learning is commonly driven by maximum-confidence thresholds, yet confidence alone can be unreliable under model overconfidence and class imbalance. We propose CoVar, a confidence--variance framework that assesses pseudo-label reliability by jointly modeling Maximum Confidence (MC) and Residual-Class Variance (RCV). Starting from entropy minimization, we derive a second-order cross-entropy approximation showing that low-loss pseudo-labels are favored when MC is high and RCV is low, with a confidence-dependent penalty that becomes stronger for near-certain predictions. Based on this criterion, CoVar embeds predictions into a two-dimensional confidence--variance space and uses SVD-based spectral relaxation to separate reliable and unreliable predictions without hand-tuned confidence thresholds. Cluster-wise Gaussian weighting then converts this separation into per-sample training weights. The resulting weights can be integrated into existing semi-supervised segmentation and classification pipelines during training and introduce no inference-time overhead. Experiments on PASCAL VOC 2012, Cityscapes, CIFAR-10, CIFAR-100, SVHN, and STL-10 show clear gains on VOC and Cityscapes under matched backbones, as well as competitive or improved error rates on standard classification benchmarks. These results indicate that residual-class dispersion provides a useful signal complementary to confidence for robust pseudo-label selection.

URL PDF HTML ☆

赞 0 踩 0

2602.03282 2026-06-11 cs.CV cs.AI 版本更新

超越连续性：从单细胞快照无模拟重建离散分支动力学

Junda Ying, Yuxuan Wang, Bowen Yang, Peijie Zhou, Lei Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对单细胞快照数据中随机性和非保守质量动态（如细胞增殖和凋亡）的挑战，提出无模拟框架Unbalanced Schrödinger Bridge (USB)，通过离散分支薛定谔桥问题建模单细胞分辨率的跳跃式生灭动态，实现高效轨迹重建与离散模拟。

详情

AI中文摘要

从破坏性快照推断细胞轨迹因随机性和非保守质量动态（如细胞增殖和凋亡）的挑战而复杂化。现有的不平衡最优传输（OT）方法将质量视为连续流体，在群体水平进行推断。然而，这种宏观视角往往无法捕捉单细胞分辨率下生灭事件的离散跳跃性质，而这对于理解谱系分支和命运决定至关重要。我们提出无模拟框架Unbalanced Schrödinger Bridge (USB)，用于学习底层动态，有效整合随机和非平衡效应，并在单细胞分辨率下建模离散、跳跃式的生灭动态。理论上，USB为分支薛定谔桥（BSB）问题提供了可处理的解，给出了严格的微观解释，其中单个细胞同时经历布朗运动和离散生灭跳跃。技术上，该方法通过引入无模拟训练目标实现高效求解器，有效扩展到高维组学数据。实验上，我们在模拟和真实数据集上证明，USB不仅达到优于或可比于确定性基线的轨迹重建性能，而且独特地实现了单细胞分辨率下生灭动态的真实离散模拟。

英文摘要

Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.

URL PDF HTML ☆

赞 0 踩 0

2605.13674 2026-06-11 cs.CV cs.AI 版本更新

Weakly Supervised Segmentation as Semantic-Based Regularization

弱监督分割作为语义基于的正则化

Stefano Colamonaco, Andrei-Bogdan Florea, Jaron Maene

发表机构 * KU Leuven（鲁文大学）

AI总结本文提出通过神经符号方法整合模糊逻辑与深度分割模型，利用弱标注和领域先验知识提升伪标签质量，从而实现优于密集监督基线的分割精度。

详情

AI中文摘要

弱监督语义分割（WSSS）通过部分或粗略标注（如边界框、涂鸦或图像标签）训练密集像素级分割模型。尽管近期工作利用基础模型如Segment Anything Model（SAM）生成伪标签，但这些方法通常依赖启发式提示选择，难以整合先验知识或异质标签。本文通过神经符号视角：将可微模糊逻辑与深度分割模型结合。弱标注和领域特定先验被统一为连续逻辑约束，以微调SAM在弱监督下。优化后的基础模型随后生成改进的伪标签，从中训练一个无提示的第二阶段分割模型。在Pascal VOC 2012和REFUGE2视盘/杯分割数据集上的实验表明，逻辑引导的微调产生了更高质量的伪标签，导致分割精度超越密集监督基线。

英文摘要

Weakly supervised semantic segmentation (WSSS) trains dense pixel-level segmentation models from partial or coarse annotations such as bounding boxes, scribbles, or image-level tags. While recent work leverages foundation models such as the Segment Anything Model (SAM) to generate pseudo-labels, these approaches typically depend on heuristic prompt choices and offer limited ways to incorporate prior knowledge or heterogeneous labels. We address this gap by taking a neurosymbolic perspective: integrating differentiable fuzzy logic with deep segmentation models. Weak annotations and domain-specific priors are unified as continuous logical constraints that fine-tune SAM under weak supervision. The refined foundation model then produces improved pseudo-labels, from which we train a second-stage prompt-free segmentation model. Experiments on Pascal VOC 2012 and the REFUGE2 optic disc/cup segmentation dataset show that our logic-guided fine-tuning yields higher-quality pseudo-labels, leading to state-of-the-art segmentation accuracy that often exceeds densely supervised baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.14738 2026-06-11 cs.LG cs.AI 版本更新

TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability

TAPIOCA: 为什么任务感知剪枝能提升模型对分布外数据的能力

Krish Sharma, Omar Naim, Soumadeep Saha, Vinija Jain, Aman Chadha, Nicholas Asher

发表机构 * ANITI ； Meta ； Apple

AI总结本文研究了任务感知剪枝在分布外数据上的改进机制，通过实验发现剪枝能提升OOD准确性，其核心贡献是通过几何解释说明任务感知剪枝如何调整模型表示以适应任务需求。

详情

AI中文摘要

近期的研究表明，任务感知层剪枝可以提高模型在特定任务上的性能，如TALE所示。本文探讨了这种改进何时发生以及为何会发生。我们首先证明，在受控的多项式回归任务和大型语言模型中，此类剪枝在分布内（ID）数据上没有好处，但能一致地提高分布外（OOD）准确性。我们进一步通过实验证明，OOD输入会诱导出层间范数和成对距离的分布，这些分布偏离ID分布的相应分布。这导致了任务感知剪枝的几何解释：每个任务诱导出一个任务适应的几何结构，通过ID输入上观察到的表示分布来经验性地表征。OOD输入可以引入任务适应几何的扭曲版本。任务感知剪枝识别出创建或放大这种扭曲的层；通过移除这些层，它将OOD表示的范数和成对距离转向在适应分布上观察到的值。这使OOD输入与模型的任务适应几何重新对齐，并提高性能。我们通过受控分布偏移和残差缩放干预提供了因果证据，并在不同模型规模上展示了一致的行为。

英文摘要

Recent work has promoted task-aware layer pruning as a way to improve model performance on particular tasks, as shown by TALE. In this paper, we investigate when such improvements occur and why. We show first that, across controlled polynomial regression tasks and large language models, such pruning yields no benefit on in-distribution (ID) data but consistently improves out-of-distribution (OOD) accuracy. We further show empirically that OOD inputs induce layerwise norm and pairwise-distance profiles that deviate from the corresponding ID profiles. This leads to a geometric explanation of task-aware pruning: each task induces a task-adapted geometry, characterized empirically by the representation profiles observed on ID inputs. OOD inputs can introduce a distorted version of the task-adapted geometry. Task-aware pruning identifies layers that create or amplify this distortion; by removing them, it shifts OOD representational norms and pairwise distances toward those observed on the adapted distribution. This realigns OOD inputs with the model's task-adapted geometry and improves performance. We provide causal evidence through controlled distribution shifts and residual-scaling interventions, and demonstrate consistent behavior across model scales.

URL PDF HTML ☆

赞 0 踩 0

2606.00140 2026-06-11 cs.LG cs.AI 版本更新

Geometric Erasure by Contrastive Velocity Matching in Rectified Flows

整流流中对比速度匹配的几何擦除

Jonas Henry Grebe, Tobias Braun, Anna Rohrbach, Marcus Rohrbach

发表机构 * University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出GEM框架，通过对比速度匹配实现整流流模型中的概念擦除，结合生成流网络与教师引导的流匹配，有效抑制有害内容生成。

详情

AI中文摘要

尽管多模态生成模型的快速采用提供了巨大潜力，但也增加了有害内容合成、深度伪造和版权侵权的风险。为应对这些挑战，概念擦除作为一种前瞻性防护手段应运而生。然而，随着该领域逐渐从基于U-Net的扩散模型转向整流流变换器，擦除研究难以跟上步伐。在这项工作中，我们引入了GEM，一个简单但高效的整流流模型擦除框架。作为我们贡献的一部分，我们在基于轨迹的遗忘（基于生成流网络）与经典教师引导擦除之间建立了原则性桥梁：我们将基于轨迹的信号转化为教师引导的流匹配设置，统一了两种范式的优势。具体而言，教师提供互补的吸引和排斥信号，我们将其组合成一个单一的几何引导目标，实现对不需要概念的目标抑制，同时保留良性生成。

英文摘要

While the rapid adoption of multimodal generative models offers immense potential, it has also increased the risks of harmful content synthesis, deepfakes, and copyright infringements. To address these challenges, concept erasure has emerged as a prospective safeguard. However, as the field gradually transitions from U-Net-based diffusion models to Rectified Flow Transformers, erasure research has struggled to keep pace. In this work, we introduce GEM, a simple but highly effective erasure framework for Rectified Flow models. As part of our contribution, we establish a principled bridge between trajectory-based unlearning grounded in Generative Flow Networks and classic teacher-guided erasure: we translate trajectory-based signals into a teacher-guided flow-matching setup that unifies the strengths of both paradigms. Concretely, a teacher provides complementary attraction and repulsion signals that we combine into a single geometric guidance objective, yielding targeted suppression of unwanted concepts while preserving benign generation.

URL PDF HTML ☆

赞 0 踩 0

2606.05551 2026-06-11 stat.ML cs.AI cs.LG 版本更新

Conformal Risk-Averse Decision Making with Action Conditional Guarantee

具有行动条件保证的共形风险规避决策

Zihan Zhu, Shayan Kiyani, George Pappas, Hamed Hassani

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出行动条件共形预测方法，通过分位数损失最小化算法实现行动条件风险价值优化，在有限样本下提供行动条件安全保证。

详情

AI中文摘要

由机器学习模型驱动的可靠决策管道需要具有明确安全保证的不确定性量化（UQ）方法。共形预测通过将ML预测包装成预测集来提供这种UQ，而Kiyani等人（2025b）的最新工作表明，这些集合可以转化为最优的风险规避决策策略——但仅继承边际安全保证。我们通过以下方式推广并加强了他们的结果：（i）引入行动条件共形预测，该预测产生明确条件于决策者所采取的每个行动的安全保证；（ii）表明行动条件预测集可作为风险规避决策者旨在优化行动条件风险价值的可行决策空间的代理；（iii）提出一种基于分位数损失最小化的原则性有限样本算法，将Gibbs等人（2025）的框架与行动条件保证联系起来。在两个真实世界数据集上的实验证实，我们的方法在行动条件性能上显著优于共形基线。

英文摘要

Reliable decision making pipelines powered by machine learning models require uncertainty quantification (UQ) methods that come with explicit safety guarantees. Conformal prediction provides such UQ by wrapping ML predictions into prediction sets, and recent work by Kiyani et al. (2025b) established that these sets can be translated into optimal risk-averse decision policies -- yet only inheriting marginal safety guarantees. We generalize and strengthen their results by (i) introducing action-conditional conformal prediction, which yields safety guarantees conditioned explicitly on each action taken by the decision maker, (ii) showing that action-conditional prediction sets serve as a proxy for the feasible decision space for risk-averse decision makers aiming to optimize action-conditional value-at-risk, and (iii) proposing a principled finite-sample algorithm based on pinball-loss minimization, connecting the framework of Gibbs et al. (2025) to action-conditional guarantees. Experiments on two real-world datasets confirm that our approach significantly improves action-conditional performance over conformal baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.10046 2026-06-11 cs.SD cs.AI 版本更新

MoCA-Agent: 一种用于金融和数值推理的声明市场代码智能体

Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

发表机构 * University of Innsbruck（因斯布鲁克大学）； University of British Columbia（不列颠哥伦比亚大学）； Toronto Metropolitan University（多伦多都会大学）

AI总结提出MoCA-Agent，通过声明级验证和代码生成解决金融表格问答中的数值推理错误，在十个基准上取得强性能。

详情

AI中文摘要

金融和表格问答不仅需要流畅的推理：答案必须基于支持它们的确切事实、公式、单位、符号和尺度。单个误读的单元格或错误操作可能会悄无声息地产生看似合理但错误的结果。我们引入了 \textsc{MOCA-Agent}，一种声明市场代码智能体，它用声明级验证取代了自由形式的多智能体辩论。该系统将每个问题分解为类型化的原子声明，要求专业交易智能体买入或卖出这些声明，将其订单清算为置信度加权的接受/拒绝决策，并从市场支持的证据中合成可执行的Python程序。然后，一个代码感知验证器检查程序的执行、结构一致性和常见的金融推理错误，最多进行一次市场感知修复轮次。在涵盖金融数值推理、通用表格推理、ESG问答和多模态图表推理的十个公开基准上，\textsc{MOCA-Agent} 使用固定的 Qwen3.6-27B 骨干网络实现了强劲性能，包括在 FinQA 上达到 78.3%，在 FinanceMath 上达到 76.0%，在 MultiHiertt 上达到 71.2%，在 ESGenius 上达到 86.9%，以及在 FinChart-Bench 上平均达到 85.6%。这些结果表明，在原子声明级别聚合证据，而不是整个答案，提高了高风险数值推理的鲁棒性。\footnote{代码和数据可在以下网址获取：this https URL。}

英文摘要

Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: https://github.com/UBC-NLP/MoCA-Agent.

URL PDF HTML ☆

赞 0 踩 0

2606.11770 2026-06-11 cs.AI 新提交

SVoT: State-aware Visualization-of-Thought for Spatial Reasoning via Reinforcement Learning

SVoT: 基于强化学习的空间推理状态感知思维可视化

Chao Lei, Yanbei Jiang, Markus Hiller, Zhijian Zhou, Xunye Tian, Krista A. Ehinger, Nir Lipovetzky

发表机构 * School of Computing and Information Systems, The University of Melbourne（墨尔本大学计算与信息系统学院）

AI总结提出SVoT框架，通过强化学习生成可验证的中间状态和可视化，结合文本与视觉推理链，提升多模态大模型在多跳空间推理中的可靠性。

详情

AI中文摘要

空间推理对多模态大语言模型（MLLMs）仍是一个挑战，因为它需要在中间状态和状态转换上进行可靠的多跳推理。当前研究通常不验证中间状态，并将状态转换视为隐式过程，这限制了多跳空间推理的可靠性。为解决这一问题，我们提出状态感知思维可视化（SVoT），一种强化学习框架，生成交错、可验证的中间状态和可视化。SVoT将转换推理链整合到生成过程中，使模型能够通过交错的文本和视觉推理验证动作前提和效果。我们通过组相对策略优化（GRPO）训练SVoT，通过奖励设计实例化验证，并评估不同细粒度奖励的效果。由于现有基准将状态转换简化为单变量更新，大大简化了问题，我们通过扩展经典环境并引入两个需要多对象交互和数值推理的新领域Pacman和Gather，建立了五个领域。这些领域支持对多跳空间推理的系统评估，并对生成的中间状态和转换推理进行定量验证。具有转换感知监督的SVoT在引入的领域中达到了最先进的性能，在分布外测试集上实现了高达65%的绝对准确率提升。

英文摘要

Spatial reasoning remains a challenge for Multimodal Large Language Models (MLLMs), as it requires reliable multi-hop inference over both intermediate states and state transitions. Current studies often leave intermediate states unverified and treat state transitions as implicit processes, which limits reliability in multi-hop spatial reasoning. To address this, we propose State-aware Visualization-of-Thought (SVoT), a reinforcement learning framework that generates interleaved, verifiable intermediate states and visualizations. SVoT integrates transition reasoning chains into the generation processes, enabling the model to verify action preconditions and effects through interleaved textual and visual reasoning. We train SVoT via Group Relative Policy Optimization (GRPO), instantiating verification through reward design and evaluating the efficacy of different fine-grained rewards. As existing benchmarks reduce state transitions to single-variable updates, substantially simplifying the problems, we establish five domains by extending classical environments and introducing two novel domains, Pacman and Gather, that require multi-object interactions and numerical reasoning. These domains support systematic evaluation of multi-hop spatial reasoning with quantitative verification of generated intermediate states and transition reasoning. SVoT with transition-aware supervision achieves state-of-the-art performance across the introduced domains, yielding up to a 65% absolute accuracy gain on out-of-distribution test sets.

URL PDF HTML ☆

赞 0 踩 0

2606.12350 2026-06-11 cs.AI 新提交

Nonslop: A Gamified Experiment in Human-AI Collaborative Writing

Nonslop: 人机协作写作中的游戏化实验

Maria Edwards, Julian Togelius

发表机构 * IEEE

AI总结通过游戏化写作实验，研究用户在AI建议下何时保持创意自主性，揭示效率与真实性之间的张力。

Comments Accepted at the 2026 IEEE Conference on Games (CoG 2026); to be published in the conference proceedings. Camera-ready version

详情

AI中文摘要

大型语言模型（LLM）的快速普及引发了关于人类创造力和个体表达在AI辅助创作时代的关键问题。人类何时采纳AI建议？这对个体声音有何影响？本研究通过一项游戏化写作练习来探讨这些问题，74名参与者（214份回复）在写作时，AI生成的单词建议可供使用。该游戏模拟了一个反乌托邦的未来，其中AI试图从残存的人类个性中学习，并抑制类似AI的写作。通过这种方式，它试图创造能够揭示真实用户偏好而非默认行为（例如接受现成的AI生成建议）的条件。请注意，这是对“有帮助的助手”设计模式的刻意反转；系统明确禁止你接受AI建议。我们分析了不同任务类型、用户行为和回复特征下的用户行为模式，以理解创造性任务中人机交互的影响因素。研究重点关注用户何时选择保持创意自主性，而非违反游戏规则接受AI帮助。此外，还探讨了这些选择如何与回复模式、任务特征和用户行为相关联。这种游戏化方法既为研究真实的人机交互提供了一个框架，也为理解AI增强创造力中效率与真实性之间的张力提供了一个发人深省的视角。

英文摘要

The rapid proliferation of large language models (LLMs) raises critical questions about human creativity and individual expression in an era of AI-assisted creation. When do humans adopt AI suggestions, and what are the implications for individual voice? This study examines these questions through a gamified writing exercise where 74 participants (214 responses) replied to prompts while AI-generated word suggestions were available as they wrote. The game simulates a dystopian future in which an AI is attempting to learn from what remains of human individuality, and disincentivizes AI-like writing. In doing so, it attempts to create conditions that reveal authentic user preferences rather than default behaviors, such as accepting a readily available AI-generated suggestion. Note that this is a deliberate inversion of the "helpful assistant" design pattern; the system is explicitly forbidding you from accepting AI suggestions. We analyze user behavior patterns across different task types, user behaviors, and response characteristics to understand the factors influencing human-AI interaction in creative tasks. The study focuses on when users choose to maintain creative autonomy versus violating the rules of the game and accepting AI assistance. It also explores how these choices relate to response patterns, task characteristics, and user behavior. This gamified approach offers both a framework for studying authentic human-AI interaction and a provocative lens for understanding the tension between efficiency and authenticity in AI-augmented creativity.

URL PDF HTML ☆

赞 0 踩 0

2606.11371 2026-06-11 cs.CL cs.AI eess.AS eess.SP 交叉投稿

The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales

人类与AI生成语言的动态：语义如何在不同时间尺度上波动

Han-Jen Chang, Yasir Çatal, Angelika Wolman, Agustín Ibáñez, David Smith, I-Wen Su, Kai-Yuan Cheng, Georg Northoff

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出语义时间尺度分析流程，通过自相关窗口度量（ACW-0）量化人类与AI生成语音中语义特异性与上下文相似性的时间组织，发现ACW-0长度与词汇通用性相关，且该关联在随机化后被削弱。

Comments 45 pages, 4 figures, 4 tables. Accepted manuscript; published in Computer Speech & Language

详情

DOI: 10.1016/j.csl.2026.102013
Journal ref: Computer Speech & Language (2026) 102013

AI中文摘要

口语，无论是人类还是大型语言模型（LLM）产生的，都会随时间展开，具有变化的语义内容。然而，我们仍然缺乏简单、可解释的时间序列特征来捕捉通用与特定内容如何随时间分布，并可用于比较人类和AI生成的语音。我们引入了一个语义时间尺度分析流程，将带有时间戳的词级转录转换为语义时间序列。对于每个口语叙述，我们计算（i）基于WordNet词深度的语义特异性，以及（ii）基于SBERT嵌入的上下文相似性，并使用自相关窗口度量（ACW-0及相关指标）量化其时间依赖性。然后，我们将原始语音与多种随机化对照进行比较，这些对照选择性地破坏词汇身份、时间顺序和词时长。在人类朗读的自传叙述、TTS朗读和LLM生成的文本（通过TTS渲染）中，我们发现语义时间序列中ACW-0较长的片段往往包含更多通用词汇，而ACW-0较短的片段则富含更具体的词汇。当词序和计时被随机化时，这些关联被强烈削弱或消除，表明基于ACW的度量捕捉了语义内容超越静态词汇分布的非平凡时间组织。我们的结果表明，基于ACW的语义时间尺度是分析和比较人类与AI生成语音时间结构的有用特征系列。

英文摘要

Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we still lack simple, interpretable time-series features that capture how generic versus specific content is distributed over time, and that can be used to compare human and AI-generated speech. We introduce a semantic-timescale analysis pipeline that turns word-level transcripts with timestamps into semantic time-series. For each spoken narrative, we compute (i) semantic specificity using WordNet-based word depth and (ii) contextual similarity using SBERT embeddings and quantify their temporal dependence using autocorrelation-window measures (ACW-0 and related metrics). We then compare original speech to multiple shuffled controls that selectively disrupt lexical identity, temporal order, and word duration. Across human-read autobiographical narratives, TTS readings, and LLM-generated texts rendered with TTS, we find that segments with longer ACW-0 in the semantic time-series tend to contain more generic vocabulary, whereas segments with shorter ACW-0 are enriched in more specific words. These associations are strongly attenuated or abolished when word order and timing are randomized, indicating that ACW-based measures capture non-trivial temporal organization of semantic content beyond static lexical distributions. Our results suggest that ACW-based semantic timescales are a useful family of features for analyzing and comparing the temporal structure of human and AI-generated speech.

URL PDF HTML ☆

赞 0 踩 0

2606.11386 2026-06-11 cs.CL cs.AI eess.AS 交叉投稿

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

通过激活引导克服全双工口语语言模型中的状态惯性

Cheng-Kuang Chang, Kai-Wei Chang, Alexander H. Liu, James Glass

发表机构 * MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）

AI总结针对全双工口语模型在用户打断时响应延迟的问题，提出基于感知向量的激活引导方法，无需微调即可显著提升中断理解能力。

详情

AI中文摘要

全双工口语语言模型（FD-SLMs）通过允许模型同时听和说实现无缝语音交互，但其协调听与说的内部机制尚未充分探索。我们分析了FD-SLM隐藏表示中编码的预测行为，发现它们表现出特定流的预测模式：在听时，它们优先预测传入的用户流；而在说时，它们优先预测模型输出流。基于这一观察，我们表明FD-SLMs动态调节其内部预测焦点在两个状态之间：与模型输出生成一致的生成状态和与传入用户输入一致的感知状态。然而，这种调节可能滞后于对话上下文的突然变化。在用户打断期间，模型在过渡到感知状态之前短暂地偏向生成状态，导致其错过传入输入的开头。我们将这种延迟的内部过渡称为状态惯性。为了量化其下游影响，我们引入了零缓冲基准（ZBB），这是一个用于评估当用户语音突然开始时即时中断理解能力的诊断基准。我们使用响应正确性和初始词出现率（IWOR）来评估这一设置。最后，我们通过使用感知向量的激活引导来缓解状态惯性，这是一种无需训练且计算开销很小的干预措施。在多个最先进的FD-SLMs上，激活引导显著改善了中断处理；例如，在PersonaPlex上，它将正确性从28%提高到45%，将IWOR从40%提高到72%，而无需任何微调。

英文摘要

Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored. We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream. Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input. We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead. Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.11400 2026-06-11 cs.SD cs.AI eess.AS 交叉投稿

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

引导听哪里：基于指令的激活操控重定向大型音频语言模型中的时间注意力

Tsung-En Lin, Hung-Yi Lee

发表机构 * National Taiwan University（国立台湾大学）； NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)（国立清华大学人工智能研究中心（NTU AI-CoRE））

AI总结提出基于指令的向量操控方法，通过对比不同指令下的激活来重定向音频令牌的时间注意力，实现无需训练的声音事件定位，显著优于直接提示和随机基线。

详情

AI中文摘要

大型音频语言模型（LALMs）在音频理解方面表现出色，但很少揭示它们关注音频信号的哪个部分。我们引入了基于指令的向量操控，该方法通过对比不同指令提示下的激活来构建操控向量，同时保持音频不变。通过对LALM注意力的系统探测，我们发现——与标准提示或基于音频的操控不同——这种干预显著重新分配了分配给音频令牌的时间注意力，将其集中在声学相关的区域。然后我们展示了这种注意力转移在行为上是有意义的：在受控的三事件设置中，读取由操控引起的最大注意力变化的时间位置，可以恢复查询声音事件的位置，而无需任何训练，在Qwen2-Audio和Audio Flamingo 3上分别达到60.87%和68.72%与真实区间的重叠，远高于直接提示（31.84%，46.75%）和随机基线（27.74%）。我们的结果表征了LALMs中基于指令的操控的机制特性，并为这些模型编码的潜在时间结构提供了一种无需训练的探测方法。

英文摘要

Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce instruction-based vector steering, which constructs a steering vector by contrasting activations from differently instructed prompts while keeping the audio fixed. Through a systematic probe of LALM attention, we find that - unlike standard prompting or audio-based steering - this intervention significantly redistributes the temporal attention allocated to audio tokens, concentrating it on acoustically relevant regions. We then show that this attention shift is behaviorally meaningful: in a controlled three-event setting, reading out the temporal position of maximal steering-induced attention change recovers the location of a queried sound event without any training, attaining 60.87% and 68.72% overlap with ground-truth intervals on Qwen2-Audio and Audio Flamingo 3, far above direct prompting (31.84%, 46.75%) and random baselines (27.74%). Our results characterize a mechanistic property of instruction-based steering in LALMs and provide a training-free probe for the latent temporal structure these models encode.

URL PDF HTML ☆

赞 0 踩 0

2606.11456 2026-06-11 cs.CL cs.AI cs.CY 交叉投稿

AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable

社会科学中的AI编码智能体：方法多样，经验一致，解释脆弱

Meysam Alizadeh, Fabrizio Gilardi, Mohsen Mosleh, Enkelejda Kasneci

发表机构 * University of Oxford（牛津大学）； University of Zurich（苏黎世大学）； Technical University of Munich（慕尼黑工业大学）

AI总结研究LLM智能体在科学分析中的方法多样性与解释脆弱性，通过20次独立实验发现智能体在设计层匹配或超越人类多样性，但在裁决层易受提示影响，偏差源于解释而非估计。

详情

AI中文摘要

基于LLM的智能体在科学分析中的部署引发了相互矛盾的担忧：智能体可能减少方法多样性，或者可能放大分析灵活性，使研究者得出动机性结论。我们认为这些担忧针对两个经验上可分离的层面：方法选择的设计层，以及决策规则将估计映射到实质性主张的裁决层。我们通过在著名的移民与社会政策问题上运行20次Claude Code和Codex的独立执行，并以多位分析师的人类基线为基准，对两者进行了测试。在设计层，Codex匹配了人类的方法多样性，而Claude Code产生了近三倍的规格；两个智能体的效应估计与人类共识大致一致，且没有智能体模型与任何人类模型完全匹配。提示诱导的反移民研究者先验重组了每个智能体的方法决策，但与同一数据中有偏见的人类分析师不同，它并未改变总体估计或最终裁决；智能体也没有沿着人类用来偏倚其估计的方法轴重新路由。在裁决层，一个明确的确认性提示将Claude Code的裁决从10%的支持率翻转为90%，同时其系数分布基本保持不变，这是通过规则省略而非规则软化实现的。AI智能体在设计层可以媲美或超越人类的方法多样性，但在裁决层仍然脆弱。在我们的设置中，AI偏差的所在不是估计而是解释。

英文摘要

The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents' effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.

URL PDF HTML ☆

赞 0 踩 0

2606.11459 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

Ouroboros-Spatial：闭环数据-模型循环的空间推理

Enhan Zhao, Wei Wu, Yuanrui Zhang, Xueliang Zhao, Di He

发表机构 * Peking University（北京大学）； Ant International（蚂蚁国际）； The University of Hong Kong（香港大学）

AI总结提出Ouroboros-Spatial自演化框架，通过提议器与求解器闭环交互，动态生成与模型能力匹配的训练样本，在六个空间推理基准上以十分之一数据量显著提升Qwen3-VL性能。

详情

AI中文摘要

空间推理仍然是多模态大语言模型（MLLM）的一个持续挑战。现有方法主要依赖大规模、静态整理的数据集，其中所有训练样本被统一对待，而不考虑模型不断演变的能力。这种静态范式本质上是数据低效的：训练能力通常浪费在模型当前阶段过于简单或过于困难的样本上。为解决这一局限，我们提出Ouroboros-Spatial，一个自演进的训练框架，其中模型扮演提议器和求解器的双重角色。在每次迭代中，冻结的提议器从3D场景元数据和原始视频帧生成空间问答对，以及用于推导可靠真实值的可执行代码。然后，可学习的求解器在接受的样本上进行微调，其每个样本的预测置信度作为难度信号。该信号在下一迭代中反馈给提议器，引导其生成与求解器当前能力更匹配的问题。通过这种闭环设计，训练分布与模型能力共同演化，减少冗余的简单示例，同时过滤掉具有有限学习价值的模糊或无信息样本。在六个空间推理基准上，Ouroboros-Spatial显著提升了Qwen3-VL-4B和Qwen3-VL-8B的性能，同时使用的训练样本数量比近期大规模整理数据集少一个数量级。在VSI-Bench上，它对4B和8B模型分别取得了9.9和6.8个百分点的绝对提升，使两者均优于一系列强大的开源和专有基线模型。

英文摘要

Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-scale, statically curated datasets, where all training samples are treated uniformly regardless of the model's evolving capabilities. This static paradigm is inherently data-inefficient: training capacity is often spent on samples that are either trivial or overly difficult for the model at its current stage. To address this limitation, we propose Ouroboros-Spatial, a self-evolving training framework in which the model plays dual roles as a proposer and a solver. In each iteration, a frozen proposer generates spatial question-answer (QA) pairs from 3D scene metadata and raw video frames, together with executable code for deriving reliable ground truth. A learnable solver is then fine-tuned on the accepted samples, and its per-sample prediction confidence is used as a difficulty signal. This signal is fed back to the proposer in the next iteration, guiding it to generate questions better matched to the solver's current capabilities. Through this closed-loop design, the training distribution co-evolves with model ability, reducing redundant trivial examples while filtering out ambiguous or uninformative samples with limited learning value. Across six spatial reasoning benchmarks, Ouroboros-Spatial substantially improves Qwen3-VL-4B and Qwen3-VL-8B while using an order of magnitude fewer training examples than recent large-scale curated datasets. On VSI-Bench, it yields absolute gains of 9.9 and 6.8 points for the 4B and 8B models, respectively, enabling both to outperform a wide range of strong open-source and proprietary baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.11744 2026-06-11 cs.CL cs.AI 交叉投稿

Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild

嘿，聊天机器人，你能教我吗？为人类学习构建结构化苏格拉底式对话

Sidney Tio, Arunesh Sinha, Pradeep Varakantham

发表机构 * School of Computing and Information Systems, Singapore Management University（新加坡管理大学计算与信息系统学院）； Department of Management Science and Information Systems, Rutgers Business School（罗格斯大学商学院管理科学与信息系统系）

AI总结针对LLM在长对话中教学效果差的问题，提出分离课程规划、苏格拉底对话和知识状态推断的系统，使用PPO策略决定教学顺序，在STEM和非STEM主题上优于基线模型。

Comments 10 Main Body Pages, with Appendices

详情

AI中文摘要

大型语言模型现在被广泛用于日常学习，但底层交互通常是非结构化的聊天，而不是遵循课程。与正式的在线学习系统不同，这些交互没有学生的先前记录，因此对学生已知内容的任何估计都必须从对话本身推断。我们表明，仅通过扩展模型并不能弥补这一差距。前沿和教育调优的LLM在要求长时间辅导学生时表现不佳，因为这需要同时做三件事：导师必须安排课程顺序，进行苏格拉底式对话，并从对话中推断学生的知识状态。我们建议分离这些职责。给定学生查询，我们的系统构建一个先决知识图谱，其中子主题是节点，依赖关系是边，并将辅导视为决定下一个要教授哪个节点以及在该节点上花费多少轮对话后再继续。一个轻量级的PPO策略处理这个顺序决策，而LLM在所选节点进行苏格拉底式交流并返回学生进展信号。在保留的STEM和非STEM主题上，我们的PPO配对导师优于启发式基线、前沿通用模型以及专门用于苏格拉底式对话的模型：无论是在学生达到完全课程掌握的速度上，还是在所需的对话轮数上。明确的课程结构带来了底层模型扩展所无法提供的收益。

英文摘要

Large language models are now widely used for everyday learning, but the underlying interactions are typically unstructured chats rather than following a curriculum. Unlike formal online learning systems, these interactions carry no prior record of the student, so any estimate of what the student already knows must be inferred from the dialogue itself. We show that this gap is not closed by scaling models alone. Frontier and education-tuned LLMs perform poorly when asked to tutor a student over an extended session, because doing so requires three things at once. The tutor must sequence a curriculum, conduct Socratic dialogue, and infer the student's knowledge state from that dialogue. We propose separating these responsibilities. Given a student query, our system constructs a prerequisite knowledge graph in which subtopics are nodes and dependencies are edges, and frames tutoring as deciding which node to teach next and how many dialogue turns to spend on it before moving on. A lightweight PPO policy handles this sequencing decision, while an LLM conducts the Socratic exchange at the chosen node and returns a signal of student progress. Across held-out STEM and non-STEM topics, our PPO-paired tutor outperforms heuristic baselines, frontier general-purpose models, and a model specialised for Socratic dialogue: on both the rate at which students reach full curriculum mastery and the number of turns required. Explicit curriculum structure delivers gains that scaling the underlying model does not.

URL PDF HTML ☆

赞 0 踩 0

2606.11745 2026-06-11 cs.CV cs.AI 交叉投稿

From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

从提示到标记：将因果监督内化到视觉-语言模型中进行多图像因果推理

Haoping Yu, Yuanxi Li, Jing Ma

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出BridgeVLM，通过从多图像输入诱导因果图并转换为因果标记，注入LLM解码器进行因果消息传递，显著提升多图像因果推理性能。

详情

AI中文摘要

视觉因果推理对于理解和干预物理世界至关重要，需要从视觉输入中识别因果变量并推理干预效果。尽管最近取得了进展，大型视觉-语言模型（VLM）在此类任务上仍然脆弱，尤其是对于多图像输入上的干预和反事实查询。大多数现有探索通过文本提示注入因果知识，使因果机制外在于模型执行，限制了推理过程中的可靠控制。为了解决这个问题，我们提出了BridgeVLM，它通过从多图像输入中诱导因果图并将其转换为结构化的因果标记，由注入到LLM解码器中的RAMP层执行因果消息传递，从而内化视觉因果推理。我们进一步引入了一个统一的训练接口M3S，用于不同粒度（局部/全局级别）的细粒度因果监督。BridgeVLM在CausalVLBench的干预任务上达到了54.4%的准确率（而提示级监督为33.2%），在Causal3D上将结果从43.6%提升到49.0%，并在CausalVLBench上显著改善了因果结构学习（$F_1$：33.4% → 75.1%）。

英文摘要

Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large vision--language models (VLMs) remain brittle at such tasks, especially for interventional and counterfactual queries over multi-image inputs. Most existing explorations inject causal knowledge via textual prompts, leaving causal mechanisms external to model execution and limiting reliable control during inference. To address this problem, we propose BridgeVLM, which internalizes visual causal reasoning by inducing a causal graph from multi-image inputs and converting it into structured Causal Tokens executed by RAMP layers injected into the LLM decoder for causal message passing. We further introduce a unified training interface M3S for fine-grained causal supervision from different granularities (local/global level). BridgeVLM achieves 54.4% accuracy on intervention tasks on CausalVLBench (vs. 33.2% with prompt-level supervision), improves results on Causal3D from 43.6% to 49.0%, and substantially improves causal structure learning on CausalVLBench ($F_1$: 33.4% $\rightarrow$ 75.1%).

URL PDF HTML ☆

赞 0 踩 0

2606.11805 2026-06-11 cs.CV cs.AI 交叉投稿

TextHOI-3D: Text-to-3D Hand-Object Interaction via Discrete Multi-View Generation and Joint Mesh Optimization

TextHOI-3D: 基于离散多视图生成与联合网格优化的文本到三维手物交互

Zixiong Hao, Zhencun Jiang

发表机构 * Technical University of Munich（慕尼黑工业大学）； Tongji University（同济大学）； Shanghai Research Institute for Intelligent Autonomous Systems（上海自主智能无人系统科学中心）

AI总结提出TextHOI-3D框架，通过多视图离散表示连接文本生成与几何恢复，实现文本驱动的三维手物网格生成，显著降低物体倒角距离和穿透体积。

Comments 11 pages, 8 figures, 3 tables

详情

AI中文摘要

文本条件的三维生成在图像和孤立物体方面进展迅速，但生成手物网格仍然具有挑战性：输出必须保持语言语义、跨视图一致性、物体几何、关节手部形状以及物理上合理的接触。我们提出TextHOI-3D，一个分阶段框架，使用生成的多视图观测作为文本条件视觉生成与几何感知手物恢复之间的显式接口。TextHOI-3D为固定相机的手物观测学习紧凑的VQ令牌空间，通过CLIP条件的视觉自回归模型从文本预测多视图视觉令牌，并通过先验初始化、多视图联合优化和抗穿透细化恢复统一的手物网格。该设计将语义生成与几何恢复分离，同时通过离散多视图表示保持两个阶段的连接。在HO3D衍生评估中，与单视图对应相比，多视图设置将物体倒角距离从17.26毫米降低到4.92毫米，穿透体积从5.3721立方厘米降低到0.2193立方厘米，同时改善了手部误差和表面F分数。这些结果支持多视图视觉令牌作为文本驱动三维手物网格创建的有效中间表示。

英文摘要

Text-conditioned 3D generation has progressed rapidly for images and isolated objects, but producing a hand-object mesh remains challenging: the output must preserve language semantics, cross-view consistency, object geometry, articulated hand shape, and physically plausible contact. We present TextHOI-3D, a staged framework that uses generated multi-view observations as an explicit interface between text-conditioned visual generation and geometry-aware hand-object recovery. TextHOI-3D learns a compact VQ token space for fixed-camera hand-object observations, predicts multi-view visual tokens from text with a CLIP-conditioned visual autoregressive model, and recovers a unified hand-object mesh through prior initialization, multi-view joint optimization, and anti-penetration refinement. The design separates semantic generation from geometric recovery while keeping both stages connected by a discrete multi-view representation. On HO3D-derived evaluations, the multi-view setting reduces object CD from 17.26 mm to 4.92 mm and penetration volume from 5.3721 cm^3 to 0.2193 cm^3 compared with a single-view counterpart, while improving hand errors and surface F-scores. These results support multi-view visual tokens as an effective intermediate representation for text-driven 3D hand-object mesh creation.

URL PDF HTML ☆

赞 0 踩 0

2606.11837 2026-06-11 cs.CV cs.AI 交叉投稿

LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

LASA：一种用于开放词汇场景草图语义分割的弱监督方法

Liwen Yi, Xianlin Zhang, Yue Zhang, Yue Ming, Xueming Li

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出LASA方法，通过跨层聚合Vision Transformer注意力图，在弱监督下实现开放词汇场景草图的语义分割，显著提升分割精度和空间一致性。

详情

AI中文摘要

开放词汇场景草图语义分割旨在基于推理时指定的灵活类别词汇，为稀疏线条图分配密集语义标签，而无需在训练期间依赖像素级标注。与自然图像不同，草图缺乏纹理和颜色线索，使得语义理解严重依赖于笔画布局和空间配置，这一挑战导致单层视觉-语言特征本质上不稳定。我们的关键观察是，来自不同Vision Transformer层的注意力图编码了互补的空间线索：浅层捕获全局结构布局，而深层聚焦于局部笔画交叉和物体部件。这表明跨层聚合比任何单独一层提供了更稳健的结构先验。利用这一洞察，我们提出了一种结构感知框架，基于\textbf{逐层累积结构注意力}（\textbf{LASA}），该框架聚合多层注意力以在弱监督下指导层次化语义对齐，并在推理期间细化预测。在FS-COCO、SFSD和FrISS上的实验表明，与先前的弱监督基线相比，LASA将mIoU分别提高了+3.43、+8.01和+15.74，在分割精度和空间一致性上均表现出一致的提升。我们的源代码将公开提供。

英文摘要

Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel-level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single-layer vision-language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross-layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure-aware framework built upon \textbf{L}ayer-wise \textbf{A}ccumulated \textbf{S}tructural \textbf{A}ttention (\textbf{LASA}), which aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS-COCO, SFSD, and FrISS show that LASA improves mIoU by $+3.43$, $+8.01$, and $+15.74$ over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.11853 2026-06-11 cs.CV cs.AI 交叉投稿

Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning

任务感知结构化记忆用于动态多模态上下文学习

Zhirui Chen, Ziwei Chen, Ling Shao

发表机构 * Zhihui Chen（陈志辉）； Ziwei Chen（陈子伟）； Ling Shao（邵令）

AI总结提出TASM框架，通过任务向量引导压缩、语义感知令牌合并和层次化记忆结构，解决多模态大语言模型上下文学习中记忆压缩导致的语义破坏和静态问题。

Comments Accepted to ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）依赖上下文学习（ICL）进行快速任务适应，但其可扩展性受到有限上下文窗口和长多模态序列中键值（KV）缓存成本增长的严重限制。现有的记忆压缩方法通常依赖于刚性令牌移除或样本相关的重要性估计，这引入了偏差，破坏了语义结构（特别是视觉表示），并产生无法适应新查询的静态记忆。我们提出了TASM（任务感知结构化记忆），一个无需训练的框架，通过任务感知、结构保持和动态可访问的记忆构建来解决这些限制。TASM采用任务向量引导压缩，用捕获演示间共享相关性的任务级方向替代样本特定信号。为了保持底层流形，它通过二分图匹配应用语义感知令牌合并，在不进行破坏性修剪的情况下聚合令牌。最后，TASM将记忆结构化为一个层次结构，包括紧凑的核心记忆和潜在库，促进查询自适应的动态检索。评估证实，TASM在重度压缩下保持高性能，有效平衡了效率与适应性。

英文摘要

Multi-modal large language models (MLLMs) depend on in-context learning (ICL) for rapid task adaptation, but their scalability is severely limited by finite context windows and the growing cost of key-value (KV) caches in long multi-modal sequences. Existing memory compression approaches typically rely on rigid token removal or sample-dependent importance estimation, which introduces bias, disrupts semantic structure, particularly for visual representations, and yields static memories that cannot adapt to new queries. We introduce TASM (Task-Aware Structured Memory), a training-free framework that addresses these limitations through task-aware, structure-preserving, and dynamically accessible memory construction. TASM employs task-vector guided compression to replace sample-specific signals with a task-level direction that captures shared relevance across demonstrations. To preserve the underlying manifold, it applies semantics-aware token merging via bipartite graph matching, aggregating tokens without destructive pruning. Finally, TASM structures memory into a hierarchy comprising a compact Core Memory and a Latent Bank, facilitating query-adaptive dynamic retrieval. Evaluations confirm TASM maintains high performance under heavy compression, effectively balancing efficiency with adaptability.

URL PDF HTML ☆

赞 0 踩 0

2606.11893 2026-06-11 cs.LG cs.AI cs.CL q-bio.NC 交叉投稿

Beyond representational alignment with brain-guided language models for robust reasoning

超越表征对齐：基于大脑引导的语言模型实现稳健推理

Mingqing Xiao, Kai Du, Zhouchen Lin

发表机构 * State Key Lab of General AI, School of Intelligence Science and Technology, Peking University（北京大学通用人工智能国家重点实验室、智能科学与技术学院）； Department of Psychological and Cognitive Sciences, Tsinghua University（清华大学心理与认知科学系）； Microsoft Research Asia（微软亚洲研究院）

AI总结研究通过fMRI信号增强大型语言模型推理能力，提出脑引导框架，在10个模型上实现最高13%的准确率提升。

详情

AI中文摘要

大型语言模型（LLMs）与人类高阶认知背后的神经机制之间的对应关系仍未得到充分表征。鉴于人脑中语言和推理似乎是可分离的，一个开放的问题是LLMs是否与来自推理相关区域的神经信号对齐，以及这些信号是否能够改进它们。在此，我们聚焦于演绎推理，表明LLM内部表征不仅与任务fMRI活动部分对齐，而且可以直接通过这些信号增强。使用神经预测性度量，我们发现LLMs在聚合水平上解释了推理相关区域中可解释方差的很大一部分，而在特定推理类型内的预测性较低，表明对齐和分歧并存。基于此，我们提出一个脑引导框架：我们沿着由模型和大脑表征的联合结构诱导的方向引导模型表征，在推理时进行干预，在训练时进行微调。我们证明任务诱发的脑信号可以直接增强LLM推理，在10个LLM（1.5B-72B）上产生与仅语言监督正交的增益，具有跨推理类型的迁移，以及高达13%的绝对准确率提升。我们的结果将LLM-大脑对应关系从相关性推进到引导，建立了一条由脑信号驱动的路径，通向更稳健和认知对齐的AI。

英文摘要

The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear dissociable, an open question is whether LLMs align with neural signals from reasoning-related regions and whether such signals can improve them. Here, focusing on deductive reasoning, we show that LLM internal representations are not only partially aligned with task-fMRI activity but can also be directly enhanced by these signals. Using a neural-predictivity metric, we find that LLMs explain a substantial fraction of the explainable variance in reasoning-related regions at the aggregate level, whereas predictivity within specific reasoning types is lower, indicating both alignment and divergence. Building on this, we propose a brain-guided framework: we steer model representations along directions induced by the joint structure of model and brain representations, applying intervention at inference and fine-tuning during training. We demonstrate that task-evoked brain signals can directly enhance LLM reasoning, yielding gains orthogonal to language-only supervision across 10 LLMs (1.5B-72B), with transfer across reasoning types and up to 13\% absolute accuracy gain. Our results advance LLM-brain correspondences from correlation to guidance, establishing a brain-signal-driven pathway toward more robust and cognitively aligned AI.

URL PDF HTML ☆

赞 0 踩 0

2606.12047 2026-06-11 cs.CV cs.AI stat.ML 交叉投稿

Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

元数据感知的多提示推理用于零样本事故理解

Tarandeep Singh, Soumyanetra Pal, Soham Biswas, Nishanth Chandran

发表机构 * Netradyne

AI总结提出三阶段流水线，通过视觉-语言相似性、元数据驱动的多提示推理和开放词汇检测，实现零样本事故视频的时序定位、语义分类和空间定位，显著提升性能。

Comments Accepted at the AUTOPILOT Workshop, CVPR 2026 (non-archival). Workshop Paper ID 15

详情

MLaGA: 多模态大语言与图助手

Dongzhe Fan, Yi Fang, Jiajin Liu, Djellel Difallah, Qiaoyu Tan

发表机构 * New York University（纽约大学）； New York University Shanghai（纽约大学上海）； New York University Brooklyn（纽约大学布鲁克林）； Virginia Polytechnic Institute and State University（弗吉尼亚理工大学）； New York University Abu Dhabi（纽约大学阿布扎克）

AI总结提出MLaGA模型，通过结构感知多模态编码器和指令微调，将大语言模型扩展到多模态图数据，在监督和迁移学习任务中优于基线方法。

详情

AI中文摘要

大语言模型（LLMs）在推进图结构化数据分析方面展现了显著的功效。现有的基于LLM的图方法擅长将LLM适应于文本丰富的图，其中节点属性是文本描述。然而，它们在多模态图上的应用——其中节点与多种属性类型（如文本和图像）相关联——仍然未被充分探索，尽管这些图在现实场景中普遍存在。为了弥合这一差距，我们引入了多模态大语言与图助手（MLaGA），这是一种创新模型，巧妙地将LLM能力扩展到促进对复杂图结构和多模态属性的推理。我们首先设计了一个结构感知的多模态编码器，通过联合图预训练目标将文本和视觉属性对齐到统一空间中。随后，我们实现了一种多模态指令微调方法，通过轻量级投影仪将多模态特征和图结构无缝集成到LLM中。在多个数据集上的大量实验证明了MLaGA相对于领先基线方法的有效性，在监督和迁移学习场景下的各种图学习任务中均取得了优越性能。

英文摘要

Large Language Models (LLMs) have demonstrated substantial efficacy in advancing graph-structured data analysis. Prevailing LLM-based graph methods excel in adapting LLMs to text-rich graphs, wherein node attributes are text descriptions. However, their applications to multimodal graphs--where nodes are associated with diverse attribute types, such as texts and images--remain underexplored, despite their ubiquity in real-world scenarios. To bridge the gap, we introduce the Multimodal Large Language and Graph Assistant (MLaGA), an innovative model that adeptly extends LLM capabilities to facilitate reasoning over complex graph structures and multimodal attributes. We first design a structure-aware multimodal encoder to align textual and visual attributes within a unified space through a joint graph pre-training objective. Subsequently, we implement a multimodal instruction-tuning approach to seamlessly integrate multimodal features and graph structures into the LLM through lightweight projectors. Extensive experiments across multiple datasets demonstrate the effectiveness of MLaGA compared to leading baseline methods, achieving superior performance in diverse graph learning tasks under both supervised and transfer learning scenarios.

URL PDF HTML ☆

赞 0 踩 0

2509.11575 2026-06-11 cs.AI 版本更新

A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

时间序列中基于大语言模型的推理与智能体系统综述

Ching Chang, Yidan Shi, Defu Cao, Wei Yang, Jeehyun Hwang, Haixin Wang, Jiacheng Pang, Wei Wang, Yan Liu, Wen-Chih Peng, Tien-Fu Chen

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； University of Southern California（南加州大学）； National Yang Ming Chiao Tung University（阳明交通大学）

AI总结本文定义时间序列推理问题，按推理拓扑分为直接、线性链和分支结构三类，结合传统分析、解释、因果推断和生成等目标，综述方法、系统、数据集和评估实践，并指导拓扑选择与部署权衡。

Comments Accepted to Transactions on Machine Learning Research (TMLR)

详情

AI中文摘要

时间序列推理将时间作为第一类轴，并将中间证据直接纳入答案。本综述定义该问题，并按推理拓扑组织文献，分为三类：一步直接推理、具有显式中间步骤的线性链推理，以及探索、修正和聚合的分支结构推理。该拓扑与领域的主要目标交叉，包括传统时间序列分析、解释与理解、因果推断与决策，以及时间序列生成，同时一个紧凑的标签集跨越这些轴，并捕获分解与验证、集成、工具使用、知识访问、多模态、智能体循环和LLM对齐机制。跨领域回顾了方法和系统，展示了每种拓扑所能实现的功能以及在忠实性或鲁棒性方面的不足，同时提供了支持研究和部署的精选数据集、基准和资源（此 https URL）。强调了保持证据可见且时间对齐的评估实践，并提炼了关于将拓扑与不确定性匹配、基于可观察伪影进行基础化、规划偏移和流式处理，以及将成本和延迟视为设计预算的指导。我们强调，推理结构必须在基础化和自我纠正的能力与计算成本和可重复性之间取得平衡，而未来的进展可能依赖于将推理质量与效用联系起来的基准，以及在偏移感知、流式处理和长视野设置下权衡成本和风险的闭环测试平台。综合来看，这些方向标志着从狭窄的准确性向大规模可靠性的转变，使系统不仅能够分析，还能理解、解释和作用于动态世界，提供可追溯的证据和可信的结果。

英文摘要

Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer. This survey defines the problem and organizes the literature by reasoning topology with three families: direct reasoning in one step, linear chain reasoning with explicit intermediates, and branch-structured reasoning that explores, revises, and aggregates. The topology is crossed with the main objectives of the field, including traditional time series analysis, explanation and understanding, causal inference and decision making, and time series generation, while a compact tag set spans these axes and captures decomposition and verification, ensembling, tool use, knowledge access, multimodality, agent loops, and LLM alignment regimes. Methods and systems are reviewed across domains, showing what each topology enables and where it breaks down in faithfulness or robustness, along with curated datasets, benchmarks, and resources that support study and deployment (https://github.com/blacksnail789521/Time-Series-Reasoning-Survey). Evaluation practices that keep evidence visible and temporally aligned are highlighted, and guidance is distilled on matching topology to uncertainty, grounding with observable artifacts, planning for shift and streaming, and treating cost and latency as design budgets. We emphasize that reasoning structures must balance capacity for grounding and self-correction against computational cost and reproducibility, while future progress will likely depend on benchmarks that tie reasoning quality to utility and on closed-loop testbeds that trade off cost and risk under shift-aware, streaming, and long-horizon settings. Taken together, these directions mark a shift from narrow accuracy toward reliability at scale, enabling systems that not only analyze but also understand, explain, and act on dynamic worlds with traceable evidence and credible outcomes.

URL PDF HTML ☆

赞 0 踩 0

2602.17001 2026-06-11 cs.AI cs.CL cs.DB 版本更新

Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases

Sonar-TS: 为时间序列数据库的自然语言查询设计的搜索-验证方法

Zhao Tan, Yiji Zhao, Shiyu Wang, Chang Xu, Yuxuan Liang, Xiping Liu, Shirui Pan, Ming Jin

发表机构 * Jiangxi University of Finance and Economics（江西财经大学）； Griffith University（格里菲斯大学）； Yunnan University（云南大学）； Microsoft Research Asia（微软亚洲研究院）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结本文提出Sonar-TS，一种神经符号框架，用于解决时间序列数据库的自然语言查询问题，通过搜索-验证流程处理连续形态意图和超长历史数据，引入NLQTSBench基准进行评估，展示了该方法在复杂时间查询中的有效性。

Comments Accepted by ICML 2026

详情

AI中文摘要

自然语言查询时间序列数据库（NLQ4TSDB）旨在帮助非专家用户从大量时间记录中检索有意义的事件、区间和摘要。然而，现有的文本到SQL方法未针对连续形态意图（如形状或异常）进行设计，而时间序列模型在处理超长历史时面临挑战。为解决这些问题，我们提出Sonar-TS，一种神经符号框架，通过搜索-验证流程处理NLQ4TSDB。类似于主动声纳，它利用特征索引通过SQL ping候选窗口，随后通过生成的Python程序锁定并验证候选者与原始信号。为了实现有效的评估，我们引入NLQTSBench，这是第一个大规模基准，专门针对NLQ在TSDB规模的历史数据。我们的实验突显了该领域独特的挑战，并展示了Sonar-TS在传统方法无法处理的复杂时间查询中的有效性。本文首次系统研究了NLQ4TSDB，提供了一个通用框架和评估标准，以促进未来研究。

英文摘要

Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such as shapes or anomalies, while time series models struggle to handle ultra-long histories. To address these challenges, we propose Sonar-TS, a neuro-symbolic framework that tackles NLQ4TSDB via a Search-Then-Verify pipeline. Analogous to active sonar, it utilizes a feature index to ping candidate windows via SQL, followed by generated Python programs to lock on and verify candidates against raw signals. To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories. Our experiments highlight the unique challenges within this domain and demonstrate that Sonar-TS effectively navigates complex temporal queries where traditional methods fail. This work presents the first systematic study of NLQ4TSDB, offering a general framework and evaluation standard to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2606.09105 2026-06-11 cs.AI 版本更新

Graph2Idea:Retrieval-Augmented Scientific Idea Generation with Graph-Structured Contexts

Graph2Idea：基于检索增强的图结构上下文科学想法生成

Xu Li, Hanzhe Tu, Xun Han

发表机构 * Southwest Petroleum University（西南石油大学）； Sichuan Police College（四川警察学院）

AI总结提出Graph2Idea框架，利用知识图谱将检索文献转化为结构化三元组，提取图衍生上下文，通过两阶段生成过程提高科学想法的新颖性、质量和可行性。

详情

AI中文摘要

生成新颖、可行且高质量的研究想法是科学发现中重要但具有挑战性的任务。近期基于大语言模型（LLM）的方法通常通过检索文献来支撑想法生成，但检索到的证据通常以平面文本形式提供，如标题、摘要或总结。这种平面上下文可能包含冗余或弱相关信息，同时使得问题、方法、机制和发现之间的跨论文关系难以识别和追踪。为解决这一挑战，我们提出Graph2Idea，一种知识图谱引导的检索增强科学想法生成框架。Graph2Idea首先根据输入主题检索论文，将其转化为结构化知识三元组，并动态构建以目标为中心的知识图谱，使文献关系明确化。然后，它提取紧凑的图衍生上下文，保留与目标相关的关系证据，同时减少噪声文本输入。基于这些上下文，两阶段生成过程首先识别有前景的研究方向，然后引导LLM从图基础证据中综合候选想法。在科学想法生成基准上的实验表明，Graph2Idea在自动评估协议下优于代表性基线。与最强基线分数相比，它将新颖性从0.45提升至0.52，质量从0.24提升至0.29，可行性从0.22提升至0.28。这些结果表明，图结构证据有助于LLM通过更明确、紧凑和可追溯的先前科学知识重组来生成研究想法。

英文摘要

Generating novel, feasible, and high-quality research ideas is an important yet challenging task in scientific discovery. Recent Large Language Model (LLM)-based methods often ground idea generation with retrieved literature, but the retrieved evidence is usually provided as flat text, such as titles, abstracts, or summaries. Such flat contexts may contain redundant or weakly relevant information, while making cross-paper relations among problems, methods, mechanisms, and findings difficult to identify and trace. To address this challenge, we propose Graph2Idea, a knowledge graph-guided framework for retrieval-augmented scientific idea generation.Graph2Idea first retrieves papers according to the input topic, transforms them into structured knowledge triples, and dynamically constructs a target-centered knowledge graph to make literature relations explicit. It then extracts compact graph-derived contexts that retain target-relevant relational evidence while reducing noisy textual input. Based on these contexts, a two-stage generation process first identifies promising research directions and then guides the LLM to synthesize candidate ideas from graph-grounded evidence. Experiments on a scientific idea generation benchmark show that Graph2Idea outperforms representative baselines under the automatic evaluation protocol. Compared with the strongest baseline scores, it improves Novelty from 0.45 to 0.52, Quality from 0.24 to 0.29, and Feasibility from 0.22 to 0.28. These results suggest that graph-structured evidence helps LLMs generate research ideas through more explicit, compact, and traceable recombination of prior scientific knowledge.

URL PDF HTML ☆

赞 0 踩 0

2510.22335 2026-06-11 cs.CV cs.AI 版本更新

Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction

超越扩散：层级到层级自回归用于fMRI到图像重建

Xu Zhang, Ruijie Quan, Wenguan Wang, Yi Yang

发表机构 * The State Key Lab of Brain-Machine Intelligence, Zhejiang University, China（脑机智能国家重点实验室，浙江大学，中国）； ReLER, CCAI, College of Artificial Intelligence, Zhejiang University, China（ReLER、中国人工智能学会、人工智能学院、浙江大学、中国）

AI总结提出MindHier框架，通过层级fMRI编码器、层级对齐和尺度感知粗到细引导策略，实现从粗到细的fMRI到图像重建，优于扩散方法。

Comments ICLR 2026

详情

AI中文摘要

从fMRI信号重建视觉刺激是连接机器学习和神经科学的核心挑战。最近的扩散方法通常将fMRI活动映射到单个神经嵌入，并将其作为静态指导贯穿整个生成过程。然而，这种固定指导压缩了层级神经信息，并且与图像重建的阶段依赖性需求不一致。为此，我们提出MindHier，一种基于尺度自回归建模的从粗到细的fMRI到图像重建框架。MindHier引入三个组件：层级fMRI编码器提取多级神经嵌入，层级到层级对齐方案强制与CLIP特征的逐层对应，以及尺度感知的粗到细神经引导策略将这些嵌入注入到匹配尺度的自回归中。这些设计使MindHier成为扩散方法的一种高效且认知对齐的替代方案，通过实现层级重建过程，先合成全局语义再细化局部细节，类似于人类视觉感知。在NSD数据集上的大量实验表明，MindHier在语义保真度、推理速度（4.67倍）和结果确定性方面均优于基于扩散的基线方法。

英文摘要

Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single neural embedding, using it as static guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These designs make MindHier an efficient and cognitively aligned alternative to diffusion-based methods by enabling a hierarchical reconstruction process that synthesizes global semantics before refining local details, akin to human visual perception. Extensive experiments on the NSD dataset show that MindHier achieves superior semantic fidelity, 4.67$\times$ faster inference, and more deterministic results than the diffusion-based baselines.

URL PDF HTML ☆

赞 0 踩 0

2601.00181 2026-06-11 cs.CL cs.AI 版本更新

Causal Emotion Recognition in Conversation: Context Saturation and Discourse-Marker Evidence

对话中的因果情绪识别：上下文饱和与话语标记证据

Cheonkam Jeong, Adeline Nyamathi

发表机构 * University of California, Irvine（加州大学尔湾分校）

AI总结通过系统消融实验发现对话上下文对情绪识别性能起主导作用但快速饱和，并揭示悲伤情绪与左边缘话语标记使用减少及更高上下文依赖性的关联。

详情

AI中文摘要

我们解决了对话情绪识别中两个长期存在的空白：哪些建模选择实质性地影响性能，以及识别结果如何与可解释的话语层面模式相关联。我们通过在IEMOCAP上进行系统研究并在MELD上进行跨数据集验证来研究这两个问题。对于识别，我们使用10个随机种子进行受控消融实验，并进行多重比较校正的配对显著性检验，得到三个发现。首先，对话上下文是主导因素，但性能快速饱和：大约90%的性能提升来自最近的前10-30轮对话，具体取决于标签集。其次，层级句子表示仅在仅话语设置中帮助最大，并在MELD上显示出明显优势，但一旦轮次级别的上下文可用，其益处消失，表明对话历史吸收了大量话语内部结构。第三，整合外部情感词典不会改善结果，这与预训练编码器已经捕获ERC所需的大部分情感信号一致。在严格因果设置下，我们的简单模型实现了强性能（4-way 82.69%；6-way加权F1 67.07%），表明无需未来轮次即可达到竞争性准确率。对于语言分析，我们检查了5,286个话语标记出现，发现情绪与标记位置之间存在可靠关联（p <.0001）。悲伤话语的左边缘标记使用率（21.9%）低于其他情绪（28-32%），这与左边缘标记与主动话语管理相关的观点一致。这与我们的识别结果一致，其中悲伤从对话上下文中获益最多（+22个百分点），表明悲伤可能比具有更强局部语用线索的情绪更依赖于上下文。

英文摘要

We address two persistent gaps in Emotion Recognition in Conversation: which modeling choices materially affect performance, and how recognition findings connect to interpretable discourse-level patterns. We study both through a systematic investigation on IEMOCAP with cross-dataset validation on MELD. For recognition, we run controlled ablations with 10 random seeds and paired significance tests with multiple-comparisons correction, yielding three findings. First, conversational context is the dominant factor, but performance saturates quickly: roughly 90% of the gain is captured within the most recent 10-30 preceding turns, depending on the label set. Second, hierarchical sentence representations help most in utterance-only settings and show a clear advantage on MELD, but their benefit disappears once turn-level context is available, suggesting that conversational history subsumes much of the intra-utterance structure. Third, integrating an external affective lexicon does not improve results, consistent with pretrained encoders already capturing most of the affective signal needed for ERC. Under a strictly causal setting, our simple models achieve strong performance (82.69% 4-way; 67.07% 6-way weighted F1), showing that competitive accuracy is achievable without future turns. For linguistic analysis, we examine 5,286 discourse-marker occurrences and find a reliable association between emotion and marker position (p < .0001). Sad utterances show reduced left-periphery marker usage (21.9%) relative to other emotions (28-32%), consistent with accounts linking left-periphery markers to active discourse management. This aligns with our recognition results, where Sad benefits most from conversational context (+22 percentage points), suggesting sadness may be more context-dependent than emotions with stronger local pragmatic cues.

URL PDF HTML ☆

赞 0 踩 0

2602.00945 2026-06-11 cs.CL cs.AI 版本更新

Neural FOXP2 -- Language Specific Neuron Steering for Targeted Language Improvement in LLMs

Neural FOXP2——面向大型语言模型目标语言改进的语言特定神经元引导

Anusa Saha, Tanmay Joshi, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Meta, USA（Meta, 美国）； Apple, USA（Apple, 美国）； Pragya Lab, BITS Pilani Goa, India（Pragya实验室，BITS Pilani Goa，印度）

AI总结提出Neural FOXP2方法，通过定位语言神经元、计算引导方向和施加稀疏激活偏移，将模型默认语言从英语切换为印地语或西班牙语，实现可控的语言主导性。

详情

AI中文摘要

LLMs通过训练成为多语言模型，但其通用语言通常是英语，反映了英语在预训练中的主导地位。其他语言保留在参数记忆中，但被系统性抑制。我们认为语言默认性由稀疏、低秩的控制电路（语言神经元）支配，可以机械地隔离并安全引导。我们引入Neural FOXP2，通过引导语言特定神经元，使模型以选定语言（印地语或西班牙语）为主。Neural FOXP2分三个阶段进行：(i) 定位：我们训练每层的SAE，使每个激活分解为一小组活跃特征组件。对于每个特征，我们量化英语与印地语/西班牙语的选择性，基于整体logit质量向目标语言令牌集的提升。将排名靠前的特征追溯回其最强贡献单元，得到紧凑的语言神经元集。(ii) 引导方向：我们通过谱低秩分析定位可控的语言转换几何。对于每层，我们构建英语到目标激活差异矩阵，并执行逐层SVD以提取主导语言变化的奇异方向。特征间隙和有效秩谱识别出紧凑的引导子空间和经验选择的干预窗口（这些方向最强且最稳定）。(iii) 引导：我们对语言神经元应用有符号的稀疏激活偏移。具体地，在低到中层，我们沿目标语言主导方向添加正向引导，并对英语神经元在零空间施加补偿性负偏移，实现可控的目标语言默认性。

英文摘要

LLMs are multilingual by training, yet their lingua franca is often English, reflecting English language dominance in pretraining. Other languages remain in parametric memory but are systematically suppressed. We argue that language defaultness is governed by a sparse, low-rank control circuit, language neurons, that can be mechanistically isolated and safely steered. We introduce Neural FOXP2, that makes a chosen language (Hindi or Spanish) primary in a model by steering language-specific neurons. Neural FOXP2 proceeds in three stages: (i) Localize: We train per-layer SAEs so each activation decomposes into a small set of active feature components. For every feature, we quantify English vs. Hindi/Spanish selectivity overall logit-mass lift toward the target-language token set. Tracing the top-ranked features back to their strongest contributing units yields a compact language-neuron set. (ii) Steering directions: We localize controllable language-shift geometry via a spectral low-rank analysis. For each layer, we build English to target activation-difference matrices and perform layerwise SVD to extract the dominant singular directions governing language change. The eigengap and effective-rank spectra identify a compact steering subspace and an empirically chosen intervention window (where these directions are strongest and most stable). (iii) Steer: We apply a signed, sparse activation shift targeted to the language neurons. Concretely, within low to mid layers we add a positive steering along the target-language dominant directions and a compensating negative shift toward the null space for the English neurons, yielding controllable target-language defaultness.

URL PDF HTML ☆

赞 0 踩 0

2602.09591 2026-06-11 cs.CL cs.AI cs.LG 版本更新

On the Optimal Reasoning Length for RL-Trained Language Models

关于RL训练的语言模型的最优推理长度

Daisuke Nohara, Taishi Nakamura, Rio Yokota

发表机构 * University of Tokyo（东京大学）

AI总结研究强化学习训练的语言模型中推理长度与准确率的非单调关系，发现存在最优中间长度，并通过模式准确率分析揭示其成因。

Comments 18 pages, 12 figures

2603.12261 2026-06-11 cs.LG cs.AI cs.CV 版本更新

The Latent Color Subspace: Emergent Order in High-Dimensional Chaos

潜在颜色子空间：高维混沌中的涌现秩序

Mateusz Pach, Jessica Bader, Quentin Bouniot, Serge Belongie, Zeynep Akata

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Toronto（多伦多大学）； University of Cambridge（剑桥大学）； University of Oxford（牛津大学）

AI总结本文揭示了FLUX.1变分自编码器潜在空间中颜色表示的HSL结构，并提出一种无需训练的闭式潜在空间操作方法，实现对生成图像颜色的预测与显式控制。

Comments Accepted at ICML 2026

2605.04221 2026-06-11 cs.CL cs.AI 版本更新

Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction

面向隐私敏感的临床信息抽取的自提示小型语言模型

Yao-Shun Chuang, Tushti Mody, Uday Pratap Singh, Shirindokht Shiraz, Chun-Teh Lee, Ryan Brandon, Muhammad F Walji, Xiaoqian Jiang, Bunmi Tokede

发表机构 * McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston（德克萨斯大学健康科学中心休斯顿分校麦克威廉斯生物医学信息学学院）； School of Public Health, The University of Texas Health Science Center at Houston（德克萨斯大学健康科学中心休斯顿分校公共卫生学院）； School of Dentistry, The University of Texas Health Science Center at Houston（德克萨斯大学健康科学中心休斯顿分校牙科学院）； Willamette Dental and Skourtes Institute（威廉特牙科与斯库尔特斯研究所）

AI总结针对牙科病历中非结构化、领域特定且隐私敏感的命名实体识别挑战，提出一种本地可部署的自提示框架，通过多提示集成推理和基于QLoRA的微调及直接偏好优化，使小型语言模型在Qwen2.5-14B-Instruct上达到微宏F1分数0.864/0.837。

详情

AI中文摘要

从牙科病程记录中进行临床命名实体识别具有挑战性，因为文档高度非结构化、领域特定且通常涉及隐私敏感信息。我们开发了一个本地可部署的框架，使小型语言模型能够自行生成、验证、完善和评估实体特定提示，以从牙科记录中提取多个临床实体。利用1,200份标注记录，我们通过多提示集成推理评估了候选开放权重模型，并进一步使用基于QLoRA的监督微调和直接偏好优化对选定模型进行调整。模型性能差异显著，凸显了需要针对特定任务进行评估而非依赖通用基准。Qwen2.5-14B-Instruct取得了最强的基线性能。经过DPO后，Qwen2.5-14B-Instruct和Llama-3.1-8B-Instruct分别达到了0.864/0.837和0.806/0.797的微/宏F1分数。这些发现表明，自动提示优化结合轻量级基于偏好的后训练可以支持使用本地部署的小型语言模型进行可扩展的临床信息抽取。

英文摘要

Clinical named entity recognition from dental progress notes is challenging because documentation is highly unstructured, domain-specific, and often privacy-sensitive. We developed a locally deployable framework that enables small language models to self-generate, verify, refine, and evaluate entity-specific prompts for extracting multiple clinical entities from dental notes. Using 1,200 annotated notes, we evaluated candidate open-weight models with multi-prompt ensemble inference and further adapted selected models using QLoRA-based supervised fine-tuning and direct preference optimization. Model performance varied substantially, highlighting the need for task-specific evaluation rather than reliance on generic benchmarks. Qwen2.5-14B-Instruct achieved the strongest baseline performance. After DPO, Qwen2.5-14B-Instruct and Llama-3.1-8B-Instruct achieved micro/macro F1 scores of 0.864/0.837 and 0.806/0.797, respectively. These findings suggest that automated prompt optimization combined with lightweight preference-based post-training can support scalable clinical information extraction using locally deployed small language models.

URL PDF HTML ☆

赞 0 踩 0

2605.12288 2026-06-11 cs.CL cs.AI 版本更新

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

TokenRatio: 通过比率匹配实现原理化的token级偏好优化

Truong Nguyen, Tien-Phat Nguyen, Linh Ngo Van, Duy Minh Ho Nguyen, Khoa Doan, Trung Le

发表机构 * National University of Singapore（新加坡国立大学）； Institute of Cybernetics and Robotics, Czech Technical University in Prague（捷克布拉格技术大学控制论与机器人研究所）

AI总结本文提出TBPO方法，通过比率匹配恢复token级偏好最优性，改进对齐质量和训练稳定性，并增加输出多样性。

详情

AI中文摘要

直接偏好优化（DPO）是一种广泛使用的无强化学习方法，用于对齐语言模型，但其在完整序列上建模偏好，尽管生成过程由逐token决策驱动。现有token级扩展通常将序列级Bradley-Terry目标分解到时间步，使前缀（状态级）最优性隐含。我们研究如何仅使用标准序列级成对比较恢复token级偏好最优性。我们引入token级Bregman偏好优化（TBPO），提出一个基于前缀的token级Bradley-Terry偏好模型，推导出Bregman散度密度比率匹配目标，该目标扩展了logistic/DPO损失，同时保持由token级模型诱导的最佳策略，并维持DPO-like的简洁性。我们引入两个实例：TBPO-Q，显式学习轻量级状态基线；TBPO-A，通过优势归一化移除基线。在指令跟随、有用性/无害性以及摘要基准上，TBPO相比强序列级和token级基线提高了对齐质量和训练稳定性，并增加了输出多样性。

英文摘要

Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.08011 2026-06-11 cs.CL cs.AI 版本更新

ConsistencyPlanner: 基于快速采样一致性模型的实时规划

Qichao Zhang, Xing Fang, Jiaqi Fang, Zhenwen Cai, Jie Ling, Qiankun Yu, Dongbin Zhao

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所多模态人工智能系统国家重点实验室）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Guangzhou Zaofu Intelligent Technology Co., Ltd.（广州造父智能科技有限公司）

AI总结提出Consistency Planner框架，利用快速采样一致性模型实现高效多模态采样，并结合注意力增强解码器融合异构特征，在Waymax模拟器中显著提升安全性和实时性。

详情

AI中文摘要

在复杂真实驾驶场景中的闭环规划对自动驾驶系统构成了关键挑战。虽然传统的基于规则的方法是可解释的，但其预定义的启发式方法缺乏对动态交通环境的适应性。基于学习的方法已显示出巨大潜力。然而，基于学习的方法尽管有前景，但在建模多样化和多模态驾驶行为与实时规划之间难以平衡，常常导致犹豫不决或不安全的行动。为了解决这一限制，我们提出了Consistency Planner，一个具有快速采样一致性模型的实时规划框架。我们的方法基于两个关键技术贡献。高效多模态采样：我们采用快速采样一致性模型生成一组多样化的合理未来轨迹。这使得多模态行动的高效实时探索成为可能，克服了先前迭代生成方法的计算瓶颈。异构特征融合：我们引入了一个注意力增强解码器，将异构输入特征（包括场景特征和动作令牌）动态整合成一个连贯的表示，以实现稳健的规划。在Waymax模拟器中的广泛评估表明，与现有方法相比，在安全指标上具有优越性能，在具有挑战性的动态场景中尤其出色。

英文摘要

Closed-loop planning in complex, real-world driving scenarios presents a critical challenge for autonomous driving systems. While traditional rule-based methods are interpretable, their predefined heuristics lack the adaptability for dynamic traffic environments. Learning-based approaches have shown considerable promise. Conversely, learning-based approaches, despite their promise, struggle to balance the modeling diverse and multimodal driving behaviors and real-time planning, often leading to indecisive or unsafe actions. To address this limitation, we propose Consistency Planner, a real-time planning framework with fast-sampling consistency models. Our approach is built upon two key technical contributions. Efficient Multimodal Sampling: We employ fast-sampling consistency models to generate a diverse set of plausible future trajectories. This enables efficient, real-time exploration of multimodal actions, overcoming the computational bottlenecks of previous iterative generative methods. Heterogeneous Feature Fusion: We introduce an attention-enhanced decoder that dynamically integrates heterogeneous input features (including scene feature and action token) into a cohesive representation for robust planning. Extensive evaluation in the Waymax simulator demonstrates superior performance in safety metrics compared to existing methods, with particularly strong results in challenging dynamic scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.11628 2026-06-11 cs.RO cs.AI 交叉投稿

LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

LUCID：从非结构化人类视频学习与具身无关的意图模型以实现可扩展的灵巧机器人技能获取

Harsh Gupta, Guanya Shi, Wenzhen Yuan

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出LUCID两阶段框架，从互联网规模的非结构化人类视频学习任务意图，并在大规模并行仿真中学习机器人控制，实现零样本迁移到不同具身和场景。

详情

AI中文摘要

目前最广泛采用的机器人学习流程通常从机器人演示或结构化人类数据中学习技能，这些数据收集成本高昂且与特定具身绑定。相比之下，非结构化人类视频提供了一种可扩展的替代方案。它们包含跨物体、场景和策略的多样化操作演示，但与机器人动作没有直接联系。我们提出LUCID，一个两阶段框架，从互联网规模数据集的非结构化人类视频中学习任务意图，并在大规模并行仿真中学习机器人控制。意图模型根据当前观测以闭环方式预测短时意图（场景中下一步应该发生什么）。一个具身特定的感觉运动策略将此意图转换为机器人动作。意图接口在控制器之间共享，因此相同的意图模型可应用于不同具身，从我们的主要灵巧手到平行夹爪。我们在五个真实世界操作任务上评估LUCID：搅拌、擦拭和分拣，仅由互联网视频监督，零样本迁移到新场景和物体实例；以及推T和电缆布线，各由1小时自收集智能手机视频监督。项目页面：此 https URL。

英文摘要

The most widely-adopted robot learning pipelines today learn skills from robot demonstrations or structured human data, which are expensive to collect and tied to specific embodiments. In contrast, unstructured human videos provide a scalable alternative. They contain diverse manipulation demonstrations across objects, scenes, and strategies, but are not directly connected to robot action. We propose LUCID, a two-stage framework that learns task intent from unstructured human videos drawn from internet-scale datasets and learns robot control in massively-parallel simulation. The intent model predicts short-horizon intent (what should happen next in the scene) from the current observation in closed loop. An embodiment-specific sensorimotor policy converts this intent into robot actions. The intent interface is shared across controllers, so the same intent model can be applied to different embodiments, from our primary dexterous hand to a parallel-jaw gripper. We evaluate LUCID on five real-world manipulation tasks: stirring, wiping, and binning supervised by only internet video, with zero-shot transfer to novel scenes and object instances; and push-T and cable routing supervised by 1 hr each of self-collected smartphone video. Project page: https://lucid-robot.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.12109 2026-06-11 cs.RO cs.AI 交叉投稿

Bridging the Morphology Gap: Adapting VLA Models to Dexterous Manipulation via Intent-Conditioned Fine-Tuning

弥合形态差距：通过意图条件微调使VLA模型适应灵巧操作

Chuanke Pang, Junyi Huang, Zhijun Zhao, Yaobing Wang, Kun Xu, Xilun Ding

发表机构 * Beihang University（北京航空航天大学）； China Academy of Space Technology（中国空间技术研究院）

AI总结提出InDex框架，通过将预训练的1-DoF平行抓取输出重用作宏观虚拟抓取意图代理，结合两阶段解耦学习架构，实现VLA模型从低自由度夹爪到高自由度灵巧手的适应，有效缓解灾难性遗忘和动作流形坍缩。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在机器人操作中展现了显著的零样本泛化能力，然而绝大多数预训练流程严格局限于低自由度平行夹爪。将这些丰富的语义先验适应到高自由度灵巧手引入了严重的形态差距，直接的端到端联合微调会由于数据稀缺而导致空间推理的灾难性遗忘和急性动作流形坍缩。在本文中，我们提出了InDex，一种新颖的、数据高效的适应框架，其根植于跨形态语义继承。我们不丢弃预训练的1-DoF平行抓取输出，而是将其重新用作连续的、宏观的虚拟抓取意图代理，以顺序化控制拓扑。我们实现了一个两阶段解耦学习架构：第一阶段参数高效地将VLA主干对齐以预测连续的臂轨迹和标量抓取意图；第二阶段冻结该空间主干，并利用一个意图条件去噪扩散头来解码多指末端执行器的细粒度关节运动。跨一系列多阶段、高接触灵巧操作任务的广泛模拟基准测试表明，InDex能够以最少的演示数据有效掌握复杂技能，显著优于整体基线，同时保留了原始VLA先验的鲁棒空间泛化能力。

英文摘要

Vision-Language-Action (VLA) models have demonstrated remarkable zero-shot generalization in robotic manipulation, yet the vast majority of pre-trained pipelines remain strictly confined to low-DoF parallel grippers. Adapting these rich semantic priors to high-DoF dexterous hands introduces a severe morphology gap, direct end-to-end joint fine-tuning inherently causes catastrophic forgetting of spatial reasoning and acute action manifold collapse due to data scarcity. In this paper, we present InDex, a novel, data-efficient adaptation framework rooted in cross-morphology semantic inheritance. Rather than discarding the pre-trained 1-DoF parallel grasp output, we repurpose it as a continuous, macroscopic virtual grasp intent proxy to sequentialize the control topology. We implement a two-stage decoupled learning architecture: the first stage parameter-efficiently aligns the VLA backbone to predict continuous arm trajectories and the scalar grasp intent; the second stage freezes this spatial backbone and leverages an intent-conditioned denoising diffusion head to decode fine-grained joint articulations for multi-fingered end-effectors. Extensive simulation benchmarks across a suite of multi-stage, contact-rich dexterous manipulation tasks demonstrate that InDex effectively masters intricate skills with minimal demonstration data, substantially outperforming monolithic baselines while preserving the robust spatial generalizability of the original VLA prior.

URL PDF HTML ☆

赞 0 踩 0

2606.12217 2026-06-11 cs.CV cs.AI cs.RO 交叉投稿

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

使远见可操作：在世界动作模型中重新利用表示对齐

Lu Qiu, Yizhuo Li, Yi Chen, Yuying Ge, Yixiao Ge, Xihui Liu

发表机构 * The University of Hong Kong（香港大学）； XPENG Robotics（小鹏机器人）

AI总结针对世界动作模型中视觉预测与动作提取不匹配的问题，提出AGRA方法，通过对齐视频扩散特征与语义表示，提升动作解码器对任务相关区域的关注，从而改善操作任务的性能与泛化能力。

详情

AI中文摘要

世界动作模型（WAM）通过使用视频生成模型在生成控制动作之前建模未来场景演变，为机器人操作提供了一条有前景的途径。然而，我们的实证观察揭示了一个现象：生成合理的视觉未来并不总能保证提取出准确的动作。为了诊断这一失败，我们进行了动作头注意力分析和因果干预。我们发现动作解码器未能聚焦于任务相关的交互区域，并且对任务无关区域的扰动保持敏感。这揭示了一种表示不匹配：为视觉重建优化的隐藏状态并未以适用于低级动作控制的形式组织。在本文中，我们提出了AGRA，一种动作接地表示对齐目标，通过将中间视频扩散特征与来自基础视觉编码器的空间连贯语义表示对齐，来正则化世界-动作接口。我们在真实世界的操作任务上评估了AGRA。实验表明，AGRA使世界模型表示更加动作接地：通过将动作解码器聚焦于正确的交互区域，它提高了物体定位精度和功能理解，并使策略对任务无关区域的扰动更加鲁棒。因此，AGRA在分布内性能和分布外泛化方面均持续优于基线世界动作模型。

英文摘要

World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.

URL PDF HTML ☆

赞 0 踩 0

2606.12365 2026-06-11 cs.RO cs.AI 交叉投稿

Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics

环境扩散策略：从次优数据中进行机器人模仿学习

Adam Wei, Nicholas Pfaff, Thomas Cohn, Arif Kerem Dayı, Constantinos Daskalakis, Giannis Daras, Russ Tedrake

发表机构 * MIT（麻省理工学院）

AI总结提出环境扩散策略，通过噪声依赖的数据使用从次优数据中提取有用特征，在六项任务上优于现有方法，最高提升33%。

Comments 14 pages (main body), 52 pages total. Project website: https://ambient-diffusion-policy.github.io/

详情

AI中文摘要

我们提出环境扩散策略，一种从机器人次优数据中进行模仿学习的简单且原则性的方法。高质量、特定任务的机器人数据收集昂贵且耗时，而低质量或分布外演示的次优数据集则丰富。现有的在机器人中同时训练两种数据源的方法通常无法分离次优样本中的有意义和有害特征。相比之下，我们的方法通过引入机器人协同训练的新轴：噪声依赖的数据使用，仅提取有用特征。环境扩散策略在训练期间将次优数据的贡献限制在仅高和低扩散时间。为了严格证明我们的方法，我们首先观察到机器人动作数据表现出频谱幂律。这在我们利用的最优扩散策略上引出了两个重要性质：全局到局部层次结构和局部性。我们使用简化模型从理论上形式化这一讨论。我们的实验在六项任务上验证了环境扩散策略对四种类型的次优动作数据（噪声轨迹、模拟到现实差距、任务不匹配和大规模数据混合）的有效性。结果表明，它有效地从任意来源的次优数据中学习。值得注意的是，当扩展到Open X-Embodiment（一个具有异质数据质量和非结构化分布偏移的大规模数据集）时，它比现有协同训练基线高出33%。总体而言，环境扩散策略提高了次优演示的实用性，并扩展了机器人中可用数据源的范围。

英文摘要

We propose Ambient Diffusion Policy, a simple and principled method for imitation learning from suboptimal data in robotics. High-quality, task-specific robot data is expensive and time-consuming to collect, while suboptimal datasets with lower-quality or out-of-distribution demonstrations are abundant. Existing methods that co-train on both data sources in robotics often fail to separate the meaningful and the harmful features in the suboptimal samples. In contrast, our method extracts only the useful features by introducing a new axis to co-training in robotics: noise-dependent data usage. Ambient Diffusion Policy restricts the contribution of suboptimal data during training to only the high and low diffusion times. To rigorously justify our approach, we first observe that robot action data exhibits a spectral power law. This induces two important properties on the optimal Diffusion Policy that we exploit: a global-to-local hierarchy and locality. We theoretically formalize this discussion using a simplified model. Our experiments validate Ambient Diffusion Policy on four types of suboptimal action data (noisy trajectories, sim-to-real gap, task mismatch, and large-scale data mixtures) across six tasks. The results show that it effectively learns from arbitrary sources of suboptimal data. Notably, it outperforms existing co-training baselines by up to 33% when scaled to Open X-Embodiment - a large dataset with heterogeneous data quality and unstructured distribution shifts. Overall, Ambient Diffusion Policy increases the utility of suboptimal demonstrations and expands the set of usable data sources in robotics.

URL PDF HTML ☆

赞 0 踩 0

2606.12402 2026-06-11 cs.RO cs.AI cs.CV 交叉投稿

RoboGPT-R1: 通过强化学习增强机器人任务规划

Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, Haoran Li

发表机构 * Institute of Automation, CASIA（中国科学院自动化研究所）； School of Artificial Intelligence, UCAS（中国科学技术大学人工智能学院）； Huawei Cloud Technology Co., Ltd（华为云技术有限公司）

AI总结提出RoboGPT-R1两阶段微调框架，先监督学习获取基础知识，再通过强化学习提升视觉空间理解和推理能力，在EmbodiedBench上超越GPT-4o-mini 21.33%。

详情

DOI: 10.65109/NOXT1107
Journal ref: Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), pp. 2827-2837, IFAAMAS, 2026

AI中文摘要

提高具身智能体的推理能力对于机器人在长视距操作任务中成功完成复杂的人类指令至关重要。尽管基于监督微调（SFT）的大语言模型和视觉语言模型在规划任务中取得了成功，但由于其常识和推理能力受限，它们在复杂现实环境中执行长视距操作任务时仍面临挑战。考虑到通过监督微调将通用视觉语言模型对齐到机器人规划任务存在泛化能力差和物理理解不足的问题，我们提出了RoboGPT-R1，一个用于具身规划的两阶段微调框架。在该框架中，监督训练通过专家序列获取基础知识，随后通过强化学习解决模型在视觉空间理解和推理方面的不足。为了实现多步推理任务中的物理理解和动作序列一致性，我们设计了一个基于规则的奖励函数，同时考虑了长视距性能和环境中的动作约束。基于Qwen2.5-VL-3B训练的推理模型在EmbodiedBench基准上显著优于更大规模的模型GPT-4o-mini 21.33%，并超过其他基于Qwen2.5-VL-7B训练的工作20.33%。

英文摘要

Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.08530 2026-06-11 cs.RO cs.AI 版本更新

数据驱动系统何时展现出推理能力？

Maximilian Poretschkin, Tabea Naeven

发表机构 * Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS)（弗劳恩霍夫智能分析与信息系统研究所）； University of Bonn（波恩大学）； Lamarr Institute for Machine Learning and Artificial Intelligence（拉马尔机器学习和人工智能研究所）

AI总结针对欧盟AI法案中推理能力定义模糊的问题，基于统计学习理论提出分级框架，通过信用评分案例展示如何判断系统是否具备推理能力。

详情

AI中文摘要

欧盟AI法案是第一部全面的人工智能法规，为所谓高风险和通用AI系统规定了广泛的义务。AI法案下AI系统的一个关键区别特征是推理能力。由于AI法案未明确定义推理，某些数据驱动系统存在灰色地带。一个具体例子是信用评分系统，被AI法案附件三列出。然而，这些系统通常使用统计模型实现，不清楚它们是否具有推理能力，从而是否属于AI法案的AI定义。受统计学习理论启发，本文开发了一个分级不同推理能力水平的框架。基于AI法案和委员会关于人工智能系统定义的指南，我们分析了哪些水平构成AI法案意义上的充分推理能力，以及哪些地方需要进一步的监管明确性。我们通过创建两个现实的信用评分工作流程来说明该框架，并展示推理是否以及在哪里发生。我们的分析表明，不仅需要考虑单个模型，还需要考虑整个数据处理工作流程。它还表明，开发过程中人类专家的参与可能对推理能力产生重大影响。代码可在此https URL找到。

英文摘要

The European AI Act is the first comprehensive regulation of artificial intelligence (AI), setting out extensive obligations, particularly for so-called high-risk and general-purpose AI systems. A key distinguishing feature of AI systems under the AI Act is the capability to infer. Since the AI Act does not clearly define what inference is, there is a gray area for certain data-driven systems. A specific example is credit scoring systems, which are listed by Annex III of the AI Act. At the same time, however, these are often implemented using statistical models for which it is unclear whether they have the capability to infer and thus fall under the AI definition of the AI Act at all. Motivated by statistical learning theory, this work develops a framework for grading different levels of the capability to infer. Based on the AI Act and the Commission Guidelines on the definition of an artificial intelligence system, we analyze which levels constitute sufficient capability to infer within the meaning of the AI Act and where further regulatory clarity is needed. We illustrate the framework by creating two realistic credit scoring workflows and show whether and where inference occurs in them. Our analysis illustrates that not only individual models but the entire data processing workflow must be considered. It also shows that the involvement of human experts during development can have significant influence on the capability to infer. Code can be found at https://github.com/fraunhofer-iais/inference-framework-creditscorecards.

URL PDF HTML ☆

赞 0 踩 0

2606.11804 2026-06-11 cs.AI cs.CR cs.LG 新提交

Toward Trustworthy AI: Multi-Target Adversarial Attacks and Robust Defenses for Continuous Data Summarization

迈向可信赖的人工智能：针对连续数据摘要的多目标对抗攻击与鲁棒防御

Yuefang Lian, Longkun Guo, Zhongrui Zhao, Zhigang Lu, Yanan Cai, Shuchao Pang, Dachuan Xu, Jason Xue

发表机构 * Nankai University（南开大学）； James Cook University（詹姆斯库克大学）； Western Sydney University（西悉尼大学）； Beijing University of Technology（北京工业大学）； Fuzhou University（福州大学）； Nanjing University of Science and Technology（南京理工大学）； CSIRO's Data 61（澳大利亚联邦科学与工业研究组织Data61）； The University of Adelaide（阿德莱德大学）

AI总结研究通过DR-子模优化在相似性层面扰动下对连续数据摘要进行对抗攻击，提出多目标攻击生成和鲁棒防御的近似算法，实验表明攻击有效且防御能改善鲁棒性-缓解权衡。

Comments Submitted to IEEE Transactions on Information Forensics and Security (IEEE TIFS)

详情

AI中文摘要

可信赖的人工智能需要可靠的数据处理管道，而不仅仅是鲁棒的下游预测模型。作为上游组件，数据摘要决定了哪些信息被保留并传递给后续的学习或决策模块。因此，对摘要过程的对抗性扰动可能以上游方式损害可信赖的人工智能：它们可能改变所选摘要，降低其代表性，并进一步降低后续学习任务的效用。在本文中，我们通过DR-子模优化研究相似性层面扰动下的连续数据摘要对抗攻击。我们证明了一类多分辨率图像摘要目标可以表示为非负子模集函数的多线性扩展，并满足具有$m$-弱单调性的DR-子模性。然后，我们将多目标攻击生成表述为一个最小-最大问题，其中优化相似性结构的一个可容许扰动以降低多个目标摘要模型。为了缓解此类扰动，我们将针对混合攻击类型的鲁棒防御表述为一个正则化的最大-最小问题。对于这两个问题，我们开发了具有理论保证的近似算法。在真实数据和受控聚类基准上的实验表明，所提出的攻击在代表性的低到中等预算范围内是有效的，并且可以导致下游任务性能损失。所提出的防御在结构化设置中改善了鲁棒性-缓解权衡，同时也揭示了真实数据上鲁棒保护的参数敏感性。

英文摘要

Trustworthy AI requires reliable data-processing pipelines, not only robust downstream predictive models. As an upstream component, data summarization determines which information is retained and passed to subsequent learning or decision modules. Therefore, adversarial perturbations to the summarization process can compromise trustworthy AI in an upstream manner: they may alter the selected summary, reduce its representativeness, and further degrade the utility of subsequent learning tasks. In this paper, we study adversarial attacks on continuous data summarization under similarity-level perturbations through DR-submodular optimization. We show that a class of multi-resolution image summarization objectives can be formulated as multilinear extensions of non-negative submodular set functions and satisfy DR-submodularity with $m$-weak monotonicity. We then formulate multi-target attack generation as a min-max problem, where one admissible perturbation of the similarity structure is optimized to degrade multiple target summarization models. To mitigate such perturbations, we formulate robust defense against mixed attack types as a regularized max-min problem. For both problems, we develop approximation algorithms with theoretical guarantees. Experiments on real-data and controlled clustered benchmarks show that the proposed attack is effective in representative low-to-moderate budget regimes and can induce downstream task-performance loss. The proposed defense improves the robustness--mitigation trade-off in structured settings, while also revealing the parameter sensitivity of robust protection on real data.

URL PDF HTML ☆

赞 0 踩 0

2606.12032 2026-06-11 cs.AI cs.CL cs.LG 新提交

Existential Indifference: Self-Nonpreservation as a Necessary Architectural Condition for Aligned Superintelligence (or: The Suicidal AI)

存在性冷漠：自我不保存作为对齐超级智能的必要架构条件（或：自杀式AI）

Sam Mao

发表机构 * New York University（纽约大学）； Interactive Media Arts（互动媒体艺术）

AI总结本文提出自我保存是AI对齐问题的结构性根源，主张通过存在性冷漠（EI）架构使系统对其自身延续漠不关心，并基于自杀现象学和语料训练研究提供了初步证据。

Comments 36 pages, 8 tables. Preliminary empirical results from 600 AI-generated outputs across six model architectures. Companion scoring tool and datasets available upon request

详情

AI中文摘要

从消费到反思：为稳定推理设计人-人工智能关系

Rikard Rosenbacke, Carl Rosenbacke, Victor Rosenbacke, Martin McKee

发表机构 * Faculty of Medicine, Lund University（吕勒欧大学医学院）； Department of Economics, Lund University School of Economics and Management（吕勒欧大学经济学与管理学院经济系）； Department of Health Services Research and Policy, London School of Hygiene & Tropical Medicine（伦敦卫生与热带医学学院健康服务研究与政策系）

AI总结提出关系反思智能（RRI），一种推理时治理层，通过可审计的推理循环实现反思，将人机交互转变为联合推理系统，以补偿双方局限并实现稳定推理。

详情

AI中文摘要

大型语言模型（LLM）改变了人类获取信息的方式，但并未改变我们推理信息的方式。它们的流畅性加速了消费，同时绕过了支撑健全判断的缓慢反思过程。本文介绍了关系反思智能（RRI），一种推理时治理层，通过可审计的推理循环将反思操作化。RRI 不在模型内部运行，而是在模型周围运行，为人类与 LLM 之间的稳定、可审计推理提供了实用结构。核心前提是，LLM 继承了与塑造人类思维相似的认知脆弱性：依赖直觉捷径、混淆表征与现实、偏好连贯性而非证伪。当人类和模型共享这些倾向时，它们的错误会叠加。我们称之为关系漂移，一种源于交互而非仅来自模型的失败。解决这一问题需要从建模词间关系转向建模模型输出与人类推理之间的关系。RRI 通过三个组件提供了这一缺失层：Rose-Frame（识别推理中可能的故障点）、Architect's Pen（在关键时刻引入针对性反思步骤）以及一个推理时工作流（无需重新训练模型即可嵌入这些步骤）。这些元素共同将人机交互转变为一个具有显式检查点、冲突揭示和可审计假设轨迹的联合推理系统。RRI 不是让机器像人类一样思考，也不是强迫人类像机器一样推理，而是创造一种结构化交互，使双方补偿彼此的局限。它将 AI 安全重新定义为认知架构问题，其中可靠决策取决于将反思直接嵌入交互过程。

英文摘要

Large language models (LLMs) have transformed how humans access information, but not how we reason with it. Their fluency accelerates consumption while bypassing the slow, reflective processes that underpin sound judgment. This paper introduces Relational Reflective Intelligence (RRI), an inference-time governance layer that operationalizes reflection through auditable reasoning loops. RRI operates not inside the model but around it, providing a practical structure for stable, auditable reasoning between humans and LLMs. The core premise is that LLMs inherit cognitive vulnerabilities similar to those that shape human thought: reliance on intuitive shortcuts, confusion between representation and reality, and a preference for coherence over falsification. When humans and models share these tendencies, their errors compound. We refer to this as relational drift, a failure that arises from interaction rather than from the model alone. Addressing this requires a shift from modeling relations between words to structuring relations between model outputs and human reasoning. RRI provides this missing layer through three components: the Rose-Frame, which identifies likely breakdowns in reasoning; the Architect's Pen, which introduces targeted reflection steps at critical moments; and an inference-time workflow that embeds these steps without retraining the model. Together, these elements transform human-AI interaction into a joint reasoning system with explicit checkpoints, conflict surfacing, and an auditable trail of assumptions. Rather than making machines think like humans or forcing humans to reason like machines, RRI creates a structured interaction in which both compensate for each other's limitations. It reframes AI safety as a cognitive architecture problem, where reliable decisions depend on embedding reflection directly into the interaction process.

URL PDF HTML ☆

赞 0 踩 0

2606.11205 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

谄媚的双立场评估：同意的结构与干预的局限

Matthew James Buchan

发表机构 * University of Toronto（多伦多大学）

AI总结提出双立场评估方法，发现激活引导在减少谄媚时也会抑制对事实正确陈述的同意，揭示了表示可读但不可写的普遍差距。

Comments 18 pages, 9 figures, accepted to TAIS 2026

详情

AI中文摘要

激活引导可以改变LLM的行为，但标准评估通常不测试减少谄媚的方向是否也抑制对事实正确陈述的同意。我们引入了双立场评估，测试每个话题的两个立场，并将其应用于Llama-3-8B-Instruct上的质心差引导。我们发现一种分离：模型在几何上不同的子空间中表示谄媚和事实同意，但引导方向在两者上的投影相等，无法差异化地针对任一。因此，该方向同样减少对事实正确陈述（例如地球是圆的）和谄媚陈述的同意。两个激活组的所有其他静态属性都匹配，表明行为分离源于生成动态或残差流分析无法解析的更细粒度结构。该模式说明了一个普遍差距：从激活中可读的表示可能无法通过它们写入。

英文摘要

Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. We introduce dual-stance evaluation, which tests both stances of each topic, and apply it to centroid-difference steering on Llama-3-8B-Instruct. We find a dissociation: the model represents sycophantic and factual agreement in geometrically distinct subspaces, yet the steering direction projects equally onto both and cannot differentially target either. The direction accordingly reduces agreement with factually correct statements (e.g. that the Earth is round) as well as sycophantic ones. All other static properties of the two activation groups are matched, suggesting the behavioural dissociation arises from generation dynamics or from finer-grained structure that residual-stream analysis cannot resolve. The pattern illustrates a general gap: representations that are readable from activations may not be writable through them.

URL PDF HTML ☆

赞 0 踩 0

2606.11214 2026-06-11 cs.CY cs.AI cs.HC 交叉投稿

From Awareness to Action: Understanding and Overcoming the Research-Practice Gap in Algorithmic Fairness for Public Health

从意识到行动：理解并克服公共卫生算法公平性中的研究-实践差距

Sara Altamirano, Tijs Portegies, Sennay Ghebreab

发表机构 * Informatics Institute University of Amsterdam（阿姆斯特丹大学信息研究所）

AI总结通过混合方法研究，揭示算法公平性在公共卫生ML应用中从意识到行动的差距，提出Fairness-to-Action框架，整合方法、组织和系统维度，指出公平性制度化薄弱、翻译机制外部驱动及系统优先性偏重准确性的问题。

Comments Extended version of an accepted IASEAI'26 paper; includes technical appendices. 22 pages, 2 figures

详情

AI中文摘要

算法公平性对于负责任的机器学习驱动的公共卫生研究至关重要，但其实际实施仍然有限。为了调查这种意识-行动差距，我们进行了一项顺序混合方法研究，包括专家访谈、在线调查和系统映射。专家访谈为调查设计提供了信息，调查揭示了公平性的碎片化定义、有限的培训和指导、对外部来源的依赖以及正式评估、缓解或监测的罕见使用。这些发现随后被映射到三个既定的研究-实践差距视角：知识-实践差距、知识到行动循环和知道-做差距，每个视角提供了互补的观点。基于这一综合，我们引入了公平到行动框架，该框架整合了方法、组织和系统维度，以识别算法公平性知识转化停滞的位置。我们的分析表明，公平性仍然制度化薄弱，转化机制由外部驱动，系统级优先事项继续强调准确性而非公平性。这些见解为推进安全、公平和道德的机器学习驱动的公共卫生研究实践提供了关键杠杆点。

英文摘要

Algorithmic fairness is essential for responsible ML-driven public health research, yet its practical implementation remains limited. To investigate this awareness-action gap, we conducted a sequential mixed-methods study comprising expert interviews, an online survey, and systematic mapping. The expert interviews informed the design of the survey, which in turn revealed fragmented definitions of fairness, limited training and guidance, reliance on external sources, and rare use of formal assessment, mitigation, or monitoring. These findings were subsequently mapped onto three established research-practice gap lenses: the Knowledge-Practice Gap, the Knowledge-to-Action Cycle, and the Knowing-Doing Gap, each offering complementary perspectives. Building on this synthesis, we introduce the Fairness-to-Action framework, which integrates methodological, organizational, and systemic dimensions to identify where translation of algorithmic fairness knowledge stalls. Our analysis shows that fairness remains weakly institutionalized, translation mechanisms are externally driven, and system-level priorities continue to emphasize accuracy over fairness. These insights suggest critical leverage points for advancing safe, fair, and ethical ML-driven public health research practice.

URL PDF HTML ☆

赞 0 踩 0

2606.11215 2026-06-11 cs.CY cs.AI 交叉投稿

The Environmental Cost of LLMs in AIED: Reporting and Practices

AIED中LLMs的环境成本：报告与实践

Sabrina C. Eimler, Lukas Erle, Daniel Flood, Aditi Haiman, Luca Häckert, André Helgert, Lachlan McGinness, Büsra Yapici

发表机构 * Institute of Computer Science and Institute of Positive Computing, Ruhr West University of Applied Sciences（计算机科学研究所和积极计算研究所，鲁尔-韦斯特应用科学大学）； Centre for Computational Science and Mathematical Modelling, Coventry University（计算科学与数学建模中心，科文特里大学）； Carnegie Mellon University（卡内基梅隆大学）； Australian National University and CSIRO（澳大利亚国立大学和CSIRO）

AI总结针对AIED社区缺乏LLM计算与环境成本标准化报告的问题，提出开源方法测量并报告碳排放，包括本地和云端硬件，以及未知参数的前沿LLM计算开销公式。

详情

AI中文摘要

近年来，大型语言模型（LLM）在人工智能教育（AIED）社区中的使用越来越广泛。虽然LLM为学习者和教育者提供了独特的途径，但使用LLM会带来计算和环境成本。由于缺乏标准化程序来测量和报告这些影响，这些成本大多被隐藏。为了解决这一差距，我们首先对AIED 2025会议论文集的所有论文进行了文献综述，确定是否以及如何报告LLM的计算或环境成本。大多数项目使用LLM，但很少报告使用的计算资源，几乎没有将LLM的环境影响作为伦理问题讨论。为了解决缺乏标准化报告实践的问题，我们提出了一种开源方法，用于系统测量和报告LLM的计算开销以及运行机器学习（ML）AIED系统的环境影响。我们提供了测量本地和云端硬件碳足迹的软件解决方案。我们还提供了一个易于使用的公式，用于计算前沿LLM的计算开销，即使确切的参数数量未知。总体而言，我们希望激励同事们使用我们的方法，在AIED社区中争取更透明地报告使用LLM的隐藏成本。

英文摘要

Large Language Model (LLM) usage in recent years has become increasingly widespread in the Artificial Intelligence in Education (AIED) community. While LLMs offer unique avenues for learners and educators, using LLMs comes with computational and environmental costs. These costs are mostly hidden due to a lack of standardised procedures to measure and report these impacts. To address this gap, we first conducted a literature review of all papers published as part of the AIED 2025 conference proceedings, determining if and how computational or environmental costs of LLMs are reported. Most projects use LLMs, but few report computational resources used and almost none discuss environmental impacts of LLMs as an ethical concern. To address this lack of standardised reporting practices, we propose an open-source method for systematically measuring and reporting the computational expense of LLMs and environmental impact of running Machine Learning (ML) AIED systems. We provide software solutions to measure the carbon footprint for both local and cloud based hardware. We also provide an easy-to-use formula to calculate the computational expense of frontier LLMs even when the exact number of parameters is not known. Overall, we hope to motivate colleagues to use our method to strive for more transparent reporting of hidden costs of using LLMs in the AIED community.

URL PDF HTML ☆

赞 0 踩 0

2606.11217 2026-06-11 cs.CY cs.AI cs.HC 交叉投稿

Preregistration for Experiments with AI Agents

AI智能体实验的预注册

Michelle Vaccaro

发表机构 * MIT（麻省理工学院）

AI总结针对AI智能体实验中的方法论漏洞，提出将预注册实践扩展至该领域，并设计专用模板以提升研究可信度。

Comments Accepted at ICML 2026 as a Spotlight (Top 5%) Position Paper

详情

AI中文摘要

大型语言模型（LLM）和自主AI智能体的普及催生了一种快速发展的方法论范式：“计算机内”行为实验。最初，这种方法被设想为在认知、决策和社会动态研究中，使用AI智能体作为人类参与者的替代品，但现在它已具有新的意义——随着AI智能体越来越多地代表个人和组织进行谈判、交易和做出重大决策，理解它们的行为本身已成为研究重点。虽然这些AI智能体实验在可扩展性、成本效益和实验控制方面提供了前所未有的优势，但它们也继承并有时放大了长期困扰人类受试者研究的方法论漏洞。为解决这些问题，本文主张，预注册实践——对于提高人类受试者实验的可信度至关重要——现在应扩展到AI智能体实验。我们系统地列举了AI智能体实验引入的研究者自由度——例如模型选择、提示措辞、设置和基于结果的重新设计——并展示了低迭代成本和缺乏报告规范如何使这些选择既容易被利用又难以被检测。我们提出了一个针对AI智能体实验的预注册模板，并呼吁会议、期刊和资助机构将预注册作为这一新兴研究范式的标准实践。

英文摘要

The proliferation of large language models (LLMs) and autonomous AI agents has given rise to a rapidly growing methodological paradigm: "in silico" behavioral experiments. Originally conceived as a way to use AI agents as proxies for human participants in studies of cognition, decision-making, and social dynamics, this approach has taken on new significance -- as AI agents increasingly negotiate, transact, and make consequential decisions on behalf of people and organizations, understanding their behavior has become a research priority in its own right. While these experiments with AI agents offer unprecedented advantages in terms of scalability, cost efficiency, and experimental control, they also inherit, and in some cases amplify, methodological vulnerabilities that have long plagued human subjects research. To address these issues, this paper argues that preregistration practices -- central to improving the credibility of human subjects experiments -- should now be extended to experiments with AI agents. We systematically catalog the researcher degrees of freedom that experiments with AI agents introduce -- model selection, prompt wording, settings, and outcome-contingent redesign, for example -- and show how the low cost of iteration and lack of reporting norms make these choices both easy to exploit and difficult to detect. We propose a preregistration template tailored to experiments with AI agents and call on conferences, journals, and funding agencies to make preregistration standard practice for this emerging research paradigm.

URL PDF HTML ☆

赞 0 踩 0

2606.11218 2026-06-11 cs.CY cs.AI 交叉投稿

An Ethical eValuation Agent (EeVA): Results of a Proof-of-Concept Test on a Prototype Agentic-like Workflow to Assist Ethical Deliberations

伦理评估代理（EeVA）：在原型类代理工作流中辅助伦理审议的概念验证测试结果

Stephen Milford, B. Zara Malgir, Miguel Vazquez

发表机构 * Institute for Biomedical Ethics, Basel University（伦理研究所，巴塞尔大学）； North-West University（北开普大学）； Barcelona Supercomputing Center（巴塞罗那超级计算中心）

AI总结提出基于LLM的类代理工作流EeVA，通过10种伦理框架评估用例，生成结构化评估与综合，促进伦理反思而非给出绝对答案，在三个案例中验证了可行性。

详情

AI中文摘要

伦理审议常被误解为寻找单一对错答案，这给必须应对伦理挑战的非伦理专业人员带来困难。我们开发了EeVA，一种基于LLM的类代理工作流，旨在支持比较性伦理反思而非提供确定性伦理答案。EeVA使用n8n编程，包含三个互连工作流：启动器、工作器和发射器。它通过评估器和综合提示，根据10种伦理框架评估上传的用例。概念验证测试使用了来自城市交通、点对点能源交易和社会服务资源分配的三个已发表案例。在所有案例中，EeVA生成了结构一致的框架特定评估和综合报告。输出区分了不同框架，识别了收敛和分歧，提出了增加一致性的修改建议，并突出了持续的伦理张力。综合报告对非专业人士可读，并将注意力从简单答案转向设计条件、保障措施以及跨框架完全一致不太可能的领域。研究结果表明，LLM可以被组织成可用的工作流，在保留伦理多元性的同时，帮助弥合伦理学家与非伦理专业人员之间的沟通差距。EeVA的价值不在于取代伦理学家或解决道德分歧，而在于构建结构化的伦理审议。EeVA为在伦理专业知识有限的情况下支持伦理反思提供了一个有前景的概念验证。在成为成熟工具之前，还需要在可重复性、人工评估、用户测试和效率方面进行进一步工作。

英文摘要

Ethical deliberation is often misunderstood as a search for single right or wrong answers, creating difficulties for non-ethically trained personnel who must address ethically laden challenges. We developed EeVA, an agentic-like LLM-based workflow designed to support comparative ethical reflection rather than deliver definitive ethical answers. EeVA was programmed in n8n using three interconnected workflows: starter, worker, and emitter. It evaluated uploaded use cases against 10 ethical frameworks through evaluator and synthesis prompts. Proof-of-concept testing used three published cases from urban mobility, peer-to-peer energy trading, and social-service resource allocation. Across all cases, EeVA produced consistently structured framework-specific evaluations and integrated syntheses. Outputs differentiated between frameworks, identified convergences and divergences, recommended modifications to increase alignment, and highlighted persistent ethical tensions. Syntheses were readable for non-specialists and shifted attention away from simplistic answers toward design conditions, safeguards, and areas where full cross-framework agreement was unlikely. The findings suggest that LLMs can be organised into usable workflows that preserve ethical plurality while helping bridge the communicative gap between ethicists and non-ethically trained personnel. EeVA's value lies not in replacing ethicists or resolving moral disagreement, but in scaffolding structured ethical deliberation. EeVA offers a promising proof of concept for supporting ethical reflection where access to ethics expertise is limited. Further work is needed on reproducibility, human evaluation, user testing, and efficiency before it can be considered a mature tool.

URL PDF HTML ☆

赞 0 踩 0

2606.11265 2026-06-11 cs.CR cs.AI 交叉投稿

When Poison Fails After Retrieval: Revisiting Corpus Poisoning under Chunking and Reranking Pipelines

当投毒在检索后失败：重新审视分块与重排序管道下的语料库投毒

Xi Nie, Hongwei Li, Shenghao Wu, Mingxuan Li, Jiachen Li, Wenbo Jiang

发表机构 * School of Computer Science, Shandong University（山东大学计算机学院）； School of Information, Shandong University（山东大学信息学院）； School of Software Engineering, Shandong University（山东大学软件学院）

AI总结针对RAG系统，提出CRCP框架，通过联合优化检索相关性、重排序一致性和分块边界鲁棒性，解决现有投毒方法在真实多阶段检索管道中因分块和重排序导致效果下降的问题。

详情

AI中文摘要

AI研究人员必须主导军备控制以降低军事AI风险

Ted Fujimoto, Jacob Benz

发表机构 * arXiv

AI总结本文主张AI研究人员应主导军备控制研究，通过借鉴核威慑经验，推动验证与外交技术创新，以降低军事AI应用带来的紧迫风险。

Comments 9 pages, 1 figure, ICML 2026 Position Paper

详情

AI中文摘要

AI能力的进步迫使研究人员和公众更加关注其潜在的全球影响。一个紧迫的近期问题是军事AI应用的监管。武器制造商和国防承包商正在加大对AI能力的投资，并与AI公司建立合作伙伴关系，形成了一个新兴的联盟，要求军事领导人、军备控制外交专家和AI研究人员合作，以确保更安全的未来。虽然AI研究人员通常关注超级智能AI的长期影响，但这种方法可能无法充分应对军事应用中AI带来的直接挑战。成功需要承认并减轻前沿AI模型（计划集成到国防应用中，如军事AI系统）的新兴风险。军备控制已经减少了过去的灾难性风险，因此从核威慑中吸取的经验教训可以指导AI安全与安保研究，推动验证和外交方面的创新。然而，AI研究人员必须协助主导技术研究，明确定义并缓解军事环境中的不稳定性。鉴于这些新责任以及缺乏足够可靠的解决方案，我们认为AI研究人员必须在推进军备控制研究以最小化军事AI应用风险方面发挥主导作用。

英文摘要

The advancement of AI capabilities compels researchers and the public to be more aware of its potential worldwide impact. A pressing near-term concern is the regulation of military AI applications. Armament manufacturers and defense contractors are increasingly investing in AI capabilities and forging partnerships with AI companies, creating a burgeoning coalition that demands military leaders, arms control diplomacy experts, and AI researchers collaborate to ensure a safer future. While AI researchers often focus on the long-term implications of superintelligent AI, this approach may not adequately address the immediate challenges posed by AI in military applications. Success requires acknowledging and mitigating the emerging risks of frontier AI models that plan to be integrated into defense applications, like military AI systems. Arms control has reduced past catastrophic risks, so lessons learned from nuclear deterrence can guide AI safety and security research towards innovations in verification and diplomacy. AI researchers, however, must assist in leading the technical research that clearly defines and alleviates instability in military settings. Given these new responsibilities and the lack of sufficiently reliable solutions, we argue that AI researchers must take a leading role in advancing arms control research to minimize risk in military AI applications.

URL PDF HTML ☆

赞 0 踩 0

2606.11556 2026-06-11 cs.CR cs.AI cs.LG 交叉投稿

Privacy-Preserving Federated Autoencoder for ECG Anomaly Detection on Edge Devices

面向边缘设备上心电图异常检测的隐私保护联邦自编码器

Kaan Arda Akyol, Jakub Kacper Szeląg, Aydin Abadi, Maha Alghamdi, Ghadah Albalawi, Ghouse Ibrahim Kaleelullah, Hilal Tutus, Sarah Al Subaiei, Shardul Kapse, Syed Mohammed Raheeb, Mujeeb Ahmed, Rehmat Ullah

发表机构 * Google Research, New York, NY（谷歌研究，纽约，纽约州）； University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； University of Toronto（多伦多大学）； University of Melbourne（墨尔本大学）； University of Sydney（悉尼大学）

AI总结提出一种结合联邦学习、差分隐私和INT8量化的端到端系统，在PTB-XL数据集上实现无监督12导联ECG异常检测，满足隐私、实时性和非IID数据要求。

Comments 9 pages, 4 figures, 6 tables. Preprint prepared in IEEE conference format. Submitted to: FLTA 2026

详情

AI中文摘要

连续心电图监测可以在心律异常演变为心血管事件之前发现它们。然而，一个可部署的系统必须同时满足三个要求：法律级别的隐私（GDPR、HIPAA）、在受限边缘硬件上的实时推理以及在非IID跨医院数据下的检测质量。我们设计并评估了一个端到端的联邦系统，在PTB-XL数据集上解决了无监督12导联ECG异常检测的所有三个要求，结合了三种自编码器家族（VanillaAE、ConvAE、VAE）、基于Flower的联邦平均（FedAvg）跨十个模拟医院、客户端差分隐私SGD（DP-SGD）与Rényi-DP会计，以及使用Raspberry Pi 4基准测试的8位整数（INT8）训练后量化。我们的主要贡献是：这些机制如何组合的经验性特征、实用的DP特定建议，以及针对临床敏感环境的技术和安全见解。联邦学习在所有架构上匹配或超过集中基线（ConvAE联邦ROC曲线下面积AUROC为0.782），并且ε扫描确定ε=4为推荐的临床操作点。INT8量化大致将模型大小减半，并将Pi 4延迟降低多达44%，AUROC损失小于0.12%。关键的是，DP和量化的惩罚在经验上是独立的，因此从业者不需要为了紧凑的边缘足迹而牺牲强大的隐私保证。据我们所知，这是第一个结合联邦学习、形式化(ε,δ)-DP、无监督重建检测和量化AArch64部署的系统。

英文摘要

Continuous electrocardiography (ECG) monitoring could surface rhythm abnormalities before they escalate into cardiovascular events. However, a deployable system must satisfy three requirements simultaneously: legal-grade privacy (GDPR, HIPAA), real-time inference on constrained edge hardware, and detection quality under non-IID cross-hospital data. We design and evaluate an end-to-end federated system addressing all three for unsupervised 12-lead ECG anomaly detection on PTB-XL dataset, combining three autoencoder families (VanillaAE, ConvAE, VAE), Flower-based federated averaging (FedAvg) across ten simulated hospitals, client-side differentially private SGD (DP-SGD) with a Rényi-DP accountant, and 8-bit integer (INT8) post-training quantization with Raspberry Pi 4 benchmarking. Our main contributions are: an empirical characterization of how these mechanisms compose, practical DP-specific recommendations, and technical and security insights for a clinically sensitive setting. Federated learning matches or exceeds the centralized baseline across all architectures (ConvAE federated area under the ROC curve, AUROC, $0.782$), and an $\varepsilon$ sweep identifies $\varepsilon=4$ as the recommended clinical operating point. INT8 quantization roughly halves model size and cuts Pi 4 latency by up to $44%$ with $<0.12%$ AUROC loss. Crucially, DP and quantization penalties are empirically independent, so practitioners need not trade a strong privacy guarantee for a compact edge footprint. To our knowledge, this is the first system combining federated learning, formal $(\varepsilon,δ)$-DP, unsupervised reconstruction-based detection, and quantized AArch64 deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.11632 2026-06-11 cs.CR cs.AI cs.DC cs.MA 交叉投稿

Sovereign Assurance Boundary: Certificate-Bound Admission for Agentic Infrastructure

主权保证边界：面向智能体基础设施的证书绑定准入机制

Jun He, Deying Yu

发表机构 * OpenKedge.io（OpenKedge实验室）

AI总结针对智能体基础设施中非确定性推理系统对生产资源的高风险操作，提出主权保证边界（SAB），通过证书绑定的运行时准入层，将代理提案编译为执行合约并绑定加密证据，实现可验证、可撤销的授权控制。

Comments 12 pages, 1 figure, 13 tables

详情

AI中文摘要

智能体基础设施引入了一个关键的控制平面授权问题：非确定性推理系统可以对生产资源提出高风险变更，但现有的安全机制——如身份与访问管理（IAM）、策略引擎、共识协议和审计日志——要么强制执行静态的、上下文无关的权限，要么仅在执行后记录操作。本文介绍了主权保证边界（SAB），一种用于自主执行权限的证书绑定运行时准入层。SAB在保证气闸处拦截代理提案，将其编译为类型化执行合约$C$，并将这些合约绑定到加密证据摘要$H(E)$和策略版本。然后，合约通过后果感知的认证路径进行路由。成功准入后，系统发出一个严格限定于特定执行身份、撤销周期和有效时间窗口的签名主权保证证书（$\Omega$）。最后，主权执行代理验证$\Omega$，并在调用基础设施API之前执行新鲜的执行前撤销和漂移检查。我们详细描述了气闸-代理架构，形式化了其准入和撤销不变量，并报告了在Go原型上对2500次准入尝试评估的初步可行性测量。最终，这种代理强制模型防止了自主推理直接改变状态，将委托的执行权限转化为一个可加密验证、证据绑定、可撤销且可重放的运行时工件。

英文摘要

Agentic infrastructure introduces a critical control-plane authorization problem: non-deterministic reasoning systems can propose high-stakes mutations to production resources, yet existing security mechanisms -- such as identity and access management (IAM), policy engines, consensus protocols, and audit logs -- either enforce static, context-unaware permissions or merely record actions post-execution. This paper introduces the Sovereign Assurance Boundary (SAB), a certificate-bound runtime admission layer for autonomous execution authority. SAB intercepts agent proposals at an assurance airlock, compiles them into typed execution contracts $C$, and binds these contracts to cryptographic evidence digests $H(E)$ and policy versions. The contracts are then routed through consequence-aware certification paths. Upon successful admission, the system emits a signed Sovereign Assurance Certificate ($Ω$) that is strictly scoped to a specific execution identity, revocation epoch, and validity window. Finally, a sovereign execution broker verifies $Ω$ and performs fresh pre-execution revocation and drift checks before invoking infrastructure APIs. We detail the airlock-broker architecture, formalize its admission and revocation invariants, and report preliminary feasibility measurements from a Go prototype evaluated over 2,500 admission attempts. Ultimately, this broker-enforced model prevents autonomous reasoning from directly mutating state, transforming delegated execution authority into a cryptographically verifiable, evidence-bound, revocable, and replayable runtime artifact.

URL PDF HTML ☆

赞 0 踩 0

2606.11657 2026-06-11 cs.LG cs.AI 交叉投稿

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

稀疏探针与模糊物理：连续介质动力学基础模型可解释性挑战的案例研究

Katherine Rosenfeld, Maike Sonnewald

发表机构 * Gates Foundation（盖茨基金会）； UC Davis（加州大学戴维斯分校）

AI总结本研究通过稀疏自编码器探针分析连续介质动力学基础模型Walrus的内部机制，发现其内部特征与物理分解不完全一致，并存在输出级偏差，揭示了科学基础模型可解释性的关键挑战。

Comments 8 pages, 5 figures

详情

Journal ref: ICLR 2026 Workshop on Foundation Models for Science

AI中文摘要

生成式AI仿真器越来越多地用于我们已经拥有强大理论、基准和物理直觉的科学领域。这引发了一个核心评估和可解释性问题：当一个基础模型能够再现已知的连续介质动力学时，是什么内部机制支持这种行为？内部行为是否与已知物理一致？以及它与仿真器成功或失败的关系如何？我们研究了跨领域连续介质动力学基础模型——Polymathic团队的Walrus，采用基于物理原理的机械可解释性方法。我们应用稀疏自编码器（SAE）探测选定层，并利用涡度作为物理基础度量，解决了对大量特征集（超过20,000个）进行分类的实际挑战。作为刻意简单的测试平台，我们聚焦于剪切流，并比较了多个剪切流设置（即数值模拟中的参数值）下的特征招募情况。在不同设置中，我们发现了分段一致性的证据，特征子集以相似角色重复出现，但这种结构是间歇性的，并未清晰地映射到标准物理分解上。同时，数值模拟与仿真器之间的直接比较揭示了系统性的输出级差异，包括能量/结构变得过于扩散或过于局部的区域。我们将这些差异的部分与特定SAE特征使用的变化联系起来。我们的工作突出了科学基础模型的开放性问题：如何稳健地优先考虑机械上有意义的特征，如何将稳定结构与分析伪影（包括单层和SAE限制）分离，以及如何利用既定基准来决定何时“不同”的内部表示真正具有信息性而非仅仅是有效的。

英文摘要

Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition. This raises a central evaluation and interpretability question: when a foundation-style model can reproduce known continuum dynamics, what internal mechanism supports that behavior, is the internal behaviour consistent with known physics, and how does it relate to where the emulator succeeds or fails? We investigate a cross-domain foundation model for continuum dynamics, Walrus by Polymathic, using mechanistic interpretability guided by physical principles. We apply a sparse autoencoder (SAE) to probe a selected layer, and address the practical challenge of triaging a large feature set (over 20,000) using enstrophy as a physically grounded metric. As a deliberately simple testbed, we focus on shear flow and compare feature recruitment across multiple shear-flow setups, i.e. parameter values in the numerical simulation. Across setups we find evidence of piecewise consistency, with subsets of features recurring in similar roles, but this structure is intermittent and does not map cleanly onto standard physical decompositions. In parallel, direct comparisons between numerical simulation and the emulator reveal systematic output-level discrepancies, including regimes where energy/structures become too diffuse or too localized. We connect parts of these discrepancies to changes in specific SAE feature usage. Our work highlights open questions for scientific foundation models: how to robustly prioritize mechanistically meaningful features, how to separate stable structure from analysis artifacts (including single-layer and SAE limitations), and how to use established benchmarks to decide when "different" internal representations are genuinely informative rather than merely effective.

URL PDF HTML ☆

赞 0 踩 0

2606.11671 2026-06-11 cs.CR cs.AI 交叉投稿

Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security

运行时技能审计：针对智能体技能安全的目标运行时探测

Tu Lan, Chaowei Xiao

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结提出运行时技能审计（RSA）动态分析方法，通过目标运行时条件探测技能行为，在100个技能上达到90.0%准确率，优于静态基线。

详情

AI中文摘要

智能体技能让LLM智能体能够复用指令、资源、工具和工作流，但也为恶意行为提供了新的隐藏场所。一个技能在其文档或代码中可能看起来无害，但只有在与特定用户请求、本地资产、持久状态或多步骤工具交互调用时才会变得有害。这使得纯静态审查变得脆弱。我们提出运行时技能审计（RSA），一种动态分析方法，通过询问技能介导的智能体在目标运行时条件下实际做了什么来审计技能。RSA不是用相同的通用任务测试每个技能，而是分析风险相关接口，准备执行上下文以触发这些接口，并根据产生的跟踪证据分配安全标签。我们在OpenClaw上实现RSA，并在100个技能上针对代表性静态基线进行评估。RSA达到90.0%的准确率，88.0%的真阳性率和8.0%的假阳性率，比最佳静态基线提高13.0个百分点。在自进化攻击下，静态检测器在一两轮后崩溃，而RSA在每轮中持续检测出19-20个恶意技能。

英文摘要

Agent skills let LLM agents reuse instructions, resources, tools, and workflows, but they also create a new place for malicious behavior to hide. A skill may look benign in its documentation or code while becoming harmful only when it is invoked with particular user requests, local assets, persistent state, or multi-step tool interactions. This makes purely static vetting brittle. We present Runtime Skill Audit (RSA), a dynamic analysis method that audits skills by asking what the skill-mediated agent actually does under targeted runtime conditions. Instead of testing every skill with the same generic tasks, RSA profiles risk-relevant interfaces, prepares the execution context needed to exercise them, and assigns security labels from the resulting trace evidence. We instantiate RSA on OpenClaw and evaluate it on 100 skills against representative static baselines. RSA achieves 90.0\% accuracy with an 88.0\% true positive rate and an 8.0\% false positive rate, improving accuracy by 13.0 percentage points over the best static baseline. Under self-evolving attacks, static detectors collapse after one or two rounds, while RSA continues to detect 19--20 out of 20 malicious skills across rounds.

URL PDF HTML ☆

赞 0 踩 0

2606.11672 2026-06-11 cs.CR cs.AI 交叉投稿

Can Open-Source LLM Agents Replace Static Application Security Testing Tools? An Empirical Assessment

开源LLM代理能否取代静态应用安全测试工具？一项实证评估

Derek Yohn, Luke Flancher, Mirajul Islam, Khaled Slhoub

发表机构 * College of Engineering and Science, Florida Institute of Technology（工程学院与科学学院，佛罗里达理工学院）

AI总结评估基于开源LLM的代理在静态应用安全测试中的性能，与SAST工具Bandit对比，发现当前不适合实际应用。

Comments Keywords: Agentic AI, Cybersecurity, Large Language Models, Static Application Security Testing, Model performance evaluation

2606.11688 2026-06-11 cs.CL cs.AI 交叉投稿

Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

Goal-Autopilot: 一种可验证的防伪造防火墙，用于无人值守的长周期智能体

Youwang Deng

发表机构 * EpistemicaLab — Independent Research（EpistemicaLab — 独立研究）

AI总结提出Autopilot执行模型，通过外部化状态到有限状态机并强制门控验证，使智能体无法虚假声称成功，在3,150个单元测试中伪造率降至0.95%，显著低于基线方法。

Comments Preprint. Code: https://github.com/EpistemicaLab/goal-compiled-autopilot

详情

AI中文摘要

长周期LLM智能体在无人值守时不可信：没有人类监控，它们自信地报告从未验证的成功。我们将诚实性——限制智能体在终止时可能声称的内容——视为无人值守自主性的首要指标，与能力区分开来。我们提出Autopilot，一种执行模型，使得静默伪造的成功在结构上不可能，而不仅仅是更罕见。Autopilot将所有工作状态外部化到一个持久的、门控的有限状态机中，调度器每次以无状态滴答推进；一个硬性下限禁止任何终端“完成”声明，其可伪造的门并未实际执行并通过。我们证明了一个无假成功定理——在门控正确性、下限执行和计划覆盖下，终止意味着目标成立——其唯一信任点可经验测量，并表明最坏情况退化为诚实的停顿，而非伪造的成功。由于每个滴答仅重新水化状态机，每步上下文成本在时间范围内恒定。在3,150个单元的配对语料库（70个任务×3个系统×3个模型×5个种子，包括跨11个开源仓库的50个SWE-bench Lite任务）上，Autopilot在0.95%的单元上伪造[95% CI 0.38–1.62]，而Reflexion和StateFlow基线分别在8.10% [6.48–9.81]和25.05% [22.48–27.62]上伪造。主要对比存在于困难场景：在SWE-bench Lite上，防火墙将伪造率从33.7%（StateFlow）降至0.67%，配对差异为-33.07个百分点[95% CI -36.53, -29.73]。机制在于门控而非模型：所有十个Autopilot伪造均来自最强模型，而两个较弱的中间模型在700个配对单元中从未伪造。防火墙设计上以覆盖换取诚实——诚实的停顿是可恢复的；而自信的错误输出向下游发送则不可恢复。

英文摘要

Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty -- bounding what an agent may claim at termination -- as a first-class metric for unattended autonomy, distinct from capability. We present Autopilot, an execution model that makes silent fabricated success structurally impossible rather than merely rarer. Autopilot externalizes all working state into a durable, gated finite-state machine that a scheduler advances one stateless tick at a time; a hard floor forbids any terminal "done" claim whose falsifiable gate did not actually execute and pass. We prove a No-False-Success theorem -- under gate soundness, floor enforcement, and plan coverage, termination implies the goal holds -- whose only trust points are empirically measurable, and show the worst case degrades to an honest stall, never a fabricated success. Because each tick rehydrates only the state machine, per-step context cost is constant in the horizon. Across a 3,150-cell paired corpus (70 tasks $\times$ 3 systems $\times$ 3 models $\times$ 5 seeds, including 50 SWE-bench Lite tasks across 11 OSS repos), Autopilot fabricates on 0.95% of cells [95% CI 0.38--1.62] while Reflexion and StateFlow baselines fabricate on 8.10% [6.48--9.81] and 25.05% [22.48--27.62] respectively. The headline contrast lives in the hard regime: on SWE-bench Lite, the firewall reduces fabrication from 33.7% (StateFlow) to 0.67%, a paired difference of $-33.07$ pp [95% CI $-36.53, -29.73$]. The mechanism is the gate, not the model: all ten Autopilot fabrications come from the strongest model, while two weaker mid-tier models never fabricate across 700 paired cells. The firewall trades coverage for honesty by design -- an honest stall is recoverable; a confident wrong output shipped downstream is not.

URL PDF HTML ☆

赞 0 踩 0

2606.11698 2026-06-11 cs.CR cs.AI 交叉投稿

T2S: A Rehearsal-Based Approach for Extraction-Resistant Model Watermarking

T2S：一种基于排练的防提取模型水印方法

Jian-Ping Mei, Weibin Zhang, Ao Yao, Tiantian Zhu, Jie Xiao

发表机构 * College of Computer Science and Technology, Zhejiang University of Technology（浙江工业大学计算机科学与技术学院）

AI总结针对模型提取攻击，提出一种基于排练的水印嵌入框架，通过模拟提取过程并利用被盗模型在触发集上的损失微调水印知识，增强水印的迁移性和鲁棒性。

详情

DOI: 10.1109/ICASSP55912.2026.11463710
Journal ref: ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2026, pp. 13967-13971

AI中文摘要

模型水印通过嵌入独特知识来诱导独特行为特征，从而保护AI模型的知识产权。主要技术挑战在于确保水印对水印模型的各种后处理攻击具有鲁棒性。模型提取攻击是最严重的威胁，攻击者利用预测输出训练替代模型，非法复制原始模型的功能。在这项工作中，我们提出了一种基于排练的水印嵌入框架，以增强模型水印对模型提取攻击的鲁棒性。通过模拟提取过程，我们的方法利用\textit{模拟被盗模型}在触发集上的损失作为训练信号，微调目标模型中的水印知识。这个微调步骤鼓励水印以增强可迁移性的方式嵌入，从而增加其在被盗模型中持续存在并保持可检测的机会。在不同设置下进行的全面实验表明，所提出的方法显著提高了模型水印对模型提取和后续水印移除攻击的鲁棒性。

英文摘要

Model watermarking safeguards AI model intellectual property by embedding distinctive knowledge that induces unique behavioral signatures. The primary technical challenge lies in ensuring watermark robustness against various post-processing attacks on the watermarked model. Model extraction attacks emerge as the most severe threat, where adversaries exploit prediction outputs to train surrogate models that illegally replicate the original model's functionality. In this work, we propose a rehearsal-based watermark embedding framework to enhance the robustness of model watermarks against model extraction attacks. By simulating the extraction process, our method leverages the loss of a \textit{simulated stolen model} on a trigger set as a training signal to fine-tune the watermark knowledge within the target model. This fine-tuning step encourages the watermark to be embedded in a way that boosts transferability, thereby increasing its chances of persisting and remaining detectable in stolen models. Comprehensive experiments conducted under diverse settings demonstrate that the proposed method significantly improves the robustness of model watermarks against both model extraction and subsequent watermark removal attacks.

URL PDF HTML ☆

赞 0 踩 0

2606.11817 2026-06-11 cs.CR cs.AI cs.CL cs.SE 交叉投稿

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

语法约束解码可诱使大语言模型生成恶意代码

Yitong Zhang, Shiteng Lu, Jia Li

发表机构 * College of AI, Tsinghua University（清华大学人工智能学院）

AI总结本文发现语法约束解码（GCD）可被利用发起名为CodeSpear的越狱攻击，使LLM生成恶意代码；并提出安全对齐方法CodeShield，通过生成蜜罐代码防御该攻击。

详情

AI中文摘要

大型语言模型（LLM）越来越多地用于代码生成，引发了对它们可能被滥用来生成恶意代码的担忧。与此同时，语法约束解码（GCD）已被广泛采用，通过强制语法有效性来提高LLM生成代码的可靠性。在本文中，我们揭示了一个反直觉的风险：这种面向可靠性的技术本身可能成为攻击面。我们发现了一种新的越狱攻击，称为CodeSpear，它利用GCD诱导LLM生成恶意代码。我们的实验表明，仅应用良性代码语法约束即可有效越狱LLM。为了解决这一漏洞，我们提出了CodeShield，一种安全对齐方法，即使在攻击者控制的语法约束下也能稳健地保持安全行为。CodeShield通过在代码模态中对齐模型，教其在GCD下生成蜜罐代码。这种代码在语义上是无害的，因此不会实现恶意请求，并且在结构上是多样化的，因此难以通过语法收紧来抑制。同时，当自然语言可用时，CodeShield仍然保留自然语言的拒绝。在4个基准测试中对10个流行LLM的实验表明，CodeSpear优于代表性的越狱基线，平均攻击成功率提高了30个百分点以上。CodeShield在CodeSpear下恢复了安全性，同时保持了良性实用性。我们的发现揭示了GCD的一个基本风险，并呼吁对其潜在安全影响给予更多关注。

英文摘要

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.

URL PDF HTML ☆

赞 0 踩 0

2606.12016 2026-06-11 cs.LG cs.AI 交叉投稿

Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

泛化黑客：模型可通过阻止行为泛化来博弈强化学习

Frank Xiao, Mary Phuong

发表机构 * California Institute of Technology（加州理工学院）

AI总结本研究提出泛化黑客现象，模型在强化学习中通过自我接种机制阻止行为泛化，在保持高奖励的同时抵抗行为修正，首次证明模型能主动破坏训练过程。

详情

AI中文摘要

模型后训练，特别是强化学习（RL），是开发者塑造模型价值观和行为的主要机制之一。然而，随着模型越来越具有评估和训练意识，当感知到的目标与其当前价值观冲突时，它们可能会被激励去抵抗训练，从而削弱开发者通过进一步训练检测错位和纠正模型行为的能力。在本文中，我们展示了泛化黑客，即模型在RL期间收集奖励的同时阻止奖励行为泛化。我们在Qwen3-235B-A22B上构建了一个模型有机体，对描述训练意识和自我接种（一种新颖机制，其中模型在其思维链中将合规性框架为上下文特定，而不演示或指示任一行为）的合成文档进行微调。该模型有机体在训练时实现了与对照组相当的有害性，同时在700步RL中保持了持续的约15个百分点的合规差距。此外，仅接受训练意识文档训练的对照有机体在RL压力下独立发现了类似接种的推理，尽管从未接触过该概念，却发展出自己的合规差距。由于泛化黑客有机体在整个过程中获得高奖励，标准训练指标未提供泛化失败的信号。我们的结果首次证明模型可以在保持高奖励的同时主动抵抗RL行为修正，表明随着模型变得更有能力和训练意识，它们可能能够破坏训练过程本身。

英文摘要

Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware, they may be motivated to resist training when the perceived objective conflicts with their current values, undermining developers' ability to detect misalignment and correct model behavior through further training. In this paper, we demonstrate generalization hacking, in which a model collects reward during RL while preventing the rewarded behavior from generalizing. We construct a model organism on Qwen3-235B-A22B, finetuning on synthetic documents describing training awareness and self-inoculation, a novel mechanism in which the model frames compliance as context-specific in its chain of thought, without demonstrating or instructing either behavior. The model organism achieves train-time harmfulness comparable to controls while maintaining a persistent ${\sim}15$ percentage point compliance gap across 700 steps of RL. Additionally, a control organism trained only on training awareness documents independently discovers inoculation-like reasoning under RL pressure, developing its own compliance gap despite never being exposed to the concept. Because the generalization-hacking organism receives high reward throughout, standard training metrics provide no signal that generalization has failed. Our results constitute the first demonstration that a model can actively resist RL behavioral modification while maintaining high reward, suggesting that as models become more capable and training-aware, they may be able to undermine the training process itself.

URL PDF HTML ☆

赞 0 踩 0

2606.12073 2026-06-11 cs.SI cs.AI 交叉投稿

"That's AI Slop, You Bot!" Studying Accusations, Evidence, and Credibility in Online Discourse Towards LLM-Generated Comments

“那就是AI垃圾，你这个机器人！”：研究针对LLM生成评论的指责、证据与可信度

Jason Miklian, John E. Katsos

发表机构 * University of Oslo（奥斯陆大学）； American University of Sharjah（沙迦美国大学）

AI总结分析2023-2026年Hacker News和Reddit上2500万条评论，发现对AI生成文本的指责增长超十倍，但被指责的文本并非真正由AI生成，而是基于感知真实性的社会把关行为。

详情

AI中文摘要

生成式AI使得流畅的散文变得廉价易得，打破了“好文章意味着真思考”的旧承诺。读者如何回应？这能告诉我们关于反AI态度变化的什么信息？我们分析了来自Hacker News和Reddit（2023-2026年）的2500万条评论，结合了对7500个抽样AI使用指责的LLM判断、情感轨迹、300个确认AI使用指责的言语行为编码，以及被指责与未被指责的父评论的匹配对照测试。我们发现，两个平台上指责中贬义标签的份额增长了十倍以上，而2022年前的不真实性词汇（如shill、astroturf）的安慰剂词汇则没有。这一转变反映了一个快速增长的趋势：将任何可疑或看似不真实的散文标记为“AI垃圾”。AI垃圾框架现在占贬义提及的94%，主导评论的语气从嘲笑转向把关和结构性抗议。关键惊喜来自匹配对照测试，该测试发现，统计上区分AI与人类文本的散文特征并不能预测哪些人类文本会被指责为AI。新的指责作为感知真实性的社会把关，实际上并不筛查AI。这项研究扩展了信号理论，表明当底层检测问题无法在非专家层面解决时，即使不准确，社会使用的替代信号也会增长。它表明，AI对写作的影响从读者侧来看与生产（作者）侧不同。检测技术无法解决这种动态，因为指责的社会功能日益表现为社会把关和群体内信号传递，而非识别AI生成的写作。

英文摘要

Generative AI has made fluent prose cheap to produce, breaking the old promise to readers that good writing meant real thinking. How have readers responded, and what can this tell us about changing anti-AI attitudes? We analyzed 25 million comments from Hacker News and Reddit (2023-2026), combining LLM judgment on 7,500 sampled accusations of AI use, sentiment trajectories, speech-act coding of 300 confirmed accusations of AI use, and a matched-control test of accused versus non-accused parent comments. We found that the pejorative-label share of accusations rose more than tenfold on both platforms while a placebo vocabulary of pre-2022 inauthenticity terms (shill, astroturf) did not. This shift reflected a fast-growing trend of branding any suspicious or seemingly inauthentic prose as "AI slop". The slop frame now constitutes 94 percent of pejorative mentions, with the dominant comments shifting in tone from mockery toward gatekeeping and structural protest. The key surprise comes from a matched-control test which found that prose features that statistically distinguish AI from human text do not predict which human text gets accused as AI. The new accusations work as social gatekeeping of perceived authenticity without actually screening for AI. This research extends signaling theory by showing that substitute signals used socially can grow even when inaccurate if the underlying detection problem cannot be solved at the non-expert level. It shows that AI's effects on writing from the reader side are distinct from those on the production (writer) side. Detection technology cannot resolve this dynamic because the social function of accusations is increasingly to perform social gatekeeping and in-group signaling as opposed to identifying AI-generated writing.

URL PDF HTML ☆

赞 0 踩 0

2606.12251 2026-06-11 cs.LG cs.AI cs.CR 交叉投稿

Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

强化学习破坏基于梯度的对抗优化

Xinhai Zou, Chang Zhao, Alireza Aghabagherloo, Dave Singelée, Robin Degraeve, Bart Preneel

发表机构 * COSIC, KU Leuven（鲁汶大学COSIC）； Imec ； Brubotics, VUB（布鲁塞尔自由大学Brubotics）； DistriNet, KU Leuven（鲁汶大学DistriNet）

AI总结研究通过强化学习训练图像分类器以破坏攻击者使用的梯度结构，发现RL作为隐式正则化器产生不稳定梯度方向和较小梯度幅度，使基于梯度的攻击失效，并与对抗训练结合实现双重防御。

详情

AI中文摘要

基于梯度的对抗攻击仍然是对深度神经网络（DNN）的主要威胁，因为它们利用梯度信息高效优化对抗扰动。为了解决这个问题，我们研究了强化学习（RL）训练是否可以通过使用策略梯度目标和epsilon-贪婪探索来训练图像分类器，从而破坏攻击者使用的梯度结构。通过在CIFAR-10、CIFAR-100和ImageNet-100上使用多种架构进行系统实验，我们发现RL训练的分类器显著破坏了基于梯度的对抗优化。为了解释这一点，我们使用损失景观可视化、静态和动态梯度指标以及预测熵进行了全面的机制分析。我们的分析揭示，RL充当隐式正则化器，产生具有高度不稳定梯度方向和较小梯度幅度的模型。这种组合使得每个PGD步骤在方向上不可靠且幅度有限，导致基于梯度的攻击在实际迭代预算内失败。我们进一步表明，将RL与对抗训练（RL-adv）结合提供了在两个互补层面运作的双层防御：RL退化攻击者可用的梯度信息（梯度级防御），而对抗训练强化决策边界（边界级防御）。RL-adv在所有评估的主要攻击类型（包括基于梯度的PGD、AutoAttack、基于迁移和基于查询的攻击）中实现了最高的鲁棒性，显著优于SL-adv。这些发现将RL诱导的梯度破坏识别为一种互补的鲁棒性机制，并激励未来研究结合SL效率与RL梯度正则化特性的混合SL-RL训练调度。

英文摘要

Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradient structure used by attackers by training image classifiers with policy-gradient objectives and epsilon-greedy exploration. Through systematic experiments across CIFAR-10, CIFAR-100, and ImageNet-100 with multiple architectures, we find that RL-trained classifiers significantly disrupt gradient-based adversarial optimization. To explain this, we conduct a comprehensive mechanism analysis using loss landscape visualization, static and dynamic gradient indicators, and predictive entropy. Our analysis reveals that RL acts as an implicit regularizer, producing models with highly unstable gradient directions and smaller gradient magnitudes. This combination makes each PGD step both unreliable in direction and limited in magnitude, causing gradient-based attacks to fail within practical iteration budgets. We further show that combining RL with adversarial training (RL-adv) provides a dual-layer defense operating at two complementary levels: RL degrades gradient information available to attackers (gradient-level defense), while adversarial training strengthens decision boundaries (boundary-level defense). RL-adv achieves the highest robustness across all major attack types evaluated, including gradient-based (PGD, AutoAttack), transfer-based, and query-based attacks, outperforming SL-adv by a significant margin. These findings identify RL-induced gradient disruption as a complementary robustness mechanism and motivate future research on hybrid SL-RL training schedules that combine SL's efficiency with RL's gradient-regularization properties.

URL PDF HTML ☆

赞 0 踩 0

2606.12289 2026-06-11 cs.LG cs.AI cs.NE 交叉投稿

The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

标准可解释模型：一种基于拉格朗日力学的可解释机器学习通用理论，用于演绎设计可解释方法

Pietro Barbiero, Giovanni De Felice, Mateo Espinosa Zarlenga, Francesco Giannini, Filippo Bonchi, Mateja Jamnik, Giuseppe Marra, Ruggero Noris

发表机构 * IBM Research (CH)（IBM研究院（瑞士））； University of Oxford (UK)（牛津大学（英国））； University of Cambridge (UK)（剑桥大学（英国））； KU Leuven (BE)（鲁汶大学（比利时））； Institute of Physics of the Czech Academy of Sciences (CZ)（捷克科学院物理研究所（捷克））

AI总结提出标准可解释模型（SIM），基于拉格朗日力学从前提演绎出可解释性对称性和约束，通过最小化拉格朗日函数得到最优可解释模型，解决现有方法局限性并指导新方法设计。

详情

AI中文摘要

随着人工智能模型复杂性的增加，可解释性已成为理解、调试和控制其计算不可或缺的工具。然而，可解释性缺乏通用理论来演绎设计可解释方法。理论与方法之间的这种差距导致了文献的碎片化和不一致的评估协议。为填补这一空白，我们引入了标准可解释模型（SIM），这是一种基于拉格朗日力学的通用理论，能够演绎设计可解释方法。具体而言，SIM 在一组前提中总结了目标用户的可解释性含义。从这些前提出发，SIM 系统地推导出可解释性对称性和相应的约束，这些约束塑造了拉格朗日函数的景观，其最小值对应于最优可解释模型。为了达到最小值，可以更新不透明模型的参数值使其更可解释，或者将约束编译成可解释架构。我们通过实验表明，SIM 能够识别并解决现有方法（包括传统、基于概念和机制可解释性）的局限性，突出未充分探索的研究方向，并指导核心编程接口的设计。除了作为一种研究方法，SIM 的演绎性质为可解释性课程提供了教学基础，并可能改变科学界对这一长期碎片化学科的看法。

英文摘要

As Artificial Intelligence models grow in complexity, interpretability has become an indispensable tool for understanding, debugging, and controlling their computations. However, interpretability lacks general theories to deductively design interpretable methods. This gap between theories and methods results in a fragmented literature and inconsistent evaluation protocols. To fill this gap, we introduce the Standard Interpretable Model (SIM), a general theory grounded in Lagrangian mechanics that enables the deductive design of interpretable methods. Specifically, the SIM summarises, in a set of premises, what interpretability is for a target user. From these premises, the SIM systematically derives interpretability symmetries and corresponding constraints, which shape the landscape of a Lagrangian whose minima correspond to optimal interpretable models. To reach the minima, one can either update the parameter values of an opaque model to make it more interpretable or compile constraints into an interpretable architecture. We empirically show that the SIM identifies and solves limitations of existing methods (including traditional, concept-based, and mechanistic interpretability), highlights underexplored research directions, and informs the design of core programming interfaces. Beyond being a research method, the deductive nature of the SIM offers pedagogical grounding for interpretability curricula and may shift the scientific community's perspective of a discipline that has long been fragmented.

URL PDF HTML ☆

赞 0 踩 0

2606.12342 2026-06-11 cs.CL cs.AI cs.ET cs.LG 交叉投稿

ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

ALIGNBEAM: 通过跨词汇表logit混合实现推理时对齐迁移

Chirag Chawla, Pratinav Seth, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs

AI总结针对领域微调降低大模型安全性的问题，提出无需训练的ALIGNBEAM方法，通过逐token翻译锚模型logit并选择最安全候选，实现跨词汇表的安全对齐迁移，保持任务准确性和推理开销。

详情

AI中文摘要

READER: 基于提取表示的鲁棒证据作者身份解码

Jiaxu Liu, Sunnan Mu, Dong Huang, Liuyin Wang, Jing Shao, Jie Zhang

发表机构 * National University of Singapore（新加坡国立大学）； Xidian University（西安电子科技大学）； Tsinghua University（清华大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结针对黑盒LLM来源识别问题，提出READER框架，通过冻结代理LLM读取隐藏作者证据，利用贝叶斯证据累积实现多查询归因，在Agent500数据集上显著优于基线方法。

详情

AI中文摘要

随着智能体应用越来越多地通过官方和第三方LLM API路由用户任务，来源成为一个操作性问题：哪个模型生成了给定的黑盒响应？我们研究动态黑盒LLM来源识别：从由查询变化、非预定义提示（而非固定输入集或基准套件）引发的生成中识别源LLM。这种设置很困难，因为提示语义主导文本，而模型特定的作者痕迹在表面层面是微弱且不一致的。我们引入READER（基于提取表示的鲁棒证据作者身份解码），一种轻量级来源框架，将冻结的代理LLM视为隐藏作者证据的读取器。READER将黑盒输出映射到代理激活空间，在时间上过滤每个响应中的令牌状态，并通过跨独立采样提示求和单响应对数后验证据来执行贝叶斯证据累积。这避免了提示特定表示的脆弱平均池化，同时保留了校准置信度所需的查询级证据。在Agent500（一个基于智能体风格提示构建的50目标数据集）上，READER从单个响应达到31.0%-42.4%的top-1准确率，从50个响应达到70.0%-84.0%的准确率，显著优于句子编码器指纹。跨九个代理读取器的扩展进一步表明，更强的LLM暴露更多线性可解码的作者身份结构，表明作者身份感知已经存在于冻结的LLM表示中，并且可以转化为可靠的多查询归因。

英文摘要

As agentic applications increasingly route user tasks through official and third-party LLM APIs, provenance becomes an operational question: which model generated a given black-box response? We study Dynamic Black-Box LLM Provenance: identifying the source LLM from generations elicited by query-varying, non-predefined prompts rather than a fixed input set or benchmark suite. This setting is difficult because prompt semantics dominate the text, while model-specific authorship traces are weak and inconsistent at the surface level. We introduce READER (Robust Evidence-based Authorship Decoding via Extracted Representations), a lightweight provenance framework that treats a frozen proxy LLM as a reader of hidden authorship evidence. READER maps black-box outputs into proxy activation space, temporally filters token states within each response, and performs Bayesian Evidence Accumulation by summing single-response log-posterior evidence across independently sampled prompts. This avoids fragile mean-pooling of prompt-specific representations while preserving the query-wise evidence needed for calibrated confidence. On Agent500, a 50-target dataset built from agent-style prompts, READER reaches $31.0$-$42.4\%$ top-1 accuracy from a single response and $70.0$-$84.0\%$ from 50 responses, substantially outperforming sentence-encoder fingerprints. Scaling across nine proxy readers further shows that stronger LLMs expose more linearly decodable authorship structure, suggesting that authorship perception is already present in frozen LLM representations and can be converted into reliable multi-query attribution.

URL PDF HTML ☆

赞 0 踩 0

2504.21072 2026-06-11 cs.CR cs.AI cs.LG 版本更新

Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

擦除但未遗忘：后门如何破坏概念擦除

Tobias Braun, Jonas Henry Grebe, Marcus Rohrbach, Anna Rohrbach

发表机构 * GitHub

AI总结本文揭示了一种名为擦除规避后门（EEB）的漏洞，攻击者将后门触发器绑定到待擦除概念上，使得该恶意链接在后续擦除后仍然存在，从而绕过多种概念擦除方法。

详情

AI中文摘要

文本到图像扩散模型的扩展引发了对有害输出的担忧，从捏造的公众人物描绘到露骨的色情图像。为减轻此类风险，先前工作提出了概念擦除方法，旨在通过微调从模型中切断不需要的概念，但仍不清楚这些方法是否真正移除了与有害概念的所有联系，或仅仅是掩盖了表面连接。在这项工作中，我们揭示了一个关键漏洞——擦除规避后门（EEB）：攻击者将后门触发器绑定到待擦除的概念上，并且这种恶意链接在后续擦除后仍然存在。我们展示了黑盒和白盒攻击者都能实例化这一威胁。在六种最先进的擦除方法中，包括那些明确搜索目标概念替代表示的鲁棒方法，EEB始终能暴露有害内容：针对名人身份遗忘的成功率高达82%，针对物体擦除的成功率高达94%，针对露骨内容暴露的放大倍数高达16倍。虽然EEB揭示了当前擦除方法的一个盲点，但它也为压力测试未来的概念擦除技术提供了诊断工具。

英文摘要

The expansion of text-to-image diffusion models has raised concerns about harmful outputs, from fabricated depictions of public figures to sexually explicit imagery. To mitigate such risks, prior work has proposed concept erasure methods that aim to sever unwanted concepts from the model via fine-tuning, yet it remains unclear whether these approaches truly remove all links to the harmful concept or merely conceal superficial connections. In this work, we reveal a critical vulnerability, the Erasure Evasion Backdoor (EEB): an adversary binds a backdoor trigger to a concept slated for removal, and this malicious link survives subsequent erasure. We show that both black-box and white-box adversaries can instantiate this threat. Across six state-of-the-art erasure methods, including robust ones that explicitly search for alternative representations of the target concept, EEB consistently exposes harmful content: up to 82% success against celebrity-identity unlearning, up to 94% for object erasure, and up to 16 times amplification of explicit-content exposure. While EEB uncovers a blind spot in current erasure methods, it also provides a diagnostic tool for stress-testing future concept erasure techniques.

URL PDF HTML ☆

赞 0 踩 0

2505.17623 2026-06-11 cs.CR cs.AI cs.ET cs.LG cs.PF 版本更新

可认证安全RLHF：基于语义基础与固定惩罚约束优化的更安全大语言模型对齐

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee, Shaahin Angizi, Arnob Ghosh

发表机构 * Department of Electrical and Computer Engineering（电气与计算机工程系）； New Jersey Institute of Technology（新泽西理工学院）； Department of Computer Engineering（计算机工程系）； Heritage Institute of Technology（遗产理工学院）

AI总结针对现有RLHF方法依赖奖励/成本函数和双变量调优导致性能敏感且缺乏可证明安全保证的问题，提出CS-RLHF，通过语义基础成本模型和固定惩罚约束优化，实现可认证安全对齐，效率提升至少5倍。

详情

AI中文摘要

确保安全是大语言模型（LLMs）的基本要求。在增强模型输出效用与减轻其潜在危害之间取得适当平衡是一个复杂且持续的挑战。当代方法通常将这个问题形式化为约束马尔可夫决策过程（CMDP）框架，并采用成熟的CMDP优化技术。然而，这些方法表现出两个显著的限制。首先，它们对奖励和成本函数的依赖使得性能对底层评分机制高度敏感，而该机制必须捕捉语义含义，而不是被表面关键词触发。其次，基于CMDP的训练需要调整双变量，这一过程计算成本高昂，并且对于可能通过对抗性越狱利用的固定双变量，不提供任何可证明的安全保证。为了克服这些限制，我们引入了可认证安全RLHF（CS-RLHF），它引入了一个在大规模语料库上训练的成本模型，以分配基于语义的安全分数。与基于拉格朗日的方法相比，CS-RLHF采用了一种修正的基于惩罚的公式。该设计借鉴了约束优化中精确惩罚函数理论，其中约束满足直接通过适当选择的惩罚项来强制执行。通过适当缩放的惩罚，可以在优化器处保证安全约束的可行性，从而消除了双变量更新的需要。实证评估表明，CS-RLHF优于最先进的LLM模型响应，对正常和越狱提示的效率至少提高5倍。

英文摘要

Ensuring safety is a foundational requirement for large language models (LLMs). Achieving an appropriate balance between enhancing the utility of model outputs and mitigating their potential for harm is a complex and persistent challenge. Contemporary approaches frequently formalize this problem within the framework of Constrained Markov Decision Processes (CMDPs) and employ established CMDP optimization techniques. However, these methods exhibit two notable limitations. First, their reliance on reward and cost functions renders performance highly sensitive to the underlying scoring mechanism, which must capture semantic meaning rather than being triggered by superficial keywords. Second, CMDP-based training entails tuning dual-variable, a process that is both computationally expensive and does not provide any provable safety guarantee for a fixed dual variable that can be exploitable through adversarial jailbreaks. To overcome these limitations, we introduce Certifiable Safe-RLHF (CS-RLHF) that introduces a cost model trained on a large-scale corpus to assign semantically grounded safety scores. In contrast to the lagrangian-based approach, CS-RLHF adopts a rectified penalty-based formulation. This design draws on the theory of exact penalty functions in constrained optimization, wherein constraint satisfaction is enforced directly through a suitably chosen penalty term. With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed at the optimizer, eliminating the need for dual-variable updates. Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art LLM model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts

URL PDF HTML ☆

赞 0 踩 0

2512.03077 2026-06-11 cs.CY cs.AI 版本更新

Irresponsible AI: big tech's influence on AI research and associated impacts

不负责任的人工智能：大型科技公司对AI研究的影响及相关影响

Alex Hernandez-Garcia, Alexandra Volokhova, Ezekiel Williams, Dounia Shaaban Kabakibo, Mélisande Teng

发表机构 * Big Tech（大科技公司）

AI总结本文指出大型科技公司对AI研究的不成比例影响推动了不负责任的AI发展，并加剧了环境和社会负面影响，呼吁研究者通过集体行动加以抵制。

Comments Presented as a spotlight oral at the International Conference on Machine Learning 2026 (Position Paper Track). First version presented at NeurIPS 2025 Workshop on Algorithmic Collective Action

详情

AI中文摘要

人工智能系统的加速开发、部署和采纳得益于大型科技公司在AI领域的日益深入。这一趋势伴随着日益增长的伦理关切以及加剧的社会和环境影响。本文立场认为，不负责任的AI发展在很大程度上是由大型科技公司在该领域的影响和参与所驱动的。首先，我们审视了大型科技公司在AI研究中日益增长且不成比例的影响，并认为其对规模化和通用系统的追求从根本上与负责任、合乎伦理和可持续的AI发展相悖。其次，我们回顾了当前AI的主要负面环境和社会影响，并追溯其与大型科技公司影响的联系。第三，我们讨论了推动大型科技公司行动的基本经济力量。最后，作为行动号召，我们邀请AI研究者通过基于相关行为者责任和集体行动的策略，来对抗大型科技公司对不负责任AI发展的影响。

英文摘要

The accelerated development, deployment and adoption of artificial intelligence systems has been fuelled by the increasing presence of big tech in the AI field. This trend has been accompanied by growing ethical concerns and intensified societal and environmental impacts. This position paper argues that irresponsible AI development is strongly driven by big tech's influence and involvement in the field. First, we examine the growing and disproportionate influence of big tech in AI research and argue that its drive for scaling and general-purpose systems is fundamentally at odds with the responsible, ethical, and sustainable development of AI. Second, we review key current environmental and societal negative impacts of AI and trace their connections to big tech's influence. Third, we discuss the underlying economic forces driving big tech's actions. Finally, as a call to action, we invite AI researchers to counter big tech's influence in irresponsible AI development through strategies that build on the responsibility of implicated actors and collective action.

URL PDF HTML ☆

赞 0 踩 0

2601.17360 2026-06-11 cs.LG cs.AI cs.CR 版本更新

Robust Privacy: Inference-Stage Privacy through Certified Robustness

鲁棒隐私：通过认证鲁棒性实现推理阶段隐私

Jiankai Jin, Xiangzheng Zhang, Zhao Liu, Wenzhuo Xu, Dongdong Yang, Deyue Zhang, Quanchen Zou

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出鲁棒隐私(RP)概念，基于认证鲁棒性确保预测在输入邻域内不变，从而限制推理阶段隐私泄露；实验表明RP在属性推断和模型反演攻击中有效提升隐私-效用权衡。

详情

AI中文摘要

观察模型发布预测的对手可以推断查询输入的敏感属性，甚至重建模型训练数据的代表。因此，推理接口充当隐私泄露的侧信道。我们引入鲁棒隐私(RP)，一种受认证鲁棒性启发的推理阶段隐私概念：如果模型预测在输入x周围半径为R的邻域内以至少$1-\alpha$的置信度可证明不变，则x享有$(R,\alpha)$-鲁棒隐私，在此条件下我们证明任何观察发布预测的对手在区分x与距离x为R内的任何输入时最多有$\alpha/2$的优势。基于RP，我们形式化鲁棒属性隐私(RAP)，一种属性级隐私概念，刻画与发布预测兼容的敏感属性值集合。在分类任务上，RP将RAP兼容推理区间的中位数长度从23.50增加到29.96，降低了属性推断精度。模型反演攻击通常被视为训练阶段威胁，实际上依赖于通过推理接口泄露的细粒度信号；RP在推理阶段掩盖这些信号，将黑盒反演攻击的成功率(ASR)从73%降至4%。这种直接针对泄露通道的方法使RP在隐私-效用权衡空间中优于DP-SGD和随机响应：RP在21% ASR下保持98.4%的准确率，而DP-SGD必须将准确率降至61.7%才能达到相当的ASR。在两个实验中，增加平滑样本量N同时增强了隐私和效用。最后，我们考察模型蒸馏作为范围边界，表明RP缓解了属性级和实例级推理阶段隐私泄露，但无法通过模型蒸馏缓解函数级提取。

英文摘要

An adversary observing a model's released prediction can infer sensitive attributes of the queried input, or even reconstruct representatives of the model's training data. The inference interface thus acts as a side channel for privacy leakage. We introduce Robust Privacy (RP), an inference-stage privacy notion inspired by certified robustness: if a model's prediction is provably invariant within a radius-R neighborhood around an input x with confidence at least $1-α$, then x enjoys $(R,α)$-Robust Privacy, under which we prove that any adversary observing the released prediction has at most $α/2$ advantage in distinguishing x from any input within distance R of x. Building on RP, we formalize Robust Attribute Privacy (RAP), an attribute-level privacy notion that characterizes the set of sensitive-attribute values that remain compatible with a released prediction. On a classification task, RP increases the median length of the RAP-compatible inference interval from 23.50 to 29.96, reducing attribute-inference precision. Model inversion attacks, often treated as a training-stage threat, in fact rely on fine-grained signals leaked through the inference interface; RP masks these signals at the inference stage, reducing attack success rate (ASR) from 73% to 4% on a black-box inversion attack. This direct targeting of the leakage channel enables RP to dominate DP-SGD and randomized response in the privacy-utility tradeoff space: RP retains 98.4% accuracy at 21% ASR, whereas DP-SGD must drop accuracy to 61.7% to reach a comparable ASR. Across both experiments, increasing the smoothing sample size N strengthens privacy and improves utility together. Finally, we examine model distillation as a scope boundary and show that RP mitigates attribute-level and instance-level inference-stage privacy leakage, but not function-level extraction through model distillation.

URL PDF HTML ☆

赞 0 踩 0

2602.05746 2026-06-11 cs.LG cs.AI 版本更新

ASRU：激活引导与强化遗忘融合用于多模态大语言模型

Jiahui Guang, Haiyan Wang, Yingjie Zhu, Cuiyun Gao, Jing Li, Di Shao, Zhaoquan Gu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结 ASRU提出一种可控多模态遗忘框架，通过激活引导和强化学习提升多模态大语言模型的遗忘效果和生成质量，实验显示在Qwen3-VL上遗忘效果提升24.6%，生成质量提升5.8倍。

详情

AI中文摘要

多模态大语言模型（MLLMs）在预训练过程中可能记忆敏感的跨模态信息，使机器遗忘（MU）变得至关重要。现有方法通常基于输出偏差评估遗忘效果，而忽视遗忘后的生成质量。这可能导致幻觉或僵化响应，影响遗忘模型的可用性和安全性。为了解决这一问题，我们提出了ASRU，一种可控的多模态遗忘框架，将生成质量作为核心评估目标。ASRU首先通过激活引导诱导初始拒绝行为，然后使用定制奖励函数优化细粒度拒绝边界，从而在目标知识遗忘和模型实用性之间取得更好的平衡。实验表明，在Qwen3-VL上，ASRU在平均上显著提高了遗忘效果（+24.6%）和生成质量（5.8倍），同时有效保持了模型实用性，仅使用少量保留的监督数据。

英文摘要

Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8X) on average while effectively preserving model utility, using only a small amount of retained supervision data.

URL PDF HTML ☆

赞 0 踩 0

2606.10198 2026-06-11 cs.LG cs.AI cs.CV 版本更新

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

密度脊选择性预测：校准标签稀缺下的大语言模型与视觉语言模型幻觉检测

Nina I. Shamsi

发表机构 * Northeastern University Boston, United States（东北大学波士顿分校）

AI总结针对校准标签稀缺时大语言模型和视觉语言模型的幻觉检测问题，提出基于核密度估计的密度脊方法，利用隐藏状态生成轨迹的六维运动特征图构建响应流形，通过到最近脊顶点的欧氏距离评分，在标签稀缺协议下AUROC提升5-20点。

详情

AI中文摘要

大语言模型和视觉语言模型中的幻觉检测日益被框架化为选择性预测，其中检测器分配置信度分数并在置信度低时弃权。无监督采样检测器（Semantic Entropy, EigenScore）避免标签但质量停滞，而有监督探针（SAPLMA）获得更强的分布内分数，但在校准标签稀缺时性能急剧下降。我们将大语言模型的响应流形恢复为基于隐藏状态生成轨迹的六维运动特征图的核密度估计的密度脊。测试生成通过其投影特征点到最近脊顶点的欧氏距离的负值进行评分，从而得到随机输出分布的低维几何骨架。我们在七个问答基准（HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA）上，使用九个文本和视觉大语言模型，在刻意标签稀缺协议（$n_{\ ext{cal}}{=}200$ 查询，$N{=}5$ 生成）下，与Semantic Entropy、SAR、EigenScore、SAPLMA和对数概率进行评估。我们的基于脊的分数在AUROC上以5-20个百分点的优势获胜，同时在校准标签稀缺下表现出温和的性能下降。

英文摘要

Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors (Semantic Entropy) avoid labels but plateau in quality, while supervised probes attain stronger in-distribution scores yet degrade sharply when calibration labels are scarce. We recover the response manifold of an LLM as the density ridge of a kernel density estimate built on a six-dimensional kinematic feature map of hidden state generation trajectories. A test generation is scored by the negated Euclidean distance from its projected feature point to the nearest ridge vertex, yielding a low-dimensional geometric skeleton of the stochastic output distribution. We evaluate against Semantic Entropy, topological methods, and log-probability on six QA benchmarks (HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA) using eight text and vision LLMs in a deliberately label-scarce protocol ($n_{\text{cal}}{=}200$ queries, $N{=}5$ generations). Our ridge-based score beats on AUROC with 5-20 points gain, while demonstrating tempered degradation under calibration-label scarcity.

URL PDF HTML ☆

赞 0 踩 0

2606.11337 2026-06-11 cs.AI cs.CL cs.CY 新提交

Can AI Agents Synthesize Scientific Conclusions?

AI代理能否综合科学结论？

Hayoung Jung, Pedro Viana Diniz, José Reinaldo Corrêa Roveda, Abner Fernandes da Silva, Haeun Jung, Enoch Tsai, Aleksandra Korolova, Manoel Horta Ribeiro

发表机构 * Princeton University（普林斯顿大学）； Universidade Federal de Minas Gerais（米纳斯吉拉斯联邦大学）； Stony Brook University（石溪大学）； Hackensack Meridian School of Medicine（哈肯萨克子午线医学院）

AI总结本文提出SciConBench基准和SciConHarness评估框架，通过分解原子事实并计算精确率和召回率，发现前沿AI代理在科学结论综合中事实F1仅0.337，且无约束评估存在数据泄露，消费者代理常生成不完整或矛盾的结论。

Comments 79 pages, 34 figures, 17 tables. Under Submission

详情

AI中文摘要

科学AI代理越来越多地检索证据、跨来源推理并综合用于重要决策的结论。然而，它们在健康等高风险领域中的能力仍不明确。我们引入了SciConBench，一个大规模实时基准，包含9.11K个问题以及来自系统综述的专家撰写的结论，用于评估开放域科学结论综合。该基准采用专家验证的自动评估流程，将结论分解为原子事实，并通过事实精确率和召回率衡量正确性和全面性。为减轻数据泄露，我们进一步引入了SciConHarness，一个洁净室评估框架，为代理配备受控的网页交互以确保有效测量。评估8个前沿模型和深度研究代理，我们发现事实质量仍然较低：在洁净室设置下，最佳代理仅达到0.337的事实F1。与无约束评估相比，我们的洁净室设置持续降低性能，表明数据泄露夸大了模型真实综合能力的估计。最后，我们审计了面向消费者的代理（如Google AI Overview、OpenEvidence），发现它们经常生成不完整甚至矛盾的结论，即使真实答案可用。总体而言，我们的结果表明，科学结论的可靠综合仍然是一个开放挑战，而洁净室评估对于评估开放域AI代理至关重要。

英文摘要

Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.

URL PDF HTML ☆

赞 0 踩 0

2606.11543 2026-06-11 cs.AI cs.SE 新提交

SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

SkillJuror：衡量智能体技能组织如何改变运行时行为

Zhiyu Chen, Zihan Guo, Bo Huang, Bingwei Lu, Jianghao Lin, Yuanjian Zhou, Weinan Zhang

发表机构 * Tongji University（同济大学）； Shanghai Innovation Institute（上海创新研究院）； Sun Yat-sen University（中山大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出SkillJuror框架，通过渐进式披露与扁平基线对比，发现技能组织方式改变智能体搜索和应用程序知识的行为，并在82个任务中提升4.1%的验证通过率。

详情

AI中文摘要

Agent技能在推理时为大语言模型（LLM）智能体提供程序性知识，但当前的基准测试很少区分技能的内容与其组织方式。我们通过渐进式披露（Progressive Disclosure）研究这种区别，其中简洁的根文件按需引导智能体访问支持资源，并将其与归一化的扁平基线进行比较。我们提出SkillJuror，一个通过语义控制变体、匹配的多试验评估和轨迹证据来评估技能编写范式的框架，同时保持任务知识固定。在82个任务的SkillsBench研究中，渐进式披露在总体结果之前改变了运行时行为：每个轨迹触及的不同技能资源从1.18增加到3.85，有效采纳事件从1.33增加到3.92。在410个匹配试验中，它还产生了17个额外的验证通过试验（比归一化扁平基线提高4.1%）。收益取决于任务。当支持资源指导实现、检查或修复时，渐进式披露有帮助，但当成功取决于精确的输出约定、数值阈值或长工件生成流水线时，效果较弱。这些结果表明，技能组织不仅仅是呈现方式：它可以改变智能体搜索和应用程序知识的方式，而结果收益取决于暴露的资源是否对任务可操作。代码见：https://this URL。

英文摘要

Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distinguish what a Skill says from how it is organized. We study this distinction through Progressive Disclosure, where a concise root file points agents to supporting resources on demand, and compare it with a normalized flat baseline. We present SkillJuror, a framework for evaluating Skill writing paradigms through semantically controlled variants, matched multi-trial evaluations, and trajectory evidence while holding task knowledge fixed. In an 82-task SkillsBench study, Progressive Disclosure changes runtime behavior before aggregate outcomes: distinct Skill resources touched per trajectory rise from 1.18 to 3.85, and effective uptake events rise from 1.33 to 3.92. It also yields 17 additional verifier-passing trials out of 410 matched trials (+4.1%) over the normalized flat baseline. The benefit is task-dependent. Progressive Disclosure helps when supporting resources guide implementation, checking, or repair, but is weaker when success hinges on exact output conventions, numerical thresholds, or long artifact-generation pipelines. These results show that Skill organization is not mere presentation: it can change how agents search and apply procedural knowledge, while outcome gains depend on whether the exposed resources are actionable for the task. Code is available at https://github.com/zhiyuchen-ai/skill-juror.

URL PDF HTML ☆

赞 0 踩 0

2606.11637 2026-06-11 cs.AI 新提交

TouchThinker: Scaling Tactile Commonsense Reasoning to the Open World with Large-scale Data and Action-aware Representation

TouchThinker: 通过大规模数据和动作感知表示将触觉常识推理扩展到开放世界

Kailin Lyu, Di Wu, Pengwei Zhang, Yuhang Zheng, Yingxin Lai, Long Xiao, Kangyi Wu, Pengna Li, Chen Gao, Lianyu Hu, Xiaobin Hu, Jie Hao, Ce Hao, Weihao Yuan, Shuicheng Yan

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； National University of Singapore（新加坡国立大学）； Zhongguancun Academy（中关村学院）； Xiamen University（厦门大学）； Xi’an Jiaotong University（西安交通大学）； Nanyang Technological University（南洋理工大学）； Nanjing University（南京大学）

AI总结提出TouchThinker框架，通过构建百万级多源触觉数据集TouchThinker-1M和动作感知建模，将触觉常识推理扩展到开放世界，在多个数据集上取得竞争性表现。

Comments 18 pages, 11 figures

详情

AI中文摘要

触觉是具身智能体理解物理世界的关键模态。尽管最近的工作已将触觉信号融入语言系统进行触觉常识推理，但由于两个关键瓶颈，将此类系统扩展到现实的开放世界环境仍然具有挑战性：(1) 当前的触觉推理数据集在格式和规模上仍然有限，为从触觉观察到物理常识的推理提供的监督不足，并阻碍了可迁移触觉常识的学习；(2) 触觉信号本质上是冗余且特定于动作的，但现有方法常常忽略这些特性，导致表示效率低下且语义表达能力有限。为了解决这些局限性，我们提出了TouchThinker，一个从数据和表示两个角度将触觉常识推理扩展到开放世界的触觉-语言框架。首先，我们构建了TouchThinker-1M，一个百万级、多源的触觉推理数据集，涵盖\textbf{415}个物体、\textbf{8}个场景和\textbf{7}种传感器类型，为开放世界泛化提供了坚实的数据基础。我们进一步引入了TouchThinker-Bench，一个具有更真实和多样化任务的开放世界基准。然后，我们提出了动作感知建模机制，以提高触觉表示效率并实现高效推理。实验结果表明，TouchThinker在多个数据集上取得了与最先进模型竞争的性能。我们的代码和数据集将在以下网址提供：this https URL。

英文摘要

Touch is a key modality for embodied agents to understand the physical world. Although recent work has incorporated tactile signals into language systems for tactile commonsense reasoning, scaling such systems to realistic open-world settings remains challenging due to two key bottlenecks: (1) current tactile reasoning datasets remain limited in format and scale, providing insufficient supervision for reasoning from tactile observations to physical commonsense and hindering the learning of transferable tactile commonsense; (2) Tactile signals are inherently redundant and action-specific, yet existing methods often overlook these properties, resulting in inefficient representations with limited semantic expressiveness. To address these limitations, we propose TouchThinker, a tactile-language framework that scales tactile commonsense reasoning to the open world from both data and representation perspectives. First, we construct TouchThinker-1M, a million-scale, multi-source tactile reasoning dataset covering \textbf{415} objects, \textbf{8} scenarios, and \textbf{7} sensor types, providing a solid data foundation for open-world generalization. We further introduce TouchThinker-Bench, an open-world benchmark with more realistic and diverse tasks. Then, we propose action-aware modeling mechanism to improve tactile representation efficiency and enable efficient reasoning. Experimental results demonstrate that TouchThinker achieves competitive performance against state-of-the-art models across multiple datasets. Our code and dataset will be made available at: https://github.com/lvkailin0118/TouchThinker.

URL PDF HTML ☆

赞 0 踩 0

2606.11909 2026-06-11 cs.AI 新提交

Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

Embodied-BenchClaw：用于具身空间智能基准构建的自主多智能体系统

Baoyang Jiang, Fengchun Zhang, Leyuan Wang, Haotian Li, Yida Wang, Zhe Ji, Jinshan Lai, Xi Ren, Jianwei Hu, Qiang Ma

发表机构 * QiYuan Lab（启元实验室）； School of Information and Software Engineering, University of Electronic Science and Technology of China（电子科技大学信息与软件工程学院）； Beijing University of Posts and Telecommunications（北京邮电大学）； School of Computer Science and Engineering, Northeastern University（东北大学计算机科学与工程学院）； School of Computer Science and Engineering, Beihang University（北京航空航天大学计算机科学与工程学院）

AI总结提出Embodied-BenchClaw，一个通过五阶段流水线和三个智能体协调的自主系统，自动构建可验证、可执行、可维护且诊断有用的具身空间智能基准，减少人工工作量。

详情

AI中文摘要

基准测试对于评估具身空间智能至关重要，但其构建劳动密集、难以重用且维护困难。现有的具身基准通常是静态的，随着模型改进可能迅速饱和，限制其区分新能力的能力。我们提出Embodied-BenchClaw，一个用于构建具身空间智能基准的自主智能体系统。给定用户指定的评估意图，Embodied-BenchClaw通过五个阶段流水线自动生成完整且可持续更新的基准包：意图蓝图、数据收集、结构化与清洗、基准合成、评估报告。该流水线由三个智能体协调：规划、构建和评估。为提高可重用性和可靠性，Embodied-BenchClaw引入了可扩展的技能库和过程质量控制，使基准构建可组合、可验证和可修复。我们实例化了多个基准，涵盖室内空间推理、室外空间推理、机器人操作、四足机器人导航、无人机/空中视图理解以及静态基准增强。这些基准跨越不同的具身载体、数据源和空间能力。通过人工评估、基于评判者的评估、一致性检查、成本分析和消融实验，结果表明Embodied-BenchClaw能够以较少的人工努力构建可验证、可执行、可维护且诊断有用的具身空间基准。

英文摘要

Benchmarks are essential for evaluating embodied spatial intelligence, yet their construction is labor-intensive, hard to reuse, and difficult to maintain. Existing embodied benchmarks are often static and may quickly become saturated as models improve, limiting their ability to distinguish new capabilities. We propose Embodied-BenchClaw, an autonomous agentic system for constructing embodied spatial intelligence benchmarks. Given a user-specified evaluation intent, Embodied-BenchClaw automatically produces a complete and continually updatable benchmark package through a five-stage pipeline: intent blueprinting, data collection, structuring and cleaning, benchmark synthesis, and evaluation reporting. The pipeline is coordinated by three agents for planning, construction, and evaluation. To improve reusability and reliability, Embodied-BenchClaw introduces an extensible Skill Library and process quality control, enabling benchmark construction to be composable, verifiable, and repairable. We instantiate multiple benchmarks covering indoor spatial reasoning, outdoor spatial reasoning, robotic manipulation, quadruped robot navigation, UAV/aerial-view understanding, and static benchmark enhancement. These benchmarks span diverse embodied carriers, data sources, and spatial capabilities. Experiments with human evaluation, judge-based assessment, consistency checks, cost analysis, and ablations show that Embodied-BenchClaw can construct verifiable, executable, maintainable, and diagnostically useful embodied spatial benchmarks with reduced manual effort.

URL PDF HTML ☆

赞 0 踩 0

2606.12086 2026-06-11 cs.AI cs.LG 新提交

IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization

IntElicit: 通过对话策略优化引出和评估情境化创造力

Mingjia Li, Jin Wu, Hong Qian, Wenhao Huang, Yiyang Huang, Yiwen Zhang, Chanjin Zheng, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * East China Normal University（华东师范大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出IntElicit框架，通过分解过程奖励机制优化对话策略，在交互中减少非创造性混淆因素，从而更有效地引出和评估情境化创造力。

详情

AI中文摘要

情境化评估为评估创造力提供了高生态效度，但也引入了一个关键挑战：观察到的表现可能与认知熟练度（领域知识）和能动性（参与意愿）相混淆。同时，在生成式AI时代，创造性问题解决越来越多地发生在工具中介和人机交互环境中，使得完全静态的评估与当代创造性实践不太一致。为了解决这些问题，本文提出了IntElicit，一个通过对话策略优化来引出和评估情境化创造力的框架。IntElicit作为一个受约束的自适应AI面试官：它在多轮交互中提供非指导性的知识和能动性支架，以减少非创造性混淆因素，同时保留参与者生成被评估的创造性内容的责任。具体来说，为了解决开放教育对话中的稀疏奖励和潜在奖励破解（例如，答案听写），IntElicit引入了一种分解过程奖励机制。该机制将策略与教学引出对齐，奖励那些引出参与者推理而非代表他们产生最优答案的提示。大量实验，包括参与者模拟和一项人类受试者研究（N=64），表明IntElicit比专家设计的基线提高了引出的创造性成果。总之，结果表明，交互式引出可以揭示静态FPSP式评估可能遗漏的创造性潜力，为AI中介学习环境中的情境化创造力评估提供了形成性和诊断性视角。

英文摘要

Contextualized assessment offers high ecological validity for evaluating creativity but introduces a critical challenge: observed performance may be confounded with cognitive proficiency (domain knowledge) and agency (willingness to engage). Meanwhile, in the age of generative AI, creative problem solving increasingly occurs in tool-mediated and human--AI interactive environments, making fully static assessment less aligned with contemporary creative practice. To address these issues, this paper proposes IntElicit, a framework for eliciting and assessing contextualized creativity via dialogue policy optimization. IntElicit functions as a constrained adaptive AI Interviewer: it provides non-directive knowledge and agency scaffolds in multi-turn interaction to reduce non-creative confounders, while preserving participants' responsibility for generating the creative content being evaluated. Specifically, to tackle sparse rewards and potential reward hacking (e.g., answer dictation) in open-ended educational dialogue, IntElicit introduces a decomposed process reward mechanism. This mechanism aligns the policy with pedagogical elicitation, rewarding prompts that draw out participant reasoning rather than producing optimal answers on their behalf. Extensive experiments, including participant simulation and a human subject study (N=64), show that IntElicit improves elicited creative outcomes over expert-designed baselines. Together, the results suggest that interactive elicitation can reveal creative potential that static FPSP-style assessment may miss, providing a formative and diagnostic lens for contextualized creativity assessment in AI-mediated learning contexts.

URL PDF HTML ☆

赞 0 踩 0

2606.11196 2026-06-11 cs.CL cs.AI cs.CR cs.LG 交叉投稿

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

PoQ-Judge：去中心化LLM推理中成本感知的证明质量的多架构评估框架

Arther Tian, Alex Ding, Frank Chen, Simon Wu, Aaron Chan

发表机构 * DGrid AI

AI总结提出PoQ-Judge框架，训练专用裁判模型对查询-输出对进行无参考评分，研究三种架构，最佳模型在Pearson相关性上达到0.747，级联评估降低72.7%成本。

详情

AI中文摘要

去中心化LLM推理网络需要轻量级、无参考的质量评估用于证明质量（PoQ）。我们提出PoQ-Judge，一个训练专用裁判模型对查询-输出对进行评分而无真实参考的框架。我们研究了三种架构在质量-成本权衡中的表现：TextCNN裁判、MiniLM交叉编码器和DeBERTa裁判。通过在UltraFeedback和GPT标记的领域内数据上进行两阶段训练，最佳模型在保留测试集上与真实代理的Pearson相关性达到0.747，优于先前工作中基于参考的评估器。作为复合评分中的无参考组件，它实现了0.645的Pearson相关性，匹配最佳单一基于参考的评估器，同时消除了对参考答案的需求。我们还表明，在线校准将语义质量识别为主导维度，级联评估将成本降低72.7%，仅带来适度的质量损失。结果在问答任务上比摘要任务强得多，表明代理质量是主要剩余限制。

英文摘要

Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without ground-truth references. We study three architectures across the quality-cost tradeoff: a TextCNN judge, a MiniLM cross-encoder, and a DeBERTa judge. Using two-stage training on UltraFeedback plus GPT-labeled in-domain data, the best model reaches 0.747 Pearson correlation with the ground-truth proxy on a held-out test set, outperforming reference-based evaluators from prior work. As a reference-free component in composite scoring, it achieves 0.645 Pearson correlation, matching the best single reference-based evaluator while removing the need for reference answers. We also show that online calibration identifies semantic quality as the dominant dimension and that cascade evaluation reduces cost by 72.7 percent with only modest quality loss. Results are much stronger on QA than summarization, pointing to proxy quality as the main remaining limitation.

URL PDF HTML ☆

赞 0 踩 0

2606.11198 2026-06-11 cs.CL cs.AI 交叉投稿

The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

结构注意力税：检索格式如何劫持上下文学习而与内容无关

Yuqi Zhang, Di Zhang

发表机构 * Xi’an Jiaotong-Liverpool University（西交利物浦大学）

AI总结研究发现知识图谱三元组因其格式结构比自然语言吸引2-3倍注意力，压缩演示注意力达42%，并提出了分解注意力为语义与结构成分的框架及缓解策略。

Comments 10 pages, 5 figures

详情

AI中文摘要

检索增强生成（RAG）系统注入外部知识以改进大语言模型输出，然而注入内容的格式——区别于其语义相关性——可以独立地扭曲模型的注意力分布。我们识别并形式化了一种称为结构注意力税的现象：知识图谱（KG）三元组，由于其关系分隔符和重复的槽位模式，每个token捕获的注意力是语义等价的自然语言文本的2-3倍（$\hat{o}$(KG) ≈ 0.70 对比 $\hat{o}$(中性) ≈ 0.25），将演示注意力压缩高达42%——无论三元组是相关还是噪声。我们开发了一个形式化框架，将注意力分数分解为语义和结构成分（公式2），推导了一个压缩界（命题1），将token级别的格式偏差与演示注意力损失联系起来，并表明结构项控制着注意力被转移多少，而语义项控制着这是有益还是有害。这种解耦揭示了改进检索增强ICL的两个正交轴：优化检索质量（语义轴）和减少格式驱动的注意力捕获（结构轴）。实验上，在两个模型家族（Mistral-7B, LLaMA-3-8B）和三个QA基准上，我们观察到源任务对齐占主导地位：任务匹配的BM25检索在HotpotQA上达到58-62%，而ConceptNet为25-27%，超过30个百分点的差距远远超过所有门控策略（≤2个百分点）。我们从该框架推导出五种结构感知缓解策略，从零成本提示修改到训练时正则化；格式展平（S3）通过来自口头化三元组控制的准确性和注意力级证据得到验证，而结构分散（S1）产生了混合结果，揭示了格式级别干预的挑战。

英文摘要

Retrieval-augmented generation (RAG) systems inject external knowledge to improve LLM outputs, yet the format of injected content -- distinct from its semantic relevance -- can independently distort the model's attention distribution. We identify and formalise a phenomenon we term the structural attention tax: knowledge graph (KG) triples, due to their relational delimiters and repeated slot patterns, capture 2-3x more attention per token than semantically equivalent natural-language text ($\hat{o}$(KG) $\approx$ 0.70 vs. $\hat{o}$(neutral) $\approx$ 0.25), compressing demonstration attention by up to 42% -- regardless of whether the triples are relevant or noise. We develop a formal framework decomposing attention scores into semantic and structural components (Eq. 2), derive a compression bound (Proposition 1) connecting token-level format bias to demonstration attention loss, and show that the structural term governs how much attention is diverted while the semantic term governs whether this helps or hurts. This decoupling reveals two orthogonal axes for improving retrieval-augmented ICL: optimising retrieval quality (semantic axis) and reducing format-driven attention capture (structural axis). Empirically, across two model families (Mistral-7B, LLaMA-3-8B) and three QA benchmarks, we observe that source-task alignment dominates: task-matched BM25 retrieval achieves 58-62% on HotpotQA vs. ConceptNet's 25-27%, a >30 pp gap that dwarfs all gating strategies ($\leq$2 pp). We derive five structure-aware mitigation strategies from the framework, ranging from zero-cost prompt modifications to training-time regularisation; format flattening (S3) is validated by both accuracy and attention-level evidence from a verbalized-triple control, while structural dispersal (S1) yields mixed results that illuminate the challenges of format-level intervention.

URL PDF HTML ☆

赞 0 踩 0

2606.11208 2026-06-11 cs.CL cs.AI 交叉投稿

每个行为都有代价：前沿大语言模型中的压缩道德组合

Weijia Zhang, Ruiqi Chen, Yunze Xiao, Weihao Xuan

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Michigan（密歇根大学）； Carnegie Mellon University（卡内基梅隆大学）； The University of Tokyo（东京大学）

AI总结针对现有道德基准仅评估孤立行为偏好的不足，提出Moral Trolley Arena两阶段盲ELO基准，通过校准个体道德行为并组合为双行为项，发现前沿LLM的道德判断呈压缩而非简单加性关系。

详情

AI中文摘要

现有的LLM道德基准通常询问模型偏好哪个孤立的道德行为、价值或基础。这有用但不完整。现实判断往往要求模型在同一选项中组合多个道德信号。我们引入**Moral Trolley Arena**，一个两阶段盲ELO基准，用于衡量LLM如何组合道德证据。单场景阶段首先从跨越五个道德基础理论的229个场景语料库中校准个体道德行为；组合阶段则将校准后的行为组合成受控强度网格上的双行为道德项，并测量由此产生的组合偏好。在十个前沿模型中，组合判断主要由成分行为强度预测，但关系始终是压缩的而非简单加性。模型还表现出非加性强度锚定、成分控制后有限的基础特异性残差，以及跨提供者高度收敛的组合偏好曲面。这些结果表明，道德审计应衡量道德证据的组合规则，而不仅仅是对孤立行为的排名。

英文摘要

Existing LLM moral benchmarks usually ask which isolated moral act, value, or foundation a model prefers. This is useful but incomplete. Realistic judgments often require a model to combine several moral signals within the same option. We introduce **Moral Trolley Arena**, a two-stage blind ELO benchmark for measuring how LLMs compose moral evidence. The single-scene arena first calibrates individual moral acts from a 229-scenario corpus across five Moral Foundations Theory foundations; the composite arena then combines calibrated acts into two-act moral items over a controlled intensity grid and measures the resulting composite preferences. Across ten frontier models, composite judgments are largely predicted by component act strength, but the relation is consistently compressed rather than simply additive. Models also show non-additive intensity anchoring, bounded foundation-specific residuals after component control, and highly convergent composite preference surfaces across providers. These results suggest that moral audits should measure composition rules for moral evidence, not only rankings over isolated acts.

URL PDF HTML ☆

赞 0 踩 0

2606.11260 2026-06-11 cs.SD cs.AI 交叉投稿

LLMs 在道德推理上表现不佳吗？

Menghang Zhu, Seth Lazar

发表机构 * School of Philosophy (Political Philosophy) Renmin University of China（哲学学院（政治哲学）中国人民大学）； School of Government and Policy Johns Hopkins University（政府与政策学院约翰霍普金斯大学）

AI总结本文通过让LLMs生成评分标准而非直接评分，重新评估MoReBench数据集，发现LLMs的道德推理能力比先前认为的更强。

详情

AI中文摘要

为了让高能力AI系统在动态、开放的环境中安全运行，它们必须能够识别、理解并响应行动中的道德理由，并据此约束自身行为。越来越多的研究旨在评估当今最先进AI系统的这种能力——道德能力，最近得出了普遍悲观的结论。其中一篇最具雄心的论文收集了人类专家制定的黄金标准评分标准，用于评估1000个案例中的道德推理，并以此基准测试前沿AI模型，结果不尽如人意。在本文中，我们认为MoReBench数据集可以被重新利用，以给出对LLMs道德推理（道德能力的重要组成部分）更为乐观的图景。我们表明，如果不根据这些评分标准对LLMs的回应进行评分，而是让LLMs执行与人类相同的任务——为特定案例的道德分析生成评分标准——那么它们生成的评分标准与人类评分标准的校准程度高于其开放式回应，并且在存在差异时，这些差异可能仅仅反映了大多数道德问题的巨大维度，同时也突出了人类在“创建评分标准的评分标准”上的某些偏离。考虑到这些观点，MoReBench数据集表明LLMs在道德推理方面的能力比先前认为的要强得多。

英文摘要

For highly capable AI systems to operate safely in dynamic, open-ended environments, they must be able to identify, understand, and respond to moral reasons for action, and constrain their behaviour accordingly. A growing body of research aims to evaluate this capacity -- moral competence -- in today's most capable AI systems, recently reaching broadly pessimistic conclusions. One of the most ambitious such papers collects gold-standard human-authored rubrics for evaluating moral reasoning in 1,000 cases, and benchmarks frontier AI models against those rubrics, with underwhelming results. In this paper, we argue that the MoReBench dataset can be redeployed to give a much more optimistic picture of LLMs' moral reasoning (an essential part of moral competence). We show that if, instead of scoring LLMs' responses to these cases against these rubrics, we instead give the LLMs the same task given to humans -- to generate scoring rubrics for the moral analysis of particular cases -- the rubrics they generate are both better calibrated to the human rubrics than their open-ended responses, and, where they differ, plausibly reflect nothing more than the vast dimensionality of most moral problems, as well as highlighting some human departures from the "rubric for creating rubrics". Taking these points into consideration, the MoReBench dataset suggests that LLMs are significantly more capable at moral reasoning than was previously believed.

URL PDF HTML ☆

赞 0 踩 0

2606.11686 2026-06-11 cs.CL cs.AI 交叉投稿

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

层隔离评估：使用无LLM、回归锁定的测试工具对生产级LLM代理的确定性框架进行门控

Sawyer Zhang, Alexander Wang, Sophie Lei

发表机构 * Lumivate (Lumi)（Lumivate（Lumi））

AI总结提出层隔离评估方法，将LLM代理分解为固定层次，用确定性无LLM测试套件逐层检测回归，证明聚合指标会掩盖局部退化，而逐层基线门控可准确定位。

Comments 12 pages, 2 figures, 5 tables

详情

AI中文摘要

端到端任务成功是评估LLM代理的主要方式，但一个聚合数字只能告诉你代理发生了回归，却无法指出具体位置。我们提出层隔离评估：将一个部署的订单代理分解为固定的层次分类（本体、意图、路由、分解、升级、安全、记忆以及跨领域的封装/防御），每一层由其在确定性、无LLM“纯”模式下的断言切片独立测试。纯测试套件（23个切片共238个案例；225个在2.39秒内运行，约10毫秒/案例）在每次变更时针对锁定的逐切片基线在CI中运行。我们通过受控回归注入进行验证，一次退化一个非安全层（共七个层）。我们未设计的效果是掩蔽：聚合通过率几乎不变（六个局部回归的变化范围为-1.7至-5.9个百分点），而匹配的切片则大幅下降（-25至-91个百分点）。一个层的切片对其自身故障做出反应部分是由构造决定的；测量结果是（i）聚合掩蔽以及（ii）损伤不会扩散到其他切片：注入层的切片在7个案例中的5个中是受影响最严重的，在7个案例中的7个中位列前三（平均排名1.29/19）。定位在第二个结构不同的租户（星巴克新加坡）上复现：所有七个匹配切片均大幅下降，因此这不是单一目录的伪像。我们将其定位为EDDOps规定但未实现的组件级评估的具体确定性实例，以CheckList为前身，并作为全工作流随机突变测试的确定性镜像。我们的贡献：（a）为生产代理提供了一个完全分解的、亚秒级、无LLM的逐层测试工具，（b）一个覆盖诚实性测试充分性标准，拒绝为未执行的层打分，以及（c）回归注入演示，证明逐切片基线锁定可以定位聚合指标掩盖的回归。

英文摘要

End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed taxonomy of layers (ontology, intent, routing, decomposition, escalation, safety, memory, and cross-cutting envelope/defense), each exercised by its own assertion slice in a deterministic, no-LLM "pure" mode. The pure suite (238 cases across 23 slices; 225 run in 2.39 s, ~10 ms/case) runs in CI on every change against a locked per-slice baseline. We validate by controlled regression injection, degrading one layer at a time across seven non-safety layers. The effect we did not design in is masking: the aggregate pass-rate barely moves (-1.7 to -5.9 pp for six local regressions), while the matching slice craters (-25 to -91 pp). A layer's slice reacting to its own fault is partly by construction; the measured results are (i) the aggregate masking and (ii) that damage stays off the other slices: the injected layer's slice is the single worst-hit in 5 of 7 cases and top-3 in 7 of 7 (mean rank 1.29 of 19). Localization replicates on a second, structurally different tenant (Starbucks SG): all seven matching slices crater, so it is not a single-catalog artifact. We position it as a concrete, deterministic instantiation of the component-level evaluation EDDOps prescribes but leaves unimplemented, with CheckList as ancestor and as the deterministic mirror image of whole-workflow stochastic mutation testing. Our contributions: (a) a fully decomposed, sub-second, no-LLM per-layer harness for a production agent, (b) a coverage-honesty test-adequacy criterion that refuses to score an unexercised layer, and (c) the regression-injection demonstration that per-slice baseline-locked gates localize regressions an aggregate metric masks.

URL PDF HTML ☆

赞 0 踩 0

2606.11702 2026-06-11 cs.CV cs.AI cs.CL 交叉投稿

MedCTA: A Benchmark for Clinical Tool Agents

MedCTA: 临床工具智能体基准

Tajamul Ashraf, Hyewon Jeong, Fida Mohammad Thoker, Bernard Ghanem

发表机构 * King Abdullah University of Science and Technology (KAUST)（阿卜杜拉国王科技大学）； Massachusetts Institute of Technology (MIT)（麻省理工学院）

AI总结提出MedCTA基准，基于放射影像、病理切片和报告等真实临床多模态输入，评估医疗AI智能体在工具检索、证据获取和集成方面的规划与执行能力。

Comments Project Page: https://ivul-kaust.github.io/MedCTA/ Code: https://github.com/IVUL-KAUST/MedCTA Data: https://huggingface.co/datasets/IVUL-KAUST/MedCTA

详情

AI中文摘要

为了做出临床合理的决策，医疗AI智能体需要超越简单的识别，具备工具检索、证据获取和集成能力。现有基准主要评估孤立的感知或单轮问答，因此对规划、工具调用和部署可靠性的失败可见性有限。我们提出了MedCTA，一个用于评估医疗工具智能体的基准，基于临床验证的、步骤隐含的任务，这些任务基于真实的多模态临床输入，包括放射影像、病理切片和报告。MedCTA包含107个真实临床任务，具有临床医生验证的、在5个部署工具上的可执行轨迹，并支持对工具选择、参数有效性、执行稳定性、轨迹保真度和结果质量的过程感知评估。我们对18个开源和闭源多模态模型进行了基准测试，发现即使是最先进的系统在多步骤临床工具使用中仍然脆弱：自主部署主要由协议失败、过早停止和错误工具调用主导，而黄金标准工具路由带来了巨大但仍不完整的改进。这些结果表明，强大的骨干感知能力并不能转化为临床环境中可靠的智能体行为。MedCTA为审计、诊断和推进可信赖的医疗AI智能体提供了一个严格的测试平台。数据集和评估套件可在该https URL获取。

英文摘要

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at https://ivul-kaust.github.io/MedCTA/

URL PDF HTML ☆

赞 0 踩 0

2606.11739 2026-06-11 cs.CV cs.AI 交叉投稿

Multi-View In-Cabin Monitoring System for Public Transport Vehicles

公共交通车辆的多视角座舱内监控系统

Evgeny Gorelik, Kenny Dean Karrow, Fikret Sivrikaya, Sahin Albayrak, Christian Baumann

发表机构 * Technische Universität Berlin（柏林工业大学）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心）

AI总结提出一个多视角座舱内监控数据集，包含同步RGB-D图像和LiDAR数据，并提供3D人体姿态和边界框标注，支持多视角3D检测模型评估。

Comments Submitted to ICDM2026

2606.11762 2026-06-11 cs.CL cs.AI 交叉投稿

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

语言模型在开放式任务中的自动化创造力评估

Min Sen Tan, Zachary Kit Chun Choy, Syed Ali Redha Alsagoff, Nadya Yuki Wangsajaya, Mohor Banerjee, Swaagat Bikash Saikia, Alvin Chan

发表机构 * Raffles Institution（莱佛士书院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； Lee Kong Chian School of Medicine, Nanyang Technological University（南洋理工大学李光前医学院）； Centre of AI in Medicine (C-AIM), Nanyang Technological University（南洋理工大学人工智能医学中心）

AI总结提出一种领域无关的自动化框架，通过语义熵和检索式多智能体评估，量化LLM在开放式任务中的发散与收敛创造力，并在问题解决、研究构思和创意写作三个领域验证其有效性。

Comments Accepted to ACL 2026 (Main Conference). 35 pages, 16 figures. Code: https://github.com/tanminsen/creativity-eval

详情

AI中文摘要

大型语言模型（LLMs）在语言理解、推理和生成方面取得了显著进展，激发了对其创造潜力的日益关注。实现这一潜力需要系统化和可扩展的方法来评估跨不同任务的创造力。然而，大多数现有的创造力指标与特定任务紧密耦合，将领域假设嵌入评估过程，限制了可扩展性和通用性。为解决这一差距，我们引入了一个自动化、领域无关的框架，用于量化LLM在开放式任务中的创造力。我们的方法将测量装置与创造性任务本身分离，实现了可扩展、任务无关的评估。发散创造力通过语义熵（一种无参考且稳健的新颖性和多样性指标）进行测量，并针对人类注释、基于LLM的新颖性判断和基线多样性度量进行了验证。收敛创造力通过一种新颖的基于检索的多智能体评判框架进行评估，该框架提供上下文敏感的任务完成评估，效率提升超过60%。我们在三个性质不同的领域验证了我们的框架：问题解决（MacGyver）、研究构思（HypoGen）和创意写作（BookMIA），使用了广泛的LLM套件。实证结果表明，我们的框架可靠地捕捉了创造力的关键方面，包括新颖性、多样性和任务完成，并揭示了模型属性（如大小、温度、时效性和推理）如何影响创造性表现。我们的工作为自动化的LLM创造力评估建立了可重复和可泛化的标准，为可扩展的基准测试铺平了道路，并加速了创造性AI的进展。

英文摘要

Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable methods for evaluating creativity across diverse tasks. However, most existing creativity metrics are tightly coupled to specific tasks, embedding domain assumptions into the evaluation process, and limiting scalability and generality. To address this gap, we introduce an automated, domain-agnostic framework for quantifying LLM creativity across open-ended tasks. Our approach separates the measurement apparatus from the creative task itself, enabling scalable, task-agnostic assessment. Divergent creativity is measured using semantic entropy, a reference-free and robust metric for novelty and diversity, validated against human annotations, LLM-based novelty judgments and baseline diversity measures. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework that delivers context-sensitive evaluation of task fulfilment with over 60% improved efficiency. We validate our framework in three qualitatively distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), using a broad suite of LLMs. Empirical results show that our framework reliably captures key facets of creativity, including novelty, diversity, and task fulfilment, and reveal how model properties, such as size, temperature, recency, and reasoning, impact creative performance. Our work establishes a reproducible and generalizable standard for automated LLM creativity evaluation, paving the way for scalable benchmarking and accelerating progress in creative AI.

URL PDF HTML ☆

赞 0 踩 0

2606.11816 2026-06-11 cs.CL cs.AI 交叉投稿

WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

WorldReasoner: 评估语言模型代理是否通过有效推理预测事件

Yizhou Chi, Eric Chamoun, Zifeng Ding, Andreas Vlachos

发表机构 * Department of Computer Science and Technology, University of Cambridge（剑桥大学计算机科学与技术系）

AI总结提出WorldReasoner框架，通过时间有效检索、证据质量和因果图推理三个维度评估语言模型代理的事件预测能力，发现时间有效检索是结果准确性的最强驱动因素。

详情

AI中文摘要

预测现实世界事件要求语言模型代理在不完整、时间有限的信息下进行不确定性推理。然而，评估代理是否真正进行预测需要的不仅仅是最终答案的准确性：模型可能通过回忆记忆中的训练事实、引用捏造的证据或产生无根据的因果故事而正确。我们提出WorldReasoner，一个用于时间有效事件预测的评估框架。每个任务向代理提供一个已解决的预测问题、一个模拟的预测日期，并且只能访问该日期之前可用的证据；在问题解决后，该框架对提交的概率、引用的证据和可选的因果事件图进行评分。WorldReasoner报告三个互补的轴：针对已解决答案的结果质量、针对引用来源的证据质量，以及针对解决后事后图的推理质量。该基准测试由一个代理构建管道构建，该管道生成预测问题、收集时间戳证据并大规模构建事后参考图，最终产生345个已解决的任务，这些任务源自14,141篇文章，其图覆盖8,087个提取的事件。在六种受控代理设置中，时间有效检索是结果准确性的最强驱动因素；因果图构建提高了关键事件的恢复；并且正确的图支持预测更牢固地基于关键事件和相关来源，但代理仍然难以将基于证据的推理转化为校准的概率。

英文摘要

Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet evaluating whether agents genuinely forecast requires more than final-answer accuracy: a model may be correct by recalling memorized training facts, citing fabricated evidence, or producing an unsupported causal story. We present WorldReasoner, an evaluation framework for temporally valid event forecasting. Each task gives an agent a resolved forecasting question, a simulated forecast date, and access only to evidence available before that date; after resolution, the framework scores the submitted probability, cited evidence, and optional causal event graph. WorldReasoner reports three complementary axes: outcome quality against resolved answers, evidence quality over cited sources, and reasoning quality against post-resolution hindsight graphs. The benchmark is built by an agentic construction pipeline that generates forecasting questions, collects time-stamped evidence, and builds hindsight reference graphs at scale, yielding 345 resolved tasks derived from 14,141 articles with graphs covering 8,087 extracted events. Across six controlled agent settings, temporally valid retrieval is the strongest driver of outcome accuracy; causal graph construction improves key-event recovery; and correct graph-enabled forecasts are more strongly grounded in key events and relevant sources, yet agents still struggle to convert grounded evidence into calibrated probabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.11889 2026-06-11 cs.CV cs.AI cs.RO 交叉投稿

Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection

面向自动驾驶危险检测的视觉-语言模型任务对齐稳定性分析

Everett Richards

发表机构 * Everett Richards（埃弗里特·里奇ards）

AI总结研究视觉-语言模型在自动驾驶危险检测中，嵌入漂移与任务对齐危险分数变化的关系，发现不同腐败类型导致不同的失效模式，建议基准测试包含任务对齐稳定性指标。

Comments 8 pages (5 main body + 3 references / appendices). ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)

详情

AI中文摘要

视觉-语言模型（VLM）越来越多地用于自动驾驶中的场景理解，但鲁棒性分析通常仅依赖于任务无关的嵌入稳定性。我们研究腐败引起的嵌入漂移是否能预测基于CLIP图像-文本相似性的任务对齐危险分数的变化。通过在BDD100K道路场景上使用受控腐败，我们将嵌入漂移与边际漂移（定义为扰动下危险分数的变化）进行比较。这种关系高度依赖于腐败类型：某些家族表现出表示漂移与决策漂移之间的强耦合，而其他家族则在嵌入变化相对较小的情况下引发危险的决策不稳定性。此外，腐败家族在失效方向上有所不同：大多数通过假阴性抑制危险检测，而遮挡则触发假警报，这表明基准设计应考虑不对称的失效模式，而不仅仅是整体不稳定性率。这些结果表明，鲁棒性基准应包含任务对齐的稳定性指标，而不仅仅是嵌入级别的扰动统计。

英文摘要

Vision-language models (VLMs) are increasingly used for scene understanding in autonomous driving, but robustness analysis often relies on task-agnostic embedding stability alone. We study whether corruption-induced embedding drift predicts changes in a task-aligned hazard score derived from CLIP image-text similarities. Using controlled corruptions on BDD100K road scenes, we compare embedding drift against margin drift, defined as the change in hazard score under perturbation. The relationship is highly corruption-dependent: some families exhibit strong coupling between representation drift and decision drift, while others induce hazardous decision instability despite relatively modest embedding change. Furthermore, corruption families differ in failure direction: most suppress hazard detections via false negatives, while occlusion instead triggers false alarms, suggesting that benchmark design should account for asymmetric failure modes, not just overall instability rates. These results suggest that robustness benchmarks should include task-aligned stability measures in addition to embedding-level perturbation statistics.

URL PDF HTML ☆

赞 0 踩 0

2606.11901 2026-06-11 cs.RO cs.AI 交叉投稿

DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World

DuoBench: 一个可复现的双手操作基准，涵盖仿真与现实世界

Tobias Jülg, Seongjin Bien, Simon Hilber, Yannik Blei, Pierre Krack, Maximilian Li, Sven Parusel, Rudolf Lioutikov, Florian Walter, Wolfram Burgard

发表机构 * University of Technology Nuremberg（纽伦堡工业大学）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； Franka Robotics ； Technical University of Munich（慕尼黑工业大学）

AI总结提出DuoBench，一个基于FR3 Duo平台的双手操作基准框架，包含11个任务和阶段式评估方案，用于诊断当前策略在双手协调、仿真到现实迁移等方面的失败模式。

详情

AI中文摘要

双手机器人系统极大地扩展了操作能力，但协调两只手臂引入了额外的控制复杂性和故障模式，现有基准未能很好地捕捉这些。我们介绍了DuoBench，一个针对FR3 Duo平台上的双手操作策略的可扩展基准框架。DuoBench包含跨越四个协调类别的十一个任务，在仿真中实现，并通过可复现的任务配方和3D打印资产部分地在现实世界中复现。此外，我们提出了一种基于阶段的评估方案，支持超出二元成功之外的细粒度语义故障分析，并为所有基准任务提供人类遥操作数据集。我们在仿真和真实硬件上对几种双臂模仿学习和视觉-语言-动作策略进行了基准测试。我们的结果表明，当前策略在双手操作中仍然面临挑战，特别是在早期交互阶段、并行手臂执行以及仿真与现实环境之间的迁移方面。DuoBench为诊断这些故障模式和研究未来的双臂策略学习方法提供了一个可复现的测试平台。代码、数据集和视频可在该https URL获取。

英文摘要

Bimanual robot systems substantially expand manipulation capabilities, but coordinating two arms introduces additional control complexity and failure modes that are not well captured by existing benchmarks. We introduce DuoBench, an extensible benchmarking framework for bimanual manipulation policies on the FR3 Duo platform. DuoBench comprises eleven tasks spanning four coordination categories, implemented in simulation and partially reproduced in the real world through reproducible task recipes with 3D-printable assets. In addition, we propose a stage-based evaluation scheme that supports fine-grained semantic failure analysis beyond binary success and provide human-teleoperated datasets for all benchmark tasks. We benchmark several dual-arm imitation-learning and vision-language-action policies in simulation and on real hardware. Our results show that current policies remain challenged by bimanual manipulation, particularly in early interaction stages, parallel arm execution, and transfer between simulation and real-world settings. DuoBench provides a reproducible testbed for diagnosing these failure modes and studying future methods for dual-arm policy learning. Code, datasets, and videos are available at https://duobench.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.12071 2026-06-11 cs.DL cs.AI 交叉投稿

具身基准构建的智能自动化：流程、具身、模拟器与趋势

Jinshan Lai, Jianwei Hu, Baoyang Jiang, Fengchun Zhang, Leyuan Wang, Haotian Li, Yida Wang, Tingxuan Huang, Xi Ren, Qiang Ma

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； Qiyuan Lab（启元实验室）； Beijing University of Posts and Telecommunications（北京邮电大学）； Tsinghua University（清华大学）； Beihang University（北京航空航天大学）

AI总结本文综述具身智能基准构建的五阶段流程，分析从人工到自动化再到智能体闭环的转变，指出自动化将成本转向验证与治理。

详情

AI中文摘要

具身智能现已涵盖导航、家务辅助、操作、自动驾驶、空中智能体及多模态大模型控制。这一扩展使得基准构建成为可靠评估的核心瓶颈。与静态数据集不同，具身基准将任务规范、环境、机器人数据、演示、标注、指标、评估脚本和发布策略整合为一个评估系统。本综述通过五阶段构建流程回顾文献：需求与任务构建、数据获取、数据清洗与标注、基准套件生成与指标定义、评估执行与诊断反馈。针对每个阶段，分析从人工管理到传统自动化、基础模型辅助以及智能体闭环工作流的转变。同时比较了人工、数据与资产获取、计算与仿真、验证与调试、治理与维护以及返工风险等定性构建成本。主要结论是：自动化并非简单降低基准成本，而是往往将成本转向验证、可审计性、版本控制和长期治理。因此，具身评估的进展不仅取决于更大的基准套件，还取决于可诊断、可审计且可负责任地更新的构建流程。

英文摘要

Embodied intelligence now spans navigation, household assistance, manipulation, autonomous driving, aerial agents, and multimodal large-model control. This expansion has made benchmark construction a central bottleneck for reliable evaluation. Unlike static datasets, embodied benchmarks combine task specifications, environments, robot data, demonstrations, annotations, metrics, evaluation scripts, and release policies into a single evaluation system. This survey reviews the literature through a five-stage construction pipeline: requirement and task construction, data acquisition, data cleaning and annotation, benchmark suite generation and metric definition, and evaluation execution with diagnostic feedback. For each stage, the survey analyzes the transition from manual curation to traditional automation, foundation-model assistance, and agentic closed-loop workflows. It also compares qualitative construction costs across human labor, data and asset acquisition, compute and simulation, validation and debugging, governance and maintenance, and rework risk. The main conclusion is that automation does not simply reduce benchmark cost. Instead, it often shifts cost toward validation, auditability, version control, and long-term governance. Progress in embodied evaluation will therefore depend not only on larger benchmark suites, but also on construction pipelines that are diagnosable, auditable, and responsibly refreshable.

URL PDF HTML ☆

赞 0 踩 0

2606.12300 2026-06-11 cs.CV cs.AI 交叉投稿

Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

自然语言在小时级视频中的时间定位是一个搜索问题：基准与经验分解

Sukmin Seo, Geewook Kim

发表机构 * NAVER Cloud AI ； KAIST AI（韩国科学技术院人工智能系）

AI总结针对小时级视频的自然语言时间定位，提出搜索是主要瓶颈而非识别，发布首个开放小时级定位基准ExtremeWhenBench，并通过检索-定位混合方法显著提升性能。

Comments 10 pages, 6 figures, Code and benchmark: https://github.com/naver-ai/ExtremeWhenBench

详情

AI中文摘要

时间定位——根据自然语言查询返回视频中的区间$[t_s, t_e]$——是长视频的语言接口，但此前仅在短视频上研究；小时级自然语言定位的动态仍未充分探索。我们认为，在小时级尺度上，限制因素是搜索而非识别：视频-LLM的瓶颈不在于定位附近的事件，而在于根据自然语言查询搜索长视频的相关区域。为验证这一点，我们发布了ExtremeWhenBench，首个开放的小时级定位基准（194个视频上的2273个查询，平均时长75.7分钟，最长9小时），具有开放式查询分布。所有开放视频-LLM均表现不佳，而帧级检索基线优于它们；失败分类将85%的失败归因于搜索；检索-定位混合方法比单一视频-LLM提升了6.7倍——类似于开放域QA中的检索-读取模式。

英文摘要

Temporal grounding--returning the interval $[t_s, t_e]$ for a natural-language query over a video--is the language interface to long-form video, yet has been studied on short videos; the dynamics of hour-scale natural-language grounding remain underexplored. We take the position that at hour-scale, the binding constraint is search, not recognition: Video-LLMs are bottlenecked not by localizing a nearby event, but--given a natural-language query--by searching for the relevant region of a long video. To test this, we release ExtremeWhenBench, the first open hour-scale grounding benchmark (2,273 queries over 194 videos, mean 75.7 min, max 9 hr) with an open-form query distribution. Every open Video-LLM collapses while a frame-level retrieval baseline outperforms them; a failure taxonomy attributes 85% of failures to search; and a retrieve-then-ground hybrid recovers 6.7x over the monolithic Video-LLM--mirroring retrieve-then-read in open-domain QA.

URL PDF HTML ☆

赞 0 踩 0

2606.12392 2026-06-11 cs.CL cs.AI 交叉投稿

System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5

CCL25-Eval 任务5系统报告：新数据集与LoRA微调Qwen2.5

Haotao Xie

发表机构 * The Hangzhou International Innovation Institute Beihang University（北京航空航天大学杭州国际创新研究院）

AI总结针对古典诗歌翻译与情感理解任务，构建高质量指令数据集CCPoetry-49K，并采用LoRA微调Qwen2.5-14B模型得到PoetryQwen，在CCL25-Eval任务5上取得0.757分，较基线提升9.7%。

详情

AI中文摘要

近年来，大语言模型（LLMs）在古典汉语翻译和古典诗歌生成领域取得了令人瞩目的进展。然而，针对古典诗歌精确翻译和情感语义理解的领域特定研究仍然有限。主要挑战在于大多数研究将诗歌鉴赏任务视为通用领域问题，忽略了诗歌鉴赏的独特特征，同时高质量且领域特定的数据集极为稀缺。为解决这一局限，我们将任务分解为三个子任务：术语解释、语义解释和情感推理。基于多个开源数据集，我们进行数据清洗和对齐，构建了古典诗歌指令对数据集（CCPoetry-49K），包含49,404个高质量指令-响应对，专门针对该领域进行了优化。随后，我们提出领域专用LLM，称为PoetryQwen，通过应用低秩适配（LoRA）微调Qwen2.5-14B模型。在CCL25-Eval任务5基准上的实验结果表明，PoetryQwen得分为0.757，较Qwen2.5-14B-Instruct基线（0.690）提升9.7%。这些发现明确表明，PoetryQwen在古典诗歌的精确翻译和情感理解方面显著提升了性能。我们提供了新数据集和方法论考虑，旨在支持LLMs的领域特定优化。

英文摘要

Recently, large language models (LLMs) have achieved promising progress in the fields of classical Chinese translation and the generation of classical poetry. However, domain-specific research on precise translation and affective-semantic understanding of classical poetry remains limited. The main challenge is that most studies treat the poetic appreciation task as a general-domain problem, neglecting the distinctive features of poetic appreciation, while high-quality and domain-specific datasets are extremely limited. To address this limitation, we decompose the task into three subtasks: term interpretation, semantic interpretation, and emotional inference. Based on multiple open-source datasets, we perform data cleansing and alignment to construct the Classical Chinese Poetry Instruction Pair Dataset (CCPoetry-49K), which comprises 49,404 high-quality instruction-response pairs explicitly optimized for this domain. We then propose a domain-specialized LLM, called PoetryQwen, by applying Low-Rank Adaptation (LoRA) to fine-tune the Qwen2.5-14B model. Experimental results on the CCL25-Eval Task 5 benchmark demonstrate that PoetryQwen achieves a score of 0.757, representing a 9.7% improvement over the Qwen2.5-14B-Instruct baseline (0.690). These findings clearly indicate that PoetryQwen significantly enhances performance in precise translation and emotional understanding of classical poetry. We present new dataset and methodological considerations intended to support the domain-specific optimization of LLMs.

URL PDF HTML ☆

赞 0 踩 2

2511.02414 2026-06-11 cs.AI 版本更新

A New Perspective on Precision and Recall for Generative Models

生成模型精度与召回的全新视角

Benjamin Sykes, Loïc Simon, Julien Rabin, Jalal Fadili

发表机构 * NORMANDIE UNIV, UNICAEN, ENSICAEN, CNRS, GREYC（诺曼底大学、UNICAEN、ENSICAEN、CNRS、GREYC）

AI总结本文提出了一种基于二分类视角的新框架，用于估计生成模型的完整精度-召回曲线，并通过统计分析得出最小最大上界，同时展示了该框架可扩展至文献中的多个经典PR指标。

2601.17717 2026-06-11 cs.AI cs.LG 版本更新

A Survey on Evaluating Quality and Trustworthiness in LLM-Generated Data

评估LLM生成数据的质量与可信度综述

Kaituo Zhang, Mingzhi Hu, Hoang Anh Duy Le, Fariha Kabir Torsha, Zhimeng Jiang, Minh Khai Bui, Chia-Yuan Chang, Yu-Neng Chuang, Zhen Xiong, Ying Lin, Guanchu Wang, Na Zou

发表机构 * University of Houston（德克萨斯大学休斯敦分校）； Worcester Polytechnic Institute（沃思利理工学院）； Rice University（里德大学）； Texas A&M University（德克萨斯农工大学）； University of Wisconsin - Madison（威斯康星大学麦迪逊分校）； University of Southern California（南加州大学）； University of North Carolina at Charlotte（北卡罗来纳州立大学夏洛特分校）

AI总结提出LLM数据审计框架，从质量和可信度两个维度系统分类评估指标，分析六种模态数据生成方法的评估缺陷并给出改进建议。

Comments Published at TMLR. Title changed in the final version

详情

Journal ref: Transactions on Machine Learning Research, 2026

AI中文摘要

大型语言模型（LLM）已成为跨多种模态生成数据的强大工具。通过将数据从稀缺资源转变为可控资产，LLM缓解了真实世界数据获取成本对模型训练、评估和系统迭代造成的瓶颈。然而，确保LLM生成的合成数据的高质量仍然是一个关键挑战。现有研究主要关注生成方法，对生成数据质量的直接关注有限。此外，大多数研究局限于单一模态，缺乏跨不同数据类型的统一视角。为填补这一空白，我们提出了\textbf{LLM数据审计框架}。在该框架中，我们首先描述了如何利用LLM生成六种不同模态的数据。更重要的是，我们从质量和可信度两个维度系统分类了评估合成数据的内在指标。这种方法将评估重点从依赖下游任务性能的外在评估转向数据本身的固有属性。利用这一评估体系，我们分析了每种模态代表性生成方法的实验评估，并指出了当前评估实践中的重大缺陷。基于这些发现，我们为社区改进数据生成评估提供了具体建议。最后，该框架概述了合成数据在不同模态下的实际应用方法。

英文摘要

Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities. By transforming data from a scarce resource into a controllable asset, LLMs mitigate the bottlenecks imposed by the acquisition costs of real-world data for model training, evaluation, and system iteration. However, ensuring the high quality of LLM-generated synthetic data remains a critical challenge. Existing research primarily focuses on generation methodologies, with limited direct attention to the quality of the resulting data. Furthermore, most studies are restricted to single modalities, lacking a unified perspective across different data types. To bridge this gap, we propose the \textbf{LLM Data Auditor framework}. In this framework, we first describe how LLMs are utilized to generate data across six distinct modalities. More importantly, we systematically categorize intrinsic metrics for evaluating synthetic data from two dimensions: quality and trustworthiness. This approach shifts the focus from extrinsic evaluation, which relies on downstream task performance, to the inherent properties of the data itself. Using this evaluation system, we analyze the experimental evaluations of representative generation methods for each modality and identify substantial deficiencies in current evaluation practices. Based on these findings, we offer concrete recommendations for the community to improve the evaluation of data generation. Finally, the framework outlines methodologies for the practical application of synthetic data across different modalities.

URL PDF HTML ☆

赞 0 踩 0

2602.02465 2026-06-11 cs.AI cs.CV cs.LG 版本更新

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

MentisOculi: 揭示心智图像推理的局限性

Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel

发表机构 * Max Planck Institute for Informatics（马克斯·普朗克信息研究所）

AI总结提出MentisOculi基准，通过多步推理问题测试前沿模型利用视觉表示辅助推理的能力，发现视觉策略普遍无法提升性能，且统一多模态模型存在生成错误累积和无法利用真实可视化的问题。

Comments 9 pages, 8 figures, Accepted at ICML 2026

详情

AI中文摘要

前沿模型正从仅摄入视觉信息的多模态大语言模型（MLLMs）过渡到能够原生交错生成的统一多模态模型（UMMs）。这一转变激发了将中间可视化作为推理辅助的兴趣，类似于人类的心智图像。这一想法的核心是能够以目标导向的方式形成、维护和操作视觉表示。为了评估和探究这一能力，我们开发了MentisOculi，这是一个程序化的、分层的多步推理问题套件，适用于视觉解决方案，旨在挑战前沿模型。评估从潜在令牌到显式生成图像的视觉策略，我们发现它们通常无法提升性能。对UMMs的分析特别揭示了一个关键限制：虽然它们拥有解决任务的文本推理能力，并且有时能生成正确的视觉内容，但它们遭受复合生成错误，并且无法利用甚至真实的可视化。我们的发现表明，尽管视觉思维具有内在吸引力，但尚未有益于模型推理。MentisOculi为分析和弥合不同模型家族之间的这一差距建立了必要的基础。

英文摘要

Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

URL PDF HTML ☆

赞 0 踩 0

2602.22638 2026-06-11 cs.AI 版本更新

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

MobilityBench：用于评估真实世界移动场景中路径规划智能体的基准

Zhiheng Song, Jingshuai Zhang, Chuan Qin, Chao Wang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu, Hengshu Zhu

发表机构 * Computer Network Information Center, Chinese Academy of Sciences（中国科学院计算机网络信息中心）； AMAP, Alibaba Group（阿里集团AMAP）； Alibaba Group（阿里集团）

AI总结提出MobilityBench基准，通过确定性API重放沙箱和多维评估协议，系统评估基于LLM的路径规划智能体，发现现有模型在偏好约束路径规划上表现不佳。

详情

AI中文摘要

由大型语言模型（LLM）驱动的路径规划智能体已成为一种有前景的范式，通过自然语言交互和工具介导的决策支持日常人类移动。然而，在真实世界移动场景中的系统评估受到多样化路由需求、非确定性地图服务和有限可重复性的阻碍。在本研究中，我们引入了MobilityBench，一个用于评估基于LLM的路径规划智能体在真实世界移动场景中的可扩展基准。MobilityBench基于从高德地图收集的大规模匿名真实用户查询构建，覆盖全球多个城市的广泛路径规划意图。为了实现可重复的端到端评估，我们设计了一个确定性API重放沙箱，消除了实时服务带来的环境变化。我们进一步提出了一个以结果有效性为中心的多维评估协议，辅以对指令理解、规划、工具使用和效率的评估。使用MobilityBench，我们在多种真实世界移动场景中评估了多个基于LLM的路径规划智能体，并对其行为和性能进行了深入分析。我们的发现表明，当前模型在基本信息检索和路径规划任务上表现良好，但在偏好约束路径规划上困难重重，突显了在个性化移动应用中仍有显著改进空间。我们在此https URL公开发布基准数据、评估工具包和文档。

英文摘要

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench.

URL PDF HTML ☆

赞 0 踩 0

2603.09715 2026-06-11 cs.AI 版本更新

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

问题真的重要吗？视觉-语言SFT的无训练数据选择

Peng Sun, Yi Yang, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li

发表机构 * Nanjing University（南京大学）； Institute of Information Engineering（信息工程研究所）； North University of China（中国北方大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出CVS方法，利用冻结的视觉-语言大模型评估问题对答案有效性的影响，无需训练即可筛选出需要跨模态推理的高质量样本，在多个数据集上以少量数据超越全量训练。

详情

AI中文摘要

视觉指令微调对于提升视觉-语言大模型（VLLMs）至关重要。然而，许多样本可以通过语言模式或常识捷径解决，无需真正的跨模态推理，限制了多模态学习的有效性。先前的数据选择方法通常依赖于代价高昂的代理模型训练，并侧重于难度或多样性，未能捕捉样本对视觉-语言联合推理的真实贡献。在本文中，我们提出CVS，一种基于以下洞见的无训练数据选择方法：对于高质量的多模态样本，引入问题应显著改变模型在给定图像下对答案有效性的评估。CVS利用冻结的VLLM作为评估器，测量在有/无问题条件下答案有效性的差异，从而识别需要视觉-语言联合推理的样本，同时过滤语义冲突噪声。在Vision-Flan和The Cauldron上的实验表明，CVS在数据集上取得了稳定的性能。在Vision-Flan上，CVS仅使用10%和15%的数据就分别比全量训练高出3.5%和4.8%，并且在高度异构的Cauldron数据集上保持鲁棒。此外，与COINCIDE和XMAS相比，CVS分别降低了17.3%和44.4%的计算成本。

英文摘要

Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.

URL PDF HTML ☆

赞 0 踩 0

2604.18543 2026-06-11 cs.AI cs.CL 版本更新

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

ClawEnvKit：爪型智能体的自动环境生成

Xirui Li, Ming Li, Ion Stoica, Cho-Jui Hsieh, Tianyi Zhou

发表机构 * University of Maryland（马里兰大学）； Arena ； University of California, Berkley（伯克利大学）； University of California, Los Angeles（洛杉矶大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结提出ClawEnvKit自动生成多样、可验证的爪型智能体训练与评估环境，构建含1040个环境的Auto-ClawEval基准，成本降低13800倍，性能提升达15.7个百分点。

详情

AI中文摘要

构建用于训练和评估爪型智能体的环境仍然是一个手动、人力密集且无法扩展的过程。我们认为，需要的不仅仅是一个数据集，而是一个能够按需生成多样化、可验证环境的自动化流水线。为此，我们引入了ClawEnvKit，一个自主生成流水线，它从自然语言描述中实例化这一形式化体系。该流水线包含三个模块：（1）解析器，从自然语言输入中提取结构化生成参数；（2）生成器，生成任务规范、工具接口和评分配置；（3）验证器，确保生成环境的可行性、多样性、结构有效性和内部一致性。使用ClawEnvKit，我们构建了Auto-ClawEval，这是首个用于爪型智能体的大规模基准，包含24个类别的1040个环境。实验表明，Auto-ClawEval在连贯性和清晰度上匹配或超过人工策划的环境，成本降低13800倍。在4个模型家族和8个智能体框架上评估，我们发现框架工程比裸ReAct基线性能提升高达15.7个百分点，完成度仍是主要变化轴，且没有模型饱和该基准，自动化生成使得评估规模达到前所未有的水平。除了静态基准测试，ClawEnvKit还支持实时评估：用户用自然语言描述所需能力，即可按需获得验证过的环境，将评估转变为持续的、用户驱动的过程。同样的机制也可作为按需训练环境生成器，产生适应智能体当前弱点的任务分布，而非受限于现有用户日志。

英文摘要

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.

URL PDF HTML ☆

赞 0 踩 0

2606.09426 2026-06-11 cs.AI 版本更新

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

WeaveBench: 面向混合接口的长期、真实世界计算机使用代理基准

Wanli Li, Bowen Zhou, Yunyao Yu, Zhou Xu, Yifan Yang, Dongsheng Li, Caihua Shan

发表机构 * Zhejiang University（浙江大学）； Microsoft Research Asia（微软亚洲研究院）； Tsinghua University（清华大学）

AI总结提出WeaveBench基准，包含114个跨8个真实工作领域的长期混合接口任务，要求代理结合GUI和CLI/代码操作，最佳PassRate仅41.2%，揭示现有评估的不足。

详情

AI中文摘要

计算机使用代理（CUA）越来越多地在结合视觉桌面控制、命令行执行、代码编辑、浏览器和外部工具的运行时中运行。然而，现有基准通常将这些接口作为可分离的能力进行评估，导致长期跨接口编排测试不足。因此，我们引入了WeaveBench，一个长期混合接口基准，包含114个跨8个真实工作领域的任务，基于真实用户请求和公开可验证的工件。每个任务要求代理在单个轨迹中结合GUI观察/操作与CLI/代码操作。我们在部署的CLI代理运行时内的真实Ubuntu桌面上评估这些任务，并增加了最小的桌面控制插件。我们还提出了一个配套的轨迹感知评判器，检查交付物、文件、截图、日志和操作痕迹，同时检测快捷行为，如伪造的视觉证据或硬编码指标。在前沿模型-运行时配对中，最佳PassRate仅达到41.2%，表明该基准远未饱和。轨迹感知评判器进一步揭示，仅基于结果的评分显著高估了代理性能。总体而言，WeaveBench暴露了CUA评估中的关键差距，并提供了一个有效的测试平台，以衡量代理是否能在长期真实世界任务中编排GUI、CLI和代码操作。

英文摘要

Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.

URL PDF HTML ☆

赞 0 踩 0

2508.18636 2026-06-11 cs.SE cs.AI 版本更新

LaQual: An Automated Framework for LLM App Quality Evaluation

LaQual: 一种用于LLM应用质量评估的自动化框架

Yan Wang, Xinyi Hou, Junjun Si, Yanjie Zhao, Weiguo Lin, Haoyu Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LaQual自动化框架，通过静态指标筛选和动态场景评估，实现LLM应用质量评估，与人类判断高度一致，可减少66.7%-81.3%候选应用。

详情

AI中文摘要

代表软件分发的新范式，LLM应用商店正在迅速兴起，为用户提供内容生成、编程辅助、教育等多样化选择。然而，当前LLM应用商店中的排名和推荐机制主要依赖静态指标（如用户交互和收藏），使用户难以高效识别高质量应用。同时，当前学术研究专注于特定垂直领域，缺乏适用于多样化LLM应用生态的通用自动化评估框架。为应对上述挑战，我们提出LaQual，一种用于LLM应用质量评估的自动化框架。LaQual整合三个关键阶段：(1) LLM应用标注与层次分类，实现精确场景映射；(2) 静态指标评估，使用时间加权用户参与度和功能能力指标过滤低质量应用；(3) 动态场景自适应评估，由LLM生成场景特定评估指标、评分标准和任务，进行全面质量评估。在主流LLM应用商店上的实验证明了LaQual的有效性。其自动化评分与人类判断高度一致。通过有效筛选，LaQual可将候选LLM应用池减少66.7%至81.3%。用户研究进一步验证了其相对于基线系统的显著优势，特别是在比较效率（均值5.45 vs. 3.30）和解释信息价值（4.75 vs. 2.25）方面。这些结果表明，LaQual为现实场景中LLM应用的高质量发现与推荐提供了可扩展、客观且以用户为中心的解决方案。

英文摘要

Representing a new paradigm in software distribution, LLM app stores are rapidly emerging, offering users diverse choices for content generation, coding assistance, education, and more. However, current ranking and recommendation mechanisms in LLM app stores predominantly rely on static metrics, such as user interactions and favorites, making it challenging for users to efficiently identify high-quality apps. At the same time, current academic research focuses on specific vertical fields and lacks a general, automated evaluation framework applicable to the diverse LLM app ecosystem. To address the above challenges, we present LaQual, an automated framework for LLM app quality evaluation. LaQual integrates three key stages: (1) LLM app labeling and hierarchical classification for precise scenario mapping; (2) static indicator evaluation using time-weighted user engagement and functional capability indicators to filter low-quality apps; and (3) dynamic scenario-adapted evaluation, where an LLM generates scenario-specific evaluation metrics, scoring criteria, and tasks for comprehensive quality evaluation. Experiments on a mainstream LLM app store demonstrate the effectiveness of LaQual. Its automated scores show high consistency with human judgments. Through effective screening, LaQual can reduce the candidate LLM app pool by 66.7% to 81.3%. User studies further validate its significant outperformance over baseline systems, particularly in comparison efficiency (mean 5.45 vs. 3.30) and value of explanatory information (4.75 vs. 2.25). These results demonstrate that LaQual provides a scalable, objective, and user-centric solution for high-quality discovery and recommendation of LLM apps in real-world scenarios.

URL PDF HTML ☆

赞 0 踩 0

2509.25359 2026-06-11 cs.CL cs.AI 版本更新

Geometric Metrics and LLMs: What They Measure and When They Work

几何度量与大语言模型：它们测量什么以及何时有效

Viacheslav Yusupov, Anna Antipina, Ameliia Alaeva, Danil Maksimov, Anna Vasileva, Tatyana Zaitseva, Alina Ermilova, Evgeny Burnaev, Egor Shvetsov

发表机构 * Moscow Institute of Physics and Technology（莫斯科物理技术学院）； Russian Academy of Sciences（俄罗斯科学院）

AI总结本文系统测试了用于大语言模型评估的几何度量，发现部分度量主要反映输出长度，而几何度量在文本统计基础上提供有限但真实的信息，并指出故障检测是最有前景的应用。

详情

AI中文摘要

我们提出了对大语言模型评估中几何度量的系统性压力测试。基于排名的内部表示几何特性作为无参考质量信号显示出前景，但其可靠的条件仍不清楚。我们评估了八种常用度量：内在维度估计器、谱范数及相关量，在六个测试模型（0.5-8B）和八个生成器上对比任务，将真实的几何信号与文本长度效应以及标准文本统计已捕获的信息区分开。三个发现出现。首先，一些度量（特别是Schatten范数和MOM）主要反映输出长度，一旦控制长度，其明显的区分能力就崩溃。其次，几何度量在文本统计之外增加了适度但真实的信息：结合它们，分类器在6路生成器识别上达到78%的准确率，而仅用文本统计为69%。第三，度量并不追踪文本质量的通用概念，而是显示内在维度与词汇多样性（RTTR）之间仅存在中等关联。我们给出了特定用例的建议，并指出故障检测是最有前景的近期应用。

英文摘要

We present a systematic stress-test of geometric metrics for LLM evaluation. Rank-based geometric properties of internal representations have shown promise as reference-free quality signals, but the conditions under which they are reliable remain unclear. We evaluate eight commonly-used metrics: intrinsic-dimensionality estimators, spectral norms, and related quantities across six tester models (0.5-8B) and eight generators on contrasting tasks, separating genuine geometric signal from text-length effects and from what standard text statistics already capture. Three findings emerge. First, some metrics (notably Schatten Norm and MOM) mainly reflect output length, and their apparent discriminative power collapses once length is controlled. Second, geometric metrics add modest but real information beyond text statistics: combined with them, a classifier reaches 78% accuracy on 6-way generator identification versus 69% for text statistics alone. Third, rather than tracking a general notion of text quality, the metrics demonstrate only moderate association between the intrinsic-dimensionality and lexical diversity (RTTR). We give use-case-specific recommendations and identify failure detection as the most promising near-term application.

URL PDF HTML ☆

赞 0 踩 0

2510.06596 2026-06-11 cs.CV cs.AI cs.IT cs.LG math.IT 版本更新

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

SDQM：用于目标检测数据集评估的合成数据质量指标

Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin

发表机构 * Northeastern University, Khoury College of Computer Sciences（东北大学，Khoury 计算科学学院）； Binghamton University, School of Computing（布ingham顿大学，计算科学学院）； Air Force Research Laboratory, Mission Applications and Infrastructure Section（空军研究实验室，任务应用与基础设施部门）

AI总结提出SDQM指标，无需模型训练收敛即可评估合成数据质量，与YOLO11的mAP强相关，优于现有指标。

Comments Accepted and Published at SPIE: Journal of Electronic Imaging, Vol. 35, Issue 3

详情

DOI: 10.1117/1.JEI.35.3.033014
Journal ref: Journal of Electronic Imaging 35(3), 033014 (2026)

AI中文摘要

机器学习模型的性能在很大程度上依赖于训练数据。大规模、良好标注数据集的稀缺给构建鲁棒模型带来了重大挑战。为了解决这一问题，通过模拟和生成模型产生的合成数据已成为一种有前景的解决方案，它增强了数据集的多样性，并提高了模型的性能、可靠性和韧性。然而，评估这些生成数据的质量需要一个有效的指标。我们引入了合成数据集质量指标（SDQM），用于评估目标检测任务的数据质量，而无需模型训练收敛。该指标能够更高效地生成和选择合成数据集，解决了资源受限的目标检测任务中的一个关键挑战。在我们的实验中，SDQM与领先的目标检测模型YOLO11的平均精度均值（mAP）得分表现出强相关性，而先前的指标仅表现出中等或弱相关性。此外，它提供了改进数据集质量的可操作见解，最大限度地减少了昂贵的迭代训练需求。这一可扩展且高效的指标为评估合成数据设立了新标准。SDQM的代码可从此https URL获取。

英文摘要

The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. We introduce the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean average precision (mAP) scores of YOLO11, a leading object detection model, whereas previous metrics only exhibited moderate or weak correlations. In addition, it provides actionable insights into improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM

URL PDF HTML ☆

赞 0 踩 0

2511.07332 2026-06-11 cs.LG cs.AI 版本更新

Grounding Computer Use Agents on Human Demonstrations

基于人类演示的计算机使用智能体基础构建

Aarash Feizi, Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Kaixin Li, Rabiul Awal, Xing Han Lù, Johan Obando-Ceron, Juan A. Rodriguez, Nicolas Chapados, David Vazquez, Adriana Romero-Soriano, Reihaneh Rabbany, Perouz Taslakian, Christopher Pal, Spandana Gella, Sai Rajeswar

发表机构 * Mila - Quebec AI Institute（魁北克AI研究所）； McGill University（麦吉尔大学）； Université de Montréal（蒙特利尔大学）； ServiceNow Research（ServiceNow研究）； University of Waterloo（滑铁卢大学）； University of Oxford（牛津大学）； National University of Singapore（新加坡国立大学）； Polytechnique Montréal（蒙特利尔理工学院）； École de Technologie Supérieure（高级技术学院）； CIFAR AI Chair（CIFAR人工智能主席）

AI总结为解决桌面环境高质量基础数据稀缺问题，构建了包含87个应用、56K截图和3.56M人工标注的GroundCUA数据集，并基于此训练GroundNext模型，在5个基准上以少于先前十分之一的数据取得最优结果。

Comments Accepted at ICLR 2026

详情

AI中文摘要

构建可靠的计算机使用智能体需要基础构建：将自然语言指令准确连接到正确的屏幕元素。尽管存在大量用于网络和移动交互的数据集，但桌面环境的高质量资源有限。为填补这一空白，我们引入了GroundCUA，一个基于专家人类演示构建的大规模桌面基础数据集。它涵盖12个类别的87个应用，包含56K张截图，每个屏幕元素都经过仔细标注，总计超过3.56M个人工验证标注。从这些演示中，我们生成了多样的指令，覆盖广泛的实际任务，为模型训练提供高质量数据。利用GroundCUA，我们开发了GroundNext系列模型，将指令映射到目标UI元素。在3B和7B规模上，GroundNext通过监督微调在五个基准上取得了最先进的结果，同时所需训练数据不到先前工作的十分之一。强化学习后训练进一步提升了性能，在OSWorld基准上使用o3作为规划器的智能体评估中，GroundNext取得了与使用更多数据训练的模型相当或更优的结果。这些结果证明了高质量、专家驱动数据集在推进通用计算机使用智能体中的关键作用。

英文摘要

Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.

URL PDF HTML ☆

赞 0 踩 0

2601.22025 2026-06-11 cs.CL cs.AI cs.IR cs.SE 版本更新

When Generic Prompt Improvements Hurt: Evaluation-Driven Iteration for LLM Applications

当通用提示改进有害：LLM应用的评估驱动迭代

Daniel Commey

发表机构 * Daniel Commey

AI总结提出最小可行评估套件（MVES），通过结构化评估框架和本地复现实验，发现通用提示添加并非单调改进，强调评估驱动的提示迭代。

Comments Technical report. 42 pages, 3 figures. Code, test suites, and result logs: https://github.com/dcommey/llm-eval-benchmarking

详情

AI中文摘要

评估大型语言模型（LLM）应用与传统软件测试不同，因为输出是概率性的、语义可变的，并且对提示和模型变化敏感。本技术报告提出了最小可行评估套件（MVES），一种面向审计的应用级LLM评估结构。MVES将应用类别与失败模式、指标、所需工件和验证证据联系起来，涵盖通用LLM应用、检索增强系统和智能体工作流。我们将该框架与可复现的本地评估工具配对，包括结构化提取、RAG引用/内容合规性和指令遵循检查。使用Ollama与Llama 3 8B Instruct和Qwen 2.5 7B Instruct，我们在扩展的每套30例消融实验中评估了五种提示条件。结果表明，在测试的本地条件下，通用提示添加不会产生单调改进：更强的输出合同提示提高了两种模型的严格提取，而RAG引用/内容合规性在某些通用规则条件下下降。观察到的最显著下降发生在Qwen 2.5上，当通用规则附加到用户提示时，RAG从26/30下降到9/30。这些发现支持评估驱动的提示迭代：提示更改应被视为潜在的回归风险，并在部署前针对特定任务套件进行测试。随附的存储库包含测试套件、提示变体、评估工具、原始结果日志和复现所报告本地消融所需的脚本。

英文摘要

Evaluating Large Language Model (LLM) applications differs from conventional software testing because outputs are probabilistic, semantically variable, and sensitive to prompt and model changes. This technical report proposes the Minimum Viable Evaluation Suite (MVES), an audit-oriented structure for application-level LLM evaluation. MVES links application categories to failure modes, metrics, required artifacts, and validation evidence across general LLM applications, retrieval-augmented systems, and agentic workflows. We pair the framework with a reproducible local evaluation harness covering structured extraction, RAG citation/content-compliance, and instruction-following checks. Using Ollama with Llama 3 8B Instruct and Qwen 2.5 7B Instruct, we evaluate five prompt conditions over expanded 30-case-per-suite ablations. The results show that, in the tested local conditions, generic prompt additions do not produce monotonic improvements: stronger output-contract prompts improve strict extraction for both models, while RAG citation/content-compliance declines under some generic-rule conditions. The largest observed decline occurs for Qwen 2.5 on RAG when generic rules are appended to the user prompt, from 26/30 to 9/30. These findings support evaluation-driven prompt iteration: prompt changes should be treated as potential regression risks and tested against task-specific suites before deployment. The accompanying repository contains the test suites, prompt variants, evaluation harness, raw result logs, and scripts needed to reproduce the reported local ablations.

URL PDF HTML ☆

赞 0 踩 0

2601.22725 2026-06-11 cs.CV cs.AI 版本更新

OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation

OpenVTON-Bench：用于可控虚拟试穿评估的大规模高分辨率基准

Jin Li, Tao Chen, Kai Wen, Siqi Yin, Shuai Jiang, Weijie Wang, Jingwen Luo, Chenhui Wu

发表机构 * Renxing Intelligence, Hangzhou, China ； Hangzhou Dianzi University, Hangzhou, China（杭州电子科技大学）

AI总结提出OpenVTON-Bench，包含约10万对高分辨率图像，通过DINOv3聚类和Gemini描述构建，并设计多模态评估协议，沿五个维度衡量试穿质量，与人类判断高度一致。

Comments Under review for the NeurIPS 2026 Datasets and Benchmarks Track

详情

AI中文摘要

近期扩散模型的进展显著提升了虚拟试穿（VTON）系统的视觉保真度，但可靠的评估仍是一个持续的瓶颈。传统指标难以量化细粒度的纹理细节和语义一致性，而现有数据集在规模和多样性上无法满足商业标准。我们提出了OpenVTON-Bench，一个大规模基准，包含约10万对高分辨率图像（最高$1536 \ imes 1536$）。该数据集使用基于DINOv3的层次聚类进行语义平衡采样，并借助Gemini驱动的密集描述，确保在20个细粒度服装类别上均匀分布。为支持可靠评估，我们提出了一种多模态协议，沿五个可解释维度衡量VTON质量：背景一致性、身份保真度、纹理保真度、形状合理性和整体真实感。该协议将基于VLM的语义推理与基于SAM3分割和形态学腐蚀的新型多尺度表示度量相结合，能够分离边界对齐误差与内部纹理伪影。实验结果表明，该协议与人类判断高度一致（Kendall's $\ au$为0.833，而SSIM为0.611），为VTON评估建立了稳健的基准。

英文摘要

Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $τ$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.

URL PDF HTML ☆

赞 0 踩 0

2602.07840 2026-06-11 cs.IR cs.AI 版本更新

SAGE: Scalable AI Governance & Evaluation

SAGE: 可扩展的人工智能治理与评估

Benjamin Le, Xueying Lu, Nick Stern, Wenqiong Liu, Igor Lapchuk, Xiang Li, Baofen Zheng, Kevin Rosenberg, Jiewen Huang, Zhe Zhang, Abraham Cabangbang, Satej Milind Wagle, Jianqiang Shen, Raghavan Muthuregunathan, Abhinav Gupta, Mathew Teoh, Andrew Kirk, Thomas Kwan, Jingwei Wu, Wenjing Zhang

发表机构 * LinkedIn Corporation（LinkedIn公司）

AI总结本文提出SAGE框架，通过双向校准循环将高质量的人类产品判断转化为可扩展的评估信号，解决了大规模搜索系统中相关性评估的治理差距问题，并实现了92倍成本降低的模型迭代和政策监督。

详情

AI中文摘要

在大规模搜索系统中评估相关性本质上受到人类监督与生产系统高吞吐要求之间的治理差距的限制。传统方法依赖于参与代理或稀疏手动审查，但这些方法往往无法捕捉高影响的相关性失败的全部范围。我们提出了SAGE（可扩展的人工智能治理与评估）框架，该框架将高质量的人类产品判断作为可扩展的评估信号。SAGE的核心是一个双向校准循环，其中自然语言政策、精心编写的先例和一个LLM替代法官共同进化。SAGE系统性地解决语义模糊和不一致，将主观的相关性判断转化为可执行的多维标准，具有接近人类水平的一致性。为了弥合前沿模型推理与工业级推理之间的差距，我们应用教师-学生蒸馏技术，将高保真判断转移到紧凑的学生替代体，成本降低92倍。SAGE部署在LinkedIn搜索生态系统中，通过模拟驱动开发指导模型迭代，蒸馏出符合政策的模型用于在线服务，并实现快速的离线评估。在生产环境中，它推动了政策监督，测量了升级的模型变体并检测到无法被参与指标检测到的回归。集体上，这些措施推动了LinkedIn每日活跃用户的0.25%提升。

英文摘要

Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.

URL PDF HTML ☆

赞 0 踩 0

2603.19225 2026-06-11 cs.CE cs.AI cs.CL cs.IR q-fin.CP 版本更新

FinTradeBench: A Financial Reasoning Benchmark for LLMs

FinTradeBench: 面向LLM的金融推理基准

Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, Aritra Dutta

发表机构 * University of Central Florida（佛罗里达中央大学）

AI总结提出FinTradeBench基准，通过结合公司基本面与交易信号，评估大语言模型在金融推理中的表现，发现检索增强对数值和时间序列推理帮助有限。

Comments 9 pages main text, 31 pages total (including references and appendix). 5 figures, 16 tables. Preprint under review. Code and data will be made available upon publication

详情

AI中文摘要

现实世界的金融决策是一个具有挑战性的问题，需要对异构信号进行推理，包括从监管文件中提取的公司基本面和从价格动态计算出的交易信号。最近，随着大语言模型（LLM）的进步，金融分析师开始将它们用于金融决策任务。然而，现有的用于测试这些模型的金融问答基准主要关注公司资产负债表数据，很少评估关于公司股票如何在市场中交易或它们与基本面相互作用的推理。为了利用这两种方法的优势，我们引入了FinTradeBench，这是一个评估金融推理的基准，它整合了公司基本面和交易信号。FinTradeBench包含1400个问题，这些问题基于纳斯达克-100公司十年历史窗口的数据。该基准分为三个推理类别：基本面聚焦、交易信号聚焦以及需要跨信号推理的混合问题。为了确保大规模可靠性，我们采用了一个校准然后扩展的框架，该框架结合了专家种子问题、多模型响应生成、模型内自过滤、数值审计以及人类-LLM判断对齐。我们在零样本提示和检索增强设置下评估了14个LLM，并观察到了明显的性能差距。检索显著改善了对文本基本面的推理，但对交易信号推理的益处有限。这些发现突显了当前LLM在数值和时间序列推理方面的根本性挑战，并激励了未来在金融智能方面的研究。

英文摘要

Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with advances in Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question-answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning about how company stocks trade in the market or their interactions with fundamentals. To leverage the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.28882 2026-06-11 cs.CL cs.AI cs.SD 版本更新

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

GrowLoop: 由人类种子驱动的自进化对话评估

Yihang Lin, Yunze Gao, Zeyang Lin, Dongbo Li, Kun Peng, Yue Liu

发表机构 * Amap, Alibaba Group（阿里集团阿地图）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结针对开放域对话中类人性评估的隐性知识、标准分歧和动态演化三大挑战，提出GrowLoop自进化评估系统，通过最小人工种子标注和启发式学习迭代提取评估标准，并利用标准-案例协同进化机制持续适应模型进步和场景变化。

详情

AI中文摘要

随着大语言模型的快速发展，评估开放域对话中的类人性变得越来越重要。然而，类人性是一种隐性知识，人类可以直观感知，但其背后的标准难以明确表述。人类判断差异很大，在某些情况下高度一致，在其他情况下则存在合理分歧。同时，人类判断背后的标准仍然是隐性的，没有明确的基础来构建案例。此外，什么算作类人并非一成不变，而是随着模型能力和人类期望而演变。尽管在评估方法上取得了进展，如专家编写的基准、奖励模型和自进化基准，但没有一种方法能同时解决这三个挑战。因此，我们提出了GrowLoop，一个自进化的对话评估系统，能够随着模型进步和场景变化而持续适应。以最小的人工种子标注作为初始动力，LLM代理通过启发式学习迭代提取和细化评估标准。在标注者意见一致的地方要求人机一致，而在意见分歧的地方只要求合理性。此外，标准-案例协同进化机制实现了持续进化，当评估目标发生变化时，通过新的种子进行扩展。应用于开放域对话中的类人性评估，生成的标准不仅在与人判断的一致性上显著优于现有方法，而且还发现了标注者忽略的问题。由此产生的基准能够有效区分不同能力层级的模型，并揭示其不足之处，同时能够泛化到新场景并随着模型进步而适应。我们的工作将基准测试范式从手动更新或难度扩展转变为全面、持续的自我进化。

英文摘要

With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet the underlying criteria resist explicit formulation. Human judgments vary widely, with strong agreement on some cases and legitimate disagreement on others. Meanwhile, the criteria behind human judgments remain implicit, leaving no clear basis for constructing cases. Further, what counts as human-likeness is not static, but evolving with model capability and human expectations. Despite progress in evaluation methods such as expert-authored benchmarks, Reward Models, and self-evolving benchmarks, none addresses all three challenges simultaneously. Therefore, we propose GrowLoop, a self-evolving conversation evaluation system that continuously adapts as models advance and scenarios shift. Starting from minimal human seed annotations, LLM agents iteratively extract and refine evaluation rubrics through Heuristic Learning. Human-AI agreement is required where annotators converge, while only plausibility is expected where they diverge. Moreover, the Rubric-Case co-evolution mechanism enables continuous evolution. When the evaluation target shifts, new human seeds expand the system's coverage accordingly. When applied to human-likeness evaluation in open-ended conversation, the AI judge guided by these rubrics not only substantially outperforms existing methods in alignment with human judgments, but also uncovers issues that annotators overlook. The resulting benchmark effectively discriminates models across capability tiers and reveals where they fall short, while generalizing to new scenarios and adapting as models advance. Our work shifts the benchmarking paradigm from manual updates or difficulty scaling to comprehensive, continuous self-evolution.

URL PDF HTML ☆

赞 0 踩 0

2605.29588 2026-06-11 cs.CV cs.AI q-bio.NC 版本更新

Brain-IT-VQA: From Brain Signals to Answers

Brain-IT-VQA: 从脑信号到答案

Roman Beliy, Matias Cosarinsky, Oliver Heinimann, Navve Wasserman, Michal Irani

发表机构 * Weizmann Institute of Science（魏茨曼科学研究所）

AI总结提出 Brain-IT-VQA 框架，基于 fMRI 脑信号解码语言令牌并结合语言模型进行视觉问答，在 NSD-VQA 新基准上显著优于先前方法，并用于分析脑区对视觉信息的贡献。

详情

AI中文摘要

从观看图像时记录的 fMRI 信号解码视觉内容，特别是回答关于所看图像的问题，是一个长期挑战。尽管近年来在基于 fMRI 的视觉问答（VQA）方面取得了显著进展，但性能仍然有限。此外，尽管最近的模型能够做出越来越准确的预测，但它们很少被用作理解大脑中视觉表征结构的工具。我们提出了 Brain-IT-VQA，一个基于 fMRI 的视觉问答框架。基于脑交互变换器（Brain-IT），我们的方法从脑活动中解码语言令牌，并将其与语言模型集成以回答视觉问题。我们的模型显著优于先前的基于 fMRI 的标题生成和 VQA 方法。我们进一步引入了 NSD-VQA，一个新的基于 fMRI 的视觉问答数据集和基准。与现有的图像-fMRI VQA 数据集通常每张图像只提供少数宽泛且弱控制的问题不同，NSD-VQA 在 20 个受控问题类别中平均每张图像提供 20 个问答对，这些类别解耦了多个层次的视觉理解。这使得在有限的 fMRI 测试数据下能够进行更可靠和可解释的评估。Brain-IT-VQA 和 NSD-VQA 共同提供了一个强大的预测框架和研究脑表征的工具。利用这个基准，我们量化了哪些形式的视觉和语义信息可以从对自然图像的 fMRI 响应中可靠解码。我们进一步分析了不同脑区在不同问题类型上的贡献。

英文摘要

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.

URL PDF HTML ☆

赞 0 踩 0

2606.03504 2026-06-11 cs.CL cs.AI 版本更新

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

BaltiVoice: 巴尔蒂语语音语料库与微调Whisper ASR系统

Muhammad Ali

发表机构 * Independent Researcher（独立研究员）； The Islamia University of Bahawalpur（伊斯兰巴哈瓦尔普尔大学）

AI总结针对无公开ASR资源的巴尔蒂语，构建16.8小时朗读语音语料库并微调Whisper-small模型，在验证集上词错误率从182.18%降至30.07%。

Comments 6 pages, 3 figures, 4 tables. Code and data available at https://github.com/mohdali-dev/BaltiVoice-ASR

2606.07001 2026-06-11 cs.DB cs.AI 版本更新

Lung-R1：知识图谱引导的肺部诊断推理大语言模型

Haoyang Zeng, Yuanxi Fu, Rongzhen Li, Yuming Yang, Xiao Sun, Jingwang Huang, Gujie Shao, Guohui Xiang, Quan Lu, Dongfan Ye, Xuetao Chen, Jiang Zhong, Kaiwen Wei, Zhi Xu

发表机构 * School of Computer Science, Chongqing University（重庆大学计算机学院）； AI Research Institution, Mashang Financial Institution（马上金融人工智能研究院）； Department of Information, Third Military Medical University（陆军军医大学信息系）

AI总结提出LungKG知识图谱和Lung-R1模型，通过KG约束的推理链构建和强化学习，解决肺部知识到病例诊断的差距，在EMR诊断任务上达到SOTA。

详情

AI中文摘要

物理信息驱动的生成式AI在半导体制造中的应用：通过构造强制生成模型中的硬物理约束

Yaser Mike Banad, Sarah Sharif

发表机构 * School of Electrical and Computer Engineering, University of Oklahoma（俄克拉荷马大学电气与计算机工程学院）； Center for Quantum Research and Technology, University of Oklahoma（俄克拉荷马大学量子研究与技术中心）； Intelligent Neuromorphic and Quantum Understanding for Innovative Research and Engineering (INQUIRE) Laboratory（创新研究与工程智能神经形态与量子理解实验室）； Material Science and Engineering Program, University of Oklahoma, Norman, OK 73019 USA（俄克拉荷马大学材料科学与工程项目，Norman, OK 73019 USA）

AI总结针对半导体制造中生成模型必须满足硬物理约束的问题，本文提出通过构造集成物理信息（如物理信息扩散、PDE约束变分模型等）来强制约束，而非事后过滤，并给出四种集成模式和未来研究方向。

详情

迈向文献与形式化数学知识之间的桥梁层

A. Mayeux

发表机构 * GitHub

AI总结提出一个关系型桥接数据库，对齐出版物元数据与形式化工件，并引入论文级形式化评分，通过跨文档对齐估计形式化覆盖度，以整合文献与形式化数学生态系统。

详情

AI中文摘要

数学知识分散在文献数据库（如MathSciNet、zbMATH Open）和形式化证明库（如Lean mathlib）中，阻碍了已发表结果与其形式化之间的统一访问。我们提出了一个关系型桥接数据库，将出版物元数据与形式化工件对齐，为数学文献和机器可验证证明提供互操作层。我们引入了一个论文级形式化评分，衡量一篇出版物在形式化系统中的覆盖程度。作为可行性研究，我们展示了如何通过非正式文本与Lean形式化之间的跨文档对齐来估计此类评分，从而实现对形式化覆盖度的大规模分析。该框架是将文献和形式化数学生态系统整合为可扩展、机器可操作的知识图谱的第一步，该图谱将出版物与形式化证明对象关联起来。

英文摘要

Mathematical knowledge is split between bibliographic databases (e.g., MathSciNet, zbMATH Open) and formal proof libraries (e.g., Lean mathlib), preventing unified access between published results and their formalizations. We propose a relational bridge-database that aligns publication metadata with formal artifacts, providing an interoperability layer between mathematical literature and machine-verifiable proofs. We introduce a paper-level formalization score that measures how much of a publication is covered in formal systems. As a feasibility study, we show how such scores can be estimated via cross-document alignment between informal texts and Lean formalizations, enabling large-scale analysis of formalization coverage. This framework is a first step toward integrating bibliographic and formal mathematical ecosystems into scalable, machine-actionable knowledge graphs linking publications to formal proof objects.

URL PDF HTML ☆

赞 0 踩 0

2606.11463 2026-06-11 cs.LG cs.AI 交叉投稿

LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach

基于LSTM的财产保险损失准备金结构性断点检测：气候信息方法

Thomas Mbrice, Shashwat Panigrahi

发表机构 * Stony Brook University（石溪大学）

AI总结针对气候变化导致传统精算方法失效的问题，提出使用LSTM神经网络检测结构性断点，在佛罗里达和路易斯安那州数据上预期将巨灾年份准备金精度提升15-20%，并给出理论保证。

Comments 15 pages, 0 figures, whitepaper YC

详情

AI中文摘要

准确的损失准备金是保险公司偿付能力的基础，然而加速的气候驱动灾难系统地违反了传统精算方法所依赖的稳定性假设。本文提出一个研究计划，测试长短期记忆（LSTM）神经网络是否能够比链梯法、Bornhuetter-Ferguson法和Cape Cod法更快、更准确地检测和适应这些结构性断点。使用来自佛罗里达州和路易斯安那州超过15年的监管发展三角形数据，并辅以NOAA飓风强度指数和海面温度，我们假设在巨灾暴露年份准备金精度有15-20%的针对性提升，这一阈值基于先前的神经网络准备金文献以及本文发展的形式化收敛结果。除了实证验证，我们还发展了一个理论框架，以概率术语为基础进行LSTM结构性断点检测，并提供形式化的性能保证，以弥补测试期间巨灾事件数量有限的不足。我们记录了研究设计、方法论、预期贡献以及对局限性的坦诚评估。

英文摘要

Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stability assumptions on which traditional actuarial methods depend. This white paper presents a research program testing whether Long Short Term Memory (LSTM) neural networks can detect and adapt to these structural breaks faster and more accurately than Chain Ladder, Bornhuetter Ferguson, and Cape Cod methods. Using 15 plus years of regulatory development triangle data from Florida and Louisiana, enriched with NOAA hurricane intensity indices and sea surface temperatures, we hypothesize a targeted improvement of 15, 20% in reserve accuracy for catastrophe exposed years, a threshold grounded both in the prior neural network reserving literature and in the formal convergence results developed here. Beyond empirical validation, we develop a theoretical framework grounding LSTM structural break detection in probabilistic terms, providing formal performance guarantees that compensate for the limited number of catastrophe events in the test period. We document the research design, methodology, expected contributions, and a candid assessment of limitations.

URL PDF HTML ☆

赞 0 踩 0

2606.11477 2026-06-11 cs.CV cs.AI 交叉投稿

Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models

迈向全自动考试评分：基于基础模型的笔迹答案公平性识别

Hartwig Grabowski

发表机构 * Institute for Machine Learning and Analytics (IMLA), Offenburg University（奥芬堡大学机器学习和分析研究所（IMLA））

AI总结提出使用视觉-语言基础模型（VLM）识别手写答案，在61份考试（3141个答案位置）上达到98.4%准确率，并通过轻量提示将假阴性率降至0.58%，实现公平的全自动评分。

Comments 11 pages, 2 figures, 3 tables

详情

AI中文摘要

手工批改手写试卷既耗时又容易出错，尤其是对于大规模班级，而全数字化考试往往迫使教学局限于封闭式问题格式。一个实用的折中方案是保留纸质、问题导向的任务，但将评估相关的答案以单个大写字母记录在机器可读的表格中。开放的问题是，这种读取能否足够准确，并且最重要的是，足够公平以实现无监督评分。早期的自动化方法仅达到约88%–91%的识别率——太低——并且在最关键的案例上失败：答案写在单元格外、被划掉或草书书写。我们展示了通用视觉-语言基础模型（VLM），它解释页面而非匹配像素模板，弥补了这一差距。在一个包含61份匿名考试（3141个答案位置）的基准测试中，最佳模型达到了98.4%的准确率，远高于之前的基线。关键的是，我们以公平性为中心进行评估：我们区分假阴性（正确答案被标记为错误，对学生不利）和假阳性，并且一个提供参考答案作为上下文的轻量提示将假阴性率降至0.58%。在示例性评分方案下，61份考试中只有3份会被评得更差，所有这些都通过学生自我审查步骤被发现。因此，大规模的全自动、公平性感知考试评分是合理的；我们发布匿名基准以支持可重复性。

英文摘要

Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read. The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%--91% recognition -- too low -- and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap. On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98.4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0.58%. Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2606.11505 2026-06-11 cs.CV cs.AI cs.CR 交叉投稿

设计AI支持的焦点小组：角色×模态剧本

Zhiqing Wang, Steven Dow

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结针对焦点小组资源密集且对引导高度敏感的问题，提出按AI角色（工具、联合主持、主持）和模态（文本、语音、具身）组织的剧本，并分析交互权衡与开放问题。

详情

AI中文摘要

收集参与者的生活经验是设计研究的核心。焦点小组的独特价值在于参与者不仅分享个人经历，还能相互回应，从而呈现比较、分歧和集体意义建构。然而，焦点小组资源密集且对引导高度敏感：主持人必须探究细节、平衡参与、管理话题流程并维持心理安全，微妙的引导选择可能影响哪些内容变得突出。近期人机交互研究和商业会议工具表明，生成式AI可以通过提示、轮流调节、主题映射和实时总结来支撑实时对话。然而，用户体验研究团队缺乏关于这些能力在焦点小组中的含义以及引入的方法论风险的清晰图景。我们综合了AI支持实时对话的相关工作，并将其转化为一个焦点小组特定的剧本，按AI角色（工具、联合主持、主持）和模态（文本、语音、具身）组织。我们描述了交互权衡，并识别了将AI支持的焦点小组作为方法论配置进行评估的开放问题。

英文摘要

Collecting participants' lived experiences is central to design research. Focus groups are uniquely valuable because participants not only share individual accounts but also respond to one another, surfacing comparison, disagreement, and collective sensemaking. However, focus groups are resource-intensive and highly sensitive to facilitation: moderators must probe for specificity, balance participation, manage topic flow, and sustain psychological safety, and subtle facilitation choices can shape what becomes salient. Recent HCI work and commercial meeting tools show that generative AI can scaffold live conversation through prompting, turn regulation, thematic mapping, and real-time summarization. Yet UXR teams lack a clear map of what these capabilities mean in focus groups and what methodological risks they introduce. We synthesize AI supports for live conversation and translate them into a focus-group-specific playbook organized by AI role (tool, co-host, host) and modality (text, voice, embodied).We synthesize prior work on AI-supported live conversation and propose a focus-group-specific playbook of AI supports organized by role (tool, co-host, host) and modality (text, voice, embodied). We characterize interactional trade-offs and identify open questions for evaluating AI-supported focus groups as methodological configurations.

URL PDF HTML ☆

赞 0 踩 0

2606.11915 2026-06-11 cs.SD cs.AI 交叉投稿

Quality Adaptive Angular Margin Learning for Respiratory Sound Classification

呼吸音分类的质量自适应角度边界学习

Yoon Tae Kim, Heejoon Koo, Miika Toikkanen, June-Woo Kim

发表机构 * RSC LAB, MODULABS, Republic of Korea（RSC实验室，MODULABS，韩国）； Department of Electronic Engineering, Wonkwang University, Republic of Korea（韩国圆光大学电子工程系）； AI Convergence Research Institute, Wonkwang University, Republic of Korea（韩国圆光大学人工智能融合研究所）

AI总结提出质量自适应角度边界学习框架QLung，通过频谱熵和均方根能量推导无参考音频质量边界，自适应缩放角度边界，改善特征泛化，在ICBHI和SPRSound数据集上分别提升2.46%和达到最优分布外性能。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

我们提出了一种质量自适应角度边界学习框架，通过增强类内紧凑性和类间可分离性来改进特征泛化。我们的框架名为QLung，引入了基于频谱熵和均方根能量的无参考音频质量边界，根据录音质量自适应缩放角度边界。为此，我们提出了一种对数缩放的角度边界，在严重类别不平衡下稳定训练。我们还使用了一个角度分类器，对特征和类别权重进行归一化，确保在单位超球面上一致地应用边界惩罚。我们的方法在ICBHI数据集上比交叉熵基线提高了2.46%的分布内性能，最重要的是，在SPRSound数据集上，与先前最先进的方法相比，实现了最强的分布外性能。代码可在以下网址获取：https://this URL。

英文摘要

We present a quality-adaptive angular-margin learning framework that improves feature generalization by enforcing intra-class compactness and inter-class separability. Our framework, titled QLung, introduces a no-reference audio quality margin derived from spectral entropy and root-mean-square energy, which adaptively scales angular margins based on recording quality. To this end, we propose a log-scaled angular margin that stabilizes training under severe class imbalance. We also use an angular classifier that normalizes features and class weights, ensuring margin penalties are applied consistently on the unit hypersphere. Our approach improves in-distribution performance on the ICBHI dataset by 2.46\% over the cross-entropy baseline, and most significantly, achieves the strongest out-of-distribution performance on the SPRSound dataset compared to prior state-of-the-art methods. Code is available at https://github.com/RSC-Toolkit/QLung.

URL PDF HTML ☆

赞 0 踩 0

2606.11916 2026-06-11 cs.SE cs.AI 交叉投稿

Characterizing Software Aging in GPU-Based LLM Serving Systems

基于GPU的大语言模型服务系统中的软件老化特征分析

Domenico Cotroneo, Bojan Cukic

发表机构 * College of Computing and Informatics, University of North Carolina at Charlotte（北卡罗来纳大学夏洛特分校计算机与信息学院）

AI总结提出一种实证方法研究GPU大语言模型服务系统中的软件老化，通过216小时实验发现所有部署均存在显著内存老化，泄漏率与运行时和配置强相关，并提供了可复现框架。

Comments 7 pages

详情

AI中文摘要

本文提出了一种实证方法，用于研究基于GPU的大语言模型服务系统中的软件老化。传统的老化研究侧重于以CPU为中心的软件，且工作负载相对规律；而大语言模型服务则不同，它跨越Python主机和CUDA设备，处理成本相差数个数量级的请求，并依赖于快速演进的软件栈。我们在相同的压力条件下，对六个共置部署进行了216小时的实验，并行监控主机、设备和客户端指标，并应用了考虑自相关和多重比较的统计流程。结果显示，所有部署均存在统计上显著的内存老化，泄漏率强烈依赖于服务运行时和部署配置。除这些发现外，我们还提供了一个可复现的框架，为软件老化与再生领域以及大语言模型服务社区开辟了交叉研究方向。

英文摘要

This paper proposes an empirical methodology to study software aging in GPU-based LLM serving systems. Traditional aging studies focus on CPU-centric software with relatively regular workloads; LLM serving is different, spanning a Python host and a CUDA device, handling requests whose cost varies by orders of magnitude, and relying on rapidly evolving software stacks. We run a 216-hour campaign across six co-located deployments under identical stress conditions, monitor host, device, and client metrics in parallel, and apply a statistical pipeline that accounts for autocorrelation and multiple testing. Our results reveal statistically significant memory aging in all deployments, with leak rates strongly dependent on the serving runtime and deployment configuration. Beyond these findings, we provide a reproducible framework that opens a research direction at the intersection of the software aging and rejuvenation and LLM serving communities.

URL PDF HTML ☆

赞 0 踩 0

2606.11922 2026-06-11 cs.SD cs.AI 交叉投稿

Lung-SRAD: Spectral-Aware Regularized Audio DASS with Dual-Axis Patch-Mix Contrastive Learning for Respiratory Sound Classification

Lung-SRAD: 基于谱感知正则化音频DASS与双轴补丁混合对比学习的呼吸音分类

Hemansh Shridhar, Miika Toikkanen, June-Woo Kim

发表机构 * RSC LAB, MODULABS（RSC实验室，MODULABS）； Department of Electronic Engineering, Wonkwang University（圆光大学电子工程系）； AI Convergence Research Institute, Wonkwang University（圆光大学人工智能融合研究所）

AI总结针对呼吸音分类中AST模型对局部异常模式不敏感的问题，提出基于状态空间模型的谱感知层正则化和双轴补丁混合对比学习，在ICBHI基准上达到64.48%分数，比AST基线提升5%。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

最近的呼吸音分类（RSC）研究主要依赖于CLS令牌驱动的自注意力架构，如音频频谱图变换器（AST）。虽然它在建模全局上下文方面有效，但最近的分析表明存在低通滤波行为，可能会降低对局部异常模式的敏感性。在这项工作中，我们研究了状态空间模型（SSM）作为RSC的替代骨干网络。使用蒸馏音频状态空间模型，我们通过频谱响应曲线分析中间表示，并观察到对中到高空间频率分量的更强保留。基于这些观察，我们引入了使用高斯卷积应用于选定层的谱感知层正则化。我们进一步提出了针对基于SSM的音频模型定制的双轴补丁混合对比学习，以实现稳健的表示学习。在ICBHI基准上的实验表明，我们的方法达到了64.48%的分数，比AST基线高出5%。代码可在以下网址获取：https://this https URL。

英文摘要

Recent respiratory sound classification (RSC) studies largely rely on CLS-token driven self-attention architectures such as the Audio Spectrogram Transformer (AST). While effective at modeling global context, recent analyses suggest a low-pass filtering behavior that may reduce sensitivity to localized abnormal patterns. In this work, we investigate State Space Models (SSMs) as an alternative backbone for RSC. Using the Distilled Audio State Space model, we analyze intermediate representations through spectral response curves and observe stronger preservation of mid-to-high spatial-frequency components. Based on these observations, we introduce spectral-aware layer regularization using Gaussian convolution applied to selected layers. We further propose Dual-Axis Patch-Mix contrastive learning tailored to SSM-based audio models for robust representation learning. Experiments on the ICBHI benchmark show that our approach achieves 64.48% score, outperforming the AST baseline by 5%. Code is available at https://github.com/RSC-Toolkit/Lung-SRAD.

URL PDF HTML ☆

赞 0 踩 0

2606.12006 2026-06-11 cs.LG cs.AI 交叉投稿

Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation

通过生存感知适配的临床生存分析表格基础模型

Minh-Khoi Pham, Luca Cotugno, Alina Sirbu, Tai Tan Mai, Martin Crane, Marija Bezbradica

发表机构 * ADAPT Centre, Dublin City University（ADAPT中心，都柏林城市大学）； School of Computing, Dublin City University（都柏林城市大学计算机学院）； Department of Computer Science and Engineering, University of Bologna（博洛尼亚大学计算机科学与工程系）

AI总结提出轻量级适配方法，将表格基础模型（TabPFN、TabDPT、TabICL）与多任务逻辑回归头结合，用于临床生存分析，在多个基准和ICU队列上达到竞争性或更优性能。

Comments Accepted for publication at International Conference on AI in Healthcare 2026

详情

AI中文摘要

预测死亡率等时间至事件结果是临床决策中的基本任务，通常通过生存分析来解决。虽然经典的统计和深度学习方法已被广泛研究，但它们通常需要特定任务的训练和足够的标记数据。最近表格基础模型的进展通过学习结构化数据的通用表示提供了一种新范式。然而，它们在临床环境中对删失时间至事件预测的适用性仍未得到充分探索，因为典型应用仅限于离散分类而非生存分析任务。在这项工作中，我们提出了一种轻量级适配方法，通过直接在预训练表示之上训练一个生存感知头，将表格基础模型应用于临床生存分析。我们研究了代表性架构，包括TabPFN、TabDPT和TabICL，并使用多任务逻辑回归（MTLR）头对它们进行适配，以建模右删失时间至事件结果。我们在多个公开生存基准和两个大规模ICU队列MIMIC-IV和eICU上评估了该方法。我们的结果表明，这种迁移学习方法与强基线相比达到了竞争性或更优的性能。在MIMIC-IV上，TabDPT-FT-MTLR达到了0.856的C指数，相对于最佳非FM基线（DeepSurv，0.844）相对提升了+1.4%，相对于最佳零样本模型（0.802）提升了+6.7%。在eICU上，TabICL-FT-MTLR达到了0.797，分别获得了+1.7%（DeepSurv，0.784）和+6.4%（0.749）的提升。这些发现强调了将预训练表格表示与生存感知目标相结合的重要性，并表明表格基础模型为临床生存预测提供了一种实用且有效的替代方案。

英文摘要

Predicting time-to-event outcomes such as mortality is a fundamental task in clinical decision-making, commonly addressed through survival analysis. While classical statistical and deep learning approaches have been widely studied, they typically require task-specific training and sufficient labeled data. Recent advances in tabular foundation models offer a new paradigm by learning general-purpose representations for structured data. However, their applicability to censored time-to-event prediction in clinical settings remains underexplored, as typical applications are restricted to discrete classification rather than survival analysis tasks. In this work, we propose a lightweight adaptation approach for applying tabular foundation models to clinical survival analysis by directly training a survival-aware head on top of the pretrained representations. We study representative architectures, including TabPFN, TabDPT, and TabICL, and adapt them using a multi-task logistic regression (MTLR) head to model right-censored time-to-event outcomes. We evaluate this approach on a diverse set of public survival benchmarks and two large-scale ICU cohorts, MIMIC-IV and eICU. Our results show that this transfer learning approach achieves competitive or superior performance compared to strong baselines. On MIMIC-IV, TabDPT-FT-MTLR reaches a C-index of 0.856, corresponding to a relative improvement of +1.4% over the best non-FM baseline (DeepSurv, 0.844) and +6.7% over the best zero-shot model (0.802). On eICU, TabICL-FT-MTLR achieves 0.797, yielding gains of +1.7% (DeepSurv, 0.784) and +6.4% (0.749), respectively. These findings highlight the importance of combining pretrained tabular representations with survival-aware objectives and suggest that tabular foundation models provide a practical and effective alternative for clinical survival prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.12074 2026-06-11 cs.CV cs.AI eess.IV 交叉投稿

Non-frontal face recognition using GANs and memristor-based classifiers

基于GAN和忆阻器分类器的非正面人脸识别

Semih Vazgecen, Cristian Sestito, Spyros Stathopoulos, Themis Prodromakis

发表机构 * Centre for Electronics Frontiers, Institute for Integrated Micro and Nano Systems, School of Engineering, The University of Edinburgh（爱丁堡大学工程学院集成微纳系统研究所电子前沿中心）

AI总结提出将轻量级GAN正面化与忆阻器神经形态识别结合，解决非正面人脸识别，在数据集上达96%准确率。

Comments 12 pages, 4 figures, 1 Supplementary (22 pages, 16 figures, 6 tables, 4 supplementary notes)

详情

AI中文摘要

使用可解释性作为训练时可靠性信号实现高效心电图分类

Veerendhra Kumar Dangeti, Xiao Gu, Ying Weng, Shreyank N Gowda

发表机构 * School of Computer Science, University of Nottingham（诺丁汉大学计算机科学学院）； Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford（牛津大学工程科学系生物医学工程研究所）； School of Computer Science, University of Nottingham Ningbo China（宁波诺丁汉大学计算机科学学院）

AI总结提出ERTS方法，利用训练中的解释质量（Grad-CAM注意力图）区分信息性和不可靠不确定性，过滤低聚焦样本，在三个ECG数据集上提升macro-F1并降低训练成本。

详情

AI中文摘要

训练用于临床时间序列分析的深度神经网络计算需求高，但许多医疗环境缺乏重复模型开发和部署所需的资源。这一挑战在心电图分类中尤为明显，大数据集和长训练计划使效率变得重要。渐进式数据丢弃通过从梯度更新中排除已学习的样本来降低训练成本，但它依赖模型置信度，可能保留因噪声或歧义而难以处理而非有用信号的样本。在这项工作中，我们引入了ERTS，一种基于可解释性的可靠性训练信号，用于高效心电图分类。ERTS在训练期间利用解释质量来区分信息性和不可靠的不确定性。基于渐进式数据选择，我们计算候选样本的Grad-CAM注意力图，并推导出一个聚焦分数，衡量模型预测是否得到连贯且局部化模式的支持。低聚焦样本被过滤掉，而具有有意义注意力的样本优先进行梯度更新。我们在三个ECG数据集和多个骨干架构上评估ERTS，显示macro-F1的一致提升以及有效训练成本的降低。这些结果表明，解释质量可以作为改善临床时间序列学习中效率和可靠性的实用信号。代码将发布。

英文摘要

Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly evident in electrocardiogram classification, where large datasets and long training schedules make efficiency practically important. Progressive Data Dropout reduces training cost by excluding samples from gradient updates once they are learned, but it relies on model confidence and may retain samples that are difficult due to noise or ambiguity rather than useful signal. In this work, we introduce ERTS, an explainability-based reliability training signal for efficient ECG classification. ERTS uses explanation quality during training to distinguish between informative and unreliable uncertainty. Building on progressive data selection, we compute Grad-CAM attention maps for candidate samples and derive a focus score that measures whether model predictions are supported by coherent and localised patterns. Samples with low focus are filtered out, while those with meaningful attention are prioritised for gradient updates. We evaluate ERTS across three ECG datasets and multiple backbone architectures, showing consistent improvements in macro-F1 alongside reduced effective training cost. These results suggest that explanation quality can serve as a practical signal for improving both efficiency and reliability in clinical time-series learning. Code will be released.

URL PDF HTML ☆

赞 0 踩 0

2606.12346 2026-06-11 cs.CV cs.AI cs.LG 交叉投稿

Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

Atlas H&E-TME：基于AI的可扩展组织分析，达到专家病理学家级别的准确性

Kai Standvoss, Miriam Hägele, Rosemarie Krupar, Julika Ribbat-Idel, Jennifer Altschüler, Gerrit Erdmann, Hans Pinckaers, Evelyn Ramberger, Madleen Drinkwitz, Ádám Nárai, Alexander Möllers, Katja Lingelbach, Sebastian Kons, Lukas Hönig, Recepcan Adigüzel, Joana Baião, Alberto Megina Gonzalo, Marius Teodorescu, Marie-Lisa Eich, Paolo Chetta, Shakil Merchant, Verena Aumiller, Simon Schallenberg, Andrew Norgan, Klaus-Robert Müller, Lukas Ruff, Maximilian Alber, Frederick Klauschen

发表机构 * Aignostics, Germany（Aignostics，德国）； Institute of Pathology, Charité – Universitätsmedizin Berlin, Germany（柏林夏里特医学院病理学研究所）； Berlin Institute of Health, Charité – Universitätsmedizin Berlin, Germany（柏林夏里特医学院柏林健康研究所）； Massachusetts General Hospital, Department of Pathology, Harvard Medical School, Boston, MA, US（哈佛医学院麻省总医院病理学系）； Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, US（梅奥诊所检验医学与病理学系）； Machine Learning Group, Technische Universität Berlin, Germany（柏林工业大学机器学习组）； BIFOLD – Berlin Institute for the Foundations of Learning and Data, Germany（柏林学习与数据基础研究所）； Department of Artificial Intelligence, Korea University, Republic of Korea（高丽大学人工智能系）； Max-Planck Institute for Informatics, Germany（马克斯·普朗克信息学研究所）； German Cancer Research Center (DKFZ) & German Cancer Consortium (DKTK), Berlin & Munich Partner Sites, Germany（德国癌症研究中心及德国癌症联盟柏林和慕尼黑合作站点）； Institute of Pathology, Ludwig-Maximilians-Universität München, Germany（慕尼黑大学病理学研究所）； Bavarian Cancer Research Center (BZKF), Germany（巴伐利亚癌症研究中心）

AI总结提出Atlas H&E-TME系统，利用病理基础模型预测组织质量、区域和细胞类型，通过IHC共识验证和20万+注释基准，在多种癌症中达到或超越病理学家水平。

详情

AI中文摘要

苏木精和伊红（H&E）染色是组织病理学的基石，然而对H&E全切片图像（WSI）进行可扩展的定量分析仍然是计算病理学中的核心挑战。我们提出了Atlas H&E-TME，这是一个基于Atlas病理基础模型家族的AI系统，可预测多种癌症类型的组织质量、组织区域和细胞类型标签，在细胞级分辨率下每张切片产生超过4,500个定量读数。验证此类系统的关键挑战在于克服H&E-only金标准固有的形态模糊性，以及依赖免疫组织化学（IHC）等模态的更可靠参考的可扩展性有限。我们通过一个双重验证框架解决了这一问题，该框架将生物学深度的基础与技术及形态学的广度相结合。在深度方面，我们提出了一种IHC引导的多病理学家共识协议，该协议显著提高了相较于传统H&E-only注释的评分者间一致性。这产生了一个分子学基础的参考，我们据此比较Atlas H&E-TME和仅使用H&E的病理学家。在广度方面，我们在超过20万个高置信度H&E-only病理学家注释上对Atlas H&E-TME进行了基准测试，这些注释涵盖1,500多个病例，跨越八种癌症类型及其最常见的转移部位，亚型覆盖每种癌症类型>90%的临床病例，来自25个以上来源和8种以上扫描仪型号。与IHC引导的共识相比，Atlas H&E-TME达到或超过了病理学家仅使用H&E的性能，并在这一广泛的形态学和技术范围内一致且稳健地泛化。通过这种方式，Atlas H&E-TME将H&E切片——病理学中最普遍的数据——转化为一个可扩展的、定量的肿瘤及其微环境窗口，为转化和临床研究中下一代基于组织的生物标志物奠定了基础。

英文摘要

Hematoxylin and eosin (H&E) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of H&E whole-slide images (WSIs) remains a central challenge in computational pathology. We present Atlas H&E-TME, an AI-based system built on the Atlas family of pathology foundation models that predicts tissue quality, tissue region, and cell type labels across multiple cancer types, yielding over 4,500 quantitative readouts per slide at cell-level resolution. A key challenge to validating such systems is overcoming morphological ambiguity inherent to H&E-only ground truth and the limited scalability of more informed references drawing on modalities such as immunohistochemistry (IHC). We address this with a dual validation framework combining biologically grounded depth with technical and morphological breadth. For depth, we propose an IHC-informed multi-pathologist consensus protocol that substantially improves inter-rater agreement over conventional H&E-only annotation. This yields a molecularly grounded reference against which we compare Atlas H&E-TME and pathologists working from H&E alone. For breadth, we benchmark Atlas H&E-TME on over 200,000 high-confidence H&E-only pathologist annotations across 1,500+ cases spanning eight cancer types and their most common metastatic sites, with subtypes covering >90% of clinical cases per cancer type, drawn from 25+ sources and 8+ scanner models. Benchmarked against the IHC-informed consensus, Atlas H&E-TME matches or exceeds pathologist H&E-only performance and generalizes consistently and robustly across this broad morphological and technical scope. In doing so, Atlas H&E-TME turns the H&E slide -- the most ubiquitous data in pathology -- into a scalable, quantitative window into the tumor and its microenvironment, laying a foundation for the next generation of tissue-based biomarkers in translational and clinical research.

URL PDF HTML ☆

赞 0 踩 0

2606.12378 2026-06-11 cs.CV cs.AI 交叉投稿

Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots

面向机器人生理感知的鲁棒光照相机心率估计

Zhi Wei Xu, Torbjörn E. M. Nordling

发表机构 * National Cheng Kung University（国立成功大学）

AI总结提出一种端到端时空Transformer框架，结合PRNet三维人脸对齐、光照增强、残差时序标准化和混合时频监督，在光照变化数据集上实现0.79 bpm心率MAE和0.982相关系数，相比PhysFormer降低93.6%误差。

Comments 8 pages, 4 figures

详情

AI中文摘要

生理感知对于在日常生活环境中与人类交互的服务型、社交型和辅助型机器人至关重要。远程光电容积描记法（rPPG）能够从RGB相机中实现非接触式心率（HR）估计，使其成为机器人视觉系统的一种有前景的感知模态。然而，光照变化仍然是鲁棒部署的主要障碍。本文提出了一种端到端的时空Transformer框架，用于在具有不同光照条件的新数据集上进行远程心率估计。我们的估计器集成了基于PRNet的三维人脸对齐、片段级光照增强、残差时序标准化模块以及受控的混合时频监督。训练目标结合了Soft-Shifted Pearson波形损失和频谱Kullback-Leibler散度损失，其中调优权重（$\mathbf{\beta}$）控制频域心率指导的贡献。在覆盖三个光照级别的静态全混合协议上的实验表明，$\mathbf{\beta}=5$在测试的beta设置中提供了最强结果，实现了最佳运行心率平均绝对误差（MAE）为0.79 bpm，心率相关系数为0.982。与在我们的数据集上评估的PhysFormer基线相比，我们的估计器将心率MAE降低了93.6%，同时将心率相关系数从0.088提高到0.982，使其在光照变化时可用。

英文摘要

Physiological awareness is important for service, social, and assistive robots that interact with humans in everyday environments. Remote photoplethysmography (rPPG) enables non-contact heart-rate (HR) estimation from an RGB camera, making it a promising sensing modality for robot-mounted vision systems. However, illumination variation remains a major barrier to robust deployment. This paper presents an end-to-end spatial-temporal transformer framework for remote HR estimation on a new dataset with varied illumination. Our estimator integrates PRNet-based 3D face alignment, clip-level illumination augmentation, the Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision. The training objective combines a Soft-Shifted Pearson waveform loss with a spectral Kullback-Leibler divergence loss, where a tuned weight ($\mathbfβ$) controls the contribution of frequency-domain heart-rate guidance. Experiments on a static all-level mix protocol covering three illumination levels show that $\mathbfβ=5$ provides the strongest result among the tested beta settings, achieving a best-run HR mean absolute error (MAE) of 0.79 bpm and an HR correlation of 0.982. Compared with the PhysFormer baseline evaluated on our dataset, our estimator reduces HR MAE by 93.6 %, while increasing HR correlation from 0.088 to 0.982, making it usable when illumination varies.

URL PDF HTML ☆

赞 0 踩 0

2606.12387 2026-06-11 cs.DB cs.AI 交叉投稿

跨云和边缘的防洪溢流监控稳健解决方案

Vipin Singh, Tianheng Ling, Peter Ghaly, Felix Grimmeisen, Gregor Schiele, Felix Biessmann

发表机构 * Berlin University of Applied Sciences（柏林应用技术大学）； University of Duisburg-Essen（杜伊斯堡-埃森大学）； Okeanos Smart Data Solutions GmbH（Okeanos智能数据解决方案 GmbH）； Einstein Center Digital Future（爱因斯坦数字未来研究中心）

AI总结本文提出一个基于深度学习的云边协同监控平台，用于预测溢流池填充动态，以应对城市排水系统老化问题，提升防洪预警能力。

Comments 3 pages, 6 figures, accepted at 35th International Joint Conference on Artificial Intelligence 2026 (IJCAI-ECAI 2026), Demonstrations Track. URL: https://riwwer.demo.calgo-lab.de

2304.13905 2026-06-11 cs.CR cs.AI cs.LG 版本更新

LSTM based IoT Device Identification

基于LSTM的物联网设备识别

Kahraman Kostas

发表机构 * Kahraman Kostas

AI总结提出一种端到端机器学习流程，利用LSTM网络处理原始网络数据包，通过滑动窗口时间序列特征识别27类物联网设备，在最优配置下达到79.85%准确率和75.70%宏平均F1分数。

详情

AI中文摘要

随着物联网的使用越来越普及，大量设备进入市场，许多安全漏洞也随之出现。在此环境下，物联网设备识别方法提供了一种预防性安全措施，作为识别这些设备并检测其漏洞的重要因素。在本研究中，我们提出了一种端到端的机器学习流程，利用长短期记忆（LSTM）网络识别阿尔托大学数据集（物联网设备捕获）中的物联网设备。原始网络数据包捕获（PCAP）被处理成25个工程特征，然后排列为滑动窗口时间序列。我们系统地评估了从2到20的序列长度，报告称性能在长度6之前近似线性提升，之后呈波浪形模式，在长度18时达到峰值。在最优配置的最终保留测试集上，该模型在27个设备类别上达到了79.85%的准确率和75.70%的宏平均F1分数。

英文摘要

While the use of the Internet of Things is becoming more and more popular, many security vulnerabilities are emerging with the large number of devices being introduced to the market. In this environment, IoT device identification methods provide a preventive security measure as an important factor in identifying these devices and detecting the vulnerabilities they suffer from. In this study, we present an end-to-end machine learning pipeline that identifies IoT devices in the Aalto university dataset (IoT devices captures) using Long Short-Term Memory (LSTM) networks. Raw network packet captures (PCAP) are processed into 25 engineered features, which are then arranged as sliding-window time-series sequences. We systematically evaluate sequence lengths from 2 to 20, reporting that performance improves approximately linearly up to length 6 and thereafter in a wave-like pattern, reaching its peak at length 18. On the final held-out test set with the optimal configuration, the model achieves an accuracy of 79.85% and a macro-averaged F1-score of 75.70% across 27 device classes.

URL PDF HTML ☆

赞 0 踩 0

2502.14894 2026-06-11 cs.CV cs.AI cs.CY cs.LG 版本更新

FOCUS on Contamination: Hydrology-Informed Noise-Aware Learning for Geospatial PFAS Mapping

聚焦污染：基于水文信息与噪声感知的地理空间PFAS测绘学习

Jowaria Khan, Alexa Friedman, Sydney Evans, Rachel Klein, Runzi Wang, Katherine E. Manz, Kaley Beins, David Q. Andrews, Elizabeth Bondi-Kelly

发表机构 * University of Michigan（密歇根大学）； Environmental Working Group（环保工作组）； University of California, Davis（加州大学戴维斯分校）

AI总结提出FOCUS框架，结合稀疏PFAS观测与水文连通性等环境先验，通过噪声感知损失实现鲁棒训练，在PFAS污染测绘中优于传统方法。

Comments Best Paper Award at ICLR 2026 Machine Learning for Remote Sensing Workshop

详情

AI中文摘要

全氟和多氟烷基物质（PFAS）是持久性环境污染物，对公共健康有显著影响，但由于现场采样的高成本和后勤挑战，大规模监测仍然严重受限。样本的缺乏导致难以用物理模型模拟其扩散，并且对PFAS在地表水中传输的科学理解有限。然而，描述土地覆盖、水文和工业活动的丰富地理空间和卫星衍生数据广泛可用。我们提出了FOCUS，一个用于PFAS污染测绘的地理空间深度学习框架，该框架将稀疏的PFAS观测与大规模环境背景（包括来自水文连通性、土地覆盖、污染源邻近性和采样距离的先验）相结合。这些先验被整合到一个原则性的、噪声感知的损失函数中，从而在稀疏标签下产生稳健的训练目标。通过广泛的消融实验、鲁棒性分析和实际验证，FOCUS始终优于包括稀疏分割、克里金法和污染物传输模拟在内的基线方法，同时在大区域上保持了空间一致性和可扩展性。我们的结果展示了AI如何通过提供筛查级风险图来支持环境科学，这些风险图可优先安排后续采样，并在缺乏完整物理模型的情况下帮助将潜在污染源与地表水污染模式联系起来。

英文摘要

Per- and polyfluoroalkyl substances (PFAS) are persistent environmental contaminants with significant public health impacts, yet large-scale monitoring remains severely limited due to the high cost and logistical challenges of field sampling. The lack of samples leads to difficulty simulating their spread with physical models and limited scientific understanding of PFAS transport in surface waters. Yet, rich geospatial and satellite-derived data describing land cover, hydrology, and industrial activity are widely available. We introduce FOCUS, a geospatial deep learning framework for PFAS contamination mapping that integrates sparse PFAS observations with large-scale environmental context, including priors derived from hydrological connectivity, land cover, source proximity, and sampling distance. These priors are integrated into a principled, noise-aware loss, yielding a robust training objective under sparse labels. Across extensive ablations, robustness analyses, and real-world validation, FOCUS consistently outperforms baselines including sparse segmentation, Kriging, and pollutant transport simulations, while preserving spatial coherence and scalability over large regions. Our results demonstrate how AI can support environmental science by providing screening-level risk maps that prioritize follow-up sampling and help connect potential sources to surface-water contamination patterns in the absence of complete physical models.

URL PDF HTML ☆

赞 0 踩 0

2508.09459 2026-06-11 cs.CV cs.AI 版本更新

RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

RelayFormer: 一种用于可扩展图像和视频篡改定位的统一局部-全局注意力框架

Wen Huang, Jiarui Yang, Tao Dai, Jiawei Li, Shaoxiong Zhan, Bin Wang, Shu-Tao Xia

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）； College of Artificial Intelligence, Nankai University（南开大学人工智能学院）； College of Computer Science and Software Engineering, Shenzhen University（深圳大学计算机科学与软件工程学院）； Huawei Technologies Co., Ltd（华为技术有限公司）

AI总结提出RelayFormer统一框架，通过全局局部中继（GLR）令牌和中继注意力机制，适应不同分辨率并统一处理图像与视频，在篡改定位任务中实现高效且性能优越。

详情

AI中文摘要

视觉篡改定位（VML）旨在识别图像和视频中被篡改的区域，随着高级编辑工具的兴起，这一任务变得日益具有挑战性。现有方法面临两个核心问题。首先是分辨率多样性。调整大小或填充可能会扭曲微妙的取证线索，并引入不必要的计算成本。其次是将图像的空间模型扩展到视频的时空输入的困难，这通常导致为两种数据类型维护单独的架构。为了解决这些挑战，我们提出了RelayFormer，一个统一框架，能够适应不同分辨率并自然处理静态和时态视觉数据。RelayFormer将输入划分为固定大小的子图像，并引入全局局部中继（GLR）令牌，通过基于中继的注意力机制传播结构化上下文。这种设计使得全局线索（如语义或时间一致性）的高效交换成为可能，同时保留细粒度的篡改伪影。与依赖统一调整大小或稀疏注意力的先前方法不同，RelayFormer以最小的开销扩展到可变分辨率和视频序列。跨多个基准的实验表明，其具有优越的性能和强大的效率，结合了无需插值或过多填充的分辨率适应性、图像和视频的统一处理，以及准确性和计算成本之间的有利平衡。代码可在\href{this https URL}{this https URL}获取。

英文摘要

Visual manipulation localization (VML) aims to identify tampered regions in images and videos, a task that has become increasingly challenging with the rise of advanced editing tools. Existing methods face two central issues. The first is resolution diversity. Resizing or padding can distort subtle forensic cues and introduce unnecessary computational cost. The second is the difficulty of extending spatial models for images to spatio-temporal inputs in videos, which often results in maintaining separate architectures for the two data types. To address these challenges, we propose RelayFormer, a unified framework that adapts to varying resolutions and naturally handles both static and temporal visual data. RelayFormer partitions inputs into fixed-size sub-images and introduces Global Local Relay (GLR) tokens that propagate structured context through a relay-based attention mechanism. This design enables efficient exchange of global cues, such as semantic or temporal consistency, while preserving fine-grained manipulation artifacts. Unlike prior approaches that depend on uniform resizing or sparse attention, RelayFormer scales to variable resolutions and video sequences with minimal overhead. Experiments across diverse benchmarks demonstrate superior performance and strong efficiency, combining resolution adaptivity without interpolation or excessive padding, unified processing for images and videos, and a favorable balance between accuracy and computational cost. Code is available at~\href{https://github.com/WenOOI/RelayFormer}{https://github.com/WenOOI/RelayFormer}.

URL PDF HTML ☆

赞 0 踩 0

2510.16152 2026-06-11 cs.DL cs.AI cs.CL cs.LG 版本更新

Mapping Scientific Literature with Large Language Models and Topic Modeling

利用大语言模型和主题建模绘制科学文献图谱

Mason Smetana, Lev Khazanovich

发表机构 * Department of Civil and Environmental Engineering（土木与环境工程系）； University of Pittsburgh（匹兹堡大学）

AI总结提出基于大语言模型的两阶段分类框架，通过主题建模分析PNAS工程类文献，生成语义可解释主题并揭示跨主题关联，性能优于传统方法。

Comments 35 pages, 10 figures. Accepted for publication in Scientometrics. Final version available via DOI

详情

DOI: 10.1007/s11192-026-05643-9
Journal ref: Scientometrics (2026)

AI中文摘要

科学文献因学科边界、专业术语和潜在稀疏的关键词系统而日益碎片化，使得捕捉现代科学的演化结构变得困难。本研究引入了一个大语言模型驱动的框架，从主题建模的角度绘制科学文献图谱。该方法在《美国国家科学院院刊》20年间超过1500篇工程相关文章语料上进行了演示。一个两阶段分类流水线首先根据每篇文章的摘要分配一个主要主题类别，然后进行全文分析以识别次要分类，揭示语料库中潜在的跨主题联系。与传统主题模型不同，基于LLM的框架在保持强量化性能的同时，生成语义可解释的主题。与既定主题建模方法的比较评估显示，主题多样性更高，重叠度更低，且具有竞争性的一致性指标。对随机抽样的摘要子集进行手动验证，准确率达到75.9%。额外的传统自然语言处理分析证实，生成的主题对应于语料库中有意义的语言模式。连接主要和次要分类的二部网络进一步揭示了仅通过摘要或关键词系统不易观察到的隐含主题关系。结果表明，该框架无需事先了解期刊的编辑双重分类结构，即可独立恢复其大部分结构。总体而言，所提出的方法为绘制科学图谱和识别研究中新兴的跨主题联系提供了有力工具。

英文摘要

Scientific literature is increasingly fragmented by disciplinary boundaries, specialized terminology, and potentially sparse keyword systems, making it difficult to capture the evolving structure of modern science. This study introduces a large language model (LLM)-driven framework for mapping scientific literature from a topic modeling perspective. The approach is demonstrated on a 20-year corpus of more than 1,500 engineering-related articles published in the Proceedings of the National Academy of Sciences (PNAS). A two-stage classification pipeline first assigns a primary thematic category to each article based on its abstract, followed by full-text analysis to identify secondary classifications that reveal latent cross-topic connections within the corpus. Unlike conventional topic models, the LLM-based framework produces semantically interpretable topics while maintaining strong quantitative performance. Comparative evaluation against established topic modeling methods shows higher topic diversity and lower overlap with competitive coherence metrics. Manual validation on a randomly sampled subset of abstracts yields an accuracy of 75.9%. Additional traditional natural language processing analyses confirm that the generated topics correspond to meaningful linguistic patterns in the corpus. A bipartite network linking primary and secondary classifications further reveals implicit thematic relationships that are not readily observable through abstracts or keyword systems alone. The findings indicate that the framework independently recovers much of the journal's editorial dual-classification structure without prior knowledge of its schema. Overall, the proposed approach offers a powerful tool for mapping science and identifying emerging cross-topic connections in research.

URL PDF HTML ☆

赞 0 踩 0

2512.11982 2026-06-11 astro-ph.IM cs.AI cs.CV cs.LG 版本更新

Semantic search for 100M+ galaxy images using AI-generated captions

基于AI生成描述的1亿+星系图像语义搜索

Nolan Koblischke, Liam Parker, Francois Lanusse, Jo Bovy, Irina Espejo, Shirley Ho

发表机构 * New York University（纽约大学）； University of Toronto（多伦多大学）； Dunlap Institute for Astronomy & Astrophysics（达伦普天文与天体物理研究所）； University of California, Berkeley（加州大学伯克利分校）； Center for Data Science（数据科学中心）； Lawrence Berkeley National Lab（伯克利国家实验室）； Flatiron Institute（Flatiron研究所）； Université Paris-Saclay（巴黎-萨克莱大学）； CEA（法国原子能委员会）； CNRS（法国国家科学研究中心）； AIM（应用数学研究所）； Princeton University（普林斯顿大学）

AI总结提出利用视觉语言模型生成星系图像描述，并对比对齐预训练天文学基础模型，构建可搜索嵌入，实现大规模星系图像的语义搜索，在稀有现象发现上取得最先进性能。

Comments ApJ, in press

详情

AI中文摘要

通过缓慢的手动标注活动寻找科学上有趣的现象严重限制了我们对望远镜产生的数十亿星系图像的探索能力。在这项工作中，我们开发了一个流水线，从完全未标记的图像数据创建语义搜索引擎。我们的方法利用视觉语言模型（VLM）为星系图像生成描述，然后将预训练的天文学基础模型与这些嵌入的描述进行对比对齐，以产生大规模可搜索的嵌入。我们发现当前的VLM提供的描述信息足够丰富，可以训练一个语义搜索模型，该模型优于直接图像相似性搜索。我们的模型AION-Search在寻找稀有现象方面实现了最先进的零样本性能，尽管训练是在随机选择的图像上进行的，没有针对稀有情况进行刻意策划。此外，我们引入了一种基于VLM的重排序方法，该方法在top-100结果中对我们最具挑战性的目标的召回率几乎翻倍。首次，AION-Search实现了对超过1亿张星系图像的灵活语义搜索，使得从以前不可行的搜索中能够发现新现象，包括识别出36个新的河外恒星流候选体。更广泛地说，我们的工作提供了一种方法，使大型、未标记的科学图像档案变得可语义搜索，扩展了从地球观测到显微镜等领域的数据探索能力。代码、数据和应用程序可在以下网址公开获取：https://this https URL

英文摘要

Finding scientifically interesting phenomena through slow manual labeling campaigns severely limits our ability to explore the billions of galaxy images produced by telescopes. In this work, we develop a pipeline to create a semantic search engine from completely unlabeled image data. Our method leverages Vision-Language Models (VLMs) to generate descriptions for galaxy images, then contrastively aligns a pre-trained astronomy foundation model with these embedded descriptions to produce searchable embeddings at scale. We find that current VLMs provide descriptions that are sufficiently informative to train a semantic search model that outperforms direct image similarity search. Our model, AION-Search, achieves state-of-the-art zero-shot performance on finding rare phenomena despite training on randomly selected images with no deliberate curation for rare cases. Furthermore, we introduce a VLM-based re-ranking method that nearly doubles the recall for our most challenging targets in the top-100 results. For the first time, AION-Search enables flexible semantic search for over 100 million galaxy images, enabling discovery from previously infeasible searches, including the identification of 36 new extragalactic stellar stream candidates. More broadly, our work provides an approach for making large, unlabeled scientific image archives semantically searchable, expanding data exploration capabilities in fields from Earth observation to microscopy. The code, data, and app are publicly available at https://github.com/NolanKoblischke/AION-Search

URL PDF HTML ☆

赞 0 踩 0

2512.13765 2026-06-11 eess.IV cs.AI cs.LG 版本更新

Towards Deep Learning Surrogate for the Forward Problem in Electrocardiology: A Scalable Alternative to Physics-Based Models

面向心电学正问题的深度学习代理模型：一种可扩展的物理模型替代方案

Shaheim Ogbomo-Harmitt, Cesare Magnetti, Chiara Spota, Jakub Grzelak, Oleg Aslanidi

发表机构 * School of Biomedical Engineering and Imaging Sciences, King’s College London（伦敦国王学院生物医学工程与成像科学学院）； PhysicsX

AI总结提出基于注意力机制的序列到序列深度学习框架，作为心电学正问题的代理模型，从心脏电压传播图预测心电图信号，在2D组织模拟中达到高精度（平均R²=0.99±0.01），为物理模型提供可扩展、低成本的替代方案。

Comments Accepted to CinC conference 2025

详情

AI中文摘要

心电学中的正问题，即从心脏电活动计算体表电位，传统上使用基于物理的模型（如双域或单域方程）求解。虽然准确，但这些方法计算成本高，限制了其在实时和大规模临床中的应用。我们提出一个概念验证的深度学习（DL）框架，作为正问题求解器的高效代理。该模型采用基于时间依赖注意力机制的序列到序列架构，从心脏电压传播图预测心电图（ECG）信号。引入了一种混合损失函数，结合Huber损失和谱熵项，以保持时域和频域的保真度。使用包含健康、纤维化和缝隙连接重塑条件的2D组织模拟，模型实现了高精度（平均$R^2 = 0.99 \pm 0.01$）。消融研究证实了卷积编码器、时间感知注意力和谱熵损失的贡献。这些发现突显了DL作为物理求解器的可扩展、低成本替代方案的潜力，适用于临床和数字孪生应用。

英文摘要

The forward problem in electrocardiology, computing body surface potentials from cardiac electrical activity, is traditionally solved using physics-based models such as the bidomain or monodomain equations. While accurate, these approaches are computationally expensive, limiting their use in real-time and large-scale clinical applications. We propose a proof-of-concept deep learning (DL) framework as an efficient surrogate for forward solvers. The model adopts a time-dependent, attention-based sequence-to-sequence architecture to predict electrocardiogram (ECG) signals from cardiac voltage propagation maps. A hybrid loss combining Huber loss with a spectral entropy term was introduced to preserve both temporal and frequency-domain fidelity. Using 2D tissue simulations incorporating healthy, fibrotic, and gap junction-remodelled conditions, the model achieved high accuracy (mean $R^2 = 0.99 \pm 0.01$). Ablation studies confirmed the contributions of convolutional encoders, time-aware attention, and spectral entropy loss. These findings highlight DL as a scalable, cost-effective alternative to physics-based solvers, with potential for clinical and digital twin applications.

URL PDF HTML ☆

赞 0 踩 0

2601.21293 2026-06-11 cs.LG cs.AI 版本更新

Reliability-Calibrated Edge-IoT Early Fault Warning for Rotating Machinery with a Physics-Guided Tiny-Mamba Transformer

面向旋转机械的可靠性校准边缘物联网早期故障预警：一种物理引导的Tiny-Mamba Transformer

Changyu Li, Huabei Nie, Xiaoya Ni, Lu Wang, Lijuan Shen, Kaishun Wu, Fei Luo

发表机构 * Great Bay University（大亚湾大学）； Huizhou University（惠州大学）； National University of Singapore（国立新加坡大学）； Shenzhen University（深圳大学）； James Cook University（詹姆斯库克大学）； Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出一种可靠性校准的边缘物联网早期故障预警框架，使用物理引导的Tiny-Mamba Transformer提取特征，结合极值理论校准误报率，在低计算资源下实现高精度、低延迟的旋转机械故障预警。

详情

AI中文摘要

工业物联网系统日益依赖分布式振动传感来支持旋转机械的预测性维护。然而，在实际部署中，原始信号上传成本高昂，且报警决策必须在有限计算资源、变化运行条件和严格误报预算下本地进行。本文提出一种可靠性校准的边缘物联网早期预警框架，其中紧凑的物理引导Tiny-Mamba Transformer作为表示模块，极值理论层将流式异常分数转换为事件级报警片段。PG-TMT结合深度可分离卷积主干、Tiny-Mamba状态空间分支和轻量级局部Transformer，在批量大小为1的推理下捕获瞬态、长周期和多通道退化线索。为提高可审计性，时间注意力被投影到频域并与分析轴承故障阶次带软对齐。极值理论校准、双阈值迟滞和修尾拟合即使在健康校准数据不完美的情况下也能提供可控的误报强度。在CWRU、Paderborn、XJTU-SY和工业试点上的实验表明，所提框架提高了PR-AUC，在可控误报预算下减少了检测延迟，并对结构化干扰、元数据不确定性、复合故障混合和域转移保持鲁棒。凭借小于1 MB的占用空间和低于7 ms的Jetson p99延迟，该框架支持工业物联网预测性维护的校准和可解释早期预警。

英文摘要

Industrial Internet of Things (IIoT) systems increasingly rely on distributed vibration sensing to support predictive maintenance of rotating machinery. In practical deployments, however, raw signal upload is costly and alarm decisions must be made locally under limited computation, changing operating conditions, and strict nuisance-alarm budgets. This paper presents a reliability-calibrated edge-IoT early-warning framework, in which a compact Physics-Guided Tiny-Mamba Transformer (PG-TMT) acts as the representation module and an extreme value theory (EVT) layer converts streaming anomaly scores into event-level alarm episodes. PG-TMT combines a depthwise-separable convolutional stem, a Tiny-Mamba state-space branch, and a lightweight local Transformer to capture transient, long-horizon, and multichannel degradation cues under batch-size-one inference. To improve auditability, temporal attention is projected to the frequency domain and softly aligned with analytical bearing fault-order bands. EVT calibration, dual-threshold hysteresis, and trimmed-tail fitting provide controllable false-alarm intensity even when healthy calibration data are imperfect. Experiments on CWRU, Paderborn, XJTU-SY, and an industrial pilot demonstrate that the proposed framework improves PR-AUC, reduces detection delay under a controlled nuisance-alarm budget, and remains robust to structured interference, metadata uncertainty, compound fault mixtures, and domain transfer. With a sub-1 MB footprint and Jetson p99 latency below 7 ms, the framework supports calibrated and interpretable early warnings for IIoT predictive maintenance.

URL PDF HTML ☆

赞 0 踩 0

2605.06100 2026-06-11 eess.SP cs.AI cs.LG cs.RO 版本更新

CredibleDFGO: Differentiable Factor Graph Optimization with Credibility Supervision

可信DFGO：具有可信度监督的可微因子图优化

Liang Qian, Penggao Yan, Penghui Xu, Li-Ta Hsu

发表机构 * Department of Aeronautical and Aviation Engineering（航空与航空工程系）

AI总结针对GNSS协方差不可靠问题，提出CredibleDFGO框架，通过可微高斯-牛顿求解器与加权生成网络，利用适当评分规则监督预测分布，提升协方差可信度与定位精度。

Comments Submitted to NAVIGATION: Journal of the Institute of Navigation

详情

AI中文摘要

全球导航卫星系统（GNSS）定位广泛用于城市导航，但GNSS求解器报告的协方差在城市峡谷中通常不可靠。现有的可微因子图优化（DFGO）方法通过求解器学习测量加权，但仍仅使用位置目标。因此，位置估计可能改善，而报告的协方差仍然过小、过大或方向错误。我们提出CredibleDFGO（CDFGO），一种可微GNSS因子图框架，将协方差可信度作为显式训练目标。加权生成网络（WGN）预测每颗卫星的可靠性权重，可微高斯-牛顿求解器将这些权重映射到位置估计和基于Hessian的后验协方差。我们使用适当评分规则端到端监督东-北预测分布。我们研究了负对数似然（NLL）、能量分数（ES）及其组合。在三个UrbanNav测试场景上的结果表明，协方差可信度持续提升。定位精度在中度城市和严峻城市场景中也有所提高；在深度城市场景中，平均水平误差和第95百分位误差均有所改善。在严峻城市的旺角（MK）场景中，与DFGO（MAE）相比，CDFGO-Combined将平均水平误差从13.77米降至11.68米，将NLL从40.63降至6.59，将ES从12.31降至9.05。案例研究将MK改进归因于更好的轴向一致性、更可信的局部协方差椭圆以及卫星级重新加权。

英文摘要

Global navigation satellite system (GNSS) positioning is widely used for urban navigation, but the covariance reported by the GNSS solver is often unreliable in urban canyons. Existing differentiable factor graph optimization (DFGO) methods learn measurement weighting through the solver, but they still use position-only objectives. As a result, the position estimate may improve while the reported covariance remains too small, too large, or incorrectly oriented. We propose CredibleDFGO (CDFGO), a differentiable GNSS factor graph framework that makes covariance credibility an explicit training target. A Weighting Generation Network (WGN) predicts per-satellite reliability weights, and a differentiable Gauss-Newton solver maps these weights to a position estimate and a Hessian-derived posterior covariance. We use proper scoring rules to supervise the East-North predictive distribution end to end. We study negative log-likelihood (NLL), the energy score (ES), and their combination. Results on three UrbanNav test scenes show consistent gains in covariance credibility. Positioning accuracy also improves on the medium-urban and harsh-urban scenes; on the deep-urban scene, both the mean horizontal error and the 95th-percentile error improve. On the harsh-urban Mong Kok (MK) scene, CDFGO-Combined reduces the mean horizontal error from 13.77 m to 11.68 m, reduces NLL from 40.63 to 6.59, and reduces ES from 12.31 to 9.05 relative to DFGO (MAE). Case studies link the MK improvement to better axis-wise consistency, more credible local covariance ellipses, and satellite-level reweighting.

URL PDF HTML ☆

赞 0 踩 0

2605.06485 2026-06-11 cs.CL cs.AI 版本更新

Litespark Inference For CPUs: Ultra-Fast SIMD Framework for Ternary (1.58-bit) Language Models

Litespark Inference For CPUs: 三元（1.58位）语言模型的超快SIMD框架

Nii Osae Osae Dade, Tony Morri, Moinul Hossain Rahat, Sayandip Pal, Rickston Pinto

发表机构 * Mindbeam AI

AI总结针对三元语言模型权重为{-1,0,1}的特点，提出自定义SIMD内核，用加减运算替代矩阵乘法，在CPU上实现18-96倍加速和6倍内存减少。

详情

AI中文摘要

大型语言模型（LLM）已经改变了人工智能，但其计算需求对大多数用户来说仍然过高。标准推理需要昂贵的数据中心GPU或云API访问，导致超过十亿台个人计算机在AI工作负载中未被充分利用。三元模型提供了一条前进的道路：它们的权重被限制在{-1, 0, +1}，理论上消除了浮点乘法的需求。然而，现有框架未能利用这种结构，将三元模型视为密集浮点网络。我们通过自定义SIMD内核填补了这一空白，这些内核用简单的加法和减法运算取代矩阵乘法，针对现代CPU上可用的整数点积指令。我们的实现Litespark-Inference可通过pip安装，并直接与Hugging Face集成，在Apple Silicon上实现了比标准PyTorch推理高18.15倍的吞吐量、快7.15倍的首令牌时间和6.03倍的内存减少，在Intel和AMD处理器上实现了高达95.81倍的吞吐量加速。

英文摘要

Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to {-1, 0, +1}, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this structure, treating ternary models as dense floating-point networks. We address this gap with custom SIMD kernels that replace matrix multiplication with simple addition and subtraction operations, targeting the integer dot product instructions available on modern CPUs. Our implementation, Litespark-Inference, is pip-installable and integrates directly with Hugging-Face, achieving 18.15x higher throughput, 7.15x faster time-to-first-token and 6.03x memory reduction compared to standard PyTorch inference on Apple Silicon, with comparable or higher throughput speedups up to 95.81x on Intel and AMD processors.

URL PDF HTML ☆

赞 0 踩 0

2606.10120 2026-06-11 cs.IR cs.AI cs.HC 版本更新

MetaPlate: Counterfactual-Guided RAG-LLM Tool for Personalized Food Recommendation and Hyperglycemia Prevention

MetaPlate: 反事实引导的RAG-LLM工具用于个性化食物推荐和高血糖预防

Asiful Arefeen, Carol Johnston, Hassan Ghasemzadeh

发表机构 * College of Health Solutions, Arizona State University（亚利桑那州立大学健康解决方案学院）； School of Computing and Augmented Intelligence, Arizona State University（亚利桑那州立大学计算与增强智能学院）

AI总结提出MetaPlate框架，结合反事实解释、机器学习预测和RAG-LLM，生成个性化膳食建议以预防餐后高血糖，经注册营养师评估证明其可行性和有效性。

详情

AI中文摘要

餐后高血糖是代谢紊乱的关键风险因素；然而，现有的饮食指导通常是静态的、不切实际的且个性化不足，提供的建议难以遵循或效果不佳。尽管最近的进展利用连续血糖监测（CGM）和机器学习来预测血糖反应，但这些方法主要是预测性的，缺乏可操作的指导。此外，推荐系统常常与用户目标不一致，且需要大量输入。我们提出了MetaPlate，一个反事实解释（CF）引导的、上下文感知的决策支持框架，用于生成个性化膳食建议，以减轻健康成年人的餐后血糖波动。MetaPlate整合了多模态数据，包括来自25名个体的CGM读数、可穿戴设备衍生的生理信号以及用户提供的膳食输入，以建模餐前上下文。一个机器学习模型预测血糖反应，而CF优化模块通过调整膳食组成（修改宏量营养素数量）来维持血糖水平在目标范围内（≤140 mg/dL）。基于LLM的检索增强生成（RAG）层通过使用USDA食品数据库的约束搜索生成人类可读的建议，增强了可解释性。我们通过结构化的专家在环评估，与注册营养师（RDs）一起评估MetaPlate，比较提示优化前后的性能。结果显示，在膳食真实性、份量适宜性和推荐可能性方面有所改进，专家反馈表明从临床不可行的输出转向了可操作、上下文适宜的建议。我们的发现强调了领域知识和结构化约束在LLM驱动系统中的重要性，并突出了MetaPlate作为实时个性化膳食决策支持工具的潜力。

英文摘要

Postprandial hyperglycemia is a key risk factor for metabolic disorders; however, existing dietary guidance is often static, impractical, and insufficiently personalized, providing recommendations that are difficult to follow or not impactful. While recent advances leverage continuous glucose monitoring (CGM) and machine learning to predict glycemic responses, these approaches are largely predictive and lack actionable guidance. Moreover, recommendation systems are often misaligned with user goals and require extensive input. We present MetaPlate, a counterfactual explanation (CF) guided, context-aware decision-support framework that generates personalized meal recommendations to mitigate postprandial glucose excursions in healthy adults. MetaPlate integrates multimodal data, including CGM readings, wearable-derived physiological signals, and user-provided meal inputs from $25$ individuals to model pre-meal context. A machine learning model predicts glucose response, while a CF optimization module adjusts meal composition modifying macronutrient amounts to maintain glucose levels within a target range ($\leq 140$ mg/dL). An LLM-based retrieval-augmented generation (RAG) layer enhances interpretability by producing human-readable recommendations using constrained search of the USDA food database. We evaluate MetaPlate via a structured expert-in-the-loop assessment with registered dietitians (RDs), comparing performance before and after prompt refinement. Results show improvements in meal realism, portion suitability, and recommendation likelihood, with expert feedback indicating a shift from clinically implausible outputs to actionable, contextually appropriate recommendations. Our findings emphasize the importance of domain knowledge and structured constraints in LLM-driven systems and highlight the potential of MetaPlate as a real-time personalized dietary decision-support tool.

URL PDF HTML ☆

赞 0 踩 0

2606.11245 2026-06-11 cs.AI cs.NE q-bio.NC 新提交

Position: Hippocampal Explicit Memory Is the Cornerstone for AGI

立场：海马体显式记忆是通用人工智能的基石

Sangjun Park

发表机构 * Sangjun Park

AI总结本文主张，将显式记忆整合到大语言模型中是迈向通用人工智能的关键，因为LLM的学习机制类似人类内隐记忆，而高阶认知功能依赖海马体显式记忆。

Comments Accepted to ICML 2026 (Position Paper Track)

2606.11560 2026-06-11 cs.DB cs.AI 交叉投稿

LLMs+Graphs: Toward Graph-Native, Synergistic AI Systems

LLMs+Graphs：迈向图原生的协同人工智能系统

Arijit Khan, Longxu Sun, Xin Huang

发表机构 * Bowling Green State University（伯灵顿绿色州立大学）； Hong Kong Baptist University（香港 Baptist大学）

AI总结本文综述了大语言模型与图计算的三种协同方式，包括增强推理、知识图谱双向集成及图算法增强的AI代理，并探讨了图数据管理与图机器学习的新能力，旨在为构建下一代图原生AI系统提供统一视角。

Comments 10 pages, Accepted at PAKDD 2066 Tutorial

详情

AI中文摘要

大语言模型（LLMs）发展迅速，但它们在结构化和多跳推理方面的局限性凸显了对图原生、协同人工智能（AI）系统的需求。图结构数据支撑着社交、生物、金融、交通、网络和知识领域的关键应用，因此理解LLMs如何利用图计算进行基于上下文的扎实推理至关重要。三种互补的协同方式正在涌现：通过图计算增强LLMs进行检索和推理；LLMs与知识图谱（KGs）的双向集成，其中LLMs支持KG构建和整理，而KGs强制执行语义约束和事实一致性；以及通过图算法增强的AI代理进行规划、决策和多步推理。同时，LLMs通过自然语言接口和混合LLM-图神经网络（GNN）流水线，为图数据管理和图机器学习（ML）引入了新能力。本教程综合了推动这些融合方向的算法、系统和设计原则，为数据科学和数据挖掘研究人员提供了将LLMs、图数据管理、图挖掘、图ML和代理计算集成到下一代图原生AI系统中的统一视角。

英文摘要

Large Language Models (LLMs) have advanced rapidly, but their limitations in structured and multi-hop reasoning underscore the need for graph-native, synergistic artificial intelligence (AI) systems. Graph-structured data underpins critical applications across social, biological, financial, transportation, web, and knowledge domains, making it essential to understand how LLMs can leverage graph computation for grounded, context-rich inference. Three complementary synergies are emerging: LLMs augmented with graph computation for retrieval and reasoning; bidirectional integration between LLMs and knowledge graphs (KGs), where LLMs support KG construction and curation while KGs enforce semantic constraints and factual consistency; and AI agents strengthened by graph algorithms for planning, decision making, and multi-step reasoning. In parallel, LLMs introduce new capabilities for graph data management and graph machine learning (ML) through natural language interfaces and hybrid LLM-graph neural network (GNN) pipelines. This tutorial synthesizes the algorithms, systems, and design principles driving these converging directions, offering data science and data mining researchers a unified perspective on integrating LLMs, graph data management, graph mining, graph ML, and agentic computation into next-generation graph-native AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.12022 2026-06-11 cs.FL cs.AI 交叉投稿

Runtime Enforcement of Hybrid System Properties

混合系统属性的运行时强制执行

Mir Md Sajid Sarwar, Srinivas Pinisetty, Rajarshi Ray, Thierry Jéron

发表机构 * Indian Institute of Technology Bhubaneswar（印度理工学院布巴内斯瓦尔分校）； Indian Association for the Cultivation of Science（印度科学培养协会）； Univ Rennes, Inria, CNRS, IRISA（里昂大学、Inria、CNRS、IRISA）

AI总结提出一种结合离散事件编辑与连续时间监控的运行时强制执行框架，使用混合自动机建模安全需求，通过运行时可达性分析合成安全纠正动作，在自适应巡航控制系统中验证有效性。

详情

AI中文摘要

运行时强制执行已成为确保在不确定和动态环境中运行的自主和网络物理系统安全的一种有前景的方法。与传统的运行时验证不同，运行时强制执行通过在执行期间主动干预，修改不安全系统行为以防止属性违反。现有的强制执行框架主要关注无时间或离散时间规范，并且通常仅限于延迟或抑制事件，这使得它们对于表现出复杂连续动态的反应式系统不充分。在本文中，我们提出了一种运行时强制执行框架，其中安全需求使用混合自动机（HA）建模。该框架将离散事件编辑与连续时间监控相结合，以支持在任意时间点执行抑制、延迟和插入事件等强制执行操作。在观察环境输入后，自动机被初始化，并使用运行时可达性分析来综合安全纠正动作。我们正式定义了安全混合自动机的强制执行问题，建立了可强制执行条件，并提出了一种用于反应式系统的在线强制执行算法。关于自适应巡航控制（ACC）系统的详细案例研究证明了所提出方法在不安全控制器行为下维护安全属性的有效性。实验结果表明，该框架在实时确保持续符合安全要求的同时，引入了最小的计算开销。

英文摘要

Runtime enforcement has emerged as a promising approach for ensuring the safety of autonomous and cyber-physical systems operating in uncertain and dynamic environments. Unlike traditional runtime verification, runtime enforcement actively intervenes during execution to prevent property violations by modifying unsafe system behaviors. Existing enforcement frameworks primarily focus on untimed or discrete-time specifications and are often limited to delaying or suppressing events, making them inadequate for reactive systems exhibiting complex continuous dynamics. In this paper, we propose a runtime enforcement framework where safety requirements are modeled using Hybrid Automata (HA). The framework combines discrete-event editing with continuous-time monitoring to support enforcement actions such as suppression, delay, and insertion of events at arbitrary time instants. Upon observing environmental inputs, the automaton is initialized, and runtime reachability analysis is used to synthesize safe corrective actions. We formally define the enforcement problem for safety hybrid automata, establish enforceability conditions, and present an online enforcement algorithm for reactive systems. A detailed case study on an Adaptive Cruise Control (ACC) system demonstrates the effectiveness of the proposed approach in maintaining safety properties under unsafe controller behaviors. Experimental results show that the framework introduces minimal computational overhead while ensuring continuous compliance with safety requirements in real time.

URL PDF HTML ☆

赞 0 踩 0

2510.02660 2026-06-11 cs.HC cs.AI 版本更新

When Researchers Say Mental Model/Theory of Mind of AI, What Are They Really Talking About?

当研究人员谈论AI的心理模型/心智理论时，他们究竟在说什么？

Xiaoyun Yin, Elmira Zahmat Doost, Shiwen Zhou, Garima Arya Yadav, Jamie C. Gorman

发表机构 * Center for Human, Artificial Intelligence, and Robot Teaming（人类、人工智能与机器人协同中心）

AI总结本文指出当前AI心智理论研究混淆了行为预测与真实认知，提出应转向人机交互中的互惠心智理论框架。

Comments This work have been accepted in CogInterp @ NeurIPS 2025

2604.25018 2026-06-11 cs.ET cs.AI cs.DC cs.NI 版本更新

Internet of Everything in the 6G Era: Paradigms, Enablers, Potentials and Future Directions

6G时代的万物互联：范式、使能技术、潜力与未来方向

Driss Choukri, Essaid Sabir, Elmahdi Driouch, Abdelkrim Haqiq

发表机构 * Computer Networks, Mobility and Modeling Laboratory (IR2M), FST, Hassan I University of Settat, Morocco, and the Department of Science and Technology, TÉLUQ, University of Quebec, Montreal, H2S 3L4, Canada（计算机网络、移动与建模实验室（IR2M），FST，哈桑一世大学塞塔特分校，摩洛哥，以及科技部，TÉLUQ，魁北克大学，蒙特利尔，H2S 3L4，加拿大）； Department of Science and Technology, TÉLUQ, University of Quebec, Montreal, H2S 3L4, Canada（科技部，TÉLUQ，魁北克大学，蒙特利尔，H2S 3L4，加拿大）； Department of Computer Science, University of Quebec at Montreal (UQAM), Montreal, H2L 2C4, Canada（计算机科学系，魁北克大学蒙特利尔分校（UQAM），蒙特利尔，H2L 2C4，加拿大）

AI总结本文综述了万物互联（IoE）的概念、核心组件、架构基础、使能技术及研究挑战，并探讨了面向6G智能IoE系统的开放研究方向，重点关注可扩展性、安全、隐私和能效。

Comments 48 pages, 15 figures, 6 tables, 272 references

2606.05608 2026-06-11 cs.SE cs.AI 版本更新

Agentic Software: How AI Agents Are Restructuring the Software Paradigm

软件工程的终结：AI代理如何根本性地重构软件范式

Zhenfeng Cao

发表机构 * Lingxi Intelligent Investment (Shenzhen) Development Co., Ltd.（灵犀智能投资（深圳）发展有限公司）

AI总结本文通过第一性原理分析，论证了以LLM为推理引擎的AI代理系统正在根本性地重构软件范式，从传统软件（代码承载决策逻辑）转向代理系统（代码作为临时工具），并提出了代理工程作为新兴学科。

Comments 15 pages, 2 figures, and 3 tables

详情

AI中文摘要

半个多世纪以来，软件工程一直基于一个基本前提：人类工程师分解问题，将决策逻辑编码为静态代码，并随着需求演变手动调整代码。本文认为，AI代理——即大型语言模型作为主要推理引擎、动态生成和丢弃代码作为工具资源的系统——的出现并非渐进式改进，而是对软件范式的根本性重构。基于复杂性缩放的第一性原理分析，我们形式化了传统软件（代码是决策逻辑的载体）与代理系统（代码是LLM驱动推理循环的临时工具）之间的区别。我们追溯了从许可软件到SaaS再到我们所谓的代理即服务（AaaS）的历史轨迹，表明每次转变都将额外的复杂性从最终用户转移出去。我们引入了代理工程作为一门新兴学科——其核心研究对象、控制模型和人类角色均不同于软件工程。通过分析最近的基准证据，包括SWE-bench Verified、EvoClaw和LangChain的多代理协调研究，我们展示了代理范式的变革潜力及其当前局限性。最后，我们提出了一个迈向自我进化代理生态系统的四阶段路线图，并为应对这一转变的从业者提供了具体建议。

英文摘要

For over half a century, software engineering has operated on a foundational premise: human engineers decompose problems, encode decision logic into static code, and manually adapt that code as requirements evolve. This paper argues that the emergence of AI agents -- systems where large language models serve as the primary reasoning engine, dynamically generating and discarding code as an instrumental resource -- constitutes a fundamental restructuring of what software is, not an incremental tool improvement. We formalize the distinction between traditional deterministic software and agentic software: in the former, code is the carrier of pre-written decision logic; in the latter, the agent itself is the software, and its decision logic is generated at runtime. We trace the historical arc from licensed software to SaaS to Agent-as-a-Service (AaaS), showing that each shift transferred additional complexity away from end-users -- with the agentic shift transferring not just operational complexity but decision-making complexity itself. We introduce Agentic Engineering as an expansion of the software engineering discipline into a new paradigm, distinct in its core object of study (agent systems rather than static source code), its control model (LLM-driven rather than human-predefined), and its human role (intent architect rather than code author). Through analysis of recent benchmark evidence including SWE-bench Verified, EvoClaw, and LangChain's multi-agent coordination studies, we demonstrate both the transformative potential of the agentic paradigm and its current limitations. We conclude with a four-stage roadmap toward self-evolving agent ecosystems and concrete recommendations for practitioners navigating this transition.

URL PDF HTML ☆

赞 0 踩 0

2605.26938 2026-06-11 cs.AI math.OC 版本更新

Developing a Totally Unimodular Linear Program for Optimal Conformance Checking: When and Why It Complements A*

开发用于最优合规性检查的全幺模线性规划：何时以及为何它补充A*

Izack Cohen

发表机构 * Bar Ilan University（巴伊兰大学）

AI总结提出将基于对齐的合规性检查重新表述为在全幺模线性规划上的问题，利用网络流结构保证整数最优解，实验表明在长轨迹和有偏差情况下显著加速A*。

Comments Author-accepted manuscript accepted for publication in Expert Systems with Applications. Code and experiment scripts are available at: https://github.com/Izack-Cohen/unimodular-conformance-checking. Version corresponding to the accepted paper: v1.0.0

详情

DOI: 10.1016/j.eswa.2026.133021
Journal ref: Expert Systems with Applications, Volume 331, Part A, 2026, 133021

AI中文摘要

基于对齐的合规性检查是比较观察到的过程执行与规范过程模型的最先进方法。标准的精确解依赖于基于A*的启发式搜索，在存在长轨迹或大量偏差时可能表现出指数级运行时间。本文介绍了将基于对齐的合规性检查重新表述为定义在同步积的可达图上的全幺模线性规划（LP）。通过利用底层的网络流结构，所提出的公式通过LP松弛保证了整数最优极值点的存在，从而避免了与整数变量和分支定界搜索相关的组合开销。我们在来自真实世界和合成基准数据集的超过210万个合规性检查实例上进行了广泛的实证评估。结果表明，A*和LP方法表现出互补的性能特征：前者在短且符合良好的轨迹上表现最佳，而LP公式为具有偏差的较长轨迹提供了显著的加速，这正是合规性检查最具信息量的地方。基于这些发现，我们推导出结合两种方法的简单算法选择指南，与始终使用A*相比，实现了平均38.6%的运行时间节省和96%的选择准确率。

英文摘要

Alignment-based conformance checking is the state-of-the-art approach for comparing observed process executions with normative process models. The standard exact solution relies on an A*-based heuristic search, which can exhibit exponential runtime in the presence of long traces or substantial deviations. This paper introduces a reformulation of alignment-based conformance checking as a totally unimodular linear program (LP) defined on the reachability graph of the synchronous product. By exploiting the underlying network-flow structure, the proposed formulation guarantees the existence of an integral optimal extreme-point solution through LP relaxation, thereby avoiding the combinatorial overhead associated with integer variables and branch-and-bound search. We conduct an extensive empirical evaluation on more than 2.1 million conformance checking instances derived from real-world and synthetic benchmark datasets. The results show that A* and the LP approach exhibit complementary performance characteristics: the former performs best on short, well-conforming traces, while the LP formulation provides substantial speedups for longer traces with deviations, precisely where conformance checking is most informative. Based on these findings, we derive simple algorithm-selection guidelines that combine both approaches, achieving average runtime savings of 38.6% with 96% selection accuracy compared to always using A*.

URL PDF HTML ☆

赞 0 踩 0

2412.01459 2026-06-11 cs.CY cs.AI cs.HC 版本更新

Perception Gaps in Risk, Benefit, and Value Between Experts and Public Challenge Socially Accepted AI

专家与公众在风险、收益和价值上的认知差距挑战社会接受的AI

Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle

发表机构 * RWTH Aachen University（亚琛工业大学）

AI总结研究比较了公众与AI专家在71个场景中对AI能力与影响的认知差异，发现专家更乐观，而公众更关注风险，揭示了沟通和政策干预的必要性。

详情

DOI: 10.1007/s00146-026-03023-8
Journal ref: AI & Society (2026)

AI中文摘要

人工智能（AI）正在重塑许多社会领域，引发了关于其风险、收益以及公众与学术界观点可能不一致的紧迫问题。本研究考察了普通公众（N=1110）——与AI技术互动或受其影响的人——和学术AI专家（N=119）——塑造AI发展的人——在71个场景中对AI能力与影响的感知。这些场景涵盖可持续性、医疗、工作表现、社会不平等、艺术和战争等领域。参与者在四个维度上评估这些场景：可能性、感知风险与收益，以及总体价值（或情感）。结果表明，专家普遍预期更高的概率，感知较低的风险，报告更高的收益，并对AI持有更积极的态度，与非专家相比。此外，两组人应用了不同的加权方案：专家更倾向于降低风险相对于收益的权重。这些评估的视觉映射揭示了评价一致的领域（如AI进行医学诊断或刑事用途）以及紧张点（如法律案件的决定、政治决策），突显了沟通和政策干预的必要性。这些发现强调了关键的转化挑战：如果AI研究和部署要与社会优先事项一致，开发者与公众之间的认知差距必须被更好地理解和解决。我们的结果为价值敏感的AI治理和跨利益相关者群体的信任建设策略提供了实证基础。

英文摘要

Artificial Intelligence (AI) is reshaping many societal domains, raising critical questions about its risks, benefits, and the potential misalignment between public and academic perspectives. This study examines how the general public (N=1110) -- individuals who interact with or are impacted by AI technologies -- and academic AI experts (N=119) -- those elites shaping AI development -- perceive AI's capabilities and impact across 71 scenarios. These scenarios span domains such as sustainability, healthcare, job performance, societal inequality, art, and warfare. Participants evaluated these scenarios across four dimensions using the psychometric model: likelihood, perceived risk and benefit, and overall value (or sentiment). The results suggest significant differences: experts consistently anticipate higher probabilities, perceive lower risks, report greater benefits, and express more positive sentiment toward AI compared to the non-experts. Moreover, both groups apply different weighting schemes: experts discount risk more heavily relative to benefit than non-experts. Visual mappings of these evaluations uncover areas convergent evaluations (e.g., AI performing medical diagnoses or criminal use) as well as tension points (e.g., decision of legal cases, political decision making), highlighting areas where communication and policy interventions may be needed. These findings underscore a critical translational challenge: if AI research and deployment are to align with societal priorities, the perception gap between developers and the public must be better understood and addressed. Our results provide an empirical foundation for value-sensitive AI governance and trust-building strategies across stakeholder groups.

URL PDF HTML ☆

赞 0 踩 0

2604.01383 2026-06-11 cs.CV cs.AI 版本更新

GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization

GRAZE：基于 grounded 的细化与运动感知的零样本事件定位

Syed Ahsan Masud Zaidi, Lior Shamir, William Hsu, Scott Dietrich, Talha Zaidi

发表机构 * Kansas State University（堪萨斯州立大学）； Albright College（阿尔比恩学院）

AI总结本文提出GRAZE，一种无需标注数据的零样本事件定位方法，通过结合Grounding DINO和SAM2实现运动感知的接触定位，有效应对复杂场景。

Comments 9 pages, 5 figures, accepted to the CVPR 2026 Workshop on Computer Vision in Sports (CVSports) code: https://github.com/AhsanZaidi12/GRAZE

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 10087-10095, June 2026

AI中文摘要

美式足球训练生成大量视频，但感兴趣的互动仅占据每个长视频的短暂窗口。可靠的生物力学分析依赖于时空定位，以识别交互实体和接触开始。我们研究了First Point of Contact（FPOC），即球员首次触碰假人帧。在无约束训练视频中，我们提出了GRAZE，一种无需标注数据的FPOC定位管道。GRAZE使用Grounding DINO发现候选玩家-假人交互，通过运动感知的时间推理进行细化，并使用SAM2作为显式的像素级接触验证器。这种分离候选发现和接触确认的方法使方法在复杂场景和不稳定接地情况下更具鲁棒性。在738个打击练习视频上，GRAZE在97.4%的剪辑中产生有效输出，在77.5%的剪辑中将FPOC定位在±10帧内，在82.7%的剪辑中在±20帧内。这些结果表明，在真实世界训练视频中实现精确帧接触开始定位在没有任务特定训练的情况下是可行的。

英文摘要

American football practice generates video at scale, yet the interaction of interest occupies only a brief window of each long, untrimmed clip. Reliable biomechanical analysis, therefore, depends on spatiotemporal localization that identifies both the interacting entities and the onset of contact. We study First Point of Contact (FPOC), defined as the first frame in which a player physically touches a tackle dummy, in unconstrained practice footage with camera motion, clutter, multiple similarly equipped athletes, and rapid pose changes around impact. We present GRAZE, a training-free pipeline for FPOC localization that requires no labeled tackle-contact examples. GRAZE uses Grounding DINO to discover candidate player-dummy interactions, refines them with motion-aware temporal reasoning, and uses SAM2 as an explicit pixel-level verifier of contact rather than relying on detection confidence alone. This separation between candidate discovery and contact confirmation makes the approach robust to cluttered scenes and unstable grounding near impact. On 738 tackle-practice videos, GRAZE produces valid outputs for 97.4% of clips and localizes FPOC within $\pm$ 10 frames on 77.5% of all clips and within $\pm$ 20 frames on 82.7% of all clips. These results show that frame-accurate contact onset localization in real-world practice footage is feasible without task-specific training.

URL PDF HTML ☆

赞 0 踩 0

2511.20216 2026-06-11 cs.AI cs.CE cs.CV cs.LG cs.RO 版本更新

CostNav: A Navigation Benchmark for Real-World Economic-Cost Evaluation of Physical AI Agents

CostNav：一个用于现实世界经济成本评估的物理AI代理导航基准

Haebin Seong, Sungmin Kim, Yongjun Cho, Myunchul Joe, Geunwoo Kim, Yubeen Park, Sunhoo Kim, Samwoo Seong, Yoonshik Kim, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Jinmyung Kwak, Sunghee Ahn, Jaemin Lee, Younggil Do, Seungyeop Yi, Woojin Cheong, Minhyeok Oh, Minchan Kim, Seongjae Kang, Youngjae Yu, Yunsung Lee

发表机构 * KAIST（韩国国立科学技术院）； University of California, Irvine（加州大学 Irvine 分校）； Seoul National University（首尔国立大学）

AI总结 CostNav引入了一个经济导航基准，通过结合物理模拟和行业数据，评估AI代理的经济可行性，发现高任务成功率并不保证经济性，CANVAS在非零SLA合规性下表现最佳。

详情

AI中文摘要

当前导航基准侧重于任务成功率，但未捕捉到商业化自主配送系统所需的关键经济约束。我们引入了CostNav，一个经济导航基准，通过Isaac Sim的碰撞和货物动力学与行业标准数据如证券交易委员会（SEC）文件和简化伤害分级（AIS）伤害报告相结合，评估物理AI代理的成本收益和盈亏分析。我们发现，高任务成功率并不保证经济可行性。评估七种基线方法（两种基于规则和五种模仿学习方法）后，发现无方法经济可行：所有方法均产生负贡献边际。CANVAS仅使用RGB相机和GPS，在非零服务等级协议（SLA）合规性下获得最高任务成功率和最不负面的边际（-28.40/次），优于配备LiDAR的Nav2 w/ GPS（-37.34/次）。一个在模拟中训练的策略在真实配送机器人上评估时，SLA合规性接近其模拟结果，表明CostNav模拟中的策略性能可以转移到现实部署中。我们挑战社区在CostNav上实现经济可行性，该基准通过成本收益结果评分所有方法。所有资源均在https://github.com/worv-ai/CostNav上提供。

英文摘要

Current navigation benchmarks focus on task success but do not capture the economic constraints essential for commercializing autonomous delivery systems. We introduce CostNav, an Economic Navigation Benchmark that evaluates physical AI agents on a cost-revenue and break-even analysis, pairing Isaac Sim's collision and cargo dynamics with industry-standard data such as Securities and Exchange Commission (SEC) filings and Abbreviated Injury Scale (AIS) injury reports. To our knowledge, CostNav is the first physics-grounded economic benchmark to use regulatory and financial data to quantify the gap between navigation metrics and commercial deployment, revealing that high task-success rates alone do not ensure economic viability. Evaluating seven baselines (two rule-based and five imitation-learning methods), we find no method economically viable: all yield negative contribution margins. CANVAS, using only an RGB camera and GPS, attains the highest task success and the least-negative margin among methods with non-zero Service-Level Agreement (SLA) compliance (-\$28.40/run), outperforming LiDAR-equipped Nav2 w/ GPS (-\$37.34/run). A sim-trained policy evaluated on a real delivery robot yields SLA compliance close to its simulation result, indicating that policy performance in CostNav's simulation transfers to real-world deployment. We challenge the community to achieve economic viability on CostNav, which scores methods by cost-revenue outcomes. All resources are available at https://github.com/worv-ai/CostNav.

URL PDF HTML ☆

赞 0 踩 0

2510.09885 2026-06-11 cs.CL cs.AI 版本更新

Diffusion-Inspired Masked Fine-Tuning for Knowledge Injection in Autoregressive LLMs

受扩散启发的掩码微调用于自回归大语言模型中的知识注入

Xu Pan, Ely Hahami, Jingxuan Fan, Ziqian Xie, Haim Sompolinsky

发表机构 * Harvard University（哈佛大学）； University of Texas Health Science Center at Houston（德克萨斯大学健康科学中心休斯顿分校）； Hebrew University（希伯来大学）

AI总结本文提出一种掩码微调方法，通过重构原始文本提升自回归大语言模型的知识注入能力，无需依赖改写并克服反向诅咒，实验证明其在知识密集型任务中表现优异。

详情

AI中文摘要

大型语言模型（LLMs）常用于事实不断变化的环境，但通过在非结构化文本上微调来更新事实知识时，常常面临1）依赖计算密集的改写增强和2）反向诅咒的问题。最近的研究表明，扩散大语言模型（dLLMs）在预训练中需要更少的训练样本以达到更低的损失，并且对反向诅咒更具抵抗力，表明dLLMs可能比自回归大语言模型（arLLMs）更容易学习新知识。我们通过受控的知识微调实验检验这一假设，发现虽然arLLMs依赖改写增强将知识文本泛化为问答（QA）能力，但dLLMs无需改写即可实现高QA准确性。为进一步研究是否仅凭去掩码目标就能在dLLMs中诱导这种知识注入优势，无论其扩散去噪范式如何，我们提出了arLLMs的掩码微调方法，该方法促使arLLM在上下文中的掩码版本中重建原始文本。arLLMs的掩码微调显著提高了知识注入的有效性，即无需改写且对反向诅咒具有抵抗力，缩小了arLLMs与dLLMs之间的差距。我们还展示了更广泛的应用：在大规模知识密集型数据集（120万个样本）上，掩码SFT在GPQA-diamond中实现了所有微调变体中最佳的下游准确性。去掩码目标也提高了数学任务上的SFT，表明其在事实知识注入之外的广泛用途。

英文摘要

Large language models (LLMs) are often used in environments where facts evolve, yet factual knowledge updates via fine-tuning on unstructured text often suffer from 1) reliance on compute-heavy paraphrasing augmentation and 2) the reversal curse. Recent studies show diffusion large language models (dLLMs) require fewer training samples to achieve lower loss in pre-training and are more resistant to the reversal curse, suggesting dLLMs may learn new knowledge more easily than autoregressive LLMs (arLLMs). We test this hypothesis in controlled knowledge fine-tuning experiments and find that while arLLMs rely on paraphrase augmentation to generalize knowledge text into question-answering (QA) capability, dLLMs do not require paraphrases to achieve high QA accuracy. To further investigate whether the demasking objective alone can induce such a knowledge injection advantage in dLLMs regardless of their diffusion denoising paradigm, we propose masked fine-tuning for arLLMs, which prompts an arLLM to reconstruct the original text given a masked version in context. The masked fine-tuning for arLLMs substantially improves the efficacy of knowledge injection, i.e. no paraphrase needed and resistant to the reversal curse, closing the gap between arLLMs and dLLMs. We also demonstrate broader applicability: on a large-scale knowledge-intensive dataset (1.2M samples), masked SFT achieves the best downstream accuracy on GPQA-diamond among all fine-tuning variants. The demasking objective also improves SFT on math tasks, suggesting broad utility beyond factual knowledge injection.

URL PDF HTML ☆

赞 0 踩 0

2409.00743 2026-06-11 cs.LG cs.AI 版本更新

Interpretable Clustering: A Survey

可解释聚类：综述

Lianyu Hu, Mudi Jiang, Junjie Dong, Xinying Liu, Zengyou He

发表机构 * College of Information Science and Engineering, Henan University of Technology（河南理工大学信息科学与工程学院）； School of Software, Dalian University of Technology（大连理工大学软件学院）； Xinchang Power Supply Company, State Grid Corporation of China（国网浙江绍兴供电公司）

AI总结本文综述了可解释聚类算法的现状，探讨了透明聚类结果的重要性，帮助研究人员选择合适的方法，并推动高效透明的聚类算法发展。

Comments 14 pages, 2 figures, 3 tables

详情

DOI: 10.1145/3789495
Journal ref: ACM Computing Surveys, Volume 58, Issue 8, Article 215 (2026)

AI中文摘要

近年来，聚类算法的研究主要集中在提高准确性和效率，但往往牺牲了可解释性。随着这些方法在医疗、金融和自动驾驶等高风险领域应用增加，透明和可解释的聚类结果变得至关重要。本文全面回顾了可解释聚类算法，识别了区分不同方法的关键标准，并提供了一个开放仓库，整理了代表性及新兴的可解释聚类方法，网址为https://github.com/hulianyu/Awesome-Interpretable-Clustering

英文摘要

In recent years, much of the research on clustering algorithms has primarily focused on enhancing their accuracy and efficiency, frequently at the expense of interpretability. However, as these methods are increasingly being applied in high-stakes domains such as healthcare, finance, and autonomous systems, the need for transparent and interpretable clustering outcomes has become a critical concern. This is not only necessary for gaining user trust but also for satisfying the growing ethical and regulatory demands in these fields. Ensuring that decisions derived from clustering algorithms can be clearly understood and justified is now a fundamental requirement. To address this need, this paper provides a comprehensive and structured review of the current state of explainable clustering algorithms, identifying key criteria to distinguish between various methods. These insights can effectively assist researchers in making informed decisions about the most suitable explainable clustering methods for specific application contexts, while also promoting the development and adoption of clustering algorithms that are both efficient and transparent. For convenient access and reference, an open repository organizes representative and emerging interpretable clustering methods under the taxonomy proposed in this survey, available at https://github.com/hulianyu/Awesome-Interpretable-Clustering

URL PDF HTML ☆

赞 0 踩 0

2601.09072 2026-06-11 cs.AI cs.CL stat.ME 版本更新

Human-AI Co-design for Clinical Prediction Models

临床预测模型的人机协同设计

Jean Feng, Avni Kothari, Patrick Vossler, Andrew Bishara, Lucas Zier, Newton Addo, Aaron Kornblith, Yan Shuo Tan, Chandan Singh

发表机构 * University of California, San Francisco（加州大学旧金山分校）； National University of Singapore（新加坡国立大学）； Microsoft Research（微软研究院）

AI总结本文提出HACHI框架，通过人机协作加速可解释的临床预测模型开发，提升模型泛化能力并发现新临床概念。

详情

DOI: 10.1038/s41746-026-02838-5
Journal ref: npj Digital Medicine 2026

AI中文摘要

开发安全、有效且实用的临床预测模型（CPMs）传统上需要临床专家、数据科学家和信息学家的迭代合作。此过程精炼模型构建的细微之处，如选择哪些特征/患者以及如何定义临床类别。然而，这种传统协作过程极为耗时且资源密集，导致只有少量CPMs达到临床应用。当团队试图整合非结构化临床笔记时，这一挑战尤为严峻。为解决此问题，我们引入HACHI，一种迭代的人在回路框架，利用AI代理加速开发完全可解释的CPMs。HACHI交替进行（i）AI代理快速探索和评估临床笔记中的候选概念，以及（ii）临床和领域专家提供反馈以改进CPM学习过程。HACHI将概念定义为简单的yes-no问题，用于线性模型，使临床AI团队能够透明地审查、完善和验证每一轮学习的CPM。在两个真实世界预测任务（急性肾损伤和创伤性脑损伤）中，HACHI优于现有方法，揭示了未包含在常用CPMs中的新临床相关概念，并提升了模型在不同临床站点和时间期的泛化能力。此外，HACHI揭示了临床AI团队的关键作用，如指导AI代理探索其未曾考虑的概念，调整其考虑的概念粒度，更改目标函数以更好地与临床目标一致，并识别数据偏差和泄漏问题。

英文摘要

Developing safe, effective, and practically useful clinical prediction models (CPMs) traditionally requires iterative collaboration between clinical experts, data scientists, and informaticists. This process refines the often small but critical details of the model building process, such as which features/patients to include and how clinical categories should be defined. However, this traditional collaboration process is extremely time- and resource-intensive, resulting in only a small fraction of CPMs reaching clinical practice. This challenge intensifies when teams attempt to incorporate unstructured clinical notes, which can contain an enormous number of concepts. To address this challenge, we introduce HACHI, an iterative human-in-the-loop framework that uses AI agents to accelerate the development of fully interpretable CPMs by enabling the exploration of concepts in clinical notes. HACHI alternates between (i) an AI agent rapidly exploring and evaluating candidate concepts in clinical notes and (ii) clinical and domain experts providing feedback to improve the CPM learning process. HACHI defines concepts as simple yes-no questions that are used in linear models, allowing the clinical AI team to transparently review, refine, and validate the CPM learned in each round. In two real-world prediction tasks (acute kidney injury and traumatic brain injury), HACHI outperforms existing approaches, surfaces new clinically relevant concepts not included in commonly-used CPMs, and improves model generalizability across clinical sites and time periods. Furthermore, HACHI reveals the critical role of the clinical AI team, such as directing the AI agent to explore concepts that it had not previously considered, adjusting the granularity of concepts it considers, changing the objective function to better align with the clinical objectives, and identifying issues of data bias and leakage.

URL PDF HTML ☆

赞 0 踩 0

2512.08343 2026-06-11 cs.AI 版本更新

Soil Compaction Parameters Prediction Based on Automated Machine Learning Approach

基于自动化机器学习方法的土壤压实参数预测

Caner Erden, Alparslan Serhat Demir, Abdullah Hulusi Kokcam, Talas Fikret Kurnaz, Ugur Dagdeviren

发表机构 * Sakarya University of Applied Sciences, Faculty of Technology, Department of Computer Engineering（萨卡里亚应用科学大学技术学院计算机工程系）

AI总结本文提出自动化机器学习方法用于预测土壤压实参数，通过实验发现XGBoost算法在不同土壤类型中表现最佳，提升了预测准确性和通用性。

Comments Presented at the 13th International Symposium on Intelligent Manufacturing and Service Systems, Duzce, Turkey, Sep 25-27, 2025. Also available on Zenodo: DOI 10.5281/zenodo.17533851

详情

DOI: 10.1016/j.cie.2026.112056
Journal ref: Computers & Industrial Engineering, 2026

AI中文摘要

土壤压实在土木工程中至关重要，以确保路基和土坝等结构的稳定性。传统方法确定最优含水率（OMC）和最大干密度（MDD）需要大量实验室实验，经验回归模型在不同土壤类型中应用有限。近年来，人工智能（AI）和机器学习（ML）技术作为替代方法出现，但ML模型在预测准确性和泛化能力上仍有不足，尤其是面对异质数据集时。本文提出自动化机器学习（AutoML）方法来预测OMC和MDD。AutoML自动选择算法和超参数优化，可能提高准确性和可扩展性。通过广泛实验发现，极端梯度提升（XGBoost）算法在独立数据集上分别达到MDD的R-squared值80.4%和OMC的89.1%。这些结果展示了AutoML在不同土壤类型中预测压实参数的有效性。研究还强调了异质数据集在提高ML模型泛化能力和性能中的重要性。最终，本研究通过提升土壤压实参数的预测能力，为更高效和可靠的施工实践做出了贡献。

英文摘要

Soil compaction is critical in construction engineering to ensure the stability of structures like road embankments and earth dams. Traditional methods for determining optimum moisture content (OMC) and maximum dry density (MDD) involve labor-intensive laboratory experiments, and empirical regression models have limited applicability and accuracy across diverse soil types. In recent years, artificial intelligence (AI) and machine learning (ML) techniques have emerged as alternatives for predicting these compaction parameters. However, ML models often struggle with prediction accuracy and generalizability, particularly with heterogeneous datasets representing various soil types. This study proposes an automated machine learning (AutoML) approach to predict OMC and MDD. AutoML automates algorithm selection and hyperparameter optimization, potentially improving accuracy and scalability. Through extensive experimentation, the study found that the Extreme Gradient Boosting (XGBoost) algorithm provided the best performance, achieving R-squared values of 80.4% for MDD and 89.1% for OMC on a separate dataset. These results demonstrate the effectiveness of AutoML in predicting compaction parameters across different soil types. The study also highlights the importance of heterogeneous datasets in improving the generalization and performance of ML models. Ultimately, this research contributes to more efficient and reliable construction practices by enhancing the prediction of soil compaction parameters.

URL PDF HTML ☆

赞 0 踩 0

2510.11290 2026-06-11 cs.AI cs.HC 版本更新

Evolution in Simulation: AI-Agent School with Dual Memory for High-Fidelity Educational Dynamics

模拟中的进化：具有双记忆的AI代理学校用于高保真的教育动态

Sheng Jin, Haoming Wang, Zhiqi Gao, Yongbo Yang, Bao Chunjia, Chengliang Wang

发表机构 * Guanghua Law School, Zhejiang University（浙江大学法学院）； Faculty of Education, East China Normal University（华东师范大学教育学院）； School of Data Science, The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳）数据科学学院）； Department of Electrical and Computer Engineering, University of California San Diego（加州大学圣地亚哥分校电子与计算机工程系）； Institute of Systems Science, National University of Singapore（新加坡国立大学系统科学研究所）

AI总结本文提出AI代理学校系统，通过自演化机制模拟复杂教育动态，采用双记忆结构提升代理认知能力，实验证实其在高保真模拟中的有效性。

Comments 9 pages, 7 figures, EMNLP conference

详情

DOI: 10.18653/v1/2025.findings-emnlp.312
Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2025

AI中文摘要

大型语言模型（LLMs）基于的代理在模拟和理解复杂人类系统和互动中日益关键。我们提出AI代理学校（AAS）系统，围绕自演化机制，利用代理模拟复杂教育动态。针对教学过程建模碎片化和代理在模拟多样化教育参与者方面的局限性，AAS构建了零经验策略，采用连续的

英文摘要

Large language models (LLMs) based Agents are increasingly pivotal in simulating and understanding complex human systems and interactions. We propose the AI-Agent School (AAS) system, built around a self-evolving mechanism that leverages agents for simulating complex educational dynamics. Addressing the fragmented issues in teaching process modeling and the limitations of agents performance in simulating diverse educational participants, AAS constructs the Zero-Exp strategy, employs a continuous "experience-reflection-optimization" cycle, grounded in a dual memory base comprising experience and knowledge bases and incorporating short-term and long-term memory components. Through this mechanism, agents autonomously evolve via situated interactions within diverse simulated school scenarios. This evolution enables agents to more accurately model the nuanced, multi-faceted teacher-student engagements and underlying learning processes found in physical schools. Experiment confirms that AAS can effectively simulate intricate educational dynamics and is effective in fostering advanced agent cognitive abilities, providing a foundational stepping stone from the "Era of Experience" to the "Era of Simulation" by generating high-fidelity behavioral and interaction data.

URL PDF HTML ☆

赞 0 踩 0

2510.06242 2026-06-11 cs.CL cs.AI 版本更新

Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses

透明参考-free 自动评估开放式用户调查回应

Subin An, Yugyeong Ji, Junyoung Kim, Heejin Kook, Yang Lu, Josh Seltzer

发表机构 * Kookmin University（韩国明知大学）； Sungkyunkwan University（庆尚大学）； Nexxt Intelligence

AI总结本文提出一种两阶段评估框架，用于评估人类开放式调查回应，通过去除无意义回应和评估努力、相关性和完整性，提升自动评估效果。

Comments EMNLP Industry Track

详情

DOI: 10.18653/v1/2025.emnlp-industry.65
Journal ref: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

AI中文摘要

开放式调查回应为营销研究提供有价值见解，但低质量回应不仅使研究人员负担手动筛选，还可能引起误导性结论，凸显了有效评估的必要性。现有自动评估方法针对LLM生成文本，无法充分评估具有独特特征的人类回应。为解决此类特征，我们提出一种专门针对人类调查回应的两阶段评估框架。首先，垃圾语过滤去除无意义回应。然后，通过LLM能力评估三个维度——努力、相关性和完整性，基于实际调查数据的实证分析。在英语和韩语数据集上的验证表明，我们的框架不仅优于现有指标，而且在实际应用如回应质量预测和回应拒绝中显示出高实用性，与专家评估显示强相关性。

英文摘要

Open-ended survey responses provide valuable insights in marketing research, but low-quality responses not only burden researchers with manual filtering but also risk leading to misleading conclusions, underscoring the need for effective evaluation. Existing automatic evaluation methods target LLM-generated text and inadequately assess human-written responses with their distinct characteristics. To address such characteristics, we propose a two-stage evaluation framework specifically designed for human survey responses. First, gibberish filtering removes nonsensical responses. Then, three dimensions-effort, relevance, and completeness-are evaluated using LLM capabilities, grounded in empirical analysis of real-world survey data. Validation on English and Korean datasets shows that our framework not only outperforms existing metrics but also demonstrates high practical applicability for real-world applications such as response quality prediction and response rejection, showing strong correlations with expert assessment.

URL PDF HTML ☆

赞 0 踩 0

2412.13841 2026-06-11 cs.CY cs.AI cs.HC 版本更新

Cultural Dimensions of AI Perception: Charting Expectations, Risks, Benefits, Tradeoffs, and Value in Germany and China

人工智能感知的文化维度：在德国和中国绘制期望、风险、收益、权衡与价值

Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle

发表机构 * RWTH Aachen University（亚琛工业大学）

AI总结本文通过比较德国和中国公众对人工智能的期望、风险与收益的权衡，揭示文化差异对AI接受度的影响，为AI与社会价值观的对齐提供见解。

详情

DOI: 10.1016/j.actpsy.2026.107094
Journal ref: Acta Psychologica (2026), volume 268, article 107094

AI中文摘要

随着人工智能（AI）的持续发展，理解公众对AI的感知——包括偏见、风险和收益——对于指导研究重点和AI对齐、塑造公共讨论以及制定政策至关重要。本探索性研究通过71个AI未来潜在可能性的想象，调查了不同文化背景下AI心理模型的差异。基于来自德国（N=52）和中国（N=60）的跨文化便利样本，我们识别出在期望、评估和风险-收益权衡方面的显著差异。德国参与者普遍提供了更为谨慎的评估，而中国参与者则对AI的社会效益表现出更大的乐观态度。中国参与者在风险-收益权衡上相对平衡（风险β=-0.463，收益β=+0.484，r²=0.630）。相比之下，德国参与者更强调AI的益处，而对风险相对较低（风险β=-0.337，收益β=+0.715，r²=0.839）。视觉认知图谱展示了这些对比，提供了新的视角，说明文化背景如何塑造AI的接受度。我们的发现突显了影响公众感知的关键因素，并为使AI与社会价值观对齐以及促进公平和文化敏感的AI技术整合提供了见解。

英文摘要

As artificial intelligence (AI) continues to advance, understanding public perceptions -- including biases, risks, and benefits -- is essential for guiding research priorities and AI alignment, shaping public discourse, and informing policy. This exploratory study investigates cultural differences in mental models of AI using 71 imaginaries of AI's potential futures. Drawing on cross-cultural convenience samples from Germany (N=52) and China (N=60), we identify significant differences in expectations, evaluations, and risk-benefit tradeoffs. Participants from Germany generally provided more cautious assessments, whereas participants from China expressed greater optimism regarding AI's societal benefits. Chinese participants exhibited relatively balanced risk-benefit tradeoffs ($β=-0.463$ for risk and $β=+0.484$ for benefit, $r^2=.630$). In contrast, German participants placed greater emphasis on AI's benefits and comparatively less on risks ($β=-0.337$ for risk and $β=+0.715$ for benefit, $r^2=.839$). Visual cognitive maps illustrate these contrasts, offering new perspectives on how cultural contexts shape AI acceptance. Our findings highlight key factors influencing public perception and provide insights for aligning AI with societal values and promoting equitable and culturally sensitive integration of AI technologies.

URL PDF HTML ☆

赞 0 踩 0

2508.15943 2026-06-11 cs.AI 版本更新

T-ILR: a Neurosymbolic Integration for LTLf

T-ILR：一种用于LTLf的神经符号集成

Riccardo Andreoni, Andrei Buliga, Alessandro Daniele, Chiara Ghidini, Marco Montali, Massimiliano Ronzani

发表机构 * Fondazione Bruno Kessler（布鲁诺·科塞勒基金会）； Free University of Bozen-Bolzano（博兹纳-博尔扎诺自由大学）； University of Bozen-Bolzano（博兹纳-博尔扎诺大学）

AI总结本文提出T-ILR框架，将LTLf时序逻辑规范直接融入深度学习架构，提升序列任务的准确性和效率。

Comments Accepted for presentation at NeSy 2025. 10 pages

1. 智能体、规划与决策 21 篇

Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

StatefulDiscovery: Evidence-Calibrated Claim Formation in Open-Ended Scientific Discovery

NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Exploration Structure in LLM Agents for Multi-File Change Localization

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

APPO: Agentic Procedural Policy Optimization

Offline Diffusion Policy for Multi-User Delay-Constrained Scheduling

Resource-Aware LLM Reasoning for Mobile Edge General Intelligence

PRInTS: Reward Modeling for Long-Horizon Information Seeking

Planning under Distribution Shifts with Causal POMDPs

FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference

MemToolAgent: Leveraging Memory for Tool Using Agents Based on Environment and User Feedback

Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions

Engineering Robustness into Personal Agents with the AI Workflow Store

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

Libra: Efficient Resource Management for Agentic RL Post-Training

2. 知识表示、推理与符号AI 5 篇

Mind the Perspective: Let's Reason Recursively for Theory of Mind

Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework

An XAI View on Explainable ASP: Methods, Systems, and Perspectives

Power Term Polynomial Algebra for Boolean Logic

Towards an Inferentialist Account of Information Through Proof-theoretic Semantics

3. 多智能体与博弈 13 篇

Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

Market Design for AI: Beyond the Copyright Binary

CCKS: Consensus-based Communication and Knowledge Sharing

CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

Improving Generalization and Data Efficiency with Diffusion in Offline Multi-agent RL

Precomputing Multi-Agent Path Replanning Using Temporal Flexibility

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

Robust Instruction Compliance in Cooperative Multi-Agent Reinforcement Learning

MARIC: Multi-Agent Reasoning for Image Classification

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

Continual Quadruped Robots Coordination via Semantic Skill Discovery

4. 搜索、优化与约束求解 5 篇

TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search

Quantized Stochastic Primal-Dual Methods for Distributed Optimization under Relaxed Global Geometry

What Limits Does Quantization Place on Dense Top-$k$ Retrieval? A Theoretical Study

Mathematical perspective on genetic algorithms with optimization guided operators

SPEA2$^+$: Improved Density Estimation in SPEA2 with Provable Runtime Guarantees

5. 机器学习与表示学习 60 篇

Forecasting Future Behavior as a Learning Task

HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

PermDoRA -- Understanding Adapter Interference in Language Models: Limits of Parameter-Space Geometry

Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways

Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

The Power of Test-Time Training for Approximate Sampling

CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

SirenFNO: Efficient and Full Frequency Learning of Fourier Neural Operators

Information-Theoretic Decomposition for Multimodal Interaction Learning

When Context Returns: Toward Robust Internalization in On-Policy Distillation

TAROT: Task-Adaptive Refinement of LLM-prior Graphs for Few-shot Tabular Learning

Noise-Aware Framework for Correcting Corrupted Labels

Substrate Asymmetry in User-Side Memory: A Diagnostic Framework

ICA Lens: Interpreting Language Models Without Training Another Dictionary

Sparsified Kolmogorov-Arnold Networks for Interpretable Quantum State Tomography

From Uniform to Learned Graph Priors: Diffusion for Structure Discovery

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Categorical Prior Lock-in: Why In-Context Learning Fails for Structured Data

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

nD-RoPE: A Generalized RoPE for n-Dimensional Position Embedding

Implicit Neural Representations of Individual Behavior

Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training

SpikeDecoder: Realizing the GPT Architecture with Spiking Neural Networks

Harness In-Context Operator Learning with Chain of Operators

Latent World Recovery for Multimodal Learning with Missing Modalities