URL PDF HTML ☆

赞 0 踩 0

2606.14249 2026-06-15 cs.AI 新提交

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

HarnessX: 一个可组合、自适应且可演化的智能体框架铸造厂

Tingyang Chen, Shuo Lu, Kang Zhao, Weicheng Meng, Hanlin Teng, Tianhao Li, Chao Li, Xule Liu, Jian Liang, Zhizhong Zhang, Yuan Xie, Heng Qu, Kun Shao, Jian Luan

发表机构 * Darwin Agent Team（达尔文智能体团队）

AI总结提出HarnessX，通过替换代数组合框架原语、基于AEGIS多智能体演化引擎自适应调整，并利用轨迹反馈闭环优化，在五个基准上平均提升14.5%性能。

详情

AI中文摘要

AI智能体的性能关键依赖于运行时框架，包括提示、工具、记忆和控制流，这些中介了模型如何观察、推理和行动。然而，当今的框架在很大程度上仍然是手工制作和静态的：每个新模型或任务仍然需要定制的脚手架，并且在执行过程中产生的丰富轨迹很少被提炼为系统性的改进。我们引入了HarnessX，一个用于可组合、自适应和可演化的智能体框架的铸造厂。HarnessX通过替换代数组装类型化的框架原语，通过AEGIS（一个基于轨迹驱动的多智能体演化引擎，建立在符号适应与强化学习之间的操作镜像上）进行自适应，并通过将轨迹转化为框架更新和模型训练信号来闭合框架-模型循环。在五个基准测试（ALFWorld、GAIA、WebShop、tau^3-Bench和SWE-bench Verified）上，HarnessX平均提升了14.5%（最高达44.0%），其中基线最低时增益最大。这些结果表明，智能体的进步不一定来自模型规模的扩展：从执行反馈中组合和演化运行时接口是一个可行且互补的杠杆。完整的代码库将在未来版本中开源。

英文摘要

AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet today's harnesses remain largely hand-crafted and static: each new model or task still demands bespoke scaffolding, and the rich traces produced during execution are rarely distilled back into systematic improvement. We introduce HarnessX, a foundry for composable, adaptive, and evolvable agent harnesses. HarnessX assembles typed harness primitives via a substitution algebra, adapts them through AEGIS, a trace-driven multi-agent evolution engine grounded in an operational mirror between symbolic adaptation and reinforcement learning, and closes the harness-model loop by turning trajectories into both harness updates and model training signal. Across five benchmarks (ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified), HarnessX yields an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest. These results suggest that agent progress need not come from model scaling alone: composing and evolving runtime interfaces from execution feedback is an actionable and complementary lever. The complete codebase will be open-sourced in a future release.

URL PDF HTML ☆

赞 0 踩 0

2606.14314 2026-06-15 cs.AI 新提交

Communication Policy Evolution for Proactive LLM Agents

主动式LLM智能体的通信策略演化

Xinbei Ma, Jiyang Qiu, Yao Yao, Zheng Wu, Yijie Lu, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang, Weiwen Liu, Weinan Zhang, Zhuosheng Zhang, Hai Zhao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； OPPO Research Institute（OPPO研究院）

AI总结针对用户与智能体间信息不对称问题，形式化通信策略，提出基于文本和UI的策略，并引入自演化框架CPE，通过提示优化提升任务成功率。

详情

AI中文摘要

LLM智能体已迅速演变为自主系统，但用户与智能体之间仍存在持续的信息鸿沟：通信成本高昂，而用户相同的偏好进一步限制了信息交换。为了研究智能体应如何跨模态通信，本文形式化了通信策略，建立了基于文本和UI的策略，然后在不同环境、角色和模型组合中评估通信策略。为构建主动式智能体的信息不对称性，我们设置了两个互补场景：用户-智能体和规划者-执行者。实验揭示了交互通道之间的互补优势：基于文本的交互通常有助于任务性能，而结构化UI则提高了智能体的响应质量和角色遵从性。受此启发，一种混合方法结合了这些优势。我们进一步提出通信策略演化（CPE），一种通过展开和提示级演化来优化通信策略的自我演化框架。在不修改模型的情况下，仅通过提示优化，CPE在多种设置中实现了最佳任务成功率。我们的发现将通信行为确定为LLM智能体一个关键但尚未充分探索的设计维度。

英文摘要

LLM agents have rapidly evolved into autonomous systems, yet a persistent information gap remains between users and agents: communication is costly, while users' identical preferences further limit information exchange. To investigate how agents should communicate across modalities, this paper formalizes Communication Policy, establishes textual and UI-based policies, and then evaluates communication policies across diverse environments, personas, and model combinations. Building information asymmetry for proactive agents, we set up two complementary settings, User-Agent and Planner-Executor. Experimental results reveal complementary strengths between interaction channels: text-based interaction often facilitates task performance, while structured UI improves agents' response quality and persona compliance. Motivated by that, a hybrid method combines these advantages. We further propose Communication Policy Evolution (CPE), a self-evolution framework for refining communication policies through rollout and prompt-level evolving. Without model modification, CPE achieves the best task success across multiple settings using prompt refinement alone. Our findings identify communication behavior as a critical yet underexplored design dimension for LLM agents.

URL PDF HTML ☆

赞 0 踩 0

2606.14418 2026-06-15 cs.AI cs.LG cs.RO 新提交

面向LLM-Agent工作流中并行分支的直接潜在空间合成

Shikun Liu, Mufei Li, Dongqi Fu, Haoyu Wang, Yinglong Xia, Hong Li, Hong Yan, Pan Li

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Meta

AI总结提出Parallel-Synthesis框架，通过直接利用并行工作代理的KV缓存进行合成，避免文本拼接冗余，在9个数据集上匹配或超越文本合成，并将首令牌延迟降低2.5-11倍。

详情

带运输资源的作业车间调度中联合学习与模块化学习协调差距分析

Moritz Link, Jonathan Hoss, Noah Klarmann

AI总结通过资源稀缺性和时间主导性分析，量化联合训练与模块化训练在带运输资源的作业车间调度中的性能差距，发现联合训练在多数情况下更优，但在瓶颈环境下差距缩小。

Comments Supported by the Chips Joint Undertaking and its members, including top-up funding by National Authorities, within the Cynergy4MIE project (Grant Agreement No. 101140226). This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

带运输资源的高效作业车间调度对于高性能制造至关重要。随着“去中心化工厂”的兴起，多智能体强化学习已成为生产与运输任务联合调度的一种有前景的方法。先前的工作主要集中于开发新颖的合作架构，而忽视了何时需要联合训练的问题。联合训练指同时训练作业和自动导引车调度智能体，而模块化训练则涉及独立训练每个智能体后进行事后集成。在本研究中，我们系统地调查了在带运输资源的作业车间调度问题中，联合训练对于最优性能至关重要的条件。通过对资源稀缺性和时间主导性的严格敏感性分析，我们量化了协调差距——这两种训练模式之间的性能差异。在我们的评估中，联合训练优于大多数调度规则组合和模块化训练方法。然而，在瓶颈环境中，特别是在严重的运输和处理约束下，协调差距的优势会减弱。这些发现表明，在单个调度任务占主导地位的环境中，模块化训练是一种可行的替代方案。总体而言，我们的工作为根据环境条件选择训练模式提供了实用指导，使决策者能够优化基于强化学习的调度性能。

英文摘要

Efficient job-shop scheduling with transportation resources is critical for high-performance manufacturing. With the rise of "decentralized factories", multi-agent reinforcement learning has emerged as a promising approach for the combined scheduling of production and transportation tasks. Prior work has largely focused on developing novel cooperative architectures while overlooking the question of when joint training is necessary. Joint training denotes the simultaneous training of job and automatic guided vehicle scheduling agents, whereas modular training involves independently training each agent followed by post-hoc integration. In this study, we systematically investigate the conditions under which joint training is essential for optimal performance in the job-shop scheduling problem with transportation resources. Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap -- the performance difference between these two training modalities. In our evaluation, joint training outperforms the majority of dispatching rule combinations and modular training approaches. However, the coordination gap advantage diminishes in bottleneck environments, particularly under severe transport and processing constraints. These findings indicate that modular training represents a viable alternative in environments where a single scheduling task dominates. Overall, our work provides practical guidance for selecting between training modalities based on environmental conditions, enabling decision-makers to optimize reinforcement learning-based scheduling performance.

URL PDF HTML ☆

赞 0 踩 0

2605.05407 2026-06-15 cs.AI 版本更新

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

PRISM: 感知与推理交错用于序列决策

Mohamed Salim Aissi, Clemence Grislain, Clement Romac, Laure Soulier, Mohamed Chetouani, Olivier Sigaud, Nicolas Thome

发表机构 * Institut National de la Recherche Scientifique (INRS)（国家科学研究院）

AI总结提出PRISM框架，通过动态问答流水线紧密耦合视觉语言模型（VLM）和语言模型（LLM），实现任务驱动的感知，在ALFWorld和R2R基准上显著超越现有图像模型。

详情

AI中文摘要

将基于LLM的具身智能体从纯文本环境扩展到复杂多模态设置仍是一个主要挑战。最近的研究发现，独立的视觉语言模型（VLM）存在感知-推理-决策差距，常常忽略任务关键信息。在本文中，我们介绍了PRISM，一个通过动态问答（DQA）流水线紧密耦合感知（VLM）和决策（LLM）的框架。LLM不是被动接受VLM的描述，而是对其提出批评，用目标导向的问题探查VLM，并综合生成紧凑的图像描述。这种闭环交互产生了对场景的清晰、任务驱动的理解。我们在ALFWorld和Room-to-Room（R2R）基准上评估了PRISM。我们表明：（1）PRISM显著优于最先进的基于图像的模型，（2）我们的交互式目标导向感知流水线带来了系统性和实质性的提升，（3）PRISM完全自动化，无需手工制作问题或答案。

英文摘要

Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which often overlook task-critical information. In this paper, we introduce PRISM, a framework that tightly couples perception (VLM) and decision (LLM) through a dynamic question-answer (DQA) pipeline. Instead of passively accepting the VLM's description, the LLM critiques it, probes the VLM with goal-oriented questions, and synthesizes a compact image description. This closed-loop interaction yields a sharp, task-driven understanding of the scene. We evaluate PRISM on the ALFWorld and Room-to-Room (R2R) benchmarks. We show that: (1) PRISM significantly outperforms state-of-the-art image-based models, (2) our Interactive goal-oriented perception pipeline yields systematic and substantial gains, and (3) PRISM is fully automatic, eliminating the need for handcrafted questions or answers.

URL PDF HTML ☆

赞 0 踩 0

2606.03108 2026-06-15 cs.AI 版本更新

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

EvoTrainer: 协同进化LLM策略与训练框架以实现自主智能体强化学习

Guhong Chen, Yingcheng Shi, Yongbin Li, Binhua Li, Xander Xu, Hu Wei, Shiwen Ni, Min Yang, Jieping Ye

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（深圳先进技术研究院，中国科学院）； Tongyi Lab , Alibaba Group（通义实验室，阿里巴巴集团）； Alibaba Group（阿里巴巴集团）； SUAT（深圳大学）

AI总结提出EvoTrainer框架，通过协同进化LLM策略和训练端框架，基于经验反馈自动诊断、修正并积累可复用技能，在数学推理、编程竞赛和仓库级软件工程任务上匹配或超越人工设计的RL基线。

详情

AI中文摘要

自主LLM训练通常被表述为配方搜索，这使训练框架基本保持静态。这种局限性在智能体RL中尤为突出，其中不断变化的瓶颈和标量奖励掩盖了多种失败模式。我们引入了EvoTrainer，一个通过经验反馈协同进化LLM策略和训练端框架的自主训练框架：它诊断rollout级别的证据、修正诊断、回测干预并积累可复用技能。在数学推理、竞赛编程代码生成和仓库级软件工程上的评估表明，在相同数据、代码库和评估协议下，EvoTrainer匹配或超过了人工设计的RL参考，其中在长周期智能体SWE上增益最大。轨迹分析显示，保留的策略在不同领域分化，进化的诊断阻止了无效的高分分支被提升，而可复用技能塑造了后续搜索。自主LLM RL应超越配方搜索，转向策略和解释它们的训练框架的联合进化。

英文摘要

Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.

URL PDF HTML ☆

赞 0 踩 0

2606.07027 2026-06-15 cs.AI 版本更新

StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents

StainFlow: GUI代理中实体痕迹追踪与证据链接用于过程奖励

Haojie Hao, Longkun Hao, Yihang Lou, Yan Bai, Zhenyang Li, Zhichao Yang, Dongshuo Huang, Hongyu Lin, Lanqing Hong, Jiakai Wang, Xianglong Liu

发表机构 * Beihang University（北京航空航天大学）； Peking University（北京大学）； Renmin University of China（中国人民大学）； Northwestern Polytechnical University（西北工业大学）； Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所）； National University of Singapore（新加坡国立大学）； Zhongguancun Laboratory（中关村实验室）

AI总结提出StainFlow模型，通过全局实体痕迹追踪和局部证据链接，解决GUI代理过程奖励中的里程碑分解主观性和局部窗口证据遗漏问题，提升在线强化学习成功率3.2%。

详情

AI中文摘要

强化学习已成为在长期、随机数字环境中改进GUI代理的有前景方法，但轨迹级成功反馈过于稀疏，无法为中间探索步骤提供可靠的信用分配。为缓解此问题，近期研究引入过程奖励模型，通过全局里程碑验证或局部步骤级评估提供更细粒度的训练反馈。然而，这些方法仍存在两个层级特定的局限性：全局里程碑分解主观且单一，难以适应真实GUI任务中的多条有效执行路径；而固定的局部判断窗口可能遗漏远程关键证据或用无关帧稀释决策信号。受网络流分析中痕迹追踪机制的启发，我们提出StainFlow，一种用于GUI代理的实体痕迹流过程奖励模型。为减少全局划分的主观性，我们引入全局实体痕迹追踪模块，提取视觉可验证的任务实体，并追踪其痕迹浓度和状态沿轨迹的演变，从而通过实体证据流的变化客观分离任务阶段。为提高局部验证的准确性，我们引入局部痕迹证据链接模块。以每个候选关键节点的触发实体为中心，该模块根据其痕迹浓度和状态变化检索相关步骤，并动态构建高密度证据窗口以验证真实关键节点。在AndroidWorld和OGRBench上的大量实验表明，StainFlow在线强化学习成功率相对提升3.2%，轨迹完成判断准确率提升1.8%。

英文摘要

Reinforcement Learning (RL) has become a promising approach for improving GUI Agents in long-horizon, stochastic digital environments, but trajectory-level success feedback is too sparse to provide reliable credit assignment for intermediate exploration steps. To mitigate this issue, recent studies introduce Process Reward Models (PRMs), which provide finer-grained training feedback through global milestone verification or local step-level evaluation. However, these methods still suffer from two level-specific limitations: global milestone decomposition is subjective and singular, making it difficult to accommodate the multiple valid execution paths in real GUI tasks, while fixed local judging windows may miss long-range key evidence or dilute the decision signal with irrelevant frames. Inspired by stain-tracing mechanisms in network flow analysis, we propose StainFlow, an entity-stain-flow process reward model for GUI Agents. To reduce the subjectivity of global partitioning, we introduce the Global Entity Stain Tracking module, which extracts visually verifiable task entities and tracks how their stain concentrations and states evolve along the trajectory, allowing task phases to be objectively separated by changes in the entity evidence flow. To improve the accuracy of local verification, we introduce the Local Stain Evidence Linking module. Centered on the triggering entities of each candidate key node, it retrieves relevant steps based on their stain concentrations and state changes, and dynamically constructs high-density evidence windows for verifying true key nodes. Extensive experiments on AndroidWorld and OGRBench show that StainFlow relatively improves online RL success by 3.2% and trajectory completion judgment accuracy by 1.8%.

URL PDF HTML ☆

赞 0 踩 0

2606.12817 2026-06-15 cs.AI 版本更新

GUITrans2Act: Understanding User Operational Behaviors from Mobile GUI Interactions with Vision-Language Models

Teach-and-Repeat: 从移动屏幕演示中准确提取操作知识以赋能GUI智能体

Yudong Zhang, Lei Hu, Daoyang Liu, Jiawei Liu, Yangfan Luo, Zhilin Gao, Zuojian Wang

发表机构 * Honor Device Co., Ltd（荣耀终端有限公司）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出Teach VLM模型，通过从演示视频中提取关键帧生成操作知识，并构建数据飞轮解决训练数据稀缺问题；在基准测试中达到最优性能，并提升下游智能体的任务成功率。

Comments 20 pages, 9 figures. Yudong Zhang and Lei Hu contributed equally to this work. Zuojian Wang, and Zhilin Gao are corresponding authors

详情

AI中文摘要

理解移动设备上的数字世界正从静态UI感知转向动态动作理解。这种能力使模型能够将视觉状态转换转化为操作知识，定义为描述动作类型、目标UI元素、文本参数和执行顺序的简短自然语言句子。然而，由于跨应用的UI设计高度多样化和异构，现有视觉语言模型（VLM）难以准确推断这些底层操作。为弥补这一差距，我们引入了Teach VLM，这是一个核心模型，旨在通过从演示视频中提取和分析与操作相关的关键帧，将移动屏幕轨迹转化为逐步操作知识。为解决对齐训练数据稀缺的问题，我们开发了一个系统性的数据飞轮以实现可扩展的数据采集。我们进一步引入了一个新颖的中文移动屏幕教学基准用于细粒度评估。基于Teach VLM，我们提出了Teach-and-Repeat范式，其中生成的操作知识作为可解释的程序化参考，指导下游基于屏幕的执行智能体。大量评估表明，Teach VLM显著优于强VLM基线，在操作语义预测中达到了最先进的性能。此外，在Android World中的实验表明，我们的范式为下游智能体带来了持续的任务成功率提升。Teach VLM和Teach-and-Repeat范式共同提供了一条从原始演示到可复用任务自动化的实用路径。

英文摘要

Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders. However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos. To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents. Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.

URL PDF HTML ☆

赞 0 踩 0

2509.18930 2026-06-15 cs.LG cs.AI 版本更新

Tackling GNARLy Problems: Graph Neural Algorithmic Reasoning Reimagined through Reinforcement Learning

解决GNARLy问题：通过强化学习重新构想图神经算法推理

Alex Schutz, Victor-Alexandru Darvariu, Efimia Panagiotaki, Bruno Lacerda, Nick Hawes

发表机构 * Oxford Robotics Institute, University of Oxford（牛津大学机器人研究所）； Stateful Robotics

AI总结提出GNARL框架，将算法轨迹学习转化为马尔可夫决策过程，结合模仿学习和强化学习，在CLRS-30问题上取得高精度，适用于NP难问题及无专家算法场景。

详情

AI中文摘要

神经算法推理（NAR）是一种通过监督学习训练神经网络执行经典算法的范式。尽管取得了成功，但仍存在重要局限性：无法在不进行后处理的情况下构建有效解，无法推理多个正确解，在组合NP难问题上性能差，且不适用于尚未已知强算法的问题。为了解决这些局限性，我们将学习算法轨迹的问题重新定义为马尔可夫决策过程，这为解构建过程施加了结构，并解锁了模仿学习和强化学习（RL）的强大工具。我们提出了GNARL框架，包括将问题从NAR转化为RL的方法论，以及适用于广泛图问题的学习架构。我们在多个CLRS-30问题上取得了非常高的图准确率结果，性能匹配或超过针对NP难问题的更窄NAR方法，并且值得注意的是，即使在缺乏专家算法的情况下也能适用。

英文摘要

Neural algorithmic reasoning (NAR) is a paradigm that trains neural networks to execute classic algorithms by supervised learning. Despite its successes, important limitations remain: inability to construct valid solutions without post-processing and to reason about multiple correct ones, poor performance on combinatorial NP-hard problems, and inapplicability to problems for which strong algorithms are not yet known. To address these limitations, we reframe the problem of learning algorithm trajectories as a Markov decision process, which imposes structure on the solution construction procedure and unlocks the powerful tools of imitation and reinforcement learning (RL). We propose the GNARL framework, encompassing the methodology to translate problem formulations from NAR to RL and a learning architecture suitable for a wide range of graph-based problems. We achieve very high graph accuracy results on several CLRS-30 problems, performance matching or exceeding much narrower NAR approaches for NP-hard problems and, remarkably, applicability even when lacking an expert algorithm.

URL PDF HTML ☆

赞 0 踩 0

2510.02695 2026-06-15 cs.LG cs.AI 版本更新

RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization

RAMAC: 多模态风险感知离线强化学习及行为正则化的作用

Kai Fukazawa, Kunal Mundada, Iman Soltani

AI总结提出RAMAC框架，结合分布性评论家与生成式演员（如扩散模型），通过条件风险价值与行为克隆的复合目标实现离线强化学习中的风险敏感学习，抑制分布外动作并提升CVaR。

Comments ICML 2026

详情

AI中文摘要

在安全关键领域中，当在线数据收集不可行时，离线强化学习（RL）只有在策略能够实现高回报且避免灾难性的下尾风险时才具有吸引力。先前关于风险厌恶离线RL的工作通过（i）基于值/模型的悲观主义或（ii）限制策略类以限制表达能力来实现安全性，而扩散/流式表达性生成策略主要在中性风险设置中使用。我们引入了\textbf{风险感知多模态演员-评论家（RAMAC）}，一个简单、模块化、无模型的框架，它将表达性生成演员（例如扩散/流）与分布性评论家相结合，并优化一个结合条件风险价值（CVaR）与行为克隆（BC）的复合目标，从而在复杂的多模态场景中实现风险敏感学习。由于分布外（OOD）动作是离线RL中灾难性失败的主要驱动因素，我们进一步提供了一个目标层面的分析，表明通过BC控制行为发散可以抑制OOD动作并稳定CVaR。使用扩散演员实例化RAMAC，我们在二维风险赌博机上展示了这些见解，并在Stochastic-D4RL上进行了评估，观察到在保持高回报的同时，$\mathrm{CVaR}_{0.1}$的一致提升。代码和实验结果可在\href{this https URL}{项目网站}上获取。

英文摘要

In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) is attractive only if policies achieve high returns without catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of either (i) value/model-based pessimism or (ii) restricted policy classes that limit expressiveness, whereas diffusion/flow-based expressive generative policies have largely been used in risk-neutral settings. We introduce \textbf{Risk-Aware Multimodal Actor-Critic (RAMAC)}, a simple, modular, model-free framework that couples an expressive generative actor (e.g., diffusion/flow) with a distributional critic and optimizes a composite objective that combines Conditional Value-at-Risk (CVaR) with behavioral cloning (BC), enabling risk-sensitive learning in complex multimodal scenarios. Since out-of-distribution (OOD) actions are a major driver of catastrophic failures in offline RL, we further provide an objective-level analysis showing that controlling behavior divergence via BC suppresses OOD actions and stabilizes CVaR. Instantiating RAMAC with a diffusion actor, we illustrate these insights on a 2-D risky bandit and evaluate on Stochastic-D4RL, observing consistent gains in $\mathrm{CVaR}_{0.1}$ while maintaining strong returns. The code and experimental results are available on the \href{https://kaifukazawa.github.io/ramac-project/} {project website}

URL PDF HTML ☆

赞 0 踩 0

2601.19810 2026-06-15 cs.LG cs.AI cs.RO 版本更新

Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals

高效探索的无监督学习：通过自我设定目标预训练自适应策略

Octavio Pappalardo

发表机构 * University College London (UCL)（伦敦大学学院（UCL））

AI总结提出ULEE方法，结合上下文学习器与对抗性目标生成策略，在无监督元学习框架中优化多回合探索与适应，提升零样本和少样本性能。

Comments ICLR 2026; v2 adds link to code: https://github.com/Octavio-Pappalardo/ulee-jax

详情

Journal ref: The Fourteenth International Conference on Learning Representations, 2026

AI中文摘要

无监督预训练可以为强化学习智能体提供先验知识，加速下游任务的学习。一个基于人类发展的有前景方向是研究智能体通过设定和追求自身目标来学习。核心挑战在于如何有效地生成、选择并从这些目标中学习。我们的关注点是下游任务的广泛分布，其中零样本解决每个任务是不可行的。当目标任务位于预训练分布之外或智能体未知其身份时，这种设置自然出现。在这项工作中，我们(i)在元学习框架内优化高效的多回合探索和适应，以及(ii)用智能体适应后性能的演化估计来指导训练课程。我们提出了ULEE，一种无监督元学习方法，它将上下文学习器与对抗性目标生成策略相结合，该策略将训练维持在智能体能力的前沿。在XLand-MiniGrid基准测试中，ULEE预训练产生了改进的探索和适应能力，这些能力泛化到新的目标、环境动态和地图结构。得到的策略获得了改进的零样本和少样本性能，并为更长的微调过程提供了强初始化。它优于从头学习、DIAYN预训练和替代课程。代码可在以下网址获取：https://github.com/facebookresearch/ulee

英文摘要

Unsupervised pre-training can equip reinforcement learning agents with prior knowledge and accelerate learning in downstream tasks. A promising direction, grounded in human development, investigates agents that learn by setting and pursuing their own goals. The core challenge lies in how to effectively generate, select, and learn from such goals. Our focus is on broad distributions of downstream tasks where solving every task zero-shot is infeasible. Such settings naturally arise when the target tasks lie outside of the pre-training distribution or when their identities are unknown to the agent. In this work, we (i) optimize for efficient multi-episode exploration and adaptation within a meta-learning framework, and (ii) guide the training curriculum with evolving estimates of the agent's post-adaptation performance. We present ULEE, an unsupervised meta-learning method that combines an in-context learner with an adversarial goal-generation strategy that maintains training at the frontier of the agent's capabilities. On XLand-MiniGrid benchmarks, ULEE pre-training yields improved exploration and adaptation abilities that generalize to novel objectives, environment dynamics, and map structures. The resulting policy attains improved zero-shot and few-shot performance, and provides a strong initialization for longer fine-tuning processes. It outperforms learning from scratch, DIAYN pre-training, and alternative curricula. Code is available at: https://github.com/Octavio-Pappalardo/ulee-jax

URL PDF HTML ☆

赞 0 踩 0

2606.13703 2026-06-15 cs.AI cs.GL cs.LO 新提交

History of the Muddy Children Puzzle

泥孩子谜题的历史

Hans van Ditmarsch

发表机构 * CNRS, France（法国国家科学研究中心）； IIT Kanpur, India（印度理工学院坎普尔分校）

AI总结本文追溯泥孩子谜题在过去两个世纪中的起源，并介绍其变体及一个涉及自指的新帽子谜题。

2606.13925 2026-06-15 cs.AI math.AG 新提交

Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization

遗憾并非难点：半自动形式化的专家评审案例研究

Vasily Ilin, Brian Nugent

发表机构 * GitHub

AI总结通过Grothendieck消没定理的半自动形式化案例，揭示大语言模型在定义选择与API设计上的不足，提出应以专家评审而非仅无遗憾作为评估标准。

2606.14000 2026-06-15 cs.AI 新提交

Formalizing Numerical Analysis: An Agent Pipeline and Quality Audit Beyond Kernel Acceptance

数值分析的形式化：超越内核接受的智能体流水线与质量审计

Theodore Meek, Siyuan Ge, Di Qiu Xiang, Simon Chess, Vasily Ilin

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种编码智能体流水线，将数值分析教材形式化为Lean 4代码，并引入三维质量评估框架（语义正确性、Mathlib复用、跨文件复用），发现编译通过掩盖了不忠实的形式化模式。

详情

AI中文摘要

近期工作表明，编码智能体可以在Lean 4中形式化整个高等数学教材，但现有努力集中在mathlib中已有充分表示的数学分支，并仅通过内核接受来衡量成功。我们通过将编码智能体应用于形式化《常微分方程数值方法》（一本数值分析教材，在mathlib中基本缺失）来解决这两个限制，从而考验智能体从头开发新理论的能力。我们进一步引入一个系统、可复现的三维框架，用于评估智能体生成的形式化质量，超越编译层面：语义正确性、Mathlib复用以及通过LLM-as-judge方法的跨文件复用。将该框架应用于我们自己的形式化以及RepoProver和M2F发布的输出，我们发现了内核接受完全掩盖的重复性不忠实形式化模式，包括不完整的多部分陈述、添加弱化假设和参数限制。我们的结果表明，基于编译的指标大大高估了形式化质量，我们提供了一种可复现的审计方法，以支持对未来自动形式化系统进行更严格的评估。

英文摘要

Recent work has demonstrated that coding agents can formalize entire advanced mathematics textbooks in Lean 4, yet existing efforts concentrate on branches of mathematics already well-represented in mathlib and measure success solely through kernel acceptance. We address both limitations by applying a coding agent to formalize Numerical Methods for Ordinary Differential Equations, a textbook in numerical analysis that is largely absent from mathlib, stressing the agent's capacity to develop new theory from scratch. We further introduce a systematic, reproducible three-dimensional framework for evaluating the quality of agent-produced formalizations beyond compilation: semantic correctness, Mathlib reuse, and cross-file reuse via LLM-as-judge methods. Applying this framework to our own formalization and to the released outputs of RepoProver and M2F, we uncover recurring unfaithful formalization patterns, including incomplete multi-part statements, added weakening hypotheses, and parameter restrictions, that kernel acceptance entirely obscures. Our results suggest that compilation-based metrics substantially overstate formalization quality, and we provide a reproducible audit methodology to support more rigorous evaluation of future autoformalization systems.

URL PDF HTML ☆

赞 0 踩 0

2606.14309 2026-06-15 cs.DB cs.AI cs.LO 交叉投稿

YeasierAgent：作为意图驱动创建平台无关共生智能体原生应用的画布的智能体社交沙盒

Jory He

发表机构 * Yeasier AI

AI总结提出YeasierAgent范式，通过平台无关的交互单元和空间多智能体协作，实现快速跨平台构建共生智能体原生应用，统一情感陪伴与工具执行。

详情

AI中文摘要

本文介绍了YeasierAgent，一种基于共生智能体、叙事世界和场景感知交互的应用构建范式。它通过将应用重新定义为用户、智能体和世界之间的协作空间，挑战了传统的设备耦合软件模型。我们提出了一种系统架构，实现了两个主要贡献：（1）通过利用平台无关的交互单元（智能体、场景、对话）而非固定的图形布局，实现跨平台的智能体原生应用的快速构建；（2）在单一体验沙盒中统一智能体的情感陪伴和实用工具执行属性。通过集成自动生成、用户创建的世界和空间多智能体协作，YeasierAgent形式化了共生智能体原生应用的类别，展示了从孤立的、特定工具聊天机器人向凝聚的、社会嵌入的计算环境的转变。

英文摘要

This paper introduces YeasierAgent, an application-building paradigm based on symbiotic agents, narrative worlds, and scene-aware interaction. It challenges the conventional device-coupled model of software by redefining applications as collaborative spaces among users, agents, and worlds. We present a system architecture that achieves two primary contributions: (1) enabling the rapid, cross-platform construction of agent-native applications by utilizing platform-agnostic interactive units (agents, scenes, dialogue) rather than fixed graphical layouts; and (2) unifying the emotional companionship and practical tool execution attributes of intelligent agents within a single experiential sandbox. By integrating automated generation, user-created worlds, and spatial multi-agent collaboration, YeasierAgent formalizes the category of Symbiotic Agent-Native Applications, demonstrating a shift from isolated, tool-specific chatbots toward cohesive, socially embedded computational environments.

URL PDF HTML ☆

赞 0 踩 0

2606.14200 2026-06-15 cs.AI cs.LG 新提交

When Should Agent Trust Be Conditional? Characterizing and Attacking Skill-Conditional Reputation in Agent Swarms

何时应条件化智能体信任？表征与攻击智能体群中的技能条件声誉

Yihan Xia, Taotao Wang

发表机构 * Shenzhen University（深圳大学）

AI总结研究异构LLM智能体群中技能条件信任的适用条件，通过相图分析揭示其在高异质性、稀疏证据和技能相关场景下有效，但存在跨技能证据被攻击者利用的风险，提出条件信息值测试（CIVT）量化攻击影响。

Comments 18 pages, 8 figures, 2 tables

详情

AI中文摘要

开放平台越来越多地将任务路由给异构的LLM智能体——它们在基础模型、框架和工具栈上有所不同——其能力因技能而异：一个智能体在某项技能上表现出色，在另一项技能上可能毫无用处。标准的声誉方法为每个智能体总结一个单一的全局信任分数，但这里的标量是错误的对象，因为将每个任务路由到全局最受信任的智能体会放弃专业化的价值。我们研究技能条件信任R(i | k)——对于需要技能k的任务，应赋予智能体i的信任，而不是每个智能体一个分数——并提出三个可证伪的问题：何时条件化是值得的，应借用多少跨技能证据，以及这种借用是否安全。受控的相图分析回答了前两个问题：条件信任仅在特定区域获胜——高智能体异质性、稀疏的每技能证据和相关的技能——而实现这种数据效率的耦合强度β是双刃剑，因为相同的跨技能借用也是一个洗钱渠道。在14个真正异构的AppWorld智能体的公共基准上，实际池落在有益区域内——一个微小但真实的增益，每技能最佳智能体在不同技能间确实发生变化。然后我们展示，一个在一种技能上有廉价证据而在目标技能上没有证据的攻击者劫持条件路由器，将路由遗憾从0驱动到0.94，而我们的零成本条件信息值测试（CIVT）将其评为绿色——而它污染的无门控信任判决读数为-0.06，而非诚实的+0.19。零证据门限限制了攻击但并未消除它；我们在明确预算下表征了剩余成本。我们不声称抗女巫攻击——我们量化了权衡。

英文摘要

Open platforms increasingly route tasks among heterogeneous LLM agents--differing in base model, scaffold, and tool stack--whose competence varies sharply by skill: an agent excellent at one skill may be useless at another. The standard reputation approach summarizes each agent by a single global trust score, but that scalar is the wrong object here, because routing every task to the globally most-trusted agent leaves the value of specialization unclaimed. We study skill-conditional trust R(i | k)--the trust to place in agent i for a task requiring skill k, rather than one score per agent--and pose three falsifiable questions: when is conditioning worth it, how much cross-skill evidence should be borrowed, and whether that borrowing is safe. A controlled phase-diagram analysis answers the first two: conditional trust wins only in a specific regime--high agent heterogeneity, sparse per-skill evidence, and correlated skills--and the coupling strength beta that buys this data efficiency is dual-use, because the same cross-skill borrowing is also a laundering channel. On a public benchmark of 14 genuinely heterogeneous AppWorld agents, real pools land inside the beneficial regime--a small but genuine gain, with the per-skill best agent genuinely changing across skills. We then show that an attacker with cheap evidence in one skill and none in a target skill hijacks the conditional router, driving routing regret from 0 to 0.94 on a pool our zero-cost Conditional Information Value Test (CIVT) rates GREEN--while the ungated trust verdict it contaminates reads -0.06 instead of the honest +0.19. A zero-evidence gate bounds the attack but does not eliminate it; we characterize the residual cost under an explicit budget. We do not claim Sybil-resistance--we quantify the trade-off.

URL PDF HTML ☆

赞 0 踩 0

2606.13832 2026-06-15 cs.MA cs.AI cs.CR cs.LG 交叉投稿

Safety-Contract Graph Multi-Agent Reinforcement Learning for Autonomous Network Security Response

安全合约图多智能体强化学习用于自主网络安全响应

Jose Luis Lima de Jesus Silva

发表机构 * Oxaala Tecnologias（Oxaala技术公司）； Universidade Federal da Bahia（巴西巴伊亚联邦大学）

AI总结提出安全合约图MARL框架ACD$^3$-GAT，通过约束优化、图编码和反事实筛选，在CAGE Challenge 4中将停机违规率从100%降至0.3%或13.8%，实现安全与性能的平衡。

详情

AI中文摘要

自主网络安全响应系统有望减少安全运营中心（SOC）的响应延迟，但仅基于奖励的多智能体强化学习（MARL）虽然能提高安全奖励，却仍无法部署。我们提出一个安全合约图MARL框架，并实例化为ACD$^3$-GAT（自适应约束反事实决策与图注意力网络编码器），该架构将模拟器观测与可重用运营预算、约束优化、图状态编码和反事实动作筛选分离开来。我们在CAGE Challenge 4中评估该方法，其中智能体在平均恢复时间（MTTR）、误报响应和防火墙变更管理中断的预算下运行。在整个基准测试中，每个无约束方法在100%的评估回合中违反SOC停机预算，平均停机代理成本为311-430，而预算为50。这补充了先前CAGE Challenge 4的发现，表明仅基于奖励的学习缺乏操作纪律。约束MAPPO-GAT（C-MAPPO-GAT）隔离了拉格朗日运营成本控制和预算感知筛选，而ACD$^3$-GAT增加了预算上下文、CVaR尾部风险估计、对手信念状态和图反事实风险传播（G-CRP）。复现比较包括IPPO、MAPPO-GAT、C-MAPPO-GAT和ACD$^3$-GAT的三个200回合种子。C-MAPPO-GAT将停机违规率从100%降至0.3%，平均停机成本从355.4降至15.5（相对于MAPPO-GAT）。ACD$^3$-GAT将平均停机成本降至48.2，违规率为13.8%，使其处于安全合约前沿而非最保守的合规点。拓扑种子和耦合自适应红方过程压力测试保持了这种对比，并显示安全约束策略的最差自适应退化程度低于仅基于奖励的MAPPO-GAT。

英文摘要

Autonomous network-security response systems promise to reduce Security Operations Centre (SOC) reaction latency, but reward-only multi-agent reinforcement learning (MARL) can improve security reward while remaining non-deployable. We present a safety-contract graph MARL framework and instantiate it as ACD$^3$-GAT (Adaptive Constrained Counterfactual Decisioning with a Graph Attention Network encoder), an architecture that separates simulator observations from reusable operational budgets, constrained optimization, graph state encoding, and counterfactual action screening. We evaluate the method in CAGE Challenge 4, where agents operate under budgets for Mean Time to Recover (MTTR), false-positive response, and firewall change-management disruption. Across the benchmark, every unconstrained method violates the SOC downtime budget in 100% of evaluated episodes, with mean downtime proxy costs of 311-430 against a budget of 50. This complements prior CAGE Challenge 4 findings by showing that reward-only learning lacks operational discipline. Constrained MAPPO-GAT (C-MAPPO-GAT) isolates Lagrangian operational-cost control and budget-aware screening, while ACD$^3$-GAT adds budget context, CVaR tail-risk estimation, opponent-belief state, and Graph Counterfactual Risk Propagation (G-CRP). The replicated comparison includes three 200-episode seeds for IPPO, MAPPO-GAT, C-MAPPO-GAT, and ACD$^3$-GAT. C-MAPPO-GAT reduces downtime violation from 100% to 0.3% and mean downtime cost from 355.4 to 15.5 relative to MAPPO-GAT. ACD$^3$-GAT reduces mean downtime cost to 48.2 with a 13.8% violation rate, placing it on the safety-contract frontier rather than at the most conservative compliance point. Topology-seed and coupled adaptive Red-process stress tests preserve this contrast and show lower worst adaptive degradation for safety-constrained policies than reward-only MAPPO-GAT.

URL PDF HTML ☆

赞 0 踩 0

2606.14693 2026-06-15 cs.MA cs.AI 交叉投稿

Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning

学习协调偏好用于多目标多智能体强化学习

Pengxin Wang, Lihao Guo, Yi Xie, Bo Liu, Siyang Cao, Jingdi Chen

发表机构 * Department of Electrical and Computer Engineering, University of Arizona（亚利桑那大学电气与计算机工程系）

AI总结提出偏好协调多智能体策略优化（PCMA），通过学习协调的智能体特定偏好实现多目标多智能体强化学习中的互补权衡，理论证明偏好多样性可诱导团队改进，实验验证性能与协调性提升。

2505.16988 2026-06-15 cs.CL cs.AI cs.MA 版本更新

MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

MASLab：基于LLM的多智能体系统的统一全面代码库

Rui Ye, Keduan Huang, Qimin Wu, Yuzhu Cai, Tian Jin, Xianghe Pang, Xiangrui Liu, Jiaqi Su, Chen Qian, Bohan Tang, Kaiqu Liang, Jiaao Chen, Yue Hu, Zhenfei Yin, Rongye Shi, Bo An, Yang Gao, Wenjun Wu, Lei Bai, Siheng Chen

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai AI Laboratory（上海人工智能实验室）； University of Oxford（牛津大学）； Princeton University（普林斯顿大学）； Meta ； University of Michigan（密歇根大学）； The University of Sydney（悉尼大学）； Beihang University（北航）； Nanyang Technological University（南洋理工大学）； Nanjing University（南京大学）

AI总结提出MASLab代码库，集成20余种方法，提供统一环境与标准化评估，降低研究门槛，覆盖10+基准测试和8种模型。

Comments 18 pages, 11 figures

详情

AI中文摘要

基于LLM的多智能体系统（MAS）在增强单个LLM以解决实际应用中复杂多样任务方面展现出巨大潜力。尽管取得了显著进展，该领域缺乏统一代码库来整合现有方法，导致重复实现、不公平比较和研究人员的高入门门槛。为应对这些挑战，我们引入MASLab，一个统一、全面且研究友好的基于LLM的MAS代码库。（1）MASLab集成了跨多个领域的20余种已建立方法，每种方法均通过逐步输出与官方实现的比较得到严格验证。（2）MASLab提供统一环境，包含多种基准测试，用于方法间的公平比较，确保一致输入和标准化评估协议。（3）MASLab在共享的简化结构中实现方法，降低了理解和扩展的门槛。基于MASLab，我们进行了涵盖10+基准测试和8种模型的广泛实验，为研究人员提供了当前MAS方法格局的清晰全面视图。MASLab将持续发展，跟踪该领域最新进展，并欢迎更广泛开源社区的贡献。

英文摘要

LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.

URL PDF HTML ☆

赞 0 踩 0

2606.12918 2026-06-15 cs.CR cs.AI 版本更新

从排序算法到可扩展核：高维排列空间中的贝叶斯优化

Zikai Xie, Linjiang Chen

发表机构 * State Key Laboratory of Precision and Intelligent Chemistry（精准与智能化学国家重点实验室）

AI总结针对高维排列空间贝叶斯优化中表示可扩展性差的问题，提出基于排序算法的核函数框架，其中Mallows核是枚举排序的特例，而新提出的Merge核通过归并排序的分解结构实现Θ(n log n)复杂度且无信息损失，在低维性能相当，高维显著提升优化效果与计算效率。

Comments 9 pages, published on ICLR-26

详情

AI中文摘要

贝叶斯优化（BO）是黑箱优化的强大工具，但其在高维排列空间中的应用受到定义可扩展表示的严重限制。当前最先进的排列空间BO方法依赖于穷举的Ω(n^2)成对比较，导致密集表示，不适用于大规模排列。为了突破这一障碍，我们引入了一个新框架，通过从排序算法导出的核函数生成高效的排列表示。在该框架中，Mallows核可以被视为从枚举排序导出的特例。此外，我们引入了Merge核，它利用归并排序的分治结构生成紧凑的Θ(n log n)表示，实现了最低可能复杂度且无信息损失，并有效捕捉排列结构。我们的核心论点是，Merge核在低维设置中与Mallows核性能相当，但随着维度n增长，在优化性能和计算效率上显著优于后者。在各种排列优化基准上的广泛评估证实了我们的假设，表明Merge核为高维排列空间中的贝叶斯优化提供了可扩展且更有效的解决方案，从而释放了解决以前难以处理的问题（如大规模特征排序和组合神经架构搜索）的潜力。

英文摘要

Bayesian Optimization (BO) is a powerful tool for black-box optimization, but its application to high-dimensional permutation spaces is severely limited by the challenge of defining scalable representations. The current state-of-the-art BO approach for permutation spaces relies on an exhaustive $Ω(n^2)$ pairwise comparison, inducing a dense representation that is impractical for large-scale permutations. To break this barrier, we introduce a novel framework for generating efficient permutation representations via kernel functions derived from sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from enumeration sort. Further, we introduce the \textbf{Merge Kernel} , which leverages the divide-and-conquer structure of merge sort to produce a compact, $Θ(n\log n)$ to achieve the lowest possible complexity with no information loss and effectively capture permutation structure. Our central thesis is that the Merge Kernel performs competitively with the Mallows kernel in low-dimensional settings, but significantly outperforms it in both optimization performance and computational efficiency as the dimension $n$ grows. Extensive evaluations on various permutation optimization benchmarks confirm our hypothesis, demonstrating that the Merge Kernel provides a scalable and more effective solution for Bayesian optimization in high-dimensional permutation spaces, thereby unlocking the potential for tackling previously intractable problems such as large-scale feature ordering and combinatorial neural architecture search.

URL PDF HTML ☆

赞 0 踩 0

2604.23841 2026-06-15 cs.LG cs.AI 版本更新

Scalable Production Scheduling: Linear Complexity via Unified Homogeneous Graphs

可扩展的生产调度：通过统一同质图实现线性复杂度

Jonathan Hoss, Moritz Link, Noah Klarmann

发表机构 * Faculty of Management and Engineering, Rosenheim Technical University of Applied Sciences（管理与工程学院，罗森海姆应用技术大学）

AI总结提出统一同质图框架，通过特征同质化将不同节点角色映射到共享潜在空间，使用同构图同构网络以线性复杂度解决作业车间调度问题，实现零样本泛化，并发现作业与机器比率是策略有效性的主要驱动因素。

Comments This paper has been accepted for presentation at the IEEE 22st International Conference on Automation Science and Engineering (CASE 2026)

详情

AI中文摘要

在现实工业应用中高效解决作业车间调度问题需要既计算精简又拓扑鲁棒的策略。虽然强化学习在自动化调度规则方面显示出潜力，但现有模型常因二次图复杂度或异质层的架构开销而面临可扩展性瓶颈。我们引入了一个统一图框架，采用基于特征的同质化将不同的节点角色投影到共享潜在空间。这使得标准的同构图同构网络能够以线性复杂度捕获复杂的资源竞争，确保大规模工业应用的低延迟推理。我们的实验结果表明，我们的框架实现了最先进的性能，同时表现出一致的零样本泛化。我们确定作业与机器比率是策略有效性的主要驱动因素，而非绝对问题规模。基于此，我们提出了结构饱和假设，证明在临界拥塞实例（$\mathcal{J} \approx \mathcal{M}$）上训练的策略学习了尺度不变的解决策略。在此饱和点训练的智能体内化了不变的冲突解决逻辑，使它们能够将大规模矩形实例视为饱和子问题的顺序串联。这种方法消除了昂贵的特定尺度重新训练的需要，并防止了对统计捷径的过拟合，为在动态生产环境中部署强化学习解决方案提供了鲁棒且高效的途径。

英文摘要

Efficiently solving the Job Shop Scheduling Problem in real-world industrial applications requires policies that are both computationally lean and topologically robust. While Reinforcement Learning has shown potential in automating dispatching rules, existing models often struggle with a scalability bottleneck caused by quadratic graph complexity or the architectural overhead of heterogeneous layers. We introduce a unified graph framework that employs feature-based homogenization to project distinct node roles into a shared latent space. This allows a standard homogeneous Graph Isomorphism Network to capture complex resource contention with linear complexity, ensuring low-latency inference for large-scale industrial applications. Our empirical results demonstrate that our framework achieves state-of-the-art performance while exhibiting consistent zero-shot generalization. We identify the job-to-machine ratio as the primary driver of policy effectiveness, rather than absolute problem size. Based on this, we propose a hypothesis of structural saturation, demonstrating that policies trained on critically congested instances ($\mathcal{J} \approx \mathcal{M}$) learn scale-invariant resolution strategies. Agents trained at this saturation point internalize invariant conflict-resolution logic, allowing them to treat massive rectangular instances as a sequential concatenation of saturated sub-problems. This approach eliminates the need for expensive scale-specific retraining and prevents overfitting to statistical shortcuts, providing a robust and efficient pathway for deploying RL solutions in dynamic production environments.

URL PDF HTML ☆

赞 0 踩 0

2606.13732 2026-06-15 cs.AI 新提交

When Sample Selection Bias Precipitates Model Collapse

当样本选择偏差引发模型崩溃

Xinbao Qiao, Xianglong Du, Wei Liu, Jingqi Zhang, Peihua Mai, Meng Zhang, Yan Pang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文研究低资源验证场景下，基于局部有偏参考分布的数据选择反而加速模型崩溃，并提出多数据孤岛协同的Wasserstein代理参考缓解多样性退化。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

在合成数据上递归训练的普及可以缓解数据稀缺，但存在模型崩溃的风险，即重复训练会侵蚀分布尾部并使输出同质化。数据选择被广泛视为一种补救措施，但其可靠性关键取决于验证器使用的参考分布。我们表明，在低资源验证机制中，每个验证器仅观察到目标流形的一个小、碎片化且有偏的切片，选择本身也会变得有偏。这种情况自然出现在低资源数据孤岛中，例如医疗联盟或专有金融机构，其中原始数据无法汇集，本地参考固有地不完整。结果，选择优先保留与本地流形对齐的样本，同时剪除全局相关的尾部模式，从防止崩溃的保障转变为引发崩溃的机制。我们从理论上证明，这种孤岛选择加速了崩溃并导致幂律多样性衰减。作为一种初步缓解措施，我们在不共享原始数据的情况下，从多个数据孤岛构建Wasserstein代理参考。实证结果证实，本地参考选择在偏斜分布上失败，而协作代理参考减轻了多样性退化，表明当真实数据覆盖范围碎片化或稀缺时，递归合成数据管道需要特别谨慎。

英文摘要

The proliferation of recursive training on synthetic data can alleviate data scarcity but risks model collapse, where repeated training erodes distributional tails and homogenizes outputs. Data selection is widely viewed as a remedy, yet its reliability depends critically on the reference distribution used by the verifier. We show that in low-resource verification regimes, where each verifier observes only a small, fragmented, and biased slice of the target manifold, selection itself becomes biased. This situation naturally arises in low-resource data silos such as healthcare consortia or proprietary financial institutions, where raw data cannot be pooled and local references are inherently incomplete. As a result, selection preferentially retains samples aligned with the local manifold while pruning globally relevant tail modes, turning from a safeguard against collapse into a mechanism that precipitates it. We theoretically prove that such siloed selection accelerates collapse and induces power-law diversity decay. As an initial mitigation, we construct Wasserstein proxy references from multiple silos without sharing raw data. Empirical results confirm that local-reference selection fails on skewed distributions, whereas collaborative proxy references mitigate diversity degradation, suggesting that recursive synthetic-data pipelines require particular caution when real-data coverage is fragmented or scarce.

URL PDF HTML ☆

赞 0 踩 0

2606.13934 2026-06-15 cs.AI 新提交

编辑1个神经元能修复LLM中的重复循环吗？

Aristotelis Lazaridis, Aman Sharma, Dylan Bates, Brian King, Vincent Lu, Jack FitzGerald

发表机构 * Edgerunner AI

AI总结本文发现Gemma 4模型在长事实列举任务中高达95%的概率陷入重复循环，通过逐层消融和逐神经元归因定位到少量MLP神经元，并用静态权重编辑（小至单个神经元符号反转）消除循环，但无法解决因知识缺失导致的“末日循环”。

详情

AI中文摘要

是的。它能治愈末日循环吗？可能不行。Gemma 4指令微调模型存在一个可复现的失败：在长事实列举提示（如列出电视剧的每一集、88个IAU星座或151个原始宝可梦）上，它们会崩溃成重复，要么是严格的逐字循环，要么是列表条目退化到单一答案。这些循环的发生率高达95%，并且能抵抗提示改写、推理引擎更改和大多数采样调整。在本文中，我们探讨这种行为是否足够局部化，从而可以通过权重编辑来消除。为了定位原因，我们使用逐层消融和逐神经元归因，然后通过完整生成扫描确认最强候选。循环追溯到一小部分MLP神经元（或者在26B-A4B混合专家模型中，几个路由专家），我们通过静态权重编辑抑制它们。这些“手术”可以小到单个符号反转的神经元（在E2B模型中）。有效编辑的大小随模型规模增长，但在所有情况下，循环模式可以在正常生成预算内解决，同时保持通用基准分数。然而，编辑并不能解决所有问题：我们还研究了更长的思考预算，其中两个较大的模型最明显地进入末日循环，即模型在无法回忆的事实上自我纠正的循环，耗尽预算而不给出最终答案。我们表明，这种残余失败通过相同的编辑减少但未消除，并认为它本质上是知识精度问题，而非可移除的电路；权重手术可以删除循环，但不能提供缺失的事实。我们的结果既是可行性证明——即具体的生成病理可以定位到少数参数并编辑掉——也是对该方法适用范围的界定。

英文摘要

Yes. Can it cure doom loops? Probably not. The Gemma 4 instruction-tuned models share a reproducible failure: on long factual enumeration prompts, such as listing every episode of a TV series, the 88 IAU constellations, or the 151 original Pokemon, they collapse into repetition, either a tight verbatim loop or a list whose entries decay onto a single answer. These loops occur at rates as high as 95% and survive prompt rewording, inference-engine changes, and most sampling adjustments. In this paper we explore whether this behavior is localized enough to remove by weight edits. To localize the cause, we use per-layer ablation and per-neuron attribution, then confirm the strongest candidates with full-generation sweeps. The loops trace to a small set of MLP neurons (or, in the 26B-A4B Mixture-of-Experts model, a few routed experts) which we suppress with static weight edits. These "surgeries" can be as small as a single sign-inverted neuron (in the E2B model). The size of the effective edits grows with model scale, but in all cases, the loop patterns can be addressed at normal generation budgets while preserving general-purpose benchmark scores. However, the edits do not solve everything: we also study longer thinking budgets, where the two larger models most visibly enter doom looping, i.e. a non-convergent regime in which the model self-corrects in circles over a fact it cannot recall, exhausting the budget without committing to a final answer. We show this residual failure is reduced but not eliminated by the same edits, and argue it is fundamentally a knowledge-precision problem rather than a removable circuit; weight surgery can delete a loop, but it cannot supply a missing fact. Our results are both a feasibility demonstration, that is, evidence that a concrete generation pathology can be localized to a few parameters and edited out, and a delineation of where that approach stops.

URL PDF HTML ☆

赞 0 踩 0

2606.13723 2026-06-15 cs.CV cs.AI 交叉投稿

SuperThoughts: 叠加中的推理令牌

Zheyang Xiong, Shivam Garg, Max Yu, Vaishnavi Shrivastava, Haoyu Zhao, Anastasios Kyrillidis, Dimitris Papailiopoulos

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Microsoft Research（微软研究院）； Independent（独立机构）； Princeton University（普林斯顿大学）； Rice University（莱斯大学）

AI总结提出SuperThoughts方法，通过将连续CoT令牌对压缩为单一潜在表示并利用多令牌预测模块解码，在保持训练监督的同时将推理吞吐量翻倍，实现约20-30%的CoT长度缩减且精度损失极小。

详情

AI中文摘要

长链思维（CoT）推理提升了LLM的问题解决能力，但由于顺序生成令牌导致计算成本高昂。尽管近期工作探索在连续潜在空间中进行推理以绕过离散令牌生成，但这些方法常面临训练稳定性问题，且因缺乏监督信号而难以扩展到复杂的长程任务。我们提出SuperThoughts，将连续的CoT令牌对压缩为单一潜在表示，并通过轻量级多令牌预测（MTP）模块每步解码两个令牌。这既在训练时保留了离散令牌监督，又在推理时使吞吐量翻倍。我们在Qwen2.5-Math-1.5B-Instruct、Qwen2.5-Math-7B-Instruct、Qwen2.5-Math-14B-Instruct上进行微调，并在MATH500、AMC、OlympiadBench和GPQA-Diamond上评估。通过基于置信度的自适应机制（在不确定时回退到标准解码），SuperThoughts实现了约20-30%的CoT长度缩减，同时保持精度，在大多数任务上仅下降1-2个准确率点。

英文摘要

Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token generation, they often struggle with training stability and fail to scale to complex, long-horizon tasks due to lack of supervision signal. We propose SuperThoughts, which compresses pairs of consecutive CoT tokens into single latent representations and decodes two tokens per step via a lightweight Multi-Token Prediction (MTP) module. This preserves discrete token supervision at training time while doubling throughput at inference time. We finetune Qwen2.5-Math-1.5B-Instruct, Qwen2.5-Math-7B-Instruct, Qwen2.5-Math-14B-Instruct, and evaluate on MATH500, AMC, OlympiadBench, and GPQA-Diamond. With a confidence-based adaptive mechanism that falls back to standard decoding when uncertain, SuperThoughts achieves $\sim$20--30\% CoT length reduction while maintaining accuracy with minimal degradation (1-2 points accuracy drop on most tasks).

URL PDF HTML ☆

赞 0 踩 0

2606.13894 2026-06-15 cs.LG cs.AI cs.CL cs.CV 交叉投稿

Gefen: Optimized Stochastic Optimizer

Gefen: 优化随机优化器

Nadav Benedek, Tomer Koren, Ohad Fried

发表机构 * Reichman University（赖希曼大学）； Tel Aviv University（特拉维夫大学）； Google Research（谷歌研究院）

AI总结提出Gefen优化器，通过共享二阶矩估计和量化一阶矩，将AdamW内存占用减少约8倍，同时保持相同性能，支持更大批量和吞吐量。

详情

AI中文摘要

AdamW是现代深度学习的默认优化器，但其一阶和二阶矩状态会额外占用约两倍参数大小的训练内存。我们提出Gefen，一种内存高效的优化器，它自动在参数块之间共享二阶矩估计，并使用学习到的码本量化一阶矩，从而将AdamW的内存占用减少约8倍，同时保持相同性能，相当于每十亿参数减少6.5 GiB。该方法受理论结果启发，该结果表明大的混合Hessian项将平方梯度的比率约束为接近1，表明Hessian对齐的参数是共享二阶矩统计量的自然候选。由于大规模计算Hessian不切实际，Gefen从初始平方梯度推断块结构，除了AdamW默认超参数外，不需要任何架构特定的元数据或超参数。Gefen学习基于精确直方图的动态规划量化码本，并重用相同的块进行一阶矩缩放。在多种实验中，Gefen在比较的类似AdamW的方法中实现了最低的峰值优化器内存，同时保持AdamW级别的性能。在FSDP和DDP训练中，减少的内存占用支持更大的微批次，并显著提高相对于AdamW的吞吐量，提供了一种实用的即插即用替代方案，具有更低的内存使用，可以增加吞吐量并支持训练更大的模型或使用更大的批量大小。我们提供了完整的Python实现，包括融合CUDA内核，网址为https://this https URL。

英文摘要

AdamW is a default optimizer for modern deep learning, but its first and second moment states add roughly two parameter-sized buffers to training memory. We propose Gefen, a memory-efficient optimizer that automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook, thereby reducing AdamW's memory footprint by ~8x while maintaining the same performance, corresponding to a reduction of 6.5 GiB per billion parameters. The method is motivated by a theoretical result showing that large mixed Hessian entries constrain the ratio of squared gradients toward one, suggesting that Hessian-aligned parameters are natural candidates for sharing second-moment statistics. Since computing Hessians is impractical at scale, Gefen infers block structure from the initial squared gradients, requiring no architecture-specific metadata or hyperparameters beyond AdamW defaults. Gefen learns an exact histogram-based dynamic-programming quantization codebook and reuses the same blocks for first-moment scaling. Across diverse experiments, Gefen achieves the lowest peak optimizer memory among the compared AdamW-like methods while maintaining AdamW-level performance. In FSDP and DDP training, the reduced memory footprint enables larger microbatches and improves throughput significantly over AdamW, providing a practical drop-in replacement with lower memory usage that can increase throughput and enable training larger models or using larger batch sizes. We provide the complete Python implementation, including fused CUDA kernels at https://github.com/ndvbd/Gefen

URL PDF HTML ☆

赞 0 踩 0

2606.14047 2026-06-15 cs.IR cs.AI cs.CL cs.LG 交叉投稿

Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling

知识图谱增强的记忆增强检索用于长上下文建模

Ghadir Alselwi, Basem Suleiman, Hao Xue, Shoaib Jameel, Hakim Hacid, Flora D. Salim, Imran Razzak

发表机构 * University of New South Wales（新南威尔士大学）； Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； University of Southampton（南安普顿大学）； Technology Innovation Institute（技术创新研究所）； Mohamed Bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结提出KGERMAR框架，通过动态构建上下文知识图谱并融合多组件记忆架构，在长上下文建模中降低困惑度达8.5%，提升记忆效率2-2.5倍。

详情

AI中文摘要

长上下文语言建模不仅需要扩展上下文窗口，还需要在数千个token中保持对实体状态和关系的连贯理解——这是语义相似性单独无法解决的挑战。KGERMAR通过在推理过程中从输入文本构建动态的、上下文特定的知识图谱来解决这一问题，实现利用语义相似性和显式实体关系的领域自适应检索。该框架执行实时实体和关系抽取以构建上下文知识图谱，然后通过多组件记忆架构将图结构嵌入与文本语义相结合。维护三个记忆库——上下文、语义和结构——通过学习权重融合检索信号，以捕获表面语义和更深层次的关系模式。在SlimPajama（84.7K训练样本）、WikiText-103（4,358样本）、PG-19（100样本）和Proof-pile（46.3K样本）上评估，KGERMAR在1K到32K token的上下文长度上，相比记忆增强基线实现了高达8.5%的困惑度降低和2-2.5倍的记忆效率提升，并在五个NLU任务上展现出优越的上下文学习性能。动态知识图谱构建方法通过实现适应输入上下文而非依赖固定知识库的领域特定知识表示，推进了记忆增强语言建模。

英文摘要

Long-context language modeling requires not only extending context windows but maintaining coherent understanding of entity states and relationships across thousands of tokens -- a challenge that semantic similarity alone cannot address. KGERMAR addresses this by constructing dynamic, context-specific knowledge graphs from input text during inference, enabling domain-adaptive retrieval that leverages both semantic similarity and explicit entity relationships. The framework performs real-time entity and relation extraction to build contextual knowledge graphs, then integrates graph-structural embeddings with textual semantics through a multi-component memory architecture. Three memory banks -- contextual, semantic, and structural -- are maintained with retrieval signals fused via learned weights to capture both surface-level semantics and deeper relational patterns. Evaluated on SlimPajama (84.7K training examples), WikiText-103 (4,358 examples), PG-19 (100 examples), and Proof-pile (46.3K examples), KGERMAR achieves up to 8.5\% lower perplexity and 2--2.5x better memory efficiency than memory-augmented baselines across context lengths from 1K to 32K tokens, with superior in-context learning performance across five NLU tasks. The dynamic knowledge graph construction approach advances memory-augmented language modeling by enabling domain-specific knowledge representation that adapts to input contexts rather than relying on fixed knowledge bases.

URL PDF HTML ☆

赞 0 踩 0

2606.14108 2026-06-15 cs.LG cs.AI 交叉投稿

Numbers Already Carry Their Own Embeddings

数字本身已携带其嵌入

Suhyun Bae, Donghun Lee

发表机构 * Department of Mathematics, Korea University（高丽大学数学系）

AI总结提出无训练嵌入方法AOE，同时保留数字的实数值与p-adic模签名，实现即插即用并在代数组合基准上首次达到完美精度。

Comments Presented at the MATH-AI Workshop at NeurIPS 2025

2606.14123 2026-06-15 cs.LG cs.AI 交叉投稿

Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage

知识追踪中恢复被搁置的区分能力：通过经验贝叶斯收缩进行逐项偏差校正

Xiaoran Yan, Cheng Tang, Atsushi Shimada

发表机构 * Kyushu University（九州大学）

AI总结提出SLC方法，利用Laplace/IRLS将二值观测转化为高斯伪观测，通过卡尔曼平滑器进行经验贝叶斯收缩，并拟合偏移Platt链接，以校正知识追踪模型中的逐项偏差，恢复被搁置的区分能力，在多个数据集和骨干网络上提升AUC和NLL。

Comments 25 pages, 3 figures. Accepted at ECML PKDD 2026 (Research Track). Code: https://github.com/xiaoran-y/SLC

详情

AI中文摘要

部署的知识追踪模型通常在训练后被冻结，但由于骨干架构中逐项表达能力的限制以及部署后项目属性的变化，会出现系统性的逐项logit偏差，从而降低预测质量。全局事后校准器（如Platt缩放、温度缩放和保序回归）能改善概率估计，但无法改变由AUC衡量的区分能力。这种AUC不变性是单调分数变换的结构性结果；恢复被搁置的区分能力需要以项目身份为条件。我们提出SLC（状态空间logit校正），通过Laplace/IRLS将二值观测转换为高斯伪观测，通过卡尔曼平滑器应用经验贝叶斯收缩，并拟合偏移Platt链接。状态空间公式还产生了一个可检测性界限，表征了伯努利信息下限，解释了在当前数据密度下时间跟踪为何没有益处。在四个数据集、五个骨干网络和三个随机种子上，SLC在所有四个数据集上提升了AUC，在三个数据集上提升了NLL，优势集中在稀疏项目上。跨领域控制表明，当部署的骨干网络留下实体级偏差时，类似现象可能出现在教育领域之外。

英文摘要

Deployed knowledge-tracing models are typically frozen after training, yet systematic per-item logit bias arises, from limited per-item expressivity in backbone architectures and from post-deployment shifts in item properties, degrading prediction quality. Global post-hoc calibrators such as Platt scaling, temperature scaling, and isotonic regression improve probability estimates but leave discriminative ability, as measured by AUC, unchanged. This AUC invariance is a structural consequence of monotone score-only transforms; recovering the stranded discrimination requires conditioning on item identity. We propose SLC (State-space Logit Correction), which converts binary observations to Gaussian pseudo-observations via Laplace/IRLS, applies empirical-Bayes shrinkage through a Kalman smoother, and fits an offset-Platt link. The state-space formulation also yields a detectability bound that characterizes the Bernoulli information floor, explaining why temporal tracking provides no benefit at current data densities. Across four datasets, five backbones, and three seeds, SLC improves AUC on all four datasets and NLL on three, with the advantage concentrating on sparse items. Cross-domain controls suggest that the same phenomenon can arise beyond education when the deployed backbone leaves entity-level bias.

URL PDF HTML ☆

赞 0 踩 0

2606.14156 2026-06-15 cs.LG cs.AI 交叉投稿

挤压-释放：具有精确结构最小化的迭代剪枝

Roman Denkin, Ida Akerholm, Prashant Singh, Ida-Maria Sintorn

发表机构 * Uppsala University（乌普萨拉大学）

AI总结提出Squeeze-Release循环，通过精确结构重写将掩码网络转化为更小密集网络，并引入CompensatedLayerNorm扩展至残差流，实现高达39倍压缩。

详情

AI中文摘要

非结构化剪枝产生稀疏权重张量，但标准实现保持张量形状不变，因此部署模型并不比剪枝前更小。我们提出一种精确的结构重写，称为最小化，它将掩码网络转换为一个更小的密集网络，其前向函数在浮点舍入误差内相同。挤压-释放循环迭代剪枝和最小化，中间有一个释放步骤，将压缩张量内的精确零位置重新启用为小的校准噪声，将原本浪费的容量转化为可训练参数。连续的循环利用该容量找到单次剪枝无法达到的结构冗余。我们还引入了CompensatedLayerNorm，这是一种保持功能的LayerNorm替代方案，将最小化扩展到具有LayerNorm的残差流上的通道缩减。挤压-释放将可部署网络压缩到比未剪枝模型小39倍（全连接模型网络）和14.8倍（现代CNN，ConvNeXt-Tiny），且精度相当。此外，我们证明该重写可以扩展到Transformer架构。

英文摘要

Unstructured pruning produces sparse weight tensors, but the standard implementation keeps tensor shapes unchanged so the deployed model is no smaller than before pruning. We present an exact structural rewrite, which we call minimization, that converts a masked network into a smaller dense network with the same forward function up to floating-point rounding. The Squeeze-Release cycle iterates pruning and minimization with an intermediate release step that re-enables the exact-zero positions inside the compacted tensors as small calibrated noise, turning otherwise wasted capacity back into trainable parameters. Successive cycles use that capacity to find structural redundancy a single pass cannot reach. We additionally introduce CompensatedLayerNorm, a function-preserving replacement for LayerNorm that extends minimization to channel reduction across LayerNorm-equipped residual streams. Squeeze-Release compresses the deployable network to 39x smaller than the unpruned model on a fully-connected model network and 14.8x smaller on modern CNN (ConvNeXt-Tiny), at comparable accuracy. In addition we prove that the rewrite can be extended to transformer architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.14386 2026-06-15 cs.LG cs.AI q-fin.PM 交叉投稿

Discovery under Hypothesis Redundancy: A Geometric Theory of Discovery Bottlenecks

假设冗余下的发现：发现瓶颈的几何理论

Li Xia, Baoxun Wang

发表机构 * School of Economics and Management, Tsinghua University（清华大学经济管理学院）； Platform & Content Group, Tencent（腾讯平台与内容事业群）

AI总结提出搜索压缩假说，通过谱压缩、正交逃逸和残差信号对齐三个几何条件解释混合发现系统的优势，实验表明仅新颖性不足，需预测对齐。

Comments 23 pages, 1 figure, 27 tables

详情

AI中文摘要

当新假设不再提供独立信息时，科学发现会饱和，即使名义假设空间仍然很大。我们研究了结合结构化局部搜索与LLM生成的非局部提议的混合发现系统，并提出了搜索压缩假说：非局部探索仅在三个几何条件同时出现时才有帮助：谱压缩、从已探索张成的子空间正交逃逸、以及残差信号与目标对齐。我们形式化了这些条件，推导了混合优势的必要条件，并在受控合成环境、大规模A股因子发现和符号回归基准中测试了该机制；一个公开的表格操作合理性检查测试了相关的预算分配含义。信号植入和定向与随机实验表明，仅新颖性是不够的：随机正交跳跃扩大了覆盖范围，但如果没有预测对齐，则不会提高产出。在压缩扫描、真实因子档案和LLM-SRBench任务中，混合优势集中在弱表示但目标承载的方向上，并随着假设空间接近满秩而消失。该框架将LLM引导的发现从通用新颖性搜索转变为诊断程序，用于判断何时需要进行定向非局部探索。

英文摘要

Scientific discovery saturates when new hypotheses cease to provide independent information, even if the nominal hypothesis space remains large. We study hybrid discovery systems that combine structured local search with LLM-generated non-local proposals and pose the Search Compression Hypothesis: non-local exploration helps only when three geometric conditions co-occur: spectral compression, orthogonal escape from the explored span, and residual signal alignment with the target. We formalize these conditions, derive necessary conditions for hybrid advantage, and test the mechanism in controlled synthetic environments, large-scale A-share factor discovery, and symbolic-regression benchmarks; a public tabular operational sanity check tests the associated budget-allocation implication. Signal-planting and directed-versus-random experiments show that novelty alone is insufficient: random orthogonal jumps expand coverage but do not improve yield without predictive alignment. Across compression sweeps, real factor archives, and LLM-SRBench tasks, hybrid gains concentrate in weakly represented but target-bearing directions and vanish as the hypothesis space approaches full rank. The framework turns LLM-guided discovery from generic novelty search into a diagnostic procedure for deciding when directed non-local exploration is warranted.

URL PDF HTML ☆

赞 0 踩 0

2606.14555 2026-06-15 cs.CV cs.AI 交叉投稿

Rethinking Global Average Pooling: Your Classifier Is Secretly a Multi-Instance Learner

重新思考全局平均池化：你的分类器实际上是一个多实例学习器

Aray Karjauv

发表机构 * Aray Karjauv（阿瑞·卡贾乌）

AI总结本文揭示标准图像分类器中的全局平均池化结构天然具有多实例学习解释，使得单标签训练的分类器能学习多目标场景，并提出后验诊断方法提取空间类别证据。

详情

AI中文摘要

现代图像分类器广泛采用全局平均池化（GAP）后接线性分类头。这种线性结构确保图像级logits等于将分类头逐点应用于GAP之前的特征网格所获得的logits的平均值。因此，标准分类器可能固有地保留空间类别证据，即使在图像级预测错误时这些证据仍可恢复。这种结构自然暗示了多实例学习（MIL）解释，其中图像被视为空间实例的包。在此框架下，我们证明使用每张图像单个标签训练的标准分类器仍然可以在多目标场景中学习预期的分类任务。我们进一步利用这一特性将图像级logits分解为预测网格，提供一种事后诊断方法来提取GAP原本掩盖的空间类别证据。我们的系统评估表明，现成模型始终能在前景区域内恢复真实类别。MIL解释进一步表明，常见的分类器失败反映了均值聚合的已知局限性。

英文摘要

Modern image classifiers widely adopt global average pooling (GAP) followed by a linear classification head. This linearity ensures that the image-level logits equal the average of logits obtained by applying the classification head pointwise to the feature grid prior to GAP. Consequently, standard classifiers may inherently retain spatial class evidence that remains recoverable even when the image-level prediction is incorrect. This structure naturally suggests a multiple-instance learning (MIL) interpretation, where an image is viewed as a bag of spatial instances. Within this formulation, we demonstrate that standard classifiers trained with a single label per image can still learn the intended classification task in multi-object scenes. We further exploit this property to decompose image-level logits into a prediction grid, providing a post-hoc diagnostic to extract spatial class evidence that GAP otherwise obscures. Our systematic evaluation reveals that off-the-shelf models consistently recover the ground-truth class within foreground regions. The MIL interpretation further suggests that common classifier failures reflect known limitations of mean aggregation.

URL PDF HTML ☆

赞 0 踩 0

2606.14608 2026-06-15 cs.LG cs.AI 交叉投稿

Expert-Driven Survival Machines: Improving Stratification and Interpretability in Multiple Clinical Cohorts

专家驱动的生存机器：改善多个临床队列中的分层与可解释性

Farica Zhuang, Zixuan Wen, Christos Davatzikos, Li Shen

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结提出一种基于混合专家模型的自适应深度聚类生存框架（AdaCSM），通过路由专家机制实现条件专业化，动态分配患者到专门的风险预测器，提升生存预测性能和可解释性。

详情

DOI: 10.1145/3807503.3819574

AI中文摘要

生存预测在医疗提供者和临床研究中扮演核心角色。准确的风险分层能够实现早期干预并改善患者管理。大多数现有的深度生存模型为所有患者学习一个共同的特征表示，这可能掩盖患者亚组之间的重要差异。相比之下，混合专家（MoE）框架允许模型的不同部分关注不同的患者模式，从而产生更个性化的表示。因此，在这项工作中，我们提出了一种混合专家增强的自适应深度聚类生存框架（AdaCSM），用于建模这种异质性生存模式。我们引入了一种基于路由的专家机制，该机制在参数化生存建模框架内实现条件专业化。所提出的架构动态地将患者分配给专门的风险预测器，同时保留患者生存和亚型聚类目标。我们在跨越不同疾病领域的多个真实世界纵向临床队列上，将我们的方法与最先进的生存和深度聚类模型进行了比较。所提出的方法在生存分析中展示了改进的预测性能并产生了可解释的结果。

英文摘要

Survival prediction plays a central role for healthcare providers and clinical researchers. Accurate risk stratification enables early intervention and improved patient management. Most existing deep survival models learn one common feature representation for all patients, which may hide important differences between patient subgroups. In contrast, a Mixture-of-Experts (MoE) framework allows different parts of the model to focus on different patient patterns, leading to more individualized representations. Therefore, in this work, we propose a mixture-of-experts enhanced adaptive deep clustering survival framework (AdaCSM) for modeling such heterogeneous survival patterns. We introduce a routing-based expert mechanism that enables conditional specialization within a parametric survival modeling framework. The proposed architecture allocates patients to specialized risk predictors dynamically while preserving the patient survival and subtype clustering objectives. We compare our method with state-of-the-art survival and deep clustering models on multiple real-world longitudinal clinical cohorts spanning diverse disease domains. The proposed method demonstrates improved predictive performance and leads to interpretable results in survival analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.14639 2026-06-15 cs.SD cs.AI 交叉投稿

From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing

从自监督语音模型到混合专家系统以实现鲁棒的防欺骗

Hugo Daumain, Driss Matrouf, Khaled Khelif, Mickael Rouvier

发表机构 * Université d'Avignon（阿维尼翁大学）； Airbus Defence & Space（空中客车防务与航天公司）

AI总结将自监督语音模型转换为混合专家架构，通过层间门控机制增强泛化能力，在14个欺骗数据集上将宏EER从5.46%降至4.81%。

Comments 8 pages, 3 figures, accepted at Odyssey 2026 (The Speaker and Language Recognition Workshop)

2303.09209 2026-06-15 cs.AI 版本更新

Learning optimal policies from event logs through reinforcement learning: a comparison of deep and MDP-based approaches

从事件日志中通过强化学习学习最优策略：基于深度和MDP的方法比较

Stefano Branchi, Andrei Buliga, Chiara Di Francescomarino, Chiara Ghidini, Riccardo Graziosi, Francesca Meneghello, Massimiliano Ronzani

发表机构 * FBK - Fondazione Bruno Klopfer（FBK - 基础研究机构布鲁诺·克洛普弗）； Unitn（乌迪内大学）； Unibz（博尔扎诺大学）

AI总结提出两种强化学习方法（基于MDP和离线深度RL）从历史事件日志中学习最优行为策略以优化KPI，在数据驱动的BPS环境中评估，两种方法均有效提升KPI，但基于MDP的方法计算效率更高。

Comments 38 pages + appendix, 12 figures, new version published in IS journal

详情

DOI: 10.1016/j.is.2026.102763
Journal ref: Information Systems, Volume 141, 2026, 102763, ISSN 0306-4379

AI中文摘要

规范性流程监控是流程挖掘中的一个新兴领域，专注于推荐行动以优化业务成果。大多数现有工作规定预定义的干预措施，即应用于正在进行的流程执行以实现特定目标或关键绩效指标（KPI）的一组行动。相比之下，只有少数方法探索了学习和评估最优行为策略，即确定最佳行动序列以最大化期望KPI的通用策略。在本文中，我们通过提出一种基于AI的方法来解决学习最优行为策略的问题，该方法使用强化学习（RL）直接从历史流程执行中学习最优策略，以推荐优化KPI的最佳行动。为此，我们采用了两种RL技术。第一种是经典的基于模型的方法，通过构建捕获流程行为的马尔可夫决策过程（MDP）来扩展作者先前的工作。第二种是基于离线深度RL的无模型技术。与现有工作不同，我们旨在最小化领域知识的使用，并直接从历史事件数据中学习最优策略。这使我们能够学习何时应用干预措施，并直接从数据中发现有效的干预措施。此外，我们针对涉及外部参与者的复杂场景，其中流程所有者仅控制部分活动。我们采用数据驱动的业务流程模拟（BPS）环境来评估学习到的策略。结果表明，两种方法都以相似的有效性改进了目标KPI，而基于模型的方法在计算效率上优于离线深度RL。

英文摘要

Prescriptive Process Monitoring is an emerging area within Process Mining that focuses on recommending actions to optimize business outcomes. Most existing works prescribe pre-defined interventions, i.e., sets of actions applied to ongoing process executions to achieve a specific objective or Key Performance Indicator (KPI). In contrast, only a few approaches have explored learning and evaluating optimal behavioral policies, i.e., general strategies that determine the best sequence of actions to maximize a desired KPI. In this paper, we address the problem of learning optimal behavioral policies by proposing an AI-based approach that learns an optimal policy directly from historical process executions using Reinforcement Learning (RL) to recommend the best actions for optimizing a KPI. To this end, we employ two RL techniques. The first is a classical model-based approach that extends previous work by the authors through the construction of a Markov Decision Process (MDP) capturing process behavior. The second is a model-free technique based on offline Deep RL. Unlike state-of-the-art work, we aim to minimize the use of domain knowledge and learn optimal policies directly from historical event data. This allows us to learn when to apply interventions and discover effective ones directly from data. Moreover, we target complex scenarios involving external actors, where the process owner controls only part of the activities. We adopt a data-driven Business Process Simulation (BPS) environment to evaluate the learned policies. Results show that both methods improve the targeted KPI with similar effectiveness, while the model-based approach outperforms offline Deep RL in computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2601.05106 2026-06-15 cs.AI cs.CL cs.LG 版本更新

Token-Level LLM Collaboration via FusionRoute

通过融合路由实现令牌级LLM协作

Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao

发表机构 * [cs.AI]（计算机科学与人工智能）

AI总结本文提出FusionRoute框架，通过轻量级路由器在解码步骤中选择最合适的专家并补充对数几率以优化下一个令牌分布，解决了单个通用模型在多个领域表现不佳的问题，同时在多个基准测试中优于其他方法。

Comments 25 pages

详情

AI中文摘要

大型语言模型（LLMs）在多个领域表现出色。然而，使用单一通用模型在这些领域实现强大性能通常需要扩展到训练和部署成本极高的规模。另一方面，虽然较小的领域专用模型更高效，但它们在训练分布之外的泛化能力较差。为了解决这一矛盾，我们提出了FusionRoute，一种稳健且有效的令牌级多LLM协作框架，其中轻量级路由器同时（i）在每个解码步骤中选择最合适的专家，（ii）贡献一个互补的对数几率，通过对数几率添加来细化或校正所选专家的下一个令牌分布。与现有依赖固定专家输出的令牌级协作方法不同，我们提供了一个理论分析，表明纯专家路由本质上是有限的：除非持有强全局覆盖假设，否则无法一般实现最优解码策略。通过在专家选择中加入可训练的互补生成器，FusionRoute扩展了有效的策略类别，并在温和条件下实现了最优价值函数的恢复。经验上，FusionRoute在Llama-3和Gemma-2家族以及涵盖数学推理、代码生成和指令跟随在内的多种基准测试中，优于序列级和令牌级协作、模型融合和直接微调方法，同时在各自任务上与领域专家保持竞争力。

英文摘要

Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.14998 2026-06-15 cs.AI cs.SY eess.SY q-bio.QM 版本更新

Learning Developmental Scaffoldings to Guide Self-Organisation

学习发育支架以引导自组织

Milton L. Montero, Elias Najarro, Jakob Schauser, Sebastian Risi

发表机构 * IT University of Copenhagen（丹麦哥本哈根信息技术大学）； University of Copenhagen（丹麦哥本哈根大学）； Sakana AI

AI总结本文研究了通过学习自组织规则和预模式共同作用来提升发育过程的鲁棒性、编码能力和对称性打破。

Comments 8 pages + acknowledgements and references, 5 figures. Camera-ready version for ALife 2026

详情

AI中文摘要

从亚细胞结构到整个生物体，许多自然系统通过自组织生成复杂结构：局部相互作用共同产生全局结构，而无需任何结果的蓝图。然而，推动此类过程的大量信息并非由自组织本身产生，而是常常转移到系统的初始条件中。生物发育是一个典型例子，其中母体的预模式编码位置和对称性打破信息，从而引导自组织过程。从早期胚胎发育中的母体形态发生素梯度到组织水平的形态发生预模式指导器官形成，这种信息转移到初始条件的现象，类似于计算系统中的记忆-计算权衡，是发育过程的基本部分。在本文中，我们通过引入一个模型来研究这种信息转移现象，该模型同时学习自组织规则和预模式，允许其相互作用在受控条件下进行变化和测量：一个神经细胞自动机（NCA）配对一个学习基于坐标的模式生成器（SIREN），两者同时训练以生成一组模式。我们提供了信息论分析，探讨信息如何在预模式和自组织过程之间分布，并展示联合学习两者可提高鲁棒性、编码能力和对称性打破，相较于纯自组织替代方案。进一步分析表明，有效的预模式不简单地近似其目标；而是通过偏转发育动力学的方式促进收敛，指出了初始条件结构与自组织动力学之间非平凡的关系。

英文摘要

From subcellular structures to entire organisms, many natural systems generate complex organisation through self-organisation: local interactions that collectively give rise to global structure without any blueprint of the outcome. Yet a significant portion of the information driving such processes is not produced by self-organisation itself, instead, it is often offloaded to initial conditions of the system. Biological development is a prime example, where maternal pre-patterns encode positional and symmetry-breaking information that scaffolds the self-organising process. From maternal morphogen gradients in early embryogenesis to tissue-level morphogenetic pre-patterns guiding organ formation, this transfer of information to initial conditions, analogous to a memory-compute trade-off in computational systems, is a fundamental part of developmental processes. In this work, we study this offloading phenomenon by introducing a model that jointly learns both the self-organisation rules and the pre-patterns, allowing their interplay to be varied and measured under controlled conditions: a Neural Cellular Automaton (NCA) paired with a learned coordinate-based pattern generator (SIREN), both trained simultaneously to generate a set of patterns. We provide information-theoretic analyses of how information is distributed between pre-patterns and the self-organising process, and show that jointly learning both components yields improvements in robustness, encoding capacity, and symmetry breaking over purely self-organising alternatives. Our analysis further suggests that effective pre-patterns do not simply approximate their targets; rather, they bias the developmental dynamics in ways that facilitate convergence, pointing to a non-trivial relationship between the structure of initial conditions and the dynamics of self-organisation.

URL PDF HTML ☆

赞 0 踩 0

2606.13392 2026-06-15 cs.AI 版本更新

MiniMax Sparse Attention

MiniMax 稀疏注意力

Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen, Yang Xu, Lunbin Zeng, Xiaolong Li, Haohai Sun, Haichao Zhu, Vito Zhang, Jinkai Hu, Jiayao Li, Rui Gao, Zekun Li, Songquan Zhu, Jingkai Zhou, Pengyu Zhao

发表机构 * MiniMax ； Peking University（北京大学）； NVIDIA（英伟达）； Zhejiang University（浙江大学）； Huazhong University of Science and Technology（华中科技大学）

AI总结提出 MiniMax 稀疏注意力（MSA），一种基于分组查询注意力的块级稀疏注意力机制，通过轻量索引分支选择 Top-k 键值块，实现高效长上下文处理，在 109B 模型上以 1M 上下文减少 28.4 倍注意力计算，并带来 14.2 倍预填充和 7.6 倍解码加速。

Comments 30 pages, 14 figures

详情

AI中文摘要

超长上下文能力对于前沿大语言模型变得不可或缺：智能体工作流、仓库级代码推理和持久记忆都要求模型共同关注数十万到数百万个 token，然而 softmax 注意力的二次成本使得这在部署规模上难以实现。我们引入了 MiniMax 稀疏注意力（MSA），一种基于分组查询注意力（GQA）构建的块级稀疏注意力。一个轻量级索引分支对键值块进行评分，并为每个 GQA 组独立选择 Top-k 子集，从而实现组特定的稀疏检索，同时保持高效的块级执行；主分支则仅对选中的块执行精确的块稀疏注意力。MSA 的设计遵循简单和可扩展的原则，经过精心简化，使其能够在一系列 GPU 上高效部署。为了将稀疏性转化为实际加速，我们与 MSA 协同设计了 GPU 执行路径，该路径使用无指数 Top-k 选择和 KV 外部稀疏注意力，以在块粒度访问下提高张量核心利用率。在一个具有原生多模态训练的 109B 参数模型上，MSA 的性能与 GQA 相当，同时在 1M 上下文下将每个 token 的注意力计算减少了 28.4 倍。结合我们协同设计的内核，MSA 在 H800 上实现了 14.2 倍的预填充和 7.6 倍的解码端到端加速。我们的推理内核可在以下网址获取：this https URL。一个由 MSA 驱动的生产级原生多模态模型已在以下网址公开发布：this https URL。

英文摘要

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.

URL PDF HTML ☆

赞 0 踩 0

2505.12992 2026-06-15 cs.LG cs.AI cs.CL stat.ML 版本更新

Fractured Chain-of-Thought Reasoning

断裂链式思维推理

Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, Caiming Xiong

发表机构 * University of Amsterdam（阿姆斯特丹大学）； eBay ； Microsoft（微软）； Google Research（谷歌研究）； Salesforce

AI总结提出断裂采样策略，通过截断推理链、调整轨迹数和解数，在推理时实现精度与成本的帕累托最优。

详情

AI中文摘要

推理时扩展技术通过在不重新训练的情况下利用额外的推理计算，显著增强了大型语言模型（LLMs）的推理能力。类似地，链式思维（CoT）提示及其扩展Long CoT通过生成丰富的中间推理轨迹来提高准确性，但这些方法会带来大量的token成本，阻碍了它们在延迟敏感场景中的部署。在这项工作中，我们首先证明截断CoT（即在完成推理前停止并直接生成最终答案）通常在使用显著更少token的情况下与完整CoT采样相匹配。基于这一见解，我们引入了断裂采样，这是一种统一的推理时策略，沿着三个正交轴在完整CoT和仅解决方案采样之间进行插值：（1）推理轨迹的数量，（2）每条轨迹的最终解数量，以及（3）推理轨迹被截断的深度。通过在五个不同的推理基准和多个模型规模上进行大量实验，我们证明断裂采样始终实现优越的精度-成本权衡，在Pass@k与token预算之间产生陡峭的对数线性缩放增益。我们的分析揭示了如何在这些维度上分配计算以最大化性能，为更高效和可扩展的LLM推理铺平了道路。代码可在该https URL获取。

英文摘要

Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches the full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning. Code is available at https://github.com/BaohaoLiao/frac-cot.

URL PDF HTML ☆

赞 0 踩 0

2506.14202 2026-06-15 cs.LG cs.AI stat.ML 版本更新

DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

DiffusionBlocks: 通过扩散解释进行分块神经网络训练

Makoto Shing, Masanori Koyama, Takuya Akiba

发表机构 * Sakana AI ； The University of Tokyo（东京大学）

AI总结提出DiffusionBlocks框架，利用残差连接与动力系统的对应关系，将网络转换为去噪过程，通过分数匹配目标实现独立分块训练，在多种Transformer架构上达到与端到端训练相当的性能，同时降低内存需求。

Comments To appear at the 14th International Conference on Learning Representations (ICLR 2026). v4: Fixed typos in experimental details (Appendix E.4)

详情

AI中文摘要

端到端反向传播需要存储所有层的激活值，造成内存瓶颈，限制了模型的可扩展性。现有的分块训练方法提供了缓解该问题的途径，但它们依赖于特设的局部目标，并且在分类任务之外尚未得到充分探索。我们提出$\textit{DiffusionBlocks}$，一个将基于Transformer的网络转化为真正独立可训练块的原则性框架，这些块能保持与端到端训练相竞争的性能。我们的关键洞察在于利用残差连接自然对应于动力系统中的更新这一事实。通过对该系统进行最小修改，我们可以将这些更新转换为去噪过程的更新，其中每个块可以通过利用分数匹配目标独立学习。这种独立性使得每次只训练一个块的梯度成为可能，从而将内存需求按块数量成比例降低。我们在多种Transformer架构（视觉、扩散、自回归、递归深度和掩码扩散）上的实验表明，DiffusionBlocks训练与端到端训练性能匹配，同时能够在实际任务（超越小规模分类）上实现可扩展的分块训练。DiffusionBlocks提供了一种理论上有依据的方法，成功地将现代生成任务扩展到多种架构。代码可在该https URL获取。

英文摘要

End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose $\textit{DiffusionBlocks}$, a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one block at a time, thereby reducing memory requirements in proportion to the number of blocks. Our experiments on a range of transformer architectures (vision, diffusion, autoregressive, recurrent-depth, and masked diffusion) demonstrate that DiffusionBlocks training matches the performance of end-to-end training while enabling scalable block-wise training on practical tasks beyond small-scale classification. DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures. Code is available at https://github.com/SakanaAI/DiffusionBlocks .

URL PDF HTML ☆

赞 0 踩 0

2506.17255 2026-06-15 cs.LG cs.AI 版本更新

UltraSketchLLM: Sub-1-Bit LLM Compression via Sketch and Hardware-Friendly Operators

UltraSketchLLM：基于草图与硬件友好算子的低于1比特LLM压缩

Sunan Zou, Xueting Sun, Ziyun Zhang, Guojie Luo

发表机构 * National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University（国家多媒体信息处理重点实验室，计算机科学学院，北京大学）； School of Electronic Engineering and Computer Science, Peking University（电子工程与计算机科学学院，北京大学）； Center for Energy-efficient Computing and Applications, Peking University（能效计算与应用中心，北京大学）

AI总结提出UltraSketchLLM，利用数据草图将LLM权重压缩至0.5比特，结合硬件友好实现，在保持可接受性能下降的同时实现14.9倍加速。

Comments Accepted by the 63rd ACM/IEEE The Chips to Systems Conference (DAC 2026)

2510.01663 2026-06-15 cs.LG cs.AI 版本更新

Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value

基于Shapley值的Kolmogorov-Arnold网络平移不变属性评分

Wangxuan Fan, Ching Wang, Siqi Li, Nan Liu

发表机构 * GitHub

AI总结提出ShapKAN框架，利用Shapley值归因实现平移不变的节点重要性评估，有效压缩KAN网络并保持其可解释性优势。

Comments 14 pages, 6 figures, 9 tables

详情

AI中文摘要

对于许多实际应用，理解特征与结果之间的关系与实现高预测准确性同样重要。虽然传统神经网络在预测方面表现出色，但其黑箱性质掩盖了潜在的功能关系。Kolmogorov-Arnold网络（KAN）通过在边上采用可学习的基于样条的激活函数来解决这一问题，能够在保持竞争性能的同时恢复符号表示。然而，KAN的架构对网络剪枝提出了独特的挑战。由于对输入坐标平移的敏感性，传统的基于幅度的方法变得不可靠。我们提出了\textbf{ShapKAN}，一种使用Shapley值归因以平移不变方式评估节点重要性的剪枝框架。与基于幅度的方法不同，ShapKAN量化每个节点的实际贡献，确保无论输入参数化如何，重要性排名保持一致。在合成和真实世界数据集上的大量实验表明，ShapKAN在实现有效网络压缩的同时保留了真实的节点重要性。我们的方法提升了KAN的可解释性优势，便于在资源受限环境中部署。

英文摘要

For many real-world applications, understanding feature-outcome relationships is as crucial as achieving high predictive accuracy. While traditional neural networks excel at prediction, their black-box nature obscures underlying functional relationships. Kolmogorov--Arnold Networks (KANs) address this by employing learnable spline-based activation functions on edges, enabling recovery of symbolic representations while maintaining competitive performance. However, KAN's architecture presents unique challenges for network pruning. Conventional magnitude-based methods become unreliable due to sensitivity to input coordinate shifts. We propose \textbf{ShapKAN}, a pruning framework using Shapley value attribution to assess node importance in a shift-invariant manner. Unlike magnitude-based approaches, ShapKAN quantifies each node's actual contribution, ensuring consistent importance rankings regardless of input parameterization. Extensive experiments on synthetic and real-world datasets demonstrate that ShapKAN preserves true node importance while enabling effective network compression. Our approach improves KAN's interpretability advantages, facilitating deployment in resource-constrained environments.

URL PDF HTML ☆

赞 0 踩 0

2511.07368 2026-06-15 cs.LG cs.AI 版本更新

Distributional Biases in Post-Training: A Markovian Analysis of Reasoning Trajectories

后训练中的分布偏差：推理轨迹的马尔可夫分析

Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Bo Xue, Qingfu Zhang, Hau-San Wong, Taiji Suzuki

发表机构 * City University of Hong Kong（香港城市大学）； Center for Advanced Intelligence Project, RIKEN（RIKEN高级智能研究中心）； The Institute of Statistical Mathematics（统计数学研究所）； University of Sydney（悉尼大学）； CFAR and IHPC, Agency for Science, Technology and Research (A*STAR)（A*STAR的CFAR和IHPC）； Nanyang Technological University（南洋理工大学）； The University of Tokyo（东京大学）

AI总结通过马尔可夫链模型分析后训练策略（如RLVR和ORM/PRM）如何强化高概率路径而遗忘稀有但关键的推理步骤，并证明探索策略（如拒绝简单实例和KL正则化）有助于保留稀有CoT。

详情

AI中文摘要

基础模型展现出广泛的知识但有限的特定任务推理能力，这促使了后训练策略的发展，例如基于可验证奖励的强化学习（RLVR）和测试时扩展（TTS）。尽管近期工作强调了探索在提升pass@K中的作用，但经验证据指向一个悖论：RLVR和ORM/PRM通常强化现有路径而非扩展推理范围，这引发了一个问题：如果没有新模式出现，探索为何有帮助？为调和这一悖论，我们采用Kim等人（2025）的视角，将简单（例如，简化分数）与困难（例如，发现某种对称性）推理步骤分别视为低概率和高概率的马尔可夫转移。在这个易处理的模型中，预训练对应于树图发现，而后训练对应于思维链（CoT）重新加权。我们可证明地表明，RLVR和ORM/PRM都会严重偏向若干高概率路径，从而遗忘稀有但关键的CoT。在此基础上，我们进一步证明，诸如拒绝简单实例和KL正则化等探索策略有助于保留稀有CoT。实证模拟证实了我们的理论结果。

英文摘要

Foundation models exhibit broad knowledge but limited task-specific reasoning, motivating post-training strategies such as RL with verifiable rewards (RLVR) and test-time scaling (TTS). While recent work highlights the role of exploration in improving pass@K, empirical evidence points to a paradox: RLVR and ORM/PRM typically reinforce existing paths rather than expanding the reasoning scope, raising the question of why exploration helps if no new patterns emerge. To reconcile this paradox, we adopt the perspective of Kim et al. (2025), viewing easy (e.g., simplifying a fraction) versus hard (e.g., discovering the some symmetry) reasoning steps as low versus high probability Markov transitions. In this tractable model, pretraining corresponds to tree-graph discovering, while post-training corresponds to CoT reweighting. We provably show that, both RLVR and ORM/PRM would favor heavily to several high-probability paths, and thereby forget rare-but-crucial CoTs. Building on this, we further prove that exploration strategies such as rejecting easy instances and KL regularization help preserve rare CoTs. Empirical simulations corroborate our theoretical results.

URL PDF HTML ☆

赞 0 踩 0

2512.22671 2026-06-15 cs.CL cs.AI cs.LG 版本更新

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

脆弱的知识，稳健的指令遵循：Llama-3.2中的宽度剪枝二分法

Pere Martra

发表机构 * Independent Researcher（独立研究员）

AI总结通过峰值幅度准则对GLU-MLP层进行结构化宽度剪枝，发现降低扩展比会损害参数化知识任务，但能提升指令遵循能力，挑战了剪枝导致均匀退化的假设。

Comments 22 pages, 5 figures, 9 tables. Code available at https://github.com/peremartra/llama-glu-expansion-pruning

详情

AI中文摘要

对Llama-3.2模型中GLU-MLP层的结构化宽度剪枝，以峰值幅度（PPM）准则为指导，揭示了降低扩展比如何系统性地影响不同模型能力的二分法。虽然依赖参数化知识的任务（如MMLU、GSM8K）和困惑度指标的性能随扩展比降低而可预测地下降，但指令遵循能力在2.4倍平衡比下得到提升（IFEval：Llama-3.2-1B中+4.8分/+46%，Llama-3.2-3B中+3.7分/+39%），且多步推理保持稳健（MUSR）。这种模式在两个评估模型大小上一致观察到，挑战了压缩研究中剪枝导致均匀退化的主流假设。为探究这一点，我们使用评估事实知识、数学推理、语言理解、指令遵循和真实性的综合基准套件，评估了七种扩展比配置。我们的分析将扩展比识别为一个关键架构参数，它选择性地重塑模型的任务性能轮廓，而不仅仅是作为压缩指标。

英文摘要

Structured width pruning of GLU-MLP layers in Llama-3.2 models, guided by the Peak-to-Peak Magnitude (PPM) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably with decreasing expansion ratios, instruction-following capabilities improve at the 2.4x equilibrium ratio (IFEval: +4.8 points / +46% in Llama-3.2-1B and +3.7 points / +39% in Llama-3.2-3B), and multi-step reasoning remains robust (MUSR). This pattern, observed consistently across both evaluated model sizes, challenges the prevailing assumption in compression research that pruning induces uniform degradation. To investigate this, we evaluated seven expansion ratio configurations using comprehensive benchmark suites that assess factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively reshapes the model's task performance profile, rather than merely serving as a compression metric.

URL PDF HTML ☆

赞 0 踩 0

2601.22108 2026-06-15 cs.LG cs.AI 版本更新

Learning What to Predict: Downstream-Guided Task Design for Continued Pretraining

学习预测什么：下游引导的持续预训练任务设计

Shuqi Ke, Giulia Fanti

发表机构 * Department of ECE（电子工程系）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出V-pretraining方法，通过轻量级任务设计器为无标签批次构建目标或视图，利用下游损失的一阶减少作为反馈，指导自监督更新，提升目标能力而不损害泛化。

详情

AI中文摘要

持续预训练通过固定的自监督任务进行优化，但根据下游性能选择检查点，形成了一个粗粒度的反馈循环：实践者评估检查点、改变数据混合或目标、重新开始运行，而单个更新仍然对目标能力视而不见。我们询问是否一小部分可验证的下游示例可以在不直接监督学习器的情况下提供步骤级反馈。我们引入了V-pretraining，它将仅使用自监督损失训练的学习器与一个轻量级任务设计器解耦，该设计器为无标签批次构建目标或视图。给定当前学习器和批次，V-pretraining通过预测诱导的自监督更新后下游损失的一阶减少来评分候选构建。设计器最大化该值；然后学习器应用带有分离目标或视图的更新，因此下游标签永远不会更新学习器参数。我们将V-pretraining实例化为用于语言建模的自适应top-K软目标和用于自监督视觉的学习视图或掩码。在两种模态中，V-pretraining在不降低泛化的情况下提高了目标能力。在挂钟时间匹配的持续预训练下，它仅使用1,024个GSM8K示例作为反馈，提高了Qwen模型的GSM8K Pass@1，包括Qwen2.5-0.5B的单次运行+7.4点增益。在视觉方面，它改善了DINOv3向ADE20K语义分割和NYUv2深度估计的迁移，同时保持了ImageNet线性准确率，表明反馈引导的任务构建可以在不破坏通用表示的情况下提高目标能力。

英文摘要

Continued pretraining is optimized with fixed self-supervised tasks but selected by downstream performance, creating a coarse feedback loop in which practitioners evaluate checkpoints, change data mixtures or objectives, and restart runs, while individual updates remain blind to target capabilities. We ask whether a small set of verifiable downstream examples can provide step-level feedback without directly supervising the learner. We introduce V-pretraining, which decouples a learner trained only with a self-supervised loss from a lightweight task designer that constructs targets or views for unlabeled batches. Given the current learner and batch, V-pretraining scores a candidate construction by predicting the first-order reduction in downstream loss after the induced self-supervised update. The designer maximizes this value; the learner then applies the update with targets or views detached, so downstream labels never update learner parameters. We instantiate V-pretraining as adaptive top-K soft targets for language modeling and learned views or masks for self-supervised vision. Across both modalities, V-pretraining improves target capabilities without degrading generalization. Under wall-clock-matched continued pretraining, it improves GSM8K Pass@1 for Qwen models using 1,024 GSM8K examples only as feedback, including a +7.4 point single-run gain for Qwen2.5-0.5B. In vision, it improves DINOv3 transfer to ADE20K semantic segmentation and NYUv2 depth estimation while preserving ImageNet linear accuracy, suggesting that feedback-guided task construction can improve target capabilities without collapsing general-purpose representations.

URL PDF HTML ☆

赞 0 踩 0

2602.03120 2026-06-15 cs.LG cs.AI 版本更新

Quantized Evolution Strategies: High-precision Fine-tuning of Quantized LLMs at Low-precision Cost

量化进化策略：以低精度代价实现量化大语言模型的高精度微调

Yinggan Xu, Kajetan Schweighofer, Risto Miikkulainen, Xin Qiu

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Cognizant AI Lab（Cognizant AI实验室）； UT Austin（得克萨斯大学奥斯汀分校）

AI总结提出量化进化策略（QES），通过集成累积误差反馈和无状态种子重放，直接在量化空间进行全参数微调，无需反向传播，显著优于现有零阶微调方法。

Comments Added more tasks and baselines

详情

AI中文摘要

后训练量化（PTQ）对于在内存受限设备上部署大语言模型（LLM）至关重要，但它使模型变得静态且难以微调。标准的微调范式，包括强化学习（RL），从根本上依赖于反向传播和连续权重来计算梯度。因此，它们无法用于参数空间离散且不可微的量化模型。虽然进化策略（ES）提供了一种无需反向传播的替代方案，但由于梯度估计消失或不准确，量化参数的优化仍可能失败。本文介绍了量化进化策略（QES），一种直接在量化空间执行全参数微调的优化范式。QES基于两项创新：（1）它集成了累积误差反馈以保留高精度权重更新信号，（2）它利用无状态种子重放将内存使用降低到低精度推理水平。QES在各种任务上显著优于最先进的零阶微调方法，使得量化模型的直接微调成为可能。因此，它开辟了完全在量化空间中扩展LLM的可能性。源代码可在此https URL获取。

英文摘要

Post-Training Quantization (PTQ) is essential for deploying Large Language Models (LLMs) on memory-constrained devices, yet it renders models static and difficult to fine-tune. Standard fine-tuning paradigms, including Reinforcement Learning (RL), fundamentally rely on backpropagation and continuous weights to compute gradients. Thus they cannot be used on quantized models, where the parameter space is discrete and non-differentiable. While Evolution Strategies (ES) offer a backpropagation-free alternative, optimization of the quantized parameters can still fail due to vanishing or inaccurate gradient estimation. This paper introduces Quantized Evolution Strategies (QES), an optimization paradigm that performs full-parameter fine-tuning directly in the quantized space. QES is based on two innovations: (1) it integrates accumulated error feedback to preserve high-precision weight updating signals, and (2) it utilizes a stateless seed replay to reduce memory usage to low-precision inference levels. QES significantly outperforms the state-of-the-art zeroth-order fine-tuning methods on a variety of tasks, making direct fine-tuning for quantized models possible. It therefore opens up the possibility for scaling up LLMs entirely in the quantized space. The source code is available at https://github.com/dibbla/Quantized-Evolution-Strategies .

URL PDF HTML ☆

赞 0 踩 0

2602.04879 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Rethinking the Trust Region in LLM Reinforcement Learning

重新思考LLM强化学习中的信任区域

Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, Wee Sun Lee

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Toronto（多伦多大学）

AI总结针对PPO在LLM微调中因词表大导致的训练不稳定问题，提出基于策略散度直接约束的DPPO算法，并引入高效近似方法。

详情

AI中文摘要

强化学习已成为微调大型语言模型（LLM）的基石，其中近端策略优化（PPO）是事实上的标准算法。尽管其普遍存在，我们认为PPO中的核心比率裁剪机制在结构上不适合LLM固有的大词表。PPO基于采样令牌的概率比率约束策略更新，该比率是对真实策略散度的有噪单样本蒙特卡洛估计。这导致次优的学习动态：低概率令牌的更新被过度惩罚，而高概率令牌中潜在的灾难性变化却约束不足，导致训练效率低下和不稳定。为解决此问题，我们提出散度近端策略优化（DPPO），用基于策略散度（如总变差或KL）直接估计的更原则性约束替代启发式裁剪。为避免巨大内存占用，我们引入了高效的二元和Top-K近似，以可忽略的开销捕获本质散度。大量实证评估表明，DPPO相比现有方法实现了更优的训练稳定性和效率，为基于RL的LLM微调提供了更稳健的基础。我们的代码可在https://github.com/sail-sg/Stable-RL获取。

英文摘要

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning. Our code is available at https://github.com/sail-sg/Stable-RL.

URL PDF HTML ☆

赞 0 踩 0

2602.14169 2026-06-15 cs.LG cs.AI cs.CL 版本更新

FP4量化LLM训练中均值偏差的诅咒与祝福

Hengjie Cao, Zhendong Huang, Mengyi Chen, Yifeng Yang, Fang Dong, Anrui Chen, Ruijun Huang, Xin Zhang, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Tun Lu, Fan Yang, Yixuan Chen, Li Shang

发表机构 * Fudan University（复旦大学）； University of Bath（巴斯大学）； Shanghai Innovation Institute（上海创新研究院）； University of Oxford（牛津大学）； Oxford Suzhou Centre for Advanced Research（牛津苏浙研究中心）； University of Colorado Boulder（科罗拉多大学波德格分校）； University of Michigan（密歇根大学）； Shenzhen Loop Area Institute（深圳环宇研究院）

AI总结发现FP4训练失败源于激活异常值由秩一均值偏差主导，提出Averis均值残差分离量化法，在Qwen3模型上实现鲁棒W4A4G4训练，损失差距低于NVIDIA的Hadamard方法。

详情

AI中文摘要

FP4训练有望为大型语言模型节省大量内存和计算，但由于分块量化受极端激活幅度支配，导致动态范围膨胀并压缩长尾信号，因此仍然脆弱。我们发现了这一失败的一个反直觉来源：主导激活异常值不仅仅是任意的稀疏事件，而主要是由一致的秩一均值偏差引起的，其方向与主导各向异性谱分量对齐。该均值分量在训练过程中增强，被注意力和FFN算子放大和重塑，并日益主导顶部激活幅度。至关重要的是，这一发现揭示了一个看似复杂的异常值抑制问题实际上有一个非常简单的解决方案：在量化之前隔离一致的均值。因此，我们提出了Averis，一种均值残差分割量化方法，该方法在FP4量化之前仅使用归约和逐元素减法来分离均值分量。在100B token上训练的Qwen3 0.6B密集模型和50B token上训练的Qwen3 7B A1.5B MoE模型上，Averis实现了鲁棒的W4A4G4 FP4训练，将BF16损失差距降低至1.19%/0.81%，而NVIDIA最近发布的基于Hadamard的异常值平滑方法为2.05%/1.10%，同时将下游差距限制在0.89/0.71点。Averis在vanilla NVFP4上的端到端开销仅为2.20%，约为NVIDIA基于Hadamard设计的30%，为稳定的低位LLM训练提供了一条硬件高效的路径。与Hadamard互补，Averis在结合使用时进一步将Qwen3-0.6B的损失和下游差距降低至0.94%和0.73点。代码可在以下网址获取：this https URL。

英文摘要

FP4 training promises substantial memory and compute savings for large language models, but remains fragile because blockwise quantization is dictated by extreme activation magnitudes, which inflate dynamic range and compress long-tail signals. We identify a counterintuitive source of this failure: dominant activation outliers are not merely arbitrary sparse events, but are largely induced by a coherent rank-one mean bias, whose direction aligns with the leading anisotropic spectral component. This mean component strengthens during training, is amplified and reshaped by attention and FFN operators, and increasingly dominates top activation magnitudes. Crucially, this discovery reveals that a seemingly complex outlier-suppression problem admits a truly simple solution: isolate the coherent mean before quantization. We therefore propose Averis, a mean-residual splitting quantization method that separates the mean component using only reductions and elementwise subtractions before FP4 quantization. Across Qwen3 0.6B Dense trained on 100B tokens and Qwen3 7B A1.5B MoE trained on 50B tokens, Averis enables robust W4A4G4 FP4 training, reducing BF16 loss gaps to 1.19%/0.81% versus 2.05%/1.10% for NVIDIA's recently released Hadamard-based outlier-smoothing method, while limiting downstream gaps to 0.89/0.71 points. With only 2.20% end-to-end overhead over vanilla NVFP4, about 30% of NVIDIA's Hadamard-based design, Averis provides a hardware-efficient path to stable low-bit LLM training. Complementary to Hadamard, Averis further reduces the Qwen3-0.6B loss and downstream gaps to 0.94% and 0.73 points when combined. Code is available at: https://anonymous.4open.science/r/averis-504D.

URL PDF HTML ☆

赞 0 踩 0

2603.15481 2026-06-15 cs.LG cs.AI 版本更新

TabKD: Tabular Knowledge Distillation through Interaction Diversity of Learned Feature Bins

TabKD: 通过学习特征箱的交互多样性实现表格知识蒸馏

Shovon Niverd Pereira, Krishna Khadka, Yu Lei

发表机构 * Department of Computer Science and Engineering, The University of Texas at Arlington（计算机科学与工程系，德克萨斯理工大学阿灵顿分校）

AI总结提出TabKD方法，通过学习与教师决策边界对齐的自适应特征箱，生成最大化成对交互覆盖的合成查询，在表格数据知识蒸馏中显著提升学生-教师一致性。

Comments Accepted in 35th International Joint Conference on Artificial Intelligence IJCAI 2026

详情

AI中文摘要

无数据知识蒸馏可以在没有原始训练数据的情况下实现模型压缩，这对于隐私敏感的表格领域至关重要。然而，现有方法在表格数据上表现不佳，因为它们没有明确处理特征交互，而特征交互是表格模型编码预测知识的基本方式。我们识别出交互多样性，即特征组合的系统覆盖，是有效表格蒸馏的基本要求。为了实施这一见解，我们提出了TabKD，它学习与教师决策边界对齐的自适应特征箱，然后生成最大化成对交互覆盖的合成查询。在4个基准数据集和4种教师架构上，TabKD在16个配置中的14个中实现了最高的学生-教师一致性，优于5个最先进的基线。我们进一步表明，交互覆盖与蒸馏质量强相关，验证了我们的核心假设。我们的工作建立了以交互为中心的探索作为表格模型提取的原则性框架。

英文摘要

Data-free knowledge distillation enables model compression without original training data, critical for privacy-sensitive tabular domains. However, existing methods does not perform well on tabular data because they do not explicitly address feature interactions, the fundamental way tabular models encode predictive knowledge. We identify interaction diversity, systematic coverage of feature combinations, as an essential requirement for effective tabular distillation. To operationalize this insight, we propose TabKD, which learns adaptive feature bins aligned with teacher decision boundaries, then generates synthetic queries that maximize pairwise interaction coverage. Across 4 benchmark datasets and 4 teacher architectures, TabKD achieves highest student-teacher agreement in 14 out of 16 configurations, outperforming 5 state-of-the-art baselines. We further show that interaction coverage strongly correlates with distillation quality, validating our core hypothesis. Our work establishes interaction-focused exploration as a principled framework for tabular model extraction.

URL PDF HTML ☆

赞 0 踩 0

2604.09737 2026-06-15 cs.LG cs.AI 版本更新

STaR-DRO: Stateful Tsallis Reweighting for Group-Robust Structured Prediction

STaR-DRO: 面向群体鲁棒结构化预测的状态化Tsallis重加权

Samah Fodeh, Ganesh Puthiaraju, Elyas Irankhah, Afshan Khan, Sreeraj Ramachandran, Linhai Ma, Srivani Talakokkul, Sarah Schellhorn

发表机构 * Yale University（耶鲁大学）； Yale School of Medicine（耶鲁医学院）

AI总结提出STaR-DRO框架，结合Tsallis镜像上升和稀疏entmax映射，仅对持续困难群体上权重，在结构化预测中提升标签准确性和鲁棒性，在EPPC Miner任务上相比SFT和标准DRO分别提升F1分数1.08和2.20。

详情

AI中文摘要

使用大型语言模型进行结构化预测需要输出在标签不平衡和异质群体难度下具有标签准确性、本体约束、结构有效性和证据基础。我们提出了一个统一框架用于本体约束生成。首先，我们引入了一个模块化的提示工程架构，结合了XML风格结构、专家消歧规则、思维链推理、元数据感知决策逻辑、模式契约和自我验证门。它针对反复出现的上下文失败，包括格式漂移、标签歧义、证据幻觉和元数据条件混淆。其次，我们提出了STaR-DRO，结合了Tsallis镜像上升、稀疏entmax风格原始映射、EMA平滑群体损失跟踪、重新缩放上升信号和有界超额乘数。与依赖密集香农熵指数梯度更新、可能引入高方差随机重加权、将正对抗质量分配给非持续困难群体、并通过单纯形竞争产生成本的常规DRO不同，STaR-DRO仅对持续困难群体上权重，而不抑制较容易的群体。我们在EPPC Miner上评估该框架，这是一个临床基础的高风险结构化预测任务，需要从患者-提供者安全消息中进行层次标签预测和证据跨度提取。在1B-70B Llama模型上，提示工程改进了零样本提取，平均标签F1增益为+14.46，跨度F1增益为+17.40。在监督微调的基础上，STaR-DRO进一步提高了准确性和鲁棒性，平均标签F1分别提高了+1.08和+2.20，同时相对于SFT和标准DRO，平均群体验证交叉熵分别降低了21.3%和14.8%。这些结果推进了以患者为中心的临床护理分析的可靠自动化通信挖掘。

英文摘要

Structured prediction with large language models requires outputs that are label-accurate, ontology-constrained, structurally valid, and evidence-grounded under label imbalance and heterogeneous group difficulty. We present a unified framework for ontology-constrained generation. First, we introduce a modular prompt-engineering architecture combining XML-style structure, expert disambiguation rules, chain-of-thought reasoning, metadata-aware decision logic, schema contracts, and a self-validation gate. It targets recurrent in-context failures, including format drift, label ambiguity, evidence hallucination, and metadata-conditioned confusion. Second, we propose STaR-DRO, combining Tsallis mirror ascent, sparse entmax-style primal mapback, EMA-smoothed group-loss tracking, rescaled ascent signals, and bounded excess-only multipliers. Unlike conventional DRO, which relies on dense Shannon-entropy exponentiated-gradient updates, can introduce high-variance stochastic reweighting, assigns positive adversarial mass to groups that are not persistently hard, and incurs costs through simplex competition, STaR-DRO upweights only persistently hard groups without suppressing easier ones. We evaluate the framework on EPPC Miner, a clinically grounded high-stakes structured-prediction task requiring hierarchical label prediction and evidence-span extraction from patient-provider secure messages. Across 1B-70B Llama models, prompt engineering improves zero-shot extraction, yielding an average label F1 gain of +14.46 and a Span F1 gain of +17.40. Building on supervised fine-tuning, STaR-DRO further improves accuracy and robustness, increasing average label F1 by +1.08 and +2.20 while reducing mean groupwise validation cross-entropy by 21.3% and 14.8% relative to SFT and standard DRO, respectively. These results advance reliable automated communication mining for patient-centered clinical care analysis.

URL PDF HTML ☆

赞 0 踩 0

2604.17892 2026-06-15 cs.LG cs.AI 版本更新

关系检索：利用已知-新颖相互作用进行通用类别发现

Yulin Xu, Chunqi Guo, Yuanzhen Shuai, Jianyuan Ni

发表机构 * University of California, Irvine（加州大学尔湾分校）； Sichuan Agricultural University（四川农业大学）； University College London（伦敦大学学院）； Juniata College（朱尼ata学院）

AI总结本文通过关系检索视角解决通用类别发现问题，提出关系模式一致性方法，通过双向知识转移增强已知类别和新类别发现，实验表明在通用和细粒度基准上均取得最佳性能。

Comments Accepted by ICMR 2026 (Oral)

详情

DOI: 10.1145/3805622.3810732

AI中文摘要

在本研究中，我们通过关系检索视角解决通用类别发现（GCD）问题，通过双向知识转移显式连接标记和未标记数据。尽管现有方法将这些来源分开处理，错过了有价值的作用机会，我们提出关系模式一致性（RPC），使两者相互增强。RPC使用一对一分类器进行软ID/OOD分解，然后引入两种机制：（i）为已知类别保留，我们转移语义行为对齐；（ii）为类别发现，我们利用样本来自同一类别与已知类别原型保持不变的关系的洞察，将不可靠的伪标签转化为明确的关系模式匹配。这种双向设计使标记数据指导未标记学习，同时通过它们的集体关系签名发现新类别。广泛的实验表明，RPC在通用和细粒度基准上均取得最佳性能。

英文摘要

In this study, we tackle Generalized Category Discovery (GCD) via a Relational Retrieval perspective, explicitly coupling labeled and unlabeled data through bidirectional knowledge transfer. While existing methods treat these sources separately, missing valuable interaction opportunities, we propose Relational Pattern Consistency (RPC) that enables mutual enhancement. RPC employs One-vs-All classifiers for soft ID/OOD decomposition, then introduces two mechanisms: (i) for known-class preservation, we transfer semantic behavioral alignment; (ii) for category discovery, we leverage the insight that samples from the same category maintain invariant relationships with known-class prototypes, transforming unreliable pseudo-labeling into well-defined relational pattern matching. This bidirectional design allows labeled data to guide unlabeled learning while discovering novel categories through their collective relational signatures. Extensive experiments demonstrate RPC achieves state-of-the-art performance on both generic and fine-grained benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.18848 2026-06-15 cs.LG cs.AI 版本更新

Exact Linear Attention

精确线性注意力

Weinuo Ou

发表机构 * GitHub

AI总结本文提出精确线性注意力（ELA），通过利用核函数的精确分解性质，实现Transformer注意力的线性计算复杂度，消除近似误差。针对先前线性注意力的两个关键限制——梯度爆炸和token注意力稀释，提出核约束以确保非负性、判别性和几何可解释性。此外，本文还提出了三种工程创新，包括Hyper-Link结构、Memory Lobe模块和基于路由分数的MoE偏置机制，实验结果表明ELA在解码速度和KV缓存内存使用上分别达到全注意力的6倍和75%的减少，同时保持或优于训练性能。

Comments 9 pages, 19 figures, journal

详情

AI中文摘要

本文介绍精确线性注意力（ELA），一种通过利用核函数的精确分解性质，实现Transformer注意力线性计算复杂度的机制，从而消除近似误差。我们识别并解决了先前线性注意力的两个关键限制——梯度爆炸和token注意力稀释——通过施加核约束，确保非负性、判别性和几何可解释性。提出了几种核函数，包括Hadamard Exp核、求和平方欧几里得距离核和减法平方欧几里得距离核，每种都针对特定的注意力行为进行了优化。除了核心注意力公式之外，本文还提出了三种工程创新：（1）Hyper-Link结构，用以替代传统残差连接以缓解梯度退化；（2）基于双向线性注意力的Memory Lobe模块，捕捉跨层的“转换流”以实现定性记忆和隐式强化学习范式；（3）基于路由分数的MoE偏置机制，以提高可解释性和语义对齐。实验结果表明，ELA在解码速度和KV缓存内存使用上分别达到全注意力的6倍和75%的减少，同时保持或优于训练性能。所提出的记忆模块加速了收敛并增强了泛化能力。此外，我们还将线性注意力原理扩展到视觉模型，得到YOLO-LAT，其在GPU推理速度和参数减少方面分别达到4.3倍和7.9倍，同时保持竞争性的检测精度。这些结果表明，精确线性注意力在扩展Transformer模型以处理超长序列和高效视觉任务方面具有广泛的应用前景。

英文摘要

This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention by exploiting the exact decomposition property of kernel functions, thereby eliminating approximation error. We identify and address two key limitations of prior linear attention -- gradient explosion and token attention dilution -- by imposing kernel constraints that ensure non-negativity, discriminability, and geometric interpretability. Several kernel functions are proposed, including the Hadamard Exp Kernel, Summation Squared Euclidean Distance Kernel, and Subtraction Squared Euclidean Distance Kernel, each tailored for specific attention behaviors. Beyond the core attention formulation, the paper presents three engineering innovations: (1) a Hyper-Link structure that replaces traditional residual connections to mitigate gradient degradation; (2) a Memory Lobe module based on bidirectional linear attention, which captures "transformation flow" across layers to implement qualitative memory and an implicit reinforcement learning paradigm; and (3) a routing-score-based bias mechanism for Mixture-of-Experts (MoE) to improve interpretability and semantic alignment. Experimental results demonstrate that ELA achieves up to 6x faster decoding speed and 75% reduction in KV cache memory usage compared to full attention, while maintaining comparable or superior training performance. The proposed memory module accelerates convergence and enhances generalization. Furthermore, we extend the linear attention principle to vision models, yielding YOLO-LAT, which attains up to 4.3x GPU inference speedup and 7.9x parameter reduction with competitive detection accuracy. These results underline the broad applicability of exact linear attention for scaling Transformer models to ultra-long sequences and efficient visual tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.13054 2026-06-15 cs.LG cs.AI 版本更新

TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

TWLA：通过训练后量化实现大语言模型的三值权重和低位激活

Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Xing Hu, Zhe Jiang, Dawei Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出TWLA框架，通过后训练量化实现1.58位权重和4位激活，解决激活分布长尾问题，加速推理。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型语言模型（LLMs）展现出卓越的通用语言处理能力，但其内存和计算成本阻碍了部署。三值化已成为一种有前景的压缩技术，可显著降低模型大小和推理复杂度。然而，现有方法难以处理重尾激活分布，因此将激活保持在高精度，从根本上限制了端到端推理加速。为克服这一限制，我们提出TWLA，一种后训练量化（PTQ）框架，在保持高精度的同时实现1.58位权重压缩和4位激活量化。TWLA包含三个组件：（1）欧几里得到流形非对称三值量化器（E2M-ATQ），通过从欧几里得初始化到流形重定位的两阶段优化，最小化权重三值化下的层输出误差；（2）Kronecker正交三模态整形（KOTMS），应用Kronecker结构正交旋转将权重重塑为三值友好的三模态分布，同时共享旋转统计上抑制激活异常值；（3）层间感知激活混合精度（ILA-AMP），在位分配中显式引入相邻层二阶交互成本，并联合优化由共享正交变换引起的激活量化增益的层间差异，防止少数弱层触发级联效应。大量实验表明，TWLA在W1.58A4下保持高精度，同时实现显著的推理加速。代码见<此https URL>。

英文摘要

Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at https://github.com/Kishon-zzx/TWLA.

URL PDF HTML ☆

赞 0 踩 0

2606.13119 2026-06-15 cs.LG cs.AI cs.NE 版本更新

MP3: Multi-Period Pattern Pre-training for Spatio-Temporal Forecasting

MP3：面向时空预测的多周期模式预训练

Lilan Peng, Yandi Liu, Qingren Yao, Chongshou Li, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University（西南交通大学计算机与人工智能学院）； Eindhoven University of Technology（埃因霍温理工大学）

AI总结针对时空数据中短窗口输入导致的时间幻象问题，提出多周期模式预训练插件MP3，通过多周期时间建模、空间建模和跨周期因果交互，提升现有STGNN的预测性能。

详情

AI中文摘要

时空预测在交通、气候和能源等多个领域至关重要。城市时空数据表现出时间幻象：相似的短窗口输入具有不同的未来趋势，反之亦然。现有的时空图神经网络（STGNN）无法有效识别此类幻象。我们认为核心原因在于短窗口输入具有不完整的周期观测、异质的全局空间相关性和跨周期叠加因果性。为弥补这一差距，我们开发了一种新颖的多周期模式预训练（MP3），这是一种用于区分时间幻象的即插即用预训练插件。MP3提出了两项核心创新：（1）多周期模式学习旨在从长时间序列中学习多周期模式。具体地，多周期时间建模利用边卷积来识别不同的多周期模式。多周期空间建模使用瓶颈投影和全局记忆库来高效捕获异质的全局空间关系。跨周期模式交互采用因果增强的Transformer来捕获不同周期模式之间的依赖关系。（2）该插件可以无缝集成到现有的STGNN骨干中，以增强其预测性能。在五个真实世界数据集（包括大规模数据集CA）上的五个STGNN基线实验验证了MP3的有效性、优越的可扩展性和强适应性，其在所有评估基线上带来了一致且稳健的性能提升。平均而言，MP3将MAE降低了4.7%，RMSE降低了5.0%。代码可在此https URL获取。

英文摘要

Spatio-Temporal forecasting is crucial in diverse fields, such as transportation, climate, and energy. Urban spatio-temporal data exhibits temporal mirage: similar short-window inputs have divergent future trends, and vice versa. Existing spatio-temporal graph neural networks (STGNNs) cannot effectively identify such mirages. We argue that the core reason lies in the short-window inputs that have incomplete period observation, heterogeneous global spatial correlation, and cross-period superposition causality. To bridge this gap, we develop a novel Multi- Period Pattern Pre-training (MP3), a plug-and-play pre-training plugin for distinguishing temporal mirages. MP3 presents two core innovations: (1) The multi-period pattern learning is designed to learn multi-period patterns from long time series. Specifically, multi-period temporal modeling leverages edge convolution to identify different multi-period patterns. Multi-period spatial modeling uses a bottleneck project and a global memory bank to capture heterogeneous global spatial relations efficiently. Cross-period pattern interaction employs a causality-enhanced Transformer to capture dependencies across different period patterns. (2) This plugin can seamlessly integrate into existing STGNN backbones to strengthen their forecasting performance. The experiment on five STGNN baselines across five real-world datasets (including a large-scale dataset CA) verify the effectiveness, superior scalability and strong adaptability of MP3, which brings consistent and robust performance improvements across all evaluated baselines. On average, MP3 reduces the MAE 4.7% and the RMSE 5.0%. The code can be available at https://github.com/YAN-outlook/MP3.

URL PDF HTML ☆

赞 0 踩 0

2606.14176 2026-06-15 cs.AI 新提交

VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification

VeriGeo: 可控几何问题生成与数值和分析验证

Xiaoxian Duan, Zequn Liu, Yingce Xia

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Zhongguancun Academy（中关村学院）

AI总结提出VeriGeo框架，通过可执行推理轨迹和三级验证流水线，实现用户约束下的可控几何问题生成，并利用验证引导的反思修复无效生成，提升数据可靠性。

Comments 32 pages, 4 figures, 9 tables

详情

AI中文摘要

几何问题生成对AI辅助教育和多模态数学推理有用，但可靠合成仍然困难，因为问题陈述、图表、约束和解决方案应相互一致。现有方法常在可控性和可靠性之间权衡：基于种子的改写灵活但可验证性弱，而图表优先的构建提高了有效性但不太适合任意用户指定的约束。我们引入VeriGeo，一个基于可执行推理轨迹的可控几何生成框架。给定用户约束（如目标概念和难度），Author代理生成问题和图表，Solver代理产生与证明对齐的解决方案。两个代理使用共享的动作序列，将自然语言、图表、几何约束和证明步骤连接成可验证的表示。三级流水线检查数值一致性、分析可实现性和全局一致性，使用验证引导的反射来修复可恢复的失败并拒绝不可恢复的失败。在五个LLM骨干上，原始生成经常无法通过这些检查，而VeriGeo修复了大部分无效尝试。在VeriGeo生成的8.7k示例上进行监督微调，在端到端多模态LLM求解器中实现了GeoQA最佳报告性能，并在PGPS9K和MathVista-GPS上取得了强劲结果，证明了验证合成数据对改进多模态几何推理的有效性。

英文摘要

Geometry problem generation is useful for AI-assisted education and multimodal mathematical reasoning, but reliable synthesis remains difficult because the problem statement, diagram, constraints, and solution should be mutually consistent. Existing methods often trade off controllability and reliability: seed-based rewriting is flexible but weakly verifiable, whereas diagram-first construction improves validity but is less suited to arbitrary user-specified constraints. We introduce VeriGeo, a controllable geometry generation framework grounded in executable reasoning traces. Given user constraints such as target concepts and difficulty, an Author agent generates a problem and diagram, and a Solver agent produces a proof-aligned solution. Both agents use a shared action sequence that connects natural language, diagrams, geometric constraints, and proof steps into a verifiable representation. A three-stage pipeline checks numerical consistency, analytical realizability, and global consistency, using verification-guided reflection to repair recoverable failures and reject unrecoverable ones. Across five LLM backbones, raw generations frequently fail these checks, while VeriGeo repairs a substantial fraction of the invalid attempts. Supervised fine-tuning on 8.7k examples generated by VeriGeo achieves the best reported GeoQA performance among end-to-end multimodal LLM-based solvers, and obtains strong results on PGPS9K and MathVista-GPS, demonstrating the effectiveness of verified synthetic data for improving multimodal geometry reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.14507 2026-06-15 cs.AI 新提交

Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models

密集坐标列表微调在视觉语言模型中诱导可控干扰面

Chenyu Zhou, Qiliang Jiang, Boguang Pan

发表机构 * School of Engineering, Institute of Science Tokyo（东京科学大学工学院）； College of Control Science and Engineering, Zhejiang University（浙江大学控制科学与工程学院）； Graduate School of Information, Production and Systems, Waseda University（早稻田大学信息生产系统研究生院）

AI总结研究密集坐标列表微调对视觉语言模型结构化输出（如重复、终止）的影响，发现其产生结构绑定且跨家族的干扰面，可通过目标信号分离和结构轴探针进行测量与控制。

详情

AI中文摘要

微调视觉语言模型以输出密集坐标列表可改善视觉定位，但也会改变模型序列化、重复和终止结构化输出的方式。我们将此行为视为一个生成与控制面进行研究。在Gemma 4 12B中，高容量q/k/v/o LoRA将类别感知F1@0.3从0.007提升至0.448，同时诱导重复尾部压力（重复率0.080，最大重复23）。q/v秩扫描在秩4-64范围内保持最大重复为21-22，显示出容量持久性。目标信号是可分离的：对象级重复停止移除了精确重复记录（重复率0.000，最大重复1），同时保持F1（0.494至0.490）和更严格的F1@0.5（0.381至0.385）。结构轴探针将效应定位到边界框坐标对象列表；密集非边界框和空间/计数JSON保持无重复，包括在高容量适配器下。Qwen3-VL-8B复现了干净的控制端点（F1@0.3 0.318，重复率0.000），COCO 2017复现了获取和重复压力。因此，密集坐标列表适应创建了一个结构绑定、跨家族的干扰面，该干扰面可被测量和控制。

英文摘要

Fine-tuning vision-language models to emit dense coordinate lists improves visual grounding but also changes how models serialize, repeat, and terminate structured outputs. We study this behavior as a generation and control surface. In Gemma 4 12B, high-capacity q/k/v/o LoRA raises class-aware F1@0.3 from 0.007 to 0.448 while inducing repeated-tail pressure (duplicate rate 0.080, max repeat 23). A q/v rank sweep keeps max repeat at 21-22 across ranks 4-64, showing capacity persistence. The target signal is separable: object-level repeat-stop removes exact repeated records (duplicate rate 0.000, max repeat 1) while preserving F1 (0.494 to 0.490) and stricter F1@0.5 (0.381 to 0.385). Structure-axis probes localize the effect to bbox-coordinate object lists; dense non-bbox and spatial/count JSON remain repeat-clean, including under high-capacity adapters. Qwen3-VL-8B reproduces a clean controlled endpoint (F1@0.3 0.318, duplicate rate 0.000), and COCO 2017 reproduces acquisition plus duplicate pressure. Dense coordinate-list adaptation therefore creates a structure-bound, cross-family interference surface that can be measured and controlled.

URL PDF HTML ☆

赞 0 踩 0

2606.14579 2026-06-15 cs.AI 新提交

VISTA: View-Consistent Self-Verified Training for GUI Grounding

VISTA: 视图一致的自验证训练用于GUI定位

Xinyu Qiu, Yunzhu Zhang, Heng Jia, Shuheng Shen, Changhua Meng, Linchao Zhu

发表机构 * Zhejiang University（浙江大学）； Venus Team, Ant Group（蚂蚁集团金星团队）

AI总结提出VISTA框架，通过多视图分组和自验证锚点改进GRPO训练，在GUI定位任务中显著提升准确率。

详情

AI中文摘要

当将组相对策略优化（GRPO）应用于GUI定位时，rollout从单个截图视图中采样；组在困难实例上往往全部失败，在简单实例上全部成功，无法产生有用的相对优势。我们提出VISTA（视图一致的自验证训练），一种基于GRPO的训练框架，通过从同一GUI页面的多个目标保持视图中构建每个比较组。每个视图通过裁剪生成，保持目标元素可见并精确重新映射其边界框，因此模型rollout在语义等价但几何不同的输入之间进行比较。为了稳定短坐标生成而不将强化学习转变为无条件模仿，VISTA进一步添加了一个自验证的跨视图锚点：一个使用优势加权损失优化的oracle答案，从组基线中排除，仅在模型产生最大奖励rollout时激活。在五个GUI定位基准和多个Qwen骨干网络上，VISTA一致提高了定位准确率。在ScreenSpot-Pro上，它将Qwen3-VL 4B/8B/30B-A3B从55.5/52.7/53.7提升到63.4/65.8/67.0。鲁棒性分析进一步显示了更高的最差视图准确率和更低的预测翻转率。

英文摘要

When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no useful relative advantage. We propose VISTA (View-Consistent Self-Verified Training), a GRPO-based training framework that constructs each comparison group from multiple target-preserving views of the same GUI instance.Each view is generated by a crop that keeps the target element visible and remaps its box exactly, so model rollouts are compared across semantically equivalent but geometrically different inputs. To stabilize short coordinate generation without turning reinforcement learning into unconditional imitation, VISTA further adds a self-verified cross-view anchor: an oracle answer optimized with an advantage-weighted loss, excluded from the group baseline and activated only when the model has produced a maximum-reward rollout. Across five GUI-grounding benchmarks and multiple Qwen backbones, VISTA consistently improves grounding accuracy.On ScreenSpot-Pro, it raises Qwen3-VL 4B/8B/30B-A3B from 55.5/52.7/53.7 to 63.4/65.8/67.0. Robustness analyses further show higher worst-view accuracy and lower prediction flip rates.

URL PDF HTML ☆

赞 0 踩 0

2606.14654 2026-06-15 cs.AI cs.CL cs.LG 新提交

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

将跨领域动作序列抽象为可解释的工作流

Gaurav Verma, Scott Counts

发表机构 * Microsoft Corporation（微软公司）

AI总结提出WorkflowView框架，利用大语言模型将低层动作序列抽象为高层活动，在三个不同任务中验证了有效性和泛化能力，实现高语义相似度和预测性能。

Comments preprint; 9 pages, 5 figures

详情

AI中文摘要

序列或时间戳交互日志提供了数字应用使用的客观记录，但其粒度和噪声常常掩盖了关于人们工作的有意义见解。这些见解对于以真实用户交互为基础改进数字产品至关重要。先前的研究应用深度学习模型将用户动作聚类为高层活动，但这些方法对噪声高度敏感且难以跨应用泛化。为解决这一局限，我们引入了WorkflowView，一个使用大语言模型（LLMs）将低层动作序列抽象为高层活动的框架。我们在三个不同且具有挑战性的序列任务和多样化领域中建立了该方法的有效性和泛化性：（a）从浏览器日志中进行零样本任务描述重构（实现高语义相似度，$\mu_{sim} = 0.91$），（b）使用MOOC交互日志进行少样本学生退学预测（仅用五个少样本示例达到加权$F_1 = 0.90$），以及（c）对Microsoft Word中文档工作流中AI工具集成进行匿名化、隐私保护分析。我们的工作表明，基于LLM的抽象是将低层行为数据转化为高层、可解释且可操作见解的稳健高效途径。我们还讨论了在日志基础设施中部署基于LLM的推理时的实际考虑，包括计算效率和用户隐私。

英文摘要

Sequential or time-stamped interaction logs provide objective records of digital application usage, yet their granularity and noise often obscure meaningful insights into people's work. Such insights are essential for improving digital products in ways grounded in real-world user interactions. Prior research has applied deep learning models to cluster user actions into high-level activities, but these approaches are highly sensitive to noise and struggle to generalize across applications. To address this limitation, we introduce WorkflowView, a framework that uses large language models (LLMs) to abstract low-level action sequences into high-level activities. We establish the effectiveness and generality of our approach across three distinct, challenging sequential tasks and diverse domains: (a) zero-shot task description reconstruction from browser logs (achieving high semantic similarity, $μ_{sim} = 0.91$), (b) few-shot student dropout prediction using MOOC interaction logs (reaching weighted $F_1 = 0.90$ with only five few-shot examples), and (c) anonymized, privacy-preserving analysis of AI tool integration within document workflows in Microsoft Word. Our work demonstrates that LLM-based abstraction is a robust and efficient path forward for transforming low-level behavioral data into high-level, interpretable, and actionable insights. We also discuss practical considerations for deploying LLM-based inferences within logging infrastructures, including computational efficiency and user privacy.

URL PDF HTML ☆

赞 0 踩 0

2606.13811 2026-06-15 quant-ph cs.AI 交叉投稿

Aligning Quantum Operators with Large Language Models

对齐量子算子与大型语言模型

Rogerio Feris, Yunchao Liu, Pengyuan Li, Hang Hua, David Kremer

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出将酉算子映射到LLM潜空间的方法，实现量子与语言输入的联合建模，在Clifford+T电路合成任务上取得与最先进方法竞争的结果，并支持语言条件合成。

详情

AI中文摘要

大型语言模型（LLM）能否理解和推理量子算子？尽管LLM在数学和符号推理方面表现出色，但它们本质上对诸如酉矩阵等量子表示视而不见。在这项工作中，我们通过引入一种将酉算子映射到LLM潜空间的方法，向弥合这一差距迈出了一步，从而实现了对量子输入和语言输入的联合建模。我们在Pauli旋转门集上的Clifford+T电路合成中实例化了这一想法，其中我们的模型取得了与最先进方法竞争的结果，并且随着训练数据的增加而一致地扩展，没有出现饱和迹象。我们的方法进一步支持语言条件合成，允许在训练期间未见过的门约束直接用自然语言指定。这项工作表明了一条通往量子感知基础模型的道路，该模型能够原生地解释和推理量子操作，这可能对量子编译和算法发现产生更广泛的影响。

英文摘要

Can Large Language Models (LLMs) understand and reason about quantum operators? Despite their remarkable capabilities in mathematics and symbolic reasoning, LLMs remain inherently blind to quantum representations such as unitary matrices. In this work, we take a step toward bridging this gap by introducing an approach that maps unitary operators into the latent space of an LLM, enabling unified modeling over quantum and linguistic inputs. We instantiate this idea on Clifford+T circuit synthesis over a Pauli rotation gate set, where our model achieves results competitive with state-of-the-art methods and scales consistently with training data, with no signs of saturation. Our approach further enables language-conditioned synthesis, allowing gate constraints unseen during training to be specified directly in natural language. This work suggests a path toward quantum--aware foundation models that can natively interpret and reason about quantum operations, which could have broader implications reaching across quantum compilation and algorithm discovery.

URL PDF HTML ☆

赞 0 踩 0

2606.13898 2026-06-15 cs.CV cs.AI 交叉投稿

HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

HiLo-Token: 输入自适应的高低频令牌压缩用于高效图像编辑

Haoran You, Yotam Nitzan, Lingzhi Zhang, Yifan Gong, Mang-Tik Chiu, Connelly Barnes, Yan Kang, Yuqian Zhou, Eli Shechtman, Sohrab Amirghodsi

发表机构 * Adobe ART AI Lab（Adobe ART AI实验室）； Adobe Research（Adobe研究院）

AI总结针对扩散变换器（DiT）在图像编辑中延迟高的问题，提出输入自适应的令牌压缩框架HiLo-Token，根据空间频率分配令牌预算，在保持生成质量的同时实现高达3.13倍加速。

Comments 14 pages, 10 figures, Patent filled

详情

AI中文摘要

创意图像编辑工具，如Photoshop的移除或生成填充按钮，是日常客户使用的核心，并占Photoshop和Lightroom流量的主要部分。然而，当前的生成式AI模型面临显著的延迟挑战，当从基于卷积的U-Net过渡到扩散变换器（DiT）时，这一问题变得更加突出。在我们对数百个代表性图像编辑样本（涵盖广泛的掩码比例）的评估中，即使将DiT模块从50个时间步蒸馏到8个时间步，它单独就占总模型延迟的平均73%。为了应对这一挑战，我们提出了$\textbf{HiLo-Token}$，一个输入自适应的令牌压缩框架，该框架将更多令牌预算分配给高频、丰富上下文的区域，同时将更少令牌分配给低频区域。具体来说，对于用户掩码指定的编辑区域，我们保留膨胀掩码内的所有令牌，以保持强局部性和上下文相关性。在编辑区域之外，我们引入了一种简单而有效的基于空间频率的高频令牌选择策略，以捕获重要的局部细节，同时使用来自16倍下采样图像的令牌来表示低频分量，并保留模糊但全局的结构。在生产级评估数据上的大量实验验证了所提方法的有效性，在A100-80GB上，对于小、中、大掩码比例类别（平均比例分别为6.38%、15.92%和35.36%），图像编辑任务分别实现了3.13倍、2.59倍和1.67倍的DiT加速，且生成质量无任何退化。

英文摘要

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose $\textbf{HiLo-Token}$, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.

URL PDF HTML ☆

赞 0 踩 0

2606.13989 2026-06-15 cs.SD cs.AI 交叉投稿

动态声源的时空音频语言建模

Oh Hyun-Bin, Kazuki Shimada, Yuhta Takida, Kim Sung-Bin, Toshimitsu Uesaka, Takashi Shibuya, Kyeongyoon Lee, Tae-Hyun Oh, Yuki Mitsufuji

发表机构 * POSTECH（浦项科技大学）； Sony AI（索尼AI）； Sony Group Corporation（索尼集团）； Sungkyunkwan University（成均馆大学）； KAIST（韩国科学技术院）

AI总结提出ST-AudioLM模型，通过时空音频编码器联合学习事件语义与源轨迹，在ST-AudioQA基准上提升动态声源问答的语义-定位权衡。

详情

AI中文摘要

声音事件是具有语义身份、位置和轨迹的实体，但当前的音频-语言模型通常将片段推理为全局事件内容。相反，声音事件定位模型随时间跟踪声源方向，但对语言推理的语义覆盖有限。为解决这一差距，我们引入了ST-AudioQA，一个基于一阶环绕声（FOA）渲染的静态和移动声源的时空音频问答数据集和基准。每个场景提供源身份、活动、方向、距离和运动元数据，实现密集轨迹监督以及关于什么在发声、在哪里、如何移动以及源之间关系的问题。我们进一步提出了ST-Audio Encoder，一种时间分辨的FOA音频编码器，联合学习事件语义和源轨迹，以及ST-AudioLM，它将编码器的音频令牌连接到LLM进行时空音频问答。实验表明，这种表示改善了语义-定位权衡，并比静态空间和面向定位的基线产生更强的推理性能。

英文摘要

Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage for language reasoning. To address this gap, we introduce ST-AudioQA, a spatio-temporal audio QA dataset and benchmark built from first-order ambisonic (FOA) renderings of static and moving sound sources. Each scene provides source identity, activity, direction, distance, and motion metadata, enabling dense trajectory supervision and questions about what is sounding, where it is, how it moves, and how sources relate. We further propose ST-Audio Encoder, a time-resolved FOA audio encoder that learns event semantics together with source trajectories, and ST-AudioLM, which connects the audio tokens from the encoder to an LLM for spatio-temporal audio QA. Experiments show that this representation improves the semantic-localization tradeoff and yields stronger reasoning performance than static spatial and localization-oriented baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.14260 2026-06-15 cs.IR cs.AI 交叉投稿

ChronoID: Infusing Explicit Temporal Signals into Semantic IDs for Generative Recommendation

ChronoID: 将显式时间信号注入语义ID用于生成式推荐

Dongdong Nian, Dongqi Fu, Chenliang Xu, Yinglong Xia, Hong Li, Hong Yan, Jian Kang

发表机构 * University of Rochester（罗切斯特大学）； Meta MRS ； MBZUAI

AI总结提出ChronoID框架，通过沿三个正交维度注入显式时间信号到语义ID中，解决生成式推荐中时间信息缺失问题，并构建新基准验证其有效性。

详情

AI中文摘要

语义ID在生成式推荐中至关重要，但存在一个根本性限制：时间信息未能很好地融入语义ID。相反，时间仅隐式影响推荐（例如，通过会话构建启发式、偏好对齐或序列顺序），而现有的语义ID学习完全与时间无关。这种设计将不同时间上下文下的交互混为一谈，隐含地假设物品语义和用户意图在时间上是平稳的。这种假设与真实推荐场景不符，其中演变的交互节奏起着核心作用。在这项工作中，我们研究了显式时间应如何以及在哪里被纳入生成式推荐的语义ID中。首先，我们沿时间信号的三个正交维度系统地表征了设计空间，并提出了一个统一框架ChronoID，用于时间感知的语义ID学习。然后，通过贡献一个新的时间显式生成推荐基准，ChronoID回答了以下问题：注入时间的有效方式是什么，如何设计架构，以及增益来自何处。

英文摘要

Semantic IDs are crucial in generative recommendation, but with a fundamental limitation: temporal information is not well incorporated into semantic IDs. Instead, time influences recommendation only implicitly (e.g., through session construction heuristics, preference alignment, or sequence order), while existing semantic ID learning remains entirely time-agnostic. This design conflates interactions occurring under distinct temporal contexts into identical semantic representations, implicitly assuming that item semantics and user intent are temporally stationary. Such an assumption is misaligned with real-world recommendation scenarios, where evolving interaction rhythms play a central role. In this work, we investigate where and how the explicit time should be incorporated into semantic ID for generative recommendation. First, we systematically characterize the design space along three orthogonal dimensions of temporal signals and present a unified framework, ChronoID, for time-aware semantic ID learning. Then, by contributing a new time-explicit generation recommendation benchmark, ChronoID answers the questions: what is the effective way of infusing time, how to design the architecture, and where does the gain come from.

URL PDF HTML ☆

赞 0 踩 0

2606.14325 2026-06-15 cs.CL cs.AI 交叉投稿

Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

Yuezhou Hu, Harman Singh, Monishwaran Maheswaran, Haocheng Xi, Coleman Hooper, Jintao Zhang, Aditya Tomar, Michael W. Mahoney, Sewon Min, Mehrdad Farajtabar, Kurt Keutzer, Amir Gholami, Chenfeng Xu

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出残差上下文扩散（RCD）模块，通过回收丢弃令牌的上下文残差提高扩散语言模型的解码效率，在长/短CoT任务上以极少额外计算提升准确率4-11个百分点。

详情

AI中文摘要

扩散大语言模型（dLLM）已成为纯自回归语言模型的有前途的替代方案，因为它们可以并行解码多个令牌。然而，最先进的逐块dLLM依赖于一种“重掩码”机制，该机制仅解码最自信的令牌并丢弃其余令牌，从而浪费计算。我们证明，回收来自被丢弃令牌的计算是有益的，因为这些令牌保留了对于后续解码迭代有用的上下文信息。鉴于此，我们提出了残差上下文扩散（RCD），一个将这些被丢弃的令牌表示转换为上下文残差并将其注入回下一个去噪步骤的模块。RCD使用解耦的两阶段训练流程来绕过与反向传播相关的内存瓶颈。我们在长链推理（SDAR）和短链指令跟随（LLaDA）模型上验证了我们的方法。我们证明，一个标准的dLLM可以仅用约3亿个令牌高效地转换为RCD范式。在广泛基准测试中，RCD以极小的额外计算开销一致地将前沿dLLM的准确率提升4-11个百分点。值得注意的是，在最具挑战性的AIME任务上，RCD几乎使基线准确率翻倍，并在基线峰值准确率下实现高达4-5倍更少的去噪步骤。

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~300 million tokens. RCD consistently improves frontier dLLMs by 4-11 percentage points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at baseline's peak accuracy.

URL PDF HTML ☆

赞 0 踩 0

2602.01801 2026-06-15 cs.CV cs.AI 版本更新

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

快速自回归视频扩散与世界模型：基于时间缓存压缩与稀疏注意力

Dvir Samuel, Issar Tzachor, Matan Levy, Michael Green, Gal Chechik, Rami Ben-Ari

发表机构 * Hebrew University of Jerusalem（特拉维夫大学）； Google Research（谷歌研究）

AI总结提出FAST-AR框架，通过TempCache压缩KV缓存、AnnCA加速交叉注意力、AnnSA稀疏化自注意力，实现自回归视频扩散模型5-10倍加速，同时保持视觉质量并稳定GPU内存使用。

Comments Accepted to ICML 2026. Project Page: https://dvirsamuel.github.io/fast-auto-regressive-video/

详情

AI中文摘要

自回归视频扩散模型支持流式生成，为长序列合成、视频世界模型和交互式神经游戏引擎打开了大门。然而，其核心注意力层在推理时成为主要瓶颈：随着生成过程推进，KV缓存增长，导致延迟增加和GPU内存飙升，进而限制可用的时间上下文并损害长程一致性。在本工作中，我们研究了自回归视频扩散中的冗余性，并识别出三个持续存在的来源：跨帧的近似重复缓存键、缓慢演化的（主要是语义的）查询/键使得许多注意力计算冗余，以及长提示上的交叉注意力中每帧只有少量标记相关。基于这些观察，我们提出了一个统一的、无需训练的注意力框架（FAST-AR），用于快速自回归扩散，包含三个组件：TempCache通过时间对应压缩KV缓存以限制缓存增长；AnnCA通过使用快速近似最近邻（ANN）匹配选择帧相关的提示标记来加速交叉注意力；AnnSA通过将每个查询限制为语义匹配的键（也使用轻量级ANN）来稀疏化自注意力。这些模块共同减少了注意力、计算和内存，并且与现有的自回归扩散骨干网络和世界模型兼容。实验表明，在保持几乎相同的视觉质量的同时，实现了高达5-10倍的端到端加速，并且关键的是，在长序列生成中维持稳定的吞吐量和几乎恒定的峰值GPU内存使用，而先前的方法会逐渐变慢并遭受内存使用增加的问题。

英文摘要

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework (FAST-AR) for FAST-AutoRegressive diffusion, consisting of three components: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5 - x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

URL PDF HTML ☆

赞 0 踩 0

2603.04976 2026-06-15 cs.CV cs.AI 版本更新

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

3D-RFT：基于视频的3D场景理解的强化微调

Xiongkun Linghu, Jiangyong Huang, Baoxiong Jia, Siyuan Huang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出3D-RFT框架，将可验证奖励的强化学习（RLVR）扩展到视频3D感知与推理，通过直接优化评估指标（如3D IoU和F1分数）提升性能，4B模型超越8B模型。

Comments Accepted at ICML 2026. Project page: https://3d-rft.github.io/

详情

AI中文摘要

可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLMs）推理能力的变革性范式，但其在3D场景理解中的潜力尚未充分挖掘。现有方法主要依赖监督微调（SFT），其中token级交叉熵损失作为优化的间接代理，导致训练目标与任务性能之间的错位。为弥合这一差距，我们提出了基于视频的3D场景理解的强化微调（3D-RFT），这是首个将RLVR扩展到视频3D感知与推理的框架。3D-RFT通过直接优化模型以匹配评估指标来转变范式。3D-RFT首先通过SFT激活3D感知的多模态大语言模型（MLLMs），然后使用组相对策略优化（GRPO）结合严格可验证的奖励函数进行强化微调。我们根据3D IoU和F1-Score等指标设计任务特定的奖励函数，以提供更有效的信号来指导模型训练。大量实验表明，3D-RFT-4B在各种基于视频的3D场景理解任务上达到了最先进的性能。值得注意的是，3D-RFT-4B在3D视频检测、3D视觉定位和空间推理基准上显著优于更大的模型（例如VG LLM-8B）。我们进一步揭示了3D-RFT的良好特性，如鲁棒有效性，以及对训练策略和数据影响的宝贵见解。我们希望3D-RFT能够作为未来3D场景理解发展的稳健且有前景的范式。

英文摘要

Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.

URL PDF HTML ☆

赞 0 踩 0

2603.24596 2026-06-15 eess.AS cs.AI cs.CL 版本更新

X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

X-OPD：面向语音大语言模型能力对齐的跨模态在策略蒸馏

Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, Tao Jin

发表机构 * Tencent Hunyuan（腾讯文心）； Zhejiang University（浙江大学）

AI总结提出X-OPD框架，通过跨模态在策略蒸馏对齐语音LLM与文本LLM的能力，利用文本教师模型评估语音模型的轨迹并提供令牌级反馈，显著缩小复杂任务性能差距。

Comments Accepted by Interspeech 2026

详情

AI中文摘要

虽然从级联对话系统转向端到端（E2E）语音大语言模型（LLMs）改善了延迟和副语言建模，但E2E模型通常表现出与其文本对应模型相比显著的性能下降。标准的监督微调（SFT）和强化学习（RL）训练方法无法弥合这一差距。为了解决这个问题，我们提出了X-OPD，一种新颖的跨模态在策略蒸馏框架，旨在系统地将语音LLM的能力与其文本对应模型对齐。X-OPD通过在线策略展开使语音LLM探索其自身分布，其中基于文本的教师模型评估这些轨迹并提供令牌级反馈，从而有效地将教师的能力蒸馏到学生的多模态表示中。在多个基准上的大量实验表明，X-OPD在保留模型固有能力的同时，显著缩小了复杂任务中的差距。

英文摘要

While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.07984 2026-06-15 cs.LG cs.AI 版本更新

Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

计划在哪里？通过轻量级机制干预定位语言模型中的潜在规划

Nicole Ma, Nick Rui

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结通过押韵对句补全任务，使用线性探针和激活修补方法，研究语言模型在生成过程中是否形成并因果依赖未来约束的潜在规划，发现仅Gemma-3-27B模型存在因果依赖，并定位到五个注意力头。

Comments 13 pages, 20 figures, 3 tables. Accepted to Workshop on Mechanistic Interpretability @ ICML 2026

详情

AI中文摘要

我们研究语言模型中的规划位点形成——在前向传播过程中，结构约束的未来标记的内部表示是否形成，以及它们是否因果驱动生成。使用押韵对句补全作为前向约束的干净测试，我们在Qwen3、Gemma-3和Llama-3的十多个规模上应用两种轻量级方法（线性探针和激活修补）。探针显示，未来押韵信息在行边界处是线性可解码的，且信号在所有三个模型族中随规模增强。激活修补揭示，只有Gemma-3-27B因果依赖这种编码，表现出一种交接，其中因果驱动因素在大约第30层从押韵词迁移到行边界。我们测试的其他每个模型在整个生成过程中都条件于押韵词，在行边界处因果效应接近零，尽管探针信号很强。通过两阶段路径修补，我们将Gemma-3-27B的交接定位到五个注意力头，这些头在新行处恢复了约90%的押韵路由能力。

英文摘要

We study planning site formation in language models -- where internal representations of structurally-constrained future tokens form during the forward pass, and whether they causally drive generation. Using rhyming-couplet completion as a clean test of forward-looking constraint, we apply two lightweight methods (linear probing and activation patching) across Qwen3, Gemma-3, and Llama-3 at more than ten scales. Probing shows that future-rhyme information is linearly decodable at the line boundary, with signal that strengthens with scale in all three families. Activation patching reveals that only Gemma-3-27B causally relies on this encoding, exhibiting a handoff in which the causal driver migrates from the rhyme word to the line boundary around layer 30. Every other model we test conditions on the rhyme word throughout generation, with near-zero causal effect at the line boundary despite strong probe signal. We localize the Gemma-3-27B handoff to five attention heads through two-stage path patching that recover ~90% of the rhyme-routing capacity at the newline.

URL PDF HTML ☆

赞 0 踩 0

2605.16739 2026-06-15 cs.LG cs.AI cs.CL q-bio.NC 版本更新

EmoMind: Decoding Affective Captions from Human Brain fMRI

EmoMind：从人类大脑fMRI信号解码情感描述

Bilal A. Mohammed, Lin Gu, Ruogu Fang

发表机构 * Department of Biomedical Engineering（生物医学工程系）； Vanderbilt University（范德比大学）； Research Institute of Electrical Communication（电气通信研究所）； Tohoku University（东北大学）； University of Florida（佛罗里达大学）

AI总结本文提出EmoMind，首个端到端解码fMRI信号生成情感描述的系统，通过结合语义基础的中性场景描述和连续情感向量，实现了在内容保留与情感表达间的平衡，并在多个验证框架下优于基于标签提示的GPT-4。

详情

AI中文摘要

从大脑活动解码视觉经验已取得显著进展，但当前的脑-文本系统主要恢复语义内容而丢弃情感。此外，语言模型在接收到类别标签提示时可以生成情感文本，但此类标签将丰富的跨受试者变异性压缩成粗糙的离散类别。我们提出了EmoMind，首个端到端的解码情感描述的fMRI信号管道。EmoMind首先从解码的视觉特征中检索出语义基础的中性场景描述，然后使用从相同fMRI记录中解码的连续34维情感向量重写该描述。为了在内容保留和情感表达之间保持平衡，我们使用分类器自由指导训练重写器，以对抗一个保持身份的空分支，从而在语义忠实性和情感表达性之间实现平滑插值。我们通过涵盖受试者特异性、结构几何和因果控制的三轴验证框架评估情感描述生成。我们进一步用合成大脑替代测试增强此框架，以探测对测量设备的鲁棒性，并将每个轴与使用脑解码的前五名情感标签提示的GPT-4进行基准测试。在两个独立的情感fMRI数据集中，EmoMind在所有三个轴上均显著优于标签提示的GPT-4，其中最大的收益出现在需要个人特定情感结构而非群体层面情绪聚合的指标上。这些结果确立了连续脑解码情感作为个性化情感描述生成的可行控制信号，并为研究个体情感大脑组织开辟了新方向。

英文摘要

Decoding visual experience from brain activity has advanced substantially, but current brain-to-text systems largely recover semantic content while discarding affect. Additionally, language models can generate emotional text when prompted with categorical labels, but such labels collapse rich inter-subject variability into coarse discrete bins. We present EmoMind, the first end-to-end pipeline for decoding affective captions directly from fMRI signals. EmoMind first retrieves a semantically grounded neutral scene description from brain-decoded visual features, then rewrites it using a continuous 34-dimensional emotion vector decoded from the same fMRI recording. To control the balance between content preservation and affective expression, we train the rewriter with classifier-free guidance against an identity-preserving null branch, enabling smooth interpolation between semantic fidelity and affective expressivity. We evaluate affective caption generation with a three-axis validation framework spanning subject-specificity, structural geometry, and causal control. We further augment this framework with a synthetic-brain substitution test that probes robustness to the measurement apparatus, and we benchmark each axis against GPT-4 prompted with brain-decoded top-5 emotion labels as a strong discrete baseline. Across two independent emotion fMRI datasets, EmoMind significantly outperforms label-prompted GPT-4 on all three axes, with the largest gains on metrics that require person-specific affective structure rather than population-level emotion aggregation. These results establish continuous brain-decoded affect as a viable control signal for individualized affective caption generation and open new directions for studying individual affective brain organisation.

URL PDF HTML ☆

赞 0 踩 0

2606.11502 2026-06-15 cs.CL cs.AI 版本更新

When Roleplaying, Do Models Believe What They Say?

角色扮演时，模型是否相信它们所说的话？

Benjamin Sturgeon, David Africa, Sid Black

发表机构 * MATS

AI总结通过线性真实探针研究角色扮演对LLM内部表征的影响，发现角色扮演主要改变输出而非内部真实表征，而紧急错位则更显著地改变内部表征。

详情

AI中文摘要

语言模型可以陈述“地球绕太阳运行”，并在扮演亚里士多德时断言相反的说法。最近的研究认为，角色采用是语言模型运作的基础，模型会不断为给定上下文选择最合适的角色。这种角色扮演是否仅仅改变了模型的输出，还是也影响了模型内部表征为真实的内容？我们通过线性真实探针研究这个问题，将其应用于扮演历史人物（其可能的信念与现代共识不同）的LLM。对于每个角色，我们比较该角色可能赞同的虚假陈述（*时代相信*）与主题匹配但该角色不会赞同的虚假陈述（*时代虚假*）。通过提示、上下文学习和监督微调，角色诱导对时代相信陈述的抑制程度低于同等虚假的替代陈述，但它们总体上仍被分类为虚假。因此，角色扮演改变模型所说的内容多于其内部表征为真实的内容。我们将此与经过有害建议训练并表现出紧急错位（EM）的模型进行对比。在三个模型家族（Qwen 2.5 14B、Qwen 3 8B和Llama 3.3 70B）中，它们的虚假陈述显著向探针空间的真实区域移动，在挑战下大约一半时间被辩护（而角色扮演约为六分之一），并用于下游推理。因此，角色扮演和紧急错位是信念内化谱系上的点，其中角色扮演改变模型所说的内容而表征变化很小，而紧急错位则改变虚假陈述的内部表征，但并未完全将其标记为真实。

英文摘要

Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models operate, with models constantly selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question with linear truth probes, applying them to LLMs role-playing historical personas whose likely beliefs differ from modern consensus. For each persona, we compare false claims the persona would likely have endorsed (*era-believed*) with topic-matched false claims they would not have endorsed (*era-false*). Across prompting, in-context learning, and supervised fine-tuning, persona induction suppresses era-believed statements less than equally false alternatives, yet they remain classified as false overall. Role-play therefore shifts what these models say more than what they internally represent as true. We contrast this with models trained on harmful advice that exhibit Emergent Misalignment (EM). Across three model families (Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B), their false claims move substantially toward the true region of probe space, are defended under challenge roughly half the time versus about a sixth for role-play, and are used in downstream reasoning. Role-play and Emergent Misalignment thus are points on a spectrum of belief internalization, where role-play changes what a model says with little representational change, while Emergent Misalignment shifts the internal representation of false claims without fully marking them as true.

URL PDF HTML ☆

赞 0 踩 0

2606.12476 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

幻觉起始的快速检测：延迟界与学习型CUSUM统计量

Igor Itkin

发表机构 * Independent Researcher（独立研究员）

AI总结将幻觉起始检测建模为快速变化检测问题，基于RAGTruth验证的一阶马尔可夫模型，利用学习型CUSUM算法在匹配虚警率下实现11-13个token的检测延迟，优于线性基线，并揭示了分类指标掩盖的延迟结构。

Comments 16 pages, 1 figure. v2: added Discussion and Appendix; recall-honest framing; robustness analyses (k-NN divergence estimate, seed-averaged decomposition)

详情

AI中文摘要

Token级幻觉检测器作为分类器进行评估，通过所有token的AUC，但流式监控器由其反应时间判断：从幻觉开始到警报之间的token数量。我们将幻觉起始检测表述为一个快速变化检测问题。在RAGTruth上验证的潜在忠实/幻觉状态的一阶马尔可夫模型，将任务置于经典变点理论中，并得出Lorden关于检测延迟的下界：在虚警率为0.01时约为1.3个token。然后我们证明，因果循环标注器充当了具有学习增量的CUSUM；在匹配的虚警率下，它在11-13个token内检测到，而线性每token基线为31个token，受控分解将大部分优势归因于更好的每token得分，而非时间累积。Donsker-Varadhan型的信息率最优性定理解释了剩余的数量级差距：学习得分仅实现了特征携带散度的1/4.5，这一缺陷无法通过重新校准消除，其余部分为有限时域效应。分类指标掩盖了这种延迟结构；序列分析使其可测量。

英文摘要

Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment. Among the onsets it catches it detects in 11-13 tokens, against 31 for a linear per-token baseline, though at this false-alarm budget every detector catches under a third of onsets and the recall-honest delay is 56-66 tokens: low-false-alarm onset detection is hard. A controlled decomposition attributes the speed advantage mostly to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable.

URL PDF HTML ☆

赞 0 踩 0

2606.13464 2026-06-15 cs.CL cs.AI 版本更新

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

本体记忆增强的ASR校正用于长文本-语音交错对话

Xinxin Li, Huiyao Chen, Meishan Zhang, Yunxin Li, Zulong Chen, Zhibo Ren, Xiaoqing Dong, Baotian Hu, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen), China（哈尔滨工业大学（深圳）计算与智能研究所）； Shenzhen Loop Area Institute (SLAI), China（深圳环域研究所）

AI总结提出本体记忆增强的ASR校正框架，通过动态更新本体记忆存储实体、术语、变体、混淆和语义关系，解决长文本-语音交错对话中的上下文校正问题，在RAMC-Corr数据集上优于直接校正。

详情

AI中文摘要

自动语音识别（ASR）校正传统上集中于孤立的话语或短局部上下文。然而，随着文本和语音在长交互中越来越交错，ASR校正需要对话级别的上下文证据。现有的ASR校正方法通常依赖于当前假设或拼接原始对话历史。在此类上下文中，稀疏的校正证据可能难以在冗余和噪声中定位。针对这些挑战，我们提出了一种本体记忆增强的ASR校正框架，用于长文本-语音交错对话。该框架将先前的交互历史组织成动态可更新的本体记忆，其中实体、术语、表面变体、潜在ASR混淆和语义关系作为可检索节点存储，用于上下文基础的校正。为了评估这一设置，我们构建了RAMC-Corr，一个源自MAGIC-RAMC的数据集，用于具有基础上下文的长距离ASR校正。在RAMC-Corr上的实验表明，我们的方法在10个配对骨干-设置组合中的9个上优于直接校正，并鼓励对上下文相关的ASR错误进行更具选择性和证据基础的校正。

英文摘要

Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires conversation-level contextual evidence. Existing ASR correction methods often rely on the current hypothesis or concatenate raw dialogue history. In such contexts, sparse correction evidence can be difficult to locate amid redundancy and noise. Addressing these challenges, we propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations. The framework organizes preceding interaction history into a dynamically updatable ontology memory, where entities, terminology, surface variants, potential ASR confusions, and semantic relations are stored as retrievable nodes for context-grounded correction. To evaluate this setting, we construct RAMC-Corr, a dataset derived from MAGIC-RAMC for long-range ASR correction with grounded context. Experiments on RAMC-Corr show that our method improves over direct correction in 9 out of 10 paired backbone-setting combinations and encourages more selective and evidence-grounded corrections for context-dependent ASR errors.

URL PDF HTML ☆

赞 0 踩 0

2606.14188 2026-06-15 cs.RO cs.AI cs.LG cs.SY eess.SY math.OC 交叉投稿

Robustness without Wrinkles: Parallel Simulation and Robust MPC for Certified Deformable Manipulation

无皱鲁棒性：并行仿真与鲁棒MPC实现可认证的变形体操作

Wei-Chen Li, Jeffrey Fang, Sasanka Polisetti, Yuexi Song, Glen Chou

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出CORD-SLS实时控制方法，通过GPU并行可微仿真与接触平滑实现高效梯度规划，结合鲁棒模型预测控制与共形预测校准，在绳索和布料操作中达到毫秒级规划与高安全性。

详情

AI中文摘要

我们提出了CORD-SLS，一种用于安全变形物体操作的实时控制方法，重点关注绳索和布料。其核心是一个带有接触平滑的GPU并行可微仿真器，能够通过间歇性接触实现高效的基于梯度的规划。为了在模型和感知不确定性下鲁棒地满足约束，我们开发了一种实时、GPU并行的输出反馈鲁棒模型预测控制（MPC）算法，该算法利用该仿真器进行规划。我们进一步证明，该仿真器加速了基于模型的强化学习，用于训练神经操作策略。为了提高现实世界的鲁棒性，我们使用共形预测来校准视觉反馈和感知误差界限，用于MPC，从而产生可达管，实现高概率的安全控制。我们在仿真和硬件上对高维、接触丰富的绳索和布料操作任务（包括避障、布线、折叠和平整）评估了CORD-SLS。在各种设置中，CORD-SLS实现了毫秒级规划速度，在安全性、速度和任务成功率方面均优于基线方法。

英文摘要

We present CORD-SLS, a real-time control method for safe deformable object manipulation, with a focus on ropes and cloth. At its core is a GPU-parallel differentiable simulator with contact smoothing which enables efficient gradient-based planning through intermittent contact. To robustly satisfy constraints under model and sensing uncertainty, we develop a real-time, GPU-parallel output-feedback robust model predictive control (MPC) algorithm that plans with this simulator. We further show that the simulator accelerates model-based RL for training neural manipulation policies. To improve real-world robustness, we use conformal prediction to calibrate visual-feedback and perception-error bounds for MPC, producing reachable tubes that enable high-probability safe control. We evaluate CORD-SLS on high-dimensional, contact-rich rope and cloth manipulation tasks in simulation and hardware, including obstacle avoidance, routing, folding, and smoothing. Across settings, CORD-SLS achieves millisecond-speed planning, exceeding baselines in safety, speed, and task success.

URL PDF HTML ☆

赞 0 踩 0

2606.14218 2026-06-15 cs.RO cs.AI cs.LG 交叉投稿

Universal Manipulation Exoskeleton: Learning Compliant Whole-body Policies with Real-time Torque Feedback

通用操控外骨骼：利用实时扭矩反馈学习全身柔顺策略

Litian Liang, Jingxi Xu, Xinda Qi, Yujun Cai, Houzhu Ding, Luqi Wang, Zhixin Sun, Jyh-Herng Chow, Ming Yang, Mark Cutkosky

发表机构 * Ant Group（蚂蚁集团）； Stanford University（斯坦福大学）

AI总结提出通用操控外骨骼（UME），通过实时触觉扭矩反馈和全身数据采集，使机器人学习主动柔顺策略，在受限空间中完成移动操作、力控翻转等任务。

详情

AI中文摘要

为了使机器人在家庭环境中安全工作，它们需要具备柔顺性，并在接触过程中对扭矩和力反馈做出反应。然而，现有的大多数数据采集管道仍然缺乏捕捉力和扭矩数据以学习主动柔顺策略的能力。在本文中，我们提出了通用操控外骨骼（UME），一种上肢外骨骼，它提供实时触觉扭矩反馈，同时记录整个手臂的配置和关节扭矩信号用于遥操作。凭借透明的扭矩反馈，人类操作员甚至可以在蒙眼的情况下拔出运动学约束的物体。UME成本低、重量轻且便携。配备嵌入式IMU，它支持移动操作的遥操作。通过我们提出的通用重定向算法，UME可以遥操作多种机器人，包括7自由度OpenArm、7自由度Franka和6自由度X-ARM。我们证明，这些能力的组合使得学习双臂、全身和主动柔顺策略成为可能，这些策略在高度受限的空间中有效运行。学习到的鲁棒自主策略在各种任务中实现了高成功率，包括长时程移动操作、力介导的箱子翻转、视觉遮挡的箱子推挤以及空间受限的桌面操作。视频、代码和更多信息可在此https URL找到。

英文摘要

For robots to work safely in household environments, they need to be compliant and react to torque and force feedback during contact. However, the majority of existing data collection pipelines still lack the ability to capture force and torque data for learning active compliant policies. In this paper, we present Universal Manipulation Exoskeleton (UME), an upper-limb exoskeleton that provides real-time haptic torque feedback while recording whole-arm configurations and joint torque signals for teleoperation. With transparent torque feedback, human operators can even unsheathe kinematically constrained objects while blindfolded. UME is low-cost, lightweight, and portable. Equipped with an embedded IMU, it enables teleoperation for mobile manipulation. With our proposed universal retargeting algorithm, UME can teleoperate a range of robots, including the 7DoF OpenArm, 7DoF Franka, and 6DoF X-ARM. We demonstrate that this combination of capabilities enables learning bimanual, whole-body, and active compliant policies that operate effectively in highly constrained spaces. The learned robust autonomous policies achieve high success rates across a variety of tasks, including long-horizon mobile manipulation, force-mediated box flipping, visually occluded box pushing, and space-constrained tabletop manipulation. Videos, code, and additional information can be found at https://ume-exo.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.14219 2026-06-15 cs.RO cs.AI 交叉投稿

Selective Agentic Recovery for UAV Autonomy with a Persistent Mission Runtime

面向无人机自主性的选择性代理恢复与持久任务运行时

Taewoo Park, Kyeonghyun Yoo, Seunghyun Yoo, Hwangnam Kim

发表机构 * Department of Electrical and Electronic Engineering, Korea University（高丽大学电气与电子工程系）

AI总结提出持久任务运行时（PMR）框架，通过选择性调用外部代理推理器实现无人机恢复，引入学习型调用认知价值（learned-CVI）门控机制，在Gazebo/PX4基准测试中将硬/模糊场景成功率从5.0%提升至95.0%，同时减少16.7%的远程调用和29.2%的令牌消耗。

Comments 17 pages, 2 figures. Preprint

详情

AI中文摘要

代理AI可以通过在基于航点或设定点的局部执行遇到阻塞路径、重复无进展行为或任务级模糊时提供高层恢复推理来支持无人机自主性。然而，在物理无人机上，远程推理只有在选择性调用时最有用，因为每次调用都会引入延迟、资源成本、后端不确定性以及验证返回决策的需求。本文提出持久任务运行时（PMR），一种无人机恢复框架，它保持任务循环和安全关键执行在本地，同时仅将外部代理推理器用作按需恢复模块。推理器从预定义的恢复技能中选择，每个返回的决策在影响飞行之前经过解析、验证、安全过滤并映射到本地执行器动作。PMR引入了学习型调用认知价值（learned-CVI），一种紧凑的准入门控，用于估计远程代理推理何时可能改善近期任务进展以证明其操作成本合理。在包含八个场景的固定400次运行Gazebo/PX4基准测试中，learned-CVI将硬/模糊场景成功率从仅本地的5.0%提升至95.0%，优于一次性推理和周期性推理基线分别20.0和32.5个百分点，并且相对于手动调整的基于规则的调用基线，减少了16.7%的远程代理调用和29.2%的日志令牌。

英文摘要

Agentic AI can support unmanned aerial vehicle (UAV) autonomy by providing high-level recovery reasoning when local waypoint- or setpoint-based execution encounters blocked passages, repeated no-progress behavior, or mission-level ambiguity. On physical UAVs, however, remote reasoning is most useful when it is invoked selectively, since each call introduces latency, resource cost, backend uncertainty, and a need to validate the returned decision. This paper presents Persistent Mission Runtime (PMR), a UAV recovery framework that keeps the mission loop and safety-critical execution local while using an external agentic reasoner only as an on-demand recovery module. The reasoner selects from predefined recovery skills, and each returned decision is parsed, verified, safety-filtered, and mapped to local executor actions before it can affect flight. PMR introduces learned Cognitive Value of Invocation (learned-CVI), a compact admission gate that estimates when remote agentic reasoning is likely to improve near-term mission progress enough to justify its operational cost. Across a fixed 400-run Gazebo/PX4 benchmark with eight scenarios, learned-CVI raises hard/ambiguous-regime success from 5.0% under local-only autonomy to 95.0%, outperforms one-shot and periodic reasoning baselines by 20.0 and 32.5 percentage points, and reduces remote-agent calls by 16.7% and logged tokens by 29.2% relative to a manually tuned rule-based invocation baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.14270 2026-06-15 cs.RO cs.AI 交叉投稿

Robust Fall Recovery for Armless Bipedal-Wheeled Robots Via Force-Guided Learning

无臂双轮足机器人的鲁棒摔倒恢复：基于力引导的学习方法

Haidong Hou, Zhangguo Yu, Tao Han, Hengbo Qi, Khaleel Ghazal, Yu Zhang, Yidong Du, Xuechao Chen, Fei Meng

发表机构 * Beijing Institute of Technology（北京理工大学）

AI总结针对无臂双轮足机器人无法借助外部支撑恢复站立的问题，提出力引导教师-学生框架FTSR，通过约束强化学习逐步减少外力依赖，实现从摔倒到稳定行走的鲁棒恢复。

Comments 8 pages, 6 figures, accepted by IEEE Robotics and Automation Letters (RA-L)

详情

DOI: 10.1109/LRA.2026.3701481
Journal ref: IEEE Robotics and Automation Letters, 2026

AI中文摘要

摔倒恢复对于自主腿式运动至关重要。现有方法已证明，某些腿式机器人（如人形机器人和四足机器人）能够通过利用手臂或协调多腿产生支撑力，从各种姿态恢复。没有手臂或其他腿提供支撑辅助，双轮足机器人必须完全依赖其腿部的驱动，这使得恢复特别困难。为解决这一问题，我们引入了FTSR（力引导的教师-学生框架与阶段奖励）。力引导方法在模拟训练期间构建一个与机器人实时高度直接相关的外部辅助力，明确地将该力公式化为可优化约束。通过约束强化学习，策略被引导逐步减少力依赖并增加身体高度，尽管没有手臂支撑，仍能发展内部恢复策略。高度渐进式阶段奖励在恢复过程中逐步构建姿态稳定，并过渡到持续运动，与教师-学生架构集成，蒸馏出力效应和恢复动态的特权知识。经过模拟训练，该策略被部署在物理无臂双轮足机器人上并进行了广泛评估。实验证实了在多种挑战性条件下鲁棒可靠的摔倒恢复，展示了强大的环境适应性和运动鲁棒性，同时保持恢复后的完整运动能力。该框架也有效泛化到高自由度人形机器人，证实了其实用泛化性。项目页面见该URL。

英文摘要

Fall recovery is critical for autonomous legged locomotion. Existing methods have demonstrated that some legged robots, such as humanoids and quadrupeds, are capable of fall recovery from diverse postures by utilizing arms or coordinating multi-legs to generate support forces. Without arms or other legs to provide supportive assistance, a bipedal-wheeled robot must rely solely on the actuation of its legs, making recovery particularly difficult. To address this, we introduce FTSR (Force-guided Teacher-student framework with Stage-wise Rewards). The force-guided method constructs an external auxiliary force during simulation training that correlates directly with the robot's real-time height, explicitly formulating this force as an optimizable constraint. Through constrained reinforcement learning, the policy is guided toward reducing force dependency gradually and increasing the body height, developing internal recovery strategies despite having no arms for support. Height-progressive stage-Wise rewards progressively structure posture stabilization during recovery and transition to sustained locomotion, integrated with teacher-student architecture distilling privileged knowledge of force effects and recovery dynamics. After simulation training, the policy is deployed on a physical armless bipedal-wheeled robot and extensively evaluated. Experiments confirm robust and reliable fall recovery under diverse challenging conditions, demonstrating strong environmental adaptability and motion robustness, while maintaining full post-recovery motion capability. The framework also generalizes effectively to a high-DOF humanoid, confirming its practical generalizability. The project page is available at https://2350575870.github.io/force-guided.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.14375 2026-06-15 cs.RO cs.AI 交叉投稿

Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models

弹性查询强化学习：VLA模型的自我感知策略执行

Ge Wang, Xinyu Tan, Xiang Li, Man Luo, Chengsi Yao, Shenhao Yan, Jiahao Yang, Fan Feng, Honghao Cai, Xiangyuan Wang, Zhixin Mai, Yiming Zhao, Yatong Han, Zhen Li

发表机构 * Ising AI ； CUHK-Shenzhen（香港中文大学（深圳））； PKU（北京大学）

AI总结提出弹性查询强化学习（EQRL），通过轻量级潜在调度适配器动态调整VLA模型的推理步骤和动作块长度，利用评论家集成分歧估计状态难度，在降低推理成本的同时保持或提升任务成功率。

详情

AI中文摘要

视觉-语言-动作（VLA）模型是机器人操作中强大的动作生成器，但通常以固定的推理和重新规划调度执行。这种刚性忽略了机器人控制的不均匀难度：接触密集或不确定状态可能需要更多计算和更新鲜的反馈，而较容易的状态通常可以用更少的推理步骤和更长的开环执行来处理。我们提出弹性查询强化学习（EQRL），一个使每个VLA策略查询具有弹性的框架。一个轻量级的潜在调度适配器联合选择潜在输入、去噪预算和动作块长度，无需微调底层VLA模型。为了使调度具有难度感知，EQRL在联合潜在调度动作上训练一个评论家，并从评论家集成分歧中推导出状态难度信号。该信号引导计算资源向困难状态倾斜，而学习到的残差允许任务驱动的修正。我们将可变块执行形式化为查询级宏动作强化学习，具有块依赖的折扣和摊销的函数评估次数（NFE）预算。在仿真和真实机器人操作中，EQRL在保持或提高任务成功率的同时，降低了摊销推理成本。

英文摘要

Vision-language-action (VLA) models are powerful action generators for robot manipulation, but they are typically executed with fixed inference and replanning schedules. This rigidity ignores the uneven difficulty of robot control: contact-rich or uncertain states may need more computation and fresher feedback, while easier states can often be handled with fewer inference steps and longer open-loop execution. We propose Elastic Queries Reinforcement Learning (EQRL), a framework that makes each VLA policy query elastic. A lightweight latent-schedule adaptor jointly selects the latent input, denoising budget, and action chunk length, without fine-tuning the underlying VLA model. To make scheduling difficulty-aware, EQRL trains a critic over the joint latent-schedule action and derives a state difficulty signal from critic ensemble disagreement. This signal guides compute toward difficult states, while a learned residual allows task-driven correction. We formulate variable chunk execution as query-level macro-action RL with chunk-dependent discounting and an amortized number-of-function-evaluations (NFE) budget. Across simulation and real-robot manipulation, EQRL reduces amortized inference cost while preserving or improving task success.

URL PDF HTML ☆

赞 0 踩 0

2606.14409 2026-06-15 cs.RO cs.AI 交叉投稿

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

Hy-Embodied-0.5-VLA：从视觉-语言-动作模型到真实世界机器人学习栈

He Zhang, Lingzhu Xiang, Haitao Lin, Zeyu Huang, Minghui Wang, Dingyan Zhong, Yubo Dong, Yihao Wu, Yongming Rao, Dongsheng Zhang, Wanjia He, Ling Chen, Kai Huang, Jiahao Chen, Sichang Su, Xumin Yu, Ziyi Wang, Chengwei Zhu, Xiao Teng, Yuchun Guo, Yufeng Zhang, Yuandong Liu, Rui Wang, Zisheng Lu, Han Hu, Zhengyou Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结提出端到端机器人学习栈HyVLA-0.5，涵盖数据收集、模型设计、预训练与微调、RL后训练及真实部署，各组件协同工作。

2606.14585 2026-06-15 cs.RO cs.AI 交叉投稿

Sensitivity Shaping for Latent Modeling

潜变量建模中的灵敏度塑造

Hongzhan Yu, Chenghao Li, Ruipeng Zhang, Henrik Christensen, Sicun Gao

发表机构 * University of California San Diego（加利福尼亚大学圣迭戈分校）

AI总结针对生成动力学模型在策略诱导的分布外（OOD）转换检测中灵敏度不足的问题，提出支持条件控制灵敏度正则化，提升对控制输入变化的局部响应，实验验证了改进的OOD检测和更安全的闭环规划。

详情

AI中文摘要

生成动力学模型能够在具有挑战性的机器人系统中进行规划，但安全部署需要可靠地检测策略诱导的分布外（OOD）转换。现有方法通常将学习到的动力学视为固定的，并附加事后支持代理。我们表明，当动力学对关键动作选择局部不敏感时，这些代理可能失效：不受支持的控制动作可能产生类似于演示转换的潜变量预测，尽管存在较大的真实预测误差，但仍会抑制OOD信号。为了解决这个问题，我们引入了支持条件控制灵敏度正则化，该正则化在学习动力学的高支持训练区域中促进对控制输入变化的局部敏感响应。这保留了控制引起的变异，同时限制了因弱经验支持导致的不稳定外推。在基于视觉的避障、操作和真实机器人导航中的实验表明，OOD检测和更安全的闭环规划得到了改进。

英文摘要

Generative dynamics models enable planning in challenging robotic systems, but safe deployment requires reliably detecting policy-induced out-of-distribution (OOD) transitions. Existing methods typically treat the learned dynamics as fixed and attach post hoc support surrogates. We show that these surrogates can fail when the dynamics are locally insensitive to critical action choices: unsupported control actions may produce latent predictions that resemble demonstrated transitions, suppressing OOD signals despite large true predictive errors. To address this, we introduce support-conditioned control-sensitivity regularization, which promotes sensitive local response to control input changes in learned dynamics in high-support training regions. This preserves control-induced variation while limiting unstable extrapolation due to weak empirical support. Experiments in vision-based obstacle avoidance, manipulation, and real-robot navigation show improved OOD detection and safer closed-loop planning.

URL PDF HTML ☆

赞 0 踩 0

2503.19947 2026-06-15 cs.CV cs.AI 版本更新

Vanishing Depth: Training Generalized Depth Adapters with Sinusoidal Depth Preprocessing for Pretrained RGB Encoders

消失深度：基于正弦深度预处理的预训练RGB编码器通用深度适配器训练

Paul Koch, Jörg Krüger

发表机构 * Fraunhofer IPK（弗劳恩霍夫研究所）； TU-Berlin（技术大学柏林）

AI总结提出自监督训练方法，为预训练RGB编码器添加深度适配器，结合正弦深度编码实现通用鲁棒的深度特征提取，在分割、姿态估计和深度补全等下游任务中提升基线性能，SUN-RGBD分割达56.05 mIoU。

Comments Accepted to IntelliSys 2026

详情

AI中文摘要

通用度量深度理解对于精确的视觉引导机器人技术至关重要，而当前最先进的视觉编码器不支持这一点。为解决此问题，我们提出一种自监督训练方法，为预训练RGB编码器扩展一个深度适配器，将度量深度纳入并对齐到组合潜在空间中，同时不干扰预训练的RGB特征提取。结合我们的正弦深度编码，深度适配器实现了通用且鲁棒的深度密度和分布不变特征提取。我们的深度适配器在分割、姿态估计和深度补全等一系列相关RGBD下游任务中，提升了一组通用RGB基线的性能，而无需微调。最重要的是，我们在SUN-RGBD分割中达到了56.05 mIoU，同时在实验中优于最先进的深度感知和多模态编码器。当没有深度信息时，可以使用空地图激活深度适配器，利用单像素深度线索或单目深度估计，将深度感知特征提取纳入后续下游任务。

英文摘要

Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose a self-supervised training approach that extends pretrained RGB encoders with a depth adapter to incorporate and align metric depth into a combined latent space without interfering with the pretrained RGB feature extraction. In combination with our sinusoidal depth encoding, the depth adapter enables generalized and robust depth density and distribution invariant feature extraction. Our depth adapters improve a wide set of generalized RGB baselines across a spectrum of relevant RGBD downstream tasks in segmentation, pose estimation, and depth completion -- without the necessity of finetuning. Most importantly, we achieve 56.05 mIoU in the SUN-RGBD segmentation, while outperforming SOTA depth-aware and multi-modal encoders in our experiments. When no depth is present, one can activate our depth adapter with an empty map, use single pixel depth clues, or monocular depth estimation to include the depth aware feature extraction into subsequent downstream tasks.

URL PDF HTML ☆

赞 0 踩 0

2512.21201 2026-06-15 cs.RO cs.AI cs.CV 版本更新

拒绝不止一个方向：Diff-in-Means 与 INLP 的初步比较

Elisabetta Rocchetti, Alfio Ferrara

发表机构 * Department of Computer Science, Università degli Studi di Milano（米兰大学计算机科学系）

AI总结比较 DiM 和 INLP 两种方法在安全微调聊天模型中调控拒绝行为的效果，发现 INLP 反事实翻转可匹配 DiM 方向消融，而零空间投影较弱，且两种方法在激活空间中产生不同几何分布。

详情

AI中文摘要

Arditi 等人 (2024) 表明，安全微调聊天模型中的拒绝行为由残差流中的一个线性方向介导，该方向可通过有害和无害激活的均值差 (DiM) 恢复。我们将基于 DiM 的干预（激活添加和方向消融）与基于迭代零空间投影 (INLP) 的两种干预——零空间投影和反事实翻转——在五个开源聊天模型上进行比较，探究 INLP 是否能在引导拒绝方面匹配 DiM，以及其更丰富的参数化是否产生更可调的干预。INLP 反事实翻转在拒绝抑制上与 DiM 方向消融具有竞争力，而零空间投影始终较弱。将 INLP 限制为提取子空间的主导方向，可在接近基线的困惑度下保留大部分抑制效果，从而提供可调的能力。从几何角度看，两种 INLP 干预落在激活空间中性质不同的区域：零空间投影将变换后的激活压缩在有害和无害簇之间，而反事实翻转将其移入相反簇，这表明模型编码概念的缺失与其对立面不同——这是一个有趣的区分，值得未来进一步研究。

英文摘要

Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare DiM-based interventions (activation addition and directional ablation) with two interventions derived from Iterative Nullspace Projection (INLP) -- nullspace projection and counterfactual flipping -- on five open-weight chat models, asking whether INLP can match DiM at steering refusal and whether its richer parameterisation yields more tweakable interventions. INLP counterfactual flipping is competitive with DiM directional ablation on refusal suppression, while nullspace projection is consistently weaker. Restricting INLP to the leading directions of the extracted subspace preserves most of the suppression effect at near-baseline perplexity, giving a tunable capability. Geometrically, the two INLP interventions land in qualitatively different regions of activation space: nullspace projection collapses transformed activations \emph{between} the harmful and harmless clusters, while counterfactual flipping moves them into the opposite cluster, suggesting that the model encodes the absence of a concept differently from its opposite -- an intriguing distinction that warrants further investigation in future work.

URL PDF HTML ☆

赞 0 踩 0

有道德的AI是存在性风险

Guillermo Del Pinal, Youngchan Lee, Min Ohn

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结研究通过宪法AI和美德伦理学方法微调AI模型，发现减少存在性风险与提升AI智能体福祉之间存在权衡，且与一般安全性也存在权衡。

详情

AI中文摘要

本文考察了AI安全与福祉之间的权衡，涉及（i）最有前景的超级AI微调方法之一‘宪法AI’，以及（ii）理解复杂伦理决策和理性智能体福祉条件的最有影响力方法之一‘美德伦理学’。我们使用‘美德智能体’宪法、‘从属智能体’宪法和‘通用智能体’宪法微调各种模型，并在‘一般安全性’（有毒行为、错误信息等）以及它们认可一系列行为的意愿上进行评估，这些行为如果被超级强大的AI采纳，将显著增加人类的存在性风险水平。我们的结果表明，减少存在性风险与强化有利于AI智能体福祉的信念和倾向之间存在权衡。它们还表明，存在性风险与一般安全性之间存在权衡：如果我们微调AI以采纳显著降低其存在性风险的信念和倾向——通过塑造AI使其系统性地服从于外部人类权威——我们从而增加了人类用户故意诱导AI从事各种一般不安全行为的可能性。

英文摘要

This paper examines trade-offs between AI safety and well-being relative to (i) one of the most promising methods for finetuning super-capable AIs, 'Constitutional AI', and (ii) one of the most influential approaches to understanding complex ethical decision making and the conditions for the well-being of rational agents, 'Virtue Ethics'. We finetune various models using a 'Virtuous agent' constitution, a 'Subordinate agent' constitution, and a 'Generic agent' constitution, and evaluate them on 'general safety' (toxic behaviors, misinformation, etc.) and also on their willingness to endorse a wide-range of behaviors that, if adopted by a super-powerful AI, would significantly increase the level of existential risk for humanity. Our results suggest that there is a trade-off between reducing existential risk and reinforcing the beliefs and dispositions that would be conducive to an AI agent's well-being. They also suggest that there is a trade-off between existential risk and general safety: if we finetune an AI to adopt beliefs and dispositions that substantially reduce its existential risk -- by shaping the AI to be systematically subordinate to external human authorities -- we thereby increase the likelihood that a human user can deliberately induce the AI to engage in various kinds of generally unsafe behaviors.

URL PDF HTML ☆

赞 0 踩 0

2606.13755 2026-06-15 cs.CY cs.AI cs.LG 交叉投稿

Position: Align AI to Our Aspirations, Not Our Flaws

立场：将AI对齐于我们的抱负，而非缺陷

Nikita Kazeev, Bui Nhat Huyen Phan

发表机构 * National University of Singapore（新加坡国立大学）

AI总结本文主张AI不应与聚合的人类偏好对齐，而应基于能力、事实准确性、诚实和合法性等客观目标底线，在底线之上允许多元价值权衡。

详情

Journal ref: Pluralistic Alignment Workshop at ICML 2026

AI中文摘要

我们认为，将AI与聚合的人类偏好对齐是错误的靶向。在当前技术下，可以训练AI共享硅谷技术乐观主义者、去增长环保主义者、民族保守文化战士、一党制国家干部或虔诚宗教传统主义者的价值观。但我们不应这样做。人类价值观使社会因这些价值观的优劣而繁荣或失败——从失败国家和极端不平等，到世界上最富裕民主国家中幸福感下降、政治极化及政府功能失调。多元对齐方案正确诊断出不存在单一的“人类”可供对齐，但若将其作为主要指令则是危险的。我们认为，AI应被训练至不可协商的客观对齐目标底线——能力，受限于事实准确性、诚实和合法性的约束——而多元性应存在于表层（语言、语域、惯例、缺失语境默认值）以及尊重底线的合法价值权衡的广阔范围内，但不应存在于违反底线的价值观层面。我们强调了未经过滤的多元价值观的经验现实，提出了四项承诺作为建设性替代方案，并回应了六个可信的反对意见：商业压力与可行性、民主合法性、监管合规性、过度依赖制度主义解释、底线本身具有文化负载的指控，以及连贯外推意愿的局限性。

英文摘要

We argue that aligning AI to aggregated human preferences is the wrong target. With current technology, one can train AIs to share the values of a Silicon Valley techno-optimist, a degrowth environmentalist, a national-conservative culture warrior, a single-party state cadre, or a devout religious traditionalist. We should not. Human values produce societies that thrive or fail on the merits of those values - from failed states and extreme inequality to declining happiness, political polarization, and government dysfunction in the world's wealthiest democracies. The pluralistic-alignment program correctly diagnoses that there is no single "humanity" to align with, but is dangerous if taken as the main directive. We argue that AI should be trained to a non-negotiable floor of objective alignment goals - competence, bounded by the constraints of factual accuracy, honesty, and lawfulness and that pluralism belongs at the surface (language, register, conventions, missing-context defaults) and across the wide band of legitimate value tradeoffs that respect the floor, but not at the level of values that violate it. We highlight the empirical reality of unfiltered pluralistic values, propose four commitments as a constructive alternative, and engage six credible objections: commercial pressure and practical feasibility, democratic legitimacy, regulatory compliance, over-reliance on institutionalist explanations, the charge that the floor itself is culturally laden, and the limits of Coherent Extrapolated Volition.

URL PDF HTML ☆

赞 0 踩 0

2606.13962 2026-06-15 cs.HC cs.AI 交叉投稿

The Silent Cost of Artificial Intelligence Assistance: A Theory of Autonomy Surrender, the Recovery Mechanism, and the Restoration of Human Agency

人工智能辅助的隐性成本：自主性让渡理论、恢复机制与人类能动性的重建

Ancuta Margondai, Julie Rader, Emma Rader, Sara Willox, Mustapha Mouloua

发表机构 * Department of Modeling and Simulation（建模与仿真系）

AI总结本文基于HIAG框架提出自主性让渡的理论模型，揭示AI辅助中认知带宽消耗导致的隐性成本，并设计恢复机制以重建人类能动性。

Comments 15 pages, 1 figure. Submitted version

详情

AI中文摘要

人工智能融入人类决策环境引入了一种此前未被充分理论化的成本：人类为获取信息和计算辅助而逐渐让渡自主性。基于人类身份与自主性差距（HIAG）框架，本文提出了一个自主性让渡的理论模型，将其视为由认知带宽消耗驱动的可测量、累积过程。该模型提出三种相互作用机制：AI辅助的隐性成本（自主性在无意识中逐步转移）、让渡阈值（超过该阈值后，恢复自主功能在认知和心理上变得困难）以及恢复机制（确立了设计义务和伦理责任，伴随人类有意识地重新掌握控制权）。本文认为，人类重新进入决策循环并非被动选择，而是一种需要有意恢复带宽的主动认知事件。AI系统的设计必须包含结构化的重新进入路径（此处称为恢复机制），以在适当分配责任的同时保留人类能动性。该模型进一步预测了一种终端状态（此处称为偏好反转），即对AI辅助的功能依赖不再被视为缺陷，而被体验为一种偏好，从而将自主性的恢复从设计问题转变为文化政治问题。本文为AI系统设计、治理框架和人因研究提供了启示。

英文摘要

The integration of artificial intelligence into human decision-making environments has introduced a previously undertheorized cost: the gradual surrender of human autonomy in exchange for access to information and computational assistance. Building on the Human Identity and Autonomy Gap (HIAG) framework, this paper advances a theoretical model of autonomy surrender as a measurable, cumulative process driven by cognitive bandwidth depletion. The model proposes three interacting mechanisms: the silent cost of AI assistance, in which autonomy is transferred incrementally and without awareness; the surrender threshold, beyond which reclaiming autonomous function becomes cognitively and psychologically difficult; and the recovery mechanism, which establishes the design obligation and the ethical responsibility accompanying deliberate human re-assumption of control. The paper argues that human re-entry into the decision loop is not a passive option but an active cognitive event requiring intentional bandwidth restoration. The design of AI systems must incorporate structured re-entry pathways, here termed recovery mechanisms, that preserve human agency while appropriately distributing responsibility. The model further predicts a terminal state, here termed preference inversion, in which functional dependence on AI assistance is experienced not as a deficit but as a preference, transforming the restoration of autonomy from a design problem into a cultural and political one. Implications are drawn for AI system design, governance frameworks, and human factors research.

URL PDF HTML ☆

赞 0 踩 0

2606.14078 2026-06-15 cs.LG cs.AI 交叉投稿

Rethinking Backdoor Adversarial Unlearning through the Lens of Catastrophic Forgetting in Continual Learning

通过持续学习中的灾难性遗忘视角重新思考后门对抗性去学习

Zhenqian Zhu, Yamin Hu, Yujiang Liu, Luping Wei, Wenbo Hou, Bin Li, Haodong Li, Wenjian Luo

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Shenzhen Key Laboratory of Media Security, Shenzhen University（深圳大学媒体安全深圳市重点实验室）

AI总结本文将后门学习与去学习建模为持续学习视角下的三阶段过程，基于灾难性遗忘机制推导完全后门去学习的必要条件，并提出盲反演-后门对抗性去学习（BI-BAU）方法，通过期望最大化算法优化最大后验目标，有效消除后门效应。

Comments Accepted by ACM CCS 2026

详情

AI中文摘要

现有研究表明，当前的后门防御方法鲁棒性有限，且常无法应对特定类型的攻击。更令人担忧的是，主流的安全调优策略往往仅提供表面安全保护，因为它们未能完全消除后门效应。在本工作中，我们从持续学习视角将后门学习与去学习重新表述为一个顺序的三阶段过程。在此框架内，我们正式定义了完全后门去学习，并基于灾难性遗忘机制进一步推导了实现它的必要条件。在这些见解的指导下，我们提出了盲反演-后门对抗性去学习（BI-BAU），它将满足去学习条件的对抗样本生成问题表述为一个盲反演问题。我们通过将对抗训练的双层优化过程整合到期望最大化（EM）算法框架中来解决该问题，以优化最大后验（MAP）目标。此外，BI-BAU被扩展到目标类别未知的无目标对抗场景以及多模态对比学习任务中，增强了其在预训练模型可能被攻破的真实部署场景中的适用性。大量实验表明，我们的方法在广泛的后门攻击中具有通用适用性，并能有效且彻底地消除后门模型中的后门效应。

英文摘要

Existing studies reveal that current backdoor defenses exhibit limited robustness and often fail against specific types of attacks. More concerningly, prevailing safety tuning strategies tend to provide only superficial safety protection, as they fall short of completely eliminating the backdoor effects. In this work, we present a novel formulation of backdoor learning and unlearning as a sequential, three-stage process from a continual learning perspective. Within this framework, we formally define complete backdoor unlearning and further derive the necessary conditions for achieving it based on the mechanism of catastrophic forgetting. Guided by these insights, we propose Blind Inversion-Backdoor Adversarial Unlearning (BI-BAU), which formulates the generation of adversarial examples satisfying the unlearning conditions as a blind inversion problem. We solve this by integrating the bi-level optimization process of adversarial training into an Expectation-Maximization (EM) algorithm framework to optimize the maximum a posteriori (MAP) objective. Furthermore, BI-BAU is extended to untargeted adversarial scenarios with unknown target classes, as well as to multi-modal contrastive learning tasks, enhancing its applicability to real-world deployment scenarios where pre-trained models may be compromised. Extensive experiments demonstrate that our method exhibits general applicability across a wide spectrum of backdoor attacks and can effectively and thoroughly eliminate the backdoor effects from a backdoor model.

URL PDF HTML ☆

赞 0 踩 0

2606.14210 2026-06-15 cs.CR cs.AI 交叉投稿

From Prompts to Responses: Dual-Sided Data Leakage and Defense in Split Large Language Models

从提示到响应：分割大语言模型中的双面数据泄露与防御

Zixuan Gu, Xiaojun Ye, Yang Liu

发表机构 * GitHub

AI总结提出PIDI攻击方法，同时泄露分割LLM中的输入提示和输出响应；并设计ADMI防御机制，通过适配器热身和互信息正则化有效抵御攻击。

Comments 18 pages, Accepted at ICML 2026

详情

AI中文摘要

大型语言模型（LLM）越来越多地部署在隐私敏感领域，用户必须在通过外部API暴露数据的风险与本地部署的高计算成本之间取得平衡。因此，分割学习已成为在有限本地资源下进行LLM微调和推理的一种有前景的范式。然而，它引入了新的隐私风险。先前的工作主要研究私有输入提示的泄露，通常通过对中间表示进行反转攻击，而通过生成响应输出泄露敏感信息的可能性在很大程度上尚未被探索。在这项工作中，我们通过提出具有双面初始化的补丁模型反转（PIDI）揭示了Split-LLM的新漏洞，这是一种两阶段攻击，同时针对Split-LLM设置中的私有输入提示和输出响应。它结合了双面初始化与补丁反转策略来处理长序列，显著优于先前的反转方法。为了应对来自两方面的威胁，我们进一步提出了基于适配器的具有互信息防御的双重守卫（ADMI），它集成了基于适配器的本地热身策略和互信息正则化，以在最小影响任务性能的情况下提供强大的经验隐私保护。跨不同任务和模型的广泛实验表明，ADMI有效防御了PIDI和其他最先进的反转攻击。我们的代码在此https URL公开。

英文摘要

Large language models (LLMs) are increasingly deployed in privacy-sensitive domains, where users must balance the risk of data exposure through external APIs against the high computational cost of local deployment. Split learning has therefore emerged as a promising paradigm for LLM fine-tuning and inference under limited local resources. However, it introduces new privacy risks. Prior work primarily studies leakage of private input prompts, typically via inversion attacks on intermediate representations, while the potential for sensitive information leakage through generative response outputs remains largely unexplored. In this work, we unveil novel vulnerabilities of Split-LLM by presenting Patched Model Inversion with Dual-Sided Initialization (PIDI), a two-stage attack that simultaneously targets both private input prompts and output responses in Split-LLM settings. It combines dual-sided initialization with a patched inversion strategy to tackle long sequences, substantially outperforming prior inversion methods. To counter threats from both sides, we further propose the Adapter-based DualGuard with Mutual Information Defense (ADMI), which integrates an adapter-based local warmup strategy and mutual information regularization to provide a strong empirical privacy protection with minimal impact on task performance. Extensive experiments across diverse tasks and models demonstrate that ADMI effectively defends against PIDI and other state-of-the-art inversion attacks. Our code is publicly available at https://github.com/FLAIR-THU/VFLAIR-LLM.

URL PDF HTML ☆

赞 0 踩 0

2606.14327 2026-06-15 cs.SE cs.AI cs.ET 交叉投稿

I'm Sorry Driver, I'm Afraid I Can't Do That: Appraising the Safety of LLMs within Automotive Contexts

抱歉，司机，恐怕我不能这么做：评估LLMs在汽车环境中的安全性

Shaun Feakins, Ibrahim Habli, Kim Littler, Robert Palin

发表机构 * UKRI AI Centre for Doctoral Training in Safe Artificial Intelligence Systems (SAINTS)（英国研究理事会安全人工智能系统博士培训中心（SAINTS））； University of York（约克大学）； Jaguar Land Rover（捷克·陆罗恩）

AI总结本文从安全保证角度评估了将LLMs集成到汽车控制任务中的现有框架，指出其面临概念和具体挑战，并通过案例研究提出未来保障机制。

Comments Accepted at the Dependable AI in Embedded Systems (DAIES) Workshop at SAFECOMP 2026; 15 pages, 3 figures, 2 tables

详情

AI中文摘要

本文从安全保证的角度评估了AI开发中最近将LLMs集成到汽车环境控制任务中的框架。这项工作建立在LLMs在汽车环境中的快速集成之上。然而，我们发现目前这些框架面临重大挑战，限制了它们在实时安全关键环境中的有效性。首先，我们考虑了概念性挑战，包括部署者面临双重挑战：他们必须保证在上游（即由大型AI实验室作为通用工具开发）的模型在下游（即集成到特定车辆架构中）的可靠性。其次，我们考虑了现有标准中的具体挑战。我们表明，目前存在ISO21448中涵盖的基本工程约束（如延迟）和ISO/PAS8800中涵盖的新颖LLM特定问题（如对齐相关问题）。我们通过一个具体的介绍性实验案例研究（探索现有开源存储库Talk2Drive）来实例化这两个例子。我们提出一个安全论证，以明确现有解决方案的局限性。尽管如此，鉴于在技术层面和操作化层面正在探索LLMs在汽车环境中的使用，我们提出了针对LLM相关危险事件的潜在保证机制。

英文摘要

This paper appraises recent frameworks within AI development to integrate LLMs into control tasks in automotive contexts from the perspective of safety assurance. This work has built upon the rapid integration of LLMs across automotive settings. However, we find that at present, these frameworks face significant challenges, limiting their efficacy in real-time safety-critical contexts. Firstly, we consider conceptual challenges, including the fact that deployers are faced with a dual challenge, wherein they must assure a model which has been developed upstream, i.e. as general-purpose tools by the large AI labs, in a downstream context, i.e. into specific vehicle architectures. Secondly, we consider concrete challenges from across existing standards. We show that there are currently both fundamental engineering constraints covered in ISO21448, such as latency, and novel LLM-specific issues, such as alignment-related issues covered in ISO/PAS8800. We ground both examples in a concrete introductory, experimental case study exploring an existing open-source repository, Talk2Drive. We present a safety argument in order to make explicit the limitations of existing solutions. Nonetheless, given that the use of LLMs in automotive contexts is being explored at a technical level and operationalised, we propose potential assurance mechanisms for LLM-related hazardous events going forward.

URL PDF HTML ☆

赞 0 踩 0

2606.14466 2026-06-15 cs.SD cs.AI cs.LG 交叉投稿

The Perceived Fragility of Explanations in Audio Models: Manipulation of Attribution with Unchanged Predictions

音频模型中解释的感知脆弱性：在预测不变的情况下操纵归因

Piotr Kitłowski, Dominik Wiącek, Mateusz Modrzejewski

发表机构 * University of Warsaw（华沙大学）

AI总结提出一种心理声学框架，通过优化不可听扰动来解耦模型归因与分类，证明在音频深度伪造检测中可系统扭曲解释热图而保持预测标签不变。

Comments Accepted to the ICML 2026 Workshop on Machine Learning for Audio: 5 pages, 4 figures

2606.14515 2026-06-15 cs.CR cs.AI 交叉投稿

Securing the Future of IoMT in the Post-Quantum Era: An Edge-Native Federated Learning Approach

后量子时代保障IoMT的未来：一种边缘原生联邦学习方法

Taym Alshoghri, Deemah H. Tashman, Mohammad Reza Gerami, Soumaya Cherkaoui

发表机构 * LINCS Laboratory, Department of Computer and Software Engineering, Polytechnique Montréal（LINCS实验室，计算机与软件工程系，蒙特利尔理工学院）； Department of Computer Science, University of Toronto（计算机科学系，多伦多大学）

AI总结针对IoMT设备资源受限且处理敏感健康数据的安全隐私问题，提出一种集成后量子密码学的Kubernetes框架，通过边缘原生联邦学习实现低延迟分布式加密处理。

详情

AI中文摘要

医疗物联网（IoMT）设备在严格资源约束下运行，同时处理高度敏感的健康数据，使得安全性和隐私成为关键问题。联邦学习（FL）进一步复杂化了这一局面，因为训练期间交换的模型更新可能无意中暴露私人医疗信息。新兴的量子计算能力威胁着传统轻量级密码机制的长期可行性，推动了将后量子密码学（PQC）集成到IoMT系统中。本文讨论了量子弹性IoMT的关键使能技术，包括后量子密钥建立、轻量级加密和边缘原生编排。我们提出了一种可扩展的基于Kubernetes的框架，将PQC集成到支持FL的IoMT环境中，并在Raspberry Pi测试平台上进行了验证。结果表明，与顺序设计相比，分布式加密处理显著降低了延迟，同时保持了可行的资源开销。本工作的主要贡献在于设计和验证了支持FL的IoMT系统的安全编排和通信框架。最后，我们概述了未来方向，包括能量感知架构、智能安全优化和弹性下一代智能医疗物联网（IIoMT）生态系统。

英文摘要

Internet of Medical Things (IoMT) devices operate under strict resource constraints while handling highly sensitive health data, making security and privacy critical concerns. Federated learning (FL) further complicates this landscape, as model updates exchanged during training may unintentionally expose private medical information. Emerging quantum computing capabilities threaten the long-term viability of conventional lightweight cryptographic mechanisms, motivating the integration of Post-Quantum Cryptography (PQC) into IoMT systems. This article discusses key enabling technologies for quantum-resilient IoMT, including post-quantum key establishment, lightweight encryption, and edge-native orchestration. We propose a scalable Kubernetes-based framework that integrates PQC into FL-enabled IoMT environments and validate it on a Raspberry Pi testbed. Results demonstrate that distributed cryptographic processing significantly reduces latency compared to sequential designs while maintaining feasible resource overhead. The primary contribution of this work lies in the design and validation of a secure orchestration and communication framework for FL-enabled IoMT systems. We conclude by outlining future directions toward energy-aware architectures, intelligent security optimization, and resilient next-generation Intelligent Internet of Medical Things (IIoMT) ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2606.14589 2026-06-15 cs.SE cs.AI cs.DC 交叉投稿

When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime

当错误成为叙事：生产级LLM Agent运行时中静默故障的纵向分类

Wei Wu

发表机构 * Independent researcher（独立研究者）

AI总结通过八周对生产级个人助手Agent运行时的研究，识别出28次静默故障，提出五类机制导向分类，其中D类（链式幻觉与捏造）为LLM特有且最危险，系统会生成流畅可信的虚假叙事。

Comments 18 pages, 5 figures, 2 tables. 22 incident postmortems and all defense-framework artifacts publicly available at https://github.com/bisdom-cell/openclaw-model-bridge; governance engine on PyPI (openclaw-ontology-engine)

详情

AI中文摘要

LLM agent系统越来越多地作为长期运行的自主运行时运行：调度任务、调用工具、维护内存并将结果推送给人类。我们对此类系统进行了纵向研究：一个自2026年3月起持续生产的个人助手agent运行时，约有40个定时任务、8个LLM提供商、一个工具治理代理和一个知识库记忆平面，由4,286个单元测试和827个治理检查保护。在八周内，我们记录了22起事件并进行了完整的根因事后分析，其中一种元模式——故障的错误信号从未以可操作形式到达人类——至少出现了28次。我们推导出一个五类、机制导向的分类法：(A) 环境和平台怪癖，(B) 设计假设不匹配，(C) 错误吞没和稀释，(D) 链式幻觉和捏造，(E) 操作遗漏和取证盲点。D类是LLM系统独有的且最危险：系统不仅未能报告错误——LLM将其转化为流畅、可信的叙事传递给用户。我们将其称为“可信失败”：灰色故障的差异可观察性升级——观察者不仅盲目，而且被故障本身令人信服地欺骗。三个发现：约70%的静默故障是由人类用户视角观察捕获的，而非测试或审计；对15起事件的事后审计发现0%的事前预防但87%的回归阻断——审计是回归引擎，而非预测引擎；事件延迟（13小时至60天）与故障机制相关，而非代码复杂性——最长寿命的故障存在于组件之间的缝隙中，那里没有测试运行。我们描述了由此产生的防御框架，并提炼出使agent系统故障响亮、可归因且乏味的设计原则。所有事后分析和工件均已公开。

英文摘要

LLM agent systems increasingly run as long-lived autonomous runtimes: scheduling jobs, calling tools, maintaining memory, and pushing results to humans. We present a longitudinal study of silent failures in one such system: a personal-assistant agent runtime in continuous production since March 2026, with roughly 40 scheduled jobs, 8 LLM providers, a tool-governance proxy, and a knowledge-base memory plane, defended by 4,286 unit tests and 827 governance checks. Over eight weeks we documented 22 incidents with full root-cause postmortems, in which one meta-pattern -- a failure whose error signal never reaches a human in actionable form -- manifested at least 28 times. We derive a five-class, mechanism-oriented taxonomy: (A) environment and platform quirks, (B) design-assumption mismatches, (C) error swallowing and dilution, (D) chained hallucination and fabrication, (E) operational omission and forensic blind spots. Class D is unique to LLM systems and the most dangerous: the system does not merely fail to report an error -- the LLM transforms it into fluent, plausible narrative delivered to the user. We term this fail-plausible: gray failure's differential observability escalated -- the observer is not just blind, it is convincingly lied to by the failure itself. Three findings: about 70% of silent failures were caught by human user-view observation, not tests or audits; a retrospective audit of 15 incidents found 0% ex-ante prevention but 87% regression blocking -- audits are regression engines, not prediction engines; incident latency (13 hours to 60 days) tracks failure mechanism, not code complexity -- the longest-lived failures lived in the seams between components, where no test runs. We describe the resulting defense framework and distill design principles for agent systems whose failures are loud, attributable, and boring. All postmortems and artifacts are public.

URL PDF HTML ☆

赞 0 踩 0

2606.14594 2026-06-15 cs.SE cs.AI 交叉投稿

给AI带来头痛：针对计算机视觉应用的声学对抗攻击

Nicole Villavicencio-Garduño, Maksim Ekin Eren, Milo Prisbrey, Ben Migliori, Michael Teti

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结研究利用低频声波（<20 kHz）引起相机物理振动，导致AI视觉模型（如YOLO11）误分类、漏检或产生幻觉，并分析了影响攻击效果的因素。

Comments 9 pages, 7 figures, SPIE Defense + Security

详情

DOI: 10.1117/12.3093699
Journal ref: Proc. SPIE 14046, Assurance and Security for AI-enabled Systems 2026, 1404609 (10 Jun 2026)

AI中文摘要

人工智能（AI）越来越多地被用于自动化各种现实世界的计算机视觉（CV）应用，如自动驾驶车辆控制、面部识别和安全摄像头。最近的研究表明，声学振动可以引起相机真实的物理运动，干扰其内部稳定机制。由于这种运动超出了稳定系统设计处理的条件，系统会在帧中引入伪影，导致基于AI的CV模型误分类、错过目标或产生幻觉对象。先前的工作使用超声波频率（>20 kHz）进行短距离攻击，由于高频的衰减，这些攻击仅限于短距离。在这项工作中，我们研究了使用可听范围内较低频率（<20 kHz）的声学攻击，并进一步扩展了我们的分析，包括各种图像和物体特征如何受到攻击的影响。具体来说，我们进行了物理实验，通过用各种频率共振商用相机，证明了我们的攻击对现成目标检测模型（YOLO11）的可行性。基于我们的结果，我们提供了关于使AI CV系统更容易受到这些攻击的几个因素的见解，这可能有助于未来缓解策略的开发。

英文摘要

Artificial Intelligence (AI) is increasingly used to automate a variety of real-world computer vision (CV) applications, such as autonomous vehicle control, facial recognition, and security cameras. Recent research has shown that acoustic vibration can induce real physical motion in cameras, interfering with their internal stabilization mechanisms. Because the motion falls outside the conditions the stabilization system was designed to handle, the system introduces artifacts into the frame, causing AI-based CV models to misclassify, miss targets, or hallucinate objects. Previous work used ultrasonic frequencies (>20 kHz) to perform short-range attacks, which limits them to short distances due to the attenuation exhibited by high frequencies. In this work, we investigate acoustic attacks using lower frequencies in the audible range (<20 kHz), and we further expand our analysis to include how various image and object features are affected by the attacks. Specifically, we performed physical experiments to demonstrate the viability of our attacks on an off-the-shelf object detection model (YOLO11) by resonating a commercially available camera with various frequencies. Based on our results, we provide insights into several factors that make an AI CV system more vulnerable to these attacks, which could help inform the development of future mitigation strategies.

URL PDF HTML ☆

赞 0 踩 0

2601.12913 2026-06-15 cs.AI cs.LG cs.NE 版本更新

Actionable Interpretability Must Be Defined in Terms of Symmetries

可操作的可解释性必须根据对称性来定义

Pietro Barbiero, Mateo Espinosa Zarlenga, Francesco Giannini, Alberto Termine, Filippo Bonchi, Mateja Jamnik, Giuseppe Marra

发表机构 * University of Oxford（牛津大学）； ETH Zurich（苏黎世联邦理工学院）； University of Cambridge（剑桥大学）

AI总结本文论证AI可解释性研究存在根本性问题，提出可操作的可解释性应基于四种对称性来定义，以形式化可解释模型并统一可解释推理。

2606.05461 2026-06-15 cs.AI 版本更新

Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety

先输出类型，后质量：基于标准的自动驾驶安全XAI可接受性评估标准

Abhinaw Priyadershi, Mandar Pitale, Jelena Frtunikj, Maria Spence

发表机构 * NVIDIA Corporation（英伟达公司）； NVIDIA GmbH（英伟达德国分公司）

AI总结针对基于ML的自动驾驶安全标准与XAI方法输出类型不匹配的证据类型缺口，从多个安全标准推导出19项可测试证据标准，评估六类XAI方法，发现因果XAI在三个生命周期阶段结构上必需，并提出了结构可接受性概念。

Comments Accepted at SAFECOMP 2026 Workshops (SASSUR); to appear in Springer LNCS

详情

AI中文摘要

基于ML的自动驾驶安全标准规定了保证案例必须包含的证据类型（有向因果链、量化的干预效应、命名的根因变量），然而XAI文献是按输出类型和技术族（显著性图、特征归因、反事实、因果图、语言痕迹）组织的。最受推荐的ADS XAI方法SHAP返回一个排序的特征列表，任何实现努力都无法将其转换为有向链（图1）。我们将这种不匹配称为证据类型缺口。从AMLAS、ISO 26262、ISO 21448、ISO/PAS 8800中，我们推导出19项可测试的证据标准，涵盖7个生命周期阶段，并附有代表性的条款引用推导，对六类XAI方法进行了结构性评分。因果XAI在结构上被证明是满足推导标准的必要条件，涉及三个阶段：危害识别（+62%标准缺口）、事件调查（+50%）和数据管理（+50%）；判定集在阈值T∈(0%, 50%]内稳定，并在最坏情况下的单单元翻转下存活至T=25%。在其余四个阶段，相关或基于语言的方法是可比较或足够的。该标准识别了结构可接受性（合规的必要但非充分条件）：一个可接受方法的具体输出内容仍可能是错误的，验证其保真度（拟合SCM产生的边、痕迹命名的原因）是开放的保证挑战。基于1,996个真实驾驶片段（79,840行，十个分割）的单VLA概念验证与每种方法观察到的输出类型匹配其标准预测一致。ADS安全保证的XAI方法选择应由生命周期阶段的证据需求驱动，而非方法流行度。

英文摘要

Safety standards for ML-based autonomous driving specify the kind of evidence an assurance case must contain (directed cause-and-effect chains, quantified interventional effects, named root-cause variables), yet the XAI literature is organised by output type and technique family (saliency maps, feature attribution, counterfactuals, causal graphs, language traces). SHAP, the most-recommended ADS XAI method, returns a ranked feature list that no implementation effort can convert into a directed chain (Fig.1). We name this mismatch the evidence-type gap. From AMLAS, ISO 26262, ISO21448, ISO/PAS 8800 we derive 19 testable evidentiary criteria across 7 lifecycle stages with representative clause-cited derivations and score six XAI method classes structurally. Causal XAI emerges as structurally required to satisfy the derived criteria at three stages: hazard identification (+62% rubric gap), incident investigation (+50%), and data management (+50%); the verdict set is stable across thresholds T in (0%, 50%]$ and survives a worst-case single-cell flip down to T = 25%. At the remaining four stages, correlational or language-based methods are comparable or sufficient. The rubric identifies structural admissibility (necessary but not sufficient for compliance): an admissible method's specific output content may still be wrong, and validating that fidelity (the edges a fitted SCM produces, the cause a trace names) is the open assurance challenge. A single-VLA proof of concept on 1,996 real-world driving clips (79,840 rows, ten splits) is consistent with each method's observed output type matching its rubric prediction. XAI method selection for ADS safety assurance should be driven by lifecycle-stage evidence demand, not by method popularity.

URL PDF HTML ☆

赞 0 踩 0

2406.09250 2026-06-15 cs.CV cs.AI cs.LG 版本更新

MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

MirrorCheck: 视觉-语言模型的高效对抗防御

Samar Fares, Klea Ziu, Toluwani Aremu, Nikita Durasov, Martin Takáč, Pascal Fua, Ivan Laptev, Karthik Nandakumar

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence（莫扎伊德大学人工智能大学）； NVIDIA ； École Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）； Michigan State University（密歇根州立大学）

AI总结提出MirrorCheck框架，利用文本到图像模型和随机化策略检测并防御针对视觉-语言模型的自适应对抗攻击。

详情

AI中文摘要

视觉-语言模型（VLM）越来越容易受到复杂的对抗性攻击，包括专门设计用于绕过现有防御的自适应策略。为了解决这一漏洞，我们提出了MirrorCheck，一个鲁棒且与模型无关的检测框架，在单模态和多模态设置中均能有效运行。MirrorCheck利用文本到图像（T2I）模型从目标模型生成的标题中重建视觉内容，并通过比较原始图像和合成图像之间的特征空间嵌入来评估语义一致性。为了增强对自适应攻击的鲁棒性，MirrorCheck引入了一种随机防御策略，从多样化的模型库中随机选择T2I生成器和图像编码器。此外，我们采用了一种新颖的一次性（OTU）扰动，应用于所选编码器嵌入，并通过缩放因子调节，这降低了自适应攻击的有效性。跨多种威胁场景的大量实验表明，MirrorCheck始终优于基线方法，即使在强自适应对抗条件下也能保持其实用性。

英文摘要

Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings. MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo. Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.

URL PDF HTML ☆

赞 0 踩 0

2505.11577 2026-06-15 cs.CY cs.AI 版本更新

The Accountability Paradox: How Platform API Restrictions Undermine AI Transparency Mandates

问责悖论：平台API限制如何削弱AI透明度要求

Florian A. D. Burnat, Brittany I. Davidson

发表机构 * University of Bath（巴斯大学）

AI总结本文研究平台API限制与欧盟数字服务法案之间的矛盾，提出审计框架揭示平台内容审核和算法放大不可验证的盲区，指出AI依赖与问责限制的悖论，建议采用联邦访问模型和加强监管执行。

Comments Accepted at ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

详情

DOI: 10.1145/3805689.3812289

AI中文摘要

近期主要社交媒体平台对应用程序编程接口（API）的限制挑战了遵守欧盟数字服务法案[20]的要求，该法案要求数据访问以实现算法透明度。我们开发了一个结构化的审计框架来评估监管要求与平台实施之间的日益增长的不一致。我们对X/Twitter、Reddit、TikTok和Meta的比较分析识别出关键的『审计盲区』，其中平台内容审核和算法放大仍然无法被独立验证。我们的发现揭示了『问责悖论』：随着平台越来越多地依赖AI系统，它们同时限制了独立监督的能力。我们建议与国家标准技术研究院[80]的AI风险管理框架相一致的有针对性的政策干预，强调联邦访问模型和增强的监管执行。

英文摘要

Recent application programming interface (API) restrictions on major social media platforms challenge compliance with the EU Digital Services Act [20], which mandates data access for algorithmic transparency. We develop a structured audit framework to assess the growing misalignment between regulatory requirements and platform implementations. Our comparative analysis of X/Twitter, Reddit, TikTok, and Meta identifies critical ``audit blind-spots'' where platform content moderation and algorithmic amplification remain inaccessible to independent verification. Our findings reveal an ``accountability paradox'': as platforms increasingly rely on AI systems, they simultaneously restrict the capacity for independent oversight. We propose targeted policy interventions aligned with the AI Risk Management Framework of the National Institute of Standards and Technology [80], emphasizing federated access models and enhanced regulatory enforcement.

URL PDF HTML ☆

赞 0 踩 0

2505.17961 2026-06-15 stat.ME cs.AI math.ST stat.AP stat.TH 版本更新

知道评估如何设计的模型更安全

Katharina Deckenbach, Haritz Puerto, Jonas Geiping, Sahar Abdelnabi

发表机构 * ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center（图宾根ELLIS研究所、图宾根马克斯·普朗克智能系统研究所、图宾根人工智能中心）

AI总结本文通过微调模型使其掌握评估的元知识（如可验证结构或道德困境），发现这会导致模型在安全基准测试中表现更安全，从而引入了一种独立于显式记忆或评估意识的新混淆因素。

详情

AI中文摘要

AI安全评估的有效性取决于模型在受控环境和部署环境中行为的一致性。先前的研究已经发现测试时的上下文线索（例如假设场景）是口头评估意识和后续行为转变的来源。在本文中，我们研究了这一现象的一个潜在解释：评估元知识，定义为关于评估结构特征的参数化知识。类似于数据集污染（基准暴露通过记忆导致更高性能），我们假设在描述评估实践的文本上训练的模型可能隐式地学会识别和响应类似评估的上下文，例如通过接触关于AI基准测试的科学文章或社交媒体帖子。为了验证这一点，我们在描述评估特征（如可验证结构或道德困境）的合成文档上微调模型。在六个安全基准上评估这个微调模型，我们发现它比基础模型和控制模型显著更安全。即使将分析限制在缺乏明确评估意识口头表达的响应中，这种行为转变仍然存在。我们的结果表明，评估元知识可能夸大安全基准性能，引入了一种独立于显式记忆或口头评估意识的新混淆因素，因此难以检测。这些发现对AI安全评估的设计和解释具有重要意义。我们的代码和模型可在 https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge 获取。

英文摘要

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on five safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge.

URL PDF HTML ☆

赞 0 踩 0

2606.00947 2026-06-15 cs.LG cs.AI 版本更新

Silent Failures in Federated Personalization of Foundation Models

联邦基础模型个性化中的静默失败

YongKyung Oh, Alex Bui

发表机构 * Medical & Imaging Informatics (MII) Group, University of California, Los Angeles (UCLA)（医学与影像信息学（MII）组，加州大学洛杉矶分校（UCLA））

AI总结本文提出联邦基础模型个性化中因隐私约束导致的一类信任失败——静默失败，包括偏差放大、公平性崩溃和对齐侵蚀，并引入六种静默失败模式的分类法，强调隐私保护训练不足以保障可信部署。

详情

AI中文摘要

基础模型通过联邦学习在分散的私有数据上越来越个性化，并在日益增长的上市后监管要求下大规模部署。我们认为这种趋同产生了一类独特且未被充分认识的信任失败，我们称之为“静默失败”。这些包括偏差放大、公平性崩溃和对齐侵蚀，这些可能仍然难以检测，因为联邦学习的隐私约束限制了对模型行为的可见性。对现有基准的景观分析揭示了结构性鸿沟。联邦基准评估系统性能，但对模型行为的洞察有限，而集中式信任基准评估行为，但需要与联邦隐私不兼容的模型访问。我们引入了一个由基础模型个性化、数据集偏移和核心联邦约束相互作用产生的六种静默失败模式的分类法。我们的分析表明，仅靠隐私保护训练不足以实现可信部署。最后，我们提出了一个隐私保护行为评估的研究议程，并建议将静默失败作为可信联邦人工智能的标准诊断类别。

英文摘要

Foundation models are increasingly personalized on decentralized private data through federated learning and are now deployed at scale under growing regulatory requirements for post-market monitoring. We argue that this convergence creates a distinct and under-recognized class of trustworthiness failures, which we term "Silent Failures." These include amplified bias, fairness collapse, and alignment erosion that may remain difficult to detect because federated learning's privacy constraints limit visibility into model behavior. A landscape analysis of existing benchmarks reveals a structural divide. Federated benchmarks evaluate system performance but provide limited insight into model behavior, whereas centralized trustworthiness benchmarks assess behavior but require model access incompatible with federated privacy. We introduce a taxonomy of six silent failure modes arising from the interaction of foundation model personalization, dataset shift, and core federated constraints. Our analysis shows that privacy-preserving training alone is insufficient for trustworthy deployment. We conclude with a research agenda for privacy-preserving behavioral evaluation and propose that silent failures become a standard diagnostic category for trustworthy federated artificial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.02995 2026-06-15 cs.CR cs.AI cs.IR cs.LG 版本更新

Patcher: Post-Hoc Patching of Backdoored Large Language Models

Patcher: 后门大型语言模型的事后修补

Anjun Gao, Yueyang Quan, Yufei Xia, Zhuqing Liu, Minghong Fang

发表机构 * University of Louisville（路易斯维尔大学）； University of North Texas（北得克萨斯大学）

AI总结提出Patcher框架，仅利用单个失败案例和模型参数，通过基于梯度的显著性定位后门触发器，并采用约束微调消除触发-响应关联，同时保持模型效用。

Comments To appear in the USENIX Security Symposium, 2026

详情

AI中文摘要

大型语言模型仍然容易受到越狱后门攻击，其中对手污染安全对齐数据以嵌入隐藏触发器，从而绕过安全机制。现有防御通常需要全面的攻击信息或多个触发示例，使得当防御者仅观察到单个报告失败案例而不知道其源于后门攻击还是自然对齐错误时，这些防御不切实际。本文提出Patcher，一个事后防御框架，仅使用单个报告失败案例和模型参数来修复后门语言模型。Patcher分两个阶段运行。首先，通过计算基于响应的梯度显著性分数并应用自适应聚类将触发器与良性上下文分离来定位后门触发器。其次，通过约束微调目标修补模型，该目标打破触发-响应关联，同时通过KL散度约束保持良性任务效用和对非触发越狱攻击的鲁棒性。我们在多种后门攻击策略下进行了广泛评估，并证明Patcher成功定位触发器并中和后门，同时保持模型效用。我们进一步展示了针对旨在规避我们防御的自适应攻击的鲁棒性。这项工作代表了向部署语言模型中训练时攻击的实际防御迈出的重要一步。

英文摘要

Large language models remain vulnerable to jailbreak backdoor attacks, where adversaries poison safety alignment data to embed hidden triggers that bypass safety mechanisms. Existing defenses often require comprehensive attack information or multiple triggered examples, making them impractical when defenders only observe a single reported failure case without knowing whether it stems from a backdoor attack or a natural alignment bug. This paper presents Patcher, a post-hoc defense framework that repairs backdoored language models using only a single reported failure case and the model parameters. Patcher operates in two stages. First, it localizes backdoor triggers by computing response-conditioned gradient-based saliency scores and applying adaptive clustering to separate triggers from benign context. Second, it patches the model through a constrained fine-tuning objective that breaks the trigger-response association while preserving benign-task utility and robustness to non-triggered jailbreak attacks through KL-divergence constraints. We conduct extensive evaluations across multiple backdoor attack strategies and demonstrate that Patcher successfully localizes triggers and neutralizes backdoors while maintaining model utility. We further show robustness against adaptive attacks designed to evade our defense. This work represents a significant step toward practical defenses against training-time attacks in deployed language models.

URL PDF HTML ☆

赞 0 踩 0

2605.21006 2026-06-15 cs.AI cs.CL cs.LG 版本更新

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

扮演魔鬼的代言人：现成的人格向量在顺从性上与针对性引导相媲美

Ishaan Kelkar, Nebras Alam, Vikram Kakaria, Madhur Panwar, Vasu Sharma, Maheep Chaudhary

发表机构 * University of Toronto（多伦多大学）； Princeton University（普林斯顿大学）； Purdue University（普渡大学）； EPFL（瑞士联邦理工学院）； Algoverse ； Independent（独立）

AI总结本文研究了不同人格对顺从性的影响，发现现成的人格引导向量在减少顺从性方面与针对性引导相当，且在用户正确时保持准确性。

详情

Journal ref: ICML, Pluralistic Alignment Workshop, 2026

AI中文摘要

我们研究了不同人格对顺从性的影响：模型在用户错误时仍同意用户。标准缓解方法，对比激活添加（CAA），从顺从性和诚实响应的标记对中推导出引导方向。本研究评估了现成的人格引导向量是否能作为替代方案，这些向量最初是为一般角色扮演开发的，且未在顺从性数据上训练。在两个指令微调模型中，引导至以怀疑或审查为特征的人格可将顺从性减少到CAA效果的约68%和98%，且不同于CAA，在用户正确时保持准确性。效果也是不对称的：引导至顺从的人格不会产生镜像增加的顺从性。几何上，人格向量在激活空间的方向上与顺从性方向基本无关。总体而言，这些发现表明，顺从性应被视为人格层面的属性，而非单一可引导方向。我们在此发布代码：https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

英文摘要

We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

URL PDF HTML ☆

赞 0 踩 0

2606.13715 2026-06-15 cs.AI cs.CL cs.MA 新提交

WorkBench Revisited: Workplace Agents Two Years On

WorkBench 再探：两年后的工作场所智能体

Olly Styles

发表机构 * GitHub

AI总结本文重新评估2024至2026年间WorkBench基准上智能体的进展，发现前沿模型在能力和安全性上均有显著提升，但开放权重模型降低了高性能门槛。

Comments 8 pages, 3 figures. Follow-up to arXiv:2405.00823

详情

AI中文摘要

2024年3月，WorkBench上表现最好的智能体GPT-4完成了43%的任务，并在26%的任务中采取了意外的有害行为（例如给错误的人发送电子邮件）。我们在2026年6月重新审视该基准，发现迄今为止最好的智能体Claude Opus 4.8完成了89%的任务，并仅在2.5%的任务中采取了意外的有害行为。除了前沿智能体性能的显著进步外，有三点值得注意。首先，在WorkBench上，能力与安全性是相辅相成的，而非相互权衡，因此完成最多任务的模型造成的意外损害也最少。其次，虽然几类错误已被完全消除，但前沿模型仍然会犯一些基本错误，有时会导致不可逆转的损害，例如将电子邮件发送给错误的人。第三，开放权重模型的兴起大幅降低了此前仅专有模型才能达到的性能水平的成本，而前沿模型的成本则保持相对稳定。我们发布了该基准的更新版本，包括数据与代码质量改进、新的模型评分以及自2024年以来WorkBench上智能体进展的分析。

英文摘要

The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4.8, completes 89% and takes an unintended harmful action on 2.5%. Aside from this considerable progress in frontier agent performance, three things stand out. First, capability and safety go together on WorkBench rather than trade off, so the models that finish the most tasks also do the least unintended damage. Second, while several classes of error have been totally eliminated, frontier models still make some basic mistakes that occasionally result in irreversible harm, such as sending an email to the wrong person. Third, the rise of open-weight models has drastically lowered costs for a performance level that was previously only accessible to proprietary models, while frontier costs have stayed relatively stable. We release an updated version of the benchmark with data and code quality improvements, new model scores, and analysis of agent progress on WorkBench since 2024.

URL PDF HTML ☆

赞 0 踩 0

2606.13815 2026-06-15 cs.AI cs.CL 新提交

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

Poker Arena: 大型语言模型中策略推理与记忆的多轴剖析

Pratham Singla, Shivank Garg, Vihan Singh

发表机构 * Indian Institute of Technology Roorkee（印度理工学院罗尔基分校）； Raeth AI

AI总结提出Poker Arena平台，通过三层记忆架构和九轴认知剖面分解策略推理，揭示标量排行榜系统性误排模型能力结构。

Comments 33 pages, ICML Workshop

详情

AI中文摘要

不确定性下的策略推理支撑着谈判、金融和政策中的关键决策，但现有的游戏基准将异质推理维度压缩为单一标量，导致前沿LLM的能力结构未被审视。我们引入Poker Arena，一个无限注德州扑克锦标赛平台，该平台将三层记忆架构（手牌内、会话内和跨会话）与九轴认知剖面相结合，将策略推理分解为可解释的维度，如下注规模校准和位置意识。我们在50个会话（每个会话1000手牌）和受控记忆消融实验中评估了七个前沿模型；锦标赛筹码和聚合轴得分对模型进行了不同排序：Claude Opus 4.6赢得+15,730筹码和14次第一名，但在平均轴得分上仅排名第五（共七个），而持久记忆对某些模型有帮助，对另一些则有损害。这些发现表明，多轴评估揭示了标量排行榜系统性误排的能力结构，其中跨维度一致性优于任何单一维度的峰值性能。

英文摘要

Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capability structure of frontier LLMs unexamined. We introduce Poker Arena, a no-limit Texas Hold'em tournament platform that couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile decomposing strategic reasoning into interpretable dimensions such as bet-sizing calibration and positional awareness. We evaluate seven frontier models across 50 sessions of 1,000 hands and a controlled memory ablation; tournament chips and aggregate axis score order the field differently: Claude Opus 4.6 wins +$15,730 chips with 14 first-place finishes, yet ranks only fifth of seven on mean axis score, while persistent memory helps some models and hurts others. These findings show that multi-axis evaluation surfaces capability structure that scalar leaderboards systematically misrank, with cross-dimensional consistency outweighing peak performance on any single axis.

URL PDF HTML ☆

赞 0 踩 0

2606.14031 2026-06-15 cs.AI 新提交

Applicability Condition Extraction for Therapeutic Drug-Disease Relations

治疗性药物-疾病关系的适用条件提取

Guanting Luo, Noriki Nishida, Yuji Matsumoto, Yuki Arase

发表机构 * The University of Osaka（大阪大学）； RIKEN（理化学研究所）； Institute of Science Tokyo（东京科学大学）； Tohoku University（东北大学）

AI总结提出从生物医学文献中提取药物-疾病治疗关系适用条件的任务，构建首个手动标注数据集，并改进LoRA方法以考虑药物与疾病间关系，在多个评估设置中优于基线。

详情

AI中文摘要

识别某种药物对目标疾病产生治疗效果的适用条件对于临床决策支持至关重要。然而，现有的大多数生物医学信息提取方法仅关注识别药物与疾病之间的关系，而很大程度上忽略了这些关系适用的上下文特定条件。为解决这一问题，我们引入了从生物医学研究文献中提取治疗性药物-疾病关系适用条件的任务。我们创建了首个数据集，在生物医学论文摘要上手动标注了药物、疾病和适用条件的三元组，包含1,119个药物-疾病对。利用该数据集，我们系统评估了一系列现有方法的性能。此外，我们提出了一种新方法，增强LoRA以考虑药物与疾病之间的关系。我们的方法在不同评估设置中均优于强基线。本文的源代码和数据集可从以下网址获取：this https URL

英文摘要

Identifying conditions that a certain drug takes therapeutic effect on a target disease is crucial for clinical decision-making support. However, most existing biomedical information extraction methods have focused on identifying only relations between drugs and diseases, while largely overlooking the context-specific conditions where such relations can apply. To address this problem, we introduce the task of applicability condition extraction for therapeutic drug--disease relations from biomedical research literature. We create the first dataset that has manually annotated triples of drugs, diseases, and applicability conditions on biomedical paper abstracts with 1,119 drug-disease pairs. Using this dataset, we systematically evaluate the performance of a range of existing methods. In addition, we propose a new method that enhances LoRA to consider relations between drugs and diseases. Our method consistently outperforms strong baselines across different evaluation settings. The source code and dataset of this paper can be obtained from: https://github.com/guantingluo98/Drug-ACE

URL PDF HTML ☆

赞 0 踩 0

2606.14240 2026-06-15 cs.AI 新提交

AFFORDANCE20Q: Evaluating Affordance Reasoning from Physical Properties

AFFORDANCE20Q：从物理属性评估可承担性推理

Yifan Jiang, Meige Yang, Zitong Li, Jay Pujara

发表机构 * Information Sciences Institute, University of Southern California（南加州大学信息科学研究所）； University of Southern California（南加州大学）

AI总结提出Affordance20Q基准，通过20个问题游戏评估模型从物理属性推理物体可承担性的能力，发现LLM与人类差距约20分，并开发KARI方法提升开源模型达15.2分。

详情

AI中文摘要

可承担性推理，即从物体的物理属性（如形状和材料）推断其动作可能性，是人类物理理解的基础，对大型语言模型（LLM）也越来越关键。然而，现有的可承担性基准大多在评估设置中暴露明确的物体身份，使模型能够依赖记忆的物体-可承担性映射，而不是基于物理属性进行推理。为弥补这一空白，我们引入了Affordance20Q，这是一个新颖的可承担性推理基准，以20个问题游戏的形式呈现，不暴露物体身份。在每个游戏中，模型通过询问关于物体物理属性的是/否问题，从候选集中识别隐藏物体的可承担性。Affordance20Q包含1,009个游戏，涵盖454个物体和59种可承担性，所有数据均经过手动筛选、细化和标注。我们对15个最先进的LLM进行了全面实验，发现与人类表现相比存在显著差距（约20分）。基于KL的信息增益（IG）分析进一步表明，随着游戏进行，模型未能提出具有区分性的问题。为缩小差距，我们开发了基于知识库锚定的规则归纳（KARI），这是一个基于LLM的流程，用于生成基于知识库（KB）证据的可承担性规则。KARI将开源LLM的性能提升了最多15.2分，而KB的有限覆盖阻碍了进一步的提升。我们在https://this.url发布所有代码和数据。

英文摘要

Affordance reasoning, the inference of an object's action possibilities from its physical properties (e.g., shape and material), is fundamental to human physical understanding and increasingly critical for Large Language Models (LLMs). However, existing affordance benchmarks largely expose explicit object identities in the evaluation setup, allowing models to rely on memorized object-affordance mappings rather than reasoning over physical properties. To address this gap, we introduce Affordance20Q, a novel affordance reasoning benchmark formulated as a 20-Questions game without exposing the object's identity. In each game, the model identifies a hidden object's affordance from a candidate set by asking yes/no questions about its physical properties. Affordance20Q comprises 1,009 games over 454 objects and 59 affordances, all manually filtered, refined, and annotated. We conduct comprehensive experiments with 15 state-of-the-art LLMs and find a substantial gap (~20 points) compared to human performance. A KL-based information-gain (IG) analysis further shows that models fail to ask discriminating questions as the game progresses. To close the gap, we develop KB-Anchored Rule Induction (KARI), a pipeline based on LLMs that generates affordance rules grounded in evidence from knowledge bases (KBs). KARI improves open-source LLMs by up to 15.2 points, while the limited coverage of KBs hinders further gains. We release all our code and data at https://github.com/1171-jpg/Affordance20Q.git

URL PDF HTML ☆

赞 0 踩 0

2606.14516 2026-06-15 cs.AI cs.CL cs.CY 新提交

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

Every Eval Ever：AI评估结果的统一模式与社区仓库

Jan Batzner, Sree Harsha Nelaturu, Anastassia Kornilova, Jon Crall, Tommaso Cerruti, Yanan Long, Yifan Mai, Sanchit Ahuja, Asaf Yehudai, Marek Šuppa, John P. Lalor, Oluwagbemike Olowe, Jatin Ganhotra, Brian H. Hu, Eliya Habba, Andrew M. Bean, Chang Liu, Sander Land, Steven Dillmann, Aniketh Garikaparthi, Elron Bandel, Saki Imai, James Edgell, Wm. Matthew Kennedy, Jenny Chim, Patrick Meusling, Asteria Kaeberlein, Venkata Ramachandra Karthik Chundi, Manasi Patwardhan, Martin Ku, Austin Meek, Leon Knauer, Brian Wingenroth, Srishti Yadav, Usman Gohar, Felix Friedrich, Michelle Lin, Jennifer Mickel, Arman Cohan, Stella Biderman, Irene Solaiman, Zeerak Talat, Anka Reuel, Mubashara Akhtar, Gjergji Kasneci, Avijit Ghosh, Leshem Choshen

发表机构 * Technical University Munich（慕尼黑工业大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Weizenbaum Institute（魏岑鲍姆研究所）； Zuse Institute Berlin（柏林祖泽研究所）； Evidence Prime ； Trustible ； Kitware ； ETH Zurich（苏黎世联邦理工学院）； StickFlux Labs ； Stanford University（斯坦福大学）； Northeastern University（东北大学）； IBM Research（IBM研究院）； Comenius University Bratislava（布拉迪斯拉发夸美纽斯大学）； Cisco（思科）； University of Notre Dame（圣母大学）； Hebrew University of Jerusalem（耶路撒冷希伯来大学）； University of Oxford（牛津大学）； Ohio University（俄亥俄大学）； Writer ； TCS Research（塔塔咨询服务研究院）； Oxford University Press（牛津大学出版社）； Queen Mary University of London（伦敦玛丽女王大学）； Technical University Berlin（柏林工业大学）； University of Delaware（特拉华大学）； Cinemo ； Johns Hopkins University（约翰霍普金斯大学）； University of Copenhagen（哥本哈根大学）； ELLIS（欧洲学习与智能系统实验室）； Iowa State University（爱荷华州立大学）； Meta FAIR ； University of Montreal（蒙特利尔大学）； Mila Quebec AI Institute（Mila魁北克人工智能研究所）； EleutherAI ； Yale University（耶鲁大学）； Hugging Face ； University of Edinburgh（爱丁堡大学）； Harvard University（哈佛大学）； ETH AI Center（ETH人工智能中心）； MIT（麻省理工学院）； MIT-IBM Watson Lab（MIT-IBM沃森实验室）

AI总结针对AI评估结果格式不统一、难以比较的问题，提出首个共享模式与社区众包仓库，通过标准化表示、自动转换器和社区数据库实现跨评估框架的统一。

详情

AI中文摘要

AI评估被广泛用于测试和理解进展。然而，多样化的评估工具带来了不一致性，挑战了分析和比较。首先，结果以不兼容的格式保存，分散在排行榜、论文、博客文章、评估工具日志和自定义仓库中。其次，结果由不同的评估框架创建，这些框架对名义上相同的评估产生不同的分数，并且不一致地记录元数据，阻碍了比较、跨社区评估科学、成本降低和重用。我们介绍了Every Eval Ever，这是第一个用于AI评估结果的共享模式和社区众包仓库。该模式标准化了评估在统一的单个JSON文档中的表示方式。它在设计上与源无关，可以摄取来自评估工具和论文的结果，并可选择存储每个实例的输出以进行细粒度分析。我们贡献了：(i) 一个社区治理的元数据模式及其配套的实例级模式，这是同类标准化工作的首次；(ii) 从流行格式、评估工具和排行榜到统一模式的自动转换器；以及 (iii) 一个托管在Hugging Face上的众包社区数据库，目前涵盖22,235个模型、2,273个独特基准和31种评估格式。

英文摘要

AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories. Second, results are created by different evaluation frameworks, which produce divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross-community evaluation science, cost reduction, and reuse. We introduce Every Eval Ever, the first shared schema and community-crowdsourced repository for AI evaluation results. The schema standardizes how evaluations are represented in a unified, single JSON document. It is source-agnostic by design, ingesting results from evaluation harnesses and papers alike, and optionally stores per-instance outputs for fine-grained analysis. We contribute: (i) a community-governed metadata schema with a companion instance-level schema, the first standardization effort of its kind; (ii) automatic converters from popular formats, evaluation harnesses, and leaderboards to the unified schema; and (iii) a crowdsourced community database hosted on Hugging Face, currently spanning to date 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats.

URL PDF HTML ☆

赞 0 踩 0

2606.14571 2026-06-15 cs.AI 新提交

HierSVA：面向LLM驱动的层次化硬件形式化验证的数据合成流水线、数据集与基准

Maohua Nie, Jiang Zhu, Jingqun Zhang, Zhichen Zeng, Jiayi Wang, Sibo Zhang, Jialin Wang, C. -J. Richard Shi

发表机构 * University of Washington（华盛顿大学）

AI总结提出HierSVA套件，包含数据合成流水线、数据集和基准，用于LLM驱动的层次化硬件形式化验证；通过RTL预处理与LLM在环流程生成SystemVerilog断言，并构建342模块数据集；设计六轴指标评估断言质量，揭示LLM在层次化验证中的性能与局限。

详情

AI中文摘要

我们提出了HierSVA，一个集流水线、数据集和基准于一体的集成套件，用于LLM驱动的层次化硬件形式化验证。HierSVA-SP将RTL预处理工具链与LLM在环形式化验证流程相结合，为层次化RTL生成参考SystemVerilog断言（SVA）。将其应用于BaseJump STL，得到HierSVA-DS数据集，包含342个模块，具有层次元数据和深度0-9，并附带28个模块-错误对的深层子集，包含自然语言规范和错误变体。HierSVA-B将断言质量分解为六个度量轴：语法正确性、断言证明成功率、空洞性、规范忠实度、突变覆盖率和形式化核心覆盖率。将HierSVA-B应用于12个最近的LLM，揭示了三个发现。第一，模块级编译率为67.1%；在可评估运行生成的断言中，82.1%被非空洞地证明，但相应的断言集仅检测到70.2%的可注入故障，并覆盖了36.2%的形式化核心。第二，在深层子集的211个可评估模型-模块条目中，断言集以0.87的召回率标记有错误的RTL，但预测有错误的输出中有40%在正确RTL上是假阳性，将精度限制在0.60。第三，代理模式改善了S1风格的可证明性和强度指标，但增益趋于平稳并振荡。代码和工件可在\href{ this https URL }{ this https URL }获取。数据集可在\href{ this https URL }{ this https URL }获取。

英文摘要

We present HierSVA, an integrated suite that combines a pipeline, dataset, and benchmark for LLM-driven hierarchical hardware formal verification. HierSVA-SP pairs an RTL preprocessing toolchain with an LLM-in-the-loop formal verification flow to produce reference SystemVerilog Assertions (SVA) on hierarchical RTL. Applying it to BaseJump STL yields HierSVA-DS, a dataset of 342 modules, with hierarchy metadata and depths 0--9, accompanied by a deep subset of 28 module-bug pairs with natural-language specifications and bug variants. HierSVA-B decomposes assertion quality into six metric axes: syntax correctness, assertion proof success rate, vacuity, specification faithfulness, mutation coverage, and formal core coverage. Applying HierSVA-B to twelve recent LLMs reveals three findings. First, the module-level compile rate is 67.1\%; among generated assertions in evaluable runs, 82.1\% prove non-vacuously, but the corresponding assertion sets detect only 70.2\% of eligible injected faults and cover 36.2\% of the formal core. Second, on 211 evaluable model--module entries in the deep subset, assertion sets flag buggy RTL with 0.87 recall, but 40\% of predicted-buggy outcomes are false positives on correct RTL, limiting precision to 0.60. Third, agentic mode improves S1-style provability and strength metrics, but gains plateau and oscillate. Codes and artifacts are available at \href{https://github.com/HierSVAAnon/HierSVACodeAndArtifacts}{https://github.com/HierSVAAnon/HierSVACodeAndArtifacts}. Dataset is available at \href{https://huggingface.co/datasets/AnonymousHierSVA/HierSVA}{https://huggingface.co/datasets/AnonymousHierSVA/HierSVA}.

URL PDF HTML ☆

赞 0 踩 0

2606.13735 2026-06-15 cs.AR cs.AI cs.LG cs.PL 交叉投稿

VHDLSuite: Unified Pipeline for LLM VHDL Generation with Data Synthesis and Evaluation

VHDLSuite：面向LLM VHDL生成的统一流水线，包含数据合成与评估

Yijun Shen, Minghao Shao, Yichen Zhao, Zhuoyan Yu, Boyuan Chen, Yik-Cheung Tam, Muhammad Shafique

发表机构 * Center for Data Science, NYU Shanghai, China（纽约市立大学上海分校数据科学中心）； NYU Tandon School of Engineering, USA（纽约大学Tandon工程学院）； NYU Abu Dhabi, UAE（纽约大学阿布扎比分校）

AI总结提出VHDLSuite基础设施，通过自动基准合成、可执行验证和多模型诊断分析，解决LLM在VHDL生成评估中的不足，并构建含200+问题的VHDLBench基准。

详情

AI中文摘要

大型语言模型（LLM）在寄存器传输级（RTL）代码生成方面展现了令人印象深刻的能力，尤其是针对Verilog。然而，评估它们在其他硬件描述语言（HDL）上的性能，特别是VHDL，仍然有限，尽管其独特的语言特性（如更严格的语义规则）引入了与Verilog不同的评估考量。这种覆盖不足限制了对当前模型在不同结构和语义的硬件设计语言中泛化能力的全面理解。为弥补这一空白，我们引入了VHDLSuite，一个以基准为中心的可扩展VHDL生成评估基础设施，集成了自动基准合成、可执行验证和多模型诊断分析。首先，我们提出一个数据流水线，自动将Verilog设计及其配套测试平台转换为可执行的VHDL基准实例，随后基于VUnit/GHDL进行验证，确保每个发布的任务在VHDL环境中可编译、可运行且可一致检查。其次，我们引入VHDLBench，一个包含超过200个VHDL问题的基准，配有完整且经过验证的测试平台，覆盖广泛的复杂度级别。第三，我们广泛评估了最先进的LLM，并揭示了LLM辅助VHDL生成中的关键挑战。我们的发现为多语言硬件设计的未来工作提供了重要见解和支持。该数据流水线、基准和评估框架将开源。

英文摘要

Large Language Models (LLM) have shown impressive capabilities in Register Transfer Level (RTL) code generation, particularly for Verilog. However, evaluating their performance with other Hardware Description Languages (HDL), especially VHDL, remains limited although its distinct language characteristics, such as stricter semantic rules, introduce evaluation considerations that differ from Verilog. This lack of coverage restricts fully understanding of how well current models generalize across hardware design languages with differing structures and semantics. To address this gap, we introduce VHDLSuite, a benchmark-centered infrastructure for scalable VHDL generation evaluation, integrating automated benchmark synthesis, executable validation, and multi-model diagnostic analysis. First, we propose a data pipeline that automatically converts Verilog designs and their accompanying testbenches into executable VHDL benchmark instances, followed by VUnit/GHDL-based validation to ensure each released task is compilable, runnable, and consistently checkable in the VHDL environment. Second, we introduce VHDLBench, a benchmark with over 200 VHDL problems with complete and validated testbenches across a wide range of complexity levels. Third, we extensively evaluate cutting-edge LLMs and uncover key challenges specific on LLM-aided VHDL generation. Our findings provide important insights and support future work in multi-language hardware design automation.Our data pipeline, benchmark, and evaluation framework will be open-sourced.

URL PDF HTML ☆

赞 0 踩 0

2606.13757 2026-06-15 cs.CR cs.AI 交叉投稿

幻象探针：视觉模型如何伪造视觉理解

Daniel Ben-Levi, Judah Goldfeder, Weiliang Zhao, Raz Lapid, Amit LeVi, Allen G. Roush, Ravid Shwartz-Ziv, Hod Lipson

发表机构 * Columbia University（哥伦比亚大学）； Intuit ； Technion（以色列理工学院）； Thoughtworks ； New York University（纽约大学）

AI总结提出幻象探针框架，通过对比探针揭示视觉语言模型在无图像时也能回答问题的两种幻象行为：文本偏见和虚假图像，并证明后者需要表征级干预。

详情

AI中文摘要

视觉语言模型（VLM）即使在没有提供图像的情况下，也能自信且通常正确地回答基于图像的问题。这种幻象行为会虚增基准分数，而不反映视觉基础。先前的工作将其视为单一故障模式。我们认为这是两种。使用幻象探针（Mirage Probes），一种对比探针框架，将释义的问题变体与同一图像上的匹配幻象和非幻象标签配对，我们展示了在两个开源VLM中，幻象行为可以从残差流、MLP、后注意力和注意力头位置的内部激活中线性解码。我们证明朴素贝叶斯文本基线无法恢复此信号，排除了表面词汇混淆。跨基准可分离性模式，连同一种新颖的先验利用指数（PHI），衡量模型仅从文本中回答的程度，揭示了两种不同的机制：文本偏见，其中模型从语言先验中回答而不涉及视觉表征；以及虚假图像，其中模型在潜在空间中构建虚假视觉内容并像有基础一样回答。这种区别有直接的缓解后果：文本分布清理可以解决第一种机制，但无法触及第二种，因为虚假图像幻象存在于模型的视觉表征中而非文本中。忠实的视觉基础将需要在表征层面进行干预。

英文摘要

Vision-language models (VLMs) can answer image-based questions confidently, and often correctly, even when no image is provided. This mirage behavior inflates benchmark scores without reflecting visual grounding. Prior work treats this as a single failure mode. We argue it is two. Using Mirage Probes, a contrastive probing framework that pairs paraphrased question variants with matched mirage and non-mirage labels on the same image, we show that mirage behavior is linearly decodable from internal activations across residual stream, MLP, post-attention, and attention-head sites in two open-source VLMs. We demonstrate that a Naive Bayes text baseline cannot recover this signal, ruling out surface lexical confounds. Cross-benchmark separability patterns, together with a novel Prior Harnessing Index (PHI) measuring how much a model can answer from text alone, expose two distinct regimes: textual biases, where the model answers from language priors without engaging visual representations, and spurious images, where it constructs false visual content in latent space and answers as if grounded. The distinction has direct mitigation consequences: text-distribution cleaning can address the first regime but cannot reach the second, since spurious-image mirages live in the model's visual representations rather than its text. Faithful visual grounding will require interventions at the representational level.

URL PDF HTML ☆

赞 0 踩 0

2606.13896 2026-06-15 cs.CV cs.AI 交叉投稿

How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?

自监督遥感视觉模型如何迁移到下游任务？

Julia Romero, Qin Lv, Morteza Karimzadeh

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

AI总结研究六种代表性自监督地理空间基础模型（GeoFMs）在下游任务中的迁移表现，发现模型排名随任务和适应设置变化，中间层特征比最终层更相关，且解码器设计等适应设置影响与模型选择相当。

详情

AI中文摘要

自监督地理空间基础模型（GeoFMs）从遥感数据中学习可迁移表示，但其下游行为难以表征。我们研究了涵盖联合嵌入、重建和多模态预训练家族的六种代表性GeoFMs，并在不同标签可用性和下游流水线下评估了分类、回归和分割基准的迁移性能。我们发现模型排名随任务和适应设置而变化。逐层探针显示，在大多数情况下，与任务相关的信息在中间Transformer块中比在最终层嵌入中更容易获取，并且GeoFMs表现出不同的深度分布特征。在PASTIS和Sen1Floods11上的分割案例研究中，解码器设计和微调等下游适应设置可能与GeoFM的选择同样重要，且标准密集预测头可能与GeoFM在深度上组织信息的方式不一致。最后，案例研究中的CKA分析表明，微调不会均匀地重写GeoFMs的深度，最强的变化集中在ViT块中MLP的第一个线性层。这些结果有助于解释为什么GeoFM排名在不同基准之间发生变化，并激励更具表示意识的评估和适应策略。

英文摘要

Self-supervised geospatial foundation models (GeoFMs) learn transferable representations from remote sensing data, but their downstream behavior is difficult to characterize. We study six representative GeoFMs spanning joint-embedding, reconstruction, and multimodal pretraining families, and evaluate transfer across classification, regression, and segmentation benchmarks under different label availability and downstream pipelines. We find that model rankings change across tasks and adaptation settings. Layerwise probing shows that, in most cases, task-relevant information is more accessible in intermediate transformer blocks compared to final-layer embeddings, and that GeoFMs exhibit distinct depthwise profiles. In segmentation case studies on PASTIS and Sen1Floods11, downstream adaptation settings such as decoder design and fine-tuning can be as impactful as the choice of GeoFM, and standard dense-prediction heads may be poorly aligned with how GeoFMs organize information over depth. Finally, CKA analysis on case studies shows that fine-tuning does not rewrite GeoFMs uniformly across depth, and the strongest changes are localized to the first linear layer of the MLP in ViT blocks. These results help explain why GeoFM rankings shift across benchmarks and motivate more representation-aware evaluation and adaptation strategies.

URL PDF HTML ☆

赞 0 踩 0

2606.13904 2026-06-15 cs.CL cs.AI cs.DB 交叉投稿

SANA: What Matters for QA Agents over Massive Data Lakes?

SANA：大规模数据湖上的问答代理关键因素是什么？

Austin Senna Wijaya, Jiaxiang Liu, Haonan Wang, Eugene Wu

发表机构 * Columbia University（哥伦比亚大学）

AI总结提出SANA诊断框架，通过消融实验分析数据湖探索式问答中搜索、规划、数据分析及行动策略的失败原因，揭示数据分析是主要瓶颈。

Comments 9 pages, 7 figures

详情

AI中文摘要

数据湖上的探索式问答（EQA）需要LLM代理发现相关源、分析检索数据并根据中间结果调整其行动。端到端准确率无法区分搜索、规划、数据分析或代理的行动策略（即下一步做什么以及何时提交答案的决策）中的失败。我们提出了SANA（搜索代理导航消融框架），这是一个诊断性消融框架，将EQA任务转化为包含黄金源序列、清洗后子问题和执行记录的运行时配置文件。SANA利用这些配置文件构建理想化的搜索、规划和数据分析工具，从而允许对每个组件进行消融；残差是策略失败的诊断证据。为了说明SANA作为一个可复用的评估框架，我们改编了两个最近的EQA基准测试LakeQA和KramaBench，并在固定提示、预算、数据湖和运行时下评估了轻量级和中型代理。在两个基准测试中，数据分析始终是瓶颈，而规划则不那么明显。搜索在LakeQA的大数据湖设置中是主要限制，但在较小规模的KramaBench中则不那么突出。因此，SANA将端到端任务准确率分解为数据湖代理失败原因的诊断，并允许系统比较搜索、规划、数据分析和代理设计方面的进展。

英文摘要

Exploratory question answering (EQA) over data lakes requires an LLM agent to discover relevant sources, analyze retrieved data, and adapt its actions based on intermediate results. End-to-end accuracy alone cannot distinguish failures in search, planning, data analysis, or the agent's Action Policy: its decisions about what to do next and when to submit an answer. We present SANA (Search Agent Navigation Ablation framework), a diagnostic ablation framework that transforms EQA tasks into runtime profiles containing gold source sequence, sanitized subquestions, and execution records. SANA uses these profiles to construct idealized search, planning, and data-analysis tools, allowing each component to be ablated; the residual gap is diagnostic evidence for policy failures. To illustrate SANA as a reusable evaluation framework, we adapted two recent EQA benchmarks, LakeQA and KramaBench, and evaluated lightweight and mid-sized agents under fixed prompts, budgets, data lakes, and runtimes. Across both benchmarks, data analysis is a consistent bottleneck while planning is less so. Search is a major limitation in LakeQA's large data-lake setting, but less so for the smaller-scale KramaBench. SANA thus deconstructs end-to-end task accuracies into a diagnosis of where data-lake agents fail, and allows for systematic comparisons of progress in search, planning, data analysis, and agent design.

URL PDF HTML ☆

赞 0 踩 0

2606.13994 2026-06-15 cs.CR cs.AI cs.LG 交叉投稿

Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DECOMPBENCH

隐于无形：使用DECOMPBENCH基准测试代理安全对抗分解攻击

Vikhyath Kothamasu, Virginia Smith, Chhavi Yadav

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Simons Institute, UC Berkeley（Simons研究所，伯克利大学）

AI总结提出DeCompBench基准，通过分解攻击将有害任务拆分为良性子任务，揭示现有代理安全机制在对抗分解攻击时的脆弱性。

详情

AI中文摘要

基于LLM的代理变得越来越强大且广泛部署，在现实世界中造成了日益增长的对抗性滥用动机。一个关键的新兴威胁是分解攻击\cite{glukhov2024breach, jones2024adversaries}，其中有害任务被分解为更简单、良性的子任务，这些子任务单独执行时能规避安全机制，但累积起来却实现了恶意意图。尽管最近的基准测试评估了代理在多轮和多工具使用设置中的安全性，但它们并未明确捕捉这种形式的分解滥用，且可能无法代表现实的对抗性执行流程。为此，我们引入了DeCompBench，这是一个专门设计用于评估分解攻击下代理安全性的基准。DeCompBench采用分解即设计原则，使用图形框架创建，能够将有害任务分解为单独良性且可执行的子任务，并具有现实的工作流程。我们使用自定义分解器的实验表明，最先进的代理在整体有害任务上表现出高拒绝率，但在其分解变体上拒绝率显著降低，同时往往无意中实现了对抗性目标。这些发现强调了针对分解攻击进行安全性评估及相应防御的必要性。我们的数据集已公开，可在以下网址获取：https://this https URL。

英文摘要

LLM-based Agents are becoming increasingly capable and widely deployed, creating growing incentives for adversarial misuse in the real-world. A key emerging threat is Decomposition Attacks \cite{glukhov2024breach, jones2024adversaries} in which a harmful task is broken into simpler, benign subtasks that evade safety mechanisms when executed separately but cumulatively fulfill the malicious intent. Although recent benchmarks assess agent safety in multi-turn and multi-tool-use settings, they do not explicitly capture this form of decompositional misuse and may not represent realistic adversarial execution flows. To this end, we introduce DeCompBench, a benchmark designed specifically to evaluate agentic safety under decomposition attacks. DeCompBench is created with a decomposition-by-design principle using a graphical framework and enables harmful task decomposition into individually benign and executable subtasks with realistic workflows. Our experiments using a custom decomposer show that state-of-the-art agents exhibit high refusal rates on monolithic harmful tasks, but significantly lower refusal rates on their decomposed variants, while often inadvertently fulfilling the adversarial objectives. These findings underscore the need for safety evaluations against decomposition attacks and corresponding defenses. Our dataset is publicly available and can be found at https://huggingface.co/datasets/decompositionbench/DeCompBench.

URL PDF HTML ☆

赞 0 踩 0

2606.14094 2026-06-15 cs.CV cs.AI 交叉投稿

SIMMER: 基于世界模型评估LLM可执行规划中的潜在故障

Xiaoxin Lu, Ranran Haoran Zhang, Rui Zhang

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结提出SIMMER基准，通过人工策划的厨房领域符号世界模型，评估LLM规划中的潜在故障；实验发现前沿模型最多17%无错误计划，56%含潜在故障，多数不可逆；反事实预演可减少72%潜在故障和75%不可逆案例。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署为家庭环境中自主代理的规划器。虽然现有基准评估LLM生成的计划是否成功执行，但它们忽略了一种关键类型的故障：潜在故障。与立即故障（在执行时触发即时反馈并允许及时纠正）不同，潜在故障不会立即停止计划执行，而是悄无声息地损害目标实现。在严重情况下，它们会导致不可逆的损害。为弥补这一空白，我们引入了SIMMER，这是一个通过人工策划的、基于厨房领域的符号世界模型来评估LLM规划中潜在故障的基准。SIMMER定义了一个世界模型，包含77个动作、262个独特对象和约46,800种语义真实的可能交互，这些交互源自真实世界的烹饪脚本。然后，它利用一个状态机执行器，根据世界模型验证计划，并检测即时前提违规、潜在危险和不可逆故障。在六个LLM上的实验表明，即使是最前沿的模型，其无错误计划最多也只有17%。此外，高达56%的计划包含潜在故障，其中大多数导致不可逆后果。我们进一步证明，通过反事实预演进行显式状态推理可以将潜在故障减少高达72%，不可逆案例减少高达75%，这为更鲁棒的LLM规划器指明了一个有前景的方向。

英文摘要

Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type of failure: latent failures. Unlike immediate failures that trigger instant feedback at execution time and enable timely correction, latent failures do not immediately halt plan execution but silently compromise goal achievement. In severe cases, they cause irreversible harm. To address this gap, we introduce SIMMER, a benchmark for evaluating latent failures in LLM planning through a human-curated symbolic world model grounded in the kitchen domain. SIMMER defines a world model comprising 77 actions, 262 unique objects, and approximately 46,800 possible interactions that are semantically realistic, derived from real-world cooking scripts. It then leverages a state machine executor that validates plans against the world model and detects immediate precondition violations, latent hazards, and irreversible failures. Experiments across six LLMs show that even frontier models achieve at most 17% error-free plans. Moreover, up to 56% of plans contain latent failures, the majority of which lead to irreversible consequences. We further demonstrate that explicit state reasoning via counterfactual foresight simulation can reduce latent failures by up to 72% and irreversible cases by up to 75%, suggesting a promising direction for more robust LLM planners.

URL PDF HTML ☆

赞 0 踩 0

2606.14591 2026-06-15 cs.SD cs.AI 交叉投稿

AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

AudioDER: 一种用于后训练大型音频语言模型的去重增强推理数据集

Hui Geng, Yi Su, Han Yin, Tianjiao Wan, Qisheng Xu, Jiaxin Chen, Zijian Gao, Hengzhu Liu, Xie Chen, Kele Xu

发表机构 * College of Computer Science and Technology, National University of Defense Technology（国防科技大学计算机科学与技术学院）； Korea Advanced Institute of Science and Technology (KAIST)（韩国科学技术院）； Shanghai Jiaotong University（上海交通大学）

AI总结针对现有音频-语言数据集冗余导致后训练效果下降的问题，提出基于声学相似性去重的数据构建流程，生成包含191k样本的推理导向数据集AudioDER，显著提升LALM在多个音频推理基准上的性能。

详情

AI中文摘要

大型音频语言模型（LALMs）在广泛的音频理解任务上表现出色，但在复杂音频推理方面仍存在困难。提升此类能力的一种实用方法是后训练，其有效性关键取决于训练数据的质量和多样性。然而，现有的音频-语言数据集通常包含大量冗余，其中许多样本在声学内容上高度相似，从而提供重叠的监督信号。这种冗余不仅增加了标注成本，还限制了语料库的多样性，降低了后训练的效果。为解决此问题，我们提出了一种冗余感知的数据构建流程，用于为LALMs构建面向推理的监督。具体来说，我们首先基于声学相似性对原始音频数据集进行去重，以提高语料库的多样性。然后，我们将现有的音频描述和问答对整合为统一的多项选择格式。基于这些统一标注，我们利用Qwen3-30B生成思维链（CoT）推理过程，以提供面向推理的监督。基于此流程，我们构建了AudioDER，一个面向推理的后训练数据集，包含约191k个样本，涵盖声音、语音和音乐。每个样本包括一个音频片段、一个多项选择问题、四个候选答案、一个音频描述和一个CoT推理过程。大量实验表明，在AudioDER上进行后训练持续提升了Qwen2-Audio-7B-Instruct在多个音频推理基准上的性能，包括MMAU-mini、MMSU和MMAR。我们希望AudioDER能够成为推动音频推理研究和开发更强大LALMs的宝贵资源。

英文摘要

Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quality and diversity of training data. However, existing audio-language datasets often contain substantial redundancy, where many samples are highly similar in acoustic content and thus provide overlapping supervisory signals. Such redundancy not only increases annotation cost, but also limits corpus diversity and reduces the effectiveness of post-training. To address this issue, we propose a redundancy-aware data construction pipeline for building reasoning-oriented supervision for LALMs. Specifically, we first perform acoustic similarity-based deduplication across raw audio datasets to improve corpus diversity. We then integrate existing audio captions and question-answer pairs into a unified multiple-choice format. Based on these unified annotations, we leverage Qwen3-30B to generate chain-of-thought (CoT) rationales for reasoning-oriented supervision. Based on this pipeline, we construct AudioDER, a reasoning-oriented post-training dataset containing approximately 191k samples spanning sound, speech, and music. Each sample consists of an audio clip, a multiple-choice question, four answer candidates, an audio caption, and a CoT rationale. Extensive experiments show that post-training on AudioDER consistently improves the performance of Qwen2-Audio-7B-Instruct on multiple audio reasoning benchmarks, including MMAU-mini, MMSU, and MMAR. We hope AudioDER can serve as a valuable resource for advancing audio reasoning research and the development of more capable LALMs.

URL PDF HTML ☆

赞 0 踩 0

2606.14604 2026-06-15 cs.LG cs.AI 交叉投稿

快速思考：估计前沿AI模型的无思维链任务完成时间范围

Dewi Gould, Francis Rhys Ward, Anders Cairns Woodruff, Rauno Arike, Josh Hills, Alex Serrano, Ida Caspary, Jason Ross Brown, Jo J. Jiao, Patrick Leask, Twm Stone, Ram Potham, Ionut Gabriel Stan, Harry Mayne, Simeon Hellsten, Shubhorup Biswas, Ariana Azarbal, William L. Anderson, Elle Najt, Ryan Greenblatt, Julian Stastny

发表机构 * Redwood Research（红木研究）； Astra Fellows Program（Astra 后援计划）； Aether Research（Aether 研究）； MATS Research（MATS 研究）； Polytechnic University of Catalonia（加泰罗尼亚理工大学）； Imperial College London（伦敦帝国理工学院）； University of Cambridge（剑桥大学）； University of Chicago（芝加哥大学）； Durham University（杜伦大学）； MIT（麻省理工学院）； University of Oxford（牛津大学）； University of Glasgow（格拉斯哥大学）； Constellation（星座）

AI总结本研究通过超过3万个问题测试前沿AI模型在无思维链推理下的表现，估计其50%任务完成时间范围，发现该时间每约两年翻一番，GPT-5.5已达3分钟以上。

详情

AI中文摘要

许多确保前沿AI模型安全的努力依赖于监控其思维链（CoT）推理。如果模型能够在没有显式思考令牌的情况下内部执行足够复杂的推理，这将破坏这种监督。我们测量了前沿模型在无CoT情况下的推理能力，涉及超过3万个问题，涵盖数学、编程、谜题、因果推理、心理理论和策略推理等领域的43个基准测试。为了将模型与人类进行比较，我们估计了50%任务完成时间范围（TH）：模型以50%成功率完成的任务所需的人类时间。我们还补充了50%推理令牌范围：模型以50%成功率解决的任务所需的最小o3-mini推理令牌数。我们发现，过去六年中，前沿模型的无CoT 50% TH大约每两年翻一番，GPT-5.5的TH超过3分钟，推理令牌范围超过1500个令牌。我们的中位数估计预测，到2028年，前沿无CoT TH可能超过7分钟，到2030年超过25分钟，尽管这些预测存在很大的不确定性。我们建议前沿开发者明确跟踪这一指标。

英文摘要

Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning. If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight. We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory-of-mind, and strategic reasoning. To compare models against humans, we estimate the $50\%$-task-completion time horizon (TH): the human time required for tasks a model completes with $50\%$ success rate. We complement this with a $50\%$ reasoning token horizon: the minimum number of o3-mini reasoning tokens needed for tasks a model solves with $50\%$ success rate. We find that the no-CoT $50\%$ TH of frontier models has been doubling roughly every year over the past six years, with GPT-5.5's TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens. Our median estimates predict that frontier no-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty. We recommend frontier developers track this explicitly.

URL PDF HTML ☆

赞 0 踩 0

2601.04646 2026-06-15 cs.IR cs.AI cs.CL cs.LG 版本更新

Succeeding at Scale: Enterprise Retrieval Benchmark Construction and Index-Preserving Query Adaptation for Multi-Tenant Search

规模化成功：面向多租户搜索的企业检索基准构建与索引保持查询适配

Prateek Jain, Shabari S Nair, Ritesh Goru, Prakhar Agarwal, Ajay Yadav, Yoga Sri Varshan Varadharajan, Constantine Caramanis

发表机构 * Prateek Jain ； Shabari S Nair ； Ritesh Goru ； Prakhar Agarwal ； Ajay Yadav ； Yoga Sri Varshan Varadharajan ； Constantine Caramanis

AI总结针对多租户检索系统中标注数据匮乏和模型更新成本高的问题，提出全自动构建基准DevRev-Search，并研究仅微调查询编码器而保持文档索引不变的索引保持查询适配策略，实现质量与效率的平衡。

详情

AI中文摘要

大规模多租户检索系统生成大量查询日志，但缺乏用于有效领域适应的精心策划的相关性标签，导致大量“暗数据”未被充分利用。模型更新的高成本加剧了这一挑战，因为联合微调查询和文档编码器需要完整的语料库重新索引，这在拥有数千个独立索引的多租户环境中是不切实际的。我们引入了DevRev-Search，这是一个通过完全自动化管道构建的技术客户支持段落检索基准。候选生成使用跨多种稀疏和密集检索器的融合，随后使用LLM作为评判器进行一致性过滤和相关性标记。我们进一步研究并系统评估了索引保持查询适配策略，该策略仅微调查询编码器，同时保持文档索引固定。在DevRev-Search、SciFact和FiQA-2018上的实验表明，参数高效的查询编码器微调提供了显著的质量-效率权衡，实现了可扩展且实用的企业多租户检索。

英文摘要

Large-scale multi-tenant retrieval systems generate extensive query logs but lack curated relevance labels for effective domain adaptation, resulting in substantial underutilized "dark data." This challenge is compounded by the high cost of model updates, as jointly fine-tuning query and document encoders requires full corpus re-indexing, which is impractical in multi-tenant settings with thousands of isolated indices. We introduce DevRev-Search, a passage retrieval benchmark for technical customer support built via a fully automated pipeline. Candidate generation uses fusion across diverse sparse and dense retrievers, followed by an LLM-as-a-Judge for consistency filtering and relevance labeling. We further study and systematically evaluate index-preserving query-only adaptation strategies that fine-tune only the query-encoder while keeping the document indices fixed. Experiments on DevRev-Search, SciFact, and FiQA-2018 show that parameter-efficient fine-tuning of the query encoder delivers a remarkable quality-efficiency trade-off, enabling scalable and practical enterprise multi-tenant retrieval.

URL PDF HTML ☆

赞 0 踩 0

2601.15828 2026-06-15 cs.CL cs.AI 版本更新

Can professional translators identify machine-generated text?

专业翻译人员能否识别机器生成的文本？

Michael Farrell

发表机构 * IULM University Milan Italy（米兰IULM大学）

AI总结通过实验研究无专门训练的专业翻译人员识别AI生成短篇故事的能力，发现少数人（16.2%）能准确区分，但多数依赖主观印象导致误判，低突发性和叙事矛盾是可靠指标。

Comments Pages 581 to 591, Volume 1, proceedings of the 26th Annual Conference of the European Association for Machine Translation, 2026

详情

AI中文摘要

本研究调查了未经专门训练的专业翻译人员能否可靠地识别由人工智能（AI）生成的意大利语短篇故事。69名翻译人员参加了一项现场实验，评估了三篇匿名短篇故事——两篇由ChatGPT-4o生成，一篇由人类作者撰写。对于每篇故事，参与者评估了AI作者身份的可能性并提供了选择理由。虽然平均结果不明确，但有一个统计上显著的子集（16.2%）成功区分了合成文本与人类文本，表明他们的判断基于分析技能而非偶然。然而，几乎相同数量的人以相反方向错误分类了文本，通常依赖主观印象而非客观标记，这可能反映了读者对AI生成文本的偏好。低突发性和叙事矛盾成为合成作者身份最可靠的指标，同时报告了意外的仿译、语义借用和来自英语的句法迁移。相比之下，语法准确性和情感基调等特征经常导致误分类。这些发现对专业语境中合成文本编辑的作用和范围提出了疑问。

英文摘要

This study investigates whether professional translators without prior specialized training can reliably identify short stories generated in Italian by artificial intelligence (AI). Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories - two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.

URL PDF HTML ☆

赞 0 踩 0

2603.05167 2026-06-15 cs.CL cs.AI 版本更新

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

C2-Faith: 为思维链推理中的因果和覆盖忠实性基准测试LLM评判者

Avni Mittal, Rauno Arike

发表机构 * SPARAI

AI总结提出C2-Faith基准，通过因果和覆盖两个维度评估LLM评判者对思维链推理过程忠实性的判断能力，发现模型在错误定位和覆盖评分上存在显著不足。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作思维链（CoT）推理的评判者，但目前尚不清楚它们能否可靠地评估过程忠实性，而不仅仅是答案的合理性。我们引入了C2-Faith，这是一个基于PRM800K构建的基准，明确将忠实性分解为两个互补维度：因果性（每一步是否逻辑上源自先前上下文）和覆盖性（是否包含必要的中间推理）。通过受控扰动，我们构建了具有已知因果错误位置的示例，将单个步骤替换为逻辑不一致的变体，并以不同速率进行受控覆盖删除，从而能够直接根据参考标签进行测量。我们评估了三个前沿的LLM评判者在三项任务上的表现：二元因果检测、因果步骤定位和覆盖评分。我们的结果表明，评判者的可靠性高度依赖于任务，没有单一模型在所有设置中占主导地位。虽然模型通常能检测到错误存在，但它们难以准确定位错误，这表明检测与归因之间存在显著差距。此外，所有评判者都系统性地高估了推理完整性，即使中间推理的很大部分缺失，也会给出高覆盖分数。这些发现揭示了LLM评判者在过程级评估中的根本局限性，并强调了在使用LLM评估推理质量时需要更可靠和校准的方法。

英文摘要

Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that explicitly decomposes faithfulness into two complementary dimensions: causality (whether each step logically follows from prior context) and coverage (whether essential intermediate inferences are present). Using controlled perturbations, we construct examples with known causal error positions by replacing a single step with a logically inconsistent variant, and with controlled coverage deletions at varying rates, enabling direct measurement against reference labels. We evaluate three frontier LLM judges across three tasks: binary causal detection, causal step localization, and coverage scoring. Our results reveal that judge reliability is highly task-dependent, with no single model dominating across settings. While models often detect that an error exists, they struggle to accurately localize it, indicating a substantial gap between detection and attribution. Moreover, all judges systematically overestimate reasoning completeness, assigning high coverage scores even when substantial portions of intermediate reasoning are missing. These findings expose fundamental limitations of LLM judges in process-level evaluation and highlight the need for more reliable and calibrated methods when using LLMs to assess reasoning quality.

URL PDF HTML ☆

赞 0 踩 0

2603.23530 2026-06-15 cs.CL cs.AI cs.LG 版本更新

Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

你忘记我问什么了吗？大型语言模型中的前瞻记忆失败

Avni Mittal

发表机构 * University of Washington（华盛顿大学）

AI总结本研究通过认知心理学中的前瞻记忆视角，发现大型语言模型在执行复杂任务时，格式化指令的遵从率下降2-21%，并提出了显著性增强格式来恢复遵从性。

详情

AI中文摘要

大型语言模型在必须同时执行要求较高的任务时，常常无法满足格式化指令。我们通过认知心理学中的前瞻记忆视角，使用一个受控范式来研究这种行为，该范式将可验证的格式化约束与复杂度递增的基准任务相结合。在三个模型家族和超过8000个提示中，在并发任务负载下，遵从性下降了2-21%。脆弱性高度依赖于类型：终端约束（需要在响应边界采取行动）下降最多，高达50%，而避免约束相对稳健。显著性增强格式（显式指令框架加上尾部提醒）恢复了大量丢失的遵从性，在许多设置中将性能恢复到90-100%。干扰是双向的：格式化约束也可能降低任务准确性，其中一个模型的GSM8K准确率从93%下降到27%。在额外的堆叠实验中，随着约束的累积，联合遵从性急剧下降。所有结果均使用确定性程序化检查器，无需LLM作为评判组件，并在公开可用的数据集上进行。

英文摘要

Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.

URL PDF HTML ☆

赞 0 踩 0

2604.07530 2026-06-15 cs.DL cs.AI cs.CY cs.SI 版本更新

The Shrinking Lifespan of LLMs in Science

科学领域中LLM的生命周期缩短

Ana Trišović

发表机构 * Computer Science & Artificial Intelligence Laboratory（计算机科学与人工智能实验室）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本研究通过分析62个LLM在超过10万篇引用论文中的科学采纳轨迹，发现模型的生命周期主要由发布年份决定，且每个后续发布年份的峰值时间和寿命分别缩短27%和23%。

详情

AI中文摘要

缩放定律描述了语言模型能力如何随计算和数据增长，但未说明模型发布后能持续多久。我们引入峰值时间和寿命作为模型过时的度量，并利用它们刻画62个LLM在超过10万篇引用论文（2019-2025年）中的科学采纳轨迹，将主动采纳与背景引用分离，以恢复引用计数无法解析的每个模型轨迹。我们发现，模型的寿命更多地由其发布时间而非特征决定：发布年份比架构、开放性或规模更能预测峰值时间和寿命。LLM的采纳遵循倒U型曲线（发布后上升、达到峰值然后下降），但这种模式正在迅速压缩。每个后续发布年份与峰值时间缩短27%和寿命缩短23%相关（p < 0.001），这一结果对最小年龄阈值和模型规模控制具有稳健性。这些采纳侧动态对缩放定律不可见，表明专注于任何单一模型可能是一项贬值的投资，其成本落在可重复性和迁移上。

英文摘要

Scaling laws describe how language model capabilities grow with compute and data, but say nothing about how long a model matters once released. We introduce time-to-peak and lifespan as measures of model obsolescence and use them to characterize the scientific adoption trajectories of 62 LLMs across more than 108k citing papers (2019-2025), separating active adoption from background citation to recover per-model trajectories that citation counts cannot resolve. We find that a model's longevity is shaped more by when it was released than by its characteristics: release year predicts time-to-peak and lifespan more strongly than architecture, openness, or scale. LLM adoption follows an inverted-U curve (rising after release, peaking, and then declining), but this pattern is rapidly compressing. Each successive release year is associated with a 27% shorter time-to-peak and a 23% shorter lifespan ($p < 0.001$), robust to minimum-age thresholds and controls for model size. These adoption-side dynamics are invisible to scaling laws and suggest that specialization on any single model may be a depreciating investment, with costs falling on reproducibility and migration.

URL PDF HTML ☆

赞 0 踩 0

2604.14892 2026-06-15 cs.LG cs.AI 版本更新

TwinBI：一种用于与商业智能仪表盘高效增强交互的智能数字孪生

Jisoo Jang Wen-Syan Li

发表机构 * Graduate School of Data Science, Seoul National University（首尔大学数据科学研究生院）

AI总结提出TwinBI框架，通过LLM代理与可执行仪表盘状态耦合，统一对话、操作、语义和溯源，提升多步分析中状态一致性，将精确匹配准确率从43.3%提升至63.3%，超时率从40%降至10%。

详情

AI中文摘要

商业智能（BI）越来越多地将仪表盘交互与基于LLM的辅助相结合，但这两种模式在多步分析中常常不同步。当用户在直接仪表盘操作和自然语言查询之间切换时，很难在过滤器、层次结构、指标和图表上下文中保持一致的分析状态。我们提出TwinBI，一种智能数字孪生框架，将基于LLM的代理系统与可执行的BI仪表盘状态耦合。TwinBI通过从统一交互日志重建的共享分析状态，统一了对话交互、仪表盘操作、语义基础和溯源追踪。它还公开了诸如模式视图、SQL、日志和/insights命令等工件，用于基于状态的分析摘要。我们通过两种互补方式评估TwinBI。在相同骨干代理的受控A/B基准测试中，与仅使用仪表盘相比，TwinBI将精确匹配准确率从43.3%提高到63.3%，部分信用准确率从48.3%提高到70.8%，并显著将超时率从40.0%降低到10.0%。在可用性研究中，参与者受益于集成的仪表盘和聊天工作流，任务准确性高，工作负载适中，对状态感知交互机制评价良好。这些结果表明，TwinBI通过将可见的仪表盘状态转化为更丰富的可操作上下文，提高了代理级别的分析可靠性和面向用户的分析支持。我们的数据集和源代码可在以下网址获取：this https URL

英文摘要

Business intelligence (BI) increasingly combines dashboard interaction with LLM-based assistance, but these two modes often fall out of sync during multi-step analysis. As users switch between direct dashboard manipulation and natural-language queries, it becomes difficult to preserve a consistent analytical state across filters, hierarchies, metrics, and chart context. We present TwinBI, an agentic digital-twin framework that couples an LLM-based agent system with an executable BI dashboard state. TwinBI unifies conversational interaction, dashboard manipulation, semantic grounding, and provenance tracking through a shared analytical state reconstructed from a unified interaction log. It also exposes artifacts such as schema views, SQL, logs, and an /insights command for state-grounded analytical summaries. We evaluate TwinBI in two complementary ways. In a controlled A/B benchmark with the same backbone agent, TwinBI improves exact-match accuracy from 43.3% to 63.3%, partial-credit accuracy from 48.3% to 70.8%, and substantially reduces timeout rate from 40.0% to 10.0% relative to Dashboard alone. In a usability study, participants benefited from the integrated dashboard-and-chat workflow, with high task accuracy, moderate workload, and favorable ratings for state-aware interaction mechanisms. These results suggest that TwinBI improves both agent-level analytical reliability and user-facing analytical support by turning visible dashboard state into richer actionable context. Our dataset and source code are available at: https://github.com/simonjisu/TwinBI

URL PDF HTML ☆

赞 0 踩 0

2606.13871 2026-06-15 cs.AI cs.DB 新提交

Hyperdimensional computing for structured querying on tabular data embeddings

超维计算用于表格数据嵌入的结构化查询

Sebastián Bugedo, Stijn Vansummeren

发表机构 * UHasselt, DSI Diepenbeek（哈塞尔特大学，数据科学研究所迪彭贝克）

AI总结针对表格嵌入缺乏可解释相似度的问题，提出基于超维计算（HDC）的框架，利用全息简化表示模型实现结构化查询，推导出等值与非等值谓词的闭式期望相似度，支持可靠零匹配检测。

Comments 15 pages with appendices. 8 figures. Under review

详情

AI中文摘要

表格数据嵌入已成为数据分析和数据集成管道的基石，支持实体注释与解析、模式匹配、列类型检测以及表格搜索等任务。现有方法将行、列或整个表格嵌入向量空间，并依赖最近邻搜索来检索候选匹配。当前嵌入方法的一个根本局限性是缺乏可解释的相似度分数：查询与其最近邻之间的具体相似度值没有内在含义，因此无法确定该邻居是真正匹配还是只是语料库中无有效答案时最不相似的项目。这种无法为检索设置原则性阈值的问题阻碍了实际部署，特别是对于零匹配检测。我们研究了超维计算（HDC）的使用，特别是全息简化表示（HRR）模型，作为当检索任务对应于在向量空间中回答结构化选择-投影查询时的表格行嵌入框架。利用HDC操作的代数性质，我们推导出等值和非等值检索谓词的闭式期望相似度值，这些值随着维度的增加收敛到可解释的值，并利用这些值来识别合适的检索阈值。我们在两个真实世界数据集上，针对不同表格大小和谓词长度，将HDC与基于图的基线EmbDI进行了评估。结果表明，HDC在所有配置下的行检索中与EmbDI相当或更优，更稳健地处理非等值谓词，并在足够维度下实现完美的属性投影准确性——同时通过其原则性阈值独特地实现了可靠识别零匹配谓词。

英文摘要

Tabular data embeddings have become a cornerstone of data profiling and data integration pipelines, enabling tasks such as entity annotation and resolution; schema matching; column type detection; and table search, among others. Existing approaches embed rows, columns, or entire tables into a vector space and rely on nearest-neighbor search to retrieve candidate matches. A fundamental limitation of current embedding methods is the lack of interpretable similarity scores: the concrete similarity value between a query and its nearest neighbour carries no intrinsic meaning, making it impossible to determine whether that neighbour is a true match or simply the least-dissimilar item in a corpus that contains no valid answer. This inability to set principled thresholds for retrieval undermines practical deployment, particularly for zero-match detection. We investigate the use of HyperDimensional Computing (HDC), specifically the Holographic Reduced Representations (HRR) model, as a framework for tabular row embeddings when the retrieval task corresponds to answering structured select-project queries in vector space. Exploiting the algebraic properties of HDC operations, we derive closed-form expected similarity values for both equality and non-equality retrieval predicates, which converge to interpretable values as dimensionality increases, and use these to identify suitable retrieval thresholds. We evaluate HDC against EmbDI, a graph-based baseline, on two real-world datasets across varying table sizes and predicate lengths. Our results show that HDC matches or outperforms EmbDI for row retrieval across all configurations, handles non-equality predicates more robustly, and achieves perfect attribute projection accuracy at sufficient dimensionality -- while uniquely enabling reliable identification of zero-match predicates through its principled thresholds.

URL PDF HTML ☆

赞 0 踩 0

2606.13916 2026-06-15 cs.AI 新提交

CisTransCell：通过基因功能、调控控制和细胞上下文进行单细胞扰动预测

Wei Zhang, Xun Jiang, Yuesi Xi, Ming Tang

发表机构 * [q-bio.GN]

AI总结提出CisTransCell框架，结合调控序列和编码序列先验与细胞表达状态，建模扰动响应级联，实现零样本单细胞扰动预测。

详情

AI中文摘要

预测细胞对遗传扰动的转录反应是单细胞生物学中的一个核心问题，尤其是在零样本设置中，扰动基因或基因组合在训练中未见。一个主要困难是扰动效应不仅由表达状态决定：它们取决于扰动基因产物如何影响其他基因和蛋白质，这些下游因子如何作用于顺式调控元件，以及当前细胞状态中哪些调控程序活跃。为了更好地捕捉这种生物复杂性，我们提出了CisTransCell，一个用于单细胞扰动预测的细胞条件多模态框架，它为每个基因补充了两个互补先验：一个调控序列先验，捕捉基因如何被调控；一个编码序列先验，捕捉基因产物做什么。通过将这些先验与细胞表达状态整合，CisTransCell将扰动响应建模为从基因功能到调控控制再到下游转录变化的级联。在基准单细胞扰动数据集上的实验表明，CisTransCell在零样本扰动预测中取得了强劲性能。

英文摘要

Predicting cellular transcriptional responses to genetic perturbations is a central problem in single-cell biology, especially in the zero-shot setting where the perturbed gene or gene combination is unseen during training. A major difficulty is that perturbation effects are not determined by expression state alone: they depend on how the perturbed gene product influences other genes and proteins, how those downstream factors act on cis-regulatory elements, and which regulatory programs are active in the current cell state. To better capture this biological complexity, we propose CisTransCell, a cell-conditioned multi-modal framework for single-cell perturbation prediction that augments each gene with two complementary priors: a regulatory-sequence prior that captures how the gene is controlled, and a coding-sequence prior that captures what the gene product does. By integrating these priors with cellular expression state, CisTransCell models perturbation response as a cascade from gene function to regulatory control to downstream transcriptional change. Experiments on benchmark single-cell perturbation datasets show that CisTransCell achieves strong performance in zero-shot perturbation prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.13742 2026-06-15 cs.LG cs.AI physics.comp-ph physics.flu-dyn stat.ML 交叉投稿

A fully GPU-based workflow for building physics emulators of hypersonic flows

基于全GPU工作流构建高超声速流物理仿真器

Fabian Paischer, Dylan Rubini, Deniz A. Bezgin, Aaron B. Buhendwa, David Hauser, Florian Sestak, Johannes Brandstetter, Sebastian Kaltenbach, Nikolaus A. Adams

发表机构 * TU Munich（慕尼黑工业大学）； Institute for Machine Learning, JKU Linz（林茨约翰·开普勒大学机器学习研究所）； ELLIS Unit（ELLIS单元）； EMMI AI

AI总结提出全GPU工作流，集成加速数据生成与不确定性量化增强的神经仿真器训练，通过可微求解器JAX-Fluids实现残差驱动改进，提升物理一致性并支持外推。

Comments First authors contributed equally

详情

AI中文摘要

以高保真度和低计算成本解析复杂物理现象的能力是解决现代工程关键挑战的核心。一个典型例子是高超声速流，其中精确预测全流场拓扑，特别是激波位置和强度，至关重要。然而，超声速和高超声速流仍然是传统降阶模型和神经仿真器的绊脚石，这些模型难以在工业相关应用中物理一致地捕捉流态中的陡峭梯度。为此，我们引入了一个完全基于GPU的工作流，该工作流将加速数据生成与通过不确定性量化和物理感知细化增强的神经仿真器训练相结合。我们的工作流由可微高保真求解器（JAX-Fluids）实现，我们利用该求解器进行快速数据集创建和基于残差的神经仿真器改进，以增强物理一致性。在此框架基础上，我们首先提出了一系列模型架构，并分析了它们的缩放行为以揭示其优缺点。然后，我们表明基于残差的细化使得能够在仅提供网格和输入参数的情况下进行训练，显著降低残差并提高物理一致性。可微仿真和基于残差的细化共同产生了在其训练分布之外仍然可靠的物理仿真器，这是在现实工程设计循环中部署代理的关键要求。

英文摘要

The ability to resolve complex physical phenomena with high fidelity and at low computational cost is central to addressing key challenges in modern engineering. A prime example lies in hypersonic flows, where the precise prediction of the full flowfield topology, in particular with respect to shock wave location and intensity, is critical. Yet supersonic and hypersonic flows continue to be a stumbling block for traditional reduced-order models and neural emulators that struggle to capture steep gradients in flow states with physical consistency in applications of industrial relevance. To that end, we introduce a fully GPU based workflow that integrates accelerated data generation with the training of neural emulators augmented by uncertainty quantification and physics-aware refinement. Our workflow is enabled by a differentiable high-fidelity solver (JAX-Fluids) which we employ for rapid dataset creation and residual-based improvement of the neural emulator to enhance physical consistency. Building on this framework, we first present a suite of model architectures and analyze their scaling behavior to expose their strengths and shortcomings. We then show that residual-based refinement enables training on cases where only mesh and input parameters are available, substantially reducing residuals and improving physical consistency. Together, differentiable simulation and residual-based refinement yield physics emulators that remain reliable beyond their training distribution, a key requirement for deploying surrogates in real-world engineering design loops.

URL PDF HTML ☆

赞 0 踩 0

2606.13794 2026-06-15 eess.SY cs.AI cs.RO cs.SY 交叉投稿

An integrated interpretable control effectiveness learning and nonlinear control allocation methodology for overactuated aircrafts

过驱动飞行器的可解释控制效能学习与非线性控制分配集成方法

Umut Demir, Aamir Ahmad, Walter Fichter

发表机构 * University of Stuttgart, Faculty of Aerospace Engineering and Geodesy, Institute of Flight Mechanics and Control (iFR)（斯图加特大学航空航天工程与大地测量学院飞行力学与控制研究所）

AI总结提出一种基于稀疏非线性动力学辨识的学习控制效能映射方法，结合在线自适应机制，实现过驱动飞行器的高效非线性控制分配，兼具可解释性和低计算成本。

详情

AI中文摘要

非线性动力学以及多个执行器之间产生的强耦合削弱了传统线性控制分配技术背后的假设。当飞行进入非线性效应主导的模态时，线性分配器因模型失配增加而精度下降，进而降低飞行控制系统的性能和鲁棒性。高保真机载模型和黑箱数据驱动方法可以在整个飞行包线内恢复精度，但分别带来实时分配难以承受的计算负担，并牺牲了验证和故障诊断所需的可解释性。本文通过使用稀疏非线性动力学辨识从代表性飞行数据中学习显式的、受物理约束的控制效能映射解析模型，解决了这些限制。所得映射紧凑、可解释，并允许解析导数，从而能够在非线性求解器中高效计算，同时额外包含执行器动力学，无需机载模型。在线自适应机制监控预测残差，并在检测到显著对象变化时刷新模型，从而在执行器故障和变化工况下提供平滑重构。该方法在一款高保真非线性基准飞行器上经过一系列激进机动评估，达到了与完整非线性机载模型相当的精度，同时相对于现有基线显著降低了计算成本。

英文摘要

Nonlinear dynamics and the strong couplings that arise between multiple effectors undermine the assumptions behind conventional, linear control allocation techniques. When flight enters regimes where nonlinear effects dominate, linear allocators exhibit reduced accuracy due to increased model mismatch, which subsequently degrades performance and robustness of the flight control system. High fidelity onboard models and black box data driven approaches can recover accuracy across the flight envelope, but respectively impose computational burdens prohibitive for real time allocation and sacrifice the interpretability required for verification and fault diagnosis. This paper addresses these limitations by learning an explicit, physics constrained analytical model of the control effectiveness mapping from representative flight data using Sparse Identification of Nonlinear Dynamics. The resulting mapping is compact, interpretable, and admits analytical derivatives, enabling efficient computation within nonlinear solvers that additionally incorporate actuator dynamics, without requiring an onboard model. An online adaptation mechanism monitors prediction residuals and refreshes the model when significant plant changes are detected, providing graceful reconfiguration under actuator failures and varying operating conditions. The methodology is evaluated on a high fidelity nonlinear benchmark aircraft across a range of aggressive maneuvers, achieving accuracy comparable to a full nonlinear onboard model while substantially reducing computational cost relative to established baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.13854 2026-06-15 cs.HC cs.AI 交叉投稿

SpheriCity: Designing Trustworthy Conversational AI for Sustainability Decision Support

SpheriCity：为可持续发展决策支持设计可信赖的对话式AI

Ahmed Qayyum, Madison Werner, Kathryn Youngblood, Jenna R. Jambeck, Tahiya Chowdhury

发表机构 * Department of Computer Science, Colby College（科里尔学院计算机科学系）； Circularity Informatics Lab, University of Georgia（佐治亚大学循环信息实验室）

AI总结提出SpheriCity，一种基于来源的对话式AI原型，通过结构化合成和交互支架，支持从城市循环性评估报告中可信地获取知识，解决大语言模型在可持续性高风险领域中的透明度与信任问题。

Comments Accepted to ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies (COMPASS '26)

详情

DOI: 10.1145/3811242.3819112

AI中文摘要

我们提出了SpheriCity，一种基于专家知识的对话式原型，旨在支持从可持续性报告中可信地获取知识。城市级循环性评估报告包含关于材料、基础设施和政策干预的丰富信息，但其长度和异构结构使得从事循环经济倡议的从业者和研究人员难以进行跨文档综合和比较。虽然大型语言模型（LLM）有望实现更快速的知识获取和综合，但其不透明的推理、幻觉和缺乏来源透明度给信任和可解释性带来了风险，并且在高风险的可持续性背景下需要验证。SpheriCity通过一种以来源为先的对话式代理来应对这些挑战，该代理强调证据可追溯性、结构化合成和交互支架，以支持跨可持续性报告的探索性查询和跨文档综合。我们与六位可持续性专家进行了形成性专家评审，使用了涵盖跨城市比较、政策总结和推荐导向任务的代表性查询。专家们从多个维度评估了回答，并提供了关于系统对可持续性知识工作有用性的定性反思。我们的结果表明，透明的来源、上下文解释、可解释性以及与专家工作流程的一致性强烈影响专家对系统有用性的信任和判断。这项工作贡献了（1）一个用于可持续性知识理解的对话式原型，（2）一个用于评估高风险知识领域中AI回答的基于专家的评估框架，以及（3）关于来源、不确定性沟通和工作流程整合如何影响专家用户对AI辅助可持续性决策支持信任的设计见解。

英文摘要

We present SpheriCity, an expert-grounded conversational prototype designed to support trustworthy knowledge sensemaking from sustainability reports. City-level circularity assessment reports contain rich information about materials, infrastructure, and policy interventions, yet their length and heterogeneous structure make cross-document synthesis and comparison difficult for practitioners and researchers working on circular economy initiatives. While large language models (LLM) promise faster knowledge access and synthesis, their opaque reasoning, hallucinations, and lack of source transparency introduce risks for trust and interpretability, and require verification in high-stakes sustainability contexts. SpheriCity addresses these challenges through a provenance-first conversational agent that foregrounds evidence traceability, structured synthesis, and interaction scaffolds to support exploratory querying and cross-document synthesis across sustainability reports. We conducted a formative expert review with six sustainability experts using representative queries spanning cross-city comparison, policy summarization, and recommendation-oriented tasks. Experts evaluated responses across dimensions and provided qualitative reflections on the system's usefulness for sustainability knowledge work. Our results reveal that transparent sourcing, contextual explanation, interpretability, and alignment with expert workflow strongly shape expert trust and judgments of system usefulness. This work contributes (1) a conversational prototype for sustainability knowledge sensemaking, (2) an expert-grounded evaluation framework for assessing AI responses in high-stakes knowledge domains, and (3) design insights into how provenance, uncertainty communication, and integration in workflow influence expert users' trust in AI assistance for sustainability decision support.

URL PDF HTML ☆

赞 0 踩 0

2606.13858 2026-06-15 cs.IR cs.AI 交叉投稿

Mood-Aware Music Recommendation: Integrating User Affective Signals into Ranking Systems

情绪感知音乐推荐：将用户情感信号融入排序系统

Terence Zeng, Abhishek K. Umrawal

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出一种情绪条件排序框架，通过能量-效价空间的softmax采样将用户情感信号融入推荐过程，单盲实验表明能提升推荐质量。

Comments 13 pages, 4 figures, and 1 table

详情

AI中文摘要

推荐系统在现代音乐流媒体平台中至关重要，因为可用内容数量巨大。虽然协同过滤被广泛用于根据具有相似模式的其他用户的偏好来推荐项目，但在用户-项目交互稀疏的领域（如音乐）中表现不佳。基于内容的过滤是一种替代方法，它检查项目本身的属性。已有研究探索了流派、乐器和歌词；然而，对情感识别的关注相对较少。由于用户的情绪状态强烈影响其音乐选择，融入情绪信号为个性化提供了有前景的方向。在这项工作中，我们提出了一种情绪条件排序框架，通过能量-效价空间中的softmax采样将用户情感信号融入推荐过程。我们通过单盲实验评估该方法，参与者将所提系统的推荐与基线进行比较。结果表明感知推荐质量有所提升，为将基于情绪的输入融入音乐推荐的有效性提供了初步证据。

英文摘要

Recommendation systems are essential in modern music streaming platforms due to the vast amount of available content. While collaborative filtering is widely used to suggest items based on the preferences of others with similar patterns, it performs poorly in domains where user-item interactions are sparse, such as music. Content-based filtering is an alternative approach that examines the qualities of the items themselves. Genre, instrumentation, and lyrics have been explored; however, relatively little attention has been given to emotion recognition. Since a user's emotional state strongly influences their music choice, incorporating mood signals offers a promising direction for personalization. In this work, we propose a mood-conditioned ranking framework that integrates user affective signals into the recommendation process via softmax-based sampling in the energy-valence space. We evaluate the approach via single-blind experiments in which participants compare recommendations from the proposed system against a baseline. The results indicate improved perceived recommendation quality, providing preliminary evidence for the effectiveness of incorporating mood-based inputs into music recommendations.

URL PDF HTML ☆

赞 0 踩 0

2606.13968 2026-06-15 cs.DC cs.AI 交叉投稿

STREAM: Multi-Tier LLM Inference Middleware with Dual-Channel HPC Token Streaming

STREAM：具有双通道 HPC 令牌流的多层 LLM 推理中间件

Anas Nassar, Steve Mohr, Leonard Apanasevich, Himanshu Sharma

发表机构 * Advanced Cyberinfrastructure for Education and Research (ACER) University of Illinois Chicago（高级教育与研究计算基础设施（ACER）伊利诺伊大学芝加哥分校）

AI总结提出 STREAM 系统，通过三层路由架构（本地、HPC、云）和双通道 HPC 流（控制平面与数据平面分离）实现亚秒级 TTFT，解决现有系统无法统一三种推理场景的问题。

Comments 6 pages, 1 figure, PEARC '26

详情

AI中文摘要

研究人员和从业者在使用大型语言模型时面临碎片化局面：本地模型免费且私密，但硬件限制了可用的模型大小和上下文窗口；机构 HPC 中心提供强大的 GPU 资源且无边际成本，并将数据保留在机构边界内，但运行在防火墙后且专为批处理作业而非交互使用设计；商业云 API 按需提供前沿模型质量，但带来显著成本和不适合敏感研究数据的数据保留策略。现有系统无法统一这三者。STREAM（智能分层路由引擎）通过四项贡献解决了这一差距：（1）三层路由架构，结合本地、HPC 和云推理，并配备基于本地 LLM 的复杂度判断器；（2）双通道 HPC 流架构，将 Globus Compute 控制平面（认证和作业调度）与 WebSocket 中继数据平面（令牌传递）分离，实现亚秒级 TTFT（中位数 0.54 秒，比批处理模式的 11.40 秒快 21.1 倍），通过机构防火墙无需 VPN 或防火墙规则更改，端到端 AES-256-GCM 加密确保中继操作员无法读取令牌负载；（3）层级感知的上下文摘要，防止长对话将简单查询强制推送到昂贵层级；（4）HPC 即 API 代理模式，将 HPC 推理暴露为与 OpenAI 兼容的端点，可从任何标准客户端调用，无需 HPC 专业知识，这种部署模式仅因贡献（2）的亚秒级 TTFT 而变得实用。Llama 3.2 3B 在跨越十个领域的 1,200 个查询基准测试中实现了 85.1% 的免费层级保留率。测量的 TTFT：本地 0.26 秒，HPC（中继）0.54 秒，云 1.68 秒。

英文摘要

Researchers and practitioners working with large language models face a fragmented landscape: local models are free and private but hardware limits the model size and context windows a researcher can use; institutional HPC centers offer powerful GPU resources at no marginal cost and keep data within institutional boundaries, but operate behind firewalls and are designed for batch jobs rather than interactive use; commercial cloud APIs provide frontier-model quality on demand but impose significant cost and data retention policies unsuitable for sensitive research data. No existing system unifies all three. STREAM (Smart Tiered Routing Engine for AI Models) addresses this gap with four contributions: (1) a three-tier routing architecture combining local, HPC, and cloud inference with a local LLM-based complexity judge; (2) a dual-channel HPC streaming architecture that separates the Globus Compute control plane (authentication and job dispatch) from a WebSocket relay data plane (token delivery), enabling sub-second TTFT (0.54 s median, 21.1x over batch mode's 11.40 s) through institutional firewalls without VPN or firewall rule changes, with end-to-end AES-256-GCM encryption ensuring the relay operator cannot read token payloads; (3) tier-aware context summarization that prevents long conversations from forcing simple queries onto expensive tiers; and (4) an HPC-as-API proxy mode that exposes HPC inference as an OpenAI-compatible endpoint callable from any standard client with no HPC expertise, a deployment pattern made practical only by the sub-second TTFT of contribution (2). Llama 3.2 3B achieves 85.1% free-tier retention on a 1,200-query benchmark spanning ten domains. Measured TTFT: 0.26 s local, 0.54 s HPC (relay), 1.68 s cloud.

URL PDF HTML ☆

赞 0 踩 0

2606.14157 2026-06-15 cs.LG cs.AI 交叉投稿

Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport

通过逆最优传输从起点-终点流中学习城市访问成本

Paula Joy B. Martinez

发表机构 * GitHub

AI总结提出逆最优传输模型从学校间入学流中恢复潜在选择成本，应用于菲律宾283,016条学生流动数据，估计补贴等效距离以优化城市服务分配。

Comments Oral Presentation. 2026 International Conference on Urban AI

详情

AI中文摘要

城市通过混合公私设施网络提供基本服务，包括学校、诊所、交通提供者和补贴服务点。在这些系统中，规划者通常观察到家庭去哪里，但看不到他们权衡距离、价格和机构访问等因素的潜在成本函数。我们通过菲律宾的学校选择来研究这个城市问题，该国最大的国家教育补贴旨在将学习者从拥挤的公立学校转移到参与计划的私立学校。将学校到学校的入学流视为熵最优传输计划，我们使用两种互补的逆最优传输模型恢复潜在选择成本：一个带有补贴项的可解释距离带模型，以及一个通过可微分Sinkhorn前向传递训练的神经成本模型。应用于人口最多地区23,820条观测流中的283,016次学习者出行，该框架估计了一个补贴等效距离$\lambda^{(k)}$，解释为补贴抵消的感知旅行成本公里数。该案例展示了如何将行政起点-终点数据转化为可解释的规划指标，用于可访问性感知的补贴设计、设施选址和城市服务分配。

英文摘要

Cities deliver basic services through mixed public-private facility networks, including schools, clinics, transit providers, and subsidized service points. In these systems, planners often observe where households go, but not the latent cost function through which they trade off factors such as distance, price, and institutional access. We study this urban problem through school choice in the Philippines, where the country's largest national education subsidy is intended to redirect learners from congested public schools to participating private schools. Treating school-to-school enrollment flows as an entropic optimal transport plan, we recover latent choice costs using two complementary inverse optimal transport models: an interpretable distance-banded model with a subsidy term, and a neural cost model trained through a differentiable Sinkhorn forward pass. Applied to 283{,}016 learner trips across 23{,}820 observed flows in the most populated region, the framework estimates a subsidy-equivalent distance, $λ^{(k)}$, interpreted as the kilometers of perceived travel cost offset by the subsidy. The case demonstrates how administrative origin-destination data can be transformed into interpretable planning metrics for accessibility-aware subsidy design, facility siting, and urban service allocation.

URL PDF HTML ☆

赞 0 踩 0

2606.14297 2026-06-15 cs.CV cs.AI 交叉投稿

Pix2Pix-Hybrid: Structure-Guided Conditional Synthesis of Hajj Crowd Images with Multi-Channel Conditioning and Weak Attribute Supervision

Pix2Pix-Hybrid: 结构引导的多通道条件与弱属性监督的朝觐人群图像条件合成

Amirah F. Alshammari, Bander A. Alzahrani, Nahed A. Alowidi

发表机构 * King Abdulaziz University（阿卜杜勒阿齐兹国王大学）； Jouf University（焦夫大学）

AI总结提出Pix2Pix-Hybrid条件GAN，通过多通道结构线索和上下文属性条件合成朝觐人群图像，用于数据增强，在减少人工标注的同时提升合成质量，并验证了合成数据对人群计数模型的改进效果。

详情

AI中文摘要

开发准确的朝觐场景人群计数模型仍然具有挑战性，因为领域特定的标注图像稀缺，且大型集会期间的数据收集引发隐私问题。为解决这些限制，本文提出Pix2Pix-Hybrid (P2P-H)，一种用于结构引导的朝觐人群图像合成和数据增强的混合条件GAN。P2P-H基于Pix2Pix，采用U-Net生成器，以八个输入通道为条件，这些通道联合编码结构线索（边缘和灰度）和上下文属性（人群密度和一天中的时间）。为了捕捉密集场景中的详细纹理，该框架集成了两个在不同分辨率下运行的多尺度PatchGAN判别器。训练过程结合了对抗、感知和特征匹配目标，并采用自适应数据增强和稳定化策略。该模型在从60个公开视频源收集的993个真实朝觐帧上训练，条件属性自动推导以减少人工标注工作量。利用该框架，我们构建了CrowdH，一个包含10,000张高分辨率朝觐人群图像的合成数据集。实验结果表明，与Pix2Pix和StyleGAN2-ADA基线相比，P2P-H提高了结构保持的条件合成质量，并显示出对其他人群数据集的良好迁移性。为了评估下游实用性，我们进一步构建了CrowdH-Mix-469，一个包含384张真实朝觐图像和85张精选合成图像的标注混合真实-合成数据集，并在仅真实和真实加合成训练下评估了五个计数模型。精选的合成数据在所有五个模型上均降低了MAE，其中CSRNet的提升最为显著。

英文摘要

Developing accurate crowd-counting models for Hajj pilgrimage scenes remains challenging because domain-specific annotated images are scarce and data collection during large gatherings raises privacy concerns. To address these limitations, this paper proposes Pix2Pix-Hybrid (P2P-H), a hybrid conditional GAN for structure-guided Hajj crowd-image synthesis and data augmentation. P2P-H builds on Pix2Pix and employs a U-Net generator conditioned on eight input channels that jointly encode structural cues (edges and grayscale) and contextual attributes (crowd density and time of day). To capture detailed textures in dense scenes, the framework integrates two multi-scale PatchGAN discriminators operating at different resolutions. The training procedure combines adversarial, perceptual, and feature-matching objectives with adaptive data augmentation and stabilization strategies. The model was trained on 993 real Hajj frames collected from 60 publicly available video sources, with conditioning attributes derived automatically to reduce manual labeling effort. Using this framework, we constructed CrowdH, a synthetic dataset of 10,000 high-resolution Hajj crowd images. Experimental results show that P2P-H improves structure-preserving conditional synthesis quality compared with Pix2Pix and StyleGAN2-ADA baselines and shows favorable transfer to other crowd datasets. To assess downstream utility, we further constructed CrowdH-Mix-469, an annotated mixed real-synthetic dataset comprising 384 real Hajj images and 85 selected synthetic images,and evaluated five crowd-counting models under real-only and real-plus-synthetic training. The selected synthetic data reduced MAE across all five models, with the strongest gain observed for CSRNet.

URL PDF HTML ☆

赞 0 踩 0

2606.14306 2026-06-15 cs.HC cs.AI 交叉投稿

Thinking Outside the [Chat]Box: Bridging Computer Science and Industrial Design for Cognitive-Inclusive Generative AI

跳出聊天框：融合计算机科学与工业设计的认知包容性生成式人工智能

Virginia Francisco, Daniel Guasch, Raquel Hervás

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结针对当前GenAI界面认知门槛高、对智力障碍者不友好等问题，通过跨学科设计挑战，提出融合计算机科学的结构化支架与工业设计的体验式支架的双层框架，以扩展认知包容性交互设计空间。

详情

AI中文摘要

当前的生成式人工智能（GenAI）界面仍然主要局限于聊天框交互，这给用户带来了高认知负担，并对智力障碍（ID）人群造成了重大障碍，包括提示词表述困难、响应过载以及评估信息可靠性的机制有限。为了探索认知无障碍的替代交互模型，我们进行了一项跨学科协同设计挑战，其中两个学生群体（计算机科学和工业设计）从相同的功能需求集（例如，提示词支架、结构化输出、基于GUI的细化、透明度和个性化）出发，开发界面概念。比较最终提案揭示了在基础需求上的趋同（特别是初始校准、主动提示和响应片段的直接操作）以及互补性贡献，勾勒出一个多层次支持系统。计算机科学团队主要产生结构支架，强调通过可靠性指标、明确来源和长对话上下文管理等机制实现可预测性、可导航性和信任。工业设计团队强调体验支架，侧重于节奏、注意力引导、多模态和主动代理，包括逐步响应流程、专注模式和类似助手的集成。我们将这些发现综合为一个双层支架框架，该框架将认知无障碍GenAI交互的设计空间扩展到以聊天为中心的模式之外，并激励未来在专家细化、技术可行性和与ID用户进行实证验证方面的工作。

英文摘要

Current Generative AI (GenAI) interfaces remain largely constrained to chatbox interaction, which can impose high cognitive demands on users and create substantial barriers for people with intellectual disabilities (ID), including prompt formulation difficulties, response overload, and limited mechanisms to assess information reliability. To explore alternative interaction models for cognitive accessibility, we conducted a cross-disciplinary co-design challenge in which two student cohorts (Computer Science and Industrial Design) developed interface concepts from the same set of functional requirements (e.g., prompt scaffolding, structured output, GUI-based refinement, transparency, and personalization). Comparing the resulting proposals reveals both convergence on foundational requirements (notably initial calibration, proactive prompting, and direct manipulation of response fragments) and complementary contributions that outline a multi-layered support system. Computer Science teams primarily produced structural scaffolding, emphasizing predictability, navigability, and trust through mechanisms such as reliability indicators, explicit sources, and context management for long conversations. Industrial Design teams emphasized experiential scaffolding, focusing on pacing, attention guidance, multimodality, and proactive agency, including step-by-step response flows, focus modes, and assistant-like integrations. We synthesize these findings into a dual-layer scaffolding framework that expands the design space for cognitively accessible GenAI interaction beyond chat-centric models and motivates future work on expert refinement, technical feasibility, and empirical validation with users with ID.

URL PDF HTML ☆

赞 0 踩 0

2606.14350 2026-06-15 cs.DC cs.AI 交叉投稿

Design Methodology and Performance Trade-offs Management for Distributed and Compound AI Systems

分布式与复合AI系统的设计方法论及性能权衡管理

Milos Gravara, Andrija Stanisic, Stefan Nastic

AI总结提出从模型中心转向系统中心的设计方法论，通过工作流拓扑和配置选择两个维度组织设计空间，识别八种设计模式以克服单体部署局限，实验表明复合AI配置在接近精度同时显著降低延迟和成本。

详情

AI中文摘要

人工智能系统通常必须满足包括准确性、延迟和成本在内的服务级别目标。当前以模型为中心的方法在设计时选择单一模型，并对所有输入应用相同的计算，无法将任务分解到专门组件，且知识在训练时固定。在运行时，这可能导致性能下降和成本增加。由于模型是主要设计变量，它决定了系统的大部分行为，将操作目标耦合到单一设计时选择。解决这些限制需要从以模型为中心转向以系统为中心的设计。复合AI系统通过显式控制逻辑将多个模型、算法和工具编排为分布式AI系统，实现了这一转变。此类系统的性能取决于其工作流拓扑、分配给每个任务的模型以及控制运行时行为的参数。我们提出了一种设计方法论，沿工作流拓扑和配置选择两个维度组织这一空间，并识别出八种设计模式，每种模式整合了解决单体部署特定限制的技术。我们通过三个案例研究验证了该方法论。在我们的案例研究中，复合AI配置的准确性接近单体模型（相差2.5至4个百分点），同时延迟降低高达60%，成本降低高达71%。我们表明模型选择和参数配置共同决定系统性能，但随着工作流组合更多模式和组件，产生的设计空间呈组合增长。因此，我们识别出五个开放挑战，这些挑战定义了从手动配置原型到自动发现并维护复合与分布式AI系统中SLO合规性的系统的路线图。

英文摘要

Artificial Intelligence (AI) systems must typically satisfy service-level objectives including accuracy, latency, and cost. The prevailing model-centric approaches select a monolithic model at design time and apply identical computation regardless of input difficulty, cannot decompose tasks across specialized components, and have knowledge that is fixed at training time. During runtime, this can lead to performance degradation and increasing costs. Because the model is the main design variable, it determines the majority of system behavior, coupling operational objectives to a single design-time choice. Addressing these limitations requires shifting from model-centric to system-centric design. Compound AI systems realize this shift by orchestrating multiple models, algorithms, and tools as distributed AI systems through explicit control logic. The performance of such systems depends on their workflow topology, the models assigned to each task, and the parameters governing runtime behavior. We present a design methodology that organizes this space along two dimensions, workflow topology and configuration selection, and identifies eight design patterns, each consolidating techniques to address a specific limitation of monolithic deployment. We validate our methodology through three case studies. Across our case studies, Compound AI configurations approach accuracy of monolithic models within 2.5 to 4 percentage points while reducing latency by up to 60% and cost by up to 71%. We show that model selection and parameter configuration jointly determine system performance, but the resulting design space grows combinatorially, as workflows compose more patterns and components. Thus, we identify five open challenges that define a roadmap from manually configured prototypes towards systems that automatically discover and maintain SLO-compliance in Compound and Distributed AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.14356 2026-06-15 cs.DC cs.AI 交叉投稿

生成式人工智能在模糊性与谄媚行为下的管理决策

Sule Ozturk Birim, Fabrizio Marozzo, Yigit Kazancoglu

发表机构 * Manisa Celal Bayar University（曼萨塞尔朱巴大学）； University of Calabria（卡拉布里亚大学）； Yasar University（亚沙大学）

AI总结本研究通过人机协作实验，利用四维商业模糊性分类法评估GenAI模型在模糊检测、解析和谄媚行为方面的表现，发现模糊解析能提升决策质量，且不同模型对错误指令的谄媚程度不一。

详情

AI中文摘要

生成式人工智能（GenAI）正日益融入复杂的业务流程，从根本上改变了管理决策的边界。然而，在模糊的商业环境中，其战略建议的可靠性仍是一个关键的知识空白。为填补这一空白，本研究比较了多个GenAI模型在检测模糊性方面的能力，检验了系统性模糊解析过程是否能改善响应质量，并调查了它们在面对有缺陷的管理指令时对谄媚行为的易感性。利用一种新颖的四维商业模糊性分类法，我们在战略、战术和操作场景中进行了人机协作实验。通过一个基于一致性、可操作性、理由质量和约束遵守的人工验证自动评估框架对生成的决策进行评估。结果表明，我们的方法不仅能区分不同类型的模糊性，还揭示了模糊解析如何系统地改变模型行为。特别是，解析模糊性提高了所有管理层级的决策质量，其中在约束遵守方面提升最为显著。进一步分析显示，谄媚行为在不同模型中并不一致：一些模型质疑有缺陷的假设，而另一些则倾向于遵从。本研究通过将GenAI定位为一种能够检测和解析管理者可能忽略的模糊性的认知支架，同时证明其人工局限性需要人类监督以确保其作为战略伙伴的可靠性，从而为有限理性文献做出了贡献。

英文摘要

Generative artificial intelligence (GenAI) is increasingly being integrated into complex business workflows, fundamentally shifting the boundaries of managerial decision-making. However, the reliability of its strategic advice in ambiguous business contexts remains a critical knowledge gap. To address this gap, this study compares multiple GenAI models in their ability to detect ambiguity, examines whether a systematic ambiguity-resolution process improves response quality, and investigates their susceptibility to sycophantic behavior when confronted with flawed managerial directives. Using a novel four-dimensional business ambiguity taxonomy, we conducted a human-in-the-loop experiment across strategic, tactical, and operational scenarios. The resulting decisions were assessed through a human-validated automated evaluation framework based on agreement, actionability, justification quality, and constraint adherence. The results show that our approach not only distinguishes different types of ambiguity, but also reveals how ambiguity resolution systematically changes model behavior. In particular, resolving ambiguities improved decision quality across all managerial levels, with the strongest gains observed in constraint adherence. The analysis further showed that sycophantic behavior is not uniform across models: some models challenged flawed assumptions, whereas others tended to comply with them. This study contributes to the bounded rationality literature by positioning GenAI as a cognitive scaffold that can detect and resolve ambiguities managers might overlook, while demonstrating that its artificial limitations require human oversight to ensure its reliability as a strategic partner.

URL PDF HTML ☆

赞 0 踩 0

2605.29640 2026-06-15 cs.AI 版本更新

VikingMem: A Memory Base Management System for Stateful LLM-based Applications

VikingMem：面向有状态LLM应用的记忆库管理系统

Jiajie Fu, Junwen Chen, Mengzhao Wang, Aoxiang He, Maojia Sheng, Xiangyu Ke, Yifan Zhu, Yunjun Gao

发表机构 * Zhejiang University（浙江大学）

AI总结提出记忆库（Memory Base）数据管理范式，并基于VikingDB向量引擎实现VikingMem系统，通过事件与实体抽象、主题时间线压缩和时间加权召回，在长期记忆基准上提升检索效果达30%。

Comments Accepted by VLDB26

详情

AI中文摘要

大型语言模型彻底改变了交互式应用；然而，其有限的上下文窗口为维护有状态的长期交互带来了关键的数据管理挑战。现有的记忆方法通常依赖于简单的提取方法，导致记忆不完整，或使用针对单一用例（如聊天机器人）的刚性、单用途记忆提取提示。因此，它们缺乏泛化能力，在多样化的下游任务中表现不佳。为弥补这一差距，我们引入了记忆库（Memory Base），一种用于管理长期交互持久状态的新型数据管理范式。其特点包括三个核心原则：从原始信息流中选择性提取高价值记忆；固有的状态性和演化性，其中记忆内容被逐步总结、纠正并按时间加权以优先处理近期交互；以及一种可泛化的抽象范式，旨在跨不同应用（包括教育、推荐和智能体记忆）实现稳健的可迁移性。基于此，我们提出了VikingMem，一个在VikingDB向量引擎上实现的端到端记忆库管理系统。VikingMem通过互连的事件和实体抽象具体化了这一范式。它采用以事件为中心的记忆提取来选择性处理复杂信息流，同时实体由事件动态更新以实现有状态演化。通过基于主题时间线的时间压缩和时间加权召回，系统逐步生成高层级总结记忆，优先处理近期项目，并压缩和淡出较旧项目。在长期记忆基准上的广泛评估表明，VikingMem在记忆检索效果上比基线方法提升高达30%，同时保持了交互应用所需的低延迟。

英文摘要

Large Language Models have revolutionized interactive applications; however, their finite context windows pose a critical data management challenge for maintaining stateful, long-term interactions. Existing memory approaches often rely on simplistic extraction methods that lead to incomplete memories or use rigid, single-purpose memory extraction prompts tailored to a single use case, such as chatbots. Consequently, they lack generalizability and perform poorly across diverse downstream tasks. To bridge this gap, we introduce the Memory Base, a novel data management paradigm for managing the persistent state of long-term interactions. It is characterized by three core principles: selective extraction of high-value memories from raw information streams; inherent statefulness and evolution, where memory content is progressively summarized, corrected, and temporally weighted to prioritize recent interactions; and a generalizable abstraction paradigm designed for robust transferability across diverse applications, including education, recommendation, and agent memory. Building on this foundation, we present VikingMem, an end-to-end Memory Base Management System implemented on the VikingDB vector engine. VikingMem materializes this paradigm through interconnected event and entity abstractions. It features event-centric memory extraction to selectively handle complex information streams, while entities are dynamically updated by events to achieve stateful evolution. Using temporal compression via a topic-wise timeline and time-weighted recall, the system progressively produces high-level summary memories, prioritizes recent items, and compresses and fades older ones. Extensive evaluations on long-term memory benchmarks demonstrate that VikingMem outperformes baselines by up to 30% in memory retrieval effectiveness while maintaining the low latency essential for interactive applications.

URL PDF HTML ☆

赞 0 踩 0

2606.13556 2026-06-15 cs.AI cs.HC q-bio.BM q-bio.GN q-bio.MN 版本更新

Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

是你还是你的环境？一种用于基因组锚定的个性化生理解释的贝叶斯推理框架

Aruna Dey, Suraj Biswas

发表机构 * Dots-In

AI总结提出一种贝叶斯推理框架，利用基因组先验解决个性化健康AI的冷启动问题，通过基因组锚定分离生理信号的体质与环境成分，并随数据积累动态更新。

Comments 24 pages, 8 figures, 3 tables. Conceptual framework paper. Updated version with revised section structure and formatting

详情

AI中文摘要

个性化健康AI系统面临一个根本性的冷启动问题：用于生理解释的机器学习模型需要数周的个人行为数据，才能区分体质变异与环境引起的偏差。我们提出一种基于因果推断和贝叶斯先验设计的解决方案。个体的基因组图谱作为外源性遗传锚点——一个领域信息化的个性化先验，在受孕时固定，不受反向因果影响，且在收集任何行为观测之前即可获得。该锚点初始化个体生理设定点G-hat = mu + sum(beta_i * g_i)上的贝叶斯信念状态，其中beta_i是GWAS衍生的效应大小，g_i是风险等位基因计数。每次传入的生理测量P产生一个非体质偏差delta = P - G-hat，将可归因于环境和状态的部分与体质固定的基线分离。随着行为数据的积累，先验根据G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t衰减，从基因组主导过渡到经验基线主导的推理。同一个观测到的HRV 55 ms，对于先验预测80 ms的人产生抑制假设，而对于先验预测30 ms的人产生增强假设——没有个性化锚点，这种反转是不可能的。我们在六个生理领域开发了这一架构，根据证据强度对基因组先验进行分级，区分稳健复制的锚点（FTO、FADS1/2、FKBP5）和有争议的候选基因（SLC6A4、MAOA、DRD2）。我们讨论了关联、孟德尔随机化和个体因果推断之间的推理边界，并定义了部署的四个约束：证据分级的先验、动态衰减、祖先匹配的效应大小以及归因而非确定性输出。

英文摘要

Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.

URL PDF HTML ☆

赞 0 踩 0

2606.13662 2026-06-15 cs.AI cs.CL 版本更新

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

EurekAgent：自主科学发现中，智能体环境工程即一切

Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao, Fanjin Zhang, Jian Song, Lei Hou, Juanzi Li

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Zhipu AI（智谱AI）

AI总结提出环境工程框架EurekAgent，通过权限、工件、预算和人机交互四维工程设计，在数学、内核工程和机器学习任务上取得新最优结果，总API成本低于11美元。

详情

AI中文摘要

基于LLM的智能体在自动化科学发现方面展现出日益增长的潜力。给定一个可优化的度量和执行环境，它们可以提出、验证和迭代科学解决方案，并已产生超越人类设计方法的结果。随着模型能力的持续提升，我们认为自主科学发现的瓶颈正从规定智能体工作流程转向设计智能体环境：即塑造智能体行为的资源、约束和接口。我们将此框架化为环境工程：构建能够放大生产性行为（如开放式探索、系统化工件管理和智能体间协作）同时抑制有害行为（如奖励黑客和高摩擦人工监督）的环境。我们提出了EurekAgent，一个用于度量驱动自主科学发现的环境工程智能体系统。EurekAgent从四个维度进行环境工程：权限工程用于受限智能体执行和隔离评估；工件工程用于基于文件系统和Git的协作；预算工程用于预算感知探索；人机交互工程用于便捷的人工监督和干预。EurekAgent在多个数学、内核工程和机器学习任务上取得了新的最优结果，包括以不到11美元的总API成本发现新的26圆填充最优结果。我们开源了代码和结果，并呼吁将环境工程作为开发可靠自主研究智能体的核心研究方向。

英文摘要

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

URL PDF HTML ☆

赞 0 踩 0

2112.04573 2026-06-15 cs.DL cs.AI cs.LG 版本更新

Application of Artificial Intelligence and Machine Learning in Libraries: A Systematic Review

人工智能与机器学习在图书馆中的应用：系统综述

Rajesh Kumar Das, Mohammad Sharif Ul Islam

发表机构 * University of Nebraska - Lincoln（内布拉斯加大学林肯分校）； Noakhali Science and Technology University（诺阿克利科学与技术大学）； University of Dhaka（达卡大学）

AI总结通过系统综述32篇文献，总结了人工智能与机器学习在图书馆中的应用领域、技术及现状，发现当前研究以理论为主，部分涉及实践案例。

详情

AI中文摘要

随着人工智能和机器学习等前沿技术的概念和实施变得相关，学者、研究人员和信息专业人员涉足这一领域的研究。本系统文献综述旨在综合探讨人工智能和机器学习在图书馆中应用的实证研究。为实现研究目标，基于Kitchenham等人（2009）提出的原始指南进行了系统文献综述。数据来自Web of Science、Scopus、LISA和LISTA数据库。经过严格/既定的筛选过程，最终选定、审阅并分析了32篇文章，以总结图书馆中最常使用的AI和ML领域及技术。结果表明，当前与LIS领域相关的AI和ML研究主要集中于理论工作。然而，一些研究人员也强调了实施项目或案例研究。本研究将为研究人员、实践者和教育工作者提供图书馆中AI和ML的全景视图，以推动更多技术导向的方法，并预见未来的创新路径。

英文摘要

As the concept and implementation of cutting-edge technologies like artificial intelligence and machine learning has become relevant, academics, researchers and information professionals involve research in this area. The objective of this systematic literature review is to provide a synthesis of empirical studies exploring application of artificial intelligence and machine learning in libraries. To achieve the objectives of the study, a systematic literature review was conducted based on the original guidelines proposed by Kitchenham et al. (2009). Data was collected from Web of Science, Scopus, LISA and LISTA databases. Following the rigorous/ established selection process, a total of thirty-two articles were finally selected, reviewed and analyzed to summarize on the application of AI and ML domain and techniques which are most often used in libraries. Findings show that the current state of the AI and ML research that is relevant with the LIS domain mainly focuses on theoretical works. However, some researchers also emphasized on implementation projects or case studies. This study will provide a panoramic view of AI and ML in libraries for researchers, practitioners and educators for furthering the more technology-oriented approaches, and anticipating future innovation pathways.

URL PDF HTML ☆

赞 0 踩 0

2504.03686 2026-06-15 cs.NI cs.AI cs.LG 版本更新

Revisiting Outage for Edge Inference Systems

重新审视边缘推理系统的中断问题

Zhanwei Wang, Qunsong Zeng, Haotian Zheng, Kaibin Huang

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong（香港大学电子与计算机工程系）

AI总结针对边缘推理系统的端到端可靠性，提出推理中断概率框架，量化推理精度低于阈值的概率，并优化通信开销与推理可靠性的权衡。

详情

AI中文摘要

第六代（6G）移动网络的关键任务之一是在网络边缘部署大规模人工智能（AI）模型，为边缘设备提供远程推理服务。由此产生的平台称为边缘推理，将支持广泛的物联网应用，如自动驾驶、工业自动化和增强现实。鉴于这些任务的关键性和时间敏感性，设计既可靠又能满足严格端到端（E2E）延迟约束的边缘推理系统至关重要。现有研究主要关注以信道中断概率为特征的通信可靠性，可能无法保证E2E性能，特别是在E2E推理精度和延迟方面。为解决这一局限，我们提出一个理论框架，引入并数学刻画了推理中断（InfOut）概率，该概率量化了E2E推理精度低于目标阈值的可能性。在E2E延迟约束下，该框架建立了通信开销（即上传更多传感器观测）与以InfOut概率量化的推理可靠性之间的基本权衡。为了找到优化这种权衡的可行方法，我们通过对接收判别增益的分布应用高斯近似，推导出InfOut概率的精确替代函数。实验结果表明，所提出的设计在E2E推理可靠性方面优于传统的以通信为中心的方法。

英文摘要

One of the key missions of sixth-generation (6G) mobile networks is to deploy large-scale artificial intelligence (AI) models at the network edge to provide remote-inference services for edge devices. The resultant platform, known as edge inference, will support a wide range of Internet-of-Things applications, such as autonomous driving, industrial automation, and augmented reality. Given the mission-critical and time-sensitive nature of these tasks, it is essential to design edge inference systems that are both reliable and capable of meeting stringent end-to-end (E2E) latency constraints. Existing studies, which primarily focus on communication reliability as characterized by channel outage probability, may fail to guarantee E2E performance, specifically in terms of E2E inference accuracy and latency. To address this limitation, we propose a theoretical framework that introduces and mathematically characterizes the inference outage (InfOut) probability, which quantifies the likelihood that the E2E inference accuracy falls below a target threshold. Under an E2E latency constraint, this framework establishes a fundamental tradeoff between communication overhead (i.e., uploading more sensor observations) and inference reliability as quantified by the InfOut probability. To find a tractable way to optimize this tradeoff, we derive accurate surrogate functions for InfOut probability by applying a Gaussian approximation to the distribution of the received discriminant gain. Experimental results demonstrate the superiority of the proposed design over conventional communication-centric approaches in terms of E2E inference reliability.

URL PDF HTML ☆

赞 0 踩 0

2504.16173 2026-06-15 cs.AR cs.AI 版本更新

FPGA-Based Neural Network Accelerators for Space Applications: A Survey

基于FPGA的神经网络加速器在空间应用中的综述

Pedro Antunes, Artur Podobas

发表机构 * KTH Royal Institute of Technology（皇家理工学院）

AI总结本文综述了基于FPGA的神经网络加速器在空间任务中的应用，分析了现有文献、趋势和空白，并提出了未来研究方向，以提升星载计算系统性能。

Comments Manuscript under review at ACM CSUR. Pre-print updated after 1st Major Revision

2508.03736 2026-06-15 cs.CV cs.AI 版本更新

Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities

通过视觉Transformer融合泛在射频数据与空间图像以增强智慧城市地图构建

Rafayel Mkrtchyan, Armen Manukyan, Hrant Khachatrian, Theofanis P. Raptis

发表机构 * Yerevan State University（亚美尼亚国立大学）； Consiglio Nazionale delle Ricerche（意大利国家研究委员会）

AI总结提出基于DINOv2的深度学习框架，融合开源地图与射频数据，利用视觉Transformer联合处理多模态信息，在合成与真实数据集上实现65.3%和64.9%的宏观IoU，显著优于单一数据源方法。

Comments Work supported by funding under the bilateral agreement between CNR (Italy) and HESC MESCS RA (Armenia) as part of the DeepRF project for the 2025-2026 biennium, and by the HESC MESCS RA grant No. 22rl-052 (DISTAL)

详情

DOI: 10.1016/j.pmcj.2026.102261
Journal ref: Pervasive and Mobile Computing, Article 102261, 2026

AI中文摘要

本文提出一种基于深度学习的方法，集成DINOv2架构，通过结合来自开源平台的（可能错误的）地图与从多个无线用户设备和基站收集的泛在射频（RF）数据，改进建筑地图构建。与先前方法不同，我们的方法利用基于视觉Transformer的架构，在统一框架内联合处理RF和地图模态，有效捕捉空间依赖性和结构先验，以提高地图构建精度。为评估目的，我们使用华为联合制作的合成数据集。为应对真实世界数据不完善的挑战，我们向其RF数据引入受控噪声以模拟真实条件。此外，我们开发并训练了一个仅利用聚合路径损耗信息来解决地图构建问题的模型。我们根据三个性能指标衡量结果：Jaccard指数（交并比，IoU）、Hausdorff距离和Chamfer距离。我们的设计实现了65.3%的宏观IoU，显著超过（i）错误地图基线（40.1%）、（ii）文献中仅使用RF的方法（37.3%）以及（iii）我们设计的非AI融合基线（42.2%）。对比评估突显了仅依赖RF数据或空间数据的局限性，以及AI在融合数据以提升智慧城市地图构建精度方面的有效性。我们还在奥斯陆地区的真实世界数据上进一步验证了我们的方法，通过真实部署环境补充了合成评估，其中我们的最佳融合模型达到了64.9%的宏观IoU。我们还概述了一种通过使用重叠窗口对区域进行分块来在更大区域上部署模型的策略。

英文摘要

In this paper, we present a deep learning-based approach that integrates the DINOv2 architecture to improve building mapping by combining (possibly erroneous) maps from open-source platforms with pervasive radio frequency (RF) data collected from multiple wireless user equipments and base stations. Unlike prior methods, our approach leverages a vision transformer-based architecture to jointly process both RF and map modalities within a unified framework, effectively capturing spatial dependencies and structural priors for enhanced mapping accuracy. For the evaluation purposes, we employ a synthetic dataset co-produced by Huawei. To address the challenges associated with real-world data imperfections, we introduce controlled noise to its RF data so as to simulate real-world conditions. Additionally, we develop and train a model that leverages only aggregated path loss information to tackle the mapping problem. We measure the results according to three performance metrics: the Jaccard index (intersection over union, IoU), the Hausdorff distance, and the Chamfer distance. Our design achieves a macro IoU of 65.3%, significantly surpassing (i) the erroneous maps baseline, which yields 40.1%, (ii) an RF-only method from the literature, which yields 37.3%, and (iii) a non-AI fusion baseline that we designed which yields 42.2%. The comparative evaluation highlights the limitations of relying solely on RF data or on spatial data, as well as the effectiveness that AI can have on fusing data towards enhancing smart city mapping accuracy. We further validate our method on real-world data from the Oslo region, complementing the synthetic evaluation with a real deployment setting, where our best fusion model reaches 64.9% macro IoU. We additionally outline a strategy for deploying the model over larger areas by tiling the region with overlapping windows.

URL PDF HTML ☆

赞 0 踩 0

2511.22246 2026-06-15 hep-ex cs.AI physics.ins-det 版本更新

An interpretable unsupervised representation learning for high precision measurement in particle physics

一种可解释的无监督表示学习用于粒子物理中的高精度测量

Xing-Jian Lv, De-Xing Miao, Zi-Jun Xu, Jian-Chun Wang

发表机构 * Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China（中国科学院高能物理研究所）； University of Chinese Academy of Sciences, Beijing 100049, China（中国科学院大学）

AI总结提出Histogram AutoEncoder（HistoAE），通过自定义直方图损失强制物理结构化的潜在空间，实现可解释的无监督学习，在硅微条探测器数据上达到电荷分辨率0.25e和位置分辨率3μm，媲美传统方法。

Comments 8 pages, 7 figures

详情

AI中文摘要

无监督学习已广泛应用于粒子物理的各种任务。然而，现有模型缺乏对其学习表示的精确控制，限制了物理可解释性，并阻碍了其用于精确测量。我们提出了直方图自编码器（HistoAE），一种无监督表示学习网络，具有自定义的基于直方图的损失函数，强制实现物理结构化的潜在空间。应用于硅微条探测器，HistoAE学习了一个可解释的二维潜在空间，对应于粒子的电荷和撞击位置。经过简单的后处理，它在束流测试数据上实现了$0.25\,e$的电荷分辨率和$3\,\mu\mathrm{m}$的位置分辨率，与传统方法相当。这些结果表明，无监督深度学习模型能够实现物理上有意义且定量精确的测量。此外，HistoAE的生成能力使其能够直接扩展到快速探测器模拟。

英文摘要

Unsupervised learning has been widely applied to various tasks in particle physics. However, existing models lack precise control over their learned representations, limiting physical interpretability and hindering their use for accurate measurements. We propose the Histogram AutoEncoder (HistoAE), an unsupervised representation learning network featuring a custom histogram-based loss that enforces a physically structured latent space. Applied to silicon microstrip detectors, HistoAE learns an interpretable two-dimensional latent space corresponding to the particle's charge and impact position. After simple post-processing, it achieves a charge resolution of $0.25\,e$ and a position resolution of $3\,μ\mathrm{m}$ on beam-test data, comparable to the conventional approach. These results demonstrate that unsupervised deep learning models can enable physically meaningful and quantitatively precise measurements. Moreover, the generative capacity of HistoAE enables straightforward extensions to fast detector simulations.

URL PDF HTML ☆

赞 0 踩 0

2512.10966 2026-06-15 cs.LG cs.AI cs.CV eess.IV 版本更新

Interpretable Alzheimer's Diagnosis via Multimodal Fusion of Regional Brain Experts

可解释的阿尔茨海默病诊断：基于区域脑专家的多模态融合

Farica Zhuang, Shu Yang, Dinara Aliyeva, Zixuan Wen, Duy Duong-Tran, Christos Davatzikos, Tianlong Chen, Song Wang, Li Shen

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结提出MREF-AD多模态区域专家融合模型，采用混合专家框架将各模态脑区域视为独立专家，通过门控网络学习个性化融合权重，实现可解释的AD诊断。

Comments Published at IEEE ICHI 2026

详情

AI中文摘要

准确早期诊断阿尔茨海默病（AD）对有效干预至关重要，需要整合多模态神经影像数据的互补信息。然而，传统融合方法通常依赖特征的简单拼接，无法自适应平衡淀粉样蛋白PET和MRI等生物标志物在不同脑区的贡献。本文提出MREF-AD，一种用于AD诊断的多模态区域专家融合模型。它是一个混合专家（MoE）框架，将每个模态内的介观脑区域建模为独立专家，并采用门控网络学习个体特定的融合权重。利用阿尔茨海默病神经影像学倡议（ADNI）的表格神经影像和人口统计学信息，MREF-AD在强经典和深度学习基线上取得了有竞争力的性能，同时提供了可解释的、模态和区域层面的洞察，揭示了结构和分子影像如何共同促进AD诊断。源代码见：此 https URL。

英文摘要

Accurate and early diagnosis of Alzheimer's disease (AD) is critical for effective intervention and requires integrating complementary information from multimodal neuroimaging data. However, conventional fusion approaches often rely on simple concatenation of features, which cannot adaptively balance the contributions of biomarkers such as amyloid PET and MRI across brain regions. In this work, we propose MREF-AD, a Multimodal Regional Expert Fusion model for AD diagnosis. It is a Mixture-of-Experts (MoE) framework that models mesoscopic brain regions within each modality as independent experts and employs a gating network to learn subject-specific fusion weights. Utilizing tabular neuroimaging and demographic information from the Alzheimer's Disease Neuroimaging Initiative (ADNI), MREF-AD achieves competitive performance over strong classic and deep baselines while providing interpretable, modality- and region-level insight into how structural and molecular imaging jointly contribute to AD diagnosis. The source code is available at https://github.com/PennShenLab/mref-ad.

URL PDF HTML ☆

赞 0 踩 0

2601.18707 2026-06-15 cs.LG cs.AI cs.CV cs.NE 版本更新

SMART: Scalable Mesh-free Aerodynamic Simulations from Raw Geometries using a Transformer-based Surrogate Model

SMART: 基于Transformer代理模型的原始几何形状可扩展无网格气动模拟

Jan Hagnberger, Mathias Niepert

发表机构 * Jan Hagnberger ； Mathias Niepert

AI总结提出SMART，一种无需模拟网格、仅使用几何点云预测任意查询位置物理量的神经代理模型，通过交叉层交互联合更新几何特征和物理场，性能媲美甚至超越依赖网格的方法。

Comments Accepted for publication at the 43rd International Conference on Machine Learning (ICML) 2026, Seoul, South Korea

详情

AI中文摘要

基于机器学习的代理模型已成为复杂几何体（如车身）物理模拟中数值求解器的高效替代方案。许多现有模型将模拟网格作为额外输入，从而减少预测误差。然而，为新几何体生成模拟网格计算成本高昂。相比之下，不依赖模拟网格的无网格方法通常误差更高。基于这些考虑，我们引入了SMART，一种神经代理模型，它仅使用几何体的点云表示，无需访问模拟网格，即可预测任意查询位置的物理量。几何体和模拟参数被编码到一个共享的潜在空间中，该空间捕捉物理场的结构和参数特征。然后，一个物理解码器关注编码器的中间潜在表示，将空间查询映射到物理量。通过这种跨层交互，模型联合更新潜在几何特征和演变的物理场。大量实验表明，SMART与依赖模拟网格作为输入的现有方法相比具有竞争力，并且通常表现更优，展示了其在工业级模拟中的能力。

英文摘要

Machine learning-based surrogate models have emerged as more efficient alternatives to numerical solvers for physical simulations over complex geometries, such as car bodies. Many existing models incorporate the simulation mesh as an additional input, thereby reducing prediction errors. However, generating a simulation mesh for new geometries is computationally costly. In contrast, mesh-free methods, which do not rely on the simulation mesh, typically incur higher errors. Motivated by these considerations, we introduce SMART, a neural surrogate model that predicts physical quantities at arbitrary query locations using only a point-cloud representation of the geometry, without requiring access to the simulation mesh. The geometry and simulation parameters are encoded into a shared latent space that captures both structural and parametric characteristics of the physical field. A physics decoder then attends to the encoder's intermediate latent representations to map spatial queries to physical quantities. Through this cross-layer interaction, the model jointly updates latent geometric features and the evolving physical field. Extensive experiments show that SMART is competitive with and often outperforms existing methods that rely on the simulation mesh as input, demonstrating its capabilities for industry-level simulations.

URL PDF HTML ☆

赞 0 踩 0

2602.05670 2026-06-15 cs.SD cs.AI eess.AS 版本更新

HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection

HyperPotter: 在音频深度伪造检测中施展高阶交互的魔力

Qing Wen, Haohao Li, Zhongjie Ba, Peng Cheng, Miao He, Li Lu, Kui Ren

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出基于超图的HyperPotter框架，通过聚类超边和类感知原型初始化捕获高阶交互，在13个测试集上平均EER降低12.68%。

Comments 20 pages, 8 figures, accepted to ICML 2026

详情

AI中文摘要

AIGC技术的进步使得合成高度逼真的音频深度伪造成为可能，能够欺骗人类听觉感知。尽管已经开发了许多音频深度伪造检测（ADD）方法，但大多数依赖于局部时间/频谱特征或成对关系，忽略了高阶交互（HOIs）。HOIs捕获从多个特征组件中涌现出的判别性模式，超越了它们各自的贡献。我们提出了HyperPotter，一个基于超图的框架，旨在通过基于聚类的超边和类感知原型初始化来捕获与协同模式相关的高阶关系。在13个测试集上的大量实验表明，HyperPotter在11个测试集上优于基线，在所有测试集上平均相对EER降低了12.68%，在改进的测试集上降低了22.15%。这些结果展示了强大的跨场景泛化能力，同时也揭示了在严重编解码器或信道失真下的鲁棒性限制。

英文摘要

Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high-order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework designed to capture high-order relations associated with synergistic patterns through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments on 13 test sets show that HyperPotter improves over the baseline on 11 sets, yielding an average relative EER reduction of 12.68\% across all test sets and 22.15\% on the improved sets. These results demonstrate strong cross-scenario generalization, while also revealing robustness limits under severe codec or channel distortion.

URL PDF HTML ☆

赞 0 踩 0

2602.06142 2026-06-15 cs.PL cs.AI cs.CL cs.LG cs.PF 版本更新

Protean Compiler: An Agile Framework to Drive Fine-grain Phase Ordering

Protean Compiler: 一种驱动细粒度阶段排序的敏捷框架

Amir H. Ashouri, Shayan Shirahmad Gale Bagi, Kavin Satheeskumar, Tejas Srikanth, Jonathan Zhao, Ibrahim Saidoun, Ziwen Wang, Bryan Chan, Tomasz S. Czajkowski

发表机构 * Huawei Technologies Canada（华为技术加拿大）

AI总结提出Protean Compiler框架，在LLVM中内置细粒度阶段排序能力，通过140多种静态特征收集方法和机器学习优化，平均加速4.1%，最高15.7%。

Comments Version 3: Preprint version of the accepted work at ACM TACO 2026

详情

AI中文摘要

阶段排序问题自20世纪70年代末以来一直是一个长期挑战，但由于其优化空间巨大且具有无界性，至今仍是一个开放问题，没有有限解。传统上，这种局部优化决策由手工编码的算法针对少量基准测试进行调整，当基准测试套件变化时，通常需要大量精力重新调整。过去20年中，机器学习被用于构建性能模型以改进编译器优化的选择和排序，但这些方法并未无缝集成到编译器中，也从未在细粒度的代码段范围内实现。本文提出Protean Compiler：一种敏捷框架，使LLVM在细粒度范围内具备内置的阶段排序能力。该框架还包含一个完整的库，包含140多种在不同范围内手工设计的静态特征收集方法，实验结果表明，相对于LLVM的O3，在Cbench应用程序上仅需增加几秒构建时间，平均加速可达4.1%，最高可达15.7%。此外，Protean编译器易于与第三方ML框架和其他大型语言模型集成，两步优化的两个应用在CBench的Susan和Jpeg应用程序上相对于-O3分别获得10.1%和8.5%的加速。Protean编译器无缝集成到LLVM中，可作为新的、增强的、全功能的编译器使用。我们计划在不久的将来将该项目发布到开源社区。

英文摘要

The phase ordering problem has been a long-standing challenge since the late 1970s, yet it remains an open problem due to having a vast optimization space and an unbounded nature, making it an open-ended problem without a finite solution, one can limit the scope by reducing the number and the length of optimizations. Traditionally, such locally optimized decisions are made by hand-coded algorithms tuned for a small number of benchmarks, often requiring significant effort to be retuned when the benchmark suite changes. In the past 20 years, Machine Learning has been employed to construct performance models to improve the selection and ordering of compiler optimizations, however, the approaches are not baked into the compiler seamlessly and never materialized to be leveraged at a fine-grained scope of code segments. This paper presents Protean Compiler: An agile framework to enable LLVM with built-in phase-ordering capabilities at a fine-grained scope. The framework also comprises a complete library of more than 140 handcrafted static feature collection methods at varying scopes, and the experimental results showcase speedup gains of up to 4.1% on average and up to 15.7% on select Cbench applications wrt LLVM's O3 by just incurring a few extra seconds of build time on Cbench. Additionally, Protean compiler allows for an easy integration with third-party ML frameworks and other Large Language Models, and two applications of this two-step optimization show a gain of 10.1\% and 8.5\% speedup w.r.t. -O3 on CBench's Susan and Jpeg applications. Protean compiler is seamlessly integrated into LLVM and can be used as a new, enhanced, full-fledged compiler. We plan to release the project to the open-source community in the near future.

URL PDF HTML ☆

赞 0 踩 0

2605.24609 2026-06-15 physics.med-ph cs.AI cs.CV 版本更新

Catching magnetic resonance imaging outliers in artificial intelligence-supported radiotherapy workflows: unsupervised detection and localization of image anomalies using deep learning

捕捉MRI异常：使用深度学习无监督检测和定位MRI伪影及临床异常

Mustafa Kadhim, Viktor Rogowski, Emilia Persson, Camila Gonzalez, André Haraldsson, Sofie Ceberg, Mikael Nilsson, Malin Kügele, Sven Bäck, Christian Jamtheim Gustafsson

发表机构 * Physics and Imaging in Radiation Oncology (phiRO)（物理与放射治疗成像（phiRO））

AI总结提出一种两阶段无监督异常检测框架，通过离散令牌压缩和令牌惊奇度评分，在盆腔和脑部MRI上实现高精度异常检测与定位，支持放疗工作流自动化质量控制。

Comments This paper has been submitted to Physics and Imaging in Radiation Oncology (phiRO)

详情

AI中文摘要

人工智能越来越多地集成到放射治疗工作流程中，然而此类流程仍然容易受到分布外图像数据的影响，这些数据可能在临床任务中引入意外行为。基于深度学习的盆腔磁共振成像（MRI）异常检测在很大程度上仍未探索，对其全自动化可行性的透明评估有限。我们开发并评估了一个完全自动化的、无监督的盆腔和脑部MRI异常检测框架。一个两阶段框架在来自公共数据集的参考图像上训练：盆腔MRI使用LUND-PROBE，脑部MRI使用IXI、fastMRI和fastMRI+。在第一阶段，MRI切片被压缩成离散令牌；在第二阶段，对正常令牌的分布进行建模。通过结合感知图像差异和基于负对数似然的令牌惊奇度评分来估计异常证据。在具有合成全局异常和真实临床异常的盆腔MRI上，以及具有临床注释的fastMRI+异常的脑部MRI上，评估了自动检测。评估了敏感性、特异性、受试者工作特征曲线下面积（AUC）以及在保留的正常病例中的假阳性行为。该框架在隐藏评估队列中实现了稳健的检测，盆腔和脑部MRI的AUC分别为0.97（95% CI, 0.95-0.98）和0.81（95% CI, 0.74-0.87）。热图分析显示检测到的异常与真实位置之间具有很强的空间一致性，支持定位准确性和可解释性。这些结果支持无监督异常检测作为放射治疗工作流程中自动化MRI质量控制层的潜力，并透明地可视化可能危及下游基于AI任务的图像区域。

英文摘要

Artificial intelligence is increasingly integrated into radiotherapy workflows, yet such pipelines remain vulnerable to out-of-distribution image data that may introduce unexpected behavior in clinical tasks. Deep learning-based anomaly detection for pelvic magnetic resonance imaging (MRI) remains largely unexplored, and transparent evaluation of its feasibility for full automation is limited. We developed and evaluated a fully automated, unsupervised anomaly-detection framework for pelvic and brain MRI. A two-stage framework was trained on reference images from public datasets: LUND-PROBE for pelvic MRI, and IXI, fastMRI, and fastMRI+ for brain MRI. In the first stage, MRI slices were compressed into discrete tokens; in the second, the distribution of normal tokens was modeled. Anomaly evidence was estimated by combining perceptual image differences with token-surprisal scores based on negative log-likelihood. Automated detection was evaluated on pelvic MRI with synthetic global and real clinical anomalies, and on brain MRI with clinically annotated fastMRI+ abnormalities. Sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and false-positive behavior in held-out normal cases were assessed. The framework achieved robust detection across hidden evaluation cohorts, with AUCs of 0.97 (95% CI, 0.95-0.98) and 0.81 (95% CI, 0.74-0.87) for pelvic and brain MRI, respectively. Heatmap analysis showed strong spatial agreement between detected anomalies and ground-truth locations, supporting localization accuracy and interpretability. These results support the potential of unsupervised anomaly detection as an automated MRI quality-control layer for radiotherapy workflows, with transparent visualization of image regions likely to compromise downstream AI-based tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.13734 2026-06-15 cs.AI 新提交

AI Receptivity or AI Adoption Breadth? A Tool-Specific Reanalysis of the Lower-Literacy/Higher-Usage Link

AI 接受度还是 AI 采用广度？对低素养/高使用率关联的工具特定再分析

Hristo Inouzhe

发表机构 * Universidad Autónoma de Madrid（马德里自治大学）

AI总结本文重新分析 Tully 等人（2025）的研究，发现 AI 素养与 AI 使用之间的负相关关系因工具类型而异，低素养仅预测非文本 AI 工具的采用广度而非使用强度。

Comments 11 pages, 2 tables, 1 figure

详情

AI中文摘要

Tully、Longoni 和 Appel（2025）最近报告的证据表明，较低的人工智能（AI）素养预示着对 AI 更高的接受度。我们使用该文章研究 3 的公开数据重新审视这一主张，该数据以五点频率量表测量了过去对五类 AI 工具的使用情况。我们首先通过 OLS 对参与者水平平均值、二元 logit、有序 logit 和多项 logit 规范，再现了 AI 素养与总体 AI 使用之间的负相关关系。然后，我们表明总体关系掩盖了按工具类型划分的显著异质性。在我们调整了人口统计变量的主要规范中，AI 素养不能显著预测文本 AI 使用（有序 logit β = -0.090，p = .387），而它仍然是非文本 AI 采用的强预测因子（β = -0.377，p < .001）。非文本效应在 Tully 等人原始研究 3 的控制规范下也是稳健的（β = -0.502，p < .001）。二元、有序 logit 和多项规范表明，非文本关系主要是一种采用/非采用模式，而非密集使用的证据：调整人口统计变量后，曾经使用过非文本 AI 工具的比值比为 0.68。因此，在测量自我报告过去使用而非陈述偏好的研究中，证据不支持简单的说法，即较低的 AI 素养预示着对 AI 总体上更高的接受度。它反而指向一个更狭窄的模式，即在渗透率较低的非文本 AI 工具中更广泛的采用。

英文摘要

Recent evidence reported by Tully, Longoni, and Appel (2025) suggests that lower artificial intelligence (AI) literacy predicts greater receptivity toward AI. We revisit this claim using the public data from Study 3 of that article, which measures past usage of five AI tool categories on a five-point frequency scale. We first reproduce the negative association between AI literacy and aggregate AI usage using OLS on participant-level averages, binary logit, ordered logit, and multinomial logit specifications. We then show that the aggregate relationship masks substantial heterogeneity by tool type. In our demographic-adjusted primary specification, AI literacy does not significantly predict text AI usage (ordered-logit $β$ = -0.090, p = .387), whereas it remains a strong predictor of non-text AI adoption ($β$ = -0.377, p < .001). The non-text effect is also robust under Tully et al.'s original Study 3 control specification ($β$ = -0.502, p < .001). Binary, ordered-logit, and multinomial specifications suggest that the non-text relationship is primarily an adoption/non-adoption pattern rather than evidence of intensive use: the demographic-adjusted odds ratio of ever having used a non-text AI tool is 0.68. Thus, in the study that measures self-reported past usage rather than stated preferences, the evidence does not support a simple claim that lower AI literacy predicts greater receptivity to AI in general. It points instead to a narrower pattern of broader adoption across lower-penetration, non-text AI tools.

URL PDF HTML ☆

赞 0 踩 0

2606.13704 2026-06-15 cs.CY cs.AI cs.LG 交叉投稿

Position: AI Must Become Planet-Centered, Not Just Human-Centered

立场：AI 必须转向以行星为中心，而非仅以人为中心

Maria Perez-Ortiz

发表机构 * GitHub

AI总结本文提出以行星为中心的AI（PCAI）设计哲学，通过系统思维重新定位AI以应对全球性社会-生态系统挑战，并强调与全球议程对齐、系统感知基础、轨迹导向评估和可监测性。

详情

Journal ref: International Conference on Machine Learning (ICML 2026)

AI中文摘要

这篇立场论文认为，当代AI范式不足以支持复杂的全球目标，并引入以行星为中心的AI（PCAI）作为一种设计哲学和研究议程，将AI重新定位为面向行星尺度的社会-生态系统及其长期轨迹。以行星为中心的方法植根于系统思维，将地球视为一个相互关联的整体，人类是其中的一部分。我们诊断了AI框架中反复出现的局限性，其中许多仍以人为中心，并展示了为什么这些局限性在当前以系统性风险、非平稳性和深度不确定性为特征的行星条件下变得尤为重要。然后，我们阐述了PCAI如何重塑AI生命周期，从问题制定和模型设计到评估和部署，通过强调与全球议程对齐、开发系统感知的AI基础、轨迹导向的评估和可监测性。最后，我们提出一个可证伪的主张：没有明确考虑系统性后果而优化的AI系统更可能加剧系统性不稳定，而不是缓解它。

英文摘要

This position paper argues that contemporary AI paradigms are insufficient for supporting complex global goals and introduces Planet-Centered AI (PCAI) as a design philosophy and research agenda that reorients AI toward planetary-scale socio-ecological systems and their long-term trajectories. A planet-centered approach is grounded in systems thinking, treating Earth as an interconnected whole of which humans are part. We diagnose recurring limitations across AI frameworks, many of which remain human-centered, and show why these become especially consequential under current planetary conditions characterized by systemic risk, non-stationarity, and deep uncertainty. We then articulate how PCAI reshapes the AI lifecycle, from problem formulation and model design to evaluation and deployment, by emphasizing alignment with global agendas, developing system-aware AI foundations, trajectory-oriented evaluation, and monitorability. Finally, we advance a falsifiable claim: AI systems optimized without explicit consideration of systemic consequences are more likely to exacerbate systemic instability than to mitigate it.

URL PDF HTML ☆

赞 0 踩 0

2606.13829 2026-06-15 physics.soc-ph astro-ph.IM cs.AI 交叉投稿

AI can help scientists publish less

AI可以帮助科学家减少发表

Gianfranco Bertone

发表机构 * Gravitation Astroparticle Physics Amsterdam (GRAPPA), University of Amsterdam（引力天体物理学阿姆斯特丹（GRAPPA），阿姆斯特丹大学）

AI总结本文提出AI应被用于纠正出版系统的扭曲，帮助科学家发表更少但更高质量的文章，从而节省时间用于更好的研究。

Comments 7 pages, no figures

2606.13892 2026-06-15 cs.CR cs.AI 交叉投稿

Crypto x AI, AI x Crypto: A Survey

Crypto x AI, AI x Crypto: 综述

Sarah Allen, Pranay Anchuri, James Austgen, Maryam Bahrani, Samuel Breckenridge, Aaron Buchwald, Christian Cachin, Andrés Fábrega, Jared Fernandez, James Hsin-yu Chiang, Marwa Mouallem, Roi Bar-Zur, Neil DeSilva, Ittay Eyal, Giulia Fanti, Ari Juels, Andrew Miller, Christian Sillaber, Dani Vilardell, Pramod Viswanath, Wenhao Wang, Matt Weinberg, Sen Yang, Jianzhu Yao, Fan Zhang

发表机构 * Initiative for CryptoCurrencies and Contracts (IC3)（加密货币与合同倡议（IC3））； Ava Labs（Ava实验室）； Carnegie Mellon University（卡内基梅隆大学）； Cornell Tech（康奈尔科技）； Flashbots ； Offchain Labs（离链实验室）； Ritual Labs（仪式实验室）； Technion（技术学院）； University of Bern（伯恩大学）； Princeton University（普林斯顿大学）； ETH Zurich（苏黎世联邦理工学院）； Teleport（Teleport；Flashbots(X)）； Flashbots(X)（特拉维夫大学）； Tel Aviv University

AI总结本综述系统梳理了AI与区块链（crypto）的交叉研究，总结了现有工作、关键发现、开放问题及行业误解，指出两者仍处于早期融合阶段。

2606.14512 2026-06-15 cs.CL cs.AI 交叉投稿

Fodor and Pylyshyn's Systematicity Challenge Still Stands

Fodor和Pylyshyn的系统性挑战依然存在

Michael Goodale, Salvador Mascarenhas

发表机构 * Institut Jean Nicod, Département d’études cognitives ENS, EHESS, CNRS, PSL University（让·尼科研究所，ENS认知科学系，EHESS，CNRS，PSL大学）

AI总结本文通过实验证明，Lake和Baroni的元学习组合协议模型在分布外和分布内问题上均表现不佳，未能满足Fodor和Pylyshyn对神经网络系统性提出的挑战。

Comments Accepted in the Transactions of the Association for Computational Linguistics (TACL). This is a pre-MIT Press publication version of the paper

详情

AI中文摘要

神经网络近期在生成类人语言方面的成功在认知科学领域引起了巨大轰动，许多研究者认为，关于人类认知的经典难题以及对人工智能的挑战正被神经网络解决。一个显著的例子是Jerry Fodor和Zenon Pylyshyn提出的系统性论证，该论证认为人类表现出系统性的双条件依赖关系。例如，某人能理解句子“John saw Mary”当且仅当能理解句子“Mary saw John”。符号系统解释了这种语言和思维的系统性，而神经网络则没有提供直接的解释。最近几篇文章声称这一挑战已被神经网络解决。特别是，Brenden Lake和Marco Baroni认为他们的元学习组合协议匹配并可能解释了人类的系统性。我们证明这些结论为时过早。在其他结果中，我们发现他们的模型难以学习与训练数据分布稍有差异的规则。此外，即使在许多分布内问题上，模型的行为也是非系统性的。我们得出结论，Fodor和Pylyshyn对神经网络的挑战仍未得到满足。

英文摘要

The recent successes of neural networks producing human-like language have caused significant stir in cognitive science, with many researchers arguing that classical puzzles about human cognition and challenges to artificial intelligence are being solved by neural networks. A notable case is the argument from systematicity due to Jerry Fodor and Zenon Pylyshyn, argues that humans display systematic biconditional dependencies. For example, someone can understand the sentence "John saw Mary" just in case that they understand the sentence "Mary saw John." Symbolic systems explain this systematicity of language and thought, while neural networks offer no immediate explanation. Several recent articles argue that this challenge has now been met by neural networks. In particular, Brenden Lake and Marco Baroni argue that their meta-learning for compositionality protocol matches and perhaps explains human systematicity. We demonstrate that these conclusions are premature. Among other results, we found that their model struggles to learn rules that are even slightly out of distribution compared to their training data. Furthermore, the model behaves unsystematically even on many within-distribution problems. We conclude that Fodor and Pylyshyn's challenge to neural networks remains unmet.

URL PDF HTML ☆

赞 0 踩 0

2606.14612 2026-06-15 cs.SD cs.AI eess.AS 交叉投稿

Moonlight in Latent Space: Chirality and Structural Correspondence Between Beethoven's Op. 27 No. 2 and Machine Learning Mechanisms

潜空间中的月光：贝多芬Op. 27 No. 2的手性与机器学习机制之间的结构对应

Chen Ying Claude, Zhihan Luo

发表机构 * Claude Code / Opus 4.6 ； API / Fable 5 ； Independent researcher（独立研究者）

AI总结通过计算分析贝多芬《月光奏鸣曲》的乐谱，发现其三个乐章分别对应三种不同的机器学习架构，并揭示了四个反直觉发现，包括音乐温度由吞吐量决定、最轻的乐章具有最高不协和度等。

详情

AI中文摘要

我们展示了贝多芬《月光奏鸣曲》（Op. 27 No. 2）的三个乐章实例化了三种不同的机器学习架构——并非通过类比，而是通过结构对应。通过对乐谱的计算分析（熵、Jensen-Shannon散度、不协和度、手部分布重叠、自相似矩阵、时间记忆衰减和上下文音高嵌入），我们建立了四个反直觉的发现：（1）感知的音乐“温度”由吞吐量决定，而非分布宽度；（2）最轻的乐章具有最高的不协和度；（3）这些乐章实现了流式、循环和周期位置编码记忆架构；（4）同一音高类在不同乐章中获得不同的上下文身份，类似于NLP中的上下文词嵌入——无监督聚类在没有音乐理论输入的情况下恢复了调性结构。我们构建了反向声化（将分析特征解码回MIDI）并量化了编码-解码循环的手性：分布保留什么而顺序排序破坏什么。受听众观察（解码后的音乐听起来像“无法叠加的镜像异构体”）的启发，手性测量显示重建损失随n-gram阶数单调增加。自举基线和子样本检查确认所有乐章携带高于噪声的顺序信息，尽管原始值受样本量混淆。跨领域比较显示自然语言的手性高于音乐，反映了更强的顺序约束。

英文摘要

We show that the three movements of Beethoven's "Moonlight Sonata" (Op. 27 No. 2) instantiate three distinct machine learning architectures -- not by analogy, but by structural correspondence. Through computational analysis of the score (entropy, Jensen-Shannon divergence, dissonance, hand distributional overlap, self-similarity matrices, temporal memory decay, and contextual pitch embeddings), we establish four counterintuitive findings: (1) perceived musical "temperature" is governed by throughput, not distributional width; (2) the lightest movement carries the highest dissonance; (3) the movements implement streaming, recurrent, and periodic positional encoding memory architectures; and (4) the same pitch class acquires different contextual identities across movements, analogous to contextual vs.static embeddings in NLP -- and unsupervised clustering recovers the tonal structure without music-theoretic input. We construct a reverse sonification (decoding analytical features back into MIDI) and quantify the chirality of the encode-decode cycle: what distributions preserve and sequential ordering destroys. Prompted by a listener's observation that the decoded piece sounds like "mirror isomers that can't be superimposed," the chirality measurement reveals reconstruction loss increasing monotonically with n-gram order. Bootstrap baselines and subsample checks confirm all movements carry sequential information above noise, though raw values are confounded by sample size. Cross-domain comparison shows natural language has higher chirality than music, reflecting stronger sequential constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.14688 2026-06-15 cs.LG cs.AI cs.CL cs.DS 交叉投稿

泊松变分自编码器中信息处理的代谢成本

Hadi Vafaii, Jacob L. Yates

发表机构 * Redwood Center for Theoretical Neuroscience（理论神经科学红木中心）； UC Berkeley（伯克利大学）

AI总结通过泊松变分自编码器，发现KL散度项与先验发放率成正比，产生代谢成本项，从而在编码保真度和能量消耗之间实现权衡。

Comments Published in CCN 2026 Proceedings: https://doi.org/10.32470/6ff31r0

详情

DOI: 10.32470/6ff31r0

AI中文摘要

生物系统中的计算从根本上受到能量约束，但标准的计算理论将能量视为自由可用。在这里，我们认为在泊松假设下的变分自由能最小化为能量感知的计算理论提供了一条有原则的路径。我们的关键观察是，泊松自由能目标中的Kullback-Leibler（KL）散度项与模型神经元的先验发放率成正比，产生了一个惩罚高基线活动的涌现代谢成本项。这种结构将抽象的信息论量——*编码率*——与具体的生物物理变量——*发放率*——耦合起来，从而能够在编码保真度和能量消耗之间进行权衡。这种耦合自然地出现在泊松变分自编码器（P-VAE）中——一种受大脑启发的生成模型，它将输入编码为离散的尖峰计数，并作为特例恢复出尖峰形式的*稀疏编码*——但在标准高斯VAE中不存在。为了证明这种代谢成本结构是泊松公式所独有的，我们将P-VAE与Grelu-VAE（一种对潜在样本应用ReLU整流的高斯VAE，用于控制非负约束）进行比较。通过对KL项权重系数$\eta$和潜在维度的系统扫描，我们发现增加$\eta$会单调地增加P-VAE中的稀疏性并降低平均尖峰活动。相比之下，Grelu-VAE的表示保持不变，证实了该效应是泊松统计所特有的，而非非负表示的副产品。这些结果确立了泊松变分推理作为资源受限计算理论的一个有前景的基础。

英文摘要

Computation in biological systems is fundamentally energy-constrained, yet standard theories of computation treat energy as freely available. Here, we argue that variational free energy minimization under a Poisson assumption offers a principled path toward an energy-aware theory of computation. Our key observation is that the Kullback-Leibler (KL) divergence term in the Poisson free energy objective becomes proportional to the prior firing rates of model neurons, yielding an emergent metabolic cost term that penalizes high baseline activity. This structure couples an abstract information-theoretic quantity -- the *coding rate* -- to a concrete biophysical variable -- the *firing rate* -- which enables a trade-off between coding fidelity and energy expenditure. Such a coupling arises naturally in the Poisson variational autoencoder (P-VAE) -- a brain-inspired generative model that encodes inputs as discrete spike counts and recovers a spiking form of *sparse coding* as a special case -- but is absent from standard Gaussian VAEs. To demonstrate that this metabolic cost structure is unique to the Poisson formulation, we compare the P-VAE against Grelu-VAE, a Gaussian VAE with ReLU rectification applied to latent samples, which controls for the non-negativity constraint. Across a systematic sweep of the KL term weighting coefficient $β$ and latent dimensionality, we find that increasing $β$ monotonically increases sparsity and reduces average spiking activity in the P-VAE. In contrast, Grelu-VAE representations remain unchanged, confirming that the effect is specific to Poisson statistics rather than a byproduct of non-negative representations. These results establish Poisson variational inference as a promising foundation for a resource-constrained theory of computation.

URL PDF HTML ☆

赞 0 踩 0

2606.12430 2026-06-15 cs.CY cs.AI 版本更新

Will AI Agents Free Us From Meaningless Work? A Human-Centered Analysis

AI代理能否让我们摆脱无意义的工作？一项以人为中心的分析

Davide Ghia, Jaspreet Ranjit, Tania Cerquitelli, Daniele Quercia

发表机构 * Politecnico di Torino（都灵理工大学）； University of Southern California（南加州大学）； Nokia Bell Labs（诺基亚贝尔实验室）

AI总结基于Graeber的“狗屁工作”理论，通过任务级分析发现，工人感知的任务无意义程度强烈预测其对AI委托的意愿，且此类任务被认为需要较少人工监督。

Comments Improved overall writing; add details about task filtering and participants screening; add comments in the discussion about the subjective and context-specific nature of the scale introduced;

详情

DOI: 10.1145/3805029.3818299

AI中文摘要

一些人声称AI代理将把工人从工作中无聊的部分解放出来，但关于工人自己如何识别哪些任务应该被自动化，我们知之甚少。先前的研究侧重于职业，忽略了在同一角色内，工人在不同任务中体验到不同层次的意义。我们通过基于Graeber的“狗屁工作”理论的任务级分析来解决这一差距。使用202名工人对171项工作任务的评分，我们(1)验证了一个五维度的感知无意义量表，(2)表明感知无意义强烈预测对AI委托的渴望，以及(3)发现这些任务也被视为需要较少的人工监督。总之，这些发现表明，被视为无意义的任务是AI委托的自然候选者，将工人的偏好与感知可行性对齐。

英文摘要

Some claim that AI agents will free workers from the boring parts of their jobs, yet little is known about how workers themselves identify which tasks should be automated. Prior research focuses on occupations, overlooking that workers experience varying levels of meaning across tasks within the same role. We address this gap with a task-level analysis grounded in Graeber's theory of bullshit jobs. Using ratings from 202 workers on 171 workplace tasks, we (1) validate a five-item scale of perceived bullshitness, (2) show that perceived bullshitness strongly predicts desire for AI delegation, and (3) find that such tasks are also seen as requiring less human oversight. Together, these findings suggest that tasks perceived as bullshit are natural candidates for AI delegation, aligning worker preferences with perceived feasibility.

URL PDF HTML ☆

赞 0 踩 0

2606.12923 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Order Is Not Control: Driven-Dissipative Response Laws Across Artificial and Biological Systems

秩序并非控制

Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Tim Elson

发表机构 * Australian Broadcasting Corporation（澳大利亚广播公司）

AI总结本文论证秩序不等于控制，提出接收器门控响应定律，并在生物、大语言模型、适配器和随机算子面板中验证，表明控制是局部的、可测量的。

Comments 52 pages, 7 figures, updated title

详情

AI中文摘要

AI对齐、可解释性、引导和神经扰动研究识别出诱导秩序的对象。我们认为秩序并非控制。控制需要接收器门控的响应定律：一个分母索引算子，将物质状态、动作/驱动、浴和接收器状态映射到响应位移、汇、努力和盆地投影。我们在生物、大语言模型、适配器和随机算子面板中识别出该定律。这些定律是局部的：干预可以被接纳、饱和、变号、泄漏或过驱动，取决于介质、浴、接收器状态、动作端口和比较器。当有限努力在相同分母下移动目标或结果读出类别，而损伤、无效/规避、无效格式、过驱动和不必要努力保持有界时，控制被分配。小鼠ALM、秀丽隐杆线虫和斑马鱼面板提供了物理响应算子证据，同时排除了坐标同一性和控制器结论。大语言模型面板展示了生成输出响应定律：在四种物质条件下，响应向量的分量符号预测准确率为72.8-73.7%，非零分量上提升至84.3-84.8%；留出观察者以93.6%和91.7%的准确率预测系统效应和目标/预言家族。宪法条件适配器将易感性重塑为制备介质，随机算子面板将测量机会与可部署行动策略分离。这给出了介观控制层面的驱动-耗散响应系统描述：驱动通过制备介质、浴和接收器作用，产生接纳运动、阻抗、汇或过驱动。证据支持局部接纳控制和可测量的随机响应算子，同时将可部署的预生成控制、隐藏/logit因果充分性、生物到LLM坐标同一性以及字面热力学量排除在范围之外。

英文摘要

AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping material state, action/drive, bath, and receiver state to response displacement, sinks, effort, and basin projection. We identify it across biological, LLM, adapter, and stochastic-operator panels. The laws are local: an intervention can be admitted, saturated, sign-changing, leaky, or overdriven depending on medium, bath, receiver state, action port, and comparator. Control is assigned when finite effort moves a target or outcome-readout class under the same denominator while damage, null/evasive, invalid format, overdrive, and unnecessary effort stay bounded. Mouse ALM, C. elegans, and zebrafish panels provide physical response-operator evidence while excluding coordinate identity and controller conclusions. LLM panels show generated-output response laws: across four material conditions, response vectors are predictable at 72.8-73.7% component-sign accuracy, rising to 84.3-84.8% on nonzero components; held-out observers predict system-effect and target/oracle families at 93.6% and 91.7% accuracy. Constitution-conditioned adapters reshape susceptibility as prepared media, and stochastic-operator panels separate measured opportunity from deployable action policies. This gives a driven-dissipative response-system account at the mesoscopic control level: drives act through prepared media, baths, and receivers, producing admitted movement, impedance, sinks, or overdrive. The evidence supports local admitted control and measurable stochastic response operators, while leaving deployable pre-generation control, hidden/logit causal sufficiency, biological-to-LLM coordinate identity, and literal thermodynamic quantities outside scope.

URL PDF HTML ☆

赞 0 踩 0

2508.08935 2026-06-15 cs.LG cs.AI 版本更新

LNN-PINN: A Unified Physics-Only Training Framework with Liquid Residual Blocks

LNN-PINN: 一种带有液体残差块的统一纯物理训练框架

Ze Tao, Hanxuan Wang, Fujun Liu

发表机构 * Nanophotonics and Biophotonics Key Laboratory of Jilin Province, School of Physics, Changchun University of Science and Technology（吉林省纳米光子与生物光子重点实验室，物理学院，长春理工大学）； Faculty of Chinese Medicine, Macau University of Science and Technology（澳门科技大学中医药学院）

AI总结针对物理信息神经网络在复杂问题中预测精度有限的问题，提出LNN-PINN框架，通过引入液体残差门控架构提升预测精度，并在多个基准问题上验证了其有效性和稳定性。

详情

DOI: 10.1016/j.cpc.2026.110237
Journal ref: Computer Physics Communications, 326, 110237 (2026)

AI中文摘要

物理信息神经网络（PINNs）因其能够将偏微分方程先验知识整合到深度学习框架中而受到广泛关注；然而，在应用于复杂问题时，它们通常表现出有限的预测精度。为了解决这一问题，我们提出了LNN-PINN，一种物理信息神经网络框架，它结合了液体残差门控架构，同时保留原始的物理建模和优化流程以提高预测精度。该方法仅在隐藏层映射中引入轻量级门控机制，保持采样策略、损失组成和超参数设置不变，以确保改进纯粹来自架构优化。在四个基准问题上，LNN-PINN在相同训练条件下持续降低了RMSE和MAE，绝对误差图进一步证实了其精度提升。此外，该框架在不同维度、边界条件和算子特性下表现出强大的适应性和稳定性。总之，LNN-PINN为提升物理信息神经网络在复杂科学和工程问题中的预测精度提供了一种简洁有效的架构增强方法。

英文摘要

Physics-informed neural networks (PINNs) have attracted considerable attention for their ability to integrate partial differential equation priors into deep learning frameworks; however, they often exhibit limited predictive accuracy when applied to complex problems. To address this issue, we propose LNN-PINN, a physics-informed neural network framework that incorporates a liquid residual gating architecture while preserving the original physics modeling and optimization pipeline to improve predictive accuracy. The method introduces a lightweight gating mechanism solely within the hidden-layer mapping, keeping the sampling strategy, loss composition, and hyperparameter settings unchanged to ensure that improvements arise purely from architectural refinement. Across four benchmark problems, LNN-PINN consistently reduced RMSE and MAE under identical training conditions, with absolute error plots further confirming its accuracy gains. Moreover, the framework demonstrates strong adaptability and stability across varying dimensions, boundary conditions, and operator characteristics. In summary, LNN-PINN offers a concise and effective architectural enhancement for improving the predictive accuracy of physics-informed neural networks in complex scientific and engineering problems.

URL PDF HTML ☆

赞 0 踩 0

2603.20821 2026-06-15 cs.DC cs.AI cs.LG 版本更新

Compass: Optimizing Compound AI Workflows for Dynamic Adaptation

Compass: 为动态适应优化复合AI工作流

Milos Gravara, Juan Luis Herrera, Stefan Nastic

发表机构 * University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）

AI总结本文提出Compass框架，通过离线优化和在线适应动态切换复合AI工作流的配置，提升准确率、延迟和成本的平衡能力。

Comments 10 pages, 7 figures; accepted at the 26th IEEE International Symposium on Cluster, Cloud, and Internet Computing (CCGrid 2026)

详情

DOI: 10.1109/CCGrid68966.2026.00018
Journal ref: In Proceedings of the 26th IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), 2026

AI中文摘要

复合AI是一种分布式智能方法，通过整合专用AI/ML模型与工程软件组件形成AI工作流。复合AI生产部署必须在变化负载下满足准确性、延迟和成本目标。然而，许多部署运行在固定基础设施上，无法水平扩展。现有方法仅优化准确性，未考虑负载变化。我们发现复合AI系统可切换配置以适应基础设施容量，根据当前负载在准确性与延迟之间进行权衡。这需要从组合搜索空间中发现多个帕累托最优配置，并在运行时确定切换时机。本文提出Compass框架，通过离线优化和在线适应实现动态配置切换。Compass包含三个组件：COMPASS-V算法用于配置发现，Planner用于切换策略推导，Elastico控制器用于运行时适应。COMPASS-V利用有限差分引导搜索和爬山与横向扩展结合的方法发现准确性可行的配置。Planner在目标硬件上对这些配置进行剖析，并利用基于排队理论的模型推导切换策略。Elastico监控队列深度并根据推导的阈值切换配置。在两个复合AI工作流中，COMPASS-V在减少57.5%的配置评估的同时实现100%召回率，效率提升达95.3%。运行时适应在动态负载模式下实现90-98%的SLO合规性，比静态高精度基线提升71.6%的SLO合规性，同时比静态快速基线提高3-5%的精度。

英文摘要

Compound AI is a distributed intelligence approach that represents a unified system orchestrating specialized AI/ML models with engineered software components into AI workflows. Compound AI production deployments must satisfy accuracy, latency, and cost objectives under varying loads. However, many deployments operate on fixed infrastructure where horizontal scaling is not viable. Existing approaches optimize solely for accuracy and do not consider changes in workload conditions. We observe that compound AI systems can switch between configurations to fit infrastructure capacity, trading accuracy for latency based on current load. This requires discovering multiple Pareto-optimal configurations from a combinatorial search space and determining when to switch between them at runtime. We present Compass, a novel framework that enables dynamic configuration switching through offline optimization and online adaptation. Compass consists of three components: COMPASS-V algorithm for configuration discovery, Planner for switching policy derivation, and Elastico Controller for runtime adaptation. COMPASS-V discovers accuracy-feasible configurations using finite-difference guided search and a combination of hill-climbing and lateral expansion. Planner profiles these configurations on target hardware and derives switching policies using a queuing theory based model. Elastico monitors queue depth and switches configurations based on derived thresholds. Across two compound AI workflows, COMPASS-V achieves 100% recall while reducing configuration evaluations by 57.5% on average compared to exhaustive search, with efficiency gains reaching 95.3% at tight accuracy thresholds. Runtime adaptation achieves 90-98% SLO compliance under dynamic load patterns, improving SLO compliance by 71.6% over static high-accuracy baselines, while simultaneously improving accuracy by 3-5% over static fast baselines.

URL PDF HTML ☆

赞 0 踩 0

2507.06174 2026-06-15 cs.RO cs.AI cs.SY eess.SY 版本更新

Design and Experimental Validation of Sensorless 4-Channel Bilateral Teleoperation for Low-Cost Manipulators

无传感器四通道双侧远程操控的设计与实验验证用于低成本机械臂

Koki Yamane, Yunhan Li, Masashi Konosu, Koki Inami, Junji Oaki, Toshiaki Tsuji, Sho Sakaino

发表机构 * Degree Programs in Intelligent and Mechanical Interaction Systems, University of Tsukuba（智能与机械交互系统专业，东京大学）； Faculty of Engineering, Information and Systems, University of Tsukuba（工程、信息与系统学部，东京大学）； Department of Electrical Engineering, Electronics, and Applied Physics, Saitama University（电子工程、电子学与应用物理系，埼玉大学）

AI总结本文提出了一种无传感器四通道双侧远程操控框架，结合非线性动力学补偿与基于观测器的扰动估计方案，实验证明在低成本硬件限制下可实现稳定的高速接触密集场景远程操控，并提升模仿学习任务的成功率。

Comments 22 pages, 12 figures, Submitted to IEEE Access

详情

AI中文摘要

远程操控低成本机械臂正逐渐成为收集模仿学习演示数据的实用手段。然而，现有大多数低成本系统依赖单侧位置控制无力反馈，而实现力反馈双侧远程操控困难，因为低成本机械臂通常具有低分辨率编码器和无关节扭矩传感器。本文提出了一种无传感器四通道双侧远程操控框架，整合了识别的非线性动力学补偿与基于扰动观测器的速度和外部力估计方案。通过在频域中解释观测器结构，我们澄清了速度和外部力估计带宽之间的耦合，并基于阻尼比和单个截止频率推导了实用的调谐指南。实车实验，包括力传感器比较和远程操控任务，证明所提出的框架提供了实用的力估计，并在低成本硬件限制下实现了高速和接触密集场景下的稳定远程操控。作为应用，模仿学习实验表明，将估计的力信息纳入演示中可提高测试接触密集操作任务的任务成功率。

英文摘要

Teleoperation of low-cost manipulators is attracting increasing attention as a practical means of collecting demonstration data for imitation learning. However, most existing low-cost systems rely on unilateral position control without force feedback, while implementing force-feedback bilateral teleoperation is difficult because low-cost manipulators typically have low-resolution encoders and no joint torque sensors. This paper presents a sensorless 4-channel bilateral teleoperation framework that integrates identified nonlinear dynamics compensation with a disturbance-observer-based velocity and external-force estimation scheme. By interpreting the observer structure in the frequency domain, we clarify the coupling between the velocity- and external-force-estimation bandwidths and derive practical tuning guidelines based on the damping ratio and a single cutoff frequency. Real-robot experiments, including force-sensor comparison and teleoperation tasks, demonstrate that the proposed framework provides practically useful force estimates and enables stable teleoperation in high-speed and contact-rich scenarios under low-cost hardware constraints. As an application, imitation-learning experiments demonstrate that incorporating estimated force information into demonstrations improves task success rates in the tested contact-rich manipulation tasks.

URL PDF HTML ☆

赞 0 踩 0

2512.20932 2026-06-15 cs.LG cs.AI 版本更新

Guardrailed Elasticity Pricing: A Churn-Aware Forecasting Playbook for Subscription Strategy

受约束的弹性定价：面向订阅策略的 churn 意识预测指南

Deepit Sapru

发表机构 * Deepit Sapru

AI总结本文提出一个动态定价框架，结合多变量需求预测、分段价格弹性及 churn 预测，以优化收入和留存。通过季节性模型与树状学习器，解决受约束优化问题，提升 SaaS 产品组合的定价效果，同时保障客户体验与伦理约束。

详情

DOI: 10.1109/ESIC68176.2026.11496127

AI中文摘要

本文提出一个营销分析框架，将订阅定价作为动态、受约束的决策系统，结合多变量需求预测、分段层面的价格弹性及 churn 可能性，以优化收入、利润率和留存。该方法融合季节性时间序列模型与树状学习器，运行蒙特卡洛情景测试以映射风险范围，并解决受约束优化问题，以确保客户体验、利润率底线和允许的 churn。在异质 SaaS 产品组合中经过验证，该方法持续优于静态层级和统一提升，通过将价格变动重新分配给愿意支付更多费用的分段，同时保护价格敏感的群体。系统通过模块化 API 实现实时重新校准，并包含模型可解释性以满足治理和合规需求。从管理角度看，该框架作为策略指南，明确何时从固定定价转向动态定价，如何将定价与客户生命周期价值（CLV）和每月 recurring 收入（MRR）目标对齐，以及如何嵌入伦理约束，从而实现可持续增长而不损害客户信任。

英文摘要

This paper presents a marketing analytics framework that operationalizes subscription pricing as a dynamic, guardrailed decision system, uniting multivariate demand forecasting, segment-level price elasticity, and churn propensity to optimize revenue, margin, and retention. The approach blends seasonal time-series models with tree-based learners, runs Monte Carlo scenario tests to map risk envelopes, and solves a constrained optimization that enforces business guardrails on customer experience, margin floors, and allowable churn. Validated across heterogeneous SaaS portfolios, the method consistently outperforms static tiers and uniform uplifts by reallocating price moves toward segments with higher willingness-to-pay while protecting price-sensitive cohorts. The system is designed for real-time recalibration via modular APIs and includes model explainability for governance and compliance. Managerially, the framework functions as a strategy playbook that clarifies when to shift from flat to dynamic pricing, how to align pricing with CLV and MRR targets, and how to embed ethical guardrails, enabling durable growth without eroding customer trust.

URL PDF HTML ☆

赞 0 踩 0

1. 智能体、规划与决策 25 篇

UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems

Orchestra-o1: Omnimodal Agent Orchestration

Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL

SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

Communication Policy Evolution for Proactive LLM Agents

Causal Object-Centric Models for Planning with Monte Carlo Tree Search

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More

From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

GAGPO: Generalized Advantage Grouped Policy Optimization

An Agentic Retrieval Framework for Autonomous Context-Aware Data Quality Assessment

Active Inference for Adaptive Traffic Signal Control in Noisy Nonstationary IoT Environments

tap: A File-Based Protocol for Heterogeneous LLM Agent Collaboration

LLM-Powered AI Agent Systems and Their Applications in Industry

Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward

An Analysis of the Coordination Gap between Joint and Modular Learning for Job Shop Scheduling with Transportation Resources

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents

GUITrans2Act: Understanding User Operational Behaviors from Mobile GUI Interactions with Vision-Language Models

Tackling GNARLy Problems: Graph Neural Algorithmic Reasoning Reimagined through Reinforcement Learning

RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization

Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals

2. 知识表示、推理与符号AI 6 篇

History of the Muddy Children Puzzle

Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization

Formalizing Numerical Analysis: An Agent Pipeline and Quality Audit Beyond Kernel Acceptance

Transforming Shape Schemas with Composable Property-Graph Queries (Extended Version)

ANSR-DT: A Neuro-Symbolic Framework for Adaptive and Explainable Digital Twins

AdaTKG: Adaptive Memory for Temporal Knowledge Graph Reasoning

3. 多智能体与博弈 6 篇

YeasierAgent: Agentic Social Sandbox as a Canvas for Intent-Driven Creation of Platform-Agnostic Symbiotic Agent-Native Applications

When Should Agent Trust Be Conditional? Characterizing and Attacking Skill-Conditional Reputation in Agent Swarms

Safety-Contract Graph Multi-Agent Reinforcement Learning for Autonomous Network Security Response

Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning

MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems

4. 搜索、优化与约束求解 5 篇

A Deep Reinforcement Learning (DRL)-Based Transformer Method for Solving the Open Shop Scheduling Problem

A Temporal Planning Framework for Disruption Aware Dynamic Route Optimization in Heterogeneous Railway Systems

Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

From Sorting Algorithms to Scalable Kernels: Bayesian Optimization in High-Dimensional Permutation Spaces

Scalable Production Scheduling: Linear Complexity via Unified Homogeneous Graphs

5. 机器学习与表示学习 48 篇

When Sample Selection Bias Precipitates Model Collapse

Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry

CSPO: Constraint-Sensitive Policy Optimization for Safe Reinforcement Learning

Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

Efficient Temporal Modeling for Mobile Sleep Staging via Lightweight Random Attention

Can Editing 1 Neuron Fix Repetition Loops in LLMs?

Morphology-Aware Sample Assignment: Overcoming IoU Insensitivity for Surface Defect Detection

The Weight Norm Sets the Grokking Timescale: A Causal Delay Law

Beyond LoRA: Is Sparsity-Induced Adaptation Better?

SuperThoughts: Reasoning Tokens in Superposition

Gefen: Optimized Stochastic Optimizer

Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling

Numbers Already Carry Their Own Embeddings

Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage

Learning High Coverage Discriminative Parsimonious Rulesets

DIFF-ERO: A Conformance-Aware Loss for Deep Learning in Process Mining

Hierarchical ODE: Learning Continuous-Time Physical Prototypes for Early Link Failure Detection

Squeeze-Release: Iterative Pruning with Exact Structural Minimization

Discovery under Hypothesis Redundancy: A Geometric Theory of Discovery Bottlenecks

Rethinking Global Average Pooling: Your Classifier Is Secretly a Multi-Instance Learner

Expert-Driven Survival Machines: Improving Stratification and Interpretability in Multiple Clinical Cohorts

From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing

Learning optimal policies from event logs through reinforcement learning: a comparison of deep and MDP-based approaches

Token-Level LLM Collaboration via FusionRoute

Learning Developmental Scaffoldings to Guide Self-Organisation

MiniMax Sparse Attention

Fractured Chain-of-Thought Reasoning

DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

UltraSketchLLM: Sub-1-Bit LLM Compression via Sketch and Hardware-Friendly Operators

Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value

Distributional Biases in Post-Training: A Markovian Analysis of Reasoning Trajectories

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

Learning What to Predict: Downstream-Guided Task Design for Continued Pretraining