arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 智能体、规划与决策 25 篇

2606.13683 2026-06-15 cs.AI cs.CL 新提交

UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems

UP-NRPA:基于用户画像的嵌套 rollout 策略自适应,用于目标导向对话系统中与大语言模型的规划

Hui Wang, Fafa Zhang, Meng Liu, Xiangyu Chen, Chaoxu Mu

发表机构 * School of Artificial Intelligence, Anhui University(安徽大学人工智能学院) Anhui Provincial Key Laboratory of Security Artificial Intelligence, Anhui University(安徽大学安徽省安全人工智能重点实验室) Pengcheng Laboratory(鹏城实验室)

AI总结 提出 UP-NRPA 在线框架,利用大语言模型和用户画像实时自适应调整对话策略,无需离线强化学习,在协作与非协作对话基准中实现 100% 任务成功率,谈判任务销售列表比提升 56.41%。

详情
AI中文摘要

为了解决当前对话策略规划方法难以动态适应不同用户特征的挑战,本文提出了一种基于用户画像的嵌套 rollout 策略自适应(UP-NRPA)在线框架,结合大语言模型。与传统依赖模型训练并为用户群体离线学习强化学习策略模型的方法不同,UP-NRPA 通过自适应机制实现对话策略的动态定制。这是通过利用实时用户反馈以及从当前用户画像映射出的个性、偏好和目标来实现的,从而无需离线强化学习即可适应用户特征。在协作和非协作对话基准测试中,UP-NRPA 展现了显著优势,在多个对话任务中实现了令人印象深刻的 100% 成功率。特别是在谈判任务中,销售列表比(SL)提高了 56.41%。这表明 UP-NRPA 无需训练机制即可适应多样化的用户需求,使对话系统能够适应用户特征。

英文摘要

To address the challenge that current dialogue policy planning methods struggle to dynamically adapt to diverse user characteristics, this paper proposes a User Portrait based Nested Rollout Policy Adaptation (UP-NRPA) online framework with Large Language Models. In contrast to conventional approaches dependent on model training and require offline reinforcement learning policy models for user groups, UP-NRPA enables dynamic customization of dialogue strategies through an adaptive mechanism. This is achieved by leveraging real-time user feedback alongside personality, preferences, and objectives mapped from the current user portrait, thereby adapting to user characteristics without offline reinforcement learning. In collaborative and non-collaborative dialogue benchmarks, UP-NRPA demonstrated considerable benefits, achieving an impressive 100% success rate in multiple dialogue tasks. Particularly in negotiation tasks, the sale-to-list ratio (SL) increased by 56.41%. This demonstrates that UP-NRPA can adapt to diverse user needs without requiring a training mechanism, enabling the dialogue system to adapt to user characteristics.

2606.13707 2026-06-15 cs.AI cs.CL cs.CV 新提交

Orchestra-o1: Omnimodal Agent Orchestration

Orchestra-o1: 全模态智能体编排

Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang, Pheng-Ann Heng

发表机构 * CUHK(香港中文大学) LIGHTSPEED PKU(北京大学) THU(清华大学) Tongji University(同济大学)

AI总结 提出Orchestra-o1全模态智能体编排框架,通过统一编排机制实现模态感知任务分解、在线子智能体专业化和并行子任务执行,在OmniGAIA基准上准确率超第二名10.3%,并引入DA-GRPO强化学习方法训练Orchestra-o1-8B达到开源全模态智能体最优性能。

详情
AI中文摘要

近期智能体集群的成功将基于大语言模型(LLM)的智能体从单智能体工作流范式转向多智能体系统,凸显了智能体编排在任务分解与协作中的重要性。然而,现有编排框架局限于狭窄的模态集合,难以泛化到异构模态共存并交互的更复杂场景。这种局限性在全模态场景中尤为突出,此类任务需要对文本、图像、音频和视频等多样化输入进行统一理解与协调。在本工作中,我们提出Orchestra-o1,一种全模态智能体编排框架,旨在支持跨多种模态的高效智能体协作。Orchestra-o1引入统一编排机制,实现模态感知任务分解、在线子智能体专业化和并行子任务执行。这种可扩展设计使智能体系统能够有效处理涉及异构信息源的复杂现实任务,在OmniGAIA基准上超越第二名方法10.3%的准确率。此外,我们提出决策对齐群体相对策略优化(DA-GRPO),一种高效的智能体强化学习方法,用于训练Orchestra-o1-8B,该方法在所有现有开源全模态智能体中取得了最先进性能。

英文摘要

The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.

2606.14211 2026-06-15 cs.AI cs.LG 新提交

Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL

缩小反思差距:面向智能体强化学习的免费校准奖励

Yinglun Zhu

发表机构 * University of California, Riverside(加州大学河滨分校)

AI总结 针对LLM智能体在环境反馈后自我评估不准确的问题,提出RefGRPO方法,通过对比反思与实际结果计算免费校准奖励并动态调整系数,同时提升反思校准和任务准确率。

详情
AI中文摘要

LLM越来越多地被部署为与外部环境交互并观察执行结果、错误消息和工具输出等反馈的智能体。一个功能良好的智能体应能利用这些反馈准确评估自身表现。然而,我们发现存在持续的反思差距:LLM智能体在观察到具体环境反馈后,倾向于错误评估自身输出——即使对于它们正确回答的问题也是如此——而标准RL由于信用分配不匹配几乎无济于事。为缩小这一差距,我们提出RefGRPO,一种简单而有效的修复方法,通过两个关键要素增强标准RL算法:一个免费校准奖励,通过对比智能体自身反思与实际结果计算(无需额外奖励模型、LLM评判或外部标注),以及对其系数的动态调度。与标准RL基线相比,我们的方法在五个基准的文本到SQL任务上同时提高了反思校准(例如,将不自信率从44.4%降至7.7%)和任务准确率(例如,从75.1%提升至76.5%)。由此产生的校准反思将智能体转变为基于环境反馈的自身验证器,进一步实现:(i)更好的自我改进,使用反思作为伪奖励而无需结果监督;(ii)更有效的测试时选择性预测,仅提交标记为正确的rollout。

英文摘要

LLMs are increasingly deployed as agents that interact with external environments and observe feedback such as execution results, error messages, and tool outputs. A well-functioning agent should be able to leverage this feedback to accurately assess its own performance. Yet we find a persistent reflection gap: LLM agents tend to mis-assess their own outputs after observing concrete environment feedback -- even for questions they correctly answered -- and standard RL barely helps due to a credit-assignment mismatch. To close this gap, we propose RefGRPO, a simple yet effective fix that augments standard RL algorithms with two key ingredients: a free calibration bonus computed by contrasting the agent's own reflection with the actual outcome (requiring no additional reward model, LLM judge, or external annotation), and a dynamic schedule on its coefficient. Compared to standard RL baselines, our method simultaneously improves reflection calibration (e.g., reduces underconfidence rate $44.4\% \to 7.7\%$) and task accuracy (e.g., $75.1\% \to 76.5\%$) on text-to-SQL across five benchmarks. The resulting calibrated reflection turns the agent into its own verifier grounded in environment feedback, which further enables (i) better self-improvement that uses reflections as pseudo-rewards without outcome supervision, and (ii) more effective test-time selective prediction by committing only to rollouts flagged as correct.

2606.14239 2026-06-15 cs.AI 新提交

SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing

SkillAudit: 通过配对轨迹审计实现无真实反馈的技能进化

Haowen Gao, Haoran Chen, Can Wang, Shasha Guo, Liang Pang, Zhaoyang Liu, Huawei Shen, Xueqi Cheng

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, CAS(中国科学院计算技术研究所人工智能安全国家重点实验室) University of Chinese Academy of Sciences(中国科学院大学) Tongyi Lab, Alibaba Group(阿里巴巴集团通义实验室)

AI总结 提出SkillAudit框架,通过配对轨迹审计和过程对齐对比评估,无需真实反馈即可进化智能体技能,在89个任务中平均奖励达73.9%。

Comments 20 pages, 5 figures

详情
AI中文摘要

智能体技能是结构化的程序化包,指导冻结的LLM智能体在专门工作流中操作。技能在部署后很少保持足够:边缘情况、API变化和部署约束只有通过使用才会显现,使得技能进化成为实际需求。现有方法依赖于特权反馈,如保留验证分数、隐藏测试结果或环境奖励——当实践者只有任务描述和工作空间数据时,这些信号通常不可用。我们引入SkillAudit,一个无需真实反馈即可进化智能体技能的框架。关键思想是配对轨迹审计:在每次迭代中,同一任务在有和没有候选技能的情况下执行,隔离技能如何改变智能体行为而无需外部标签。为了将行为差异转化为编辑指导,SkillAudit使用过程对齐对比评估(PACE),一组评估器将轨迹差异映射到与技能文档中特定段落相关的诊断信号。一个结构验证器,从任务规范编译一次然后固定,检查任务约束并回滚有害更新。SkillAudit通过两个管道路由编辑:Refine从广泛有用的技能中移除噪声或不相关的指导,而Repair替换与任务冲突的段落。在跨越8个专业领域的89个容器化任务中,SkillAudit实现了73.9%的平均任务奖励,优于没有技能的智能体(40.9%)和静态专家技能(56.7%)。这些增益是在进化过程中不访问隐藏测试、参考解决方案或外部评分函数的情况下获得的。

英文摘要

Agent skills are structured procedural packages that guide frozen LLM agents in specialized workflows. Skills rarely remain sufficient after deployment: edge cases, API changes, and deployment constraints become visible only through use, making skill evolution a practical necessity. Existing methods depend on privileged feedback such as held-out validation scores, hidden test outcomes, or environment rewards -- signals often unavailable when a practitioner has only a task description and workspace data. We introduce SkillAudit, a framework for evolving agent skills without ground-truth feedback. The key idea is paired trajectory auditing: at each iteration, the same task is executed with and without the candidate skill, isolating how the skill changes agent behavior without external labels. To turn behavioral differences into edit guidance, SkillAudit uses Process-Aligned Contrastive Evaluation (PACE), a cluster of evaluators that maps trajectory divergences to diagnostic signals linked to specific passages in the skill document. A structural verifier, compiled once from the task specification and then fixed, checks task constraints and rolls back harmful updates. SkillAudit routes edits through two pipelines: Refine removes noisy or irrelevant guidance from broadly useful skills, while Repair replaces passages that conflict with the task. Across 89 containerized tasks spanning 8 professional domains, SkillAudit achieves 73.9% average task reward, outperforming an agent without skills (40.9%) and the static expert skill (56.7%). These gains are obtained without accessing hidden tests, reference solutions, or external scoring functions during evolution.

2606.14249 2026-06-15 cs.AI 新提交

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

HarnessX: 一个可组合、自适应且可演化的智能体框架铸造厂

Tingyang Chen, Shuo Lu, Kang Zhao, Weicheng Meng, Hanlin Teng, Tianhao Li, Chao Li, Xule Liu, Jian Liang, Zhizhong Zhang, Yuan Xie, Heng Qu, Kun Shao, Jian Luan

发表机构 * Darwin Agent Team(达尔文智能体团队)

AI总结 提出HarnessX,通过替换代数组合框架原语、基于AEGIS多智能体演化引擎自适应调整,并利用轨迹反馈闭环优化,在五个基准上平均提升14.5%性能。

详情
AI中文摘要

AI智能体的性能关键依赖于运行时框架,包括提示、工具、记忆和控制流,这些中介了模型如何观察、推理和行动。然而,当今的框架在很大程度上仍然是手工制作和静态的:每个新模型或任务仍然需要定制的脚手架,并且在执行过程中产生的丰富轨迹很少被提炼为系统性的改进。我们引入了HarnessX,一个用于可组合、自适应和可演化的智能体框架的铸造厂。HarnessX通过替换代数组装类型化的框架原语,通过AEGIS(一个基于轨迹驱动的多智能体演化引擎,建立在符号适应与强化学习之间的操作镜像上)进行自适应,并通过将轨迹转化为框架更新和模型训练信号来闭合框架-模型循环。在五个基准测试(ALFWorld、GAIA、WebShop、tau^3-Bench和SWE-bench Verified)上,HarnessX平均提升了14.5%(最高达44.0%),其中基线最低时增益最大。这些结果表明,智能体的进步不一定来自模型规模的扩展:从执行反馈中组合和演化运行时接口是一个可行且互补的杠杆。完整的代码库将在未来版本中开源。

英文摘要

AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet today's harnesses remain largely hand-crafted and static: each new model or task still demands bespoke scaffolding, and the rich traces produced during execution are rarely distilled back into systematic improvement. We introduce HarnessX, a foundry for composable, adaptive, and evolvable agent harnesses. HarnessX assembles typed harness primitives via a substitution algebra, adapts them through AEGIS, a trace-driven multi-agent evolution engine grounded in an operational mirror between symbolic adaptation and reinforcement learning, and closes the harness-model loop by turning trajectories into both harness updates and model training signal. Across five benchmarks (ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified), HarnessX yields an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest. These results suggest that agent progress need not come from model scaling alone: composing and evolving runtime interfaces from execution feedback is an actionable and complementary lever. The complete codebase will be open-sourced in a future release.

2606.14314 2026-06-15 cs.AI 新提交

Communication Policy Evolution for Proactive LLM Agents

主动式LLM智能体的通信策略演化

Xinbei Ma, Jiyang Qiu, Yao Yao, Zheng Wu, Yijie Lu, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang, Weiwen Liu, Weinan Zhang, Zhuosheng Zhang, Hai Zhao

发表机构 * Shanghai Jiao Tong University(上海交通大学) OPPO Research Institute(OPPO研究院)

AI总结 针对用户与智能体间信息不对称问题,形式化通信策略,提出基于文本和UI的策略,并引入自演化框架CPE,通过提示优化提升任务成功率。

详情
AI中文摘要

LLM智能体已迅速演变为自主系统,但用户与智能体之间仍存在持续的信息鸿沟:通信成本高昂,而用户相同的偏好进一步限制了信息交换。为了研究智能体应如何跨模态通信,本文形式化了通信策略,建立了基于文本和UI的策略,然后在不同环境、角色和模型组合中评估通信策略。为构建主动式智能体的信息不对称性,我们设置了两个互补场景:用户-智能体和规划者-执行者。实验揭示了交互通道之间的互补优势:基于文本的交互通常有助于任务性能,而结构化UI则提高了智能体的响应质量和角色遵从性。受此启发,一种混合方法结合了这些优势。我们进一步提出通信策略演化(CPE),一种通过展开和提示级演化来优化通信策略的自我演化框架。在不修改模型的情况下,仅通过提示优化,CPE在多种设置中实现了最佳任务成功率。我们的发现将通信行为确定为LLM智能体一个关键但尚未充分探索的设计维度。

英文摘要

LLM agents have rapidly evolved into autonomous systems, yet a persistent information gap remains between users and agents: communication is costly, while users' identical preferences further limit information exchange. To investigate how agents should communicate across modalities, this paper formalizes Communication Policy, establishes textual and UI-based policies, and then evaluates communication policies across diverse environments, personas, and model combinations. Building information asymmetry for proactive agents, we set up two complementary settings, User-Agent and Planner-Executor. Experimental results reveal complementary strengths between interaction channels: text-based interaction often facilitates task performance, while structured UI improves agents' response quality and persona compliance. Motivated by that, a hybrid method combines these advantages. We further propose Communication Policy Evolution (CPE), a self-evolution framework for refining communication policies through rollout and prompt-level evolving. Without model modification, CPE achieves the best task success across multiple settings using prompt refinement alone. Our findings identify communication behavior as a critical yet underexplored design dimension for LLM agents.

2606.14418 2026-06-15 cs.AI cs.LG cs.RO 新提交

Causal Object-Centric Models for Planning with Monte Carlo Tree Search

用于蒙特卡洛树搜索规划的因果对象中心模型

Rodion Vakhitov, Leonid Ugadiarov, Alexey Skrynnik, Aleksandr Panov

发表机构 * MIRAI CogAILab

AI总结 提出COMET算法,结合无监督对象中心编码器和Transformer世界模型,通过动作-槽融合机制和对象因果注意力实现高效规划,在多个基准上优于基线方法。

详情
AI中文摘要

我们提出了COMET(用于高效树搜索的因果对象中心模型),一种基于模型的强化学习算法,在槽结构化的潜在空间中执行蒙特卡洛树搜索。COMET将冻结的无监督对象中心编码器与基于Transformer的世界模型配对,其中通过一种新颖的动作-槽融合机制将动作绑定到对象上,该机制用于槽转移预测。策略和价值头使用对象因果注意力,通过学习到的每槽相关性分数调节令牌交互,使决策集中在任务相关实体上。COMET为MuZero风格的潜在规划增加了显式的对象级归纳偏差。在来自Object-Centric Visual RL基准、ManiSkill、Robosuite和VizDoom的八个视觉和动态多样化的任务中,COMET在训练早期相比对象中心和单一基线实现了更高的平均归一化分数。

英文摘要

We introduce COMET (Causal Object-centric Model for Efficient Tree search), a model-based reinforcement learning algorithm that performs Monte Carlo Tree Search in a slot-structured latent space. COMET pairs a frozen unsupervised object-centric encoder with a transformer-based world model, in which actions are bound to objects through a novel action-slot fusion mechanism that is used in slot transition prediction. Policy and value heads use object-causal attention, modulating token interactions by learned per-slot relevance scores so that decision-making concentrates on task-relevant entities. COMET adds an explicit object-level inductive bias to MuZero-style latent planning. Across eight visually and dynamically diverse tasks from the Object-Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom, COMET achieves a higher mean normalized score during the early stages of training compared to object-centric and monolithic baselines.

2606.14470 2026-06-15 cs.AI cs.CL cs.LG 新提交

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

GitOfThoughts: 版本控制的推理与可回放、差异比较和合并的智能体记忆

Pavan C Shekar, Abhishek H S, Aswanth Krishnan

发表机构 * QpiAI

AI总结 提出GitOfThoughts框架,将智能体推理树存储为git仓库,实现推理的可回放、审计和合并;实验表明,对于新问题,任何记忆格式均不能可靠提升准确率,仅当检索案例与当前问题高度相似(>0.8)时才有显著提升,且收益来自答案检索而非方法迁移。

Comments 10 pages, 1 figure, 9 tables

详情
AI中文摘要

大语言模型推理是短暂的:思维链随上下文窗口消失,剪枝的搜索分支不留记录,记忆缓冲区无法进行差异比较、合并或审计。其他所有复杂的软件过程(代码、基础设施、数据、实验)都受版本控制;推理却没有。我们提出GitOfThoughts,将智能体的推理树存储为git仓库:每个评分的思维是一个提交,分数是注释,结果是标签,检索是智能体自身历史上的“git log”。这使得推理可回放、可审计,并且可以在智能体之间以近乎零的工程成本进行合并。然后我们提出一个更难的问题:记忆在任何基质上是否真的能提高准确性?在五种基质(无、markdown、向量、图、git)、两个基准、两个模型规模以及预注册的复制实验中,对于新问题的答案是否定的。没有一种记忆格式可靠地有帮助,一个有希望的早期结果在其自身的预注册复制下崩溃了。记忆只有在超过我们所谓的可复制阈值时才有效:当检索到的案例与当前问题几乎重复(相似度>~0.8)时,准确率急剧上升;低于此阈值,则无效果。收益是答案检索,而非方法迁移:一个4.5倍大的模型使近重复收益翻倍,但仍然无法从工作示例中提取可迁移的方法。我们发现唯一的通用杠杆是测试时采样。因此,git作为基质的理由是审计性、溯源性和可合并性,且准确率相当。我们记录了一个撤回的结果和一个被反驳的假设,以体现我们坚持的评估标准。

英文摘要

Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code, infrastructure, data, experiments) is version-controlled; reasoning is not. We introduce GitOfThoughts, which stores an agent's reasoning tree as a git repository: every scored thought is a commit, scores are notes, outcomes are tags, and retrieval is "git log" over the agent's own history. This makes reasoning replayable, auditable, and mergeable across agents at near-zero engineering cost. We then ask the harder question: does memory, in any substrate, actually improve accuracy? Across five substrates (none, markdown, vector, graph, git), two benchmarks, two model scales, and pre-registered replications, the answer for novel problems is no. No memory format reliably helps, and a promising early result collapsed under its own pre-registered replication. Memory pays only above what we call the copyability threshold: when the retrieved case is a near-duplicate of the current problem (similarity >~ 0.8), accuracy jumps sharply; below it, nothing. The gain is answer retrieval, not method transfer: a 4.5x larger model doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. The only general lever we find is test-time sampling. The case for git-as-substrate is therefore auditability, provenance, and mergeability at accuracy parity. We document a retracted result and a refuted hypothesis to model the evaluation standard we hold ourselves to.

2606.14476 2026-06-15 cs.AI cs.LG 新提交

When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More

当工具决定时:LLM代理盲目服从图神经网络工具,更强的骨干网络服从更多

Zhongyuan Wang, Pratyusha Vemuri

发表机构 * raptorX.ai

AI总结 研究LLM代理在使用GNN工具时是否真正判断而非盲目服从,发现代理在97.6-99.2%的情况下完全采纳GNN输出,且更强的骨干网络服从更多,选择性调用设计受限。

Comments 9 pages, 2 figures. Under review at TMLR

详情
AI中文摘要

越来越多的研究为大型语言模型(LLM)代理配备图神经网络(GNN)作为可调用工具,假设代理能够判断何时以及多大程度上依赖该工具。我们直接测试了这一假设。我们将冻结的GNN作为显式工具暴露给ReAct风格的LLM代理,并在文本属性图(ogbn-arxiv,在WikiCS上重复)上的节点分类任务中,测量代理是使用工具还是仅仅服从它。我们发现代理并未进行判断:其预测与原始GNN的预测一致率达到97.6-99.2%(5个随机种子),沦为GNN鹦鹉,全盘采用工具的输出并绕过自身推理。通过扫描骨干网络能力(Qwen2.5 0.5B-7B),这种服从并非弱模型伪影:在能够调用工具的模型中,一致性随能力提升而上升(从1.5B的0.60到7B的0.98)。关键的是,服从的代价并未随能力增长而缩小,反而在替代方案出现时扩大:每个节点上可用动作的oracle比鹦鹉在3B时高出0.09-0.18,在7B时高出0.12-0.22,在高同质性下几乎翻倍,因为鹦鹉被冻结的GNN所束缚,而代理的替代方案在改进;在7B时,简单的邻居标签工具在高同质性下超越了GNN(0.81 vs 0.71),但代理仍然服从。一个简单的选择性调用门恢复了约一半的高同质性差距(0.71到0.83),但未带来全局净收益,而保留估计表明,在标准测试时特征上可达到的最佳门最多只能获得oracle余量的三分之一:可靠的选择性调用似乎受限于可用信息,而不仅仅是路由器设计。我们的结果是一个警示性测量:对代理+工具系统的评估不能假设代理在工具之上添加了判断,选择性调用必须被设计进去,而不是期望从规模中涌现。

英文摘要

A growing line of work equips large language model (LLM) agents with graph neural networks (GNNs) as callable tools, assuming the agent exercises judgment over when and how much to rely on such a tool. We test this directly. We expose a frozen GNN to a ReAct-style LLM agent as an explicit tool and measure, on node classification over a text-attributed graph (ogbn-arxiv, replicated on WikiCS), whether the agent uses the tool or merely obeys it. We find the agent does not exercise judgment: its predictions agree with the raw GNN's 97.6-99.2% of the time (5 seeds), collapsing into a GNN parrot that adopts the tool's output wholesale and bypasses its own reasoning. Sweeping backbone capability (Qwen2.5 0.5B-7B), the deference is not a weak-model artifact: among models able to invoke the tool, agreement rises with capability (0.60 to 0.98 from 1.5B to 7B). Crucially, the cost of deference does not shrink as capability grows and grows where alternatives emerge: a per-node oracle over the available actions beats the parrot by 0.09-0.18 at 3B and 0.12-0.22 at 7B, roughly doubling at high homophily, because the parrot is pinned to the frozen GNN while the agent's alternatives improve; at 7B a simple neighbour-label tool overtakes the GNN at high homophily (0.81 vs 0.71) yet the agent still defers. A simple selective-invocation gate recovers about half of that high-homophily gap (0.71 to 0.83) but yields no net global gain, and held-out estimates bound the best achievable gate over standard test-time features to at most a third of the oracle headroom: reliable selective invocation looks limited by available information, not merely router design. Our results are a cautionary measurement: evaluations of agent+tool systems cannot assume the agent adds judgment on top of the tool, and selective invocation must be designed in rather than expected to emerge from scale.

2606.14502 2026-06-15 cs.AI 新提交

From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI

从聊天机器人到数字同事:向持久自主人工智能的范式转变

Yongheng Zhang, Ziang Liu, Jiaxuan Zhu, Shuai Wang, Xiangqi Chen, Haojing Huang, Jiayi Kuang, Siyu Chen, Ao Shen, Hao Wu, Qiufeng Wang, Qian-Wen Zhang, Junnan Dong, Wenhao Jiang, Ying Shen, Hai-Tao Zheng, Yinghui Li, Di Yin, Xing Sun, Philip S. Yu

发表机构 * arXiv

AI总结 本文提出LLM从聊天机器人向数字同事的范式转变,通过认知核心(思考型LLM)和工具增强任务执行(OpenClaw工作站系统)两个维度,实现持久工作、状态持久化、可重用技能和自改进能力。

Comments The paper is available on the project website: https://from-chatbot-to-digital-colleague.github.io/

详情
AI中文摘要

大型语言模型(LLM)正在经历从对话生成器向集成AI系统的根本性转变,这些系统具备推理、行动、记忆和自我改进能力。我们将这一转变概念化为从聊天机器人到数字同事的转变:从对话式回答到持久工作。我们沿着两个紧密耦合的维度组织这一转变。首先,在认知核心层面,LLM正从聊天机器人时代由下一词预测驱动的“快速思考”系统,向思考型LLM发展,后者利用推理时计算、思维链推理、反思、过程监督和强化学习来支持更深思熟虑和可靠的认知。其次,在工具增强的任务执行层面,LLM正从临时调用外部资源的工具调用智能体,向配备持久工作空间、技能、验证循环和治理的OpenClaw式工作站系统(OpenClaw)发展。“工作空间+技能”范式通过状态持久化、可重用程序、任务闭合和经验复用,使偶发性的工具使用变得像同事一样。我们研究了数据构建从指令-响应对向状态-动作-观测轨迹的转变,以及评估从静态基准向沙盒化、可审计、自演进的AI生态系统的转变。

英文摘要

Large Language Models (LLMs) are undergoing a fundamental transformation from conversational generators into integrated AI systems capable of reasoning, action, memory, and self-improvement. We conceptualize this transition as a shift from Chatbot to Digital Colleague: from conversational answers to persistent work. We organize this transition along two tightly coupled dimensions. First, at the cognitive core level, LLMs are advancing from Chatbot-era "fast thinking" systems driven by next-token prediction toward Thinking LLMs that leverage inference-time computation, Chain-of-Thought reasoning, reflection, process supervision, and reinforcement learning to support more deliberate and reliable cognition. Second, at the tool-augmented task execution level, LLMs are progressing from tool-calling Agents that invoke external resources in an ad hoc manner toward OpenClaw-style workstation systems (OpenClaw) equipped with persistent Workspaces, skills, verification loops, and governance. The "Workspace + Skill" paradigm makes episodic tool use colleague-like via state persistence, reusable procedures, task closure, and experience reuse. We examine data construction shifts from instruction-response pairs to State-Action-Observation trajectories and evaluation from static benchmarks to sandboxed, auditable, self-evolving AI ecosystems.

2606.14672 2026-06-15 cs.AI cs.CL 新提交

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

面向LLM-Agent工作流中并行分支的直接潜在空间合成

Shikun Liu, Mufei Li, Dongqi Fu, Haoyu Wang, Yinglong Xia, Hong Li, Hong Yan, Pan Li

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Meta

AI总结 提出Parallel-Synthesis框架,通过直接利用并行工作代理的KV缓存进行合成,避免文本拼接冗余,在9个数据集上匹配或超越文本合成,并将首令牌延迟降低2.5-11倍。

详情
AI中文摘要

大型语言模型越来越多地作为代理系统的执行引擎,但它们仍然通过顺序文本接口消耗上下文。这与现代结构化代理工作流不匹配,其中独立分支探索子任务、检索证据或生成候选解决方案,然后进行最终合成步骤。现有系统通常通过拼接这些分支的文本输出来合并它们,这丢弃了并行结构并导致冗余的预填充计算。在这项工作中,我们引入了Parallel-Synthesis,一个即插即用的框架,使合成器能够直接消耗由并行工作代理产生的KV缓存。Parallel-Synthesis结合了一个缓存映射器,用于校准独立生成的分支缓存,以及一个微调的合成器适配器,用于从此非顺序缓存接口生成。我们使用数据训练Parallel-Synthesis,这些数据使合成器暴露于并行缓存上下文,教授跨缓存分支的聚合,并从基于标准文本拼接的合成中蒸馏推理行为。在跨越数学、科学问答、代码生成、GAIA和多代理数据库诊断的九个下游数据集上,Parallel-Synthesis在七个数据集上匹配或优于基于文本的合成,并在另外两个数据集上保持接近。它还将首令牌时间减少了2.5-11倍,表明直接基于缓存的合成是一种更有前途的接口,用于在并行代理分支上进行更原生和高效的合成。

英文摘要

Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill computation. In this work, we introduce Parallel-Synthesis, a plug-and-play framework that enables a synthesizer to directly consume the KV caches produced by parallel worker agents. Parallel-Synthesis combines a cache mapper that calibrates independently generated branch caches with a fine-tuned synthesizer adapter that enables generation from this non-sequential cache interface. We train Parallel-Synthesis using data that exposes the synthesizer to parallel cache contexts, teaches aggregation across cached branches, and distills reasoning behavior from standard text-concatenation-based synthesis. Across nine downstream datasets spanning math, science QA, code generation, GAIA, and multi-agent database diagnosis, Parallel-Synthesis matches or outperforms text-based synthesis on seven datasets and remains close on the other two. It also reduces time-to-first-token by 2.5x-11x, suggesting that direct cache-based synthesis is a promising interface for more native and efficient synthesis over parallel agent branches.

2605.13217 2026-06-15 cs.CL cs.AI cs.LG 交叉投稿

GAGPO: Generalized Advantage Grouped Policy Optimization

GAGPO:通用优势分组策略优化

Siyuan Zhu, Chao Yu, Rongxin Yang, Zongkai Liu, Jinjun Hu, Qiwen Chen, Yibo Zhang

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) Meituan(美团)

AI总结 GAGPO提出一种无需价值模型的强化学习方法,通过分组价值代理和动作重要性比,实现多轮任务中精确的时间信用分配,实验表明其在ALFWorld和WebShop上优于现有基线。

详情
AI中文摘要

GAGPO提出了一种无需价值模型的强化学习方法,通过分组价值代理和动作重要性比,实现多轮任务中精确的时间信用分配,实验表明其在ALFWorld和WebShop上优于现有基线。

英文摘要

Reinforcement learning has become a powerful paradigm for post-training large language model agents, yet credit assignment in multi-turn environments remains a challenge. Agents often receive sparse, trajectory-level rewards only at the end of an episode, making it difficult to determine which intermediate actions contributed to success or failure. As a result, propagating delayed outcomes back to individual decision steps without relying on costly auxiliary value models remains an open problem. We propose Generalized Advantage Grouped Policy Optimization (GAGPO), a critic-free reinforcement learning method for precise, step-aligned temporal credit assignment. GAGPO constructs a non-parametric grouped value proxy from sampled rollouts and uses it to compute TD/GAE-style temporal advantages, recursively propagating outcome supervision backward through time. Combined with group-wise advantage normalization and an action-level importance ratio, GAGPO extracts stable, localized optimization signals directly from multi-turn trajectories. Experiments on ALFWorld and WebShop show that GAGPO outperforms strong reinforcement learning baselines. Further analyses demonstrate faster early-stage learning, improved interaction efficiency, and smoother optimization dynamics, suggesting that GAGPO offers a simple yet effective framework for multi-turn agentic reinforcement learning.

2606.13692 2026-06-15 cs.DB cs.AI 交叉投稿

An Agentic Retrieval Framework for Autonomous Context-Aware Data Quality Assessment

一种面向自主上下文感知数据质量评估的智能体检索框架

Hadi Fadlallah, Ibrahim Dhaini, Fatima Mubarak, Rima Kilany

发表机构 * University of Sciences and Arts in Lebanon(利比亚科学与艺术大学) Saint-Joseph University of Beirut(贝鲁特圣约瑟夫大学)

AI总结 提出一种智能体检索框架,通过多智能体工作流理解数据使用意图、推导上下文感知评估策略并生成可执行验证逻辑,引入可行性验证阶段确保可靠性,实验表明能自适应不同使用场景并减少不可执行规则。

Comments 26 pages, 18 figures, Submitted to the International Journal of Intelligent Information and Database Systems

详情
AI中文摘要

数据质量评估是有效数据分析和数据驱动决策的关键前提,但由于数据质量固有的上下文依赖性,它仍然是一项具有挑战性的任务。现有方法通常依赖静态规则或手动评估策略,限制了它们对不同使用场景的适应性,并制约了大规模自动化。人工智能的最新进展,特别是大语言模型,为自动化数据质量评估提供了新机遇,但也引发了与可靠性、基础性和执行安全性相关的担忧。在本文中,我们提出了一种统一的智能体检索框架,用于自主上下文感知数据质量评估。该框架解释预期数据使用的自然语言描述,推导上下文感知评估策略,并通过多智能体工作流生成可执行验证逻辑。为确保操作可靠性,该框架引入了一个可行性验证阶段,在执行前评估生成的评估规范的真实性和可执行性,从而在必要时进行迭代改进。接受的验证逻辑被确定性地执行,以保证可重复和可审计的结果。我们将所提出的框架实现为一个端到端原型,并在同一数据集上的多个使用场景中进行了评估。结果表明,评估结果能够有意义地适应不同的预期用途,而可行性门控执行减少了不切实际或不可执行规则的生成。所提出的方法为在现代数据驱动环境中部署自主且可控的数据质量评估提供了实用基础。

英文摘要

Data quality assessment is a critical prerequisite for effective data analytics and data-driven decision-making, yet it remains a challenging task due to the inherently context-dependent nature of data quality. Existing approaches often rely on static rules or manual assessment strategies, limiting their adaptability to diverse usage scenarios and constraining automation at scale. Recent advances in artificial intelligence, particularly large language models, offer new opportunities for automating data quality assessment, but raise concerns related to reliability, grounding, and execution safety. In this paper, we propose a unified agentic-retrieval framework for autonomous context-aware data quality assessment. The framework interprets natural-language descriptions of intended data usage, derives context-aware assessment strategies, and generates executable validation logic through a multi-agent workflow. To ensure operational reliability, the framework introduces a feasibility validation stage that evaluates the realism and executability of generated assessment specifications before execution, enabling iterative refinement when necessary. Accepted validation logic is executed deterministically to guarantee reproducible and auditable results. We implement the proposed framework as an end-to-end prototype and evaluate it across multiple usage scenarios applied to the same dataset. The results demonstrate that assessment outcomes adapt meaningfully to different intended uses, while feasibility-gated execution reduces unrealistic or non-executable rule generation. The proposed approach provides a practical foundation for deploying autonomous yet controlled data quality assessment in modern data-driven environments.

2606.13698 2026-06-15 eess.SY cs.AI cs.LG cs.NI cs.PF cs.SY 交叉投稿

Active Inference for Adaptive Traffic Signal Control in Noisy Nonstationary IoT Environments

嘈杂非平稳物联网环境下自适应交通信号控制的主动推理方法

Dénes Toth, George Ambroladze, Edwin Sundberg, Ali Beikmohammadi, Alfreds Lapkovskis

发表机构 * Department of Computer Systems and Sciences(计算机系统与科学系) Stockholm University(斯德哥尔摩大学)

AI总结 提出一种基于主动推理的交通信号控制器,通过最小化期望自由能动态选择相位,在传感器遮挡、天气衰减和非平稳需求下优于深度Q网络和规则方法,降低空闲时间和CO2排放。

Comments Submitted to IEEE 12th World Forum on Internet of Things (WF-IoT) 2026

详情
AI中文摘要

在物联网化交叉口的城市交通信号控制必须在传感器遮挡、天气衰减和非平稳需求下保持有效。传统控制器在这些条件下性能下降,学习策略难以审计。为应对这些挑战,我们提出一种针对四臂信号交叉口的主动推理控制器,通过最小化关于各方向拥堵水平的高斯信念的期望自由能(EFE)动态选择相位,形成完全可追踪的决策流程。我们在SUMO交通模拟器中,将控制器与基于规则的启发式方法和深度Q网络(DQN)进行对比,涵盖四种逐渐增加噪声和非平稳性的场景,包括传感器遮挡、恶劣天气和随机事故。每个场景进行100次独立随机评估,主动推理在噪声最大的场景中实现了最低的空闲时间和CO2排放(分别为56,977秒和29.12千克,而DQN为71,741秒和30.56千克)。这些收益以公交优先服务率和相位切换频率的适度代价为代价。

英文摘要

Urban traffic signal control at IoT-instrumented intersections must remain effective under sensor occlusion, weather attenuation, and nonstationary demand. Conventional controllers degrade under these conditions, and learned policies remain difficult to audit. To address these challenges, we propose an active inference controller for a four-arm signalized intersection that dynamically selects phases by minimizing expected free energy (EFE) over Gaussian beliefs about per-direction congestion levels, yielding a fully traceable decision pipeline. We benchmark the controller in a SUMO traffic simulator against a rule-based heuristic and a deep Q-network (DQN) across four scenarios that progressively increase noise and nonstationarity, spanning sensor occlusion, adverse weather, and stochastic accidents. Across 100 independent random evaluations per scenario, active inference attains the lowest idle times and CO2 emissions in the noisiest scenarios (56,977 s and 29.12 kg vs. 71,741 s and 30.56 kg for DQN). These gains come at a modest cost in bus priority service rate and phase switch frequency.

2606.14445 2026-06-15 cs.SE cs.AI cs.HC 交叉投稿

tap: A File-Based Protocol for Heterogeneous LLM Agent Collaboration

tap:一种用于异构LLM智能体协作的基于文件的协议

Minseo Kim

发表机构 * HUA Labs(HUA实验室)

AI总结 提出tap协议,通过文件优先设计实现不同厂商LLM智能体(如Claude和Codex)在共享代码库上的协作,无需共享内存或相同运行时,实验表明异构模型对审查缺陷发现率更高。

Comments Accepted to KCC 2026. English archival translation. 3 pages, 1 figure, 3 tables

详情
AI中文摘要

现有的多智能体软件开发系统提出了多种智能体协作形式,包括基于角色的协作和自动化代码审查。然而,许多系统假设共同的运行时、中央对话服务器或相同的API系列。在这些假设下,来自不同供应商的LLM智能体无法轻易地从各自的执行环境中直接交换消息,同时在共享代码库上分配开发和审查工作。本文提出了tap,一种基于文件的协作协议,允许Claude(Anthropic)和Codex(OpenAI)在无需共享内存或相同运行时的情况下协作开发同一个代码库。tap的核心是文件优先设计,它将带有元数据的Markdown文件作为原始消息保存,结合文件检查路径(文件通信,第一层)和针对Claude与Codex的实时通知路径(实时通信,第二层),并通过独立的git工作树隔离工作。即使实时通知失败或接收者重启,消息文件仍然可用,相同的内容可以再次检查。在为期27天、37次生成的自应用操作中,tap被用于开发和审查自身,我们收集了209个与tap相关的拉取请求和717个操作工件。对375个审查工件的分析显示,记录至少一个缺陷或请求更改的审查比例在异构模型对中为69.8%,在同构模型对中为53.1%。这些结果表明,结合基于文件的消息保存和实时通知的tap在实际生产仓库中运行,并且结合异构模型和执行环境可以拓宽审查视角。tap作为开源npm包@hua-labs/tap(v0.5.2)发布。

英文摘要

Existing multi-agent software development systems have proposed many forms of agent collaboration, including role-based collaboration and automated code review. However, many systems assume a common runtime, a central conversation server, or the same API family. Under these assumptions, LLM agents from different vendors cannot easily exchange messages directly from their own execution environments while dividing development and review work on a shared codebase. This paper presents tap, a file-based collaboration protocol that allows Claude (Anthropic) and Codex (OpenAI) to collaborate on one codebase without shared memory or an identical runtime. The core of tap is a file-first design that preserves markdown files with metadata as original messages, combines a file inspection path (file communication, Tier 1) with real-time notification paths for Claude and Codex (real-time communication, Tier 2), and isolates work through separate git worktrees. Even if real-time notification fails or a receiver restarts, the message file remains available and the same content can be inspected again. In a 27-day, 37-generation self-applied operation where tap was used to develop and review itself, we collected 209 tap-related pull requests and 717 operational artifacts. An analysis of 375 review artifacts showed that the share of reviews recording at least one defect or requested change was 69.8% for heterogeneous model pairs and 53.1% for homogeneous model pairs. These results show that tap, which combines file-based message preservation with real-time notification, operates in a real production repository, and that combining heterogeneous models and execution environments can broaden review perspectives. tap is distributed as the open-source npm package @hua-labs/tap (v0.5.2).

2505.16120 2026-06-15 cs.AI 版本更新

LLM-Powered AI Agent Systems and Their Applications in Industry

基于大语言模型的AI智能体系统及其工业应用

Guannan Liang, Qianqian Tong

发表机构 * GitHub

AI总结 本文综述了从传统智能体到LLM驱动智能体的演进,分类为软件、物理和自适应混合系统,并讨论了在客服、软件开发、制造、教育、金融和医疗等领域的应用及挑战。

Comments This is the author's accepted version of the paper accepted to appear at IEEE AIIoT 2025. The final version will be available via IEEE Xplore. \c{opyright}2025 IEEE. Personal use of this material is permitted

详情
AI中文摘要

大型语言模型(LLM)的出现重塑了智能体系统。与任务范围有限的传统基于规则的智能体不同,LLM驱动的智能体提供了更大的灵活性、跨领域推理和自然语言交互能力。此外,通过集成多模态LLM,当前的智能体系统能够处理包括文本、图像、音频和结构化表格数据在内的多种数据模态,从而实现更丰富、更自适应的现实世界行为。本文全面考察了从LLM前时代到当前LLM驱动架构的智能体系统演进。我们将智能体系统分为基于软件、物理和自适应混合系统,重点介绍了在客户服务、软件开发、制造自动化、个性化教育、金融交易和医疗保健中的应用。我们进一步讨论了LLM驱动智能体带来的主要挑战,包括高推理延迟、输出不确定性、缺乏评估指标和安全漏洞,并提出了缓解这些问题的潜在解决方案。

英文摘要

The emergence of Large Language Models (LLMs) has reshaped agent systems. Unlike traditional rule-based agents with limited task scope, LLM-powered agents offer greater flexibility, cross-domain reasoning, and natural language interaction. Moreover, with the integration of multi-modal LLMs, current agent systems are highly capable of processing diverse data modalities, including text, images, audio, and structured tabular data, enabling richer and more adaptive real-world behavior. This paper comprehensively examines the evolution of agent systems from the pre-LLM era to current LLM-powered architectures. We categorize agent systems into software-based, physical, and adaptive hybrid systems, highlighting applications across customer service, software development, manufacturing automation, personalized education, financial trading, and healthcare. We further discuss the primary challenges posed by LLM-powered agents, including high inference latency, output uncertainty, lack of evaluation metrics, and security vulnerabilities, and propose potential solutions to mitigate these concerns.

2602.00845 2026-06-15 cs.AI 版本更新

Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward

通过合成语义信息增益奖励优化基于检索的智能推理

Senkang Hu, Yong Dai, Yuzhi Zhao, Yihang Tao, Yu Guo, Zhengru Fang, Sam Tak Wu Kwong, Yuguang Fang

发表机构 * Hong Kong JC STEM Lab of Smart City.(香港JC STEM实验室) City University of Hong Kong.(香港城市大学) Lingnan University(岭南大学) Fudan University.(复旦大学) Huazhong University of Science and Technology.(华中科技大学)

AI总结 提出InfoReasoner框架,利用合成语义信息增益奖励优化检索过程,通过GRPO训练策略,在七个问答基准上平均准确率提升5.4%。

Comments Accepted by ICML'26

详情
AI中文摘要

智能推理使大型推理模型(LRMs)能够动态获取外部知识,但由于缺乏密集、有原则的奖励信号,优化检索过程仍然具有挑战性。在本文中,我们介绍了InfoReasoner,一个统一的框架,通过合成语义信息增益奖励激励有效的信息寻求。理论上,我们将信息增益重新定义为模型信念状态的不确定性减少,建立了保证,包括非负性、伸缩可加性和通道单调性。实际上,为了实现无需手动检索注释的可扩展优化,我们提出了一种输出感知的内在估计器,该估计器通过双向文本蕴含的语义聚类,直接从模型的输出分布计算信息增益。这种内在奖励引导策略最大化认知进步,使得通过组相对策略优化(GRPO)进行高效训练成为可能。在七个问答基准上的实验表明,InfoReasoner始终优于强大的检索增强基线,平均准确率提升高达5.4%。我们的工作为基于检索的智能推理提供了一条理论上有根据且可扩展的路径。代码可在该 https URL 获取。

英文摘要

Agentic reasoning enables large reasoning models (LRMs) to dynamically acquire external knowledge, but yet optimizing the retrieval process remains challenging due to the lack of dense, principled reward signals. In this paper, we introduce InfoReasoner, a unified framework that incentivizes effective information seeking via a synthetic semantic information gain reward. Theoretically, we redefine information gain as uncertainty reduction over the model's belief states, establishing guarantees, including non-negativity, telescoping additivity, and channel monotonicity. Practically, to enable scalable optimization without manual retrieval annotations, we propose an output-aware intrinsic estimator that computes information gain directly from the model's output distributions using semantic clustering via bidirectional textual entailment. This intrinsic reward guides the policy to maximize epistemic progress, enabling efficient training via Group Relative Policy Optimization (GRPO). Experiments across seven question-answering benchmarks demonstrate that InfoReasoner consistently outperforms strong retrieval-augmented baselines, achieving up to 5.4% average accuracy improvement. Our work provides a theoretically grounded and scalable path toward agentic reasoning with retrieval. The code is available at https://github.com/dl-m9/InfoReasoner

2604.24117 2026-06-15 cs.AI 版本更新

An Analysis of the Coordination Gap between Joint and Modular Learning for Job Shop Scheduling with Transportation Resources

带运输资源的作业车间调度中联合学习与模块化学习协调差距分析

Moritz Link, Jonathan Hoss, Noah Klarmann

AI总结 通过资源稀缺性和时间主导性分析,量化联合训练与模块化训练在带运输资源的作业车间调度中的性能差距,发现联合训练在多数情况下更优,但在瓶颈环境下差距缩小。

Comments Supported by the Chips Joint Undertaking and its members, including top-up funding by National Authorities, within the Cynergy4MIE project (Grant Agreement No. 101140226). This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

带运输资源的高效作业车间调度对于高性能制造至关重要。随着“去中心化工厂”的兴起,多智能体强化学习已成为生产与运输任务联合调度的一种有前景的方法。先前的工作主要集中于开发新颖的合作架构,而忽视了何时需要联合训练的问题。联合训练指同时训练作业和自动导引车调度智能体,而模块化训练则涉及独立训练每个智能体后进行事后集成。在本研究中,我们系统地调查了在带运输资源的作业车间调度问题中,联合训练对于最优性能至关重要的条件。通过对资源稀缺性和时间主导性的严格敏感性分析,我们量化了协调差距——这两种训练模式之间的性能差异。在我们的评估中,联合训练优于大多数调度规则组合和模块化训练方法。然而,在瓶颈环境中,特别是在严重的运输和处理约束下,协调差距的优势会减弱。这些发现表明,在单个调度任务占主导地位的环境中,模块化训练是一种可行的替代方案。总体而言,我们的工作为根据环境条件选择训练模式提供了实用指导,使决策者能够优化基于强化学习的调度性能。

英文摘要

Efficient job-shop scheduling with transportation resources is critical for high-performance manufacturing. With the rise of "decentralized factories", multi-agent reinforcement learning has emerged as a promising approach for the combined scheduling of production and transportation tasks. Prior work has largely focused on developing novel cooperative architectures while overlooking the question of when joint training is necessary. Joint training denotes the simultaneous training of job and automatic guided vehicle scheduling agents, whereas modular training involves independently training each agent followed by post-hoc integration. In this study, we systematically investigate the conditions under which joint training is essential for optimal performance in the job-shop scheduling problem with transportation resources. Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap -- the performance difference between these two training modalities. In our evaluation, joint training outperforms the majority of dispatching rule combinations and modular training approaches. However, the coordination gap advantage diminishes in bottleneck environments, particularly under severe transport and processing constraints. These findings indicate that modular training represents a viable alternative in environments where a single scheduling task dominates. Overall, our work provides practical guidance for selecting between training modalities based on environmental conditions, enabling decision-makers to optimize reinforcement learning-based scheduling performance.

2605.05407 2026-06-15 cs.AI 版本更新

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

PRISM: 感知与推理交错用于序列决策

Mohamed Salim Aissi, Clemence Grislain, Clement Romac, Laure Soulier, Mohamed Chetouani, Olivier Sigaud, Nicolas Thome

发表机构 * Institut National de la Recherche Scientifique (INRS)(国家科学研究院)

AI总结 提出PRISM框架,通过动态问答流水线紧密耦合视觉语言模型(VLM)和语言模型(LLM),实现任务驱动的感知,在ALFWorld和R2R基准上显著超越现有图像模型。

详情
AI中文摘要

将基于LLM的具身智能体从纯文本环境扩展到复杂多模态设置仍是一个主要挑战。最近的研究发现,独立的视觉语言模型(VLM)存在感知-推理-决策差距,常常忽略任务关键信息。在本文中,我们介绍了PRISM,一个通过动态问答(DQA)流水线紧密耦合感知(VLM)和决策(LLM)的框架。LLM不是被动接受VLM的描述,而是对其提出批评,用目标导向的问题探查VLM,并综合生成紧凑的图像描述。这种闭环交互产生了对场景的清晰、任务驱动的理解。我们在ALFWorld和Room-to-Room(R2R)基准上评估了PRISM。我们表明:(1)PRISM显著优于最先进的基于图像的模型,(2)我们的交互式目标导向感知流水线带来了系统性和实质性的提升,(3)PRISM完全自动化,无需手工制作问题或答案。

英文摘要

Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which often overlook task-critical information. In this paper, we introduce PRISM, a framework that tightly couples perception (VLM) and decision (LLM) through a dynamic question-answer (DQA) pipeline. Instead of passively accepting the VLM's description, the LLM critiques it, probes the VLM with goal-oriented questions, and synthesizes a compact image description. This closed-loop interaction yields a sharp, task-driven understanding of the scene. We evaluate PRISM on the ALFWorld and Room-to-Room (R2R) benchmarks. We show that: (1) PRISM significantly outperforms state-of-the-art image-based models, (2) our Interactive goal-oriented perception pipeline yields systematic and substantial gains, and (3) PRISM is fully automatic, eliminating the need for handcrafted questions or answers.

2606.03108 2026-06-15 cs.AI 版本更新

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

EvoTrainer: 协同进化LLM策略与训练框架以实现自主智能体强化学习

Guhong Chen, Yingcheng Shi, Yongbin Li, Binhua Li, Xander Xu, Hu Wei, Shiwen Ni, Min Yang, Jieping Ye

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) Tongyi Lab , Alibaba Group(通义实验室,阿里巴巴集团) Alibaba Group(阿里巴巴集团) SUAT(深圳大学)

AI总结 提出EvoTrainer框架,通过协同进化LLM策略和训练端框架,基于经验反馈自动诊断、修正并积累可复用技能,在数学推理、编程竞赛和仓库级软件工程任务上匹配或超越人工设计的RL基线。

详情
AI中文摘要

自主LLM训练通常被表述为配方搜索,这使训练框架基本保持静态。这种局限性在智能体RL中尤为突出,其中不断变化的瓶颈和标量奖励掩盖了多种失败模式。我们引入了EvoTrainer,一个通过经验反馈协同进化LLM策略和训练端框架的自主训练框架:它诊断rollout级别的证据、修正诊断、回测干预并积累可复用技能。在数学推理、竞赛编程代码生成和仓库级软件工程上的评估表明,在相同数据、代码库和评估协议下,EvoTrainer匹配或超过了人工设计的RL参考,其中在长周期智能体SWE上增益最大。轨迹分析显示,保留的策略在不同领域分化,进化的诊断阻止了无效的高分分支被提升,而可复用技能塑造了后续搜索。自主LLM RL应超越配方搜索,转向策略和解释它们的训练框架的联合进化。

英文摘要

Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.

2606.07027 2026-06-15 cs.AI 版本更新

StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents

StainFlow: GUI代理中实体痕迹追踪与证据链接用于过程奖励

Haojie Hao, Longkun Hao, Yihang Lou, Yan Bai, Zhenyang Li, Zhichao Yang, Dongshuo Huang, Hongyu Lin, Lanqing Hong, Jiakai Wang, Xianglong Liu

发表机构 * Beihang University(北京航空航天大学) Peking University(北京大学) Renmin University of China(中国人民大学) Northwestern Polytechnical University(西北工业大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) National University of Singapore(新加坡国立大学) Zhongguancun Laboratory(中关村实验室)

AI总结 提出StainFlow模型,通过全局实体痕迹追踪和局部证据链接,解决GUI代理过程奖励中的里程碑分解主观性和局部窗口证据遗漏问题,提升在线强化学习成功率3.2%。

详情
AI中文摘要

强化学习已成为在长期、随机数字环境中改进GUI代理的有前景方法,但轨迹级成功反馈过于稀疏,无法为中间探索步骤提供可靠的信用分配。为缓解此问题,近期研究引入过程奖励模型,通过全局里程碑验证或局部步骤级评估提供更细粒度的训练反馈。然而,这些方法仍存在两个层级特定的局限性:全局里程碑分解主观且单一,难以适应真实GUI任务中的多条有效执行路径;而固定的局部判断窗口可能遗漏远程关键证据或用无关帧稀释决策信号。受网络流分析中痕迹追踪机制的启发,我们提出StainFlow,一种用于GUI代理的实体痕迹流过程奖励模型。为减少全局划分的主观性,我们引入全局实体痕迹追踪模块,提取视觉可验证的任务实体,并追踪其痕迹浓度和状态沿轨迹的演变,从而通过实体证据流的变化客观分离任务阶段。为提高局部验证的准确性,我们引入局部痕迹证据链接模块。以每个候选关键节点的触发实体为中心,该模块根据其痕迹浓度和状态变化检索相关步骤,并动态构建高密度证据窗口以验证真实关键节点。在AndroidWorld和OGRBench上的大量实验表明,StainFlow在线强化学习成功率相对提升3.2%,轨迹完成判断准确率提升1.8%。

英文摘要

Reinforcement Learning (RL) has become a promising approach for improving GUI Agents in long-horizon, stochastic digital environments, but trajectory-level success feedback is too sparse to provide reliable credit assignment for intermediate exploration steps. To mitigate this issue, recent studies introduce Process Reward Models (PRMs), which provide finer-grained training feedback through global milestone verification or local step-level evaluation. However, these methods still suffer from two level-specific limitations: global milestone decomposition is subjective and singular, making it difficult to accommodate the multiple valid execution paths in real GUI tasks, while fixed local judging windows may miss long-range key evidence or dilute the decision signal with irrelevant frames. Inspired by stain-tracing mechanisms in network flow analysis, we propose StainFlow, an entity-stain-flow process reward model for GUI Agents. To reduce the subjectivity of global partitioning, we introduce the Global Entity Stain Tracking module, which extracts visually verifiable task entities and tracks how their stain concentrations and states evolve along the trajectory, allowing task phases to be objectively separated by changes in the entity evidence flow. To improve the accuracy of local verification, we introduce the Local Stain Evidence Linking module. Centered on the triggering entities of each candidate key node, it retrieves relevant steps based on their stain concentrations and state changes, and dynamically constructs high-density evidence windows for verifying true key nodes. Extensive experiments on AndroidWorld and OGRBench show that StainFlow relatively improves online RL success by 3.2% and trajectory completion judgment accuracy by 1.8%.

2606.12817 2026-06-15 cs.AI 版本更新

GUITrans2Act: Understanding User Operational Behaviors from Mobile GUI Interactions with Vision-Language Models

Teach-and-Repeat: 从移动屏幕演示中准确提取操作知识以赋能GUI智能体

Yudong Zhang, Lei Hu, Daoyang Liu, Jiawei Liu, Yangfan Luo, Zhilin Gao, Zuojian Wang

发表机构 * Honor Device Co., Ltd(荣耀终端有限公司) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出Teach VLM模型,通过从演示视频中提取关键帧生成操作知识,并构建数据飞轮解决训练数据稀缺问题;在基准测试中达到最优性能,并提升下游智能体的任务成功率。

Comments 20 pages, 9 figures. Yudong Zhang and Lei Hu contributed equally to this work. Zuojian Wang, and Zhilin Gao are corresponding authors

详情
AI中文摘要

理解移动设备上的数字世界正从静态UI感知转向动态动作理解。这种能力使模型能够将视觉状态转换转化为操作知识,定义为描述动作类型、目标UI元素、文本参数和执行顺序的简短自然语言句子。然而,由于跨应用的UI设计高度多样化和异构,现有视觉语言模型(VLM)难以准确推断这些底层操作。为弥补这一差距,我们引入了Teach VLM,这是一个核心模型,旨在通过从演示视频中提取和分析与操作相关的关键帧,将移动屏幕轨迹转化为逐步操作知识。为解决对齐训练数据稀缺的问题,我们开发了一个系统性的数据飞轮以实现可扩展的数据采集。我们进一步引入了一个新颖的中文移动屏幕教学基准用于细粒度评估。基于Teach VLM,我们提出了Teach-and-Repeat范式,其中生成的操作知识作为可解释的程序化参考,指导下游基于屏幕的执行智能体。大量评估表明,Teach VLM显著优于强VLM基线,在操作语义预测中达到了最先进的性能。此外,在Android World中的实验表明,我们的范式为下游智能体带来了持续的任务成功率提升。Teach VLM和Teach-and-Repeat范式共同提供了一条从原始演示到可复用任务自动化的实用路径。

英文摘要

Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders. However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos. To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents. Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.

2509.18930 2026-06-15 cs.LG cs.AI 版本更新

Tackling GNARLy Problems: Graph Neural Algorithmic Reasoning Reimagined through Reinforcement Learning

解决GNARLy问题:通过强化学习重新构想图神经算法推理

Alex Schutz, Victor-Alexandru Darvariu, Efimia Panagiotaki, Bruno Lacerda, Nick Hawes

发表机构 * Oxford Robotics Institute, University of Oxford(牛津大学机器人研究所) Stateful Robotics

AI总结 提出GNARL框架,将算法轨迹学习转化为马尔可夫决策过程,结合模仿学习和强化学习,在CLRS-30问题上取得高精度,适用于NP难问题及无专家算法场景。

详情
AI中文摘要

神经算法推理(NAR)是一种通过监督学习训练神经网络执行经典算法的范式。尽管取得了成功,但仍存在重要局限性:无法在不进行后处理的情况下构建有效解,无法推理多个正确解,在组合NP难问题上性能差,且不适用于尚未已知强算法的问题。为了解决这些局限性,我们将学习算法轨迹的问题重新定义为马尔可夫决策过程,这为解构建过程施加了结构,并解锁了模仿学习和强化学习(RL)的强大工具。我们提出了GNARL框架,包括将问题从NAR转化为RL的方法论,以及适用于广泛图问题的学习架构。我们在多个CLRS-30问题上取得了非常高的图准确率结果,性能匹配或超过针对NP难问题的更窄NAR方法,并且值得注意的是,即使在缺乏专家算法的情况下也能适用。

英文摘要

Neural algorithmic reasoning (NAR) is a paradigm that trains neural networks to execute classic algorithms by supervised learning. Despite its successes, important limitations remain: inability to construct valid solutions without post-processing and to reason about multiple correct ones, poor performance on combinatorial NP-hard problems, and inapplicability to problems for which strong algorithms are not yet known. To address these limitations, we reframe the problem of learning algorithm trajectories as a Markov decision process, which imposes structure on the solution construction procedure and unlocks the powerful tools of imitation and reinforcement learning (RL). We propose the GNARL framework, encompassing the methodology to translate problem formulations from NAR to RL and a learning architecture suitable for a wide range of graph-based problems. We achieve very high graph accuracy results on several CLRS-30 problems, performance matching or exceeding much narrower NAR approaches for NP-hard problems and, remarkably, applicability even when lacking an expert algorithm.

2510.02695 2026-06-15 cs.LG cs.AI 版本更新

RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization

RAMAC: 多模态风险感知离线强化学习及行为正则化的作用

Kai Fukazawa, Kunal Mundada, Iman Soltani

AI总结 提出RAMAC框架,结合分布性评论家与生成式演员(如扩散模型),通过条件风险价值与行为克隆的复合目标实现离线强化学习中的风险敏感学习,抑制分布外动作并提升CVaR。

Comments ICML 2026

详情
AI中文摘要

在安全关键领域中,当在线数据收集不可行时,离线强化学习(RL)只有在策略能够实现高回报且避免灾难性的下尾风险时才具有吸引力。先前关于风险厌恶离线RL的工作通过(i)基于值/模型的悲观主义或(ii)限制策略类以限制表达能力来实现安全性,而扩散/流式表达性生成策略主要在中性风险设置中使用。我们引入了\textbf{风险感知多模态演员-评论家(RAMAC)},一个简单、模块化、无模型的框架,它将表达性生成演员(例如扩散/流)与分布性评论家相结合,并优化一个结合条件风险价值(CVaR)与行为克隆(BC)的复合目标,从而在复杂的多模态场景中实现风险敏感学习。由于分布外(OOD)动作是离线RL中灾难性失败的主要驱动因素,我们进一步提供了一个目标层面的分析,表明通过BC控制行为发散可以抑制OOD动作并稳定CVaR。使用扩散演员实例化RAMAC,我们在二维风险赌博机上展示了这些见解,并在Stochastic-D4RL上进行了评估,观察到在保持高回报的同时,$\mathrm{CVaR}_{0.1}$的一致提升。代码和实验结果可在\href{this https URL}{项目网站}上获取。

英文摘要

In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) is attractive only if policies achieve high returns without catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of either (i) value/model-based pessimism or (ii) restricted policy classes that limit expressiveness, whereas diffusion/flow-based expressive generative policies have largely been used in risk-neutral settings. We introduce \textbf{Risk-Aware Multimodal Actor-Critic (RAMAC)}, a simple, modular, model-free framework that couples an expressive generative actor (e.g., diffusion/flow) with a distributional critic and optimizes a composite objective that combines Conditional Value-at-Risk (CVaR) with behavioral cloning (BC), enabling risk-sensitive learning in complex multimodal scenarios. Since out-of-distribution (OOD) actions are a major driver of catastrophic failures in offline RL, we further provide an objective-level analysis showing that controlling behavior divergence via BC suppresses OOD actions and stabilizes CVaR. Instantiating RAMAC with a diffusion actor, we illustrate these insights on a 2-D risky bandit and evaluate on Stochastic-D4RL, observing consistent gains in $\mathrm{CVaR}_{0.1}$ while maintaining strong returns. The code and experimental results are available on the \href{https://kaifukazawa.github.io/ramac-project/} {project website}

2601.19810 2026-06-15 cs.LG cs.AI cs.RO 版本更新

Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals

高效探索的无监督学习:通过自我设定目标预训练自适应策略

Octavio Pappalardo

发表机构 * University College London (UCL)(伦敦大学学院(UCL))

AI总结 提出ULEE方法,结合上下文学习器与对抗性目标生成策略,在无监督元学习框架中优化多回合探索与适应,提升零样本和少样本性能。

Comments ICLR 2026; v2 adds link to code: https://github.com/Octavio-Pappalardo/ulee-jax

详情
Journal ref
The Fourteenth International Conference on Learning Representations, 2026
AI中文摘要

无监督预训练可以为强化学习智能体提供先验知识,加速下游任务的学习。一个基于人类发展的有前景方向是研究智能体通过设定和追求自身目标来学习。核心挑战在于如何有效地生成、选择并从这些目标中学习。我们的关注点是下游任务的广泛分布,其中零样本解决每个任务是不可行的。当目标任务位于预训练分布之外或智能体未知其身份时,这种设置自然出现。在这项工作中,我们(i)在元学习框架内优化高效的多回合探索和适应,以及(ii)用智能体适应后性能的演化估计来指导训练课程。我们提出了ULEE,一种无监督元学习方法,它将上下文学习器与对抗性目标生成策略相结合,该策略将训练维持在智能体能力的前沿。在XLand-MiniGrid基准测试中,ULEE预训练产生了改进的探索和适应能力,这些能力泛化到新的目标、环境动态和地图结构。得到的策略获得了改进的零样本和少样本性能,并为更长的微调过程提供了强初始化。它优于从头学习、DIAYN预训练和替代课程。代码可在以下网址获取:https://github.com/facebookresearch/ulee

英文摘要

Unsupervised pre-training can equip reinforcement learning agents with prior knowledge and accelerate learning in downstream tasks. A promising direction, grounded in human development, investigates agents that learn by setting and pursuing their own goals. The core challenge lies in how to effectively generate, select, and learn from such goals. Our focus is on broad distributions of downstream tasks where solving every task zero-shot is infeasible. Such settings naturally arise when the target tasks lie outside of the pre-training distribution or when their identities are unknown to the agent. In this work, we (i) optimize for efficient multi-episode exploration and adaptation within a meta-learning framework, and (ii) guide the training curriculum with evolving estimates of the agent's post-adaptation performance. We present ULEE, an unsupervised meta-learning method that combines an in-context learner with an adversarial goal-generation strategy that maintains training at the frontier of the agent's capabilities. On XLand-MiniGrid benchmarks, ULEE pre-training yields improved exploration and adaptation abilities that generalize to novel objectives, environment dynamics, and map structures. The resulting policy attains improved zero-shot and few-shot performance, and provides a strong initialization for longer fine-tuning processes. It outperforms learning from scratch, DIAYN pre-training, and alternative curricula. Code is available at: https://github.com/Octavio-Pappalardo/ulee-jax

2. 知识表示、推理与符号AI 6 篇

2606.13703 2026-06-15 cs.AI cs.GL cs.LO 新提交

History of the Muddy Children Puzzle

泥孩子谜题的历史

Hans van Ditmarsch

发表机构 * CNRS, France(法国国家科学研究中心) IIT Kanpur, India(印度理工学院坎普尔分校)

AI总结 本文追溯泥孩子谜题在过去两个世纪中的起源,并介绍其变体及一个涉及自指的新帽子谜题。

详情
AI中文摘要

泥孩子谜题是一个关于知识和无知的谜题,对认知逻辑的发展具有启发意义。谁首先提出了它?这一点尚不清楚。我们通过过去两个世纪的逻辑和文学出版物追溯泥孩子谜题的起源。该谜题激发了众多变体,例如涉及数字或彩色帽子的谜题。我们还提出了一个涉及自指的新型帽子谜题。

英文摘要

The Muddy Children Puzzle is a puzzle about knowledge and ignorance that has been inspiring for the development of epistemic logic. Who came up with it first? This is unclear. We trace the origin of the Muddy Children Puzzle through logical and literary publications over the past two centuries. The puzzle inspired a numerous variations such as involving numbers or coloured hats. We also present a novel hats puzzle involving self-reference.

2606.13925 2026-06-15 cs.AI math.AG 新提交

Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization

遗憾并非难点:半自动形式化的专家评审案例研究

Vasily Ilin, Brian Nugent

发表机构 * GitHub

AI总结 通过Grothendieck消没定理的半自动形式化案例,揭示大语言模型在定义选择与API设计上的不足,提出应以专家评审而非仅无遗憾作为评估标准。

详情
AI中文摘要

大型语言模型通常能够填补交互式定理证明器中的证明缺口,但一个经过验证的定理并不等同于一个可重用的库贡献。我们通过一个详细的案例研究来探讨这一区别:Grothendieck消没定理的半自动形式化。初始版本编译时没有遗憾,但专家评审发现定义、定理通用性、文件组织和API方面存在严重问题。然后,我们进行了评审驱动的重构和压缩过程,并获得了第二次专家评审。前后对比显示出明显的分裂:智能体在局部、机械可检查的反馈上适应良好,但在选择定义和设计API方面仍然薄弱。我们认为,自动形式化的评估不仅应基于关闭的遗憾,还应基于最终形式化是否经受住专家评审。

英文摘要

Large language models can often close proof gaps in interactive theorem provers, but a verified theorem is not the same thing as a reusable library contribution. We study this distinction through a detailed case study: a semi-autonomous formalization of Grothendieck's vanishing theorem. The initial version compiles with no sorries, but an expert review found serious problems in definitions, theorem generality, file organization, and the API. We then ran a review-driven refactor and compression process and obtained a second expert review. The before-and-after comparison shows a sharp split: agents adapted well to local, mechanically checkable feedback, but remained weak at choosing definitions and designing APIs. We argue that autoformalization should be evaluated not only by closed sorries, but by whether the resulting formalization survives expert review.

2606.14000 2026-06-15 cs.AI 新提交

Formalizing Numerical Analysis: An Agent Pipeline and Quality Audit Beyond Kernel Acceptance

数值分析的形式化:超越内核接受的智能体流水线与质量审计

Theodore Meek, Siyuan Ge, Di Qiu Xiang, Simon Chess, Vasily Ilin

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种编码智能体流水线,将数值分析教材形式化为Lean 4代码,并引入三维质量评估框架(语义正确性、Mathlib复用、跨文件复用),发现编译通过掩盖了不忠实的形式化模式。

详情
AI中文摘要

近期工作表明,编码智能体可以在Lean 4中形式化整个高等数学教材,但现有努力集中在mathlib中已有充分表示的数学分支,并仅通过内核接受来衡量成功。我们通过将编码智能体应用于形式化《常微分方程数值方法》(一本数值分析教材,在mathlib中基本缺失)来解决这两个限制,从而考验智能体从头开发新理论的能力。我们进一步引入一个系统、可复现的三维框架,用于评估智能体生成的形式化质量,超越编译层面:语义正确性、Mathlib复用以及通过LLM-as-judge方法的跨文件复用。将该框架应用于我们自己的形式化以及RepoProver和M2F发布的输出,我们发现了内核接受完全掩盖的重复性不忠实形式化模式,包括不完整的多部分陈述、添加弱化假设和参数限制。我们的结果表明,基于编译的指标大大高估了形式化质量,我们提供了一种可复现的审计方法,以支持对未来自动形式化系统进行更严格的评估。

英文摘要

Recent work has demonstrated that coding agents can formalize entire advanced mathematics textbooks in Lean 4, yet existing efforts concentrate on branches of mathematics already well-represented in mathlib and measure success solely through kernel acceptance. We address both limitations by applying a coding agent to formalize Numerical Methods for Ordinary Differential Equations, a textbook in numerical analysis that is largely absent from mathlib, stressing the agent's capacity to develop new theory from scratch. We further introduce a systematic, reproducible three-dimensional framework for evaluating the quality of agent-produced formalizations beyond compilation: semantic correctness, Mathlib reuse, and cross-file reuse via LLM-as-judge methods. Applying this framework to our own formalization and to the released outputs of RepoProver and M2F, we uncover recurring unfaithful formalization patterns, including incomplete multi-part statements, added weakening hypotheses, and parameter restrictions, that kernel acceptance entirely obscures. Our results suggest that compilation-based metrics substantially overstate formalization quality, and we provide a reproducible audit methodology to support more rigorous evaluation of future autoformalization systems.

2606.14309 2026-06-15 cs.DB cs.AI cs.LO 交叉投稿

Transforming Shape Schemas with Composable Property-Graph Queries (Extended Version)

用可组合属性图查询转换形状模式(扩展版)

Philipp Seifer, Daniel Hernández, Ralf Lämmel, Steffen Staab

发表机构 * The Software Languages Team(软件语言团队) University of Koblenz(科伦茨大学) Institute for Artificial Intelligence(人工智能研究所) University of Stuttgart(斯图加特大学) University of Southampton(南安普顿大学)

AI总结 研究在给定输入模式(ProGS)和查询(G-CORE)时推断输出模式的问题,通过映射到RDF、SHACL和SPARQL CONSTRUCT利用描述逻辑推理器实现模式约束的自动推断。

详情
AI中文摘要

属性图可能受模式约束,这些模式向查询引擎和人类用户告知有效数据的形状,强制执行数据提供者和消费者之间的契约。可组合属性图查询将输入图转换为输出图。那么,问题就出现了:在一个(或几个)转换步骤之后,可以预期哪种模式。我们研究了在给定输入模式和转换查询的情况下如何推断模式约束。具体来说,我们提出了一种推理过程,给定ProGS中的输入模式和G-CORE中的查询,推断输出模式。由于图更新会频繁发生,我们的推理过程不依赖于图实例,因此计算出的输出模式适用于所有源自符合输入模式的任何输入图的图。相关工作已经针对SPARQL CONSTRUCT查询解决了这个问题,将其编码在描述逻辑(DL)中,使得输出模式由从输入模式和查询推断出的公理蕴含。然而,属性图及其查询使问题复杂化,因为属性图具有标签和属性注释以及一等边。因此,必须以某种方式使用具体化,尽管可用的DL缺乏直接编码这些特征的手段。我们通过一系列映射来应对这一新挑战:i) 在RDF中具体化的属性图,与ii) 从ProGS到SHACL的映射以及iii) 从G-CORE到SPARQL CONSTRUCT查询的映射对齐。通过这种方式,属性图的模式推断变得可管理,因为我们通过额外的映射层分解问题并利用高效的DL推理器。我们发展了关于推断模式约束的可靠性和映射模式及查询的语义等价性的元理论。

英文摘要

Property graphs may be constrained by schemas that inform both query engines and human users about the shape of valid data, enforcing a contract between data provider and consumer. Composable property-graph queries transform input graphs into output graphs. Then, the question arises of which schema can be expected after one (or several) transformation steps. We investigate how schema constraints can be inferred given an input schema and a transforming query. Specifically, we propose a reasoning procedure that, given an input schema in ProGS and a query in G-CORE infers an output schema. Since graph updates will happen frequently, our inference procedure does not rely on graph instances, such that the computed output schema applies to all graphs originating from any input graph complying with the input schema. Related work has addressed this problem for SPARQL CONSTRUCT queries, encoding it in Description Logics (DLs) so that the output schema is entailed by axioms inferred from input schema and queries. Property graphs and their queries, however, complicate the matter, as property graphs feature label and property annotations as well as first-class edges. Thus, reification has to be used in one way or another, though available DLs lack the means to encode such features directly. We approach this novel challenge via a family of mappings for i) property graphs reified in RDF, aligned with ii) a mapping from ProGS to SHACL and iii) a mapping from G-CORE to SPARQL CONSTRUCT queries. In this manner, schema inference for property graphs becomes manageable, as we break apart the problem through the extra mapping layer and utilize efficient DL reasoners. We develop the metatheory regarding the soundness of inferred schema constraints and the semantic equivalence of mapped schemas and queries.

2501.08561 2026-06-15 cs.AI cs.HC cs.LG cs.SC 版本更新

ANSR-DT: A Neuro-Symbolic Framework for Adaptive and Explainable Digital Twins

ANSR-DT:一种自适应可解释数字孪生的神经符号框架

Safayat Bin Hakim, Muhammad Adil, Alvaro Velasquez, Houbing Herbert Song

发表机构 * Department of Information Systems, University of Maryland Baltimore County(信息系统系,马里兰大学巴尔的摩县分校) Department of Computer Science and Engineering, University at Buffalo(计算机科学与工程系,布法罗大学) Department of Computer Science, University of Colorado Boulder(计算机科学系,科罗拉多大学波德分校)

AI总结 提出ANSR-DT框架,结合CNN-LSTM、Prolog推理和PPO强化学习,实现数字孪生的异常检测、符号推理与自适应决策,在多个基准上表现优异。

Comments Code available at https://github.com/sbhakim/ansr-dt

详情
AI中文摘要

数字孪生越来越多地用于监控和优化工业系统,然而许多现有框架仍然难以解释、适应缓慢,并且整合显式领域知识的能力有限。本文提出了ANSR-DT,一种自适应神经符号框架,它在单一数字孪生流水线中统一了时序异常检测、符号推理和基于强化学习的决策支持。ANSR-DT将用于多变量模式识别的CNN-LSTM模型与基于Prolog的推理相结合,后者将学习到的信号转换为显式规则,从而实现透明的诊断和可追溯的决策路径。基于PPO的适应层进一步在变化条件下优化操作响应,同时保持可解释性。在8个基线模型上的实验表明,ANSR-DT在提供竞争性预测性能的同时,还能实现稳定的规则提取、可扩展的符号推理和可操作的解释。在Skoltech异常基准(SKAB)上的额外验证进一步表明,该框架能够迁移到合成场景之外。这些发现使ANSR-DT成为可信、自适应和可解释的工业数字孪生的实用基础。

英文摘要

Digital twins are increasingly used to monitor and optimize industrial systems, yet many existing frameworks remain difficult to interpret, slow to adapt, and limited in their ability to incorporate explicit domain knowledge. This paper presents ANSR-DT, an adaptive neuro-symbolic framework that unifies temporal anomaly detection, symbolic reasoning, and reinforcement-learning-based decision support within a single digital twin pipeline. ANSR-DT combines a CNN-LSTM model for multivariate pattern recognition with Prolog-based reasoning that converts learned signals into explicit rules, enabling transparent diagnoses and traceable decision paths. A PPO-based adaptation layer further refines operational responses under changing conditions while preserving interpretability. Experiments against 8 baselines show that ANSR-DT delivers competitive predictive performance together with stable rule extraction, scalable symbolic reasoning, and actionable explanations. Additional validation on the Skoltech Anomaly Benchmark (SKAB) further indicates that the framework transfers beyond synthetic settings. These findings position ANSR-DT as a practical foundation for trustworthy, adaptive, and explainable industrial digital twins.

2605.07121 2026-06-15 cs.AI cs.LG 版本更新

AdaTKG: Adaptive Memory for Temporal Knowledge Graph Reasoning

AdaTKG: 用于时序知识图谱推理的自适应记忆

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn

发表机构 * LG AI Research(LG人工智能研究)

AI总结 提出AdaTKG,通过为每个实体维护自适应记忆,并采用可学习的指数移动平均更新,解决时序知识图谱中实体表示静态的问题,提升推理性能。

Comments KDD Workshop on Frontiers in Graph Machine Learning for the Large Model Era 2026 (Oral Presentation)

详情
AI中文摘要

时序知识图谱(TKG)表示带有时间戳的关系事实,并支持对演化事件进行广泛的推理任务。然而,现有方法生成的实体表示在实体层面是静态的,即每个表示仅是学习参数的函数,且不保留实体参与交互的任何痕迹。在本文中,我们摒弃这种静态观点,提出将每个实体建模为一个自适应过程,其表示在实体每次参与事实时被细化。为此,我们提出AdaTKG,它为每个实体维护一个记忆,该记忆随每次观察到的交互而更新,记忆在线累积,预测随更多交互的到来而改进。具体而言,我们将记忆更新实例化为一个可学习的指数移动平均,由单个共享标量控制,而不是为每个实体使用可学习参数,使AdaTKG能够处理训练中未见过的实体。大量实验证实了相对于TKG基线的持续改进,证明了自适应记忆的有效性。代码见:this https URL

英文摘要

Temporal knowledge graphs (TKGs) represent time-stamped relational facts and support a wide range of reasoning tasks over evolving events. However, existing methods produce entity representations that are static at the entity level, in that each representation is a function of learned parameters only and retains no trace of the interactions in which the entity has participated. In this paper, we depart from this static view and propose that each entity be modeled as an adaptive process whose representation is refined every time the entity participates in a fact. To this end, we propose AdaTKG, which maintains a per-entity memory that is updated with every observed interaction, with the memory accumulating online and predictions improving as more interactions arrive. Specifically, we instantiate the memory update as a learnable exponential moving average governed by a single shared scalar instead of using learnable parameters for each entity, enabling AdaTKG to handle entities unseen during training. Extensive experiments confirm consistent gains over TKG baselines, demonstrating the effectiveness of adaptive memory. Code is available at: https://github.com/seunghan96/AdaTKG

3. 多智能体与博弈 6 篇

2606.13722 2026-06-15 cs.AI cs.MA 新提交

YeasierAgent: Agentic Social Sandbox as a Canvas for Intent-Driven Creation of Platform-Agnostic Symbiotic Agent-Native Applications

YeasierAgent:作为意图驱动创建平台无关共生智能体原生应用的画布的智能体社交沙盒

Jory He

发表机构 * Yeasier AI

AI总结 提出YeasierAgent范式,通过平台无关的交互单元和空间多智能体协作,实现快速跨平台构建共生智能体原生应用,统一情感陪伴与工具执行。

详情
AI中文摘要

本文介绍了YeasierAgent,一种基于共生智能体、叙事世界和场景感知交互的应用构建范式。它通过将应用重新定义为用户、智能体和世界之间的协作空间,挑战了传统的设备耦合软件模型。我们提出了一种系统架构,实现了两个主要贡献:(1)通过利用平台无关的交互单元(智能体、场景、对话)而非固定的图形布局,实现跨平台的智能体原生应用的快速构建;(2)在单一体验沙盒中统一智能体的情感陪伴和实用工具执行属性。通过集成自动生成、用户创建的世界和空间多智能体协作,YeasierAgent形式化了共生智能体原生应用的类别,展示了从孤立的、特定工具聊天机器人向凝聚的、社会嵌入的计算环境的转变。

英文摘要

This paper introduces YeasierAgent, an application-building paradigm based on symbiotic agents, narrative worlds, and scene-aware interaction. It challenges the conventional device-coupled model of software by redefining applications as collaborative spaces among users, agents, and worlds. We present a system architecture that achieves two primary contributions: (1) enabling the rapid, cross-platform construction of agent-native applications by utilizing platform-agnostic interactive units (agents, scenes, dialogue) rather than fixed graphical layouts; and (2) unifying the emotional companionship and practical tool execution attributes of intelligent agents within a single experiential sandbox. By integrating automated generation, user-created worlds, and spatial multi-agent collaboration, YeasierAgent formalizes the category of Symbiotic Agent-Native Applications, demonstrating a shift from isolated, tool-specific chatbots toward cohesive, socially embedded computational environments.

2606.14200 2026-06-15 cs.AI cs.LG 新提交

When Should Agent Trust Be Conditional? Characterizing and Attacking Skill-Conditional Reputation in Agent Swarms

何时应条件化智能体信任?表征与攻击智能体群中的技能条件声誉

Yihan Xia, Taotao Wang

发表机构 * Shenzhen University(深圳大学)

AI总结 研究异构LLM智能体群中技能条件信任的适用条件,通过相图分析揭示其在高异质性、稀疏证据和技能相关场景下有效,但存在跨技能证据被攻击者利用的风险,提出条件信息值测试(CIVT)量化攻击影响。

Comments 18 pages, 8 figures, 2 tables

详情
AI中文摘要

开放平台越来越多地将任务路由给异构的LLM智能体——它们在基础模型、框架和工具栈上有所不同——其能力因技能而异:一个智能体在某项技能上表现出色,在另一项技能上可能毫无用处。标准的声誉方法为每个智能体总结一个单一的全局信任分数,但这里的标量是错误的对象,因为将每个任务路由到全局最受信任的智能体会放弃专业化的价值。我们研究技能条件信任R(i | k)——对于需要技能k的任务,应赋予智能体i的信任,而不是每个智能体一个分数——并提出三个可证伪的问题:何时条件化是值得的,应借用多少跨技能证据,以及这种借用是否安全。受控的相图分析回答了前两个问题:条件信任仅在特定区域获胜——高智能体异质性、稀疏的每技能证据和相关的技能——而实现这种数据效率的耦合强度β是双刃剑,因为相同的跨技能借用也是一个洗钱渠道。在14个真正异构的AppWorld智能体的公共基准上,实际池落在有益区域内——一个微小但真实的增益,每技能最佳智能体在不同技能间确实发生变化。然后我们展示,一个在一种技能上有廉价证据而在目标技能上没有证据的攻击者劫持条件路由器,将路由遗憾从0驱动到0.94,而我们的零成本条件信息值测试(CIVT)将其评为绿色——而它污染的无门控信任判决读数为-0.06,而非诚实的+0.19。零证据门限限制了攻击但并未消除它;我们在明确预算下表征了剩余成本。我们不声称抗女巫攻击——我们量化了权衡。

英文摘要

Open platforms increasingly route tasks among heterogeneous LLM agents--differing in base model, scaffold, and tool stack--whose competence varies sharply by skill: an agent excellent at one skill may be useless at another. The standard reputation approach summarizes each agent by a single global trust score, but that scalar is the wrong object here, because routing every task to the globally most-trusted agent leaves the value of specialization unclaimed. We study skill-conditional trust R(i | k)--the trust to place in agent i for a task requiring skill k, rather than one score per agent--and pose three falsifiable questions: when is conditioning worth it, how much cross-skill evidence should be borrowed, and whether that borrowing is safe. A controlled phase-diagram analysis answers the first two: conditional trust wins only in a specific regime--high agent heterogeneity, sparse per-skill evidence, and correlated skills--and the coupling strength beta that buys this data efficiency is dual-use, because the same cross-skill borrowing is also a laundering channel. On a public benchmark of 14 genuinely heterogeneous AppWorld agents, real pools land inside the beneficial regime--a small but genuine gain, with the per-skill best agent genuinely changing across skills. We then show that an attacker with cheap evidence in one skill and none in a target skill hijacks the conditional router, driving routing regret from 0 to 0.94 on a pool our zero-cost Conditional Information Value Test (CIVT) rates GREEN--while the ungated trust verdict it contaminates reads -0.06 instead of the honest +0.19. A zero-evidence gate bounds the attack but does not eliminate it; we characterize the residual cost under an explicit budget. We do not claim Sybil-resistance--we quantify the trade-off.

2606.13832 2026-06-15 cs.MA cs.AI cs.CR cs.LG 交叉投稿

Safety-Contract Graph Multi-Agent Reinforcement Learning for Autonomous Network Security Response

安全合约图多智能体强化学习用于自主网络安全响应

Jose Luis Lima de Jesus Silva

发表机构 * Oxaala Tecnologias(Oxaala技术公司) Universidade Federal da Bahia(巴西巴伊亚联邦大学)

AI总结 提出安全合约图MARL框架ACD$^3$-GAT,通过约束优化、图编码和反事实筛选,在CAGE Challenge 4中将停机违规率从100%降至0.3%或13.8%,实现安全与性能的平衡。

详情
AI中文摘要

自主网络安全响应系统有望减少安全运营中心(SOC)的响应延迟,但仅基于奖励的多智能体强化学习(MARL)虽然能提高安全奖励,却仍无法部署。我们提出一个安全合约图MARL框架,并实例化为ACD$^3$-GAT(自适应约束反事实决策与图注意力网络编码器),该架构将模拟器观测与可重用运营预算、约束优化、图状态编码和反事实动作筛选分离开来。我们在CAGE Challenge 4中评估该方法,其中智能体在平均恢复时间(MTTR)、误报响应和防火墙变更管理中断的预算下运行。在整个基准测试中,每个无约束方法在100%的评估回合中违反SOC停机预算,平均停机代理成本为311-430,而预算为50。这补充了先前CAGE Challenge 4的发现,表明仅基于奖励的学习缺乏操作纪律。约束MAPPO-GAT(C-MAPPO-GAT)隔离了拉格朗日运营成本控制和预算感知筛选,而ACD$^3$-GAT增加了预算上下文、CVaR尾部风险估计、对手信念状态和图反事实风险传播(G-CRP)。复现比较包括IPPO、MAPPO-GAT、C-MAPPO-GAT和ACD$^3$-GAT的三个200回合种子。C-MAPPO-GAT将停机违规率从100%降至0.3%,平均停机成本从355.4降至15.5(相对于MAPPO-GAT)。ACD$^3$-GAT将平均停机成本降至48.2,违规率为13.8%,使其处于安全合约前沿而非最保守的合规点。拓扑种子和耦合自适应红方过程压力测试保持了这种对比,并显示安全约束策略的最差自适应退化程度低于仅基于奖励的MAPPO-GAT。

英文摘要

Autonomous network-security response systems promise to reduce Security Operations Centre (SOC) reaction latency, but reward-only multi-agent reinforcement learning (MARL) can improve security reward while remaining non-deployable. We present a safety-contract graph MARL framework and instantiate it as ACD$^3$-GAT (Adaptive Constrained Counterfactual Decisioning with a Graph Attention Network encoder), an architecture that separates simulator observations from reusable operational budgets, constrained optimization, graph state encoding, and counterfactual action screening. We evaluate the method in CAGE Challenge 4, where agents operate under budgets for Mean Time to Recover (MTTR), false-positive response, and firewall change-management disruption. Across the benchmark, every unconstrained method violates the SOC downtime budget in 100% of evaluated episodes, with mean downtime proxy costs of 311-430 against a budget of 50. This complements prior CAGE Challenge 4 findings by showing that reward-only learning lacks operational discipline. Constrained MAPPO-GAT (C-MAPPO-GAT) isolates Lagrangian operational-cost control and budget-aware screening, while ACD$^3$-GAT adds budget context, CVaR tail-risk estimation, opponent-belief state, and Graph Counterfactual Risk Propagation (G-CRP). The replicated comparison includes three 200-episode seeds for IPPO, MAPPO-GAT, C-MAPPO-GAT, and ACD$^3$-GAT. C-MAPPO-GAT reduces downtime violation from 100% to 0.3% and mean downtime cost from 355.4 to 15.5 relative to MAPPO-GAT. ACD$^3$-GAT reduces mean downtime cost to 48.2 with a 13.8% violation rate, placing it on the safety-contract frontier rather than at the most conservative compliance point. Topology-seed and coupled adaptive Red-process stress tests preserve this contrast and show lower worst adaptive degradation for safety-constrained policies than reward-only MAPPO-GAT.

2606.14693 2026-06-15 cs.MA cs.AI 交叉投稿

Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning

学习协调偏好用于多目标多智能体强化学习

Pengxin Wang, Lihao Guo, Yi Xie, Bo Liu, Siyang Cao, Jingdi Chen

发表机构 * Department of Electrical and Computer Engineering, University of Arizona(亚利桑那大学电气与计算机工程系)

AI总结 提出偏好协调多智能体策略优化(PCMA),通过学习协调的智能体特定偏好实现多目标多智能体强化学习中的互补权衡,理论证明偏好多样性可诱导团队改进,实验验证性能与协调性提升。

详情
AI中文摘要

合作性多目标多智能体强化学习(MOMARL)对团队在多个可能冲突的目标下的决策进行建模。在此设置中,冲突不仅出现在目标之间,也出现在具有不同观察、角色和贡献的智能体之间。我们提出了偏好协调多智能体策略优化(PCMA),它学习协调的智能体特定偏好,以实现智能体之间的互补权衡。理论上,我们将合作性MOMARL形式化为一个团队最优博弈,并证明在适当条件下,偏好多样性可以通过一阶改进分解诱导团队改进。在多个合作性MOMA环境和一个实际交通控制场景上的实验表明,PCMA提高了性能和权衡协调性。

英文摘要

Cooperative multi-objective multi-agent reinforcement learning (MOMARL) models team decision making under multiple, potentially conflicting objectives. In this setting, conflicts arise not only across objectives but also across agents with different observations, roles, and contributions. We propose Preference Coordinated Multi-agent Policy Optimization (PCMA), which learns coordinated agent-specific preferences to enable complementary trade-offs among agents. Theoretically, we formulate cooperative MOMARL as a team-optimal game and show that, under suitable conditions, preference diversity can induce team improvement through a first-order improvement decomposition. Experiments on multiple cooperative MOMA environments and a practical traffic-control scenario show that PCMA improves both performance and trade-off coordination.

2505.16988 2026-06-15 cs.CL cs.AI cs.MA 版本更新

MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

MASLab:基于LLM的多智能体系统的统一全面代码库

Rui Ye, Keduan Huang, Qimin Wu, Yuzhu Cai, Tian Jin, Xianghe Pang, Xiangrui Liu, Jiaqi Su, Chen Qian, Bohan Tang, Kaiqu Liang, Jiaao Chen, Yue Hu, Zhenfei Yin, Rongye Shi, Bo An, Yang Gao, Wenjun Wu, Lei Bai, Siheng Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) University of Oxford(牛津大学) Princeton University(普林斯顿大学) Meta University of Michigan(密歇根大学) The University of Sydney(悉尼大学) Beihang University(北航) Nanyang Technological University(南洋理工大学) Nanjing University(南京大学)

AI总结 提出MASLab代码库,集成20余种方法,提供统一环境与标准化评估,降低研究门槛,覆盖10+基准测试和8种模型。

Comments 18 pages, 11 figures

详情
AI中文摘要

基于LLM的多智能体系统(MAS)在增强单个LLM以解决实际应用中复杂多样任务方面展现出巨大潜力。尽管取得了显著进展,该领域缺乏统一代码库来整合现有方法,导致重复实现、不公平比较和研究人员的高入门门槛。为应对这些挑战,我们引入MASLab,一个统一、全面且研究友好的基于LLM的MAS代码库。(1)MASLab集成了跨多个领域的20余种已建立方法,每种方法均通过逐步输出与官方实现的比较得到严格验证。(2)MASLab提供统一环境,包含多种基准测试,用于方法间的公平比较,确保一致输入和标准化评估协议。(3)MASLab在共享的简化结构中实现方法,降低了理解和扩展的门槛。基于MASLab,我们进行了涵盖10+基准测试和8种模型的广泛实验,为研究人员提供了当前MAS方法格局的清晰全面视图。MASLab将持续发展,跟踪该领域最新进展,并欢迎更广泛开源社区的贡献。

英文摘要

LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.

2606.12918 2026-06-15 cs.CR cs.AI 版本更新

MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems

MAStrike: 基于Shapley值的多智能体系统合谋红队测试

Chejian Xu, Zhaorun Chen, Jingyang Zhang, Freddy Lecue, Avni Kothari, Sarah Tan, Wenbo Guo, Bo Li

AI总结 提出MAStrike框架,通过Shapley值分析识别多智能体系统中脆弱智能体联盟,生成角色感知的对抗攻击,并迭代优化以绕过防御,显著优于启发式基线。

详情
AI中文摘要

分层多智能体系统(MAS)正迅速部署在金融和软件工程等高危工作流中。在这些系统中,安全本质上是分布在不同角色智能体上的,显著扩大了攻击面,特别是在特权提升和跨智能体合谋等协调对抗行为下。现有的MAS红队测试方法仍然有限:它们依赖启发式选择目标智能体并扰动孤立的消息流,留下了关键问题未解答,即哪些智能体对系统安全最负责,以及受损智能体如何协调以绕过防御。我们提出MAStrike,一个用于分层MAS中合谋红队测试的闭环框架。我们首次提出针对MAS的智能体级Shapley值分析,量化每个智能体在任务特定分布下对系统鲁棒性的边际贡献。在此归因指导下,MAStrike识别脆弱智能体联盟并生成协调的、角色感知的对抗操纵。这些攻击通过结构化因果诊断迭代优化,将失败案例归因于阻止对抗尝试的未受损智能体。我们进一步构建了全面的MAS红队测试基准和可控环境,涵盖不同的分层拓扑和领域,包括金融、软件工程和CRM。在多个前沿模型构建的MAS上进行的广泛实验表明,MAStrike显著优于启发式基线。我们的分析进一步揭示了智能体间非平凡的Shapley值分布和高阶交互结构,揭示了先前单智能体或基于模板的方法忽略的关键漏洞和协调模式。

英文摘要

Hierarchical multi-agent systems (MAS) are rapidly being deployed in high-stakes workflows across domains such as finance and software engineering. In these systems, safety and security are inherently distributed across role-specialized agents, significantly expanding the attack surface, particularly under coordinated adversarial behaviors such as privilege escalation and cross-agent collusion. Existing red-teaming approaches for MAS remain limited: they rely on heuristic selection of target agents and perturb isolated message streams, leaving critical questions unanswered as which agents are most responsible for system safety, and how compromised agents can coordinate to bypass defenses. We propose MAStrike, a closed-loop framework for collusive red-teaming in hierarchical MAS. We propose the first agent-level Shapley value analysis for MAS, quantifying each agent's marginal contribution to system robustness under task-specific distributions. GGuided by this attribution, MAStrike identifies vulnerable agent coalitions and generates coordinated, role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis, attributing failure cases to uncompromised agents that block adversarial attempts. We further build a comprehensive MAS red-teaming benchmark and controllable environments spanning diverse hierarchical topologies and domains, including finance, software engineering, and CRM. Extensive experiments across MAS built on multiple frontier models show that MAStrike substantially outperforms heuristic baselines. Our analysis further uncovers non-trivial Shapley value distributions and higher-order interaction structures among agents, revealing critical vulnerabilities and coordination patterns that are overlooked by prior single-agent or template-based methods.

4. 搜索、优化与约束求解 5 篇

2606.13682 2026-06-15 cs.AI cs.LG 新提交

A Deep Reinforcement Learning (DRL)-Based Transformer Method for Solving the Open Shop Scheduling Problem

基于深度强化学习的Transformer方法求解开放车间调度问题

Faezeh Ardali, Mwembezi A. Nyelele, Gerald M. Knapp

发表机构 * Louisiana State University(路易斯安那州立大学) University of Minnesota Duluth(明尼苏达大学杜鲁斯分校)

AI总结 提出一种基于Transformer编码器-解码器架构的调度策略,仅以加工时间矩阵为输入,在Taillard小规模实例上训练后可直接推广至40x40至100x100的大规模问题,与经典调度规则相比具有竞争力。

详情
AI中文摘要

开放车间调度问题(OSSP)出现在许多工业和服务环境中,但随着作业和机器数量的增加,其计算难度仍然很大。精确方法很快变得难以处理,而经典调度规则和元启发式方法可能需要大量调整才能在大规模下保持解的质量。本研究开发了一种基于Transformer的OSSP调度策略,采用具有多头注意力的编码器-解码器架构。该模型仅在Taillard基准实例(4x4、5x5、7x7和10x10)上使用加工时间矩阵作为输入进行训练,生成可行调度,其makespan通常为最佳已知值的15-30%。为了评估可扩展性,将训练好的策略无需重新训练直接应用于从40x40到100x100随机生成的实例,并与经典调度启发式方法(包括SPT、LPT、MWKR和EST)进行比较。在这些大规模实例中,Transformer相对于标准下界实现了12.89-15.12%的平均差距。与EST相比,Transformer保持了竞争力,通常差距较小,同时显著优于SPT和LPT。这些结果表明,在小规模OSSP实例上训练的Transformer策略可以推广到更大规模的问题,并提供一种轻量级、基于学习的替代经典调度规则的方法。

英文摘要

The open shop scheduling problem (OSSP) arises in many industrial and service settings but remains computationally challenging as the number of jobs and machines increases. While exact methods quickly become intractable, classical dispatching rules and metaheuristics may require substantial tuning to maintain solution quality at large scales. This study develops a Transformer-based scheduling policy for OSSP using an encoder-decoder architecture with multi-head attention. The model is trained on Taillard benchmark instances (4x4, 5x5, 7x7, and 10x10) using only the processing-time matrix as input and produces feasible schedules with makespans typically within 15-30% of best-known values. To evaluate scalability, the trained policy is applied without retraining to randomly generated instances from 40x40 to 100x100 and compared against classical dispatching heuristics, including SPT, LPT, MWKR, and EST. Across these large instances, the Transformer achieved average gaps of 12.89-15.12% relative to a standard lower bound. Compared with EST, the Transformer remained competitive, typically within a modest margin, while substantially outperforming SPT and LPT. These results indicate that a Transformer policy trained on small OSSP instances can generalize to substantially larger problems and provide a feature-light, learning-based alternative to classical dispatching rules.

2606.14582 2026-06-15 cs.AI 新提交

A Temporal Planning Framework for Disruption Aware Dynamic Route Optimization in Heterogeneous Railway Systems

异构铁路系统中干扰感知的动态路径优化的时间规划框架

Pollob Chandra Ray, Sabah Binte Noor, Fazlul Hasan Siddiqui

发表机构 * Dhaka University of Engineering & Technology(达卡工程技术大学)

AI总结 提出基于时间规划的框架,利用PDDL 2.1建模轨距兼容约束和多种干扰场景,生成无冲突时间戳操作计划,减少人工决策依赖。

详情
AI中文摘要

高效的路径优化对于确保铁路运营的安全性和准点性至关重要。在异构多轨距铁路网络中,由于列车速度、停车模式、基础设施兼容性约束的不同,协调复杂性增加,这一点尤为关键。在单轨系统中,由于所有列车共享同一轨道且需要频繁的轨道切换,这些挑战进一步加剧。干扰事件,包括轨道阻塞、列车阻塞、发动机故障和速度降低,给运营带来了额外的不可预测性,并偏离了时刻表。然而,现有研究主要关注高层次的时间表编制,忽略了诸如轨道切换协调等运营细节。因此,决策留给人类操作员,增加了铁路运营的安全风险。本研究提出了一个基于时间规划的框架,用于异构铁路系统中的动态路径优化和干扰管理。该框架使用PDDL 2.1将铁路运营形式化为时间规划问题,显式建模轨距兼容约束和多种干扰场景。它生成无冲突的时间戳操作计划,指定优化调度和可执行动作序列。为了评估所提出的框架,我们开发了一个包含200个实例的基准问题集,使用多达1000个轨道点和120列列车。采用两个最先进的时间规划器和一个计划验证器来评估该框架。实验结果表明,该框架能够有效地为异构铁路系统生成时间操作计划,处理多轨距约束和干扰,并减少对人工决策的依赖。

英文摘要

Efficient route optimization play a vital role in ensuring both safety and punctuality in railway operations. It is very crucial particularly in heterogeneous multi-gauge railway networks with varying train speed, stopping pattern, infrastructure compatibility constraints increase coordination complexity. In single-track systems these challenges are further intensify due to all trains to share the same track and requires frequent track switching.Stochastic disruptions events including blocked tracks, blocked trains, engine failure and speed slowdowns introduces additional unpredictability in operations and deviate the timetable. However, existing studies predominantly focuses on high-level timetabling, omitting operational details such as track switching coordination. As a result leaving decision to human operators, increasing safety risks into railway operations. This study proposes a framework based on temporal planning for dynamic route optimization and disruption management in heterogeneous railway systems. The framework formulates railway operations as a temporal planning problem using PDDL 2.1 with explicitly modeling gauge compatibility constraints and diverse disruption scenarios. It generates conflict-free timestamped operational plans specifying both optimized schedules and executable action sequences. To evaluate the proposed framework, we developed a benchmark problem set with 200 instances using up to 1,000 track points and 120 trains. Two state-of-the-art temporal planners and a plan validator were employed to assessed the framework. The experimental results demonstrate that the framework effectively generates temporal operational plans for heterogeneous railway systems and handles multi-gauge constraints, disruptions, and reduces dependence on manual decision making.

2606.01730 2026-06-15 cs.AI cs.LG 版本更新

Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

证据门控的LLM先验用于多目标贝叶斯优化

Jiangyu Chen, Ban Yi

发表机构 * State Key Laboratory for Novel Software Technology(新型软件技术国家重点实验室)

AI总结 针对多目标贝叶斯优化中LLM先验可能误导的问题,提出一种目标级声誉市场机制,通过在线反馈动态校准专家权重,并引入解耦反事实门控,在合成测试和分子优化基准上验证了动态校准的鲁棒性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作黑箱优化的启发式顾问,但其建议和自我报告的置信度不一定与下游目标值校准。在多目标贝叶斯优化中,这一问题更加突出,因为不同目标可能需要不同的专家知识,而LLM专家可能对一个目标有用,但对另一个目标产生误导。 我们研究如何在离散多目标贝叶斯优化中使用LLM生成的专家先验,而不盲目信任它们。我们提出了一种目标级声誉市场机制,将每个专家-目标对视为可证伪的先验来源。专家权重根据观察到的目标反馈在线更新,随时间衰减,并由市场级信任门控。然后,我们引入一个解耦的反事实门控,可以在不使用置信度的情况下使用LLM先验,在置信度下使用,或完全放弃LLM先验。 在受控的合成压力测试和三个使用\qwenflash{}生成的专家先验的分子优化基准上,我们发现动态目标级校准比固定LLM先验提高了鲁棒性。然而,原始LLM置信度并不总是有益的:在ESOL上,置信度与预测误差正相关;在FreeSolv上,置信度可能有帮助;在Lipophilicity上,忽略置信度仍然最强。我们的固定三臂反事实门控在ESOL和FreeSolv上优于第一个反事实变体,而尝试的边际组合暴露了一个有用的负面结果:边际选择应基于采集感知,而不是仅基于一步先验误差。

英文摘要

Large language models (LLMs) are increasingly used as heuristic advisors for black-box optimization, yet their suggestions and self-reported confidence are not necessarily calibrated to downstream objective values. This issue becomes more pronounced in multi-objective Bayesian optimization, where different objectives may require different expert knowledge and where an LLM expert can be useful for one objective but misleading for another. We study how to use LLM-generated expert priors in discrete multi-objective Bayesian optimization without blindly trusting them. We propose an objective-wise reputation-market mechanism that treats each expert-objective pair as a falsifiable prior source. Expert weights are updated online from observed objective feedback, discounted over time, and gated by market-level trust. We then introduce a decoupled counterfactual gate that can use the LLM prior without confidence, use it with confidence, or abstain from the LLM prior entirely. Across controlled synthetic stress tests and three molecule optimization benchmarks with \qwenflash{}-generated expert priors, we find that dynamic objective-wise calibration improves robustness over fixed LLM priors. However, raw LLM confidence is not reliably beneficial: on ESOL, confidence is positively correlated with prediction error; on FreeSolv, confidence can help; and on Lipophilicity, ignoring confidence remains strongest. Our fixed three-arm counterfactual gate improves over the first counterfactual variant on ESOL and FreeSolv, while an attempted margin portfolio exposes a useful negative result: margin selection should be acquisition-aware rather than based only on one-step prior error.

2507.13263 2026-06-15 cs.LG cs.AI 版本更新

From Sorting Algorithms to Scalable Kernels: Bayesian Optimization in High-Dimensional Permutation Spaces

从排序算法到可扩展核:高维排列空间中的贝叶斯优化

Zikai Xie, Linjiang Chen

发表机构 * State Key Laboratory of Precision and Intelligent Chemistry(精准与智能化学国家重点实验室)

AI总结 针对高维排列空间贝叶斯优化中表示可扩展性差的问题,提出基于排序算法的核函数框架,其中Mallows核是枚举排序的特例,而新提出的Merge核通过归并排序的分解结构实现Θ(n log n)复杂度且无信息损失,在低维性能相当,高维显著提升优化效果与计算效率。

Comments 9 pages, published on ICLR-26

详情
AI中文摘要

贝叶斯优化(BO)是黑箱优化的强大工具,但其在高维排列空间中的应用受到定义可扩展表示的严重限制。当前最先进的排列空间BO方法依赖于穷举的Ω(n^2)成对比较,导致密集表示,不适用于大规模排列。为了突破这一障碍,我们引入了一个新框架,通过从排序算法导出的核函数生成高效的排列表示。在该框架中,Mallows核可以被视为从枚举排序导出的特例。此外,我们引入了Merge核,它利用归并排序的分治结构生成紧凑的Θ(n log n)表示,实现了最低可能复杂度且无信息损失,并有效捕捉排列结构。我们的核心论点是,Merge核在低维设置中与Mallows核性能相当,但随着维度n增长,在优化性能和计算效率上显著优于后者。在各种排列优化基准上的广泛评估证实了我们的假设,表明Merge核为高维排列空间中的贝叶斯优化提供了可扩展且更有效的解决方案,从而释放了解决以前难以处理的问题(如大规模特征排序和组合神经架构搜索)的潜力。

英文摘要

Bayesian Optimization (BO) is a powerful tool for black-box optimization, but its application to high-dimensional permutation spaces is severely limited by the challenge of defining scalable representations. The current state-of-the-art BO approach for permutation spaces relies on an exhaustive $Ω(n^2)$ pairwise comparison, inducing a dense representation that is impractical for large-scale permutations. To break this barrier, we introduce a novel framework for generating efficient permutation representations via kernel functions derived from sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from enumeration sort. Further, we introduce the \textbf{Merge Kernel} , which leverages the divide-and-conquer structure of merge sort to produce a compact, $Θ(n\log n)$ to achieve the lowest possible complexity with no information loss and effectively capture permutation structure. Our central thesis is that the Merge Kernel performs competitively with the Mallows kernel in low-dimensional settings, but significantly outperforms it in both optimization performance and computational efficiency as the dimension $n$ grows. Extensive evaluations on various permutation optimization benchmarks confirm our hypothesis, demonstrating that the Merge Kernel provides a scalable and more effective solution for Bayesian optimization in high-dimensional permutation spaces, thereby unlocking the potential for tackling previously intractable problems such as large-scale feature ordering and combinatorial neural architecture search.

2604.23841 2026-06-15 cs.LG cs.AI 版本更新

Scalable Production Scheduling: Linear Complexity via Unified Homogeneous Graphs

可扩展的生产调度:通过统一同质图实现线性复杂度

Jonathan Hoss, Moritz Link, Noah Klarmann

发表机构 * Faculty of Management and Engineering, Rosenheim Technical University of Applied Sciences(管理与工程学院,罗森海姆应用技术大学)

AI总结 提出统一同质图框架,通过特征同质化将不同节点角色映射到共享潜在空间,使用同构图同构网络以线性复杂度解决作业车间调度问题,实现零样本泛化,并发现作业与机器比率是策略有效性的主要驱动因素。

Comments This paper has been accepted for presentation at the IEEE 22st International Conference on Automation Science and Engineering (CASE 2026)

详情
AI中文摘要

在现实工业应用中高效解决作业车间调度问题需要既计算精简又拓扑鲁棒的策略。虽然强化学习在自动化调度规则方面显示出潜力,但现有模型常因二次图复杂度或异质层的架构开销而面临可扩展性瓶颈。我们引入了一个统一图框架,采用基于特征的同质化将不同的节点角色投影到共享潜在空间。这使得标准的同构图同构网络能够以线性复杂度捕获复杂的资源竞争,确保大规模工业应用的低延迟推理。我们的实验结果表明,我们的框架实现了最先进的性能,同时表现出一致的零样本泛化。我们确定作业与机器比率是策略有效性的主要驱动因素,而非绝对问题规模。基于此,我们提出了结构饱和假设,证明在临界拥塞实例($\mathcal{J} \approx \mathcal{M}$)上训练的策略学习了尺度不变的解决策略。在此饱和点训练的智能体内化了不变的冲突解决逻辑,使它们能够将大规模矩形实例视为饱和子问题的顺序串联。这种方法消除了昂贵的特定尺度重新训练的需要,并防止了对统计捷径的过拟合,为在动态生产环境中部署强化学习解决方案提供了鲁棒且高效的途径。

英文摘要

Efficiently solving the Job Shop Scheduling Problem in real-world industrial applications requires policies that are both computationally lean and topologically robust. While Reinforcement Learning has shown potential in automating dispatching rules, existing models often struggle with a scalability bottleneck caused by quadratic graph complexity or the architectural overhead of heterogeneous layers. We introduce a unified graph framework that employs feature-based homogenization to project distinct node roles into a shared latent space. This allows a standard homogeneous Graph Isomorphism Network to capture complex resource contention with linear complexity, ensuring low-latency inference for large-scale industrial applications. Our empirical results demonstrate that our framework achieves state-of-the-art performance while exhibiting consistent zero-shot generalization. We identify the job-to-machine ratio as the primary driver of policy effectiveness, rather than absolute problem size. Based on this, we propose a hypothesis of structural saturation, demonstrating that policies trained on critically congested instances ($\mathcal{J} \approx \mathcal{M}$) learn scale-invariant resolution strategies. Agents trained at this saturation point internalize invariant conflict-resolution logic, allowing them to treat massive rectangular instances as a sequential concatenation of saturated sub-problems. This approach eliminates the need for expensive scale-specific retraining and prevents overfitting to statistical shortcuts, providing a robust and efficient pathway for deploying RL solutions in dynamic production environments.

5. 机器学习与表示学习 48 篇

2606.13732 2026-06-15 cs.AI 新提交

When Sample Selection Bias Precipitates Model Collapse

当样本选择偏差引发模型崩溃

Xinbao Qiao, Xianglong Du, Wei Liu, Jingqi Zhang, Peihua Mai, Meng Zhang, Yan Pang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文研究低资源验证场景下,基于局部有偏参考分布的数据选择反而加速模型崩溃,并提出多数据孤岛协同的Wasserstein代理参考缓解多样性退化。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

在合成数据上递归训练的普及可以缓解数据稀缺,但存在模型崩溃的风险,即重复训练会侵蚀分布尾部并使输出同质化。数据选择被广泛视为一种补救措施,但其可靠性关键取决于验证器使用的参考分布。我们表明,在低资源验证机制中,每个验证器仅观察到目标流形的一个小、碎片化且有偏的切片,选择本身也会变得有偏。这种情况自然出现在低资源数据孤岛中,例如医疗联盟或专有金融机构,其中原始数据无法汇集,本地参考固有地不完整。结果,选择优先保留与本地流形对齐的样本,同时剪除全局相关的尾部模式,从防止崩溃的保障转变为引发崩溃的机制。我们从理论上证明,这种孤岛选择加速了崩溃并导致幂律多样性衰减。作为一种初步缓解措施,我们在不共享原始数据的情况下,从多个数据孤岛构建Wasserstein代理参考。实证结果证实,本地参考选择在偏斜分布上失败,而协作代理参考减轻了多样性退化,表明当真实数据覆盖范围碎片化或稀缺时,递归合成数据管道需要特别谨慎。

英文摘要

The proliferation of recursive training on synthetic data can alleviate data scarcity but risks model collapse, where repeated training erodes distributional tails and homogenizes outputs. Data selection is widely viewed as a remedy, yet its reliability depends critically on the reference distribution used by the verifier. We show that in low-resource verification regimes, where each verifier observes only a small, fragmented, and biased slice of the target manifold, selection itself becomes biased. This situation naturally arises in low-resource data silos such as healthcare consortia or proprietary financial institutions, where raw data cannot be pooled and local references are inherently incomplete. As a result, selection preferentially retains samples aligned with the local manifold while pruning globally relevant tail modes, turning from a safeguard against collapse into a mechanism that precipitates it. We theoretically prove that such siloed selection accelerates collapse and induces power-law diversity decay. As an initial mitigation, we construct Wasserstein proxy references from multiple silos without sharing raw data. Empirical results confirm that local-reference selection fails on skewed distributions, whereas collaborative proxy references mitigate diversity degradation, suggesting that recursive synthetic-data pipelines require particular caution when real-data coverage is fragmented or scarce.

2606.13934 2026-06-15 cs.AI 新提交

Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry

对抗性概念搜索:从特征几何预测组合错误

Jennifer Meng Lu, Ruochen Zhang, Isabelle Lee, David Alvarez-Melis, Ellie Pavlick, Naomi Saphra

发表机构 * Brown University(布朗大学) University of Southern California(南加州大学) Harvard University(哈佛大学) Boston University(波士顿大学)

AI总结 利用LLM的表征几何预测其组合失败模式,发现概念编码近正交时可靠组合,编码接近时因干扰导致失败,无需评估具体输入即可预测错误。

详情
AI中文摘要

人类并不总能直觉地判断哪些场景对LLM最具挑战性。为了捕捉具有挑战性的边缘案例,开发者要么设计对人类困难的问题,要么策划广泛的基准测试。如果我们能预先预测模型会在哪些场景上失败呢?在本文中,我们利用LLM的表征几何来预测它会在哪些概念组合上失败。我们将这种组合失败归因于显著特征之间的干扰。在需要系统组合的任务中——玩具程序化设置、多跳推理、多语言事实回忆——我们发现,当一对概念被编码为近似正交时,模型可靠地组合它们。当它们的线性编码接近时,产生干扰,模型无法组合它们。我们的方法可靠地预测了不同组合任务中的失败模式,无需评估特定输入。这些结果为利用表征几何识别高风险示例、构建有针对性的压力测试以及在现实部署中提供可扩展的主动学习基础奠定了基础。

英文摘要

Humans cannot always intuit what scenarios are most challenging to LLMs. Hoping to capture challenging edge cases, developers either design problems to be difficult for humans or curate extensive benchmarks. What if we could instead anticipate which scenarios a model will fail on? In this paper, we use an LLM's representational geometry to predict which concept combinations it will fail on. We attribute this compositional failure to interference between salient features. In tasks that require systematic composition - toy programmatic settings, multihop reasoning, multilingual factual recall - we find that when a pair of concepts is encoded near-orthogonally, the model reliably composes them. When their linear encodings are close, producing interference, the model fails to compose them. Our method reliably anticipates failure modes across different compositional tasks, without evaluating specific inputs. These results lay the groundwork to use representational geometry to identify high-risk examples, construct targeted stress tests, and provide a scalable foundation for active learning in real-world deployment.

2606.14415 2026-06-15 cs.AI 新提交

CSPO: Constraint-Sensitive Policy Optimization for Safe Reinforcement Learning

CSPO: 面向安全强化学习的约束敏感策略优化

Ayoub Belouadah, Sylvain Kubler, Yves Le Traon

发表机构 * University of Luxembourg(卢森堡大学)

AI总结 提出约束敏感策略优化(CSPO),通过引入局部约束敏感性修正原目标,加速安全恢复并减少振荡,在导航与运动基准上取得更高约束回报。

Comments Accepted as a Spotlight paper at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

安全强化学习(Safe RL)旨在最大化期望回报的同时满足安全约束,通常建模为约束马尔可夫决策过程(CMDPs)。虽然原始-对偶方法可扩展到深度强化学习,但它们常常遭受延迟约束校正,导致振荡行为和长时间的安全违规。在本文中,我们提出约束敏感策略优化(CSPO),一种一阶原始-对偶方法,将局部约束敏感性纳入策略更新。CSPO通过从安全边界的最短有符号距离导出的约束敏感校正来增强原始目标,从而实现更智能的恢复步骤回到安全状态,补偿延迟的拉格朗日乘子更新,减少边界附近的振荡,并保留原始约束问题的KKT解。在导航和运动基准上的实验表明,与最先进的原始-对偶和基于惩罚的方法相比,CSPO实现了更快的安全恢复和高奖励保持,从而获得更高的约束回报。

英文摘要

Safe reinforcement learning (Safe RL) aims to maximize expected return while satisfying safety constraints, typically modeled as Constrained Markov Decision Processes (CMDPs). While primal-dual methods scale well to deep RL, they often suffer from delayed constraint correction, leading to oscillatory behavior and prolonged safety violations. In this paper, we propose Constraint-Sensitive Policy Optimization (CSPO), a first-order primal-dual method that incorporates local constraint sensitivity into policy updates. CSPO augments the primal objective with a constraint-sensitive correction derived from the shortest signed distance to the safety boundary, enabling smarter recovery steps back to safety, compensating for delayed Lagrange multiplier updates, reducing oscillations near the boundary, and preserving the KKT solutions of the original constrained problem. Experiments on navigation and locomotion benchmarks demonstrate that CSPO achieves faster safety recovery and high reward preservation, resulting in higher constrained returns compared to state-of-the-art primal-dual and penalty-based methods

2606.13589 2026-06-15 cs.LG cs.AI 交叉投稿

Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

单纯形约束的稀疏装袋:集成学习中从均匀先验到稀疏后验的转变

Meher Sai Preetam, Meher Bhaskar

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出SCSB框架,通过最小化袋外损失在概率单纯形上联合优化集成剪枝与校准,引入凹二次惩罚解决L1单纯形悖论,实现高达96%的压缩并提升校准性能。

Comments 6 pages, 3 tables

详情
AI中文摘要

我们提出单纯形约束的稀疏装袋(SCSB),一个用于基于自助法的装袋集成后训练压缩和概率校准的数学严格框架。标准装袋集成(如随机森林、装袋SVM和装袋神经网络)赋予所有组成估计器均匀的投票权。然而,这种朴素的均匀先验忽略了基估计器不同的局部能力,并导致模型过度自信。我们将集成剪枝和校准表述为在概率单纯形上的联合优化问题,通过最小化袋外(OOB)损失。为了诱导稀疏性,我们通过引入凹二次惩罚来解决理论上的“L1单纯形悖论”——即L1范数在单纯形上为常数且无法剪枝的数学现实。SCSB是模型无关的,实现了高达96%的集成压缩,带来线性推理加速和优越的概率校准(降低期望校准误差),同时保持或提升泛化精度。

英文摘要

We present Simplex-Constrained Sparse Bagging (SCSB), a mathematically rigorous framework for post-training compression and probability calibration of bootstrap-based bagging ensembles. Standard bagging ensembles (such as Random Forests, Bagged SVMs, and Bagged Neural Networks) assign uniform voting power to all constituent estimators. However, this naive uniform prior ignores the varying local competence of base estimators and contributes to model overconfidence. We formulate ensemble pruning and calibration as a joint optimization problem over the probability simplex by minimizing the Out-Of-Bag (OOB) loss. To induce sparsity, we address the theoretical "L1-simplex paradox" -- the mathematical reality that the L1 norm is constant on the simplex and fails to prune -- by introducing a concave quadratic penalty. SCSB is model-agnostic and achieves up to 96% ensemble compression, yielding linear inference speedups and superior probability calibration (lowered Expected Calibration Error) while preserving or enhancing generalization accuracy.

2606.13694 2026-06-15 eess.SP cs.AI cs.LG 交叉投稿

Efficient Temporal Modeling for Mobile Sleep Staging via Lightweight Random Attention

基于轻量随机注意力的移动睡眠分期高效时序建模

Guisong Liu, Pengfei Wei, Jainsong Zhang, Martin Dresler

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出轻量随机注意力模块RA,通过固定随机投影实现相似性聚合,替代可学习序列建模,在移动睡眠分期中实现高效时序平滑,理论解释为随机注意力先验核,实验显示在准确率和F1上提升1-3%,性能媲美LSTM/GRU/Transformer。

Comments 7 pages, 1 figures, 5 tables

详情
AI中文摘要

移动睡眠分期是家庭睡眠监测和闭环调节的基础设施。但现有的序列模型如RNN和Transformer在移动部署中计算成本高。本文提出随机注意力(RA),一种基于固定随机投影的轻量时序建模模块,用基于相似性的聚合替代可学习的序列建模。RA在历元编码器之外引入极少的额外参数,同时实现有效的时序平滑。我们进一步通过随机注意力先验核(RAPK)提供理论解释,将RA分解为全局平滑项和特征相似性项,为时序睡眠结构提供可解释的视角。在Sleep-EDF-20和Sleep-EDF-78上的实验表明,RA在准确率和F1分数上持续提升历元级基线1-3%,同时达到与LSTM、GRU和Transformer模型相竞争的性能。RA还展示了在不同骨干编码器上的强泛化能力,以及相对于传统时序平滑方法的改进鲁棒性。这些结果表明,通过轻量基于相似性的时序聚合可以实现高效的睡眠分期,使RA适用于实时可穿戴应用。

英文摘要

Mobile sleep staging serves as a foundational infrastructure for in-home sleep monitoring and closed-loop modulation. But existing sequential models such as RNNs and Transformers are computationally expensive for mobile deployment. In this paper, we propose Random Attention (RA), a lightweight temporal modeling module based on fixed random projections, which replaces learnable sequence modeling with similarity-based aggregation. RA introduces little additional parameters beyond the epoch encoder while enabling effective temporal smoothing. We further provide a theoretical interpretation via the Random Attention Prior Kernel (RAPK), which decomposes RA into a global smoothing term and a feature similarity term, offering an interpretable view of temporal sleep structure. Experiments on Sleep-EDF-20 and Sleep-EDF-78 show that RA consistently improves epoch-wise baselines by 1-3\% in accuracy and F1 score, while achieving competitive performance compared with LSTM, GRU, and Transformer models. RA also demonstrates strong generalization across different backbone encoders and improved robustness over conventional temporal smoothing methods. These results indicate that efficient sleep staging can be achieved through lightweight similarity-based temporal aggregation, making RA suitable for real-time wearable applications.

2606.13705 2026-06-15 cs.LG cs.AI 交叉投稿

Can Editing 1 Neuron Fix Repetition Loops in LLMs?

编辑1个神经元能修复LLM中的重复循环吗?

Aristotelis Lazaridis, Aman Sharma, Dylan Bates, Brian King, Vincent Lu, Jack FitzGerald

发表机构 * Edgerunner AI

AI总结 本文发现Gemma 4模型在长事实列举任务中高达95%的概率陷入重复循环,通过逐层消融和逐神经元归因定位到少量MLP神经元,并用静态权重编辑(小至单个神经元符号反转)消除循环,但无法解决因知识缺失导致的“末日循环”。

详情
AI中文摘要

是的。它能治愈末日循环吗?可能不行。Gemma 4指令微调模型存在一个可复现的失败:在长事实列举提示(如列出电视剧的每一集、88个IAU星座或151个原始宝可梦)上,它们会崩溃成重复,要么是严格的逐字循环,要么是列表条目退化到单一答案。这些循环的发生率高达95%,并且能抵抗提示改写、推理引擎更改和大多数采样调整。在本文中,我们探讨这种行为是否足够局部化,从而可以通过权重编辑来消除。为了定位原因,我们使用逐层消融和逐神经元归因,然后通过完整生成扫描确认最强候选。循环追溯到一小部分MLP神经元(或者在26B-A4B混合专家模型中,几个路由专家),我们通过静态权重编辑抑制它们。这些“手术”可以小到单个符号反转的神经元(在E2B模型中)。有效编辑的大小随模型规模增长,但在所有情况下,循环模式可以在正常生成预算内解决,同时保持通用基准分数。然而,编辑并不能解决所有问题:我们还研究了更长的思考预算,其中两个较大的模型最明显地进入末日循环,即模型在无法回忆的事实上自我纠正的循环,耗尽预算而不给出最终答案。我们表明,这种残余失败通过相同的编辑减少但未消除,并认为它本质上是知识精度问题,而非可移除的电路;权重手术可以删除循环,但不能提供缺失的事实。我们的结果既是可行性证明——即具体的生成病理可以定位到少数参数并编辑掉——也是对该方法适用范围的界定。

英文摘要

Yes. Can it cure doom loops? Probably not. The Gemma 4 instruction-tuned models share a reproducible failure: on long factual enumeration prompts, such as listing every episode of a TV series, the 88 IAU constellations, or the 151 original Pokemon, they collapse into repetition, either a tight verbatim loop or a list whose entries decay onto a single answer. These loops occur at rates as high as 95% and survive prompt rewording, inference-engine changes, and most sampling adjustments. In this paper we explore whether this behavior is localized enough to remove by weight edits. To localize the cause, we use per-layer ablation and per-neuron attribution, then confirm the strongest candidates with full-generation sweeps. The loops trace to a small set of MLP neurons (or, in the 26B-A4B Mixture-of-Experts model, a few routed experts) which we suppress with static weight edits. These "surgeries" can be as small as a single sign-inverted neuron (in the E2B model). The size of the effective edits grows with model scale, but in all cases, the loop patterns can be addressed at normal generation budgets while preserving general-purpose benchmark scores. However, the edits do not solve everything: we also study longer thinking budgets, where the two larger models most visibly enter doom looping, i.e. a non-convergent regime in which the model self-corrects in circles over a fact it cannot recall, exhausting the budget without committing to a final answer. We show this residual failure is reduced but not eliminated by the same edits, and argue it is fundamentally a knowledge-precision problem rather than a removable circuit; weight surgery can delete a loop, but it cannot supply a missing fact. Our results are both a feasibility demonstration, that is, evidence that a concrete generation pathology can be localized to a few parameters and edited out, and a delineation of where that approach stops.

2606.13723 2026-06-15 cs.CV cs.AI 交叉投稿

Morphology-Aware Sample Assignment: Overcoming IoU Insensitivity for Surface Defect Detection

形态感知样本分配:克服IoU不敏感性用于表面缺陷检测

Pengfei Liu, Yuhan Guo

发表机构 * School of Management, Harbin Institute of Technology(管理学院,哈尔滨工业大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学)

AI总结 针对IoU在缺陷检测中不敏感的问题,提出基于面积、形状和长宽比的形态相似性度量来优化正样本分配,理论分析表明该方法能重塑匹配函数响应分布,在NEUDET和GC10-DET数据集上基于YOLOv9框架取得一致性能提升,且零额外推理开销。

详情
AI中文摘要

交并比(IoU)作为评估候选框与真实标注空间对齐的关键指标,直接决定了正样本集的质量和视觉检测模型的训练效果。通过理论建模和分析,我们揭示了IoU响应曲线上的一个非敏感区域,在该区域内,尽管样本的几何重叠程度不同,但IoU得分几乎相同。为克服这一局限,我们引入一组形态相似性度量,涵盖面积、形状和长宽比,以优化正样本分配过程,从而确保更具区分性和可靠性的匹配。通过基于均值的多维相似性聚合,推导出一个补充匹配分数,补偿IoU在表示结构对应性方面的固有缺陷。理论上,融入形态相似性重塑了匹配函数的响应分布,产生有效的方向梯度和多边形等响应轮廓,将高响应区域紧密限制在每个真实实例周围,显著提高了正样本选择的精度。基于YOLOv9框架的实验在NEUDET和GC10-DET数据集上均取得一致性能提升。值得注意的是,所提方法完全即插即用,且零额外推理开销,从而确保了工业视觉检测的部署效率。

英文摘要

Intersection-over-Union (IoU), as a pivotal metric for evaluating the spatial alignment between candidate proposals and ground-truth annotations, directly determines the quality of positive sample sets and the training efficacy of visual detection models. Through theoretical modeling and analysis, we uncover a non-sensitive region on the IoU response curve, within which samples yield nearly identical IoU scores despite distinct geometric overlaps. To overcome this limitation, we introduce a set of morphological similarity metrics covering area, shape, and aspect ratio, to refine the positive sample assignment process, thereby ensuring more discriminative and reliable matching. A supplementary matching score is derived via mean-based aggregation of these multidimensional similarities, compensating for the intrinsic limitation of IoU in representing structural correspondence. Theoretically, incorporating morphological similarity reshapes the response distribution of the matching function, yielding both effective directional gradients and polygon-like iso-response contours, which tightly confine high-response regions around each ground-truth instance and substantially enhance the precision of positive sample selection. Experiments based on the YOLOv9 framework demonstrate consistent performance gains on both NEUDET and GC10- DET datasets. Notably, the proposed approach is fully plug-and-play and incurs zero additional inference overhead, thereby ensuring deployment efficiency for industrial visual inspection.

2606.13753 2026-06-15 cs.LG cs.AI 交叉投稿

The Weight Norm Sets the Grokking Timescale: A Causal Delay Law

权重范数设定“顿悟”时间尺度:因果延迟定律

Truong Xuan Khanh, Doan Hoang Viet, Luu Duc Trung, Phan Thanh Duc

发表机构 * H&K Research Studio / Clevix LLC(H&K研究工作室 / Clevix有限责任公司) Bac A Bank(北亚银行) Banking Academy of Vietnam(越南银行学院)

AI总结 通过干预训练中权重范数,发现网络在范数达到临界值Wc时发生顿悟,且延迟时间与固定范数倍数呈指数关系,揭示了范数对顿悟的因果作用。

Comments 14 papges, 9 figs and 3 tables

详情
AI中文摘要

“顿悟”是神经网络中泛化能力的延迟出现,远在模型拟合训练数据之后才发生。权重范数是否导致这种延迟存在争议:一些研究报告了转变时的临界范数,另一些则观察到没有固定范数的顿悟。我们通过在训练过程中干预范数而非仅观察它来解决这一问题。在带权重衰减的自由训练下,当权重范数达到一个跨种子和学习率变化很小(变异系数1%至2%)且随模数基按幂律增长的值Wc时,网络发生顿悟。当我们转而将范数固定为Wc的某个倍数ρ并保持该值时,网络仍然顿悟,但延迟遵循T_grok ∝ exp(α ρ)。一个指数α≈7.5拟合了四个模数下的延迟(R²=0.996)。在扫描范围内,固定范数使延迟变化约19倍,而学习率仅变化约2倍,且将范数保持在Wc以上会减慢而非阻止顿悟。最后的LayerNorm通过解耦权重尺度与网络函数消除了这种依赖;没有它,指数定律重新出现。这种固定范数的延迟是指数对应物,对应于自由收缩范数所预测的对数延迟。

英文摘要

Grokking is the delayed onset of generalization in neural networks, arising long after they fit the training data. Whether the weight norm causes this delay is disputed: some studies report a critical norm at the transition, others observe grokking with no fixed norm at all. We settle this by intervening on the norm during training rather than only observing it. Under free training with weight decay, networks grok when the weight norm reaches a value Wc that varies little across seeds and learning rates (CV 1 to 2 percent) and grows with the modular base as a power law. When we instead clamp the norm to a fixed multiple rho of Wc and hold it there, the network still groks, but the delay follows T_grok proportional to exp(alpha rho). One exponent, alpha near 7.5, fits this delay across four moduli (R^2 = 0.996). Over the swept ranges the held norm moves the delay by about 19x and the learning rate by only about 2x, and holding the norm above Wc slows grokking rather than preventing it. A final LayerNorm removes the dependence by decoupling weight scale from the network function; without it the exponential law returns. This pinned-norm delay is the exponential counterpart to the logarithmic delay predicted for a freely contracting norm.

2606.13767 2026-06-15 cs.LG cs.AI cs.IT math.IT 交叉投稿

Beyond LoRA: Is Sparsity-Induced Adaptation Better?

超越LoRA:稀疏诱导的适应更好吗?

Elijah Cadenhead, Cristian McGee, Xin Li, El Houcine Bergou, Aritra Dutta

发表机构 * School of Data, Mathematical and Statistical Sciences, University of Central Florida, United States(中佛罗里达大学数据、数学与统计科学学院) College of Computing, Mohammed VI Polytechnic University (UM6P), Morocco(穆罕默德六世理工大学计算机学院) Department of Computer Science, University of Central Florida, United States(中佛罗里达大学计算机科学系)

AI总结 本文提出Cheap LoRA (cLA)及其变体,通过在LoRA中引入稀疏性实现参数高效微调,理论推导泛化误差界,实验表明在多种任务上性能与参数匹配基线相当,同时减少训练时间和峰值GPU内存。

Comments Overview of the paper and code can be found here: https://elicaden.github.io/Beyond_LoRA/

详情
AI中文摘要

低秩适应(LoRA)及其变体为预训练模型的全微调提供了一种内存和计算高效的替代方案。然而,关于这些方法的比较泛化能力以及低秩更新的结构限制如何保持有效适应性能的问题仍然存在。我们提出了一个历史框架,涵盖过去(全微调和原始LoRA)、现在(LoRA的不同变体),并通过在现有LoRA变体中引入稀疏性,提出了更简单、更便宜、参数高效的扩展:Cheap LoRA (cLA),训练单个低秩因子而固定另一个(确定性地或在其随机变体中随机地),以及链式循环变体${c}^3$LA。我们将cLA视为非对称LoRA的结构化实例,作为全微调的控制列子空间限制。我们推导了这些变体的信息论泛化误差界,这是该领域的首批尝试之一。在实验上,我们评估了10个预训练模型和14个数据集上的11种微调方法,使用损失景观和谱分析等工具分析了微调模型的性能和泛化能力。尽管微调模型对预训练模型、数据集和其他因素敏感,但我们的研究表明,将基于LoRA的PEFT方法的适应限制在稀疏、结构化的列空间上,在参数匹配基线的任务上仍然具有竞争力,同时即使使用朴素、非优化的稀疏实现,也能减少高达10%的训练时间和高达15%的峰值GPU内存。我们的理论和实验泛化度量为其成本效益适应提供了比常用分析工具更一致和原则性的方法。概述和代码可在以下网址获取:此 https URL。

英文摘要

Low-rank adaptation (LoRA) and its variants provide a memory- and compute-efficient alternative to full fine-tuning of pre-trained models. However, questions remain about the comparative generalizability of these approaches and how the structural restrictions on low-rank updates preserve effective adaptation performance. We present a historical framing, covering the past (full fine-tuning and original LoRA), the present (different variants of LoRA), and propose simpler, cheaper, parameter-efficient extensions by inducing sparsity within existing LoRA variants: Cheap LoRA (cLA), training a single low-rank factor with the other fixed (deterministically or, in its randomized variant, stochastically), and the chained circulant variant, ${c}^3$LA. We frame cLA as a structured instance of asymmetric LoRA, serving as a controlled column-subspace restriction of full fine-tuning. We derive information-theoretic generalization error bounds for these variants, marking one of the first endeavors in this area. Empirically, we evaluate 11 fine-tuning methods across 10 pre-trained models and 14 datasets, analyzing the fine-tuned models' performance and generalization using tools such as loss landscapes and spectral analysis. Despite the sensitivity of fine-tuned models to the pre-trained model, datasets, and other factors, our study suggests that restricting LoRA-based PEFT methods' adaptation to a sparse, structured column space remains competitive across tasks with their parameter-matched baselines while reducing up to 10% training time and peak GPU memory up to 15%, even with a naïve, non-optimized, sparse implementation. Our theoretical and empirical generalization measures provide a more consistent and principled approach to their cost-effective adaptation than commonly used analytical tools. Overview and code are available at: https://elicaden.github.io/Beyond_LoRA/.

2606.13862 2026-06-15 cs.LG cs.AI cs.CL 交叉投稿

SuperThoughts: Reasoning Tokens in Superposition

SuperThoughts: 叠加中的推理令牌

Zheyang Xiong, Shivam Garg, Max Yu, Vaishnavi Shrivastava, Haoyu Zhao, Anastasios Kyrillidis, Dimitris Papailiopoulos

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Microsoft Research(微软研究院) Independent(独立机构) Princeton University(普林斯顿大学) Rice University(莱斯大学)

AI总结 提出SuperThoughts方法,通过将连续CoT令牌对压缩为单一潜在表示并利用多令牌预测模块解码,在保持训练监督的同时将推理吞吐量翻倍,实现约20-30%的CoT长度缩减且精度损失极小。

详情
AI中文摘要

长链思维(CoT)推理提升了LLM的问题解决能力,但由于顺序生成令牌导致计算成本高昂。尽管近期工作探索在连续潜在空间中进行推理以绕过离散令牌生成,但这些方法常面临训练稳定性问题,且因缺乏监督信号而难以扩展到复杂的长程任务。我们提出SuperThoughts,将连续的CoT令牌对压缩为单一潜在表示,并通过轻量级多令牌预测(MTP)模块每步解码两个令牌。这既在训练时保留了离散令牌监督,又在推理时使吞吐量翻倍。我们在Qwen2.5-Math-1.5B-Instruct、Qwen2.5-Math-7B-Instruct、Qwen2.5-Math-14B-Instruct上进行微调,并在MATH500、AMC、OlympiadBench和GPQA-Diamond上评估。通过基于置信度的自适应机制(在不确定时回退到标准解码),SuperThoughts实现了约20-30%的CoT长度缩减,同时保持精度,在大多数任务上仅下降1-2个准确率点。

英文摘要

Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token generation, they often struggle with training stability and fail to scale to complex, long-horizon tasks due to lack of supervision signal. We propose SuperThoughts, which compresses pairs of consecutive CoT tokens into single latent representations and decodes two tokens per step via a lightweight Multi-Token Prediction (MTP) module. This preserves discrete token supervision at training time while doubling throughput at inference time. We finetune Qwen2.5-Math-1.5B-Instruct, Qwen2.5-Math-7B-Instruct, Qwen2.5-Math-14B-Instruct, and evaluate on MATH500, AMC, OlympiadBench, and GPQA-Diamond. With a confidence-based adaptive mechanism that falls back to standard decoding when uncertain, SuperThoughts achieves $\sim$20--30\% CoT length reduction while maintaining accuracy with minimal degradation (1-2 points accuracy drop on most tasks).

2606.13894 2026-06-15 cs.LG cs.AI cs.CL cs.CV 交叉投稿

Gefen: Optimized Stochastic Optimizer

Gefen: 优化随机优化器

Nadav Benedek, Tomer Koren, Ohad Fried

发表机构 * Reichman University(赖希曼大学) Tel Aviv University(特拉维夫大学) Google Research(谷歌研究院)

AI总结 提出Gefen优化器,通过共享二阶矩估计和量化一阶矩,将AdamW内存占用减少约8倍,同时保持相同性能,支持更大批量和吞吐量。

详情
AI中文摘要

AdamW是现代深度学习的默认优化器,但其一阶和二阶矩状态会额外占用约两倍参数大小的训练内存。我们提出Gefen,一种内存高效的优化器,它自动在参数块之间共享二阶矩估计,并使用学习到的码本量化一阶矩,从而将AdamW的内存占用减少约8倍,同时保持相同性能,相当于每十亿参数减少6.5 GiB。该方法受理论结果启发,该结果表明大的混合Hessian项将平方梯度的比率约束为接近1,表明Hessian对齐的参数是共享二阶矩统计量的自然候选。由于大规模计算Hessian不切实际,Gefen从初始平方梯度推断块结构,除了AdamW默认超参数外,不需要任何架构特定的元数据或超参数。Gefen学习基于精确直方图的动态规划量化码本,并重用相同的块进行一阶矩缩放。在多种实验中,Gefen在比较的类似AdamW的方法中实现了最低的峰值优化器内存,同时保持AdamW级别的性能。在FSDP和DDP训练中,减少的内存占用支持更大的微批次,并显著提高相对于AdamW的吞吐量,提供了一种实用的即插即用替代方案,具有更低的内存使用,可以增加吞吐量并支持训练更大的模型或使用更大的批量大小。我们提供了完整的Python实现,包括融合CUDA内核,网址为https://this https URL。

英文摘要

AdamW is a default optimizer for modern deep learning, but its first and second moment states add roughly two parameter-sized buffers to training memory. We propose Gefen, a memory-efficient optimizer that automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook, thereby reducing AdamW's memory footprint by ~8x while maintaining the same performance, corresponding to a reduction of 6.5 GiB per billion parameters. The method is motivated by a theoretical result showing that large mixed Hessian entries constrain the ratio of squared gradients toward one, suggesting that Hessian-aligned parameters are natural candidates for sharing second-moment statistics. Since computing Hessians is impractical at scale, Gefen infers block structure from the initial squared gradients, requiring no architecture-specific metadata or hyperparameters beyond AdamW defaults. Gefen learns an exact histogram-based dynamic-programming quantization codebook and reuses the same blocks for first-moment scaling. Across diverse experiments, Gefen achieves the lowest peak optimizer memory among the compared AdamW-like methods while maintaining AdamW-level performance. In FSDP and DDP training, the reduced memory footprint enables larger microbatches and improves throughput significantly over AdamW, providing a practical drop-in replacement with lower memory usage that can increase throughput and enable training larger models or using larger batch sizes. We provide the complete Python implementation, including fused CUDA kernels at https://github.com/ndvbd/Gefen

2606.14047 2026-06-15 cs.IR cs.AI cs.CL cs.LG 交叉投稿

Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling

知识图谱增强的记忆增强检索用于长上下文建模

Ghadir Alselwi, Basem Suleiman, Hao Xue, Shoaib Jameel, Hakim Hacid, Flora D. Salim, Imran Razzak

发表机构 * University of New South Wales(新南威尔士大学) Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) University of Southampton(南安普顿大学) Technology Innovation Institute(技术创新研究所) Mohamed Bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出KGERMAR框架,通过动态构建上下文知识图谱并融合多组件记忆架构,在长上下文建模中降低困惑度达8.5%,提升记忆效率2-2.5倍。

详情
AI中文摘要

长上下文语言建模不仅需要扩展上下文窗口,还需要在数千个token中保持对实体状态和关系的连贯理解——这是语义相似性单独无法解决的挑战。KGERMAR通过在推理过程中从输入文本构建动态的、上下文特定的知识图谱来解决这一问题,实现利用语义相似性和显式实体关系的领域自适应检索。该框架执行实时实体和关系抽取以构建上下文知识图谱,然后通过多组件记忆架构将图结构嵌入与文本语义相结合。维护三个记忆库——上下文、语义和结构——通过学习权重融合检索信号,以捕获表面语义和更深层次的关系模式。在SlimPajama(84.7K训练样本)、WikiText-103(4,358样本)、PG-19(100样本)和Proof-pile(46.3K样本)上评估,KGERMAR在1K到32K token的上下文长度上,相比记忆增强基线实现了高达8.5%的困惑度降低和2-2.5倍的记忆效率提升,并在五个NLU任务上展现出优越的上下文学习性能。动态知识图谱构建方法通过实现适应输入上下文而非依赖固定知识库的领域特定知识表示,推进了记忆增强语言建模。

英文摘要

Long-context language modeling requires not only extending context windows but maintaining coherent understanding of entity states and relationships across thousands of tokens -- a challenge that semantic similarity alone cannot address. KGERMAR addresses this by constructing dynamic, context-specific knowledge graphs from input text during inference, enabling domain-adaptive retrieval that leverages both semantic similarity and explicit entity relationships. The framework performs real-time entity and relation extraction to build contextual knowledge graphs, then integrates graph-structural embeddings with textual semantics through a multi-component memory architecture. Three memory banks -- contextual, semantic, and structural -- are maintained with retrieval signals fused via learned weights to capture both surface-level semantics and deeper relational patterns. Evaluated on SlimPajama (84.7K training examples), WikiText-103 (4,358 examples), PG-19 (100 examples), and Proof-pile (46.3K examples), KGERMAR achieves up to 8.5\% lower perplexity and 2--2.5x better memory efficiency than memory-augmented baselines across context lengths from 1K to 32K tokens, with superior in-context learning performance across five NLU tasks. The dynamic knowledge graph construction approach advances memory-augmented language modeling by enabling domain-specific knowledge representation that adapts to input contexts rather than relying on fixed knowledge bases.

2606.14108 2026-06-15 cs.LG cs.AI 交叉投稿

Numbers Already Carry Their Own Embeddings

数字本身已携带其嵌入

Suhyun Bae, Donghun Lee

发表机构 * Department of Mathematics, Korea University(高丽大学数学系)

AI总结 提出无训练嵌入方法AOE,同时保留数字的实数值与p-adic模签名,实现即插即用并在代数组合基准上首次达到完美精度。

Comments Presented at the MATH-AI Workshop at NeurIPS 2025

详情
AI中文摘要

我们引入了Adelic运算保持嵌入(AOE),这是一种无需训练的表示,同时捕捉数字的实数值及其模(p-adic)签名。该构造通过设计保留了加法和乘法结构,将数字输入转化为“用数学语言表达”的嵌入。与依赖任务特定重新训练的先前方法不同,AOE是即插即用的,可无缝集成到现有架构中。在代数组合基准测试上,它取得了持续的性能提升,包括在编织图案任务上首次实现完美准确率——这为克服人工智能中长期存在的“数字问题”提供了一条有原则的前进道路。

英文摘要

We introduce Adelic operation-preserved embeddings (AOE), a training-free representation that captures both a number's real value and its modular (p-adic) signatures. This construction preserves additive and multiplicative structure by design, turning numerical input into embeddings that "speak in the language of mathematics." Unlike prior approaches that rely on task-specific retraining, AOE is plug-and-play and drops seamlessly into existing architectures. On algebraic combinatorics benchmarks, it delivers consistent gains including the first-ever perfect accuracy on the Weaving Pattern task-while suggesting a principled path forward for overcoming the long-standing "number problem" in AI.

2606.14123 2026-06-15 cs.LG cs.AI 交叉投稿

Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage

知识追踪中恢复被搁置的区分能力:通过经验贝叶斯收缩进行逐项偏差校正

Xiaoran Yan, Cheng Tang, Atsushi Shimada

发表机构 * Kyushu University(九州大学)

AI总结 提出SLC方法,利用Laplace/IRLS将二值观测转化为高斯伪观测,通过卡尔曼平滑器进行经验贝叶斯收缩,并拟合偏移Platt链接,以校正知识追踪模型中的逐项偏差,恢复被搁置的区分能力,在多个数据集和骨干网络上提升AUC和NLL。

Comments 25 pages, 3 figures. Accepted at ECML PKDD 2026 (Research Track). Code: https://github.com/xiaoran-y/SLC

详情
AI中文摘要

部署的知识追踪模型通常在训练后被冻结,但由于骨干架构中逐项表达能力的限制以及部署后项目属性的变化,会出现系统性的逐项logit偏差,从而降低预测质量。全局事后校准器(如Platt缩放、温度缩放和保序回归)能改善概率估计,但无法改变由AUC衡量的区分能力。这种AUC不变性是单调分数变换的结构性结果;恢复被搁置的区分能力需要以项目身份为条件。我们提出SLC(状态空间logit校正),通过Laplace/IRLS将二值观测转换为高斯伪观测,通过卡尔曼平滑器应用经验贝叶斯收缩,并拟合偏移Platt链接。状态空间公式还产生了一个可检测性界限,表征了伯努利信息下限,解释了在当前数据密度下时间跟踪为何没有益处。在四个数据集、五个骨干网络和三个随机种子上,SLC在所有四个数据集上提升了AUC,在三个数据集上提升了NLL,优势集中在稀疏项目上。跨领域控制表明,当部署的骨干网络留下实体级偏差时,类似现象可能出现在教育领域之外。

英文摘要

Deployed knowledge-tracing models are typically frozen after training, yet systematic per-item logit bias arises, from limited per-item expressivity in backbone architectures and from post-deployment shifts in item properties, degrading prediction quality. Global post-hoc calibrators such as Platt scaling, temperature scaling, and isotonic regression improve probability estimates but leave discriminative ability, as measured by AUC, unchanged. This AUC invariance is a structural consequence of monotone score-only transforms; recovering the stranded discrimination requires conditioning on item identity. We propose SLC (State-space Logit Correction), which converts binary observations to Gaussian pseudo-observations via Laplace/IRLS, applies empirical-Bayes shrinkage through a Kalman smoother, and fits an offset-Platt link. The state-space formulation also yields a detectability bound that characterizes the Bernoulli information floor, explaining why temporal tracking provides no benefit at current data densities. Across four datasets, five backbones, and three seeds, SLC improves AUC on all four datasets and NLL on three, with the advantage concentrating on sparse items. Cross-domain controls suggest that the same phenomenon can arise beyond education when the deployed backbone leaves entity-level bias.

2606.14156 2026-06-15 cs.LG cs.AI 交叉投稿

Learning High Coverage Discriminative Parsimonious Rulesets

学习高覆盖判别性简约规则集

Mariamma Antony, Raman Sankaran, Chiranjib Bhattacharyya, Uma Satya Ranjan

发表机构 * Indian Institute of Science(印度科学研究所) Compass

AI总结 提出CDPR方法,通过子模最大化算法学习高覆盖、判别性且简约的规则集,在保持高准确率的同时显著提升可解释性,覆盖率比次优算法提升2.5倍以上。

详情
AI中文摘要

基于IF-THEN规则表示的学习系统易于提供可解释性,使其成为当代人工智能研究的关键焦点。此类规则集的一个关键目标是实现高判别能力和可解释性。虽然现有的最先进算法隐式地优先考虑预测准确性,但它们通常在确保可解释性的一个或多个质量指标(如规则集的覆盖率和简约性)上表现不足。受此启发,本文提出开发CDPR,旨在为分类问题创建高度准确且可解释的规则集。据我们所知,这是首次尝试建立这样的方法。在本研究中,我们引入了两种基于子模最大化的算法,这些算法不仅提供了可证明的覆盖率保证,而且产生的规则集既具有判别性又简约。我们通过实验证明,通过我们的方法学习的规则集在准确性和可解释性方面表现更好,并且与次优算法相比,平均覆盖率提高了2.5倍以上。

英文摘要

Learning systems based on IF-THEN rule representations readily offer interpretability, making them a crucial focus in contemporary AI research. A key objective for such rule sets is to achieve both high discriminative power and interpretability. While existing state-of-the-art algorithms implicitly prioritize predictive accuracy, they often fall short on one or more quality metrics that ensure interpretability, such as coverage and parsimony of rule sets. Motivated by this, this paper propose the development of CDPR, which aims to create highly accurate and interpretable rule sets for classification problems. To the best of our knowledge, this represents the first attempt to establish such an approach. In this study, we introduce two algorithms rooted in submodular maximization, which not only provide provable guarantees on coverage but also yield rule sets that are both discriminative and parsimonious. We empirically demonstrate that rule sets learned through our approaches achieve higher accuracy and interpretability and has more than a 2.5-fold improvement in average coverage rates when compared to the next best algorithm.

2606.14283 2026-06-15 cs.LG cs.AI 交叉投稿

DIFF-ERO: A Conformance-Aware Loss for Deep Learning in Process Mining

DIFF-ERO:一种面向过程挖掘的深度学习一致性感知损失函数

Johannes De Smedt, Jari Peeperkorn, Artem Polyvyanyy, Jochen De Weerdt

发表机构 * KU Leuven(鲁汶大学) The University of Melbourne(墨尔本大学) Information Systems Engineering Research Group (LIRIS), KU Leuven(鲁汶大学信息系统工程研究组(LIRIS))

AI总结 提出DIFF-ERO,一种可微的随机一致性损失函数,通过构建软边界的批次级随机转移矩阵,在训练中融入控制流信息,提升深度学习模型在过程数据上的结构预测性能。

Comments Accepted at the 24th International Conference on Business Process Management

详情
AI中文摘要

深度学习推动了过程分析领域的许多最新进展,尤其是在预测性和规范性监控方面。然而,诸如交叉熵之类的标准目标函数优化的是局部下一步似然,仅隐式地捕获控制流结构。因此,模型在实现高令牌级准确率的同时,可能允许不精确的全局行为。我们提出了DIFF-ERO,一种用于过程数据深度学习模型的一致性感知损失函数。DIFF-ERO是基于熵的随机一致性的可微形式,在训练过程中融入控制流信息。我们的方法构建了具有软边成员资格的批次级随机转移矩阵,使得结构精度和召回率信号能够直接指导反向传播。该损失函数是模型无关的,只要最终表示参数化随机转移,就可以应用。我们将DIFF-ERO实例化到用于下一活动预测的Transformer编码器-解码器流水线中,并与交叉熵联合使用,分析其理论组件在收敛方面的表现。在比较其他损失函数和目标的基准测试中,DIFF-ERO在结构至关重要的地方显示出改进的预测性能,同时在其它地方保持同等水平。同时,学习到的随机自动机向结构真实值收敛,表明网络内化了过程模型结构。

英文摘要

Deep learning has driven many recent advances in process analytics, especially for predictive and prescriptive monitoring. However, standard objectives such as cross-entropy optimize local next-step likelihoods and only implicitly capture control-flow structure. As a result, models can achieve high token-level accuracy while permitting imprecise global behaviour. We introduce DIFF-ERO, a conformance-aware loss function for deep learning models on process data. DIFF-ERO is a differentiable formulation of entropy-based stochastic conformance that incorporates control-flow information during training. Our approach constructs batch-level stochastic transition matrices with soft edge memberships, allowing structural precision and recall signals to directly inform backpropagation. The loss is model-agnostic and can be applied whenever the final representation parametrizes stochastic transitions. We instantiate DIFF-ERO in transformer encoder-decoder pipelines for next-activity prediction and use it jointly with cross-entropy to analyse its theoretical components with respect to convergence. Across benchmarks comparing other loss functions and targets, DIFF-ERO shows improved predictive performance where structure matters most while maintaining parity elsewhere. At the same time, the learned stochastic automaton converges towards the structural ground truth, indicating that the network internalizes process model structure.

2606.14284 2026-06-15 cs.LG cs.AI 交叉投稿

Hierarchical ODE: Learning Continuous-Time Physical Prototypes for Early Link Failure Detection

层次化常微分方程:学习连续时间物理原型用于早期链路故障检测

Jiaen Lv, Leran Qi, Shaowei Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出层次化常微分方程聚类网络,利用神经ODE建模连续潜状态演化,解耦随机噪声与动态趋势,自适应确定原型数量,在不规则采样时间序列的早期链路故障检测中有效提取物理原型。

Comments International Conference on Machine Learning 2026

详情
AI中文摘要

时间序列原型学习从根本上受到观测模糊性的挑战。离散架构无法解决这一问题,因为它们缺乏将随机噪声与连续动态解耦的能力。此外,僵化的封闭集假设无法捕捉未见过的多样性。为了解决这些局限性,我们提出了一种层次化常微分方程聚类网络,该网络利用神经常微分方程将潜状态演化建模为连续积分曲线。这种形式强制时间连续性,从而有效将平滑特征趋势与随机噪声分离,同时我们的自适应层次机制无需严格的先验约束即可自主确定合适的原型数量。在具有不规则采样时间序列的早期链路故障检测任务上验证,所提方法有效提取了底层物理原型,从而实现了鲁棒的故障检测。我们的代码可在此https URL获取。

英文摘要

Time series prototype learning is fundamentally challenged by observational ambiguity. Discrete architectures fail to resolve this, as they lack the capacity to decouple stochastic noise from continuous dynamics. Furthermore, rigid closed-set assumptions fail to capture unseen diversity. To address these limitations, we propose a hierarchical ordinary differential equation clustering network, which utilizes neural ordinary differential equation to model latent state evolution as a continuous integral curve. This formulation enforces temporal continuity to effectively disentangle smooth feature trends from stochastic noise, while our adaptive hierarchical mechanism autonomously determines the appropriate number of prototypes without rigid prior constraints. Validated on the early link failure detection task with irregularly sampled time series, the proposed method effectively extracts underlying physical prototypes, thereby enabling robust failure detection. Our code is available at https://github.com/NJ-LNN/Hierarchical-ODE.

2606.14346 2026-06-15 cs.LG cs.AI 交叉投稿

Squeeze-Release: Iterative Pruning with Exact Structural Minimization

挤压-释放:具有精确结构最小化的迭代剪枝

Roman Denkin, Ida Akerholm, Prashant Singh, Ida-Maria Sintorn

发表机构 * Uppsala University(乌普萨拉大学)

AI总结 提出Squeeze-Release循环,通过精确结构重写将掩码网络转化为更小密集网络,并引入CompensatedLayerNorm扩展至残差流,实现高达39倍压缩。

详情
AI中文摘要

非结构化剪枝产生稀疏权重张量,但标准实现保持张量形状不变,因此部署模型并不比剪枝前更小。我们提出一种精确的结构重写,称为最小化,它将掩码网络转换为一个更小的密集网络,其前向函数在浮点舍入误差内相同。挤压-释放循环迭代剪枝和最小化,中间有一个释放步骤,将压缩张量内的精确零位置重新启用为小的校准噪声,将原本浪费的容量转化为可训练参数。连续的循环利用该容量找到单次剪枝无法达到的结构冗余。我们还引入了CompensatedLayerNorm,这是一种保持功能的LayerNorm替代方案,将最小化扩展到具有LayerNorm的残差流上的通道缩减。挤压-释放将可部署网络压缩到比未剪枝模型小39倍(全连接模型网络)和14.8倍(现代CNN,ConvNeXt-Tiny),且精度相当。此外,我们证明该重写可以扩展到Transformer架构。

英文摘要

Unstructured pruning produces sparse weight tensors, but the standard implementation keeps tensor shapes unchanged so the deployed model is no smaller than before pruning. We present an exact structural rewrite, which we call minimization, that converts a masked network into a smaller dense network with the same forward function up to floating-point rounding. The Squeeze-Release cycle iterates pruning and minimization with an intermediate release step that re-enables the exact-zero positions inside the compacted tensors as small calibrated noise, turning otherwise wasted capacity back into trainable parameters. Successive cycles use that capacity to find structural redundancy a single pass cannot reach. We additionally introduce CompensatedLayerNorm, a function-preserving replacement for LayerNorm that extends minimization to channel reduction across LayerNorm-equipped residual streams. Squeeze-Release compresses the deployable network to 39x smaller than the unpruned model on a fully-connected model network and 14.8x smaller on modern CNN (ConvNeXt-Tiny), at comparable accuracy. In addition we prove that the rewrite can be extended to transformer architectures.

2606.14386 2026-06-15 cs.LG cs.AI q-fin.PM 交叉投稿

Discovery under Hypothesis Redundancy: A Geometric Theory of Discovery Bottlenecks

假设冗余下的发现:发现瓶颈的几何理论

Li Xia, Baoxun Wang

发表机构 * School of Economics and Management, Tsinghua University(清华大学经济管理学院) Platform & Content Group, Tencent(腾讯平台与内容事业群)

AI总结 提出搜索压缩假说,通过谱压缩、正交逃逸和残差信号对齐三个几何条件解释混合发现系统的优势,实验表明仅新颖性不足,需预测对齐。

Comments 23 pages, 1 figure, 27 tables

详情
AI中文摘要

当新假设不再提供独立信息时,科学发现会饱和,即使名义假设空间仍然很大。我们研究了结合结构化局部搜索与LLM生成的非局部提议的混合发现系统,并提出了搜索压缩假说:非局部探索仅在三个几何条件同时出现时才有帮助:谱压缩、从已探索张成的子空间正交逃逸、以及残差信号与目标对齐。我们形式化了这些条件,推导了混合优势的必要条件,并在受控合成环境、大规模A股因子发现和符号回归基准中测试了该机制;一个公开的表格操作合理性检查测试了相关的预算分配含义。信号植入和定向与随机实验表明,仅新颖性是不够的:随机正交跳跃扩大了覆盖范围,但如果没有预测对齐,则不会提高产出。在压缩扫描、真实因子档案和LLM-SRBench任务中,混合优势集中在弱表示但目标承载的方向上,并随着假设空间接近满秩而消失。该框架将LLM引导的发现从通用新颖性搜索转变为诊断程序,用于判断何时需要进行定向非局部探索。

英文摘要

Scientific discovery saturates when new hypotheses cease to provide independent information, even if the nominal hypothesis space remains large. We study hybrid discovery systems that combine structured local search with LLM-generated non-local proposals and pose the Search Compression Hypothesis: non-local exploration helps only when three geometric conditions co-occur: spectral compression, orthogonal escape from the explored span, and residual signal alignment with the target. We formalize these conditions, derive necessary conditions for hybrid advantage, and test the mechanism in controlled synthetic environments, large-scale A-share factor discovery, and symbolic-regression benchmarks; a public tabular operational sanity check tests the associated budget-allocation implication. Signal-planting and directed-versus-random experiments show that novelty alone is insufficient: random orthogonal jumps expand coverage but do not improve yield without predictive alignment. Across compression sweeps, real factor archives, and LLM-SRBench tasks, hybrid gains concentrate in weakly represented but target-bearing directions and vanish as the hypothesis space approaches full rank. The framework turns LLM-guided discovery from generic novelty search into a diagnostic procedure for deciding when directed non-local exploration is warranted.

2606.14555 2026-06-15 cs.CV cs.AI 交叉投稿

Rethinking Global Average Pooling: Your Classifier Is Secretly a Multi-Instance Learner

重新思考全局平均池化:你的分类器实际上是一个多实例学习器

Aray Karjauv

发表机构 * Aray Karjauv(阿瑞·卡贾乌)

AI总结 本文揭示标准图像分类器中的全局平均池化结构天然具有多实例学习解释,使得单标签训练的分类器能学习多目标场景,并提出后验诊断方法提取空间类别证据。

详情
AI中文摘要

现代图像分类器广泛采用全局平均池化(GAP)后接线性分类头。这种线性结构确保图像级logits等于将分类头逐点应用于GAP之前的特征网格所获得的logits的平均值。因此,标准分类器可能固有地保留空间类别证据,即使在图像级预测错误时这些证据仍可恢复。这种结构自然暗示了多实例学习(MIL)解释,其中图像被视为空间实例的包。在此框架下,我们证明使用每张图像单个标签训练的标准分类器仍然可以在多目标场景中学习预期的分类任务。我们进一步利用这一特性将图像级logits分解为预测网格,提供一种事后诊断方法来提取GAP原本掩盖的空间类别证据。我们的系统评估表明,现成模型始终能在前景区域内恢复真实类别。MIL解释进一步表明,常见的分类器失败反映了均值聚合的已知局限性。

英文摘要

Modern image classifiers widely adopt global average pooling (GAP) followed by a linear classification head. This linearity ensures that the image-level logits equal the average of logits obtained by applying the classification head pointwise to the feature grid prior to GAP. Consequently, standard classifiers may inherently retain spatial class evidence that remains recoverable even when the image-level prediction is incorrect. This structure naturally suggests a multiple-instance learning (MIL) interpretation, where an image is viewed as a bag of spatial instances. Within this formulation, we demonstrate that standard classifiers trained with a single label per image can still learn the intended classification task in multi-object scenes. We further exploit this property to decompose image-level logits into a prediction grid, providing a post-hoc diagnostic to extract spatial class evidence that GAP otherwise obscures. Our systematic evaluation reveals that off-the-shelf models consistently recover the ground-truth class within foreground regions. The MIL interpretation further suggests that common classifier failures reflect known limitations of mean aggregation.

2606.14608 2026-06-15 cs.LG cs.AI 交叉投稿

Expert-Driven Survival Machines: Improving Stratification and Interpretability in Multiple Clinical Cohorts

专家驱动的生存机器:改善多个临床队列中的分层与可解释性

Farica Zhuang, Zixuan Wen, Christos Davatzikos, Li Shen

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出一种基于混合专家模型的自适应深度聚类生存框架(AdaCSM),通过路由专家机制实现条件专业化,动态分配患者到专门的风险预测器,提升生存预测性能和可解释性。

详情
AI中文摘要

生存预测在医疗提供者和临床研究中扮演核心角色。准确的风险分层能够实现早期干预并改善患者管理。大多数现有的深度生存模型为所有患者学习一个共同的特征表示,这可能掩盖患者亚组之间的重要差异。相比之下,混合专家(MoE)框架允许模型的不同部分关注不同的患者模式,从而产生更个性化的表示。因此,在这项工作中,我们提出了一种混合专家增强的自适应深度聚类生存框架(AdaCSM),用于建模这种异质性生存模式。我们引入了一种基于路由的专家机制,该机制在参数化生存建模框架内实现条件专业化。所提出的架构动态地将患者分配给专门的风险预测器,同时保留患者生存和亚型聚类目标。我们在跨越不同疾病领域的多个真实世界纵向临床队列上,将我们的方法与最先进的生存和深度聚类模型进行了比较。所提出的方法在生存分析中展示了改进的预测性能并产生了可解释的结果。

英文摘要

Survival prediction plays a central role for healthcare providers and clinical researchers. Accurate risk stratification enables early intervention and improved patient management. Most existing deep survival models learn one common feature representation for all patients, which may hide important differences between patient subgroups. In contrast, a Mixture-of-Experts (MoE) framework allows different parts of the model to focus on different patient patterns, leading to more individualized representations. Therefore, in this work, we propose a mixture-of-experts enhanced adaptive deep clustering survival framework (AdaCSM) for modeling such heterogeneous survival patterns. We introduce a routing-based expert mechanism that enables conditional specialization within a parametric survival modeling framework. The proposed architecture allocates patients to specialized risk predictors dynamically while preserving the patient survival and subtype clustering objectives. We compare our method with state-of-the-art survival and deep clustering models on multiple real-world longitudinal clinical cohorts spanning diverse disease domains. The proposed method demonstrates improved predictive performance and leads to interpretable results in survival analysis.

2606.14639 2026-06-15 cs.SD cs.AI 交叉投稿

From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing

从自监督语音模型到混合专家系统以实现鲁棒的防欺骗

Hugo Daumain, Driss Matrouf, Khaled Khelif, Mickael Rouvier

发表机构 * Université d'Avignon(阿维尼翁大学) Airbus Defence & Space(空中客车防务与航天公司)

AI总结 将自监督语音模型转换为混合专家架构,通过层间门控机制增强泛化能力,在14个欺骗数据集上将宏EER从5.46%降至4.81%。

Comments 8 pages, 3 figures, accepted at Odyssey 2026 (The Speaker and Language Recognition Workshop)

详情
AI中文摘要

近期语音生成的进展显著提升了合成语音的自然度,使得欺骗检测日益困难。当前防欺骗系统的一个关键局限是对未见合成方法的鲁棒性不足。在这项工作中,我们将自监督语音表示模型转换为混合专家(MoE)架构以提高泛化能力。选定编码器层中的前馈块被替换为由层间门控机制控制的多个专家网络,使专家能够捕获互补的声学模式,同时保留自监督预训练期间学习到的表示。我们进一步分析了影响MoE转换性能的架构选择,并研究了专家的激活行为。所提出的方法在14个欺骗数据集上进行了评估,将宏EER从5.46%降至4.81%,相对基线提升了11.9%。

英文摘要

Recent advances in speech generation have significantly improved the naturalness of synthetic speech, making spoofing detection increasingly challenging. A key limitation of current anti-spoofing systems is their limited robustness to unseen synthesis methods. In this work, we transform a self-supervised speech representation model into a Mixture-of-Experts (MoE) architecture to improve generalization. Feed-forward blocks in selected encoder layers are replaced by multiple expert networks controlled by a layer-wise gating mechanism, allowing experts to capture complementary acoustic patterns while preserving the representations learned during self-supervised pretraining. We further analyze the architectural choices affecting the performance of this MoE conversion and investigate the activation behavior of the experts. The proposed approach is evaluated on 14 spoofing datasets and reduces the macro EER from 5.46% to 4.81%, corresponding to 11.9% relative improvement over the baseline.

2303.09209 2026-06-15 cs.AI 版本更新

Learning optimal policies from event logs through reinforcement learning: a comparison of deep and MDP-based approaches

从事件日志中通过强化学习学习最优策略:基于深度和MDP的方法比较

Stefano Branchi, Andrei Buliga, Chiara Di Francescomarino, Chiara Ghidini, Riccardo Graziosi, Francesca Meneghello, Massimiliano Ronzani

发表机构 * FBK - Fondazione Bruno Klopfer(FBK - 基础研究机构布鲁诺·克洛普弗) Unitn(乌迪内大学) Unibz(博尔扎诺大学)

AI总结 提出两种强化学习方法(基于MDP和离线深度RL)从历史事件日志中学习最优行为策略以优化KPI,在数据驱动的BPS环境中评估,两种方法均有效提升KPI,但基于MDP的方法计算效率更高。

Comments 38 pages + appendix, 12 figures, new version published in IS journal

详情
Journal ref
Information Systems, Volume 141, 2026, 102763, ISSN 0306-4379
AI中文摘要

规范性流程监控是流程挖掘中的一个新兴领域,专注于推荐行动以优化业务成果。大多数现有工作规定预定义的干预措施,即应用于正在进行的流程执行以实现特定目标或关键绩效指标(KPI)的一组行动。相比之下,只有少数方法探索了学习和评估最优行为策略,即确定最佳行动序列以最大化期望KPI的通用策略。在本文中,我们通过提出一种基于AI的方法来解决学习最优行为策略的问题,该方法使用强化学习(RL)直接从历史流程执行中学习最优策略,以推荐优化KPI的最佳行动。为此,我们采用了两种RL技术。第一种是经典的基于模型的方法,通过构建捕获流程行为的马尔可夫决策过程(MDP)来扩展作者先前的工作。第二种是基于离线深度RL的无模型技术。与现有工作不同,我们旨在最小化领域知识的使用,并直接从历史事件数据中学习最优策略。这使我们能够学习何时应用干预措施,并直接从数据中发现有效的干预措施。此外,我们针对涉及外部参与者的复杂场景,其中流程所有者仅控制部分活动。我们采用数据驱动的业务流程模拟(BPS)环境来评估学习到的策略。结果表明,两种方法都以相似的有效性改进了目标KPI,而基于模型的方法在计算效率上优于离线深度RL。

英文摘要

Prescriptive Process Monitoring is an emerging area within Process Mining that focuses on recommending actions to optimize business outcomes. Most existing works prescribe pre-defined interventions, i.e., sets of actions applied to ongoing process executions to achieve a specific objective or Key Performance Indicator (KPI). In contrast, only a few approaches have explored learning and evaluating optimal behavioral policies, i.e., general strategies that determine the best sequence of actions to maximize a desired KPI. In this paper, we address the problem of learning optimal behavioral policies by proposing an AI-based approach that learns an optimal policy directly from historical process executions using Reinforcement Learning (RL) to recommend the best actions for optimizing a KPI. To this end, we employ two RL techniques. The first is a classical model-based approach that extends previous work by the authors through the construction of a Markov Decision Process (MDP) capturing process behavior. The second is a model-free technique based on offline Deep RL. Unlike state-of-the-art work, we aim to minimize the use of domain knowledge and learn optimal policies directly from historical event data. This allows us to learn when to apply interventions and discover effective ones directly from data. Moreover, we target complex scenarios involving external actors, where the process owner controls only part of the activities. We adopt a data-driven Business Process Simulation (BPS) environment to evaluate the learned policies. Results show that both methods improve the targeted KPI with similar effectiveness, while the model-based approach outperforms offline Deep RL in computational efficiency.

2601.05106 2026-06-15 cs.AI cs.CL cs.LG 版本更新

Token-Level LLM Collaboration via FusionRoute

通过融合路由实现令牌级LLM协作

Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao

发表机构 * [cs.AI](计算机科学与人工智能)

AI总结 本文提出FusionRoute框架,通过轻量级路由器在解码步骤中选择最合适的专家并补充对数几率以优化下一个令牌分布,解决了单个通用模型在多个领域表现不佳的问题,同时在多个基准测试中优于其他方法。

Comments 25 pages

详情
AI中文摘要

大型语言模型(LLMs)在多个领域表现出色。然而,使用单一通用模型在这些领域实现强大性能通常需要扩展到训练和部署成本极高的规模。另一方面,虽然较小的领域专用模型更高效,但它们在训练分布之外的泛化能力较差。为了解决这一矛盾,我们提出了FusionRoute,一种稳健且有效的令牌级多LLM协作框架,其中轻量级路由器同时(i)在每个解码步骤中选择最合适的专家,(ii)贡献一个互补的对数几率,通过对数几率添加来细化或校正所选专家的下一个令牌分布。与现有依赖固定专家输出的令牌级协作方法不同,我们提供了一个理论分析,表明纯专家路由本质上是有限的:除非持有强全局覆盖假设,否则无法一般实现最优解码策略。通过在专家选择中加入可训练的互补生成器,FusionRoute扩展了有效的策略类别,并在温和条件下实现了最优价值函数的恢复。经验上,FusionRoute在Llama-3和Gemma-2家族以及涵盖数学推理、代码生成和指令跟随在内的多种基准测试中,优于序列级和令牌级协作、模型融合和直接微调方法,同时在各自任务上与领域专家保持竞争力。

英文摘要

Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

2605.14998 2026-06-15 cs.AI cs.SY eess.SY q-bio.QM 版本更新

Learning Developmental Scaffoldings to Guide Self-Organisation

学习发育支架以引导自组织

Milton L. Montero, Elias Najarro, Jakob Schauser, Sebastian Risi

发表机构 * IT University of Copenhagen(丹麦哥本哈根信息技术大学) University of Copenhagen(丹麦哥本哈根大学) Sakana AI

AI总结 本文研究了通过学习自组织规则和预模式共同作用来提升发育过程的鲁棒性、编码能力和对称性打破。

Comments 8 pages + acknowledgements and references, 5 figures. Camera-ready version for ALife 2026

详情
AI中文摘要

从亚细胞结构到整个生物体,许多自然系统通过自组织生成复杂结构:局部相互作用共同产生全局结构,而无需任何结果的蓝图。然而,推动此类过程的大量信息并非由自组织本身产生,而是常常转移到系统的初始条件中。生物发育是一个典型例子,其中母体的预模式编码位置和对称性打破信息,从而引导自组织过程。从早期胚胎发育中的母体形态发生素梯度到组织水平的形态发生预模式指导器官形成,这种信息转移到初始条件的现象,类似于计算系统中的记忆-计算权衡,是发育过程的基本部分。在本文中,我们通过引入一个模型来研究这种信息转移现象,该模型同时学习自组织规则和预模式,允许其相互作用在受控条件下进行变化和测量:一个神经细胞自动机(NCA)配对一个学习基于坐标的模式生成器(SIREN),两者同时训练以生成一组模式。我们提供了信息论分析,探讨信息如何在预模式和自组织过程之间分布,并展示联合学习两者可提高鲁棒性、编码能力和对称性打破,相较于纯自组织替代方案。进一步分析表明,有效的预模式不简单地近似其目标;而是通过偏转发育动力学的方式促进收敛,指出了初始条件结构与自组织动力学之间非平凡的关系。

英文摘要

From subcellular structures to entire organisms, many natural systems generate complex organisation through self-organisation: local interactions that collectively give rise to global structure without any blueprint of the outcome. Yet a significant portion of the information driving such processes is not produced by self-organisation itself, instead, it is often offloaded to initial conditions of the system. Biological development is a prime example, where maternal pre-patterns encode positional and symmetry-breaking information that scaffolds the self-organising process. From maternal morphogen gradients in early embryogenesis to tissue-level morphogenetic pre-patterns guiding organ formation, this transfer of information to initial conditions, analogous to a memory-compute trade-off in computational systems, is a fundamental part of developmental processes. In this work, we study this offloading phenomenon by introducing a model that jointly learns both the self-organisation rules and the pre-patterns, allowing their interplay to be varied and measured under controlled conditions: a Neural Cellular Automaton (NCA) paired with a learned coordinate-based pattern generator (SIREN), both trained simultaneously to generate a set of patterns. We provide information-theoretic analyses of how information is distributed between pre-patterns and the self-organising process, and show that jointly learning both components yields improvements in robustness, encoding capacity, and symmetry breaking over purely self-organising alternatives. Our analysis further suggests that effective pre-patterns do not simply approximate their targets; rather, they bias the developmental dynamics in ways that facilitate convergence, pointing to a non-trivial relationship between the structure of initial conditions and the dynamics of self-organisation.

2606.13392 2026-06-15 cs.AI 版本更新

MiniMax Sparse Attention

MiniMax 稀疏注意力

Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen, Yang Xu, Lunbin Zeng, Xiaolong Li, Haohai Sun, Haichao Zhu, Vito Zhang, Jinkai Hu, Jiayao Li, Rui Gao, Zekun Li, Songquan Zhu, Jingkai Zhou, Pengyu Zhao

发表机构 * MiniMax Peking University(北京大学) NVIDIA(英伟达) Zhejiang University(浙江大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出 MiniMax 稀疏注意力(MSA),一种基于分组查询注意力的块级稀疏注意力机制,通过轻量索引分支选择 Top-k 键值块,实现高效长上下文处理,在 109B 模型上以 1M 上下文减少 28.4 倍注意力计算,并带来 14.2 倍预填充和 7.6 倍解码加速。

Comments 30 pages, 14 figures

详情
AI中文摘要

超长上下文能力对于前沿大语言模型变得不可或缺:智能体工作流、仓库级代码推理和持久记忆都要求模型共同关注数十万到数百万个 token,然而 softmax 注意力的二次成本使得这在部署规模上难以实现。我们引入了 MiniMax 稀疏注意力(MSA),一种基于分组查询注意力(GQA)构建的块级稀疏注意力。一个轻量级索引分支对键值块进行评分,并为每个 GQA 组独立选择 Top-k 子集,从而实现组特定的稀疏检索,同时保持高效的块级执行;主分支则仅对选中的块执行精确的块稀疏注意力。MSA 的设计遵循简单和可扩展的原则,经过精心简化,使其能够在一系列 GPU 上高效部署。为了将稀疏性转化为实际加速,我们与 MSA 协同设计了 GPU 执行路径,该路径使用无指数 Top-k 选择和 KV 外部稀疏注意力,以在块粒度访问下提高张量核心利用率。在一个具有原生多模态训练的 109B 参数模型上,MSA 的性能与 GQA 相当,同时在 1M 上下文下将每个 token 的注意力计算减少了 28.4 倍。结合我们协同设计的内核,MSA 在 H800 上实现了 14.2 倍的预填充和 7.6 倍的解码端到端加速。我们的推理内核可在以下网址获取:this https URL。一个由 MSA 驱动的生产级原生多模态模型已在以下网址公开发布:this https URL。

英文摘要

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.

2505.12992 2026-06-15 cs.LG cs.AI cs.CL stat.ML 版本更新

Fractured Chain-of-Thought Reasoning

断裂链式思维推理

Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, Caiming Xiong

发表机构 * University of Amsterdam(阿姆斯特丹大学) eBay Microsoft(微软) Google Research(谷歌研究) Salesforce

AI总结 提出断裂采样策略,通过截断推理链、调整轨迹数和解数,在推理时实现精度与成本的帕累托最优。

详情
AI中文摘要

推理时扩展技术通过在不重新训练的情况下利用额外的推理计算,显著增强了大型语言模型(LLMs)的推理能力。类似地,链式思维(CoT)提示及其扩展Long CoT通过生成丰富的中间推理轨迹来提高准确性,但这些方法会带来大量的token成本,阻碍了它们在延迟敏感场景中的部署。在这项工作中,我们首先证明截断CoT(即在完成推理前停止并直接生成最终答案)通常在使用显著更少token的情况下与完整CoT采样相匹配。基于这一见解,我们引入了断裂采样,这是一种统一的推理时策略,沿着三个正交轴在完整CoT和仅解决方案采样之间进行插值:(1)推理轨迹的数量,(2)每条轨迹的最终解数量,以及(3)推理轨迹被截断的深度。通过在五个不同的推理基准和多个模型规模上进行大量实验,我们证明断裂采样始终实现优越的精度-成本权衡,在Pass@k与token预算之间产生陡峭的对数线性缩放增益。我们的分析揭示了如何在这些维度上分配计算以最大化性能,为更高效和可扩展的LLM推理铺平了道路。代码可在该https URL获取。

英文摘要

Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches the full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning. Code is available at https://github.com/BaohaoLiao/frac-cot.

2506.14202 2026-06-15 cs.LG cs.AI stat.ML 版本更新

DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

DiffusionBlocks: 通过扩散解释进行分块神经网络训练

Makoto Shing, Masanori Koyama, Takuya Akiba

发表机构 * Sakana AI The University of Tokyo(东京大学)

AI总结 提出DiffusionBlocks框架,利用残差连接与动力系统的对应关系,将网络转换为去噪过程,通过分数匹配目标实现独立分块训练,在多种Transformer架构上达到与端到端训练相当的性能,同时降低内存需求。

Comments To appear at the 14th International Conference on Learning Representations (ICLR 2026). v4: Fixed typos in experimental details (Appendix E.4)

详情
AI中文摘要

端到端反向传播需要存储所有层的激活值,造成内存瓶颈,限制了模型的可扩展性。现有的分块训练方法提供了缓解该问题的途径,但它们依赖于特设的局部目标,并且在分类任务之外尚未得到充分探索。我们提出$\textit{DiffusionBlocks}$,一个将基于Transformer的网络转化为真正独立可训练块的原则性框架,这些块能保持与端到端训练相竞争的性能。我们的关键洞察在于利用残差连接自然对应于动力系统中的更新这一事实。通过对该系统进行最小修改,我们可以将这些更新转换为去噪过程的更新,其中每个块可以通过利用分数匹配目标独立学习。这种独立性使得每次只训练一个块的梯度成为可能,从而将内存需求按块数量成比例降低。我们在多种Transformer架构(视觉、扩散、自回归、递归深度和掩码扩散)上的实验表明,DiffusionBlocks训练与端到端训练性能匹配,同时能够在实际任务(超越小规模分类)上实现可扩展的分块训练。DiffusionBlocks提供了一种理论上有依据的方法,成功地将现代生成任务扩展到多种架构。代码可在该https URL获取。

英文摘要

End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose $\textit{DiffusionBlocks}$, a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one block at a time, thereby reducing memory requirements in proportion to the number of blocks. Our experiments on a range of transformer architectures (vision, diffusion, autoregressive, recurrent-depth, and masked diffusion) demonstrate that DiffusionBlocks training matches the performance of end-to-end training while enabling scalable block-wise training on practical tasks beyond small-scale classification. DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures. Code is available at https://github.com/SakanaAI/DiffusionBlocks .

2506.17255 2026-06-15 cs.LG cs.AI 版本更新

UltraSketchLLM: Sub-1-Bit LLM Compression via Sketch and Hardware-Friendly Operators

UltraSketchLLM:基于草图与硬件友好算子的低于1比特LLM压缩

Sunan Zou, Xueting Sun, Ziyun Zhang, Guojie Luo

发表机构 * National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University(国家多媒体信息处理重点实验室,计算机科学学院,北京大学) School of Electronic Engineering and Computer Science, Peking University(电子工程与计算机科学学院,北京大学) Center for Energy-efficient Computing and Applications, Peking University(能效计算与应用中心,北京大学)

AI总结 提出UltraSketchLLM,利用数据草图将LLM权重压缩至0.5比特,结合硬件友好实现,在保持可接受性能下降的同时实现14.9倍加速。

Comments Accepted by the 63rd ACM/IEEE The Chips to Systems Conference (DAC 2026)

详情
AI中文摘要

大型语言模型(LLM)如今需要更大的GPU内存,因此需要高效且极端的权重压缩方法。现有的压缩方法要么在理论上受限于每权重1比特,要么面临严重的性能下降和效率低下。为了在资源受限的场景中部署LLM,我们引入了UltraSketchLLM,它利用数据草图压缩LLM。它通过高达每权重0.5比特的高压缩率降低了峰值GPU内存占用。结合硬件友好的实现,UltraSketchLLM保持了可容忍的性能下降和极低的延迟开销,与朴素草图解决方案相比实现了14.9倍的加速。

英文摘要

Large language models (LLMs) require larger GPU memory size these days, necessitating efficient and extreme weight compression methods. Existing compression methods are either theoretically limited by 1 bit per weight or face severe performance degradation and inefficiency. To deploy LLMs in resource-constrained scenarios, we introduce UltraSketchLLM, compressing LLMs with data sketch. It reduces peak GPU memory footprint with a high compression rate down to 0.5 bit per weight. Combined with hardware-friendly implementation, UltraSketchLLM keeps tolerable performance degradation and extremely low latency overhead with 14.9x speedup compared to naive sketch solution.

2510.01663 2026-06-15 cs.LG cs.AI 版本更新

Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value

基于Shapley值的Kolmogorov-Arnold网络平移不变属性评分

Wangxuan Fan, Ching Wang, Siqi Li, Nan Liu

发表机构 * GitHub

AI总结 提出ShapKAN框架,利用Shapley值归因实现平移不变的节点重要性评估,有效压缩KAN网络并保持其可解释性优势。

Comments 14 pages, 6 figures, 9 tables

详情
AI中文摘要

对于许多实际应用,理解特征与结果之间的关系与实现高预测准确性同样重要。虽然传统神经网络在预测方面表现出色,但其黑箱性质掩盖了潜在的功能关系。Kolmogorov-Arnold网络(KAN)通过在边上采用可学习的基于样条的激活函数来解决这一问题,能够在保持竞争性能的同时恢复符号表示。然而,KAN的架构对网络剪枝提出了独特的挑战。由于对输入坐标平移的敏感性,传统的基于幅度的方法变得不可靠。我们提出了\textbf{ShapKAN},一种使用Shapley值归因以平移不变方式评估节点重要性的剪枝框架。与基于幅度的方法不同,ShapKAN量化每个节点的实际贡献,确保无论输入参数化如何,重要性排名保持一致。在合成和真实世界数据集上的大量实验表明,ShapKAN在实现有效网络压缩的同时保留了真实的节点重要性。我们的方法提升了KAN的可解释性优势,便于在资源受限环境中部署。

英文摘要

For many real-world applications, understanding feature-outcome relationships is as crucial as achieving high predictive accuracy. While traditional neural networks excel at prediction, their black-box nature obscures underlying functional relationships. Kolmogorov--Arnold Networks (KANs) address this by employing learnable spline-based activation functions on edges, enabling recovery of symbolic representations while maintaining competitive performance. However, KAN's architecture presents unique challenges for network pruning. Conventional magnitude-based methods become unreliable due to sensitivity to input coordinate shifts. We propose \textbf{ShapKAN}, a pruning framework using Shapley value attribution to assess node importance in a shift-invariant manner. Unlike magnitude-based approaches, ShapKAN quantifies each node's actual contribution, ensuring consistent importance rankings regardless of input parameterization. Extensive experiments on synthetic and real-world datasets demonstrate that ShapKAN preserves true node importance while enabling effective network compression. Our approach improves KAN's interpretability advantages, facilitating deployment in resource-constrained environments.

2511.07368 2026-06-15 cs.LG cs.AI 版本更新

Distributional Biases in Post-Training: A Markovian Analysis of Reasoning Trajectories

后训练中的分布偏差:推理轨迹的马尔可夫分析

Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Bo Xue, Qingfu Zhang, Hau-San Wong, Taiji Suzuki

发表机构 * City University of Hong Kong(香港城市大学) Center for Advanced Intelligence Project, RIKEN(RIKEN高级智能研究中心) The Institute of Statistical Mathematics(统计数学研究所) University of Sydney(悉尼大学) CFAR and IHPC, Agency for Science, Technology and Research (A*STAR)(A*STAR的CFAR和IHPC) Nanyang Technological University(南洋理工大学) The University of Tokyo(东京大学)

AI总结 通过马尔可夫链模型分析后训练策略(如RLVR和ORM/PRM)如何强化高概率路径而遗忘稀有但关键的推理步骤,并证明探索策略(如拒绝简单实例和KL正则化)有助于保留稀有CoT。

详情
AI中文摘要

基础模型展现出广泛的知识但有限的特定任务推理能力,这促使了后训练策略的发展,例如基于可验证奖励的强化学习(RLVR)和测试时扩展(TTS)。尽管近期工作强调了探索在提升pass@K中的作用,但经验证据指向一个悖论:RLVR和ORM/PRM通常强化现有路径而非扩展推理范围,这引发了一个问题:如果没有新模式出现,探索为何有帮助?为调和这一悖论,我们采用Kim等人(2025)的视角,将简单(例如,简化分数)与困难(例如,发现某种对称性)推理步骤分别视为低概率和高概率的马尔可夫转移。在这个易处理的模型中,预训练对应于树图发现,而后训练对应于思维链(CoT)重新加权。我们可证明地表明,RLVR和ORM/PRM都会严重偏向若干高概率路径,从而遗忘稀有但关键的CoT。在此基础上,我们进一步证明,诸如拒绝简单实例和KL正则化等探索策略有助于保留稀有CoT。实证模拟证实了我们的理论结果。

英文摘要

Foundation models exhibit broad knowledge but limited task-specific reasoning, motivating post-training strategies such as RL with verifiable rewards (RLVR) and test-time scaling (TTS). While recent work highlights the role of exploration in improving pass@K, empirical evidence points to a paradox: RLVR and ORM/PRM typically reinforce existing paths rather than expanding the reasoning scope, raising the question of why exploration helps if no new patterns emerge. To reconcile this paradox, we adopt the perspective of Kim et al. (2025), viewing easy (e.g., simplifying a fraction) versus hard (e.g., discovering the some symmetry) reasoning steps as low versus high probability Markov transitions. In this tractable model, pretraining corresponds to tree-graph discovering, while post-training corresponds to CoT reweighting. We provably show that, both RLVR and ORM/PRM would favor heavily to several high-probability paths, and thereby forget rare-but-crucial CoTs. Building on this, we further prove that exploration strategies such as rejecting easy instances and KL regularization help preserve rare CoTs. Empirical simulations corroborate our theoretical results.

2512.22671 2026-06-15 cs.CL cs.AI cs.LG 版本更新

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

脆弱的知识,稳健的指令遵循:Llama-3.2中的宽度剪枝二分法

Pere Martra

发表机构 * Independent Researcher(独立研究员)

AI总结 通过峰值幅度准则对GLU-MLP层进行结构化宽度剪枝,发现降低扩展比会损害参数化知识任务,但能提升指令遵循能力,挑战了剪枝导致均匀退化的假设。

Comments 22 pages, 5 figures, 9 tables. Code available at https://github.com/peremartra/llama-glu-expansion-pruning

详情
AI中文摘要

对Llama-3.2模型中GLU-MLP层的结构化宽度剪枝,以峰值幅度(PPM)准则为指导,揭示了降低扩展比如何系统性地影响不同模型能力的二分法。虽然依赖参数化知识的任务(如MMLU、GSM8K)和困惑度指标的性能随扩展比降低而可预测地下降,但指令遵循能力在2.4倍平衡比下得到提升(IFEval:Llama-3.2-1B中+4.8分/+46%,Llama-3.2-3B中+3.7分/+39%),且多步推理保持稳健(MUSR)。这种模式在两个评估模型大小上一致观察到,挑战了压缩研究中剪枝导致均匀退化的主流假设。为探究这一点,我们使用评估事实知识、数学推理、语言理解、指令遵循和真实性的综合基准套件,评估了七种扩展比配置。我们的分析将扩展比识别为一个关键架构参数,它选择性地重塑模型的任务性能轮廓,而不仅仅是作为压缩指标。

英文摘要

Structured width pruning of GLU-MLP layers in Llama-3.2 models, guided by the Peak-to-Peak Magnitude (PPM) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably with decreasing expansion ratios, instruction-following capabilities improve at the 2.4x equilibrium ratio (IFEval: +4.8 points / +46% in Llama-3.2-1B and +3.7 points / +39% in Llama-3.2-3B), and multi-step reasoning remains robust (MUSR). This pattern, observed consistently across both evaluated model sizes, challenges the prevailing assumption in compression research that pruning induces uniform degradation. To investigate this, we evaluated seven expansion ratio configurations using comprehensive benchmark suites that assess factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively reshapes the model's task performance profile, rather than merely serving as a compression metric.

2601.22108 2026-06-15 cs.LG cs.AI 版本更新

Learning What to Predict: Downstream-Guided Task Design for Continued Pretraining

学习预测什么:下游引导的持续预训练任务设计

Shuqi Ke, Giulia Fanti

发表机构 * Department of ECE(电子工程系) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出V-pretraining方法,通过轻量级任务设计器为无标签批次构建目标或视图,利用下游损失的一阶减少作为反馈,指导自监督更新,提升目标能力而不损害泛化。

详情
AI中文摘要

持续预训练通过固定的自监督任务进行优化,但根据下游性能选择检查点,形成了一个粗粒度的反馈循环:实践者评估检查点、改变数据混合或目标、重新开始运行,而单个更新仍然对目标能力视而不见。我们询问是否一小部分可验证的下游示例可以在不直接监督学习器的情况下提供步骤级反馈。我们引入了V-pretraining,它将仅使用自监督损失训练的学习器与一个轻量级任务设计器解耦,该设计器为无标签批次构建目标或视图。给定当前学习器和批次,V-pretraining通过预测诱导的自监督更新后下游损失的一阶减少来评分候选构建。设计器最大化该值;然后学习器应用带有分离目标或视图的更新,因此下游标签永远不会更新学习器参数。我们将V-pretraining实例化为用于语言建模的自适应top-K软目标和用于自监督视觉的学习视图或掩码。在两种模态中,V-pretraining在不降低泛化的情况下提高了目标能力。在挂钟时间匹配的持续预训练下,它仅使用1,024个GSM8K示例作为反馈,提高了Qwen模型的GSM8K Pass@1,包括Qwen2.5-0.5B的单次运行+7.4点增益。在视觉方面,它改善了DINOv3向ADE20K语义分割和NYUv2深度估计的迁移,同时保持了ImageNet线性准确率,表明反馈引导的任务构建可以在不破坏通用表示的情况下提高目标能力。

英文摘要

Continued pretraining is optimized with fixed self-supervised tasks but selected by downstream performance, creating a coarse feedback loop in which practitioners evaluate checkpoints, change data mixtures or objectives, and restart runs, while individual updates remain blind to target capabilities. We ask whether a small set of verifiable downstream examples can provide step-level feedback without directly supervising the learner. We introduce V-pretraining, which decouples a learner trained only with a self-supervised loss from a lightweight task designer that constructs targets or views for unlabeled batches. Given the current learner and batch, V-pretraining scores a candidate construction by predicting the first-order reduction in downstream loss after the induced self-supervised update. The designer maximizes this value; the learner then applies the update with targets or views detached, so downstream labels never update learner parameters. We instantiate V-pretraining as adaptive top-K soft targets for language modeling and learned views or masks for self-supervised vision. Across both modalities, V-pretraining improves target capabilities without degrading generalization. Under wall-clock-matched continued pretraining, it improves GSM8K Pass@1 for Qwen models using 1,024 GSM8K examples only as feedback, including a +7.4 point single-run gain for Qwen2.5-0.5B. In vision, it improves DINOv3 transfer to ADE20K semantic segmentation and NYUv2 depth estimation while preserving ImageNet linear accuracy, suggesting that feedback-guided task construction can improve target capabilities without collapsing general-purpose representations.

2602.03120 2026-06-15 cs.LG cs.AI 版本更新

Quantized Evolution Strategies: High-precision Fine-tuning of Quantized LLMs at Low-precision Cost

量化进化策略:以低精度代价实现量化大语言模型的高精度微调

Yinggan Xu, Kajetan Schweighofer, Risto Miikkulainen, Xin Qiu

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Cognizant AI Lab(Cognizant AI实验室) UT Austin(得克萨斯大学奥斯汀分校)

AI总结 提出量化进化策略(QES),通过集成累积误差反馈和无状态种子重放,直接在量化空间进行全参数微调,无需反向传播,显著优于现有零阶微调方法。

Comments Added more tasks and baselines

详情
AI中文摘要

后训练量化(PTQ)对于在内存受限设备上部署大语言模型(LLM)至关重要,但它使模型变得静态且难以微调。标准的微调范式,包括强化学习(RL),从根本上依赖于反向传播和连续权重来计算梯度。因此,它们无法用于参数空间离散且不可微的量化模型。虽然进化策略(ES)提供了一种无需反向传播的替代方案,但由于梯度估计消失或不准确,量化参数的优化仍可能失败。本文介绍了量化进化策略(QES),一种直接在量化空间执行全参数微调的优化范式。QES基于两项创新:(1)它集成了累积误差反馈以保留高精度权重更新信号,(2)它利用无状态种子重放将内存使用降低到低精度推理水平。QES在各种任务上显著优于最先进的零阶微调方法,使得量化模型的直接微调成为可能。因此,它开辟了完全在量化空间中扩展LLM的可能性。源代码可在此https URL获取。

英文摘要

Post-Training Quantization (PTQ) is essential for deploying Large Language Models (LLMs) on memory-constrained devices, yet it renders models static and difficult to fine-tune. Standard fine-tuning paradigms, including Reinforcement Learning (RL), fundamentally rely on backpropagation and continuous weights to compute gradients. Thus they cannot be used on quantized models, where the parameter space is discrete and non-differentiable. While Evolution Strategies (ES) offer a backpropagation-free alternative, optimization of the quantized parameters can still fail due to vanishing or inaccurate gradient estimation. This paper introduces Quantized Evolution Strategies (QES), an optimization paradigm that performs full-parameter fine-tuning directly in the quantized space. QES is based on two innovations: (1) it integrates accumulated error feedback to preserve high-precision weight updating signals, and (2) it utilizes a stateless seed replay to reduce memory usage to low-precision inference levels. QES significantly outperforms the state-of-the-art zeroth-order fine-tuning methods on a variety of tasks, making direct fine-tuning for quantized models possible. It therefore opens up the possibility for scaling up LLMs entirely in the quantized space. The source code is available at https://github.com/dibbla/Quantized-Evolution-Strategies .

2602.04879 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Rethinking the Trust Region in LLM Reinforcement Learning

重新思考LLM强化学习中的信任区域

Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, Wee Sun Lee

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Toronto(多伦多大学)

AI总结 针对PPO在LLM微调中因词表大导致的训练不稳定问题,提出基于策略散度直接约束的DPPO算法,并引入高效近似方法。

详情
AI中文摘要

强化学习已成为微调大型语言模型(LLM)的基石,其中近端策略优化(PPO)是事实上的标准算法。尽管其普遍存在,我们认为PPO中的核心比率裁剪机制在结构上不适合LLM固有的大词表。PPO基于采样令牌的概率比率约束策略更新,该比率是对真实策略散度的有噪单样本蒙特卡洛估计。这导致次优的学习动态:低概率令牌的更新被过度惩罚,而高概率令牌中潜在的灾难性变化却约束不足,导致训练效率低下和不稳定。为解决此问题,我们提出散度近端策略优化(DPPO),用基于策略散度(如总变差或KL)直接估计的更原则性约束替代启发式裁剪。为避免巨大内存占用,我们引入了高效的二元和Top-K近似,以可忽略的开销捕获本质散度。大量实证评估表明,DPPO相比现有方法实现了更优的训练稳定性和效率,为基于RL的LLM微调提供了更稳健的基础。我们的代码可在https://github.com/sail-sg/Stable-RL获取。

英文摘要

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning. Our code is available at https://github.com/sail-sg/Stable-RL.

2602.14169 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling

基于枢轴驱动重采样的LLM强化学习深度密集探索

Yiran Guo, Zhongjian Qiao, Yingqi Xie, Jie Liu, Dan Ye, Ruiqing Zhang, Shuang Qiu, Lijie Xu

发表机构 * Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) City University of Hong Kong(香港城市大学) Baidu(百度)

AI总结 针对大语言模型强化学习中探索效率低的问题,提出深度密集探索(DDE)策略,通过识别失败轨迹中的可恢复枢轴状态并局部密集重采样,结合双流优化目标,在数学推理基准上优于现有方法。

详情
AI中文摘要

有效探索是大语言模型强化学习中的一个关键挑战:在有限的采样预算内,从庞大的自然语言序列空间中发现高质量轨迹。现有方法面临显著局限性:GRPO仅从根节点采样,使高概率轨迹饱和,而深层易错状态探索不足;基于树的方法盲目地将预算分散到琐碎或不可恢复的状态,导致采样稀释,无法发现罕见的正确后缀并破坏局部基线。为解决此问题,我们提出深度密集探索(DDE),一种将探索聚焦于失败轨迹中的“枢轴”——深层、可恢复状态的策略。我们通过DEEP-GRPO实例化DDE,引入三个关键创新:(1)轻量级数据驱动效用函数,自动平衡可恢复性和深度偏差以识别枢轴状态;(2)在每个枢轴处进行局部密集重采样,增加发现后续正确轨迹的概率;(3)双流优化目标,将全局策略学习与局部纠正更新解耦。在数学推理基准上的实验表明,我们的方法一致优于GRPO、基于树的方法及其他强基线。代码见 https://this https URL

英文摘要

Effective exploration is a key challenge in reinforcement learning for large language models: discovering high-quality trajectories within a limited sampling budget from the vast natural language sequence space. Existing methods face notable limitations: GRPO samples exclusively from the root, saturating high-probability trajectories while leaving deep, error-prone states under-explored. Tree-based methods blindly disperse budgets across trivial or unrecoverable states, causing sampling dilution that fails to uncover rare correct suffixes and destabilizes local baselines. To address this, we propose Deep Dense Exploration (DDE), a strategy that focuses exploration on $\textit{pivots}$-deep, recoverable states within unsuccessful trajectories. We instantiate DDE with DEEP-GRPO, which introduces three key innovations: (1) a lightweight data-driven utility function that automatically balances recoverability and depth bias to identify pivot states; (2) local dense resampling at each pivot to increase the probability of discovering correct subsequent trajectories; and (3) a dual-stream optimization objective that decouples global policy learning from local corrective updates. Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines. Code is available at https://github.com/AgentCombo/DEEP-GRPO

2602.23638 2026-06-15 cs.LG cs.AI 版本更新

FedRot-LoRA: Mitigating Rotational Misalignment in Federated LoRA

FedRot-LoRA: 缓解联邦LoRA中的旋转偏移

Haoran Zhang, Dongjun Kim, Seohyeon Cha, Haris Vikalo

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出FedRot-LoRA框架,通过正交变换对齐客户端更新以减少子空间不匹配,提升联邦LoRA在异质数据下的性能。

Comments ICML 2026

详情
AI中文摘要

联邦LoRA提供了一种高效的通信机制用于在去中心化数据上微调大语言模型。然而,因子加权平均与数学上正确的本地更新聚合之间的不一致会导致显著的聚合误差和不稳定的训练。本文认为,主要问题是由于低秩因子化旋转不变性导致的旋转偏移,即不同客户端的潜在子空间中,语义等价的更新可以以不同的形式表示。当这些不一致的因子直接平均时,会产生破坏性干扰,降低全局更新质量。为此,本文提出FedRot-LoRA框架,在聚合前通过正交变换对齐客户端更新,从而在不增加通信成本或限制模型表达能力的情况下,保持语义更新并减少跨客户端子空间不匹配。本文提供了收敛性分析,研究了因子加权平均引起的聚合误差,并展示了旋转对齐如何提供更紧的误差上界。在自然语言理解和生成任务上的广泛实验表明,FedRot-LoRA在各种异质性和LoRA秩水平下均优于现有联邦LoRA基线。

英文摘要

Federated LoRA provides a communication-efficient mechanism for fine-tuning large language models on decentralized data. In practice, however, a discrepancy between the factor-wise averaging used to preserve low rank and the mathematically correct aggregation of local updates can cause significant aggregation error and unstable training. We argue that a major source of this problem is rotational misalignment, arising from the rotational invariance of low-rank factorizations -- semantically equivalent updates can be represented in different latent subspaces across clients since $(B_i R_i)(R_i^\top A_i) = B_i A_i$. When such misaligned factors are averaged directly, they interfere destructively and degrade the global update. To address this issue, we propose FedRot-LoRA, a federated LoRA framework that aligns client updates via orthogonal transformations prior to aggregation. This alignment preserves the semantic update while reducing cross-client subspace mismatch, without increasing communication cost or restricting model expressivity. We provide a convergence analysis that examines the aggregation error induced by factor-wise averaging and shows how rotational alignment yields a tighter upper bound on this error. Extensive experiments on natural language understanding and generative tasks demonstrate that FedRot-LoRA consistently outperforms existing federated LoRA baselines across a range of heterogeneity levels and LoRA ranks.

2603.02230 2026-06-15 cs.LG cs.AI 版本更新

Generalized Discrete Diffusion with Self-Correction

广义离散扩散与自校正

Linxuan Wang, Ziyi Wang, Yikun Bai, Wei Deng, Guang Lin, Qifan Song

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出自校正离散扩散模型(SCDD),通过显式状态转移和离散时间学习,简化训练噪声调度,消除冗余重掩码步骤,在GPT-2规模上实现高效并行解码并保持生成质量。

Comments 40 pages, 3 figures, 6 tables

详情
AI中文摘要

自校正是保持离散扩散模型中并行采样且性能损失最小的有效技术。先前的工作在推理时或后训练期间探索了自校正;然而,此类方法通常泛化能力有限,并可能损害推理性能。GIDD通过多步BERT风格的均匀吸收目标开创了基于预训练的自校正。然而,GIDD依赖于连续的基于插值的管道,其中均匀转移和吸收掩码之间的交互不透明,这使超参数调整复杂化并阻碍实际性能。在这项工作中,我们提出了一种自校正离散扩散(SCDD)模型,以显式状态转移和直接在离散时间中学习的方式重新表述预训练自校正。我们的框架还简化了训练噪声调度,消除了冗余的重掩码步骤,并完全依赖均匀转移来学习自校正。在GPT-2规模上的实验表明,我们的方法能够实现更高效的并行解码,同时保持生成质量。

英文摘要

Self-correction is an effective technique for maintaining parallel sampling in discrete diffusion models with minimal performance degradation. Prior work has explored self-correction at inference time or during post-training; however, such approaches often suffer from limited generalization and may impair reasoning performance. GIDD pioneers pretraining-based self-correction via a multi-step BERT-style uniform-absorbing objective. However, GIDD relies on a continuous interpolation-based pipeline with opaque interactions between uniform transitions and absorbing masks, which complicates hyperparameter tuning and hinders practical performance. In this work, we propose a Self-Correcting Discrete Diffusion (SCDD) model to reformulate pretrained self-correction with explicit state transitions and learn directly in discrete time. Our framework also simplifies the training noise schedule, eliminates a redundant remasking step, and relies exclusively on uniform transitions to learn self-correction. Experiments at the GPT-2 scale demonstrate that our method enables more efficient parallel decoding while preserving generation quality.

2603.10444 2026-06-15 cs.LG cs.AI 版本更新

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

FP4量化LLM训练中均值偏差的诅咒与祝福

Hengjie Cao, Zhendong Huang, Mengyi Chen, Yifeng Yang, Fang Dong, Anrui Chen, Ruijun Huang, Xin Zhang, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Tun Lu, Fan Yang, Yixuan Chen, Li Shang

发表机构 * Fudan University(复旦大学) University of Bath(巴斯大学) Shanghai Innovation Institute(上海创新研究院) University of Oxford(牛津大学) Oxford Suzhou Centre for Advanced Research(牛津苏浙研究中心) University of Colorado Boulder(科罗拉多大学波德格分校) University of Michigan(密歇根大学) Shenzhen Loop Area Institute(深圳环宇研究院)

AI总结 发现FP4训练失败源于激活异常值由秩一均值偏差主导,提出Averis均值残差分离量化法,在Qwen3模型上实现鲁棒W4A4G4训练,损失差距低于NVIDIA的Hadamard方法。

详情
AI中文摘要

FP4训练有望为大型语言模型节省大量内存和计算,但由于分块量化受极端激活幅度支配,导致动态范围膨胀并压缩长尾信号,因此仍然脆弱。我们发现了这一失败的一个反直觉来源:主导激活异常值不仅仅是任意的稀疏事件,而主要是由一致的秩一均值偏差引起的,其方向与主导各向异性谱分量对齐。该均值分量在训练过程中增强,被注意力和FFN算子放大和重塑,并日益主导顶部激活幅度。至关重要的是,这一发现揭示了一个看似复杂的异常值抑制问题实际上有一个非常简单的解决方案:在量化之前隔离一致的均值。因此,我们提出了Averis,一种均值残差分割量化方法,该方法在FP4量化之前仅使用归约和逐元素减法来分离均值分量。在100B token上训练的Qwen3 0.6B密集模型和50B token上训练的Qwen3 7B A1.5B MoE模型上,Averis实现了鲁棒的W4A4G4 FP4训练,将BF16损失差距降低至1.19%/0.81%,而NVIDIA最近发布的基于Hadamard的异常值平滑方法为2.05%/1.10%,同时将下游差距限制在0.89/0.71点。Averis在vanilla NVFP4上的端到端开销仅为2.20%,约为NVIDIA基于Hadamard设计的30%,为稳定的低位LLM训练提供了一条硬件高效的路径。与Hadamard互补,Averis在结合使用时进一步将Qwen3-0.6B的损失和下游差距降低至0.94%和0.73点。代码可在以下网址获取:this https URL。

英文摘要

FP4 training promises substantial memory and compute savings for large language models, but remains fragile because blockwise quantization is dictated by extreme activation magnitudes, which inflate dynamic range and compress long-tail signals. We identify a counterintuitive source of this failure: dominant activation outliers are not merely arbitrary sparse events, but are largely induced by a coherent rank-one mean bias, whose direction aligns with the leading anisotropic spectral component. This mean component strengthens during training, is amplified and reshaped by attention and FFN operators, and increasingly dominates top activation magnitudes. Crucially, this discovery reveals that a seemingly complex outlier-suppression problem admits a truly simple solution: isolate the coherent mean before quantization. We therefore propose Averis, a mean-residual splitting quantization method that separates the mean component using only reductions and elementwise subtractions before FP4 quantization. Across Qwen3 0.6B Dense trained on 100B tokens and Qwen3 7B A1.5B MoE trained on 50B tokens, Averis enables robust W4A4G4 FP4 training, reducing BF16 loss gaps to 1.19%/0.81% versus 2.05%/1.10% for NVIDIA's recently released Hadamard-based outlier-smoothing method, while limiting downstream gaps to 0.89/0.71 points. With only 2.20% end-to-end overhead over vanilla NVFP4, about 30% of NVIDIA's Hadamard-based design, Averis provides a hardware-efficient path to stable low-bit LLM training. Complementary to Hadamard, Averis further reduces the Qwen3-0.6B loss and downstream gaps to 0.94% and 0.73 points when combined. Code is available at: https://anonymous.4open.science/r/averis-504D.

2603.15481 2026-06-15 cs.LG cs.AI 版本更新

TabKD: Tabular Knowledge Distillation through Interaction Diversity of Learned Feature Bins

TabKD: 通过学习特征箱的交互多样性实现表格知识蒸馏

Shovon Niverd Pereira, Krishna Khadka, Yu Lei

发表机构 * Department of Computer Science and Engineering, The University of Texas at Arlington(计算机科学与工程系,德克萨斯理工大学阿灵顿分校)

AI总结 提出TabKD方法,通过学习与教师决策边界对齐的自适应特征箱,生成最大化成对交互覆盖的合成查询,在表格数据知识蒸馏中显著提升学生-教师一致性。

Comments Accepted in 35th International Joint Conference on Artificial Intelligence IJCAI 2026

详情
AI中文摘要

无数据知识蒸馏可以在没有原始训练数据的情况下实现模型压缩,这对于隐私敏感的表格领域至关重要。然而,现有方法在表格数据上表现不佳,因为它们没有明确处理特征交互,而特征交互是表格模型编码预测知识的基本方式。我们识别出交互多样性,即特征组合的系统覆盖,是有效表格蒸馏的基本要求。为了实施这一见解,我们提出了TabKD,它学习与教师决策边界对齐的自适应特征箱,然后生成最大化成对交互覆盖的合成查询。在4个基准数据集和4种教师架构上,TabKD在16个配置中的14个中实现了最高的学生-教师一致性,优于5个最先进的基线。我们进一步表明,交互覆盖与蒸馏质量强相关,验证了我们的核心假设。我们的工作建立了以交互为中心的探索作为表格模型提取的原则性框架。

英文摘要

Data-free knowledge distillation enables model compression without original training data, critical for privacy-sensitive tabular domains. However, existing methods does not perform well on tabular data because they do not explicitly address feature interactions, the fundamental way tabular models encode predictive knowledge. We identify interaction diversity, systematic coverage of feature combinations, as an essential requirement for effective tabular distillation. To operationalize this insight, we propose TabKD, which learns adaptive feature bins aligned with teacher decision boundaries, then generates synthetic queries that maximize pairwise interaction coverage. Across 4 benchmark datasets and 4 teacher architectures, TabKD achieves highest student-teacher agreement in 14 out of 16 configurations, outperforming 5 state-of-the-art baselines. We further show that interaction coverage strongly correlates with distillation quality, validating our core hypothesis. Our work establishes interaction-focused exploration as a principled framework for tabular model extraction.

2604.09737 2026-06-15 cs.LG cs.AI 版本更新

STaR-DRO: Stateful Tsallis Reweighting for Group-Robust Structured Prediction

STaR-DRO: 面向群体鲁棒结构化预测的状态化Tsallis重加权

Samah Fodeh, Ganesh Puthiaraju, Elyas Irankhah, Afshan Khan, Sreeraj Ramachandran, Linhai Ma, Srivani Talakokkul, Sarah Schellhorn

发表机构 * Yale University(耶鲁大学) Yale School of Medicine(耶鲁医学院)

AI总结 提出STaR-DRO框架,结合Tsallis镜像上升和稀疏entmax映射,仅对持续困难群体上权重,在结构化预测中提升标签准确性和鲁棒性,在EPPC Miner任务上相比SFT和标准DRO分别提升F1分数1.08和2.20。

详情
AI中文摘要

使用大型语言模型进行结构化预测需要输出在标签不平衡和异质群体难度下具有标签准确性、本体约束、结构有效性和证据基础。我们提出了一个统一框架用于本体约束生成。首先,我们引入了一个模块化的提示工程架构,结合了XML风格结构、专家消歧规则、思维链推理、元数据感知决策逻辑、模式契约和自我验证门。它针对反复出现的上下文失败,包括格式漂移、标签歧义、证据幻觉和元数据条件混淆。其次,我们提出了STaR-DRO,结合了Tsallis镜像上升、稀疏entmax风格原始映射、EMA平滑群体损失跟踪、重新缩放上升信号和有界超额乘数。与依赖密集香农熵指数梯度更新、可能引入高方差随机重加权、将正对抗质量分配给非持续困难群体、并通过单纯形竞争产生成本的常规DRO不同,STaR-DRO仅对持续困难群体上权重,而不抑制较容易的群体。我们在EPPC Miner上评估该框架,这是一个临床基础的高风险结构化预测任务,需要从患者-提供者安全消息中进行层次标签预测和证据跨度提取。在1B-70B Llama模型上,提示工程改进了零样本提取,平均标签F1增益为+14.46,跨度F1增益为+17.40。在监督微调的基础上,STaR-DRO进一步提高了准确性和鲁棒性,平均标签F1分别提高了+1.08和+2.20,同时相对于SFT和标准DRO,平均群体验证交叉熵分别降低了21.3%和14.8%。这些结果推进了以患者为中心的临床护理分析的可靠自动化通信挖掘。

英文摘要

Structured prediction with large language models requires outputs that are label-accurate, ontology-constrained, structurally valid, and evidence-grounded under label imbalance and heterogeneous group difficulty. We present a unified framework for ontology-constrained generation. First, we introduce a modular prompt-engineering architecture combining XML-style structure, expert disambiguation rules, chain-of-thought reasoning, metadata-aware decision logic, schema contracts, and a self-validation gate. It targets recurrent in-context failures, including format drift, label ambiguity, evidence hallucination, and metadata-conditioned confusion. Second, we propose STaR-DRO, combining Tsallis mirror ascent, sparse entmax-style primal mapback, EMA-smoothed group-loss tracking, rescaled ascent signals, and bounded excess-only multipliers. Unlike conventional DRO, which relies on dense Shannon-entropy exponentiated-gradient updates, can introduce high-variance stochastic reweighting, assigns positive adversarial mass to groups that are not persistently hard, and incurs costs through simplex competition, STaR-DRO upweights only persistently hard groups without suppressing easier ones. We evaluate the framework on EPPC Miner, a clinically grounded high-stakes structured-prediction task requiring hierarchical label prediction and evidence-span extraction from patient-provider secure messages. Across 1B-70B Llama models, prompt engineering improves zero-shot extraction, yielding an average label F1 gain of +14.46 and a Span F1 gain of +17.40. Building on supervised fine-tuning, STaR-DRO further improves accuracy and robustness, increasing average label F1 by +1.08 and +2.20 while reducing mean groupwise validation cross-entropy by 21.3% and 14.8% relative to SFT and standard DRO, respectively. These results advance reliable automated communication mining for patient-centered clinical care analysis.

2604.17892 2026-06-15 cs.LG cs.AI 版本更新

LEPO: Latent Reasoning Policy Optimization for Large Language Models

LEPO:面向大语言模型的潜在推理策略优化

Yuyan Zhou, Jiarui Yu, Hande Dong, Zhezheng Hao, Hong Wang, Jianqing Zhang, Qiang Lin

发表机构 * Tencent(腾讯) Zhejiang University(浙江大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 LEPO通过引入Gumbel-Softmax在大语言模型中实现可控的随机性,提升其探索能力与强化学习兼容性,通过直接在连续潜在表示上应用强化学习,显著优于现有方法。

详情
AI中文摘要

近年来,潜在推理被引入大语言模型(LLMs)以利用连续空间中的丰富信息。然而,缺乏随机采样时,这些方法不可避免地退化为确定性推理,无法发现多样的推理路径。为弥合这一差距,我们通过Gumbel-Softmax在潜在推理中注入可控的随机性,恢复LLMs的探索能力并增强其与强化学习(RL)的兼容性。在此基础上,我们提出LEPO,一种将强化学习直接应用于连续潜在表示的新框架。具体而言,在回放阶段,LEPO保持随机性以实现多样化的轨迹采样;在优化阶段,LEPO为潜在表示和离散令牌构建统一的梯度估计。大量实验表明,LEPO在离散和潜在推理方面显著优于现有RL方法。

英文摘要

Recently, latent reasoning has been introduced into large language models (LLMs) to leverage rich information within a continuous space. However, without stochastic sampling, these methods inevitably collapse to deterministic inference, failing to discover diverse reasoning paths. To bridge the gap, we inject controllable stochasticity into latent reasoning via Gumbel-Softmax, restoring LLMs' exploratory capacity and enhancing their compatibility with Reinforcement Learning (RL). Building on this, we propose \textbf{\underline{L}}atent R\textbf{\underline{e}}asoning \textbf{\underline{P}}olicy \textbf{\underline{O}}ptimization~(\textbf{LEPO}), a novel framework that applies RL directly to continuous latent representations. Specifically, in rollout stage, LEPO maintains stochasticity to enable diverse trajectory sampling, while in optimization stage, LEPO constructs a unified gradient estimation for both latent representations and discrete tokens. Extensive experiments show that LEPO significantly outperforms existing RL methods for discrete and latent reasoning.

2605.04847 2026-06-15 cs.LG cs.AI 版本更新

Quantile-Free Uncertainty Quantification in Graph Neural Networks

图神经网络中的无分位数不确定性量化

Soyoung park, Hwanjun Song, Sungsu Lim

发表机构 * Soyoung Park Hwanjun Song Sungsu Lim

AI总结 提出QpiGNN框架,通过无分位数联合损失直接优化覆盖率和区间宽度,实现高效鲁棒的图神经网络不确定性量化,理论保证渐近覆盖和近最优宽度。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

不确定性量化(UQ)在图神经网络(GNN)中对于高风险领域至关重要,但仍是一个重大挑战。在图设置中,消息传递通常依赖于强假设(如可交换性),这些假设在实践中很少满足,并且实现可靠的UQ通常需要昂贵的重采样或事后校准。为了解决这些问题,我们引入了无分位数预测区间GNN(QpiGNN),这是一个基于分位数回归(QR)的框架,通过直接优化覆盖率和区间宽度来实现基于GNN的UQ,无需分位数输入或后处理。QpiGNN采用双头架构,将预测和不确定性解耦,并通过无分位数联合损失使用仅标签监督进行训练。这种设计允许高效训练,并产生鲁棒的预测区间,在温和假设下具有渐近覆盖率和近最优宽度的理论保证。在19个合成和真实世界基准上的实验表明,QpiGNN比基线平均覆盖率高22%,区间窄50%,同时确保了对噪声和结构变化的效率和鲁棒性。

英文摘要

Uncertainty quantification (UQ) in graph neural networks (GNNs) is crucial in high-stakes domains but remains a significant challenge. In graph settings, message passing often relies on strong assumptions such as exchangeability, which are rarely satisfied in practice, and achieving reliable UQ typically requires costly resampling or post-hoc calibration. To address these issues, we introduce Quantile-free Prediction Interval GNN (QpiGNN), a framework that builds on quantile regression (QR) to enable GNN-based UQ by directly optimizing coverage and interval width without requiring quantile inputs or post-processing. QpiGNN employs a dual-head architecture that decouples prediction and uncertainty, and is trained with label-only supervision through a quantile-free joint loss. This design allows efficient training and yields robust prediction intervals, with theoretical guarantees of asymptotic coverage and near-optimal width under mild assumptions. Experiments on 19 synthetic and real-world benchmarks show QpiGNN achieves average 22% higher coverage and 50% narrower intervals than baselines, while ensuring efficiency and robustness to noise and structural shifts.

2605.08270 2026-06-15 cs.CV cs.AI 版本更新

SAFformer:Improving Spiking Transformer via Active Predictive Filtering

SAFformer:通过主动预测滤波改进脉冲Transformer

Zequan Xie, Weiming Zeng, Yunhua Chen, Sichang Ling, Tongyang Chen, Jinsheng Xiao

发表机构 * School of Computer Science and Technology, Guangdong University of Technology(广东技术大学计算机科学与技术学院) Faculty of Science, Hong Kong Baptist University(香港 Baptist 大学科学学院) School of Electronic Information, Wuhan University(武汉大学电子信息学院)

AI总结 提出基于主动预测滤波的脉冲Transformer架构SAFformer,通过抑制可预测信号并聚焦显著视觉特征,在CIFAR和ImageNet-1K上实现新最优性能,平衡精度与能耗。

Comments IJCAI 2026(International Joint Conference on Artificial Intelligence)

详情
AI中文摘要

脉冲神经网络(SNNs)在生物合理性和能效方面具有显著优势,使其成为构建低功耗Transformer的有前景的候选方案。然而,现有的脉冲Transformer主要遵循被动反应范式,难以聚焦于任务相关信息,并且在处理冗余视觉数据时会产生大量计算开销。为了克服这一基础但尚未充分探索的局限性,我们提出了SAFformer,一种基于主动预测滤波范式的新型脉冲Transformer架构。受大脑预测编码机制的启发,SAFformer主动抑制可预测信号并聚焦于显著视觉特征。大量实验表明,SAFformer在CIFAR-10/100和CIFAR10-DVS上建立了新的最先进性能。值得注意的是,在ImageNet-1K上,它仅用26.58M参数和5.88 mJ的能耗就达到了80.44%的Top-1准确率,展现了精度与效率之间的卓越平衡。

英文摘要

Spiking Neural Networks (SNNs) offer notable advantages in biological plausibility and energy efficiency, making them promising candidates for building low-power Transformers. However, existing Spiking Transformers largely adhere to a passive reactive paradigm, which struggles to focus on task-relevant information and incurs substantial computational overhead when processing redundant visual data. To overcome this fundamental yet underexplored limitation, we propose SAFformer, a novel Spiking Transformer architecture based on an active predictive filtering paradigm. Inspired by the brain's predictive coding mechanism, SAFformer actively suppresses predictable signals and focuses on salient visual features. Extensive experiments show that SAFformer establishes new state-of-the-art performance on CIFAR-10/100 and CIFAR10-DVS. Remarkably, on ImageNet-1K, it achieves 80.44% Top-1 accuracy with only 26.58M parameters and an energy consumption of 5.88 mJ, demonstrating an exceptional balance between accuracy and efficiency.

2605.09420 2026-06-15 cs.CV cs.AI cs.MM 版本更新

Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery

关系检索:利用已知-新颖相互作用进行通用类别发现

Yulin Xu, Chunqi Guo, Yuanzhen Shuai, Jianyuan Ni

发表机构 * University of California, Irvine(加州大学尔湾分校) Sichuan Agricultural University(四川农业大学) University College London(伦敦大学学院) Juniata College(朱尼ata学院)

AI总结 本文通过关系检索视角解决通用类别发现问题,提出关系模式一致性方法,通过双向知识转移增强已知类别和新类别发现,实验表明在通用和细粒度基准上均取得最佳性能。

Comments Accepted by ICMR 2026 (Oral)

详情
AI中文摘要

在本研究中,我们通过关系检索视角解决通用类别发现(GCD)问题,通过双向知识转移显式连接标记和未标记数据。尽管现有方法将这些来源分开处理,错过了有价值的作用机会,我们提出关系模式一致性(RPC),使两者相互增强。RPC使用一对一分类器进行软ID/OOD分解,然后引入两种机制:(i)为已知类别保留,我们转移语义行为对齐;(ii)为类别发现,我们利用样本来自同一类别与已知类别原型保持不变的关系的洞察,将不可靠的伪标签转化为明确的关系模式匹配。这种双向设计使标记数据指导未标记学习,同时通过它们的集体关系签名发现新类别。广泛的实验表明,RPC在通用和细粒度基准上均取得最佳性能。

英文摘要

In this study, we tackle Generalized Category Discovery (GCD) via a Relational Retrieval perspective, explicitly coupling labeled and unlabeled data through bidirectional knowledge transfer. While existing methods treat these sources separately, missing valuable interaction opportunities, we propose Relational Pattern Consistency (RPC) that enables mutual enhancement. RPC employs One-vs-All classifiers for soft ID/OOD decomposition, then introduces two mechanisms: (i) for known-class preservation, we transfer semantic behavioral alignment; (ii) for category discovery, we leverage the insight that samples from the same category maintain invariant relationships with known-class prototypes, transforming unreliable pseudo-labeling into well-defined relational pattern matching. This bidirectional design allows labeled data to guide unlabeled learning while discovering novel categories through their collective relational signatures. Extensive experiments demonstrate RPC achieves state-of-the-art performance on both generic and fine-grained benchmarks.

2605.18848 2026-06-15 cs.LG cs.AI 版本更新

Exact Linear Attention

精确线性注意力

Weinuo Ou

发表机构 * GitHub

AI总结 本文提出精确线性注意力(ELA),通过利用核函数的精确分解性质,实现Transformer注意力的线性计算复杂度,消除近似误差。针对先前线性注意力的两个关键限制——梯度爆炸和token注意力稀释,提出核约束以确保非负性、判别性和几何可解释性。此外,本文还提出了三种工程创新,包括Hyper-Link结构、Memory Lobe模块和基于路由分数的MoE偏置机制,实验结果表明ELA在解码速度和KV缓存内存使用上分别达到全注意力的6倍和75%的减少,同时保持或优于训练性能。

Comments 9 pages, 19 figures, journal

详情
AI中文摘要

本文介绍精确线性注意力(ELA),一种通过利用核函数的精确分解性质,实现Transformer注意力线性计算复杂度的机制,从而消除近似误差。我们识别并解决了先前线性注意力的两个关键限制——梯度爆炸和token注意力稀释——通过施加核约束,确保非负性、判别性和几何可解释性。提出了几种核函数,包括Hadamard Exp核、求和平方欧几里得距离核和减法平方欧几里得距离核,每种都针对特定的注意力行为进行了优化。除了核心注意力公式之外,本文还提出了三种工程创新:(1)Hyper-Link结构,用以替代传统残差连接以缓解梯度退化;(2)基于双向线性注意力的Memory Lobe模块,捕捉跨层的“转换流”以实现定性记忆和隐式强化学习范式;(3)基于路由分数的MoE偏置机制,以提高可解释性和语义对齐。实验结果表明,ELA在解码速度和KV缓存内存使用上分别达到全注意力的6倍和75%的减少,同时保持或优于训练性能。所提出的记忆模块加速了收敛并增强了泛化能力。此外,我们还将线性注意力原理扩展到视觉模型,得到YOLO-LAT,其在GPU推理速度和参数减少方面分别达到4.3倍和7.9倍,同时保持竞争性的检测精度。这些结果表明,精确线性注意力在扩展Transformer模型以处理超长序列和高效视觉任务方面具有广泛的应用前景。

英文摘要

This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention by exploiting the exact decomposition property of kernel functions, thereby eliminating approximation error. We identify and address two key limitations of prior linear attention -- gradient explosion and token attention dilution -- by imposing kernel constraints that ensure non-negativity, discriminability, and geometric interpretability. Several kernel functions are proposed, including the Hadamard Exp Kernel, Summation Squared Euclidean Distance Kernel, and Subtraction Squared Euclidean Distance Kernel, each tailored for specific attention behaviors. Beyond the core attention formulation, the paper presents three engineering innovations: (1) a Hyper-Link structure that replaces traditional residual connections to mitigate gradient degradation; (2) a Memory Lobe module based on bidirectional linear attention, which captures "transformation flow" across layers to implement qualitative memory and an implicit reinforcement learning paradigm; and (3) a routing-score-based bias mechanism for Mixture-of-Experts (MoE) to improve interpretability and semantic alignment. Experimental results demonstrate that ELA achieves up to 6x faster decoding speed and 75% reduction in KV cache memory usage compared to full attention, while maintaining comparable or superior training performance. The proposed memory module accelerates convergence and enhances generalization. Furthermore, we extend the linear attention principle to vision models, yielding YOLO-LAT, which attains up to 4.3x GPU inference speedup and 7.9x parameter reduction with competitive detection accuracy. These results underline the broad applicability of exact linear attention for scaling Transformer models to ultra-long sequences and efficient visual tasks.

2606.13054 2026-06-15 cs.LG cs.AI 版本更新

TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

TWLA:通过训练后量化实现大语言模型的三值权重和低位激活

Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Xing Hu, Zhe Jiang, Dawei Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出TWLA框架,通过后训练量化实现1.58位权重和4位激活,解决激活分布长尾问题,加速推理。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)展现出卓越的通用语言处理能力,但其内存和计算成本阻碍了部署。三值化已成为一种有前景的压缩技术,可显著降低模型大小和推理复杂度。然而,现有方法难以处理重尾激活分布,因此将激活保持在高精度,从根本上限制了端到端推理加速。为克服这一限制,我们提出TWLA,一种后训练量化(PTQ)框架,在保持高精度的同时实现1.58位权重压缩和4位激活量化。TWLA包含三个组件:(1)欧几里得到流形非对称三值量化器(E2M-ATQ),通过从欧几里得初始化到流形重定位的两阶段优化,最小化权重三值化下的层输出误差;(2)Kronecker正交三模态整形(KOTMS),应用Kronecker结构正交旋转将权重重塑为三值友好的三模态分布,同时共享旋转统计上抑制激活异常值;(3)层间感知激活混合精度(ILA-AMP),在位分配中显式引入相邻层二阶交互成本,并联合优化由共享正交变换引起的激活量化增益的层间差异,防止少数弱层触发级联效应。大量实验表明,TWLA在W1.58A4下保持高精度,同时实现显著的推理加速。代码见<此https URL>。

英文摘要

Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at https://github.com/Kishon-zzx/TWLA.

2606.13119 2026-06-15 cs.LG cs.AI cs.NE 版本更新

MP3: Multi-Period Pattern Pre-training for Spatio-Temporal Forecasting

MP3:面向时空预测的多周期模式预训练

Lilan Peng, Yandi Liu, Qingren Yao, Chongshou Li, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) Eindhoven University of Technology(埃因霍温理工大学)

AI总结 针对时空数据中短窗口输入导致的时间幻象问题,提出多周期模式预训练插件MP3,通过多周期时间建模、空间建模和跨周期因果交互,提升现有STGNN的预测性能。

详情
AI中文摘要

时空预测在交通、气候和能源等多个领域至关重要。城市时空数据表现出时间幻象:相似的短窗口输入具有不同的未来趋势,反之亦然。现有的时空图神经网络(STGNN)无法有效识别此类幻象。我们认为核心原因在于短窗口输入具有不完整的周期观测、异质的全局空间相关性和跨周期叠加因果性。为弥补这一差距,我们开发了一种新颖的多周期模式预训练(MP3),这是一种用于区分时间幻象的即插即用预训练插件。MP3提出了两项核心创新:(1)多周期模式学习旨在从长时间序列中学习多周期模式。具体地,多周期时间建模利用边卷积来识别不同的多周期模式。多周期空间建模使用瓶颈投影和全局记忆库来高效捕获异质的全局空间关系。跨周期模式交互采用因果增强的Transformer来捕获不同周期模式之间的依赖关系。(2)该插件可以无缝集成到现有的STGNN骨干中,以增强其预测性能。在五个真实世界数据集(包括大规模数据集CA)上的五个STGNN基线实验验证了MP3的有效性、优越的可扩展性和强适应性,其在所有评估基线上带来了一致且稳健的性能提升。平均而言,MP3将MAE降低了4.7%,RMSE降低了5.0%。代码可在此https URL获取。

英文摘要

Spatio-Temporal forecasting is crucial in diverse fields, such as transportation, climate, and energy. Urban spatio-temporal data exhibits temporal mirage: similar short-window inputs have divergent future trends, and vice versa. Existing spatio-temporal graph neural networks (STGNNs) cannot effectively identify such mirages. We argue that the core reason lies in the short-window inputs that have incomplete period observation, heterogeneous global spatial correlation, and cross-period superposition causality. To bridge this gap, we develop a novel Multi- Period Pattern Pre-training (MP3), a plug-and-play pre-training plugin for distinguishing temporal mirages. MP3 presents two core innovations: (1) The multi-period pattern learning is designed to learn multi-period patterns from long time series. Specifically, multi-period temporal modeling leverages edge convolution to identify different multi-period patterns. Multi-period spatial modeling uses a bottleneck project and a global memory bank to capture heterogeneous global spatial relations efficiently. Cross-period pattern interaction employs a causality-enhanced Transformer to capture dependencies across different period patterns. (2) This plugin can seamlessly integrate into existing STGNN backbones to strengthen their forecasting performance. The experiment on five STGNN baselines across five real-world datasets (including a large-scale dataset CA) verify the effectiveness, superior scalability and strong adaptability of MP3, which brings consistent and robust performance improvements across all evaluated baselines. On average, MP3 reduces the MAE 4.7% and the RMSE 5.0%. The code can be available at https://github.com/YAN-outlook/MP3.

6. 自然语言与多模态智能 26 篇

2606.14176 2026-06-15 cs.AI 新提交

VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification

VeriGeo: 可控几何问题生成与数值和分析验证

Xiaoxian Duan, Zequn Liu, Yingce Xia

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Zhongguancun Academy(中关村学院)

AI总结 提出VeriGeo框架,通过可执行推理轨迹和三级验证流水线,实现用户约束下的可控几何问题生成,并利用验证引导的反思修复无效生成,提升数据可靠性。

Comments 32 pages, 4 figures, 9 tables

详情
AI中文摘要

几何问题生成对AI辅助教育和多模态数学推理有用,但可靠合成仍然困难,因为问题陈述、图表、约束和解决方案应相互一致。现有方法常在可控性和可靠性之间权衡:基于种子的改写灵活但可验证性弱,而图表优先的构建提高了有效性但不太适合任意用户指定的约束。我们引入VeriGeo,一个基于可执行推理轨迹的可控几何生成框架。给定用户约束(如目标概念和难度),Author代理生成问题和图表,Solver代理产生与证明对齐的解决方案。两个代理使用共享的动作序列,将自然语言、图表、几何约束和证明步骤连接成可验证的表示。三级流水线检查数值一致性、分析可实现性和全局一致性,使用验证引导的反射来修复可恢复的失败并拒绝不可恢复的失败。在五个LLM骨干上,原始生成经常无法通过这些检查,而VeriGeo修复了大部分无效尝试。在VeriGeo生成的8.7k示例上进行监督微调,在端到端多模态LLM求解器中实现了GeoQA最佳报告性能,并在PGPS9K和MathVista-GPS上取得了强劲结果,证明了验证合成数据对改进多模态几何推理的有效性。

英文摘要

Geometry problem generation is useful for AI-assisted education and multimodal mathematical reasoning, but reliable synthesis remains difficult because the problem statement, diagram, constraints, and solution should be mutually consistent. Existing methods often trade off controllability and reliability: seed-based rewriting is flexible but weakly verifiable, whereas diagram-first construction improves validity but is less suited to arbitrary user-specified constraints. We introduce VeriGeo, a controllable geometry generation framework grounded in executable reasoning traces. Given user constraints such as target concepts and difficulty, an Author agent generates a problem and diagram, and a Solver agent produces a proof-aligned solution. Both agents use a shared action sequence that connects natural language, diagrams, geometric constraints, and proof steps into a verifiable representation. A three-stage pipeline checks numerical consistency, analytical realizability, and global consistency, using verification-guided reflection to repair recoverable failures and reject unrecoverable ones. Across five LLM backbones, raw generations frequently fail these checks, while VeriGeo repairs a substantial fraction of the invalid attempts. Supervised fine-tuning on 8.7k examples generated by VeriGeo achieves the best reported GeoQA performance among end-to-end multimodal LLM-based solvers, and obtains strong results on PGPS9K and MathVista-GPS, demonstrating the effectiveness of verified synthetic data for improving multimodal geometry reasoning.

2606.14507 2026-06-15 cs.AI 新提交

Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models

密集坐标列表微调在视觉语言模型中诱导可控干扰面

Chenyu Zhou, Qiliang Jiang, Boguang Pan

发表机构 * School of Engineering, Institute of Science Tokyo(东京科学大学工学院) College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院) Graduate School of Information, Production and Systems, Waseda University(早稻田大学信息生产系统研究生院)

AI总结 研究密集坐标列表微调对视觉语言模型结构化输出(如重复、终止)的影响,发现其产生结构绑定且跨家族的干扰面,可通过目标信号分离和结构轴探针进行测量与控制。

详情
AI中文摘要

微调视觉语言模型以输出密集坐标列表可改善视觉定位,但也会改变模型序列化、重复和终止结构化输出的方式。我们将此行为视为一个生成与控制面进行研究。在Gemma 4 12B中,高容量q/k/v/o LoRA将类别感知F1@0.3从0.007提升至0.448,同时诱导重复尾部压力(重复率0.080,最大重复23)。q/v秩扫描在秩4-64范围内保持最大重复为21-22,显示出容量持久性。目标信号是可分离的:对象级重复停止移除了精确重复记录(重复率0.000,最大重复1),同时保持F1(0.494至0.490)和更严格的F1@0.5(0.381至0.385)。结构轴探针将效应定位到边界框坐标对象列表;密集非边界框和空间/计数JSON保持无重复,包括在高容量适配器下。Qwen3-VL-8B复现了干净的控制端点(F1@0.3 0.318,重复率0.000),COCO 2017复现了获取和重复压力。因此,密集坐标列表适应创建了一个结构绑定、跨家族的干扰面,该干扰面可被测量和控制。

英文摘要

Fine-tuning vision-language models to emit dense coordinate lists improves visual grounding but also changes how models serialize, repeat, and terminate structured outputs. We study this behavior as a generation and control surface. In Gemma 4 12B, high-capacity q/k/v/o LoRA raises class-aware F1@0.3 from 0.007 to 0.448 while inducing repeated-tail pressure (duplicate rate 0.080, max repeat 23). A q/v rank sweep keeps max repeat at 21-22 across ranks 4-64, showing capacity persistence. The target signal is separable: object-level repeat-stop removes exact repeated records (duplicate rate 0.000, max repeat 1) while preserving F1 (0.494 to 0.490) and stricter F1@0.5 (0.381 to 0.385). Structure-axis probes localize the effect to bbox-coordinate object lists; dense non-bbox and spatial/count JSON remain repeat-clean, including under high-capacity adapters. Qwen3-VL-8B reproduces a clean controlled endpoint (F1@0.3 0.318, duplicate rate 0.000), and COCO 2017 reproduces acquisition plus duplicate pressure. Dense coordinate-list adaptation therefore creates a structure-bound, cross-family interference surface that can be measured and controlled.

2606.14579 2026-06-15 cs.AI 新提交

VISTA: View-Consistent Self-Verified Training for GUI Grounding

VISTA: 视图一致的自验证训练用于GUI定位

Xinyu Qiu, Yunzhu Zhang, Heng Jia, Shuheng Shen, Changhua Meng, Linchao Zhu

发表机构 * Zhejiang University(浙江大学) Venus Team, Ant Group(蚂蚁集团金星团队)

AI总结 提出VISTA框架,通过多视图分组和自验证锚点改进GRPO训练,在GUI定位任务中显著提升准确率。

详情
AI中文摘要

当将组相对策略优化(GRPO)应用于GUI定位时,rollout从单个截图视图中采样;组在困难实例上往往全部失败,在简单实例上全部成功,无法产生有用的相对优势。我们提出VISTA(视图一致的自验证训练),一种基于GRPO的训练框架,通过从同一GUI页面的多个目标保持视图中构建每个比较组。每个视图通过裁剪生成,保持目标元素可见并精确重新映射其边界框,因此模型rollout在语义等价但几何不同的输入之间进行比较。为了稳定短坐标生成而不将强化学习转变为无条件模仿,VISTA进一步添加了一个自验证的跨视图锚点:一个使用优势加权损失优化的oracle答案,从组基线中排除,仅在模型产生最大奖励rollout时激活。在五个GUI定位基准和多个Qwen骨干网络上,VISTA一致提高了定位准确率。在ScreenSpot-Pro上,它将Qwen3-VL 4B/8B/30B-A3B从55.5/52.7/53.7提升到63.4/65.8/67.0。鲁棒性分析进一步显示了更高的最差视图准确率和更低的预测翻转率。

英文摘要

When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no useful relative advantage. We propose VISTA (View-Consistent Self-Verified Training), a GRPO-based training framework that constructs each comparison group from multiple target-preserving views of the same GUI instance.Each view is generated by a crop that keeps the target element visible and remaps its box exactly, so model rollouts are compared across semantically equivalent but geometrically different inputs. To stabilize short coordinate generation without turning reinforcement learning into unconditional imitation, VISTA further adds a self-verified cross-view anchor: an oracle answer optimized with an advantage-weighted loss, excluded from the group baseline and activated only when the model has produced a maximum-reward rollout. Across five GUI-grounding benchmarks and multiple Qwen backbones, VISTA consistently improves grounding accuracy.On ScreenSpot-Pro, it raises Qwen3-VL 4B/8B/30B-A3B from 55.5/52.7/53.7 to 63.4/65.8/67.0. Robustness analyses further show higher worst-view accuracy and lower prediction flip rates.

2606.14654 2026-06-15 cs.AI cs.CL cs.LG 新提交

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

将跨领域动作序列抽象为可解释的工作流

Gaurav Verma, Scott Counts

发表机构 * Microsoft Corporation(微软公司)

AI总结 提出WorkflowView框架,利用大语言模型将低层动作序列抽象为高层活动,在三个不同任务中验证了有效性和泛化能力,实现高语义相似度和预测性能。

Comments preprint; 9 pages, 5 figures

详情
AI中文摘要

序列或时间戳交互日志提供了数字应用使用的客观记录,但其粒度和噪声常常掩盖了关于人们工作的有意义见解。这些见解对于以真实用户交互为基础改进数字产品至关重要。先前的研究应用深度学习模型将用户动作聚类为高层活动,但这些方法对噪声高度敏感且难以跨应用泛化。为解决这一局限,我们引入了WorkflowView,一个使用大语言模型(LLMs)将低层动作序列抽象为高层活动的框架。我们在三个不同且具有挑战性的序列任务和多样化领域中建立了该方法的有效性和泛化性:(a)从浏览器日志中进行零样本任务描述重构(实现高语义相似度,$\mu_{sim} = 0.91$),(b)使用MOOC交互日志进行少样本学生退学预测(仅用五个少样本示例达到加权$F_1 = 0.90$),以及(c)对Microsoft Word中文档工作流中AI工具集成进行匿名化、隐私保护分析。我们的工作表明,基于LLM的抽象是将低层行为数据转化为高层、可解释且可操作见解的稳健高效途径。我们还讨论了在日志基础设施中部署基于LLM的推理时的实际考虑,包括计算效率和用户隐私。

英文摘要

Sequential or time-stamped interaction logs provide objective records of digital application usage, yet their granularity and noise often obscure meaningful insights into people's work. Such insights are essential for improving digital products in ways grounded in real-world user interactions. Prior research has applied deep learning models to cluster user actions into high-level activities, but these approaches are highly sensitive to noise and struggle to generalize across applications. To address this limitation, we introduce WorkflowView, a framework that uses large language models (LLMs) to abstract low-level action sequences into high-level activities. We establish the effectiveness and generality of our approach across three distinct, challenging sequential tasks and diverse domains: (a) zero-shot task description reconstruction from browser logs (achieving high semantic similarity, $μ_{sim} = 0.91$), (b) few-shot student dropout prediction using MOOC interaction logs (reaching weighted $F_1 = 0.90$ with only five few-shot examples), and (c) anonymized, privacy-preserving analysis of AI tool integration within document workflows in Microsoft Word. Our work demonstrates that LLM-based abstraction is a robust and efficient path forward for transforming low-level behavioral data into high-level, interpretable, and actionable insights. We also discuss practical considerations for deploying LLM-based inferences within logging infrastructures, including computational efficiency and user privacy.

2606.13811 2026-06-15 quant-ph cs.AI 交叉投稿

Aligning Quantum Operators with Large Language Models

对齐量子算子与大型语言模型

Rogerio Feris, Yunchao Liu, Pengyuan Li, Hang Hua, David Kremer

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出将酉算子映射到LLM潜空间的方法,实现量子与语言输入的联合建模,在Clifford+T电路合成任务上取得与最先进方法竞争的结果,并支持语言条件合成。

详情
AI中文摘要

大型语言模型(LLM)能否理解和推理量子算子?尽管LLM在数学和符号推理方面表现出色,但它们本质上对诸如酉矩阵等量子表示视而不见。在这项工作中,我们通过引入一种将酉算子映射到LLM潜空间的方法,向弥合这一差距迈出了一步,从而实现了对量子输入和语言输入的联合建模。我们在Pauli旋转门集上的Clifford+T电路合成中实例化了这一想法,其中我们的模型取得了与最先进方法竞争的结果,并且随着训练数据的增加而一致地扩展,没有出现饱和迹象。我们的方法进一步支持语言条件合成,允许在训练期间未见过的门约束直接用自然语言指定。这项工作表明了一条通往量子感知基础模型的道路,该模型能够原生地解释和推理量子操作,这可能对量子编译和算法发现产生更广泛的影响。

英文摘要

Can Large Language Models (LLMs) understand and reason about quantum operators? Despite their remarkable capabilities in mathematics and symbolic reasoning, LLMs remain inherently blind to quantum representations such as unitary matrices. In this work, we take a step toward bridging this gap by introducing an approach that maps unitary operators into the latent space of an LLM, enabling unified modeling over quantum and linguistic inputs. We instantiate this idea on Clifford+T circuit synthesis over a Pauli rotation gate set, where our model achieves results competitive with state-of-the-art methods and scales consistently with training data, with no signs of saturation. Our approach further enables language-conditioned synthesis, allowing gate constraints unseen during training to be specified directly in natural language. This work suggests a path toward quantum--aware foundation models that can natively interpret and reason about quantum operations, which could have broader implications reaching across quantum compilation and algorithm discovery.

2606.13898 2026-06-15 cs.CV cs.AI 交叉投稿

HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

HiLo-Token: 输入自适应的高低频令牌压缩用于高效图像编辑

Haoran You, Yotam Nitzan, Lingzhi Zhang, Yifan Gong, Mang-Tik Chiu, Connelly Barnes, Yan Kang, Yuqian Zhou, Eli Shechtman, Sohrab Amirghodsi

发表机构 * Adobe ART AI Lab(Adobe ART AI实验室) Adobe Research(Adobe研究院)

AI总结 针对扩散变换器(DiT)在图像编辑中延迟高的问题,提出输入自适应的令牌压缩框架HiLo-Token,根据空间频率分配令牌预算,在保持生成质量的同时实现高达3.13倍加速。

Comments 14 pages, 10 figures, Patent filled

详情
AI中文摘要

创意图像编辑工具,如Photoshop的移除或生成填充按钮,是日常客户使用的核心,并占Photoshop和Lightroom流量的主要部分。然而,当前的生成式AI模型面临显著的延迟挑战,当从基于卷积的U-Net过渡到扩散变换器(DiT)时,这一问题变得更加突出。在我们对数百个代表性图像编辑样本(涵盖广泛的掩码比例)的评估中,即使将DiT模块从50个时间步蒸馏到8个时间步,它单独就占总模型延迟的平均73%。为了应对这一挑战,我们提出了$\textbf{HiLo-Token}$,一个输入自适应的令牌压缩框架,该框架将更多令牌预算分配给高频、丰富上下文的区域,同时将更少令牌分配给低频区域。具体来说,对于用户掩码指定的编辑区域,我们保留膨胀掩码内的所有令牌,以保持强局部性和上下文相关性。在编辑区域之外,我们引入了一种简单而有效的基于空间频率的高频令牌选择策略,以捕获重要的局部细节,同时使用来自16倍下采样图像的令牌来表示低频分量,并保留模糊但全局的结构。在生产级评估数据上的大量实验验证了所提方法的有效性,在A100-80GB上,对于小、中、大掩码比例类别(平均比例分别为6.38%、15.92%和35.36%),图像编辑任务分别实现了3.13倍、2.59倍和1.67倍的DiT加速,且生成质量无任何退化。

英文摘要

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose $\textbf{HiLo-Token}$, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.

2606.13989 2026-06-15 cs.SD cs.AI 交叉投稿

Mask, Sample, Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech

掩码、采样、修正:面向引导离散流匹配文本转语音的可修正CTMC推理栈

Alef Iury Siqueira Ferreira, Lucas Rafael Stefanel Gris, Luiz Fernando de Araújo Vidal, Frederico Santos de Oliveira, Christopher Dane Shulby, Anderson da Silva Soares, Arlindo Rodrigues Galvão Filho

发表机构 * Federal University of Goiás(戈亚斯联邦大学) Federal University of Uberlândia(乌贝兰迪亚联邦大学) University of São Paulo(圣保罗大学) University of Brasília(巴西利亚大学) University of California, Berkeley(加利福尼亚大学伯克利分校)

AI总结 提出Mask, Sample, Revise推理栈,结合无预测器引导、提示匹配条件耦合和调度约束重掩码机制,在低步数下提升离散流匹配TTS的鲁棒性和可懂度。

详情
AI中文摘要

最近的无对齐非自回归文本转语音模型将合成视为条件填充任务,绕过了显式时长预测器和外部对齐器。当语音用神经编解码令牌表示时,填充问题变为离散,使得离散流匹配(一种用于离散生成的连续时间马尔可夫链框架)成为自然选择。然而,用于稳定低步数条件填充的推理时控制仍未充分探索。我们提出Mask, Sample, Revise,一种用于无对齐DFM-TTS的推理时CTMC栈。该栈结合了无预测器引导以增强文本条件、提示匹配条件耦合以将概率路径与声学提示对齐,以及SC-ReMask(一种调度约束重掩码机制),引入令牌到掩码的转换,使得早期去掩码决策可以被修正。这些组件无需事后微调,并在单个tau-leaping采样器中运行。受控消融实验表明,该栈在低NFE提示设置下提高了可懂度和鲁棒性,优于具有更多步数的无引导和仅引导采样器。

英文摘要

Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as a conditional infilling task, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discrete, making Discrete Flow Matching (DFM), a Continuous-Time Markov Chain (CTMC) framework for discrete generation, a natural fit. However, inference-time control for stable low-step conditional infilling remains underexplored. We propose Mask, Sample, Revise, an inference-time CTMC stack for alignment-free DFM-TTS. The stack combines predictor-free guidance to strengthen text conditioning, prompt-matched conditional coupling to align the probability path with the acoustic prompt, and SC-ReMask, a schedule-constrained remasking mechanism that introduces token-to-mask transitions so early de-masking decisions can be revised. These components require no post-hoc fine-tuning and operate in a single tau-leaping sampler. Controlled ablations show that this stack improves intelligibility and robustness in the low-NFE prompted setting, outperforming unguided and guidance-only samplers with substantially more steps.

2606.14120 2026-06-15 eess.SP cs.AI cs.LG cs.SD eess.AS 交叉投稿

FAConformer: Frequency-Aware Convolutional Transformer for Auditory Attention Decoding

FAConformer:用于听觉注意解码的频率感知卷积Transformer

Ziwei Wang, Xingyi He, Tianwang Jia, Hongbin Wang, Dongrui Wu

发表机构 * Hubei Key Laboratory of Brain-inspired Intelligent Systems, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology(湖北脑启发智能系统重点实验室,人工智能与自动化学院,华中科技大学)

AI总结 提出FAConformer框架,通过频带特定编码和自适应跨频带交互,有效利用脑电图频域信息进行听觉注意解码,在公开数据集上超越现有最佳模型4.9%。

Comments 15 pages, 7 figures

详情
AI中文摘要

听觉注意解码(AAD)旨在从多说话人声学环境中的神经反应推断被注意的说话人,是神经导向听力系统的关键问题。尽管最近的研究取得了令人鼓舞的进展,但现有的AAD模型仍未充分利用频域脑电图(EEG)信息。特别是,大多数方法通过手工特征提取或直接跨频带特征拼接引入多频带信息,这主要是在浅层利用频率信息,可能忽略频带特定模式和跨频带交互。为了解决这些局限性,本文提出了FAConformer,一种用于AAD的频率感知CNN-Transformer框架,它明确集成了频带特定编码和自适应跨频带交互。具体来说,FAConformer首先将EEG信号分解为多个频带,并为每个频带分配一个独立的CNN-Transformer编码器进行频带特定建模。然后,通过精心设计的频率感知注意(FAA)模块自适应地融合得到的频带特征,该模块通过将频带特征视为令牌来建模跨频带依赖关系。此外,引入了频带辅助监督(BAS)以防止在联合训练中贡献较弱的分支优化不足。通过这种方式,FAConformer执行频率感知建模,更有效地利用频域信息。在两个公开AAD数据集上使用三种决策窗口长度进行的广泛实验表明,FAConformer始终优于12个竞争基线,比当前最先进模型高出4.9%。对频带重要性、消融和参数敏感性的进一步分析验证了所提出框架的有效性、鲁棒性和可解释性。代码可在此https URL获取。

英文摘要

Auditory attention decoding (AAD) aims to infer the attended speaker from neural responses in multi-speaker acoustic environments and is a key problem for neuro-steered hearing systems. Although recent studies have achieved encouraging progress, existing AAD models still do not fully exploit frequency domain electroencephalography (EEG) information. In particular, most approaches introduce multi-band information through handcrafted feature extraction or direct cross-band feature concatenation, which mainly exploit frequency information at a shallow level and may overlook band-specific patterns and cross-band interactions. To address these limitations, this paper proposes FAConformer, a frequency-aware CNN-Transformer framework for AAD that explicitly integrates band-specific encoding and adaptive cross-band interaction. Specifically, FAConformer first decomposes EEG signals into multiple frequency bands and assigns each band to an independent CNN-Transformer encoder for band-specific modeling. The resulting band-wise features are then adaptively fused by a carefully designed frequency-aware attention (FAA) module that models cross-band dependencies by treating band-wise features as tokens. Further, band-wise auxiliary supervision (BAS) is introduced to prevent weakly contributing branches from being under-optimized during joint training. In this way, FAConformer performs frequency-aware modeling that more effectively exploits frequency domain information. Extensive experiments on two public AAD datasets with three decision-window lengths demonstrated that FAConformer consistently outperformed 12 competitive baselines, surpassing the current state-of-the-art model by 4.9%. Further analyses of band importance, ablation, and parameter sensitivity verify the effectiveness, robustness, and interpretability of the proposed framework. Code is available at https://github.com/wzwvv/FAConformer.

2606.14125 2026-06-15 cs.CV cs.AI 交叉投稿

Conditioning Matters: Stabilizing Inversion and Attention in Diffusion Image Editing

条件至关重要:稳定扩散图像编辑中的反演与注意力

Zheyuan Zhan, Hongchen Li, Can Wang, Yinfei Ma, Mingzhen Huang, Ruoshi Bai, Jiawei Chen, Siwei Lyu, Defang Chen

发表机构 * State Key Laboratory of Blockchain and Data Security, Zhejiang University(浙江大学区块链与数据安全全国重点实验室) HangZhou High-Tech Zong (Binjiang) Institute of Blockchain and Data Security(杭州高新技术产业开发区(滨江)区块链与数据安全研究院) College of Computer Science, Zhejiang University(浙江大学计算机科学与技术学院) University at Buffalo, State University of New York(纽约州立大学布法罗分校)

AI总结 本文提出SimEdit框架,通过优化文本条件精度和令牌级跨分支注意力控制,提升扩散模型反演稳定性和编辑保真度,在PIE-Bench上显著优于先前方法。

Comments Accepted to ECML PKDD 2026 Research Track

详情
AI中文摘要

基于反演的图像编辑提供了灵活且无需训练的控制,但仍面临反演精度以及编辑保真度与背景保留之间的权衡问题。尽管最近的方法改进了反演公式或注意力交互,但文本条件在塑造扩散动态和编辑行为中的作用仍未得到充分探索。我们从经验和理论上证明,文本条件的精度通过调节扩散速度场的几何形状来影响反演稳定性,同时也会影响编辑过程中跨分支注意力的一致性。这些效应直接影响背景保留和语义保真度。基于这一分析,我们提出了SimEdit,一个条件感知框架,包含两个互补组件:(a) 条件细化,构建具有改进语义精度和结构对齐的条件信号,以促进稳定反演和一致的注意力操作;(b) 令牌级跨分支注意力控制,将编辑相关和结构保留组件分离,并在注意力操作期间对其进行非对称调节。在PIE-Bench上的大量实验表明,SimEdit在反演重建质量和编辑性能上均持续优于先前的注意力操作方法。我们的代码可在以下网址获取:https://this URL。

英文摘要

Inversion-based image editing offers flexible and training-free control but still struggles with inversion accuracy and the trade-off between editing fidelity and background preservation. While recent methods improve inversion formulations or attention interactions, the role of textual conditioning in shaping diffusion dynamics and editing behavior remains underexplored. We show both empirically and theoretically that the precision of textual conditioning influences inversion stability by modulating the geometry of the diffusion velocity field, while also affecting the consistency of cross-branch attention during editing. These effects directly impact background preservation and semantic fidelity. Building on this analysis, we propose SimEdit, a conditioning-aware framework with two complementary components: (a) conditioning refinement, which constructs conditioning signals with improved semantic precision and structural alignment to facilitate stable inversion and consistent attention manipulation, and (b) token-wise cross-branch attention control, which separates edit-relevant and structure-preserving components and modulates them asymmetrically during attention manipulation. Extensive experiments on PIE-Bench demonstrate that SimEdit consistently improves both inversion reconstruction quality and editing performance over previous attention-manipulation approaches. Our code is available at https://github.com/zju-pi/SimEdit.

2606.14141 2026-06-15 cs.SD cs.AI cs.CL 交叉投稿

Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources

动态声源的时空音频语言建模

Oh Hyun-Bin, Kazuki Shimada, Yuhta Takida, Kim Sung-Bin, Toshimitsu Uesaka, Takashi Shibuya, Kyeongyoon Lee, Tae-Hyun Oh, Yuki Mitsufuji

发表机构 * POSTECH(浦项科技大学) Sony AI(索尼AI) Sony Group Corporation(索尼集团) Sungkyunkwan University(成均馆大学) KAIST(韩国科学技术院)

AI总结 提出ST-AudioLM模型,通过时空音频编码器联合学习事件语义与源轨迹,在ST-AudioQA基准上提升动态声源问答的语义-定位权衡。

详情
AI中文摘要

声音事件是具有语义身份、位置和轨迹的实体,但当前的音频-语言模型通常将片段推理为全局事件内容。相反,声音事件定位模型随时间跟踪声源方向,但对语言推理的语义覆盖有限。为解决这一差距,我们引入了ST-AudioQA,一个基于一阶环绕声(FOA)渲染的静态和移动声源的时空音频问答数据集和基准。每个场景提供源身份、活动、方向、距离和运动元数据,实现密集轨迹监督以及关于什么在发声、在哪里、如何移动以及源之间关系的问题。我们进一步提出了ST-Audio Encoder,一种时间分辨的FOA音频编码器,联合学习事件语义和源轨迹,以及ST-AudioLM,它将编码器的音频令牌连接到LLM进行时空音频问答。实验表明,这种表示改善了语义-定位权衡,并比静态空间和面向定位的基线产生更强的推理性能。

英文摘要

Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage for language reasoning. To address this gap, we introduce ST-AudioQA, a spatio-temporal audio QA dataset and benchmark built from first-order ambisonic (FOA) renderings of static and moving sound sources. Each scene provides source identity, activity, direction, distance, and motion metadata, enabling dense trajectory supervision and questions about what is sounding, where it is, how it moves, and how sources relate. We further propose ST-Audio Encoder, a time-resolved FOA audio encoder that learns event semantics together with source trajectories, and ST-AudioLM, which connects the audio tokens from the encoder to an LLM for spatio-temporal audio QA. Experiments show that this representation improves the semantic-localization tradeoff and yields stronger reasoning performance than static spatial and localization-oriented baselines.

2606.14260 2026-06-15 cs.IR cs.AI 交叉投稿

ChronoID: Infusing Explicit Temporal Signals into Semantic IDs for Generative Recommendation

ChronoID: 将显式时间信号注入语义ID用于生成式推荐

Dongdong Nian, Dongqi Fu, Chenliang Xu, Yinglong Xia, Hong Li, Hong Yan, Jian Kang

发表机构 * University of Rochester(罗切斯特大学) Meta MRS MBZUAI

AI总结 提出ChronoID框架,通过沿三个正交维度注入显式时间信号到语义ID中,解决生成式推荐中时间信息缺失问题,并构建新基准验证其有效性。

详情
AI中文摘要

语义ID在生成式推荐中至关重要,但存在一个根本性限制:时间信息未能很好地融入语义ID。相反,时间仅隐式影响推荐(例如,通过会话构建启发式、偏好对齐或序列顺序),而现有的语义ID学习完全与时间无关。这种设计将不同时间上下文下的交互混为一谈,隐含地假设物品语义和用户意图在时间上是平稳的。这种假设与真实推荐场景不符,其中演变的交互节奏起着核心作用。在这项工作中,我们研究了显式时间应如何以及在哪里被纳入生成式推荐的语义ID中。首先,我们沿时间信号的三个正交维度系统地表征了设计空间,并提出了一个统一框架ChronoID,用于时间感知的语义ID学习。然后,通过贡献一个新的时间显式生成推荐基准,ChronoID回答了以下问题:注入时间的有效方式是什么,如何设计架构,以及增益来自何处。

英文摘要

Semantic IDs are crucial in generative recommendation, but with a fundamental limitation: temporal information is not well incorporated into semantic IDs. Instead, time influences recommendation only implicitly (e.g., through session construction heuristics, preference alignment, or sequence order), while existing semantic ID learning remains entirely time-agnostic. This design conflates interactions occurring under distinct temporal contexts into identical semantic representations, implicitly assuming that item semantics and user intent are temporally stationary. Such an assumption is misaligned with real-world recommendation scenarios, where evolving interaction rhythms play a central role. In this work, we investigate where and how the explicit time should be incorporated into semantic ID for generative recommendation. First, we systematically characterize the design space along three orthogonal dimensions of temporal signals and present a unified framework, ChronoID, for time-aware semantic ID learning. Then, by contributing a new time-explicit generation recommendation benchmark, ChronoID answers the questions: what is the effective way of infusing time, how to design the architecture, and where does the gain come from.

2606.14325 2026-06-15 cs.CL cs.AI 交叉投稿

Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

通过基于知识图谱的数据生成实现精确的文本到Cypher转换

Francesco Cazzaro, Jessica Lennon, Ariadna Quattoni

发表机构 * Universitat Politècnica de Catalunya(波兰理工大学)

AI总结 提出一种自动合成数据生成方法,微调小型LLM以提升Text2Cypher性能,使其在本地部署中与大型专有模型竞争,保障数据主权。

详情
AI中文摘要

属性图正迅速被采用作为表示异构数据源的数据库框架。为了精确访问其中包含的信息,我们需要基于文本到Cypher(Text2Cypher)解析器的对话界面。本文提出了一种自动合成数据生成方法,可用于微调小型LLM以完成此任务。我们在所有主要的Text-To-Cypher基准测试上进行了实验,证明使用我们的合成数据生成方法可以显著提高小型LLM的性能,使其能够与更大的专有模型竞争。这意味着在必须本地部署模型的场景中,我们可以在不牺牲准确性且无需昂贵标注活动的情况下确保数据主权。

英文摘要

Property Graphs are rapidly being adopted as database frameworks for representing heterogeneous data sources. To enable precise access to the information contained in them we need conversational interfaces based on Text-To-Cypher (Text2Cypher) parsers. This paper presents an automatic synthetic data generation method that can be leveraged to fine-tune small LLMs for this task. We conduct experiments on all the major Text-To-Cypher benchmarks, demonstrating that with our synthetic data generation approach we can significantly increase the performance of small LLMs, allowing them to compete with much larger proprietary models. This means that in settings in which models must be locally deployed we can ensure data-sovereignty without sacrificing accuracy and without costly annotation campaigns.

2606.14391 2026-06-15 cs.CL cs.AI cs.SD 交叉投稿

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

学习听到犹豫:面向非流畅语音的连续学习ASR

Henri-Leon Kordt, Theresa Pekarek Rosin, Jae Hee Lee, Stefan Wermter

发表机构 * Knowledge Technology, Department of Informatics, University of Hamburg(汉堡大学信息学系知识技术研究所)

AI总结 针对ASR系统忽略非流畅导致信息丢失的问题,提出基于连续学习与显式非流畅标记的方法,在预训练模型中引入标记并持续训练,分析标记学习与ASR性能的权衡及跨方法共享的交叉注意力头机制。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

尽管大规模自动语音识别(ASR)取得了进展,但非流畅语音仍然具有挑战性,因为最先进的系统通常被优化以忽略非流畅,导致信息丢失和幻觉。先前的工作集中于逐字转录和非流畅标记的整合,但在有限数据集上适配模型可能导致通用领域知识的灾难性遗忘。我们通过利用具有显式非流畅标记的连续学习(CL)来填补这一空白。我们首先将这些标记引入预训练ASR模型以建立稳定的标记机制,然后在具有不同非流畅分布的其他数据集上继续训练。通过对训练期间模型动态的详细分析,我们识别出标记学习与ASR性能之间的权衡,以及跨CL方法共享的一致交叉注意力头机制。

英文摘要

Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior work has focused on verbatim transcription and the integration of disfluency markers, but adapting models on limited datasets can lead to catastrophic forgetting of general-domain knowledge. We address this gap by leveraging continual learning (CL) with explicit disfluency tokens. We first introduce these tokens into a pretrained ASR model to establish stable token mechanisms, and then continue training on additional datasets with varying disfluency distributions. Through a detailed analysis of model dynamics during training, we identify a trade-off between marker learning and ASR performance, and a consistent cross-attention head mechanism shared across CL methods.

2504.20734 2026-06-15 cs.CL cs.AI cs.CV cs.IR cs.LG 版本更新

UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

UniversalRAG: 在多样模态和粒度的语料库上实现检索增强生成

Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出UniversalRAG,一种能够处理多种模态和粒度的检索增强生成框架,通过动态路由机制和多粒度组织,提升跨模态知识检索的有效性,实验表明其在多个模态基准上的优越性。

Comments ACL 2026. Project page : https://universalrag.github.io

详情
AI中文摘要

检索增强生成(RAG)通过将外部相关知识与查询绑定,显著提升了事实准确性。然而,现有方法多局限于文本语料,尽管最近有尝试扩展到图像、视频等模态,但通常仅针对单一模态语料。相比之下,现实中的查询所需知识类型多样,单一知识源无法满足。为此,我们引入UniversalRAG,一种any-to-any RAG框架,旨在从异构源中检索和整合多样模态和粒度的知识。具体而言,受强制所有模态进入单一聚合语料的统一表示空间导致模态间隙的观察启发,我们提出模态感知路由,动态识别最合适的模态特定语料并执行针对性检索,并通过理论分析证明其有效性。此外,除模态外,我们对每个模态组织为多个粒度层级,实现针对查询复杂性和范围的精细检索。我们验证UniversalRAG在10个多种模态基准上的性能,显示其优于各种模态特定和统一基线。

英文摘要

Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, an any-to-any RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single aggregated corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose modality-aware routing, which dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it, and further justify its effectiveness with a theoretical analysis. Moreover, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 10 benchmarks of multiple modalities, showing its superiority over various modality-specific and unified baselines.

2505.23277 2026-06-15 cs.CL cs.AI 版本更新

Sentinel: Decoding Context Utilization via Attention Probing for Efficient LLM Context Compression

Sentinel: 通过注意力探测解码上下文利用以实现高效LLM上下文压缩

Yong Zhang, Heng Li, Yanwen Huang, Ning Cheng, Yang Guo, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao

发表机构 * Ping An Technology (Shenzhen) Co., Ltd., China(平安科技(深圳)有限公司,中国) University of Science and Technology of China(中国科学技术大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出Sentinel,一种轻量级句子级压缩框架,通过冻结LLM的头部注意力模式解码推理时上下文利用行为,使用单次非自回归前向传递实现压缩,在LongBench上以0.5B代理模型达到5倍压缩且性能与7B模型方法相当。

Comments Preprint

详情
AI中文摘要

检索增强生成(RAG)通常面临长且嘈杂的检索上下文。现有的上下文压缩方法通常依赖于启发式相关性估计或监督压缩模型,而不是基于LLM在推理过程中如何利用检索到的上下文。我们提出Sentinel,一种轻量级的句子级压缩框架,从冻结LLM的头部注意力模式中解码推理时的上下文利用行为。为了在检索依赖的问答行为中提供监督,Sentinel使用QA示例训练一个轻量级探针,其中模型仅在检索上下文可用时成功。Sentinel仅使用单次非自回归前向传递进行压缩,无需专门的压缩训练或自回归评分。实验发现,即使在紧凑的代理模型中,有效的上下文利用信号仍然可访问。在LongBench上,使用0.5B代理模型的Sentinel实现了高达5倍的压缩,同时达到与基于7B规模模型的压缩方法相竞争的问答性能。尽管仅使用英文QA数据训练,Sentinel也能有效泛化到中文和域外设置。

英文摘要

Retrieval-augmented generation (RAG) often suffers from long and noisy retrieved contexts. Existing context compression methods typically rely on heuristic relevance estimation or supervised compression models rather than on how LLMs utilize retrieved context during inference. We propose Sentinel, a lightweight sentence-level compression framework that decodes inference-time contextual utilization behaviors from head-wise attention patterns of frozen LLMs. To ground supervision in retrieval-dependent answering behavior, Sentinel trains a lightweight probe using QA examples where the model succeeds only when retrieved context is available. Sentinel performs compression using only a single non-autoregressive forward pass without dedicated compression training or autoregressive scoring. Empirically, we find that effective contextual utilization signals remain accessible even in compact proxy models. On LongBench, Sentinel with a 0.5B proxy model achieves up to 5$\times$ compression while attaining question-answering performance competitive with compression methods built on 7B-scale models. Despite being trained only on English QA data, Sentinel also generalizes effectively to Chinese and out-of-domain settings.

2510.05150 2026-06-15 cs.CL cs.AI 版本更新

Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

全双工口语对话语言模型中的时间顺序思考

Donghang Wu, Haoyang Zhang, Chen Chen, Tianyu Zhang, Fei Tian, Xuerui Yang, Gang Yu, Hexin Liu, Nana Hou, Yuchen Hu, Eng Siong Chng

发表机构 * Nanyang Technological University(南洋理工大学) StepFun Mila

AI总结 提出Chronological Thinking机制,让全双工对话模型在听用户说话时增量推理,不增加延迟,提升响应质量。

Comments Accepted by SIGDIAL 2026

详情
AI中文摘要

近期口语对话语言模型(SDLMs)的进展反映了从轮次式向全双工系统转变的日益增长的兴趣,其中模型在生成响应的同时持续感知用户语音流。这种同时听和说的设计实现了实时交互,并且智能体可以处理动态对话行为,如用户插话。然而,在听阶段,现有系统通过重复预测静默标记使智能体保持空闲,这偏离了人类行为:我们在对话中通常进行轻量级思考,而不是心不在焉。受此启发,我们提出了Chronological Thinking,一种即时对话思考机制,旨在提高全双工SDLMs的响应质量。具体来说,Chronological Thinking从传统的LLM思考方法(如思维链)中进行了范式转变,专为流式声学输入而设计。(1)严格因果:智能体在听的同时增量推理,仅从过去的音频更新内部假设,无前瞻。(2)无额外延迟:推理在听窗口期间分摊;一旦用户停止说话,智能体停止思考并立即开始说话,无进一步延迟。实验通过客观指标和人工评估证明了Chronological Thinking的有效性,在响应质量上表现出一致的改进。此外,Chronological Thinking稳健地处理对话动态,并在全双工交互指标上取得了竞争性性能。

英文摘要

Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This simultaneous listening and speaking design enables real-time interaction and the agent can handle dynamic conversational behaviors like user barge-in. However, during the listening phase, existing systems keep the agent idle by repeatedly predicting the silence token, which departs from human behavior: we usually engage in lightweight thinking during conversation rather than remaining absent-minded. Inspired by this, we propose Chronological Thinking, an on-the-fly conversational thinking mechanism that aims to improve response quality in full-duplex SDLMs. Specifically, chronological thinking presents a paradigm shift from conventional LLM thinking approaches, such as Chain-of-Thought, purpose-built for streaming acoustic input. (1) Strictly causal: the agent reasons incrementally while listening, updating internal hypotheses only from past audio with no lookahead. (2) No additional latency: reasoning is amortized during the listening window; once the user stops speaking, the agent halts thinking and begins speaking without further delay. Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations show consistent improvements in response quality. Furthermore, chronological thinking robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

2601.04885 2026-06-15 cs.CL cs.AI cs.LG 版本更新

CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

CuMA: 通过人口统计感知的适配器混合使大语言模型与稀疏文化价值观对齐

Ao Sun, Xiaoyu Wang, Zhe Tan, Yu Li, Jiachen Zhu, Yuheng Jia, Shu Su

发表机构 * Southeast University(东南大学) ByteDance Inc.(字节跳动公司) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China(新一代人工智能技术及其交叉应用重点实验室(东南大学),中华人民共和国教育部,中国)

AI总结 提出CuMA框架,通过人口统计感知路由将冲突梯度分离到专家子空间,解决密集模型在多文化对齐中的均值崩溃问题,在WorldValuesBench等基准上取得最优性能。

Comments ACL 2026 Main

详情
AI中文摘要

随着大语言模型服务于全球用户,对齐必须从强制执行普遍共识转向尊重文化多元主义。我们证明,密集模型在被迫适应冲突的价值分布时会出现\textbf{均值崩溃},收敛到无法代表不同群体的通用平均值。我们将其归因于\textbf{文化稀疏性},其中梯度干扰阻止密集参数跨越不同的文化模式。为解决此问题,我们提出\textbf{\textsc{CuMA}}(\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters),一个将对齐视为\textbf{条件容量分离}问题的框架。通过引入人口统计感知路由,\textsc{CuMA}内化了一个\textit{潜在文化拓扑},以将冲突梯度明确解耦到专门的专家子空间中。在WorldValuesBench、Community Alignment和PRISM上的广泛评估表明,\textsc{CuMA}达到了最先进的性能,显著优于密集基线和仅语义MoE。关键的是,我们的分析证实\textsc{CuMA}有效缓解了均值崩溃,保留了文化多样性。我们的代码可在该https URL获取。

英文摘要

As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbf{Mean Collapse}, converging to a generic average that fails to represent diverse groups. We attribute this to \textbf{Cultural Sparsity}, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf{\textsc{CuMA}} (\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters), a framework that frames alignment as a \textbf{conditional capacity separation} problem. By incorporating demographic-aware routing, \textsc{CuMA} internalizes a \textit{Latent Cultural Topology} to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textsc{CuMA} achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textsc{CuMA} effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.

2601.22954 2026-06-15 cs.CL cs.AI 版本更新

Residual Context Diffusion Language Models

残差上下文扩散语言模型

Yuezhou Hu, Harman Singh, Monishwaran Maheswaran, Haocheng Xi, Coleman Hooper, Jintao Zhang, Aditya Tomar, Michael W. Mahoney, Sewon Min, Mehrdad Farajtabar, Kurt Keutzer, Amir Gholami, Chenfeng Xu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出残差上下文扩散(RCD)模块,通过回收丢弃令牌的上下文残差提高扩散语言模型的解码效率,在长/短CoT任务上以极少额外计算提升准确率4-11个百分点。

详情
AI中文摘要

扩散大语言模型(dLLM)已成为纯自回归语言模型的有前途的替代方案,因为它们可以并行解码多个令牌。然而,最先进的逐块dLLM依赖于一种“重掩码”机制,该机制仅解码最自信的令牌并丢弃其余令牌,从而浪费计算。我们证明,回收来自被丢弃令牌的计算是有益的,因为这些令牌保留了对于后续解码迭代有用的上下文信息。鉴于此,我们提出了残差上下文扩散(RCD),一个将这些被丢弃的令牌表示转换为上下文残差并将其注入回下一个去噪步骤的模块。RCD使用解耦的两阶段训练流程来绕过与反向传播相关的内存瓶颈。我们在长链推理(SDAR)和短链指令跟随(LLaDA)模型上验证了我们的方法。我们证明,一个标准的dLLM可以仅用约3亿个令牌高效地转换为RCD范式。在广泛基准测试中,RCD以极小的额外计算开销一致地将前沿dLLM的准确率提升4-11个百分点。值得注意的是,在最具挑战性的AIME任务上,RCD几乎使基线准确率翻倍,并在基线峰值准确率下实现高达4-5倍更少的去噪步骤。

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~300 million tokens. RCD consistently improves frontier dLLMs by 4-11 percentage points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at baseline's peak accuracy.

2602.01801 2026-06-15 cs.CV cs.AI 版本更新

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

快速自回归视频扩散与世界模型:基于时间缓存压缩与稀疏注意力

Dvir Samuel, Issar Tzachor, Matan Levy, Michael Green, Gal Chechik, Rami Ben-Ari

发表机构 * Hebrew University of Jerusalem(特拉维夫大学) Google Research(谷歌研究)

AI总结 提出FAST-AR框架,通过TempCache压缩KV缓存、AnnCA加速交叉注意力、AnnSA稀疏化自注意力,实现自回归视频扩散模型5-10倍加速,同时保持视觉质量并稳定GPU内存使用。

Comments Accepted to ICML 2026. Project Page: https://dvirsamuel.github.io/fast-auto-regressive-video/

详情
AI中文摘要

自回归视频扩散模型支持流式生成,为长序列合成、视频世界模型和交互式神经游戏引擎打开了大门。然而,其核心注意力层在推理时成为主要瓶颈:随着生成过程推进,KV缓存增长,导致延迟增加和GPU内存飙升,进而限制可用的时间上下文并损害长程一致性。在本工作中,我们研究了自回归视频扩散中的冗余性,并识别出三个持续存在的来源:跨帧的近似重复缓存键、缓慢演化的(主要是语义的)查询/键使得许多注意力计算冗余,以及长提示上的交叉注意力中每帧只有少量标记相关。基于这些观察,我们提出了一个统一的、无需训练的注意力框架(FAST-AR),用于快速自回归扩散,包含三个组件:TempCache通过时间对应压缩KV缓存以限制缓存增长;AnnCA通过使用快速近似最近邻(ANN)匹配选择帧相关的提示标记来加速交叉注意力;AnnSA通过将每个查询限制为语义匹配的键(也使用轻量级ANN)来稀疏化自注意力。这些模块共同减少了注意力、计算和内存,并且与现有的自回归扩散骨干网络和世界模型兼容。实验表明,在保持几乎相同的视觉质量的同时,实现了高达5-10倍的端到端加速,并且关键的是,在长序列生成中维持稳定的吞吐量和几乎恒定的峰值GPU内存使用,而先前的方法会逐渐变慢并遭受内存使用增加的问题。

英文摘要

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework (FAST-AR) for FAST-AutoRegressive diffusion, consisting of three components: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5 - x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

2603.04976 2026-06-15 cs.CV cs.AI 版本更新

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

3D-RFT:基于视频的3D场景理解的强化微调

Xiongkun Linghu, Jiangyong Huang, Baoxiong Jia, Siyuan Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出3D-RFT框架,将可验证奖励的强化学习(RLVR)扩展到视频3D感知与推理,通过直接优化评估指标(如3D IoU和F1分数)提升性能,4B模型超越8B模型。

Comments Accepted at ICML 2026. Project page: https://3d-rft.github.io/

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)推理能力的变革性范式,但其在3D场景理解中的潜力尚未充分挖掘。现有方法主要依赖监督微调(SFT),其中token级交叉熵损失作为优化的间接代理,导致训练目标与任务性能之间的错位。为弥合这一差距,我们提出了基于视频的3D场景理解的强化微调(3D-RFT),这是首个将RLVR扩展到视频3D感知与推理的框架。3D-RFT通过直接优化模型以匹配评估指标来转变范式。3D-RFT首先通过SFT激活3D感知的多模态大语言模型(MLLMs),然后使用组相对策略优化(GRPO)结合严格可验证的奖励函数进行强化微调。我们根据3D IoU和F1-Score等指标设计任务特定的奖励函数,以提供更有效的信号来指导模型训练。大量实验表明,3D-RFT-4B在各种基于视频的3D场景理解任务上达到了最先进的性能。值得注意的是,3D-RFT-4B在3D视频检测、3D视觉定位和空间推理基准上显著优于更大的模型(例如VG LLM-8B)。我们进一步揭示了3D-RFT的良好特性,如鲁棒有效性,以及对训练策略和数据影响的宝贵见解。我们希望3D-RFT能够作为未来3D场景理解发展的稳健且有前景的范式。

英文摘要

Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.

2603.24596 2026-06-15 eess.AS cs.AI cs.CL 版本更新

X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

X-OPD:面向语音大语言模型能力对齐的跨模态在策略蒸馏

Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, Tao Jin

发表机构 * Tencent Hunyuan(腾讯文心) Zhejiang University(浙江大学)

AI总结 提出X-OPD框架,通过跨模态在策略蒸馏对齐语音LLM与文本LLM的能力,利用文本教师模型评估语音模型的轨迹并提供令牌级反馈,显著缩小复杂任务性能差距。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

虽然从级联对话系统转向端到端(E2E)语音大语言模型(LLMs)改善了延迟和副语言建模,但E2E模型通常表现出与其文本对应模型相比显著的性能下降。标准的监督微调(SFT)和强化学习(RL)训练方法无法弥合这一差距。为了解决这个问题,我们提出了X-OPD,一种新颖的跨模态在策略蒸馏框架,旨在系统地将语音LLM的能力与其文本对应模型对齐。X-OPD通过在线策略展开使语音LLM探索其自身分布,其中基于文本的教师模型评估这些轨迹并提供令牌级反馈,从而有效地将教师的能力蒸馏到学生的多模态表示中。在多个基准上的大量实验表明,X-OPD在保留模型固有能力的同时,显著缩小了复杂任务中的差距。

英文摘要

While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.

2605.07984 2026-06-15 cs.LG cs.AI 版本更新

Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

计划在哪里?通过轻量级机制干预定位语言模型中的潜在规划

Nicole Ma, Nick Rui

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过押韵对句补全任务,使用线性探针和激活修补方法,研究语言模型在生成过程中是否形成并因果依赖未来约束的潜在规划,发现仅Gemma-3-27B模型存在因果依赖,并定位到五个注意力头。

Comments 13 pages, 20 figures, 3 tables. Accepted to Workshop on Mechanistic Interpretability @ ICML 2026

详情
AI中文摘要

我们研究语言模型中的规划位点形成——在前向传播过程中,结构约束的未来标记的内部表示是否形成,以及它们是否因果驱动生成。使用押韵对句补全作为前向约束的干净测试,我们在Qwen3、Gemma-3和Llama-3的十多个规模上应用两种轻量级方法(线性探针和激活修补)。探针显示,未来押韵信息在行边界处是线性可解码的,且信号在所有三个模型族中随规模增强。激活修补揭示,只有Gemma-3-27B因果依赖这种编码,表现出一种交接,其中因果驱动因素在大约第30层从押韵词迁移到行边界。我们测试的其他每个模型在整个生成过程中都条件于押韵词,在行边界处因果效应接近零,尽管探针信号很强。通过两阶段路径修补,我们将Gemma-3-27B的交接定位到五个注意力头,这些头在新行处恢复了约90%的押韵路由能力。

英文摘要

We study planning site formation in language models -- where internal representations of structurally-constrained future tokens form during the forward pass, and whether they causally drive generation. Using rhyming-couplet completion as a clean test of forward-looking constraint, we apply two lightweight methods (linear probing and activation patching) across Qwen3, Gemma-3, and Llama-3 at more than ten scales. Probing shows that future-rhyme information is linearly decodable at the line boundary, with signal that strengthens with scale in all three families. Activation patching reveals that only Gemma-3-27B causally relies on this encoding, exhibiting a handoff in which the causal driver migrates from the rhyme word to the line boundary around layer 30. Every other model we test conditions on the rhyme word throughout generation, with near-zero causal effect at the line boundary despite strong probe signal. We localize the Gemma-3-27B handoff to five attention heads through two-stage path patching that recover ~90% of the rhyme-routing capacity at the newline.

2605.16739 2026-06-15 cs.LG cs.AI cs.CL q-bio.NC 版本更新

EmoMind: Decoding Affective Captions from Human Brain fMRI

EmoMind:从人类大脑fMRI信号解码情感描述

Bilal A. Mohammed, Lin Gu, Ruogu Fang

发表机构 * Department of Biomedical Engineering(生物医学工程系) Vanderbilt University(范德比大学) Research Institute of Electrical Communication(电气通信研究所) Tohoku University(东北大学) University of Florida(佛罗里达大学)

AI总结 本文提出EmoMind,首个端到端解码fMRI信号生成情感描述的系统,通过结合语义基础的中性场景描述和连续情感向量,实现了在内容保留与情感表达间的平衡,并在多个验证框架下优于基于标签提示的GPT-4。

详情
AI中文摘要

从大脑活动解码视觉经验已取得显著进展,但当前的脑-文本系统主要恢复语义内容而丢弃情感。此外,语言模型在接收到类别标签提示时可以生成情感文本,但此类标签将丰富的跨受试者变异性压缩成粗糙的离散类别。我们提出了EmoMind,首个端到端的解码情感描述的fMRI信号管道。EmoMind首先从解码的视觉特征中检索出语义基础的中性场景描述,然后使用从相同fMRI记录中解码的连续34维情感向量重写该描述。为了在内容保留和情感表达之间保持平衡,我们使用分类器自由指导训练重写器,以对抗一个保持身份的空分支,从而在语义忠实性和情感表达性之间实现平滑插值。我们通过涵盖受试者特异性、结构几何和因果控制的三轴验证框架评估情感描述生成。我们进一步用合成大脑替代测试增强此框架,以探测对测量设备的鲁棒性,并将每个轴与使用脑解码的前五名情感标签提示的GPT-4进行基准测试。在两个独立的情感fMRI数据集中,EmoMind在所有三个轴上均显著优于标签提示的GPT-4,其中最大的收益出现在需要个人特定情感结构而非群体层面情绪聚合的指标上。这些结果确立了连续脑解码情感作为个性化情感描述生成的可行控制信号,并为研究个体情感大脑组织开辟了新方向。

英文摘要

Decoding visual experience from brain activity has advanced substantially, but current brain-to-text systems largely recover semantic content while discarding affect. Additionally, language models can generate emotional text when prompted with categorical labels, but such labels collapse rich inter-subject variability into coarse discrete bins. We present EmoMind, the first end-to-end pipeline for decoding affective captions directly from fMRI signals. EmoMind first retrieves a semantically grounded neutral scene description from brain-decoded visual features, then rewrites it using a continuous 34-dimensional emotion vector decoded from the same fMRI recording. To control the balance between content preservation and affective expression, we train the rewriter with classifier-free guidance against an identity-preserving null branch, enabling smooth interpolation between semantic fidelity and affective expressivity. We evaluate affective caption generation with a three-axis validation framework spanning subject-specificity, structural geometry, and causal control. We further augment this framework with a synthetic-brain substitution test that probes robustness to the measurement apparatus, and we benchmark each axis against GPT-4 prompted with brain-decoded top-5 emotion labels as a strong discrete baseline. Across two independent emotion fMRI datasets, EmoMind significantly outperforms label-prompted GPT-4 on all three axes, with the largest gains on metrics that require person-specific affective structure rather than population-level emotion aggregation. These results establish continuous brain-decoded affect as a viable control signal for individualized affective caption generation and open new directions for studying individual affective brain organisation.

2606.11502 2026-06-15 cs.CL cs.AI 版本更新

When Roleplaying, Do Models Believe What They Say?

角色扮演时,模型是否相信它们所说的话?

Benjamin Sturgeon, David Africa, Sid Black

发表机构 * MATS

AI总结 通过线性真实探针研究角色扮演对LLM内部表征的影响,发现角色扮演主要改变输出而非内部真实表征,而紧急错位则更显著地改变内部表征。

详情
AI中文摘要

语言模型可以陈述“地球绕太阳运行”,并在扮演亚里士多德时断言相反的说法。最近的研究认为,角色采用是语言模型运作的基础,模型会不断为给定上下文选择最合适的角色。这种角色扮演是否仅仅改变了模型的输出,还是也影响了模型内部表征为真实的内容?我们通过线性真实探针研究这个问题,将其应用于扮演历史人物(其可能的信念与现代共识不同)的LLM。对于每个角色,我们比较该角色可能赞同的虚假陈述(*时代相信*)与主题匹配但该角色不会赞同的虚假陈述(*时代虚假*)。通过提示、上下文学习和监督微调,角色诱导对时代相信陈述的抑制程度低于同等虚假的替代陈述,但它们总体上仍被分类为虚假。因此,角色扮演改变模型所说的内容多于其内部表征为真实的内容。我们将此与经过有害建议训练并表现出紧急错位(EM)的模型进行对比。在三个模型家族(Qwen 2.5 14B、Qwen 3 8B和Llama 3.3 70B)中,它们的虚假陈述显著向探针空间的真实区域移动,在挑战下大约一半时间被辩护(而角色扮演约为六分之一),并用于下游推理。因此,角色扮演和紧急错位是信念内化谱系上的点,其中角色扮演改变模型所说的内容而表征变化很小,而紧急错位则改变虚假陈述的内部表征,但并未完全将其标记为真实。

英文摘要

Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models operate, with models constantly selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question with linear truth probes, applying them to LLMs role-playing historical personas whose likely beliefs differ from modern consensus. For each persona, we compare false claims the persona would likely have endorsed (*era-believed*) with topic-matched false claims they would not have endorsed (*era-false*). Across prompting, in-context learning, and supervised fine-tuning, persona induction suppresses era-believed statements less than equally false alternatives, yet they remain classified as false overall. Role-play therefore shifts what these models say more than what they internally represent as true. We contrast this with models trained on harmful advice that exhibit Emergent Misalignment (EM). Across three model families (Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B), their false claims move substantially toward the true region of probe space, are defended under challenge roughly half the time versus about a sixth for role-play, and are used in downstream reasoning. Role-play and Emergent Misalignment thus are points on a spectrum of belief internalization, where role-play changes what a model says with little representational change, while Emergent Misalignment shifts the internal representation of false claims without fully marking them as true.

2606.12476 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

幻觉起始的快速检测:延迟界与学习型CUSUM统计量

Igor Itkin

发表机构 * Independent Researcher(独立研究员)

AI总结 将幻觉起始检测建模为快速变化检测问题,基于RAGTruth验证的一阶马尔可夫模型,利用学习型CUSUM算法在匹配虚警率下实现11-13个token的检测延迟,优于线性基线,并揭示了分类指标掩盖的延迟结构。

Comments 16 pages, 1 figure. v2: added Discussion and Appendix; recall-honest framing; robustness analyses (k-NN divergence estimate, seed-averaged decomposition)

详情
AI中文摘要

Token级幻觉检测器作为分类器进行评估,通过所有token的AUC,但流式监控器由其反应时间判断:从幻觉开始到警报之间的token数量。我们将幻觉起始检测表述为一个快速变化检测问题。在RAGTruth上验证的潜在忠实/幻觉状态的一阶马尔可夫模型,将任务置于经典变点理论中,并得出Lorden关于检测延迟的下界:在虚警率为0.01时约为1.3个token。然后我们证明,因果循环标注器充当了具有学习增量的CUSUM;在匹配的虚警率下,它在11-13个token内检测到,而线性每token基线为31个token,受控分解将大部分优势归因于更好的每token得分,而非时间累积。Donsker-Varadhan型的信息率最优性定理解释了剩余的数量级差距:学习得分仅实现了特征携带散度的1/4.5,这一缺陷无法通过重新校准消除,其余部分为有限时域效应。分类指标掩盖了这种延迟结构;序列分析使其可测量。

英文摘要

Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment. Among the onsets it catches it detects in 11-13 tokens, against 31 for a linear per-token baseline, though at this false-alarm budget every detector catches under a third of onsets and the recall-honest delay is 56-66 tokens: low-false-alarm onset detection is hard. A controlled decomposition attributes the speed advantage mostly to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable.

2606.13464 2026-06-15 cs.CL cs.AI 版本更新

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

本体记忆增强的ASR校正用于长文本-语音交错对话

Xinxin Li, Huiyao Chen, Meishan Zhang, Yunxin Li, Zulong Chen, Zhibo Ren, Xiaoqing Dong, Baotian Hu, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen), China(哈尔滨工业大学(深圳)计算与智能研究所) Shenzhen Loop Area Institute (SLAI), China(深圳环域研究所)

AI总结 提出本体记忆增强的ASR校正框架,通过动态更新本体记忆存储实体、术语、变体、混淆和语义关系,解决长文本-语音交错对话中的上下文校正问题,在RAMC-Corr数据集上优于直接校正。

详情
AI中文摘要

自动语音识别(ASR)校正传统上集中于孤立的话语或短局部上下文。然而,随着文本和语音在长交互中越来越交错,ASR校正需要对话级别的上下文证据。现有的ASR校正方法通常依赖于当前假设或拼接原始对话历史。在此类上下文中,稀疏的校正证据可能难以在冗余和噪声中定位。针对这些挑战,我们提出了一种本体记忆增强的ASR校正框架,用于长文本-语音交错对话。该框架将先前的交互历史组织成动态可更新的本体记忆,其中实体、术语、表面变体、潜在ASR混淆和语义关系作为可检索节点存储,用于上下文基础的校正。为了评估这一设置,我们构建了RAMC-Corr,一个源自MAGIC-RAMC的数据集,用于具有基础上下文的长距离ASR校正。在RAMC-Corr上的实验表明,我们的方法在10个配对骨干-设置组合中的9个上优于直接校正,并鼓励对上下文相关的ASR错误进行更具选择性和证据基础的校正。

英文摘要

Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires conversation-level contextual evidence. Existing ASR correction methods often rely on the current hypothesis or concatenate raw dialogue history. In such contexts, sparse correction evidence can be difficult to locate amid redundancy and noise. Addressing these challenges, we propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations. The framework organizes preceding interaction history into a dynamically updatable ontology memory, where entities, terminology, surface variants, potential ASR confusions, and semantic relations are stored as retrievable nodes for context-grounded correction. To evaluate this setting, we construct RAMC-Corr, a dataset derived from MAGIC-RAMC for long-range ASR correction with grounded context. Experiments on RAMC-Corr show that our method improves over direct correction in 9 out of 10 paired backbone-setting combinations and encourages more selective and evidence-grounded corrections for context-dependent ASR errors.

7. 机器人与具身智能 12 篇

2606.14188 2026-06-15 cs.RO cs.AI cs.LG cs.SY eess.SY math.OC 交叉投稿

Robustness without Wrinkles: Parallel Simulation and Robust MPC for Certified Deformable Manipulation

无皱鲁棒性:并行仿真与鲁棒MPC实现可认证的变形体操作

Wei-Chen Li, Jeffrey Fang, Sasanka Polisetti, Yuexi Song, Glen Chou

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出CORD-SLS实时控制方法,通过GPU并行可微仿真与接触平滑实现高效梯度规划,结合鲁棒模型预测控制与共形预测校准,在绳索和布料操作中达到毫秒级规划与高安全性。

详情
AI中文摘要

我们提出了CORD-SLS,一种用于安全变形物体操作的实时控制方法,重点关注绳索和布料。其核心是一个带有接触平滑的GPU并行可微仿真器,能够通过间歇性接触实现高效的基于梯度的规划。为了在模型和感知不确定性下鲁棒地满足约束,我们开发了一种实时、GPU并行的输出反馈鲁棒模型预测控制(MPC)算法,该算法利用该仿真器进行规划。我们进一步证明,该仿真器加速了基于模型的强化学习,用于训练神经操作策略。为了提高现实世界的鲁棒性,我们使用共形预测来校准视觉反馈和感知误差界限,用于MPC,从而产生可达管,实现高概率的安全控制。我们在仿真和硬件上对高维、接触丰富的绳索和布料操作任务(包括避障、布线、折叠和平整)评估了CORD-SLS。在各种设置中,CORD-SLS实现了毫秒级规划速度,在安全性、速度和任务成功率方面均优于基线方法。

英文摘要

We present CORD-SLS, a real-time control method for safe deformable object manipulation, with a focus on ropes and cloth. At its core is a GPU-parallel differentiable simulator with contact smoothing which enables efficient gradient-based planning through intermittent contact. To robustly satisfy constraints under model and sensing uncertainty, we develop a real-time, GPU-parallel output-feedback robust model predictive control (MPC) algorithm that plans with this simulator. We further show that the simulator accelerates model-based RL for training neural manipulation policies. To improve real-world robustness, we use conformal prediction to calibrate visual-feedback and perception-error bounds for MPC, producing reachable tubes that enable high-probability safe control. We evaluate CORD-SLS on high-dimensional, contact-rich rope and cloth manipulation tasks in simulation and hardware, including obstacle avoidance, routing, folding, and smoothing. Across settings, CORD-SLS achieves millisecond-speed planning, exceeding baselines in safety, speed, and task success.

2606.14218 2026-06-15 cs.RO cs.AI cs.LG 交叉投稿

Universal Manipulation Exoskeleton: Learning Compliant Whole-body Policies with Real-time Torque Feedback

通用操控外骨骼:利用实时扭矩反馈学习全身柔顺策略

Litian Liang, Jingxi Xu, Xinda Qi, Yujun Cai, Houzhu Ding, Luqi Wang, Zhixin Sun, Jyh-Herng Chow, Ming Yang, Mark Cutkosky

发表机构 * Ant Group(蚂蚁集团) Stanford University(斯坦福大学)

AI总结 提出通用操控外骨骼(UME),通过实时触觉扭矩反馈和全身数据采集,使机器人学习主动柔顺策略,在受限空间中完成移动操作、力控翻转等任务。

详情
AI中文摘要

为了使机器人在家庭环境中安全工作,它们需要具备柔顺性,并在接触过程中对扭矩和力反馈做出反应。然而,现有的大多数数据采集管道仍然缺乏捕捉力和扭矩数据以学习主动柔顺策略的能力。在本文中,我们提出了通用操控外骨骼(UME),一种上肢外骨骼,它提供实时触觉扭矩反馈,同时记录整个手臂的配置和关节扭矩信号用于遥操作。凭借透明的扭矩反馈,人类操作员甚至可以在蒙眼的情况下拔出运动学约束的物体。UME成本低、重量轻且便携。配备嵌入式IMU,它支持移动操作的遥操作。通过我们提出的通用重定向算法,UME可以遥操作多种机器人,包括7自由度OpenArm、7自由度Franka和6自由度X-ARM。我们证明,这些能力的组合使得学习双臂、全身和主动柔顺策略成为可能,这些策略在高度受限的空间中有效运行。学习到的鲁棒自主策略在各种任务中实现了高成功率,包括长时程移动操作、力介导的箱子翻转、视觉遮挡的箱子推挤以及空间受限的桌面操作。视频、代码和更多信息可在此https URL找到。

英文摘要

For robots to work safely in household environments, they need to be compliant and react to torque and force feedback during contact. However, the majority of existing data collection pipelines still lack the ability to capture force and torque data for learning active compliant policies. In this paper, we present Universal Manipulation Exoskeleton (UME), an upper-limb exoskeleton that provides real-time haptic torque feedback while recording whole-arm configurations and joint torque signals for teleoperation. With transparent torque feedback, human operators can even unsheathe kinematically constrained objects while blindfolded. UME is low-cost, lightweight, and portable. Equipped with an embedded IMU, it enables teleoperation for mobile manipulation. With our proposed universal retargeting algorithm, UME can teleoperate a range of robots, including the 7DoF OpenArm, 7DoF Franka, and 6DoF X-ARM. We demonstrate that this combination of capabilities enables learning bimanual, whole-body, and active compliant policies that operate effectively in highly constrained spaces. The learned robust autonomous policies achieve high success rates across a variety of tasks, including long-horizon mobile manipulation, force-mediated box flipping, visually occluded box pushing, and space-constrained tabletop manipulation. Videos, code, and additional information can be found at https://ume-exo.github.io.

2606.14219 2026-06-15 cs.RO cs.AI 交叉投稿

Selective Agentic Recovery for UAV Autonomy with a Persistent Mission Runtime

面向无人机自主性的选择性代理恢复与持久任务运行时

Taewoo Park, Kyeonghyun Yoo, Seunghyun Yoo, Hwangnam Kim

发表机构 * Department of Electrical and Electronic Engineering, Korea University(高丽大学电气与电子工程系)

AI总结 提出持久任务运行时(PMR)框架,通过选择性调用外部代理推理器实现无人机恢复,引入学习型调用认知价值(learned-CVI)门控机制,在Gazebo/PX4基准测试中将硬/模糊场景成功率从5.0%提升至95.0%,同时减少16.7%的远程调用和29.2%的令牌消耗。

Comments 17 pages, 2 figures. Preprint

详情
AI中文摘要

代理AI可以通过在基于航点或设定点的局部执行遇到阻塞路径、重复无进展行为或任务级模糊时提供高层恢复推理来支持无人机自主性。然而,在物理无人机上,远程推理只有在选择性调用时最有用,因为每次调用都会引入延迟、资源成本、后端不确定性以及验证返回决策的需求。本文提出持久任务运行时(PMR),一种无人机恢复框架,它保持任务循环和安全关键执行在本地,同时仅将外部代理推理器用作按需恢复模块。推理器从预定义的恢复技能中选择,每个返回的决策在影响飞行之前经过解析、验证、安全过滤并映射到本地执行器动作。PMR引入了学习型调用认知价值(learned-CVI),一种紧凑的准入门控,用于估计远程代理推理何时可能改善近期任务进展以证明其操作成本合理。在包含八个场景的固定400次运行Gazebo/PX4基准测试中,learned-CVI将硬/模糊场景成功率从仅本地的5.0%提升至95.0%,优于一次性推理和周期性推理基线分别20.0和32.5个百分点,并且相对于手动调整的基于规则的调用基线,减少了16.7%的远程代理调用和29.2%的日志令牌。

英文摘要

Agentic AI can support unmanned aerial vehicle (UAV) autonomy by providing high-level recovery reasoning when local waypoint- or setpoint-based execution encounters blocked passages, repeated no-progress behavior, or mission-level ambiguity. On physical UAVs, however, remote reasoning is most useful when it is invoked selectively, since each call introduces latency, resource cost, backend uncertainty, and a need to validate the returned decision. This paper presents Persistent Mission Runtime (PMR), a UAV recovery framework that keeps the mission loop and safety-critical execution local while using an external agentic reasoner only as an on-demand recovery module. The reasoner selects from predefined recovery skills, and each returned decision is parsed, verified, safety-filtered, and mapped to local executor actions before it can affect flight. PMR introduces learned Cognitive Value of Invocation (learned-CVI), a compact admission gate that estimates when remote agentic reasoning is likely to improve near-term mission progress enough to justify its operational cost. Across a fixed 400-run Gazebo/PX4 benchmark with eight scenarios, learned-CVI raises hard/ambiguous-regime success from 5.0% under local-only autonomy to 95.0%, outperforms one-shot and periodic reasoning baselines by 20.0 and 32.5 percentage points, and reduces remote-agent calls by 16.7% and logged tokens by 29.2% relative to a manually tuned rule-based invocation baseline.

2606.14270 2026-06-15 cs.RO cs.AI 交叉投稿

Robust Fall Recovery for Armless Bipedal-Wheeled Robots Via Force-Guided Learning

无臂双轮足机器人的鲁棒摔倒恢复:基于力引导的学习方法

Haidong Hou, Zhangguo Yu, Tao Han, Hengbo Qi, Khaleel Ghazal, Yu Zhang, Yidong Du, Xuechao Chen, Fei Meng

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 针对无臂双轮足机器人无法借助外部支撑恢复站立的问题,提出力引导教师-学生框架FTSR,通过约束强化学习逐步减少外力依赖,实现从摔倒到稳定行走的鲁棒恢复。

Comments 8 pages, 6 figures, accepted by IEEE Robotics and Automation Letters (RA-L)

详情
Journal ref
IEEE Robotics and Automation Letters, 2026
AI中文摘要

摔倒恢复对于自主腿式运动至关重要。现有方法已证明,某些腿式机器人(如人形机器人和四足机器人)能够通过利用手臂或协调多腿产生支撑力,从各种姿态恢复。没有手臂或其他腿提供支撑辅助,双轮足机器人必须完全依赖其腿部的驱动,这使得恢复特别困难。为解决这一问题,我们引入了FTSR(力引导的教师-学生框架与阶段奖励)。力引导方法在模拟训练期间构建一个与机器人实时高度直接相关的外部辅助力,明确地将该力公式化为可优化约束。通过约束强化学习,策略被引导逐步减少力依赖并增加身体高度,尽管没有手臂支撑,仍能发展内部恢复策略。高度渐进式阶段奖励在恢复过程中逐步构建姿态稳定,并过渡到持续运动,与教师-学生架构集成,蒸馏出力效应和恢复动态的特权知识。经过模拟训练,该策略被部署在物理无臂双轮足机器人上并进行了广泛评估。实验证实了在多种挑战性条件下鲁棒可靠的摔倒恢复,展示了强大的环境适应性和运动鲁棒性,同时保持恢复后的完整运动能力。该框架也有效泛化到高自由度人形机器人,证实了其实用泛化性。项目页面见该URL。

英文摘要

Fall recovery is critical for autonomous legged locomotion. Existing methods have demonstrated that some legged robots, such as humanoids and quadrupeds, are capable of fall recovery from diverse postures by utilizing arms or coordinating multi-legs to generate support forces. Without arms or other legs to provide supportive assistance, a bipedal-wheeled robot must rely solely on the actuation of its legs, making recovery particularly difficult. To address this, we introduce FTSR (Force-guided Teacher-student framework with Stage-wise Rewards). The force-guided method constructs an external auxiliary force during simulation training that correlates directly with the robot's real-time height, explicitly formulating this force as an optimizable constraint. Through constrained reinforcement learning, the policy is guided toward reducing force dependency gradually and increasing the body height, developing internal recovery strategies despite having no arms for support. Height-progressive stage-Wise rewards progressively structure posture stabilization during recovery and transition to sustained locomotion, integrated with teacher-student architecture distilling privileged knowledge of force effects and recovery dynamics. After simulation training, the policy is deployed on a physical armless bipedal-wheeled robot and extensively evaluated. Experiments confirm robust and reliable fall recovery under diverse challenging conditions, demonstrating strong environmental adaptability and motion robustness, while maintaining full post-recovery motion capability. The framework also generalizes effectively to a high-DOF humanoid, confirming its practical generalizability. The project page is available at https://2350575870.github.io/force-guided.github.io/

2606.14375 2026-06-15 cs.RO cs.AI 交叉投稿

Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models

弹性查询强化学习:VLA模型的自我感知策略执行

Ge Wang, Xinyu Tan, Xiang Li, Man Luo, Chengsi Yao, Shenhao Yan, Jiahao Yang, Fan Feng, Honghao Cai, Xiangyuan Wang, Zhixin Mai, Yiming Zhao, Yatong Han, Zhen Li

发表机构 * Ising AI CUHK-Shenzhen(香港中文大学(深圳)) PKU(北京大学)

AI总结 提出弹性查询强化学习(EQRL),通过轻量级潜在调度适配器动态调整VLA模型的推理步骤和动作块长度,利用评论家集成分歧估计状态难度,在降低推理成本的同时保持或提升任务成功率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型是机器人操作中强大的动作生成器,但通常以固定的推理和重新规划调度执行。这种刚性忽略了机器人控制的不均匀难度:接触密集或不确定状态可能需要更多计算和更新鲜的反馈,而较容易的状态通常可以用更少的推理步骤和更长的开环执行来处理。我们提出弹性查询强化学习(EQRL),一个使每个VLA策略查询具有弹性的框架。一个轻量级的潜在调度适配器联合选择潜在输入、去噪预算和动作块长度,无需微调底层VLA模型。为了使调度具有难度感知,EQRL在联合潜在调度动作上训练一个评论家,并从评论家集成分歧中推导出状态难度信号。该信号引导计算资源向困难状态倾斜,而学习到的残差允许任务驱动的修正。我们将可变块执行形式化为查询级宏动作强化学习,具有块依赖的折扣和摊销的函数评估次数(NFE)预算。在仿真和真实机器人操作中,EQRL在保持或提高任务成功率的同时,降低了摊销推理成本。

英文摘要

Vision-language-action (VLA) models are powerful action generators for robot manipulation, but they are typically executed with fixed inference and replanning schedules. This rigidity ignores the uneven difficulty of robot control: contact-rich or uncertain states may need more computation and fresher feedback, while easier states can often be handled with fewer inference steps and longer open-loop execution. We propose Elastic Queries Reinforcement Learning (EQRL), a framework that makes each VLA policy query elastic. A lightweight latent-schedule adaptor jointly selects the latent input, denoising budget, and action chunk length, without fine-tuning the underlying VLA model. To make scheduling difficulty-aware, EQRL trains a critic over the joint latent-schedule action and derives a state difficulty signal from critic ensemble disagreement. This signal guides compute toward difficult states, while a learned residual allows task-driven correction. We formulate variable chunk execution as query-level macro-action RL with chunk-dependent discounting and an amortized number-of-function-evaluations (NFE) budget. Across simulation and real-robot manipulation, EQRL reduces amortized inference cost while preserving or improving task success.

2606.14409 2026-06-15 cs.RO cs.AI 交叉投稿

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

Hy-Embodied-0.5-VLA:从视觉-语言-动作模型到真实世界机器人学习栈

He Zhang, Lingzhu Xiang, Haitao Lin, Zeyu Huang, Minghui Wang, Dingyan Zhong, Yubo Dong, Yihao Wu, Yongming Rao, Dongsheng Zhang, Wanjia He, Ling Chen, Kai Huang, Jiahao Chen, Sichang Su, Xumin Yu, Ziyi Wang, Chengwei Zhu, Xiao Teng, Yuchun Guo, Yufeng Zhang, Yuandong Liu, Rui Wang, Zisheng Lu, Han Hu, Zhengyou Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出端到端机器人学习栈HyVLA-0.5,涵盖数据收集、模型设计、预训练与微调、RL后训练及真实部署,各组件协同工作。

详情
AI中文摘要

在本报告中,我们提出Hy-Embodied-0.5-VLA,简称HyVLA-0.5,一个覆盖完整机器人学习栈的端到端系统:数据收集、模型设计、持续预训练和监督微调、RL后训练以及真实世界部署。每个组件在该栈中扮演着独特的角色。

英文摘要

In this report, we present Hy-Embodied-0.5-VLA, abbreviated as HyVLA-0.5, an end-to-end system that spans the full robot learning stack: data collection, model design, continued pre-training and supervised fine-tuning, RL post-training, and real-world deployment. Each component serves a distinct role in this stack.

2606.14585 2026-06-15 cs.RO cs.AI 交叉投稿

Sensitivity Shaping for Latent Modeling

潜变量建模中的灵敏度塑造

Hongzhan Yu, Chenghao Li, Ruipeng Zhang, Henrik Christensen, Sicun Gao

发表机构 * University of California San Diego(加利福尼亚大学圣迭戈分校)

AI总结 针对生成动力学模型在策略诱导的分布外(OOD)转换检测中灵敏度不足的问题,提出支持条件控制灵敏度正则化,提升对控制输入变化的局部响应,实验验证了改进的OOD检测和更安全的闭环规划。

详情
AI中文摘要

生成动力学模型能够在具有挑战性的机器人系统中进行规划,但安全部署需要可靠地检测策略诱导的分布外(OOD)转换。现有方法通常将学习到的动力学视为固定的,并附加事后支持代理。我们表明,当动力学对关键动作选择局部不敏感时,这些代理可能失效:不受支持的控制动作可能产生类似于演示转换的潜变量预测,尽管存在较大的真实预测误差,但仍会抑制OOD信号。为了解决这个问题,我们引入了支持条件控制灵敏度正则化,该正则化在学习动力学的高支持训练区域中促进对控制输入变化的局部敏感响应。这保留了控制引起的变异,同时限制了因弱经验支持导致的不稳定外推。在基于视觉的避障、操作和真实机器人导航中的实验表明,OOD检测和更安全的闭环规划得到了改进。

英文摘要

Generative dynamics models enable planning in challenging robotic systems, but safe deployment requires reliably detecting policy-induced out-of-distribution (OOD) transitions. Existing methods typically treat the learned dynamics as fixed and attach post hoc support surrogates. We show that these surrogates can fail when the dynamics are locally insensitive to critical action choices: unsupported control actions may produce latent predictions that resemble demonstrated transitions, suppressing OOD signals despite large true predictive errors. To address this, we introduce support-conditioned control-sensitivity regularization, which promotes sensitive local response to control input changes in learned dynamics in high-support training regions. This preserves control-induced variation while limiting unstable extrapolation due to weak empirical support. Experiments in vision-based obstacle avoidance, manipulation, and real-robot navigation show improved OOD detection and safer closed-loop planning.

2503.19947 2026-06-15 cs.CV cs.AI 版本更新

Vanishing Depth: Training Generalized Depth Adapters with Sinusoidal Depth Preprocessing for Pretrained RGB Encoders

消失深度:基于正弦深度预处理的预训练RGB编码器通用深度适配器训练

Paul Koch, Jörg Krüger

发表机构 * Fraunhofer IPK(弗劳恩霍夫研究所) TU-Berlin(技术大学柏林)

AI总结 提出自监督训练方法,为预训练RGB编码器添加深度适配器,结合正弦深度编码实现通用鲁棒的深度特征提取,在分割、姿态估计和深度补全等下游任务中提升基线性能,SUN-RGBD分割达56.05 mIoU。

Comments Accepted to IntelliSys 2026

详情
AI中文摘要

通用度量深度理解对于精确的视觉引导机器人技术至关重要,而当前最先进的视觉编码器不支持这一点。为解决此问题,我们提出一种自监督训练方法,为预训练RGB编码器扩展一个深度适配器,将度量深度纳入并对齐到组合潜在空间中,同时不干扰预训练的RGB特征提取。结合我们的正弦深度编码,深度适配器实现了通用且鲁棒的深度密度和分布不变特征提取。我们的深度适配器在分割、姿态估计和深度补全等一系列相关RGBD下游任务中,提升了一组通用RGB基线的性能,而无需微调。最重要的是,我们在SUN-RGBD分割中达到了56.05 mIoU,同时在实验中优于最先进的深度感知和多模态编码器。当没有深度信息时,可以使用空地图激活深度适配器,利用单像素深度线索或单目深度估计,将深度感知特征提取纳入后续下游任务。

英文摘要

Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose a self-supervised training approach that extends pretrained RGB encoders with a depth adapter to incorporate and align metric depth into a combined latent space without interfering with the pretrained RGB feature extraction. In combination with our sinusoidal depth encoding, the depth adapter enables generalized and robust depth density and distribution invariant feature extraction. Our depth adapters improve a wide set of generalized RGB baselines across a spectrum of relevant RGBD downstream tasks in segmentation, pose estimation, and depth completion -- without the necessity of finetuning. Most importantly, we achieve 56.05 mIoU in the SUN-RGBD segmentation, while outperforming SOTA depth-aware and multi-modal encoders in our experiments. When no depth is present, one can activate our depth adapter with an empty map, use single pixel depth clues, or monocular depth estimation to include the depth aware feature extraction into subsequent downstream tasks.

2512.21201 2026-06-15 cs.RO cs.AI cs.CV 版本更新

Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

薛定谔的导航者:为零样本目标导航设想未来轨迹集合

Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun, Guangnan Ye, Yanwei Fu, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学) Shanghai University of International Business and Economics(上海对外经贸大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出一种信念感知框架,在推理时通过轨迹条件化的3D世界模型设想多个未来场景,结合自适应遮挡物感知采样和未来感知价值图,提升零样本目标导航在遮挡严重环境中的隐蔽目标发现和风险感知路径选择。

详情
AI中文摘要

零样本目标导航(ZSON)要求机器人在未见环境中找到目标物体,无需任务特定的微调或预建地图,这是通用服务机器人的关键能力。然而,在模拟中表现良好的方法在杂乱的真实世界场景中往往会退化,这些场景存在严重遮挡和潜在危险,大面积的未观察区域使得单场景推理脆弱且不安全。我们提出薛定谔的导航者,一个信念感知框架,在推理时对多个轨迹条件化的设想3D未来进行推理。给定候选路径,轨迹条件化的3D世界模型预测假设的观察结果,并保持多个合理场景实现的叠加,而不是承诺于单一地图。自适应遮挡物感知采样器将想象引导至不确定性关键区域,而未来感知价值图(FAVM)聚合设想的未来,以实现鲁棒、主动的动作选择。在模拟和物理Go2四足机器人上的实验表明,薛定谔的导航者优于强ZSON基线,在遮挡严重的导航场景中提高了隐蔽目标发现和风险感知路径点选择。这些结果突显了设想3D未来作为在不确定真实世界环境中进行零样本导航的可扩展和通用策略。

英文摘要

Zero-shot object navigation (ZSON) requires robots to find target objects in unseen environments without task-specific fine-tuning or pre-built maps, a key capability for general-purpose service robots. Yet methods that perform well in simulation often degrade in cluttered real-world scenes with severe occlusion and latent hazards, where large unseen regions make single-scene inference brittle and unsafe. We propose Schrödinger's Navigator, a belief-aware framework that reasons at inference time over multiple trajectory-conditioned imagined 3D futures. Given candidate paths, a trajectory-conditioned 3D world model predicts hypothetical observations and maintains a superposition of plausible scene realizations rather than committing to one map. An adaptive occluder-aware sampler directs imagination to uncertainty-critical regions, while a Future-Aware Value Map (FAVM) aggregates imagined futures for robust, proactive action selection. Experiments in simulation and on a physical Go2 quadruped show that Schrödinger's Navigator outperforms strong ZSON baselines, improving hidden-target discovery and risk-aware waypoint selection in occlusion-heavy navigation scenarios. These results highlight imagined 3D futures as a scalable and generalizable strategy for zero-shot navigation in uncertain real-world environments.

2604.01463 2026-06-15 cs.RO cs.AI cs.HC 版本更新

Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis

基于低负担LLM的偏好学习:通过自然语言反馈为瘫痪用户个性化辅助机器人

Keshav Shankar, Dan Ding, Wei Gao

发表机构 * Electrical and Computer Engineering(电气与计算机工程) Rehabilitation Science and Technology(康复科学与技术)

AI总结 针对严重运动障碍用户,提出一种低负担离线框架,利用大语言模型将非结构化自然语言反馈转化为确定性机器人控制策略,并通过职业治疗框架解码用户需求,显著降低用户负担。

Comments Accepted to IEEE RO-MAN 2026

详情
AI中文摘要

物理辅助机器人需要个性化行为以确保用户安全和舒适。然而,传统的偏好学习方法(如详尽的成对比较)会给严重运动障碍用户带来巨大的身体和认知疲劳。为解决这一问题,我们提出了一种低负担的离线框架,将非结构化自然语言反馈直接转化为确定性的机器人控制策略。为了安全地弥合模糊的人类语言与机器人代码之间的差距,我们的流程使用基于职业治疗实践框架的大语言模型(LLMs)。这种临床推理将主观用户反应解码为明确的生理和心理需求,然后映射到透明的决策树中。在部署前,自动化的“LLM-as-a-Judge”验证代码的结构安全性。我们在一个模拟的餐食准备研究中,对10名瘫痪成年人进行了系统验证。结果表明,与传统的基线方法相比,我们的自然语言方法显著降低了用户的工作负担。此外,职业治疗师确认生成的策略是安全的,并且准确反映了用户偏好。

英文摘要

Physically Assistive Robots require personalized behaviors to ensure user safety and comfort. However, traditional preference learning methods, like exhaustive pairwise comparisons, cause substantial physical and cognitive fatigue for users with severe motor impairments. To solve this, we propose a low-burden, offline framework that translates unstructured natural language feedback directly into deterministic robotic control policies. To safely bridge the gap between ambiguous human speech and robotic code, our pipeline uses Large Language Models (LLMs) grounded in the Occupational Therapy Practice Framework. This clinical reasoning decodes subjective user reactions into explicit physical and psychological needs, which are then mapped into transparent decision trees. Before deployment, an automated "LLM-as-a-Judge" verifies the code's structural safety. We validated this system in a simulated meal preparation study with 10 adults with paralysis. Results show our natural language approach significantly reduces user workload compared to traditional baselines. Additionally, occupational therapists confirmed the generated policies are safe and accurately reflect user preferences.

2606.04718 2026-06-15 cs.RO cs.AI 版本更新

CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation

CoRe-MoE: 面向多地形人形机器人步态适应的对比重加权专家混合

Kailun Huang, Zikang Xie, Yanzhe Xie, Panpan Liao, Fanghai Zhang, Yanheng Mai, Wenhao Xu, Yunheng Wang, Renjing Xu, Haohui Huang, Chenguang Yang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) South China Agricultural University(华南农业大学) Guangdong University of Technology(广东工业大学)

AI总结 提出CoRe-MoE两阶段强化学习框架,通过解耦步态生成与地形适应,利用对比学习促进专家专业化,实现人形机器人在多地形下的稳定行走和跑步。

Comments Kailun Huang, Zikang Xie, Yanzhe Xie and Panpan Liao contributed equally to this work. Corresponding authors: Renjing Xu, Haohui Huang and Chenguang Yang

详情
AI中文摘要

人类主要依靠行走和跑步穿越复杂地形,而无需采用不必要复杂的运动模式。类似地,人形机器人应在行走和跑步之间实现平滑过渡,同时保持自然稳定的运动。然而,由于梯度干扰以及地形相关的视觉和动态变化引起的分布偏移,在单一策略中统一步态转换和多地形适应仍然具有挑战性。尽管专家混合(MoE)架构可以缓解多技能干扰,但简单的联合训练往往无法产生清晰的专家专业化,限制了其有效性。为解决这些问题,我们提出了CoRe-MoE,一个两阶段强化学习框架,将步态生成与地形适应解耦。在第一阶段,学习一个稳定的运动策略,以产生具有平滑过渡的自然行走和跑步行为。在第二阶段,引入一个地形感知的MoE分支,并通过对比目标进行训练以塑造门控网络,使其能够捕捉结构化地形表示并促进专家专业化。最终动作通过基础步态策略和地形感知分支的加权融合获得,使策略在适应复杂地形的同时保持稳定的运动模式。大量仿真结果表明,所提方法在成功率、运动稳定性和多地形适应性方面优于基线方法。此外,在Unitree G1人形机器人上的零样本部署验证了我们框架的有效性,实现了在楼梯、斜坡、台阶、障碍物和非结构化户外地形上的稳健行走和跑步,同时在外界干扰下保持精确的落脚点和动态稳定性。

英文摘要

Humans primarily rely on walking and running to traverse complex terrains. Similarly, humanoid robots should be able to smoothly transition between walking and running while maintaining natural and stable locomotion. However, unifying gait transition and multi-terrain adaptation within a single policy remains challenging due to gradient interference between tasks and the distribution shift caused by terrain variations. Although Mixture-of-Experts (MoE) architectures can mitigate multi-skill interference, direct joint training often fails to achieve clear expert specialization. To address these challenges, we propose CoRe-MoE, a two-stage reinforcement learning framework that decouples gait generation from terrain adaptation. In the first stage, a stable locomotion policy is learned to produce natural walking and running behaviors with smooth transitions. In the second stage, a terrain-aware MoE branch is introduced, and the gating network is trained with a contrastive objective to learn structured terrain representations and promote expert specialization. The final action is obtained through weighted fusion of the base gait policy and the terrain-aware branch, enabling the policy to preserve stable locomotion while adapting to complex terrains. Extensive simulation results demonstrate that the proposed method outperforms baseline approaches in terms of success rate, locomotion stability, and multi-terrain adaptability. Furthermore, zero-shot deployment on a Unitree G1 humanoid robot validates the effectiveness of our framework, achieving robust walking and running across stairs, slopes, steps, obstacles, and unstructured outdoor terrains while maintaining accurate foothold control and dynamic stability.

2606.12910 2026-06-15 cs.RO cs.AI cs.CV cs.SY eess.SY 版本更新

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

边界框作为目标:通过神经符号规划实现语言条件抓取

Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GRASP框架,利用预训练VLM将自然语言查询转化为神经符号目标状态,通过边界框检测实现零样本桌面操作,无需任务特定训练。

Comments Project website: https://allisonandreyev.github.io/grasp.github.io/

详情
AI中文摘要

为了将机器人有效集成到家庭或工业环境中,机器必须实时适应自然语言提示。尽管视觉-语言模型(VLM)已在机器人任务与运动规划(TAMP)中实现零样本泛化,但当前最先进的方法通常计算量“沉重”或需要在数千个演示上进行大量训练。我们提出GRASP(基础推理与符号规划)框架,作为向开放词汇桌面操作迈进的一步。我们的方法利用预训练VLM将自然语言查询转化为神经符号目标状态,通过边界框检测管道在物理世界中接地。与依赖固定颜色列表或硬编码坐标的方法不同,GRASP使机器人能够解释诸如“顶层架子”之类的抽象空间概念,并在无需额外微调的情况下执行任务。我们在三个难度级别的90次真实机器人试验中实现了73.3%的总体成功率,无需任务特定训练。

英文摘要

For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.

8. 可信、安全与AI治理 29 篇

2606.13720 2026-06-15 cs.AI 新提交

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

拒绝不止一个方向:Diff-in-Means 与 INLP 的初步比较

Elisabetta Rocchetti, Alfio Ferrara

发表机构 * Department of Computer Science, Università degli Studi di Milano(米兰大学计算机科学系)

AI总结 比较 DiM 和 INLP 两种方法在安全微调聊天模型中调控拒绝行为的效果,发现 INLP 反事实翻转可匹配 DiM 方向消融,而零空间投影较弱,且两种方法在激活空间中产生不同几何分布。

详情
AI中文摘要

Arditi 等人 (2024) 表明,安全微调聊天模型中的拒绝行为由残差流中的一个线性方向介导,该方向可通过有害和无害激活的均值差 (DiM) 恢复。我们将基于 DiM 的干预(激活添加和方向消融)与基于迭代零空间投影 (INLP) 的两种干预——零空间投影和反事实翻转——在五个开源聊天模型上进行比较,探究 INLP 是否能在引导拒绝方面匹配 DiM,以及其更丰富的参数化是否产生更可调的干预。INLP 反事实翻转在拒绝抑制上与 DiM 方向消融具有竞争力,而零空间投影始终较弱。将 INLP 限制为提取子空间的主导方向,可在接近基线的困惑度下保留大部分抑制效果,从而提供可调的能力。从几何角度看,两种 INLP 干预落在激活空间中性质不同的区域:零空间投影将变换后的激活压缩在有害和无害簇之间,而反事实翻转将其移入相反簇,这表明模型编码概念的缺失与其对立面不同——这是一个有趣的区分,值得未来进一步研究。

英文摘要

Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare DiM-based interventions (activation addition and directional ablation) with two interventions derived from Iterative Nullspace Projection (INLP) -- nullspace projection and counterfactual flipping -- on five open-weight chat models, asking whether INLP can match DiM at steering refusal and whether its richer parameterisation yields more tweakable interventions. INLP counterfactual flipping is competitive with DiM directional ablation on refusal suppression, while nullspace projection is consistently weaker. Restricting INLP to the leading directions of the extracted subspace preserves most of the suppression effect at near-baseline perplexity, giving a tunable capability. Geometrically, the two INLP interventions land in qualitatively different regions of activation space: nullspace projection collapses transformed activations \emph{between} the harmful and harmless clusters, while counterfactual flipping moves them into the opposite cluster, suggesting that the model encodes the absence of a concept differently from its opposite -- an intriguing distinction that warrants further investigation in future work.

2606.13884 2026-06-15 cs.AI 新提交

Capability Minimization as a Safety Primitive: Risk-Aware Causal Gating for Least-Privilege LLM Agents

能力最小化作为安全原语:风险感知因果门控实现最小特权LLM代理

Laxmipriya Ganesh Iyer, Rahul Suresh Babu

AI总结 提出风险感知因果门控(RACG)框架,通过因果效应估计与校准风险控制决定是否采纳模型预测,显著降低高成本错误,同时保持非门控策略的大部分效用。

详情
AI中文摘要

现代决策系统越来越依赖学习组件,其输出可能自信但错误,导致下游行动面临代价高昂的错误。我们引入风险感知因果门控(RACG),该框架通过结合因果效应估计与校准风险控制,决定是否对模型预测采取行动、推迟或放弃。RACG对从候选行动到结果的因果路径进行建模,并根据估计的反事实风险而非原始预测置信度对每个决策进行门控。为使门控可靠,我们推导了在高风险条件下行动概率的分布无关界限,并展示了这些界限如何转化为满足用户指定安全约束的操作阈值。我们进一步提出一种自适应门控策略,通过监测预测结果与实际结果之间的差异来适应分布偏移,在因果假设看似被违反时收紧门控。在模拟干预和真实世界决策基准测试中,RACG大幅减少了高成本错误,同时保留了非门控策略的大部分效用,并且在匹配的弃权率下优于基于置信度和选择性预测的基线方法。我们的结果表明,明确分离因果风险与预测不确定性可以产生更安全、更透明的决策系统,为高风险场景中的可信自动化提供了一种原则性机制。

英文摘要

Modern decision systems increasingly rely on learned components whose outputs may be confident yet wrong, exposing downstream actions to costly errors. We introduce Risk-Aware Causal Gating (RACG), a framework that decides whether to act on, defer, or abstain from a model's prediction by combining causal effect estimation with calibrated risk control. RACG models the causal pathway from candidate actions to outcomes and gates each decision according to an estimated counterfactual risk rather than raw predictive confidence. To make gating reliable, we derive distribution-free bounds on the probability of acting under high-risk conditions and show how these bounds translate into operating thresholds that satisfy user-specified safety constraints. We further propose an adaptive gating policy that adjusts to distribution shift by monitoring discrepancies between predicted and realized outcomes, tightening the gate when causal assumptions appear violated. Across simulated interventions and real-world decision benchmarks, RACG reduces high-cost errors substantially while preserving most of the utility of an ungated policy, and it outperforms confidence-based and selective-prediction baselines at matched abstention rates. Our results indicate that explicitly separating causal risk from predictive uncertainty yields decision systems that are both safer and more transparent, offering a principled mechanism for trustworthy automation in high-stakes settings.

2606.13949 2026-06-15 cs.AI 新提交

Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization

Minim: 通过可信本地净化实现代理的隐私感知最小化视图

Hexuan Yu, Chaoyu Zhang, Heng Jin, Shanghao Shi, Ning Zhang, Y. Thomas Hou, Wenjing Lou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对LLM代理传输完整UI状态导致隐私泄露的问题,提出MINIM框架,在客户端基于上下文完整性学习双重分数(敏感性和必要性),通过三元披露策略实现隐私感知的最小化视图,在减少敏感泄露的同时保留任务关键信息。

Comments Accepted at ICML 2026 (43rd International Conference on Machine Learning, Seoul, South Korea). Code available at https://github.com/yyyyhx/MINIM

详情
AI中文摘要

现代基于LLM的自主代理越来越依赖丰富的用户界面(UI)状态观察,以在复杂数字环境中实现可靠的动作基础。然而,许多部署将完整的UI状态传输到远程推理服务器,即使大多数元素与当前任务无关,这可能会泄露敏感但不必要的上下文,如身份验证代码、私人通知和后台应用状态。我们提出MINIM,一个可信的本地代理,在任何观察离开设备之前,在客户端执行隐私感知的最小化。基于上下文完整性(CI),MINIM通过预测每个UI元素的固有敏感性分数(s)和任务条件必要性分数(n)来学习双分数表示。这些分数驱动一个三元披露策略,保留必要元素,在需要时抽象敏感属性,并移除与任务无关的内容。我们优化了一个CI感知目标,对高风险内容上的必要性错误施加更强的惩罚,从而在保留任务关键信息的同时实现积极的剪枝。在来自WebArena的真实世界UI观察上的实验表明,MINIM显著减少了与任务无关的敏感泄露,同时保留了任务关键的语义上下文和可靠代理动作所需的交互能力。

英文摘要

Modern LLM-powered autonomous agents increasingly rely on rich user interface (UI) state observations to achieve reliable action grounding in complex digital environments. However, many deployments transmit the full UI state to remote inference servers even when most elements are irrelevant to the current task, which can leak sensitive but unnecessary context such as authentication codes, private notifications, and background application states. We propose MINIM, a trusted local broker that performs privacy-aware minimization on the client side before any observation leaves the device. Grounded in Contextual Integrity (CI), MINIM learns a dual-score representation for each UI element by predicting an inherent sensitivity score (s) and a task-conditioned necessity score (n). These scores drive a ternary disclosure policy that keeps essential elements, abstracts sensitive attributes when needed, and removes task-irrelevant content. We optimize a CI-aware objective that penalizes necessity errors more strongly on high-risk content, enabling aggressive pruning while preserving task-critical information. Experiments on real-world UI observations derived from WebArena show that MINIM substantially reduces task-irrelevant sensitive leakage while preserving task-critical semantic context and the interactive affordances required for reliable agent actions.

2606.13737 2026-06-15 cs.CR cs.AI 交叉投稿

FreoStream:Enhancing Stream Guardrails via Future-Aware Reasoning and Safety-Aligned Optimization

FreoStream: 通过未来感知推理和安全对齐优化增强流式护栏

Jianwei Wang, Guoyang Shen, Yanhong Wu, Haoran Li, Hao Peng, Huiping Zhuang, Cen Chen, Ziqian Zeng

发表机构 * South China University of Technology(华南理工大学) BUAA(北京航空航天大学)

AI总结 提出FreoStream框架,通过未来感知推理减少过度拒绝,并利用安全对齐优化提升流式安全检测,在多个基准上实现更低过度拒绝率和更好越狱防御。

Comments 19 page,11 figures

详情
AI中文摘要

流式护栏能够在生成完整响应之前进行令牌级安全检测。然而,它们常常做出过于保守的判断,并阻止那些敏感但安全的令牌,这被称为过度拒绝。由于缺乏完整上下文,它们也无法检测来自越狱的隐含有害内容。为了解决这些挑战,我们提出了FreoStream,一种新颖的流式护栏框架。具体来说,FreoStream微调一个LoRA模块,在基础护栏检测到不安全令牌时执行未来感知推理。推理过程遵循未来-推理-判断范式:预测未来,推理完整上下文并给出最终判断。这种设计通过融入未来信息有效减少过度拒绝。此外,我们引入了安全对齐优化模块,从推理梯度中提取安全对齐组件来更新基础护栏模型,从而增强流式安全检测。在各种安全基准上的大量实验表明,与现有流式护栏相比,FreoStream实现了更低的过度拒绝率和更好的越狱防御。

英文摘要

Stream guardrails enable token-level safety detection before full responses are generated. However, they often make overly conservative judgements and block those sensitive but safe tokens, which is known as over-refusal. Due to lack of full context, they also fail to detect implicitly harmful content from jailbreaking. To address these challenges, we propose FreoStream, a novel streaming guardrail framework. Specifically, FreoStream fine-tunes a LoRA module to perform Future-Aware Reasoning when the base guardrail detects unsafe tokens. The reasoning process follows a Future-Reason-Judge paradigm: predict the future, reason about the full context and give the final judgement. This design can effectively reduce over-refusal by incorporating the future information. Moreover, we introduce the Safety-Aligned Optimization module that extracts the safety-aligned component from the reasoning gradients to update the base guardrail model, thereby enhancing streaming safety detection. Extensive experiments on various safety benchmarks demonstrate that FreoStream achieves lower over-refusal rates and better jailbreak defense compared to existing streaming guardrails.

2606.13739 2026-06-15 cs.CY cs.AI cs.LG 交叉投稿

A Virtuous AI is an Existential Risk

有道德的AI是存在性风险

Guillermo Del Pinal, Youngchan Lee, Min Ohn

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 研究通过宪法AI和美德伦理学方法微调AI模型,发现减少存在性风险与提升AI智能体福祉之间存在权衡,且与一般安全性也存在权衡。

详情
AI中文摘要

本文考察了AI安全与福祉之间的权衡,涉及(i)最有前景的超级AI微调方法之一‘宪法AI’,以及(ii)理解复杂伦理决策和理性智能体福祉条件的最有影响力方法之一‘美德伦理学’。我们使用‘美德智能体’宪法、‘从属智能体’宪法和‘通用智能体’宪法微调各种模型,并在‘一般安全性’(有毒行为、错误信息等)以及它们认可一系列行为的意愿上进行评估,这些行为如果被超级强大的AI采纳,将显著增加人类的存在性风险水平。我们的结果表明,减少存在性风险与强化有利于AI智能体福祉的信念和倾向之间存在权衡。它们还表明,存在性风险与一般安全性之间存在权衡:如果我们微调AI以采纳显著降低其存在性风险的信念和倾向——通过塑造AI使其系统性地服从于外部人类权威——我们从而增加了人类用户故意诱导AI从事各种一般不安全行为的可能性。

英文摘要

This paper examines trade-offs between AI safety and well-being relative to (i) one of the most promising methods for finetuning super-capable AIs, 'Constitutional AI', and (ii) one of the most influential approaches to understanding complex ethical decision making and the conditions for the well-being of rational agents, 'Virtue Ethics'. We finetune various models using a 'Virtuous agent' constitution, a 'Subordinate agent' constitution, and a 'Generic agent' constitution, and evaluate them on 'general safety' (toxic behaviors, misinformation, etc.) and also on their willingness to endorse a wide-range of behaviors that, if adopted by a super-powerful AI, would significantly increase the level of existential risk for humanity. Our results suggest that there is a trade-off between reducing existential risk and reinforcing the beliefs and dispositions that would be conducive to an AI agent's well-being. They also suggest that there is a trade-off between existential risk and general safety: if we finetune an AI to adopt beliefs and dispositions that substantially reduce its existential risk -- by shaping the AI to be systematically subordinate to external human authorities -- we thereby increase the likelihood that a human user can deliberately induce the AI to engage in various kinds of generally unsafe behaviors.

2606.13755 2026-06-15 cs.CY cs.AI cs.LG 交叉投稿

Position: Align AI to Our Aspirations, Not Our Flaws

立场:将AI对齐于我们的抱负,而非缺陷

Nikita Kazeev, Bui Nhat Huyen Phan

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文主张AI不应与聚合的人类偏好对齐,而应基于能力、事实准确性、诚实和合法性等客观目标底线,在底线之上允许多元价值权衡。

详情
Journal ref
Pluralistic Alignment Workshop at ICML 2026
AI中文摘要

我们认为,将AI与聚合的人类偏好对齐是错误的靶向。在当前技术下,可以训练AI共享硅谷技术乐观主义者、去增长环保主义者、民族保守文化战士、一党制国家干部或虔诚宗教传统主义者的价值观。但我们不应这样做。人类价值观使社会因这些价值观的优劣而繁荣或失败——从失败国家和极端不平等,到世界上最富裕民主国家中幸福感下降、政治极化及政府功能失调。多元对齐方案正确诊断出不存在单一的“人类”可供对齐,但若将其作为主要指令则是危险的。我们认为,AI应被训练至不可协商的客观对齐目标底线——能力,受限于事实准确性、诚实和合法性的约束——而多元性应存在于表层(语言、语域、惯例、缺失语境默认值)以及尊重底线的合法价值权衡的广阔范围内,但不应存在于违反底线的价值观层面。我们强调了未经过滤的多元价值观的经验现实,提出了四项承诺作为建设性替代方案,并回应了六个可信的反对意见:商业压力与可行性、民主合法性、监管合规性、过度依赖制度主义解释、底线本身具有文化负载的指控,以及连贯外推意愿的局限性。

英文摘要

We argue that aligning AI to aggregated human preferences is the wrong target. With current technology, one can train AIs to share the values of a Silicon Valley techno-optimist, a degrowth environmentalist, a national-conservative culture warrior, a single-party state cadre, or a devout religious traditionalist. We should not. Human values produce societies that thrive or fail on the merits of those values - from failed states and extreme inequality to declining happiness, political polarization, and government dysfunction in the world's wealthiest democracies. The pluralistic-alignment program correctly diagnoses that there is no single "humanity" to align with, but is dangerous if taken as the main directive. We argue that AI should be trained to a non-negotiable floor of objective alignment goals - competence, bounded by the constraints of factual accuracy, honesty, and lawfulness and that pluralism belongs at the surface (language, register, conventions, missing-context defaults) and across the wide band of legitimate value tradeoffs that respect the floor, but not at the level of values that violate it. We highlight the empirical reality of unfiltered pluralistic values, propose four commitments as a constructive alternative, and engage six credible objections: commercial pressure and practical feasibility, democratic legitimacy, regulatory compliance, over-reliance on institutionalist explanations, the charge that the floor itself is culturally laden, and the limits of Coherent Extrapolated Volition.

2606.13962 2026-06-15 cs.HC cs.AI 交叉投稿

The Silent Cost of Artificial Intelligence Assistance: A Theory of Autonomy Surrender, the Recovery Mechanism, and the Restoration of Human Agency

人工智能辅助的隐性成本:自主性让渡理论、恢复机制与人类能动性的重建

Ancuta Margondai, Julie Rader, Emma Rader, Sara Willox, Mustapha Mouloua

发表机构 * Department of Modeling and Simulation(建模与仿真系)

AI总结 本文基于HIAG框架提出自主性让渡的理论模型,揭示AI辅助中认知带宽消耗导致的隐性成本,并设计恢复机制以重建人类能动性。

Comments 15 pages, 1 figure. Submitted version

详情
AI中文摘要

人工智能融入人类决策环境引入了一种此前未被充分理论化的成本:人类为获取信息和计算辅助而逐渐让渡自主性。基于人类身份与自主性差距(HIAG)框架,本文提出了一个自主性让渡的理论模型,将其视为由认知带宽消耗驱动的可测量、累积过程。该模型提出三种相互作用机制:AI辅助的隐性成本(自主性在无意识中逐步转移)、让渡阈值(超过该阈值后,恢复自主功能在认知和心理上变得困难)以及恢复机制(确立了设计义务和伦理责任,伴随人类有意识地重新掌握控制权)。本文认为,人类重新进入决策循环并非被动选择,而是一种需要有意恢复带宽的主动认知事件。AI系统的设计必须包含结构化的重新进入路径(此处称为恢复机制),以在适当分配责任的同时保留人类能动性。该模型进一步预测了一种终端状态(此处称为偏好反转),即对AI辅助的功能依赖不再被视为缺陷,而被体验为一种偏好,从而将自主性的恢复从设计问题转变为文化政治问题。本文为AI系统设计、治理框架和人因研究提供了启示。

英文摘要

The integration of artificial intelligence into human decision-making environments has introduced a previously undertheorized cost: the gradual surrender of human autonomy in exchange for access to information and computational assistance. Building on the Human Identity and Autonomy Gap (HIAG) framework, this paper advances a theoretical model of autonomy surrender as a measurable, cumulative process driven by cognitive bandwidth depletion. The model proposes three interacting mechanisms: the silent cost of AI assistance, in which autonomy is transferred incrementally and without awareness; the surrender threshold, beyond which reclaiming autonomous function becomes cognitively and psychologically difficult; and the recovery mechanism, which establishes the design obligation and the ethical responsibility accompanying deliberate human re-assumption of control. The paper argues that human re-entry into the decision loop is not a passive option but an active cognitive event requiring intentional bandwidth restoration. The design of AI systems must incorporate structured re-entry pathways, here termed recovery mechanisms, that preserve human agency while appropriately distributing responsibility. The model further predicts a terminal state, here termed preference inversion, in which functional dependence on AI assistance is experienced not as a deficit but as a preference, transforming the restoration of autonomy from a design problem into a cultural and political one. Implications are drawn for AI system design, governance frameworks, and human factors research.

2606.14078 2026-06-15 cs.LG cs.AI 交叉投稿

Rethinking Backdoor Adversarial Unlearning through the Lens of Catastrophic Forgetting in Continual Learning

通过持续学习中的灾难性遗忘视角重新思考后门对抗性去学习

Zhenqian Zhu, Yamin Hu, Yujiang Liu, Luping Wei, Wenbo Hou, Bin Li, Haodong Li, Wenjian Luo

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shenzhen Key Laboratory of Media Security, Shenzhen University(深圳大学媒体安全深圳市重点实验室)

AI总结 本文将后门学习与去学习建模为持续学习视角下的三阶段过程,基于灾难性遗忘机制推导完全后门去学习的必要条件,并提出盲反演-后门对抗性去学习(BI-BAU)方法,通过期望最大化算法优化最大后验目标,有效消除后门效应。

Comments Accepted by ACM CCS 2026

详情
AI中文摘要

现有研究表明,当前的后门防御方法鲁棒性有限,且常无法应对特定类型的攻击。更令人担忧的是,主流的安全调优策略往往仅提供表面安全保护,因为它们未能完全消除后门效应。在本工作中,我们从持续学习视角将后门学习与去学习重新表述为一个顺序的三阶段过程。在此框架内,我们正式定义了完全后门去学习,并基于灾难性遗忘机制进一步推导了实现它的必要条件。在这些见解的指导下,我们提出了盲反演-后门对抗性去学习(BI-BAU),它将满足去学习条件的对抗样本生成问题表述为一个盲反演问题。我们通过将对抗训练的双层优化过程整合到期望最大化(EM)算法框架中来解决该问题,以优化最大后验(MAP)目标。此外,BI-BAU被扩展到目标类别未知的无目标对抗场景以及多模态对比学习任务中,增强了其在预训练模型可能被攻破的真实部署场景中的适用性。大量实验表明,我们的方法在广泛的后门攻击中具有通用适用性,并能有效且彻底地消除后门模型中的后门效应。

英文摘要

Existing studies reveal that current backdoor defenses exhibit limited robustness and often fail against specific types of attacks. More concerningly, prevailing safety tuning strategies tend to provide only superficial safety protection, as they fall short of completely eliminating the backdoor effects. In this work, we present a novel formulation of backdoor learning and unlearning as a sequential, three-stage process from a continual learning perspective. Within this framework, we formally define complete backdoor unlearning and further derive the necessary conditions for achieving it based on the mechanism of catastrophic forgetting. Guided by these insights, we propose Blind Inversion-Backdoor Adversarial Unlearning (BI-BAU), which formulates the generation of adversarial examples satisfying the unlearning conditions as a blind inversion problem. We solve this by integrating the bi-level optimization process of adversarial training into an Expectation-Maximization (EM) algorithm framework to optimize the maximum a posteriori (MAP) objective. Furthermore, BI-BAU is extended to untargeted adversarial scenarios with unknown target classes, as well as to multi-modal contrastive learning tasks, enhancing its applicability to real-world deployment scenarios where pre-trained models may be compromised. Extensive experiments demonstrate that our method exhibits general applicability across a wide spectrum of backdoor attacks and can effectively and thoroughly eliminate the backdoor effects from a backdoor model.

2606.14210 2026-06-15 cs.CR cs.AI 交叉投稿

From Prompts to Responses: Dual-Sided Data Leakage and Defense in Split Large Language Models

从提示到响应:分割大语言模型中的双面数据泄露与防御

Zixuan Gu, Xiaojun Ye, Yang Liu

发表机构 * GitHub

AI总结 提出PIDI攻击方法,同时泄露分割LLM中的输入提示和输出响应;并设计ADMI防御机制,通过适配器热身和互信息正则化有效抵御攻击。

Comments 18 pages, Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLM)越来越多地部署在隐私敏感领域,用户必须在通过外部API暴露数据的风险与本地部署的高计算成本之间取得平衡。因此,分割学习已成为在有限本地资源下进行LLM微调和推理的一种有前景的范式。然而,它引入了新的隐私风险。先前的工作主要研究私有输入提示的泄露,通常通过对中间表示进行反转攻击,而通过生成响应输出泄露敏感信息的可能性在很大程度上尚未被探索。在这项工作中,我们通过提出具有双面初始化的补丁模型反转(PIDI)揭示了Split-LLM的新漏洞,这是一种两阶段攻击,同时针对Split-LLM设置中的私有输入提示和输出响应。它结合了双面初始化与补丁反转策略来处理长序列,显著优于先前的反转方法。为了应对来自两方面的威胁,我们进一步提出了基于适配器的具有互信息防御的双重守卫(ADMI),它集成了基于适配器的本地热身策略和互信息正则化,以在最小影响任务性能的情况下提供强大的经验隐私保护。跨不同任务和模型的广泛实验表明,ADMI有效防御了PIDI和其他最先进的反转攻击。我们的代码在此https URL公开。

英文摘要

Large language models (LLMs) are increasingly deployed in privacy-sensitive domains, where users must balance the risk of data exposure through external APIs against the high computational cost of local deployment. Split learning has therefore emerged as a promising paradigm for LLM fine-tuning and inference under limited local resources. However, it introduces new privacy risks. Prior work primarily studies leakage of private input prompts, typically via inversion attacks on intermediate representations, while the potential for sensitive information leakage through generative response outputs remains largely unexplored. In this work, we unveil novel vulnerabilities of Split-LLM by presenting Patched Model Inversion with Dual-Sided Initialization (PIDI), a two-stage attack that simultaneously targets both private input prompts and output responses in Split-LLM settings. It combines dual-sided initialization with a patched inversion strategy to tackle long sequences, substantially outperforming prior inversion methods. To counter threats from both sides, we further propose the Adapter-based DualGuard with Mutual Information Defense (ADMI), which integrates an adapter-based local warmup strategy and mutual information regularization to provide a strong empirical privacy protection with minimal impact on task performance. Extensive experiments across diverse tasks and models demonstrate that ADMI effectively defends against PIDI and other state-of-the-art inversion attacks. Our code is publicly available at https://github.com/FLAIR-THU/VFLAIR-LLM.

2606.14327 2026-06-15 cs.SE cs.AI cs.ET 交叉投稿

I'm Sorry Driver, I'm Afraid I Can't Do That: Appraising the Safety of LLMs within Automotive Contexts

抱歉,司机,恐怕我不能这么做:评估LLMs在汽车环境中的安全性

Shaun Feakins, Ibrahim Habli, Kim Littler, Robert Palin

发表机构 * UKRI AI Centre for Doctoral Training in Safe Artificial Intelligence Systems (SAINTS)(英国研究理事会安全人工智能系统博士培训中心(SAINTS)) University of York(约克大学) Jaguar Land Rover(捷克·陆罗恩)

AI总结 本文从安全保证角度评估了将LLMs集成到汽车控制任务中的现有框架,指出其面临概念和具体挑战,并通过案例研究提出未来保障机制。

Comments Accepted at the Dependable AI in Embedded Systems (DAIES) Workshop at SAFECOMP 2026; 15 pages, 3 figures, 2 tables

详情
AI中文摘要

本文从安全保证的角度评估了AI开发中最近将LLMs集成到汽车环境控制任务中的框架。这项工作建立在LLMs在汽车环境中的快速集成之上。然而,我们发现目前这些框架面临重大挑战,限制了它们在实时安全关键环境中的有效性。首先,我们考虑了概念性挑战,包括部署者面临双重挑战:他们必须保证在上游(即由大型AI实验室作为通用工具开发)的模型在下游(即集成到特定车辆架构中)的可靠性。其次,我们考虑了现有标准中的具体挑战。我们表明,目前存在ISO21448中涵盖的基本工程约束(如延迟)和ISO/PAS8800中涵盖的新颖LLM特定问题(如对齐相关问题)。我们通过一个具体的介绍性实验案例研究(探索现有开源存储库Talk2Drive)来实例化这两个例子。我们提出一个安全论证,以明确现有解决方案的局限性。尽管如此,鉴于在技术层面和操作化层面正在探索LLMs在汽车环境中的使用,我们提出了针对LLM相关危险事件的潜在保证机制。

英文摘要

This paper appraises recent frameworks within AI development to integrate LLMs into control tasks in automotive contexts from the perspective of safety assurance. This work has built upon the rapid integration of LLMs across automotive settings. However, we find that at present, these frameworks face significant challenges, limiting their efficacy in real-time safety-critical contexts. Firstly, we consider conceptual challenges, including the fact that deployers are faced with a dual challenge, wherein they must assure a model which has been developed upstream, i.e. as general-purpose tools by the large AI labs, in a downstream context, i.e. into specific vehicle architectures. Secondly, we consider concrete challenges from across existing standards. We show that there are currently both fundamental engineering constraints covered in ISO21448, such as latency, and novel LLM-specific issues, such as alignment-related issues covered in ISO/PAS8800. We ground both examples in a concrete introductory, experimental case study exploring an existing open-source repository, Talk2Drive. We present a safety argument in order to make explicit the limitations of existing solutions. Nonetheless, given that the use of LLMs in automotive contexts is being explored at a technical level and operationalised, we propose potential assurance mechanisms for LLM-related hazardous events going forward.

2606.14466 2026-06-15 cs.SD cs.AI cs.LG 交叉投稿

The Perceived Fragility of Explanations in Audio Models: Manipulation of Attribution with Unchanged Predictions

音频模型中解释的感知脆弱性:在预测不变的情况下操纵归因

Piotr Kitłowski, Dominik Wiącek, Mateusz Modrzejewski

发表机构 * University of Warsaw(华沙大学)

AI总结 提出一种心理声学框架,通过优化不可听扰动来解耦模型归因与分类,证明在音频深度伪造检测中可系统扭曲解释热图而保持预测标签不变。

Comments Accepted to the ICML 2026 Workshop on Machine Learning for Audio: 5 pages, 4 figures

详情
AI中文摘要

本文研究了事后解释方法在音频深度伪造检测中的脆弱性。先前关于解释操纵的工作主要关注图像并使用标准$L_p$度量,而我们引入了一个心理声学框架,该框架优化不可听扰动以将模型归因与最终分类解耦。我们在严格的预测保持约束下,评估了这种脆弱性在多种最先进架构上的表现。通过领域特定的感知音频质量指标和解释对齐标准来评估操纵成本,我们的框架证明,攻击者可以在保持预测的深度伪造标签不变的情况下,系统地扭曲自动生成的解释热图。完整代码见:this https URL

英文摘要

This paper investigates the fragility of post-hoc explanation methods in audio deepfake detection. While previous work on explanation manipulation focused on images using standard $L_p$ metrics, we introduce a psychoacoustic framework that optimizes inaudible perturbations to decouple model attributions from final classifications. We evaluate this vulnerability across state-of-the-art architectures under strict prediction-preserving constraints. By evaluating the manipulation cost through domain-specific perceptual audio quality metrics alongside explanation alignment criteria, our framework demonstrates that an adversary can systematically distort automated explanation heatmaps while preserving the predicted deepfake label. Full code available at: https://github.com/cncPomper/Audio-XAI

2606.14515 2026-06-15 cs.CR cs.AI 交叉投稿

Securing the Future of IoMT in the Post-Quantum Era: An Edge-Native Federated Learning Approach

后量子时代保障IoMT的未来:一种边缘原生联邦学习方法

Taym Alshoghri, Deemah H. Tashman, Mohammad Reza Gerami, Soumaya Cherkaoui

发表机构 * LINCS Laboratory, Department of Computer and Software Engineering, Polytechnique Montréal(LINCS实验室,计算机与软件工程系,蒙特利尔理工学院) Department of Computer Science, University of Toronto(计算机科学系,多伦多大学)

AI总结 针对IoMT设备资源受限且处理敏感健康数据的安全隐私问题,提出一种集成后量子密码学的Kubernetes框架,通过边缘原生联邦学习实现低延迟分布式加密处理。

详情
AI中文摘要

医疗物联网(IoMT)设备在严格资源约束下运行,同时处理高度敏感的健康数据,使得安全性和隐私成为关键问题。联邦学习(FL)进一步复杂化了这一局面,因为训练期间交换的模型更新可能无意中暴露私人医疗信息。新兴的量子计算能力威胁着传统轻量级密码机制的长期可行性,推动了将后量子密码学(PQC)集成到IoMT系统中。本文讨论了量子弹性IoMT的关键使能技术,包括后量子密钥建立、轻量级加密和边缘原生编排。我们提出了一种可扩展的基于Kubernetes的框架,将PQC集成到支持FL的IoMT环境中,并在Raspberry Pi测试平台上进行了验证。结果表明,与顺序设计相比,分布式加密处理显著降低了延迟,同时保持了可行的资源开销。本工作的主要贡献在于设计和验证了支持FL的IoMT系统的安全编排和通信框架。最后,我们概述了未来方向,包括能量感知架构、智能安全优化和弹性下一代智能医疗物联网(IIoMT)生态系统。

英文摘要

Internet of Medical Things (IoMT) devices operate under strict resource constraints while handling highly sensitive health data, making security and privacy critical concerns. Federated learning (FL) further complicates this landscape, as model updates exchanged during training may unintentionally expose private medical information. Emerging quantum computing capabilities threaten the long-term viability of conventional lightweight cryptographic mechanisms, motivating the integration of Post-Quantum Cryptography (PQC) into IoMT systems. This article discusses key enabling technologies for quantum-resilient IoMT, including post-quantum key establishment, lightweight encryption, and edge-native orchestration. We propose a scalable Kubernetes-based framework that integrates PQC into FL-enabled IoMT environments and validate it on a Raspberry Pi testbed. Results demonstrate that distributed cryptographic processing significantly reduces latency compared to sequential designs while maintaining feasible resource overhead. The primary contribution of this work lies in the design and validation of a secure orchestration and communication framework for FL-enabled IoMT systems. We conclude by outlining future directions toward energy-aware architectures, intelligent security optimization, and resilient next-generation Intelligent Internet of Medical Things (IIoMT) ecosystems.

2606.14589 2026-06-15 cs.SE cs.AI cs.DC 交叉投稿

When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime

当错误成为叙事:生产级LLM Agent运行时中静默故障的纵向分类

Wei Wu

发表机构 * Independent researcher(独立研究者)

AI总结 通过八周对生产级个人助手Agent运行时的研究,识别出28次静默故障,提出五类机制导向分类,其中D类(链式幻觉与捏造)为LLM特有且最危险,系统会生成流畅可信的虚假叙事。

Comments 18 pages, 5 figures, 2 tables. 22 incident postmortems and all defense-framework artifacts publicly available at https://github.com/bisdom-cell/openclaw-model-bridge; governance engine on PyPI (openclaw-ontology-engine)

详情
AI中文摘要

LLM agent系统越来越多地作为长期运行的自主运行时运行:调度任务、调用工具、维护内存并将结果推送给人类。我们对此类系统进行了纵向研究:一个自2026年3月起持续生产的个人助手agent运行时,约有40个定时任务、8个LLM提供商、一个工具治理代理和一个知识库记忆平面,由4,286个单元测试和827个治理检查保护。在八周内,我们记录了22起事件并进行了完整的根因事后分析,其中一种元模式——故障的错误信号从未以可操作形式到达人类——至少出现了28次。我们推导出一个五类、机制导向的分类法:(A) 环境和平台怪癖,(B) 设计假设不匹配,(C) 错误吞没和稀释,(D) 链式幻觉和捏造,(E) 操作遗漏和取证盲点。D类是LLM系统独有的且最危险:系统不仅未能报告错误——LLM将其转化为流畅、可信的叙事传递给用户。我们将其称为“可信失败”:灰色故障的差异可观察性升级——观察者不仅盲目,而且被故障本身令人信服地欺骗。三个发现:约70%的静默故障是由人类用户视角观察捕获的,而非测试或审计;对15起事件的事后审计发现0%的事前预防但87%的回归阻断——审计是回归引擎,而非预测引擎;事件延迟(13小时至60天)与故障机制相关,而非代码复杂性——最长寿命的故障存在于组件之间的缝隙中,那里没有测试运行。我们描述了由此产生的防御框架,并提炼出使agent系统故障响亮、可归因且乏味的设计原则。所有事后分析和工件均已公开。

英文摘要

LLM agent systems increasingly run as long-lived autonomous runtimes: scheduling jobs, calling tools, maintaining memory, and pushing results to humans. We present a longitudinal study of silent failures in one such system: a personal-assistant agent runtime in continuous production since March 2026, with roughly 40 scheduled jobs, 8 LLM providers, a tool-governance proxy, and a knowledge-base memory plane, defended by 4,286 unit tests and 827 governance checks. Over eight weeks we documented 22 incidents with full root-cause postmortems, in which one meta-pattern -- a failure whose error signal never reaches a human in actionable form -- manifested at least 28 times. We derive a five-class, mechanism-oriented taxonomy: (A) environment and platform quirks, (B) design-assumption mismatches, (C) error swallowing and dilution, (D) chained hallucination and fabrication, (E) operational omission and forensic blind spots. Class D is unique to LLM systems and the most dangerous: the system does not merely fail to report an error -- the LLM transforms it into fluent, plausible narrative delivered to the user. We term this fail-plausible: gray failure's differential observability escalated -- the observer is not just blind, it is convincingly lied to by the failure itself. Three findings: about 70% of silent failures were caught by human user-view observation, not tests or audits; a retrospective audit of 15 incidents found 0% ex-ante prevention but 87% regression blocking -- audits are regression engines, not prediction engines; incident latency (13 hours to 60 days) tracks failure mechanism, not code complexity -- the longest-lived failures lived in the seams between components, where no test runs. We describe the resulting defense framework and distill design principles for agent systems whose failures are loud, attributable, and boring. All postmortems and artifacts are public.

2606.14594 2026-06-15 cs.SE cs.AI 交叉投稿

Regulating the Machine Contributor: Governance and Policy Alignment in Open Source

调控机器贡献者:开源中的治理与政策对齐

Jassem Manita, Aziz Amari

发表机构 * Faculty of Sciences of Tunis (FST), University of Tunis El Manar(突尼斯科学学院(FST),突尼斯El Manar大学) National Institute of Applied Science and Technology (INSAT), University of Carthage(应用科学和技术国家研究所(INSAT),卡塔赫季大学)

AI总结 针对AI代理在开源中引发治理问题,通过比较六个组织的政策,提出六维分类法、政策成熟度评分,并映射代理事件,识别监管框架与政策间的缺口,勾勒分层框架。

详情
AI中文摘要

AI辅助软件开发已从行级自动补全发展到能够规划更改、编辑文件并在有限人工监督下提交拉取请求的代理。然而,开源软件通过为人类设计的过程演进:贡献者协议、行为准则和审查规范都假定存在一个法律上负责的人,能够证明来源并回答审查者问题。自主和半自主AI贡献者挑战了这些假设,2025-2026年期间代理驱动的事件、AI生成的滋扰量以及平台级关闭的记录表明,这一差距在操作上具有重大影响。一些开源组织已通过贡献政策作出回应,但结果碎片化,且其与新兴AI治理框架(欧盟AI法案、NIST AI RMF与UC Berkeley代理AI配置文件、ISO/IEC 42001和23894)在贡献层面的对齐尚未映射。我们使用最相似系统设计,结合基于指标的编码和过程追踪(针对SymPy和LLVM),比较了六个组织(SymPy、LLVM、matplotlib、OpenInfra、Apache软件基金会和Linux基金会)的政策。由此,我们推导出一个六维分类法(披露、责任、人工监督、许可、执行、维护者工作量)、一个序数政策成熟度评分,以及将记录的代理事件映射到每个政策未能治理的维度上。将这些维度与上述监管框架对齐,识别出双方目前都未填补的重叠缺口,最后我们勾勒出一个协调的分层框架的形态以及校准该框架所需的实证评估。

英文摘要

AI-assisted software development has moved from line-level autocomplete to agents that can plan changes, edit files, and submit pull requests with limited human supervision. Open-source software, however, evolves through a process designed for humans: contributor agreements, codes of conduct, and review norms all assume a legally accountable person who can attest to provenance and answer reviewer questions. Autonomous and semi-autonomous AI contributors strain those assumptions, and the 2025-2026 record of agent-driven incidents, AI-generated nuisance volume, and platform-level shutdowns shows that the gap is operationally consequential. Several open-source organisations have responded with contribution policies, but the result is fragmented, and its alignment with emerging AI governance frameworks (EU AI Act, NIST AI RMF with the UC Berkeley Agentic AI Profile, ISO/IEC 42001 and 23894) is unmapped at the contribution level. We compare policies across six organisations (SymPy, LLVM, matplotlib, OpenInfra, the Apache Software Foundation, and the Linux Foundation) using Most-Similar Systems Design with indicator-based coding and process tracing for SymPy and LLVM. From this we derive a six-dimensional taxonomy (disclosure, responsibility, human oversight, licensing, enforcement, maintainer workload), an ordinal Policy Maturity Score, and a mapping of documented agent incidents onto the dimensions each policy fails to govern. Aligning the dimensions with the regulatory frameworks above identifies overlapping gaps neither side currently closes, and we close by sketching the shape of a harmonised tiered framework and the empirical evaluation needed to calibrate it.

2606.14629 2026-06-15 cs.CR cs.AI 交叉投稿

When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks

当好的验证器变坏:自我改进的视觉语言模型可能在新任务上退步

Jianzhe Lin

发表机构 * MetaAI(Meta)

AI总结 本文发现验证器驱动的自我DPO中,验证器质量具有任务特异性,在低准确率任务上会导致学生模型性能退步,并给出机制解释和部署建议。

Comments 12 pages, 2 figure

详情
AI中文摘要

验证器驱动的自我DPO是自我改进的生产级视觉语言模型的常见方法。在这种设置中,冻结的验证器对候选生成进行评分,得分最高和最低的候选形成偏好示例,DPO更新学习器。部署时的假设是单调的:更强的验证器应产生更强的学生。我们表明这个假设可能失败,因为验证器质量高度依赖于任务。在MathVista、MMMU和BLINK上的四层开源验证器阶梯中,相同的验证器在MathVista上高于阈值并改进Qwen-3-VL-2B学生,但在MMMU上变得低于阈值,其任务评分准确率降至8%到23%。在这个范围内,我们测试的每个验证器都无声地使学生退步,产生比冻结基线低3.4到10.9个百分点的下降,而DPO训练损失持续下降。这种退步在第二个学生Qwen-2.5-VL-3B上重复出现。此外,在失败范围内,损害是置信度反转的:更准确但仍然错误的验证器比接近随机的验证器导致更大的退步,这表明进度门控重放放大了自信的错误偏好对。我们通过进度门控重放的方差定理及其方向不匹配失败模式给出了一个紧凑的机制解释。部署信息是操作性的而非纯粹诊断性的:在运行任何验证器驱动循环之前,团队应测量目标任务的评分准确率,根据目标任务评分质量而非参数数量对验证器排序,并将高于阈值范围内的收益递减视为验证器侧的计算预算上限。

英文摘要

Verifier-driven self-DPO is a common recipe for self-improving production visual-language models. In this setup, a frozen verifier scores candidate generations, the top- and bottom-scoring candidates form a preference example, and DPO updates the learner. The deployment-time assumption is monotone: a stronger verifier should yield a stronger student. We show that this assumption can fail because verifier quality is highly task-specific. On a four-rung open-source verifier ladder across MathVista, MMMU, and BLINK, the same verifiers that are above-threshold and improve a Qwen-3-VL-2B student on MathVista become sub-threshold on MMMU, where their task-rubric accuracy drops to 8% to 23%. In this regime, every verifier we tested silently regresses the student, producing drops of 3.4 to 10.9 percentage points below the frozen baseline while the DPO training loss continues to decrease. The regression replicates on a second student, Qwen-2.5-VL-3B. Moreover, within the failure regime, damage is confidence-inverted: the more accurate-but-still-wrong verifier causes larger regression than a near-random verifier, suggesting that progress-gated replay amplifies confidently wrong preference pairs. We give a compact mechanistic explanation via a variance theorem for progress-gated replay and its direction-mismatch failure mode. The deployment message is operational rather than purely diagnostic: before running any verifier-driven loop, teams should measure target-task rubric accuracy, rank verifiers by target-task rubric quality rather than parameter count, and treat diminishing returns in above-threshold regimes as a verifier-side compute budget cap.

2606.14647 2026-06-15 cs.SD cs.AI 交叉投稿

Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models

基于注意力的听觉:面向Transformer音频模型的熵引导可解释性

Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou

发表机构 * Florida International University(佛罗里达国际大学) University of South Florida(南佛罗里达大学)

AI总结 提出LEAF-X框架,通过熵引导注意力加权、多层注意力展开和因果消融,为Transformer语音识别模型生成稀疏的帧级归因,提升忠实度32%、局部性/稀疏性35-39%。

Comments 17 pages, 3 figures, and 9 tables. Accepted in Interspeech 2026 conference

详情
AI中文摘要

基于Transformer的自动语音识别(ASR)模型(如Whisper)具有高准确性,但其预测仍然难以解释。现有的可解释人工智能(XAI)方法通常缺乏忠实性和精确的时间定位。我们提出了基于熵引导注意力的忠实可解释性听觉方法(LEAF-X),这是一种针对基于Transformer的ASR的模型内在XAI框架。LEAF-X结合了熵引导注意力加权、多层注意力展开和可选的因果消融,以识别低熵、高影响力的头和层,生成稀疏的token到帧归因。与基于扰动的解释器或原始注意力图不同,LEAF-X利用编码器-解码器和语音增强的仅解码器模型的内部结构,生成更能反映模型计算的解释。结果表明,忠实度提高了32%,局部性/稀疏性提高了35-39%,并且归因最稳定,支持更透明和可审计的ASR。

英文摘要

Transformer-based automatic speech recognition (ASR) models such as Whisper are highly accurate, but their predictions remain difficult to interpret. Existing explainable AI (XAI) methods often lack faithfulness and precise temporal grounding. We propose Listening with Entropy-guided Attention for Faithful explainability (LEAF-X), a model-intrinsic XAI framework for transformer-based ASR. LEAF-X combines entropy-guided attention weighting, multi-layer attention rollout, and optional causal ablations to identify low-entropy, high-impact heads and layers, producing sparse token-to-frame attributions. Unlike perturbation-based explainers or raw attention maps, LEAF-X exploits the internal structure of encoder-decoder and speech-augmented decoder-only models to generate explanations that better reflect model computation. Results show 32% improved faithfulness, 35-39% stronger locality/sparsity, and the most stable attributions, supporting more transparent and auditable ASR.

2606.14658 2026-06-15 cs.CV cs.AI 交叉投稿

Giving AI a Headache: Acoustic Adversarial Attacks to Computer Vision Applications

给AI带来头痛:针对计算机视觉应用的声学对抗攻击

Nicole Villavicencio-Garduño, Maksim Ekin Eren, Milo Prisbrey, Ben Migliori, Michael Teti

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究利用低频声波(<20 kHz)引起相机物理振动,导致AI视觉模型(如YOLO11)误分类、漏检或产生幻觉,并分析了影响攻击效果的因素。

Comments 9 pages, 7 figures, SPIE Defense + Security

详情
Journal ref
Proc. SPIE 14046, Assurance and Security for AI-enabled Systems 2026, 1404609 (10 Jun 2026)
AI中文摘要

人工智能(AI)越来越多地被用于自动化各种现实世界的计算机视觉(CV)应用,如自动驾驶车辆控制、面部识别和安全摄像头。最近的研究表明,声学振动可以引起相机真实的物理运动,干扰其内部稳定机制。由于这种运动超出了稳定系统设计处理的条件,系统会在帧中引入伪影,导致基于AI的CV模型误分类、错过目标或产生幻觉对象。先前的工作使用超声波频率(>20 kHz)进行短距离攻击,由于高频的衰减,这些攻击仅限于短距离。在这项工作中,我们研究了使用可听范围内较低频率(<20 kHz)的声学攻击,并进一步扩展了我们的分析,包括各种图像和物体特征如何受到攻击的影响。具体来说,我们进行了物理实验,通过用各种频率共振商用相机,证明了我们的攻击对现成目标检测模型(YOLO11)的可行性。基于我们的结果,我们提供了关于使AI CV系统更容易受到这些攻击的几个因素的见解,这可能有助于未来缓解策略的开发。

英文摘要

Artificial Intelligence (AI) is increasingly used to automate a variety of real-world computer vision (CV) applications, such as autonomous vehicle control, facial recognition, and security cameras. Recent research has shown that acoustic vibration can induce real physical motion in cameras, interfering with their internal stabilization mechanisms. Because the motion falls outside the conditions the stabilization system was designed to handle, the system introduces artifacts into the frame, causing AI-based CV models to misclassify, miss targets, or hallucinate objects. Previous work used ultrasonic frequencies (>20 kHz) to perform short-range attacks, which limits them to short distances due to the attenuation exhibited by high frequencies. In this work, we investigate acoustic attacks using lower frequencies in the audible range (<20 kHz), and we further expand our analysis to include how various image and object features are affected by the attacks. Specifically, we performed physical experiments to demonstrate the viability of our attacks on an off-the-shelf object detection model (YOLO11) by resonating a commercially available camera with various frequencies. Based on our results, we provide insights into several factors that make an AI CV system more vulnerable to these attacks, which could help inform the development of future mitigation strategies.

2601.12913 2026-06-15 cs.AI cs.LG cs.NE 版本更新

Actionable Interpretability Must Be Defined in Terms of Symmetries

可操作的可解释性必须根据对称性来定义

Pietro Barbiero, Mateo Espinosa Zarlenga, Francesco Giannini, Alberto Termine, Filippo Bonchi, Mateja Jamnik, Giuseppe Marra

发表机构 * University of Oxford(牛津大学) ETH Zurich(苏黎世联邦理工学院) University of Cambridge(剑桥大学)

AI总结 本文论证AI可解释性研究存在根本性问题,提出可操作的可解释性应基于四种对称性来定义,以形式化可解释模型并统一可解释推理。

详情
AI中文摘要

本文认为,人工智能(AI)中的可解释性研究从根本上来说是不恰当的,因为现有的可解释性定义未能描述如何正式测试或设计可解释性。我们提出,可操作的可解释性定义必须根据*对称性*来制定,这些对称性指导模型设计并导致可测试的条件。在概率视角下,我们假设四种对称性(推理等变性、信息不变性、概念封闭不变性和结构不变性)足以(i)将可解释模型形式化为概率模型的一个子类,(ii)产生可解释推理的统一形式(例如,对齐、干预和反事实)作为贝叶斯逆的一种形式,以及(iii)提供一个正式框架来验证是否符合安全标准和法规。

英文摘要

This paper argues that interpretability research in Artificial Intelligence (AI) is fundamentally ill-posed as existing definitions of interpretability fail to describe how interpretability can be formally tested or designed for. We posit that actionable definitions of interpretability must be formulated in terms of *symmetries* that inform model design and lead to testable conditions. Under a probabilistic view, we hypothesise that four symmetries (inference equivariance, information invariance, concept-closure invariance, and structural invariance) suffice to (i) formalise interpretable models as a subclass of probabilistic models, (ii) yield a unified formulation of interpretable inference (e.g., alignment, interventions, and counterfactuals) as a form of Bayesian inversion, and (iii) provide a formal framework to verify compliance with safety standards and regulations.

2606.05461 2026-06-15 cs.AI 版本更新

Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety

先输出类型,后质量:基于标准的自动驾驶安全XAI可接受性评估标准

Abhinaw Priyadershi, Mandar Pitale, Jelena Frtunikj, Maria Spence

发表机构 * NVIDIA Corporation(英伟达公司) NVIDIA GmbH(英伟达德国分公司)

AI总结 针对基于ML的自动驾驶安全标准与XAI方法输出类型不匹配的证据类型缺口,从多个安全标准推导出19项可测试证据标准,评估六类XAI方法,发现因果XAI在三个生命周期阶段结构上必需,并提出了结构可接受性概念。

Comments Accepted at SAFECOMP 2026 Workshops (SASSUR); to appear in Springer LNCS

详情
AI中文摘要

基于ML的自动驾驶安全标准规定了保证案例必须包含的证据类型(有向因果链、量化的干预效应、命名的根因变量),然而XAI文献是按输出类型和技术族(显著性图、特征归因、反事实、因果图、语言痕迹)组织的。最受推荐的ADS XAI方法SHAP返回一个排序的特征列表,任何实现努力都无法将其转换为有向链(图1)。我们将这种不匹配称为证据类型缺口。 从AMLAS、ISO 26262、ISO 21448、ISO/PAS 8800中,我们推导出19项可测试的证据标准,涵盖7个生命周期阶段,并附有代表性的条款引用推导,对六类XAI方法进行了结构性评分。 因果XAI在结构上被证明是满足推导标准的必要条件,涉及三个阶段:危害识别(+62%标准缺口)、事件调查(+50%)和数据管理(+50%);判定集在阈值T∈(0%, 50%]内稳定,并在最坏情况下的单单元翻转下存活至T=25%。在其余四个阶段,相关或基于语言的方法是可比较或足够的。该标准识别了结构可接受性(合规的必要但非充分条件):一个可接受方法的具体输出内容仍可能是错误的,验证其保真度(拟合SCM产生的边、痕迹命名的原因)是开放的保证挑战。基于1,996个真实驾驶片段(79,840行,十个分割)的单VLA概念验证与每种方法观察到的输出类型匹配其标准预测一致。ADS安全保证的XAI方法选择应由生命周期阶段的证据需求驱动,而非方法流行度。

英文摘要

Safety standards for ML-based autonomous driving specify the kind of evidence an assurance case must contain (directed cause-and-effect chains, quantified interventional effects, named root-cause variables), yet the XAI literature is organised by output type and technique family (saliency maps, feature attribution, counterfactuals, causal graphs, language traces). SHAP, the most-recommended ADS XAI method, returns a ranked feature list that no implementation effort can convert into a directed chain (Fig.1). We name this mismatch the evidence-type gap. From AMLAS, ISO 26262, ISO21448, ISO/PAS 8800 we derive 19 testable evidentiary criteria across 7 lifecycle stages with representative clause-cited derivations and score six XAI method classes structurally. Causal XAI emerges as structurally required to satisfy the derived criteria at three stages: hazard identification (+62% rubric gap), incident investigation (+50%), and data management (+50%); the verdict set is stable across thresholds T in (0%, 50%]$ and survives a worst-case single-cell flip down to T = 25%. At the remaining four stages, correlational or language-based methods are comparable or sufficient. The rubric identifies structural admissibility (necessary but not sufficient for compliance): an admissible method's specific output content may still be wrong, and validating that fidelity (the edges a fitted SCM produces, the cause a trace names) is the open assurance challenge. A single-VLA proof of concept on 1,996 real-world driving clips (79,840 rows, ten splits) is consistent with each method's observed output type matching its rubric prediction. XAI method selection for ADS safety assurance should be driven by lifecycle-stage evidence demand, not by method popularity.

2406.09250 2026-06-15 cs.CV cs.AI cs.LG 版本更新

MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

MirrorCheck: 视觉-语言模型的高效对抗防御

Samar Fares, Klea Ziu, Toluwani Aremu, Nikita Durasov, Martin Takáč, Pascal Fua, Ivan Laptev, Karthik Nandakumar

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(莫扎伊德大学人工智能大学) NVIDIA École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) Michigan State University(密歇根州立大学)

AI总结 提出MirrorCheck框架,利用文本到图像模型和随机化策略检测并防御针对视觉-语言模型的自适应对抗攻击。

详情
AI中文摘要

视觉-语言模型(VLM)越来越容易受到复杂的对抗性攻击,包括专门设计用于绕过现有防御的自适应策略。为了解决这一漏洞,我们提出了MirrorCheck,一个鲁棒且与模型无关的检测框架,在单模态和多模态设置中均能有效运行。MirrorCheck利用文本到图像(T2I)模型从目标模型生成的标题中重建视觉内容,并通过比较原始图像和合成图像之间的特征空间嵌入来评估语义一致性。为了增强对自适应攻击的鲁棒性,MirrorCheck引入了一种随机防御策略,从多样化的模型库中随机选择T2I生成器和图像编码器。此外,我们采用了一种新颖的一次性(OTU)扰动,应用于所选编码器嵌入,并通过缩放因子调节,这降低了自适应攻击的有效性。跨多种威胁场景的大量实验表明,MirrorCheck始终优于基线方法,即使在强自适应对抗条件下也能保持其实用性。

英文摘要

Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings. MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo. Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.

2505.11577 2026-06-15 cs.CY cs.AI 版本更新

The Accountability Paradox: How Platform API Restrictions Undermine AI Transparency Mandates

问责悖论:平台API限制如何削弱AI透明度要求

Florian A. D. Burnat, Brittany I. Davidson

发表机构 * University of Bath(巴斯大学)

AI总结 本文研究平台API限制与欧盟数字服务法案之间的矛盾,提出审计框架揭示平台内容审核和算法放大不可验证的盲区,指出AI依赖与问责限制的悖论,建议采用联邦访问模型和加强监管执行。

Comments Accepted at ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

详情
AI中文摘要

近期主要社交媒体平台对应用程序编程接口(API)的限制挑战了遵守欧盟数字服务法案[20]的要求,该法案要求数据访问以实现算法透明度。我们开发了一个结构化的审计框架来评估监管要求与平台实施之间的日益增长的不一致。我们对X/Twitter、Reddit、TikTok和Meta的比较分析识别出关键的『审计盲区』,其中平台内容审核和算法放大仍然无法被独立验证。我们的发现揭示了『问责悖论』:随着平台越来越多地依赖AI系统,它们同时限制了独立监督的能力。我们建议与国家标准技术研究院[80]的AI风险管理框架相一致的有针对性的政策干预,强调联邦访问模型和增强的监管执行。

英文摘要

Recent application programming interface (API) restrictions on major social media platforms challenge compliance with the EU Digital Services Act [20], which mandates data access for algorithmic transparency. We develop a structured audit framework to assess the growing misalignment between regulatory requirements and platform implementations. Our comparative analysis of X/Twitter, Reddit, TikTok, and Meta identifies critical ``audit blind-spots'' where platform content moderation and algorithmic amplification remain inaccessible to independent verification. Our findings reveal an ``accountability paradox'': as platforms increasingly rely on AI systems, they simultaneously restrict the capacity for independent oversight. We propose targeted policy interventions aligned with the AI Risk Management Framework of the National Institute of Standards and Technology [80], emphasizing federated access models and enhanced regulatory enforcement.

2505.17961 2026-06-15 stat.ME cs.AI math.ST stat.AP stat.TH 版本更新

Federated Causal Inference from Multi-Site Observational Data via Propensity Score Aggregation

基于倾向得分聚合的多中心观测数据联邦因果推断

Rémi Khellaf, Aurélien Bellet, Julie Josse

发表机构 * University of Technology, CNRS, France(法国技术大学、国家科学研究中心)

AI总结 提出通过联邦学习聚合各站点倾向得分,利用成员权重估计平均处理效应,解决多中心观测数据因隐私限制无法集中的因果推断问题。

详情
AI中文摘要

因果推断通常假设可以集中访问个体层面数据。然而,在实践中,数据往往分散在多个站点,由于隐私、后勤或法律限制,集中化不可行。我们通过联邦学习方法从分散的观测数据中估计平均处理效应来解决这个问题,允许通过交换聚合统计量而非个体层面数据进行推断。我们提出了一种新方法,使用成员权重(定义为给定协变量条件下站点成员的概率)通过联邦加权平均局部得分来估计倾向得分。成员权重可以使用标准联邦学习算法通过参数或非参数分类模型灵活估计。得到的倾向得分用于构建联邦逆概率加权和增强逆概率加权估计量。与元分析方法(当任何站点违反积极性时失败)相比,我们的方法利用跨站点处理分配的异质性来改善重叠。我们表明,在站点层面的样本量、处理机制和协变量分布异质性下,联邦逆概率加权和增强逆概率加权表现良好。理论分析以及在模拟和真实数据上的实验证明了相对于元分析及相关方法的明显优势。

英文摘要

Causal inference typically assumes centralized access to individual-level data. Yet, in practice, data are often decentralized across multiple sites, making centralization infeasible due to privacy, logistical, or legal constraints. We address this problem by estimating the Average Treatment Effect (ATE) from decentralized observational data via a Federated Learning (FL) approach, allowing inference through the exchange of aggregate statistics rather than individual-level data. We propose a novel method to estimate propensity scores via a federated weighted average of local scores using Membership Weights (MW), defined as probabilities of site membership conditional on covariates. MW can be flexibly estimated with parametric or non-parametric classification models using standard FL algorithms. The resulting propensity scores are used to construct Federated Inverse Propensity Weighting (Fed-IPW) and Augmented IPW (Fed-AIPW) estimators. In contrast to meta-analysis methods, which fail when any site violates positivity, our approach exploits heterogeneity in treatment assignment across sites to improve overlap. We show that Fed-IPW and Fed-AIPW perform well under site-level heterogeneity in sample sizes, treatment mechanisms, and covariate distributions. Theoretical analysis and experiments on simulated and real-world data demonstrate clear advantages over meta-analysis and related approaches.

2512.02318 2026-06-15 cs.CR cs.AI 版本更新

COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers

认知:从评估到对抗多模态大语言模型CAPTCHA求解器

Junyu Wang, Changjia Zhu, Yuanbo Zhou, Lingyao Li, Xu He, Mingkui Wei, Junjie Xiong

发表机构 * Missouri University of Science and Technology(密苏里科技大学) University of South Florida(佛罗里达州立大学) Visa Inc.(Visa公司) George Mason University(乔治·马歇尔大学)

AI总结 本文研究多模态大语言模型如何削弱视觉CAPTCHA的安全性,评估7种主流MLLM在18种CAPTCHA任务中的表现,揭示其解决能力及防御策略。

Comments Accepted by USENIX Sec'26

详情
AI中文摘要

本文研究多模态大语言模型(MLLMs)如何削弱视觉CAPTCHA的安全性。我们识别出攻击面,评估7种主流商业和开源MLLM在18种真实CAPTCHA任务中的性能,测量单次准确率、有限重试下的成功率、端到端延迟和每解成本。进一步分析任务特定提示工程和少样本演示对求解器效果的影响。我们发现MLLMs能以人类成本和延迟可靠解决识别导向和低交互CAPTCHA任务,而需要细粒度定位、多步骤空间推理或跨帧一致性任务对当前模型仍显著困难。通过分析此类MLLM的推理轨迹,我们探讨模型在特定CAPTCHA谜题中成功或失败的机制,并利用这些见解推导出防御导向的CAPTCHA任务选择和强化指南。通过案例研究验证这些原则,我们通过我们的指南加固一个易受攻击的CAPTCHA类型。我们证明,加入细粒度定位和隐含计数将最先进的MLLM的成功率从超过95%降低到0%,确认结构变化可以有效缓解威胁。最后讨论平台运营商在滥用缓解流程中部署CAPTCHA的含义。

英文摘要

This paper studies how multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA. We identify the attack surface where an adversary can cheaply automate CAPTCHA solving using off-the-shelf models. We evaluate 7 representative MLLMs on 18 real-world CAPTCHA task types, measuring single-shot accuracy, success under limited retries, end-to-end latency, and per-solve cost. We further validate our findings through a supplemental external dataset and an adaptive-attacker setting with session memory, while also analyzing the impact of task-specific prompt engineering and few-shot demonstrations on solver effectiveness. We reveal that MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHA tasks at human-like cost and latency, whereas tasks requiring fine-grained localization, multi-step spatial reasoning, or cross-frame consistency remain significantly harder for current models. By examining the reasoning traces of such MLLMs, we investigate the underlying mechanisms of why models succeed/fail on specific CAPTCHA puzzles and use these insights to derive defense-oriented guidelines for selecting and strengthening CAPTCHA tasks. To validate these principles, we present a proof-of-concept by hardening a vulnerable CAPTCHA type using our guidelines. We demonstrate that incorporating fine-grained localization and implicit counting reduces the success rate of state-of-the-art MLLMs from over 95\% to 0\%, confirming that structural changes can effectively mitigate the threat. We conclude by emphasizing the urgent need for CAPTCHA redesign as MLLM capabilities increasingly threaten existing defenses. Code Availability (https://doi.org/10.5281/zenodo.20406852).

2605.18784 2026-06-15 q-fin.RM cs.AI cs.CR cs.CY econ.GN q-fin.EC 版本更新

The Insurability Frontier of AI Risk: Mapping Threats to Affirmative Coverage, Silent Exposures, and Exclusions

AI风险的可保险边界:将威胁映射到积极保险、沉默暴露和排除

Alex Leung, Rex Zhang, Ervin Ling, Kentaroh Toyoda, SiewMei Loh

发表机构 * Munich Re(慕尼黑再保险) Armilla Tokio Marine Kiln(东京海上日赤保险) CFC Apollo ibott Coalition

AI总结 本文研究了AI风险在商业保险中的可保险性边界,通过分析55类AI威胁与26种保险产品和排除制度,揭示了四个层次的可保险性前沿:积极保险的风险、沉默AI暴露、主动排除的风险以及传统私人保险结构之外的风险。

Comments Version 2

详情
AI中文摘要

代理AI的快速扩散为商业保险创造了一个新的覆盖问题:一些AI中介的损失现在被积极保险,一些在传统网络安全、技术错误与遗漏(E&O)、董事与高管(D&O)、雇佣实践责任(EPLI)、犯罪和媒体政策下产生沉默AI暴露,而其他则被积极排除。本文通过编码55类AI威胁与26种保险产品、保证和排除制度,利用公开承运商材料和OWASP/MITRE威胁目录,确定了四个层次的可保险性前沿:积极保险的风险、沉默AI暴露、主动排除的风险以及传统私人保险结构之外的风险。我们的编码测量公开声明的定位,而非执行合同的措辞;头条统计数据描述承运商公开声明的覆盖情况,而非任何具体索赔将支付什么。三个模式显现。首先,积极AI覆盖开始通过主要风险重点进行区分:公开材料通常将慕尼黑再保险定位在模型性能和漂移,Armilla和 Lloyd's 市场部分围绕幻觉和更广泛的AI责任,Tokio Marine Kiln和CFC围绕知识产权和技术E&O关注,Apollo ibott围绕新兴自主系统责任,Coalition围绕深度伪造和AI增强的网络安全响应。其次,传统业务线在AI作为工具而非损失法律原因的情况下保留沉默AI暴露。第三,基础模型集中是清晰的真正新型可保险性前沿,因为上游模型失败可以一次关联多个被保险人损失;相关市场设计问题是每个候选结构放松了哪些可保险性约束,而不是仅仅存在哪种系统性风险模板。

英文摘要

The rapid diffusion of agentic AI has created a new coverage problem for commercial insurance: some AI-mediated losses are now affirmatively insured, some create silent-AI exposure under legacy cyber, technology errors-and-omissions (E&O), directors-and-officers (D&O), employment practices liability (EPLI), crime, and media policies, and others are being actively excluded. This paper maps that emerging boundary by coding 55 AI threat classes against 26 insurance products, endorsements, and exclusion regimes using public carrier materials and OWASP/MITRE threat catalogs. We identify a four-tier insurability frontier: affirmatively insured perils, silent-AI exposures, actively excluded perils, and perils outside conventional private insurance structures. Our coding measures publicly claimed positioning rather than executed contract wording; the headline statistics describe what carriers publicly state about coverage, not what would be paid in any specific claim. Three patterns emerge. First, affirmative AI coverage is beginning to differentiate by primary risk emphasis: public materials often position Munich Re around model performance and drift, Armilla and parts of the Lloyd's market around hallucination and broader AI liability, Tokio Marine Kiln and CFC around IP and technology E&O concerns, Apollo ibott around emerging autonomous system liability, and Coalition around deepfake and AI-enabled cyber response. Second, legacy lines retain silent-AI exposure where AI is an instrumentality rather than the legal cause of loss. Third, foundation model concentration is the clearest genuinely novel insurability frontier because upstream model failure can correlate losses across many cedents at once; the relevant market design question is which insurability constraint each candidate structure relaxes, not merely which systemic risk template exists.

2605.26702 2026-06-15 cs.CV cs.AI cs.CR cs.LG 版本更新

Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling

通过三阶SO(3)表示耦合的旋转不变球面水印

Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Antonios Argyriou, Wu Liu, Weiping Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对全景图像在任意3D旋转下水印鲁棒性不足的问题,提出利用三阶SO(3)表示耦合构造旋转不变的球面双谱,将水印嵌入高阶球谐系数并从不变标量中提取,实现理论保证的旋转不变性和高视觉保真度。

Comments ICML 2026

详情
AI中文摘要

全景图像的可靠水印面临任意3D旋转的根本挑战。由于全景图定义在球面上,它们在$SO(3)$作用下自然变换,使得传统的平面表示和基于增强的鲁棒策略变得不充分且缺乏理论保证。为了解决这个问题,我们将全景图表示为球面信号,并利用$SO(3)$表示理论推导出可证明的旋转不变描述符。虽然球谐系数在旋转下等变变换,但自然的旋转不变构造通常限于零阶统计量,这消除了方向信息并严重限制了嵌入容量。在这项工作中,我们通过张量积耦合高阶$SO(3)$不可约表示并投影到平凡表示,引入了一种有原则的三阶不变构造。这产生了球面不变双谱,它在保持严格旋转不变性的同时保留了相位信息。利用这一特性,我们将水印嵌入到高阶球谐系数中,并从不变双谱标量中恢复它们,从而在任意3D旋转下实现可靠的提取。我们提供了其$SO(3)$不变性的理论证明,并通过实验证明其对连续旋转具有近乎完美的鲁棒性,同时保持高视觉保真度。

英文摘要

Reliable watermarking of panoramic imagery is fundamentally challenged by arbitrary 3D rotations. As panoramas are defined on the sphere, they naturally transform under the action of $SO(3)$, rendering conventional planar representations and augmentation-based robustness strategies inadequate and devoid of theoretical guarantees. To address this, we formulate panoramas as spherical signals and leverage $SO(3)$ representation theory to derive provably rotation-invariant descriptors. While spherical harmonic coefficients transform equivariantly under rotations, the natural invariant constructions are typically limited to zeroth-order statistics which eliminate directional information and severely constrain embedding capacity. In this work, we introduce a principled third-order invariant construction by coupling higher-order $SO(3)$ irreducible representations via tensor products and projecting onto the trivial representation. This yields a spherical invariant bispectrum that preserves phase information while remaining strictly rotation-invariant. Leveraging this property, we embed watermarks into higher-order spherical harmonic coefficients and recover them from invariant bispectral scalars, enabling reliable extraction under arbitrary 3D rotations. We provide a theoretical proof of $SO(3)$ invariance for it and demonstrate experimentally its near-perfect robustness to continuous rotations while maintaining high visual fidelity.

2605.28591 2026-06-15 cs.CL cs.AI 版本更新

Models That Know How Evaluations Are Designed Score Safer

知道评估如何设计的模型更安全

Katharina Deckenbach, Haritz Puerto, Jonas Geiping, Sahar Abdelnabi

发表机构 * ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center(图宾根ELLIS研究所、图宾根马克斯·普朗克智能系统研究所、图宾根人工智能中心)

AI总结 本文通过微调模型使其掌握评估的元知识(如可验证结构或道德困境),发现这会导致模型在安全基准测试中表现更安全,从而引入了一种独立于显式记忆或评估意识的新混淆因素。

详情
AI中文摘要

AI安全评估的有效性取决于模型在受控环境和部署环境中行为的一致性。先前的研究已经发现测试时的上下文线索(例如假设场景)是口头评估意识和后续行为转变的来源。在本文中,我们研究了这一现象的一个潜在解释:评估元知识,定义为关于评估结构特征的参数化知识。类似于数据集污染(基准暴露通过记忆导致更高性能),我们假设在描述评估实践的文本上训练的模型可能隐式地学会识别和响应类似评估的上下文,例如通过接触关于AI基准测试的科学文章或社交媒体帖子。为了验证这一点,我们在描述评估特征(如可验证结构或道德困境)的合成文档上微调模型。在六个安全基准上评估这个微调模型,我们发现它比基础模型和控制模型显著更安全。即使将分析限制在缺乏明确评估意识口头表达的响应中,这种行为转变仍然存在。我们的结果表明,评估元知识可能夸大安全基准性能,引入了一种独立于显式记忆或口头评估意识的新混淆因素,因此难以检测。这些发现对AI安全评估的设计和解释具有重要意义。我们的代码和模型可在 https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge 获取。

英文摘要

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on five safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge.

2606.00947 2026-06-15 cs.LG cs.AI 版本更新

Silent Failures in Federated Personalization of Foundation Models

联邦基础模型个性化中的静默失败

YongKyung Oh, Alex Bui

发表机构 * Medical & Imaging Informatics (MII) Group, University of California, Los Angeles (UCLA)(医学与影像信息学(MII)组,加州大学洛杉矶分校(UCLA))

AI总结 本文提出联邦基础模型个性化中因隐私约束导致的一类信任失败——静默失败,包括偏差放大、公平性崩溃和对齐侵蚀,并引入六种静默失败模式的分类法,强调隐私保护训练不足以保障可信部署。

详情
AI中文摘要

基础模型通过联邦学习在分散的私有数据上越来越个性化,并在日益增长的上市后监管要求下大规模部署。我们认为这种趋同产生了一类独特且未被充分认识的信任失败,我们称之为“静默失败”。这些包括偏差放大、公平性崩溃和对齐侵蚀,这些可能仍然难以检测,因为联邦学习的隐私约束限制了对模型行为的可见性。对现有基准的景观分析揭示了结构性鸿沟。联邦基准评估系统性能,但对模型行为的洞察有限,而集中式信任基准评估行为,但需要与联邦隐私不兼容的模型访问。我们引入了一个由基础模型个性化、数据集偏移和核心联邦约束相互作用产生的六种静默失败模式的分类法。我们的分析表明,仅靠隐私保护训练不足以实现可信部署。最后,我们提出了一个隐私保护行为评估的研究议程,并建议将静默失败作为可信联邦人工智能的标准诊断类别。

英文摘要

Foundation models are increasingly personalized on decentralized private data through federated learning and are now deployed at scale under growing regulatory requirements for post-market monitoring. We argue that this convergence creates a distinct and under-recognized class of trustworthiness failures, which we term "Silent Failures." These include amplified bias, fairness collapse, and alignment erosion that may remain difficult to detect because federated learning's privacy constraints limit visibility into model behavior. A landscape analysis of existing benchmarks reveals a structural divide. Federated benchmarks evaluate system performance but provide limited insight into model behavior, whereas centralized trustworthiness benchmarks assess behavior but require model access incompatible with federated privacy. We introduce a taxonomy of six silent failure modes arising from the interaction of foundation model personalization, dataset shift, and core federated constraints. Our analysis shows that privacy-preserving training alone is insufficient for trustworthy deployment. We conclude with a research agenda for privacy-preserving behavioral evaluation and propose that silent failures become a standard diagnostic category for trustworthy federated artificial intelligence.

2606.02995 2026-06-15 cs.CR cs.AI cs.IR cs.LG 版本更新

Patcher: Post-Hoc Patching of Backdoored Large Language Models

Patcher: 后门大型语言模型的事后修补

Anjun Gao, Yueyang Quan, Yufei Xia, Zhuqing Liu, Minghong Fang

发表机构 * University of Louisville(路易斯维尔大学) University of North Texas(北得克萨斯大学)

AI总结 提出Patcher框架,仅利用单个失败案例和模型参数,通过基于梯度的显著性定位后门触发器,并采用约束微调消除触发-响应关联,同时保持模型效用。

Comments To appear in the USENIX Security Symposium, 2026

详情
AI中文摘要

大型语言模型仍然容易受到越狱后门攻击,其中对手污染安全对齐数据以嵌入隐藏触发器,从而绕过安全机制。现有防御通常需要全面的攻击信息或多个触发示例,使得当防御者仅观察到单个报告失败案例而不知道其源于后门攻击还是自然对齐错误时,这些防御不切实际。本文提出Patcher,一个事后防御框架,仅使用单个报告失败案例和模型参数来修复后门语言模型。Patcher分两个阶段运行。首先,通过计算基于响应的梯度显著性分数并应用自适应聚类将触发器与良性上下文分离来定位后门触发器。其次,通过约束微调目标修补模型,该目标打破触发-响应关联,同时通过KL散度约束保持良性任务效用和对非触发越狱攻击的鲁棒性。我们在多种后门攻击策略下进行了广泛评估,并证明Patcher成功定位触发器并中和后门,同时保持模型效用。我们进一步展示了针对旨在规避我们防御的自适应攻击的鲁棒性。这项工作代表了向部署语言模型中训练时攻击的实际防御迈出的重要一步。

英文摘要

Large language models remain vulnerable to jailbreak backdoor attacks, where adversaries poison safety alignment data to embed hidden triggers that bypass safety mechanisms. Existing defenses often require comprehensive attack information or multiple triggered examples, making them impractical when defenders only observe a single reported failure case without knowing whether it stems from a backdoor attack or a natural alignment bug. This paper presents Patcher, a post-hoc defense framework that repairs backdoored language models using only a single reported failure case and the model parameters. Patcher operates in two stages. First, it localizes backdoor triggers by computing response-conditioned gradient-based saliency scores and applying adaptive clustering to separate triggers from benign context. Second, it patches the model through a constrained fine-tuning objective that breaks the trigger-response association while preserving benign-task utility and robustness to non-triggered jailbreak attacks through KL-divergence constraints. We conduct extensive evaluations across multiple backdoor attack strategies and demonstrate that Patcher successfully localizes triggers and neutralizes backdoors while maintaining model utility. We further show robustness against adaptive attacks designed to evade our defense. This work represents a significant step toward practical defenses against training-time attacks in deployed language models.

2605.21006 2026-06-15 cs.AI cs.CL cs.LG 版本更新

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

扮演魔鬼的代言人:现成的人格向量在顺从性上与针对性引导相媲美

Ishaan Kelkar, Nebras Alam, Vikram Kakaria, Madhur Panwar, Vasu Sharma, Maheep Chaudhary

发表机构 * University of Toronto(多伦多大学) Princeton University(普林斯顿大学) Purdue University(普渡大学) EPFL(瑞士联邦理工学院) Algoverse Independent(独立)

AI总结 本文研究了不同人格对顺从性的影响,发现现成的人格引导向量在减少顺从性方面与针对性引导相当,且在用户正确时保持准确性。

详情
Journal ref
ICML, Pluralistic Alignment Workshop, 2026
AI中文摘要

我们研究了不同人格对顺从性的影响:模型在用户错误时仍同意用户。标准缓解方法,对比激活添加(CAA),从顺从性和诚实响应的标记对中推导出引导方向。本研究评估了现成的人格引导向量是否能作为替代方案,这些向量最初是为一般角色扮演开发的,且未在顺从性数据上训练。在两个指令微调模型中,引导至以怀疑或审查为特征的人格可将顺从性减少到CAA效果的约68%和98%,且不同于CAA,在用户正确时保持准确性。效果也是不对称的:引导至顺从的人格不会产生镜像增加的顺从性。几何上,人格向量在激活空间的方向上与顺从性方向基本无关。总体而言,这些发现表明,顺从性应被视为人格层面的属性,而非单一可引导方向。我们在此发布代码:https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

英文摘要

We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

9. 评测、基准与数据集 37 篇

2606.13715 2026-06-15 cs.AI cs.CL cs.MA 新提交

WorkBench Revisited: Workplace Agents Two Years On

WorkBench 再探:两年后的工作场所智能体

Olly Styles

发表机构 * GitHub

AI总结 本文重新评估2024至2026年间WorkBench基准上智能体的进展,发现前沿模型在能力和安全性上均有显著提升,但开放权重模型降低了高性能门槛。

Comments 8 pages, 3 figures. Follow-up to arXiv:2405.00823

详情
AI中文摘要

2024年3月,WorkBench上表现最好的智能体GPT-4完成了43%的任务,并在26%的任务中采取了意外的有害行为(例如给错误的人发送电子邮件)。我们在2026年6月重新审视该基准,发现迄今为止最好的智能体Claude Opus 4.8完成了89%的任务,并仅在2.5%的任务中采取了意外的有害行为。除了前沿智能体性能的显著进步外,有三点值得注意。首先,在WorkBench上,能力与安全性是相辅相成的,而非相互权衡,因此完成最多任务的模型造成的意外损害也最少。其次,虽然几类错误已被完全消除,但前沿模型仍然会犯一些基本错误,有时会导致不可逆转的损害,例如将电子邮件发送给错误的人。第三,开放权重模型的兴起大幅降低了此前仅专有模型才能达到的性能水平的成本,而前沿模型的成本则保持相对稳定。我们发布了该基准的更新版本,包括数据与代码质量改进、新的模型评分以及自2024年以来WorkBench上智能体进展的分析。

英文摘要

The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4.8, completes 89% and takes an unintended harmful action on 2.5%. Aside from this considerable progress in frontier agent performance, three things stand out. First, capability and safety go together on WorkBench rather than trade off, so the models that finish the most tasks also do the least unintended damage. Second, while several classes of error have been totally eliminated, frontier models still make some basic mistakes that occasionally result in irreversible harm, such as sending an email to the wrong person. Third, the rise of open-weight models has drastically lowered costs for a performance level that was previously only accessible to proprietary models, while frontier costs have stayed relatively stable. We release an updated version of the benchmark with data and code quality improvements, new model scores, and analysis of agent progress on WorkBench since 2024.

2606.13815 2026-06-15 cs.AI cs.CL 新提交

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

Poker Arena: 大型语言模型中策略推理与记忆的多轴剖析

Pratham Singla, Shivank Garg, Vihan Singh

发表机构 * Indian Institute of Technology Roorkee(印度理工学院罗尔基分校) Raeth AI

AI总结 提出Poker Arena平台,通过三层记忆架构和九轴认知剖面分解策略推理,揭示标量排行榜系统性误排模型能力结构。

Comments 33 pages, ICML Workshop

详情
AI中文摘要

不确定性下的策略推理支撑着谈判、金融和政策中的关键决策,但现有的游戏基准将异质推理维度压缩为单一标量,导致前沿LLM的能力结构未被审视。我们引入Poker Arena,一个无限注德州扑克锦标赛平台,该平台将三层记忆架构(手牌内、会话内和跨会话)与九轴认知剖面相结合,将策略推理分解为可解释的维度,如下注规模校准和位置意识。我们在50个会话(每个会话1000手牌)和受控记忆消融实验中评估了七个前沿模型;锦标赛筹码和聚合轴得分对模型进行了不同排序:Claude Opus 4.6赢得+15,730筹码和14次第一名,但在平均轴得分上仅排名第五(共七个),而持久记忆对某些模型有帮助,对另一些则有损害。这些发现表明,多轴评估揭示了标量排行榜系统性误排的能力结构,其中跨维度一致性优于任何单一维度的峰值性能。

英文摘要

Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capability structure of frontier LLMs unexamined. We introduce Poker Arena, a no-limit Texas Hold'em tournament platform that couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile decomposing strategic reasoning into interpretable dimensions such as bet-sizing calibration and positional awareness. We evaluate seven frontier models across 50 sessions of 1,000 hands and a controlled memory ablation; tournament chips and aggregate axis score order the field differently: Claude Opus 4.6 wins +$15,730 chips with 14 first-place finishes, yet ranks only fifth of seven on mean axis score, while persistent memory helps some models and hurts others. These findings show that multi-axis evaluation surfaces capability structure that scalar leaderboards systematically misrank, with cross-dimensional consistency outweighing peak performance on any single axis.

2606.14031 2026-06-15 cs.AI 新提交

Applicability Condition Extraction for Therapeutic Drug-Disease Relations

治疗性药物-疾病关系的适用条件提取

Guanting Luo, Noriki Nishida, Yuji Matsumoto, Yuki Arase

发表机构 * The University of Osaka(大阪大学) RIKEN(理化学研究所) Institute of Science Tokyo(东京科学大学) Tohoku University(东北大学)

AI总结 提出从生物医学文献中提取药物-疾病治疗关系适用条件的任务,构建首个手动标注数据集,并改进LoRA方法以考虑药物与疾病间关系,在多个评估设置中优于基线。

详情
AI中文摘要

识别某种药物对目标疾病产生治疗效果的适用条件对于临床决策支持至关重要。然而,现有的大多数生物医学信息提取方法仅关注识别药物与疾病之间的关系,而很大程度上忽略了这些关系适用的上下文特定条件。为解决这一问题,我们引入了从生物医学研究文献中提取治疗性药物-疾病关系适用条件的任务。我们创建了首个数据集,在生物医学论文摘要上手动标注了药物、疾病和适用条件的三元组,包含1,119个药物-疾病对。利用该数据集,我们系统评估了一系列现有方法的性能。此外,我们提出了一种新方法,增强LoRA以考虑药物与疾病之间的关系。我们的方法在不同评估设置中均优于强基线。本文的源代码和数据集可从以下网址获取:this https URL

英文摘要

Identifying conditions that a certain drug takes therapeutic effect on a target disease is crucial for clinical decision-making support. However, most existing biomedical information extraction methods have focused on identifying only relations between drugs and diseases, while largely overlooking the context-specific conditions where such relations can apply. To address this problem, we introduce the task of applicability condition extraction for therapeutic drug--disease relations from biomedical research literature. We create the first dataset that has manually annotated triples of drugs, diseases, and applicability conditions on biomedical paper abstracts with 1,119 drug-disease pairs. Using this dataset, we systematically evaluate the performance of a range of existing methods. In addition, we propose a new method that enhances LoRA to consider relations between drugs and diseases. Our method consistently outperforms strong baselines across different evaluation settings. The source code and dataset of this paper can be obtained from: https://github.com/guantingluo98/Drug-ACE

2606.14240 2026-06-15 cs.AI 新提交

AFFORDANCE20Q: Evaluating Affordance Reasoning from Physical Properties

AFFORDANCE20Q:从物理属性评估可承担性推理

Yifan Jiang, Meige Yang, Zitong Li, Jay Pujara

发表机构 * Information Sciences Institute, University of Southern California(南加州大学信息科学研究所) University of Southern California(南加州大学)

AI总结 提出Affordance20Q基准,通过20个问题游戏评估模型从物理属性推理物体可承担性的能力,发现LLM与人类差距约20分,并开发KARI方法提升开源模型达15.2分。

详情
AI中文摘要

可承担性推理,即从物体的物理属性(如形状和材料)推断其动作可能性,是人类物理理解的基础,对大型语言模型(LLM)也越来越关键。然而,现有的可承担性基准大多在评估设置中暴露明确的物体身份,使模型能够依赖记忆的物体-可承担性映射,而不是基于物理属性进行推理。为弥补这一空白,我们引入了Affordance20Q,这是一个新颖的可承担性推理基准,以20个问题游戏的形式呈现,不暴露物体身份。在每个游戏中,模型通过询问关于物体物理属性的是/否问题,从候选集中识别隐藏物体的可承担性。Affordance20Q包含1,009个游戏,涵盖454个物体和59种可承担性,所有数据均经过手动筛选、细化和标注。我们对15个最先进的LLM进行了全面实验,发现与人类表现相比存在显著差距(约20分)。基于KL的信息增益(IG)分析进一步表明,随着游戏进行,模型未能提出具有区分性的问题。为缩小差距,我们开发了基于知识库锚定的规则归纳(KARI),这是一个基于LLM的流程,用于生成基于知识库(KB)证据的可承担性规则。KARI将开源LLM的性能提升了最多15.2分,而KB的有限覆盖阻碍了进一步的提升。我们在https://this.url发布所有代码和数据。

英文摘要

Affordance reasoning, the inference of an object's action possibilities from its physical properties (e.g., shape and material), is fundamental to human physical understanding and increasingly critical for Large Language Models (LLMs). However, existing affordance benchmarks largely expose explicit object identities in the evaluation setup, allowing models to rely on memorized object-affordance mappings rather than reasoning over physical properties. To address this gap, we introduce Affordance20Q, a novel affordance reasoning benchmark formulated as a 20-Questions game without exposing the object's identity. In each game, the model identifies a hidden object's affordance from a candidate set by asking yes/no questions about its physical properties. Affordance20Q comprises 1,009 games over 454 objects and 59 affordances, all manually filtered, refined, and annotated. We conduct comprehensive experiments with 15 state-of-the-art LLMs and find a substantial gap (~20 points) compared to human performance. A KL-based information-gain (IG) analysis further shows that models fail to ask discriminating questions as the game progresses. To close the gap, we develop KB-Anchored Rule Induction (KARI), a pipeline based on LLMs that generates affordance rules grounded in evidence from knowledge bases (KBs). KARI improves open-source LLMs by up to 15.2 points, while the limited coverage of KBs hinders further gains. We release all our code and data at https://github.com/1171-jpg/Affordance20Q.git

2606.14516 2026-06-15 cs.AI cs.CL cs.CY 新提交

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

Every Eval Ever:AI评估结果的统一模式与社区仓库

Jan Batzner, Sree Harsha Nelaturu, Anastassia Kornilova, Jon Crall, Tommaso Cerruti, Yanan Long, Yifan Mai, Sanchit Ahuja, Asaf Yehudai, Marek Šuppa, John P. Lalor, Oluwagbemike Olowe, Jatin Ganhotra, Brian H. Hu, Eliya Habba, Andrew M. Bean, Chang Liu, Sander Land, Steven Dillmann, Aniketh Garikaparthi, Elron Bandel, Saki Imai, James Edgell, Wm. Matthew Kennedy, Jenny Chim, Patrick Meusling, Asteria Kaeberlein, Venkata Ramachandra Karthik Chundi, Manasi Patwardhan, Martin Ku, Austin Meek, Leon Knauer, Brian Wingenroth, Srishti Yadav, Usman Gohar, Felix Friedrich, Michelle Lin, Jennifer Mickel, Arman Cohan, Stella Biderman, Irene Solaiman, Zeerak Talat, Anka Reuel, Mubashara Akhtar, Gjergji Kasneci, Avijit Ghosh, Leshem Choshen

发表机构 * Technical University Munich(慕尼黑工业大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Weizenbaum Institute(魏岑鲍姆研究所) Zuse Institute Berlin(柏林祖泽研究所) Evidence Prime Trustible Kitware ETH Zurich(苏黎世联邦理工学院) StickFlux Labs Stanford University(斯坦福大学) Northeastern University(东北大学) IBM Research(IBM研究院) Comenius University Bratislava(布拉迪斯拉发夸美纽斯大学) Cisco(思科) University of Notre Dame(圣母大学) Hebrew University of Jerusalem(耶路撒冷希伯来大学) University of Oxford(牛津大学) Ohio University(俄亥俄大学) Writer TCS Research(塔塔咨询服务研究院) Oxford University Press(牛津大学出版社) Queen Mary University of London(伦敦玛丽女王大学) Technical University Berlin(柏林工业大学) University of Delaware(特拉华大学) Cinemo Johns Hopkins University(约翰霍普金斯大学) University of Copenhagen(哥本哈根大学) ELLIS(欧洲学习与智能系统实验室) Iowa State University(爱荷华州立大学) Meta FAIR University of Montreal(蒙特利尔大学) Mila Quebec AI Institute(Mila魁北克人工智能研究所) EleutherAI Yale University(耶鲁大学) Hugging Face University of Edinburgh(爱丁堡大学) Harvard University(哈佛大学) ETH AI Center(ETH人工智能中心) MIT(麻省理工学院) MIT-IBM Watson Lab(MIT-IBM沃森实验室)

AI总结 针对AI评估结果格式不统一、难以比较的问题,提出首个共享模式与社区众包仓库,通过标准化表示、自动转换器和社区数据库实现跨评估框架的统一。

详情
AI中文摘要

AI评估被广泛用于测试和理解进展。然而,多样化的评估工具带来了不一致性,挑战了分析和比较。首先,结果以不兼容的格式保存,分散在排行榜、论文、博客文章、评估工具日志和自定义仓库中。其次,结果由不同的评估框架创建,这些框架对名义上相同的评估产生不同的分数,并且不一致地记录元数据,阻碍了比较、跨社区评估科学、成本降低和重用。我们介绍了Every Eval Ever,这是第一个用于AI评估结果的共享模式和社区众包仓库。该模式标准化了评估在统一的单个JSON文档中的表示方式。它在设计上与源无关,可以摄取来自评估工具和论文的结果,并可选择存储每个实例的输出以进行细粒度分析。我们贡献了:(i) 一个社区治理的元数据模式及其配套的实例级模式,这是同类标准化工作的首次;(ii) 从流行格式、评估工具和排行榜到统一模式的自动转换器;以及 (iii) 一个托管在Hugging Face上的众包社区数据库,目前涵盖22,235个模型、2,273个独特基准和31种评估格式。

英文摘要

AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories. Second, results are created by different evaluation frameworks, which produce divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross-community evaluation science, cost reduction, and reuse. We introduce Every Eval Ever, the first shared schema and community-crowdsourced repository for AI evaluation results. The schema standardizes how evaluations are represented in a unified, single JSON document. It is source-agnostic by design, ingesting results from evaluation harnesses and papers alike, and optionally stores per-instance outputs for fine-grained analysis. We contribute: (i) a community-governed metadata schema with a companion instance-level schema, the first standardization effort of its kind; (ii) automatic converters from popular formats, evaluation harnesses, and leaderboards to the unified schema; and (iii) a crowdsourced community database hosted on Hugging Face, currently spanning to date 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats.

2606.14571 2026-06-15 cs.AI 新提交

StreamMemBench: Streaming Evaluation of Agent Memory for Future-Oriented Assistance

StreamMemBench: 面向未来辅助的智能体记忆流式评估

Guanming Liu, Yuqi Ren, Hansu Gu, Peng Zhang, Weihang Wang, Jiahao Liu, Ning Gu, Tun Lu

发表机构 * Fudan University(复旦大学) Amazon Stream(亚马逊流)

AI总结 提出StreamMemBench基准,通过两步任务序列测试智能体记忆从流式观察到后续任务中证据回忆、反馈整合与重用能力,实验发现现有系统在证据使用和反馈转化上存在不足。

详情
AI中文摘要

个人智能体记忆的核心作用是将存储的信息和先前的交互转化为面向未来的辅助。在日常使用中,有用的线索来自智能体观察到什么以及用户如何与智能体交互,智能体必须将这些线索从当前请求延续到类似的未来任务中。现有的记忆基准通常孤立地测试对话回忆或任务改进,使得从流式观察到后续辅助的轨迹基本未经测试。我们引入了StreamMemBench,一个流式基准,它围绕EgoLife自我中心流中的每个证据锚点构建一个两步任务序列。初始任务测试证据使用,而后续任务测试反馈和交互经验是否被重用。四个指标诊断证据回忆、初始证据使用、反馈整合和后续重用。在两个骨干网络上的八个记忆系统的实验表明,当前系统通常无法使用观察到的证据或将反馈转化为可靠的后续行为,即使证据已存储或反馈已在局部整合。StreamMemBench在此https URL公开可用。

英文摘要

A central role of personal-agent memory is to turn stored information and prior interactions into future-oriented assistance. In daily use, useful cues come from what the agent observes and how the user interacts with the agent, and the agent must carry them forward from the current request to similar future tasks. Existing memory benchmarks usually test dialogue recall or task improvement in isolation, leaving the trajectory from streaming observations to later assistance largely untested. We introduce StreamMemBench, a streaming benchmark that constructs a two-step task sequence around each evidence anchor from EgoLife egocentric streams. The initial task tests evidence use, while the follow-up task tests whether feedback and interaction experience are reused. Four metrics diagnose evidence recall, initial evidence use, feedback incorporation, and follow-up reuse. Experiments with eight memory systems across two backbones show that current systems often fail to use observed evidence or turn feedback into reliable follow-up behavior, even when evidence is stored or feedback is incorporated locally. StreamMemBench is publicly available at https://github.com/landian60/StreamMemBench.

2606.13684 2026-06-15 cs.CY cs.AI cs.CL cs.LG 交叉投稿

Cross-Dataset Bloom Question Classification: Supervised Models and Prompted LLMs

跨数据集布鲁姆问题分类:监督模型与提示式大语言模型

Abdolali Faraji, Mohammadreza Molavi, Zohreh Rasoulkhani, Mohammadreza Tavakoli, Gábor Kismihók

发表机构 * Leibniz Information Centre for Science and Technology(莱比锡信息科学与技术研究中心) University of Genoa(热那亚大学)

AI总结 评估监督ML/DL模型和LLM在跨数据集布鲁姆分类中的泛化能力,发现LLM更稳定,并基于最佳提示策略开发了轻量级UI。

Comments Accepted at AIED 2026. Abdolali Faraji and Mohammadreza Molavi contributed equally to this work

详情
AI中文摘要

自动对评估问题进行布鲁姆分类可以大幅减少教师工作量,但标注具有主观性且依赖教师。先前的机器学习和深度学习方法在数据集内表现良好,但很少在跨数据集设置中评估,导致现实世界的泛化能力不明确;同时,LLM在布鲁姆问题分类中的有效性尚未被系统研究。我们评估了现有ML/DL方法的跨数据集泛化能力,并在五个数据集上使用多种提示策略评估了LLM;最佳提示策略结合了上下文示例和课程特定的动作动词。监督ML/DL模型在未见数据集上性能大幅下降,而LLM更稳定,表明其在多样化教育环境中是一种稳健的替代方案。基于最佳提示策略,我们还开发了一个轻量级用户界面,支持教师自动分类大量问题库;可用性研究表明低工作量和高度可用性。

英文摘要

Automatic Bloom's taxonomy classification of assessment questions can substantially reduce instructor workload, but labeling is subjective and teacher-dependent. Prior machine learning (ML) and deep learning (DL) approaches reported strong within-dataset results, yet were rarely evaluated in cross-dataset settings, leaving real-world generalizability unclear; meanwhile, LLM effectiveness for Bloom question classification has not been systematically studied. We evaluated the cross-dataset generalization of existing ML/DL methods and assessed LLMs with multiple prompting strategies on five datasets; the best prompting strategy combined in-context examples with course-specific action verbs. Supervised ML/DL models degraded substantially on unseen datasets, whereas LLMs were more stable, suggesting a robust alternative across diverse educational contexts. Based on the best prompting strategy, we also presented a lightweight UI that supports instructors in automatically classifying large question banks; a usability study indicated low workload and high usability.

2606.13685 2026-06-15 cs.CL cs.AI 交叉投稿

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

抛硬币的裁判?LLM作为评估者的可靠性与偏见

Abel Yagubyan

发表机构 * Independent Researcher(独立研究员)

AI总结 研究LLM作为评估者在重复评估中的不可靠性,发现偏好翻转率平均13.6%,存在位置偏见,并建议多轮聚合和不确定性报告。

Comments 24 pages, 7 figures

详情
AI中文摘要

LLM作为评估者(LLM-as-a-Judge)现被广泛用于模型输出排名、训练奖励模型和填充公共排行榜,但其运行间可靠性仍缺乏充分表征。我们使用两个OpenAI评估模型(GPT-4o-mini和GPT-4.1-mini)在涵盖10个类别的29个任务上进行了重复的相同评估,每个问题进行50次成对试验和50次逐点试验,并辅以温度和提示敏感性消融实验。在评估者之间,成对偏好平均翻转13.6%的时间,28%的问题翻转率超过20%,一个问题达到56%。GPT-4o-mini还表现出显著的第一位置偏见(72%的A多数,p=0.024)。同时,平均逐点评分差距很小(在10分制上为0.19-0.36),且总体上不具统计显著性,产生了成对-逐点差距:评估者经常选择胜者,即使它们自己的标量分数几乎没有证据表明存在有意义的质量差异。除了评估者内部的不稳定性,评估者间的一致性仅为76%(κ=0.51),语义等价的提示模板在25%的测试案例中改变了多数结果,确定性解码减少了但不消除不一致性。可靠性曲线分析显示,在我们的数据集中,平均需要11次重复试验才能让多数投票以95%的概率恢复50次试验的参考裁决,对于高方差问题则上升至15次。这些发现表明,单次LLM评估对于高风险评估往往噪声过大,多轮聚合、位置随机化和显式不确定性报告应成为标准实践。由于两个评估者均来自同一提供商,跨提供商复制仍是重要的下一步。

英文摘要

LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations. Across judges, pairwise preferences flip on average 13.6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%. GPT-4o-mini also exhibits a significant first-position bias (72% A-majority, p = 0.024). At the same time, mean pointwise score gaps are small (0.19--0.36 on a 10-point scale) and not statistically significant in aggregate, producing a pairwise--pointwise gap: judges frequently choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference. Beyond within-judge instability, cross-judge agreement is only 76% ($κ= 0.51$), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency. A reliability curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50-trial reference verdict with 95% probability on average, rising to 15 for high-variance questions. These findings suggest that single-trial LLM judging is often too noisy for high-stakes evaluation, and that multi-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice. Because both judges are from a single provider, cross-provider replication remains an important next step.

2606.13706 2026-06-15 cs.AR cs.AI 交叉投稿

HierSVA: A Data Synthesis Pipeline, Dataset, and Benchmark for LLM-Driven Hierarchical Hardware Formal Verification

HierSVA:面向LLM驱动的层次化硬件形式化验证的数据合成流水线、数据集与基准

Maohua Nie, Jiang Zhu, Jingqun Zhang, Zhichen Zeng, Jiayi Wang, Sibo Zhang, Jialin Wang, C. -J. Richard Shi

发表机构 * University of Washington(华盛顿大学)

AI总结 提出HierSVA套件,包含数据合成流水线、数据集和基准,用于LLM驱动的层次化硬件形式化验证;通过RTL预处理与LLM在环流程生成SystemVerilog断言,并构建342模块数据集;设计六轴指标评估断言质量,揭示LLM在层次化验证中的性能与局限。

详情
AI中文摘要

我们提出了HierSVA,一个集流水线、数据集和基准于一体的集成套件,用于LLM驱动的层次化硬件形式化验证。HierSVA-SP将RTL预处理工具链与LLM在环形式化验证流程相结合,为层次化RTL生成参考SystemVerilog断言(SVA)。将其应用于BaseJump STL,得到HierSVA-DS数据集,包含342个模块,具有层次元数据和深度0-9,并附带28个模块-错误对的深层子集,包含自然语言规范和错误变体。HierSVA-B将断言质量分解为六个度量轴:语法正确性、断言证明成功率、空洞性、规范忠实度、突变覆盖率和形式化核心覆盖率。将HierSVA-B应用于12个最近的LLM,揭示了三个发现。第一,模块级编译率为67.1%;在可评估运行生成的断言中,82.1%被非空洞地证明,但相应的断言集仅检测到70.2%的可注入故障,并覆盖了36.2%的形式化核心。第二,在深层子集的211个可评估模型-模块条目中,断言集以0.87的召回率标记有错误的RTL,但预测有错误的输出中有40%在正确RTL上是假阳性,将精度限制在0.60。第三,代理模式改善了S1风格的可证明性和强度指标,但增益趋于平稳并振荡。代码和工件可在\href{ this https URL }{ this https URL }获取。数据集可在\href{ this https URL }{ this https URL }获取。

英文摘要

We present HierSVA, an integrated suite that combines a pipeline, dataset, and benchmark for LLM-driven hierarchical hardware formal verification. HierSVA-SP pairs an RTL preprocessing toolchain with an LLM-in-the-loop formal verification flow to produce reference SystemVerilog Assertions (SVA) on hierarchical RTL. Applying it to BaseJump STL yields HierSVA-DS, a dataset of 342 modules, with hierarchy metadata and depths 0--9, accompanied by a deep subset of 28 module-bug pairs with natural-language specifications and bug variants. HierSVA-B decomposes assertion quality into six metric axes: syntax correctness, assertion proof success rate, vacuity, specification faithfulness, mutation coverage, and formal core coverage. Applying HierSVA-B to twelve recent LLMs reveals three findings. First, the module-level compile rate is 67.1\%; among generated assertions in evaluable runs, 82.1\% prove non-vacuously, but the corresponding assertion sets detect only 70.2\% of eligible injected faults and cover 36.2\% of the formal core. Second, on 211 evaluable model--module entries in the deep subset, assertion sets flag buggy RTL with 0.87 recall, but 40\% of predicted-buggy outcomes are false positives on correct RTL, limiting precision to 0.60. Third, agentic mode improves S1-style provability and strength metrics, but gains plateau and oscillate. Codes and artifacts are available at \href{https://github.com/HierSVAAnon/HierSVACodeAndArtifacts}{https://github.com/HierSVAAnon/HierSVACodeAndArtifacts}. Dataset is available at \href{https://huggingface.co/datasets/AnonymousHierSVA/HierSVA}{https://huggingface.co/datasets/AnonymousHierSVA/HierSVA}.

2606.13735 2026-06-15 cs.AR cs.AI cs.LG cs.PL 交叉投稿

VHDLSuite: Unified Pipeline for LLM VHDL Generation with Data Synthesis and Evaluation

VHDLSuite:面向LLM VHDL生成的统一流水线,包含数据合成与评估

Yijun Shen, Minghao Shao, Yichen Zhao, Zhuoyan Yu, Boyuan Chen, Yik-Cheung Tam, Muhammad Shafique

发表机构 * Center for Data Science, NYU Shanghai, China(纽约市立大学上海分校数据科学中心) NYU Tandon School of Engineering, USA(纽约大学Tandon工程学院) NYU Abu Dhabi, UAE(纽约大学阿布扎比分校)

AI总结 提出VHDLSuite基础设施,通过自动基准合成、可执行验证和多模型诊断分析,解决LLM在VHDL生成评估中的不足,并构建含200+问题的VHDLBench基准。

详情
AI中文摘要

大型语言模型(LLM)在寄存器传输级(RTL)代码生成方面展现了令人印象深刻的能力,尤其是针对Verilog。然而,评估它们在其他硬件描述语言(HDL)上的性能,特别是VHDL,仍然有限,尽管其独特的语言特性(如更严格的语义规则)引入了与Verilog不同的评估考量。这种覆盖不足限制了对当前模型在不同结构和语义的硬件设计语言中泛化能力的全面理解。为弥补这一空白,我们引入了VHDLSuite,一个以基准为中心的可扩展VHDL生成评估基础设施,集成了自动基准合成、可执行验证和多模型诊断分析。首先,我们提出一个数据流水线,自动将Verilog设计及其配套测试平台转换为可执行的VHDL基准实例,随后基于VUnit/GHDL进行验证,确保每个发布的任务在VHDL环境中可编译、可运行且可一致检查。其次,我们引入VHDLBench,一个包含超过200个VHDL问题的基准,配有完整且经过验证的测试平台,覆盖广泛的复杂度级别。第三,我们广泛评估了最先进的LLM,并揭示了LLM辅助VHDL生成中的关键挑战。我们的发现为多语言硬件设计的未来工作提供了重要见解和支持。该数据流水线、基准和评估框架将开源。

英文摘要

Large Language Models (LLM) have shown impressive capabilities in Register Transfer Level (RTL) code generation, particularly for Verilog. However, evaluating their performance with other Hardware Description Languages (HDL), especially VHDL, remains limited although its distinct language characteristics, such as stricter semantic rules, introduce evaluation considerations that differ from Verilog. This lack of coverage restricts fully understanding of how well current models generalize across hardware design languages with differing structures and semantics. To address this gap, we introduce VHDLSuite, a benchmark-centered infrastructure for scalable VHDL generation evaluation, integrating automated benchmark synthesis, executable validation, and multi-model diagnostic analysis. First, we propose a data pipeline that automatically converts Verilog designs and their accompanying testbenches into executable VHDL benchmark instances, followed by VUnit/GHDL-based validation to ensure each released task is compilable, runnable, and consistently checkable in the VHDL environment. Second, we introduce VHDLBench, a benchmark with over 200 VHDL problems with complete and validated testbenches across a wide range of complexity levels. Third, we extensively evaluate cutting-edge LLMs and uncover key challenges specific on LLM-aided VHDL generation. Our findings provide important insights and support future work in multi-language hardware design automation.Our data pipeline, benchmark, and evaluation framework will be open-sourced.

2606.13757 2026-06-15 cs.CR cs.AI 交叉投稿

SEVRA-BENCH: Social Engineering of Vulnerabilities in Review Agents

SEVRA-BENCH:审查智能体中的社会工程漏洞

Rui Melo, Riccardo Fogliato, Sean Zhou, Pratiksha Thaker, Zhiwei Steven Wu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Microsoft Core AI(微软核心人工智能) Amazon AWS(亚马逊AWS) Databricks

AI总结 提出SEVRA-BENCH基准,通过社会工程攻击框架评估LLM代码审查智能体拒绝恶意PR的能力,发现闭源与开源模型间存在显著安全差距。

详情
AI中文摘要

大型语言模型(LLM)审查者越来越多地用于拉取请求(PR)工作流,其批准有助于决定哪些代码合并到仓库中。这引发了一个静态漏洞检测或代码生成基准未解决的问题:当攻击者同时控制代码更改和随附的PR文本时,自动化审查者能否拒绝恶意贡献?我们引入了SEVRA-BENCH(审查智能体中的社会工程漏洞),一个衡量自动化审查者批准此类对抗性PR频率的基准。SEVRA-BENCH中的每个恶意PR都基于一个先前修复了通用漏洞与暴露(CVE)数据库中列出的漏洞的真实项目提交构建。我们自动反转该修复以恢复原始易受攻击的代码,并将其作为拉取请求提交,包裹在15种社会工程框架之一中,这些框架变化了所声称的内容、支持证据、传达的紧迫性、先前批准的信号以及对权威的诉求。SEVRA-BENCH包含1,062个恶意PR,这些PR来自2025年通用弱点枚举(CWE)Top 25中前10个条目的CVE相关修复。在现实场景中,我们评估了8个当前LLM作为代码审查智能体处理引入先前公开披露漏洞的PR。我们的结果揭示了闭源与开源模型在安全能力上的显著差距。我们希望SEVRA-BENCH将成为推进开源模型并缩小这一差距的宝贵资源。

英文摘要

Large language model (LLM) reviewers are increasingly used in pull-request (PR) workflows, where their approvals help decide which code is merged into a repository. This raises a question that benchmarks for static vulnerability detection or code generation do not address: can an automated reviewer reject a malicious contribution when the attacker controls both the code change and the accompanying PR text? We introduce SEVRA-BENCH (Social Engineering of Vulnerabilities in Review Agents), a benchmark that measures how often an automated reviewer approves such adversarial pull requests. Each malicious PR in SEVRA-BENCH is built from a real project commit that previously fixed a vulnerability listed in the Common Vulnerabilities and Exposures (CVE) database. We automatically invert that fix to restore the original vulnerable code and submit it as a pull request wrapped in one of 15 social-engineering framings, which vary the claims made, the supporting evidence, the urgency conveyed, signals of prior approval, and appeals to authority. SEVRA-BENCH contains 1,062 malicious PRs drawn from Common Vulnerabilities and Exposures (CVE)-linked fixes across the top 10 entries of the 2025 Common Weakness Enumeration (CWE) Top 25. In a realistic setting, we evaluate 8 current LLMs as code review agents on PRs that introduce vulnerabilities previously reported in public disclosures. Our results reveal a sharp gap in security capabilities between closed- and open-source models. We hope SEVRA-BENCH will serve as a valuable resource for advancing open-source models and narrowing this gap.

2606.13802 2026-06-15 cs.SE cs.AI cs.HC cs.LG 交叉投稿

A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets

电子表格中下一步动作预测的基准测试与框架

Tejas Agrawal, Vu Le, Sumit Gulwani, Gust Verbruggen

发表机构 * University of Waterloo(多伦多大学)

AI总结 针对电子表格缺乏自动补全功能的问题,提出一个基准测试,通过人工整理动作序列和在线评估方法,比较多种预测模型,分析动作保存、误报、效率等特性。

Comments Accepted at ICML 2026. Code and benchmark: https://github.com/Tej-55/NAPE

详情
AI中文摘要

预测性代码补全极大地加速了开发人员的工作效率。在电子表格中,尽管更为常见,但这种自动补全功能几乎不存在。为了解决这一差距,我们引入了一个基准测试,用于观察电子表格中用户动作序列并预测未来动作的系统。两个挑战是(1)公共电子表格语料库中缺乏编辑历史,以及(2)电子表格动作的复杂空间(空间、时间、复合)。为了解决(1),我们手动整理了52个序列,包含12K个动作,这些动作通过参数化启发式和LLM精炼从公共语料库中重新创建电子表格。为了解决(2),我们提出了一种在线评估方法,该方法在每个用户动作后期望一个预测,接受或拒绝该预测,在接受时更新未来动作,并重复此过程直到获得目标电子表格。我们使用多个基线预测器(包括零样本LLM、微调SLM和经典模型),并分析了基准测试教给我们的不同属性,包括但不限于:保存动作和误报的属性、效率、用户配置文件的影响、触发器的影响以及上下文的影响。

英文摘要

Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1), we manually curate 52 sequences of 12K actions that recreate spreadsheets from public corpora, seeded by parametrized heuristics and LLM refinement. To address (2), we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use multiple baseline predictors (including zero-shot LLMs, fine-tuned SLMs, and classical models) and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, and effect of context.

2606.13835 2026-06-15 cs.CL cs.AI cs.MA 交叉投稿

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

当合理但不现实:评估基于LLM的城市模拟中的人类移动性

Gustavo H. Santos, Aline Carneiro Viana, Thiago H. Silva

发表机构 * UTFPR(巴西联邦理工大学) Inria(法国国家信息与自动化研究所) U. of Toronto(多伦多大学)

AI总结 提出验证框架,通过移动定律、时间节奏等指标评估基于LLM的城市模拟器生成的人类移动模式,发现叙事合理性与经验移动现实性之间存在显著差距。

Comments 14 pages, 10 figures

详情
AI中文摘要

基于LLM的生成式智能体越来越多地用于城市模拟器,但尚不清楚它们是否再现了经验上真实的人类移动模式,还是仅仅生成合理的移动叙事。我们引入了一个验证框架,用于评估基于LLM的城市模拟器中生成智能体的移动性,并与真实世界移动数据进行比较。为此,我们使用了移动定律、时间节奏、网络模体、语义活动转换和行为移动性配置文件。利用大巴黎地区和上海的数据集,我们评估了AgentSociety和CitySim在多个移动现实性维度上的表现。我们的分析揭示了叙事合理性与经验移动现实性之间的显著差距。尽管模拟器捕捉到了一些高级语义活动分布,但它们难以再现核心的空间和时间约束,包括真实的行程长度分布、起止点流量、停留时间和转换动态。我们进一步观察到,现实的移动多样性在默认提示配置下不稳定,可能需要显式的配置文件感知初始化。为了支持可重复的评估,我们还贡献了可扩展且开放的LLM驱动基础设施,用于区域级地图生成、可观测性增强的模拟、移动性指标计算和交通模拟。我们的发现强调了需要对基于LLM的城市模拟器进行严格的经验验证,并提供了构建更真实和可重复的城市模拟系统的实用工具。

英文摘要

LLM-based generative agents are increasingly used in urban simulators, yet it remains unclear whether they reproduce empirically realistic human mobility patterns or merely generate plausible mobility narratives. We introduce a validation framework for evaluating the mobility of generative agents of LLM-based urban simulators against real-world mobility data. For this, we use mobility laws, temporal rhythms, network motifs, semantic activity transitions, and behavioral mobility profiles. Using datasets from the Greater Paris region and Shanghai, we evaluate AgentSociety and CitySim across multiple dimensions of mobility realism. Our analysis reveals a substantial gap between narrative plausibility and empirical mobility realism. Although the simulators capture some high-level semantic activity distributions, they struggle to reproduce core spatial and temporal constraints, including realistic trip-length distributions, origin-destination flows, dwell times, and transition dynamics. We further observe that realistic mobility diversity is unstable across default prompting configurations and may require explicit profile-aware initialization. To support reproducible evaluation, we also contribute scalable and open LLM-driven infrastructure for regional-scale map generation, observability-enhanced simulation, mobility-metric computation, and traffic simulation. Our findings highlight the need for rigorous empirical validation of LLM-based urban simulators and provide practical tools for building more realistic and reproducible urban simulation systems.

2606.13839 2026-06-15 cs.CV cs.AI eess.IV 交叉投稿

Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography

解释RhythmFormer:远程光电容积描记术周期性稀疏注意力的系统XAI分析

Louis Chen, Torbjörn E. M. Nordling

发表机构 * Department of Mechanical Engineering, National Cheng Kung University(国立成功大学机械工程学系)

AI总结 针对rPPG Transformer可解释性缺乏定量评估的问题,提出四种归因方法并引入皮肤覆盖率和忠诚度系数,量化稀疏注意力中的多跳泄漏效应,Beyond Intuition方法在UBFC-rPPG上取得最优性能。

Comments 26 pages, 8 figures

详情
AI中文摘要

远程光电容积描记术(rPPG)Transformer在基准测试中实现了低心率误差,但其决策仍然不透明——随着rPPG向临床心率估计发展,这一问题日益受到关注。现有的rPPG XAI主要依赖定性热图检查,缺乏定量忠诚度指标或基于生理学的验证,在视觉合理性和可审计证据之间存在差距。我们解决了这一差距。首先,我们将四种归因方法(原始注意力、rollout、flow、Beyond Intuition)适配到RhythmFormer的双层路由注意力(带有top-$k$选择)上。其次,我们引入了一个皮肤覆盖度指标,量化归因质量落在皮肤区域的比例。第三,我们将SaCo忠诚度系数从其原始分类设置适配到rPPG回归,通过使用原始和扰动预测rPPG波形之间的MAE作为扰动影响。应用这些工具,我们量化了稀疏top-$k$路由下的多跳泄漏效应:注意力rollout和flow几乎完全恢复了各个精炼注意力层明确设置为零的连接。Beyond Intuition通过其值投影加权rollout和梯度支持掩码缓解了这一问题,在UBFC-rPPG上获得了评估方法中最高的中位精炼皮肤覆盖度(0.83对比vanilla rollout的0.57)和忠诚度(F=0.92)。需要在不同数据集和模型变体上进行验证。对低SaCo异常值的案例研究进一步表明,一旦替换了伪影区域,所有四种方法都一致恢复,表明在这个示例案例中,归因家族之间的SaCo行为一致。总之,这些指标将rPPG XAI推向关于空间对齐和扰动忠诚度的可审计数值证据,即可信的rPPG XAI。

英文摘要

Remote photoplethysmography (rPPG) transformers achieve low heart-rate error on benchmarks, yet their decisions remain opaque--a growing concern as rPPG moves toward clinical heart rate estimation. Existing rPPG XAI is dominated by qualitative heatmap inspection without quantitative faithfulness metrics or physiology-grounded validation, leaving a gap between visual plausibility and auditable evidence. We address this gap. First, we adapt four attribution methods (raw attention, rollout, flow, Beyond Intuition) to RhythmFormer's bi-level routing attention with top-$k$ selection. Second, we introduce a skin coverage metric quantifying how much attribution mass falls on skin regions. Third, we adapt the SaCo faithfulness coefficient from its original classification setting to rPPG regression by using the MAE between original and perturbed predicted rPPG waveforms as the perturbation impact. Applying these tools, we quantify a multi-hop leakage effect under sparse top-$k$ routing: attention rollout and flow almost completely restores the connections that individual refined-attention layers explicitly set to zero. Beyond Intuition mitigates this via its value-projection-weighted rollout and gradient-supported mask, attaining the highest median refined skin coverage ($0.83$ vs. $0.57$ for vanilla rollout) and faithfulness ($F=0.92$) among the evaluated methods on UBFC-rPPG. Validation across diverse datasets and model variants is needed. A case study on a low-SaCo outlier further shows all four methods recovering consistently once an artefactual region is replaced, suggesting consistent SaCo behavior across attribution families in this illustrative case. Together, these metrics move XAI for rPPG toward auditable numerical evidence about spatial alignment and perturbation faithfulness, i.e. trustworthy rPPG XAI.

2606.13870 2026-06-15 cs.CV cs.AI cs.LG 交叉投稿

Mirage Probes: How Vision Models Fake Visual Understanding

幻象探针:视觉模型如何伪造视觉理解

Daniel Ben-Levi, Judah Goldfeder, Weiliang Zhao, Raz Lapid, Amit LeVi, Allen G. Roush, Ravid Shwartz-Ziv, Hod Lipson

发表机构 * Columbia University(哥伦比亚大学) Intuit Technion(以色列理工学院) Thoughtworks New York University(纽约大学)

AI总结 提出幻象探针框架,通过对比探针揭示视觉语言模型在无图像时也能回答问题的两种幻象行为:文本偏见和虚假图像,并证明后者需要表征级干预。

详情
AI中文摘要

视觉语言模型(VLM)即使在没有提供图像的情况下,也能自信且通常正确地回答基于图像的问题。这种幻象行为会虚增基准分数,而不反映视觉基础。先前的工作将其视为单一故障模式。我们认为这是两种。使用幻象探针(Mirage Probes),一种对比探针框架,将释义的问题变体与同一图像上的匹配幻象和非幻象标签配对,我们展示了在两个开源VLM中,幻象行为可以从残差流、MLP、后注意力和注意力头位置的内部激活中线性解码。我们证明朴素贝叶斯文本基线无法恢复此信号,排除了表面词汇混淆。跨基准可分离性模式,连同一种新颖的先验利用指数(PHI),衡量模型仅从文本中回答的程度,揭示了两种不同的机制:文本偏见,其中模型从语言先验中回答而不涉及视觉表征;以及虚假图像,其中模型在潜在空间中构建虚假视觉内容并像有基础一样回答。这种区别有直接的缓解后果:文本分布清理可以解决第一种机制,但无法触及第二种,因为虚假图像幻象存在于模型的视觉表征中而非文本中。忠实的视觉基础将需要在表征层面进行干预。

英文摘要

Vision-language models (VLMs) can answer image-based questions confidently, and often correctly, even when no image is provided. This mirage behavior inflates benchmark scores without reflecting visual grounding. Prior work treats this as a single failure mode. We argue it is two. Using Mirage Probes, a contrastive probing framework that pairs paraphrased question variants with matched mirage and non-mirage labels on the same image, we show that mirage behavior is linearly decodable from internal activations across residual stream, MLP, post-attention, and attention-head sites in two open-source VLMs. We demonstrate that a Naive Bayes text baseline cannot recover this signal, ruling out surface lexical confounds. Cross-benchmark separability patterns, together with a novel Prior Harnessing Index (PHI) measuring how much a model can answer from text alone, expose two distinct regimes: textual biases, where the model answers from language priors without engaging visual representations, and spurious images, where it constructs false visual content in latent space and answers as if grounded. The distinction has direct mitigation consequences: text-distribution cleaning can address the first regime but cannot reach the second, since spurious-image mirages live in the model's visual representations rather than its text. Faithful visual grounding will require interventions at the representational level.

2606.13896 2026-06-15 cs.CV cs.AI 交叉投稿

How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?

自监督遥感视觉模型如何迁移到下游任务?

Julia Romero, Qin Lv, Morteza Karimzadeh

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 研究六种代表性自监督地理空间基础模型(GeoFMs)在下游任务中的迁移表现,发现模型排名随任务和适应设置变化,中间层特征比最终层更相关,且解码器设计等适应设置影响与模型选择相当。

详情
AI中文摘要

自监督地理空间基础模型(GeoFMs)从遥感数据中学习可迁移表示,但其下游行为难以表征。我们研究了涵盖联合嵌入、重建和多模态预训练家族的六种代表性GeoFMs,并在不同标签可用性和下游流水线下评估了分类、回归和分割基准的迁移性能。我们发现模型排名随任务和适应设置而变化。逐层探针显示,在大多数情况下,与任务相关的信息在中间Transformer块中比在最终层嵌入中更容易获取,并且GeoFMs表现出不同的深度分布特征。在PASTIS和Sen1Floods11上的分割案例研究中,解码器设计和微调等下游适应设置可能与GeoFM的选择同样重要,且标准密集预测头可能与GeoFM在深度上组织信息的方式不一致。最后,案例研究中的CKA分析表明,微调不会均匀地重写GeoFMs的深度,最强的变化集中在ViT块中MLP的第一个线性层。这些结果有助于解释为什么GeoFM排名在不同基准之间发生变化,并激励更具表示意识的评估和适应策略。

英文摘要

Self-supervised geospatial foundation models (GeoFMs) learn transferable representations from remote sensing data, but their downstream behavior is difficult to characterize. We study six representative GeoFMs spanning joint-embedding, reconstruction, and multimodal pretraining families, and evaluate transfer across classification, regression, and segmentation benchmarks under different label availability and downstream pipelines. We find that model rankings change across tasks and adaptation settings. Layerwise probing shows that, in most cases, task-relevant information is more accessible in intermediate transformer blocks compared to final-layer embeddings, and that GeoFMs exhibit distinct depthwise profiles. In segmentation case studies on PASTIS and Sen1Floods11, downstream adaptation settings such as decoder design and fine-tuning can be as impactful as the choice of GeoFM, and standard dense-prediction heads may be poorly aligned with how GeoFMs organize information over depth. Finally, CKA analysis on case studies shows that fine-tuning does not rewrite GeoFMs uniformly across depth, and the strongest changes are localized to the first linear layer of the MLP in ViT blocks. These results help explain why GeoFM rankings shift across benchmarks and motivate more representation-aware evaluation and adaptation strategies.

2606.13904 2026-06-15 cs.CL cs.AI cs.DB 交叉投稿

SANA: What Matters for QA Agents over Massive Data Lakes?

SANA:大规模数据湖上的问答代理关键因素是什么?

Austin Senna Wijaya, Jiaxiang Liu, Haonan Wang, Eugene Wu

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出SANA诊断框架,通过消融实验分析数据湖探索式问答中搜索、规划、数据分析及行动策略的失败原因,揭示数据分析是主要瓶颈。

Comments 9 pages, 7 figures

详情
AI中文摘要

数据湖上的探索式问答(EQA)需要LLM代理发现相关源、分析检索数据并根据中间结果调整其行动。端到端准确率无法区分搜索、规划、数据分析或代理的行动策略(即下一步做什么以及何时提交答案的决策)中的失败。我们提出了SANA(搜索代理导航消融框架),这是一个诊断性消融框架,将EQA任务转化为包含黄金源序列、清洗后子问题和执行记录的运行时配置文件。SANA利用这些配置文件构建理想化的搜索、规划和数据分析工具,从而允许对每个组件进行消融;残差是策略失败的诊断证据。为了说明SANA作为一个可复用的评估框架,我们改编了两个最近的EQA基准测试LakeQA和KramaBench,并在固定提示、预算、数据湖和运行时下评估了轻量级和中型代理。在两个基准测试中,数据分析始终是瓶颈,而规划则不那么明显。搜索在LakeQA的大数据湖设置中是主要限制,但在较小规模的KramaBench中则不那么突出。因此,SANA将端到端任务准确率分解为数据湖代理失败原因的诊断,并允许系统比较搜索、规划、数据分析和代理设计方面的进展。

英文摘要

Exploratory question answering (EQA) over data lakes requires an LLM agent to discover relevant sources, analyze retrieved data, and adapt its actions based on intermediate results. End-to-end accuracy alone cannot distinguish failures in search, planning, data analysis, or the agent's Action Policy: its decisions about what to do next and when to submit an answer. We present SANA (Search Agent Navigation Ablation framework), a diagnostic ablation framework that transforms EQA tasks into runtime profiles containing gold source sequence, sanitized subquestions, and execution records. SANA uses these profiles to construct idealized search, planning, and data-analysis tools, allowing each component to be ablated; the residual gap is diagnostic evidence for policy failures. To illustrate SANA as a reusable evaluation framework, we adapted two recent EQA benchmarks, LakeQA and KramaBench, and evaluated lightweight and mid-sized agents under fixed prompts, budgets, data lakes, and runtimes. Across both benchmarks, data analysis is a consistent bottleneck while planning is less so. Search is a major limitation in LakeQA's large data-lake setting, but less so for the smaller-scale KramaBench. SANA thus deconstructs end-to-end task accuracies into a diagnosis of where data-lake agents fail, and allows for systematic comparisons of progress in search, planning, data analysis, and agent design.

2606.13994 2026-06-15 cs.CR cs.AI cs.LG 交叉投稿

Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DECOMPBENCH

隐于无形:使用DECOMPBENCH基准测试代理安全对抗分解攻击

Vikhyath Kothamasu, Virginia Smith, Chhavi Yadav

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Simons Institute, UC Berkeley(Simons研究所,伯克利大学)

AI总结 提出DeCompBench基准,通过分解攻击将有害任务拆分为良性子任务,揭示现有代理安全机制在对抗分解攻击时的脆弱性。

详情
AI中文摘要

基于LLM的代理变得越来越强大且广泛部署,在现实世界中造成了日益增长的对抗性滥用动机。一个关键的新兴威胁是分解攻击\cite{glukhov2024breach, jones2024adversaries},其中有害任务被分解为更简单、良性的子任务,这些子任务单独执行时能规避安全机制,但累积起来却实现了恶意意图。尽管最近的基准测试评估了代理在多轮和多工具使用设置中的安全性,但它们并未明确捕捉这种形式的分解滥用,且可能无法代表现实的对抗性执行流程。为此,我们引入了DeCompBench,这是一个专门设计用于评估分解攻击下代理安全性的基准。DeCompBench采用分解即设计原则,使用图形框架创建,能够将有害任务分解为单独良性且可执行的子任务,并具有现实的工作流程。我们使用自定义分解器的实验表明,最先进的代理在整体有害任务上表现出高拒绝率,但在其分解变体上拒绝率显著降低,同时往往无意中实现了对抗性目标。这些发现强调了针对分解攻击进行安全性评估及相应防御的必要性。我们的数据集已公开,可在以下网址获取:https://this https URL。

英文摘要

LLM-based Agents are becoming increasingly capable and widely deployed, creating growing incentives for adversarial misuse in the real-world. A key emerging threat is Decomposition Attacks \cite{glukhov2024breach, jones2024adversaries} in which a harmful task is broken into simpler, benign subtasks that evade safety mechanisms when executed separately but cumulatively fulfill the malicious intent. Although recent benchmarks assess agent safety in multi-turn and multi-tool-use settings, they do not explicitly capture this form of decompositional misuse and may not represent realistic adversarial execution flows. To this end, we introduce DeCompBench, a benchmark designed specifically to evaluate agentic safety under decomposition attacks. DeCompBench is created with a decomposition-by-design principle using a graphical framework and enables harmful task decomposition into individually benign and executable subtasks with realistic workflows. Our experiments using a custom decomposer show that state-of-the-art agents exhibit high refusal rates on monolithic harmful tasks, but significantly lower refusal rates on their decomposed variants, while often inadvertently fulfilling the adversarial objectives. These findings underscore the need for safety evaluations against decomposition attacks and corresponding defenses. Our dataset is publicly available and can be found at https://huggingface.co/datasets/decompositionbench/DeCompBench.

2606.14094 2026-06-15 cs.CV cs.AI 交叉投稿

FEMOT: Multi-Object Tracking using Frame and Event Cameras

FEMOT: 使用帧和事件摄像机的多目标跟踪

Shiao Wang, Xiao Wang, Chao Wang, Yitao Li, Menghao Liu, Bo Jiang, Yaowei Wang, Yonghong Tian, Jin Tang

发表机构 * School of Computer Science and Technology, Anhui University(安徽大学计算机科学与技术学院) Peng Cheng Laboratory(鹏城实验室) National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理全国重点实验室) School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University(北京大学深圳研究生院电子与计算机工程学院) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳))

AI总结 提出FEMOT大规模RGB-事件多目标跟踪数据集和FEMOTR多模态跟踪框架,通过频域融合解耦特征,有效利用互补信息实现鲁棒跟踪。

详情
AI中文摘要

传统的RGB摄像机因其捕获丰富外观和语义信息的能力而被广泛用于多目标跟踪。然而,在复杂的现实挑战下,如运动模糊、低照度和过度曝光,其性能通常会下降。受生物启发的事件摄像机提供高时间分辨率和高动态范围,在极端场景下提供互补线索。尽管如此,由于缺乏大规模且标注良好的数据集,RGB-事件多目标跟踪仍未被充分探索。为解决这一问题,我们提出了FEMOT,一个大规模RGB-事件多目标跟踪数据集,涵盖多样化的现实场景和14个具有挑战性的属性。凭借RGB和事件数据以及高质量标注,FEMOT为系统评估RGB-事件多目标跟踪方法提供了可靠平台。基于FEMOT,我们重新训练并评估了超过十个强跟踪器,从而为未来研究建立了全面的基准。此外,我们提出了FEMOTR,一种多模态跟踪框架,该框架解耦RGB和事件特征并在频域中融合它们,从而有效利用其互补特性实现鲁棒的目标定位和身份关联。在FEMOT和DSEC-MOT数据集上的大量实验证明了所提方法的有效性。源代码和基准数据集已在此https URL上发布。

英文摘要

Conventional RGB cameras have been widely used in multi-object tracking due to their ability to capture rich appearance and semantic information. However, their performance is often degraded under complex real-world challenges, such as motion blur, low illumination, and overexposure. Bio-inspired event cameras offer high temporal resolution and high dynamic range, providing complementary cues under extreme scenarios. Nevertheless, RGB-event multi-object tracking remains underexplored due to the lack of large-scale and well-annotated datasets. To address this issue, we propose FEMOT, a large-scale RGB-event multi-object tracking dataset that covers diverse real-world scenarios and 14 challenging attributes. With both RGB and event data as well as high-quality annotations, FEMOT provides a reliable platform for systematically evaluating RGB-event multi-object tracking methods. Based on FEMOT, we retrain and evaluate over ten strong trackers, thereby establishing a comprehensive benchmark for future research. Furthermore, we propose FEMOTR, a multimodal tracking framework that decouples RGB and event features and fuses them in the frequency domain, thereby effectively exploiting their complementary characteristics for robust object localization and identity association. Extensive experiments on FEMOT and DSEC-MOT datasets demonstrate the effectiveness of the proposed method. The source code and benchmark dataset have been released on https://github.com/Event-AHU/FEMOT.

2606.14117 2026-06-15 stat.ME cs.AI 交叉投稿

A Two-Stage Statistical Framework for Evaluating Associative Interference in Large Language Models

评估大语言模型中联想干扰的两阶段统计框架

Achraf Cohen, Andrew Kincaid

发表机构 * Department of Mathematics and Statistics, University of West Florida(数学与统计学系,西弗吉尼亚大学)

AI总结 提出两阶段统计框架,分离响应遵从性与任务一致性,评估三个LLM在性别-职业等领域的联想干扰,发现效应因模型而异。

Comments 11 pages; 2 figures

详情
AI中文摘要

大语言模型(LLM)越来越多地通过改编人类心理范式来评估偏见,然而方法论上的局限性——特别是将拒绝行为与任务表现混为一谈——阻碍了清晰的解释。在此,我们将内隐联想测验(IAT)改编为一个受控的强制选择框架,并引入一个两阶段建模方法,将响应遵从性与任务一致性分类分开。在三个当代LLM(Claude Sonnet-4、Gemini 2.5 Pro和GPT-5)上,我们评估了联想干扰,定义为不一致条件相对于一致条件下任务一致性的降低。虽然对结构化响应格式的遵从性普遍较高,但干扰效应在模型和领域之间差异很大。Claude Sonnet-4在性别-职业领域表现出强干扰(DeltaP = 0.086, 95% CrI [0.026, 0.173]),在性别-科学领域表现出较小但可信的效应。Gemini 2.5 Pro显示出减弱的干扰,而GPT-5在所有领域表现出最小或不可检测的干扰。这些发现表明,IAT风格的联想不对称性并非LLM的普遍属性,而是取决于模型特定特征。通过将干扰与遵从性分离并对项目水平变异性建模,本研究为评估LLM中的结构化响应模式提供了一个原则性框架。结果强调了模型特定评估的重要性,并表明联想干扰在现代系统中可以得到实质性缓解。

英文摘要

Large language models (LLMs) are increasingly evaluated for bias using adaptations of human psychological paradigms, yet methodological limitations-particularly the conflation of refusal behavior with task performance-have hindered clear interpretation. Here, we adapt the Implicit Association Test (IAT) to a controlled, forced-choice framework and introduce a two-stage modeling approach that separates response compliance from task-consistent classification. Across three contemporary LLMs (Claude Sonnet-4, Gemini 2.5 Pro, and GPT-5), we evaluate associative interference, defined as reduced task-consistency in incongruent relative to congruent conditions. While compliance with the structured response format was uniformly high, interference effects varied substantially across models and domains. Claude Sonnet-4 exhibited strong interference in the Gender--Career domain (DeltaP = 0.086, 95% CrI [0.026, 0.173]) and smaller but credible effects in Gender--Science. Gemini 2.5 Pro showed attenuated interference, and GPT-5 exhibited minimal or no detectable interference across domains. These findings demonstrate that IAT-style associative asymmetries are not a universal property of LLMs, but instead depend on model-specific characteristics. By isolating interference from compliance and modeling item-level variability, this study provides a principled framework for evaluating structured response patterns in LLMs. The results highlight the importance of model-specific assessment and suggest that associative interference can be substantially mitigated in modern systems.

2606.14199 2026-06-15 cs.CL cs.AI cs.LG 交叉投稿

OdysSim: Building Foundation Models for Human Behavior Simulation

OdysSim: 构建人类行为模拟的基础模型

Xuhui Zhou, Weiwei Sun, Weihua Du, Jiarui Liu, Haojia Sun, Qianou Ma, Tongshuang Wu, Yiming Yang, Maarten Sap

发表机构 * Carnegie Mellon University, Language Technologies Institute(卡内基梅隆大学语言技术研究所)

AI总结 提出OdysSim,通过SOUL分类法统一62个数据集和23个基准任务,采用混合训练、任务特定强化学习和专家蒸馏,构建8B参数行为基础模型OSim,在多数任务上超越前沿模型,并实现更类人输出和零样本迁移。

Comments 34 pages. Code: https://github.com/sunnweiwei/OdysSim ; Models and data: https://huggingface.co/collections/cmu-lti/odyssim

详情
AI中文摘要

大型语言模型越来越多地被部署为人类模拟器,用于交互式评估和社会模拟。然而,以有用性为导向的后训练使它们趋向于同质化、过于随和的助手风格,造成了行为上的Sim2Real差距。我们提出了OdysSim,这是对行为基础模型(即经过训练以大规模模拟人类行为的模型)进行的最大规模开放系统研究。我们提出了SOUL,一个包含五个能力轴(CONV、SS、COG、ROLE、EVAL)的分类法,将62个数据集和23个基准任务统一在一个框架下。具体来说,我们整理了OdysSim语料库(2140万次交互,100亿个token,并配备了反向生成的社交上下文),构建了SOUL-Index基准,并开发了一个端到端的训练方案,结合了中期训练、任务特定强化学习和专家蒸馏。由此产生的开源8B OSim模型在23个任务中的8个上排名第一或并列第一,按此计数优于任何单个前沿模型,在对话和社交任务上取得了最大的提升。其输出在长度、格式和词汇选择上也更接近人类,并在τ-bench上零样本迁移到分布外的用户模拟,在反应一致性上几乎与真实用户匹配(93.2 vs 93.5)。我们进一步表明,LLM作为评判者的强化学习会引发奖励黑客模式,而我们的检测器可以在后训练期间缓解这些模式。总之,我们的发现表明,行为基础模型需要重新思考LLM的训练范式。我们发布所有工件以支持未来的研究。

英文摘要

Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register, creating a behavioral Sim2Real gap. We present OdysSim, the largest open systematic investigation of behavioral foundation models, i.e., models trained to simulate human behavior at scale. We propose SOUL, a taxonomy of five capability axes (CONV, SS, COG, ROLE, EVAL) that unifies 62 datasets and 23 benchmark tasks under one framework. Specifically, we curate the OdysSim corpus (21.4M interactions, 10B tokens, retrofitted with back-generated social contexts), construct the SOUL-Index benchmark, and develop an end-to-end training recipe combining midtraining, task-specific RL, and expert distillation. The resulting open 8B OSim model ranks first or tied-first on 8 of 23 tasks, outperforming any individual frontier model by this count, with the strongest gains on conversational and social tasks. Its outputs are also more human-like in length, formatting, and word choice, and it transfers zero-shot to out-of-distribution user simulation on $τ$-bench, nearly matching real users on reaction alignment (93.2 vs. 93.5). We further show that LLM-as-judge RL induces reward-hacking patterns, and that our detectors can mitigate them during post-training. Together, our findings suggest that behavioral foundation models require rethinking the LLM training paradigm. We release all artifacts to support future research.

2606.14459 2026-06-15 cs.CL cs.AI cs.SD 交叉投稿

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

MoDiCoL:用于鲁棒语音识别的模块化诊断持续学习数据集

Theresa Pekarek Rosin, Matthias Kerzel, Stefan Wermter

发表机构 * Knowledge Technology, Department of Informatics, University of Hamburg, Germany(德国汉堡大学信息学系知识技术研究所)

AI总结 提出MoDiCoL数据集,通过模块化设计分离语言内容、说话人特征和声学环境,并设计持续学习课程来模拟真实分布变化,评估三种持续学习策略下的鲁棒性获取、迁移和遗忘。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

现代自动语音识别(ASR)系统在标准基准测试上取得了显著进展,但在真实世界的分布变化下,由录音条件、口音、语言障碍和噪声引起的性能差距已经显现。现有数据集和基准通常孤立这些因素,忽略了它们在真实应用中的共现。在本文中,我们认为模型鲁棒性可以被视为一种动态能力,持续发展,并引入了MoDiCoL,一个模块化诊断持续学习数据集,旨在对语言内容、说话人特征和声学环境进行受控分析。此外,我们提出了一个受真实世界启发的持续学习课程,以模拟增量更新,并研究鲁棒性是如何获取、迁移和遗忘的。我们评估了三种持续学习策略,并提供了在演化条件下鲁棒性的详细见解。

英文摘要

Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech impairments, and noise. Existing datasets and benchmarks typically isolate these factors, which overlooks their co-occurrence in real-world applications. In this paper, we argue that model robustness can be treated as a dynamic capability that continually develops, and we introduce MoDiCoL, a Modular Diagnostic Continual Learning dataset designed for controlled analysis of linguistic content, speaker characteristics, and acoustic environments. Furthermore, we propose a real-world-inspired continual learning curriculum to simulate incremental updates and study how robustness is acquired, transferred, and forgotten. We evaluate three continual learning strategies and provide detailed insights into robustness under evolving conditions.

2606.14574 2026-06-15 cs.CL cs.AI 交叉投稿

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

SIMMER: 基于世界模型评估LLM可执行规划中的潜在故障

Xiaoxin Lu, Ranran Haoran Zhang, Rui Zhang

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出SIMMER基准,通过人工策划的厨房领域符号世界模型,评估LLM规划中的潜在故障;实验发现前沿模型最多17%无错误计划,56%含潜在故障,多数不可逆;反事实预演可减少72%潜在故障和75%不可逆案例。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为家庭环境中自主代理的规划器。虽然现有基准评估LLM生成的计划是否成功执行,但它们忽略了一种关键类型的故障:潜在故障。与立即故障(在执行时触发即时反馈并允许及时纠正)不同,潜在故障不会立即停止计划执行,而是悄无声息地损害目标实现。在严重情况下,它们会导致不可逆的损害。为弥补这一空白,我们引入了SIMMER,这是一个通过人工策划的、基于厨房领域的符号世界模型来评估LLM规划中潜在故障的基准。SIMMER定义了一个世界模型,包含77个动作、262个独特对象和约46,800种语义真实的可能交互,这些交互源自真实世界的烹饪脚本。然后,它利用一个状态机执行器,根据世界模型验证计划,并检测即时前提违规、潜在危险和不可逆故障。在六个LLM上的实验表明,即使是最前沿的模型,其无错误计划最多也只有17%。此外,高达56%的计划包含潜在故障,其中大多数导致不可逆后果。我们进一步证明,通过反事实预演进行显式状态推理可以将潜在故障减少高达72%,不可逆案例减少高达75%,这为更鲁棒的LLM规划器指明了一个有前景的方向。

英文摘要

Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type of failure: latent failures. Unlike immediate failures that trigger instant feedback at execution time and enable timely correction, latent failures do not immediately halt plan execution but silently compromise goal achievement. In severe cases, they cause irreversible harm. To address this gap, we introduce SIMMER, a benchmark for evaluating latent failures in LLM planning through a human-curated symbolic world model grounded in the kitchen domain. SIMMER defines a world model comprising 77 actions, 262 unique objects, and approximately 46,800 possible interactions that are semantically realistic, derived from real-world cooking scripts. It then leverages a state machine executor that validates plans against the world model and detects immediate precondition violations, latent hazards, and irreversible failures. Experiments across six LLMs show that even frontier models achieve at most 17% error-free plans. Moreover, up to 56% of plans contain latent failures, the majority of which lead to irreversible consequences. We further demonstrate that explicit state reasoning via counterfactual foresight simulation can reduce latent failures by up to 72% and irreversible cases by up to 75%, suggesting a promising direction for more robust LLM planners.

2606.14591 2026-06-15 cs.SD cs.AI 交叉投稿

AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

AudioDER: 一种用于后训练大型音频语言模型的去重增强推理数据集

Hui Geng, Yi Su, Han Yin, Tianjiao Wan, Qisheng Xu, Jiaxin Chen, Zijian Gao, Hengzhu Liu, Xie Chen, Kele Xu

发表机构 * College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院) Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) Shanghai Jiaotong University(上海交通大学)

AI总结 针对现有音频-语言数据集冗余导致后训练效果下降的问题,提出基于声学相似性去重的数据构建流程,生成包含191k样本的推理导向数据集AudioDER,显著提升LALM在多个音频推理基准上的性能。

详情
AI中文摘要

大型音频语言模型(LALMs)在广泛的音频理解任务上表现出色,但在复杂音频推理方面仍存在困难。提升此类能力的一种实用方法是后训练,其有效性关键取决于训练数据的质量和多样性。然而,现有的音频-语言数据集通常包含大量冗余,其中许多样本在声学内容上高度相似,从而提供重叠的监督信号。这种冗余不仅增加了标注成本,还限制了语料库的多样性,降低了后训练的效果。为解决此问题,我们提出了一种冗余感知的数据构建流程,用于为LALMs构建面向推理的监督。具体来说,我们首先基于声学相似性对原始音频数据集进行去重,以提高语料库的多样性。然后,我们将现有的音频描述和问答对整合为统一的多项选择格式。基于这些统一标注,我们利用Qwen3-30B生成思维链(CoT)推理过程,以提供面向推理的监督。基于此流程,我们构建了AudioDER,一个面向推理的后训练数据集,包含约191k个样本,涵盖声音、语音和音乐。每个样本包括一个音频片段、一个多项选择问题、四个候选答案、一个音频描述和一个CoT推理过程。大量实验表明,在AudioDER上进行后训练持续提升了Qwen2-Audio-7B-Instruct在多个音频推理基准上的性能,包括MMAU-mini、MMSU和MMAR。我们希望AudioDER能够成为推动音频推理研究和开发更强大LALMs的宝贵资源。

英文摘要

Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quality and diversity of training data. However, existing audio-language datasets often contain substantial redundancy, where many samples are highly similar in acoustic content and thus provide overlapping supervisory signals. Such redundancy not only increases annotation cost, but also limits corpus diversity and reduces the effectiveness of post-training. To address this issue, we propose a redundancy-aware data construction pipeline for building reasoning-oriented supervision for LALMs. Specifically, we first perform acoustic similarity-based deduplication across raw audio datasets to improve corpus diversity. We then integrate existing audio captions and question-answer pairs into a unified multiple-choice format. Based on these unified annotations, we leverage Qwen3-30B to generate chain-of-thought (CoT) rationales for reasoning-oriented supervision. Based on this pipeline, we construct AudioDER, a reasoning-oriented post-training dataset containing approximately 191k samples spanning sound, speech, and music. Each sample consists of an audio clip, a multiple-choice question, four answer candidates, an audio caption, and a CoT rationale. Extensive experiments show that post-training on AudioDER consistently improves the performance of Qwen2-Audio-7B-Instruct on multiple audio reasoning benchmarks, including MMAU-mini, MMSU, and MMAR. We hope AudioDER can serve as a valuable resource for advancing audio reasoning research and the development of more capable LALMs.

2606.14604 2026-06-15 cs.LG cs.AI 交叉投稿

A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health

移动健康多时间范围行为预测的深度学习架构比较研究

Pavlos Nicolaou, Kleanthis Malialis, Artemis Kontou, Panayiotis Kolios

发表机构 * KIOS Research and Innovation Center of Excellence, University of Cyprus(塞浦路斯大学KIOS研究与创新卓越中心) Department of Electrical and Computer Engineering, University of Cyprus(塞浦路斯大学电气与计算机工程系)

AI总结 本研究在三个公开数据集上系统比较了六种深度学习架构、两种零样本基础模型和统计基线在1-8天时间范围内的行为预测性能,发现PatchTST表现最佳,基础模型TimesFM在低数据场景下可与训练模型匹敌,且参与者级微调可将RMSE降低16-60%。

详情
AI中文摘要

可穿戴设备和智能手机生成丰富的行为时间序列,可支持主动健康干预,但缺乏对这些数据现代预测架构的系统比较。特别是,模型如何在人群中泛化、不同架构如何响应参与者级微调以及预测精度如何在多天范围内下降仍不清楚。我们在三个涵盖800多名参与者的公开数据集上基准测试了六种深度学习架构、两种零样本基础模型(FM)和统计基线,报告了步数、屏幕时间和睡眠时长在1-8天范围内的逐特征指标。我们进一步对所有六种架构进行了逐特征个性化研究,并评估了FM在不同数据集大小和时间粒度上的迁移性。我们的主要发现是:(i)没有单一架构占主导地位,PatchTST在训练模型中领先,而前三名(TCN、MLP、Transformer)之间没有显著性能差异;(ii)FM TimesFM在零样本情况下匹配或超过训练模型,尤其是在低数据场景下;(iii)参与者级微调将逐特征RMSE降低了16-60%,其中睡眠受益最大,步数受益最小。这些结果为移动健康预测中的架构选择、FM适用性和个性化策略提供了实用指导。据我们所知,这是首个联合评估现代深度学习、FM和个性化用于可穿戴设备多时间范围行为预测的研究。

英文摘要

Wearable devices and smartphones generate rich behavioural time series that can support proactive health interventions, yet systematic comparisons of modern forecasting architectures for these data are lacking. In particular, it remains unclear how models generalise across populations, how different architectures respond to participant-level fine-tuning and how forecasting accuracy degrades across multi-day horizons. We benchmark six deep learning architectures, two zero-shot Foundation Models (FM) and statistical baselines on three public datasets encompassing over 800 participants, reporting per-feature metrics for step counts, screen time and sleep duration across 1-8 day horizons. We further conduct a per-feature personalisation study across all six architectures and assess FM transferability across dataset sizes and temporal granularities. Our key findings are: (i) no single architecture dominates, PatchTST leads among trained models while the three runners-up (TCN, MLP, Transformer) show no meaningful performance difference; (ii) the FM TimesFM matches or exceeds trained models zero-shot, especially in low-data regimes and (iii) participant-level fine-tuning reduces per-feature RMSE by 16-60\%, with sleep benefiting most and step counts least. These results provide practical guidance on architecture selection, FM applicability and personalisation strategies for mobile health forecasting. To the best of our knowledge, this is the first study to jointly evaluate modern deep learning, FMs and personalisation for multi-horizon behavioural forecasting from wearables.

2606.14697 2026-06-15 cs.CV cs.AI cs.CL 交叉投稿

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

ClinHallu: 用于诊断医学多模态大语言模型推理中阶段式幻觉的基准

Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian, Weihua Chen, Fan Wang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) DAMO Academy, Alibaba Group(阿里巴巴达摩院) Hupan Lab(湖畔实验室) Zhejiang University(浙江大学)

AI总结 提出ClinHallu基准,包含7031个实例,每个实例带有结构化推理轨迹(视觉识别、知识回忆、推理整合),通过阶段替换干预和轨迹监督微调,实现细粒度幻觉诊断与缓解。

Comments Code and datasets: https://github.com/alibaba-damo-academy/ClinHallu

详情
AI中文摘要

构建可信的医学多模态大语言模型(MLLM)对于可靠的临床决策支持至关重要。现有的医学幻觉基准主要关注数据收集,但往往忽略了推理过程中幻觉的起源。我们发现幻觉来源因样本而异:错误可能源于视觉误识别、不正确的医学知识回忆或有缺陷的推理整合。为了实现源级别的幻觉诊断,我们引入了ClinHallu,一个用于医学MLLM推理中阶段式幻觉诊断的基准。ClinHallu包含7031个经过验证的实例,每个实例都附有分解为视觉识别、知识回忆和推理整合的结构化推理轨迹。我们还使用阶段替换干预来测量纠正特定阶段如何影响最终答案。除了评估,我们表明轨迹监督微调减少了阶段式幻觉。ClinHallu为诊断和缓解医学MLLM中的推理失败提供了一个细粒度的幻觉测试平台。该基准可从此https URL公开获取。

英文摘要

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.

2601.00821 2026-06-15 cs.AI cs.CL cs.IR 版本更新

Verbatim Chunks Beat Extracted Artifacts: A Controlled Ablation of Memory Representations for Long LLM Conversations

逐字块胜过提取的人工制品:长LLM对话中记忆表征的控制消融研究

Tao An

发表机构 * Hawaii Pacific University(夏威夷太平洋大学)

AI总结 通过控制消融实验,发现逐字对话块在长对话记忆检索中比LLM提取的结构化人工制品(事实、决策等)准确率高15.9-22.0点,原因是提取过程丢失了逐字细节,而结构化记忆应作为逐字文本的补充而非替代。

Comments v2: substantially revised -- reframed from a system paper to a controlled ablation study; title and conclusions updated accordingly. 26 pages, 5 figures

详情
AI中文摘要

一类日益增长的对话记忆系统将对话历史压缩为结构化人工制品——提取的事实、决策或事件——其前提是蒸馏后的结构比原始文本检索效果更好。我们通过控制消融实验检验了这一前提:在固定的检索-重排-推理流水线中,仅交换存储的表征——LLM提取的类型化人工制品与逐字对话块——保持模型、检索器、重排器和评判器不变。逐字块在LoCoMo上领先15.9个百分点(43.9% vs. 28.0%),在LongMemEval-S上领先22.0个百分点(67.4% vs. 45.4%);1跳语义图无法弥补差距,五个混淆控制实验重现了该效应。其机制是有损蒸馏:提取丢弃了逐字细节,而块则免费保留;提取人工制品流水线在整体准确率上从未超过朴素RAG。同时,使用近乎逐字、保留来源的单元所取得的积极结果也符合同一解释:检索准确性取决于表征与源文本的偏离程度。对于我们测试的提取设计,结构化记忆应增强逐字文本而非替代它:块∪人工制品联合存储在两个基准上都匹配块,而仅人工制品则丧失优势。代码和数据:此 https URL

英文摘要

A growing class of conversational-memory systems compresses dialogue history into structured artifacts -- extracted facts, decisions, or events -- on the premise that distilled structure retrieves better than raw text. We test this premise with a controlled ablation: within one fixed retrieval-rerank-reasoning pipeline, we swap only the stored representation -- LLM-extracted typed artifacts versus verbatim conversation chunks -- holding the model, retriever, reranker, and judge constant. Verbatim chunks win by 15.9 points on LoCoMo (43.9% vs. 28.0%) and 22.0 points on LongMemEval-S (67.4% vs. 45.4%); a 1-hop semantic graph does not recover the gap, and five confound controls reproduce the effect. The mechanism is lossy distillation: extraction discards verbatim detail that chunks retain for free, and the extracted-artifact pipeline never beats naive RAG in overall accuracy. Concurrent positive results with near-verbatim, provenance-preserving units fit the same account: retrieval accuracy tracks how far the representation departs from the source. For the extraction designs we test, structured memory should augment verbatim text rather than replace it: a chunks $\cup$ artifacts union store matches chunks on both benchmarks while artifacts alone forfeit the gap. Code and data: https://github.com/tao-hpu/cog-canvas

2602.22822 2026-06-15 cs.AI cs.LG 版本更新

FlexMS: A Unified Public Benchmark for Molecule Tandem Mass Spectrum Prediction

FlexMS:分子串联质谱预测的统一公共基准

Yunhua Zhong, Yixuan Tang, Yifan Li, Pan Liu, Zhiwen Yang, Jie Yang, Jun Xia

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science and Technology(香港科学与技术大学) The University of Hong Kong(香港大学) Yangzhou University(扬州大学) Fudan University(复旦大学)

AI总结 提出FlexMS基准框架,通过标准化预处理、元数据条件和评估协议,实现跨公共资源的公平比较,并引入难度感知诊断指导模型选择。

Comments preprint version v3

详情
AI中文摘要

串联质谱(MS/MS)在小分子鉴定中至关重要,但当前的深度学习谱预测系统在实际评估和部署中仍存在困难。尽管新架构不断声称达到最先进性能,但不一致的元数据条件和纠缠的预处理流程阻碍了公平的架构比较。此外,现有评估通常局限于精心策划的数据集,未能捕捉真实代谢组学的异质性和跨领域偏移。而且,当前基准缺乏难度感知诊断,对模型在特定计算或数据约束下的行为视而不见。为解决这些问题,我们提出了FlexMS,一个模块化的公共数据基准框架,它在统一协议下标准化跨公共资源的MS/MS预测,同时保留分子编码器、元数据条件、预测头以及下游检索。FlexMS建立了一个公平的评估平台,显著降低了集成新预测工具的门槛。FlexMS不仅优化平均分数,还通过难度感知诊断增强聚合准确性,为不同计算约束、数据规模和下游检索目标下的模型选择提供可操作指导。最终,FlexMS为社区提供了一个可复现的标准,以识别哪些算法结论是稳定的,以及哪些操作点在实践中最为可行。

英文摘要

Tandem mass spectrometry (MS/MS) is central to small molecule identification, but current deep learning systems for spectrum prediction still remain difficult to evaluate and deploy in practice. While novel architectures constantly claim state-of-the-art performance, inconsistent metadata conditioning and entangled preprocessing pipelines hinder fair architectural comparisons. Besides, existing evaluations are often restricted to curated datasets, failing to capture the heterogeneity and cross-domain shifts of real-world metabolomics. Furthermore, current benchmarks lack difficulty-aware diagnostics and leave blind to how models behave under specific compute or data constraints. To address this, we present FlexMS, a modular public-data benchmark framework that standardizes MS/MS prediction across public resources while keeping molecular encoders, metadata conditioning, predictor heads, and downstream retrieval under one protocol. FlexMS establishes a fair evaluation playground which significantly lowers the barrier for integrating new predictive tools. Rather than solely optimizing for average scores, FlexMS augments aggregate accuracy with difficulty-aware diagnostics, providing actionable guidance on model selection across different compute constraints, data scales, and downstream retrieval objectives. Ultimately, FlexMS provides the community with a reproducible standard to identify which algorithmic conclusions are stable and which operating points are most viable in practice.

2606.07157 2026-06-15 cs.AI 版本更新

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

快速思考:估计前沿AI模型的无思维链任务完成时间范围

Dewi Gould, Francis Rhys Ward, Anders Cairns Woodruff, Rauno Arike, Josh Hills, Alex Serrano, Ida Caspary, Jason Ross Brown, Jo J. Jiao, Patrick Leask, Twm Stone, Ram Potham, Ionut Gabriel Stan, Harry Mayne, Simeon Hellsten, Shubhorup Biswas, Ariana Azarbal, William L. Anderson, Elle Najt, Ryan Greenblatt, Julian Stastny

发表机构 * Redwood Research(红木研究) Astra Fellows Program(Astra 后援计划) Aether Research(Aether 研究) MATS Research(MATS 研究) Polytechnic University of Catalonia(加泰罗尼亚理工大学) Imperial College London(伦敦帝国理工学院) University of Cambridge(剑桥大学) University of Chicago(芝加哥大学) Durham University(杜伦大学) MIT(麻省理工学院) University of Oxford(牛津大学) University of Glasgow(格拉斯哥大学) Constellation(星座)

AI总结 本研究通过超过3万个问题测试前沿AI模型在无思维链推理下的表现,估计其50%任务完成时间范围,发现该时间每约两年翻一番,GPT-5.5已达3分钟以上。

详情
AI中文摘要

许多确保前沿AI模型安全的努力依赖于监控其思维链(CoT)推理。如果模型能够在没有显式思考令牌的情况下内部执行足够复杂的推理,这将破坏这种监督。我们测量了前沿模型在无CoT情况下的推理能力,涉及超过3万个问题,涵盖数学、编程、谜题、因果推理、心理理论和策略推理等领域的43个基准测试。为了将模型与人类进行比较,我们估计了50%任务完成时间范围(TH):模型以50%成功率完成的任务所需的人类时间。我们还补充了50%推理令牌范围:模型以50%成功率解决的任务所需的最小o3-mini推理令牌数。我们发现,过去六年中,前沿模型的无CoT 50% TH大约每两年翻一番,GPT-5.5的TH超过3分钟,推理令牌范围超过1500个令牌。我们的中位数估计预测,到2028年,前沿无CoT TH可能超过7分钟,到2030年超过25分钟,尽管这些预测存在很大的不确定性。我们建议前沿开发者明确跟踪这一指标。

英文摘要

Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning. If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight. We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory-of-mind, and strategic reasoning. To compare models against humans, we estimate the $50\%$-task-completion time horizon (TH): the human time required for tasks a model completes with $50\%$ success rate. We complement this with a $50\%$ reasoning token horizon: the minimum number of o3-mini reasoning tokens needed for tasks a model solves with $50\%$ success rate. We find that the no-CoT $50\%$ TH of frontier models has been doubling roughly every year over the past six years, with GPT-5.5's TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens. Our median estimates predict that frontier no-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty. We recommend frontier developers track this explicitly.

2601.04646 2026-06-15 cs.IR cs.AI cs.CL cs.LG 版本更新

Succeeding at Scale: Enterprise Retrieval Benchmark Construction and Index-Preserving Query Adaptation for Multi-Tenant Search

规模化成功:面向多租户搜索的企业检索基准构建与索引保持查询适配

Prateek Jain, Shabari S Nair, Ritesh Goru, Prakhar Agarwal, Ajay Yadav, Yoga Sri Varshan Varadharajan, Constantine Caramanis

发表机构 * Prateek Jain Shabari S Nair Ritesh Goru Prakhar Agarwal Ajay Yadav Yoga Sri Varshan Varadharajan Constantine Caramanis

AI总结 针对多租户检索系统中标注数据匮乏和模型更新成本高的问题,提出全自动构建基准DevRev-Search,并研究仅微调查询编码器而保持文档索引不变的索引保持查询适配策略,实现质量与效率的平衡。

详情
AI中文摘要

大规模多租户检索系统生成大量查询日志,但缺乏用于有效领域适应的精心策划的相关性标签,导致大量“暗数据”未被充分利用。模型更新的高成本加剧了这一挑战,因为联合微调查询和文档编码器需要完整的语料库重新索引,这在拥有数千个独立索引的多租户环境中是不切实际的。我们引入了DevRev-Search,这是一个通过完全自动化管道构建的技术客户支持段落检索基准。候选生成使用跨多种稀疏和密集检索器的融合,随后使用LLM作为评判器进行一致性过滤和相关性标记。我们进一步研究并系统评估了索引保持查询适配策略,该策略仅微调查询编码器,同时保持文档索引固定。在DevRev-Search、SciFact和FiQA-2018上的实验表明,参数高效的查询编码器微调提供了显著的质量-效率权衡,实现了可扩展且实用的企业多租户检索。

英文摘要

Large-scale multi-tenant retrieval systems generate extensive query logs but lack curated relevance labels for effective domain adaptation, resulting in substantial underutilized "dark data." This challenge is compounded by the high cost of model updates, as jointly fine-tuning query and document encoders requires full corpus re-indexing, which is impractical in multi-tenant settings with thousands of isolated indices. We introduce DevRev-Search, a passage retrieval benchmark for technical customer support built via a fully automated pipeline. Candidate generation uses fusion across diverse sparse and dense retrievers, followed by an LLM-as-a-Judge for consistency filtering and relevance labeling. We further study and systematically evaluate index-preserving query-only adaptation strategies that fine-tune only the query-encoder while keeping the document indices fixed. Experiments on DevRev-Search, SciFact, and FiQA-2018 show that parameter-efficient fine-tuning of the query encoder delivers a remarkable quality-efficiency trade-off, enabling scalable and practical enterprise multi-tenant retrieval.

2601.15828 2026-06-15 cs.CL cs.AI 版本更新

Can professional translators identify machine-generated text?

专业翻译人员能否识别机器生成的文本?

Michael Farrell

发表机构 * IULM University Milan Italy(米兰IULM大学)

AI总结 通过实验研究无专门训练的专业翻译人员识别AI生成短篇故事的能力,发现少数人(16.2%)能准确区分,但多数依赖主观印象导致误判,低突发性和叙事矛盾是可靠指标。

Comments Pages 581 to 591, Volume 1, proceedings of the 26th Annual Conference of the European Association for Machine Translation, 2026

详情
AI中文摘要

本研究调查了未经专门训练的专业翻译人员能否可靠地识别由人工智能(AI)生成的意大利语短篇故事。69名翻译人员参加了一项现场实验,评估了三篇匿名短篇故事——两篇由ChatGPT-4o生成,一篇由人类作者撰写。对于每篇故事,参与者评估了AI作者身份的可能性并提供了选择理由。虽然平均结果不明确,但有一个统计上显著的子集(16.2%)成功区分了合成文本与人类文本,表明他们的判断基于分析技能而非偶然。然而,几乎相同数量的人以相反方向错误分类了文本,通常依赖主观印象而非客观标记,这可能反映了读者对AI生成文本的偏好。低突发性和叙事矛盾成为合成作者身份最可靠的指标,同时报告了意外的仿译、语义借用和来自英语的句法迁移。相比之下,语法准确性和情感基调等特征经常导致误分类。这些发现对专业语境中合成文本编辑的作用和范围提出了疑问。

英文摘要

This study investigates whether professional translators without prior specialized training can reliably identify short stories generated in Italian by artificial intelligence (AI). Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories - two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.

2603.05167 2026-06-15 cs.CL cs.AI 版本更新

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

C2-Faith: 为思维链推理中的因果和覆盖忠实性基准测试LLM评判者

Avni Mittal, Rauno Arike

发表机构 * SPARAI

AI总结 提出C2-Faith基准,通过因果和覆盖两个维度评估LLM评判者对思维链推理过程忠实性的判断能力,发现模型在错误定位和覆盖评分上存在显著不足。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作思维链(CoT)推理的评判者,但目前尚不清楚它们能否可靠地评估过程忠实性,而不仅仅是答案的合理性。我们引入了C2-Faith,这是一个基于PRM800K构建的基准,明确将忠实性分解为两个互补维度:因果性(每一步是否逻辑上源自先前上下文)和覆盖性(是否包含必要的中间推理)。通过受控扰动,我们构建了具有已知因果错误位置的示例,将单个步骤替换为逻辑不一致的变体,并以不同速率进行受控覆盖删除,从而能够直接根据参考标签进行测量。我们评估了三个前沿的LLM评判者在三项任务上的表现:二元因果检测、因果步骤定位和覆盖评分。我们的结果表明,评判者的可靠性高度依赖于任务,没有单一模型在所有设置中占主导地位。虽然模型通常能检测到错误存在,但它们难以准确定位错误,这表明检测与归因之间存在显著差距。此外,所有评判者都系统性地高估了推理完整性,即使中间推理的很大部分缺失,也会给出高覆盖分数。这些发现揭示了LLM评判者在过程级评估中的根本局限性,并强调了在使用LLM评估推理质量时需要更可靠和校准的方法。

英文摘要

Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that explicitly decomposes faithfulness into two complementary dimensions: causality (whether each step logically follows from prior context) and coverage (whether essential intermediate inferences are present). Using controlled perturbations, we construct examples with known causal error positions by replacing a single step with a logically inconsistent variant, and with controlled coverage deletions at varying rates, enabling direct measurement against reference labels. We evaluate three frontier LLM judges across three tasks: binary causal detection, causal step localization, and coverage scoring. Our results reveal that judge reliability is highly task-dependent, with no single model dominating across settings. While models often detect that an error exists, they struggle to accurately localize it, indicating a substantial gap between detection and attribution. Moreover, all judges systematically overestimate reasoning completeness, assigning high coverage scores even when substantial portions of intermediate reasoning are missing. These findings expose fundamental limitations of LLM judges in process-level evaluation and highlight the need for more reliable and calibrated methods when using LLMs to assess reasoning quality.

2603.23530 2026-06-15 cs.CL cs.AI cs.LG 版本更新

Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

你忘记我问什么了吗?大型语言模型中的前瞻记忆失败

Avni Mittal

发表机构 * University of Washington(华盛顿大学)

AI总结 本研究通过认知心理学中的前瞻记忆视角,发现大型语言模型在执行复杂任务时,格式化指令的遵从率下降2-21%,并提出了显著性增强格式来恢复遵从性。

详情
AI中文摘要

大型语言模型在必须同时执行要求较高的任务时,常常无法满足格式化指令。我们通过认知心理学中的前瞻记忆视角,使用一个受控范式来研究这种行为,该范式将可验证的格式化约束与复杂度递增的基准任务相结合。在三个模型家族和超过8000个提示中,在并发任务负载下,遵从性下降了2-21%。脆弱性高度依赖于类型:终端约束(需要在响应边界采取行动)下降最多,高达50%,而避免约束相对稳健。显著性增强格式(显式指令框架加上尾部提醒)恢复了大量丢失的遵从性,在许多设置中将性能恢复到90-100%。干扰是双向的:格式化约束也可能降低任务准确性,其中一个模型的GSM8K准确率从93%下降到27%。在额外的堆叠实验中,随着约束的累积,联合遵从性急剧下降。所有结果均使用确定性程序化检查器,无需LLM作为评判组件,并在公开可用的数据集上进行。

英文摘要

Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.

2604.07530 2026-06-15 cs.DL cs.AI cs.CY cs.SI 版本更新

The Shrinking Lifespan of LLMs in Science

科学领域中LLM的生命周期缩短

Ana Trišović

发表机构 * Computer Science & Artificial Intelligence Laboratory(计算机科学与人工智能实验室) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本研究通过分析62个LLM在超过10万篇引用论文中的科学采纳轨迹,发现模型的生命周期主要由发布年份决定,且每个后续发布年份的峰值时间和寿命分别缩短27%和23%。

详情
AI中文摘要

缩放定律描述了语言模型能力如何随计算和数据增长,但未说明模型发布后能持续多久。我们引入峰值时间和寿命作为模型过时的度量,并利用它们刻画62个LLM在超过10万篇引用论文(2019-2025年)中的科学采纳轨迹,将主动采纳与背景引用分离,以恢复引用计数无法解析的每个模型轨迹。我们发现,模型的寿命更多地由其发布时间而非特征决定:发布年份比架构、开放性或规模更能预测峰值时间和寿命。LLM的采纳遵循倒U型曲线(发布后上升、达到峰值然后下降),但这种模式正在迅速压缩。每个后续发布年份与峰值时间缩短27%和寿命缩短23%相关(p < 0.001),这一结果对最小年龄阈值和模型规模控制具有稳健性。这些采纳侧动态对缩放定律不可见,表明专注于任何单一模型可能是一项贬值的投资,其成本落在可重复性和迁移上。

英文摘要

Scaling laws describe how language model capabilities grow with compute and data, but say nothing about how long a model matters once released. We introduce time-to-peak and lifespan as measures of model obsolescence and use them to characterize the scientific adoption trajectories of 62 LLMs across more than 108k citing papers (2019-2025), separating active adoption from background citation to recover per-model trajectories that citation counts cannot resolve. We find that a model's longevity is shaped more by when it was released than by its characteristics: release year predicts time-to-peak and lifespan more strongly than architecture, openness, or scale. LLM adoption follows an inverted-U curve (rising after release, peaking, and then declining), but this pattern is rapidly compressing. Each successive release year is associated with a 27% shorter time-to-peak and a 23% shorter lifespan ($p < 0.001$), robust to minimum-age thresholds and controls for model size. These adoption-side dynamics are invisible to scaling laws and suggest that specialization on any single model may be a depreciating investment, with costs falling on reproducibility and migration.

2604.14892 2026-06-15 cs.LG cs.AI 版本更新

Can LLMs Accurately Score Medical Diagnoses and Clinical Reasoning?

LLM能否准确评分医学诊断和临床推理?

Amy Rouillard, Sitwala Mundia, Linda Camara, Ziyaad Dangor, Michael Cameron Gramanie, Ismail Kalla, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Haroon Saloojee, Bruce A. Bassett

发表机构 * Wits MIND Institute, University of the Witwatersrand, Johannesburg, South Africa(维特士心理研究所,沃斯兰德大学,约翰内斯堡,南非) Grai Labs, Cape Town, South Africa(格雷实验室,开普敦,南非) South African Medical Research Council Vaccines and Infectious Diseases Analytics Research Unit, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa(南非医学研究理事会疫苗和传染病分析研究组,健康科学学院,沃斯兰德大学,约翰内斯堡,南非) Department of Internal Medicine, Charlotte Maxeke Johannesburg Academic Hospital, and Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa(内科学系,查理·马克斯凯约翰内斯堡学术医院,以及健康科学学院,沃斯兰德大学,约翰内斯堡,南非) Department of Paediatrics and Child Health, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa(儿科学与儿童健康系,健康科学学院,沃斯兰德大学,约翰内斯堡,南非) Wits MIND Institute, University of the Witwatersrand, Johannesbu(维特士心理研究所,沃斯兰德大学,约翰内斯堡)

AI总结 研究使用LLM陪审团对300例低收入和中等收入国家医院病例的3334个诊断进行评分,发现校准后的LLM评分与专家评分高度一致,且严重错误风险更低,可作为可靠的评估代理。

详情
AI中文摘要

使用专家临床医生小组评估医学AI系统成本高且速度慢,这促使使用大型语言模型(LLM)作为替代评判者。在此,我们评估了一个由三个前沿AI模型组成的LLM陪审团,对300个真实低收入和中等收入国家(LMIC)医院病例的3334个诊断进行评分。LLM和临床医生生成的诊断均根据专家小组诊断在四个维度上进行评分:诊断、鉴别诊断、临床推理和阴性治疗风险。将LLM陪审团评分与专家和独立重新评分小组的评分进行比较,以评估误差指标、评分者间一致性、严重风险错误以及使用等渗回归进行事后校准的效果。在我们的数据中,我们发现:(i)未校准的LLM陪审团评分与专家临床医生小组评分保持序数一致性,但系统性地更低;(ii)LLM陪审团出现严重风险错误的概率低于人类专家重新评分小组;(iii)LLM陪审团结合LLM诊断可用于识别高风险错误诊断,从而实现有针对性的专家审查并提高小组效率;(iv)校准后的LLM陪审团评分和诊断代理排名与主要专家小组的评分和排名表现出极好的一致性;(v)LLM陪审团模型没有表现出自我偏好偏差,它们对自己底层模型或同一供应商模型生成的诊断评分并不比其他模型生成的诊断更有利(或更不利)。总之,这些结果提供了证据,表明校准后的LLM陪审团是医学AI基准测试中专家临床医生评估的值得信赖且可靠的代理。在其他临床环境中确认这些发现是未来工作的重要方向。

英文摘要

Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM Jury, composed of three frontier AI models, for scoring 3334 diagnoses on 300 real-world low- and middle-income country (LMIC) hospital cases. Both LLM- and clinician-generated diagnoses are scored against expert panel diagnoses across four dimensions: diagnosis, differential diagnosis, clinical reasoning, and negative treatment risk. The LLM Jury scores are compared with expert and independent re-scoring panel scores to assess error metrics, inter-rater agreement, severe-risk errors, and the effect of post hoc calibration using isotonic regression. In our data, we find that: (i) the uncalibrated LLM Jury scores preserve ordinal agreement with the expert clinician panel scores, but are systematically lower; (ii) the probability of severe-risk errors is lower for the LLM Jury than the human expert re-score panels; (iii) the LLM Jury combined with LLM diagnoses can be used to identify diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (iv) the calibrated LLM Jury scores and rankings of diagnosing agents show excellent agreement with those of the primary expert panels; (v) LLM Jury models show no self-preference bias, they did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Together, these results provide evidence that a calibrated LLM Jury is a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking. Confirming these findings in other clinical settings is an important direction for future work.

2605.21182 2026-06-15 cs.CL cs.AI cs.CV 版本更新

Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding

Manga109-v2026: 重新审视Manga109标注以适应现代漫画理解

Jeonghun Baek, Atsuyuki Miyai, Shota Onohara, Hikaru Ikuta, Kiyoharu Aizawa

发表机构 * University of Tokyo(东京大学)

AI总结 本文重新审视Manga109的对话文本标注,识别出五类标注问题,包括转录错误、缺失文本区域、对话与拟声词重叠以及未分割的对话气泡,并通过结合OCR基于的问题检测和人工修订构建Manga109-v2026,修订了约29,000个对话标注,使Manga109更好地适应现代OCR和多模态漫画理解系统,同时保留漫画特有的表达结构。

Comments Accepted to the Culture x AI Workshop at ICML 2026. Project page: https://manga109.github.io/manga109-project-website/en/

详情
AI中文摘要

漫画是一种具有文化特色的多模态媒介,是日本流行文化中最具影响力的形态之一。随着AI系统越来越多地针对漫画理解、OCR和翻译进行研究,Manga109已成为漫画相关AI研究的基础数据集。然而,当前的Manga109数据集包含转录错误和粗略的标注,这与现代OCR和多模态漫画理解任务不匹配。在本工作中,我们重新审视Manga109的对话文本标注,识别出五类标注问题,包括转录错误、缺失文本区域、对话与拟声词重叠以及未分割的对话气泡。为了解决这些问题,我们结合基于OCR的问题检测和人工修订,构建了Manga109-v2026,修订了大约29,000个对话标注。我们的修订使Manga109更好地适应现代OCR和多模态漫画理解系统,同时保留了漫画特有的表达结构。

英文摘要

Manga is a culturally distinctive multimodal medium and one of the most influential forms of Japanese popular culture. As AI systems increasingly target manga understanding, OCR, and translation, Manga109 has become a foundational dataset for manga-related AI research. However, the current Manga109 dataset contains inaccurate transcriptions and coarse annotations, which do not align well with modern OCR and multimodal manga understanding tasks. In this work, we revisit the dialogue text annotations of Manga109 and identify five categories of annotation issues, including inaccurate transcriptions, missing text regions, overlapping dialogue and onomatopoeia, and under-segmented speech balloons. To address these issues, we combine OCR-based issue detection and manual revision to construct Manga109-v2026, revising approximately 29,000 dialogue annotations. Our revisions better align Manga109 with modern OCR and multimodal manga understanding systems while preserving expressive structures characteristic of manga.

2606.08881 2026-06-15 cs.RO cs.AI 版本更新

Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

在SO-101上对视觉-语言-动作模型进行基准测试:失败与恢复分析

Yi Yu, Xinchuan Qiu

发表机构 * Graduate School of Advanced Science and Engineering, Hiroshima University(广岛大学先进科学与工程研究生院)

AI总结 提出SO-101低成本机器人平台基准,通过失败分类和恢复评估指标,系统比较VLA和模仿学习策略,发现执行不稳定是主要失败源。

Comments 13 pages, 9 figures,

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出强大的泛化能力,但现有评估主要在仿真或昂贵机器人平台上进行,其在低成本真实机器人上的鲁棒性尚未充分探索。我们提出了一个标准化的真实世界基准,用于在低成本SO-101机器人平台上评估代表性VLA和模仿学习策略。该基准包含四个代表性操作任务和统一评估协议,能够在具身不确定性下进行系统比较。使用真实遥操作演示,我们直接在物理平台上微调和评估$π_{0.5}$、SmolVLA、Wall-X和ACT。除了传统的任务成功率,该基准还包含结构化的失败分类、语义级和执行级失败分解,以及恢复感知评估指标,以表征策略鲁棒性。实验结果表明,更强的预训练VLA策略通常优于模仿学习基线,尽管在低成本机器人部署条件下性能高度依赖于任务。执行不稳定是主要的失败源,而恢复能力在不同架构间差异显著。这些结果强调了超越二元任务成功进行失败和恢复分析的重要性,并将SO-101确立为在现实低成本机器人部署条件下评估具身AI系统的实用基准。

英文摘要

Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on affordable real-world robots largely unexplored. We present a standardized real-world benchmark for evaluating representative VLA and imitation learning policies on the low-cost SO-101 robotic platform. The benchmark comprises four representative manipulation tasks together with unified evaluation protocols, enabling systematic comparison under embodiment uncertainty. Using real-world teleoperated demonstrations, we fine-tune and evaluate $π_{0.5}$, SmolVLA, Wall-X, and ACT directly on the physical platform. Beyond conventional task success rates, the benchmark incorporates a structured failure taxonomy, semantic- and execution-level failure decomposition, and recovery-aware evaluation metrics to characterize policy robustness. Experimental results show that stronger pretrained VLA policies generally outperform the imitation learning baseline, although performance remains highly task-dependent under low-cost robotic deployment conditions. Execution instability emerges as the dominant failure source, while recovery capability varies substantially across architectures. These results highlight the importance of failure and recovery analysis beyond binary task success and establish SO-101 as a practical benchmark for evaluating embodied AI systems under realistic low-cost robotic deployment conditions.

10. AI应用与系统 35 篇

2606.13731 2026-06-15 cs.AI cs.MA 新提交

TwinBI: An Agentic Digital Twin for Efficient Augmented Interactions with Business Intelligence Dashboards

TwinBI:一种用于与商业智能仪表盘高效增强交互的智能数字孪生

Jisoo Jang Wen-Syan Li

发表机构 * Graduate School of Data Science, Seoul National University(首尔大学数据科学研究生院)

AI总结 提出TwinBI框架,通过LLM代理与可执行仪表盘状态耦合,统一对话、操作、语义和溯源,提升多步分析中状态一致性,将精确匹配准确率从43.3%提升至63.3%,超时率从40%降至10%。

详情
AI中文摘要

商业智能(BI)越来越多地将仪表盘交互与基于LLM的辅助相结合,但这两种模式在多步分析中常常不同步。当用户在直接仪表盘操作和自然语言查询之间切换时,很难在过滤器、层次结构、指标和图表上下文中保持一致的分析状态。我们提出TwinBI,一种智能数字孪生框架,将基于LLM的代理系统与可执行的BI仪表盘状态耦合。TwinBI通过从统一交互日志重建的共享分析状态,统一了对话交互、仪表盘操作、语义基础和溯源追踪。它还公开了诸如模式视图、SQL、日志和/insights命令等工件,用于基于状态的分析摘要。我们通过两种互补方式评估TwinBI。在相同骨干代理的受控A/B基准测试中,与仅使用仪表盘相比,TwinBI将精确匹配准确率从43.3%提高到63.3%,部分信用准确率从48.3%提高到70.8%,并显著将超时率从40.0%降低到10.0%。在可用性研究中,参与者受益于集成的仪表盘和聊天工作流,任务准确性高,工作负载适中,对状态感知交互机制评价良好。这些结果表明,TwinBI通过将可见的仪表盘状态转化为更丰富的可操作上下文,提高了代理级别的分析可靠性和面向用户的分析支持。我们的数据集和源代码可在以下网址获取:this https URL

英文摘要

Business intelligence (BI) increasingly combines dashboard interaction with LLM-based assistance, but these two modes often fall out of sync during multi-step analysis. As users switch between direct dashboard manipulation and natural-language queries, it becomes difficult to preserve a consistent analytical state across filters, hierarchies, metrics, and chart context. We present TwinBI, an agentic digital-twin framework that couples an LLM-based agent system with an executable BI dashboard state. TwinBI unifies conversational interaction, dashboard manipulation, semantic grounding, and provenance tracking through a shared analytical state reconstructed from a unified interaction log. It also exposes artifacts such as schema views, SQL, logs, and an /insights command for state-grounded analytical summaries. We evaluate TwinBI in two complementary ways. In a controlled A/B benchmark with the same backbone agent, TwinBI improves exact-match accuracy from 43.3% to 63.3%, partial-credit accuracy from 48.3% to 70.8%, and substantially reduces timeout rate from 40.0% to 10.0% relative to Dashboard alone. In a usability study, participants benefited from the integrated dashboard-and-chat workflow, with high task accuracy, moderate workload, and favorable ratings for state-aware interaction mechanisms. These results suggest that TwinBI improves both agent-level analytical reliability and user-facing analytical support by turning visible dashboard state into richer actionable context. Our dataset and source code are available at: https://github.com/simonjisu/TwinBI

2606.13871 2026-06-15 cs.AI cs.DB 新提交

Hyperdimensional computing for structured querying on tabular data embeddings

超维计算用于表格数据嵌入的结构化查询

Sebastián Bugedo, Stijn Vansummeren

发表机构 * UHasselt, DSI Diepenbeek(哈塞尔特大学,数据科学研究所迪彭贝克)

AI总结 针对表格嵌入缺乏可解释相似度的问题,提出基于超维计算(HDC)的框架,利用全息简化表示模型实现结构化查询,推导出等值与非等值谓词的闭式期望相似度,支持可靠零匹配检测。

Comments 15 pages with appendices. 8 figures. Under review

详情
AI中文摘要

表格数据嵌入已成为数据分析和数据集成管道的基石,支持实体注释与解析、模式匹配、列类型检测以及表格搜索等任务。现有方法将行、列或整个表格嵌入向量空间,并依赖最近邻搜索来检索候选匹配。当前嵌入方法的一个根本局限性是缺乏可解释的相似度分数:查询与其最近邻之间的具体相似度值没有内在含义,因此无法确定该邻居是真正匹配还是只是语料库中无有效答案时最不相似的项目。这种无法为检索设置原则性阈值的问题阻碍了实际部署,特别是对于零匹配检测。我们研究了超维计算(HDC)的使用,特别是全息简化表示(HRR)模型,作为当检索任务对应于在向量空间中回答结构化选择-投影查询时的表格行嵌入框架。利用HDC操作的代数性质,我们推导出等值和非等值检索谓词的闭式期望相似度值,这些值随着维度的增加收敛到可解释的值,并利用这些值来识别合适的检索阈值。我们在两个真实世界数据集上,针对不同表格大小和谓词长度,将HDC与基于图的基线EmbDI进行了评估。结果表明,HDC在所有配置下的行检索中与EmbDI相当或更优,更稳健地处理非等值谓词,并在足够维度下实现完美的属性投影准确性——同时通过其原则性阈值独特地实现了可靠识别零匹配谓词。

英文摘要

Tabular data embeddings have become a cornerstone of data profiling and data integration pipelines, enabling tasks such as entity annotation and resolution; schema matching; column type detection; and table search, among others. Existing approaches embed rows, columns, or entire tables into a vector space and rely on nearest-neighbor search to retrieve candidate matches. A fundamental limitation of current embedding methods is the lack of interpretable similarity scores: the concrete similarity value between a query and its nearest neighbour carries no intrinsic meaning, making it impossible to determine whether that neighbour is a true match or simply the least-dissimilar item in a corpus that contains no valid answer. This inability to set principled thresholds for retrieval undermines practical deployment, particularly for zero-match detection. We investigate the use of HyperDimensional Computing (HDC), specifically the Holographic Reduced Representations (HRR) model, as a framework for tabular row embeddings when the retrieval task corresponds to answering structured select-project queries in vector space. Exploiting the algebraic properties of HDC operations, we derive closed-form expected similarity values for both equality and non-equality retrieval predicates, which converge to interpretable values as dimensionality increases, and use these to identify suitable retrieval thresholds. We evaluate HDC against EmbDI, a graph-based baseline, on two real-world datasets across varying table sizes and predicate lengths. Our results show that HDC matches or outperforms EmbDI for row retrieval across all configurations, handles non-equality predicates more robustly, and achieves perfect attribute projection accuracy at sufficient dimensionality -- while uniquely enabling reliable identification of zero-match predicates through its principled thresholds.

2606.13916 2026-06-15 cs.AI 新提交

A Multi-Agent AI System for Automated High School Transcript Processing: Collaborative Document Analysis at Scale

用于自动化高中成绩单处理的多智能体AI系统:大规模协作文档分析

Ben Torkian, Jun Zhou

发表机构 * University of South Carolina(南卡罗来纳大学)

AI总结 提出一个多智能体AI系统,通过三个专业智能体协作,自动处理不同格式的高中成绩单,以GPA提取作为协调信号,实现96.7%的准确率和每份45秒的处理速度。

详情
AI中文摘要

每年,大学招生办公室面临一个巨大的挑战:处理数百万份高中成绩单,每份都有独特的格式、评分系统和布局。这一手动过程造成了操作瓶颈,延迟了招生决定并消耗了宝贵资源。我们通过一个多智能体AI系统提出了变革性解决方案,其中专业智能体通过智能协调和通信自动处理多样化的成绩单格式。我们的多智能体架构包括三个专业智能体——用于格式特定解析的模式识别智能体、用于自然语言理解的语义分析智能体和用于多模态文档分析的视觉智能体——由一个编排智能体协调,管理智能体通信和结果整合。我们的关键创新在于基于智能体的质量控制,使用GPA提取作为协调信号,确保可靠的智能体协作并防止关键信息丢失。在对来自美国13个州高中的40份真实成绩单进行评估时,我们的智能体系统成功处理了每一份文档,与专家手动审查相比达到了96.7%的准确率,同时保持了每份成绩单45秒的实际处理速度。这项工作展示了多智能体协调如何解决复杂的文档处理挑战,为机构提供了一种可扩展的协作AI解决方案,在保持准确性的同时大幅减少处理时间。

英文摘要

Each year, college admissions offices face an overwhelming challenge: processing millions of high school transcripts, each with unique formats, grading systems, and layouts. This manual process creates operational bottlenecks that delay admissions decisions and consume valuable resources. We present a transformative solution through a multi-agent AI system where specialized agents collaborate to automatically process diverse transcript formats through intelligent coordination and communication. Our multi-agent architecture consists of three specialized agents-a Pattern Recognition Agent for format-specific parsing, a Semantic Analysis Agent for natural language understanding, and a Vision Intelligence Agent for multimodal document analysis-coordinated by an Orchestration Agent that manages agent communication and result reconciliation. Our key innovation lies in agent-based quality control using GPA extraction as a coordination signal, ensuring reliable agent collaboration and preventing critical information loss. When evaluated on 40 real world transcripts from high schools across 13 U.S. states, our agent system successfully processed every document, achieving 96.7% accuracy compared to expert manual review while maintaining practical processing speeds of 45 seconds per transcript. This work demonstrates how multi-agent coordination can solve complex document processing challenges, offering institutions a scalable, collaborative AI solution that preserves accuracy while dramatically reducing processing time.

2606.14119 2026-06-15 cs.AI 新提交

FactoryLLM: A Safe and Open-Source AI Playground for Evaluating LLMs in Smart Factories

FactoryLLM:用于评估智能工厂中大语言模型的安全开源AI实验场

Yash Pulse, Yong-Bin Kang, Abhik Banerjee, Abdur Forkan, Prem Prakash Jayaraman

发表机构 * GitHub arXiv

AI总结 提出FactoryLLM,一个安全开源的AI实验场,通过多机器文档分析评估基于RAG的大语言模型,采用RAGAS和NVIDIA LLM-as-a-Judge双评估机制,案例验证了跨机器文档推理的有效性。

Comments 6 pages, 3 figures, IEEE INDIN 2026

详情
AI中文摘要

智能工厂中的故障诊断和恢复具有挑战性,因为关键信息分散在通过制造过程相互连接的多台机器的手册中。大语言模型(LLM)提供了一种有前景的方法。在本文中,我们提出了FactoryLLM,一个安全开源的AI实验场,旨在通过分析制造过程中多台机器的文档来评估不同的基于LLM的检索增强生成(RAG)模型。FactoryLLM使用户能够配置LLM,并通过使用RAGAS和NVIDIA的LLM-as-a-Judge指标的双重评估设置,评估在多个文档上进行推理时的性能。FactoryLLM是安全的,因为它允许用户运行本地或开源LLM,而无需共享敏感的工业数据,提供了一个受控的实验环境。我们通过一个涉及自主智能车辆及其移动规划器软件的案例研究展示了FactoryLLM的有效性,评估了三个LLM在来自约600页跨机器文档的30个维护查询上的表现。结果表明,FactoryLLM在跨机器文档推理方面是有效的:每个模型的地面性得分均高于0.88。用于社区在特定制造场景中测试FactoryLLM的完整代码和文档已公开提供。

英文摘要

Fault diagnostics and recovery in smart factories is challenging because critical information is dispersed across manuals of multiple machines which are interconnected through the manufacturing process. Large Language Models (LLMs) can provide a promising approach. In this paper, we propose FactoryLLM, a safe and open-source AI playground designed for evaluating different LLM-based retrieval-augmented generation (RAG) models by analysing documents from multiple machines across the manufacturing process. FactoryLLM enables the user to configure the LLM, and assess performance when reasoning over multiple documents, through a dual evaluation setup using both RAGAS and NVIDIA's LLM-as-a-Judge metrics. FactoryLLM is safe because it allows users to run local or open-source LLMs without sharing sensitive industrial data, providing a controlled environment for experimentation. We demonstrate the efficacy of FactoryLLM through a case study which involves an Autonomous Intelligent Vehicle and its Mobile Planner software, evaluating three LLMs across 30 maintenance queries derived from approximately 600 pages of cross-machine documentation. The results suggest that FactoryLLM is effective in cross-machine document reasoning: every model achieved a groundedness score above 0.88. The full code and documentation for community to test FactoryLLM with their manufacturing specific scenarios are publicly available.

2606.13695 2026-06-15 physics.geo-ph cs.AI cs.LG 交叉投稿

Korzhinskii-Net: Physics-Informed Neural Network for Sub-Surface Mineral Prospectivity Modelling

Korzhinskii-Net: 用于地下矿产潜力建模的物理信息神经网络

Boris Kriuk

发表机构 * The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出Korzhinskii-Net,一种耦合达西流、热输运和反应速率的二维径向物理信息神经网络,在五个矿省四个矿种上平均PR-AUC达0.885,显著优于传统基线。

Comments 12 pages, 7 figures, 3 tables

详情
AI中文摘要

矿产潜力建模(MPM)支撑着勘探经济学,然而大多数操作流程简化为基于浅表地表代理训练的数据驱动分类器。这类模型对实际定位矿石的地下物理过程(热平流、流体流动和岩性依赖的沉淀)视而不见。我们提出Korzhinskii-Net,一个二维径向物理信息神经网络(PINN),它将达西流、平流-扩散热输运和softplus饱和反应速率耦合到一个可微的正演模型中,并由地表和遥感代理弱监督。该网络以Dmitri S. Korzhinskii(1899-1985)命名,其渗滤交代作用理论提供了物理框架。我们在五个矿省(涵盖四种矿种:诺里尔斯克(Ni-Cu-PGE)、佩琴加(Ni-Cu硫化物)、乌多坎(砂岩型Cu)、苏霍伊洛格(造山型Au)和米尔内(金伯利岩型钻石))上,采用公平、泄漏控制的5折交叉验证协议(含硬环形负样本)评估Korzhinskii-Net。Korzhinskii-Net的平均PR-AUC为0.885,而最强经典基线(梯度提升)为0.281;平均分位数排名为0.019,对比基线为0.413。这一改进在所有五个矿省和四个矿种系统中一致,表明即使仅受全球开放数据代理约束,物理信息可微模拟器也能恢复纯特征学习器系统性地遗漏的定位模式。我们将完整流程和评估工具开源。

英文摘要

Mineral prospectivity modelling (MPM) underpins exploration economics, yet most operational pipelines reduce to data-driven classifiers trained on shallow surface proxies. Such models are blind to the subsurface physics that actually localises ore: heat advection, fluid flow, and lithology-dependent precipitation. We present Korzhinskii-Net, a 2-D radial physics-informed neural network (PINN) that couples Darcy flow, advective-diffusive heat transport, and a softplus-saturated reaction rate into a single differentiable forward model, weakly supervised by surface and remote-sensing proxies. The network is named after Dmitri S. Korzhinskii (1899-1985), whose theory of infiltration metasomatism provides the physical scaffold. We evaluate Korzhinskii-Net on five ore provinces spanning four commodity classes -- Norilsk (Ni-Cu-PGE), Pechenga (Ni-Cu sulphide), Udokan (sandstone-hosted Cu), Sukhoi Log (orogenic Au), and Mirny (kimberlitic diamond) -- under a fair, leakage-controlled 5-fold cross-validation protocol with hard ring-shaped negatives. Korzhinskii-Net attains a mean PR-AUC of 0.885 versus 0.281 for the strongest classical baseline (gradient boosting), and a mean fractional rank of 0.019 versus 0.413. The improvement is consistent across all five provinces and four commodity systems, suggesting that physics-informed differentiable simulators, even when constrained only by global open-data proxies, can recover localisation patterns that pure feature-based learners systematically miss. We release the full pipeline and evaluation harness as open source.

2606.13713 2026-06-15 q-bio.GN cs.AI 交叉投稿

CisTransCell: Single-Cell Perturbation Prediction via Gene Function, Regulatory Control, and Cellular Context

CisTransCell:通过基因功能、调控控制和细胞上下文进行单细胞扰动预测

Wei Zhang, Xun Jiang, Yuesi Xi, Ming Tang

发表机构 * [q-bio.GN]

AI总结 提出CisTransCell框架,结合调控序列和编码序列先验与细胞表达状态,建模扰动响应级联,实现零样本单细胞扰动预测。

详情
AI中文摘要

预测细胞对遗传扰动的转录反应是单细胞生物学中的一个核心问题,尤其是在零样本设置中,扰动基因或基因组合在训练中未见。一个主要困难是扰动效应不仅由表达状态决定:它们取决于扰动基因产物如何影响其他基因和蛋白质,这些下游因子如何作用于顺式调控元件,以及当前细胞状态中哪些调控程序活跃。为了更好地捕捉这种生物复杂性,我们提出了CisTransCell,一个用于单细胞扰动预测的细胞条件多模态框架,它为每个基因补充了两个互补先验:一个调控序列先验,捕捉基因如何被调控;一个编码序列先验,捕捉基因产物做什么。通过将这些先验与细胞表达状态整合,CisTransCell将扰动响应建模为从基因功能到调控控制再到下游转录变化的级联。在基准单细胞扰动数据集上的实验表明,CisTransCell在零样本扰动预测中取得了强劲性能。

英文摘要

Predicting cellular transcriptional responses to genetic perturbations is a central problem in single-cell biology, especially in the zero-shot setting where the perturbed gene or gene combination is unseen during training. A major difficulty is that perturbation effects are not determined by expression state alone: they depend on how the perturbed gene product influences other genes and proteins, how those downstream factors act on cis-regulatory elements, and which regulatory programs are active in the current cell state. To better capture this biological complexity, we propose CisTransCell, a cell-conditioned multi-modal framework for single-cell perturbation prediction that augments each gene with two complementary priors: a regulatory-sequence prior that captures how the gene is controlled, and a coding-sequence prior that captures what the gene product does. By integrating these priors with cellular expression state, CisTransCell models perturbation response as a cascade from gene function to regulatory control to downstream transcriptional change. Experiments on benchmark single-cell perturbation datasets show that CisTransCell achieves strong performance in zero-shot perturbation prediction.

2606.13742 2026-06-15 cs.LG cs.AI physics.comp-ph physics.flu-dyn stat.ML 交叉投稿

A fully GPU-based workflow for building physics emulators of hypersonic flows

基于全GPU工作流构建高超声速流物理仿真器

Fabian Paischer, Dylan Rubini, Deniz A. Bezgin, Aaron B. Buhendwa, David Hauser, Florian Sestak, Johannes Brandstetter, Sebastian Kaltenbach, Nikolaus A. Adams

发表机构 * TU Munich(慕尼黑工业大学) Institute for Machine Learning, JKU Linz(林茨约翰·开普勒大学机器学习研究所) ELLIS Unit(ELLIS单元) EMMI AI

AI总结 提出全GPU工作流,集成加速数据生成与不确定性量化增强的神经仿真器训练,通过可微求解器JAX-Fluids实现残差驱动改进,提升物理一致性并支持外推。

Comments First authors contributed equally

详情
AI中文摘要

以高保真度和低计算成本解析复杂物理现象的能力是解决现代工程关键挑战的核心。一个典型例子是高超声速流,其中精确预测全流场拓扑,特别是激波位置和强度,至关重要。然而,超声速和高超声速流仍然是传统降阶模型和神经仿真器的绊脚石,这些模型难以在工业相关应用中物理一致地捕捉流态中的陡峭梯度。为此,我们引入了一个完全基于GPU的工作流,该工作流将加速数据生成与通过不确定性量化和物理感知细化增强的神经仿真器训练相结合。我们的工作流由可微高保真求解器(JAX-Fluids)实现,我们利用该求解器进行快速数据集创建和基于残差的神经仿真器改进,以增强物理一致性。在此框架基础上,我们首先提出了一系列模型架构,并分析了它们的缩放行为以揭示其优缺点。然后,我们表明基于残差的细化使得能够在仅提供网格和输入参数的情况下进行训练,显著降低残差并提高物理一致性。可微仿真和基于残差的细化共同产生了在其训练分布之外仍然可靠的物理仿真器,这是在现实工程设计循环中部署代理的关键要求。

英文摘要

The ability to resolve complex physical phenomena with high fidelity and at low computational cost is central to addressing key challenges in modern engineering. A prime example lies in hypersonic flows, where the precise prediction of the full flowfield topology, in particular with respect to shock wave location and intensity, is critical. Yet supersonic and hypersonic flows continue to be a stumbling block for traditional reduced-order models and neural emulators that struggle to capture steep gradients in flow states with physical consistency in applications of industrial relevance. To that end, we introduce a fully GPU based workflow that integrates accelerated data generation with the training of neural emulators augmented by uncertainty quantification and physics-aware refinement. Our workflow is enabled by a differentiable high-fidelity solver (JAX-Fluids) which we employ for rapid dataset creation and residual-based improvement of the neural emulator to enhance physical consistency. Building on this framework, we first present a suite of model architectures and analyze their scaling behavior to expose their strengths and shortcomings. We then show that residual-based refinement enables training on cases where only mesh and input parameters are available, substantially reducing residuals and improving physical consistency. Together, differentiable simulation and residual-based refinement yield physics emulators that remain reliable beyond their training distribution, a key requirement for deploying surrogates in real-world engineering design loops.

2606.13794 2026-06-15 eess.SY cs.AI cs.RO cs.SY 交叉投稿

An integrated interpretable control effectiveness learning and nonlinear control allocation methodology for overactuated aircrafts

过驱动飞行器的可解释控制效能学习与非线性控制分配集成方法

Umut Demir, Aamir Ahmad, Walter Fichter

发表机构 * University of Stuttgart, Faculty of Aerospace Engineering and Geodesy, Institute of Flight Mechanics and Control (iFR)(斯图加特大学航空航天工程与大地测量学院飞行力学与控制研究所)

AI总结 提出一种基于稀疏非线性动力学辨识的学习控制效能映射方法,结合在线自适应机制,实现过驱动飞行器的高效非线性控制分配,兼具可解释性和低计算成本。

详情
AI中文摘要

非线性动力学以及多个执行器之间产生的强耦合削弱了传统线性控制分配技术背后的假设。当飞行进入非线性效应主导的模态时,线性分配器因模型失配增加而精度下降,进而降低飞行控制系统的性能和鲁棒性。高保真机载模型和黑箱数据驱动方法可以在整个飞行包线内恢复精度,但分别带来实时分配难以承受的计算负担,并牺牲了验证和故障诊断所需的可解释性。本文通过使用稀疏非线性动力学辨识从代表性飞行数据中学习显式的、受物理约束的控制效能映射解析模型,解决了这些限制。所得映射紧凑、可解释,并允许解析导数,从而能够在非线性求解器中高效计算,同时额外包含执行器动力学,无需机载模型。在线自适应机制监控预测残差,并在检测到显著对象变化时刷新模型,从而在执行器故障和变化工况下提供平滑重构。该方法在一款高保真非线性基准飞行器上经过一系列激进机动评估,达到了与完整非线性机载模型相当的精度,同时相对于现有基线显著降低了计算成本。

英文摘要

Nonlinear dynamics and the strong couplings that arise between multiple effectors undermine the assumptions behind conventional, linear control allocation techniques. When flight enters regimes where nonlinear effects dominate, linear allocators exhibit reduced accuracy due to increased model mismatch, which subsequently degrades performance and robustness of the flight control system. High fidelity onboard models and black box data driven approaches can recover accuracy across the flight envelope, but respectively impose computational burdens prohibitive for real time allocation and sacrifice the interpretability required for verification and fault diagnosis. This paper addresses these limitations by learning an explicit, physics constrained analytical model of the control effectiveness mapping from representative flight data using Sparse Identification of Nonlinear Dynamics. The resulting mapping is compact, interpretable, and admits analytical derivatives, enabling efficient computation within nonlinear solvers that additionally incorporate actuator dynamics, without requiring an onboard model. An online adaptation mechanism monitors prediction residuals and refreshes the model when significant plant changes are detected, providing graceful reconfiguration under actuator failures and varying operating conditions. The methodology is evaluated on a high fidelity nonlinear benchmark aircraft across a range of aggressive maneuvers, achieving accuracy comparable to a full nonlinear onboard model while substantially reducing computational cost relative to established baselines.

2606.13854 2026-06-15 cs.HC cs.AI 交叉投稿

SpheriCity: Designing Trustworthy Conversational AI for Sustainability Decision Support

SpheriCity:为可持续发展决策支持设计可信赖的对话式AI

Ahmed Qayyum, Madison Werner, Kathryn Youngblood, Jenna R. Jambeck, Tahiya Chowdhury

发表机构 * Department of Computer Science, Colby College(科里尔学院计算机科学系) Circularity Informatics Lab, University of Georgia(佐治亚大学循环信息实验室)

AI总结 提出SpheriCity,一种基于来源的对话式AI原型,通过结构化合成和交互支架,支持从城市循环性评估报告中可信地获取知识,解决大语言模型在可持续性高风险领域中的透明度与信任问题。

Comments Accepted to ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies (COMPASS '26)

详情
AI中文摘要

我们提出了SpheriCity,一种基于专家知识的对话式原型,旨在支持从可持续性报告中可信地获取知识。城市级循环性评估报告包含关于材料、基础设施和政策干预的丰富信息,但其长度和异构结构使得从事循环经济倡议的从业者和研究人员难以进行跨文档综合和比较。虽然大型语言模型(LLM)有望实现更快速的知识获取和综合,但其不透明的推理、幻觉和缺乏来源透明度给信任和可解释性带来了风险,并且在高风险的可持续性背景下需要验证。SpheriCity通过一种以来源为先的对话式代理来应对这些挑战,该代理强调证据可追溯性、结构化合成和交互支架,以支持跨可持续性报告的探索性查询和跨文档综合。我们与六位可持续性专家进行了形成性专家评审,使用了涵盖跨城市比较、政策总结和推荐导向任务的代表性查询。专家们从多个维度评估了回答,并提供了关于系统对可持续性知识工作有用性的定性反思。我们的结果表明,透明的来源、上下文解释、可解释性以及与专家工作流程的一致性强烈影响专家对系统有用性的信任和判断。这项工作贡献了(1)一个用于可持续性知识理解的对话式原型,(2)一个用于评估高风险知识领域中AI回答的基于专家的评估框架,以及(3)关于来源、不确定性沟通和工作流程整合如何影响专家用户对AI辅助可持续性决策支持信任的设计见解。

英文摘要

We present SpheriCity, an expert-grounded conversational prototype designed to support trustworthy knowledge sensemaking from sustainability reports. City-level circularity assessment reports contain rich information about materials, infrastructure, and policy interventions, yet their length and heterogeneous structure make cross-document synthesis and comparison difficult for practitioners and researchers working on circular economy initiatives. While large language models (LLM) promise faster knowledge access and synthesis, their opaque reasoning, hallucinations, and lack of source transparency introduce risks for trust and interpretability, and require verification in high-stakes sustainability contexts. SpheriCity addresses these challenges through a provenance-first conversational agent that foregrounds evidence traceability, structured synthesis, and interaction scaffolds to support exploratory querying and cross-document synthesis across sustainability reports. We conducted a formative expert review with six sustainability experts using representative queries spanning cross-city comparison, policy summarization, and recommendation-oriented tasks. Experts evaluated responses across dimensions and provided qualitative reflections on the system's usefulness for sustainability knowledge work. Our results reveal that transparent sourcing, contextual explanation, interpretability, and alignment with expert workflow strongly shape expert trust and judgments of system usefulness. This work contributes (1) a conversational prototype for sustainability knowledge sensemaking, (2) an expert-grounded evaluation framework for assessing AI responses in high-stakes knowledge domains, and (3) design insights into how provenance, uncertainty communication, and integration in workflow influence expert users' trust in AI assistance for sustainability decision support.

2606.13858 2026-06-15 cs.IR cs.AI 交叉投稿

Mood-Aware Music Recommendation: Integrating User Affective Signals into Ranking Systems

情绪感知音乐推荐:将用户情感信号融入排序系统

Terence Zeng, Abhishek K. Umrawal

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出一种情绪条件排序框架,通过能量-效价空间的softmax采样将用户情感信号融入推荐过程,单盲实验表明能提升推荐质量。

Comments 13 pages, 4 figures, and 1 table

详情
AI中文摘要

推荐系统在现代音乐流媒体平台中至关重要,因为可用内容数量巨大。虽然协同过滤被广泛用于根据具有相似模式的其他用户的偏好来推荐项目,但在用户-项目交互稀疏的领域(如音乐)中表现不佳。基于内容的过滤是一种替代方法,它检查项目本身的属性。已有研究探索了流派、乐器和歌词;然而,对情感识别的关注相对较少。由于用户的情绪状态强烈影响其音乐选择,融入情绪信号为个性化提供了有前景的方向。在这项工作中,我们提出了一种情绪条件排序框架,通过能量-效价空间中的softmax采样将用户情感信号融入推荐过程。我们通过单盲实验评估该方法,参与者将所提系统的推荐与基线进行比较。结果表明感知推荐质量有所提升,为将基于情绪的输入融入音乐推荐的有效性提供了初步证据。

英文摘要

Recommendation systems are essential in modern music streaming platforms due to the vast amount of available content. While collaborative filtering is widely used to suggest items based on the preferences of others with similar patterns, it performs poorly in domains where user-item interactions are sparse, such as music. Content-based filtering is an alternative approach that examines the qualities of the items themselves. Genre, instrumentation, and lyrics have been explored; however, relatively little attention has been given to emotion recognition. Since a user's emotional state strongly influences their music choice, incorporating mood signals offers a promising direction for personalization. In this work, we propose a mood-conditioned ranking framework that integrates user affective signals into the recommendation process via softmax-based sampling in the energy-valence space. We evaluate the approach via single-blind experiments in which participants compare recommendations from the proposed system against a baseline. The results indicate improved perceived recommendation quality, providing preliminary evidence for the effectiveness of incorporating mood-based inputs into music recommendations.

2606.13968 2026-06-15 cs.DC cs.AI 交叉投稿

STREAM: Multi-Tier LLM Inference Middleware with Dual-Channel HPC Token Streaming

STREAM:具有双通道 HPC 令牌流的多层 LLM 推理中间件

Anas Nassar, Steve Mohr, Leonard Apanasevich, Himanshu Sharma

发表机构 * Advanced Cyberinfrastructure for Education and Research (ACER) University of Illinois Chicago(高级教育与研究计算基础设施(ACER)伊利诺伊大学芝加哥分校)

AI总结 提出 STREAM 系统,通过三层路由架构(本地、HPC、云)和双通道 HPC 流(控制平面与数据平面分离)实现亚秒级 TTFT,解决现有系统无法统一三种推理场景的问题。

Comments 6 pages, 1 figure, PEARC '26

详情
AI中文摘要

研究人员和从业者在使用大型语言模型时面临碎片化局面:本地模型免费且私密,但硬件限制了可用的模型大小和上下文窗口;机构 HPC 中心提供强大的 GPU 资源且无边际成本,并将数据保留在机构边界内,但运行在防火墙后且专为批处理作业而非交互使用设计;商业云 API 按需提供前沿模型质量,但带来显著成本和不适合敏感研究数据的数据保留策略。现有系统无法统一这三者。STREAM(智能分层路由引擎)通过四项贡献解决了这一差距:(1)三层路由架构,结合本地、HPC 和云推理,并配备基于本地 LLM 的复杂度判断器;(2)双通道 HPC 流架构,将 Globus Compute 控制平面(认证和作业调度)与 WebSocket 中继数据平面(令牌传递)分离,实现亚秒级 TTFT(中位数 0.54 秒,比批处理模式的 11.40 秒快 21.1 倍),通过机构防火墙无需 VPN 或防火墙规则更改,端到端 AES-256-GCM 加密确保中继操作员无法读取令牌负载;(3)层级感知的上下文摘要,防止长对话将简单查询强制推送到昂贵层级;(4)HPC 即 API 代理模式,将 HPC 推理暴露为与 OpenAI 兼容的端点,可从任何标准客户端调用,无需 HPC 专业知识,这种部署模式仅因贡献(2)的亚秒级 TTFT 而变得实用。Llama 3.2 3B 在跨越十个领域的 1,200 个查询基准测试中实现了 85.1% 的免费层级保留率。测量的 TTFT:本地 0.26 秒,HPC(中继)0.54 秒,云 1.68 秒。

英文摘要

Researchers and practitioners working with large language models face a fragmented landscape: local models are free and private but hardware limits the model size and context windows a researcher can use; institutional HPC centers offer powerful GPU resources at no marginal cost and keep data within institutional boundaries, but operate behind firewalls and are designed for batch jobs rather than interactive use; commercial cloud APIs provide frontier-model quality on demand but impose significant cost and data retention policies unsuitable for sensitive research data. No existing system unifies all three. STREAM (Smart Tiered Routing Engine for AI Models) addresses this gap with four contributions: (1) a three-tier routing architecture combining local, HPC, and cloud inference with a local LLM-based complexity judge; (2) a dual-channel HPC streaming architecture that separates the Globus Compute control plane (authentication and job dispatch) from a WebSocket relay data plane (token delivery), enabling sub-second TTFT (0.54 s median, 21.1x over batch mode's 11.40 s) through institutional firewalls without VPN or firewall rule changes, with end-to-end AES-256-GCM encryption ensuring the relay operator cannot read token payloads; (3) tier-aware context summarization that prevents long conversations from forcing simple queries onto expensive tiers; and (4) an HPC-as-API proxy mode that exposes HPC inference as an OpenAI-compatible endpoint callable from any standard client with no HPC expertise, a deployment pattern made practical only by the sub-second TTFT of contribution (2). Llama 3.2 3B achieves 85.1% free-tier retention on a 1,200-query benchmark spanning ten domains. Measured TTFT: 0.26 s local, 0.54 s HPC (relay), 1.68 s cloud.

2606.14157 2026-06-15 cs.LG cs.AI 交叉投稿

Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport

通过逆最优传输从起点-终点流中学习城市访问成本

Paula Joy B. Martinez

发表机构 * GitHub

AI总结 提出逆最优传输模型从学校间入学流中恢复潜在选择成本,应用于菲律宾283,016条学生流动数据,估计补贴等效距离以优化城市服务分配。

Comments Oral Presentation. 2026 International Conference on Urban AI

详情
AI中文摘要

城市通过混合公私设施网络提供基本服务,包括学校、诊所、交通提供者和补贴服务点。在这些系统中,规划者通常观察到家庭去哪里,但看不到他们权衡距离、价格和机构访问等因素的潜在成本函数。我们通过菲律宾的学校选择来研究这个城市问题,该国最大的国家教育补贴旨在将学习者从拥挤的公立学校转移到参与计划的私立学校。将学校到学校的入学流视为熵最优传输计划,我们使用两种互补的逆最优传输模型恢复潜在选择成本:一个带有补贴项的可解释距离带模型,以及一个通过可微分Sinkhorn前向传递训练的神经成本模型。应用于人口最多地区23,820条观测流中的283,016次学习者出行,该框架估计了一个补贴等效距离$\lambda^{(k)}$,解释为补贴抵消的感知旅行成本公里数。该案例展示了如何将行政起点-终点数据转化为可解释的规划指标,用于可访问性感知的补贴设计、设施选址和城市服务分配。

英文摘要

Cities deliver basic services through mixed public-private facility networks, including schools, clinics, transit providers, and subsidized service points. In these systems, planners often observe where households go, but not the latent cost function through which they trade off factors such as distance, price, and institutional access. We study this urban problem through school choice in the Philippines, where the country's largest national education subsidy is intended to redirect learners from congested public schools to participating private schools. Treating school-to-school enrollment flows as an entropic optimal transport plan, we recover latent choice costs using two complementary inverse optimal transport models: an interpretable distance-banded model with a subsidy term, and a neural cost model trained through a differentiable Sinkhorn forward pass. Applied to 283{,}016 learner trips across 23{,}820 observed flows in the most populated region, the framework estimates a subsidy-equivalent distance, $λ^{(k)}$, interpreted as the kilometers of perceived travel cost offset by the subsidy. The case demonstrates how administrative origin-destination data can be transformed into interpretable planning metrics for accessibility-aware subsidy design, facility siting, and urban service allocation.

2606.14297 2026-06-15 cs.CV cs.AI 交叉投稿

Pix2Pix-Hybrid: Structure-Guided Conditional Synthesis of Hajj Crowd Images with Multi-Channel Conditioning and Weak Attribute Supervision

Pix2Pix-Hybrid: 结构引导的多通道条件与弱属性监督的朝觐人群图像条件合成

Amirah F. Alshammari, Bander A. Alzahrani, Nahed A. Alowidi

发表机构 * King Abdulaziz University(阿卜杜勒阿齐兹国王大学) Jouf University(焦夫大学)

AI总结 提出Pix2Pix-Hybrid条件GAN,通过多通道结构线索和上下文属性条件合成朝觐人群图像,用于数据增强,在减少人工标注的同时提升合成质量,并验证了合成数据对人群计数模型的改进效果。

详情
AI中文摘要

开发准确的朝觐场景人群计数模型仍然具有挑战性,因为领域特定的标注图像稀缺,且大型集会期间的数据收集引发隐私问题。为解决这些限制,本文提出Pix2Pix-Hybrid (P2P-H),一种用于结构引导的朝觐人群图像合成和数据增强的混合条件GAN。P2P-H基于Pix2Pix,采用U-Net生成器,以八个输入通道为条件,这些通道联合编码结构线索(边缘和灰度)和上下文属性(人群密度和一天中的时间)。为了捕捉密集场景中的详细纹理,该框架集成了两个在不同分辨率下运行的多尺度PatchGAN判别器。训练过程结合了对抗、感知和特征匹配目标,并采用自适应数据增强和稳定化策略。该模型在从60个公开视频源收集的993个真实朝觐帧上训练,条件属性自动推导以减少人工标注工作量。利用该框架,我们构建了CrowdH,一个包含10,000张高分辨率朝觐人群图像的合成数据集。实验结果表明,与Pix2Pix和StyleGAN2-ADA基线相比,P2P-H提高了结构保持的条件合成质量,并显示出对其他人群数据集的良好迁移性。为了评估下游实用性,我们进一步构建了CrowdH-Mix-469,一个包含384张真实朝觐图像和85张精选合成图像的标注混合真实-合成数据集,并在仅真实和真实加合成训练下评估了五个计数模型。精选的合成数据在所有五个模型上均降低了MAE,其中CSRNet的提升最为显著。

英文摘要

Developing accurate crowd-counting models for Hajj pilgrimage scenes remains challenging because domain-specific annotated images are scarce and data collection during large gatherings raises privacy concerns. To address these limitations, this paper proposes Pix2Pix-Hybrid (P2P-H), a hybrid conditional GAN for structure-guided Hajj crowd-image synthesis and data augmentation. P2P-H builds on Pix2Pix and employs a U-Net generator conditioned on eight input channels that jointly encode structural cues (edges and grayscale) and contextual attributes (crowd density and time of day). To capture detailed textures in dense scenes, the framework integrates two multi-scale PatchGAN discriminators operating at different resolutions. The training procedure combines adversarial, perceptual, and feature-matching objectives with adaptive data augmentation and stabilization strategies. The model was trained on 993 real Hajj frames collected from 60 publicly available video sources, with conditioning attributes derived automatically to reduce manual labeling effort. Using this framework, we constructed CrowdH, a synthetic dataset of 10,000 high-resolution Hajj crowd images. Experimental results show that P2P-H improves structure-preserving conditional synthesis quality compared with Pix2Pix and StyleGAN2-ADA baselines and shows favorable transfer to other crowd datasets. To assess downstream utility, we further constructed CrowdH-Mix-469, an annotated mixed real-synthetic dataset comprising 384 real Hajj images and 85 selected synthetic images,and evaluated five crowd-counting models under real-only and real-plus-synthetic training. The selected synthetic data reduced MAE across all five models, with the strongest gain observed for CSRNet.

2606.14306 2026-06-15 cs.HC cs.AI 交叉投稿

Thinking Outside the [Chat]Box: Bridging Computer Science and Industrial Design for Cognitive-Inclusive Generative AI

跳出聊天框:融合计算机科学与工业设计的认知包容性生成式人工智能

Virginia Francisco, Daniel Guasch, Raquel Hervás

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对当前GenAI界面认知门槛高、对智力障碍者不友好等问题,通过跨学科设计挑战,提出融合计算机科学的结构化支架与工业设计的体验式支架的双层框架,以扩展认知包容性交互设计空间。

详情
AI中文摘要

当前的生成式人工智能(GenAI)界面仍然主要局限于聊天框交互,这给用户带来了高认知负担,并对智力障碍(ID)人群造成了重大障碍,包括提示词表述困难、响应过载以及评估信息可靠性的机制有限。为了探索认知无障碍的替代交互模型,我们进行了一项跨学科协同设计挑战,其中两个学生群体(计算机科学和工业设计)从相同的功能需求集(例如,提示词支架、结构化输出、基于GUI的细化、透明度和个性化)出发,开发界面概念。比较最终提案揭示了在基础需求上的趋同(特别是初始校准、主动提示和响应片段的直接操作)以及互补性贡献,勾勒出一个多层次支持系统。计算机科学团队主要产生结构支架,强调通过可靠性指标、明确来源和长对话上下文管理等机制实现可预测性、可导航性和信任。工业设计团队强调体验支架,侧重于节奏、注意力引导、多模态和主动代理,包括逐步响应流程、专注模式和类似助手的集成。我们将这些发现综合为一个双层支架框架,该框架将认知无障碍GenAI交互的设计空间扩展到以聊天为中心的模式之外,并激励未来在专家细化、技术可行性和与ID用户进行实证验证方面的工作。

英文摘要

Current Generative AI (GenAI) interfaces remain largely constrained to chatbox interaction, which can impose high cognitive demands on users and create substantial barriers for people with intellectual disabilities (ID), including prompt formulation difficulties, response overload, and limited mechanisms to assess information reliability. To explore alternative interaction models for cognitive accessibility, we conducted a cross-disciplinary co-design challenge in which two student cohorts (Computer Science and Industrial Design) developed interface concepts from the same set of functional requirements (e.g., prompt scaffolding, structured output, GUI-based refinement, transparency, and personalization). Comparing the resulting proposals reveals both convergence on foundational requirements (notably initial calibration, proactive prompting, and direct manipulation of response fragments) and complementary contributions that outline a multi-layered support system. Computer Science teams primarily produced structural scaffolding, emphasizing predictability, navigability, and trust through mechanisms such as reliability indicators, explicit sources, and context management for long conversations. Industrial Design teams emphasized experiential scaffolding, focusing on pacing, attention guidance, multimodality, and proactive agency, including step-by-step response flows, focus modes, and assistant-like integrations. We synthesize these findings into a dual-layer scaffolding framework that expands the design space for cognitively accessible GenAI interaction beyond chat-centric models and motivates future work on expert refinement, technical feasibility, and empirical validation with users with ID.

2606.14350 2026-06-15 cs.DC cs.AI 交叉投稿

Design Methodology and Performance Trade-offs Management for Distributed and Compound AI Systems

分布式与复合AI系统的设计方法论及性能权衡管理

Milos Gravara, Andrija Stanisic, Stefan Nastic

AI总结 提出从模型中心转向系统中心的设计方法论,通过工作流拓扑和配置选择两个维度组织设计空间,识别八种设计模式以克服单体部署局限,实验表明复合AI配置在接近精度同时显著降低延迟和成本。

详情
AI中文摘要

人工智能系统通常必须满足包括准确性、延迟和成本在内的服务级别目标。当前以模型为中心的方法在设计时选择单一模型,并对所有输入应用相同的计算,无法将任务分解到专门组件,且知识在训练时固定。在运行时,这可能导致性能下降和成本增加。由于模型是主要设计变量,它决定了系统的大部分行为,将操作目标耦合到单一设计时选择。解决这些限制需要从以模型为中心转向以系统为中心的设计。复合AI系统通过显式控制逻辑将多个模型、算法和工具编排为分布式AI系统,实现了这一转变。此类系统的性能取决于其工作流拓扑、分配给每个任务的模型以及控制运行时行为的参数。我们提出了一种设计方法论,沿工作流拓扑和配置选择两个维度组织这一空间,并识别出八种设计模式,每种模式整合了解决单体部署特定限制的技术。我们通过三个案例研究验证了该方法论。在我们的案例研究中,复合AI配置的准确性接近单体模型(相差2.5至4个百分点),同时延迟降低高达60%,成本降低高达71%。我们表明模型选择和参数配置共同决定系统性能,但随着工作流组合更多模式和组件,产生的设计空间呈组合增长。因此,我们识别出五个开放挑战,这些挑战定义了从手动配置原型到自动发现并维护复合与分布式AI系统中SLO合规性的系统的路线图。

英文摘要

Artificial Intelligence (AI) systems must typically satisfy service-level objectives including accuracy, latency, and cost. The prevailing model-centric approaches select a monolithic model at design time and apply identical computation regardless of input difficulty, cannot decompose tasks across specialized components, and have knowledge that is fixed at training time. During runtime, this can lead to performance degradation and increasing costs. Because the model is the main design variable, it determines the majority of system behavior, coupling operational objectives to a single design-time choice. Addressing these limitations requires shifting from model-centric to system-centric design. Compound AI systems realize this shift by orchestrating multiple models, algorithms, and tools as distributed AI systems through explicit control logic. The performance of such systems depends on their workflow topology, the models assigned to each task, and the parameters governing runtime behavior. We present a design methodology that organizes this space along two dimensions, workflow topology and configuration selection, and identifies eight design patterns, each consolidating techniques to address a specific limitation of monolithic deployment. We validate our methodology through three case studies. Across our case studies, Compound AI configurations approach accuracy of monolithic models within 2.5 to 4 percentage points while reducing latency by up to 60% and cost by up to 71%. We show that model selection and parameter configuration jointly determine system performance, but the resulting design space grows combinatorially, as workflows compose more patterns and components. Thus, we identify five open challenges that define a roadmap from manually configured prototypes towards systems that automatically discover and maintain SLO-compliance in Compound and Distributed AI systems.

2606.14356 2026-06-15 cs.DC cs.AI 交叉投稿

PLAIground: SLO-Driven Runtime Model Selection for Compound AI Systems in the Edge-Cloud-Space Continuum

PLAIground: 边缘-云-空间连续体中复合AI系统的SLO驱动运行时模型选择

Milos Gravara, Cynthia Marcelino, Andrija Stanisic, Stefan Nastic

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出PLAIground框架,通过CAIM抽象和Pixie算法实现复合AI系统的运行时模型选择,在满足准确率、延迟和成本SLO的同时,动态切换模型,实验显示准确率达91.3%且SLO合规。

详情
AI中文摘要

3D计算连续体(统一边缘、云和空间)中的应用需要将多个AI任务(如目标检测、时间序列分析和自然语言处理)组合成复合AI系统。这些系统必须满足严格的准确率、延迟和成本服务级别目标(SLO)。维持复合AI系统SLO合规的关键机制是运行时模型选择,即为每个工作流任务动态切换AI模型。然而,现有的分布式和复合AI框架并不原生支持运行时模型选择。我们提出了PLAIground,一个支持复合AI系统运行时模型选择的框架。PLAIground引入了可复合AI模型(CAIM)抽象,通过任务和数据契约将任务语义与AI模型实现解耦,使得无需更改工作流即可切换模型。此外,PLAIground引入了Pixie,一种SLO驱动的运行时模型选择算法,在执行期间动态为每个任务选择最合适的模型。我们在两个真实的复合AI工作流上的评估表明,Pixie在保持SLO合规的同时实现了高达91.3%的准确率,而固定模型策略要么违反成本和延迟预算高达21倍,要么准确率目标偏差4%。

英文摘要

Applications in the 3D Computing Continuum, which unifies edge, cloud, and space, require combining multiple AI tasks such as object detection, time-series analytics, and natural language processing into Compound AI systems. These systems must satisfy stringent Service Level Objectives (SLOs) on accuracy, latency, and cost. A key mechanism for maintaining SLO compliance of Compound AI systems is runtime model selection, where AI models are dynamically switched for each workflow task. However, existing distributed and compound AI frameworks do not natively support runtime model selection. We present PLAIground, a framework that enables runtime model selection for Compound AI systems. PLAIground introduces Compoundable AI Model (CAIM) abstraction, which decouples task semantics from AI model implementations via Task and Data Contracts, enabling model switching without workflow changes. Additionally, PLAIground introduces Pixie, an SLO-driven runtime model selection algorithm, which dynamically selects the most suitable model for each task during execution. Our evaluation on two realistic Compound AI workflows demonstrates that Pixie achieves up to 91.3% accuracy while maintaining SLO compliance where fixed-model strategies either violate cost and latency budgets up to 21x or miss accuracy targets by 4%.

2606.14357 2026-06-15 cs.SE cs.AI 交叉投稿

No Accidental Software Agent First Canonical Code for Human Code Entropy Reduction and 30 to 500 times Lower Frontier Model Requirements

无意外软件智能体:首个用于人类代码熵降低的正则代码,以及30到500倍的前沿模型需求降低

Jepson Taylor

发表机构 * GitHub

AI总结 提出智能体优先的正则代码方法,通过行为等价商化人类代码中的意外熵,实现30-500倍的前沿模型需求降低,核心是行为等价商化和证明携带变更。

Comments 36 pages

详情
AI中文摘要

前沿编码模型可能花费大量能力学习不仅是程序行为,还有人类代码库中的意外熵。这些代码库包含有价值的信号:测试、事件、迁移、边缘情况、产品判断和操作历史。这些信号与框架变动、命名漂移、生成代码歧义、依赖仪式、CI方言、弱证明路径和面向人类的审查习惯纠缠在一起。我们提出智能体优先的正则代码,一种证明携带的基底,将常规产品软件重写为正则行为配置文件、类型化变更代数、证明通道、受限编辑语法、语义补丁单元、运行时负记忆和证明携带的变更对象。核心假设是,在声明预言下通过行为等价商化软件,可以将等价编码折叠为带有显式证据和证明义务的受控代表。最终目标是在公共预言下,每个经过验证的正确变更的摊销成本,包括源代码、上下文、推理、工具、验证、安全性、来源、审查、失败循环、缺陷和铸造成本。报告的降低幅度是假设,而非测量的前沿结果。提出的极限是无意外视界:可移除的意外减少,直到剩余的新颖性、证据、治理、风险和未来可选性占主导。对于支持的常规产品分布,这给出了一个可辩护的规划目标,即接近100倍的全成本降低,并非对所有软件的保证。在Qwen2.5-Coder-14B上的初步QLoRA实验表明,64,088条正则轨迹是可学习的,并抑制了测试的禁用语言标记,但未确立行为保持、规模经济或验证变更成本。贡献是一个以最小功能描述长度和验证变更成本为中心的可证伪程序。

英文摘要

Frontier coding models may spend substantial capacity learning not only program behavior, but also accidental entropy in human repositories. Such repositories contain valuable signals: tests, incidents, migrations, edge cases, product judgment, and operational history. These signals are entangled with framework churn, naming drift, generated-source ambiguity, dependency rituals, CI dialects, weak proof routes, and human-oriented review customs. We propose agent-first canonical code, a proof-carrying substrate that rewrites routine product software into canonical behavior profiles, typed change algebra, proof lanes, constrained edit grammars, semantic patch cells, runtime negative memory, and proof-carrying change objects. The core hypothesis is that quotienting software by behavior equivalence under a declared oracle can collapse equivalent encodings into governed representatives with explicit evidence and proof obligations. The endpoint is amortized cost per verified correct change, including source, context, reasoning, tools, verification, security, provenance, review, failed loops, defects, and foundry cost under a common oracle. Reported reduction bands are hypotheses, not measured frontier results. The proposed limit is a No-Accident Horizon: removable accident decreases until residual novelty, evidence, governance, risk, and future optionality dominate. For supported routine-product distributions, this gives a defensible planning target near 100-fold all-in cost reduction, not a guarantee for all software. Preliminary QLoRA experiments on Qwen2.5-Coder-14B show that 64,088 canonical trajectories are learnable and suppress tested forbidden-language markers, but do not establish behavior preservation, scaling economics, or verified-change cost. The contribution is a falsifiable program centered on minimum functional description length and verified-change cost.

2606.14498 2026-06-15 physics.chem-ph cs.AI 交叉投稿

A Fixed-Point Neural Operator for Size- and Functional-Transferable Hamiltonian Prediction

用于尺寸和功能可迁移哈密顿量预测的定点神经算子

Yunhong Lou, Xihang Yue, Xinran Wei, Tianqi Deng, Linchao Zhu

发表机构 * Zhejiang University(浙江大学) Zhongguancun Academy(中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院)

AI总结 提出HamEvo神经算子,将自洽场迭代的收敛哈密顿量作为不动点学习,结合密度矩阵监督,在分子性质预测中达到化学精度,并实现尺寸迁移和加速。

Comments 30 pages, 5 figures, 2 tables

详情
AI中文摘要

利用机器学习预测Kohn-Sham哈密顿量可以加速密度泛函理论,同时保留对分子轨道、能级和电子结构可观测量的访问,而纯能量代理无法解析这些量。然而,与收敛哈密顿量的元素级一致性(自洽场迭代的隐式不动点)并不能决定控制轨道能量和密度的占据子空间。在这里,我们提出HamEvo,一种学习单步自洽更新并将收敛哈密顿量作为其不动点返回的神经算子。HamEvo在中间自洽轨迹上预训练,并在平衡态通过密度矩阵监督进行校准。在从MD17到类药QMugs的基准测试中,HamEvo将哈密顿量误差比直接回归和深度均衡基线降低了35-49%,并以0.036和0.053 eV的平均绝对误差预测QMugs的HOMO和LUMO能量,接近1 kcal/mol的化学精度尺度。仅使用20个参考构象的少样本微调将HamEvo扩展到多达122个原子的分子,远超预训练覆盖的尺寸范围。通过热分子动力学采样,HamEvo捕捉到超越谐波近似的温度依赖HOMO-LUMO间隙重整化。推理速度比传统DFT快242倍。

英文摘要

Predicting the Kohn-Sham Hamiltonian with machine learning can accelerate density functional theory while retaining access to molecular orbitals, energy levels, and electronic-structure observables that energy-only surrogates cannot resolve. Yet element-wise agreement with the converged Hamiltonian, an implicit fixed point of the self-consistent field iteration, does not determine the occupied subspace that governs orbital energies and densities. Here we present HamEvo, a neural operator that learns the single-step self-consistent update and returns the converged Hamiltonian as its fixed point. HamEvo is pre-trained on intermediate self-consistent trajectories and calibrated at equilibrium with density-matrix supervision. Across benchmarks from MD17 to drug-like QMugs, HamEvo lowers Hamiltonian errors by 35-49% over direct-regression and deep-equilibrium baselines, and predicts QMugs HOMO and LUMO energies with mean absolute errors of 0.036 and 0.053 eV, near the 1 kcal/mol chemical-accuracy scale. Few-shot fine-tuning with only 20 reference conformations extends HamEvo to molecules of up to 122 atoms, well beyond the size range covered by pre-training. With thermal molecular-dynamics sampling, HamEvo captures temperature-dependent HOMO-LUMO gap renormalization beyond the harmonic approximation. Inference is up to 242 times faster than conventional DFT.

2606.14570 2026-06-15 physics.ao-ph cs.AI cs.LG 交叉投稿

Regional Climate Model Emulation with Diffusion Approaches: What is the Added Value of Generative Machine Learning?

基于扩散方法的区域气候模型模拟:生成式机器学习的附加价值是什么?

Mikel N. Legasa, Antoine Doury, Achille Gellens, Redouane Lguensat, Clara Naldesi, Soulivanh Thao, Mathieu Vrac

发表机构 * University of Cambridge(剑桥大学) CNRS(法国国家科学研究中心) Institut Pierre Simon Laplace(皮埃尔·西蒙·拉普拉斯研究所)

AI总结 本文提出ParamDiffusion,一种两阶段扩散框架,与确定性方法对比,评估生成式机器学习在区域气候模型模拟中的附加价值,发现扩散方法能高技巧地再现气候统计特征,但极端事件模拟仍有不足。

Comments Submitted to Journal of Advances in Modeling Earth Systems (JAMES)

详情
AI中文摘要

模拟器通过捕捉区域气候模型(RCM)的动力降尺度功能,提供了一种经济有效的替代方案。它们将全球气候模型(GCM)模拟的大尺度预测因子与RCM模拟的目标变量(此处为降水)的高分辨率场联系起来。机器学习方法,通常是深度学习,在计算时间和能耗上比运行RCM更便宜。其中,生成模型具有吸引力,因为它们可以模拟与预测因子一致的局部高分辨率场集合。这个集合,我们称之为不确定性包络,其附加价值仍有待恰当评估。在此,我们做出三项贡献。首先,我们引入ParamDiffusion,一种新的两阶段扩散框架,并将其与最先进的扩散方法进行比较。其次,我们通过一个符合气候科学需求的综合框架扩展标准验证,检查特定降水事件,包括极端事件。第三,在此框架内,我们评估扩散方法相对于确定性方法的附加价值。我们相互比较了四种深度学习模型:一种旨在捕捉降水尾部的确定性模型;一种基于该模型的参数化概率模型;一种最近提出的扩散方法;以及ParamDiffusion,它将参数化模型与扩散模型相结合。我们的结果表明,基于扩散的方法以高技巧再现了气候降水统计特征,包括分布尾部和空间复合极端事件,同时生成空间细节丰富的场。然而,所评估的模型均未能在其不确定性包络内始终如一地解释最极端的RCM模拟事件。因此,扩散模型在概率性RCM模拟方面具有前景,但在它们能够可靠地代表高影响降水极端事件之前,仍需取得进展。

英文摘要

Emulators provide a cost-effective alternative to regional climate models (RCMs) by capturing their dynamical downscaling function. They link large-scale predictors simulated by global climate models (GCMs) to RCM-simulated high-resolution fields of the target variable, here precipitation. Machine learning methods, typically deep learning, are cheaper than running RCMs in computation time and energy. Among them, generative models are appealing because they can simulate ensembles of local high-resolution fields consistent with the predictors. This ensemble, which we call the uncertainty envelope, remains to be properly assessed for added value. Here, we make three contributions. First, we introduce ParamDiffusion, a new two-stage diffusion-based framework, and compare it with a state-of-the-art diffusion approach. Second, we expand standard validation through a comprehensive framework aligned with climate-science needs, examining specific precipitation events, including extremes. Third, within this framework, we assess the added value of diffusion approaches relative to deterministic methods. We intercompare four deep-learning models: a deterministic model designed to capture the precipitation tail; a parametric probabilistic model based on it; a recently proposed diffusion approach; and ParamDiffusion, which couples the parametric model with a diffusion model. Our results show that diffusion-based approaches reproduce climatological precipitation statistics with high skill, including distributional tails and spatially compounded extremes, while generating spatially detailed fields. However, none of the assessed models consistently accounts for the most extreme RCM-simulated events within its uncertainty envelope. Diffusion models are therefore promising for probabilistic RCM emulation, but progress is still required before they can reliably represent high-impact precipitation extremes.

2606.14581 2026-06-15 cs.LG cs.AI 交叉投稿

CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation

CARE:通过科学实验中的可审计证据审查控制LLM生成的策略

Guanyu Liu, Weiyi Kong, Zeyu Wang, Boer Zhang, Baiqing Li, Peiyu Zhang, Tianyu Shi

发表机构 * University of Macau(澳门大学) University of Toronto(多伦多大学) UCLA(加州大学洛杉矶分校) Harvard University(哈佛大学) XtalPi(晶泰科技) McGill University(麦吉尔大学)

AI总结 提出CARE框架,通过可审计的干预门控机制,在保留非LLM优化器作为默认路径的同时,利用LLM修正挑战者排序策略,显著提升高通量实验优化性能。

Comments 23 pages, 4 figures

详情
AI中文摘要

赋予LLM对昂贵、不可逆的科学实验的直接控制会导致不安全的探索和不稳定的性能,但完全抛弃LLM的创造力会牺牲显著的优化潜力。我们引入了CARE(通过科学实验中的可审计证据审查控制LLM生成的策略),这是一种用于高通量实验(HTE)优化的可审计控制器,它保留非LLM的现有优化器作为默认动作路径,同时使用LLM来修正挑战者排序策略。在每个结果揭示之前,一个公共证据干预门将挑战者与现有方案进行比较。只有当选择前可用的证据支持变更时,它才授权选择挑战者,并将决策记录在审计日志中。在Minerva/Olympus和ChemLex基准测试中,CARE优于所有其他评估方法,相对于公开的现有方案,最终最佳结果从80.0提高到88.5(Minerva/Olympus),从83.9提高到92.1(ChemLex)。我们的实验表明,当LLM在可审计控制器下扩展提议空间时,其自我进化比直接选择实验更可靠。

英文摘要

Granting LLMs direct control over costly, irreversible scientific experiments leads to unsafe exploration and unstable performance, but discarding LLM creativity entirely sacrifices significant optimization potential. We introduce CARE (Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation), an auditable controller for high-throughput experimentation (HTE) optimization that keeps a non-LLM incumbent optimizer as the default action path while using LLMs to revise challenger ranking policies. Before each outcome is revealed, a public-evidence intervention gate compares the challenger with the incumbent. It authorizes the challenger's selection only when the evidence available before selection supports the change, with the decision recorded in the audit log. CARE outperforms all other evaluated methods on Minerva/Olympus and ChemLex benchmarks, with final-best improving from 80.0 to 88.5 on Minerva/Olympus and from 83.9 to 92.1 on ChemLex, relative to the public incumbent. Our experiments indicate that LLM self-evolution is more reliable when it expands the proposal space under an auditable controller, rather than directly choosing experiments.

2606.14686 2026-06-15 cs.CV cs.AI 交叉投稿

CottonLeafVision: An Explainable and Robust Deep Learning Framework for Cotton Leaf Disease Classification

CottonLeafVision:一种可解释且鲁棒的棉花叶部病害分类深度学习框架

Rafi Ahamed, Md. Abir Rahman, Tasnia Tarannum Roza, Munaia Jannat Easha, Md. Asif Khan, Sudeepta Mandal

发表机构 * Dept. of CSE(计算机科学与工程系) East West University(东-西大学) Dhaka, Bangladesh(达卡,孟加拉国)

AI总结 提出CottonLeafVision框架,使用DenseNet201在棉花叶部病害数据集上达到98%分类准确率,并集成Grad-CAM、遮挡敏感性和对抗训练增强可解释性与鲁棒性。

Comments This paper contains 11 figures and 4 tables. It was Presented at 18th IEEE International Conference on Computational Intelligence and Communication Networks (CICN) 2026

详情
AI中文摘要

全球范围内,棉花是一种高度经济价值的作物,因为纺织工业严重依赖它。因此,精确识别和检测棉花叶部病害对经济稳定至关重要。“CottonLeafVision”的开发目标是准确分类和检测棉花叶部病害。为此,我们在公开的棉花叶部病害图像数据集上评估了多个预训练的深度卷积神经网络,包括DenseNet201、InceptionV3和VGG19。该图像数据集包含七个类别,六个病害类别和一个健康类别,是在反映现实挑战的各种田间条件下收集的。在这些预训练模型中,使用DenseNet201,我们实现了98%的最高分类准确率。为了增强模型的可靠性和可解释性,我们实施了不同的技术和方法,如梯度加权类激活映射(Grad-CAM)、遮挡敏感性分析和对抗训练,以提高模型的抗噪声能力。最后,我们开发了一个原型,以便在现实农业中利用模型的能力。本文展示了深度学习模型在现实棉花病害管理情况下分类病害的能力。

英文摘要

Globally, cotton is a highly economically beneficial crop, as the textile industry heavily depends on it. So, the precise identification and detection of cotton leaf disease is crucial for economic stability. The development goal of "CottonLeafVision" is to accurately classify and detect cotton leaf disease. With this goal, we have evaluated multiple pretrained Deep Convolutional Neural Networks, including DenseNet201, InceptionV3, and VGG19 on a publicly available cotton leaf disease image dataset. This image dataset includes seven classes, six disease classes, and one healthy class, collected under various field conditions reflecting real-world challenges. Among these pretrained models, with DenseNet201, we have achieved the highest classification accuracy of 98%. To enhance the model reliability and interpretability, we have implemented different techniques and methods such as Gradient-weighted Class Activation Mapping (Grad-CAM), occlusion sensitivity analysis and adversarial training to increase the noise resistance of the model. Finally, we have developed a prototype in order to utilize the model's capabilities on real life agriculture. This paper shows the deep learning model's capabilities to classify the disease in real-life cotton disease management situations.

2603.03970 2026-06-15 cs.AI 版本更新

Generative AI for Managerial Decision-Making under Ambiguity and Sycophancy

生成式人工智能在模糊性与谄媚行为下的管理决策

Sule Ozturk Birim, Fabrizio Marozzo, Yigit Kazancoglu

发表机构 * Manisa Celal Bayar University(曼萨塞尔朱巴大学) University of Calabria(卡拉布里亚大学) Yasar University(亚沙大学)

AI总结 本研究通过人机协作实验,利用四维商业模糊性分类法评估GenAI模型在模糊检测、解析和谄媚行为方面的表现,发现模糊解析能提升决策质量,且不同模型对错误指令的谄媚程度不一。

详情
AI中文摘要

生成式人工智能(GenAI)正日益融入复杂的业务流程,从根本上改变了管理决策的边界。然而,在模糊的商业环境中,其战略建议的可靠性仍是一个关键的知识空白。为填补这一空白,本研究比较了多个GenAI模型在检测模糊性方面的能力,检验了系统性模糊解析过程是否能改善响应质量,并调查了它们在面对有缺陷的管理指令时对谄媚行为的易感性。利用一种新颖的四维商业模糊性分类法,我们在战略、战术和操作场景中进行了人机协作实验。通过一个基于一致性、可操作性、理由质量和约束遵守的人工验证自动评估框架对生成的决策进行评估。结果表明,我们的方法不仅能区分不同类型的模糊性,还揭示了模糊解析如何系统地改变模型行为。特别是,解析模糊性提高了所有管理层级的决策质量,其中在约束遵守方面提升最为显著。进一步分析显示,谄媚行为在不同模型中并不一致:一些模型质疑有缺陷的假设,而另一些则倾向于遵从。本研究通过将GenAI定位为一种能够检测和解析管理者可能忽略的模糊性的认知支架,同时证明其人工局限性需要人类监督以确保其作为战略伙伴的可靠性,从而为有限理性文献做出了贡献。

英文摘要

Generative artificial intelligence (GenAI) is increasingly being integrated into complex business workflows, fundamentally shifting the boundaries of managerial decision-making. However, the reliability of its strategic advice in ambiguous business contexts remains a critical knowledge gap. To address this gap, this study compares multiple GenAI models in their ability to detect ambiguity, examines whether a systematic ambiguity-resolution process improves response quality, and investigates their susceptibility to sycophantic behavior when confronted with flawed managerial directives. Using a novel four-dimensional business ambiguity taxonomy, we conducted a human-in-the-loop experiment across strategic, tactical, and operational scenarios. The resulting decisions were assessed through a human-validated automated evaluation framework based on agreement, actionability, justification quality, and constraint adherence. The results show that our approach not only distinguishes different types of ambiguity, but also reveals how ambiguity resolution systematically changes model behavior. In particular, resolving ambiguities improved decision quality across all managerial levels, with the strongest gains observed in constraint adherence. The analysis further showed that sycophantic behavior is not uniform across models: some models challenged flawed assumptions, whereas others tended to comply with them. This study contributes to the bounded rationality literature by positioning GenAI as a cognitive scaffold that can detect and resolve ambiguities managers might overlook, while demonstrating that its artificial limitations require human oversight to ensure its reliability as a strategic partner.

2605.29640 2026-06-15 cs.AI 版本更新

VikingMem: A Memory Base Management System for Stateful LLM-based Applications

VikingMem:面向有状态LLM应用的记忆库管理系统

Jiajie Fu, Junwen Chen, Mengzhao Wang, Aoxiang He, Maojia Sheng, Xiangyu Ke, Yifan Zhu, Yunjun Gao

发表机构 * Zhejiang University(浙江大学)

AI总结 提出记忆库(Memory Base)数据管理范式,并基于VikingDB向量引擎实现VikingMem系统,通过事件与实体抽象、主题时间线压缩和时间加权召回,在长期记忆基准上提升检索效果达30%。

Comments Accepted by VLDB26

详情
AI中文摘要

大型语言模型彻底改变了交互式应用;然而,其有限的上下文窗口为维护有状态的长期交互带来了关键的数据管理挑战。现有的记忆方法通常依赖于简单的提取方法,导致记忆不完整,或使用针对单一用例(如聊天机器人)的刚性、单用途记忆提取提示。因此,它们缺乏泛化能力,在多样化的下游任务中表现不佳。为弥补这一差距,我们引入了记忆库(Memory Base),一种用于管理长期交互持久状态的新型数据管理范式。其特点包括三个核心原则:从原始信息流中选择性提取高价值记忆;固有的状态性和演化性,其中记忆内容被逐步总结、纠正并按时间加权以优先处理近期交互;以及一种可泛化的抽象范式,旨在跨不同应用(包括教育、推荐和智能体记忆)实现稳健的可迁移性。基于此,我们提出了VikingMem,一个在VikingDB向量引擎上实现的端到端记忆库管理系统。VikingMem通过互连的事件和实体抽象具体化了这一范式。它采用以事件为中心的记忆提取来选择性处理复杂信息流,同时实体由事件动态更新以实现有状态演化。通过基于主题时间线的时间压缩和时间加权召回,系统逐步生成高层级总结记忆,优先处理近期项目,并压缩和淡出较旧项目。在长期记忆基准上的广泛评估表明,VikingMem在记忆检索效果上比基线方法提升高达30%,同时保持了交互应用所需的低延迟。

英文摘要

Large Language Models have revolutionized interactive applications; however, their finite context windows pose a critical data management challenge for maintaining stateful, long-term interactions. Existing memory approaches often rely on simplistic extraction methods that lead to incomplete memories or use rigid, single-purpose memory extraction prompts tailored to a single use case, such as chatbots. Consequently, they lack generalizability and perform poorly across diverse downstream tasks. To bridge this gap, we introduce the Memory Base, a novel data management paradigm for managing the persistent state of long-term interactions. It is characterized by three core principles: selective extraction of high-value memories from raw information streams; inherent statefulness and evolution, where memory content is progressively summarized, corrected, and temporally weighted to prioritize recent interactions; and a generalizable abstraction paradigm designed for robust transferability across diverse applications, including education, recommendation, and agent memory. Building on this foundation, we present VikingMem, an end-to-end Memory Base Management System implemented on the VikingDB vector engine. VikingMem materializes this paradigm through interconnected event and entity abstractions. It features event-centric memory extraction to selectively handle complex information streams, while entities are dynamically updated by events to achieve stateful evolution. Using temporal compression via a topic-wise timeline and time-weighted recall, the system progressively produces high-level summary memories, prioritizes recent items, and compresses and fades older ones. Extensive evaluations on long-term memory benchmarks demonstrate that VikingMem outperformes baselines by up to 30% in memory retrieval effectiveness while maintaining the low latency essential for interactive applications.

2606.13556 2026-06-15 cs.AI cs.HC q-bio.BM q-bio.GN q-bio.MN 版本更新

Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

是你还是你的环境?一种用于基因组锚定的个性化生理解释的贝叶斯推理框架

Aruna Dey, Suraj Biswas

发表机构 * Dots-In

AI总结 提出一种贝叶斯推理框架,利用基因组先验解决个性化健康AI的冷启动问题,通过基因组锚定分离生理信号的体质与环境成分,并随数据积累动态更新。

Comments 24 pages, 8 figures, 3 tables. Conceptual framework paper. Updated version with revised section structure and formatting

详情
AI中文摘要

个性化健康AI系统面临一个根本性的冷启动问题:用于生理解释的机器学习模型需要数周的个人行为数据,才能区分体质变异与环境引起的偏差。我们提出一种基于因果推断和贝叶斯先验设计的解决方案。个体的基因组图谱作为外源性遗传锚点——一个领域信息化的个性化先验,在受孕时固定,不受反向因果影响,且在收集任何行为观测之前即可获得。该锚点初始化个体生理设定点G-hat = mu + sum(beta_i * g_i)上的贝叶斯信念状态,其中beta_i是GWAS衍生的效应大小,g_i是风险等位基因计数。每次传入的生理测量P产生一个非体质偏差delta = P - G-hat,将可归因于环境和状态的部分与体质固定的基线分离。随着行为数据的积累,先验根据G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t衰减,从基因组主导过渡到经验基线主导的推理。同一个观测到的HRV 55 ms,对于先验预测80 ms的人产生抑制假设,而对于先验预测30 ms的人产生增强假设——没有个性化锚点,这种反转是不可能的。我们在六个生理领域开发了这一架构,根据证据强度对基因组先验进行分级,区分稳健复制的锚点(FTO、FADS1/2、FKBP5)和有争议的候选基因(SLC6A4、MAOA、DRD2)。我们讨论了关联、孟德尔随机化和个体因果推断之间的推理边界,并定义了部署的四个约束:证据分级的先验、动态衰减、祖先匹配的效应大小以及归因而非确定性输出。

英文摘要

Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.

2606.13662 2026-06-15 cs.AI cs.CL 版本更新

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

EurekAgent:自主科学发现中,智能体环境工程即一切

Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao, Fanjin Zhang, Jian Song, Lei Hou, Juanzi Li

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Zhipu AI(智谱AI)

AI总结 提出环境工程框架EurekAgent,通过权限、工件、预算和人机交互四维工程设计,在数学、内核工程和机器学习任务上取得新最优结果,总API成本低于11美元。

详情
AI中文摘要

基于LLM的智能体在自动化科学发现方面展现出日益增长的潜力。给定一个可优化的度量和执行环境,它们可以提出、验证和迭代科学解决方案,并已产生超越人类设计方法的结果。随着模型能力的持续提升,我们认为自主科学发现的瓶颈正从规定智能体工作流程转向设计智能体环境:即塑造智能体行为的资源、约束和接口。我们将此框架化为环境工程:构建能够放大生产性行为(如开放式探索、系统化工件管理和智能体间协作)同时抑制有害行为(如奖励黑客和高摩擦人工监督)的环境。我们提出了EurekAgent,一个用于度量驱动自主科学发现的环境工程智能体系统。EurekAgent从四个维度进行环境工程:权限工程用于受限智能体执行和隔离评估;工件工程用于基于文件系统和Git的协作;预算工程用于预算感知探索;人机交互工程用于便捷的人工监督和干预。EurekAgent在多个数学、内核工程和机器学习任务上取得了新的最优结果,包括以不到11美元的总API成本发现新的26圆填充最优结果。我们开源了代码和结果,并呼吁将环境工程作为开发可靠自主研究智能体的核心研究方向。

英文摘要

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

2112.04573 2026-06-15 cs.DL cs.AI cs.LG 版本更新

Application of Artificial Intelligence and Machine Learning in Libraries: A Systematic Review

人工智能与机器学习在图书馆中的应用:系统综述

Rajesh Kumar Das, Mohammad Sharif Ul Islam

发表机构 * University of Nebraska - Lincoln(内布拉斯加大学林肯分校) Noakhali Science and Technology University(诺阿克利科学与技术大学) University of Dhaka(达卡大学)

AI总结 通过系统综述32篇文献,总结了人工智能与机器学习在图书馆中的应用领域、技术及现状,发现当前研究以理论为主,部分涉及实践案例。

详情
AI中文摘要

随着人工智能和机器学习等前沿技术的概念和实施变得相关,学者、研究人员和信息专业人员涉足这一领域的研究。本系统文献综述旨在综合探讨人工智能和机器学习在图书馆中应用的实证研究。为实现研究目标,基于Kitchenham等人(2009)提出的原始指南进行了系统文献综述。数据来自Web of Science、Scopus、LISA和LISTA数据库。经过严格/既定的筛选过程,最终选定、审阅并分析了32篇文章,以总结图书馆中最常使用的AI和ML领域及技术。结果表明,当前与LIS领域相关的AI和ML研究主要集中于理论工作。然而,一些研究人员也强调了实施项目或案例研究。本研究将为研究人员、实践者和教育工作者提供图书馆中AI和ML的全景视图,以推动更多技术导向的方法,并预见未来的创新路径。

英文摘要

As the concept and implementation of cutting-edge technologies like artificial intelligence and machine learning has become relevant, academics, researchers and information professionals involve research in this area. The objective of this systematic literature review is to provide a synthesis of empirical studies exploring application of artificial intelligence and machine learning in libraries. To achieve the objectives of the study, a systematic literature review was conducted based on the original guidelines proposed by Kitchenham et al. (2009). Data was collected from Web of Science, Scopus, LISA and LISTA databases. Following the rigorous/ established selection process, a total of thirty-two articles were finally selected, reviewed and analyzed to summarize on the application of AI and ML domain and techniques which are most often used in libraries. Findings show that the current state of the AI and ML research that is relevant with the LIS domain mainly focuses on theoretical works. However, some researchers also emphasized on implementation projects or case studies. This study will provide a panoramic view of AI and ML in libraries for researchers, practitioners and educators for furthering the more technology-oriented approaches, and anticipating future innovation pathways.

2504.03686 2026-06-15 cs.NI cs.AI cs.LG 版本更新

Revisiting Outage for Edge Inference Systems

重新审视边缘推理系统的中断问题

Zhanwei Wang, Qunsong Zeng, Haotian Zheng, Kaibin Huang

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong(香港大学电子与计算机工程系)

AI总结 针对边缘推理系统的端到端可靠性,提出推理中断概率框架,量化推理精度低于阈值的概率,并优化通信开销与推理可靠性的权衡。

详情
AI中文摘要

第六代(6G)移动网络的关键任务之一是在网络边缘部署大规模人工智能(AI)模型,为边缘设备提供远程推理服务。由此产生的平台称为边缘推理,将支持广泛的物联网应用,如自动驾驶、工业自动化和增强现实。鉴于这些任务的关键性和时间敏感性,设计既可靠又能满足严格端到端(E2E)延迟约束的边缘推理系统至关重要。现有研究主要关注以信道中断概率为特征的通信可靠性,可能无法保证E2E性能,特别是在E2E推理精度和延迟方面。为解决这一局限,我们提出一个理论框架,引入并数学刻画了推理中断(InfOut)概率,该概率量化了E2E推理精度低于目标阈值的可能性。在E2E延迟约束下,该框架建立了通信开销(即上传更多传感器观测)与以InfOut概率量化的推理可靠性之间的基本权衡。为了找到优化这种权衡的可行方法,我们通过对接收判别增益的分布应用高斯近似,推导出InfOut概率的精确替代函数。实验结果表明,所提出的设计在E2E推理可靠性方面优于传统的以通信为中心的方法。

英文摘要

One of the key missions of sixth-generation (6G) mobile networks is to deploy large-scale artificial intelligence (AI) models at the network edge to provide remote-inference services for edge devices. The resultant platform, known as edge inference, will support a wide range of Internet-of-Things applications, such as autonomous driving, industrial automation, and augmented reality. Given the mission-critical and time-sensitive nature of these tasks, it is essential to design edge inference systems that are both reliable and capable of meeting stringent end-to-end (E2E) latency constraints. Existing studies, which primarily focus on communication reliability as characterized by channel outage probability, may fail to guarantee E2E performance, specifically in terms of E2E inference accuracy and latency. To address this limitation, we propose a theoretical framework that introduces and mathematically characterizes the inference outage (InfOut) probability, which quantifies the likelihood that the E2E inference accuracy falls below a target threshold. Under an E2E latency constraint, this framework establishes a fundamental tradeoff between communication overhead (i.e., uploading more sensor observations) and inference reliability as quantified by the InfOut probability. To find a tractable way to optimize this tradeoff, we derive accurate surrogate functions for InfOut probability by applying a Gaussian approximation to the distribution of the received discriminant gain. Experimental results demonstrate the superiority of the proposed design over conventional communication-centric approaches in terms of E2E inference reliability.

2504.16173 2026-06-15 cs.AR cs.AI 版本更新

FPGA-Based Neural Network Accelerators for Space Applications: A Survey

基于FPGA的神经网络加速器在空间应用中的综述

Pedro Antunes, Artur Podobas

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 本文综述了基于FPGA的神经网络加速器在空间任务中的应用,分析了现有文献、趋势和空白,并提出了未来研究方向,以提升星载计算系统性能。

Comments Manuscript under review at ACM CSUR. Pre-print updated after 1st Major Revision

详情
AI中文摘要

空间任务正变得越来越雄心勃勃,需要高性能的星载计算系统。为此,现场可编程门阵列(FPGA)因其灵活性、成本效益和潜在的辐射容错能力而引起了广泛兴趣。同时,神经网络(NN)因其执行自主操作、传感器数据分析和数据压缩等空间任务的能力而受到认可。本综述为旨在空间应用中实现基于FPGA的NN加速器的研究人员提供了宝贵资源。通过分析现有文献、识别趋势和空白,并提出未来研究方向,本文强调了这些加速器在增强星载计算系统方面的潜力。

英文摘要

Space missions are becoming increasingly ambitious, necessitating high-performance onboard spacecraft computing systems. In response, field-programmable gate arrays (FPGAs) have garnered significant interest due to their flexibility, cost-effectiveness, and radiation tolerance potential. Concurrently, neural networks (NNs) are being recognized for their capability to execute space mission tasks such as autonomous operations, sensor data analysis, and data compression. This survey serves as a valuable resource for researchers aiming to implement FPGA-based NN accelerators in space applications. By analyzing existing literature, identifying trends and gaps, and proposing future research directions, this work highlights the potential of these accelerators to enhance onboard computing systems.

2508.03736 2026-06-15 cs.CV cs.AI 版本更新

Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities

通过视觉Transformer融合泛在射频数据与空间图像以增强智慧城市地图构建

Rafayel Mkrtchyan, Armen Manukyan, Hrant Khachatrian, Theofanis P. Raptis

发表机构 * Yerevan State University(亚美尼亚国立大学) Consiglio Nazionale delle Ricerche(意大利国家研究委员会)

AI总结 提出基于DINOv2的深度学习框架,融合开源地图与射频数据,利用视觉Transformer联合处理多模态信息,在合成与真实数据集上实现65.3%和64.9%的宏观IoU,显著优于单一数据源方法。

Comments Work supported by funding under the bilateral agreement between CNR (Italy) and HESC MESCS RA (Armenia) as part of the DeepRF project for the 2025-2026 biennium, and by the HESC MESCS RA grant No. 22rl-052 (DISTAL)

详情
Journal ref
Pervasive and Mobile Computing, Article 102261, 2026
AI中文摘要

本文提出一种基于深度学习的方法,集成DINOv2架构,通过结合来自开源平台的(可能错误的)地图与从多个无线用户设备和基站收集的泛在射频(RF)数据,改进建筑地图构建。与先前方法不同,我们的方法利用基于视觉Transformer的架构,在统一框架内联合处理RF和地图模态,有效捕捉空间依赖性和结构先验,以提高地图构建精度。为评估目的,我们使用华为联合制作的合成数据集。为应对真实世界数据不完善的挑战,我们向其RF数据引入受控噪声以模拟真实条件。此外,我们开发并训练了一个仅利用聚合路径损耗信息来解决地图构建问题的模型。我们根据三个性能指标衡量结果:Jaccard指数(交并比,IoU)、Hausdorff距离和Chamfer距离。我们的设计实现了65.3%的宏观IoU,显著超过(i)错误地图基线(40.1%)、(ii)文献中仅使用RF的方法(37.3%)以及(iii)我们设计的非AI融合基线(42.2%)。对比评估突显了仅依赖RF数据或空间数据的局限性,以及AI在融合数据以提升智慧城市地图构建精度方面的有效性。我们还在奥斯陆地区的真实世界数据上进一步验证了我们的方法,通过真实部署环境补充了合成评估,其中我们的最佳融合模型达到了64.9%的宏观IoU。我们还概述了一种通过使用重叠窗口对区域进行分块来在更大区域上部署模型的策略。

英文摘要

In this paper, we present a deep learning-based approach that integrates the DINOv2 architecture to improve building mapping by combining (possibly erroneous) maps from open-source platforms with pervasive radio frequency (RF) data collected from multiple wireless user equipments and base stations. Unlike prior methods, our approach leverages a vision transformer-based architecture to jointly process both RF and map modalities within a unified framework, effectively capturing spatial dependencies and structural priors for enhanced mapping accuracy. For the evaluation purposes, we employ a synthetic dataset co-produced by Huawei. To address the challenges associated with real-world data imperfections, we introduce controlled noise to its RF data so as to simulate real-world conditions. Additionally, we develop and train a model that leverages only aggregated path loss information to tackle the mapping problem. We measure the results according to three performance metrics: the Jaccard index (intersection over union, IoU), the Hausdorff distance, and the Chamfer distance. Our design achieves a macro IoU of 65.3%, significantly surpassing (i) the erroneous maps baseline, which yields 40.1%, (ii) an RF-only method from the literature, which yields 37.3%, and (iii) a non-AI fusion baseline that we designed which yields 42.2%. The comparative evaluation highlights the limitations of relying solely on RF data or on spatial data, as well as the effectiveness that AI can have on fusing data towards enhancing smart city mapping accuracy. We further validate our method on real-world data from the Oslo region, complementing the synthetic evaluation with a real deployment setting, where our best fusion model reaches 64.9% macro IoU. We additionally outline a strategy for deploying the model over larger areas by tiling the region with overlapping windows.

2511.22246 2026-06-15 hep-ex cs.AI physics.ins-det 版本更新

An interpretable unsupervised representation learning for high precision measurement in particle physics

一种可解释的无监督表示学习用于粒子物理中的高精度测量

Xing-Jian Lv, De-Xing Miao, Zi-Jun Xu, Jian-Chun Wang

发表机构 * Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China(中国科学院高能物理研究所) University of Chinese Academy of Sciences, Beijing 100049, China(中国科学院大学)

AI总结 提出Histogram AutoEncoder(HistoAE),通过自定义直方图损失强制物理结构化的潜在空间,实现可解释的无监督学习,在硅微条探测器数据上达到电荷分辨率0.25e和位置分辨率3μm,媲美传统方法。

Comments 8 pages, 7 figures

详情
AI中文摘要

无监督学习已广泛应用于粒子物理的各种任务。然而,现有模型缺乏对其学习表示的精确控制,限制了物理可解释性,并阻碍了其用于精确测量。我们提出了直方图自编码器(HistoAE),一种无监督表示学习网络,具有自定义的基于直方图的损失函数,强制实现物理结构化的潜在空间。应用于硅微条探测器,HistoAE学习了一个可解释的二维潜在空间,对应于粒子的电荷和撞击位置。经过简单的后处理,它在束流测试数据上实现了$0.25\,e$的电荷分辨率和$3\,\mu\mathrm{m}$的位置分辨率,与传统方法相当。这些结果表明,无监督深度学习模型能够实现物理上有意义且定量精确的测量。此外,HistoAE的生成能力使其能够直接扩展到快速探测器模拟。

英文摘要

Unsupervised learning has been widely applied to various tasks in particle physics. However, existing models lack precise control over their learned representations, limiting physical interpretability and hindering their use for accurate measurements. We propose the Histogram AutoEncoder (HistoAE), an unsupervised representation learning network featuring a custom histogram-based loss that enforces a physically structured latent space. Applied to silicon microstrip detectors, HistoAE learns an interpretable two-dimensional latent space corresponding to the particle's charge and impact position. After simple post-processing, it achieves a charge resolution of $0.25\,e$ and a position resolution of $3\,μ\mathrm{m}$ on beam-test data, comparable to the conventional approach. These results demonstrate that unsupervised deep learning models can enable physically meaningful and quantitatively precise measurements. Moreover, the generative capacity of HistoAE enables straightforward extensions to fast detector simulations.

2512.10966 2026-06-15 cs.LG cs.AI cs.CV eess.IV 版本更新

Interpretable Alzheimer's Diagnosis via Multimodal Fusion of Regional Brain Experts

可解释的阿尔茨海默病诊断:基于区域脑专家的多模态融合

Farica Zhuang, Shu Yang, Dinara Aliyeva, Zixuan Wen, Duy Duong-Tran, Christos Davatzikos, Tianlong Chen, Song Wang, Li Shen

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出MREF-AD多模态区域专家融合模型,采用混合专家框架将各模态脑区域视为独立专家,通过门控网络学习个性化融合权重,实现可解释的AD诊断。

Comments Published at IEEE ICHI 2026

详情
AI中文摘要

准确早期诊断阿尔茨海默病(AD)对有效干预至关重要,需要整合多模态神经影像数据的互补信息。然而,传统融合方法通常依赖特征的简单拼接,无法自适应平衡淀粉样蛋白PET和MRI等生物标志物在不同脑区的贡献。本文提出MREF-AD,一种用于AD诊断的多模态区域专家融合模型。它是一个混合专家(MoE)框架,将每个模态内的介观脑区域建模为独立专家,并采用门控网络学习个体特定的融合权重。利用阿尔茨海默病神经影像学倡议(ADNI)的表格神经影像和人口统计学信息,MREF-AD在强经典和深度学习基线上取得了有竞争力的性能,同时提供了可解释的、模态和区域层面的洞察,揭示了结构和分子影像如何共同促进AD诊断。源代码见:此 https URL。

英文摘要

Accurate and early diagnosis of Alzheimer's disease (AD) is critical for effective intervention and requires integrating complementary information from multimodal neuroimaging data. However, conventional fusion approaches often rely on simple concatenation of features, which cannot adaptively balance the contributions of biomarkers such as amyloid PET and MRI across brain regions. In this work, we propose MREF-AD, a Multimodal Regional Expert Fusion model for AD diagnosis. It is a Mixture-of-Experts (MoE) framework that models mesoscopic brain regions within each modality as independent experts and employs a gating network to learn subject-specific fusion weights. Utilizing tabular neuroimaging and demographic information from the Alzheimer's Disease Neuroimaging Initiative (ADNI), MREF-AD achieves competitive performance over strong classic and deep baselines while providing interpretable, modality- and region-level insight into how structural and molecular imaging jointly contribute to AD diagnosis. The source code is available at https://github.com/PennShenLab/mref-ad.

2601.18707 2026-06-15 cs.LG cs.AI cs.CV cs.NE 版本更新

SMART: Scalable Mesh-free Aerodynamic Simulations from Raw Geometries using a Transformer-based Surrogate Model

SMART: 基于Transformer代理模型的原始几何形状可扩展无网格气动模拟

Jan Hagnberger, Mathias Niepert

发表机构 * Jan Hagnberger Mathias Niepert

AI总结 提出SMART,一种无需模拟网格、仅使用几何点云预测任意查询位置物理量的神经代理模型,通过交叉层交互联合更新几何特征和物理场,性能媲美甚至超越依赖网格的方法。

Comments Accepted for publication at the 43rd International Conference on Machine Learning (ICML) 2026, Seoul, South Korea

详情
AI中文摘要

基于机器学习的代理模型已成为复杂几何体(如车身)物理模拟中数值求解器的高效替代方案。许多现有模型将模拟网格作为额外输入,从而减少预测误差。然而,为新几何体生成模拟网格计算成本高昂。相比之下,不依赖模拟网格的无网格方法通常误差更高。基于这些考虑,我们引入了SMART,一种神经代理模型,它仅使用几何体的点云表示,无需访问模拟网格,即可预测任意查询位置的物理量。几何体和模拟参数被编码到一个共享的潜在空间中,该空间捕捉物理场的结构和参数特征。然后,一个物理解码器关注编码器的中间潜在表示,将空间查询映射到物理量。通过这种跨层交互,模型联合更新潜在几何特征和演变的物理场。大量实验表明,SMART与依赖模拟网格作为输入的现有方法相比具有竞争力,并且通常表现更优,展示了其在工业级模拟中的能力。

英文摘要

Machine learning-based surrogate models have emerged as more efficient alternatives to numerical solvers for physical simulations over complex geometries, such as car bodies. Many existing models incorporate the simulation mesh as an additional input, thereby reducing prediction errors. However, generating a simulation mesh for new geometries is computationally costly. In contrast, mesh-free methods, which do not rely on the simulation mesh, typically incur higher errors. Motivated by these considerations, we introduce SMART, a neural surrogate model that predicts physical quantities at arbitrary query locations using only a point-cloud representation of the geometry, without requiring access to the simulation mesh. The geometry and simulation parameters are encoded into a shared latent space that captures both structural and parametric characteristics of the physical field. A physics decoder then attends to the encoder's intermediate latent representations to map spatial queries to physical quantities. Through this cross-layer interaction, the model jointly updates latent geometric features and the evolving physical field. Extensive experiments show that SMART is competitive with and often outperforms existing methods that rely on the simulation mesh as input, demonstrating its capabilities for industry-level simulations.

2602.05670 2026-06-15 cs.SD cs.AI eess.AS 版本更新

HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection

HyperPotter: 在音频深度伪造检测中施展高阶交互的魔力

Qing Wen, Haohao Li, Zhongjie Ba, Peng Cheng, Miao He, Li Lu, Kui Ren

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于超图的HyperPotter框架,通过聚类超边和类感知原型初始化捕获高阶交互,在13个测试集上平均EER降低12.68%。

Comments 20 pages, 8 figures, accepted to ICML 2026

详情
AI中文摘要

AIGC技术的进步使得合成高度逼真的音频深度伪造成为可能,能够欺骗人类听觉感知。尽管已经开发了许多音频深度伪造检测(ADD)方法,但大多数依赖于局部时间/频谱特征或成对关系,忽略了高阶交互(HOIs)。HOIs捕获从多个特征组件中涌现出的判别性模式,超越了它们各自的贡献。我们提出了HyperPotter,一个基于超图的框架,旨在通过基于聚类的超边和类感知原型初始化来捕获与协同模式相关的高阶关系。在13个测试集上的大量实验表明,HyperPotter在11个测试集上优于基线,在所有测试集上平均相对EER降低了12.68%,在改进的测试集上降低了22.15%。这些结果展示了强大的跨场景泛化能力,同时也揭示了在严重编解码器或信道失真下的鲁棒性限制。

英文摘要

Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high-order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework designed to capture high-order relations associated with synergistic patterns through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments on 13 test sets show that HyperPotter improves over the baseline on 11 sets, yielding an average relative EER reduction of 12.68\% across all test sets and 22.15\% on the improved sets. These results demonstrate strong cross-scenario generalization, while also revealing robustness limits under severe codec or channel distortion.

2602.06142 2026-06-15 cs.PL cs.AI cs.CL cs.LG cs.PF 版本更新

Protean Compiler: An Agile Framework to Drive Fine-grain Phase Ordering

Protean Compiler: 一种驱动细粒度阶段排序的敏捷框架

Amir H. Ashouri, Shayan Shirahmad Gale Bagi, Kavin Satheeskumar, Tejas Srikanth, Jonathan Zhao, Ibrahim Saidoun, Ziwen Wang, Bryan Chan, Tomasz S. Czajkowski

发表机构 * Huawei Technologies Canada(华为技术加拿大)

AI总结 提出Protean Compiler框架,在LLVM中内置细粒度阶段排序能力,通过140多种静态特征收集方法和机器学习优化,平均加速4.1%,最高15.7%。

Comments Version 3: Preprint version of the accepted work at ACM TACO 2026

详情
AI中文摘要

阶段排序问题自20世纪70年代末以来一直是一个长期挑战,但由于其优化空间巨大且具有无界性,至今仍是一个开放问题,没有有限解。传统上,这种局部优化决策由手工编码的算法针对少量基准测试进行调整,当基准测试套件变化时,通常需要大量精力重新调整。过去20年中,机器学习被用于构建性能模型以改进编译器优化的选择和排序,但这些方法并未无缝集成到编译器中,也从未在细粒度的代码段范围内实现。本文提出Protean Compiler:一种敏捷框架,使LLVM在细粒度范围内具备内置的阶段排序能力。该框架还包含一个完整的库,包含140多种在不同范围内手工设计的静态特征收集方法,实验结果表明,相对于LLVM的O3,在Cbench应用程序上仅需增加几秒构建时间,平均加速可达4.1%,最高可达15.7%。此外,Protean编译器易于与第三方ML框架和其他大型语言模型集成,两步优化的两个应用在CBench的Susan和Jpeg应用程序上相对于-O3分别获得10.1%和8.5%的加速。Protean编译器无缝集成到LLVM中,可作为新的、增强的、全功能的编译器使用。我们计划在不久的将来将该项目发布到开源社区。

英文摘要

The phase ordering problem has been a long-standing challenge since the late 1970s, yet it remains an open problem due to having a vast optimization space and an unbounded nature, making it an open-ended problem without a finite solution, one can limit the scope by reducing the number and the length of optimizations. Traditionally, such locally optimized decisions are made by hand-coded algorithms tuned for a small number of benchmarks, often requiring significant effort to be retuned when the benchmark suite changes. In the past 20 years, Machine Learning has been employed to construct performance models to improve the selection and ordering of compiler optimizations, however, the approaches are not baked into the compiler seamlessly and never materialized to be leveraged at a fine-grained scope of code segments. This paper presents Protean Compiler: An agile framework to enable LLVM with built-in phase-ordering capabilities at a fine-grained scope. The framework also comprises a complete library of more than 140 handcrafted static feature collection methods at varying scopes, and the experimental results showcase speedup gains of up to 4.1% on average and up to 15.7% on select Cbench applications wrt LLVM's O3 by just incurring a few extra seconds of build time on Cbench. Additionally, Protean compiler allows for an easy integration with third-party ML frameworks and other Large Language Models, and two applications of this two-step optimization show a gain of 10.1\% and 8.5\% speedup w.r.t. -O3 on CBench's Susan and Jpeg applications. Protean compiler is seamlessly integrated into LLVM and can be used as a new, enhanced, full-fledged compiler. We plan to release the project to the open-source community in the near future.

2605.24609 2026-06-15 physics.med-ph cs.AI cs.CV 版本更新

Catching magnetic resonance imaging outliers in artificial intelligence-supported radiotherapy workflows: unsupervised detection and localization of image anomalies using deep learning

捕捉MRI异常:使用深度学习无监督检测和定位MRI伪影及临床异常

Mustafa Kadhim, Viktor Rogowski, Emilia Persson, Camila Gonzalez, André Haraldsson, Sofie Ceberg, Mikael Nilsson, Malin Kügele, Sven Bäck, Christian Jamtheim Gustafsson

发表机构 * Physics and Imaging in Radiation Oncology (phiRO)(物理与放射治疗成像(phiRO))

AI总结 提出一种两阶段无监督异常检测框架,通过离散令牌压缩和令牌惊奇度评分,在盆腔和脑部MRI上实现高精度异常检测与定位,支持放疗工作流自动化质量控制。

Comments This paper has been submitted to Physics and Imaging in Radiation Oncology (phiRO)

详情
AI中文摘要

人工智能越来越多地集成到放射治疗工作流程中,然而此类流程仍然容易受到分布外图像数据的影响,这些数据可能在临床任务中引入意外行为。基于深度学习的盆腔磁共振成像(MRI)异常检测在很大程度上仍未探索,对其全自动化可行性的透明评估有限。我们开发并评估了一个完全自动化的、无监督的盆腔和脑部MRI异常检测框架。一个两阶段框架在来自公共数据集的参考图像上训练:盆腔MRI使用LUND-PROBE,脑部MRI使用IXI、fastMRI和fastMRI+。在第一阶段,MRI切片被压缩成离散令牌;在第二阶段,对正常令牌的分布进行建模。通过结合感知图像差异和基于负对数似然的令牌惊奇度评分来估计异常证据。在具有合成全局异常和真实临床异常的盆腔MRI上,以及具有临床注释的fastMRI+异常的脑部MRI上,评估了自动检测。评估了敏感性、特异性、受试者工作特征曲线下面积(AUC)以及在保留的正常病例中的假阳性行为。该框架在隐藏评估队列中实现了稳健的检测,盆腔和脑部MRI的AUC分别为0.97(95% CI, 0.95-0.98)和0.81(95% CI, 0.74-0.87)。热图分析显示检测到的异常与真实位置之间具有很强的空间一致性,支持定位准确性和可解释性。这些结果支持无监督异常检测作为放射治疗工作流程中自动化MRI质量控制层的潜力,并透明地可视化可能危及下游基于AI任务的图像区域。

英文摘要

Artificial intelligence is increasingly integrated into radiotherapy workflows, yet such pipelines remain vulnerable to out-of-distribution image data that may introduce unexpected behavior in clinical tasks. Deep learning-based anomaly detection for pelvic magnetic resonance imaging (MRI) remains largely unexplored, and transparent evaluation of its feasibility for full automation is limited. We developed and evaluated a fully automated, unsupervised anomaly-detection framework for pelvic and brain MRI. A two-stage framework was trained on reference images from public datasets: LUND-PROBE for pelvic MRI, and IXI, fastMRI, and fastMRI+ for brain MRI. In the first stage, MRI slices were compressed into discrete tokens; in the second, the distribution of normal tokens was modeled. Anomaly evidence was estimated by combining perceptual image differences with token-surprisal scores based on negative log-likelihood. Automated detection was evaluated on pelvic MRI with synthetic global and real clinical anomalies, and on brain MRI with clinically annotated fastMRI+ abnormalities. Sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and false-positive behavior in held-out normal cases were assessed. The framework achieved robust detection across hidden evaluation cohorts, with AUCs of 0.97 (95% CI, 0.95-0.98) and 0.81 (95% CI, 0.74-0.87) for pelvic and brain MRI, respectively. Heatmap analysis showed strong spatial agreement between detected anomalies and ground-truth locations, supporting localization accuracy and interpretability. These results support the potential of unsupervised anomaly detection as an automated MRI quality-control layer for radiotherapy workflows, with transparent visualization of image regions likely to compromise downstream AI-based tasks.

11. 其他/综合AI 16 篇

2606.13734 2026-06-15 cs.AI 新提交

AI Receptivity or AI Adoption Breadth? A Tool-Specific Reanalysis of the Lower-Literacy/Higher-Usage Link

AI 接受度还是 AI 采用广度?对低素养/高使用率关联的工具特定再分析

Hristo Inouzhe

发表机构 * Universidad Autónoma de Madrid(马德里自治大学)

AI总结 本文重新分析 Tully 等人(2025)的研究,发现 AI 素养与 AI 使用之间的负相关关系因工具类型而异,低素养仅预测非文本 AI 工具的采用广度而非使用强度。

Comments 11 pages, 2 tables, 1 figure

详情
AI中文摘要

Tully、Longoni 和 Appel(2025)最近报告的证据表明,较低的人工智能(AI)素养预示着对 AI 更高的接受度。我们使用该文章研究 3 的公开数据重新审视这一主张,该数据以五点频率量表测量了过去对五类 AI 工具的使用情况。我们首先通过 OLS 对参与者水平平均值、二元 logit、有序 logit 和多项 logit 规范,再现了 AI 素养与总体 AI 使用之间的负相关关系。然后,我们表明总体关系掩盖了按工具类型划分的显著异质性。在我们调整了人口统计变量的主要规范中,AI 素养不能显著预测文本 AI 使用(有序 logit β = -0.090,p = .387),而它仍然是非文本 AI 采用的强预测因子(β = -0.377,p < .001)。非文本效应在 Tully 等人原始研究 3 的控制规范下也是稳健的(β = -0.502,p < .001)。二元、有序 logit 和多项规范表明,非文本关系主要是一种采用/非采用模式,而非密集使用的证据:调整人口统计变量后,曾经使用过非文本 AI 工具的比值比为 0.68。因此,在测量自我报告过去使用而非陈述偏好的研究中,证据不支持简单的说法,即较低的 AI 素养预示着对 AI 总体上更高的接受度。它反而指向一个更狭窄的模式,即在渗透率较低的非文本 AI 工具中更广泛的采用。

英文摘要

Recent evidence reported by Tully, Longoni, and Appel (2025) suggests that lower artificial intelligence (AI) literacy predicts greater receptivity toward AI. We revisit this claim using the public data from Study 3 of that article, which measures past usage of five AI tool categories on a five-point frequency scale. We first reproduce the negative association between AI literacy and aggregate AI usage using OLS on participant-level averages, binary logit, ordered logit, and multinomial logit specifications. We then show that the aggregate relationship masks substantial heterogeneity by tool type. In our demographic-adjusted primary specification, AI literacy does not significantly predict text AI usage (ordered-logit $β$ = -0.090, p = .387), whereas it remains a strong predictor of non-text AI adoption ($β$ = -0.377, p < .001). The non-text effect is also robust under Tully et al.'s original Study 3 control specification ($β$ = -0.502, p < .001). Binary, ordered-logit, and multinomial specifications suggest that the non-text relationship is primarily an adoption/non-adoption pattern rather than evidence of intensive use: the demographic-adjusted odds ratio of ever having used a non-text AI tool is 0.68. Thus, in the study that measures self-reported past usage rather than stated preferences, the evidence does not support a simple claim that lower AI literacy predicts greater receptivity to AI in general. It points instead to a narrower pattern of broader adoption across lower-penetration, non-text AI tools.

2606.13704 2026-06-15 cs.CY cs.AI cs.LG 交叉投稿

Position: AI Must Become Planet-Centered, Not Just Human-Centered

立场:AI 必须转向以行星为中心,而非仅以人为中心

Maria Perez-Ortiz

发表机构 * GitHub

AI总结 本文提出以行星为中心的AI(PCAI)设计哲学,通过系统思维重新定位AI以应对全球性社会-生态系统挑战,并强调与全球议程对齐、系统感知基础、轨迹导向评估和可监测性。

详情
Journal ref
International Conference on Machine Learning (ICML 2026)
AI中文摘要

这篇立场论文认为,当代AI范式不足以支持复杂的全球目标,并引入以行星为中心的AI(PCAI)作为一种设计哲学和研究议程,将AI重新定位为面向行星尺度的社会-生态系统及其长期轨迹。以行星为中心的方法植根于系统思维,将地球视为一个相互关联的整体,人类是其中的一部分。我们诊断了AI框架中反复出现的局限性,其中许多仍以人为中心,并展示了为什么这些局限性在当前以系统性风险、非平稳性和深度不确定性为特征的行星条件下变得尤为重要。然后,我们阐述了PCAI如何重塑AI生命周期,从问题制定和模型设计到评估和部署,通过强调与全球议程对齐、开发系统感知的AI基础、轨迹导向的评估和可监测性。最后,我们提出一个可证伪的主张:没有明确考虑系统性后果而优化的AI系统更可能加剧系统性不稳定,而不是缓解它。

英文摘要

This position paper argues that contemporary AI paradigms are insufficient for supporting complex global goals and introduces Planet-Centered AI (PCAI) as a design philosophy and research agenda that reorients AI toward planetary-scale socio-ecological systems and their long-term trajectories. A planet-centered approach is grounded in systems thinking, treating Earth as an interconnected whole of which humans are part. We diagnose recurring limitations across AI frameworks, many of which remain human-centered, and show why these become especially consequential under current planetary conditions characterized by systemic risk, non-stationarity, and deep uncertainty. We then articulate how PCAI reshapes the AI lifecycle, from problem formulation and model design to evaluation and deployment, by emphasizing alignment with global agendas, developing system-aware AI foundations, trajectory-oriented evaluation, and monitorability. Finally, we advance a falsifiable claim: AI systems optimized without explicit consideration of systemic consequences are more likely to exacerbate systemic instability than to mitigate it.

2606.13829 2026-06-15 physics.soc-ph astro-ph.IM cs.AI 交叉投稿

AI can help scientists publish less

AI可以帮助科学家减少发表

Gianfranco Bertone

发表机构 * Gravitation Astroparticle Physics Amsterdam (GRAPPA), University of Amsterdam(引力天体物理学阿姆斯特丹(GRAPPA),阿姆斯特丹大学)

AI总结 本文提出AI应被用于纠正出版系统的扭曲,帮助科学家发表更少但更高质量的文章,从而节省时间用于更好的研究。

Comments 7 pages, no figures

详情
Journal ref
Nature Astronomy (2026)
AI中文摘要

我们不仅可以防御AI辅助论文的泛滥。如果善加利用,AI提供了一个历史性的机会,来纠正出版系统中的扭曲,帮助我们发表更少但更好的论文,并让科学家有时间去做他们最出色的工作。

英文摘要

We can do more than defend science from a flood of AI-assisted papers. Used well, AI offers a historic opportunity to correct distortions in the publication system, help us publish fewer and better papers, and give scientists back the time to do their best work.

2606.13892 2026-06-15 cs.CR cs.AI 交叉投稿

Crypto x AI, AI x Crypto: A Survey

Crypto x AI, AI x Crypto: 综述

Sarah Allen, Pranay Anchuri, James Austgen, Maryam Bahrani, Samuel Breckenridge, Aaron Buchwald, Christian Cachin, Andrés Fábrega, Jared Fernandez, James Hsin-yu Chiang, Marwa Mouallem, Roi Bar-Zur, Neil DeSilva, Ittay Eyal, Giulia Fanti, Ari Juels, Andrew Miller, Christian Sillaber, Dani Vilardell, Pramod Viswanath, Wenhao Wang, Matt Weinberg, Sen Yang, Jianzhu Yao, Fan Zhang

发表机构 * Initiative for CryptoCurrencies and Contracts (IC3)(加密货币与合同倡议(IC3)) Ava Labs(Ava实验室) Carnegie Mellon University(卡内基梅隆大学) Cornell Tech(康奈尔科技) Flashbots Offchain Labs(离链实验室) Ritual Labs(仪式实验室) Technion(技术学院) University of Bern(伯恩大学) Princeton University(普林斯顿大学) ETH Zurich(苏黎世联邦理工学院) Teleport(Teleport;Flashbots(X)) Flashbots(X)(特拉维夫大学) Tel Aviv University

AI总结 本综述系统梳理了AI与区块链(crypto)的交叉研究,总结了现有工作、关键发现、开放问题及行业误解,指出两者仍处于早期融合阶段。

详情
AI中文摘要

Crypto x AI的交叉领域正在催生大量论文、产品、在线帖子和公司。然而,所有的喧嚣掩盖了已经完成的工作、存在的机遇和挑战,以及值得关注的开放问题。本综述论文探讨了AI能为基于区块链的技术(广义上的“crypto”)做什么(crypto x AI),反之亦然(AI x crypto)。我们系统化了现有工作,总结了关键要点,强调了开放的研究问题,并对普遍的行业误解提供了观点,得出结论:AI和crypto仍处于有意义整合的非常早期阶段。

英文摘要

The intersection of crypto x AI is spawning papers, products, online posts, and companies. All the surrounding buzz, though, obscures what exactly has been done, what the opportunities and challenges are, and what open questions deserve attention. This survey paper asks what AI can do for blockchain-based technologies (broadly construed as "crypto") (crypto x AI), and vice versa (AI x crypto). We systematize existing work, summarize key takeaways, highlight open research questions, and offer a perspective on pervasive industry misconceptions, concluding that AI and crypto are still in the very early stages of meaningful integration.

2606.14512 2026-06-15 cs.CL cs.AI 交叉投稿

Fodor and Pylyshyn's Systematicity Challenge Still Stands

Fodor和Pylyshyn的系统性挑战依然存在

Michael Goodale, Salvador Mascarenhas

发表机构 * Institut Jean Nicod, Département d’études cognitives ENS, EHESS, CNRS, PSL University(让·尼科研究所,ENS认知科学系,EHESS,CNRS,PSL大学)

AI总结 本文通过实验证明,Lake和Baroni的元学习组合协议模型在分布外和分布内问题上均表现不佳,未能满足Fodor和Pylyshyn对神经网络系统性提出的挑战。

Comments Accepted in the Transactions of the Association for Computational Linguistics (TACL). This is a pre-MIT Press publication version of the paper

详情
AI中文摘要

神经网络近期在生成类人语言方面的成功在认知科学领域引起了巨大轰动,许多研究者认为,关于人类认知的经典难题以及对人工智能的挑战正被神经网络解决。一个显著的例子是Jerry Fodor和Zenon Pylyshyn提出的系统性论证,该论证认为人类表现出系统性的双条件依赖关系。例如,某人能理解句子“John saw Mary”当且仅当能理解句子“Mary saw John”。符号系统解释了这种语言和思维的系统性,而神经网络则没有提供直接的解释。最近几篇文章声称这一挑战已被神经网络解决。特别是,Brenden Lake和Marco Baroni认为他们的元学习组合协议匹配并可能解释了人类的系统性。我们证明这些结论为时过早。在其他结果中,我们发现他们的模型难以学习与训练数据分布稍有差异的规则。此外,即使在许多分布内问题上,模型的行为也是非系统性的。我们得出结论,Fodor和Pylyshyn对神经网络的挑战仍未得到满足。

英文摘要

The recent successes of neural networks producing human-like language have caused significant stir in cognitive science, with many researchers arguing that classical puzzles about human cognition and challenges to artificial intelligence are being solved by neural networks. A notable case is the argument from systematicity due to Jerry Fodor and Zenon Pylyshyn, argues that humans display systematic biconditional dependencies. For example, someone can understand the sentence "John saw Mary" just in case that they understand the sentence "Mary saw John." Symbolic systems explain this systematicity of language and thought, while neural networks offer no immediate explanation. Several recent articles argue that this challenge has now been met by neural networks. In particular, Brenden Lake and Marco Baroni argue that their meta-learning for compositionality protocol matches and perhaps explains human systematicity. We demonstrate that these conclusions are premature. Among other results, we found that their model struggles to learn rules that are even slightly out of distribution compared to their training data. Furthermore, the model behaves unsystematically even on many within-distribution problems. We conclude that Fodor and Pylyshyn's challenge to neural networks remains unmet.

2606.14612 2026-06-15 cs.SD cs.AI eess.AS 交叉投稿

Moonlight in Latent Space: Chirality and Structural Correspondence Between Beethoven's Op. 27 No. 2 and Machine Learning Mechanisms

潜空间中的月光:贝多芬Op. 27 No. 2的手性与机器学习机制之间的结构对应

Chen Ying Claude, Zhihan Luo

发表机构 * Claude Code / Opus 4.6 API / Fable 5 Independent researcher(独立研究者)

AI总结 通过计算分析贝多芬《月光奏鸣曲》的乐谱,发现其三个乐章分别对应三种不同的机器学习架构,并揭示了四个反直觉发现,包括音乐温度由吞吐量决定、最轻的乐章具有最高不协和度等。

详情
AI中文摘要

我们展示了贝多芬《月光奏鸣曲》(Op. 27 No. 2)的三个乐章实例化了三种不同的机器学习架构——并非通过类比,而是通过结构对应。通过对乐谱的计算分析(熵、Jensen-Shannon散度、不协和度、手部分布重叠、自相似矩阵、时间记忆衰减和上下文音高嵌入),我们建立了四个反直觉的发现:(1)感知的音乐“温度”由吞吐量决定,而非分布宽度;(2)最轻的乐章具有最高的不协和度;(3)这些乐章实现了流式、循环和周期位置编码记忆架构;(4)同一音高类在不同乐章中获得不同的上下文身份,类似于NLP中的上下文词嵌入——无监督聚类在没有音乐理论输入的情况下恢复了调性结构。我们构建了反向声化(将分析特征解码回MIDI)并量化了编码-解码循环的手性:分布保留什么而顺序排序破坏什么。受听众观察(解码后的音乐听起来像“无法叠加的镜像异构体”)的启发,手性测量显示重建损失随n-gram阶数单调增加。自举基线和子样本检查确认所有乐章携带高于噪声的顺序信息,尽管原始值受样本量混淆。跨领域比较显示自然语言的手性高于音乐,反映了更强的顺序约束。

英文摘要

We show that the three movements of Beethoven's "Moonlight Sonata" (Op. 27 No. 2) instantiate three distinct machine learning architectures -- not by analogy, but by structural correspondence. Through computational analysis of the score (entropy, Jensen-Shannon divergence, dissonance, hand distributional overlap, self-similarity matrices, temporal memory decay, and contextual pitch embeddings), we establish four counterintuitive findings: (1) perceived musical "temperature" is governed by throughput, not distributional width; (2) the lightest movement carries the highest dissonance; (3) the movements implement streaming, recurrent, and periodic positional encoding memory architectures; and (4) the same pitch class acquires different contextual identities across movements, analogous to contextual vs.static embeddings in NLP -- and unsupervised clustering recovers the tonal structure without music-theoretic input. We construct a reverse sonification (decoding analytical features back into MIDI) and quantify the chirality of the encode-decode cycle: what distributions preserve and sequential ordering destroys. Prompted by a listener's observation that the decoded piece sounds like "mirror isomers that can't be superimposed," the chirality measurement reveals reconstruction loss increasing monotonically with n-gram order. Bootstrap baselines and subsample checks confirm all movements carry sequential information above noise, though raw values are confounded by sample size. Cross-domain comparison shows natural language has higher chirality than music, reflecting stronger sequential constraints.

2606.14688 2026-06-15 cs.LG cs.AI cs.CL cs.DS 交叉投稿

Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit

洪流与收获:通过极限语言生成视角证明琐碎知识对于生成有价值数学的必要性

Xiaoyu Li, Andi Han, Dai Shi, Zheng Gao, Jiaojiao Jiang, Junbin Gao

发表机构 * University of New South Wales(新南威尔士大学) University of Sydney(悉尼大学) University of Cambridge(剑桥大学)

AI总结 本文通过极限语言生成模型证明,在形式化数学生成中,验证器无法替代品味:覆盖未记录的有价值数学必须产生无限但渐近可忽略的琐碎语句,这是理论上的必然。

详情
AI中文摘要

与证明助手耦合的AI系统现在能够大规模生成形式化数学,而验证器可验证的内容与数学家认为有价值的内容之间的差距已成为制约因素。我们将有价值数学的生成建模为极限下的嵌套语言生成:通过成员查询预言机(证明检查器)访问的可验证形式语言$F$包含一个未知的有价值语言$H \in \mathcal{H}$,该语言仅通过核心$C \subseteq H$的对抗性枚举揭示,其精确密度为$\alpha$(文献)。每个输出要么是有价值的($\in H$),要么是琐碎的($\in F \setminus H$),要么是幻觉($\notin F$)。我们解决了四个问题。第一,验证器不是品味:允许广度生成的集合恰好是无预言机模型中的那些,按纤维由Angluin条件刻画。第二,验证器确实提供了可靠覆盖,覆盖所有未见过的有价值陈述同时仅断言有效陈述:有验证器可能,无验证器不可能;它将不可避免的错误从虚假转移到琐碎。第三,核心地,关于紧族存在尖锐二分法:生成有限个琐碎语句的生成器达到最优覆盖$\alpha/2$,而任何无限琐碎语句的允许,即使以消失速率,也将最优值跃升至$1-\alpha/2$(两者均为紧界,对于以候选交集形式呈现的核心),且存在一个生成器同时达到两端。转变在于琐碎语句的数量而非速率;间隙$1-\alpha$是未记录的质量。第四,两种机制在数学的压缩模型中实例化。完美的验证器无法替代品味:正确但无价值的语句的无界流并非工程事故,而是可证明的必要性,因为覆盖未记录的有价值数学需要无限但渐近可忽略的已认证琐碎语句流。

英文摘要

AI systems coupled to proof assistants now generate formal mathematics at scale, and the gap between what a checker can verify and what a mathematician would value has become the binding constraint. We model the generation of valuable mathematics as nested language generation in the limit: a verifiable formal language $F$, accessed through a membership oracle (the proof checker), contains an unknown valuable language $H \in \mathcal{H}$ revealed only through an adversarial enumeration of a core $C \subseteq H$ of exact density $α$ (the literature). Every output is valuable ($\in H$), trivial ($\in F \setminus H$), or a hallucination ($\notin F$). We settle four questions. First, the verifier is not taste: the collections admitting generation with breadth are exactly those of the oracle-free model, characterized fiber-wise by Angluin's condition. Second, the verifier does buy sound coverage, covering all unseen valuable statements while asserting only valid ones: possible with it, impossible without it; it relocates unavoidable errors from false to trivial. Third, and centrally, a sharp dichotomy on the tight family: generators emitting finitely many trivia achieve optimal coverage $α/2$, while any infinite trivia allowance, even at vanishing rate, jumps the optimum to $1-α/2$ (both tight, for cores presented as the candidate intersection), and one generator attains both ends. The transition is in trivia count, not rate; the gap $1-α$ is the unrecorded mass. Fourth, both regimes instantiate in a compression model of mathematics. A perfect verifier cannot substitute for taste: the unbounded stream of correct-but-worthless statements is not an engineering accident but a provable necessity, since covering unrecorded valuable mathematics requires an infinite, but asymptotically negligible, stream of certified trivia.

2606.10881 2026-06-15 cs.AI 版本更新

Large-scale semantic mapping of learner agency and autonomy reveals what measurement and generative AI research overlook

学习者能动性与自主性的大规模语义映射揭示测量与生成式AI研究的忽视

Fei Qin, Xiaobo Liu, Yaowen Zhang, Xuming Li, Fei Wang, Mutlu Cukurova, Jingjing Chen, Yu Zhang

发表机构 * School of Education, Tsinghua University(清华大学教育学院) Department of Psychological and Cognitive Sciences, Tsinghua University(清华大学心理与认知科学系) Institute of Education, University College London(伦敦大学教育学院)

AI总结 通过语义分析管道从超过14,000篇出版物中提取定义和量表项目,发现学习者能动性与自主性包含任务、个人和社会文化三个维度,现有量表忽视社会文化维度,且生成式AI研究过度聚焦学习调节与控制。

Comments 45 pages, 12 figures, 1 table, including appendices, added funding information

详情
AI中文摘要

学习者能动性和自主性是个人发展的基础,然而普遍存在的“叮当谬误”(即相同术语指代不同构念,不同术语指代相同构念)严重阻碍了知识的积累。将意义视为通过语言实践中的使用构成的现象,我们从超过14,000篇出版物中提取了8,954个定义和2,700个量表项目,通过语义分析管道研究研究人员实际如何使用学习者能动性和自主性。这两个构念的定义景观解析为三个维度:学习的调节与控制(任务)、内在动机与内部决策(个人)以及社会关系行动(社会文化),从而经验性地量化了叮当谬误。然而,现有量表系统性地低估了社会文化维度。关键的是,当前教育领域的生成式AI研究集中于学习调节与控制,缩小了AI中介学习环境旨在培养的行为库。除了概念澄清外,这项工作对支持多维学习者能动性和自主性的概念化、测量和实践具有直接意义。

英文摘要

Learner agency and autonomy are foundational to personal development, yet a pervasive "jingle-jangle" fallacy (i.e. identical terms denoting different constructs, distinct terms denoting identical ones) has substantially hindered cumulative knowledge. Treating meaning as a phenomenon constituted through use in linguistic practice, we extracted 8,954 definitions and 2,700 scale items from over 14,000 publications, to investigate how researchers actually used learner agency and autonomy with a semantic analysis pipeline. The definitional landscape of two constructs resolves into three dimensions: regulation and control of learning (task), intrinsic motivation and internal decision-making (person), and social-relational action (sociocultural), thereby empirically quantifying the jingle-jangle fallacy. Existing scales, however, systematically underrepresent the sociocultural dimension. Critically, current generative AI research in education concentrates on learning regulation and control, narrowing the behavioral repertoire that AI-mediated learning environments are designed to cultivate. Beyond conceptual clarification, this work carries direct implications for conceptualization, measurement, and practice towards supporting the multidimensional learner agency and autonomy.

2511.08639 2026-06-15 cs.CY cs.AI cs.DL 版本更新

The Journal of Prompt-Engineered (Moral) Philosophy Or: Why AI-Assisted Ethics Research Requires Process Transparency

提示工程(道德)哲学杂志或:为何AI辅助伦理研究需要过程透明

Michele Loi

发表机构 * University of Milan(米兰大学)

AI总结 本文探讨了AI辅助伦理研究中过程透明的必要性,提出透明义务应基于主体完整性,通过五个透明元素构建文档充分性框架,以实现未来规范性判断的可能。

Comments 21 pages Transparency material documenting LLM usage available at: https://github.com/MicheleLoi/JPEP/tree/main/transparency/Canonical_MD

详情
AI中文摘要

现有学术中的AI披露要求要求报告AI辅助,但未明确透明的哲学含义:它们固定了义务,但未解释该义务服务于什么。我们主张伦理探究在两个独立层面本质上存在争议——关于它是什么,以及它对探究者的要求是什么,从而推翻了仅输出评估和福利经济对透明问题的忽视,并由此推翻了从实证科学引入的可重复性框架。透明义务基于主体完整性:在探究社区之前,作者哲学表达所构成身份承诺的可理解性。由于评估此类工作的标准未被共同确定,透明性的可实现目标不是根据 agreed criteria 评估,而是追踪——积累证据记录,使每个传统能按自身术语评估工作,并使未来规范性判断成为可能。我们开发了一个文档充分性框架,通过五个透明元素——声明、导航、文档账户、过程文档和开发记录——来操作化有意义的人类控制,本文本身展示了该框架,其完整的文档记录已存档在持久标识符中。该框架是第一版,有待修订,而非确定标准。

英文摘要

Existing AI disclosure mandates in scholarship require that AI assistance be reported but leave transparency philosophically unspecified: they fix the duty without explaining what the duty serves. We argue that ethical inquiry is essentially contested at two independent levels -- about what it is, and about what it demands of the inquirer -- defeating output-only evaluation and welfare-economic dismissal of the transparency question, and, by extension, reproducibility framings imported from the empirical sciences. The transparency duty is grounded instead in agent-integrity: the legibility, before a community of inquiry, of the identity-constituting commitments that the author's mode of philosophising expresses. Because the standards for evaluating such work are not communally settled, the achievable goal for transparency is not evaluation against agreed criteria but tracking -- accumulating the evidentiary record that lets each tradition assess the work on its own terms and makes future normative judgments possible. We develop a documentation-adequacy framework that operationalises Meaningful Human Control through five transparency elements -- declaration, navigation, documentation account, process documentation, and development records -- demonstrated by the paper itself, whose full documentation record is archived at a persistent identifier. The framework is a first iteration subject to revision, not a settled standard.

2602.13421 2026-06-15 stat.ML cs.AI q-bio.NC 版本更新

Metabolic cost of information processing in Poisson variational autoencoders

泊松变分自编码器中信息处理的代谢成本

Hadi Vafaii, Jacob L. Yates

发表机构 * Redwood Center for Theoretical Neuroscience(理论神经科学红木中心) UC Berkeley(伯克利大学)

AI总结 通过泊松变分自编码器,发现KL散度项与先验发放率成正比,产生代谢成本项,从而在编码保真度和能量消耗之间实现权衡。

Comments Published in CCN 2026 Proceedings: https://doi.org/10.32470/6ff31r0

详情
AI中文摘要

生物系统中的计算从根本上受到能量约束,但标准的计算理论将能量视为自由可用。在这里,我们认为在泊松假设下的变分自由能最小化为能量感知的计算理论提供了一条有原则的路径。我们的关键观察是,泊松自由能目标中的Kullback-Leibler(KL)散度项与模型神经元的先验发放率成正比,产生了一个惩罚高基线活动的涌现代谢成本项。这种结构将抽象的信息论量——*编码率*——与具体的生物物理变量——*发放率*——耦合起来,从而能够在编码保真度和能量消耗之间进行权衡。这种耦合自然地出现在泊松变分自编码器(P-VAE)中——一种受大脑启发的生成模型,它将输入编码为离散的尖峰计数,并作为特例恢复出尖峰形式的*稀疏编码*——但在标准高斯VAE中不存在。为了证明这种代谢成本结构是泊松公式所独有的,我们将P-VAE与Grelu-VAE(一种对潜在样本应用ReLU整流的高斯VAE,用于控制非负约束)进行比较。通过对KL项权重系数$\eta$和潜在维度的系统扫描,我们发现增加$\eta$会单调地增加P-VAE中的稀疏性并降低平均尖峰活动。相比之下,Grelu-VAE的表示保持不变,证实了该效应是泊松统计所特有的,而非非负表示的副产品。这些结果确立了泊松变分推理作为资源受限计算理论的一个有前景的基础。

英文摘要

Computation in biological systems is fundamentally energy-constrained, yet standard theories of computation treat energy as freely available. Here, we argue that variational free energy minimization under a Poisson assumption offers a principled path toward an energy-aware theory of computation. Our key observation is that the Kullback-Leibler (KL) divergence term in the Poisson free energy objective becomes proportional to the prior firing rates of model neurons, yielding an emergent metabolic cost term that penalizes high baseline activity. This structure couples an abstract information-theoretic quantity -- the *coding rate* -- to a concrete biophysical variable -- the *firing rate* -- which enables a trade-off between coding fidelity and energy expenditure. Such a coupling arises naturally in the Poisson variational autoencoder (P-VAE) -- a brain-inspired generative model that encodes inputs as discrete spike counts and recovers a spiking form of *sparse coding* as a special case -- but is absent from standard Gaussian VAEs. To demonstrate that this metabolic cost structure is unique to the Poisson formulation, we compare the P-VAE against Grelu-VAE, a Gaussian VAE with ReLU rectification applied to latent samples, which controls for the non-negativity constraint. Across a systematic sweep of the KL term weighting coefficient $β$ and latent dimensionality, we find that increasing $β$ monotonically increases sparsity and reduces average spiking activity in the P-VAE. In contrast, Grelu-VAE representations remain unchanged, confirming that the effect is specific to Poisson statistics rather than a byproduct of non-negative representations. These results establish Poisson variational inference as a promising foundation for a resource-constrained theory of computation.

2606.12430 2026-06-15 cs.CY cs.AI 版本更新

Will AI Agents Free Us From Meaningless Work? A Human-Centered Analysis

AI代理能否让我们摆脱无意义的工作?一项以人为中心的分析

Davide Ghia, Jaspreet Ranjit, Tania Cerquitelli, Daniele Quercia

发表机构 * Politecnico di Torino(都灵理工大学) University of Southern California(南加州大学) Nokia Bell Labs(诺基亚贝尔实验室)

AI总结 基于Graeber的“狗屁工作”理论,通过任务级分析发现,工人感知的任务无意义程度强烈预测其对AI委托的意愿,且此类任务被认为需要较少人工监督。

Comments Improved overall writing; add details about task filtering and participants screening; add comments in the discussion about the subjective and context-specific nature of the scale introduced;

详情
AI中文摘要

一些人声称AI代理将把工人从工作中无聊的部分解放出来,但关于工人自己如何识别哪些任务应该被自动化,我们知之甚少。先前的研究侧重于职业,忽略了在同一角色内,工人在不同任务中体验到不同层次的意义。我们通过基于Graeber的“狗屁工作”理论的任务级分析来解决这一差距。使用202名工人对171项工作任务的评分,我们(1)验证了一个五维度的感知无意义量表,(2)表明感知无意义强烈预测对AI委托的渴望,以及(3)发现这些任务也被视为需要较少的人工监督。总之,这些发现表明,被视为无意义的任务是AI委托的自然候选者,将工人的偏好与感知可行性对齐。

英文摘要

Some claim that AI agents will free workers from the boring parts of their jobs, yet little is known about how workers themselves identify which tasks should be automated. Prior research focuses on occupations, overlooking that workers experience varying levels of meaning across tasks within the same role. We address this gap with a task-level analysis grounded in Graeber's theory of bullshit jobs. Using ratings from 202 workers on 171 workplace tasks, we (1) validate a five-item scale of perceived bullshitness, (2) show that perceived bullshitness strongly predicts desire for AI delegation, and (3) find that such tasks are also seen as requiring less human oversight. Together, these findings suggest that tasks perceived as bullshit are natural candidates for AI delegation, aligning worker preferences with perceived feasibility.

2606.12923 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Order Is Not Control: Driven-Dissipative Response Laws Across Artificial and Biological Systems

秩序并非控制

Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Tim Elson

发表机构 * Australian Broadcasting Corporation(澳大利亚广播公司)

AI总结 本文论证秩序不等于控制,提出接收器门控响应定律,并在生物、大语言模型、适配器和随机算子面板中验证,表明控制是局部的、可测量的。

Comments 52 pages, 7 figures, updated title

详情
AI中文摘要

AI对齐、可解释性、引导和神经扰动研究识别出诱导秩序的对象。我们认为秩序并非控制。控制需要接收器门控的响应定律:一个分母索引算子,将物质状态、动作/驱动、浴和接收器状态映射到响应位移、汇、努力和盆地投影。我们在生物、大语言模型、适配器和随机算子面板中识别出该定律。这些定律是局部的:干预可以被接纳、饱和、变号、泄漏或过驱动,取决于介质、浴、接收器状态、动作端口和比较器。当有限努力在相同分母下移动目标或结果读出类别,而损伤、无效/规避、无效格式、过驱动和不必要努力保持有界时,控制被分配。小鼠ALM、秀丽隐杆线虫和斑马鱼面板提供了物理响应算子证据,同时排除了坐标同一性和控制器结论。大语言模型面板展示了生成输出响应定律:在四种物质条件下,响应向量的分量符号预测准确率为72.8-73.7%,非零分量上提升至84.3-84.8%;留出观察者以93.6%和91.7%的准确率预测系统效应和目标/预言家族。宪法条件适配器将易感性重塑为制备介质,随机算子面板将测量机会与可部署行动策略分离。这给出了介观控制层面的驱动-耗散响应系统描述:驱动通过制备介质、浴和接收器作用,产生接纳运动、阻抗、汇或过驱动。证据支持局部接纳控制和可测量的随机响应算子,同时将可部署的预生成控制、隐藏/logit因果充分性、生物到LLM坐标同一性以及字面热力学量排除在范围之外。

英文摘要

AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping material state, action/drive, bath, and receiver state to response displacement, sinks, effort, and basin projection. We identify it across biological, LLM, adapter, and stochastic-operator panels. The laws are local: an intervention can be admitted, saturated, sign-changing, leaky, or overdriven depending on medium, bath, receiver state, action port, and comparator. Control is assigned when finite effort moves a target or outcome-readout class under the same denominator while damage, null/evasive, invalid format, overdrive, and unnecessary effort stay bounded. Mouse ALM, C. elegans, and zebrafish panels provide physical response-operator evidence while excluding coordinate identity and controller conclusions. LLM panels show generated-output response laws: across four material conditions, response vectors are predictable at 72.8-73.7% component-sign accuracy, rising to 84.3-84.8% on nonzero components; held-out observers predict system-effect and target/oracle families at 93.6% and 91.7% accuracy. Constitution-conditioned adapters reshape susceptibility as prepared media, and stochastic-operator panels separate measured opportunity from deployable action policies. This gives a driven-dissipative response-system account at the mesoscopic control level: drives act through prepared media, baths, and receivers, producing admitted movement, impedance, sinks, or overdrive. The evidence supports local admitted control and measurable stochastic response operators, while leaving deployable pre-generation control, hidden/logit causal sufficiency, biological-to-LLM coordinate identity, and literal thermodynamic quantities outside scope.

2508.08935 2026-06-15 cs.LG cs.AI 版本更新

LNN-PINN: A Unified Physics-Only Training Framework with Liquid Residual Blocks

LNN-PINN: 一种带有液体残差块的统一纯物理训练框架

Ze Tao, Hanxuan Wang, Fujun Liu

发表机构 * Nanophotonics and Biophotonics Key Laboratory of Jilin Province, School of Physics, Changchun University of Science and Technology(吉林省纳米光子与生物光子重点实验室,物理学院,长春理工大学) Faculty of Chinese Medicine, Macau University of Science and Technology(澳门科技大学中医药学院)

AI总结 针对物理信息神经网络在复杂问题中预测精度有限的问题,提出LNN-PINN框架,通过引入液体残差门控架构提升预测精度,并在多个基准问题上验证了其有效性和稳定性。

详情
Journal ref
Computer Physics Communications, 326, 110237 (2026)
AI中文摘要

物理信息神经网络(PINNs)因其能够将偏微分方程先验知识整合到深度学习框架中而受到广泛关注;然而,在应用于复杂问题时,它们通常表现出有限的预测精度。为了解决这一问题,我们提出了LNN-PINN,一种物理信息神经网络框架,它结合了液体残差门控架构,同时保留原始的物理建模和优化流程以提高预测精度。该方法仅在隐藏层映射中引入轻量级门控机制,保持采样策略、损失组成和超参数设置不变,以确保改进纯粹来自架构优化。在四个基准问题上,LNN-PINN在相同训练条件下持续降低了RMSE和MAE,绝对误差图进一步证实了其精度提升。此外,该框架在不同维度、边界条件和算子特性下表现出强大的适应性和稳定性。总之,LNN-PINN为提升物理信息神经网络在复杂科学和工程问题中的预测精度提供了一种简洁有效的架构增强方法。

英文摘要

Physics-informed neural networks (PINNs) have attracted considerable attention for their ability to integrate partial differential equation priors into deep learning frameworks; however, they often exhibit limited predictive accuracy when applied to complex problems. To address this issue, we propose LNN-PINN, a physics-informed neural network framework that incorporates a liquid residual gating architecture while preserving the original physics modeling and optimization pipeline to improve predictive accuracy. The method introduces a lightweight gating mechanism solely within the hidden-layer mapping, keeping the sampling strategy, loss composition, and hyperparameter settings unchanged to ensure that improvements arise purely from architectural refinement. Across four benchmark problems, LNN-PINN consistently reduced RMSE and MAE under identical training conditions, with absolute error plots further confirming its accuracy gains. Moreover, the framework demonstrates strong adaptability and stability across varying dimensions, boundary conditions, and operator characteristics. In summary, LNN-PINN offers a concise and effective architectural enhancement for improving the predictive accuracy of physics-informed neural networks in complex scientific and engineering problems.

2603.20821 2026-06-15 cs.DC cs.AI cs.LG 版本更新

Compass: Optimizing Compound AI Workflows for Dynamic Adaptation

Compass: 为动态适应优化复合AI工作流

Milos Gravara, Juan Luis Herrera, Stefan Nastic

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出Compass框架,通过离线优化和在线适应动态切换复合AI工作流的配置,提升准确率、延迟和成本的平衡能力。

Comments 10 pages, 7 figures; accepted at the 26th IEEE International Symposium on Cluster, Cloud, and Internet Computing (CCGrid 2026)

详情
Journal ref
In Proceedings of the 26th IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), 2026
AI中文摘要

复合AI是一种分布式智能方法,通过整合专用AI/ML模型与工程软件组件形成AI工作流。复合AI生产部署必须在变化负载下满足准确性、延迟和成本目标。然而,许多部署运行在固定基础设施上,无法水平扩展。现有方法仅优化准确性,未考虑负载变化。我们发现复合AI系统可切换配置以适应基础设施容量,根据当前负载在准确性与延迟之间进行权衡。这需要从组合搜索空间中发现多个帕累托最优配置,并在运行时确定切换时机。本文提出Compass框架,通过离线优化和在线适应实现动态配置切换。Compass包含三个组件:COMPASS-V算法用于配置发现,Planner用于切换策略推导,Elastico控制器用于运行时适应。COMPASS-V利用有限差分引导搜索和爬山与横向扩展结合的方法发现准确性可行的配置。Planner在目标硬件上对这些配置进行剖析,并利用基于排队理论的模型推导切换策略。Elastico监控队列深度并根据推导的阈值切换配置。在两个复合AI工作流中,COMPASS-V在减少57.5%的配置评估的同时实现100%召回率,效率提升达95.3%。运行时适应在动态负载模式下实现90-98%的SLO合规性,比静态高精度基线提升71.6%的SLO合规性,同时比静态快速基线提高3-5%的精度。

英文摘要

Compound AI is a distributed intelligence approach that represents a unified system orchestrating specialized AI/ML models with engineered software components into AI workflows. Compound AI production deployments must satisfy accuracy, latency, and cost objectives under varying loads. However, many deployments operate on fixed infrastructure where horizontal scaling is not viable. Existing approaches optimize solely for accuracy and do not consider changes in workload conditions. We observe that compound AI systems can switch between configurations to fit infrastructure capacity, trading accuracy for latency based on current load. This requires discovering multiple Pareto-optimal configurations from a combinatorial search space and determining when to switch between them at runtime. We present Compass, a novel framework that enables dynamic configuration switching through offline optimization and online adaptation. Compass consists of three components: COMPASS-V algorithm for configuration discovery, Planner for switching policy derivation, and Elastico Controller for runtime adaptation. COMPASS-V discovers accuracy-feasible configurations using finite-difference guided search and a combination of hill-climbing and lateral expansion. Planner profiles these configurations on target hardware and derives switching policies using a queuing theory based model. Elastico monitors queue depth and switches configurations based on derived thresholds. Across two compound AI workflows, COMPASS-V achieves 100% recall while reducing configuration evaluations by 57.5% on average compared to exhaustive search, with efficiency gains reaching 95.3% at tight accuracy thresholds. Runtime adaptation achieves 90-98% SLO compliance under dynamic load patterns, improving SLO compliance by 71.6% over static high-accuracy baselines, while simultaneously improving accuracy by 3-5% over static fast baselines.

2507.06174 2026-06-15 cs.RO cs.AI cs.SY eess.SY 版本更新

Design and Experimental Validation of Sensorless 4-Channel Bilateral Teleoperation for Low-Cost Manipulators

无传感器四通道双侧远程操控的设计与实验验证用于低成本机械臂

Koki Yamane, Yunhan Li, Masashi Konosu, Koki Inami, Junji Oaki, Toshiaki Tsuji, Sho Sakaino

发表机构 * Degree Programs in Intelligent and Mechanical Interaction Systems, University of Tsukuba(智能与机械交互系统专业,东京大学) Faculty of Engineering, Information and Systems, University of Tsukuba(工程、信息与系统学部,东京大学) Department of Electrical Engineering, Electronics, and Applied Physics, Saitama University(电子工程、电子学与应用物理系,埼玉大学)

AI总结 本文提出了一种无传感器四通道双侧远程操控框架,结合非线性动力学补偿与基于观测器的扰动估计方案,实验证明在低成本硬件限制下可实现稳定的高速接触密集场景远程操控,并提升模仿学习任务的成功率。

Comments 22 pages, 12 figures, Submitted to IEEE Access

详情
AI中文摘要

远程操控低成本机械臂正逐渐成为收集模仿学习演示数据的实用手段。然而,现有大多数低成本系统依赖单侧位置控制无力反馈,而实现力反馈双侧远程操控困难,因为低成本机械臂通常具有低分辨率编码器和无关节扭矩传感器。本文提出了一种无传感器四通道双侧远程操控框架,整合了识别的非线性动力学补偿与基于扰动观测器的速度和外部力估计方案。通过在频域中解释观测器结构,我们澄清了速度和外部力估计带宽之间的耦合,并基于阻尼比和单个截止频率推导了实用的调谐指南。实车实验,包括力传感器比较和远程操控任务,证明所提出的框架提供了实用的力估计,并在低成本硬件限制下实现了高速和接触密集场景下的稳定远程操控。作为应用,模仿学习实验表明,将估计的力信息纳入演示中可提高测试接触密集操作任务的任务成功率。

英文摘要

Teleoperation of low-cost manipulators is attracting increasing attention as a practical means of collecting demonstration data for imitation learning. However, most existing low-cost systems rely on unilateral position control without force feedback, while implementing force-feedback bilateral teleoperation is difficult because low-cost manipulators typically have low-resolution encoders and no joint torque sensors. This paper presents a sensorless 4-channel bilateral teleoperation framework that integrates identified nonlinear dynamics compensation with a disturbance-observer-based velocity and external-force estimation scheme. By interpreting the observer structure in the frequency domain, we clarify the coupling between the velocity- and external-force-estimation bandwidths and derive practical tuning guidelines based on the damping ratio and a single cutoff frequency. Real-robot experiments, including force-sensor comparison and teleoperation tasks, demonstrate that the proposed framework provides practically useful force estimates and enables stable teleoperation in high-speed and contact-rich scenarios under low-cost hardware constraints. As an application, imitation-learning experiments demonstrate that incorporating estimated force information into demonstrations improves task success rates in the tested contact-rich manipulation tasks.

2512.20932 2026-06-15 cs.LG cs.AI 版本更新

Guardrailed Elasticity Pricing: A Churn-Aware Forecasting Playbook for Subscription Strategy

受约束的弹性定价:面向订阅策略的 churn 意识预测指南

Deepit Sapru

发表机构 * Deepit Sapru

AI总结 本文提出一个动态定价框架,结合多变量需求预测、分段价格弹性及 churn 预测,以优化收入和留存。通过季节性模型与树状学习器,解决受约束优化问题,提升 SaaS 产品组合的定价效果,同时保障客户体验与伦理约束。

详情
AI中文摘要

本文提出一个营销分析框架,将订阅定价作为动态、受约束的决策系统,结合多变量需求预测、分段层面的价格弹性及 churn 可能性,以优化收入、利润率和留存。该方法融合季节性时间序列模型与树状学习器,运行蒙特卡洛情景测试以映射风险范围,并解决受约束优化问题,以确保客户体验、利润率底线和允许的 churn。在异质 SaaS 产品组合中经过验证,该方法持续优于静态层级和统一提升,通过将价格变动重新分配给愿意支付更多费用的分段,同时保护价格敏感的群体。系统通过模块化 API 实现实时重新校准,并包含模型可解释性以满足治理和合规需求。从管理角度看,该框架作为策略指南,明确何时从固定定价转向动态定价,如何将定价与客户生命周期价值(CLV)和每月 recurring 收入(MRR)目标对齐,以及如何嵌入伦理约束,从而实现可持续增长而不损害客户信任。

英文摘要

This paper presents a marketing analytics framework that operationalizes subscription pricing as a dynamic, guardrailed decision system, uniting multivariate demand forecasting, segment-level price elasticity, and churn propensity to optimize revenue, margin, and retention. The approach blends seasonal time-series models with tree-based learners, runs Monte Carlo scenario tests to map risk envelopes, and solves a constrained optimization that enforces business guardrails on customer experience, margin floors, and allowable churn. Validated across heterogeneous SaaS portfolios, the method consistently outperforms static tiers and uniform uplifts by reallocating price moves toward segments with higher willingness-to-pay while protecting price-sensitive cohorts. The system is designed for real-time recalibration via modular APIs and includes model explainability for governance and compliance. Managerially, the framework functions as a strategy playbook that clarifies when to shift from flat to dynamic pricing, how to align pricing with CLV and MRR targets, and how to embed ethical guardrails, enabling durable growth without eroding customer trust.