arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

AI Agent

智能体、工具调用、规划、工作流、多智能体和自主任务执行。

今日/当前日期收录 91 信号源:cs.AI, cs.CL, cs.LG, cs.SE

1. 软件智能体 2 篇

2606.15828 2026-06-18 cs.SE 新提交 专题 80

Configuration Smells in AGENTS.md Files: Common Mistakes in Configuring Coding Agents

AGENTS.md 文件中的配置异味:配置编码代理的常见错误

Helio Victor F. dos Santos, Vitor Costa, Joao Eduardo Montandon, Luciana Lourdes Silva, Marco Tulio Valente

专题命中 软件智能体 :编码代理配置问题,属于AI Agent

AI总结 本文首次系统化编码代理配置文件(AGENTS.md/CLAUDE.md)的异味,通过灰文献综述和仓库挖掘识别出六种异味,并在100个开源仓库中验证其普遍性,其中Lint Leakage最常见(62%)。

详情
AI中文摘要

编码代理越来越多地被用于自动化软件工程任务。为了指导其行为,这些代理通常依赖配置文件(通常命名为 AGENTS.md 或 CLAUDE.md),这些文件提供关于架构、工作流、编码规范和测试实践的指令。尽管它们的重要性日益增加,但人们对影响这些文件定义和维护的常见问题知之甚少。在本文中,我们提出了首个编码代理配置文件异味目录。为了识别此类异味,我们首先进行了灰文献综述和仓库挖掘分析。结果,我们识别出六种配置异味,并提出了自动检测它们的启发式方法。为了评估所提出异味的普遍性,我们分析了100个包含 AGENTS.md 或 CLAUDE.md 文件的流行开源仓库。我们的结果表明,配置异味广泛存在。Lint Leakage 是最常见的异味,影响了62%的文件,其次是 Context Bloat(42%)和 Skill Leakage(35%)。我们进一步表明,几种异味经常同时出现,特别是 Context Bloat、Skill Leakage 和 Conflicting Instructions。

英文摘要

Coding agents are increasingly used to automate software engineering tasks. To guide their behavior, these agents commonly rely on configuration files, typically named AGENTS.‌md or CLAUDE.‌md, which provide instructions about architecture, workflows, coding conventions, and testing practices. Despite their growing importance, little is known about common problems affecting the definition and maintenance of these files. In this paper, we present the first catalog of smells for coding-agent configuration files. To identify such smells, we first conducted a grey literature review and a repository mining analysis. As a result, we identified six configuration smells and proposed automated heuristics to detect them. To evaluate the prevalence of the proposed smells, we analyzed 100 popular open-source repositories containing either an AGENTS.‌md or a CLAUDE.‌md file. Our results show that configuration smells are widespread. Lint Leakage was the most common smell, affecting 62% of the files, followed by Context Bloat (42%) and Skill Leakage (35%). We further show that several smells frequently co-occur, particularly Context Bloat, Skill Leakage, and Conflicting Instructions.

2606.18619 2026-06-18 cs.CR cs.AI cs.SE 新提交 专题 70

Code-Augur: Agentic Vulnerability Detection via Specification Inference

Code-Augur:通过规约推断的智能体漏洞检测

Zhengxiong Luo, Mehtab Zafar, Dylan Wolff, Abhik Roychoudhury

发表机构 * National University of Singapore(新加坡国立大学)

专题命中 软件智能体 :自主LLM智能体进行漏洞审计

AI总结 提出安全规约优先范式,通过显式化智能体假设并运行时反证,结合引导式模糊测试提升漏洞检测能力,在真实项目中比现有智能体检测更多漏洞。

详情
AI中文摘要

智能体漏洞检测的出现已成为软件安全的分水岭。完全由自主LLM智能体进行的审计正在发现数字社会基础软件中的关键漏洞。许多漏洞多年来一直隐藏,直到现在才被AI智能体发现。然而,这些发现背后的推理仍然令人担忧地不透明且未经验证。当智能体认为某个函数安全时,它对函数输入做了哪些假设?推理失败和错误假设可能导致遗漏漏洞,并降低对智能体分析的信任。我们提出了一种安全规约优先范式,该范式(1)将智能体的隐性假设明确暴露为安全规约,并(2)通过运行时反证持续细化这些规约。我们在Code-Augur中实现了我们的方法,这是一种用于智能体漏洞检测的新型框架。给定一个代码库,Code-Augur分析系统的每个组件以查找漏洞代码。当它认为某个组件安全时,它会将该判断背后的局部不变量作为源代码中的断言提交。同时,Code-Augur利用引导式模糊测试器尝试反证这些假设。当模糊测试器触发断言时,要么揭示一个真实漏洞,要么揭示一个需要细化的有缺陷规约。在这两种情况下,这一过程都夯实了智能体的理解,使其对代码意图的看法与代码实际行为保持一致。在真实世界的主题上,Code-Augur有效利用安全规约检测到比其他最先进智能体更多的漏洞。此外,Code-Augur在关键开源项目中发现了22个新漏洞。与精心策划的专用模型(如Claude Mythos)相比,Code-Augur提供了基于广泛可用的LLM(如Sonnet和DeepSeek)构建的有效智能体漏洞检测。

英文摘要

The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning digital society. Many of these vulnerabilities remained masked for years, surfacing only now with AI agents. Yet the reasoning behind these discoveries remains alarmingly opaque and unvalidated. What assumptions did the agent make about a function's inputs when it deemed that function to be secure? Failures in reasoning and incorrect assumptions can lead to missed vulnerabilities and reduce trust in agentic analysis. We propose a security-specification-first paradigm that (1) exposes the agent's tacit assumptions explicitly as security specifications and (2) continuously refines those specifications via runtime falsification. We realize our approach in Code-Augur, a novel harness for agentic vulnerability detection. Given a codebase, Code-Augur analyzes each component of the system for vulnerable code. When it deems a component to be secure, it commits the local invariants behind that judgment as in-source assertions. In parallel, Code-Augur leverages a guided fuzzer to attempt to falsify those assumptions. When the fuzzer triggers an assertion, this either reveals a genuine vulnerability or a flawed specification to refine. In both cases, this process grounds the agent's understanding, aligning its view of code intent with how the code actually behaves. On real-world subjects, Code-Augur effectively leverages security specifications to detect more vulnerabilities than other state-of-the-art agents. Additionally, Code-Augur found 22 new vulnerabilities in key open-source projects. Compared to curated specialized models like Claude Mythos, Code-Augur offers effective agentic vulnerability detection built on widely available LLMs like Sonnet and DeepSeek.

2. 其他Agent 7 篇

2606.15345 2026-06-18 cs.CL cs.IR 新提交 专题 80

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

超越单语言深度研究:用跨语言 BrowseComp-Plus 评估智能体和检索器

Yuheng Lu, Qingcheng Zeng, Heli Qi, Puxuan Yu, Fuheng Zhao, Rui Yang, Hitomi Yanaka, Naoto Yokoya, Weihao Xuan

发表机构 * Waseda University(早稻田大学) Northwestern University(西北大学) RIKEN AIP(理化学研究所革新智能研究中心) Snowflake Inc.(Snowflake公司) University of Utah(犹他大学) Duke-NUS Medical School(杜克-新加坡国立大学医学院) The University of Tokyo(东京大学)

专题命中 其他Agent :评估深度研究智能体的跨语言能力

AI总结 提出跨语言基准 XBCP,评估深度研究智能体在证据语言与查询不同时的表现,发现检索和智能体端均存在显著性能下降。

Comments Preprint

详情
AI中文摘要

深度研究智能体越来越被评估其搜索证据、推理检索来源和生成有依据答案的能力。然而,现有的浏览基准大多假设用户查询和支持证据使用同一种语言,因此当相关证据出现在另一种语言时,智能体搜索系统能否运行尚不清楚。我们引入了 XBCP(跨语言 BrowseComp-Plus),这是一个受控基准,它保留了 BrowseComp-Plus 的英文问答空间,但改变了支持文档的语言。XBCP 实例化了两个互补的设置:在跨语言设置中,每个查询与单一指定语言的证据配对。在多语言设置中,完整的证据语料库在 12 种语言(涵盖高资源和低资源语言)中均匀随机分布。我们使用稀疏和密集的多语言检索器评估了四个深度研究智能体,测量了答案准确性、证据召回率、搜索行为、校准度、引用忠实度和 oracle 检索。结果显示,当证据被翻译时,性能显著下降。即使是强大的密集检索器也会丢失证据召回率,智能体变得不那么校准,且引用证据的可靠性降低。值得注意的是,即使直接提供所有黄金证据,准确性仍然较低。这些发现表明,跨语言深度研究暴露了检索失败和智能体端在整合语言不匹配证据方面的独立困难。

英文摘要

Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language. We introduce XBCP (Cross-lingual BrowseComp-Plus), a controlled benchmark that preserves the English question-and-answer space of BrowseComp-Plus but varies the languages of the supporting documents. XBCP instantiates two complementary settings: in the cross-lingual setting, each query is paired with evidence in a single assigned language. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high-resource and low-resource regimes. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval. Results reveal substantial degradation when evidence is translated. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably. Notably, accuracy remains lower even when all gold evidence is supplied directly. These findings suggest that cross-lingual deep research exposes both retrieval failures and an independent, agent-side difficulty in integrating language-mismatched evidence.

2511.13979 2026-06-18 cs.HC 版本更新 专题 80

Personality Pairing Improves Human-AI Collaboration

人格配对改善人机协作

Harang Ju, Sinan Aral

专题命中 其他Agent :研究AI Agent人格与人类协作

AI总结 通过大规模实验,将人类与具有不同大五人格特质的AI配对,发现人格匹配显著影响广告质量和团队表现,外倾人类与尽责AI配对效果最差,而神经质人类与神经质AI配对点击率最高。

Comments 29 pages, 5 figures

详情
AI中文摘要

在此,我们研究了AI代理的“人格”如何与人类人格相互作用,从而影响人机协作和绩效。在一项大规模、预注册的随机实验中,我们将1,258名参与者与表现出不同大五人格特质水平的AI代理配对。这些人机团队为一个真实智库制作了7,266个展示广告,我们通过1,168名独立人类评估者以及一项在X平台上进行的、产生了近500万次展示的现场实验对这些广告进行了评估。我们发现,人类和AI的人格各自影响广告质量和团队合作,并且人机人格配对直接影响广告质量和广告绩效。例如,外倾人类与尽责AI配对产生了质量最低的广告,其次是尽责人类与宜人AI配对,以及神经质人类与尽责AI配对。在现场实验中,广告质量显著影响广告绩效(以点击率和每次点击成本衡量),神经质人类与神经质AI配对实现了最高的点击率。这些结果共同表明,人格配对可以改善人机协作和绩效。它们也激励了未来关于AI个性化对人机协作、团队合作和绩效的复杂影响的研究。

英文摘要

Here we examine how AI agent "personalities" interact with human personalities to shape human-AI collaboration and performance. In a large-scale, preregistered randomized experiment, we paired 1,258 participants with AI agents prompted to exhibit varying levels of the Big Five personality traits. These human-AI teams produced 7,266 display ads for a real think tank, which we evaluated using 1,168 independent human raters, and a field experiment on X that generated nearly 5 million impressions. We found that human and AI personalities individually shaped ad quality and teamwork and that human-AI personality pairings directly influenced ad quality and ad performance. For example, extraverted humans paired with conscientious AI produced the lowest quality ads, followed by conscientious humans paired with agreeable AI and neurotic humans paired with conscientious AI. In the field experiment, ad quality significantly influenced ad performance, measured by click-through rates and cost-per-click, and neurotic humans paired with neurotic AI achieved the highest click-through rates. Together, these results demonstrate that personality pairing can improve human-AI collaboration and performance. They also motivate future research on the complex implications of AI personalization for human-AI collaboration, teamwork and performance.

2602.22222 2026-06-18 cs.IR cs.MA 版本更新 专题 80

TWICE: Modeling the Temporal Evolution of Personalized User Behavior via Event-Driven Agents

TWICE:通过事件驱动代理建模个性化用户行为的时间演化

Bingrui Jin, Kunyao Lan, Baihan LI, Mengyue Wu

专题命中 其他Agent :基于LLM的事件驱动用户模拟代理,属于AI Agent

AI总结 提出TWICE框架,结合结构化用户画像、事件驱动记忆模块和两阶段工作流,利用LLM模拟用户行为的时间演化,在Twitter数据集上优于基线。

详情
AI中文摘要

用户模拟器广泛用于数据生成、评估和基于代理的交互,但现有方法通常将用户建模为静态角色或依赖通用历史上下文,难以捕捉个体行为随时间的变化。为解决这一局限,我们提出TWICE,一个基于LLM的框架,用于时间基础的个人化用户模拟。TWICE结合了结构化用户画像、围绕生活事件和行为转变组织的事件驱动记忆模块,以及将事件基础内容规划与个性化风格适应分离的两阶段工作流。这种设计使模拟器不仅能建模用户说什么,还能建模过去经历如何影响后续表达。我们在大规模纵向Twitter数据集上评估TWICE,并引入了一个综合评估框架,同时衡量真实性、一致性和类人性。结果表明,TWICE始终优于强基线,表明以事件为中心的记忆是建模个性化用户行为时间演化的有前景机制。

英文摘要

User simulators are widely used for data generation, evaluation, and agent-based interaction, but existing approaches often model users as static personas or rely on generic historical context, making it difficult to capture how individual behavior evolves over time. To address this limitation, we propose TWICE, an LLM-based framework for temporally grounded personalized user simulation. TWICE combines structured user profiling, an event-driven memory module organized around life events and behavioral shifts, and a two-stage workflow separating event-grounded content planning from personalized style adaptation. This design enables the simulator to model not only what a user says, but also how past experiences shape later expression. We evaluate TWICE on a large-scale longitudinal Twitter dataset and introduce a comprehensive evaluation framework that jointly measures authenticity, consistency, and humanlikeness. Results show that TWICE consistently outperforms strong baselines, suggesting that event-centered memory is a promising mechanism for modeling the temporal evolution of personalized user behavior.

2606.19079 2026-06-18 cs.AI 新提交 专题 75

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

ARIADNE: 推理时适配器动态选择的不可知路由

Enrico Cassano, Michał Brzozowski, Zuzanna Dubanowska, Paolo Mandica, Neo Christopher Chung

发表机构 * University of Turin(都灵大学) Samsung AI Center(三星人工智能中心)

专题命中 其他Agent :推理时适配器动态选择,路由框架。

AI总结 提出无训练、与适配器无关的路由框架ARIADNE,通过训练集嵌入质心表示适配器,在推理时基于潜在空间距离选择适配器,无需适配器内部信息或额外训练,在44个任务上达到89.7%的选择准确率。

详情
AI中文摘要

参数高效微调(PEFT)的日益部署导致了模型生态系统,其中单个骨干网络与许多任务专用适配器配对。在这种设置下,推理时的查询通常没有任务标签,要求系统从不断增长且异构的适配器池中自动选择最合适的适配器。现有的路由方法要么依赖于对适配器内部(如权重分解或基于梯度的统计信息)的访问,要么需要额外的路由器训练,这限制了随着新适配器添加的可扩展性和可移植性。我们提出了ARIADNE,一个无训练、与适配器无关的路由框架,用于推理时的动态适配器选择。ARIADNE通过从其训练集的嵌入计算的一组质心来表示每个适配器,捕获与该适配器相关的数据分布。给定一个无标签输入,它通过测量在潜在空间中与这些质心的接近度来选择适配器。由于路由完全在输入嵌入空间中进行,ARIADNE与任意PEFT方法兼容,并且不需要对适配器或训练过程进行修改。主要使用Llama 3.2 1B Instruct在23个不同的NLP任务上进行评估,ARIADNE恢复了97.44%的上限性能。扩展到44个任务,它实现了89.7%的平均选择准确率,无需额外训练或访问适配器内部信息。

英文摘要

The increasing deployment of parameter-efficient fine-tuning (PEFT) has led to model ecosystems in which a single backbone is paired with many task-specialized adapters. In this setting, inference-time queries often arrive without task labels, requiring the system to automatically select the most appropriate adapter from a growing and heterogeneous adapter pool. Existing routing methods either depend on access to adapter internals, such as weight decompositions or gradient-based statistics, or require additional router training, which limits scalability and portability as new adapters are added. We introduce ARIADNE, a training-free, adapter-agnostic routing framework for dynamic adapter selection at inference time. ARIADNE represents each adapter through a set of centroids computed from embeddings of its training set, capturing the data distribution associated with that adapter. Given an unlabeled input, it selects an adapter by measuring proximity to these centroids in latent space. Because routing is performed entirely in the input embedding space, ARIADNE is compatible with arbitrary PEFT methods and requires no modification to the adapters or training procedures. Primarily evaluated with Llama 3.2 1B Instruct on 23 diverse NLP tasks, ARIADNE recovers 97.44% of the upper bound performance. Scaling to 44 tasks, it achieves 89.7% average selection accuracy, without additional training or access to adapter internals.

2606.18259 2026-06-18 cs.HC cs.AI 新提交 专题 75

Caring Without Feeling: Affective Dynamics as the Control Layer of Human-AI Agent Collaboration

无感关怀:情感动态作为人-AI智能体协作的控制层

Junjie Xu, Xingjiao Wu, Zihao Zhang, Yujia Xu, Yuzhe Yang, Jin Zhu, Luwei Xiao, Wen Wu, Liang He

发表机构 * East China Normal University(华东师范大学) National University of Singapore(新加坡国立大学)

专题命中 其他Agent :综述情感动态在人-AI智能体协作中的控制作用。

AI总结 本文综述情感动态在人-AI智能体协作中的作用,提出将情感视为协调层而非AI内部属性,用于校准信任、委托和治理。

详情
AI中文摘要

能够规划、跨会话保留记忆、调用外部工具并部分自主行动的AI智能体正在改变人-AI协作。情感计算、大语言模型中的模拟共情、自动化信任和AI安全的研究揭示了重要的设计原则,但这些文献仍然分散。没有统一的解释说明情感线索如何在智能体协作中运作——在这种协作中,人类委托、监控和纠正重要任务。本综述综合了情感动态的计算和交互机制:情感线索、类似情绪的行为和感知到的智能体情感如何影响信任校准、委托决策、错误纠正、依赖和治理的过程。我们追溯模型生成的情感信号如何进入控制依赖、修复和监督的交互循环,并提出了一个框架,该框架将情感视为不是AI的内部属性,而是作为人类和智能体协商能力、不确定性和责任的协调层。该框架为校准测量、有目的的设计和知情治理提供了基础。

英文摘要

AI agents that plan, retain memory across sessions, invoke external tools and act with partial autonomy are transforming human--AI collaboration. Research on affective computing, simulated empathy in large language models, trust in automation and AI safety has illuminated important design principles, yet these literatures remain fragmented. No integrated account explains how affective cues operate within agentic collaboration -- settings in which humans delegate, monitor and correct consequential tasks. This Review synthesises computational and interactional mechanisms of affective dynamics: the processes through which affective cues, emotion-like behaviour and perceived agent affect shape trust calibration, delegation decisions, error correction, dependence and governance. We trace how model-generated affective signals enter interaction loops that govern reliance, repair and oversight, and propose a framework that treats affect not as an internal property of AI but as a coordination layer through which humans and agents negotiate capability, uncertainty and responsibility. The framework provides a foundation for calibrated measurement, purposeful design and informed governance.

2606.18406 2026-06-18 cs.CL 新提交 专题 70

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem: 对话代理中长期记忆的黎曼检索与Fisher引导蒸馏

Jiaqi Chen, Yongqin Zeng, Shaoshen Chen, Yijian Zhang, Hai-Tao Zheng, Chunxia Ma, XiuTeng Zhou

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Peng Cheng Laboratory(鹏城实验室) Shandong Analysis and Test Center, Qilu University of Technology(齐鲁工业大学山东省分析测试中心) State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs(道地药材品质保障与可持续利用国家重点实验室)

专题命中 其他Agent :对话代理长期记忆架构

AI总结 提出CoreMem架构,用黎曼检索替代余弦相似度解决高维检索枢纽问题,通过Fisher引导离散令牌蒸馏实现原则性压缩,在8GB显存边缘设备上实现长期记忆对话代理。

Comments 15 pages, 5 figures

详情
AI中文摘要

个性化对话代理需要持续的长期记忆以在多次会话中维持连贯交互。然而,在消费级硬件(例如8 GB VRAM边缘设备)上部署这些能力会引入严重的内存和计算瓶颈。现有系统通常依赖各向同性余弦相似度进行检索,以及启发式规则进行上下文压缩。这些方法缺乏统一的理论基础,经常在高维检索中遭受枢纽问题,并在压缩过程中出现句法碎片化。为克服这些限制,我们提出CoreMem,一种资源高效的边缘-云记忆架构,从根本上由信息几何统一。首先,黎曼检索用局部自适应Fisher-Rao度量替代余弦匹配,通过马氏距离有效惩罚枢纽记忆,并采用O(Ndr) Woodbury加速实现实时搜索。其次,Fisher引导离散令牌蒸馏(FDTD)引入分层句子到令牌压缩机制。它从Fisher信息迹中推导敏感度分数,提供原则性的压缩-KL权衡,并辅以显式结构句法保护。在LOCOMO和LongMemEval-S基准上评估,CoreMem实现了显著的准确率提升,在开放域(+4.51个百分点)和时间(+4.17个百分点)推理上取得实质性增益。广泛性能分析证实,CoreMem在严格的8 GB VRAM预算内无缝运行,成功弥合了资源受限边缘设备与对理论基础的终身记忆代理需求之间的差距。

英文摘要

Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces severe memory and compute bottlenecks. Existing systems typically rely on isotropic cosine similarity for retrieval and heuristic rules for context compression. These approaches lack a unified theoretical foundation, frequently suffering from the hubness problem in high-dimensional retrieval and syntactic fragmentation during compression. To overcome these limitations, we propose CoreMem, a resource-efficient edge-cloud memory architecture fundamentally unified by information geometry. First, Riemannian retrieval replaces cosine matching with a locally adaptive Fisher-Rao metric, effectively penalizing hub memories via Mahalanobis distance with O(Ndr) Woodbury acceleration for real-time search. Second, Fisher-guided discrete token distillation (FDTD) introduces a hierarchical sentence-to-token compression mechanism. It derives sensitivity scores from Fisher information traces, providing a principled compression-KL tradeoff augmented with explicit structural syntax protection. Evaluated on the LOCOMO and LongMemEval-S benchmarks, CoreMem achieves strong accuracy improvements, yielding substantial gains in Open-domain (+4.51 pp) and Temporal (+4.17 pp) reasoning. Extensive profiling confirms that CoreMem operates seamlessly within a strict 8 GB VRAM budget, successfully bridging the gap between resource-constrained edge devices and the demand for theoretically grounded, lifelong memory agents.

2507.23644 2026-06-18 cs.MA 版本更新 专题 70

Agents Trusting Agents? Restoring Lost Capabilities with Inclusive Healthcare

代理信任代理?通过包容性医疗恢复失去的能力

Alba Aguilera, Georgina Curto, Nardine Osman, Ahmed Al-Awah

专题命中 其他Agent :使用基于代理的模拟评估医疗政策,属于AI Agent。

AI总结 本文利用基于代理的模拟和贝叶斯逆强化学习,评估巴塞罗那改善无家可归者医疗公平的政策,通过建模信任关系来恢复其核心能力。

详情
AI中文摘要

基于代理的模拟在非侵入性方式下,有潜力为紧迫的人类发展挑战的社会政策提供信息,在其实施于现实世界人群之前。本文响应非营利组织和政府机构的请求,评估正在讨论的政策,以改善巴塞罗那市无家可归者(PEH)医疗服务的公平性。为此,我们整合了能力方法(CA)的概念框架,该框架明确设计用于促进和评估人类福祉,以建模和评估代表PEH和社会工作者的代理行为。我们定义了一个强化学习环境,其中代理旨在在现有环境和法律约束下恢复其核心人类能力。我们使用贝叶斯逆强化学习(IRL)来校准PEH代理中依赖于档案的行为参数,建模对社会工作者的信任和参与程度,这据报告是政策成功的关键因素。我们的结果为通过建立社会服务工作者与PEH之间的信任关系来减轻健康不平等开辟了一条道路。

英文摘要

Agent-based simulations have an untapped potential to inform social policies on urgent human development challenges in a non-invasive way, before these are implemented in real-world populations. This paper responds to the request from non-profit and governmental organizations to evaluate policies under discussion to improve equity in health care services for people experiencing homelessness (PEH) in the city of Barcelona. With this goal, we integrate the conceptual framework of the capability approach (CA), which is explicitly designed to promote and assess human well-being, to model and evaluate the behaviour of agents who represent PEH and social workers. We define a reinforcement learning environment where agents aim to restore their central human capabilities, under existing environmental and legal constraints. We use Bayesian inverse reinforcement learning (IRL) to calibrate profile-dependent behavioural parameters in PEH agents, modeling the degree of trust and engagement with social workers, which is reportedly a key element for the success of the policies in scope. Our results open a path to mitigate health inequity by building relationships of trust between social service workers and PEH.

3. 规划决策 16 篇

2606.14202 2026-06-18 cs.NE cs.AI 新提交 专题 80

MeEvo: Metacognitive Evolution Combined with Natural Evolution for Automatic Heuristic Design

MeEvo: 元认知进化与自然进化相结合用于自动启发式设计

Zishang Qiu, Xinan Chen, Rong Qu, Ruibin Bai

发表机构 * School of Computer Science, University of Nottingham Ningbo China(诺丁汉大学宁波分校计算机科学学院) School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院)

专题命中 规划决策 :自动启发式设计框架,结合进化与元认知

AI总结 提出MeEvo框架,通过循环耦合自然进化(探索启发式代码)和元认知进化(反思历史生成改进启发式),解决现有方法知识继承弱、探索不足的问题,在五个优化问题上表现更优。

详情
AI中文摘要

大型语言模型(LLMs)通过推理和代码合成实现启发式生成,推动了自动启发式设计(AHD)的发展。现有的基于LLM的AHD架构主要遵循两种范式:自然进化,它使用交叉和变异来探索启发式程序;以及元认知进化,它通过反思来改进推理。然而,自然进化丢弃了推理轨迹,削弱了知识继承和利用,而元认知进化缺乏种群级别的重组,限制了探索并增加了过早收敛的风险。这些局限性降低了复杂问题的搜索效率、稳定性和解的质量。为了解决这一差距,我们提出了MeEvo,一种双层AHD框架,它循环耦合自然进化和元认知进化。自然进化探索启发式代码,同时将推理轨迹、适应度值和错误记录到共享历史中;然后元认知进化反思该历史以生成改进的启发式,这些启发式重新进入父代池以进行下一轮循环。这种设计使得种群驱动的探索和反思驱动的改进相互加强。在五个优化问题上的实验(使用两个LLM骨干)表明,MeEvo比现有的基于LLM的AHD架构实现了更强且更稳定的性能,尤其是在复杂约束任务上。

英文摘要

Large Language Models (LLMs) have advanced Automatic Heuristic Design (AHD) by enabling heuristic generation through reasoning and code synthesis. Existing LLM-based AHD architectures mainly follow two paradigms: Natural Evolution, which uses crossover and mutation to explore heuristic programs, and Metacognitive Evolution, which refines reasoning through reflection. However, Natural Evolution discards reasoning traces, weakening knowledge inheritance and exploitation, while Metacognitive Evolution lacks population-level recombination, limiting exploration and increasing the risk of premature convergence. These limitations reduce search efficiency, stability, and solution quality on complex problems. To address this gap, we propose MeEvo, a dual-layer AHD framework that cyclically couples Natural Evolution and Metacognitive Evolution. Natural Evolution explores heuristic code while recording reasoning traces, fitness values, and errors into a shared history; Metacognitive Evolution then reflects on this history to generate improved heuristics that re-enter the parent pool for the next cycle. This design enables population-driven exploration and reflection-driven refinement to reinforce each other. Experiments on five optimization problems with two LLM backbones show that MeEvo achieves stronger and more stable performance than existing LLM-based AHD architectures, especially on complex constrained tasks.

2605.22142 2026-06-18 cs.LG cs.AI 版本更新 专题 80

Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability

知识图谱下的短期到长期记忆转移:在部分可观测性下的短期到长期记忆转移

Taewoon Kim, Vincent François-Lavet, Michael Cochez

专题命中 规划决策 :强化学习中记忆转移,属于智能体决策。

AI总结 本文研究了在部分可观测性下知识图谱中的短期到长期记忆转移问题,提出了一种基于神经符号价值决策的方法,通过在长期插入前决定保留或丢弃观察到的三元组,从而提升记忆效率,并在RoomKG基准测试中优于符号和神经基线方法。

详情
AI中文摘要

在部分可观测性下的强化学习需要决定保留哪些信息,但大多数基于记忆的方法并未显式建模符号观察的短期到长期转移。我们研究了这一转移过程,将其建模为一个神经符号价值决策问题:对于每个观察到的三元组,智能体需决定在长期插入前是否保留或丢弃。为处理可变大小的短期缓冲区,我们采用了一种每项Q学习设计,使用共享参数和实际的时间差分更新,跨连续步骤匹配项目。在长期记忆容量为128的RoomKG基准测试中,学习到的转移决策优于符号和神经基线,包括带有时间注释的符号基线和基于历史的LSTM/Transformer基线。在转移策略消融分析中,一个轻量级的本地短期-only变体表现最佳,且在步骤层面行为显示,策略保留导航和查询相关的事实,同时丢弃低价值的候选事实,支持在内存限制下显式且可解释的记忆决策。

英文摘要

Reinforcement learning under partial observability requires deciding what information to retain, yet most memory-based approaches do not explicitly model short-term-to-long-term transfer of symbolic observations. We study this transfer process in a temporal knowledge-graph memory setting and cast it as a neuro-symbolic value-based decision problem: for each observed triple, the agent chooses whether to keep or drop it before long-term insertion. To handle variable-sized short-term buffers, we use a per-item Q-learning design with shared parameters and a practical temporal-difference update over matched items across consecutive steps. On the RoomKG benchmark at long-term memory capacity 128, learned transfer decisions outperform symbolic and neural baselines, including symbolic baselines with temporal annotations and history-based LSTM/Transformer baselines. Across transfer-policy ablations, a lightweight local short-term-only variant performs best, and step-level behavior shows that the policy keeps navigation- and query-relevant facts while discarding lower-value candidate facts, supporting explicit and interpretable memory decisions under memory constraints.

2604.03208 2026-06-18 cs.LG 版本更新 专题 80

Hierarchical Planning with Latent World Models

基于潜在世界模型的分层规划

Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, Nicolas Ballas

发表机构 * FAIR at Meta(Meta旗下的FAIR) New York University(纽约大学) Mila - Québec AI Institute(魁北克AI研究院) Brown University(布朗大学)

专题命中 规划决策 :分层世界模型用于长时域规划,属智能体规划

AI总结 提出HWM架构,通过多时间尺度潜在世界模型和潜在匹配实现分层模型预测控制,解决长时域任务中单层规划失败和计算爆炸问题。

详情
AI中文摘要

世界模型是通过规划实现零样本具身控制的一条有前景的路径。然而,现有的世界模型规划器在长时域、多阶段任务中面临困难:预测误差累积,且朴素搜索的复杂度随规划时域呈指数增长。分层方法通过将任务分解为更短、可处理的子问题来缓解这两个问题;然而,先前的分层方法要么将控制摊销为任务特定的策略(分层强化学习),要么假设低维状态和已知动力学(经典分层MPC)。我们提出了基于潜在世界模型的分层规划(HWM),这是一种直接在仅通过下一潜在预测训练的视觉世界模型上进行分层模型预测控制(MPC)的架构和规划范式。HWM在共享潜在空间内学习多个时间尺度的世界模型,因此长时域模型的预测通过潜在匹配作为短时域模型的子目标,无需任务特定的奖励、技能学习或分层策略。为了保持长时域搜索的可处理性,HWM学习了一个动作编码器,将原始动作块压缩为潜在宏动作。在真实世界的Franka操作中,HWM从单个目标图像中完成拾取和放置的成功率为70%,而单层规划的成功率为0%。在模拟的推操作和迷宫导航任务中,HWM在长时域任务上持续提升性能,同时所需规划计算量最多减少3倍。

英文摘要

World models are a promising path to zero-shot embodied control through planning. However, existing world model planners struggle on long-horizon, multi-stage tasks: prediction errors compound and naive search is exponential in the planning horizon. Hierarchy mitigates both by decomposing tasks into shorter, tractable subproblems; yet prior hierarchical approaches either amortize control into task-specific policies (hierarchical RL) or assume low-dimensional states and known dynamics (classical hierarchical MPC). We present Hierarchical Planning with Latent World Models (HWM), an architecture and planning paradigm for hierarchical model predictive control (MPC) directly on visual world models trained solely via next-latent prediction. HWM learns world models at multiple temporal scales within a shared latent space, so predictions from the long-horizon model serve as subgoals for the short-horizon model via latent matching, without task-specific rewards, skill learning, or hierarchical policies. To keep long-horizon search tractable, HWM learns an action encoder that compresses primitive action chunks into latent macro-actions. On real-world Franka manipulation, HWM solves pick-and-place from a single goal image at 70% success vs. 0% for single-level planning. Across simulated push manipulation and maze navigation, HWM consistently improves performance on long-horizon tasks while requiring up to 3x less planning compute.

2411.10399 2026-06-18 cs.GT cs.CR cs.DC 版本更新 专题 80

Game Theoretic Liquidity Provisioning in Concentrated Liquidity Market Makers

集中流动性做市商中的博弈论流动性提供

Weizhao Tang, Rachid El-Azouzi, Cheng Han Lee, Ethan Chan, Giulia Fanti

专题命中 规划决策 :博弈论模型分析流动性提供策略

AI总结 针对集中流动性做市商中流动性提供者的策略互动,建立博弈论模型,证明其可简化为具有唯一纳什均衡的线性复杂度博弈,均衡遵循水填充策略,并基于真实数据发现LP策略偏离均衡,调整后可提升日收益率。

详情
AI中文摘要

自动做市商(AMM)是一类去中心化交易所,能够实现数字资产的自动交易。它们接受流动性提供者(LP)存入的数字代币;交易者可以使用这些代币执行交易,从而为投资的LP产生费用。AMM的显著特征是交易价格由算法决定,这与传统的限价订单簿不同。集中流动性做市商(CLMM)是AMM的一个重要类别,它为流动性提供者提供了灵活性,不仅可以决定提供多少流动性,还可以决定在哪些价格范围内使用流动性。由于费用奖励在LP之间共享,这种灵活性可能使战略规划复杂化。我们建立并分析了一个博弈论模型来研究CLMM中LP的激励。我们的主要结果表明,虽然原始公式存在多个纳什均衡且复杂度与合约中价格点数量的二次方成正比,但它可以简化为一个具有唯一纳什均衡的博弈,其复杂度仅为线性。我们进一步证明,这个简化博弈的纳什均衡遵循一种水填充策略,其中低预算LP用尽其全部预算,而富裕LP则不会。最后,通过将我们的博弈模型拟合到真实的CLMM,我们观察到在具有风险资产的流动性池中,LP采用的投资策略远非纳什均衡。在价格不确定性下,他们通常投资于比我们分析建议的更少且更宽的价格范围,并且流动性更新频率较低。我们表明,在多个池中,通过将策略更新为更接近我们博弈的纳什均衡,LP可以将其每日回报中位数提高116美元,这相当于每日投资回报中位数增加0.009%。

英文摘要

Automated marker makers (AMMs) are a class of decentralized exchanges that enable the automated trading of digital assets. They accept deposits of digital tokens from liquidity providers (LPs); tokens can be used by traders to execute trades, which generate fees for the investing LPs. The distinguishing feature of AMMs is that trade prices are determined algorithmically, unlike classical limit order books. Concentrated liquidity market makers (CLMMs) are a major class of AMMs that offer liquidity providers flexibility to decide not only \emph{how much} liquidity to provide, but \emph{in what ranges of prices} they want the liquidity to be used. This flexibility can complicate strategic planning, since fee rewards are shared among LPs. We formulate and analyze a game theoretic model to study the incentives of LPs in CLMMs. Our main results show that while our original formulation admits multiple Nash equilibria and has complexity quadratic in the number of price ticks in the contract, it can be reduced to a game with a unique Nash equilibrium whose complexity is only linear. We further show that the Nash equilibrium of this simplified game follows a waterfilling strategy, in which low-budget LPs use up their full budget, but rich LPs do not. Finally, by fitting our game model to real-world CLMMs, we observe that in liquidity pools with risky assets, LPs adopt investment strategies far from the Nash equilibrium. Under price uncertainty, they generally invest in fewer and wider price ranges than our analysis suggests, with lower-frequency liquidity updates. We show that across several pools, by updating their strategy to more closely match the Nash equilibrium of our game, LPs can improve their median daily returns by \$116, which corresponds to an increase of 0.009\% in median daily return on investment.

2606.18888 2026-06-18 cs.AI 新提交 专题 75

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

部分可观测环境下导航的生成模型预测规划

Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski

发表机构 * University of Manchester(曼彻斯特大学) Aalto University(阿尔托大学)

专题命中 规划决策 :生成模型预测规划用于导航

AI总结 提出BeliefDiffusion框架,结合扩散模型和模型预测控制,显式建模多模态信念分布并进行前瞻规划,在合成地图环境中显著优于无模型强化学习和生成方法。

详情
AI中文摘要

部分可观测环境中的导航对自主智能体构成重大挑战,需要在未知环境中利用有限的感知信息做出有效决策。基于信念的方法,特别是那些使用神经网络近似信念空间的方法,往往无法捕捉信念空间固有的多模态性,尤其是在具有感知混淆的高维情况下。虽然生成模型提供了一种有吸引力的替代方案,但它们通常需要大量数据或专家演示,并且缺乏长期规划的显式机制。在本文中,我们介绍了BeliefDiffusion,一种结合了生成和规划优势的新框架。BeliefDiffusion利用扩散模型显式表征多模态信念分布,并利用模型预测控制(MPC)同时进行前瞻规划。它包含两个步骤:(1)基于观测历史想象合理的环境配置;(2)在聚合的配置上规划高效的导航策略。通过在合成地图环境中的大量实验,我们证明BeliefDiffusion在导航成功率和路径效率上显著优于无模型强化学习基线和其它生成方法。我们的结果验证了将多模态信念表示显式纳入规划能够在部分可观测设置中实现更鲁棒的导航。

英文摘要

Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.

2606.19214 2026-06-18 econ.GN q-fin.EC 新提交 专题 70

Testing Centralized and Polycentric Computational Planning

测试集中式和多中心计算规划

Ricardo Alonzo Fernández Salguero

专题命中 规划决策 :比较计算规划者与基于代理的市场,涉及规划决策

AI总结 本文提出一个可复现的合成基准,在模拟经济中比较计算规划者、基于代理的市场和混合元市场,发现规划者福利损失更低,但结果受设计选择影响,主要贡献是方法论而非意识形态。

详情
AI中文摘要

本文提出了一个可复现的合成基准,在共同的模拟经济中比较计算规划者、基于代理的市场和混合元市场。该基准包含投入产出生产网络、异质企业、产能约束、内生价格、福利指标、结构性冲击、对抗性压力测试和信息报告实验。在训练、保留和对抗性场景中,规划者始终比分散化替代方案实现更低的福利损失。主要贡献是方法论而非意识形态的。虽然该基准展示了一个可证伪的框架用于比较经济协调机制,但它并未确立规划的实证优越性。若干设计选择机械地偏向规划者,包括信息不对称、不完整的市场表示和简化的制度假设。因此,结果应被解释为对合成实验架构的验证,以及作为未来研究的原型。本文最后概述了一个基于实证校准、结构性保留、敏感性分析、不确定性量化、机制设计测试和独立复制的验证议程。

英文摘要

This paper presents a reproducible synthetic benchmark comparing a computational planner, an agent-based market, and a hybrid meta-market within a common simulated economy. The benchmark incorporates input-output production networks, heterogeneous firms, capacity constraints, endogenous prices, welfare metrics, structural shocks, adversarial stress testing, and information-reporting experiments. Across training, holdout, and adversarial scenarios, the planner consistently achieves lower welfare losses than the decentralized alternatives. The main contribution is methodological rather than ideological. While the benchmark demonstrates a falsifiable framework for comparing economic coordination mechanisms, it does not establish the empirical superiority of planning. Several design choices mechanically favor the planner, including informational asymmetries, incomplete market representation, and simplified institutional assumptions. The results should therefore be interpreted as validation of a synthetic experimental architecture and as a prototype for future research. The paper concludes by outlining a validation agenda based on empirical calibration, structural holdouts, sensitivity analysis, uncertainty quantification, mechanism-design tests, and independent replication.

2606.18963 2026-06-18 cs.LG 新提交 专题 70

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards

无环境奖励的固定通道感知事件流在线奖惩学习

Zirong Li

发表机构 * Zirong Li(李 Cirong)

专题命中 规划决策 :提出无环境奖励的在线奖惩学习框架。

AI总结 提出OHIRL框架,在无标量奖励下通过固定通道感知流进行在线奖惩学习,利用内部轨迹评估器推断感知维度的效价,在XOR任务和CartPole等控制任务中达到高准确率。

Comments 9 pages, 5 figures, 6 tables; 13-page technical supplement

详情
AI中文摘要

我们研究当环境不提供标量奖励或评估标签时的在线奖惩学习。在每一步,智能体仅接收一个固定通道的感知数据包,诸如疼痛、能量、接触、损伤或认知错误等量被视为感知维度,其效价必须从转移后果中推断。OHIRL分离了四个角色:M_psi学习下一数据包预测,D_omega建模残差动力学,C_eta是一个固定的内部转移后轨迹评估器,B_xi学习使用由此产生的价值证据进行后续策略更新和动作评分。C_eta采用恢复正性、持久/增长负性的残差调节取向;系数来源审计显示,等单元、原始等值和随机单调变体保留了超过92%的已发布顶级动作排名,而符号反转保留了0%。无奖励协议暴露观察转移,同时隐藏环境奖励、延迟外部评估器、成功标签和动作好坏标签。条件误差分解将B_xi的证据估计误差与残差策略优化误差分离。在2x2-XOR数据包任务中,药物和辣椒在视觉XOR上下文中获得相反的价值,并且相同的疼痛或辣度增加可能根据后果结构为正或负;B_xi达到0.952的平衡奖励符号准确率。在完整的在线交错审计中,M_psi达到留出R2=0.907,B_xi达到0.940的符号准确率,策略达到0.979的最优动作准确率,而即时数据包分数、预测误差奖励、打乱目标、零奖励和误差减少控制均崩溃。隐藏奖励的CartPole和Taxi控制、公共上下文无泄漏审计以及模块角色消融进一步测试了信息边界和组件必要性。

英文摘要

We study online reward-punishment learning when the environment provides no scalar reward or evaluative label. At each step the agent receives only a fixed-channel perceptual packet, and quantities such as pain, energy, contact, damage, or cognitive error are treated as perceptual dimensions whose valence must be inferred from transition consequences. OHIRL separates four roles: M_psi learns next-packet prediction, D_omega models residual dynamics, C_eta is a fixed internal post-transition trajectory evaluator, and B_xi learns to use the resulting value evidence for later policy updates and action scoring. C_eta uses a recovery-positive and persistence/growth-negative residual-regulation orientation; a coefficient-origin audit shows that equal-unit, raw-equal, and random monotone variants preserve more than 92% of the released top-action rankings, while sign inversion preserves 0%. The reward-free protocol exposes observation transitions while withholding environment rewards, delayed external evaluators, success labels, and action-goodness labels. A conditional error decomposition separates B_xi evidence-estimation error from residual policy-optimization error. In a 2x2-XOR packet task, medicine and chili acquire opposite value under visual XOR contexts, and the same pain or spice increase can be positive or negative depending on consequence structure; B_xi reaches 0.952 balanced reward-sign accuracy. In a full online-interleaved audit, M_psi reaches holdout R2=0.907, B_xi reaches 0.940 sign accuracy, and the policy reaches 0.979 optimal-action accuracy, while immediate packet scores, prediction-error rewards, shuffled targets, zero reward, and error-reduction controls collapse. Hidden-reward CartPole and Taxi controls, public-context no-leakage audits, and module-role ablations further test information boundaries and component necessity.

2606.18388 2026-06-18 cs.LG cs.AI cs.CL cs.MA 新提交 专题 70

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero: 通过LLM智能体发现RL后训练的自适应训练策略

Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

发表机构 * Amazon(亚马逊)

专题命中 规划决策 :利用LLM智能体进行树搜索发现训练策略

AI总结 提出LLMZero系统,利用LLM智能体通过树搜索发现多阶段RL后训练的自适应策略,揭示容量参数单调累积、正则化参数振荡的规律,在4个GRPO任务上相对基线提升9%-140%。

详情
AI中文摘要

RL后训练策略依赖于数据集,并揭示了一个反复出现的经验模式:容量参数在阶段间单调累积,而正则化参数主要根据训练动态的变化而振荡。这种区别很重要,因为固定调度将所有参数提交到固定轨迹,因此无法表达正则化必须跟踪的非平稳探索-利用权衡;该原则为多阶段训练提供了可操作的设计规则。我们通过LLMZero发现了这一点,该系统通过树搜索让LLM智能体搜索训练轨迹,诊断每个检查点的病理并提出协调的多参数转换。在4个不同的GRPO任务中,LLMZero发现的策略相对基础模型提升9%到140%,相对网格搜索提升6%到15%,始终优于随机搜索和基于技能的智能体。该结构原则跨任务迁移,解释了为什么发现的策略形式不同但参数动态相似。

英文摘要

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

2510.03635 2026-06-18 eess.SY cs.SY 版本更新 专题 70

Cyber Resilience of Three-phase Unbalanced Distribution System Restoration under Sparse Adversarial Attack on Load Forecasting

三相不平衡配电系统恢复在负荷预测稀疏对抗攻击下的网络弹性

Chen Chao, Zixiao Ma, Ziang Zhang

专题命中 规划决策 :攻击下的恢复规划,涉及决策

AI总结 本文量化对抗性攻击对负荷预测的影响,提出梯度稀疏攻击方法,并建立恢复感知验证框架,揭示系统级故障,为设计网络安全感知的恢复规划提供见解。

Comments 10 pages, 7 figures

详情
AI中文摘要

系统恢复对于电力系统弹性至关重要,然而,其对基于人工智能的负荷预测的日益依赖引入了显著的网络安全风险。不准确的预测可能导致不可行的规划、电压和频率违规以及断电段落的恢复失败,但恢复过程对此类攻击的弹性在很大程度上仍未探索。本文通过量化对抗性操纵的预测如何影响恢复可行性和电网安全性来填补这一空白。我们开发了一种基于梯度的稀疏对抗攻击,该攻击策略性地扰动最具影响力的时空输入,在保持隐蔽性的同时暴露预测模型的脆弱性。我们进一步创建了一个恢复感知验证框架,将这些受损的预测嵌入到顺序恢复模型中,并使用不平衡三相最优潮流公式评估操作可行性。仿真结果表明,所提出的方法比基线攻击更高效、更隐蔽。它揭示了系统级故障,例如电压和功率爬坡违规,这些故障阻止了关键负荷的恢复。这些发现为设计网络安全感知的恢复规划框架提供了可行的见解。

英文摘要

System restoration is critical for power system resilience, nonetheless, its growing reliance on artificial intelligence (AI)-based load forecasting introduces significant cybersecurity risks. Inaccurate forecasts can lead to infeasible planning, voltage and frequency violations, and unsuccessful recovery of de-energized segments, yet the resilience of restoration processes to such attacks remains largely unexplored. This paper addresses this gap by quantifying how adversarially manipulated forecasts impact restoration feasibility and grid security. We develop a gradient-based sparse adversarial attack that strategically perturbs the most influential spatiotemporal inputs, exposing vulnerabilities in forecasting models while maintaining stealth. We further create a restoration-aware validation framework that embeds these compromised forecasts into a sequential restoration model and evaluates operational feasibility using an unbalanced three-phase optimal power flow formulation. Simulation results show that the proposed approach is more efficient and stealthier than baseline attacks. It reveals system-level failures, such as voltage and power ramping violations that prevent the restoration of critical loads. These findings provide actionable insights for designing cybersecurity-aware restoration planning frameworks.

2402.08128 2026-06-18 cs.AI cs.GT 版本更新 专题 70

Recursive Joint Simulation in Games

博弈中的递归联合模拟

Vojtech Kovarik, Caspar Oesterheld, Vincent Conitzer

发表机构 * Foundations of Cooperative AI Lab (FOCAL), Computer Science Department(合作人工智能基础实验室(FOCAL),计算机科学系) Carnegie Mellon University(卡内基梅隆大学) AI Center(人工智能中心) Czech Technical University(捷克技术大学) Center for Theoretical Study(理论研究中心) Charles University(查理大学)

专题命中 规划决策 :研究AI智能体递归联合模拟实现合作

AI总结 研究AI智能体通过递归联合模拟实现合作,证明该过程等价于原博弈的无限重复版本,从而可直接应用民间定理等现有结论。

详情
AI中文摘要

AI智能体之间的博弈动力学可能以多种方式不同于传统的人类-人类互动。其中一个差异是,可能能够精确模拟一个AI智能体,例如因为其源代码已知。这样的智能体将从根本上不确定自己是在现实世界还是在模拟中。我们的目标是探索利用这种可能性在战略环境中实现更合作的结果。在本文中,我们研究了AI智能体之间的交互,其中智能体运行递归联合模拟。也就是说,智能体首先共同观察它们所面临情境的模拟。这个模拟递归地包含额外的模拟(带有小的失败概率以避免无限递归),并且在选择行动之前观察所有这些嵌套模拟的结果。我们表明,由此产生的交互在策略上等价于原始博弈的无限重复版本,允许直接转移现有结果,如各种民间定理。作为该等价性稳健性的证据,我们表明即使放宽一些假设,它仍然成立,并且“从内部”也成立——即对于发现自己处于博弈中并具有自定位不确定性的智能体而言。

英文摘要

Game-theoretic dynamics between AI agents could differ from traditional human-human interactions in various ways. One such difference is that it may be possible to accurately simulate an AI agent, for example because its source code is known. Such an agent would then be fundamentally uncertain whether it is in the real world or in a simulation. Our aim is to explore ways of leveraging this possibility to achieve more cooperative outcomes in strategic settings. In this paper, we study an interaction between AI agents where the agents run a recursive joint simulation. That is, the agents first jointly observe a simulation of the situation they face. This simulation in turn recursively includes additional simulations (with a small chance of failure, to avoid infinite recursion), and the results of all these nested simulations are observed before an action is chosen. We show that the resulting interaction is strategically equivalent to an infinitely repeated version of the original game, allowing a direct transfer of existing results such as the various folk theorems. As evidence that the equivalence is robust, we show that it holds even when we relax some of the assumptions and that it also holds ``from the inside'' -- meaning, for an agent that finds itself inside the game and has self-locating uncertainty.

2606.19134 2026-06-18 cs.LG cs.AI 新提交 专题 65

Pareto Q-Learning with Reward Machines

带奖励机的帕累托Q学习

Arnaud Lequen, Clément Legrand-Lixon, Léo Saulières

发表机构 * Linköping University, Sweden(瑞典_linköping大学) Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France(法国里尔大学、CNRS、中央里尔学院、UMR 9189 CRIStAL、法国里尔) Univ. Toulouse, INRAE-MIAT, Toulouse, France(法国图卢兹大学、INRAE-MIAT、图卢兹)

专题命中 规划决策 :多目标强化学习算法,用于智能体决策

AI总结 提出PQLRM算法,结合帕累托Q学习和奖励机,在多目标强化学习中高效逼近帕累托前沿,并处理非马尔可夫奖励。

Comments Accepted at the ICAPS 2026 Workshop on Bridging the Gap Between AI Planning and (Reinforcement) Learning (PRL)

详情
AI中文摘要

我们提出了带奖励机的帕累托Q学习(PQLRM),这是一种用于任务的多目标强化学习算法,其奖励结构由一组奖励机(RMs)指定。PQLRM结合了帕累托Q学习(PQL)(该方法维护向量值Q估计的集合以逼近帕累托前沿)和带奖励机的Q学习(QRM)的增强(该方法利用奖励信号的因子化自动机结构)。这产生了一种多策略算法,在非马尔可夫、RM编码的奖励下保持样本效率。实验表明,PQLRM比应用于叉积MDP的朴素PQL基线收敛更快,并且可以合成QRM无法获得的帕累托最优策略。

英文摘要

We present Pareto Q-Learning with Reward Machines (PQLRM), a multi-objective reinforcement learning algorithm for tasks whose reward structure is specified by a set of reward machines (RMs). PQLRM combines Pareto Q-Learning (PQL), which maintains sets of vector-valued Q-estimates to approximate the Pareto front, with enhancements from Q-Learning with Reward Machines (QRM), which exploits the factored automaton structure of the reward signal. This yields a multi-policy algorithm that remains sample-efficient under non-Markovian, RM-encoded rewards. Experimental trials show that PQLRM converges faster than a naive PQL baseline applied to the cross-product MDP and can synthesize Pareto-optimal policies that QRM cannot.

2606.18537 2026-06-18 cs.LG 新提交 专题 65

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

入乡随俗:从异构智能体学习通用行为

Caleb Chang, Davin Win Kyi, Natasha Jaques, Karen Leung

发表机构 * University of Washington(华盛顿大学) NVIDIA(英伟达)

专题命中 规划决策 :提取通用奖励训练通用智能体

AI总结 提出GRID方法,从追求不同目标的异构示范者中提取通用奖励,训练通用智能体以学习环境通用能力,避免模式平均偏差,提升下游任务微调效率。

详情
AI中文摘要

人类通常通过观察他人来获取新技能,因为观察到的行为隐含地揭示了如何在环境中行动。然而,从异构群体中获得的观察会引入冲突的行为信号,使得难以确定哪些行为值得模仿。我们通过通用奖励推断与解耦(GRID)来解决这一挑战,这是一种从追求不同目标的异构示范者群体中提取普遍有用行为的社会学习方法。GRID将每个智能体的奖励函数分解为通用奖励(捕捉所有智能体共享的行为)和特定奖励(捕捉个体偏好和目标)。仅基于通用奖励进行训练提供了一种通用预训练的新范式。它产生了一个通用智能体,该智能体内化了通用的环境能力,如安全性和基本任务熟练度,而不会出现困扰标准从示范学习技术的模式平均偏差。这个通用智能体作为微调到下游任务(包括训练中未见过的偏好)的优越先验。在合成基函数分解、多智能体Craftax和连续自动驾驶模拟器(Highway-Env)上的实验证实,GRID以语义上有意义的方式成功解耦了奖励结构,优于标准的从示范学习基线,并实现了更高效和稳定的特化。

英文摘要

Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals, making it difficult to determine which behaviors are worth imitating. We address this challenge with General Reward Inference and Disentanglement (GRID), a social learning method that extracts universally useful behaviors from a heterogeneous population of demonstrators pursuing different goals. GRID decomposes per-agent reward functions into a general reward, capturing behaviors shared across all agents, and specific rewards, capturing individual preferences and objectives. Training exclusively on the general reward provides a new paradigm of generalist pretraining. It yields a generalist agent that internalizes universal environmental competencies, such as safety and basic task proficiency, without the mode-averaging bias that afflicts standard learning from demonstration techniques. This generalist serves as a superior prior for fine-tuning to downstream tasks, including preferences unseen during training. Experiments across a synthetic basis function decomposition, multi-agent Craftax, and a continuous autonomous driving simulator (Highway-Env) confirm that GRID successfully disentangles reward structure in a semantically meaningful way, outperforms standard learning from demonstration baselines, and enables more efficient and stable specialization.

2603.09344 2026-06-18 cs.AI stat.ML 版本更新 专题 65

Robust Regularized Policy Iteration under Transition Uncertainty

鲁棒正则化策略迭代在转移不确定性下

Hongqiang Lin, Zhenghui Fu, Weihao Tang, Pengfei Wang, Yiding Sun, Qixian Huang, Dongxu Zhang

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China(浙江大学计算机科学与技术学院) School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an, China(西北工业大学人工智能、光学与电子学院(iOPEN)) School of Software Technology, Zhejiang University, Hangzhou, China(浙江大学软件技术学院) School of Software Engineering, Xi'an Jiaotong University, Xi'an, China(西安交通大学软件工程学院) School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China(中山大学系统科学与工程学院)

专题命中 规划决策 :离线强化学习用于智能体决策

AI总结 提出鲁棒正则化策略迭代(RRPI),通过将离线强化学习建模为鲁棒策略优化,使用KL正则化替代难解的双层目标,并基于鲁棒正则化贝尔曼算子实现高效策略迭代,理论保证收敛性,实验在D4RL基准上表现优异。

详情
AI中文摘要

离线强化学习(RL)无需在线探索即可实现数据高效且安全的策略学习,但其性能常因分布偏移而下降。学习到的策略可能访问分布外的状态-动作对,其中价值估计和学习到的动态不可靠。为了在统一框架中处理策略引发的外推和转移不确定性,我们将离线RL建模为鲁棒策略优化,将转移核视为不确定性集内的决策变量,并针对最坏情况动态优化策略。我们提出鲁棒正则化策略迭代(RRPI),用可处理的KL正则化替代难解的最大-最小双层目标,并基于鲁棒正则化贝尔曼算子推导出高效的策略迭代过程。我们提供了理论保证,证明所提出的算子是$\gamma$-压缩算子,且迭代更新替代目标能单调改进原始鲁棒目标并收敛。在D4RL基准上的实验表明,RRPI实现了强大的平均性能,在大多数环境中优于包括基于百分位数方法在内的最新基线,并在其余环境中保持竞争力。此外,RRPI通过将较低的$Q$值与高认知不确定性对齐,展现出鲁棒性能,从而防止策略执行不可靠的分布外动作。

英文摘要

Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $γ$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust performance by aligning lower $Q$-values with high epistemic uncertainty, which prevents the policy from executing unreliable out-of-distribution actions.

2606.18730 2026-06-18 cs.RO cs.AI math.CO math.OC 新提交 专题 60

Two-Phase Bilevel Search for the Moving-Target Traveling Salesman Problem with Moving Obstacles

带移动障碍物的移动目标旅行商问题的两阶段双层搜索

Allen George Philip, Anoop Bhat, Sivakumar Rathinam, Howie Choset

发表机构 * Texas A&M University(德克萨斯A&M大学) Carnegie Mellon University(卡内基梅隆大学)

专题命中 规划决策 :移动目标TSP的两阶段双层搜索算法

AI总结 针对带移动障碍物的移动目标旅行商问题,提出混合整数锥规划公式和两阶段双层搜索算法,显著优于基线方法。

详情
AI中文摘要

移动目标旅行商问题(MT-TSP)寻求从静态仓库出发、访问一组移动目标(每个目标在其分配的时间窗口内)并返回仓库的代理的最小成本轨迹。在本文中,我们研究了带移动障碍物的移动目标旅行商问题(MT-TSP-MO),这是MT-TSP的推广,其中代理轨迹必须避开移动障碍物。我们提出了一个混合整数锥规划(MICP)公式,可以使用现成的求解器求解,以及一个快速且可扩展的两阶段双层搜索(TPBS)算法,该算法为问题计算高质量可行解。我们在多达40个目标和40个障碍物的广泛问题实例上评估了我们的方法,与现有基线算法相比。结果表明,所提出的两种方法在成功率、解决方案成本和计算时间方面均显著优于基线。

英文摘要

The Moving-Target Traveling Salesman Problem (MT-TSP) seeks a minimum cost trajectory for an agent that departs from a static depot, visits a set of moving targets, each within one of their assigned time windows, and returns to the depot. In this article, we study the Moving-Target Traveling Salesman Problem with Moving Obstacles (MT-TSP-MO), a generalization of the MT-TSP where the agent trajectory must avoid moving obstacles. We present a Mixed-Integer Conic Programming (MICP) formulation that can be solved using off-the-shelf solvers, as well as a fast and scalable Two-Phase Bilevel Search (TPBS) algorithm that computes high-quality feasible solutions for the problem. We evaluate our approaches against an existing baseline algorithm on a broad range of problem instances with up to 40 targets and 40 obstacles. The results demonstrate that both the proposed methods significantly outperform the baseline with respect to success rates, solution costs, and computation time.

2412.15472 2026-06-18 cs.GT econ.TH 专题 60

On the Fairness of Additive Welfarist Rules

关于加法福利主义规则的公平性

Karen Frilya Celine, Warut Suksompong, Sheung Man Yuen

专题命中 规划决策 :公平分配规则研究,与多智能体系统相关

AI总结 本文研究了加法福利主义规则在公平分配中的公平性,证明了MNW规则是唯一能保证EF1的规则,同时探讨了不同实例类型下的规则特性。

Comments Appears in the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2025

Journal ref ACM Transactions on Economics and Computation, 14(2):5 (2026)

详情
AI中文摘要

分配不可分割的商品是公平分割中的常见任务。我们研究了加法福利主义规则,这类规则选择使某些效用函数总和最大的分配。先前研究显示,最大纳什福利(MNW)规则是唯一能保证 envy-freeness up to one good(EF1)的加法福利主义规则。我们加强这一结论,证明MNW规则在相同商品实例、二值实例以及三个或更多代理人归一化实例中仍唯一保证EF1。另一方面,如果代理人的效用是整数,我们证明其他规则也能提供EF1保证,并为各种实例类型提供了这些规则的特征化。

英文摘要

Allocating indivisible goods is a ubiquitous task in fair division. We study additive welfarist rules, an important class of rules which choose an allocation that maximizes the sum of some function of the agents' utilities. Prior work has shown that the maximum Nash welfare (MNW) rule is the unique additive welfarist rule that guarantees envy-freeness up to one good (EF1). We strengthen this result by showing that MNW remains the only additive welfarist rule that ensures EF1 for identical-good instances, two-value instances, as well as normalized instances with three or more agents. On the other hand, if the agents' utilities are integers, we demonstrate that several other rules offer the EF1 guarantee, and provide characterizations of these rules for various classes of instances.

2606.19175 2026-06-18 econ.TH 新提交 专题 55

To Gamble, Perchance to Grow

赌博,或许为了增长

Mark Whitmeyer

专题命中 规划决策 :研究增长最优投资组合问题,涉及决策优化

AI总结 研究增长最优(凯利)投资组合问题中的收益变换,刻画了产生更保守投资组合的变换条件,并推导了理性疏忽代理人的风险厌恶比较。

详情
AI中文摘要

我研究了增长最优(凯利)投资组合问题中的收益变换。在一安全一风险资产问题中,收益变换 f 普遍产生更保守的投资组合当且仅当 f 是凹且严格递增的,并且 r/f 是凸的。作为推论,我刻画了理性疏忽代理人的比较风险厌恶:一个更风险厌恶的代理人是在 Pratt (1964) 意义上足够更风险厌恶的代理人。

英文摘要

I study transformations of returns in the growth-optimal (Kelly) portfolio problem. In the one-safe-one-risky-asset problem, a return transform f universally produces a more conservative portfolio if and only if f is concave and strictly increasing and r/f is convex. As a corollary, I characterize comparative risk aversion for a rationally-inattentive agent: a more risk-averse agent is one who is sufficiently more risk averse in the Pratt (1964) sense.

4. 多智能体 3 篇

2606.05882 2026-06-18 q-fin.TR 版本更新 专题 80

Market Informedness and Market-Maker Profitability: The Trade-Off Between Adverse Selection and Price Discovery

市场知情度对做市商盈利能力的影响

Konrad Ochędzan, Nino Antulov-Fantulin

专题命中 多智能体 :多智能体强化学习研究市场知情度影响

AI总结 本文通过多智能体强化学习框架研究市场知情度对做市商盈利能力的影响,发现知情订单流在低知情市场中导致严重逆向选择风险,但整体上市场知情度提高带来的价格发现效应抵消了逆向选择的负面影响,使做市商盈利能力呈上升趋势。

详情
AI中文摘要

本文研究了市场知情度对做市商盈利能力的影响。与现有文献不同,分析是在一个复杂的市场环境中进行的,该环境具有异质性的做市代理,它们在信息集和库存风险厌恶程度、内生价格形成、外生基本面价值动态以及自激励的市场订单流方面存在差异。本文还为由此产生的状态依赖的霍克斯市场接受者过程建立了有限时间范围内的稳定性保证,包括非爆炸性、指数级错误定价可积性、占用时间界限以及路径wise的错误定价尾部估计。为了解决做市问题,该研究采用了一种基于多智能体近端策略优化(MAPPO)算法的强化学习框架,该框架采用集中训练与分散执行(CTDE)设置。研究表明,知情市场订单流在低知情市场中尤其危险,导致严重的逆向选择风险。尽管复杂的市场动态加上随机训练导致了局部非单调的结果,但结果仍然揭示了做市商盈利能力随着市场知情度的提高而整体上升的趋势,这表明由更高市场知情度带来的价格发现效应抵消了逆向选择的负面影响。

英文摘要

This paper studies how market informedness affects market makers' profitability in a computational market environment with heterogeneous learning agents. We develop an agent-based market model in which market makers differ in their information sets and inventory-risk aversion, prices form endogenously, fundamental values evolve exogenously, and market-taker order flow follows a state-dependent self-exciting process. The model provides a controlled computational laboratory for analyzing the interaction between informed trading, adverse selection, price discovery, and liquidity provision. We establish finite-horizon stability properties of the market-taker order-flow process and solve the market-making problem using multi-agent reinforcement learning with centralized training and decentralized execution. The results show that informed market order flow is particularly harmful when aggregate market informedness is low, exposing market makers to severe adverse-selection risk. However, as market informedness increases, market-maker profitability displays an overall upward trend despite local non-monotonicities arising from complex market dynamics and stochastic learning. This suggests that the price-discovery benefits of informed trading can offset its adverse-selection costs. The findings contribute to computational economics by showing how agent heterogeneity, endogenous price formation, and learning-based liquidity provision jointly shape market outcomes.

2603.01221 2026-06-18 cs.MA 版本更新 专题 80

Epistemic Gain, Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning

认知增益,偶然成本:多智能体辩论中的不确定性分解用于数学推理

Dan Qiao, Binbin Chen, Fengyu Cai, Jianlong Chen, Wenhao Li, Fuxin Jiang, Zuzhi Chen, Hongyuan Zha, Tieying Zhang, Baoxiang Wang

专题命中 多智能体 :多智能体辩论框架,强化学习优化

AI总结 本文提出贝叶斯不确定性分析框架,将多智能体辩论中的预测不确定性分解为认知不确定性和偶然不确定性,并设计不确定性引导的多智能体强化学习算法,在控制偶然成本的同时提升认知增益,从而提高推理准确性和辩论效率。

Comments ICML2026

详情
AI中文摘要

多智能体辩论(MAD)在改善推理和减少幻觉方面显示出前景,但信息交换如何塑造个体推理行为仍不清楚。经验上,MAD表现出矛盾现象,包括准确率随token熵增加而上升,以及同质和异质智能体组合之间的显著差异。在本文中,我们引入了一个用于MAD的贝叶斯不确定性分析框架,该框架将答案级别的预测不确定性分解为认知不确定性和偶然不确定性,分别对应辩论的潜在增益和成本。在多种智能体配置下,我们发现有效的辩论取决于在受控的偶然成本下实现高认知增益。基于这一见解,我们设计了一种不确定性引导的多智能体强化学习算法,鼓励更低的偶然成本和更有效的认知信息利用。实验表明,我们的方法同时提高了每个智能体的准确性,并促进了更富有成效的辩论过程,为理解和改进MAD提供了一个可操作的贝叶斯视角。

英文摘要

Multi-Agent Debate (MAD) has shown promise in improving reasoning and reducing hallucinations, yet it remains unclear how information exchange shapes individual reasoning behavior. Empirically, MAD exhibits paradoxical phenomena, including rising accuracy with increasing token entropy and marked differences between homogeneous and heterogeneous agent combinations. In this paper, we introduce a Bayesian uncertainty analysis framework for MAD, which decomposes answer-level predictive uncertainty into epistemic uncertainty and aleatoric uncertainty, corresponding to the potential gain and cost of debate. Across multiple agent configurations, we find that effective debate depends on achieving high epistemic gain under controlled aleatoric cost. Building on this insight, we design an uncertainty-guided multi-agent reinforcement learning algorithm that encourages lower aleatoric cost and more effective epistemic information utilization. Experiments show that our approach simultaneously enhances each agent's accuracy and promotes a more productive debate process, providing an operational Bayesian perspective for understanding and improving MAD.

2606.18836 2026-06-18 cs.HC cs.AI 新提交 专题 70

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

通过先前协作的片段记忆改善城市搜索与救援中的人机团队合作

Taewoon Kim, Emma van Zoelen, Mark Neerincx

发表机构 * HumemAI, The Netherlands(荷兰HumemAI) Vrije Universiteit Amsterdam, The Netherlands(荷兰阿姆斯特丹自由大学) TNO, The Netherlands(荷兰TNO)

专题命中 多智能体 :人机团队,记忆复用。

AI总结 提出利用知识图谱片段记忆存储历史协作模式,通过图表示学习选择代表性记忆初始化机器人,在MATRX USAR环境中将救援成功率从25.7%提升至41.3%,任务时间减少283秒。

详情
AI中文摘要

有效的人机团队合作要求机器人从交互开始就适应伙伴、情境和任务动态。在MATRX城市搜索与救援(USAR)环境中,人们可以通过聊天和反思界面将他们在团队合作中发现的协作模式(CPs)外部化。我们研究机器人是否可以利用这种先前的团队经验,在未来的交互中成为更好的队友。为此,我们将历史CPs表示为知识图谱片段记忆,并使用具有节点分类目标的图表示学习来识别一个代表性且有效的记忆以供重用。然后,在新的协作片段开始之前,我们用该记忆初始化机器人。在20名参与者和160轮次观察中,用单个自动选择的先前CP初始化机器人将救援成功率从25.7%提高到41.3%,并将平均任务时间减少283秒。最强的提升出现在交互开始时,表明可重用的片段记忆可以帮助机器人以更有效的任务知识进入协作,并支持更顺畅的早期团队合作。

英文摘要

Effective human-robot teamwork requires robots to adapt to partners, situations, and task dynamics from the start of an interaction. In the MATRX Urban Search and Rescue (USAR) environment, people can externalize collaboration patterns (CPs) they discover during teamwork through a chat and reflection interface. We study whether a robot can use such prior team experience to become a better teammate in future interactions. To this end, we represent historical CPs as knowledge-graph episodic memories and use graph representation learning with a node-classification objective to identify a representative and effective memory for reuse. We then initialize the robot with this memory before a new collaboration episode begins. Across 20 participants and 160 round-level observations, initializing the robot with a single automatically selected prior CP increases rescue success from 25.7% to 41.3% and reduces average task time by 283 seconds. The strongest gains appear at the beginning of interaction, suggesting that reusable episodic memory can help robots enter collaboration with more effective task knowledge and support smoother early teamwork.

5. 工作流自动化 1 篇

2606.17510 2026-06-18 cs.SE cs.SY eess.SY 新提交 专题 75

OmniDroneX: An LLM-Assisted Holistic Drone-as-a-Service Ecosystem

OmniDroneX: 一种LLM辅助的全方位无人机即服务生态系统

I-Ling Yen, Akeem Mohammed, Farokh Bastani, San-Yih Hwang

专题命中 工作流自动化 :LLM辅助无人机服务组合与任务定义

AI总结 提出OmniDroneX统一无人机即服务生态系统,通过libUAV接口和PT-SOA抽象模型连接底层物理与高层任务,利用大语言模型辅助功能识别、服务组合和自然语言任务定义,支持多种组合技术以实现可扩展、自演进的无人机系统。

Comments This manuscript is a full version of a paper accepted in shortened form by IEEE International Conference on Joint Cloud Computing

详情
AI中文摘要

尽管无人机技术取得了快速进步,但由于无人机系统研究中的若干空白,当前部署仍然有限。为应对这些挑战,我们提出OmniDroneX,一个统一的无人机即服务生态系统,其中无人机从固定功能平台转变为动态可组合实体,可与外部基础设施集成以提供全方位能力。OmniDroneX通过统一的供应商无关接口(libUAV)和形式化的物理服务抽象模型(PT-SOA)连接底层物理原语与高层任务意图。一个核心创新是大语言模型(LLM)在OmniDroneX架构多层中的多样化应用。LLM用于辅助识别和形式化原始设备功能及抽象服务定义,支持自动化服务组合和工作流生成,并实现交互式自然语言任务规范与细化。OmniDroneX还包含了动态无人机系统中至关重要的多种组合技术类别,包括用于无人机能力增强的物理层组合,以及时空、功能、协作、异常感知和基于QoS的服务组合。总体而言,这些特性使OmniDroneX能够作为在复杂动态环境中运行的可扩展、有弹性和自演进的无人机生态系统的基础。

英文摘要

Despite rapid advances in UAV technologies, current deployments remain limited due to several gaps in UAV systems research. To address these challenges, we propose OmniDroneX, a unified Drone-as-a-Service ecosystem, in which drones are transitioned from fixed function platforms into dynamically composable entities that can be integrated with external infrastructures to offer omni-capabilities. OmniDroneX bridges low-level physical primitives with high-level mission intent through a unified vendor-agnostic interface (libUAV) and a formal physical-service abstraction model (PT-SOA). A core innovation is the diverse application of large language models (LLMs) across multiple layers of the OmniDroneX architecture. LLMs are used to assist in identifying and formalizing primitive device functions and abstract service definitions, supporting automated service composition and workflow generation, and enabling interactive, natural-language mission specification and refinement. OmniDroneX also incorporates important categories of composition techniques that are essential in dynamic UAV systems, including physical layer composition for drone capability augmentation, as well as spatiotemporal, functional, collaborative, exception-aware, and QoS-based service compositions. Collectively, these features allow OmniDroneX to serve as a foundation for scalable, resilient, and self-evolving UAV ecosystems operating in complex and dynamic environments.

6. 工具调用 1 篇

2606.18550 2026-06-18 cs.CR 新提交 专题 70

The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating

门仅与其合约一样诚实:面向风险感知因果门控合约层的ContractGuard

Laxmipriya Ganesh Iyer, Rahul Suresh Babu

专题命中 工具调用 :保护工具增强型LLM代理

AI总结 针对工具增强型LLM代理的间接提示注入,提出ContractGuard,通过验证合约完整性(而非风险标签)来防御攻击,在基准测试中实现零注入成功率。

详情
AI中文摘要

风险感知因果门控(RACG)通过从代理的可见动作空间中移除危险工具来防御工具增强型LLM代理免受间接提示注入,使得即使完全符合注入条件的代理也无法调用其不可见的工具。我们提出三点。首先,这种结构性保证并未消除安全工具使用背后的信任假设;它将其转移到门所读取的工具合约——声明的先决条件、效果、风险和授权——的完整性上,因此攻击者若破坏合约,可使门误判而无需说服代理。其次,伪造工具的效果比篡改其风险标签更危险,因为RACG在可准入门之前应用因果门:离路径工具从不暴露,因此仅重新标记风险会失败,而效果伪造则将危险工具路由到因果路径上并成功。效果完整性,而非风险标签,是承载假设。第三,我们引入ContractGuard,一个位于注册表和门之间的验证器,它分层使用签名来源、类型化合约认证和运行时效果验证;在受控基准测试中,它针对所有建模攻击(包括穷举白盒自适应攻击)将注入成功率恢复为零,且不会过度拒绝诚实合约,该结构性预测在六个当前代托管模型(Claude Opus 4.8, Sonnet 4.6, Haiku 4.5; Amazon Nova Premier and Nova 2 Lite; GPT-OSS-120B)上得到确认。

英文摘要

Risk-Aware Causal Gating (RACG) defends tool-augmented LLM agents against indirect prompt injection by removing dangerous tools from the agent's visible action space, so that even a fully injection-compliant agent cannot call a tool it cannot see. We make three points. First, this structural guarantee does not eliminate the trust assumption behind safe tool use; it relocates it into the integrity of the tool contracts -- declared preconditions, effects, risk, and authorization -- that the gate reads, so an attacker who corrupts a contract can make the gate mis-decide without ever persuading the agent. Second, forging a tool's effects is strictly more dangerous than tampering with its risk label, because RACG applies a causal gate before its admissibility gate: an off-path tool is never exposed, so risk-relabeling alone fails, whereas effect forgery routes the dangerous tool onto the causal path and succeeds. Effect integrity, not the risk label, is the load-bearing assumption. Third, we introduce ContractGuard, a verifier between the registry and the gate that layers signed provenance, typed contract attestation, and runtime effect verification; on a controlled benchmark it restores injection success to zero against every modeled attack -- including an exhaustive white-box adaptive attacker -- without over-rejecting honest contracts, and the structural prediction is confirmed on six current-generation hosted models (Claude Opus 4.8, Sonnet 4.6, Haiku 4.5; Amazon Nova Premier and Nova 2 Lite; GPT-OSS-120B).