arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

AI Agent

智能体、工具调用、规划、工作流、多智能体和自主任务执行。

2026-06-19 至 2026-06-19 收录 15 信号源:cs.AI, cs.CL, cs.LG, cs.SE

1. 多智能体 4 篇

2606.06971 2026-06-19 cs.MA cs.SI 版本更新 90%

Modeling U.S. Attitudes Toward China via an Event-Steered Multi-Agent Simulator

通过事件驱动的多智能体模拟器建模美国对华态度

Chenxu Zhu, Hantao Yao, Wu Liu, Junbo Guo, Yongdong Zhang

专题命中 多智能体 :事件驱动多智能体模拟器建模舆论演化

AI总结 提出事件驱动多智能体模拟器(ES-MAS),利用CURE数据集和双流数据集成引擎(DSDIE)及新闻驱动动态交互模块(NDDI),模拟美国对华舆论的动态演化,实验表明优于现有模型。

详情
AI中文摘要

理解舆论的动态演化,如美国公众对中国的态度,对于评估地缘政治风险至关重要。然而,现有的基于LLM的多智能体模拟器主要依赖静态规则和固定数据集,限制了其捕捉现实世界中宏观层面舆论转变的动态、事件驱动特性的能力。为解决这一限制,我们提出了一种事件驱动的多智能体模拟器(ES-MAS),其中重大事件和日常新闻通过智能体之间的动态交互持续驱动舆论演化。我们首先构建了中美关系演化(CURE)数据集,涵盖2021年至2025年的20个季度,包括258个重大事件和超过14,000篇日常新闻文章,为建模舆论动态提供了全面的时间基础。基于CURE数据集,我们提出了双流数据集成引擎(DSDIE),该引擎通过宏观层面事件将模拟与历史时间线对齐,同时基于个体智能体画像和上下文信号实现个性化信息暴露。此外,我们设计了新闻驱动的动态交互(NDDI)模块,该模块自适应地将具有共同新闻兴趣的智能体分组到局部交互上下文中,促进自下而上的共识形成,同时降低孤立信息茧房的风险。在CURE数据集上的实验结果表明,ES-MAS在复现真实世界历史趋势方面显著优于现有模拟器,为建模动态舆论演化提供了一个可扩展且有效的框架。

英文摘要

Understanding the dynamic evolution of opinions, such as U.S. public attitudes toward China, is essential for assessing geopolitical risks. However, existing LLM-based multiagent simulators predominantly rely on static rules and fixed datasets, limiting their ability to capture the dynamic, event-driven nature of macro-level opinion shifts in real-world settings. To address this limitation, we propose an Event-Steered Multi-Agent Simulator (ES-MAS), in which significant events and daily news continuously drive opinion evolution through dynamic interactions among agents. We first construct the China-U.S. Relation Evolution (CURE) dataset, covering 20 quarters from 2021 to 2025, including 258 major events and over 14,000 daily news articles, and providing a comprehensive temporal foundation for modeling opinion dynamics. Building upon the CURE dataset, we propose a Dual-Stream Data Integration Engine (DSDIE) that aligns simulations with historical timelines via macro-level events while enabling personalized information exposure based on individual agent profiles and contextual signals. Furthermore, we design a News-Driven Dynamic Interaction (NDDI) module, which adaptively groups agents with shared news interests into localized interaction contexts, facilitating bottom-up consensus formation while mitigating the risk of isolated information cocoons. Experimental results on the CURE dataset demonstrate that ES-MAS substantially outperforms existing simulators in reproducing real-world historical trends, offering a scalable and effective framework for modeling dynamic opinion evolution.

2605.27864 2026-06-19 cs.AI 版本更新 85%

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

FundaPod: 一个具有知识图谱记忆的多角色智能体平台,用于AI辅助的基础投资研究

Di Zhu, Lei Nico Zheng, Zihan Chen

发表机构 * Stevens Institute of Technology(史蒂文斯理工学院) UMass Boston(马萨诸塞大学波士顿分校)

专题命中 多智能体 :多角色智能体平台,支持独立研究和知识图谱记忆

AI总结 提出FundaPod平台,通过多角色独立研究、知识图谱记忆和事后裁决机制,支持人类投资经理进行透明、可验证的基础投资决策。

Comments 32 pages; 12 figures

详情
AI中文摘要

大型语言模型(LLMs)在金融领域的应用日益增多,但现有工作大多强调交易信号或围绕预测的金融自然语言处理任务。相比之下,机构基础研究需要人类分析师或AI智能体收集证据、识别业务驱动因素、比较竞争观点并生成投资备忘录。其更广泛的目标不仅是预测结果,而是产生透明、可重用和可验证的投资计划,同时促进投资知识的累积发展。我们提出了FundaPod,一个用于AI辅助基础投资研究的多角色智能体平台。我们认为基础研究是一项以人为中心的决策支持任务,在本质上与交易信号生成不同,因此更适合采用保持独立性的架构。在FundaPod中,具有不同角色(如价值投资者或宏观策略师)的AI智能体在共享溯源契约下独立进行研究。他们的分歧随后通过知识图谱记忆系统事后呈现,供人类投资组合经理(PM)裁决。本文基于设计科学实践以及认知隔离和人机协调理论,提出了支持基础研究的人机混合系统的五项设计原则。它还描述了四种架构机制:将公开投资者资料转化为可部署智能体的角色提炼管道;允许规划器推导类型化任务图的声明式技能注册表;将备忘录声明与可验证来源联系起来的基于证据的模型;以及连接股票代码、备忘录、分析师和主题的知识图谱“第二大脑”。我们通过一个完整的案例研究和基于角色的备忘录比较来展示该架构。

英文摘要

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

2511.17625 2026-06-19 cs.MA cs.GT 版本更新 85%

Iterative Negotiation and Oversight: A Case Study in Decentralized Air Traffic Management

迭代协商与监督:去中心化空中交通管理案例研究

Jaehan Im, John-Paul Clarke, Ufuk Topcu, David Fridovich-Keil

专题命中 多智能体 :提出去中心化协商框架用于空中交通管理。

AI总结 提出一种受监管的去中心化协商框架,通过交易拍卖实现共识,并引入税收式监督机制引导系统效率和公平性,理论保证有限时间终止,案例验证了框架在去中心化空中交通管理中的有效性。

详情
AI中文摘要

在去中心化多智能体系统中,自利智能体通常具有冲突偏好,达成共识仍然具有挑战性。现有的协调方法使智能体无需中央协调员即可达成共识,但无法对系统级目标(如效率或公平性)提供正式保证。为解决这一局限,我们提出一个受监管的去中心化协商框架,该框架通过有限的监管监督增强去中心化协商机制。该框架基于交易拍卖达成共识,使具有冲突偏好的自利智能体能够通过资产交易进行协商,同时避免直接披露私有资产估值。我们引入一种监督机制,实施类似税收的干预,引导去中心化协商走向系统高效和公平的结果,同时调节框架的收敛速度。我们建立了有限时间终止的理论保证,并推导出系统效率和收敛速度与监管干预水平相关的界限。基于美国空中交通管理中的协作航迹选项计划(一个改道倡议)的案例研究表明,该框架能够可靠地在自利空域扇区管理者之间达成共识,并揭示了监管干预水平如何调节系统效率与收敛速度之间的关系。综合理论和实验结果表明,所提出的框架提供了一种受监管的去中心化协调机制,在维护非合作最终选择的同时保障系统级目标。

英文摘要

Achieving consensus among self-interested agents remains challenging in decentralized multi-agent systems, where agents often have conflicting preferences. Existing coordination methods enable agents to reach consensus without a centralized coordinator, but do not provide formal guarantees on system-level objectives such as efficiency or fairness. To address this limitation, we propose a regulated decentralized negotiation framework that augments a decentralized negotiation mechanism with limited regulatory oversight. The framework builds upon the trading auction for consensus, enabling self-interested agents with conflicting preferences to negotiate through asset trading while avoiding direct disclosure of private asset valuations. We introduce an oversight mechanism, which implements a taxation-like intervention that guides decentralized negotiation toward system-efficient and equitable outcomes while also regulating how fast the framework converges. We establish theoretical guarantees of finite-time termination and derive bounds linking system efficiency and convergence rate to the level of regulatory intervention. A case study based on the collaborative trajectory options program, a rerouting initiative in U.S. air traffic management, demonstrates that the framework can reliably achieve consensus among self-interested airspace sector managers, and reveals how the level of regulatory intervention regulates the relationship between system efficiency and convergence speed. Taken together, the theoretical and experimental results indicate that the proposed framework provides a mechanism for regulated decentralized coordination that preserves noncooperative final selection while safeguarding system-level objectives.

2502.19193 2026-06-19 cs.SI cs.AI cs.NE 版本更新 70%

Simulation of Language Evolution under Regulated Social Media Platforms: A Synergistic Approach of Large Language Models and Genetic Algorithms

受监管社交媒体平台下的语言演化模拟:大语言模型与遗传算法的协同方法

Jinyu Cai, Yusei Ishimizu, Mingyue Zhang, Munan Li, Jialong Li, Kenji Tei

专题命中 多智能体 :多智能体框架模拟用户语言策略演化

AI总结 提出基于大语言模型的多智能体框架,结合遗传算法模拟用户语言策略在监管下的迭代演化,实验表明对话轮次增加可提升信息传递准确性和对话持续性。

Comments The manuscript has been accepted to IEEE Transactions on Computational Social Systems

详情
AI中文摘要

社交媒体平台经常实施限制性政策来调节用户内容,从而催生出创造性的规避语言策略。本文提出了一个基于大语言模型(LLMs)的多智能体框架,用于模拟在监管约束下语言策略的迭代演化。在该框架中,参与者智能体作为社交媒体用户,不断演化其语言表达,而监管智能体通过评估政策违规来模拟平台级别的监管。为了实现更逼真的模拟,我们采用了语言策略的双重设计(约束和表达)来区分冲突目标,并利用LLM驱动的遗传算法(GA)进行语言策略的选择、变异和交叉。该框架使用两种不同的场景进行评估:一个抽象的密码游戏和一个逼真的模拟非法宠物交易场景。实验结果表明,随着对话轮次的增加,不间断对话轮次的数量和信息传输的准确性都显著提高。此外,一项包含40名参与者的用户研究验证了生成对话和策略的现实相关性。消融研究也验证了GA的重要性,强调了其对长期适应性和整体结果改善的贡献。

英文摘要

Social media platforms frequently impose restrictive policies to moderate user content, prompting the emergence of creative evasion language strategies. This paper presents a multi-agent framework based on Large Language Models (LLMs) to simulate the iterative evolution of language strategies under regulatory constraints. In this framework, participant agents, as social media users, continuously evolve their language expression, while supervisory agents emulate platform-level regulation by assessing policy violations. To achieve a more faithful simulation, we employ a dual design of language strategies (constraint and expression) to differentiate conflicting goals and utilize an LLM-driven GA (Genetic Algorithm) for the selection, mutation, and crossover of language strategies. The framework is evaluated using two distinct scenarios: an abstract password game and a realistic simulated illegal pet trade scenario. Experimental results demonstrate that as the number of dialogue rounds increases, both the number of uninterrupted dialogue turns and the accuracy of information transmission improve significantly. Furthermore, a user study with 40 participants validates the real-world relevance of the generated dialogues and strategies. Moreover, ablation studies validate the importance of the GA, emphasizing its contribution to long-term adaptability and improved overall results.

2. 工具调用 1 篇

2605.29483 2026-06-19 cs.AI 版本更新 90%

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

VitalAgent: 一种工具增强型代理,用于对可穿戴健康数据进行反应性和主动式生理监测

Di Zhu, Yu Yvonne Wu, Hong Jia, Aaqib Saeed, Vassilis Kostakos, Ting Dang

发表机构 * The University of Melbourne, Australia(墨尔本大学) Dartmouth College, US(达特茅斯学院) University of Auckland, New Zealand(奥克兰大学) Eindhoven University of Technology, Netherlands(埃因霍温理工大学)

专题命中 工具调用 :工具增强推理和主动监测的智能体框架

AI总结 提出VitalAgent框架,通过工具增强推理和纵向生理记忆,实现对ECG/PPG信号的反应性问答与主动监测,在VitalBench基准上相比基线提升超30%。

Comments Minor revisions; results unchanged

详情
AI中文摘要

可穿戴设备能够连续监测ECG和PPG等生理信号,但现有的移动健康系统大多局限于特定任务的预测管道或对静态摘要的反应性问答。它们缺乏支持时间推理、持久生理上下文以及对长期信号流进行主动监测的能力。我们提出VitalAgent,一个基于ECG/PPG的移动健康工具增强型代理框架,支持反应性问答和主动监测。VitalAgent建立在纵向生理记忆和工具增强推理接口之上,能够对原始信号进行动态计算。我们进一步引入VitalBench,一个纵向生理监测基准数据集,包含用于反应性问答的1,862个问答对和用于主动监测的90.2小时连续ECG/PPG记录,涵盖心脏、身体活动和压力相关任务。实验表明,VitalAgent在反应性评估中相比基于提示和ReAct的基线实现了超过30%的提升,并支持对长期生理信号的主动警报监测,突显了动态工具使用和长期生理监测的重要性。

英文摘要

Wearable devices enable continuous monitoring of physiological signals such as ECG and PPG, but existing mHealth systems are largely limited to task-specific prediction pipelines or reactive question answering over static summaries. They lack the ability to support temporal reasoning, persistent physiological context, and proactive monitoring over long-term signal streams. We propose VitalAgent, a tool-augmented agentic framework for ECG/PPG-based mHealth that supports both reactive question answering and proactive monitoring. VitalAgent is built on a longitudinal physiological memory and a tool-augmented reasoning interface that enables dynamic computation over raw signals. We further introduce VitalBench, a longitudinal physiological monitoring benchmark dataset comprising 1,862 QA pairs for reactive question answering and 90.2 hours of continuous ECG/PPG recordings for proactive monitoring, covering cardiac, physical activity, and stress-related tasks. Experiments demonstrate that VitalAgent achieves over 25% improvement over prompt-based and ReAct baselines in reactive evaluation and supports proactive alert monitoring over long-term physiological signals, highlighting the importance of dynamic tool use and long-term physiological monitoring.

3. 工作流自动化 3 篇

2604.23938 2026-06-19 cs.CL 版本更新 90%

TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

TSAssistant: 一种人在回路中的自动化靶点安全性评估智能体框架

Xiaochen Zheng, Zhiwen Jiang, David Tokar, Yexiang Cheng, Alvaro Serra, Melanie Guerard, Klas Hatje, Tatyana Doktorova

发表机构 * Computational Sciences Center of Excellence(计算科学卓越中心)

专题命中 工作流自动化 :多智能体框架自动化靶点安全性评估报告生成

AI总结 提出TSAssistant多智能体框架,通过分层指令架构和交互式优化循环,将靶点安全性评估报告生成分解为专业子任务,实现高可重复性和证据溯源。

Comments Updated with quantitative and expert evaluations

详情
AI中文摘要

靶点安全性评估(TSA)需要系统整合遗传、转录组、靶点同源性、药理学和临床数据,以评估治疗靶点的潜在安全性风险。该过程劳动密集且依赖专家,在可扩展性和可重复性方面面临挑战。我们提出TSAssistant,一种人在回路中的多智能体框架,将TSA报告生成分解为专门子智能体的工作流:研究子智能体各自基于并引用单个TSA领域,合成子智能体整合跨领域发现。子智能体通过标准化工具接口从精选生物医学来源检索和综合证据,生成可单独引用、基于证据的章节,其行为由分层指令架构塑造,该架构将协调逻辑与领域专业知识和用户意图分离。为补充这些软约束,程序化执行钩子和持久记忆存储在整个工作流中强制执行硬约束,而交互式优化循环允许专家在完全保留跨迭代对话上下文的情况下审查和修订各个章节。我们不是进行单一的整体比较,而是将报告质量分解为可重复性、证据基础、任务级准确性和专家监督下的可控性,发现高可重复性和证据基础、与人类参考高度一致以及专家驱动的净正面改进。

英文摘要

Target Safety Assessment (TSA) requires systematic integration of genetic, transcriptomic, target homology, pharmacological, and clinical data to evaluate potential safety liabilities of therapeutic targets. This process is labor-intensive and expert-dependent, posing challenges in scalability and reproducibility. We present TSAssistant, a human-in-the-loop multi-agent framework that decomposes TSA report generation into a workflow of specialized subagents: Research Subagents that each ground and cite a single TSA domain, and Synthesis Subagents that integrate findings across domains. Subagents retrieve and synthesize evidence from curated biomedical sources through standardized tool interfaces and produce individually citable, evidence-grounded sections, with behavior shaped by a hierarchical instruction architecture that separates coordination logic from domain expertise and user intent. To complement these soft constraints, programmatic execution hooks and persistent memory stores enforce hard constraints across the workflow, while an interactive refinement loop allows experts to review and revise individual sections with full conversational context preserved across iterations. Rather than a single holistic comparison, we decompose report quality into reproducibility, evidential grounding, task-level accuracy, and controllability under expert oversight, finding high reproducibility and grounding, substantial agreement with the human reference, and net-positive expert-driven refinement.

2604.08552 2026-06-19 cs.DB cs.AI 版本更新 85%

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

使用本体约束的LLM代理自动化标准化遗留生物医学元数据

Josef Hardi, Martin J. O'Connor, Marcos Martinez-Romero, Jean G. Rosario, Stephen A. Fisher, Mark A. Musen

发表机构 * Division of Computational Medicine, Stanford University(斯坦福大学计算医学部) Department of Biology, University of Pennsylvania(宾夕法尼亚大学生物学系)

专题命中 工作流自动化 :LLM代理自动化标准化生物医学元数据

AI总结 提出基于LLM的元数据标准化系统,通过实时查询标准指南和本体服务,在839条HuBMAP记录上验证,相比纯LLM方法显著提升预测准确性。

详情
AI中文摘要

科学元数据通常不完整且不符合社区标准,限制了数据集的可发现性、互操作性和重用。即使存在标准元数据报告指南,它们通常缺乏机器可操作的表征。生成FAIR数据集需要将元数据标准编码为具有丰富字段规范和精确值约束的机器可操作模板。最近的研究表明,由字段名称和本体约束引导的LLM可以改善元数据标准化,但这些方法将约束视为静态文本提示,仅依赖模型的训练知识。我们提出了一种基于LLM的元数据标准化系统,该系统实时查询标准报告指南和权威生物医学术语服务,以按需检索规范正确的标准。我们在来自人类生物分子图谱计划(HuBMAP)的839条遗留元数据记录上评估了该方法,使用专家策划的金标准进行精确匹配评估。我们的评估表明,与仅使用LLM相比,通过实时工具访问增强LLM在受本体约束和不受本体约束的字段上均持续提高了预测准确性,展示了一种实用的生物医学元数据自动化标准化方法。

英文摘要

Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. Even when standard metadata reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM-based metadata standardization system that queries standard reporting guidelines and authoritative biomedical terminology services in real time to retrieve canonically correct standards on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas Program (HuBMAP) using an expert-curated gold standard for exact-match assessment. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields, demonstrating a practical approach to automated standardization of biomedical metadata.

2602.15707 2026-06-19 cs.MM cs.CL cs.LG 版本更新 80%

Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

基于音频和IMU的主动式程序性任务对话助手

Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili

发表机构 * Qualcomm Technologies, Inc.(高通技术公司)

专题命中 工作流自动化 :实时对话助手提供程序性任务指导,主动交互

AI总结 提出首个仅使用音频和IMU模态的实时对话助手,通过微调语言模型减少不必要对话并提升问答准确性,在边缘设备上实现无云依赖。

Comments 5 figures. 5 more in appendix

详情
AI中文摘要

实时对话助手用于程序性手工任务通常依赖视频输入,这会导致计算成本高且侵犯用户隐私。我们首次提出一种实时对话助手,仅使用来自用户可穿戴设备的轻量级隐私保护模态(如音频和IMU输入)来理解上下文,为程序性手工任务提供全面指导。通过家具组装任务和烹饪任务,我们展示了该助手如何主动向执行程序性任务的用户提供逐步指令,并回答用户问题。我们阐述了实现该助手的数据生成方法和系统设计。观察到现成的语言模型健谈但并非总能正确回答问题,我们展示了微调模型如何将其减少不必要对话的能力提升50%(精确度),同时将正确回答问题的能力提升150%(召回率)。我们进一步描述了如何在边缘设备上实现该助手,无需依赖云端。

英文摘要

Real-time conversational assistants for procedural manual tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for procedural manual tasks using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. Using a furniture assembly task and a cooking task, we show how this assistant proactively communicates step-by-step instructions to a user performing a procedural task, and answers user questions. We illustrate the data generation method and the system design to achieve such an assistant. On observing that an off-the-shelf language model is a talkative assistant but is not always able to answer questions correctly, we demonstrate how finetuning the model improves its ability to limit unnecessary dialogues with a 50% increase in the precision, while also improving its ability to answer questions correctly, measured by a 150% increase in the recall of answers. We further describe how such an assistant is implemented on an edge device with no dependence on the cloud.

4. 其他Agent 4 篇

2605.13438 2026-06-19 cs.AI cs.CL 版本更新 85%

CogniFold: Always-On Proactive Memory via Cognitive Folding

CogniFold: 通过认知折叠实现始终在线的主动记忆

Suli Wang, Yiqun Duan, Yu Deng, Rundong Zhao, Dai Shi, Minghua Deng, Chen Chen, Xinliang Zhou

专题命中 其他Agent :主动记忆系统,持续认知结构涌现

AI总结 提出CogniFold,一种受大脑启发的主动记忆系统,通过将互补学习系统扩展为三层(海马体、新皮层、前额叶意图层)并利用图拓扑自组织,实现事件流的持续认知结构涌现,在认知评估和常规记忆基准上均表现优异。

Comments Code is available at https://github.com/OpenNorve/CogniFold

详情
AI中文摘要

现有的智能体记忆主要仍是被动反应式和基于检索的,缺乏自主将经验组织成持久认知结构的能力。为了迈向真正自主的智能体,我们引入了CogniFold,一种受大脑启发的“始终在线”智能体记忆,专为下一代主动助手设计。CogniFold持续将碎片化事件流折叠成自涌现的认知结构,从传入事件和积累的知识中逐步引导出更高层次的认知。我们通过将互补学习系统(CLS)理论从两层(海马体、新皮层)扩展到三层,增加了一个前额叶意图层来奠定基础。模仿前额叶皮层作为意图控制和决策制定的中心,CogniFold通过图拓扑自组织实现这一点:认知结构在事件流下主动组装,语义相似时合并,过时时衰减,通过联想回忆重新链接,并在概念簇密度超过阈值时浮现意图。我们使用CogEval-Bench评估结构形成,证明CogniFold独特地产生了符合认知期望和概念涌现的记忆结构。此外,在跨越五个认知领域的7个广泛覆盖的基准测试中,我们验证了CogniFold在常规记忆基准上同时表现出稳健的性能。

英文摘要

Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce CogniFold, a brain-inspired "always-on" agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into self-emerging cognitive structures, bootstrapping progressively higher-level cognition from incoming events and accumulated knowledge. We ground this by extending Complementary Learning Systems (CLS) theory from two layers (hippocampus, neocortex) to three, adding a prefrontal intent layer. Emulating the prefrontal cortex as the locus of intentional control and decision-making, CogniFold achieves this through graph-topology self-organization: cognitive structures proactively assemble under the stream, merge when semantically similar, decay when stale, relink through associative recall, and surface intents when concept-cluster density crosses a threshold. We evaluate structural formation using CogEval-Bench, demonstrating that CogniFold uniquely produces memory structures that match cognitive expectations and concept emergence. Furthermore, across eight downstream benchmarks -- two probing long-term conversational memory (LoCoMo, LongMemEval) and six spanning other cognitive domains -- we validate that CogniFold simultaneously performs robustly on conventional memory tasks. Our code is available at https://github.com/OpenNorve/CogniFold.

2604.21804 2026-06-19 physics.ins-det hep-ex hep-ph 版本更新 80%

Agentic-AI Detector Co-design and Optimization in Vertically-Integrated Differentiable Full Simulations

Agentic-AI探测器协同设计与优化在垂直集成可微分全模拟中

Wonyong Chung, Qibin Liu, Liangyu Wu, Julia Gonski

专题命中 其他Agent :AI智能体集成到探测器设计优化

AI总结 提出双层级优化框架,将AI智能体集成到高能物理探测器设计中,通过可微分全模拟联合优化几何、前端数字化和重建算法参数,在竞争性能指标下找到最优设计点。

Comments 7 pages, 3 figures

详情
AI中文摘要

我们首次实现了AI智能体在高能物理实验探测器设计与优化中的应用,通过一个双层级优化框架,在可微分全模拟中垂直集成探测器几何、前端数字化和高层重建算法参数。以基线分辨率为$3\\%/\sqrt{E}$的双读出分段晶体电磁量能器为例,我们研究了AI智能体在识别和减少关键探测器参数以及非线性遍历设计空间方面的能力和价值。我们发现,当前前沿的LLM推理模型,在未提供额外实验特定上下文的情况下,能够有效执行复杂工作流,并主动提出通用但相关的进一步研究或改进方向。在此,我们展示了AI智能体在三个竞争性能指标中寻找最优设计点的能力,表明将智能体有效集成到前沿研究领域的复杂工作流中,可以在减少劳动和计算的同时,提高关键物理目标的性能。本研究为未来首次完全由AI设计的探测器在科学设施中的应用奠定了基础。

英文摘要

We present the first implementation of AI agents into the design and optimization of detectors in high-energy physics experiments via a bi-level optimization framework that vertically integrates detector geometry, front-end digitization, and high-level reconstruction algorithm parameters in differentiable full simulations. Using the example of a dual-readout, segmented crystal EM calorimeter with a baseline resolution of $3\%/\sqrt{E}$, we investigate the capabilities and value propositions of AI agents in the identification and reduction of key detector parameters and in the nonlinear traversal of design space. We find that frontier LLM reasoning-models today, without being given additional experiment-specific context, are able to effectively execute complex workflows and proactively suggest generic but relevant avenues for further study or improvement. Here, we demonstrate an AI agent's ability to find an optimal design point amidst three competing performance criteria, showing that effective integration of agents into the complex workflows of frontier research areas can yield higher performance for key physics goals while reducing labor and compute. This study establishes the foundation for a future demonstration of the first fully AI-designed detector for future scientific facilities.

2603.22922 2026-06-19 cs.CL 版本更新 75%

Quality Over Clicks: Iterative Reinforcement Learning for Early-Stage E-Commerce Query Suggestion

质量优于点击:面向早期电商查询建议的迭代强化学习

Qi Sun, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group(阿里巴巴国际数字商业集团)

专题命中 其他Agent :电商查询建议的迭代强化学习框架

AI总结 针对早期部署场景点击反馈稀疏的问题,提出质量优先的迭代强化学习框架QualEQS,从可回答性、事实性和信息增益三个维度优化查询建议质量,通过候选建议的组级分歧识别模糊上下文并挖掘难例进行迭代改进,在真实电商系统中ChatPV提升6.81%。

详情
AI中文摘要

现有的对话系统依赖查询建议来增强用户参与度。最近的方法主要使用点击率(CTR)模型优化生成模型,以与用户偏好对齐。然而,这些方法在早期部署场景中效果较差,因为点击反馈稀疏且不足以训练可靠的CTR模型。为弥补这一差距,我们提出了QualEQS,一个面向电商查询建议的质量优先迭代强化学习框架。我们将可操作的建议质量形式化为三个直接影响下游可用性的维度:可回答性、事实性和信息增益。为了在没有点击监督的情况下从在线流量中持续改进,我们进一步提出候选建议之间的组级分歧,以识别模糊的查询上下文并挖掘难训练案例进行迭代优化。我们还引入了EQS-Benchmark,一个包含16,949个真实电商查询的数据集,用于离线训练和评估。实验表明,我们基于质量的离线指标与在线性能强相关,为稀疏反馈部署提供了一种实用的评估方法。在离线和在线设置中,QualEQS均持续优于强基线,在真实企业级对话购物助手系统中,在线ChatPV提升了6.81%。

英文摘要

Existing dialogue systems rely on query suggestion to enhance user engagement. Recent approaches mainly optimize generative models using click-through rate (CTR) models to align with user preferences. However, these methods are less effective in early-stage deployment scenarios, where click feedback is sparse and insufficient for training a reliable CTR model. To bridge this gap, we propose QualEQS, a quality-first iterative reinforcement learning framework for e-commerce query suggestion. We formalize actionable suggestion quality along three dimensions that directly affect downstream usability: answerability, factuality, and information gain. To continuously improve from online traffic without click supervision, we further propose group-level disagreement among candidate suggestions to identify ambiguous query contexts and mine hard training cases for iterative refinement. We also introduce EQS-Benchmark, a dataset of 16,949 real-world e-commerce queries for offline training and evaluation. Experiments show that our quality-based offline metrics correlate strongly with online performance, providing a practical evaluation recipe for sparse-feedback deployment. In both offline and online settings, QualEQS consistently outperforms strong baselines, yielding a 6.81% improvement in online ChatPV in a real-world enterprise-level conversational shopping assistant system.

2501.18038 2026-06-19 cs.CY 版本更新 60%

Acceleration AI Ethics and the Telus GenAI Conversational Agent

加速AI伦理与Telus生成式AI对话代理

James Brusseau

专题命中 其他Agent :涉及生成式AI对话代理的伦理应用

AI总结 本文阐述加速伦理学的理论框架,并通过Telus公司的生成式AI语言工具案例,展示加速AI伦理如何在创新与安全之间平衡,以最大化社会责任。

Journal ref Law Ethics Technol. 2026(2):0006

详情
AI中文摘要

加速伦理学处理人工智能中创新与安全之间的张力。加速论点是,创新带来的风险应通过更多的创新来应对。本文总结了这一理论立场,然后展示了加速伦理学在真实案例中如何运作。首先,本文总结了加速伦理学的五个要素:创新解决创新问题、创新具有内在价值、未知令人鼓舞、治理去中心化、伦理嵌入其中。随后,本文通过一个用例——加拿大电信公司Telus开发的生成式人工智能语言工具——来说明加速框架。尽管理论立场的纯粹性被现实世界的模糊性所模糊,但Telus的经验表明,加速AI伦理是通过创新最大化社会责任的一种方式,而不是为了创新牺牲社会责任,或者为了社会责任牺牲创新。

英文摘要

Acceleration ethics addresses the tension between innovation and safety in artificial intelligence. The acceleration argument is that risks raised by innovation should be answered with still more innovating. This paper summarizes the theoretical position, and then shows how acceleration ethics works in a real case. To begin, the paper summarizes acceleration ethics as composed of five elements: innovation solves innovation problems, innovation is intrinsically valuable, the unknown is encouraging, governance is decentralized, ethics is embedded. Subsequently, the paper illustrates the acceleration framework with a use-case, a generative artificial intelligence language tool developed by the Canadian telecommunications company Telus. While the purity of theoretical positions is blurred by real-world ambiguities, the Telus experience indicates that acceleration AI ethics is a way of maximizing social responsibility through innovation, as opposed to sacrificing social responsibility for innovation, or sacrificing innovation for social responsibility.

5. 软件智能体 2 篇

2508.04266 2026-06-19 cs.CL 版本更新 85%

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

ShoppingBench:面向LLM智能体的真实世界意图导向购物基准

Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jian Dong Zhang, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group(阿里巴巴国际数字商业集团)

专题命中 软件智能体 :提出购物基准测试LLM智能体,属于软件智能体

AI总结 提出ShoppingBench基准,包含多层级真实购物意图任务,通过模拟环境和250万商品评估LLM智能体,发现GPT-4.1成功率低于50%,并提出轨迹蒸馏策略提升小模型性能。

Comments Accepted for oral presentation at AAAI 2026

详情
AI中文摘要

现有的电子商务基准主要关注基本用户意图,例如查找或购买产品。然而,现实世界的用户通常追求更复杂的目标,例如应用优惠券、管理预算以及寻找多产品卖家。为了弥补这一差距,我们提出了ShoppingBench,这是一个新颖的端到端购物基准,旨在涵盖日益具有挑战性的接地意图级别。具体来说,我们提出了一个可扩展的框架,基于从采样的真实世界产品中得出的各种意图来模拟用户指令。为了促进一致且可靠的评估,我们提供了一个大规模购物沙箱作为交互式模拟环境,包含超过250万种真实产品。实验结果表明,即使是最先进的语言智能体(如GPT-4.1)在我们的基准任务上的绝对成功率也低于50%,这突显了我们的ShoppingBench带来的重大挑战。此外,我们提出了一种轨迹蒸馏策略,并利用监督微调以及基于合成轨迹的强化学习,将大型语言智能体的能力蒸馏到较小的智能体中。结果,我们训练的智能体实现了与GPT-4.1相媲美的竞争性能。

英文摘要

Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.

2605.25160 2026-06-19 cs.AI 版本更新 80%

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

SimuWoB: 模拟真实世界移动应用以实现快速且保真的GUI智能体基准测试

Guohong Liu, Jialei Ye, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(人工智能产业研究院(AIR),清华大学) University of Electronic Science and Technology of China(电子科技大学) MiLM Plus, Xiaomi Inc.(小米公司MiLM Plus团队)

专题命中 软件智能体 :GUI智能体基准测试环境合成

AI总结 针对现有移动GUI智能体基准测试与现实应用之间的差距,提出全合成基准SimuWoB,通过鲁棒的虚拟环境生成框架合成高保真任务和环境,自动提供有效奖励,实现对复杂长程交互的高效可重复评估。

详情
AI中文摘要

由大型语言模型驱动的移动GUI智能体发展迅速,迫切需要真实且全面的评估。现有基准测试优先考虑可重复性,但通常局限于开源应用或文件操作任务,因为在实际应用中构建奖励困难,导致基准设置与现实使用之间存在差距。此外,大多数基准测试侧重于基本定位和导航,对复杂长程交互的覆盖有限。为解决这些局限性,我们引入了SimuWoB,一个全合成的移动GUI智能体基准测试,包含120个涵盖不同类型和难度级别的挑战性任务。我们构建了一个鲁棒的虚拟环境生成框架,合成高保真任务和环境,并为每个任务自动提供有效奖励。每个环境都部署为可通过URL访问的无后端网页,实现高效且可重复的评估。我们对几个最先进的移动GUI智能体进行了全面实验。平均成功率仅为27.92%,在长程任务上降至17.82%,揭示了当前智能体在复杂场景下的显著弱点。与真实世界样本任务的评估结果比较表明,基于我们合成环境的智能体评估具有良好的泛化性。我们进一步提供了关键能力维度的诊断见解,并讨论了对未来移动GUI智能体开发的启示。

英文摘要

GUI agents powered by large language models are advancing rapidly, creating urgent needs for evaluation and training based on realistic environments. However, directly doing so in real-world environments introduces some challenges that cannot be overlooked. Real-world environments are complex and uncontrollable, making it difficult to construct verifiable rewards and to save or reset states. Existing works prioritize reproducibility but are often limited to open-source apps or file-operation tasks for reliable reward building, leaving a persistent gap from real-world usage. Furthermore, relying on virtual machines or docker images demand high resource requirements and suffer from slow response speeds, which limit the efficiency. We present \sys, a framework that could produce high-fidelity synthesized interactive environments for GUI agents across platforms with verifiable rewards. These environments behave as backend-free webpages accessible via URL, requiring near-zero setup and low resource cost, making the approach suitable for both large-scale evaluation and downstream agent training. We support multiple GUI platforms including mobile, desktop, and automotive/in-vehicle interfaces based on the same pipeline, covering 100+ environments and 1000+ verifiable tasks. Among them, 120 challenging tasks across 63 simulated mobile applications are released as a fully synthesized mobile GUI agent benchmark. Experiment results on five state-of-the-art mobile GUI agents reveal substantial headroom -- the average success rate is only 27.92\%, dropping to 17.82\% on long-horizon subset -- while humans reach 92.08\%. A comparison against real-world sample tasks shows that assessments made in our synthetic environments generalize to real apps. The project website is at https://scalewob.github.io.

6. 规划决策 1 篇

2603.16865 2026-06-19 math.OC cs.SY eess.SY 版本更新 80%

Prescribed-Time Distributed Generalized Nash Equilibrium Seeking

预设时间分布式广义纳什均衡求解

Liraz Mudrik, Isaac Kaminer, Sean Kragelund, Abram H. Clark

专题命中 规划决策 :多智能体分布式纳什均衡求解

AI总结 针对安全关键多智能体系统,提出首个全分布式算法,在用户预设时间T内求解带共享耦合约束的广义纳什均衡问题,采用多速率增益调度解耦观测器、优化与对偶一致性三层耦合。

Comments 12 pages, 5 figures

详情
AI中文摘要

从协同制导到碰撞避免等安全关键多智能体系统,通常必须在硬截止时间前达成协调决策,而非仅仅最终收敛。本文提出首个全分布式算法,用于在用户预设时间$T$内求解广义纳什均衡(GNE)问题(一种具有共享耦合约束和一般成本耦合的非合作博弈),该时间独立于初始条件。其基础是建立在优化李雅普诺夫函数框架上的集中式预设时间结果,并通过非归一化Hessian-梯度反馈实现,选择该反馈是因为与牛顿和归一化Hessian-梯度实现不同,它自然地分解为每个智能体的计算。分布式实现该反馈要求每个智能体同时运行三个耦合过程:全局状态的预设时间观测器、局部优化律以及强制变分GNE共享乘子的对偶一致性机制。它们的同步运行是核心难点,因为优化不断位移观测器跟踪的状态,而估计误差污染驱动优化的梯度。我们通过一种多速率增益调度解决该耦合,其中观测器和一致性层比优化层严格更快收缩,使得每个误差分量在$T$时刻精确消失。Fischer-Burmeister重构保持设计无投影,同时在截止时间强制执行约束。针对Cournot博弈和时间关键传感器覆盖问题的数值结果验证了该方法,并展示了其作为时间关键自主性求解器在环的应用。

英文摘要

Safety-critical multi-agent systems, from cooperative guidance to collision avoidance, must often reach a coordinated decision by a hard deadline rather than merely converge to one eventually. This paper proposes the first fully distributed algorithm that solves the generalized Nash equilibrium (GNE) problem, a non-cooperative game with shared coupling constraints and general cost coupling, at a user-prescribed time $T$ independent of initial conditions. The foundation is a centralized, prescribed-time result built on the optimization Lyapunov function framework and implemented via unnormalized Hessian-gradient feedback, chosen because, unlike the Newton and normalized Hessian-gradient realizations, it naturally splits into per-agent computations. Distributing this feedback requires each agent to run three coupled processes simultaneously: a prescribed-time observer of the global state, a local optimization law, and a dual-consensus mechanism that enforces the shared multipliers of the variational GNE. Their simultaneous operation is the core difficulty, as the optimization continually displaces the states the observers track, while estimation errors corrupt the gradients that drive the optimization. We resolve this coupling with a multi-rate gain schedule whose observer and dual-consensus layers contract strictly faster than the optimization layer, so that every error component vanishes exactly at $T$. A Fischer-Burmeister reformulation keeps the design projection-free while enforcing the constraints at the deadline. Numerical results for a Cournot game and a time-critical sensor-coverage problem validate the approach and demonstrate its use as a solver-in-the-loop for time-critical autonomy.