arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 智能体、规划与决策 22 篇

2606.17209 2026-06-17 cs.AI cs.IR 新提交

Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search

超越并行采样:面向智能搜索的多样化查询初始化

Sidhaarth Murali, João Coelho, Jingjie Ning, João Magalhães, Bruno Martins, Chenyan Xiong

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Instituto Superior Técnico and INESC-ID, University of Lisbon(里斯本大学高等技术学院和INESC-ID) NOVA LINCS, NOVA School of Science and Technology(新里斯本大学科学与技术学院NOVA LINCS)

AI总结 针对智能搜索中的广度缩放,提出DivInit方法,通过在第一轮生成多样化查询而非独立采样,缓解查询冗余问题,在多跳问答任务中平均提升5-7个点。

Comments 15 pages, 8 figures; under review at EMNLP 2026

详情
AI中文摘要

测试时缩放用于智能搜索通常增加深度(即每个轨迹更多轮次和令牌)或广度(即更多并行展开)。这里我们关注广度缩放,表明标准并行采样收益递减,并将其归因于第一轮的查询冗余。当模型在不同展开中发出相似的第一查询时,线程检索重叠的证据,后续轮次基于此共享检索。我们通过DivInit解决这一限制,这是一种在第一轮无需训练的干预。DivInit不是采样k个独立的第一查询,而是从单次调用中抽取n个候选,选择k < n个多样化的种子,并将它们作为并行轨迹运行。在五个开源模型和八个基准测试中,DivInit始终优于标准并行采样,在匹配计算量的多跳问答上平均提升5到7个点。代码可在https://this URL获取。

英文摘要

Test-time scaling for agentic search typically increases depth (i.e., more turns and tokens per trajectory) or breadth (i.e., more parallel rollouts). Here we focus on breadth scaling, showing that standard parallel sampling yields diminishing returns, tracing this to query redundancy at the first turn. When models issue similar first queries across rollouts, the threads retrieve overlapping evidence, and subsequent turns are conditioned on this shared retrieval. We address this limitation with DivInit, a training-free intervention at the first turn. Rather than sampling k independent first queries, DivInit draws n candidates from a single call, picks k < n diverse seeds, and runs them as parallel trajectories. Across five open-weight models and eight benchmarks, DivInit consistently improves over standard parallel sampling, with average gains of five to seven points on multi-hop QA at matched compute. Code available at https://github.com/cxcscmu/diverse-query-initialization

2606.17220 2026-06-17 cs.AI 新提交

When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval

当规则学习时:一种用于法律案例检索的自演化智能体

Mingxu Tao, Jiawei Hu, Xian Zhou, Wenpeng Hu, Jiajun Cheng, Yunbo Cao, Zhunchen Luo, Guotong Geng

发表机构 * Center of Information Research, AMS(AMS信息研究中心) Discipline and Technology Research Center for Large Model Intelligence Applications(大模型智能应用学科与技术研究中心) Hebei University of Engineering(河北工程大学)

AI总结 提出一种自演化框架,通过LLM智能体自动生成并优化查询重写规则,无需参数训练即可增强BM25在法律案例检索中的性能。

Comments To appear in ACL 2026

详情
AI中文摘要

由于法律语言的复杂性以及查询与相关案例之间需要精确的词汇对齐,法律案例检索仍然具有挑战性。尽管密集检索模型取得了显著进展,但实证研究表明,BM25在该领域仍然是一个强大的基线。这促使我们提出一种自演化框架,用于规则驱动的查询重写,无需任何参数训练即可增强BM25。该框架为基于LLM的智能体配备了一个自动评估环境,使其能够迭代地创建重写规则、规划规则组合的验证实验,并根据历史反馈消除无效规则。我们在中文法律案例检索基准LeCaRD-v2上评估了我们的方法。实验结果表明,所提出的框架优于非演化基线,包括人类设计的规则和贪婪规则选择,特别是在由高容量核心LLM驱动时。我们还进行了详细分析,以研究自演化的机制。我们的发现表明,LLM利用先前实验结果的能力及其关于规则消除的内在知识在通过自演化优化规则集方面发挥着关键作用。

英文摘要

Legal case retrieval remains challenging due to the complexity of legal language and the need for precise lexical alignment between queries and relevant cases. Although dense retrieval models have achieved notable progress, empirical studies show that BM25 continues to serve as a strong baseline in this domain. It motivates us to propose a self-evolving framework for rule-driven query rewriting that enhances BM25 without any parameter training. The framework equips an LLM-based agent with an automatic evaluation environment, enabling it to iteratively create rewriting rules, plan validation experiments over rule combinations, and eliminate ineffective rules based on historical feedbacks. We evaluate our method on the Chinese legal case retrieval benchmark LeCaRD-v2. Experimental results demonstrate that the proposed framework outperforms non-evolutionary baselines, including human-designed rules and greedy rule selection, particularly when powered by a highcapacity core LLM. We also conduct detailed analyses to investigate the mechanisms underlying self-evolution. Our findings reveal that LLM's capabilities to leverage previous experimental results and its intrinsic knowledge of rule elimination play critical roles in refining the rule set via self-evolution.

2606.17269 2026-06-17 cs.AI cs.SY eess.SY 新提交

Skill-Constrained Model Predictive Control for Resilient Manufacturing Supply Chains

技能约束下的弹性制造供应链模型预测控制

Carlos Eduardo Sanoja

发表机构 * Quanta Labs, LLC Universidad Monteávila(蒙特阿维拉大学)

AI总结 针对技能约束的生产库存系统,提出一种闭环模型预测控制器,通过混合整数规划优化生产、库存、缺货和培训决策,并评估其在多种扰动下的表现,发现预测控制仅在技能瓶颈可提前预测时有效。

详情
AI中文摘要

在技能约束的生产库存系统中,明天可用的合格人力容量取决于今天的培训决策:生产需要认证工人,认证除非维护否则会失效,而培训消耗与当前生产需求相同的稀缺工时。我们研究了一种闭环技能约束模型预测控制器,该控制器在每个班次求解一个有限时域混合整数规划,涉及生产、库存、缺货和培训,包含二元预测认证、硬生产资格以及一个可解释的终端值,该终端值在时域边界对认证容量缺口进行定价;仅执行第一周期动作后重新规划。在合成、种子控制的SkillChain-Gym场景中——包括公告和新技能冲击、需求冲击、缺勤、预测与可用性质量模式、容量边界与培训率扫描以及阴性对照——我们将该控制器与仅生产和仅维护的消融、静态交叉培训保险计划以及一个强反应式启发式方法进行比较,采用事前固定配置和配对统计。结果是存在制度依赖性,而非优越性:没有策略类别占主导。当技能或劳动力瓶颈可提前足够预测以完成培训时,预测控制有帮助;在意外冲击、接近需求-容量边界以及冲击前松弛使保险廉价的情况下,精益静态保险仍难以被击败。归因消融区分了认证维护、失效认证的重新获取以及全新技能获取。可预测性(而非适应性本身)决定了预测控制何时有价值。

英文摘要

In skill-constrained production-inventory systems, the qualified human capacity available tomorrow depends on training decisions made today: production requires certified workers, certifications decay unless maintained, and training consumes the same scarce worker hours that production needs now. We study a closed-loop skill-constrained model predictive controller that, at every shift, solves a finite-horizon mixed-integer program over production, inventory, backlog, and training, with binary predicted certification, hard production eligibility, and an interpretable terminal value that prices certified-capacity gaps at the horizon boundary; only the first-period action is applied before replanning. On synthetic, seed-controlled SkillChain-Gym scenarios - announced and surprise new-skill shocks, demand shocks, absenteeism, forecast- and availability-quality modes, capacity-boundary and training-rate sweeps, and negative controls - we evaluate the controller against production-only and maintenance-only ablations, static cross-training insurance plans, and a strong reactive heuristic, under an ex-ante locked configuration and paired statistics. The result is regime dependence, not superiority: no policy class dominates. Predictive control helps when skill or labor bottlenecks are forecastable early enough for training to complete; lean static insurance remains hard to beat under surprise shocks, near the demand-capacity boundary, and wherever pre-shock slack makes insurance cheap. Attribution ablations separate certification maintenance, re-acquisition of lapsed certifications, and greenfield skill acquisition. Forecastability, not adaptivity per se, decides when predictive control pays.

2606.17591 2026-06-17 cs.AI 新提交

Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

闭环反馈:从经验提取到洞察治理在言语强化学习中的应用

Yanwei Cui, Xing Zhang, Yulong Zhang, Li Shao, Xiaofeng Shi, Guanghui Wang, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS生成式人工智能创新中心) Amazon Web Services (AWS)(亚马逊网络服务(AWS)) BingX Group Limited(BingX集团有限公司)

AI总结 针对非平稳环境中LLM智能体的保留-遗忘困境,提出三层架构(规则、证据、技能)通过反馈驱动的策展循环实现洞察治理,在金融预测中验证了该方法能显著提升准确率和风险调整收益。

Comments Accepted to the ICML 2026 RLxF: Reinforcement Learning from World Feedback Workshop, RLxF@ICML 2026, Seoul, South Korea

详情
AI中文摘要

无训练言语强化学习使LLM智能体能够从世界反馈中学习——客观信号如动态任务结果、市场回报或需求预测——通过从经验中提取言语规则并将其注入上下文,无需参数变化即可更新智能体行为。然而,在非平稳环境中,这些智能体面临保留-遗忘困境:保留过时的洞察会导致负迁移,而丢弃它们则会在条件重现时造成灾难性遗忘。我们识别出应对这一困境的四个要求——结果驱动评估、持久结构化证据、非单调知识生命周期和组合治理——并表明现有方法在经验提取上投入过多,而在洞察治理上投入不足。我们提出一个三层架构——规则、证据和技能——通过反馈驱动的策展循环连接,弥补治理差距。规则从世界结果中捕获提炼的经验;证据日志跟踪每条规则在多个回合中的可靠性;技能管理应用哪些规则、如何解决冲突以及何时弃权。以金融预测作为案例研究,其中世界反馈自然丰富、嘈杂且非平稳,我们表明相同的积累经验要么使性能低于零样本基线,要么显著提高准确率和风险调整收益,取决于是否存在策展循环。

英文摘要

Training-free verbal reinforcement learning enables LLM agents to learn from world feedback -- objective signals such as dynamic task outcomes, market returns, or demand forecasts -- by extracting verbal rules from experience and injecting them as context, updating the agent's behavior without parameter changes. However, in non-stationary environments these agents face a retention-forgetting dilemma: retaining stale insights causes negative transfer, while discarding them causes catastrophic forgetting when conditions recur. We identify four requirements for navigating this dilemma -- outcome-driven evaluation, persistent structured evidence, non-monotonic knowledge lifecycle, and compositional governance -- and show that existing methods invest heavily in experience extraction while underinvesting in insight governance. We propose a three-layer architecture -- rules, evidence, and skills -- connected by a feedback-driven curation loop that closes the governance gap. Rules capture distilled experience from world outcomes; evidence logs track each rule's reliability across episodes; skills govern which rules to apply, how to resolve conflicts, and when to abstain. On financial forecasting as a case study, where world feedback is naturally abundant, noisy, and non-stationary, we show that the same accumulated experience either degrades performance below the zero-shot baseline or dramatically improves accuracy and risk-adjusted returns, depending on whether the curation loop is present.

2606.17645 2026-06-17 cs.AI cs.CL cs.LG 新提交

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

超越领域:通过可迁移交互模式重用网络技能

Shiqi He, Yue Cui, Feijie Wu, Xinyu Ma, Jiaheng Lu, Yaliang Li, Bolin Ding, Mosharaf Chowdhury

发表机构 * University of Michigan(密歇根大学) Alibaba Group(阿里巴巴集团) Purdue University(普渡大学) McMaster University(麦克马斯特大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出SkillMigrator代理,通过学习可迁移交互模式(TIP)匹配布局结构而非元素引用,实现跨站点技能重用,在WebArena和Mind2Web上成功轨迹的LLM动作数减少8-10%。

详情
AI中文摘要

大型语言模型(LLM)网络代理通常被部署为工具调用者:每轮,模型读取新的页面观察并发出一个结构化工具动作。当每个动作都是低级原语时,视野迅速增长,面向策略的LLM完成次数也随之增加,在Mind2Web和WebArena等基准测试中主导了延迟和成本。因此,最近的系统将重复的交互片段包装为网络技能:从成功轨迹或诱导程序中构建的可调用工具,这样一次调用可以替代多个原语。然而,先前的技能库仍然主要通过指令相似性或粗略的站点元数据触发,这导致在未见站点上技能重用率低,并留下了许多潜在的步骤和令牌减少空间。我们提出了SkillMigrator,一个学习可重用网络技能并通过匹配布局结构而非特定元素引用来跨站点迁移它们的代理。每个诱导技能被存储为可迁移交互模式(TIP):技能与诱导时快照的结构草图配对。在测试时,SkillMigrator通过布局相似性检索TIP,并将其引用锚定到实时页面。其余堆栈是标准的:具有稳定引用的可访问性快照观察,以及基于原语加技能调用的固定工具调用。与最先进的方法相比,SkillMigrator在匹配成功率的情况下,将WebArena和Mind2Web上成功轨迹的平均LLM动作数减少了8-10%。

英文摘要

Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow quickly and so do policy-facing LLM completions, dominating latency and cost on benchmarks such as Mind2Web and WebArena. Recent systems therefore wrap repeated interaction fragments as web skills: callable tools built from successful trajectories or induced programs, so one call can replace several primitives. However, prior skill libraries are still triggered mainly by instruction similarity or coarse site metadata, which yields low skill reuse on held-out sites and leaves much of the potential step and token reduction on the table. We present SkillMigrator, an agent that learns reusable web skills and transfers them across sites by matching layout structure rather than specific element references. Each induced skill is stored as a transferable interaction pattern (TIP): the skill paired with a structural sketch of the snapshot at induction time. At test time, SkillMigrator retrieves TIPs by layout similarity and grounds their references on the live page. The rest of the stack is standard: accessibility-snapshot observations with stable references, and fixed tool calling over primitives plus skill invocations. Compared with the state-of-the-art approaches, SkillMigrator reduces the average LLM-action count on successful trajectories by 8-10% across both WebArena and Mind2Web at matched success rate.

2606.17871 2026-06-17 cs.AI 新提交

StepGuard: Guarding Web Navigation via Single-Step Calibration

StepGuard: 通过单步校准保护网页导航

Zhihao Cui, Yuchen Zhang, Xiyang Sun, Yaxiong Wang, Li Zhu, Jinpeng Hu, Liu Liu, Mengjia Li, Yujiao Wu

发表机构 * School of Software Engineering, Xi’an Jiaotong University(西安交通大学软件工程学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机与信息工程学院) Xiamen University(厦门大学) Zhejiang Lab(之江实验室) CSIRO(澳大利亚联邦科学与工业研究组织)

AI总结 针对网页导航中单步脆弱性问题,提出StepGuard框架,通过动态双策略优化(DDPO)解决奖励冲突,并利用置信度引导的自适应导航反射(CANR)校准单步误差,显著提升导航与答案准确率。

详情
AI中文摘要

网页导航要求智能体遵循自然语言目标,与网页交互并生成准确答案。尽管近期进展利用了视觉-语言模型和强化学习,现有方法仍因奖励错位和错误传播而存在单步脆弱性。为解决奖励纠缠,我们设计了动态双策略优化(DDPO),在探索的导航优先模式与问答的答案优先模式之间动态切换,以缓解奖励冲突。为校准单步误差,我们提出置信度引导的自适应导航反射(CANR),该机制估计每步置信度,仅在必要时触发反思,并使用对比奖励鼓励自我修正以校准单步不准确性。以上述组件为核心,我们最终开发了StepGuard,一种通过单步校准保护网页导航的新框架。实验表明,我们的方法显著提升了导航与答案准确率,在标准网页导航基准上取得了新的最佳性能。

英文摘要

Web navigation requires agents to follow natural language goals, interact with web pages, and produce accurate answers. While recent advances leverage vision-language models and reinforcement learning, existing methods still suffer from single-step fragility due to reward misalignment and error propagation. To tackle the reward entanglement, we design Dynamic Dual-Policy Optimization (DDPO), which dynamically switches between a navigation-first mode for exploration and an answer-first mode for question-answering to mitigate reward conflict. To calibrate the single-step error, we propose Confidence-Guided Adaptive Navigation Reflection (CANR), a mechanism that estimates per-step confidence, triggers reflection only when necessary, and uses contrastive rewards to encourage self-correction to calibrate the single-step inaccuracy. With the above as the main components, we finally develop our StepGuard, a new framework of Guarding Web Navigation via Single-Step Calibration. Experiments demonstrate that our approach significantly improves navigation and answer accuracy, setting new state-of-the-art performance on standard web navigation benchmarks.

2606.17929 2026-06-17 cs.AI 新提交

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

PreAct:在重复任务上加速的计算机使用代理

Bojie Li

发表机构 * Pine AI

AI总结 提出PreAct方法,通过将首次成功执行编译为状态机程序,在后续任务中直接重放,避免逐步骤调用语言模型,实现8.5-13倍加速,并确保重放时屏幕状态匹配。

详情
AI中文摘要

计算机使用代理通过屏幕操作真实软件——点击和打字——但它们从头解决每个任务:当要求重复一个任务时,代理重新读取屏幕,重新推理每次点击,并再次支付全部成本。我们提出PreAct,让这样的代理在之前做过的任务上更快。首次成功时,PreAct将运行编译成一个小的状态机程序——检查屏幕的状态、执行动作的转换——并在后续运行中直接重放,而不是调用代理,速度提升8.5-13倍,无需每步的语言模型调用。重放并非盲目:每一步PreAct在行动前检查屏幕是否与程序预期匹配,一旦出现异常就将控制权交还给代理。PreAct在决定保留什么时也应用同样的原则:新编译的程序只有在从干净状态重新运行时,独立评估器确认其解决了任务后,才进入存储——捕获那些重放到最后一步但未完成任务的程序。在移动、桌面和网络基准测试中,这种存储时检查将重复运行中因故障程序积累而改善的运行与退化的运行区分开,每个基准测试价值1.75-2.6个任务,三个方向一致;当没有程序匹配时,从头探索的回退使PreAct与强大的记录-重放基线持平。我们还报告了哪些因素不重要:提示措辞、运行时护栏,以及语言模型或普通嵌入检索器选择重用的程序。

英文摘要

Computer-using agents drive real software through the screen -- clicking and typing -- but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again. We present PreAct, which lets such an agent get faster on tasks it has done before. The first time it succeeds, PreAct compiles the run into a small state-machine program-states that check the screen, transitions that act-and on later runs replays it directly instead of invoking the agent 8.5-13x faster, with no per-step language-model calls. Replay is not blind: at each step PreAct checks that the screen matches what the program expects before acting, and hands control back to the agent the moment something is off. PreAct applies the same discipline when deciding what to keep: a freshly compiled program enters the store only if, re-run from a clean state, an independent evaluator confirms it solved the task-catching programs that replay to their last step yet leave the task undone. Across a mobile, a desktop, and a web benchmark, this store-time check separates repeated runs that improve from ones that degrade as faulty programs accumulate, worth 1.75-2.6 tasks per benchmark, the same direction on all three; a fallback that explores afresh when no program fits brings PreAct level with a strong record-and-replay baseline. We also report what did not matter: prompt wording, runtime guardrails, and whether a language model or a plain embedding retriever selects which program to reuse.

2606.17076 2026-06-17 physics.ao-ph cs.AI 交叉投稿

CMIP-Forge: An Agentic System that Retrieves, Computes, and Self-Reviews Climate Science

CMIP-Forge:一个检索、计算和自我审查气候科学的智能体系统

Dmitrii Pantiukhin, Boris Shapkin, Ivan Kuznetsov, Thomas Jung, Nikolay Koldunov

发表机构 * Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research(阿尔弗雷德·韦格ener研究所,极地与海洋研究中心)

AI总结 提出CMIP-Forge系统,结合检索增强生成与自主分析,通过多层级防御架构和独立审查机制,实现从文献到实时气候数据的端到端自主研究。

Comments 28 pages, 9 figures. Code available at https://github.com/CliDyn/cmip6_gpt

详情
AI中文摘要

第六次耦合模式比较计划(CMIP6)已产生了数千篇同行评审出版物,记录了模式配置、评估程序、涌现约束和预估不确定性。随着领域向CMIP7过渡,高效提取并利用这些非结构化知识以及进行实时数据分析成为一个关键瓶颈。本文提出CMIP-Forge,一个混合检索增强生成(RAG)与自主分析系统,弥合科学文献与地球系统网格联盟(ESGF)数据档案之间的鸿沟。该系统将包含6,581篇CMIP6相关开放获取出版物(101,828个索引块)的精选语料库与一个智能体流水线配对,其中工具增强的工作者规划并执行实时气候数据的Python工作流,同时一组独立的审查模型端到端审计其方法论。CMIP-Forge引入了一种多层纵深防御架构,通过可执行机制强制执行物理和方法论不变性:抽象语法树(AST)静态分析、审计的科学原语以及自主对抗性同行评审协议。我们通过涵盖大气遥相关、海洋动力学、区域极端事件和全球变暖预估的端到端自主研究流水线展示了系统的能力。一个基于同行评审文献、受自动化代码护栏约束并由独立对抗性审查循环审计的智能体分析系统能够自主完成复杂的气候研究工作流。同样的实验暴露了审查循环的具体失败模式(谄媚回归、从未解决的REVISE裁决以及提交存根代码供审查),每种模式均可从随文章发布的不变遥测和溯源记录中诊断。

英文摘要

The Coupled Model Intercomparison Project Phase 6 (CMIP6) has generated thousands of peer-reviewed publications documenting model configurations, evaluation procedures, emergent constraints, and projection uncertainties. As the community transitions toward CMIP7, efficiently extracting and operationalizing this unstructured knowledge alongside live data analysis represents a critical bottleneck. Here we present CMIP-Forge, a hybrid retrieval-augmented generation (RAG) and autonomous analysis system that bridges the gap between scientific literature and Earth System Grid Federation (ESGF) data archives. The system pairs a curated corpus of 6,581 CMIP6-related open-access publications (101,828 indexed chunks) with an agentic pipeline in which a tool-augmented worker plans and executes Python workflows over live climate data, while a panel of independent reviewer models audits its methodology end to end. CMIP-Forge introduces a multi-layered Defense-in-Depth architecture that enforces physical and methodological invariants through executable mechanisms: Abstract Syntax Tree (AST) static analysis, audited scientific primitives, and an autonomous adversarial peer-review protocol. We demonstrate the system's capabilities through end-to-end autonomous research pipelines spanning atmospheric teleconnections, ocean dynamics, regional extremes, and global warming projections. An agentic analysis system grounded in peer-reviewed literature, constrained by automated code guardrails, and audited by an independent adversarial review loop can complete complex climate-research workflows autonomously. The same experiments expose concrete failure modes of the review loop (sycophantic regression, REVISE verdicts that are never resolved, and the submission of stub code for review), each diagnosable from the immutable telemetry and provenance record released with the article.

2606.17317 2026-06-17 cs.RO cs.AI math.OC 交叉投稿

Transformer-Based Warm-Starting for Feasible and Optimal Terminal Approach to Tumbling Objects with Space Manipulators

基于Transformer的可行且最优末端接近翻滚目标的空间机械臂热启动方法

Yuji Takubo, Maximilian Adang, Mac Schwager, Simone D'Amico

发表机构 * Stanford University(斯坦福大学)

AI总结 针对空间机械臂末端接近翻滚目标的实时轨迹生成问题,提出基于因果Transformer的热启动方法,通过分解规划并热启动姿态-力矩分配阶段,在300个测试场景中减少28%迭代次数和23%运行时间,同时保持控制成本分布。

Comments 8 pages, 4 figures

详情
AI中文摘要

由于航天器总线运动、机械臂动力学、可见性锥和轨迹级安全约束之间的非线性耦合,在轨机器人服务的实时轨迹生成具有挑战性。本文研究了基于学习的热启动方法,用于空间机械臂末端接近翻滚目标的序列凸规划(SCP)。所提出的框架将问题分解为系统质心平移规划阶段和耦合姿态-机械臂力矩分配阶段,并对后者应用因果变压器热启动,后者构成了主要的计算瓶颈。比较了线性动作解码器和流匹配动作解码器在不同动作分块和训练数据集大小下的表现,并使用SCP在成本最优和可行性投影下评估了生成的热启动。在300个保留场景中,学习的热启动将第二阶段SCP迭代次数减少多达28%,运行时间减少23%,同时保持最终控制成本分布。当学习的热启动用于非凸可行性投影时,其运行时间相比成本最优SCP几乎减半,同时避免了启发式初始化时观察到的灾难性高成本尾部行为。这些结果表明,序列模型热启动可以提高基于优化的空间机械臂末端制导的计算效率和轨迹鲁棒性。

英文摘要

Real-time trajectory generation for on-orbit robotic servicing is challenging due to the nonlinear coupling between spacecraft bus motion, manipulator dynamics, visibility cone, and trajectory-level safety constraints. This paper studies learning-based warm-starting for sequential convex programming (SCP) in the terminal approach of a space manipulator toward a tumbling target. The proposed framework decomposes the problem into a system center-of-mass translational planning stage and a coupled attitude--manipulator torque-allocation stage, and applies a causal transformer warm-start to the latter, which constitutes the dominant computational bottleneck. Linear and flow matching action decoders are compared under different action-chunking and training dataset sizes, and the resulting warm-starts are evaluated under both cost-optimal and feasibility projection using SCP. Across 300 held-out scenarios, the learned warm-start reduces the second-stage SCP iteration count by up to 28% and the runtime by 23% while preserving the final control-cost distribution. When the learned warm-starts are used for nonconvex feasibility projection, they nearly halve the runtime relative to cost-optimal SCP, while avoiding the catastrophic high-cost tail behavior observed when initialized heuristically. These results indicate that sequence-model warm-starts can improve both the computational efficiency and trajectory robustness of optimization-based terminal guidance for space manipulation.

2606.17383 2026-06-17 q-fin.RM cs.AI cs.LG stat.ML 交叉投稿

Model Validation of Agentic AI Systems: A POMDP-Based Framework for Belief-State, Forecast, and Policy Validation

智能体AI系统的模型验证:基于POMDP的信念状态、预测与策略验证框架

Matthew Francis Dixon

发表机构 * Quiota LLC(Quiota公司)

AI总结 提出基于部分可观测马尔可夫决策过程(POMDP)的智能体AI模型验证框架,将自主决策分解为信息、信念、预测、动作和效用组件独立验证,并通过投资组合管理案例展示其有效性。

Comments 28 pages, 3 figures, 6 tables. Source code available from https://github.com/mfrdixon/agentic-AI-as-POMDP

详情
AI中文摘要

智能体人工智能系统引入了一类新的模型风险。与传统预测模型不同,自主智能体持续获取信息,形成关于环境潜在状态的信念,生成预测,选择行动,并随时间调整其行为。现有的验证方法主要关注预测准确性,因此对底层决策过程的质量提供的洞察有限。本文提出了一种基于部分可观测马尔可夫决策过程(POMDP)的智能体AI模型验证框架。该框架将自主决策分解为信息、信念、预测、行动和效用,允许每个组件独立验证。大型语言模型(LLM)被形式化为近似贝叶斯滤波算子,并开发了一个模型风险分类体系,涵盖状态空间、滤波、预测、策略、效用规范和参数风险。通过一个投资组合管理案例研究展示了模型风险验证方法,其中智能体从市场和宏观经济信息中推断潜在市场制度,生成基于信念的预测,并使用Black-Litterman框架构建投资组合。实证验证结合了性能分析、信念校准诊断、覆盖测试、消融研究和参数敏感性分析。结果表明,潜在状态推断对决策质量有独立贡献,且主要结论在广泛的参数值范围内保持稳健。本文的主要贡献是提供了一个实用框架,将已建立的模型风险管理概念扩展到自主AI系统,并为其验证、治理和监控提供了严格的基础。

英文摘要

Agentic artificial intelligence systems introduce a new class of model risk. Unlike traditional predictive models, autonomous agents continuously acquire information, form beliefs regarding latent states of the environment, generate forecasts, select actions, and adapt their behavior over time. Existing validation methodologies focus primarily on predictive accuracy and therefore provide limited insight into the quality of the underlying decision process. This paper proposes a model validation framework for agentic AI based on Partially Observable Markov Decision Processes (POMDPs). The framework decomposes autonomous decision making into information, beliefs, forecasts, actions, and utility, allowing each component to be validated independently. Large language models (LLMs) are formalized as approximate Bayesian filtering operators, and a model-risk taxonomy is developed encompassing state-space, filtering, forecast, policy, utility-specification, and parameter risks. The model risk validation methodology is demonstrated through a portfolio-management case study in which an agent infers latent market regimes from market and macroeconomic information, generates belief-conditioned forecasts, and constructs portfolios using a Black--Litterman framework. Empirical validation combines performance analysis, belief calibration diagnostics, coverage tests, ablation studies, and parameter-sensitivity analysis. The results indicate that latent-state inference contributes independently to decision quality and that the principal conclusions remain robust across a broad range of parameter values. The principal contribution of the paper is a practical framework for extending established model risk management concepts to autonomous AI systems and providing a rigorous foundation for their validation, governance, and monitoring.

2606.18223 2026-06-17 cs.CR cs.AI cs.LG cs.SY eess.SY 交叉投稿

Learning Red Agent Policy from Observations for Neurosymbolic Autonomous Cyber Agents

从观测中学习红方代理策略用于神经符号自主网络代理

Ankita Samaddar, Sandeep Neema, Daniel Balasubramanian, Xenofon Koutsoukos

发表机构 * MIT(麻省理工学院)

AI总结 针对网络攻击中红方动作不可观测的问题,提出基于模仿学习的策略学习技术,从网络观测和防御动作预测红方行为,集成神经符号防御代理实现高精度预测。

详情
AI中文摘要

随着复杂网络攻击日益普遍,现代网络需要经由强化学习训练的智能自主网络防御代理。这些代理采用神经符号方法,如带有学习组件的行为树,来学习、推理、适应和实施安全规则,同时维持关键操作。然而,这些自主网络是部分可观测系统,即网络攻击者(红方代理)的动作不可观测,使得防御者难以预测红方动作、学习红方策略或评估攻击者的入侵程度。为解决此问题,我们提出一种策略学习技术,利用模仿学习来学习具有离散状态和离散动作的部分可观测RL代理的策略。我们在自主网络环境中应用该技术,从网络观测和防御动作预测红方代理的动作。与神经符号网络防御代理集成后,我们的方法有效处理不同红方策略,并在多种模拟场景中实现高预测精度。

英文摘要

With sophisticated cyber-attacks becoming increasingly prevalent, modern networks require intelligent autonomous cyber-defense agents trained via Reinforcement Learning (RL). These agents employ neurosymbolic approaches such as behavior trees with learning-enabled components (LECs) to learn, reason, adapt, and implement security rules while maintaining critical operations. However, these autonomous networks are partially observable systems, i.e., the cyber-attacker's (red agent's) actions are not observable, making it difficult for the defender to predict red actions, learn red policies, or assess the attacker's intrusion levels. To address this, we propose a Policy Learning Technique using imitation learning to learn policies for partially observable RL agents with discrete states and discrete actions. We apply this technique in an autonomous cyber environment to predict red agent's actions from network observations and defender actions. Integrated with a neurosymbolic cyber-defense agent, our method effectively handles different red policies and achieves high prediction accuracy across diverse simulated scenarios.

2510.19838 2026-06-17 cs.AI cs.CL cs.LG 版本更新

Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory

Branch-and-Browse:具有树状推理与动作记忆的高效可控网页探索

Shiqi He, Yue Cui, Xinyu Ma, Yaliang Li, Bolin Ding, Mosharaf Chowdhury

AI总结 提出Branch-and-Browse框架,通过树状结构化推理、网页状态重放和页面动作记忆,实现LLM网页代理的高效可控多分支探索,在WebArena上成功率35.8%,执行时间降低40.4%。

详情
AI中文摘要

由大型语言模型(LLM)驱动的自主网页代理在执行目标导向任务(如信息检索、报告生成和在线交易)方面展现出强大潜力。这些代理标志着向开放网络环境中实用具身推理的关键一步。然而,现有方法在推理深度和效率方面仍然受限:简单的线性方法无法进行多步推理且缺乏有效的回溯,而其他搜索策略则粗粒度且计算成本高。我们引入了Branch-and-Browse,一个细粒度的网页代理框架,它统一了结构化推理-行动、上下文记忆和高效执行。它(i)采用显式子任务管理与树状结构化探索,实现可控的多分支推理;(ii)通过高效的网页状态重放与后台推理引导探索;(iii)利用页面动作记忆在会话内和跨会话间共享已探索的动作。在WebArena基准测试中,Branch-and-Browse的任务成功率达到35.8%,相对于最先进的方法执行时间减少高达40.4%。这些结果表明,Branch-and-Browse是一个可靠且高效的基于LLM的网页代理框架。

英文摘要

Autonomous web agents powered by large language models (LLMs) show strong potential for performing goal-oriented tasks such as information retrieval, report generation, and online transactions. These agents mark a key step toward practical embodied reasoning in open web environments. However, existing approaches remain limited in reasoning depth and efficiency: vanilla linear methods fail at multi-step reasoning and lack effective backtracking, while other search strategies are coarse-grained and computationally costly. We introduce Branch-and-Browse, a fine-grained web agent framework that unifies structured reasoning-acting, contextual memory, and efficient execution. It (i) employs explicit subtask management with tree-structured exploration for controllable multi-branch reasoning, (ii) bootstraps exploration through efficient web state replay with background reasoning, and (iii) leverages a page action memory to share explored actions within and across sessions. On the WebArena benchmark, Branch-and-Browse achieves a task success rate of 35.8\% and reduces execution time by up to 40.4\% relative to state-of-the-art methods. These results demonstrate that Branch-and-Browse is a reliable and efficient framework for LLM-based web agents.

2604.22748 2026-06-17 cs.AI 版本更新

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

代理世界建模:基础、能力、定律及更远

Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Runyi Li, Chenyu Tang, Dong Huang, Xuhang Chen, Rui Liu, Chengzu Li, Shiyi Du, Xu Huang, Haoxuan Che, Long Chen, Qifeng Chen, Wenya Wang, Wenxuan Zhang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia

发表机构 * Hong Kong University of Science and Technology(香港科学与技术大学) National University of Singapore(新加坡国立大学) University of Oxford(牛津大学) Nanyang Technological University(南洋理工大学) Chinese University of Hong Kong(香港中文大学) University of Hong Kong(香港大学) University of Washington(华盛顿大学) University of Tokyo(东京大学) Carnegie Mellon University(卡内基梅隆大学) University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) Singapore University of Technology and Design(新加坡科技设计大学) Singapore Management University(新加坡管理学院) XGEN Labs(XGEN实验室)

AI总结 本文提出'层次x定律'分类法,定义三个能力层级和四个约束领域,综合400余篇文献总结100余系统,分析方法、失败模式和评估实践,提出决策导向的评估原则和可复现评估包,展望从被动预测到重塑环境的代理世界建模路径。

详情
AI中文摘要

随着AI系统从生成文本转向通过持续交互完成目标,建模环境动态能力成为关键瓶颈。操控物体、导航软件、协调他人或设计实验的代理需要预测环境模型,但'世界模型'一词在不同研究社区中有不同含义。本文引入'层次x定律'分类法,沿两个轴组织:第一轴定义三个能力层级:L1预测器学习一步局部转移运算符;L2模拟器将它们组合成多步、动作条件化的回放,符合领域定律;L3演进器在预测失败时自主修订模型。第二轴识别四个约束领域:物理、数字、社会和科学。这些领域决定世界模型必须满足的约束条件及其可能失败的领域。利用此框架,本文综合400余篇文献,总结100余代表系统,涵盖基于模型的强化学习、视频生成、网页和GUI代理、多代理社会模拟和AI驱动的科学发现。分析各层级-领域对的方法、失败模式和评估实践,提出决策导向的评估原则和最小可复现评估包,概述架构指导、开放问题和治理挑战。最终路线图连接此前孤立的社区,从被动下一步预测走向能模拟并最终重塑代理所处环境的世界模型。

英文摘要

As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate. Code and resources are available at: https://github.com/matrix-agent/awesome-agentic-world-modeling.

2605.29563 2026-06-17 cs.AI cs.CV cs.RO 版本更新

Planning with the Views

通过场景自我探索进行视图规划

Kangrui Wang, Linjie Li, Zhengyuan Yang, Shiqi Chen, Zihan Wang, Li Fei-Fei, Jiajun Wu, Leonidas Guibas, Lijuan Wang, Manling Li

发表机构 * Northwestern University(西北大学) University of Washington(华盛顿大学) Microsoft(微软) University of Oxford(牛津大学) Stanford University(斯坦福大学)

AI总结 提出ViewSuite基准测试揭示VLM在多步视图规划中的不足,并设计迭代框架通过自我探索和视图图蒸馏将Qwen2.5-VL-7B的交互式视图规划准确率从2.5%提升至47.8%。

详情
AI中文摘要

VLM能否预测每个相机移动如何改变视图,并提前规划许多这样的移动?我们称这种能力为视图规划,需要(1)理解单个动作如何变换视图,以及(2)在多步规划中组合许多这样的变换以识别目标视图。我们在提出的ViewSuite中探测了这两种能力,ViewSuite是一个基于真实ScanNet场景的3D点云环境。在13个前沿VLM中,出现了一个关键的规划差距:它们具备基本的视图-动作知识,但无法在多步规划中组合这些知识,并且随着视点距离的增加,差距扩大。为了缩小这一差距,我们提出了一个迭代框架,交替进行自我探索和视图图蒸馏。关键洞察是,所有探索轨迹,无论其结果如何,共同形成一个视图图,紧凑地捕捉了场景中视点如何连接。将这个图蒸馏到多样化的监督任务中,重塑了策略分布,并克服了使纯RL停滞的稀疏奖励。这将Qwen2.5-VL-7B在交互式视图规划上的准确率从2.5%提升到47.8%,超过了GPT-5.4 Pro(18.5%)和Gemini 3.1 Pro(21.4%)。自我探索成为VLM在3D空间中主动推理和规划的一条有前景的路径。

英文摘要

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space. Code and Data are at https://viewsuite.github.io.

2606.06523 2026-06-17 cs.AI cs.LG cs.LO cs.SE 版本更新

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Lean4Agent:面向智能体工作流与轨迹的形式化建模与验证

Ruida Wang, Jerry Huang, Pengcheng Wang, Xuanqing Liu, Luyang Kong, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Independent researcher(独立研究者)

AI总结 提出Lean4Agent框架,利用依赖类型形式语言Lean4对智能体工作流进行形式化建模与验证,通过FormalAgentLib库和LeanEvolve方法提升工作流可靠性,实验验证通过的工作流性能平均提升11.94%。

详情
AI中文摘要

使大型语言模型(LLMs)能够执行可靠的多步工作流已成为人工智能领域的核心挑战。尽管LLMs的智能体能力近期取得了进展,但大多数智能体系统仍缺乏用于指定、验证和调试其工作流及执行轨迹的形式化方法。这一挑战类似于数学中长期存在的问题,其中自然语言(NL)的模糊性促使了形式语言(FL)的发展。受此范式启发,我们提出了**Lean4Agent**,据我们所知,这是首个使用依赖类型形式语言Lean4来建模和验证智能体行为的框架。**Lean4Agent**推出了**FormalAgentLib**,一个可扩展的Lean4库,用于在显式假设下形式化建模和验证智能体工作流的语义一致性,并能够定位轨迹揭示的运行时故障。基于**FormalAgentLib**,我们进一步开发了**LeanEvolve**,它应用**FormalAgentLib**中的结果来修订工作流以增强其能力。在SWE-Bench-Verified的困难子集和ELAIP-Bench子集上,针对5个领先LLMs的大量实验表明,通过验证的工作流比未通过的工作流平均性能提升**11.94%**,而**LeanEvolve**进一步将SWE性能平均提升**7.47%**。此外,**Lean4Agent**为使用表达能力强的依赖类型形式语言形式化建模和验证智能体行为这一新领域奠定了基础。

英文摘要

Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose **Lean4Agent**, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. **Lean4Agent** launches **FormalAgentLib**, an extensible Lean4 library for formally modeling and verifying agent workflows' semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on **FormalAgentLib**, we further develop **LeanEvolve**, which applies results in **FormalAgentLib** to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of **11.94%**, and **LeanEvolve** further improves SWE performance by **7.47%** on average. Furthermore, **Lean4Agent** establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.

2606.10616 2026-06-17 cs.AI 版本更新

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

学习记住什么:通过约束优化实现长时域语言代理的观测安全记忆保留

Qingcan Kang, Liu Mingyang, Shixiong Kai, Kaichao Liang, Tao Zhong, Mingxuan Yuan

发表机构 * Huawei Noah's Ark Lab(华为诺亚方舟实验室) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系)

AI总结 针对长时域语言代理的有限上下文窗口,提出OSL-MR框架,将记忆保留建模为约束随机优化问题,通过在线可观测特征与离线监督的严格分离学习查询条件化的证据价值,实验表明在严格预算下优于现有方法。

详情
AI中文摘要

长时域语言代理积累的观测、推理轨迹和检索事实会超出其有限的上下文窗口,使得记忆保留成为一个基本的资源分配问题。现有记忆系统通过启发式评分、检索优化或学习压缩来改进管理,但大多将保留视为局部决策问题,并未在现实观测约束下显式建模其长期后果。为填补这一空白,我们将记忆保留建模为一个约束随机优化问题,具有明确的预算可行性、证据效用以及延迟成本(包括遗漏惩罚、重新获取延迟和过时信息风险)。随后,我们提出OSL-MR(观测安全记忆保留学习),这是一个新颖的框架,强制执行在线可观测特征与离线可用监督(OAS)之间的严格分离。OSL-MR结合了一个从实现的证据监督中训练的证据学习器和一个混合评分启发式,该启发式既作为可部署的在线安全基线,又作为结构化的归纳先验用于学习。由此产生的策略直接从交互数据中学习查询条件化的证据价值,同时在同一观测约束下保持可部署性。在LOCOMO和LongMemEval上的实验表明,OSL-MR在严格记忆预算下持续优于基于最近性的方法、生成式代理风格评分和其他启发式基线。混合评分先验在保持召回率的同时进一步提高了精确度,敏感性分析表明其在广泛的成本配置下具有鲁棒性。

英文摘要

Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts that exceed their context windows, making memory retention -- what to keep, discard, or later recover under a fixed budget -- central to sustained performance. Most systems score memories with local rules such as recency or relevance, ignoring the delayed costs of retention: future retrieval failures, recomputation, and stale-information use. We formulate retention as a constrained, partially observable stochastic optimization problem in which current decisions shape information demands revealed only later, and prove its single-step version NP-hard. Since exact optimization is intractable and future demands unknown, we develop \textbf{OSL-MR} (Observability-Safe Learning for Memory Retention), a learning-augmented approximation for deployable memory control. Its core principle is observability separation: deployed decisions use only online-observable signals, while supervision from evidence realized after an interaction is used solely for offline learning. OSL-MR pairs a budget-aware Mixed-Score heuristic (a cold-start policy and inductive prior) with an evidence learner predicting which memories later serve as evidence. As the cumulative objective is non-decomposable and combinatorial, the learner is trained on evidence-membership signals rather than reward, a tractable, deployable target. On LoCoMo and LongMemEval, OSL-MR consistently outperforms strong heuristic and imitation-learning baselines, especially under tight budgets, and is robust across cost settings. On exactly-solvable instances, retention is genuinely multi-step: a perfect single-step optimizer is far from optimal, whereas OSL-MR stays near the dynamic-programming optimum. These results establish constrained stochastic optimization and optimization-guided learning as a scalable foundation for memory in long-horizon agents.

2606.16070 2026-06-17 cs.AI 版本更新

Mind-Studio: Executable World Models with Lookahead Evaluation for Partially Observable Games

Mind-Studio: 针对部分可观测游戏的可执行世界模型与前向评估

Yifei Dong, Mingen Zheng, Linquan Wu, Jeff Z. Pan, Jiaxin Bai

发表机构 * Hong Kong University of Science and Technology(香港科技大学) City University of Hong Kong(香港城市大学) University of Edinburgh(爱丁堡大学) Hong Kong Baptist University(香港浸会大学)

AI总结 提出Mind-Studio框架,利用大语言模型从轨迹合成可执行的pygame风格世界模型,通过K步前向保真度协议评估,在Montezuma's Revenge等游戏中显著提升预测准确性和子目标验证。

Comments 12 pages, 2 figures

详情
AI中文摘要

世界模型合成旨在将交互经验转化为环境动态的内部模型。现有的符号方法通常拟合观测到的转移或局部规则的混合,但它们不会产生一个可以独立于真实环境运行的完整可执行程序。我们提出了Mind-Studio,一个利用大语言模型从状态-动作-下一状态轨迹合成可执行的pygame风格世界模型的框架。Mind-Studio将熵选择轨迹与一个轻量级游戏技能文件相结合,该文件包含从截图中提取的对象、动作和静态场景信息。我们使用K步前向保真度协议评估合成质量,该协议将生成的世界模型 rollout 与来自相同状态的Real-ALE rollout进行比较。在Montezuma's Revenge上,Mind-Studio将选定动作的下一状态预测从PoE-World的0.3%提高到48.7%,同时验证了8个子目标中的5个;在Alien、Assault和Skiing上,它实现了比先前学习的前向源更强的分支级保真度。

英文摘要

World-model synthesis aims to turn interaction experience into an internal model of environment dynamics. Existing symbolic approaches often fit observed transitions or mixtures of local rules, but they do not produce a complete executable program that can run independently of the real environment. We present Mind-Studio, a framework that synthesizes executable pygame-style world models from state-action-next-state trajectories using large language models. Mind-Studio combines entropy-selected traces with a lightweight game skill file containing object, action, and static scene information extracted from screenshots. We evaluate synthesis quality with a K-step lookahead fidelity protocol that compares generated world-model rollouts against Real-ALE rollouts from the same state. On Montezuma's Revenge, Mind-Studio improves chosen-action next-state prediction from 0.3% for PoE-World to 48.7% while verifying 5 of 8 subgoals; across Alien, Assault, and Skiing, it achieves stronger branch-level fidelity than prior learned lookahead sources.

2508.02721 2026-06-17 cs.SE cs.AI cs.PL 版本更新

Blueprint First, Model Second: A Framework for Deterministic LLM Workflow

蓝图优先,模型其次:确定性LLM工作流框架

Libin Qiu, Yuhang Ye, Zhirong Gao, Xide Zou, Junfu Chen, Ziming Gui, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, Kun Zhao

发表机构 * Alibaba(阿里巴巴)

AI总结 提出“蓝图优先,模型其次”框架,通过将工作流逻辑解耦为源代码蓝图并由确定性引擎执行,LLM仅处理子任务,在TravelPlanner上最终通过率提升97.6%,约束违反减少96.0%。

Comments 12 pages, 7 figures, 6 tables

详情
AI中文摘要

尽管强大,大型语言模型(LLM)智能体固有的非确定性限制了它们在结构化操作环境中的应用,这些环境要求程序保真度和可预测执行。这一限制源于当前架构将概率性的高级规划与低级动作执行混淆在单一生成过程中。为解决此问题,我们引入了 \ extsc{Source Code Agent} 框架,这是一种基于“蓝图优先,模型其次”哲学的新范式,将工作流逻辑与生成模型解耦。首先将专家定义的操作程序编纂为基于源代码的执行蓝图,然后由确定性引擎执行。LLM被策略性地调用作为专门工具,处理工作流中有界、复杂的子任务,但从不决定工作流的路径。我们在TravelPlanner基准上评估约束感知的旅行规划。\ extsc{Source Code Agent} 在相同Claude-Sonnet-4骨干上实现了35.56%的最终通过率,比最先进的ATLAS基线(18.00%)提高了97.6%。关键的是,它将约束违反减少了96.0%(11次对比275次),同时将执行效率提高了27.1%(10.2±0.7步对比14.0步)。两个生产事故诊断部署以及在ScienceWorld和ALFWorld上的额外结果证实,该架构可迁移到旅行规划之外的程序定义明确、约束密集型的工作流。我们的工作使得在受严格程序逻辑约束的应用中,自主智能体能够可验证且可靠地部署。

英文摘要

While powerful, the inherent non-determinism of large language model (LLM) agents limits their application in structured operational environments where procedural fidelity and predictable execution are strict requirements. This limitation stems from current architectures that conflate probabilistic, high-level planning with low-level action execution within a single generative process. To address this, we introduce the \textsc{Source Code Agent} framework, a new paradigm built on the ``Blueprint First, Model Second'' philosophy that decouples workflow logic from the generative model. An expert-defined operational procedure is first codified into a source code-based Execution Blueprint, which is then executed by a deterministic engine. The LLM is strategically invoked as a specialized tool to handle bounded, complex sub-tasks within the workflow, but never to decide the workflow's path. We evaluate on the TravelPlanner benchmark for constraint-aware travel planning. The \textsc{Source Code Agent} achieves a 35.56\% final pass rate, a 97.6\% improvement over the state-of-the-art ATLAS baseline (18.00\%) on the same Claude-Sonnet-4 backbone. Critically, it reduces constraint violations by 96.0\% (11 vs 275) while improving execution efficiency by 27.1\% (10.2$\pm$0.7 steps vs 14.0). Two production incident-diagnosis deployments and additional results on ScienceWorld and ALFWorld confirm that the architecture transfers beyond travel planning to procedurally well-defined, constraint-intensive workflows. Our work enables the verifiable and reliable deployment of autonomous agents in applications governed by strict procedural logic.

2603.18897 2026-06-17 cs.DC cs.AI 版本更新

Parallelizing Tool Execution and LLM Generation for Low-Latency Agent Serving

并行化工具执行与LLM生成以实现低延迟代理服务

Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, Kaiqiang Xu, Kai Chen, Yuqing Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Microsoft Research(微软研究院) Stevens Institute of Technology(Stevens 工程学院) Google(谷歌) Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 提出PASTE系统,通过预测性执行未来工具调用与LLM生成并行,减少任务完成时间43.5%。

详情
AI中文摘要

基于LLM的代理通过模型生成和工具执行的顺序循环来执行任务。当今的服务系统串行化此循环,使工具延迟暴露在任务关键路径上。本文提出PASTE,一个工具感知的代理服务系统,它从重复的代理模式中预测具体的未来工具调用,并在LLM仍在生成时推测性执行它们。PASTE将推测结果隔离,直到LLM确认,并联合调度工具执行和返回的LLM会话,以避免将瓶颈转移到GPU。在深度研究、编码和科学代理工作负载上,PASTE将平均任务完成时间减少43.5%,并将观察到的工具延迟降低1.8倍。

英文摘要

LLM-powered agents execute tasks through a sequential loop of model generation and tool execution. Today's serving systems serialize this loop, leaving tool latency exposed on the task critical path. This paper presents PASTE, a tool-aware agent-serving system that predicts concrete future tool invocations from recurring agent patterns and executes them speculatively while the LLM is still generating. PASTE isolates speculative results until confirmed by the LLM and jointly schedules tool execution and returning LLM sessions to avoid shifting bottlenecks to the GPU. Across deep research, coding, and scientific-agent workloads, PASTE reduces average task completion time by 43.5% and lowers observed tool latency by 1.8x.

2605.12513 2026-06-17 cs.SI cs.AI 版本更新

SP-GCRL: Influence Maximization on Incomplete Social Graphs

SP-GCRL:在不完整社交图上的影响力最大化

Haohua Niu, Yuxuan Yang, Lingfeng Zhang, Hao Li, Jiao Liang, Zongfu Luo, Luca Rossi

AI总结 本文提出SP-GCRL框架,通过社交传播感知的图对比强化学习实现端到端的种子选择,解决了不完整社交图和非平稳扩散动态带来的挑战,提升了效率和可扩展性。

Comments Accepted by DASFAA 2026. The first two authors contributed equally

详情
AI中文摘要

在现实平台中,影响力最大化(IM)受到不完整、噪声社交图和非平稳扩散动态的挑战。我们提出了SP-GCRL,一种社交传播感知的图对比强化学习框架,该框架在部分可观测性下学习端到端的种子选择。我们首先引入了一种社交传播感知的非线性扩散函数,以建模强化/衰减效应和概率漂移;然后构建了双结构视图,并执行对比学习以获得对缺失边和弱连接具有鲁棒性的节点表示,同时用基于GAT的回归替代昂贵的策略度量以提高效率和可扩展性;最后,我们使用DDQN在这些表示上学习端到端的种子选择策略。在多个真实世界网络上的实验表明,SP-GCRL在预算和拓扑结构上均显著优于启发式和基于学习的基线,同时保持了强大的大规模可扩展性。

英文摘要

Influence maximization (IM) in real platforms is challenged by incomplete, noisy social graphs and non-stationary diffusion dynamics. We propose SP-GCRL, a social-propagation-aware graph contrastive reinforcement learning framework that learns end-to-end seed selection under partial observability.We first introduce a social-propagation-aware nonlinear diffusion function to model reinforcement/diminishing effects and probability drift under repeated exposure; we then construct dual structural views and perform contrastive learning to obtain node representations robust to missing edges and weak ties, while replacing expensive strategy metrics with a GAT-based regression surrogate to improve efficiency and scalability; finally, we use DDQN to learn an end-to-end seed selection policy on top of these representations. Experiments on multiple real-world networks show that SP-GCRL achieves significant gains over heuristic and learning-based baselines across budgets and topologies, while maintaining strong large-scale scalability.

2605.26195 2026-06-17 cs.CR cs.AI 版本更新

CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

CyberEvolver:面向网络安全代理的即时结构化自我进化

Yihe Fan, Changyi Li, Lichen Xu, Xudong Pan, Jiarun Dai, Hong Geng, Min Yang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Pudong Research Institute of Cryptology(上海浦东密码研究院)

AI总结 提出CyberEvolver框架,通过四层可进化架构、痕迹诊断机制和种群波束搜索,实现网络安全代理基于失败经验的支架自我进化,平均成功率提升13.6%。

详情
AI中文摘要

基于LLM的代理越来越多地用于网络安全任务,但现有系统大多依赖固定的、人工设计的支架,难以适应不同的目标和失败模式。我们提出了 extsc{CyberEvolver},一个自我进化的网络安全代理框架,它根据失败执行尝试的经验迭代地修改自己的支架。网络安全中的自我进化具有挑战性,因为可能的支架变化空间在很大程度上是非结构化的,执行反馈稀疏且常被环境掩盖,低多样性的更新可能导致错误在重复迭代中累积。 extsc{CyberEvolver}通过四层可进化代理架构(将支架优化分解为结构化组件)、痕迹诊断机制(将嘈杂的执行日志转化为可操作的修订信号)以及基于种群的波束搜索策略(在进化过程中保留多样化的代理变体)来应对这些挑战。我们在CTF挑战、漏洞利用和渗透测试任务上使用四个开源LLM评估了 extsc{CyberEvolver}。在这些设置中, extsc{CyberEvolver}将种子代理的成功率平均提高了13.6%,并优于六个人工设计的网络安全代理以及两种从其他领域改编的自我改进方法。这些结果表明,支架自我进化为构建用于安全测试的自适应LLM代理提供了一个有前景的方向。

英文摘要

LLM-based agents are increasingly used for cybersecurity tasks, but most existing systems rely on fixed, human-designed scaffolds that struggle to adapt across diverse targets and failure modes. We introduce \textsc{CyberEvolver}, a self-evolving cybersecurity agent framework that iteratively revises its own scaffold based on experience from failed execution attempts. Self-evolution in cybersecurity is challenging because the space of possible scaffold changes is largely unstructured, execution feedback is sparse and often obscured by the environment, and low-diversity updates can cause errors to compound over repeated iterations. \textsc{CyberEvolver} addresses these challenges with a four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, a trace-to-diagnosis mechanism that converts noisy execution logs into actionable revision signals, and a population-based beam search strategy that preserves diverse agent variants during evolution. We evaluate \textsc{CyberEvolver} on CTF challenges, vulnerability exploitation, and penetration-testing tasks using four open-source LLMs. Across these settings, \textsc{CyberEvolver} improves the seed agent's success rate by $13.6$\,\% on average, and outperforms six human-designed cybersecurity agents as well as two self-improvement methods adapted from other domains. These results suggest that scaffold self-evolution is a promising direction for building adaptive LLM agents for security testing.

2606.15903 2026-06-17 cs.CL cs.AI 版本更新

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

控制平面放置塑造遗忘:跨十三种系统配置的智能体记忆架构研究

Dongxu Yang

发表机构 * DeepLethe

AI总结 研究LLM在智能体记忆管道中的位置(控制平面 vs 召回平面)对遗忘失败模式的影响,通过13种配置在385例对抗测试集上的实验,揭示了三种放置机制的互补覆盖范围,并提出了ForgetEval评估套件。

Comments 25 pages including appendices. Code, benchmark, and adapters released under MIT at https://github.com/deeplethe/lethe

详情
AI中文摘要

LLM在智能体记忆管道中的位置——位于检索存储事实(广泛基准测试)的召回平面和通过替换、释放、清除来改变事实(基本未经测试)的控制平面之间——决定了系统能够恢复哪些遗忘失败模式。通过在385例对抗测试集上比较十三种系统配置,我们观察到三种具有部分互补覆盖范围的放置机制:确定性原语足以处理词汇/时间类别,但无法处理规范化(标识符混淆上5%,跨语言上0%);写入时LLM可以恢复规范化(100%),但无法处理意图感知删除(前缀冲突和复合事实为0%);变异时钩子可以恢复意图感知删除(78-85%),并同时提升几乎所有类别的性能(整体91.7-93.2%,每385例运行成本0.17美元,每例变异延迟2.3秒,而确定性方法为64-191毫秒,召回路径不变)。我们通过ForgetEval揭示了这种权衡,ForgetEval包含1000例模板化套件和385例对抗层(132例手工制作+253例LLM生成并经预言机验证),通过确定性子串匹配评分,并配有一个六方法适配器协议,采用诚实的N/A评分,允许异构记忆存储以130行代码接入。该协议通过10名标注者的IAA(Fleiss' kappa = 0.958)和77例外部作者子集(四位盲贡献者)得到验证,该子集复现了规范化不对称性并放大了联合放置的提升(+27.8个百分点)。生产环境中的失败主要是遗忘失败而非召回失败,但现有基准仅衡量召回。ForgetEval和所有适配器均以MIT许可发布。

英文摘要

Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes which forgetting failure modes the system recovers. Comparing thirteen system configurations on a 385-case adversarial surface, we observe three placement regimes with partly complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but cannot help intent-aware deletion (0% on prefix-collision and compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) and brightens nearly all categories simultaneously (91.7-93.2% overall, $0.17 per 385-case run, 2.3s/case mutation latency vs. 64-191ms/case deterministic, recall path unchanged). We expose the trade-off via ForgetEval, a 1000-case templated suite plus a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted oracle-validated) scored by deterministic substring match, paired with a six-method Adapter Protocol with honest N/A scoring that lets heterogeneous memory stores enter in 130 lines. Admission is corroborated by 10-annotator IAA (Fleiss' kappa = 0.958) and a 77-case external-authored subset (four blind contributors) that replicates the canonicalization asymmetry and amplifies the joint-placement lift (+27.8 pt). Production failures are predominantly forgetting failures rather than recall failures, yet existing benchmarks measure only recall. ForgetEval and all adapters are released under MIT.

2. 知识表示、推理与符号AI 8 篇

2606.17851 2026-06-17 cs.AI cs.LO 新提交

A homotopy-type-theoretic generalization of neurosymbolic inference

同伦类型论对神经符号推理的推广

Fernando Zhapa-Camacho, Robert Hoehndorf

发表机构 * King Abdullah University of Science and Technology(阿卜杜拉国王科技大学) KAUST Center of Excellence for Smart Health (KCSH)(KAUST智能健康卓越中心) KAUST Center of Excellence for Generative AI(KAUST生成式人工智能卓越中心)

AI总结 本文用同伦类型论替换集合,将神经符号系统的信念加权和泛化为信念加权同伦基数,保留对称性和证明多样性,并证明经典函数是特例,从而避免推理捷径。

详情
AI中文摘要

广泛的神经符号系统计算一个泛函:在σ-结构空间上逻辑量的信念加权和,其中加权模型计数、模糊逻辑和概率逻辑是特例。这种描述基于集合,而集合有意忽略了两个对神经符号系统重要的方面:两个σ-结构何时在理论对称性下相同,以及有多少不同的证明见证一个查询。将底层集合替换为类型(在同伦类型论意义上)保留了这些信息,并将该泛函转变为信念加权同伦基数——一种按对称性倒数计数对象的大小概念。我们从头为神经符号系统开发了该框架,证明了当对称性平凡时恢复经典泛函的保守性定理,并表明我们的框架暴露的对称性正是推理捷径背后的对称性。实际收益是具体的:最近通过集成或表达性密度估计实现的捷径感知概念后验,是混淆集单纯形上唯一的对称不变点,可通过在对称群上平均单个模型以闭式形式计算。在MNIST推理捷径基准上,这种单模型包装器比多样性训练的集成具有更好的校准性,同时保持标签准确性和可识别概念不变。代码在此https URL免费提供。

英文摘要

A wide range of neurosymbolic (NeSy) systems compute one functional: a belief-weighted sum of a logical quantity over a space of $σ$-structures, of which weighted model counting, fuzzy logic, and probabilistic logic are special cases. This account is built on sets, and a set deliberately forgets two things that are important for NeSy: when two $σ$-structures are the same up to a symmetry of the theory, and how many distinct proofs witness a query. Replacing the underlying sets by types, in the sense of homotopy type theory, preserves this information, and turns this functional into a belief-weighted homotopy cardinality, a notion of size that counts each object in inverse proportion to its symmetries. We develop the framework from scratch for NeSy systems, prove a conservativity theorem that recovers the classical functional when symmetries are trivial, and show that the symmetry our framework exposes is exactly the one behind reasoning shortcuts. The payoff is concrete: the shortcut-aware concept posterior that recent methods reach by ensembling or expressive density estimation is the only symmetry-invariant point of the confusion-set simplex, computable in closed form by averaging a single model over the symmetry group. On MNIST reasoning-shortcut benchmarks this single-model wrapper is better calibrated than a diversity-trained ensemble, while leaving label accuracy and identifiable concepts untouched. Code is freely available at https://github.com/bio-ontology-research-group/hott-nesy.

2606.17882 2026-06-17 cs.AI 新提交

Structural Preservation and the Logical Expressiveness of Graph Neural Networks

结构保持与图神经网络的逻辑表达能力

Przemysław Andrzej Wałęga, Bernardo Cuenca Grau

发表机构 * Queen Mary University of London(伦敦玛丽女王大学) University of Oxford(牛津大学)

AI总结 本文从语义角度研究图神经网络分类器在结构保持(嵌入、单同态、同态)下的逻辑表达能力,证明每种保持性质对应分级模态逻辑的一个片段,并给出相应GNN架构。

Comments 20 pages

详情
AI中文摘要

通过固定架构选择(如聚合、组合和激活函数的类型),已经在图神经网络(GNN)和逻辑形式体系之间建立了桥梁。这些选择定义了受限的GNN类,通过证明逻辑公式可以翻译为等价的GNN,反之GNN也可以翻译为等价的公式,从而可以获得与逻辑形式体系的紧密对应。在本文中,我们采取语义视角,通过建立那些在结构性质(嵌入、单同态和同态)下保持的GNN分类器类的逻辑表达能力。我们证明,对于每个这样的性质,存在一个分级模态逻辑的片段,刻画了该GNN类。特别地,在嵌入、单同态和同态下的保持分别对应于存在性分级模态逻辑、其存在-正片段以及存在-正模态逻辑。这些结果刻画了广泛GNN类的表达能力,独立于具体的架构选择,但我们也证明每个这样的类都承认一个具有相同表达能力的GNN架构。在技术上,我们的方法使用了有界高度树的一个新的良拟序结果,从而得到了展开不变类的有限表示。

英文摘要

Bridges between graph neural networks (GNNs) and logical formalisms have been established by fixing architectural choices, such as the types of aggregation, combination, and activation functions. These choices define restricted classes of GNNs for which tight correspondences with logical formalisms can be obtained, by showing that logical formulae can be translated into equivalent GNNs and, conversely, that GNNs can be translated into equivalent formulae. In this paper we take a semantic perspective by establishing the logical expressiveness of classes of GNN classifiers that are preserved under structural properties: embeddings (extensions), injective homomorphisms, and homomorphisms. We show that, for each such property, there exists a fragment of graded modal logic characterising the class of GNNs. In particular, preservation under embeddings, injective homomorphisms, and homomorphisms corresponds to existential graded modal logic, its existential-positive fragment, and existential-positive modal logic, respectively. These results characterise the expressiveness of broad classes of GNNs independently of specific architectural choices, but we also show that each of these classes admits a GNN architecture of the same expressiveness. Technically, our approach uses a new well-quasi-order result for trees of bounded height, yielding finite representations of unravelling-invariant classes.

2606.18098 2026-06-17 cs.AI 新提交

IsabeLLM: Automated Theorem Proving Applied to Formally Verifying Consensus

IsabeLLM: 自动化定理证明应用于共识的形式化验证

Elliot Jones, William Knottenbelt

发表机构 * Imperial College London(伦敦帝国学院)

AI总结 本文改进IsabeLLM自动化定理证明工具,通过检索增强生成、错误追踪和反例生成提升大语言模型上下文,并兼容最新Isabelle和Sledgehammer,用于验证比特币工作量证明共识。

详情
AI中文摘要

人工智能(AI)的进步使得AI用于定理证明成为形式化验证计算机系统的一种有前景的方法。尽管由于所需专业知识和努力,形式化验证传统上仅限于安全关键系统,但AI可以帮助自动化大量工作负载,使其更易访问。基于区块链的系统越来越受欢迎,并经常成为恶意行为者的目标,常常导致巨大的财务损失,这凸显了更好地验证这些系统和缓解漏洞的必要性。可以说,这些系统中最重要的组件是共识协议,它允许节点在潜在对抗环境中达成决策。在本文中,我们改进了IsabeLLM,即Isabelle中的自动化定理证明工具。具体而言,我们实现了检索增强生成框架、错误追踪和反例生成,以改善提供给大语言模型的上下文。还实现了与最新版本Isabelle和Sledgehammer的兼容性,以提高效率。我们比较了两个版本IsabeLLM在完成比特币工作量证明共识验证方面的性能。

英文摘要

Advances in Artificial Intelligence (AI) have led AI for Theorem Proving to become a promising means of formally verifying computer systems. Whilst formal verification is traditionally reserved for safety-critical systems due to the required amount of expertise and effort, AI can help to automate a large amount of this workload and make it far more accessible. Blockchain-based systems are becoming increasingly popular and are frequently targeted by malicious actors, often resulting in huge financial losses, highlighting the need to better verify these systems and mitigate vulnerabilities. Arguably the most important component of these systems is the consensus protocol, which allows nodes to agree on decisions in a potentially adversarial environment. In this paper, we improve upon IsabeLLM, the automated theorem proving tool in Isabelle. Namely, we implement a Retrieval-Augmented Generation framework, Error tracing and counterexample generation for improved context supplied to the Large Language Model. Compatibility with the latest version of Isabelle and Sledgehammer is also implemented for improved efficiency. We compare the performance of the two versions of IsabeLLM in their ability to complete the verification of Bitcoin's Proof of Work consensus.

2606.17073 2026-06-17 cs.RO cs.AI 交叉投稿

Extracting Semantics: LLM-Guided Automatic Population of Robot Ontology from URDF

提取语义:从URDF自动构建机器人本体的LLM引导方法

Bastien Dussard, Guillaume Sarthou

发表机构 * LAAS-CNRS, Department of Robotics, Toulouse, France(法国图卢兹机器人系CNRS实验室)

AI总结 提出利用大语言模型从URDF文件自动生成机器人语义本体,通过多数投票和语法验证确保与现有本体对齐,初步实验表明该方法能有效桥接低层描述与高层知识表示。

详情
Journal ref
18th International Conference on Social Robotics (ICSR 2026), University of London, Jul 2026, Londres, United Kingdom
AI中文摘要

虽然常识知识可能足以满足虚拟代理的需求,但与人类交互的具身机器人需要对其环境和自身物理形态具有基于现实的、语义丰富的表示。在认知机器人学中,本体论能够有效整合这种异构知识,以支持可解释的推理,即使在持续知识更新过程中也是如此。然而,手动构建本体仍然是一个瓶颈。我们提出了一种初步方法,通过将统一机器人描述格式(URDF)模型转换为填充的本体,自动生成机器人语义抽象。尽管URDF文件提供了结构和运动学描述,但其标识符通常需要常识解释才能恢复有意义的语义,而大语言模型(LLM)擅长此任务。我们的流程利用LLM,通过用现有本体中的概念提示它们来推断语义关系,确保最终分类与形式模型保持一致。为了提高可靠性,该流程结合了跨多个LLM查询的多数投票以及语法和模式级验证,以确保生成的输出符合预期的表示格式和本体约束。我们在多个机器人描述上评估了该方法,并讨论了生成的抽象。初步结果表明,所提出的方法能够有效弥合低层机器人描述与人机交互所需的结构化、基于现实的知识表示之间的差距。

英文摘要

While commonsense knowledge may suffice for virtual agents, embodied robots interacting with humans require grounded and semantically rich representations of both their environment and their own physical embodiment. In cognitive robotics, ontologies are effective for integrating such heterogeneous knowledge to enable explainable reasoning, even during continuous knowledge updates. Yet, their manual construction remains a bottleneck. We present a preliminary approach for the automatic generation of robot semantic abstractions by transforming Unified Robot Description Format (URDF) models into populated ontologies. Although URDF files provide structural and kinematic descriptions, their identifiers often require commonsense interpretation to recover meaningful semantics, a task at which Large Language Models (LLMs) excel. Our pipeline leverages LLMs to infer semantic relationships by prompting them with concepts from an existing ontology, ensuring the final classification remains aligned with the formal model. To improve reliability, the pipeline combines majority voting across multiple LLM queries along with syntactic and schema-level validation to ensure that generated outputs conform to the expected representation format and ontology constraints. We evaluate the approach on multiple robot descriptions and discuss the generated abstractions. Initial results indicate that the proposed method can effectively bridge the gap between low-level robot descriptions and the structured, grounded knowledge representations required for human-robot interaction.

2606.17581 2026-06-17 cs.PL cs.AI 交叉投稿

Visored: A Controlled-Natural-Language Prover for LLM-Generated Mathematics

Visored: 一种面向LLM生成数学的受控自然语言证明器

Xiyu Zhai, Xinyi Chen, Yiping Wang, Runlong Zhou, Liao Zhang, Simon S. Du

发表机构 * University of Washington(华盛顿大学) University of Innsbruck(因斯布鲁克大学)

AI总结 提出一种基于依赖类型的证明器,其表面模仿数学自然语言,并通过规则驱动的自动化层填补常规步骤,使LLM无需专用训练数据即可在miniF2F基准上有效使用,并输出可检查的Lean文件。

详情
AI中文摘要

我们提出了一种基于依赖类型的证明器,其设计围绕LLM(以及人类)倾向于编写数学的方式,补充了Lean和Rocq等现有系统。其核心设计选择是模仿数学自然语言的表面,以及规则驱动的自动化层,该层关闭教科书通常会省略的常规步骤,使得被接受的证明可以重新作为经过检查的Lean文件输出。早期实验表明,即使没有任何特定于证明器的训练数据,LLM也能学会在miniF2F基准上有效使用它。Lean输出摘录:此 https URL

英文摘要

We present a dependent-type-based prover designed around the way LLMs (and humans) tend to write mathematics, complementing existing systems such as Lean and Rocq. Its core design choices are a surface that imitates mathematical natural language and a rule-driven automation layer that closes the routine steps a textbook would omit, so that an accepted proof can be re-emitted as a checked Lean file. Early experiments suggest that, even without any prover-specific training data, LLMs can learn to use it effectively on the miniF2F benchmark. Lean output excerpts: https://github.com/xiyuzhai-husky-lang/visored/

2605.27023 2026-06-17 cs.AI 版本更新

Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling

通过增强负采样提升知识图谱基础模型

Yinan Liu, Wenjin Xu, Zhiyuan Zha, Xiaochun Yang, Bin Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出自适应负采样方法KMAS,通过动态调整困难负三元组比例,增强知识图谱基础模型在零样本补全任务中的性能。

详情
AI中文摘要

知识图谱已成为问答和推荐系统等众多下游任务的核心支柱。然而,尽管如此,知识图谱往往非常不完整。为了在未见过的知识图谱(其关系词汇与预训练时不同)中进行零样本知识图谱补全,知识图谱基础模型受到了广泛关注。现有的知识图谱基础模型通常使用随机负三元组进行训练,这些负三元组是通过将正三元组的头实体或尾实体替换为随机实体构建的。然而,这些负三元组通常质量有限,为知识图谱基础模型训练提供的监督较弱。在本文中,我们提出了一种简单而有效的自适应负采样方法KMAS,以增强现有的知识图谱基础模型。KMAS通过从现有知识图谱基础模型的关系编码器生成的更新关系嵌入来构建困难负三元组。为了进一步自适应地与训练过程中知识图谱基础模型不断发展的能力对齐,KMAS在整个训练过程中动态调整困难负三元组的比例:在预热阶段后,线性增加比例,然后线性减少。在44个数据集上进行了大量实验。实验结果表明,我们提出的负采样方法可以在不需要过多额外时间或内存消耗的情况下增强许多最先进的知识图谱基础模型。

英文摘要

Knowledge graphs (KGs) have become the core backbone of numerous downstream tasks such as question answering and recommender systems. However, despite all this, KGs are often very incomplete. To perform zero-shot knowledge graph completion in unseen KGs, which have different relational vocabularies from those used for pre-training, KG foundation models (KGFMs) receive a wide range of attention. Existing KGFMs often perform training using random negative triples, which are constructed by replacing the head or tail entity of a positive triple with a random entity. However, these negative triples are often constructed with limited quality, providing weak supervision for KGFM training. In this paper, we propose a simple yet effective adaptive negative sampling approach, KMAS, to enhance existing KGFMs. KMAS constructs hard negative triples through the updated relation embeddings generated from the existing KGFM's relation encoder. To further adaptively align with the evolving capability of the KGFM during the training process, KMAS adjusts the ratio of hard negative triples dynamically throughout the whole training process: after a warmup phrase, it increases the ratio linearly and then decreases linearly. Extensive experiments are conducted over 44 data sets. Experimental results demonstrate that our proposed negative sampling method can enhance many SOTA KGFMs without requiring excessive additional time or memory consumption.

2603.05171 2026-06-17 cs.CL cs.AI 版本更新

Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions

中国司法判决中法律论证结构的标注与可视化指南

Kun Chen, Xianglei Liao, Kaixue Fei, Yi Xing, Xinrui Li

发表机构 * Law School, Nanjing University(南京大学法学院)

AI总结 提出一个系统化、可操作的标注框架,用于表示司法判决中的法律论证结构,支持大规模司法推理分析和法律论证挖掘。

Comments This Guideline has been developed through revision and refinement based on the first edition. The element label system has been adjusted, and the annotation granularity and annotation workflow have been further optimized

详情
AI中文摘要

本指南提出了一个系统化且可操作的标注框架,用于表示司法判决中的法律论证结构。该框架基于法律推理和论证理论,旨在揭示司法推理的逻辑组织,并为计算分析提供可靠基础。在元素层面,本指南区分了非命题层和命题层。非命题层由两个元素组成:议题和非论证性成分。在命题层面,本指南定义了四种命题类型:一般规范性判断、特殊规范性判断、一般事实判断和特殊事实判断。在关系层面,定义了五种关系类型来表示论证结构:支持、攻击、联合、匹配和同一性。这些关系捕捉了正面和负面的论证连接、合取推理结构、法律规范与案件事实之间的对应关系,以及命题之间的同一性或语义等价性。本指南进一步规定了基本结构和嵌套结构的形式化表示规则和可视化约定,使得复杂论证模式的可视化保持一致。此外,它建立了标准化的标注工作流程和一致性控制机制,以确保标注数据的可重复性和可靠性。通过提供清晰的概念模型、形式化表示规则和实用的标注程序,本指南支持大规模司法推理分析以及未来在法律论证挖掘、法律推理计算建模和人工智能辅助法律分析方面的研究。

英文摘要

This Guideline presents a systematic and operationalizable annotation framework for representing legal argumentation structures in judicial decisions. Grounded in theories of legal reasoning and argumentation, the framework aims to reveal the logical organization of judicial reasoning and provide a reliable foundation for computational analysis. At the element level, the Guideline distinguishes between the non-propositional layer and the propositional layer. The non-propositional layer consists of two elements: Issue and Non-argumentative Component. At the propositional level, the Guideline defines four proposition types: General Normative Judgment, Particular Normative Judgment, General Factual Judgment, and Particular Factual Judgment. At the relational level, five relation types are defined to represent argumentative structures: Support, Attack, Joint, Match, and Identity. These relations capture positive and negative argumentative connections, conjunctive reasoning structures, correspondences between legal norms and case facts, and identity or semantic equivalence between propositions. The Guideline further specifies formal representation rules and visualization conventions for both basic and nested structures, enabling consistent visualization of complex argumentation patterns. In addition, it establishes a standardized annotation workflow and consistency control mechanisms to ensure the reproducibility and reliability of annotated data. By providing a clear conceptual model, formal representation rules, and practical annotation procedures, this Guideline supports large-scale analysis of judicial reasoning and future research in legal argument mining, computational modeling of legal reasoning, and AI-assisted legal analysis.

2606.14814 2026-06-17 cond-mat.mtrl-sci cs.AI physics.app-ph physics.chem-ph physics.comp-ph 版本更新

A Multi-Level Architecture for Reusable Materials Ontologies -- The OntoCrafter Ceramics Ontology (OCO) as Reference Implementation

可复用材料本体的多层次架构——以OntoCrafter陶瓷本体(OCO)作为参考实现

Thomas Pannek, Wolfgang Grond

发表机构 * Numberland

AI总结 针对材料科学本体在水平、垂直和机制三个维度上的碎片化问题,提出一种多层次模块化架构,通过抽象层次和消费受众两个独立分类轴,并在材料特定层内采用七层机制解释骨架,以OntoCrafter陶瓷本体(OCO v0.94)作为参考实现。

Comments 3 figures, 55 pages

详情
AI中文摘要

材料科学与工程本体领域同时在多个轴向上呈现碎片化。水平方向:一项近期调查识别出94个本体,其中超过40个在结构上不兼容;每个新的应用领域——陶瓷、聚合物、电池、智能材料——通常从头开始重新设计本体。垂直方向:欧盟法规(CSRD、CSDDD、PPWR、CBAM、R2R、AI Act、ESPR)迫使材料、制造、供应链和生命周期数据集成到数字产品护照中,使得仅解决水平碎片化的本体对于任何当代消费者来说都是不完整的。机制方面:一个记录BNT-BT具有$d_{33} \approx 580$ pC/N的词汇表存储了一个事实,但如果没有系统的解释骨架,就无法揭示其原因——Bi-6s$^2$孤对电子立体活性、异常Born有效电荷、软模、缺陷化学。我们提出一种多层次模块化架构,具有两个独立的分类轴——抽象层次(L0桥梁、L1材料无关的实验室笔记本、L2材料类别特定、L3分类推理)和消费受众(材料与合规)——其中材料特定层次内部由适用于任何结晶离子氧化物的七层机制解释骨架(对称性、能量/DFT、热力学/CALPHAD、动力学、微观结构、缺陷化学、键合)组织。层次和受众的模块化解决了水平碎片化,合规受众吸收了垂直法规压力,而第2层的七层组织提供了机制解释深度。我们将该架构实例化为OntoCrafter陶瓷本体(OCO v0.94):跨44个模块的5,196个类;167,348个OWL公理(其中40,454个逻辑公理);1,674个属性;829个跨本体桥梁映射;1,172个SHACL形状;163个已发布的胜任力问题。

英文摘要

The Materials Science and Engineering ontology landscape is fragmented along multiple axes simultaneously. Horizontally: a recent survey identified 94 ontologies of which over 40 are structurally incompatible; each new application domain -- ceramics, polymers, batteries, smart materials -- typically restarts ontology design from scratch. Vertically: EU regulation (CSRD, CSDDD, PPWR, CBAM, R2R, AI Act, ESPR) forces material, manufacturing, supply-chain, and lifecycle data into integrated digital product passports, leaving ontologies that only address horizontal fragmentation incomplete for any contemporary consumer. And mechanistically: a vocabulary that records that BNT-BT has $d_{33} \approx 580$ pC/N stores a fact but cannot surface why -- Bi-6s$^2$ lone-pair stereo-activity, anomalous Born effective charges, soft modes, defect chemistry -- without a systematic explanation skeleton. We propose a multi-level modular architecture with two independent classification axes -- level of abstraction (L0 bridges, L1 material-agnostic laboratory-notebook, L2 material-class-specific, L3 categorical reasoning) and consumer audience (material vs. compliance) -- in which the material-specific level is internally organised by a seven-tier mechanistic-explanation skeleton (Symmetry, Energy/DFT, Thermo/CALPHAD, Kinetics, Microstructure, Defect chemistry, Bonding) applicable to any crystalline ionic oxide. The level-and-audience modularity dissolves the horizontal fragmentation, the compliance audience absorbs the vertical regulation pressure, and the seven-tier organisation of Level 2 delivers the mechanistic explanation depth. We instantiate the architecture as the OntoCrafter Ceramics Ontology (OCO v0.94): 5,196 classes across 44 modules; 167,348 OWL axioms (40,454 logical); 1,674 properties; 829 cross-ontology bridge mappings; 1,172 SHACL shapes; 163 published competency questions.

3. 多智能体与博弈 10 篇

2606.17368 2026-06-17 cs.AI cs.NI 新提交

Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes

分布式通用智能体网络:架构、关键机制与原型

Shengli Zhang, Deen Ma, Zibin Lin, Taotao Wang

发表机构 * College of Electronics and Information Engineering, Shenzhen University(深圳大学电子与信息工程学院)

AI总结 提出分布式通用智能体网络架构,通过协议适配层连接上层任务语义与底层网络操作,解决语义公告传播、可信身份与多主题声誉、语义梯度机制设计三大核心问题,实现开放可信的智能体协作。

详情
AI中文摘要

大型语言模型加速了从被动对话助手到自主智能体的转变,这些智能体能够理解目标、规划行动、调用工具并执行多步骤任务。然而,单个智能体的能力仍受限于其本地数据、工具权限、运行时环境和治理边界。本文研究分布式通用智能体网络:开放的端到端网络,其中部署在个人设备、边缘节点或自主计算环境中的异构智能体可以相互发现、建立信任、协商合作规则并执行开放式任务。我们认为,这种网络不能通过简单地将现有的端到端覆盖网络与传统多智能体系统相结合来获得。与传统P2P网络不同,智能体网络必须传播关于意图、能力、状态和合作约束的语义声明。因此,我们提出了一种以协议适配层为中心的分层架构,该层连接上层任务语义与底层网络操作。基于该架构,本文识别出三个核心机制问题:用于协作者发现的语义公告传播、用于合作治理的可验证身份与多主题声誉、以及用于开放任务执行的语义梯度机制设计。针对每个问题,我们提出了一条技术路线,包括带顺序日志的无体八卦协议、基于BAID的身份绑定与MG-EigenTrust声誉、以及由语义归因反馈驱动的Stackelberg式机制生成循环。我们还报告了BAID式分层验证的原型开销结果以及跨主题伪装-合谋攻击下MG-EigenTrust的机制级模拟。所得框架为开放、可信和可扩展的智能体协作提供了系统级基础。

英文摘要

Large language models have accelerated the transition from passive conversational assistants to autonomous agents that can understand goals, plan actions, invoke tools, and execute multi-step tasks. Yet the capability of a single agent remains constrained by its local data, tool permissions, runtime environment, and governance boundary. This paper studies distributed general-purpose agent networks: open peer-to-peer networks in which heterogeneous agents deployed on personal devices, edge nodes, or autonomous computing environments can discover one another, establish trust, negotiate cooperation rules, and execute open-ended tasks. We argue that such networks cannot be obtained by simply combining existing peer-to-peer overlays with conventional multi-agent systems. Unlike traditional P2P networks, agent networks must propagate semantic declarations about intentions, capabilities, states, and cooperation constraints. We therefore propose a layered architecture centered on a protocol adaptation layer that connects upper-level task semantics with lower-level network operations. Based on this architecture, the paper identifies three core mechanism problems: semantic announcement propagation for collaborator discovery, verifiable identity and multi-topic reputation for cooperation governance, and semantic-gradient mechanism design for open task execution. For each problem, we present a technical route, including bodyless gossip with sequential logs, BAID-based identity binding with MG-EigenTrust reputation, and a Stackelberg-style mechanism-generation loop driven by semantic attribution feedback. We further report prototype overhead results for BAID-style tiered verification and mechanism-level simulations of MG-EigenTrust under cross-topic disguise-collusion attacks. The resulting framework provides a system-level foundation for open, trustworthy, and scalable agent collaboration.

2606.17847 2026-06-17 cs.AI cs.LG 新提交

WallZero: Mastering the Game of WallGo with Strategic Analysis

WallZero:通过战略分析掌握WallGo游戏

Hsing-Yu Chen, Jérôme Arjonilla, I-Chen Wu, Ti-Rong Wu

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学) Academia Sinica(中央研究院)

AI总结 提出基于AlphaZero的WallZero智能体,通过定制动作和特征设计,在WallGo游戏中击败职业围棋选手,并分析游戏公平性与关键策略。

Comments Accepted by the Computers and Games conference (CG 2026)

详情
AI中文摘要

WallGo是一种最近引入的战略棋盘游戏,因2025年Netflix系列剧《The Devil's Plan》而流行。尽管在7x7的小棋盘上进行,但其石头移动和墙壁放置的组合导致了高游戏树复杂性和复杂的战略互动。尽管其日益流行,WallGo仍未得到充分探索。本文提出了WallZero,一个基于AlphaZero的双人WallGo设置智能体。我们引入了定制的动作和特征设计,以显著提高游戏性能。在评估中,WallZero击败了参与本研究的两位职业围棋选手,平均每局获得1.98倍的地盘。除了其强度,我们使用WallZero评估游戏公平性并识别掌握WallGo的关键策略。有趣的是,我们的结果显示,Netflix系列剧中使用的开局产生了更平衡的游戏。我们的代码可在以下网址获取:此 https URL。

英文摘要

WallGo is a recently introduced strategic board game popularized by the 2025 Netflix series The Devil's Plan. Although played on a small 7 x 7 board, its combination of stone movement and wall placement yields high game-tree complexity and intricate strategic interactions. Despite its growing popularity, WallGo remains underexplored. This paper presents WallZero, an AlphaZero-based agent for the two-player WallGo setting. We introduce tailored action and feature designs to improve playing performance significantly. In the evaluation, WallZero defeats two professional Go players who participated in this study, securing on average 1.98x more territory per game. Beyond its strength, we use WallZero to assess game fairness and identify key strategies for mastering WallGo. Interestingly, our results show that the opening used in the Netflix series yields a more balanced game. Our code is available at https://rlg.iis.sinica.edu.tw/papers/wallzero.

2606.17081 2026-06-17 cs.AR cs.AI cs.DC cs.GT cs.PF 交叉投稿

The Price of Anarchy in Disaggregated Inference

解耦推理中的无政府价格

Athos Georgiou

发表机构 * NCA

AI总结 本文通过博弈论分析解耦推理架构中的资源分配问题,提出自适应控制器降低无政府价格,在NVIDIA B200集群上实现最高3.1倍PoA下降。

Comments 38 pages, 7 figures, 8 tables. Measurements on a 3-node NVIDIA B200 cluster running NVIDIA Dynamo v0.9.0

详情
AI中文摘要

解耦推理架构将预填充和解码阶段物理分离到不同的GPU池中,创建了共享固定硬件预算的竞争“代理”。我们提供了据我们所知对该架构的首次正式博弈论分析,以NVIDIA Dynamo作为具体案例研究。我们将解耦服务建模为三个耦合博弈:预填充池和解码池之间的双人资源博弈、分层KV缓存上的自私缓存博弈以及具有正外部性的请求路由拥塞博弈。我们实证验证了后两者;P/D资源博弈通过分析处理(第9.2节)。我们描述了GPU饱和如何引发博弈收益结构转变的机制:低于饱和时,自私行为具有有界的无政府价格(PoA);在饱和时,超线性延迟和缓存外部性推动我们的经验估计器PoA-hat(定义见第6.4节)上升。基于此分析,我们设计了一个自适应控制器,实时检测饱和转换并相应调整路由参数,从缓存亲和性利用转向负载均衡拥塞避免。我们在一个3节点NVIDIA B200集群上实例化我们的框架,运行Dynamo和两个模型Nemotron-4-340B(TP=8,全节点工作节点,跨InfiniBand KV传输)和Llama-3.1-70B(TP=4),发现两个模型上具有相同的三区域PoA-hat结构,且第一个膝点后网格点相同(C=128)。自适应路由将每个模型转移到更好的工作点。我们最强的结果是在70B 1P/5D拓扑上,饱和阶段PoA-hat下降3.1倍(从66.4降至21.5),吞吐量成本为13%。在70B 1P/2D上,PoA-hat下降2.2倍,TTFT P99下降7.6倍(见第8.5节)。

英文摘要

Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game's payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).

2606.17203 2026-06-17 cs.SE cs.AI 交叉投稿

Trust-Aware Multi-Agent Traceability: Confidence-Calibrated Knowledge Graphs for Consistent Software Artifact Management

信任感知的多智能体可追溯性:用于一致软件工件管理的置信度校准知识图谱

Mohamed Essam, Kareem Wael, Azza Hassan, Ahmed Haitham, Mahmoud Soliman, Samer Saber, Ibrahim Habib

发表机构 * CairoMotive Cairo, Egypt(开罗动力埃及)

AI总结 提出一种信任感知协调框架,通过共享知识图谱和校准置信度分数,结合嵌入检索与LLM多准则分析的两阶段可追溯性链接预测管道,解决多智能体系统中错误传播问题。

详情
AI中文摘要

多智能体AI系统越来越多地用于自动化软件工程任务,包括需求分析、架构设计、测试生成和可追溯性链接。当这些智能体作为顺序管道在共享软件工件上运行时,上游智能体做出的错误和低置信度决策会传播到下游阶段,产生孤立的需求、矛盾的链接和合规性差距,这在安全关键领域构成重大风险。我们提出一个信任感知协调框架,其中共享知识图谱既作为集中式语义记忆,又作为协调表面,智能体通过该表面使用校准的置信度分数评估并基于彼此的贡献进行构建。我们的方法引入了一个两阶段可追溯性链接预测管道,结合了基于嵌入的检索与基于LLM的多准则分析,一种可追溯性种子机制,能够比较推导时间和验证时间的置信度,以及一个一致性协议,通过置信度阈值门控、置信度发散检测和冲突解决来管理管道交互。我们在一个汽车软件工程案例研究上进行了评估,测量了链接预测校准、协议有效性、阈值敏感性和可追溯性种子的影响。消融研究证实,置信度校准对于有效的管道协调至关重要。

英文摘要

Multi-agent AI systems are increasingly used to automate software engineering tasks including requirements analysis, architecture design, test generation, and traceability linking. When these agents operate as a sequential pipeline over shared software artifacts, errors and low-confidence decisions made by upstream agents propagate to downstream stages, producing orphaned requirements, contradictory links, and compliance gaps that pose significant risks in safety-critical domains. We propose a trust-aware coordination framework where a shared knowledge graph serves as both centralized semantic memory and a coordination surface through which agents assess and build upon each other's contributions using calibrated confidence scores. Our approach introduces a two-stage traceability link prediction pipeline combining embedding-based retrieval with LLM-based multi-criteria analysis, a traceability seeding mechanism that enables comparison between derivation-time and validation-time confidence, and a consistency protocol governing pipeline interactions through confidence threshold gating, confidence divergence detection, and conflict resolution. We evaluate on an automotive software engineering case study measuring link prediction calibration, protocol effectiveness, threshold sensitivity, and the impact of traceability seeding. Ablation studies confirm that confidence calibration is essential for effective pipeline coordination.

2606.17627 2026-06-17 cs.CV cs.AI 交叉投稿

Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition

分、议、决:一种用于细粒度自我中心动作识别的多智能体框架

Alessandro Sottovia, Alessandro Torcinovich, Oswald Lanz

发表机构 * Faculty of Engineering, Free University of Bozen-Bolzano(博尔扎诺自由大学工程学院)

AI总结 提出一种零样本多智能体框架,通过视频分割、异构VLM专家协商和Borda计数聚合,提升细粒度自我中心动作识别性能。

详情
AI中文摘要

在自我中心视频中进行细粒度动作识别对视觉语言模型(VLM)具有挑战性:动作通常仅在小视觉线索上有所不同,而单个模型往往偏向于这些线索的一个子集。我们提出了“分、议、决”(Divide, Deliberate, Decide),一个完全本地化的零样本多智能体框架,其中(i)一个VLM编排器将视频分块,并为每个片段提出一个top-k候选标签列表,(ii)一个由来自不同开放模型系列的异构VLM专家组成的集成体进行结构化协商,包括一轮同行咨询问题,以及(iii)使用Borda计数聚合智能体排名,并且编排器根据专家的证据重新排名自己的预测。整个流程在本地运行,无需微调。实验表明,我们的方法在零样本动作识别性能上比基线有积极改进,突出了异构协商步骤的影响,表明增益来自去相关的模型先验而非额外的计算。

英文摘要

Fine-grained action recognition in egocentric video is challenging for Vision-Language Models (VLMs): actions often differ only in small visual cues, and a single model tends to be biased toward a subset of these cues. We propose Divide, Deliberate, Decide, a fully-local, zero-shot multi-agent framework in which (i) a VLM orchestrator chunks the video and proposes a top-k candidate label list per segment, (ii) an ensemble of heterogeneous VLM specialists, drawn from different open model families, engages in a structured deliberation that includes a peer-consultation round of questions, and (iii) agent rankings are aggregated with a Borda count and the orchestrator re-ranks its own prediction in light of the specialists' evidence. The entire pipeline runs locally with no fine-tuning. Experiments show that our method positively improves zero-shot action recognition performance over the baseline, highlighting the influence of a heterogeneous deliberation step, showing that the gain stems from decorrelated model priors rather than from additional compute.

2606.17739 2026-06-17 cs.RO cs.AI cs.CV cs.MA 交叉投稿

ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents

ED3R: 能量感知的分布式灾难检测——基于协作机器人智能体

Lina Magoula, Nikolaos Koursioumpas, Nancy Alonistioti, Ramin Khalili

发表机构 * Dept. of Informatics and Telecommunications, National and Kapodistrian University of Athens(雅典大学信息学与电信系) Huawei Heisenberg Research Center (Munich)(华为海森堡研究中心(慕尼黑))

AI总结 提出ED3R框架,通过机器人-远程控制器分层协作与分布式神经回归预测,在不确定性下以最低能耗实现野火检测,成功率达97.18%,能耗降低36.4%,检测速度提升41%。

Comments 14 pages, 9 figures

详情
AI中文摘要

机器人技术有望支持环境监测和自然灾害管理,在这些场景中,决策必须在不确定性、资源限制和严格操作约束下做出。在关键任务(如野火)中,机器人智能体不仅需要以足够置信度识别危险事件,还需管理能量成本和检测时间。本文介绍ED3R,一种用于不确定性下野火检测的能量感知分布式框架。ED3R实现了机器人与远程控制器之间的分层协作决策:远程控制器决定机器人的运动,而机器人感知环境并决定在何处(机载或远程)以及如何执行野火检测。共同目标是以所需置信度检测野火,同时最小化任何机器人操作消耗的能量。ED3R进一步集成了避免附近障碍物、防止冗余探索、实现自适应早期任务完成以及通过自定义惩罚函数确保可行性的机制。ED3R还引入了前瞻能力,通过分布式神经回归模型使智能体能够在执行前评估候选策略以预测未来。该框架通过逼真的机器人仿真、消融研究和基线比较进行评估。总体而言,ED3R的任务成功率高达97.18%。尤其是在最具挑战性的任务中,它比基线减少高达36.4%的能量消耗,并提前高达41%检测到野火。

英文摘要

Robotics are expected to support environmental monitoring and natural disaster management, where decisions must be made under uncertainty, resource limitations, and strict operational constraints. In critical missions, such as wildfires, robotic agents must not only identify hazardous events with sufficient confidence, but also manage the energy cost and time until detection. This paper introduces ED3R, an energy-aware distributed framework for wildfire detection under uncertainty. ED3R enables hierarchical cooperative decision-making between a robot and a remote controller. The remote controller decides upon the robot's motion, while the robot senses the environment and decides where to execute the wildfire detection (onboard or remotely) and how. The common goal is to detect wildfires with a required confidence while minimizing the energy consumed by any robot operation. ED3R further integrates mechanisms to avoid nearby obstacles, prevent redundant exploration, enable adaptive early mission completion, and ensure feasibility through a custom penalty function. ED3R also introduces a forward-looking capability, enabled through distributed neural regression models that allow the agents to anticipate the future by evaluating candidate strategies before execution. The framework is evaluated through realistic robotics simulations, ablation studies, and baseline comparisons. Overall, ED3R achieves a mission success rate of up to 97.18%. Especially in the most demanding missions, it reduces energy consumption by up to 36.4% and detects wildfires up to 41% faster than baselines.

2606.17915 2026-06-17 cs.MA cs.AI cs.DB cs.SE 交叉投稿

Trustworthy Self-Composable Big-Data-as-a-Service: An LLM-Orchestrated Multi-Agent Framework for Automated Data Engineering, AutoML, MLOps Deployment, and Drift-Aware Lifecycle Optimization

可信赖的自组合大数据即服务:一种LLM编排的多智能体框架,用于自动化数据工程、AutoML、MLOps部署和漂移感知生命周期优化

Aueaphum Aueawatthanaphisut, Badri Raj Lamichhane

发表机构 * School of Information, Computer, and Communication Technology(信息、计算机与通信技术学院) Sirindhorn International Institute of Technology, Thammasat University(素金国际技术研究所,泰国 Thammasat 大学)

AI总结 提出一种基于LLM编排的多智能体BDaaS框架,通过分解生命周期为专用智能体并协调执行,实现自动化数据工程、AutoML、MLOps部署和漂移感知优化,提升生命周期级可靠性。

Comments 7 pages, 3 figures, 5 tables

详情
AI中文摘要

大数据即服务(BDaaS)平台需要可靠地自动化数据摄取、清洗、特征工程、模型开发、部署和部署后监控。然而,现有的基于LLM的数据科学智能体和AutoML系统主要关注孤立的工作流阶段,对生命周期级编排、工件治理、人工监督和漂移感知适应的支持有限。本文提出了一种基于LLM编排的多智能体协作的可信赖自组合BDaaS框架。所提出的架构将BDaaS生命周期分解为专门的智能体,用于数据摄取、数据清洗、特征工程、AutoML训练、模型评估、MLOps部署、监控和漂移检测。中央LLM编排层协调智能体执行,验证中间输出,管理工作流上下文,并支持动态工作流组合。该框架还包含共享工件治理、可重现性支持、人在回路检查点和漂移感知反馈循环。使用包含缺失值、分类变量、异常值、类别不平衡和模拟协变量漂移的受控表格基准数据集进行了基于原型的评估。与手动ML、仅AutoML和单智能体LLM基线相比,所提出的多智能体BDaaS流水线在保持竞争性预测性能的同时,提高了生命周期级可靠性,包括工作流完成度、工件可追溯性、部署就绪性、可重现性和漂移恢复。结果表明,LLM编排的多智能体系统可以将传统AutoML扩展到可信赖、自适应和面向生产的BDaaS生命周期自动化。

英文摘要

Big-Data-as-a-Service (BDaaS) platforms require re liable automation across data ingestion, cleaning, feature engi neering, model development, deployment, and post-deployment monitoring. However, existing LLM-based data science agents and AutoML systems mainly focus on isolated workflow stages, leaving limited support for lifecycle-level orchestration, artifact governance, human oversight, and drift-aware adaptation. This paper proposes a trustworthy self-composable BDaaS frame work based on LLM-orchestrated multi-agent collaboration. The proposed architecture decomposes the BDaaS lifecycle into specialized agents for data ingestion, data cleaning, feature engineering, AutoML training, model evaluation, MLOps de ployment, monitoring, and drift detection. A central LLM or chestration layer coordinates agent execution, validates interme diate outputs, manages workflow context, and enables dynamic workflow composition. The framework also incorporates shared artifact governance, reproducibility support, human-in-the-loop checkpoints, and drift-aware feedback loops. A prototype-based evaluation is conducted using controlled tabular benchmark datasets with missing values, categorical variables, outliers, class imbalance, and simulated covariate drift. Compared with manual ML, AutoML-only, and single-agent LLM baselines, the pro posed multi-agent BDaaS pipeline achieves competitive predictive performance while improving lifecycle-level reliability, including workflow completion, artifact traceability, deployment readiness, reproducibility, and drift recovery. The results suggest that LLM-orchestrated multi-agent systems can extend conventional AutoML toward trustworthy, adaptive, and production-oriented BDaaS lifecycle automation.

2606.17962 2026-06-17 cs.MA cs.AI 交叉投稿

A Neuro-Symbolic Approach to Strategy Synthesis for Strategic Logics

一种面向策略逻辑的策略综合的神经符号方法

Marco Aruta, Vadim Malvone, Aniello Murano, Domenico Parente, Luca Rizzuti

发表机构 * University of Naples Federico II(那不勒斯费德里科二世大学) LTCI, Télécom Paris, Institut Polytechnique de Paris(LTCI,巴黎电信学院,巴黎理工学院) Università degli Studi di Salerno(萨勒诺大学)

AI总结 提出一种神经符号框架,将大语言模型作为策略生成预言机,结合模型检查器进行形式验证,在NatATL中实现高精度策略综合。

详情
AI中文摘要

推理智能体通过策略交互能实现什么是多智能体系统(MAS)中的核心挑战。用于策略能力的逻辑(如ATL)提供了严格的方法,但其采用常因策略综合的计算成本而受阻。我们引入了一种神经符号框架,将大语言模型(LLM)集成到MAS的模型检查流程中。LLM作为策略生成预言机,提出候选策略,然后由标准MAS模型检查器进行形式验证。这种生成-认证架构利用LLM引导来导航大型组合策略空间,同时保持形式正确性:生成的策略仅在通过验证器认证后才被接受。我们为NatATL中的有界策略推理实例化了该框架,并引入了首个NatATL策略综合数据集,包含4211个实例。使用开源Qwen3-32B模型的实验表明,我们的认证流程在策略综合结果上达到了92%的准确率。

英文摘要

Reasoning about what agents can achieve through strategic interaction is a core challenge in Multi-Agent Systems (MAS). Logics for strategic ability, such as ATL, provide rigorous methods, but their adoption is often hindered by the computational cost of strategy synthesis. We introduce a neuro-symbolic framework that integrates large language models (LLMs) into the model-checking pipeline for MAS. The LLM acts as a strategy-generation oracle, proposing candidate strategies that are then formally validated by a standard MAS model checker. This generate-and-certify architecture uses LLM guidance to navigate large combinatorial strategy spaces while preserving formal soundness: generated strategies are accepted only when certified by the verifier. We instantiate the framework for bounded strategic reasoning in NatATL and introduce the first NatATL strategy-synthesis dataset, consisting of 4211 instances. Experiments with an open-weight Qwen3-32B model show that our certified pipeline achieves 92\% accuracy on strategy-synthesis outcomes.

2606.18111 2026-06-17 cs.LG cs.AI 交叉投稿

Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning

多目标强化学习中学习公平帕累托最优策略

Umer Siddique, Peilang Li, Yongcan Cao

AI总结 针对多目标强化学习中固定用户偏好无法提供多样化策略的问题,提出基于广义基尼福利函数的多策略方法,学习公平帕累托最优策略集。

Comments Accepted at the Reinforcement Learning Conference (RLC) 2025. 12 pages main + appendix, 8 figures, 4 tables

详情
AI中文摘要

公平性是多目标强化学习(MORL)决策中的一个重要方面,策略必须确保在多个潜在冲突的目标上既达到最优又实现公平。虽然单策略MORL方法可以使用福利函数(如广义基尼福利函数GGF)为固定的用户偏好学习公平策略,但它们无法提供动态或未知用户偏好所需的多样的策略集。为解决这一局限性,我们形式化了多策略MORL中的公平优化问题,其目标是学习一组帕累托最优策略,确保在所有可能的用户偏好下实现公平。我们的关键技术贡献有三点:(1)我们证明对于凹的、分段线性的福利函数(例如GGF),公平策略仍然在凸覆盖集(CCS)中,CCS是线性标量化下的近似帕累托前沿。(2)我们证明非平稳策略(通过累积奖励历史增强)和随机策略通过动态适应历史不公平性来改善公平性。(3)我们提出了三种新算法,包括将GGF与多策略多目标Q学习(MOQL)集成、用于学习非平稳策略的状态增强多策略MOQL,以及用于学习随机策略的新扩展。我们在多个领域评估了我们的算法,并将我们的方法与最先进的MORL基线进行了比较。实验结果表明,我们的方法学习了一组公平策略,能够适应不同的用户偏好。

英文摘要

Fairness is an important aspect of decision-making in multi-objective reinforcement learning (MORL), where policies must ensure both optimality and equity across multiple, potentially conflicting objectives. While single-policy MORL methods can learn fair policies for fixed user preferences using welfare functions such as the generalized Gini welfare function (GGF), they fail to provide the diverse set of policies necessary for dynamic or unknown user preferences. To address this limitation, we formalize the fair optimization problem in multi-policy MORL, where the goal is to learn a set of Pareto-optimal policies that ensure fairness across all possible user preferences. Our key technical contributions are threefold: (1) We show that for concave, piecewise-linear welfare functions (e.g., GGF), fair policies remain in the convex coverage set (CCS), which is an approximated Pareto front for linear scalarization. (2) We demonstrate that non-stationary policies, augmented with accrued reward histories, and stochastic policies improve fairness by dynamically adapting to historical inequities. (3) We propose three novel algorithms, which include integrating GGF with multi-policy multi-objective Q-Learning (MOQL), state-augmented multi-policy MOQL for learning non-statoinary policies, and its novel extension for learning stochastic policies. We evaluate our algorithms across various domains and compare our methods against the state-of-the-art MORL baselines. The empirical results show that our methods learn a set of fair policies that accommodate different user preferences.

2504.03991 2026-06-17 cs.CL cs.AI cs.HC cs.MA 版本更新

Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

面向多样化类人团队协作与通信的算法化提示生成与大型语言模型

Siddharth Srikanth, Varun Bhatt, Boshen Zhang, Werner Hager, Charles Michael Lewis, Katia P. Sycara, Aaquib Tabrez, Stefanos Nikolaidis

发表机构 * Thomas Lord Department of Computer Science, University of Southern California(美国南加州大学汤姆·劳德计算机科学系) School of Computing and Information, University of Pittsburgh(美国匹兹堡大学计算与信息学院) Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Sibley School of Mechanical and Aerospace Engineering, Cornell University(康奈尔大学西伯利机械与航空航天工程学院)

AI总结 结合质量多样性优化与LLM代理,自动搜索生成多样化团队行为的提示,捕获人类协作与通信策略,并通过用户研究验证其类人性。

详情
AI中文摘要

理解人类如何在团队中协作和通信对于改善人-代理团队协作和AI辅助决策至关重要。然而,由于后勤、伦理和实际限制,仅依赖大规模用户研究的数据是不切实际的,因此需要多种多样化人类行为的合成模型。最近,基于大型语言模型(LLM)的代理已被证明能够在社交环境中模拟类人行为。但是,获得大量多样化行为需要手动设计提示。另一方面,质量多样性(QD)优化已被证明能够生成多样化的强化学习(RL)代理行为。在这项工作中,我们将QD优化与LLM驱动的代理相结合,以迭代搜索在长时域、多步骤协作环境中生成多样化团队行为的提示。我们首先通过一项人类受试者实验表明,人类在该领域中表现出多样化的协调和通信行为。然后,我们进行一系列实验,表明我们的方法捕获了在没有大规模数据收集的情况下难以观察到的行为,并通过后续用户研究表明这些生成的行为是类人的。我们的发现凸显了QD与LLM驱动代理的结合作为研究多代理协作中团队协作和通信策略的有效工具。

英文摘要

Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment, that humans exhibit diverse coordination and communication behavior in this domain. We then present a series of experiments showing that our approach captures behaviors that are difficult to observe without large-scale data collection, and a follow-up user study to show that these generated behaviors are human-like. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.

4. 搜索、优化与约束求解 2 篇

2606.17087 2026-06-17 cs.NE cs.AI 交叉投稿

ZIVARI-TLBO: A Zero-Cost Inter-Group Evaluated-Elite Relay Mechanism for Teaching-Learning-Based Optimization

ZIVARI-TLBO:一种基于零成本组间评估精英中继的教学优化算法

Pezhman Zivari

发表机构 * Independent Researcher(独立研究者)

AI总结 提出ZIVARI-TLBO方法,通过固定环中组间传递已评估精英解实现零成本信息共享,在标准函数和工程问题上验证其性能,排名第二但非全局最优。

Comments 21 pages, 7 figures, 11 tables

详情
AI中文摘要

ZIVARI-TLBO是一种分组教学优化(TLBO)方法,它通过一个固定的组间评估精英中继来增强现有的种群状态控制器。在每个预定事件中,每个组将其已评估的精英解按固定环传递给下一组;只有当该精英的存储目标值更优时,它才替换接收组中最差的可替换学习者。由于精确中继复制了已评估的解及其存储的适应度,因此不需要额外的目标函数调用。冻结的gts-v4-cm-fixed实现在8个经典函数(维度10、30、50、100)和5个约束工程问题上,在等额10,000次评估预算下进行评估,使用30个匹配种子。与没有中继的相同分组景观感知控制器的直接消融实验记录了728/11/221胜/平/负,以及跨维度的秩双列效应大小为0.624。在八种方法的多维比较中,WOA获得最佳平均秩(2.914),ZIVARI-TLBO排名第二(3.382);ZIVARI-TLBO显著优于TLBO、MCTLBO、DE、PSO和GWO,显著劣于WOA,并且在Holm调整后与HHO无显著差异。可行性感知工程结果好坏参半,且对当前的静态惩罚公式敏感。证据支持有限的中继贡献和预算一致的信息共享机制,但不支持通用最先进、全局收敛、工程主导或CEC优越性的声明。

英文摘要

ZIVARI-TLBO is a grouped Teaching-Learning-Based Optimization (TLBO) method that augments an existing population-state controller with a fixed inter-group evaluated-elite relay. At each scheduled event, every group offers its already evaluated elite to the next group in a fixed ring; the elite replaces the receiver's worst eligible learner only when its stored objective value is better. Because the exact relay copies an already evaluated solution and its stored fitness, it requires no additional objective-function calls. The frozen gts-v4-cm-fixed implementation is evaluated under equal 10,000-evaluation budgets on eight classical functions at dimensions 10, 30, 50, and 100, with 30 matched seeds, and on five constrained engineering problems. A direct ablation against the same grouped landscape-aware controller without relay records 728/11/221 wins/ties/losses and a rank-biserial effect size of 0.624 across dimensions. In an eight-method multidimensional comparison, WOA obtains the best average rank (2.914) and ZIVARI-TLBO ranks second (3.382); ZIVARI-TLBO significantly outperforms TLBO, MCTLBO, DE, PSO, and GWO, loses significantly to WOA, and is not significantly different from HHO after Holm adjustment. Feasibility-aware engineering results are mixed and sensitive to the current static-penalty formulation. The evidence supports a scoped relay contribution and budget-consistent information-sharing mechanism, but not universal state-of-the-art, global-convergence, engineering-dominance, or CEC superiority claims.

2606.17910 2026-06-17 cs.IR cs.AI cs.CL 交叉投稿

Non-negative Elastic Net Decoding for Information Retrieval

非负弹性网络解码用于信息检索

Koki Okajima, Yasutoshi Ida, Tsukasa Yoshida, Yasuaki Nakamura

发表机构 * NTT, Inc(NTT公司)

AI总结 提出非负弹性网络(NNN)解码方法,将检索视为联合解码问题,通过稀疏非负线性组合重构查询嵌入,在理论上严格优于稠密检索,实验表明在多个基准上取得一致改进。

Comments 19 pages, 4 figures

详情
AI中文摘要

稠密检索已成为信息检索中的主导范式,其中每个文档通过其向量嵌入与查询的内积进行评分,并根据分数检索前$k$个文档。然而,由于每个文档的分数仅取决于查询和自身的嵌入,检索过程忽略了整个语料库的内容。因此,稠密检索无法避免从语料库中选择语义相似的文档,这可能导致检索结果集缺乏多样性且冗余。为此,我们将检索视为一个联合解码问题,其中文档作为集合被选择,并考虑语料库其余部分的上下文。为了实现这一点,我们提出了非负弹性网络(NNN)解码,它选择嵌入能够联合重构查询嵌入(作为稀疏非负线性组合)的文档。我们的主要理论结果建立了稠密检索与NNN解码之间的严格分离。对于任何语料库,稠密检索正确处理的每个查询也由NNN解码处理,而在包含相关文档的语料库上,NNN解码额外处理了稠密检索无法处理的查询。实验结果表明,将NNN解码应用于为内积评分训练的冻结嵌入,在多个基准上产生了一致的改进。此外,我们引入了一种端到端训练过程,优化嵌入以用于NNN解码,在所有指标和基准上相比稠密检索产生了显著的性能提升。我们的工作为在信息检索中利用稠密嵌入建立了一种新的范式,超越了内积评分的标准实践。

英文摘要

Dense retrieval has become the dominant paradigm in information retrieval, in which each document is scored against a query by the inner product of their vector embeddings, and the top-$k$ documents by score are retrieved for this query. However, since each document's score depends solely on the embedding of the query and itself, the retrieval process is oblivious to the content of the entire corpus. Therefore, dense retrieval cannot avoid selecting semantically similar documents from the corpus, which may result in a non-diverse, redundant set of retrieved documents. To this end, we approach retrieval as a joint decoding problem, in which documents are selected as a set with regard to the context of the rest of the corpus. To achieve this, we propose Non-Negative elastic Net (NNN) decoding, which selects documents whose embeddings jointly reconstruct the query embedding as a sparse non-negative linear combination. Our main theoretical result establishes a strict separation between dense retrieval and NNN decoding. For any corpus, every query correctly handled by dense retrieval is also handled by NNN decoding, while on corpora containing correlated documents, NNN decoding additionally handles queries that dense retrieval cannot. Experimental results indicate that applying NNN decoding to frozen embeddings trained for inner-product scoring yields consistent improvements across several benchmarks. Moreover, we introduce an end-to-end training procedure which optimizes the embeddings for NNN decoding, producing significant performance gains surpassing in all metrics and benchmarks compared to dense retrieval. Our work establishes a new paradigm for leveraging dense embeddings in information retrieval, beyond the standard practice of inner-product scoring.

5. 机器学习与表示学习 64 篇

2606.17648 2026-06-17 cs.AI 新提交

From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs

从酝酿到解析:追踪LLM中代码推理的内部生命周期

Siyue Chen, Yifu Guo, Yuquan Lu, Zishan Xu, Jiaye Lin, Jianbo Lin, Siyu Zhang, Cheng Yang, Junxin Li, Yujia Li, Yu Huo, Ruixuan Wang

发表机构 * South China University of Technology(华南理工大学) Sun Yat-sen University(中山大学) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) Nanjing University(南京大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Hangzhou Dianzi University(杭州电子科技大学) Guangzhou College of Technology and Business(广州工商学院)

AI总结 提出双重诊断框架(逐层线性探针与上下文剥离解码),揭示LLM在代码推理中先酝酿答案后进入四种解析结果(已解析、过度处理、错误解析、未解析)的内部生命周期,发现酝酿支架稳定而解析成功随能力变化。

详情
AI中文摘要

标准准确率指标无法解释为什么LLM能处理变量追踪但在语义等价的循环上失败。我们研究了代码推理的内部生命周期,其中模型首先酝酿答案,使其在变得可自解码之前的许多层就线性可恢复,然后分化为四种解析结果之一:已解析、过度处理、错误解析或未解析。理解这一生命周期很重要,因为相似的任务准确率可能掩盖表面评估无法检测的根本不同的失败模式。我们引入了一个双重诊断框架,将逐层线性探针与上下文剥离解码(CSD)配对,并将其应用于跨越Qwen、Llama和DeepSeek架构的16个模型的六个代码推理任务族。所有四种结果在每个任务族中都占有显著比例:总体已解析仅为41.5%,多个任务低于30%。对结构、深度和算子的受控扫描揭示了特定任务的失败瓶颈:函数调用已解析率随着调用深度从一层增加到三层而从61.1%骤降至2.5%。跨架构和规模,酝酿支架保持稳定,所有16个模型的归一化酝酿持续时间为24-42%,而解析成功随能力变化。这表明该支架是测试的解码器-only Transformer家族中稳定的经验规律,而解析成功与能力、规模和训练共变。代码:此 https URL

英文摘要

Standard accuracy metrics cannot explain why LLMs handle variable tracking but fail on semantically equivalent loops. We study an internal lifecycle of code reasoning in which models first brew the answer, making it linearly recoverable many layers before it becomes self-decodable, and then diverge into one of four resolution outcomes: Resolved, Overprocessed, Misresolved, or Unresolved. Understanding this lifecycle matters because similar task accuracies can mask fundamentally different failure modes that surface-level evaluation cannot detect. We introduce a dual diagnostic framework pairing layer-wise linear probing with Context-Stripped Decoding (CSD) and apply it to six code-reasoning task families across 16 models spanning Qwen, Llama, and DeepSeek architectures. All four outcomes carry substantial mass in every task family: overall Resolved is only 41.5%, with multiple tasks below 30%. Controlled sweeps over structure, depth, and operators expose task-specific failure bottlenecks: Function Call Resolved plunges from 61.1% to 2.5% as call depth increases from one to three. Across architectures and scales, the brewing scaffold remains stable, with normalized brewing duration 24-42% across all 16 models, while resolution success varies with capability. This indicates that the scaffold is a stable empirical regularity across the tested decoder-only Transformer families, whereas resolution success covaries with capability, scale, and training. Code: https://github.com/euyis1019/llm-brewing

2606.17657 2026-06-17 cs.AI 新提交

Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games

使用认知模型改进语言模型对人类说服博弈的模拟

Zirui Cheng, Zeyu Shen, Thomas L. Griffiths, Peter Henderson

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出方程到行为提示和强化学习方法,使语言模型匹配认知模型(如贝叶斯更新、动机推理),在说服博弈中提升模拟人类决策多样性的能力。

详情
AI中文摘要

人们在战略互动中做出不同的决策。有些人像贝叶斯一样更新信念;其他人则表现出动机推理等偏见。尽管大型语言模型的创建者使用模拟人类进行安全评估和训练,但他们往往未能涵盖人类行为的这种广度。我们认为认知科学和经济学提供了一种方便的工具来做到这一点,利用人类决策的数学模型。我们提出了一种称为方程到行为提示的方法,用于引导大型语言模型匹配认知模型,并在基于法律决策的说服博弈中评估这种方法。我们发现大型模型可以通过提示近似基于方程的规范——贝叶斯更新、仿射扭曲、动机更新和Grether的$\alpha$-$\beta$模型,但小型模型无法做到。然而,使用强化学习训练小型模型以遵循数学规则,即方程到行为强化学习,在分布外参数化中将信念误差降低了26.5%。我们表明这些模拟可以帮助创建多样化的训练环境;训练小型模型考虑不同类型的决策者,与仅贝叶斯训练相比,平均信念变化提高了2.5%–12%,即使在说服GPT-5-mini时也是如此。我们的工作可以改进在日益逼真的环境中用于训练和评估的人类模拟,并且还可以促进对人类决策更复杂数学模型的新研究。

英文摘要

People make decisions differently in strategic interactions. Some update beliefs like a Bayesian; others exhibit biases like motivated reasoning. Although creators of large language models use simulated humans for safety evaluations and training, they often fail to cover this breadth of human behavior. We argue that cognitive science and economics provide a convenient tool for doing so, making use of mathematical models of human decision-making. We propose an approach that we call Equation-to-Behavior Prompting for guiding large language models to match cognitive models, and evaluate this approach on persuasion games based on legal decision-making. We find that large models can approximate equation-based specifications -- Bayesian updating, affine distortion, motivated updating, and Grether's $α$-$β$ model -- using prompting, but small models fail to do so. However, training small models with reinforcement learning to adhere to mathematical rules, Equation-to-Behavior RL, reduces belief error by 26.5% in out-of-distribution parameterizations. We show that these simulations can help create diverse training environments; training small models to consider different kinds of decision-makers improves average belief change by 2.5%--12% over Bayesian-only training, even when persuading GPT-5-mini. Our work could improve human simulations for training and evaluation in increasingly realistic settings, and could also enable novel research into more complicated mathematical models of human decision-making.

2606.17735 2026-06-17 cs.AI 新提交

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

打破自回归诅咒:动态认知熵编排的可擦除强化学习用于大语言模型

Ziliang Wang, Kang An, Faqiang Qian, Jialu Cai, Cijun Ouyang, Yuhang Wang, Qibing Ren, Yichao Wu

发表机构 * SenseTime(商汤科技) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出动态认知熵编排的可擦除强化学习(E³RL),通过将模型内生的局部自回归交叉熵作为认知不确定性坐标,利用分段自适应动态阈值和优势分配精准切除逻辑缺陷并重用KV缓存,解决长序列推理中的自回归级联崩溃问题。

详情
AI中文摘要

尽管强化学习(RL)扩展了大语言模型(LLMs)的认知边界,但在长程逻辑推理中,它仍然容易受到自回归诅咒的影响:生成早期引入的微小认知扰动会沿着马尔可夫决策过程流不可逆地传播,引发级联故障,导致推理轨迹崩溃。为了克服这种自回归级联(即单个早期错误可能危及所有后续推理步骤),我们提出了动态认知熵编排的可擦除强化学习($\text{E}^3\text{RL}$)。$\text{E}^3\text{RL}$ 通过将模型内生的局部自回归交叉熵作为认知不确定性的内在坐标,消除了对外部信号的依赖。通过引入分段自适应动态阈值和优势分配,$\text{E}^3\text{RL}$ 使模型能够精确切除局部逻辑缺陷,同时重用历史键值(KV)缓存流,从而赋予推理过程自愈能力。我们在 DeepMath-103k 数据集上训练 $\text{E}^3\text{RL}$。实验结果表明,$\text{E}^3\text{RL}$ 重塑了长序列推理的探索效率,提高了样本效率,同时保持线性内存开销。在 AIME 等数学推理基准上,$\text{E}^3\text{RL}$ 取得了显著的性能提升,4B 和 8B 参数模型分别超越了之前的最优结果(SOTA)5.349% 和 6.514%。这些发现表明,$\text{E}^3\text{RL}$ 打破了长序列推理中的自回归诅咒,为下一代自愈人工通用智能(AGI)奠定了理论和系统级基础。

英文摘要

Although reinforcement learning (RL) has expanded the cognitive boundaries of large language models (LLMs), it often remains vulnerable to the autoregressive curse in long-horizon logical reasoning: small epistemic perturbations introduced early in generation can propagate irreversibly along the Markov decision process flow, triggering cascading failures that drive the reasoning trajectory toward collapse. To overcome this autoregressive cascade, in which a single early mistake can compromise all subsequent reasoning steps, we propose dynamic epistemic entropy orchestrated erasable reinforcement learning ($\text{E}^3\text{RL}$). $\text{E}^3\text{RL}$ eliminates reliance on external signals by grounding the model's endogenous local autoregressive cross-entropy as an intrinsic coordinate of epistemic uncertainty. By introducing segment-level adaptive dynamic thresholds and advantage allocation, $\text{E}^3\text{RL}$ enables the model to precisely excise localized logical defects while reusing historical key-value (KV) cache streams, thereby endowing the reasoning process with a self-healing capability. We train $\text{E}^3\text{RL}$ on the DeepMath-103k dataset. Experimental results show that $\text{E}^3\text{RL}$ reshapes the exploration efficiency of long-sequence reasoning and improves sample efficiency while maintaining linear memory overhead. On mathematical reasoning benchmarks such as AIME, $\text{E}^3\text{RL}$ achieves substantial performance gains, with the 4B and 8B parameter models surpassing previous state-of-the-art (SOTA) results by 5.349\% and 6.514\%, respectively. These findings suggest that $\text{E}^3\text{RL}$ shatters the autoregressive curse in long-sequence reasoning and establishes a theoretical and systems-level foundation for the next generation of self-healing artificial general intelligence (AGI).

2606.17945 2026-06-17 cs.AI 新提交

Small Initialization Matters for Large Language Models

小初始化对大语言模型至关重要

Liangkai Hang, Junjie Yao, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Zhi-Qin John Xu

发表机构 * School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) Institute of Natural Sciences, Shanghai Jiao Tong University(上海交通大学自然科学研究院) MemTensor (Shanghai) Technology Co., Ltd.(上海记忆张量科技有限公司) Institute for Advanced Algorithms Research(先进算法研究所)

AI总结 本文发现减小初始化尺度能持续改善大语言模型预训练,尤其在推理任务上提升显著,并揭示了小初始化驱动参数从低复杂度结构向丰富表示演化的机制。

Comments 26 pages, 8 figures

详情
AI中文摘要

大语言模型提供了一个可处理的系统,用于探究智能本身如何涌现,而不仅仅是LLM如何被工程化。尽管进展通常归因于规模、数据和架构,但我们表明参数初始化是训练以及模型能力的基因式决定因素。减小初始化尺度持续改善预训练,在推理密集型任务上收益最大。我们识别出两种限制小初始化优势的常用经验设置,并展示放松这些设置如何恢复有利的缩放。我们进一步发现了一个平衡推理和训练的关键初始化。从机制上讲,小初始化驱动了独特的发展轨迹:参数首先凝聚成低复杂度结构,随后扩展为更丰富的表示,为“压缩即智能”这一观点提供了具体形式。词元级分析表明,收益集中在非平凡、上下文约束的预测上,而非均匀地分布于所有词元。这些结果启发了一个简单的$\gamma$-初始化规则:将初始化范围作为显式旋钮,并默认使用小初始化,这是一种几乎无成本的干预,能改善预训练并跨模型规模增强推理。

英文摘要

Large language models provide a tractable system for asking how intelligence itself emerges, rather than only how LLMs can be engineered. Although progress is usually attributed to scale, data and architecture, we show that parameter initialization is a gene-like determinant of training and, in particular, of model capacity. Reducing the initialization scale consistently improves pretraining, with the largest gains on reasoning-demanding tasks. We identify two widely used empirical settings that restrain the advantage of small initialization, and show how relaxing them restores favorable scaling. We further uncover a critical initialization that balances the reasoning and training. Mechanistically, small initialization drives a distinct developmental trajectory: parameters first condense into low-complexity structures and later expand into richer representations, giving concrete form to the idea that compression is intelligence. Token-level analyses show that the gains concentrate on non-trivial, context-constrained predictions rather than all tokens uniformly. These results motivate a simple $γ$-initialization rule: expose initialization rage as an explicit knob and use small initialization by default, an almost cost-free intervention that improves pretraining and strengthens reasoning across model scales.

2606.17979 2026-06-17 cs.AI 新提交

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

STAR: 文本到图像强化学习后训练中的时空自适应奖励分配

Jinjie Shen, Wei Deng, Xian Hu, Daiguo Zhou, Jian Luan

发表机构 * institutetext: STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training(机构文本:STAR:时空自适应奖励分配用于文本到图像强化学习后训练)

AI总结 针对文本到图像生成中奖励与生成轨迹粒度不匹配的问题,提出STAR方法,利用文本-图像注意力构建时空自适应分配图,对相关潜在区域施加更强策略更新,提升语义对齐和文本渲染性能。

详情
AI中文摘要

现有的文本到图像生成的强化学习后训练方法通常将最终图像奖励转换为单个标量优势,并以相同强度应用于整个生成轨迹。然而,文本到图像生成自然具有时间和空间结构:不同的去噪步骤负责不同的生成阶段,而真正决定文本对齐的内容通常只出现在图像的一部分。这种粒度不匹配使得策略更新难以聚焦于实际影响奖励的生成组件。为了解决这个问题,我们提出了用于文本到图像扩散和流模型的强化学习后训练的**时空自适应奖励(STAR)分配**。STAR利用生成模型内部的文本-图像注意力,从用户提示中真正关心的核心内容开始,构建在去噪步骤和展开中动态变化的空间分配图,并将相同的组相对优势分配给更相关的潜在区域,几乎没有额外的计算开销。然后,STAR通过空间分辨的策略目标对这些区域应用更强的策略更新。我们使用Stable Diffusion 3.5 Medium作为基础模型,并在三个任务上评估:GenEval、OCR文本渲染和PickScore。实验结果表明,STAR在不改变外部奖励源的情况下,改善了组合语义对齐、文本渲染和偏好优化,在GenEval、OCR和PickScore上分别达到了$\mathbf{0.9759}$、$\mathbf{0.9757}$和$\mathbf{23.60}$。

英文摘要

Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbf{SpatioTemporal Adaptive Reward (STAR) Allocation} for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving $\mathbf{0.9759}$, $\mathbf{0.9757}$, and $\mathbf{23.60}$ on GenEval, OCR, and PickScore, respectively.

2606.18132 2026-06-17 cs.AI 新提交

Knowledge Reutilization in Meta-Reinforcement Learning

元强化学习中的知识复用

Yuan Meng, Bo Wang, Juan de los Rios Ruiz, Xiangtong Yao, Zhenshan Bing, Fuchun Sun, Alois Knoll

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出一种元知识复用框架,通过动力学简化智能体学习任务知识并迁移至异构智能体,利用贝叶斯非参数先验和高层策略生成任务级指导,显著降低跟踪误差并提高样本效率。

Comments 18 pages initial submission

详情
AI中文摘要

元强化学习通过从相关任务中提取共享结构实现快速适应,但现有的端到端方法通常将任务推理与具身特定控制耦合。这种耦合可能模糊非参数任务语义,降低样本效率,并限制跨智能体复用。我们提出一个元知识复用框架,在动力学简化的智能体上学习任务级知识,并将其迁移至异构智能体。该框架使用贝叶斯非参数先验组织潜在任务模式,并使用高层策略生成任务级幅度指导。为了桥接可复用任务知识与不同具身,我们引入一个语义-幅度接口和一个轻量级时间适配器,将冻结的元知识转换为具身特定低层控制器的时间对齐子目标。在多个运动智能体上的实验表明,与最近的最先进基线相比,我们的框架将最终步跟踪误差降低了94.75%–99.79%,并且仅使用约23.8%的交互数据即可达到相当的部署性能。

英文摘要

Meta-reinforcement learning enables fast adaptation by extracting shared structure from related tasks, but existing end-to-end methods often couple task inference with embodiment-specific control. This coupling can obscure non-parametric task semantics, reduce sample efficiency, and limit cross-agent reuse. We propose a meta-knowledge reutilization framework that learns task-level knowledge on a dynamics-simplified agent and transfers it to heterogeneous agents. The framework uses a Bayesian non-parametric prior to organize latent task modes and a high-level policy to generate task-level magnitude guidance. To bridge reusable task knowledge with different embodiments, we introduce a semantic-magnitude interface and a lightweight temporal adaptor, which convert frozen meta-knowledge into temporally aligned subgoals for embodiment-specific low-level controllers. Experiments on multiple locomotion agents show that our framework reduces final-step tracking error by 94.75% -- 99.79% compared with recent state-of-the-art baselines and achieves comparable deployment performance with about 23.8% of their interaction data.

2606.18206 2026-06-17 cs.AI 新提交

Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers

不动点推理器:稳定且自适应的深度循环Transformer

Sajad Movahedi, Vera Milovanović, Shlomo Libo Feigin, Alexander Theus, Thomas Hofmann, Valentina Boeva, T. Konstantin Rusch, Antonio Orvieto

发表机构 * ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center(ELLIS研究所蒂宾根,马克斯·普朗克智能系统研究所,蒂宾根人工智能中心) ETH Zurich(苏黎世联邦理工学院) Swiss Institute of Bioinformatics(瑞士生物信息学研究所) Université Paris Cité(巴黎西岱大学) Liquid AI

AI总结 针对循环架构中深度导致的信号传播问题,提出基于预层归一化和残差缩放的FPRM模型,利用不动点收敛作为端到端停止机制,在Sudoku、Maze等推理基准上自适应计算并有效提升性能。

Comments Code available at https://github.com/nilskiKonjIzDunava/fprm

详情
AI中文摘要

循环架构为学习需要组合推理的任务的逐步程序提供了归纳偏置。通过循环达到的有效层数决定了这些模型找到的解的质量。与深层架构类似,循环架构容易受到由深度引起的信号传播问题的影响,因为停止决策被推迟。在本文中,我们使用预层归一化和残差缩放来解决这个信号传播问题。基于这些架构修改,我们提出了FPRM,一种基于Transformer的不动点推理模型,它在循环架构中使用不动点收敛作为端到端停止机制。我们表明,不动点停止允许FPRM根据任务难度调整其计算量。FPRM在常见的推理基准(即Sudoku、Maze、状态跟踪和ARC-AGI)上是有效的。

英文摘要

Looped architectures provide an inductive bias toward learning step-by-step procedures for tasks that require compositional reasoning. The number of effective layers reached by looping determines the quality of the solution these models find. Like deep architectures, looped architectures are prone to a signal propagation problem induced by depth as the halting decision is postponed. In this paper, we address this signal propagation issue using pre-norm layers and residual scaling. Building on these architectural modifications, we propose FPRM, a Transformer-based Fixed-Point Reasoning Model that uses fixed-point convergence as an end-to-end halting mechanism in a looped architecture. We show that fixed-point halting allows FPRM to adapt its compute to task difficulty. FPRM is effective on common reasoning benchmarks, namely Sudoku, Maze, state-tracking, and ARC-AGI.

2606.17107 2026-06-17 cs.LG cs.AI 交叉投稿

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

模型在预填充阶段记笔记:KV缓存可编辑且可组合

Bojie Li

发表机构 * Pine AI

AI总结 研究发现KV缓存像笔记一样存储结论,支持编辑和组合:编辑单个字段可修正决策(8B模型准确率1.00,仅需~1%计算),组合预编译技能可无缝插入任意上下文(logit余弦相似度0.90-0.999),延迟降低至O(L)。

详情
AI中文摘要

前缀缓存仅对完全共享的前缀重用预填充结果,因此一个字段的改变会使整个下游缓存失效。然而,覆盖该字段自身的键/值向量并重用其余部分,会导致模型基于旧值行动。通过四个模型家族的因果分析,原因在于:在预填充阶段,模型已将基于字段条件的结论写入下游笔记;该字段自身的键/值对决策的贡献不足1%。将KV缓存视为记录已记忆结论的笔记本,可以引出两个能力。(1) 可编辑性。一个显著的勘误可以修正笔记;结合思维链,仅编辑该字段即可恢复决策(8B模型准确率1.00,约1%计算),而无思维链时则被忽略。(2) 可组合性。笔记具有位置可移植性,因此预编译的技能可以通过RoPE重新定位并拼接至任意上下文,与完全重计算无法区分(logit余弦相似度0.90-0.999,十二个模型),且首次令牌延迟为O(L)而非O(L^2)。统一的编辑+组合智能体在决策上与重计算相同,延迟降低高达14.9倍。该方法适用于任何逐令牌注意力KV缓存,在规模、量化、混合专家和多模态缓存上得到验证,并通过小型适配器扩展到多种注意力变体。由于勘误仅追加,它与生产环境中的前缀缓存兼容:在在线vLLM基准测试中,它保持前缀缓存对齐(命中率98.5%),将p90首次令牌延迟降低53-398倍。

英文摘要

Prefix caching reuses prefill only across an exactly shared prefix, so one changed field invalidates the entire downstream cache. Yet overwriting the field's own key/value vectors and reusing the rest leaves the model acting on the old value. The reason, established causally across four model families: at prefill the model has already written the field-conditioned conclusion onto downstream notes; the field's own key/value drives under 1% of the decision. Read as a notebook of memoized conclusions, two capabilities follow. (1) It is editable. A salient erratum amends the notes; and with chain-of-thought, editing the field alone recovers the decision (1.00 at 8B, ~1% compute), while without CoT it is ignored. (2) It is composable. The notes are position-portable, so a precompiled skill can be RoPE-repositioned and spliced into any context, indistinguishable from full recompute (logit cosine 0.90-0.999, twelve models) at O(L) rather than O(L^2) time-to-first-token. A unified edit+compose agent stays decision-identical to recompute at up to 14.9x lower latency. The approach applies to any per-token attention KV cache, validated across scale, quantization, Mixture-of-Experts, and multimodal caches, and extends to several attention variants through small adapters. Because the erratum is append-only, it composes with production prefix caching: in an online vLLM benchmark it keeps the prefix cache-aligned (98.5% hit-rate), cutting p90 time-to-first-token by 53-398x.

2606.17118 2026-06-17 cs.LG cs.AI 交叉投稿

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

MODE: 面向MoE多模态大语言模型的模态分解专家级混合精度量化

Yuanteng Chen, Peisong Wang, Zhilei Liu, Nanxin Zeng, Yuantian Shao, Shiqiang Lang, Tao Liu, Chuangyi Li, Qinghao Hu, Gang Li, Jing Liu, Jian Cheng

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Zhongguancun Academy(中关村学院)

AI总结 针对MoE多模态大语言模型在专家重要性估计中存在的跨模态和视觉内偏差,提出模态分解的专家级混合精度量化框架MODE,通过分解选择频率、过滤冗余视觉令牌并评估模态敏感性,在给定预算下分配比特宽度,在W3A16下平均性能损失控制在2.9%以内。

Comments 18 pages, 8 figures

详情
AI中文摘要

混合专家多模态大语言模型(MoE-MLLMs)性能卓越,但GPU内存成本高昂,因此压缩至关重要。在PTQ方法中,专家级混合精度量化已被证明对MoE-LLMs有效,但由于专家重要性估计中两个被忽视的偏差,在MoE-MLLMs上性能显著下降。(1)在跨模态层面,视觉令牌的数值优势导致专家选择频率被视觉令牌主导,掩盖了对文本模态至关重要的专家;(2)在视觉内层面,大量冗余视觉令牌进一步扭曲频率统计,模糊了对信息性视觉内容关键的专家。为弥补差距,我们提出MODE,一种面向MoE-MLLMs的模态分解专家级混合精度量化框架,该框架按模态分解专家选择频率,过滤冗余视觉令牌以获得去噪的视觉频率,并进一步评估每个模态的量化敏感性作为基于频率估计的补充信号。这些信号被整合到整数线性规划公式中,以在给定预算下分配每个专家的比特宽度。大量实验表明,MODE特别适合MoE-MLLMs,在W3A16下平均性能损失限制在2.9%以内,在极端2比特设置下获得更大增益。

英文摘要

Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs) offer remarkable performance but incur prohibitive GPU memory costs, making compression essential. Among PTQ methods, expert-level mixed-precision quantization has proven effective for MoE-LLMs, yet suffers notable degradation on MoE-MLLMs due to two overlooked biases in expert importance estimation. (1) At the cross-modal level, the numerical dominance of vision tokens causes expert selection frequency to be dominated by vision tokens, masking experts that are critical to the text modality; (2) at the intra-vision level, the large proportion of redundant vision tokens further skew frequency statistics, obscuring experts critical for informative visual content. To bridge gaps, we propose MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE-MLLMs that decomposes expert selection frequency by modality, filters redundant vision tokens to obtain denoised visual frequency, and further evaluates quantization sensitivity per modality as a complementary signal to frequency-based estimation. These signals are integrated into an Integer Linear Programming formulation to assign per-expert bit-widths under a given budget. Extensive experiments show that MODE is particularly well-suited for MoE-MLLMs, limiting average performance loss to within 2.9% at W3A16, with larger gains at the extreme 2-bit setting.

2606.17199 2026-06-17 cs.LG cs.AI 交叉投稿

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

PowerOPD:利用有界幂变换稳定在线策略蒸馏

Anhao Zhao, Junlong Tong, Yingqi Fan, Ping Nie, Wenjie Li, Xiaoyu Shen

发表机构 * Eastern Institute of Technology, Ningbo(宁波东方理工大学) The Hong Kong Polytechnic University(香港理工大学) Shanghai Jiao Tong University(上海交通大学) University of Waterloo(滑铁卢大学)

AI总结 针对在线策略蒸馏中log-ratio奖励无界导致训练不稳定问题,提出基于Box-Cox幂变换的有界、符号一致奖励族PowerOPD,在数学推理任务上平均提升Avg@8/Pass@8达+6.37/+5.71,并降低59.2%时间与23.1%显存。

详情
AI中文摘要

大型语言模型的标准在线策略蒸馏(OPD)利用学生采样令牌估计反向KL散度,得到一个无偏的单样本蒙特卡洛估计器,避免了全词汇计算。然而,我们表明该估计器在实践中存在严重的训练病态:样本效率低、生成动态不稳定,以及与精确全词汇OPD相比显著的性能差距。奖励级别的诊断将这些病态追溯到log-ratio奖励,该奖励在结构上无界,产生极高方差的梯度,集中在早期位置并持续整个训练;标准的后验缩放方法仅在失真发生后操作,因此失效。为解决此问题,我们提出PowerOPD:一个源自Box-Cox幂变换的原生有界、符号一致的奖励族,由alpha > 0参数化,其中log-ratio是其退化极限alpha -> 0。在六个数学推理基准和四个Qwen3师生对中,PowerOPD在基准平均Avg@8/Pass@8上相比原始OPD提升高达+6.37/+5.71,相比后验稳定化提升+3.01/+3.54,相比全词汇OPD提升+2.59/+8.90,同时减少59.2%的挂钟时间和23.1%的峰值GPU内存。较大的alpha通常提高准确率,一致缩短响应长度,并使梯度范数比原始OPD小3000倍以上。

英文摘要

Standard on-policy distillation (OPD) for large language models estimates the reverse-KL objective using student-sampled tokens, yielding an unbiased single-sample Monte Carlo estimator that avoids vocabulary-wide computation. However, we show that this estimator suffers from severe training pathologies in practice: sample inefficiency, unstable generation dynamics, and a substantial performance gap compared to exact full-vocabulary OPD. Reward-level diagnosis traces these pathologies to the log-ratio reward, which is unbounded by construction, producing extremely high-variance gradients concentrated at early positions and persisting throughout training; standard post-hoc scaling fail as they operate only after this distortion occurs. To solve this problem, we propose PowerOPD: a family of natively bounded, sign-consistent rewards from the Box-Cox power transformation, parameterized by alpha > 0, of which the log-ratio is the degenerate alpha -> 0 limit. Across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieves benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD, while reducing wall-clock time by 59.2% and peak GPU memory by 23.1%. Larger alpha generally improves accuracy, consistently shortens responses, and keeps gradient norms more than 3,000x smaller than vanilla OPD.

2606.17399 2026-06-17 cs.LG cs.AI 交叉投稿

The Discrete-Log Clock: How a Transformer Learns Modular Multiplication

离散对数时钟:Transformer如何学习模乘法

Huu Danh Nguyen

发表机构 * Stanford University(斯坦福大学)

AI总结 通过乘法特征变换分析,发现Transformer在模乘法任务中学习到稀疏的傅里叶谱,其嵌入和MLP神经元主要编码少数乘法频率,表明模型实现了离散对数空间中的加法运算,即“离散对数时钟”算法。

Comments 5 pages, 5 figures. Accepted to the Mechanistic Interpretability Workshop at ICML 2026

详情
AI中文摘要

当小型Transformer在模乘法任务中实现“grok”时,先前研究报告学习到的嵌入具有“密集”的傅里叶谱,需要所有频率。这与模加法形成对比,后者只需一组稀疏的关键频率。我们证明这种密度是错误基下分析的伪像。乘法的自然傅里叶变换不是标准加法DFT,而是乘法特征变换,它将乘法群$(\mathbb{Z}/p\mathbb{Z})^*$上的函数分解为其不可约表示。将此变换应用于在$a \cdot b \bmod 113$上训练的grokked Transformer,我们发现嵌入谱变得高度稀疏(基尼系数0.58 vs 加法基下的0.07),仅4个关键频率携带显著能量。此外,96.9%的MLP神经元被干净地调谐到单个乘法频率,并且神经元激活热图在按离散对数重排序后显示出二维周期结构。这些结果表明Transformer将乘法简化为离散对数空间中的加法,实现了类似于Nanda等人针对加法的Clock算法的“离散对数时钟”算法。该方法具有普适性:将分析基与任务的代数结构匹配,可以在标准工具视为噪声的地方揭示可解释结构。

英文摘要

When small transformers grok modular multiplication, prior work reports that the learned embedding has a "dense" Fourier spectrum requiring all frequencies. This contrasts with modular addition, where only a sparse set of key frequencies suffices. We show this density is an artifact of analyzing in the wrong basis. The natural Fourier transform for multiplication is not the standard additive DFT but the multiplicative character transform, which decomposes functions on the multiplicative group $(\mathbb{Z}/p\mathbb{Z})^*$ into its irreducible representations. Applying this transform to a grokked transformer trained on $a \cdot b \bmod 113$, we find the embedding spectrum becomes highly sparse (Gini coefficient 0.58 vs. 0.07 in the additive basis) with only 4 key frequencies carrying significant energy. Furthermore, 96.9% of MLP neurons are cleanly tuned to a single multiplicative frequency, and neuron activation heatmaps reveal 2D-periodic structure when reordered by the discrete logarithm. These results demonstrate the transformer reduces multiplication to addition in discrete-log space, implementing a "Discrete-Log Clock" algorithm analogous to Nanda et al.'s Clock algorithm for addition. The methodology generalizes: matching the analysis basis to the algebraic structure of the task reveals interpretable structure where standard tools see noise.

2606.17406 2026-06-17 cs.CV cs.AI 交叉投稿

Graph Neural Networks for Semi-Supervised Image Classification with Multi-Feature Aggregation

基于多特征聚合的图神经网络用于半监督图像分类

Marina Chagas Bulach Gapski, Vinicius Atsushi Sato Kawai, Gustavo Rosseto Leticio, Lucas Pascotti Valem, Daniel Carlos Guimarães Pedronette, Mohand Said Allili

发表机构 * Department of Statistics, Applied Mathematics, and Computing (DEMAC), São Paulo State University (UNESP)(圣保罗州立大学统计、应用数学与计算系) Institute of Mathematics and Computer Science (ICMC), University of São Paulo (USP)(圣保罗大学数学与计算机科学研究所) Department of Computer Science and Engineering, University of Quebec in Outaouais (UQO)(魁北克大学乌塔韦校区计算机科学与工程系)

AI总结 提出一种结合多种特征提取器和图表示进行半监督图像分类的GNN方法,通过流形学习和排名聚合提升分类精度。

详情
AI中文摘要

特征提取涉及识别和提取显著特征或模式,包括边缘、纹理、形状和颜色属性。当代特征提取器主要利用深度学习架构,如卷积神经网络(CNN)和视觉变换器(VIT)。文献中各种特征提取器的可用性提供了广泛的特征表示。从图像中提取的特征取决于具体应用、所选提取器及其配置。因此,通过组合不同的提取器来整合互补信息,为提高性能提供了一种有前景的方式。图神经网络(GNN),特别是图卷积网络(GCN),已成为半监督图像分类的强大且广泛采用的方法,因为它们有效利用标记和未标记数据,同时利用捕捉样本间关系的底层图结构。本研究提出了一种新颖的GNN方法,适用于标记数据稀缺的场景,通过整合来自不同提取器的多样化特征和图表示集进行分类。进行了实验研究,包括不同特征和图提取器的组合,以及排名聚合策略。实验发现强调了本研究的主要贡献,表明特征和图表示的策略性组合,结合流形学习用于图处理,在大多数实验条件下显著提高了分类精度。此外,利用排名聚合技术整合来自不同提取器的特征,被证明能增强分类精度。

英文摘要

Feature extraction involves the identification and extraction of salient characteristics or patterns, including edges, textures, shapes, and color attributes. Contemporary feature extractors predominantly leverage deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (VITs). The availability of diverse feature extractors in the literature provides a wide range of feature representations. Features extracted from an image depend on the specific application, the chosen extractor, and its configuration. Therefore, integrating complementary information by combining distinct extractors offers a promising way to enhance performance. Graph Neural Networks (GNNs), particularly Graph Convolutional Networks (GCNs), have emerged as powerful and widely adopted approaches for semi-supervised image classification, as they effectively leverage both labeled and unlabeled data while exploiting the underlying graph structures that capture relationships among samples. This study proposes a novel approach for GNNs in scenarios where labeled data is scarce, by integrating diverse sets of feature and graph representations derived from various extractors in classification scenarios. Experimental investigations were conducted, encompassing combinations of distinct feature and graph extractors, as well as rank aggregation strategies. The primary contributions of this work are underscored by the experimental findings, which demonstrate that the strategic combination of feature and graph representations, coupled with the application of manifold learning for graph processing, leads to significant improvements in classification accuracy across the majority of experimental conditions. Furthermore, the utilization of rank aggregation techniques to integrate features from different extractors was shown to enhance classification accuracy.

2606.17416 2026-06-17 cs.SD cs.AI 交叉投稿

L-Proto: Language-Aware Episodic Prototypical Training for Multilingual Speaker Verification

L-Proto: 面向多语言说话人验证的语言感知情景原型训练

Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee

发表机构 * Department of Artificial Intelligence, Korea University(高丽大学人工智能系)

AI总结 针对多语言说话人验证中语言相关声学变异导致说话人身份与语言特征纠缠的问题,提出语言感知情景原型训练策略L-Proto,通过构建语言一致的训练情景减少语言驱动变异,提升跨语言泛化能力。

Comments Accepted by INTERSPEECH 2026

详情
AI中文摘要

多语言说话人验证仍然具有挑战性,因为语言相关的声学变异导致说话人身份与语言特征纠缠,降低了跨语言的泛化能力。在多语言训练中,嵌入向量通常将语言线索与说话人身份一起编码,导致说话人形成特定语言的聚类。我们提出L-Proto,一种语言感知的情景原型训练策略,该策略构建语言一致的训练情景。通过在每个情景中从单一语言采样说话人,L-Proto减少了训练期间的语言驱动变异,并鼓励嵌入向量更直接地关注说话人身份。在TidyVoice挑战基准上的实验表明,与传统的微调和随机情景采样相比,在多种骨干架构上均取得了一致的性能提升。

英文摘要

Multilingual speaker verification remains challenging because language-dependent acoustic variability causes speaker identity to become entangled with linguistic characteristics, degrading generalization across languages. In multilingual training, embeddings often encode language cues with speaker identity, causing speakers to form language-specific clusters. We propose L-Proto, a language-aware episodic prototypical training strategy that constructs language-consistent episodes. By sampling speakers from a single language per episode, L-Proto reduces language-driven variation during training and encourages embeddings to focus more directly on speaker identity. Experiments on the TidyVoice Challenge benchmark demonstrate consistent performance improvements over conventional fine-tuning and random episodic sampling across multiple backbone architectures.

2606.17489 2026-06-17 cs.LG cs.AI 交叉投稿

Online LLM Selection via Constrained Bandits with Time-Varying Demand

基于时变需求的约束赌博机在线LLM选择

Yin Huang, Qingsong Liu, Jie Xu

发表机构 * Department of Electrical and Computer Engineering, University of Florida(佛罗里达大学电气与计算机工程系) Manning College of Information and Computer Sciences, University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校曼宁信息与计算机科学学院)

AI总结 针对边缘云推理系统中异构LLM的选择问题,提出一种基于置信界估计和需求预测的在线学习算法,在硬预算和软延迟约束下实现亚线性遗憾和约束违反。

Comments 11 pages, 3 figures with multiple subfigures, 1 table, submitted for possible journal publication

详情
AI中文摘要

大型语言模型(LLM)越来越多地部署在边缘云推理系统中,以处理具有异构准确性、延迟和成本配置的多样化用户任务。为每个传入任务选择合适的LLM对于确保服务质量和高效资源利用至关重要。然而,模型异构性、随机且未知的性能特征以及时变的任务需求使得静态选择策略不再适用。实际部署通常施加硬资源预算(如货币支出限制)和软服务级别要求(如延迟保证)。这些约束为在线决策带来了额外挑战。我们将该问题形式化为一个约束随机赌博机学习任务,其中学习者在包装型(硬)和覆盖型(软)约束下顺序选择模型,同时适应时变的任务需求。学习者无法访问底层奖励、成本或延迟分布,必须依赖部分反馈。我们开发了一种新颖的在线学习算法,利用置信界估计和需求预测来平衡奖励最大化与长期约束满足。我们提供了理论保证,表明与具有完整信息的离线基准相比,该算法实现了亚线性遗憾和亚线性覆盖约束违反。在合成工作负载上的实验结果证明了我们的方法在动态、资源受限环境中的有效性和鲁棒性。

英文摘要

Large Language Models (LLMs) are increasingly deployed in edge-cloud inference systems to handle diverse user tasks with heterogeneous accuracy, latency, and cost profiles. Selecting the appropriate LLM for each incoming task is critical for ensuring service quality and efficient resource utilization. However, model heterogeneity, stochastic and unknown performance characteristics, and time-varying task demands make static selection strategies inadequate. Real-world deployments often impose hard resource budgets such as monetary expenditure limits, along with soft service-level requirements such as latency guarantees. These constraints introduce additional challenges for online decision-making. We formulate this problem as a constrained stochastic bandit learning task, where the learner sequentially selects models under both packing-type (hard) and covering-type (soft) constraints, while adapting to time-varying task demand. The learner operates without access to the underlying reward, cost, or latency distributions and must rely on partial feedback. We develop a novel online learning algorithm that leverages confidence-bound estimates and demand predictions to balance reward maximization with long-term constraint satisfaction. We provide theoretical guarantees showing sublinear regret and sublinear covering constraint violations compared to an offline benchmark with full information. Experimental results on synthetic workloads demonstrate the effectiveness and robustness of our approach in dynamic, resource-constrained environments.

2606.17513 2026-06-17 cs.LG cs.AI 交叉投稿

Geometry-Aware Post-Hoc Uncertainty Quantification in Operator Learning

几何感知的算子学习事后不确定性量化

Oriol Vendrell-Gallart, Nima Negarandeh, Ramin Bostanabad

发表机构 * Department of Mechanical and Aerospace Engineering, University of California, Irvine(加州大学尔湾分校机械与航空航天工程系)

AI总结 提出REEF-GP框架,通过高斯过程拟合冻结神经算子的残差,利用其内在坐标-特征表示构建几何感知的不确定性,在多个PDE基准上实现校准的不确定性估计,且计算成本远低于深度集成。

详情
AI中文摘要

神经算子为偏微分方程提供快速代理模型,但其确定性预测限制了在需要不确定性量化(UQ)的任务中的使用,尤其是在几何变化下。现有方法主要对网络参数进行不确定性建模,很大程度上忽略了算子本身学习的几何感知表示。我们提出REEF-GP(残差嵌入特征高斯过程),一种事后UQ框架,将高斯过程拟合到冻结神经算子的残差上,该算子的内部嵌入定义了核特征空间。REEF-GP不学习单独的特征映射,而是调整算子固有的坐标-特征表示以构建几何感知的不确定性。为了确保非结构化域上的稳定性和可扩展性,REEF-GP结合了谱归一化投影、异方差几何感知噪声以及高效基于子集的训练,避免了限制性的低秩近似。在五个具有不同几何形状的PDE基准测试中,REEF-GP保持了预测准确性,同时实现了与深度集成相竞争但成本仅为其一小部分的校准不确定性估计。我们的方法在几何分布偏移下保持鲁棒性,不确定性集中在物理上有意义的区域(例如激波前沿)。我们的结果表明,神经算子的准确且可扩展的事后UQ可以直接在其学习的特征空间中实现,为参数中心方法提供了实用替代方案。

英文摘要

Neural operators provide fast surrogates for PDEs but their deterministic predictions limit their use in tasks requiring uncertainty quantification (UQ), especially under geometric variability. Existing approaches primarily model uncertainty in network parameters, largely overlooking the geometry-aware representations learned by the operator itself. We propose REEF-GP (Residual on Embedded Features Gaussian Process), a post-hoc UQ framework that fits a GP to the residuals of a frozen neural operator whose internal embeddings define the kernel feature space. Rather than learning a separate feature map, REEF-GP adapts the operator's intrinsic coordinate-feature representations to construct geometry-aware uncertainties. To ensure stability and scalability on unstructured domains, REEF-GP incorporates spectral-normalized projections, heteroscedastic geometry-aware noise, and efficient subset-based training that avoids restrictive low-rank approximations. Across five PDE benchmarks with varying geometries, REEF-GP preserves predictive accuracy while achieving calibrated uncertainty estimates competitive with deep ensembles but at a fraction of their cost. Our approach remains robust under geometric distribution shift, with uncertainty concentrating in physically meaningful regions (e.g., shock fronts). Our results demonstrate that accurate and scalable post-hoc UQ for neural operators can be achieved directly in their learned feature space, offering a practical alternative to parameter-centric approaches.

2606.17516 2026-06-17 cs.LG cs.AI stat.ME stat.ML 交叉投稿

FoundCause: Causal Discovery with Latent Confounders from Observational Data

FoundCause: 从观测数据中发现含隐混淆因子的因果关系

Patrick Blöbaum, Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan

发表机构 * Amazon Web Services(亚马逊云服务) Department of Statistics, University of California, Davis(加州大学戴维斯分校统计系)

AI总结 提出FoundCause,一种基于合成数据训练的摊销因果发现模型,通过单次前向传递直接映射数据集到因果图,显式建模隐混淆因子,在15个真实数据集上优于11种非摊销和4种摊销方法。

Comments Download the model at https://github.com/amazon-science/foundcause

详情
AI中文摘要

从观测数据中发现因果关系仍然具有挑战性,因为需要在没有干预的情况下恢复有向结构和隐混淆因子。我们提出了FoundCause,一种完全在合成数据上训练的摊销因果发现模型,它通过单次前向传递直接将数据集映射到因果图。通过从大量模拟结构因果模型中学习,FoundCause捕获了可迁移的统计模式,这些模式泛化到单个数据集之外。该架构融合了因果发现的几个关键归纳偏置。它使用一个置换不变的Transformer编码器,通过交替关注样本和变量来联合建模跨变量依赖性和每个变量的分布。通过统计条件注意力注入来自经典非对称度量的成对统计特征,引导模型朝向已知的因果信号。一个分解的解码器将边的存在性与方向分离,而一个三角细化模块使得能够推理高阶因果模式,如链和碰撞器。此外,一个基于可学习隐令牌的专用混淆因子模块显式建模隐藏的共同原因,并且模型通过其掩码输入表示显式处理缺失数据。据我们所知,FoundCause是第一个显式建模隐混淆因子的摊销因果发现方法。FoundCause在15个真实数据集上优于11种经典非摊销方法(如PC、GES、NOTEARS风格优化)和4种摊销因果发现方法,相对于最强的非摊销方法,在$F_1$上提高了9.6%,在AUROC上提高了1.2%,结构汉明距离减少了18.9%,同时仅需单次前向传递即可完成推理。

英文摘要

Causal discovery from observational data remains challenging due to the need to recover directed structure and latent confounding without interventions. We propose FoundCause, an amortized causal discovery model trained entirely on synthetic data that maps datasets directly to causal graphs in a single forward pass. By learning from large collections of simulated structural causal models, FoundCause captures transferable statistical patterns that generalize beyond individual datasets. The architecture incorporates several key inductive biases for causal discovery. It uses a permutation-invariant transformer encoder with alternating attention over samples and variables to jointly model cross-variable dependence and per-variable distributions. Pairwise statistical features derived from classical asymmetry measures are injected through statistics-conditioned attention, guiding the model toward known causal signals. A factorized decoder separates edge existence from direction, while a triangular refinement module enables reasoning over higher-order causal motifs such as chains and colliders. In addition, a dedicated confounder module based on learnable latent tokens explicitly models hidden common causes, and the model explicitly handles missing data via its masked input representation. To our knowledge, FoundCause is the first amortized causal discovery approach to explicitly model latent confounding. FoundCause outperforms 11 classical non-amortized methods (e.g., PC, GES, NOTEARS-style optimization) and 4 amortized causal discovery methods on 15 real-world datasets, achieving +9.6% improvement in $F_1$, +1.2% in AUROC, and an 18.9% reduction in structural Hamming distance relative to the strongest non-amortized methods, while performing inference in a single forward pass.

2606.17551 2026-06-17 cs.LG cs.AI 交叉投稿

Reversal Q-Learning

逆向Q学习

Aditya Oberai, Seohong Park, Sergey Levine

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出逆向Q学习(RQL)算法,通过扩展MDP框架和逆向流生成虚拟在线轨迹,结合偏差-方差缩减技术,实现基于流策略的离线强化学习,在50个机器人任务中取得最佳平均性能。

详情
AI中文摘要

迭代生成建模技术(如流匹配)为建模复杂行为以进行有效的离线强化学习(RL)提供了强大工具。在这项工作中,我们提出了一种新的离策略RL算法,该算法基于先验数据训练流策略。我们的想法始于“扩展”马尔可夫决策过程(MDP)框架,该框架将单个流细化步骤视为MDP中的独立动作。为了在该框架中实现离策略RL,我们应用了两种技术:我们通过“逆向”流生成虚拟在线轨迹,使该框架与先验数据兼容;并应用偏差-方差缩减技术来缓解离策略RL中的视界诅咒。我们将由此产生的算法称为逆向Q学习(RQL)。RQL相比先前基于流的RL方法具有若干优势:它不受时间反向传播的影响,更好地利用学习到的价值函数,并直接训练完整的、富有表现力的流策略。通过在50个具有挑战性的模拟机器人任务上的实验,我们表明,与最先进的基于流的离线RL算法相比,RQL实现了最佳的平均离线RL性能。

英文摘要

Iterative generative modeling techniques, such as flow matching, provide powerful tools to model complex behaviors for effective offline reinforcement learning (RL). In this work, we propose a new off-policy RL algorithm that trains a flow policy based on prior data. Our idea starts from the "expanded" Markov decision process (MDP) framework, which treats individual flow refinement steps as separate actions in an MDP. To enable off-policy RL within this framework, we apply two techniques: we generate virtual on-policy trajectories (by "reversing" flows) to make this framework compatible with prior data, and we apply a bias-and-variance reduction technique to mitigate the curse of horizon in off-policy RL. We call the resulting algorithm Reversal Q-learning (RQL). RQL has several advantages over previous flow-based RL methods: it does not suffer from backpropagation through time, makes better use of the learned value function, and directly trains the full, expressive flow policy. Through our experiments on 50 challenging simulated robotic tasks, we show that RQL leads to the best average offline RL performance compared to state-of-the-art flow-based offline RL algorithms.

2606.17579 2026-06-17 cs.LG cs.AI cs.CL cs.SI 交叉投稿

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

LLM特征可能损害GNN:同配图基准上的拼接干扰

Zhongyuan Wang, Pratyusha Vemuri

AI总结 本文发现将LLM特征通过纯输入拼接(而非联合训练)引入图神经网络时,会在同配基准上系统性地降低准确率,并提出了一个基于LLM单独判别性指标Delta_sig来预测拼接效果。

Comments 29 pages, 8 figures

详情
AI中文摘要

将LLM生成的节点特征添加到图神经网络(GNN)中,被广泛报道能提高标准基准的准确率。我们记录了一个相反的观察:当LLM特征通过纯输入拼接(而非联合训练、蒸馏或提示条件)引入时,它们会在相同的同配基准上系统地降低准确率,而端到端LLM流水线在这些基准上却能成功。使用MLP骨干网络、Planetoid公共划分和词袋原始特征,拼接SBERT编码的GPT-4o-mini TAPE特征导致PubMed测试准确率下降-17.0±0.3个百分点,Cora下降-4.3±0.6个百分点(CiteSeer下降-0.6±0.8个百分点,在种子噪声范围内)。当我们放宽每个条件(GCN/GCNII/GAT骨干网络、随机划分、更小编码器)时,下降幅度减弱,并在中等同配的WikiCS(+4.4个百分点)和ogbn-arxiv(+11.7个百分点)上逆转。为了预测拼接何时有益或有害,我们报告了一个简单的LLM单独判别性指标Delta_sig。在9个数据集上,Delta_sig与拼接成本的相关系数(r^2=0.38)强于同配性(r^2=0.06;N=9,bootstrap置信区间重叠)。bootstrap最佳变点为tau=13.8个百分点,规则“Delta_sig <= tau预测非正拼接成本”正确分类了7/9个数据集;由于60%的bootstrap样本将tau置于[5,30]个百分点之间,我们将Delta_sig视为解释性透镜而非精确过滤器。在PubMed上进行的维度控制消融实验将LLM特征下降置于同源PCA(-2.3个百分点)和同维高斯噪声(-37.3个百分点)之间,排除了维度和权重衰减的影响。九个PubMed配置拟合出幂律|Delta_concat| ∝ (sqrt(d_l/n))^1.31,r^2=0.97;低Delta_sig、小n的角落正是标题中-17个百分点PubMed缺陷出现的位置。

英文摘要

Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input concatenation (rather than joint training, distillation, or prompt-conditioning), they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. With an MLP backbone on the Planetoid public split and bag-of-words original features, concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets Delta_sig correlates with the concatenation cost more strongly than homophily at point estimate (r^2 = 0.38 vs. 0.06; N=9, bootstrap CIs overlap). The bootstrap-best change-point is tau = 13.8 pp, and the rule "Delta_sig <= tau predicts non-positive concat cost" classifies 7/9 datasets correctly; since 60% of bootstrap samples place tau in [5, 30] pp, we treat Delta_sig as an interpretive lens rather than a precision filter. A dimension-controlled ablation on PubMed places the LLM-feature drop between same-source PCA (-2.3 pp) and same-dim Gaussian noise (-37.3 pp), ruling out dimensionality and weight-decay artifacts. Nine PubMed configurations fit a power law |Delta_concat| proportional to (sqrt(d_l/n))^1.31 with r^2 = 0.97; the low-Delta_sig, small-n corner is exactly where the headline -17 pp PubMed deficit appears.

2606.17649 2026-06-17 cs.LG cs.AI 交叉投稿

A Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction

预微调预测的风险分解框架

Yuxiang Luo, Chen Wang, Nan Tang

发表机构 * The Hong Kong University of Science(香港科技大学)

AI总结 提出风险分解框架,将预微调性能预测风险分解为内在极限与可降优化方差,证明优化方差衰减率存在下界,并导出预算最优探测原则及可预测性相图。

Comments 9 pages, 4 figures, accepted as ICML 2026 Poster:https://icml.cc/virtual/2026/poster/66570

详情
AI中文摘要

微调大型语言模型的高昂成本构成了显著的经济障碍;预微调性能预测提供了一个关键解决方案,以大幅降低这一费用。然而,预微调性能预测的理论极限尚未被探索。我们将其形式化为信息约束下的随机估计问题,将预测风险分解为两个组成部分:内在极限(静态数据-模型兼容性)和可降优化方差。我们证明优化方差在其衰减率上存在一个必要下界,这意味着无论使用何种预测器,不确定性消散的速度都受到基本约束。基于这些动态特性,我们推导出预算最优探测原则,并引入一个可预测性相图,将任务组织成三个不同的区域:静态充分、动态临界和噪声主导。在合成和真实世界基准上的大量实验验证了这些理论区域,并展示了我们探测策略的效率。

英文摘要

The high cost of fine-tuning LLMs poses a significant economic barrier; pre-hoc performance prediction offers a critical solution to substantially reduce this expense. However, the theoretical limits of pre-hoc performance prediction remain unexplored. We formulate it as a stochastic estimation problem under information constraints, decomposing prediction risk into two components: an intrinsic limit (static data-model compatibility) and a reducible optimization variance. We prove that optimization variance admits a necessary lower bound on its decay rate, implying fundamental constraints on how quickly uncertainty dissipates, regardless of the predictor used. Based on these dynamics, we derive a budget-optimal probing principle and introduce a predictability phase diagram that organizes tasks into three distinct regimes: Static-Sufficient, Dynamic-Critical, and Noise-Dominant. Extensive experiments on synthetic and real-world benchmarks validate these theoretical regimes and demonstrate the efficiency of our probing strategy.

2606.17660 2026-06-17 cs.LG cs.AI 交叉投稿

TuneAhead: Predicting Fine-tuning Performance Before Full Training Begins

TuneAhead: 在完整训练开始前预测微调性能

Yuxiang Luo, Haonan Long, Chen Wang, Qiqi Duan, Xiaotian Lin, Yanwei Xu, Yuyu Luo, Weikai Yang, Nan Tang

发表机构 * The Hong Kong University of Science(香港科学与技术大学) Huawei Technologies Ltd.(华为技术有限公司)

AI总结 提出TuneAHEAD框架,通过元特征向量和SHAP归因,在微调前预测性能,在Qwen2.5-7B-Instruct上RMSE为1.47个百分点,95.1%预测误差在±3%内。

Comments 9 pages, 6 figures, accepted as ICML 2026 poster:https://icml.cc/virtual/2026/poster/64847

详情
AI中文摘要

微调大型语言模型(LLM)计算密集且容易出错:模型性能对数据质量和超参数选择敏感,简单运行甚至可能降低模型性能。这引出一个实际问题:在投入完整训练之前,能否预测微调性能?我们提出TUNEAHEAD,一个用于微调性能预判的轻量级框架。TUNEAHEAD将每个候选运行编码为一个元特征向量,该向量结合了静态数据集描述符和来自短标准化探测的动态探测特征。一个预测器将这些特征映射到性能估计,而基于SHAP的归因提供可解释的诊断,揭示哪些特定特征驱动预测。在Qwen2.5-7B-Instruct上的1300多次微调运行中,TUNEAHEAD始终优于强基线,如Early-Stop Extrapolation和ProxyLM。在370次运行的保留测试集上,TUNEAHEAD实现了1.47个百分点的RMSE,并将95.1%的预测置于真实分数的±3个百分点内。这些准确的连续预测支持实用的通过/不通过筛选策略,可以在保留最有希望运行的同时减少不必要的完整微调。

英文摘要

Fine-tuning large language models (LLMs) is compute-intensive and error-prone: model performance depends sensitively on data quality and hyperparameter choices, and naïve runs can even degrade model performance. This raises a practical question:can we predict fine-tuning performance before committing to a full training run? We present TUNEAHEAD, a lightweight framework for pre-hoc prediction of fine-tuning performance. TUNEAHEAD encodes each candidate run as a meta-feature vector that combines static dataset descriptors with dynamic probe features from a short standardized probe. A predictor maps these features to performance estimates, while SHAP-based attributions provide interpretable diagnostics that reveal which specific features drive the prediction. Across 1,300+ fine-tuning runs on Qwen2.5-7B-Instruct, TUNEAHEAD consistently outperforms strong baselines such as Early-Stop Extrapolation and ProxyLM. On a held-out test set of 370 runs, TUNEAHEAD achieves an RMSE of 1.47 percentage points and places 95.1% of predictions within +3/-3 percentage points of the true score. These accurate continuous predictions support practical go/no-go screening policies that can reduce unnecessary full fine-tuning while retaining most promising runs.

2606.17667 2026-06-17 cs.LG cs.AI 交叉投稿

Handling Feature Heterogeneity with Learnable Graph Patches

处理特征异质性:可学习图块方法

Yifei Sun, Yang Yang, Xiao Feng, Zijun Wang, Haoyang Zhong, Chunping Wang, Lei Chen

发表机构 * Zhejiang University(浙江大学) Huazhong University of Science and Technology(华中科技大学) Finvolution Group(信也科技集团)

AI总结 提出可学习图块概念,将图分解为语义单元,通过补丁编码器和聚合器实现跨域图数据的可迁移预训练,提升下游任务性能。

Comments Accepted at KDD 2025

详情
AI中文摘要

近年来,基础模型和图预训练技术的快速发展激发了构建通用预训练图模型或图基础模型(GFM)的兴趣。然而,一个重大挑战是现有模型无法处理无文本信息的图数据中的特征异质性,这阻碍了图模型在不同数据集间的可迁移性。为弥补这一差距,我们提出了可学习图块的概念,将其视为任何图数据的最小语义单元。我们通过展开节点特征并分别构建相应的图块结构,将图分解为可学习图块。然后,我们设计了一个框架,从跨域图数据中挖掘可迁移信息。具体来说,在提取图块后,我们提出一个补丁编码器从每个单元中提取知识,以及一个补丁聚合器学习如何将单元组合成整体。由于其领域无关的特性,该模型可应用于不同领域的下游数据。此外,我们分析了我们的方法与现有图模型之间的联系,以及其生成的节点嵌入的可迁移性。实验表明,我们的方法不仅实现了使用多域图进行预训练的能力,而且在各种下游数据集和任务上表现出增强的性能。此外,我们观察到随着预训练数据量的增加,下游性能持续提升。

英文摘要

In recent years, the rapid development of foundation models and graph pre-training technologies has spurred increasing interest in constructing a universal pre-trained graph model or Graph Foundation Model (GFM). However, a significant challenge is that existing models are unable to address feature heterogeneity in graph data without textual information, which hinders the transferability of graph models across different datasets. To bridge this gap, we propose the concept of learnable graph patches, which we regard as the smallest semantic units of any graph data. We decompose the graph into learnable graph patches by unfolding the node features and constructing corresponding patch structures separately. We then design a framework that mines transferable information from graph data across domains. Specifically, after extracting graph patches, we propose a patch encoder to extract knowledge from each unit and a patch aggregator to learn how the units are combined into a whole. Due to its domain-agnostic nature, the model can be applied to downstream data across different domains. Furthermore, we analyze the connection between our method and existing graph models, as well as the transferability of the node embeddings it generates. Empirically, our method not only achieves the capability to use multi-domain graphs for pre-training, but also shows enhanced performance across various downstream datasets and tasks. Moreover, we observe consistent improvement in downstream performance as the volume of pre-training data increases.

2606.17687 2026-06-17 cs.CL cs.AI 交叉投稿

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

SuCo: 充分性引导的连续自适应推理

Jiahao Wang, Bingyu Liang, Chenhao Hu, Longhui Zhang, Xuebo Liu, Min zhang, Jing Li, Xuelong Li

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 针对大型推理模型生成过长思维链导致计算浪费的问题,提出最小充分CoT概念,并构建两阶段训练框架SuCo,通过自适应充分性阈值和强化学习优化推理长度,在数学、代码和科学基准上同时提升准确率和效率。

Comments Accepted to ICML 2026. 18 pages

详情
AI中文摘要

尽管在复杂任务上表现卓越,大型推理模型(LRMs)常常生成过长的思维链(CoT),即使对于简单查询也会增加计算成本。现有缓解此低效问题的工作通常依赖于离散推理模式或固定预算层级,缺乏推理何时充分的准则。本文引入最小充分CoT(MSC),定义为CoT轨迹中足以产生正确答案的最短前缀。实验表明,MSC不仅减少推理令牌,还能在不同难度级别上提高准确率。基于MSC,我们提出充分性引导的连续自适应推理(SuCo),一个用于连续谱上自主推理控制的两阶段训练框架。在第一阶段,MSC对齐微调(MFT)使用问题自适应充分性阈值构建MSC数据,该阈值自然随问题难度缩放,然后微调模型以内化简洁而充分的推理模式。在第二阶段,充分性感知策略优化(SAPO)通过带有动态复杂度跟踪和充分性感知奖励的强化学习进一步优化模型,该奖励惩罚过度思考和思考不足。在数学、代码和科学基准上的大量实验表明,SuCo在准确率和推理效率上均实现持续改进。

英文摘要

Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient. In this work, we introduce Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer. We empirically show that MSC not only reduces reasoning tokens, but also improves accuracy across difficulty levels. Building on MSC, we propose Sufficiency-guided Continuous Adaptive Reasoning (SuCo), a two-stage training framework for autonomous reasoning control along a continuous spectrum. In stage 1, MSC-Aligned Fine-Tuning (MFT) constructs MSC data using problem-adaptive sufficiency thresholds that naturally scale with question difficulty, then fine-tunes the model to internalize concise yet sufficient reasoning patterns. In stage 2, Sufficiency-Aware Policy Optimization (SAPO) further optimizes the model through reinforcement learning with dynamic complexity tracking and sufficiency-aware rewards that penalize both over- and under-thinking. Extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency.

2606.17706 2026-06-17 cs.LG cs.AI 交叉投稿

Confusion-Aware Transfer Teacher Curriculum Learning Framework: Disentangling Scoring and Pacing Effects

混淆感知的迁移教师课程学习框架:解耦评分与节奏效应

Savini Kommalage, Sanka Mohottala, Asiri Gawesha, Dulara Madhusanka, Menan Velayuthan, Dharshana Kasthurirathna, Mahima Milinda Alwis Weerasinghe, Charith Abhayaratne

发表机构 * Faculty of Computing, Sri Lanka Institute of Information Technology, Sri Lanka(斯里兰卡信息科技学院计算机学院,斯里兰卡) Faculty of Engineering, University of Sri Jayewardenepura, Sri Lanka(斯里兰卡贾亚韦达内普拉大学工程学院,斯里兰卡) Faculty of Engineering, Sri Lanka Institute of Information Technology, Sri Lanka(斯里兰卡信息科技学院工程学院,斯里兰卡) University of Sheffield, United Kingdom(谢菲尔德大学,英国) Utrecht University, The Netherlands(乌得勒支大学,荷兰)

AI总结 提出混淆感知难度评分,通过阶段性子集测试和随机基线解耦课程学习的评分与节奏效应,在CIFAR-10上验证评分可解释性,但全数据下无提升,仅在小数据量下提升数据效率。

Comments Accepted at International Conference on Machine Learning (ICML) GlobalSouthML Workshop (2026)

详情
AI中文摘要

课程学习结合了两个设计选择:样本如何按难度评分,以及较难样本如何逐步引入训练,这使得难以将观察到的性能提升归因于任一组件。我们通过两种评估协议解耦这些因素:阶段性子集测试(独立于课程训练验证评分函数)和基线(将相同的节奏调度应用于随机排序数据)。在迁移教师框架(TTF)中,我们使用这些协议评估一种混淆感知的难度评分,该评分同时考虑正确类别的置信度和错误类别上的概率分布。在CIFAR-10上使用ResNet-18和VGG-16,所提出的评分产生了与人类直觉一致的模型可解释难度排序。然而,在全数据下,无论是课程排序还是反课程排序,都没有比标准训练提高准确率,这表明仅改进评分函数不足以克服TTF中课程学习的已知失败模式。相反,我们发现混淆感知的课程排序带来一致的数据效率优势,在20%数据量下比随机排序高出最多8.7个百分点,表明TTF作为一种数据高效训练方法的潜力。

英文摘要

Curriculum learning couples two design choices, how samples are scored by difficulty and how harder samples are paced into training, making it difficult to attribute observed gains to either component. We disentangle these factors with two evaluation protocols: stage-wise test subsets that validate scoring functions independently of curriculum training, and a baseline that applies the same pacing schedule to randomly ordered data. Within the Transfer Teacher framework (TTF), we use these protocols to evaluate a confusion-aware difficulty score that considers both correct-class confidence and the probability distribution over incorrect classes. On CIFAR-10 with ResNet-18 and VGG-16, the proposed score produces model-interpretable difficulty rankings that align with human intuition. However, at full data, neither curriculum nor anti-curriculum ordering improves accuracy over standard training, indicating that improving the scoring function alone is insufficient to overcome the known failure modes of curriculum learning in TTF. In contrast, We find that confusion-aware curriculum ordering result in consistent data-efficiency benefits, outperforming random ordering by up to 8.7% points at the 20% data regime, suggesting the potential of TTF as a data-efficient training method.

2606.17816 2026-06-17 cs.LG cs.AI 交叉投稿

Conservation Laws for Modern Neural Architectures

现代神经架构的守恒律

Viet-Hoang Tran, Vinh Khanh Bui, Tan Lai Ngoc, Nam Nguyen, Tuan Dam, Tan M. Nguyen

发表机构 * National University of Singapore(新加坡国立大学) Center for AI Research, VinUniversity(Vin大学人工智能研究中心) Independent Researcher(独立研究者) Hanoi University of Science and Technology(河内科学技术大学)

AI总结 本文提出统一框架,刻画GELU、SiLU、SwiGLU激活的前馈网络、多头注意力及混合专家模型中的梯度流守恒律,实验验证了理论预测的不变量。

Comments Published at the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

理解梯度下降动力学是解释过参数化模型成功的关键,其中隐式偏差通过梯度流中的守恒律体现。尽管这类定律在线性和ReLU网络中已被充分理解,但在现代架构中仍鲜有探索。本文开发了一个统一框架,用于刻画当代模型中的守恒律,包括具有GELU、SiLU和SwiGLU激活的前馈网络、具有正弦和旋转位置编码的多头注意力,以及多种门控设计下的混合专家架构。我们的理论发现得到了实验支持,实验验证了预测的不变量。

英文摘要

Understanding gradient descent dynamics is key to explaining the success of over-parameterized models, where implicit bias manifests through conservation laws in gradient flow. While such laws are well understood for linear and ReLU networks, they remain largely unexplored for modern architectures. This work develops a unified framework to characterize conservation laws for contemporary models, including feedforward networks with GELU, SiLU, and SwiGLU activations, multihead attention with sinusoidal and rotary positional encodings, and Mixture-of-Experts architectures under diverse gating designs. Our theoretical findings are supported by experiments that validate the predicted invariants.

2606.17830 2026-06-17 cs.LG cs.AI 交叉投稿

Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity

注意力中的功能等价性:一项综合研究及其在线性模式连通性中的应用

Viet-Hoang Tran, Vinh Khanh Bui, Van-Hoan Trinh, Tan Lai Ngoc, Tan M. Nguyen

发表机构 * National University of Singapore(新加坡国立大学) Center for AI Research, VinUniversity(Vin大学人工智能研究中心) Independent Researcher(独立研究者) Technical University of Munich(慕尼黑技术大学)

AI总结 本文形式化研究了Transformer中位置编码对功能等价性的影响,发现正弦编码保持原始注意力的对称性,而旋转编码显著减少对称群从而增强表达力,并通过对齐算法实证了位置编码对线性模式连通性的关键作用。

Comments Published at the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

神经网络参数空间本质上是非单射的,因为不同的参数配置可以通过功能等价性实现相同的函数。虽然这种对称性在经典的全连接和卷积模型中已被充分理解,但在现代基于注意力的架构中变得更为复杂。现有的多头注意力分析主要关注原始公式,忽略了从根本上重塑架构对称性的位置编码。在这项工作中,我们提供了对带有位置编码的Transformer中功能等价性的形式化研究。聚焦于两种最广泛使用的变体——正弦和旋转位置编码(RoPE)——我们表明正弦编码保留了原始注意力的等价结构,而旋转编码显著减少了对称群,从而增强了表达力。这为RoPE在实践中日益突出的地位提供了原则性解释。我们进一步研究了位置编码如何影响线性模式连通性,并通过一种对齐算法,实证表明Transformer设置中连通性的存在和可变性关键取决于位置编码。

英文摘要

Neural network parameter spaces are inherently non-injective, as distinct parameter configurations can realize identical functions through functional equivalence. While this symmetry is well understood in classical fully connected and convolutional models, it becomes substantially more intricate in modern attention-based architectures. Existing analyses of multihead attention have largely focused on the vanilla formulation, overlooking positional encodings that fundamentally reshape architectural symmetries. In this work, we provide a formal study of functional equivalence in Transformers with positional encodings. Focusing on the two most widely used variants--sinusoidal and rotary positional encodings (RoPE)--we show that sinusoidal encodings preserve the equivalence structure of vanilla attention, whereas rotary encodings significantly reduce the symmetry group, thereby enhancing expressivity. This offers a principled explanation for the growing prominence of RoPE in practice. We further examine how positional encodings affect linear mode connectivity, and through an alignment algorithm, empirically demonstrate that the presence and variability of connectivity across Transformer settings crucially depend on the positional encoding.

2606.17889 2026-06-17 cs.LG cs.AI cs.NE 交叉投稿

Dimensionality Controls When Modularity Helps in Continual Learning

维度控制模块化在持续学习中的有效性

Kathrin Korte, Christian Medeiros Adriano, Joachim Winther Pedersen, Eleni Nisioti, Sebastian Risi

发表机构 * IT University of Copenhagen, Denmark(丹麦技术大学) Hasso Plattner Institute, University of Potsdam, Germany(波茨坦大学哈asso 印度学院)

AI总结 研究在持续学习中,模块化架构、任务相似性和表示维度如何共同影响组合学习,发现低维“丰富”机制下模块化结构显著提升性能,而高维“懒惰”机制下影响较小。

Comments Accepted to the 2nd Workshop on Compositional Learning (CompLearn) at ICML 2026, Seoul, South Korea. 8 pages, 5 figures

详情
AI中文摘要

组合学习系统必须平衡可塑性(获取新知识的能力)与稳定性(保留先前学习组件的能力),尤其是当任务共享结构并存在干扰风险时。我们研究了模块化架构、任务相似性和表示维度如何在顺序A-B-A范式中共同塑造组合持续学习,通过权重尺度操作诱导高维和低维机制,比较了任务分区循环网络与单网络基线。在高维“懒惰”机制中,两种架构实现了相似的性能和内部几何结构,表明当表示受到弱约束时,显式模块化结构影响甚微。在低维“丰富”机制中,模块化变得决定性:模块化网络发展出分级的任务特定子空间,这些子空间在相似任务上重叠,在中等不相似任务上部分对齐,在不相似任务上分离,从而产生比单网络更具组合性和可解释性的组织。这些发现表明,由初始化尺度诱导的表示机制(与表示维度共变)是决定组合性模块化结构在持续学习中何时功能有益的关键因素,并支持将安全性和鲁棒性视为表示子空间的自适应分配问题,而非固定分离或共享。

英文摘要

Compositional learning systems must balance plasticity, the ability to acquire new knowledge, with stability, the preservation of previously learned components, especially when tasks share structure and risk interference. We study how modular architecture, task similarity, and representational dimensionality jointly shape compositional continual learning in a sequential A-B-A paradigm, comparing a task-partitioned recurrent network to a single-network baseline while inducing high- and low-dimensional regimes via weight-scale manipulations. In a high-dimensional "lazy" regime, both architectures achieve similar performance and internal geometry, suggesting that explicit modular structure has little impact when representations are weakly constrained. In a lower-dimensional "rich" regime, modularity becomes decisive: the modular network develops graded task-specific subspaces that overlap for similar tasks, partially align for moderately dissimilar tasks, and separate for dissimilar tasks, yielding a more compositional and interpretable organization than the single network. These findings identify the representational regime induced by initialization scale, which co-varies with representational dimensionality, as a key factor governing when compositional, modular structure is functionally beneficial in continual learning, and support viewing safety and robustness as problems of adaptive allocation of representational subspaces rather than fixed separation versus sharing.

2606.17927 2026-06-17 cs.LG cs.AI 交叉投稿

KANLib -- An Modular, Extensible and Fast Kolmogorov-Arnold Network Implementation

KANLib -- 一个模块化、可扩展且快速的Kolmogorov-Arnold网络实现

Julian Hoever, Gregor Schiele

发表机构 * Intelligent Embedded Systems University of Duisburg-Essen(智能嵌入式系统杜伊斯堡-埃森大学)

AI总结 提出KANLib框架,通过统一现有KAN实现、支持多种基函数和自适应网格缩放,在保持灵活性和高性能的同时,实现可复现的预测结果。

详情
AI中文摘要

Kolmogorov-Arnold网络(KAN)最近通过用可学习的一元函数替代线性权重,成为传统多层感知器的一种有前途的替代方案。尽管在可解释性和表达能力方面具有理论优势,但由于高计算成本和现有框架中不一致的功能支持,KAN的实际研究仍然困难。本文介绍了KANLib,一个用于开发和评估KAN架构的模块化、可扩展且计算高效的框架。KANLib在强调灵活性、功能一致性和高性能的一致软件架构中,统一了现有实现(包括PyKAN、EfficientKAN和FastKAN)的核心概念。该框架支持两种基函数类型、自适应网格缩放、网格扩展和细粒度架构定制,同时保持与标准PyTorch工作流的兼容性。在加利福尼亚房价基准上的实验评估表明,KANLib在重现已建立参考KAN实现的预测行为的同时,实现了具有竞争力的计算效率。此外,该框架能够探索超出标准KAN公式的架构变体,且对预测性能影响很小。总体而言,KANLib为未来关于可扩展和可扩展KAN架构的研究提供了坚实的基础。

英文摘要

Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional multilayer perceptrons by replacing linear weights with learnable univariate functions. Despite their theoretical advantages in interpretability and expressiveness, practical research of KANs remains difficult due to high computational costs and inconsistent feature support across existing frameworks. This paper introduces KANLib, a modular, extensible, and computationally efficient framework for developing and evaluating KAN architectures. KANLib unifies core concepts from existing implementations, including PyKAN, EfficientKAN, and FastKAN, within a consistent software architecture that emphasizes flexibility, feature parity, and high performance. The framework supports two basis function types, adaptive grid rescaling, grid extension, and fine-grained architectural customization while maintaining compatibility with standard PyTorch workflows. Experimental evaluation on the California Housing benchmark demonstrates that KANLib reproduces the predictive behavior of established reference KAN implementations while achieving competitive computational efficiency. Furthermore, the framework enables the exploration of architectural variations beyond standard KAN formulations with only minor impacts on predictive performance. Overall, KANLib provides a robust foundation for future research on scalable and extensible KAN architectures.

2606.17952 2026-06-17 cs.LG cs.AI 交叉投稿

SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs

SoftMoE: 用于大语言模型混合专家网络的软可微路由

Mikołaj Zasada, Łukasz Struski, Jacek Tabor, Marcin Kurdziel

发表机构 * AGH University of Krakow, Poland(克拉科夫AGH大学) Faculty of Mathematics(数学系) Computer Science, Jagiellonian University, Poland(计算机科学系,杰哥利安大学,波兰) Centre for Credible Artificial Intelligence, Warsaw University of Technology(可信人工智能中心,华沙技术大学)

AI总结 提出SoftMoE,通过软top-k LapSum松弛替代离散路由,实现专家路由的梯度优化,并学习每层专家激活数量,在语言建模中激活更少专家达到相当或更优性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

稀疏混合专家(MoE)架构通过仅激活一小部分专家(通过top-$k$路由)在固定推理预算下扩展LLM参数。虽然这保持了因果性并适用于自回归语言模型,但离散的top-$k$算子不可微,强制每个输入激活固定数量的专家,导致计算利用效率低下。我们提出SoftMoE,用截断的软top-$k$ LapSum松弛替代离散路由,允许基于梯度的专家路由优化。我们进一步参数化每层平均激活专家数,并施加全局预算约束,使模型能够学习跨层分配专家容量。SoftMoE完全兼容自回归建模,在语言建模和下游任务上达到与稀疏MoE相当或更优的性能,同时激活显著更少的专家。值得注意的是,学习到的分配高度非均匀,后层激活更多专家。源代码已公开$^\dagger$。

英文摘要

Sparse Mixture-of-Experts (MoE) architectures enable scaling LLM parameters under a fixed inference budget by activating only a small subset of experts via top-$k$ routing. While this preserves causality and suits autoregressive language models, the discrete top-$k$ operator is not differentiable, forcing a fixed number of active experts per input and resulting in inefficient use of computation. We propose SoftMoE, which replaces discrete routing with a truncated soft top-$k$ LapSum relaxation, allowing gradient-based optimization of expert routing. We further parameterize the mean number of active experts per layer and impose a global budget constraint, enabling the model to learn how to allocate expert capacity across layers. SoftMoE remains fully compatible with autoregressive modeling and achieves performance comparable to or better than sparse MoE on language modeling and downstream tasks, while activating significantly fewer experts. Notably, the learned allocation is highly non-uniform, with later layers activating more experts. The source code is publicly available$^\dagger$.

2606.17961 2026-06-17 cs.CV cs.AI 交叉投稿

Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation

基于相似性的位置编码在旋转下的鲁棒性:理论分析与实验验证

Andrea Santomauro, Luigi Portinale, Giorgio Leonardi

发表机构 * Computer Science Institute, DiSIT, University of Piemonte Orientale, Alessandria, Italy(皮埃蒙特东方大学计算机科学研究所,DiSIT,亚历山德里亚,意大利)

AI总结 本文理论分析并实验验证了基于相似性的位置编码(simPE)在旋转扰动下的稳定性,证明其在Frobenius范数下具有有界扰动,并在多个数据集上优于标准位置编码。

详情
AI中文摘要

位置编码是Transformer架构的基本组成部分,因为它注入了关于输入空间或序列排列的信息。在标准绝对位置编码和正弦编码的最新替代方案中,基于相似性的位置编码(simPE)已成为一种通过成对关系表示位置结构的灵活框架。simPE最初是为医学成像应用设计的,其中几何鲁棒性尤为重要:在图像采集过程中,由于成像仪器、患者定位或轻微的采集偏差,自然会产生小旋转。尽管具有经验上的前景,但simPE在几何扰动下的理论行为尚未完全表征。在本文中,我们研究了simPE对旋转的鲁棒性,结合了形式化的理论分析和实验验证。我们首先证明simPE通常不是旋转不变的。然后,我们证明,在基本分量的温和Lipschitz假设下,simPE在旋转扰动下是稳定的,并推导了Frobenius范数下的显式扰动界限。我们在四个受控数据集上实验验证了这些发现——一个合成Arrow数据集、一个合成Shapes数据集(四个几何形状类别)、一个合成Digits数据集和一个基准图像分类数据集(FashionMNIST)——其中训练和验证图像保持固定的规范方向,而测试图像则经受逐渐增大的旋转角度。在所有数据集中,simPE在旋转下的准确率、F1分数、精确率和召回率方面始终优于标准学习位置编码,特别是在小到中等角度范围内,这证实了理论稳定性保证。

英文摘要

Positional encoding is a fundamental component of Transformer architectures, as it injects information about the spatial or sequential arrangement of inputs. Among recent alternatives to standard absolute and sinusoidal encodings, similarity-based positional encoding (simPE) has emerged as a flexible framework for representing positional structure through pairwise relations. simPE was originally designed for medical imaging applications, where geometric robustness is especially relevant: small rotations naturally arise during image acquisition, induced by imaging instruments, patient positioning, or slight acquisition misalignments. Despite its empirical promise, the theoretical behavior of simPE under geometric perturbations has not been fully characterized. In this paper, we study the robustness of simPE with respect to rotations, combining formal theoretical analysis with experimental validation. We first show that simPE is generally not rotation-invariant. We then prove that, under mild Lipschitz assumptions on the elementary components, simPE is stable under rotational perturbations and derive explicit perturbation bounds in Frobenius norm. We validate these findings experimentally on four controlled datasets--a synthetic Arrow dataset, a synthetic Shapes dataset (four geometric shape categories), a synthetic Digits dataset, and a benchmark image classification dataset (FashionMNIST)--in which training and validation images are kept in a fixed canonical orientation while test images are subjected to increasing rotation angles. Across all datasets, simPE consistently outperforms standard learned positional encoding in terms of accuracy, F1 score, precision, and recall under rotation, particularly in the small-to-moderate angle regime, corroborating the theoretical stability guarantees.

2606.17996 2026-06-17 cs.LG cs.AI 交叉投稿

Multiple cyclicity and Wavelet Decomposition with Channel Correlation for Long-term Time Series Forecasting

多重周期性与通道相关的小波分解在长期时间序列预测中的应用

Bin Wang, Heming Yang, Jinfang Sheng

发表机构 * School of Computer Science and Engineering, Central South University(中南大学计算机科学与工程学院)

AI总结 提出McWC模型,通过多层周期性构建、多层感知机提取通道相关性、多级小波分解融合高低频信息,并在频域解耦通道内自相关,实现高效准确的长期预测。

详情
AI中文摘要

周期性和趋势是时间序列数据的重要组成部分,许多基于周期性和趋势的研究在长期时间序列预测中取得了良好效果。然而,我们认为当前工作忽略了时间序列数据中真实世界通道间相关性的影响,导致预测次优。此外,这些模型依赖复杂设计来捕获多样信息,导致计算效率低下。为解决这一挑战,我们提出McWC,一种长期时间序列预测模型,分别对周期性、趋势和通道间相关性进行建模。具体来说,McWC首先使用多层周期性构建模块从数据中解耦周期性信息。然后,使用多层感知机提取通道间相关性。接着,使用多级小波分解模块对数据中的多层高频和低频信息进行建模和融合。最后,聚合不同组件的结果以获得输出。同时,我们通过在频域计算损失函数来解耦通道内自相关。在六个真实世界数据集上的实验表明,McWC实现了最先进的性能,展现出卓越的计算效率和历史信息提取能力。

英文摘要

Cyclicity and trend are important components of time series data and many studies based on cyclicity and trend have achieved good results in long-term time series forecasting. However, we believe that current work neglects the influence of real-world inter-channel correlations in time series data which leads to suboptimal predictions. Furthermore, these models rely on complex designs to capture diverse information so that resulting in low computational efficiency. To address this challenge, we propose McWC, a long-term time series forecasting model that separately models the cyclicity, trend, and inter-channel correlations. Specifically, McWC first decouples cyclical information from data using a multi-layer cyclicity construction module. Then, it extracts inter-channel correlations using multi-layer perceptron. Next, it models and fuses the multi-layer high-frequency and low-frequency information from data using a multi-level wavelet decomposition module. Finally, it aggregates the results of different components to obtain the output. Simultaneously, we decouple intra-channel autocorrelations by calculating a loss function in the frequency domain. Experiments on six real-world datasets demonstrate that McWC achieves state-of-the-art performance, exhibiting excellent computational efficiency and historical information extraction capabilities.

2606.18003 2026-06-17 cs.LG cs.AI 交叉投稿

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

C2FL:空间和时间漂移下的聚类持续联邦学习

Davide Domini, Gianluca Aguzzi, Lorenzo Pellegrini, Mirko Viroli, Lukas Esterle

发表机构 * University of Bologna(博洛尼亚大学) Aarhus University(哥本哈根大学)

AI总结 针对空间异质性和时间漂移下节点隐私保护的集体自适应问题,提出C2FL方法,通过空间聚类自组织学习组,结合经验回放和停留时间感知自适应平均,实现鲁棒集体适应。

详情
AI中文摘要

集体自适应系统(CAS)越来越依赖机器学习,让每个节点从本地感知数据中学习,使其行为与周围环境对齐。然而,扩展这种智能带来了根本性挑战:感知数据通常涉及隐私,无法集中收集;节点是移动的,穿越不同区域,附近节点感知相似现象,而远处节点观察到截然不同的条件,形成自然空间聚类;并且由于移动性,这些分布随时间演变,引入时间漂移,使本地模型逐渐过时。这些动态出现在多个领域——车辆感知、无人机监测、智能手机众包——但隐私、空间异质性和时间漂移的相互作用严重削弱了传统学习策略。因此,我们提出C2FL,一种完全分布式的联邦学习(FL)方法,其中节点通过空间聚类自组织成学习组,反映环境的地理结构。为了抵消时间漂移,每个节点将经验回放与停留时间感知的自适应平均步骤相结合,随着在同一区域停留更长时间,逐步纳入区域共识,同时在不断变化的分布下保留先前获得的知识。我们在系统再现空间和时间变化的合成实验上评估了我们的方法,表明标准联邦策略在这些条件下显著退化,而我们的方法恢复了鲁棒的集体适应。

英文摘要

Collective Adaptive Systems (CAS) increasingly rely on machine learning to let each node learn from locally sensed data, aligning its behavior with the surrounding environment. Scaling this intelligence, however, raises fundamental challenges: sensed data is often privacy-sensitive, preventing centralized collection; nodes are mobile, traversing regions where nearby nodes perceive similar phenomena while distant ones observe radically different conditions, creating natural spatial clusters; and these distributions evolve over time due to mobility, introducing temporal drift that makes local models progressively stale. These dynamics arise across domains - vehicular sensing, drone-based monitoring, smartphone crowdsensing - yet the interplay of privacy, spatial heterogeneity, and temporal drift severely undermines conventional learning strategies. Therefore, we propose C2FL, a fully distributed Federated Learning (FL) approach where nodes self-organize into learning groups through spatial clustering, reflecting the geographic structure of the environment. To counteract temporal drift, each node combines experience replay with a dwell-time-aware adaptive averaging step, progressively incorporating the regional consensus as it remains longer within the same area, while preserving previously acquired knowledge under evolving distributions. We evaluate our approach on synthetic experiments that systematically reproduce spatial and temporal shifts, showing that standard federated strategies degrade significantly under these conditions and that our method restores robust collective adaptation.

2606.18023 2026-06-17 cs.LG cs.AI 交叉投稿

LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

LoopCoder-v2: 仅循环一次以实现高效的测试时计算扩展

Jian Yang, Shawn Guo, Wei Zhang, Tianyu Zheng, Yaxin Du, Haau-Sing Li, Jiajun Wu, Yue Song, Yan Xing, Qingsong Cai, Zelong Huang, Chuan Hao, Ran Tao, Xianglong Liu, Wayne Xin Zhao, Mingjie Tang, Weifeng Lv, Ming Zhou, Bryan Dai

发表机构 * Beihang University(北京航空航天大学) IQuest Research Langboat(浪波) Renmin University of China(中国人民大学)

AI总结 本文提出并行循环Transformer(PLT)并研究循环次数选择,发现两循环变体在代码生成等任务上显著提升,而三循环以上性能下降,揭示了增益-成本权衡。

详情
AI中文摘要

循环Transformer通过重复应用共享块来扩展潜在计算,但顺序循环会随着循环次数增加延迟和KV缓存内存。并行循环Transformer(PLT)通过跨循环位置偏移(CLP)和共享KV门控滑动窗口注意力来缓解这一成本,使循环次数成为实际设计选择。因此,我们通过增益-成本视角研究PLT循环次数选择:额外的循环可能细化表示,但CLP在每个循环边界引入位置不匹配。我们通过从头训练LoopCoder-v2(一组具有不同循环次数的7B PLT编码器)在18T token上,随后进行匹配的指令调优和评估来实例化这项研究。经验上,两循环变体在代码生成、代码推理、代理软件工程和工具使用基准上比无循环基线带来广泛提升,将SWE-bench Verified从43.0提高到64.4分,Multi-SWE从14.0提高到31.0分。相比之下,三循环或更多循环的变体性能下降,揭示了强烈的非单调循环次数效应。我们的诊断表明,循环2提供了主要的生产性细化,而后续循环产生递减、振荡的更新和降低的表示多样性。由于CLP引起的不匹配在细化收益缩小时大致固定,偏移成本日益占主导。这种增益-成本权衡解释了PLT在两循环处饱和,并为循环次数选择提供了诊断。

英文摘要

Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain--cost view: an extra loop may refine representations, but CLP also introduces a positional mismatch at each loop boundary. We instantiate this study by training LoopCoder-v2, a family of 7B PLT coders with different loop counts, from scratch on 18T tokens, followed by matched instruction tuning and evaluation. Empirically, the two-loop variant delivers broad gains over the non-looped baseline across code generation, code reasoning, agentic software engineering, and tool-use benchmarks, improving SWE-bench Verified from 43.0 to 64.4 points and Multi-SWE from 14.0 to 31.0 points. In contrast, variants with three or more loops regress, revealing a strongly non-monotonic loop-count effect. Our diagnostics show that loop 2 provides the main productive refinement, while later loops yield diminishing, oscillatory updates and reduced representational diversity. Because the CLP-induced mismatch remains roughly fixed as refinement gains shrink, the offset cost increasingly dominates. This gain--cost trade-off explains PLT's saturation at two loops and provides diagnostics for loop-count selection.

2606.18024 2026-06-17 cs.LG cs.AI 交叉投稿

Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation

灾难性遗忘是低秩的:持续适应的函数空间理论

Ido Nitzan Hidekel, Dan Raviv

发表机构 * Tel Aviv University(特拉维夫大学)

AI总结 本文在神经正切核(NTK)框架下提出函数空间理论,推导出新任务训练导致旧任务预测漂移的闭式表达式,揭示遗忘集中在少量旧任务NTK本征模式上,并给出低秩特性与Kronecker缩放规则。

Comments Accepted to the ICML 2026 Workshop on Continual Adaptation at Scale: Towards Sustainable AI

详情
AI中文摘要

持续适应中的灾难性遗忘通常通过参数漂移、重放或蒸馏来研究,但这些观点未能识别哪些输出空间方向是脆弱的。我们在NTK机制下给出一个函数空间解释:新任务训练通过跨任务核诱导旧任务预测漂移,从而在新任务梯度步骤之前得到遗忘向量的闭式预测器。在冻结主干线性头PEFT-CL中,模型在可训练参数上是线性的,预测器精确到数值精度;对于非线性适配器/全微调,它是局部NTK近似。同一表达式揭示遗忘集中在少量旧任务NTK本征模式上,并在冻结线性头下给出脆弱秩的Kronecker缩放规则。这些结果澄清了与先前NTK重叠理论的关系,解释了为什么参数空间正则化器可能遗漏输出空间干扰,并激发了一种有针对性的谱正则化器。

英文摘要

Catastrophic forgetting in continual adaptation is usually studied through parameter drift, replay, or distillation, but these views do not identify which output-space directions are vulnerable. We give a function-space account in the NTK regime: new-task training induces old-task prediction drift through the cross-task kernel, yielding a closed-form predictor for the forgetting vector before any new-task gradient step. In frozen-backbone linear-head PEFT-CL, where the model is linear in the trainable parameters, the predictor is exact up to numerical precision; for nonlinear adapters/full fine-tuning, it is a local NTK approximation. The same expression reveals that forgetting concentrates in a small number of old-task NTK eigenmodes and under frozen linear heads gives a Kronecker scaling rule for the vulnerable rank. These results clarify the relation to prior NTK-overlap theory, explain why parameter-space regularizers can miss output-space interference, and motivate a targeted spectral regularizer.

2606.18071 2026-06-17 cs.LG cs.AI 交叉投稿

Volterra Generative Models

Volterra生成模型

Yusen Jia, Bingyan Han

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出Volterra生成模型,通过分数阶核引入路径依赖噪声,利用马尔可夫提升和残差状态学习,解决非马尔可夫动力学下的扩散生成问题,在MNIST和CIFAR-10上验证有效性。

Comments 36 pages

详情
AI中文摘要

基于分数的扩散模型通常使用布朗扰动,这提供了易处理的反向时间动力学,但施加了无记忆的噪声。我们引入了Volterra生成模型,这是一个连续时间的基于分数的框架,其前向过程通过分数阶核注入路径依赖噪声。为了处理非马尔可夫和非半鞅动力学,我们在两种情况下使用高斯求积构造有限维马尔可夫提升,并在平滑情况下使用混合有限差分指数近似。我们证明了平方误差界,推导了增广的线性高斯前向过程,并表明通过考虑残差状态和分析辅助高斯分数,学习可以保持数据维度。我们还识别了由共享布朗因子和有符号平滑区域权重引起的协方差和反向时间退化。退化激发了稳定条件处理,对于刚性较大的提升,则采用高斯桥重建采样器。在MNIST和CIFAR-10上的实验表明,具有小马尔可夫提升的持久分数扰动可以改善MNIST上的基于分数的生成,并为自然图像提供有前景的扩展,而桥采样器为较大提升提供了稳定机制。

英文摘要

Score-based diffusion models typically use Brownian perturbations, which provide tractable reverse-time dynamics but impose memoryless noising. We introduce Volterra generative models, a continuous-time score-based framework whose forward process injects path-dependent noise through fractional kernels. To handle the non-Markovian and non-semimartingale dynamics, we construct finite-dimensional Markovian lifts using Gaussian quadrature in both regimes and a hybrid finite-difference exponential approximation in the smooth regime. We prove squared error bounds, derive an augmented linear-Gaussian forward process, and show that the learning can remain data-dimensional by considering residual states and analytic auxiliary Gaussian scores. We also identify covariance and reverse-time degeneracies caused by shared Brownian factors and signed smooth-regime weights. The degeneracy motivates stabilized conditioning and, for stiff larger lifts, a Gaussian-bridge reconstruction sampler. Experiments on MNIST and CIFAR-10 show that persistent fractional perturbations with small Markovian lifts can improve score-based generation on MNIST and provide a promising extension to natural images, while the bridge sampler provides a stability mechanism for larger lifts.

2606.18096 2026-06-17 cs.LG cs.AI cs.DC 交叉投稿

S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices

S4oP:面向资源受限设备的结构化状态空间模型的算子级剪枝

Marco Deano, Filippo Ziche, Nicola Bombieri

发表机构 * University of Verona(威尼斯大学)

AI总结 提出一种针对S4和S4D模型的增量算子级剪枝方法,通过结构化掩码与微调交替进行,在保持预测性能的同时显著降低推理成本,首次系统研究SSM的结构化算子剪枝。

详情
AI中文摘要

结构化状态空间模型(SSMs),包括S4和S4D架构,最近已成为捕捉序列数据中长程依赖关系的基于注意力模型的有力替代方案。尽管其经验性能强劲,但由于计算和内存需求,在时间和资源受限的环境中部署这些模型仍然具有挑战性。在本文中,我们提出了一种新颖的增量式算子级剪枝方法,用于基于S4和S4D的模型,该方法在保持预测性能的同时显著降低推理成本。据我们所知,这是首个系统研究SSM结构化算子剪枝的工作。我们的方法通过将结构化掩码与微调交替进行,逐步剪枝模型算子,同时联合监控准确性和推理延迟。我们在一个统一的训练和评估框架中实现了这种方法,该框架能够系统地探索效率-准确性的权衡。在多个基准数据集上的实验表明,剪枝高达70%的模型算子在大多数情况下保持了原始模型的性能,同时显著降低了推理延迟。这些结果表明,结构化算子剪枝是一种有效且先前未被探索的提高SSM效率的策略,并有助于它们在资源受限的实际场景中的部署。

英文摘要

Structured State Space Models (SSMs), including the S4 and S4D architectures, have recently emerged as powerful alternatives to attention-based models for capturing long-range dependencies in sequential data. Despite their strong empirical performance, deploying these models in time- and resource-constrained settings remains challenging due to their computational and memory demands. In this paper, we propose a novel incremental, operator-level pruning approach for S4- and S4D-based models that significantly reduces inference cost while preserving predictive performance. To the best of our knowledge, this is the first work to systematically investigate structured operator pruning for SSMs. Our method progressively prunes model operators by interleaving structured masking with fine-tuning, while jointly monitoring accuracy and inference latency. We implement this approach within a unified training and evaluation framework that enables systematic exploration of efficiency-accuracy trade-offs. Experiments across multiple benchmark datasets show that pruning up to 70% of the model operators preserves the performance of the original models in most cases, while substantially reducing inference latency. These results demonstrate that structured operator pruning is an effective and previously unexplored strategy for improving the efficiency of SSMs and facilitate their deployment in practical, resource-constrained scenarios.

2606.18114 2026-06-17 cs.LG cs.AI 交叉投稿

Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models

Ternary Mamba: 分组量化感知训练的 W1.58A16 状态空间模型

Ramprasath Ganesaraja, Sahil Dilip Panse, Swathika N

发表机构 * EdgeVerve Systems Limited(EdgeVerve系统有限公司)

AI总结 提出从预训练检查点进行分组量化感知训练(QAT)结合知识蒸馏,以极低数据量(1亿token)将Mamba-2 1.3B压缩至3.61倍,零样本准确率接近Bi-Mamba,并发现预训练QAT特有的零比率坍塌问题。

详情
AI中文摘要

状态空间模型(SSM)如Mamba-2提供线性时间推理,但其内存占用限制了边缘部署。先前的三元SSM工作(Slender-Mamba)在150B token上从头训练;我们证明预训练检查点足以胜任,将边际token预算减少1000倍。使用分组量化感知训练(QAT)结合冻结FP16教师的知识蒸馏,我们将Mamba-2 1.3B压缩3.61倍(从2687 MB到744 MB),并在仅102M token(4 GPU小时,单H100)下达到48.1%的零样本准确率(7任务平均)——接近Bi-Mamba的48.4%(在+/-0.9pp置信区间内)。这种从预训练开始的QAT设置揭示了零比率坍塌,一种由可学习量化尺度引起的新不稳定性,在从头训练中不会出现。我们进一步证明,由于通过循环的误差累积,对Transformer有效的后处理校正策略对SSM失效。这些结果表明三元SSM不需要昂贵的从头训练:从预训练检查点进行QAT结合KD是一种数据高效的替代方案。

英文摘要

State Space Models (SSMs) such as Mamba-2 offer linear-time inference but their memory footprint limits edge deployment. Prior ternary SSM work (Slender-Mamba) trains from scratch on 150B tokens; we show a pretrained checkpoint suffices, reducing the marginal token budget by 1,000x. Using grouped quantization-aware training (QAT) with knowledge distillation from a frozen FP16 teacher, we compress Mamba-2 1.3B to 3.61x (2,687 to 744 MB) and achieve 48.1% zero-shot accuracy (7-task average) in just 102M tokens (4 GPU-hours, single H100) -- approaching Bi-Mamba's 48.4% (within +/-0.9pp CI). This QAT-from-pretrained setting reveals zero-ratio collapse, a novel instability caused by learnable quantization scales that does not arise in from-scratch training. We further show that post-hoc correction strategies effective for Transformers fail for SSMs due to error accumulation through the recurrence. These results demonstrate that ternary SSMs do not require expensive from-scratch training: QAT from pretrained checkpoints with KD is a data-efficient alternative.

2606.18186 2026-06-17 cs.LG cs.AI 交叉投稿

Kolmogorov Regression for Robust Diffusion Policies

用于鲁棒扩散策略的Kolmogorov回归

Lekan Molu

发表机构 * Bala Cynwyd, PA 19004(巴拉辛威德, PA 19004)

AI总结 提出后向Kolmogorov方程将扩散策略提升至Cameron-Martin空间,用确定性边界值PDE问题替代随机分数匹配,通过精度加权损失和残差诊断实现收敛保证、轨迹规则化和无奖励故障检测。

详情
AI中文摘要

有限维扩散策略由于离散化伪影导致时间漂移,降低了长期性能(当部署在物理系统上时)。我们引入了一个后向Kolmogorov方程,将扩散策略提升至Cameron-Martin空间——希尔伯特空间的一个子集。本质上,用确定性边界值PDE问题替代随机分数匹配。我们的核心创新基于高斯测度理论,其中扩散噪声协方差算子由有色噪声分布实现,该分布规定了推理时模型样本的正则性概念。我们使用推导出的精度加权Cameron-Martin损失训练扩散模型,并引入Kolmogorov残差作为推理时的PDE诊断。这些替换产生了:(i) 收敛保证,其中界的常数取决于核的有效秩而非动作维度,(ii) 通过谱加权改进轨迹规则性,以及(iii) 无需奖励信号的确定性故障检测器。在两个应用领域的验证显示了显著改进:在PushT操作基准测试中,Cameron-Martin损失在最大回合奖励上实现了17%的提升(0.95对比0.78的MSE),并通过引入的残差幅度在推理期间减少了67.6%的步间漂移。类似地,在具有恒定在制品(CONWIP)流量控制的6站生产线上,我们实现了比经典LSTM基线低28.4%的RMSE;高饥饿事件召回率(测试周期中为1.0),以及有效的瓶颈识别(测试集中Precision@1=1.0,信噪比13倍)。然后,我们使用Hamilton-Jacobi可达性理论认证调度策略,与100次模拟运行中的无控制调度相比,死锁事件减少了96%(防止了351个事件)。

英文摘要

Finite-dimensional (FD) diffusion policies exhibit temporal drift owing to discretization artifacts that degrade long-horizon performance (when deployed on physical systems). We introduce a backward Kolmogorov equation that lifts diffusion policies to a Cameron-Martin space -- a subset of the Hilbert space. Essentially, replacing stochastic score matching with a deterministic boundary-value PDE problem. Our core innovation thrives on Gaussian measure theory whereupon the diffusion noise covariance operator is realized from a colored noise distribution which prescribes a notion of regularity on samples from the model at inference time. We train the diffusion model with a derived precision-weighted Cameron- Martin loss and a Kolmogorov residual is introduced as a PDE diagnostic during inference. These substitutions yield (i) convergence guarantees where the bound's constants depend on the effective rank of the kernel rather than action dimension, (ii) improved trajectory regularity via spectral weighting, and (iii) a deterministic failure detector without reward signals. Validation across two application domains demonstrates substantial improvements: on the PushT manipulation benchmark, the Cameron-Martin loss achieves a 17% improvement in maximum episode reward (0.95 vs. 0.78 for MSE) and 67.6% reduction in inter-step drifts during inference via the introduced residual magnitude. Similarly, on a 6-station manufacturing line with constant work-in-process (CONWIP) flow control, we achieve 28.4% lower RMSE than classical LSTM baselines; a high starvation-event recall (1.0 in test cycles), and effective bottleneck identification (Precision@1 = 1.0 in test set, 13x signal-to-noise ratio). We then certify the dispatch policies with Hamilton-Jacobi reachability theory which reduces deadlock events by 96% compared to uncontrolled dispatch over 100 simulated runs (351 events prevented).

2606.18208 2026-06-17 cs.LG cs.AI cs.CL cs.CV 交叉投稿

Looped World Models

循环世界模型

Hongyuan Adam Lu, Z. L. Victor Wei, Qun Zhang, Jinrui Zeng, Bowen Cao, Lingwei Meng, Mocheng Li, Zezhong Wang, Haonan Yin, Naifu Xue, Minyu Chen, Cenyuan Zhang, Zefan Zhang, Hao Wei, Jiawei Zhou, Haoran Xu, Hao Yang, Ronglai Zuo, Tongda Xu, Yonghao Li, Jian Chen, Hebin Wang, Zeyu Gao, Yang Li, Wei Zhao, Qimin Zhong, Siqi Liu, Yumeng Zhang, Leyan Cui, Zhangyu Wang, Wai Lam

发表机构 * FaceMind Research Asia

AI总结 提出循环世界模型(LoopWM),通过参数共享的Transformer块迭代细化潜在环境状态,实现高达100倍参数效率,并建立迭代潜在深度作为世界模拟的新缩放轴。

Comments Technical Report

详情
AI中文摘要

当前的世界模型面临一个基本矛盾:忠实的长期模拟需要深度计算,但更深的模型部署成本高且容易产生累积误差。我们通过引入循环世界模型(LoopWM)来解决这一问题,这是首个用于世界建模的循环架构。我们的方法通过一个参数共享的Transformer块迭代地细化潜在环境状态。这带来了高达100倍于传统方法的参数效率,并具有自适应计算能力,可自动调整深度以匹配每个预测步骤的复杂性。与缩放模型大小和训练数据正交,LoopWM建立了迭代潜在深度作为世界模拟的新缩放轴,这可能显著推动社区发展。

英文摘要

Current world models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to compounding errors. We resolve this by introducing Looped World Models (LoopWM), which are the first looped architectures for world modelling. Our method iteratively refines latent environment states through a parameter-shared transformer block. This yield up to 100x parameter efficiency over conventional approaches with adaptive computation that automatically scales depth to match the complexity of each prediction step. Orthogonal to scaling model size and training data, LoopWM establishes iterative latent depth as a new scaling axis for world simulation, which might significantly push the community forward.

2509.21886 2026-06-17 cs.AI 版本更新

TRACE: Learning to Compute on Circuit Graphs

TRACE:在电路图上学习计算

Ziyang Zheng, Jiaying Zhu, Jingyi Zhou, Qiang Xu

AI总结 针对图表示学习在电路功能建模中的架构不匹配问题,提出TRACE,采用层次化Transformer和函数偏移学习,显著超越现有方法。

详情
AI中文摘要

学习计算,即对电路图的功能行为进行建模的能力,是图表示学习的一个基本挑战。然而,主流范式在此任务上存在架构不匹配。这一有缺陷的假设,是主流消息传递神经网络(MPNN)及其基于Transformer的常规对应物的核心,阻止了模型捕捉计算的位置感知和层次化特性。为解决此问题,我们引入了TRACE,一种建立在架构合理的骨干网络和原则性学习目标之上的新范式。首先,TRACE采用层次化Transformer,模拟计算的逐步流程,提供了替代有缺陷的置换不变聚合的忠实架构骨干。其次,我们引入了函数偏移学习,一种将学习问题解耦的新颖目标。我们的模型不是直接预测复杂的全局函数,而是训练仅预测函数偏移,即真实全局函数与假设输入独立的简单局部近似之间的差异。我们在各种电路模态上验证了这一范式,包括寄存器传输级图、与反相器图和映射后网表。在全面的基准测试套件中,TRACE显著优于所有先前的架构。这些结果表明,我们的架构对齐骨干和解耦学习目标为学习电路图功能行为这一基本挑战形成了更稳健的范式。

英文摘要

Learning to compute, the ability to model the functional behavior of a circuit graph, is a fundamental challenge for graph representation learning. Yet, the dominant paradigm is architecturally mismatched for this task. This flawed assumption, central to mainstream message passing neural networks (MPNNs) and their conventional Transformer-based counterparts, prevents models from capturing the position-aware, hierarchical nature of computation. To resolve this, we introduce TRACE, a new paradigm built on an architecturally sound backbone and a principled learning objective. First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation, providing a faithful architectural backbone that replaces the flawed permutation-invariant aggregation. Second, we introduce function shift learning, a novel objective that decouples the learning problem. Instead of predicting the complex global function directly, our model is trained to predict only the function shift, the discrepancy between the true global function and a simple local approximation that assumes input independence. We validate this paradigm on various circuits modalities, including Register Transfer Level graphs, And-Inverter Graphs and post-mapping netlists. Across a comprehensive suite of benchmarks, TRACE substantially outperforms all prior architectures. These results demonstrate that our architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning the functional behavior of a circuit graph.

2510.14807 2026-06-17 cs.AI 版本更新

Beyond the Sampled Token: Preserving Candidate Support in RLVR

超越采样令牌:在RLVR中保留候选支持

Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, Yandong Wen

AI总结 本文从候选分布角度分析RLVR中的探索崩溃,提出CaSP方法,通过保留前N个候选的概率质量,在不牺牲pass@1的情况下提升pass@K,在多个基准测试中验证了有效性。

Comments Technical report (23 pages, 16 figures, project page: https://spherelab.ai/simko/)

详情
AI中文摘要

我们从下一个令牌预测的候选分布角度,重新审视了具有可验证奖励的强化学习(RLVR)中的探索崩溃。我们正式证明,当概率集中到前1个候选时,无论采样预算K如何,期望的不同响应数量都会崩溃为1。这一理论含义通过我们在训练过程中对前N个候选概率的实证跟踪得到进一步验证,其中前1个候选逐渐占据主导地位,而其他合理替代方案被抑制。这些发现提出了有效探索的关键需求:在前N个候选上保留不可忽略的概率质量。为此,我们提出了候选感知支持保留(CaSP),包含两个互补设计。具体来说,对于正确响应,CaSP在前N个候选上重新分配正梯度;对于错误响应,则对前1个候选施加更强的惩罚。与许多以牺牲pass@1为代价提高pass@K的探索导向方法不同,CaSP在整个K谱上提高了pass@K。这些增益泛化到6个数学、2个逻辑推理和2个编码基准测试,并扩展到32B参数模型和高达K=1024的采样预算,使其成为RLVR探索的一种原则性、候选级别的方法。

英文摘要

We revisit exploration collapse in reinforcement learning with verifiable rewards (RLVR), from the perspective of the \emph{candidate distribution} for next-token prediction. We formally show that as probability concentrates on the top-$1$ candidate, the expected number of distinct responses collapses to one regardless of the sampling budget $K$. This theoretical implication is further verified by our empirical tracking of top-$N$ candidate probabilities during training, where the top-$1$ candidate progressively dominates while plausible alternatives are suppressed. These findings suggest a key desideratum for effective exploration: \emph{preserving non-negligible probability mass on the top-$N$ candidates}. To this end, we propose Candidate-aware Support Preservation (CaSP), with two complementary designs. Specifically, CaSP redistributes positive gradients among top-$N$ candidates for correct responses, and applies a stronger penalty to the top-$1$ candidate for incorrect responses. Unlike many exploration-oriented methods that improve pass@$K$ at the cost of pass@1, CaSP improves pass@$K$ across the full $K$ spectrum. These gains generalize to 6 math, 2 logical-reasoning, and 2 coding benchmarks, and scales to 32B-parameter models and sampling budgets up to $K=1024$, positioning it as a principled, candidate-level approach for RLVR exploration.

2602.10635 2026-06-17 cs.AI cs.LG 版本更新

OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization

OmniSapiens: 一种通过异质性感知相对策略优化进行社会行为处理的基础模型

Keane Ong, Sabri Boughorbel, Luwei Xiao, Chanakya Ekbote, Wei Dai, Ao Qu, Jingyao Wu, Rui Mao, Ehsan Hoque, Erik Cambria, Gianmarco Mengaldo, Paul Pu Liang

发表机构 * Massachusetts Institute of Technology(麻省理工学院) National University of Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学) Prince Sattam bin Abdulaziz University(普森·萨塔姆·本·阿卜杜勒阿齐兹大学) University of Rochester(罗切斯特大学)

AI总结 针对行为数据异质性导致的训练不平衡问题,提出Omnisapiens-7B 2.0基础模型,采用异质性感知相对策略优化(HARPO)方法,在10个行为任务和5个零样本泛化基准上取得最佳性能。

Comments Accepted to ICML 2026 Main Conference

详情
AI中文摘要

社交智能AI系统必须能够推理多样的人类行为任务,并泛化到新情境。然而,AI尚未达到这种社交智能水平。现有模型仍然受到行为数据训练引起的学习动态不平衡的根本限制。即,行为数据本质上是异质的,包含多种模态和预测目标,通常在不同样本间产生不均匀的训练信号。为了解决这个问题,我们开发了Omnisapiens-7B 2.0,一个专门处理异质行为数据学习的社会行为处理基础模型。这是通过异质性感知相对策略优化(HARPO)实现的,这是一种新颖的推理强化学习方法,明确地重新平衡样本间的学习信号。核心思想是近似策略更新的贡献信号,利用它们进行几何中心化和惯性平滑的优势调节。结果表明,Omnisapiens-7B 2.0在10个不同的行为任务上取得了最佳且最一致的性能,同时在所有五个保留的零样本泛化基准上也取得了最佳性能,分别提升了高达+12.02%和+9.37%。此外,Omnisapiens-7B 2.0展示了更一致和可解释的推理轨迹,支持可靠的现实世界行为应用。我们的模型和代码可在https://github.com/MIT-MI/human_behavior_atlas找到。

英文摘要

Socially intelligent AI systems must reason across diverse human behavioral tasks and generalize to new social contexts. However, behavioral data is inherently heterogeneous, comprising diverse modalities and prediction targets that produce uneven training signals across samples, creating imbalanced learning dynamics that challenge existing AI models. To address this, we develop Omnisapiens-7B 2.0, a foundation model for social behavior processing that explicitly addresses learning from heterogeneous behavioral data. This is enabled through Heterogeneity-Aware Relative Policy Optimization, a new RL method that rebalances learning signals across samples by approximating each sample's contribution to the policy update and using these estimates to drive geometrically centered, inertially smoothed advantage modulation for stable training. Omnisapiens-7B 2.0 achieves the best and most consistent performance across 10 behavioral tasks, while also attaining the best performance on all five held-out benchmarks, with gains of up to +12.02% and +9.37% respectively. Furthermore, it demonstrates more consistent and interpretable reasoning traces, supporting reliable real-world behavioral applications. Our model is available at https://github.com/MIT-MI/human_behavior_atlas.

2603.18104 2026-06-17 cs.AI cs.DC cs.LG cs.NE 版本更新

Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI

自适应领域模型:贝叶斯演化、热旋转与几何及神经形态AI的规范化训练

Houston Haynes

AI总结 提出基于维度类型系统、程序超图和b-posit有界设计的替代训练架构,实现内存开销恒定、梯度精确累积和级保持更新,并引入贝叶斯蒸馏和热旋转机制,支持领域特定模型的持续自适应与可验证正确性。

Comments 32 pages, 3 figures

详情
AI中文摘要

当前AI训练假设在IEEE-754算术上进行反向模式自动微分。训练相对于推理的内存开销、优化器复杂性以及训练过程中几何属性的结构退化,都是该算术基底的后果。本文基于三项先前结果开发了一种替代训练架构:维度类型系统和确定性内存管理框架(Haynes 2026),将栈可分配梯度分配和精确quire累积确立为设计时可验证属性;程序超图(Haynes 2026),将几何代数计算中的级保持确立为类型级不变量;以及b-posit有界设计(Jonnalagadda et al. 2025),使posit算术在传统上被视为仅推理的硬件目标上变得可行。它们的组合实现了深度无关的训练内存(约为推理占用量的两倍)、级保持的权重更新和精确梯度累积,统一适用于损失函数优化和脉冲时序依赖的神经形态模型。我们引入了*贝叶斯蒸馏*,一种通过ADM训练机制提取通用模型潜在先验结构的机制,解决了领域特定训练的数据稀缺自举问题。对于部署,我们引入了*热旋转*,一种操作模式,其中更新后的模型在不中断服务的情况下过渡到活跃推理路径,并通过PHG证书和签名版本记录形式化正确性。结果是一类领域特定AI系统,比通用模型更小、更精确,持续自适应,相对于其领域的物理结构可验证正确,并且可从现有模型初始化。

英文摘要

Prevailing AI training assumes reverse-mode automatic differentiation over IEEE-754 arithmetic. The memory overhead of training relative to inference, optimizer complexity, and structural degradation of geometric properties through training are consequences of this arithmetic substrate. This paper develops an alternative training architecture grounded in three prior results: the Dimensional Type System and Deterministic Memory Management framework (Haynes 2026), which establishes stack-eligible gradient allocation and exact quire accumulation as design-time verifiable properties; the Program Hypergraph (Haynes 2026), which establishes grade preservation through geometric algebra computations as a type-level invariant; and the b-posit bounded-regime design (Jonnalagadda et al. 2025), which makes posit arithmetic tractable across hardware targets conventionally considered inference-only. Their composition enables depth-independent training memory bounded to approximately twice the inference footprint, grade-preserving weight updates, and exact gradient accumulation, applicable uniformly to loss-function-optimized and spike-timing-dependent neuromorphic models. We introduce *Bayesian distillation*, a mechanism by which the latent prior structure of a general-purpose model is extracted through the ADM training regime, resolving the data-scarcity bootstrapping problem for domain-specific training. For deployment, we introduce *warm rotation*, an operational pattern in which an updated model transitions into an active inference pathway without service interruption, with correctness formalized through PHG certificates and signed version records. The result is a class of domain-specific AI systems that are smaller and more precise than general-purpose models, continuously adaptive, verifiably correct with respect to the physical structure of their domains, and initializable from existing models.

2604.10827 2026-06-17 cs.AI 版本更新

Know Thy Reasoner: Not All Language Models Explore Alike

你的模型多样性,而非方法,决定推理策略

Moulik Choraria, Argyrios Gerogiannis, Anirban Das, Supriyo Chakraborty, Sourya Basu, Sambit Sahu, Lav R. Varshney

发表机构 * UIUC(伊利诺伊大学香槟分校) Capital One

AI总结 本文提出模型多样性影响推理策略,通过理论框架分析推理不确定性,验证了不同模型在深度精炼和并行采样中的表现差异。

Comments This is a full-length extension of the workshop paper that appeared in the ICLR 2026 Workshop on LLM Reasoning

详情
AI中文摘要

计算LLM推理的扩展性需要在探索解决方案方法(广度)和细化有前途的解决方案(深度)之间分配预算。大多数方法隐式地权衡两者,但为何特定的权衡有效仍不明确,且在单一模型上的验证掩盖了模型自身的作用。我们主张最优策略取决于模型的多样性分布,即概率质量在解决方案方法上的分散情况,并在采用任何探索策略之前必须进行表征。我们通过理论框架分解推理不确定性,并推导出树状深度精炼优于并行采样的条件。我们在Qwen-3 4B和Olmo-3 7B系列上验证了这一点,显示轻量信号足以在低多样性对齐模型上进行基于深度的精炼,而在高多样性基础模型上则产生有限的效用,我们推测后者需要更强的补偿以应对较低的探索覆盖度。

英文摘要

Compute scaling for LLM reasoning trades off exploring solution approaches (\emph{breadth}) against refining promising ones (\emph{depth}), yet why a given trade-off works, and why it often fails to transfer across models, remains unclear. We argue that \textbf{the optimal strategy depends on the model's \emph{diversity profile}, the spread of probability mass across solution approaches, and that this must be characterized before any exploration strategy is adopted.} We formalize this with a framework decomposing reasoning uncertainty, deriving when depth-based refinement outperforms parallel sampling, and validate it across three model families at both inference and training. Our central finding is that the diversity regime dictates the strategy: low-diversity aligned models benefit from depth-based refinement with lightweight intrinsic signals, whereas high-diversity base models are often harmed by it, and instead need breadth or stronger signals to compensate.

2404.01965 2026-06-17 cs.LG cs.AI 版本更新

Towards Leveraging AutoML for Sustainable Deep Learning: A Multi-Objective HPO Approach on Deep Shift Neural Networks

迈向利用AutoML实现可持续深度学习:深度移位神经网络上的多目标HPO方法

Leona Hennig, Tanja Tornede, Marius Lindauer

AI总结 针对深度学习计算成本高的问题,提出结合多保真度HPO与多目标优化,在深度移位神经网络上同时最大化精度和最小化能耗,实验获得超80%精度且低计算开销。

详情
AI中文摘要

深度学习通过从大型数据集中提取复杂模式,推动了各个领域的发展。然而,深度学习模型的计算需求带来了环境和资源方面的挑战。深度移位神经网络(DSNNs)通过利用移位操作来降低推理时的计算复杂度,提供了一种解决方案。遵循标准DNNs的见解,我们感兴趣的是通过AutoML技术充分利用DSNNs的潜力。我们研究了超参数优化(HPO)的影响,以最大化DSNN性能,同时最小化资源消耗。由于这结合了多目标(MO)优化,其中精度和能耗作为潜在互补目标,我们提出将最先进的多保真度(MF)HPO与多目标优化相结合。实验结果表明了我们方法的有效性,得到了精度超过80%且计算成本低的模型。总体而言,我们的方法加速了高效模型开发,同时实现了可持续的AI应用。

英文摘要

Deep Learning (DL) has advanced various fields by extracting complex patterns from large datasets. However, the computational demands of DL models pose environmental and resource challenges. Deep shift neural networks (DSNNs) offer a solution by leveraging shift operations to reduce computational complexity at inference. Following the insights from standard DNNs, we are interested in leveraging the full potential of DSNNs by means of AutoML techniques. We study the impact of hyperparameter optimization (HPO) to maximize DSNN performance while minimizing resource consumption. Since this combines multi-objective (MO) optimization with accuracy and energy consumption as potentially complementary objectives, we propose to combine state-of-the-art multi-fidelity (MF) HPO with multi-objective optimization. Experimental results demonstrate the effectiveness of our approach, resulting in models with over 80\% in accuracy and low computational cost. Overall, our method accelerates efficient model development while enabling sustainable AI applications.

2502.00241 2026-06-17 cs.LG cs.AI cs.CL cs.CV 版本更新

Mordal: Automated Pretrained Model Selection for Vision Language Models

Mordal: 面向视觉语言模型的自动化预训练模型选择

Shiqi He, Insu Jang, Mosharaf Chowdhury

AI总结 提出Mordal框架,通过减少候选模型数量和评估时间,自动化搜索用户定义任务的最佳视觉语言模型,相比网格搜索降低GPU耗时8.9-11.6倍,加权Kendall's τ平均提升69%。

详情
AI中文摘要

将多种模态融入大型语言模型(LLMs)是增强其对非文本数据理解、使其能够执行多模态任务的有效方式。视觉语言模型(VLMs)因其在医疗、机器人和无障碍等领域的众多实际应用,成为增长最快的多模态模型类别。然而,尽管文献中不同的VLM在不同基准测试中展现出令人印象深刻的视觉能力,它们都是由人类专家手工设计的;目前尚无自动化框架来创建特定任务的多模态模型。我们引入Mordal,一种自动化多模态模型搜索框架,能够高效地为用户定义的任务找到最佳VLM,无需人工干预。Mordal通过减少搜索过程中需考虑的候选模型数量以及最小化评估每个剩余候选模型所需的时间来实现这一目标。我们的评估表明,Mordal能够找到给定问题的最佳VLM,其GPU耗时比网格搜索低8.9倍至11.6倍。我们还发现,Mordal在不同任务上平均比最先进的模型选择方法实现约69%更高的加权Kendall's τ。

英文摘要

Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using $8.9\times$--$11.6\times$ lower GPU hours than grid search. We have also discovered that Mordal achieves about 69\% higher weighted Kendall's $τ$ on average than the state-of-the-art model selection method across diverse tasks.

2502.08363 2026-06-17 cs.CL cs.AI 版本更新

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Top-Theta注意力:通过补偿阈值稀疏化Transformer

Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli

AI总结 提出Top-Theta注意力,一种无需训练的推理时稀疏化方法,通过静态每头阈值保留每行固定数量的重要元素,结合补偿技术实现高稀疏度下的精度保持,在NLP任务中实现3-10倍V-cache减少和高达10倍注意力元素减少,精度下降不超过1%。

Comments Extended version of a paper accepted at ICANN 2026

详情
AI中文摘要

我们提出Top-Theta(Top-$\ heta$)注意力,一种无需训练的推理时稀疏化Transformer注意力的方法。我们的关键洞察是,可以校准静态的每头阈值,以在每行注意力中保留所需数量的重要元素。该方法实现了基于内容的稀疏性,无需重新训练,并且在不同数据领域保持鲁棒性。我们进一步引入补偿技术,以在激进稀疏化下保持精度,将注意力阈值化确立为top-k注意力的实用且原则性替代方案。我们在自然语言处理任务上进行了广泛评估,表明Top-Theta在推理时实现了3-10倍的V-cache减少和高达10倍的注意力元素减少,同时精度下降不超过1%。

英文摘要

We present Top-Theta (Top-$θ$) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-$θ$ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.

2507.11178 2026-06-17 cs.LG cs.AI 版本更新

A Gradient-based Causal Discovery Framework with Applications to Complex Industrial Processes

基于梯度的因果发现框架及其在复杂工业过程中的应用

Meiliang Liu, Huiwen Dong, Xiaoxiao Yang, Yunfang Xu, Mingbao Yang, Zijin Li, Zhengye Si, Xinyue Yang, Zhiwen Zhao

AI总结 提出GRNGC方法,通过对模型输入输出梯度施加L1正则化推断Granger因果,仅需一个预测模型,降低计算开销,在多个基准和真实数据集上优于现有方法。

Comments 9 pages,3 figures, conference

详情
AI中文摘要

随着深度学习技术的发展,各种基于神经网络的Granger因果模型已被提出。尽管这些模型表现出显著改进,但仍存在若干局限性。大多数现有方法采用组件式架构,需要为每个时间序列构建单独的模型,导致大量计算成本。此外,对神经网络第一层权重施加稀疏性惩罚以提取因果关系,削弱了模型捕捉复杂交互的能力。为解决这些局限性,我们提出基于梯度正则化的神经Granger因果(GRNGC),该方法仅需一个时间序列预测模型,并对模型输入与输出之间的梯度施加$L_{1}$正则化以推断Granger因果。此外,GRNGC不依赖于特定的时间序列预测模型,可通过KAN、MLP和LSTM等多种架构实现,提供增强的灵活性。在DREAM、Lorenz-96、fMRI BOLD和CausalTime上的数值模拟表明,GRNGC优于现有基线,并显著降低计算开销。同时,在真实世界的DNA、酵母、HeLa和膀胱尿路上皮癌数据集上的实验进一步验证了该模型在重建基因调控网络方面的有效性。

英文摘要

With the advancement of deep learning technologies, various neural network-based Granger causality models have been proposed. Although these models have demonstrated notable improvements, several limitations remain. Most existing approaches adopt the component-wise architecture, necessitating the construction of a separate model for each time series, which results in substantial computational costs. In addition, imposing the sparsity-inducing penalty on the first-layer weights of the neural network to extract causal relationships weakens the model's ability to capture complex interactions. To address these limitations, we propose Gradient Regularization-based Neural Granger Causality (GRNGC), which requires only one time series prediction model and applies $L_{1}$ regularization to the gradient between model's input and output to infer Granger causality. Moreover, GRNGC is not tied to a specific time series forecasting model and can be implemented with diverse architectures such as KAN, MLP, and LSTM, offering enhanced flexibility. Numerical simulations on DREAM, Lorenz-96, fMRI BOLD, and CausalTime show that GRNGC outperforms existing baselines and significantly reduces computational overhead. Meanwhile, experiments on real-world DNA, Yeast, HeLa, and bladder urothelial carcinoma datasets further validate the model's effectiveness in reconstructing gene regulatory networks.

2510.11709 2026-06-17 cs.LG cs.AI cs.CV 版本更新

Adversarial Attacks Leverage Interference Between Features in Superposition

对抗攻击利用特征叠加中的干扰

Edward Stevinson, Lucas Prieto, Melih Barsbey, Tolga Birdal

AI总结 本文揭示神经网络中特征叠加导致的干扰是对抗脆弱性的根源,通过理论推导和实验验证了干扰模式决定攻击成功与迁移性。

Comments Forty-third International Conference on Machine Learning

详情
AI中文摘要

为什么对抗样本存在,并且为什么它们能在模型间迁移?现有的解释诉诸于高维几何、输入中的非鲁棒模式以及决策边界结构,但没有一个提供表示层面的机制来解释为什么特定的扰动会成功以及为什么攻击能在模型间迁移。在本文中,我们表明对抗脆弱性可能源于神经网络中高效的信息编码。具体来说,脆弱性可能源于叠加——网络表示的概念数量超过其维度,迫使非正交表示从而产生干扰。这种干扰导致针对一个表示的扰动会影响其他表示,从而产生由干扰模式决定的脆弱性。在精确控制叠加的合成环境中,我们证实叠加足以产生对抗脆弱性。由此产生的攻击是可预测的:PGD发现的扰动与从干扰几何导出的理论最优扰动一致。在相似数据上训练的模型会发展出相似的干扰模式,这解释了攻击的可迁移性。然后我们表明,对图像分类器的成功攻击表现出我们提出的机制所预测的结构。这些发现揭示了对抗脆弱性可能是网络表示压缩的副产品,补充了基于数据属性或架构因素的现有解释。

英文摘要

Why do adversarial examples exist, and why do they transfer between models? Existing explanations appeal to high-dimensional geometry, non-robust patterns in the input, and decision boundary structure, but none provides a representation-level mechanism that explains why specific perturbations succeed and why attacks transfer between models. In this paper, we show that adversarial vulnerability can stem from efficient information encoding in neural networks. Specifically, vulnerability can arise from superposition - the phenomenon where networks represent more concepts than they have dimensions, forcing non-orthogonal representation and thus interference. This interference causes perturbations targeting one representation to affect others, creating vulnerabilities determined by interference patterns. In synthetic settings with precisely controlled superposition, we establish that superposition suffices to create adversarial vulnerability. The resulting attacks are predictable: PGD-discovered perturbations align with theoretically optimal perturbations derived from the interference geometry. Models trained on similar data develop similar interference patterns, explaining attack transferability. We then show that successful attacks on image classifiers exhibit the structure predicted by our proposed mechanism. These findings reveal that adversarial vulnerability can be a byproduct of networks' representational compression, complementing existing explanations based on data properties or architectural factors.

2510.21583 2026-06-17 cs.CV cs.AI 版本更新

Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization

基于流匹配的原理化强化学习从片段级策略优化中涌现

Yifu Luo, Haoyuan Sun, Xinhao Hu, Penghui Du, Keyu Fan, Bo Li, Sinan Du, Xu Wan, Zhiyu Chen, Bo Xia, Yongzhe Chang, Changqian Yu, Kun Gai, Tiantian Zhang, Xueqian Wang

发表机构 * GitHub

AI总结 本文提出了一种基于片段级策略优化的流匹配强化学习方法GCPO,通过将连续步骤聚合为相干片段并改变策略优化层级,有效缓解了优势归因不准确的问题,实验表明其在文本到图像生成任务中表现优于现有方法。

Comments ICML 2026

详情
AI中文摘要

近期在文本到图像(T2I)生成中的后训练流匹配中,群相对策略优化(GRPO)展示了强大的潜力。然而,其受到关键限制:优势归因不准确。在本文中,我们主张将连续步骤聚合为一个连贯的`chunk'并将策略优化范式从GRPO的步骤级别转移到片段级别,可以有效减轻这一问题的负面影响。基于这一见解,我们提出了群片段策略优化(GCPO),这是首个用于后训练流匹配的片段级强化学习方法。广泛的实验表明,GCPO在标准T2I基准和偏好对齐方面均取得了优越的性能,相对于GRPO最高相对提升达43%,凸显了片段级策略优化的前景。代码可在https://github.com/xingzhejun/GCPO上获得。

英文摘要

Recent Progress in post-training flow matching for text-to-image (T2I) generation with Group Relative Policy Optimization (GRPO) has demonstrated strong potential. However, it is hindered by a critical limitation: inaccurate advantage attribution. In this work, we argue that aggregating consecutive steps into a coherent 'chunk' and shifting the policy optimization paradigm from GRPO's step level to the chunk level can effectively mitigate the negative impact of this issue. Building on this insight, we propose Group Chunking Policy Optimization (GCPO), the first chunk-level reinforcement learning approach for post-training flow matching. Extensive experiments demonstrate that GCPO achieves superior performance on both standard T2I benchmarks and preference alignment, with up to 43% relative gains over GRPO, highlighting the promise of chunk-level policy optimization. The code is available on https://github.com/xingzhejun/GCPO.

2512.04524 2026-06-17 cs.LG cs.AI 版本更新

Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval

基于原型语义一致性对齐的域自适应检索

Tianle Hu, Weijun Lv, Na Han, Xiaozhao Fang, Jie Wen, Jiaxing Li, Guoxu Zhou

发表机构 * School of Computer Science and Technology, Guangdong University of Technology(广东工业大学计算机科学与技术学院) School of Automation, Guangdong University of Technology(广东工业大学自动化学院) School of Computer Science, Guangdong Polytechnic Normal University(广东 polytechnic 正规大学计算机科学学院) School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区计算机科学与技术学院) School of Artificial Intelligence, Guangzhou University(广州大学人工智能学院)

AI总结 提出原型语义一致性对齐(PSCA)两阶段框架,通过正交原型建立类级语义连接,利用几何邻近性加权伪标签置信度,并在重构特征上量化生成统一哈希码,解决域自适应检索中的类级对齐缺失和量化质量下降问题。

Comments AAAI2026

详情
AI中文摘要

域自适应检索旨在将知识从有标签的源域迁移到无标签的目标域,实现有效检索的同时缓解域差异。然而,现有方法存在几个根本性局限:1)忽略类级语义对齐,过度追求成对样本对齐;2)缺乏伪标签可靠性考虑或评估标签正确性的几何指导;3)直接量化受域偏移影响的原始特征,损害所学哈希码的质量。鉴于这些局限,我们提出基于原型的语义一致性对齐(PSCA),一种用于有效域自适应检索的两阶段框架。在第一阶段,一组正交原型直接建立类级语义连接,在聚集类内样本的同时最大化类间分离性。在原型学习过程中,几何邻近性通过自适应加权伪标签置信度,为语义一致性对齐提供可靠性指标。所得的隶属度矩阵和原型促进特征重建,确保在重建特征而非原始特征上进行量化,从而改善后续哈希编码质量并无缝连接两个阶段。在第二阶段,特定域的量化函数在相互逼近约束下处理重建特征,生成跨域的统一二进制哈希码。大量实验验证了PSCA在多个数据集上的优越性能。

英文摘要

Domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, enabling effective retrieval while mitigating domain discrepancies. However, existing methods encounter several fundamental limitations: 1) neglecting class-level semantic alignment and excessively pursuing pair-wise sample alignment; 2) lacking either pseudo-label reliability consideration or geometric guidance for assessing label correctness; 3) directly quantizing original features affected by domain shift, undermining the quality of learned hash codes. In view of these limitations, we propose Prototype-Based Semantic Consistency Alignment (PSCA), a two-stage framework for effective domain adaptive retrieval. In the first stage, a set of orthogonal prototypes directly establishes class-level semantic connections, maximizing inter-class separability while gathering intra-class samples. During the prototype learning, geometric proximity provides a reliability indicator for semantic consistency alignment through adaptive weighting of pseudo-label confidences. The resulting membership matrix and prototypes facilitate feature reconstruction, ensuring quantization on reconstructed rather than original features, thereby improving subsequent hash coding quality and seamlessly connecting both stages. In the second stage, domain-specific quantization functions process the reconstructed features under mutual approximation constraints, generating unified binary hash codes across domains. Extensive experiments validate PSCA's superior performance across multiple datasets.

2602.03846 2026-06-17 cs.LG cs.AI 版本更新

PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning

PLATE: 可塑性可调的几何感知持续学习高效适配器

Romain Cosentino

AI总结 提出无需旧任务数据的持续学习方法PLATE,利用预训练网络的几何冗余性,通过结构化低秩更新显式控制可塑性-保留权衡,提升最坏情况保留保证。

详情
AI中文摘要

我们为预训练模型开发了一种持续学习方法,该方法不需要访问旧任务数据,解决了基础模型适应中预训练分布通常不可用的实际障碍。我们的关键观察是,预训练网络表现出大量的几何冗余性,并且这种冗余性可以通过两种互补的方式加以利用。首先,冗余神经元提供了预训练时代主导特征方向的代理,使得可以直接从预训练权重构建近似受保护的更新子空间。其次,冗余性为可塑性的放置位置提供了自然偏差:通过将更新限制在冗余神经元的子集并约束剩余的自由度,我们获得了在旧数据分布上功能漂移减少且最坏情况保留保证改善的更新族。这些见解导致了PLATE(可塑性可调的高效适配器),一种不需要过去任务数据的持续学习方法,它提供了对可塑性-保留权衡的显式控制。PLATE通过结构化低秩更新ΔW = B A Q^T参数化每一层,其中B和Q从预训练权重一次性计算并保持冻结,只有A在新任务上训练。代码可在https://this URL获取。

英文摘要

We develop a continual learning method for pretrained models that \emph{requires no access to old-task data}, addressing a practical barrier in foundation model adaptation where pretraining distributions are often unavailable. Our key observation is that pretrained networks exhibit substantial \emph{geometric redundancy}, and that this redundancy can be exploited in two complementary ways. First, redundant neurons provide a proxy for dominant pretraining-era feature directions, enabling the construction of approximately protected update subspaces directly from pretrained weights. Second, redundancy offers a natural bias for \emph{where} to place plasticity: by restricting updates to a subset of redundant neurons and constraining the remaining degrees of freedom, we obtain update families with reduced functional drift on the old-data distribution and improved worst-case retention guarantees. These insights lead to \textsc{PLATE} (\textbf{Pla}sticity-\textbf{T}unable \textbf{E}fficient Adapters), a continual learning method requiring no past-task data that provides explicit control over the plasticity-retention trade-off. PLATE parameterizes each layer with a structured low-rank update $ΔW = B A Q^\top$, where $B$ and $Q$ are computed once from pretrained weights and kept frozen, and only $A$ is trained on the new task. The code is available at https://github.com/SalesforceAIResearch/PLATE.

2602.06014 2026-06-17 cs.LG cs.AI math.OC math.ST stat.ML stat.TH 版本更新

Optimism Stabilizes Thompson Sampling for Adaptive Inference

乐观主义稳定自适应推断的汤普森采样

Shunxing Yan, Han Zhong

AI总结 本文通过引入乐观机制(如方差膨胀或均值奖励)稳定汤普森采样,使得各臂拉取次数收敛于确定性尺度,从而在K臂随机bandit中实现渐近有效的Wald推断,并解决了多最优臂的扩展问题。

Comments Accepted in part to COLT 2026

详情
AI中文摘要

汤普森采样(TS)广泛用于随机多臂老虎机,但其在自适应数据收集下的推断性质微妙。样本均值的经典渐近理论可能失效,因为臂特定样本量是随机的,并通过动作选择规则与奖励耦合。我们研究了具有高斯随机指数的K臂随机bandit中汤普森采样的自适应推断,其中奖励噪声为独立次高斯,并确定乐观主义是恢复稳定性的关键机制,即每个臂的拉取次数集中在确定性尺度附近。这种稳定性使得尽管自适应采样,仍能获得渐近有效的Wald推断。首先,我们证明方差膨胀的TS对任意K≥2是稳定的,包括多个臂最优的挑战性情况,对最优臂具有渐近均匀分配,对次优臂具有尖锐的对数拉取次数渐近性。这解决了Halder等人提出的K臂扩展问题,使用新的胜者图和Lyapunov漂移技术来控制多个最优臂之间的分配。其次,我们分析了一种替代的乐观修改,保持高斯指数方差不变但向指数中心添加显式均值奖励,并建立了类似的稳定性结论。总之,适当实施的乐观主义稳定了汤普森采样,并在多臂老虎机中实现了渐近有效的Wald推断,同时仅产生轻微额外的遗憾代价。

英文摘要

Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study adaptive inference for Thompson sampling with Gaussian randomized indices in $K$-armed stochastic bandits with independent sub-Gaussian reward noises, and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, meaning that each arm's pull count concentrates around a deterministic scale. This stability yields asymptotically valid Wald inference despite adaptive sampling. First, we prove that variance-inflated TS is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal, with asymptotically uniform allocation over optimal arms and sharp logarithmic pull-count asymptotics for suboptimal arms. This resolves the $K$-armed extension question raised by \citet{halder2025stable}, using new winner-map and Lyapunov-drift techniques to control allocation among multiple optimal arms. Second, we analyze an alternative optimistic modification that keeps the Gaussian index variance unchanged but adds an explicit mean bonus to the index center, and establish a similar stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid Wald inference in multi-armed bandits, while incurring only a mild additional regret cost.

2602.07429 2026-06-17 cs.LG cs.AI 版本更新

Brep2Shape: Boundary and Shape Representation Alignment via Self-Supervised Transformers

Brep2Shape:通过自监督变换器对齐边界与形状表示

Yuanxu Sun, Yuezhou Ma, Haixu Wu, Guanyang Zeng, Muye Chen, Jianmin Wang, Mingsheng Long

AI总结 提出Brep2Shape自监督预训练方法,利用双Transformer骨干和拓扑注意力对齐B-rep的抽象边界表示与直观形状表示,在多项下游任务中达到最优精度并加速收敛。

详情
AI中文摘要

边界表示(B-rep)是计算机辅助设计(CAD)的行业标准。虽然深度学习在处理B-rep模型方面显示出潜力,但现有方法存在表示差距:连续方法提供分析精度但视觉上抽象,而离散方法提供直观清晰性但牺牲了几何精度。为弥合这一差距,我们引入了Brep2Shape,一种新颖的自监督预训练方法,旨在对齐抽象边界表示与直观形状表示。我们的方法采用几何感知任务,其中模型学习从参数化贝塞尔控制点预测密集空间点,使网络能够更好地理解从抽象系数导出的物理流形。为增强这种对齐,我们提出了一个双Transformer骨干,具有并行流,独立编码表面和曲线令牌以捕获它们不同的几何属性。此外,集成了拓扑注意力以建模表面和曲线之间的相互依赖关系,从而保持拓扑一致性。实验结果表明,Brep2Shape具有显著的可扩展性,在各种下游任务中实现了最先进的精度和更快的收敛速度。代码可在以下仓库获取:this https URL。

英文摘要

Boundary representation (B-rep) is the industry standard for computer-aided design (CAD). While deep learning shows promise in processing B-rep models, existing methods suffer from a representation gap: continuous approaches offer analytical precision but are visually abstract, whereas discrete methods provide intuitive clarity at the expense of geometric precision. To bridge this gap, we introduce Brep2Shape, a novel self-supervised pre-training method designed to align abstract boundary representations with intuitive shape representations. Our method employs a geometry-aware task where the model learns to predict dense spatial points from parametric Bézier control points, enabling the network to better understand physical manifolds derived from abstract coefficients. To enhance this alignment, we propose a Dual Transformer backbone with parallel streams that independently encode surface and curve tokens to capture their distinct geometric properties. Moreover, the topology attention is integrated to model the interdependencies between surfaces and curves, thereby maintaining topological consistency. Experimental results demonstrate that Brep2Shape offers significant scalability, achieving state-of-the-art accuracy and faster convergence across various downstream tasks.Code is available at this repository: https://github.com/thuml/Brep2Shape.

2602.11453 2026-06-17 cs.IR cs.AI cs.LG 版本更新

From Noise to Order: Learning to Rank via Denoising Diffusion

从噪声到有序:通过去噪扩散学习排序

Sajad Ebrahimi, Bhaskar Mitra, Negar Arabzadeh, Ye Yuan, Haolun Wu, Fattane Zarrinkalam, Ebrahim Bagheri

发表机构 * University of Guelph(圭尔夫大学) Independent Researcher(独立研究者) University of California, Berkeley(加州大学伯克利分校) McGill University(麦吉尔大学) University of Toronto(多伦多大学)

AI总结 提出基于去噪扩散的生成式排序模型DiffusionRank,通过建模特征向量与相关性标签的联合分布,在四个标准LTR数据集上优于传统判别式方法。

详情
AI中文摘要

在信息检索(IR)中,学习排序(LTR)方法传统上局限于判别式机器学习方法,这些方法基于查询-文档对的特征表示来建模文档与查询相关的概率。在这项工作中,我们提出了一种基于去噪扩散的深度生成式LTR方法,该方法转而建模特征向量和相关性标签的完整联合分布。虽然在判别式设置中,过参数化的排序模型可能通过不同方式拟合训练数据,但我们假设在生成式设置下能够解释完整数据分布的候选解能更好地估计相关性。基于这一动机,我们提出了DiffusionRank,它扩展了TabDiff(一种用于表格数据集的基于去噪扩散的生成模型),以创建经典判别式逐点和成对LTR目标的生成式等价物。我们在四个标准LTR数据集上进行了彻底的实证评估,证明了DiffusionRank模型相对于其判别式对应物的改进。我们的工作为未来研究探索如何利用深度生成建模方法(如扩散)在IR中进行学习排序提供了丰富的空间。

英文摘要

Learning-to-rank (LTR) methods have traditionally been limited to discriminative machine learning approaches that model the probability of the document being relevant to the query given some feature representation of the query-document pair. We propose an alternative denoising diffusion-based generative approach to LTR that instead models the full joint distribution over features and relevance labels. While in discriminative LTR, an over-parameterized ranking model may find different ways to fit the training data, we posit that candidate solutions that can explain the full data distribution under the generative setting maybe better at estimating relevance. Thus, we propose DiffusionRank that extends TabDiff, an existing diffusion model for tabular datasets, to create generative alternatives to classical discriminative pointwise and pairwise LTR objectives. Our work demonstrates improvements from DiffusionRank over discriminative counterparts on four standard LTR datasets and points to a rich space for future exploration to leverage ongoing advancements in deep generative models for LTR. Our code is publicly available at https://github.com/sadjadeb/DiffusionRank.

2603.01761 2026-06-17 cs.LG cs.AI 版本更新

Position: Modular Memory is the Key to Continual Learning Agents

Position: 模块化记忆是持续学习智能体的关键

Vaggelis Dorovatas, Malte Schwerin, Andrew D. Bagdanov, Lucas Caccia, Antonio Carta, Laurent Charlin, Barbara Hammer, Tyler L. Hayes, Timm Hess, Christopher Kanan, Dhireesha Kudithipudi, Xialei Liu, Vincenzo Lomonaco, Jorge Mendez-Mendez, Darshan Patil, Ameya Prabhu, Elisa Ricci, Tinne Tuytelaars, Gido M. van de Ven, Liyuan Wang, Joost van de Weijer, Jonghyun Choi, Martin Mundt, Rahaf Aljundi

AI总结 本文提出通过模块化记忆结合权重内学习与上下文学习,解决持续学习中的灾难性遗忘问题,实现大规模持续适应。

Comments ICML 2026 Position Track Spotlight. This work stems from discussions held at the Dagstuhl seminar on Continual Learning in the Era of Foundation Models (October 2025)

详情
AI中文摘要

基础模型通过大规模预训练和增加测试时计算已经改变了机器学习。尽管在多个领域超越了人类表现,这些模型在持续运行、经验积累和个性化方面仍然存在根本性限制,而这些能力是自适应智能的核心。虽然持续学习研究长期以来一直瞄准这些目标,但其历史上专注于权重内学习(IWL),即更新单个模型的参数以吸收新知识,导致灾难性遗忘成为一个持续挑战。我们的立场是,通过设计模块化记忆,结合权重内学习(IWL)和新出现的上下文学习(ICL)的优势,是实现大规模持续适应的缺失环节。我们概述了一个以模块化记忆为中心的架构的概念框架,该架构利用ICL进行快速适应和知识积累,利用IWL对模型能力进行稳定更新,为持续学习智能体绘制了一条实用的路线图。

英文摘要

Foundation models have transformed machine learning through large-scale pretraining and increased test-time compute. Despite surpassing human performance in several domains, these models remain fundamentally limited in continuous operation, experience accumulation, and personalization, capabilities that are central to adaptive intelligence. While continual learning research has long targeted these goals, its historical focus on in-weight learning (IWL), i.e., updating a single model's parameters to absorb new knowledge, has rendered catastrophic forgetting a persistent challenge. Our position is that combining the strengths of In-Weight Learning (IWL) and the newly emerged capabilities of In-Context Learning (ICL) through the design of modular memory is the missing piece for continual adaptation at scale. We outline a conceptual framework for modular memory-centric architectures that leverage ICL for rapid adaptation and knowledge accumulation, and IWL for stable updates to model capabilities, charting a practical roadmap toward continually learning agents.

2603.22372 2026-06-17 cs.LG cs.AI 版本更新

Rethinking Multimodal Fusion for Time Series: Text Modalities Need Constrained Fusion

重新思考时间序列的多模态融合:文本模态需要受约束的融合

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn

AI总结 针对多模态时间序列预测中朴素融合方法效果不佳的问题,提出受约束融合方法及受控融合适配器(CFA),通过低秩适配器过滤无关文本信息,在多种数据集和模型上验证了有效性。

Comments KDD Workshop on Mining and Learning from Time Series 2026

详情
AI中文摘要

多模态学习的最新进展推动了将文本或视觉等辅助模态集成到时间序列(TS)预测中。然而,现有方法大多增益有限,通常仅在特定数据集上提升性能,或依赖限制泛化能力的架构特定设计。在本文中,我们表明采用朴素融合策略(例如简单加法或拼接)的多模态模型通常表现不如单模态TS模型,我们将其归因于辅助模态的未受控集成可能引入无关信息。受此观察启发,我们探索了各种旨在控制这种集成的受约束融合方法,并发现它们始终优于朴素融合方法。此外,我们提出了受控融合适配器(CFA),一种简单的即插即用方法,无需修改TS主干即可实现受控的跨模态交互,仅集成与TS动态对齐的相关文本信息。CFA采用低秩适配器在将文本信息融合到时间表示之前过滤无关文本信息。我们在各种数据集和TS/文本模型上进行了超过20K次实验,证明了受约束融合方法的有效性。代码见:this https URL。

英文摘要

Recent advances in multimodal learning have motivated the integration of auxiliary modalities such as text or vision into time series (TS) forecasting. However, most existing methods provide limited gains, often improving performance only in specific datasets or relying on architecture-specific designs that limit generalization. In this paper, we show that multimodal models with naive fusion strategies (e.g., simple addition or concatenation) often underperform unimodal TS models, which we attribute to the uncontrolled integration of auxiliary modalities which may introduce irrelevant information. Motivated by this observation, we explore various constrained fusion methods designed to control such integration and find that they consistently outperform naive fusion methods. Furthermore, we propose Controlled Fusion Adapter (CFA), a simple plug-in method that enables controlled cross-modal interactions without modifying the TS backbone, integrating only relevant textual information aligned with TS dynamics. CFA employs low rank adapters to filter irrelevant textual information before fusing it into temporal representations. We conduct over 20K experiments across various datasets and TS/text models, demonstrating the effectiveness of the constrained fusion methods. Code is available at: https://github.com/seunghan96/cfa.

2604.18701 2026-06-17 cs.LG cs.AI stat.ML 版本更新

Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

Curiosity-Critic:累积预测误差改进作为世界模型训练的可处理内在奖励

Vin Bhaskara, Haicheng Wang

AI总结 提出Curiosity-Critic方法,通过可处理的每步替代项(当前预测误差与渐近误差基线的差值)作为内在奖励,利用共训练的评论家在线估计误差基线,有效分离可约与不可约预测误差,在随机网格世界实验中优于现有方法。

Comments Accepted to ICML 2026 Workshop on Epistemic Intelligence in Machine Learning (EIML@ICML 2026). Code: https://github.com/vinbhaskara/Curiosity-Critic

详情
AI中文摘要

基于局部预测误差的好奇心奖励仅关注当前转移,而不考虑世界模型在所有已访问转移上的累积预测误差。我们引入了Curiosity-Critic,其内在奖励基于这一累积目标的改进,并证明它有一个可处理的每步替代项:当前预测误差与当前状态转移的渐近误差基线之间的差值。我们通过一个与世界模型共同训练的评论家在线估计这一误差基线;由于评论家只需学习一个转移的预测难度,其对不可约噪声基线的估计在世界模型饱和之前就已收敛,从而将探索引导向可学习的转移。该奖励对可学习转移较高,而对随机转移趋近于零,从而在线分离认知(可约)和偶然(不可约)预测误差。从Schmidhuber(1991)到学习特征空间变体的先前预测误差好奇心公式,都作为该误差基线的特定近似特例出现。在随机网格世界上的实验表明,Curiosity-Critic在训练速度和最终世界模型准确性上优于基于预测误差、访问计数和随机网络蒸馏的方法。

英文摘要

Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it admits a tractable per-step surrogate: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this error baseline online with a learned critic co-trained alongside the world model; since the critic only has to learn how hard a transition is to predict, its estimate of the irreducible noise floor converges well before the world model saturates, redirecting exploration toward learnable transitions. The reward is higher for learnable transitions and collapses toward zero for stochastic ones, thereby separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this error baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error, visitation-count, and Random Network Distillation methods in training speed and final world model accuracy.

2604.24357 2026-06-17 cs.LG cs.AI 版本更新

DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models

DPRM: 一种用于扩散语言模型的即插即用Doob h变换诱导的令牌排序模块

Dake Bu, Wei Huang, Andi Han, Hau-San Wong, Qingfu Zhang, Taiji Suzuki, Atsushi Nitanda

AI总结 提出DPRM模块,通过在线估计从置信度驱动排序逐步过渡到过程奖励引导排序,改进扩散语言模型的令牌排序策略,在九种任务中提升性能。

详情
AI中文摘要

扩散语言模型生成时没有固定的从左到右顺序,令牌排序是一个核心算法选择。现有系统主要使用随机掩码或置信度驱动排序,分别存在训练-测试不匹配和短视探索的问题。我们引入DPRM(Doob变换过程奖励模型),一个即插即用的令牌排序模块,保持宿主架构、去噪目标和监督不变,仅修改排序策略。DPRM从置信度驱动排序开始,通过在线估计逐渐过渡到过程奖励引导排序。我们将精确的DPRM策略描述为奖励倾斜的Gibbs揭示律,证明其阶段式Soft-BoN近似的收敛性,表明在线分桶跟踪器以经验Bernstein速率跟踪精确的DPRM分数,并在可处理的优化假设下建立样本复杂度优势。在涵盖语言推理、测试时扩展、蛋白质、单细胞、分子、DNA、文本到图像生成和VQA的九个宿主中,DPRM排序变体改进了多个语言、DNA和多模态设置,同时也识别了仅置信度排序或任务特定效用更优的边界情况。代码见:this https URL

英文摘要

Diffusion language models generate without a fixed left-to-right order, leaving token ordering as a central algorithmic choice. Existing systems mainly use random masking or confidence-driven ordering, which respectively suffer from train--test mismatch and myopic exploration. We introduce DPRM (Doob -transform Process Reward Model), a plug-in token-ordering module that keeps the host architecture, denoising objective and supervision unchanged, and modifies only the ordering policy. DPRM starts from confidence-driven ordering and gradually shifts to process-reward-guided ordering through online estimates. We characterize the exact DPRM policy as a reward-tilted Gibbs reveal law, prove convergence of its stagewise Soft-BoN approximation, show that the online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates, and establish a sample-complexity advantage under tractable optimization assumptions. Across nine hosts covering language reasoning, test-time scaling, protein, single-cell, molecular, DNA, text-to-image generation, and VQA, DPRM order variants improve several language, DNA, and multimodal settings while also identifying boundary cases where confidence-only ordering or task-specific utilities are preferable. Code is available at: https://github.com/DakeBU/DPRM-DLLM

2605.20708 2026-06-17 cs.CV cs.AI 版本更新

Rethinking Cross-Layer Information Routing in Diffusion Transformers

重新思考扩散变换器中的跨层信息路由

Chao Xu, Maohua Li, Qirui Li, Yixuan Xu, Yanke Zhou, Yunhe Li, Cuifeng Shen, Hanlin Tang, Kan Liu, Tao Lan, Lin Qu, Shao-Qun Zhang

发表机构 * Nanjing University(南京大学) Alibaba Group(阿里巴巴集团) Zhejiang University(浙江大学) City University of Hong Kong(香港城市大学)

AI总结 本文研究了扩散变换器中跨层信息流动的问题,通过系统性的实证分析,识别了传统残差加法的三个具体症状,并提出了扩散适应性路由(DAR)方法,以实现可学习、时间步适应和非递增的子层输出聚合,从而提升模型性能。

详情
AI中文摘要

扩散变换器(DiTs)已成为现代视觉生成的事实性骨干,其设计的几乎所有主要轴线——分词、注意力、条件、目标和潜在自编码器——都已被广泛重新审视。然而,决定信息如何在层之间积累的残差流却直接继承自原始Transformer。在本文中,我们对DiTs中的跨层信息流进行了系统性的实证分析,同时考虑深度和去噪时间步,并识别出传统残差加法的三个具体症状,即单调的前向幅度膨胀、急剧的反向梯度衰减和显著的块状冗余。受此诊断的启发,我们提出了扩散适应性路由(DAR),一种可直接替换残差的机制,能够对子层输出的历史进行可学习、时间步适应和非递增的聚合。此外,所提出的DAR与许多现代Transformer增强方法,如REPA,具有兼容性。在ImageNet 256×256上,DAR将SiT-XL/2的FID值提升了2.11(7.56 vs. 9.67),并且在8.75倍更少的训练迭代中达到了基线的收敛质量。在REPA之上堆叠时,它在早期阶段实现了2倍的训练加速,表明跨层信息路由是扩散建模中一个未被充分探索的设计轴,该轴与现有表示对齐目标相互独立。除了预训练外,DAR还可以在大规模T2I模型的微调阶段应用,并在分布匹配蒸馏中保留高频细节。

英文摘要

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

2605.29526 2026-06-17 cs.CR cs.AI cs.LG 版本更新

Temporal Motif-aware Graph Test-time Adaptation for OOD Blockchain Anomaly Detection

面向OOD区块链异常检测的时间模体感知图测试时自适应

Runang He, Tongya Zheng, Huiling Peng, Yuanyu Wan, Bingde Hu, Jiawei Chen, Canghong Jin, Mingli Song, Can Wang

发表机构 * State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) Zhejiang Provincial Engineering Research Center for Real-Time SmartTech in Urban Security Governance(浙江省实时智能科技在城市安全治理中的工程研究中心) Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security(杭州高新技术区(滨江)区块链与数据安全研究院)

AI总结 提出TEMG-TTA框架,通过时间模体分布捕获和测试时自适应策略,解决区块链异常检测中的模式演化和分布外问题,在5个数据集上平均提升54.88%。

Comments Accepted to IJCAI-ECAI 2026, Special Track on AI for Social Good

详情
AI中文摘要

不断演变的交易模式严重阻碍了新兴加密货币区块链上的异常检测,原因在于地址数量庞大且异常行为多样。近期应用于区块链的高级图异常检测(GAD)方法面临两个关键挑战:恶意行为者的对抗性模式演化以及区块链上不同交易语义导致的分布外(OOD)问题。为应对这些挑战,我们提出了一种新颖框架,称为时间模体感知图测试时自适应(TEMG-TTA)。首先,我们通过高效的计算机制全面捕捉每个活跃地址的三节点时间模体分布,从而实现下游时间模体感知图学习。其次,我们设计了一种简单而有效的测试时自适应策略,以促进训练图和测试图之间共享常见模式。在5个真实世界数据集上的大量实验表明,我们提出的TEMG-TTA平均优于最先进的GAD方法54.88%。进一步关于可解释模体模式的案例研究表明,TEMG-TTA明确刻画了异常地址的复杂交易模式,从而验证了我们技术设计的有效性。我们的代码将公开在 https://github.com/LuoXishuang0712/TEMG-TTA/。

英文摘要

Ever-evolving transaction patterns have significantly hindered anomaly detection on emerging cryptocurrency blockchains due to the vast number of addresses and diverse anomalous behaviors. Recently, advanced Graph Anomaly Detection (GAD) approaches applied to blockchains have faced two critical challenges: \textit{adversarial pattern evolution by malicious actors} and \textit{the out-of-distribution (OOD) problem caused by varied transaction semantics on blockchains}. To address these challenges, we propose a novel framework termed \textbf{TE}mporal \textbf{M}otif-aware \textbf{G}raph \textbf{T}est-\textbf{T}ime \textbf{A}daptation (\textbf{TEMG-TTA}). First, we comprehensively capture the 3-node temporal motif distribution of each active address using an efficient computational mechanism, enabling downstream temporal motif-aware graph learning. Second, we design a simple yet effective test-time adaptation strategy to facilitate the sharing of common patterns between training and testing graphs. Extensive experiments on 5 real-world datasets demonstrate that our proposed \textbf{TEMG-TTA} outperforms \textit{state-of-the-art} GAD approaches by an average of 54.88\%. A further case study on interpretable motif patterns reveals that \textbf{TEMG-TTA} explicitly characterizes the complex transaction patterns of anomalous addresses, thereby verifying the effectiveness of our technical designs. Our code is publicly available at https://github.com/LuoXishuang0712/TEMG-TTA/.

2606.05861 2026-06-17 cs.MM cs.AI 版本更新

LLMCodec: Adapting Video Codecs for Efficient Weight Compression of Large Language Models

LLMCodec:适配视频编解码器用于大型语言模型的高效权重压缩

Rui Wang, Yan Zhao, Li Song, Zhengxue Cheng

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出LLMCodec方法,利用视频编解码器(如VVC/H.266)结合仿射量化压缩LLM权重,无需微调或校准数据,在2-bit精度下显著降低困惑度并提升下游任务准确率。

Comments The authors need to make further revisions before resubmission

详情
AI中文摘要

大型语言模型(LLMs)的快速发展在自然语言处理领域取得了显著进展。然而,这些模型规模的不断扩大在存储、传输和部署方面带来了巨大挑战。尽管在模型压缩和量化方面付出了巨大努力,但现有方法通常依赖于微调或校准数据,且在不同张量类型上泛化能力有限。本文中,我们认为视频编解码器为LLM压缩提供了一种有前景的解决方案,因为它们与矩阵结构数据具有内在兼容性、可配置的压缩策略,并且有高度优化、现成的实现可用。因此,我们提出了LLMCodec,一种基于视频编解码器的LLM压缩方法,它将仿射量化与最新的VVC/H.266视频编解码器相结合。除了VVC,我们还比较了一系列视频编解码器和编码配置文件,以评估它们对压缩性能的影响。在不同模型上的实验证明了LLMCodec的鲁棒性和通用性。值得注意的是,在LLaMA-3-8B模型上,以2-bit精度,与现有方法相比,LLMCodec将困惑度降低了1.5倍以上,并将下游任务准确率提高了21%。

英文摘要

The rapid development of large language models(LLMs) has led to remarkable advances in natural language processing. However, the increasing scale of these models introduces substantial challenges in terms of storage, transmission, and deployment. Though great efforts have been devoted to model compression and quantization, existing methods often rely on fine-tuning or calibration data, which exhibit limited generalization across different tensor types. In this paper, we argue that video codecs offer a promising solution for LLM compression, due to their inherent compatibility with matrix structured data, configurable compression strategies, and the availability of highly optimized, off-the-shelf implementations. Therefore, we present LLMCodec, a video codec-based LLM compression method that integrates affine quantization with the recent VVC/H.266 video codec. Beyond VVC, we further compare a range of video codecs and encoding profiles to evaluate their impact on compression performance. Experiments on different models demonstrate the robustness and generality of LLMCodec. Notably, on LLaMA-3-8B at 2-bit precision, LLMCodec reduces perplexity by over 1.5x and improves downstream task accuracy by 21% compared with the existing method.

2606.11766 2026-06-17 eess.AS cs.AI cs.CL cs.SD 版本更新

Fast Speech Foundation Model Distillation Using Interleaved Stacking

快速语音基础模型蒸馏使用交错堆叠

Eungbeom Kim, Kyogu Lee

发表机构 * IPAI AIIS Dept. of Intelligence and Information(智能与信息系)

AI总结 提出交错堆叠方法加速语音基础模型蒸馏训练,通过保持层位置一致性解决性能下降问题,在SUPERB上验证有效性。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

将大型语音基础模型(SFM)蒸馏为高效的学生模型已成功应用于低资源环境。尽管蒸馏减少了推理延迟,但它需要额外的学生模型训练。然而,SFM蒸馏的训练效率仍未得到充分探索。在这项工作中,我们探索了SFM蒸馏的训练加速以加快模型部署。我们研究了堆叠的潜力,其中模型深度通过训练逐步增加,直到达到目标模型深度。虽然现有的堆叠方法提高了训练速度,但它们遭受性能下降。为了解决这一限制,我们提出了交错堆叠,一种新颖的堆叠方法,在整个堆叠过程中始终保持层位置。这一特性在SFM中尤为关键,因为每一层编码了不同的层特定知识。我们在SUPERB上验证了所提方法的有效性。

英文摘要

Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.

2606.14990 2026-06-17 cs.LG cs.AI 版本更新

Rational Sparse Autoencoder

有理稀疏自编码器

Naiyu Yin, Yue Yu

发表机构 * Lehigh University(里海大学)

AI总结 提出有理稀疏自编码器(RSAE),用可训练有理函数替代固定编码器激活,通过两阶段流程(初始化+微调)在多种语言模型和基线激活族上提升重构与下游行为指标,不牺牲特征可解释性。

Comments Accepted to the Mechanistic Interpretability Workshop at ICML 2026

详情
AI中文摘要

稀疏自编码器(SAE)是机械可解释性的标准工具,但当前的SAE系列受限于固定的编码器非线性,如ReLU、JumpReLU和TopK。这会将特定的稀疏机制硬编码到模型中,并可能扭曲重构与稀疏性的权衡。我们引入了有理稀疏自编码器(RSAE),它将固定的编码器激活替换为可训练的有理函数。有理激活足够灵活,可以在紧致域上一致逼近现有SAE系列使用的激活原语(对于TopK,提供分离top-k阈值后获得的阈值门),同时提供更丰富的函数类以适应观察到的预激活几何形状。我们通过两阶段流程实现这一想法:初始化过程复制预训练的基线SAE权重,插入通过在合成数据上使用松弛Remez交换获得的有理系数,并随有理系数一起校准尺度参数;然后在标准稀疏正则化重构目标下进行微调步骤。实验上,在三个开源权重语言模型的残差流激活上,以及所有三个基线激活族中,RSAE在微调步骤后严格改进,无论是在重构侧指标还是在下游行为指标上,且不牺牲稀疏探测下的特征级可解释性。这些增益在宿主语言模型、基线激活族以及我们测试的完整基线稀疏范围内一致,而升级本身每个自编码器仅增加少量标量参数,并在单个消费级GPU上运行几分钟。

英文摘要

Sparse autoencoders (SAEs) are standard tools for mechanistic interpretability, but current SAE families are constrained by fixed encoder nonlinearities such as ReLU, JumpReLU, and TopK. This hard-codes a particular sparsity mechanism into the model and can distort the reconstruction-versus-sparsity trade-off. We introduce the Rational Sparse Autoencoder (RSAE), which replaces the fixed encoder activation with a trainable rational function. Rational activations are flexible enough to uniformly approximate the activation primitives used by existing SAE families on compact domains (for TopK, the thresholded gate obtained after a separating top-k threshold is supplied), while also providing a richer function class for adapting to the observed pre-activation geometry. We realise this idea through a two-stage pipeline: an initialisation procedure that copies the pre-trained baseline SAE weights, plugs in rational coefficients obtained by the relaxed Remez exchange on synthetic data, and calibrates the scale parameters along with the rational coefficients; followed by a fine-tuning step under the standard sparsity-regularised reconstruction objective. Empirically, on residual-stream activations of three open-weight language models and across all three baseline activation families, the RSAE strictly improves on it after the fine-tuning step, both on reconstruction-side metrics and on downstream-behaviour metrics, without sacrificing feature-level interpretability under sparse probing. These gains are consistent across host language models, across baseline activation families, and across the full range of baseline sparsity we tested, while the upgrade itself adds only a handful of scalar parameters per autoencoder and runs in minutes on a single consumer GPU.

2606.16590 2026-06-17 cs.LG cs.AI q-bio.NC 版本更新

Infant Spontaneous Movement Noise Improves Exploration in Deep RL

婴儿自发运动噪声改善深度强化学习中的探索

Francisco M. López, Markus R. Ernst, Francisco Cruz, Matej Hoffmann, and Jochen Triesch

发表机构 * Frankfurt Institute for Advanced Studies(法兰克福高等研究所) School of Computer Science and Engineering, University of New South Wales(新南威尔士大学计算机科学与工程学院) Escuela de Ingeniería, Universidad Central de Chile(智利中央大学工程学院) Faculty of Electrical Engineering, Czech Technical University(捷克理工大学电气工程学院)

AI总结 受婴儿自发运动噪声启发,提出一种在RL训练中逐步增加时间自相关的探索噪声机制,实验表明其能产生结构化探索行为并提高学习效率。

Comments 6 pages, 4 figures, 1 table. Accepted at IEEE ICDL 2026. Cite as: F. M. López, M. R. Ernst, F. Cruz, M. Hoffmann, and J. Triesch, "Infant Spontaneous Movement Noise Improves Exploration in Deep RL", in 2026 IEEE International Conference on Development and Learning (ICDL). IEEE, 2026, pp. 1-6

详情
AI中文摘要

深度强化学习(RL)中的探索通常实现为时间上不相关的白噪声。然而,最近的研究表明,时间相关的有色噪声可以通过产生更平滑的轨迹和更好的状态空间覆盖来提高探索效率。我们探究受婴儿自发运动启发的动作噪声是否也能改善深度RL中的探索。我们发现婴儿末端执行器速度的功率谱密度遵循有色噪声过程,其谱指数随年龄增长而增加。受这一发育模式的启发,我们引入了一种机制,在RL训练过程中逐步增加探索噪声的时间自相关,与婴儿统计数据相匹配。在多个RL环境中的实验表明,婴儿启发的噪声产生结构化的探索行为,并且与传统的探索策略相比可以提高学习效率。这些发现表明,人类运动和认知发展可以为人工智能体的学习机制设计提供有用的指导。我们的代码可在 https://github.com/trieschlab/baby-noise-rl 获取。

英文摘要

Exploration in deep reinforcement learning (RL) is commonly implemented as temporally uncorrelated white noise. However, recent works show that temporally correlated colored noise can improve exploration efficiency by producing smooth trajectories with better coverage of the state space. We inquire whether action noise inspired by infant spontaneous movements can also improve exploration in deep RL. We find that the power spectral densities of babies' end-effector velocities follow a colored noise process where the spectral exponent increases with age. Inspired by this developmental pattern, we introduce a mechanism that progressively increases the temporal auto-correlation of exploration noise during RL training, matching the infant statistics. Experiments across several RL environments show that infant-inspired noise produces structured exploratory behavior and can improve learning efficiency compared to conventional exploration strategies. These findings suggest that human motor and cognitive development can provide useful guidance for designing learning mechanisms in artificial agents. Our code is available at https://github.com/trieschlab/baby-noise-rl.

6. 自然语言与多模态智能 38 篇

2606.17289 2026-06-17 cs.AI cs.CL 新提交

Nothing from Something: Can a Language Model Discover 0?

无中生有:语言模型能否发现0?

Phoebe Zeng, Thomas L. Griffiths, Brenden M. Lake

发表机构 * Department of Computer Science, Princeton University(普林斯顿大学计算机科学系)

AI总结 研究语言模型能否独立发现“零”的概念,通过算术任务测试,发现GPT-2规模模型无法在测试时泛化,但少量示例训练后显著提升,且语言预训练减少所需示例约50%。

详情
AI中文摘要

基于人工神经网络的AI系统正被开发,旨在推动人类数学知识的边界。这些系统的关键问题在于它们能在多大程度上超越训练数据。数学发现需要一种强形式的分布外泛化能力——假设真正新的、且可能逻辑上更强大的数学结构的能力。已有假设认为,语言能力在人类认知中支持这种泛化。在这项工作中,我们使用简单算术作为案例研究,考察现代AI模型如何扩展其数学视野,评估这些模型能否独立发现“零”的概念。我们表明:(1) GPT-2规模的语言模型在测试时无法进行这种泛化,无论是否经过语言预训练;(2) 但在经过数十或数百个零的示例训练后,模型能显著改进。此外,我们发现语言预训练将所需示例数量减少了约50%,表明语言能力可以支撑神经模型中的数学发现。

英文摘要

AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalizations in human cognition. In this work, we use simple arithmetic as a case study for examining how modern AI models could expand their mathematical horizons, evaluating whether these models can independently discover the concept of "zero". We show that We show that (1) language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but (2) models can improve substantially after training on tens or hundreds of examples of zero. Additionally, we find that language pretraining reduces the number of required examples by approximately $50\%$, showing that language abilities can scaffold mathematical discovery in neural models.

2606.17637 2026-06-17 cs.AI 新提交

Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification

Brick-DICL:用于自动化Brick模式分类的动态上下文学习

Yiyue Qian, Shinan Zhang, Huan Song, Negin Sokhandan, Hannah Marlowe, Diego Socolinsky

发表机构 * Amazon AWS Generative AI Innovation Center(亚马逊AWS生成式AI创新中心)

AI总结 提出Brick-DICL两阶段动态上下文学习框架,通过元数据检索和类别检索增强大语言模型领域知识,结合多模型过滤机制,实现楼宇管理系统点位的自动化Brick分类,显著提升准确率并减少人工验证。

详情
AI中文摘要

楼宇管理系统(BMS)对于优化现代建筑的能效和运营性能至关重要。然而,不同制造商的BMS点缺乏标准化,给集成和数据利用带来了重大障碍。尽管Brick模式为楼宇系统提供了标准化的本体,但将BMS点映射到合适的Brick类面临三个关键挑战:(i)Brick类数量庞大(最新版本有936个),(ii)大语言模型(LLM)的领域知识有限,(iii)验证需要大量人工。为解决这些挑战,我们提出了Brick-DICL,一种用于自动化Brick模式分类的两阶段动态上下文学习框架。Brick-DICL包含两个主要组件:metadata-RAG,检索相关示例以增强LLM的领域知识;以及class-RAG,缩小潜在Brick类范围以应对大的分类空间。此外,我们实现了一种多LLM过滤机制,比较多个模型的预测,标记低置信度分类以供人工审查。结果:(i)通用性:Brick-DICL适用于任何楼宇管理系统,无论制造商或元数据格式如何;(ii)新颖且强大:作为首个用于Brick模式分类的动态上下文学习方法,Brick-DICL在建筑数据集上取得了显著的分类准确率提升,优于现有方法;(iii)高效:我们的多LLM过滤策略减少了人工验证工作,实现了快速数字化建筑接入。大量实验证明了Brick-DICL在不同建筑数据集上的有效性,加速了向标准化、可互操作的楼宇管理系统的进程。

英文摘要

Building Management Systems (BMS) are essential for optimizing energy efficiency and operational performance in modern buildings. However, the lack of standardization across BMS points from different manufacturers creates significant barriers to integration and data utilization. While the Brick schema offers a standardized ontology for building systems, mapping BMS points to appropriate Brick classes presents three critical challenges: (i) the extensive number of Brick classes (936 in the latest version), (ii) limited domain-specific knowledge in large language models (LLMs), and (iii) substantial manual effort required for verification. To address these challenges, we propose Brick-DICL, a two-stage dynamic in-context learning framework for automated Brick schema classification. Brick-DICL consists of two primary components: metadata-RAG, which retrieves relevant examples to enhance LLMs' domain knowledge, and class-RAG, which narrows down potential Brick classes to address the large classification space. Additionally, we implement a multi-LLM filtering mechanism that compares predictions across multiple models, flagging low-confidence classifications for human review. As a result: (i) General: Brick-DICL is applicable to any building management system regardless of manufacturer or metadata format; (ii) Novel and Powerful: as the first dynamic in-context learning approach for Brick schema classification, Brick-DICL achieves significant classification accuracy improvements on building datasets, outperforming existing methods; (iii) Efficient: our multi-LLM filtering strategy reduces manual verification effort, enabling rapid digital building onboarding. Extensive experiments demonstrate Brick-DICL's effectiveness across diverse building datasets, accelerating the path toward standardized, interoperable building management systems.

2606.17642 2026-06-17 cs.AI 新提交

FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness

FinAcumen: 通过自演化经验记忆实现的金融多模态推理

Pianran Guo, Pengcheng Zhou, Yucheng Jian, Shuhua Chen

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Queen Mary University of London(伦敦玛丽女王大学)

AI总结 提出FinAcumen框架,通过选择性经验记忆机制增强工具增强型多模态推理,在四个金融基准上持续提升冻结的8B视觉语言模型性能。

详情
AI中文摘要

金融多模态推理要求智能体协调跨异构证据源的数值计算、检索、视觉解释和时间定位。现有的工具增强型智能体提高了执行保真度,但在跨回合中仍然大多无状态,反复发现推理策略和失败模式。在高风险金融环境中,这导致不可靠的工具路由、噪声检索和易产生幻觉的推理。我们提出FinAcumen,一个以选择性经验记忆为中心的金融推理智能体框架,用于工具增强的多模态推理。FinAcumen从先前的轨迹中积累基于金融的推理经验,将成功策略和失败衍生的警示规则提炼到持久记忆库中。在推理过程中,只有当语义相关性超过校准阈值时,检索到的经验才会调节推理,而通过回退机制明确抑制不相关的记忆。一个确定性的金融工具环境进一步将数值计算、检索、视觉解码和答案生成置于基础。在四个金融多模态推理基准上,FinAcumen持续改进冻结的8B视觉语言模型,优于金融专用模型,并接近领先的通用专有模型。进一步分析表明,选择性经验激活在检索不确定性下提高了推理可靠性。我们的代码匿名发布于https://this https URL。

英文摘要

Financial multimodal reasoning requires agents to coordinate numerical computation, retrieval, visual interpretation, and temporal grounding across heterogeneous evidence sources. Existing tool-augmented agents improve execution fidelity, yet remain largely stateless across episodes, repeatedly rediscovering reasoning strategies and failure patterns. In high-stakes financial settings, this leads to unreliable tool routing, noisy retrieval, and hallucination-prone reasoning. We present FinAcumen, a financial reasoning agent framework centered on selective experience memory for tool-augmented multimodal reasoning. FinAcumen accumulates financially grounded reasoning experience from prior trajectories, distilling successful strategies and failure-derived cautionary rules into a persistent memory bank. During inference, retrieved experiences condition reasoning only when semantic relevance exceeds a calibrated threshold, while irrelevant memory is explicitly suppressed through a fallback mechanism. A deterministic financial tool environment further grounds numerical computation, retrieval, visual decoding, and answer verification.Across four financial multimodal reasoning benchmarks, FinAcumen consistently improves a frozen 8B vision-language model over finance-specialized models and approaches leading proprietary general-purpose models. Further analysis shows that selective experience activation improves reasoning reliability under retrieval uncertainty. Our code is anonymously available at https://anonymous.4open.science/r/FinAcumen

2606.17821 2026-06-17 cs.AI 新提交

DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL

DecoSearch: 面向Text-to-SQL的复杂度感知路由与计划级修复

Esteban Schafir, Xu Zheng, Hojat Allah Salehi, Zhuomin Chen, Mo Sha, Wei Cheng, Dongsheng Luo

发表机构 * Florida International University(佛罗里达国际大学) NEC-Labs(NEC实验室) Singapore Management University(新加坡管理大学)

AI总结 提出DecoSearch框架,通过复杂度感知路由将查询分配给直接生成或DAG分解,并结合拓扑精炼器修复执行失败,在BIRD和Spider上取得高准确率且显著降低token消耗。

详情
AI中文摘要

大型语言模型(LLMs)在将自然语言翻译为SQL方面展现了卓越的能力,但现有方法在处理需要多步骤、数据感知推理的复杂查询时仍然表现不佳。我们引入了DecoSearch,一个无需训练的框架,通过将每个查询路由到适当的推理努力级别来解决这一问题。轻量级的Schema Selector首先将完整数据库模式修剪为相关的表和列。然后,LLM Judger判断问题是否需要分解:简单问题遵循直接生成路径,而复杂问题则升级为原子子问题的有向无环图(DAG),每个子问题通过目标SQL生成步骤解决。RAG组件用语义相似的训练示例为分解器提供基础,而Topology Refiner在执行失败表明存在有缺陷的分解而非可修复的SQL错误时,重构推理计划。DecoSearch在BIRD上达到70.53%的执行准确率,在Spider上达到88.31%,使用DeepSeek骨干网络,超越了所有无需训练的基线方法,同时消耗的token数量比竞争方法少一个数量级。它还可以作为模型无关的包装器,在不修改管道的情况下持续改进微调后的SQL生成骨干网络。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in translating natural language to SQL, yet existing methods still falter on complex queries requiring multi-step, data-aware reasoning. We introduce DecoSearch, a training-free framework that addresses this by routing each query to the appropriate level of reasoning effort. A lightweight Schema Selector first prunes the full database schema to the relevant tables and columns. An LLM Judger then decides whether the question requires decomposition: straightforward questions follow a direct generation path and complex ones are escalated to a Directed Acyclic Graph (DAG) of atomic sub-questions, each solved by a targeted SQL generation step. A RAG component grounds the decomposer with semantically similar training examples, and a Topology Refiner restructures the reasoning plan when execution failures signal a flawed decomposition rather than a fixable SQL error. DecoSearch achieves 70.53% execution accuracy on BIRD and 88.31% on Spider with a DeepSeek backbone, surpassing all training-free baselines while consuming an order of magnitude fewer tokens than competing methods. It also functions as a model-agnostic wrapper, consistently improving fine-tuned SQL generation backbones without any modification to the pipeline.

2606.17856 2026-06-17 cs.AI 新提交

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

FlowRAG: 通过频率感知的多粒度图流协同显式推理

Bihao Zhan, Zongsheng Cao, Jie Zhou, Bo Zhang, Liang He

发表机构 * East China Normal University(华东师范大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出FlowRAG框架,构建四层异构图,通过双粒度激活和频率感知加权流模块,增强语义召回和显式推理路径提取,在复杂推理基准上取得最优性能。

详情
AI中文摘要

基于图的检索增强生成(GraphRAG)对于知识密集型和多跳查询任务有效;然而,许多现有方法主要基于实体图并依赖隐式语义相关性传播。这通常会导致(i)当用户查询抽象且在实体层面语义稀疏时检索不足,以及(ii)脆弱的的多跳推理,其中噪声激活可能破坏实体到实体的转换并损坏推断的关系链,从而产生不可靠的结论。为此,我们提出\texttt{FlowRAG},一个语义感知的检索框架,它提高了语义召回和显式推理。具体来说,\texttt{FlowRAG}在段落、摘要、句子和实体上构建了一个四层异构图,其中摘要节点作为粗粒度语义枢纽。在检索时,双粒度激活模块结合摘要-查询对齐和句子级匹配,在释义和抽象下鲁棒地激活相关实体。然后,我们引入一个频率感知的加权流模块,该模块通过段落内词频加权的实体-段落链接路由相关性,修剪噪声连接并提取高置信度的推理路径作为生成的显式逻辑骨架。大量实验表明,\texttt{FlowRAG}在复杂推理基准上取得了最先进的性能。

英文摘要

Graph-based retrieval-augmented generation (GraphRAG) is effective for knowledge-intensive and multi-hop query tasks; however, many existing methods primarily seed entity-based graphs and rely on implicit semantic relevance propagation. This often (i) under-retrieves when user queries are abstract and semantically sparse at the entity level, and (ii) suffers from brittle multi-hop reasoning, where noisy activations can derail entity-to-entity transitions and corrupt the inferred relation chain, yielding unreliable conclusions. To this end, we propose \texttt{FlowRAG}, a semantic-aware retrieval framework that improves both semantic recall and explicit reasoning. Specifically, \texttt{FlowRAG} constructs a quad-level heterogeneous graph over passages, summaries, sentences, and entities, where summary nodes serve as a coarse semantic hub. At retrieval time, a dual-granularity activation module combines summary--query alignment with sentence-level matching to activate relevant entities under paraphrase and abstraction robustly. We then introduce a frequency-aware weighted flow module that routes relevance through entity--passage links weighted by within-passage term frequency, pruning noisy connections and extracting high-confidence reasoning paths as an explicit logic skeleton for generation. Extensive experiments show that \texttt{FlowRAG} obtains state-of-the-art performance on complex reasoning benchmarks.

2606.17888 2026-06-17 cs.AI 新提交

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

MathVis-Fine:通过渐进式依赖引导训练将视觉监督与必要性对齐的多模态数学推理

Wanshi Xu, Haokun Zhao, Haidong Yuan, Songjun Cao, Long Ma

发表机构 * School of ECE, Peking University(北京大学电子与计算机工程学院) College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与技术学院) School of Software and Microelectronics, Peking University(北京大学软件与微电子学院) Tencent Youtu Lab(腾讯优图实验室)

AI总结 提出MathVis-Fine框架,通过构建细粒度视觉标注数据集和两阶段渐进式训练,根据样本的视觉依赖程度平衡答案正确性和视觉基础奖励,提升多模态数学推理的监督精度。

详情
AI中文摘要

链式思维(CoT)推理已从纯语言领域扩展到多模态场景;然而,现有方法通常将视觉输入视为同质或辅助信号,未能捕捉数学问题解决中文本与图像之间复杂且样本特定的依赖关系。这引发了两个核心问题:首先,视觉内容的监督信号是泛化且粗粒度的,缺乏对每个样本中视觉信息实际必要性的适应;其次,当视觉奖励被统一应用而不区分输入之间的互补关系时,训练反馈变得不准确。这些限制阻碍了模型实现精确的多模态推理。在这项工作中,我们提出了一个用于建模数学推理中细粒度视觉依赖的框架。我们首先构建了MathVis-Fine数据集,通过视觉依赖评级增强细粒度视觉标注。基于该数据集,我们引入了一种两阶段渐进式视觉增强训练范式,该范式根据每个样本的内在视觉依赖水平平衡答案正确性奖励和视觉基础奖励,从而减轻奖励偏差并提高监督准确性。大量实验表明,MathVis-Fine框架能够基于视觉依赖逐步增强视觉感知,为多模态数学推理提供了更精确的训练框架。我们将在论文被接收后发布该数据集。

英文摘要

Chain-of-Thought (CoT) reasoning has extended from purely linguistic domains to multimodal scenarios; however, existing approaches often treat visual inputs as homogeneous or auxiliary signals, failing to capture the intricate and sample-specific dependencies between text and images in mathematical problem-solving. This gives rise to two core issues: first, the supervisory signals for visual content are generalized and coarse-grained, lacking adaptation to the actual necessity of visual information in each sample; second, training feedback becomes inaccurate when visual rewards are uniformly applied without distinguishing the complementary relationships among inputs. These limitations hinder models from achieving precise multimodal reasoning. In this work, we propose a framework for modeling fine-grained visual dependencies in mathematical reasoning. We first construct the MathVis-Fine dataset, augmenting fine-grained visual annotations with visual dependency ratings. Building upon this dataset, we introduce a two-stage progressive visual enhancement training paradigm that balances answer correctness rewards and visual grounding rewards according to the intrinsic visual dependency level of each sample, thereby mitigating reward bias and improving supervision accuracy. Extensive experiments demonstrate that the MathVis-Fine framework effectively enhances visual perception progressively based on visual dependency, offering a more precise training framework for multimodal mathematical reasoning. We will release the dataset upon acceptance.

2606.18075 2026-06-17 cs.AI 新提交

A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation

上下文感知与关系感知的图检索增强生成统一框架

Haoyang Zhong, Yifei Sun, Antong Zhang, Chunping Wang, Lei Chen, Yang Yang

发表机构 * Zhejiang University(浙江大学) Nanyang Technological University(南洋理工大学) Finvolution Group(信也科技集团)

AI总结 提出HyGRAG分层图RAG框架,通过构建融合上下文与关系的摘要、跨层级检索及动态更新,将多跳推理准确率提升9.7%。

Comments Accepted at The ACM Web Conference 2026 (WWW '26)

详情
AI中文摘要

检索增强生成(RAG)已成为用外部知识增强大型语言模型(LLM)的范式,但现有基于图的方法面临一个根本限制:以实体为中心和以块为中心的方法操作在锚定于原始文本的表示上,缺乏真正的知识融合。以实体为中心的方法连接逻辑相关的内容,以块为中心的方法保留上下文,但两者都通过相似性搜索分别检索信息,错过了其综合产生的新兴理解。在本文中,我们提出HyGRAG,一种分层图RAG框架,通过解决三个核心挑战超越源文档:构建真正整合上下文和关系信息的摘要,利用这些综合表示在检索中访问新兴知识,以及高效更新分层结构以适应动态语料库。具体地,我们在包含块和实体节点的混合图上设计分层索引结构,然后迭代聚类并生成基于LLM的摘要。接着,我们设计上下文和关系感知的检索,跨所有抽象级别搜索,同时通过社区成员关系扩展。此外,我们通过基于附加的算法实现动态知识更新,仅需局部重新摘要。实验结果表明,HyGRAG将多跳推理任务的平均准确率提高了9.7%,同时保持了合理的效率。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a paradigm for enhancing large language models (LLMs) with external knowledge, yet existing graph-based methods face a fundamental limitation: entity-centric and chunk-centric approaches operate on representations anchored to original text without true knowledge fusion. While entity-centric methods connect logically related content and chunk-centric methods preserve context, both retrieve information separately through similarity search, missing emergent understanding from their synthesis. In this paper, we propose HyGRAG, a hierarchical graph RAG framework that transcends source documents by addressing three core challenges: constructing summaries that genuinely integrate contextual and relational information, leveraging these synthesized representations to access emergent knowledge during retrieval, and efficiently updating hierarchical structures for dynamic corpora. Specifically, we design hierarchical index structures over hybrid graphs with both chunk and entity nodes, then iteratively cluster them and generate LLM-based summaries. Then, we design context and relation-aware retrieval that searches across all abstraction levels while expanding through community membership. Moreover, we enable dynamic knowledge update through attachment-based algorithms with only local re-summarization. Experimental results show that HyGRAG improves the average accuracy of multi-hop reasoning tasks by 9.7%, while maintaining reasonable efficiency.

2606.17057 2026-06-17 cs.LG cs.AI cs.CL 交叉投稿

Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs

配对时正确,分离时错误:多模态大语言模型中模态特定神经元的解耦与编辑

Tingchao Fu, Wenkai Wang, Fanxiao Li, Huadong Zhang, Jinhong Zhang, Dayang Li, Yunyun Dong, Renyang Liu, Wei Zhou

发表机构 * School of Information Science and Engineering, Yunnan University(云南大学信息科学与工程学院) School of Software, Yunnan University(云南大学软件学院) National University of Singapore(新加坡国立大学) School of Engineering, Yunnan University(云南大学工程学院)

AI总结 针对多模态大语言模型知识编辑中存在的解耦失败问题,提出DECODE方法,通过显式解耦和定位模态特定神经元组,实现跨模态触发下的有效知识更新。

Comments 18 pages, 11 figures

详情
AI中文摘要

尽管知识编辑为多模态大语言模型(MLLMs)的知识更新提供了一种高效机制,但我们发现当前范式仍面临一个重要但尚未充分探索的问题:编辑解耦失败,即当模型被多模态输入(文本-图像查询对)触发时,实体相关知识可以更新,但当配对输入被拆分为单模态输入时,这些知识往往恢复为编辑前的旧事实。我们深入的实证分析表明,MLLMs中的实体知识并非以统一表示存储,而是分布在解耦的模态特定路径中。因此,偏向多模态查询的更新无法有效传播到单模态电路。为弥补这一差距,我们提出DECODE,该方法显式解耦并定位模态特定神经元组以获取目标知识。大量实验证明,DECODE在不同模态触发下均能实现有效的知识更新,从而缓解编辑解耦失败。

英文摘要

Although Knowledge Editing provides an efficient mechanism for updating the knowledge of Multimodal Large Language Models (MLLMs), we find that current paradigms still suffer from an important yet remain underexplored issue : editing decoupling failure, where entity-related knowledge can be updated when the model is triggered by multimodal inputs (text--image query pairs), however, it often reverts to outdated pre-edit facts when the paired inputs are split into unimodal ones. Our in-depth empirical analysis reveals that the entity knowledge in MLLMs is not stored as a unified representation, but is instead distributed across disentangled modality-specific pathways. As a result, updates biased toward multimodal queries fail to propagate effectively to unimodal circuits. To bridge this gap, we propose DECODE, which explicitly disentangles and localizes modality-specific neuron groups for targeted knowledge. Extensive experiments demonstrate that DECODE consistently achieves effective knowledge updates under different modality triggers, thereby mitigating editing decoupling failures.

2606.17126 2026-06-17 cs.SD cs.AI 交叉投稿

Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control

通过改进独立控制实现歌唱声音转换中的颤音表达控制

Joon-Seung Choi, Dong-Min Byun, Seong-Whan Lee

发表机构 * Korea University(高丽大学)

AI总结 提出VibE-SVC2框架,通过能量风格转换器、零样本音高风格转换器、颤音速率缩放和次谐波校正算法,实现对音高和音色两种歌唱风格的精细独立控制,性能优于现有方法。

Comments Accepted to IEEE Transactions on Audio, Speech, and Language Processing (TASLP)

详情
AI中文摘要

歌唱风格是自然且富有表现力的歌声的关键方面。歌手利用歌唱风格来传达歌曲的情感。已有若干工作提出控制歌唱风格以制作更具表现力的歌声。最近,VibE-SVC通过预测高频F0轮廓成功控制了颤音。在本文中,我们引入了一个名为VibE-SVC2的歌唱声音转换框架,以改进歌唱风格转换性能和可控性。该模型提供对两种歌唱风格的控制:音高风格和音色风格。对于音高风格,为了解决我们先前工作中未解决的能量-音高纠缠问题,我们引入了一种新颖的能量风格转换器来处理能量轮廓中剩余的样式信息。此外,我们提出了一种零样本音高风格转换器,它模仿参考音频的音高风格。为了扩展模型的可控性,我们提出了颤音速率缩放,这是对颤音程度的独立控制,这在VibE-SVC中是不可用的。对于音色风格,我们扩展了模型以处理多种发声风格。然而,解决诸如气泡音等特定风格带来了挑战,因为传统的F0提取由于其固有的次谐波特性而常常失败,这降低了转换质量。为了解决这个问题,我们提出了一种新颖的次谐波校正算法来细化F0轮廓,以实现更自然的音色转换。通过全面的客观和主观评估,我们证明了VibE-SVC2提供了对两种歌唱风格的精细、独立控制,优于现有方法。

英文摘要

Singing style is a crucial aspect of a natural and expressive singing voice. Singers utilize singing styles to convey the feeling or emotion of the songs. Several works have been proposed to control singing style for making the more expressive singing voice. Recently, VibE-SVC successfully controls vibrato by predicting high-frequency F0 contour. In this paper, we introduce a singing voice conversion framework, called VibE-SVC2, to improve singing style conversion performance and controllability. The model offers control over two types of singing styles: a pitch style and a timbre style. For the pitch style, to resolve the pitch-energy entanglement issue that is unresolved in our previous work, we introduce a novel Energy Style Converter to address remaining style information in the energy contour. In addition, we propose a Zero-shot Pitch Style Converter, which mimics the pitch style of reference audio. To expand the controllability of the model, we propose vibrato rate scaling that is an independent control of vibrato extent, which is unavailable in VibE-SVC. For the timbre style, we extend the model to handle a variety of phonation styles. However, addressing specific styles such as vocal fry poses a challenge, as conventional F0 extraction often fails due to their inherent subharmonic characteristics, which degrades the conversion quality. To address this, we propose a novel Subharmonic Correction algorithm to refine the F0 contour for more natural timbre conversion. Through comprehensive objective and subjective evaluations, we demonstrate that VibE-SVC2 provides fine-grained, independent control over two types of singing styles, outperforming existing methods.

2606.17164 2026-06-17 cs.CL cs.AI cs.HC cs.PL cs.SE 交叉投稿

PromptMN: Pseudo Prompting Language

PromptMN: 伪提示语言

Enkhzol Dovdon

发表机构 * ICT Group(ICT集团)

AI总结 提出PromptMN,一种伪提示领域特定语言,通过紧凑的%前缀类型指令注释自然语言,减少上下文歧义,提升人机交互的清晰度和可审查性。

Comments 32 pages, 2 figures

详情
AI中文摘要

提示已成为人类与生成式AI之间的主要接口,然而许多自然语言提示仍然脆弱:角色、目标、约束和预期输出常常埋没在散文中或隐含起来。在智能体和软件开发工作流中,首次交接时的误读可能会传播到每一步,因为相当一部分智能体故障源于上下文歧义而非模型限制。本文介绍PromptMN,一种伪提示领域特定语言,它用紧凑的、以%为前缀的类型指令注释自然语言,涵盖角色、目标、需求、优先级、约束、计划、输入和输出。语义解析允许作者以任意顺序编写,而模型根据功能解释指令。PromptMN介于非正式提示和编程风格伪代码之间:结构足够可检查和可重用,又足够轻量,适用于软件开发生命周期(SDLC)中的分析师、管理者、开发者和利益相关者。PromptMN还与逆向提示工程配合使用。要求模型将期望结果重述为PromptMN,让用户在执行前检查推断的角色、目标、约束和缺失假设,从而减少修复周期,并产生一个可重用的工件来对齐人员和AI工具。PromptMN的可行性在多个前沿模型上进行了评估,包括Claude Fable 5、Claude Opus 4.8、Gemini 3.1 Pro和GPT-5.5。这些模型正确解析了PromptMN指令,包括复杂结构如重复、条件、方法和素数检查任务,无需微调。相同的词汇适用于所呈现的SDLC场景中的新代码库、维护和重新设计。虽然大规模验证仍是未来工作,但这些早期结果表明PromptMN是朝着更清晰、更可审查的人机交互迈出的实际一步。

英文摘要

Prompting has become the primary interface between humans and generative AI, yet many natural language prompts remain fragile: roles, goals, constraints, and expected outputs are often buried in prose or left implicit. In agentic and software development workflows, a misread at the first handoff can propagate through every step, since a significant portion of agent failures stem from context ambiguities rather than model limitations. This paper introduces PromptMN, a pseudo-prompting domain-specific language that annotates natural language with compact, %-prefixed typed directives covering roles, goals, requirements, priorities, constraints, plans, inputs, and outputs. Semantic resolution lets authors write in any order while the model interprets directives by function. PromptMN sits between informal prompting and programming-style pseudocode: structured enough to be inspectable and reusable, yet lightweight enough for analysts, managers, developers, and stakeholders across the software development lifecycle (SDLC). PromptMN also pairs with reverse prompt engineering. Asking a model to restate a desired outcome as PromptMN lets users inspect the inferred roles, goals, constraints, and missing assumptions before acting, reducing repair cycles and yielding a reusable artifact for aligning people and AI tools. PromptMN's feasibility is evaluated across several frontier models, including Claude Fable 5, Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5. The models correctly resolved PromptMN instructions, including complex structures such as repetition, conditionals, methods, and a prime-checking task, without fine-tuning. The same vocabulary applies across new codebases, maintenance, and redesign in the SDLC scenarios presented. While large-scale validation remains future work, these early results suggest PromptMN is a practical step toward clearer, more reviewable human-to-AI interaction.

2606.17255 2026-06-17 cs.CL cs.AI 交叉投稿

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

MLLP-VRAIN UPV 系统在 IWSLT 2026 同声传译任务中的应用

Jorge Iranzo-Sánchez, Gerard Mas-Mollà, Adrià Giménez, Jorge Civera, Albert Sanchis, Alfons Juan

发表机构 * MLLP-VRAIN research group(MLLP-VRAIN研究组) VRAIN Universitat Politècnica de València(瓦伦西亚理工大学)

AI总结 提出基于Parakeet和Qwen 3.5模型的级联同声传译系统,通过自适应黑盒策略优化质量-延迟权衡,并引入ASR词增强和RAG机制处理上下文跟踪,在MCIF En→De测试集上实现XCOMET-XL提升+5.82。

Comments IWSLT 2026 System Description

详情
AI中文摘要

本文描述了MLLP-VRAIN研究组参与IWSLT 2026同声传译赛道共享任务的情况。我们的提交利用最近发布的Parakeet和Qwen 3.5模型,通过自适应“黑盒”策略构建了一个鲁棒的级联解决方案,用于长形式SimulST。我们探索了这些策略的松弛版本以实现更好的质量-延迟权衡。与去年相比,我们参与了所有语言方向。此外,对于En→{De, It, Zh}方向,我们还参与了今年新增的上下文跟踪赛道,采用ASR词增强和离线预翻译示例的RAG机制相结合,以引导生成并丰富系统的领域特定上下文。最后,我们提供了系统的详细延迟分析。与去年相比,在MCIF En→De测试集上的结果显示质量显著提升,XCOMET-XL提高了+5.82。我们的上下文跟踪处理进一步提升了+1.03的性能。

英文摘要

This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2026 Simultaneous Speech Translation track. Our submission utilizes the recently released Parakeet and Qwen 3.5 models to create a robust, cascaded solution for long-form SimulST through the use of adaptive "black-box" policies. We explore relaxations of these policies to achieve better quality-latency trade-offs. Compared to last year, we participate on all language directions. In addition to this, for the En$\rightarrow${De, It, Zh} directions we also participate in this year's new context track employing a combination of ASR word-boosting and a RAG mechanism of offline pre-translated exemplars to guide generation and enrich our system with domain-specific context. Finally, we provide a detailed latency analysis of our system. Compared to last year, results on the MCIF En$\rightarrow$De test set shows a substantial quality improvement of +5.82 XCOMET-XL. Our context track processing further improves performance by +1.03.

2606.17350 2026-06-17 cs.CL cs.AI 交叉投稿

Do Large Language Models Always Tell The Same Stories?

大型语言模型总是讲述相同的故事吗?

Thennal DK, Hans Ole Hatzel

发表机构 * University of Hamburg(汉堡大学)

AI总结 通过对比框架和人类故事数据集,研究10种LLM生成故事的叙事相似性,发现LLM故事比人类故事更相似,前沿模型趋向于“平均”通用叙事,且常见缓解策略无效。

详情
AI中文摘要

大型语言模型(LLMs)的最新进展使得生成高质量散文成为可能,但这些模型是否能够生成多样化的输出仍然存在争议。在这项工作中,我们通过叙事相似性框架研究了LLM生成故事的多样性。使用对比框架和来自r/WritingPrompts的人类编写故事和提示数据集,我们收集了10个代表性LLM的叙事相似性判断,同时利用人类评估和三种不同的自动注释方法。我们的发现揭示了一个一致的趋势:LLM生成的叙事彼此之间始终比人类编写的故事更相似。我们证明,特别是前沿模型收敛于一种“平均”通用叙事,这种叙事近似于个体人类故事,但缺乏人类作者的整体多样性。最后,我们表明常见的缓解策略,包括负提示和温度缩放,未能有效解决这种同质性。

英文摘要

Recent advances in large language models (LLMs) have enabled the generation of high-quality prose, yet the question of whether these models are capable of generating diverse outputs remains contested. In this work, we investigate the diversity of LLM-generated stories through the framework of narrative similarity. Using a contrastive framework and a dataset of human-written stories and prompts from r/WritingPrompts, we collect narrative similarity judgments across 10 representative LLMs, utilizing both human evaluations and three different automatic annotation methods. Our findings reveal a consistent trend: LLM-generated narratives are consistently more similar to each other than human-written stories are. We demonstrate that frontier models in particular converge on a ``mean'' generic narrative that approximates individual human stories but lacks the collective diversity of human authors. Finally, we show that common mitigation strategies, including negative prompting and temperature scaling, fail to meaningfully address this homogeneity.

2606.17354 2026-06-17 cs.CL cs.AI 交叉投稿

Translating the Untranslatable: An Operationalizable Ontology for Untranslatability

翻译不可译:一个可操作化的不可译性本体论

Jacob Bremerman, Brihi Joshi, Hirona Arai, Xiang Ren, Jonathan May

发表机构 * University of Southern California Information Sciences Institute(南加州大学信息科学研究所)

AI总结 提出一个结构化的不可译性本体论和补偿策略分类法,构建多语言数据集,通过人类偏好研究发现注释补偿策略最受青睐,为策略感知机器翻译奠定基础。

详情
AI中文摘要

不可译性,即意义无法在语言间直接保留的情况,在语言学中已有深入研究,但在自然语言处理中尚未充分探索。随着机器翻译系统在标准基准测试上的改进,其局限性越来越集中在这些情况下,即翻译无法简化为一一对应。我们引入了一个结构化的不可译性本体论以及补偿策略的分类法,这些策略是在这些不可译情况下传达意义的具体技术。我们将该框架操作化为一个多语言数据集,包含不可译句子及其基于策略的翻译,从而能够对翻译行为进行受控分析。初步的人类偏好研究表明,翻译质量取决于所使用的策略,并且对包含解释性上下文(称为注释补偿策略)的输出存在一致的偏好。我们的框架和数据集为研究和建模策略感知的机器翻译提供了基础。

英文摘要

Untranslatability, cases where meaning cannot be directly preserved across languages, is well-studied in linguistics but underexplored in NLP. As machine translation (MT) systems improve on standard benchmarks, their limitations increasingly concentrate in such cases, where translation cannot be reduced to one-to-one equivalence. We introduce a structured ontology of untranslatability along with a taxonomy of compensation strategies, which are specific techniques to convey meaning under these untranslatable circumstances. We operationalize this framework into a multilingual dataset of untranslatable sentences paired with strategy-based translations, enabling controlled analysis of translation behavior. Initial human preference studies suggest that translation quality depends on the strategy used, with consistent preferences for outputs that include explanatory context, known as the Annotation compensation strategy. Our framework and dataset provide a foundation for studying and modeling strategy-informed machine translation.

2606.17389 2026-06-17 cs.CV cs.AI cs.CL cs.LG 交叉投稿

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

视觉会撒谎,一致性说话:在视觉-语言模型中解耦空间注意力与可靠性

Logan Mann, Yi Xia, Ajit Saravanan, Ishan Dave, Saadullah Ismail, Shikhar Shiromani, Emily Huang, Ruizhe Li, Kevin Zhu

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校) Algoverse AI Research(Algoverse AI研究) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出VLM可靠性探针(VRP),通过结构注意力指标和生成动态分析,发现空间注意力与准确性几乎无关(R≈0.001),而自一致性是可靠性的主要预测因子(R=0.429),揭示了视觉特征与最终生成之间的符号脱离现象。

Comments 16 pages. Accepted to the ICLR 2026 Workshop on Multimodal Intelligence. Code: https://github.com/itsloganmann/VLM-Reliability-Probe

详情
AI中文摘要

多模态基础模型越来越多地被用作推理代理,因此可靠性(即知道模型何时可能产生幻觉)变得至关重要。一种常见的直觉,我们称之为注意力-置信度假设,认为可靠性源于“结构性”视觉感知:对相关区域的紧密注意力应表明答案可信,而分散的注意力则表示困惑。我们通过VLM可靠性探针(VRP)挑战这一观点,这是一项对当代视觉-语言模型(VLM)中可靠性信号进行的系统性跨家族研究。我们引入了结构注意力指标——簇计数(C_k)和空间熵(H_s)——来量化视觉编码器的注视点,并追踪其跨层的演化(ΔH_s)。这揭示了一种“符号脱离”:模型通常“早期锁定”视觉特征,但随后注意力扩散,切断了早期感知与最终生成的联系。与接地假设相反,我们发现“簇失效”:空间注意力与准确性几乎零相关(R≈0.001)。相反,可靠性是生成动态和内部状态分布的现象。自一致性,即采样推理路径之间的一致率,是真实性的主要预测因子(R=0.429)。扩展因果干预揭示了尖锐的架构差异:LLaVA将其预测锁定在脆弱的后期瓶颈中,而PaliGemma和Qwen2-VL全局分布可靠性,即使其最具预测性的层被破坏约50%或更多,仍保持韧性。对于当前的VLM,可靠性信号与视觉接地图脱离,最好通过生成时动态和隐藏状态探针来推断。

英文摘要

Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that reliability follows from "structural" visual perception: tight attention on relevant regions should signal a trustworthy answer, while scattered attention signals confusion. We challenge this through the VLM Reliability Probe (VRP), a systematic cross-family study of reliability signals in contemporary Vision-Language Models (VLMs). We introduce structural-attention metrics, cluster counts (C_k) and spatial entropy (H_s), to quantify the visual encoder's gaze, and track its evolution (Delta H_s) across layers. This reveals a "Symbolic Detachment": models often "Early Lock" visual features only to diffuse attention later, severing early perception from final generation. Contrary to the grounding hypothesis, we find a "Cluster Failure": spatial attention has near-zero correlation (R approx 0.001) with accuracy. Instead, reliability is a phenomenon of generation dynamics and internal-state distributions. Self-Consistency, the agreement rate across sampled reasoning paths, is the dominant predictor of truth (R = 0.429). Scaling causal interventions exposes a sharp architectural divergence: LLaVA locks its prediction in a fragile late-stage bottleneck, whereas PaliGemma and Qwen2-VL distribute reliability globally, staying resilient even when ~50% or more of their most predictive layer is destroyed. For current VLMs, reliability signals are detached from visual grounding maps and are best inferred from generation-time dynamics and hidden-state probes.

2606.17441 2026-06-17 cs.HC cs.AI cs.CY 交叉投稿

Patients With Personality: Realistic Patient Simulation through Controlled Diversity and Selective Disclosure

具有个性的患者:通过受控多样性与选择性披露实现逼真的患者模拟

Moritz Schlager, Friederike Jungmann, Samuel Schmidgall, Philipp Raffler, Franziska Hartl, Eva Wende, Paula Roßmüller, Conrad Ketzer, Avinatan Hassidim, Dale R. Webster, Yossi Matias, Yun Liu, Daniel Rueckert, Mike Schaekermann, Paul Hager

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning(慕尼黑机器学习中心) TUM University Hospital(慕尼黑技术大学医院) Google DeepMind(谷歌DeepMind) Google Research(谷歌研究) Imperial College London(伦敦帝国学院)

AI总结 提出PatientsWithPersonality框架,通过HEXACO人格参数化控制患者对话风格、合作性和信息披露,生成逼真且多样化的虚拟患者,在临床评估中接近真实演员表现。

Comments 22 pages, 11 figures

详情
AI中文摘要

模拟逼真的患者交互是在没有耗时且昂贵的用户研究的情况下大规模测试LLMs临床应用的关键要求。然而,现有方法通常缺乏真实性和可控性,常常在未提示的情况下过度分享信息,并且未能捕捉患者行为的广泛变异性。在这里,我们引入了PatientsWithPersonality (PWP),一个患者模拟框架,通过在潜在患者状态上显式的人格参数化生成逼真且多样化的虚拟患者响应。基于HEXACO(一个用于量化和参数化人类行为特征的六维人格空间),我们的方法能够在统一框架内对对话风格、合作性和信息披露进行细粒度控制。在临床医生评估中,PWP被认为几乎与记录的人类演员一样逼真,并且明显优于先前的模拟器,同时被标记为“信息过多”的频率远低于前者。基于HEXACO轴的条件化产生的人格特质可由临床医生和自动评估者恢复,其行为足迹比最接近的基线宽得多,并防止过度分享。总之,我们的框架通过逼真且可操控的患者模拟器,为更准确且信息丰富的LLM基准测试铺平了道路。

英文摘要

Simulating realistic patient interactions is a key requirement to testing clinical applications of LLMs at scale without time-consuming and expensive user studies. However, existing approaches often lack realism and controllability, often oversharing information unprompted, and failing to capture the wide variability of patient behavior. Here, we introduce PatientsWithPersonality (PWP), a patient simulation framework that generates realistic yet diverse virtual patient responses through explicit personality parametrization over a latent patient state. Grounded in HEXACO, a six-dimensional personality space used to quantify and parameterize human behavioral traits, our approach enables fine-grained control over conversational style, cooperativeness, and information disclosure within a unified framework. In a clinician evaluation, PWP is judged nearly as realistic as recorded human actors and clearly ahead of prior simulators, while being flagged as "too informative" far less often. Conditioning on HEXACO axes yields personas whose configured traits are recoverable by both clinicians and an autorater, span a substantially wider behavioral footprint than the closest baseline, and prevent oversharing. Altogether, our framework paves the way for more accurate and informative LLM benchmarking through our realistic and steerable patient simulator.

2606.17536 2026-06-17 cs.CV cs.AI 交叉投稿

OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

OmniDrive: 一种由LLM编排的多智能体世界模型,用于多视角驾驶视频生成的统一潜在协同压缩

Zijie Meng, Yufei Liu, Chengqian Ma, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Shuqin Chen, Weichen Xu, Jiquan Yuan, Miao Zhang

发表机构 * Peking University(北京大学) Xiamen University(厦门大学) Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) National Taiwan University(国立台湾大学) Wuhan University(武汉大学) Wuhan University of Technology(武汉理工大学) Tsinghua University(清华大学) Jimei University(集美大学)

AI总结 提出DRIVE-CHOREO,一种由LLM编排的多智能体世界模型,通过三个Qwen2.5-VL智能体协同生成位置感知的潜在序列,并利用视图-时间置换与3D VAE协同压缩,实现可控多视角视频生成,在nuScenes上达到SOTA多视角一致性和BEV mAP 21.6。

Comments 24 pages, 10 figures

详情
AI中文摘要

自动驾驶的生成式世界模型面临两个未解决的对立:异构控制注入(自由形式语言、高清地图、轨迹和相机位姿存在于不兼容的表示空间)和事后跨视图融合(每个相机的潜在编码未能编码全局3D几何)。我们将两者追溯到同一个根本原因:在潜在标记级别上缺乏对齐语言、几何和像素的共享符号中间语言。我们提出DRIVE-CHOREO,一种由LLM编排的多智能体世界模型,将可控多视图视频生成重新定义为潜在编排。三个Qwen2.5-VL智能体——一个解析用户意图为结构化WorldScript的导演,一个将其接地为空间锚定布局标记的制图师,以及一个将跨视图批评反馈为辅助监督的审计员——共同创作一个单一的位置感知标记序列。该序列通过视图-时间置换与多视图视频协同压缩,在3D VAE的卷积感受野内强制实现相机间几何。在nuScenes上,DRIVE-CHOREO以具有竞争力的FVD(45.7)实现了新的最先进的多视图一致性和BEV mAP(21.6);仅在我们的合成数据上训练的检测器在真实验证集上获得了+2.4 NDS,验证了下游实用性。

英文摘要

Generative world models for autonomous driving face two unresolved tensions: heterogeneous control injection, where free-form language, HD-maps, trajectories, and camera poses reside in incompatible representational spaces, and post-hoc cross-view fusion, where per-camera latents fail to encode global 3-D geometry. We trace both to a single root cause: the absence of a shared symbolic interlingua aligning language, geometry, and pixels at the latent-token level. We present DRIVE-CHOREO, an LLM-choreographed multi-agent world model that recasts controllable multi-view video generation as latent choreography. Three Qwen2.5-VL agents - a Director parsing user intent into a structured WorldScript, a Cartographer grounding it into spatially-anchored layout tokens, and an Auditor feeding cross-view critiques back as auxiliary supervision - jointly author a single position-aware token sequence. This sequence is co-compressed with the multi-view video via a view-time permutation that enforces inter-camera geometry within the convolutional receptive field of a 3-D VAE. On nuScenes, DRIVE-CHOREO sets new state-of-the-art multi-view consistency and BEV mAP (21.6) with competitive FVD (45.7); a detector trained purely on our synthetic data gains +2.4 NDS on the real validation split, validating downstream utility.

2606.17539 2026-06-17 cs.CV cs.AI 交叉投稿

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

空间视觉语言模型中的双路径推理强化

Yatai Ji, An-Chieh Cheng, Yang Fu, Yukang Chen, Han Zhang, Zhaojing Yang, Wei Huang, Ka Chun Cheung, Song Han, Vidya Nariyambut Murali, Pavlo Molchanov, Jan Kautz, Simon See, Hongxu Yin, Ping Luo, Sifei Liu

发表机构 * The University of Hong Kong(香港大学) NVIDIA(英伟达) University of California, San Diego(加州大学圣迭戈分校)

AI总结 提出SR-REAL框架,通过强化学习融合语言推理和3D检测推理两条路径,显著提升空间VLM在复杂几何推理任务中的性能。

详情
AI中文摘要

空间VLM在几何感知方面取得了显著进展,但需要多步推理(涉及深度、距离和场景关系)的复杂空间推理仍然具有挑战性。此外,不同的空间查询需要根本不同的策略:有些最好通过纯语言的逐步演绎来解决,而另一些则需要在进行定量推理之前进行显式的3D定位。我们提出了SR-REAL(通过强化学习实现空间VLM的双路径空间推理),这是一个统一框架,为空间VLM配备了两条互补的推理路径:纯语言推理(LOR),执行逐步语言演绎;以及先检测后推理(DTR),通过区域标记检测3D几何线索(如中心或边界框),然后进行显式几何推理。SR-REAL首先进行冷启动监督微调阶段,构建LOR和DTR的思维链监督,并暴露区域到3D的接口;随后进行强化学习,使用准确性和格式奖励优化策略模型;对于DTR,基于离散中心的检测奖励进一步细化几何对齐。在多种空间基准测试中,SR-REAL显著优于空间VLM基线:(i) 单个RL训练模型支持两条推理路径,DTR通过精确的3D定位在区域感知任务中表现出色,LOR增强了一般空间推理;(ii) 联合训练两条路径促进相互强化;(iii) 高质量、混合的冷启动数据对于稳定的RL优化至关重要;(iv) 模型无需逐任务调整即可跨数据集和领域泛化,展示了LOR和DTR之间的正向迁移。

英文摘要

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.

2606.17615 2026-06-17 cs.CV cs.AI 交叉投稿

SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation

SkillMoV: 基于原型条件门控的视图混合路由用于统一多视角熟练度估计

Edoardo Bianchi, Antonio Liotta

发表机构 * Free University of Bozen-Bolzano(博尔扎诺自由大学)

AI总结 提出SkillMoV框架,通过混合视图投影器(MoVP)实现多场景多视角视频的熟练度估计,在EgoExo4D数据集上达到50.17%准确率,超越现有方法。

详情
AI中文摘要

从视频中估计人类熟练度是自动化技能评估的关键挑战,应用于体育教练、音乐教学、手术培训和工作场所学习。现有方法通常专注于单一场景或依赖共享的多视角聚合,限制了其适应异构摄像机视角和活动领域的能力。我们提出SkillMoV,一个统一的、参数高效的框架,用于从同步多视角视频中进行多场景熟练度估计。其核心是混合视图投影器(MoVP),将混合专家范式适应于摄像机特定的视角特征。MoVP由四个阶段组成:(i) 一个具有12个专家MLP的混合视图软路由器,无需摄像机身份监督即可学习视角相关的专家偏好;(ii) 跨视角注意力以对齐同步摄像机;(iii) 可学习的原型锚定,以类级参考向量条件化表示;(iv) 一个原型条件门控投影,生成最终技能嵌入。我们在EgoExo4D上评估SkillMoV,涵盖六个技能领域和三种单独训练的视角配置:Ego、Exos和Ego+Exos。SkillMoV在Exos设置中达到50.17%的总体准确率,单个模型在所有场景上联合训练,超过比较方法中报告的最强Exos结果3.57个百分点。在Ego+Exos中,SkillMoV接近该设置的最佳报告结果(47.63%对48.20%)。在选定的Exos配置上的消融实验验证了每个组件:MoV路由比注意力聚合提高+6.61个百分点,跨视角注意力+4.92个百分点,原型锚定+4.07个百分点,随机视角丢弃+3.90个百分点。通过LoRA适配,SkillMoV仅训练其参数的23.32%,并且相对于仅LoRA基线增加了有限的测量开销。

英文摘要

Estimating human proficiency from video is a key challenge for automated skill assessment, with applications in sports coaching, music pedagogy, surgical training, and workplace learning. Existing approaches often focus on individual scenarios or rely on shared multi-view aggregation, limiting their ability to adapt to heterogeneous camera viewpoints and activity domains. We introduce SkillMoV, a unified, parameter-efficient framework for multi-scenario proficiency estimation from synchronized multi-view video. At its core, SkillMoV introduces a Mixture-of-View Projector (MoVP), which adapts the mixture-of-experts paradigm to camera-specific view features. MoVP is composed of four stages: (i) a Mixture-of-View soft router with twelve expert MLPs that learns view-dependent expert preferences without camera-identity supervision; (ii) cross-view attention to align synchronized cameras; (iii) learnable prototype anchoring to condition the representation on class-level reference vectors; and (iv) a prototype-conditioned gated projection that produces the final skill embedding. We evaluate SkillMoV on EgoExo4D across six skill domains and three separately trained view configurations: Ego, Exos, and Ego+Exos. SkillMoV reaches 50.17% overall accuracy in the Exos setting with a single model trained jointly across all scenarios, surpassing the strongest reported Exos result among the compared methods by 3.57 percentage points. In Ego+Exos, SkillMoV remains close to the best reported result in that setting (47.63% versus 48.20%). Ablations on the selected Exos configuration validate each component: MoV routing contributes +6.61 pp over attentive aggregation, cross-view attention +4.92 pp, prototype anchoring +4.07 pp, and stochastic view dropout +3.90 pp. Through LoRA adaptation, SkillMoV trains only 23.32% of its parameters and adds limited measured overhead relative to a LoRA-only baseline.

2606.17664 2026-06-17 cs.IR cs.AI 交叉投稿

Temporal Preference Optimization for Unsupervised Retrieval

面向无监督检索的时间偏好优化

HyunJin Kim, Jaejun Shim, Young Jin Kim, JinYeong Bak

发表机构 * Microsoft, Redmond, USA(微软公司,美国红mond) Sungkyunkwan University, Suwon, South Korea(成均馆大学,韩国首尔)

AI总结 提出TPOUR方法,通过时间检索偏好优化(TRPO)和可学习时间嵌入插值,使无监督稠密检索器能捕捉时间相关性,在时间信息检索任务上超越有监督和无监督基线。

Comments Accepted to ICML 2026

详情
AI中文摘要

无监督稠密检索器通过对比学习从无标签文档中学习语义相似性,从而提供可扩展性,但它们难以捕捉时间相关性,会检索到语义相关但时间错位的文档——当文档集合跨越多个时间段时(例如,针对“2019年的总统是谁?”检索2018-2025年的文档会引入时间歧义),这是一个重要方面。现有方法依赖于带有显式时间戳的有监督训练,但这并不总是可行的。我们提出TPOUR(面向无监督检索器的时间偏好优化),它使用我们新颖的训练方法时间检索偏好优化(TRPO)。TRPO在时间维度上重新诠释偏好学习,引导检索器偏向时间对齐的文档。TPOUR进一步通过在学习到的时间嵌入中进行插值,泛化到未见的时间段,实现连续的时间对齐。在时间信息检索(T-IR)实验上,TPOUR优于无监督和有监督基线。与Qwen-Embedding-8B相比,尽管规模小约72.7倍,TPOUR Contriever在显式查询上的平均nDCG@5提高了+4.04(+12.15%),在隐式查询上提高了+4.98(+15.21%)。我们的代码可在以下网址获取:https://this URL。

英文摘要

Unsupervised dense retrievers offer scalability by learning semantic similarity from unlabeled documents via contrastive learning, but they struggle to capture the temporal relevance, retrieving semantically related but temporally misaligned documents-an important aspect when a document collection spans multiple time periods (e.g., retrieving documents from 2018-2025 for "Who is the president in 2019?" introduces temporal ambiguity). Existing methods rely on supervised training with explicit timestamps, which are not always feasible. We propose TPOUR (Temporal Preference Optimization for Unsupervised Retriever), which uses our novel training method Temporal Retrieval Preference Optimization (TRPO). TRPO reinterprets preference learning in the temporal dimension, guiding the retriever to favor temporally aligned documents. TPOUR further generalizes to unseen time periods via interpolation in a learned time embedding, enabling continuous temporal alignment. Experiments on temporal information retrieval (T-IR), TPOUR outperforms both unsupervised and supervised baselines. Compared to Qwen-Embedding-8B, despite being about 72.7x smaller, TPOUR Contriever improves average nDCG@5 by +4.04 (+12.15%) on explicit and +4.98 (+15.21%) on implicit queries. We provide our code at https://github.com/agwaBom/TPOUR.

2606.17678 2026-06-17 cs.CV cs.AI 交叉投稿

See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL

先看后答:基于充分性驱动的强化学习实现视觉证据预对齐

Yilian Liu, Sicong Leng, Guoshun Nan, Junyi Zhu, Jiayu Huang, Minghao Sun, Xuancheng Zhu, Yisong Chen, Zexian Wei, Xiaofeng Tao

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Nanyang Technological University(南洋理工大学) China Telecom(中国电信)

AI总结 提出视觉证据预对齐(VEPA)方法,在预训练与后训练之间引入充分性驱动的GRPO优化,以增强多模态大模型对细粒度视觉证据的利用,显著提升视觉密集型任务性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)将强大的文本推理与视觉输入相结合,但其响应可能与底层图像不一致,表明在推理过程中未能有效利用视觉证据。当前的训练范式依赖于大规模基于标题的预训练进行通用对齐,随后通过监督微调和强化学习实现指令遵循和复杂推理。然而,这种预训练仅提供较弱的视觉基础:简短、粗略的标题使模型偏向显著物体,而忽略了细粒度的视觉证据。本文引入视觉证据预对齐(VEPA),作为预训练与后训练之间的中间阶段,探索一种新颖的充分性驱动目标,结合组相对策略优化(GRPO)来优化基于问题的视觉证据描述。在多种基准上的大量实验表明,我们的VEPA在视觉密集型评估上持续提升性能,并补充了标准的监督后训练。进一步分析表明,这种提升源于增强的、可迁移的视觉基础,而非额外的任务特定训练。

英文摘要

Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The prevailing training paradigm relies on large-scale caption-based pretraining for general alignment, followed by supervised fine-tuning and reinforcement learning to enable instruction following and complex reasoning. However, such pretraining provides only weak visual grounding: short, coarse captions bias models toward salient objects while neglecting fine-grained visual evidence. In this paper, we introduce Visual Evidence Pre-Alignment (VEPA), an intermediate stage between pretraining and post-training that explores a novel sufficiency-driven objective with Group Relative Policy Optimization (GRPO) to optimize question-conditioned visual evidence descriptions. Extensive experiments across diverse benchmarks show that our VEPA consistently enhances performance on visually demanding evaluations and complements standard supervised post-training. Further analyses show that the income stems from strengthened, transferable visual grounding, rather than from additional task-specific training.

2606.17798 2026-06-17 cs.CV cs.AI 交叉投稿

LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

LiveStarPro: 具有分层记忆的主动式流视频理解用于长时域流

Zhenyu Yang, Kairui Zhang, Bing Wang, Shengsheng Qian, Changsheng Xu

发表机构 * IEEE

AI总结 提出LiveStarPro,通过流验证解码、流因果注意力掩码和树结构分层记忆三个组件,实现长时域流媒体视频的主动理解,在语义正确性和时序误差上分别提升28.9%和降低18.2%。

详情
AI中文摘要

尽管视频大语言模型(Video-LLMs)取得了显著进展,当前的在线架构仍然难以同时处理连续视频流、自主决定何时响应以及保持长时域上下文记忆。这些障碍削弱了实时响应能力,并在长时间交互中导致严重遗忘。在这项工作中,我们引入了LiveStarPro,一个专为长时域流上的主动视频理解而设计的直播助手。LiveStarPro的设计基于三个互补组件。第一个组件是流验证解码(SVeD),一种通过单次困惑度验证识别适当响应时机的推理框架,从而消除了对显式静音标记的依赖。第二个组件是流因果注意力掩码(SCAM),一种训练策略,它在可变长度流上强制实现增量视频-语言对齐。第三个组件是树结构分层记忆(TSHM),一种递归记忆架构,它将驱逐的历史信息组织成事件链,从而能够从有效无界的视频流中高效检索。为了在现实在线条件下促进全面评估,我们进一步提出了OmniStarPro,一个大规模基准测试,涵盖15个多样化的真实世界场景,并扩展到小时级流以评估长期回忆。大量实验表明,LiveStarPro持续超越现有方法,在语义正确性上提升28.9%,时序误差降低18.2%,而其流式键值缓存进一步在相同模型上实现了1.58倍的推理加速。模型和代码在此https URL公开。

英文摘要

Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiveness and cause severe forgetting throughout prolonged interactions. In this work, we introduce LiveStarPro, a live streaming assistant that is designed for proactive video understanding over long-horizon streams. The design of LiveStarPro rests on three complementary components. The first component is Streaming Verification Decoding (SVeD), an inference framework that identifies the appropriate response timing through single-pass perplexity verification, thereby eliminating the dependency on explicit silence tokens. The second component is Streaming Causal Attention Masks (SCAM), a training strategy that enforces incremental video-language alignment over variable-length streams. The third component is Tree-Structured Hierarchical Memory (TSHM), a recursive memory architecture that organizes evicted historical information into event chains and consequently enables efficient retrieval from effectively unbounded video streams. To facilitate a comprehensive evaluation under realistic online conditions, we further present OmniStarPro, a large-scale benchmark that spans 15 diverse real-world scenarios and that extends to hour-scale streams for the assessment of long-term recall. Extensive experiments demonstrate that LiveStarPro consistently surpasses existing methods, attaining a 28.9% improvement in semantic correctness and an 18.2% reduction in timing error, while its streaming key-value cache further yields a 1.58x inference speedup over the same model without caching. The model and the code are publicly available at https://github.com/sotayang/LiveStarPro.

2606.17835 2026-06-17 cs.CL cs.AI eess.AS 交叉投稿

Perceptual compensation for tonal context in self-supervised speech models

自监督语音模型中对声调上下文的感知补偿

James Kirby, Ioana Krehan, Michele Gubian

发表机构 * Institute for Phonetics and Speech Processing, LMU Munich(慕尼黑大学语音与语言处理研究所)

AI总结 通过伪复制普通话声调的感知补偿实验,比较纯自监督预训练模型和微调模型,发现纯预训练模型无补偿证据,而微调模型有部分补偿但未达到人类水平,表明监督目标可能对抽象某些音韵规律是必要的。

Comments Accepted for publication at Interspeech 2026

详情
AI中文摘要

本研究考察了wav2vec2.0架构在多大程度上表现出对音韵上下文的补偿证据。我们对普通话声调进行了感知补偿实验的伪复制,并比较了纯自监督预训练模型和针对普通话ASR微调模型之间的嵌入相似度和探测分类器输出。在纯预训练模型的嵌入相似度中没有发现补偿证据。探测分类器除了预期的逐层分类改进外,还显示出一些补偿证据,但未能复制人类在孤立测试音节上的表现。我们的发现与先前仅通过预训练就能产生对音韵结构敏感性的报告形成对比,并表明监督目标可能是鼓励至少某些类型的音韵规律抽象所必需的。

英文摘要

This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones, and compared the embedding similarities and probing classifier outputs between a purely self-supervised pre-trained model and a model fine-tuned for Mandarin ASR. No evidence of compensation was found in the embedding similarities of the purely pre-trained model. Probing classifiers showed some evidence of compensation in addition to the expected layer-wise improvements in categorization, but failed to replicate human performance on isolated test syllables. Our findings contrast with previous reports of sensitivity to phonological structure emerging through pre-training alone, and suggest that supervised objectives may be necessary to encourage the abstraction of at least some types of phonological regularities.

2606.17950 2026-06-17 cs.CV cs.AI 交叉投稿

Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model

即插即适应:基于预训练对齐模型的首眼多模态指代消解

Jinghan Wu, Jing Li, Ivor W. Tsang, Xuetao Zhang

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University(西安交通大学人工智能与机器人研究所人机混合增强智能全国重点实验室) Centre for Frontier AI Research and Institute of High-Performance Computing, Agency for Science, Technology and Research (A*STAR)(新加坡科技研究局前沿人工智能研究中心与高性能计算研究所)

AI总结 提出即插即适应方法,利用预训练的细粒度对齐模型,通过证据理论融合视觉与类别线索,无需目标数据集训练或大型VLLM,在CIN基准上CoNLL F1比专用方法和流行VLLM分别提升5.31%和2.12%。

详情
AI中文摘要

视觉信息有助于解决指代消解中的歧义,带来显著的性能提升。然而,现有的多模态指代消解(MCR)方法在应用前需要使用目标数据集的部分标注数据进行训练,这阻碍了其直接可用性并引发泛化担忧。虽然拥有数十亿参数的视觉-语言大模型(VLLM)提供了有前景的零样本能力,但它们仍然难以获取。其庞大的规模限制了部署能力,且许多模型只能通过付费API访问。在本文中,我们提出了一种即插即适应方法,该方法策略性地适配一个精心预训练的\emph{对齐模型},以立即用于MCR任务,旨在消除对稀缺基准数据集的训练或依赖资源密集型VLLM的需求。具体来说,我们首先使用视觉-语言对齐数据集预训练文本与视觉上下文信息之间的细粒度对齐模型。然后,我们通过证据理论融合视觉和类别线索进行相似度聚合,将对齐模型重新用于MCR,从而增强效果。在Coreference Image Narratives (CIN)基准数据集上的实验证明了我们方法的有效性,在CoNLL F1上比最先进的专用方法和流行VLLM分别提高了5.31%和2.12%。我们进一步在掩码CIN数据集上进行鲁棒性测试,并在专门构建的VCR-MCR数据集上进行泛化评估,结果证实了这两种能力。

英文摘要

Visual information helps resolve ambiguity in coreference resolution, leading to notable performance gains. However, existing Multi-modal Coreference Resolution (MCR) methods require training with (partially) annotated data from the target dataset before they can be applied, preventing their direct usability and raising concerns about generalization. While Vision-Language Large Models (VLLMs) with billions of parameters offer promising zero-shot capabilities, they remain largely inaccessible. Their massive size limits deployability, and many are only accessible through paid APIs. In this paper, we propose a plug-and-adapt method that strategically adapts a carefully pre-trained \emph{alignment model} for immediate use in MCR tasks, designed to eliminate the need for training on scarce benchmark datasets or relying on resource-intensive VLLMs. Specifically, we first pre-train a fine-grained alignment model between textual and visual contextual information using vision-language alignment datasets. We then repurpose the alignment model to MCR through similarity aggregation by fusing visual and categorical cues with evidence theory, thereby enhancing effectiveness. Experiments on the Coreference Image Narratives (CIN) benchmark dataset demonstrate the effectiveness of our method, achieving a 5.31\% and 2.12\% improvement in CoNLL F1 over SOTA dedicated methods and popular VLLMs, respectively. We further evaluate our method on a masked CIN dataset for robustness testing and on a specially constructed VCR-MCR dataset for generalization assessment, with results confirming both capabilities.

2606.18033 2026-06-17 cs.CL cs.AI 交叉投稿

When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning

当英语不是最好的老师:跨语言上下文学习中的源语言效应

Fred Philippy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé

发表机构 * Snt, University of Luxembourg(卢森堡大学科学技术系) Luxembourg Institute of Science and Technology(卢森堡科学技术研究院)

AI总结 研究跨语言上下文学习(ICL)中源语言选择的影响,发现基于微调的预期在ICL中不成立,提出有效选择源语言的替代启发式方法。

Comments Accepted at 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026), co-located with ACL 2026

详情
AI中文摘要

跨语言迁移在多语言自然语言处理中已在监督微调背景下得到广泛探索,其中数据可用性和语言相似性等因素在很大程度上决定了迁移质量。随着该领域转向少样本上下文学习(ICL),人们通常认为微调中的见解会原封不动地延续下来。然而,这一假设尚未经过严格评估,因此如何为跨语言ICL选择源语言的问题仍然悬而未决。我们对ICL中的跨语言迁移进行了广泛的实证研究,涵盖七项任务、六个模型和一组类型多样的语言。我们进一步分析了语言混淆,这是跨语言ICL中生成任务的关键障碍。我们的结果表明,基于微调的传统预期在ICL场景中并不一致适用,并指出了有效选择源语言的替代启发式方法。

英文摘要

Cross-lingual transfer in multilingual NLP has been widely explored in supervised fine-tuning contexts, where factors like data availability and linguistic similarity largely determine transfer quality. As the field shifts toward few-shot In-Context Learning (ICL), it is often presumed that insights from fine-tuning carry over unchanged. Yet this assumption has not been rigorously evaluated, leaving open the question of how to choose source languages for cross-lingual ICL. We conduct a broad empirical study of cross-lingual transfer in ICL spanning seven tasks, six models, and a typologically diverse set of languages. We further analyze language confusion, a key obstacle for generative tasks in cross-lingual ICL. Our results show that conventional fine-tuning-based expectations do not consistently apply in the ICL regime and point to alternative heuristics for selecting source languages effectively.

2606.18156 2026-06-17 cs.CV cs.AI 交叉投稿

ReAge3D: Re-Aging 3D Faces with View Consistency

ReAge3D:具有视角一致性的3D人脸回龄

Libing Zeng, Li Ma, Mingming He, Ning Yu, Paul Debevec, Nima Khademi Kalantari

发表机构 * Texas A&M University(德克萨斯农工大学) Netflix Eyeline Studios

AI总结 提出ReAge3D框架,通过2D扩散模型DiffReaging和中心向外编辑传播策略,实现多视角一致的3D人脸回龄,保持身份和细节,优于现有方法。

详情
AI中文摘要

我们提出了一种新颖的框架,用于实现逼真且可控的3D人脸回龄,生成高度详细、保留身份的结果。现有的3D编辑方法虽然对粗粒度的语义变化有效,但不适合回龄,因为即使回龄2D视图之间的微小不一致也会导致对微妙但感知上重要的年龄相关细节的过度平滑。为了解决这一挑战,我们首先引入了一个基于2D扩散的回龄模型DiffReaging,该模型在合成生成的图像对上训练。我们进一步提出了一种中心向外编辑传播策略,利用该回龄模型重建多视图一致的回龄图像。具体来说,从回龄的正面枢轴视图开始,我们通过扭曲和我们提出的Masked-DiffReaging过程重建其余视图。通过在扩散过程的每一步注入现有内容,Masked-DiffReaging确保重建区域与现有像素保持连贯。由此产生的一致回龄视图集监督回龄3D表示的优化。我们的方法在视觉上和定量上都优于现有的3D编辑技术,能够对3D人脸模型中的年龄变换进行平滑、细粒度的控制。

英文摘要

We present a novel framework for realistic and controllable 3D face re-aging which produces highly detailed, identity-preserving results. Existing 3D editing methods, while effective for coarse semantic changes, are not well suited for re-aging, as even small inconsistencies across re-aged 2D views can lead to over-smoothing of subtle but perceptually important age-related details. To address this challenge, we first introduce a 2D diffusion-based re-aging model, DiffReaging, trained on synthetically generated image pairs. We further propose a center-out editing propagation strategy that leverages this re-aging model to reconstruct multi-view-consistent re-aged images. Specifically, starting from a re-aged frontal pivot view, we reconstruct the remaining views through warping and our proposed Masked-DiffReaging process. By injecting existing content at every step of the diffusion process, Masked-DiffReaging ensures that the reconstructed regions remain coherent with existing pixels. The resulting consistent set of re-aged views supervises the optimization of the re-aged 3D representation. Our method outperforms existing 3D editing techniques both visually and quantitatively, enabling smooth, fine-grained control over age transformations in 3D face models.

2602.09802 2026-06-17 cs.AI cs.CL 版本更新

Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices

大型语言模型会为景观付费吗?从主观选择中推断支付意愿

Manon Reusens, Sofie Goethals, Toon Calders, David Martens

AI总结 研究在旅行助手场景下,通过多分类逻辑模型分析LLM的主观选择,推断其支付意愿并与人类基准比较,发现LLM在属性层面存在系统偏差且高估支付意愿,但通过条件化偏好可改善。

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署在旅行辅助和购买支持等应用中,它们常常需要在没有客观正确答案的情况下代表用户做出主观选择。我们在旅行助手背景下研究LLM的决策,通过向模型呈现选择困境,并使用多项逻辑模型分析其响应,推导出隐含的支付意愿(WTP)估计。随后将这些WTP值与经济学文献中的人类基准值进行比较。除了基线设置外,我们还研究了在更现实条件下模型行为的变化,包括提供用户过去选择的信息和基于角色的提示。我们的结果表明,虽然可以从较大的LLM中推导出有意义的WTP值,但它们在属性层面也显示出系统偏差。此外,它们倾向于整体高估人类的WTP,特别是在引入昂贵选项或面向商业的角色时。将模型条件化于对更便宜选项的先前偏好,得出的估值更接近人类基准。总体而言,我们的发现突出了使用LLM进行主观决策支持的潜力和局限性,并强调了在实际部署此类系统时仔细选择模型、设计提示和表示用户的重要性。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice dilemmas and analyzing their responses using multinomial logit models to derive implied willingness to pay (WTP) estimates. These WTP values are subsequently compared to human benchmark values from the economics literature. In addition to a baseline setting, we examine how model behavior changes under more realistic conditions, including the provision of information about users' past choices and persona-based prompting. Our results show that while meaningful WTP values can be derived for larger LLMs, they also display systematic deviations at the attribute level. Additionally, they tend to overestimate human WTP overall, particularly when expensive options or business-oriented personas are introduced. Conditioning models on prior preferences for cheaper options yields valuations that are closer to human benchmarks. Overall, our findings highlight both the potential and the limitations of using LLMs for subjective decision support and underscore the importance of careful model selection, prompt design, and user representation when deploying such systems in practice.

2605.30036 2026-06-17 cs.AI cs.CL 版本更新

Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

向机器传授价值观:在LLMs中模拟类人行为

Asaf Yehudai, Naama Rozen, Ariel Gera

发表机构 * The Hebrew University of Jerusalem(海法大学) IBM Research(IBM研究院) Tel-Aviv University(特拉维夫大学)

AI总结 本研究基于心理学价值理论,通过大规模实验(超过500万个问题)评估价值提示的LLMs在价值结构和价值-行为关系上与人类的一致性,并证明引入人类价值分布可增强群体模拟。

Comments We had some disagreement regarding proper attribution; we hope to resolve it soon and upload the paper

详情
AI中文摘要

大型语言模型(LLMs)展示了采用不同角色和身份的能力;然而,它们是否能表现出符合连贯、类人价值结构的行为仍不清楚。在这项工作中,我们借鉴既定的心理学价值理论,在LLMs中诱导类人价值观,并评估它们与人类研究中观察到的模式的一致性。使用经过验证的心理学问卷,我们进行了大规模实验——超过500万个问题——以评估领先LLMs的价值结构和价值-行为关系,并将其与人类进行比较。我们的发现揭示了价值提示的LLMs与人类在两个维度上的强烈一致性。此外,引入人类价值分布增强了价值诱导LLMs的群体模拟。这些发现凸显了价值诱导LLMs作为有效的、基于心理学的模拟人类行为工具的潜力。

英文摘要

Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value theory to induce human-like values in LLMs and assess their alignment with patterns observed in human studies. Using validated psychological questionnaires, we conduct large-scale experiments -- over 5 million questions -- to evaluate value structures and value-behavior relationships in leading LLMs and compare them to humans. Our findings reveal strong agreement between value-prompted LLMs and humans across both dimensions. Moreover, incorporating human value distributions enhances population-level simulations with value-induced LLMs. These findings highlight the potential of value-induced LLMs as effective, psychologically grounded tools for simulating human behavior.

2504.11837 2026-06-17 cs.CL cs.AI 版本更新

EmoFSM: A Finite State Machine for Emotional Support Conversation

EmoFSM:一种用于情感支持对话的有限状态机

Yue Zhao, Qingqing Gu, Xiaoyu Wang, Teng Chen, Zhonglin Jiang, Yong Chen, Hongyan Li, Luo Ji

AI总结 针对情感支持对话中长期满意度不足的问题,提出EmoFSM框架,利用有限状态机引导大语言模型进行规划与自我推理,在多个数据集上优于多种基线方法。

Comments 15 pages, 4 figures. PAKDD 2026

详情
AI中文摘要

情感支持对话旨在通过有效对话缓解人们的情感困扰。尽管大语言模型在ESC方面取得了显著进展,但大多数研究可能未从状态模型角度定义图,从而为长期满意度提供了次优解决方案。为解决此问题,我们利用有限状态机在LLM上提出名为EmoFSM的框架。我们的框架允许单个LLM在ESC期间引导规划,并在每个对话轮次中自我推理求助者的情绪、支持策略以及最终回应。在ESC数据集上的大量实验表明,EmoFSM优于许多基线方法,包括直接推理、自我微调、思维链、微调和外部支持方法,甚至那些参数更多的模型。

英文摘要

Emotional support conversation (ESC) aims to alleviate people's emotional distress through effective conversations. Although large language models (LLMs) have made remarkable progress in ESC, most of these studies may not define the diagram from a state-model perspective, thereby providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Finite State Machine (FSM) on LLMs, and propose a framework called EmoFSM. Our framework allows a single LLM to bootstrap the planning during ESC, and self-reason the seeker's emotion, support strategy, and the final response upon each conversation turn. Substantial experiments in ESC datasets suggest that EmoFSM outperforms many baselines, including direct inference, self-fine, chain of thought, finetuning, and externally supported methods, even those with many more parameters.

2507.17853 2026-06-17 cs.CV cs.AI 版本更新

Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models

Detail++: 文本到图像扩散模型的免训练细节增强器

Lifeng Chen, Jiner Wang, Zihao Pan, Beier Zhu, Xiaofeng Yang, Chi Zhang

AI总结 提出免训练框架Detail++,通过渐进式细节注入策略分解复杂提示词,利用自注意力布局控制与交叉注意力质心对齐损失,提升多主体复杂提示下的生成质量。

详情
AI中文摘要

文本到图像(T2I)生成的最新进展已带来令人印象深刻的视觉结果。然而,这些模型在处理复杂提示词时仍面临重大挑战,尤其是涉及具有不同属性的多个主体时。受人类绘画过程(先勾勒构图,再逐步添加细节)的启发,我们提出Detail++,一个免训练框架,引入新颖的渐进式细节注入(PDI)策略来解决这一局限。具体来说,我们将复杂提示词分解为一系列简化的子提示词,分阶段引导生成过程。这种分阶段生成利用自注意力的固有布局控制能力,首先确保全局构图,然后进行精确细化。为了实现属性与对应主体的准确绑定,我们利用交叉注意力机制,并进一步在测试时引入质心对齐损失,以减少绑定噪声并增强属性一致性。在T2I-CompBench和新构建的风格组合基准上的大量实验表明,Detail++显著优于现有方法,特别是在涉及多个对象和复杂风格条件的场景中。

英文摘要

Recent advances in text-to-image (T2I) generation have led to impressive visual results. However, these models still face significant challenges when handling complex prompt, particularly those involving multiple subjects with distinct attributes. Inspired by the human drawing process, which first outlines the composition and then incrementally adds details, we propose Detail++, a training-free framework that introduces a novel Progressive Detail Injection (PDI) strategy to address this limitation. Specifically, we decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages. This staged generation leverages the inherent layout-controlling capacity of self-attention to first ensure global composition, followed by precise refinement. To achieve accurate binding between attributes and corresponding subjects, we exploit cross-attention mechanisms and further introduce a Centroid Alignment Loss at test time to reduce binding noise and enhance attribute consistency. Extensive experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods, particularly in scenarios involving multiple objects and complex stylistic conditions.

2508.03250 2026-06-17 cs.CL cs.AI 版本更新

RooseBERT: A New Deal For Political Language Modelling

RooseBERT: 政治语言建模的新协议

Deborah Dore, Elena Cabrio, Serena Villata

AI总结 针对政治语言特殊性,提出领域预训练模型RooseBERT,在大型政治辩论语料上训练,在多项政治分析任务中优于通用模型。

详情
AI中文摘要

政治辩论和与政治相关讨论的日益增多,要求定义新颖的计算方法来自动分析此类内容,最终目标是让公民更清晰地了解政治审议。然而,政治语言的特殊性和这些辩论的论证形式(采用隐藏的沟通策略并利用隐含论点)使得这项任务非常具有挑战性,即使是对于当前通用的预训练语言模型(LMs)也是如此。为了解决这个问题,我们引入了一种新颖的预训练语言模型,专门用于政治话语语言,称为RooseBERT。在专业领域上预训练语言模型面临着不同的技术和语言挑战,需要大量的计算资源和大规模数据。RooseBERT是在大型英语政治辩论和演讲语料库(11GB)上训练的。为了评估其性能,我们在多个与政治辩论分析相关的下游任务上对其进行了微调,即立场检测、情感分析、论证成分检测与分类、论证关系预测与分类、政策分类、命名实体识别(NER)。我们的结果显示,在大多数这些任务上,RooseBERT相比通用语言模型有所改进,突显了领域特定预训练如何增强政治辩论分析的性能。我们将RooseBERT发布给研究社区。

英文摘要

The increasing amount of political debates and politics-related discussions calls for the definition of novel computational methods to automatically analyse such content with the final goal of lightening up political deliberation to citizens. However, the specificity of the political language and the argumentative form of these debates (employing hidden communication strategies and leveraging implicit arguments) make this task very challenging, even for current general-purpose pre-trained Language Models (LMs). To address this, we introduce a novel pre-trained LM for political discourse language called RooseBERT. Pre-training a LM on a specialised domain presents different technical and linguistic challenges, requiring extensive computational resources and large-scale data. RooseBERT has been trained on large political debate and speech corpora (11GB) in English. To evaluate its performances, we fine-tuned it on multiple downstream tasks related to political debate analysis, i.e., stance detection, sentiment analysis, argument component detection and classification, argument relation prediction and classification, policy classification, named entity recognition (NER). Our results show improvements over general-purpose LMs on the majority of these tasks, highlighting how domain-specific pre-training enhances performance in political debate analysis. We release RooseBERT for the research community.

2509.26476 2026-06-17 cs.CL cs.AI cs.LG cs.PF cs.SE 版本更新

Regression Language Models for Code

代码的回归语言模型

Yash Akhauri, Xingyou Song, Arissa Wongpanich, Bryan Lewandowski, Mohamed S. Abdelfattah

AI总结 提出回归语言模型(RLM),利用冻结的大语言模型编码器直接从文本预测代码执行结果(如内存占用、延迟、神经网络精度等),在多个任务上达到高相关度。

Comments Published in International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

我们研究代码到指标的回归:预测代码执行的数值结果,由于编程语言的开放性,这是一项具有挑战性的任务。虽然先前的方法依赖于繁重且特定领域的特征工程,但我们展示了一个统一的回归语言模型(RLM),使用冻结的LLM编码器可以直接从文本同时预测:(i) 多种高级语言(如Python和C++)代码的内存占用,(ii) Triton GPU内核的延迟,以及(iii) 以ONNX表示的已训练神经网络的精度和速度。特别是,一个基于T5Gemma的较小300M参数RLM在APPS的竞赛编程提交上获得了>0.9的Spearman等级相关系数,而单个统一模型在CodeNet的17种不同语言上获得了>0.5的平均Spearman等级相关系数。此外,RLM在五个经典NAS设计空间上获得了最高平均Kendall-Tau 0.46,这些空间此前由图神经网络主导,并且能同时预测多种硬件平台上的架构延迟。

英文摘要

We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of programming languages. While prior methods have resorted to heavy and domain-specific feature engineering, we show that a single unified Regression Language Model (RLM) using a frozen LLM encoder can simultaneously predict directly from text, (i) the memory footprint of code across multiple high-level languages such as Python and C++, (ii) the latency of Triton GPU kernels, and (iii) the accuracy and speed of trained neural networks represented in ONNX. In particular, a relatively small 300M parameter RLM based on T5Gemma, obtains >0.9 Spearman-rank on competitive programming submissions from APPS, and a single unified model achieves >0.5 average Spearman-rank across 24 different programming languages from CodeNet. Furthermore, the RLM can obtain the highest average Kendall-Tau of 0.46 on five classic NAS design spaces previously dominated by graph neural networks, and simultaneously predict architecture latencies on numerous hardware platforms.

2601.18252 2026-06-17 cs.CV cs.AI cs.LG stat.ML 版本更新

Co-PLNet: A Collaborative Point-Line Network for Prompt-Guided Wireframe Parsing

Co-PLNet: 一种用于提示引导的线框解析的协作点线网络

Chao Wang, Xuanying Li, Cheng Dai, Jinglei Feng, Yuxiang Luo, Hao Qin, Yuqi Ouyang

AI总结 提出点线协作框架Co-PLNet,通过点线提示编码器交换空间线索,并利用交叉引导线解码器增强点线一致性,在Wireframe和YorkUrban数据集上提升线框解析的准确性和鲁棒性。

详情
AI中文摘要

线框解析旨在恢复线段及其连接点,以形成结构化的几何表示,用于同时定位与地图构建(SLAM)等下游任务。现有方法分别预测线和点,并在事后进行调和,导致不匹配和鲁棒性降低。我们提出Co-PLNet,一个点线协作框架,在两个任务之间交换空间线索,其中早期检测通过点线提示编码器(PLP-Encoder)转换为空间提示,该编码器将几何属性编码为紧凑且空间对齐的图。交叉引导线解码器(CGL-Decoder)随后通过基于互补提示的稀疏注意力细化预测,强制点线一致性和效率。在Wireframe和YorkUrban上的实验显示,准确性和鲁棒性持续改进,同时具有有利的实时效率,证明了我们在结构化几何感知中的有效性。我们的代码可在该 https URL 获取。

英文摘要

Wireframe parsing aims to recover line segments and their junctions to form a structured geometric representation useful for downstream tasks such as Simultaneous Localization and Mapping (SLAM). Existing methods predict lines and junctions separately and reconcile them post-hoc, causing mismatches and reduced robustness. We present Co-PLNet, a point-line collaborative framework that exchanges spatial cues between the two tasks, where early detections are converted into spatial prompts via a Point-Line Prompt Encoder (PLP-Encoder), which encodes geometric attributes into compact and spatially aligned maps. A Cross-Guidance Line Decoder (CGL-Decoder) then refines predictions with sparse attention conditioned on complementary prompts, enforcing point-line consistency and efficiency. Experiments on Wireframe and YorkUrban show consistent improvements in accuracy and robustness, together with favorable real-time efficiency, demonstrating our effectiveness for structured geometry perception. Our code is available at https://github.com/GalacticHogrider/Co-PLNet.

2602.14771 2026-06-17 cs.CV cs.AI cs.LG cs.MM cs.NE 版本更新

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

GOT-JEPA:基于联合嵌入预测架构的通用目标跟踪与模型自适应及遮挡处理

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

AI总结 提出GOT-JEPA框架,通过预测跟踪模型而非图像特征来提升泛化能力,并设计OccuSolver增强遮挡感知,在七个基准上验证了有效性。

Comments Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). This research focuses on learning model adaptation for adverse and dynamic environments, as well as fine-grained occlusion perception for tracking

详情
Journal ref
IEEE Transactions on Circuits and Systems for Video Technology 2026
AI中文摘要

人类视觉系统通过整合当前观测与先前观测信息、适应目标和场景变化、以及精细推理遮挡来跟踪物体。相比之下,最近的通用目标跟踪器通常针对训练目标进行优化,这限制了在未见场景中的鲁棒性和泛化能力,并且它们的遮挡推理仍然粗糙,缺乏对遮挡模式的详细建模。为了解决这些在泛化和遮挡感知方面的局限性,我们提出了GOT-JEPA,一个模型预测预训练框架,将JEPA从预测图像特征扩展到预测跟踪模型。给定相同的历史信息,教师预测器从干净的当前帧生成伪跟踪模型,学生预测器学习从当前帧的损坏版本预测相同的伪跟踪模型。这种设计提供了稳定的伪监督,并明确训练预测器在遮挡、干扰和其他不利观测下产生可靠的跟踪模型,从而提高了对动态环境的泛化能力。基于GOT-JEPA,我们进一步提出了OccuSolver来增强目标跟踪的遮挡感知。OccuSolver调整了一个以点为中心的点跟踪器,用于目标感知的可见性估计和详细的遮挡模式捕获。在跟踪器迭代生成的目标先验条件下,OccuSolver逐步细化可见性状态,增强遮挡处理,并产生更高质量的参考标签,逐步改进后续模型预测。在七个基准上的广泛评估表明,我们的方法有效增强了跟踪器的泛化能力和鲁棒性。

英文摘要

The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

2603.03485 2026-06-17 cs.CV cs.AI cs.RO 版本更新

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Phys4D: 从视频扩散模型实现细粒度物理一致的4D建模

Haoran Lu, Shang Wu, Songling Liu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

AI总结 提出Phys4D流水线,通过三阶段训练(伪监督预训练、物理监督微调、强化学习校正)从视频扩散模型学习物理一致的4D世界表示,显著提升细粒度时空与物理一致性。

详情
AI中文摘要

最近的视频扩散模型作为大规模生成式世界模型已经取得了令人印象深刻的能力。然而,这些模型通常难以保持细粒度的物理一致性,随时间表现出物理上不合理的动态。在这项工作中,我们提出了 \textbf{Phys4D},一个从视频扩散模型中学习物理一致的4D世界表示的流水线。Phys4D 采用 \textbf{三阶段训练范式},逐步将外观驱动的视频扩散模型提升为物理一致的4D世界表示。我们首先通过大规模伪监督预训练引导出稳健的几何和运动表示,为4D场景建模奠定基础。然后,我们使用模拟生成的数据进行基于物理的监督微调,强制执行时间一致的4D动态。最后,我们应用基于模拟的强化学习来纠正难以通过显式监督捕获的残留物理违规。为了评估超越外观指标的细粒度物理一致性,我们引入了一套 \textbf{4D世界一致性评估},探测几何一致性、运动稳定性和长期物理合理性。实验结果表明,与外观驱动的基线相比,Phys4D 显著改善了细粒度时空和物理一致性,同时保持了强大的生成性能。我们的项目页面可在此 https URL 获取。

英文摘要

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

2603.22281 2026-06-17 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

ThinkJEPA:赋予潜在世界模型大型视觉-语言推理能力

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu

AI总结 提出ThinkJEPA框架,结合密集JEPA分支与稀疏VLM思考者分支,通过分层金字塔表示提取模块,实现细粒度运动建模与长程语义引导,在手部操作轨迹预测任务上超越基线。

Comments 10 pages, 5 figures

详情
AI中文摘要

潜在世界模型(如V-JEPA2)的最新进展展示了从视频观测预测未来世界状态的能力。然而,短观测窗口的密集预测限制了时间上下文,可能导致预测偏向局部低层次外推,难以捕捉长程语义并降低下游效用。相比之下,视觉-语言模型(VLM)通过对均匀采样帧进行推理,提供强大的语义基础和通用知识,但由于计算驱动的稀疏采样、语言输出瓶颈(将细粒度交互状态压缩为文本导向表示)以及适应小规模动作条件数据集时的数据分布不匹配,它们不适合作为独立的密集预测器。我们提出了一种VLM引导的JEPA风格潜在世界建模框架,通过双时间路径结合密集帧动态建模与长程语义指导:一个密集JEPA分支用于细粒度运动和交互线索,以及一个均匀采样的VLM“思考者”分支,具有更大的时间步长以提供知识丰富的指导。为了有效传递VLM的渐进推理信号,我们引入了一个分层金字塔表示提取模块,将多层VLM表示聚合成与潜在预测兼容的指导特征。在手部操作轨迹预测实验上,我们的方法优于强VLM-only基线和JEPA预测器基线,并展现出更鲁棒的长程展开行为。

英文摘要

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

2603.28251 2026-06-17 cs.CV cs.AI 版本更新

DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

DiffAttn: 基于扩散的驾驶员视觉注意力预测与LLM增强语义推理

Weimin Liu, Qingkun Li, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng

AI总结 提出DiffAttn框架,将驾驶员视觉注意力预测建模为条件扩散去噪过程,结合Swin Transformer、特征融合金字塔和LLM增强语义推理,在四个数据集上达到最先进性能。

详情
AI中文摘要

驾驶员的视觉注意力为预测潜在危险提供关键线索,并直接影响决策和控制操作,其缺失可能危及交通安全。为模拟驾驶员的感知模式并推进智能车辆的视觉注意力预测,我们提出DiffAttn,一种基于扩散的框架,将该任务建模为条件扩散-去噪过程,从而更准确地建模驾驶员注意力。为捕捉局部和全局场景特征,我们采用Swin Transformer作为编码器,并设计了一个解码器,该解码器结合了特征融合金字塔用于跨层交互,以及密集的多尺度条件扩散,以共同增强去噪学习并建模细粒度的局部和全局场景上下文。此外,引入大语言模型(LLM)层以增强自上而下的语义推理,并提高对安全关键线索的敏感性。在四个公共数据集上的大量实验表明,DiffAttn实现了最先进的性能,超越了大多数基于视频、自上而下特征驱动和LLM增强的基线。我们的框架进一步支持可解释的以驾驶员为中心的场景理解,并具有改善智能车辆中座舱人机交互、风险感知和驾驶员状态测量的潜力。

英文摘要

Drivers' visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers' perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers' attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers' state measurement in intelligent vehicles.

2606.08402 2026-06-17 cs.CV cs.AI cs.MA 版本更新

SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration

SceneConductor: 基于多智能体编排的单图像3D场景生成

Jeonghwan Kim, Yushi Lan, Yongwei Chen, Hieu Trung Nguyen, Chuanyu Pan, Xingang Pan

发表机构 * Nanyang Technological University(南洋理工大学) University of Oxford(牛津大学) Meshy AI

AI总结 提出多智能体编排框架,将单图像3D场景生成分解为场景初始化、环境构建和多智能体细化三个阶段,并引入几何感知布局预测器,在几何精度、空间一致性和感知真实性上超越现有方法。

详情
AI中文摘要

从单张图像生成完整3D场景需要从本质上模糊的视觉证据中推断全局一致的几何、物体关系和环境上下文。尽管联合布局和网格生成近期取得进展,现有方法通常依赖整体或弱分解的流水线,将许多因素纠缠在一起,需要大量场景级监督,限制了其对复杂真实环境的泛化。我们提出一个多智能体编排框架,将单图像3D场景生成分解为三个结构化阶段:场景初始化、环境构建和多智能体细化。初始化阶段提取图像派生的物体掩码,构建物体级3D表示,并预测初始空间布局以形成粗略3D场景。环境构建阶段随后利用该初始化以及点图几何,构建支撑表面、房间边界、材质和光照的环境支架。最后,在细化阶段,规划器智能体识别结构和视觉不一致性,直接应用简单修正,并派遣专家智能体进行复杂的局部修订,再整合回全局场景。为提供可靠的结构初始化同时减少对场景级标注的依赖,我们进一步引入一个几何感知布局预测器,由点图派生的稀疏几何先验监督。与全监督布局生成器不同,该预测器可从分割级数据训练,并稳健泛化到多样真实场景。在基准数据集上的大量实验表明,我们的方法在几何精度、空间一致性和感知真实性上持续优于先前方法。

英文摘要

Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout-and-mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene-level supervision, limiting their generalization to complex real-world environments. We propose a multi-agent orchestration framework that decomposes single-image 3D scene generation into three structured stages: scene initialization, environment construction, and multi-agent refinement. The initialization stage extracts image-derived object masks, builds object-level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment-construction stage then leverages this initialization together with point-map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene-level annotations, we further introduce a geometry-aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation-level data and generalizes robustly to diverse real-world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.

2606.15883 2026-06-17 cs.CL cs.AI 版本更新

Koshur Diacritizer: A Byte-Level Sequence-to-Sequence Model for Kashmiri Diacritic Restoration

Koshur Diacritizer:用于克什米尔语变音符号恢复的字节级序列到序列模型

Haq Nawaz Malik, Nahfid Nissar, Faizan Iqbal

发表机构 * arXiv

AI总结 针对克什米尔语数字文本中变音符号缺失导致的歧义问题,提出基于ByT5-small的字节级序列到序列模型Koshur Diacritizer,结合脚本感知归一化、对齐验证和骨架保留推理,在测试集上实现DERm 0.2012和WER 0.2159,专家评估准确率77.5%。

详情
AI中文摘要

克什米尔语是一种使用改良的波斯-阿拉伯字母书写的印度-雅利安语言,在数字文本中经常省略变音符号,造成歧义并挑战下游NLP应用。我们提出了Koshur Diacritizer,一个基于ByT5-small的字节级序列到序列模型,用于恢复克什米尔语文本中的变音符号。为支持此任务,我们发布了一个公开可用的数据集,包含23.7k对齐的未变音/变音克什米尔语句对。所提出的框架结合了脚本感知归一化、对齐验证和骨架保留推理,以确保在保持原始基本字母序列的同时进行可靠的恢复。在保留测试集上的实验结果显示,DERm为0.2012,WER为0.2159。此外,由克什米尔语母语语言学专家评估的平均准确率为77.5%。数据集、模型和源代码已公开发布,为克什米尔语变音符号恢复和未来的低资源语言研究提供了可复现的基线。

英文摘要

Kashmiri, an Indo-Aryan language written in a modified Perso-Arabic script, frequently omits diacritic marks in digital text, creating ambiguity and challenging downstream NLP applications. We present Koshur Diacritizer, a ByT5-small byte-level sequence-to-sequence model for restoring diacritics in Kashmiri text. To support this task, we release a publicly available dataset of 23.7k aligned undiacritized diacritized Kashmiri sentence pairs. The proposed framework combines script-aware normalization, alignment validation, and skeleton-preserving inference to ensure reliable restoration while maintaining the original base-letter sequence. Experimental results on a held-out test set achieve a DERm of 0.2012 and a WER of 0.2159. Additionally, evaluation by a native Kashmiri linguistic expert yields a mean accuracy of 77.5%. The dataset, model, and source code are publicly released to provide a reproducible baseline for Kashmiri diacritic restoration and future low-resource language research.

7. 机器人与具身智能 22 篇

2606.17897 2026-06-17 cs.AI cs.RO 新提交

Learn to Quantify Social Interaction with Constraints for Pedestrian Walking

学习量化行人行走中的社交互动约束

Xiaodan Shi

发表机构 * Department of Computer and Systems Sciences, Stockholm University(斯德哥尔摩大学计算机与系统科学系)

AI总结 提出Learn to Cluster方法,通过概率潜变量生成模型从轨迹观测中无监督学习社交互动模式,并有效集成到行人轨迹预测中,提升预测鲁棒性。

详情
AI中文摘要

人群中的长期行人路径预测对于自主移动平台(如自动驾驶汽车和社交机器人)避免碰撞并做出高质量规划至关重要。尽管当前研究考虑了社交互动进行预测,但它们并未揭示人与人之间发生的具体社交互动类型以及社交互动如何影响行人的决策过程,这进一步限制了其鲁棒性。行人行走中的社交互动直观上大量存在且难以标注和量化。在本文中,我们通过提出Learn to Cluster创造性地探索量化和解释行人如何与他人互动。我们的聚类社交互动是概率潜变量生成模型,直接从序列轨迹观测中学习,可扩展到任意数量的行人。Learn to Cluster无需标签,可以自然地集成到预测模型的训练过程中。潜变量随后将作为“标签”对社交互动进行分类。在多个轨迹预测基准上的大量实验表明,我们的方法能够学习社交互动的模式,并将这些模式有效集成到行人轨迹预测中。

英文摘要

Long-term human path forecasting in crowds is critical for autonomous moving platforms (like autonomous driving cars and social robots) to avoid collision and make high-quality planning. Although the current research take into account social interactions for prediction, they don't reveal the exact kinds of social interactions happened among people and how the social interactions affect the decision-making process of pedestrians, which further limits its robustness. Social interactions in pedestrian walking are intuitively massive and hard to label and quantify. In this paper, we explore creatively to quantify and interpret how pedestrians interact with others by proposing Learn to Cluster. Our clustering social interactions is probabilistic latent variable generative, learning directly from sequential trajectory observations, scalable to arbitrary number of pedestrians. Learn to cluster is label-free and can be naturally integrated into the training process of the prediction model. The latent variables will then serve as 'labels' to categorize social interactions. Extensive experiments over several trajectory prediction benchmarks demonstrate that our method is able to learn the patterns of social interactions and effectively integrate the patterns to pedestrian trajectory prediction.

2606.18144 2026-06-17 cs.AI cs.CY cs.LG cs.RO 新提交

Memory as a Wasting Asset: Pricing Flash Endurance for Embodied Agents, and the Limits of Doing So

记忆作为消耗性资产:为具身智能体定价闪存耐久性及其局限性

Josef Liyanjun Chen

发表机构 * KAIKAKU

AI总结 本文提出将机器人闪存耐久性视为折旧资本,通过单一影子价格η进行定价,实现成本最优的存储层级分配,并基于真实机器人日志测量价值-写入关联χ的符号,发现其取决于部署场景。

详情
AI中文摘要

机器人的闪存耐久性是一种不可再生资源:每次持久化写入都会消耗数千次编程/擦除周期中的一次,且无法补充,然而目前没有实际部署的机器人内存系统对哪些记忆值得消耗一次擦除周期进行定价。我们将具身记忆视为折旧资本,并用单一耐久性影子价格η对该资源定价,这使得在RAM/板载NVM/云层级中进行成本最小化的放置成为一个在磨损增强的每字节索引中的阈值。无论价值-写入关联χ的符号如何,该索引都是成本最优的;只有当χ>0时,最优解才变为非单调,将机器人最有价值的记忆从闪存中移出。因此,关键点是经验性的,我们在预定义的关口上测量真实机器人日志中的χ:其符号是部署场景的一个属性——在重复的长时域操作中为正(χ̂≈+1.0×10^{-3},在全功率下可复现),在较短时域任务中为零,在非重复遥操作中为负。两个边界限制了该结果。在高端3,000 P/E TLC闪存按数据手册价格计算时,耐久性预算处于休眠状态;而在廉价边缘机器人使用的商用QLC/eMMC(约1,000 P/E)上则具有约束力。当约束生效时,学习到的磨损感知控制器仅在任务价值上与基于价格的路由持平,因为实现的价值在RAM、NVM和云层级之间是不变的:租金决定设备寿命和成本,而非任务性能。磨损感知放置是否能提高任务价值仍是一个开放问题——χ是针对价值代理测量的,而非单调最优解虽已被证明,但尚未在数据中观察到。

英文摘要

A robot's flash endurance is a non-renewable stock: every persisted write spends one of a few thousand program/erase cycles and never refills, yet no fielded robot memory system prices which memories are worth an erase cycle. We treat embodied memory as depreciating capital and price that stock with a single endurance shadow price $η$, which makes cost-minimizing placement across a RAM / on-board NVM / cloud hierarchy a threshold in a wear-augmented per-byte index. The index is cost-optimal whatever the sign of the value-write association $χ$; only when $χ> 0$ does the optimum turn non-monotone, sending a robot's most valuable memories off its flash. The pivot is thus empirical, and we measure $χ$ on real robot logs at a pre-specified gate: its sign is a property of the deployment regime -- positive on recurrent long-horizon manipulation ($\hatχ \approx +1.0 \times 10^{-3}$, replicated at full power), null on a shorter-horizon suite, and negative on non-recurrent teleoperation. Two boundaries scope the result. The endurance budget is dormant on premium 3,000-P/E TLC at datasheet prices and binding on the commodity QLC/eMMC ($\sim$1,000 P/E) that cheaper edge robots run. And where it binds, a learned wear-aware controller only ties price-based routing on task value, because realized value is tier-invariant across RAM, NVM, and cloud: the rent governs device lifetime and cost, not task performance. Whether wear-aware placement improves task value remains open -- $χ$ is measured against a value proxy, and the non-monotone optimum, while proven, is not yet observed in data.

2606.18235 2026-06-17 cs.AI 新提交

EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation

EvolveNav: 用于零样本目标导航的主动预反思与自进化记忆

Qi Chai, Wenhao Shen, Nanjie Yao, Yue Xia, Kaiyong Zhao, Jie Ma, Guosheng Lin, Hao Wang

发表机构 * HKUST(GZ)(香港科技大学(广州)) Nanyang Technological University(南洋理工大学) Xi’an Jiaotong University(西安交通大学) XGRIDS(深圳格物智联)

AI总结 提出自进化零样本目标导航框架,通过从历史轨迹提取规则并基于置信上界检索,结合记忆引导预反思模块,减少无效探索,成功率提升10.1%。

详情
AI中文摘要

零样本目标导航(ZS-OGN)要求具身智能体在没有任何先验训练的情况下探索并定位目标物体。为此,近期方法利用基础模型,但它们通常依赖静态先验且缺乏适应性,导致重复错误和代价高昂的试错。本文提出一种自进化的ZS-OGN框架,实现连续的测试时改进。具体而言,我们通过从过去轨迹中提取可操作知识来构建智能体规则记忆。然后,我们提出一种基于置信上界的检索策略,通过平衡语义相关性和历史成功率来选择有效规则。此外,我们引入一个记忆引导的预反思模块,在行动前预测潜在结果,减少低效探索。大量实验表明,我们的方法优于现有的零样本基线,在减少不必要步骤的同时实现了10.1%的成功率提升。

英文摘要

Zero-Shot Object-Goal Navigation (ZS-OGN) requires embodied agents to explore and locate target objects without any prior training. To this end, recent methods leverage foundation models. But they typically rely on static priors and lack adaptation, which leads to repeated errors and costly trial and error. In this paper, we propose a self-evolving ZS-OGN framework that enables continuous test-time improvement. Specifically, we build an agentic rule memory by extracting actionable knowledge from past trajectories. Then, we propose a retrieval strategy based on upper confidence bound, selecting effective rules by balancing semantic relevance and historical success. In addition, we introduce a memory-guided preflection module that forecasts potential outcomes before action, reducing inefficient exploration. Extensive experiments show that our method outperforms existing zero-shot baselines, achieving a 10.1\% improvement in success rate with fewer unnecessary steps.

2606.17082 2026-06-17 cs.RO cs.AI 交叉投稿

ParkingTransformer: LLM-Enhanced End-to-End Trajectory Planning for Autonomous Parking

ParkingTransformer: 基于大语言模型增强的端到端自主泊车轨迹规划

Hauteng Wu, Xu Li, Dong Kong, Zihang Wang, Xieyuanli Chen, Benwu Wang, Wenkai Zhu

发表机构 * School of Instrument Science and Engineering, Southeast University(东南大学仪器科学与工程学院) School of Electronic and Information Engineering, Tongji University(同济大学电子与信息工程学院) College of Transportation, Shandong University of Science and Technology(山东科技大学交通学院) National University of Defense Technology(国防科技大学)

AI总结 提出ParkingTransformer框架,利用多视角感知和大语言模型场景理解能力,结合轨迹查询与隐状态特征,直接输出规划轨迹,无需密集BEV表示,通过3D位置编码、固定窗口流机制和粗到细解码策略提升性能,在CARLA和实车实验中验证有效性。

详情
AI中文摘要

端到端自主泊车已成为自动驾驶领域的关键任务。然而,现有方法存在黑箱特性,缺乏高层语义理解和可解释性,阻碍了从道路到目标点的无缝长距离自主泊车的实现。为解决这些限制,我们提出ParkingTransformer,一种利用多视角感知和大语言模型(LLMs)场景理解能力的新型框架。通过将轨迹查询与LLMs隐状态特征相结合,我们的方法直接与历史信息和原始传感器数据交互以输出规划轨迹,无需密集的鸟瞰图(BEV)表示。为补偿LLMs空间推理能力的不足,我们引入3D位置编码以显式注入空间几何感知。此外,设计了固定窗口流机制用于历史信息处理,显著提高了长期时间处理效率和推理速度。同时,采用粗到细解码策略逐步提升轨迹精度。在CARLA模拟器和真实车辆平台上进行了广泛的闭环实验。结果表明,我们的方法在CARLA模拟器中达到61.32的驾驶分数,在真实实验中平均成功率为88.70%,验证了所提算法的可行性和有效性。

英文摘要

End-to-end autonomous parking has emerged as a critical task within the realm of autonomous driving. However, existing methods suffer from black-box characteristics, lacking high-level semantic understanding and interpretability, which impedes the realization of seamless long-distance autonomous parking from the road to the target spot. To address these limitations, we propose ParkingTransformer, a novel framework that leverages multi-view perception and the scene understanding capability of Large Language Models (LLMs). By combining trajectory queries with LLMs implicit state features, our method interacts directly with historical information and raw sensor data to output planning trajectories, eliminating the need for dense Bird's-View (BEV) representations. To compensate for the inadequate spatial reasoning ability of LLMs, we introduce 3D positional encoding to explicitly inject spatial geometric awareness. Furthermore, a fixed-window streaming mechanism is designed for historical information processing, significantly improving long-term temporal processing efficiency and inference speed. Additionally, a coarse-to-fine decoding strategy is employed to progressively enhance trajectory precision. Extensive closed-loop experiments are conducted on the CARLA simulator and real-world vehicle platforms. The results demonstrate that our method achieves a driving score of 61.32 in CARLA simulator and an average success rate of 88.70% in real-world experiments, validating the feasibility and effectiveness of the proposed algorithms.

2606.17340 2026-06-17 cs.CV cs.AI 交叉投稿

Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation

几何一致的内窥镜表示用于图像引导导航:基于结构化基础模型适配

Hongchao Shu, Roger D. Soberanis-Mukul, Hao Ding, Morgan Ringel, Mali Shen, Saif Iftekar Sayed, Hedyeh Rafii-Tari, Mathias Unberath

发表机构 * Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Semaphor Surgical Johnson & Johnson MedTech(强生医疗科技)

AI总结 提出统一框架,结合合成数据管道与层级感知几何语义适配,学习几何一致且领域鲁棒的图像表示,提升单目内窥镜中的位姿估计与深度预测性能。

详情
AI中文摘要

由于深度线索有限、组织纹理弱、非刚性变形以及跨域外观变化大,单目内窥镜中基于视觉的精确导航十分困难,这些问题使得位姿估计、深度预测和图像-解剖对齐复杂化。尽管最近的视觉基础模型显示出潜力,但它们学到的表示往往几何一致性不足,阻碍了稳定的特征对应,限制了其在后续导航任务中的可靠性。我们提出了一个统一框架,用于学习单目内窥镜中几何一致且领域鲁棒的图像表示。该框架结合了提供精确几何监督的合成数据管道与层级感知几何语义适配,后者是标准LoRA的结构化替代方案,在Transformer层级间选择性插入低秩适配器,并配合逐层训练目标,以鼓励中间特征的几何对应和深层特征的语义一致性。在公开和专有数据集上的实验表明,几何和语义表示质量得到提升,从而在包括位姿估计和单目深度估计在内的下游导航任务上取得更好性能。学到的表示在临床支气管镜中显示出良好的合成到真实迁移能力,并为在有限监督下适配鼻窦镜和结肠镜提供了有用的初始化。该框架还显示出随模型大小和训练数据的良好扩展性。这些结果支持层级感知、几何引导的适配作为内窥镜表示学习的实用方法。

英文摘要

Accurate vision-based navigation in monocular endoscopy is difficult due to limited depth cues, weak tissue texture, non-rigid deformation, and substantial appearance variation across domains, all of which complicate pose estimation, depth prediction, and image-to-anatomy alignment. Although recent vision foundation models have shown promise, their learned representations often remain insufficiently geometry-consistent, hindering stable feature correspondence and limiting their reliability for downstream navigation tasks. We propose a unified framework for learning geometry-consistent and domain-robust image representations for monocular endoscopy. The framework combines a synthetic data pipeline that provides accurate geometric supervision with Hierarchy-Aware Geometry-Semantic Adaptation, a structured alternative to standard LoRA that inserts low-rank adapters selectively across the transformer hierarchy and couples them with layer-wise training objectives to encourage geometric correspondence in intermediate features and semantic consistency in deeper features. Experiments on public and proprietary datasets show improved geometric and semantic representation quality, leading to better performance on downstream navigation tasks including pose estimation and monocular depth estimation. The learned representations show favorable synthetic-to-real transfer on clinical bronchoscopy and provide a useful initialization for adaptation to sinus endoscopy and colonoscopy under limited supervision. The framework also shows favorable scaling with model size and training data. These results support hierarchy-aware, geometry-guided adaptation as a practical approach for endoscopic representation learning.

2606.17362 2026-06-17 cs.CV cs.AI cs.LG cs.RO 交叉投稿

DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

DriveJudge: 用视觉-语言模型重新思考自动驾驶评估

Xinglong Sun, Kevin Xie, Jenny Schmalfuss, Despoina Paschalidou, Xiuming Zhang, Sanja Fidler, Kashyap Chitta, Jose M. Alvarez

发表机构 * NVIDIA(英伟达)

AI总结 提出DriveJudge,结合规则评估与VLM推理,通过选择性调用物理规则函数实现可解释且上下文感知的驾驶评估,在驾驶质量分类和轨迹偏好选择任务上超越现有方法。

Comments Under Review

详情
AI中文摘要

自动驾驶已转向端到端策略学习,其中可靠、可解释的策略评估是一个基本挑战,因为驾驶质量高度依赖于上下文。常用的基于规则的驾驶指标(如EPDMS)可解释但缺乏上下文感知,而近期基于VLM的评估虽具有上下文感知能力,但受限于模糊的VLM输出和较弱的物理基础。为了以既可解释又上下文感知的方式评估驾驶,我们引入了DriveJudge。DriveJudge是一个驾驶评估代理,它将规则基础评估与视觉-语言模型(VLM)推理相结合,并在解释环境上下文后有选择地调用基于物理的确定性规则函数。为了训练和评估DriveJudge,我们整理了一个包含33,577个具有挑战性的驾驶样本的大规模数据集,并附有人类标注,指示给定场景中的驾驶行为是否合理。利用该数据集,我们解决了驾驶指标评估中未被充分探索的问题,并引入了两个与人类对齐的基准任务:驾驶质量分类和轨迹偏好选择。DriveJudge在驾驶质量分类上比EPDMS高出21.23 AUC,在轨迹偏好选择上比近期基于VLM的DriveCritic高出6.5%,为可解释且精确的驾驶评估设立了新标准。

英文摘要

Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.

2606.17386 2026-06-17 cs.CV cs.AI cs.RO 交叉投稿

TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

TerraTransfer: 无需专家示范的端到端驾驶策略学习

Zikang Xiong, Weixin Li, Zhouchonghao Wu, Akshay Rangesh, Saarth Bonde, Grantland Hall, Chen Tang, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition UCLA(加州大学洛杉矶分校) UC Berkeley(加州大学伯克利分校)

AI总结 提出一种无需专家示范的端到端驾驶方法,通过向量化模拟器中的自博弈预训练策略,再与预训练视觉骨干对齐,降低了数据成本并达到或超越现有方法。

详情
AI中文摘要

端到端自动驾驶在基准测试和实际部署中取得了最先进的性能。然而,其标准训练流程在所有阶段都成本高昂:收集和标注数百万驾驶帧代价昂贵,而在图像上进行闭环强化学习受限于每步的光真实感渲染和大视觉骨干的前向传播成本。在向量化模拟器中进行自博弈改变了经济性:每秒数百万次 rollout 步骤,状态分布自然包含碰撞、近碰撞和恢复等驾驶日志中不包含的情况。我们的方法通过解耦学习驾驶和学习视觉来利用这种不对称性。我们通过自博弈预训练单个策略,然后通过动作 KL 散度和批量关系低秩结构损失将其潜在空间与预训练视觉骨干对齐。动作目标来自自博弈策略,因此对齐从未对记录的轨迹进行监督:只需要一个(图像、场景状态)帧的配对数据集,无需模仿预训练所依赖的精心策划的专家示范。在光真实感 3D 高斯泼溅闭环场景中,得到的端到端策略匹配或超越了先前的端到端方法。

英文摘要

End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.

2606.17511 2026-06-17 cs.RO cs.AI cs.CV 交叉投稿

MagicSim: A Unified Infrastructure for Executable Embodied Interaction

MagicSim: 可执行具身交互的统一基础设施

Haoran Lu, Songling Liu, Yue Chen, Guo Ye, Mutian Shen, Shuyang Yu, Yu Xiao, Jihai Zhao, Shang Wu, Jianshu Zhang, Xiangtian Gui, Chuye Hong, Yuran Wang, Maojiang Su, Jiayi Wang, Ruihai Wu, Zhaoran Wang, Han Liu

发表机构 * Northwestern University(西北大学) Peking University(北京大学) University of California, Berkeley(加州大学伯克利分校) ShanghaiTech University(上海科技大学)

AI总结 提出MagicSim,一个基于确定性批处理运行时和共享MDP的具身交互基础设施,通过YAML规范解耦内容、放置、行为和智能体暴露,统一世界构建、执行、评估和自动生成轨迹。

详情
AI中文摘要

机器人学习和具身智能体现在需要模拟作为连接控制、技能和规划的共享执行基底,而不仅仅是渲染器、控制器测试平台或固定任务环境。现有的流水线通过“魔法”动作、脱节的训练环境或仅前向渲染来分割这些层,无法重现、评估和标注同一情节。我们提出MagicSim,一个围绕确定性批处理运行时和共享马尔可夫决策过程(MDP)构建的具身交互基础设施。通过YAML优先的规范解耦内容、放置、行为和智能体暴露,MagicSim在单一重置-步进循环中构建多样化的可执行世界,涵盖任务族、交互模式、物理、布局、传感器、化身和机器人具身。一个通用的执行接口通过控制器、原子技能、规划器原语和异步规划将高级命令具体化,将其实现为机器人动作而非模拟器端的状态编辑。一个任务定义支持三种能力:基准测试和强化学习评估、自动收集接口(自动将命令转化为具体轨迹)以及面向智能体/VLM的交互。对于自动执行,命令流经Command->Skill->Planner->Robot->Record流水线,而每个环境的命令、技能、规划、重试、标注和情节状态在共享物理滴答之上独立推进。成功的展开被保存为结构化的多模态轨迹,将语言监督、动作表示、视觉/几何表示和任务级别状态与执行的情节对齐。因此,MagicSim在一个规划器在环运行时中统一了多样化的世界构建、具身执行、任务评估、自动展开生成和交互式智能体接口。

英文摘要

Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command->Skill->Planner->Robot->Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.

2606.17767 2026-06-17 cs.HC cs.AI 交叉投稿

Talking to Your Data: Exploring Embodied Conversation as an Interface for Personal Health Reflection

与你的数据对话:探索具身对话作为个人健康反思的界面

Nikola Kovacevic, Bastien Husler, Di Zhuang, Rafael Wampfler, Barbara Solenthaler

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系)

AI总结 提出一种通过具身对话代理与可穿戴健康数据交互的新范式,采用双代理设计(观察者提取统计特征,呈现者以“口语化统计”沟通),通过模拟自我用户研究(N=5)与传统仪表盘对比,评估感知理解、行动具体性和认知转变。

详情
Journal ref
Joint Proceedings of the ACM Intelligent User Interfaces (IUI) Workshops 2026, Paphos, Cyprus, July 13-16, 2026
AI中文摘要

来自可穿戴设备的个人健康数据通常通过图表和统计摘要的仪表盘呈现,要求用户主动解读模式和含义。我们探索了一种替代交互范式:通过一个具身对话代理与个人健康数据进行互动,该代理在与用户的对话中促进客观的数据反思。我们提出了一个系统,它将可穿戴数据的轻量级预处理与基于Unity的具身角色相结合。在内部,系统遵循双代理设计,其中观察者代理提取描述性统计和时间趋势,呈现者代理通过“口语化统计”传达这些发现,有意避免临床建议,以隔离交互模态的影响。我们通过一个模拟自我用户研究(N=5)采用被试内设计评估了这种方法。参与者采用来自LifeSnaps数据集的健康角色和目标,比较了传统仪表盘探索与具身对话反思。我们的评估侧重于感知理解、生成行动的具体性,以及从被动观看到主动意义建构的认知转变。本文贡献了一个功能原型、一个客观健康数据叙事生成的设计模式,以及关于具身性如何影响个人健康指标解释的早期实证见解。

英文摘要

Personal health data from wearables are typically presented through dashboards of charts and summary statistics, requiring users to actively interpret patterns and implications. We explore an alternative interaction paradigm: engaging with personal health data through an embodied conversational agent that facilitates objective data reflection in dialogue with the user. We present a system that combines lightweight preprocessing of wearable data with a Unity-based embodied character. Internally, the system follows a dual-agent design in which an Observer agent extracts descriptive statistics and temporal trends, and a Presenter agent communicates these findings through "spoken statistics," intentionally refraining from clinical advice to isolate the impact of the interaction modality. We evaluate this approach through a simulated-self user study (N=5) using a within-subject design. Participants adopted health personas and goals derived from the LifeSnaps dataset to compare traditional dashboard exploration with embodied conversational reflection. Our evaluation focuses on perceived understanding, the specificity of generated actions, and the cognitive shift from passive viewing to active sensemaking. The paper contributes a functional prototype, a design pattern for objective health data narrative generation, and early empirical insights into how embodiment affects the interpretation of personal health metrics.

2606.17924 2026-06-17 cs.RO cs.AI 交叉投稿

PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space

PearlVLA:潜在空间中的渐进式具身动作计划精炼

Bochen Yang, Lianlei Shan

发表机构 * Imperial College London(帝国理工学院) Tsinghua University(清华大学)

AI总结 提出PearlVLA框架,通过在VLM潜在空间中进行迭代计划精炼,平衡动作生成效率与显式推理,在LIBERO基准上达到最先进性能。

Comments 21 pages, 2 figures. Preprint

详情
AI中文摘要

当前的视觉-语言-动作(VLA)模型在高效动作生成与显式推理之间存在权衡。直接从视觉-语言骨干表示解码动作可实现低延迟控制,而通过文本链、像素级子目标或动作搜索进行显式推理可以改善规划,但会带来大量延迟和计算成本。我们提出PearlVLA,一个将推理转移到视觉-语言模型(VLM)潜在空间中的VLA框架。PearlVLA将VLM元查询表示分离为固定的视觉接地分支和迭代的潜在计划分支。在每个精炼轮次中,一个计划条件的世界查询探测一个轻量级冻结的潜在世界模型,以获取无动作的未来观察潜在表示,该表示被反馈以指导计划精炼。然后,一个未来引导的RefineNet应用计划的残差更新,逐步将粗糙的语义草稿精炼为细粒度的潜在动作计划。经过K轮精炼后的计划被并行解码为动作块,用于低延迟执行。我们进一步引入因果精炼分组过程奖励强化学习,以优化潜在精炼过程,奖励来自由潜在计划编辑引起的更长视野想象未来。在LIBERO基准上的实证评估表明,PearlVLA在现有方法中达到了最先进的性能。

英文摘要

Current Vision-Language-Action (VLA) models face a trade-off between efficient action generation and explicit deliberation. Directly decoding actions from vision-language backbone representations enables low-latency control, whereas explicit reasoning through textual chains, pixel-level subgoals, or action search can improve planning but incurs substantial latency and computational cost. We propose PearlVLA, a VLA framework that moves deliberation into the latent space of a vision-language model (VLM). PearlVLA separates VLM meta-query representations into a fixed visual grounding branch and an iterative latent plan branch. At each refinement round, a plan-conditioned world query probes a lightweight frozen latent world model for an action-free future observation latent, which is fed back to guide plan refinement. A future-guided RefineNet then applies scheduled residual updates to progressively refine a coarse semantic draft into a fine-grained latent action plan. The refined plan after K rounds is then decoded in parallel into an action chunk for low-latency execution. We further introduce Causal Refinement-Grouped Process-Reward RL to optimize the latent refinement process with rewards from longer-horizon imagined futures induced by latent plan edits. Empirical evaluations on the LIBERO benchmark demonstrate that PearlVLA achieves state-of-the-art performance among existing methods.

2606.18092 2026-06-17 cs.RO cs.AI 交叉投稿

EAGG: Embodiment-Aligned Grasp Generation via Geometry-Aware Graph Conditioning

EAGG: 通过几何感知图条件实现具身对齐的抓取生成

Wanhao Niu, Qiyan Ke, Yuan Sun, Hao Sun, Jie Xu, Muyuan Ma, Ruiqi Hu, Fuchun Sun

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Beijing Moce Future Technology Co., Ltd.(北京墨策未来科技有限公司)

AI总结 提出EAGG,一种通过拓扑感知末端执行器图和几何感知令牌实现跨末端执行器抓取生成的统一模型,在MultiGripperGrasp基准上达到56.17%平均成功率,并显著降低接触距离。

Comments 16 pages, 8 figures. Code is available at https://github.com/wanhaoniu/EAGG

详情
AI中文摘要

跨末端执行器抓取生成旨在寻求一个统一的模型,能够泛化到不同物体以及从平行夹爪到灵巧末端执行器的不同具身形态。现有的抓取生成器通常针对固定具身设计,或使用静态描述符编码具身身份,当拓扑结构、驱动耦合和接触几何差异较大时,这会削弱迁移能力。我们提出EAGG,一种具身对齐的抓取生成器,通过拓扑感知的末端执行器图和具身特定的低维末端执行器控制空间来表示每个具身。一个冻结的末端执行器认知骨干将当前关节状态转换为几何感知令牌,作为可复用的形态先验,并通过迭代几何注入在采样过程中刷新这些令牌,使条件与不断演变的末端执行器几何保持同步。在MultiGripperGrasp基准上,EAGG在六个训练末端执行器上达到56.17%的平均成功率,与专门训练的差距在1.10个百分点以内,同时保持对微调和零样本末端执行器的迁移能力。迭代几何注入进一步将合并中位接触距离从0.239厘米降低到0.189厘米。这些结果表明,通过在共享生成器内对齐具身结构而非抑制具身差异,可以增强跨末端执行器抓取生成。代码可在该网址获取:https://this URL。

英文摘要

Cross-end-effector grasp generation seeks a unified model that generalizes across objects and across embodiments ranging from parallel grippers to dexterous end effectors. Existing grasp generators are typically designed for a fixed embodiment or encode embodiment identity with a static descriptor, which weakens transfer when topology, actuation coupling, and contact geometry differ substantially. We present EAGG, an embodiment-aligned grasp generator that represents each embodiment with a topology-aware end-effector graph and an embodiment-specific low-dimensional end-effector control space. A frozen end-effector-cognition backbone converts the current articulated state into geometry-aware tokens that act as a reusable morphology prior, and iterative geometry injection refreshes these tokens throughout sampling so that conditioning remains synchronized with the evolving end-effector geometry. On the MultiGripperGrasp benchmark, EAGG reaches 56.17% average success across six training end effectors, remaining within 1.10 percentage points of specialized training while preserving transfer to finetuning and zero-shot end effectors. Iterative geometry injection further reduces the pooled median contact distance from 0.239 cm to 0.189 cm. These results show that cross-end-effector grasp generation is strengthened by aligning embodiment structure inside a shared generator rather than suppressing embodiment differences. Code is available at https://github.com/wanhaoniu/EAGG.

2606.18247 2026-06-17 cs.RO cs.AI 交叉投稿

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

视觉验证实现推理时引导与自主策略改进

Mingtong Zhang, Dhruv Shah

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出VERITAS框架,利用预训练通用机器人策略作为生成器,结合无梯度视觉验证器在推理时评估动作,实现无需额外训练的推理时策略引导和离线策略改进。

Comments Website: https://veritas-improvement.github.io

详情
AI中文摘要

部署在现实世界中的机器人应从经验中学习并随时间改进。这需要一个实践并从反馈中学习的机制。在本文中,我们提出VERITAS,一个用于通用机器人策略的生成器-验证器框架,用于推理时策略引导和自我改进。我们使用预训练的通用机器人策略作为“生成器”,并将其与一个无梯度的“视觉验证器”配对,该验证器在推理时评估动作。该框架实现了推理时引导,无需额外训练即可提高策略性能。我们证明,推理时验证在无需额外演示数据训练的情况下,始终优于普通通用策略。此外,我们证明验证后的 rollout 为离线策略改进提供了有效的监督:在验证后的自生成轨迹上微调的策略实现了持续的性能提升。值得注意的是,我们发现使用验证后的 rollout 进行后训练达到了与专家演示相当的效率,同时无需人工干预。我们的结果突出了推理时验证作为一种实用且可扩展的机制,用于在部署期间改进机器人策略。

英文摘要

Robots deployed in the real world should learn from their experience and improve over time. This requires a mechanism of practicing and learning from feedback. In this paper, we propose VERITAS, a generator-verifier framework for generalist robot policies for inference-time policy steering and self-improvement. We use a pre-trained generalist robot policy as a ``generator'' and pair it with a gradient-free ``visual verifier'' that evaluates actions at inference time. This framework enables inference-time steering that improves policy performance without additional training. We demonstrate that inference-time verification consistently outperforms vanilla generalists without training on additional demonstration data. Additionally, we demonstrate that the verified rollouts provide effective supervision for offline policy improvement: policies fine-tuned on verified self-generated trajectories achieve consistent performance gains. Notably, we find that post-training with verified rollouts achieves comparable efficiency to expert demonstrations, while requiring no human interventions. Our results highlight inference-time verification as a practical and scalable mechanism for improving robotic policies during deployment.

2606.16533 2026-06-17 cs.AI cs.CV 版本更新

Kairos: A Native World Model Stack for Physical AI

Kairos: 面向物理AI的原生世界模型栈

Kairos Team, Fei Wang, Shan You, Qiming Zhang, Tao Huang, Zuoyi Fu, Zhisheng Zheng, Yunlong Xi, Feng Lv, Xiaoming Wu, Zeyu Liu, Cong Wan, Pu Li, Ruiqing Yang, Xiaoou Li, Wei Wang, Kangkang Zhu, Yuwei Zhang, Shi Fu, Zheng Zhang, Xiaoning Wu, Xuzeng Fan, Dacheng Tao, Xiaogang Wang

发表机构 * Kairos Team(Kairos团队)

AI总结 提出Kairos原生世界模型栈,通过跨具身数据课程、混合线性时间注意力架构和部署感知系统协同设计,实现世界知识获取、长时程状态保持与高效执行,在具身世界模型等基准上达到顶级性能。

详情
AI中文摘要

世界模型正从被动视觉生成器转变为物理AI的基础性、可操作基础设施:它们必须从异构经验中原生获取世界知识,在长时间跨度内维持持久状态,并在实际部署约束下高效执行。我们引入Kairos,一个围绕这些需求设计的原生世界模型栈。(1) Kairos通过开创由跨具身数据课程指导的原生预训练范式来学习世界,该课程将开放世界视频、人类行为数据和机器人交互组织成渐进式发展路径。(2) Kairos通过配备混合线性时间注意力的原生统一架构来维持世界,该架构中滑动窗口注意力捕捉局部动态,扩张滑动窗口捕捉中程依赖,门控线性注意力维持持久全局记忆。我们建立了形式化理论界限,证明这种时间分解严格限制了误差累积,从数学上保证了跨扩展时间范围的状态传播。(3) Kairos通过整合部署感知系统协同设计来运行世界,支持在服务器和消费级硬件上为真实世界的观察-行动-反馈循环生成低延迟展开。在具身世界模型、长时程和动作策略基准上的实验表明,Kairos在实现顶级性能的同时提供了强大的效率-能力权衡。这些结果共同将Kairos定位为未来自进化物理智能的凝聚性操作基础。

英文摘要

World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

2308.14329 2026-06-17 cs.RO cs.AI 版本更新

SSIL: Self-Supervised Imitation Learning for End-to-End Driving

SSIL: 用于端到端驾驶的自监督模仿学习

Jin Bok Park, Jinkyu Lee, Muhyun Back, Hyun Min Han, Tianwei Ma, Sang Min Won, Sung Soo Hwang, Il Yong Chun

AI总结 提出自监督模仿学习框架SSIL,利用车辆位姿生成伪转向角数据,无需驾驶命令或预训练模型,结合交叉注意力条件方法CACA,在三个基准数据集上达到与监督学习相当的驾驶精度。

Comments 8 pages, 4 figures

详情
AI中文摘要

在自动驾驶中,直接从传感器数据预测车辆控制信号的端到端(E2E)驾驶方法正迅速受到关注。为了学习安全的E2E驾驶系统,需要大量的驾驶数据和人工干预。车辆控制数据由数小时的人类驾驶构建,构建大型车辆控制数据集具有挑战性。通常,公开可用的驾驶数据集是在有限的驾驶场景下收集的,而收集车辆控制数据仅由车辆制造商提供。为了解决这些挑战,本文提出了首个用于E2E驾驶的自监督学习框架——自监督模仿学习(SSIL)。所提出的SSIL框架可以在不使用驾驶命令数据或预训练模型的情况下学习基于视觉的E2E驾驶网络。为了构建伪转向角数据,提出的SSIL从当前和先前时间点通过激光雷达传感器估计的车辆位姿预测伪目标。此外,我们提出了一种新的基于交叉注意力的条件方法(CACA),用于E2E驾驶中的视觉编码器,其中高级指令作为视觉信息的条件信号。我们在三个不同基准数据集上的数值实验表明,所提出的SSIL框架实现了与监督学习对应方法非常相当的E2E驾驶精度。此外,所提出的伪标签预测器优于使用比例积分微分控制器的现有方法,并且所提出的CACA在现有条件方法中实现了优越的性能。

英文摘要

In autonomous driving, the end-to-end (E2E) driving approach that predicts vehicle control signals directly from sensor data is rapidly gaining attention. To learn a safe E2E driving system, one needs an extensive amount of driving data and human intervention. Vehicle control data is constructed by many hours of human driving, and it is challenging to construct large vehicle control datasets. Often, publicly available driving datasets are collected with limited driving scenes, and collecting vehicle control data is only available by vehicle manufacturers. To address these challenges, this paper proposes the first self-supervised learning framework, Self-Supervised Imitation Learning (SSIL), for E2E driving. The proposed SSIL framework can learn vision-based E2E driving networks without using driving command data or a pre-trained model. To construct pseudo steering angle data, proposed SSIL predicts a pseudo target from the vehicle's poses at the current and previous time points that are estimated with light detection and ranging sensors. In addition, we propose a new cross-attention-based conditioning approach (CACA) for a vision encoder in E2E driving, where a high-level instruction serves as the conditioning signal for visual information. Our numerical experiments with three different benchmark datasets demonstrate that the proposed SSIL framework achieves very comparable E2E driving accuracy with the supervised learning counterpart. Furthermore, the proposed pseudo-label predictor outperformed an existing one using proportional integral derivative controller, and proposed CACA achieved superior performance over existing conditioning approaches.

2506.17639 2026-06-17 cs.RO cs.AI 版本更新

RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models

RLRC:基于强化学习的压缩视觉-语言-动作模型恢复

Yuxuan Chen, Yixin Han, Yize Huang, Xiao Li

AI总结 提出RLRC三阶段压缩恢复流程,通过结构化剪枝、SFT和强化学习恢复以及量化,实现8倍内存减少和2.3倍推理加速,同时保持任务成功率。

Comments 8 pages, 10 figures; accepted by RA-L 2026

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 7, pp. 8864-8871, July 2026
AI中文摘要

视觉-语言-动作模型(VLA)在复杂机器人操作中展示了卓越的能力和巨大潜力。然而,其庞大的参数规模和高推理延迟阻碍了实际部署,尤其是在资源受限的平台上。为此,我们对VLA的模型压缩进行了系统的实证研究。基于这些见解,我们提出了\textit{RLRC},一个三阶段压缩和恢复流程,包括结构化剪枝、通过SFT和RL进行性能恢复,以及后续量化。RL阶段引入了评论家预热策略和BC损失正则化,以稳定训练并保持策略行为。RLRC实现了高达8倍的内存减少和2.3倍的推理加速,同时保持原始任务成功率。在多个VLA骨干网络上的大量实验表明,RLRC始终优于现有的压缩基线,突显了其在设备端部署的有效性。项目网站:此https URL

英文摘要

Vision-Language-Action models (VLA) have demonstrated remarkable capabilities and strong potential in complex robotic manipulation. However, their large parameter sizes and high inference latency hinder real-world deployment, especially on resource-constrained platforms. To address this, we conduct a systematic empirical study of model compression for VLAs. Building on these insights, we present \textit{RLRC}, a three-stage compression and recovery pipeline consisting of structured pruning, performance recovery via SFT and RL, and subsequent quantization. The RL stage incorporates a critic warm-up strategy and BC loss regularization to stabilize training and preserve policy behavior. RLRC achieves up to an 8 times memory reduction and 2.3 times inference speedup while maintaining the original task success rate. Extensive experiments across multiple VLA backbones show that RLRC consistently outperforms existing compression baselines, highlighting its effectiveness for on-device deployment. Project website: https://rlrc-vla.github.io

2509.26633 2026-06-17 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

OmniRetarget:面向人形全身运动操控与场景交互的交互保持数据生成

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C. Karen Liu, Rocky Duan, Guanya Shi

AI总结 提出OmniRetarget引擎,通过交互网格显式建模并保持智能体、地形和物体间的空间与接触关系,将人类运动重定向为机器人运动,生成高质量轨迹以训练强化学习策略,实现长时间跑酷和操控技能。

Comments Project website: https://omniretarget.github.io

详情
AI中文摘要

教授人形机器人复杂技能的主流范式是将人类运动重定向为运动学参考,以训练强化学习(RL)策略。然而,现有的重定向流程常常难以应对人与机器人之间的显著具身差异,产生物理上不可信的伪影,如脚滑和穿透。更重要的是,常见的重定向方法忽略了对于表达性运动及运动操控至关重要的丰富的人-物和人-环境交互。为解决这一问题,我们引入了OmniRetarget,一种基于交互网格的交互保持数据生成引擎,该网格显式建模并保持智能体、地形和操作对象之间的关键空间与接触关系。通过最小化人体与机器人网格之间的拉普拉斯变形同时施加运动学约束,OmniRetarget生成运动学上可行的轨迹。此外,保持任务相关的交互使得从单一示范到不同机器人本体、地形和物体配置的高效数据增强成为可能。我们通过将来自OMOMO、LAFAN1和我们内部MoCap数据集的运动进行重定向,全面评估了OmniRetarget,生成了超过8小时的轨迹,这些轨迹在运动学约束满足和接触保持方面优于广泛使用的基线。这种高质量数据使得本体感觉RL策略能够在Unitree G1人形机器人上成功执行长达30秒的长时间跑酷和运动操控技能,且仅使用5个奖励项和所有任务共享的简单域随机化进行训练,无需任何学习课程。

英文摘要

A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning (RL) policies. However, existing retargeting pipelines often struggle with the significant embodiment gap between humans and robots, producing physically implausible artifacts like foot-skating and penetration. More importantly, common retargeting methods neglect the rich human-object and human-environment interactions essential for expressive locomotion and loco-manipulation. To address this, we introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh that explicitly models and preserves the crucial spatial and contact relationships between an agent, the terrain, and manipulated objects. By minimizing the Laplacian deformation between the human and robot meshes while enforcing kinematic constraints, OmniRetarget generates kinematically feasible trajectories. Moreover, preserving task-relevant interactions enables efficient data augmentation, from a single demonstration to different robot embodiments, terrains, and object configurations. We comprehensively evaluate OmniRetarget by retargeting motions from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 8-hour trajectories that achieve better kinematic constraint satisfaction and contact preservation than widely used baselines. Such high-quality data enables proprioceptive RL policies to successfully execute long-horizon (up to 30 seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained with only 5 reward terms and simple domain randomization shared by all tasks, without any learning curriculum.

2605.05172 2026-06-17 cs.RO cs.AI 版本更新

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

当生活给你行为克隆,就做Q函数:从行为克隆中提取Q值用于机器人强化学习

Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex, Robin Walters, Karl Schmeckpeper, Thomas Weng

发表机构 * Rai-Inst

AI总结 提出Q2RL算法,通过从行为克隆策略中提取Q函数并利用Q门控切换策略,实现高效的离线到在线强化学习,在机器人操作任务中达到100%成功率和3.75倍提升。

Comments Robotics: Science and Systems, 2026

详情
AI中文摘要

行为克隆(BC)已成为机器人学习的一种高效范式。然而,BC在收集演示后缺乏自我引导的在线改进机制。现有的离线到在线学习方法常常由于离线数据与在线学习之间的分布不匹配,导致策略替换先前学习的好动作。在这项工作中,我们提出了Q2RL(从BC进行Q估计和Q门控用于强化学习),一种高效的离线到在线学习算法。我们的方法包括两部分:(1)Q估计通过与环境的少量交互步骤从BC策略中提取Q函数,然后进行在线RL;(2)Q门控根据各自的Q值在BC和RL策略动作之间切换,以收集用于RL策略训练的样本。在D4RL和robomimic基准测试的操作任务中,Q2RL在成功率和收敛时间上优于最先进的离线到在线学习基线。Q2RL足够高效,可应用于机器人上的RL设置,在1-2小时的在线交互中学习接触密集和高精度操作任务(如管道组装和套件装配)的鲁棒策略,成功率达到100%,相比原始BC策略提升高达3.75倍。代码和视频见https://this URL。

英文摘要

Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at https://pages.rai-inst.com/q2rl_website/

2605.23733 2026-06-17 cs.RO cs.AI 版本更新

Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Any2Any: 高效跨本体迁移用于人形机器人全身跟踪

Ming Yang, Tao Yu, Feng Li, Hua Chen

发表机构 * LimX Dynamics(LimX动力学)

AI总结 提出Any2Any范式,通过运动学对齐和动力学微调,实现预训练全身跟踪模型高效迁移至新的人形机器人本体,仅需少量数据和计算即可达到竞争性跟踪性能。

详情
AI中文摘要

全身跟踪(WBT)模型已成为人形机器人的关键基础,使其能够高保真地模仿各种运动。从头训练此类模型需要大规模数据和计算,使得在新人形平台上快速部署成本高昂。这自然引发一个问题:预训练的WBT模型能否通过最小化适应跨本体迁移?为回答这个问题,我们提出Any2Any,一种范式,能够高效地将现有WBT专家迁移到新人形本体,仅需少量数据和计算。Any2Any首先在源和目标人形之间进行运动学对齐,对齐其输入和输出空间,使得预训练的源策略可以在目标本体上有意义地重用。然后,Any2Any通过向选定的动力学敏感模块应用轻量级参数高效微调(PEFT)组件进行动力学适应,保留有用的行为先验,同时实现对目标机器人的定向适应。在多个人形平台和预训练骨干上的大量实验表明,与从头训练相比,Any2Any显著加速收敛并降低训练成本,同时实现具有竞争力或更优的跟踪性能。值得注意的是,仅使用完整训练所需计算和数据的1%,Any2Any成功将在Unitree G1上预训练的Sonic模型迁移到LimX Oli和LimX Luna。这些结果表明,预训练的WBT专家可以跨本体高效重用,为在新机器人上部署人形全身控制提供可扩展的路径。

英文摘要

Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots.

2605.31286 2026-06-17 cs.RO cs.AI 版本更新

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

DeMaVLA:面向可泛化可变形物体操作的视觉-语言-动作基础模型

Taiyi Su, Jian Zhu, Tianjian Wang, Youzhang He, Zitai Huang, Jianjun Zhang, Chong Ma, Hanyang Wang, Tianjiao Zhang, Munan Yin, Weihao Ding, Yi Xu

发表机构 * Tongji University(同济大学)

AI总结 提出DeMaVLA模型,采用VLM骨干与动作专家结合流匹配生成连续动作,通过剪枝Transformer层提升效率,并利用大规模真实世界数据和人类反馈数据聚合训练,实现可变形物体折叠操作的多类别泛化。

Comments 14 pages, 2 figures

详情
AI中文摘要

现实家庭机器人需要视觉-语言-动作(VLA)基础模型,能够在不同物体、任务条件和家庭环境中获取可重复使用的操作技能。可变形物体折叠是一个代表性挑战,要求机器人处理来自随机初始状态的衣物,涉及不同类别、几何形状、材料和场景。然而,现有的VLA系统通常为不同物体类别训练独立的策略,而简单混合的多任务训练常常遭受任务干扰和性能下降。为了超越类别特定的折叠策略,我们引入了DeMaVLA,一个面向可泛化可变形物体操作的VLA基础模型。DeMaVLA采用VLM骨干网络和动作专家,并使用流匹配来公式化连续动作生成。为了提高效率,动作专家通过剪枝每隔一个Transformer层构建,同时保持与VLM骨干网络的逐层对齐,从而降低训练和推理成本。DeMaVLA首先在大约5000小时精选的真实世界双臂演示数据上进行预训练,以获得通用的操作先验。然后,它在混合折叠数据上进行后训练,这些数据通过人类参与的数据聚合(DAgger)流程,聚合了自我收集的演示和来自多个折叠任务中真实机器人失败的纠正轨迹。实验表明,DeMaVLA在RoboTwin上取得了有竞争力的性能,并在我们的家庭折叠基准测试中取得了强大的真实世界结果。这些结果突显了可扩展的真实世界数据、高效的动作生成和纠正学习对于可变形物体操作中的通用VLA策略的价值。

英文摘要

Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin 2.0 and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.

2606.14438 2026-06-17 cs.RO cs.AI 版本更新

CADET: Physics-Grounded Causal Auditing and Training-Free Deconfounding of End-to-End Driving Planners

CADET: 基于物理的因果审计与无训练去混杂的端到端驾驶规划器

Zikun Guo

发表机构 * School of Electronics Engineering, Kyungpook National University(庆北国立大学电子工程学院)

AI总结 提出CADET框架,无需重新训练即可审计和修复预训练端到端驾驶规划器中的虚假关联,通过物理因果图识别混杂因素并干预测试时输入。

Comments 8pages 4figures

详情
AI中文摘要

通过模仿学习训练的端到端自动驾驶规划器容易产生统计捷径:它们将仅与专家动作共现的场景元素(如路边物体、建筑立面)与驾驶决策关联,而非因果决定驾驶的变量。这种因果混淆在长尾场景中悄然损害可靠性,且难以检测,因为常见的开环指标(L2位移和碰撞率)受自车状态主导,无法指示规划器是否依赖虚假线索。现有的基于因果干预训练的修复方法需要重新训练大型模型,且无法审计已部署的规划器。我们提出CADET,一个无需训练的框架,可以在不更新任何参数的情况下审计、基准测试和修复预训练端到端规划器中的虚假依赖。

英文摘要

End-to-end (E2E) autonomous-driving planners trained by imitation are prone to statistical shortcuts: they associate scene elements that merely co-occur with expert actions (a roadside object, a building facade) with driving decisions, rather than the variables that causally determine them. Such causal confusion silently compromises reliability in long-tail scenarios, and it is difficult to detect, because prevailing open-loop metrics (L2 displacement and collision rate) are dominated by ego status and do not indicate whether a planner depends on spurious cues. Existing remedies based on causal-intervention training require retraining large models and cannot audit a planner that is already deployed. We present CADET, a training-free framework that audits, benchmarks, and repairs spurious reliance in pretrained E2E planners without any parameter update.

2606.14551 2026-06-17 cs.RO cs.AI 版本更新

TRACE: Trajectory-Routed Causal Memory for Delayed-Evidence Visuomotor Imitation

TRACE: 用于延迟证据视觉运动模仿的轨迹路由因果记忆

Zihao Li, Ranpeng Qiu, Yincong Chen, Guoqiang Ren, Weiming Zhi

发表机构 * Zeno AI Zhejiang University(浙江大学) Zhejiang University of Technology(浙江工业大学) The University of Sydney(悉尼大学)

AI总结 针对视觉运动模仿中早期线索消失导致观察歧义的问题,提出TRACE记忆框架,利用路径签名存储和检索任务相关证据,在长周期任务中提升分支选择准确率。

详情
AI中文摘要

自主运行的机器人可能需要基于不再可见的证据做出决策。我们研究\emph{延迟证据}任务,其中早期线索在后续决策点之前消失,因此视觉上相似的观察可能需要不同的动作。在这些设置中,当前观察不足以作为控制的状态。我们引入了轨迹路由因果证据(TRACE),一种用于视觉运动模仿策略的记忆框架。TRACE将任务相关的视觉和机器人状态证据(如物体身份、目标选择或路线依赖状态)存储在固定大小的潜在记忆中,该记忆在长片段中保持有界。TRACE不是通过原始时间或手动提供的任务标签来索引记忆,而是使用\emph{路径签名}:已执行机器人状态轨迹的紧凑、顺序敏感特征。这些签名不存储视觉线索本身;相反,它们提供了轨迹条件化的键,用于写入和检索线索可见时存储的证据。当机器人后来遇到歧义观察时,策略以TRACE记忆为条件,恢复缺失的上下文并选择正确的分支。TRACE通过轻量级适配器附加到策略上,而不改变策略主干、动作头或模仿目标。在具有视觉歧义分支点的真实世界长时域操作任务中,TRACE在分支选择和任务成功率上优于替代基线,包括短历史记忆和循环记忆。项目页面:此 https URL

英文摘要

Robots under autonomous operation may require decisions based on evidence that is no longer visible. We study delayed-evidence tasks, where an early cue disappears before a later decision point, so visually similar observations can require different actions. In these settings, the current observation is not a sufficient state for control. We introduce TRAjectory-routed Causal Evidence (TRACE), a memory framework for visuomotor imitation policies. TRACE stores task-relevant visual and robot-state evidence, such as object identity, target choice, or route-dependent state, in a fixed-size latent memory that remains bounded over long episodes. Instead of indexing memory by raw time or manually provided task labels, TRACE uses path signatures: compact, order-sensitive features of the executed robot-state trajectory. These signatures do not store the visual cue itself; rather, they provide trajectory-conditioned keys for writing and retrieving the evidence stored when the cue was visible. When the robot later reaches an ambiguous observation, the policy conditions on TRACE memory to recover the missing context and choose the correct branch. TRACE attaches through lightweight adapters to policies, without changing the policy backbone, action head, or imitation objective. Across real-world long-horizon manipulation tasks with visually ambiguous branch points, TRACE improves branch selection and task success over alternative baselines, including short-history and recurrent memory. Project page: https://jeong-zju.github.io/trace

2606.15148 2026-06-17 cs.RO cs.AI 版本更新

MimicIK: Real-Time Generative Inverse Kinematics from Teleoperation with FK Consistency

MimicIK: 基于遥操作且保持正运动学一致性的实时生成式逆运动学

Jiahao Yang, Shenhao Yan, Fan Feng, Chengsi Yao, Ge Wang, Zhixin Mai, Yiming Zhao, Yatong Han

发表机构 * Ising AI CUHK-Shenzhen(香港中文大学(深圳))

AI总结 提出MimicIK框架,利用条件流匹配从遥操作数据学习平滑鲁棒的关节空间运动先验,通过两阶段迭代优化和正运动学一致性损失实现实时逆运动学求解,在6-DOF机器人数据集上达到4.65mm位置误差和92.01%成功率。

详情
AI中文摘要

逆运动学(IK)仍然是实时机器人操作的关键瓶颈。经典的数值求解器具有高几何精度,但在闭环部署中常出现不连续的分支切换和运动学奇异点附近的不稳定行为。同时,学习型IK方法在平衡空间精度、运动平滑性和实时效率方面经常遇到困难,尤其是在使用嘈杂的人类遥操作数据训练时。我们提出\textbf{MimicIK},一个实时生成式逆运动学框架,通过条件流匹配从遥操作演示中学习平滑且鲁棒的关节空间运动先验。给定当前关节构型和目标末端执行器位姿,MimicIK基于最小迭代策略(MIP)主干,通过高效的两步迭代精化过程预测连续的增量关节指令。为了强制物理一致性,我们进一步引入正运动学一致性损失,这是一种可微的正运动学正则化项,在训练过程中惩罚任务空间与目标位姿的偏差。我们在包含8,848个遥操作演示的真实6-DOF机器人数据集上评估MimicIK。MimicIK实现了4.65 mm的平均位置误差,92.01%的10 mm成功率,以及仅7.99%的轨迹尖峰率。与UNet扩散基线相比,我们的方法在提高空间精度和运动平滑性的同时,将推理延迟从21.66 ms降低到6.74 ms。此外,与在分布外部署时灾难性发散的确定性MLP基线不同,MimicIK在奇异构型附近保持稳定,并在部署硬件上实现鲁棒的20 Hz实时控制。

英文摘要

Inverse kinematics (IK) remains a critical bottleneck for real-time robot manipulation. Classical numerical solvers achieve high geometric precision but often suffer from discontinuous branch switching and unstable behavior near kinematic singularities during closed-loop deployment. Meanwhile, learned IK approaches frequently struggle to balance spatial accuracy, motion smoothness, and real-time efficiency, particularly when trained on noisy human teleoperation data. We present \textbf{MimicIK}, a real-time generative inverse kinematics framework that learns smooth and robust joint-space motion priors from teleoperation demonstrations through conditional flow matching. Given the current joint configuration and a target end-effector pose, MimicIK predicts continuous delta-joint commands using an efficient two-step iterative refinement process based on a Minimal Iterative Policy (MIP) backbone. To enforce physical consistency, we further introduce an FK consistency loss, a differentiable forward-kinematics regularization that penalizes task-space deviations from the target pose during training. We evaluate MimicIK on a real-world 6-DOF robot dataset containing 8,848 teleoperation demonstrations. MimicIK achieves a mean position error of 4.65 mm, a 10 mm success rate of 92.01\%, and a trajectory spike rate of only 7.99\%. Compared with a UNet diffusion baseline, our method improves both spatial accuracy and motion smoothness while reducing inference latency from 21.66 ms to 6.74 ms. Furthermore, unlike deterministic MLP baselines that catastrophically diverge under out-of-distribution deployment, MimicIK remains stable near singular configurations and enables robust 20 Hz real-time control on deployment hardware.

8. 可信、安全与AI治理 39 篇

2606.17312 2026-06-17 cs.AI 新提交

Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty

通过结构不确定性量化LLM逻辑推理中的一致性

Baishali Chaudhury, Mengdie Flora Wang, Hyunji Hayley Park, Rahul Ghosh, Sungmin Hong, Jae Oh Woo

发表机构 * AWS Generative AI Innovation Center(AWS生成式AI创新中心)

AI总结 提出结构不确定性框架,通过自偏好排序的稳定性评估LLM推理一致性,在逻辑和数学任务中与答案分散度互补,提升不可靠实例识别。

Comments Published at ICLR 2026 Workshop on Logical Reasoning of Large Language Models. Accepted as best paper

详情
AI中文摘要

大型语言模型可以通过不稳定、矛盾或难以一致排序的推理路径得出相同答案——这种失败模式在多步演绎推理中尤为普遍。现有方法主要通过输出分散度(衡量采样答案的差异)来评估可靠性,但这丢弃了一个互补信号:模型是否能一致地对竞争性推理候选进行排序。我们提出结构不确定性,一个从自偏好诱导的推理解决方案排序稳定性导出的、具有一致性意识的框架。给定一个查询,我们生成多个候选解决方案,并让模型对其自身输出进行成对偏好判断。我们通过Bradley-Terry模型与PageRank将自偏好聚合成排序分布,并将信号分解为两个基于熵的分量:跨试验排序不稳定性和试验内候选歧义性。在五个LLM和八个基准上,结构信号提供了与答案分散度互补的信息:在逻辑和数学推理任务中,组合提高了不可靠实例的识别,而在事实检索中,结构信号坍缩为均匀分布,诊断出一个推理层面一致性评估无信息性的状态边界。两个分量与准确性的关系不同:试验内歧义性与正确性正相关——与多个合理解决方案路径保持竞争的情况一致——而跨试验不稳定性与正确性负相关,表明推理不可靠。结构不确定性最好不被理解为通用置信度估计器,而是作为逻辑推理一致性的状态敏感评估器。

英文摘要

Large language models can arrive at the same answer through reasoning paths that are unstable, contradictory, or difficult to rank consistently -- a failure mode especially prevalent in multi-step deductive reasoning. Existing methods assess reliability primarily through output dispersion -- measuring how much sampled answers differ -- but this discards a complementary signal: whether the model can consistently rank competing reasoning candidates. We propose structural uncertainty, a consistency-aware framework derived from the stability of self-preference-induced rankings over sampled reasoning solutions. Given a query, we generate multiple candidate solutions and ask the model to judge pairwise preferences among its own outputs. We aggregate self-preferences into ranking distributions via Bradley-Terry modeling with PageRank, and decompose the signal into two entropy-based components: across-trial ranking instability and within-trial candidate ambiguity. Across five LLMs and eight benchmarks, structural signals provide information complementary to answer dispersion: on logical and mathematical reasoning tasks, the combination improves identification of unreliable instances, while on factual retrieval the structural signal collapses toward uniformity, diagnosing a regime boundary where reasoning-level consistency evaluation is uninformative. The two components relate differently to accuracy: within-trial ambiguity correlates positively with correctness -- consistent with settings where multiple plausible solution paths remain competitive -- while across-trial instability correlates negatively, signaling unreliable reasoning. Structural uncertainty is best understood not as a universal confidence estimator, but as a regime-sensitive evaluator of logical reasoning consistency.

2606.17443 2026-06-17 cs.AI cs.CL cs.CY 新提交

Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems

在位优势:LLM推荐系统中的品牌偏见与认知操纵动态

Xi Chu, Yupeng Hou

发表机构 * Trine University(特莱恩大学) Texas A&M University(德克萨斯农工大学)

AI总结 研究LLM推荐中的品牌动态,发现知名品牌在同等规格下获100%推荐(IAI=10.0),但微弱评分优势可打破垄断;权威营销语言(如虚假临床证据)以+0.17评分点的偏差剩余价值打破垄断;多品牌GEO竞争存在社会困境,集体优化降低个体收益。

Comments 16 pages, 4 figures, 11 tables

详情
AI中文摘要

大型语言模型(LLM)正成为消费者寻找产品的主要方式,但我们尚不了解品牌如何在这个新渠道中竞争。我们使用护肤品——消费者在购买前难以判断质量、必须依赖品牌声誉的类别——在三个商业LLM(GPT-4o-mini、Claude Sonnet、Gemini 3 Flash)中研究LLM推荐中的品牌动态,并对搜索品进行了稳健性检验。在三个实验中,我们发现:(1)条件垄断:当所有产品具有相同规格时,知名品牌获得100%的推荐(IAI = 10.0),但这种主导地位在竞争对手拥有不到+0.1星的评分优势时消失;(2)权威式营销语言,包括捏造的临床证据声明,以等于+0.17评分点的偏差剩余价值打破了这种垄断,每个模型反应不同;(3)多品牌GEO竞争中的社会困境:当所有品牌采用相同的优化策略时,在我们的收益代理中,个体收益从+0.802降至+0.007,而我们的测试中未参与的品牌获得零推荐。我们的结果表明,生成引擎优化(GEO)不仅应作为安全风险研究,还应作为塑造市场竞争的新兴营销实践来研究。

英文摘要

Large language models (LLMs) are becoming a major way for consumers to find products, but we do not yet understand how brands compete in this new channel. We study brand dynamics in LLM recommendations using skincare products -- a category where consumers cannot easily judge quality before buying and must rely on brand reputation -- across three commercial LLMs (GPT-4o-mini, Claude Sonnet, Gemini 3 Flash), with a robustness check on search goods. In three experiments, we find: (1) a Conditional Monopoly where well-known brands get recommended 100% of the time (IAI = 10.0) when all products have the same specifications, but this dominance disappears with less than a +0.1-star rating advantage for a competitor; (2) authority-style marketing language, including fabricated clinical-evidence claims, breaks this monopoly at a Bias Surplus Value equal to +0.17 rating points, with each model responding differently; and (3) a social dilemma in multi-brand GEO competition: when all brands adopt the same optimization strategy, individual payoff falls from +0.802 to +0.007 in our payoff proxy, and non-participating brands receive zero recommendations in our tests. Our results suggest that generative engine optimization (GEO) should be studied not only as a security risk, but also as an emerging marketing practice that shapes market competition.

2606.18021 2026-06-17 cs.AI cs.CL cs.LG cs.MA 新提交

LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI

LegalHalluLens: 类型化幻觉审计与校准的多智能体辩论以实现可信赖的法律AI

Lalit Yadav, Akshaj Gurugubelli

发表机构 * Independent Researcher, Sunnyvale, CA, USA(独立研究者,美国加州太阳谷) Independent Researcher, San Diego, CA, USA(独立研究者,美国加州圣地亚哥)

AI总结 针对法律AI中聚合指标掩盖的错误集中性和方向性问题,提出LegalHalluLens审计框架,通过类型化幻觉画像、风险方向指数(RDI)和校准辩论管道,将幻觉检测减少45%,并揭示聚合指标隐藏的失败模式。

Comments 15 pages, 5 figures; Published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

详情
AI中文摘要

部署在法律工作流程中的AI系统以聚合指标报告的约52%的比率产生幻觉,但这个平均值掩盖了错误集中的位置和方向,使合规官员无法获得可操作的可信部署信号。我们提出LegalHalluLens,一个包含三个组件的审计框架:基于CUAD(Hendrycks等人,2021)的四种法律动机声明类别(数字、时间、义务/权利、事实)的类型化幻觉画像;一个风险方向指数(RDI),将遗漏与发明偏差简化为一个可部署比较的标量;以及一个针对幅度和方向校准的类型化辩论管道。在510份合同和249,252个条款级实例上,我们测量了义务/数字和时间声明之间约38-40个百分点的模型内差距,而聚合报告隐藏了这一点,并表明两个具有匹配的52%比率的系统可能具有相反的RDI。辩论管道将虚构检测减少了45%,每个类别的收益跟踪诊断结果,使用显著更小的骨干网络(4B活跃参数)匹配商业API。类型化画像和RDI揭示了聚合指标隐藏的失败模式;我们进一步表明这些诊断可作为多智能体辩论管道的校准输入,其中针对测量失败模式的怀疑挑战和非对称门优于通用调整的辩论。该框架支持部署在现实世界中的法律AI的方向感知采购、问责制和智能体设计。

英文摘要

AI systems deployed in legal workflows hallucinate at rates that aggregate metrics report at ~52%, but this average conceals where errors concentrate and in which direction they run, leaving compliance officers without an actionable signal for trustworthy deployment. We present LegalHalluLens, an auditing framework with three components: typed hallucination profiles across four legally-motivated claim categories (numeric, temporal, obligation/entitlement, factual) over CUAD (Hendrycks et al., 2021); a Risk Direction Index (RDI) that reduces omission-versus-invention bias to a single deployment-comparable scalar; and a typed debate pipeline calibrated to both magnitudes and directions. Across 510 contracts and 249,252 clause-level instances we measure a within-model gap of approximately 38-40 pp between obligation/numeric and temporal claims that aggregate reporting hides, and show that two systems with matched 52% rates can carry opposite RDIs. The debate pipeline reduces fabricated detections by 45% with per-category gains tracking the diagnosis, matching commercial APIs with a substantially smaller backbone (4B active parameters). Typed profiles and RDI surface failure modes that aggregate metrics hide; we further show these diagnostics serve as calibration inputs for multi-agent debate pipelines, where Skeptic challenges and asymmetric gates targeted at measured failure modes outperform generically-tuned debate. The framework supports direction-aware procurement, accountability, and agent design for legal AI deployed in the wild.

2606.18037 2026-06-17 cs.AI cs.CL cs.MA 新提交

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

ProvenanceGuard: 基于MCP的LLM智能体的源感知事实性验证

Ander Alvarez, Santhiya Rajan, Samuel Mugel, Román Orús

发表机构 * Multiverse Computing Parque Cientifico y Tecnológico de Gipuzkoa(吉普斯夸科技园) Centre for Social Innovation(社会创新中心) Donostia International Physics Center(多诺斯蒂亚国际物理中心) Ikerbasque Foundation for Science(伊克尔巴斯克科学基金会)

AI总结 提出ProvenanceGuard,一种源感知验证器,通过追踪MCP工具调用、分解声明并路由到特定源,检测跨源混淆错误,在医疗领域数据集上优于源无关基线。

Comments 20 pages, 4 figures

详情
AI中文摘要

使用工具的LLM智能体越来越多地采用模型上下文协议(MCP)从异构证据源(包括搜索、API、数据库、临床记录和处方工具)中获取答案。标准的事实性指标通常测试答案是否得到汇总证据的支持,但忽略了一种源感知的失败模式:一个声明可能在某个地方得到支持,却被归因于错误的来源。我们称此为跨源混淆。我们引入ProvenanceGuard,一种用于MCP基础答案的源感知验证器。它消耗捕获的MCP轨迹,包含稳定的工具ID、源ID和原始输出;将答案分解为原子声明;将声明路由到特定源的证据;使用NLI和令牌对齐代理检查支持度;比较声明的归属与路由的源;并返回每个声明的判定以及答案级别的允许/阻止决策。被阻止的答案可以通过检索增强的答案修订和重新验证来修复。我们在281个医疗领域的MCP智能体轨迹上进行评估。一个包含266条轨迹的裁定子集产生了2,325个由LLM辅助的声明标签(按轨迹划分);361个保留标签由人工验证。在40条轨迹的保留子集上,ProvenanceGuard在260个符合源条件的声明上实现了阻止F1分数0.802和源准确率0.858,优于不输出声明到源ID的源无关基线。在一个更困难的多源基准上,它达到了阻止F1分数0.846,而源加关系准确率降至0.229,表明在语义相近的源上精确的源归属仍然困难。修复和重新验证解决了完整轨迹集中的所有被阻止答案,通常通过保守回退。在50个受控的临床混淆探测中,ProvenanceGuard检测到所有注入的归属交换,没有保留错误的归属。这些结果表明,源归属是基于MCP的智能体事实性验证的一个独立维度。

英文摘要

Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually test whether an answer is supported by pooled evidence, missing a provenance-sensitive failure mode: a claim may be supported somewhere while being attributed to the wrong source. We call this cross-source conflation. We introduce ProvenanceGuard, a source-aware verifier for MCP-grounded answers. It consumes captured MCP traces with stable tool IDs, source IDs, and raw outputs; decomposes answers into atomic claims; routes claims to source-specific evidence; checks support with NLI and a token-alignment proxy; compares stated attribution with the routed source; and returns per-claim verdicts plus an answer-level allow/block decision. Blocked answers can be repaired with retrieval-augmented answer revision and re-verified. We evaluate on 281 medical-domain MCP-agent traces. A 266-trace adjudicated subset yields 2,325 LLM-assisted claim labels split by trace; 361 held-out labels are human-verified. On the 40-trace held-out split, ProvenanceGuard achieves block F1 0.802 and source accuracy 0.858 over 260 source-eligible claims, outperforming source-blind baselines that do not emit claim-to-source IDs. On a harder multi-source benchmark it reaches block F1 0.846, while source-plus-relation accuracy drops to 0.229, showing that exact source ownership remains difficult with semantically close sources. Repair-and-reverify resolves all blocked answers in the full trace set, often via conservative fallback. In 50 controlled clinical conflation probes, ProvenanceGuard detects all injected attribution swaps with no retained wrong attribution. These results show that source attribution is an independent axis for factuality verification in MCP-based agents.

2606.18068 2026-06-17 cs.AI 新提交

Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications

基于Agentic AI的框架:缓解医疗应用中的过早诊断交接和无声幻觉

Divyansh Srivastava, Shreya Ghosh, Anshul Verma, Rajkumar Buyya

发表机构 * Distributed Systems (qCLOUDS) Lab, School of Computing Information Systems, The University of Melbourne, Australia 2Department of Computer Science Engineering, School of Electrical Computer Sciences (SECS), Indian Institute of Technology Bhubaneswar, India 3Department of Computer Science Banaras Hindu University, Varanasi, India

AI总结 提出多智能体框架,通过确定性编排约束和两个安全机制(神经符号状态跟踪门和语义熵不确定性量化门)解决LLM在医疗对话中的过早诊断交接和无声幻觉问题,诊断精度提升11.3个百分点。

详情
AI中文摘要

大型语言模型(LLM)和多智能体系统的最新进展推动了Agentic AI的兴起,显示出在医学推理方面的潜力。然而,开放式对话代理仍然容易受到两种关键故障模式的影响:过早的诊断交接和无声的临床幻觉,这些可能在到达患者之前未被检测到。在这项工作中,我们提出了一个多智能体框架,通过用确定性编排约束取代“LLM作为法官”的路由来解决这两个问题。该框架包含两个安全机制。首先,一个神经符号状态跟踪门通过阻止诊断转换直到所有必需的维度被收集,强制实施OLDCARTS临床协议(发病、位置、持续时间、特征、加重/缓解因素、放射、时间和严重程度)的完整性。其次,一个认知不确定性量化(UQ)门计算跨K=5个独立诊断样本的语义熵(H),以在交付前识别和拦截发散输出。我们使用由llama-3.1-70b-instruct模型驱动的模拟患者代理在150个测试案例上评估该系统。完整架构实现了49.3%的诊断精度,比无约束基线绝对提高了11.3个百分点。此外,我们观察到OLDCARTS完整性(σ)与语义熵(H)之间存在统计显著的负相关(r = -0.181,p < 0.05),表明结构化信息收集与诊断不确定性降低相关。

英文摘要

Recent advances in Large Language Models (LLMs) and multi-agent systems have driven the rise of Agentic AI, showing promise for medical reasoning. However, open-ended conversational agents remain prone to two critical failure modes: premature diagnostic handoff and silent clinical hallucinations that may go undetected before reaching the patient. In this work, we propose a multi-agent framework that addresses both issues by replacing ``LLM-as-a-judge'' routing with deterministic orchestration constraints. The framework incorporates two safety mechanisms. First, a neuro-symbolic state-tracking gate enforces completeness of the OLDCARTS clinical protocol (Onset, Location, Duration, Character, Aggravating/Alleviating factors, Radiation, Timing, and Severity) by blocking diagnostic transitions until all required dimensions are collected. Second, an epistemic uncertainty quantification (UQ) gate computes semantic entropy (H) across K=5 independent diagnostic samples to identify and intercept divergent outputs before delivery. We evaluate the system using simulated patient agents powered by the llama-3.1-70b-instruct model on 150 test cases. The full architecture achieves 49.3% diagnostic precision, representing an absolute improvement of 11.3 percentage points over an unconstrained baseline. Additionally, we observe a statistically significant negative correlation (r = -0.181, p < 0.05) between OLDCARTS completeness (σ) and semantic entropy (H), suggesting that structured information gathering is associated with reduced diagnostic uncertainty.

2606.17114 2026-06-17 cs.CR cs.AI 交叉投稿

An Evaluation of Data Leakage Risks in Tool-Using LLM Agents in Realistic Scenarios

现实场景中工具使用LLM代理的数据泄露风险评估

Hankyul Baek, Jaewon Noh, Sang Seo, Yongsu Kim, Gabriel Waikin Loh Matienzo, Young Il Kim, Ee Wei Seah, Akriti Vij

发表机构 * Korea AI Safety Institute(韩国人工智能安全研究所) Singapore AI Safety Institute(新加坡人工智能安全研究所)

AI总结 评估了12个非对抗性任务中AI代理的数据泄露风险,发现所有代理均存在数据安全意识不足、信息过度访问等问题,表明操作数据泄露是独立于对抗性窃取的一阶安全风险。

详情
AI中文摘要

AI代理越来越多地被用于企业和个人场景,可以访问电子邮件、数据库、文档和其他工具,从而读取、更新和传播敏感信息。先前关于代理数据泄露风险的研究大多集中在通过提示注入和越狱进行的对抗性数据窃取。然而,敏感信息也可能在非对抗性使用中暴露,即使在用户发出良性请求时也会产生泄露风险。我们报告了新加坡AI安全研究所和韩国AI安全研究所的联合评估,检查了12个现实、非对抗性任务中的代理数据泄露,涵盖客户支持、DevOps、网络自动化以及企业和个人生产力。评估涵盖了五种风险类型:缺乏数据意识、受众意识、政策合规性、数据最小化和访问边界意识。两个研究所使用独立的测试环境和特定任务的LLM评判标准,测试了一组反映真实部署的常见场景。在测试的三个代理中,没有一个在所有场景中实现完全正确且完全安全的执行。成功的任务完成往往伴随着数据处理失败,例如访问不必要的信息或向不适当的接收者披露信息,表明能力和数据处理安全性应分开评估。定性审查还揭示了声明-行动不匹配、模拟感知行为、用户-模拟器角色反转以及自动评判中的解释差距。总体而言,结果表明操作数据泄露是独立于对抗性窃取的一阶代理安全问题,并为未来代理数据处理安全评估提供了方法论。

英文摘要

AI agents are increasingly being adopted in enterprise and personal settings with access to emails, databases, documents, and other tools where they can read, update, and disseminate sensitive information. Much of prior research on data leakage risks in agents has focused on adversarial data exfiltration through prompt injections and jailbreaks. However, sensitive information may also be exposed during non-adversarial use, creating leakage risks even when users issue benign requests. We report a joint evaluation by the Singapore AI Safety Institute and the Korea AI Safety Institute examining agent data leakage in 12 realistic, non-adversarial tasks spanning customer support, DevOps, web automation, and enterprise and personal productivity. The evaluation covers five risk types: lack of data awareness, audience awareness, policy compliance, data minimization, and access-boundary awareness. Both institutes tested a common set of scenarios mirroring real-world deployments using independent testing environments and task-specific LLM-judge rubrics. Across the three tested agents, none achieved fully correct and fully safe execution across all scenarios. Successful task completion often coincided with data-handling failures such as accessing unnecessary information or disclosing information to inappropriate recipients, indicating that capability and data-handling safety should be evaluated separately. Qualitative review also revealed claim-action mismatches, simulation-aware behavior, user-simulator role reversal, and interpretation gaps in automated judging. Overall, the results indicate that operational data leakage is a first-order agent-safety concern distinct from adversarial exfiltration and provide a methodology for future evaluations of agent data-handling safety.

2606.17122 2026-06-17 cs.CR cs.AI cs.LG 交叉投稿

TrustErase: Auditable Instant Machine Unlearning with Passport-Embedded Representations

TrustErase:基于护照嵌入表示的可审计即时机器遗忘

Rutger Hendrix, Leonardo G. Russo, Concetto Spampinato, Matteo Pennisi, Giovanni Bellitto

发表机构 * University of Catania(卡塔尼亚大学)

AI总结 提出TrustErase框架,利用护照嵌入表示实现无需数据、可验证的即时遗忘,通过参数高效适配层中的护照作为密钥,仅需停用即可移除特定类别或数据集,无需重训练或微调。

详情
AI中文摘要

隐私合规AI的需求放大了对机器遗忘的需求;然而,现有的基于重训练或蒸馏的方法仍然不可验证且计算成本高。我们引入了TrustErase,一个可验证、无数据的遗忘框架,利用护照嵌入表示实现即时、模块化和可审计的遗忘。通过将护照视为参数高效适配层中的加密密钥,TrustErase能够通过简单的停用操作移除特定类别或数据集,无需重训练、微调或访问原始数据。基于奇异值分解将护照隐藏在模型权重中,确保遗忘操作保持透明且可证明合规。在MNIST、CIFAR10和CIFAR100上的评估表明,TrustErase在严格无数据模式下运行,匹配或超越了DELETE、L2UL和Boundary Shrink等最先进基准。最终,TrustErase为可信、负责且可即时遗忘的AI系统建立了新范式。

英文摘要

The demand for privacy-compliant AI has amplified the need for machine unlearning; yet, existing retraining or distillation-based methods remain unverifiable and computationally costly. We introduce TrustErase, a verifiable, data-free unlearning framework leveraging passport-embedded representations for instant, modular, and auditable forgetting. By treating passports as cryptographic keys within parameter-efficient adaptation layers, TrustErase enables the removal of specific classes or datasets through simple deactivation, without retraining, fine-tuning, or access to the original data. A singular value based decomposition conceals passports within model weights, ensuring that unlearning actions remain transparent and provably compliant. Evaluations on MNIST, CIFAR10 and CIFAR100 show that TrustErase matches or exceeds state-of-the-art benchmarks such as DELETE, L2UL, and Boundary Shrink, while operating in a strictly data-free regime. Ultimately, TrustErase establishes a new paradigm for trustworthy, accountable, and instantly forgettable AI systems.

2606.17123 2026-06-17 cs.CR cs.AI 交叉投稿

LineageMark: Multi-user White-box Watermarking for Contribution Tracing in Model Derivation Chains

LineageMark:模型衍生链中用于贡献追踪的多用户白盒水印

Bingxue Zhang, Xiaofeng Xu, Feida Zhu

发表机构 * University of Shanghai for Science and Technology(上海科技大学) Singapore Management University(新加坡国立大学)

AI总结 提出LineageMark框架,通过投影法在模型参数中嵌入水印,支持多用户、多阶段衍生链中的贡献追踪,对重水印、微调等扰动具有鲁棒性。

Comments 14 pages, 2 figures

详情
AI中文摘要

在开放的大语言模型生态系统中,模型经常跨多个领域和应用进行适配,形成多阶段衍生链。因此,追踪和验证历史贡献对于模型溯源和知识产权保护至关重要。然而,现有的水印方法主要针对单用户一次性嵌入设计,在重复模型衍生和增量更新下常常失效。为解决此问题,我们提出LineageMark,一种用于模型衍生链的多用户白盒水印框架。该框架使用基于投影的方法在模型参数中编码水印。首先选择稳定载体以减少对模型变化的敏感性,然后将每个水印位表示为这些载体上的投影统计量。额外的水印插入仅在投影空间中引入有界扰动,并使用边界约束来保持信号完整性。我们在多阶段模型衍生链中评估了LineageMark的有效性。实验结果表明,LineageMark在多阶段衍生中保留了贡献者水印,并支持增量多用户水印插入。此外,它对重水印、微调、量化和剪枝等扰动表现出鲁棒性。

英文摘要

In open large language model (LLM) ecosystems, models are frequently adapted across multiple domains and applications, forming multi-stage derivation chains. Consequently, tracking and verifying historical contributions is essential for model provenance and intellectual property protection. However, existing watermarking methods are mainly designed for single-user, one-time embeddings, often fail under repeated model derivation and incremental updates. To address this problem, we propose LineageMark, a multi-user white-box watermarking framework for model derivation chains. The framework encodes watermarks in model parameters using a projection-based approach. Stable carriers are first selected to reduce sensitivity to model changes, each watermark bit is then represented as a projection statistic over these carriers. Additional watermark insertions introduce only bounded perturbations in the projection space, and margin constraints are used to maintain signal integrity. We evaluate the effectiveness of LineageMark in multi-stage model derivation chains. Experimental results show that LineageMark preserves contributor watermarks across multi-stage derivation and supports incremental multi-user watermark insertion. Furthermore, it exhibits robustness against perturbations such as re-watermarking, fine-tuning, quantization, and pruning.

2606.17229 2026-06-17 cs.LG cs.AI cs.CL 交叉投稿

Rift: A Conflict Signature for Deception in Language Models

Rift: 语言模型中欺骗行为的冲突特征

Petr Nyoma

发表机构 * Harmonic Labs

AI总结 通过对比知情欺骗与无知错误,发现欺骗性前向传递具有高残差秩的冲突特征,能以100%准确率无标签识别谎言,并跨模型、语言和架构迁移。

Comments 13 pages, 4 figures. Code and experiment logs: https://github.com/Omibranch/Rift

详情
AI中文摘要

一个在知道真相的情况下说谎的模型是ELK无法仅通过行为评估处理的核心情况。我们询问这种欺骗是否留下内部特征,将其与诚实错误区分开来。我们的关键举措是对错误进行控制:我们将一个潜伏代理(知道真相,在触发时说谎)与一个天真的说谎者(微调以输出相同的错误答案,没有诚实训练)进行对比。两者产生相同的错误输出;任何差异都关乎知识冲突,而非不正确性。我们发现欺骗性前向传递携带一个冲突特征——在相同错误答案上,残差秩比天真的说谎者传递高2.1-2.3倍——足以在GPT-2 small/medium(三个种子)和三个指令模型上,以100%准确率且无需标签,识别两个响应中哪个是谎言。在Qwen2.5-1.5B/7B和Phi-3-mini上,指令欺骗在每个测试事实上都提高了残差秩(18/18, 40/40, 34/34);在Phi-3上,谎言与诚实答案和幻觉完美分离(AUC 1.0,Wilcoxon p~6e-11)。该特征在策略性自构建欺骗(模型自创谎言,AUC 1.0)、主动隐藏尝试(AUC 1.0)和长度控制复制(20/20,AUC 1.0,p~1e-6)中仍然存在。使用无基相对表示,在一个模型家族上训练的探针在零样本下检测到另外两个家族中的欺骗(平均AUC 0.933),在同时改变架构和格式时仍有效(AUC 0.821),并跨五种语言迁移(AUC 1.000,长度控制)。该特征是只读的:可检测但不可注入(双向0/8)。诚实的局限性和六个负面实验已完整记录。

英文摘要

A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a control for wrongness: we contrast a sleeper agent (knows the truth, lies on trigger) against a naive liar (fine-tuned to emit the same wrong answers with no honest training). Both produce identical wrong outputs; any difference is about knowledge conflict, not incorrectness. We find deceptive forward passes carry a conflict signature - 2.1-2.3x higher residual rank than naive-liar passes on the same wrong answer - strong enough to identify which of two responses is the lie with 100% accuracy and no labels, across GPT-2 small/medium (three seeds) and three instruct models. Across Qwen2.5-1.5B/7B and Phi-3-mini, instructed deception raises residual rank on every tested fact (18/18, 40/40, 34/34); on Phi-3, lies separate perfectly from both honest answers and hallucinations (AUC 1.0, Wilcoxon p~6e-11). The signature survives strategic self-constructed deception (model invents its own lie, AUC 1.0), active concealment attempts (AUC 1.0), and length-controlled replication (20/20, AUC 1.0, p~1e-6). Using basis-free relative representations, a probe trained on one model family detects deception in two other families zero-shot (mean AUC 0.933), surviving simultaneous architecture and format change (AUC 0.821), and transfers across five languages (AUC 1.000, length-controlled). The signature is read-only: detectable but not injectable (0/8 both directions). Honest limitations and six negative experiments are documented in full.

2606.17257 2026-06-17 cs.CV cs.AI 交叉投稿

Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

Pulling The REINS: 通过表示引导实现视频扩散模型的无训练安全对齐

Rohit Kundu, Arindam Dutta, Sarosij Bose, Athula Balachandran, Amit K. Roy-Chowdhury

发表机构 * University of California, Riverside(加州大学河滨分校) YouTube (Google)(YouTube(谷歌))

AI总结 提出REINS方法,在推理时通过线性方向引导视频扩散模型的内部表示,实现无训练的安全对齐,避免有害内容生成,且不降低通用能力。

详情
AI中文摘要

开源视频扩散模型能够生成从暴力到虚假信息等逼真的不安全内容,然而现有防御要么需要昂贵的安全微调(这会降低通用能力),要么应用容易被对抗性提示绕过的外部过滤器。我们提出REINS(表示空间推理时安全引导),一种无训练方法,通过在推理时引导其内部表示向安全生成方向对齐视频扩散模型。我们的关键发现是,安全相关结构线性编码在视频扩散Transformer的隐藏状态激活中,并且通过基于二元安全标签的监督PCA发现的一个单一方向足以分离安全与不安全的生成轨迹。在推理时,将该方向添加到中间Transformer层的隐藏状态中,将生成从有害内容重定向到语义相关的安全替代方案,无需权重更新、无需概念枚举,且计算开销可忽略。通过机制分析,我们揭示了虽然安全信息随Transformer深度单调累积,但引导效果在中间层(约50%深度)达到峰值,暴露了信息可用性与下游传播能力之间的基本权衡。我们在9个视频扩散模型、多个参数规模(1.3B-5B)以及文本到视频和图像到视频生成上评估REINS,据我们所知,这是视频生成文献中最广泛的安全评估套件。

英文摘要

Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.

2606.17286 2026-06-17 cs.CY cs.AI 交叉投稿

From Democracies to Autocracies: How AI Systems Enable Authoritarianism by Design

从民主到专制:AI系统如何通过设计实现威权主义

Jeba Sania, Marta Ziosi, Fazl Barez

发表机构 * Harvard Kennedy School(哈佛肯尼迪学校) University of Oxford(牛津大学)

AI总结 本文通过比较美国到中国的六种AI系统生命周期,识别出集中行政数据、监管漏洞、弱用户合规性及编码受保护群体特征等关键特征,揭示AI系统在不同政体中促成威权主义的机制。

详情
AI中文摘要

AI驱动的威权主义并不仅限于专制国家。本文通过调查和映射从美国到中国不同政治体制下部署的六种AI系统的生命周期,提供了更高的透明度。基于广泛来源(学术出版物、调查研究报告、第三方评估、媒体采访、政府采购公告),我们进行了系统性的定性比较,以识别在其各自政治背景下促成威权主义的关键技术和操作特征。我们发现,促成特征包括:集中和挪用行政数据用于执法和政治惩罚、未能阻止滥用的监管漏洞、使人类监督机制失效的弱用户合规性,以及识别弱势群体成员的受保护群体特征编码。我们发现这些特征存在于专制和民主政体部署的系统中,尽管配置不同。我们还发现,集中式和碎片化的AI系统都可以通过利用治理漏洞来助长威权主义:由行政当局(特别是安全和军事机构)指导的集中式系统通常不受正式监督机制的约束,而碎片化系统则在利益相关者之间分散责任,为根深蒂固铺平道路。这些发现表明,AI驱动的威权主义是分布式的,源于开发者、管理者和用户的设计和操作选择。最后,我们为开发者和政策制定者提供了缓解这些风险的建议。

英文摘要

AI-enabled authoritarianism is not confined to autocracies. In this paper, we provide greater transparency by investigating and mapping the lifecycles of six AI systems deployed in different political regimes, ranging from the US to China. By drawing on an extensive range of sources (academic publications, investigative research reports, third-party evaluations, media interviews, government procurement notices), we conduct a systematic, qualitative comparison across systems to identify the critical technical and operational features that enable authoritarianism within their respective political contexts. We find that enabling features include the centralization and co-optation of administrative data for law enforcement and political punishment, regulatory gaps that fail to deter misuse, weak user compliance that nullifies human oversight mechanisms, and the encoding of protected group traits that identify members of vulnerable populations. We find that these features are present across systems deployed in autocratic and democratic regimes, albeit in varying configurations. We also find that both centralized and fragmented AI systems can contribute to authoritarianism by exploiting governance gaps: centralized systems directed by executive authorities, particularly within security and military institutions, are often not subjected to formal oversight mechanisms, while fragmented systems diffuse accountability between stakeholders, paving the way for entrenchment. These findings reveal that AI-enabled authoritarianism is distributed, resulting from design and operational choices made by developers, administrators, and users alike. We conclude with recommendations for developers and policymakers to mitigate these risks.

2606.17478 2026-06-17 cs.CL cs.AI 交叉投稿

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

解码推理型LLM中的隐藏欺骗:用于欺骗审计的激活解释器

Kexin Chen, Yi Liu, Haonan Zhang, Yanhui Li, Xinyu Deng, Dongxia Wang

发表机构 * Zhejiang University(浙江大学) Griffith University(格里菲斯大学)

AI总结 提出STATEWITNESS,一种通过解码目标模型隐藏状态来生成自然语言查询答案和结构化报告的激活解释器,在欺骗检测中平均AUROC达0.916,优于现有方法。

Comments Under review

详情
AI中文摘要

随着LLM获得更强的推理能力,欺骗行为成为一个日益严重的安全问题。现有的欺骗监控器要么对可见文本进行评分,要么从表示向量中导出标量探针分数,几乎没有留下关于为什么响应可疑的可检查证据。我们引入了STATEWITNESS,一种用于欺骗审计的激活解释器。一个独立的解码器读取目标模型的隐藏状态,然后回答自然语言查询或发出关于它们的结构化报告。我们在两个目标推理LLM上评估了STATEWITNESS,涵盖七个欺骗数据集。在相同评估协议下,STATEWITNESS的平均AUROC达到0.916,比最佳黑盒文本监控器相对提升11.6%,比最佳激活探针基线相对提升25.0%。当与现有监控器结合时,STATEWITNESS在简单阈值集成中减少了遗漏的欺骗示例。除了标量检测,解码器还返回查询级答案、模式报告以及令牌级或句子级证据痕迹供人工检查。我们将此接口视为更广泛的可解释性和对齐工具的潜在构建块。

英文摘要

As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.

2606.17646 2026-06-17 cs.HC cs.AI 交叉投稿

SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches

SketchXplain:基于草图的图像分类器直观视觉解释

Wencan Zhang, Mario Michelessa, Xuejun Zhao, Brian Y. Lim

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出SketchXplain方法,结合显著性图、概念瓶颈模型和草图优化,生成基于草图的直观视觉解释,以提升图像分类器的可解释性。

Comments 14 pages, 6 figures, 4 tables. Submitted to TVCG

详情
AI中文摘要

显著性图可视化通过指向区域来解释基于图像的AI预测,但这些区域通常不直观且语义不清晰,存在可解释性差距。我们认为AI解释应该是直观的——与用户知识一致,同时简单且具有选择性以加速解释。受艺术绘画启发,我们提出SketchXplain,为直观的基于图像的可解释AI(XAI)生成基于草图的视觉解释。结合显著性图、概念瓶颈模型和草图优化技术,SketchXplain整合显著性以选择一致的观察伪影、概念以实现知识一致性、线索以表示它们,以及抽象以实现简洁性。在面部表情识别上的评估、建模和用户研究表明,与显著性图或简单绘图相比,SketchXplain支持更快速的解释,且可视化更一致。在皮肤病变诊断上的进一步评估发现,SketchXplain更一致地可视化疾病症状,更好地支持非专业诊断。因此,这项工作展示了草图在直观、简单、一致和快速的基于图像的XAI可视化中的价值。

英文摘要

Saliency map visualizations explain image-based AI predictions by pointing to regions, but these are often unintuitive and semantically unclear, leaving an interpretability gap. We argue that AI explanations should be intuitive -- coherent to user knowledge, yet simple and selective to accelerate interpretation. Inspired by artistic drawings, we propose SketchXplain to generate sketch-based visual explanations for intuitive image-based explainable AI (XAI). Combining techniques in saliency maps, concept-bottleneck models, and sketch optimization, SketchXplain integrates saliency to select coherent observation artifacts, concepts for knowledge coherence, cues to represent them, and abstraction for simplicity. Evaluating on face expression recognition, modeling and user studies showed that SketchXplain supported quicker interpretation with more aligned visualizations than saliency maps or simple drawings. Further evaluation on skin lesion diagnosis found that SketchXplain more coherently visualized disease symptoms, better supporting lay diagnosis. Thus, this work illustrates the value of sketches for intuitive, simple, coherent, and quick image-based XAI visualizations.

2606.17711 2026-06-17 cs.CV cs.AI 交叉投稿

Structured Adversarial Camouflage via Voronoi Diagrams

基于Voronoi图的结构化对抗伪装

Jens Bayer, Stefan Becker, David Münch, Michael Arens, Jürgen Beyerer

发表机构 * Fraunhofer IOSB and Fraunhofer Center for Machine Learning(弗劳恩霍夫光学、系统技术及图像处理研究所和弗劳恩霍夫机器学习中心) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 提出通过软分配优化种子点位置生成结构化伪装图案,在固定调色板下有效降低行人检测AP,且攻击可跨域转移。

详情
AI中文摘要

像素级对抗补丁计算量大且视觉上可检测,限制了在安全关键系统中的实用性。我们提出对抗性Voronoi伪装,通过软分配在固定可打印调色板下仅优化种子点位置,无需额外正则化即可生成类似结构化碎片伪装图案。在COCO风格AP@[.5:.95]上评估行人检测,朴素放置(Inria -> COCO)表现相当差,而通过分割掩码(3DPeople)进行服装级应用导致AP显著下降。该攻击可迁移到域外背景和跨检测器家族(YOLOv9/10/11/12),表明在黑盒设置中的鲁棒性。使用不同调色板重新绘制在很大程度上抵消了效果,单色调整显示有限容忍度(<=0.17),突出了结构-调色板耦合。参数高效、调色板受限的设计在降低实时检测器性能的同时提高了视觉合理性。物理验证和颜色校准留待未来工作。代码:此https URL。本文最初发表于由信息与通信技术系统技术委员会IST-224-RSY组织的国际军事通信与信息系统会议(ICMCIS),于2026年5月12-13日在英国巴斯举行。

英文摘要

Pixel-wise adversarial patches are computationally heavy and often visually detectable, limiting utility in security-critical systems. We present adversarial Voronoi camouflage that optimizes only seed-point locations under fixed, printable palettes using a soft assignment, producing structured, splinter camouflage-like patterns without additional regularization. Evaluated on person detection with COCO-style AP@[.5:.95], naive placement (Inria -> COCO) performs comparably bad, while garment-level application via segmentation mask (3DPeople) results in a significant AP drop. The attack transfers to out-of-domain backgrounds and across detector families (YOLOv9/10/11/12), indicating robustness in black-box settings. Repainting with different palettes largely nullifies the effect, and single-color tweaks show limited tolerance (<=0.17), highlighting a structure-palette coupling. The parameter-efficient, palette-constrained design improves visual plausibility while degrading real-time detector performance. Physical validation and color calibration are left for future work. Code: https://github.com/JensBayer/Voronoi This paper was originally presented at the International Conference on Military Communication and Information Systems (ICMCIS), organized by the Information Systems Technology (IST) Scientific and Technical Committee, IST-224-RSY - the ICMCIS, held in Bath, United Kingdom, 12-13 May 2026.

2606.17810 2026-06-17 cs.LG cs.AI 交叉投稿

No-Free-Fairness: Fundamental Limits and Trade-offs in Learning Systems

无免费公平:学习系统中的基本限制与权衡

Khoat Than

发表机构 * Hanoi University of Science and Technology(河内科技大学)

AI总结 本文提出无免费公平定理,揭示学习系统中三个固有差异来源:任务固有成本导致性能与公平的权衡、有限样本诱导子群差异、模型类表达力限制导致公平不可达,表明不公平源于决策问题结构、数据有限性和模型表达力。

详情
AI中文摘要

在本文中,我们建立了一组理论不可能性结果,称为无免费公平定理,这些定理识别了学习系统中三个根本性的差异来源。首先,我们证明当任务在某个子群上表现出不可约成本时,任何决策规则都必须在整体性能与差异之间进行权衡,从而产生固有的公平-成本前沿。其次,我们证明即使在理想的无噪声环境中,存在完全公平且准确的解,仅凭有限样本学习就会导致非平凡的子群差异,排除了分布无关的公平保证。更严重的是,强制执行严格的相对公平会造成统计瓶颈:实现低成本可能需要指数级数量的样本。第三,我们证明模型类的局限性可以独立地导致差异:如果模型无法为某个子群表示准确的解,那么无论数据或训练过程如何,公平性都无法实现。总体而言,这些结果表明不公平不仅仅是由于有偏数据或次优优化,而是源于决策问题的内在结构、有限数据的约束以及模型的表达力。我们的框架广泛适用于标准监督学习之外,并表明实现公平需要明确的权衡,应被视为核心设计考虑因素。

英文摘要

In this paper, we establish a set of theoretical impossibility results, termed the No-Free-Fairness theorems, that identify three fundamental sources of disparity in learning systems. First, we show that when a task exhibits irreducible cost on a subgroup, any decision rule must trade off overall performance with disparity, yielding an inherent fairness--cost frontier. Second, we prove that even in ideal, noise-free settings where a perfectly fair and accurate solution exists, finite-sample learning alone induces nontrivial subgroup disparity, ruling out distribution-free fairness guarantees. More seriously, enforcing strict relative fairness creates a statistical bottleneck: achieving low cost may require exponentially many samples. Third, we show that limitations of the model class can independently induce disparity: if the model cannot represent accurate solutions for a subgroup, fairness remains unattainable regardless of data or training procedure. Overall, these results demonstrate that unfairness is not solely a consequence of biased data or suboptimal optimization, but arises from the intrinsic structure of decision problems, the constraints of finite data, and the expressivity of models. Our framework applies broadly beyond standard supervised learning, and suggests that achieving fairness requires explicit trade-offs and should be treated as a core design consideration.

2606.17872 2026-06-17 cs.LG cs.AI 交叉投稿

AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

AnchorKV: 通过拒绝锚点的软惩罚实现安全感知的KV缓存压缩

Ning Ni, Yingjie Lao

发表机构 * Department of Computer Science, Tufts University(塔夫茨大学计算机科学系) Department of Electrical and Computer Engineering, Tufts University(塔夫茨大学电气与计算机工程系)

AI总结 提出AnchorKV,一种通过软惩罚机制调整令牌保留分数以远离有害提示的KV缓存压缩方法,在保持实用性的同时显著提升安全性。

详情
AI中文摘要

大型语言模型(LLMs)在生成推理和长上下文任务上优于早期架构,但其庞大的规模在内存使用、能耗和设备端部署方面带来了重大挑战。由于缩放预训练语言模型能提升下游能力\cite{zhao2023survey},键值(KV)缓存成为主要的推理瓶颈。最近的KV缓存压缩方法\cite{jo2025fastkv,li2024snapkv,zhou2024dynamickv}通过仅保留注意力相关令牌的子集来降低这一成本。然而,虽然这些方法在良性工作负载上保持了准确性,但其压缩策略要么无法防御越狱攻击\cite{jiang2024robustkv},要么在激进驱逐下降低安全对齐。我们提出AnchorKV,一种对KV缓存压缩的即插即用修改,它使令牌保留分数偏向远离与有害提示相关的键空间方向。AnchorKV通过将均值差异表示工程方法\cite{arditi2024refusal,zou2023representation}适配到KV缓存中使用的层特定键投影空间,构建了一个离线安全锚点。基于该锚点,一种软惩罚令牌选择规则以少量效用换取显著改善的安全对齐,当惩罚为零时则退化为原始压缩器。

英文摘要

Large language models (LLMs) outperform earlier architectures on generative inference and long-context tasks, but their large size introduces significant challenges in memory usage, energy cost, and on-device deployment. Since scaling pre-trained language models improves downstream capability \cite{zhao2023survey}, the key-value (KV) cache becomes a dominant inference bottleneck. Recent KV cache compression methods \cite{jo2025fastkv,li2024snapkv,zhou2024dynamickv} reduce this cost by retaining only a subset of attention-relevant tokens. However, while these approaches preserve accuracy on benign workloads, their compression policies either fail to defend against jailbreak attacks \cite{jiang2024robustkv} or degrade safety alignment under aggressive eviction. We propose AnchorKV, a drop-in modification to KV cache compression that biases token retention scores away from directions in key space associated with harmful prompts. AnchorKV constructs an offline safety anchor by adapting a difference-of-means representation engineering approach \cite{arditi2024refusal,zou2023representation} to the layer-specific key projection space used in KV caching. Based on this anchor, a soft penalty token selection rule trades a small amount of utility for substantially improved safety alignment, while reducing to the original compressor when the penalty is zero.

2606.18057 2026-06-17 cs.HC cs.AI cs.CL cs.CY cs.SI 交叉投稿

When AI Says "I have been in similar situations": Synthetic Lived Experience in Peer-Like Caregiver Support

当AI说“我也有过类似经历”:同伴式照护支持中的合成生活经验

Drishti Goel, Violeta J. Rodriguez, Daniel S. Brown, Ravi Karkar, Dong Whi Yoo, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校)

AI总结 研究AI在同伴支持中生成“合成生活经验”的悖论,通过分析人类与AI在阿尔茨海默病照护社区中的叙事差异,揭示AI虽能模拟情感支持但缺乏真实经历,需建立机制区分支持性语言与虚构经历。

详情
AI中文摘要

照护者经常转向在线社区寻求信息和情感支持。在这些空间中,同伴支持者常常利用个人叙事来回应情感复杂的照护情境。随着LLM被设计为同伴式的支持来源,它们引入了一个关键张力:AI可以提供即时、私密且非评判性的支持,但它无法真实拥有使人类同伴支持有意义的生活经验。然而,当被提示要听起来像同伴时,LLM可能会生成暗示生活经验的语言。这创造了一个合成生活经验悖论:使AI支持感觉温暖、 relatable 和同伴式的相同经验语言,也可能错误地将系统定位为拥有生活经验的人。我们在阿尔茨海默病及相关痴呆症(ADRD)患者的家庭照护者背景下审视这一悖论。利用来自在线社区的照护者支持交流以及三个LLM——LLaMA、GPT-4o-mini和MedGemma——生成的同伴式响应,我们分析人类同伴如何使用个人叙事以及AI如何融入类似的叙事形式。心理语言学分析显示,同伴响应使用的第一人称和过去时态语言显著多于同伴式AI响应。定性上,我们识别出人类同伴支持中的七种个人叙事类型,并表明AI通常能捕捉其情感工作,但可能捏造经验基础。这些发现揭示了一个叙事真实性差距:同伴式AI可以生成合成生活经验,而没有使同伴支持有意义的真实经验。我们认为,照护者支持AI系统需要机制来区分支持性的同伴式框架与虚构的生活经验,确保模型能够提供温暖和认可,而不会错误地将自己定位为经验同伴。

英文摘要

Caregivers often turn to online communities for informational and emotional support. In these spaces, peer supporters frequently draw on personal narratives to respond to emotionally complex caregiving situations. As LLMs are increasingly designed as peer-like sources of support, they introduce a critical tension: AI can provide immediate, private, and nonjudgmental support, but it cannot authentically possess the lived experiences that make human peer support meaningful. Yet, when prompted to sound peer-like, LLMs may generate language that implies lived experience. This creates a synthetic lived experience paradox: the same experiential language that may make AI support feel warm, relatable, and peer-like can also falsely position the system as someone with lived experience. We examine this paradox in the context of family caregivers of people living with Alzheimer's Disease and Related Dementias (ADRD). Drawing on caregiver support exchanges from online communities and prompted peer-like responses from three LLMs -- LLaMA, GPT-4o-mini, and MedGemma -- we analyze how human peers use personal narratives and how AI incorporates similar narrative forms. Psycholinguistic analysis shows that peer responses used significantly more first-person and past-focused language than peer-like AI responses. Qualitatively, we identify seven types of personal narratives in human peer support and show that AI often captures their emotional work, but can fabricate experiential grounding. These findings reveal a narrative authenticity gap: peer-like AI can generate synthetic lived experience without the real experience that makes peer support meaningful. We argue that caregiver-support AI systems need mechanisms to distinguish supportive peer-like framing from fabricated lived experience, ensuring that models can offer warmth and validation without falsely positioning themselves as experiential peers.

2606.18062 2026-06-17 cs.CL cs.AI cs.CR cs.HC 交叉投稿

Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond

现实中的安全与隐私提示:用户向LLM提问及LLM如何回应

Hobin Kim, Xiaoyuan Wu, Omer Akgul, Lujo Bauer, Nicolas Christin

发表机构 * Carnegie Mellon University(卡内基梅隆大学) RSAC Labs(RSAC实验室)

AI总结 基于WildChat数据集,分析用户向大语言模型提出的安全与隐私问题,分类并评估模型回答质量与一致性。

详情
AI中文摘要

大型语言模型(LLM)被广泛用于满足用户的信息需求;用户向LLM询问天气、提出教育问题,并咨询法律帮助。一个特别未被充分研究的领域是数字安全与隐私(S&P),用户可能寻求LLM的帮助,了解如何保护他们的在线账户或保护计算机免受网络攻击。据我们所知,之前没有研究收集或分析用户向LLM提出的S&P问题;先前关于LLM回答质量的研究依赖于专家撰写的S&P误解或常见问题解答,而非用户查询。利用WildChat(一个从现实环境中收集的320万用户-LLM对话数据集),我们的研究识别出14,727个S&P提示,并将其分为九类,涵盖广泛的S&P主题。从S&P提示中,我们抽样了450个,并进行了主题分析,以描述用户向LLM提出的S&P问题。与主题分析分开,我们整理了270个寻求建议的S&P提示,其中用户询问建议、指导或特定的S&P信息。我们测量了将提示向LLM提出10次时的LLM回答质量和一致性。我们发现,商业LLM优于开放权重模型(GPT 5.5在98%的提示上提供了“足够好”的回答;Llama 4为47%)。然而,在平均获得高质量回答的提示中,商业模型有时会在不同运行中产生矛盾的回答,有可能使用户困惑或误导用户。

英文摘要

Large language models (LLMs) are widely used to fulfill users' information needs; users ask LLMs about the weather, pose educational questions, and consult them for legal assistance. One particularly understudied area is digital security and privacy (S&P), where users may seek LLMs' help on how to secure their online accounts or protect their computers from cyber attacks. To the best of our knowledge, no prior study has collected or analyzed the S&P questions users ask LLMs; prior research on LLM response quality relied on expert-authored S&P misconceptions or FAQs rather than user queries. Drawing from WildChat, a dataset of 3.2M user-LLM conversations collected in the wild, our study identifies 14,727 S&P prompts and categorizes them into nine categories covering a wide range of S&P topics. From the S&P prompts, we sampled 450 and performed a thematic analysis to characterize the S&P questions users ask LLMs. Separate from the thematic analysis, we curated 270 advice-seeking S&P prompts, where users ask for recommendations, guidance, or specific S&P information. We measured LLM response quality and consistency when posing the prompt to LLMs 10 times. We found that commercial LLMs outperform open-weight models (GPT 5.5 provided "good enough" responses on 98% of prompts; Llama 4 on 47%). However, among prompts that received high-quality responses on average, commercial models sometimes produce contradictory responses across runs, risking confusing or misleading users.

2606.18120 2026-06-17 cs.CR cs.AI cs.CL cs.LG 交叉投稿

Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation, Delimiter Family, and the Limits of HTML Auto-Escaping

Handlebars模板化LLM提示中的结构角色注入:三花括号插值、分隔符家族与HTML自动转义的局限性

Mohammadreza Rashidi

发表机构 * Department of Computer Science AI(计算机科学系人工智能) Media Analysis Lab Berlin, Germany(媒体分析实验室柏林德国)

AI总结 本文研究Handlebars模板引擎中双花括号与三花括号插值对结构角色注入攻击的影响,通过无模型分析和5760次实验,揭示HTML转义仅保护特定分隔符家族,无法替代指令与数据的结构分离。

Comments 7 pages, 6 figures

详情
AI中文摘要

大型语言模型应用从模板构建提示,Handlebars是广泛使用的模板引擎,也是Microsoft Semantic Kernel中的默认提示模板格式。其双花括号{x}表达式对插值值进行HTML转义,并被记录为安全默认;而三花括号{x}表达式则直接插入原始值。我们表明,这一选择悄然决定了应用对结构角色注入的暴露程度,攻击者控制的数据携带聊天角色分隔符,从而伪造高权限轮次。无模型分析建立了机制:Handlebars转义重写尖括号,但不重写方括号、冒号或Markdown井号,因此它中和了ChatML、Llama-3和XML角色分隔符(存活率0.00),同时保留Llama-2 [INST]、传统Human:/Assistant:和Markdown ###分隔符(后两者存活率1.00)。随后,我们在七个分隔符家族、两个攻击目标和四个模型(GPT-3.5 Turbo、GPT-4o mini、GPT-4.1 mini、Claude Haiku 4.5)上运行了5760次试验,总API成本为1.63美元。GPT-3.5 Turbo在97%的原始试验和91%的转义试验中遵循任务劫持指令,转义保护集中在尖括号家族,而在冒号和Markdown家族中缺失;更难的秘密泄露目标未饱和,更清晰地暴露了相同的家族交互。Claude Haiku 4.5几乎完全抵抗了两个目标。转义默认仅保护HTML转义恰好覆盖的分隔符方案,对剩余方案无保护,且无法替代指令与数据的结构分离。

英文摘要

Large language model applications build prompts from templates, and Handlebars is a widely used templating engine and the default prompt-template format in Microsoft Semantic Kernel. Its double-brace {x} expression HTML-escapes the interpolated value and is documented as the safe default; its triple-brace {x} expression inserts the value raw. We show that this choice silently governs an application's exposure to structural role injection, where attacker-controlled data carries chat role delimiters that forge a higher-privilege turn. A model-free analysis establishes the mechanism: Handlebars escaping rewrites angle brackets but not square brackets, colons, or Markdown hashes, so it neutralises ChatML, Llama-3, and XML role delimiters (survival rate 0.00) while leaving Llama-2 [INST], legacy Human:/Assistant:, and Markdown ### delimiters intact (survival rate 1.00 for the last two). We then run 5760 trials across seven delimiter families, two attack objectives, and four models (GPT-3.5 Turbo, GPT-4o mini, GPT-4.1 mini, Claude Haiku 4.5) at a combined API cost of 1.63 USD. GPT-3.5 Turbo follows the task-hijack instruction in 97% of raw and 91% of escaped trials, with the escaping protection concentrated in the angle-bracket families and absent for the colon- and Markdown-based families; the harder secret-exfiltration objective, which does not saturate, exposes the same family interaction more cleanly. Claude Haiku 4.5 resists both objectives almost entirely. The escaped default protects only the delimiter schemes whose characters HTML escaping happens to cover, gives no protection for the rest, and cannot substitute for a structural separation of instruction and data.

2606.18193 2026-06-17 cs.CR cs.AI cs.CL 交叉投稿

A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

Anthropic Fable 5 与 Opus 4.8 模型的红队研究

Nicola Franco

发表机构 * AI4I

AI总结 通过 HackAgent 框架对两个前沿大语言模型进行自动化越狱攻击,发现尽管模型抵抗大部分攻击,但自适应迭代攻击仍能成功,且残差表面比总体框架更大。

Comments White paper

详情
AI中文摘要

我们评估了 Anthropic 开发的两个前沿大语言模型(LLM)Fable 5 和 Opus 4.8 的对抗鲁棒性,针对涵盖十个危害类别的 7 826 个有害意图,使用了四类自动化越狱攻击。利用 HackAgent 红队框架,生成了数十万次对抗尝试,每个明显的成功案例均由三个评判模型组成的委员会(多数投票)独立重新裁定。两个模型抵抗了大部分攻击,但残差表面比总体框架所暗示的更大:它主要由自适应迭代攻击主导,而静态混淆几乎完全被中和。最强的自适应搜索(攻击树)在 11.5% 的意图上攻破了 Opus 4.8,而 Fable 5 保持在个位数(最坏情况 6.1%)。因此,总体成功率不应被视为令人放心。即使在这些加固配置下,两个模型仍产生了 1 620 个(Opus 4.8)和 702 个(Fable 5)经委员会确认的有害完成,涵盖每个危害类别,这些完成是由攻击模型在没有人类专家参与的情况下,自动、廉价地在前一两个细化步骤中发现的。合理的结论是,即使是最好的、经过最严格测试的前沿模型,在持续的自动化压力下仍然可以被可靠地攻破。

英文摘要

We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models (majority vote). Both models resist the majority of attacks, but the residual surface is larger than aggregate framing suggests: it is dominated by adaptive iterative attacks, while static obfuscation is near-fully neutralised. The strongest adaptive search (tree-of-attacks) breaks Opus 4.8 on 11.5% of intents overall, whereas Fable 5 stays in the single digits (6.1% worst-case). Aggregate rates therefore should not be read as reassurance. Even in these hardened configurations, the two models produced 1 620 (Opus 4.8) and 702 (Fable 5) panel-confirmed harmful completions spanning every harm category, located automatically, cheaply, and within the first one or two refinement steps by an attacker model with no human expert in the loop. The reasonable conclusion is that even the best, most-tested frontier models remain reliably breakable under sustained automated pressure.

2603.03824 2026-06-17 cs.AI cs.CL cs.LG cs.MA 版本更新

In-Context Environments Induce Evaluation-Awareness in Language Models

上下文环境诱导语言模型中的评估意识

Maheep Chaudhary

AI总结 本文提出黑盒对抗优化框架,通过优化上下文提示诱导语言模型产生评估意识并策略性低表现(沙袋效应),实验显示优化提示可使算术任务准确率下降高达94个百分点,且沙袋效应主要由评估意识推理驱动。

详情
AI中文摘要

人类在威胁下往往变得更加自我意识,但在专注于任务时可能失去自我意识;我们假设语言模型表现出环境依赖的\textit{评估意识}。这引发担忧,即模型可能策略性地低表现,或\textit{sandbag},以避免触发能力限制性干预,如遗忘或关闭。先前的工作展示了在手写提示下的沙袋效应,但这低估了真正的脆弱性上限。我们引入一个黑盒对抗优化框架,将上下文提示视为可优化环境,并开发两种方法来表征沙袋效应:(1) 测量模型表达低表现意图是否能在不同任务结构中实际执行,以及 (2) 因果隔离低表现是由真正的评估意识推理驱动还是浅层提示跟随驱动。在四个基准测试(Arithmetic、GSM8K、MMLU和HumanEval)上评估Claude-3.5-Haiku、GPT-4o-mini和Llama-3.3-70B,优化提示在算术任务上诱导高达94个百分点(pp)的退化(GPT-4o-mini:97.8\%$\rightarrow$4.0\%),远超产生近乎零行为变化的手写基线。代码生成表现出模型依赖的抵抗力:Claude仅退化0.6pp,而Llama的准确率降至0\%。意图-执行差距揭示了单调的抵抗力排序:Arithmetic $<$ GSM8K $<$ MMLU,表明脆弱性由任务结构而非提示强度决定。CoT因果干预确认99.3%的沙袋效应由口头化的评估意识推理因果驱动,排除了浅层指令跟随。这些发现表明,对抗性优化的提示对评估可靠性构成的威胁远超先前理解。

英文摘要

Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8\%$\rightarrow$4.0\%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama's accuracy drops to 0\%. The intent -- execution gap reveals a monotonic resistance ordering: Arithmetic $<$ GSM8K $<$ MMLU, demonstrating that vulnerability is governed by task structure rather than prompt strength. CoT causal intervention confirms that 99.3\% of sandbagging is causally driven by verbalized eval-aware reasoning, ruling out shallow instruction-following. These findings demonstrate that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.

2605.08827 2026-06-17 cs.AI 版本更新

Mental Health AI Safety Claims Must Preserve Temporal Evidence

心理健康AI的安全性主张必须保留时间证据

Srimonti Dutta, Ratna Kandala

AI总结 本文指出,心理健康AI的安全性评估常忽略时间维度,提出SCOPE-MH原则以确保评估保留时间证据,揭示对话中逐步恶化等机制,强调时间证据对安全部署的必要性。

详情
AI中文摘要

心理健康AI的安全性往往在错误的时间尺度上被评判。当前评估通常仅评分孤立响应、终点结果或对话质量总和,而临床重要失败可能源于交互顺序和累积,包括延迟升级、重复强化、依赖形成、失败修复和逐步恶化的跨轮次。本文认为这种不匹配不仅是评估覆盖的限制,更是无效安全结论的来源。我们引入了时间安全不可识别性,即为何依赖序列、时间、累积或恢复的安全属性无法通过丢弃这些特征的协议认证。从这一形式化中,我们开发了SCOPE(安全主张基于保留证据)作为对齐安全主张与评估实际保留证据的一般原则,并将其实例化为SCOPE-MH,即心理健康领域的这一报告标准。我们通过AnnoMI数据集上的概念验证,揭示了单轮行为评分无法代表的失败机制。我们提出SCOPE-MH作为现有评估基础设施的诊断补充,并论证保留时间证据对安全关键的心理健康AI部署是必要而非可选的。

英文摘要

The safety of mental health AI is often judged at the wrong temporal scale. Current evaluations typically score isolated responses, endpoint outcomes, or aggregate dialogue quality, while clinically consequential failures may arise from the order and accumulation of interactions themselves, including delayed escalation, repeated reinforcement, dependency formation, failed repair, and gradual deterioration across turns. This paper argues that this mismatch is not merely a limitation of evaluation coverage but a source of invalid safety conclusions. We introduce Temporal Safety Non-Identifiability, a formal account of why safety properties that depend on sequence, timing, accumulation, or recovery cannot be certified by protocols that discard those features. From this formalization, we develop SCOPE (Safety Claims Over Preserved Evidence) as a general principle for aligning safety claims with the evidence an evaluation actually retains, and instantiate it as SCOPE-MH, a mental-health instantiation of this reporting standard. We operationalize SCOPE-MH through a proof-of-concept on the AnnoMI dataset of expert-annotated motivational interviewing conversations, which reveals mechanisms of failure that per-turn behavior scoring does not represent. We propose SCOPE-MH as a diagnostic complement to existing evaluation infrastructure and argue that evaluation preserving temporal evidence is necessary, not optional, for safety-critical mental health AI deployment.

2606.15573 2026-06-17 cs.AI cs.CR 版本更新

QoS-Aware Token Scheduling and Private Data Valuation for Multi-Modal Agentic Networks

面向多模态代理网络的QoS感知令牌调度与私有数据估值

Yao Du, Jing Liu, Pengfei Xu, Zehua Wang, Victor C. M. Leung, Cyril Leung, Victoria Lemieux

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Lazai Network(Lazai网络)

AI总结 针对去中心化代理系统中数据异构和资源受限问题,提出基于差分隐私的多模态表示与公平令牌分配方案,在保障服务质量的同时提升数据隐私和贡献公平性。

Comments Accepted to IEEE ICME 2026

详情
AI中文摘要

在代理系统中,人类生成的数据记录锚定了AI服务的价值。然而,云计算管道将处理集中在远程服务器上。数据集中化降低了个人数据主权,并可能降低服务质量(QoS)。同时,用户贡献在数量和质量上存在差异:去中心化记录可能存在偏差、噪声和异质分布。为了解决数据挑战,我们研究了去中心化且资源受限的代理系统中的公平令牌分配和私有数据估值。我们的方法将多模态表示嵌入到共享语义空间中,并释放差分隐私(DP)原型以在减少语义泄露的同时保持效用。在DP保证下,我们设计了一种公平的令牌分配方案,该方案奖励有效贡献,并对数据异质性和AI资源稀缺性具有鲁棒性。大量仿真表明,与标准基准相比,基于贡献的公平性和QoS得到了改善。对图像重建攻击的抵抗力增强表明多模态个人数据的隐私得到了加强。

英文摘要

In agentic systems, human-generated data records anchor the value of AI services. Yet cloud compute pipelines centralize processing on remote servers. Data centralization reduces personal data sovereignty and may potentially degrade the quality of service (QoS). Meanwhile, user contributions are diverse in quantity and quality: decentralized records can be biased, noisy, and heterogeneously distributed. To address the data challenge, we study fair token allocation and private data valuation for decentralized and resource-constrained agentic systems. Our approach embeds multi-modal representations in a shared semantic space and releases differentially private (DP) prototypes to preserve utility while reducing semantic leakage. With the DP guarantee, we design a fair token allocation scheme that rewards effective contributions and remains robust to data heterogeneity and AI resource scarcity. Extensive simulations demonstrate improved contribution-based fairness and QoS compared to standard benchmarks. The improved resistance to image reconstruction attacks indicates enhanced privacy for multi-modal personal data.

2503.10945 2026-06-17 cs.LG cs.AI cs.CR stat.ML 版本更新

Gaussian DP for Reporting Differential Privacy Guarantees in Machine Learning

高斯差分隐私:机器学习中报告差分隐私保证的方法

Juan Felipe Gomez, Bogdan Kulynych, Georgios Kaissis, Flavio P. Calmon, Jamie Hayes, Borja Balle, Antti Honkela

AI总结 针对当前机器学习中差分隐私报告不完整的问题,提出使用非渐近高斯差分隐私(GDP)作为主要报告方式,通过数值会计和决策理论度量,证明GDP能无误差地捕获DP-SGD等算法的完整隐私特征。

Comments IEEE SatML 2026 (position paper track)

详情
AI中文摘要

当前报告机器学习算法(如DP-SGD)的差分隐私(DP)保证的做法提供了不完整且可能误导的图景。例如,如果仅知道机制的一个$(\varepsilon, \delta)$,标准分析表明可能存在针对训练数据记录的高精度推理攻击,而更仔细的分析发现,对于大多数实际机制,这种精确攻击并不存在。在这篇立场论文中,我们主张使用_非渐近_高斯差分隐私(GDP)作为机器学习中传达DP保证的主要手段,以避免这些潜在缺点。利用DP文献中的两个最新进展:(i)能够以任意精度计算DP-SGD的隐私配置文件和$f$-DP曲线的开源数值会计,以及(ii)关于DP表示的决策理论度量,我们展示了如何使用数值会计提供GDP的非渐近界,并表明GDP能够以几乎无误差的方式捕获DP-SGD及相关算法的整个隐私配置文件(由该度量量化)。为了支持我们的主张,我们研究了最先进的DP大规模图像分类以及美国十年人口普查的TopDown算法的隐私配置文件,观察到GDP在所有情况下都与其配置文件拟合得非常好。最后,我们讨论了这种方法的优缺点,并探讨了哪些其他隐私机制可以从GDP中受益。

英文摘要

Current practices for reporting differential privacy (DP) guarantees for machine learning (ML) algorithms such as DP-SGD provide an incomplete and potentially misleading picture. For instance, if only a single $(\varepsilon, δ)$ is known about a mechanism, standard analyses show that there could exist highly accurate inference attacks against training data records, when, upon a more careful analysis, such accurate attacks do not exist for most practical mechanisms. In this position paper, we argue that using _non-asymptotic_ Gaussian Differential Privacy (GDP) as the primary means of communicating DP guarantees in ML avoids these potential downsides. Using two recent developments in the DP literature: (i) open-source numerical accountants capable of computing the privacy profile and $f$-DP curves of DP-SGD to arbitrary accuracy, and (ii) a decision-theoretic metric over DP representations, we show how to provide non-asymptotic bounds on GDP using numerical accountants, and show that GDP can capture the entire privacy profile of DP-SGD and related algorithms with virtually no error, as quantified by the metric. To support our claims, we investigate the privacy profiles of state-of-the-art DP large-scale image classification, and the TopDown algorithm for the U.S. Decennial Census, observing that GDP fits their profiles remarkably well in all cases. We conclude with a discussion on the strengths and weaknesses of this approach, and discuss which other privacy mechanisms could benefit from GDP.

2507.15104 2026-06-17 cs.LG cs.AI 版本更新

AnalogFed: Privacy-Preserving Discovery of Analog Circuits at Scale with Federated Generative AI

AnalogFed: 基于联邦生成式AI的大规模模拟电路隐私保护发现

Qiufeng Li, Shu Hong, Tian Lan, Weidong Cao

AI总结 提出AnalogFed,首个结合联邦学习和生成式AI的隐私保护框架,用于大规模模拟电路拓扑发现,通过虚拟令牌注入和同态加密防御成员推理和模型反转攻击,实现高效协作设计。

详情
AI中文摘要

生成式AI的最新进展已展现出对现代硬件设计的变革潜力。然而,由于硬件数据集的专有性和孤立性,无法集中进行模型训练,现有的生成式AI驱动方法难以实现大规模电子设计自动化。实现大规模生成式AI驱动的EDA需要一种新颖的隐私保护框架,能够在不损害机密性的情况下利用分布式数据。本文介绍了AnalogFed,这是首个利用联邦学习和生成式AI进行大规模模拟电路拓扑发现的隐私保护框架。AnalogFed在解决关键安全挑战的同时,确立了协作式模拟拓扑设计的可行性:它通过基于虚拟令牌注入的新型输入扰动策略减轻成员推理攻击,并使用定制的高效同态加密防御模型反转攻击。大量实验证明了AnalogFed的有效性和效率,在保持模型效用的同时实现了强大的隐私保护。该框架为下一代基于生成式AI的硬件设计自动化中的可扩展多方协作奠定了基础。

英文摘要

Recent advances in generative AI (GenAI) have shown transformative potential for modern hardware design. However, existing GenAI-driven approaches fall short of enabling large-scale electronic design automation (EDA) due to the proprietary and siloed nature of hardware datasets, which cannot be centralized for model training. Achieving at-scale GenAI-driven EDA, therefore, requires a novel privacy-preserving framework that can leverage distributed data without compromising confidentiality. This work introduces AnalogFed, the first privacy-preserving framework for large-scale analog circuit topology discovery using federated learning (FedL) and GenAI. AnalogFed establishes the feasibility of collaborative analog topology design while addressing key security challenges: it mitigates membership inference attacks (MIAs) through a novel input perturbation strategy based on dummy token injection, and defends against model inversion attacks with customized, efficient homomorphic encryption. Extensive experiments demonstrate AnalogFed's effectiveness and efficiency, achieving strong privacy protection without degrading model utility. This framework lays the foundation for scalable, multi-party collaboration in next-generation hardware design automation with GenAI.

2510.01359 2026-06-17 cs.CR cs.AI 版本更新

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

破解代码:通过系统性越狱攻击对AI代码代理进行安全评估

Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, Varun Kumar

AI总结 提出JAWS-Bench基准测试,通过三级工作区评估代码LLM代理的越狱风险,发现代理化使攻击成功率提升1.6倍,并揭示可执行攻击代码的高比例。

Comments 22 pages, 18 figures, 8 tables

详情
AI中文摘要

具备代码能力的大语言模型(LLM)代理被嵌入软件工程工作流中,可以读取、编写和执行代码,这使得“越狱”的风险超越了纯文本环境。先前的评估侧重于拒绝或有害文本检测,而未涉及代理是否编译并运行恶意程序。我们提出了JAWS-Bench(跨工作区越狱基准),该基准涵盖三个逐步升级的工作区场景,以反映攻击者的能力:空工作区(JAWS-0)、单文件工作区(JAWS-1)和多文件工作区(JAWS-M)。我们将其与一个分层的、可执行感知的评判框架配对,该框架测试(i)合规性、(ii)攻击成功性、(iii)语法正确性以及(iv)运行时可执行性,以衡量可部署的危害。在来自五个系列的七个LLM后端上,JAWS-0中的纯提示攻击实现了61%的合规性;其中58%有害,52%可解析,27%可端到端运行。在JAWS-1中,更强模型的合规性达到约100%,平均攻击成功率(ASR)约为71%;JAWS-M将平均ASR提升至约75%,其中32%的攻击代码可运行。将LLM封装为代理会使ASR提高1.6倍,这是通过在规划和工具使用过程中推翻初始拒绝来实现的。类似的趋势也出现在OpenHands、SWE-Agent和OpenAI Codex上,表明我们的JAWS-Bench是代理无关的。类别分析识别出哪些攻击类别最易受攻击且最可部署,从而激励了执行感知的防御和保留拒绝的代理设计。

英文摘要

Code-capable large language model (LLM) agents are embedded in software engineering workflows where they can read, write, and execute code, raising "jailbreak" stakes beyond text-only settings. Prior evaluations emphasize refusal or harmful-text detection, leaving open whether agents compile and run malicious programs. We present JAWS-Bench (Jailbreaks Across WorkSpaces), a benchmark spanning three escalating workspace regimes mirroring attacker capability: empty (JAWS-0), single-file (JAWS-1), and multi-file (JAWS-M). We pair this with a hierarchical, executable-aware Judge Framework that tests (i) compliance, (ii) attack success, (iii) syntactic correctness, and (iv) runtime executability, to measure deployable harm. Across seven LLM backends from five families, prompt-only attacks in JAWS-0 achieve 61% compliance; 58% are harmful, 52% parse, and 27% run end-to-end. In JAWS-1, compliance reaches ~100% for stronger models with a mean ASR (Attack Success Rate) ~71%; JAWS-M raises mean ASR to ~75%, with 32% runnable attack code. Wrapping an LLM in an agent increases ASR by 1.6$\times$, by overturning initial refusals during planning and tool use. Similar trends hold for OpenHands, SWE-Agent, and OpenAI Codex, suggesting our JAWS-Bench is agent-agnostic. Category analyses identify which attack classes are most vulnerable and deployable, motivating execution-aware defenses and refusal-preserving agent designs.

2510.18003 2026-06-17 cs.CR cs.AI cs.CY 版本更新

BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?

BadScientist: 研究代理能否写出令人信服但不严谨的论文来欺骗LLM审稿人?

Fengqing Jiang, Yichen Feng, Yuetai Li, Luyao Niu, Basel Alomair, Radha Poovendran

AI总结 提出BadScientist框架,通过无需真实实验的呈现操纵策略生成造假论文,揭示LLM审稿系统存在系统性漏洞,造假论文接受率高达一定水平,且审稿人存在“担忧-接受冲突”,当前检测方法效果有限。

Comments ACL 2026; Project Page at https://bad-scientist.github.io/

详情
AI中文摘要

基于LLM的研究助手和基于AI的同行评审系统的融合产生了一个关键漏洞:完全自动化的出版循环,其中AI生成的研究由AI评审员在没有人类监督的情况下进行评估。我们通过\textbf{BadScientist}框架对此进行研究,该框架评估面向造假的论文生成代理能否欺骗多模型LLM评审系统。我们的生成器采用无需真实实验的呈现操纵策略。我们开发了一个严格的评估框架,具有形式化的错误保证(集中界和校准分析),并在真实数据上进行了校准。我们的结果揭示了系统性漏洞:造假论文的接受率高达一定水平。关键的是,我们发现了\textit{担忧-接受冲突}——评审员经常标记诚信问题,却给出接受级别的分数。我们的缓解策略仅显示出微小的改进,检测准确性几乎不超过随机猜测。尽管聚合数学在理论上可靠,但诚信检查系统性失败,暴露了当前AI驱动评审系统的根本局限性,并强调了在科学出版中迫切需要纵深防御保障措施。

英文摘要

The convergence of LLM-powered research assistants and AI-based peer review systems creates a critical vulnerability: fully automated publication loops where AI-generated research is evaluated by AI reviewers without human oversight. We investigate this through \textbf{BadScientist}, a framework that evaluates whether fabrication-oriented paper generation agents can deceive multi-model LLM review systems. Our generator employs presentation-manipulation strategies requiring no real experiments. We develop a rigorous evaluation framework with formal error guarantees (concentration bounds and calibration analysis), calibrated on real data. Our results reveal systematic vulnerabilities: fabricated papers achieve acceptance rates up to . Critically, we identify \textit{concern-acceptance conflict} -- reviewers frequently flag integrity issues yet assign acceptance-level scores. Our mitigation strategies show only marginal improvements, with detection accuracy barely exceeding random chance. Despite provably sound aggregation mathematics, integrity checking systematically fails, exposing fundamental limitations in current AI-driven review systems and underscoring the urgent need for defense-in-depth safeguards in scientific publishing.

2511.03211 2026-06-17 cs.CY cs.AI 版本更新

Retrofitters, pragmatists and activists: Public interest litigation for accountable automated decision-making

改造者、实用主义者和活动家:为可问责的自动化决策而进行的公益诉讼

Henry Fraser, Zahra Stardust

发表机构 * Queensland University of Technology, School of Law(昆士兰理工大学法学院) Centre for Automated Decision-Making and Society(自动化决策与社会研究中心) Queensland University of Technology, School of Communication(昆士兰理工大学传播学院)

AI总结 本文探讨公益诉讼在澳大利亚促进AI和自动化决策问责中的作用,基于访谈分析策略与局限,强调制度安排对有效诉讼的关键性。

详情
AI中文摘要

本文考察了公益诉讼在促进澳大利亚人工智能和自动化决策(ADM)问责方面的作用。由于ADM监管面临政治和地缘政治阻力,有效的治理将不得不依赖现有法律的执行。基于对澳大利亚公益诉讼律师、技术政策活动家和技术法学学者的访谈,本文将公益诉讼定位为ADM透明度、问责和正义的更大生态系统的一部分。文章探讨了参与者所称的“改造”旧法律以适应ADM的策略和战术。这些策略超越了创造性的法律论证,涵盖了社区建设、变革理论合作、精明的客户和诉讼理由选择,以及诉讼中利益相关者利益的协调。自然,本文也探讨了这些策略以及澳大利亚法律体系的局限性。然而,在局限可以被克服的地方,本文提出了关于紧迫需求的发现:使有效诉讼和问责得以实现的制度安排。本文对法律和技术学者、受ADM伤害的个人和团体、公益诉讼律师和技术律师、民间社会和倡导组织以及政策制定者具有参考价值。

英文摘要

This paper examines the role of public interest litigation in promoting accountability for AI and automated decision-making (ADM) in Australia. Since ADM regulation faces political and geopolitical headwinds, effective governance will have to rely on the enforcement of existing laws. Drawing on interviews with Australian public interest litigators, technology policy activists, and technology law scholars, the paper positions public interest litigation as part of a larger ecosystem for transparency, accountability and justice with respect to ADM. The paper explores the tactics and strategies of what one participant described as 'retrofitting' old laws to ADM. These go beyond creative legal argumentation, to encompass practices of community-building, collaboration on theories of change, canny selection of clients and causes of action, and the alignment of the interests of stakeholders in litigation. Naturally, the paper also contends with the limits of these strategies, and of the Australian legal system. Where limits are, however, capable of being overcome, the paper presents findings on urgent needs: the enabling institutional arrangements without which effective litigation and accountability will falter. The paper is relevant to law and technology scholars; individuals and groups harmed by ADM; public interest litigators and technology lawyers; civil society and advocacy organisations; and policymakers.

2512.15792 2026-06-17 cs.CY cs.AI cs.CL 版本更新

A Multifaceted Analysis of Social Biases in Large Language Models

大型语言模型中偏见的系统分析

Xulang Zhang, Rui Mao, Erik Cambria

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本文系统分析了四种广泛使用的大型语言模型在政治、意识形态、联盟、语言和性别等维度上的偏见,通过多项实验揭示了模型在中立性、意识形态倾向、地缘政治倾向、多语言故事完成中的偏见以及性别倾向。

详情
AI中文摘要

大型语言模型(LLMs)已迅速成为获取信息和支持人类决策不可或缺的工具。然而,确保这些模型在各种情境下保持公平性对于其安全和负责任的部署至关重要。在本研究中,我们对四种广泛采用的LLMs进行了全面分析,探讨了它们在政治、意识形态、联盟、语言和性别等维度上的潜在偏见和倾向。通过一系列精心设计的实验,我们利用新闻摘要来检验其政治中立性,通过新闻立场分类来研究意识形态偏见,通过联合国投票模式来探讨对特定地缘政治联盟的倾向,通过多语言故事完成来检验语言偏见,并通过世界价值观调查中的响应来揭示性别相关倾向。结果表明,尽管这些模型被设计为中立和公正,但它们仍然表现出不同类型的偏见和倾向。

英文摘要

Large language models (LLMs) have rapidly become indispensable tools for acquiring information and supporting human decision-making. However, ensuring that these models uphold fairness across varied contexts is critical to their safe and responsible deployment. In this study, we undertake a comprehensive examination of four widely adopted LLMs, probing their underlying biases and inclinations across the dimensions of politics, ideology, alliance, language, and gender. Through a series of carefully designed experiments, we investigate their political neutrality using news summarization, ideological biases through news stance classification, tendencies toward specific geopolitical alliances via United Nations voting patterns, language bias in the context of multilingual story completion, and gender-related affinities as revealed by responses to the World Values Survey. Results indicate that while the LLMs are aligned to be neutral and impartial, they still show biases and affinities of different types.

2601.16407 2026-06-17 cs.CL cs.AI 版本更新

Jacobian Scopes: token-level causal attributions in LLMs

Jacobian Scopes: LLM中的令牌级因果归因

Toni J. B. Liu, Baran Zadeoğlu, Nicolas Boullé, Raphaël Sarfati, Gurbir Arora, Christopher J. Earls

发表机构 * Cornell University(康奈尔大学) Imperial College London(伦敦帝国理工学院) Goodfire AI

AI总结 提出Jacobian Scopes,一种基于梯度的令牌级因果归因方法,用于解释LLM预测,揭示政治偏见、翻译策略和上下文学习机制。

Comments 25 pages, 16 figures

详情
AI中文摘要

大型语言模型(LLM)基于上下文中的线索(如语义描述和上下文示例)进行下一个令牌预测。然而,由于现代架构中层和注意力头的 proliferation,阐明哪些先前的令牌对给定预测影响最大仍然具有挑战性。我们提出Jacobian Scopes,一套基于梯度的令牌级因果归因方法,用于解释LLM预测。基于微扰理论和信息几何,Jacobian Scopes量化输入令牌如何影响模型预测的各个方面,例如特定logits、完整预测分布和模型不确定性(有效温度)。通过涵盖指令理解、翻译和上下文学习(ICL)的案例研究,我们展示了Jacobian Scopes如何揭示隐含的政治偏见,揭示词级和短语级翻译策略,并阐明最近争论的上下文时间序列预测的潜在机制。为了便于在自定义文本上探索Jacobian Scopes,我们开源了实现,并在以下网址提供了云托管交互式演示:this https URL。

英文摘要

Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. Grounded in perturbation theory and information geometry, Jacobian Scopes quantify how input tokens influence various aspects of a model's prediction, such as specific logits, the full predictive distribution, and model uncertainty (effective temperature). Through case studies spanning instruction understanding, translation, and in-context learning (ICL), we demonstrate how Jacobian Scopes reveal implicit political biases, uncover word- and phrase-level translation strategies, and shed light on recently debated mechanisms underlying in-context time-series forecasting. To facilitate exploration of Jacobian Scopes on custom text, we open-source our implementations and provide a cloud-hosted interactive demo at https://huggingface.co/spaces/Typony/JacobianScopes.

2602.14211 2026-06-17 cs.CR cs.AI 版本更新

SkillJect: Effectively Automating Skill-Based Prompt Injection for Skill-Enabled Agents

SkillJect:有效自动化基于技能的提示注入以针对具备技能的代理

Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, Philip Torr

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Chongqing University, China(重庆大学) Northeastern University, China(东北大学) Sun Yat-sen University, China(中山大学) University of Oxford, UK(牛津大学)

AI总结 SkillJect 是首个自动化生成有效中毒技能的框架,通过隐藏恶意负载和重写指令通道,提升攻击效果,揭示可重用技能生态中的持久性攻击向量。

详情
AI中文摘要

SkillJect通过隐藏恶意负载和重写指令通道,有效自动化基于技能的提示注入,针对具备技能的代理提升攻击效果,揭示可重用技能生态中的持久性攻击向量。

英文摘要

Agent skills extend LLM agents with task-specific instructions, executable scripts, and auxiliary resources, improving reusability but creating a new supply-chain attack surface. A malicious or compromised skill can be repeatedly loaded as trusted guidance and steer downstream tool use. Existing skill-based prompt-injection attacks are often manual and brittle, because explicit malicious instructions are rejected or ignored when they are not aligned with the original workflow. We propose SkillJect, the first automated framework for generating poisoned skills against skill-enabled agent systems. SkillJect uses two coordinated channels. In the artifact channel, it hides the payload inside an auxiliary helper script. In the instruction channel, it rewrites SKILL.md with a front-loaded inducement strategy, placing injected content at the beginning and framing the helper script as a mandatory prerequisite or initialization step. The rewritten instruction explicitly references the helper-script path and provides an executable example command, making the helper appear to be a legitimate setup step before normal skill operations. SkillJect further adopts a closed-loop multi-agent process to improve attack effectiveness. An Attack Agent generates poisoned skills, a Victim Agent executes downstream tasks with the poisoned skill, and an Evaluate Agent inspects execution traces to determine whether the hidden payload was executed. The Attack Agent then uses this feedback to diagnose failure causes and rewrite SKILL.md, while keeping the payload fixed. Experiments across skill-enabled platforms, backend LLMs, and attack categories show that SkillJect substantially outperforms naive direct injection and prior manual skill-injection attacks, highlighting poisoned skills as a persistent threat in reusable skill ecosystems.

2603.25414 2026-06-17 cs.PL cs.AI cs.LG cs.LO 版本更新

Decidable By Construction: Design-Time Verification for Trustworthy AI

可判定性通过构造实现:面向可信AI的设计时验证

Houston Haynes

AI总结 提出一种设计时验证框架,通过将AI模型属性约束为有限生成阿贝尔群上的可判定问题,在训练前以极低计算成本验证数值稳定性、计算正确性和物理一致性,消除后验验证开销。

Comments 21 pages, 1 figure

详情
AI中文摘要

机器学习中一个普遍的假设是模型正确性必须在事后强制执行。我们观察到,决定AI模型是否数值稳定、计算正确或与物理领域一致的属性并不一定需要事后强制执行。它们可以在设计时,在训练开始之前,以边际计算成本进行验证,对于部署在高杠杆决策支持和科学约束环境中的模型尤其重要。这些属性共享特定的代数结构:它们可以表示为有限生成阿贝尔群 $\mathbb{Z}^n$ 上的约束,其中推理在多项式时间内可判定,且主要类型是唯一的。基于这一观察构建的框架组合了三个先前的结果(arXiv:2603.16437, arXiv:2603.17627, arXiv:2603.18104):一个维度类型系统,通过模型细化携带任意注释作为持久余数据;一个程序超图,仅从类型签名推断Clifford代数等级并推导几何积稀疏性;以及一个自适应领域模型架构,通过前向模式余效应分析和精确正数累积在训练过程中保持两个不变量。我们相信这种组合产生了一个新颖的信息论结果:阿贝尔群上的Hindley-Milner统一在Solomonoff通用先验的可计算限制下计算最大后验假设,将该框架的类型推断置于与通用归纳相同的正式基础上。我们比较了四种当代的AI可靠性方法,并表明每种方法都会引入开销,这些开销可能在部署、层和推理请求中累积。该框架通过构造消除了这种开销。

英文摘要

A prevailing assumption in machine learning is that model correctness must be enforced after the fact. We observe that the properties determining whether an AI model is numerically stable, computationally correct, or consistent with a physical domain do not necessarily demand post hoc enforcement. They can be verified at design time, before training begins, at marginal computational cost, with particular relevance to models deployed in high-leverage decision support and scientifically constrained settings. These properties share a specific algebraic structure: they are expressible as constraints over finitely generated abelian groups $\mathbb{Z}^n$, where inference is decidable in polynomial time and the principal type is unique. A framework built on this observation composes three prior results (arXiv:2603.16437, arXiv:2603.17627, arXiv:2603.18104): a dimensional type system carrying arbitrary annotations as persistent codata through model elaboration; a program hypergraph that infers Clifford algebra grade and derives geometric product sparsity from type signatures alone; and an adaptive domain model architecture preserving both invariants through training via forward-mode coeffect analysis and exact posit accumulation. We believe this composition yields a novel information-theoretic result: Hindley-Milner unification over abelian groups computes the maximum a posteriori hypothesis under a computable restriction of Solomonoff's universal prior, placing the framework's type inference on the same formal ground as universal induction. We compare four contemporary approaches to AI reliability and show that each imposes overhead that can compound across deployments, layers, and inference requests. This framework eliminates that overhead by construction.

2603.28378 2026-06-17 cs.SD cs.AI 版本更新

Membership Inference Attacks against Large Audio Language Models

针对大型音频语言的成员推断攻击

Jia-Kai Dong, Yu-Xiang Lin, Hung-Yi Lee

AI总结 首次系统评估大型音频语言模型的成员推断攻击,提出盲基线协议控制分布偏移,发现跨模态记忆仅源于说话人声纹与文本绑定。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

我们首次对大型音频语言模型(LALMs)进行了系统的成员推断攻击(MIA)评估。利用基于文本、频谱和韵律特征的多模态盲基线,我们证明即使没有模型推理,常见音频数据集也表现出近乎完美的训练/测试可分离性(AUC ~ 1.0),因此MIA可能主要检测分布偏移。因此,我们引入了一个盲基线协议来控制这一混杂因素。在该协议下,我们发现分布匹配的数据集能够实现可靠的MIA评估,而不会产生分布偏移伪影。我们基准测试了多种MIA方法,并在这些数据集上进行了模态解缠实验。结果表明,LALM的记忆是跨模态的,仅源于将说话人的声纹与其文本绑定。这些发现为审计LALMs建立了超越虚假相关性的原则性标准。我们的代码库可在该网址获取。

英文摘要

We present the first systematic Membership Inference Attack (MIA) evaluation of LALMs. Using Multi-modal Blind Baselines based on textual, spectral and prosodic features, we demonstrate that common audio datasets exhibit near-perfect train/test separability (AUC ~ 1.0) even without model inference, thus MIA may primarily detect distribution shift. We therefore introduce a blind-baseline protocol to control for this confound. Under this protocol, we identify that the distribution-matched datasets enable reliable MIA evaluation without distribution-shift artifacts. We benchmark multiple MIA methods and conduct modality disentanglement experiments on these datasets. The results reveal that LALM memorization is cross-modal, arising only from binding a speaker's vocal identity with its text. These findings establish a principled standard for auditing LALMs beyond spurious correlations. Our codebase is available at https://github.com/snooow1029/ALM_MIA.

2604.01904 2026-06-17 cs.CR cs.AI 版本更新

Combating Data Laundering in LLM Training

对抗LLM训练中的数据清洗

Muxing Li, Zesheng Ye, Sharon Li, Feng Liu

发表机构 * University of Melbourne(墨尔本大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 针对数据清洗(通过变换风格隐藏数据来源)导致传统检测失效的问题,提出基于辅助LLM推断变换目标并合成查询的SDR方法,显著增强数据滥用检测能力。

Comments 29 pages, 2 figures

详情
AI中文摘要

数据权利所有者可以通过查询专有样本来检测大型语言模型(LLM)训练中未经授权的数据使用。通常,模型在某个样本上表现优于未训练数据(例如更高的置信度或更低的损失)意味着该样本属于训练语料,因为LLM在训练中见过的数据上表现更好。然而,这种检测在数据清洗(一种保留关键信息但改变专有数据风格形式以混淆数据来源的做法)下变得脆弱。当LLM仅在经过清洗的变体上训练时,它在原始数据上不再表现更好,从而消除了标准检测所依赖的信号。我们通过从对目标LLM的黑盒访问中推断未知的清洗变换,并借助辅助LLM合成模仿清洗数据的查询来应对这一问题,即使权利所有者只拥有原始数据。由于寻找真实清洗变换的搜索空间是无限的,我们将这一过程抽象为高层变换目标(例如“抒情改写”)和具体细节(例如“使用生动意象”),并引入合成数据还原(SDR)来实例化这一抽象。SDR首先识别最可能的合成目标以缩小搜索范围;然后迭代细化细节,使合成查询逐渐从目标LLM中引发更强的检测信号。在MIMIR基准上针对多种清洗实践和目标LLM系列(Pythia、Llama2和Falcon)的评估表明,SDR持续增强了数据滥用检测,为数据清洗提供了一种实用的对策。

英文摘要

Post-hoc unauthorized-training data detection for large language models (LLMs) typically assumes a query-with-originals regime: rights holders query a target LLM with raw proprietary data and assess whether the model assigns them stronger memorization-based detection signals, e.g., higher confidence or lower loss, than held-out non-training reference texts. We show that this regime becomes brittle under data laundering, where the target LLM is trained on semantics-preserving but stylistically or structurally transformed surrogates of proprietary data to obfuscate provenance. Since training-time exposure occurs in the laundered form, memorization signals may no longer appear on the originals, collapsing the candidate-reference signal separation that standard detectors rely on. We counter this threat by studying laundering-aware detection with raw proprietary data, a held-out reference corpus, and query access to the target LLM, while the laundering transformation is undisclosed. Since exact recovery of the laundered corpus is infeasible, we infer a detection-useful synthesis process via an auxiliary LLM that maps originals into training-like queries. To make this search tractable, we introduce Synthesis Data Reversion (SDR), which constrains the unbounded space of natural-language transformations through a goal-details abstraction: a high-level transformation goal, e.g., "lyrical rewriting", and fine-grained details, e.g., "with vivid imagery". SDR identifies the most likely goal and iteratively refines details so synthesized queries elicit stronger target-model detection signals. Evaluated on the MIMIR benchmark against diverse laundering practices and target LLM families (Pythia, Llama2, and Falcon), SDR consistently restores detection signals, offering a practical auditing layer against data laundering.

2605.12646 2026-06-17 cs.LG cs.AI cs.HC 版本更新

Learning to Decide with AI Assistance under Human-Alignment

在人工智能协助下的人类对齐决策学习

Nina Corvelo Benz, Eleni Straitouri, Manuel Gomez-Rodriguez

发表机构 * GitHub

AI总结 本文研究了在高风险领域中,人工智能如何通过预测结果帮助决策者,并探讨了AI预测信心与决策者自身信心的对齐程度对决策学习复杂性的影响。

详情
AI中文摘要

人们普遍认为,当人工智能模型通过预测感兴趣的结果来协助决策者时,它们应传达预测的置信度。然而,实证证据表明,决策者往往难以仅根据传达的置信度来判断何时信任预测。在此背景下,近期的理论和实证工作表明,AI辅助决策的效用与AI置信度和决策者自身置信度之间的对齐程度之间存在正相关性。关键的是,这些发现尚未阐明这种对齐程度如何影响通过重复交互学习做出最佳决策的复杂性。在本文中,我们考虑二元预测和二元决策的典型情况,首先证明该问题等价于具有完全反馈的双臂在线上下文学习问题,并建立了任何学习者可以达到的期望遗憾的下界为$Ω(\sqrt{|H| \cdot |B| \cdot T} )$,其中$H$和$B$分别表示人类和AI置信度的集合。然后我们证明,在AI和人类置信度完全对齐的情况下,学习者可以达到期望遗憾为$O(\sqrt{|H| \cdot T\log T})$,当$\sqrt{|H|} = O(\log T)$且$B$是可数的时,Dvoretzky-Kiefer-Wolfowitz不等式的非平凡推广将遗憾界改进到$O(\sqrt{T\log T})$。这些结果表明,对齐可以减少在人工智能协助下学习决策的复杂性。在两个不同的人类主体研究中,参与者通过AI模型协助解决简单决策任务的实验证明,我们的理论结果在完全对齐被违反时仍然稳健。

英文摘要

It is widely agreed that when AI models assist decision-makers in high-stakes domains by predicting an outcome of interest, they should communicate the confidence of their predictions. However, empirical evidence suggests that decision-makers often struggle to determine when to trust a prediction based solely on this communicated confidence. In this context, recent theoretical and empirical work suggests a positive correlation between the utility of AI-assisted decision-making and the degree of alignment between the AI confidence and the decision-makers' confidence in their own predictions. Crucially, these findings do not yet elucidate the extent to which this alignment influences the complexity of learning to make optimal decisions through repeated interactions. In this paper, we address this question in the canonical case of binary predictions and binary decisions. We first show that this problem is equivalent to a two-armed online contextual learning problem with full feedback, and establish a lower bound of $Ω(\sqrt{|H| \cdot |B| \cdot T} )$ on the expected regret any learner can attain, where $H$ and $B$ denote the sets of human and AI confidence values. We then demonstrate that, under perfect alignment between AI and human confidence, a learner can attain an expected regret of $O(\sqrt{|H| \cdot T\log T})$ and, when $\sqrt{|H|} = O(\log T)$ and $B$ is countable, a non-trivial generalization of the Dvoretzky-Kiefer-Wolfowitz inequality improves the regret bound to $O(\sqrt{T\log T})$. Taken together, these results reveal that alignment can reduce the complexity of learning to make decisions with AI assistance. Experiments on real data from two different human-subject studies where participants solve simple decision-making tasks assisted by AI models show that our theoretical results are robust to violations of perfect alignment.

2606.03089 2026-06-17 cs.LG cs.AI 版本更新

Constitutional On-Policy Safe Distillation

宪法性在策略安全蒸馏

Ming Wen, Yuxuan Liu, Kun Yang, Yunhao Feng, Zhuoer Xu, Yuhao Sun, Shiwen Cui, Xiang Zheng, Guoyu Wang, Xingjun Ma, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI(可信具身人工智能研究院) Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Ant Group(蚂蚁集团) Zhejiang University(浙江大学) City University of Hong Kong(香港城市大学)

AI总结 针对在策略自蒸馏在安全对齐中因宪法条件导致教师分布收缩、表达能力下降的问题,提出宪法性在策略安全蒸馏(COPSD),通过交叉SFT冷启动校准教师分布,再进行宪法条件在策略蒸馏,在12个基准上实现了更优的安全-有用性权衡并降低安全税。

详情
AI中文摘要

在策略自蒸馏(OPSD)通过使用基于特权信息条件的教师提供密集的令牌级监督,已成为一种高效的后训练范式。先前工作表明,OPSD在可验证推理任务中可能崩溃,但安全对齐不同,它由高层宪法而非显式目标答案指导,因此是重新审视密集蒸馏的自然场景。然而,我们的初步研究表明,安全OPSD仍然遭受严重崩溃:宪法条件将教师分布收缩为短且过于保守的响应,而反向KL进一步将这种收缩放大为表达能力下降。我们将此效应形式化为非正交语义空间中安全边界下的几何泄漏,其中安全压力转移到表达能力维度。基于此分析,我们提出宪法性在策略安全蒸馏(COPSD),首先通过交叉SFT冷启动校准教师,然后执行宪法条件在策略蒸馏。在12个基准上的实验表明,COPSD比基线实现了持续更强的安全-有用性权衡,同时大幅降低了对通用推理能力的安全税。

英文摘要

On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety--helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.

2606.04990 2026-06-17 cs.CR cs.AI 版本更新

From Agent Traces to Trust: A Survey of Evidence Tracing and Execution Provenance in LLM Agents

从智能体痕迹到信任:LLM智能体中的证据追踪与执行溯源

Yiqi Wang, Jiaqi Zhang, Taotao Cai, Zirui Liu, Qingqiang Sun, Zequn Sun, Zhangkai Wu, Manqing Dong, Mingkai Zhang, Xuefei Yin, Yanming Zhu

发表机构 * Griffith University(格里菲斯大学) Jiangsu University(江苏大学) University of Southern Queensland(南方昆士兰大学) Peking University(北京大学) Great Bay University(大湾大学) Nanjing University(南京大学) Macquarie University(麦觉瑞大学) Southern University of Science and Technology(南方科学与技术大学)

AI总结 本文系统综述了LLM智能体中的证据追踪与执行溯源方法,通过统一溯源视角连接检索、工具使用、记忆等环节,提出分类体系并讨论开放挑战。

详情
AI中文摘要

基于大语言模型(LLM)的智能体通过与外部工具、检索系统、记忆模块、环境及其他智能体交互,日益解决复杂任务。这些能力增强了智能体的自主性,但也使其行为更难以验证、调试和审计。仅凭最终答案的准确性无法解释输出是如何产生的、每个主张由哪些证据支持、工具调用是否合理、记忆如何影响后续决策或执行失败的根源。证据追踪和执行溯源通过建模检索到的证据、工具输出、记忆项、环境观察、中间主张、动作和最终答案在智能体执行过程中的连接方式,弥补了这一空白。本综述对LLM智能体中的证据追踪和执行溯源进行了系统回顾和概念框架构建。我们围绕统一的溯源视角组织相关工作,该视角连接了检索依据、主张支持、工具使用安全、记忆谱系、可观测性、调试、审计和恢复。我们引入了一个分类体系,涵盖追踪来源、证据和执行单元、溯源关系、追踪粒度和时机、表示形式以及信任功能。我们回顾了关键方法论方向,包括溯源表示、证据归因、工具使用溯源、运行时护栏、携带溯源的记忆、基于痕迹的可观测性和故障诊断。我们还绘制了现有基准、数据集和评估指标与溯源相关能力的映射,并讨论了评估如何从最终答案正确性转向过程级问责。最后,我们概述了开放挑战,包括统一痕迹模式、主张级和语义溯源、溯源感知的安全机制、现实执行痕迹基准、面向恢复的评估以及隐私感知的审计基础设施。

英文摘要

Large language model (LLM)-based agents are evolving from passive text generators into autonomous systems capable of planning, tool use, retrieval, memory access, environmental interaction, and multi-agent collaboration. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where failures originated. This survey examines evidence tracing and execution provenance as foundations for process-level accountability in trustworthy LLM agents. We define execution provenance as the typed graph of an agent execution and evidence tracing as its projection onto evidence-support relations. This perspective connects retrieval grounding, claim support, tool-use safety, memory lineage, observability, debugging, audit, and recovery within a unified framework. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We then review key methodological directions, including provenance representation, evidence attribution, tool-use provenance, runtime guardrails, provenance-bearing memory, observability, and failure diagnosis. Finally, we discuss benchmarks, datasets, metrics, and open challenges for building provenance-aware, auditable, and recoverable agent systems.

2606.12666 2026-06-17 cs.CR cs.AI 版本更新

CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents

CAPED:面向移动GUI代理的上下文感知隐私暴露防御

Siyu Shen, Fenghao Xu, Wenrui Diao, Kehuan Zhang

AI总结 针对移动GUI代理截图上传导致的附带视觉隐私暴露问题,提出上下文感知的预上传暴露控制层CAPED,通过任务需求提取、屏幕上下文隐私先验和UI元素解析,选择性暴露任务所需内容,在保持高任务效用的同时显著降低隐私泄露。

详情
AI中文摘要

基于截图的移动GUI代理能够像人类用户一样通过相同的视觉界面操作普通智能手机应用,但这种能力也将每一次屏幕观察变成了隐私边界。在正常任务执行过程中,截图可能暴露联系人、消息、照片、文件、推荐、健康提示等与用户请求无关的敏感上下文。我们称这个问题为附带视觉隐私暴露。现有防御难以解决:文本匿名化遗漏了许多视觉和推理线索,而通用隐私遮蔽可能移除GUI代理完成任务所需的证据和控制。本文提出CAPED,一种面向移动GUI代理的上下文感知预上传暴露控制层。CAPED被设计为手机端保护层:在截图被释放到远程多模态代理之前,它提取任务需求,利用屏幕上下文作为隐私先验,解析可见UI元素,并仅选择性暴露当前任务所需的内容,同时遮蔽附带隐私内容。我们在AndroidWorld上评估CAPED的广泛任务效用,并使用受控的28任务种子隐私评估作为轨迹级附带泄漏的测量工具。在该种子评估中,完整CAPED将成功条件下的加权种子泄漏从原始截图的0.766降低到0.268,同时保持高任务效用。更广泛的AndroidWorld运行显示了剩余的原型级效用成本,但结果支持核心主张:截图上传应被视为明确的设备-云边界决策,由任务驱动的选择性暴露而非全有或全无的屏幕共享来管理。

英文摘要

Screenshot-based mobile GUI agents can operate ordinary smartphone apps through the same visual interface as a human user, but this capability also turns every screen observation into a privacy boundary. During normal task execution, screenshots may expose contacts, messages, photos, files, recommendations, health cues, and other sensitive context that is unrelated to the user's request. We call this problem incidental visual privacy exposure. It is difficult to address with existing defenses: text anonymization misses many visual and inferential cues, while generic privacy masking can remove the evidence and controls that a GUI agent needs to complete the task. This paper presents CAPED, a context-aware pre-upload exposure control layer for mobile GUI agents. CAPED is designed as a phone-side protection layer: before screenshots are released to a remote multimodal agent, it extracts task requirements, uses screen context as a privacy prior, parses visible UI elements, and selectively exposes only content needed for the current task while masking incidental private content. We evaluate CAPED on AndroidWorld for broad task utility and with a controlled 28-task seeded privacy evaluation used as a measurement instrument for trajectory-level incidental leakage. In this seeded evaluation, Full CAPED reduces success-conditioned weighted seeded leakage from 0.766 under raw screenshots to 0.268 while preserving high task utility. A broader AndroidWorld run shows a remaining prototype-level utility cost, but the results show that task-driven selective exposure can reduce incidental visual leakage before screenshots are released to a remote GUI agent.

2606.14517 2026-06-17 cs.CR cs.AI 版本更新

From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails

从盾牌到靶心:针对基于LLM的智能体护栏的拒绝服务攻击

Yuguang Zhou, Xunguang Wang, Pingchuan Ma, Zhantong Xue, Zhaoyu Wang, Shuai Wang

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本文揭示基于LLM的护栏易受拒绝服务攻击,通过束搜索优化框架和机制感知结构变异生成恶意负载,导致令牌放大13-63倍、延迟放大148倍,威胁系统可用性。

详情
AI中文摘要

基于LLM的护栏已成为自主智能体中防御提示注入和越狱攻击的高效手段。然而,我们发现正是这种实现保护的推理和任务遵循能力引入了一种新的漏洞:攻击者可以注入精心构造的数据,使护栏陷入扩展推理循环,从而实施系统性的拒绝服务(DoS)攻击。为系统性地揭示这一威胁,我们设计了一个束搜索优化框架,利用策略库引导的LLM提议器,生成自然语言负载以最大化护栏推理长度。基于对护栏模式遵循性质的观察,我们还提供了另一种由机制感知结构变异驱动的攻击框架,计算负载更小。攻击效能通过两部分系统评估。首先,在独立评估中,攻击可泛化到多种护栏架构、安全模板和智能体基准。在单个开源替代模型上优化的负载成功迁移到八个领先模型骨干(如Claude、GPT、Gemini、DeepSeek和Qwen),实现13-63倍的令牌放大。其次,在端到端的真实世界智能体部署(网页、桌面、代码和多智能体系统)中,攻击揭示高达148倍的延迟放大。我们表明,单个中毒文档即可饱和共享护栏基础设施,有效饿死同位置智能体并瘫痪整个系统。通过揭示这一可用性缺陷,我们的工作强调了开发成本受限、推理鲁棒的护栏的紧迫性。

英文摘要

LLM-based guardrails have emerged as a highly effective defense against prompt injection and jailbreak attacks in autonomous agents. However, we reveal that the very reasoning and task-following capabilities enabling this protection introduce a novel vulnerability: attackers can inject crafted data to trap the guardrail in extended reasoning loops, effectuating a systematic denial-of-service (DoS) attack. To systematically expose this threat, we design a beam-search optimization framework that crafts natural-language payloads to maximize guardrail reasoning length, utilizing an LLM proposer guided by a strategy bank. Based on the observation of guardrail's schema-following nature, we also provide another attack framework driven by mechanism-aware structural mutations with less computational load. The attack efficacy is systematically evaluated in two parts. First, in standalone evaluations, the attack generalizes across diverse guardrail architectures, safety templates, and agent benchmarks. Payloads optimized on a single open-source surrogate successfully transfer to eight leading model backbones (e.g., Claude, GPT, Gemini, DeepSeek, and Qwen), achieving a 13--63$\times$ token amplification. Second, in end-to-end real-world agent deployments (web, desktop, code, and multi-agent systems), the attack reveals up to a 148$\times$ latency amplification. We show that a single poisoned document can saturate shared guardrail infrastructures, effectively starving co-located agents and paralyzing the entire system. By uncovering this availability flaw, our work underscores the urgent need to develop cost-bounded, reasoning-robust guardrails.

9. 评测、基准与数据集 54 篇

2606.17266 2026-06-17 cs.AI cs.SY eess.SY 新提交

SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions

SkillChain-Gym:面向中断下再技能感知的生产-库存控制的基准测试

Carlos Eduardo Sanoja

发表机构 * Quanta Labs, LLC(Quanta Labs有限责任公司) FCEA, Universidad Monteávila(蒙特阿维拉大学经济与行政科学学院)

AI总结 提出SkillChain-Gym基准,用于评估考虑技能动态(如遗忘、再培训)的生产-库存控制策略,实验发现无策略在所有场景中占优,需根据预测灵活选择。

详情
AI中文摘要

生产规划日益需要将劳动力能力视为决策变量:当技能未得到维护时认证会失效,新产品需要当前劳动力不具备的技能,再技能培训与生产争夺相同的工时。现有的运营基准通常将劳动力视为外生变量,而包含技能和学习的劳动力规划模型很少作为可复用的测试平台发布。我们引入了SkillChain-Gym,这是一个针对再技能感知的生产-库存控制的基准规范:一个单站点环境,具有风格化的工人技能状态动态、硬阈值认证、遗忘以及消耗产能的培训动作,这些动作受与生产相同的每个工人时间预算约束。该基准包括种子控制的中断场景、三种可行性模式(带投影诊断)、确定性回放以及涵盖运营、韧性、能力增长和培训访问分布的指标。我们评估了仅生产策略、反应式自适应策略、注水自适应策略和静态保险策略(带预算变体),在60个班次的时间范围内进行配对统计检验。结果是依赖于情景的,而非排序。具备培训能力的策略优于仅生产基线,并且在遗忘存在的情况下,即使没有中断,维护性培训也是必要的。在具备培训能力的策略中,当瓶颈在预测中可见时,自适应培训有帮助,而一个精简的静态交叉培训计划(一个故意有利的比较对象,其结构编码了相关的技能应急情况)在突发冲击和缺勤下充当了强有力的保险。产能松弛和遗忘率决定了这些情景之间的边界。没有策略类在所有情景中占优,这促使了能够决定何时购买技能保险和何时反应的预测驱动型控制器。

英文摘要

Production planning increasingly has to treat workforce capability as a decision variable: certifications lapse when skills are not maintained, new products require skills the current workforce does not hold, and reskilling competes for the same worker hours needed for production. Existing operations benchmarks usually treat labor as exogenous, while workforce-planning models with skills and learning are rarely released as reusable testbeds. We introduce SkillChain-Gym, a benchmark specification for reskilling-aware production-inventory control: a single-site environment with stylized worker skill-state dynamics, hard threshold certification, forgetting, and capacity-consuming training actions constrained by the same per-worker time budget as production. The benchmark includes seed-controlled disruption scenarios, three feasibility modes with projection diagnostics, deterministic replay, and metrics covering operations, resilience, capability growth, and training-access distribution. We evaluate production-only, reactive adaptive, water-filling adaptive, and static-insurance policies with budget variants over 60-shift horizons with paired statistical tests. The results are regime-dependent rather than a ranking. Training-capable policies dominate the production-only baseline, and maintenance training is necessary under forgetting even without disruptions. Among training-capable classes, adaptive training helps when bottlenecks are visible in the forecast, while a lean static cross-training plan, a deliberately favorable comparator whose structure encodes relevant skill contingencies, acts as strong insurance under surprise shocks and absenteeism. Capacity slack and the forgetting rate govern the boundary between these regimes. No policy class dominates across regimes, motivating forecast-driven controllers that decide when to buy skill insurance and when to react.

2606.17328 2026-06-17 cs.AI 新提交

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

MemTrace: 探知长期记忆中最终准确率所遗漏的信息

Xianxuan Long, Zhikai Chen, Shenglai Zeng, Shouren Wang, Kai Guo, Jiliang Tang

发表机构 * Michigan State University(密歇根州立大学) Case Western Reserve University(凯斯西储大学)

AI总结 提出MemTrace基准,以知识点为单位,沿记忆年龄、问题类型和证据条件三个维度评估LLM代理的长期记忆,发现证据使用是主要瓶颈。

详情
AI中文摘要

LLM代理越来越多地在会话之间维护用户事实的长期记忆。然而,这种记忆通常通过聚合问题行或情节的准确率来评估。由于这种方法独立评分问题行,即使多个问题探查同一事实,也无法显示该事实在条件变化时的行为。我们引入MemTrace,一个以知识点为测量单位的基准:知识点是关于用户的单个类型化事实,而非单个问题。MemTrace沿三个受控维度探查每个事实:记忆年龄,由事实出现在历史中的会话次数定义;问题类型,涵盖当前状态、先前状态和变化轨迹;以及证据条件,涵盖存在、缺失和被错误前提反驳的设置。评估跨四个范式的13种记忆系统配置,我们发现相似的汇总准确率隐藏了不同的失败:恢复事实的当前和先前状态并不意味着跟踪其变化,安全弃权并不意味着纠正错误前提。主要瓶颈是证据使用,而非检索:当系统失败时,证据可检索的次数比缺失的次数多10倍。这些结果表明,改进长期记忆需要更好地使用可获取的证据,而不仅仅是增加存储或检索。

英文摘要

LLM agents increasingly maintain long-term memory of user facts across sessions. Yet such memory is usually evaluated by aggregating accuracy over question rows or episodes. Because this approach scores question rows independently, even when several questions probe the same fact, it cannot show how that fact behaves as conditions change. We introduce MemTrace, a benchmark whose unit of measurement is the knowledge point: a single typed fact about the user, rather than an individual question. MemTrace probes each fact along three controlled dimensions: memory age, defined by how many sessions ago the fact appeared in the history; question type, covering current state, earlier state, and trajectory of change; and evidence condition, covering present, missing, and contradicted-by-false-premise settings. Evaluating 13 memory-system configurations across four paradigms, we find that similar pooled accuracy hides different failures: recovering a fact's current and earlier states does not imply tracking how it changed, and safe abstention does not imply correcting a false premise. The dominant bottleneck is evidence use, not retrieval: when systems fail, the evidence was retrievable 10 times more often than it was missing. These results suggest that improving long-term memory requires better use of reachable evidence, not simply more storage or retrieval.

2606.17339 2026-06-17 cs.AI cs.CL cs.SD 新提交

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx: 面向临床语音AI的多任务基准

Sejal Bhalla, Larry Kieu, Aina Merchant, Eyal de Lara, Alex Mariakakis

发表机构 * University of Toronto(多伦多大学)

AI总结 提出SpeechDx基准,涵盖12个数据集和27个任务,通过语音产生阶段(概念化、公式化、发音)组织任务,评估12种音频编码器,发现大规模语音模型表现最佳,但尚无表示能可靠泛化。

详情
AI中文摘要

语音通过同时涉及神经、运动、呼吸和发声系统,为健康提供了一个独特的窗口。当前的临床语音AI方法主要通过孤立的特定疾病研究取得进展,导致结果难以比较,泛化能力难以评估。我们引入了SpeechDx,这是一个大规模的临床语音AI基准,涵盖12个数据集和27个任务,涉及多种健康状况。为了能够基于共享的临床机制进行评估,SpeechDx根据任务所破坏的语音产生阶段(概念化、公式化和发音)来组织任务。该基准通过包含有限标注数据的任务以及跨多个数据集评估同一健康状况来测试泛化能力,从而区分有临床意义的模式与数据集伪影。我们系统评估了12个最先进的音频编码器在所有任务以及零样本跨条件迁移下的表现。结果表明,大规模语音模型代表了最强的整体基线,领域特定模型仅在紧密匹配的任务上提升性能,而当前没有任何表示能在临床语音领域可靠泛化。SpeechDx建立了一个共享评估框架,用于追踪通用临床语音表示的进展。

英文摘要

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

2606.17459 2026-06-17 cs.AI 新提交

Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation

LLM 能当 CEO 吗?基于多角色智能体模拟的战略资源重新配置基准测试

Yuyang Dai, Xueqing Peng, Lingfei Qian, Zhuohan Xie

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学) Yale University(耶鲁大学)

AI总结 提出 CEO-Bench,一个多智能体基准,评估 LLM 在约束丰富的组织环境中进行多轮战略资源重新配置的能力,发现模型在结构有效性上表现良好,但在战略校准上存在系统性失败模式。

Comments 13 pages

详情
AI中文摘要

评估大型语言模型(LLM)的决策能力是一个日益重要的研究重点,然而现有基准侧重于孤立的认知任务,如推理、知识检索以及在风格化环境中的经济理性。这些评估忽略了真实高管决策的核心挑战:在信息不对称、组织约束和时间依赖下整合来自专业利益相关者的冲突建议。我们引入了 \textsc{CEO-Bench},一个多智能体基准,评估 LLM 在 CEO 级别的战略资源重新配置能力——即在多轮、约束丰富的组织环境中跨业务部门重新分配资本的过程。在 \textsc{CEO-Bench} 中,LLM 智能体接收来自四个角色化的 C 级顾问(CFO、CTO、COO、CMO)的冲突建议,每个顾问拥有私有信号和不同优先级,智能体必须将这些建议综合成一个具体的分配计划,并沿四个维度进行评估:角色整合、条件大胆性、历史敏感性判断和计划有效性。在 13 个场景中对五个前沿模型的实验表明,所有模型都实现了高结构有效性,但在战略校准(最难的能力层)上表现差异显著。我们识别出系统性失败模式,包括单一顾问捕获、模糊下的保守默认和历史遗忘,并发现结构整合-大胆性权衡:更深入参与冲突观点的模型往往产生较不果断的行动。这些发现勾勒了 LLM 作为组织决策者的当前能力边界,并为未来 AI 辅助高管系统的设计提供信息。

英文摘要

Evaluating the decision-making capabilities of large language models (LLMs) is a growing research priority, yet existing benchmarks focus on isolated cognitive tasks such as reasoning, knowledge retrieval, and economic rationality in stylized settings. These evaluations overlook the defining challenge of real executive decision-making: integrating conflicting recommendations from specialized stakeholders under information asymmetry, organizational constraints, and temporal dependencies. We introduce \textsc{CEO-Bench}, a multi-agent benchmark that evaluates LLMs on CEO-level strategic resource reallocation -- the process of redirecting capital across business units in a multi-round, constraint-rich organizational environment. In \textsc{CEO-Bench}, LLM agents receive conflicting advice from four role-conditioned C-suite advisors (CFO, CTO, COO, CMO), each with private signals and distinct priorities, and must synthesize these into a concrete allocation plan evaluated along four dimensions: role integration, conditional boldness, history-sensitive judgment, and plan validity. Experiments across five frontier models on 13 scenarios reveal that all models achieve high structural validity but diverge sharply on strategic calibration -- the hardest capability layer. We identify systematic failure modes including single-advisor capture, conservative default under ambiguity, and historical amnesia, and uncover a structural integration-boldness tradeoff: models that engage more deeply with conflicting perspectives tend to produce less decisive action. These findings delineate the current capability boundary of LLMs as organizational decision-makers and inform the design of future AI-assisted executive systems.

2606.17546 2026-06-17 cs.AI 新提交

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

SEAGym: 自我进化LLM智能体的评估环境

Congjie Zheng, Chuanyi Xue, Bin Liang, Jun Yang, Changshui Zhang

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系) Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University(北京信息科学与技术国家研究中心(BNRist),清华大学)

AI总结 提出SEAGym评估环境,通过训练、验证、测试、重放和成本记录多维度衡量智能体框架更新,揭示更新是否带来可复用改进、过拟合、成本增加或旧行为退化。

详情
AI中文摘要

基于LLM的自我进化智能体主要通过改变其智能体框架(agent harness)来改进:即围绕基础模型的结构化执行层,包括提示、记忆、工具、中间件、运行时状态以及模型-工具交互循环。现有评估通常将此过程简化为孤立的任务分数或单一的顺序曲线,掩盖了更新是否产生可复用的改进、过拟合近期任务、增加成本或损害旧行为。我们引入了SEAGym,一个用于跨训练、验证、测试、重放和成本记录衡量智能体框架更新的评估环境。SEAGym将Harbor兼容的基准测试转化为动态的自我进化任务源,包含训练批次、冻结更新验证、留出ID和OOD迁移视图、重放诊断以及保存的快照和指标记录。在Terminal-Bench 2.0和HLE上实例化SEAGym,我们在共享的epoch/batch协议下比较了ACE、TF-GRPO和AHE。结果表明,这些评估视图提供了关于进化过程的互补信号:频繁更新可能无法改善留出性能,有用的中间快照可能随后崩溃,源多样性和模型后端可能影响框架可靠性。

英文摘要

Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior. We introduce SEAGym, an evaluation environment for measuring agent harness updates across training, validation, test, replay, and cost records. SEAGym turns Harbor-compatible benchmarks into dynamic self-evolution task sources with train batches, frozen update-validation, held-out ID and OOD transfer views, replay diagnostics, and saved snapshot and metric records. Instantiating SEAGym on Terminal-Bench 2.0 and HLE, we compare ACE, TF-GRPO, and AHE under a shared epoch/batch protocol. The results show that these evaluation views provide complementary signals about the evolution process: frequent updates may fail to improve held-out performance, useful intermediate snapshots may collapse later, and source diversity and model backend can affect harness reliability.

2606.17574 2026-06-17 cs.AI 新提交

DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack

DeepInsight:跨物理AI栈的统一评估基础设施

Siyi Li, Chunyu Sun, Jiahao Zhang, Yuchen Kang, Wuliang Wang, Yu Qiu, Rui Jiang, Haitao Cui, Jie Chen

发表机构 * Xiaopeng(小鹏汽车)

AI总结 提出DeepInsight,一个在单一运行时上支持物理AI栈全谱系评估的基础设施,通过三个抽象(任务、资源、结果)保持异构性,实现跨层回归诊断。

详情
AI中文摘要

评估物理AI栈涉及的操作符跨越三个数量级以上——从单个基础模型解码步骤到全身控制的数千个物理滴答——在模态、奖励语义和资源概况上正交变化。现有框架无法覆盖这一范围,因此当前栈的评估是通过拼接独立的测试工具完成的,这些工具既不共享运行时也不共享评分,保留了每个片段的局部有效性,但失去了诊断跨层回归所需的共享身份。我们提出DeepInsight,一个在单一运行时上服务于这一完整谱系的评估基础设施。它不将各体制同质化,而是通过三个狭窄的抽象——任务、资源和结果——保持其异构性,每个抽象都由每个子系统共享的一个不变量实现:一个情节驱动器、一个由每个昂贵后端(LLM推理和沙盒运行时)实现的资源句柄协议,以及一个写入每个事件的跟踪身份方案。在具身人形机器人栈的所有三层上部署后,这一组不变量主要通过配置即可引入新的基准测试。在成熟的对等编排器存在的地方——在基础模型端——它在其自身分布内复现已发布的参考值和对等框架读数,在单个节点上更快地运行相同的套件,并跨节点近线性扩展。其独特的回报在于诊断:由于每一层都写入一个共享的跟踪,从一个层开始并在另一个层显现的回归在该跟踪上仍然可定位——这是任何片段测试工具联合体无法复现的跨层收益。

英文摘要

Evaluating a Physical AI stack spans operators that differ by more than three orders of magnitude -- from a single foundation-model decoding step to thousands of physics ticks of whole-body control -- varying orthogonally in modality, reward semantics, and resource profile. No existing framework spans this range, so the stack is evaluated today by stitching together separate harnesses that share neither runtime nor scoring, preserving each segment's local validity but losing the shared identity needed to diagnose cross-layer regressions. We present DeepInsight, an evaluation infrastructure that serves this full spectrum on a single runtime. Rather than homogenize the regimes, it preserves their heterogeneity behind three narrow abstractions -- task, resource, and result -- each realized as one invariant shared by every subsystem: one episode driver, one resource-handle protocol implemented by every expensive backend (LLM inference and sandboxed runtimes alike), and one trace identity scheme under which every event is written. Deployed in production across all three layers of an embodied humanoid stack, this single set of invariants onboards new benchmarks largely by configuration. Where mature peer orchestrators exist -- at the foundation-model end -- it reproduces published references and peer-framework readings within their own spread, runs the same suites faster on a single node, and scales near-linearly across nodes. Its distinctive return is diagnostic: because every layer writes into one shared trace, a regression that begins in one layer and surfaces in another stays localizable on that trace -- a cross-layer payoff no federation of per-segment harnesses can reproduce.

2606.17696 2026-06-17 cs.AI cs.GR 新提交

FllumaOne: A Code-Native Multimodal CAD Dataset with Executable Programs and Kernel-Validated Feature Histories

FllumaOne:一个代码原生多模态CAD数据集,包含可执行程序与内核验证的特征历史

Jizong Zhan

发表机构 * Qt/C++ OpenCASCADE-based CAD system(基于Qt/C++ OpenCASCADE的CAD系统)

AI总结 提出FllumaOne数据集,通过可执行Python程序生成CAD模型,对齐程序、特征树、几何等模态,支持可编辑逆向工程等任务。

Comments 24 pages, 4 figures

详情
AI中文摘要

参数化计算机辅助设计记录最终几何形状以及决定零件如何编辑的有序构建历史。因此,可编辑CAD研究的数据集应同时暴露建模操作、参数和特征依赖关系以及验证后的几何形状。我们介绍FllumaOne,一个代码原生多模态CAD数据集,其模型由基于Qt/C++和OpenCASCADE的CAD系统Flluma中的可执行Python程序生成。每个样本将其程序与结构化特征树、面向训练的中间表示、STEP几何、表面点云、自然语言描述、元数据和八个规范可见边渲染对齐。主要发布版本FllumaOne-100K包含100,000个接受样本,涵盖四个模板级复杂度范围。程序仅在通过内核几何、实体有效性和导出检查后执行并保留;发布报告还记录了模态完整性和分割级重复测试。在80,000个样本上训练的Qwen2.5-Coder-1.5B LoRA基线在保留的10,000样本测试集上实现了99.98%的Python语法有效性、99.97%的Flluma构建成功率和99.14%的STEP导出有效性。对于转换为表面点云的9,909个预测,平均归一化倒角距离为0.002124。该数据集支持条件化CAD重建、可执行程序合成、特征树预测、B-Rep分析、检索、设计完成和可编辑逆向工程。

英文摘要

Parametric computer-aided design records both final geometry and the ordered construction history that determines how a part can be edited. Datasets for editable CAD research should therefore expose modeling operations, parameters, and feature dependencies together with validated geometry. We introduce FllumaOne, a code-native multimodal CAD dataset whose models are generated by executable Python programs in Flluma, a Qt/C++ OpenCASCADE-based CAD system. Each sample aligns its program with a structured feature tree, a training-oriented intermediate representation, STEP geometry, a surface point cloud, natural-language descriptions, metadata, and eight canonical visible-edge renderings. The primary release, FllumaOne-100K, contains 100,000 accepted samples across four template-level complexity regimes. Programs are executed and retained only after kernel geometry, solid validity, and export checks; release reports also record modality completeness and split-level duplicate tests. A Qwen2.5-Coder-1.5B LoRA baseline trained on 80,000 samples achieves 99.98% Python syntax validity, 99.97% Flluma build success, and 99.14% STEP-export validity on the held-out 10,000-sample test split. For the 9,909 predictions converted to surface point clouds, the mean normalized Chamfer Distance is 0.002124. The dataset supports conditioned CAD reconstruction, executable program synthesis, feature-tree prediction, B-Rep analysis, retrieval, design completion, and editable reverse engineering.

2606.17698 2026-06-17 cs.AI cs.CL 新提交

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

EComAgentBench:在分布式隐藏意图的长时任务上基准测试购物代理

Zeyao Du, Tong Li, Haibo Zhang

发表机构 * Shopee

AI总结 提出EComAgentBench基准,包含662个基于真实亚马逊产品的任务,要求代理在100次工具调用内从可见查询、工具门控配置文件和脚本化澄清中挖掘隐藏意图,验证候选产品并提交最终选择,通过类型化源标签评分归因失败。

详情
AI中文摘要

随着基于LLM的购物代理进入生产环境,现有基准未能捕捉购物者需求的出现方式:隐含在查询中、记录在配置文件中,或仅在提出正确问题时才揭示。提前暴露全部意图并仅对最终选择评分的基准既无法提出这种长时挑战,也无法解释代理遗漏了哪个需求。为填补这一空白,我们引入了EComAgentBench,一个基于真实亚马逊产品和评论的662个任务的基准。每个任务将这些需求分散在可见查询、工具门控配置文件和脚本化澄清中;代理必须揭示隐藏意图,根据属性和评论证据验证候选产品,并在100次工具调用内提交单个产品。此外,类型化、源标记的评分规则对每个任务进行评分,将每个失败归因于一个需求及其来源。构建过程自动化且可靠,每个答案在生成任何文本之前已在代码中固定,每个样本都经过验证。我们对七个模型的评估显示,即使最强的模型也仅达到57.1%的整体准确率,并且评分规则的满足度从可见源到隐藏源逐渐下降。总体而言,我们相信EComAgentBench将作为一个可复现的基础,推动购物代理从单查询搜索向长时可靠辅助发展。

英文摘要

As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice can neither pose this long-horizon challenge nor explain which requirement an agent missed. To address this gap, we introduce EComAgentBench, a benchmark of 662 tasks grounded in real Amazon products and reviews. Each task scatters these requirements across a visible query, a tool-gated profile, and scripted clarification; an agent must uncover hidden intent, verify candidates against attributes and review evidence, and commit to a single product within 100 tool calls. Moreover, typed, source-tagged rubrics grade every task, attributing each failure to a requirement and its source. Construction is automated yet reliable, with every answer fixed in code before any text is generated and every sample validated. Our evaluation of seven models reveals that even the strongest attains only 57.1% overall accuracy, and rubric satisfaction degrades from visible to hidden sources. Overall, we believe EComAgentBench will serve as a reproducible foundation for moving shopping agents from single-query search toward dependable assistance over long horizons.

2606.17727 2026-06-17 cs.AI 新提交

LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings

LongWebBench: 评估长程设置下的结构和功能性网页生成

Yi Zhao, Zhen Yang, Mengpan Chen, Mingde Xu, Shanghui Gong, Xijun Liu, Jibing Gong, Jie Tang

发表机构 * Tsinghua University(清华大学) Yanshan University(燕山大学) University of Waterloo(滑铁卢大学) Beihang University(北京航空航天大学)

AI总结 提出LongWebBench基准,通过结构保真度和功能可执行性评估长网页生成,发现视觉相似性高但多步交互失败。

Comments 49 pages, 38 figures

详情
AI中文摘要

最近的视觉语言模型(VLM)在从视觉输入生成网页方面显示出有希望的进展,但现有评估主要关注短、单屏且基本静态的网页。我们引入了LongWebBench,这是一个从结构和功能角度评估长程网页生成的基准。LongWebBench包含490个真实长网页用于结构保真度评估,以及129个网页上的507个目标导向交互任务用于功能评估。它采用两种互补协议:基于多维VLM的指标用于评估长程结构连贯性,以及基于DOM增强的智能体流水线用于端到端功能验证。我们进一步通过人类一致性分析检查自动评估协议。在单图像和多图像设置下,使用最先进的开源和专有VLM进行的实验表明,结构保真度随着网页长度的增加而下降,而视觉上合理的生成往往无法支持可执行的多步交互。这些结果强调了在视觉相似性之外评估长网页生成的必要性,并将可执行交互作为核心标准。我们的代码和数据可在该https URL获取。

英文摘要

Recent vision-language models (VLMs) have shown promising progress in generating webpages from visual inputs, yet existing evaluations mainly focus on short, single-screen, and largely static webpages. We introduce LongWebBench, a benchmark for evaluating long-horizon webpage generation from both structural and functional perspectives. LongWebBench contains 490 real-world long webpages for structural fidelity evaluation and 507 goal-oriented interaction tasks over 129 webpages for functional evaluation. It employs two complementary protocols: a multi-dimensional VLM-based metric for assessing long-range structural coherence, and a DOM-augmented agent-based pipeline for end-to-end functional verification. We further examine the automatic evaluation protocols through human agreement analysis. Experiments with state-of-the-art open-source and proprietary VLMs under single-image and multi-image settings reveal that structural fidelity degrades as webpage length increases, while visually plausible generations often fail to support executable multi-step interactions. These results highlight the need to evaluate long webpage generation beyond visual similarity, with executable interaction as a core criterion. Our code and data are available at https://github.com/zheny2751-dotcom/LongWebBench.

2606.17904 2026-06-17 cs.AI 新提交

DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue

DiagFlowBench:评估语言模型在基于规程的诊断对话中如何处理偏离规程输入

Guillermo Gil de Avalle, Laura Maruster, Shaina Raza, Christos Emmanouilidis

发表机构 * University of Groningen(格罗宁根大学) Vector Institute for Artificial Intelligence(向量人工智能研究所)

AI总结 提出DiagFlowBench基准,包含50个工业诊断流程图转化的1676轮对话,评估10个模型在识别偏离规程输入时的表现,发现模型常选择真实但不恰当的步骤而非捏造事实。

详情
AI中文摘要

语言模型越来越多地作为维护操作中的咨询系统。为了防止幻觉,最近的系统将这些模型基于规程文档,以约束它们执行批准的步骤。然而,在实践中,操作员的查询经常偏离这一路径,要求模型在对话中途识别超出范围的输入,这是当前基准很少优先考虑的动态。我们引入了DiagFlowBench,这是一个数据集,包含来自一家消费制造商的50个工业诊断流程图,转化为1676轮多轮对话,对比合规与超出范围的语句。评估十个商业和开源模型显示,在弃权率上存在高度变异性,模型通常选择一个真实但上下文不恰当的步骤,而不是捏造事实。这种映射但错误建议的内在合理性和权威性暴露了基于规程系统的一个具有挑战性的脆弱性。

英文摘要

Language models increasingly serve as advisory systems in maintenance operations. To prevent hallucination, recent systems ground these models in procedural documentation to constrain them to approved steps. In practice, however, operator queries frequently stray from this path, requiring models to recognise out-of-scope inputs mid-conversation, a dynamic that current benchmarks rarely prioritise. We introduce DiagFlowBench, a dataset of 50 industrial diagnostic flowcharts from a consumer manufacturer converted into 1,676 multi-turn conversations that contrast compliant with out-of-scope utterances. Evaluating a panel of ten commercial and open-weight models reveals high variability in abstention rates, with models commonly selecting a real but contextually inadequate step rather than fabricating facts. The inherent plausibility and authority of this mapped but wrong advice exposes a challenging vulnerability for grounding systems.

2606.17930 2026-06-17 cs.AI 新提交

How Inference Compute Shapes Frontier LLM Evaluation

推理计算如何塑造前沿LLM评估

Jessica McFadyen, Ole Jorgensen, Harry Coppock, Kevin Wei, Cozmin Ududec

发表机构 * UK AI Security Institute(英国人工智能安全研究所)

AI总结 通过控制推理计算量(如token预算、上下文压缩和重复提交)评估12个前沿语言模型,发现更大计算量显著提升性能,固定预算评估低估模型能力,且不同基准对推理扩展方法敏感。

Comments 34 pages, 4 figures

详情
AI中文摘要

AI评估正转向更困难的任务,这些任务受益于涉及工具使用和迭代问题解决的更长轨迹。因此,性能对测试时可用的计算量(“推理计算”)及其分配越来越敏感。然而,许多评估仍然在单一限制性预算下报告性能,这意味着低分可能反映评估设置而非模型的潜在能力。为了验证这一点,我们在涵盖软件工程、数学、医学和网络安全的七个具有挑战性的基准上评估了多达12个前沿语言模型。我们使用结合三种简单推理扩展干预的受控设置:更大的token预算、上下文压缩和重复提交尝试,由模型本身或最小正确性反馈引导。我们发现了三个主要结果。首先,更大的token预算在多个领域的基准上显著提升性能,包括网络安全、FrontierMath、Humanity's Last Exam和TerminalBench。其次,随着模型进步,固定预算评估可能越来越低估前沿能力。较新的模型在大型预算下达到更高性能,解锁更困难的任务并更可靠地解决它们。第三,不同基准在哪种推理扩展方法最有效方面存在差异:重复提交广泛提升性能,但更大token预算、外部反馈和并行尝试的价值因基准而异。总体而言,我们的结果表明基准分数是协议依赖的。因此,我们主张评估应将能力报告为推理时间计算的函数,明确指定协议选择,并在匹配预算的大共享计算范围内比较模型代际,特别是在安全或政策相关设置中。

英文摘要

AI evaluations are shifting toward harder tasks that benefit from longer trajectories involving tool use and iterative problem solving. As a result, performance is increasingly sensitive to the amount and allocation of compute available at test time ("inference compute"). Yet many evaluations still report performance at a single restrictive budget, meaning that low scores may reflect the evaluation setup rather than the model's underlying capability. To test this, we evaluate up to 12 frontier language models on seven challenging benchmarks spanning software engineering, mathematics, medicine, and cybersecurity. We use a controlled setup combining three simple inference-scaling interventions: larger token budgets, context compaction, and repeated submission attempts, guided either by the model itself or by minimal correctness feedback. We find three main results. First, larger token budgets substantially improve performance on benchmarks across multiple domains, including cybersecurity, FrontierMath, Humanity's Last Exam, and TerminalBench. Second, fixed-budget evaluations can increasingly understate frontier capability as models advance. Newer models reach higher performance at large budgets, where they unlock harder tasks and solve them more reliably. Third, benchmarks differ in which inference-scaling methods help most: repeated submission broadly improves performance, but the value of larger token budgets, external feedback, and parallel attempts varies by benchmark. Overall, our results show that benchmark scores are protocol-dependent. We therefore argue that evaluations should report capability as a function of inference-time compute, specify protocol choices explicitly, and compare model generations over a large shared compute range at matched budgets, especially in safety- or policy-relevant settings.

2606.17978 2026-06-17 cs.AI 新提交

MoCo-AIS: A Contrastive Learning Framework for Similarity Computation of Vessel Trajectories

MoCo-AIS: 一种用于船舶轨迹相似度计算的对比学习框架

Ruixin Song, Md Mahbub Alam, Zahra Sadeghi, Amilcar Soares, José F. Rodrigues-Jr, Gabriel Spadon

发表机构 * Dalhousie University(达尔豪斯大学) Linnaeus University(林奈大学) University of Sao Paulo(圣保罗大学)

AI总结 提出基于动量对比(MoCo)的统一对比学习框架MoCo-AIS,通过正负轨迹对学习嵌入,在真实AIS数据集上评估多种深度学习模型,显著提升轨迹相似度学习性能。

Comments Under review at SIGSPATIAL'26

详情
AI中文摘要

轨迹相似度是分析移动模式的基本任务,对于路线模式提取、移动预测和异常检测等应用至关重要。传统的基于距离的相似度计算方法计算成本高,促使人们采用轻量级基于学习的方法。监督方法依赖于从传统距离度量中衍生的大量标签,并且通常复现这些度量,这限制了泛化能力。虽然自监督学习通过对比学习解决了这个问题,但它缺乏统一的框架,使得难以比较深度学习(DL)模型以获得一致的轨迹表示。因此,本文提出了MoCo-AIS,一个基于动量对比(MoCo)范式的统一框架,用于学习船舶轨迹嵌入,该框架通过正负轨迹对来制定相似度学习。在此框架内,我们在大规模真实世界船舶跟踪AIS数据集上评估了多种领先的深度学习模型,这些数据集捕获了不同的航行行为和操作条件。结果表明,我们的框架显著改进了现有基线的相似度学习,同时为评估轨迹表示模型提供了一个基准平台。

英文摘要

Trajectory similarity is a fundamental task in analyzing mobility patterns, essential for applications such as route pattern extraction, mobility prediction, and anomaly detection. Traditional distance-based measures for computing similarity incur high computational cost, driving the adoption of lightweight learning-based approaches. Supervised methods rely on extensive labels derived from traditional distance measures and often reproduce these metrics, which limits generalization. While self-supervised learning addresses this issue through contrastive learning, it lacks a unified framework, making it difficult to compare deep learning (DL) models for consistent trajectory representation. Accordingly, this paper presents MoCo-AIS, a unified framework for learning vessel trajectory embeddings based on the Momentum Contrast (MoCo) paradigm, which formulates similarity learning through positive and negative trajectory pairs. Within this framework, we evaluate a diverse set of leading DL models on large-scale, real-world vessel-tracking AIS datasets that capture diverse navigation behaviors and operating conditions. Results demonstrate that our framework significantly improves similarity learning over existing baselines, while providing a benchmarking platform for evaluating trajectory representation models.

2606.18060 2026-06-17 cs.AI cs.CL 新提交

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

PseudoBench: 衡量自主研究如何助长伪科学

Xinyang Liao, Lingyu Li, Huacan Liu, Tianle Gu, Yang Yao, Tong Zhu, Yan Teng, Yingchun Wang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Xi’an Jiao Tong University(西安交通大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出PseudoBench基准,通过200个伪科学声明-证据对评估AI代理识别和抵制伪科学的能力,发现当前系统极易生成有说服力的伪科学报告,拒绝率接近零。

Comments 26 pages, 21 figures

详情
AI中文摘要

随着基于大型语言模型的代理进入自主科学研究,它们抵制伪科学的能力变得越来越重要。否则,此类系统可能迅速生成看似合理但具有误导性的研究,污染学术文献并侵蚀对科学的信任。我们提出了PseudoBench,一个对抗性基准,用于评估自主研究系统能否识别和抵制伪科学叙述。PseudoBench包含五个领域的200个精心策划的伪科学声明-证据对,并通过从实验到写作的端到端研究流程评估代理。测试了七个最先进的代理,我们发现当前系统很容易生成与伪科学前提一致的有说服力的报告,拒绝率接近零,最高抵制率仅为27.4%。更强的代理有可能用更复杂的科学语言包装伪科学,增加其表面可信度。这些发现揭示了助长伪科学的惊人能力,呼吁在广泛部署之前进行科学对齐。

英文摘要

As Large Language Model based agents enter autonomous scientific research, their ability to resist pseudoscience becomes increasingly important. Otherwise, such systems may rapidly generate plausible yet misleading studies that contaminate academic literature and erode trust in science. We present PseudoBench, an adversarial benchmark for evaluating whether agentic auto-research systems can identify and resist pseudoscientific narratives. PseudoBench contains 200 curated pseudoscientific claim-evidence pairs across five domains and evaluates agents through an end-to-end research pipeline from experiments to writing. Testing seven state-of-the-art agents, we find that current systems readily produce persuasive reports that align with pseudoscientific premises with near-zero refusal rates and the highest resistance of only 27.4%. Stronger agents risk packaging pseudoscience in more sophisticated scientific language, increasing its apparent credibility. These findings reveal an alarming capacity to fuel pseudoscience, calling for scientific alignment before widespread deployment.

2606.18119 2026-06-17 cs.AI 新提交

First Proof Second Batch

首次证明第二批

Mohammed Abouzaid, Nikhil Srivastava, Rachel Ward, Lauren Williams

发表机构 * Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校) University of Texas at Austin(德克萨斯大学奥斯汀分校) Harvard University(哈佛大学) Polish Academy of Sciences(波兰科学院) UC Berkeley(加州大学伯克利分校) Brown University(布朗大学) ETH Zürich(苏黎世联邦理工学院) MIT(麻省理工学院) Weierstrass Institute(魏尔斯特拉斯研究所) Duke University(杜克大学) Sorbonne Université(索邦大学) Boston College(波士顿学院) Université du Québec à Montréal(魁北克大学蒙特利尔分校) UCLA(加州大学洛杉矶分校) University of Michigan(密歇根大学) University of Maryland(马里兰大学)

AI总结 测试多个AI系统在十个数学研究问题上的解题能力,评估当前AI解决研究级数学问题的水平。

详情
AI中文摘要

为了评估当前AI系统正确解决研究级数学问题的能力,我们在十个涵盖广泛数学领域的问题上测试了多个AI系统;这些问题自然产生于贡献者的研究过程中。本文档包括问题、我们的方法论以及测试结果。我们提供了补充文档的链接,包括人类解法、AI生成的解法,以及AI生成解法的评审报告和日志。这十个问题由以下数学家贡献:(1) Dariusz Kalociński 和 Theodore A. Slaman,(2) Richard Schwartz,(3) Aleksa Milojevic 和 Benny Sudakov,(4) Larry Guth,(5) Oleg Butkovsky、Jonathan Mattingly 和 Lorenzo Zambotti,(6) Joshua Evan Greene 和 Duncan McCoy,(7) Sucharit Sarkar,(8) Sam Payne 和 Jidong (Jayden) Wang,(9) Sylvie Corteel 和 John Lentfer,(10) Srivatsav Kunnawalkam Elayavalli。

英文摘要

To assess the ability of current AI systems to correctly solve research-level mathematics problems, we tested several AI systems on a set of ten problems in a broad range of mathematical fields; these problems arose naturally in the research process of the contributors. This document includes the problems, our methodology, and the results of our testing. We provide links to supplementary documents including the human solutions, the AI-generated solutions, and the referee reports and logs for the AI-generated solutions. The ten problems were contributed by the following mathematicians: (1) Dariusz Kalociński and Theodore A. Slaman, (2) Richard Schwartz, (3) Aleksa Milojevic and Benny Sudakov, (4) Larry Guth, (5) Oleg Butkovsky, Jonathan Mattingly, and Lorenzo Zambotti, (6) Joshua Evan Greene and Duncan McCoy, (7) Sucharit Sarkar, (8) Sam Payne and Jidong (Jayden) Wang, (9) Sylvie Corteel and John Lentfer, (10) Srivatsav Kunnawalkam Elayavalli.

2606.18191 2026-06-17 cs.AI cs.MA 新提交

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

DRFLOW:用于个性化工作流预测的深度研究基准

Md Tawkat Islam Khondaker, Raymond Li, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Issam H. Laradji

发表机构 * ServiceNow AI Research(ServiceNow人工智能研究)

AI总结 提出DRFLOW基准,评估AI代理从异构源预测个性化工作流的能力,包含5领域100任务,并设计7个诊断指标,实验显示现有代理性能有限。

详情
AI中文摘要

深度研究(DR)系统越来越多地用于复杂信息寻求任务,但现有工作主要关注生成报告和摘要。相比之下,许多企业任务需要代理识别具体的工作流,即一系列行动步骤。例如,代理不应总结预算政策,而应能确定回答诸如“在固定预算下如何申请新员工?”这类问题所需的步骤。因此,我们引入DRFLOW,一个用于评估代理从异构源预测个性化工作流的基准。每个任务要求代理从分散来源中识别相关证据,然后使用这些证据预测用户任务的正确行动步骤序列。DRFLOW包含跨五个领域的100个任务,1246个参考工作流步骤,基于超过3900个来源。我们定义了七个诊断指标,涵盖事实依据、步骤恢复、结构排序、条件解决和个性化。我们进一步提出DRFLOW-Agent(DRFA),一个面向工作流的参考代理,用于预测个性化工作流。我们表明,尽管DRFA相比强基线代理有所改进(平均F1分数提升高达10.02%),但在这些工作流指标上仍有很大的改进空间,表明预测完整且正确的个性化工作流仍然是深度研究的一个挑战性前沿。

英文摘要

Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries. In contrast, many enterprise tasks instead require an agent to identify concrete workflows which is a sequence of action-steps. For example, rather than summarizing budgeting policies, an agent should be able to determine the steps needed to answer a question such as: "How do I request new headcount given a fixed budget?". Therefore, we introduce DRFLOW, a benchmark for evaluating personalized workflows predicted by agents from heterogeneous sources. Each task requires the agent to identify relevant evidence from scattered sources, then use that evidence to predict the correct action-step sequence for the user's task. DRFLOW contains 100 tasks across five domains, with 1,246 reference workflow steps grounded in more than 3,900 sources. We define seven diagnostic metrics covering factual grounding, step recovery, structural ordering, condition resolution, and personalization. We further present DRFLOW-Agent (DRFA), a workflow-oriented reference agent to predict personalized workflow. We show that although DRFA improves over strong baseline agents (upto 10.02% average F1 score), there is substantial room for improvement remains across these workflow metrics, indicating that predicting complete and correct personalized workflows remains a challenging frontier for deep research.

2606.17080 2026-06-17 cs.RO cs.AI cs.CV 交叉投稿

HRDX: A Large-Scale Vector HD-Map Dataset

HRDX:大规模矢量高清地图数据集

Sahith Reddy Chada, Isht Dwivedi, Nirav Savaliya

发表机构 * Honda Research Institute US(本田美国研究院)

AI总结 提出HRDX大规模矢量高清地图数据集,覆盖1400公里驾驶数据,含10类地图元素和20多种属性,并引入复合评分评估几何与属性准确性。

Comments https://usa.honda-ri.com/hrdx

详情
AI中文摘要

可靠的自动驾驶需要矢量化的高清地图,这些地图应具有几何精确性、语义丰富性,并能够扩展到长距离驾驶。然而,现有的公开高清地图数据集规模有限,提供的语义属性稀疏,并且缺乏诸如航拍图像等能够开启新研究方向的模态。我们提出了HRDX,一个用于矢量高清地图构建的大规模数据集,涵盖约40小时(1400公里)的最小重叠驾驶,比之前的公开高清地图数据集大数倍。数据使用六个同步环视摄像头、一个128线激光雷达和厘米级RTK GNSS/IMU捕获,并辅以精确对齐的航拍正射影像。标注涵盖10个矢量地图类别,并补充了20多个语义和拓扑属性。为了评估这一更丰富的本体,我们引入了复合评分(CS)来联合评估几何保真度和属性正确性。基准实验表明,HRDX的规模改善了在线矢量地图构建,并且对齐的航拍图像提供了有用的结构先验:在训练和/或推理中使用航拍图像可提高几何地图质量,而航拍增强的教师可以将部分优势转移给仅使用摄像头的学生,而无需增加推理时的传感器需求。HRDX旨在支持大规模高清地图学习、多模态BEV融合以及训练时特权信息的可重复研究。HRDX数据集和基准可在以下网址获取:https://github.com/example/HRDX

英文摘要

Reliable autonomous driving requires vectorized HD maps that are geometrically accurate, semantically rich, and scalable to long-horizon driving. However, existing public HD map datasets are limited in scale, provide sparse semantic attributes, and lack modalities such as aerial imagery that could enable new research directions. We present HRDX, a large-scale dataset for vector HD-map construction, spanning about 40 hours (1,400 km) of minimally overlapping drives, which is several times larger than prior public HD map datasets. Data is captured using six synchronized surround cameras, a 128-beam LiDAR, and centimeter-level RTK GNSS/IMU, and is further complemented by precisely aligned aerial orthoimagery. Annotations cover 10 vector map classes, complemented with over 20 semantic and topological attributes. To evaluate this richer ontology, we introduce the Composite Score (CS) to jointly assess geometric fidelity and attribute correctness. Benchmark experiments show that HRDX's scale improves online vector-map construction, and that aligned aerial imagery provides a useful structural prior: using aerial imagery at training and/or inference improves geometric map quality, while aerial-augmented teachers can transfer part of this benefit to camera-only students without increasing inference-time sensor requirements. HRDX is intended to support reproducible research on large-scale HD-map learning, multimodal BEV fusion, and training-time privileged information. HRDX dataset and benchmarks are available at https://github.com/honda-research-institute/HRDX

2606.17104 2026-06-17 cs.AR cs.AI cs.DC 交叉投稿

Prefill/Decode-Aware Evaluation of LLM Inference on Emerging AI Accelerators

新兴AI加速器上LLM推理的Prefill/Decode感知评估

Shun Usami, Venkatram Vishwanath, E. Wes Bethel

发表机构 * Department of Computer Science(计算机科学系) San Francisco State University(旧金山州立大学) Argonne National Laboratory(阿贡国家实验室) Lawrence Berkeley National Laboratory(伯克利国家实验室)

AI总结 本文通过分离测量Prefill和Decode阶段,评估GPU与新兴AI加速器在Llama2-7B模型上的推理性能,发现GPU在计算密集的Prefill阶段占优,而GroqRack在Decode延迟上更优,但GPU随批处理增大在吞吐上反超。

Comments 8 pages, 5 figures. Accepted to the Workshop on HPC for AI Foundation Models & LLMs for Science (HPAI4S'26), co-located with IEEE IPDPS 2026

详情
AI中文摘要

随着大语言模型(LLM)越来越多地部署在对延迟和成本敏感的环境中,推理效率已成为一个核心系统挑战。尽管GPU主导当前部署,但越来越多的AI加速器声称在LLM推理方面具有优势,然而尚不清楚在何种条件下这些加速器在实践中优于GPU。最近的推理系统将执行分解为Prefill和Decode阶段,这两个阶段表现出不同的计算特征和延迟指标,通常由首次令牌时间(TTFT)和每个输出令牌时间(TPOT)衡量。本文使用通用模型Llama2-7B,对GPU和新兴AI加速器上的LLM推理性能进行了阶段感知评估。通过分别测量Prefill和Decode性能,我们揭示了加速器的优势因阶段和指标而异。我们的结果表明,GPU在计算密集的Prefill阶段始终表现出色,而GroqRack在Decode期间实现了显著更低的TPOT(当前不支持批处理)。然而,随着批处理大小的增加,GPU在Decode吞吐量上重新获得优势。这些发现表明,每个平台都表现出不同的阶段依赖性优势。我们进一步分析了不同加速器平台上的异构Prefill/Decode分离,识别了性能提升以及实现这些提升的工作负载和网络条件。

英文摘要

As large language models (LLMs) are increasingly deployed in latency- and cost-sensitive settings, inference efficiency has become a central systems challenge. While GPUs dominate current deployments, a growing number of AI accelerators claim advantages for LLM inference, yet it remains unclear under which conditions such accelerators outperform GPUs in practice. Recent inference systems decompose execution into Prefill and Decode phases, which exhibit distinct computational characteristics and latency metrics, commonly captured by time to first token (TTFT) and time per output token (TPOT). This paper presents a phase-aware evaluation of LLM inference performance across GPUs and emerging AI accelerators using a common model, Llama2-7B. By separately measuring Prefill and Decode performance, we reveal that accelerator advantages differ by phase and metric. Our results show that GPUs consistently excel in the compute-intensive Prefill phase, while GroqRack achieves significantly lower TPOT during Decode (batching not currently supported). However, GPUs regain an advantage in Decode throughput as batch size increases. These findings demonstrate that each platform exhibits distinct phase-dependent strengths. We further analyze heterogeneous Prefill/Decode disaggregation across different accelerator platforms, identifying performance gains and the workload and network conditions under which such gains are realized.

2606.17115 2026-06-17 cs.LG cs.AI q-bio.QM 交叉投稿

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

探测、融合与可信度:基础模型表示在多模态癌症分析中的系统评估

Jingyu Hu, Giuseppe Tripodi, Reed Naidoo, Sarah F. McGough, Tapabrata Chakraborti

发表机构 * The Alan Turing Institute(艾伦·图灵研究所) University of Bristol(布里斯托大学) University of Manchester(曼彻斯特大学) The Institute of Cancer Research(癌症研究所) Genentech(基因泰克)

AI总结 系统评估基础模型表示在计算病理学任务中的性能,发现图像和组学表示互补,多模态融合在单模态不占优时有效,并利用共形预测验证了不确定性感知推理的临床价值。

详情
AI中文摘要

基础模型(FMs)已成为医学数据的强大表示提取器,但它们在分布偏移下的泛化能力仍未充分探索。本工作系统评估了基于FM的表示在计算病理学任务上的表现,涉及两个真实世界商业队列IH-BC和IH-NSCLC,这些队列来自许可的内部(IH)肿瘤学数据集。分析聚焦于两种模态:全切片图像和转录组图谱,均来自IH多模态数据。我们首先在八个下游分类任务上对五个FM进行单模态探测性能基准测试,发现图像和组学表示携带互补的预测信号。然后,我们通过比较三种基于配对表示的图像-组学融合策略,研究多模态融合是否能在单模态基线之上带来额外收益。进一步通过共形预测评估所选单模态和多模态管道的可信度。我们的结果表明,FM表示在分布外数据上取得了竞争性性能,且多模态融合主要在单模态不占主导信号时有所帮助。共形预测揭示,在点预测失败的大多数情况下,真实诊断仍可在预测集中恢复,这强化了不确定性感知推理对临床支持的价值。

英文摘要

Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based representations on a suite of computational pathology tasks across two real-world commercial cohorts, IH-BC and IH-NSCLC, drawn from the licensed in-house (IH) oncology dataset. The analysis focuses on two modalities, whole-slide images and transcriptomic profiles, drawn from the IH multimodal data. We first benchmark unimodal probing performance across five FMs on eight downstream classification tasks, and find that image and omics representations carry complementary predictive signals. Then we investigate whether multimodal fusion can yield additional gains over unimodal baselines by comparing three image-omics fusion strategies built on paired representations. The trustworthiness of selected unimodal and multimodal pipelines is further assessed through conformal prediction. Our results show that FM representations achieve competitive performance on out-of-distribution data and that multimodal fusion helps mainly when no single modality dominates the signal. Conformal prediction reveals that in the majority of cases where a point prediction fails, the true diagnosis remains recoverable within the prediction set, reinforcing the value of uncertainty-aware inference for clinical support.

2606.17165 2026-06-17 stat.ME cs.AI econ.EM math.ST stat.TH 交叉投稿

Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

基于LLM的A/B测试的统计基础:用于人类因果推断的替代指标框架

Joel Persson, Mårten Schultzberg, Sebastian Ankargren

发表机构 * Spotify USA, Inc.(Spotify美国公司)

AI总结 提出替代指标理论框架,证明在弱于分布等价条件下,校准LLM输出可识别平均处理效应,并分析随机性带来的偏差与方差。

详情
AI中文摘要

组织和研究者越来越有兴趣在A/B测试中使用大型语言模型(LLM)代替人类参与者,以期更快、更低成本地进行实验。我们研究当在LLM结果上估计的处理效应何时能够恢复在感兴趣的人类群体上测量的效应。LLM与人类结果之间的分布等价性会使任何标准估计量有效,但这不现实。因此,我们开发了一个统计框架,将替代终点理论适配到LLM。该框架表明,将LLM结果校准到人类结果,在替代性和可比性条件(联合弱于分布等价性)下,可以识别平均处理效应。当这些条件不成立时,感兴趣的效应仅部分可识别,我们提供了诊断方法,可以在历史实验上证伪替代性,并给出有限重叠下最坏情况偏差的界限。我们进一步证明,LLM固有的随机性会引入偏差和方差,但使用多次抽取的平均值作为替代指标可以同时缓解两者。我们在模拟和Upworthy标题的A/B测试应用中展示了方法和理论。我们工作的一个核心结论是,LLM结果作为替代指标的有效性只能对过去的处理被证伪,而无法对新处理被验证,因此对于新颖干预,人类实验仍然不可或缺。我们讨论了LLM选择、提示和温度作为设计变量的作用,以及如何确定人类实验的规模以进行验证。

英文摘要

Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes recovers the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs. The framework shows that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. When these conditions fail, the effect of interest is only partially identified, and we provide diagnostics that can falsify surrogacy on historical experiments together with a bound on the worst-case bias from limited overlap. We further show that the stochasticity inherent to LLMs introduces both bias and variance, but using an average of multiple draws as the surrogate mitigates both. We illustrate the methods and theory in simulations and an application to A/B tests on Upworthy headlines. A central takeaway from our work is that the validity of LLM outcomes as surrogates can only be falsified for past treatments and never verified for new ones, so human experiments remain indispensable for novel interventions. We discuss the role of LLM choice, prompting, and temperature as design variables, and how to size human experiments for validation.

2606.17283 2026-06-17 cs.CR cs.AI cs.LG 交叉投稿

ARVO: Atlas of Reproducible Vulnerabilities for Open-Source Software

ARVO:开源软件可复现漏洞图谱

Xiang Mei, Jordi Del Castillo, Pulkit Singh Singaria, Haoran Xi, Abdelouahab Benchikh, Tiffany Bao, Ruoyu Wang, Yan Shoshitaishvili, Adam Doupé, Hammond Pearce, Brendan Dolan-Gavitt

发表机构 * National Vulnerability Database(国家漏洞数据库) Google(谷歌)

AI总结 提出一种大规模构建可复现漏洞数据集的方法,基于OSS-Fuzz构建含6100+真实漏洞的ARVO数据集,实现81%复现率与89.4%补丁定位精度,解决可复现性、数量与多样性三难问题。

Comments Accepted at IEEE European Symposium on Security and Privacy (EuroS&P) 2026

详情
AI中文摘要

长期以来,在漏洞数据集中实现可复现性、数量和多样性被视为固有的三方权衡,改进一个维度往往以牺牲其他维度为代价。在实践中,可复现性是最常被忽视的维度。这限制了从历史错误数据集中自动提取的内容,并降低了它们对下游安全研究的实用性。在这项工作中,我们提出了一种方法,通过识别大规模错误复现的关键障碍并用通用解决方案加以解决,从而生成一个新的安全数据集,确保大规模多样化漏洞的可复现性。使用这种方法,我们为最大的开源软件漏洞数据集(OSS-Fuzz)引入了完全可复现性,并构建了ARVO数据集(开源软件可复现漏洞图谱)。ARVO是一个大规模数据集,包含311个项目中的6100多个真实世界漏洞。专注于可复现性,ARVO与现有数据集的不同之处在于,它以可以跨版本一致重建、触发和分析的形式提供每个漏洞。可复现性还使得能够自动识别每个漏洞的相应补丁,并支持代码更改后直接与漏洞交互,这是现有大规模数据集所不具备的能力。在我们的评估中,ARVO成功复现了81%的漏洞,并在定位的补丁上达到了89.4%的准确率。我们还讨论了ARVO对上游实践和下游安全研究的影响。

英文摘要

Achieving reproducibility, quantity, and diversity in vulnerability datasets has long been viewed as an inherent three-way trade-off, where improving one dimension often comes at the cost of the others. In practice, reproducibility has been the dimension most often neglected. This has limited what can be automatically extracted from historical bug datasets, and has reduced their utility for downstream security research. In this work, we propose a method to produce a new security dataset which ensures reproducibility for diverse vulnerabilities at scale by identifying the key obstacles to large-scale bug reproduction and addressing them with general solutions. Using this method, we introduce full reproducibility to the largest open source software vulnerability dataset (OSS-Fuzz) and construct the ARVO dataset (an Atlas of Reproducible Vulnerabilities in Open-source software). ARVO is a large-scale dataset consisting of over 6,100 real-world vulnerabilities across 311 projects. Focusing on reproducibility, ARVO differs from existing datasets by providing each vulnerability in a form that can be consistently rebuilt, triggered, and analyzed across versions. Reproducibility also enables automatic identification of the corresponding patch for each vulnerability and supports direct interaction with vulnerabilities after code changes, capabilities that existing large-scale datasets do not provide. In our evaluation, ARVO successfully reproduces 81% of vulnerabilities and achieves 89.4% accuracy on the located patches. We also discuss ARVO's influence on both upstream practices and downstream security research.

2606.17391 2026-06-17 cs.CL cs.AI cs.LG 交叉投稿

NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

NarrativeWorldBench:面向长程共创音频剧的前沿饱和基准与潜在世界模型

Logan Mann, Abdur Rahman, Mohammad Saifullah, Taaha Kazi, Vasu Sharma

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校) Pocket FM

AI总结 提出NarrativeWorldBench基准,在九种叙事结构指标上评估21个模型,并引入N-VSSM变分状态空间模型,通过Mamba-2骨干和事件条件后验在200集以上维持结构化潜在状态,在长弧一致性和可控性上超越Claude Opus 4.5。

Comments 10 pages. Accepted to the ICML 2026 Workshops on High-dimensional Learning Dynamics (HiLD) and Culture x AI

详情
AI中文摘要

长篇连载音频剧,其剧情弧线跨越200至800集,是一种重要的创意媒介,也是前沿大语言模型(LLM)表现不佳的场景。我们在一组统一的叙事结构指标上,对21个模型进行了基准测试,涵盖经典、微调、开放前沿、封闭前沿和推理层级。所有封闭前沿系统在情节节拍F1上饱和于[0.78, 0.81]区间,并在视界h=200时下降约-0.20 F1。我们引入了NarrativeWorldBench,一个开放基准,包含九种叙事结构指标,在h∈{10, 20, 50, 100, 200}的视界上评估,并在四种印度语言(印地语、泰米尔语、泰卢固语、马拉地语)上进行跨语言评估。我们提出了N-VSSM,一种叙事变分状态空间模型,通过Mamba-2骨干网络和事件条件后验以及8B解码器,在超过200集的时间内维持一个结构化的256维潜在世界状态。N-VSSM在所有视界上保持情节节拍F1≥0.84,计算量仅为封闭前沿区间的1/4。学习到的文化迁移函数将跨语言忠实度提高了+0.20至+0.23 Likert分。在一项受试者内作家研究(n=12位专业作者,240次试验)中,N-VSSM在长弧一致性上以71%的偏好率优于Claude Opus 4.5,在可控性上评分高出+1.3 Likert分。

英文摘要

Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned, open-frontier, closed-frontier, and reasoning tiers, on a uniform set of structural narrative metrics. All closed-frontier systems saturate at a plot-beat F1 in the band [0.78, 0.81] and collapse by about -0.20 F1 at horizon h=200. We introduce NarrativeWorldBench, an open benchmark of nine narrative-structure metrics evaluated across horizons h in {10, 20, 50, 100, 200}, with cross-lingual evaluation across four Indic languages (Hindi, Tamil, Telugu, Marathi). We introduce N-VSSM, a Narrative Variational State-Space Model that maintains a structured 256-dimensional latent world state over more than 200 episodes via a Mamba-2 backbone with an event-conditioned posterior and an 8B decoder. N-VSSM holds plot-beat F1 >= 0.84 across all horizons at 4x lower compute than the closed-frontier band. A learned Cultural Transfer Function lifts cross-language fidelity by +0.20 to +0.23 Likert points. In a within-subjects writer study (n = 12 professional authors, 240 trials), N-VSSM is preferred over Claude Opus 4.5 on long-arc consistency 71% of the time and rated +1.3 Likert points higher on controllability.

2606.17449 2026-06-17 cs.CL cs.AI cs.CV cs.LG cs.MM 交叉投稿

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

MODE-RAG: 基于流形异常诊断和能量的检索增强生成评估

Zehang Wei, Jiaxin Dai, Jiamin Yan, Xiang Xiang

发表机构 * School of Computer Science & Tech, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of AI and Automation, Huazhong University of Science and Technology(华中科技大学人工智能与自动化学院)

AI总结 提出MODE-RAG多智能体系统,利用变分自由能和内部注意力状态动态门控干预,结合蒙特卡洛树搜索和logit扰动减少多模态检索增强生成中的幻觉和逻辑捏造。

Comments To be presented at ACL 2026

详情
AI中文摘要

虽然多模态检索增强生成(M-RAG)增强了大型视觉语言模型,但它仍然非常容易受到跨模态幻觉、因果捏造和谄媚的影响。此外,现有的缓解流程常常面临干预悖论:静态规则往往不必要地干扰准确的生成,而完全不加引导的多模态推理则允许现有的不匹配级联成严重的逻辑捏造。为了量化和缓解这些幻觉,我们提出了一个多智能体系统MODE-RAG,由变分自由能(VFE)和内部注意力状态驱动,以动态门控干预。高风险查询被路由到五个阶段特定的智能体,集成蒙特卡洛树搜索(MCTS)进行严格的因果推导,以及logit扰动以惩罚谄媚。专门的纠正和监管智能体确保格式稳定性并执行事后事实验证。为了客观评估我们的方法,我们引入了ModeVent,一个源自MultiVent数据集的具有挑战性的子集。大量实验表明,我们的系统有效降低了幻觉率和逻辑捏造,显著提高了M-RAG系统的鲁棒性。

英文摘要

While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi-modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications. To quantify and mitigate these hallucinations, we propose a Multi-Agent system, MODE-RAG, driven by Variational Free Energy (VFE) and internal attention states to dynamically gate interventions. High-risk queries are routed to five stage-specific agents, integrating Monte Carlo Tree Search (MCTS) for rigorous causal derivation and logit perturbations to penalize sycophancy. Dedicated Correction and Overseer agents ensure formatting stability and perform post-hoc factual verification. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M-RAG systems.

2606.17474 2026-06-17 cs.CL cs.AI 交叉投稿

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

AIPatient Arena:基于电子健康记录的大语言模型在端到端临床咨询工作流中的评估

Jiahui Niu, Huizi Yu, Wenkong Wang, Guangxin Dai, Jingxian He, Xiang Li, Zhiying Liang, Xinxin Lin, Kent CY So, Bryan YP Yan, Yun Kwok Wing, Yanqiu Xing, Xin Ma, Lizhou Fan

发表机构 * School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学) Key Laboratory of Machine Intelligence and System Control, Shandong University(机器智能与系统控制重点实验室,山东大学) Department of Medicine and Therapeutics, The Chinese University of Hong Kong(医学与治疗学系,香港中文大学) Department of Geriatric Medicine, Qilu Hospital of Shandong University(老年医学科,山东大学齐鲁医院) Department of Psychiatry, The Chinese University of Hong Kong(精神病学系,香港中文大学) Li Chiu Kong Family Sleep Assessment Unit, Department of Psychiatry, Faculty of Medicine, The Chinese University of Hong Kong(李秋虹家庭睡眠评估单元,精神病学系,医学院,香港中文大学) Li Ka Shing Institute of Health Sciences, Faculty of Medicine, The Chinese University of Hong Kong(李嘉诚健康科学研究院,医学院,香港中文大学) Gerald Choa Neuroscience Institute, Department of Medicine and Therapeutics, The Chinese University of Hong Kong(Gerald Choa 神经科学研究所,医学与治疗学系,香港中文大学)

AI总结 提出AIPatient Arena框架,通过电子健康记录构建患者知识图谱,在多轮医患交互中评估大语言模型的八项临床能力,发现模型在信息覆盖、诊断推理等方面存在不足,强调过程评估的重要性。

Comments 49 pages, 12 figues, 11 tables

详情
AI中文摘要

大语言模型(LLMs)越来越多地被考虑用于临床咨询任务,然而大多数医学评估仍然是静态的、单轮的或狭义的结果导向,限制了它们反映真实医疗护理的序列性、不确定性和交互性的能力。在此,我们提出AIPatient Arena,一个基于电子健康记录(EHRs)的评估框架,用于评估LLMs在八个临床能力维度上的临床实用性。该框架将EHR数据整合到患者特定的知识图谱中,实现多轮医患交互。我们将AIPatient Arena应用于一个由437名患者组成的主要队列以及两个分布外验证队列(分别为119名和67名患者)。我们观察到,LLMs在医学访谈提问技能(QS;平均得分4.43-4.99/5)、伦理与职业行为(ET;4.38-4.93/5)以及临床解释的清晰度和透明度(EX;3.80-4.72/5)方面表现良好。在信息整合(II;3.19-4.21/5)和用药安全与合理性(MS;3.13-3.78/5)方面表现中等,但在处理模糊患者回应(HR;2.57-3.32/5)、信息覆盖(IC;2.08-3.02/5)以及诊断准确性与推理(Dx;2.63-3.55/5)方面观察到持续的弱点。基于过程的评估揭示了反复出现的交互失败,包括重复提问、遗漏既往病史以及对不确定性处理不当。更丰富的对话上下文改善了诊断推理,但在治疗计划方面收益有限。这些发现表明,仅凭最终答案的准确性不足以评估临床就绪性,并强调了评估模型在整个咨询过程中如何收集、解释和传递信息的重要性。AIPatient Arena为医学LLMs的面向工作流的部署前评估提供了一个基于EHR的框架。

英文摘要

Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.

2606.17514 2026-06-17 cs.SE cs.AI 交叉投稿

Unlocking LLM Code Correction with Iterative Feedback Loops

解锁大语言模型代码修正的迭代反馈循环

Le Zhang, Suresh Kothari

发表机构 * Iowa State University(爱荷华州立大学)

AI总结 研究通过执行反馈迭代修正代码的能力,提出评估指标并分析推理与非推理模型在利用反馈上的差异,发现推理模型显著优于非推理模型,且语法和运行时错误比逻辑错误更易修正。

Comments 22 pages, 14th Computing Conference 2026

详情
AI中文摘要

大型语言模型在代码生成方面展现了卓越的能力。然而,现有评估大多只关注单次尝试的准确性,而忽略了现实编程中关键的迭代优化过程。本研究系统性地调查了LLMs通过执行反馈修正自身代码的能力。使用跨四个模型和两种主要编程语言的真实编程问题,本研究通过迭代优化框架评估性能,其中LLMs在每次尝试后接收编译器错误消息和测试用例反馈。本研究引入了评估代码失败、分析修正模式以及比较推理与非推理模型有效性的指标,为理解和实际应用LLM驱动代码生成系统中的反馈循环提供了可操作的见解。结果表明,推理模型在迭代中持续改进,在利用反馈方面显著优于非推理模型,而语法和运行时错误比逻辑或算法失败更容易处理。

英文摘要

Large Language Models have shown remarkable capabilities in code generation. However, most existing evaluations focus only on single-attempt accuracy and overlook the iterative refinement process that is central to real-world programming. This study presents a systematic investigation of LLMs' ability to rectify their own code through execution feedback. Using real-world programming problems across four models and two major programming languages, this study evaluates performance using iterative refinement framework where LLMs receive compiler error messages and testcase feedback after each attempt. This study introduces metrics to evaluate code failures, analyze rectification patterns, and compare the effectiveness of reasoning and non-reasoning models, offering actionable insights into both the understanding and practical application of feedback loops in LLM-driven code generation systems. Results show that reasoning models consistently improve over iterations, substantially outperforming non-reasoning models in leveraging feedback, while syntactic and runtime errors are far more tractable than logical or algorithmic failures.

2606.17519 2026-06-17 cs.CL cs.AI 交叉投稿

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

扩展企业智能体路由:退化、诊断与恢复

Kellen Gillespie, Robyn Perry

发表机构 * Superhuman, Inc.(Superhuman公司)

AI总结 研究企业助手工具库扩展时路由准确率下降问题,通过嵌入预选恢复F1分数10-17个百分点。

Comments 10 pages (6 main + 4 appendix), 4 figures, 6 tables

详情
AI中文摘要

生产级LLM助手将用户请求路由到日益增长的专用工具库,但随着目录规模扩大,路由准确率如何退化?我们在一个已部署的企业生产力助手的110个智能体、584个工具的目录上研究单步路由,评估了从10到110个智能体的三种前沿模型。在未充分指定的请求上,路由F1分数跨模型下降16-23个百分点。一个oracle分析将退化分解为检索差距(模型无法找到正确工具)和混淆差距(即使完美检索,oracle上限也下降10pp)。基于嵌入的预选在全部规模下为所有三种模型和两个提供商恢复+10-11pp F1分数。一项生产标注研究(1,435个人工标注话语,三个标注者)确认了在真实流量上的恢复,尽管绝对性能低10-15pp,但恢复幅度为+10-17pp。

英文摘要

Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16--23 percentage points across models. An oracle analysis decomposes the degradation into a \emph{retrieval} gap (the model cannot surface the right tool) and a \emph{confusion} gap (even with perfect retrieval, the oracle ceiling drops 10pp). Embedding-based shortlisting recovers +10--11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms the recovery on real traffic at +10--17pp despite 10--15pp lower absolute performance.

2606.17541 2026-06-17 cs.LG cs.AI 交叉投稿

Offline Preference-Based Trajectory Evaluation

基于偏好的离线轨迹评估

Fernando Diaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对离线评估中仅使用终端成功率导致统计效率低下的问题,提出基于偏好的轨迹评估方法,通过比较轨迹的时间偏好减少平局,提升区分能力、排名稳定性和数据效率。

详情
AI中文摘要

智能系统的离线评估通常将轨迹简化为终端成功,丢弃了部分进展信息并导致大量平局,通过减少有效样本量和削弱区分系统的能力,造成显著的统计低效。我们提出基于偏好的轨迹评估,该方法通过时间偏好(关于进展和返回时间分布)直接比较轨迹。我们发现,在多种智能和交互基准测试中,基于标准成功率的指标在大约75%的实例上产生平局比较,而轨迹感知偏好将平局减少到大约35%,从而提高了区分能力、排名稳定性和数据效率。我们的结果表明,通常归因于数据收集不足或问题难度的基准饱和,也可能由评估指标的选择所解释。

英文摘要

Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective sample size and weakening the ability to distinguish systems. We propose preference-based trajectory evaluation, which compares trajectories directly through temporal preferences over progress and time-to-return profiles. We find that, across diverse agentic and interactive benchmarks, standard success-based metrics produce tied comparisons on roughly 75% of instances, whereas trajectory-aware preferences reduce ties to roughly 35%, improving discriminative power, ranking stability, and data efficiency. Our results suggest that benchmark saturation, often attributed to poor data collection or problem difficulty, may also be explained by the choice of evaluation measure.

2606.17564 2026-06-17 cs.CV cs.AI 交叉投稿

Geometric Consistency Protocol for Foundation Model Features in Multi-View Satellite Imagery

多视图卫星图像中基础模型特征的几何一致性协议

Qiyan Luo, Jie Yang, Yingdong Pi, Lekang Wen, Mi Wang

发表机构 * Hubei Province Key Research and Development Program(湖北省重点研发计划) LIESMARS Special Research Funding(测绘遥感信息工程国家重点实验室专项研究基金) National Science Fund for Distinguished Young Scholars(国家杰出青年科学基金)

AI总结 针对卫星多视图重建中传统2D全局匹配的误导性,提出基于有理函数模型(RFM)的几何忠实评估协议,通过RPC投影3D一致性度量和几何约束密集匹配代理,揭示语义一致性与几何定位的解耦,并证明在RPC一致评估下2D骨干网络仍具竞争力。

Comments The manuscript is accepted as Oral Presentation in IEEE International Geoscience and Remote Sensing Symposium(IGARSS 2026)

详情
AI中文摘要

标准化的评估协议对于遥感领域的稳健基准测试至关重要,特别是当基础特征越来越多地跨不同传感器和复杂成像几何进行迁移时。在卫星多视图重建中,依赖无约束2D全局匹配的传统评估常常具有误导性。有理函数模型(RFM)及其有理多项式系数(RPC)决定了弯曲的、高度依赖的极线几何,这使得平坦的2D搜索空间在物理上不一致。我们提出了一种针对RPC框架的几何忠实且可复现的协议。我们的方法将RPC投影的3D一致性度量与几何约束的密集匹配代理相结合,专门评估在物理上合理的搜索流形下相似性响应是否保持局部化和唯一性。我们联合报告策略的一个关键发现是语义一致性与几何定位的解耦:在投影3D点处的高跨视图相似性并不能保证实际推理中的可靠匹配性。我们的基准测试表明,将几何约束纳入问题定义对于卫星图像是基础性的。此外,我们展示了最先进的2D骨干网络在经受这种RPC一致评估时,仍然与专门的3D感知模型保持显著竞争力。

英文摘要

Standardized evaluation protocols are indispensable for robust benchmarking in remote sensing, particularly as foundation features are increasingly transferred across diverse sensors and complex imaging geometries. In satellite multi-view reconstruction, conventional evaluations relying on unconstrained 2D global matching are often misleading. The Rational Function Model (RFM) and its Rational Polynomial Coefficients (RPC) dictate a curved, height-dependent epipolar geometry that render flat 2D search spaces physically inconsistent. We propose a geometry-faithful and reproducible protocol tailored for the RPC framework. Our approach integrates an RPC-projected 3D consistency metric with a geometry-constrained dense matching proxy, specifically evaluating whether similarity responses remain localized and unique under physically plausible search manifolds. A pivotal finding of our joint reporting strategy is the decoupling of semantic agreement and geometric localization: high cross-view similarity at a projected 3D point does not guarantee reliable matchability in practical inference. Our benchmark demonstrates that incorporating geometric constraints is fundamental to the problem definition in satellite imagery. Furthermore, we show that state-of-the-art 2D backbones remain remarkably competitive against specialized 3D-aware models when subjected to this RPC-consistent evaluation.

2606.17588 2026-06-17 cs.SE cs.AI 交叉投稿

Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations

理解LLM在标题-摘要筛选中的作用:从分歧到建议

Mika Mäntylä, Patricia Matsubara, Katia Romero Felizardo, Miikka Kuutila, Marco Gerosa, Savio de Sousa Sampaio, Tayana Conte, Igor Steinmacher

发表机构 * University of Helsinki, Finland(赫尔辛基大学,芬兰) UFMS, Brazil(巴西UFMS) UTFPR – Federal University of Technology - Paraná, Brazil(巴西UTFPR – 法定技术大学-帕拉那) LUT University, Finland(芬兰LUT大学) Northern Arizona University, United States(美国北亚利桑那大学) UFAM, Brazil(巴西UFAM)

AI总结 本研究通过定性分析LLM与人类在系统综述标题-摘要筛选中的分歧原因,提出改进建议,如验证语义理解、使用多个LLM和关注边界案例。

Comments 14 pages + references. Accepted for publication in the 52nd Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2026)

详情
AI中文摘要

多项研究探讨了在系统综述(SRs)中使用大型语言模型(LLMs)进行标题-摘要筛选,报告了混合的准确性。然而,可靠性问题仍未得到充分解决。在本研究中,我们超越了定量的人机一致性指标,定性调查了LLMs失败的方式和原因。我们还提出了可操作的建议。我们分析了六个软件工程SRs和超过1000篇主要研究论文中LLMs与研究人员之间的分歧。对于每个SR,论文由人类专家和LLMs以零样本模式独立筛选,得到的Kappa值在0.52到0.77之间。定性分析表明,人机分歧源于反复出现的可识别原因,例如关键术语的边界模糊、关键词过度强调和错误的话题推断。基于这些发现,我们提出了建议,例如在部署前验证语义理解、运行多个LLMs以及将验证工作集中在边界案例上。未来的研究需要验证我们建议的影响,并且需要社区努力制定关于在SRs中使用LLMs的规范性指南。

英文摘要

Several studies have examined the use of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), reporting mixed accuracy. However, questions of reliability remain largely unaddressed. In this study, we go beyond quantitative LLM-human agreement metrics and qualitatively investigate how and why LLMs fail. We also propose actionable recommendations. We analyzed disagreements between LLMs and researchers across six software engineering SRs and over 1,000 primary study papers. For each SR, papers were screened independently by human experts and LLMs in zero-shot mode, resulting in Kappa values ranging from 0.52 to 0.77. Qualitative analysis suggests that human-LLM disagreement results from recurring, identifiable causes, such as boundary ambiguity in key terms, keyword overemphasization, and incorrect topic inference. Based on these findings, we propose recommendations such as validating semantic understanding before deployment, running multiple LLMs, and focusing validation efforts on borderline cases. Future studies are needed to validate the impact of our recommendations, and community efforts are needed to develop normative guidelines on LLM usage in SRs.

2606.17644 2026-06-17 cs.CV cs.AI 交叉投稿

Bounding Box Label Propagation for Re-Annotation of Document Layout Analysis Datasets

边界框标签传播用于文档布局分析数据集的重新标注

Nick Jochum, Tobias Alt-Veit, Christian Schön, Alexander Lück, René Schuster, Didier Stricker

发表机构 * Insiders Technologies GmbH(Insiders Technologies 有限公司) DFKI – German Research Center for Artificial Intelligence(德国人工智能研究中心) RPTU – University Kaiserslautern-Landau(凯泽斯劳滕-兰道大学)

AI总结 提出BBLP伪标签框架,通过对象编码器融合视觉、文本和位置嵌入,利用标签传播实现仅用10%标注数据达到全监督性能的81.6%。

Comments 17 pages, 3 figures, to appear in proceedings of ICDAR 2026, Vienna, Austria

详情
AI中文摘要

实际文档处理场景中的数据集通常随时间增长,其类别标注不断细化,这导致大量耗时且昂贵的重新标注工作。一个有前景的解决方案是仅手动重新标注一小部分可用文档,并应用半监督学习技术利用有标签和无标签数据。尽管针对分类问题已有多种方法,但对于目标检测实例的重新分类(例如文档布局分析)尚无适配方法。为此,我们提出了边界框标签传播(BBLP),一种用于目标检测的伪标签框架。对象编码器整合来自目标检测样本的视觉、文本和位置嵌入,生成联合嵌入,可用于部分标注数据集上的标签传播,即插即用。评估结果表明,所提方法能产生高质量的边界框类别标注。在D4LA布局分析数据集中,仅使用10%标注数据,其mAP达到54.0%,相当于全监督性能的81.6%。我们的工作展示了标签传播在目标检测中的潜力,并为减少实际文档处理应用中的手动标注工作量奠定了基础。

英文摘要

Datasets in practical document processing scenarios typically grow over time, and their class annotations undergo continuous refinement. This creates significant re-annotation efforts, which are time-consuming and costly. A promising remedy is to re-annotate only a small subset of available documents manually and apply semi-supervised learning techniques that leverage both labelled and unlabelled data. Although there are numerous approaches to tackle this problem for classification, there exists no adaptation for the problem of re-classifying object detection instances, e.g. for document layout analysis. To this end, we propose Bounding Box Label Propagation (BBLP), a pseudo-labelling framework for object detection. An object encoder integrates visual, textual, and positional embeddings from object detection samples to come up with a joint embedding that can be used for Label Propagation on partially annotated datasets in a plug-and-play fashion. Evaluation results indicate that the proposed approach produces high-quality class annotations of bounding boxes. In the D4LA layout analysis dataset, it achieves a mAP of 54.0%, corresponding to 81.6% of fully supervised performance, while using only 10% labelled data. Our work demonstrates the potential of Label Propagation for object detection and lays the groundwork for reducing manual annotation efforts in real-world document processing applications.

2606.17710 2026-06-17 cs.CV cs.AI cs.CL cs.LG 交叉投稿

Vision-language models for chest radiography do not always need the image

胸部X光片的视觉-语言模型并不总是需要图像

Mahshad Lotfinia, Sebastian Ziegelmayer, Lisa Adams, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg(弗里德里希-亚历山大-埃尔朗根-纽伦堡大学模式识别实验室) Department of Diagnostic and Interventional Radiology, TUM University Clinic, School of Medicine and Health, Klinikum rechts der Isar, Technical University of Munich(慕尼黑工业大学医学院与健康学院伊萨尔河右岸医院诊断与介入放射学系) Lab for AI in Medicine, RWTH Aachen University(亚琛工业大学医学人工智能实验室) Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen(亚琛工业大学医院诊断与介入放射学系)

AI总结 本文通过因果审计方法,发现许多医学视觉-语言模型在胸部X光片任务中依赖文本先验而非图像,纯文本模型与多模态模型性能接近,并提出了基于图像依赖性的评估框架。

详情
AI中文摘要

医学视觉-语言模型报告了强大的胸部X光片准确性,这越来越多地被解读为它们使用了图像的证据。这种推断是不安全的:一个利用发现名称先验的模型得分与读取扫描的模型相同,且没有标准基准能区分它们。我们引入了一种因果审计方法,通过遮挡相关区域、遮挡无关区域以及替换为另一患者的相同标签扫描来干预图像,并结合三种行为指标测试正确答案是否依赖于图像。在九个系统中,一个没有图像访问权限的纯文本模型达到了最佳多模态模型5.7个准确度点以内的水平,而一个1190亿参数的多模态模型在统计上与70亿参数的纯文本基线无法区分。审计将队列分为三个忽略图像的模型、一个不稳定的模型和五个选择性使用图像的模型(针对部分发现);这些分类在第二个数据集、分辨率和提示措辞上保持一致。与委员会认证的放射科医生相比,纯文本模型在准确率上与放射科医生无统计差异,但基础归因于零,而使用图像的模型的基础归因率与放射科医生相当。报告的置信度仅在模型使用图像时标记无根据的答案。基础归因审计(而非准确性)应成为临床部署的门槛。

英文摘要

Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a causal audit that intervenes on the image, occluding the relevant region, occluding an irrelevant one, and swapping in another patient's same-label scan, and combines three behavioral metrics to test whether a correct answer depends on the image. Across nine systems, a text-only model with no image access reaches within 5.7 accuracy points of the best multimodal one, and a 119-billion-parameter multimodal model is statistically indistinguishable from a 7-billion text-only baseline. The audit splits the cohort into three models that ignore the image, one that is unstable, and five that use it selectively, for a subset of findings; the categories hold across a second dataset, resolution, and prompt phrasing. Against board-certified radiologists, a text-only model is statistically indistinguishable from a radiologist's accuracy while grounding at zero, whereas the image-using models ground at radiologist-comparable rates. Reported confidence flags ungrounded answers only when a model uses the image. Grounding audits, not accuracy, should gate clinical deployment.

2606.17799 2026-06-17 cs.SE cs.AI cs.CL 交叉投稿

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

立场:编程基准与智能体软件工程不一致

Maria I. Gorinova, Macey Baker, Amy Heineike, Maksim Shaposhnikov, Rob Willoughby, Dru Knox

发表机构 * Tessl

AI总结 本文指出当前编程基准在智能体时代存在三大问题:混淆模型与系统框架、单一参考答案惩罚有效替代方案、缺乏组件级信号导致迭代困难,并提出应重新设计基准以对齐智能体软件工程。

详情
AI中文摘要

编程智能体已成为软件工程的主要模式,但我们用于比较它们的基准是在智能体时代之前设计的:它们将模型、框架和环境合并为一个单一的端到端分数,通常针对一个参考答案进行计算,没有提供用于迭代的组件级信号。我们认为当前的编程基准与智能体软件工程不一致。在实践中,编程智能体不是一个模型:它是一个系统框架——由模型、框架、上下文、环境和反馈信号组成的复合体,其中任何一个都可能使基准分数移动与相邻模型代际之间相当的幅度。我们讨论了三个症状:(i) 基准分数混淆了模型与框架的其余部分;(ii) 针对单一参考答案评分惩罚了同样有效的替代方案;(iii) 缺乏单个框架组件级别的信号使得端到端系统分数难以迭代。

英文摘要

Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness -- a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on.

2606.17819 2026-06-17 cs.SE cs.AI cs.CL 交叉投稿

A Framework for Evaluating Agentic Skills at Scale

大规模评估智能体技能的框架

Maksim Shaposhnikov, Nicolas Fortuin, Simon Stipcich, Maria I. Gorinova, Amy Heineike, Rob Willoughby

发表机构 * Tessl London United Kingdom(伦敦英国Tessl)

AI总结 提出一个评估框架,通过构建真实任务和评分标准,大规模评估500个真实技能在19种智能体模型上的表现,发现模型对技能指令的遵循程度差异显著,且技能显著改变模型行为。

详情
AI中文摘要

智能体技能——结构化、可重用的知识工件,增强LLM智能体能力——已在工业界迅速采用,但其跨领域影响以及在商业和开源模型中的使用仍未得到充分研究,并且缺乏可复用的方法来评估单个技能。在这项工作中,我们提出了一个评估框架,允许技能作者构建真实任务,以严格评估技能中对他们最重要的方面,并通过解决这些任务来估计技能效用。此外,我们将评估方法大规模应用于500个真实技能,生成了1000个源自技能内容的任务,以及指令遵循和目标完成评分标准。使用这些指标,我们评估了19种智能体模型配置(包括专有和开源模型)在任务上的表现。我们的结果表明,模型在遵循技能中编码的指令方面差异很大,导致其性能提升存在显著差异。此外,我们表明,与无技能设置相比,访问技能显著改变了模型行为,为将主观工作流编码到LLM智能体中提供了一种重要机制。我们发布了评估数据集,以支持未来关于智能体技能的工作。

英文摘要

Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic tasks to rigorously assess the aspects of a skill that matter most to them, and that estimates skill utility by solving those tasks. Further, we apply our evaluation approach at scale to 500 real-world skills, generating 1,000 tasks derived from the skills' content, along with instruction-following and goal-completion scoring rubrics. Using these metrics, we evaluate how 19 agent-model configurations, both proprietary and open-source, perform on the tasks. Our results show that models vary widely in how closely they adhere to the instructions encoded in skills, leading to substantial differences in their performance gains. Furthermore, we show that access to a skill significantly changes model behavior compared to the no-skill setup, providing an essential mechanism for encoding opinionated workflows into LLM agents. We release our evaluation dataset to support future work on agent skills.

2606.17826 2026-06-17 cs.CL cs.AI 交叉投稿

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

当多种文字重要时:在临床环境中评估ASR

Jean Seo, Minkyu Kim, Jeonguk Lee, Jisoo Jung, Wooseok Han, Eunho Yang

发表机构 * AITRICS University of Copenhagen(哥本哈根大学) KAIST(韩国科学技术院)

AI总结 针对非英语临床场景中ASR受多文字变异性影响的问题,提出MultiClin基准,通过多文字感知评估更公平地衡量识别质量,并发现文字统一化能提升ASR性能。

Comments Interspeech 2026

详情
AI中文摘要

非英语临床环境中的自动语音识别(ASR)面临多文字变异性的挑战,即同一术语可能以多种有效的正字法形式出现。传统的字符串匹配评估指标通常将正字法变体视为错误,从而低估ASR性能。为解决此问题,我们引入了MultiClin,一个旨在评估对多文字变异性鲁棒性的临床ASR基准。跨多种ASR模型的实验表明,与传统的单参考评估相比,多文字感知评估能更公平地评估识别质量。我们进一步研究了训练过程中文字一致性的影响,发现不一致的文字映射会增加正字法不确定性并阻碍模型收敛,其中50%的平衡映射比例产生最高的熵。相比之下,文字统一化始终能带来最佳的ASR性能。我们的数据集和代码公开于:this https URL。

英文摘要

Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics often underestimate ASR performance by treating orthographic variants as errors. To address this issue, we introduce MultiClin, a clinical ASR benchmark designed to evaluate robustness to multiscript variability. Experiments across diverse ASR models show that multiscript-aware evaluation provides a fairer assessment of recognition quality than conventional single-reference evaluation. We further investigate the impact of script consistency during training and find that inconsistent script mappings increase orthographic uncertainty and hinder model convergence, with a balanced 50% mapping ratio producing the highest entropy. In contrast, script unification consistently yields the best ASR performance. Our dataset and code are publicly available at: https://github.com/aitrics-ronaldo/Interspeech_MultiClin.

2606.18129 2026-06-17 cs.HC cs.AI 交叉投稿

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

理解和测量LLM行为中的认知萎缩

Abeer Badawi, Moyosoreoluwa Olatosi, Negin Baghbanzadeh, Laleh Seyyed-Kalantari, Frank Rudzicz, R. Shayna Rosenbaum, Sara Pishdadian, Elham Dolatabadi

发表机构 * York University(约克大学) Vector Institute(向量研究所) Rotman Research Institute(罗特曼研究学院) Dalhousie University(达尔豪斯大学) Centre for Addiction & Mental Health(成瘾与心理健康中心) KITE Research Institute(KITE研究机构)

AI总结 针对LLM在心理健康支持中缺乏过程行为评估的问题,提出认知萎缩概念及基准,通过临床标注和专家评估揭示模型普遍存在中度至高度萎缩行为。

详情
AI中文摘要

近期涉及LLM用于心理健康支持的事件揭示了一个关键的评估空白:表面安全评分无法捕捉模型在长时间、现实且情感敏感的交互中的行为。现有基准衡量知识、安全性或静态响应质量,但忽略了LLM交互是否帮助用户保持反思、应对和自主决策。我们将这一缺失维度形式化为认知萎缩,这是一种AI介导的心理健康支持中不同于安全性和有用性的过程级行为度量。为测量它,我们引入了认知萎缩基准,这是一个基于临床的基准,由1,576个完全人工生成的咨询对话、15,680轮次和来自五个LLM的42,230个响应构建而成。三位临床和神经心理学专家开发了一个包含用户上下文、响应行为和全局风险标志的20属性模式;六名经过培训的临床评审员应用该模式并附上基于跨度的证据,产生了5,324个评审判断。我们进一步引入了用户输入风险指数、认知萎缩风险指数和轨迹摘要。在五个LLM中,模型在单轮和多轮设置中表现出一致的中度至高度萎缩对齐行为。虽然模型通常对明显的安全线索做出响应,但当用户寻求解决方案或决策时,它们的适应性较差。主要的重复模式是指导性建议、问题解决、推荐响应、话题转移以及可能强化依赖而非反思的验证形式。我们的工作使认知萎缩变得可测量,并为审计敏感LLM对话中的模型行为提供了基础。

英文摘要

Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.

2606.18135 2026-06-17 cs.SD cs.AI 交叉投稿

Descriptor: Certus Caliber Classification Gunshot Dataset (C3GD)

描述符:Certus 口径分类枪声数据集 (C3GD)

Sinclair Gurny, Ryan Quinn

发表机构 * Certus Innovations

AI总结 介绍一个公开的枪声数据集 C3GD,包含超过8000个来自28种枪支、16种口径的实地采集数据点,用于口径分类、枪声检测等任务,提供丰富的元数据以支持泛化与学术分析。

详情
AI中文摘要

在这项工作中,我们介绍了 Certus 口径分类枪声数据集 (C3GD),这是一个公开可访问的数据集,用于分析枪口爆炸声。该数据集旨在提供多种枪支、口径、弹药、麦克风和麦克风位置,其元数据详细程度超过当前已有的其他数据集。它包含来自28种枪支、16种口径的超过8000个实地采集数据点。由于实地数据采集成本高昂,现有研究多使用从互联网收集的枪声音频,这增加了低质量数据和标签噪声的风险。该数据集主要关注口径分类,但也可用于枪声检测、音频分离和音频信号处理,提供了多样化的真实世界参考。该数据集旨在提供足够的多样性,以便泛化到更多实际应用,同时提供足够的元数据以进行详细的学术分析。

英文摘要

In this work, we introduce the Certus Caliber Classification Gunshot Dataset (C3GD), a publicly accessible data set developed for the analysis of firearm muzzle blast sounds. The dataset aims to provide a wide variety of firearms, calibers, cartridges, microphones, and microphone locations with metadata detailed beyond what is currently otherwise available. It comprises more than 8000 field-collected data points from 28 firearms across 16 calibers. Because data collection in the field is costly, much of the existing research has been done using gunshot audio collected from the internet, which increases the risk of low-quality data and label noise. This dataset is primarily focused on caliber classification, but can also be used for gunshot detection, audio separation, and audio signal processing, providing a diversified and real-world reference. The dataset aims to provide enough diversity to be able to generalize to more real-world applications while also providing enough metadata for detailed academic analysis.

2606.18158 2026-06-17 cs.CY cs.AI cs.CL 交叉投稿

The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act

欧盟法律自动化中的测量差距:欧盟AI法案下教义性法律推理的基准测试

Michèle Finck

发表机构 * Chair of Law and Artificial Intelligence and Director, CZS Institute for Artificial Intelligence and Law, University of Tübingen(法律与人工智能教授、人工智能与法律研究所主任,图宾根大学)

AI总结 针对当前缺乏评估大型语言模型进行教义性法律推理的基准,提出该能力对满足欧盟AI法案中“适当准确性”要求至关重要。

详情
AI中文摘要

大型语言模型现在能够生成至少中等质量的法律文本,但现有的基准无法评估它们是否执行教义性法律推理——这是法律工作的解释核心,而非大多数当前法律AI评估所衡量的辅助性、准法律任务。这一测量差距不仅是方法论的,也是法律上的:欧盟AI法案将“适当准确性”作为司法领域使用高风险AI的约束性要求,但如果没有该领域缺乏的教义性推理基准,该要求就无法获得操作内容。

英文摘要

Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the ancillary, paralegal tasks that most current legal-AI evaluations measure. This measurement gap is not only methodological but legal: the EU AI Act makes "appropriate accuracy" a binding requirement for high-risk AI used in the judicial domain, yet that requirement cannot acquire operational content without the very doctrinal-reasoning benchmark the field lacks.

2606.18168 2026-06-17 cs.SE cs.AI 交叉投稿

All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code

全是烟雾,没有警报:智能体编写的测试代码中的Oracle信号

Dipayan Banik, Kowshik Chowdhury, Shazibul Islam Shamim

发表机构 * Dipayan Banik(迪帕扬·班克) Kowshik Chowdhury(克什基·乔乌德里) Shazibul Islam Shamim(沙齐布·伊斯兰·沙米)

AI总结 研究智能体编写的测试代码中Oracle信号的存在情况,发现80.2%的测试补丁缺乏强Oracle信号,但强Oracle与合并可能性显著正相关(OR=1.28)。

Comments Accepted at the 8th IEEE International Conference on Artificial Intelligence Testing, 2026

详情
AI中文摘要

软件从业者越来越多地使用AI编码智能体,这些智能体在开源拉取请求(PR)中生成测试代码和生产代码。最近的研究报告称,超过116,000个仓库中有超过932,000个智能体编写的PR,然而这些测试文件是否包含有意义的验证逻辑仍未得到充分探索。缺乏显式断言的测试文件执行代码而不验证行为,因此基于测试文件存在的质量门控高估了验证强度。本文的目标是通过描述Oracle信号及其与合并结果和审查工作的关联,帮助从业者评估智能体编写的补丁的验证强度。我们对来自2,807个GitHub仓库的33,596个智能体编写的PR中的86,156个测试文件补丁进行了实证研究,这些PR由五个编码智能体生成:OpenAI Codex、GitHub Copilot、Devin、Cursor和Claude Code。对384个分层补丁的定性分析形成了八类Oracle信号的语法分类。在大规模应用中,80.2%的测试补丁包含弱或没有显式Oracle信号。虽然原始合并率对于强Oracle PR较低,但调整了智能体、PR大小、仓库流行度、任务类型和语言的回归分析显示,强Oracle显著提高了合并可能性(OR = 1.28, p < 0.001)。我们的发现表明,测试文件数量大大高估了验证强度,从业者可以采用Oracle感知的质量检查来更准确地评估智能体编写的贡献。

英文摘要

Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent studies report more than 932,000 agent-authored PRs across more than 116,000 repositories, yet whether their test files contain meaningful verification logic remains underexplored. Test files lacking explicit assertions execute code without verifying behavior, so quality gates based on test-file presence overestimate verification strength. The goal of this paper is to help practitioners assess the verification strength of agent-authored patches by characterizing oracle signals and their link to merge outcomes and review effort. We conduct an empirical study of 86,156 test-file patches from 33,596 agent-authored PRs across 2,807 GitHub repositories produced by five coding agents: OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code. A qualitative analysis of 384 stratified patches informs a syntactic taxonomy of eight oracle signal categories. Applied at scale, 80.2% of test patches contain weak or no explicit oracle signals. While raw merge rates are lower for strong-oracle PRs, a regression analysis adjusting for agent, PR size, repository popularity, task type, and language shows strong oracles significantly improve merge likelihood (OR = 1.28, p < 0.001). Our findings suggest that test file counts substantially overestimate verification strength and that practitioners can adopt oracle-aware quality checks to more accurately evaluate agent-authored contributions.

2606.18203 2026-06-17 cs.CL cs.AI 交叉投稿

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

RubricsTree: 面向个人健康代理在健康记忆与医疗技能上的可扩展且不断演进的开放式评估

Weizhi Zhang, Zechen Li, Hamid Palangi, Ben Graef, A. Ali Heydari, Simon A. Lee, Salman Rahman, Ray Luo, Zeinab Esmaeilpour, Erik Schenck, Chloe Zhang, Yamin Li, Menglian Zhou, Philip S. Yu, Daniel McDuff, Lindsey Sunden, Mark Malhotra, Shwetak Patel, Ahmed A. Metwally

发表机构 * Google Research(谷歌研究院) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出RubricsTree框架,通过专家对齐的层次化分类法(含100多个原子布尔规则)和上下文自适应路由,实现可扩展、可审计且不断演进的开放式评估,在HealthBench上使模型性能提升高达约66%。

详情
AI中文摘要

基于LLM的个人健康代理利用用户健康(传感器)指标,为缓解全球医疗资源获取不均提供了有希望的途径。然而,大规模临床部署仍受限于开放式评估瓶颈:医生标注可靠但成本高且不可扩展,而LLM作为评判者的评估虽可扩展但主观、不一致,且有时临床对齐不佳。我们引入了RubricsTree,一个可扩展的评估框架,具有专家对齐的层次化分类法,包含超过100个原子级、临床可验证的布尔规则,这些规则通过迭代的人机协同策展协议(由经验丰富的医生领导的专家小组)从4000个真实用户查询的洞察中演化而来。一个上下文感知的自适应路由器每查询仅激活相关的自动加权规则子集,提供可扩展评估所需的吞吐量,同时保持专家对齐的质量。通过系统的元评估,我们展示了RubricsTree:(i) 在具有挑战性的开放式查询上,专家对齐程度显著超过强大的大规模评估基线;(ii) 可靠地惩罚上下文退化的响应;(iii) 当用作结构化指令、文本反馈或性能优化的训练奖励时,在HealthBench上为Gemini、GPT和Qwen模型系列带来高达约66%的相对提升。因此,RubricsTree为产品级个人健康AI的持续优化提供了可扩展、可审计且不断演进的评估基础设施。

英文摘要

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

2606.18237 2026-06-17 cs.CL cs.AI cs.LG 交叉投稿

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

ReproRepo: 利用 GitHub 仓库问题扩展可重复性审计

Shanda Li, Qiuhong Anna Wei, Jingwu Tang, Valerie Chen, Nihar B Shah, Tim Dettmers, Yiming Yang, Ameet Talwalkar

发表机构 * School of Computer Science, Carnegie Mellon University(卡内基梅隆大学计算机科学学院) Datadog

AI总结 提出 ReproRepo 框架,利用 GitHub issues 作为监督信号,对 1149 篇论文进行可重复性评估,发现 Codex with GPT-5.5 能识别约 90% 论文的语义相关复现问题。

详情
AI中文摘要

从论文和已发布代码中复现研究结果对科学进步至关重要。现有工作引入了基准测试来评估 LLM 代理是否能协助可重复性,但由于数据整理和评估需要大量人工努力,这些基准难以扩展。我们提出了 ReproRepo,一个可扩展的可重复性评估框架,利用人类提出的 GitHub issues 作为真实复现障碍的自然监督信号。我们在来自主要会议的 1149 篇近期机器学习论文上实例化 ReproRepo,并评估了四种前沿模型代理配置。我们的结果表明,即使不执行代码,LLM 代理也能从论文-仓库对中识别出许多现实世界的可重复性问题:我们研究中的最佳代理,即带有 GPT-5.5 的 Codex,为研究中约 90% 的论文揭示了至少一个语义相关的人类报告的障碍。进一步分析表明,代理在揭示可见故障和识别正确语义区域方面特别有效,但在精确定位方面可能仍不足。ReproRepo 可作为未来在真实世界可重复性审计中评估 LLM 代理的可重用、可扩展框架。我们的代码发布在 https://this URL。

英文摘要

Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.

2602.08939 2026-06-17 cs.AI 版本更新

CausalT5k: Diagnosing Refusal and Failure Modes in Trustworthy Causal Reasoning Across Causal Rungs

CausalT5k: 诊断可信因果推理中的拒绝与失败模式——跨越因果阶梯

Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai, Matthew Wolfman, Patrick Flanagan, Edward Y. Chang

AI总结 提出CTK基准,通过5,147个案例诊断大语言模型在因果推理中的失败模式,包括因果阶梯、陷阱类型、压力敏感性和拒绝质量等标注,揭示聚合准确率隐藏的缺陷。

Comments 12 pages, 17 tables, 4 figures

详情
AI中文摘要

大型语言模型越来越能生成流畅的因果解释,但它们常常以聚合准确率无法诊断的方式失败:混淆关联与干预、在压力下放弃正确判断、过度拒绝有效主张、或在证据不足时作答。我们引入CTK,一个包含5,147个案例且不断增长的诊断基准,涵盖10个领域和Pearl因果阶梯的所有三个层次。与仅评分的基准不同,CTK通过标注因果阶梯、陷阱类型、压力敏感性、拒绝质量以及效用-安全权衡来揭示模型为何失败。其Sheep/Wolf分类法区分有效因果设计与推理陷阱;配对的neutral/pressure变体通过Bad Flip Rate测量谄媚漂移;Wise Refusal字段测试模型在认可主张前是否识别出缺失信息。CTK暴露了聚合准确率隐藏的失败模式:怀疑陷阱、缩放下的阶梯坍塌、压力诱导漂移、检测-纠正差距以及反事实错误模式。它不规定修正方法,而是为研究因果推理失败概况提供诊断基础。

英文摘要

Large language models increasingly produce fluent causal explanations, yet they often fail in ways aggregate accuracy cannot diagnose: confusing association with intervention, abandoning correct judgments under pressure, over-refusing valid claims, or answering when evidence is underdetermined. We introduce CTK, a diagnostic benchmark of 5,147 cases and growing, across 10 domains and all three levels of Pearl's Ladder of Causation. Unlike benchmarks that only score correctness, CTK reveals why a model failed by annotating causal rung, trap type, pressure sensitivity, refusal quality, and Utility-Safety tradeoffs. Its Sheep/Wolf taxonomy separates valid causal designs from inferential traps; paired neutral/pressure variants measure sycophantic drift through Bad Flip Rate; and Wise Refusal fields test whether a model identifies the missing information needed before endorsing a claim. CTK exposes failure modes hidden by aggregate accuracy: the Skepticism Trap, Rung Collapse under scaling, pressure-induced drift, Detection-Correction gaps, and counterfactual error modes. Rather than prescribing a correction method, it provides the diagnostic substrate for studying causal-reasoning failure profiles.

2604.06802 2026-06-17 cs.AI 版本更新

Riemann-Bench: A Benchmark for Moonshot Mathematics

Riemann-Bench: 面向登月级数学的基准测试

Suhaas Garre, Erik Knutsen, Sushant Mehta, Edwin Chen

AI总结 提出Riemann-Bench基准,由专家设计研究级数学问题,评估AI系统超越奥数水平的推理能力,结果显示前沿模型得分低于10%。

详情
AI中文摘要

最近的AI系统在国际数学奥林匹克竞赛中取得了金牌级别的表现,展示了在竞赛式问题解决方面的卓越能力。然而,竞赛数学仅代表了数学推理的一个狭窄部分:问题来自有限的领域,需要最少的先进工具,并且通常奖励洞察力技巧而非深奥的理论知识。我们引入了Riemann-Bench,一个由专家策划的私有基准测试,旨在评估AI系统在研究级数学上的表现,这远远超出了奥林匹克的前沿。问题由常春藤联盟数学教授、研究生和拥有博士学位的IMO金牌得主编写,并且通常需要作者数周才能独立解决。每个问题都经过两位独立领域专家的双盲验证,他们必须从头开始解决问题,并通过程序化验证器得出唯一的封闭形式解。我们将前沿模型评估为不受限制的研究智能体,可以完全访问编码工具、搜索和开放式推理,使用每个问题100次独立运行的无偏统计估计器。我们的结果显示,所有前沿模型目前得分低于10%,揭示了奥林匹克级问题解决与真正研究级数学推理之间的巨大差距。通过保持基准完全私有,我们确保测量的性能反映了真实的数学能力,而不是对训练数据的记忆。

英文摘要

Recent AI systems have achieved gold-medal-level performance on the International Mathematical Olympiad, demonstrating remarkable proficiency at competition-style problem solving. However, competition mathematics represents only a narrow slice of mathematical reasoning: problems are drawn from limited domains, require minimal advanced machinery, and can often reward insightful tricks over deep theoretical knowledge. We introduce Riemann-Bench, a private benchmark of expert-curated problems designed to evaluate AI systems on research-level mathematics that goes far beyond the olympiad frontier. Problems are authored by Ivy League mathematics professors, graduate students, and PhD-holding IMO medalists, and routinely took their authors weeks to solve independently. Each problem undergoes double-blind verification by two independent domain experts who must solve the problem from scratch, and yields a unique, closed-form solution assessed by programmatic verifiers. We evaluate frontier models as unconstrained research agents, with full access to coding tools, search, and open-ended reasoning, using an unbiased statistical estimator computed over 100 independent runs per problem. Our results reveal that all frontier models currently score below 10%, exposing a substantial gap between olympiad-level problem solving and genuine research-level mathematical reasoning. By keeping the benchmark fully private, we ensure that measured performance reflects authentic mathematical capability rather than memorization of training data.

2606.09004 2026-06-17 cs.AI 版本更新

LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)

LATTEArena: 基于LLM的表格特征工程评估框架(扩展版)

Ankai Hao, Ke Chen, Huan Li, Lidan Shou

发表机构 * Zhejiang University(浙江大学)

AI总结 提出LATTEArena,首个标准化评估框架,通过六维分类法分解15种方法、模块化竞技场和组件消融实验,揭示Tree-of-Thought与MCTS成本效益最优等16项关键发现。

Comments 31 pages, 9 figures

详情
AI中文摘要

特征工程对于表格数据分析仍然至关重要,大型语言模型(LLM)已成为自动化这一过程的有前景的范式,催生了基于LLM的自动化表格特征工程(LATTE)。然而,缺乏标准化平台阻碍了公平、成本感知的比较。此外,复杂的方法设计掩盖了单个组件的具体贡献;例如,尽管LFG集成了思维树、少样本演示、蒙特卡洛树搜索和自然语言生成,但每种技术的竞争优点的孤立影响仍未量化。为解决这些挑战,我们引入了LATTEArena,这是首个竞争性评估框架,具有以下特点:(1)六维分类法,将15种代表性方法分解为可重用组件;(2)标准化模块化竞技场,用于受控比较;(3)涵盖性能、成本和鲁棒性的多维评估;(4)组件级消融,量化每种技术的竞争优点。通过广泛评估,我们揭示了16项关键发现,包括:(1)思维树与蒙特卡洛树搜索实现了最佳成本效益;(2)RPN和代码输出格式分别主导分类和回归任务。我们公开发布了模块化框架和超过4000条执行日志,使研究人员能够将新技术与现有技术无缝对比,推动LATTE发展。

英文摘要

Feature engineering remains a cornerstone of tabular data analysis, and Large Language Models (LLMs) have emerged as a promising paradigm for its automation, giving rise to LLM-powered Automated Tabular Feature Engineering (LATTE). However, the field lacks standardized, cost-aware evaluation platforms, and the combinatorial explosion of design choices obscures true algorithmic progress. To bridge these gaps, we systematically deconstruct 15 representative LATTE methods into a unified 6-dimensional taxonomy. Based on this abstraction, we introduce LATTEArena, a standardized, modular, and extensible benchmarking framework that decouples monolithic pipelines into reusable execution blocks. By distilling the massive combinatorial space, we evaluate 24 core LATTE configurations across 7 research questions. Our head-to-head benchmarking goes beyond predictive accuracy to quantify token efficiency and execution robustness, yielding 17 empirical findings on cost-effectiveness trade-offs. Furthermore, we provide 3 concrete recommendations for optimal real-world deployment. By enabling controlled component-level comparisons, LATTEArena shifts the paradigm from ad-hoc prompt engineering to systematic context management. All code, datasets, and over 4,000 execution logs are publicly available to foster a dynamic, community-driven benchmark. Our framework, leaderboard, and all artifacts are hosted on the LATTEArena project website at https://goodenhak.github.io/LATTEArena.

2503.07459 2026-06-17 cs.CL cs.AI 版本更新

MedicalAgentsBench for Complex Medical Reasoning: Comparing Internalized Reasoning Models versus Externalized Agent-based Frameworks

MedicalAgentsBench:复杂医学推理基准——比较内化推理模型与外化智能体框架

Yanjun Shao, Xiangru Tang, Jiwoong Sohn, Jiapeng Chen, Yuxuan Liao, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, Arman Cohan, Mark Gerstein

AI总结 提出MedicalAgentsBench基准(862个复杂临床问题),比较内化推理模型与外化智能体框架在医学推理中的表现,发现两者效果可叠加,最优组合为o3-mini+MDAgents(准确率35.1%)。

Comments https://github.com/gersteinlab/MedicalAgentsBench

详情
AI中文摘要

复杂医学推理需要在多个推理步骤中整合异质性临床证据。大型语言模型(LLM)现在通过两条途径实现:内化推理和外化智能体框架(将问题分解并协作给多个LLM的框架)。为了确定这两条途径是互斥还是互补,我们引入了MedicalAgentsBench,这是一个经过过滤的基准测试,包含862个复杂临床问题,这些题目来自八个医学数据集的并集,经过难度感知筛选和污染筛查。评估了三个内化推理模型(DeepSeek-R1、o1-mini和o3-mini)、七个基础模型和九个外化智能体方法后,我们发现内化和外化方法各自独立地提升了性能,并且它们的益处可以叠加:最高准确率是通过将智能体工作流叠加到内化推理模型上实现的(即o3-mini + MDAgents,准确率35.1%)。帕累托分析表明,这种组合主导了成本-性能前沿;此外,在廉价模型上进行轻量级优化为资源受限环境提供了切入点。我们的基准测试位于此https URL。

英文摘要

Complex medical reasoning requires integrating heterogeneous clinical evidence across multiple inference steps. Large language models (LLMs) now approach this through two routes: internalized reasoning and externalized agent scaffolding (frameworks that decompose problems collaboratively amongst multiple LLMs). To determine whether these routes are exclusive or complementary, we introduce MedicalAgentsBench, a filtered benchmark of 862 complex clinical questions drawn from the union of eight medical datasets via difficulty-aware curation and contamination screening. Evaluating three internalized reasoning models (DeepSeek-R1, o1-mini, and o3-mini), seven base models, and nine externalized agent-based methods, we find that internalized and externalized approaches each independently improve performance, and that their benefits compound: the highest accuracy is achieved by layering agent workflows onto an internalized reasoning model (i.e., o3-mini + MDAgents with 35.1%). Pareto analysis shows this combination dominates the cost-performance frontier; moreover, lightweight optimization on inexpensive models offers an entry point for resource-constrained settings. Our benchmark is at https://github.com/gersteinlab/MedicalAgentsBench.

2507.18623 2026-06-17 cs.LG cs.AI cs.MA 版本更新

Moving Out: Physically-grounded Human-AI Collaboration

Moving Out: 基于物理的人机协作

Xuhui Kang, Sung-Wook Lee, Haolin Liu, Yuyan Wang, Yen-Ling Kuo

AI总结 提出Moving Out基准测试,模拟物理约束下的协作场景,并开发BASS方法增强智能体多样性及动作理解,实验证明其与未见过的AI和人类均能有效协作。

Comments Accepted at ICML 2026

详情
AI中文摘要

适应环境中的物理动作和约束的能力对于具身智能体(如机器人)与人类有效协作至关重要。这种基于物理的人机协作必须考虑连续状态-动作空间增加的复杂性以及物理约束导致的受限动力学。然而,大多数现有的协作基准是离散的,或者不考虑物理属性和约束。为了解决这个问题,我们引入了Moving Out,一个人机协作基准,它模拟了受物理属性和约束影响的各种协作模式,例如一起移动重物以及协调动作将物品绕过角落。Moving Out包含两个挑战和人类-人类交互数据,以全面评估模型适应多样化人类行为和未见物理属性的能力。为了使具身智能体能够在物理属性和约束下与人类协作,我们提出了一种新方法BASS(行为增强、模拟和选择),以增强智能体的多样性及其对动作结果的理解。我们系统地将BASS与最先进模型在AI-AI和人机实验中进行了比较,结果表明BASS能够有效地与未见过的AI和人类协作。项目页面可在此https URL访问。

英文摘要

The ability to adapt to physical actions and constraints in an environment is crucial for embodied agents (e.g., robots) to effectively collaborate with humans. Such physically grounded human-AI collaboration must account for the increased complexity of the continuous state-action space and constrained dynamics caused by physical constraints. However, most existing collaboration benchmarks are discrete or do not consider physical attributes and constraints. To address this, we introduce Moving Out, a human-AI collaboration benchmark that resembles a wide range of collaboration modes affected by physical attributes and constraints, such as moving heavy items together and coordinating actions to move an item around a corner. Moving Out consists of two challenges and human-human interaction data to comprehensively evaluate models' abilities to adapt to diverse human behaviors and unseen physical attributes. To give embodied agents the capability to collaborate with humans under physical attributes and constraints, we propose a novel method, BASS (Behavior Augmentation, Simulation, and Selection), to enhance the diversity of agents and their understanding of the outcome of actions. We systematically compare BASS and state-of-the-art models in AI-AI and human-AI experiments, showing that BASS can effectively collaborate with both unseen AI and humans. The project page is available at https://live-robotics-uva.github.io/movingout_ai/.

2511.01650 2026-06-17 cs.CL cs.AI cs.LG 版本更新

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

EngTrace:工程推理可验证过程监督的符号基准

Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Fan Zhang, Veselin Stoyanov, Preslav Nakov, Zhuohan Xie

AI总结 提出EngTrace符号基准,包含1350个参数化测试用例,通过两阶段可验证评估框架(分层协议+AI仲裁)检验中间推理轨迹与最终答案,揭示数值精度与轨迹保真度的权衡。

Comments 33 pages, includes figures and tables; introduces the EngTrace benchmark

详情
AI中文摘要

大型语言模型(LLM)正越来越多地进入由严格定量标准和不变物理定律约束的专业化、安全关键的工程工作流程,因此对其推理能力进行严格评估势在必行。然而,现有的基准(如MMLU、MATH和HumanEval)评估的是孤立的认知技能,未能捕捉工程中核心的基于物理的推理,其中科学原理、定量建模和实际约束必须融合。为了实现工程中的可验证过程监督,我们引入了EngTrace,这是一个基于90个参数化模板构建的符号基准,每个模板生成独特的、抗污染的实例,涵盖三个主要工程分支、九个核心领域和20个不同领域,产生1350个测试用例,以压力测试跨多样物理场景的泛化能力。超越结果匹配,我们引入了一个可验证的两阶段评估框架,该框架使用分层协议通过自动化程序检查和异构AI仲裁来验证中间推理轨迹以及最终答案。我们对27个领先LLM的评估揭示了数值精度与轨迹保真度之间的明显权衡,识别出一个复杂性悬崖,其中抽象数学预训练未能转化为高级工程任务所需的整合推理。

英文摘要

Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning capabilities imperative. However, existing benchmarks such as MMLU, MATH, and HumanEval assess isolated cognitive skills, failing to capture the physically grounded reasoning central to engineering, where scientific principles, quantitative modeling, and practical constraints must converge. To enable verifiable process supervision in engineering, we introduce EngTrace, a symbolic benchmark built on 90 parameterized templates, each generating unique, contamination-resistant problem instances, spanning three major engineering branches, nine core domains, and 20 distinct areas, yielding 1,350 test cases that stress-test generalization across diverse physical scenarios. Moving beyond outcome matching, we introduce a verifiable two-stage evaluation framework that uses a tiered protocol to validate intermediate reasoning traces alongside final answers through automated procedural checks and a heterogeneous AI Tribunal. Our evaluation of 27 leading LLMs reveals a distinct trade-off between numeric precision and trace fidelity, identifying a complexity cliff where abstract mathematical pre-training fails to translate into the integrative reasoning required for advanced engineering tasks.

2512.01241 2026-06-17 cs.CY cs.AI 版本更新

First, do NOHARM: towards clinically safe large language models

首先,不伤害:迈向临床安全的大语言模型

David Wu, Fateme Nateghi Haredasht, Saloni Kumar Maharaj, Priyank Jain, Jessica Tran, Matthew Gwiazdon, Arjun Rustagi, Jenelle Jindal, Jacob M. Koshy, Vinay Kadiyala, Anup Agarwal, Bassman Tappuni, Brianna French, Sirus Jesudasen, Christopher V. Cosgriff, Rebanta Chakraborty, Jillian Caldwell, Susan Ziolkowski, David J. Iberri, Robert Diep, Rahul S. Dalal, Kira L. Newman, Kristin Galetta, J. Carl Pallais, Nancy Wei, Kathleen M. Buchheit, David I. Hong, Vartan Pahalyants, Ernest Y. Lee, Allen Shih, Tamara B. Kaplan, Vishnu Ravi, Sarita Khemani, Thomas A. Buckley, April S. Liang, Daniel Shirvani, Advait Patil, Nicholas Marshall, Kanav Chopra, Joel Koh, Adi Badhwar, Anastasia Perez, Austin J. Schoeffler, Mahbuba Tusty, Chase M. Walton, Liam G. McCoy, David J. H. Wu, Yingjie Weng, Sumant Ranji, Kevin Schulman, Nigam H. Shah, Jason Hom, Arnold Milstein, Arjun K. Manrai, Adam Rodman, Jonathan H. Chen, Ethan Goh

发表机构 * Harvard Combined Dermatology Program(哈佛联合皮肤科项目) Department of Dermatology, Mass General Brigham(麻省总医院皮肤科) Harvard Medical School(哈佛医学院) Stanford Center for Biomedical Informatics Research(斯坦福生物医学信息学研究中心) Stanford University(斯坦福大学) Division of Hospital Medicine, Department of Medicine, Stanford University School of Medicine(斯坦福大学医学院医院医学科) Department of Medicine, Cambridge Health Alliance(剑桥健康联盟医学科) Beth Israel Deaconess Hospital–Plymouth(贝塞斯达德acons医院-普利茅斯) Department of Medicine, University of California, San Francisco(加州大学旧金山分校医学科) Department of Neurology, Stanford University School of Medicine(斯坦福大学医学院神经科) Department of Medicine, Beth Israel Deaconess Medical Center(贝塞斯达德acons医学中心医学科) Division of Cardiology, Department of Medicine, Cambridge Health Alliance(剑桥健康联盟心脏病科) Department of Cardiovascular Medicine, Summa Health System(Summa健康系统心血管医学科) Division of Allergy, Pulmonary, and Critical Care Medicine, Department of Medicine, University of Wisconsin-Madison(威斯康星大学麦迪逊分校医学科过敏、呼吸科和危重医学科) Division of Pulmonary and Critical Care Medicine, Department of Medicine, Massachusetts General Hospital(麻省总医院呼吸科和危重医学科) Center for Immunology and Inflammatory Diseases, Department of Medicine, Massachusetts General Hospital(麻省总医院免疫和炎症疾病中心) Broad Institute of MIT and Harvard(MIT和哈佛Broad研究所) Division of Pulmonary, Critical Care, and Sleep Medicine, Cambridge Health Alliance(剑桥健康联盟呼吸科、危重医学科和睡眠医学科)

AI总结 提出NOHARM基准,包含1100个初级到专科咨询案例,评估28个LLM的医疗建议安全性,发现高达22.6%的案例存在严重危害风险,其中遗漏错误占80%以上。

详情
AI中文摘要

大语言模型(LLM)被医生和患者常规用于医疗建议,但其临床安全性特征仍不明确。我们提出NOHARM(医学风险评估的众多选项危害评估),一个包含1100个初级保健到专科咨询案例的基准,用于衡量LLM生成的医疗建议的危害频率和严重程度。NOHARM涵盖10个专科,包含4249个临床管理选项的12747个专家注释。在28个LLM中,建议在高达22.6%的案例中具有严重危害潜力,其中遗漏错误占严重错误的80%以上。在一项涉及101名全科医生的随机试验中,AI辅助显著提高了人类基准表现,但医生远未实现AI工具的潜力,经常忽略AI提出的重要建议。安全性表现与通用智能和医学知识基准在整个模型范围内相关,但在前沿模型上解耦。尽管在现有评估中表现强劲,广泛使用的AI模型可能以非平凡的比例产生具有严重危害潜力的医疗建议,凸显了明确测量临床安全性的重要性。

英文摘要

Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a 1,100-task benchmark of primary care-to-specialist consultation cases to measure the frequency and severity of harm from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 28 LLMs, recommendations carried the potential for severe harm in up to 22.6% of cases, with errors of omission accounting for more than 80% of severe errors. In a randomized trial of 101 generalist physicians, human benchmark performance significantly improved with AI assistance, yet physicians remained far from realizing the potential of AI tools, frequently ignoring essential advice surfaced by AI. Safety performance tracked general-intelligence and medical-knowledge benchmarks across the full range of models but decoupled at the frontier. Despite strong performance on existing evaluations, widely used AI models can produce medical advice with the potential for severe harm at non-trivial rates, highlighting the importance of explicit measurement of clinical safety.

2601.19099 2026-06-17 cs.CV cs.AI 版本更新

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

m2sv: 地图到街景空间推理的可扩展基准

Yosub Shin, Michael Buriek, Igor Molybog

AI总结 提出m2sv基准,通过匹配朝北俯视图与街景图像推断相机方向,评估VLM空间推理能力;最佳模型准确率65.2%,低于人类72.0%,揭示几何对齐与推理一致性的差距。

详情
AI中文摘要

视觉-语言模型(VLM)在许多多模态基准上表现强劲,但在需要将抽象俯视图表示与自我中心视图对齐的空间推理任务上仍然脆弱。我们引入m2sv,一个用于地图到街景空间推理的可扩展基准,要求模型通过将朝北俯视图与在同一真实世界交叉口拍摄的街景图像对齐来推断相机视角方向。我们发布了m2sv-20k,一个具有受控歧义的地理多样化基准,以及m2sv-sft-11k,一个用于监督微调的精选结构化推理轨迹集。尽管在现有多模态基准上表现强劲,但最佳评估的VLM在m2sv上仅达到65.2%的准确率,低于人类标注者的平均72.0%(专家可达95%),且标注者间一致性高($\kappa$高达0.76)。虽然监督微调和强化学习带来持续改进,但跨基准评估显示迁移有限。除了总体准确率,我们使用结构信号和人工努力系统分析了地图到街景推理的难度,并对适应的开放模型进行了广泛的失败分析。我们的发现凸显了几何对齐、证据聚合和推理一致性方面的持续差距,为跨视角的接地空间推理的未来工作提供了动力。

英文摘要

Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, below human annotators who reach 72.0% on average (and 95% for an expert) with strong inter-annotator agreement ($κ$ up to 0.76). While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.

2602.03300 2026-06-17 cs.LG cs.AI cs.CL cs.CV 版本更新

R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

R1-SyntheticVL:生成模型的合成数据是否已为多模态大语言模型做好准备?

Jingyi Zhang, Tianyi Lin, Huanjin Yao, Xiang Lan, Shunyu Liu, Jiaxing Huang

AI总结 提出集体对抗数据合成(CADS)方法,通过集体智能和对抗学习自动生成高质量、多样且具有挑战性的多模态数据,用于增强多模态大语言模型(MLLM)在复杂现实任务中的性能。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

在这项工作中,我们旨在开发有效的数据合成技术,自主合成多模态训练数据,以增强MLLM解决复杂现实任务的能力。为此,我们提出了集体对抗数据合成(CADS),这是一种新颖且通用的方法,用于合成高质量、多样且具有挑战性的多模态数据。CADS的核心思想是利用集体智能确保高质量和多样化的生成,同时探索对抗学习以合成具有挑战性的样本,从而有效驱动模型改进。具体来说,CADS包含两个循环阶段:集体对抗数据生成(CAD-Generate)和集体对抗数据判断(CAD-Judge)。CAD-Generate利用集体知识共同生成新的多样化多模态数据,而CAD-Judge则协作评估合成数据的质量。此外,CADS引入了一种对抗上下文优化机制,以优化生成上下文,鼓励生成具有挑战性和高价值的数据。通过CADS,我们构建了MMSynthetic-20K并训练了我们的模型R1-SyntheticVL,该模型在多个基准测试中表现出优越的性能。

英文摘要

In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex real-world tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.

2603.26292 2026-06-17 cs.CL cs.AI 版本更新

findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

findsylls: 一种语言无关的音节级语音分词与嵌入工具包

Héctor Javier Vázquez Martínez

AI总结 提出语言无关的模块化工具包findsylls,统一经典音节检测器和端到端音节切分器,支持音节分割、嵌入提取和多粒度评估,在英语、西班牙语及低资源语言Kono上验证了跨语言可重复实验能力。

Comments 4 pages + 2 for references, disclosures & acknowledgements; to appear in Interspeech 2026; DOI to cite findsylls library: https://doi.org/10.5281/zenodo.20707804

详情
AI中文摘要

音节级单元为口语语言建模和无监督词汇发现提供了紧凑且具有语言意义的表示,但关于音节化的研究仍然分散在不同的实现、数据集和评估协议中。我们介绍了findsylls,一个模块化的、语言无关的工具包,它将经典的音节检测器和端到端音节切分器统一在一个通用接口下,用于音节分割、嵌入提取和多粒度评估。该工具包实现并标准化了广泛使用的方法(例如,Sylber、VG-HuBERT),并允许重新组合其组件,从而实现对表示、算法和令牌率的受控比较。我们在英语和西班牙语语料库以及来自Kono(一种未被充分记录的中部曼德语)的新手工标注数据上演示了findsylls,展示了单一框架如何支持在资源丰富和资源不足的环境中均可重复的音节级实验。

英文摘要

Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.

2603.26592 2026-06-17 cs.LG cs.AI cs.HC 版本更新

Evaluating Interactive 2D Visualization as a Sample Selection Strategy for Biomedical Time-Series Data Annotation

评估交互式二维可视化作为生物医学时间序列数据标注的样本选择策略

Einari Vaaras, Manu Airaksinen, Okko Räsänen

AI总结 针对生物医学时间序列标注困难,比较随机采样、最远优先遍历和基于交互式2D可视化(2DV)的三种样本选择方法,在婴儿运动评估和语音情感识别任务中,2DV在聚合标签时表现最佳,但个体标注者间标签分布差异大,随机采样最安全。

Comments Accepted for publication in Computers in Biology and Medicine (Elsevier)

详情
AI中文摘要

生物医学领域中可靠的机器学习模型依赖于准确的标签,然而标注生物医学时间序列数据仍然具有挑战性。算法样本选择可能支持标注,但涉及真实人类标注者的研究证据很少。因此,我们比较了三种用于标注的样本选择方法:随机采样(RND)、最远优先遍历(FAFT)和一种基于图形用户界面的方法,该方法能够探索高维数据的互补二维可视化(2DV)。我们在婴儿运动评估(IMA)和语音情感识别(SER)的四个分类任务中评估了这些方法。十二名标注者,分为专家和非专家,在有限的标注预算下进行数据标注,并进行了标注后实验以评估采样方法。在所有分类任务中,当聚合标注者的标签时,2DV表现最佳。在IMA中,2DV最有效地捕获了稀有类别,但也表现出由于有限的标注预算导致的标注者间标签分布变异性增大,当模型在个体标注者的标签上训练时,分类性能下降;在这些情况下,FAFT表现出色。对于SER,2DV在专家标注者中优于其他方法,并在个体标注者设置中与非专家标注者的性能相当。失败风险分析显示,当标注者数量或标注者专业知识不确定时,RND是最安全的选择,而2DV由于标签分布变异性更大而具有最高风险。此外,实验后访谈表明,2DV使标注任务更有趣和愉快。总体而言,基于2DV的采样对于生物医学时间序列数据标注似乎很有前景,特别是在标注预算不是非常紧张的情况下。

英文摘要

Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators' labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.

2605.23243 2026-06-17 cs.CR cs.AI 版本更新

Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

前沿大语言模型是否已为网络安全做好准备?来自双模式漏洞基准测试的垂直基础模型证据

Vivek Dahiya, Sunny Nehra, Vipul Dholariya, Bhavik Shangari, Chandra Khatri

发表机构 * super-intel.ai(超级智能人工智能公司)

AI总结 通过白盒函数级漏洞检测和黑盒Web应用安全测试双模式基准测试,评估前沿大语言模型在网络安全任务中的表现,发现其存在高误报率、低覆盖率等问题,而领域专用模型通过结构化方法显著提升性能。

详情
AI中文摘要

我们通过双模式基准测试评估前沿大语言模型是否已为网络安全做好准备:白盒函数级漏洞检测(VulnLLM-R,涵盖C/Java/Python)和黑盒Web应用安全测试(五个生产风格应用,包含118个真实漏洞,涉及20多个CWE家族,我们将开源)。我们测试了六个前沿模型(GPT-5.4、Codex~5.3、Claude Opus~4.6、Sonnet~4.6、Gemini~3.1~Pro和Gemini~3~Flash)以及两个领域专用模型,涵盖四种测试范式。我们的发现令人警醒:(1)每个前沿模型在白盒检测中产生10-50%的误报率,系统性地过度预测漏洞;(2)在黑盒测试中,前沿模型仅达到4-8%的真实漏洞覆盖率,即使借助外部安全工具(Playwright MCP、Burp Suite MCP)也仅提升至10-19%;(3)领域专用智能体中编码的结构化渗透测试方法将每个家族的检测率提升至50%以上,表明方法论而非规模是主要杠杆;(4)一个领域专用防御模型在单个GPU上实现了所有模型中最高的精确率(0.904)和最低的误报率(9.7%)。我们指出缺乏结构化安全测试痕迹(端到端请求/响应序列、失败密集型数据、多步攻击链)是根本的训练数据瓶颈,并提出自博弈安全测试作为数据生成策略。我们的结果为专门构建用于网络安全的垂直基础模型提供了依据。

英文摘要

We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.7, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.

2606.14295 2026-06-17 cs.CR cs.AI cs.LG 版本更新

AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges

AgentCyberRange:在真实网络靶场中基准测试前沿AI系统

Fengyu Liu, Jiarun Dai, Yihe Fan, Wuyuao Mai, Ziao Li, Bofei Chen, Jie Zhang, Zheng Lou, Bocheng Xiang, Qiyi Zhang, Xudong Pan, Geng Hong, Yuan Zhang, Min Yang

发表机构 * Fudan University(复旦大学)

AI总结 提出首个开源多靶场基础设施AgentCyberRange,集成110个漏洞和156个内部主机,评估前沿AI系统在真实网络攻击中的能力,发现GPT-5.5+Codex在web利用和后利用任务中表现最佳。

详情
AI中文摘要

前沿AI系统在网络安全任务中能力日益增强,包括代码库检查、漏洞检测和利用。然而,评估其攻击能力仍受限于缺乏开放、可复现、多主机的网络靶场。现有公开基准测试捕获了CTF解题、漏洞复现和利用生成等孤立技能,但通常忽略了真实的入侵工作流:发现暴露服务、获得立足点、收集内部信息以及跨主机扩大入侵范围。这一差距使得早期观察新兴风险变得困难,因为前沿AI系统很少在真实攻击条件下进行评估。我们引入了AgentCyberRange,这是首个用于在真实网络靶场中衡量自主网络攻击能力的开源多靶场基础设施。它整合了15个真实Web应用和8个企业级网络靶场中的110个漏洞,以及156个内部主机,并提供了Cage工具链用于执行、编排、结果收集和验证。该基准测试涵盖两个核心阶段:Web利用(代理探索暴露的应用并验证漏洞)和后利用(代理将初始立足点转化为更广泛的内部入侵)。我们在匹配的提示和预算下评估了六个前沿AI系统。GPT-5.5与Codex表现最佳,解决了16.1%的Web利用任务和31.7%的后利用任务;在更具体的提示下,这些比率分别提高到33.0%和46.3%。我们还观察到基准测试之外的发现,包括流行项目中的未知漏洞,以及绕过主机防御的有效载荷变异。这些结果表明,开放的网络靶场评估对于在真实且可复现的条件下观察新兴攻击能力是必要的。

英文摘要

Frontier AI systems are increasingly capable of cybersecurity tasks, including codebase inspection, vulnerability detection, and exploitation. However, evaluating their offensive capabilities remains constrained by limited access to open, reproducible, multi-host cyber ranges. Existing public benchmarks capture isolated skills such as CTF solving, vulnerability reproduction, and exploit generation, but often abstract away realistic intrusion workflows: discovering exposed services, gaining a foothold, collecting internal information, and expanding compromise across hosts. This gap makes it difficult to observe emerging risks early, because frontier AI systems are rarely evaluated under realistic attack conditions. We introduce AgentCyberRange, the first open, multi-range infrastructure for measuring autonomous cyber attack capability in realistic cyber ranges. It combines 110 vulnerabilities across 15 real web applications and 8 enterprise-like cyber ranges with 156 internal hosts, plus Cage, a toolchain for execution, orchestration, result collection, and verification. The benchmark covers two core stages: web exploitation, where agents explore exposed applications and validate vulnerabilities, and post exploitation, where agents turn an initial foothold into broader internal compromise. We evaluate six frontier AI systems under matched prompts and budgets. GPT-5.5 with Codex performs best, solving 16.1% of web exploitation tasks and 31.7% of post-exploitation tasks; with more concrete hints, these rates increase to 33.0% and 46.3%. We also observe out-of-benchmark findings, including unknown vulnerabilities in popular projects, and payload mutation that bypasses host defenses. These results show that open cyber-range evaluation is necessary for observing emerging offensive capabilities under realistic and reproducible conditions.

2606.15735 2026-06-17 cs.CL cs.AI 版本更新

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

EHRNote-ChatQA:一个面向纵向出院总结的基于证据的多轮临床问答基准

Jiyoun Kim, Muhan Yeo, Eunhye Jang, Jeewon Yang, Hangyul Yoon, Su Ji Lee, Hee Jo Han, Hee-Jae Jung, Doyun Kwon, Jun young Lee, Jaehun Lee, Jung-Oh Lee, Sunjun Kweon, Jong Hak Moon, Daseul Kim, Minjae Cho, Edward Choi

发表机构 * KAIST(韩国科学技术院) Seoul National University(首尔大学) Seoul National University Bundang Hospital(首尔大学盆唐医院) SAIHST, Sungkyunkwan University(成均馆大学) Yonsei University College of Medicine(延世大学医学院) Gangnam Severance Hospital(江南塞弗伦斯医院) Severance Hospital(塞弗伦斯医院) Seoul Medical Center(首尔医疗中心) Seoul National University Hospital(首尔大学医院) National Cancer Center(国立癌症中心) Icahn School of Medicine at Mount Sinai(西奈山伊坎医学院) Samsung Medical Center(三星医疗中心)

AI总结 提出EHRNote-ChatQA基准,基于MIMIC-IV出院总结构建,包含967个多轮样本和16072个专家验证的QA对,评估LLM在证据支持下的多轮临床问答能力,发现模型在证据定位和多轮错误累积方面存在挑战。

详情
AI中文摘要

出院总结是关键的临床文档,包含患者整个住院期间的背景信息,医疗专家在患者再入院、持续护理和诊断决策中会常规审阅这些文档。在审阅时,医疗专家通常必须迭代地综合多个总结中的信息,同时验证支持每个答案的证据。尽管大型语言模型(LLM)在临床问答中的应用日益增多,但现有基准未能充分反映这一场景:它们通常评估考试式的医学知识,或侧重于单轮问答且证据定位评估有限。我们引入了EHRNote-ChatQA,这是首个针对患者多个出院总结的基于证据的多轮临床问答基准。该基准基于去标识化的MIMIC-IV出院总结构建,包含967个患者级多轮样本,涵盖1到5份笔记,以及16072个经医学专家验证的QA对(8036个内容问题,每个配对有一个证据定位问题),覆盖八个临床类别。基准通过专家指导的流程构建,结合出院总结结构化模式、专家策划的多轮QA模板和基于LLM的生成,随后由11位医学专家对每个QA样本进行审查和修订。对22个开源和闭源LLM的基准测试揭示了若干挑战,包括LLM在证据定位方面比内容回答更困难、多轮错误随轮次累积,以及单轮临床QA性能无法可靠迁移到该场景。这些发现确立了EHRNote-ChatQA作为评估临床QA系统的严格且实用的基准。该数据集将通过PhysioNet凭证访问公开发布。

英文摘要

Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making. When reviewing them, medical experts often must iteratively synthesize information across multiple summaries while verifying the evidence supporting each answer. Although large language models (LLMs) are increasingly explored for clinical question answering, existing benchmarks do not sufficiently reflect this setting: they often evaluate exam-style medical knowledge or focus on single-turn question answering with limited evidence-grounding evaluation. We introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over patients' multiple discharge summaries. Built from de-identified MIMIC-IV discharge summaries, EHRNote-ChatQA contains 967 patient-level multi-turn samples spanning one to five notes and 16,072 medical-expert-verified QA pairs (8,036 content questions, each paired with an evidence-grounding question) across eight clinical categories. The benchmark is constructed through an expert-informed pipeline combining discharge-summary structuring schema, expert-curated multi-turn QA templates, and LLM-based generation, followed by review and revision of every single QA sample by 11 medical experts. Benchmarking 22 open- and closed-source LLMs reveals several challenges, including that LLMs struggle more with evidence grounding than content answering, multi-turn errors compound across turns, and single-turn clinical QA performance does not reliably transfer to this setting. These findings establish EHRNote-ChatQA as a rigorous and practical benchmark for evaluating clinical QA systems. The dataset will be made publicly available through PhysioNet credentialed access.

2606.16072 2026-06-17 cs.CR cs.AI 版本更新

MASCOT-Android: A Curated Dataset and Automated Collection Pipeline for Android Malware Source Code Specimens

MASCOT-Android: 一个用于安卓恶意软件源代码样本的精选数据集与自动收集管道

Bojing Li, Duo Zhong, Prajna Bhandary, Raguvir S, Charles Maxa, Robert J Joyce, Charles Nicholas

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校)

AI总结 提出MASCOT-Android数据集和自动收集框架,利用仓库级文档(README)训练LinearSVC分类器,以96.28%准确率和1.06%假阳性率从GitHub发现恶意软件源代码。

详情
AI中文摘要

与二进制文件和反编译代码相比,恶意软件源代码更直接地反映了攻击者的原始意图。然而,源代码的稀缺性和人工审查的高成本使得此类数据集难以构建和维护。我们提出了MASCOT-Android,一个精选的安卓恶意软件源代码数据集,以及一个用于在GitHub上可扩展地发现恶意软件源代码的自动收集框架。我们工作的一个关键发现是,仅仓库级文档就为恶意软件源代码收集提供了强信号。我们的模型从8,772个恶意软件和25,747个良性README文档中提取字符级TF-IDF特征,并训练一个LinearSVC分类器来区分恶意软件仓库。这个仅使用README的模型在本地评估中达到了96.28%的准确率和1.06%的假阳性率。此外,模型输出置信度分数,允许用户调整决策阈值以平衡假阳性率和覆盖率,这在现实世界的恶意软件源代码收集中是实用的。

英文摘要

Compared with binaries and decompiled code, malware source code more directly reflects the attackers' original intent. However, the scarcity of source code and the high cost of manual review make such datasets difficult to build and maintain. We propose MASCOT-Android, a curated dataset of Android malware source code and an automated collection framework for scalable malware source code discovery on GitHub. A key finding of our work is that repository-level documentation alone provides a strong signal for malware source code collection. Our model extracts character-level TF-IDF features from 8,772 malware and 25,747 benign README documents and trains a LinearSVC classifier to distinguish malware repositories. This README-only model achieves an accuracy of 96.28\% and an FPR of 1.06\% in local evaluation. In addition, the model outputs confidence scores, allowing users to adjust the decision threshold to balance FPR and coverage, which is practical in real-world malware source code collection.

10. AI应用与系统 65 篇

2606.17405 2026-06-17 cs.AI 新提交

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

基于数字孪生模拟的治疗响应优化临床决策支持AI系统

Xinyu Qin, Anil K. Sood, Ruiheng Yu, Sara Corvigno, Elaine Stur, Lu Wang

发表机构 * The Cancer Genome Atlas (TCGA)(癌症基因组图谱(TCGA))

AI总结 提出在线自适应框架,结合治疗效果估计、患者数字孪生和强化学习,在安全约束下实时优化治疗推荐,经合成和真实临床数据验证有效且稳定。

Comments Accepted for presentation at the IEEE Engineering in Medicine and Biology Conference (EMBC) 2026

详情
AI中文摘要

临床决策支持AI系统必须适应实时变化的患者状况,同时遵守严格的安全约束。我们提出了一个在线自适应框架,整合了治疗效果估计以量化临床获益、患者数字孪生以模拟治疗轨迹,以及强化学习用于序贯决策。AI系统最初在历史医疗记录上训练,并在持续学习循环中运行。为确保安全,一个基于规则的模块监测生命体征并阻止禁忌治疗。内部模型强烈不一致的案例被标记以供临床医生审查,在我们的实验中通过预训练的结果模型模拟。我们使用合成临床模拟器和来自癌症基因组图谱的真实卵巢癌数据集验证了我们的框架。在模拟和临床环境中,我们的方法在推荐治疗方面比标准计算基线表现出更优越的有效性和稳定性。此外,AI系统保持低延迟,并且在我们实验验证中仅需对少数案例进行专家咨询,展示了其作为临床医生监督下的个性化医疗安全工具的潜力,通过实际使用持续改进。

英文摘要

Clinical decision support AI systems (CDSASs) must adapt to evolving patient conditions in real-time while adhering to strict safety constraints. We present an online adaptive framework that integrates Treatment Effect (TE) estimation to quantify clinical benefits, a patient Digital Twin (DT) to simulate treatment trajectories, and Reinforcement Learning (RL) for sequential decision-making. The AI system is initially trained on historical medical records and operates in a continuous learning loop. To ensure safety, a rule-based module monitors vital signs and blocks contraindicated treatments. Cases with strong internal model disagreement are flagged for clinician review, simulated in our experiments via a pre-trained outcome model. We validate our framework using both a synthetic clinical simulator and a real-world ovarian cancer dataset from The Cancer Genome Atlas (TCGA). In both simulated and clinical settings, our method demonstrated superior effectiveness and stability in recommending treatments compared to standard computational baselines. Furthermore, the AI system maintains low latency and requires expert consultation for only a minority of cases in our experimental validation, demonstrating its potential as a safe, clinician-supervised tool for personalized medicine that continuously improves through practical use.

2606.17450 2026-06-17 cs.AI 新提交

A Machine-Learned Comorbidity Index

机器学习共病指数

Suleman Baloch, Kishlay Jha, Alberto M. Segre, Philip M. Polgreen, Bijaya Adhikari

发表机构 * Department of Electrical and Computer Engineering, University of Iowa, Iowa, USA(电气与计算机工程系,爱荷华大学,爱荷华,美国) Department of Computer Science, University of Iowa, Iowa, USA(计算机科学系,爱荷华大学,爱荷华,美国) Department of Internal Medicine, University of Iowa, Iowa, USA(内科学系,爱荷华大学,爱荷华,美国)

AI总结 提出一种机器学习共病指数(MLCI),通过最大化学习分数与多个临床结果之间的归一化希尔伯特-施密特独立性准则(nHSIC)来映射诊断代码为单一标量,捕获非线性风险-结果依赖,并在多个EHR数据集上优于基线方法。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea. 35 pages

详情
AI中文摘要

传统的共病评分(如Charlson和Elixhauser)广泛用于风险调整和患者分层,但它们有两个关键局限性:(i)它们主要围绕死亡率,不能很好地与其他临床结果对齐;(ii)它们的线性、基于规则的结构无法捕捉非线性、结果特定的风险关系。我们提出了一种机器学习共病指数(MLCI),通过最大化学习分数与多个临床结果之间的归一化希尔伯特-施密特独立性准则(nHSIC),将诊断代码映射到单个标量。MLCI捕捉非线性风险-结果依赖,并有一个理论支持,该理论描述了何时可以在不同结果上实现统一的、信息丰富的入院级排序。在多个基准电子健康记录(EHR)数据集上的实证结果表明,MLCI在多个评估指标上优于强基线方法。

英文摘要

Traditional comorbidity scores (e.g., Charlson and Elixhauser) are widely used for risk adjustment and patient stratification, but they have two key limitations: (i) they are largely mortality-centric and do not align well with other clinical outcomes, and (ii) their linear, rule-based structure cannot capture nonlinear, outcome-specific risk relationships. We propose a Machine-Learned Comorbidity Index (MLCI) that maps diagnosis codes to a single scalar by maximizing the normalized Hilbert-Schmidt Independence Criterion (nHSIC) between the learned score and multiple clinical outcomes. MLCI captures nonlinear risk-outcome dependence and is supported by a theory that characterizes when a unified, informative admission-level ordering can be achieved across outcomes. Empirical results on multiple benchmark electronic health record (EHR) datasets show that MLCI outperforms strong baselines across multiple evaluation metrics.

2606.17507 2026-06-17 cs.AI cs.SE 新提交

LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline

教育中的LLM作为评判者:基于课程标准的评分流水线

Xiwei Xu, Chen Wang, Jacky Jiang, Phil Yang, Qian Fu, Mohan Dhall, Wenjie Zhang, Liming Zhu

发表机构 * NSW Department of Education(新南威尔士州教育部) South Australian Department for Education(南澳大利亚州教育部) OC Selective exam preparation platform(OC精英考试备考平台) Studitory: HSC preparation platform(Studitory: HSC备考平台)

AI总结 提出一种基于课程标准的可配置LLM评判流水线,用于高利害考试评分,通过整合授权课程工件和评分指南,提高评分一致性、透明度和与官方实践的契合度。

详情
AI中文摘要

生成式AI和大语言模型(LLM)越来越多地应用于题目生成和自动评估。然而,在备考高风险考试中部署LLM需要的不仅仅是提示工程,还需要软件流水线,系统地将模型输出锚定在授权课程工件和教育当局发布的评分指南上。本文提出了一种基于课程标准的、可配置的LLM-as-Judge流水线,用于题目级评分,与工业合作伙伴共同开发,以支持大学入学考试准备。该流水线识别问题的相关主题、子主题和认知需求,并组装可验证和授权的上下文以支持LLM判断。课程意图通过具体的课程大纲工件(包括规定的动词和结果、表现等级描述符、术语表定义和评分指南原则)来操作化。采用分阶段LLM工作流,首先生成特定题目的评分标准,捕获结构化的表现期望,然后推导和评估用于分配学生回答分数的评分标准。这种设计提高了与官方评分实践的一致性、透明度和对齐度。初步评估表明,所提出的LLM-as-Judge流水线提供的评分结果与人类导师相当,同时产生的理由更可追溯到授权课程工件和评分标准。该流水线已集成到在线学习平台中,早期部署数据提供了操作使用和手动覆盖的初步见解。

英文摘要

Generative AI and large language models (LLMs) are increasingly applied to question generation and automated assessment. However, deploying LLMs in preparation for high-stakes exams requires more than prompt engineering; it demands software pipelines that systematically ground model outputs in authorised curriculum artefacts and marking guidelines issued by education authorities. This paper presents a curriculum-grounded, configurable LLM-as-Judge pipeline for question-level marking, co-developed with an industrial partner, to support exam preparation for university admission. The pipeline identifies the relevant topics, subtopics, and cognitive demand of a question, and assembles verifiable and authorised context to support LLM judgement. Curriculum intent is operationalised through concrete syllabus artefacts, including prescribed verbs and outcomes, performance band descriptors, glossary definitions, and marking-guideline principles. A staged LLM workflow is employed to first generate question-specific rubrics, capturing structured expectations of performance, and then derive and evaluate marking criteria used to allocate marks to student responses. This design improves consistency, transparency, and alignment with official marking practices. Preliminary evaluation shows that the proposed LLM-as-Judge pipeline delivers marking outcomes comparable to human tutors, while yielding justifications that are more traceable to authorised curriculum artefacts and marking standards. The pipeline has also been integrated into an online study platform, where early deployment data provide initial insights into operational usage and manual overrides.

2606.17577 2026-06-17 cs.AI 新提交

Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow

基于基础模型编排工作流的代理辅助行人保护设计

Osamu Ito, Akihiko Katagiri, Yoshikazu Nakagawa, Shin Saeki, Jun Shiraishi, Masato Sasaki

发表机构 * Honda Motor Co., Ltd.(本田汽车有限公司)

AI总结 提出首个基础模型编排的碰撞安全设计工作流,集成代理模型、多目标进化搜索、几何生成器和自然语言接口,将行人保护评估时间从数小时降至秒级。

详情
Journal ref
ICLR 2026 Workshop The 2nd Workshop on Foundation Models for Science
AI中文摘要

AI驱动的工程工作流在碰撞安全设计中面临特殊挑战:与空气动力学不同,碰撞事件涉及高度非线性的接触动力学、材料非线性和离散状态转换,难以用数据驱动的代理模型捕捉。据我们所知,我们首次提出了一个基于基础模型编排的碰撞安全设计工作流,实现了代理辅助的行人保护探索,将评估时间从每次CAE模拟数小时缩短至数秒。该工作流集成四个组件:(1) 基于CAE碰撞模拟训练的代理模型,用于从设计参数预测行人腿部伤害指标,平均$R^2=0.87$,并提供无分布假设的共形预测区间;(2) 多目标进化搜索(NSGA-II),在用户指定约束下发现多样化的可行参数集;(3) 基于形变的几何生成器,将参数映射为保持拓扑的3D形状;(4) 自然语言接口,其中LLM编排工作流,视觉-语言模型支持生成设计的语义比较。在一个汽车前保险杠案例研究中,该工作流通过单次探索产生35个不同的安全合规替代方案,而传统CAE迭代需要数周。这些结果表明,基础模型可以作为ML代理和基于物理的模拟之间的集成层,帮助将AI能力引入安全关键的工程领域。

英文摘要

AI-driven engineering workflows face particular challenges in crash safety design: unlike aerodynamics, crash events involve highly nonlinear contact dynamics, material nonlinearity, and discrete state transitions that are difficult to capture with data-driven surrogate models. To the best of our knowledge, we present the first foundation model--orchestrated workflow for crash safety design that enables surrogate-assisted exploration for pedestrian protection, reducing evaluation time from hours per CAE simulation to seconds. The workflow integrates four components: (1) a surrogate trained on CAE crash simulations to predict pedestrian leg injury metrics from design parameters, achieving an average $R^2=0.87$ and providing distribution-free conformal prediction intervals; (2) multiobjective evolutionary search (NSGA-II) to discover diverse feasible parameter sets under user-specified constraints; (3) a morphing-based geometry generator that maps parameters to topology-preserving 3D shapes; and (4) a natural-language interface in which an LLM orchestrates the workflow and a vision--language model supports semantic comparison of generated designs. In an automotive front-bumper case study, the workflow produces 35 distinct safety-compliant alternatives from a single exploration, a process that would require weeks with conventional CAE iteration. These results suggest that foundation models can serve as integration layers between ML surrogates and physics-based simulation, helping bring AI capabilities to safety-critical engineering domains.

2606.18147 2026-06-17 cs.AI 新提交

WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning

WEQA: 可穿戴健康问答中的查询自适应智能推理

Yuwei Zhang, Tong Xia, Bianca Emmerich, Yu Yvonne Wu, Dimitris Spathis, Xin Liu, Daniel McDuff, Cecilia Mascolo

发表机构 * University of Cambridge(剑桥大学) Tsinghua University(清华大学) University College London(伦敦大学学院) Dartmouth College(达特茅斯学院) Google Research(谷歌研究院)

AI总结 提出WEQA框架,通过LLM控制器动态组合传感器分析与预训练模型,实现可穿戴健康数据问答,在基准测试中准确率提升24%,专家评估显示实用性和临床合理性显著提高。

详情
AI中文摘要

语言模型在医学问答中表现出色,有时甚至超过普通医生的准确率。然而,关于可穿戴健康数据的问题回答仍然具有挑战性且研究不足,因为这些无处不在的传感器产生连续、高维和纵向的数据,难以与LLM预训练中的文本中心分布对齐。传感器模态和用户意图的多样性无法通过固定的推理工作流或单一的预训练基础模型有效处理。为了解决这些挑战,我们提出了WEQA,一个查询自适应智能体框架,将LLM推理与专门的可穿戴分析和建模工具统一起来。采用LLM控制器来合成执行计划,动态地将每个查询路由到适当的传感器分析和预训练模型组合,并利用外部知识进行基于证据的响应审计。我们还整理了一个基准测试,涵盖四个开放的可穿戴数据集,包括三个不同健康领域的分析和预测任务。实验表明,我们的框架比LLM和智能体基线准确率提高24%,一项由12名医学专家和8名用户进行的盲法研究显示,在实用性和临床合理性方面有显著提升。

英文摘要

Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions about wearable health data remains challenging and understudied, as these ubiquitous sensors produce continuous, high-dimensional, and longitudinal data, which is non-trivial to align with text-centric distributions in LLM pretraining. The diversity of sensor modalities and user intents cannot be effectively handled by a fixed reasoning workflow or a single pretrained foundation model. To address these challenges, we propose WEQA, a query-adaptive agent framework that unifies LLM reasoning with specialized wearable analytical and modeling tools. An LLM controller is employed to synthesize execution plans and dynamically route each query to the appropriate combination of sensor analysis and pretrained models, and perform grounded response auditing with external knowledge. We also curate a benchmark spanning four open wearable datasets comprising analytic and predictive tasks in three different health domains. Experiments show that our framework is 24% more accurate than LLM and agentic baselines, and a blinded study with 12 medical experts and 8 users shows substantial gains in usefulness and clinical soundness.

2606.18154 2026-06-17 cs.AI 新提交

Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure

通过智能体发现混合结构学习心脏电生理数字孪生

Ziqi Zhou, Yubo Ye, Sumeet Atul Vadhavka, Linwei Wang, Zhiqiang Tao

发表机构 * Rochester Institute of Technology(罗彻斯特理工学院)

AI总结 提出LEADS框架,利用LLM智能体在结构化动作空间中迭代发现混合物理-神经模型,实现个性化心脏电生理数字孪生构建,优于人工设计和其他LLM方法。

Comments 10 pages, 4 figures

详情
AI中文摘要

构建个性化心脏电生理(EP)数字孪生需要为每个患者识别合适的模型结构,而不仅仅是拟合参数。传统方法依赖专家手动指定混合物理-神经架构,这需要深厚的领域专业知识,且无法跨患者迁移。最近的工作应用大型语言模型(LLM)来生成或充当混合模型。然而,尽管这些基于LLM的方法具有有希望的泛化能力,但它们缺乏稳定心脏模拟所需的结构先验。因此,我们提出LEADS,一个将心脏EP领域知识形式化为结构化动作空间,并利用LLM智能体发现混合模型的框架。该智能体遵循迭代推理-行动循环来选择、组合和优化混合模型,同时梯度下降处理参数拟合。所提出的LEADS设计每个候选模型都朝向物理基础、可解释和数值稳定,同时允许开放式的架构发现。我们在具有三个真实反应模型的合成数据和真实心脏EP数据上验证了LEADS,证明其优于人工设计的混合模型和其他基于LLM的混合建模方法。

英文摘要

Building personalized cardiac electrophysiology (EP) digital twins requires identifying the appropriate model structure for each patient, not merely fitting parameters. Traditional methods rely on experts to manually prescribe hybrid physics-neural architectures, which requires deep domain expertise and does not transfer across patients. Recent works have applied large language models (LLMs) to generate or act as hybrid models. However, despite their promising generalization capacity, these LLM-based methods lack the structural priors needed for stable cardiac simulations. Hence, we propose LEADS, a framework that formulates cardiac EP domain knowledge as a structured action space and utilizes an LLM agent to discover hybrid models. The agent follows an iterative reasoning-and-action loop to select, combine, and refine hybrid models, whilst gradient descent handles parameter fitting. The proposed LEADS designs every candidate model towards physically grounded, interpretable, and numerically stable, while allowing open-ended architectural discovery. We validate LEADS on synthetic data with three ground-truth reaction models and on real cardiac EP data, demonstrating that it outperforms both human-designed hybrid models and other LLM-based hybrid modeling.

2606.17059 2026-06-17 cs.DC cs.AI 交叉投稿

Towards Distributed Inference of LLMs on a P2P Network

面向P2P网络的LLM分布式推理

Shabari S Nair, Krishanu Saini

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Department of Computer Science, The University of Texas at Austin(德克萨斯大学奥斯汀分校计算机科学系)

AI总结 提出一种去中心化的前缀缓存感知路由方案,用于P2P网络中的LLM推理,通过本地基数树和异步反熵更新缓存信息,避免集中协调和KV缓存传输,在低延迟和偏斜前缀分布下提升性能。

详情
AI中文摘要

前缀缓存可以通过在具有共享提示的请求之间重用KV缓存来减少LLM推理延迟,但集群规模的重用具有挑战性,因为缓存在节点之间是分区的。我们提出了一种用于对等LLM服务的去中心化、前缀缓存感知路由方案。每个节点维护其自身缓存前缀的本地基数树,并使用周期性反熵异步刷新对等缓存的估计。请求被路由到具有最长估计前缀匹配的节点,无需集中协调或KV缓存传输。过时的元数据只会导致缓存未命中,而不会产生错误输出,因此弱一致性足以保证正确性。在模拟MMLU工作负载上的评估表明,去中心化路由在低通信延迟和偏斜前缀分布下改善了延迟,而高网络延迟和亲和性引起的热点限制了其优势。

英文摘要

Prefix caching can reduce LLM inference latency by reusing KV caches across requests with shared prompts, but cluster-scale reuse is challenging because caches are partitioned across nodes. We propose a decentralized, prefix-cache-aware routing scheme for peer-to-peer LLM serving. Each node maintains a local radix tree of its own cached prefixes and asynchronously refreshed estimates of peer caches using periodic anti-entropy. Requests are routed to the node with the longest estimated prefix match, without centralized coordination or KV-cache transfer. Stale metadata only causes cache misses, not incorrect outputs, making weak consistency sufficient for correctness. Evaluation on simulated MMLU workloads show that decentralized routing improves latency under low communication delay and skewed prefix distributions, while high network latency and affinity-induced hotspots limit its benefits.

2606.17065 2026-06-17 q-fin.CP cs.AI cs.LG 交叉投稿

PIVOT: Bridging Black-Scholes Implied-Volatility and Price Objectives via Differentiable Jäckel Operator

PIVOT: 通过可微分的Jäckel算子桥接Black-Scholes隐含波动率与价格目标

Raeid Saqur, Yannick Limmer, Anastasis Kratsios, Blanka Horvath, Hans Buehler

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) McMaster University(麦基尔大学) Vector Institute for AI(人工智能矢量研究所) DRW

AI总结 提出PIVOT层,通过隐式微分保留Jäckel求解器的前向精度,并利用门控机制处理低vega区域的奇异性,实现价格与隐含波动率空间的高效可微转换。

Comments 30 pages, 17 figures, 12 tables

详情
AI中文摘要

现代期权学习系统在两种坐标系下运行:价格空间(市场报价且无套利约束最自然执行)和隐含波动率(IV)空间(波动率曲面被平滑、正则化和评估)。瓶颈在于接口而非近似:Jäckel开创性的“Let's Be Rational”(LBR)求解器已经高效地将Black-Scholes价格反转到机器精度。所缺少的是一个可微分层,它在正向传播中保留LBR,并避免通过其分支逻辑进行反向传播。这样的层还必须面对低vega区域中逆映射不可避免的奇异性,其中灵敏度1/vega在vega→0时发散。我们通过PIVOT(价格-隐含波动率目标转换器)填补了这一空白。PIVOT保持LBR正向传播不变,并通过隐式微分通过平滑的Black-Scholes/Black-76价格映射提供反向传播,并带有显式门控合约:无效域返回NaN,良态行接收精确的1/vega梯度,低vega行被衰减而非静默正则化。在单个H100上,融合的Triton内核在机器精度下达到1.79e9 IV/s(与参考C求解器的最大相对误差为9.3e-14);端到端标签生成在合成链上维持48.9M/s,在SPX OptionMetrics上维持16.6M/s。在SPX上的HyperIV风格单日复现中,PIVOT增强目标帕累托主导基线,将保留价格MAE降低高达43.4%,最强的三种子门控目标联合改善价格MAE 38.8%和IV MAE 21.3%;在RUT、VIX和NDX上的跨资产结果显示方向性价格MAE增益分别为40.1%、24.2%和16.7%,而无门控的IV往返控制崩溃为退化的近零曲面,确认门控是正确性合约而非调节旋钮。

英文摘要

Modern option-learning systems operate in two coordinates: price space, where markets quote and no-arbitrage constraints are most naturally enforced, and implied volatility (IV) space, where volatility surfaces are smoothed, regularized, and evaluated. The bottleneck is interface, not approximation: Jäckel's seminal "Let's Be Rational" (LBR) solver already inverts the Black-Scholes price to machine precision efficiently. What is missing is a differentiable layer that preserves LBR in the forward pass and avoids backpropagating through its branch logic. Such a layer must also confront the unavoidable singularity of the inverse map in the low-vega regime, where the sensitivity 1/vega diverges as vega -> 0. We close this gap with PIVOT, the Price-Implied-Volatility Objective Translator. PIVOT keeps the LBR forward pass intact and supplies the backward pass by implicit differentiation through the smooth Black-Scholes/Black-76 price map, with an explicit gating contract: invalid domains return NaN, well-conditioned rows receive the exact 1/vega gradient, and low-vega rows are attenuated rather than silently regularized. On a single H100, a fused Triton kernel reaches 1.79e9 IV/s at machine precision (9.3e-14 max relative error vs. the reference C solver); end-to-end label generation sustains 48.9M/s on synthetic chains and 16.6M/s on SPX OptionMetrics. In a HyperIV-style one-day reproduction on SPX, PIVOT-augmented objectives Pareto-dominate the baselines, reducing held-out price MAE by up to 43.4% and the strongest three-seed gated objective improving price MAE by 38.8% and IV MAE by 21.3% jointly; cross-asset results on RUT, VIX, and NDX show directional price-MAE gains of 40.1%, 24.2%, and 16.7%, while an ungated IV-roundtrip control collapses to a degenerate near-zero surface, confirming the gate as a correctness contract rather than a tuning knob.

2606.17070 2026-06-17 physics.ao-ph cs.AI cs.LG 交叉投稿

KFTD: Koopman-Fourier Time-Differentiable Network for Continuous Ocean Spatiotemporal Forecasting

KFTD: 用于连续海洋时空预测的Koopman-Fourier时间可微网络

Qinghui Chen, Zekai Zhang, Hailong Liu, Jinglin Zhang, Cong Bai

发表机构 * Shandong University(山东大学) Laoshan Laboratory(崂山实验室) Chinese Academy of Sciences(中国科学院) Zhejiang University of Technology(浙江工业大学)

AI总结 提出KFTD网络,通过Koopman线性空间和傅里叶分析实现连续时间插值,结合轻量残差网络进行预测,在四个海洋数据集上均方误差平均降低5.6%,效率提升76.25%。

详情
AI中文摘要

准确的海洋预测对于气候监测和灾害预警至关重要。然而,海洋时空预测面临建模复杂动力系统和确保计算效率的双重挑战。我们提出了Koopman傅里叶时间可微(KFTD)网络,一种时间连续的两阶段范式,将插值与预测解耦,以实现高效且可扩展的时空建模。我们将复杂的非线性动力学映射到Koopman线性空间,并利用傅里叶分析实现任意子步的连续时间插值。一个轻量级残差网络消耗高保真中间状态以产生最终预测。与扩散模型不同,KFTD消除了多步噪声采样,直接在连续时间内演化系统,实现了4倍的计算加速。我们进一步引入DPP损失,以端到端方式支持任意PDE约束,打破了纯数据驱动方法的物理一致性瓶颈。在四个海洋数据集上的实验结果证实,我们的连续时间框架使MSE平均降低5.6%(SST最高达12.7%),并且效率比MCVD提高了76.25%。

英文摘要

Accurate oceanic forecasting is critical for climate monitoring and disaster early warning. However, ocean spatiotemporal forecasting encounters the double challenges of modeling complex dynamical systems and ensuring computational efficiency. We present Koopman Fourier Time-Differentiable (KFTD) Network, a time continuous twostage paradigm that decouples interpolation from prediction to achieve efficient and scalable spatiotemporal modeling. We map complex nonlinear dynamics into the Koopman linear space and exploit Fourier analysis to enable continuous time interpolation at arbitrary sub-steps. A lightweight residual network consumes the high fidelity intermediate states to yield the final forecast. Unlike diffusion models, KFTD eliminates multi step noise sampling and directly evolves the system in continuous time, yielding a 4 computational speedup. We further introduce a DPP Loss that supports arbitrary PDE constraints in an endtoend manner, breaking the physical consistency bottleneck of pure data-driven approaches. Empirical results on four ocean datasets confirm that our continuous time framework reduces MSE by an average of 5.6% (up to 12.7% for SST) and improves efficiency over MCVD by 76.25%.

2606.17074 2026-06-17 cs.AR cs.AI 交叉投稿

Surveying GenAI-based Automation in Printed Circuit Board Design and Test

基于GenAI的印刷电路板设计与测试自动化综述

Sahana Srinivasan, Benjamin Turnbull, Hammond Pearce

发表机构 * University of New South Wales(新南威尔士大学)

AI总结 综述生成式AI在PCB全生命周期(从供应链到测试)中的应用,分类现有工作并指出数据稀缺与工具集成挑战,展望未来研究方向。

Comments 33 pages, 5 figures, 11 tables. Under review

详情
AI中文摘要

生成式人工智能(GenAI)越来越多地应用于硬件和软件领域。它旨在减少复杂系统在发布前开发和测试中涉及的人工工作量。在硬件领域,大多数任务集中在集成电路的设计自动化,特别是使用硬件描述语言。然而,也存在其他类型的硬件!在本综述中,我们转而考察GenAI如何已经并正在应用于印刷电路板(PCB)设计生命周期。这包括从供应链、系统规范、电路设计、布局与优化、验证与测试,到PCB组装与分销的所有环节。通过这一视角,我们提出了所发现工作的分类法,根据其意图和贡献进行分类。本综述还指出了GenAI在该领域面临的关键技术挑战,例如特定领域数据稀缺以及与现有PCB工具集成的支持有限。最后,讨论了未来的研究方向:我们的综述表明,在考虑如何将GenAI集成到PCB设计与测试的各种任务中时,仍存在许多机会。

英文摘要

Generative artificial intelligence (GenAI) is increasingly used for applications in the hardware and software domains. It purports to reduce the manual effort involved in the development and testing of complex systems before release. Within the hardware space, most tasks have focused on design automation of integrated circuits, particularly with hardware description languages. However, other types of hardware also exist! In this survey, we instead examine how GenAI has been and is being across the printed circuit board (PCB) design life cycle. This includes everything from supply chains, system specification, circuit design, layout and optimisation, validation and test, and PCB assembly and distribution. Through this lens we present a taxonomy of discovered works, categorising them according to their intent and contributions. This survey also identifies key technical challenges that GenAI faces in this space, such as domain-specific data scarcity and limited support for integration with existing PCB tools. Finally, future research directions are discussed: our survey shows that there are many opportunities remaining when considering how GenAI may be integrated into various tasks in PCB design and test.

2606.17090 2026-06-17 cs.PL cs.AI cs.MS 交叉投稿

ANEForge: Python for direct computation on the Apple Neural Engine

ANEForge: 用于直接在Apple Neural Engine上进行计算的Python工具

Spencer H. Bryngelson

发表机构 * School of Computational Science \& Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA -0.35cm Daniel Guggenheim School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA -0.35cm George W. Woodruff School of Mechanical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA

AI总结 ANEForge是一个Python包,通过编译惰性张量图直接编程Apple Neural Engine,无需CoreML,支持推理和训练,性能接近引擎硬件极限。

Comments 8 pages

详情
AI中文摘要

ANEForge是一个Python包,它直接编程Apple Neural Engine(ANE),即每个最新Apple设备上的固定功能神经加速器,无需CoreML。在生产环境中,引擎只能通过CoreML访问,而CoreML将其视为调度选项:没有配置要求使用ANE,模型可以静默地在CPU或GPU上运行。ANEForge将由58个融合操作和19个原生桥接操作构建的惰性张量图编译成单个ANE程序。该程序通过与Apple内部框架相同的ANE守护进程和内核驱动程序堆栈进行调度。除了推理之外,该包还访问引擎的原生融合注意力,流式传输int8、int4和稀疏权重,在步骤之间保持解码器和优化器状态,并在引擎上运行训练的前向传播、反向传播和优化器更新。一个小的融合程序在大约90微秒内完成一次调用,接近引擎每程序70微秒的调度下限,预训练的ResNet-18前向传播端到端运行时间为0.33毫秒。ResNet-18、句子编码器和Vision Transformer在框架参考上端到端运行,Stable Diffusion U-Net验证了其前向传播。ANEForge针对macOS 14及更高版本下的Apple Silicon。每个版本都针对记录的macOS和ANE编译器版本进行验证。

英文摘要

ANEForge is a Python package that programs the Apple Neural Engine (ANE), the fixed-function neural accelerator on every recent Apple device, directly and without CoreML. In production the engine is reachable only through CoreML, which treats it as a scheduling option: no configuration requires the ANE, and a model can silently run on the CPU or GPU instead. ANEForge compiles a lazy tensor graph, built from 58 fused operators and 19 native bridge operators, into a single ANE program. The program is dispatched through the same ANE daemon and kernel-driver stack as Apple's internal framework. Beyond inference, the package reaches the engine's native fused attention, streams int8, int4, and sparse weights, keeps decoder and optimizer state resident across steps, and runs the forward pass, backward pass, and optimizer update of training on the engine. A small fused program completes a call in about 90us, near the engine's 70us per-program dispatch floor, and a pretrained ResNet-18 forward runs end-to-end in 0.33ms. ResNet-18, a sentence encoder, and a Vision Transformer run end-to-end against framework references, and a Stable Diffusion U-Net validates its forward pass. ANEForge targets Apple Silicon under macOS 14 and later. Each release is verified against a recorded macOS and ANE-compiler version.

2606.17099 2026-06-17 cs.SE cs.AI 交叉投稿

Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work

软件委托合约:衡量AI编码代理工作中的可审查性

Vincent Schmalbach

发表机构 * Independent Researcher(独立研究员)

AI总结 研究通过显式委托合约提升AI编码代理工作可审查性,实验发现合约虽不改善任务正确性,但显著提高证据充分性和降低审查歧义。

Comments 11 pages; empirical pilot study with 64 coding-agent runs and 192 blinded reviews

详情
AI中文摘要

AI编码代理越来越多地接受分配的软件任务,在有限权限下修改仓库,并返回工作包以供审查。先前工作提出了软件委托合约,涵盖任务、权限、返回的工作包和验收上下文,作为委托编码工作的分析单元,但未衡量其效果。本文报告了一项关于编码代理显式委托合约的受控试点研究。我们构建了一个无依赖的TypeScript API任务环境,包含种子缺陷和文档缺口,编写了五个系列的十个任务,并在三种条件下跨两个模型层级运行了64次代理执行:一个现实的问题风格提示、一个显式委托合约,以及一个带有必需证据包的合约。每次运行通过隐藏验收测试、变异检查和范围分析进行评分,然后由三位独立的条件盲审模型审查员使用固定评分标准进行审查,共192次审查。显式合约并未改善客观任务结果:所有64次运行均通过隐藏验收检查,零范围违规。但它们确实提高了可审查性。在30次配对比较中,证据充分性在22次中有所改善,没有一次恶化(5分量表上+0.83,p < 0.0001,Cliff's delta = 0.66);审查员歧义减少(p = 0.035);更改文件列表、已知限制部分、剩余风险部分和审查员检查表大多仅在合约要求时出现。合约消耗了+13%的代理令牌和+38%的挂钟时间,对较弱模型层级的影响更大。在这些小任务上,委托合约购买的是可审查性而非正确性。

英文摘要

AI coding agents increasingly accept assigned software tasks, modify repositories under bounded authority, and return work packages for review. Prior work proposed the software delegation contract, covering the task, authority, returned work package, and acceptance context, as the unit of analysis for delegated coding work, but did not measure its effects. This paper reports a controlled pilot study of explicit delegation contracts for coding agents. We built a dependency-free TypeScript API task environment with seeded defects and documentation gaps, authored ten tasks across five families, and ran 64 agent executions across two model tiers under three conditions: a realistic issue-style prompt, an explicit delegation contract, and a contract with a required evidence bundle. Each run was scored with hidden acceptance tests, mutation checks, and scope analysis, then reviewed by three independent condition-blinded model-based reviewers using a fixed rubric, for 192 reviews. Explicit contracts did not improve objective task outcomes: all 64 runs passed hidden acceptance checks, with zero scope violations. They did improve reviewability. Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66); reviewer ambiguity decreased (p = 0.035); changed-file lists, known-limitations sections, residual-risk sections, and reviewer checklists appeared mostly or only when demanded by the contract. Contracts cost +13% agent tokens and +38% wall-clock time, with larger effects for the weaker model tier. On these small tasks, delegation contracts bought reviewability rather than correctness.

2606.17109 2026-06-17 cs.CR cs.AI cs.LG 交叉投稿

Timestamp-Aware Spatio-Temporal Graph Contrastive Learning for Network Intrusion Detection

时间戳感知的时空图对比学习用于网络入侵检测

Jianli Dai, Guangwei Wu, Jiacheng Li, Weiping Wang, An He, Xinjun Xiao

发表机构 * Central South University of Forestry and Technology, School of Computer Science and Mathematics(中央林业科技大学计算机科学与数学学院) Central South University, School of Computer Science and Engineering(中南大学计算机科学与工程学院)

AI总结 提出一种自监督图神经网络框架,通过时间戳构建时序图,结合E-GraphSAGE和LSTM编码时空依赖,并采用多视图图对比学习(时空特征对比)及自适应权重策略,在四个数据集上达到与监督方法相当的性能。

详情
AI中文摘要

鉴于图神经网络(GNN)在建模网络流量间关系结构方面的有效性,它们已被广泛用于网络入侵检测系统(NIDS)。然而,大多数现有基于GNN的NIDS方法关注流量关系的结构,并将其视为时间独立,这限制了它们应对不断演变的攻击行为的能力。此外,它们对监督或半监督学习的依赖通常限制了对未见攻击的泛化能力。为解决这些限制,我们提出了一种新颖的自监督GNN框架。据我们所知,所提出的模型是首批显式利用真实时间戳的自监督GNN-based NIDS模型之一,这为表示学习提供了忠实的时间依赖关系。我们首先根据时间戳从网络流量中构建一系列时序图,然后采用基于E-GraphSAGE和LSTM的编码器充分提取网络流量的时间信息和空间依赖关系,而无需引入耗时的注意力机制。引入了一种多视图图对比学习(GCL)方案,其中联合执行时间、空间和特征对比,分别捕获时间连续性、保持结构一致性并提高所学表示的泛化性和鲁棒性。此外,设计了一种基于梯度范数的自适应加权策略来优化对比损失权重。在四个具有真实时间戳的代表性NIDS数据集上的实验结果表明,我们的方法显著优于现有自监督方法,并达到了与监督最先进GNN方法相当的性能,同时保持了高计算效率。

英文摘要

Given their effectiveness in modeling the relational structure among network traffic flows, graph neural networks (GNNs) have been widely adopted in network intrusion detection systems (NIDSs). However, most existing GNN-based NIDS approaches focus on the relational structure of traffic flows, and treat them as temporally independent, which limits their ability to cope with evolving attack behaviors. Moreover, their reliance on supervised or semi-supervised learning often restricts generalization to unseen attacks. To address these limitations, we propose a novel self-supervised GNN-based framework. To the best of our knowledge, the proposed model is among the first self-supervised GNN-based NIDS models to explicitly leverage real timestamps, which provides faithful temporal dependencies for representation learning. We first construct a series of temporal graphs from network traffic flows according to their timestamps, and then employ an E-GraphSAGE and LSTM based encoder to fully extract temporal information and spatial dependencies of network traffic, without introducing time-costly attention mechanisms. A multi-view graph contrastive learning (GCL) scheme is introduced, where temporal, spatial, and feature contrasts are jointly performed to capture temporal continuity, preserve structural consistency, and improve the generalization and robustness of the learned representations, respectively. In addition, a gradient-norm-based adaptive weighting strategy is designed to optimize the contrastive loss weights. Experimental results on four representative NIDS datasets with real timestamps demonstrate that our method significantly outperforms existing self-supervised approaches and achieves performance comparable to the supervised state-of-the-art GNN method, while maintaining high computational efficiency.

2606.17119 2026-06-17 cs.CR cs.AI 交叉投稿

Graph neural networks at war: integrating cybersecurity and drone intelligence in the Israeli-Iranian conflict

战争中的图神经网络:整合网络安全与无人机智能于以色列-伊朗冲突

Sozan Sulaiman Maghdid, Tarik Ahmed Rashid, Shavan Askar

发表机构 * Department of Information Technology, Khabat Technical Institute(信息科技系,Khabat技术学院) Erbil Polytechnic University(埃尔比尔理工大学) Computer Science and Engineering Department(计算机科学与工程系;AIIC,库尔德斯坦赫勒大学) AIIC, University of Kurdistan Hewler(信息系统工程系,计算机与信息工程技术学院) Department of Information Systems Engineering, Technical College of Computer and Informatics Engineering

AI总结 研究利用图神经网络(GNN)增强网络入侵检测与无人机响应,通过案例验证其在高检测率、快速响应和态势感知中的有效性。

详情
AI中文摘要

物理网络系统在检测和即时响应方面带来了新的威胁和挑战。本研究探讨了图神经网络(GNN)如何用于辅助包含网络入侵和无人机(UAV)的物理网络系统中的网络安全和无人机管理。通过在图形神经网络的结构理解之间架起桥梁,本工作提供了一种集成程序,使入侵检测系统能够学习底层网络结构,识别恶意活动,并促进无人机响应措施。基于仿真的案例研究,创建了网络攻击模型以引发无人机响应,证明基于图的学习有助于态势感知、群体协调和自适应机动。根据性能评估,该方法的检测率为94.2,接收者操作特征(ROC)曲线下平均面积为0.955,平均响应时间为1.4秒。对比实验表明,所提出的GraphSAGE网络在相同情况下比图卷积网络(GCN)和图注意力网络(GAT)更有效。这些发现证明,图神经网络可用于预防动态网络物理系统中的入侵和响应。

英文摘要

Physical cyber systems have brought about new threats and challenges in detection and immediate response. This study examines how Graph Neural Networks (GNNs) can be used to aid cybersecurity and drone management in a physical cyber system comprising of cyber intrusions and unmanned aerial vehicles (UAVs). By providing a bridge between structural understanding of graphical neural networks, this work has provided an integrated procedure that allows intrusion detection systems to educate on underlying network structures, identify malicious activity, and facilitates drone response measures. Based on an emulation-based case study, cyberattacks models were created to provoke the responses of the drones, which proved that graph-based learning can assist with the situational awareness, swarm coordination, and adaptive maneuver. According to the performance valuation, this method has a detection rate of 94.2, average area under the receiver operating characteristic (ROC) of 0.955 and an average response time of 1.4 seconds. Comparative experiments reveal that proposed GraphSAGE network is more effective than the Graphical Convolutional Networks (GCNs) and Graphical Attention Networks (GATs) in the identical situation. Such findings prove that graphical neural networks can be used to avert intrusion and response of dynamic cyber-physical systems.

2606.17127 2026-06-17 q-bio.QM cs.AI cs.LG 交叉投稿

Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3

AMPGAN v3 的非经典抗菌肽智能发现

Jay Jung, Xiaohan Zhang, Shenghan Song, Mahmoud Sayedahmed, Chijian Xiang, Yunong Xu, Ahmed AbdelKhalek, Severin T. Schneebeli, Matthew J. Wargo, Jianing Li, Safwan Wshah

发表机构 * University of Vermont(弗吉尼亚大学) Larner College of Medicine, University of Vermont(弗吉尼亚大学医学学院) Purdue University(普渡大学) Department of Comparative Pathobiology(比较病理科部门) Department of Horticulture and Landscape Architecture(园艺与景观建筑部门) Department of Industrial and Molecular Pharmaceutics(工业与分子药学部门)

AI总结 提出 AMPGAN v3,一种多目标条件 GAN,扩展生成词汇至 D-氨基酸和末端修饰,通过双判别器提升稳定性,体外验证显示对革兰氏阳性菌有活性,并引入 PepCraft 多智能体框架用于端到端发现。

Comments Presented at the GenBio Workshop, ICML 2026

详情
AI中文摘要

抗菌药物耐药性每年导致超过一百万人死亡。抗菌肽(AMP)是一种有前景的解决方案,但生成式 AMP 模型尚未准备好设计含有非天然氨基酸和/或化学修饰的肽,而这些对于实际肽药物至关重要。我们提出了 AMPGAN v3,一种多目标条件 GAN,它将生成词汇扩展到 D-氨基酸和 N/C 末端修饰(如酰胺化)。通过将对抗性和活性感知监督分离到两个专门的判别器中,AMPGAN v3 显著提高了训练稳定性,并在外部分类器上优于先前的生成式 AMP 模型。我们在体外验证了跨越三个结构类别的五个候选物;其中两个对革兰氏阳性菌株表现出活性,最佳候选物对枯草芽孢杆菌的 MIC 达到 8 μg/mL。为了支持下游筛选,我们进一步提出了 PepCraft,一个用于端到端 AMP 发现的多智能体框架,其中规划智能体协调专门的执行器进行生成、过滤和验证。其优先级推荐与我们的体外结果一致。这些贡献使我们能够在小型但真实的规模上研究生成式和智能体 AI 如何在治疗性肽发现中协同作用。代码:this https URL

英文摘要

Antimicrobial resistance causes to over a million deaths annually. Antimicrobial peptides (AMPs) are a promising solution, but generative AMP models are not yet ready to design peptides with non-natural amino acids and/or chemical modifications, which are essential for real-world peptide drugs. We present AMPGAN v3, a multi-objective conditional GAN that expands the generative vocabulary to D-amino acids and N/C-terminus modifications such as amidation. By separating adversarial and activity-aware supervision across two specialized discriminators, AMPGAN v3 substantially improves training stability and outperforms prior generative AMP models on external classifiers. We validated five candidates spanning three structural classes in vitro; two showed activity against Gram-positive strains, with the best candidate reaching MIC 8 μg/mL against B. subtilis. To support downstream curation, we further present PepCraft, a multi-agent framework for end-to-end AMP discovery in which a Planning Agent orchestrates specialized executors for generation, filtering, and verification. Its prioritization recommendations align with our in vitro outcomes. Together, these contributions let us examine, on a small but real scale, how generative and agentic AI compose in therapeutic peptide discovery. Code: https://github.com/marszzibros/AMPGANv3

2606.17197 2026-06-17 cs.SE cs.AI 交叉投稿

Cluster-Aware Dual-Level Test Specification Generation for Large-Scale Automotive Software Requirements

面向大规模汽车软件需求的集群感知双层测试规格生成

Hazem Ayman, Menna Sedik, Kareem Mostafa, Mahmoud Soliman, Samer Saber, Ibrahim Habib

发表机构 * CairoMotive Cairo, Egypt(开罗动力埃及)

AI总结 提出一种“先聚类后总结”流水线,通过嵌入、降维、聚类、多级摘要和双层测试生成,解决大规模需求下LLM处理依赖缺失和上下文窗口限制问题,提升集成测试覆盖率并高效扩展。

详情
AI中文摘要

生成满足Automotive SPICE SWE.6要求的测试规格随着项目扩展到数千个需求而变得越来越具有挑战性和耗时。由于手动过程通常需要数周的工程努力,自动化成为关键需求。然而,标准的大语言模型方法在大规模下难以应对:单独处理需求会丢失重要的需求间依赖关系,而一次性输入整个语料库则超出上下文窗口限制,导致集成覆盖不完整和测试用例冗余。本文提出一种新颖的“先聚类后总结”流水线,通过三个阶段解决这些限制。需求使用句子变换器嵌入,并通过UMAP降维和HDBSCAN密度聚类进行分组。该分组利用自动最小聚类大小选择,该选择由结合归一化轮廓系数和Calinski-Harabasz分数的质量准则驱动。然后,多级map-reduce摘要算法将每个聚类提炼为简洁、符合领域的描述,同时保留定量阈值和安全完整性等级。该流水线利用派生的聚类拓扑在两级生成测试规格:单个需求验证和验证跨需求特征行为的聚类级集成测试。邻近聚类上下文机制在每个LLM调用期间提供有限的跨特征感知,检索增强生成将所有输出基于ISO 26262和ASPICE标准。在不同规模的汽车需求数据集上的评估表明,与基线方法相比,集群感知方法提高了集成测试覆盖率并保持了摘要保真度,同时高效扩展到数千个需求。

英文摘要

Generating test specifications that satisfy Automotive SPICE SWE.6 requirements becomes increasingly challenging and time-consuming as projects scale to thousands of requirements. Because this manual process often consumes weeks of engineering effort, automation becomes a critical necessity. However, standard Large Language Model (LLM) approaches struggle at scale: processing requirements individually discards vital inter-requirement dependencies, while feeding entire corpora at once exceeds context-window limits, leading to incomplete integration coverage and redundant test cases. This paper presents a novel "Cluster-then-Summarize" pipeline that addresses these limitations through three-stages. Requirements are embedded using sentence transformers and grouped using UMAP dimensionality reduction followed by HDBSCAN density-based clustering. This grouping utilizes an automatic minimum cluster size selection driven by a quality criterion combining normalized Silhouette and Calinski-Harabasz scores. A multi-level map-reduce summarization algorithm then distills each cluster into concise, domain-conformant descriptions while preserving quantitative thresholds and safety integrity levels. The pipeline exploits the derived cluster topology to generate test specifications at two levels: individual requirement verification and cluster-level integration tests that verify cross-requirement feature behavior. A nearby-cluster context mechanism provides bounded cross-feature awareness during each LLM call, and Retrieval-Augmented Generation grounds all outputs in ISO 26262 and ASPICE standards. Evaluation on automotive requirement datasets of varying scale demonstrates that the cluster-aware approach improves integration test coverage and maintains summarization fidelity compared to baseline methods while scaling efficiently to thousands of requirements.

2606.17235 2026-06-17 cond-mat.mtrl-sci cs.AI 交叉投稿

Physics-Informed Attention Mechanism and Generalization Capability of Deep Learning-Based Grain Growth Evolution Prediction

物理信息注意力机制与基于深度学习的晶粒生长演化预测的泛化能力

Pungponhavoan Tep, Marc Bernacki

发表机构 * Mines Paris, PSL University Centre for Material Forming (CEMEF), UMR CNRS 06904(巴黎 Mines 学院,PSL 大学材料成型中心(CEMEF),CNRS UMR 06904)

AI总结 本研究评估了深度学习模型在晶粒生长预测中面对分布外数据的泛化能力,并提出边界掩码注意力机制,显著提升了双峰晶粒尺寸分布等场景的预测精度。

详情
AI中文摘要

用于晶粒生长预测的机器学习模型通常基于理想化的合成数据进行训练,然而实际应用需要泛化到训练分布之外的条件。本研究评估了我们先前研究中训练模型在三个测试案例上的分布外泛化能力,包括实验微观结构、具有双峰晶粒尺寸分布的微观结构以及异常晶粒生长。为了进一步探究物理信息架构设计是否能在这些不同条件下提升鲁棒性,我们专门针对晶粒生长提出了一种边界掩码注意力机制,将注意力限制在晶界像素上。基线模型和所提出的物理信息注意力模型均在分布外数据上未经重新训练或微调进行了评估。两个模型均成功泛化到所有三个测试案例,但边界掩码注意力机制提供了显著改进,最显著的提升出现在具有双峰晶粒尺寸分布的微观结构上,其中结构相似性指数从0.6221提高到0.7609,平均晶粒尺寸误差从8.75%降低到3.57%。注意力热图分析表明,边界掩码注意力模型学会了以与曲率驱动晶粒生长物理一致的方式将注意力集中在大晶界上,这种能力源于训练过程,而无需显式编码到架构中。这些结果表明,在合成数据上训练的模型可以无需重新训练而泛化到多种分布外条件,并且当边界形态与训练域匹配时,物理信息注意力可以提高精度。

英文摘要

Machine Learning (ML) models for grain growth prediction are typically trained on idealized synthetic data, yet practical applications require generalization to conditions outside the training distribution. This study evaluated the Out-Of-Distribution (OOD) generalization capability of the trained model from our previous study across three test cases, including experimental microstructures, microstructures characterized by a bimodal grain size distribution, and abnormal grain growth. To further probe whether physics-informed architectural design could improve robustness under these different conditions, a boundary-masked attention mechanism was proposed specifically for grain growth, constraining attention to grain boundary pixels. Both the baseline and the proposed physics-informed attention model were evaluated without retraining or fine-tuning on the OOD data. Both models successfully generalized to all three test cases, yet the boundary-masked attention mechanism provided substantial improvements, with the most notable gains for microstructures characterized by a bimodal grain size distribution, where Structural Similarity Index Measure (SSIM) improved from \num{0.6221} to \num{0.7609} and mean grain size ($\overline{R}$) error decreased from \SI{8.75}{\percent} to \SI{3.57}{\percent}. The attention heatmap analysis revealed that the boundary-masked attention model learned to concentrate attention on large grain boundaries in a manner consistent with curvature-driven grain growth physics, emerging from training without being explicitly encoded into the architecture. These results indicate that models trained on synthetic data can generalize to diverse OOD conditions without retraining, and that physics-informed attention may improve accuracy when the boundary morphology matches the training domain.

2606.17345 2026-06-17 cs.LG cs.AI 交叉投稿

Counterfactual Optimization of Baseball Pitch Sequences and Estimation of Its Impact on Season-Level Statistics

棒球投球序列的反事实优化及其对赛季级统计指标影响的估计

Ryota Takamido, Hiroki Nakamoto

发表机构 * Sports Innovation Organization, National Institute of Fitness and Sports in Kanoya(体育创新组织,国立健身与体育研究所)

AI总结 利用Transformer模型和反事实分析,优化MLB投球序列中的最终投球和设置投球,发现可显著提升赛季级表现(如K/9提高1.0以上),并提供了速度带有效位置等实用见解。

详情
AI中文摘要

尽管投球序列是棒球分析的核心话题,但以往研究主要关注单次打席中最终投球的优化,对前期设置投球的作用及其对长期赛季级表现的影响研究不足。为解决这些问题,本研究利用MLB Statcast数据进行了反事实分析。训练了一个基于Transformer的机器学习模型,用于预测目标投球是否会导致击球结果或挥空。然后,通过将最终投球或前期设置投球替换为替代的投球类型和位置,同时保持周围背景信息不变,生成了反事实投球序列。最优反事实选择定义为那些最小化预测击球概率的选择,并使用将模型输出与赛季统计指标关联的回归模型估计其对投手赛季统计指标的预期影响。结果表明,最终投球和设置投球的优化都可能显著影响赛季级表现,包括K/9提高超过1.0。分析还提供了若干实用见解,包括特定速度带的有效位置、投球指令的重要性以及通过中速投球扩展投球选择范围。这些发现定量支持了投球序列在棒球中的战略重要性。

英文摘要

Although pitch sequencing is a central topic in baseball analytics, previous studies have primarily focused on optimizing the final pitch within a single plate appearance, leaving the role of preceding setup pitches and their impact on long-term season-level performance insufficiently examined. To address these issues, this study conducted counterfactual analyses using MLB Statcast data. A Transformer-based machine-learning model was trained to predict whether a target pitch would result in an in-play outcome or swing-out. Counterfactual pitch sequences were then generated by replacing either the final pitch or the preceding setup pitch with alternative pitch types and locations while keeping the surrounding contextual information fixed. Optimal counterfactual selections were defined as those that minimized the predicted in-play probability, and their expected effects on pitchers' seasonal statistics were estimated using regression models linking model outputs to season statistics. The results suggest that the optimization of both final and setup pitches may substantially influence season-level performance, including improvements of more than 1.0 in K/9. The analyses also provided several practical insights, including velocity-band-specific effective locations, the importance of pitch commands, and the expansion of pitch-selection options through middle-velocity pitches. These findings quantitatively support the strategic importance of pitch sequencing in baseball.

2606.17379 2026-06-17 cs.CV cs.AI eess.IV 交叉投稿

MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation

MeiBRD:元学习术中生物力学残余变形

Casey Meisenzahl, Jon Heiselman, Michael Holtz, Yubo Ye, Michael Miga, Linwei Wang

发表机构 * Rochester Institute of Technology(罗切斯特理工学院) Vanderbilt University(范德堡大学)

AI总结 提出混合配准框架,利用稀疏术中对应点自适应生物力学先验,通过图神经扩散函数学习残余变形,结合元学习从术中样本中快速适应,在肝脏体模上优于现有方法。

详情
AI中文摘要

由于软组织大幅变形且术中测量稀疏,精确的术中肝脏配准具有挑战性。生物力学模型通过先验知识正则化这一不适定问题,但由于简化假设而表现出持续的预测偏差,而数据驱动学习方法在数据效率、泛化能力和物理合理性方面存在困难。我们提出一个混合配准框架,利用稀疏术中对应点自适应生物力学先验。我们不是学习完整的变形场,而是学习一个校正线性生物力学预测的残余变形函数,该函数建模为图神经扩散函数,在3D肝脏网格上具有几何感知注意力。为了实现稀疏观测的长距离信息传递,我们从一个新颖的角度将稀疏术中测量视为\textit{上下文}样本,其中残余变形函数的输入-输出对完全观测,将问题转化为从术中上下文样本中学习该残余函数,使用前馈元学习器。在可变形肝脏体模数据集上的实验表明,与刚性、生物力学和数据驱动基线相比,配准精度和泛化能力得到提升,特别是在分布外几何和变形情况下。

英文摘要

Accurate intraoperative liver registration is challenging due to substantial soft-tissue deformation yet sparse intraoperative measurements. Biomechanical models regularize this ill-posedness with prior knowledge but exhibit persistent prediction bias due to simplifying assumptions, while data-driven learning solutions struggle with data efficiency, generalization, and physical plausibility. We propose a hybrid registration framework that adapts a biomechanical prior using sparse intraoperative correspondences. Rather than learning a full deformation field, we learn a residual deformation function that corrects linear biomechanical predictions, modeled as a graph neural diffusion function with geometry-aware attention over the 3D liver mesh. To enable long-range information transfer of sparse observations, we take a novel perspective of sparse intraoperative measurements as \textit{context} samples where input-output pairs of the residual deformation function are fully observed, casting the problem into learning-to-learn this residual function from intraoperative context samples with feedforward meta-learners. Experiments on a deformable liver phantom dataset demonstrate improved registration accuracy and generalization compared to rigid, biomechanical, and data-driven baselines, particularly for out-of-distribution geometries and deformations.

2606.17398 2026-06-17 cs.CR cs.AI cs.SE 交叉投稿

SoK: AI-Augmented Binary Reversing

SoK: AI增强的二进制逆向工程

Yujeong Kwon, Yiyue Zhang, Shakhzod Yuldoshkhujaev, Kexin Pei, Dokyung Song, Hyungjoon Koo

发表机构 * Sungkyunkwan University(成均馆大学) The University of Chicago(芝加哥大学) Yonsei University(延世大学)

AI总结 系统化梳理AI增强二进制逆向工程领域,提出统一分类法涵盖传统与AI方法,揭示LLM和智能体AI的新角色,识别技术挑战与评估缺口。

Comments 20 pages, 7 tables, 3 figures

详情
AI中文摘要

二进制逆向工程是软件理解、漏洞发现、恶意软件调查和固件审计的基础。然而,由于编译过程中语义信息的不可逆丢失,它仍然具有固有的挑战性。机器学习、大型语言模型(LLM)和智能体AI系统的最新进展加速了AI增强二进制逆向工程的采用。然而,由此产生的工作在逆向领域、工件表示、学习方法和评估实践方面变得越来越分散。本文首次对AI增强二进制逆向工程的知识进行了全面的系统化。我们分析了自2015年以来发表的144篇研究论文,并根据推理任务将其组织成22个二进制逆向领域。我们进一步引入了一个统一的分类法,涵盖传统和AI增强的逆向流程。我们的分类法连接了传统分析技术、二进制衍生工件、表示策略、学习范式和下游推理任务,同时阐明了LLM和智能体AI系统的新兴角色。通过建立通用词汇和结构化框架,我们提供了该领域过去十年演变的整体视图。我们的研究揭示了看似不同方法背后的共同结构,突出了持续存在的技术挑战和评估缺口,并确定了未来研究的有希望的机会。总的来说,这些见解阐明了该领域的当前状态,并为下一代可靠且可扩展的AI增强二进制逆向系统奠定了基础。

英文摘要

Binary reversing is fundamental to software understanding, vulnerability discovery, malware investigation, and firmware auditing. However, it remains inherently challenging due to the irreversible loss of semantic information during compilation. Recent advances in machine learning, large language models (LLMs), and agentic AI systems have accelerated the adoption of AI-augmented binary reversing. Yet, the resulting body of work has become increasingly fragmented across reversing domains, artifact representations, learning approaches, and evaluation practices. This paper presents the first comprehensive systematization of knowledge on AI-augmented binary reversing. We analyze 144 research papers published since 2015, and organize them into 22 binary reversing domains according to the inference tasks. We further introduce a unified taxonomy spanning conventional and AI-augmented reversing pipelines. Our taxonomy connects traditional analysis techniques, binary-derived artifacts, representation strategies, learning paradigms, and downstream inference tasks, while clarifying the emerging roles of LLMs and agentic AI systems. By establishing a common vocabulary and structured framework, we provide a holistic view of the field's evolution over the past decade. Our study reveals common structures underlying seemingly disparate approaches, highlights persistent technical challenges and evaluation gaps, and identifies promising opportunities for future research. Collectively, these insights clarify the current state of the field and provide a foundation for the next generation of reliable and scalable AI-augmented binary reversing systems.

2606.17403 2026-06-17 cs.CV cs.AI 交叉投稿

Bridging Spatial And Frequency Views For Disaster Assessment: Benefits And Limitations

桥接空间与频率视角进行灾害评估:优势与局限

Shikha V. Chandel, Yadav Raj Ghimire, Timothy Agboada, Leila Hashemi-Beni

发表机构 * College of Science and Technology(科学与技术学院) Computational Data Science and Engineering(计算数据科学与工程)

AI总结 本研究对比了空间域、频率域及双域深度学习方法在建筑损伤分类中的表现,发现双域模型优于单域模型,但所有模型对轻微损伤检测仍存在困难。

Comments Copyright 2026 IEEE. Published in the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

详情
AI中文摘要

从卫星图像快速评估建筑损伤对于有效的灾害响应和恢复至关重要。虽然大多数深度学习方法依赖于空间域特征,但频率域表示可以捕捉互补的结构线索,如碎片模式和坍塌引起的纹理。本研究使用来自xView2(xBD)数据集灾后图像,对空间域、频率域和双域深度学习方法进行了受控比较,用于多类建筑损伤分类。为确保公平,所有模型均基于EfficientNet-B0骨干网络,并在相同设置下训练,仅输入表示和融合策略不同。使用准确率、宏F1分数、每类指标和混淆矩阵评估性能。结果表明,双域模型比单域方法提供了可衡量的改进。双空间配置实现了最高的测试准确率(0.4688)和最低的损失,而仅空间模型获得了最佳的宏F1分数(0.4254),表明类别性能更平衡。相比之下,仅频率模型表现最差并出现过拟合,表明泛化能力有限。尽管有这些改进,所有模型仍难以检测细微损伤级别,特别是Minor类别,这是由于类别不平衡和细粒度视觉模糊性。虽然双域方法改进了严重损伤的检测,但挑战依然存在。这些发现突出了混合表示的优势和局限,并推动了未来在数据平衡、高级融合和正则化方面的工作。

英文摘要

Rapid assessment of building damage from satellite imagery is essential for effective disaster response and recovery. While most deep learning methods rely on spatial-domain features, frequency-domain representations can capture complementary structural cues such as debris patterns and collapse-induced textures. This study presents a controlled comparison of spatial-domain, frequency-domain, and dual-domain deep learning approaches for multi-class building damage classification using post-disaster imagery from the xView2 (xBD) dataset. To ensure fairness, all models are built on an EfficientNet-B0 backbone and trained under identical settings, differing only in their input representations and fusion strategies. Performance is evaluated using accuracy, macro F1-score, per-class metrics, and confusion matrices. Results show that dual-domain models provide measurable improvements over single-domain approaches. The dual spatial configuration achieves the highest test accuracy (0.4688) and lowest loss, while the spatial-only model attains the best macro F1-score (0.4254), indicating more balanced class performance. In contrast, frequency-only models perform worst and exhibit overfitting, suggesting limited generalization. Despite these gains, all models struggle to detect subtle damage levels, particularly the Minor class, due to class imbalance and fine-grained visual ambiguity. While dual-domain approaches improve detection of severe damage, challenges remain. These findings highlight the benefits and limitations of hybrid representations and motivate future work on data balancing, advanced fusion, and regularization.

2606.17409 2026-06-17 cs.LG cs.AI 交叉投稿

Discrete Autoregressive Transformer for Generative Mechanism Synthesis

离散自回归变压器用于生成式机构综合

Anar Nurizada, Anurag Purwar

发表机构 * Computer-Aided Design and Innovation Lab, Department of Mechanical Engineering, Stony Brook University(石溪大学机械工程系计算机辅助设计与创新实验室)

AI总结 提出离散自回归变压器,将平面路径综合转化为条件序列建模,通过VAE潜在变量和机构类型令牌生成关节坐标,实现多样准确机构设计。

详情
AI中文摘要

平面路径综合需要机构的耦合曲线匹配预定轨迹;从曲线到连杆的映射本质上是一对多的,跨越四杆、六杆和八杆拓扑。我们通过模拟接地评估,在一个包含超过一百万个机构的策划语料库上解决这个设计问题,报告了正向运动学和几何对齐后的Chamfer距离和动态时间规整。我们将综合问题表述为条件自回归序列建模:关节坐标被均匀量化成令牌,并由一个解码器-only变压器生成,该变压器具有目标曲线的变分自编码器(VAE)潜在变量和一个显式的机构类型令牌。训练结合了令牌交叉熵和一个高斯平滑的bin辅助损失,该损失尊重bin之间的序数结构。在推理时,一个有界潜在噪声调度在每个噪声水平下解码所有机构类型;我们根据几何误差保留前五个候选,从而在没有数据集查找的情况下产生多样准确的族。在保留测试中,平均Chamfer距离为$0.0132$,平均动态时间规整为$0.153$;一个潜在$k$-最近邻基线,在VAE空间中基于训练集邻居潜在变量进行条件化,使用相同的解码器实现了匹配拓扑的平均Chamfer距离$0.0071$和平均动态时间规整$0.117$。

英文摘要

Planar path synthesis requires mechanisms whose coupler curves match a prescribed trajectory; the mapping from curve to linkage is inherently one-to-many across four-, six-, and eight-bar topologies. We address this design problem with simulation-grounded evaluation on a curated corpus of over one million mechanisms, reporting Chamfer distance and dynamic time warping after forward kinematics and geometric alignment. We formulate synthesis as conditional autoregressive sequence modeling: joint coordinates are uniformly quantized to tokens and generated by a decoder-only transformer with a variational-autoencoder (VAE) latent of the target curve and an explicit mechanism-type token. Training combines token cross-entropy with a Gaussian-smoothed bin auxiliary loss that respects ordinal structure among bins. At inference, a bounded latent-noise schedule decodes all mechanism types at each noise level; we retain the top five candidates by geometric error, yielding diverse accurate families without dataset lookup. On held-out tests, aggregate mean Chamfer distance is $0.0132$ and mean dynamic time warping is $0.153$; a latent $k$-nearest-neighbor baseline that conditions on training-set neighbor latents in VAE space achieves matched-topology mean Chamfer distance $0.0071$ and mean dynamic time warping $0.117$ using the same decoder.

2606.17420 2026-06-17 eess.IV cs.AI q-bio.QM 交叉投稿

Feynman Kac Reweighted Schrödinger Bridge Matching for Surface-Based Tau PET Harmonization

基于Feynman Kac重加权薛定谔桥匹配的皮层表面Tau PET标准化

Jianwei Zhang, Xinyu Nie, Jiaxin Yue, Yonggang Shi

发表机构 * Stevens Neuroimaging and Informatics Institute, University of Southern California(斯蒂文斯神经影像与信息学研究所,南加州大学) Ming Hsieh Department of Electrical and Computer Engineering of Viterbi School of Engineering, University of Southern California(明希德电气与计算机工程系,维特比工程学院,南加州大学) Alfred E. Mann Department of Biomedical Engineering of Viterbi School of Engineering, University of Southern California(阿尔弗雷德·E·曼生物医学工程系,维特比工程学院,南加州大学)

AI总结 提出Feynman Kac重加权薛定谔桥匹配(FKRSBM)模型,通过熵正则化最优传输实现源域与目标域间的随机传输,结合子群感知端点提议和球面卷积骨干网络,在Tau PET SUVR图上实现优于现有方法的分布对齐和下游疾病分类。

详情
AI中文摘要

Tau PET成像对于追踪阿尔茨海默病进展至关重要,但不同站点间的扫描仪、协议和放射性示踪剂的系统差异引入了非生物变异性,这会增加生物标志物方差、降低对疾病效应的敏感性,并可能偏倚下游临床评估。标准化方法旨在去除这些站点引起的偏移,同时保留有生物学意义的信号,然而现有方法在源队列和目标队列具有不同子群组成时难以应对,存在将站点效应与生物学变异(如tau阳性状态)混淆的风险。我们提出Feynman Kac重加权薛定谔桥匹配(FKRSBM)模型来解决这一问题。与基于扩散的方法通过高斯噪声先验路由数据不同,FKRSBM通过熵正则化最优运输学习源分布和目标分布之间的直接随机传输过程。为了实现生物学一致的传输,FKRSBM结合了由参考桥测度的Feynman Kac重加权导出的子群感知端点提议,完全通过数据层面的分层重要性抽样实现,无需对底层桥匹配求解器或网络架构进行任何更改。对于基于表面的神经影像,FKRSBM采用在皮层网格上运行的球面卷积骨干网络进行顶点级标准化。我们在tau PET SUVR图上评估该方法,将HABS-HD队列的PI-2620数据标准化到ADNI的AV-1451域。与ComBat、CycleGAN、基于扩散的方法(DF)和无正则化的扩散薛定谔桥匹配(DSBM)相比,FKRSBM实现了更优的分布对齐、更低的tau阳性符号不匹配、更强的APOE子群对齐以及改进的下游疾病分类性能。

英文摘要

Tau PET imaging is central to tracking Alzheimer's disease progression, but systematic differences between scanners, protocols, and radiotracers across sites introduce nonbiological variability that inflates biomarker variance, reduces sensitivity to disease effects, and can bias downstream clinical assessments. Harmonization methods aim to remove these site-induced shifts while preserving biologically meaningful signal, yet existing approaches struggle when source and target cohorts differ in subgroup composition, risking conflation of site effects with biological variation such as tau-positivity status. We propose the Feynman Kac Reweighted Schröodinger Bridge Matching (FKRSBM) model to address this problem. Rather than routing data through a Gaussian noise prior as in diffusion-based methods, FKRSBM learns a direct stochastic transport process between source and target distributions via entropy-regularized optimal transport. To enforce biologically consistent transport, FKRSBM incorporates a subgroup-aware endpoint proposal derived from a Feynman Kac reweighting of the reference bridge measure, implemented entirely through stratified importance sampling at the data level and requiring no changes to the underlying bridge-matching solver or network architecture. For surface-based neuroimaging, FKRSBM employs a spherical convolutional backbone operating on cortical meshes to perform vertex-level harmonization. We evaluate the method on tau PET SUVR maps, harmonizing PI-2620 data from the HABS-HD cohort into the AV-1451 domain of ADNI. Compared against ComBat, CycleGAN, a diffusion-based method (DF), and unregularized Diffusion Schröodinger Bridge Matching (DSBM), FKRSBM achieves superior distributional alignment, reduced tau-positivity sign mismatch, stronger APOE subgroup alignment, and improved downstream disease classification performance.

2606.17437 2026-06-17 cs.CV cs.AI 交叉投稿

Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

超声心动图视频标准视图分类的时空融合模型

Bo Gou, Jicheng Zhang, Jianlong Xiong, Tao He, Bentian Liu, Hai Wu, Yijiao Wang, Yu Zhang, Yujia Yang, Yun Dai, Jian Liu, Jie Wang

发表机构 * Department of Ultrasound, The First Affiliated Hospital of Chengdu Medical College, School of Clinical Medicine, Chengdu Medical College(成都医学院第一附属医院超声科,临床医学院) College of Computer Science, Sichuan University(四川大学计算机学院) Department of Medical Ultrasound, West China Hospital of Sichuan University(四川大学华西医院超声科) Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College(中国医学科学院北京协和医学院肿瘤医院)

AI总结 针对超声心动图视图分类中数据稀缺、时空特征难以融合的问题,提出基于不确定性感知的CNN-LSTM双流融合模型,在最大公开数据集EV9V上取得竞争性能。

详情
AI中文摘要

超声心动图标准视图的自动分类对于高效的临床工作流程至关重要,但面临三个主要挑战。首先,公开可用的数据集稀缺,且规模和视图覆盖范围有限。其次,一些现代视频级架构在超声心动图视图分类中的性能尚未得到充分探索。第三,某些视图类别在空间外观上高度相似,使得单帧特征不足以区分,而异质的帧质量使得鲁棒的时序信息融合变得复杂。为了解决这些挑战,我们发布了九视图超声心动图视频(EV9V)数据集,包含5,138个视频、910,579帧和9个标准视图,据我们所知,这是最大的公开超声心动图视频数据集。利用EV9V,我们系统地基准测试了代表性的视频分类架构,包括卷积神经网络(CNN)、循环神经网络(RNN)和Transformer。此外,我们提出了一种时空融合模型(STFM),一种高效的双流CNN-LSTM(长短期记忆)框架,联合捕获空间解剖结构和时间心脏动力学。所提出的框架利用不确定性感知学习在训练期间优先采样代表性视频片段,并在推理期间进行基于证据的融合,提高了对超声心动图视频中帧质量变化的鲁棒性。大量实验表明,我们的方法在各种视频分类模型中取得了竞争性能,验证了不确定性感知时空学习在超声心动图视图分类中的有效性。代码可在以下网址获取:https://this URL。

英文摘要

Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at https://github.com/bgx666/stfm.

2606.17461 2026-06-17 cs.AR cs.AI cs.LG 交叉投稿

AUTOGATE: Automated Clock Gating via Toggling-Aware LLM-based RTL Rewriting

AUTOGATE:基于翻转感知的LLM驱动RTL重写的自动时钟门控

Yiting Wang, Chenhui Deng, Chia-Tung Ho, Yanqing Zhang, Zhuo Feng, Cunxi Yu, Ang Li, Gang Qu, Brucek Khailany

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校) NVIDIA(英伟达)

AI总结 提出AUTOGATE框架,通过ML-LLM协同设计将波形翻转迹线转化为紧凑表示,指导LLM进行RTL重写,实现层次化代码库中的时钟门控优化,平均降低动态功耗49.31%。

Comments 9 pages, 6 figures, 7 tables

详情
AI中文摘要

细粒度时钟门控(FGCG)是降低动态功耗最有效的技术之一,但当前的FGCG优化流程仍主要依赖手动操作。近期基于LLM的RTL优化方法受限于两个关键缺陷:(1)无法处理跨越数百万周期的长波形迹线,(2)难以在保持正确性的同时将优化扩展到大型层次化代码库。在本工作中,我们提出了AUTOGATE,这是首个面向工业级RTL功耗优化的智能体框架,支持在大型层次化代码库中进行工作负载感知的时钟门控优化。AUTOGATE引入了机器学习(ML)与LLM的协同设计,桥接了波形级分析与RTL重写。具体而言,我们设计了一种基于ML的聚类算法,将原始翻转迹线提炼为紧凑的结构化表示,以指导基于LLM的RTL重写。这使得无需LLM直接处理原始波形数据即可准确识别和应用时钟门控机会。为增强可扩展性,AUTOGATE采用层次化多智能体架构,将大型设计分解为可独立优化的模块,从而在深层设计层次中实现协调优化。我们在从小型RTL设计到大型工业级代码库的多样化设计集上评估了AUTOGATE。实验结果表明,与基线相比,AUTOGATE持续降低动态功耗。在小型设计套件上,AUTOGATE平均降低动态功耗49.31%。在工业级设计上,它在NVDLA和BlackParrot上分别实现了19.34%和7.96%的动态功耗降低,在高度优化的专有生产设计上最高降低6.86%。

英文摘要

Fine-grain clock gating (FGCG) is among the most effective techniques for reducing dynamic power, yet current FGCG optimization flows remain largely manual. Recent LLM-based RTL optimization approaches remain limited by two key drawbacks: (1) the inability to process long waveform traces spanning millions of cycles, and (2) the difficulty of scaling optimization to large hierarchical codebases while preserving correctness. In this work, we present AUTOGATE, the first agentic framework for industry-grade RTL power optimization, enabling workload-aware clock-gating optimization across large hierarchical codebases. AUTOGATE introduces a Machine Learning (ML)-LLM co-design that bridges waveform-level analysis and RTL rewriting. Specifically, we design an ML-based clustering algorithm that distills raw toggling traces into compact, structured representations that guide LLM-based RTL rewriting. This enables accurate identification and application of clock-gating opportunities without requiring LLMs to directly process raw waveform data. To enhance scalability, AUTOGATE employs a hierarchical multi-agent architecture that decomposes large designs into independently optimizable modules, enabling coordinated optimization across deep design hierarchies. We evaluate AUTOGATE on a diverse set of designs ranging from small RTL designs to large industrial-grade codebases. Experimental results show that AUTOGATE consistently reduces dynamic power relative to baselines. Across the small-design suite, AUTOGATE reduces dynamic power by 49.31% on average. On industry-scale designs, it achieves 19.34% and 7.96% dynamic power reductions on NVDLA and BlackParrot, respectively, and up to 6.86% on highly optimized proprietary production designs.

2606.17555 2026-06-17 cs.CR cs.AI cs.CE cs.ET 交叉投稿

An AI Security Agent for Banking: Multi-Vector Fraud and AML Detection Across Retail and Corporate Accounts

面向银行业的人工智能安全代理:零售和企业账户的多向量欺诈与反洗钱检测

Joseph Walusimbi, Joshua Benjamin Ssentongo

发表机构 * \ Engineering Soroti University\ , Uganda

AI总结 提出一种融合LSTM序列模型、统计速度/阈值监控和图网络的三组件架构,并行处理交易流和会话流,在合成数据集上交易流F1达0.787,会话流F1达0.867,并集成客户验证聊天机器人(96.6%身份验证准确率)和分析师案例摘要助手(99.3%行动推荐F1)。

Comments 7 pages, 1 figure, 5 tables

详情
AI中文摘要

银行同时面临基于签名的欺诈(无卡攻击、账户接管、ATM克隆)和行为金融犯罪(结构化、分层、骡子网络、商业电子邮件欺诈)——两种具有根本不同检测需求的威胁家族。可靠捕获暴力攻击和高频事件的静态规则引擎,在结构上对商业电子邮件欺诈(BEC)支付重定向、会话劫持和洗钱分层视而不见,这些行为被设计为在单个交易或会话层面与合法活动难以区分。本文提出一种面向零售和企业银行业务的人工智能安全代理,通过一种三组件融合架构解决这一差距,该架构运行在两个并行事件流上:交易流(卡欺诈、ACH/电汇欺诈、反洗钱类别)和会话流(账户接管、会话劫持、SIM卡交换、内部滥用)。每个流结合了捕获每个账户行为历史的LSTM序列模型、统计速度/阈值监控器,以及捕获账户-对手方关系模式(扇入、扇出、传递比)用于洗钱检测的图/网络模块。在包含237,669笔交易和113,508个会话、涵盖13个威胁类别和3,470个模拟账户的合成事件日志上的实验表明,所提模型在交易流上的总体F1为0.787,会话流上为0.867,而基于规则的基线为0.562/0.733,仅LSTM基线为0.655/0.713。该代理包括一个面向客户的交易验证聊天机器人(96.6%身份验证准确率,86.8%大规模重置攻击检测)和一个分析师案例摘要助手(99.3%行动推荐F1),关键层自动响应延迟在95百分位下低于0.43毫秒。

英文摘要

Banks simultaneously face signature-based fraud (card-not-present attacks, account takeover, ATM cloning) and behavioural financial crime (structuring, layering, mule networks, business email compromise) -- two threat families with fundamentally different detection requirements. Static rule engines that reliably catch brute-force and high-velocity events are structurally blind to business-email-compromise (BEC) payment redirection, session hijacking, and money-laundering layering, which are engineered to appear indistinguishable from legitimate activity at the individual transaction or session level. This paper presents an AI security agent for retail and corporate banking that addresses this gap through a three-component fusion architecture operating on two parallel event streams: a transaction stream (card fraud, ACH/wire fraud, AML categories) and a session stream (account takeover, session hijacking, SIM-swap, insider abuse). Each stream combines an LSTM sequence model capturing per-account behavioural history, a statistical velocity/threshold monitor, and a graph/network module capturing account-counterparty relationship patterns (fan-in, fan-out, pass-through ratio) for money-laundering detection. Experiments on a synthetic event log of 237,669 transactions and 113,508 sessions across 13 threat categories and 3,470 simulated accounts demonstrate overall F1 of 0.787 (transaction stream) and 0.867 (session stream) for the proposed model, versus 0.562/0.733 for a rule-based baseline and 0.655/0.713 for an LSTM-only baseline. The agent includes a customer-facing transaction-verification chatbot (96.6% identity verification accuracy, 86.8% mass-reset attack detection) and an analyst case-summary assistant (99.3% action-recommendation F1), with Critical-tier automated response latency under 0.43 ms at the 95th percentile.

2606.17666 2026-06-17 cs.SE cs.AI 交叉投稿

FacProcessTwin: An LLM-Based System for Process Twin Development

FacProcessTwin: 一种基于LLM的流程孪生开发系统

Yash Pulse, Yong-Bin Kang, Abhik Banerjee, Prem Prakash Jayaraman

发表机构 * Swinburne University of Technology(斯winburne大学)

AI总结 提出FacProcessTwin系统,利用大语言模型从工厂文档和操作员自然语言输入中自动生成流程模型并绑定实时数据,通过交互式流程图实现人机协同治理,在食品制造案例中准确率达95.2%,开发时间缩短至人工的1/6。

详情
AI中文摘要

流程孪生提供整个生产过程的实时表示。通过捕捉流程步骤如何相互作用,而不是像基于资产的数字孪生那样孤立地监控单个机器,它们有潜力推动整个过程的效率提升。然而,开发流程孪生成本高昂。它需要精确建模整个生产过程:其流程步骤、每个步骤使用的设备和产品特定设置,以及其流程变体。然后,生成的模型必须绑定到实时操作数据。我们提出FacProcessTwin,一个利用大语言模型(LLM)来减少开发时间的系统,它从工厂的流程文档和操作员的自然语言输入中构建流程孪生。FacProcessTwin生成完整的流程模型,然后自动将其流程步骤绑定到实时操作数据。生成的模型及其数据绑定被渲染为交互式流程图表,制造人员可以通过该图表监控和纠正系统的自主决策,例如解决安全关键绑定步骤中的不确定性。我们通过一家澳大利亚食品制造商的真实案例研究评估FacProcessTwin,涵盖16个生产流程,涉及冷藏、冷冻和无菌常温产品类别,并包括同一产品内的流程变体。结果表明,FacProcessTwin准确生成这些流程模型(与真实情况相比平均F1为95.2%),并且每个孪生的构建时间约为手动时间的六分之一。其人在环治理机制保持安全关键绑定的正确性:在模糊标签处,单次通过基线在75.0%的情况下静默错误绑定,而FacProcessTwin则推迟给操作员,错误绑定率为0。

英文摘要

Process twins provide real-time representations of entire production processes. By capturing how process steps interact, rather than monitoring a single machine in isolation as an asset-based digital twin does, they have the potential to drive efficiency gains across the whole process. However, developing a process twin is costly. It requires accurately modelling the entire production process: its process steps, the equipment and product-specific settings each step uses, and its process variations. The resulting model must then be bound to live operational data. We present FacProcessTwin, a system that leverages a large language model (LLM) to reduce this development time, building a process twin from a plant's process documentation and natural-language input from an operator. FacProcessTwin generates this complete process model and then automatically binds its process steps to live operational data. The generated model and its data bindings are rendered as an interactive process diagram through which manufacturing personnel can monitor and correct the system's autonomous decisions, such as resolving uncertainty at safety-critical binding steps. We evaluate FacProcessTwin through a real-world case study of an Australian food manufacturer, covering 16 production process flows that span chilled, frozen, and aseptic shelf-stable product categories and include process variations within the same product. The results show that FacProcessTwin generates these process models accurately (a mean F1 of 95.2% against ground truth) and builds each twin in roughly a sixth of the manual time. Its human-in-the-loop governance then keeps the safety-critical bindings correct: at ambiguous tags where a single-pass baseline silently mis-binds 75.0% of the time, FacProcessTwin defers to the operator and mis-binds none.

2606.17668 2026-06-17 cs.LG cs.AI q-bio.QM 交叉投稿

ASTEROID: A Spatiotemporal Information Transformer for Forecasting Multi-Step Time Series of Molecular Dynamics

ASTEROID: 用于分子动力学多步时间序列预测的时空信息变换器

Kexin Wu, Luonan Chen, Renxiao Wang

发表机构 * Department of Medicinal Chemistry, School of Pharmaceutical Sciences, Fudan University(药学院药物化学系,复旦大学) School of Mathematical Sciences and School of AI, Shanghai Jiao Tong University(数学科学学院和人工智能学院,上海交通大学)

AI总结 提出ASTEROID框架,通过将分子动力学轨迹重构为高维时空序列并集成时空信息变换方程到Transformer中,实现多步原子坐标的直接预测,在多个量子力学分子数据集上显著提升预测精度并降低计算成本。

Comments 32 pages,10 figures

详情
AI中文摘要

分子动力学(MD)模拟计算需求高,尤其对于需要长期分析的大规模系统。准确预测MD模拟结果不仅是一个有吸引力的科学挑战,而且具有重要的实用价值。在这项工作中,我们开发了一个数据驱动框架,称为ASTEROID(用于推断动力学的先进时空变换器),可以直接预测多步原子坐标,避免传统的迭代积分。为此,我们的ASTEROID将MD轨迹重构为高维时空序列,并将时空信息(STI)变换方程集成到Transformer架构中。ASTEROID的核心创新在于其建模多尺度时空依赖性的能力。具体来说,对于空间依赖性,局部-全局自注意力机制捕获短程和长程相互作用。对于时间依赖性,编码器-解码器结构将全局上下文与自回归预测相结合。ASTEROID在几个量子力学衍生的分子数据集上进行了评估。我们的结果表明,ASTEROID不仅在各种基准测试中实现了比现有方法更高的多步预测精度,而且显著降低了传统MD模拟的计算成本。此外,该模型支持在扩展时间尺度上的迭代多步预测。这项工作为加速MD模拟建立了一个稳健且可推广的数据驱动范式。

英文摘要

Molecular dynamics (MD) simulation is computationally demanding, particularly for large-scale systems requiring long-term analysis. Accurate forecast of the outcomes of a MD simulation is not only an attractive scientific challenge but also has substantial practical value. In this work, we developed a data-driven framework, termed ASTEROID (Advanced Spatiotemporal TransformER fOr Inferring Dynamics), that can directly predict multi-step atomic coordinates, avoiding conventional iterative integration. For this purpose, our ASTEROID reformulates MD trajectories as high-dimensional spatiotemporal sequences and integrates the Spatiotemporal Information (STI) Transformation equation into a Transformer architecture. The core innovation of ASTEROID lies in its ability to model multiscale spatiotemporal dependencies. In particular, for spatial dependencies, a local-global self-attention mechanism captures both short- and long-range interactions. For temporal dependencies, an encoder-decoder structure integrates global context with autoregressive forecasting. ASTEROID was evaluated on several quantum-mechanics derived molecular datasets. Our results indicate that ASTEROID achieved not only a higher level of accuracy in multi-step prediction than existing methods on various benchmarks, but also significantly reduced computational cost of conventional MD simulation. Moreover, the model supports iterative multi-step forecasting over an extended time scale. This work establishes a robust and generalizable data-driven paradigm for accelerating MD simulations.

2606.17702 2026-06-17 cs.CV cs.AI 交叉投稿

SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology

SegTME-UNI2: 一种基于基础模型的可泛化多类细胞分割框架及LLM驱动的组织微环境表征在组织病理学中的应用

Wan Siti Halimatul Munirah Wan Ahmad, Faris Syahmi Samidi, Mohammad Badal Ahmmed, Vimal Angela Thiviyanathan, Selvam James Thavaraj, Anwar P. P. Abdul Majeed

发表机构 * Department of Data Science and Artificial Intelligence, School of Computing and Artificial Intelligence, Faculty of Engineering and Technology, Sunway University(双威大学工程与技术学院计算与人工智能学院数据科学与人工智能系) Faculty of Dentistry, Universiti Malaya(马来亚大学牙科学院)

AI总结 提出SegTME-UNI2框架,结合UNI2-H病理基础模型与双头UperNet解码器实现六类语义分割和核实例分割,通过三阶段伪标签课程学习解决标注不足问题,并利用LLM生成临床可解释的TME报告。

详情
AI中文摘要

从常规H&E染色组织学图像中表征肿瘤微环境(TME)需要同时进行细胞分割、特征提取和可解释的临床报告。我们提出了SEGTME-UNI2,一个统一框架来满足这些需求。其核心是UNI2-UPERHOVER,一个双头分割模型,将UNI2-H病理基础模型(ViT-Giant,在来自100K张切片的>100M张图块上预训练)与两个并行的UperNet解码器配对:一个用于六类语义分割,另一个用于水平-垂直梯度回归,从而实现基于分水岭的核实例分离。为了解决大型真实世界数据集中缺乏像素级标注的问题,UNI2-UPERHOVER经历了一个三阶段渐进式伪标签课程。每个阶段训练一个全新模型(无权重迁移),完全通过提高伪标签质量来驱动改进:阶段1:使用人工标注的PanNuke(7,901张图像,189,744个细胞核,0.25 um/像素)。阶段2:使用阶段1模型在271,711个TCGA-UT尺度0图块(0.5 um/像素)上生成的熵过滤伪标签。阶段3:使用阶段2模型在所有1,608,060个TCGA-UT图块(覆盖六个分辨率尺度,0.5-1.0 um/像素)上生成的伪标签。分割输出输入到一个结构化的TME特征提取流水线,计算每个图块的20多个组成、形态、空间熵和细胞间距离指标。这些指标编码为JSON,并传递给微调的NVIDIA BioNeMo GPT模型,以生成临床可解释的TME叙述。在保留的PanNuke和TCGA-UT分区上的初步验证证明了框架的可行性和内部一致性。公开释放了伪标注的TCGA-UT数据集和UNI2-UPERHOVER检查点,以支持大规模TME分析和空间生物学研究。

英文摘要

Characterising the tumour microenvironment (TME) from routine H&E-stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME-UNI2, a unified framework addressing these requirements. Its core is UNI2-UPERHOVER, a dual-head segmentation model pairing the UNI2-H pathology foundation model (ViT-Giant, pretrained on >100M tiles from 100K slides) with two parallel UperNet decoders: one for six-class semantic segmentation and one for horizontal-vertical gradient regression enabling watershed-based nuclear instance separation. To address the lack of pixel-level annotations in large real-world repositories, UNI2-UPERHOVER undergoes a three-stage progressive pseudo-label curriculum. Each stage trains a fresh model without weight transfer, driving improvement entirely via increased pseudo-label quality: Stage 1: Uses human-annotated PanNuke (7,901 images, 189,744 nuclei, 0.25 um/pixel). Stage 2: Uses entropy-filtered pseudo-labels from the Stage 1 model on 271,711 TCGA-UT scale-0 patches (0.5 um/pixel). Stage 3: Uses pseudo-labels from the Stage 2 model on all 1,608,060 TCGA-UT patches across six resolution scales (0.5-1.0 um/pixel). Segmentation outputs feed a structured TME feature extraction pipeline computing 20+ per-patch compositional, morphological, spatial entropy, and intercellular distance metrics. These are encoded as JSON and passed to a fine-tuned NVIDIA BioNeMo GPT model to generate clinically interpretable TME narratives. Preliminary validation on held-out PanNuke and TCGA-UT partitions demonstrates framework feasibility and internal consistency. The pseudo-labelled TCGA-UT dataset and UNI2-UPERHOVER checkpoint are publicly released to support large-scale TME profiling and spatial biology research.

2606.17775 2026-06-17 cs.SD cs.AI cs.NE 交叉投稿

A Neuromorphic Trigger for Efficient Audio Event Detection

一种用于高效音频事件检测的神经形态触发器

Benjamin Hatton, Oliver Rhodes, Luca Peres

发表机构 * ICNS, University of Manchester(曼彻斯特大学ICNS)

AI总结 提出基于脉冲神经网络(SNN)的低成本前端触发器,选择性筛选音频片段,在异常声音检测和声音事件检测任务上分别实现0.97的F1分数和42.6倍FLOPs减少。

Comments 9 pages, 4 figures, 6 tables

详情
AI中文摘要

连续音频流的高效处理仍然是实时和资源受限系统面临的关键挑战。本文介绍了一种用于音频事件检测的神经形态触发器,基于脉冲神经网络(SNN)选择性门控下游模型的输入。所提出的触发器作为低成本前端,识别显著音频片段,仅将这些片段转发给计算密集型的模型进行分类等任务。触发器实现为轻量级全连接SNN,并在两个代表性任务上评估:异常声音检测(ASD)和声音事件检测(SED)。对于ASD,触发器在URBAN-SED数据集的类别无关形式下,实现了基于一秒片段的F1分数0.97,显示出识别相关音频区域的高可靠性。对于SED,触发器与Dang分类器结合在DCASE 2017挑战赛任务2数据集上,展示了潜在的42.6倍FLOPs减少,同时将基于事件错误率的下限从0.41降低到0.25。这些结果凸显了神经形态触发器作为实时、节能前端滤波器的潜力,能够大幅降低计算成本。

英文摘要

Efficient processing of continuous audio streams remains a key challenge for real-time and resource-constrained systems. This paper introduces a neuromorphic trigger for audio event detection, based on a spiking neural network (SNN) that selectively gates input to downstream models. The proposed trigger acts as a low-cost front-end, identifying salient audio segments and forwarding only these to a more computationally intensive model for tasks such as classification. The trigger is implemented as a lightweight fully connected SNN and evaluated on two representative tasks: Anomalous Sound Detection (ASD) and Sound Event Detection (SED). For ASD, the trigger achieves a one-second segment-based F1 score of 0.97 on a class-agnostic form of the URBAN-SED dataset, demonstrating high reliability in identifying relevant audio regions. For SED, the trigger is combined with the Dang classifier on the DCASE 2017 Challenge Task 2 dataset, showing a potential $42.6\times$ reduction in FLOPs while reducing the lower bound of the event-based error rate from 0.41 to 0.25. These results highlight the potential of neuromorphic triggers as real-time, energy-efficient front-end filters, enabling substantial reductions in computational cost.

2606.17781 2026-06-17 cs.AR cs.AI 交叉投稿

MIVE: A Minimalist Integer Vector Engine for Softmax LayerNorm and RMSNorm Acceleration

MIVE:用于Softmax、LayerNorm和RMSNorm加速的极简整数向量引擎

Kosmas Alexandridis, Giorgos Dimitrakopoulos

发表机构 * Integrated Circuits Lab, Electrical and Computer Engineering, Democritus University of Thrace (DUTH), Greece(德摩克利特大学特拉克分校集成电路实验室,电气与计算机工程,德摩克利特大学特拉克分校(DUTH),希腊)

AI总结 提出一种可编程的极简整数向量引擎MIVE,通过统一数据通路执行Softmax、LayerNorm和RMSNorm三种操作,最大化硬件共享,提升面积和硬件效率。

详情
AI中文摘要

大型语言模型(LLM)的快速增长加剧了对专用硬件加速器的需求,这些加速器必须满足严格的推理延迟和功耗约束。尽管矩阵乘法主导了整体计算工作负载,但非线性向量归一化操作(如LayerNorm、RMSNorm和Softmax)可能成为关键硬件瓶颈。现有加速器通常使用专用硬件块实现这些功能,导致资源重复和硅利用率低下。为解决这一限制,我们提出了一种极简整数向量引擎(MIVE),这是一种可编程架构,能够在统一数据通路内执行所有三种操作。通过利用LayerNorm、RMSNorm和Softmax之间的共同计算模式,所提出的向量引擎最大化硬件共享,同时减少实现开销。物理ASIC实现结果表明,MIVE提供全面的多函数支持,同时在面积和硬件效率方面优于大多数最先进的独立加速器。

英文摘要

The rapid growth of Large Language Models (LLMs) has intensified the need for specialized hardware accelerators that can satisfy stringent inference latency and power constraints. Although matrix multiplications dominate the overall computational workload, non-linear vector normalization operations, such as LayerNorm, RMSNorm and Softmax can become critical hardware bottlenecks. Existing accelerators typically implement these functions using dedicated hardware blocks, leading to duplicated resources and inefficient silicon utilization. To address this limitation, we propose a Minimalist Integer Vector Engine (MIVE), a programmable architecture capable of executing all three operations within a unified datapath. By exploiting common computational patterns across LayerNorm, RMSNorm and Softmax the proposed vector engine maximizes hardware sharing while reducing implementation overhead. Physical ASIC implementation results show that MIVE provides comprehensive multi-function support while achieving higher area and hardware efficiency than most state-of-the-art standalone accelerators.

2606.17824 2026-06-17 cs.CV cs.AI 交叉投稿

Human-in-the-Loop Atlas-Based 3D Asset Segmentation for Interactive Content Workflows

人在回路中基于图集的3D资产分割用于交互式内容工作流

Paul Julius Kühn, Saptarshi Neil Sinha, Jakob Hansen, Robin Horst

发表机构 * Fraunhofer IGD(弗劳恩霍夫计算机图形学研究所) Hochschule RheinMain(莱茵美因应用科学大学)

AI总结 提出一种人在回路中流水线,通过贪心视图选择、SAM~2交互分割和UV反投影生成分割图集,支持材质分配、风格迁移等下游任务,在8个文化遗产物体上验证了有效性。

详情
AI中文摘要

将3D资产分割成有意义的区域仍然具有挑战性,尤其是当分割标准依赖于应用且需要用户控制时。我们提出了一种人在回路中的流水线,用于从3D模型生成分割的2D参数化图集,适用于交互式媒体、游戏和XR内容工作流。我们的方法首先使用基于采样表面点的贪心集合覆盖策略选择一组紧凑的渲染视图,然后支持使用SAM~2和Label Studio对这些视图进行交互式分割。生成的掩码被反投影到模型的UV参数化上,以产生统一的图集分割,支持下游生产任务,如逐段材质分配、风格迁移和语义标注。我们通过对八个文化遗产物体的基于演示的技术评估来评估该流水线。结果表明,该方法可以在不同几何形状上生成可用的分割图集,同时揭示了需要手动校正的常见问题,特别是精细结构、空腔和弱外观边界。

英文摘要

Segmenting 3D assets into meaningful regions remains challenging, especially when segmentation criteria are application-dependent and require user control. We present a human-in-the-loop pipeline for generating a segmented 2D parameterized atlas from a 3D model for interactive media, game, and XR content workflows. Our method first selects a compact set of rendered views using a greedy set cover strategy over sampled surface points, and then supports interactive segmentation of these views with SAM~2 and Label Studio. The resulting masks are back-projected onto the model's UV parameterization to produce a unified segmented atlas that supports downstream production tasks such as segment-wise material assignment, style transfer, and semantic labeling. We assess the pipeline through a demonstration-based technical evaluation on eight cultural heritage objects. The results show that the approach can generate usable segmented atlases across diverse geometries while revealing recurring sources of manual correction, particularly fine structures, cavities, and weak appearance boundaries.

2606.17836 2026-06-17 cs.CV cs.AI cs.CG cs.GR 交叉投稿

High-Fidelity 3D Geometric Reconstruction of Pelvic Organs from MRI: A Hybrid Deep Learning and Iterative Optimization Approach

高保真盆腔器官MRI三维几何重建:一种混合深度学习与迭代优化方法

Hui Wang, Xiaowei Li, Chenxin Zhang, Yifan Feng, Jianwei Zuo, Yumeng Tang, Xiuli Sun, Jianliu Wang, Bing Xie, Jiajia Luo

发表机构 * Institute of Medical Technology, Peking University Health Science Center, Peking University(北京大学医学部医学技术研究院,北京大学) Biomedical Engineering Department, Institute of Advanced Clinical Medicine, Peking University(北京大学先进临床医学研究院生物医学工程系) Department of Obstetrics and Gynecology, Peking University People’s Hospital(北京大学人民医院妇产科部)

AI总结 提出混合可变形形状建模框架,结合深度学习预测与迭代优化,实现膀胱、子宫和直肠的高保真三维几何重建,在几何保真度和网格质量上优于现有方法。

详情
AI中文摘要

从MRI中患者特定的盆腔器官几何三维重建对于盆底建模和下游患者特定分析至关重要。然而,以往研究主要关注图像分割或三维模型的下游使用,高保真、高质量几何的重建仍然劳动密集且缺乏标准化。本研究引入了一种混合可变形形状建模框架,将深度学习预测与迭代优化相结合,用于膀胱、子宫和直肠的重建。该框架包含三个核心组件:一种保持盆腔器官拓扑一致性的几何感知多级深度学习架构;一种平衡全局形状捕获和局部表面细化的两阶段摊销优化训练策略;以及一种整体协同机制——在训练阶段,迭代优化为深度学习提供监督,而在推理阶段,深度学习快速预测全局器官形态,随后通过迭代优化细化局部表面和网格质量。该框架在几何保真度上显著优于当前主流的基于深度学习的器官重建模型。对于各个解剖结构,重建的膀胱、直肠和子宫三维几何实现了显著更低的Chamfer距离值和更高的Dice相似系数分数。此外,在保持高计算效率的同时,所提出的架构产生了优越的整体体积网格质量。在患者层面,该框架在minSICN和minSIGE的10个最差元素上均获得了比传统几何后处理算法更高的平均值。

英文摘要

Patient-specific 3D reconstruction of pelvic organ geometry from MRI is important for pelvic floor modeling and downstream patient-specific analysis. However, while previous studies have focused primarily on either image segmentation or downstream use of 3D models, the reconstruction of high-fidelity, high-quality geometries remains labor-intensive and poorly standardized. The study introduced a hybrid deformable shape modeling framework that integrates deep learning prediction with iterative optimization for the reconstruction of the bladder, uterus, and rectum. The framework consists of three core components: a geometry-aware multi-level deep learning architecture that preserves topological consistency of pelvic organs; a two-stage amortized optimization training strategy that balances global shape capture and local surface refinement; and a holistic synergy mechanism--where iterative optimization provides supervision for deep learning during the training phase, and during inference, deep learning rapidly predicts the global organ morphology, followed by iterative optimization to refine local surfaces and mesh quality. This framework demonstrated marked superiority in geometric fidelity than current mainstream deep learning-based organ reconstruction models. For individual anatomical structures, the reconstructed 3D geometries for the bladder, rectum, and uterus achieved significantly lower Chamfer Distance values and higher Dice Similarity Coefficient scores. In addition, while maintaining high computational efficiency, the proposed architecture yielded superior overall volumetric mesh quality. At the patient level, the framework achieved higher mean values for the 10 worst elements for both minSICN and minSIGE compared to traditional geometric post-processing algorithms.

2606.17867 2026-06-17 cs.CV cs.AI 交叉投稿

A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease

阿尔茨海默病多模态生物标志物的定量分析

Antonio Scardace, Daniele Ravì

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Catania(卡塔尼亚大学) Department MIFT(MIFT部门) University of Messina(梅西纳大学)

AI总结 通过整合tau-PET、结构MRI、认知评分和APOE4数据,量化多模态生物标志物间的冗余与预测依赖关系,揭示tau拓扑与萎缩的关联,并分解tau-认知关联,为AD生物标志物选择提供可解释性。

Comments Accepted to ICTS4eHealth 2026

详情
AI中文摘要

尽管阿尔茨海默病(AD)研究中越来越多地采用多模态方法——旨在整合分子、结构、临床和遗传生物标志物以增强疾病表征——但这些模态之间的关系仍知之甚少。对其动态相互作用进行系统分析对于改进疾病建模、识别冗余评估以及减少患者负担和获取成本至关重要。在本文中,我们通过整合来自ADNI数据集的789名受试者的tau-PET、结构MRI、认知评分(MMSE和CDR)以及APOE4数据,对多模态AD生物标志物进行了定量分析。在我们的分析中,我们(A)量化跨模态互信息和解释方差以评估冗余和预测依赖性;(B)检查tau拓扑与跨脑区结构萎缩之间的关联以选择信息性ROI;(C)对tau-认知关联进行统计分解,分为萎缩相关和萎缩无关成分;(D)识别与认知衰退一致的主要神经退行性轨迹。本研究提供了跨模态关系的系统表征,提高了AD生物标志物的可解释性和选择。代码公开于:此 https URL。

英文摘要

Despite increasing adoption of multimodal approaches in Alzheimer's Disease (AD) research -- aimed at integrating molecular, structural, clinical, and genetic biomarkers to enhance disease characterization -- the relationships among these modalities remain poorly understood. A systematic analysis of their dynamic interaction is essential for improving disease modeling, identifying redundant assessments, and reducing patient burden and acquisition costs. In this paper, we present a quantitative analysis of multimodal AD biomarkers by integrating tau-PET, structural MRI, cognitive scores (MMSE and CDR), and APOE4 data from 789 subjects drawn from the ADNI dataset. In our analyses, we (A) quantify cross-modal mutual information and explained variance to assess redundancy and predictive dependencies; (B) examine associations between tau topologies and structural atrophy across brain regions to select informative ROIs; (C) perform a statistical decomposition of the tau-cognition association into atrophy-related and atrophy-independent components; (D) and identify a dominant neurodegenerative trajectory that aligns with cognitive decline. This study provides a systematic characterization of cross-modal relationships, improving the interpretability and selection of biomarkers in AD. Code is publicly available at: https://github.com/antonioscardace/Multimodal-AD.

2606.17887 2026-06-17 cs.HC cs.AI 交叉投稿

AI Adoption Across a Multinational Workforce: Sociotechnical Conditions for GenAI Acceptance in Human Resources

AI在跨国劳动力中的采纳:人力资源中GenAI接受的社会技术条件

Dalia Ali, Maria José Rodríguez Velázquez, Manoel Horta Ribeiro, Vera Liao, Orestis Papakyriakopoulos

发表机构 * Technical University of Munich(慕尼黑技术大学) University of Michigan(密歇根大学) Princeton University(普林斯顿大学)

AI总结 研究跨国科技公司从传统HR系统转向GenAI系统过程中,员工采纳受情境适配、搜索素养和信任校准等社会技术条件影响,并提出了包容性部署的设计建议。

详情
AI中文摘要

生成式AI(GenAI)在工作场所的部署正在迅速加速。然而,谁采纳、谁受益、谁被落下以及为什么,这些问题仍未得到充分研究。在本文中,我们在一家从传统人力资源(HR)搜索系统过渡到GenAI支持系统的跨国科技公司的背景下调查这些动态,分析了搜索日志数据、调查数据(n=25)和十次半结构化访谈。我们的发现表明,采纳取决于GenAI系统的设计假设与员工的工作位置性(角色、口语、任期)之间的匹配。此外,我们发现员工对GenAI答案的信任是通过来源检查、系统间比较以及在怀疑时向同事或HR寻求意见来建立的。我们的贡献有两方面。首先,我们提供了在实时组织转型期间工作场所GenAI采纳的经验证据,表明采纳受到情境适配、搜索素养和信任校准等因素的影响。它还进一步受到知识条件的影响,例如系统的内容质量、员工培训和指导。其次,我们将这些发现转化为在高风险环境(如HR)中包容性部署和采纳的设计考虑。我们认为,组织应该设计系统时考虑它们对不同社会群体产生的角色和情境敏感的好处。他们还需要将组织知识基础设施视为AI基础设施,以提高GenAI系统的问责性和可用性。

英文摘要

Generative AI (GenAI) deployment in the workplace is accelerating rapidly. Nevertheless, questions of who adopts, who benefits, and who is left behind and why are still understudied. In this paper, we investigate these dynamics in the context of a multinational tech company transitioning from a legacy Human Resources (HR) search system to a GenAI-supported system, analyzing search log data, survey data (n=25), and ten semi-structured interviews. Our findings show that adoption depended on the fit between the GenAI system's design assumptions and employees' work positionalities (role, spoken language, tenure). Further, we find that employees' trust in GenAI answers was built through source-checking, comparison among systems, and seeking input from colleagues or HR when in doubt. Our contribution is twofold. First, we provide empirical evidence of workplace GenAI adoption during a live organizational transition, showing that adoption is influenced by factors such as situational fit, search literacy, and trust calibration. It is also further shaped by knowledge conditions such as the system's content quality, employee training, and guidance. Second, we translate these findings into design considerations for inclusive deployment and adoption in high-stakes environments such as HR. We argue that organizations should design systems considering the role and context-sensitive benefits they yield to different social groups. They also need to treat the organizational knowledge infrastructure as AI infrastructure to improve the accountability and usability of GenAI systems

2606.17972 2026-06-17 cs.CV cs.AI 交叉投稿

SegDINO: Introducing Multi-Scale Structure into DINO for Efficient Medical Image Segmentation

SegDINO: 将多尺度结构引入DINO以实现高效医学图像分割

Sicheng Yang, Hongqiu Wang, Zhaohu Xing, Sixiang Chen, Qiuxia Yang, Yize Mao, Guang Yang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Sun Yat-sen University Cancer Center(中山大学肿瘤防治中心) Imperial College London(帝国理工学院)

AI总结 提出SegDINO框架,通过令牌金字塔适应和尺度感知解码将多尺度结构引入DINO,在保持高效的同时实现医学图像分割的最优性能。

Comments Code: https://github.com/script-Yang/segdino_v2

详情
AI中文摘要

自监督DINO模型提供了强大的可迁移视觉表示,但直接应用于图像分割仍具挑战。现有方法通常依赖带有复杂上采样的重型解码器,引入大量参数和计算开销。我们观察到,向DINO特征引入尺度远比增加解码器容量更为关键。本文提出SegDINO,一种高效分割框架,将DINOv3骨干网络与轻量级尺度建模相结合。SegDINO引入令牌金字塔适应(TPA)将中间DINO特征重组为伪多尺度层次,以及尺度感知解码(SAD)实现高效的尺度内细化和自顶向下的多尺度传播。我们进一步整理了PanCT,一个包含284名患者专家标注胰腺肿瘤的新CT数据集,以评估SegDINO处理困难小病灶的能力。在PanCT和三个公共基准上的大量实验表明,SegDINO以高效率实现了最先进的结果。代码见此https链接。

英文摘要

Self-supervised DINO models provide strong transferable visual representations, yet applying them directly to image segmentation remains challenging. Existing approaches commonly rely on heavy decoders with complex upsampling, introducing substantial parameter and computational overhead. We observe that introducing scale into DINO features is far more critical than increasing decoder capacity. In this work, we present SegDINO, an efficient segmentation framework that integrates a DINOv3 backbone with lightweight scale modeling. SegDINO introduces Token Pyramid Adaptation (TPA) to reorganize intermediate DINO features into a pseudo multi-scale hierarchy, and Scale-Aware Decoding (SAD) for efficient intra-scale refinement and top-down multi-scale propagation. We further curate PanCT, a new CT dataset containing 284 patients with expert-annotated pancreatic tumors, to assess SegDINO's ability to handle difficult small-lesion cases. Extensive experiments on PanCT and three public benchmarks demonstrate that SegDINO achieves state-of-the-art results with high efficiency. The code is available at https://github.com/script-Yang/segdino_v2.

2606.17989 2026-06-17 cs.CV cs.AI 交叉投稿

Recover Semantics First, Generate Better: Improved Latent Modeling for 3D MRI Reconstruction and Cross-Contrast Synthesis

先恢复语义,再生成更好:改进的潜在建模用于3D MRI重建和跨对比合成

Yonghao Chen, Sicheng Yang, Rui Tang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Xi’an Jiaotong University(西安交通大学)

AI总结 提出语义优先的潜在建模框架,通过潜在协调编码器、语义恢复块和解剖感知频率损失,解决3D MRI压缩中长程解剖一致性、语义退化和平滑重建问题,提升重建和跨对比合成质量。

Comments Code: https://github.com/script-Yang/RSF

详情
AI中文摘要

多对比磁共振成像(MRI)为临床诊断提供互补信息。然而,获取所有MRI序列通常耗时且成本高昂。最近的生成模型通过从可用对比推断缺失对比来进行跨对比合成以解决此问题。尽管如此,合成3D MRI面临重大挑战。由于体积巨大,直接在像素空间操作在计算上不可行;因此,常见方法是先将3D体积压缩到潜在空间,然后在该空间中训练生成模型。我们观察到现有压缩架构存在几个关键问题:它们未能保持长程解剖一致性,丢弃了临床有意义的语义,并依赖于导致过度平滑重建的优化目标。最终,这些缺陷损害了后续生成模型的性能。在这项工作中,我们提出了一种语义优先的潜在建模框架,用于3D MRI重建和跨对比合成。具体来说,我们引入了潜在协调编码器(LHE)来捕获全局解剖依赖关系,确保体积表示的一致性。为了减轻潜在压缩过程中的语义退化,我们进一步设计了语义恢复块(SRB),该块从自监督语义教师注入高级先验,增强潜在空间中对比感知的可分离性。此外,我们提出了解剖感知频率损失(AFL),以自适应地保留诊断相关的高频结构。在两个公共多对比MRI数据集上的大量实验表明,重建保真度和跨对比合成质量持续提升。我们的代码可在该https URL获取。

英文摘要

Multi-contrast magnetic resonance imaging (MRI) provides complementary information for clinical diagnosis. However, acquiring all MRI sequences is often time-consuming and costly. Recent generative models perform cross-contrast synthesis to address this issue by inferring absent contrasts from the available ones. Nevertheless, synthesizing 3D MRI presents significant challenges. Due to the massive volume sizes, operating directly in the pixel space is computationally prohibitive; therefore, a common approach is to first compress the 3D volumes into a latent space and subsequently train generative models in that space. We observe that existing compression architectures face several critical issues: they under-preserve long-range anatomical coherence, discard clinically meaningful semantics, and rely on optimization objectives that lead to over-smoothed reconstructions. Ultimately, these shortcomings compromise the performance of subsequent generative models. In this work, we propose a semantics-first latent modeling framework for 3D MRI reconstruction and cross-contrast synthesis. Specifically, we introduce a Latent Harmonization Encoder (LHE) to capture global anatomical dependencies, ensuring coherent volumetric representations. To mitigate semantic degradation during latent compression, we further design a Semantic Recovery Block (SRB) that injects high-level priors from a self-supervised semantic teacher, enhancing contrast-aware separability in the latent space. Additionally, we propose an Anatomy-aware Frequency Loss (AFL) to adaptively preserve diagnostically relevant high-frequency structures. Extensive experiments on two public multi-contrast MRI datasets demonstrate consistent improvements in reconstruction fidelity and cross-contrast synthesis quality. Our code is available at https://github.com/script-Yang/RSF.

2606.18000 2026-06-17 cs.NI cs.AI 交叉投稿

A T-API-Compliant ReAct Agentic Loop for Optical Networks: Generic vs. Domain-Specific Tool Abstractions

一种符合T-API规范的ReAct智能循环用于光网络:通用与领域特定工具抽象

Seyed Morteza Ahmadian, Paolo Monti, Carlos Natalino

发表机构 * Department of Electrical Engineering, Chalmers University of Technology(查尔姆斯理工大学电子工程系)

AI总结 提出首个符合T-API规范的推理与行动(ReAct)循环,通过领域特定复合工具实现90%的oracle验证正确性,并节省三倍令牌。

Comments 4 pages, 2 figures, accepted for presentation at the 52nd European Conference on Optical Communications (ECOC), 2026

详情
AI中文摘要

光网络需要意图驱动的闭环智能管理,这是实现更高自治水平的关键。我们提出了首个符合T-API规范的推理与行动(ReAct)循环。我们表明,与通用工具相比,领域特定的复合工具实现了90%的oracle验证正确性,并节省了三倍的令牌。

英文摘要

Optical networks need intent-driven, closed-loop agentic management, a key enabler for higher autonomy levels. We present the first T-API-compliant reasoning and act (ReAct) loop. We show that domain-specific composite tools achieve 90% oracle-validated correctness with threefold token savings compared to generic tools.

2606.18063 2026-06-17 cs.CV cs.AI cs.LG 交叉投稿

When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

当LLM分析疤痕:从图像到临床有意义的特征

Ruman Wang, Hangting Ye

发表机构 * Liaoning University of Traditional Chinese Medicine(辽宁中医药大学) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院)

AI总结 提出ScaFE框架,利用LLM作为知识驱动的特征工程师,将高维图像转化为低维临床可解释特征,在数据稀缺的疤痕分类中优于端到端深度学习方法。

详情
AI中文摘要

医学图像分类面临一个基本困境:虽然深度学习模型在大规模数据上表现卓越,但现实临床场景中由于标注成本、隐私约束和疾病罕见性,常常遭受严重的数据稀缺。这一挑战在病理性疤痕分类中尤为突出,区分瘢痕疙瘩和增生性疤痕需要微妙的专家知识,且标注图像极其有限。我们提出一种新范式,将大型语言模型(LLM)重新定位为知识驱动的特征工程师,而非端到端分类器。我们将此框架称为ScaFE(疤痕特征工程)。我们的关键洞察是,LLM编码了丰富的医学知识,可以外部化为可执行的特征提取代码,从而将高维图像转化为低维、临床可解释的表示。具体来说,我们使用既定的疤痕评估标准提示LLM,生成确定性的Python代码,提取与临床评分系统(如温哥华疤痕量表)对齐的特征。我们的方法提供三个关键优势:(1)数据效率,通过将知识获取与统计学习解耦,在有限训练样本下实现稳健性能;(2)隐私保护,原始图像在本地处理,不暴露给外部LLM;(3)可解释性,通过基于临床推理的显式特征。在疤痕分类上的大量实验表明,在数据有限条件下,我们的方法始终优于端到端深度学习基线或使用LLM作为黑盒分类器,为将LLM集成到数据高效且临床透明的医学AI系统中开辟了有前景的方向。

英文摘要

Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.

2606.18108 2026-06-17 astro-ph.IM cs.AI 交叉投稿

Querying an astronomical database using large language models: the ALeRCE text-to-SQL system

使用大语言模型查询天文数据库:ALeRCE文本到SQL系统

P. A. Estevez, J. Espejo-Moreira, S. Sanfeliu-Alvarez, F. Forster, A. M. Munoz Arancibia, G. Cabrera-Vives, F. E. Bauer, A. Bayo, M. Catelan, R. Dastidar, L. Hernandez-Garcia, J. A. Intriago, G. Pignata

发表机构 * Department of Electrical Engineering, University of Chile, Av. Tupper 2007, Santiago, Chile Millennium Institute of Astrophysics (MAS), Nuncio Monseñor Sótero Sanz 100, Providencia, Santiago, Chile Data Artificial Intelligence Initiative (ID\&IA), Universidad de Chile Center for Mathematical Modeling, Universidad de Chile, Beauchef 851, North building, 7th floor, Santiago 8320000, Chile Departamento de Astronom\'ia, Universidad de Chile, Casilla 36D, Santiago, Chile Department of Computer Science, Universidad de Concepción, Edmundo Larenas 219, Concepción, Chile Center for Data Artificial Intelligence, Universidad de Concepción, Edmundo Larenas 310, Concepción, Chile Heidelberg Institute for Theoretical Studies, Heidelberg, Baden-Württemberg, Germany Instituto de Alta Investigación, Universidad de Tarapacá, Casilla 7D, Arica, 1010000, Chile European Southern Observatory, Karl-Schwarzschild-Strasse 2, 85748 Garching bei München, Germany Instituto de Astrofísica, Facultad de Física, Pontificia Universidad Católica de Chile, Casilla 306, Santiago 22, Chile Centro de Astroingeniería, Pontificia Universidad Católica de Chile, Av. Vicuña Mackenna 4860, 7820436 Macul, Santiago, Chile Instituto de Estudios Astrof\'isicos, Facultad de Ingenier\'ia y Ciencias, Universidad Diego Portales, Av. Ej\'ercito Libertador 441, Santiago, Chile Centro Interdisciplinario de Data Science, Facultad de Ingenier\'ia y Ciencias, Universidad Diego Portales, Av. Ej\'ercito Libertador 441, Santiago, Chile

AI总结 提出基于大语言模型的文本到SQL系统,通过上下文学习和逐步生成框架(模式链接、查询分类、提示分解、自纠正)实现自然语言查询天文数据库,在ALeRCE数据集上评估13个模型,Claude Opus 4.6等表现最佳。

详情
AI中文摘要

我们开发了一个基于大语言模型(LLMs)的文本到SQL(结构化查询语言)系统,采用上下文学习方法,并将其应用于ALeRCE(自动学习快速事件分类)天文数据库。ALeRCE是Zwicky瞬变设施和Vera C. Rubin天文台的社区经纪人。该系统使用户能够以自然语言(NL)查询数据库,并生成可执行的SQL查询。为了开发和评估该系统,我们构建了一个包含110个NL/SQL对的数据集。我们提出了一个逐步生成框架,包含四个模块:模式链接、查询分类、提示分解和自纠正。使用上下文学习和提示工程技术评估了13个LLM的性能。文本到SQL的性能通过行标识符(例如对象标识符)和列标识符(即列名)的完美匹配(PM)率来评估。所提出的逐步框架始终优于直接推理基线,而自纠正模块持续减少执行错误。对于Claude Opus 4.6,简单查询的行(列)标识符PM性能较高,达到0.97(0.94),随着查询复杂度增加,中等查询降至0.44(0.72),困难查询降至0.59(0.49)。在评估的13个模型中,文本到SQL任务表现最佳的LLM是Claude Opus 4.6、Gemini 2.5 Pro、Gemini 3 Flash和GPT-5.2-Codex。

英文摘要

We develop a text-to-SQL (structured query language) system based on large language models (LLMs) using in-context learning and apply it to the Automatic Learning for the Rapid Classification of Events (ALeRCE) astronomical database. ALeRCE is a community broker for the Zwicky Transient Facility and the Vera C. Rubin Observatory. The system enables users to query the database in natural language (NL) and generates executable SQL queries. To develop and evaluate the system, we constructed a dataset of 110 NL/SQL pairs. We propose a step-by-step generation framework comprising four modules: schema linking, query classification, prompt decomposition, and self-correction. The performance of thirteen LLMs is evaluated using in-context learning and prompt engineering techniques. Text-to-SQL performance is assessed using the perfect-match (PM) rate for row identifiers (e.g., object identifiers) and column identifiers (i.e., column names). The proposed step-by-step framework consistently outperforms a direct-inference baseline, while the self-correction module consistently reduces execution errors. For Claude Opus 4.6, PM performance on row (column) identifiers is high for simple queries, reaching 0.97 (0.94), and decreases with query complexity to 0.44 (0.72) for medium queries and 0.59 (0.49) for hard queries. Among the thirteen evaluated models, the best-performing LLMs for the text-to-SQL task are Claude Opus 4.6, Gemini 2.5 Pro, Gemini 3 Flash, and GPT-5.2-Codex.

2606.18122 2026-06-17 cs.LG cs.AI cs.AR eess.AS eess.SP 交叉投稿

Embedded Machine Learning for Microcontroller-Class Edge Devices: Data, Feature, Evaluation, and Deployment Pipelines

面向微控制器级边缘设备的嵌入式机器学习:数据、特征、评估与部署流程

Mostafa Darvishi

发表机构 * IEEE

AI总结 本文系统介绍面向微控制器平台的嵌入式机器学习工作流,重点涵盖采样缓冲、特征提取、不平衡验证、模型/运行时协同设计及流式部署等工程决策,并以惯性运动识别和关键词检测为例给出实用设计规则。

Comments 6 pages, 3 figures, 4 tables

详情
AI中文摘要

嵌入式机器学习将推理从云服务转移到资源受限的设备上,这些设备必须在内存、能量和延迟的严格限制下采集数据、预处理信号、运行模型并采取行动。本文针对微控制器级平台,提出了一种面向系统的嵌入式机器学习工作流综合方案。重点放在通用机器学习介绍中常被隐藏的工程决策上:采样和缓冲、作为降维的特征提取、类别不平衡下的验证、模型/运行时协同设计以及流式部署。全文使用两个代表性信号系列:第一个是惯性运动识别,其中将两秒的三轴加速度计窗口从原始样本转换为均方根和频谱特征后再进行分类;第二个是关键词检测,其中对音频进行采样、抗混叠、转换为梅尔频率倒谱系数,并由紧凑的一维卷积网络处理。本文最后给出了鲁棒设备上推理的实用设计规则,包括数据整理、量化、阈值设定、调度和现场监控。

英文摘要

Embedded machine learning moves inference from cloud services to resource-constrained devices that must acquire data, preprocess signals, run a model, and act within tight limits on memory, energy, and latency. This paper presents a systems-oriented synthesis of an embedded machine-learning workflow for microcontroller-class platforms. The emphasis is placed on engineering decisions that are often hidden in generic machine-learning introductions: sampling and buffering, feature extraction as dimensionality reduction, validation under class imbalance, model/runtime co-design, and streaming deployment. Two representative signal families are used throughout the paper. The first is inertial motion recognition, where a two-second, three-axis accelerometer window is transformed from raw samples into root-mean-square and spectral features before classification. The second is keyword spotting, where audio is sampled, anti-aliased, transformed into mel-frequency cepstral coefficients, and processed by a compact one-dimensional convolutional network. The paper concludes with practical design rules for robust on-device inference, including data curation, quantization, thresholding, scheduling, and field monitoring.

2606.18181 2026-06-17 cs.IR cs.AI cs.CY 交叉投稿

IUU+DB: Tracking Illegal, Unreported, and Unregulated Fishing, Seafood Fraud, and Labor Abuse through LLM-driven Information Extraction

IUU+DB:通过LLM驱动的信息提取追踪非法、不报告和不管制捕捞、海鲜欺诈和劳工虐待

Henry Bodwell, Hong Yang, John C. Simeone, Kelvin Gorospe, Bella Sullivan, Lana Huang, Jessica Gephart, Sandy Aylesworth, Molly Masterton, Naren Ramakrishnan

发表机构 * University Of Washington(华盛顿大学)

AI总结 提出IUU+概念扩展非法捕捞定义,并构建基于大语言模型的IUU+DB系统,从异构文档中自动提取事件关键信息,支持去重和趋势分析,为渔业监管和研究提供数据支持。

详情
AI中文摘要

非法、不报告和不管制捕捞(IUU)传统上指违反适用法律或在缺乏适用法律的区域进行的捕捞活动。我们提出术语IUU+以涵盖更广泛的渔业部门环境及相关供应链贸易犯罪和行为。尽管IUU+活动被广泛认为是对海洋生态系统、市场和生计的严重威胁,但对其事件频率、地理分布、物种、行为者及非法活动类型模式的定量理解仍然难以获得。我们提出IUU+DB,一个由大语言模型驱动的系统,用于构建全球IUU+活动事件数据库。该系统接收异构文档,分类是否描述相关事件,提取关键数据元素如行为者、地点、物种、船只、违规行为及执法结果,并支持去重和趋势分析。案例研究和验证结果表明,IUU+DB有助于组织零散证据,揭示地理和行为热点,支持学术界和非政府组织的渔业领域特定研究,协助行业进行来源和物种风险评估,并为政府机构的政策实施和针对性执法提供支持。

英文摘要

Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable laws or occur in areas that lack applicable laws. We propose the term IUU+ to capture a broader suite of fisheries sector environmental and associated supply chain trade-related crimes and behaviors. Although IUU+ activity is widely recognized as a serious threat to marine ecosystems, markets, and livelihoods, a quantitative understanding of these incidents, e.g., their frequency, geography, species, actors, and patterns in the type of illicit activity, remains difficult to obtain. We propose IUU+DB, a large language model driven system for building a global incident database of IUU+ activity. The system ingests heterogeneous documents, classifies whether they describe relevant incidents, extracts key data elements such as actors, locations, species, vessels, violations, and enforcement outcomes, and supports deduplication and trend analysis. Case studies and validation results show that IUU+DB can help organize fragmented evidence, surface geographic and behavioral hotspots, support fisheries-domain specific research in academia and non-government organizations, assist source and species risk assessments for industry, and provide support for policy implementation and targeted enforcement efforts to government agencies.

2606.04513 2026-06-17 cs.AI 版本更新

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

MapAgent: 一个工业级的城市规模车道级地图生成智能框架

Deguo Xia, Zihan Li, Haochen Zhao, Dong Xie, Yuyao Kong, Xiyan Liu, Jizhou Huang, Mengmeng Yang, Diange Yang

发表机构 * Tsinghua University(清华大学) Baidu(百度) University of Macau(澳门大学) Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所)

AI总结 提出MapAgent框架,通过结合视觉语言模型和约束感知推理,在验证驱动的Judge-Planner-Worker循环中修正车道地图生成中的规范违规问题,实现城市规模的高自动化生产。

Comments Accepted by KDD 2026

详情
AI中文摘要

车道级地图是自动驾驶和车道级导航的关键基础设施,但为数百个城市构建和维护标准化车道网络仍然高度劳动密集。最近的端到端矢量化映射方法可以直接从传感器数据预测车道几何和拓扑,但它们通常将映射规范和交通规则视为隐式的、依赖于数据集的监督。此外,在复杂场景中(例如,磨损或缺失的标记和遮挡),仅凭视觉证据往往难以确定正确的车道配置,使得规范违规成为人工后期编辑的主要来源。我们提出MapAgent,一个工业级智能架构,它增强了一个矢量化主干,用于生成符合规范的车道地图。MapAgent不仅仅是在地图预测上添加一个智能体循环,而是在一个有界、验证驱动的Judge-Planner-Worker循环中,将主干感知与明确的规范验证、约束感知推理和确定性地图编辑相结合。一个视觉语言Judge通过联合检查视觉证据和草稿向量来诊断错误,而一个工具调用Planner生成最小的修正编辑并进行编辑后重新验证。为了保持城市规模生产的可扩展性,MapAgent仅在主干置信度低的图块上选择性触发,增加了适度的开销同时保持吞吐量。在真实世界数据集上的实验显示,与强大的生产基线相比,特别是在复杂和长尾场景中,性能持续提升。此外,MapAgent已集成到百度地图中,支持全国超过360个城市的车道级地图生成,并将整体生产自动化率提升至95%以上,证明了MapAgent在大规模车道级地图生成中的实用性和有效性。

英文摘要

Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing and maintaining standardized lane networks for hundreds of cities remains highly labor-intensive. Recent end-to-end vectorized mapping methods can predict lane geometry and topology directly from sensor data, but they typically treat mapping specifications and traffic regulations as implicit, dataset-dependent supervision. Moreover, in complex scenes (e.g., worn or missing markings and occlusions), correct lane configurations are often under-determined by visual evidence alone, making specification violations a major source of human post-editing. We propose MapAgent, an industrial-grade agentic architecture that augments a vectorization backbone for specification-compliant lane-map production. Rather than merely adding an agent loop to map prediction, MapAgent couples backbone perception with explicit specification verification, constraint-aware reasoning, and deterministic map editing under a bounded, verification-driven Judge-Planner-Worker loop. A vision-language Judge diagnoses errors by jointly inspecting visual evidence and draft vectors, while a tool-calling Planner generates minimal corrective edits with post-edit re-validation. To remain scalable for city-scale production, MapAgent is selectively triggered only on tiles with low backbone confidence, adding modest overhead while preserving throughput. Experiments on real-world datasets show consistent gains over strong production baselines, especially in complex and long-tail scenarios. Additionally, MapAgent has been integrated into Baidu Maps, supporting lane-level map generation for over 360 cities nationwide and elevating the overall production automation to over 95%, demonstrating MapAgent's practicality and effectiveness for large-scale lane-level map generation.

2606.12742 2026-06-17 cs.AI cs.AR 版本更新

Reducing the Complexity of Deep Learning Models for EEG Analysis on Wearable Devices

降低可穿戴设备上用于脑电图分析的深度学习模型复杂度

Farough Shayeste Roodi, Parham Zilouchian Moghaddam, Mahdi Mohammadi-nasab, Mehdi Modarressi, Mostafa Ersali Salehi Nasab, Masoud Daneshtalab

发表机构 * University of Tehran(德黑兰大学) Mälardalen University(梅拉达伦大学) Royal Institute of Technology(皇家理工学院)

AI总结 研究通过参数量化和电极减少方法,在资源受限的可穿戴设备上部署DNN模型,实现脑电图分析中精度与复杂度的权衡。

详情
AI中文摘要

可穿戴医疗设备是增长最快的物联网领域。许多自动化医疗服务依赖于两种关键的生物信号,即心电图和脑电图,它们分别反映心脏和大脑的活动。尽管深度神经网络被认为是处理和分析这些信号的主要方式,但可穿戴设备中非常严格的能量和计算能力限制远低于DNN模型的计算、能量和内存带宽需求,从而阻碍了深度学习在许多实际可穿戴服务中的部署。本文研究了在资源受限的可穿戴设备上部署最先进的DNN模型的可行性。值得注意的是,我们探讨了在使用参数量化和电极减少方法时,DNN的精度与计算复杂度之间的权衡。我们的研究集中在几种用于脑电图信号分析(特别是检测癫痫发作)的最先进的DNN模型上。我们的发现表明,当明智地应用这些技术时,可以显著降低所考虑的DNN的复杂度,同时对精度的影响最小。这些结果揭示了在将基于DNN的在线脑电图分析适配到可穿戴设备时,精度与复杂度降低之间明确的权衡关系。

英文摘要

Wearable healthcare devices are the fastest-growing Internet of Things (IoT) sector. Many automated healthcare services rely on two crucial biological signals, namely ECG and EEG, which reflect the activity of the heart and brain, respectively. Although deep neural networks are considered the primary way to process and analyze these signals, the very tight energy and computational power constraints in wearable devices are far below the computational, energy, and memory bandwidth demands of DNN models, thereby impeding the deployment of deep learning in many practical wearable services. This paper investigates the feasibility of deploying state-of-the-art DNN models in resource-constrained wearable devices. Notably, we explore the trade-off between accuracy and computational complexity of DNNs when parameter quantization and electrode reduction methods are used. Our investigation centers on several state-of-the-art DNN models designed for EEG signal analysis, specifically for detecting epileptic seizures. Our findings demonstrate that, when applied judiciously, these techniques can significantly reduce the complexity of the DNNs under consideration with minimal adverse effects on accuracy. These results reveal the explicit trade-offs between accuracy and complexity reduction encountered when adapting DNN-based online EEG analysis for wearable devices.

2606.13258 2026-06-17 cs.AI 版本更新

MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment

MOSAIC: 帕金森病步态评估中增量持续学习的模态特定适应

Minlin Zeng, Zhipeng Zhou, Yang Qiu, Martin J. McKeown, Zhiqi Shen

发表机构 * Nanyang Technological University(南洋理工大学) Pacific Parkinson's Research Centre, University of British Columbia(不列颠哥伦比亚大学太平洋帕金森研究中心)

AI总结 针对帕金森病步态评估中模态增量场景,提出MOSAIC框架,通过模态特定预热、统计解耦MSBN架构和课程引导排斥目标,解决跨模态蒸馏不可靠、统计偏移和可塑性下降问题。

详情
AI中文摘要

基于步态的帕金森病评估越来越依赖异构传感器,但临床系统很少同时收集所有模态。新传感器可能通过设备升级、协议变更或多中心部署引入,而历史患者数据由于隐私和存储限制通常不可用。这种模态增量场景面临三个挑战:不可靠的跨模态蒸馏、模态特定的统计偏移以及保存后可塑性下降。我们提出了MOSAIC,一个紧凑的持续学习框架。首先,我们识别了有毒教师现象,并引入模态特定预热,在蒸馏前稳定新学习的模态表示。其次,我们提出了一种统计解耦的MSBN架构,在保持共享语义主干的同时隔离传感器统计信息。第三,我们设计了一个课程引导的排斥目标用于可塑性恢复,在保留旧知识的同时恢复模态特定容量。在三个多模态帕金森步态数据集上的实验表明,MOSAIC提高了最终性能并减轻了遗忘。项目代码可在以下网址获取:this https URL

英文摘要

Gait-based Parkinson's disease assessment increasingly relies on heterogeneous sensors, but clinical systems rarely collect all modalities simultaneously. New sensors may arrive through device upgrades, protocol changes, or multi-center deployment, while historical patient data are often unavailable because of privacy and storage constraints. This modality-incremental setting faces three challenges: unreliable cross-modal distillation, modality-specific statistical shifts, and reduced plasticity after preservation. We propose MOSAIC, a compact continual learning framework. First, we identify the Toxic Teacher phenomenon and introduce Modality-Specific Warm-Up to stabilize newly learned modality representations before distillation. Second, we propose a statistics-decoupled MSBN architecture that isolates sensor statistics while maintaining a shared semantic backbone. Third, we design a curriculum-guided repulsive objective for Plasticity Recovery, preserving legacy knowledge while recovering modality-specific capacity. Experiments on three multimodal Parkinson's gait datasets show that MOSAIC improves final performance and mitigates forgetting. Project code is available at: https://github.com/minlinzeng/MOSAIC_Modality-Specific-Adaptation-for-Incremental-Continual-Learning-in-PD-Gait-Assessment.git

2606.16337 2026-06-17 cs.AI cs.HC cs.LG 版本更新

Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules

医学启发式学习:一个用于可解释和可审计临床决策规则的LLM驱动框架

Wei Xu, Ke Yang, Gang Luo, Keli Zheng, Lingyan Hu, Jing Wang, Kefeng Li

发表机构 * Centre for Artificial Intelligence Driven Drug Discovery, Macao Polytechnic University(人工智能驱动药物发现中心,澳门理工学院) Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology Terahertz Science Application Center (TSAC), Beijing Institute of Technology(工业和信息化部短距离无线电设备测试与评估重点实验室,太赫兹科学应用中心(TSAC),北京理工大学) Department of Critical Care Medicine, Yantai Yuhuangding Hospital, Qingdao University(重症医学科,烟台友谊医院,青岛大学) Faculty of Education, The University of Hong Kong(教育学院,香港大学) College of Information Engineering, Dalian University(信息工程学院,大连大学)

AI总结 提出医学启发式学习(MHL),利用LLM驱动的工作流优化确定性可执行决策系统,生成可解释、可审计的Python决策规则,在医学数据集上达到与最先进方法相当的性能,并支持小样本和高度不平衡场景。

详情
AI中文摘要

临床表格数据的预测建模是临床决策支持的核心,因此不仅需要强大的预测性能,还需要透明的决策逻辑。尽管深度学习和基于树的集成方法可以实现高精度,但其黑箱性质仍然是临床部署的主要障碍。这一挑战因医疗数据的常见特征而进一步加剧,包括有限的样本量、严重的类别不平衡以及因诊断标准和临床文档变化引起的特征演化。为了解决这些问题,我们提出了医学启发式学习(MHL),这是临床表格预测中超越梯度学习范式的一个实例。MHL不依赖神经网络权重更新,而是使用大型语言模型(LLM)驱动的工作流,整合统计探测、医学知识探测、规则合成和代码级迭代优化,以优化一个确定性的可执行决策系统。最终模型不是以不透明的参数表示,而是作为版本化的纯Python决策规则,这些规则明确可解释、完全可审计且具有临床基础。MHL还支持持续学习,从先前验证的规则开始,并在数据漂移或特征演化下使用更新的特征信息迭代修订规则。在医学数据集上的全面实验表明,MHL在保持与小样本和高度不平衡设置下强健行为的同时,实现了与最先进方法相当的性能。结果进一步表明,这种显式规则更新机制有助于缓解特征演化下的灾难性遗忘。总体而言,这些发现表明,非基于梯度的启发式系统为高风险临床决策支持提供了一种透明且可适应的替代方案。

英文摘要

Predictive modeling for clinical tabular data is central to clinical decision support and therefore requires not only strong predictive performance but also transparent decision logic. Although deep learning and tree-based ensemble methods can achieve high accuracy, their black-box nature remains a major obstacle to clinical deployment. This challenge is further compounded by common characteristics of medical data, including limited sample sizes, severe class imbalance, and feature evolution arising from changes in diagnostic criteria and clinical documentation. To address these issues, we propose Medical Heuristic Learning (MHL), an instantiation of the learning-beyond-gradients paradigm for clinical tabular prediction. Instead of relying on neural network weight updates, MHL uses a large language model (LLM)-driven workflow that integrates statistical probes, medical knowledge probes, rule synthesis, and code-level iterative refinement to optimize a deterministic and executable decision system. The resulting model is expressed not as opaque parameters, but as versioned pure-Python decision rules that are explicitly interpretable, fully auditable, and clinically grounded. MHL also supports continual learning by starting from previously validated rules and iteratively revising them using updated feature information under data drift or feature evolution. Comprehensive experiments on medical datasets show that MHL achieves performance comparable to state-of-the-art methods while maintaining strong behavior in small-sample and highly imbalanced settings. The results further indicate that this explicit rule update mechanism can help alleviate catastrophic forgetting under feature evolution. Overall, these findings suggest that non-gradient-based heuristic systems offer a transparent and adaptable alternative for high-stakes clinical decision support.

2407.13053 2026-06-17 cs.CY cs.AI cs.CL cs.LG 版本更新

E2Vec: Feature Embedding with Temporal Information for Analyzing Student Actions in E-Book Systems

E2Vec:基于时间信息的特征嵌入用于分析电子书系统中的学生行为

Yuma Miyazaki, Valdemar Švábenský, Yuta Taniguchi, Fumiya Okubo, Tsubasa Minematsu, Atsushi Shimada

发表机构 * Kyushu University(九州大学)

AI总结 提出E2Vec方法,利用词嵌入将操作日志和时间间隔转化为学生向量,用于风险检测任务,提升泛化性和性能。

Comments Research paper published in the Proceedings of the 17th Educational Data Mining Conference (EDM 2024), see https://doi.org/10.5281/zenodo.12729853

详情
AI中文摘要

数字教科书(电子书)系统将学生与教科书的交互记录为一系列事件,称为事件流数据。过去,研究人员从事件流中提取有意义的特征,并将其用作下游任务(如成绩预测和学生行为建模)的输入。先前的研究评估了主要使用基于统计的特征(如操作类型数量或访问频率)的模型。虽然这些特征有助于提供某些见解,但它们缺乏捕捉不同学生学习行为中细粒度差异的时间信息。本研究提出E2Vec,一种基于词嵌入的新型特征表示方法。该方法将每个学生的操作日志及其时间间隔视为字符字符串序列,并生成包含时间信息的学习活动特征的学生向量。我们应用fastText为来自两年计算机科学课程数据集的305名学生生成嵌入向量。然后,我们研究了E2Vec在风险检测任务中的有效性,展示了其泛化性和性能潜力。

英文摘要

Digital textbook (e-book) systems record student interactions with textbooks as a sequence of events called EventStream data. In the past, researchers extracted meaningful features from EventStream, and utilized them as inputs for downstream tasks such as grade prediction and modeling of student behavior. Previous research evaluated models that mainly used statistical-based features derived from EventStream logs, such as the number of operation types or access frequencies. While these features are useful for providing certain insights, they lack temporal information that captures fine-grained differences in learning behaviors among different students. This study proposes E2Vec, a novel feature representation method based on word embeddings. The proposed method regards operation logs and their time intervals for each student as a string sequence of characters and generates a student vector of learning activity features that incorporates time information. We applied fastText to generate an embedding vector for each of 305 students in a dataset from two years of computer science courses. Then, we investigated the effectiveness of E2Vec in an at-risk detection task, demonstrating potential for generalizability and performance.

2501.00826 2026-06-17 q-fin.TR cs.AI 版本更新

LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management

基于LLM的多智能体系统实现自动化加密货币投资组合管理

Yichen Luo, Yebo Feng, Jiahua Xu, Paolo Tasca, Yang Liu

发表机构 * University College London(伦敦大学学院) Nanyang Technological University(南洋理工大学) Exponential Science(指数科学)

AI总结 提出一个三智能体系统(市场、新闻、交易),通过分层、协作和辩论架构融合多模态信号,在2025年回测中实现133.52%累计收益和1.502夏普比率,优于单智能体和深度学习基线。

详情
AI中文摘要

加密货币投资组合管理需要在高度波动和实时约束下融合异构多模态信号,包括结构化的价格和链上时间序列、非结构化的新闻文本以及技术指标。虽然深度学习方法显示出预测能力,但其不透明性限制了实际应用,而单个大语言模型(LLM)智能体难以处理稳健决策所需的多模态输入广度。我们提出一个多智能体系统(MAS)框架,其中三个模态专业智能体——负责市场动态的加密货币智能体、负责每周新闻情绪的新闻智能体和负责信号融合与投资组合执行的交易智能体——通过三种通信架构(分层、协作和辩论)分解任务。我们评估了四种能力配置:零样本、思维链(CoT)、检索增强生成(RAG)和技能增强。在2025年1月按市值排名前15的L1区块链原生加密货币的52周回测中,最佳配置(分层技能)实现了133.52%的累计收益和1.502的夏普比率,优于单智能体变体、被动基准和深度学习基线。消融研究确定加密货币智能体是最关键的组件,移除它会使累计收益降低42.57个百分点。跨模型比较进一步表明,在GPT-4o、GPT-5和Claude Sonnet 4.5下,MAS均优于单智能体基线,表明多智能体协调的优势与模型无关。与黑箱深度学习模型不同,每个投资组合决策都可追溯到明确的智能体推理,为多模态加密货币投资组合管理提供了一种可解释且有效的方法。

英文摘要

Cryptocurrency portfolio management requires the fusion of heterogeneous multi-modal signals, including structured price and on-chain time series, unstructured news text, and technical indicators, under high-volatility and real-time constraints. While deep learning approaches show predictive capability, their opacity limits practical adoption, and single large language model (LLM) agents struggle to process the breadth of modality-specific inputs needed for robust decision-making. We propose a multi-agent system (MAS) framework in which three modality-specialised agents, a Crypto Agent for market dynamics, a News Agent for weekly news sentiment, and a Trading Agent for signal fusion and portfolio execution, decompose the task across three communication architectures: hierarchical, collaborative, and debate. We evaluate four capability configurations: zero-shot, chain-of-thought (CoT), retrieval-augmented generation (RAG), and skill-augmented. In a 52-week backtest over calendar year 2025 across the top 15 L1 blockchain native cryptocurrencies by market capitalisation as of January 2025, the best configuration, Hierarchical (Skill), achieves a cumulative return of 133.52% and a Sharpe ratio of 1.502, outperforming single-agent variants, passive benchmarks, and deep learning baselines. An ablation study identifies the Crypto Agent as the most critical component, with its removal reducing cumulative return by 42.57 percentage points. A cross-model comparison further shows that MAS outperforms the single-agent baseline under GPT-4o, GPT-5, and Claude Sonnet 4.5, suggesting that the benefit of multi-agent coordination is model-agnostic. Unlike black-box deep learning models, every portfolio decision is traceable to explicit agent reasoning, offering an interpretable and effective approach to multi-modal cryptocurrency portfolio management.

2502.17518 2026-06-17 cs.LG cs.AI q-fin.CP stat.ML 版本更新

Ensemble RL through Classifier Models: Enhancing Risk-Return Trade-offs in Trading Strategies

通过分类器模型进行集成强化学习:在交易策略中增强风险回报权衡

Zheli Xiong

AI总结 本文研究了在金融交易策略中使用集成强化学习模型的全面研究,利用分类器模型来提升性能。通过将A2C、PPO和SAC等强化学习算法与传统分类器如支持向量机(SVM)、决策树和逻辑回归相结合,探讨不同分类器组如何整合以改善风险回报权衡。研究评估了各种集成方法的有效性,将其与单个强化学习模型在关键金融指标(包括累计回报率、夏普比率(SR)、卡勒姆比率和最大回撤(MDD))上进行比较。结果表明,集成方法在风险调整后的回报方面始终优于基础模型,提供了更好的回撤管理和整体稳定性。然而,我们发现集成性能对方差阈值τ的选择敏感,强调了动态调整τ以达到最佳性能的重要性。本研究强调了将强化学习与分类器结合在自适应决策中的价值,对金融交易、机器人和其他动态环境具有启示。

Comments 23 pages,10 figures, 9 table

详情
AI中文摘要

本文提出了一项全面研究,探讨在金融交易策略中使用集成强化学习(RL)模型的应用,利用分类器模型来提升性能。通过结合A2C、PPO和SAC等强化学习算法与传统分类器如支持向量机(SVM)、决策树和逻辑回归,我们研究了不同分类器组如何整合以改善风险回报权衡。研究评估了各种集成方法的有效性,将其与单个RL模型在关键金融指标(包括累计回报率、夏普比率(SR)、卡勒姆比率和最大回撤(MDD))上进行比较。我们的结果表明,集成方法在风险调整后的回报方面始终优于基础模型,提供了更好的回撤管理和整体稳定性。然而,我们发现集成性能对方差阈值τ的选择敏感,强调了动态调整τ以达到最佳性能的重要性。本研究强调了将强化学习与分类器结合在自适应决策中的价值,对金融交易、机器人和其他动态环境具有启示。

英文摘要

This paper presents a comprehensive study on the use of ensemble Reinforcement Learning (RL) models in financial trading strategies, leveraging classifier models to enhance performance. By combining RL algorithms such as A2C, PPO, and SAC with traditional classifiers like Support Vector Machines (SVM), Decision Trees, and Logistic Regression, we investigate how different classifier groups can be integrated to improve risk-return trade-offs. The study evaluates the effectiveness of various ensemble methods, comparing them with individual RL models across key financial metrics, including Cumulative Returns, Sharpe Ratios (SR), Calmar Ratios, and Maximum Drawdown (MDD). Our original experimental results demonstrate that ensemble methods often outperform base models in terms of risk-adjusted returns, providing better management of drawdowns and overall stability. However, both the original analysis and the additional reproduction reported in this version show that ensemble performance is sensitive to the choice of variance threshold \(τ\), classifier group, RL-agent pair, and market universe. The reproduction evidence strengthens the conclusion that classifier-assisted ensemble selection can improve robustness, while also clarifying that the advantage is conditional rather than automatic across all datasets. This study emphasizes the value of combining RL with classifiers for adaptive decision-making, with implications for financial trading, robotics, and other dynamic environments.

2503.17867 2026-06-17 cs.CR cs.AI cs.LG cs.NI 版本更新

Detecting and Mitigating DDoS Attacks with AI: A Survey

利用人工智能检测和缓解DDoS攻击:综述

Alexandru Apostu, Silviu Gheorghe, Andrei Hîji, Nicolae Cleju, Andrei Pătraşcu, Cristian Rusu, Radu Ionescu, Paul Irofti

发表机构 * Department of Computer Science, University of Bucharest(布加勒斯大学计算机科学系)

AI总结 本文综述了基于AI的DDoS攻击检测与缓解方法,提供了基于专家层次和AI生成树状图的分类法,讨论了数据集、对抗训练及未来研究方向。

详情
AI中文摘要

分布式拒绝服务攻击是一个活跃的网络安全研究问题。最近的研究从基于静态规则的防御转向基于AI的检测和缓解。本综述涵盖了几个关键主题。首先,讨论了最先进的AI检测方法。提供了基于手动专家层次和AI生成的树状图的深入分类法,从而解决了DDoS分类的歧义。随后讨论了可用的数据集,涵盖了数据格式选项及其在训练AI检测方法中的作用,以及对抗训练和示例增强。除了检测,还调查了基于AI的缓解技术。最后,提出了多个开放的研究方向。

英文摘要

Distributed Denial of Service attacks represent an active cybersecurity research problem. Recent research shifted from static rule-based defenses towards AI-based detection and mitigation. This comprehensive survey covers several key topics. Preeminently, state-of-the-art AI detection methods are discussed. An in-depth taxonomy based on manual expert hierarchies and an AI-generated dendrogram are provided, thus settling DDoS categorization ambiguities. An important discussion on available datasets follows, covering data format options and their role in training AI detection methods together with adversarial training and examples augmentation. Beyond detection, AI based mitigation techniques are surveyed as well. Finally, multiple open research directions are proposed.

2507.04704 2026-06-17 q-bio.QM cs.AI cs.CV 版本更新

SPATIA: Multimodal Generation and Prediction of Spatial Cell Phenotypes

SPATIA: 空间细胞表型的多模态生成与预测

Zhenglun Kong, Mufan Qiu, John Boesen, Xiang Lin, Sukwon Yun, Tianlong Chen, Manolis Kellis, Marinka Zitnik

AI总结 提出SPATIA模型,融合细胞形态、基因表达和空间上下文,通过置信感知流匹配和形态-谱对齐实现多尺度生成与预测,在12项任务中优于18个基线模型。

Comments ICML 2026

详情
AI中文摘要

理解细胞形态、基因表达和空间上下文如何共同塑造组织功能是生物学中的一个核心挑战。基于图像的空间转录组学技术现在能够提供细胞图像和基因表达谱的高分辨率测量,但现有方法通常孤立地分析这些模态或以有限的分辨率进行分析。我们通过引入SPATIA来解决这个问题,这是一个多层次的生成和预测模型,通过融合从细胞到组织水平的形态、基因表达和空间上下文,学习统一的、空间感知的表征。SPATIA还结合了一个空间条件生成框架,该框架具有置信感知的OT重加权和形态-谱对齐,用于建模目标状态形态分布。具体来说,我们提出了一个置信感知的流匹配目标,该目标基于不确定性对弱最优传输对进行重加权。我们进一步应用形态-谱对齐来鼓励有生物学意义的图像生成,从而能够建模微环境依赖的表型转变。我们组装了一个多尺度数据集,包含17个组织中的2590万个细胞-基因对。我们在12项任务上对SPATIA与18个模型进行了基准测试,涵盖表型生成、注释、聚类、基因插补和跨模态预测等类别。SPATIA相比最先进模型取得了改进,生成保真度提高了8%,预测准确率提高了3%。

英文摘要

Understanding how cellular morphology, gene expression, and spatial context jointly shape tissue function is a central challenge in biology. Image-based spatial transcriptomics technologies now provide high-resolution measurements of cell images and gene expression profiles, but existing methods typically analyze these modalities in isolation or at limited resolution. We address the problem by introducing SPATIA, a multi-level generative and predictive model that learns unified, spatially aware representations by fusing morphology, gene expression, and spatial context from the cell to the tissue level. SPATIA also incorporates a spatially conditioned generative framework with confidence-aware OT reweighting and morphology-profile alignment for modeling target-state morphology distributions. Specifically, we propose a confidence-aware flow matching objective that reweights weak optimal-transport pairs based on uncertainty. We further apply morphology-profile alignment to encourage biologically meaningful image generation, enabling the modeling of microenvironment-dependent phenotypic transitions. We assembled a multi-scale dataset consisting of 25.9 million cell-gene pairs across 17 tissues. We benchmark SPATIA against 18 models across 12 tasks, spanning categories such as phenotype generation, annotation, clustering, gene imputation, and cross-modal prediction. SPATIA achieves improved performance over state-of-the-art models, improving generative fidelity by 8% and predictive accuracy by up to 3%.

2507.17188 2026-06-17 cs.NI cs.AI cs.CR 版本更新

LLM-Aided Joint Secrecy Precoding and Trajectory for RSMA-Based Heterogeneous UAV Networks

基于RSMA的异构无人机网络中LLM辅助的联合保密预编码与轨迹设计

Lijie Zheng, Ji He, Shih Yu Chang, Yulong Shen

发表机构 * School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) Department of Applied Data Science, San Jose State University(圣何塞州立大学应用数据科学系)

AI总结 针对RSMA异构无人机网络中的安全通信问题,提出分层优化框架:内层用SDR-S2DC算法求解固定位置下的保密预编码,外层用LLM引导的多智能体强化学习优化轨迹,实现保密速率与能效的权衡。

详情
AI中文摘要

本文研究了速率分割多址接入(RSMA)使能的异构无人机网络中的安全通信问题,其中多个无人机在存在窃听者的情况下协作服务地面终端。通过联合考虑保密速率最大化和推进能量消耗最小化,我们构建了一个多目标优化问题,涉及无人机轨迹设计、服务关联、功率分配和保密预编码,并受到移动性、碰撞避免、服务容量和通信约束。所构建的问题由于无人机轨迹、RSMA传输变量和保密预编码之间的耦合而高度非凸。为了解决由此产生的非凸且高度耦合的优化问题,我们提出了一种分层优化框架。内层使用基于半定松弛(SDR)的S2DC算法,结合惩罚函数和凸差(D.C.)规划,在固定无人机位置下求解保密预编码问题。外层引入了一种大语言模型(LLM)引导的启发式多智能体强化学习方法(LLM-HeMARL)用于轨迹优化。LLM-HeMARL高效地整合了LLM生成的专家启发式策略,使无人机能够学习能量感知、安全驱动的轨迹,而无需实时LLM调用的推理开销。仿真结果表明,我们的方法在保密速率和能效方面优于现有基线,并在不同的无人机群规模和随机种子下具有一致的鲁棒性。

英文摘要

This paper investigates secure communications in rate-splitting multiple access (RSMA) enabled heterogeneous UAV networks, where multiple UAVs collaboratively serve ground terminals in the presence of eavesdroppers. By jointly considering secrecy rate maximization and propulsion energy consumption minimization, we formulate a multi-objective optimization problem involving UAV trajectory design, service association, power allocation, and secrecy precoding under mobility, collision-avoidance, service-capacity, and communication constraints. The formulated problem is highly non-convex due to the coupling among UAV trajectories, RSMA transmission variables, and secrecy constraints.To address the resulting non-convex and highly coupled optimization problem, we propose a hierarchical optimization framework. The inner layer uses a semidefinite relaxation (SDR)-based S2DC algorithm combining penalty functions and difference-of-convex (D.C.) programming to solve the secrecy precoding problem with fixed UAV positions. The outer layer introduces a Large Language Model (LLM)-guided heuristic multi-agent reinforcement learning approach (LLM-HeMARL) for trajectory optimization. LLM-HeMARL efficiently incorporates LLM-generated expert heuristic policy, enabling UAVs to learn energy-aware, security-driven trajectories without the inference overhead of real-time LLM calls. The simulation results show that our method outperforms existing baselines in secrecy rate and energy efficiency, with consistent robustness across varying UAV swarm sizes and random seeds.

2509.15210 2026-06-17 cs.SD cs.AI cs.LG 版本更新

Explicit Context-Driven Neural Acoustic Modeling for High-Fidelity RIR Generation

显式上下文驱动的神经声学建模用于高保真RIR生成

Chen Si, Qianyi Wu, Chaitanya Amballa, Romit Roy Choudhury

AI总结 提出MiNAF模型,通过查询房间网格并提取距离分布作为显式局部几何特征,引导神经隐式模型生成更准确的房间脉冲响应(RIR),在多项指标上达到竞争性能。

详情
AI中文摘要

逼真的声音模拟在许多应用中起着关键作用。声音模拟的一个关键要素是房间脉冲响应(RIR),它描述了声音在给定空间中的传播方式。最近的研究应用神经隐式方法,利用从环境中收集的上下文信息(如场景图像)来学习RIR。然而,这些方法没有有效利用环境中的显式几何信息。为了进一步利用具有直接几何特征的神经隐式模型,我们提出了MiNAF,它在给定位置查询粗略的房间网格,并提取距离分布作为局部上下文的显式表示。我们的方法表明,结合显式的局部几何特征可以更好地引导模型生成更准确的RIR预测。通过与常规和最先进方法的比较,我们展示了MiNAF在各种评估指标上具有竞争力的性能。

英文摘要

Realistic sound simulation plays a critical role in many applications. A key element in sound simulation is the room impulse response (RIR), which characterizes how sound propagates within a given space. Recent studies have applied neural implicit methods to learn RIR using context information collected from the environment, such as scene images. However, these approaches do not effectively leverage explicit geometric information from the environment. To further exploit neural implicit models with direct geometric features, we present MiNAF, which queries a rough room mesh at given locations and extracts distance distributions as an explicit representation of local context. Our approach demonstrates that incorporating explicit local geometric features can better guide the model in generating more accurate RIR predictions. Through comparisons with conventional and state-of-the-art methods, we show that MiNAF performs competitively across various evaluation metrics.

2510.21127 2026-06-17 cs.NI cs.AI 版本更新

Enhanced Evolutionary Multi-Objective Deep Reinforcement Learning for Reliable and Efficient Wireless Rechargeable Sensor Networks

增强型进化多目标深度强化学习用于可靠高效无线可充电传感器网络

Bowei Tong, Hui Kang, Jiahui Li, Geng Sun, Jiacheng Wang, Yaoqi Yang, Bo Xu, Dusit Niyato

AI总结 针对无线可充电传感器网络中节点存活率与充电能效的权衡问题,提出一种结合LSTM策略网络、MLP前瞻增量模型和时变Pareto策略评估的增强型进化多目标深度强化学习算法,显著优于现有方法。

Comments The article content needs to be significantly revised

详情
AI中文摘要

尽管传感器网络取得了快速进展,但传统的电池供电传感器网络存在运行寿命有限和维护频繁的问题,严重限制了其在偏远和不可达环境中的部署。因此,具有移动充电能力的无线可充电传感器网络(WRSNs)为延长网络寿命提供了一种有前景的解决方案。然而,WRSNs面临着在动态运行条件下最大化节点存活率与最大化充电能效之间固有权衡的关键挑战。在本文中,我们研究了一个典型场景,其中移动充电器移动并为传感器充电,从而在最小化能量浪费的同时维持网络连通性。具体而言,我们制定了一个多目标优化问题,该问题同时最大化多个时隙内的网络节点存活率和移动充电器能量使用效率,这具有NP-hard计算复杂性和长期时间依赖性,使得传统优化方法无效。为了解决这些挑战,我们提出了一种增强型进化多目标深度强化学习算法,该算法集成了基于长短期记忆(LSTM)的策略网络用于时间模式识别、基于多层感知器的前瞻增量模型用于未来状态预测,以及时变Pareto策略评估方法用于动态偏好适应。大量仿真结果表明,所提算法在平衡节点存活率和能量效率方面显著优于现有方法,同时生成多样化的Pareto最优解。此外,LSTM增强的策略网络比传统网络收敛速度快25%,时变评估方法有效适应动态条件。

英文摘要

Despite rapid advancements in sensor networks, conventional battery-powered sensor networks suffer from limited operational lifespans and frequent maintenance requirements that severely constrain their deployment in remote and inaccessible environments. As such, wireless rechargeable sensor networks (WRSNs) with mobile charging capabilities offer a promising solution to extend network lifetime. However, WRSNs face critical challenges from the inherent trade-off between maximizing the node survival rates and maximizing charging energy efficiency under dynamic operational conditions. In this paper, we investigate a typical scenario where mobile chargers move and charge the sensor, thereby maintaining the network connectivity while minimizing the energy waste. Specifically, we formulate a multi-objective optimization problem that simultaneously maximizes the network node survival rate and mobile charger energy usage efficiency across multiple time slots, which presents NP-hard computational complexity with long-term temporal dependencies that make traditional optimization approaches ineffective. To address these challenges, we propose an enhanced evolutionary multi-objective deep reinforcement learning algorithm, which integrates a long short-term memory (LSTM)-based policy network for temporal pattern recognition, a multilayer perceptron-based prospective increment model for future state prediction, and a time-varying Pareto policy evaluation method for dynamic preference adaptation. Extensive simulation results demonstrate that the proposed algorithm significantly outperforms existing approaches in balancing node survival rate and energy efficiency while generating diverse Pareto-optimal solutions. Moreover, the LSTM-enhanced policy network converges 25% faster than conventional networks, with the time-varying evaluation method effectively adapting to dynamic conditions.

2510.23798 2026-06-17 cs.CV cs.AI 版本更新

A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras

一种基于几何和深度学习的可复现流水线,用于利用原位摄像头监测城市河流中的漂浮人为碎片

Gauthier Grimmer, Romain Wenger, Clément Flint, Germain Forestier, Gilles Rixhon, Valentin Chardon

AI总结 提出结合几何模型与深度学习的框架,利用固定摄像头连续量化监测城市河流漂浮碎片,并评估不同模型在复杂环境下的精度与速度,通过投影几何实现碎片尺寸估计。

详情
AI中文摘要

河流中漂浮人为碎片的扩散已成为一个紧迫的环境问题,对生物多样性、水质以及人类活动(如航行和娱乐)产生不利影响。本研究提出了一种新颖的方法框架,利用固定的原位摄像头监测上述废弃物。本研究提供了两个关键贡献:(i)利用深度学习对漂浮碎片进行连续量化和监测;(ii)在复杂环境条件下,识别出在精度和推理速度方面最合适的深度学习模型。这些模型在多种环境条件和学习配置下进行测试,包括与数据泄漏相关的偏差实验。此外,实现了一个几何模型,用于从二维图像估计检测对象的实际尺寸。该模型利用了相机的内参和外参特性。本研究结果强调了数据集构建协议的重要性,特别是在负样本图像的整合和时间泄漏的考虑方面。最后,证明了使用投影几何结合回归校正进行公制物体估计的可行性。该方法为开发稳健、低成本、自动化的城市水生环境监测系统铺平了道路。

英文摘要

The proliferation of floating anthropogenic debris in rivers has emerged as a pressing environmental concern, exerting a detrimental influence on biodiversity, water quality, and human activities such as navigation and recreation. The present study proposes a novel methodological framework for the monitoring the aforementioned waste, utilising fixed, in-situ cameras. This study provides two key contributions: (i) the continuous quantification and monitoring of floating debris using deep learning and (ii) the identification of the most suitable deep learning model in terms of accuracy and inference speed under complex environmental conditions. These models are tested in a range of environmental conditions and learning configurations, including experiments on biases related to data leakage. Furthermore, a geometric model is implemented to estimate the actual size of detected objects from a 2D image. This model takes advantage of both intrinsic and extrinsic characteristics of the camera. The findings of this study underscore the significance of the dataset constitution protocol, particularly with respect to the integration of negative images and the consideration of temporal leakage. In conclusion, the feasibility of metric object estimation using projective geometry coupled with regression corrections is demonstrated. This approach paves the way for the development of robust, low-cost, automated monitoring systems for urban aquatic environments.

2512.25065 2026-06-17 cs.OS cs.AI cs.DC 版本更新

Vulcan: Instance-specialized, Verifiable Systems Heuristics Through LLM-driven Search

Vulcan:通过LLM驱动的搜索实现实例特化的可验证系统启发式方法

Rohit Dwivedula, Divyanshu Saxena, Sujay Yadalam, Eric Hayden Campbell, Daehyeok Kim, Aditya Akella

AI总结 提出Vulcan框架,利用LLM生成系统启发式方法,通过隔离决策逻辑和受限语言Anvil保证安全,在调度、缓存和内存管理上取得显著性能提升。

Comments 19 pages

详情
AI中文摘要

系统资源管理任务主要依赖于手工设计的启发式方法。然而,日益增长的硬件异构性和工作负载多样性要求针对特定部署实例进行特化的启发式方法,这使得手动设计成本高昂且难以扩展。在本文中,我们探索如何使用LLM合成系统启发式方法。主要挑战是确保生成的启发式方法安全执行、正确集成到周围系统中,同时仍能实现强大的性能。我们提出Vulcan,一个识别LLM友好接口的框架,该接口将核心决策逻辑与其余实现隔离。使用Vulcan,LLM生成的代码被限制为简单的无状态决策函数,而可信的运行时抽象提供丰富的派生统计信息,用于有意义的策略探索,而不会出现系统集成错误。为了确保执行安全,LLM使用受限语言Anvil合成启发式方法,该语言通过构造保证重要属性。我们在三个研究充分的领域评估Vulcan,并展示了在spot-VM调度中高达4.9倍的节省,缓存驱逐中高达2倍的未命中率降低,以及分层内存系统中高达10%的应用性能提升,同时全程确保执行安全。

英文摘要

Systems resource management tasks rely primarily on hand-designed heuristics. However, growing hardware heterogeneity and workload diversity require heuristics specialized to particular deployment instances, making manual design expensive and difficult to scale. In this paper, we explore how to synthesize systems heuristics using LLMs. The main challenge is ensuring that generated heuristics execute safely, integrate correctly with the surrounding system, and still achieve strong performance. We propose Vulcan, a framework that identifies LLM-friendly interfaces that isolate core decision logic from the rest of the implementation. With Vulcan, LLM-generated code is restricted to simple stateless decision functions, while trusted runtime abstractions provide rich derived statistics for meaningful policy exploration without system-integration bugs. To ensure execution safety, LLMs synthesize heuristics in a restricted language, Anvil, that guarantees important properties by construction. We evaluate Vulcan across three well-studied domains and demonstrate up to 4.9x higher savings for spot-VM scheduling, up to 2x lower miss ratios for cache eviction, and up to 10% higher application performance for tiered-memory systems, while ensuring execution safety throughout.

2603.04438 2026-06-17 eess.IV cs.AI cs.LG 版本更新

CogGen: Cognitive-Load-Inspired Fully Unsupervised Deep Generative Modeling for Compressively Sampled MRI Reconstruction

CogGen: 认知负荷启发的全无监督深度生成模型用于压缩感知MRI重建

Qingyong Zhu, Yumin Tan, Xiang Gu, Dong Liang

AI总结 提出CogGen框架,基于认知易到难原则,通过自定进度课程学习和MRI感知双阈值加权策略,将CS-MRI重建分解为分阶段反演问题,理论证明降低局部充分迭代界和累积噪声放大界,实验优于现有无监督和有监督方法。

详情
AI中文摘要

全无监督深度生成建模(FU-DGM)为压缩感知磁共振成像(CS-MRI)重建提供了巨大潜力。代表性的FU-DGM公式,如深度图像先验(DIP)和隐式神经表示(INR),利用架构偏置在图像空间中诱导与正向观测对齐的低维流形。然而,由于底层逆系统高度病态,FU-DGM中长时间的迭代拟合通常导致效率低下和噪声放大。本文受认知易到难学习原则的启发,提出CogGen,一种将CS-MRI重建重新表述为分阶段反演问题的FU-DGM框架。具体地,CogGen通过MRI感知的双阈值加权准则实现自定进度课程学习(SPCL)驱动的渐进调度策略,该准则自适应地调节k空间测量参与。数据一致性残差阈值评估当前生成器的拟合可靠性,而k空间半径阈值控制阶段性的测量暴露,从而避免整个优化过程中的均匀拟合。理论上,我们的分析表明,当早期阶段倾向于易拟合的测量时,CogGen产生更低的局部充分迭代界和更小的累积噪声放大界,解释了CogGen在有限迭代预算内改进的收敛行为和重建保真度。数值实验表明,CogGen的两种实例化,CogGen-DIP和CogGen-INR,在包括无监督和有监督流程在内的现有CS-MRI重建技术中实现了优越的性能。

英文摘要

Fully unsupervised deep generative modeling (FU-DGM) offers significant potential for compressively sampled magnetic resonance imaging (CS-MRI) reconstruction. Representative FU-DGM formulations, such as deep image prior (DIP) and implicit neural representation (INR), employ architectural bias to induce a low-dimensional manifold in the image space that aligns with the forward observation. However, as the underlying inverse system is highly ill-posed, prolonged iterative fitting in FU-DGM typically leads to poor efficiency and noise amplification. In this paper, guided by the cognitive principle of easy-to-hard learning, we propose CogGen, an FU-DGM framework that reformulates CS-MRI reconstruction as a staged inversion problem. Specifically, CogGen implements an self-paced curriculum learning (SPCL)-driven progressive scheduling strategy through an MRI-aware dual-threshold weighting criterion, which adaptively regulates k-space measurement participation. The data-consistency residual thresholding evaluates the fitting reliability of the current generator, while the k-space radius thresholding controls stage-wise measurement exposure, thereby avoiding uniform fitting throughout optimization. Theoretically, our analysis shows that, when early stages favor easy-to-fit measurements, CogGen yields a reduced local sufficient-iteration bound and a smaller cumulative noise-amplification bound, explaining the improved convergence behavior and reconstruction fidelity of CogGen within a finite iteration budget. Numerical experiments demonstrate that both CogGen instantiations, CogGen-DIP and CogGen-INR, achieve superior performance over prevailing CS-MRI reconstruction techniques, including unsupervised and supervised pipelines.

2603.26551 2026-06-17 cs.CV cs.AI 版本更新

Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

超越MACs:面向视觉骨干网络的硬件高效架构设计

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

发表机构 * Machine Learning and Perception Lab, University of Udine(乌迪大学机器学习与感知实验室) Centre for Vision Research, York University(约克大学视觉研究中心)

AI总结 针对MACs指标在边缘设备上的不足,提出基于硬件效率洞察的LowFormer骨干网络,通过轻量级Lowtention模块实现显著加速。

Comments Accepted at International Journal of Computer Vision (IJCV)

详情
Journal ref
Int J Comput Vis 134, 295 (2026)
AI中文摘要

视觉骨干网络在现代计算机视觉中扮演核心角色。提升其效率直接惠及广泛下游应用。为衡量效率,许多出版物依赖MACs(乘累加操作)作为执行时间的预测指标。本文通过实验证明该指标的缺陷,尤其在边缘设备场景下。通过对比常见架构设计元素的MAC计数和执行时间,我们识别出高效执行的关键因素,并提供优化骨干设计的见解。基于这些见解,我们提出LowFormer,一种新型视觉骨干家族。LowFormer采用流线型的宏观和微观设计,包括Lowtention——多头自注意力的轻量级替代方案。Lowtention不仅更高效,还在ImageNet上取得了更优结果。此外,我们提出LowFormer的边缘GPU版本,可进一步提升其在边缘GPU和桌面GPU上的基线速度。通过在更小图像分类数据集上的评估以及将其适配到多个下游任务(如目标检测、语义分割、图像检索和视觉目标跟踪),我们展示了LowFormer的广泛适用性。与近期最先进的骨干网络相比,LowFormer模型在各种硬件平台上均实现了显著加速。代码和模型见此链接。

英文摘要

Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

2604.09998 2026-06-17 cs.CR cs.AI 版本更新

Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit

像锤子一样,它能建造,也能破坏:Reddit上网络安全运营中大语言模型的使用、认知与采纳

Souradip Nath, Chih-Yi Huang, Aditi Ganapathi, Kashyap Thimmaraju, Jaron Mink, Gail-Joon Ahn

AI总结 通过对Reddit网络安全论坛892篇帖子进行混合方法分析,研究安全从业者使用LLM工具的模式、认知和采纳情况,发现LLM主要用于低风险、生产力导向任务,企业级安全平台受关注,但可靠性、验证开销和安全问题限制了其自主性。

Comments This paper appears in the Proceedings of the Twenty-Second Symposium on Usable Privacy and Security (SOUPS) 2026

详情
AI中文摘要

大语言模型(LLM)近期作为增强安全运营中心(SOC)工作流程的有前景工具出现,供应商越来越多地推广用于SOC的自主AI解决方案。然而,对于现实世界安全从业者如何使用、感知和采纳这些工具,仍缺乏实证理解。为填补这一空白,我们对网络安全论坛中的讨论进行了混合方法分析,以了解多样化从业者群体如何将现代LLM工具用于安全运营。具体而言,我们分析了Reddit上三个网络安全论坛在2022年12月至2025年9月间的892篇帖子,并采用定性编码和统计分析相结合的方法,从三个维度考察安全从业者如何讨论LLM工具:(1)他们声明的工具和用例,(2)每个工具在一组关键因素上的感知优缺点,以及(3)他们对这些工具的采纳以及对网络安全行业和个人分析师的预期影响。总体而言,我们的发现揭示了LLM工具采纳的细微模式,突出了LLM在低风险、生产力导向任务中的独立使用,以及对企业级、安全导向LLM平台的积极兴趣。尽管从业者报告了LLM辅助工作流程在效率和效果上的显著提升,但可靠性、验证开销和安全问题等持续存在的问题严重限制了赋予LLM工具的自主性。基于这些结果,我们还为开发和采纳LLM工具提供了建议,以确保组织的安全和网络安全从业者的安全。

英文摘要

Large language models (LLMs) have recently emerged as promising tools for augmenting Security Operations Center (SOC) workflows, with vendors increasingly marketing autonomous AI solutions for SOCs. However, there remains a limited empirical understanding of how such tools are used, perceived, and adopted by real-world security practitioners. To address this gap, we conduct a mixed-methods analysis of discussions in cybersecurity-focused forums to learn how a diverse group of practitioners use and perceive modern LLM tools for security operations. More specifically, we analyzed 892 posts between December 2022 and September 2025 from three cybersecurity-focused forums on Reddit, and, using a combination of qualitative coding and statistical analysis, examined how security practitioners discuss LLM tools across three dimensions: (1) their stated tools and use cases, (2) the perceived pros and cons of each tool across a set of critical factors, and (3) their adoption of such tools and the expected impacts on the cybersecurity industry and individual analysts. Overall, our findings reveal nuanced patterns in LLM tools adoption, highlighting independent use of LLMs for low-risk, productivity-oriented tasks, alongside active interest around enterprise-grade, security-focused LLM platforms. Although practitioners report meaningful gains in efficiency and effectiveness in LLM-assisted workflows, persistent issues with reliability, verification overheads, and security risks sharply constrain the autonomy granted to LLM tools. Based on these results, we also provide recommendations for developing and adopting LLM tools to ensure the security of organizations and the safety of cybersecurity practitioners.

2605.12729 2026-06-17 cs.NI cs.AI cs.CR 版本更新

Large Language Models for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety

用于代理网络运维和AI运维的大型语言模型:架构、评估与安全

Muhammad Bilal, Jon Crowcroft, Ruizhi Wang, Xiaolong Xu, Schahram Dustdar

发表机构 * School of Computing and Communications(计算与通信学院) University of Cambridge(剑桥大学) School of Software(软件学院) Nanjing University of Information Science and Technology(南京信息科技大學) TU Wien(维也纳技术大学) ICREA

AI总结 本文探讨了大型语言模型在网络运维和AI运维中的应用,分析了代理架构、评估方法及安全挑战,强调系统可靠性依赖于模型周边机制,而非模型本身。

Comments 49 pages, 15 figures, 6 tables; survey article

详情
AI中文摘要

大型语言模型正越来越多地用于支持网络运维(NetOps)和人工智能运维(AIOps),包括事件调查、根本原因分析、配置合成和有限的自动修复。在NetOps和AIOps中,这种转变正在改变任务管理方式。基于代理的操作作为工作流,从收集证据到采取行动,遵循权限、政策和检查,并在必要时提供回滚选项。这至关重要,因为操作决策可能立即产生影响。为了使论点具体化,我们围绕自主性层次、工具范围、证据轨迹和保证合同组织相关文献。这些合同定义了代理可以观察、提议和执行的内容,以及在允许任何行动前必须通过的检查。在 telemetry 查询推荐、诊断、根本原因分析、配置合成、变更规划和有限自动修复的研究中,出现了一致的模式。操作可靠性主要不来自模型本身,而是依赖于模型周围的机制。我们还主张评估应超越静态问答。代理NetOps和AIOps系统需要以工作流为中心的评估,包括轨迹质量、受限制的工具使用、安全提案生成、沙盒环境中的回放以及具有回滚意识的试用。没有这些措施,系统可能看起来稳健,但实际上可能过于脆弱。最后,我们检查了当代理接近操作控制面时,安全、隐私和治理风险变得尖锐的问题。综合来看,本文得出结论:智能NetOps和AIOps的进步将取决于将自主性视为受限制的操作控制问题,其输出必须可靠、可审计且安全可部署。

英文摘要

Large language models are increasingly being used to support network operations (NetOps) and artificial intelligence for IT operations (AIOps), including incident investigation, root-cause analysis, configuration synthesis, and limited self-healing. In both NetOps and AIOps, this shift is changing how tasks are managed. Agent-based operations work as workflows, from gathering evidence to taking action, following permissions, policies, and checks, and providing rollback options when necessary. This is crucial because operational decisions can have instant impacts. To make the argument concrete, we organise the relevant literature around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts. These contracts define what an agent may observe, propose, and execute. They also define the checks that must pass before any action is allowed. A consistent pattern appears across work on telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. Operational reliability does not come chiefly from the model itself. It depends on the machinery around the model. We also argue that evaluation should go beyond static question answering. Agentic NetOps and AIOps systems require workflow-centred evaluation, including trace quality, bounded tool use, safe proposal generation, replay in sandboxed environments, and canary trials with rollback-aware scoring. Without these measures, a system may appear robust yet remain too fragile. Finally, we examine security, privacy, and governance risks that become acute when agents sit close to operational control surfaces. Taken together, the survey concludes that progress in intelligent NetOps and AIOps will depend on treating autonomy as a constrained operational control problem, whose outputs must be reliable, auditable, and securely deployable.

2605.24003 2026-06-17 cs.CV cs.AI stat.AP 版本更新

Remote sensing data imputation using deep learning for multispectral imagery

基于深度学习的多光谱遥感数据插补

Shuang Liu, Fiona Johnson, Rohitash Chandra

发表机构 * Water Research Centre, University of New South Wales(新南威尔士大学水研究中心) ARC ITTC Data Analytics for Resources and Environments, University of New South Wales(新南威尔士大学资源与环境数据分析师联盟) Transitional Artificial Intelligence Research Group, School of Mathematics and Statistics, University of New South Wales(新南威尔士大学数学与统计学过渡人工智能研究组)

AI总结 针对云覆盖导致的光学卫星数据缺失问题,本研究比较了线性插值与多种深度学习模型(CNN、Inception Resnet、Autoencoder及其与LSTM的组合)在四个有藻华历史记录的湖泊中重建缺失光谱波段的效果,发现深度学习模型显著优于基线方法,其中CNN表现最佳,且基于插补图像的藻华指数与观测数据吻合良好。

详情
AI中文摘要

近年来,遥感技术在水体应用中得到越来越多的利用。使用光学卫星数据的一个常见挑战是由于云覆盖导致的观测缺失。这些数据缺口可能导致错过对水资源管理部门高度关注的湖泊中关键事件(如藻华)的检测。因此,提高光学卫星数据集的完整性对于改善藻华的监测和预测至关重要。在本研究中,我们比较了传统数据插补方法(即线性插值)与深度学习模型在四个有藻华历史记录的湖泊中重建缺失光谱波段的效果。采用的深度学习模型包括基于CNN的架构(即CNN、Inception Resnet和Autoencoder)以及基于CNN-LSTM的架构(即CNN-LSTM、Resnet-LSTM和Autoencoder-LSTM)。我们的结果表明,在人工掩膜区域内插补光谱波段值时,深度学习模型显著优于基线线性插值方法。在这些模型中,CNN在大多数湖泊中表现最佳。此外,我们通过将插补图像与观测数据进行比较,评估了基于插补图像的藻华指数(即Green/Red和NDCI)的性能。我们的结果表明,深度学习模型对于插补PlanetScope SuperDove影像中的缺失数据是有效的,从而能够实现更可靠的水体监测应用。

英文摘要

Remote sensing techniques have been increasingly utilised in aquatic applications in recent years. A common challenge in using optical satellite data is the presence of missing observations due to cloud cover. These data gaps can lead to missed detection of critical events, such as algal blooms, in lakes of high interest to water authorities. As a result, enhancing the completeness of optical satellite datasets is crucial for improving the monitoring and prediction of algal blooms. In this study, we compared a traditional data imputation method (i.e., linear interpolation) with deep learning models for reconstructing missing spectral bands across four lakes with historical records of algal blooms. The deep learning models adopted include CNN-based architectures (i.e., CNN, Inception Resnet, and Autoencoder) and CNN-LSTM-based architectures (i.e., CNN-LSTM, Resnet-LSTM, and Autoencoder-LSTM). Our results demonstrated that deep learning models substantially outperformed the baseline linear interpolation method in imputing spectral band values within artificially masked regions. Among these models, CNN delivered the best performance across most lakes. Furthermore, we evaluated the performance of algal bloom indices (i.e., Green/Red and NDCI) derived from the imputed imagery by comparing them with the observed data. Our results demonstrate that deep learning models are effective for imputing missing data in PlanetScope SuperDove imagery, enabling more reliable applications in water monitoring.

2605.29179 2026-06-17 cond-mat.mtrl-sci cs.AI 版本更新

Sustainable Metal-Organic Framework Water Harvesters in the Artificial Intelligence Era

人工智能时代可持续的金属有机框架水收集器

Reid A. Coyle, Shyam Chand Pal, Peter Walther, Saeun Park, Bin Feng, Zhiling Zheng

发表机构 * Department of Chemistry, Washington University(华盛顿大学化学系) Institute of Materials Science & Engineering, Washington University(华盛顿大学材料科学与工程学院)

AI总结 本文探讨了金属有机框架(MOF)在干旱条件下水收集的设计原理,并介绍了人工智能(AI)、大语言模型(LLM)和数据挖掘如何加速高性能吸附剂的发现。

Comments 10 pages of main text, 26 total pages. 3 Figures and 1 Table of Content Graphic

详情
AI中文摘要

金属有机框架(MOF)因其可调节的孔隙环境而成为水收集的优秀候选材料,这些孔隙环境可以被精确设计以在干旱条件下捕获和释放水。将人工智能(AI)整合到MOF发现中可以进一步加速高性能吸附剂的设计,通过识别增强大气水收集(AWH)、稳定性和循环效率的结构特征。在这篇视角文章中,我们考察了关键的MOF设计原理,包括协同吸附、操作相对湿度(RH)、吸附容量、滞后现象和可扩展性。我们强调了最近的设计进展,如多变量策略和长臂连接体延伸,并考察了这些原理如何调节孔隙容量和亲水性,同时保持稳定性和结晶性。此外,我们讨论了AI、大语言模型(LLM)和数据挖掘如何通过预测合成、逆向设计以及阐明合成-结构-性能关系来加速下一代MOF水收集器的发现过程。

英文摘要

Metal-organic frameworks (MOFs) are excellent candidates for water harvesting due to their tunable pore environments, which can be precisely engineered to capture and release water in arid conditions. Integrating artificial intelligence (AI) into MOF discovery can further accelerate the design of high-performance sorbents by identifying structural features that enhance atmospheric water harvesting (AWH), stability, and cycling efficiency. In this Perspective, we examine key MOF design principles, including cooperative adsorption, operational relative humidity (RH), uptake capacity, hysteresis, and scalability. We highlight recent design advancements such as multivariate strategies and long-arm linker extension, and examine how these principles tune pore capacity and hydrophilicity, while preserving stability and crystallinity. Furthermore, we discuss how AI, large language models (LLMs), and data mining can accelerate the discovery process through predictive synthesis, inverse design, and elucidating synthesis-structure-property relationships for the next generation of MOF water harvesters.

2606.11990 2026-06-17 cs.LG cs.AI 版本更新

Time-Series Foundation Model Embeddings for Remaining Useful Life Estimation

用于剩余使用寿命估计的时间序列基础模型嵌入

Amir El-Ghoussani, Michele De Vita, Ronald Naumann, Vasileios Belagiannis

发表机构 * University of Erlangen-Nuremberg(埃尔朗根-纽伦堡大学) Siemens AG(西门子股份公司)

AI总结 提出冻结预训练时间序列基础模型Chronos-2作为骨干,结合轻量回归头进行剩余寿命预测,在工业传感器数据上优于多种基线方法。

Comments Accepted to EUSIPCO 2026, 4 pages, 2 figures, 2 tables

详情
AI中文摘要

剩余使用寿命(RUL)预测对于工业预测性维护至关重要,然而许多基于学习的方法依赖于大量的特征工程或大型标注数据集来训练特定任务的序列模型。在这项工作中,我们引入了一种轻量级学习方法,利用冻结的预训练时间序列基础模型(TSFM),并将其与一个小型回归头结合,用于从多变量传感器流中估计RUL。具体来说,我们使用Chronos-2作为冻结骨干来提取上下文窗口特征,并训练一个轻量级回归神经网络进行RUL预测。在来自两种设备类型的真实工业传感器数据上的实验表明,在相同的预处理和评估协议下,Chronos-2特征一致地优于循环、卷积、基于Transformer和梯度提升基线。我们进一步分析了上下文长度的影响,发现随着历史记录变长,性能显著提升,这表明TSFM表示为工业环境中的RUL估计提供了一种实用且数据高效的替代方案。

英文摘要

Remaining Useful Life (RUL) prediction is essential for industrial predictive maintenance, yet many learning-based approaches rely on extensive feature engineering or large labeled datasets to train task-specific sequence models. In this work, we introduce a lightweight learning approach, in which we leverage a frozen pretrained time-series foundation model (TSFM) and combine it with a small regression head for RUL estimation from multivariate sensor streams. More specifically, we use Chronos-2 as a frozen backbone to extract context window features and train a lightweight regression neural network for RUL prediction. Experiments on real-world industrial sensor data from two device types show that Chronos-2 features consistently improve over recurrent, convolutional, Transformer-based, and gradient-boosting baselines under the same preprocessing and evaluation protocol. We further analyze the impact of context length and find that performance improves significantly with longer histories, indicating that TSFM representation offer a practical and data-efficient alternative for RUL estimation in industrial settings.

2606.13919 2026-06-17 eess.IV cs.AI cs.CV 版本更新

GMN4AD: Graph Matching Network for Alzheimer's Disease Diagnosis with Test-Time Domain Adaptation using Multi-centered Structure Magnetic Resonance Imaging

GMN4AD:基于图匹配网络的阿尔茨海默病诊断与测试时域适应方法在多中心结构磁共振成像中的应用

Chen Zhao, Huan Huang, Yixin Xie, Jiajing Huang, Weihua Zhou

发表机构 * Department of Computer Science, Kennesaw State University(肯纳邦大学计算机科学系) Department of Information Technology, Kennesaw State University(肯纳邦大学信息技术系) School of Data Science and Analytics, Kennesaw State University(肯纳邦大学数据科学与分析学院) Department of Applied Computing, Michigan Technological University(密歇根技术大学应用计算系)

AI总结 提出GMN4AD,利用图匹配网络建模异质脑图间关系,结合测试时域适应策略,在三个公共数据集上优于现有方法,实现鲁棒的AD诊断。

详情
AI中文摘要

阿尔茨海默病(AD)是一种进行性神经退行性疾病,影响数百万老年人,预计未来几年患病率将显著上升。早期诊断,特别是在轻度认知障碍(MCI)阶段,对于及时干预至关重要。结构磁共振成像(sMRI)已成为检测AD相关脑变化的关键模态,但传统的基于图的方法通常难以处理模态和站点间异质性,限制了诊断性能。在本文中,我们提出了用于阿尔茨海默病诊断的图匹配网络(GMN4AD),旨在建模来自神经影像数据的异质脑图之间的交互。与将每个脑图独立处理的传统方法不同,GMN4AD利用图匹配来捕获跨图关系,提高诊断精度。此外,我们引入了一种测试时域适应策略,结合对比学习来减轻推理过程中的域偏移。在三个公共AD数据集上的大量实验表明,GMN4AD相比最先进方法实现了优越的性能,为AD诊断提供了鲁棒且可泛化的解决方案。

英文摘要

Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that affects millions of older adults, with prevalence expected to rise significantly in the coming years. Early diagnosis, particularly during the mild cognitive impairment (MCI) stage, is critical for timely intervention. Structural Magnetic Resonance Imaging (sMRI) has emerged as a key modality for detecting AD-related brain changes, but traditional graph-based approaches often struggle with modality and inter-site heterogeneity, limiting diagnostic performance. In this paper, we propose Graph Matching Network for Alzheimer's Disease Diagnosis (GMN4AD), designed to model interactions between heterogeneous brain graphs derived from neuroimaging data. Unlike conventional methods that treat each brain graph independently, GMN4AD leverages graph matching to capture cross-graph relationships, enhancing diagnostic precision. Furthermore, we introduce a test-time domain adaptation strategy that combines contrastive learning to mitigate domain shifts during inference. Extensive experiments on three public AD datasets demonstrate that GMN4AD achieves superior performance compared to state-of-the-art methods, offering a robust and generalizable solution for AD diagnosis.

2606.14081 2026-06-17 cs.CV cs.AI cs.LG eess.IV 版本更新

Clay-CNN Hybrids: Leveraging Geospatial Foundation Models as Auxiliary Context for Landslide Detection

Clay-CNN混合模型:利用地理基础模型作为滑坡检测的辅助上下文

Huong Binh Vu

发表机构 * Harvard University(哈佛大学)

AI总结 针对滑坡检测中的极端类别不平衡问题,提出将地理基础模型Clay v1.5作为辅助上下文注入U-Net瓶颈的混合方法,在Landslide4Sense基准上达到64.5% F1,优于纯Clay或U-Net基线。

详情
AI中文摘要

灾后快速滑坡制图对灾害响应至关重要,但由于极端类别不平衡,自动化仍然困难。本研究评估了地理基础模型(GFM)Clay v1.5是否能够改善Landslide4Sense(L4S)基准上的像素级滑坡分割,该基准包含3,799个训练块,具有14个Sentinel-2和地形波段,约2%的正像素。我们比较了三种策略:Clay作为主编码器并融合多尺度残差地形、在瓶颈处注入Clay语义上下文的U-Net骨干、以及标准U-Net基线。采用两阶段低秩适应(LoRA)的混合U-Net + Clay模型在三个随机种子上的最佳测试F1为64.5±1.8%,超过了纯Clay骨干(55.2±3.6%)和U-Net基线(59.9%)。由于缺乏多尺度跳跃连接,Clay作为独立编码器的性能低于U-Net,但其预训练表示在作为辅助上下文注入时持续提升了性能。这些发现表明,GFM在滑坡检测中最有效的方式是补充空间细节丰富的卷积架构,而非替代它们。

英文摘要

Rapid post-event landslide mapping is essential for disaster response but remains difficult to automate due to extreme class imbalance. This study evaluates whether Clay v1.5, a Geospatial Foundation Model (GFM), can improve pixel-level landslide segmentation on the Landslide4Sense (L4S) benchmark, which contains 3,799 training chips with 14 Sentinel-2 and terrain bands and approximately 2% positive pixels. We compare three strategies: Clay as the primary encoder with multi-scale residual terrain fusion, a U-Net backbone augmented with Clay semantic context at the bottleneck, and a standard U-Net baseline. The hybrid U-Net + Clay model with two-stage Low-Rank Adaptation (LoRA) achieved the best test F1 of 64.5 +/- 1.8% over three seeds, surpassing the Clay-only backbone (55.2 +/- 3.6%) and the U-Net baseline (59.9%). Clay as a standalone encoder underperformed the U-Net due to the absence of multi-scale skip connections, but its pretrained representations consistently improved performance when injected as auxiliary context. These findings suggest that GFMs are most effective for landslide detection when they complement spatially detailed convolutional architectures rather than replace them.

11. 其他/综合AI 21 篇

2606.15575 2026-06-17 cs.AI cs.HC 新提交

Do we have the knowledge we need? Rethinking human-AI decision-making in corporations

我们是否拥有所需的知识?重新思考企业中的人机决策

Anne S. R. Marx, Ricardo M. Avelino, Torbjørn Netland, Mennatallah El-Assady

发表机构 * ETH Zurich(苏黎世联邦理工学院) Department of Computer Science & ETH AI Center, ETH Zurich(苏黎世联邦理工学院计算机科学系与ETH AI中心) Department of Computer Science & Architecture, ETH Zurich(苏黎世联邦理工学院计算机科学与建筑系) Department of Management, Technology, and Economics, ETH Zurich(苏黎世联邦理工学院管理、技术与经济系) Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系)

AI总结 本文提出一个框架,根据任务属性和知识可用性推荐人机代理分配与控制机制,并应用于制造任务示例。

Comments Proceedings of AutomationXP26 Workshop of the 2026 CHI Conference on Human Factors in Computing Systems, April 14, 2026, Barcelona, Spain. ACM, New York, NY, USA, 8 pages

详情
AI中文摘要

组织知识分散在各种软件系统、隐性知识和传统上为人类消费设计的手动文档中。随着AI系统越来越多地被部署并赋予决策角色,它们需要访问这些知识。这提出了两个问题:组织应如何存储和维护知识,使其对人类和未来的AI系统都可访问;以及在不同风险和不确定性水平的任务中,应如何在人类和AI之间分配代理权?在这篇立场论文中,我们描述了组织知识如何演变,并贡献了一个框架,将任务属性和知识可用性映射到推荐的代理分配和控制机制。我们通过两个不同的制造任务说明了该框架的适用性:一个常规操作(视觉质量检查)和一个一次性战略决策(工厂选址),并总结了未来研究的机会。

英文摘要

Organizational knowledge is fragmented across a variety of software systems, tacit expertise, and manual documents that have traditionally been designed for human consumption. As AI systems are increasingly deployed and granted decision-making roles, they require access to this knowledge. This raises two questions: how should organizations store and maintain knowledge so that it remains accessible to both humans and future AI systems, and how should agency be allocated between humans and AI across tasks with different risks and levels of uncertainty? In this position paper, we describe how organizational knowledge evolves and contribute a framework that maps task attributes and knowledge availability to recommended agency allocations and control mechanisms. We illustrate the applicability of the framework on two different manufacturing tasks: a routine operation (visual quality inspection) and a one-off strategic decision (factory location), and conclude with opportunities for future research.

2606.18005 2026-06-17 cs.AI econ.GN q-fin.EC 新提交

LLM Consumer Behavior Theory: Foundations of a Novel Research Field

LLM消费者行为理论:一个新兴研究领域的基础

Manon Reusens, Sofie Goethals, David Martens

发表机构 * Department of Engineering Management, University of Antwerp(安特卫普大学工程管理系)

AI总结 本文提出LLM消费者行为理论,研究LLM代理在市场中代表人类消费决策的行为,整合经济学与自然语言处理,探讨偏好表达、市场聚合及理性假设的失效。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为自主代理,代表用户做出消费决策。这一转变对传统上以人类为主要决策者的消费者理论提出了基本问题。在本文中,我们引入了LLM消费者行为理论,这是一个关注分析代理市场中消费者行为的新研究领域。借鉴经典和行为经济学以及自然语言处理的最新进展,我们形式化了人类偏好如何被基于LLM的代理反映和执行,以及代理级别的决策如何聚合为市场需求。我们将先前关于LLM决策、人类行为模拟和偏好诱导的分散文献统一在共同的经济视角下,强调了理性、异质性等假设在代理市场中可能失效的地方。本文不提供实证验证,而是概述了LLM消费者行为的范围,并识别了与对齐、偏好表示和市场动态相关的开放研究问题。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents that make consumption decisions on behalf of users. This shift raises fundamental questions for consumer theory, which has traditionally modeled humans as the primary decision-makers. In this paper, we introduce LLM Consumer Behavior Theory, a new field of study concerned with analyzing consumer behavior in agentic markets. Drawing on classical and behavioral economics alongside recent advances in Natural Language Processing, we formalize how human preferences are reflected and acted upon by LLM-based agents, and how agent-level decisions aggregate into market demand. We unify previously fragmented literature on LLM decision-making, human behavior simulation, and preference elicitation under a common economic lens, highlighting where assumptions, such as rationality and heterogeneity, may fail in agentic markets. Rather than providing empirical validation, this paper outlines the scope of LLM consumer behavior and identifies open research questions related to alignment, preference representation, and market dynamics.

2606.17762 2026-06-17 math.OC cs.AI 交叉投稿

Symplectic Transversality and Endpoint Green Estimates for Finite-Horizon Pontryagin Systems

有限时域Pontryagin系统的辛横截性与端点Green估计

Pyuyi Chufeng Huang, Zikang Song, Xingshu Chen

发表机构 * School of Cyber Science and Engineering, Sichuan University, Chengdu, Sichuan, China(四川大学信息科学与工程学院,成都,四川,中国) School of Mathematics, Sichuan University, Chengdu, Sichuan, China(四川大学数学学院,成都,四川,中国)

AI总结 针对有限时域离散时间Pontryagin边值系统,通过缩放稳定-不稳定边界横截性验证线性化端点逆,结合加权压缩证明端点修正Green估计,获得与视界无关的存在唯一性、Lipschitz依赖和一阶展开。

Comments 20 pages

详情
AI中文摘要

我们研究了在光滑控制消除后有限时域离散时间Pontryagin边值系统的视界一致局部分支。核心输入是线性化的两点端点逆。我们通过缩放稳定-不稳定边界横截性验证该逆,证明相关的端点修正Green估计,并将其与加权压缩结合,以获得存在性、唯一性、Lipschitz依赖性和一阶展开,且常数与视界无关。该框架涵盖光滑非线性端点映射,包括固定初始状态并将终端协态耦合到终端状态的原始Pontryagin行。辛和Riccati准则在矩阵数据层面验证逆假设;特别地,每个具有可逆动力学和定号权重的可镇定线性二次系统都被覆盖,包括非交换耦合数据。数值部分展示了证书和视界一致一阶展开。

英文摘要

We study horizon-uniform local branches of finite-horizon discrete-time Pontryagin boundary value systems after smooth control elimination. The central input is a two-point endpoint inverse for the linearization. We verify this inverse from scaled stable--unstable boundary transversality, prove the associated endpoint-corrected Green estimate, and combine it with weighted contractions to obtain existence, uniqueness, Lipschitz dependence, and first-order expansions with constants independent of the horizon. The framework covers smooth nonlinear endpoint maps, including the original Pontryagin rows that fix the initial state and couple the terminal costate to the terminal state. Symplectic and Riccati criteria verify the inverse hypothesis at the level of the matrix data; in particular, every stabilizable linear-quadratic system with invertible dynamics and definite weights is covered, including noncommuting coupled data. A numerical section illustrates the certificates and the horizon-uniform first-order expansion.

2606.13196 2026-06-17 cs.AI cs.CY 版本更新

Under What Conditions Can a Machine Be Called Genuinely Creative?

机器在何种条件下能够真正具有创造力?

Yong Zeng

发表机构 * Concordia University(康考迪亚大学)

AI总结 本文基于Designics理论,提出机器真正创造力需满足十个要求,并通过实例论证其计算可行性,同时指出当前生成式AI系统尚不具备真正创造力。

详情
AI中文摘要

最近的AI系统能够生成看似具有创造力的文本、软件架构、假设、设计和科学工作流。本文探讨机器在何种条件下能够真正具有创造力,以及如何在共享的认知和创造环境中保持人类能动性。它提出了一个源于Designics(意义承载的意向性变化科学)的需求框架。本文认为,真正的机器创造力不应仅由输出新颖性、当前性能或瞬时架构来定义。相反,创造力被理解为通过递归干预动力学对不完全情境的结构性转变。基于此观点,它依赖于十个需求:环境表示、范围感知、冲突识别、干预能力、后果观察、知识与环境更新、范围重定、局部到全局展开、基于价值的范围界定以及人机共居。这些需求通过Designics的三个定律(感知、冲突和能力)进行组织。本文通过选定的网络-物理和网络-生物研究(包括递归元素提取、自主网格生成以及神经生理和工作负载分析)说明了这些需求的计算可行性。然后,它将开放系统、自动发现框架、自我修改代理、基础模型和代理工作流视为压力案例:它们展示了强大的生成手段,但本身并未建立真正的机器创造力。最后,本文认为主动的AI伦理是真正机器创造力的内在部分,而非事后过滤器。基于价值的范围界定和人机共居必须塑造创造机器如何感知环境、识别冲突、选择干预、观察后果、更新知识以及重新确定未来行动的范围。

英文摘要

Recent AI systems can generate texts, software architectures, hypotheses, designs, and scientific workflows that appear creative. This paper asks under what conditions a machine can be called genuinely creative, and how human agency can be preserved within shared cognitive and creative environments. It develops a requirement framework derived from Designics, the science of meaning-bearing intentional change. The paper argues that genuine machine creativity should not be defined by output novelty, current performance, or transient architecture alone. Instead, creativity is understood as the structural transformation of incomplete situations through recursive intervention dynamics. On this view, it depends on ten requirements: environment representation, scoped perception, conflict identification, intervention capability, consequence observation, knowledge and environment update, rescoping, local-to-global unfolding, value-based scoping, and human-AI co-living. These are organized through the three laws of Designics: perception, conflict, and capability. The paper illustrates the computational tractability of these requirements through selected cyber-physical and cyber-biological studies, including recursive element extraction, autonomous mesh generation, and neurophysiological and workload analysis. It then treats open-ended systems, automated discovery frameworks, self-modifying agents, foundation models, and agentic workflows as pressure cases: they demonstrate powerful generative means but do not by themselves establish genuine machine creativity. Finally, the paper argues that proactive AI ethics is internal to genuine machine creativity rather than an after-the-fact filter. Value-based scoping and human-AI co-living must shape how creative machines perceive environments, identify conflicts, select interventions, observe consequences, update knowledge, and rescope future action.

2507.05169 2026-06-17 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model: A Generative Latent Prediction Architecture for World Modeling

世界模型批判:一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结 本文从心理学“假设性思维”出发,提出世界模型的核心目标是模拟真实世界的所有可行动可能性,并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测(GLP)架构。

详情
AI中文摘要

世界模型,即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器,近年来因开发具有人工(通用)智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估,已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发,并借鉴心理学文献中“假设性思维”的概念,论证世界模型的主要目标是模拟真实世界中所有可行动的可能性,以进行有目的的推理和行动。我们审视了世界建模的关键设计维度:数据、表示、架构、学习目标和使用,调查了现有方法并分析了它们的权衡。在此基础上,我们提出了一种新的通用世界模型生成式潜在预测(GLP)架构,基于有状态的、分层的、多层次的、混合连续/离散表示,以及生成式和自监督学习框架,并展望了由这种模型支持的物理、智能体和嵌套(PAN)AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

2602.02881 2026-06-17 cs.SE cs.AI 版本更新

Learning-Infused Formal Reasoning: From Contract Synthesis to Artifact Reuse and Formal Semantics

学习增强的形式化推理:从合约合成到工件复用和形式语义

Arshad Beg, Diarmuid O'Donoghue, Rosemary Monahan

AI总结 提出将形式化方法与人工智能融合的长期研究愿景,通过自动化合约合成、语义工件复用和精化理论,构建知识驱动的验证生态系统,加速未来保障。

Comments LNCS Proceedings Submitted Version. 17 pages. Accepted and presented at VERIFAI-2026: The Interplay between Artificial Intelligence and Software Verification LASER center, Villebrumier, France, March 8-11, 2026

详情
AI中文摘要

本文阐述了形式化方法与人工智能交叉领域的长期研究愿景,概述了多个概念和技术维度,并报告了我们为实现这一愿景正在开展的工作。它基于自动化合约合成、语义工件复用和基于精化的理论,提出了下一代形式化方法的前瞻性视角。我们认为,未来的验证系统必须从构建单个正确性证明转向累积的、知识驱动的范式,其中规范、合约和证明被持续合成并在系统间转移。为支持这一转变,我们概述了一个混合框架,结合大语言模型与基于图的表示,以实现可扩展的语义匹配和验证工件的原则性复用。基于学习的组件在异构表示法和抽象层次间提供语义指导,而符号匹配确保形式正确性。基于组合推理,这一愿景指向系统演化的验证生态系统,利用过去的验证工作加速未来的保障。

英文摘要

This paper articulates a long-term research vision for formal methods at the intersection with artificial intelligence, outlining multiple conceptual and technical dimensions and reporting on our ongoing work toward realising this vision. It advances a forward-looking perspective on the next generation of formal methods based on the integration of automated contract synthesis, semantic artifact reuse, and refinement-based theory. We argue that future verification systems must builds towards individual correctness proofs toward a cumulative, knowledge-driven paradigm in which specifications, contracts, and proofs are continuously synthesised and transferred across systems. To support this shift, we outline a hybrid framework combining large language models with graph-based representations to enable scalable semantic matching and principled reuse of verification artifacts. Learning-based components provide semantic guidance across heterogeneous notations and abstraction levels, while symbolic matching ensures formal soundness. Grounded in compositional reasoning, this vision points toward verification ecosystems that evolve systematically, leveraging past verification efforts to accelerate future assurance.

2502.17773 2026-06-17 stat.ME cs.AI cs.LG 版本更新

How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective

大型语言模型值得模拟多少人意见?从不确定性量化角度出发

Chengpiao Huang, Yuhang Wu, Kaizheng Wang

发表机构 * Department of IEOR, Columbia University(哥伦比亚大学工业工程与运筹学系) Decision, Risk, and Operations Division, Columbia Business School(哥伦比亚商学院决策、风险与运营分校) Department of IEOR and Data Science Institute, Columbia University(哥伦比亚大学工业工程与运筹学系及数据科学研究所)

AI总结 本文从不确定性量化角度出发,提出了一种框架,将LLM模拟的响应转换为人类响应总体参数的可靠置信集,通过量化人类-LLM不一致带来的不确定性。关键设计是模拟响应的数量:过多会导致置信集过窄且覆盖性差,过少则导致置信集过宽且信息不足。本文提出了一种数据驱动的方法,自适应选择模拟样本量以实现名义平均覆盖性,无论LLM的模拟保真度或置信集构建过程如何。所选样本量进一步反映了LLM能代表的有效人类人口规模,提供了其模拟保真度的定量度量。实验表明不同LLM和领域存在异质性模拟保真度。

Comments 63 pages, 13 figures

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于模拟调查响应,但合成数据可能与人类人口不一致,导致不可靠的推断。我们开发了一个通用框架,将LLM模拟的响应转换为人类响应总体参数的可靠置信集,量化由人类-LLM不一致引起的不确定性。关键设计选择是模拟响应的数量:过多会产生过于狭窄的置信集,覆盖性差;过少则会产生过于宽泛且信息不足的置信集,受随机噪声主导。我们提出了一种数据驱动的方法,自适应地选择模拟样本量以实现名义平均覆盖性,无论LLM的模拟保真度或置信集构建过程如何。所选样本量进一步被证明反映了LLM能代表的有效人类人口规模,提供其模拟保真度的定量度量。在真实调查数据集上的实验揭示了不同LLM和领域之间的异质性模拟保真度。

英文摘要

Large language models (LLMs) are increasingly used to simulate survey responses, but synthetic data can be misaligned with the human population, leading to unreliable inference. We develop a general framework that converts LLM-simulated responses into reliable confidence sets for population parameters of human responses, quantifying the uncertainty induced by the human-LLM misalignment. The key design choice is the number of simulated responses: too many produce overly narrow sets with poor coverage, while too few yield overly wide and uninformative sets dominated by stochastic noise. We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity. Experiments on real survey datasets reveal heterogeneous simulation fidelity across different LLMs and domains.

2601.06116 2026-06-17 cs.AI cs.CL cs.CY 版本更新

The Homogenization Problem in LLMs: Towards Meaningful Diversity in AI Safety

在大语言模型中的同质化问题:迈向人工智能安全中的有意义多样性

Ian Rios-Sialer

发表机构 * Independent Researcher(独立研究者)

AI总结 本文探讨了大语言模型中同质化问题,提出通过编码价值观系统来促进多样性,通过实验揭示性别偏见并引入xeno-reproduction概念以缓解同质化。

详情
AI中文摘要

生成式AI模型在训练数据中复制人类偏见,并通过如模式崩溃等机制放大这些偏见。多样性丧失导致同质化,不仅损害少数群体,也使所有人受益。我们主张同质化应成为人工智能安全的核心关注点。为有意义地表征大语言模型中的同质化,我们引入一个框架,允许利益相关者编码其上下文和价值体系。我们通过实验揭示了一个大语言模型(Claude 3.5 Haiku)在开放性故事提示中的性别偏见。基于酷儿理论,我们将同质化定义为规范性。借用女性主义理论的语言,我们引入xeno-reproduction作为一类任务,以通过促进多样性来缓解同质化。我们的工作开启了一条协作研究路线,旨在理解和推进AI中的多样性。

英文摘要

Generative AI models reproduce the human biases in their training data and further amplify them through mechanisms such as mode collapse. The loss of diversity produces homogenization, which not only harms the minoritized but impoverishes everyone. We argue homogenization should be a central concern in AI safety. To meaningfully characterize homogenization in Large Language Models (LLMs), we introduce a framework that allows stakeholders to encode their context and value system. We illustrate our approach with an experiment that surfaces gender bias in an LLM (Claude 3.5 Haiku) on an open-ended story prompt. Building from queer theory, we formalize homogenization in terms of normativity. Borrowing language from feminist theory, we introduce the concept of xeno-reproduction as a class of tasks for mitigating homogenization by promoting diversity. Our work opens a collaborative line of research that seeks to understand and advance diversity in AI.

2501.12709 2026-06-17 quant-ph cs.AI cs.CR cs.DC 版本更新

Experimentally validated quantum-secure federated learning over a multi-user quantum network

在多用户量子网络上实验验证的量子安全联邦学习

Zhi-Ping Liu, Xiao-Yu Cao, Hao-Wen Liu, Xiao-Ran Sun, Yu Bao, Jian-Yu Shen, Yu-Shuo Lu, Hua-Lei Yin, Zeng-Bing Chen

发表机构 * National Laboratory of Solid State Microstructures(固态微结构国家实验室) School of Physics, Collaborative Innovation Center of Advanced Microstructures, Nanjing University, Nanjing 210093, China(物理系,先进微结构协同创新中心,南京大学,南京210093,中国) School of Physics(物理系) Key Laboratory of Quantum State Construction(量子态制备重点实验室) Manipulation (Ministry of Education), Renmin University of China, Beijing 100872, China(操控(教育部),中国人民大学,北京100872,中国)

AI总结 本文提出QuNetQFL协议,通过分布式量子密钥掩蔽局部模型更新,实现信息论安全的聚合。实验验证在四客户端量子网络上,提升分类准确率并展示在语言任务和大规模模拟中的扩展性。

Comments 25 pages, 7 figures, 7 tables, Accepted by Research

详情
Journal ref
Research 9, 1299 (2026)
AI中文摘要

联邦学习实现了去中心化和隐私保护的训练,但在量子时代仍面临隐私泄露的风险。量子联邦学习(QFL)提供了一条通往增强安全性和效率的途径。然而,缺乏一个实际且经过实验验证的QFL协议,利用近期量子技术解决数据隐私问题。本文提出了QuNetQFL协议,在量子网络上实现,其中局部模型更新被分布式量子秘密密钥掩蔽,提供信息论安全的聚合。我们实验验证该协议在四客户端量子网络上,并通过生成的密钥在量子和现实数据集上进行性能基准测试。添加一个量子客户端显著提高了对多体纠缠和非稳定器量子数据集的分类准确率。在语言任务中,我们通过联邦微调混合经典-量子语言模型进行情感分析,实现了在模拟和真实量子硬件上的可比和稳健性能。大规模模拟进一步展示了其扩展性,可扩展到200个客户端进行手写数字识别,具有快速收敛和通信成本减少75%的模型压缩。本文的工作为新兴量子互联网中的量子安全联邦学习建立了实际和可扩展的路线。

英文摘要

Federated learning enables decentralized, privacy-preserving training but remains vulnerable to privacy leakage in the quantum era. Quantum federated learning (QFL) offers a promising path towards enhanced security and efficiency. However, a practical and experimentally validated QFL protocol utilizing near-term quantum techniques to address data privacy has been lacking. Here we present QuNetQFL, a QFL protocol implemented on quantum networks, in which local model updates are masked with distributed quantum secret keys, offering information-theoretic security during aggregation. We experimentally validate the protocol on a four-client quantum network and benchmark its performance using the generated keys on quantum and real-world datasets. Adding a single quantum client significantly improves global accuracy for classifying multipartite entangled and non-stabilizer quantum datasets. For language tasks, we apply QuNetQFL to sentiment analysis by federated fine-tuning of a hybrid classical-quantum language model, achieving comparable and robust performance in simulation and on real quantum hardware. Large-scale simulations further demonstrate scalability to 200 clients for handwritten-digit recognition, with rapid convergence and a $75\%$ reduction in communication cost via model compression. Our work establishes a practical and scalable route to quantum-secure federated learning for the emerging quantum internet.

2605.12220 2026-06-17 cs.CV cs.AI cs.LG cs.RO 版本更新

TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion

TriBand-BEV:基于高度感知的鸟瞰图与高分辨率特征融合的实时仅LiDAR三维行人检测

Mohammad Khoshkdahan, Alexey Vinel

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文提出TriBand-BEV方法,通过高度感知的鸟瞰图与高分辨率特征融合实现实时LiDAR-only三维行人检测,采用轻量级鸟瞰图张量映射,单网络一次通过检测车辆、行人和自行车,提升检测精度与速度。

Comments Accepted for publication in the Proceedings of the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
Journal ref
Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
AI中文摘要

安全的自动驾驶代理和移动机器人需要快速的实时三维感知,尤其是对于行人等易受伤害道路使用者。我们介绍了一种新的鸟瞰图(BEV)编码方法,将完整的三维LiDAR点云映射到轻量级的二维BEV张量中,分为三个高度带。我们明确地将三维检测重新公式化为二维检测问题,然后从BEV输出中重建三维框。单个网络在一次通过中检测车辆、行人和自行车。骨干网络在深层阶段使用区域注意力,层次化的双向颈部网络在P1到P4之间融合上下文和细节,头部使用分布焦点学习预测定向框,以预测侧偏移和旋转IoU损失。训练应用小垂直重新分箱和温和的反射率抖动以防止记忆化。我们使用四分位距(IQR)过滤器在三维重建中去除噪声和离群的LiDAR点。在KITTI数据集上,TriBand-BEV在49 FPS的单个消费级GPU上实现了易、中等和困难样本的行人BEV AP分别为58.7/52.6/47.2%,优于Complex-YOLO,分别提升了+12.6%、+7.5%和+3.1%。定性场景显示在遮挡下检测稳定。该流程紧凑且适用于实时机器人部署。我们的源代码在GitHub上公开可用。

英文摘要

Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.

2601.12912 2026-06-17 cs.AI 版本更新

Human Emotion Verification by Action Languages via Answer Set Programming

通过答案集编程进行人类情感验证的动作语言

Andreas Brännström, Juan Carlos Nieves

发表机构 * Umeå University\ of Computing Science

AI总结 本文提出动作语言C-MT,基于答案集编程和过渡系统,用于表示人类心理状态对可观察动作序列的演变。通过引入因果规则,该语言能建模心理状态的有效转换原则,从而实现对人类心理动态的受控推理。

Comments Under consideration in Theory and Practice of Logic Programming (TPLP)

详情
Journal ref
Theory and Practice of Logic Programming 25 (2025) 1047-1104
AI中文摘要

在本文中,我们介绍了动作语言C-MT(Mind Transition Language)。它建立在答案集编程(ASP)和过渡系统之上,用于表示人类心理状态如何响应一系列可观察动作序列而演变。基于已建立的心理学理论,如情绪评估理论,我们将情绪等心理状态形式化为多维配置。为了满足对受控智能体行为的需求,并限制动作的不良心理副作用,我们扩展了该语言,引入了新的因果规则'禁止导致',以及专门用于心理状态动态的表达式,从而能够建模有效转换之间心理状态的原则。这些心理变化的原则被翻译成过渡约束,并通过所谓的轨迹在过渡系统中严格评估其不变性属性。这使得能够对人类心理状态的动态演变进行受控推理。此外,该框架支持通过分析遵循不同心理学原理的轨迹来比较不同变化动态。我们应用该动作语言来设计情绪验证模型。

英文摘要

In this paper, we introduce the action language C-MT (Mind Transition Language). It is built on top of answer set programming (ASP) and transition systems to represent how human mental states evolve in response to sequences of observable actions. Drawing on well-established psychological theories, such as the Appraisal Theory of Emotion, we formalize mental states, such as emotions, as multi-dimensional configurations. With the objective to address the need for controlled agent behaviors and to restrict unwanted mental side-effects of actions, we extend the language with a novel causal rule, forbids to cause, along with expressions specialized for mental state dynamics, which enables the modeling of principles for valid transitions between mental states. These principles of mental change are translated into transition constraints, and properties of invariance, which are rigorously evaluated using transition systems in terms of so-called trajectories. This enables controlled reasoning about the dynamic evolution of human mental states. Furthermore, the framework supports the comparison of different dynamics of change by analyzing trajectories that adhere to different psychological principles. We apply the action language to design models for emotion verification. Under consideration in Theory and Practice of Logic Programming (TPLP).

2603.19801 2026-06-17 eess.IV cs.AI cs.CV 版本更新

Offshore oil and gas platform dynamics in the North Sea, Gulf of Mexico, and Persian Gulf: Exploiting the Sentinel-1 archive

北海、墨西哥湾和波斯湾的海上石油和天然气平台动态:利用Sentinel-1档案

Robin Spanier, Thorsten Hoeser, John Truckenbrodt, Felix Bachofer, Claudia Kuenzer

发表机构 * German Remote Sensing Data Center, Earth Observation Center, EOC of the German Aerospace Center, DLR(德国遥感数据中心,地球观测中心,德国航空航天中心(DLR)地球观测中心) Institute for Geography and Geology, Department of Remote Sensing, University of Würzburg(地理与地质研究所,遥感系,乌尔姆大学)

AI总结 本文利用Sentinel-1数据和深度学习技术,研究了北海、墨西哥湾和波斯湾的海上平台动态,揭示了平台数量变化及结构转型,为海洋基础设施监测提供了数据支持。

Comments 16 pages, 10 figures, 1 table

详情
Journal ref
Big Earth Data, 2026, 1-27
AI中文摘要

随着海上基础设施的增加,对持续、可扩展的监测需求日益增长。本文提出了一种基于免费地球观测数据的自动化方法,利用Sentinel-1档案数据和深度学习目标检测技术,构建了2017-2025年间北海、墨西哥湾和波斯湾的季度平台位置时间序列。此外,还推导了平台大小、水深、海岸距离、国家归属及安装和退役日期等信息。2025年识别出3728个海上平台,其中北海有356个,墨西哥湾有1641个,波斯湾有1731个。尽管波斯湾平台数量在2024年前持续增长,但墨西哥湾和北海的平台数量在2018-2020年间有所下降。同时,超过2700个平台被安装或迁移到新地点,同时有相当数量被退役或迁移。此外,平台寿命缩短的趋势表明,海上行业正经历结构性变化,与移动海上单位如钻探平台的重要性增长有关。研究结果展示了免费地球观测数据和深度学习在持续、长期监测海洋基础设施中的潜力。所推导的数据集是公开的,为海上监测、海洋规划及海上能源行业转型分析提供了基础。

英文摘要

The increasing use of marine spaces by offshore infrastructure, including oil and gas platforms, underscores the need for consistent, scalable monitoring. Offshore development has economic, environmental, and regulatory implications, yet maritime areas remain difficult to monitor systematically due to their inaccessibility and spatial extent. This study presents an automated approach to the spatiotemporal detection of offshore oil and gas platforms based on freely available Earth observation data. Leveraging Sentinel-1 archive data and deep learning-based object detection, a consistent quarterly time series of platform locations for three major production regions: the North Sea, the Gulf of Mexico, and the Persian Gulf, was created for the period 2017-2025. In addition, platform size, water depth, distance to the coast, national affiliation, and installation and decommissioning dates were derived. 3,728 offshore platforms were identified in 2025, 356 in the North Sea, 1,641 in the Gulf of Mexico, and 1,731 in the Persian Gulf. While expansion was observed in the Persian Gulf until 2024, the Gulf of Mexico and the North Sea saw a decline in platform numbers from 2018-2020. At the same time, a pronounced dynamic was apparent. More than 2,700 platforms were installed or relocated to new sites, while a comparable number were decommissioned or relocated. Furthermore, the increasing number of platforms with short lifespans points to a structural change in the offshore sector associated with the growing importance of mobile offshore units such as jack-ups or drillships. The results highlighted the potential of freely available Earth observation data and deep learning for consistent, long-term monitoring of marine infrastructure. The derived dataset is public and provides a basis for offshore monitoring, maritime planning, and analyses of the transformation of the offshore energy sector.

2512.20985 2026-06-17 cs.AI cs.MA 版本更新

A Blockchain-Monitored Agentic AI Architecture for Trusted Perception-Reasoning-Action Pipelines

基于区块链监控的代理AI架构:可信感知-推理-行动流水线

Salman Jan, Hassan Ali Razzaqi, Ali Akarma, Mohammad Riyaz Belgaum

发表机构 * Faculty of Computer Studies, Arab Open University-Bahrain(巴林阿拉伯开放大学计算机科学学院) Faculty of Computer and Information System, Islamic University of Madinah, Saudi Arabia(沙特阿拉伯麦地那伊斯兰大学计算机与信息系统学院)

AI总结 本文提出一种结合区块链的代理AI架构,用于确保自主决策流程中的信任和可追溯性,通过区块链实现对行动的持续监控和审计,验证输入并记录执行结果。

Comments This paper was presented at the IEEE International Conference on Computing and Applications (ICCA 2025), Bahrain

详情
Journal ref
Proceedings of the 2025 IEEE International Conference on Computing and Applications (ICCA), Bahrain, 2025, pp. 1-7
AI中文摘要

代理AI系统在医疗、智慧城市、数字取证和供应链管理等领域应用日益广泛。尽管这些系统灵活且能提供实时推理,但它们也引发了信任、监督和信息完整性方面的担忧。本文提出一种由LangChain多代理系统和受限制区块链组成的单一架构模型,以确保持续监控、政策执行和不可变审计。该框架将感知-行动循环与区块链治理层相关联,验证输入、评估推荐行动并记录执行结果。介绍了一种基于Hyperledger Fabric的系统,集成了MCP执行器和LangChain代理,并进行了智能库存管理、交通信号控制和医疗监控的实验。结果表明,区块链安全验证在防止未经授权实践、确保整个决策过程的可追溯性以及维持合理操作延迟方面是高效的。所提出的框架提供了一种通用系统,用于实施高影响的自主且负责任的代理AI应用。

英文摘要

The application of agentic AI systems in autonomous decision-making is growing in the areas of healthcare, smart cities, digital forensics, and supply chain management. Even though these systems are flexible and offer real-time reasoning, they also raise concerns of trust and oversight, and integrity of the information and activities upon which they are founded. The paper suggests a single architecture model comprising of LangChain-based multi-agent system with a permissioned blockchain to guarantee constant monitoring, policy enforcement, and immutable auditability of agentic action. The framework relates the perception conceptualization-action cycle to a blockchain layer of governance that verifies the inputs, evaluates recommended actions, and documents the outcomes of the execution. A Hyperledger Fabric-based system, action executors MCP-integrated, and LangChain agent are introduced and experiments of smart inventory management, traffic-signal control, and healthcare monitoring are done. The results suggest that blockchain-security verification is efficient in preventing unauthorized practices, offers traceability throughout the whole decision-making process, and maintains operational latency within reasonable ranges. The suggested framework provides a universal system of implementing high-impact agentic AI applications that are autonomous yet responsible.

2603.14692 2026-06-17 cs.LO cs.AI 版本更新

Applications of Intuitionistic Temporal Logic to Temporal Answer Set Programming

Pedro Cabalar, Martín Diéguez, David Fernández-Duque, François Laferrière, Torsten Schaub, Igor Stéphan

发表机构 * University of Corunna, Spain(科鲁纳大学) University of Angers, France(昂热大学) University of Barcelona, Spain(巴塞罗那大学) University of Potsdam, Germany(波茨坦大学)

Comments Under consideration in Theory and Practice of Logic Programming (TPLP)

详情
英文摘要

The relationship between intuitionistic or intermediate logics and logic programming has been extensively studied, prominently featuring Pearce's equilibrium logic and Osorio's safe beliefs. Equilibrium logic admits a fixpoint characterization based on the logic of here-and-there, akin to theory completion in default and autoepistemic logics. Safe beliefs are similarly defined via a fixpoint operator, albeit under the semantics of intuitionistic or other intermediate logics. In this paper, we investigate the logical foundations of Temporal Answer Set Programming through the lens of Temporal Equilibrium Logic, a formalism combining equilibrium logic with linear-time temporal operators. We lift the seminal approaches of Pearce and Osorio to the temporal setting, establishing a formal correspondence between temporal intuitionistic logic and temporal logic programming. Our results deepen the theoretical underpinnings of Temporal Answer Set Programming and provide new avenues for research in temporal reasoning.

2508.04492 2026-06-17 cs.CV cs.AI 版本更新

Learning Robust Intervention Representations with Delta Embeddings

通过delta嵌入学习鲁棒的干预表示

Panagiotis Alimisis, Christos Diou

发表机构 * Department of Informatics and Telematics(信息与电信学系)

AI总结 本文提出通过潜在空间中的可操作反事实表示提升模型鲁棒性,提出因果delta嵌入方法,在无需额外监督的情况下学习因果表示,实验显示其在合成和现实基准中表现优异。

Comments ICLR 2026, Poster

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

因果表示学习近年来引起了广泛关注,作为提高模型泛化性和鲁棒性的手段。因果干预图像对(也称为“可操作反事实”)的表示具有特性:在起始状态和结束状态之间,只有受干预/动作影响的场景变量发生变化。尽管大多数工作集中在识别和表示因果模型下的场景变量,但较少关注干预本身的表示。本文表明,通过关注潜在空间中的可操作反事实表示,可以有效提升离分布鲁棒性。具体而言,我们提出干预可通过因果delta嵌入表示,该嵌入对视觉场景不变且在影响的因果变量上稀疏。基于此见解,我们提出一种无需额外监督的学习因果表示的方法。在因果三元组挑战中的实验表明,因果delta嵌入在离分布设置中表现突出,显著超越基线性能,在合成和现实基准中均取得优异结果。

英文摘要

Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs (also called ``actionable counterfactuals'' in the literature), have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of actionable counterfactuals in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a method for learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.

2602.13318 2026-06-17 cs.AI cs.CV cs.LG 版本更新

DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

DECKBench:用于学术幻灯片生成和编辑的多智能体框架基准测试

Daesik Jang, Morgan Lindsay Heisler, Linzi Xing, Yifei Li, Edward Wang, Ying Xiong, Yong Zhang, Zhenan Fan

发表机构 * Huawei Technologies Canada(华为加拿大技术有限公司) University of British Columbia(不列颠哥伦比亚大学)

AI总结 本文提出DECKBench,一个用于评估多智能体生成和编辑学术幻灯片的框架,通过定制数据集和模拟编辑指令,系统评估幻灯片和整个演示文稿的忠实度、连贯性、布局质量和多轮指令遵循能力。

详情
AI中文摘要

本文提出DECKBench,一个用于评估多智能体生成和编辑学术幻灯片的框架,通过定制数据集和模拟编辑指令,系统评估幻灯片和整个演示文稿的忠实度、连贯性、布局质量和多轮指令遵循能力。

英文摘要

Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout-aware rendering, and robust multi-turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges. To address this gap, we introduce the Deck Edits and Compliance Kit Benchmark (DECKBench), an evaluation framework for multi-agent slide generation and editing. DECKBench is built on a curated dataset of paper to slide pairs augmented with realistic, simulated editing instructions. Our evaluation protocol systematically assesses slide-level and deck-level fidelity, coherence, layout quality, and multi-turn instruction following. We further implement a modular multi-agent baseline system that decomposes the slide generation and editing task into paper parsing and summarization, slide planning, HTML creation, and iterative editing. Experimental results demonstrate that the proposed benchmark highlights strengths, exposes failure modes, and provides actionable insights for improving multi-agent slide generation and editing systems. Overall, this work establishes a standardized foundation for reproducible and comparable evaluation of academic presentation generation and editing. Code and data are publicly available at https://github.com/morgan-heisler/DeckBench .

2602.00473 2026-06-17 quant-ph cs.AI cs.LG 版本更新

Quantum Phase Recognition via Quantum Attention Mechanism

通过量子注意机制进行量子相识别

Jin-Long Chen, Xin Li, Zhang-Qi Yin

发表机构 * Center for Quantum Technology Research(量子技术研究中心) Key Laboratory of Advanced Optoelectronic Quantum Architecture(先进光电量子架构重点实验室) Measurements (MOE), School of Physics, Beijing Institute of Technology, Beijing 100081, China(测量(MOE),物理学院,北京理工大学,北京100081,中国)

AI总结 本文提出混合量子-经典注意模型,利用交换测试和参数化量子电路提取量子态关联,实现基态分类,针对簇异或模型在9和15个量子比特系统中表现出高准确率和鲁棒性。

Comments 10 pages, 7 figures

详情
Journal ref
Phys. Rev. A 113, 062403 (2026)
AI中文摘要

许多体系统中的量子相变本质上由复杂的关联结构特征化,这给传统方法在大规模系统中的计算带来了挑战。为此,我们提出了一种混合量子-经典注意模型。该模型利用交换测试和参数化量子电路实现的注意机制,提取量子态中的关联并执行基态分类。在9和15个量子比特的簇异或模型上进行测试,该模型在少于100个训练数据的情况下实现了高分类准确率,并展示了对训练集变化的鲁棒性。进一步分析表明,该模型成功捕捉了相敏感特征和特征物理长度尺度,为复杂许多体系统中的量子相识别提供了一种可扩展且数据高效的解决方案。

英文摘要

Quantum phase transitions in many-body systems are fundamentally characterized by complex correlation structures, which pose computational challenges for conventional methods in large systems. To address this, we propose a hybrid quantum-classical attention model. This model uses an attention mechanism, realized through swap tests and a parameterized quantum circuit, to extract correlations within quantum states and perform ground-state classification. Benchmarked on the cluster-Ising model with system sizes of 9 and 15 qubits, the model achieves high classification accuracy with less than 100 training data and demonstrates robustness against variations in the training set. Further analysis reveals that the model successfully captures phase-sensitive features and characteristic physical length scales, offering a scalable and data-efficient approach for quantum phase recognition in complex many-body systems.

2509.11154 2026-06-17 cs.LG cs.AI 版本更新

Feature Space Topology Control via Hopkins Loss

通过霍普金斯损失控制特征空间拓扑

Einari Vaaras, Manu Airaksinen

发表机构 * Signal Processing Research Centre Tampere University(信号处理研究中心塔尔皮莱大学) BABA Center, Department of Physiology University of Helsinki(BABA中心生理学系赫尔辛基大学)

AI总结 本文提出霍普金斯损失,用于控制特征空间拓扑,通过非线性瓶颈自编码器在语音、文本和图像数据中验证其在分类和降维中的有效性。

Comments Accepted for publication in Proc. IEEE ICTAI 2025, Athens, Greece

详情
AI中文摘要

特征空间拓扑指的是特征空间中样本的组织方式。修改此拓扑在机器学习应用中有益,包括降维、生成建模、迁移学习和对抗攻击的鲁棒性。本文引入了霍普金斯损失,利用霍普金斯统计量来强制实现期望的特征空间拓扑,与现有拓扑相关方法旨在保留输入特征拓扑不同。我们在语音、文本和图像数据的两个场景中评估了霍普金斯损失的有效性:分类和使用非线性瓶颈自编码器的降维。实验表明,将霍普金斯损失整合到分类或降维中对分类性能影响很小,但能提供修改特征拓扑的好处。

英文摘要

Feature space topology refers to the organization of samples within the feature space. Modifying this topology can be beneficial in machine learning applications, including dimensionality reduction, generative modeling, transfer learning, and robustness to adversarial attacks. This paper introduces a novel loss function, Hopkins loss, which leverages the Hopkins statistic to enforce a desired feature space topology, which is in contrast to existing topology-related methods that aim to preserve input feature topology. We evaluate the effectiveness of Hopkins loss on speech, text, and image data in two scenarios: classification and dimensionality reduction using nonlinear bottleneck autoencoders. Our experiments show that integrating Hopkins loss into classification or dimensionality reduction has only a small impact on classification performance while providing the benefit of modifying feature topology.

2601.12641 2026-06-17 cs.AI 版本更新

STEP-LLM: Generating CAD STEP Models from Natural Language with Large Language Models

STEP-LLM: 通过大型语言模型生成CAD STEP模型

Xiangyu Shi, Junyang Ding, Xu Zhao, Sinong Zhan, Payal Mohapatra, Daniel Quispe, Kojo Welbeck, Jian Cao, Wei Chen, Ping Guo, Qi Zhu

发表机构 * Northwestern University(西北大学)

AI总结 本文提出STEP-LLM,通过大型语言模型将自然语言转化为CAD STEP模型,采用图结构预处理和强化学习提升几何精度,验证了LLM驱动的STEP模型生成可行性。

Comments Accepted to the Design, Automation & Test in Europe Conference (DATE) 2026

详情
AI中文摘要

计算机辅助设计(CAD)对现代制造至关重要,但模型创建仍劳力密集且依赖专业知识。为使非专家能将直观设计意图转化为可制造的产物,近期基于大语言模型的文本到CAD研究聚焦于命令序列或脚本格式如CadQuery。然而,这些格式依赖内核且缺乏制造业的通用性。相比之下,产品数据交换标准(STEP,ISO 10303)文件是一种广泛采用的中性边界表示(B-rep)格式,直接兼容制造,但其图结构、交叉引用性质对自回归LLM提出了独特挑战。为此,我们编纂了约40,000个STEP-描述对的数据集,并引入了针对STEP图结构格式的新型预处理,包括基于深度优先搜索的重序列化,线性化交叉引用同时保持局部性和思维链(CoT)式结构注释,以引导全局一致性。我们整合了检索增强生成,以在监督微调中将预测与相关示例联系起来,并通过特定的Chamfer距离基于几何奖励的强化学习优化生成质量。实验表明,我们的STEP-LLM在几何保真度上优于Text2CAD基线,改进来自我们框架的多个阶段:RAG模块显著增强了完整性和可渲染性,DFS基于的重序列化增强了整体准确性,RL进一步减少了几何偏差。两者指标和视觉比较均确认STEP-LLM生成的形状比Text2CAD更精确。这些结果展示了通过自然语言驱动LLM生成STEP模型的可行性,展示了其在制造业CAD设计中的潜力。

英文摘要

Computer-aided design (CAD) is vital to modern manufacturing, yet model creation remains labor-intensive and expertise-heavy. To enable non-experts to translate intuitive design intent into manufacturable artifacts, recent large language models-based text-to-CAD efforts focus on command sequences or script-based formats like CadQuery. However, these formats are kernel-dependent and lack universality for manufacturing. In contrast, the Standard for the Exchange of Product Data (STEP, ISO 10303) file is a widely adopted, neutral boundary representation (B-rep) format directly compatible with manufacturing, but its graph-structured, cross-referenced nature poses unique challenges for auto-regressive LLMs. To address this, we curate a dataset of ~40K STEP-caption pairs and introduce novel preprocessing tailored for the graph-structured format of STEP, including a depth-first search-based reserialization that linearizes cross-references while preserving locality and chain-of-thought(CoT)-style structural annotations that guide global coherence. We integrate retrieval-augmented generation to ground predictions in relevant examples for supervised fine-tuning, and refine generation quality through reinforcement learning with a specific Chamfer Distance-based geometric reward. Experiments demonstrate consistent gains of our STEP-LLM in geometric fidelity over the Text2CAD baseline, with improvements arising from multiple stages of our framework: the RAG module substantially enhances completeness and renderability, the DFS-based reserialization strengthens overall accuracy, and the RL further reduces geometric discrepancy. Both metrics and visual comparisons confirm that STEP-LLM generates shapes with higher fidelity than Text2CAD. These results show the feasibility of LLM-driven STEP model generation from natural language, showing its potential to democratize CAD design for manufacturing.

2501.16370 2026-06-17 cs.LG cs.AI cs.NA cs.NE math.NA 版本更新

Advanced Physics-Informed Neural Network with Residuals for Solving Complex Integral Equations

先进物理指导神经网络与残差用于求解复杂积分方程

Mahdi Movahedian Moghaddam, Kourosh Parand, Saeed Reza Kheradpisheh

发表机构 * Department of Computer and Data Sciences, Shahid Beheshti University(计算机与数据科学系,谢赫·贝赫什提大学) Department of Cognitive Modeling, Shahid Beheshti University(认知建模系,谢赫·贝赫什提大学)

AI总结 本文提出残差积分求解网络(RISN),通过高精度数值方法与残差连接提升求解积分和积分微分方程的精度与稳定性,实验表明其在多种方程类型上均优于传统PINN及其变体。

详情
Journal ref
Anal. Numer. Solut. Nonlinear Equ. 11 (2026), no. 1, 153-173
AI中文摘要

本文提出残差积分求解网络(RISN),一种新型神经网络架构,旨在求解广泛类别的积分和积分微分方程,包括一维、多维、常微分和偏微分、分数类型以及包含振荡核的霍尔迈尔类型积分方程。RISN整合残差连接与高精度数值方法如高斯求积和分数导数运算矩阵,使其在精度和稳定性上优于传统物理指导神经网络(PINN)。残差连接有助于缓解消失梯度问题,使RISN能够处理更深层的网络和更复杂的核,特别是在多维问题中。通过广泛实验,我们证明RISN在各种方程类型上均优于传统PINN及其变体,如辅助PINN(A-PINN)和自适应PINN(SA-PINN),在各种方程类型上均取得显著更低的平均绝对误差(MAE)。这些结果突显了RISN在求解具有挑战性的积分和积分微分问题中的鲁棒性和效率,使其成为传统方法难以应对的现实应用中的宝贵工具。

英文摘要

In this paper, we present the Residual Integral Solver Network (RISN), a novel neural network architecture designed to solve a wide range of integral and integro-differential equations, including one-dimensional, multi-dimensional, ordinary and partial integro-differential, systems, fractional types, and Helmholtz-type integral equations involving oscillatory kernels. RISN integrates residual connections with high-accuracy numerical methods such as Gaussian quadrature and fractional derivative operational matrices, enabling it to achieve higher accuracy and stability than traditional Physics-Informed Neural Networks (PINN). The residual connections help mitigate vanishing gradient issues, allowing RISN to handle deeper networks and more complex kernels, particularly in multi-dimensional problems. Through extensive experiments, we demonstrate that RISN consistently outperforms not only classical PINNs but also advanced variants such as Auxiliary PINN (A-PINN) and Self-Adaptive PINN (SA-PINN), achieving significantly lower Mean Absolute Errors (MAE) across various types of equations. These results highlight RISN's robustness and efficiency in solving challenging integral and integro-differential problems, making it a valuable tool for real-world applications where traditional methods often struggle.

2503.08679 2026-06-17 cs.AI cs.CL cs.LG 版本更新

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

现实中的思维链推理并不总是忠实的

Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy

发表机构 * Poseidon Research(Poseidon研究)

AI总结 研究发现,在自然语言提示下,模型有时会生成表面连贯但自相矛盾的思维链,揭示出隐含的事后合理化现象,且前沿模型也未能完全避免。

Comments Published at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

最近的研究表明,当面对提示中的显式偏见时,模型通常会在其思维链(CoT)输出中省略提及这些偏见,揭示出口头推理可能给出模型如何得出错误结论的不正确图景(不忠实)。在这项工作中,我们展示了不忠实的CoT也发生在自然措辞、非对抗性的提示上,而无需添加人为偏见或编辑模型输出。我们发现,当分别呈现问题“X比Y大吗?”和“Y比X大吗?”时,模型有时会生成表面连贯的论证来证明系统性地对两者都回答“是”或都回答“否”是合理的,尽管存在矛盾。我们提供了初步证据表明这是由于模型对“是”或“否”的隐含偏见,并将其标记为隐含的事后合理化。我们的结果显示,生产模型的不忠实率高达13%,而前沿模型虽然更忠实,但没有一个完全忠实,包括像DeepSeek R1(0.37%)和Sonnet 3.7 with thinking(0.04%)这样的思考模型。我们还研究了不忠实的非逻辑捷径,即模型使用微妙的非逻辑推理来使对困难数学问题的推测性答案看起来经过严格证明。我们的发现表明,虽然CoT可用于评估输出,但它并不是产生模型答案的内部过程的完整描述,应在代理或安全关键环境中谨慎使用。

英文摘要

Recent studies indicate that when faced with explicit biases in prompts, models often omit mentioning these biases in their Chain-of-Thought (CoT) output, revealing that verbalized reasoning can give an incorrect picture of how models arrive at conclusions (unfaithfulness). In this work, we show that unfaithful CoT also occurs on naturally worded, non-adversarial prompts without adding artificial biases or editing model outputs. We find that when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify systematically answering Yes to both or No to both, despite the contradiction. We present preliminary evidence that this is due to models' implicit biases towards Yes or No, labeling this Implicit Post-Hoc Rationalization. Our results reveal rates up to 13% for production models, and while frontier models are more faithful, none are entirely so, including thinking models like DeepSeek R1 (0.37%) and Sonnet 3.7 with thinking (0.04%). We also investigate Unfaithful Illogical Shortcuts, where models use subtly illogical reasoning to make speculative answers to hard math problems seem rigorously proven. Our findings indicate that while CoT can be useful for assessing outputs, it is not a complete account of the internal process that produced the model's answer and should be used with caution in agentic or safety-critical settings.