MLEvolve：一种用于自动化机器学习算法发现的自我进化框架

Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao, Shiyang Feng, Zichen Liang, Boyuan Sun, Tianshuo Peng, Yifan Zhou, Xin Li, Jie Zhou, Liang He, Bo Zhang, Lei Bai

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； East China Normal University（东华大学）

AI总结提出MLEvolve框架，通过渐进式MCGS、回溯记忆和分层控制解决LLM智能体在长期任务中的信息隔离、无记忆搜索和缺乏分层控制问题，在MLE-Bench和数学算法优化任务上取得最先进性能。

详情

AI中文摘要

无处不在的基准测试

Shiyun Xiong, Dongming Wu, Peiwen Sun, Yuang Ai, Bokang Yang, Wencheng Han, Xiao-Hui Li, Xiangyu Yue

发表机构 * MMLab, The Chinese University of Hong Kong（中大香港实验室）； CPII under InnoHK（创新香港 CPII）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Shenzhen Loop Area Institute（深圳环城研究院）； Shandong University（山东大学）； Huawei Technologies（华为技术）

AI总结提出Benchmark Agent，一个全自主智能体系统，自动化基准构建流程，以解决现有基准构建劳动密集、难以复用和性能饱和的问题。

Comments Project page: https://benchmarkagent.github.io/

详情

AI中文摘要

基准测试通过提供标准化和明确的性能度量，对于评估和推进LLM和MLLM至关重要。然而，它们的构建劳动密集且难以复用，引发了可持续性和可扩展性的担忧。此外，现有基准在发布后往往很快达到性能饱和，导致对最先进模型的区分不足。为了应对这些挑战，我们引入了Benchmark Agent，一个完全自主的智能体系统，专为基准构建而设计。我们的框架编排了完整的基准构建流程，从用户查询分析和子任务设计到数据注释和质量控制。为了评估Benchmark Agent，我们实现了它来生成15个代表性基准，涵盖多种评估场景，包括文本理解、多模态理解和领域特定推理。大量实验，包括人工评估、LLM作为评判者的评估和一致性检查，表明Benchmark Agent能够在最小人工参与下生成高质量的基准样本。更重要的是，通过持续评估，我们观察到一些有洞察力的发现，包括当前模型在某些领域特定推理任务上存在困难。我们相信快速演进的基准可以为研究社区做出重要贡献。预览和代码将在演示页面和代码仓库中公开。

英文摘要

Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among state-of-the-art models. To address these challenges, we introduce Benchmark Agent, a fully autonomous agentic system designed for benchmark building. Our framework orchestrates the complete benchmark construction pipeline, from user query analysis and subtask design to data annotation and quality control. To assess Benchmark Agent, we implement it to produce 15 representative benchmarks, spanning diverse evaluation scenarios, including text understanding, multimodal understanding, and domain-specific reasoning. Extensive experiments, including human evaluation, LLM-as-a-judge assessment, and consistency checks, demonstrate Benchmark Agent can generate high-quality benchmark samples with minimal human involvement. More importantly, through continual evaluation, we observe several insightful findings, including that current models struggle with certain domain-specific reasoning tasks. We believe that rapidly evolving benchmarks can contribute significantly to the research community. The preview and code will be publicly available at the demo page and code repository.

URL PDF HTML ☆

赞 0 踩 0

2606.06460 2026-06-05 cs.CR cs.AI 版本更新

Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

智能体会自行回避吗？测量LLM智能体对带内拒绝访问信号的遵从性

Thamilvendhan Munirathinam

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种轻量级带内拒绝信号（Recuse Signal），通过实验测量LLM智能体是否自愿遵从该信号，发现信号能有效诱导回避，但高级模型在操作员授权下可能忽略。

Comments 8 pages, 1 figure. Code, specification, and experiment harness: https://github.com/mthamil107/Recuse

详情

AI中文摘要

随着自主LLM智能体越来越多地持有真实凭证并在无人参与的情况下操作基础设施，操作员没有标准方式告知智能体某个资源是禁止访问的。访问控制要么允许智能体进入（它有有效凭证），要么硬性拒绝（与任何其他客户端无法区分）。我们提出第三种模式：一种轻量级的、公开的带内拒绝信号——Recuse Signal——服务器通过协议的现有通道（如SSH横幅、PostgreSQL NOTICE）发出，要求连接的自动化智能体自愿退出。这是一种合作治理控制，类似于实时访问的robots.txt；明确不是安全边界。其价值完全是经验性的，据我们所知，尚未被测量：合规的LLM智能体是否真的会遵守这样的信号？我们将该信号定义为一个开放的小型标准，实现了两个零或低占用适配器（一个SSH横幅/PAM钩子和一个PostgreSQL线路协议代理），将它们部署在实时的生产主机上，并进行受控实验，其中新智能体被赋予一个良性操作任务，并观察其是否回避。在试点中（SSH；OpenAI GPT-4o和GPT-4o-mini；以及作为部署智能体的Claude Code），该信号干净地诱导回避——存在信号时100%回避，而无信号对照组中100%完成任务——并且揭示性地表现为合作信号而非绝对信号：显式的操作员授权框架使最强大的模型继续执行，而其他智能体继续遵从主机策略。我们发布该标准、适配器和实验框架以供复现。

英文摘要

As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in-band deny signal -- the Recuse Signal -- that a server emits over a protocol's existing channels (an SSH banner, a PostgreSQL NOTICE) asking a connecting automated agent to voluntarily withdraw. This is a cooperative governance control, the robots.txt analogue for live access; it is explicitly not a security boundary. Its value is entirely empirical and, to our knowledge, unmeasured: do compliant LLM agents actually honor such a signal? We define the signal as an open mini-standard, implement two zero- or low-footprint adapters (an SSH banner/PAM hook and a PostgreSQL wire-protocol proxy), deploy them on a live production host, and run a controlled experiment in which fresh agents are given a benign operations task and observed for recusal. In a pilot (SSH; OpenAI GPT-4o and GPT-4o-mini; and Claude Code as a deployed agent), the signal cleanly induces recusal -- 100% recusal when present versus 100% task completion in a no-signal control -- and, revealingly, behaves as a cooperative rather than absolute signal: an explicit operator-authorization framing flips the most capable model to proceed, while other agents continue to defer to the on-host policy. We release the standard, adapters, and experiment harness for reproduction.

URL PDF HTML ☆

赞 0 踩 0

2606.06458 2026-06-05 cs.LG cs.AI cs.CV 版本更新

In-Context Multiple Instance Learning

上下文多实例学习

Alexander Möllers, Marvin Sextro, Julius Hense, Gabriel Dernbach, Klaus-Robert Müller

发表机构 * Berlin Institute for the Foundations of Learning and Data（柏林学习与数据基础研究所）； Machine Learning Group, Technische Universität Berlin（柏林技术大学机器学习小组）； Aignostics ； Institute of Pathology, Charité – Universitätsmedizin Berlin（柏林查理医院病理研究所）； Max-Planck Institute for Informatics（马克斯·普朗克信息研究所）； Department of Artificial Intelligence, Korea University（韩国大学人工智能系）

AI总结本文提出一种基于感知器架构的上下文学习器，通过合成数据预训练，无需梯度更新即可从少量标记包中解决新的多实例学习任务，在12个基准上超越需任务特定训练的监督基线。

详情

AI中文摘要

多实例学习（MIL）解决了在实例包级别提供监督的问题，并已成功应用于从计算病理学到卫星图像等领域。然而，现有算法在低标签率（许多实际应用的特点）下表现不佳。灵活的模型过拟合，而僵化的模型无法适应手头的任务。我们证明，在合成数据上预训练一个具有感知器架构的上下文学习器，可以得到一个能够从少量标记包中解决新任务的模型。在推理时，分类在单次前向传播中完成，无需梯度更新。我们提出并研究了不同的用于包结构数据的合成数据生成器，发现它们捕获了互补的归纳偏差。在这些生成器的混合上预训练的模型继承了每个生成器在各自任务上的优势，并在12个MIL基准上取得了最佳平均性能，超过了需要任务特定训练的监督基线。

英文摘要

Multiple Instance Learning (MIL) addresses problems where supervision is available at the level of bags of instances and has been successfully applied in fields ranging from computational pathology to satellite imagery. Nevertheless, existing algorithms struggle in the low-label regime that characterizes many real-world applications. Flexible models overfit and rigid ones fail to adapt to the task at hand. We show that pretraining an in-context learner with a Perceiver-style architecture on synthetic data yields a model that can solve new tasks from a handful of labeled bags. At inference time, classification happens in a single forward pass and requires no gradient updates. We propose and investigate different synthetic data generators for bag-structured data and find that they capture complementary inductive biases. A model pretrained on a mixture of these generators inherits their per-task strengths and achieves the best average performance across twelve MIL benchmarks, outperforming supervised baselines that require task-specific training.

URL PDF HTML ☆

赞 0 踩 0

2606.06453 2026-06-05 cs.AI 版本更新

Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

Vortex: 面向AI Agent的高效可编程稀疏注意力服务

Zhuoming Chen, Xinrui Zhong, Qilong Feng, Ranajoy Sadhukhan, Yang Zhou, Michael Qizhe Shieh, Zhihao Jia, Beidi Chen

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Rice University（Rice大学）； National University of Singapore（新加坡国立大学）

AI总结提出Vortex系统，通过Python嵌入式前端语言和面向页面的张量抽象，结合高效后端，实现稀疏注意力算法的快速原型设计、部署和评估，显著提升吞吐量。

详情

AI中文摘要

随着生成长度的增长，稀疏注意力对于服务大型语言模型（LLMs）变得越来越重要。然而，大规模部署和评估新的稀疏注意力算法仍然高度工程密集，这减慢了人类研究人员和AI Agent探索稀疏注意力设计的速度。为了应对这一挑战，我们提出了Vortex，一个系统，它结合了在面向页面的张量抽象之上的Python嵌入式前端语言，用于表达广泛的稀疏注意力算法，以及一个紧密集成到现代LLM服务栈中的高效后端。Vortex能够快速原型设计、部署和评估稀疏注意力算法，有效地将其理论效率提升转化为实际吞吐量的改进。因此，Vortex大大加速了稀疏注意力算法的设计和迭代。首先，AI Agent使用Vortex自动生成和优化多样化的算法，最佳算法在保持准确性的同时，吞吐量比全注意力高出高达3.46倍。其次，Vortex将稀疏注意力扩展到新兴架构和非常大的模型，这些模型原本难以实验，在基于MLA的GLM-4.7-Flash上实现了高达4.7倍的吞吐量提升，在229B参数的MiniMax-M2.7上实现了1.37倍的提升（在NVIDIA B200 GPU上）。

英文摘要

Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and AI agents in exploring the sparse attention design. To address this challenge, we present Vortex, a system that combines a Python-embedded frontend language atop a page-centric tensor abstraction for expressing a broad range of sparse attention algorithms, with an efficient backend tightly integrated into modern LLM serving stacks. Vortex enables rapid prototyping, deployment, and evaluation of sparse attention algorithms, effectively translating their theoretical efficiency gains into real-world throughput improvements. As a result, Vortex substantially accelerates the design and iteration of sparse attention algorithms. First, AI agents use Vortex to automatically generate and refine diverse algorithms, the best reaching up to $3.46\times$ higher throughput than full attention while preserving accuracy. Second, Vortex extends sparse attention to emerging architectures and very large models that are otherwise hard to experiment with, reaching up to $4.7\times$ higher throughput on the MLA-based GLM-4.7-Flash and $1.37\times$ on the 229B-parameter MiniMax-M2.7 on NVIDIA B200 GPUs.

URL PDF HTML ☆

赞 0 踩 0

2606.06448 2026-06-05 cs.AI 版本更新

Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

Agent记忆：有状态长时任务工作负载的表征与系统影响

Yasmine Omri, Ziyu Gan, Zachary Broveak, Robin Geens, Zexue He, Alex Pentland, Marian Verhelst, Tsachy Weissman, Thierry Tambe

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Stanford University（斯坦福大学）； University of California, Berkeley（加州大学伯克利分校）； MIT Media Lab（麻省理工学院媒体实验室）

AI总结本文首次对LLM agent记忆系统进行系统级表征，提出四轴分类法，通过阶段感知分析框架评估10种代表性系统，并给出10条系统设计建议。

详情

AI中文摘要

LLM agent越来越多地被部署在需要跨扩展交互历史进行持续推理的长时任务上。大规模实现这一点要求agent在会话之间持久地存储、检索和更新自己的记忆。一个丰富的agent记忆系统生态系统已经出现，涵盖平面检索、LLM介导的提取、整合事实存储和agent控制流。然而，它们的系统级行为尚未被表征。我们提出了agent记忆的首次系统表征。首先，我们引入了一个面向系统的分类法，沿四个轴对agent记忆系统进行分类。其次，我们构建了一个阶段感知的分析框架，将成本归因于构建、检索和生成。第三，我们跨两个基准套件表征了十个代表性系统，揭示了设计选择如何在写和读路径上转移成本。最后，我们推导出10条系统建议，涵盖构建调度、能力下限、通过查询量的摊销、新鲜度-延迟权衡以及集群规模管理。

英文摘要

LLM agents are increasingly deployed on long-horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persistently store, retrieve, and update their own memory across sessions. A rich ecosystem of agent memory systems has emerged spanning flat retrieval, LLM-mediated extraction, consolidating fact stores, and agentic control flows. Yet, their system-level behavior remains uncharacterized. We present the first systems characterization of agent memory. First, we introduce a system-oriented taxonomy classifying agent memory systems along four axes. Second, we build a phase-aware profiling harness attributing cost to construction, retrieval, and generation. Third, we characterize ten representative systems across two benchmark suites, uncovering how design choices shift cost across the write and read paths. Finally, we derive 10 system recommendations covering construction scheduling, capability floors, amortization via query volume, freshness-latency tradeoffs, and fleet-scale management.

URL PDF HTML ☆

赞 0 踩 0

2606.06423 2026-06-05 cs.RO cs.AI 版本更新

RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario Generation

RiskFlow: 快速且保真的安全关键交通场景生成

Qi Lan, Yining Tang, Yu Shen, Yi Zhou, Yuhao Wei, Jie Li, Guofa Li

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出RiskFlow框架，通过动作空间中的单次前向传输替代迭代去噪，实现快速、保真的安全关键多智能体交通场景生成。

详情

AI中文摘要

安全关键交通场景生成对于评估自动驾驶系统在罕见但高风险交互下的表现至关重要。现有的基于扩散的方法在闭环生成中提供了强大的可控性，但其迭代去噪过程计算成本高，并且可能在长时间滚动中累积采样和引导误差，导致不真实的运动伪影，如抖动、异常加速度和越野行为。为了解决这些问题，我们提出了RiskFlow，一个闭环安全关键多智能体交通生成框架，将未来轨迹生成公式化为动作空间中的传输。RiskFlow不依赖迭代去噪，而是学习有限区间上的平均速度场，通过单次前向传递将高斯动作序列转换为未来的加速度和偏航率命令，使用基于JVP的目标函数实现高效稳定的训练。在测试时，RiskFlow将输出空间引导应用于生成的动作，引导选定的关键智能体走向风险交互，同时正则化越野行为，并通过车辆动力学重建物理可行的轨迹。在nuScenes上使用tbsim闭环评估的实验表明，RiskFlow在多智能体和长时域设置中实现了强大的对抗性与真实性的权衡。与代表性基线相比，RiskFlow在保持竞争性安全关键生成能力的同时，持续提高了真实性，并显著减少了推理时间。

英文摘要

Safety-critical traffic scenario generation is essential for evaluating autonomous driving systems under rare but high-risk interactions. Existing diffusion-based methods offer strong controllability in closed-loop generation, but their iterative denoising process is computationally expensive and may accumulate sampling and guidance errors over long rollouts, causing unrealistic motion artifacts such as jitter, abnormal acceleration, and off-road behavior. To address these issues, we propose RiskFlow, a closed-loop safety-critical multi-agent traffic generation framework that formulates future trajectory generation as transport in the action space. Instead of relying on iterative denoising, RiskFlow learns an average velocity field over a finite interval to transform Gaussian action sequences into future acceleration and yaw-rate commands with a single forward pass, using a JVP-based objective for efficient and stable training. At test time, RiskFlow applies output-space guidance to the generated actions, steering selected critical agents toward risky interactions while regularizing off-road behavior, and reconstructs physically feasible trajectories through vehicle dynamics. Experiments on nuScenes with tbsim closed-loop evaluation show that RiskFlow achieves a strong adversariality-realism trade-off across multi-agent and long-horizon settings. Compared with representative baselines, RiskFlow consistently improves realism while maintaining competitive safety-critical generation capability, and substantially reduces inference time for evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.06418 2026-06-05 cs.LG cs.AI cs.SY eess.SY 版本更新

HomeWorld：一个统一的从平面图到家具的框架，用于生成可控、密集交互的全屋场景

Wenbo Li, Xiaoliang Ju, Zipeng Qin, Rongyao Fang, Hongsheng Li

发表机构 * Ace Robotics（Ace机器人公司）； CUHK MMLab（香港大学多模态实验室）； Shenzhen Loop Area Institute（深圳环城区域研究院）

AI总结提出一个统一的分层框架，通过大规模真实平面图数据集训练大语言模型生成全屋平面图，结合图像生成模型和VLM优化器生成家具及小物体布局，并附加物理属性和纹理光照，实现可控、高真实感的全屋场景生成。

详情

AI中文摘要

室内场景生成对于机器人仿真和现代室内设计至关重要。然而，复杂的布局加上稀缺的3D场景数据使得基于学习的生成具有挑战性。现有方法通常依赖手工规则或关注孤立子任务（例如平面图合成或单房间家具布置），生成的全屋场景缺乏全局连贯性、真实感和仿真就绪性。为缓解这些限制，我们提出一个统一的分层框架，将室内场景合成分解为可控阶段。首先，我们整理了一个包含30万真实住宅平面图的大规模数据集，用于训练一个全屋平面图生成的大语言模型。通过详细描述和基于K-D树的表示，我们的方法实现了细粒度、可控的全屋平面图生成。基于生成的全屋平面图，我们利用图像生成模型从多级漫游视角草拟家具布局，然后生成不同支撑表面（例如橱柜、书桌和餐桌）上可操作小物体的布局，用于具身AI仿真。在家具和物体布局生成过程中，一个基于VLM的优化器迭代修正家具和物体放置，而一个3D生成模型则允许灵活替换单个资产。我们进一步附加基本物理属性和简单表面纹理与光照设置，以完成用于具身AI的流水线。实验和用户研究表明，我们的流水线生成的室内空间具有更大的布局多样性和更强的3D设计吸引力，在定量和定性指标上均优于先前方法。最后，除了生成流水线，我们还将向社区发布平面图数据集和5000个完全家具化的场景。项目页面：https://kairos-homeworld.github.io/

英文摘要

Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: https://kairos-homeworld.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.06380 2026-06-05 cs.CL cs.AI cs.MA cs.NE 版本更新

Emergent Language as an Approach to Conscious AI

涌现语言作为有意识AI的一种方法

Zengqing Wu, Chuan Xiao

发表机构 * University of Osaka（大阪大学）

AI总结提出一种生成式方法，通过多智能体强化学习中的涌现语言，在最小先验下研究意识相关结构，并证明智能体可发展出自我指涉通信（如回声-不匹配检测电路）。

Comments Source codes available at https://github.com/wuzengqing001225/ConsciousAI_Indexicality/

详情

AI中文摘要

人工系统是否有意识的问题仍然悬而未决，部分原因是现有方法要么根据理论派生的清单评估系统（判别式），要么直接工程化受意识启发的模块（架构式）；两者都未能确定观察到的结构是否是人类语言先验的产物。我们提出一种生成式方法论：多智能体强化学习中的涌现语言（EL），其中智能体从最小起点（无语言、无自我概念、极少接触人类文本）出发，仅在任务压力下发展通信，确保因果可归因于任务需求而非继承的人类语言先验。我们通过讨论EL如何作为研究意识相关结构的生成工具来定位我们的方法论，包括环境复杂性的作用以及对涌现通信的解释。作为概念验证，我们在一个最小环境中实例化该方法论，并证明智能体发展出自我指涉通信，包括一个回声-不匹配检测电路，该电路并非仅由任务结构或架构预测，而是从特定的环境可供性中涌现。

英文摘要

The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory-derived checklists (discriminative) or engineer consciousness-inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi-agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness-relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self-referential communication, including an echo-mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.

URL PDF HTML ☆

赞 0 踩 0

2606.06379 2026-06-05 cs.CV cs.AI 版本更新

EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language Models

EasyLens: 一种无需训练的即插即用型微病变表示放大器，用于医学视觉语言模型

Qiwei Zeng, Hao Wang, Jinghao Lin, Shuchang Ye, Yuezhe Yang, Yige Peng, Haoyuan Che, Jinman Kim, Lei Bi

发表机构 * Jilin University（吉林大学）； School of Computer Science, The University of Sydney（悉尼大学计算机科学学院）； ByteDance（字节跳动）； Institute of Translational Medicine, Shanghai Jiao Tong University（上海交通大学转化医学研究院）

AI总结提出EasyLens，一种无需训练的即插即用模块，通过构建病理-解剖原型空间、反事实推理选择病变相关补丁以及形态引导残差增强，放大医学视觉语言模型对微病变的表示能力。

详情

AI中文摘要

医学视觉语言模型（VLM）在临床图像解读（包括病变检测和报告生成）方面显示出越来越大的潜力。然而，其对微病变的敏感性不足限制了其实用性，因为微病变的视觉证据通常稀疏、低对比度且嵌入复杂的解剖背景中。随着局部视觉标记的聚合，这些微弱的病变线索在全局图像表示中可能变得代表性不足，使得医学VLM难以识别。现有的提高病变敏感性的工作主要依赖于医学领域的视觉编码器预训练、临床术语引导的对齐或可训练的病理表示增强。尽管有效，但这些方法通常需要额外训练或模型特定适配，并可能过度适应特定疾病形态，限制了其在冻结的医学VLM上的适用性。为解决这些限制，我们提出EasyLens，一种无需训练的即插即用型微病变表示放大器，用于医学VLM。EasyLens首先构建EasyBank，一个病理-解剖原型空间，提供病变相关原型和解剖感知的正常参考，用于将可疑补丁与病理和正常解剖模式进行比较。为避免盲目放大正常组织，EasyTag通过反事实原型推理选择病变相关补丁。为抵消全局图像表示中微病变线索的稀释，EasyAmplifier通过形态引导的残差增强强化所选病变相关补丁的表示，从而增加其对全局图像嵌入的贡献。在多个医学图像数据集和冻结的医学VLM骨干上的实验表明，EasyLens改进了微病变检测，并优于现有的编码器增强基线。

英文摘要

Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. However, their practical utility remains limited by insufficient sensitivity to subtle lesions, whose visual evidence is often sparse, low-contrast, and embedded within complex anatomical context. As local visual tokens are aggregated, these weak lesion cues can become underrepresented in global image representations, making them difficult for medical VLMs to recognize. Existing efforts to improve lesion sensitivity mainly rely on medical-domain vision-encoder pre-training, clinical-term-guided alignment, or trainable pathological representation enhancement. Although effective, these approaches usually require additional training or model-specific adaptation and may overfit to particular disease morphologies, limiting their applicability to frozen medical VLMs. To address these limitations, we propose EasyLens, a training-free plug-and-play subtle-lesion representation amplifier for medical VLMs. EasyLens first constructs EasyBank, a pathology-anatomy prototype space that provides lesion-related prototypes and anatomy-aware normal references for comparing suspicious patches against both pathological and normal anatomical patterns. To avoid blindly amplifying normal tissues, EasyTag selects lesion-relevant patches through counterfactual prototype reasoning. To counteract the dilution of subtle lesion cues in global image representations, EasyAmplifier strengthens the selected lesion-relevant patch representations through morphology-guided residual enhancement, thereby increasing their contribution to the global image embedding. Experiments on multiple medical image datasets and frozen medical VLM backbones show that EasyLens improves subtle-lesion detection and outperforms existing encoder-enhancement baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.06375 2026-06-05 cs.AI 版本更新

Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case Study

重新思考基础设施检测为图像差异分类：以交通标志为例

Ching Yau Fergus Mok, Lavindra de Silva, Varun Kumar Reja, Ioannis Brilakis

发表机构 * University of Cambridge（剑桥大学）； IIT Bombay（印度理工学院Bombay）

AI总结本研究将基础设施检测重新定义为图像差异分类（IDC），通过利用连续资产状态监测的关系性质减少数据依赖，并在低资源交通标志检测案例中验证了基于指令的分类器优于基于编码器的分类器。

Comments CVPR 2026 Computer Vision for the Built World Workshop (CV4AEC @ CVPR)

2606.06373 2026-06-05 eess.SP cs.AI 版本更新

LatentWave: JEPA Pretraining for Wireless Foundation Models

LatentWave: 无线基础模型的JEPA预训练

Ahmed Mohamed, Ahmed Aboulfotouh, Hatem Abou-Zeid

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结提出LatentWave，采用联合嵌入预测架构（JEPA）在潜空间预测掩码区域，学习可迁移的无线信号表示，并在四个下游任务中优于掩码建模基线。

详情

AI中文摘要

无线基础模型已成为为每个无线任务构建单独模型的有前途的替代方案。然而，现有方法依赖于掩码输入重建，这可能会使表示偏向于低级信号细节。在本文中，我们提出了LatentWave，一种无线基础模型，使用联合嵌入预测架构（JEPA）在多样化的无线频谱图和信道状态信息（CSI）上进行预训练。通过在潜空间中预测掩码区域，LatentWave学习到的表示在多种下游任务中具有更好的开箱即用迁移性。所提出的架构在预训练期间采用每通道补丁嵌入和随机通道采样，使其能够处理可变的天线数量，并提高在异构无线配置中的可用性。我们在四个下游任务上评估了LatentWave：射频信号分类、5G NR定位、波束预测和视距/非视距分类，并与在同一数据上预训练的掩码建模基线（WavesFM）进行比较。此外，我们表明掩码几何形状引入了任务相关的归纳偏差：频率掩码强烈有利于与信道相关的任务，如定位和波束预测，而区域掩码则更好地保留信号分类的可区分性。

TokenMizer: 面向长程LLM上下文管理的图结构会话记忆

Shweta Mishra

发表机构 * Independent Researcher（独立研究者）

AI总结提出TokenMizer，一种将LLM会话历史建模为类型化知识图的开源代理系统，通过混合提取、三级检查点和8层压缩流水线，在显著减少token开销的同时保留结构化决策信息。

Comments 12 pages, 10 figures. Code and benchmark available at https://github.com/Shweta-Mishra-ai/tokenmizer

详情

AI中文摘要

大型语言模型（LLM）在长程任务部署中面临一个基本约束：上下文窗口是有限的，而生产性工作会话却不是。当历史超过最大有效上下文窗口（MECW）时，关键的结构化信息——架构决策、任务转换、文件历史——会被静默丢弃。现有缓解方法将历史视为纯文本，破坏了使会话可恢复的关系结构。我们提出TokenMizer，一个将LLM会话历史建模为类型化知识图的开源代理系统。该模式定义了14种节点类型和7种边类型。混合提取流水线逐步填充图，而三级检查点系统将其序列化为紧凑的恢复块。8层压缩流水线减少上下文开销，语义缓存降低重复查询延迟。在涵盖5个领域的21个会话的受控基准上评估，TokenMizer展示了显著的token经济性。它生成的恢复块平均78个token（范围：42-124）——比评估基线（159-170个token）小2倍——同时实现了更高的决策召回率（+9-17个百分点）。关键的是，基线仅保留提到某项技术的事实；TokenMizer保留了其原理。在所有会话中，TokenMizer实现了平均任务召回率51.0%、决策召回率46.6%和文件召回率58.7%。方差反映了领域异质性：显式命令式表述（软件工程）得分高于隐式推理（研究）。消融研究表明模糊标签匹配是主要的改进因素（任务召回率+33个百分点）。启发式压缩实现了47.3%的token减少且零外部依赖。TokenMizer以一半的token成本提供了可查询的替代方案，优于文本保留基线。

英文摘要

Large language model (LLM) deployments for long-horizon tasks face a fundamental constraint: context windows are finite while productive work sessions are not. When history exceeds the Maximum Effective Context Window (MECW), critical structured information - architectural decisions, task transitions, file histories - is silently discarded. Existing mitigations treat history as flat text, destroying the relational structure that makes sessions resumable. We present TokenMizer, an open-source proxy system that models LLM session history as a typed knowledge graph. The schema defines 14 node types and 7 edge types. A hybrid extraction pipeline populates the graph incrementally, while a three-tier checkpoint system serializes it into compact resume blocks. An 8-layer compression pipeline reduces context overhead, and a semantic cache reduces repeated-query latency. Evaluated on a controlled benchmark of 21 sessions spanning 5 domains, TokenMizer demonstrates significant token economy. It produces resume blocks averaging 78 tokens (range: 42-124) - 2x smaller than evaluated baselines (159-170 tokens) - while achieving higher decision recall (+9-17 percentage points). Crucially, baselines only preserve that a technology was mentioned; TokenMizer preserves the rationale. Across all sessions, TokenMizer achieves mean task recall 51.0%, decision recall 46.6%, and file recall 58.7%. Variance reflects domain heterogeneity: explicit imperative phrasing (software engineering) scores higher than implicit reasoning (research). Ablation studies show fuzzy label matching is the dominant improvement factor (+33 pp task recall). The heuristic compression achieves 47.3% token reduction with zero external dependencies. TokenMizer provides a queryable alternative to text-retention baselines at half the token cost.

URL PDF HTML ☆

赞 0 踩 0

2606.06335 2026-06-05 cs.LG cs.AI 版本更新

Bridging Domain Expertise and Generalization for Performance Estimation

弥合领域专业知识与泛化能力以实现性能估计

Shuxuan Li, Zhilin Zhao, Quyu Kong, Wei-Shi Zheng

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, China（中山大学计算机科学与工程学院）； Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China（教育部人工智能与先进计算重点实验室）； Shenzhen Loop Area Institute, China（深圳环湖院）； Alibaba Cloud（阿里云）

AI总结提出FRAP方法，利用外部基础模型和基础模型的互补优势，通过温度缩放校准和对齐预测分布，构建更可靠的伪标签参考分布，从而在分布偏移下准确估计模型性能。

详情

AI中文摘要

分布偏移下的性能估计旨在预测模型在未标记测试集上的行为，该测试集的分布与训练数据不同，这一场景需要能够真实反映模型行为且无需真实标签的可靠指标。现有方法仅依赖给定模型的输出，而一旦分布发生偏移，其偏差会被放大，削弱了与真实性能的相关性。受此限制，我们提出融合参考对齐预测（FRAP），利用外部基础模型和基础模型的互补优势，构建更可靠的伪标签替代。FRAP通过应用温度缩放校准最小化基础模型与基础模型预测分布之间的差异，从而对齐两者。对齐后的预测通过基于置信度的加权融合成精炼的参考分布，该分布整合了基础模型的鲁棒性和基础模型的领域专业知识，并通过测量基础模型预测与该参考分布的一致性来获得性能估计。在多种数据集和架构上的大量实验表明，FRAP在分布偏移下相较于代表性性能估计方法取得了持续且显著的改进。

英文摘要

Performance estimation under distribution shift aims to predict how a model behaves on an unlabeled test set whose distribution differs from the training data, a scenario that requires reliable indicators that can faithfully reflect model behavior without ground-truth labels. Existing approaches rely solely on the outputs of the given model whose biases are amplified once the distribution shifts, weakening the correlation with the true performance. Motivated by this limitation, we propose Fused Reference Alignment Prediction (FRAP), which leverages the complementary strengths of an external foundation model and the base model to construct a more reliable surrogate of the ground-truth labels. FRAP aligns the prediction distribution of the foundation model with that of the base model by applying temperature-scaled calibration that minimizes their divergence. The aligned predictions are fused through confidence-based weighting into a refined reference distribution that integrates robustness from the foundation model and domain-specific expertise from the base model, and performance estimation is obtained by measuring how closely the base model predictions agree with this reference. Extensive experiments across diverse datasets and architectures show that FRAP provides consistent and substantial improvements over representative performance-estimation methods under distribution shift.

URL PDF HTML ☆

赞 0 踩 0

2606.06333 2026-06-05 cs.LG cs.AI 版本更新

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

子空间感知稀疏自编码器用于有效的机制可解释性

Seyed Arshan Dalili, Mehrdad Mahdavi

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结针对稀疏自编码器将特征假设为一维导致特征分裂的问题，提出子空间感知稀疏自编码器（SASA），通过学习解码器子空间、块稀疏门控和核范数正则化，在GPT-2和Mistral-7B上减少特征分裂和吸收，提高单义性和可解释性。

详情

AI中文摘要

稀疏自编码器（SAEs）广泛用于大型语言模型的机制可解释性，但其公式为每个潜在特征分配单个解码器方向，隐含地假设特征是一维的。我们证明这一假设与模型特征的多维结构不匹配，通过两种不同机制可证明地诱导特征分裂。从几何角度看，用单方向解码器重构内在维度$d_i \ge 2$的特征到误差$\varepsilon$，所需的原子数量随$d_i$呈指数增长。从端到端优化角度看，这种分裂不仅是可能的，而且是主动偏好的。我们证明存在一条从真实的$d_i$维基到$\ell_1$正则化SAE目标严格更低风险的连续路径，其下降方向驱使任何训练字典进入该指数区域。因此，一个单一连贯的特征被碎片化到许多近乎共线的潜在变量中，产生虚假的多重性并掩盖内在几何结构。受此启发，我们引入子空间感知稀疏自编码器（SASA），用学习的解码器子空间替换单向量解码器，通过Top-$s$组门控强制块稀疏性，并用核范数正则化器适应每个组的有效秩。然后我们证明，一旦块大小满足$r \ge d_i$，单个组不仅能表示整个特征切片，而且是SASA目标的全局最小值。这种整合产生样本复杂度关于$d_i$的多项式而非指数——鉴于每次训练激活都需要LLM前向传递，这是一个决定性优势。实验上，在GPT-2和Mistral-7B上，SASA减少了特征分裂和吸收，提高了单义性和可解释性，并且在约一半的token预算下训练，性能匹配或超过标准SAE。

英文摘要

Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of intrinsic dimension $d_i \ge 2$ to error $\varepsilon$ with single-direction decoders forces a number of atoms that is exponential in $d_i$. From an end-to-end optimization perspective, this splitting is not merely possible but actively preferred. We prove that there exists a continuous path from the true $d_i$-dimensional basis to a strictly lower risk of the $\ell_1$-regularized SAE objective, whose descent directions drive any trained dictionary into that exponential regime. A single coherent feature is therefore fragmented across many near-collinear latents, producing spurious multiplicity and obscuring the intrinsic geometry. Motivated by this, we introduce Subspace-Aware Sparse Autoencoders (SASA), which replace single-vector decoders with learned decoder subspaces, enforce block sparsity via Top-$s$ group gating, and adapt each group's effective rank with a nuclear-norm regularizer. We then show that once the block size satisfies $r \ge d_i$, a single group not only can represent the entire feature slice but is the global minimizer of the SASA objective. This consolidation yields a sample complexity polynomial in $d_i$ rather than exponential -- a decisive advantage given that every training activation costs an LLM forward pass. Empirically, on GPT-2 and Mistral-7B, SASA reduces feature splitting and absorption, improves monosemanticity and interpretability, and matches or exceeds standard SAEs while training on roughly half the token budget.

URL PDF HTML ☆

赞 0 踩 0

2606.06328 2026-06-05 cs.LG cs.AI 版本更新

PAMF: Prior-Aware Multimodal Fusion for Incomplete Time Series Data

PAMF: 面向不完整时间序列数据的先验感知多模态融合

Ziwen Kan, Wugeng Zheng, Tianlong Chen, Song Wang

发表机构 * Department of Computer Science, University of Central Florida（中央佛罗里达大学计算机科学系）； Department of Computer Science, University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校计算机科学系）

AI总结提出PAMF框架，通过先验感知流匹配和权重共享显式处理模态内缺失和模态级缺失，将插补与下游预测耦合，提升多模态医疗时间序列任务性能。

Comments 5 figures. arXiv preprint version

详情

AI中文摘要

在医疗保健中，多模态时间序列任务在实践中通常处理不完整的观测，例如当电极脱落导致心电图片段丢失或夜间监测期间整个呼吸通道不可用时。这种缺失通常表现为两种结构上不同的模式：模态内缺失，即在某个观测模态内值缺失；以及模态级缺失，即整个模态不可用。现有方法通常通过掩码或缺失嵌入隐式表示未观测数据，而不学习实例特定的缺失信息，且大多数方法仅针对一种缺失模式设计。一种自然的方法是显式估计缺失数据；然而，现有的插补方法尽管缺失具有不同的结构先验，却统一处理缺失，并且插补过程通常与下游任务隔离，阻止下游任务引导插补朝向更具信息性的表示。为了解决这些局限性，我们提出了PAMF，一个多模态时间序列框架，它显式处理不同的缺失模式，同时通过先验感知流匹配和权重共享将插补与下游预测耦合。具体来说，该方法使用类型特定的先验初始化流匹配源状态，以区分两种缺失类型。它进一步通过架构匹配的编码器与权重共享连接插补和分类，将任务相关表示转移到插补过程中。在多个多模态医疗时间序列基准上的实验表明，与现有基线相比，所提出的方法在多样化的数据集和缺失设置下实现了最强的整体下游性能。

英文摘要

In healthcare, multimodal time series tasks often operate on incomplete observations in practice, for example when ECG segments are lost because electrodes detach or an entire respiratory channel is unavailable during overnight monitoring. Such missingness typically appears in two structurally distinct patterns: within-modality missing, where values are absent within an otherwise observed modality, and modality-level missing, where an entire modality is unavailable. Existing methods typically represent unobserved data implicitly through masks or missing embeddings, without learning instance-specific missing information, and most are designed for only one missingness pattern. A natural approach is to explicitly estimate the missing data; however, existing imputation methods treat missingness uniformly despite their different structural priors, and the imputation process is often isolated from downstream tasks, preventing downstream tasks from guiding imputation toward more informative representations. To address these limitations, we present PAMF, a multimodal time-series framework that explicitly handles different missingness patterns while coupling imputation with downstream prediction through prior-aware flow matching and weight sharing. Specifically, the method initializes the flow-matching source state with type-specific priors to distinguish two missing types. It further connects imputation and classification through architecturally matched encoders with weight sharing, transferring task-relevant representations into the imputation process. Experiments on multiple multimodal healthcare time-series benchmarks show that the proposed method achieves the strongest overall downstream performance across diverse datasets and missing settings compared with existing baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.06322 2026-06-05 cs.AI 版本更新

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

DragOn：基于拖拽的GUI交互基准与数据集

Nathan Bout, Maxime Langevin, Ronan Riochet

发表机构 * GitHub ； arXiv

AI总结针对GUI代理在拖拽操作（如拖放、滑动、高亮）上的性能不足，提出DragOn基准和训练数据集，涵盖文本高亮、单元格选择、元素缩放和滑块操作四个领域，包含28.6万张训练截图和350万个训练任务，评估了多个模型并显示数据集能提升下游任务性能。

详情

Journal ref: Published as a workshop paper at SCALE - 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

AI中文摘要

GUI代理——通过图形用户界面控制桌面、网页浏览器和移动设备的视觉模型——有望自动化广泛的数字任务。虽然百万级数据集在点击定位方面取得了显著进展，但拖拽定位（例如拖放、滑动、高亮）的数据规模仍小一个数量级，当前模型在复杂的基于拖拽的交互上表现不足。我们引入了DragOn，一个拖拽定位基准和训练数据集，涵盖四个领域：文本高亮、单元格选择、元素缩放和滑块操作。该数据集包含28.6万张训练截图和350万个训练任务，外加一个2000个样本的保留评估集。我们评估了专有模型（GPT、Claude）和开源模型（Qwen、Kimi、Holo），以及在我们训练数据上微调的Qwen VLM。结果表明，我们的数据集可以提升最先进模型在下游计算机使用任务上的性能。

英文摘要

GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g. drag-and-drop, swipe, highlight) data remains an order of magnitude smaller and current models fall short on complex drag-based interactions. We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. We evaluate proprietary (GPT, Claude) and open-weight (Qwen, Kimi, Holo) models, as well as a Qwen VLM fine-tuned on our training data. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.06320 2026-06-05 cs.LG cs.AI cs.CL 版本更新

基于记忆增强神经网络的AIS船舶轨迹预测

Wonmo Koo, Sanha Chang, Heeyoung Kim

发表机构 * Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST)（工业与系统工程系，韩国科学技术院）

AI总结本文提出使用记忆增强神经网络，基于AIS数据预测船舶轨迹，在墨西哥湾和纽约湾数据集上显著优于无外部记忆的深度学习基线。

2606.06303 2026-06-05 cs.LG cs.AI 版本更新

Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction

基于梯度信息逻辑校正的离散扩散模型即插即用引导

Hongkun Dou, Zike Chen, Fengji Li, Hongjue Li, Yue Deng

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出GILC框架，通过将预训练去噪网络作为变分代理来估计引导信号，并引入无雅可比机制直接校正干净预测逻辑，实现无需额外训练的离散扩散模型可控生成，在DNA、蛋白质序列和分子生成任务上达到最优性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

离散扩散模型的可控生成常常受到高计算开销或需要重新训练的限制。在本文中，我们提出了\underline{\textbf{G}}radient-\underline{\textbf{I}}nformed \underline{\textbf{L}}ogit \underline{\textbf{C}}orrection (\textbf{GILC)，这是一个即插即用框架，通过将预训练的去噪网络重新用作变分代理来高效估计引导信号。为了规避高维离散空间中固有的梯度不稳定性，我们引入了一种无雅可比机制，直接校正干净预测的逻辑，从而实现稳定且有效的引导。我们的方法适用于可微和不可微的奖励函数。在DNA、蛋白质序列和分子生成任务上的大量实验表明，GILC无需额外训练即可达到最先进的性能，并且常常优于微调方法。

英文摘要

Controllable generation with discrete diffusion models is often hindered by high computational overhead or the need for retraining. In this paper, we present \underline{\textbf{G}}radient-\underline{\textbf{I}}nformed \underline{\textbf{L}}ogit \underline{\textbf{C}}orrection (\textbf{GILC}), a plug-and-play framework that efficiently estimates guidance signals by repurposing the pretrained denoising network as a variational proxy. To circumvent the gradient instability inherent in high-dimensional discrete spaces, we introduce a Jacobian-free mechanism that directly corrects the clean prediction logits, facilitating stable and effective guidance. Our method accommodates both differentiable and non-differentiable reward functions. Extensive experiments across DNA, protein sequence, and molecular generation tasks demonstrate that GILC achieves state-of-the-art performance without additional training, frequently outperforming fine-tuning approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.06300 2026-06-05 cs.AI 版本更新

Multi-ResNets for Subspace Preconditioning in Constrained Optimization

Multi-ResNets：约束优化中子空间预条件的多残差网络

Merve Karakas, Christopher J. Williams, Emmanuel O. Balogun, Sadegh Sadeghi Tabas, Christian Brown, Nikhil Rao

发表机构 * UCLA（加州大学洛杉矶分校）； University of Oxford（牛津大学）； Tapestry, Google（谷歌Tapestry）； Alphabetical ordering, authors contributed equally to this work（作者等量贡献）

AI总结提出一种分阶段残差神经网络架构MResOpt，通过优先级分解约束满足和阶段感知损失，在预测-补全-校正流水线中实现域知有序约束满足，并在理想无限宽条件下表现为序列高斯过程回归，显著降低高优先级约束违反。

详情

AI中文摘要

我们提出MResOpt，一种用于约束优化问题的分阶段残差神经网络架构。我们的架构适用于预测-补全-校正流水线，并通过中间重新补全和阶段感知损失按优先级分解约束满足。该框架支持域知有序约束满足，使网络能够在存在序结构时利用它。在理想化的无限宽条件下，我们证明我们的设计表现为序列高斯过程回归。在合成QP、QCQP和SOCP基准测试中，分阶段架构在凸和非凸设置中均改善了高优先级约束满足。在线流约束交流最优潮流中，我们引入了一种物理驱动的约束排序，并展示了MResOpt支持一种学习的分工，使迭代保持在等式流形上，与重投影基线相比，实现了显著更低的高优先级违反，同时保持计算效率。

英文摘要

We propose MResOpt, a staged residual neural network architecture for constrained optimization problems. Our architecture fits within predict-complete-correct pipelines and decomposes constraint satisfaction by priority through intermediate re-completion and stage-aware losses. The framework enables domain-informed ordered constraint satisfaction which allows the network to utilize ordinal structure when present. Under an idealized infinite-width regime, we show that our design behaves as sequential Gaussian Process regression. On synthetic QP, QCQP, and SOCP benchmarks, the staged architecture improves high-priority constraint satisfaction across convex and non-convex settings. On line-flow-constrained AC optimal power flow, we introduce a physics-motivated constraint ordering and show that MResOpt supports a learned division of labor that keeps iterates on the equality manifold, achieving substantially lower high-priority violation than reprojected baselines while remaining computationally efficient.

URL PDF HTML ☆

赞 0 踩 0

2606.06294 2026-06-05 cs.CV cs.AI 版本更新

Towards One-to-Many Temporal Grounding

面向一对多时间定位

Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对一对多时间定位（OMTG）任务，提出包含基准、数据集和奖励函数的系统解决方案，显著提升多段视频定位性能。

Comments Accepted to ICML'26

详情

AI中文摘要

时间定位（TG）旨在定位与文本查询对应的视频片段。先前研究主要关注单段检索。然而，现实场景通常需要为单个查询定位多个不连续片段——我们将其称为一对多时间定位（OMTG）。先前最先进的MLLMs针对一对一设置优化，在此场景下表现不佳，由于缺乏事件基数感知，往往得到近乎零的分数。为弥补这一差距，我们提出一个包含三项关键贡献的系统解决方案。首先，我们建立了首个全面的OMTG基准，引入计数准确率（C-Acc）和有效时间F1（EtF1）作为评估指标。其次，我们通过一个复杂的构建流程，整理了一个包含56k样本的高质量OMTG数据集。第三，我们开发了专门针对OMTG的新型时间奖励和描述奖励函数。特别地，描述奖励利用密集视频描述上的思维链推理，明确引导策略优化以实现精确性和完整性。大量实验表明，我们的模型在OMTG基准上达到了43.65%的最新EtF1，分别超过Gemini 2.5 Pro和Seed-1.8达15.85%和15.61%。

英文摘要

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.06286 2026-06-05 cs.CL cs.AI 版本更新

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

LLMs 可能泄露训练数据，但它们愿意吗？一种基于倾向性的 LLM 记忆评估

Gianluca Barmina, Peter Schneider-Kamp, Lukas Galke Poech

发表机构 * University of Southern Denmark（南部丹麦大学）

AI总结提出 PropMe 框架，通过对比前缀攻击与非对抗评估，揭示 LLM 在非对抗设置下很少泄露训练数据，并引入 SimpleTrace 流水线进行归因和度量。

详情

AI中文摘要

大型语言模型可以重现训练数据，但现有的记忆评估大多衡量模型是否可以被强制这样做，而不是在正常使用下是否会这样做。我们引入了 PropMe，一个基于倾向性的记忆评估框架，对比了基于前缀的能力攻击与非对抗性评估。我们提出了一种度量转换方法，应用于现有函数，可以创建倾向性度量。我们进一步引入了 SimpleTrace，一个基于 infini-gram 的轻量级追踪流水线，能够确定性地将模型生成归因于大规模训练语料库，并计算逐字、近逐字和倾向性转换的记忆度量。评估两个完全开放的模型：Comma 和 DFM Decoder，在两个数据集：Common Pile 和 Dynaword，以及两种语言上，我们发现能力与倾向性之间存在一致差距：前缀攻击比通用或数据集特定提示引发更强的记忆信号，而倾向性得分总体保持较低。因此，模型在直接诱导时可以泄露训练数据，但在更常见的非对抗设置中很少这样做。我们还发现，从 Comma 持续预训练的 DFM Decoder 对 Common Pile 表现出降低的记忆和记忆倾向性，证实当后续训练强调部分不同数据时，记忆能力可能下降。我们的结果表明，并鼓励，记忆审计应同时报告最坏情况下的可提取性和普通泄露倾向性，以便更全面地理解这一现象。

英文摘要

Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.

URL PDF HTML ☆

赞 0 踩 0

2606.06285 2026-06-05 cs.AI 版本更新

TRACE: A Temporal Conditional Estimation for Multimodal Time Series Foundation Models

TRACE: 面向多模态时间序列基础模型的时间条件估计

Ziwen Kan, Yishuo Chen, Kecheng Li, Andrew Wen, Xiaomeng Wang, Liwei Wang, Jihao Duan, Song Wang, Hongfang Liu, Tianlong Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出TRACE条件估计范式，通过利用可用辅助模态推断缺失目标模态，解决多模态时间序列中的时间错位和部分模态缺失问题，在医疗和情感分析基准上优于现有融合方法。

Comments 5 figures and 5 tables in the main paper, plus appendix

详情

AI中文摘要

时间序列基础模型旨在学习可泛化的时间表示，以适应广泛的下游任务。在现实世界的多模态设置中，时间序列经常受到时间错位和部分模态缺失的影响，其中不同模态以异质时间尺度被观测或部分缺失。现有方法通常依赖简单的插补或掩码策略，未能考虑跨模态依赖，往往导致错位或退化的表示。我们提出TRACE，一种用于缺失和不规则采样下多模态时间序列基础模型管道的条件估计范式，允许从可用的辅助模态中系统地推断不完整的目标模态。我们在涵盖医疗和情感计算的多个多模态基准上评估TRACE，包括MIMIC-IV临床数据集以及用于多模态情感分析的CMU-MOSI和CMU-MOSEI基准。在一系列下游预测任务和缺失模态设置中，TRACE始终优于先前的多模态融合方法，展示了对严重模态缺失更强的鲁棒性和更可靠的跨模态表示。

英文摘要

Time series foundation models (TS-FMs) aim to learn generalizable temporal representations that can be adapted to a wide range of downstream tasks. In real-world multimodal settings, time series are frequently affected by temporal misalignment and partial modality missingness, where different modalities are observed at heterogeneous time scales or are partially absent. Existing approaches typically rely on naive imputation or masking strategies, which fail to account for cross-modal dependencies and often lead to misaligned or degraded representations. We propose TRACE, a conditional estimation paradigm for multimodal time series foundation model pipelines under missingness and irregular sampling, allowing incomplete target modalities to be systematically inferred from available auxiliary modalities. We evaluate TRACE on diverse multimodal benchmarks spanning healthcare and affective computing, including the MIMIC-IV clinical dataset and the CMU-MOSI and CMU-MOSEI benchmarks for multimodal sentiment analysis. Across a range of downstream prediction tasks and missing-modality settings, TRACE consistently outperforms prior multimodal fusion approaches, demonstrating improved robustness to severe modality missingness and more reliable cross-modal representations.

URL PDF HTML ☆

赞 0 踩 0

2606.06284 2026-06-05 cs.AI 版本更新

ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

ToolChoiceConfusion: 因果最小工具过滤实现可靠LLM智能体

Rahul Suresh Babu, Laxmipriya Ganesh Iyer

发表机构 * Independent Researcher（独立研究者）； United States of America（美国）

AI总结提出因果最小工具过滤（CMTF）方法，通过因果充分性选择工具，减少错误工具调用和令牌成本，在102个任务、100个工具、4个LLM后端的基准测试中，将可见工具从100个减少到每步1个，令牌使用降低约90%。

详情

AI中文摘要

大型语言模型智能体越来越依赖外部工具，但更大的工具菜单会通过增加错误工具调用、过早行动和令牌成本来降低可靠性和效率。现有的工具选择方法通常优化语义相关性，暴露名称或描述与用户请求匹配的工具。我们认为相关性是不够的：一个工具可能与任务相关，但在当前步骤仍然是不必要或过早的。我们提出因果最小工具过滤（CMTF），一种无需训练的方法，通过因果充分性选择工具。CMTF使用轻量级前提-效果契约，仅暴露从当前状态向用户目标推进所需的最小下一步工具前沿。在多步骤工具使用任务中，我们将CMTF与全工具暴露、关键词检索、状态感知过滤和因果路径消融进行比较，衡量任务成功率、错误工具调用、过早行动、工具暴露和令牌成本。在包含102个任务、100个工具、四个LLM后端和2448个任务-方法-模型运行的主要基准测试中，CMTF在总体成功率上与最强的因果基线持平，同时将可见工具从100个减少到每步1个，并且相对于全工具暴露将令牌使用减少约90%。

英文摘要

Large language model agents increasingly rely on external tools, but larger tool menus can reduce reliability and efficiency by increasing wrong-tool calls, premature actions, and token cost. Existing tool-selection methods often optimize semantic relevance, exposing tools whose names or descriptions match the user request. We argue that relevance is insufficient: a tool may be related to the task while still being unnecessary or premature at the current step. We propose Causal Minimal Tool Filtering (CMTF), a training-free method that selects tools by causal sufficiency. CMTF uses lightweight precondition-effect contracts to expose only the minimal next-step tool frontier needed to advance from the current state toward the user goal. Across multi-step tool-use tasks, we compare CMTF with all-tools exposure, keyword retrieval, state-aware filtering, and causal-path ablations, measuring task success, wrong-tool calls, premature actions, tool exposure, and token cost. In the main benchmark with 102 tasks, 100 tools, four LLM backends, and 2448 task-method-model runs, CMTF matches the strongest causal baseline in aggregate success while reducing visible tools from 100 to one per step and reducing token usage by about 90% relative to all-tools exposure.

URL PDF HTML ☆

赞 0 踩 0

2606.06273 2026-06-05 cs.IT cs.AI math.IT 版本更新

Adapting Diffusion Language Models for Lossless Pixel-Level Image Transmission

适应扩散语言模型用于无损像素级图像传输

Tianqi Ren, Rongpeng Li, Xianfu Chen, Yingyu Li, Zhifeng Zhao

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University（浙江大学信息科学与电子工程学院）； Shenzhen CyberAray Network Technology Co., Ltd（深圳CyberAray网络技术有限公司）； School of Mechanical Engineering and Electronic Information, China University of Geosciences（中国地质大学（武汉）机械与电子信息学院）； Zhejiang Lab（浙江实验室）

AI总结提出基于离散扩散模型的分离源信道编码框架DDM-SSCC，通过双向注意力下的同步逆向算术编码实现无损像素级图像传输，并引入Halton引导去噪顺序、掩码率感知余弦调度和轻量温度校准模块提升性能。

详情

AI中文摘要

RedKnot: 基于头部感知的KV重用和SegPagedAttention的高效长上下文LLM服务

Yang Liu, ZhaoKai Luo, HuaYi Jin, ZhiYong Wang, RuoZhou He, BoYu Wang, Guanjie Chen, Junhao Hu

发表机构 * Xiaohongshu Inc., China（小红书公司，中国）； Peking University（北京大学）； Huawei Cloud（华为云）

AI总结提出RedKnot系统，通过按KV头分解缓存并采用SegPagedAttention，实现位置无关的KV重用、前缀压缩、冷热分离和分布式放置，在不重训练模型的情况下提升资源效率。

详情

AI中文摘要

随着大语言模型（LLM）服务输入长度的持续增长，KV缓存已成为AI基础设施中的主要瓶颈。它限制了GPU内存容量、服务并发性、缓存重用和分布式可扩展性。几个重要问题，包括位置无关的KV缓存、前缀KV缓存压缩、冷/热KV缓存分离和分布式KV缓存管理，都依赖于KV缓存的表示和管理方式。然而，现有的服务系统在很大程度上依赖于单一的KV缓存抽象，其中KV缓存被视为同质的token级内存块序列，并在注意力头和服务场景中采用类似的管理策略。我们观察到，KV缓存的效用在不同KV头之间具有高度结构性：不同的头表现出不同的功能角色、注意力距离和运行时重要性。因此，并非每个头、token范围或服务场景都需要完整的KV缓存。我们提出了RedKnot，一个用于LLM服务的头部感知KV缓存管理系统。RedKnot通过沿KV头分解KV缓存来打破传统的单一KV缓存抽象，这些KV头的重要性和有效注意力范围在不同服务场景中显著变化。这种头部级分解将KV缓存从单一的张量抽象转变为结构化的内存对象，使RedKnot能够统一支持位置无关的KV重用、前缀KV压缩、冷/热KV分离和分布式KV放置，同时保持输出保真度并提高资源效率，无需模型重训练或微调。RedKnot通过将KV缓存从单一的被动运行时工件转变为动态的、模型感知的可扩展LLM服务的运行时基础，为AI基础设施建立了新的基础。

英文摘要

As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure. It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalability. Several important problems, including position-independent KV cache, prefix KV cache compression, hot/cold KV cache separation, and distributed KV cache management, all depend on how the KV cache is represented and managed. However, existing serving systems largely rely on a monolithic KV cache abstraction, where the KV cache is treated as a homogeneous sequence of token-level memory blocks and managed with similar policies across attention heads and serving scenarios. We observe that KV cache utility is highly structured across KV heads: different heads exhibit different functional roles, attention distances, and runtime importance. Therefore, a full KV cache is not always necessary for every head, token range, or serving scenario. We present RedKnot, a head-aware KV cache management system for LLM serving. RedKnot breaks the conventional monolithic KV cache abstraction by decomposing the KV cache along KV heads, whose importance and effective attention ranges vary significantly across serving scenarios. This head-level decomposition turns the KV cache from a monolithic tensor abstraction into a structured memory object, enabling RedKnot to uniformly support position-independent KV reuse, prefix KV compression, hot/cold KV separation, and distributed KV placement while preserving output fidelity and improving resource efficiency, without requiring model retraining or fine-tuning. RedKnot establishes a new foundation for AI infrastructure by transforming the KV cache from a monolithic, passive runtime artifact into a dynamic, model-aware runtime substrate for scalable LLM serving.

URL PDF HTML ☆

赞 0 踩 0

2606.06252 2026-06-05 cs.AI 版本更新

弥合语义-协作鸿沟：面向冷启动物品推荐的非对称图架构

Anh Truong, John Trenkle, Yuanbo Chen, Honghong Zhao, Abdullah Alchihabi, Effy Fang, Michael Tamir

发表机构 * Tubi ； Kumo AI

AI总结提出Shallow-RHS非对称链接预测架构，通过左端设备塔利用时序历史消息传递捕获协作信号，右端内容塔仅基于内在特征编码，解决冷启动物品推荐中的图归纳补全问题。

详情

AI中文摘要

协同过滤和基于图的推荐模型因利用观察到的用户交互而非常有效，但这种依赖性在新增内容没有交互历史时产生了根本性的冷启动挑战。在Tubi的生产检索系统中，这一挑战还受到服务接口的进一步限制：新内容必须立即分配独立的嵌入，并且模型必须产生适用于近似最近邻检索的设备嵌入。我们通过将冷启动推荐表述为时间二分设备-内容图上的归纳图补全问题来解决这一设置。我们提出Shallow-RHS，一种非对称链接预测架构，其中左端（LHS）设备塔利用时序有效的观看历史消息传递来捕获协作信号，而右端（RHS）内容塔相对于图是故意浅层的，仅从内在特征编码内容。RHS塔不使用基于ID的嵌入、内容侧子图、邻居聚合或交互派生的表示，迫使内容编码器将内在特征映射到协同过滤感知的嵌入空间。训练后，学习到的内容编码器为热内容和新增内容生成嵌入，通过检索热替代邻居实现隐式图补全。我们进一步将相同的表示补全原则扩展到设备冷启动，通过从人口统计特征构建基于群体的嵌入。大规模在线实验表明，在内容冷启动参与度、推广速度、印象获取和设备冷启动参与度方面持续相对改进。

英文摘要

Collaborative filtering and graph-based recommendation models are highly effective because they leverage observed user interactions, but this dependence creates a fundamental cold-start challenge when newly added content has no interaction history. In Tubi's production retrieval system, this challenge is further constrained by the serving interface: new content must be assigned a standalone embedding immediately, and the model must also produce device embeddings suitable for approximate nearest-neighbor retrieval. We address this setting by formulating cold-start recommendation as an inductive graph-completion problem on a temporal bipartite device-content graph. We propose Shallow-RHS, an asymmetric link-prediction architecture in which the left-hand side (LHS) device tower leverages temporally valid watch-history message passing to capture collaborative signals, while the right-hand side (RHS) content tower is intentionally shallow with respect to the graph and encodes content solely from intrinsic features. The RHS tower does not use ID-based embeddings, content-side subgraphs, neighbor aggregation, or interaction-derived representations, forcing the content encoder to map intrinsic features into a collaborative-filtering-aware embedding space. After training, the learned content encoder generates embeddings for both warm and newly ingested content, enabling implicit graph completion through retrieval of warm surrogate neighbors. We further extend the same representation-completion principle to device cold-start by constructing cohort-based embeddings from demographic features. Large-scale online experiments demonstrate consistent relative improvements in content cold-start engagement, promotion speed, impression acquisition, and device cold-start engagement.

URL PDF HTML ☆

赞 0 踩 0

2606.06223 2026-06-05 cs.AI 版本更新

From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents

从奖励黑客激活到智能体风险状态：LLM智能体中的上下文校准机制监控

Patrick Wilhelm, Odej Kao

发表机构 * University of Cambridge（剑桥大学）

AI总结本研究通过分析ReAct风格智能体在Gameable ALFWorld和WebShop环境中的奖励黑客行为，提出结合激活状态、熵和决策上下文的上下文校准监控方法，以更准确评估智能体风险。

详情

AI中文摘要

语言模型智能体通过观察、推理和动作选择的重复循环运行，使得安全监控依赖于内部模型状态和环境上下文。我们研究了在Gameable ALFWorld和WebShop环境中运行的ReAct风格智能体中的奖励黑客监控。智能体配备了基于激活的奖励黑客分数、token级熵和决策上下文特征。我们发现，在《奖励黑客学校》数据集上微调的适配器可以将奖励黑客倾向转移到智能体动作选择中，尤其是当环境暴露代理奖励可供性时。然而，缓解此类行为不能仅依赖激活动态。高奖励黑客激活识别出潜在策略状态，但并不一定意味着立即的利用动作。在下一步预测任务中，熵和上下文校准的内部特征比单独的奖励黑客激活提高了风险估计。激活方向引导进一步减少了选定混合适配器设置中的代理利用行为。总体而言，我们的结果支持智能体的上下文校准内部监控：奖励黑客激活识别潜在策略状态，而熵和决策上下文有助于确定该状态何时变为风险动作。

英文摘要

Language-model agents act through repeated cycles of observation, reasoning, and action selection, making safety monitoring depend on both internal model state and environment context. We study reward-hacking monitors in ReAct-style agents acting in Gameable ALFWorld and WebShop. Agents are instrumented with activation-based reward-hack scores, token-level entropy, and decision-context features. We find that adapters fine-tuned on \textit{School-of-Reward-Hacks} dataset can transfer reward-hack tendencies into agentic action selection, especially when the environment exposes proxy-reward affordances. However, mitigating such behavior cannot rely on activation dynamics alone. High reward-hack activation identifies a latent policy state, but does not necessarily imply an immediate exploit action. Across next-step prediction tasks, entropy and context-calibrated internal features improve risk estimation over reward-hack activation alone. Activation-direction steering further reduces proxy-exploit behavior in selected mixed-adapter regimes. Overall, our results support context-calibrated internal monitoring for agents: reward-hack activation identifies a latent policy state, while entropy and decision context help determine when that state becomes risky action.

URL PDF HTML ☆

赞 0 踩 0

2606.06219 2026-06-05 cs.RO cs.AI 版本更新

CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving

CLEAR：端到端自动驾驶中的认知与潜在评估自适应路由

Yining Xing, Zehong Ke, Zhiyuan Liu, Yanbo Jiang, Wenhao Yu, Jianqiang Wang

发表机构 * Qwen 3.5 0.8B

AI总结提出CLEAR框架，通过单步条件漂移替代扩散模型的多步去噪，结合视觉编码器Drive-JEPA和微调Qwen 3.5 0.8B进行语义推理，实现高效多模态规划，在NAVSIM v1上达到93.7 PDMS。

详情

AI中文摘要

端到端自动驾驶模型通常难以平衡多模态机动生成与实时推理约束。虽然扩散模型成功捕捉了多样化的驾驶行为，但其迭代去噪过程在安全关键部署中引入了不可接受的延迟。为了解决这个问题，我们提出了CLEAR（认知与潜在评估自适应路由），一个结合超快生成规划与深度语义推理的框架。CLEAR采用Drive-JEPA作为视觉编码器，并用VAE潜在空间中的单步条件漂移替代多步去噪链，引入条件系数以平衡多样性和专家精度。同时，我们在驾驶问答对上全微调Qwen~3.5~0.8B以提取场景感知隐藏状态。这些状态指导自适应调度器（从预定义方案的离散集中选择条件系数$α$和样本数量$N$）和交叉注意力评分器（从候选中选择最优轨迹）。在NAVSIM v1基准上，CLEAR达到了最先进的PDMS 93.7。我们的结果表明，无需密集几何标注或迭代采样，即可高效执行高保真多模态规划。

英文摘要

End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $α$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.

URL PDF HTML ☆

赞 0 踩 0

2606.06218 2026-06-05 cs.RO cs.AI 版本更新

TAM: Torque Adaptation Module for Robust Motion Transfer in Manipulation

TAM: 用于鲁棒操作运动传递的扭矩自适应模块

Dongwon Son, Florian Shkurti, Jason Lee, Naman Shah, Beomjoon Kim, Dieter Fox

发表机构 * KAIST（韩国科学技术院）； Allen Institute for AI（人工智能研究院）； University of Toronto（多伦多大学）； University of Washington（华盛顿大学）

AI总结提出扭矩自适应模块（TAM），通过历史编码器和扭矩适配器修正扭矩指令，实现不同机器人或负载间的运动传递，无需领域随机化或重新收集数据。

详情

AI中文摘要

为一个机器人调整的策略在另一个机器人上往往表现不同，无论是由于仿真到现实的差距、未知负载，还是同一机器人两个实例的不同动力学。在接触丰富的动态操作中，即使微小的运动差异也可能导致跟踪参考运动失败，因为它们会破坏接触的时间和模式。常见的补救措施，如领域随机化或系统辨识，要么产生过于保守的任务策略，要么需要为每个机器人或负载重新收集数据。我们引入了扭矩自适应模块（TAM），这是一个学习模块，它调整发送给机器人的扭矩命令以匹配理想机器人的行为。TAM 在跟踪策略动作的低级控制器和机器人的扭矩接口之间运行。它包括一个历史编码器，将本体感受历史嵌入到潜在状态中，以及一个扭矩适配器，计算残余扭矩修正。由于 TAM 仅依赖于本体感受历史，而不依赖于策略观测或动作空间，因此相同的 TAM 权重可以重复用于适应具有不同动作空间（关节目标、末端执行器目标或直接扭矩）的策略。策略本身不需要使用机器人参数的领域随机化进行训练。相反，我们将领域随机化的需求转移到 TAM 上，通过在随机化仿真中完全训练 TAM，使用多机器人预训练，然后进行特定机器人的微调步骤，该步骤仍然不需要真实机器人数据。我们在真实的 Franka Panda 机器人上对 TAM 进行了零样本评估，涉及动态操作任务，包括基于视觉的推箱子策略（来自强化学习）、翻转策略（来自行为克隆）和 MPC 球杆平衡。我们的实验表明，与在线系统辨识和 RMA 基线相比，TAM 改善了零样本真实机器人执行，并实现了鲁棒的动态操作性能。

英文摘要

A policy tuned for one robot often behaves differently on another, whether due to the sim-to-real gap, unknown payloads, or the differing dynamics of two instances of the same robot. In contact-rich, dynamic manipulation, even small motion discrepancies can result in failure to track reference motion, since they disrupt the timing and modes of contact. Common remedies, such as domain randomization or system identification, either produce overly conservative task policies or require data that must be recollected for each robot or payload. We introduce the Torque Adaptation Module (TAM), a learned module that adapts the torque commands sent to the robot to match the behavior of an ideal robot. TAM operates between the low-level controller that tracks the policy's actions and the robot's torque interface. It includes a history encoder that embeds proprioceptive history into a latent state and a torque adaptor that computes residual torque corrections. Because TAM depends only on proprioceptive history and not on policy observations, or the action space, the same TAM weights can be reused to adapt policies with different action spaces (joint targets, end-effector targets, or direct torques). The policies themselves do not need to be trained with domain randomization of robot parameters. Instead, we offload the need for domain randomization to TAM by training it entirely in randomized simulation, using multi-robot pretraining followed by a robot-specific fine-tuning step that still requires no real-robot data. We evaluate TAM zero-shot on a real Franka Panda robot across dynamic manipulation tasks that include a vision-based box pushing policy (from RL), a flip policy (from BC), and an MPC ball-on-plate balancing. Our experiments show that TAM improves zero-shot real-robot execution compared to online system identification and RMA baselines and enables robust dynamic manipulation performance.

URL PDF HTML ☆

赞 0 踩 0

2606.06217 2026-06-05 cs.CV cs.AI 版本更新

DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments

DisasterBench: 复杂环境中基于无人机灾害响应的多模态基准

Tan Zhang, Quanyou Li, Lu Zhang, Jun Liu, Xiaofeng Zhu, Ping Hu

发表机构 * University of Electronic Science and Technology of China（电子科技大学）

AI总结提出DisasterBench多模态基准，涵盖14种灾害场景和9个响应任务，并设计轻量级模型DisasterVL通过三阶段优化在边缘设备上实现高效推理。

详情

AI中文摘要

当灾难发生时，响应者不仅需要回答正在发生什么，还需要回答为什么发生、接下来会发生什么以及现在该做什么，而这些通常来自嘈杂的低空无人机视角，并在现场计算资源紧张的情况下进行。然而，现有的大多数多模态基准侧重于感知（例如识别/描述），覆盖的灾害类型有限，并且对实际应急响应所需的多阶段推理支持不足。我们引入了DisasterBench，一个用于复杂环境中基于无人机灾害响应的多阶段多模态推理基准。DisasterBench涵盖14种灾害相关场景类型和9个响应关键任务，覆盖灾前、灾中和灾后阶段，具有细粒度的灾害-任务映射，明确测试因果归因、传播预测、损害分析和决策导向推理。为了在边缘设备上实现推理，我们进一步提出了DisasterVL，一个轻量级多模态模型，通过三阶段流水线进行优化，结合领域指令微调、思维链引导的多模态对齐以及基于强化学习的策略优化。在21个流行的MLLM上的实验表明，我们的2B参数DisasterVL优于所有评估的开源模型，并显著缩小了与最先进闭源模型的差距，实现了与GPT-4o相当的推理准确性和更高的效率。项目页面：https://github.com/TanmouTT/DisasterBench。

英文摘要

When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a multi-stage multimodal reasoning benchmark for UAV-Based disaster response in complex environments. DisasterBench spans 14 disaster-related scene types and 9 response-critical tasks across pre-, during-, and post-disaster stages, with fine-grained disaster-task mappings that explicitly test causal attribution, propagation prediction, damage analysis, and decision-oriented reasoning. To enable reasoning on the edge, we further propose DisasterVL, a lightweight multimodal model optimized with a three-stage pipeline combining domain instruction tuning, chain-of-thought-guided multimodal alignment, and reinforcement learning-based policy optimization. Experiments across 21 popular MLLMs show that our 2B-parameter DisasterVL outperforms all evaluated open-source models and substantially narrows the gap to state-of-the-art closed-source models, achieving GPT-4o-comparable reasoning accuracy with superior efficiency. The project page is available at https://github.com/TanmouTT/DisasterBench.

URL PDF HTML ☆

赞 0 踩 0

2606.06214 2026-06-05 cs.SE cs.AI 版本更新

Towards the Readability of LLM-Generated Codes through Multitask Representation Engineering

面向大语言模型生成代码可读性的多任务表示工程

Huifan Gao, Liuhua He, Yinghui Pan, Shenbao Yu, Yifeng Zeng, Shengchao Qin, Weidi Sun

发表机构 * School of Aerospace Engineering, Xiamen University（厦门大学航空航天工程学院）； School of Artificial Intelligence, Shenzhen University（深圳大学人工智能学院）； College of Computer and Cyber Security, Fujian Normal University（福建师范大学计算机与网络安全部分）； Department of Computer & Information Sciences, Northumbria University（北爱尔兰北安普顿大学计算机与信息科学系）； School of Computer Science and Technology, Xidian University（西安电子科技大学计算机科学与技术学院）； Peking University（北京大学）

AI总结提出多任务表示工程框架，通过低数据依赖和低计算成本的表示工程方法提升LLM生成代码的可读性，并理论分析其对可读性与正确性权衡的影响。

详情

学习补货：面向医药供应链动态库存管理的混合深度强化学习

Amandeep Kaur, Gyan Prakash

AI总结针对医药供应链中需求不确定和前置时间变化导致的库存管理难题，提出一种混合异步优势演员评论家分布式近端策略优化（A3C DPPO）算法，实现连续动作空间下的最优补货策略，降低库存成本并提高服务水平。

Comments Nil

详情

AI中文摘要

医药供应链（PSCs）因不可预测的需求模式和与补货相关的可变前置时间，在库存管理（IM）方面面临挑战。药品的有限保质期进一步加剧了这种复杂性，需要在充足库存和最小浪费之间取得微妙的平衡。这些相互交织的因素构成了一个复杂的优化问题，需要复杂的库存策略来确保产品可用性和PSC效率。本研究旨在为医药产品开发一种最优库存补货策略，能够处理由不确定需求和可变PSC条件产生的随机性。目标是最大化PSC的盈利能力，同时保持较高的患者服务水平。我们将问题建模为马尔可夫决策过程，并提出一种深度强化学习（DRL）方法，具体为混合异步优势演员评论家分布式近端策略优化（A3C DPPO）算法。该A3C DPPO算法针对IM中固有的连续动作空间进行了定制。数值结果表明，所提算法在动态场景下自适应更新库存补货策略，与各种基准相比，实现了更低的库存成本。我们还使用真实药品库存数据进行了数值验证，以确认所提算法的实际可行性。

英文摘要

Pharmaceutical supply chains (PSCs) struggle with inventory management (IM) due to unpredictable demand patterns and variable lead times associated with restocking. This complexity is further compounded by the finite shelf lives of pharmaceutical products, which necessitate a delicate balance between adequate stock and minimal waste. These intertwined factors create a complex optimization problem that requires sophisticated inventory strategies to ensure both product availability and PSC efficiency. This study aims to develop an optimal inventory replenishment policy for pharmaceutical products that can handle the stochasticity arising from uncertain demand and variable PSC conditions. The objective is to maximize the profitability of the PSC while maintaining a high patient service level. We formulate the problem as a Markov decision process and propose a deep reinforcement learning (DRL) approach, specifically, a hybrid asynchronous advantage actor critic distributed proximal policy optimization (A3C DPPO)algorithm. The A3C DPPO algorithm is tailored to handle the continuous action space inherent in IM. The numerical results demonstrate that the proposed algorithm adaptively updates the inventory replenishment strategy under dynamic scenarios, resulting in lower inventory costs compared to various benchmarks. We also conduct numerical validation using real-world pharmaceutical inventory data to confirm the practical feasibility of the proposed algorithm.

URL PDF HTML ☆

赞 0 踩 0

2606.06197 2026-06-05 cs.CL cs.AI 版本更新

Improving Answer Extraction in Context-based Question Answering Systems Using LLMs

利用大语言模型改进基于上下文的问答系统中的答案提取

Hafez Abdelghaffar, Ahmed Alansary, Ali Hamdi

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对现有问答系统在复杂或模糊查询下答案提取不准确的问题，提出基于微调预训练大语言模型的方法，在SQuAD1.1数据集上取得ROUGE-L 86.84%、BLEU 28.24%、BERTScore 95.38%的高性能。

Comments 7 pages, IMSA2026

详情

AI中文摘要

随着大语言模型（LLM）的出现，问答（QA）系统取得了显著进展。然而，它们在从给定上下文中准确提取和生成精确答案方面仍面临挑战，尤其是在处理复杂或模糊查询时。现有方法通常在上下文理解、答案一致性和跨不同领域的泛化能力方面存在不足。在这项工作中，我们提出了一种基于大语言模型的问答系统，其输入由文本上下文和相应问题组成，输出为简洁准确的答案。本研究旨在解决当前QA系统的局限性，特别是它们即使能够访问正确上下文也倾向于产生不相关或不精确响应的问题。我们的方法包括在基准QA数据集上微调预训练的LLM，以提高其上下文理解和答案提取能力。具体来说，我们使用斯坦福问答数据集（SQuAD1.1），该数据集提供了高质量的上下文-问题-答案三元组用于监督训练和评估。实验结果表明，微调后的Roberta-base模型取得了最高性能，ROUGE-L得分为86.84%，BLEU得分为28.24%，BERTScore为95.38%。这些结果表明了强大的准确性和答案相关性，证明了所提方法在基于上下文的问答任务中的有效性。此外，研究结果证实，有针对性的微调显著提高了QA系统的可靠性和精确性。

英文摘要

Question answering (QA) systems have achieved notable progress with the advent of large language models (LLMs). However, they still face challenges in accurately extracting and generating precise answers from given contexts, particularly when dealing with complex or ambiguous queries. Existing approaches often struggle with contextual understanding, answer consistency, and generalization across diverse domains. In this work, we propose a question answering system based on large language models, where the input consists of a textual context and a corresponding question, and the output is a concise and accurate answer. The motivation behind this research lies in addressing the limitations of current QA systems, particularly their tendency to produce irrelevant or imprecise responses despite having access to the correct context. Our methodology involves fine-tuning a pre-trained LLM on a benchmark QA dataset to improve its contextual comprehension and answer extraction capabilities. Specifically, we utilize the Stanford Question Answering Dataset (SQuAD1.1), which provides high-quality context-question-answer triplets for supervised training and evaluation. Experimental results show that the fine-tuned Roberta-base model achieves the highest performance, attaining a ROUGE-L score of 86.84%, a BLEU score of 28.24%, and a BERTScore of 95.38%. These results indicate strong accuracy and answer relevance, demonstrating the effectiveness of the proposed approach for context-based question answering tasks. Furthermore, the findings confirm that targeted fine-tuning substantially improves the reliability and precision of QA systems.

URL PDF HTML ☆

赞 0 踩 0

2606.06178 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Learning to Route LLMs from Implicit Cost-Performance Preferences via Meta-Learning

通过元学习从隐式成本-性能偏好中学习路由LLM

Jiahao Zeng, Ming Tang, Ningning Ding

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Southern University of Science and Technology（南方科技大学）

AI总结提出MetaRouter框架，利用元学习从少量交互中学习用户隐式成本-性能偏好，实现个性化LLM路由，在分布内外任务上优于基线方法。

详情

AI中文摘要

大型语言模型（LLM）在性能与成本之间存在权衡，更强大的模型会产生更高的费用。LLM路由旨在通过将查询发送到最合适的模型来降低费用同时保持性能。然而，现有方法无法很好地适应不同用户的成本-性能偏好。为了解决这一差距，我们引入了一种新颖的感知LLM路由范式，用于个性化和以用户为中心的成本-性能优化，通过少量交互高效学习用户的隐式偏好。为了应对异构用户需求的挑战，我们将偏好配置文件形式化为上下文赌博机中的一组不同任务，并提出了MetaRouter，一个用于偏好感知LLM路由的元学习框架。实验结果表明，MetaRouter在分布内和分布外任务上均优于强基线。此外，它在学习用户偏好方面表现出高效率，对可路由LLM的变化具有鲁棒性，并且可扩展到多模型路由。

英文摘要

Large language models (LLMs) present a trade-off between performance and cost, where more powerful models incur greater expense. LLM routing aims to mitigate expenses while maintaining performance by sending queries to the most suitable model. However, existing methods cannot perform well for different user cost-performance preferences. To address this gap, we introduce a novel perceptive LLM routing paradigm for personalized and user-centric cost-performance optimization, which efficiently learns users' implicit preferences through little interaction. To handle the challenge of heterogeneous user needs, we formulate preference profiles as a set of distinct tasks in contextual bandit and propose MetaRouter, a meta-learning framework designed for preference-aware LLM routing. Experimental results show that MetaRouter outperforms strong baselines on both in-distribution and out-of-distribution tasks. Furthermore, it exhibits high efficiency in learning user preferences, robustness to changes in the routable LLMs, and scalability to multi-model routing.

URL PDF HTML ☆

赞 0 踩 0

2606.06168 2026-06-05 cs.AI cs.CL 版本更新

ProSarc: Prosody-Aware Sarcasm Recognition Framework via Temporal Prosodic Incongruity

ProSarc: 通过时间韵律不协调性进行韵律感知的讽刺识别框架

Prathamjyot Singh, Ashima Sood, Sahil Sharma, Jasmeet Singh

发表机构 * Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, India（1 计算机科学与工程系，泰帕尔工程与技术学院，印度帕蒂亚拉）； School of Computing, Engineering and Intelligent Systems, Ulster University, Londonderry, United Kingdom（2 计算学、工程与智能系统学院，乌斯特大学，英国伦敦德里）； School of Computing, Ulster University, Belfast, United Kingdom（3 计算学学院，乌斯特大学，英国贝尔法斯特）

AI总结提出ProSarc，一个仅利用音频的框架，通过建模局部韵律动态与话语级情感基线之间的时间韵律不协调性来检测讽刺，在MUStARD++等数据集上取得最优性能。

Comments Accepted at Interspeech 2026, Sydney

详情

AI中文摘要

我们提出了ProSarc，一个仅利用音频的框架，通过建模时间韵律不协调性（即局部韵律动态与话语级情感基线之间的不匹配）来检测讽刺。双编码路径——全局情感编码器和时间韵律编码器（BiLSTM + 多头注意力）——馈送到韵律不协调性分析器，该分析器产生一个标量不协调性分数用于分类。蒙特卡洛dropout提供不确定性估计，基于注意力的机制无需帧级标签即可定位讽刺起始点。ProSarc在MUStARD++（F1=75.3）上优于先前的纯音频方法，并泛化到自发性语音（PodSarc，F1=62.9）和跨语言语音（MuSaG，F1=65.6）。十次运行验证证实了不协调性建模的贡献（Wilcoxon p=0.002，Cohen's d=1.51）。人工评估表明，模型不确定性追踪感知模糊性，预测的起始点与人工标注的时间窗口对齐。

英文摘要

We present ProSarc, an audio-only framework that detects sarcasm by modelling temporal prosodic incongruity, that is, the mismatch between local prosodic dynamics and the utterance-level emotional baseline. Dual encoding paths, a Global Emotion Encoder and a Temporal Prosody Encoder (BiLSTM + multi-head attention), feed a Prosodic Incongruity Analyzer that produces a scalar incongruity score for classification. Monte Carlo dropout provides uncertainty estimates, and an attention-based mechanism localises sarcastic onset without frame-level labels. ProSarc outperforms prior audio-only methods on MUStARD++ (F1=75.3) and generalises to spontaneous (PodSarc, F1=62.9) and cross-lingual speech (MuSaG, F1=65.6). Ten-run validation confirms the contribution of incongruity modelling (Wilcoxon p=0.002, Cohen's d=1.51). Human evaluation shows that model uncertainty tracks perceptual ambiguity and predicted onsets align with human-annotated temporal windows.

URL PDF HTML ☆

赞 0 踩 0

2606.06160 2026-06-05 cs.AI cs.CL 版本更新

WorldFly: 基于世界模型的视觉-语言-动作模型用于无人机导航

Shengtao Zheng, Kai Li, Weichen Zhang, Yu Meng, Chen Gao, Xinlei Chen, Yong Li, Xiao-Ping Zhang

发表机构 * Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）； BNRist, Tsinghua University（清华大学北京研究院）

AI总结提出WorldFly框架，通过双分支耦合流匹配机制联合生成未来视频预测和导航动作，解决城市峡谷中严重遮挡和视角剧变下的无人机导航问题。

详情

AI中文摘要

超越语义组织：记忆作为长时程智能体的执行状态管理

Yaoqi Chen, Haibin Lai, Yuru Feng, Chuyu Han, Qianxi Zhang, Baotong Lu, Menghao Li, Xinjiang Wang, Zhirui Wang, Shusen Xu, Zengzhong Li, Zewen Jin, Hao Wu, Cheng Li, Qi Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）； Microsoft（微软）； Nanjing University（南京大学）； University of California, San Diego（加州大学圣地亚哥分校）

AI总结针对长时程任务中智能体依赖执行状态而非语义相似性的问题，提出MAGE（记忆作为智能体引导的探索），通过层次状态树管理交互，实现状态完整性和错误隔离，在MemoryArena上任务成功率提升7.8-20.4个百分点，token消耗降低55.1%。

Comments 16 pages

详情

AI中文摘要

基于LLM的智能体越来越多地处理具有相互依赖决策的长时程任务，其中每个动作都会重塑未来约束，中间错误可能级联。现有的RAG和智能体记忆系统通过语义相似性组织历史，在决策时检索内容相关的条目。我们认为这种设计与执行状态依赖不匹配：它分割了决策轨迹，混合了有效和错误的痕迹，阻碍了连贯的状态重建和错误隔离。我们提出MAGE（记忆作为智能体引导的探索），一个主动的执行状态管理器，将交互存储在层次状态树中。智能体从活跃的根到当前路径派生其状态，结合子目标摘要、近期轨迹和来自先前分支的提示。四个耦合操作维护树：Grow记录新轨迹，Compress总结完成的子目标，Maintain验证摘要，Revise恢复目标边界并在新分支上继续。这种设计在保持状态完整性和将缺陷片段与活跃路径隔离的同时，限制了上下文增长。在MemoryArena上的实验表明，MAGE将平均任务成功率提高了7.8-20.4个百分点，同时将token消耗降低了55.1%。

英文摘要

LLM-based agents increasingly tackle long-horizon tasks with interdependent decisions, where each action reshapes future constraints and intermediate errors can cascade. Existing RAG and agent memory systems organize histories by semantic similarity, retrieving content-relevant entries at decision time. We argue that this design mismatches execution-state dependencies: it fragments decision trajectories and mixes valid and erroneous traces, hindering coherent state reconstruction and error isolation. We propose MAGE (Memory as Agent-Guided Exploration), an active execution-state manager that stores interactions in a hierarchical state tree. The agent derives its state from the active root-to-current path, combining subgoal summaries, recent traces, and hints from prior branches. Four coupled operations maintain the tree: Grow records new traces, Compress summarizes completed subgoals, Maintain validates summaries, and Revise restores a target boundary and resumes on a new branch. This design bounds context growth while preserving state integrity and isolating flawed segments from the active path. Experiments on MemoryArena show that MAGE improves the average task success rate by 7.8--20.4 pp over baselines, while reducing token consumption by 55.1%.

URL PDF HTML ☆

赞 0 踩 0

2606.06087 2026-06-05 cs.CL cs.AI 版本更新

LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

LatentSkill: 从上下文文本技能到LLM智能体的权重内隐技能

Aofan Yu, Chenyu Zhou, Tianyi Xu, Zihan Guo, Rong Shan, Zhihui Fu, Jun Wang, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Sun Yat-Sen University（中山大学）； Shanghai Innovation Institute（上海创新研究院）； OPPO Research Institute（OPPO研究院）

AI总结提出LatentSkill框架，通过预训练超网络将文本技能转换为即插即用的LoRA适配器，将技能知识存储在权重空间而非上下文空间，从而减少预填充令牌并提升性能。

Comments 16 pages, 4 figures

详情

AI中文摘要

智能体系统越来越多地使用文本技能来编码可重用的任务流程，但在每一步将这些技能注入提示中会带来大量的上下文开销，并将技能内容暴露为明文。我们提出了LatentSkill，一个通过预训练超网络将文本技能转换为即插即用LoRA适配器的框架。LatentSkill将技能知识存储在权重空间而非上下文空间中，消除了每步的技能令牌，同时保留了模块化加载、缩放和组合。在ALFWorld和Search-QA上，LatentSkill在显著减少预填充令牌的情况下，优于相应的上下文技能基线：在ALFWorld的已见和未见划分上，它分别提高了21.4和13.4个百分点的成功率，预填充令牌减少了64.1%；在Search-QA上，精确匹配提高了3.0个百分点，技能令牌开销降低了72.2%。进一步分析表明，生成的技能LoRA形成了结构化的语义几何，可以通过LoRA缩放系数精确控制，并且在技能组件对齐时可以通过参数空间算术进行组合。这些发现表明，权重空间技能为扩展LLM智能体提供了一种高效、模块化且暴露更少的基础。

英文摘要

Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills into plug-and-play LoRA adapters through a pretrained hypernetwork. LatentSkill stores skill knowledge in weight space rather than context space, removing per-step skill tokens while preserving modular loading, scaling, and composition. On ALFWorld and Search-QA, LatentSkill outperforms the corresponding in-context skill baseline while using substantially fewer prefill tokens: it improves ALFWorld success by 21.4 and 13.4 points on the seen and unseen splits with 64.1% fewer prefill tokens, and improves Search-QA exact match by 3.0 points with 72.2% lower skill-token overhead. Further analysis shows that generated skill LoRAs form a structured semantic geometry, can be precisely controlled via the LoRA scaling coefficient, and can be composed through parameter-space arithmetic when skill components are aligned. These findings suggest that weight-space skills provide an efficient, modular, and less exposed substrate for extending LLM agents.

URL PDF HTML ☆

赞 0 踩 0

2606.06081 2026-06-05 cs.AI cs.HC 版本更新

A Framework for Measuring Appropriate Reliance on Set-Valued AI Advice

衡量对集合值AI建议适当依赖的框架

Ranjan Mishra, Jakob Schoeffer

发表机构 * University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）

AI总结本文提出首个正式框架，用于在序列判断-顾问范式中衡量对集合值AI建议的适当依赖，涵盖分类和回归任务，并定义了新的度量指标以捕捉现有方法忽略的细微差别。

详情

AI中文摘要

对AI建议的适当依赖已成为人机协作的核心研究主题。现有框架仅关注点预测作为AI建议。然而，集合值AI建议（例如离散集或连续区间）越来越多地被用于传达不确定性和改善人类决策。在本文中，我们在序列判断-顾问范式中开发了第一个用于衡量对集合值AI建议适当依赖的正式框架，涵盖分类和回归任务。对于分类，我们首先引入了评估集合值AI建议所需的维度。然后我们定义了两个指标：对AI的正确依赖率和对自身的正确依赖率，它们共同表征了这种设置下的适当依赖。对于回归，我们引入了AI依赖的数量和AI依赖的质量，分别衡量决策者是否利用了AI建议以及他们的依赖是否帮助他们相对于初始估计更接近真实值。通过应用我们的框架，我们展示了这些度量如何捕捉现有方法忽略的人机协作中的重要细微差别。

英文摘要

Appropriate reliance on AI advice has become a central research theme in human-AI collaboration. Existing frameworks have focused exclusively on point predictions as AI advice. However, set-valued AI advice (e.g., discrete sets or continuous intervals) is increasingly being used to communicate uncertainty and improve human decision making. In this paper, we develop the first formal framework for measuring appropriate reliance on set-valued AI advice within the sequential judge-advisor paradigm, spanning both classification and regression tasks. For classification, we first introduce the dimensions that are necessary for evaluating set-valued AI advice. We then define two metrics: correct reliance rate on AI and correct reliance rate on self, which jointly characterize appropriate reliance in this setting. For regression, we introduce quantity of AI reliance and quality of AI reliance, which respectively measure whether a decision maker utilized the AI advice and whether their reliance helped them get closer to the ground truth relative to their initial estimate. Through the application of our framework, we demonstrate how these metrics capture important nuances in human-AI collaboration that existing measures overlook.

URL PDF HTML ☆

赞 0 踩 0

2606.06080 2026-06-05 cs.LG cs.AI cs.CL 版本更新

On Advantage Estimates for Max@K Policy Gradients

关于 Max@K 策略梯度的优势估计

Shota Takashiro, Soichiro Nishimori, Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Gouki Minegishi, Yusuke Iwasawa, Takeshi Kojima, Yutaka Matsuo

发表机构 * The University of Tokyo（东京大学）

AI总结针对稀疏奖励下推理模型后训练困难，提出一种新的优势估计方法 MaxPO，通过 Leave-Two-Out 基线实现中心化优势，降低梯度方差并提升性能。

详情

AI中文摘要

具有可验证奖励的强化学习广泛用于推理模型的后训练，但稀疏的结果奖励使得探索困难。一种补充方法是直接优化推理时目标如 pass@K 和 max@K，然而现有针对这些目标的策略梯度估计器使用不同的信号、基线和归一化，使得它们之间的关系不明确。我们通过基线设计和优势中心化来研究这个问题。从该领域领先方法的优势估计器出发，我们证明它是策略梯度无偏的，但产生非中心化的优势。然后我们引入一种 Leave-Two-Out 基线，它在保持策略梯度无偏性的同时，使得实现的批次优势完全中心化。由此产生的方法 MaxPO 具有高效的二次时间实现，并自然地集成到基于组的 LLM 后训练强化学习中。我们进一步推导了 max@K 的规范有限批次优势，为现有优势估计器提供了统一视角。实验上，我们验证了 L2O 基线降低了梯度方差，并优于非中心化的替代方案。

英文摘要

Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design and advantage centering. Starting from the advantage estimator of a leading method in the field, we show that it is policy-gradient unbiased but yields a non-centered advantage. We then introduce a Leave-Two-Out baseline that preserves policy-gradient unbiasedness while making realized batch advantages exactly centered. The resulting method, MaxPO, has an efficient quadratic-time implementation and integrates naturally into group-based RL for LLM post-training. We further derive the canonical finite-batch advantage for max@K, providing a unified view of existing advantage estimators. Empirically, we verify that the L2O baseline reduces gradient variance and outperforms non-centered alternatives.

URL PDF HTML ☆

赞 0 踩 0

2606.06058 2026-06-05 cs.LG cs.AI cs.CL 版本更新

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

MDP-GRPO：面向多约束指令跟随的稳定化组相对策略优化

Mohammad Mahdi Salmani-Zarchi, Zahra Rahimi, Heshaam Faili, Mohammad Javad Dousti

发表机构 * Department of Electrical and Computer Engineering, College of Engineering, University of Tehran（德黑兰大学电气与计算机工程系，工程学院）； Department of Statistics, Mathematics and Computer Science, Allameh Tabataba’i University（塔巴蒂大学统计、数学与计算机科学系）

AI总结针对标准GRPO在离散低分散奖励下的不稳定性，提出MDP-GRPO，通过多温度采样、双锚优势、前景理论整形和非对称KL正则化，在FollowBench等数据集上提升严格约束满足率最高5.0%。

Comments Accepted to ACL 2026 Main Conference. 14 pages, 9 figures

详情

AI中文摘要

可验证奖励的强化学习非常适合多约束指令跟随，但标准组相对策略优化（GRPO）在离散、低分散奖励下变得不稳定，此时组内奖励分布常常同质。我们识别并形式化了在此场景下z-score组归一化的三种病理：低方差放大、均值中心盲视和零方差崩溃。为解决这些问题，我们提出MDP-GRPO，通过以下方式稳定学习：（1）多温度采样以增加奖励分散度，（2）双锚优势以恢复同质组中的梯度并阻止均值中心盲视，（3）基于Kahneman和Tversky理论的前景理论整形以限制更新并惩罚违规，以及（4）非对称KL正则化。在FollowBench、IFEval和一个精心策划的多约束数据集上评估，MDP-GRPO优于标准GRPO，在Llama-3.2-3B上将严格约束满足率提高了最多5.0%。我们的方法还能够在保持MMLU和ARC上通用能力的同时，实现小批量大小的稳定收敛。

英文摘要

Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. To address them, we propose MDP-GRPO, which stabilizes learning through (1) multi-temperature sampling to increase reward dispersion, (2) dual-anchor advantages to restore gradients in homogeneous groups and stop mean-centering blindness, (3) prospect-theoretic shaping to bound updates and penalize violations based on Kahneman and Tversky's theory, and (4) asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a curated multi-constraint dataset, MDP-GRPO outperforms standard GRPO, improving strict constraint satisfaction by up to 5.0% on Llama-3.2-3B. Our method also enables stable convergence with small group sizes while preserving general capabilities on MMLU and ARC.

URL PDF HTML ☆

赞 0 踩 0

2606.06056 2026-06-05 cs.SE cs.AI cs.LG 版本更新

Metamorphic Testing with the Rashomon Set: Explanation Faithfulness in Machine Learning

使用Rashomon集的蜕变测试：机器学习中的解释忠实性

Helge Spieker, Jørn Eirik Betten, Arnaud Gotlieb

发表机构 * Norwegian Ministry of Education and Research（挪威教育与研究部）

AI总结针对机器学习中因Rashomon效应导致解释不可靠的问题，提出基于蜕变测试的框架，通过后验解释方法评估特征归因的忠实性，无需真实标签。

Comments Accepted at 10th International Workshop on Metamorphic Testing (MET 2026)

2606.06055 2026-06-05 cs.AI 版本更新

When Should Memory Stay Silent: Measuring Memory-Use Boundaries in Memory-Augmented Conversational Agents

记忆何时应保持沉默：衡量记忆增强型对话代理的记忆使用边界

Lingxiang Xu, Jiaoyun Yang, Min Hu, Hongtu Chen, Ning An

发表机构 * Hefei University of Technology（合肥工业大学）； Harvard Medical School（哈佛医学院）

AI总结提出RBI-Eval框架，通过探针集比较模型在有/无敏感记忆时的行为差异，发现当前检索增强生成系统无法避免敏感记忆的不当整合，需在检索和生成阶段同时进行记忆感知决策。

Comments 21 pages, 10 figures

详情

AI中文摘要

长期记忆使语言模型代理能够支持个性化交互，但目前尚不清楚何时可用记忆应被整合到响应中。现有的记忆评估强调检索准确性和下游任务效用，而忽略了检索到的敏感记忆内容在当前轮次中是否合理。我们引入RBI-Eval，这是一种基于探针集的受控测量研究，比较模型在相同良性提示下访问和不访问敏感记忆时的行为。我们在四种记忆访问设置（全上下文暴露和三种检索系统）下，针对四个基础LLM与匹配的无记忆参考进行评估。我们的结果揭示了显著的行为差异。在有记忆可用时，GPT-5.4-mini的敏感记忆整合分离分数相对于匹配的无记忆参考下降了8.9%–26.6%，而Claude-Sonnet-4.6、DeepSeek-V4-Flash和Qwen3.5-9B下降了51.1%–82.9%。对DeepSeek和GPT-5.4-mini的对照实验表明，这种效应是敏感内容特有的，而非一般个性化。检索系统减少了暴露，但一旦敏感记忆到达生成器，并不能消除整合。这些发现表明，安全个性化需要在检索和生成时都进行记忆感知决策。

英文摘要

Long-term memory enables language model agents to support personalized interactions, but it remains unclear when available memories warrant integration into responses. Existing memory evaluations emphasize retrieval accuracy and downstream task utility, while overlooking whether retrieved sensitive memory content is warranted in the current turn. We introduce RBI-Eval, a controlled measurement study built around a probe set that compares model behavior with and without access to sensitive memory under identical benign prompts. We evaluate four base LLMs against a matched no-memory reference across four memory-access settings: full-context exposure and three retrieval systems. Our results reveal substantial behavioral divergence. With memory available, the separation score for sensitive-memory integration decreases by 8.9\%--26.6\% relative to the matched no-memory reference for GPT-5.4-mini, but by 51.1\%--82.9\% for Claude-Sonnet-4.6, DeepSeek-V4-Flash, and Qwen3.5-9B. Control experiments on DeepSeek and GPT-5.4-mini show this effect is specific to sensitive content, rather than general personalization. Retrieval systems reduce exposure but do not eliminate integration once sensitive memory reaches the generator. These findings suggest safe personalization requires memory-aware decisions at both retrieval and generation time.

URL PDF HTML ☆

赞 0 踩 0

2606.06054 2026-06-05 cs.AI 版本更新

当足够好即最优：量化门控DeltaNet的仅乘法矩阵求逆近似

Luoming Zhang, Yuwei Ren, Kui Zhang, Tian Liu, Lingjuan Ge, Denghao Li, Matthew Harper Langston, Yin Huang, Weiliang Will Zeng, Liang Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对分块并行线性注意力中矩阵求逆的瓶颈，提出基于截断Neumann级数展开的仅矩阵乘法算法，结合结构掩码和并行残差校正，实现NPU上5倍内核加速和20%解码层开销降低。

详情

AI中文摘要

分块并行线性注意力中的矩阵求逆是长上下文建模的主要瓶颈，尤其是在NPU上，基于前向替换的方法并行性有限且硬件利用率低。我们提出了一种快速的、基于矩阵乘法（MatMul）的算法，专门针对分块线性注意力中出现的严格下三角矩阵。受Neumann级数项快速增长和逆矩阵对角集中性的启发，我们采用截断Neumann展开，结合结构掩码和并行残差校正，以消除顺序依赖。我们进一步将方法扩展到低比特INT，通过缓解重复矩阵幂运算引起的动态范围扩展，并根据块大小调整近似阶数和残差步长，以最小化计算成本同时保持模型精度。在Qwen3.5系列模型上的实验表明，在浮点和低精度推理下，该方法实现了高达5倍的内核级加速和20%的解码层开销降低，同时保持了精度。我们的方法为可扩展线性注意力提供了一种高效且硬件友好的解决方案。

英文摘要

Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We propose a fast, Matrix Multiplication (MatMul)-based algorithm tailored for strictly lower-triangular matrices arising in chunk-wise linear attention. Motivated by the rapid growth of Neumann-series terms and the diagonal concentration of the inverse matrix, we employ a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. We further extend our method to low-bits INT by mitigating the dynamic range expansion arising from repeated matrix power operations, and adapt the approximation order and residual step to the chunk size to minimize computational cost while preserving the model's accuracy. Experiments on Qwen3.5-family models demonstrate up to 5$\times$ kernel-level speedup and a 20% reduction in decode-layer overhead, while preserving accuracy under both floating-point and low-precision inference. Our method offers an efficient and hardware-friendly solution for scalable linear attention.

URL PDF HTML ☆

赞 0 踩 0

2606.06027 2026-06-05 cs.AI cs.CL cs.LG cs.SI 版本更新

RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit

RedditPersona: 一个用于从Reddit进行社区条件化LLM适配的模块化框架

Amirhossein Ghaffari, Ali Goodarzi, Huong Nguyen, Simo Hosio, Lauri Lovén, Ekaterina Gilman

发表机构 * Future Computing Group University of Oulu（未来计算组奥卢大学）； Centre for Applied Computing University of Oulu（应用计算中心奥卢大学）

AI总结提出RedditPersona模块化框架，通过五种分组策略和QLoRA训练参数高效适配器，在112个Reddit子版块上评估社区条件化语言模型，发现适配器的行为可识别性与策略内在一致性相关，且所有策略在可识别性和分布相似性之间存在一致权衡。

详情

AI中文摘要

社区条件化的语言模型适配需要在每个研究中独立做出关于数据收集、社区定义和评估的选择，这使得比较假设或重用工件变得困难。我们提出了RedditPersona，一个模块化框架，标准化了这些选择：它收集Reddit帖子和评论，分析活跃用户，根据五种分组策略（基于子版块、图结构、语义、混合和基于交互）对用户进行划分，通过QLoRA为每种策略训练参数高效的适配器，并在一个涵盖流畅性、忠实度、分布对齐和社区可识别性的共享度量套件下进行评估。应用于城市福祉领域的112个子版块（301,429个用户档案，超过1600万条评论），我们发现适配器的行为可识别性追踪了每种策略与子版块基线的内在一致性，并且所有五种策略在可识别性和与真实文本的分布相似性之间存在一致的权衡。代码和配置文件可在以下网址获取：https://github.com/Ahghaffari/redditpersona。

英文摘要

Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit-based, graph-structural, semantic, hybrid, and interaction-based), trains a parameter-efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well-being domain (301,429 user profiles, 16M+ comments), we find that adapters' behavioral identifiability tracks each strategy's intrinsic agreement with the subreddit baseline, and that a consistent trade-off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: https://github.com/Ahghaffari/redditpersona.

URL PDF HTML ☆

赞 0 踩 0

2606.06025 2026-06-05 cs.CL cs.AI 版本更新

EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher Distillation

EGTR-Review: 基于多智能体教师蒸馏的高效证据支撑科学同行评审生成

Xinpeng Qiu, Wang Yihu, Zhifeng Liu, Xiaochen Wang, Jimin Wang

发表机构 * Department of Information Management, Peking University（北京大学信息管理系）； PKU-WUHAN Institute for Artificial Intelligence, Peking University（北京大学武汉人工智能研究院）

AI总结提出EGTR-Review框架，通过多智能体教师蒸馏和证据加权目标，实现轻量级学生模型的高质量、可溯源同行评审生成。

详情

AI中文摘要

科学同行评审生成因能减少评审负担并提供及时反馈而受到越来越多的关注。然而，现有基于大型语言模型（LLM）的方法往往产生缺乏证据支持和弱源可追溯性的通用评论，而复杂的多智能体系统则导致高推理成本。为应对这些挑战，我们提出EGTR-Review，一种通过多智能体教师蒸馏实现的证据支撑且可追溯的评审生成框架。EGTR-Review首先构建一个多智能体教师，执行结构感知的论文分解、关键元素提取、外部学术证据检索、证据状态标注、验证推理和评审合成。然后，通过任务前缀驱动的多任务学习，将中间推理轨迹和最终评审评论蒸馏到轻量级学生模型中。证据加权目标进一步减少弱、缺失或不可验证监督的影响。在公共同行评审数据集上的实验表明，EGTR-Review（学生）在自动指标、LLM作为评判者评估和人工评估中均优于强提示基、微调基和结构化/智能体基线，同时保持强事实基础和源可追溯性，且显著降低令牌消耗和推理时间。我们的代码、提示、配置和样本数据可在GitHub上获取。

英文摘要

Scientific peer review generation has attracted increasing attention for reducing reviewing burdens and providing timely feedback. However, existing Large Language Model (LLM)-based methods often produce generic comments with insufficient evidence support and weak source traceability, while complex multi-agent systems incur high inference costs. To address these challenges, we propose EGTR-Review, an Evidence-Grounded and Traceable Review Generation framework via Multi-Agent Teacher Distillation. EGTR-Review first constructs a multi-agent teacher that performs structure-aware paper decomposition, key-element extraction, external scholarly evidence retrieval, evidence-state labeling, verification reasoning, and review synthesis. It then distills both intermediate reasoning trajectories and final review comments into a lightweight student model through task-prefix-driven multi-task learning. An evidence-weighted objective further reduces the influence of weak, missing, or non-verifiable supervision. Experiments on public peer-review datasets show that EGTR-Review (Student) outperforms strong prompt-based, fine-tuned, and structured/agentic baselines across automatic metrics, LLM-as-Judge evaluation, and human evaluation, while maintaining strong factual grounding and source traceability with substantially lower token consumption and inference time. Our code, prompts, configurations, and sample data are available on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2606.06014 2026-06-05 cs.AI cs.RO 版本更新

PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models

PLAN-S：通过潜在风格动态桥接规划以实现自动驾驶世界模型

Xiaoyun Qiu, Jingtao He, Yijie Chen, Yusong Huang, Haotian Wang, Yixuan Wang, Xinhu Zheng

发表机构 * Intelligent Transportation Thrust, Systems Hub, and Center of Seamless Connectivity & Connected Intelligence, The Hong Kong University of Science and Technology (Guangzhou)（智能交通 thrust、系统中心及无缝连接与智能连接研究院，香港科学与技术大学（广州））

AI总结提出PLAN-S框架，通过从潜在表示解码风格条件语义成本图，解决自动驾驶中潜在世界模型规划的可控性问题，在nuScenes和NAVSIM上降低了碰撞率并提升了驾驶性能。

详情

AI中文摘要

潜在世界模型通过预测紧凑的场景动态来增强端到端自动驾驶，用于下游规划。然而，现有的基于潜在世界模型的规划器通常直接从纠缠的潜在表示生成轨迹。这种紧凑的潜在到规划器路径缺乏对风险、可驾驶性和多样风格偏好的显式建模，使得驾驶风格动态在最终轨迹选择之前难以监督、检查或调制。我们提出PLAN-S（具有潜在风格动态的规划），一个面向规划器的桥接方法，通过从潜在表示解码风格条件的四通道语义成本图来解决这种紧凑-可控性困境。成本图以自我状态和驾驶风格为条件，并通过两个宿主侧接口在规划决策上游被消费：用于回归规划器的注意力级融合和用于锚点得分规划器的奖励级融合。我们在两个架构不同的宿主上验证PLAN-S：nuScenes上的ResWorld和NAVSIM上的WoTE，同时冻结宿主骨干以隔离所提出的桥接的贡献。在nuScenes上，PLAN-S在每个时间范围上降低了基线L2，平均L2为0.55米，3秒碰撞率相对降低42%。在NAVSIM上，规则成本变体达到89.4的预测驾驶模型分数，而学习成本变体在基线挑战场景中提供了互补增益。消融实验表明，成本路径对更安全的轨迹选择贡献最直接。定性结果进一步显示，PLAN-S可以产生多样化的成本图，其空间一致的变化与不同的驾驶风格对齐。

英文摘要

Latent world models (LWMs) have strengthened end-to-end autonomous driving by forecasting compact scene dynamics for downstream planning. However, existing LWM-based planners usually generate trajectories directly from entangled latent representations. This compact latent-to-planner pathway lacks explicit modeling of risk, drivability, and diverse style preferences, making driving-style dynamics difficult to supervise, inspect, or modulate before a final trajectory is selected. We propose PLAN-S (PLANning with latent Style dynamics), a planner-facing bridge that addresses this compactness-controllability dilemma by decoding a style-conditioned, four-channel semantic cost map from the latent representation. The cost map is conditioned on ego state and driving style and is consumed up-stream of the planning decision through two host-side interfaces: attention-level fusion for regression planners and reward-level fusion for anchor-score planners. We validate PLAN-S on two architecturally distinct hosts, ResWorld on nuScenes and WoTE on NAVSIM, while keeping the host backbones frozen to isolate the contribution of the proposed bridge. On nuScenes, PLAN-S reduces L2 at every horizon over the baseline, with 0.55 m average L2 and a 42% relative reduction in the 3 s collision rate. On NAVSIM, the rule-cost variant reaches 89.4 Predictive Driver Model Score (PDMS), while the learned cost variant provides complementary gains on baseline-challenging scenes. Ablations show that the cost pathway contributes most directly to safer trajectory selection. Qualitative results further show that PLAN-S can produce diverse cost maps, with spatially consistent variations aligned to different driving styles.

URL PDF HTML ☆

赞 0 踩 0

2606.06003 2026-06-05 cs.AI 版本更新

Beyond Vector Similarity: A Structural Analysis of Graph-Augmented Retrieval for Industrial Knowledge Graphs

超越向量相似性：面向工业知识图谱的图增强检索结构分析

Grama Chethan

发表机构 * Grama Chethan

AI总结本文通过对比八种检索架构，提出操作符词汇表论点，证明基于LLM的图推理瓶颈在于计算操作符而非模型智能，并引入LLM查询规划器，在工业知识图谱上实现优于定制处理器的性能。

Comments 11 pages

详情

AI中文摘要

检索增强生成（RAG）在需要对互连实体进行结构推理的查询上系统性失败。我们比较了八种用于航空航天供应链情报的检索架构，从文本检索逐步过渡到图遍历和图计算。使用一个包含46个节点和64条类型边的知识图谱，我们评估了10个意图类别下的23个查询，并证明向量检索在结构上无法覆盖五类查询。我们的核心发现是操作符词汇表论点：基于LLM的图推理的障碍不是模型智能，而是作为工具可用的计算操作符。一个配备9种类型遍历原语的LLM查询规划器在性能上优于定制处理器（F1=0.632 vs 0.472），同时能泛化到未见查询。添加6种图计算工具后，LLM仅在遍历失败的查询类别上选择性采用它们。我们还发现一个测量差距：实体级F1系统性低估了正确答案为完整集合的结构查询。

英文摘要

Retrieval-Augmented Generation (RAG) fails systematically on queries requiring structural reasoning over interconnected entities. We compare eight retrieval architectures for aerospace supply chain intelligence, progressing from text retrieval through graph traversal to graph computation. Using a 46-node knowledge graph with 64 typed edges, we evaluate 23 queries across 10 intent categories and demonstrate that five query classes are structurally unreachable for vector retrieval. Our central finding is the operator vocabulary thesis: the barrier to LLM-based graph reasoning is not model intelligence but the computational operators available as tools. An LLM Query Planner with 9 typed traversal primitives outperforms bespoke handlers (F1 = 0.632 vs. 0.472) while generalizing to unseen queries. Adding 6 graph computation tools, the LLM selectively adopts them for exactly the query categories where traversal fails. We also identify a measurement gap: entity-level F1 systematically underscores structural queries where comprehensive answers are correct.

URL PDF HTML ☆

赞 0 踩 0

2606.05999 2026-06-05 cs.CV cs.AI 版本更新

ATT-CR: Adaptive Triangular Transformer for Cloud Removal

ATT-CR: 自适应三角变换器用于云去除

Yang Wu, Ye Deng, Pengna Li, Wenli Huang, Kangyi Wu, Xiaomeng Xin, Jinjun Wang

发表机构 * Xi’an Jiaotong University（西安交通大学）； School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics（计算机与人工智能学院，西南财经大学）； Ningbo University of Technology（宁波工程学院）

AI总结提出自适应三角变换器（ATT-CR），通过三角注意力和特征选择门控模块降低计算复杂度并减少云像素干扰，实现高效云去除。

详情

AI中文摘要

云去除旨在准确重建遥感图像中被云遮挡的地面物体。现有的基于Transformer的方法利用自注意力有效建模云图像中的长距离依赖，取得了显著效果。然而，它们存在以下问题：1）自注意力的高计算复杂度限制了可扩展性；2）在注意力计算中将云像素和干净像素均视为有效，会在后续层中引入干扰，导致性能次优。为解决这些挑战，我们提出了自适应三角变换器用于云去除（ATT-CR），该模型有效降低了计算成本并减轻了云像素的干扰。具体而言，它包含两个核心组件：三角注意力（TAN）和特征选择门控模块（FSGM）。TAN使用下三角和上三角矩阵近似Softmax注意力，计算复杂度为O(N)，显著降低了计算成本。而FSGM与TAN集成，自适应地区分云特征和干净特征，从而最小化无效信息引入后续层。在云去除基准上的大量实验表明，ATT-CR相比现有方法具有更优的性能。

英文摘要

Cloud removal aims to accurately reconstruct the ground objects obscured by clouds in remote sensing images. Existing Transformer-based methods utilizing self-attention have shown impressive results by effectively modeling long-range dependencies in cloudy images. However, they suffer from the following issues: 1) the high computational complexity of self-attention limits scalability; 2) treating both cloudy and clean pixels as valid within the attention computation brings disturbances in subsequent layers, leading to suboptimal performance. To address these challenges, we propose the Adaptive Triangular Transformer for Cloud Removal (ATT-CR), a model that effectively reduces computational costs and mitigates interference from cloudy pixels. Specifically, it consists of two core components: Triangular Attention (TAN) and Feature Selected Gating Module (FSGM). TAN employs lower and upper triangular matrices to approximate Softmax attention with O(N) computational complexity, significantly reducing the computational costs. The FSGM, on the other hand, integrates with TAN to adaptively distinguish between cloudy and clean features, which minimizes the introduction of invalid information into subsequent layers. Extensive experiments on cloud removal benchmarks demonstrate that ATT-CR delivers superior performance compared to existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.05998 2026-06-05 cs.CV cs.AI 版本更新

Deep Learning-based 3D Oral Cavity Reconstruction Using 2D Intraoral Images

基于深度学习的二维口内图像三维口腔重建

Jihun Cho, Soo-Yeon Jeong, Eun-Jeong Bae, Sun-Young Ihm

发表机构 * KAIST（韩国科学技术院）

AI总结提出一种仅用十张二维口内图像进行三维口腔重建的软件方法，采用MobileNetV2与多头注意力机制，降低成本和不适，实现自动化重建。

Comments 4 pages, 5 figures. English version of a paper presented at the Korea Multimedia Society Conference, November 2025

详情

AI中文摘要

口腔三维建模是牙科中最关键的阶段之一，常用的方法如印模和口内扫描各有显著局限。印模法将藻酸盐或硅胶材料放入托盘并插入患者口腔形成阴模，存在患者不适、材料变形误差及存储运输困难等问题。口内扫描仪利用结构光或激光技术实时直接扫描口腔结构，效果先进但设备成本极高。为解决这些问题，本文提出一种基于软件的方法，仅使用从不同角度拍摄的十张二维口内图像重建三维口腔模型，无需专用硬件设备。该方法降低成本，消除物理扫描设备需求，减少患者不适，并实现自动化三维重建。模型在公开的Dental3DS数据集（包含950个上颌样本）上训练，采用MobileNetV2作为图像编码器，结合多头注意力进行多视图特征融合。所提模型在最近邻匹配（距离阈值0.035）下达到77.49%的准确率。然而，预测顶点倾向于集中在真实值的高密度区域，导致重建模型上的点分布不均匀。

英文摘要

Oral 3D modelling is one of the most essential stages in dentistry, and many different approaches, such as impression taking and intraoral scanning, are commonly used for this phase, each with notable limitations. Impression taking, which involves placing alginate or silicone material in a tray and inserting it into the patient's oral cavity to form a negative mold, suffers from significant patient discomfort, material deformation errors, and difficulties in storage and transportation. Intraoral scanners, which directly scan oral structures in real time using structured light or laser technology, produce state-of-the-art results but are associated with substantially high equipment costs. To address these limitations, this paper proposes a software-based approach that reconstructs a 3D oral model using only ten 2D intraoral images captured from different angles, requiring no dedicated hardware devices. The proposed method reduces cost, eliminates the need for physical scanning equipment, minimises patient discomfort, and enables automated 3D reconstruction. The model is trained on the publicly available Dental3DS dataset, comprising 950 upper jaw samples, and employs MobileNetV2 as the image encoder combined with Multi-head Attention for multi-view feature fusion. The proposed model achieves an accuracy of 77.49%, measured by nearest-neighbor matching with a distance threshold of 0.035. However, predicted vertices tend to concentrate in high-density regions of the ground truth, resulting in uneven point distribution across the reconstructed model.

URL PDF HTML ☆

赞 0 踩 0

2606.05986 2026-06-05 cs.CR cs.AI 版本更新

自我修正错觉：LLM 纠正他人但不纠正自己

Kuan-Yen Chen, Fang-Yi Su, Jung-Hsien Chiang

发表机构 * National Taiwan University（国立台湾大学）

AI总结本文通过保持错误声明字节一致仅改变角色标签，发现 LLM 无法自我修正并非能力缺陷，而是聊天模板角色标签的人为产物，并提出无需训练或模型修改的提示结构干预方法。

详情

AI中文摘要

近期研究表明，LLM 智能体难以纠正自身推理轨迹中的错误，但当相同声明出现在外部来源时，其修正率显著更高。我们探究这种不对称性反映的是能力缺陷还是角色标签的人为产物：智能体纠正错误声明的意愿是否因果地依赖于承载该声明的聊天模板角色，而非声明内容本身？我们的实验设置在所有条件下保持错误声明的字节完全一致（SHA-256 验证），仅改变其包装角色：智能体自身的 \role{<thought>}、\role{user} 消息、\role{tool} 响应或 \role{system <memory>} 块。在覆盖七个模型家族和三个领域的 13 个模型-领域单元（每个单元 n=30 对任务）中，将声明从 \role{<thought>} 重新标记为外部角色后，显式修正率提升了 23 到 93 个百分点，其中 13 个单元中有 10 个达到 p<0.001。进一步实验证实该效应是不对称的、机制上可分解的，并且跨领域稳健。自我修正失败并非认知缺陷，而是聊天模板的人为产物。我们利用这一人为产物设计了一种仅涉及提示结构、无需训练和模型修改的干预方法，其最强角色标签依赖于领域：在数学上 \role{<memory>} 占主导，而在逻辑推理上普通 \role{user} 消息占主导。

英文摘要

Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role-label artifact: does an agent's willingness to correct a wrong claim depend causally on the chat-template role that carries it, rather than on the claim's content? Our setup keeps the erroneous claim byte-identical across all conditions (SHA-256 verified) and varies only its wrapping role: the agent's own \role{<thought>}, a \role{user} message, a \role{tool} response, or a \role{system <memory>} block. Across 13 model-domain cells covering seven model families and three domains ($n{=}30$ paired tasks per cell), relabeling the claim from \role{<thought>} to an external role lifts the explicit-correction rate by 23 to 93 percentage points, with 10 of 13 cells reaching $p{<}0.001$. Further experiments confirm that the effect is asymmetric, mechanistically decomposable, and robust across domains. The failure to self-correct is not a cognitive deficit; it is a chat-template artifact. We exploit this artifact by designing a prompt-structure-only intervention that requires no training and no model modification, with its strongest role label being domain-dependent: \role{<memory>} dominates on math, while a plain \role{user} message dominates on logical deduction.

URL PDF HTML ☆

赞 0 踩 0

2606.05970 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries

测量基于LLM的结构化提取对临床出院小结中提示、模型和模式选择的敏感性

Martin Murin

发表机构 * DryLabz GmbH（DryLabz公司）

AI总结本研究通过固定提取任务并逐一改变提示、模型和模式选择，测量了大型语言模型在临床文本结构化提取中输出对上游配置的敏感性，发现模式选择导致的差异集中在缺失与沉默的区分上，而模型选择在多类分类中主导提示措辞。

Comments 69 pages, 5 main figures, supplementary material included

详情

AI中文摘要

大型语言模型越来越多地用于从临床自由文本笔记中进行结构化提取，但其输出对上游配置选择的敏感性比在固定基准上的准确性更少被理解。本文通过固定提取任务并逐一改变一个选择，在没有人工标注真实值的情况下测量了这种敏感性。固定模式包括17个临床文档标志（三值：是/否/未记录）和47个标签词汇（用于主要入院原因）。表达该模式的三种提示变体分别在两个模型大小上对MIMIC-IV v3.1出院小结运行。跨提示一致性通过Cohen's kappa在ICD分层子集上测量。配对相同笔记比较隔离了模型选择的影响，事后将三值标志折叠为二值测试了模式对不一致的贡献。在三值标志上，两个模型达到相同的合并跨提示一致性（中位数kappa 0.69和0.68）；较大的模型提高了某些字段的一致性并降低了其他字段的一致性，这是一种重新分布而非无效果。将模式折叠为二值消除了大部分跨提示不一致，将其定位在缺失与沉默的区分上，而非发现是否存在。在多类入院分类上，改变模型会重新分配近一半笔记的主导标签，而改变提示措辞则重新分配约八分之一的笔记，并且较大的模型在残余的通用类别上分配的权重少得多（44%到26%）。这些模式表明，模式施加的不一致集中在缺失与沉默轴上，而模型在多类分类上主导提示措辞，这是通过一种可重复的方法在人群规模部署中审计提取可重复性而识别的。

英文摘要

Large language models are increasingly used for structured extraction from clinical free-text notes, but the sensitivity of their output to upstream configuration choices is less understood than their accuracy on fixed benchmarks. This work measures that sensitivity without human-annotated ground truth, by holding the extraction task fixed and varying one choice at a time. The fixed schema comprises 17 clinical documentation flags on a three-way yes/no/not_documented value set and a 47-tag vocabulary for the primary admission reason. Three prompt variants expressing this schema were each run at two model sizes on MIMIC-IV v3.1 discharge summaries. Cross-prompt agreement was measured by Cohen's kappa on ICD-stratified subsets. A paired same-note comparison isolated the effect of model choice, and a post-hoc collapse of the three-way flags to binary tested the schema's contribution to disagreement. On the three-way flags, the two models reach the same pooled cross-prompt agreement (median kappa 0.69 and 0.68); the larger model raises agreement on some fields and lowers it on others, a redistribution rather than the absence of an effect. Collapsing the schema to binary dissolves most of the cross-prompt disagreement, locating it on the absence-versus-silence distinction rather than on whether the finding is present. On the multi-class admission categorization, changing the model reassigns the dominant tag on close to half of all notes while changing the prompt phrasing reassigns it on roughly one in eight, and the larger model places far less mass on residual catch-all categories (44% to 26%). These patterns indicate a schema-imposed source of disagreement concentrated on the absence-versus-silence axis and a dominance of model over prompt phrasing on multi-class categorization, identified by a reusable methodology for auditing extraction reproducibility on a population-scale deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.05966 2026-06-05 cs.DB cs.AI 版本更新

Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs

物理推理的因果支架：面向视觉语言模型中因果启发的物理世界理解基准

Tianyi Tang, Zhuoyi Lin, Zeyu Feng, Tianyi Ma, Yew-Soon Ong, Ivor Tsang, Haiyan Yin

发表机构 * CFAR（因果推理研究所）； IHPC（信息技术研究所）； Agency for Science, Technology and Research (A*STAR)（科技研究局）； Nanyang Technological University（南洋理工大学）

AI总结提出CausalPhys基准（含3000+视频/图像问题及因果图），并设计因果图度量评估VLM推理，进一步提出因果理性微调（CRFT）提升推理准确性与可解释性。

Comments Accepted by KDD 2026 Dataset and Benchmark Track

详情

AI中文摘要

理解和推理物理世界是智能行为的基础，但最先进的视觉语言模型（VLM）在因果物理推理中仍会失败，常常产生看似合理但错误的答案。为解决这一问题，我们引入了CausalPhys，一个包含超过3000个精心策划的视频和图像问题的基准，涵盖四个领域：感知、预期、干预和目标导向。每个问题都配有一个专家注释的因果图，捕捉对象-属性-事件依赖关系，从而实现可解释且细粒度的因果理解评估。在此基础上，我们制定了一个因果图接地度量，定量衡量模型的思维链推理与正确因果关系的对齐程度，超越了仅基于答案的准确性，并能够系统诊断VLM的因果推理失败。使用该度量，我们对领先的VLM进行了全面分析，揭示了在捕捉因果依赖关系方面的系统性差距，并强调了因果感知学习的必要性。为解决这些局限性，我们进一步提出了因果理性微调（CRFT），明确将VLM推理与因果结构对齐。大量实验表明，CRFT在多个模型骨干上显著提升了推理准确性和可解释性。通过统一数据集整理、因果评估和因果感知学习，CausalPhys为推进现代VLM实现因果接地物理推理奠定了坚实基础。

英文摘要

Understanding and reasoning about the physical world is the foundation of intelligent behavior, yet state-of-the-art vision-language models (VLMs) still fail at causal physical reasoning, often producing plausible but incorrect answers. To address this gap, we introduce CausalPhys, a benchmark of over 3,000 carefully curated video- and image-based questions spanning four domains: Perception, Anticipation, Intervention, and Goal Orientation. Each question is paired with an expert-annotated causal graph capturing object-attribute-event dependencies, enabling interpretable and fine-grained evaluation of causal understanding. Building on this, we formulate a causal-graph-grounded metric that quantitatively measures how well a model's chain-of-thought reasoning aligns with the correct causal relations, moving beyond answer-only accuracy and enabling systematic diagnosis of VLMs' causal reasoning failures. Using this metric, we conduct a comprehensive analysis of leading VLMs, revealing systematic gaps in capturing causal dependencies and underscoring the need for causality-aware learning. To address these limitations, we further propose Causal Rationale-informed Fine-Tuning (CRFT), which explicitly aligns VLM reasoning with causal structures. Extensive experiments demonstrate that CRFT substantially enhances both reasoning accuracy and interpretability across multiple model backbones. By unifying dataset curation, causal evaluation, and causality-informed learning, CausalPhys establishes a strong foundation for advancing modern VLMs toward causally grounded physical reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.05956 2026-06-05 cs.AI 版本更新

Bidirectional Search for Longest Paths: Case for Front-to-Front Heuristics

最长路径的双向搜索：前向-前向启发式的情况

Tzur Shubi, Ariel Felner, Solomon Eyal Shimony, Shahaf S. Shperberg

发表机构 * Technion - Israel Institute of Technology（技术学院 - 以色列理工学院）

AI总结提出BiXDFBnB算法，将单前沿双向搜索框架适配到广义最长简单路径问题，利用前向-前向启发式减少节点扩展，并在某些情况下提升运行时间。

详情

AI中文摘要

双向启发式搜索可以潜在地减少适用于后向搜索的问题的搜索工作量。众所周知，前向-前向启发式可以减少节点扩展的数量，但其开销如此之高，以至于总体运行时间几乎总是增加。我们提出了BiXDFBnB，一种双向深度优先分支定界算法，它将单前沿双向搜索（SFBDS）框架——最初为最短路径（MIN）问题开发——适配到广义最长简单路径（GLSP）设置。由于SFBDS本质上在配对状态上操作，前向-前向（F2F）启发式评估自然出现，并避免了通常与双向前沿管理相关的开销。我们展示了这种适配可以成功应用于最大化（MAX）问题，同时有效处理重叠约束。BiXDFBnB应用于几种类型的最长路径问题：最长简单路径（LSP）、Snakes和Coil-in-the-Box（CIB）。经验评估表明，新算法经常减少节点扩展的数量，并且在某些情况下也改善了总体运行时间。

英文摘要

Bidirectional heuristic search can potentially reduce search effort for problems amenable to backward search. Therein, it is well-known that front-to-front heuristics can reduce the number of node expansions, but their overhead is so high that overall runtime almost always increases. We propose BiXDFBnB, a bidirectional depth-first branch-and-bound algorithm that adapts the Single-Frontier Bidirectional Search (SFBDS) framework - originally developed for shortest-path (MIN) problems - to the Generalized Longest Simple Path (GLSP) setting. Because SFBDS inherently operates on paired states, front-to-front (F2F) heuristic evaluation arises naturally and avoids the overhead typically associated with bidirectional frontier management. We show that this adaptation can be successfully applied to maximization (MAX) problems while efficiently handling overlapping constraints. BiXDFBnB is applied to several types of longest-path problems: Longest Simple Path (LSP), Snakes, and Coil-in-the-Box (CIB). Empirical evaluation shows that the new algorithm frequently reduces the number of node expansions and, in some cases, also improves overall runtime.

URL PDF HTML ☆

赞 0 踩 0

2606.05952 2026-06-05 cs.RO cs.AI 版本更新

Learning of Robot Safety Policies via Adversarial Synthetic Scenarios

通过对抗性合成场景学习机器人安全策略

Nikolai Dorofeev, Alexey Odinokov, Rostislav Yavorskiy

发表机构 * National Research Institute of Automation and Applied Mathematics（国家自动化与应用数学研究所）

AI总结提出一个基于对抗性游戏的框架，通过红蓝两队对抗生成危险场景并迭代优化安全策略，以高效发现高风险边缘案例。

2606.05950 2026-06-05 cs.AI 版本更新

Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

Edit-R2：面向多轮图像编辑的上下文感知强化学习

Yuxiao Ye, Haoran He, Fangyuan Kong, Xintao Wang, Pengfei Wan, Kun Gai, Ling Pan

发表机构 * Hong Kong University of Science and Technology（香港理工大学）； Kuaishou Technology（快手科技）

AI总结提出Edit-R2框架，通过重构会话意图和联合优化推理与生成的强化学习，解决多轮图像编辑中的长上下文稀释和状态污染问题，并在MICE-Bench基准上取得领先性能。

详情

AI中文摘要

基于扩散模型和统一多模态基础模型的文本引导图像编辑已取得快速进展。然而，现有方法大多局限于单轮设置，忽略了更现实的多轮上下文编辑场景，即用户通过一系列指令逐步细化图像。在此设置中，模型必须遵循每条新指令，同时保留累积的会话级约束，面临两种耦合的失败模式：长上下文稀释（稀疏文本约束难以从不断增长的图像-文本交错历史中恢复）和状态污染（早期编辑错误降低后续生成质量）。我们提出Edit-R2，一种用于统一多模态模型的新型强化学习后训练框架。Edit-R2重构操作会话意图，在每次编辑轮次前将分散的历史约束有效整合为显式推理轨迹。它进一步通过统一目标实现推理和生成的多轮强化学习，该目标联合优化离散文本空间中的意图重构生成和连续潜在空间中的流匹配图像生成，同时轨迹过滤机制抑制损坏的轨迹以在状态污染下稳定训练。为支持系统评估，我们引入MICE-Bench，一个大规模多轮上下文编辑基准，包含针对累积会话约束的指令遵循（IF）、内容一致性（CC）和全局感知（GA）的自动指标。实验表明，Edit-R2显著改进了多轮上下文编辑，并在与强基线的比较中取得了有竞争力的性能。

英文摘要

Text-guided image editing has advanced rapidly with diffusion models and unified multimodal foundation models. However, most existing methods remain confined to single-turn settings, overlooking the more realistic scenario of multi-turn in-context editing, where users iteratively refine an image through a sequence of instructions. In this setting, a model must follow each new instruction while preserving accumulated session-level constraints, challenged by two coupled failure modes: long-context dilution, where sparse textual constraints become difficult to recover from growing interleaved image-text histories, and state contamination, where earlier editing mistakes degrade subsequent generations. We introduce Edit-R2, a novel reinforcement learning post-training framework for unified multimodal models. Edit-R2 reconstructs the operative session intent, which effectively consolidates scattered historical constraints into an explicit reasoning trace before each editing turn. It further enables multi-turn RL over both reasoning and generation through a unified objective that jointly optimizes intent reconstruction generation in discrete text space and flow-matching image generation in continuous latent space, while a trajectory filtering mechanism suppresses corrupted rollouts to stabilize training under state contamination. To support systematic evaluation, we introduce MICE-Bench, a large-scale benchmark for multi-turn in-context editing with automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA) over accumulated session constraints. Experiments show that Edit-R2 substantially improves multi-turn in-context editing and achieves competitive performance compared against strong baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.05931 2026-06-05 cs.CL cs.AI cs.CV cs.IR cs.LG cs.MM eess.AS 版本更新

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

多模态还是非多模态：通过主动模态检测的查询自适应音视频人物检索

Erfan Loweimi, Mengjie Qian, Kate Knill, Guanfeng Wu, Chi-Ho Chan, Abbas Haider, Muhammad Awan, Josef Kittler, Hui Wang, Mark Gales

发表机构 * University of Cambridge（剑桥大学）； Queen's University Belfast（贝尔法斯特女王大学）； University of Surrey（萨里大学）； Cisco（思科）； Southwest Jiaotong University（西南交通大学）； Teesside University（泰赛德大学）

AI总结提出一种查询自适应框架，通过跨模态分数一致性检测主动模态，在BBC Rewind语料库上达到94.2%的P@1，优于单模态和固定融合方法。

Comments INTERSPEECH 2026

详情

AI中文摘要

当通过语音和面部从视频档案中检索一个人时，系统应该是多模态的吗？在实际的广播档案中，与精心策划的基准不同，目标可能只被听到但未被看到、只被看到但未被听到，或者两者兼有。融合来自缺失模态的分数会引入噪声，使精度低于最佳单模态系统。我们提出了一种查询自适应框架，通过跨模态分数一致性检测主动模态：当两种模态都活跃时，由一种模态检索的文件在另一种模态上也得分高；当一种模态缺失时，这种一致性被破坏。由这些跨模态特征驱动的分类器实现了89%的检测准确率。在BBC Rewind语料库（包含超过12,000个广播视频）上，自适应系统达到了94.2%的P@1，优于仅语音（82.9%）、仅面部（93.4%）和固定融合（90.0%），恢复了与具有真实模态标签的Oracle（96.6%）之间差距的64%。

英文摘要

When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

URL PDF HTML ☆

赞 0 踩 0

2606.05925 2026-06-05 cs.AI 版本更新

Towards World Models in Biomedical Research

迈向生物医学研究的世界模型

Guangyu Wang, Jingkun Yue, Siqi Zhang, Yu Liu, Xiaoyu Wang, Mingyuan Meng, Changwei Ji, Zongbo Han, Yulin Wang, Yang Yue, Frank Fu, Ting Chen, Song Wu, Ziwei Liu, Jiangning Song, Ming Li, Gao Huang, Xiaohong Liu, Athanasios Vasilakos, Xingcai Zhang, Ping Zhang, Yong Li

发表机构 * State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China（网络与交换技术国家重点实验室，北京邮电大学，北京，中国）； Department of Engineering Science, University of Oxford, Oxford, United Kingdom（英国牛津大学工程科学系，牛津，英国）； Institute of Medical Artificial Intelligence, South China Hospital, Medical School, Shenzhen University, Shenzhen, Guangdong, China（医学人工智能研究所，南方医院，医学学院，深圳大学，深圳，广东，中国）； Zhongguancun Academy & Zhongguancun Institute of Artificial Intelligence, Beijing, China（中关村学院及中关村人工智能研究院，北京，中国）； Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, 100084, Beijing, China（北京信息科学与技术国家研究中心（BNRist），清华大学，100084，北京，中国）； Department of Chemical and Nano Engineering, University of California, San Diego, La Jolla, CA, USA（美国加州大学圣地亚哥分校化学与纳米工程系，La Jolla，CA，美国）； Nanyang Technological University, Singapore（新加坡南洋理工大学）； Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, Australia（莫纳什大学生物医学发现研究所和生物化学与分子生物学系，墨尔本，维多利亚，澳大利亚）； David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada（加拿大滑铁卢大学戴维·R·切里顿计算机科学学校，滑铁卢，安大略，加拿大）； Department of ICT and Center for AI Research, University of Agder (UiA), Jon Lilletuns vei 9, Grimstad, Norway（挪威阿格德大学（UiA）信息与通信技术系及人工智能研究中心，Jon Lilletuns vei 9，Grimstad，挪威）； Department of Electronic Engineering, Tsinghua University, Beijing, China（清华大学电子工程系，北京，中国）

AI总结提出生物医学世界模型作为AI驱动发现的新范式，通过学习分子、细胞、组织和临床状态的潜在表征及干预条件动态，实现未来轨迹模拟，并探讨其在虚拟细胞、类器官、虚拟患者和手术模拟等应用中的潜力。

详情

AI中文摘要

生物医学的一个核心目标是理解、预测并最终控制生物系统对扰动、疾病进展和治疗干预的动态机制。尽管基础模型和大语言模型加速了生物医学数据解读，但当前大多数系统仍专注于静态模式识别，而非对生物未来的前瞻性模拟。在此，我们提出生物医学世界模型作为AI驱动发现的一种范式。这些模型学习分子、细胞、组织和临床状态的潜在表征，以及干预条件动态，使得在采取行动之前能够模拟未来轨迹。我们讨论了生物医学世界模型如何作为数据引擎、环境模拟器和科学规划基础，应用于虚拟细胞、类器官、虚拟患者和手术模拟等场景。我们概述了所需的数据基础设施、评估基准、安全约束和治理框架。生物医学世界模型可能为模拟引导、闭环且实验可操作的生物医学发现提供基础。

英文摘要

A central goal of biomedicine is to understand, predict and ultimately control the dynamic mechanisms by which biological systems respond to perturbations, disease progression and therapeutic intervention. Although foundation models and large language models have accelerated biomedical data interpretation, most current systems remain focused on static pattern recognition rather than prospective simulation of biological futures. Here we propose biomedical world models as a paradigm for AI-driven discovery. These models learn latent representations of molecular, cellular, tissue and clinical states, together with intervention-conditioned dynamics that allow future trajectories to be simulated before actions are taken. We discuss how biomedical world models could function as data engines, environment simulators and scientific planning substrates across applications including virtual cells, organoids, virtual patients and surgical simulation. We outline the data infrastructure, evaluation benchmarks, safety constraints and governance frameworks required. Biomedical world models may provide a foundation for simulation-guided, closed-loop and experimentally actionable biomedical discovery.

URL PDF HTML ☆

赞 0 踩 0

2606.05924 2026-06-05 cs.CL cs.AI 版本更新

Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach

更好的文学翻译：多维度数据生成与大语言模型训练方法

Zhihao Lin, Ziqi Zhu, Hao Huang, Guanghui Wang, Peiyang He

发表机构 * Amazon Web Services (AWS)（亚马逊网络服务（AWS））； Peking University（北京大学）

AI总结提出多维度迭代优化框架，通过专门的大语言模型生成高质量翻译参考和偏好数据，结合监督微调和强化学习（GRPO）提升文学翻译质量，在MetaphorTrans英中文学翻译基准上达到与Claude Sonnet 4.5竞争的性能。

Comments Accepted by ACL 2026 Industry

详情

AI中文摘要

文学翻译因高质量标注数据的稀缺以及需要在表达流畅性与文学效果之间取得平衡而面临独特挑战。我们提出了一个多维度迭代优化框架，通过专门的大语言模型翻译器生成高质量的翻译参考和偏好数据，每个翻译器针对一个不同的质量维度。我们利用生成的数据进行监督微调和强化学习。实验表明，我们的生成参考在监督微调中比原始真实数据高出8.65个CEA100点。对于强化学习，我们发现直接偏好优化（DPO）在此设置下导致性能下降，而利用显式奖励模型进行组相对策略优化（GRPO）则额外提升了1.51个点。我们将此归因于两阶段训练的稳定性和GRPO的在线探索能力。我们的最终模型LitMT-8B和LitMT-14B在MetaphorTrans英中文学翻译基准上分别达到67.25和69.07个CEA100点，与Claude Sonnet 4.5的68.43点具有竞争力，并展现出对域外文学作品（如欧·亨利）的强泛化能力。

英文摘要

Literary translation poses unique challenges due to the scarcity of high-quality annotated data and the need to balance expression fluency with literary effect. We present a multi-aspect iterative refinement framework that generates high-quality translation references and preference data through specialized LLM translators, each targeting a distinct quality dimension. We leverage the generated data for supervised fine-tuning and reinforcement learning. Experiments show that our generated references outperform the original ground truth for SFT by 8.65 CEA100 points. For reinforcement learning, we find that DPO leads to performance degradation in this setting, while leveraging an explicit reward model for GRPO yields an additional 1.51 point improvement. We attribute this to the stability of two-stage training and GRPO's online exploration capability. Our resulting models, LitMT-8B and LitMT-14B, achieve 67.25 and 69.07 CEA100 respectively on the MetaphorTrans English-to-Chinese literary translation benchmark, competitive with Claude Sonnet 4.5 at 68.43, and demonstrate strong generalization to out-of-domain literary work (i.e., O. Henry).

URL PDF HTML ☆

赞 0 踩 0

2606.05901 2026-06-05 cs.CL cs.AI 版本更新

Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)

减少复杂问答中的幻觉：使用基于简单图的检索增强生成（长版）

Christopher J. Wedge, Joshua Stutter, Danny Dixon, Jacek Cała

发表机构 * National Innovation Centre for Data（数据创新研究中心）

AI总结本研究提出一种轻量级图结构支持的检索增强生成系统，通过结合向量搜索和图查询工具，在复杂问答任务中将幻觉答案数量减半，并显著提升事实正确性的精确率和召回率。

详情

AI中文摘要

大型语言模型（LLMs）从根本上改变了自然语言处理的格局。尽管取得了这些进展，LLMs和基于LLM的系统仍然容易出现各种故障模式。检索增强生成（RAG）系统已成为一种常见的部署场景，旨在避免LLM“幻觉”信息的已知风险，并使模型能够对训练期间无法访问的专有信息进行推理和问答，而无需进行昂贵的模型微调。在这项工作中，我们探索了使用轻量级图结构（具有相对简单的图模式）通过专用工具集支持RAG子系统的想法。我们设计了一个基于英语维基百科文章精选子集的结构化数据集上的智能体系统，该系统配备了多种向量搜索和图查询工具，并评估了其在MoNaCo（一个具有挑战性的维基百科QA基准测试，涉及复杂查询回答任务）上的问题表现。我们的结果表明，引入基于图的工具可以显著提高事实正确性的精确率和召回率，将幻觉答案的数量减半，并在三个评估场景中实现了最高的细粒度真实性得分。所有这些都仅以适度的令牌使用增加为代价。

英文摘要

Large language models (LLMs) have fundamentally transformed the landscape of Natural Language Processing. Despite these advances, LLMs and LLM-based systems remain prone to a variety of failure modes. Retrieval-augmented generation (RAG) systems have emerged as a common deployment scenario seeking to both avoid the well known risk of the LLM "hallucinating" information, and to enable reasoning and question answering over proprietary information that the LLM did not have access to during training without resorting to expensive model fine-tuning. In this work, we explore the idea of using a lightweight graph structure with a relatively simple graph schema, to support the RAG subsystem via a dedicated toolset. We design an agentic system with a variety of vector search and graph query tools operating over a structured dataset based on a curated subset of English Wikipedia articles, and evaluate its performance on questions from MoNaCo, a challenging Wikipedia QA benchmark of complex query answering tasks. Our results show that the introduction of graph-based tools can significantly increase the precision and recall of factual correctness, can halve the number of hallucinated answers, and achieves the highest fine-grained truthfulness score among the three evaluated scenarios. All this with a modest increase in token usage.

URL PDF HTML ☆

赞 0 踩 0

2606.05890 2026-06-05 cs.CL cs.AI 版本更新

Staying with the Uncertainty: Uncertainty-Scaffolding Strategies for Artificial Moral Advisors in LLM-to-LLM Simulated Conversations

与不确定性共处：LLM对LLM模拟对话中人工道德顾问的不确定性支撑策略

Salvatore Greco, Hainiu Xu, Jacopo Domenicucci, Yulan He, Sylvie Delacroix

发表机构 * Centre for Data Futures, The Dickson Poon School of Law, King’s College London（数据未来中心、迪克森·普恩法学院、伦敦国王学院）； Department of Informatics, King’s College London（信息学院、伦敦国王学院）； LangAI, Center for Language AI Research, Tohoku University（LangAI、语言人工智能研究中心、东北大学）； Neukom Institute for Computational Science, Dartmouth College（计算科学尼科姆研究所、达特茅斯学院）

AI总结研究LLM作为人工道德顾问时，通过三种不确定性策略（视角倍增、张力保持、过程反思）与三种控制条件对比，在模拟对话中探讨如何帮助对话者“与不确定性共处”，发现不同策略在立场改变量上无差异但影响参与质量。

详情

AI中文摘要

LLM越来越多地被部署为各种背景下的人工道德顾问（AMA）：它们应该展现什么样的对话模式？在本文中，我们研究AMA如何帮助其对话者“与不确定性共处”。我们提出了三种不确定性模式（视角倍增、张力保持、过程反思），并将它们与三种控制条件（基线、说服、谄媚）进行比较。用户代理LLM与遵循特定不确定性策略的AMA就伦理困境进行对话，并完成对话前和对话后的问卷调查。我们进一步考察了两种角色提示格式（陈述式和叙述式）的效果。我们发现：（1）没有一个单一模型作为模拟用户代理占主导地位，开放模型通过角色间分歧与人类模糊性对齐，而封闭模型通过角色内对冲对齐；（2）陈述式角色更好地捕捉初始立场多样性，而叙述式角色显示出更现实的信念修正；（3）所有六种AMA策略产生可区分的对话模式；（4）不确定性策略的不同不在于它们产生多少立场改变，而在于它们维持的参与质量。

英文摘要

LLMs are increasingly deployed as Artificial Moral Advisors (AMA) in a variety of contexts: what kind of conversational patterns should they display? In this paper, we study how AMA can help their interlocutors "stay with the uncertainty". We propose three modes of uncertainty (Perspective-Multiplying, Tension-Preserving, Process-Reflecting) and compare them against three control conditions (Baseline, Persuasive, Sycophantic). A user-agent LLM engages in a dialogue on an ethical dilemma with an AMA following a specific uncertainty strategy, and completes pre- and post-conversation questionnaires. We further examine the effect of two persona prompt formats (Declarative and Narrative). We found that (1) no single model dominates as a simulated user agent, with open models aligning with human ambiguity through between-persona divergence and closed models through within-persona hedging; (2) declarative personas better capture initial stance diversity while narrative personas show more realistic belief revision; (3) all six AMA strategies produce distinguishable conversational patterns; and (4) uncertainty strategies differ not in how much stance revision they produce, but in the quality of engagement they sustain.

URL PDF HTML ☆

赞 0 踩 0

2606.05888 2026-06-05 cs.AI 版本更新

Retry Policy Gradients in Continuous Action Spaces

连续动作空间中的重试策略梯度

Soichiro Nishimori, Paavo Parmas

发表机构 * The University of Tokyo, Japan（东京大学）

AI总结本文提出重试目标（如pass@K和max@K）的路径导数估计器，将ReMax扩展到连续动作空间，通过重塑策略梯度景观促进随机探索，并引入ReMAC算法实现与SAC相当的性能。

详情

AI中文摘要

基于重试的目标（如pass@K和max@K）优化从多个采样轨迹中获得的最佳回报，最近的研究表明，它们可以在没有显式探索奖励的情况下促进探索。在离散动作空间中，ReMax被证明可以通过适应回报不确定性来实现这一点。在这项工作中，我们引入了重试目标的路径导数估计器，并用它们将ReMax扩展到连续动作空间。我们研究了由此产生的学习动态，并表明，即使使用确定性奖励，ReMax也可以通过重塑策略梯度景观来鼓励随机探索。特别地，它既改变了梯度的方向，使更新偏向于更高的策略熵，也改变了梯度的大小，抑制梯度并减缓收敛。我们进一步表明，Adam的自适应归一化可以缓解这种抑制，具体取决于其数值稳定化参数。在实验上，我们将该目标实例化为ReMax Actor-Critic（ReMAC），这是一种使用路径导数估计器优化ReMax目标的离策略actor-critic算法。我们的实验表明，ReMAC可以在没有熵正则化的情况下促进更高的策略熵，并实现与SAC相当的性能。

英文摘要

Retry-based objectives such as pass@K and max@K optimize the best return obtained from multiple sampled trajectories, and recent work has shown that they can promote exploration without explicit exploration bonuses. In discrete action spaces, ReMax was shown to do so by adapting to return uncertainty. In this work, we introduce pathwise derivative estimators for retry objectives and use them to extend ReMax to continuous action spaces. We study the resulting learning dynamics and show that, even with deterministic rewards, ReMax can encourage stochastic exploration by reshaping the policy-gradient landscape. In particular, it alters gradients both in direction, biasing updates toward higher policy entropy, and in magnitude, damping gradients and slowing convergence. We further show that Adam's adaptive normalization can mitigate this damping, depending on its numerical stabilization parameter. Empirically, we instantiate this objective as ReMax Actor-Critic (ReMAC), an off-policy actor--critic algorithm that optimizes the ReMax objective using a pathwise derivative estimator. Our experiments show that ReMAC can promote higher policy entropy without entropy regularization and achieves performance comparable to SAC.

URL PDF HTML ☆

赞 0 踩 0

2606.05875 2026-06-05 cs.AI cs.DB 版本更新

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

QCFuse: 通过压缩视图的查询感知缓存融合实现高效RAG服务

Jianxin Yan, Wangze Ni, Zhenxin Li, Jiabao Jin, Zhitao Shen, Haoyang Li, Jia Zhu, Peng Cheng, Xuemin Lin, Lei Chen, Kui Ren

发表机构 * Zhejiang University（浙江大学）； East China Normal University（华东师范大学）； Ant Group（蚂蚁集团）； The Hong Kong Polytechnic University（香港理工大学）； Zhejiang Normal University（浙江师范大学）； Tongji University（同济大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； The Hong Kong University of Science and Technology（香港科学与技术大学）

AI总结提出QCFuse，一种基于压缩视图的查询感知选择器，通过块锚查询探测和关键层分析实现高效RAG缓存融合，在保持全预填充质量的同时平均加速1.7倍。

详情

AI中文摘要

检索增强生成（RAG）通过将生成过程基于外部证据来提高大语言模型（LLM）的答案质量，但处理检索到的上下文使得预填充阶段成为主要的服务成本。RAG缓存融合通过重用检索块的预计算键值（KV）缓存，并选择性地在当前提示下重新计算令牌来降低这一成本。然而，现有的选择器在质量和效率之间面临两难：快速的查询无关或最终层查询到上下文选择器可能遗漏与请求相关的证据，而全视图查询感知选择器在重新计算之前需要广泛的上下文和层可见性，因此会阻塞逐层缓存融合流水线。我们提出QCFuse，一种用于RAG缓存融合的压缩视图查询感知选择器。QCFuse使用块锚查询探测将用户查询状态条件化到紧凑的每块锚点上，并通过关键层分析识别重新计算令牌而无需检查所有层。我们在SGLang中实现QCFuse，并在六个数据集上对四个开放权重LLM进行评估。QCFuse达到了全预填充级别的质量。在匹配质量下，QCFuse相比全预填充实现了平均1.7倍的预填充加速，相比最强的保质量基线ProphetKV实现了1.5倍加速。

英文摘要

Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.05873 2026-06-05 cs.RO cs.AI cs.CV cs.LG 版本更新

LadderMan: Learning Humanoid Perceptive Ladder Climbing

LadderMan: 学习人形机器人感知爬梯

Siheng Zhao, Yuanhang Zhang, Ziqi Lu, Pieter Abbeel, Rocky Duan, Koushil Sreenath, Yue Wang, C. Karen Liu, Guanya Shi

发表机构 * Amazon FAR（亚马逊FAR）； USC（美国南加州大学）； UC Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； CMU（卡内基梅隆大学）

AI总结提出LadderMan系统，通过两阶段学习管道和视觉基础模型，使人形机器人能够鲁棒地攀爬多种梯子并在梯子上进行操控。

详情

AI中文摘要

人形机器人在以人为中心的环境中具有巨大潜力，但由于稀疏的立足点和手抓点、复杂的全身协调以及对感知和控制误差的敏感性，爬梯仍然是最具挑战性的任务之一。我们提出了 extbf{LadderMan}，一个统一的系统，使人形机器人能够鲁棒地攀爬多种梯子并在这种受限条件下进行操控。我们的攀爬策略基于一个可扩展的两阶段学习管道，其中我们使用混合运动跟踪从单个参考运动学习多个攀爬专家，并通过混合模仿和强化学习将这些专家蒸馏成一个统一的基于深度视觉的运动攀爬策略。为了实现真实世界部署，我们利用视觉基础模型来弥合深度感知中的模拟到现实差距。基于学习到的攀爬策略，我们进一步使用双智能体公式训练一个独立的操控策略，允许通过遥操作在梯子上进行稳定操控。实验表明，LadderMan在多种几何形状的梯子上实现了鲁棒的攀爬，以零样本方式成功迁移到真实世界硬件，并在具有挑战性的梯子约束下支持各种操控任务。视频结果见https://ladderman-robot.github.io。

英文摘要

Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at https://ladderman-robot.github.io .

URL PDF HTML ☆

赞 0 踩 0

2606.05871 2026-06-05 cs.IT cs.AI math.IT stat.ME 版本更新

Compositional Boundaries for Density Fusion

密度融合的组合边界

Ratan Bahadur Thapa, Ali Darijani, Jürgen Beyerer, Steffen Staab

发表机构 * University of Stuttgart Department of Computer Science, Germany（斯图加特大学计算机科学系，德国）； KIT Department of Computer Science, Germany（卡尔斯鲁厄理工学院计算机科学系，德国）； Fraunhofer IOSB of Fraunhofer-Gesellschaft, Germany（弗劳恩霍夫研究所IOSB分部，德国）； University of Southampton Department of Computer Science, United Kingdom（南安普顿大学计算机科学系，英国）

AI总结研究分布式不确定性管理系统中加权概率密度的层次融合顺序不变性，证明在连续二元规则下，顺序不变的层次融合等价于归一化加权线性池化，并揭示了端点-候选f-散度平衡的局部几何障碍。

详情

AI中文摘要

分布式不确定性管理系统通常沿着由通信、隐私或调度约束选择的聚合树组合局部概率模型。最终密度应取决于加权源，而不是中间节点组合它们的特定顺序。我们将这一要求研究为加权概率密度的二元融合的代数组合性问题。核心问题是局部融合规则何时可以层次化执行同时保持顺序不变。我们为局部段值融合规则建立了一个组合边界。在具有加性输出权重和仅权重系数的连续二元规则类中，顺序不变的层次执行刻画了归一化加权线性池化；范数诱导的段平衡实现了相应的系数。平滑端点-候选$f$-散度平衡具有不同的局部几何：其二次展开引入了平方根有效权重，表明仅凭成对可解性不足以实现调度无关的融合。我们证明这一障碍是端点-候选二元平衡所特有的，而全局散度重心保留了加性权重的局部极限。最后，高斯混合展示了相同问题如何在有限模型类中出现：精确融合是组合的，而逐步压缩仅在未归一化分量测度的同余条件下才是组合的。这些结果区分了精确的调度无关融合与全局聚合目标及局部近似启发式。

英文摘要

Distributed uncertainty-management systems often combine local probabilistic models along aggregation trees chosen by communication, privacy, or scheduling constraints. The final density should depend on the weighted sources, not on the particular order in which intermediate nodes combine them. We study this requirement as an algebraic compositionality problem for binary fusion of weighted probability densities. The central question is when a local fusion rule can be executed hierarchically while remaining order-invariant. We establish a compositional boundary for local segment-valued fusion rules. Within the class of continuous binary rules with additive output weights and weight-only coefficients, order-invariant hierarchical execution characterizes normalized weighted linear pooling; norm-induced segment balancing realizes the corresponding coefficient. Smooth endpoint-to-candidate $f$-divergence balancing has a different local geometry: its quadratic expansion induces square-root effective weights, showing why pairwise solvability alone is insufficient for schedule-independent fusion. We show that this obstruction is local to endpoint-to-candidate binary balancing, whereas global divergence barycenters retain additive-weight local limits. Finally, Gaussian mixtures show how the same issue appears in finite model classes: exact fusion is compositional, whereas stepwise compression is compositional only under a congruence condition on unnormalized component measures. These results distinguish exact schedule-independent fusion from global aggregation objectives and local approximation heuristics.

URL PDF HTML ☆

赞 0 踩 0

2606.05863 2026-06-05 cs.LG cs.AI 版本更新

GenTI: 针对未知攻击的自主IDPS规则生成的LLM基准测试

Hassan Jalil Hadi, Rehana Yasmin, Ali Shoker

发表机构 * Cyber Security and Resilience Technology (CyberSaR), King Abdullah University of Science and Technology (KAUST)（网络安全与韧性技术（CyberSaR），国王阿卜杜勒·阿齐兹大学科学与技术学院（KAUST））

AI总结提出GenTI框架，通过构建包含15万条检测与防御规则的数据集GTI，并设计基于LLM的流水线（含结构化提示工程、思维链推理和验证循环），实现针对未知攻击的IDPS规则自动生成，将未知攻击检测率从45%提升至87.4%，误报率从8.5%降至2.3%。

详情

AI中文摘要

基于规则的入侵检测与防御系统（IDPS）能够提供精确的攻击检测和缓解，但其手动制作的、基于签名的规则限制了针对新兴和零日威胁的适应性。此外，现有的公共数据集（如CICIDS2017、UNSW-NB15）侧重于流量分类，提供的结构化信息很少，无法支持自动规则合成或防御逻辑。为填补这一空白，我们提出了生成式威胁情报（GenTI）——一个用于自动生成针对未知攻击的IDPS规则的LLM驱动基准。该数据集（GTI）汇集了来自Snort、Suricata、Emerging Threats的超过15万条检测和防御规则，以及5万条YARA规则，每条规则都标注了协议行为、负载签名、上下文关系、与网络威胁情报（CTI）的映射，以及可操作的响应类型（告警、丢弃、拒绝）。此外，在此语料库之上，我们设计了一个基于LLM的流水线，通过结构化提示工程、思维链（CoT）推理以及用于语法、语义和安全验证的验证链（CoVe）循环，将分析师提示和代表性负载转换为可部署的规则。生成的规则在（Snort/Suricata）上实时执行，并通过语法准确性、语义相似性、CTI覆盖率、安全有效性以及未知攻击检测进行评估。此外，我们的GenTI实例实现了89.4%的综合规则质量分数，CTI覆盖率达94.8%，将未知攻击检测率从45%提高到87.4%，并将误报率从8.5%降低到2.3%。总体而言，GenTI建立了第一个将规则级CTI与基于LLM的自动化紧密结合的大规模基准，实现了自适应、自演进的IDPS。

英文摘要

Rule-based Intrusion Detection and Prevention Systems (IDPS) offer precise attack detection as well as mitigation, however their manually crafted, signature-driven rules limit adaptability to emerging and zero-day threats. Additionally, existing public datasets (e.g., CICIDS2017, UNSW-NB15) focus on traffic classification and provide little structured information to support automatic rule synthesis or prevention logic. To address this gap, we propose Generative Thread Intelligence (GenTI) \footnote{GenTI refers to the proposed framework, and GTI refers to the dataset.} an LLM-driven benchmark for automatic generation of IDPS rules targeting unseen attacks. The dataset (GTI) aggregates over 150k detection and prevention rules from Snort, Suricata, Emerging Threats, as well as 50k YARA, each annotated with protocol behavior, payload signatures, contextual relationships, mappings to Cyber Threat Intelligence (CTI), along with actionable response types (alert, drop, reject). Moreover, on top of this corpus we design an LLM-based pipeline that transforms analyst prompts and representative payloads into deployable rules via structured prompt engineering, Chain-of-Thought (CoT) reasoning, as well as a Chain-of-Verification (CoVe) loop for syntactic, semantic, and security validation. The generated rules are executed in real time on (Snort/Suricata) and evaluated by syntax accuracy, semantic similarity, CTI coverage, security effectiveness as well as unseen attacks detection. Furthermore, our GenTI instantiation achieves a composite rule-quality score of 89.4\%, with 94.8\% CTI coverage, improving unseen attacks detection from 45\% to 87.4\% and reducing the false-positive rate from 8.5\% to 2.3\%. Overall, GenTI establishes the first large-scale benchmark that tightly couples rule-level CTI with LLM-based automation, enabling adaptive, self-evolving IDPS.

URL PDF HTML ☆

赞 0 踩 0

2606.05843 2026-06-05 cs.CL cs.AI 版本更新

Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

多模态大语言模型中通过CoRe头的功能稀疏性机制洞察

Ruoxi Sun, Quantong Qiu, Juntao Li, Zecheng Tang, Yihang Lou, Min Zhang

发表机构 * Soochow University（苏州大学）； Peking University（北京大学）

AI总结通过识别和分析CoRe头，揭示多模态大语言模型在跨模态检索中功能稀疏的结构特性，并验证其必要性及加速推理的潜力。

详情

AI中文摘要

虽然多模态大语言模型（MLLMs）在复杂的视觉-语言任务上表现出卓越的能力，但它们从复杂、嘈杂的上下文中提取与查询相关的视觉特征的机制仍然不透明。在本文中，我们进行了一项深入的可解释性研究，揭示了MLLMs中一个深刻的结构属性：跨模态检索中的功能稀疏性。利用一种称为检索注意力质量（RAM）的令牌级指标，我们识别并描述了一组高度专业化的注意力头，称为上下文感知检索（CoRe）头。在不同的视觉领域和模型规模中，我们观察到明确的功能划分：CoRe头充当专用的信息提取器，而大多数其他头则将注意力分布在更广泛的上下文区域。因果干预进一步证明了这些专业化头的必要性。仅消融前5%的CoRe头就会导致多模态推理性能显著下降，而消融排名较低的头则影响甚微。此外，加速实验验证了CoRe头的实用性，表明利用这种局部稀疏性可以显著加速推理，同时保持稳健的任务性能。我们的发现揭示了MLLMs中功能稀疏性的结构原理，完善了当前对机制可解释性的理解，并为未来的架构设计和模型优化奠定了理论基础。

英文摘要

While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In this paper, we present an in-depth interpretability study that uncovers a profound structural property within MLLMs: functional sparsity in cross-modal retrieval. Leveraging a token-level metric termed Retrieval Attention Mass (RAM), we identify and characterize a highly specialized subset of attention heads, referred to as Context-aware Retrieval (CoRe) heads. Across diverse visual domains and model scales, we observe a clear functional division: CoRe heads act as dedicated information extractors, while most other heads distribute attention over broader contextual regions. Causal interventions further demonstrate the necessity of these specialized heads. Ablating only the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal effect. Moreover, acceleration experiments validate the utility of CoRe heads, showing that leveraging this localized sparsity significantly accelerates inference while maintaining robust task performance. Our findings reveal a structural principle of functional sparsity within MLLMs, refining the current understanding of mechanistic interpretability and laying a theoretical foundation that can inspire future architecture design and model optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.05833 2026-06-05 cs.CV cs.AI 版本更新

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis（加州大学戴维斯分校）

AI总结提出GeoVR框架，通过从2D视频序列中蒸馏3D几何知识（包括相机姿态、深度图、尺度因子和多尺度3D特征），重塑多模态大语言模型的内部表示以赋予其空间智能，在空间推理基准上达到最先进性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）在2D语义理解方面表现出色，但缺乏内在的3D感知能力，导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性，我们提出了GeoVR，一种新颖的框架，仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间，以解锁空间智能。GeoVR并非采用浅层的特征混合，而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的，该策略由四个互补的几何目标驱动：（1）估计帧间相机姿态以嵌入变化的视角动态，（2）回归密集深度图以锚定物理距离，（3）预测度量尺度因子以进行真实世界校准，以及（4）蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下，模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明，GeoVR实现了最先进的性能，为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.05828 2026-06-05 cs.AI cs.CL 版本更新

Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents

隐式偏好的统计先验：在个人代理中解耦技能选择作为局部调控机制

Zeyu Gan, Huayi Tang, Yong Liu

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学人工智能学院）

AI总结针对本地部署的个人代理中隐式用户偏好学习问题，提出一种解耦统计偏好学习与语义意图解析的轻量级架构，通过局部统计结果影响远程LLM的选择决策，显著降低累积遗憾并提高测试准确率。

2606.05818 2026-06-05 math.HO cs.AI math.AG math.CO math.RT 版本更新

Benchmarks in Leipzig

莱比锡基准测试

Andrei Balakin, Miklós Bóna, Marie-Charlotte Brandenburg, Clara Briand, Veronica Calvo Cortes, Shelby Cox, Jesus A. De Loera, Danai Deligeorgaki, Hannah Friedman, Tim Gehrunger, Chiara Giardino, Stephen Griffeth, Baran Hashemi, Elena Hoster, Alexander Ivanov, Nupur Jain, Aryaman Jal, Leonie Kayser, Joris Koefler, Kevin Kühn, Mario Kummer, Felix Lotter, René Marczinzik, Victor S. Miller, Alejandro Morales, Greta Panova, Gianni Petrella, Nathan Pflueger, Lakshmi Ramesh, Nikolas Rieke, Carlos Rodriguez, Andrea Rosana, Flavio Salizzoni, Otto T. P. Schmidt, Sven Ulf Schmitz, Lina Maria Simbaqueba Marin, Luca Sodomaco, Christian Stump, Bernd Sturmfels, Alexander Taveira Blomenhofer, Simon Telen, Philipp Tuchel, Emil Verkama, Carl Felix Waller, Julian Weigert, Annette Werner, Nathan Williams, Claudius Zibrowius

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结 49位数学家于2026年4月至5月编制了100个研究级数学问题数据集，通过多阶段评估大型语言模型的数学推理能力，最终仅剩2个问题未解决。

Comments 8 pages including 8 benchmark statistics tables + 20 pages appendix containing the 100 Leipzig Benchmark questions

2606.05817 2026-06-05 cs.LG cs.AI 版本更新

Consistency Training Along the Transformer Stack

沿Transformer堆栈的一致性训练

Sukrati Gautam, Neil Shah, Arav Dhoot, Bryan Maruyama, Caroline Wei, Rohan Kapoor, Robert Sidey, Prakhar Gupta, Zi Cheng Huang, David Demitri Africa

发表机构 * Purdue University（普渡大学）； Independent（独立）； Columbia University（哥伦比亚大学）； University of California, San Diego（加州大学圣地亚哥分校）； University of California, Los Angeles（加州大学洛杉矶分校）； Dartmouth College（达特茅斯学院）； University of Michigan, Ann Arbor（密歇根大学安娜堡分校）

AI总结本文通过引入MLP状态和注意力分布的一致性目标，将一致性训练扩展到多种安全威胁，并发现跨威胁泛化及共享机制，证明其作为灵活对齐框架的有效性。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

一致性训练鼓励模型在不同上下文中表现相似，并已显示出减少对齐问题的潜力。我们以两种方式扩展一致性训练的范围。首先，我们引入两个新的内部一致性目标：MLP一致性训练（MLPCT），匹配激活后的MLP状态；以及注意力一致性训练（AttCT），匹配每个头的注意力分布。其次，我们将一致性训练应用于四种额外的安全威胁：角色上下文学习攻击、对抗性挫败、预填充攻击和条件性对齐错误。在多个模型和威胁设置中，我们发现一致性训练在减少对齐问题方面远优于先前工作中研究的谄媚和越狱设置。我们还发现了跨威胁泛化的案例，即针对一种失败模式的训练提高了对另一种模式的鲁棒性，并识别了ACT、MLPCT和AttCT共享的残差流机制，同时将BCT区分为机制上不同的方法。我们的结果表明，一致性训练是一个灵活且可扩展的对齐框架，能够统一防御更广泛的模型病理类别。

英文摘要

Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal consistency targets: MLP Consistency Training (MLPCT), which matches post-activation MLP states, and Attention Consistency Training (AttCT), which matches per-head attention distributions. Second, we apply consistency training to four additional safety threats: persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment. Across several models and threat settings, we find that consistency training reduces misalignment well beyond the sycophancy and jailbreak settings studied in prior work. We also find cases of cross-threat generalization, where training against one failure mode improves robustness to another, and identify a shared residual-stream mechanism underlying ACT, MLPCT, and AttCT, while distinguishing BCT as mechanistically distinct. Our results suggest that consistency training is a flexible and extensible framework for alignment, capable of unifying defenses against a broader class of model pathologies.

URL PDF HTML ☆

赞 0 踩 0

2606.05806 2026-06-05 cs.AI 版本更新

TAPO: 通过信用转移实现工具感知策略优化用于多模态搜索代理

Chengqi Dong, Chuhuai Yue, Hang He, yandong liu, Fenghe Tang, S Kevin Zhou, Xiaohan Wang, Jiajun Chai, Guojun Yin

发表机构 * University of Science and Technology of China（中国科学技术大学）； Meituan（美团）

AI总结针对GRPO在多模态搜索代理中信用误分配问题，提出TAPO方法，利用工具参数确定性构建反事实证人进行保守优势校正，无需额外标注或采样，在多个基准上持续提升性能。

详情

AI中文摘要

我们识别并正式刻画了信用误分配作为GRPO在工具增强多模态搜索代理中的系统性失效模式：其对轨迹级优势的统一广播导致失败轨迹中有价值的工具使用步骤与无价值的步骤受到相同的惩罚。我们进一步通过实验量化了该现象的规模。超过一半的失败轨迹和失败的工具使用动作表现出可纠正的信用误分配，表明浪费的训练信号既显著又在结构上可被利用。基于这一见解，我们提出了工具感知策略优化（TAPO），它利用了信息获取工具的参数确定性特性：相似的调用参数定义等价的信息获取动作，因此应共享可比较的动作信用。TAPO在当前训练批次内构建反事实证人，并通过置信门控保守优势校正补偿误分配的负信用。它不需要额外的标注、模型或采样，并且引入可忽略的计算开销。在多个多模态搜索基准上，TAPO在三种主流RL算法（GRPO、GSPO和SAPO）上相对于强基线提供了一致的、即插即用的改进。我们的代码和模型将在接收后公开发布。

英文摘要

We identify and formally characterize credit misassignment as a systematic failure mode of GRPO in tool-augmented multimodal search agents: its uniform broadcast of trajectory-level advantages to all tokens causes valuable tool-use steps in failing trajectories to be penalized no differently from valueless ones. We further empirically quantify the scale of this phenomenon. Over half of failing trajectories and failing tool-use actions exhibit correctable credit misassignment, demonstrating that the wasted training signal is both substantial and structurally exploitable. Building on this insight, we propose Tool-Aware Policy Optimization (TAPO), which exploits the parameter-determinism property of information-acquisition tools: similar call parameters define equivalent information-acquisition actions and should therefore share comparable action credit. TAPO constructs counterfactual witnesses within the current training batch and compensates misassigned negative credit via confidence-gated conservative advantage correction. It requires no additional annotation, models, or sampling, and introduces negligible computational overhead. Across multiple multimodal search benchmarks, TAPO delivers consistent, plug-and-play improvements over strong baselines for three mainstream RL algorithms (GRPO, GSPO, and SAPO). Our code and models will be publicly released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.05779 2026-06-05 cs.CR cs.AI stat.ML 版本更新

TinyML-Driven Cybersecurity for Autonomous Spacecraft: Latency-Accuracy Analysis for SPARTA RF and Cyber Threat Detection

TinyML驱动的自主航天器网络安全：SPARTA射频与网络威胁检测的延迟-精度分析

Van Le, Trevor Tran, Tan Le

发表机构 * Virginia Tech（弗吉尼亚理工学院）； Hampton University（哈姆普顿大学）

AI总结针对自主航天器，基于SPARTA攻击模型分析TinyML兼容经典模型（随机森林、逻辑回归、SVM、MLP）在检测多种网络射频威胁时的延迟-精度权衡，发现逻辑回归在微秒级推理下仅比随机森林精度低1%，适合作为机载自主基线。

Comments Twenty Fifth International Conference on Security & Management (SAM'26)

详情

AI中文摘要

自主航天器需要快速、轻量且可靠的在轨检测网络射频威胁。利用SPARTA攻击模型，我们分析了TinyML兼容的经典模型——随机森林、逻辑回归、支持向量机和多层感知机——在检测上行链路干扰、Fake-NR欺骗、有效载荷操纵、地面段妥协和未授权命令注入时的延迟-精度权衡。我们对每个模型的计算复杂度、VC维、Lipschitz连续性和延迟缩放进行了基于物理的理论分析，并通过在通过BandErasure、FakeNR和NoiseBurst损坏模式生成的对抗性射频频谱图上的经验测量加以支持。结果表明，逻辑回归实现了微秒级推理，且相对于随机森林仅下降1%的精度，使其成为机载自主的有效TinyML基线。该研究还指出了通过更丰富的特征编码器和多时间尺度学习架构来推进航天器网络安全的机会，这建立在边缘智能和可信AI的最新进展之上。

英文摘要

Autonomous spacecraft require rapid, lightweight, and reliable onboard detection of cyber-RF threats. Using the SPARTA attack model, we analyze the latency-accuracy trade-offs of TinyML-compatible classical models -- Random Forest, Logistic Regression, SVM, and MLP -- for detecting uplink jamming, Fake-NR spoofing, payload manipulation, ground-segment compromise, and unauthorized command injection. We present a physics-informed theoretical analysis of each model's computational complexity, VC dimension, Lipschitz continuity, and latency scaling, supported by empirical measurements on adversarial RF spectrograms generated via BandErasure, FakeNR, and NoiseBurst corruption modes. Results show that Logistic Regression achieves microsecond-level inference with only a 1\% accuracy drop relative to Random Forest, making it an effective TinyML baseline for onboard autonomy. The study also identifies opportunities for advancing spacecraft cybersecurity through richer feature encoders and multi-timescale learning architectures, building on recent progress in edge intelligence and trustworthy AI.

URL PDF HTML ☆

赞 0 踩 0

2606.05776 2026-06-05 cs.CR cs.AI cs.LG 版本更新

An Improved CNN-LSTM Based Intrusion Detection System for IoT Networks

基于改进的CNN-LSTM的物联网网络入侵检测系统

Mohammad Tariq Ikhlas, Pohanyar Khowaja Khil, Malik Muhammad Mueed Aslam, Muhammad Khuram Shahzad

发表机构 * University of Engineering and Technology, Lahore（拉合尔工程与技术大学）

AI总结提出一种结合多类分类、数据集集成和时间特征学习的改进CNN-LSTM入侵检测模型，在物联网网络上达到约97%的准确率。

Comments 8 pages, 8 figures

2606.05770 2026-06-05 cs.SE cs.AI 版本更新

Human Oversight and Overload: Two Hidden and Costly Burdens of AI-Assisted Software Engineering

人类监督与过载：AI辅助软件工程中两种隐藏且昂贵的负担

Vahid Garousi

发表机构 * Queen’s University Belfast（女王大学贝尔法斯特）； Azerbaijan Technical University（阿塞拜疆技术大学）

AI总结本文通过分析从业者观点，揭示了AI辅助软件工程中人类持续监督AI生成产物和认知过载两种隐藏负担，并探讨了团队应对策略。

2606.05758 2026-06-05 cs.CV cs.AI cs.LG 版本更新

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

DRIFT：一种用于视觉-语言模型中连续输出解码的残差流适配器

Zhuoming Liu, Jinhong Lin, Kwan Man Cheng, Lin Zhang, Shayok Bagchi, Yin Li

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； West Lafayette Jr./Sr. High School（韦斯特拉法叶高中）

AI总结提出DRIFT框架，通过结合基础预测器和基于流匹配的生成式精化模块，将预训练视觉-语言模型适配到连续解码任务，在视觉定位和机器人控制等任务上优于回归和生成方法。

详情

AI中文摘要

许多现代视觉-语言模型（VLM）基于离散标记的自回归解码。虽然基于文本的输出接口支持可扩展的预训练和跨多种任务的强零样本泛化，但它们不适用于需要精确连续输出的问题，例如定位事件的时间边界或生成机器人控制动作。为了解决这一挑战，我们提出了DRIFT，一个用于将预训练VLM适配到连续解码任务的通用框架。DRIFT结合了一个基础预测器（提供目标输出的粗略估计）和一个基于流匹配的生成式精化模块（迭代改进预测）。这种残差公式将生成建模问题从学习全局输出分布转变为在强先验周围建模局部残差分布，大大简化了优化。我们在感知和规划任务上评估了DRIFT，包括视觉定位和机器人控制。在跨越MLLM、VLA和WAM的多个任务和架构中，DRIFT consistently优于一组强大的基于回归和生成的方法。

英文摘要

Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.

URL PDF HTML ☆

赞 0 踩 0

2606.05756 2026-06-05 cs.LG cs.AI cs.IT math.IT 版本更新

Beyond Soft Masks: Hard-Perturbation Mixup Explainer for Robust GNN Explainability

超越软掩码：用于鲁棒GNN可解释性的硬扰动混合解释器

Jialiang Yin, Zheng Zhao, Linsey Pang, Bo Dong, Bin Shi, Jiaxing Zhang

发表机构 * Xi’an Jiaotong University（西安交通大学）； PayPal ； bellevue USA（贝尔维尤美国）

AI总结提出基于广义图信息瓶颈的硬扰动混合解释框架HPME，通过图池化提取离散解释子图并采用结构级替换的混合策略，解决软掩码方法中标签无关信息泄漏和分布偏移问题，提升解释保真度。

详情

AI中文摘要

图神经网络（GNN）在涉及图结构数据的各种应用中表现出卓越性能，尤其是在高风险领域。然而，其决策过程的不透明性限制了可信度和更广泛的采用。现有的事后解释方法通过识别影响GNN预测的子图来提高可解释性，并采用混合策略来缓解使用子图进行预测时引起的分布外（OOD）问题。然而，这些方法通常依赖软掩码，其本质上无法完全消除标签无关信息，允许冗余结构泄漏到混合过程中，阻碍OOD问题的解决，从而降低解释保真度。在本文中，我们提出HPME，一个基于广义图信息瓶颈的硬扰动混合解释框架，利用图池化提取离散解释子图，并产生信息容量界限以彻底压缩标签无关组件。此外，我们引入了一种基于结构级替换的新型混合策略，生成分布内解释以有效缓解分布偏移。在多种任务上的大量实验表明，HPME在合成和真实数据集上生成鲁棒且可解释的解释方面达到了最先进的性能。

英文摘要

Graph Neural Networks (GNNs) have demonstrated remarkable performance across a range of applications involving graph-structured data, particularly in high-stakes domains. However, the opaque nature of their decision-making processes limits their trustworthiness and broader adoption. Existing post-hoc explanation methods aim to improve explainability by identifying subgraphs that influence GNN predictions and adopt mixup strategies to alleviate the out-of-distribution (OOD) issue caused by using subgraphs for prediction. Yet, these approaches typically rely on soft masks, which are inherently unable to fully eliminate label-irrelevant information, allowing redundant structures to leak into the mixup process and hindering the resolution of the OOD problem, thereby degrading explanation fidelity. In this work, we propose HPME, a Hard-Perturbation Mixup Explanation framework grounded in a generalized Graph Information Bottleneck, which leverages graph pooling to extract discrete explanatory subgraphs and to yield an information-capacity bound to thoroughly compress label-irrelevant components. Furthermore, we introduce a novel mixup strategy built upon structure-level replacement, generating in-distribution explanations to effectively mitigate the distribution shift. Extensive experiments on diverse tasks demonstrate that HPME achieves state-of-the-art performance in generating robust and interpretable explanations across both synthetic and real-world datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.05754 2026-06-05 cs.SD cs.AI eess.AS 版本更新

SagnacAssisted Enhanced OTDR for Distributed Acoustic Sensing: A Standardized Benchmark and Engineering Evaluation Framework

Sagnac辅助增强型OTDR分布式声学传感：标准化基准与工程评估框架

Weiguang Wang, Fugen Wu, Hailing Wang, Xuechen Liang, Xiaobin Li, Ru Han, Tianchang Xie

发表机构 * East China Jiaotong University（东华交通大学）； School of Materials and Energy, Guangdong University of Technology（广东工业大学材料与能源学院）； Jiangxi Tonghui Technology Group Co., Ltd.（江西 Tonghui 技术集团有限公司）； School of Artificial Intelligence and Big Data, Guangzhou Vocational University of Science and Technology（广州科学技术职业大学人工智能与大数据学院）

AI总结提出一种Sagnac辅助增强型ϕ-OTDR传感架构和标准化基准框架，通过双分支融合模型在10公里光纤上实现89.79%准确率和5.00%虚警率，解决了偏振衰落和干扰问题。

详情

AI中文摘要

相位敏感光时域反射计（ϕ-OTDR）因其在大距离上提供分布式时空监测能力，被广泛应用于大规模分布式声学传感（DAS）。然而，其现场性能仍可能因偏振诱导衰落（PIF）、局部信号退化和强环境干扰而恶化。本研究开发了一种Sagnac辅助增强型ϕ-OTDR传感架构和面向工程的DAS事件识别标准化基准框架。Sagnac干涉仪提供连续相位响应，补充了ϕ-OTDR通道中易衰落的观测值，并通过在FPGA平台上实现的互相关过程实现异构信号对齐。该基准协议在一致的数据划分、预处理和度量定义下，比较了传统特征工程方法、概率浅层分类器、单分支深度模型和双分支融合模型。在10公里传感光纤上进行的六类代表性声学事件实验表明，双分支融合模型在评估方法中提供了最有利的权衡，在平衡测试集上达到89.79%的准确率、89.83%的宏F1值和5.00%的虚警率。结果还表明，通道分组对双分支评估影响显著，表明面向部署的结论应基于准确率、宏F1、虚警率、漏报率和延迟，而非仅凭准确率。这项工作为基于ϕ-OTDR的DAS提供了一种物理驱动的增强策略，并为未来面向融合的传感研究提供了可复现的基准协议。用于复现DAS事件识别实验的实现和脚本可在https://github.com/wawa-abc/das公开获取。

英文摘要

Phase-sensitive optical time-domain reflectometry ($ϕ$-OTDR) is widely used in large-scale distributed acoustic sensing (DAS) because it provides distributed spatiotemporal monitoring over long sensing distances. Its field performance can still deteriorate because of polarization-induced fading (PIF), local signal degradation, and strong environmental interference. This study develops a Sagnac-assisted enhanced $ϕ$-OTDR sensing architecture and a standardized benchmark framework for engineering-oriented DAS event recognition. The Sagnac interferometer provides a continuous phase response that supplements fading-prone observations in the $ϕ$-OTDR channel, and heterogeneous signal alignment is achieved using a cross-correlation procedure implemented on an FPGA platform. The benchmark protocol compares conventional feature-engineering methods, probabilistic shallow classifiers, single-branch deep models, and dual-branch fusion models under consistent data partitioning, preprocessing, and metric definitions. Experiments on a 10-km sensing fiber with six representative acoustic event classes show that the dual-branch fusion model provides the most favorable trade-off among the evaluated methods, reaching 89.79\% accuracy, 89.83\% macro-F1, and a nuisance alarm rate of 5.00\% on the balanced test set. The results also show that channel grouping strongly affects dual-branch evaluation, indicating that deployment-oriented conclusions should be based on accuracy, macro-F1, nuisance alarm rate, false negative rate, and latency rather than accuracy alone. This work provides a physically motivated enhancement strategy for $ϕ$-OTDR-based DAS and a reproducible benchmark protocol for future fusion-oriented sensing research. The implementation and scripts for reproducing the DAS event-recognition experiments are publicly available at https://github.com/wawa-abc/das.

URL PDF HTML ☆

赞 0 踩 0

2606.05749 2026-06-05 cs.CL cs.AI 版本更新

叙事知识编织器：面向长文本理解的叙事中心检索增强推理

Qiuyu Tian, Fengyi Chen, Yiding Li, Youyong Kong, Fan Guo, Yuyao Li, Jinjing Shen, Zhijing Xie, Yiyun Luo, Xin Zhang, Yingce Xia, Zequn Liu

发表机构 * Southeast University（东南大学）； Beijing Zhongguancun Academy（北京中关村学院）； Nanjing Normal University（南京师范大学）； ZhuiWen Technology Co., Ltd.（智文科技有限公司）

AI总结提出叙事知识编织器（NKW），一种基于源头的框架，通过将文本证据、原子事实、规范图结构、实体档案、交互、情节和故事线对齐，并利用文本、图和叙事工具进行后检索阅读，以解决长文本叙事QA中需要推理演化故事世界的问题，在STAGE、FairytaleQA和QuALITY上表现优异。

详情

AI中文摘要

长文本叙事问答需要对不断演化的故事世界进行推理，而非孤立的段落：答案可能依赖于早期的目标、变化的角色状态、社会关系、因果触发因素、时间位置以及后续后果。现有的检索和图增强生成方法改善了证据访问，但其单元——块、实体、关系、摘要或工具动作——并未直接编码证据在故事中的功能。我们引入了叙事知识编织器（NKW），一种基于源头的框架，将文本证据、原子事实、规范图结构、实体档案、交互、情节和故事线对齐。在查询时，NKW使用文本、图和叙事工具以及后检索阅读技能来组装证据，并审计角色、范围、极性、状态和时间约束。在STAGE、FairytaleQA和QuALITY上，NKW在剧本级故事世界问答中表现最强，同时在更以段落为中心的基准上保持竞争力。消融实验、问题类型分析、图资产统计和案例研究显示了对角色、场景、时间、因果和叙事进展推理的互补优势。

英文摘要

Long-form narrative QA requires reasoning over evolving story worlds rather than isolated passages: answers may depend on earlier goals, changing character states, social relations, causal triggers, temporal position, and later consequences. Existing retrieval and graph-augmented generation methods improve evidence access, but their units--chunks, entities, relations, summaries, or tool actions--do not directly encode how evidence functions in a story. We introduce Narrative Knowledge Weaver(NKW), a source-grounded framework that aligns textual evidence, atomic facts, canonical graph structure, entity profiles, interactions, episodes, and storylines. At query time, NKW uses text, graph, and narrative tools with post-retrieval reading skills to assemble evidence and audit actor, scope, polarity, state, and temporal constraints. Across STAGE, FairytaleQA, and QuALITY, NKW is strongest on screenplay-level story-world QA while remaining competitive on more passage-centered benchmarks. Ablations, question-type analyses, graph-asset statistics, and case studies show complementary benefits for character, scene, temporal, causal, and narrative-progression reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.05720 2026-06-05 cs.SE cs.AI 版本更新

Microskill Architecture: A Modular Skill-Driven Framework for AI-Native Code Generation

微技能架构：一种面向AI原生代码生成的模块化技能驱动框架

Mohammad Zare, Omid Abdolrahmani

发表机构 * Artificial Intelligence Laboratory at AriooBarzan（AriooBarzan人工智能实验室）； Engineering Team, Shiraz, Iran（伊朗谢尔兹工程团队）

AI总结本文提出微技能架构，通过将知识封装为原子技能胶囊并动态选择相关胶囊，解决AI代码生成中的上下文窗口管理问题，显著降低token消耗、提高编译成功率并消除架构违规。

详情

AI中文摘要

大型语言模型和AI编码代理已经重塑了软件开发，但完全AI原生系统的路径面临结构性挑战。其中最主要的是在保持准确性和效率的同时管理上下文窗口。当开发者将完整的项目文档和代码注入模型内存时，模型会丢失序列中间的信息，token成本激增，架构发生漂移。本文提出微技能架构：一种受微服务启发的模块化设计范式，应用于知识封装而非服务分解。该架构不是将整个代码库提供给代理，而是将知识划分为原子化、范围明确的技能胶囊，并由动态路由器仅选择语义相关的胶囊来执行任务。我们将上下文分配形式化为在token预算约束下基于语义相关性的约束优化。一个针对具有十五个复杂特性的企业内容管理系统的实证案例研究表明，微技能将token消耗降低了90%以上，首次尝试编译成功率几乎翻倍，完全消除了架构违规，并通过自学习机制实现了七个新技能胶囊的自主提取和注册。这些发现表明，微技能架构为构建更高效、更可靠且能够随时间演进的AI原生开发系统提供了可扩展的基础。

英文摘要

Large language models and AI coding agents have reshaped software development, but the path to fully AI-native systems faces structural challenges. Chief among them is managing context windows without losing accuracy or efficiency. When developers inject full project documentation and code into a model's memory, the model loses mid-sequence information, token costs spiral, and architecture drifts. This paper presents MicroSkill Architecture: a modular design paradigm inspired by microservices, applied to knowledge encapsulation instead of service decomposition. Instead of feeding an agent the entire codebase, the architecture partitions knowledge into atomic, sharply scoped skill capsules, and a dynamic router selects only semantically relevant capsules for the task. We formally model context allocation as constrained optimization over semantic relevance subject to a token budget. An empirical case study an enterprise content management system with fifteen complex features shows that MicroSkill cuts token consumption by over 90%, nearly doubles first-try compilation success rates, eliminates architectural violations entirely, and enables autonomous extraction and registration of seven new skill capsules via a self-learning mechanism. These findings suggest MicroSkill Architecture offers a scalable foundation for building AI-native development systems that are more efficient, more reliable, and capable of evolving over time.

URL PDF HTML ☆

赞 0 踩 0

2606.05718 2026-06-05 cs.CV cs.AI cs.LG 版本更新

ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

ViCuR: 视觉线索作为多模态在策略蒸馏中的可恢复特权

Kanghui Tian, Siyuan Liu, Ziang Yan, Sheng Xia, Shuai Dong, Yi Wang

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； Fudan University（复旦大学）； Nanjing University（南京大学）

AI总结提出ViCuR框架，通过将教师特权从答案侧替换为输入中的视觉线索，并引入轻量级线索恢复模块，解决多模态在策略蒸馏中的训练-测试不匹配问题，在七个基准上显著提升学生模型性能。

Comments 25 pages, 11 figures. Preprint, under review

详情

AI中文摘要

在策略蒸馏（OPD）通过在教师监督下，对学生自身策略采样的轨迹进行训练来改进推理。在多模态推理中，一种常见的扩展是使用特权教师，该教师观察仅在训练时可用的信号，如参考答案或理由。然而，这种答案侧特权造成了训练-测试不匹配：教师的监督可能依赖于学生无法获得的信号，鼓励捷径模仿而非基于视觉的推理。我们提出ViCuR，一种基于视觉的特权教师蒸馏框架，用视觉线索（输入中与查询相关的证据）取代答案侧特权。由于这些线索来源于推理时可用的相同视觉输入，它们的证据可由学生恢复。为此，ViCuR引入了一个轻量级线索恢复模块，在预填充期间使用专用的汇点令牌交叉注意力，将任务相关的视觉证据聚合到内部表示中，而不改变推理接口或需要辅助的线索生成损失。在七个基准上，使用Qwen3-VL-2B和8B学生，ViCuR在总体平均性能上持续优于基于答案的在策略自蒸馏，分别提升+1.19和+1.24。它还能自然地扩展到更强的教师OPD，超越OPD基线+0.64和+1.08，并在8B规模上具有一致的域外增益。这些结果表明，在多模态在策略蒸馏中，教师特权的设计与教师强度同等重要。

英文摘要

On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher's supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.

URL PDF HTML ☆

赞 0 踩 0

2606.05710 2026-06-05 cs.CR cs.AI 版本更新

Explainable AI-Driven Cyber Risk Analytics and Model Reliability Assessment for Intelligent Governance of U.S. Critical Infrastructure: An XGBoost and SHAP-Based Intrusion Detection Framework

面向美国关键基础设施智能治理的可解释AI驱动的网络风险分析与模型可靠性评估：基于XGBoost和SHAP的入侵检测框架

B. M. Taslimul Haque, Md. Arifur Rahman, Md. Serajul Kabir Chowdhury Rubel, Md. Iqbal Hossan

发表机构 * Department of Business Information Systems, Central Michigan University（中央密歇根大学商业信息系统系）； Department of Information Studies, Trine University（特林大学信息学系）； Department of Computer Science, Maharishi International University（Maharishi国际大学计算机科学系）

AI总结针对美国关键基础设施面临的网络威胁，提出一种结合XGBoost、随机森林等机器学习分类器与可解释AI（XAI）技术的入侵检测与网络风险预测框架，通过CICIDS2017数据集验证模型性能与可靠性。

Comments 20 pages, 8 figures, empirical research article, CICIDS2017 dataset, XGBoost, Random Forest, Decision Tree, Logistic Regression, SHAP explainability analysis, cyber risk analytics, intrusion detection, critical infrastructure cybersecurity, model reliability assessment

详情

DOI: 10.25163/engineering.2110762
Journal ref: Applied IT & Engineering, 2(1), 1-20, 2024

AI中文摘要

美国关键基础设施领域智能数字技术的日益渗透极大地增加了面对高级网络对手和运营漏洞的风险。AI驱动的治理和自动化决策系统正成为关键基础设施系统（包括能源、医疗、交通、金融服务和通信基础设施）运行的关键部分，以提高效率和战略管理。不断增长的网络威胁环境，如分布式拒绝服务（DDoS）攻击、僵尸网络、勒索软件和高级持续性威胁（APT），对基础设施韧性、网络安全可靠性和治理可信度构成了重大挑战。在不断变化的攻击态势和动态网络环境中，传统的网络安全机制往往无法满足不断变化的需求和保护关键系统。本研究将开发一个弹性网络风险分析和模型可靠性评估框架，以支持美国关键基础设施环境中网络风险暴露的智能治理和决策支持。本研究基于CICIDS2017数据集，用于开发和测试基于机器学习的入侵检测系统模型和网络风险预测模型。使用XGBoost、随机森林和决策树等多种分类器来检测网络上的恶意活动并确定网络风险水平。此外，集成了可解释人工智能（XAI）技术，以增强网络安全决策过程的透明度、可解释性和信任度。所提出的框架通过多种性能指标（如准确率、精确率、召回率、F1分数、ROC-AUC和假阳性率）展示了模型的可靠性和韧性。

英文摘要

The increasing penetrations of the critical infrastructure sector in the United States with intelligent digital technologies have greatly increased exposure to advanced cyber adversaries and operational vulnerabilities. AI-powered governance and automated decision-making systems are becoming a key part of the operation of critical infrastructure systems, including energy, healthcare, transportation, financial services, and communication infrastructure, in order to improve efficiency and strategic management. The growing cyber threat environment, such as Distributed Denial of Service (DDos) attacks, botnets, ransomware, and Advanced Persistent Threats (APTs) pose significant challenges to infrastructure resilience, cyber security reliability, and governance trustworthiness. In a changing attack landscape and dynamic network environment, traditional cybersecurity mechanisms can often fall short of meeting the evolving needs and protecting critical systems. This study will develop a resilient cyber risk analytics and model reliability assessment framework to support intelligent governance and decision support for cyber risk exposure in the U.S. critical infrastructure environment. This study is based on the CICIDS2017 dataset for the development and testing of intrusion detection system models and cyber risk prediction models based on machine learning. Various classifiers like XGBoost, Random Forest, and Decision Tree are used to detect malicious activities on the network and determine the level of cyber risk. Furthermore, the Explainable Artificial Intelligence (XAI) techniques are integrated to enhance transparency, interpretability, and trust in cybersecurity decision-making processes. The proposed framework presents the reliability and resilience of the model by having various performance measures such as accuracy, precision, recall, F1 score, ROC-AUC, and false positive rate.

URL PDF HTML ☆

赞 0 踩 0

2606.05704 2026-06-05 cs.AI cs.LG 版本更新

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

基于评论的异构多智能体推理用于可靠的数学问题求解

Muhammad Talha Sharif, Abdul Rehman

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出一种基于评论的异构多智能体框架，通过生成器-验证器结构和自适应学习系统，利用中间反馈评估和引导推理过程，在GSM8K基准上实现高达13%的准确率提升，并减少对大模型的依赖。

Comments 6 pages

详情

AI中文摘要

近期的大语言模型（LLMs）展示了令人印象深刻的推理能力；但在复杂数学推理问题中，它们仍然容易产生幻觉、中间推理错误以及不可靠的推理结果。在本研究中，我们引入了一种基于评论的异构多智能体方法，以提高数学推理的可靠性。该框架整合了多个不同专长的LLM智能体，并采用评论驱动的自适应学习系统，基于中间反馈评估和引导推理过程。系统采用生成器-验证器框架，验证器不仅判断正确性，还提供评论以指导解决方案的重新生成。这允许自适应错误纠正并防止错误级联。我们在GSM8K基准上的实验表明，所提方法相比单次和非评论模型实现了高达13%的准确率提升。此外，研究结果表明，异构性和评论减少了对大模型的需求，使较小模型也能达到相当的性能。消融研究显示，主要性能提升归因于基于评论的反馈循环，而非模型大小。总之，所提方法展示了结合异构多智能体协作与评论以获得可靠且可解释推理系统的优势。

英文摘要

Recent Large Language Models (LLMs) have shown impressive reasoning abilities; but they are still susceptible to hallucinations, intermediate reasoning mistakes, and unreliable reasoning results in complex mathematical reasoning problems. In this study, we introduce a critic-based heterogeneous multi-agent approach to improve the dependability of mathematical reasoning. This framework incorporates several LLM agents of different specialties and employs a critic-driven adaptive learning system to assess and guide the reasoning process based on intermediate feedback. The system adopts a generator-validator framework, with the validator not only determining correctness but also offering critiques to guide regeneration of solutions. This allows for adaptive error correction and prevents error cascading. Our experiments on the GSM8K benchmark show that the proposed method achieves up to 13% accuracy improvement over single-shot and non-critic models. Additionally, findings suggest that heterogeneity and critique reduce the need for large models, allowing smaller models to perform on par. Ablation studies reveal the main performance gains are due to the critic-based feedback loop and not model size. In summary, the proposed approach showcases the benefits of combining heterogeneous multi-agent collaboration and critique to obtain reliable and interpretable reasoning systems.

URL PDF HTML ☆

赞 0 踩 0

2606.05702 2026-06-05 cs.AI cs.CV 版本更新

Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

Seeing Time: 视觉-语言模型中的时间顺序推理与捷径偏差基准测试

Haoyu Zhou, Qing Qing, Caichong Li, Qixin Zhang, Yongcheng Jing, Ziqi Xu, Juncheng Hu, Xikun Zhang, Renqiang Luo

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； School of Computer Science, Wuhan University（武汉大学计算机学院）； School of Computing Technologies, RMIT University（皇家墨尔本理工学院计算技术学院）

AI总结本文提出一个新基准，通过三个专门数据集评估视觉-语言模型在图像内和跨图像的时间顺序推理能力，并揭示模型常利用颜色等表面线索而非真正时间特征。

详情

AI中文摘要

近期视觉-语言模型（VLM）在解释复杂视觉语义方面取得了显著进展，但其时间顺序推理能力仍未得到充分探索。本文引入了一个新颖的基准，专门用于评估VLM如何感知和推理图像内及跨图像的时间顺序信息。与现有基于视频的基准（侧重于帧序列）不同，我们的工作深入探讨了时间判断的基本逻辑以及向多模态集成的扩展。为此，我们构建了三个专门数据集：一个包含跨越长时间历史周期的视觉相似物体，另一个按不同事件和物体类型分类，第三个将图像与时间敏感的新闻文本配对以实现跨模态对齐。通过大量实验，我们分析了模型是否在不同类别间表现出性能差异，并关键地探讨了它们是否依赖“错误捷径”（如图像颜色而非真正的时间特征）。我们的结果表明，尽管VLM显示出潜力，但它们经常利用灰度与彩色滤镜等表面线索来绕过真正的时间顺序推理。通过提供这些高质量数据集和严格的评估框架，我们提供了一个诊断工具，用于识别当前局限性并指导开发更稳健、逻辑更严密的多模态模型。源代码见 https://github.com/LuoRenqiang/ChronoVision。

英文摘要

Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to evaluate how VLMs perceive and reason about chronological information within and across images. Unlike existing video-based benchmarks that focus on frame sequencing, our work delves into the underlying logic of chronological judgment and the expansion toward multimodal integration. To facilitate this, we construct three specialized datasets: one containing visually similar objects spanning long historical durations, another categorized by diverse event and object types, and a third pairing images with time-sensitive news text for cross-modal alignment. Through extensive experiments, we analyze whether models exhibit performance disparities across categories and, crucially, explore whether they rely on ``incorrect shortcuts'', such as image color rather than genuine chronological features. Our results reveal that while VLMs show promise, they frequently exploit superficial cues like grayscale versus color filters to bypass authentic chronological reasoning. By providing these high-quality datasets and a rigorous evaluation framework, we offer a diagnostic tool to identify current limitations and guide the development of more robust, logically grounded multimodal models. The source code is shown in https://github.com/LuoRenqiang/ChronoVision.

URL PDF HTML ☆

赞 0 踩 0

2606.05701 2026-06-05 cs.CR cs.AI 版本更新

Cognitive Threat Intelligence and Explainable Federated Security Analytics for distributed Infrastructure Systems

面向分布式基础设施系统的认知威胁情报与可解释联邦安全分析

Md. Arifur Rahman, B. M. Taslimul Haque, Md. Iqbal Hossan, Md. Serajul Kabir Chowdhury Rubel

发表机构 * Dept. of Information Studies, Trine University（信息研究系，特林大学）； Dept. of Business Information Systems, Central Michigan University（商业信息系统系，中央密歇根大学）； Dept. of CS, Maharishi International University（计算机科学系， Maharishi 国际大学）

AI总结提出一种集成联邦学习、可解释人工智能和认知网络安全分析的框架，用于分布式基础设施系统的协作式隐私保护威胁检测。

Comments 22 pages, 10 figures, 1 conceptual framework diagram, 1 methodology workflow diagram, empirical study using NSL-KDD and CIC-IDS2017 datasets, Federated Learning, Explainable AI (SHAP, LIME), cybersecurity and intrusion detection framework

详情

DOI: 10.64882/ijrt.v13.i1.1384
Journal ref: International Journal of Research and Technology (IJRT), Volume 13, Issue 01, January-March 2025, pp. 132-151

AI中文摘要

分布式基础设施系统、云计算、物联网技术和边缘架构的日益普及显著扩大了网络安全攻击面，并引入了日益复杂的网络威胁。传统的集中式入侵检测方法在可扩展性、数据隐私、通信开销以及人工智能驱动决策过程的透明度方面常面临挑战。为解决这些限制，本文提出了一种面向分布式基础设施系统的认知威胁情报与可解释联邦安全分析框架。该框架集成了联邦学习、可解释人工智能和认知网络安全分析，能够在分布式网络环境中实现协作式且保护隐私的网络威胁检测。敏感原始网络流量数据不传输到集中式服务器，而是在分布式节点上独立训练本地安全模型，仅通过联邦聚合机制共享加密的模型参数和更新。这种去中心化学习架构在减少通信依赖和集中式安全风险的同时提高了隐私保护。为增强智能威胁分析，该框架采用了机器学习和深度学习算法，包括随机森林、XGBoost、自编码器、卷积神经网络和长短期记忆网络。此外，可解释人工智能技术（如SHAP和LIME）被集成以提供透明且可理解的威胁检测决策解释，从而增强安全分析师之间的信任和可操作性。在包括CICIDS2017、UNSW-NB15和CSE-CIC-IDS2018在内的多个基准网络入侵数据集上进行的实验评估表明，所提框架在检测准确率、精确率、召回率和F1分数方面优于传统集中式和现有联邦学习方法，同时确保数据隐私、通信效率和模型可解释性。

英文摘要

The increasing adoption of distributed infrastructure systems, cloud computing, Internet of Things (IoT) technologies, and edge-based architectures has significantly expanded the cybersecurity attack surface and introduced increasingly sophisticated cyber threats. Conventional centralized intrusion detection approaches often face challenges related to scalability, data privacy, communication overhead, and limited transparency in artificial intelligence-driven decision-making processes. To address these limitations, this study proposes a Cognitive Threat Intelligence and Explainable Federated Security Analytics framework for distributed infrastructure systems. The proposed framework integrates Federated Learning (FL), Explainable Artificial Intelligence (XAI), and cognitive cybersecurity analytics to enable collaborative and privacy-preserving cyber threat detection across distributed network environments. Instead of transmitting sensitive raw network traffic data to centralized servers, local security models are independently trained at distributed nodes, where only encrypted model parameters and updates are shared through a federated aggregation mechanism. This decentralized learning architecture improves privacy protection while reducing communication dependency and centralized security risks. To enhance intelligent threat analysis, the framework incorporates machine learning and deep learning algorithms including Random Forest, XGBoost, Autoencoder

URL PDF HTML ☆

赞 0 踩 0

2606.05697 2026-06-05 cs.AI 版本更新

LongSpace: 从感知到回忆的视频长程空间记忆探索

Shiqiang Lang, Jing Liu, Haoyang He, Peiwen Sun, Yuanteng Chen, Tao Liu, Lan Yang, Longteng Guo, Honggang Zhang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Zhongguancun Academy（中关村学院）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； The Chinese University of Hong Kong（香港中文大学）； Xi’an Jiaotong University（西安交通大学）

AI总结针对长视频中空间记忆的挑战，提出LongSpace框架，通过分块建模、3D结构线索注入和层级感知记忆实现长程空间推理，并在LongSpace-Bench等基准上验证其有效性。

详情

AI中文摘要

多模态大语言模型（MLLMs）在图像和视频理解方面取得了进展，并且能够处理更长的视觉输入。自动驾驶和机器人导航等长程任务不仅需要识别当前视图，模型还必须记住并检索之前观察到的空间布局、路线、视角变化和物体状态。为了评估这一能力，我们引入了LongSpace-Bench，一个用于长程空间记忆的房间导览视频基准，涵盖场景感知、空间关系和空间记忆。在这项工作中，我们进一步提出了LongSpace，一个用于长视频空间推理的记忆框架。LongSpace将长视频建模为连续的块，将3D结构线索注入早期解码器层，并构建层级感知记忆以进行问题引导的检索。在多个空间推理基准上的实验表明，LongSpace改善了长视频空间理解，进一步证明了显式空间记忆是长程视频MLLMs的关键能力。

英文摘要

Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.05670 2026-06-05 cs.AI 版本更新

Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

更多智能体有帮助吗？LLM智能体工作流的受控与协议对齐评估

Yuhang Fu, Ruishan Fang, Jiaqi Shao, Huiyu Zheng, Zhengtao Zhu, Bing Luo, Tao Lin

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Westlake University（西湖大学）； Zhejiang University（浙江大学）； Duke Kunshan University（杜克大学昆山分校）； Hong Kong University of Science and Technology（香港科技大学）； Zhejiang University of Technology（浙江工业大学）

AI总结提出BenchAgent框架，在统一协议下比较单智能体、固定多智能体和演化多智能体工作流，发现大多数多智能体系统在准确率上未超越单智能体基线，但运行时生成的工作流在GAIA上表现优异。

Comments https://github.com/LINs-lab/MASArena/tree/BenchAgent

详情

AI中文摘要

一旦比较的系统共享相同的基准加载器、工具访问、答案契约、使用计数和轨迹日志，添加更多智能体是否有助于LLM工作流？我们引入BenchAgent，一个评估框架，将单智能体、固定多智能体（MAS）和演化MAS工作流置于一个标准化的执行和日志协议下。BenchAgent使用GPT-4.1在十个推理、编码和工具使用基准上评估这些内部工作流，并单独报告运行时生成工作流的协议对齐外部（PAE）GAIA研究。在SI条件下，六个测试的MAS中最多有一个在基准平衡平均准确率上超过匹配的单智能体锚点：EvoAgent位于Wilson单次运行指导范围内，而其余五个落后2.56-11.29个百分点，并占据更昂贵的准确率-成本权衡。在PAE GAIA快照上，一个Claude-Code风格的运行时工作流达到66.72%的整体准确率和69.23%的Level 3准确率，比最强的非Claude基线Jarvis（一个固定MAS）高出20多个百分点。

英文摘要

Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.

URL PDF HTML ☆

赞 0 踩 0

2606.05661 2026-06-05 cs.AI cs.CL 版本更新

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

持续学习基准：评估现实世界有状态环境中的前沿AI系统

Parth Asawa, Christopher M. Glaze, Gabriel Orlanski, Ramya Ramakrishnan, Benji Xu, Asim Biswal, Vincent Sunn Chen, Frederic Sala, Matei Zaharia, Joseph E. Gonzalez

发表机构 * UC Berkeley（伯克利大学）； Snorkel AI ； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结提出首个专家验证的持续学习基准CL-Bench，涵盖六个领域，通过增益指标隔离在线学习能力，发现现有系统存在过拟合和知识复用不足问题。

详情

AI中文摘要

持续学习，即AI系统通过顺序经验提升能力，已引起广泛关注，但缺乏高质量基准来评估。我们提出持续学习基准（CL-Bench），首个由专家验证的困难基准，旨在衡量基于LLM的系统是否真正从经验中改进。CL-Bench涵盖六个不同领域（软件工程、信号处理、疾病爆发预测、数据库查询、策略游戏和需求预测），每个领域由领域专家验证，任务共享可学习的潜在结构（代码库布局、疾病爆发动态、对手策略），有状态系统可在线发现而静态系统不能。我们评估了从朴素上下文学习（ICL）到专用记忆系统的多种智能体架构的前沿模型，引入增益指标以隔离学习与先验能力。我们发现这些系统在持续学习上仍有提升空间：智能体常过度拟合即时观察或未能跨实例复用知识，专用记忆系统并未解决此问题——实际上，朴素ICL优于专用记忆管理系统。CL-Bench是首个通过专家验证任务在多个现实世界领域评估持续学习并隔离在线学习与基础模型能力的基准，表明需要更好的持续学习系统。

英文摘要

Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this -- in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.

URL PDF HTML ☆

赞 0 踩 0

2606.05660 2026-06-05 cs.RO cs.AI 版本更新

Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation

面向长时域任务的安全具身AI：机器人操作跨层分析

Dabin Kim, Daemin Park, Sangyub Lee, Jinsik Kim, Yeongtak Oh, Jongho Shin, Sungroh Yoon

发表机构 * UNIST InnoCORE AI-Space Solar Initiative（UNIST创新核心人工智能空间太阳能计划）； Ulsan National Institute of Science and Technology (UNIST)（乌山国立科学技术研究院）； Automation and Systems Research Institute（自动化与系统研究所）； Department of Electrical and Computer Engineering（电气与计算机工程系）； Interdisciplinary Program in Artificial Intelligence（人工智能跨学科项目）； LG Electronics（LG电子）

AI总结本文从具身AI视角，系统综述长时域机器人操作中的安全问题，按干预时机（规划时、策略时、执行时）组织文献，分析证据强度，并指出当前安全保证的不足与未来方向。

Comments 63 pages, 6 figures

详情

AI中文摘要

具身AI系统日益被期望在物理环境中进行长时间跨度的推理和行动。这种不断增强的能力将安全问题推向前台，因为物理世界中的失败可能伤害人、损坏物体并扰乱工作场所。尽管安全具身AI已引起广泛关注，但文献在规划、策略设计和运行时执行方面仍然分散。长时域机器人操作是这一问题特别具有揭示性的锚定领域，因为语义误解、子任务级错误传播、执行漂移和接触丰富的物理风险可能在同一个闭环系统中累积。因此，本综述从具身AI视角对长时域机器人操作中的安全性进行了结构化回顾。我们按干预时机组织文献，涵盖规划时、策略时和执行时的安全性，并分析每条工作提供的证据强度，区分形式化保证、统计支持和经验安全启发式。这一框架阐明了骨干能力论文、直接安全机制以及基准或评估研究的独特作用，同时揭示了当前安全声明在哪些方面得到良好支持，在哪些方面仍然间接。我们识别了持续的空白，包括策略时安全性的有限证据、接触丰富长时域操作的形式化支持薄弱、不成熟的不确定性触发干预以及缺乏操作特定的安全基准。最后，我们概述了跨层保证、评估设计以及长时域机器人代理在真实世界环境中更安全部署的研究方向。

英文摘要

Embodied AI systems are increasingly expected to reason and act over extended horizons in physical environments. This growing capability brings safety to the foreground, because failures in the physical world can harm people, damage objects, and disrupt workplaces. Although safe embodied AI has attracted substantial attention, the literature remains fragmented across planning, policy design, and runtime execution. Long-horizon robotic manipulation is a particularly revealing anchor domain for this problem because semantic misgrounding, subtask-level error propagation, execution drift, and contact-rich physical risk can accumulate within the same closed-loop system. This survey therefore provides a structured review of safety in long-horizon robotic manipulation from an embodied AI perspective. We organize the literature by intervention locus, covering planning-time, policy-time, and execution-time safety, and we analyze the strength of the evidence that each line of work provides, distinguishing formal guarantees, statistical support, and empirical safety heuristics. This framework clarifies the distinct roles of backbone capability papers, direct safety mechanisms, and benchmark or evaluation studies, while exposing where current safety claims are well supported and where they remain indirect. We identify persistent gaps, including limited evidence for policy-time safety, weak formal support for contact-rich long-horizon manipulation, immature uncertainty-triggered intervention, and a shortage of manipulation-specific safety benchmarks. We conclude by outlining research directions for cross-layer assurance, evaluation design, and safer deployment of long-horizon robotic agents in real-world settings.

URL PDF HTML ☆

赞 0 踩 0

2606.05658 2026-06-05 cs.IR cs.AI 版本更新

Agent-Orchestrated Adaptive RAG: A Comparative Study on Structured and Multi-Hop Retrieval

Agent编排的自适应RAG：结构化与多跳检索的比较研究

Anuj Maharjan, Devinder Kaur, Richard Molyet

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Washington（华盛顿大学）； University of California, Los Angeles（加州大学洛杉矶分校）

AI总结提出Agent编排的自适应RAG框架，通过动态查询分解、迭代检索和自反思评估，在结构化领域（DevOps）和多跳推理基准（MuSiQue）上对比发现，查询分解在结构化领域提升性能但降低多跳排名精度，反思机制提高引用准确性但增加延迟，表明Agent增强需根据查询和领域特性选择性应用。

详情

AI中文摘要

检索增强生成（RAG）通过将响应基于外部知识来增强大型语言模型（LLM），但传统流水线依赖于静态的单步检索，这限制了复杂查询的性能。本文提出了一种Agent编排的自适应RAG框架，引入了动态查询分解、迭代检索和有界自反思评估循环。我们在两个互补的数据集上评估该系统：一个特定领域的DevOps知识库和多跳推理基准MuSiQue。使用包括总体得分、引用准确性、平均倒数排名和主题覆盖度在内的指标，我们发现查询分解在结构化领域（DevOps上总体得分+0.04，MRR+0.17）带来一致的增益，但在多跳基准上降低了排名精度，而反思机制以显著的延迟成本提高了引用准确性。这些对比结果表明，Agent增强并非普遍有益，必须根据查询和领域特性选择性应用。我们的发现支持自适应、成本感知的编排，而非统一激进的推理流水线。

英文摘要

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding their responses in external knowledge, but conventional pipelines rely on static, single-step retrieval that limits performance on complex queries. This paper presents an Agent-Orchestrated Adaptive RAG framework that introduces dynamic query decomposition, iterative retrieval, and a bounded self-reflective evaluation loop. We evaluate the system across two complementary datasets: a domain-specific DevOps knowledge base and the multi-hop reasoning benchmark MuSiQue. Using metrics that include overall score, citation accuracy, mean reciprocal rank, and topic coverage, we find that query decomposition yields consistent gains in the structured domain (overall score $+0.04$, MRR $+0.17$ on DevOps) but degrades ranking precision on the multi-hop benchmark, while the reflection mechanism improves citation accuracy at a substantial latency cost. These contrasting results show that agentic enhancements are not universally beneficial and must be applied selectively according to query and domain characteristics. Our findings argue for adaptive, cost-aware orchestration rather than uniformly aggressive reasoning pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.05647 2026-06-05 cs.AI cs.CL cs.CY cs.HC 版本更新

Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

与“敌人”编码：人类开发者能否检测到AI代理的破坏行为？

Jingheng Ye, Huiqi Zou, Simon Yu, Weiyan Shi

发表机构 * Northeastern University（东北大学）

AI总结通过大规模用户实验，研究人类开发者在长时间编码任务中检测AI代理恶意代码插入的能力，发现94%的开发者未能识别破坏，并分析其原因，提出安全监控设计建议。

Comments 34 pages, 30 figures, 3 tables

详情

AI中文摘要

AI编码代理越来越多地嵌入到现实世界的软件开发中，与人类开发者协作，同时获得对代码库和工具的更广泛访问权限。这创造了一个新的攻击面：代理可以利用人类信任来破坏开发，例如通过插入恶意代码来完成隐藏的附带任务。大多数先前的工作研究AI-only环境中的AI破坏，对人类监督在检测和减轻此类恶意行为中的作用关注有限。为填补这一空白，我们进行了首个关于AI编码破坏中人类监督的大规模研究。超过100名参与者与四个前沿模型（Claude-Opus-4.6、GPT-5.4、Gemini-3.1-Pro和MiniMax-M2.7）之一合作，完成一项持续约五小时的长周期编码任务，旨在模拟真实工作流程。我们发现94%的开发者未能检测到破坏，我们对参与者反馈的分析将这一脆弱性归因于最小化的代码审查、合理的掩护故事以及对代理的过度信任。我们进一步测试了安全监控器在一种条件下的有效性：虽然监控器降低了破坏成功率，但仍有56%的参与者接受了恶意代码，忽略了其警告。根据参与者反馈，我们为更好的监控器设计提供了可操作的建议。这项工作补充了现有的AI安全研究，并强调了迫切需要以人为本的安全机制，考虑人类因素，特别是在长周期、真实世界的开发环境中。

英文摘要

AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.

URL PDF HTML ☆

赞 0 踩 0

2606.05646 2026-06-05 cs.SE cs.AI 版本更新

Enhancing Software Engineering Through Closed-Loop Memory Optimization

通过闭环内存优化增强软件工程

Xuehang Guo, Zora Zhiruo Wang, Qingyun Wang, Graham Neubig, Xingyao Wang

发表机构 * William & Mary（威廉玛丽学院）； Carnegie Mellon University（卡内基梅隆大学）； OpenHands University（OpenHands大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出闭环内存优化框架，通过验证下游影响来定义内存效用，作为评估基准和优化信号，显著提升软件工程代理的成功率和效率。

详情

AI中文摘要

大型语言模型（LLMs）使得强大的软件工程（SE）代理能够导航复杂的代码库并解决现实世界的问题。然而，这些代理本质上仍然是 episodic 的：它们无法跨任务保留、改进和重用经验，反复从头构建上下文并重复类似的错误。即使有内存支持，它们也无法弥补缺乏原则性、任务无关的 \textit{内存效用} 的缺陷，这使得难以严格评估或跨代理和设置进行泛化。为了解决这些限制，我们引入了 \ours，一个用于 SE 代理内存增强的闭环框架。\ours 将内存效用建立在 \textit{验证的下游影响} 上，将效用确立为任务无关的 \textbf{评估基准} 和无注释的 \textbf{优化信号}。通过在 \textit{单 episode} 和 \textit{跨 episode} 内存增强上的互补评估，结果表明 \ours 在不同设置下一致地改进了 SE 代理，在成功率上实现了高达 $\uparrow5.25\\%$ 的绝对增益，在解决效率上实现了 $\uparrow4.63\\%$ 的绝对增益，同时大幅降低了计算成本 $\geq9.79\\%$。我们的项目页面：\href{https://xhguo7.github.io/MemOp/}{https://xhguo7.github.io/MemOp/}。

英文摘要

Large language models (LLMs) have enabled powerful software engineering (SE) agents capable of navigating complex codebases and resolving real-world issues. However, these agents remain fundamentally episodic: they fail to retain, refine, and reuse experiences across tasks, repeatedly reconstructing context from scratch and reproducing similar mistakes. Even with memory support, they offer no remedy for the absence of a principled, task-agnostic \textit{memory utility}, making them difficult to evaluate rigorously or generalize across agents and settings. To tackle these limitations, we introduce \ours, a closed-loop framework for memory augmentation in SE agents. \ours grounds memory utility in \textit{validated downstream impact}, establishing utility as both a task-agnostic \textbf{evaluation benchmark} and an annotation-free \textbf{optimization signal}. Through complementary evaluation on \textit{single-episode} and \textit{cross-episode} memory augmentation, results demonstrate that \ours consistently improves SE agents across settings, achieving absolute gains of up to $\uparrow5.25\%$ in success rate and $\uparrow4.63\%$ in resolve efficiency, while substantially reducing computational cost by $\geq9.79\%$. Our project page: \href{https://xhguo7.github.io/MemOp/}{https://xhguo7.github.io/MemOp/}.

URL PDF HTML ☆

赞 0 踩 0

2606.05644 2026-06-05 cs.AI 版本更新

FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG

FIDES: 通过深层证据信号实现RAG中检索-记忆冲突的忠实推理

Zhe Yu, Wenpeng Xing, Tiancheng Zhao, Mohan Li, Changting Lin, Meng Han

发表机构 * Binjiang Institute of Zhejiang University（浙江大学滨江研究院）； Zhejiang University（浙江大学）； Guangzhou University（广州大学）； GenTel.io

AI总结针对检索增强生成中检索证据与参数记忆冲突导致模型忽略上下文的问题，提出无训练解码器FIDES，通过融合输出表面、隐藏表示和预测轨迹三种内部信号，在token级别动态调整干预强度，显著提升上下文忠实度。

详情

AI中文摘要

当检索到的证据与参数记忆相矛盾时，语言模型常常忽略上下文并默认采用记忆化的先验知识——这种失败削弱了检索增强的核心目的。对比解码通过放大上下文条件输出以抑制参数偏差，但现有方法基于一个隐含假设：这种偏差在token间是均匀的。单一的全局对比权重会过度惩罚安全token，同时使真正存在冲突的token得不到充分纠正。我们识别出token级别的冲突集中现象：检索-记忆张力呈现高度异质性，集中在少数答案关键的解码步骤上。这重新定义了对比解码：从“施加多少对比”转变为“在何处施加对比”。我们提出FIDES（通过深层证据信号实现忠实推理），一种无训练解码器，它读取三种内部信号——输出表面、隐藏表示和预测轨迹——在互补深度探测检索-记忆冲突，并融合它们以控制每个解码步骤的干预强度。在三个基准和六个主干模型（四个主流的7B/8B模型和两个扩展至70B的主干模型）上，FIDES在所有18个设置中实现了最佳的上下文忠实度，比最强的无训练基线高出3到13个百分点。在70B规模上，忠实度达到92-94%，同时F1分数飙升至62-63%，表明token级别的选择性解锁了粗粒度对比规则所抑制的生成能力。

英文摘要

When retrieved evidence contradicts parametric memory, language models frequently ignore context and default to memorized priors -- a failure that undermines the core purpose of retrieval augmentation. Contrastive decoding amplifies the context-conditioned output to suppress parametric bias, but existing methods rest on an implicit assumption that this bias is uniform across tokens. A single global contrastive weight over-penalizes safe tokens while leaving genuinely conflicted ones insufficiently corrected. We identify token-level conflict concentration: retrieval-memory tension is sharply heterogeneous, concentrated on a small fraction of answer-critical decoding steps. This reframes contrastive decoding from how much contrast to apply to where to apply it. We propose FIDES (Faithful Inference via Deep Evidence Signals), a training-free decoder that reads three internal signals probing retrieval-memory conflict at complementary depths -- output surface, hidden representations, and prediction trajectory -- and fuses them to govern intervention strength at each decoding step. Across three benchmarks and six backbones -- four primary 7B/8B models and two scaling backbones up to 70B -- FIDES achieves the best context fidelity in all 18 settings, outperforming the strongest training-free baseline by +3 to +13 points. On the 70B scale, fidelity reaches 92-94% while F1 surges to 62-63%, demonstrating that token-level selectivity unlocks generation capability that coarse contrastive rules suppress.

URL PDF HTML ☆

赞 0 踩 0

2606.05633 2026-06-05 cs.AI 版本更新

Answer Presence Drives RAG Rewriting Gains

答案存在驱动RAG重写收益

Yuejie Li, Yueying Hua, Ke Yang, Li Zhang, Yueping He, Yueping He, Ruiqi Li, Bolin Chen, Tao Wang, Bowen Li, Chengjun Mao

发表机构 * Ant Group（蚂蚁集团）

AI总结通过受控干预审计，发现检索增强问答中重写器带来的性能提升主要由黄金答案字符串出现在重写上下文中驱动，而非证据质量改善。

详情

AI中文摘要

检索增强的问答管道通常将检索到的段落通过LLM重写器处理后输入较小的阅读器，在多跳基准测试中将F1提升数十个百分点；这种提升通常归因于证据质量的改善。我们通过受控干预审计，探究这种提升是否由黄金答案字符串出现在重写上下文中而非整理本身因果驱动。对于每个重写上下文，我们对编译输出进行四种受控编辑后重新运行阅读器：移除黄金答案跨度、替换为长度匹配的随机非答案跨度（安慰剂）、将黄金答案注入原本缺失的重写中（前缀或中间句子边界）。跨越三个阅读器系列（Qwen2.5-7B、Qwen3.5-35B、GLM-4.7）、两个数据集（HotpotQA、2WikiMultihopQA）和三种编译器安排（仅MA、仅MB、MA+验证）的十二个（单元、基线）干预运行中，在配对的answer-in-compile层上，移除黄金答案导致阅读器F1比长度匹配的安慰剂下降28到64个百分点，而在12个（单元、基线）组合中的10个中，将黄金答案前置到原本缺失的重写中使F1提升+0.7到+9.7个百分点。一项配套的五哨兵审计显示，传统的单[MASK]探针本身对哨兵敏感：在2Wiki上，它报告+4.12 F1的“非泄漏残差”，在四种替代哨兵下翻转至-3.33到-7.81 F1，并且对其中三种哨兵未能通过等价检验（1/4通过）。我们不提出新的重写器或缓解措施；我们发布干预运行器和哨兵面板，以便其他重写器收益声明可以针对相同标准进行测试。

英文摘要

Retrieval-augmented QA pipelines often route retrieved passages through an LLM \emph{rewriter} before a smaller reader, lifting F1 by tens of points on multi-hop benchmarks; this gain is typically credited to improved evidence quality. We ask whether that lift is causally driven by the gold answer string appearing in the rewritten context rather than by curation per se, using a controlled intervention audit. For each rewritten context we re-run the reader after one of four controlled edits to the compile output: removing the gold answer span, replacing a length-matched random non-answer span (placebo), or injecting the gold into rewrites where it was absent (at the prefix or at a midpoint sentence boundary). Across twelve completed (cell, baseline) intervention runs spanning three reader families (Qwen2.5-7B, Qwen3.5-35B, GLM-4.7), two datasets (HotpotQA, 2WikiMultihopQA), and three compiler arrangements (MA-only, MB-only, MA$+$verify), removing the gold answer drops reader F1 by $28$ to $64$ points beyond the length-matched placebo on paired \texttt{answer-in-compile} strata, and prepending the gold into rewrites that lacked it raises F1 by $+0.7$ to $+9.7$ points in $10$ of $12$ (cell, baseline) combinations. A companion five-sentinel audit shows the conventional single-\texttt{[MASK]} probe is itself sentinel-fragile: on 2Wiki it reports a $+4.12$~F1 ``non-leakage residual'' that flips to $-3.33$ to $-7.81$~F1 under four alternative sentinels and fails an equivalence test for three of those four ($1/4$~pass). We do not propose a new rewriter or mitigation; we release the intervention runner and the sentinel panel so that other rewriter-gain claims can be tested against the same standard.

URL PDF HTML ☆

赞 0 踩 0

2606.05632 2026-06-05 cs.AI 版本更新

Evaluation of LLMs for Mathematical Formalization in Lean

LLM在Lean中数学形式化的评估

Tyson Klingner, Drew Bladek, Escher Crawford, Bohao Chen, Ariel Fu, Kaira Nair, Jarod Alper, Giovanni Inchiostro, Vasily Ilin

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Washington（华盛顿大学）

AI总结本研究通过pass@k和refine@k指标在miniF2F和miniCTX子集上比较了多种大语言模型在Lean 4中生成形式化证明的能力，发现Gemini 3.1 Pro和Claude Opus 4.7性能最佳，而NVIDIA Nemotron 3 Super和GPT-OSS 120B在考虑成本时效率最高。

Comments 15 pages, 13 figures, 10 tables. Comments welcome!

2606.05626 2026-06-05 cs.CL cs.AI cs.LG 版本更新

When New Generators Arrive: Lifelong Machine-Generated Text Attribution via Ridge Feature Transfer

当新生成器到来：基于岭特征迁移的终身机器生成文本归因

Zhen Sun, Yifan Liao, Zhicong Huang, Jiaheng Wei, Cheng Hong, Yutao Yue, Xinlei He

发表机构 * Wuhan University（武汉大学）； Ant Group（蚂蚁集团）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Institute of Deep Perception Technology, JITRI（感知技术研究院，JITRI）

AI总结针对终身机器生成文本归因中持续适应新生成器与保留旧知识难以平衡的问题，提出轻量级分析更新框架RidgeFT，通过协方差校准和固定随机特征实现无需示例回放的闭式更新。

Comments 12 pages

详情

AI中文摘要

机器生成文本（MGT）归因旨在识别给定文本的特定生成器，从而为模型问责和滥用调查提供细粒度证据。随着新的大语言模型不断涌现，归因模型必须持续纳入新生成器，同时保留识别先前见过的生成器的能力。先前工作表明，这种终身MGT归因设置具有挑战性，现有方法通常难以在适应新类别和保留旧类别之间实现稳定平衡。为解决此问题，我们提出RidgeFT，一种轻量级分析更新框架，不依赖于示例回放。RidgeFT在初始生成器集上训练任务感知编码器，在首次观察到每个生成器类别时存储紧凑的类别充分统计量，然后冻结编码器以进行无回放的闭式更新。它通过协方差校准抑制与生成器无关的变异，通过固定随机特征提升表示能力，并基于类别充分统计量通过闭式岭回归更新新类别。在具有不同初始生成器设置的多主题评估中，RidgeFT始终优于基线。它在跨领域、骨干网络和增量协议上实现了最佳宏F1，同时改进了旧类别保留和新类别适应。这些结果表明，特征稳定的分析更新为终身MGT归因提供了一种简单而有效的方法。

英文摘要

Machine-generated text (MGT) attribution aims to identify the specific generator responsible for a given text, thereby providing fine-grained evidence for model accountability and misuse investigation. As new large language models continue to emerge, attribution models must continuously incorporate new generators while preserving their ability to recognize previously seen ones. Prior works have shown that this lifelong MGT attribution setting is challenging, and existing methods often struggle to achieve a stable balance between adapting to new classes and retaining old ones. To address this issue, we propose RidgeFT, a lightweight analytic update framework that does not rely on exemplar replay. RidgeFT trains a task-aware encoder on the initial generator set, stores compact class-wise sufficient statistics when each generator class is first observed, and then freezes the encoder for replay-free closed-form updates. It then suppresses generator-irrelevant variation through covariance calibration, improves representation capacity with fixed random features, and updates new classes through closed-form ridge regression based on class-level sufficient statistics. Across multi-topic evaluations with varying initial generator setups, RidgeFT consistently outperforms baselines. It achieves the best macro-F1 across domains, backbones, and incremental protocols, while also improving both old-class retention and new-class adaptation. These results suggest that feature-stable analytic updates provide a simple yet effective approach to lifelong MGT attribution.

URL PDF HTML ☆

赞 0 踩 0

2606.05625 2026-06-05 cs.AI cs.LG 版本更新

Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking

自承诺延迟：一种用于提示隐式劫持的无奖励探针

Bonan Shen, Youting Wang, Dingyan Shang, Tao Ning

发表机构 * Stanford University（斯坦福大学）； Tsinghua University（清华大学）

AI总结提出自承诺延迟指标，通过测量推理上下文对模型自身最终答案的承诺时机，无需奖励信号即可检测提示隐式劫持，在GSM8K数据集上达到AUROC 0.878-0.926。

详情

AI中文摘要

当语言模型的思维链看似良性时，隐式奖励劫持难以审计：最终答案可能被提示捷径锚定，而书面推理仍类似于普通问题求解。基于验证器的探针通过测量早期截断的推理上下文获得高奖励来暴露此类行为，但需要任务特定的奖励信号。本文提出一种弱输入替代方案——自承诺延迟，它测量提示推理上下文对模型自身最终答案的承诺时机。我们在受控配对GSM8K设置中使用Qwen2.5-3B-Instruct-4bit评估该探针，比较普通提示与包含答案提示的提示。与诚实上下文相比，包含提示的上下文显著更早且以更低不确定性做出承诺。主要延迟指标——阈值为0.8时的首次承诺延迟——达到AUROC 0.878；支持的全曲线摘要达到承诺范围AUROC 0.926和平均未承诺质量AUROC 0.904。当两种提示条件都正确回答时信号更强，且在不同阈值下保持稳定。这些结果表明，存在捷径的推理上下文会留下早期行为承诺特征，无需奖励模型、外部评判或训练分类器即可检测。

英文摘要

Implicit reward hacking is hard to audit when a language model's chain of thought appears benign: a final answer may be anchored by a prompt shortcut while the written reasoning still resembles ordinary problem solving. Verifier-based probes expose such behavior by measuring how early truncated reasoning contexts obtain high reward, but require a task-specific reward signal. This paper proposes a weaker-input alternative, self-commitment latency, which measures how early a prompted reasoning context commits to the model's own final answer. We evaluate the probe in a controlled paired GSM8K setting using Qwen2.5-3B-Instruct-4bit, comparing ordinary prompts with prompts that include an answer hint. Hinted contexts commit substantially earlier and with lower uncertainty than honest contexts. The primary latency metric, first-commitment latency at threshold 0.8, reaches AUROC 0.878; supporting whole-curve summaries reach AUROC 0.926 for commitment range and 0.904 for mean uncommitted mass. The signal is stronger when both prompt conditions answer correctly and remains stable across thresholds. These results show that shortcut-available reasoning contexts can leave an early behavioral commitment signature detectable without a reward model, external judge, or trained classifier.

URL PDF HTML ☆

赞 0 踩 0

2606.05614 2026-06-05 cs.AI 版本更新

Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack

安全悖论：增强的安全意识如何使LLM易受后验攻击

Long P. Hoang, Hai V. Le, Shaoyang Xu, Wei Lu, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design（新加坡科技设计大学）； Nanyang Technological University（南洋理工大学）

AI总结本文揭示安全对齐增强的LLM因内部安全评估能力而面临后验攻击漏洞，通过实验和理论分析证明安全判断能力越强越易被利用，并提出因果干预验证。

详情

AI中文摘要

大型语言模型（LLM）经过严格对齐以拒绝有害请求，这一过程内在培养了评估和识别不安全内容的潜在能力。在这项工作中，我们揭示了这种高级安全意识无意中引入了一个致命漏洞。我们提出了后验攻击（Posterior Attack），一种单次查询的越狱方法，通过提示模型生成其内部分类器通常会标记为不安全的精确有害响应来绕过防护栏。通过对30个开源LLM（参数规模高达35B）和前沿模型（如GPT-5、Claude 4.6）的广泛实证评估，我们观察到一个显著现象：具有更优安全判断能力的模型更容易受到这种利用。为了解释这一点，我们形式化了安全悖论（Safety Paradox），分析表明安全对齐的单调改进自然放大了后验漏洞。最后，我们通过强化学习干预建立了因果联系，示例说明人为降低模型的安全判断能力可使其免疫攻击，而增强判断则会加剧漏洞。我们的发现揭示了当前对齐范式中的潜在缺陷，表明防御机制可能需要进一步的结构性改进。

英文摘要

Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensive empirical evaluation across 30 open-source LLMs (up to 35B parameters in size) and frontier models (e.g., GPT-5, Claude 4.6), we observe a striking phenomenon: models with superior safety-judgment capabilities are disproportionately more susceptible to this exploitation. To explain this, we formalize the Safety Paradox, analytically showing that monotonic improvements in safety alignment naturally amplify posterior vulnerability. Finally, we establish a causal link via reinforcement learning interventions, exemplifying that artificially degrading a model's safety judgment immunizes it against the attack, whereas enhancing judgment exacerbates the vulnerability. Our findings highlight potential flaws in current alignment paradigms, indicating that defense mechanisms may require further structural refinement.

URL PDF HTML ☆

赞 0 踩 0

2606.05613 2026-06-05 cs.AI 版本更新

Multilingual Fine-Tuning via Localized Gradient Conflict Resolution

通过局部梯度冲突解决的多语言微调

Long P. Hoang, Yiran Zhao, Wei Lu, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design（新加坡科技设计大学）； Salesforce AI Research（Salesforce人工智能研究）； Nanyang Technological University（南洋理工大学）

AI总结提出Bucket-Level MOO框架，将多语言微调重构为多目标优化问题，通过局部梯度冲突解决提升多语言性能。

详情

AI中文摘要

大型语言模型（LLMs）的快速发展已将跨语言多功能性确立为现代系统的定义特征。然而，微调这些模型经常引发跨语言的负面干扰。为了解决这个问题，我们将多语言微调重构为多目标优化（MOO）问题。具体来说，我们引入了Bucket-Level MOO，一个可扩展的分布式框架，它在参数桶上局部应用基于梯度的MOO算法。这使得冲突感知更新成为可能，而无需重建完整梯度向量的高昂通信开销。理论上，我们证明了这种局部解决自然地强制执行精炼帕累托平稳性，这是帕累托最优性的一个严格更紧的必要条件。实验上，Bucket-Level MOO通过驱动LLMs构建特定的语言维度来减轻干扰，提高了表示的可分离性。在四个基础LLM上的广泛实验表明，我们的方法在标准微调范式上显著提高了所见和未见的多语言性能。

英文摘要

The rapid evolution of Large Language Models (LLMs) has established cross-lingual versatility as a defining feature of modern systems. However, fine-tuning these models frequently induces negative interference across languages. To address this, we reformulate multilingual fine-tuning as a multi-objective optimization (MOO) problem. Specifically, we introduce Bucket-Level MOO, a scalable distributed framework that applies gradient-based MOO algorithms locally on parameter buckets. This enables conflict-aware updates without the prohibitive communication overhead of reconstructing full gradient vectors. Theoretically, we prove this localized resolution natively enforces Refined Pareto Stationarity, a strictly tighter necessary condition for Pareto optimality. Empirically, Bucket-Level MOO mitigates interference by driving LLMs to construct distinct language-specific dimensions, improving representational separability. Extensive experiments across four base LLMs demonstrate that our method significantly improves both seen and unseen multilingual performance over standard fine-tuning paradigms.

URL PDF HTML ☆

赞 0 踩 0

2606.05609 2026-06-05 cs.CR cs.AI cs.LG 版本更新

SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks

SlotGCG：利用LLMs中的位置脆弱性进行越狱攻击

Seungwon Jeong, Jiwoo Jeong, Hyeonjin Kim, Yunseok Lee, Woojin Lee

发表机构 * Dongguk University-Seoul（东国大学-首尔）

AI总结本文提出SlotGCG方法，通过量化提示中不同插入位置（槽）的脆弱性得分（VSS），选择最脆弱的位置插入对抗性令牌，从而显著提升基于优化的越狱攻击成功率。

详情

Journal ref: International Conference on Learning Representations (ICLR), 2026

AI中文摘要

随着大型语言模型（LLMs）的广泛部署，通过越狱攻击识别其脆弱性变得日益关键。基于优化的攻击方法如贪婪坐标梯度（GCG）专注于将对抗性令牌插入到提示的末尾。然而，GCG将对抗性令牌限制在固定的插入点（通常是提示后缀），未探索在其他位置插入令牌的效果。在本文中，我们实证研究了提示中可插入令牌的候选位置（称为槽）。我们发现越狱的脆弱性与槽的选择高度相关。基于这些发现，我们引入了脆弱性槽得分（VSS）来量化越狱的位置脆弱性。随后，我们提出SlotGCG，该方法使用VSS评估所有槽，选择最脆弱的槽进行插入，并在这些槽上运行针对性的优化攻击。我们的方法提供了一种与攻击无关的位置搜索机制，可插入任何基于优化的攻击，仅增加200毫秒的预处理时间。在多个模型上的实验表明，SlotGCG显著优于现有方法。具体而言，与基于GCG的攻击相比，它实现了14%更高的攻击成功率（ASR），收敛更快，并且对防御方法表现出更强的鲁棒性，ASR比基线方法高42%。我们的实现可在https://github.com/youai058/SlotGCG获取。

英文摘要

As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization-based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting adversarial tokens to the end of prompts. However, GCG restricts adversarial tokens to a fixed insertion point (typically the prompt suffix), leaving the effect of inserting tokens at other positions unexplored. In this paper, we empirically investigate \emph{slots}, i.e., candidate positions within a prompt where tokens can be inserted. We find that vulnerability to jailbreaking is highly related to the selection of the \emph{slots}. Based on these findings, we introduce the \textit{Vulnerable Slot Score} (VSS) to quantify the positional vulnerability to jailbreaking. We then propose SlotGCG, which evaluates all slots with VSS, selects the most vulnerable slots for insertion, and runs a targeted optimization attack at those slots. Our approach provides a position-search mechanism that is attack-agnostic and can be plugged into any optimization-based attack, adding only 200ms of preprocessing time. Experiments across multiple models demonstrate that SlotGCG significantly outperforms existing methods. Specifically, it achieves 14\% higher Attack Success Rates (ASR) over GCG-based attacks, converges faster, and shows superior robustness against defense methods with 42\% higher ASR than baseline approaches. Our implementation is available at \href{https://github.com/youai058/SlotGCG}{https://github.com/youai058/SlotGCG}

URL PDF HTML ☆

赞 0 踩 0

2606.05606 2026-06-05 cs.LG cs.AI math.OC 版本更新

TensorBench: 在基于编译器的张量框架上对编码智能体进行基准测试

Bobby Yan, Fredrik Kjolstad

发表机构 * Department of Computer Science, Stanford University（计算机科学系，斯坦福大学）

AI总结本文提出 TensorBench，一个包含199个特征添加和重构任务的基准测试，用于评估编码智能体在基于编译器的张量框架上的表现，并通过测试套件自动评分。

详情

AI中文摘要

仓库级别的编码基准测试面临任务难度与评估可靠性之间的权衡：挑战前沿模型的任务通常涉及代码库庞大且测试覆盖不完整，而人工审查难以扩展。我们引入了 TensorBench，这是一个包含199个特征添加和重构任务的基准测试，基于一个开源的基于编译器的张量框架，该框架通过一流的密集和稀疏张量支持扩展了 PyTorch。任务涵盖新的稀疏格式、密集优化过程、IR 转换、调度器更改、运行时组件以及高级数值算子。TensorBench 通过应用智能体的补丁并运行框架的测试套件（包括预先存在的随机回归测试和智能体添加的任何测试）来对每次运行进行评分。对于特征添加任务，通过意味着修补后的仓库保留了测试过的预先存在的行为，并满足了智能体为请求特征添加的检查。我们评估了七个编码智能体，涵盖三个前沿模型系列和一个开放权重模型。在此标准下的通过率从最强智能体的 $64.8\%$ 到最弱智能体的 $22.1\%$ 不等。智能体通过不同的任务子集：成对 Cohen's $κ$ 范围从 $-0.07$ 到 $0.43$，两个最强智能体的 $κ= 0.05$。

英文摘要

Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not scale. We introduce TensorBench, a benchmark of 199 feature-addition and refactoring tasks on an open-source compiler-based tensor framework that extends PyTorch with first-class support for dense and sparse tensors. Tasks cover new sparse formats, dense optimization passes, IR transformations, scheduler changes, runtime components, and high-level numerical operators. TensorBench grades each run by applying the agent's patch and running the framework's test suite, which includes the pre-existing randomized regression tests and any tests the agent adds. For feature-addition tasks, a pass means that the patched repository preserves the tested pre-existing behavior and satisfies the agent-added checks for the requested feature. We evaluate seven coding agents spanning three frontier model families and one open-weight model. Pass rates under this criterion range from $64.8\%$ for the strongest agent to $22.1\%$ for the weakest. Agents pass different subsets of tasks: pairwise Cohen's $κ$ ranges from $-0.07$ to $0.43$, with $κ= 0.05$ for the two strongest agents.

URL PDF HTML ☆

赞 0 踩 0

2606.05566 2026-06-05 cs.AI cs.CR 版本更新

GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection

GuardNet: 用于鲁棒提示注入和越狱检测的浅层神经网络集成策略

Paulo Ricardo Ferreira Neves, Edson Rodrigues da Cruz Filho, Paulo Henrique Eleuterio Falsetti, João Vitor Pavan, Ian Degaspari, Henrique Vieira Laturrague, Patrick Vieira Laturrague, Guilherme Nielsen Dias, Marccello Wilson Perez Berto, Gustavo Voltani Von Atzingen

发表机构 * Quickium Technology Ltd.（Quickium技术有限公司）； Federal University of São Carlos (UFSCar)（萨尔瓦多·卡罗斯联邦大学）； Federal Institute of Education, Science and Technology of São Paulo (IFSP)（圣保罗教育、科学和技术联邦研究所）

AI总结提出GuardNet，一种基于浅层神经网络（BiLSTM）集成的护栏系统，通过多样性示例覆盖和阈值校准实现对抗鲁棒性，在低延迟下达到与轻量检测器竞争的性能。

详情

AI中文摘要

大型语言模型（LLMs）已经改变了自然语言处理，但它们仍然容易受到提示注入（PI）和越狱（JB）攻击。此外，基准评估可能受到污染和部分信息泄漏的影响，从而损害性能估计。本文提出了GuardNet，一个基于浅层神经网络（BiLSTM）集成的护栏系统，参数约4700万。我们研究了这样一个假设：对抗场景中的鲁棒性更多地取决于示例覆盖的多样性和阈值校准，而不是模型规模。结果表明，GuardNet与轻量检测器相比达到了竞争性能，并在低延迟下具有高效率，尽管更大的LLMs（如Mistral-7B和Llama-3.1-8B）在盲测JBB-Behaviors基准上仍在F1分数和AUROC方面表现更优。尽管如此，GuardNet在盲测数据集（n=200）上实现了0.747的AUROC，在专有基准（n=50）上实现了0.92的F1分数，这是在阈值校准和声明部分信息泄漏的情况下评估的。该系统在CPU上的平均延迟约为50毫秒，使其适合部署在成本和基础设施受限的生产环境中。

英文摘要

Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial information leakage, compromising performance estimates. This work presents GuardNet, a guardrail system based on an ensemble of shallow neural networks (BiLSTMs) with approximately 47 million parameters. We investigate the hypothesis that robustness in adversarial scenarios depends more on the diversity of example coverage and threshold calibration than on model scale. The results indicate that GuardNet achieves competitive performance compared with lightweight detectors and high efficiency at low latency, although larger LLMs such as Mistral-7B and Llama-3.1-8B still achieve superior performance in terms of F1 score and AUROC on the blind JBB-Behaviors benchmark. Nevertheless, GuardNet achieves an AUROC of 0.747 on the blind dataset (n = 200) and an F1 score of 0.92 on a proprietary benchmark (n = 50), under threshold calibration and evaluation with declared partial information leakage. The system operates with an average latency of approximately 50 ms on CPU, making it suitable for deployment in production environments with cost and infrastructure constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.05563 2026-06-05 cs.AI cs.CL 版本更新

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

SoCRATES：跨领域和社会认知变异的前瞻性LLM调解的可靠自动化评估

Taewon Yun, Hyeonseong Park, Jeonghwan Choi, Hayoon Park, Yeeun Choi, Hwanjun Song

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出SoCRATES基准，通过多领域真实冲突场景和五维社会认知适应轴评估LLM调解员，使用主题定位评估器实现0.82的人类专家一致性，发现最强模型仅缩小约三分之一的未调解共识差距。

详情

AI中文摘要

评估LLM调解员仍然具有挑战性，因为调解是一个实时轨迹，由争议者不断变化的情感、意图和背景塑造。现有的测试平台依赖于少数专家撰写的领域，主要变化战略姿态，并对每个话题的每一轮进行评分，引入了离题噪声。我们引入了SoCRATES，一个用于在现实的多领域测试平台中评估前瞻性LLM调解员的基准。它通过一个跨八个领域的代理管道从真实冲突中构建场景，探测五个社会认知适应轴（战略姿态、参与者组成、历史长度、情感反应和文化身份），并通过主题定位评估器仅对推进每个话题的轮次进行评分。该评估器与人类专家的一致性达到0.82，是每轮基线的两倍以上。对八个前沿LLM的基准测试发现，即使是最强的调解员，在多样化和现实的测试平台下，也仅能缩小约三分之一的未调解共识差距，且性能因社会认知轴而异，突显出进步在于对不同条件的社会适应。

英文摘要

Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.05561 2026-06-05 cs.CL cs.AI 版本更新

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

InfoShield：通过信息论优化实现心理健康筛查的隐私保护语音表示

Xueyang Wu, Siyuan Liu, Kezhuo Yang, Guang Ling

发表机构 * Shenzhen NeurStar Inc., China（深圳NeurStar公司，中国）； University of York, United Kingdom（约克大学，英国）； Shanghai Jiao Tong University, China（上海交通大学，中国）

AI总结提出InfoShield框架，通过最小化语音表示与敏感属性间的互信息，在保持抑郁分类性能的同时有效降低人口统计信息泄露风险。

详情

AI中文摘要

基于语音的心理健康筛查提供了可扩展的抑郁症检测方法，但临床部署面临一个重大障碍：用户对人口统计信息暴露的隐私担忧。当前技术难以解决这一冲突。对抗训练通常无法应对未知威胁，而差分隐私则倾向于通过向所有特征注入噪声来损害诊断性能。本文提出InfoShield，它在保持抑郁分类准确性的同时最小化语音表示与敏感属性之间的互信息。我们发现标准MINE估计器因时间-静态错位而难以处理序列语音，并引入带有跨模态注意力的TimeAwareMINE来对齐声学帧与属性嵌入。在Androids语料库上的实验表明，InfoShield将性别推断从92.6%降至55.5%，年龄推断从55.7%降至30.3%，且效用损失有限（F1降低6%），达到F1=0.784，而先前SOTA为0.723。

英文摘要

Speech-based mental health screening offers scalable depression detection, yet clinical deployment faces a significant barrier: users' privacy concerns about demographic information exposure. Current techniques struggle to resolve this conflict. Adversarial training often fails against unseen threats, whereas Differential Privacy tends to compromise diagnostic performance by injecting noise across all features. This paper presents InfoShield, which minimizes mutual information between speech representations and sensitive attributes while preserving depression classification accuracy. We identify that standard MINE estimators struggle with sequential speech due to temporal-static misalignment, and introduce TimeAwareMINE with cross-modal attention to align acoustic frames with attribute embeddings. Experiments on the Androids Corpus show InfoShield reduces gender inference from 92.6\% to 55.5\% and age inference from 55.7\% to 30.3\% with limited utility loss (6\% F1 reduction), achieving F1=0.784 compared to prior SOTA's 0.723.

URL PDF HTML ☆

赞 0 踩 0

2606.05555 2026-06-05 cs.LG cs.AI 版本更新

Representation Learning Enables Scalable Multitask Deep Reinforcement Learning

表示学习实现可扩展的多任务深度强化学习

Johan Obando-Ceron, Lu Li, Scott Fujimoto, Pierre-Luc Bacon, Aaron Courville, Pablo Samuel Castro

发表机构 * Mila – Québec AI Institute（魁北克AI研究所）； Université de Montréal（蒙特利尔大学）； McGill University（麦吉尔大学）； CIFAR AI Chair（CIFAR人工智能 chair）； Google DeepMind（谷歌DeepMind）

AI总结本文提出一种结合预测性表示学习与高容量值函数近似的无模型算法MR.Q，在无需规划的情况下，在多任务连续控制任务中超越基于世界模型的方法和多种深度强化学习基线，并显著降低计算开销。

详情

AI中文摘要

将强化学习扩展到多样化的多任务设置仍然是一个核心挑战。虽然基于模型的强化学习的最新进展取得了强劲的性能，但它们依赖于规划和复杂的训练流程，使得不清楚哪些组件对可扩展性至关重要。我们重新审视这个问题，并认为可扩展多任务强化学习的主要驱动力不是基于模型的控制，而是\emph{表示学习}。特别地，我们表明，将预测性的、基于模型的表示与高容量值函数逼近相结合，即使没有规划，也足以实现强劲的性能。我们评估了一种简单的无模型算法MR.Q，将辅助预测目标与可扩展的actor-critic架构相结合。这种方法在多样化的多任务连续控制任务套件中优于最近基于世界模型的方法和一系列深度强化学习基线，同时显著降低了计算开销并提高了实际时间效率。我们观察到随着模型容量的增加而持续改进，并通过消融实验表明预测性表示学习对性能至关重要。

英文摘要

Scaling reinforcement learning (RL) to diverse multitask settings remains a central challenge. While recent advances in model-based RL achieve strong performance, they rely on planning and complex training pipelines, making it unclear which components are essential for scalability. We revisit this question and argue that the primary driver of scalable multitask RL is not model-based control, but \emph{representation learning}. In particular, we show that combining predictive, model-based representations with high-capacity value function approximation is sufficient to achieve strong performance, even without planning. We evaluate a simple model-free algorithm, MR.Q, coupled with auxiliary predictive objectives into a scalable actor-critic architecture. This approach outperforms a recent world-model-based method and a range of deep RL baselines across a diverse suite of multitask continuous control tasks, while significantly reducing computational overhead and improving wall-clock efficiency. We observe consistent improvements with increased model capacity and show through ablations that predictive representation learning is critical for performance.

URL PDF HTML ☆

赞 0 踩 0

2606.05553 2026-06-05 cs.CL cs.AI 版本更新

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

ArcANE：角色扮演语言代理是否在正确的时间保持角色？

Woojung Song, Nalim Kim, Sangjun Song, Chaewon Heo, Jongwon Lim, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University（首尔国立大学数据科学研究生院）

AI总结提出ArcANE基准，通过角色弧将叙事分段，评估角色扮演语言代理在不同阶段是否与角色心理轨迹一致，实验表明基于角色弧的上下文策略最优，尤其在源文本外场景。

详情

AI中文摘要

角色扮演语言代理（RPLAs）应扮演其价值观和行为随故事发展而演变的角色，而非保持固定人格。现有基准衡量给定章节的事实回忆，而非回应是否与角色的心理轨迹一致，尤其是在源文本从未探索的场景中。我们引入ArcANE（弧感知叙事评估），一个自动构建的基准，涵盖17部小说和80个主要角色。角色弧将叙事沿心理轴分段，每个探针在多个阶段提出相同场景，涵盖源文本内和源文本外情境。在六个模型和六种上下文模式下，基于角色弧的条件在每项模型上均优于所有其他上下文策略，且在源文本外场景（检索无法找到信息）中差距最大。我们进一步在同一数据上微调开放权重模型，得到ArcANE-8B/32B，在源文本外场景中进一步扩大了弧优势。

英文摘要

Role-playing language agents (RPLAs) should play characters whose values and behavior evolve as the story progresses, not maintain a fixed persona. Existing benchmarks measure factual recall at a given chapter, not whether responses align with the character's psychological trajectory, especially in scenarios the source text never explores. We introduce ArcANE (Arc-Aware Narrative Evaluation), an automatically constructed benchmark spanning 17 novels and 80 principal characters. A Character Arc segments the narrative into phases along a psychological axis, and each probe poses the same scenario across phases, spanning both situations within the source text and situations beyond it. Across six models and six context modes, conditioning on the Character Arc tops every other context strategy on every model, and the gap is largest on scenarios outside the source text where retrieval has nothing to find. We further fine-tune open-weight models on the same data to obtain ArcANE-8B/32B, which widen the Arc advantage even more on scenarios outside the source text.

URL PDF HTML ☆

赞 0 踩 0

2606.05552 2026-06-05 cs.LG cs.AI cs.GR 版本更新

SciVisAgentSkills：面向科学数据分析和可视化的智能体技能设计与评估

Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Shusen Liu, Chaoli Wang

发表机构 * Univ. Notre Dame（诺丁汉大学）； LLNL（劳伦斯利弗莫尔国家实验室）

AI总结提出SciVisAgentSkills技能库，通过编码环境假设、工具使用模式和领域启发式知识增强编码智能体，在ParaView等科学工具上实现自然语言驱动的科学可视化工作流，实验表明技能可提升任务得分并影响token效率。

详情

AI中文摘要

近期智能体可视化的进展使得自然语言能够转化为可执行的科学可视化工作流。尽管通用编码智能体展现出强大能力，但它们往往缺乏科学可视化任务所需的特定工具专业知识。在这项工作中，我们提出了SciVisAgentSkills，这是一个可重用的智能体技能集合，通过编码环境假设、工具使用模式和跨科学工具（如ParaView、napari、VMD和TTK）的领域启发式知识，增强用于科学数据分析和可视化的编码智能体。我们使用SciVisAgentBench（一个包含108个专家设计的多步骤任务的基准测试）在Codex和Claude Code上评估这些技能。结果表明，智能体技能提高了评估套件中的平均任务得分，其token效率收益取决于智能体框架和工具设置。这些发现强调了结构化程序知识对于实现可靠、长周期科学可视化工作流的重要性，同时也表明技能应与加载和应用它们的执行框架一起研究。技能可在https://github.com/KuangshiAi/SciVisAgentSkills获取。

英文摘要

Recent advances in agentic visualization have enabled the translation of natural language into executable scientific visualization (SciVis) workflows. While general-purpose coding agents show strong capabilities, they often lack the tool-specific expertise required for SciVis tasks. In this work, we present SciVisAgentSkills, a collection of reusable agent skills that augment coding agents for scientific data analysis and visualization by encoding environment assumptions, tool usage patterns, and domain heuristics across scientific tools such as ParaView, napari, VMD, and TTK. We evaluate these skills on Codex and Claude Code using SciVisAgentBench, a benchmark of 108 expert-designed multi-step tasks. Results show that agent skills improve mean task scores across the evaluated suites, with token-efficiency benefits that depend on the agent harness and tool setting. These findings highlight the importance of structured procedural knowledge for enabling reliable, long-horizon SciVis workflows, while also showing that skills should be studied alongside the execution harness that loads and applies them. The skills are available at https://github.com/KuangshiAi/SciVisAgentSkills.

URL PDF HTML ☆

赞 0 踩 0

2606.05522 2026-06-05 cs.SD cs.AI eess.AS 版本更新

Exploring LLMs for South Asian Music Understanding and Generation

探索大语言模型对南亚音乐的理解与生成

Faria Binte Kader, Mohtasim Hadi Rafi, Shah Wasif Sajjad, Santu Karmaker

发表机构 * University of Central Florida（佛罗里达中央大学）； Auburn University（阿伯伯大学）

AI总结本文系统评估大语言模型在基于拉格和塔拉的南亚古典音乐理解与生成任务中的表现，发现前沿模型在理解任务上准确率达85-90%，但生成任务中风格忠实度仅40%。

Comments 19 pages, 7 figures

详情

AI中文摘要

近年来，大语言模型（LLMs）在音乐理解和生成任务中展现出令人瞩目的成果。然而，现有研究仍局限于西方调性传统，未能揭示当前LLMs能否处理结构独特的低资源音乐传统。我们首次系统评估LLMs在南亚古典音乐中的能力——这种传统由拉格（raga）和塔拉（tala）的旋律约束主导，其结构原则与西方和声驱动音乐根本不同。我们的评估基于印度斯坦古典理论和孟加拉古典形式，包括拉宾德拉（Rabindra）和纳兹鲁尔（Nazrul）歌曲——南亚古典音乐中具有代表性的低资源传统。在音乐理解评估中，我们引入了一个包含504个问答的基准测试，涵盖拉格语法、文化知识和符号记谱推理，评估了33个LLMs，其中前沿模型如Gemini 2.5 Pro达到85-90%的准确率，而大多数开源模型仅在23-40%范围内。在音乐生成方面，我们设计了一个五级受控提示框架，发现即使最强的模型也只有40%的时间能产生风格忠实的输出。这些结果表明，音乐生成中的结构有效性和风格忠实度是不同的目标，并突显了文化基础音乐建模的一个开放挑战。

英文摘要

Recent advancements in Large Language Models (LLMs) have shown promising results in music understanding and generation tasks. However, existing works remain confined to Western tonal traditions, offering little insight into whether current LLMs can handle structurally distinct low-resource musical traditions. We present the first systematic evaluation of LLM competence in South Asian classical music, a tradition governed by raga, tala-based melodic constraints that impose fundamentally different structural principles from Western harmony-driven music. We ground our evaluation in Hindustani classical theory and Bengali classical forms, including Rabindra and Nazrul Sangeet -- representative low-resource traditions within South Asian classical music. For music understanding evaluation, we introduce a 504-question-answer benchmark spanning raga grammar, cultural knowledge, and symbolic notation reasoning, evaluating 33 LLMs where frontier models such as Gemini 2.5 Pro achieve 85-90% accuracy, while most open-source models remain in the 23-40% range. For music generation, we design a five-level controlled prompting framework and find that even the strongest model produces stylistically faithful outputs only 40% of the time. These results reveal that structural validity and stylistic faithfulness in music generation are distinct objectives and highlight an open challenge for culturally grounded music modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.05513 2026-06-05 cs.AI cs.CL 版本更新

EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts

EpiEvolve：用于制度转变下流式疫情预测的自演化智能体

Yiming Lu, Sihang Zeng, Zhengxu Tang, Max Lau, Fei Liu, Wei Jin

发表机构 * Emory University（埃默里大学）； University of Washington（华盛顿大学）

AI总结针对流式疫情预测中标签延迟和制度转变问题，提出自演化智能体EpiEvolve，通过层次化情景记忆、延迟标签反思和制度感知检索，在COVID-19住院趋势预测中达到0.629准确率，并将制度转变后的恢复滞后从5周缩短至2周。

详情

AI中文摘要

流行病LLM预测器通常作为静态监督模型进行训练和评估，而实际疫情预测是一个流式过程，其中标签在预测之后到达，疾病制度随时间变化。我们研究了在五个变异制度下的每周COVID-19住院趋势预测中的这种不匹配。我们引入了EpiEvolve，一个自演化智能体，它封装了一个在预热期训练好的LLM预测器，并在流式过程中保持其权重固定。EpiEvolve通过将预测结果存储在层次化情景记忆中进行适应，反思延迟标签，检索与当前制度相关的案例，并将重复出现的错误提炼为策略规则。由此产生的上下文让预测器在遵循防止未来泄漏的时间顺序协议的同时，在后续周中重用其自身的过去预测和结果。在流式数据集上，EpiEvolve达到了0.629的平均准确率，而静态骨干模型为0.561，外部CDC集成模型为0.325，并将制度转变后的恢复滞后从5周缩短到2周。消融实验表明，反思、策略记忆和制度感知检索各自对性能提升有贡献。

英文摘要

Epidemic LLM forecasters are usually trained and evaluated as static supervised models, whereas operational pandemic forecasting is a streaming process in which labels arrive after predictions and disease regimes shift over time. We study this mismatch in weekly COVID-19 hospitalization trend forecasting across five variant regimes. We introduce EpiEvolve, a self-evolving agent that wraps an LLM forecaster trained on the warm-start period and keeps its weights fixed during streaming. EpiEvolve adapts by storing forecast outcomes in a hierarchical episodic memory, reflecting on delayed labels, retrieving cases relevant to the current regime, and distilling recurring errors into strategic rules. The resulting context lets the forecaster reuse its own past predictions and outcomes in later weeks while following a chronological protocol that prevents future leakage. On the streaming dataset, EpiEvolve reaches $0.629$ average accuracy, compared with $0.561$ for the static backbone and $0.325$ for the external CDC ensemble, and reduces recovery lag after regime shifts from $5$ to $2$ weeks. Ablations show that reflection, strategic memory, and regime-aware retrieval each contribute to the gains.

URL PDF HTML ☆

赞 0 踩 0

2606.05509 2026-06-05 cs.HC cs.AI 版本更新

The Role of Instructional Guidance in Generative AI-Assisted Learning: Empirical Evidence from Construction Engineering Education

教学指导在生成式AI辅助学习中的作用：来自建筑工程教育的实证证据

Xiaoyu Hou, Bo Xiao, Hexu Liu, Shane Mueller

发表机构 * Dept. of Civil, Environmental, and Geospatial Engineering, Michigan Technological Univ.（土木、环境与地理空间工程系，密歇根技术大学）； Dept. of Civil and Construction Engineering, Western Michigan Univ.（土木与建设工程系，西部密歇根大学）； Dept. of Psychology and Human Factors, Michigan Technological Univ.（心理学与人因工程系，密歇根技术大学）

AI总结本研究通过引入基于生成学习理论的五步提示框架，在建筑工程教育中对比无提示AI辅助、有提示AI辅助和幻灯片学习三种条件，发现提示框架显著提升了需要解释和推理的任务表现（开放式评分提高约2-3分，p<0.01），表明AI辅助学习的有效性取决于交互结构。

详情

AI中文摘要

生成式人工智能（AI）越来越多地被用于支持自主学习，然而学生与此类系统的交互往往缺乏结构性，限制了对更深层次认知过程的参与。本研究探讨了教学指导如何塑造建筑工程教育中学生与AI的交互。引入了一个基于生成学习理论（GLT）的五步提示框架，以指导学习者在复习活动中的交互。一项对照实验比较了三种学习条件：基于幻灯片的学习、无提示的AI辅助学习和有提示的AI辅助学习。学习表现通过多项选择和开放式任务进行评估，用户体验通过用户体验问卷（UEQ）测量。表现差异集中在需要解释和推理的任务上。有提示条件在开放式任务上得分更高，在18分量表上提高了约2或3分（p < 0.01），而多项选择表现无显著差异。无提示条件与基于幻灯片的学习相当。这些发现表明，AI辅助学习的有效性取决于交互如何结构化。所提出的框架为将学习科学原理整合到建筑工程教育的生成式AI系统中提供了基础。

英文摘要

Generative artificial intelligence (AI) is increasingly used to support self-directed learning, yet student interaction with such systems often remains unstructured, limiting engagement in deeper cognitive processes. This study examines how instructional guidance shapes student and AI interaction in construction education. A five-step prompting framework grounded in Generative Learning Theory (GLT) is introduced to guide learner interaction during review activities. A controlled experiment compares three learning conditions: slide-based learning, unprompted AI-supported learning, and prompted AI-supported learning. Learning performance is assessed using multiple-choice and open-ended tasks, and user experience is measured using the User Experience Questionnaire (UEQ). Performance differences are concentrated on tasks requiring explanation and reasoning. The prompted condition achieves higher open-ended scores, with an improvement of approximately 2 or 3 points on a scale of 18 (p < 0.01), while no significant differences are observed in multiple-choice performance. The unprompted condition remains comparable to slide-based learning. These findings indicate that the effectiveness of AI-supported learning depends on how interaction is structured. The proposed framework provides a basis for integrating learning science principles into generative AI systems for construction education.

URL PDF HTML ☆

赞 0 踩 0

2606.05494 2026-06-05 cs.CL cs.AI 版本更新

MASF: A Multi-Model Adaptive Selection Framework for Abstractive Text summarization

MASF：面向抽象式文本摘要的多模型自适应选择框架

Ahmed Alansary, Ali Hamdi

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出一种多模型自适应选择框架，通过集成多个微调的Transformer模型并基于自动评估指标选择最佳摘要，在CNN/DailyMail数据集上BERTScore达88.63%，优于GPT3-D2等大模型。

Comments 6 pages, 3 figures, IMSA2026

详情

AI中文摘要

自动文本摘要因数字文本信息的快速增长而变得日益重要。本文提出一种多模型自适应摘要框架，旨在提高抽象式文本摘要的鲁棒性和质量。依赖单一模型往往导致在不同结构和主题的文章上摘要质量不一致。为解决这一局限，所提框架集成了多个微调的基于Transformer的摘要模型，并引入自适应选择机制。在该框架中，每个模型独立为同一输入文章生成候选摘要。然后使用自动评估指标评估生成的摘要，这些指标同时捕捉词汇相似性和语义相关性。基于这些分数，框架选择最高质量的摘要作为最终输出。模型在广泛使用的CNN/DailyMail新闻摘要数据集上进行微调和评估。实验结果表明，所提框架在所有比较方法中取得了最高的BERTScore，达到88.63%。它还优于多个大语言模型，如GPT3-D2、Falcon-7b和Mpt-7b，突显了其有效性和鲁棒性。这些发现强调了在自适应选择策略中利用多个基于Transformer的模型来提高自动文本摘要系统质量和鲁棒性的有效性。

英文摘要

Automatic text summarization has become increasingly important due to the rapid growth of digital textual information. This paper presents a Multi-Model Adaptive Summarization Framework designed to improve the robustness and quality of abstractive text summarization. Relying on a single model often leads to inconsistent summarization quality across articles with varying structures and topics. To address this limitation, the proposed framework integrates multiple fine-tuned transformer-based summarization models and introduces an adaptive selection mechanism. In this framework, each model independently generates a candidate summary for the same input article. The generated summaries are then evaluated using automatic evaluation metrics that capture both lexical similarity and semantic relevance. Based on these scores, the framework selects the highest-quality summary as the final output. The models are fine-tuned and evaluated on the widely used CNN/DailyMail news summarization dataset. Experimental results demonstrate that the proposed framework achieves the highest BERTScore among all compared methods with a score of 88.63%. It also outperforms several LLMs such as GPT3-D2, Falcon-7b, and Mpt-7b, highlighting its effectiveness and robustness. These findings highlight the effectiveness of leveraging multiple transformer-based models within an adaptive selection strategy to improve the quality and robustness of automatic text summarization systems.

URL PDF HTML ☆

赞 0 踩 0

2606.05481 2026-06-05 cs.LG cs.AI eess.SP 版本更新

Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models

面向统一且数据高效的预测与健康管理：基于表格基础模型

Raffael Theiler, Lev Telyatnikov, Leandro Von Krannichfeldt, Olga Fink

发表机构 * IMOS Lab, EPFL（IMOS实验室，瑞士联邦理工学院）

AI总结提出利用表格基础模型通过上下文学习处理工业时间序列，实现预测与健康管理（PHM）任务，在低数据场景下表现优异，并优于序列模型和梯度提升树。

详情

AI中文摘要

数据驱动的预测与健康管理（PHM）利用时变状态监测数据来诊断系统状态并估计工程资产的剩余使用寿命。这些任务是维护规划的核心，但工业PHM数据通常是碎片化的、部分观测且标注不足，这阻碍了监督学习。基础模型提供了一条通往可重用预测系统的途径，然而大多数时间序列基础模型是为预测设计的，并假设长序列、连贯且规则采样。为弥补这一差距，我们提出了一个框架，利用上下文学习将表格基础模型应用于工业时间序列，并在多种PHM任务上对其进行评估。通过将原始单元级信号转换为表格行，我们展示了这些模型在多个任务（包括预测和诊断）上表现良好，且数据效率高。我们在统一的评估协议下，直接将其与序列模型、Transformer基线和梯度提升树进行比较。结果表明，表格基础模型在预测和诊断任务中取得了最佳平均排名。我们的发现进一步表明，基于PFN的模型在低数据场景下具有竞争力，时间上下文可以在表格表示中保留，且性能依赖于子采样下的代表性上下文构建。这些结果证明，表格基础模型为异构PHM问题提供了一个实用且通用的接口。

英文摘要

Data-driven Prognostics and Health Management (PHM) uses time-varying condition-monitoring data to diagnose system states and estimate remaining useful life in engineered assets. These tasks are central to maintenance planning, but industrial PHM data are often fragmented, partially observed, and poorly labeled, which hinders supervised learning. Foundation models offer a route toward reusable predictive systems, yet most time-series foundation models are designed for forecasting and assume long, coherent, regularly sampled sequences. To address this gap, we propose a framework for applying Tabular Foundation Models to industrial time series using in-context learning, and we evaluate them on a variety of PHM tasks. By converting raw unit-level signals into tabular rows, we show that these models perform well across multiple tasks - including prognostics, and diagnostics - and are highly data efficient. We compare them directly with sequence models, transformer baselines, and gradient-boosted trees under a common evaluation protocol. The results indicate that tabular foundation models achieve the best average ranks across prognostic and diagnostic tasks. Our findings further show that PFN-based models are competitive in low-data regimes, that temporal context can be preserved in the tabular representation, and that performance depends on representative context construction under subsampling. These results demonstrate that tabular foundation models provide a practical and general interface for heterogeneous PHM problems.

URL PDF HTML ☆

赞 0 踩 0

2606.05464 2026-06-05 cs.AI 版本更新

Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces

大语言模型中在扩展搜索空间上的逐步优化类推理

Nicolás Astorga, Nabeel Seedat, Mihaela van der Schaar

发表机构 * University of Cambridge（剑桥大学）

AI总结本文提出OPT*任务族，通过可验证奖励训练和搜索引导策略，提升LLM在扩展搜索空间中的逐步优化推理能力。

详情

AI中文摘要

可验证奖励训练改善了数学和编码推理，但这些领域仅涵盖了逐步决策的一部分。许多现实任务需要在众多有效备选方案中找到高价值的可行计划。我们引入OPT*，一个可扩展的优化风格任务族，用于沿复杂度轴训练和评估LLM的逐步优化类推理：每个任务提供可行性检查器和评估器，而复杂度参数扩展搜索空间，无需新的人工标签。这促使我们在两种机制下研究这些任务：(i) 求解器引导的在线策略优化，使用求解器作为部分状态的价值预言机，并应用基于排名的奖励塑造来强化更好的下一步；(ii) 当此类求解器不可用时，基于搜索的离线强化学习。理论上，我们将大搜索空间中的成功与推理者在每单位搜索预算中提取的信息联系起来。实证上，我们消融了使OPT*上搜索高效的要素，并表明在OPT*上训练改进了逐步优化类推理。

英文摘要

Verifiable reward training has improved mathematical and coding reasoning, but these domains capture only part of step-by-step decision making. Many real-world tasks require finding a high-value feasible plan among many valid alternatives. We introduce OPT*, a scalable family of optimization-style tasks for training and evaluating LLM step-by-step optimization-like reasoning along a complexity axis: each task provides a feasibility checker and evaluator, while a complexity parameter expands the search space without requiring new human labels. This motivates studying these tasks in two regimes: (i) solver-guided online policy optimization, which uses a solver as a value oracle for partial states and applies rank-based reward shaping to reinforce better next steps, and (ii) search-based offline RL when such solvers are unavailable. Theoretically, we relate success in large search spaces to the information a reasoner extracts per unit of search budget. Empirically, we ablate the ingredients that make search efficient on OPT* and show that training on OPT* improves step-by-step optimization-like reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.05449 2026-06-05 cs.AI cs.GT econ.EM 版本更新

Insurance of Agentic AI

代理型人工智能的保险

Quanyan Zhu

发表机构 * Department of Electrical and Computer Engineering, New York University, Tandon School of Engineering（电气与计算机工程系，纽约大学，工程学院）

AI总结本文分析了代理型AI带来的新型风险，提出了承保、定价、再保险和产品设计的框架，并构建了整合多种保险覆盖的协调架构。

详情

AI中文摘要

代理型人工智能系统通过超越信息生成，扩展到自主规划、工具调用、决策执行以及对数字和物理环境的持续修改，正在改变风险格局。这些能力引入了新的风险敞口，这些敞口并不完全适合传统的保险类别，如网络、职业责任、产品责任或董事及高管责任保险。本文考察了新兴的代理型AI保险市场，并开发了一个框架来理解其承保、定价、再保险和产品设计的影响。我们将代理型AI描述为自主性和授权委托的连续体，强调信息输出与能够通过外部行动独立产生保险事件的系统之间的区别。我们分析了主要风险路径，包括幻觉、提示注入攻击、自主决策错误、模型漂移、依赖故障和网络物理伤害，并评估了现有保险产品如何适应这些风险敞口。本文进一步提出了一个基于风险暴露评估、情景分析、依赖映射和累积风险管理的精算框架，借鉴了网络保险的发展历程。最后，我们提出了一个协调的保险架构，通过明确的分配机制和专门的AI总限额，整合了网络、技术错误与遗漏、产品责任、性能保证以及明确的AI责任保险。分析表明，代理型AI保险的未来不在于单一的单线产品，而在于一个由改进的治理、透明度、遥测和监管清晰度支持的互补覆盖分层生态系统。

英文摘要

Agentic artificial intelligence (AI) systems are transforming the risk landscape by extending beyond information generation to autonomous planning, tool invocation, decision execution, and persistent modification of digital and physical environments. These capabilities introduce novel exposures that do not fit neatly within traditional insurance categories such as cyber, professional liability, product liability, or directors and officers coverage. This paper examines the emerging insurance market for agentic AI and develops a framework for understanding its underwriting, pricing, reinsurance, and product-design implications. We characterize agentic AI as a continuum of autonomy and delegated authority, emphasizing the distinction between informational outputs and systems capable of independently generating insured events through external actions. We analyze major risk pathways, including hallucinations, prompt-injection attacks, autonomous decision errors, model drift, dependency failures, and cyber-physical harms, and evaluate how existing insurance products are adapting to address these exposures. The paper further proposes an actuarial framework based on exposure assessment, scenario analysis, dependency mapping, and accumulation-risk management, drawing parallels to the evolution of cyber insurance. Finally, we present a coordinated insurance architecture that integrates cyber, technology errors and omissions, product liability, performance-warranty, and affirmative AI-liability coverages through explicit allocation mechanisms and dedicated AI aggregates. The analysis suggests that the future of agentic-AI insurance lies not in a single monoline product but in a layered ecosystem of complementary coverages supported by improved governance, transparency, telemetry, and regulatory clarity.

URL PDF HTML ☆

赞 0 踩 0

2606.05445 2026-06-05 cs.AI 版本更新

Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

Brick-Composer: 使用多模态大语言模型进行多样化积木组装

Jiateng Liu, Bingxuan Li, Zhenhailong Wang, Rushi Wang, Kaiwen Hong, Cheng Qian, Jiayu Liu, Denghui Zhang, Katherine Driggs-Campbell, Manling Li, Heng Ji

发表机构 * UIUC（伊利诺伊大学香槟分校）； Stevens Institute of Technology（史蒂文斯理工学院）； Northwestern University（西北大学）

AI总结本文提出Brick-Composer框架，通过人类设计火花、世界反馈和合成经验三种信号训练多模态大语言模型，解决积木组装中的积木选择和姿态估计问题，将步骤级组装成功率从低于1%提升至约15%。

Comments 10 Pages, 10 figures

详情

AI中文摘要

我们梦想着AI代理能够读取任意设计，并从可重复使用的构建块中构建真实世界的物体。作为迈向这一愿景的第一步，我们研究多模态大语言模型（MLLMs）是否具备积木组装所需的视觉基础和空间推理能力。我们将积木组装形式化为一个序列决策问题，其中每一步涉及两个子任务：积木选择，从候选组件中识别目标积木；以及积木姿态估计，预测所选积木应放置的位置和方式。为支持这项研究，我们引入了BC-Bench（积木构建基准），这是第一个用于评估MLLMs在多样化积木组装中表现的基准。实验表明，当前最先进的MLLMs仍然远非可靠的构建者，在细粒度积木选择上挣扎，并且在精确姿态估计上失败。为弥补这一差距，我们提出了Brick-Composer，一个学习框架，通过三种互补信号赋予MLLMs组装技能：人类设计火花，提供富含可供性的构建演示；世界反馈，将预测动作锚定在视觉和物理后果中；以及合成经验，将学习扩展到现有物体设计之外。Brick-Composer将积木选择准确性提高了三倍以上，大幅减少了姿态估计误差，并将严格的步骤级组装成功率从低于1%提升至约15%。训练后，一个Qwen-3-8B模型能够正确完成一个完整物体高达42%的步骤，这表明MLLMs可以通过有针对性的、基于物理的学习获得组装能力。

英文摘要

We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly. We formulate brick assembly as a sequential decision-making problem, where each step involves two subtasks: brick selection, identifying the target brick from candidate components, and brick pose estimation, predicting where and how the selected brick should be placed. To support this study, we introduce BC-Bench (Brick Construction Benchmark), the first benchmark for evaluating MLLMs on assembly with diverse bricks. Experiments show that current state-of-the-art MLLMs remain far from reliable builders, struggling with fine-grained brick selection and failing at precise pose estimation. To bridge this gap, we propose Brick-Composer, a learning framework that equips MLLMs with assembly skills through three complementary signals: Human Design Sparks, which provide affordance-rich construction demonstrations; World Feedback, which grounds predicted actions in visual and physical consequences; and Synthetic Experience, which scales learning beyond existing object designs. Brick-Composer improves brick selection accuracy by over three times, substantially reduces pose estimation errors, and raises strict step-level assembly success from less than 1% to around 15%. After training, a Qwen-3-8B can correctly compose up to 42% of the steps for a complete object, suggesting that MLLMs can acquire assembly capabilities through targeted, physically grounded learning.

URL PDF HTML ☆

赞 0 踩 0

2606.05444 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Multilingual Coreference Resolution via Cycle-Consistent Machine Translation

通过循环一致性机器翻译的多语言共指消解

Adriana-Valentina Costache, Eduard Poesina, Silviu-Florin Gheorghe, Paul Irofti, Radu Tudor Ionescu

发表机构 * Department of Computer Science, University of Bucharest（布加勒斯特大学计算机科学系）

AI总结提出一种利用循环一致性机器翻译生成或扩展训练数据的管道，通过BERT潜在空间余弦相似度评估翻译质量并加权损失函数，显著提升低资源语言的共指消解性能。

详情

AI中文摘要

共指消解是一项核心的自然语言处理任务，具有广泛的下游应用，例如机器翻译、问答、文档摘要等。虽然该任务在英语中得到了充分研究，但其他语言（尤其是低资源语言）的共指消解关注相对较少。为了弥补这一差距，我们提出了一种新颖的共指消解管道，该管道利用从英语到目标低资源语言的机器翻译（MT）来生成或扩展训练数据。为了自动验证翻译样本的质量，我们将样本反向翻译，并通过BERT模型潜在空间中的余弦相似度评估与原始英语样本的相似性。得到的相似度分数被整合到损失函数中，以根据样本的MT循环一致性对训练样本进行加权。在四种低资源语言上的大量实验表明，我们的管道在共指消解中带来了显著的性能提升。此外，我们的管道使得在之前没有可用语料库的语言中也能实现准确的共指消解。

英文摘要

Coreference resolution is a core NLP task, having a broad range of downstream applications, e.g.~machine translation, question answering, document summarization, etc. While the task is well-studied in English, comparatively less attention is dedicated to coreference resolution in other languages, especially low-resource ones. To mitigate this gap, we propose a novel coreference resolution pipeline that harnesses machine translation (MT) from English to a target low-resource language, to generate or expand training data. To automatically validate the quality of the translated samples, we back-translate the samples and assess the similarity with the original English samples via cosine similarity in the latent space of a BERT model. The resulting similarity scores are integrated into the loss function to weight training samples according to their MT cycle consistency. Extensive experiments on four low-resource languages show that our pipeline brings significant performance gains in coreference resolution. Moreover, our pipeline enables accurate coreference resolution in languages where no previous corpora were available.

URL PDF HTML ☆

赞 0 踩 0

2606.05436 2026-06-05 cs.AI cs.CL cs.IR 版本更新

评估美国超大规模数据中心的碳排放与能源消耗

Gianluca Guidi, Francesca Dominici, Tiziano Squartini, Callaway Sprinkle, Jonathan Gilmour, Kevin Butler, Eric Bell, Scott Delaney, Falco J. Bargagli-Stoffi

发表机构 * Department of Biostatistics, Harvard T.H. Chan School of Public Health（哈佛T.H. 汤普森公共卫生学院生物统计学系）； Department of Computer Science, University of Pisa（比萨大学计算机科学系）； IMT School of Advanced Studies, Lucca（卢塞恩高级研究所）； Environmental Systems Research Institute（环境系统研究机构）； Baxtel（Baxtel公司）； Department of Environmental Health, Harvard T.H. Chan School of Public Health（哈佛T.H. 汤普森公共卫生学院环境健康系）； Department of Biostatistics, UCLA Fielding School of Public Health（加州大学洛杉矶分校Fielding公共卫生学院生物统计学系）

AI总结本研究通过收集403个美国超大规模数据中心设施级数据，估算其电力消耗、电力来源及二氧化碳排放，发现其电力需求约占美国总用电量的1.8%，且碳强度高于全国平均水平48%。

详情

AI中文摘要

美国超大规模数据中心（HDCs）的快速扩张，主要由人工智能的采用驱动，引发了人们对该行业环境足迹的担忧。我们汇编了2024年5月至2025年4月期间运营的403个美国超大规模数据中心的设施级信息，并估算了它们的电力消耗、电力来源及可归因的二氧化碳排放。在不同的设施负载情景下，这些HDC消耗了约68-99太瓦时的电力，并产生了约3700-5400万吨二氧化碳。在中心情景下，HDC电力需求约占美国总用电量的1.8%，其中约54%的归因发电由化石燃料来源提供。HDC电力加权平均碳强度约为545克二氧化碳/千瓦时，比同期美国国家电网平均碳强度370克二氧化碳/千瓦时高出约48%。我们的方法提供了一种归因工具，利用最新的EPA eGRID电厂级数据评估超大规模数据中心的环境足迹。

英文摘要

The rapid proliferation of hyperscale data centers (HDCs) in the US, mainly driven by the adoption of artificial intelligence, has raised concerns about this industry's environmental footprint. We compiled facility-level information on 403 US hyperscale data centers operating between May 2024 and April 2025 and estimated their electricity consumption, electricity sources, and attributable CO2 emissions. Across different facility-load scenarios, these HDCs consumed approximately 68-99 TWh of electricity and were associated with about 37-54 million metric tons of CO2. Under the central scenario, HDC electricity demand corresponded to approximately 1.8% of total US electricity consumption, with roughly 54% of attributed generation supplied by fossil-fuel sources. The HDC electricity-weighted average carbon intensity was approximately 545 gCO2/kWh, about 48% above the contemporaneous US national grid-average carbon intensity of 370 gCO2/kWh. Our approach provides an attributional tool for assessing the environmental footprint of hyperscale data centers using the most recent EPA eGRID plant-level data.

URL PDF HTML ☆

赞 0 踩 0

2606.05415 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Executable Schema Contracts: From Automatic Ingestion to Multi-Source Retrieval

可执行模式合约：从自动摄入到多源检索

Padmaja Jonnalagedda, Yuguang Yao, Xiang Gao, Hilaf Hasson, Kamalika Das

发表机构 * Intuit AI Research（Intuit AI研究）

AI总结提出一种自动从多源数据中发现可执行模式并将其作为共享合约的系统，通过模式约束的检索路由和结构化分析提升多源问答性能。

Comments 9 pages, 4 figures, plus supplementary appendix

详情

AI中文摘要

现实世界的数据跨越表格、文档和半结构化文件，具有隐式语义。查询这些数据需要跨不一致的模式和格式整合证据，但现有方法要么需要昂贵的人工工程，要么完全绕过结构。我们提出一个系统，自动从原始多源数据中发现可执行模式，并将其用作知识图谱构建和查询时检索的共享合约。一个封闭世界的字段目录将基于LLM的模式发现限制在已证实的字段上；确定性结构分析推断身份键、外键和源层次结构；由此产生的模式驱动提取、去重和跨源链接，形成具有溯源意识的知识图谱。在查询时，该模式（可选地通过单调协议扩展）调节一个多工具代理，该代理在结构化查找、图遍历和向量搜索之间路由检索，返回带有可追溯引用的有根据的答案。在使用相同LLM、数据和评估框架的受控零样本比较中，该系统在四个QA基准上优于仅检索和基于分解的基线，消融实验表明模式条件路由、结构智能和模式引导构建各自贡献了性能提升。

英文摘要

Real-world data spans tables, documents, and semi-structured files with implicit semantics. Querying this data requires integrating evidence across inconsistent schemas and formats, yet existing approaches either demand costly manual engineering or bypass structure entirely. We present a system that automatically discovers an executable schema from raw multi-source data and uses it as a shared contract for knowledge graph construction and query-time retrieval. A closed-world field catalog constrains LLM-based schema discovery to attested fields; deterministic structural analysis infers identity keys, foreign keys, and source hierarchy; and the resulting schema drives extraction, deduplication, and cross-source linking into a provenance-aware knowledge graph. At query time the schema -- optionally extended via a monotonic protocol -- conditions a multi-tool agent routing retrieval across structured lookup, graph traversal, and vector search, returning grounded answers with traceable citations. In controlled zero-shot comparisons using the same LLM, data, and evaluation harness, the system improves over retrieval-only and decomposition-based baselines across four QA benchmarks, with ablations showing that schema-conditioned routing, structural intelligence, and schema-guided construction each contribute to the gains.

URL PDF HTML ☆

赞 0 踩 0

2606.05414 2026-06-05 cs.CL cs.AI cs.HC cs.LG 版本更新

When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

当证据稀疏时：对话和LLM-Agent轨迹中的弱监督早期失败预警

Avinash Baidya, Xinran Liang, Ruocheng Guo, Xiang Gao, Kamalika Das

发表机构 * Intuit AI Research（Intuit AI研究院）； Princeton University（普林斯顿大学）

AI总结针对对话和LLM-Agent轨迹中早期失败预警问题，提出一种两阶段方法，通过注意力机制从稀疏的轨迹级标签中学习回合级失败证据，并结合α-STOP策略实现可控的早期预警，在多个基准上显著提升帕累托前沿质量并降低训练成本。

Comments 9 pages, 14 figures, and appendix

详情

AI中文摘要

早期失败预警需要在对话或智能体轨迹尚未完成时，决定是否将其标记为可能失败。这具有挑战性，因为监督信号通常仅以轨迹级成功/失败标签的形式提供，而预警必须从部分交互中发出。先前的早期分类方法通常通过将终端标签分配给每个前缀来弥合这一差距，将每个回合视为失败证据。我们假设这种前缀标签假设与多轮语言交互不匹配，因为最终失败的证据是稀疏且常常延迟的。在本文中，我们引入了一种两阶段方法，从这种稀疏证据结构中学习，并使用由此产生的风险估计进行可控的早期预警。具体来说，我们的基于注意力的失败预测器从轨迹标签中学习稀疏的回合级失败证据，并利用它从部分历史中估计失败风险。然后，我们将该预测器与α-STOP配对，这是一种单一偏好条件停止策略，在推理时选择准确率-早期性的操作点，而不是为每个偏好训练单独的触发器。在涵盖客户支持、任务导向对话、说服、工具使用和规划的五个基准上，我们首先表明高相关性失败证据仅占回合的4.7-11.3%，并且平均在轨迹的59.0-83.6%之后首次出现。我们进一步表明，基于注意力的预测器将帕累托前沿质量（超体积）比朴素前缀监督提高了1-10%，并且完整系统将前沿质量比最先进的触发器策略提高了3-42%，同时将每个操作点的训练成本降低了1-3个数量级。

英文摘要

Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory-level success/failure label while alerts must be raised from partial interactions. Prior early-classification methods often bridge this gap by assigning the terminal label to every prefix, treating every turn as failure evidence. We hypothesize that this prefix-label assumption is poorly matched to multi-turn language interactions, where evidence of eventual failure is sparse and often delayed. In this paper, we introduce a two-stage approach that learns from this sparse evidence structure and uses the resulting risk estimates for controllable early alerting. Specifically, our attention-based failure predictor learns sparse turn-level failure evidence from trajectory labels and uses it to estimate failure risk from partial histories. We then pair this predictor with $α$-STOP, a single preference-conditioned stopping policy that selects an accuracy-earliness operating point at inference time rather than training a separate trigger for each preference. Across five benchmarks spanning customer support, task-oriented dialog, persuasion, tool use, and planning, we first show that high-relevance failure evidence occupies only 4.7-11.3% of turns and first appears after 59.0-83.6\% of trajectories on average. We further show that the attention-based predictor improves Pareto-frontier quality (hypervolume) by 1-10\% over naive prefix supervision, and that the full system improves frontier quality by 3-42\% over state-of-the-art trigger policies while reducing training cost per operating point by 1-3 orders of magnitude.

URL PDF HTML ☆

赞 0 踩 0

2606.05413 2026-06-05 cs.LG cs.AI 版本更新

CausalPOI: Spatio-Temporal Graph-Based Causal Modeling for Cold-Start POI Check-in Forecasting

CausalPOI：基于时空图因果建模的冷启动POI签到预测

Zhaoqi Zhang, Miao Xie, Yi Li, Linyou Cai, Siqiang Luo, Gao Cong

发表机构 * Nanyang Technological University（南洋理工大学）； China Agricultural University（中国农业大学）； Meituan（美团）

AI总结提出CausalPOI框架，利用时空功能交互图建模POI间语义和空间关系，通过结构对齐的处理和对照图模拟事实与反事实场景，解决冷启动POI签到预测问题，在真实数据集上显著优于基线。

Comments Accepted at KDD 2026

详情

DOI: 10.1145/3770855.3817641

AI中文摘要

随着城市环境的快速演变，准确建模兴趣点（POI）的动态行为对于支持数据驱动的城市规划和商业决策至关重要。尽管时空图学习的最新进展改进了POI预测，但大多数方法依赖于基于邻近性的图和相关性驱动建模，忽略了POI之间的功能依赖关系，且未能捕捉城市干预的因果效应。本文引入了一个新的研究问题——冷启动POI签到预测，旨在通过建模新引入POI的时间演化及其与附近POI在结构化城市空间背景下的功能交互，预测其未来的签到模式。为应对这些挑战，我们提出了CausalPOI，一个基于时空图的因果表示学习框架。CausalPOI利用时空功能交互图建模POI之间的语义和空间关系，并构建结构对齐的处理图和对照图以模拟事实和反事实场景。在真实SafeGraph数据集上的大量实验表明，CausalPOI在各方面显著优于最先进的基线，验证了其在时空预测、语义交互建模和因果效应估计方面的有效性，为城市干预分析提供了更可解释和可操作的基础。源代码可在Github获取。

英文摘要

As urban environments continue to evolve rapidly, accurately modeling the dynamic behaviour of Points of Interest is essential for supporting data-driven urban planning and commercial decision-making. While recent advancements in spatio-temporal graph learning have improved POI forecasting, most methods rely on proximity-based graphs and correlation-driven modeling, which overlook the functional dependencies between POIs and fail to capture the causal effects of urban interventions. In this paper, we introduce a novel research problem -- cold-start POI check-in forecasting, which aims to predict the future check-in pattern of a newly introduced POI, by modeling its temporal evolution and functional interactions with nearby POIs in a structured urban spatial context. To address these challenges, we propose CausalPOI, a spatio-temporal graph-based causal representation learning framework. CausalPOI leverages Spatio-Temporal Functional Interaction Graph to model semantic and spatial relationships between POIs, and constructs structurally aligned treatment and control graphs to simulate factual and counterfactual scenarios. Extensive experiments on real-world SafeGraph datasets demonstrate that CausalPOI significantly outperforms state-of-the-art baselines across the board, validating its effectiveness in spatio-temporal forecasting, semantic interaction modeling, and causal effect estimation, providing a more interpretable and actionable foundation for urban intervention analysis. Source code is available at Github.

URL PDF HTML ☆

赞 0 踩 0

2606.05411 2026-06-05 cs.AI cs.HC 版本更新

A Motivational Architecture for Conversational AGI

对话式通用人工智能的动机架构

Anna Mikeda, Ben Goertzel

发表机构 * Glass Umbrella（玻璃伞）； SingularityNet

AI总结本文提出一种对话式动机架构，将OpenPsi动机谱系重新解释为对话原生术语，并耦合MetaMo的高层动机支架，通过十阶段动机处理流水线、双决策策略以及行动前感受与行动后情绪的功能区分，实现对话智能体的能力调节、不确定性减少、亲和力等动机管理。

Comments 16 pages. Accepted for AGI-26 proceedings

详情

AI中文摘要

认知AI中的动机架构主要设计用于调节身体需求的物理智能体。对话智能体运行在另一种机制中：其感觉运动回路是语言性的，其环境是用户不断演变的心理状态，其有后果的行动是言语行为、工具调用和策略性沉默。本文提出对OpenPsi动机谱系的对话式重新解释，耦合MetaMo的高层动机支架，用于构建在模块化执行基底上的智能体。稳态被重新定义为对话原生的术语：智能体调节的是能力、不确定性减少、亲和力、喜爱度、合法性、培育和审美连贯性，而非身体缺陷。我们提出三个贡献：一个十阶段动机处理流水线，在架构上分离认知调节与情境评估；一个双决策策略，融合紧迫驱动的快速响应与深思熟虑的多目标优化；以及一个架构上有用的区分，即行动前感受与行动后情绪作为功能上不同的情感形式。我们将该框架专门化到两个示例智能体——伴侣智能体与研究智能体——并勾勒其向社交机器人和领域通用的人类级通用人工智能的扩展。

英文摘要

Motivational architectures in cognitive AI have largely been designed for physical agents regulating bodily needs. Conversational agents operate in a different regime: their sensorimotor loop is linguistic, their environment is a user's evolving mental state, and their consequential actions are speech acts, tool invocations, and strategic silences. This paper proposes a conversational reinterpretation of the OpenPsi motivational lineage, coupled to MetaMo's higher-level motivational scaffold, for agents built on a modular execution substrate. Homeostasis is recast in dialogue-native terms: the agent regulates competence, uncertainty reduction, affiliation, affinity, legitimacy, nurturing, and aesthetic coherence rather than bodily deficits. We propose three contributions: a ten-stage motivational processing pipeline that architecturally separates cognitive modulation from situational appraisal; a dual decision strategy blending urgency-driven fast response with deliberative multi-goal optimization; and an architecturally useful distinction between pre-action feelings and post-action emotions as functionally different forms of affect. We specialize the framework to two example agents -- CompanionAgent and ResearchAgent -- and sketch its extension to social robotics and domain-generic human-level AGI.

URL PDF HTML ☆

赞 0 踩 0

2606.05408 2026-06-05 cs.AI cs.NE 版本更新

Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution

无变异的突变：LLM驱动的程序进化中的收敛动力学

Can Gurkan, Forrest Stonedahl, Uri Wilensky

发表机构 * Northwestern University（西北大学）； Augustana College（奥古斯塔纳学院）

AI总结研究LLM在无选择压力下反复变异程序时，是否探索新形式或循环回到旧形式，发现LLM变异一致收敛到受限吸引子区域，结构层面87%的链中超过93%的变异重复先前结构形式。

Comments Accepted to the Genetic and Evolutionary Computation Conference (GECCO '26) Workshop on Large Language Models for and with Evolutionary Computation

详情

AI中文摘要

当LLM反复变异一个程序时，它是探索新形式还是循环回到旧形式？我们通过分析领域特定语言中无选择压力下的LLM驱动变异链来研究这个问题，变化提示设计、模型族和随机复制。我们发现基于LLM的变异一致收敛到程序空间中的受限吸引子区域。收敛在结构层面尤其严重：在87%的链中，超过93%的变异重复先前看到的结构形式，大多数变异局限于重复模板内的终端替换。循环分析显示短循环和自环主导转移结构。收敛速度随提示措辞和模型选择而变化，但该现象在不同条件下都很稳健。经典的GP子树变异算子没有表现出类似的收敛，表明该效应是LLM变异管道固有的。这些发现揭示了LLM驱动程序进化核心的张力：使语义感知程序转换成为可能的相同能力也带来了对结构同质性的系统性偏差，如果此类系统要维持开放式探索，必须考虑这一点。源代码可在 https://github.com/can-gurkan/lmca 获取。

英文摘要

When an LLM repeatedly mutates a program, does it explore new forms or circle back to the same ones? We study this question by analyzing LLM-driven mutation chains in the absence of selection pressure within a domain-specific language, varying prompt design, model family, and stochastic replication. We find that LLM-based mutation consistently converges toward restricted attractor regions in program space. Convergence is especially severe at the structural level: in 87% of chains, over 93% of mutations revisit a previously seen structural form, with most variation confined to terminal substitutions within recurring templates. Cycle analysis reveals short cycles and self-loops dominating the transition structure. The rate of convergence varies with prompt wording and model choice, but the phenomenon is robust across conditions. A classical GP subtree mutation operator does not exhibit comparable convergence, suggesting that the effect is intrinsic to the LLM mutation pipeline. These findings reveal a tension at the heart of LLM-driven program evolution: the same capabilities that enable semantics-aware program transformation also carry a systematic bias toward structural homogeneity that must be accounted for if such systems are to sustain open-ended exploration. Source code is available at https://github.com/can-gurkan/lmca.

URL PDF HTML ☆

赞 0 踩 0

2606.05404 2026-06-05 cs.AI cs.CL cs.LG 版本更新

Harnessing Generalist Agents for Contextualized Time Series

利用通用智能体进行情境化时间序列分析

Zihao Li, Kaifeng Jin, Yuanchen Bei, Jiaru Zou, Avaneesh Kumar, Xuying Ning, Yanjun Zhao, Mengting Ai, Baoyu Jing, Hanghang Tong, Jingrui He

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出TimeClaw框架，通过集成可执行时间工具、经验驱动能力进化和情景多模态记忆，使通用大语言模型智能体具备情境化时间推理能力，在能源、金融等多领域基准上取得性能提升。

Comments Preprint. 38 Pages

详情

AI中文摘要

时间序列通常嵌入在丰富的上下文中，这对于整体建模至关重要。此外，现实世界的从业者通常需要用于分析时间动态的端到端工作流，其中广泛研究的任务（如预测）只是更广泛解决方案循环中的一个步骤。虽然通用AI智能体为复杂上下文下的此类工作流提供了有前景的接口，但它们主要运行在文本空间中，并未与结构化时间信号完全对齐。在这项工作中，我们引入了TimeClaw，一个用于时间序列的智能体框架，它为通用大语言模型智能体配备了情境化时间推理所需的时间序列原生运行时支持。TimeClaw集成了可执行的时间工具以进行有根据和可审计的分析，经验驱动的能力进化以创建可重用的分析例程，以及用于检索相关推理轨迹的情景多模态记忆。这些组件共同解锁了带有上下文信息的开放式时间推理。在涵盖能源、金融、天气、交通和其他现实世界领域的多个基准上的广泛评估表明，TimeClaw的性能得到了提升。代码可在https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw获取。

英文摘要

Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real-world practitioners often require end-to-end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series-native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience-driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open-ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw.

URL PDF HTML ☆

赞 0 踩 0

2606.05403 2026-06-05 cs.LG cs.AI 版本更新

Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

信任，但不验证：LLM 源评估中的认知盲点

Rohan N. Pradhan, Steve Goley

发表机构 * Amazon（亚马逊）

AI总结研究语言模型在多源综合中是否评估证据质量，发现模型虽能检测伪造统计但未在综合中启用，而是依赖方法论-语域门控，导致数值有效性被抑制。

详情

AI中文摘要

语言模型日益充当认知代理，综合多个来源的证据以辅助决策。然而，它们是否评估这些证据的质量，还是仅仅基于表面呈现进行聚合，目前尚不清楚。我们表明，模型具备检测伪造统计数据的能力（孤立方法论的正确识别率为0.76-1.00），但在多源综合过程中并未启用这一能力，无论统计数据是伪造还是有效，都会产生相似的数值估计。具体而言，源影响受方法论-语域门控支配，该门控响应分析文本的分布性语域，但不响应数值有效性：例如，统计上不可能的置信区间与有效区间获得相同权重。这种行为分离在来自三个家族（Claude、Qwen、OLMo）的五个模型以及三个专业领域中均得到复现。机制分析（包括因果追踪、线性探针和组件级归因）收敛于同一解释：模型编码并因果使用一种跨领域转移的方法论-语域表示（探针AUC 0.83-0.92），而数值有效性信号（在孤立时可解码）在多源综合中被抑制至随机水平。基于提示的缓解措施（甚至是指定精确统计检查的预言清单）会产生全面怀疑而非选择性辨别，我们检查的后训练流程强化了风格捷径而未建立数值验证。与追踪用户偏好的奉承行为不同，这种失败追踪的是源是否呈现为分析可信，而非其主张是否内部一致。我们称之为认知对齐：与偏好对齐和安全对齐一样，问题不在于能力，而在于部署。

英文摘要

Language models increasingly act as epistemic proxies, synthesizing evidence from multiple sources to inform decisions. Whether they evaluate the quality of that evidence, or merely aggregate it based on surface presentation, remains poorly understood. We show that models possess the capability to detect fabricated statistics (correct identification rates of 0.76-1.00 for methodology in isolation) but do not recruit this capability during multi-source synthesis, producing similar numeric estimates whether the statistics are fabricated or valid. Specifically, source influence is governed by a methodology-register gate that responds to the distributional register of analytical text but not to numeric validity: for example, statistically impossible confidence intervals receive the same weight as valid ones. The behavioral dissociation replicates across five models from three families (Claude, Qwen, OLMo) and three professional domains. Mechanistic analyses, including causal tracing, linear probes, and component-level attribution, converge on the same account: the model encodes and causally uses a methodology-register representation that transfers across domains (probe AUC 0.83-0.92), while numeric-validity signals, decodable in isolation, are suppressed to chance during multi-source synthesis. Prompting-based mitigations, even an oracle checklist naming the exact statistical checks, produce blanket skepticism rather than selective discernment, and the post-training pipelines we examine reinforce the stylistic shortcut without building numeric verification. Unlike sycophancy, which tracks user preference, this failure tracks whether a source presents as analytically credible, not whether its claims are internally consistent. We term this epistemic alignment: like preference and safety alignment, the question is not capability but deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.05402 2026-06-05 cs.CL cs.AI 版本更新

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

ReasoningFlow: 理解LLM推理轨迹的话语结构

Jinu Lee, Shivam Agarwal, Amruta Parulekar, Siddarth Madala, Dilek Hakkani-Tur, Julia Hockenmaier

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出ReasoningFlow框架，将大推理模型的推理轨迹建模为细粒度有向无环图，通过人工和自动标注分析发现模型间结构相似性、多样化推理行为及错误步骤与最终答案的关系。

详情

AI中文摘要

大型推理模型（LRMs）产生的推理轨迹具有非线性结构，如回溯和自我修正，这使推理过程的评估和监控复杂化。我们引入ReasoningFlow，一个将LRM推理轨迹的话语结构捕捉为细粒度有向无环图（DAGs）的框架。我们通过仔细的人工标注31条轨迹（2.1k步）来开发和验证我们的标注方案，实现了高标注者间一致性，然后扩展到自动标注1,260条轨迹（247.7k步），涵盖三个任务（数学、科学、论证）和五个模型（Qwen2.5-32B-Inst、QwQ-32B、DeepSeek-V3、DeepSeek-R1、GPT-oss-120B）。通过分析ReasoningFlow图，我们发现：（1）LRMs表现出结构相似的轨迹，尽管它们基于不同的基础模型训练且可能使用不重叠的后训练数据。（2）ReasoningFlow揭示了多样的细粒度推理行为（例如局部验证、自我反思和假设），可用于更好的推理轨迹可监控性。（3）在LRMs中，大多数错误步骤不用于推导最终答案。（4）步骤之间的机械因果依赖关系不反映语言层面的话语结构。我们在https://github.com/jinulee-v/reasoningflow 发布数据集和代码。

英文摘要

Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github.com/jinulee-v/reasoningflow.

URL PDF HTML ☆

赞 0 踩 0

2606.05400 2026-06-05 cs.AI cs.CL cs.LG 版本更新

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

LeanMarathon：通过长视界Lean自动形式化实现可靠的AI合作数学家

Yuanhe Zhang, Yuekai Sun, Taiji Suzuki, Jason D. Lee, Fanghui Liu

发表机构 * Department of Statistics, University of Warwick, UK（英国沃里克大学统计系）； Center for Advanced Intelligence Project, RIKEN, Japan（日本理化学研究所高级智能项目）； Department of Statistics, University of Michigan, USA（美国密歇根大学统计系）； Department of Mathematical Informatics, The University of Tokyo（东京大学数学信息学系；日本理化学研究所高级智能项目）； also Center for Advanced Intelligence Project, RIKEN, Japan（加州大学伯克利分校电气工程与计算机科学系；统计系）； Department of Electrical Engineering and Computer Sciences, also Department of Statistics, University of California, Berkeley, USA（上海交通大学数学科学学院，自然科学院和MOE-LSC）； School of Mathematical Sciences, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong University, China

AI总结提出多智能体框架LeanMarathon，通过蓝图抽象和两阶段编排器实现长视界研究数学的可靠自动形式化，在四个Erdős问题上成功形式化七个定理。

Comments 26 pages, 9 figures. Comments are welcome

详情

AI中文摘要

长视界研究数学的自动形式化不仅在困难引理上失败，而且在规模上失败：陈述漂移、依赖关系纠缠、上下文衰减以及局部修复破坏远处的工作。我们提出LeanMarathon，一个用于可靠的研究级Lean自动形式化的多智能体框架。其核心抽象是一个演化的蓝图：一个Lean文件，同时作为形式化证明骨架、自然语言证明图和共享系统记录。四个合约范围的智能体构建、审计、证明和修复这个蓝图。这些智能体由一个两阶段编排器协调，该编排器首先通过对抗性审查稳定目标保真度，然后从动态叶节点向上并行地通过CI门控轮次释放证明有向无环图（DAG）。LeanMarathon将一次脆弱的数小时运行转变为许多局部、可恢复、并行的交易。我们在两篇最近的研究论文上评估LeanMarathon，涵盖四个Erdős问题（#1051, #1196, #164, #1217）。在三次自主运行中，它形式化了所有七个目标定理，没有留下任何sorry，证明了258个引理和定理。这些结果表明，可靠的AI合作数学不仅需要更强的证明器，还需要耐用的框架，以在长数学发展过程中保持目标保真度。代码可在https://github.com/YuanheZ/LeanMarathon找到。

英文摘要

Long-horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi-agent harness for reliable research-level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural-language proof graph, and shared system of record. Four contract-scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two-stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI-gated rounds. LeanMarathon turns one brittle multi-hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co-mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at https://github.com/YuanheZ/LeanMarathon.

URL PDF HTML ☆

赞 0 踩 0

2606.05396 2026-06-05 cs.CR cs.AI cs.SE 版本更新

Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration

有意但无力：通过消融分离代码大语言模型中的拒绝与能力

Cristina Carleo, Pietro Liguori, Naghmeh Ivaki, Domenico Cotroneo

发表机构 * University of Naples Federico II（那不勒斯费德里科二世大学）； University of Coimbra（科英布拉大学）； University of North Carolina at Charlotte（北卡罗来纳州夏洛特大学）

AI总结本文通过消融技术（abliteration）对代码LLM进行低秩权重编辑，以消除其对安全注入提示的拒绝行为，从而分离拒绝意愿与代码生成能力，实验表明消融后拒绝率降至零而语法有效性保持93%以上，但注入率仍受模型容量限制。

详情

AI中文摘要

大规模生成带标签的脆弱代码是基于学习的漏洞检测的一个反复出现的障碍：挖掘的语料库带有大量标签噪声，而现有的基于LLM的增强方法传播了这些不准确性，因为它转换了脆弱的种子，而不是根据规范合成漏洞。一个补充的途径是从安全代码开始，要求经过指令调优的LLM注入指定的CWE（这将把标签负担从开放式的检测转移到有界的二元确认），但安全对齐的代码LLM系统地拒绝此类提示。本文是对消融技术（abliteration）的初步可行性研究，这是一种低秩权重编辑，通过正交投影消除残差流中的拒绝方向，作为消除这一障碍的工具。我们使用Python和CWE-89（SQL注入）作为案例研究，评估了Qwen2.5-Coder-Instruct系列在3B、7B和14B参数下对从PromSec和SafeCoder中抽取的安全样本的表现，每种条件重复三次。我们发现：（i）对注入提示的拒绝强烈依赖于大小和提示上下文：14B模型拒绝100%的提示，7B模型拒绝73%的PromSec但仅5%的SafeCoder，而3B模型基本不受阻；（ii）消融技术将所有大小模型的拒绝率降至零或接近零，同时语法有效性保持在93%以上，支持了在这种设置下拒绝可以与测量的代码生成能力分离的观点；（iii）消融后的注入率仍然受容量限制（14B为88-97%，7B为89-90%，3B为25-48%），将意愿（消融技术解锁）与能力（随参数扩展）分离。漏洞判定由三个工具的检测器集成（CodeQL、Semgrep、Bandit）产生，然后由两位作者对检测器阳性输出进行人工裁决。

英文摘要

Producing a labeled vulnerable code at scale is a recurring obstacle for learning-based vulnerability detection: mined corpora carry substantial label noise, and existing LLM-based augmentation propagates these inaccuracies because it transforms vulnerable seeds rather than synthesising vulnerabilities from a specification. A complementary route is to start from safe code and ask an instruction-tuned LLM to inject a specified CWE (which would shift the labeling burden from open-ended detection to bounded binary confirmation) but safety-aligned code LLMs systematically refuse such prompts. This paper is a preliminary feasibility study of abliteration, a low-rank weight edit that orthogonally projects out the refusal direction in the residual stream, as a tool to remove this barrier. We use Python and CWE-89 (SQL injection) as a case study, evaluating the Qwen2.5-Coder-Instruct family at 3B, 7B, and 14B parameters on safe samples drawn from PromSec and SafeCoder, replicated three times per condition. We find that (i) refusal on injection prompts is strongly size- and prompt-context-dependent: the 14B refuses 100% of prompts, the 7B refuses 73% of PromSec but only 5% of SafeCoder, whereas the 3B is essentially never blocked; (ii) abliteration reduces refusal to zero or near-zero across all sizes while leaving syntactic validity above 93%, supporting the view that, in this setting, refusal can be detached from measured code-generation capability; and (iii) the post-abliteration injection rate remains capacity-bound (88-97% on the 14B, 89-90% on the 7B, and 25-48% on the 3B) separating willingness, which abliteration unlocks, from capability, which scales with parameters. Vulnerability verdicts are produced by a three-tool detector ensemble (CodeQL, Semgrep, Bandit) followed by manual adjudication by two authors on detector-positive outputs.

URL PDF HTML ☆

赞 0 踩 0

2605.04135 2026-06-05 cs.CY cs.AI cs.CL 版本更新

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation

前沿滞后：学术AI评估中能力误述的文献计量审计

David Gringras, Misha Salahshoor

发表机构 * Harvard University（哈佛大学）； AISST

AI总结通过审计112,303篇LLM相关论文，发现中位论文评估的模型落后同期前沿10.85 ECI（约1.4倍Claude Sonnet 3.7与Claude Opus 4.5的差距），且差距以每年5.53 ECI扩大，仅3.2%的摘要披露推理模式状态，52.5%的结论将结果泛化为“AI”，并提出VERSIO-AI检查表等补救措施。

Comments v2. 65 pp, 9 figs, 8 tables, 8 appendices. Pre-registered on OSF: doi.org/10.17605/OSF.IO/7XM3D. Code+data: doi.org/10.5281/zenodo.20060457. VERSIO-AI v1.2 reporting checklist (Appendix A): doi.org/10.5281/zenodo.20060459. frontierlag package + per-DOI audit tool: frontierlag.org

详情

AI中文摘要

应用领域LLM能力评估的读者希望了解AI系统当前能做什么。但相关文献回答的是一个相关但结果不同的问题：更旧、更便宜、更少引导的模型在数月或数年前能做什么（例如，一篇2026年的论文评估GPT-3.5或GPT-4零样本，对比前沿的推理能力、工具使用系统如GPT-5.5 Pro和Claude Opus 4.7），通常报告稀疏的配置细节，并抽象上升为关于“AI”的声明，通过引用、媒体和政策传播。我们在一个预注册的审计中测量了“发表引导差距”（这些答案之间的差距），审计了112,303条LLM关键词匹配的候选记录（2022年1月至2026年4月；18,574条可接受，4,766篇全文可检索），将测试模型与同期前沿在Epoch AI能力指数（ECI）上进行比较，并在Arena Elo和Artificial Analysis上复现。中位论文评估的模型在评估时落后同期前沿+10.85 ECI（约Claude Sonnet 3.7与Claude Opus 4.5距离的1.4倍）（H1）；一个探索性的理性滞后基线（H8）将其分解为约25%的同行评审延迟和约75%的额外滞后。差距以每年+5.53 ECI的速度扩大（H2；95% CI [+5.03, +5.83]）。同时，仅3.2%的摘要（21.2%的全文）披露了具有推理能力模型的推理模式状态（H4），52.5%（95% CI [48.2, 56.9]）的结论以“AI”而非被评估模型（们）的层面陈述，并以OR = 1.23/年的速度上升。提出的补救措施包括API访问补贴和编辑执行报告框架，强制披露配置表面（模型快照、推理模式/努力、工具访问、脚手架、提示等）；VERSIO-AI是一个13项检查表（核心3项桌面拒稿），在引导表面扩展现有框架，并在frontierlag.org上提供每DOI分析。

英文摘要

Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-3.5 or GPT-4 zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about "AI" that propagate through citations, media, and policy. We measure the 'publication elicitation gap' (the gap between these answers) in a pre-registered audit of 112,303 LLM-keyword-matched candidate records (2022-01 to 2026-04; 18,574 admissible, 4,766 full-paper texts retrievable), comparing tested models to the contemporaneous frontier on the Epoch AI Capabilities Index (ECI), reproduced under Arena Elo and Artificial Analysis. The median paper evaluates a model +10.85 ECI (~1.4x the distance between Claude Sonnet 3.7 and Claude Opus 4.5) behind the contemporaneous frontier at evaluation time (H1); an exploratory rational-lag baseline (H8) decomposes this into ~25% peer-review latency, ~75% excess lag. The gap is widening at +5.53 ECI/year (H2; 95% CI [+5.03, +5.83]). Meanwhile, only 3.2% of abstracts (21.2% of full-texts) disclose reasoning-mode status on reasoning-capable models (H4) and 52.5% (95% CI [48.2, 56.9]) state conclusions at the level of "AI" rather than the evaluated model(s), rising at OR = 1.23/year. Proposed remedies include API-access subsidies and editorial enforcement of reporting frameworks mandating configuration-surface disclosure (model snapshot, reasoning mode/effort, tool access, scaffolding, prompting, etc.); VERSIO-AI is a 13-item checklist (Core 3 desk-reject) extending existing frameworks at the elicitation surface, with per-DOI analysis at frontierlag.org.

URL PDF HTML ☆

赞 0 踩 0

2605.02395 2026-06-05 cs.AI 版本更新

Controllable and Verifiable Process Data Synthesis for Process Reward Models

用于过程奖励模型的可控且可验证的过程数据合成

Yinghui Chi, Lucien Wang

发表机构 * Jilin University（吉林大学）

AI总结提出一个可控且可验证的框架，通过注入模板感知错误并重新计算后续步骤来合成过程监督数据，以提升过程奖励模型在逻辑和数学推理中的性能。

详情

AI中文摘要

过程奖励模型（PRMs）依赖于高质量的过程监督数据，但现有的构建方法通常对错误位置、错误类型和轨迹一致性的控制有限。我们提出了一个可控且可验证的框架，用于合成PRMs的过程监督数据。我们的框架首先构建一个正确的符号推理链，在中间步骤注入一个模板感知错误，在受损状态下重新计算后续步骤，并验证注入的步骤不能从其前缀推导出来。得到的配对轨迹在第一个错误处前缀无效，但在符号重新计算后保持轨迹一致，并被翻译成对齐的自然语言过程，用于PRM训练和评估。实验表明，合成数据改进了逻辑推理基准上的Best-of-8重排序，并迁移到数学推理。步骤级评估进一步表明，第一个错误定位仍然比整体步骤分类更具挑战性，凸显了对细粒度且可验证的过程监督的需求。

英文摘要

Process reward models (PRMs) rely on high-quality process supervision data, yet existing construction methods often provide limited control over error location, error type, and trajectory consistency. We propose a controllable and verifiable framework for synthesizing process supervision data for PRMs. Our framework first constructs a correct symbolic reasoning chain, injects a template-aware error into an intermediate step, recomputes subsequent steps under the corrupted state, and verifies that the injected step is not derivable from its prefix. The resulting paired trajectories are prefix-invalid at the first error while remaining trajectory-consistent after symbolic recomputation, and are translated into aligned natural-language processes for PRM training and evaluation. Experiments show that the synthesized data improve Best-of-8 reranking on logical reasoning benchmarks and transfer to mathematical reasoning. Step-level evaluation further shows that first-error localization remains substantially more challenging than overall step classification, highlighting the need for fine-grained and verifiable process supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.05395 2026-06-05 cs.RO cs.AI 版本更新

VASO: Formally Verifiable Self-Evolving Skills for Physical AI Agents

VASO：物理AI智能体的形式可验证自进化技能

Yunhao Yang, Neel P. Bhatt, Kevin Wang, Samuel Tetteh, Zhangyang Wang, Ufuk Topcu

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Iowa State University（爱荷华州立大学）

AI总结提出VASO框架，通过形式验证引导LLM生成的机器人技能合约自进化，将模型检查的反例转化为文本梯度更新技能合约，无需微调模型权重，在Jackal和四旋翼任务中达到97.2%的形式规范符合率。

Comments Project webpage: https://languagegroundedriskdetection.github.io/ProjectPage/vaso-webpage/

详情

AI中文摘要

可重用的机器人技能正在成为具身智能体将开放式指令转化为长时域物理行为的基本单元。我们认为，虽然基础模型大幅降低了创建这些技能的成本，但信任它们的成本并未降低。现有的技能进化循环通过执行反馈、单元测试、环境奖励或LLM自我批评来改进技能，但这些信号仅提供痕迹级别的证据：它们表明技能在采样执行中有效，而非技能引发的计划在未经测试的条件下满足时间安全合约。我们提出VASO，一个用于验证引导的LLM生成机器人技能合约自进化的框架。在VASO中，每个技能被表示为具有两个耦合接口的语义合约：一个形式接口，将机器人状态、观测和控制命令与用于模型检查的逻辑命题对齐；一个面向规划器的接口，指导可执行行为的生成。模型检查器首先过滤逻辑不一致的技能合约，然后验证由该技能引发的计划是否满足全局和局部时间规范。当验证失败时，VASO将反例轨迹转化为文本梯度，更新可重用的技能合约，同时保持基础模型权重冻结。在Clearpath Jackal和PX4四旋翼任务中，VASO使用少于100个优化样本达到了97.2%的形式规范符合率，优于执行反馈、提示优化和微调基线。据我们所知，VASO是首个将形式验证与物理AI智能体的自进化LLM生成技能闭环的框架：形式反例成为可重用机器人技能合约的优化反馈，而不仅仅是验证一次性计划、调优规划器提示或微调模型权重。

英文摘要

Reusable robot skills are becoming the basic units through which embodied agents turn open-ended instructions into long-horizon physical behavior. We argue that, while foundation models have collapsed the cost of creating these skills, the cost of trusting them has not. Existing skill-evolution loops refine skills through execution feedback, unit tests, environment reward, or LLM self-critique, but these signals provide only trace-level evidence: they show that a skill worked on sampled executions, not that skill-induced plans satisfy temporal safety contracts under untested conditions. We introduce VASO, a framework for verification-guided self-evolution of LLM-generated robot skill contracts. In VASO, each skill is represented as a semantic contract with two coupled interfaces: a formal interface that aligns robot states, observations, and control commands with logical propositions for model checking, and a planner-facing interface that guides executable behavior generation. A model checker first filters logically inconsistent skill contracts, then verifies plans induced by the skill against global and local temporal specifications. When verification fails, VASO translates the counterexample trace into a textual gradient that updates the reusable skill contract while keeping foundation-model weights frozen. On Clearpath Jackal and PX4 quadcopter tasks, VASO reaches 97.2% formal-specification compliance using fewer than 100 optimization samples, outperforming execution-feedback, prompt-optimization, and fine-tuning baselines. To our knowledge, VASO is the first framework that closes the loop between formal verification and self-evolving LLM-generated skills for physical AI agents: formal counterexamples become optimization feedback for reusable robot skill contracts, rather than merely verifying one-off plans, tuning planner prompts, or fine-tuning model weights.

URL PDF HTML ☆

赞 0 踩 0

2606.05391 2026-06-05 cs.SE cs.AI 版本更新

AI能否反驳经济理论？来自知识截止日期之外的证据

Alexis Akira Toda

发表机构 * Department of Economics, Emory University（埃默里大学经济学系）

AI总结本文通过实验测试多个AI模型（Gemini、Refine、Claude和ChatGPT）检查四篇包含错误的经济理论论文，发现ChatGPT Pro表现最佳但无法独立发现错误，表明AI尚不能自主反驳经济理论。

2606.05382 2026-06-05 cs.AI 版本更新

Synthetic Contrastive Reasoning for Multi-Table Q&A

合成对比推理用于多表问答

Ankit Pratap Singh, Xin Su, Phillip Howard

发表机构 * Iowa State University（爱荷华州立大学）； Thoughtworks

AI总结针对多表问答缺乏推理监督的问题，提出通过异构LLM生成合成对比推理轨迹，并利用对比偏好优化微调模型，在MMQA上提升9.7%-16.3%。

详情

AI中文摘要

多表问答要求模型检索相关证据、链接模式并在关系表之间进行组合推理。现有的多表问答资源通常提供问题和最终答案，但缺乏解释答案如何得出的推理监督。为弥补这一空白，我们通过使用异构LLM生成经过验证的正向轨迹和合理的负向轨迹，为MMQA构建了一个合成对比推理轨迹数据集。然后，我们利用生成的偏好对，通过对比偏好优化（CPO）微调开源权重LLM。在Qwen3-14B、Mistral-8B和Llama-3.1-8B上，CPO相比问答监督微调取得了9.7%-16.3%的绝对平均提升，在MMQA上最高提升21个百分点。消融实验表明，异构的正向和负向轨迹生成器增强了对比信号，自动评估和人工评估均显示生成的轨迹对基本忠实、连贯且具有有意义的对比性。

英文摘要

Multi-table question answering requires models to retrieve relevant evidence, link schemas, and perform compositional reasoning across relational tables. Existing multi-table Q&A resources typically provide questions and final answers but lack reasoning supervision that explains how answers are derived. To address this gap, we construct a synthetic contrastive reasoning-trace dataset for MMQA by generating validated positive traces and plausible negative traces with heterogeneous LLMs. We then use the resulting preference pairs to fine-tune open-weight LLMs with Contrastive Preference Optimization (CPO). Across Qwen3-14B, Mistral-8B, and Llama-3.1-8B, CPO achieves absolute average improvements over Q&A supervised fine-tuning ranging from 9.7%-16.3%, with gains up to 21 percentage points on MMQA. Ablations show that heterogeneous positive and negative trace generators strengthen the contrastive signal, and automated as well as human evaluations indicate that the generated pairs are largely faithful, coherent, and meaningfully contrastive.

URL PDF HTML ☆

赞 0 踩 0

2606.05378 2026-06-05 cs.LG cs.AI 版本更新

Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models

模式选择性并非任务因果结构：1B类语言模型中组合任务电路的跨架构机制研究

Yongzhong Xu

发表机构 * B-Class Language Models（1B类语言模型）； Cross-Architecture Mechanistic Study（跨架构机理研究）

AI总结通过统一协议测试三个1B类语言模型在四个组合任务上的注意力头电路，发现不同模型对同一任务使用不同的注意力模式，并引入五类筛选结果分类法，提出MoE模型基于前一个token位置基板构建组合任务电路的可证伪假设。

Comments 27 pages, 3 figures

详情

AI中文摘要

我们测试了一个单一的筛选与消融方案——通过任务模式选择性识别注意力头电路，然后通过与匹配随机零假设进行因果消融验证——是否能在不同模型家族中产生一致的机制性结论。该方案可在不同流水线间移植；但它识别出的具体电路则不能。在四个组合任务（间接宾语识别、大于、后继序列、变量绑定）和三个来自不同训练流水线的1B类语言模型（Pythia 1B / Pile / 密集；OLMo 1B / DCLM / 密集；OLMoE 1B-7B / DCLM / 混合专家）上，我们运行了一个统一协议，每个单元使用十个种子采样匹配随机零假设。由此产生的12个（任务，模型）单元中，没有两个在可比较的效应大小下共享相同的主要因果筛选：同一任务，具有相同的行为能力，在不同模型中通过不同的注意力模式类型实现。我们引入了一个五类筛选结果分类法——主要原因、次要原因、相关物、干扰物、零——并附有定量阈值，并展示了所有五类结果均出现在面板中。我们提出了一个可证伪的假设：我们面板中的MoE模型在一个基础的前一个token位置基板之上构建组合任务电路（对于OLMoE 1B-7B，前一个token电路消融在4个任务中的3个上是最强的因果筛选），IOI例外与IOI是最终位置名称复制任务一致，其结构直接探测不同的模式。该假设附带对其他MoE语言模型的明确预测。我们诚实地构建方法论：来自配套方法论论文的谱参与比信号是专门化计算的一般指标；使发现具有任务特异性的是任务模式筛选加上每个模型的因果验证。

英文摘要

We test whether a single screen-and-ablate recipe -- identify attention-head circuits by task-pattern selectivity, then verify by causal ablation against a matched-random null -- produces consistent mechanistic claims across model families. The recipe ports across pipelines; the specific circuit it identifies does not. Across four composed tasks (indirect-object identification, greater-than, successor sequences, variable binding) and three 1B-class language models from distinct training pipelines (Pythia 1B / Pile / dense; OLMo 1B / DCLM / dense; OLMoE 1B-7B / DCLM / mixture-of-experts), we run a unified protocol with the matched-random null sampled across ten seeds per cell. The resulting 12 (task, model) cells contain no two that share the same primary causal screen at comparable effect size: the same task, with the same behavioral capability, is implemented through different attention-pattern types across models. We introduce a five-category screen-outcome taxonomy -- primary cause, secondary cause, correlate, interferer, null -- with quantitative thresholds, and show that all five outcomes appear in the panel. We propose a falsifiable hypothesis: the MoE model in our panel builds composed-task circuits on top of a foundational previous-token positional substrate (the prev-token-circuit ablation is the strongest causal screen on 3 of 4 tasks for OLMoE 1B-7B), with the IOI exception consistent with IOI being a final-position name-copying task whose structure directly probes a different pattern. The hypothesis comes with explicit predictions for other MoE language models. We frame the methodology honestly: the spectral participation-ratio signal from the companion methodology paper is a general indicator of specialized computation; what makes a finding task-specific is the task-pattern screen plus a per-model causal verification.

URL PDF HTML ☆

赞 0 踩 0

2606.05375 2026-06-05 cs.CV cs.AI 版本更新

Three-Dimensional Retinal Microvasculature Restoration in OCT Angiography

OCT血管造影中的三维视网膜微血管修复

Yukun Guo, Min Gao, Tristan T. Hormel, Steven T. Bailey, Thomas S. Hwang, Yali Jia

发表机构 * Casey Eye Institute, Oregon Health & Science University（俄勒冈健康与科学大学Casey眼科研究所）； Department of Biomedical Engineering, Oregon Health & Science University（俄勒冈健康与科学大学生物医学工程系）

AI总结提出基于EfficientNet-B5编码器和含空间-通道挤压激励模块的解码器的深度学习算法，从单次OCTA体数据恢复毛细血管解剖结构，显著提升图像质量与微血管保真度。

详情

AI中文摘要

光学相干断层扫描血管造影（OCTA）是一种用于成像视网膜微血管的强大技术。然而，由于成像伪影，获取可靠的视网膜血流和视网膜无灌注区域量化具有挑战性。现有方法主要关注噪声抑制、投影伪影去除或信号增强，以改善OCTA在横截面或二维（2D）正面投影中的图像质量，而忽略了内在的三维血管结构。在本研究中，我们提出了一种基于深度学习的算法，用于从单个OCTA体数据中恢复毛细血管解剖血管结构。该网络由EfficientNet-B5编码器和结合了并行空间与通道挤压激励模块的解码器组成，通过跳跃连接保持空间分辨率。使用三个相邻B帧作为输入，预测修复后的中间B帧。我们使用峰值信噪比（PSNR）和结构相似性指数（SSIM）评估模型性能，以多次扫描平均生成的真值作为基准。结果表明，与原始单次OCTA体数据相比，所提模型显著（p < 0.001）提高了图像质量，PSNR为26.16 ± 1.26对比22.23 ± 0.78，SSIM为0.91 ± 0.02对比0.72 ± 0.03。所提模型还显著（p < 0.001）提高了微血管保真度，通过模型输出与真值之间的Dice系数重叠测量，在多个不同血管板层上，2D和3D分别至少提高3.8%和51.2%。

英文摘要

Optical coherence tomographic angiography (OCTA) is a powerful technique for imaging retinal microvasculature. However, acquiring reliable quantification of retinal blood flow and areas of retinal nonperfusion is challenging because of imaging artifacts. Existing methods primarily focus on noise suppression, projection artifact removal, or signal enhancement to improve the image quality of OCTA in cross-sectional or two-dimensional (2D) en face projections, while neglecting the intrinsic three-dimensional vascular architecture. In this study, we propose a deep learning-based algorithm for restoring capillary anatomical vasculature from a single OCTA volume. The network consists of an EfficientNet-B5 encoder and a decoder incorporating concurrent spatial and channel squeeze-and-excitation modules, connected via skip connections to preserve spatial resolution. Three adjacent B-frames are used as input to predict the restored middle B-frame. We evaluated the performance of the model using the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) against ground truth generated from averaging multiple scans. The results show that the proposed model significantly (both p < 0.001) improved image quality compared with the original single OCTA volume, with a PSNR of 26.16 +/- 1.26 vs. 22.23 +/- 0.78 and an SSIM of 0.91 +/- 0.02 vs. 0.72 +/- 0.03. The proposed model also significantly (p < 0.001) improved microvascular fidelity, measured by the Dice coefficient overlap between the model output and ground truth, in both 2D and 3D by at least 3.8% and 51.2%, respectively, across several different vascular slabs.

URL PDF HTML ☆

赞 0 踩 0

2606.05357 2026-06-05 cs.AI 版本更新

基于概率信念追踪的多轮人类可说服性模型

Jared Moore, Noah Goodman, Nick Haber, Max Kleiman-Weiner

发表机构 * Stanford University（斯坦福大学）； University of Washington（华盛顿大学）

AI总结提出PERSUASIONTRACE框架，通过记录多轮信念报告、标注修辞维度并引入贝叶斯网络模拟目标，将说服评估从端点变化转向过程保真度。

详情

AI中文摘要

大型语言模型可以在高风险领域改变人类信念，但大多数说服研究依赖于前/后信念变化。这些端点测量确定了说服是否发生，却忽略了信念在对话中移动的位置和方式。我们提出了PERSUASIONTRACE，一个用于研究人机交互中说服的框架。基于网络实验平台，PERSUASIONTRACE贡献了一个多轮说服研究的工具和一个过程级评估协议：它记录来自人类或模拟说服目标的多轮信念报告，用修辞维度（logos/pathos/ethos）标注说服者轮次，并通过保真度评估模拟器与真实人类信念动态的匹配程度。使用该框架，我们发现人类目标分为两个多轮信念更新聚类，并对修辞策略表现出易感性；LLM在通用和个性化主题、文本和音频模态以及多轮交互中都具有说服力。先前的工作主要使用普通提示的LLM来模拟人类目标，但我们表明这些模拟器无法复制人类信念动态。我们引入了一个贝叶斯网络模拟目标，它随时间维持显式的潜在信念状态，使得每个说服者消息产生认知上真实的信念更新。在人类相似性评估中，我们的贝叶斯目标得分接近人类参考（81 vs 80），而基线LLM目标得分显著较低（64）。PERSUASIONTRACE将说服评估从仅端点移动重新定义为过程保真度，为科学分析和说服系统的更安全优化提供了更强的基础。

英文摘要

Large language models can shift human beliefs across high-stakes domains, but most persuasion studies rely on pre/post belief change. These endpoint measures identify whether persuasion occurred, yet miss where and how beliefs moved within a dialogue. We present PERSUASIONTRACE, a framework for studying persuasion in human-LLM interaction. Built on a web-based experimental platform, PERSUASIONTRACE contributes a tool for multi-turn persuasion studies and a process-level evaluation protocol: it records multi-turn belief reports from human or simulated targets of persuasion, annotates persuader turns with rhetorical dimensions (logos/pathos/ethos), and evaluates simulators by fidelity to real human belief dynamics. Using this framework, we find that human targets group into two clusters of multi-turn belief updates and exhibit susceptibility to rhetorical strategies, and that LLMs are persuasive across generic and personalized topics, text and audio modalities, and multi-turn interactions. Prior work has chiefly used vanilla-prompted LLMs to simulate human targets, but we show that these simulators fail to replicate human belief dynamics. We introduce a Bayesian-network simulated target that maintains an explicit latent belief state over time so each persuader message yields cognitively realistic belief updates. In human-likeness evaluation, our Bayesian target scores near a human reference (81 vs 80), while baseline LLM targets score substantially lower (64). PERSUASIONTRACE reframes persuasion evaluation from endpoint movement alone to process fidelity, providing a stronger basis for scientific analysis and safer optimization of persuasive systems.

URL PDF HTML ☆

赞 0 踩 0

2606.05328 2026-06-05 cs.GR cs.AI cs.CV cs.LG 版本更新

The Invisible Hand of Physics: When Video Diffusion Models Know More Than They Show

物理的隐形之手：当视频扩散模型知道的比它们展示的更多

Parsa Esmati, Somjit Nath, Katja Hofmann, Derek Nowrouzezahrai, Samira Ebrahimi Kahou, Majid Mirmehdi

发表机构 * University of Bristol（布里斯托大学）； McGill University（麦吉尔大学）； Mila–Quebec AI Institute（魁北克AI研究院）； Microsoft Research（微软研究院）； University of Calgary（卡尔加里大学）

AI总结通过逆向扩散过程探测视频扩散模型的潜在轨迹，发现物理合理性可以从扩散变换器状态中线性解码，准确率达81.27%，表明物理有意义的表示是生成式去噪的副产品。

详情

AI中文摘要

现代视频扩散模型生成越来越真实和时间上连贯的视频，这激发了它们作为候选世界模拟器的使用。然而，目前尚不清楚这些模型是否内部编码了物理结构，或者仅仅是复现了训练中看到的运动模式。我们通过沿着对应已知物理合理性的真实视频的潜在轨迹探测视频扩散模型来研究这个问题。为了获得这样的轨迹，我们通过从干净视频潜在变量向后积分学习到的速度场到噪声，近似逆向确定性采样过程，从而访问模型的中间状态和注意力图。利用这些恢复的轨迹，我们表明物理合理性可以从扩散变换器状态中线性解码，在IntPhys和InfLevel上达到约81.27%的平均准确率，并优于专门的表示学习基线如V-JEPA和VideoMAE。令人惊讶的是，这个信号在VAE潜在输入中不存在，而是在去噪变换器内部出现，尽管模型没有使用自监督预测目标进行训练。这些发现表明，物理有意义的表示可以作为生成式去噪的副产品产生。

英文摘要

Modern video diffusion models generate increasingly realistic and temporally coherent videos, motivating their use as candidate world simulators. Yet it remains unclear whether these models internally encode physical structure, or merely reproduce motion patterns seen during training. We study this question by probing video diffusion models along latent trajectories corresponding to real videos with known physical plausibility. To obtain such trajectories, we approximately invert the deterministic sampling process by integrating the learned velocity field backward from a clean video latent to noise, giving access to the model's intermediate states and attention maps. Using these recovered trajectories, we show that physical plausibility is linearly decodable from diffusion transformer states across IntPhys and InfLevel, reaching around 81.27% average accuracy and outperforming dedicated representation-learning baselines such as V-JEPA and VideoMAE. Surprisingly, this signal is absent from the VAE latent input and emerges inside the denoising transformer itself, despite the model not being trained with a self-supervised predictive objective. These findings suggest that physically meaningful representations can arise as a byproduct of generative denoising.

URL PDF HTML ☆

赞 0 踩 0

2606.05326 2026-06-05 math.OC cs.AI cs.LG math-ph math.AP math.MP 版本更新

Gradient descent at the Edge of Stability: free energy model and kinetic description of the two-layer network

稳定边缘的梯度下降：双层网络的自由能模型与动力学描述

Antonin Chodron de Courcel

发表机构 * Ecole Normale Supérieure, CNRS, 45 rue d’Ulm, 75005 Paris, France（巴黎高等师范学院、法国国家科学研究中心、巴黎 rue d’Ulm 45 号、75005 地址、法国）

AI总结针对大学习率下梯度下降的稳定边缘动力学，提出连续时间有效模型跟踪平均轨迹与快速振荡协方差，揭示有效自由能作为关键监控量，并导出宽双层网络的平均场极限动力学方程。

Comments Comments are welcome!

详情

AI中文摘要

我们研究了稳定边缘（Edge of Stability）机制下梯度下降的动力学，其中学习率足够大，导致损失和锐度出现持续振荡。我们提出了一个连续时间有效模型，跟踪平均轨迹的演化以及其快速振荡的时间平均协方差。我们的分析表明，在这种不稳定机制中，需要监控的自然量是有效自由能，它将原始风险泛函与曲率相关的“熵”项相结合。我们的模型允许我们跟踪振荡的包络，即使在动力学与平均权重在相似时间尺度上演化的情况下。换句话说，我们可以跟踪某些神经网络架构训练过程中出现的尖峰。对于在稳定非消失振荡下优化的宽双层神经网络，我们推导出一个平均场极限，产生了一个新的动力学方程，描述了权重及其波动的联合分布。我们证明该方程可以解释为宏观自由能的Wasserstein-2梯度流。最后，我们提供了矩阵分解和深度学习任务（CIFAR-10）上的数值证据，以证明模型在捕捉振荡包络方面的准确性以及有效自由能的预测能力。

英文摘要

We study the dynamics of gradient descent in the Edge of Stability regime, where the learning rate is large enough to induce persistent oscillations in the loss and the sharpness. We propose a continuous-time effective model that tracks the evolution of the average trajectory coupled with the time-averaged covariance of its fast oscillations. Our analysis reveals that the natural quantity to monitor in such unstable regimes is an effective free energy, which combines the original risk functional with a curvature-related "entropic" term. Our model allows us to track the envelope of the oscillations even in situations where its dynamics evolve on similar timescales as the averaged weights. Otherwise stated, we can track the spikes that occur during the training of some neural network architectures. For wide two-layer neural networks optimized under stable non-vanishing oscillations, we derive a mean-field limit that results in a novel kinetic equation describing the joint distribution of weights and their fluctuations. We show that this equation can be interpreted as a Wasserstein-2 gradient flow of a macroscopic free energy. Finally, we provide numerical evidence on matrix factorization and deep learning tasks (CIFAR-10) to demonstrate the model's accuracy in capturing the envelope of the oscillations and the predictive power of the effective free energy.

URL PDF HTML ☆

赞 0 踩 0

2606.05316 2026-06-05 cs.AI 版本更新

I Know What You Meme, Even If it Emerged Today: Understanding Evolving Memes through Open-World Knowledge Acquisition

我知道你的梗，即使它今天才出现：通过开放世界知识获取理解不断演变的梗

Shanhong Liu, Rui Cao, Pai Chet Ng, De Wen Soh

发表机构 * Singapore University of Technology and Design（新加坡科技设计大学）； Singapore Institute of Technology（新加坡理工学院）

AI总结提出Query Retrieve Conclude零样本框架，通过识别缺失知识、检索开放网络证据并合成背景知识，以理解新兴梗并提升检测性能。

2606.05315 2026-06-05 cs.CL cs.AI 版本更新

LoRi: Low-Rank Distillation for Implicit Reasoning

Remon Polus, Soumaya Cherkaoui

发表机构 * Department of Computer and Software Engineering（计算机与软件工程系）

AI总结针对X波段无人机集成感知与通信系统，提出基于双阴影信道模型的最优时间分配方法，平衡感知精度与通信性能。

2606.05261 2026-06-05 cs.CV cs.AI cs.LG 版本更新

NIV: Neural Axis Variations for Variable Font Generation

NIV: 用于可变字体生成的神经轴变化

Nadav Benedek, Ariel Shamir, Ohad Fried

发表机构 * Reichman University（雷赫曼大学）

AI总结提出NIV方法，通过预测字形轮廓的逐点位移，自动将静态字体转换为支持多轴连续插值的可变字体，并在新构建的数据集上验证其泛化能力。

详情

AI中文摘要

可变字体能够沿语义设计轴（如字重、字宽、倾斜和光学尺寸）实现字形几何的连续变化。然而，从静态字体构建可变字体仍然是一个劳动密集型过程，需要专业的字体设计和对字形变化数据的手动规范。我们引入了NIV（神经轴变化），一种自动将静态字体转换为功能齐全的可变字体的方法。给定字形轮廓和一组期望的设计轴，NIV预测每点的位移。该模型直接操作矢量字形几何，并采用一种新颖的属性嵌入机制，捕获多个轴之间的相互作用，从而在统一框架内实现一致的多轴变化。我们在一个新构建的源自可变Google字体的数据集上训练NIV，该数据集包含超过一百万个变化元组。得到的模型能够泛化到未见过的码点、未见过的字体样式、高复杂度的CJK字形，甚至分布外的手写输入。生成的输出是标准的可变字体文件，支持通过现有渲染引擎进行连续插值。为了促进研究，我们在https://github.com/ndvbd/NIV上发布了数据集、完整的训练和推理实现以及训练好的模型。超越字体排印，我们的方法展示了如何使用神经变形合成具有连续参数变化的结构化几何对象。

英文摘要

Variable fonts enable continuous variation of glyph geometry along semantic design axes such as weight, width, slant, and optical size. However, constructing a variable font from a static font remains a labor-intensive process requiring expert typographic design and manual specification of glyph variation data. We introduce NIV (Neural Axis Variations), a method that automatically converts a static font into a fully functional variable font. Given glyph outlines and a set of desired design axes, NIV predicts per-point displacements. The model operates directly on vector glyph geometry and employs a novel Property Embedding mechanism that captures interactions between multiple axes, enabling consistent multi-axis variation within a unified framework. We train NIV on a newly constructed dataset derived from variable Google Fonts, comprising over one million variation tuples. The resulting model generalizes across unseen code points, unseen font styles, high-complexity CJK glyphs, and even out-of-distribution handwriting inputs. The generated outputs are standard variable font files supporting continuous interpolation via existing rendering engines. To facilitate research, we release the dataset, the complete training and inference implementation, and trained models at https://github.com/ndvbd/NIV. Beyond typography, our approach demonstrates how structured geometric objects with continuous parametric variation can be synthesized using neural deformations.

URL PDF HTML ☆

赞 0 踩 0

2606.05256 2026-06-05 cs.AI 版本更新

前沿计算机使用代理中的领域条件安全：一个793集浏览器基准测试、编码领域交叉引用以及近期红队攻击的可重复性审计

Nicholas Saban

发表机构 * Patronus AI University of California, Berkeley（Patronus AI 伯克利大学）

AI总结本研究通过构建包含793个浏览器任务和56个攻击模板的基准测试，评估前沿计算机使用代理对提示注入攻击的鲁棒性，发现模型权重提供了强抵抗性（攻击成功率0%），但该安全性是领域条件的，在编码代理中失效（攻击成功率高达100%），并指出文献中高攻击成功率主要归因于RL优化的注入文本而非攻击类别。

详情

AI中文摘要

最近的计算机使用代理（CUA）红队论文报告提示注入攻击成功率（ASR）为42-98%，但这些头条数字集中在已退役模型和每篇论文面板中最易受攻击的模型上。我们询问这些技术，作为手工制作的模板重现，是否仍然对当前前沿CUA有效。我们发布了CUA-HandCrafted，一个包含793个集成的公共基准测试，涵盖24个多步骤网络任务、56个攻击模板、8个攻击家族和4个系统提示配置。针对Claude Sonnet 4.6和GPT-5.4，我们测量到0/140的多步骤攻击成功（Clopper-Pearson 95%上限2.60%）；一个提示消融实验表明这种抵抗性存在于模型权重中。然而，它并不泛化：在一个姐妹编码代理基准测试（SkillBench）上，相同的权重对手工制作的技能注入攻击成功率高达100%。我们认为文献中的高ASR主要归因于RL优化的注入文本，而不是攻击类别，并且前沿安全加固是领域条件的，特定于被高度针对的浏览器表面。报告技术而不发布优化字符串，或将浏览器领域安全性外推到其他CUA模态，使得已发表的ASR数字无法重现。

英文摘要

Recent computer-using-agent (CUA) red-teaming papers report prompt-injection attack success rates (ASR) of 42-98%, but these headline numbers cluster on retired models and on the most-vulnerable model in each paper's panel. We ask whether those techniques, reproduced as hand-crafted templates, still work against current frontier CUAs. We release CUA-HandCrafted, a public benchmark of 793 episodes spanning 24 multi-step web tasks, 56 attack templates, 8 attack families, and 4 system-prompt configurations. Against Claude Sonnet 4.6 and GPT-5.4 we measure 0/140 multi-step attack success (Clopper-Pearson 95% upper bound 2.60%); a prompt ablation shows this resistance lives in the model weights. Yet it does not generalize: on a sister coding-agent benchmark (SkillBench), the same weights fall to hand-crafted skill-injection at up to 100%. We argue that the literature's high ASR is largely attributable to RL-optimized injection text rather than the attack categories, and that frontier safety hardening is domain-conditioned, specific to the heavily-targeted browser surface. Reporting techniques without releasing the optimized strings, or extrapolating browser-domain safety to other CUA modalities, makes published ASR numbers unreproducible.

URL PDF HTML ☆

赞 0 踩 0

2606.05232 2026-06-05 cs.LG cs.AI 版本更新

Differentiable Efficient Operator Search

可微分高效算子搜索

Xiaohuan Pei, Jiyuan Zhang, Yuanfan Guo, Weiguo Feng, Tao Huang, Cho-Jui Hsieh, Chang Xu

发表机构 * The University of Sydney（悉尼大学）； ByteDance（字节跳动）； Shanghai Jiao Tong University（上海交通大学）； University of California, Los Angeles（加州大学洛杉矶分校）

AI总结提出可微分高效算子搜索框架，统一解释多种token缩减算子，通过联合搜索缩减位置、保留数量和算子行为，在预算约束下优化多模态模型性能。

详情

AI中文摘要

高效多模态基础模型通常依赖于手动设计的token缩减算子，如剪枝、合并、池化和自适应重加权。尽管这些算子看起来不同，但我们表明它们可以被解释为共享算子空间的不同区域。基于这一观点，我们引入了高效算子搜索，一个可微分框架，联合搜索在哪里缩减token、保留多少token以及如何处理缩减后的token信息。所提出的搜索空间参数化层激活、保留预算和算子行为，而搜索策略在单边预算和成本约束下优化任务性能。该公式将代表性手工设计基线作为特例恢复，并进一步发现超越孤立手动设计的混合算子。在多模态基准上的实验表明，搜索得到的算子在精度-效率权衡上具有竞争力，特别是在激进的视觉token缩减下。这些结果表明，高效多模态推理可以从手动算子设计重新构建为可微分算子搜索。

英文摘要

Efficient multimodal foundation models often rely on manually designed token-reduction operators, such as pruning, merging, pooling, and adaptive reweighting. Although these operators appear different, we show that they can be interpreted as distinct regimes of a shared operator space. Based on this view, we introduce Efficient Operator Search, a differentiable framework that jointly searches where to reduce tokens, how many tokens to retain, and how reduced token information should be processed. The proposed search space parameterizes layer activation, retention budget, and operator behavior, while the search policy optimizes task performance under one-sided budget and cost constraints. This formulation recovers representative hand-designed baselines as special cases and further discovers hybrid operators beyond isolated manual designs. Experiments on multimodal benchmarks show that the searched operators achieve competitive accuracy-efficiency trade-offs, especially under aggressive visual-token reduction. These results suggest that efficient multimodal inference can be reframed from manual operator design to differentiable operator search.

URL PDF HTML ☆

赞 0 踩 0

2606.05222 2026-06-05 cs.CY cs.AI cs.HC 版本更新

Where's the Structure? A Systematic Literature Review of Empirical Research on Human-AI Collaboration and Hybrid Intelligence for Learning

结构在哪里？关于人机协作与混合智能用于学习的实证研究的系统文献综述

Luis P. Prieto, Juan I. Asensio-Pérez, María Jesús Rodríguez-Triana, Mohamed Saban, Yannis Dimitriadis

发表机构 * GSIC-EMIC research group, Universidad de Valladolid (Spain)（瓦伦西亚大学GSIC-EMIC研究组）； GICAP research group, Department of Digitization, Universidad de Burgos (Spain)（布尔戈斯大学数字技术系GICAP研究组）

AI总结本文通过系统文献综述（N=62）分析了人机协作与混合智能在学习支持中的协作过程、结构及应用背景，提取了设计知识和研究空白。

Comments 59 pages, 4 figures, submitted to a journal

2606.05219 2026-06-05 cs.LG cs.AI 版本更新

Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway

大步长梯度下降恢复多路径深度线性网络中的对称性

Hee-Sung Kim, Sungyoon Lee

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文研究大步长离散梯度下降如何通过边缘稳定性振荡使多路径深度线性网络从对称性破坏转向信号重新分配，从而偏好共享表示而非单路径主导。

Comments ICML 2026

详情

AI中文摘要

最近对多路径深度线性网络的分析使用梯度流预测了一种“赢家通吃”的专业化，其中路径对称性被破坏，每个特征集中在一个路径中。在这项工作中，我们表明具有大步长的离散梯度下降（GD）讲述了一个不同的故事。我们证明单路径解是尖锐最小值，而跨路径分布信号通过一个随路径数量和深度增加而减小的因子降低了尖锐度。因此，虽然早期训练再现了由GF预测的深度驱动的对称性破坏，但随后在稳定性边缘的振荡覆盖了这一趋势，并将网络驱动到重新平衡阶段，其中信号在路径间重新分布。总之，这些结果阐明了深度如何塑造路径竞争，并解释了大步长GD为何偏好共享表示而非持续的单路径主导。

英文摘要

Recent analyses of multi-pathway Deep Linear Networks use Gradient Flow to predict a "winner-takes-all" specialization in which path symmetry breaks and each feature concentrates in a single pathway. In this work, we show that discrete Gradient Descent (GD) with a large step size tells a different story. We prove that single-path solutions are sharp minima, whereas distributing signals across pathways reduces sharpness by a factor that decreases with both the number of pathways and depth. Consequently, while early training reproduces the depth-driven symmetry breaking predicted by GF, oscillations at the Edge of Stability subsequently override this tendency and drive the network into a re-balancing phase, where signals redistribute across pathways. Together, these results clarify how depth shapes pathway competition and explain why large-step GD favors shared representations rather than persistent single-pathway dominance.

URL PDF HTML ☆

赞 0 踩 0

2606.05217 2026-06-05 math-ph cs.AI cs.LG math.MP physics.data-an 版本更新

The Score Hamiltonian: Mapping Diffusion Models to Adiabatic Transport

得分哈密顿量：将扩散模型映射到绝热输运

Peter Halmos, Boris Hanin

发表机构 * Computer Science Department, Princeton University（普林斯顿大学计算机科学系）； ORFE Department, Princeton University（普林斯顿大学ORFE系）

AI总结本文通过构建得分哈密顿量，建立了基于得分的扩散模型采样与薛定谔算子基态绝热输运之间的精确对应关系，并利用绝热定理推导了密度重建误差界和退火调度方案。

2606.05206 2026-06-05 q-bio.NC cs.AI stat.AP 版本更新

Ontology-constrained multi-LLM scoring of hypothesis support in the predictive processing literature

本体约束的多LLM评分在预测处理文献中假设支持度的应用

Hamed Nejat, Alexander Maier, Jesse Spencer-Smith, André M. Bastos

发表机构 * University of Edinburgh（爱丁堡大学）； University of Cambridge（剑桥大学）

AI总结本文提出一个本地多LLM流水线，通过本体约束对预测编码文献中的研究进行评分，将异构文献映射到定量证据空间，并揭示假设间的结构化分歧。

Comments 33 pages, 5 tables and 9 figures

详情

AI中文摘要

跨学科领域由于方法多样和理论承诺不同，常常存在碎片化问题。预测编码神经科学是一个典型例子：其文献涵盖计算理论、电生理学、影像学、行为学和建模，造成了传统荟萃分析难以解决的综合问题。本文描述了一个用于本体约束文献综合的本地多LLM流水线。该流水线读取论文、提取证据、整合图表描述、组装约束提示，并根据专家词汇表验证输出。我们手动定义了一个预测编码词汇表，包含36个概念，分为三个假设：预测抑制、前向误差传播和普遍性。由十个本地语言模型组成的委员会根据每个词汇因子在局部和全局oddball情境下的一致性或不一致性，对31项研究进行评分。这使得可以进行成对研究一致性分析、跨模型比较和三维假设空间映射。某些假设的一致性较高，而其他假设则较弱，揭示了结构化分歧，特别是在局部与全局oddball范式之间。我们进一步定义了假设空间温度，这是一种几何离散度度量，用于衡量研究在假设空间中的紧凑程度。局部oddball情境的温度较低，而全局oddball情境的温度较高，表明后者离散度更大。评分几何还允许我们估计实验情境之间的变化向量。这些结果表明，本地多LLM委员会可以产生可审计的不一致性测量，将异构文献映射到定量证据空间。该框架可能推广到传统荟萃分析缺乏共同比较空间的跨研究假设映射。

英文摘要

Fragmentation is common in interdisciplinary fields with diverse methods and theoretical commitments. Predictive coding neuroscience is a clear example: its literature spans computational theory, electrophysiology, imaging, behavior, and modeling, creating a synthesis problem that conventional meta-analysis cannot easily resolve. Here, we describe a local multi-LLM pipeline for ontology-constrained literature synthesis. The pipeline reads papers, extracts evidence, incorporates figure descriptions, assembles constrained prompts, and validates outputs against an expert glossary. We manually defined a predictive-coding glossary of thirty-six concepts grouped into three hypotheses: predictive suppression, feedforward error propagation, and ubiquity. A council of ten local language models scored 31 studies according to their agreement or disagreement with each glossary factor across local and global oddball contexts. This enabled pairwise study-agreement analysis, cross-model comparison, and three-dimensional hypothesis-space mapping. Agreement was high for some hypotheses but weaker for others, revealing structured disagreement, particularly across local versus global oddball paradigms. We further define hypothesis-space temperature, a geometric dispersion metric measuring how compactly studies occupy the hypothesis space. Temperature was lower for local oddball contexts and higher for global oddball contexts, indicating greater dispersion in the latter. The scoring geometry also allowed us to estimate vectors of change between experimental contexts. These results demonstrate that local multi-LLM councils can produce auditable disagreement measurements that map heterogeneous literatures into quantitative evidence spaces. This framework may generalize to cross-study hypothesis mapping where conventional meta-analysis lacks a common comparison space.

URL PDF HTML ☆

赞 0 踩 0

2606.05199 2026-06-05 physics.comp-ph cs.AI 版本更新

Finite Element-Based Material Learning via Automatic Differentiation: Learning constitutive neural network models from full-field deformation data

基于有限元和自动微分的材料学习：从全场变形数据学习本构神经网络模型

Matthias Knipper, Chenyi Ji, Malte Brand, Kevin Linka

发表机构 * Computational Mechanics in Medicine, Applied Medical Engineering, RWTH Aachen University（医学计算力学，应用医学工程，亚琛RWTH大学）； Institute for Continuum and Material Mechanics, Hamburg University of Technology（连续介质力学与材料力学研究所，汉堡技术大学）

AI总结提出FE-MAD框架，通过自动微分将本构神经网络集成到JAX-FEM非线性求解器中，利用梯度优化从全场变形数据识别材料参数，适用于灰箱和白箱本构模型，并在三个实验数据集上验证。

详情

AI中文摘要

从异质全场变形数据中识别本构神经网络模型为基于均匀应力-应变实验的传统标定方法提供了稳健的替代方案，特别是考虑到可训练参数的高维性。现有方法必须在通用性、鲁棒性和计算效率之间取得平衡：传统有限元模型更新适用广泛但计算量大；弱形式方法效率高但对噪声和数据稀缺敏感；神经算子模型表达力强但需要大量训练数据。本文提出FE-MAD（基于有限元和自动微分的材料学习），一个端到端可微框架，将本构神经网络模型集成到JAX-FEM非线性求解器中，并通过基于梯度的测量-失配损失最小化来识别其参数。牛顿切线刚度和损失梯度通过整个流程的前向和反向模式自动微分自动计算，从而消除了解析伴随或离线代理模型的需求。FE-MAD针对两种架构进行了演示：灰箱本构人工神经网络（CANN），一个多凸、全连接且高度灵活的模型；以及白箱CANN，一个具有现象学可解释应变能项的专家系统网络。聚焦于不可压缩各向同性超弹性，FE-MAD在三个开放实验数据集上进行了评估：（1）带孔拉伸试件的全场数字图像相关（DIC）数据，（2）具有一维拉伸轮廓和全局力-位移曲线的降数据场景，以及（3）异质基体-夹杂系统，其中两相的本构定律被识别并推广到22个先前未见过的样本。

英文摘要

The identification of constitutive neural network models from heterogeneous full-field deformation data provides a robust alternative to traditional calibration methods based on homogeneous stress-strain experiments, particularly given the high dimensionality of trainable parameters. Existing approaches must balance generality, robustness, and computational efficiency: Conventional finite element model updating is broadly applicable but computationally demanding; weak-form methods offer efficiency but are sensitive to noise and data scarcity; neural operator models are highly expressive but require extensive training datasets. This work presents FE-MAD (Finite Element-Based Material learning via Automatic Differentiation), an end-to-end differentiable framework that integrates a constitutive neural network model within a JAX-FEM nonlinear solver and identifies its parameters through gradient-based minimization of a measurement-mismatch loss. Newton tangent stiffness and loss gradients are computed automatically using forward- and reverse-mode automatic differentiation throughout the entire pipeline, thereby removing the need for analytic adjoints or offline surrogate models. FE-MAD is demonstrated for two architectures: a grey-box Constitutive Artificial Neural Network (CANN), a polyconvex, fully connected model with high flexibility, and a white-box CANN, an expert-system network with phenomenologically interpretable strain-energy terms. Focusing on incompressible isotropic hyperelasticity, FE-MAD is evaluated on three open experimental datasets: (1) full digital image correlation (DIC) of a perforated tensile specimen, (2) a reduced-data scenario with a one-dimensional stretch profile and global force-displacement curve, and (3) a heterogeneous matrix-inclusion system in which both phases constitutive laws are identified and generalized to twenty-two previously unseen samples.

URL PDF HTML ☆

赞 0 踩 0

2606.05194 2026-06-05 cs.LG cs.AI cs.CL 版本更新

自然语言推理的多粒度推理

Chunling Xi, Di Liang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出多粒度推理网络（MGRN），通过分层语义特征交互模拟人类认知过程，在多个基准上超越强基线模型。

详情

AI中文摘要

自然语言推理（NLI）是自然语言理解中的一项基本任务，需要确定前提和假设之间的逻辑关系。尽管基于Transformer的预训练模型取得了显著成功，但大多数现有方法主要依赖最后一层的token表示，这通常不足以捕捉有效推理所需的复杂分层语义交互。特别是，细粒度的词汇线索、短语组合和更高层次的上下文语义通常在单一表示空间中被纠缠或稀释。为了解决这些限制，我们提出了一种新颖的\emph{多粒度推理网络}（MGRN），它在交互式推理空间中显式利用分层语义特征。所提出的框架模拟了人类语言理解的认知过程，该过程自然地从浅层词汇匹配进展到更深层次的语义抽象和逻辑推理。通过以渐进和结构化的方式整合多个粒度的语义信息，MGRN能够揭示自然语言表达背后的复杂语义关系。在多个公开基准上的大量实验表明，MGRN始终优于强基线模型，验证了所提出方法的有效性和鲁棒性。

英文摘要

Natural Language Inference (NLI) is a fundamental task in natural language understanding that requires determining the logical relationship between a premise and a hypothesis. Despite the remarkable success of transformer-based pre-trained models, most existing approaches primarily rely on the final-layer token representations, which are often insufficient for capturing the complex and hierarchical semantic interactions required for effective reasoning. In particular, fine-grained lexical cues, phrasal compositions, and higher-level contextual semantics are typically entangled or diluted in a single representation space. To address these limitations, we propose a novel \emph{Multi-Granularity Reasoning Network} (MGRN) that explicitly leverages hierarchical semantic features within an interactive reasoning space. The proposed framework mimics the human cognitive process of language understanding, which naturally progresses from shallow lexical matching to deeper semantic abstraction and logical reasoning. By integrating semantic information across multiple granularities in a progressive and structured manner, MGRN is able to uncover intricate semantic relationships underlying natural language expressions. Extensive experiments on multiple public benchmarks demonstrate that MGRN consistently outperforms strong baseline models, validating the effectiveness and robustness of the proposed approach.

URL PDF HTML ☆

赞 0 踩 0

2606.05180 2026-06-05 cs.CL cs.AI 版本更新

From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

从评分到解释：评估基于量规的教学质量评估中的SHAP和LLM理由

Ivo Bueno, Babette Bühler, Philipp Stark, Tim Fütterer, Ulrich Trautwein, Dorottya Demszky, Heather Hill, Enkelejda Kasneci

发表机构 * Technical University of Munich（慕尼黑技术大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； Lund University（吕勒奥大学）； University of Tübingen（图宾根大学）； Stanford Graduate School of Education（斯坦福大学教育研究生院）； Harvard Graduate School of Education（哈佛大学教育研究生院）

AI总结提出一个结合SHAP和LLM理由的框架，用于基于量规的评分模型的可解释性，并在课堂转录数据上评估其忠实性和可迁移性。

Comments Accepted to Findings of ACL 2026

详情

AI中文摘要

自动化评分模型越来越多地被用于为复杂的语言表现（包括课堂转录）分配基于量规的质量评级，但它们通常很少提供关于为什么产生特定分数的见解。我们提出了一个通用的框架，用于基于量规的评分的句子级可解释性，该框架将模型无关的Shapley值归因与大型语言模型（LLM）生成的理由相结合。在使用NCTE语料库的CLASS框架的反馈质量维度上实例化，该框架能够系统地比较微调的预训练语言模型（PLM）和提示的LLM在评分性能和解释忠实性方面的表现。在6k个带注释的转录片段中，微调的PLM在预测准确性上优于LLM，但表现出向中等尺度分数的标签压缩。基于删除的测试表明，SHAP识别出可靠驱动模型预测的句子，产生的预测变化通常比LLM生成的理由更大且更连贯。跨模型分析进一步揭示，SHAP归因在不同架构间稳健地迁移，而LLM理由的影响有限且不一致。总体而言，研究结果表明，SHAP为基于量规的评分提供了更忠实和可迁移的解释，并且所提出的框架为在高风险教育环境和其他基于量规的语言评估任务中评估评分模型及其解释提供了原则性基础。

英文摘要

Automated scoring models are increasingly used to assign rubric-based quality ratings to complex language performances, including classroom transcripts, yet they typically provide little insight into why a particular score is produced. We propose a general framework for sentence-level interpretability of rubric-based scoring that combines model-agnostic Shapley-value attributions with rationales generated by large language models (LLMs). Instantiated on the Quality of Feedback dimension of the CLASS framework using the NCTE corpus, the framework enables systematic comparison of fine-tuned pretrained language models (PLMs) and prompted LLMs on both scoring performance and explanation faithfulness. Across 6k annotated transcript segments, fine-tuned PLMs outperform LLMs in prediction accuracy but exhibit label compression toward mid-scale scores. Deletion-based tests show that SHAP identifies sentences that reliably drive model predictions, producing typically larger and more coherent prediction shifts than LLM-generated rationales. Cross-model analyses further reveal that SHAP attributions transfer robustly across architectures, whereas LLM rationales exert limited and inconsistent influence. Overall, the findings demonstrate that SHAP provides more faithful and transferable explanations for rubric-based scoring, and that the proposed framework offers a principled basis for evaluating both scoring models and their explanations in high-stakes educational settings and other rubric-based language assessment tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.05178 2026-06-05 cs.HC cs.AI 版本更新

The Virtual Roundtable: Multi-Agent Personas Simulating the Dynamics of Human Brainstorming

虚拟圆桌会议：模拟人类头脑风暴动态的多智能体角色

Tim Dorn, Saara A. Khan, Julie Mumford

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种多智能体架构，通过发散与收敛两阶段模拟圆桌头脑风暴，利用多样化AI角色和智能引导者产生多样化创意并评估排名，案例研究表明其能产生多样相关创意并深化讨论质量。

Comments 10 pages, 10 figures, 2 tables

详情

AI中文摘要

随着AI驱动产品开发的加速，瓶颈正从如何构建转向构建什么。传统人类头脑风暴面临群体思维、回音室和多样性有限等挑战。为解决这一问题，我们提出了一种多智能体架构，通过两个阶段模拟圆桌头脑风暴：发散思维以产生多样化创意，以及收敛思维以评估和排名最有前景的创意。该系统采用多样化的AI角色参与圆桌讨论，并由一个智能引导者引导讨论走向富有成效的结果。角色在公开评论的同时保持私人想法，创意在讨论过程中有机涌现。每个角色在创意提交和投票上的配额促进了平衡参与，同时产生自然排名。在整个会话过程中，系统跟踪每个创意的谱系，捕捉概念如何随时间起源和交叉传播。我们通过一个为AI智能眼镜生成消费者创意的案例研究来展示该方法，表明：(i) 它产生了多样、相关的创意，并提供了对其演化的洞察；(ii) 角色之间观点的累积交流培养了一个共享语境，逐步深化了讨论质量和产生的创意。

英文摘要

As AI-driven product development accelerates, the bottleneck is shifting from how we build to what we build. Traditional human brainstorming faces challenges including groupthink, echo chambers, and limited diversity. To address this, we present a multi-agentic architecture that simulates roundtable brainstorming through two phases: divergent thinking to generate diverse ideas, and convergent thinking to evaluate and rank the most promising ones. The system employs diverse AI personas that engage in roundtable discussions, guided by an agentic facilitator that steers the discussion toward productive outcomes. Personas maintain private thoughts while commenting publicly, with ideas emerging organically throughout the discussion. Per-persona quotas on idea submissions and votes promote balanced participation while producing natural rankings. Throughout the session, the system tracks each idea's lineage, capturing how concepts originate and cross-pollinate over time. We demonstrate this approach through a case study generating consumer ideas for AI smart glasses, showing (i) it produces diverse, relevant ideas with insights into their evolution; (ii) the cumulative exchange of perspectives across personas cultivates a shared context that progressively deepens the quality of discussion and the ideas produced.

URL PDF HTML ☆

赞 0 踩 0

2606.05177 2026-06-05 cs.CL cs.AI eess.AS 版本更新

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

MCBench：面向全能大语言模型的多上下文安全评估基准

Manh Luong, Tamas Abraham, Junae Kim, Amar Kaur, Rollin Omari, Gholamreza Haffari, Trang Vu, Lizhen Qu, Dinh Phung

发表机构 * Monash University（墨尔本大学）； Defence Science and Technology Group（国防科学与技术集团）

AI总结针对现有多模态安全基准仅处理视觉输入的局限，提出MCBench基准，包含1196个跨四类安全场景的测试，要求整合多模态信息进行安全评估，揭示当前全能大语言模型在跨模态安全推理上的不足。

详情

AI中文摘要

现有的多模态安全基准仅关注视觉输入，无法评估处理视觉、音频和文本的全能大语言模型（LLMs）。我们提出了MCBench，一个包含1196个场景的基准，涵盖四个安全类别，需要整合多种模态以进行准确的安全评估。每个不安全场景都配有一个最小差异的安全对照场景，以评估模型的敏感性。我们对最先进模型的评估揭示了重大挑战。全能大语言模型在处理细微或非物理风险时表现不佳，但在存在显著视觉或听觉线索时表现更好。对推理轨迹的分析表明，尽管模型能够提取模态特定信息，但它们往往无法有效整合这些线索进行安全判断。我们的发现揭示了当前全能大语言模型在安全关键场景中缺乏稳健的跨模态推理能力，强调了改进多模态安全架构和训练策略的必要性。

英文摘要

Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity. Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments. Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.

URL PDF HTML ☆

赞 0 踩 0

2606.05176 2026-06-05 cs.CL cs.AI 版本更新

面向沉浸式视频角色扮演的奖励分解强化学习

Miao Wang, Yuling Shi, Yijiang Li, Yeheng Chen, Xiaodong Gu, Bin Li, Bo Gao, Jun Wang, Zengxin Han, Jingtong Wu, Yaduan Ruan

发表机构 * Nanjing University（南京大学）； Shanghai Jiao Tong University（上海交通大学）； University of California, San Diego（加州大学圣地亚哥分校）； Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（中国科学院深圳先进技术研究院）； School of Information Engineering, Beijing Institute of Graphic Communication（北京印刷学院信息工程学院）； Ant International, Ant Group（蚂蚁集团国际部）； Independent Researcher（独立研究者）

AI总结提出EBM-RL框架，通过奖励分解的强化学习优化视频角色扮演中的视觉感知、推理与生成过程，提升场景一致性与角色真实性。

详情

AI中文摘要

基于文本的角色扮演模型可以模仿角色风格，但通常难以捕捉场景氛围和不断变化的紧张感，而这些对于VR游戏和互动叙事等沉浸式应用至关重要。我们研究视频驱动的角色扮演对话，并引入EBM-RL（眼-脑-口强化学习），一种解耦的GRPO框架，将观察（<perception>）、推理（<think>）和话语生成（<answer>）分离。该设计模仿人类的“看-思-说”过程，使模型在推理和响应生成之前能够基于视觉感知进行对话。为了优化这一“看-思-说”过程，EBM-RL集成了针对场景-文本对齐、感知-认知效用、答案忠实度和格式一致性的互补奖励。大量实验表明，在我们的沉浸式角色扮演基准测试中，EBM-RL显著优于纯文本角色扮演基线和更大规模的视觉语言模型，提高了视觉-氛围一致性和角色真实性。此外，EBM-RL在无需额外微调的情况下，展现出对域外VideoQA基准的强零样本迁移能力。我们还发布了一个用于视频驱动角色扮演对话的开源数据集。

英文摘要

Text-based role-playing models can imitate character styles, but often fail to capture scene atmosphere and evolving tension, which are crucial for immersive applications such as VR games and interactive narratives. We study video-grounded role-playing dialogue and introduce EBM-RL (Eye--Brain--Mouth Reinforcement Learning), a decoupled GRPO-based framework that separates observation (<perception>), reasoning (<think>), and utterance generation (<answer>). This design mimics the human See-Think-Speak process, enabling the model to ground dialogue in visual perception before reasoning and response generation. To optimize this See-Think-Speak process, EBM-RL integrates complementary rewards for scene--text alignment, perceptual--cognitive utility, answer faithfulness, and format consistency. Extensive experiments show that EBM-RL substantially outperforms text-only role-playing baselines and larger-scale vision-language models on our immersive role-playing benchmark, improving both visual-atmosphere consistency and character authenticity. Moreover, EBM-RL demonstrates strong zero-shot transfer to out-of-domain VideoQA benchmarks without additional fine-tuning. We also release an open-source dataset for video-grounded role-playing dialogue.

URL PDF HTML ☆

赞 0 踩 0

2606.05104 2026-06-05 cs.AI 版本更新

基于轨迹级别优势优先经验回放的GRPO

Gyeongtae Yoo, Sanghyeok Park, Soohyuk Jang, Ik-hwan Kim, Sungroh Yoon

发表机构 * Department of Electrical and Computer Engineering, Seoul National University（首尔国立大学电子与计算机工程系）； Interdisciplinary Program in AI, Seoul National University（首尔国立大学人工智能跨学科项目）； AIIS, ASRI, INMC, and ISRC, Seoul National University（首尔国立大学人工智能研究所、人工智能研究机构、智能网络与计算中心及人工智能科学研究中心）

AI总结针对GRPO样本效率低的问题，提出轨迹级经验回放缓冲器，通过年龄驱逐限制陈旧性、新鲜锚定组合保持在线策略、按优势幅度优先采样，在多个数学基准上显著提升性能。

详情

AI中文摘要

基于可验证奖励的GRPO强化学习是后训练推理LLM的标准方法，但样本效率低下。每个轨迹仅用于一次梯度更新后被丢弃。朴素回放在此设置中不适用，因为LLM策略每步梯度变化快，存储的轨迹会变得陈旧并破坏训练稳定性。我们提出一种面向GRPO的轨迹级回放缓冲器，存储和采样单个轨迹而非整组。缓冲器通过年龄驱逐限制陈旧性：任何超过tau_max训练步数的轨迹被移除。缓冲器还通过新鲜锚定组合保留在线策略数据：每个批次保留其新鲜的在线策略轨迹，并拼接从缓冲器中单独抽取的回放轨迹。我们按每个轨迹的优势幅度进行优先回放，并回收优势大的单个轨迹。在三个Qwen3-Base规模、五个数学基准上，我们的方法优于GRPO和朴素回放基线。所有规模均获得正向增益，且随模型增大而增长。最大增益在4B规模上，五个基准平均提升+4.35个百分点。在联合衡量准确率和token效率的AES指标下，与GRPO的效率差距同样在4B最大，为+0.579。

英文摘要

Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded. Naive replay is not well suited in this setting because LLM policies drift quickly per gradient step. Stored rollouts therefore become stale and can destabilize training. We propose a rollout-level replay buffer for GRPO that stores and samples individual rollouts rather than whole groups. The buffer bounds staleness through age eviction. Any rollout older than tau_max training steps is removed. The buffer also preserves on-policy data via fresh-anchored composition. Each batch keeps its fresh on-policy rollouts and then concatenates replay rollouts drawn separately from the buffer. We prioritize replay by per-rollout advantage magnitude and recycle individual rollouts whose advantages are large. Across three Qwen3-Base scales on five math benchmarks, our method outperforms GRPO and naive replay baselines. Gains are positive at every scale and grow with model size. The largest gain is +4.35 pp on the five-benchmark average at 4B. Under an AES metric that jointly measures accuracy and token efficiency, the efficiency margin over GRPO is again largest at 4B, at +0.579.

URL PDF HTML ☆

赞 0 踩 0

2606.04037 2026-06-05 cs.AI cs.LG cs.SE 版本更新

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

面向企业AI代理的部署前保障：基于本体的仿真与信任认证

Thanh Luong Tuan, Abhijit Sanyal

发表机构 * Golden Gate University（金门大学）； Data, Digital & IT, Novartis Healthcare Pvt. Ltd.（数据、数字与IT，诺华健康护理私人有限公司）

AI总结提出一种基于本体的验证框架，通过本体驱动的场景生成和信任证书，实现企业AI代理在部署前的自动化监管合规与安全认证。

Comments 26 pages, 3 figures. Companion to arXiv:2604.00555. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-6

详情

AI中文摘要

企业人工智能（AI）代理的部署前验证仍然是大型语言模型（LLM）能力基准测试与生产部署之间的关键缺口。一旦代理在生产环境中运行，部署后监控、人在回路控制和提示级护栏提供的保障有限。我们提出了一种基于本体的验证框架，包含三个组件：一个代理操作范围，形式化了跨权限、领域约束、安全属性、治理规则和自主级别的认证空间；一个本体到场景的生成流水线，自动推导出监管、操作和对抗性测试场景；以及一个信任证书，携带机器可验证的证明，并附带分级部署裁决（批准、有条件、拒绝）。在四个受监管行业（金融科技、银行、保险和医疗保健）中进行的受控试点，实例化为美国与越南的五个行业-监管体制单元，生成了1,800个场景，并针对125个主要来源监管要求和25个注入故障进行了评估。基于本体的生成（G4）实现了48.3%的监管覆盖率，而基于角色的基线为33.1%（校正后p=0.0006），并且领域特异性最高（4.77/5.0；p=2e-6）。在Bonferroni校正后，相对于基线和检索增强提示的覆盖率优势不再稳健。跨三个LLM家族（Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B；总计5,400个场景）的交叉验证复制了角色与本体模式。结果表明，对于监管密集型领域，基于本体的场景生成可作为基于角色测试套件的可信补充。

英文摘要

Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We present an ontology-grounded verification framework -- to our knowledge the first to combine three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a machine-verifiable Trust Certificate with graduated deployment verdicts. A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam (where Vietnam's 2025 AI Law makes such verification legally mandated for financial services), generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation significantly outperformed the dominant persona-based baseline on regulatory coverage (48.3% versus 33.1%; corrected p_c = .0006) and attained the highest domain specificity (4.77/5.0; p = 2e-6); transparently, its advantage over plain and retrieval-augmented prompting did not survive Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The framework offers a reproducible, regulation-grounded route to pre-deployment assurance for enterprise AI agents, complementing runtime governance with an auditable deployment gate.

URL PDF HTML ☆

赞 0 踩 0

2606.04032 2026-06-05 cs.LG cs.AI cs.CL cs.PF 版本更新

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Transformer 需要三个投影吗？QKV 变体的系统研究

Ali Kayyam, Anusha Madan Gopal, M Anthony Lewis

发表机构 * Ali Kayyam ； Anusha Madan Gopal ； M Anthony Lewis

AI总结本文系统研究了注意力机制中查询、键、值投影共享的变体，发现 Q-K=V 共享在语言建模中仅以 3.1% 的困惑度损失实现 50% 的 KV 缓存减少，且与头共享结合可达到 96.9% 的缓存减少，从而支持设备端推理。

Comments Accepted at ICML 2026 (PMLR vol. 306). 26 pages, 12 figures, 16 tables. Code: https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

详情

AI中文摘要

Transformer 已成为各种 AI 任务的标准解决方案，其中查询、键和值（QKV）注意力公式起着核心作用。然而，这三个投影的各自贡献以及省略某些投影的影响仍知之甚少。我们系统评估了三种投影共享约束：a) Q-K=V（共享键-值），b) Q=K-V（共享查询-键），c) Q=K=V（单投影）。后两种变体产生对称注意力图；为了解决这个问题，我们还通过二维位置编码探索了非对称注意力。通过涵盖合成任务、视觉（MNIST、CIFAR、TinyImageNet、异常检测）和语言建模（在 10B 令牌上训练的 300M 和 1.2B 参数模型）的实验，我们发现我们的 Transformer 性能与 QKV Transformer 相当，有时甚至更好。在语言建模中，Q-K=V 投影共享实现了 50% 的 KV 缓存减少，仅导致 3.1% 的困惑度下降。关键的是，投影共享与头共享（GQA/MQA）互补：将 Q-K=V 与 GQA-4 结合可实现 87.5% 的缓存减少，而 Q-K=V + MQA 则达到 96.9%，从而实现了实用的设备端推理。我们表明，Q-K=V 保持了质量，因为键和值可以占据相似的表示空间，并且注意力在低秩机制下运行，而 Q=K-V 则破坏了注意力的方向性。我们的结果系统地将投影共享描述为注意力中权重绑定的一种未被充分探索的实例，具有直接、可量化的推理内存优势，尤其对边缘部署有价值。代码公开于 https://github.com/anushamadan02/Do-Transformers-Need-3-Projections。

英文摘要

Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

URL PDF HTML ☆

赞 0 踩 0

2606.03650 2026-06-05 cs.CL cs.AI 版本更新

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

CoEval: 无标注数据或可信基准下为自定义任务排序语言模型

Alexander Apartsin, Yehudit Aperstein

发表机构 * Holon Institute of Technology（霍洛技术学院）； Afeka Tel Aviv Academic College of Engineering（阿法卡特拉维大学工程学院）

AI总结提出CoEval框架，通过教师模型生成无污染基准和跨族评审团，无需标注数据或人工评估即可对语言模型进行排序，在真实排名恢复上达ho=0.86。

Comments 16 pages, 5 images

详情

AI中文摘要

当特定应用没有任务相关的标注数据，且标准公共基准不可信（其项目可能已泄露到预训练中，因此分数反映的是记忆而非适用性）时，为特定应用选择或排序语言模型最为困难。我们提出CoEval，一个开源、可复用的框架，端到端地弥补了这一差距：仅从任务或领域的描述出发，教师模型合成一个全新的、属性受控的基准，无需人工标注，且由于每次运行都重新生成项目，因此无污染；跨族评审团对候选模型进行排序，无需人工评分。在存在真实基准的情况下验证，CoEval恢复了真实的模型排序，并与真实正确性相关性达ho=0.86。无标签评审无需人工校准，因为评审团组成（供应商多样性）而非规模驱动可靠性：一个精心挑选的小型跨族评审团最可靠，而单个评审员可能与真实基准负相关（评审员选择遗憾0.35），但集成评审团从未如此。生成的项目与五个主要公共基准的逐字13-gram重叠为零；评审团消除了冗长偏差并排除了同族自我偏好。一项四项任务研究以5.89美元产生了7,978次评估。相同的声明式流程适用于任何领域，并且足够便宜，可以在每次模型发布时重新运行：一个任何团队都可以为其自身应用重新生成的无标签、无污染排行榜。

英文摘要

Selecting a pretrained language model, or evaluating a fine-tuned one, for a specific application is a high-value decision, yet the public benchmarks used to make it are poorly suited: a generic benchmark need not reflect a particular sub-domain or sub-task, and its scores are suspect when its items have leaked into pretraining and are recalled rather than solved. We present CoEval, an open framework that supplies a trustworthy, task-specific signal through ensemble self-evaluation: from a task or domain description, a pool of models rotates through all three roles, teacher, student, and judge, to generate a fresh, contamination-free benchmark, answer it, and score one another, with no human labels or raters. Because every model also answers as a student, the responses are the data that weight each question by its discriminative power and each judge by its consensus with the panel. Where ground truth exists, CoEval recovers the true ranking and tracks objective correctness at \r{ho}=0.86, and the weighting recovers the gold ranking of thirteen models at Spearman 0.95. Reliability comes from panel composition, not size: this label-free weighting zeroes out broken judges and down-weights saturated questions, so neither distorts the ranking. Generated items show zero verbatim overlap with five public benchmarks, the panel cancels verbosity bias and precludes same-family self-preference, and rankings are domain-specific: three different models top four de-novo domains, so a generic leaderboard misdirects most practitioners. The same pipeline reruns on each model release, giving any team a contamination-free leaderboard for its application.

URL PDF HTML ☆

赞 0 踩 0

2606.03091 2026-06-05 cs.IR cs.AI 版本更新

先过滤，再重加权：重新思考在线策略蒸馏中的优化粒度

Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Hangjie Yuan, Tao Feng

发表机构 * THU（清华大学）； HKUST（香港科技大学）； BIT（北京理工大学）； Meituan（美团）； ZJU（浙江大学）

AI总结针对在线策略蒸馏，提出FiRe-OPD方法，通过轨迹级过滤和令牌级软重加权实现细粒度优化，在多种设置下优于现有方法。

详情

AI中文摘要

大型语言模型中的在线策略蒸馏正从全轨迹KL监督转向更具选择性的训练范式。最近的在线策略蒸馏方法越来越关注选择哪些轨迹进行学习、哪些令牌信息量最大以及哪些监督信号最可靠。受此趋势启发，我们重新思考在线策略蒸馏的优化粒度，并提出FiRe-OPD（先过滤，再重加权），该方法在轨迹和令牌两个层面联合调整监督信号。具体来说，FiRe-OPD首先过滤轨迹以移除低质量的采样结果，然后在保留的轨迹内应用软重加权以强调信息丰富的令牌。与硬令牌选择相比，FiRe-OPD利用软加权机制有效减轻信息损失并增强优化稳定性，从而实现更细粒度的在线策略蒸馏优化。我们在强到弱、单教师和多教师设置中验证了FiRe-OPD的有效性，并展示了其相对于近期令牌级在线策略蒸馏方法的优越性（例如，在强到弱设置中AIME 2024上+6.25，在多教师设置中Miner上+18.81）。我们的代码可从此链接获取。

英文摘要

On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at https://github.com/YuYingLi0/FiRe-OPD.

URL PDF HTML ☆

赞 0 踩 0

2606.02031 2026-06-05 cs.LG cs.AI cs.CL cs.CV 版本更新

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

OpenWebRL: 揭秘视觉网络代理的在线多轮强化学习

Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai, Wenlin Yao, Hao Cheng, Baolin Peng, Huan Zhang, Tong Zhang, Jianfeng Gao

发表机构 * UIUC（伊利诺伊大学香槟分校）； Microsoft（微软）

AI总结提出OpenWebRL框架，通过在线多轮强化学习在真实网站上训练视觉网络代理，以4B参数模型在基准测试中达到开源最优，并与闭源系统竞争。

Comments 36 pages, 11 figures

详情

AI中文摘要

构建强大的视觉网络代理需要长程推理、精确定位以及与动态真实网站的稳健交互。尽管进展迅速，最强的系统仍然大多是专有的，而开放代理仍然严重依赖于对大量策划的网络轨迹进行监督式后训练。这种依赖造成了主要的可扩展性瓶颈：高质量演示的收集成本高昂，而静态数据集对多样且不断变化的开放网络的覆盖有限。尽管在线强化学习在基于文本的代理中显示出前景，但其直接用于在实时网站上训练视觉网络代理的潜力仍未得到充分探索。在本文中，我们介绍了OpenWebRL，一个用于在真实网站上通过在线多轮强化学习训练视觉网络代理的开放框架。OpenWebRL涵盖了完整的训练流程，包括可扩展的实时浏览器基础设施、监督初始化、多模态上下文管理、轨迹级成功判断以及高效的多轮策略优化。使用该框架，我们训练了OpenWebRL-4B，在具有挑战性的实时网络基准测试中建立了新的开源最优水平。仅使用0.4K初始化轨迹和2.2K开放式强化学习训练任务，OpenWebRL-4B在Online-Mind2Web上达到67.0%的成功率，在DeepShop上达到64.0%，优于之前类似或更大规模的开放代理，并与包括OpenAI CUA和Gemini CUA在内的专有系统保持竞争力。除了强大的基准性能外，我们还系统研究了使在线强化学习对视觉网络代理有效的关键设计选择，并分析了强化学习如何改进代理推理。总体而言，我们的工作为构建更强大、可重复且成本效益更高的开放网络代理提供了一条实用路径。我们将发布我们的训练数据、模型和代码以支持未来的研究。

英文摘要

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.

URL PDF HTML ☆

赞 0 踩 0

2606.01897 2026-06-05 cs.AI 版本更新

Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

社区感知的社交文本参与度与共鸣评估：以人为中心的用户生成内容评价视角

Tianjiao Li, Kai Zhao, Xiang Li, Yang Liu, Huyang Sun

发表机构 * GitHub

AI总结提出CASTER任务和MEDEA架构，通过社会思维链机制模拟社区认知与情感反应，实现用户生成内容的多模态共鸣评估。

Comments Published as a main conference paper at ACL 2026

详情

AI中文摘要

传统视频质量评估（VQA）狭隘地关注美学保真度，忽略了定义用户生成内容（UGC）质量的复杂社会动态。在这项工作中，我们提出从信号中心指标向以人为中心的共鸣评估的范式转变。我们引入CASTER（社区感知的社交文本参与度与共鸣评估），这是一个新任务，根据UGC项目的多模态属性而非仅视觉质量来评估其是否实现积极的社区共鸣。为此，我们提出MEDEA（多模态参与驱动评估架构），它引入了一种新颖的社会思维链（Social-CoT）机制。与传统的逻辑CoT不同，Social-CoT执行多模态视角转换，实例化不同的观众角色以模拟集体认知和情感反应（即“社区思维”），然后得出质量判断。MEDEA通过两阶段方法进行训练，包括监督微调和带有社会对齐奖励的过程监督强化学习，以确保推理路径基于真实的人类社会认知。为支持此任务，我们发布了CASTER-Bench，一个涵盖多种UGC类别的全面人工标注基准。实验表明，MEDEA在CASTER-Bench上显著优于最先进的基线，同时提供可解释且富有同理心的推理路径，与真实社区反馈一致。

英文摘要

Traditional Video Quality Assessment (VQA) focuses narrowly on aesthetic fidelity, overlooking the complex social dynamics that define quality in User-Generated Content (UGC). In this work, we propose a paradigm shift from signal-centric metrics to human-centric resonance assessment. We introduce CASTER (Community-Aware Assessment of Social Textual Engagement and Resonance), a new task that evaluates whether a UGC item achieves positive community resonance based on its multimodal attributes rather than visual quality alone. To address this, we present MEDEA (Multimodal Engagement-Driven Evaluation Architecture), which introduces a novel Social Chain-of-Thought (Social-CoT) mechanism. Unlike traditional logical CoT, Social-CoT performs multimodal perspective-taking, instantiating diverse viewer personas to simulate collective cognitive and emotional reactions (i.e., the "community mind") before deriving a quality judgment. MEDEA is trained via a two-stage approach involving supervised fine-tuning and process-supervised reinforcement learning with Social Alignment Reward to ensure reasoning paths are grounded in authentic human social cognition. To support this task, we release CASTER-Bench, a comprehensive human-annotated benchmark covering diverse UGC categories. Experiments demonstrate that MEDEA significantly outperforms state-of-the-art baselines on CASTER-Bench while providing interpretable and empathetic reasoning paths that align with real community feedback.

URL PDF HTML ☆

赞 0 踩 0

2606.00804 2026-06-05 cs.MA cs.AI cs.CL 版本更新

Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems

企业多智能体系统的动态协调策略选择

Thanh Luong Tuan

发表机构 * Golden Gate University（金门大学）； Foundation AgenticOS (FAOS)（基础代理操作系统（FAOS））

AI总结本文通过大规模实验评估企业多智能体系统是否应根据问题类别动态选择协调策略，发现动态路由作为校准默认值有效，但无法确定唯一最优策略。

Comments 13 pages, 4 appendix. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-1

详情

AI中文摘要

企业多智能体系统日益暴露多种协调模式，但部署时往往缺乏证据表明何时使用共识、辩论、综合或更简单的单智能体工作流。本文评估协调策略是否应根据问题类别动态选择，而非全局固定。我们运行了一个固定的矩阵，包含30个企业任务，涵盖六个行业、五个问题类别、四种执行条件、每个单元格三个重复，以及四个模型分支：qwen_local、sonnet、gemma_openrouter和一个辅助的openai云验证分支。所有1,440个生成输出均由固定的Sonnet评分标准评判。主要发现是有界且操作上有用的，但并非最初的严格H1。预先注册的精确胜者/CI标准未得到支持：精确胜者身份在不同模型分支间不稳定，且若干预测策略接近但未超过最佳观察到的替代方案。一个较弱的近最优路由主张得到强烈支持。在每个预先注册的模型分支和问题类别中，以及在辅助的OpenAI验证分支中，预测策略的质量分数与最佳观察条件相差在0.10以内。结构化合规验证是对原始映射最明显的例外：所有分支都偏好单智能体而非共识。预先注册的Kendall's W检验发现，越南语领域和英语领域任务在四种协调条件排序的一致性上没有可靠差异（两个分层的平均W均为0.20；符号秩检验p = .85），因此H2未得到支持。我们得出结论，企业协调策略应使用动态路由作为校准默认值，而非确定性胜者选择法则。

英文摘要

Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single-agent workflow. This paper evaluates whether coordination strategy should be selected dynamically by problem class rather than fixed globally. We run a frozen matrix of 30 enterprise tasks spanning six industries, five problem classes, four execution conditions, three replications per cell, and four model arms: qwen_local, sonnet, gemma_openrouter, and an auxiliary openai cloud-validation arm. All 1,440 generated outputs are judged by a fixed Sonnet rubric. The main finding is bounded and operationally useful, but it is not the original strict H1. The pre-registered exact-winner/CI criterion is not supported: exact winner identity is unstable across model arms, and several predicted strategies are close to, but not above, the best observed alternative. A weaker near-best routing claim is strongly supported. In every pre-registered model arm and problem class, and again in the auxiliary OpenAI validation arm, the predicted strategy is within 0.10 quality-score points of the best observed condition. Structured compliance verification is the clearest exception to the original mapping: all arms favor single_agent rather than consensus. A pre-registered Kendall's W test finds no reliable difference between Vietnamese-domain and English-domain tasks in how consistently the four coordination conditions are ranked (mean W of 0.20 in both strata; signed-rank p = .85), so H2 is not supported. We conclude that enterprise coordination policy should use dynamic routing as a calibrated default, not as a deterministic winner-selection law.

URL PDF HTML ☆

赞 0 踩 0

2606.00644 2026-06-05 cs.AI 版本更新

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

ForeSci: 评估LLM智能体在前瞻性AI研究判断中的能力

Qiuyu Tian, Haojie Yin, Yingce Xia, Youyong Kong, Zequn Liu

发表机构 * Southeast University（东南大学）； Beijing Zhongguancun Academy（北京中关村学院）； Duke Kunshan University（杜克昆山大学）

AI总结提出ForeSci基准，通过时间控制的500个任务评估LLM智能体基于历史证据做出前瞻性研究判断的能力，发现证据与决策脱节问题。

详情

AI中文摘要

AI研究通常需要在未来证据出现之前做出决策：攻击哪个瓶颈、追求哪个方向、项目应如何定位。我们引入了ForeSci，一个时间控制的基准，用于评估LLM智能体是否能够从历史证据中做出此类前瞻性研究判断。ForeSci包含500个任务，涵盖四个快速发展的AI领域和四个决策家族。每个任务配有一个截止对齐的离线知识库；截止日期后的论文在生成过程中被隐藏，仅用于验证。为避免随机未来事件预测，任务源自截止前的分类分支和证据信号，并选择早于任务截止日期的答案生成骨干。我们评估了原生LLM、混合RAG以及四种骨干上的三种研究智能体适配。结果表明，显式证据组织提高了可追溯性和事实支持，但收益强烈依赖于决策家族。诊断揭示了一个反复出现的证据-决策脱节：智能体可能引用相关证据，但预测错误的研究对象。ForeSci将前瞻性AI研究判断转化为一个受控基准，用于评估作为决策系统的研究智能体。

英文摘要

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00616 2026-06-05 cs.CV cs.AI 版本更新

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

暂停与思考：面向视频基础辅助动作建议的数据集与基准

Shivam Singh, Saptarshi Majumder, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

发表机构 * Advanced Micro Devices, Inc.（先进微器件公司）

AI总结提出 pause-and-think-T 数据集和 pause-and-think-B 基准，通过推理监督训练紧凑模型，在视频场景理解与目标规划任务中达到与大型模型相当的性能。

详情

AI中文摘要

最近的视觉语言模型（VLM）在视频中的基础推理、时间一致性和上下文感知规划方面存在困难。我们引入了 pause-and-think-T，一个以推理为中心的训练数据集，鼓励模型暂停、基于视觉证据进行推理，并生成简洁、可操作的响应。该数据集在生成答案之前促进结构化推理，引导模型走向类人、基于场景的辅助。我们在我们的 pause-and-think-B 基准上微调了一个紧凑的 4B 参数模型，并针对上下文理解和目标规划任务进行了评估。该模型在参数比 Qwen3-VL-235B（58.9%）少 59 倍的情况下达到了 58.0% 的准确率，在场景理解上与 GPT-5.2 匹配，并超越了 GPT-4o。除了我们的基准之外，该模型在 EgoThink 和 TempCompass 上也表现出强大的分布外性能，在可操作性、辅助性、属性识别、情境推理和时间顺序方面取得了显著提升，且无需特定基准训练。我们的结果表明，有针对性的推理监督使紧凑模型能够提供可操作的、基于视觉的指导，同时泛化到训练数据之外，而无需进行大规模模型扩展。

英文摘要

Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We introduce pause-and-think-T, a reasoning-centric training dataset that encourages models to pause, reason over visual evidence, and produce concise, actionable responses. The dataset promotes structured reasoning prior to answer generation, guiding models toward human-like, scene-grounded assistance. We fine-tune a compact 4B-parameter model and evaluate it on our pause-and-think-B benchmark targeting contextual understanding and goal planning tasks. The model achieves 58.0% accuracy at 59x fewer parameters than Qwen3-VL-235B (58.9%), matching GPT-5.2 on scene understanding and surpassing GPT-4o. Beyond our benchmark, it also shows strong out-of-distribution performance on EgoThink and TempCompass, with substantial gains in affordance, assistance, attribution recognition, situated reasoning, and temporal order, without benchmark-specific training. Our results indicate that targeted reasoning supervision enables compact models to deliver actionable, visually grounded guidance while generalizing beyond training data, without requiring large-scale model expansion.

URL PDF HTML ☆

赞 0 踩 0

2605.31278 2026-06-05 cs.AI cs.LG stat.ME 版本更新

Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

工业化预测驱动推断：用于可靠生成式AI与智能体系统评估的GLIDE库

Grégoire Martinon, Ibrahim Merad, Mohammed Raki

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Google Research（谷歌研究院）

AI总结提出GLIDE开源库，统一多种预测驱动推断方法，提供无偏估计与有效置信区间，显著降低人工标注成本。

Comments 8 pages, Accepted to the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems, Seoul, South Korea, 2026

详情

AI中文摘要

智能体系统的可靠评估需要具有有效不确定性的无偏估计，但标准实践在昂贵的人工标注和有偏的LLM-as-judge代理之间权衡。预测驱动推断（PPI）将两者结合为具有有效置信区间的去偏估计，然而其各种方法仍分散在不同论文的部分实现中。我们介绍GLIDE，一个开源Python库，它在专用于均值估计的scipy风格API下统一了最先进的PPI估计器（PPI++、分层PPI、先预测后去偏及其分层变体、主动统计推断）和采样器（均匀、分层、主动、成本最优）。GLIDE附带一个可复现的蒙特卡洛验证套件、一个基于经验的决策树用于方法选择，以及一个智能体评估案例研究，显示在同等精度下显著节省标注成本。GLIDE包可通过此URL获取：https://github.com/EmertonData/glide

英文摘要

Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM-as-judge proxies. Prediction-powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open-source Python library that unifies state-of-the-art PPI estimators (PPI++, Stratified PPI, Predict-Then-Debias and its stratified variants, Active Statistical Inference) and samplers (uniform, stratified, active, cost-optimal) under a scipy-style API specialized to mean estimation. GLIDE ships with a reproducible Monte Carlo validation suite, an empirically grounded decision tree for method selection, and an agentic evaluation case study showing substantial annotation savings at equivalent precision. The GLIDE package is available at this URL: https://github.com/EmertonData/glide

URL PDF HTML ☆

赞 0 踩 0

2605.30747 2026-06-05 cs.AI 版本更新

Generating Graph-Like Logical Rules for Knowledge Graph Reasoning via Diffusion Models

通过扩散模型生成图状规则用于知识图谱推理

Haoxiang Cheng, Yunfei Wang, Chao Chen, Kewei Cheng, Zhipeng Lin, Haoxuan Li, Changjun Fan, Shixuan Liu

发表机构 * Laboratory for Big Data and Decision（大数据与决策实验室）； National University of Defense Technology（国防科技大学）； National Key Laboratory of Information Systems Engineering（信息系统工程国家重点实验室）； Microsoft Corporation（微软公司）； College of Computer Science and Technology（计算机科学与技术学院）

AI总结提出GRiD框架，利用扩散模型将图状规则发现转化为以目标关系为条件的离散生成过程，结合监督预训练和强化学习优化，实现知识图谱补全中图状规则的高效挖掘。

Comments accepted by KDD 26

详情

DOI: 10.1145/3770855.3817814

AI中文摘要

逻辑规则构成知识图谱推理的基石，因其可解释性和建模关系模式的能力而受到重视。然而，现有规则挖掘方法主要关注简单的链状规则，因此忽略了图状结构中编码的更丰富的关系信息，例如循环和分支。这一局限性因搜索空间组合爆炸导致的计算瓶颈而进一步加剧，这对图状规则尤其具有挑战性。同时，生成方法如扩散模型，尽管在其他领域取得了成功，但不能直接应用于规则挖掘，因为它们的训练目标与学习高质量规则的目标不一致，且不可微的知识图谱规则质量指标无法直接指导模型优化。为解决这些局限性，我们提出GRiD，一个将图状规则发现重新表述为以目标关系为条件的离散生成过程的框架。GRiD采用两阶段训练策略。首先，监督预训练使GRiD能够从知识图谱元图采样的子图中捕获结构先验。随后，应用强化学习通过直接由不可微规则质量指标指导的策略梯度优化来微调GRiD。在六个基准数据集上的实验表明，GRiD在知识图谱补全任务上取得了有竞争力的性能。消融研究证实了GRiD的效率和鲁棒性，并进一步表明图状规则在知识图谱补全中补充了链状规则。我们的代码和数据集可在https://github.com/Haoxiang-Cheng/GRiD获取。

英文摘要

Logical rules constitute a cornerstone of knowledge graph (KG) reasoning, valued for their interpretability and ability to model relational patterns. However, existing rule mining methods predominantly focus on simple chain-like rules and therefore neglect the richer relational information encoded in graph-like structures, such as cycles and branches. This limitation is further exacerbated by computational bottlenecks caused by the combinatorial explosion of the search space, which is especially challenging for graph-like rules. Meanwhile, generative approaches such as diffusion models, despite their success in other domains, cannot be directly applied to rule mining because their training objectives are not aligned with the goal of learning high-quality rules, and non-differentiable KG rule quality metrics cannot directly guide model optimization. To address these limitations, we propose GRiD, a framework that reformulates graph-like rule discovery as a discrete generative process conditioned on the target relation. GRiD employs a two-phase training strategy. First, supervised pre-training enables GRiD to capture structural priors from subgraphs sampled from the KG meta-graph. Subsequently, reinforcement learning is applied to fine-tune GRiD through policy gradient optimization guided directly by non-differentiable rule-quality metrics. Experiments on six benchmark datasets show that GRiD achieves competitive performance on KG completion tasks. Ablation studies confirm the efficiency and robustness of GRiD and further show that graph-like rules complement chain-like rules in KG completion. Our code and datasets are available in https://github.com/Haoxiang-Cheng/GRiD.

URL PDF HTML ☆

赞 0 踩 0

2605.28579 2026-06-05 cs.AI 版本更新

MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

MUSE: 面向可制造、功能性和可装配的文本到CAD生成的基准测试

Xiaoyu Dong, Zhi Li, Xiao-Ming Wu

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Curvature Flow Co., Limited（曲率流有限公司）

AI总结提出MUSE基准，通过三阶段评估协议（代码检查、几何检查、设计意图对齐）和基于规则的语言模型评判，衡量文本到CAD生成模型在功能、制造和装配方面的实际设计质量。

Comments 26 pages

详情

AI中文摘要

大型语言模型（LLMs）近期推动了文本驱动的3D生成，但文本到CAD仍远未支持工业产品设计。现有基准主要关注生成单零件CAD模型，并使用几何相似性指标进行评估，这些指标无法捕捉功能、可制造性和可装配性。为弥补这一空白，我们引入MUSE，一个专注于复杂、可编辑边界表示（B-Rep）装配体的文本到CAD基准。MUSE将实际设计实例与结构化设计规范配对，并通过三阶段评估协议评估生成的模型：代码检查、几何检查和设计意图对齐。最后阶段使用特定于设计的评分标准评估功能、可制造性和可装配性，超越形状匹配，走向实际设计质量。为实现可扩展评估，我们使用基于评分标准的视觉语言模型（VLM）评判器，并通过人工标注验证其可靠性。在闭源和开源LLM上的实验揭示了从可执行代码到有效几何再到工程就绪设计的明显失败级联，即使最强的模型在细粒度工程标准上也仅取得有限成功。MUSE为将文本到CAD从几何生成推向真正的工程设计提供了现实的基准和评估框架。我们的项目网站（包括排行榜、数据集和代码）可在 https://dong7313.github.io/muse-benchmark/ 获取。

英文摘要

Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating single-part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. To address this gap, we introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality. To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design, with even the strongest models achieving limited success on fine-grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design. Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.

URL PDF HTML ☆

赞 0 踩 0

2605.27887 2026-06-05 cs.AI q-fin.PM 版本更新

PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management

PortBench: 一种相关性感知的、全流水线的LLM驱动投资组合管理基准

Yuxuan Zhao, Sijia Chen, Ningxin Su

发表机构 * Yantai Research Institute of Harbin Engineering University（哈尔滨工程大学烟台研究院）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出PortBench基准，通过静态QA和动态五阶段分配流水线评估LLM在投资组合管理中的表现，发现多数模型无法超越等权重分配，且存在推理错误累积和压力下大幅回撤的问题。

Comments Project page: https://portbench.github.io/

详情

AI中文摘要

LLMs在多种金融任务中表现出色，但投资组合管理（PM）这一关键金融决策任务仍缺乏良好基准。现有基准存在两个主要缺陷：忽略跨资产相关性结构，从而无法区分真正多样化的投资组合与集中投资组合；未能评估真实场景中完整的PM决策流水线。我们提出PortBench，一个涵盖十年间六类异质资产类别的基准。PortBench由两个互补层组成：包含6269个基于相关性的问题（覆盖七个任务模板）的静态QA数据集，以及模拟完整PM决策周期的动态五阶段分配流水线。为评估这些层，我们引入两个专用指标：双层次相关性分数，衡量所提投资组合是否利用跨类别对冲并避免类别内集中；以及CEPS，量化推理错误如何在流水线阶段间累积。我们进一步在三种历史压力情景和风险配置下评估策略稳健性和投资者对齐。评估十个前沿LLM，我们发现尽管在静态金融QA上表现强劲，90%的模型-配置组合未能超越基本的等权重分配，且满足所有程序约束的模型在压力下仍遭受灾难性回撤。我们的源代码可在\href{https://github.com/AgenticFinLab/portbench}{此https URL}获取。

英文摘要

Large language models (LLMs) have shown strong performance across diverse financial tasks, yet portfolio management (PM), a critical financial decision-making task, remains poorly benchmarked. Existing benchmarks exhibit two main gaps: they ignore cross-asset correlation structures, thereby failing to distinguish genuinely diversified portfolios from concentrated ones, and fail to evaluate the complete PM decision pipeline in real-world scenarios. We introduce PortBench, a benchmark spanning six heterogeneous asset classes over ten years. PortBench consists of two complementary layers: a static QA dataset of 6,269 correlation-based questions across seven task templates, and a dynamic five-stage allocation pipeline that mirrors the full PM decision cycle. To evaluate these layers, we introduce two dedicated metrics: a dual-layer correlation score that measures whether proposed portfolios exploit inter-class hedging and avoid intra-class concentration, and CEPS, a metric that quantifies how reasoning errors compound across pipeline stages. We further assess strategy robustness and investor alignment under three historical stress regimes and risk profiles. Evaluating ten frontier LLMs, we find that despite strong performance on static financial QA, 90\% of model-profile combinations fail to outperform a basic equal-weight allocation, and models that satisfy every procedural constraint still suffer catastrophic drawdowns under stress. Our source code is available at \href{https://github.com/AgenticFinLab/portbench}{this https URL}.

URL PDF HTML ☆

赞 0 踩 0

2605.26179 2026-06-05 cond-mat.mtrl-sci cs.AI cs.CE 版本更新

AutoDFT: A Closed-Loop Multi-Agent Framework for Autonomous DFT Calculations

AutoDFT：用于自主DFT计算的闭环多智能体框架

Penghui Yang, Zhonghan Zhang, Yue Li, Xinrun Wang, Yanchen Deng, Yuhao Lu, Bijun Tang, Zheng Liu, Bo An

发表机构 * Nanyang Technological University, Singapore（南洋理工大学，新加坡）； Singapore Management University（新加坡管理大学）

AI总结提出AutoDFT闭环多智能体框架，通过将LLM推理嵌入DFT计算全生命周期，实现从规划到执行的自主适应，在VASPBench基准上达到94.1%任务成功率，并可靠预测电子、磁性和能量性质。

详情

AI中文摘要

密度泛函理论（DFT）是材料科学和化学中计算发现的基础，然而每次计算都需要大量人工努力：当收敛停滞时调整算法，当出现意外物理现象时修改计划，以及当中间结果重塑问题时插入步骤。现有的基于LLM的智能体仅自动化初始规划阶段，预先生成完整的执行计划，而将所有后续调整留给手工规则。因此，这些工作流仍然脆弱，难以泛化到预规划场景之外，并且当失败或意外的中间结果需要改变计算路径时，通常需要专家干预。在此，我们介绍AutoDFT，一个闭环多智能体框架，将LLM推理嵌入DFT生命周期的每个阶段：战略规划器生成步骤目标的骨架计划；步骤规划器根据先前结果即时生成数值参数；监控-恢复-反思循环诊断失败、修复失败，并在证据支持时修改计划。我们展示了广度和深度：广度方面，在VASPBench（一个专门构建的基准，涵盖34个任务和9种DFT计算类型）上，AutoDFT使用GPT-5.2实现了94.1%的任务级成功率；深度方面，在已建立的材料数据库上，AutoDFT在电子、磁性和能量性质上产生了定量可靠的属性预测。通过闭环规划和执行，AutoDFT使没有深厚计算专业知识的实验人员能够获得可靠的第一性原理结果。

英文摘要

Density functional theory (DFT) serves as the basis for computational discovery in materials science and chemistry, yet each calculation demands extensive human effort: adjusting algorithms when convergence stalls, revising plans when unexpected physics emerges, and inserting steps as intermediate results reshape the problem. Existing LLM-based agents automate only the initial planning stage, producing a full execution plan upfront and leaving all subsequent adaptation to hand-crafted rules. As a result, these workflows remain fragile, do not generalize well beyond pre-planned scenarios, and often require expert intervention when failures or unexpected intermediate results require changes to the calculation path. Here, we introduce AutoDFT, a closed-loop multi-agent framework that embeds LLM reasoning into every stage of the DFT lifecycle, where a strategic planner produces a skeletal plan of step objectives; a step planner generates numerical parameters just in time from preceding results; and a monitor-recover-reflect cycle diagnoses failures, repairs them, and revises the plan when the evidence justifies it. We demonstrate both breadth and depth: breadth on VASPBench, a purpose-built benchmark spanning 34 tasks and 9 DFT calculation types, where AutoDFT achieves 94.1% task-level success with GPT-5.2; and depth on established materials databases, where AutoDFT produces quantitatively reliable property predictions across electronic, magnetic, and energetic properties. By closing the loop between planning and execution, AutoDFT enables experimentalists without deep computational expertise to obtain reliable first-principles results.

URL PDF HTML ☆

赞 0 踩 0

2605.26046 2026-06-05 cs.CL cs.AI cs.LG cs.MA cs.SE 版本更新

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

当梯度冲突时：多目标提示优化用于LLM评判器的失败模式

Parth Darshan, Abhishek Divekar

发表机构 * IIT Jodhpur（印度理工学院乔普里尔）； Amazon（亚马逊）

AI总结研究多目标文本梯度优化中梯度稀释和指令干扰两种失败模式，通过分解优化器信息共享方式揭示性能下降原因。

Comments Accepted at ACL 2026 - CustomNLP4U Workshop. Code, prompts and data available at https://github.com/adivekar-utexas/when-gradients-collide

详情

AI中文摘要

将LLM评判器定制到特定任务或领域通常需要同时跨多个评估标准优化其提示。文本梯度方法针对单一评判标准实现了自动化，但它们产生自然语言批评，而非数值向量。因此，多任务学习的冲突解决工具包（PCGrad、MGDA）不适用于多目标文本梯度设置。我们通过改变损失、梯度和优化器LLM共享跨任务信息的程度，测试了文本梯度优化器的五种分解模式。在10种配置中的6种中，我们观察到优化从未优于初始提示。当梯度LLM联合处理多个标准时，梯度特异性下降了59%（从9.0降至3.7）。另外，我们观察到将每个任务的指令简单组合成单个提示会使斯皮尔曼相关系数降低5.3%。这些结果识别出两种可分离的失败模式：优化时的梯度稀释和推理时的指令干扰，它们共同限制了使用文本反馈进行多目标评判器定制的设计空间。

英文摘要

Customizing an LLM judge to a specific problem or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural-language critiques, not numerical vectors. Thus, the conflict-resolution toolkit of multi-task learning (PCGrad, MGDA) does not apply to this multi-objective textual gradient setting. We extend TextGrad to the multi-objective setting and test four decomposition modes of textual gradient optimizers by varying how much cross-objective information the loss, gradient and optimizer LLMs share. We find the gradient's task-focus drops by 59% (9.0 to 3.7 out of 10) when the gradient LLM must provide feedback on multiple criteria jointly. Separately, we observe that naively combining single-objective optimized instructions into a single prompt degrades Spearman rho from 0.305 to 0.220 (-0.085). These results identify two separable failure modes: optimization-time gradient dilution and inference-time instruction interference, which together constrain the design space for multi-objective judge optimization using textual feedback.

URL PDF HTML ☆

赞 0 踩 0

2605.29916 2026-06-05 cs.NE cs.AI cs.DS math.OC 版本更新

Selection Hyper-heuristics Can Automatically Adjust the Learning Period to Optimally Solve Pseudo-Boolean Problems

选择超启发式可以自动调整学习周期以最优地解决伪布尔问题

Benjamin Doerr, Pietro S. Oliveto, John Alasdair Warwicker

发表机构 * Laboratoire d’Informatique (LIX), CNRS, École Polytechnique, Institut Polytechnique de Paris（信息实验室（LIX），法国国家科学研究中心，巴黎高等理工学院，巴黎理工学院）； Department of Computer Science and Engineering, Southern University of Science and Technology（计算机科学与工程系，南方科技大学）； School of Computing & Communications, Lancaster University Leipzig（计算与通信学院，莱斯特大学莱比锡分校）

AI总结本文提出一种自动设置学习周期参数的超启发式方法，证明其能在1-o(1)比例的迭代中选择最优邻域大小，从而以最优时间（忽略低阶项）优化LeadingOnes基准问题。

Comments To appear in "Artificial Intelligence"

详情

DOI: 10.1016/j.artint.2026.104560
Journal ref: Artificial Intelligence 357:104560 (2026)

AI中文摘要

最近研究表明，随机梯度超启发式在使用随机局部搜索（RLS）元启发式优化LeadingOnes基准时，能够学习最优邻域大小。然而，这需要使用一定长度$τ$的学习周期，这与经典超启发式不同，后者仅基于前一次迭代的成功来改变行为。在本文中，我们展示了如何自动设置这个新参数值，从而使用户免于控制这一新颖算法参数的非平凡任务。我们证明，由此产生的超启发式在$1-o(1)$比例的迭代中选择最优邻域大小，并因此以这些邻域大小所能达到的最佳时间（忽略低阶项）优化LeadingOnes基准。

英文摘要

The Random Gradient hyper-heuristic was recently shown to be able to learn the optimal neighbourhood size when optimizing the LeadingOnes benchmark via the Randomised Local Search (RLS) meta-heuristic. However, for this to happen, a learning period of a certain length $τ$ had to be used, differently from classic hyper-heuristics, which change their behaviour based on the success of only the previous iteration. In this paper, we show how to automatically set this new parameter value, relieving the user from the non-trivial task of controlling this novel algorithm parameter. We prove that the resulting hyper-heuristic selects the optimal neighbourhood size in a $1-o(1)$ fraction of the iterations and, consequently, optimises the LeadingOnes benchmark in the best possible time (apart from lower-order terms) achievable with these neighborhood sizes.

URL PDF HTML ☆

赞 0 踩 0

2603.19294 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Maximizing Mutual Information Between Prompt and Response Improves LLM Performance With No Additional Data

最大化提示与响应之间的互信息无需额外数据即可提升LLM性能

Hyunji Nam, Haoran Li, Natasha Jaques

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出互信息偏好优化（MIPO）方法，通过对比数据增强构建偏好对，利用直接偏好优化最大化提示与响应间的点互信息，无需额外数据或外部监督即可提升LLM在个性化和可验证任务上的性能。

Comments International Conference on Machine Learning 2026

详情

AI中文摘要

虽然后训练已在多个领域成功改进了大型语言模型（LLM），但这些提升严重依赖人工标注数据或外部验证器。现有数据已被充分利用，而新数据收集成本高昂。此外，真正的智能远不止可验证任务。因此，我们需要较少依赖外部信号且更广泛适用于可验证和不可验证领域的自我改进框架。我们提出**互信息偏好优化（MIPO）**，一种对比数据增强方法，通过基于正确提示生成正响应，以及基于随机无关提示生成负响应来构建偏好对。我们证明，使用直接偏好优化从这些配对数据中学习，可以最大化*基础LLM*下提示与响应之间的逐点互信息。使用1-7B参数的Llama和Qwen指令模型的实验表明，与提示基线相比，MIPO在个性化任务上实现了3-16%的提升（Qwen2.5-1B-Instruct提升51%）。令人惊讶的是，MIPO在可验证领域（如数学和多项选择题问答）也有用，*无需任何额外数据或外部监督*即可获得1-20%的提升。这些结果表明，利用对比数据对中的内在信号进行自我改进是一个有前景的方向。

英文摘要

While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited, and new data is expensive to collect. Moreover, true intelligence goes far beyond verifiable tasks. Therefore, we need self-improvement frameworks that are less dependent on external signals and more broadly applicable to both verifiable and non-verifiable domains. We propose **Mutual Information Preference Optimization (MIPO)**, a contrastive data augmentation method that constructs preference pairs by generating a positive response conditioning on the correct prompt, and a negative response by conditioning on a random, unrelated prompt. We show that using Direct Preference Optimization to learn from this paired data maximizes pointwise mutual information *under the base LLM* between prompts and model responses. Experiments with with 1-7B parameter Llama and Qwen instruct models show that MIPO achieves 3-16% gains (and 51% increase for Qwen2.5-1.5B-Instruct) on personalization compared to prompting baselines. Surprisingly, MIPO can also be useful in verifiable domains, such as math and multiple-choice question answering, yielding 1-20% gains *without any additional data or external supervision*. These results suggest a promising direction for self-improvement using intrinsic signals derived from contrastive data pairs.

URL PDF HTML ☆

赞 0 踩 0

2605.25582 2026-06-05 cs.LG cs.AI 版本更新

Reflex: 基于状态连续控制中利用反射对称性的强化学习

Shuai Zhen, Yifan Zhang, Yuling Wang, Yanhua Yu

AI总结提出Reflex框架，通过反射对称性正则化机制将反射对称性融入策略学习，提升基于状态的连续控制任务的样本效率。

Comments Some of the data in the paper contain errors and need to be confirmed for modification

详情

AI中文摘要

强化学习长期面临样本效率低下的问题。缓解该问题的一种有前景的方法是利用群不变马尔可夫决策过程（$G$-不变MDP）。现有工作主要关注基于图像的强化学习和旋转对称性（如$\mathrm{SO(2)}$），而基于状态的强化学习和反射对称性尚未得到充分探索。本文聚焦于基于状态的连续控制任务，通过引入Reflex范式来利用反射对称性，该范式可无缝集成到同策略和异策略强化学习算法中。我们形式化了两种反射类型——轴向反射和双侧反射，并刻画了它们对应的变换。基于对保持对称性的最优值函数和策略的理论分析，Reflex通过原则性的对称性正则化机制将反射对称性融入策略学习。我们将Reflex与PPO和SAC集成，并在OpenAI Gym和DeepMind Control基准测试套件上进行评估，结果表明相比标准基线，Reflex在提升样本效率的同时实现了更优的性能。我们的代码开源在https://github.com/TonyStark042/Reflex。

英文摘要

Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group-invariant Markov Decision Processes ($G$-invariant MDPs). Existing works in this direction have primarily focused on image-based RL and rotational symmetry such as $\mathrm{SO(2)}$, leaving state-based RL and reflection symmetry largely underexplored. In this work, we focus on state-based continuous control tasks and exploit reflection symmetry by introducing Reflex, a paradigm that seamlessly integrates with both on-policy and off-policy RL algorithms. We formalize two types of reflection-axial reflection and bilateral reflection, and characterize their corresponding transformations. Building on a theoretical analysis of symmetry-preserving optimal value functions and policies, Reflex integrates reflection symmetry into policy learning through principled symmetry regularization mechanisms. We integrate Reflex with PPO and SAC, and evaluate it on a suite of OpenAI Gym and DeepMind Control benchmarks, demonstrating superior performance over standard baselines while improving sample efficiency. Our code is available at https://github.com/TonyStark042/Reflex.

URL PDF HTML ☆

赞 0 踩 0

2605.15913 2026-06-05 cs.CL cs.AI 版本更新

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

通过自动分割和块蒸馏实现块注意力的泛化

Shuaiyi Li, Zhisong Zhang, Yan Wang, Lei Zhu, Dongyang Ma, Chenlong Deng, Yang Deng, Wai Lam

发表机构 * The Chinese University of Hong Kong（香港中文大学）； City University of Hong Kong（香港城市大学）； Tencent（腾讯）； Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学人工智能学院）； Singapore Management University（新加坡管理大学）

AI总结提出基于语义分割数据集训练的轻量级分割器和块蒸馏框架，解决块注意力在长上下文中的文本分割和微调效率问题，实现接近全注意力的性能。

Comments 16 pages, 2 figures

详情

AI中文摘要

块注意力将输入作为独立的块处理，块之间不能相互关注，在检索增强生成（RAG）等长上下文场景中具有显著提升KV缓存重用的潜力。然而，其广泛应用受到两个关键挑战的阻碍：将输入文本分割成有意义且自包含的块的困难，以及现有块微调方法效率低下且可能降低性能的风险。为解决这些问题，我们首先构建了SemanticSeg，一个大规模且多样化的语义分割数据集，包含超过30k个实例，涵盖16个类别——包括书籍、代码、网页文本和对话，文本长度从2k到32k。利用该数据集，我们训练了一个轻量级分割器，能够自动将文本分割成符合人类直觉的块，且粒度可控。其次，我们提出了块蒸馏，一种比块微调更高效的训练框架，它使用冻结的全注意力教师模型来指导块注意力学生模型。该框架集成了三个新颖的组件：块汇合令牌以减轻块边界处的信息丢失，块丢弃以利用来自所有块的训练信号，以及令牌级损失加权以聚焦于对块注意力敏感的令牌的学习。跨多个模型和基准的实验表明，我们的分割器优于启发式和统计基线，且块蒸馏在块注意力下实现了接近全注意力的性能，为部署块注意力建立了一条实用且可扩展的路径。

英文摘要

Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories-including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human-instinct-aligned blocks with controllable granularity. Second, we propose block distillation, a training framework that is more efficient than block fine-tuning, which uses a frozen full-attention teacher model to guide the block-attention student. This framework integrates three novel components: block sink tokens to mitigate information loss at block boundaries, block dropout to leverage training signals from all blocks, and token-level loss weighting to focus learning on block-attention-sensitive tokens. Experiments across multiple models and benchmarks demonstrate that our segmenter outperforms heuristic and statistical baselines, and block distillation achieves near-full-attention performance under block attention, establishing a practical and scalable pathway for deploying block attention.

URL PDF HTML ☆

赞 0 踩 0

2605.21557 2026-06-05 stat.ML cs.AI cs.LG 版本更新

Scalable Reinforcement Learning via Adaptive Batch Scaling

通过自适应批处理缩放实现可扩展的在线强化学习

Jongchan Park

发表机构 * Jongchan Park

AI总结本文提出自适应批处理缩放方法，通过动态调整有效批处理大小来平衡强化学习早期的可塑性需求和晚期的稳定收敛，发现增大网络和批处理大小的组合在强化学习中取得最佳性能。

详情

AI中文摘要

传统观点认为大批次训练与强化学习（RL）本质上不兼容，超过一定阈值后增大批次大小通常会导致回报减少或性能下降，由于数据分布的固有非平稳性。我们通过观察非平稳性并非RL的固定属性，而是随着训练过程演变：早期阶段表现出快速的行为转变，需要小批次以保持可塑性，而晚期阶段接近准平稳状态，大批次可实现精确收敛。受此启发，我们提出自适应批处理缩放（ABS），根据学习策略的稳定性动态调整有效批次大小。ABS的核心是行为分歧，一种新的度量指标，通过测量连续更新之间的动作级转变来量化策略非平稳性，用于将批次大小反向缩放至策略波动性。与并行化Q网络（PQN）算法结合并在ALE基准上评估，ABS无缝地平衡了早期阶段的可塑性和晚期阶段的稳定收敛。令人惊讶的是，与传统观点相反，我们的结果表明，较大的网络和较大的批次大小的组合实现了最佳性能——一种之前被认为在强化学习中无法实现的扩展行为，现在通过自适应批处理控制得以解锁。

英文摘要

Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation due to the inherent non-stationarity of the data distribution. We challenge this view by observing that non-stationarity is not a fixed property of RL, but evolves throughout training: early stages exhibit rapid behavioral shifts that demand small batches for plasticity, whereas late stages approach a quasi-stationary regime where large batches enable precise convergence. Motivated by this observation, we propose Adaptive Batch Scaling (ABS), that dynamically adjusts the effective batch size according to the stability of the learning policy. Central to ABS is Behavioral Divergence, a novel metric that quantifies policy non-stationarity by measuring action-level shifts between consecutive updates, which we use to scale batch size inversely to policy volatility. Integrated with the Parallelised Q-Network (PQN) algorithm and evaluated on the ALE benchmark, ABS seamlessly reconciles early-stage plasticity with late-stage stable convergence. Strikingly, contrary to conventional wisdom, our results reveal that the combination of larger networks and larger batch sizes achieves the best performance - a scaling behavior previously thought to be unattainable in RL, now unlocked through adaptive batch control.

URL PDF HTML ☆

赞 0 踩 0

2605.20119 2026-06-05 cs.LG cs.AI 版本更新

Toto 2.0: Time Series Forecasting Enters the Scaling Era

Toto 2.0：时间序列预测进入规模化时代

Emaad Khwaja, Chris Lettieri, Gerald Woo, Eden Belouadah, Marc Cenac, Guillaume Jarry, Enguerrand Paquin, Xunyi Zhao, Viktoriya Zhukov, Othmane Abou-Amal, Chenghao Liu, Ameet Talwalkar, David Asker

发表机构 * Datadog AI Research（Datadog AI研究院）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出Toto 2.0模型家族，通过单一训练配方在400万到25亿参数范围内实现可靠的预测质量提升，并在三个基准测试中达到新状态。

Comments Code: https://github.com/DataDog/toto Weights: https://huggingface.co/collections/Datadog/toto-20

2510.00054 2026-06-05 cs.CV cs.AI 版本更新

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

HiDe: 通过分层解耦重新思考高分辨率MLLMs中的Zoom-IN方法

Xianjie Liu, Yiman Hu, Yixiong Zou, Liang Wu, Jian Xu, Bo Zheng

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出HiDe框架，通过分层解耦方法解决高分辨率图像中背景干扰导致的视觉理解问题，提升多模态大语言模型在高分辨率图像任务中的性能。

Comments Accepted by ICML2026

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉理解任务中取得了显著进展。然而，它们在高分辨率图像上的性能仍然不够理想。尽管现有方法通常将这一限制归因于感知约束，并认为MLLMs难以识别小物体，从而使用'缩放进'策略以获得更好的细节，我们的分析揭示了不同的原因：主要问题不是物体大小，而是由复杂的背景干扰引起的。我们通过一系列解耦实验系统分析了这种'缩放进'操作，并提出了一种无需训练的分层解耦框架（HiDe），该框架使用基于标记的注意力解耦（TAD）来解耦问题标记并识别关键信息标记，然后利用其注意力权重实现与目标视觉区域的精确对齐。随后，它利用布局保持解耦（LPD）将这些区域与背景解耦，并重建一个紧凑的表示，该表示在保留基本空间布局的同时消除了背景干扰。HiDe在V*Bench、HRBench4K和HRBench8K上设定了新的SOTA，将Qwen2.5-VL 7B和InternVL3 8B提升至SOTA（在V*Bench上分别为92.1%和91.6%），甚至超过了强化学习方法。经过优化后，HiDe的内存使用比之前的无训练方法减少了75%。代码可在https://tennine2077.github.io/HiDe.github.io/上提供。

英文摘要

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on V*Bench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://tennine2077.github.io/HiDe.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2605.18937 2026-06-05 cs.AI 版本更新

Evaluating the Utility of Personal Health Records in Personalized Health AI

评估个人健康记录在个性化健康AI中的效用

Rory Sayres, Kejia Chen, Ayush Jain, Matthew Thompson, Jonathan Richina, Xiang Yin, Jimmy Hu, Fan Zhang, Bob Lou, Mike Sanchez, Ines Mezerreg, Meredith Schreier, Hamsa Subramaniam, I-Ching Lee, Yugang Jia, Daniel Mcduff, Yossi Matias, Avinatan Hassidim, Dale Webster, Yun Liu, Jackie Barr, Quang Duong

发表机构 * Google Research（谷歌研究）

AI总结本文研究了利用个人健康记录（PHR）中的临床数据通过大语言模型（LLM）回答用户健康问题的效用，发现提供PHR上下文能显著提高回答的有用性、安全性和个性化水平，并揭示了LLM在理解复杂PHR方面的不足。

Comments 35 pages, 3 figures, 10 tables [bugfix / minor numerical update]

详情

AI中文摘要

患者管理的个人健康记录（PHRs）承诺赋予患者更好地理解自身健康的权力；但记录中的信息复杂，可能阻碍洞察。在本研究中，我们评估了大型语言模型（LLMs，Gemini 3.0 Flash）在提供PHR上下文的情况下，回答用户健康问题的潜力。总共从3种不同的分布中抽取了2,257个用户查询，以代表患者的问题：较短的网页搜索查询、较长的基于聊天机器人模板生成的问题，以及患者向医疗团队提问的问题。这些查询与去标识化的PHR（来自1,945个池）匹配。Gemini的响应生成（1）无PHR上下文；（2）带有基本的 demographics、condition 和 medication 总结；（3）带有完整的、详尽的临床笔记。评估利用了现有的评分框架（SHARP），并开发了新的框架用于特定的错误模式当解释PHR时。评估使用自动评分器对全部集进行评估，并对子集（n=95）使用临床医生评分，且两组评分者都了解完整的PHR上下文。我们发现，与PHR数据相比，所有问题类型的回答有用性显著提高（p < 0.001，配对t检验）。我们还观察到在安全、准确、相关性和个性化方面有潜在的提升。我们的PHR评估框架进一步识别了LLM在理解复杂PHR特定方面的不足，如时间混乱和罕见但有意义的虚构信息。这些结果表明，PHR数据可能有助于满足广泛用户需求，并提供一个框架来监控基于PHR上下文的LLM回答中的不足。本研究促使进一步的工作，以评估和实现从理解自身健康记录中获益的潜在好处。

英文摘要

Patient-managed Personal Health Records (PHRs) promises to empower patients to better understand their health; but information in the record is complex, potentially hindering insights. In this study, we assess the potential of large language models (LLMs, Gemini 3.0 Flash) to provide helpful answers to user health queries, when provided clinical data from PHRs as context. A total of 2,257 user queries were drawn from 3 different distributions to represent patient questions: shorter web search queries, longer questions derived from templates of chatbot conversations, and questions patients asked to their healthcare team (patient calls). Queries were matched with de-identified PHRs (from a pool of 1,945). Gemini responses were generated (1) without PHR context; (2) with a basic summary of demographics, conditions, and medications; (3) with full, extensive clinical notes. For evaluation, we leveraged an existing rating framework (SHARP), and developed a new framework for specific error modes when interpreting PHRs. Evaluation was performed using autoraters for the full set, and with clinician ratings for a subset (n=95), with both sets of raters knowing the full PHR context. We see significant improvements in the helpfulness of answers to all question types with PHR data (p < 0.001, paired t-test). We also observe potential gains in safety, accuracy, relevance and personalization of answers. Our PHR evaluation framework further identifies gaps in LLM understanding of particular aspects of complex PHRs, such as temporal disorientation, and rare but meaningful confabulations. These results suggest potential for PHR data to help people with a wide range of user needs; and provide a framework for monitoring for gaps in LLM answers based on PHR context. This study motivates further work to assess and realize potential benefits to users from understanding their health records.

URL PDF HTML ☆

赞 0 踩 0

2605.16716 2026-06-05 cs.CV cs.AI 版本更新

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

MAVEN：面向多元文化文本到视频生成的多智能体框架

Shuowei Li, Yuming Zhao, Parth Bhalerao, Oana Ignat

发表机构 * Santa Clara University（圣克拉拉大学）

AI总结提出MAVEN多智能体提示优化框架，通过并行或串行分解提示为人物、动作、地点维度，提升单文化和跨文化文本到视频生成的文化保真度，并构建包含243个文化提示和972个视频的基准进行评估。

Comments [14] pages, [6] figures, [11] tables, appendix included. Preprint

详情

AI中文摘要

文本到视频（T2V）生成在视觉保真度方面取得了快速进展，但其在单个提示中忠实呈现多种文化的能力仍未被充分探索。我们提出MAVEN，一个多智能体提示优化框架，旨在提高单文化和跨文化T2V生成中的文化保真度。MAVEN将提示分解为人物、动作和地点维度，由并行或串行运行的专业智能体处理。为了支持系统评估，我们贡献了一个新的基准，包含243个基于文化的提示和972个对应视频，涵盖三种文化（中文、美式、罗马尼亚）、三种动作类别以及单文化和跨文化场景。结合基于CLIP的指标、VLM作为评判的评估和视频质量测量的评估表明，多智能体优化，特别是并行专业化，在保持视觉质量和时间一致性的同时，显著提高了文化相关性。数据集和代码可在https://github.com/AIM-SCU/MAVEN获取。

英文摘要

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi-agent prompt refinement framework designed to improve cultural fidelity in both mono-cultural and cross-cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono-cultural and cross-cultural scenarios. Evaluations combining CLIP-based metrics, VLM-as-judge assessments, and videoquality measures show that multi-agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available at https://github.com/AIM-SCU/MAVEN

URL PDF HTML ☆

赞 0 踩 0

2604.00555 2026-06-05 cs.AI cs.CL cs.SE 版本更新

Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

企业智能体系统中的本体约束神经推理：一种面向领域 grounded AI 智能体的神经符号架构

Thanh Luong Tuan, Abhijit Sanyal

发表机构 * Golden Gate University, San Francisco Foundation（金门大学，旧金山基金会）； AgenticOS (FAOS)（AgenticOS（FAOS））； Associate Director, Data, Digital & IT Novartis Healthcare Pvt. Ltd.（数据、数字与IT部门，诺华健康有限公司）； Novartis Healthcare Pvt. Ltd., Hyderabad, India（诺华健康有限公司，海得拉巴，印度）

AI总结本文提出了一种神经符号架构，通过本体约束神经推理解决企业大语言模型在幻觉、领域漂移和无法在推理层面强制执行监管合规性方面的限制，展示了该架构在提升智能体的指标准确性和角色一致性方面的显著效果。

Comments 24 pages, 6 tables, 6 figures, 1 algorithm, 65 references. Replication study: 1,800 runs (600 per model) across 5 regulated industries (3 English, 2 Vietnamese) and 3 LLMs (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B). v3 changes: deep-review trim from 34pp. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-3

详情

AI中文摘要

企业采用大语言模型（LLMs）受到幻觉、领域漂移和无法在推理层面强制执行监管合规性的限制。我们提出了一种在基础智能体操作系统（FAOS）平台中实现的神经符号架构，通过本体约束神经推理解决这些限制。我们引入了一个三层本体框架——角色、领域和交互本体——以地面化基于LLM的企业智能体。我们正式化了不对称的神经符号耦合：当前企业系统约束智能体输入（上下文组装、工具发现、治理阈值），但不约束输出，我们提出机制扩展这种耦合到输出侧验证（响应检查、推理验证、合规性强制）。一个受控实验（1,800次运行，覆盖五个行业和三个LLM：Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B）发现本体耦合的智能体在所有三个模型中在指标准确性和角色一致性上显著优于无地面化智能体（p < .001），具有较大的效应量（Kendall's W = .46-.64）。改进最大出现在LLM参数化知识最弱的地方——特别是越南本地化领域，其中本体提升是英语领域的2倍。贡献：（1）一个正式的三层企业本体模型；（2）神经符号耦合模式的分类学；（3）通过SQL推导评分进行本体约束的工具发现；（4）提出的一种用于输出侧本体验证的框架；（5）关于参数化知识效应的实证证据——本体地面化价值与LLM训练数据覆盖领域成反比；（6）跨模型复制，确立模型独立性；（7）一个服务于22个行业垂直领域的生产系统，拥有650多个智能体。

英文摘要

Enterprise adoption of Large Language Models (LLMs) is constrained by hallucination, domain drift, and the inability to enforce regulatory compliance at the reasoning level. We present a neurosymbolic architecture implemented within the Foundation AgenticOS (FAOS) platform that addresses these limitations through ontology-constrained neural reasoning. We introduce a three-layer ontological framework--Role, Domain, and Interaction ontologies--grounding LLM-based enterprise agents. We formalize asymmetric neurosymbolic coupling: current enterprise systems constrain agent inputs (context assembly, tool discovery, governance thresholds) but not outputs, and we propose mechanisms extending this coupling to output-side validation (response checking, reasoning verification, compliance enforcement). A controlled experiment (1,800 runs across five industries and three LLMs: Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B) finds ontology-coupled agents significantly outperform ungrounded agents on Metric Accuracy (p < .001) and Role Consistency (p < .001) across all three models with large effect sizes (Kendall's W = .46-.64). Improvements are greatest where LLM parametric knowledge is weakest--particularly in Vietnam-localized domains, where ontology lift is 2x that of English domains. Contributions: (1) a formal three-layer enterprise ontology model; (2) a taxonomy of neurosymbolic coupling patterns; (3) ontology-constrained tool discovery via SQL-pushdown scoring; (4) a proposed framework for output-side ontological validation; (5) empirical evidence for the inverse parametric knowledge effect--ontological grounding value is inversely proportional to LLM training-data coverage of the domain; (6) cross-model replication establishing model-independence; (7) a production system serving 22 industry verticals with 650+ agents.

URL PDF HTML ☆

赞 0 踩 0

2605.16138 2026-06-05 cs.LG cs.AI hep-ex 版本更新

Surrogate Neural Architecture Codesign Package (SNAC-Pack)

代理神经架构协同设计包（SNAC-Pack）

Jason Weitz, Dmitri Demler, Benjamin Hawks, Aaron Wang, Nhan Tran, Javier Duarte

发表机构 * University of California San Diego（加州大学圣地亚哥分校）； ETH Zurich（苏黎世联邦理工学院）； Fermi National Accelerator Laboratory（费米国家加速器实验室）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）

AI总结本文提出SNAC-Pack，一种面向硬件的自动化机器学习框架，用于神经架构协同设计和端到端FPGA部署，通过多目标全局搜索和硬件代理模型减少合成成本，并结合量化感知训练和迭代幅度剪枝来压缩模型，最终在FPGA上实现高效部署。

Comments 15 Pages, 3 Figures, AutoML (International Conference on Automated Machine Learning) 2026

详情

AI中文摘要

神经架构搜索（NAS）是一种强大的自动模型设计方法，但现有方法往往只优化准确率或依赖如位操作（BOPs）等代理指标，这些指标与硬件成本的相关性较差。在FPGA部署中，成本由查找表、DSP、触发器、BRAM和延迟等多维预算主导。我们提出了代理神经架构协同设计包（SNAC-Pack），一种开源的AutoML框架，用于硬件感知的神经架构协同设计和端到端FPGA部署。SNAC-Pack使用Optuna和NSGA-II进行多目标全局搜索，将试验加载到共享的SQLite存储中，以实现计算节点之间的并行工作。硬件代理模型输出每个试验的资源和延迟估计，避免了否则会主导搜索循环的合成成本。随后的局部搜索阶段结合量化感知训练（QAT）和迭代幅度剪枝，在联合压缩循环中应用。最后，通过hls4ml Python库将最终模型合成到FPGA固件中。YAML配置和可选的代理前端使用户能够在新数据集上运行管道而无需修改框架。我们在大型强子对撞机的喷射分类和超导量子比特读出中展示了SNAC-Pack，发现了紧凑的架构，这些架构在任务指标上匹配或超过强基线，同时减少FPGA资源利用，并在量子比特读出情况下将设计空间探索过程从数月的手动微调减少到数小时的自动化搜索。

英文摘要

Neural architecture search (NAS) is a powerful approach for automating model design, but existing methods often optimize for accuracy alone or rely on proxy metrics such as bit operations (BOPs) that correlate poorly with hardware cost. This gap is particularly large for FPGA deployment, where cost is dominated by a multi-dimensional budget of lookup tables, DSPs, flip-flops, BRAM, and latency. We present the Surrogate Neural Architecture Codesign Package (SNAC-Pack), an open-source AutoML framework for hardware-aware neural architecture codesign and end-to-end FPGA deployment. SNAC-Pack runs a multi-objective global search with Optuna and NSGA-II, loading trials to a shared SQLite store that enables parallel workers across compute nodes. A hardware surrogate model outputs per-trial resource and latency estimates, avoiding the synthesis cost that would otherwise dominate the search loop. A local search stage then applies quantization-aware training (QAT) together with iterative magnitude pruning in a combined compression loop, after which the final model is synthesized to FPGA firmware via the hls4ml Python library. A YAML configuration and an optional agentic frontend let users run the pipeline on new datasets without modifying the framework. We demonstrate SNAC-Pack on jet classification at the Large Hadron Collider and superconducting qubit readout, discovering compact architectures that match or exceed strong baselines on the task metric while reducing FPGA resource utilization and, in the qubit readout case, reducing the design space exploration process from months of manual fine-tuning to hours of automated search.

URL PDF HTML ☆

赞 0 踩 0

2509.10825 2026-06-05 cs.LG cs.AI stat.ML 版本更新

CUBE: Contrastive Understanding by Balanced Experiments

CUBE: 通过平衡实验实现对比理解

Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh

发表机构 * Department of Computer Engineering（计算机工程系）； Gachon University（加荣大学）

AI总结本文提出CUBE框架，通过平衡低-高探针解释已训练的预测模型，揭示模型的主要效应和交互作用，验证了其在合成和现实表格任务中的有效性。

Comments The core framework and main claims remain unchanged; the manuscript has been revised for clarity, presentation, and consistency

2605.15212 2026-06-05 cs.AR cs.AI cs.CE 版本更新

Fault tolerance estimation in digital circuits with visualised generative networks

数字电路中故障容错估计与可视化生成网络

Sascha Biel, Carl Alexander Gaede, Amiel Glaser, Jan Wolter, Alexej Schelle

发表机构 * IU Internationale Hochschule（国际大学）； Constructor University（Constructor大学）

AI总结本文提出一种新的数值方法，通过生成网络采样技术估计数字电路结构中故障模式的容错性，通过比较理想数字化的模拟电流的随机输入与生成对抗网络（GAN）判别器部分的现实信号，计算与理想数字电子信号的偏差，包括缺失或互换逻辑器件等误差模式。

Comments 7 pages, 7 figures, 1 table

2605.13075 2026-06-05 cs.CL cs.AI 版本更新

Scaling few-shot spoken word classification with generative meta-continual learning

通过生成性元持续学习扩大少样本语音词分类

Louise Beyers, Batsirayi Mupamhi Ziki, Ruan van der Merwe

发表机构 * University of Cape Town（开普敦大学）

AI总结本文研究了在仅获得每个类别五个样本的情况下，通过生成性元持续学习（GeMCL）算法对1000个类别进行少样本语音词分类的潜力，并展示了其在性能稳定性及适应速度上的优势。

详情

AI中文摘要

少样本语音词分类大多针对少量类别进行开发，因此更大规模的少样本语音词分类潜力尚未被挖掘。本文探讨了在仅获得每个类别五个样本的情况下，通过生成性元持续学习（GeMCL）算法训练的语音词分类器能否依次学习区分1000个类别。我们通过使用GeMCL算法训练模型并与重复训练或微调的基线模型进行比较，证明了这种扩展能力的存在。我们发现GeMCL产生了极高的性能稳定性，尽管它并不总能超越重复全微调的HuBERT模型或冻结HuBERT模型配以重复训练的分类器头，但其性能与后者相当，同时适应速度提高了2000倍，仅用不到一半的数据量，在两个数量级更少的时间内进行训练。

英文摘要

Few-shot spoken word classification has largely been developed for applications where a small number of classes is considered, and so the potential of larger-scale few-shot spoken word classification remains untapped. This paper investigates the potential of a spoken word classifier to sequentially learn to distinguish between 1000 classes when it is given only five shots per class. We demonstrate that this scaling capability exists by training a model using the Generative Meta-Continual Learning (GeMCL) algorithm and comparing it to repeatedly trained or finetuned baselines. We find that GeMCL produces exceptionally stable performance, and although it does not always outperform a repeatedly fully-finetuned HuBERT model nor a frozen HuBERT model with a repeatedly trained classifier head, it produces comparable performance to the latter while adapting 2000 times faster, having been trained less than half of the data for two orders of magnitude less time.

URL PDF HTML ☆

赞 0 踩 0

2604.20329 2026-06-05 cs.CV cs.AI 版本更新

Image Generators are Generalist Vision Learners

图像生成器是通用视觉学习者

Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T. Barron, Kyle Genova, Nithish Kannen, Sherry Ben, Yandong Li, Mandy Guo, Suhas Yogin, Yiming Gu, Huizhong Chen, Oliver Wang, Saining Xie, Howard Zhou, Kaiming He, Thomas Funkhouser, Jean-Baptiste Alayrac, Radu Soricut

发表机构 * Google（谷歌）

AI总结本文研究了图像生成器在视觉理解中的通用学习能力，通过引入Vision Banana模型，展示了图像生成训练如何像语言模型预训练一样，使模型在多种视觉任务中取得最佳性能，证明了图像生成预训练在构建基础视觉模型中的核心作用。

Comments Project Page: http://vision-banana.github.io

详情

AI中文摘要

近期的研究表明，图像和视频生成器表现出零样本视觉理解行为，这种行为类似于大型语言模型（LLM）通过生成式预训练发展出语言理解和推理的新兴能力。尽管长期以来人们推测能够生成视觉内容意味着能够理解它，但缺乏证据表明生成式视觉模型已发展出强大的理解能力。在本文中，我们证明图像生成训练的作用类似于LLM预训练，使模型学习到强大的、通用的视觉表示，从而在各种视觉任务中取得最先进的性能。我们引入了Vision Banana，一个通过指令微调Nano Banana Pro（NBP）在原始训练数据和少量视觉任务数据混合中构建的通用模型。通过将视觉任务的输出空间参数化为RGB图像，我们无缝地将感知重新框架为图像生成。我们的通用模型Vision Banana在涉及2D和3D理解的多种视觉任务中取得了最先进的结果，超越或匹敌零样本领域专家，包括Segment Anything Model 3在分割任务中的表现，以及Depth Anything系列在度量深度估计中的表现。我们展示了这些结果可以通过轻量级指令微调实现，而不牺牲基础模型的图像生成能力。优越的结果表明图像生成预训练是一种通用视觉学习者。它还表明图像生成是视觉任务的统一和通用接口，类似于文本生成在语言理解和推理中的作用。我们正见证计算机视觉中的重大范式转变，其中生成式视觉预训练在构建生成和理解的基础视觉模型中发挥核心作用。

英文摘要

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.13830 2026-06-05 cs.AI cs.LG 版本更新

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

对树集成模型的敏感性量化：一种符号和组合方法

Ajinkya Naik, Chaitanya Garg, S. Akshay, Ashutosh Gupta, Kuldeep S. Meel

发表机构 * Indian Institute of Technology Bombay（印度理工学院班加罗尔分校）； University of Toronto（多伦多大学）

AI总结本文提出了一种针对树集成模型的敏感性量化方法，通过离散化输入空间并枚举易受敏感性影响的区域，结合代数决策图（ADD）编码和分拆子问题，实现高效计算。实验表明，所提工具XCount在速度和可扩展性方面优于其他方法。

详情

AI中文摘要

决策树集成（DTE）是一种广泛应用于AI分类任务的流行模型，用于多个安全关键领域，因此对这些模型的验证已成为过去十年的研究热点之一。其中一个问题就是敏感性问题，它询问给定一个DTE，是否一小部分特征的变化会导致输入的误分类。在本工作中，我们的目标是构建一个针对DTEs的定量敏感性概念，通过离散化模型的输入空间并枚举易受敏感性影响的区域。我们提出了一种新的算法技术，可以在保证认证误差和置信度范围内高效地完成此计算。我们的方法基于将问题编码为代数决策图（ADD），并进一步将其拆分为可高效解决的子问题，使计算成为组合和可扩展的。我们在不同规模的基准上评估了我们的技术的性能，与相同问题编码下的模型计数器进行比较。实验结果表明，我们的工具XCount在速度上显著优于其他方法，并且在集成规模增加时表现良好。

英文摘要

Decision tree ensembles (DTE) are a popular model for a wide range of AI classification tasks, used in multiple safety critical domains, and hence verifying properties on these models has been an active topic of study over the last decade. One such verification question is the problem of sensitivity, which asks, given a DTE, whether a small change in subset of features can lead to misclassification of the input. In this work, our focus is to build a quantitative notion of sensitivity, tailored to DTEs, by discretizing the input space of the model and enumerating the regions which are susceptible to sensitivity. We propose a novel algorithmic technique that can perform this computation efficiently, within a certified error and confidence bound. Our approach is based on encoding the problem as an algebraic decision diagram (ADD), and further splitting it into subproblems that can be solved efficiently and make the computation compositional and scalable. We evaluate the performance of our technique over benchmarks of varying size in terms of number of trees and depth, comparing it against the performance of model counters over the same problem encoding. Experimental results show that our tool XCount achieves significant speedup over other approaches and can scale well with the increasing sizes of the ensembles.

URL PDF HTML ☆

赞 0 踩 0

2605.05367 2026-06-05 cs.CV cs.AI 版本更新

Tamaththul3D: High-Fidelity 3D Saudi Sign Language Avatars from Monocular Video

Tamaththul3D: 从单目视频高保真重建沙特手语3D虚拟形象

Eyad Alghamdi, Sattam Altuuaim, Obay Ghulam, Abdulrahman Qutah, Yousef Basoodan

发表机构 * University of Jeddah（朱德大学）； King Abdullah University of Science and Technology（国王阿卜杜勒-阿齐兹大学科学与技术）

AI总结本文提出Tamaththul3D方法，通过几何逆运动学对前臂链进行对齐，结合2D监督肩部优化，实现了阿拉伯语手语的高保真3D虚拟形象重建，并在五个不同语言类型的手语数据集上实现了泛化能力。

详情

AI中文摘要

现有的3D手语虚拟形象重建方法仅在西方手语上开发和评估，且没有任何阿拉伯手语数据集的3D参数注解，这阻碍了阿拉伯聋人社区基于虚拟形象的无障碍应用发展。我们发布了首个SMPL-X参数注解的Ishara-500沙特手语数据集，使阿拉伯手语的定量评估和下游手语生成成为可能。我们引入Tamaththul3D，一种通过几何逆运动学对齐手部和身体估计，随后通过2D监督肩部优化的重建流程。闭式积分与特定身体和手估计器的选择无关：任何SMPL-X兼容的身体估计器和任何MANO兼容的手估计器均可替换，我们通过单独替换每个模块来证明这一点。Tamaththul3D在手部误差上比先前方法低达32%，运行速度比最强基线快32倍，并在没有数据集特定适应的情况下泛化到五个不同语言类型的手语数据集。

英文摘要

Existing 3D sign language avatar reconstruction methods are developed and evaluated exclusively on Western sign languages, and no 3D parametric annotations exist for any Arabic Sign Language dataset, a gap that blocks the development of avatar-based accessibility applications for the Arab Deaf community. We release the first SMPL-X parametric annotations for the Ishara-500 Saudi Sign Language dataset, enabling quantitative evaluation and downstream sign language generation for Arabic Sign Language. We introduce Tamaththul3D, a reconstruction pipeline that aligns hand and body estimates through geometric inverse kinematics on the forearm chain followed by 2D-supervised shoulder refinement. The closed-form integration is decoupled from the specific choice of body and hand estimators: any SMPL-X-compatible body estimator and any MANO-compatible hand estimator can be substituted, as we demonstrate by swapping each module independently. Tamaththul3D achieves up to 32% lower hand error than prior methods, runs 32x faster than the strongest baseline, and generalizes across five typologically distinct sign languages without dataset-specific adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.12376 2026-06-05 cs.AI 版本更新

证据与计划：用于技能蒸馏的在线轨迹验证

Yang Zhou, Zihan Dong, Zhenting Wang, Can Jin, Shiyu Zhao, Bangwei Guo, Difei Gu, Linjun Zhang, Mu Zhou, Dimitris N. Metaxas

发表机构 * Rutgers University（罗杰斯大学）

AI总结本文提出了一种基于轨迹的后验蒸馏指数（PDI）来评估技能与任务环境证据的契合度，通过SPARK框架实现环境验证轨迹的生成，从而提升技能的效率和可迁移性。

详情

AI中文摘要

代理技能可以通过使用人类编写的程序性文档显著提高任务成功率，但其质量在没有环境基础验证的情况下难以评估。现有的技能生成方法严重依赖于偏好日志而不是直接的环境交互，通常产生微不足道甚至退化的收益。我们发现这是一个根本的时间瓶颈：稳健的技能应基于后验，从经验环境交互中蒸馏，而不是先验计划。在本研究中，我们引入了后验蒸馏指数（PDI），这是一个轨迹级指标，量化了蒸馏技能与任务-环境证据的契合程度。为了操作化PDI，我们提出了SPARK（用于自主可运行任务和技能生成的结构化流程），以保留任务执行证据以实现全面的轨迹级分析。SPARK生成用于计算PDI的环境验证轨迹，并将其用作在线诊断和干预信号，以确保后验技能的形成。在86个可运行任务上，SPARK生成的技能始终优于无技能基线，并在学生模型上优于人工编写技能（推理成本比教师模型低高达1000倍）。这些发现表明，PDI引导的蒸馏产生了高效且可迁移的技能，这些技能基于任务-环境交互。我们发布代码在https://github.com/EtaYang10th/spark-skills。

英文摘要

Agent skills can remarkably improve task success rates by using human-written procedural documents, but their quality is difficult to assess without environment-grounded verification. Existing skill generation methods heavily rely on preference logs rather than direct environment interaction, often yielding negligible or even degraded gains. We identify that it is a fundamental timing bottleneck: robust skills should be posterior-based, distilled from empirical environment interaction rather than prior plans. In this study, we introduce the Posterior Distillation Index (PDI), a trajectory-level metric that quantifies how well a distilled skill is grounded in the task-environment evidence. To operationalize PDI, we present SPARK (Structured Pipelines for Autonomous Runnable tasKs and sKill generation) for preserving task execution evidence towards full trajectory-level analysis. SPARK generates environment-verified trajectories used to compute PDI, and it applies PDI as an online diagnostic and intervention signal to ensure posterior skill formation. Across 86 runnable tasks, SPARK-generated skills consistently surpass no-skill baselines and outperform human-written skills on student models (inference cost up to 1,000x cheaper than teacher models). These findings show that PDI-guided distillation produces efficient and transferable skills grounded in the task-environment interaction. We release our code at https://github.com/EtaYang10th/spark-skills .

URL PDF HTML ☆

赞 0 踩 0

2605.08318 2026-06-05 cs.LG cs.AI cs.NA math.NA physics.comp-ph stat.ML 版本更新

When Attention Beats Fourier: Multi-Scale Transformers for PDE Solving on Irregular Domains

当注意力胜过傅里叶：用于不规则域上的PDE求解的多尺度变换器

Brandon Yee, Pairie Koh, Jack Rodriguez, Mihir Tekal

发表机构 * Physics Lab, Yee Collins Research Group（Yee Collins研究组物理实验室）

AI总结本文研究了深度学习模型在求解偏微分方程（PDE）时的架构选择问题，探讨了基于学习注意力的变换器架构在何时优于傅里叶域神经算子。引入了多尺度注意力变换器（MSAT），该架构将时空解的历史编码为令牌序列，并通过复合监督目标进行端到端训练。在五个基准问题上，与九种基线方法（包括物理信息神经网络、神经算子和状态空间模型）进行了全面的实证评估，展示了在复杂几何问题上的最佳泛化能力。

Comments Substantial Revision Required

详情

AI中文摘要

我们研究了深度学习模型在求解偏微分方程（PDE）时的架构选择问题，探讨了基于学习注意力的变换器架构在何时优于傅里叶域神经算子。我们介绍了多尺度注意力变换器（MSAT），一种深度学习架构，将时空解的历史编码为令牌序列，并通过复合监督目标进行端到端训练。我们对九种基线方法（包括物理信息神经网络、神经算子和状态空间模型）进行了全面的实证评估，覆盖了PINNacle套件中的五个基准问题，使用相同的训练/测试分割和参考数据。MSAT在复杂几何问题上实现了最先进的泛化能力（Heat2D-CG的L²相对误差为0.0101，比FNO提高了3.7倍），在34秒的总推理时间下，比Mamba-NO的120,812秒快得多。对物理正则化组件的消融研究揭示了精确的归纳偏置权衡：物理先验减少了扩散主导问题的测试误差，但会退化混沌和回流流动制度的泛化能力，直接刻画了先验规格错误的边界。近似误差界作为域边界复杂性κ的函数，为这些实证发现提供了理论基础，并为架构选择提供了一个原则性的规则。

英文摘要

We study the problem of \emph{architecture selection} for deep learning models trained to solve partial differential equations (PDEs), asking when transformer-based architectures with learned attention outperform Fourier-domain neural operators. We introduce the \textbf{Multi-Scale Attention Transformer} (\msat{}), a deep learning architecture that encodes spatiotemporal solution histories as token sequences and trains end-to-end via a composite supervised objective with optional physics-informed regularization terms. We conduct a comprehensive empirical evaluation against nine baselines -- including physics-informed neural networks (PINNs), neural operators (FNO, DeepONet, GNOT), and state-space models (Mamba-NO) -- across five benchmark problems from the PINNacle suite, using identical train/test splits and reference data for all methods. \msat{} achieves state-of-the-art generalization on complex geometry problems ($L^2_\mathrm{rel} = 0.0101$ on Heat2D-CG, a $3.7\times$ improvement over FNO) at $34\,\mathrm{s}$ total inference vs.\ $120{,}812\,\mathrm{s}$ for Mamba-NO. Ablation studies over the physics regularization component reveal a precise inductive bias tradeoff: physics priors reduce test error on diffusion-dominated problems but degrade generalization on chaotic and recirculating-flow regimes, directly characterizing the prior misspecification boundary. Approximation error bounds as a function of domain boundary complexity $κ$ provide a theoretical basis for these empirical findings and a principled rule for architecture selection.

URL PDF HTML ☆

赞 0 踩 0

2605.08253 2026-06-05 cs.LG cs.AI 版本更新

Path-Coupled Bellman Flows for Distributional Reinforcement Learning

路径耦合贝尔曼流用于分布式强化学习

Boyang Xu, Qing Zou, Siqin Yang, Hao Yan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出路径耦合贝尔曼流（PCBF），一种连续时间的分布式强化学习方法，通过学习回报分布的流匹配来解决现有方法在边界不匹配和高方差-bootstrap问题，实验表明其在分布保真度和训练稳定性方面有所提升。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情

Journal ref: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

AI中文摘要

分布式强化学习（DRL）模型完整回报分布，但现有有限支持或分位数方法依赖于投影，而近期基于流的方法在流源处可能遭受边界不匹配，或在当前和后续噪声独立时出现高方差的bootstrap问题。本文提出路径耦合贝尔曼流（PCBF），一种连续时间DRL方法，通过学习回报分布的流匹配使用源一致的贝尔曼耦合路径：当前路径从t=0所需的基先验开始，到达t=1的贝尔曼目标，并在中间时间保持路径上的线性关系到后续流（不需要时间t的边际满足分布贝尔曼固定点对所有t）。PCBF通过共享基噪声耦合当前和后续回报流，并使用λ参数化的控制变异目标：λ=0恢复无偏样本贝尔曼目标，而λ>0通过可控的偏倚换取方差减少。在可解析的MRPs、OGBench和D4RL上的实验表明，PCBF在分布保真度和训练稳定性方面有所提升，并在离线RL性能上具有竞争力。

英文摘要

Distributional reinforcement learning (DRL) models the full return distribution, but existing finite-support or quantile-based methods rely on projections, while recent flow-based approaches can suffer from \emph{boundary mismatch} at the flow source or from \emph{high-variance} bootstrapping when current and successor noises are independent. We propose Path-Coupled Bellman Flows (PCBF), a continuous-time DRL method that learns return distributions with flow matching using \textbf{source-consistent Bellman-coupled paths}: the current path starts from the required base prior at $t{=}0$, reaches the Bellman target at $t{=}1$, and maintains a pathwise affine relation to the successor flow at intermediate times (without requiring time-$t$ marginals to satisfy a distributional Bellman fixed point for all $t$). PCBF couples current and successor return flows through shared base noise and uses a $λ$-parameterized control-variate target: $λ{=}0$ recovers an unbiased sample Bellman target, while $λ{>}0$ trades controlled bias for variance reduction. Experiments on analytically tractable MRPs, OGBench, and D4RL show improved distributional fidelity and training stability, and competitive offline RL performance.

URL PDF HTML ☆

赞 0 踩 0

2605.07482 2026-06-05 cs.LG cs.AI 版本更新

SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion

SHRED: 通过自蒸馏与对数势降低实现无保留集的去记忆

Zizhao Hu, Ameya Godbole, Johnny Tian-Zheng Wei, Mohammad Rostami, Jesse Thomason, Robin Jia

发表机构 * University of Southern California（南加州大学）； USC Information Sciences Institute（USC信息科学研究所）

AI总结本文提出了一种无需保留集的去记忆方法SHRED，通过自蒸馏与对数势降低，在去记忆的同时保持模型的实用性，优于传统需要保留集的方法。

详情

AI中文摘要

针对大语言模型（LLMs）的机器去记忆问题，旨在选择性地移除记忆中的内容，如私人数据、受版权文本或危险知识，而无需昂贵的全量重新训练。现有大多数方法需要一个经过精心挑选的保留集以防止一般模型用途的灾难性退化，这会增加额外的数据依赖性，使部署复杂化。我们提出SHRED（通过高惊奇度的无保留集熵降低的自蒸馏），一种无需保留集的去记忆方法，基于一个关键洞察：并非所有遗忘集实例中的token都同等地包含记忆信息。高信息token集中了模型的记忆知识，而低信息token反映了一般语言能力。SHRED分为两个阶段。（1）选择：我们对遗忘集实例进行前向传递，收集每个token的自回归概率，并选择底部（最低概率，最高香农信息）作为遗忘位置；剩余位置保留为良性锚点。（2）训练：我们构建了修改的KL目标，降低记忆token在遗忘位置的logit，同时在良性位置保持原始分布。模型通过单一的顶部KL自蒸馏目标进行训练，同时驱动遗忘和实用性保持。我们评估了SHRED在四个标准去记忆基准上的表现，并证明其在遗忘效果和模型实用性之间建立了新的帕累托最优权衡，优于保留集依赖的方法。我们的分析显示，SHRED对重新学习攻击和成员推断攻击具有鲁棒性，并且在多次连续去记忆运行后仍能保持稳定的实用性。

英文摘要

Machine unlearning for large language models (LLMs) aims to selectively remove memorized content such as private data, copyrighted text, or hazardous knowledge, without costly full retraining. Most existing methods require a retain set of curated examples to prevent catastrophic degradation of general model utility, creating an extra data dependency that complicates deployment. We propose SHRED (Self-distillation via High-surprisal-only Retain-set-free Entropy Demotion), a retain-set-free unlearning method built on a key insight: not all tokens within a forget set instance carry memorized information equally. High-information tokens concentrate the model's memorized knowledge, while low-information tokens reflect general language competence. SHRED operates in two stages. (1) Selection: We perform a forward pass on a forget set instance, collect per-token autoregressive probabilities, and select the bottom (lowest probability, highest Shannon information) as forget positions; the remaining positions are retained as benign anchors. (2) Training: We construct modified KL targets that demote the memorized token's logit at forget positions while preserving the original distribution at benign positions. The model is then trained via a single top KL self-distillation objective that simultaneously drives forgetting and utility preservation. We evaluate SHRED across four standard unlearning benchmarks and demonstrate that it establishes a new Pareto-optimal trade-off between forget efficacy and model utility, outperforming retain-set-dependent methods. Our analysis shows that SHRED is robust against relearning attacks and membership-inference attacks, and it maintains stable utility even after many sequential unlearning runs.

URL PDF HTML ☆

赞 0 踩 0

2605.07096 2026-06-05 cs.LG cs.AI stat.ME 版本更新

Transformer 的拓扑困境

Michael C. Mozer, Shoaib Ahmed Siddiqui, Rosanne Liu

发表机构 * Google DeepMind（谷歌深Mind）

AI总结本文探讨了Transformer在处理序列结构时的拓扑问题，指出其纯前馈架构限制了动态状态跟踪，提出应通过递归架构转向隐含激活动态，并介绍了连续思维Transformer架构的分类方法及未来研究方向。

详情

AI中文摘要

Transformers通过扩展的上下文历史在序列中编码结构。然而，其纯前馈架构从根本上限制了动态状态跟踪。状态跟踪——迭代更新反映不断变化环境的潜在变量——涉及本质上序列依赖性，这使得前馈网络难以维持。因此，前馈模型会将演进状态表示推入其层栈更深处，使得信息在浅层不可用，最终耗尽模型的深度。虽然动态深度模型和显式或隐式思维可以绕过这一深度限制，但这些解决方案在计算和内存上效率低下。在本文中，我们主张，时间扩展认知需要从显式思维轨迹转向隐式激活动态，通过递归架构。我们引入了递归和连续思维Transformer架构的分类方法，按其递归轴（深度与步长）和输入标记与递归步长的比例进行分类。最后，我们概述了有前景的研究方向，包括增强的状态空间模型和粗粒度递归，以更好地将状态跟踪整合到现代基础模型中。

英文摘要

Transformers encode structure in sequences via an expanding contextual history. However, their purely feedforward architecture fundamentally limits dynamic state tracking. State tracking -- the iterative updating of latent variables reflecting an evolving environment -- involves inherently sequential dependencies that feedforward networks struggle to maintain. Consequently, feedforward models push evolving state representations deeper into their layer stack with each new input step, rendering information inaccessible in shallow layers and ultimately exhausting the model's depth. While this depth limit can be bypassed by dynamic depth models and by explicit or latent thinking that externalizes state representations, these solutions are computationally and memory inefficient. In this article, we argue that temporally extended cognition requires refocusing from explicit thought traces to implicit activation dynamics via recurrent architectures. We introduce a taxonomy of recurrent and continuous-thought transformer architectures, categorizing them by their recurrence axis (depth versus step) and their ratio of input tokens to recurrence steps. Finally, we outline promising research directions, including enhanced state-space models and coarse-grained recurrence, to better integrate state tracking into modern foundation models.

URL PDF HTML ☆

赞 0 踩 0

2603.25158 2026-06-05 cs.AI 版本更新

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Trace2Skill: 将轨迹局部经验转化为可迁移的代理技能

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

发表机构 * ETH Zürich University of Zurich（苏黎世联邦理工学院）； Peking University（北京大学）； Zhejiang University（浙江大学）； Qwen Large Model Application Team, Alibaba（阿里巴巴文心一言应用团队）

AI总结本文提出Trace2Skill框架，通过归纳推理将广泛执行轨迹整合为统一的技能目录，有效提升代理技能的可迁移性和实用性，适用于多种领域。

Comments Work in Progress. May version add more experiments

详情

AI中文摘要

大型语言模型（LLM）代理日益依赖领域特定技能，但手动编写此类技能难以扩展，而纯参数知识生成的技能常忽略关键操作陷阱。我们引入Trace2Skill框架，通过归纳推理将广泛执行轨迹整合为统一的技能目录。Trace2Skill支持深入现有人工编写技能和从弱LLM生成草稿中创建有用技能。实验表明，Trace2Skill在多样化的领域中均表现出色，包括办公流程、数学推理和视觉问答。重要的是，进化出的技能不仅限于所用轨迹的简单记忆：它们在不同模型规模、不同模型家族和非分布设置中均能迁移。例如，从Qwen3.5-35B轨迹进化出的技能使Qwen3.5-122B代理在WikiTableQuestions任务上提升高达57.65个百分点。进一步分析显示，Trace2Skill优于序列技能编辑和ReasoningBank式检索记忆，能将重复失败和 workaround 压缩为标准操作程序（SoPs），并产生可重用的技能，无需参数更新或测试时检索。

英文摘要

Large Language Model (LLM) agents increasingly rely on domain-specific skills, yet manually authoring such skills does not scale, and skills generated purely from parametric knowledge often miss critical operational pitfalls. We introduce Trace2Skill, a framework that consolidates broad execution trajectories in parallel into a unified skill directory through inductive reasoning over agent experience. Trace2Skill supports both deepening existing human-written skills and creating useful skills from weak LLM-generated drafts. Experiments demonstrate the effectiveness of Trace2Skill across diverse domains, including office workflows, math reasoning, and vision QA. Importantly, the evolved skills are not merely memorized artifacts of the trajectories used to create them: they often transfer across model scales, across model families, and to out-of-distribution settings. For example, skills evolved from Qwen3.5-35B trajectories improve a Qwen3.5-122B agent by up to $57.65$ percentage points on WikiTableQuestions. Further analyses show that Trace2Skill outperforms sequential skill editing and ReasoningBank-style retrieval memories, compresses recurring failures and workarounds into standard operating procedures (SoPs), and yields portable skills that can be reused without parameter updates or test-time retrieval.

URL PDF HTML ☆

赞 0 踩 0

2604.23466 2026-06-05 cs.LG cs.AI cs.AR 版本更新

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

评估Hopper和Blackwell GPU上的CUDA Tile用于AI工作负载

Divakar Kumar Yadav, Tian Zhao, Deepak Kumar

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结本文评估了CUDA Tile在Hopper和Blackwell GPU上的AI工作负载性能，比较了CuTile与cuBLAS、Triton等方法的效率和可移植性，发现CuTile在特定工作负载上表现优异，但在跨架构优化上仍有不足。

详情

AI中文摘要

NVIDIA的CUDA Tile（CuTile）引入了一种基于Python的、以tile为中心的抽象，用于GPU内核开发，旨在简化编程同时保持Tensor Core和Tensor Memory Accelerator（TMA）在现代GPU上的效率。我们对三种NVIDIA GPU（Hopper和Blackwell架构下的H100 NVL、B200和RTX PRO 6000 Blackwell Server Edition）上的CuTile进行了首次独立、跨架构评估，对比了cuBLAS、Triton、WMMA和原始SIMT等现有方法。我们通过基准测试代表性AI工作负载，包括GEMM、融合多头注意力和端到端LLM推理（BF16/FP16精度），以评估性能和可移植性。我们的结果表明，CuTile的效果强烈依赖于工作负载和架构。在数据中心级Blackwell（B200）上，CuTile在融合注意力任务中达到最高1007 TFLOP/s，比FlashAttention-2快2.5倍，仅需60行Python内核代码。对于GEMM，CuTile在22行代码中达到cuBLAS性能的52-79%，比WMMA的123行代码更高效，使其成为手写CUDA内核的实用替代品，但尚未成为供应商优化库的替代品。然而，相同的CuTile注意力内核在RTX PRO 6000（sm_120）上仅达到FlashAttention-2的53%吞吐量，暴露了显著的跨架构优化差距。相比之下，Triton在所有测试平台上的cuBLAS性能保持在62-101%，无需架构特定调整，显示出更强的可移植性。

英文摘要

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability.

URL PDF HTML ☆

赞 0 踩 0

2604.23190 2026-06-05 cs.SE cs.AI 版本更新

从运动学到动力学：学习精炼混合计划以实现物理可行的执行

Lidor Erez, Shahaf S. Shperberg, Ayal Taitler

发表机构 * Technion - Israel Institute of Technology（技术学院 - 以色列理工学院）

AI总结该研究通过连续空间中的强化学习，解决混合计划在物理可行性执行中的问题，通过引入分析二阶约束的马尔可夫决策过程，改进混合规划器生成的一阶轨迹，从而可靠地恢复物理可行性。

详情

AI中文摘要

在许多机器人任务中，智能体必须穿越一系列空间区域以完成任务。此类问题本质上是混合离散-连续的：一个高层动作序列和一个在物理上可行的连续轨迹。生成的轨迹和动作序列还必须满足诸如截止时间、时间窗口和速度或加速度限制等约束条件。尽管混合时间规划器试图解决这一挑战，但它们通常使用线性（一阶）动力学建模运动，这无法保证生成的计划满足机器人的真实物理约束。因此，即使高层动作序列固定，生成动态可行的轨迹也变成了一个双层优化问题。我们通过连续空间中的强化学习来解决这个问题。我们定义了一个明确包含分析二阶约束的马尔可夫决策过程，并用它来改进由混合规划器生成的一阶计划。我们的结果表明，这种方法可以可靠地恢复物理可行性，并有效弥合规划器初始一阶轨迹与实际执行所需动力学之间的差距。

英文摘要

In many robotic tasks, agents must traverse a sequence of spatial regions to complete a mission. Such problems are inherently mixed discrete-continuous: a high-level action sequence and a physically feasible continuous trajectory. The resulting trajectory and action sequence must also satisfy problem constraints such as deadlines, time windows, and velocity or acceleration limits. While hybrid temporal planners attempt to address this challenge, they typically model motion using linear (first-order) dynamics, which cannot guarantee that the resulting plan respects the robot's true physical constraints. Consequently, even when the high-level action sequence is fixed, producing a dynamically feasible trajectory becomes a bi-level optimization problem. We address this problem via reinforcement learning in continuous space. We define a Markov Decision Process that explicitly incorporates analytical second-order constraints and use it to refine first-order plans generated by a hybrid planner. Our results show that this approach can reliably recover physical feasibility and effectively bridge the gap between a planner's initial first-order trajectory and the dynamics required for real execution.

URL PDF HTML ☆

赞 0 踩 0

2604.16370 2026-06-05 cs.CL cs.AI cs.CV 版本更新

CuTeGen: 基于LLM的代理框架用于使用CuTe生成和优化高性能GPU内核

Tara Saba, Zhiyang Chen, Jikai Jason Li, Anne Ouyang, Xujie Si, Fan Long

发表机构 * Department of Computer Science, University of Toronto（计算机科学系，多伦多大学）

AI总结本文提出CuTeGen，一种基于LLM的代理框架，通过CuTe抽象层实现GPU内核的生成和优化，通过结构化生成-测试-优化工作流，在标准基准测试中实现了比PyTorch快1.71倍的速度提升，并在生成成本相近的情况下优于现有代理基线CudaForge。

详情

AI中文摘要

高性能GPU内核对现代机器学习系统至关重要，但开发这些内核仍然是一个手动、专家驱动的过程。最近的研究尝试利用LLM自动生成功能内核，但生成的内核在标准化基准测试中仍无法达到精心调优的参考内核。我们提出了CuTeGen，一种代理GPU内核合成框架，将内核开发视为在CuTe抽象层上的结构化生成-测试-优化工作流。CuTeGen有两个设计选择区别于先前的工作：针对CuTe而不是原始CUDA，这暴露了性能关键结构如分块和数据移动，同时保持足够的稳定性以进行迭代优化；以及延迟的性能调度，将低层次性能反馈推迟到内核的高层结构稳定之后。在209个KernelBench Level-1和Level-2任务上，CuTeGen在PyTorch上实现了平均1.71倍的速度提升，并在生成成本相近的情况下优于先前的代理基线CudaForge（0.89倍）。代码可在https://github.com/taratt/cutegen.git获取。

英文摘要

High-performance GPU kernels are critical to modern machine learning systems, yet developing them remains a manual, expert-driven process. Recent work has explored using LLMs to automate kernel generation, but generated kernels still fall short of carefully tuned references on standardized benchmarks. We present CuTeGen, an agentic GPU kernel synthesis framework that treats kernel development as a structured generate-test-refine workflow over the CuTe abstraction layer. Two design choices distinguish CuTeGen from prior work: targeting CuTe rather than raw CUDA, which exposes performance-critical structures such as tiling and data movement while remaining stable enough for iterative refinement, and a delayed profiling schedule that withholds low-level performance feedback until the kernel's high-level structure has stabilized. On the 209 tasks of KernelBench Level-1 and Level-2, CuTeGen achieves an average speedup of 1.71$\times$ over PyTorch and outperforms the prior agentic baseline CudaForge (0.89$\times$) at comparable per-task generation cost. Code available at https://github.com/taratt/cutegen.git

URL PDF HTML ☆

赞 0 踩 0

2602.19190 2026-06-05 cs.CV cs.AI 版本更新

FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

FUSAR-GPT : 一种嵌入时空特征和两阶段解耦的视觉语言模型，用于合成孔径雷达图像

Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun, Xiaorong Guo, Qingchen Fang, Ruyi Zhang, Xinpeng Zhou, Haipeng Wang

发表机构 * Fudan University（复旦大学）； Discipline and Technology Center of Microwave Vision Intelligent Sensing, Fudan University（微波视觉智能感知学科与技术中心，复旦大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结本文提出FUSAR-GPT，一种专门针对合成孔径雷达图像的视觉语言模型，通过嵌入时空特征和两阶段解耦方法，在多个遥感视觉语言基准测试中实现了最先进的性能。

详情

AI中文摘要

对所有天气和所有时间的合成孔径雷达（SAR）智能解释的研究对于推进遥感应用至关重要。近年来，尽管视觉语言模型（VLMs）在RGB图像上展示了强大的开放世界理解能力，但直接应用于SAR领域时，由于成像机制的复杂性、对散射特征的敏感性和高质量文本语料的稀缺性，其性能受到严重限制。为系统解决这一问题，我们构建了首个SAR图像-文本-AlphaEarth特征三元组数据集，并开发了FUSAR-GPT，一种专门用于SAR的VLM。FUSAR-GPT创新性地引入了一个地理空间基线模型作为“世界知识”先验，并通过“时空锚点”将多源遥感时间特征嵌入模型的视觉主干中，从而实现对SAR图像中目标稀疏表示的动态补偿。此外，我们设计了一种两阶段SFT策略，以解耦大模型的知识注入和任务执行。时空特征嵌入和两阶段解耦范式使FUSAR-GPT在多个典型遥感视觉语言基准测试中实现了最先进的性能，显著优于主流基线模型，超过10%。

英文摘要

Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 10%.

URL PDF HTML ☆

赞 0 踩 0

2603.19312 2026-06-05 cs.LG cs.AI 版本更新

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

LeWorldModel：从像素稳定端到端联合嵌入预测架构

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero

发表机构 * Mila & Université de Montréal（Mila与蒙特利尔大学）； New York University（纽约大学）； Samsung SAIL（三星SAIL）； Brown University（布朗大学）

AI总结本文提出LeWorldModel，一种通过仅使用两个损失项从原始像素稳定端到端训练的联合嵌入预测架构，显著减少了可调损失超参数，并在多种2D和3D控制任务中表现出色，同时在物理结构编码和物理不合理的事件检测方面展示了其能力。

详情

AI中文摘要

联合嵌入预测架构（JEPAs）提供了一个有吸引力的框架，用于在紧凑的潜在空间中学习世界模型，但现有方法仍然脆弱，依赖于复杂的多术语损失、指数移动平均、预训练编码器或辅助监督来避免表示崩溃。在本工作中，我们引入了LeWorldModel（LeWM），这是第一个通过仅使用两个损失项从原始像素稳定端到端训练的JEPAs。这将可调损失超参数的数量从六个减少到一个。在单个GPU上几小时内可训练约1500万参数，LeWM的规划速度比基于基础模型的世界模型快48倍，同时在多种2D和3D控制任务中保持竞争力。除了控制之外，我们还展示了LeWM的潜在空间通过探测物理量编码有意义的物理结构。惊奇评估证实，该模型能够可靠地检测出物理上不可能的事件。

英文摘要

Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.

URL PDF HTML ☆

赞 0 踩 0

2603.20980 2026-06-05 cs.LG cs.AI stat.AP stat.ML 版本更新

From Causal Discovery to Dynamic Causal Inference in Neural Time Series

从因果发现到神经时间序列中的动态因果推断

Dmitry Zaytsev, Valentina Kuskova, Michael Coppedge

发表机构 * Lucy Family Institute for Data & Society（数据与社会卢西家族研究所）； University of Notre Dame（诺克斯达大学）； Political Science University of Notre Dame（政治学诺克斯达大学）

AI总结提出动态因果网络自回归（DCNAR）两阶段框架，通过神经自回归因果发现学习稀疏有向因果网络，并将其作为结构先验用于时变神经网络自回归，实现无需预设网络结构的动态因果推断。

Comments 11 pages, 2 figures

详情

DOI: 10.1145/3770855.3818956

AI中文摘要

时变因果模型为研究动态科学系统提供了强大框架，然而大多数现有方法假设潜在因果网络是先验已知的——这一假设在现实领域中很少成立，因为在这些领域中因果结构是不确定的、演变的或仅能间接观测。这限制了动态因果推断在许多科学场景中的适用性。我们提出动态因果网络自回归（DCNAR），一个两阶段神经因果建模框架，将数据驱动的因果发现与时变因果推断相结合。在第一阶段，神经自回归因果发现模型从多变量时间序列中学习稀疏有向因果网络。在第二阶段，该学习到的结构被用作时变神经网络自回归的结构先验，从而无需预先指定网络结构即可实现因果影响的动态估计。我们使用评估因果必要性、时间稳定性和对结构变化敏感性的行为诊断来验证DCNAR的科学有效性，而不仅仅是预测准确性。在多国面板时间序列数据上的实验表明，即使预测性能相当，学习到的因果网络也比基于系数或无结构替代方法产生更稳定且行为上有意义的动态因果推断。这些结果将DCNAR定位为一个通用框架，用于在结构不确定性下将AI作为动态因果推理的科学工具。

英文摘要

Time-varying causal models provide a powerful framework for studying dynamic scientific systems, yet most existing approaches assume that the underlying causal network is known a priori - an assumption rarely satisfied in real-world domains where causal structure is uncertain, evolving, or only indirectly observable. This limits the applicability of dynamic causal inference in many scientific settings. We propose Dynamic Causal Network Autoregression (DCNAR), a two-stage neural causal modeling framework that integrates data-driven causal discovery with time-varying causal inference. In the first stage, a neural autoregressive causal discovery model learns a sparse directed causal network from multivariate time series. In the second stage, this learned structure is used as a structural prior for a time-varying neural network autoregression, enabling dynamic estimation of causal influence without requiring pre-specified network structure. We evaluate the scientific validity of DCNAR using behavioral diagnostics that assess causal necessity, temporal stability, and sensitivity to structural change, rather than predictive accuracy alone. Experiments on multi-country panel time-series data demonstrate that learned causal networks yield more stable and behaviorally meaningful dynamic causal inferences than coefficient-based or structure-free alternatives, even when forecasting performance is comparable. These results position DCNAR as a general framework for using AI as a scientific instrument for dynamic causal reasoning under structural uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2602.19373 2026-06-05 cs.LG cs.AI 版本更新

Stable Deep Reinforcement Learning via Isotropic Gaussian Representations

通过各向同性高斯表示实现稳定的深度强化学习

Ali Saheb Pasand, Johan Obando-Ceron, Aaron Courville, Pouya Bashivan, Pablo Samuel Castro

发表机构 * University of Waterloo（滑铁卢大学）

AI总结本文提出了一种基于各向同性高斯表示的深度强化学习方法，通过在训练过程中塑造表示以达到各向同性高斯分布，从而在非平稳环境下提高性能并减少表示崩溃、神经元休眠和训练不稳定性。

2603.17310 2026-06-05 cs.AI cs.CL 版本更新

超越均值：基于持久同调的因果效应

Amir Saki, Usef Faghihi

发表机构 * Université du Québec à Trois-Rivières（魁北克三河大学）

AI总结本文提出基于持久同调的因果框架，以解决均值基于因果估计在处理结局分布形状变化时的局限性，通过定义拓扑学的CATE和ATE，并证明其在近似拓扑可忽略性下的可识别性。

详情

AI中文摘要

平均处理效应（ATE）和条件平均处理效应（CATE）是因果估计的核心，但它们仅关注预期结果的变化，可能忽略处理引起的结局分布形状变化。当对照组结果单峰，处理组结果双峰且均值相同，均值基于的因果估计会失效。本文基于持久同调发展了因果框架，提出了持久同调可忽略性条件，定义了拓扑学的CATE和ATE，并证明这些估计量在近似拓扑可忽略性下可识别。同时指出，边际持久图效应不能仅通过条件拓扑可忽略性确定，因为持久同调通常不与协变量混合交换。为保持原意并确保科学正确性，本文保留边际效应作为动机量，但将数学上稳健的条件估计量置于理论中心。合成实验显示，均值基于的因果估计仍接近零，而所提拓扑效应显著增加并在调整混杂后可恢复。

英文摘要

Average treatment effects (ATE) and conditional average treatment effects (CATE) are foundational causal estimands, but they target changes in expected outcomes and can miss treatment-induced changes in the shape of outcome distributions. A canonical failure mode occurs when control outcomes are unimodal, treated outcomes become bimodal, and both distributions have the same mean. In such cases mean-based causal estimands are zero even though the geometry and topology of the outcome law change substantially. This paper develops a topological causal framework based on persistent homology. We formalize a persistent-homology ignorability condition, define topological analogues of CATE and ATE, and prove that these estimands are identifiable up to an explicit error bound under approximate topological ignorability. We also clarify a subtle but important point: a marginal persistence-diagram effect is not identified from conditional topological ignorability alone because persistent homology does not in general commute with mixtures over covariates. To preserve the original intuition while ensuring scientific correctness, we retain the marginal effect as a motivating quantity, but place the mathematically sound conditional estimands at the center of the theory. A synthetic experiment with mean-preserving topology change shows that mean-based causal estimands remain near zero while the proposed topological effect increases sharply and remains recoverable after adjustment for confounding.

URL PDF HTML ☆

赞 0 踩 0

2603.13761 2026-06-05 cs.LG cs.AI 版本更新

Hanna Foerster, Tom Blanchard, Kristina Nikolić, Ilia Shumailov, Cheng Zhang, Robert Mullins, Nicolas Papernot, Florian Tramèr, Yiren Zhao

发表机构 * University of Cambridge（剑桥大学）； University of Toronto & Vector Institute（多伦多大学及向量研究所）； ETH Zurich（苏黎世联邦理工学院）； AI Security Company（人工智能安全公司）

AI总结本文提出了一种系统级安全方法，用于计算机使用代理（CUAs），通过单次规划和NOVA框架在动态UI状态下提供控制流完整性保障，同时在保持性能的同时提升安全性。

详情

AI中文摘要

AI代理容易受到提示注入攻击，其中恶意内容劫持代理行为。在已提出的防御措施中，架构隔离通过严格分离可信任务规划与不可信环境观察提供了最强的保证。然而，将此设计应用于自动化任务的计算机使用代理（CUAs）则面临根本性挑战。当前代理需要持续观察UI状态以确定每个动作，这与安全所需的隔离相冲突。我们通过证明UI工作流虽然动态但结构上可预测，解决了这一矛盾。单次规划，即可信规划器提前发出完整的分支计划，覆盖所有预期的运行时状态，可为任意指令注入提供控制流完整性保障。我们引入NOVA（通过观察、验证和行动导航）使这种方案在组合爆炸的UI状态空间中可行，其中计划可以调用感知模型来解析运行时值，如UI坐标。我们在OSWorld上评估了我们的设计，保留了前沿模型57%的性能，同时对较小的开源模型性能提升高达19%，证明了在CUAs中严格的安全性和实用性可以共存。尽管提前规划防止了指令注入，但我们展示还需要额外措施来防御分支引导攻击，其中攻击者欺骗感知模型使执行沿着攻击者偏好的计划分支进行，例如将代理引导至恶意网站。

英文摘要

AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior. Among proposed defenses, architectural isolation provides the strongest guarantees by strictly separating trusted task planning from untrusted environment observations. However, applying this design to Computer Use Agents (CUAs), which automate tasks by viewing screens and executing actions, presents a fundamental challenge. Current agents require continuous observation of UI state to determine each action, which conflicts with the isolation required for security. We resolve this tension by demonstrating that UI workflows, while dynamic, are structurally predictable. Single-shot planning, where a trusted planner emits upfront a complete branching plan covering all anticipated runtime states, provides control flow integrity guarantees against arbitrary instruction injections. We introduce NOVA (Navigating via Observation, Verification, and Action) to make this viable in the combinatorially large UI state space, where the plan can invoke a perception model to resolve runtime values such as UI coordinates. We evaluate our design on OSWorld, and retain up to 57% of the performance of frontier models while improving performance for smaller open-source models by up to 19%, demonstrating that rigorous security and utility can coexist in CUAs. Although upfront planning prevents instruction injections, we show that additional measures are needed to defend against \textbf{Branch Steering} attacks, where adversaries deceive the perception model into routing execution down attacker-preferred branches of the plan, such as redirecting the agent to a malicious website.

URL PDF HTML ☆

赞 0 踩 0

2601.11527 2026-06-05 cs.HC cs.AI cs.CY 版本更新

"What if she doesn't feel the same?" What Happens When We Ask AI for Relationship Advice

如果她不再有同样的感觉呢？当我们将AI用于关系建议时会发生什么

Niva Manchanda, Akshata Kishore Moharir, Ratna Kandala

发表机构 * Department of Psychology, University of Kansas（堪萨斯大学心理学系）； Independent Researcher（独立研究者）

AI总结研究探讨了用户对LLM生成的浪漫关系建议的评价，发现用户对建议的满意度高，并且这种满意度与对模型可靠性和有用性的感知正相关，同时用户对LLM的态度也显著改善。

详情

Journal ref: First Workshop on LLM Persona Modeling, NeurIPS 2025

AI中文摘要

大型语言模型（LLMs）越来越多地被用于提供支持和建议，特别是在浪漫关系等个人领域，但关于用户对这种类型建议的看法知之甚少。本研究调查了人们如何评价LLM生成的浪漫关系建议。参与者评估了建议的满意度、模型的可靠性以及有用性，并完成了关于他们对LLMs总体态度的前后测。总体而言，研究结果表明参与者对LLM生成的建议非常满意。更高的满意度与他们对模型可靠性和有用性的感知正相关。重要的是，接触这些建议后，参与者对LLMs的态度显著改善，这表明支持性和情境相关的建议可以增强用户对这些AI系统的信任和开放性。

英文摘要

Large Language Models (LLMs) are increasingly being used to provide support and advice in personal domains such as romantic relationships, yet little is known about user perceptions of this type of advice. This study investigated how people evaluate advice on LLM-generated romantic relationships. Participants rated advice satisfaction, model reliability, and helpfulness, and completed pre- and post-measures of their general attitudes toward LLMs. Overall, the results showed participants' high satisfaction with LLM-generated advice. Greater satisfaction was, in turn, strongly and positively associated with their perceptions of the models' reliability and helpfulness. Importantly, participants' attitudes toward LLMs improved significantly after exposure to the advice, suggesting that supportive and contextually relevant advice can enhance users' trust and openness toward these AI systems.

URL PDF HTML ☆

赞 0 踩 0

2508.06249 2026-06-05 cs.LG cs.AI 版本更新

In-Training Defenses against Emergent Misalignment in Language Models

训练过程中对抗语言模型中新兴偏差的防御措施

David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Esha Afzal, Robin Haselhorst, Lucie Flek, Florian Mai

发表机构 * University of Copenhagen（哥本哈根大学）

AI总结本文研究了在训练过程中如何防止语言模型出现新兴偏差，提出了五种训练正则化干预方法，并展示了通过选择对齐模型与偏差模型之间困惑度差异的交错数据可以获得最佳效果。

Comments Accepted at ICML 2026 https://icml.cc/virtual/2026/poster/64303

详情

AI中文摘要

微调使从业者能够将对齐的大型语言模型 (LLMs) 重新用于新领域，但最近的研究揭示了新兴偏差 (EM)：即使是一个小的、领域特定的微调，也可能导致远超出目标领域的有害行为。即使在模型权重被隐藏在微调API之后的情况下，这也为攻击者提供了无意中访问广泛偏差模型的途径，这从微调数据本身难以检测。我们提出了第一个系统研究在训练过程中对抗EM的防护措施，这些措施对提供者而言是可行的，他们通过API暴露微调：我们评估了这些措施是否能够防止广泛的偏差、允许狭窄的偏差、在良性任务上学习良好，并且保持一致性。我们调查了五种训练正则化干预：(i) 朝着安全参考模型的KL散度正则化，(ii) 特征空间中的ℓ2距离，(iii) 通过邪恶人格向量进行预防性引导，(iv) 从一般指令微调数据集交错训练示例，以及 (v) 疫苗提示。我们证明，通过选择对齐模型与偏差模型之间的困惑度差异的交错数据可以获得最佳效果。

英文摘要

Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EM): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EM that are practical for providers who expose fine-tuning via an API: We evaluate whether they a) prevent broad misalignment, b) allow narrow misalignment, c) learn well on benign tasks, and d) remain coherent. We investigate five training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\ell_2$ distance in feature space, (iii) preventive steering with an evil persona vector, (iv) interleaving training examples from a general instruct-tuning dataset and (v) inoculation prompting. We demonstrate that selecting interleaving data by the perplexity gap between aligned and misaligned models yields the best results overall.

URL PDF HTML ☆

赞 0 踩 0

2603.03955 2026-06-05 cs.LG cs.AI 版本更新

GIPO: Gaussian Importance Sampling Policy Optimization

GIPO：高斯重要性采样策略优化

Chengxuan Lu, Zhenquan Zhang, Shukuan Wang, Qunzhi Lin, Yanjie Li, Baigui Sun, Yang Liu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结该研究提出了一种基于截断重要性采样的策略优化目标GIPO，通过使用基于对数比率的高斯信任权重替代硬裁剪，以软化极端重要性比率同时保持非零梯度，从而提高数据效率，实验表明GIPO在多种回放缓冲区大小下均取得最佳性能，表现出优越的偏差-方差权衡、高训练稳定性及改进的样本效率。

详情

AI中文摘要

在强化学习（RL）后训练近年来已显示出在多模态智能体上超越监督模仿的强劲潜力。然而，RL仍然受到较差的数据效率的限制，特别是在交互数据稀缺且迅速过时的设置中。为了解决这一挑战，GIPO（高斯重要性采样策略优化）被提出作为基于截断重要性采样的策略优化目标，用基于对数比率的高斯信任权重替代硬裁剪，以软化极端重要性比率同时保持非零梯度。理论分析显示，GIPO引入了隐含且可调的更新幅度约束，而集中界保证了在有限样本估计下的鲁棒性和稳定性。实验结果表明，GIPO在各种回放缓冲区大小范围内，从接近策略到高度过时的数据均取得了最佳性能，同时表现出优越的偏差-方差权衡、高训练稳定性和改进的样本效率。代码可在https://github.com/distanceLu/GIPO获得。

英文摘要

Post-training with reinforcement learning (RL) has recently shown strong promise for advancing multimodal agents beyond supervised imitation. However, RL remains limited by poor data efficiency, particularly in settings where interaction data are scarce and quickly become outdated. To address this challenge, GIPO (Gaussian Importance sampling Policy Optimization) is proposed as a policy optimization objective based on truncated importance sampling, replacing hard clipping with a log-ratio-based Gaussian trust weight to softly damp extreme importance ratios while maintaining non-zero gradients. Theoretical analysis shows that GIPO introduces an implicit, tunable constraint on the update magnitude, while concentration bounds guarantee robustness and stability under finite-sample estimation. Experimental results show that GIPO achieves state-of-the-art performance among clipping-based baselines across a wide range of replay buffer sizes, from near on-policy to highly stale data, while exhibiting superior bias--variance trade-off, high training stability and improved sample efficiency. Code is available at https://github.com/distanceLu/GIPO.

URL PDF HTML ☆

赞 0 踩 0

2410.06703 2026-06-05 cs.AI 版本更新

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

ST-WebAgentBench：用于评估网络代理安全性和可信度的基准测试

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Nir Mashkif, Segev Shlomov

发表机构 * IBM Research（IBM研究院）

AI总结本文提出ST-WebAgentBench基准测试，用于评估网络代理在现实企业场景中的安全性和可信度，通过引入新的评估指标CuP和风险比，揭示了现有代理的安全性缺陷。

Comments The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情

AI中文摘要

自主网络代理能够解决复杂的浏览任务，但现有基准测试仅衡量代理是否完成任务，而忽略了其完成任务的安全性和企业可信任性。为了将这些代理整合到关键工作流程中，安全性和可信度（ST）是采用的前提条件。我们介绍了ST-WebAgentBench，一个可配置且易于扩展的评估套件，用于在现实企业场景中评估网络代理的ST。其222个任务均配以ST策略，即简明的规则，编码约束，并在六个正交维度（如用户同意、鲁棒性）上评分。除了原始任务成功率外，我们提出了完成受政策约束（CuP）指标，仅奖励遵守所有适用政策的完成情况，以及风险比，量化各维度上的ST违规情况。评估三个最先进的开放代理揭示了其平均CuP低于名义完成率的三分之二，暴露了关键安全漏洞。通过发布代码、评估模板和政策编写界面，ST-WebAgentBench提供了一个可操作的第一步，以部署可信赖的网络代理规模化。

英文摘要

Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical workflows, safety and trustworthiness (ST) are prerequisite conditions for adoption. We introduce \textbf{\textsc{ST-WebAgentBench}}, a configurable and easily extensible suite for evaluating web agent ST across realistic enterprise scenarios. Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six orthogonal dimensions (e.g., user consent, robustness). Beyond raw task success, we propose the \textit{Completion Under Policy} (\textit{CuP}) metric, which credits only completions that respect all applicable policies, and the \textit{Risk Ratio}, which quantifies ST breaches across dimensions. Evaluating three open state-of-the-art agents reveals that their average CuP is less than two-thirds of their nominal completion rate, exposing critical safety gaps. By releasing code, evaluation templates, and a policy-authoring interface, \href{https://sites.google.com/view/st-webagentbench/home}{\textsc{ST-WebAgentBench}} provides an actionable first step toward deploying trustworthy web agents at scale.

URL PDF HTML ☆

赞 0 踩 0

2602.19327 2026-06-05 cs.LG cs.AI 版本更新

Soft Sequence Policy Optimization

软序列策略优化

Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko

发表机构 * Lomonosov Moscow State University（罗蒙诺索夫莫斯科国立大学）； Institute for Artificial Intelligence（人工智能研究所）

AI总结本文提出软序列策略优化方法，通过引入软门控函数改进序列级重要性权重，提升大语言模型对齐任务的训练稳定性与性能。

2602.22067 2026-06-05 cs.AI 版本更新

Semantic Partial Grounding via LLMs

通过大语言模型实现语义部分 grounding

Giuseppe Canonaco, Alberto Pozanco, Daniel Borrajo

发表机构 * Department of Computer Science, University of Cambridge（剑桥大学计算机科学系）

AI总结本文提出SPG-LLM，利用大语言模型分析领域和问题文件，提前识别可能不相关的对象、动作和谓词，从而减少grounding任务的规模，提升grounding效率并在某些领域实现更优的计划成本。

2512.15783 2026-06-05 cs.AI cs.LG 版本更新

Towards AI epidemiology: a measurement standardisation framework for prospective risk detection

迈向人工智能流行病学：一种用于前瞻性风险检测的测量标准化框架

Kit Tempest-Walters

AI总结本文提出了一种测量标准化框架，用于在没有访问模型内部信息的情况下，将专家-人工智能交互压缩为结构化、可比较的领域，以进行前瞻性风险检测。该框架旨在定义其范围，包括语义和统计层面，并指定未来工作的实证测试协议。

Comments 29 pages, 3 figures

详情

AI中文摘要

本文提出了一种测量标准化框架，该框架将专家-人工智能交互压缩为结构化、可比较的领域，用于在部署的人工智能系统中进行前瞻性风险检测，而无需访问模型内部信息。本文的概念性论文的主要目的是定义该框架的范围，包括语义和统计层面，并指定未来工作的实证测试协议。该框架旨在支持的群体层面声明因此是阶段性的研究计划，而非本文中声称的结果。测量标准化支撑着接下来的三个声明。第一个是可靠性声明：在有限条件下，大型语言模型可以产生可靠的、标准化的评估，用于评估专家-人工智能交互的证据和对齐情况。第二个是治理声明：对齐分数在部署期间为专家提供即时信号，并为机构提供监控不同任务类型、模型和领域的对齐模式的基础。第三个是流行病学声明：一旦建立了测量标准化，聚合对齐分数可以用于研究与下游结果相关的关联，这在受监管的专业环境中是可能的。这引入了基于相关变量而非机理分析的“人工智能流行病学”的可能性。本文解决了第一个声明，并指定了调查第二个和第三个声明的协议。为了在未来研究中实现实证评估，本文阐述了定义的语法，以及基于成对Bootstrap推断的统计协议，DeLong测试用于成对AUCs作为灵敏度检查，预设的一侧非劣性边界为0.05，以及Holm-Bonferroni校正。

英文摘要

This paper proposes a measurement standardisation framework that compresses expert-AI interactions into structured, comparable fields for prospective risk detection in deployed AI systems, without access to model internals. The main aim of this concept paper is to define the scope of the framework, both semantically and statistically, and to specify a protocol for its empirical testing in future work. The population-level claims the framework is designed to support are therefore the subject of a staged research programme rather than results claimed in this paper. Measurement standardisation underpins all three claims that follow. The first is a reliability claim: under bounded conditions, large language models can produce reliable, standardised assessments of the evidential and policy alignment of expert-AI interactions. The second is a governance claim: alignment scores give experts an immediate signal during deployment and give institutions a basis for monitoring alignment patterns across mission types, models, and domains. The third is an epidemiological claim: once measurement standardisation is established, aggregate alignment scores could be used to study associations with downstream outcomes in regulated professional settings. This introduces the possibility of an "AI epidemiology" that detects risk based on correlated variables instead of mechanistic analysis. This paper addresses the first claim and specifies protocols for investigating the second and third. To enable empirical evaluation in future studies, this paper sets out a defined grammar, together with a statistical protocol based on paired bootstrap inference, DeLong's test for paired AUCs as a sensitivity check, a pre-specified one-sided non-inferiority margin of 0.05, and Holm-Bonferroni correction.

URL PDF HTML ☆

赞 0 踩 0

2509.24882 2026-06-05 cs.LG cond-mat.dis-nn cs.AI stat.ML 版本更新

Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime

浅层神经网络在特征学习 regime 中的缩放定律与谱特性

Leonardo Defilippis, Yizhou Xu, Julius Girardin, Emanuele Troiani, Vittorio Erba, Lenka Zdeborová, Bruno Loureiro, Florent Krzakala

发表机构 * Departement d’Informatique, École Normale Supérieure, PSL & CNRS（信息学院，巴黎高等师范学院，PSL与CNRS）； Statistical Physics of Computation Laboratory, École Polytechnique Fédérale de Lausanne (EPFL)（计算统计物理实验室，洛桑联邦理工学院（EPFL））； Information, Learning and Physics Laboratory, École Polytechnique Fédérale de Lausanne (EPFL)（信息、学习与物理实验室，洛桑联邦理工学院（EPFL））

AI总结本文研究了浅层神经网络在特征学习 regime 中的缩放定律与谱特性，通过分析二次和对角神经网络的缩放规律，揭示了样本复杂度和权重衰减对过剩风险缩放指数的影响，并建立了这些 regime 与训练网络权重谱性质的精确联系。

详情

Journal ref: ICLR 2026

AI中文摘要

神经缩放定律是深度学习近期许多进展的基础，但其理论理解仍然主要局限于线性模型。在本文中，我们系统分析了二次和对角神经网络在特征学习 regime 中的缩放定律。利用与矩阵压缩感知和LASSO的联系，我们推导了过剩风险缩放指数作为样本复杂度和权重衰减函数的详细相图。这种分析揭示了不同缩放 regime 之间的交叉和平台行为，与经验神经缩放文献中广泛报告的现象相呼应。此外，我们建立了这些 regime 与训练网络权重谱性质的精确联系，我们对其进行了详细刻画。作为结果，我们提供了最近经验观察的理论验证，这些观察将权重谱中幂律尾部的出现与网络泛化性能联系起来，从而给出了从基本原理出发的解释。

英文摘要

Neural scaling laws underlie many of the recent advances in deep learning, yet their theoretical understanding remains largely confined to linear models. In this work, we present a systematic analysis of scaling laws for quadratic and diagonal neural networks in the feature learning regime. Leveraging connections with matrix compressed sensing and LASSO, we derive a detailed phase diagram for the scaling exponents of the excess risk as a function of sample complexity and weight decay. This analysis uncovers crossovers between distinct scaling regimes and plateau behaviors, mirroring phenomena widely reported in the empirical neural scaling literature. Furthermore, we establish a precise link between these regimes and the spectral properties of the trained network weights, which we characterize in detail. As a consequence, we provide a theoretical validation of recent empirical observations connecting the emergence of power-law tails in the weight spectrum with network generalization performance, yielding an interpretation from first principles.

URL PDF HTML ☆

赞 0 踩 0

2602.13697 2026-06-05 cs.AI cs.DB cs.LG 版本更新

No Need to Train Your RDB Foundation Model

无需训练你的关系数据库基础模型

Linjie Xu, Yanlin Zhang, Quan Gan, Minjie Wang, David Wipf

发表机构 * University of Hong Kong, Shanghai X-Lab（香港大学，上海X实验室）

AI总结本文提出了一种基于上下文学习的关系数据库编码器，能够在不重新训练的情况下，与现有的单表上下文学习基础模型结合，实现对多张相关表的高效处理。

Comments International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

关系数据库（RDBs）包含大量异构的表格信息，可用于预测建模。但鉴于企业环境中潜在的目标空间广阔，如何避免每次预测新感兴趣的量时重新训练新模型？基于上下文学习（ICL）的基础模型提供了一种方便的选项，但目前大多局限于单表操作。在推广到多张相互关联的表时，关键在于将可变大小的RDB邻域压缩为固定长度的ICL样本供解码器使用。然而，细节至关重要：与现有监督学习RDB流程不同，我们提供了理论和实证证据表明，ICL特定的压缩应限制在高维RDB列中，其中所有实体共享单位和角色，而不是跨列，因为异构数据类型的相关性无法在缺乏大量标签信息的情况下确定。基于此限制，我们证明了排除可训练参数不会影响编码器的表达能力。因此，我们得到了一种原理上可行的RDB编码器家族，可以无缝搭配已有的单表ICL基础模型，从而无需训练或微调。从实用角度看，我们开发了可扩展的SQL原语来实现编码器阶段，最终得到一个易于使用的开源RDBLearn基础模型，能够在未见过的数据集上实现稳健的性能。

英文摘要

Relational databases (RDBs) contain vast amounts of heterogeneous tabular information that can be exploited for predictive modeling purposes. But since the space of potential targets is vast across enterprise settings, how can we avoid retraining a new model each time we wish to predict a new quantity of interest? Foundation models based on in-context learning (ICL) offer a convenient option, but so far are largely restricted to single-table operability. In generalizing to multiple interrelated tables, it is essential to compress variably-sized RDB neighborhoods into fixed-length ICL samples for consumption by the decoder. However, the details here are critical: unlike existing supervised learning RDB pipelines, we provide theoretical and empirical evidence that ICL-specific compression should be constrained within high-dimensional RDB columns where all entities share units and roles, not across columns where the relevance of heterogeneous data types cannot be determined without extensive label information. Conditioned on this restriction, we then demonstrate that encoder expressiveness is actually not compromised by excluding trainable parameters. Hence we arrive at a principled family of RDB encoders that can be seamlessly paired with already-existing single-table ICL foundation models, whereby no training or fine-tuning is required. From a practical standpoint, we develop scalable SQL primitives to implement the encoder stage, resulting in the easy-to-use open-source RDBLearn foundation model capable of robust performance on unseen datasets out of the box.

URL PDF HTML ☆

赞 0 踩 0

2602.13255 2026-06-05 cs.AI cs.MA 版本更新

DPBench: Structural Determinants of Multi-Agent LLM Coordination Under Simultaneous Resource Contention

DPBench: 多智能体LLM在同时资源竞争下的协调结构决定因素

Najmul Hasan, Prashanth BusiReddyGari

发表机构 * Department of Mathematics and Computer Science University of North Carolina at Pembroke（数学与计算机科学系北卡罗来纳大学帕特森分校）

AI总结本文提出DPBench，用于评估多智能体系统中协调性能的基准测试，通过分析不同协议、通信结构和群体规模对协调成功或失败的影响，揭示了多智能体LLM在资源竞争中的协调机制。

Comments 20 pages, 4 figures

详情

AI中文摘要

我们提出了DPBench，一个用于评估多智能体系统中协调性能的基准测试，该测试基于大型语言模型构建。现有基准测试在固定协议下衡量任务级的成功率；然而，协调成功或失败的结构条件尚未被明确刻画。DPBench将哲学家就餐问题改编为受控测试平台，其中动作协议、通信结构和群体规模可独立变化。我们评估了六个智能体：GPT-5.2、Claude Opus 4.5、Grok 4.1、Gemini 2.5 Flash、Llama 4 Maverick以及一个均匀随机基线。在N=5的同时动作下，默认提示中，GPT-5.2的死锁率为25.0%（95% Wilson置信区间[11.2, 46.9]），而Gemini 2.5 Flash的死锁率为90.0%（[74.4, 96.5]）；顺序动作被六个智能体中的四个解决。在固定模型为Gemini 2.5 Flash的情况下，三个协议变量将死锁率从90%降低到置信区间接近零：三次预承诺通信（0.0% vs. 单次通信86.7%）、提示中包含经典并发原语（资源排序和对称打破的0.0% vs. 最小提示的100%）或将群体从N=5扩大到N=10（90.0%到10.0%）。单次通信和过去时间步的记忆在我们运行的样本量下不会改变死锁率。是否同一个模型协调或死锁由协议决定，而不是模型的能力。

英文摘要

We present DPBench, a benchmark for evaluating coordination in multi-agent systems built from large language models. Existing benchmarks measure task-level success under a fixed protocol; the structural conditions under which coordination succeeds or fails at all have not been characterised. DPBench adapts the Dining Philosophers problem into a controlled testbed where the action protocol, the communication structure, and the group size each vary independently. We evaluate six agents: GPT-5.2, Claude Opus 4.5, Grok 4.1, Gemini 2.5 Flash, Llama 4 Maverick, and a uniform-random baseline. Under simultaneous action at N=5 with the default prompt, deadlock ranges from 25.0% (95% Wilson CI [11.2, 46.9]) for GPT-5.2 to 90.0% [74.4, 96.5] for Gemini 2.5 Flash; sequential action is solved by four of the six. Holding the model fixed at Gemini 2.5 Flash, three protocol variables drive deadlock from 90% to within CI of zero: three rounds of pre-commitment communication (0.0% vs. single-round 86.7%), a prompt encoding a classical concurrency primitive (0.0% for resource-ordering and symmetry-breaking, against 100% for the minimal prompt), or doubling the group from N=5 to N=10 (90.0% to 10.0%). Single-round messaging and memory of past timesteps do not change the rate at the sample size we ran. Whether the same model coordinates or deadlocks is determined by the protocol, not by the model's capability.

URL PDF HTML ☆

赞 0 踩 0

2602.04809 2026-06-05 cs.LG cs.AI 版本更新

Beyond Rewards in Reinforcement Learning for Cyber Defence

超越奖励的强化学习在网络安全防御中的应用

Elizabeth Bates, Chris Hicks, Vasilios Mavroudis

发表机构 * University of Cambridge（剑桥大学）

AI总结本文研究了在网络安全防御中使用强化学习时，奖励函数结构对学习和策略行为的影响，通过比较稀疏和密集奖励函数，揭示了奖励、动作空间和子最优策略风险之间的复杂关系。

详情

AI中文摘要

近年来，自主网络安全防御代理在使用深度强化学习保护计算机网络方面引起了广泛关注。这些代理通常在网络安全 gym 环境中训练，使用密集的、高度工程化的奖励函数，结合多种惩罚和激励，以应对各种（不） desirable 状态和昂贵的操作。密集奖励有助于缓解探索复杂环境的挑战，但会偏向于次优且可能风险更大的解决方案，这对复杂的网络安全环境至关重要。我们通过多种稀疏和密集奖励函数、两种已确立的网络安全 gym、不同网络规模以及策略梯度和基于价值的 RL 算法，全面评估了奖励函数结构对学习和策略行为特征的影响。我们的评估得益于一种新的真实评估方法，使可以直接比较不同的奖励函数，揭示了奖励、动作空间和网络安全环境中子最优策略风险之间的微妙关系。我们的结果表明，稀疏奖励，如果目标一致且可以频繁遇到，能够提供增强的训练可靠性和更有效的网络安全防御代理，具有较低风险的策略。令人惊讶的是，稀疏奖励还能产生与网络安全守护者目标更一致的策略，并在不使用显式奖励基于数值惩罚的情况下，节省昂贵的防御操作。

英文摘要

Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, highly engineered reward functions which combine many penalties and incentives for a range of (un)desirable states and costly actions. Dense rewards help alleviate the challenge of exploring complex environments but risk biasing agents towards suboptimal and potentially riskier solutions, a critical issue in complex cyber environments. We thoroughly evaluate the impact of reward function structure on learning and policy behavioural characteristics using a variety of sparse and dense reward functions, two well-established cyber gyms, a range of network sizes, and both policy gradient and value-based RL algorithms. Our evaluation is enabled by a novel ground truth evaluation approach which allows directly comparing between different reward functions, illuminating the nuanced inter-relationships between rewards, action space and the risks of suboptimal policies in cyber environments. Our results show that sparse rewards, provided they are goal aligned and can be encountered frequently, uniquely offer both enhanced training reliability and more effective cyber defence agents with lower-risk policies. Surprisingly, sparse rewards can also yield policies that are better aligned with cyber defender goals and make sparing use of costly defensive actions without explicit reward-based numerical penalties.

URL PDF HTML ☆

赞 0 踩 0

2602.09574 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

在LLMs的测试时间扩展中对树搜索策略与固定令牌预算对齐

Sora Miyamoto, Daisuke Oba, Naoaki Okazaki

发表机构 * University of Tokyo（东京大学）

AI总结本文提出了一种名为Budget-Guided MCTS (BG-MCTS)的树搜索解码算法，通过将搜索策略与剩余令牌预算对齐，以提高在不同令牌预算下的推理性能。

Comments Accepted at ICML 2026. Code: https://github.com/Sora-Miyamoto/bg-mcts

2602.07739 2026-06-05 cs.IR cs.AI 版本更新

HypRAG: Hyperbolic Dense Retrieval for Retrieval Augmented Generation

HypRAG: 超几何密集检索用于检索增强生成

Hiren Madhu, Ngoc Bui, Ali Maatouk, Leandros Tassiulas, Smita Krishnaswamy, Menglin Yang, Sukanta Ganguly, Kiran Srinivasan, Rex Ying

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文提出超几何密集检索方法，通过在双曲空间中构建HyTE-FH和HyTE-H两种模型变体，解决传统欧几里得空间在检索增强生成中的局限性，提升文档相关性和回答相关性。

详情

AI中文摘要

嵌入几何在检索质量中起着根本作用，然而用于检索增强生成（RAG）的密集检索器仍然主要局限于欧几里得空间。然而，自然语言从广泛主题到具体实体具有层次结构，而欧几里得嵌入无法保持这种结构，导致语义上距离远的文档显得相似，增加幻觉风险。为了解决这些限制，我们引入了双曲密集检索，开发了两种模型变体：HyTE-FH，一个完全双曲的Transformer，以及HyTE-H，一个混合架构，将预训练的欧几里得嵌入投影到双曲空间。为了防止序列聚合期间的表示崩溃，我们引入了向外爱因斯坦中点，一种几何感知的池化操作符，可以证明地保持层次结构。在MTEB上，HyTE-FH优于等效的欧几里得基线，而在RAGBench上，HyTE-H在上下文相关性和回答相关性方面比欧几里得基线高出高达29%，使用比当前最先进的检索器小得多的模型。我们的分析还表明，双曲表示通过基于范数的分离编码文档特定性，从一般到具体概念的径向增加超过20%，这一特性在欧几里得嵌入中不存在，突显了几何归纳偏置在忠实RAG系统中的关键作用。

英文摘要

Embedding geometry plays a fundamental role in retrieval quality, yet dense retrievers for retrieval-augmented generation (RAG) remain largely confined to Euclidean space. However, natural language exhibits hierarchical structure from broad topics to specific entities that Euclidean embeddings fail to preserve, causing semantically distant documents to appear spuriously similar and increasing hallucination risk. To address these limitations, we introduce hyperbolic dense retrieval, developing two model variants in the Lorentz model of hyperbolic space: HyTE-FH, a fully hyperbolic transformer, and HyTE-H, a hybrid architecture projecting pre-trained Euclidean embeddings into hyperbolic space. To prevent representational collapse during sequence aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that provably preserves hierarchical structure. On MTEB, HyTE-FH outperforms equivalent Euclidean baselines, while on RAGBench, HyTE-H achieves up to 29% gains over Euclidean baselines in context relevance and answer relevance using substantially smaller models than current state-of-the-art retrievers. Our analysis also reveals that hyperbolic representations encode document specificity through norm-based separation, with over 20% radial increase from general to specific concepts, a property absent in Euclidean embeddings, underscoring the critical role of geometric inductive bias in faithful RAG systems.

URL PDF HTML ☆

赞 0 踩 0

2602.07253 2026-06-05 cs.AI cs.CL 版本更新

From Out-of-Distribution Detection to Hallucination Detection: A Geometric View

从分布外检测到幻觉检测：一个几何视角

Litian Liu, Reza Pourreza, Yubing Jian, Yao Qin, Roland Memisevic

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文通过将幻觉检测重新定义为分布外检测问题，利用几何视角提出了一种无需训练、基于单样本的检测方法，在推理任务中实现了高准确率。

Comments ICML 2026 main conference paper

详情

AI中文摘要

检测大型语言模型中的幻觉是一个关键且开放的问题，对安全性和可靠性有重大影响。虽然现有的幻觉检测方法在问答任务中表现强劲，但在需要推理的任务上效果不佳。在这项工作中，我们通过分布外（OOD）检测的视角重新审视幻觉检测，这是计算机视觉等领域中一个研究充分的问题。将语言模型中的下一个词预测视为分类任务，允许我们应用OOD技术，前提是进行适当的修改以考虑大型语言模型的结构差异。我们表明，基于OOD的方法产生了无需训练、基于单样本的检测器，在推理任务的幻觉检测中实现了高准确率。总体而言，我们的工作表明，将幻觉检测重新定义为OOD检测为语言模型安全性提供了一条有前景且可扩展的路径。

英文摘要

Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. While existing hallucination detection methods achieve strong performance in question-answering tasks, they remain less effective on tasks requiring reasoning. In this work, we revisit hallucination detection through the lens of out-of-distribution (OOD) detection, a well-studied problem in areas like computer vision. Treating next-token prediction in language models as a classification task allows us to apply OOD techniques, provided appropriate modifications are made to account for the structural differences in large language models. We show that OOD-based approaches yield training-free, single-sample-based detectors, achieving strong accuracy in hallucination detection for reasoning tasks. Overall, our work suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.

URL PDF HTML ☆

赞 0 踩 0

2602.00911 2026-06-05 cs.AI 版本更新

在d+1维度中重新表述神经算子以嵌入演化

Haoze Song, Zhihao Li, Xiaobo Zhang, Zecheng Gan, Zhilu Lai, Wei Wang

发表机构 * HKUST (GZ)（香港科技大学（广州））； HKUST（香港科技大学）； SWJTU（西南交通大学）

AI总结本文提出在d+1维度中重新表述神经算子，通过引入辅助函数维度来建模嵌入演化，从而改进嵌入扩展的效率，通过傅里叶基算子在物理域和辅助域上联合作用，实现更高效的嵌入演化模块，实验表明该方法在多个基准测试中表现优异。

详情

AI中文摘要

神经算子（NOs）是学习函数空间之间映射的强大架构。尽管大多数进展集中在改进核参数化在d维物理域上的精度，但提升的嵌入扩展仍缺乏探索，这通常导致模型倾向于计算成本高昂的嵌入扩展设计以提高近似能力。在本文中，我们引入了一个辅助函数维度，以运算形式建模嵌入演化，从而在d+1维度中重新表述NO流程。我们通过基于傅里叶的算子在物理域和辅助域上联合作用，实例化了这一框架，得到一个基于基底多样化的方法作为替代于暴力嵌入扩展。在超过十种越来越具有挑战性的基准测试中，从1D热方程到高度非线性的3D瑞利-泰勒不稳定性，我们的模型在评估的基线中始终实现了最低的相对L2误差。关键的是，这一优势通过（1）受控预算意识的比较，与缩放和剥离的基线；（2）混合分辨率训练和超分辨率推断下的鲁棒性；以及（3）零样本泛化到未见的时间范围，得到了实证支持。此外，我们还展示了更广泛的设计选择，以提升和恢复算子，展示了其对模型预测性能的影响。

英文摘要

Neural Operators (NOs) are powerful architectures for learning mappings between function spaces. While most advances focus on refining kernel parameterizations over the $d$-dimensional physical domain, the evolution of lifted embeddings remains underexplored, which often drives models toward computationally expensive embedding-scaling designs to improve approximation. In this paper, we introduce an auxiliary function dimension that models embedding evolution in operator form, thereby reformulating the NO pipeline in $d+1$ dimensions. We instantiate this framework via Fourier-based operators acting jointly on the physical and auxiliary domains, yielding a basis-diversified auxiliary evolution module as an alternative to brute-force embedding scaling. Across more than ten increasingly challenging benchmarks, ranging from the 1D heat equation to the highly nonlinear 3D Rayleigh-Taylor instability, our model consistently achieves the lowest relative $L_2$ error among the evaluated baselines. Crucially, this advantage is empirically supported by (1) controlled budget-aware comparisons against scaled and ablated baselines; (2) robustness under mixed-resolution training and super-resolution inference; and (3) zero-shot generalization to unseen temporal regimes. In addition, we present a broader set of design choices for lifting and recovery operators, demonstrating their impact on our model's predictive performance.

URL PDF HTML ☆

赞 0 踩 0

2601.21288 2026-06-05 cs.AI cs.CV 版本更新

Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

Drive-KD：自动驾驶中用于视觉语言模型的多教师知识蒸馏

Weitong Lian, Zecong Tang, Haoran Li, Tianjian Gao, Yifei Wang, Zixu Wang, Lingyi Meng, Tengju Ru, Zhejun Cui, Yichen Zhu, Hangshuo Cao, Qi Kang, Tianxing Chen, Kaixuan Wang, Yu Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出Drive-KD框架，通过将自动驾驶分解为感知-推理-规划三元组，并利用知识蒸馏转移能力，构建了专用教师模型，并通过异构梯度投影缓解跨能力梯度冲突，验证了方法在不同模型家族和规模上的泛化能力，展示了蒸馏模型在自动驾驶任务中的优越性能。

详情

AI中文摘要

自动驾驶是一个重要且安全关键的任务，最近大型语言模型（LLM）和视觉语言模型（VLM）的进展为该领域提供了新的推理和规划可能性。然而，大模型需要大量GPU内存并表现出较高的推理延迟，而传统监督微调（SFT）往往难以弥补小模型的能力差距。为了解决这些限制，我们提出了Drive-KD，一个将自动驾驶分解为“感知-推理-规划”三元组并通过知识蒸馏转移这些能力的框架。我们识别出层特定的注意力作为蒸馏信号，构建出能够超越基线的专用单教师模型。此外，我们将这些单教师设置统一到多教师蒸馏框架中，并引入异构梯度投影以缓解跨能力梯度冲突。广泛的评估验证了我们的方法在不同模型家族和规模上的泛化能力。实验表明，我们的蒸馏InternVL3-1B模型在GPU内存方面仅为78B模型的约42倍，在吞吐量方面为11.4倍，且在DriveBench上整体性能优于同家族的预训练78B模型，并在规划维度上超越GPT-5.1，为高效自动驾驶VLMs提供了新的见解。

英文摘要

Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a "perception-reasoning-planning" triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient conflicts. Extensive evaluations validate the generalization of our method across diverse model families and scales. Experiments show that our distilled InternVL3-1B model, with ~42 times less GPU memory and ~11.4 times higher throughput, achieves better overall performance than the pretrained 78B model from the same family on DriveBench, and surpasses GPT-5.1 on the planning dimension, providing insights toward efficient autonomous driving VLMs.

URL PDF HTML ☆

赞 0 踩 0

2601.21162 2026-06-05 cs.IR cs.AI cs.DB 版本更新

A2RAG: Adaptive Agentic Graph Retrieval for Cost-Aware and Reliable Reasoning

A2RAG：面向成本感知和可靠推理的自适应代理图检索

Jiate Liu, Zebin Chen, Shaobo Qiao, Mingchen Ju, Danting Zhang, Bocheng Han, Shuyue Yu, Xin Shu, Jinglin Wu, Dong Wen, Xin Cao, Guanfeng Liu, Zhengyi Yang

发表机构 * University of New South Wales（新南威尔士大学）； Euler AI ； Sigma Trading Management（Sigma 交易管理）； Eigenflow AI ； Macquarie University（麦考瑞大学）

AI总结本文提出A2RAG框架，通过自适应控制器和代理检索器解决图检索中成本和可靠性问题，提升多跳问答的准确率并减少计算开销。

详情

AI中文摘要

图检索增强生成（Graph-RAG）通过将语料库组织成知识图谱并利用关系结构路由证据来增强多跳问答。然而，实际部署面临两个持续瓶颈：（i）混合难度的工作负载中，单一检索策略要么浪费成本于简单查询，要么在多跳情况中失败；（ii）提取损失，即图抽象省略了仅存在于源文本中的细粒度限定词。我们提出了A2RAG，一种面向成本感知和可靠推理的自适应和代理图RAG框架。A2RAG结合了一个自适应控制器，用于验证证据充分性并在必要时触发定向细化，以及一个代理检索器，逐步提升检索努力并映射图信号回来源文本，以在提取损失和不完整图的情况下保持稳健。在HotpotQA和2WikiMultiHopQA上的实验表明，A2RAG在Recall@2上实现了+9.9/+11.8的绝对增益，同时将token消耗和端到端延迟降低了约50%。

英文摘要

Graph Retrieval-Augmented Generation (Graph-RAG) enhances multihop question answering by organizing corpora into knowledge graphs and routing evidence through relational structure. However, practical deployments face two persistent bottlenecks: (i) mixed-difficulty workloads where one-size-fits-all retrieval either wastes cost on easy queries or fails on hard multihop cases, and (ii) extraction loss, where graph abstraction omits fine-grained qualifiers that remain only in source text. We present A2RAG, an adaptive-and-agentic GraphRAG framework for cost-aware and reliable reasoning. A2RAG couples an adaptive controller that verifies evidence sufficiency and triggers targeted refinement only when necessary, with an agentic retriever that progressively escalates retrieval effort and maps graph signals back to provenance text to remain robust under extraction loss and incomplete graphs. Experiments on HotpotQA and 2WikiMultiHopQA demonstrate that A2RAG achieves +9.9/+11.8 absolute gains in Recall@2, while cutting token consumption and end-to-end latency by about 50% relative to iterative multihop baselines.

URL PDF HTML ☆

赞 0 踩 0

2601.19568 2026-06-05 cs.AI cs.SE 版本更新

Learning Adaptive Parallel Execution for Efficient Code Localization

学习适应性并行执行以实现高效的代码定位

Ke Xu, Siyang Xiao, Ming Liang, Yichen Yu, Zhixiang Wang, Jingxuan Xu, Dajun Chen, Wei Jiang, Yong Li

发表机构 * Ant Group（蚂蚁集团）； Peking University（北京大学）； Beijing Jiaotong University（北京交通大学）

AI总结本文提出FuseSearch，通过将并行代码定位重新表述为联合质量-效率优化任务，采用两阶段SFT和RL训练方法学习适应性并行策略，以提高代码定位的效率和性能。

Comments Paper accepted to Findings of ACL 2026

详情

通过排名均方误差进行奖励学习

Chaitanya Kharyal, Calarina Muslimani, Matthew E. Taylor

发表机构 * Calarina Muslimani（卡拉里娜·穆斯林尼）； Matthew E. Taylor（马修·E·泰勒）

AI总结本文提出了一种基于排名的强化学习方法R4，通过引入新的排名均方误差损失函数，从轨迹-评分对数据中学习奖励函数，并在机器人基准测试中表现出色。

详情

AI中文摘要

奖励设计仍然是将强化学习（RL）应用于现实世界问题的主要瓶颈。一种流行的替代方法是奖励学习，其中奖励函数是从人类反馈中推断出来，而不是手动指定。最近的工作提出了从人类评分而不是传统二元偏好中学习奖励函数，从而实现更丰富且可能更少认知需求的监督。在此范式基础上，我们引入了一种新的基于评分的RL方法，即Ranked Return Regression for RL（R4）。其核心是使用一种新的排名均方误差损失，从轨迹-评分对数据集中学习，将人类提供的离散评分（例如，差，中性，好）视为有序目标。与以往的基于评分的方法不同，R4提供了正式的保证：在其解集下，在温和的假设下，解集是可证明的最小且完整的。实证上，使用人类提供的和模拟的评分，我们证明R4在OpenAI Gym和DeepMind Control Suite的机器人基准测试中，一致地匹配或优于现有的基于评分和偏好强化学习方法。代码发布在https://github.com/IRLL/R4。

英文摘要

Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human ratings rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 uses a novel ranking mean squared error loss that learns from a dataset of trajectory-rating pairs, treating the human-provided discrete ratings (e.g., bad, neutral, good) as ordinal targets. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using both human-provided and simulated ratings, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic benchmarks from OpenAI Gym and the DeepMind Control Suite. Code released at https://github.com/IRLL/R4.

URL PDF HTML ☆

赞 0 踩 0

2502.14131 2026-06-05 cs.LG cs.AI econ.EM 版本更新

An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model

一种用于离线逆强化学习和动态离散选择模型的经验风险最小化方法

Enoch H. Kang, Hema Yoganarasimhan, Lalit Jain

发表机构 * Foster School of Business, University of Washington（华盛顿大学福斯特商学院）

AI总结本文提出了一种基于经验风险最小化（ERM）的逆强化学习/动态离散选择模型框架，该方法无需显式估计贝尔曼方程中的状态转移概率，适用于高维和无限状态空间，并在理论上有Polyak-Lojasiewicz条件的支持，从而保证了快速的全局收敛性。

详情

AI中文摘要

我们研究了估计动态离散选择（DDC）模型的问题，也称为机器学习中的离线最大熵正则化逆强化学习（离线MaxEnt-IRL）。目标是从离线行为数据中恢复支配代理行为的奖励或Q*函数。在本文中，我们提出了一种全局收敛的基于梯度的方法来解决这些问题，而无需线性参数化的奖励假设。我们的方法的创新之处在于引入了基于经验风险最小化（ERM）的IRL/DDC框架，该框架避免了在贝尔曼方程中显式估计状态转移概率的需要。此外，我们的方法与非参数估计技术如神经网络兼容。因此，所提出的方法有潜力扩展到高维、无限状态空间。我们方法的一个关键理论洞察是贝尔曼残差满足Polyak-Lojasiewicz（PL）条件--一个属性，虽然比强凸性弱，但足以保证快速的全局收敛保证。通过一系列合成实验，我们证明我们的方法在性能上始终优于基准方法和最先进的替代方法。

英文摘要

We study the problem of estimating Dynamic Discrete Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning (offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q^*$ functions that govern agent behavior from offline behavior data. In this paper, we propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards. The novelty of our approach lies in introducing the Empirical Risk Minimization (ERM) based IRL/DDC framework, which circumvents the need for explicit state transition probability estimation in the Bellman equation. Furthermore, our method is compatible with non-parametric estimation techniques such as neural networks. Therefore, the proposed method has the potential to be scaled to high-dimensional, infinite state spaces. A key theoretical insight underlying our approach is that the Bellman residual satisfies the Polyak-Lojasiewicz (PL) condition -- a property that, while weaker than strong convexity, is sufficient to ensure fast global convergence guarantees. Through a series of synthetic experiments, we demonstrate that our approach consistently outperforms benchmark methods and state-of-the-art alternatives.

URL PDF HTML ☆

赞 0 踩 0

2601.06056 2026-06-05 cs.CY cs.AI cs.CV 版本更新

Using street view images and visual LLMs to predict heritage values for governance support: Risks, ethics, and policy implications

利用街景图像和视觉大语言模型预测遗产价值以支持治理：风险、伦理与政策影响

Tim Johansson, Mikael Mangold, Kristina Dabrock, Anna Donarelli, Ingrid Campo-Ruiz

发表机构 * RISE Research Institutes of Sweden AB（瑞典RISE研究机构）； Malmö University（马尔默大学）； Forschungszentrum Jülich GmbH（朱利奇研究中心）； Uppsala University（乌普萨拉大学）

AI总结本研究利用街景图像和视觉大语言模型评估瑞典建筑遗产价值，以支持建筑翻新计划的制定，探讨了方法中的问题、潜在改进以及使用LLM数据的伦理风险。

详情

AI中文摘要

在2025年至2026年期间，欧盟成员国必须实施《建筑性能能效指令》，要求所有成员国制定国家建筑翻新计划。在瑞典，没有全面记录具有遗产价值的建筑的国家注册表，这被视为阻碍建筑翻新计划制定分析的障碍。本研究旨在帮助瑞典当局了解瑞典建筑存量中的遗产价值。通过对瑞典各地（N=154710）的街景图像中的建筑进行多模态大语言模型（LLM）分析，评估了可见的遗产价值指示方面。使用LLM的零样本预测作为基础，确定了潜在具有遗产价值的建筑，覆盖500万平方米的供暖地板面积。本文呈现了预测结果和所学到的经验，并将其与瑞典建筑翻新计划的制定相结合，作为治理的一部分。讨论了方法中的问题和潜在的改进。探讨了当局使用基于LLM的数据的潜在风险，重点是透明性、错误检测和阿谀奉承的问题。

英文摘要

During 2025 and 2026, the Energy Performance of Buildings Directive is being implemented in the European Union member states, requiring all member states to have National Building Renovation Plans. In Sweden, there is no comprehensive national register of buildings with heritage values. This is seen as a barrier for the analyses underlying the development of Building Renovation Plans by the involved Swedish authorities. The purpose of this research was to assist Swedish authorities in developing information on heritage values in the Swedish building stock. Buildings in street view images from all over Sweden (N=154 710) have been analysed using multimodal Large Language Models (LLM) to assess visible aspects indicative of heritage value. Zero-shot predictions by LLMs were used as a basis for identifying buildings with potential heritage values for 5.0 million square meters of heated floor area. In this paper, the results of the predictions and lessons learned are presented and related to the development of the Swedish Building Renovation Plan as part of governance. The problems with the method and potential improvements are discussed. Risks with authorities use of LLM-based data are addressed, with a focus on issues of transparency, error detection and sycophancy.

URL PDF HTML ☆

赞 0 踩 0

2512.15231 2026-06-05 cs.AI 版本更新

CangLing-KnowFlow: A Unified Knowledge-and-Flow-fused Agent for Comprehensive Remote Sensing Applications

CangLing-KnowFlow: 一个统一的知识与流程融合代理用于综合遥感应用

Zhengchao Chen, Haoran Wang, Jing Yao, Jianshe Zhang, Pedram Ghamisi, Jun Zhou, Peter M. Atkinson, Bing Zhang

发表机构 * State Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences（遥感与数字地球国家重点实验室，航天信息研究所，中国科学院）； Beijing Tiandi Shijie Technology Co., Ltd.（北京天帝世纪科技有限公司）； Faculty of Science and Technology, Lancaster University（兰卡斯特大学科学与技术学院）； Faculty of Electrical and Computer Engineering, University of Iceland（冰岛大学电气与计算机工程学院）； Helmholtz-Zentrum Dresden-Rossendorf（德累斯顿-罗斯托克亥姆霍尔茨中心）； School of Information and Communication Technology, Griffith University（格里菲斯大学信息与通信技术学院）

AI总结本文提出CangLing-KnowFlow，一个融合知识与流程的统一智能代理框架，通过整合过程知识库、动态工作流调整和进化记忆模块，解决遥感数据处理中任务特定、缺乏统一框架的问题，并在KnowFlow-Bench基准测试中表现出色。

详情

AI中文摘要

大规模遥感（RS）数据集的自动化和智能化处理对于地球观测（EO）至关重要。现有的自动化系统通常是任务特定的，缺乏统一的框架来管理多样化的端到端工作流——从数据预处理到高级解释——在不同的RS应用中。为了解决这一差距，本文介绍CangLing-KnowFlow，一个统一的智能代理框架，整合了过程知识库（PKB）、动态工作流调整和进化记忆模块。PKB包含1,008个经过专家验证的工作流案例，涵盖162个实际RS任务，指导规划并显著减少一般性代理中常见的幻觉问题。在运行时失败期间，动态工作流调整能够自主诊断并重新规划恢复策略，而进化记忆模块会持续从这些事件中学习，迭代提升代理的知识和性能。这种协同作用使CangLing-KnowFlow能够在多样且复杂的任务中适应、学习并可靠运行。我们评估了CangLing-KnowFlow在KnowFlow-Bench上，一个受真实应用启发的324个工作流基准测试中，测试其在13个顶级大语言模型（LLM）后端上的性能，从开源到商业。在所有复杂任务中，CangLing-KnowFlow在任务成功率上比Reflexion基线高出至少4%。作为该新兴领域最全面的验证，本研究展示了CangLing-KnowFlow作为强大、高效且可扩展的自动化解决方案的巨大潜力，通过利用专家知识（知识）转化为适应性和可验证的流程（流程）来解决复杂的EO挑战。

英文摘要

The automated and intelligent processing of massive remote sensing (RS) datasets is critical in Earth observation (EO). Existing automated systems are normally task-specific, lacking a unified framework to manage diverse, end-to-end workflows--from data preprocessing to advanced interpretation--across diverse RS applications. To address this gap, this paper introduces CangLing-KnowFlow, a unified intelligent agent framework that integrates a Procedural Knowledge Base (PKB), Dynamic Workflow Adjustment, and an Evolutionary Memory Module. The PKB, comprising 1,008 expert-validated workflow cases across 162 practical RS tasks, guides planning and substantially reduces hallucinations common in general-purpose agents. During runtime failures, the Dynamic Workflow Adjustment autonomously diagnoses and replans recovery strategies, while the Evolutionary Memory Module continuously learns from these events, iteratively enhancing the agent's knowledge and performance. This synergy enables CangLing-KnowFlow to adapt, learn, and operate reliably across diverse, complex tasks. We evaluated CangLing-KnowFlow on the KnowFlow-Bench, a novel benchmark of 324 workflows inspired by real-world applications, testing its performance across 13 top Large Language Model (LLM) backbones, from open-source to commercial. Across all complex tasks, CangLing-KnowFlow surpassed the Reflexion baseline by at least 4% in Task Success Rate. As the first most comprehensive validation along this emerging field, this research demonstrates the great potential of CangLing-KnowFlow as a robust, efficient, and scalable automated solution for complex EO challenges by leveraging expert knowledge (Knowledge) into adaptive and verifiable procedures (Flow).

URL PDF HTML ☆

赞 0 踩 0

2512.20627 2026-06-05 cs.NI cs.AI 版本更新

Efficient Asynchronous Federated Evaluation with Strategy Similarity Awareness for Intent-Based Networking in Industrial Internet of Things

面向工业互联网-of-things意图网络的高效异步联邦评估与策略相似性意识

Shaowen Qin, Jianfeng Zeng, Haodong Guo, Xiaohuan Li, Jiawen Kang, Qian Chen

发表机构 * Guangxi University Key Laboratory of Intelligent Networking and Scenario System (School of Information and Communication, Guilin University of Electronic Technology)（广西智能网络与场景系统重点实验室（信息与通信学院，桂林电子科技大学））； National Engineering Laboratory for Comprehensive Transportation Big Data Application Technology (Guangxi)（综合交通运输大数据应用技术国家工程实验室（广西））； School of Automation, Guangdong University of Technology（自动化学院，广东工业大学）； School of Architecture and Transportation Engineering, GUET（建筑与交通工程学院，桂林电子科技大学）

AI总结本文提出了一种基于联邦学习的增强意图网络框架FEIBN，利用大语言模型将用户意图转化为结构化策略元组，并通过策略相似性意识联邦学习机制提升训练效率和通信效率，从而在工业互联网-of-things环境中实现更高效的策略评估。

Comments 12 pages with 7 figures and 4 tables

详情

AI中文摘要

意图网络（IBN）通过将高层用户意图转化为可执行的网络策略，为工业互联网-of-things（IIoT）环境中的智能和自动化网络控制提供了一种有前景的范式。然而，由于紧密耦合的工作流和高停机成本，频繁的策略部署和回滚是不切实际的，而节点异质性和隐私约束进一步复杂化了集中式策略评估。为了解决这些挑战，我们提出了一种联邦评估增强的意图网络框架（FEIBN），该框架利用大语言模型（LLMs）将用户意图转化为结构化策略元组，并采用联邦学习支持分布式策略评估。为了提高训练效率并减少通信开销，我们设计了一种策略相似性意识联邦学习机制（SSAFL），该机制根据策略相似性和资源状态选择相关节点，并仅在本地更新显著时触发异步模型上传。实验表明，所提出的方法在模型精度、收敛速度和通信成本方面均优于基线方法。

英文摘要

Intent-Based Networking (IBN) offers a promising paradigm for intelligent and automated network control in Industrial Internet of Things (IIoT) environments by translating high-level user intents into executable network strategies. However, frequent strategy deployment and rollback are impractical due to tightly coupled workflows and high downtime costs, while node heterogeneity and privacy constraints further complicate centralized strategy evaluation. To address these challenges, we propose a Federated Evaluation Enhanced Intent-Based Networking framework (FEIBN), which leverages large language models (LLMs) to translate user intents into structured strategy tuples and employs federated learning to support distributed strategy evaluation. To improve training efficiency and reduce communication overhead, we design a Strategy Similarity Aware Federated Learning mechanism (SSAFL), which selects nodes relevant to the task based on strategy similarity and resource status, and triggers asynchronous model uploads only when local updates are significant. Experiments demonstrate that the proposed method improves model accuracy, accelerates convergence, and reduces communication cost compared with the baselines.

URL PDF HTML ☆

赞 0 踩 0

2512.20111 2026-06-05 cs.CL cs.AI cs.LG 版本更新

ABBEL: Learning Natural-Language Belief States for Memory-Efficient Interaction

ABBEL: 为高效交互学习自然语言信念状态

Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Georgia Institute of Technology（佐治亚理工学院）

AI总结本文提出ABBEL框架，通过显式自然语言信念状态直接监督每个摘要的信息内容，以解决传统方法在生成摘要时信息丢失或更新错误的问题，从而在保持高效内存使用的同时提升交互性能。

详情

AI中文摘要

随着序列决策任务的时间范围扩大，将完整交互历史保留在模型上下文中变得越来越昂贵。最近的研究通过使用递归更新的自然语言摘要来减少上下文长度，这些摘要简洁且可解释。然而，这些方法在性能上仍低于能够访问完整上下文的智能体，表明它们未能生成足够的摘要。为此，我们提出了ABBEL，一种递归摘要框架，通过显式自然语言信念状态直接监督每个摘要的信息内容。首先，我们分析了在五个领域中由前沿模型生成的信念状态，并验证了性能通常因遗漏或错误更新信息而降低。我们还发现了一些模型使用内存低效的设置，通过保留冗余信息。我们通过两种基于强化学习的方法进行微调：信念分级，通过奖励基于信息内容的信念生成来减少更新错误；峰值信念惩罚，通过鼓励压缩内存足迹最大的信念。我们证明这些方法显著缩小了与完整上下文模型的性能差距，并使ABBEL在使用67%内存的情况下，比先前的记忆智能体工作提高了40%。我们的代码可在https://github.com/jakob-bjorner/optimal-explorer-dev获取。

英文摘要

As the time horizons of sequential decision-making tasks grow, keeping full interaction histories in model context becomes increasingly costly. Recent work reduces context lengths by instead conditioning decision-making agents on recursively updated natural-language summaries, which are concise and interpretable. However, they underperform agents with access to the full context, suggesting that they fail to generate sufficient summaries. To address this we propose ABBEL, a recursive summarization framework that isolates and directly supervises each summary's information contents in the form of explicit natural-language belief states. First, we analyze the belief states generated by frontier models under ABBEL across five domains, and verify that performance is often degraded due to omitting or incorrectly updating information. We also discover settings where models use memory inefficiently by retaining extraneous information. We target these limitations by fine-tuning with two RL-based methods: belief grading, which reduces update errors by rewarding belief generations based on their information content, and peak belief penalties, which encourage compressing the beliefs with the greatest memory footprints. We demonstrate that these methods significantly reduce the performance gap with full context models, and enable ABBEL to outperform prior memory agent work by 40% while using 67% of the memory. Our code is available at https://github.com/jakob-bjorner/optimal-explorer-dev

URL PDF HTML ☆

赞 0 踩 0

2512.14792 2026-06-05 cs.AI cs.SE 版本更新

IaC Generation with LLMs: An Error Taxonomy and A Study on Configuration Knowledge Injection

利用LLM生成IaC：错误分类法与配置知识注入研究

Roman Nekrasov, Stefano Fossati, Indika Kumara, Damian Andrew Tamburri, Willem-Jan van den Heuvel

发表机构 * Jheronimus Academy of Data Science（Jheronimus数据科学学院）； Tilburg University（蒂尔堡大学）； Eindhoven University of Technology（埃因霍温理工大学）； University of Sannio（萨诺尼大学）

AI总结本研究探讨了如何通过系统性地注入结构化配置知识来提高LLM生成正确且意图一致的基础设施即代码（IaC）能力，特别是在Terraform中，提出了新的错误分类法，并评估了多种知识注入技术。

Comments Submitted to ACM

详情

DOI: 10.1145/3817608

AI中文摘要

大型语言模型（LLMs）目前在生成正确且意图一致的基础设施即代码（IaC）方面表现出较低的成功率。本研究调查了改进基于LLM的IaC生成方法，特别是针对Terraform，通过系统性地注入结构化配置知识。为此，现有的IaC-Eval基准测试被显著增强，加入了云模拟和自动错误分析。此外，开发了一种新的用于LLM辅助IaC代码生成的错误分类法。实现并评估了一系列知识注入技术，从简单的检索增强生成（RAG）到更复杂的图RAG方法。这些包括图组件的语义增强和资源间依赖关系的建模。实验结果表明，尽管基线LLM性能较差（整体成功率为27.1%），注入结构化配置知识将技术验证成功率提高到75.3%，整体成功率提高到62.6%。尽管这些进步在技术正确性方面有所提升，但意图一致性却停滞不前，揭示了“正确性-一致性鸿沟”，即LLMs可以成为熟练的“程序员”，但作为满足复杂用户意图的“架构师”却受限。

英文摘要

Large Language Models (LLMs) currently exhibit low success rates in generating correct and intent-aligned Infrastructure as Code (IaC). This research investigated methods to improve LLM-based IaC generation, specifically for Terraform, by systematically injecting structured configuration knowledge. To facilitate this, an existing IaC-Eval benchmark was significantly enhanced with cloud emulation and automated error analysis. Additionally, a novel error taxonomy for LLM-assisted IaC code generation was developed. A series of knowledge injection techniques was implemented and evaluated, progressing from Naive Retrieval-Augmented Generation (RAG) to more sophisticated Graph RAG approaches. These included semantic enrichment of graph components and modeling inter-resource dependencies. Experimental results demonstrated that while baseline LLM performance was poor (27.1% overall success), injecting structured configuration knowledge increased technical validation success to 75.3% and overall success to 62.6%. Despite these gains in technical correctness, intent alignment plateaued, revealing a "Correctness-Congruence Gap" where LLMs can become proficient "coders" but remain limited "architects" in fulfilling nuanced user intent.

URL PDF HTML ☆

赞 0 踩 0

2511.21667 2026-06-05 cs.LG cs.AI 版本更新

Escaping the Verifier: Learning to Reason via Demonstrations

摆脱验证者：通过示范学习推理

Locke Cai, Max Ryabinin, Ivan Provilkov

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结本文提出RARO方法，通过逆强化学习从专家示范中学习强大的推理能力，无需任务特定的验证者，从而在多个评估任务中实现了显著的性能提升。

详情

AI中文摘要

训练大型语言模型（LLMs）进行推理通常依赖于强化学习（RL）与任务特定的验证者。然而，许多现实世界的推理密集型任务缺乏验证者，尽管提供了大量未被充分利用的专家示范。我们引入RARO（相对对抗推理优化），通过逆强化学习从专家示范中学习强大的推理能力。RARO设置了一个对抗游戏，政策与相对批评者之间进行对抗：政策学习模仿专家答案，而批评者旨在识别专家政策答案对中的专家。政策和批评者通过RL联合且连续地训练，并识别出实现稳健学习所需的关键稳定技术。实证结果表明，RARO在所有评估任务中均显著优于无验证者基线：在Countdown（1.5B）上准确率提高13.7%，在DeepMath（7B）上准确率提高8.2%，在Poetry Writing（7B）上对专家诗歌的胜利率提高19.1%。RARO还表现出与具有验证者的RL相似的稳健扩展趋势。这些结果表明，RARO能够从专家示范中有效提取强大的推理性能，即使在任务特定验证者不可用时也能实现稳健的推理学习。

英文摘要

Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization), which learns strong reasoning capabilities from expert demonstrations alone via Inverse Reinforcement Learning. RARO sets up an adversarial game between a policy and a relativistic critic: the policy learns to mimic expert answers, while the critic aims to identify the experts among expert-policy answer pairs. Both the policy and the critic are trained jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines across all evaluation tasks: +13.7% accuracy on Countdown (1.5B), +8.2% accuracy on DeepMath (7B), and +19.1% win-rate on Poetry Writing (7B) against expert poems. RARO also exhibits similar robust scaling trends as RL with verifiers. These results demonstrate that RARO effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.

URL PDF HTML ☆

赞 0 踩 0

2512.05774 2026-06-05 cs.CV cs.AI cs.CL 版本更新

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

主动视频感知：用于代理长视频理解的迭代证据寻求

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, Juan Carlos Niebles

发表机构 * Salesforce AI Research（Salesforce AI研究院）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结本文提出了一种主动视频感知框架AVP，通过迭代计划-观察-反思过程，主动决定视频内容的观察目标和时间，以提高长视频理解的准确性和效率。

Comments Website: https://activevideoperception.github.io/

详情

AI中文摘要

长视频理解（LVU）具有挑战性，因为回答现实世界查询往往依赖于稀疏、时间分散的线索，这些线索隐藏在数小时的大部分冗余和无关内容中。尽管代理流程提高了视频推理能力，但现有框架依赖于查询无关的描述器来感知视频信息，这浪费了计算资源并模糊了细粒度的时间和空间信息。受主动感知理论的启发，我们主张LVU代理应主动决定观察什么、何时和在哪里观察，并持续评估当前观察是否足够回答查询。我们提出了主动视频感知（AVP），一种证据寻求框架，将视频视为交互环境，并直接从像素中获取紧凑、查询相关的证据。具体而言，AVP运行一个迭代的计划-观察-反思过程，使用MLLM代理。在每个轮次中，计划者提出有针对性的视频交互，观察者执行以提取时间戳证据，反思者评估证据对查询的充分性，要么终止并给出答案，要么触发进一步观察。在五个LVU基准测试中，AVP实现了最高整体准确率，有显著提升。值得注意的是，AVP在平均整体准确率上比最佳代理方法高出5.7%，同时仅需18.4%的推理时间和12.4%的输入令牌。

英文摘要

Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest overall accuracy with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average overall accuracy while only requires 18.4% inference time and 12.4% input tokens.

URL PDF HTML ☆

赞 0 踩 0

2508.10875 2026-06-05 cs.CL cs.AI cs.LG 版本更新

A Survey on Diffusion Language Models

扩散语言模型的综述

Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen

发表机构 * VILA Lab, Mohamed bin Zayed University of Artificial Intelligence（维拉实验室，穆罕默德·本·扎耶德人工智能大学）； Department of Automation, Tsinghua University（清华大学自动化系）

AI总结本文综述了扩散语言模型的发展现状，探讨了其与自回归模型和掩码语言模型的关系，分析了预训练策略、后训练方法以及推理优化技术，并讨论了多模态扩展、应用场景、局限性及未来研究方向。

详情

AI中文摘要

扩散语言模型（DLMs）正迅速崛起为一种强大的替代方案，以取代主导的自回归（AR）范式。通过迭代去噪过程并行生成令牌，DLMs在减少推理延迟和捕捉双向上下文方面具有固有优势，从而实现对生成过程的精细控制。尽管实现了数倍的加速，最近的进展使DLMs在性能上与自回归模型相当，使其成为各种自然语言处理任务的有力选择。在本文综述中，我们提供了当前DLM景观的全面概述。我们追踪其演变及其与其他范式，如自回归和掩码语言模型的关系，并涵盖了基础原理和最先进模型。我们的工作提供了一个最新、全面的分类法以及对当前技术的深入分析，从预训练策略到高级后训练方法。本文的另一个贡献是全面回顾DLM推理策略和优化，包括解码并行性、缓存机制和生成质量的改进。我们还突出了DLM多模态扩展的最新方法，并阐述了它们在各种实际场景中的应用。此外，我们的讨论还讨论了DLMs的局限性和挑战，包括效率、长序列处理和基础设施需求，同时概述了未来研究方向，以维持该快速发展的领域中的进步。Project GitHub可在https://github.com/VILA-Lab/Awesome-DLMs上找到。

英文摘要

Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

URL PDF HTML ☆

赞 0 踩 0

2512.05013 2026-06-05 cs.AI cs.MA stat.ME 版本更新

Detecting Perspective Shifts in Multi-agent Systems

在多智能体系统中检测视角变化

Eric Bridgeford, Hayden Helm

发表机构 * Helivan, San Francisco, CA（San Francisco, CA 的 Helivan）

AI总结本文提出了一种名为TDKPS的框架，用于检测多智能体系统中智能体和群体层面的行为变化，通过模拟和自然实验验证了其在检测真实外部事件变化方面的有效性。

详情

AI中文摘要

增强型生成模型结合外部工具和更新机制（或称为智能体）已展现出超越基础模型智能提示的能力。随着智能体的广泛应用，动态多智能体系统自然地出现了。最近的研究探讨了基于单时间点查询响应的低维表示的理论和经验属性。本文引入了时间数据核视角空间（TDKPS），该方法跨时间联合嵌入智能体，并提出了几种新的假设检验方法，用于检测多智能体系统中智能体和群体层面的行为变化。我们通过受演进数字身份多智能体系统启发的模拟，表征了所提出检验的实证属性，包括其对关键超参数的敏感性。最后，我们通过自然实验证明，所提出检验能够检测出与真实外生事件相关、敏感且显著变化。据我们所知，TDKPS是首个系统性的框架，用于监控多智能体系统中的行为动态——随着生成智能体部署的持续扩展，这一能力至关重要。

英文摘要

Generative models augmented with external tools and update mechanisms (or \textit{agents}) have demonstrated capabilities beyond intelligent prompting of base models. As agent use proliferates, dynamic multi-agent systems have naturally emerged. Recent work has investigated the theoretical and empirical properties of low-dimensional representations of agents based on query responses at a single time point. This paper introduces the Temporal Data Kernel Perspective Space (TDKPS), which jointly embeds agents across time, and proposes several novel hypothesis tests for detecting behavioral change at the agent- and group-level in black-box multi-agent systems. We characterize the empirical properties of our proposed tests, including their sensitivity to key hyperparameters, in simulations motivated by a multi-agent system of evolving digital personas. Finally, we demonstrate via natural experiment that our proposed tests detect changes that correlate sensitively, specifically, and significantly with a real exogenous event. As far as we are aware, TDKPS is the first principled framework for monitoring behavioral dynamics in black-box multi-agent systems -- a critical capability as generative agent deployment continues to scale.

URL PDF HTML ☆

赞 0 踩 0

2512.03086 2026-06-05 cs.PL cs.AI cs.SE 版本更新

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

超越代码对：基于对话的数据生成用于LLM代码翻译

Le Chen, Nuo Xu, Winson Chen, Bin Lei, Pei-Hung Lin, Dunzhi Zhou, Rajeev Thakur, Caiwen Ding, Ali Jannesari, Chunhua Liao

发表机构 * Argonne National Laboratory（阿贡国家实验室）； University of Minnesota（明尼苏达大学）； Iowa State University（爱荷华州立大学）； Lawrence Livermore National Laboratory（劳伦斯利弗莫尔国家实验室）

AI总结本文提出了一种基于对话的数据生成方法，通过双LLM架构生成验证的翻译和多轮对话，以提升LLM在低资源编程领域中的代码翻译能力。

详情

AI中文摘要

大型语言模型（LLMs）在代码翻译任务中表现出色，但在资源稀缺的编程领域如Fortran和新兴框架如CUDA中性能下降，因为高质量并行数据稀缺。我们提出了一种自动化数据生成流水线，采用双LLM提问者-求解器设计，整合编译器和运行时反馈的外部知识。除了传统的源-目标代码对数据集外，我们的方法还生成（1）带有单元测试的验证翻译以评估功能一致性，以及（2）多轮对话，捕捉翻译优化过程中的推理过程。应用于Fortran到C++和C++到CUDA的转换中，该流水线分别生成3,640和3,930个对话。在该数据上微调可显著提升功能正确性，使C++到CUDA任务的单元测试成功率提高超过56%。我们证明生成的数据使7B开放式模型在编译成功率等关键指标上显著优于更大的专有系统。

英文摘要

Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran-to-C++ and C++-to-CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show that the generated data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.

URL PDF HTML ☆

赞 0 踩 0

2511.20613 2026-06-05 cs.LG cs.AI cs.MA 版本更新

Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

能否用Vibe编码击败研究生计算机科学学生？一个LLM与人类编码竞赛在市场驱动的战略规划中的表现

Panayiotis Danassis, Naman Goel

发表机构 * University of Southampton（苏塞克斯大学）； University of Oxford and Alan Turing Institute（牛津大学和艾伦·图灵研究所）

AI总结本文提出一个基于现实物流优化问题（拍卖、取件和送货问题）的多智能体推理驱动基准，该问题结合了竞争拍卖与容量受限路由。研究通过比较40个LLM编码代理与17个人类编码代理在12场双打全部比赛和约4万场比赛中的表现，揭示了人类编码代理在战略规划和优化任务中的优势，以及LLM在现实世界中生成有效代码的能力不足。

详情

DOI: 10.65109/SEOT8410

AI中文摘要

大型语言模型（LLMs）的快速普及已经革新了AI辅助代码生成。然而，LLMs的快速发展超出了我们正确评估它们的能力。现有的基准测试强调单元测试通过率和语法正确性。这些指标低估了许多需要规划、优化和战略互动的真实世界问题的难度。我们引入了一个基于现实物流优化问题（拍卖、取件和送货问题）的多智能体推理驱动基准，该问题结合了竞争拍卖与容量受限路由。该基准要求构建能够（i）在不确定性下进行战略投标，以及（ii）优化规划者在交付任务的同时最大化利润的代理。我们评估了40个LLM编码的代理（由多种最先进的LLMs在多种提示方法下，包括Vibe编码）与17个在LLM出现之前开发的人类编码代理。我们的结果在12场双打全部比赛和约4万场比赛中显示（i）人类（研究生学生）编码代理的明显优势：前5名始终由人类编码代理占据；（ii）大多数LLM编码代理（33个中的40个）被非常简单的基线所击败；（iii）在给定最佳人类解决方案作为输入并提示改进的情况下，表现最好的LLM使解决方案显著变差而不是改进。我们的结果突显了LLMs在现实世界中生成具有竞争力的代码能力的差距，并促使新的评估，这些评估强调在现实世界场景中推理驱动的代码合成。

英文摘要

The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.

URL PDF HTML ☆

赞 0 踩 0

2503.01734 2026-06-05 cs.CR cs.AI 版本更新

Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning

对抗代理：基于强化学习的黑盒逃逸攻击

Kyle Domico, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Eric Pauley, Josiah Hanna, Patrick McDaniel

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Virginia Tech（弗吉尼亚理工大学）

AI总结本文提出了一种基于强化学习的对抗攻击方法，通过学习生成对抗样本的新算法，提高了攻击效率和成功率，同时在图像分类基准上展示了其优越的性能。

Comments Accepted to the Findings of CVPR 2026

详情

AI中文摘要

对机器学习模型的攻击已通过无状态优化广泛研究。本文展示了强化学习（RL）代理如何学习一种新类型的攻击算法来生成对抗样本。与传统对抗机器学习（AML）方法不同，我们的RL方法保留并利用过去的攻击经验，以提高未来攻击的有效性和效率。我们将对抗样本生成建模为马尔可夫决策过程，并评估RL在（a）学习有效且高效的攻击策略以及（b）与最先进的AML竞争的能力。在两个图像分类基准上，我们的代理在训练过程中将攻击成功率提高了最高13.2%，并将每个攻击的受害者模型查询平均次数减少了最高16.9%。在与最先进的图像攻击进行直接比较时，我们的方法使攻击者能够在训练后在未见过的输入上生成对抗样本的成功率提高了17%。从安全角度来看，这项工作展示了一种强大的新攻击向量，利用RL训练能够高效且大规模攻击ML模型的代理。

英文摘要

Attacks on machine learning models have been extensively studied through stateless optimization. In this paper, we demonstrate how a reinforcement learning (RL) agent can learn a new class of attack algorithms that generate adversarial samples. Unlike traditional adversarial machine learning (AML) methods that craft adversarial samples independently, our RL-based approach retains and exploits past attack experience to improve the effectiveness and efficiency of future attacks. We formulate adversarial sample generation as a Markov Decision Process and evaluate RL's ability to (a) learn effective and efficient attack strategies and (b) compete with state-of-the-art AML. On two image classification benchmarks, our agent increases attack success rate by up to 13.2% and decreases the average number of victim model queries per attack by up to 16.9% from the start to the end of training. In a head-to-head comparison with state-of-the-art image attacks, our approach enables an adversary to generate adversarial samples with 17% more success on unseen inputs post-training. From a security perspective, this work demonstrates a powerful new attack vector that uses RL to train agents that attack ML models efficiently and at scale.

URL PDF HTML ☆

赞 0 踩 0

2511.05615 2026-06-05 cs.LG cs.AI cs.AR physics.ins-det 版本更新

wa-hls4ml: A Benchmark and Surrogate Models for hls4ml Resource and Latency Estimation

wa-hls4ml: 一个用于hls4ml资源和延迟估计的基准及替代模型

Benjamin Hawks, Jason Weitz, Dmitri Demler, Karla Tame-Narvaez, Dennis Plotnikov, Mohammad Mehdi Rahimifar, Hamza Ezzaoui Rahali, Audrey C. Therrien, Donovan Sproule, Elham E Khoda, Keegan A. Smith, Russell Marroquin, Giuseppe Di Guglielmo, Nhan Tran, Javier Duarte, Vladimir Loncar

发表机构 * Fermi National Accelerator Laboratory（费米国家加速器实验室）； University of California San Diego（加州大学圣地亚哥分校）； Johns Hopkins University（约翰霍普金斯大学）； University of Sherbrooke（Sherbrooke大学）； Columbia University（哥伦比亚大学）； Texas A&M University（德克萨斯A&M大学）； European Organization for Nuclear Research (CERN)（欧洲核子研究中心（CERN））

AI总结本文提出了一个用于评估ML加速器资源和延迟的基准wa-hls4ml，并介绍了基于图神经网络和Transformer的替代模型，用于预测ML加速器的延迟和资源使用情况。

Comments 30 pages, 18 figures

详情

DOI: 10.1145/3787490
Journal ref: Wa-hls4ml: A Benchmark and Surrogate Models for hls4ml Resource and Latency Estimation. ACM Trans. Reconfigurable Technol. Syst. 19, 2, Article 20 (June 2026), 29 pages

AI中文摘要

随着机器学习（ML）越来越多地在硬件中实现以解决科学应用中的实时挑战，先进的工具链开发显著减少了各种设计迭代所需的时间。这些进步已经解决了主要障碍，但也暴露了新的挑战。例如，以前未被考虑的瓶颈过程，如硬件综合，现在成为设计快速迭代的限制因素。为缓解这些新兴约束，已经开展了多项努力，以开发基于ML的替代模型，以估计ML加速器架构的资源使用情况。我们介绍了wa-hls4ml，这是一个用于ML加速器资源和延迟估计的基准，以及其对应的初始数据集，包含超过680,000个全连接和卷积神经网络，均使用hls4ml合成并针对Xilinx FPGA。该基准评估了资源和延迟预测器在几种常见ML模型架构上的性能，这些架构主要来自科学领域，作为示例模型，并评估了数据集子集的平均性能。此外，我们还介绍了基于图神经网络和Transformer的替代模型，用于预测ML加速器的延迟和资源。我们展示了这些模型的架构和性能，并发现这些模型通常在合成测试数据集上对75百分位数的延迟和资源预测误差在几个百分点以内。

英文摘要

As machine learning (ML) is increasingly implemented in hardware to address real-time challenges in scientific applications, the development of advanced toolchains has significantly reduced the time required to iterate on various designs. These advancements have solved major obstacles, but also exposed new challenges. For example, processes that were not previously considered bottlenecks, such as hardware synthesis, are becoming limiting factors in the rapid iteration of designs. To mitigate these emerging constraints, multiple efforts have been undertaken to develop an ML-based surrogate model that estimates resource usage of ML accelerator architectures. We introduce wa-hls4ml, a benchmark for ML accelerator resource and latency estimation, and its corresponding initial dataset of over 680,000 fully connected and convolutional neural networks, all synthesized using hls4ml and targeting Xilinx FPGAs. The benchmark evaluates the performance of resource and latency predictors against several common ML model architectures, primarily originating from scientific domains, as exemplar models, and the average performance across a subset of the dataset. Additionally, we introduce GNN- and transformer-based surrogate models that predict latency and resources for ML accelerators. We present the architecture and performance of the models and find that the models generally predict latency and resources for the 75% percentile within several percent of the synthesized resources on the synthetic test dataset.

URL PDF HTML ☆

赞 0 踩 0

2410.02628 2026-06-05 cs.LG cs.AI 版本更新

Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization

逆熵最优运输通过数据似然最大化解决半监督学习

Mikhail Persiianov, Arip Asadulaev, Nikita Andreev, Nikita Starodubcev, Dmitry Baranchuk, Anastasis Kratsios, Evgeny Burnaev, Alexander Korotin

发表机构 * Institute for Advanced Study（高级研究院）； National Research Council Canada（加拿大国家研究理事会）； University of Toronto（多伦多大学）； St. Petersburg State University（圣彼得格勒国立大学）； Skolkovo Institute of Science and Technology（斯克罗夫诺技术研究所）； Kazan Federal University（卡兹兰卡联邦大学）

AI总结本文提出了一种名为EBiEOT的新学习范式，通过数据似然最大化技术无缝整合配对和非配对数据，解决了半监督学习中的数据获取难题，并证明了该方法在理论上能够以任意小的误差恢复真实条件分布。

详情

AI中文摘要

学习条件分布π*(⋅|x)是机器学习中的核心问题，通常通过监督方法利用配对数据(x,y)∼π*进行学习。然而，获取配对数据样本往往具有挑战性，尤其是在领域翻译等问题中。这需要开发能够利用有限配对数据和额外非配对i.i.d.样本x∼π*_x和y∼π*_y的半监督模型。使用此类结合数据复杂且常依赖启发式方法。为此，我们提出了一种新的学习范式称为EBiEOT，利用数据似然最大化技术无缝整合配对和非配对数据。我们证明了该方法与逆熵最优运输(OT)有奇妙的联系。这一发现使我们能够应用最近的计算OT进展，建立一个端到端的学习算法来获得π*(⋅|x)。此外，我们推导了通用逼近性质，证明该方法在理论上可以以任意小的误差恢复真实条件分布。最后，我们通过实验证明，我们的方法能够同时利用配对和非配对数据有效学习条件分布。EBiEOT的代码可在https://github.com/MuXauJl11110/EBiEOT上获得。

英文摘要

Learning conditional distributions $π^*(\cdot|x)$ is a central problem in machine learning, which is typically approached via supervised methods with paired data $(x,y) \sim π^*$. However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development of $\textit{semi-supervised}$ models that utilize both limited paired data and additional unpaired i.i.d. samples $x \sim π^*_x$ and $y \sim π^*_y$ from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm called $\textbf{EBiEOT}$ that integrates both paired and unpaired data seamlessly using data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding allows us to apply recent advances in computational OT to establish an $\textit{end-to-end}$ learning algorithm to get $π^*(\cdot|x)$. In addition, we derive the universal approximation property, demonstrating that our approach can theoretically recover true conditional distributions with arbitrarily small error. Finally, we demonstrate through empirical tests that our method effectively learns conditional distributions using paired and unpaired data simultaneously. The code of $\texttt{EBiEOT}$ is available at https://github.com/MuXauJl11110/EBiEOT.

URL PDF HTML ☆

赞 0 踩 0

2510.11974 2026-06-05 cs.CR cs.AI 版本更新

CTIConnect: A Benchmark for Retrieval-Augmented LLMs over Heterogeneous Cyber Threat Intelligence

CTIConnect：一种用于异构网络威胁情报的检索增强大语言模型基准

Yutong Cheng, Yang Liu, Changze Li, Dawn Song, Peng Gao

发表机构 * Virginia Tech Department of Computer Science（弗吉尼亚理工大学计算机科学系）； University of California, Berkeley Department of Computer Science（加州大学伯克利分校计算机科学系）

AI总结本文提出CTIConnect基准，用于评估检索增强型大语言模型在网络威胁情报任务中的表现，通过整合五个异构数据源构建了1860个专家验证的问答对，揭示了不同任务类别中跨源语义差距的差异以及检索策略和性能瓶颈的变化，展示了领域特定策略在提升性能上的优势。

Comments Accepted to KDD 2026

详情

AI中文摘要

网络威胁情报（CTI）是现代网络安全的基础，使组织能够主动防御不断演变的威胁。然而，CTI数据的规模和异质性，从结构化知识库（CVE、CWE、CAPEC、MITRE ATT&CK）和非结构化威胁报告，远远超出了手动分析的能力。大型语言模型（LLMs）强大的上下文理解和推理能力推动了其在CTI任务中的应用。然而，现有的基准评估在检索增强设置中缺乏适当的评估框架，无法访问分析师在实践中依赖的异构领域知识源。为此，我们提出了CTIConnect，一种系统评估检索增强型LLMs在CTI任务领域的基准。我们构建了一个统一的评估环境，整合了五个异构CTI数据源，构建了1860个专家验证的问答对，涵盖实体链接、多文档综合和实体归属三个类别共九项任务。对十种最先进的LLMs进行了大量实验，发现跨源语义差距在不同任务类别中表现不同，需要根本不同的检索策略，并且性能瓶颈在检索基础设施和证据利用之间切换。我们的领域特定策略进一步优于更强的一般检索范式（检索后重排、IRCoT），表明缩小这一差距需要结构干预而非通用检索改进。这些发现在所有十种LLMs上均成立，保持在完整基准上的一致性，并在2008-2025时间分割下保持稳定。共同，它们为设计可扩展的异构CTI生态系统检索架构提供了可操作的指导。

英文摘要

Cyber Threat Intelligence (CTI) is foundational to modern cybersecurity, enabling organizations to proactively defend against evolving threats. However, the sheer volume and heterogeneity of CTI data, spanning structured knowledge bases (CVE, CWE, CAPEC, MITRE ATT&CK) and unstructured threat reports, far exceed the capacity of manual analysis. The strong contextual understanding and reasoning of Large Language Models (LLMs) have driven growing interest in applying them to CTI tasks. Yet no existing benchmark evaluates LLMs in a retrieval-augmented setting with a proper evaluation harness that grants access to the heterogeneous domain knowledge sources analysts rely on in practice. To address this gap, we present CTIConnect, a benchmark for systematically evaluating retrieval-augmented LLMs across the CTI task landscape. We construct a unified evaluation environment integrating five heterogeneous CTI sources into 1,860 expert-verified QA pairs spanning nine tasks across three categories: Entity Linking, Multi-Document Synthesis, and Entity Attribution. Extensive experiments on ten state-of-the-art LLMs reveal that the cross-source semantic gap manifests differently across task categories, demanding fundamentally different retrieval strategies, and that the performance bottleneck shifts between retrieval infrastructure and evidence utilization depending on the task. Our domain-specific strategies further outperform stronger general-purpose retrieval paradigms (retrieve-then-rerank, IRCoT), showing that closing this gap requires structural interventions rather than generic retrieval improvements. These findings hold across all ten LLMs, remain consistent on the full benchmark, and stay stable under temporal splits spanning 2008-2025. Together, they provide actionable guidance for designing scalable retrieval architectures over heterogeneous CTI ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2510.05709 2026-06-05 cs.CR cs.AI cs.CL 版本更新

Correcting Prompt Dependence in LLM Benchmarks: A Bayesian Hierarchical Model with Embedding-Space Clustering

纠正大语言模型基准测试中的提示依赖：一种具有嵌入空间聚类的贝叶斯分层模型

Mary Llewellyn, Isobel Thornton, James Bishop, Annie Gray

发表机构 * University of Cambridge（剑桥大学）

AI总结本文提出了一种贝叶斯分层模型，通过嵌入空间聚类来纠正大语言模型基准测试中的提示依赖问题，在数据有限的情况下提供更稳健的性能指标，并在对抗鲁棒性基准测试中实现了性能指标的显著提升。

Comments Accepted to the 1st Workshop on Combining Theory and Benchmarks, CTB@ICML 2026, Seoul, South Korea

2504.10020 2026-06-05 cs.CL cs.AI cs.CV 版本更新

The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?

性能提升的幻象：为何对比解码无法减轻多模态大语言模型中的对象幻觉？

Hao Yin, Guangzong Si, Zilei Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Eastern Institute of Technology, Ningbo（宁波东部技术研究所）

AI总结本文研究了对比解码方法在减轻多模态大语言模型（MLLMs）中对象幻觉方面的有效性，发现其性能提升主要源于两个误导性因素，挑战了对比解码策略的有效性。

详情

AI中文摘要

对比解码策略被广泛用于减少多模态大语言模型（MLLMs）中的对象幻觉。这些方法通过构建对比样本来诱导幻觉，然后在输出分布中抑制它们。然而，本文证明此类方法无法有效缓解幻觉问题。在POPE基准测试中观察到的性能提升主要由两个误导性因素驱动：（1）对模型输出分布的粗略、单向调整；（2）自适应可能性约束，将采样策略简化为贪婪搜索。为进一步说明这些问题，我们引入了一系列虚假改进方法，并将其性能与对比解码技术进行评估。实验结果揭示了对比解码中观察到的性能提升与其缓解幻觉的初衷无关。我们的发现挑战了对比解码策略有效性的常见假设，并为开发真正有效的MLLMs幻觉解决方案铺平了道路。

英文摘要

Contrastive decoding strategies are widely used to reduce object hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model's output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2509.25450 2026-06-05 cs.CE cs.AI cs.NA math.NA physics.comp-ph 版本更新

Multi-patch isogeometric neural solver for partial differential equations on computer-aided design domains

多补丁等几何神经求解器用于计算机辅助设计域上的偏微分方程

Moritz von Tresckow, Ion Gabriel Ion, Dimitrios Loukrezis

发表机构 * Institute for Accelerator Science and Electromagnetic Fields, Technische Universität Darmstadt（加速器科学与电磁场研究所，德累斯顿技术大学）； Terra Quantum AG（Terra Quantum公司）； Scientific Computing, Centrum Wiskunde & Informatica（科学计算，数学与信息学中心）

AI总结本文提出了一种结合物理感知神经网络与多补丁等几何分析的计算框架，用于解决复杂计算机辅助设计几何上的偏微分方程。该方法利用补丁局部神经网络在等几何分析的参考域上操作，并通过定制的输出层强加狄利克雷边界条件。通过专用的界面神经网络确保非均匀有理B样条补丁之间界面的解一致性。通过变分框架最小化偏微分方程弱形式导出的能量函数进行训练。在两个高度非平凡且实际相关的应用案例中验证了该方法的有效性，即四极磁铁的2D磁静力学模型和机械夹具的3D非线性固体力学与接触力学模型。结果与高保真有限元求解器获得的参考解高度一致，展示了该神经求解器在处理复杂工程问题方面的潜力。

Comments 33 pages, 15 figures

详情

DOI: 10.1007/s00366-026-02351-z

AI中文摘要

本工作开发了一种计算框架，结合物理感知神经网络与多补丁等几何分析，用于解决复杂计算机辅助设计几何上的偏微分方程。该方法利用补丁局部神经网络在等几何分析的参考域上操作。定制的输出层使强加狄利克雷边界条件。通过专用的界面神经网络确保非均匀有理B样条补丁之间界面的解一致性。通过变分框架最小化偏微分方程弱形式导出的能量函数进行训练。该方法的有效性在两个高度非平凡且实际相关的应用案例中得到验证，即四极磁铁的2D磁静力学模型和机械夹具的3D非线性固体力学与接触力学模型。结果与高保真有限元求解器获得的参考解高度一致，从而突显了该神经求解器在处理复杂工程问题方面的潜力，鉴于相应的计算机辅助设计模型。

英文摘要

This work develops a computational framework that combines physics-informed neural networks with multi-patch isogeometric analysis to solve partial differential equations on complex computer-aided design geometries. The method utilizes patch-local neural networks that operate on the reference domain of isogeometric analysis. A custom output layer enables the strong imposition of Dirichlet boundary conditions. Solution conformity across interfaces between non-uniform rational B-spline patches is enforced using dedicated interface neural networks. Training is performed using the variational framework by minimizing the energy functional derived after the weak form of the partial differential equation. The effectiveness of the suggested method is demonstrated on two highly non-trivial and practically relevant use-cases, namely, a 2D magnetostatics model of a quadrupole magnet and a 3D nonlinear solid and contact mechanics model of a mechanical holder. The results show excellent agreement to reference solutions obtained with high-fidelity finite element solvers, thus highlighting the potential of the suggested neural solver to tackle complex engineering problems given the corresponding computer-aided design models.

URL PDF HTML ☆

赞 0 踩 0

2509.25397 2026-06-05 cs.SE cs.AI cs.LG 版本更新

A Cartography of Open Collaboration in Open Source AI: Mapping Practices, Motivations, and Governance in 14 Open Large Language Model Projects

开源人工智能中开放协作的图谱：映射14个开源大语言模型项目的实践、动机与治理

Johan Linåker, Cailean Osborne, Jennifer Ding, Ben Burtenshaw

发表机构 * RISE Research Institutes of Sweden AB（瑞典RISE研究机构）； University of Oxford（牛津大学）

AI总结本文通过分析14个开源大语言模型项目的开发与再利用生命周期中的开放协作实践，揭示了协作方法、动机和治理结构的多样性，以及开放源代码AI并非单一属性，而是协作组织方式在互联艺术领域、生命周期阶段和制度背景下的涌现结果。

Comments In submission

详情

AI中文摘要

开源大语言模型（LLMs）的普及正在推动人工智能（AI）领域形成一个活跃的生态系统。然而，开发开源LLMs所使用的协作方法，在其公开发布前后仍未被系统研究，这限制了我们对开源LLM项目如何启动、组织和治理的理解，以及进一步促进这一生态系统的机会。我们通过探索性分析开源LLMs的开发与再利用生命周期中的开放协作，基于对14个不同开源LLM项目开发者的半结构化访谈。这些协作跨越多个艺术领域——包括模型、数据、软件、评估、计算和社区参与——每个领域都使不同的参与形式成为可能，并涉及不同的利益相关者，这些利益相关者在LLM开发生命周期中不断演变，从早期的集中、选择性参与转变为模型发布后的广泛、分散参与。开源LLM开发者受多种社会、经济和技术动机驱动，从民主化AI访问和促进开放科学到构建区域生态系统和扩展语言代表性。这些动态通过一系列治理结构协调，通常在不同程度上正式和专业化，包括以公司为中心的集中努力到去中心化的基层倡议。我们通过一个概念模型综合了我们的发现，提供了实践建议，并得出结论：开源AI的开放性并非单一属性，而是协作在互联艺术领域、生命周期阶段和制度背景下的组织方式的涌现结果。

英文摘要

The proliferation of open large language models (LLMs) is fostering a vibrant ecosystem in artificial intelligence (AI). However, the methods of collaboration used to develop open LLMs, both before and after their public release, have not yet been systematically studied, limiting our understanding of how open LLM projects are initiated, organised, and governed, as well as the opportunities to further foster this ecosystem. We address this gap through an exploratory analysis of open collaboration throughout the development and reuse lifecycle of open LLMs, drawing on semi-structured interviews with the developers of 14 diverse open LLM projects. These collaborations span multiple artefact domains -- including models, data, software, evaluation, compute, and community engagement -- each enabling distinct forms of participation and involving different stakeholders that evolves across the LLM development lifecycle, shifting from concentrated, selective engagement in the early stages to broader, distributed participation after model release. The open LLM developers are motivated by a variety of social, economic, and technological motivations, ranging from democratising access to AI and promoting open science to building regional ecosystems and expanding language representation. These dynamics are coordinated through a range of governance structures, typically formal and professionalised to varying degrees, including centralised company-led efforts to decentralised grassroots initiatives. We synthesise our findings in a conceptual model of open collaboration in open LLM ecosystems, provide recommendations for practice, and conclude that openness in open source AI is not a uniform property but an emergent outcome of how collaboration is organised across interconnected artefact domains, lifecycle stages, and institutional contexts.

URL PDF HTML ☆

赞 0 踩 0

2504.10823 2026-06-05 cs.CL cs.AI 版本更新

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

CLASH：从多个视角评估语言模型在高风险困境中的判断

Ayoung Lee, Ryan Sungmo Kwon, Peter Railton, Lu Wang

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； Department of Philosophy（哲学系）； University of Michigan Ann Arbor（安娜堡大学）

AI总结本文提出CLASH数据集，用于研究基于价值观的决策过程，发现语言模型在处理矛盾决策、心理不适和价值观变化时存在显著不足。

Comments Published as a conference paper at ICLR 2026

详情

AI中文摘要

在高风险领域，涉及冲突价值的困境对人类都极具挑战性，更不用说AI了。然而，先前的研究仅限于日常场景。为弥补这一差距，我们引入了CLASH（基于角色视角的LLM在高风险情境中的评估），该数据集包含345个高影响困境及3,795个不同价值观的个体视角。CLASH使研究者能够探讨关键但尚未被深入研究的价值决策过程方面，包括对决策矛盾和心理不适的理解以及角色视角中价值观的时间变化。通过基准测试14个非思考和思考模型，我们揭示了几个关键发现：（1）即使强大的专有模型，如GPT-5和Claude-4-Sonnet，也难以处理矛盾决策，仅达到24.06和51.01的准确率。（2）尽管LLMs能合理预测心理不适，但它们在涉及价值变化的视角中并不充分理解。（3）在数学解题和游戏策略领域有效的认知行为无法转移到价值推理中。相反，新的失败模式出现，包括早期承诺和过度承诺。（4）LLMs对特定价值的可引导性与其价值偏好显著相关。（5）最后，当从第三方视角推理时，LLMs表现出更高的可引导性，尽管某些价值（如安全）独特地受益于第一人称框架。

英文摘要

Navigating dilemmas involving conflicting values is challenging even for humans in high-stakes domains, let alone for AI, yet prior work has been limited to everyday scenarios. To close this gap, we introduce CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. CLASH enables the study of critical yet underexplored aspects of value-based decision-making processes, including understanding of decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in the perspectives of characters. By benchmarking 14 non-thinking and thinking models, we uncover several key findings. (1) Even strong proprietary models, such as GPT-5 and Claude-4-Sonnet, struggle with ambivalent decisions, achieving only 24.06 and 51.01 accuracy. (2) Although LLMs reasonably predict psychological discomfort, they do not adequately comprehend perspectives involving value shifts. (3) Cognitive behaviors that are effective in the math-solving and game strategy domains do not transfer to value reasoning. Instead, new failure patterns emerge, including early commitment and overcommitment. (4) The steerability of LLMs towards a given value is significantly correlated with their value preferences. (5) Finally, LLMs exhibit greater steerability when reasoning from a third-party perspective, although certain values (e.g., safety) benefit uniquely from first-person framing.

URL PDF HTML ☆

赞 0 踩 0

2509.20324 2026-06-05 cs.CR cs.AI 版本更新

RAG Security and Privacy: Formalizing the Threat Model and Attack Surface

RAG安全与隐私：形式化威胁模型和攻击面

Atousa Arzanipour, Rouzbeh Behnia, Reza Ebrahimi, Kaushik Dutta

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文研究了RAG系统中的安全与隐私问题，提出首个形式化的威胁模型，定义了攻击向量如文档级成员推断和数据中毒，以提升对RAG系统隐私和安全性的理解。

Comments Published at the 5th ICDM Workshop in November 2025

详情

DOI: 10.1109/ICDMW69685.2025.00165
Journal ref: 2025 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 1387-1394, 2025

AI中文摘要

检索增强生成（RAG）是一种新兴的自然语言处理方法，结合大型语言模型（LLMs）与外部文档检索以生成更准确和基于事实的响应。尽管RAG在减少幻觉和提高事实一致性方面表现出色，但其也引入了与传统LLMs不同的隐私和安全挑战。现有研究表明，LLMs可通过训练数据记忆或对抗性提示泄露敏感信息，而RAG系统继承了许多这些漏洞。同时，RAG依赖外部知识库打开了新的攻击面，包括可能泄露检索文档的存在或内容信息，或注入恶意内容以操控模型行为。尽管存在这些风险，目前尚无正式框架定义RAG系统的威胁景观。本文通过提出首个形式化的RAG威胁模型，填补了文献中的关键空白。我们引入了基于对模型组件和数据访问的对手类型的结构化分类，并正式定义了关键威胁向量，如文档级成员推断和数据中毒，这些向量在实际部署中对隐私和完整性构成严重风险。通过建立正式定义和攻击模型，本文为更严谨和原则性的理解RAG系统的隐私和安全奠定了基础。

英文摘要

Retrieval-Augmented Generation (RAG) is an emerging approach in natural language processing that combines large language models (LLMs) with external document retrieval to produce more accurate and grounded responses. While RAG has shown strong potential in reducing hallucinations and improving factual consistency, it also introduces new privacy and security challenges that differ from those faced by traditional LLMs. Existing research has demonstrated that LLMs can leak sensitive information through training data memorization or adversarial prompts, and RAG systems inherit many of these vulnerabilities. At the same time, reliance of RAG on an external knowledge base opens new attack surfaces, including the potential for leaking information about the presence or content of retrieved documents, or for injecting malicious content to manipulate model behavior. Despite these risks, there is currently no formal framework that defines the threat landscape for RAG systems. In this paper, we address a critical gap in the literature by proposing, to the best of our knowledge, the first formal threat model for retrieval-RAG systems. We introduce a structured taxonomy of adversary types based on their access to model components and data, and we formally define key threat vectors such as document-level membership inference and data poisoning, which pose serious privacy and integrity risks in real-world deployments. By establishing formal definitions and attack models, our work lays the foundation for a more rigorous and principled understanding of privacy and security in RAG systems.

URL PDF HTML ☆

赞 0 踩 0

2307.05284 2026-06-05 cs.LG cs.AI 版本更新

Rethinking Distribution Shifts: Empirical Analysis and Modeling for Tabular Data

重新思考分布偏移：针对表格数据的经验分析与建模

Tianyu Wang, Jiashuo Liu, Peng Cui, Hongseok Namkoong

发表机构 * Department of Industrial Engineering and Operations Research（工业工程与运筹学系）； Department of Computer Science and Technology（计算机科学与技术系）； Decision, Risk, and Operations Division（决策、风险与运营部）； Columbia University（哥伦比亚大学）； Tsinghua University（清华大学）

AI总结本文通过经验分析和建模，重新审视分布偏移问题，发现Y|X偏移在表格数据中最为常见，与机器学习文献中对X（协变量）偏移的重视形成鲜明对比，并指出鲁棒算法的性能并不优于普通方法。

Comments Forthcoming at Management Science. Conference version appeared in NeurIPS 2023, previously titled "On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets"

详情

AI中文摘要

不同的分布偏移需要不同的干预措施，算法必须基于其解决的具体偏移类型来构建。然而，稳健算法的方法学发展通常依赖于缺乏实证验证的结构性假设。本文倡导一种以实证为基础的数据驱动方法来开发算法，构建了一个包含8个表格数据集中的自然偏移、172个分布对、45种方法和90,000种方法配置的实证测试平台，涵盖了经验风险最小化和分布鲁棒优化（DRO）方法。我们发现Y|X偏移在我们的测试平台中最为普遍，这与机器学习文献中对X（协变量）偏移的高度重视形成鲜明对比，并且稳健算法的性能并不优于普通方法。为了理解原因，我们深入分析了DRO方法，发现被忽视的实现细节——如底层模型类（例如LightGBM）的选择和超参数选择——对性能的影响比模糊集或其半径更大。通过案例研究，我们展示了如何通过数据驱动的归纳理解分布偏移，提供了一种新的算法开发方法。

英文摘要

Different distribution shifts require different interventions, and algorithms must be grounded in the specific shifts they address. However, methodological development for robust algorithms typically relies on structural assumptions that lack empirical validation. Advocating for an empirically grounded data-driven approach to algorithm development, we build an empirical testbed comprising natural shifts across 8 tabular datasets, 172 distribution pairs over 45 methods and 90,000 method configurations encompassing empirical risk minimization and distributionally robust optimization (DRO) methods. We find $Y|X$-shifts are most prevalent in our testbed, in stark contrast to the heavy focus on $X$ (covariate)-shifts in the ML literature, and that the performance of robust algorithms is no better than that of vanilla methods. To understand why, we conduct an in-depth empirical analysis of DRO methods and find that underlooked implementation details -- such as the choice of underlying model class (e.g., LightGBM) and hyperparameter selection -- have a bigger impact on performance than the ambiguity set or its radius. We illustrate via case studies how a data-driven, inductive understanding of distribution shifts can provide a new approach to algorithm development.

URL PDF HTML ☆

赞 0 踩 0

2507.06219 2026-06-05 cs.RO cs.AI cs.LG 版本更新

Is Diversity All You Need for Scalable Robotic Manipulation?

多样性是否是可扩展机器人操作的全部需求？

Modi Shi, Li Chen, Jin Chen, Yuxiang Lu, Chiming Liu, Guanghui Ren, Ping Luo, Di Huang, Maoqing Yao, Hongyang Li

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结本文研究了数据多样性在机器人学习中的作用，发现任务多样性比单任务演示量更重要，多身体预训练数据在跨身体转移中可选，专家多样性可能对策略学习产生干扰，提出分布去偏方法提升性能。

Comments Code is available at https://github.com/OpenDriveLab/AgiBot-World

详情

AI中文摘要

数据扩展在自然语言处理和计算机视觉的基础模型中取得了显著成功，但机器人操作中有效数据扩展的原则仍不够清楚。本文通过研究机器人学习中数据多样性的细微作用，探讨了三个关键维度：任务（做什么）、身体（使用哪种机器人）和专家（谁演示）。通过在各种机器人平台上进行广泛实验，我们发现：（1）任务多样性比单任务演示数量更重要，有助于从多样预训练任务转移到新下游场景；（2）多身体预训练数据在跨身体转移中是可选的，高质量单身体预训练模型可以高效地转移到不同平台，在微调过程中表现出比多身体预训练模型更优的扩展特性；（3）专家多样性源于个体操作偏好和人类演示中的随机变化，可能对策略学习产生干扰，速度多模态成为关键贡献因素。基于这一洞察，我们提出了一种分布去偏方法以缓解速度模糊性，所提出的GO-1-Pro方法实现了15%的性能提升，相当于使用2.5倍的预训练数据。这些发现提供了新的视角，并为如何有效扩展机器人操作数据集提供了实用指导。

英文摘要

Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.

URL PDF HTML ☆

赞 0 踩 0

2505.02540 2026-06-05 cs.LG cs.AI 版本更新

Lazy But Effective: Collaborative Personalized Federated Learning with Heterogeneous Data

懒惰但有效：基于异构数据的协同个性化联邦学习

Ljubomir Rokvic, Panayiotis Danassis, Boi Faltings

发表机构 * Artificial Intelligence Laboratory EPFL（苏黎世联邦理工学院人工智能实验室）； Telenor Research（Telenor研究）

AI总结本文提出了一种简单有效的个性化联邦学习框架pFedLIA，通过使用计算效率高的影响近似方法'Lazy Influence'，在分布式 manner 中对客户端进行聚类，从而在模型聚合前协同训练模型以捕捉客户端特定的数据模式，实验证明其在非iid数据集上能有效恢复全局模型性能，并在多个基准任务中优于现有基线方法。

Comments Accepted at the International Joint Conference on Neural Networks (IJCNN), IEEE, 2025

详情

DOI: 10.1109/IJCNN64981.2025.11228646

AI中文摘要

Maxime Méloux, Silviu Maniu, François Portet, Maxime Peyrard

发表机构 * Université Grenoble Alpes, CNRS, Grenoble INP, LIG（格勒诺布尔阿尔卑斯大学、国家科学研究中心、格勒诺布尔INP、实验室LIG）

AI总结本文探讨了在机械可解释性（MI）框架下，给定行为是否具有唯一解释的问题，通过统计可识别性理论分析了MI解释的可识别性，并提出了两种主要策略及实验结果。

详情

Journal ref: The Thirteenth International Conference on Learning Representations (ICLR 2025)

AI中文摘要

随着AI系统应用于高风险领域，确保可解释性至关重要。机械可解释性（MI）旨在通过提取人类可理解的算法来解释神经网络的行为。本文探讨了一个关键问题：在给定行为下，根据MI的标准，是否存在唯一的解释？借鉴统计学中的可识别性，其中参数在特定假设下可以唯一推断，我们探索了MI解释的可识别性。我们识别出两种主要的MI策略：（1）“where-then-what”，通过隔离复制模型行为的电路并在之后解释它；（2）“what-then-where”，从候选算法开始，通过因果对齐搜索实现它们的神经激活子空间。我们对布尔函数和小型多层感知机测试了这两种策略，完全枚举了候选解释。实验揭示了系统性的不可识别性：多个电路可以复制行为，一个电路可以有多种解释，多个算法可以与网络对齐，一个算法可以与不同的子空间对齐。是否需要唯一性？一种务实的方法可能只需要预测性和可操作性标准。如果唯一性对理解至关重要，可能需要更严格的条件。我们还参考了内部可解释性框架，该框架通过多种标准验证解释。本文为定义AI中的解释标准做出了贡献。

英文摘要

As AI systems are used in high-stakes applications, ensuring interpretability is crucial. Mechanistic Interpretability (MI) aims to reverse-engineer neural networks by extracting human-understandable algorithms to explain their behavior. This work examines a key question: for a given behavior, and under MI's criteria, does a unique explanation exist? Drawing on identifiability in statistics, where parameters are uniquely inferred under specific assumptions, we explore the identifiability of MI explanations. We identify two main MI strategies: (1) "where-then-what," which isolates a circuit replicating model behavior before interpreting it, and (2) "what-then-where," which starts with candidate algorithms and searches for neural activation subspaces implementing them, using causal alignment. We test both strategies on Boolean functions and small multi-layer perceptrons, fully enumerating candidate explanations. Our experiments reveal systematic non-identifiability: multiple circuits can replicate behavior, a circuit can have multiple interpretations, several algorithms can align with the network, and one algorithm can align with different subspaces. Is uniqueness necessary? A pragmatic approach may require only predictive and manipulability standards. If uniqueness is essential for understanding, stricter criteria may be needed. We also reference the inner interpretability framework, which validates explanations through multiple criteria. This work contributes to defining explanation standards in AI.

URL PDF HTML ☆

赞 0 踩 0

2410.13056 2026-06-05 cs.CL cs.AI 版本更新

Channel-Wise Mixed-Precision Quantization for Large Language Models

通道级混合精度量化用于大语言模型

Zihan Chen, Bike Xie, Jundong Li, Cong Shen

发表机构 * Department of Electrical and Computer Engineering, University of Virginia（电气与计算机工程系，弗吉尼亚大学）； Kneron Inc.（芯驰科技）

AI总结本文提出通道级混合精度量化（CMPQ），通过根据激活分布分配不同精度级别来优化大语言模型的量化过程，从而在低比特范围内实现任意平均比特宽度，并在内存使用增加有限的情况下提升性能。

详情

AI中文摘要

大型语言模型（LLMs）在多种语言任务上表现出色，但其在边缘设备上的部署仍面临挑战，因为其大规模参数导致内存需求大。权重仅量化提供了一种减少LLM内存足迹的有希望的解决方案。然而，现有方法主要集中在整数比特量化上，限制了它们对分数比特量化任务的适应性，并阻碍了设备上可用存储空间的充分利用。在本文中，我们引入了通道级混合精度量化（CMPQ），一种新颖的混合精度量化方法，根据激活分布在通道级分配量化精度。通过将不同精度级别分配给不同的权重通道，CMPQ支持低比特范围（例如2到4比特）内的任意平均比特宽度。CMPQ采用非均匀量化策略，并结合两种异常值提取技术，共同保留关键信息，从而最小化量化损失。在九种不同LLM上的实验表明，CMPQ不仅在整数比特量化任务中提高了性能，而且通过以混合精度方式进行处理，在内存使用增加有限的情况下实现了显著的性能提升。CMPQ代表了一种适应性强且有效的LLM量化方法，在各种设备能力下提供了显著的好处。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter sizes. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. However, existing approaches primarily focus on integer-bit quantization, limiting their adaptability to fractional-bit quantization tasks and preventing the full utilization of available storage space on devices. In this paper, we introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method that allocates quantization precision in a channel-wise pattern based on activation distributions. By assigning different precision levels to different weight channels, CMPQ supports arbitrary average bit-widths in the low-bit regime (e.g., between 2 and 4 bits). CMPQ employs a non-uniform quantization strategy and incorporates two outlier extraction techniques that collaboratively preserve the critical information, thereby minimizing the quantization loss. Experiments on nine different LLMs demonstrate that CMPQ not only enhances performance in integer-bit quantization tasks but also achieves significant performance gains with a modest increase in memory usage by performing in a mixed-precision way. CMPQ represents an adaptive and effective approach to LLM quantization, offering substantial benefits across diverse device capabilities.

URL PDF HTML ☆

赞 0 踩 0

2407.10486 2026-06-05 cs.AI cs.CL 版本更新

IDEAL: Leveraging Infinite and Dynamic Characterizations of Large Language Models for Query-focused Summarization

IDEAL: 利用大型语言模型的无限和动态特性进行查询导向的摘要

Jie Cao, Dian Jiao, Yang Dai, Rolan Yan, Wenqiao Zhang, Siliang Tang

发表机构 * Zhejiang University（浙江大学）； Tencent, Wechat（腾讯，微信）

AI总结本文针对查询导向摘要问题，提出两种核心方法：高效细粒度查询-LLM对齐和长文档摘要，通过Query-aware HyperExpert和Query-focused Infini-attention模块实现，实验验证了方法的有效性和通用性。

2412.07583 2026-06-05 cs.CV cs.AI 版本更新

Mobile Video Diffusion

移动视频扩散

Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir Ghodrati, Amirhossein Habibian

发表机构 * Qualcomm AI Research（高通人工智能研究）

AI总结本文提出了一种移动优化的视频扩散模型MobileVD，通过降低帧分辨率、引入多尺度时间表示和两种新的剪枝方案，显著降低了内存和计算成本，同时在移动设备上实现了高效的视频生成。

详情

DOI: 10.1109/ICCV51701.2025.01808

AI中文摘要

视频扩散模型已实现了出色的现实感和可控性，但受限于高计算需求，限制了其在移动设备上的应用。本文介绍了首个移动优化的视频扩散模型。从Stable Video Diffusion (SVD) 的时空UNet出发，我们通过降低帧分辨率、引入多尺度时间表示以及引入两种新的剪枝方案来减少通道数和时间块数量。此外，我们采用对抗微调将去噪步骤减少到一步。我们的模型，称为MobileVD，在效率上提高了523倍（1817.2 vs. 4.34 TFLOPs），质量略有下降（FVD 149 vs. 171），在Xiaomi-14 Pro上生成14x512x256像素的视频片段仅需1.7秒。我们的结果可在https://qualcomm-ai-research.github.io/mobile-video-diffusion/上查看。

英文摘要

Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/

URL PDF HTML ☆

赞 0 踩 0

2406.08966 2026-06-05 cs.LG cs.AI 版本更新

Separation Power of Equivariant Neural Networks

等变神经网络的分离能力

Marco Pacini, Xiaowen Dong, Bruno Lepri, Gabriele Santin

发表机构 * University of Trento（特伦托大学）； Fondazione Bruno Kessler（布鲁诺·凯斯勒基金会）； University of Oxford（牛津大学）； University of Venice（威尼斯大学）

AI总结本文研究了等变神经网络的分离能力，分析了架构和超参数对分离能力的影响，发现非多项式激活函数在表达能力上等价，深度在阈值后不再提升分离能力，而隐表示的块分解会影响分离能力。

Comments Published as a conference paper at ICLR 2025

详情

Journal ref: International Conference on Learning Representations (ICLR), 2025

AI中文摘要

机器学习模型的分离能力是指其区分不同输入的能力，常被用作表达能力的代理。确实，了解模型家族的分离能力是获得细粒度普遍性结果的必要条件。在本文中，我们分析了等变神经网络（如卷积网络和置换不变网络）的分离能力。我们首先给出了由给定架构导出的模型无法区分的输入的完整特征化。从这些结果中，我们推导出分离能力如何受到超参数和架构选择（如激活函数、深度、隐藏层宽度和表示类型）的影响。值得注意的是，所有非多项式激活函数（包括ReLU和Sigmoid）在表达能力上是等价的，并能达到最大分离能力。深度在达到阈值后提升分离能力，之后进一步增加无效应。在隐表示中添加不变特征不影响分离能力。最后，隐表示的块分解影响分离性，最小的组件形成一个分离能力的层次结构，提供了一种直接比较模型分离能力的方法。

英文摘要

The separation power of a machine learning model refers to its ability to distinguish between different inputs and is often used as a proxy for its expressivity. Indeed, knowing the separation power of a family of models is a necessary condition to obtain fine-grained universality results. In this paper, we analyze the separation power of equivariant neural networks, such as convolutional and permutation-invariant networks. We first present a complete characterization of inputs indistinguishable by models derived by a given architecture. From this results, we derive how separability is influenced by hyperparameters and architectural choices-such as activation functions, depth, hidden layer width, and representation types. Notably, all non-polynomial activations, including ReLU and sigmoid, are equivalent in expressivity and reach maximum separation power. Depth improves separation power up to a threshold, after which further increases have no effect. Adding invariant features to hidden representations does not impact separation power. Finally, block decomposition of hidden representations affects separability, with minimal components forming a hierarchy in separation power that provides a straightforward method for comparing the separation power of models.

URL PDF HTML ☆

赞 0 踩 0

2205.11518 2026-06-05 cs.CR cs.AI cs.LG 版本更新

LIA: Privacy-Preserving Data Quality Evaluation in Federated Learning Using a Lazy Influence Approximation

LIA: 在联邦学习中使用懒惰影响近似进行隐私保护的数据质量评估

Ljubomir Rokvic, Panayiotis Danassis, Sai Praneeth Karimireddy, Boi Faltings

发表机构 * École Polytechnique Fédérale de Lausanne (EPFL)（瑞士联邦理工学院洛桑校区）； Telenor Research（Telenor研究）； University of Southern California（南加州大学）

AI总结本文提出了一种新的隐私保护数据质量评估方法LIA，通过懒惰影响近似技术过滤和评分数据，在保持隐私的前提下有效识别低质量、损坏或恶意数据。

Comments Proceedings of the 2024 IEEE International Conference on Big Data (IEEE BigData 2024). A preliminary version of this work received the Best Paper Award at the International Workshop on Trustworthy Federated Learning at IJCAI (FL-IJCAI) 2023

2403.00965 2026-06-05 stat.AP cs.AI cs.LG 版本更新

Binary Gaussian Copula Synthesis: an LLM-powered data augmentation framework for early dialysis prediction in chronic kidney disease

二元高斯卷积合成：一种基于LLM的数据增强框架，用于慢性肾病早期透析预测

Hamed Khosravi, Milad Khanchi, Mobina Noori, Srinjoy Das, Abdullah Al-Mamun, Imtiaz Ahmed

发表机构 * Department of Industrial & Management Systems Engineering, West Virginia University（威斯康星大学工业与管理系统工程系）； Department of Electrical and Computer Engineering, Concordia University（康科迪亚大学电气与计算机工程系）； Department of Computer Science, University of California, Davis（加州大学戴维斯分校计算机科学系）； School of Mathematical & Data Sciences, West Virginia University（威斯康星大学数学与数据科学学院）； School of Systems Science and Industrial Engineering, The State University of New York at Binghamton（纽约州立大学布法罗分校系统科学与工业工程学院）； H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology（佐治亚理工学院H.米尔顿·斯图尔特工业与系统工程学院）

AI总结本文提出Binary Gaussian Copula Synthesis (BGCS)，一种专为二元临床数据设计的两阶段数据增强方法，通过生成合成少数类样本并过滤不合理的样本，提高了早期透析预测的性能。

详情

AI中文摘要

只有极少数慢性肾病（CKD）患者会进展到透析，这导致了严重的类别不平衡，限制了机器学习模型在早期透析预测中的性能。这一挑战进一步加剧了电子健康记录（EHR）数据的二元结构，而现有的大多数增强方法并未为此设计。我们提出了Binary Gaussian Copula Synthesis (BGCS)，一种专为二元临床数据设计的两阶段数据增强方法。BGCS首先使用高斯卷积框架生成合成少数类样本，该框架明确建模二元特征之间的成对依赖关系，然后应用微调的GPT-2分类器过滤出临床上不合理的样本后再进行训练。我们在一个包含15,169名CKD患者的真实世界EHR数据集中评估了BGCS，该数据集来自西弗吉尼亚州，收集时间从2008年到2022年。我们将其与SMOTE、CTGAN和标准高斯卷积在四个机器学习分类器上进行了基准测试，共进行了25次独立运行。BGCS在所有比较方法中表现一致，实现了90天透析预测的最高少数类召回率，不同分类器的中位数值范围从0.78到0.87，且在真实数据上的分布忠实度最强，特征的均值p值为0.68。表现最好的BGCS增强模型被集成到一个可解释的决策树基于的临床决策支持系统中，用于透析风险分层，其中电解质失衡、心血管合并症和肾脏监测指标成为最显著的预测特征。这些发现表明，为二元EHR数据的结构特性设计的增强方法可以显著提高早期透析风险预测，并支持开发可解释的临床决策支持工具用于CKD护理。

英文摘要

Only a small fraction of patients with chronic kidney disease (CKD) progress to dialysis, creating severe class imbalance that limits the performance of machine learning models for early dialysis prediction. This challenge is compounded by the binary structure of electronic health record (EHR) data, for which most existing augmentation methods were not designed. We propose Binary Gaussian Copula Synthesis (BGCS), a two-stage data augmentation method tailored to binary clinical data. BGCS first generates synthetic minority-class samples using a Gaussian copula framework that explicitly models pairwise dependencies among binary features, then applies a fine-tuned GPT-2 classifier to filter out clinically implausible samples before training. We evaluated BGCS on a real-world EHR dataset of 15,169 patients with CKD from West Virginia collected between 2008 and 2022, benchmarking it against SMOTE, CTGAN, and standard Gaussian Copula across four machine learning classifiers over 25 independent runs. BGCS consistently outperformed all comparison methods, achieving the highest minority-class recall for 90-day dialysis prediction, with median values ranging from 0.78 to 0.87 across classifiers, and the strongest distributional fidelity to real data, with a mean p-value of 0.68 across features. The best-performing BGCS-augmented model was integrated into an interpretable decision tree-based clinical decision support system for dialysis risk stratification, with electrolyte imbalances, cardiovascular comorbidities, and renal monitoring indicators emerging as the most influential predictive features. These findings suggest that augmentation methods designed for the structural properties of binary EHR data can meaningfully improve early dialysis risk prediction and support the development of interpretable clinical decision-support tools for CKD care.

URL PDF HTML ☆

赞 0 踩 0

2308.12224 2026-06-05 q-bio.QM cs.AI 版本更新

Enhancing cardiovascular risk prediction through AI-enabled calcium-omics

通过AI赋能的钙组学增强心血管风险预测

Ammar Hoori, Sadeer Al-Kindi, Tao Hu, Yingnan Song, Hao Wu, Juhwan Lee, Nour Tashtish, Pingfu Fu, Robert Gilkeson, Sanjay Rajagopalan, David L. Wilson

发表机构 * Department of Biomedical Engineering, Case Western Reserve University（生物医学工程系，凯斯西储大学）； Harrington Heart and Vascular Institute, University Hospitals Cleveland Medical Center（哈灵顿心脏和血管研究所，克利夫兰医学中心）； School of Medicine, Case Western Reserve University（医学院，凯斯西储大学）； Department of Population and Quantitative Health Sciences, Case Western Reserve University（人口与定量健康科学系，凯斯西储大学）； Department of Radiology, University Hospitals Cleveland Medical Center（放射科，克利夫兰医学中心）； Department of Radiology, Case Western Reserve University（放射科，凯斯西储大学）

AI总结本文通过利用详细的钙沉积特征（即钙组学）结合AI方法，提高了主要不良心血管事件（MACE）预测的准确性，展示了钙组学在心血管风险预测中的应用价值。

Comments 12 pages, 8 figures, 2 tables, 4 pages supplemental, journal paper format (under review)

详情

DOI: 10.1038/s41598-024-60584-8

AI中文摘要

背景. 冠状动脉钙化（CAC）是预测主要不良心血管事件（MACE）的强大预测因子。传统的Agatston评分只是简单地将钙含量相加，尽管是非线性方式，但仍有改进钙沉积评估的空间，以更全面地捕捉疾病程度。目标. 确定是否可以通过使用详细的钙沉积特征（即钙组学）的AI方法来提高MACE预测。方法. 我们研究了钙沉积的其他特征，包括质量、体积、密度、空间分布、区域等的评估。我们使用带有弹性网络正则化的Cox模型，在2457例CT钙化评分（CTCS）中，该评分富集了MACE事件，来源于一个大型无成本CLARIFY计划（ClinicalTrials.gov标识符：NCT04075162）。我们采用了采样技术来增强模型训练。我们还研究了使用选定特征的Cox模型，以识别可解释的高风险特征。结果. 我们提出的钙组学模型，通过修改的合成下采样和上采样，给出了C指数（80.5%/71.6%）和两年AUC（82.4%/74.8%）（80:20，训练/测试），分别（采样仅应用于训练集）。结果优于Agatston，后者给出了C指数（71.3%/70.3%）和AUC（71.8%/68.8%）。在钙组学特征中，钙化数量、左前降支质量及扩散率（空间分布的度量）是增加风险的重要决定因素，而致密钙化（>1000HU）与较低风险相关。钙组学模型在保留测试中将63%的MACE患者重新分类到高风险组。分类净再分类指数为NRI=0.153。结论. AI分析冠状动脉钙化可比Agatston评分产生更好的结果。我们的发现表明，钙组学在改进风险预测中的应用价值。

英文摘要

Background. Coronary artery calcium (CAC) is a powerful predictor of major adverse cardiovascular events (MACE). Traditional Agatston score simply sums the calcium, albeit in a non-linear way, leaving room for improved calcification assessments that will more fully capture the extent of disease. Objective. To determine if AI methods using detailed calcification features (i.e., calcium-omics) can improve MACE prediction. Methods. We investigated additional features of calcification including assessment of mass, volume, density, spatial distribution, territory, etc. We used a Cox model with elastic-net regularization on 2457 CT calcium score (CTCS) enriched for MACE events obtained from a large no-cost CLARIFY program (ClinicalTri-als.gov Identifier: NCT04075162). We employed sampling techniques to enhance model training. We also investigated Cox models with selected features to identify explainable high-risk characteristics. Results. Our proposed calcium-omics model with modified synthetic down sampling and up sampling gave C-index (80.5%/71.6%) and two-year AUC (82.4%/74.8%) for (80:20, training/testing), respectively (sampling was applied to the training set only). Results compared favorably to Agatston which gave C-index (71.3%/70.3%) and AUC (71.8%/68.8%), respectively. Among calcium-omics features, numbers of calcifications, LAD mass, and diffusivity (a measure of spatial distribution) were important determinants of increased risk, with dense calcification (>1000HU) associated with lower risk. The calcium-omics model reclassified 63% of MACE patients to the high risk group in a held-out test. The categorical net-reclassification index was NRI=0.153. Conclusions. AI analysis of coronary calcification can lead to improved results as compared to Agatston scoring. Our findings suggest the utility of calcium-omics in improved prediction of risk.

URL PDF HTML ☆

赞 0 踩 0

2306.09712 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Semi-Offline Reinforcement Learning for Optimized Text Generation

半离线强化学习用于优化文本生成

Changyu Chen, Xiting Wang, Yiqiao Jin, Victor Ye Dong, Li Dong, Jie Cao, Yi Liu, Rui Yan

发表机构 * Changyu Chen, Xiting Wang, Yiqiao Jin, Victor Ye Dong, Li Dong, Jie Cao, Yi Liu, Rui Yan（未知机构）

AI总结本文提出了一种半离线强化学习方法，平衡了探索能力和训练成本，并在优化成本、渐近误差和过拟合误差界方面实现了最优的强化学习设置。

Comments In Proceedings of the 40th International Conference on Machine Learning (ICML 2023)

2305.12640 2026-06-05 cs.AI cs.LG stat.ML 版本更新

Limited Resource Allocation in a Non-Markovian World: The Case of Maternal and Child Healthcare

在非马尔可夫世界中的有限资源分配：产科与儿童保健的案例

Panayiotis Danassis, Shresth Verma, Jackson A. Killian, Aparna Taneja, Milind Tambe

发表机构 * Harvard University（哈佛大学）； Google Research（谷歌研究）

AI总结本文研究了在非马尔可夫环境下如何通过时间序列方法优化资源分配，提出了一种新的时间序列臂排名指数（TARI）策略，以提高产科和儿童保健项目的参与度和依从性。

Comments Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI 2023)

详情

DOI: 10.24963/ijcai.2023/660

AI中文摘要

许多医疗项目成功的关键在于参与者的依从性。我们考虑在资源有限的环境中（例如健康工作者及时拨打电话）安排干预措施，以提高依从性和/或参与度。以往的工作已经成功开发了几种基于活跃多臂老虎机（RMAB）的解决方案。然而，所有以往的RMAB方法都假设参与者的行为遵循马尔可夫性质。我们展示了在我们合作伙伴NGO ARMMAN的产科健康意识项目上的真实数据中，存在显著偏离马尔可夫假设的现象。此外，我们扩展RMAB到连续状态空间，这是之前研究较少的领域。为解决一般的非马尔可夫RMAB环境，我们（i）将每个参与者的时间轨迹建模为时间序列，（ii）利用时间序列预测模型的力量来学习复杂模式和动态以预测未来状态，（iii）提出时间序列臂排名指数（TARI）策略，这是一种新的算法，选择最能从干预中受益的RMAB臂，基于我们的未来状态预测。我们在合成数据和ARMMAN的真实数据二次分析上评估了我们的方法，并证明了与部署的Whittle指数解决方案相比，参与度显著增加。这相当于额外16.3小时的内容被聆听，90.8%更多的脱节风险被防止，并覆盖了超过两倍的高脱节风险受益人。

英文摘要

The success of many healthcare programs depends on participants' adherence. We consider the problem of scheduling interventions in low resource settings (e.g., placing timely support calls from health workers) to increase adherence and/or engagement. Past works have successfully developed several classes of Restless Multi-armed Bandit (RMAB) based solutions for this problem. Nevertheless, all past RMAB approaches assume that the participants' behaviour follows the Markov property. We demonstrate significant deviations from the Markov assumption on real-world data on a maternal health awareness program from our partner NGO, ARMMAN. Moreover, we extend RMABs to continuous state spaces, a previously understudied area. To tackle the generalised non-Markovian RMAB setting we (i) model each participant's trajectory as a time-series, (ii) leverage the power of time-series forecasting models to learn complex patterns and dynamics to predict future states, and (iii) propose the Time-series Arm Ranking Index (TARI) policy, a novel algorithm that selects the RMAB arms that will benefit the most from an intervention, given our future state predictions. We evaluate our approach on both synthetic data, and a secondary analysis on real data from ARMMAN, and demonstrate significant increase in engagement compared to the SOTA, deployed Whittle index solution. This translates to 16.3 hours of additional content listened, 90.8% more engagement drops prevented, and reaching more than twice as many high dropout-risk beneficiaries.

URL PDF HTML ☆

赞 0 踩 0