arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.06492 2026-06-05 cs.SE cs.AI cs.CL 版本更新

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

Code2LoRA:用于软件演化下代码语言模型的超网络生成适配器

Liliana Hotsko, Yinxi Li, Yuntian Deng, Pengyu Nie

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出Code2LoRA超网络框架,通过生成仓库特定的LoRA适配器注入仓库知识,无需推理时令牌开销,支持静态和演化两种场景,在RepoPeftBench上达到与逐仓库LoRA相当或更优的性能。

详情
AI中文摘要

代码语言模型需要仓库级上下文来解决导入、API和项目约定。现有方法通过长输入(通过RAG或依赖分析检索)或逐仓库微调和LoRA注入这些知识——这在仓库规模上成本高昂且对演化的代码库脆弱。我们引入Code2LoRA,一个超网络框架,生成仓库特定的LoRA适配器,有效地注入仓库知识,零推理时令牌开销。Code2LoRA支持两种使用场景:Code2LoRA-Static将单个仓库快照转换为适配器,适用于稳定代码库的理解;而Code2LoRA-Evo维护一个由GRU隐藏状态支持的适配器,该状态随每次代码差异更新,适用于演化代码库的活跃开发。为了评估Code2LoRA与参数高效微调基线,我们构建了RepoPeftBench,一个包含604个Python仓库的基准,包含两个轨道:一个静态轨道,包含40K训练和12K测试断言完成任务;一个演化轨道,包含215K提交派生训练和87K提交派生测试任务。在静态轨道上,Code2LoRA-Static实现了63.8%的跨仓库和66.2%的仓库内精确匹配,与逐仓库LoRA上界相当;在演化轨道上,Code2LoRA-Evo实现了60.3%的跨仓库精确匹配(比单个共享LoRA高5.2个百分点)。Code2LoRA的代码可在https://anonymous.4open.science/r/code2lora-6857找到;模型检查点和RepoPeftBench数据集可在https://huggingface.co/code2lora找到。

英文摘要

Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG or dependency analysis) or through per-repository fine-tuning and LoRA -- costly at repository scale and brittle to evolving codebases. We introduce Code2LoRA, a hypernetwork framework that generates repository-specific LoRA adapters, effectively injecting repository knowledge with zero inference-time token overhead. Code2LoRA supports two usage scenarios: Code2LoRA-Static converts a single repository snapshot into an adapter, suitable for comprehension of stable codebases; while Code2LoRA-Evo maintains an adapter backed by a GRU hidden state updated per code diff, suitable for active development of evolving codebases. To evaluate Code2LoRA against parameter-efficient fine-tuning baselines, we build RepoPeftBench, a benchmark of 604 Python repositories with two tracks: a static track with 40K training and 12K test assertion-completion tasks, and an evolution track with 215K commit-derived training and 87K commit-derived test tasks. On the static track, Code2LoRA-Static achieves 63.8% cross-repo and 66.2% in-repo exact match, matching the per-repository LoRA upper bound; on the evolution track, Code2LoRA-Evo achieves 60.3% cross-repo exact match (+5.2 pp over a single shared LoRA). Code2LoRA's code can be found at https://anonymous.4open.science/r/code2lora-6857; the model checkpoints and RepoPeftBench datasets can be found at https://huggingface.co/code2lora.

2606.06491 2026-06-05 cs.RO cs.AI 版本更新

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

TempoVLA: 学习速度可控的视觉-语言-动作策略

Dong Jing, Jingchen Nie, Tianqi Zhang, Jiaqi Liu, Huaxiu Yao, Zhiwu Lu, Mingyu Ding

发表机构 * RUC(中国人民大学) FDU(福建大学) UNC(北卡罗来纳大学教堂山分校)

AI总结 提出TempoVLA,通过可变速度轨迹增强和速度条件机制,实现机器人操作中速度的双向灵活控制,并支持动态速度调节。

详情
AI中文摘要

机器人操作在低风险过渡阶段需要快速执行,而在高风险接触阶段需要缓慢精确的运动。然而,现有的视觉-语言-动作模型(VLA)仅从训练演示中继承单一的固定速度。先前通过模型压缩、KV缓存重用或强化学习加速VLA的尝试仅将策略从一个固定速度转移到另一个,而几乎未探索减速。我们观察到每个预测动作的幅度已经决定了机器人移动的速度,这为可控执行速度开辟了直接途径。我们将这一观察转化为TempoVLA,一个执行速度由显式条件控制的单一VLA。TempoVLA结合了两个耦合组件:(1)数据侧的可变速度轨迹增强(VSTA),通过合并或分割动作重新定时演示到任何目标速度,同时保留其运动语义;(2)模型侧的条件机制,将速度馈送给策略。统计显示,VSTA以可忽略的运动误差达到请求的速度。在仿真和真实世界任务上的实验表明,TempoVLA实现了双向的灵活速度控制,而VSTA通过更好的数据利用进一步提升了默认的1倍性能。此外,通过与大型多模态模型协作,TempoVLA实现了动态速度控制,在低风险阶段加速,在高风险阶段减速。

英文摘要

Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.

2606.06486 2026-06-05 cs.LG cs.AI cs.GT 版本更新

Regret Minimization with Adaptive Opponents in Repeated Games

重复博弈中与自适应对手的遗憾最小化

Mingyang Liu, Asuman Ozdaglar, Tiancheng Yu, Kaiqing Zhang

发表机构 * Massachusetts Institute of Technology(麻省理工学院) OpenAI University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 针对重复博弈中自适应对手的遗憾最小化问题,提出重复策略遗憾(RP-Regret)指标,并设计三种算法实现次线性遗憾,同时证明所有玩家最小化该遗憾可学习子博弈完美均衡。

详情
AI中文摘要

在本文中,我们研究重复博弈中与\emph{自适应}对手(即能够根据历史对局做出反应的对手)的遗憾最小化问题。已知在线学习中的标准\emph{外部遗憾}指标无法捕捉这种自适应性。为了考虑玩家的反事实推理,我们引入了{ t 重复策略遗憾(RP-Regret)},这是一种博弈论指标,衡量当所有玩家都能对历史对局做出\emph{反应}时,\emph{实际}累积效用与\emph{事后最优}累积效用之间的差异。与此背景下现有的遗憾概念相比,我们的概念更贴近重复博弈,允许更强的比较器和约束更少的对手,同时当所有玩家最小化该遗憾时,仍有可能找到更好的均衡。我们首先确定了获得时间次线性{ t RP-Regret}的必要条件,涉及遗憾定义中玩家比较器策略的变化以及比较器和对手策略的记忆。然后,我们研究了最小化{ t RP-Regret}的附加条件和可证明的算法,该遗憾在策略空间上本质上是\emph{非凸}的。为了应对这一挑战,我们提出了三种算法:(i)基于优化预言机(如先前一些在线非凸学习工作所假设的);(ii)每次迭代最小化{ t RP-Regret}的凸\emph{线性化}替代项;(iii)当对手缓慢改变策略时,直接最小化{ t RP-Regret}。此外,当所有玩家都能运行算法最小化{ t RP-Regret}(或其线性化变体)时,可以学习重复博弈的某些子博弈完美均衡。我们还提供了实验,表明最小化我们的遗憾概念可以在诸如猎鹿博弈等游戏中带来更合作、效用更高的解。

英文摘要

In this paper, we study regret minimization in repeated games with \emph{adaptive} opponents who can respond based on histories of play. The standard metric of \emph{external regret} in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the difference between the \emph{realized} and the \emph{best-in-hindsight} accumulated utility when all players can \emph{respond} to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining {\tt RP-Regret} sublinear in time, on the variation of the player's comparator strategies in the regret definition and on the memories of both the comparator and opponents' strategies. We then study additional conditions and provable algorithms to minimize {\tt RP-Regret}, which is by definition \emph{non-convex} in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and \emph{linearized} surrogate of {\tt RP-Regret} at each iteration; (iii) one that directly minimizes {\tt RP-Regret} when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the {\tt RP-Regret} (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.

2606.06481 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection

操作引导的渐进式人机文本转换基准:面向多粒度AI文本检测

Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Tianjun Yao, Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Hao Li, Salman Khan, Zhiqiang Shen

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·泽亚德人工智能大学) University College London(伦敦大学学院)

AI总结 提出OpAI-Bench基准,通过九种渐进修订版本和五种AI编辑操作,模拟人机协作编辑过程,支持文档、句子、词元和跨度多粒度检测,揭示AI文本可检测性受编辑操作、领域和累积修订历史影响,并发现混合作者中间版本比纯人类或纯AI端点更难检测。

Comments Our code and data are available at https://github.com/VILA-Lab/OpAI-Bench

详情
AI中文摘要

随着AI写作助手越来越多地融入现实世界的起草和修订流程,许多文档不再是纯粹的人类撰写或AI生成,而是渐进式人机共同编辑的结果。然而,现有的AI文本检测基准主要关注最终输出,对AI作者身份信号如何在修订过程中出现、累积或消失的理解有限。我们引入了OpAI-Bench,一个操作引导的基准,用于研究在文档、句子、词元和跨度粒度上的渐进式人机文本转换。从人类撰写的文档开始,OpAI-Bench在预定义的AI覆盖水平和五种代表性AI编辑操作下,为每个样本构建了九个顺序修订版本,涵盖四个领域,同时保留多粒度上的完整作者身份来源。该基准支持8个文档级检测器、7个句子级检测器和2个细粒度词元/跨度级检测器的全面评估。实验表明,AI文本的可检测性不仅受AI编辑内容比例的影响,还受编辑操作、领域和累积修订历史的影响。有趣的是,我们注意到混合作者身份的中间版本通常比完全人类或大量AI编辑的端点更难检测,暴露了现有基准遗漏的非单调检测模式。OpAI-Bench为分析在现实渐进编辑场景下,AI辅助写作是否、何时以及如何变得可检测提供了一个受控测试平台。我们的代码和基准可在https://github.com/VILA-Lab/OpAI-Bench获取。

英文摘要

As AI writing assistants become increasingly integrated into real-world drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but instead result from progressive human-AI co-editing. However, existing AI-text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI-Bench, an operation-guided benchmark for studying progressive human-to-AI text transformation across document, sentence, token, and span granularities. Starting from human-written documents, OpAI-Bench constructs nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at multiple granularities. The benchmark supports comprehensive evaluation with 8 document-level detectors, 7 sentence-level detectors, and 2 fine-grained token/span-level detectors. Experiments reveal that AI-text detectability is governed not only by the proportion of AI-edited content, but also by edit operation, domain, and cumulative revision history. Interestingly, we notice that mixed-authorship intermediate versions are often harder to detect than both fully human and heavily AI-edited endpoints, exposing non-monotonic detection patterns missed by existing benchmarks. OpAI-Bench provides a controlled testbed for analyzing whether, when, and how AI-assisted writing becomes detectable under realistic progressive editing scenarios. Our code and benchmark are available at https://github.com/VILA-Lab/OpAI-Bench.

2606.06479 2026-06-05 cs.LG cs.AI 版本更新

Pretraining Recurrent Networks without Recurrence

无递归预训练循环网络

Akarsh Kumar, Phillip Isola

发表机构 * MIT(麻省理工学院)

AI总结 提出监督记忆训练(SMT)方法,通过将循环神经网络训练转化为一步记忆转换标签上的监督学习,实现时间并行训练和稳定梯度路径,优于反向传播通过时间(BPTT)方法。

Comments 30 pages, 23 figures

详情
AI中文摘要

训练循环神经网络(RNN)需要在长序列计算中分配信用。标准的反向传播通过时间(BPTT)对此问题处理不佳:它在时间上是顺序的,限制了并行性,并且遭受梯度消失或爆炸,使得长程关联难以学习。我们提出监督记忆训练(SMT),一种训练非线性RNN的方法,通过将RNN训练简化为一步记忆转换标签 $(m_t, x_{t+1}) \rightarrow m_{t+1}$ 上的监督学习,完全绕过了循环信用传播。SMT通过训练基于Transformer的编码器在预测状态目标上获取这些记忆标签——仅保留预测未来所需的过去信息。通过将记忆内容与记忆更新方式解耦,SMT实现了时间并行的RNN训练,任意两个token之间具有稳定的$O(1)$长度梯度路径——而无需展开RNN。我们发现,在语言建模和像素序列建模等任务上预训练各种RNN架构时,SMT优于BPTT。SMT使非线性RNN能够更好地捕获长程依赖并并行训练,可能解锁构建过去经验时间抽象模型的缩放能力。

英文摘要

Training recurrent neural networks (RNNs) requires assigning credit across long sequences of computations. Standard backpropagation through time (BPTT) addresses this problem poorly: it is sequential in time, limiting parallelism, and suffers from vanishing or exploding gradients, making long-range associations difficult to learn. We propose Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely by reducing RNN training to supervised learning on one-step memory transition labels $(m_t, x_{t+1}) \rightarrow m_{t+1}$. SMT acquires these memory labels by training a Transformer-based encoder on a predictive state objective--retaining only information from the past necessary to predict the future. By decoupling what to remember from how to update memory, SMT enables time-parallel RNN training with a stable $O(1)$ length gradient path between any two tokens--without ever unrolling the RNN. We find that SMT outperforms BPTT when pretraining various RNN architectures on tasks like language modeling and pixel sequence modeling. SMT enables nonlinear RNNs to better capture long-range dependencies and train in parallel, potentially unlocking the scaling of models that build temporal abstractions of past experience.

2606.06475 2026-06-05 cs.LG cs.AI 版本更新

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

RREDCoT: 推理模型的片段级奖励再分配

Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter

发表机构 * ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria(林茨ELLIS单元和LIT人工智能实验室,机器学习研究所,林茨约瑟夫·冯·拉格纳大学,奥地利) Cognizant AI Lab, San Francisco, USA(认知人工智能实验室,美国旧金山) NXAI GmbH, Linz, Austria(NXAI公司,奥地利林茨)

AI总结 针对推理语言模型强化学习微调中的延迟奖励问题,提出RREDCoT方法,利用模型自身近似最优奖励再分配,无需额外生成,降低方差并提升信用分配效率。

Comments Preprint, under review

详情
AI中文摘要

近期推理语言模型的进展由强化学习微调驱动。通常,这些依赖于组相对策略优化(GRPO)算法或其变体来引导模型生成思维链(CoT)轨迹。最终答案只能在CoT轨迹完成后验证并分配奖励,这构成了延迟奖励问题。GRPO及其变体对应于标准强化学习中的蒙特卡洛方法,已知具有高方差。该问题的一个可能解决方案是通过信用分配进行奖励再分配,其中对达到期望解重要的CoT轨迹片段通过分配更高奖励来强调。虽然蒙特卡洛采样可用于提供中间状态值的无偏估计,但其计算开销使其不适用于长上下文高粒度下的训练时信用分配。我们引入RREDCoT(思维链的奖励再分配),它利用模型自身近似最优奖励再分配,无需额外生成。我们研究了我们的方法相比MC采样和几种归因方法的优势。我们进一步分析了与再分配构建相关的几个方面,例如CoT轨迹的分割和状态值估计。

英文摘要

Recent advancements in reasoning language models have been driven by Reinforcement Learning (RL) fine-tuning. Most often, these rely on the Group Relative Policy Optimization (GRPO) algorithm or modifications thereof to steer the models to produce Chain-of-Thought (CoT) traces. The final answer can only be verified, and the reward assigned, after the CoT trace is complete, making it a delayed reward problem. GRPO and its modifications correspond to Monte Carlo methods in standard RL, which are known to suffer from high variance. A possible solution to this problem is the redistribution of rewards through credit assignment, where segments of the CoT trace that are important for arriving at the desirable solution are emphasized by assigning a higher reward. While Monte Carlo sampling can be used to provide an unbiased estimate of intermediate state values, its computational overhead makes it unsuitable for train-time credit assignment in long contexts at high granularity. We introduce RREDCoT (Reward REDistribution for Chain of Thoughts), which utilizes the model itself to approximate the optimal reward redistribution without additional generation. We investigate the advantages of our method compared to MC sampling and several attribution methods. We further analyze several aspects relevant to the construction of the redistribution such as segmentation of CoT traces and state value estimation.

2606.06474 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Self-Augmenting Retrieval for Diffusion Language Models

扩散语言模型的自增强检索

Paul Jünger, Justin Lovelace, Linxi Zhao, Dongyoung Go, Kilian Q. Weinberger

发表机构 * University of California, Berkeley(加州大学伯克利分校) Google Research(谷歌研究院)

AI总结 提出SARDI框架,利用扩散语言模型去噪过程中丢弃的低置信度标记作为前瞻信号指导检索,无需训练且与检索器无关,在多跳问答基准上以高达8倍吞吐量超越现有方法。

Comments ICML 2026

详情
AI中文摘要

离散扩散语言模型通过并行迭代去噪整个响应来生成文本。每一步,它们为每个掩码位置预测暂定标记,将高置信度预测提交到输出,并丢弃低置信度标记。我们表明,被丢弃的标记实际上对检索增强生成是有用的前瞻信号:即使低置信度标记也常在去噪轨迹早期浮现显著实体,从而在输出最终确定前检索到更强的证据。我们通过扩散语言模型的自增强检索(SARDI)利用这一点,这是一个动态RAG框架,在去噪过程中使用这些前瞻标记指导检索。SARDI无需训练、与检索器无关,并适用于任何具备推理能力的离散扩散语言模型。在五个多跳QA基准上,SARDI以高达8倍的吞吐量优于当前无训练的扩散和自回归检索基线。

英文摘要

Discrete diffusion language models generate text by iteratively denoising an entire response in parallel. At each step, they predict tentative tokens for every masked position, committing the confident predictions to the output and discarding the unconfident ones. We show that the discarded tokens are in fact a useful lookahead signal for retrieval-augmented generation: even low-confidence tokens often surface salient entities early in the denoising trajectory, enabling retrieval of stronger evidence before the output is finalized. We exploit this through Self-Augmenting Retrieval for Diffusion Language Models (SARDI), a dynamic RAG framework that uses these lookahead tokens to guide retrieval during denoising. SARDI is training-free, retriever-agnostic, and applicable to any reasoning-capable discrete diffusion language model. Across five multi-hop QA benchmarks, SARDI outperforms current training-free diffusion and autoregressive retrieval baselines at up to $8\times$ higher throughput.

2606.06473 2026-06-05 cs.AI cs.CL 版本更新

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

MLEvolve:一种用于自动化机器学习算法发现的自我进化框架

Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao, Shiyang Feng, Zichen Liang, Boyuan Sun, Tianshuo Peng, Yifan Zhou, Xin Li, Jie Zhou, Liang He, Bo Zhang, Lei Bai

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) East China Normal University(东华大学)

AI总结 提出MLEvolve框架,通过渐进式MCGS、回溯记忆和分层控制解决LLM智能体在长期任务中的信息隔离、无记忆搜索和缺乏分层控制问题,在MLE-Bench和数学算法优化任务上取得最先进性能。

详情
AI中文摘要

大型语言模型(LLM)智能体越来越多地应用于长期任务,如科学发现和机器学习工程(MLE),其中持续的自我进化成为关键能力。然而,现有的MLE智能体存在分支间信息隔离、无记忆搜索和缺乏分层控制的问题,这些共同阻碍了长期优化。我们提出了MLEvolve,一个基于LLM的自我进化多智能体框架,用于端到端的机器学习算法发现。通过将树搜索扩展到渐进式MCGS,MLEvolve通过基于图的参考边实现跨分支信息流,并借助熵启发的渐进式调度,逐步将搜索从广泛探索转向集中利用。为了让智能体能够随着积累的经验进化,我们引入了回溯记忆,它将冷启动领域知识库与动态全局记忆相结合,用于特定任务的体验检索和重用。为了实现稳定的长期迭代,我们进一步将战略规划与代码生成解耦,并采用自适应编码模式。在MLE-Bench上的评估表明,MLEvolve在多个维度上实现了最先进的性能,包括在12小时预算(标准运行时间的一半)下的平均奖牌率和有效提交率。此外,MLEvolve在数学算法优化任务上也优于专门的算法发现方法(包括AlphaEvolve),展示了强大的跨领域泛化能力。我们的代码可在https://github.com/InternScience/MLEvolve获取。

英文摘要

Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.

2606.06470 2026-06-05 cs.LG cs.AI 版本更新

PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training

PC层:通过多项式权重预处理改进大语言模型预训练

Senmiao Wang, Tiantian Fang, Haoran Zhang, Yushun Zhang, Kunxiang Zhao, Alex Schwing, Ruoyu Sun

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Google LLC(谷歌公司) Shenzhen International Center for Industrial and Applied Mathematics(深圳国际工业与应用数学中心) Shenzhen Research Institute of Big Data(深圳大数据研究院) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出一种多项式预条件子权重参数化方法(PC层),通过低阶多项式预条件重塑权重矩阵奇异值谱,确保LLM训练中权重条件稳定,且训练后无推理开销,在Llama-1B预训练中优于标准Transformer。

详情
AI中文摘要

我们提出了一种预条件(PC)层,一种通过多项式预条件子实现的权重参数化方法,确保在整个LLM训练过程中权重条件稳定。PC模块通过低阶多项式预条件重塑权重矩阵的奇异值谱。训练后,预条件权重可以合并回原始架构,不产生推理开销。我们展示了在Llama-1B预训练中,对于AdamW和Muon优化器,所提出的PC层相对于标准Transformer的优势。理论上,我们通过证明对于某些深度线性网络,均匀限制每层的奇异值能确保梯度下降几何收敛到全局最小值,从而证明了这一谱控制原理。我们的代码可在https://github.com/Empath-aln/PC-layer获取。

英文摘要

We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular-value spectrum of weight matrices via low-degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead. We demonstrate the advantage of the proposed PC layer over standard transformers in Llama-1B pre-training, for both the AdamW and Muon optimizers. Theoretically, we justify this spectrum-control principle by proving that uniformly bounding each layer's singular values ensures geometric convergence of gradient descent to global minima, for certain deep linear networks. Our code is available at https://github.com/Empath-aln/PC-layer.

2606.06468 2026-06-05 cs.AI 版本更新

Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement

Goedel-Architect: 通过蓝图生成与精炼简化形式定理证明

Jui-Hui Chung, Ziyang Cai, Zihao Li, Qishuo Yin, Rohit Agarwal, Simon Park, Rodrigo Porto, Narutatsu Ri, Ziran Yang, Shange Tang, Xingyu Dang, Hongzhou Lin, Mengdi Wang, Danqi Chen, Chi Jin, Liam H Fowl, Sanjeev Arora

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Science and Technology of China(中国科学技术大学) University of Toronto(多伦多大学) National University of Singapore(新加坡国立大学) University of Tokyo(东京大学) University of Washington(华盛顿大学)

AI总结 提出Goedel-Architect框架,通过生成和精炼依赖图蓝图,结合Lean 4证明器并行证明引理,在多个基准测试上达到开源最优性能。

详情
AI中文摘要

我们介绍Goedel-Architect,一个以蓝图生成和精炼为中心的Lean 4形式定理证明智能体框架。蓝图是一个定义和引理的依赖图,逐步构建到主定理。首先,Goedel-Architect生成一个包含形式化定义和引理及其声明依赖关系的蓝图。该蓝图可选地由自然语言证明引导。然后,一个配备工具的Lean证明器组件使用相关依赖并行证明每个开放的引理节点。失败的引理反过来驱动全局蓝图的精炼。这种策略与其他主流方法形成对比,后者使用递归引理分解,并可能低效地在死胡同策略上循环。使用开放权重的DeepSeek-V4-Flash (284B-A13B)作为骨干,Goedel-Architect在MiniF2F-test上达到99.2%的pass@1,在PutnamBench上达到75.6%的pass@1。在更困难的问题上,通过可选的初始蓝图自然语言证明种子,我们额外解决了剩余的两个MiniF2F-test问题(达到100%),将PutnamBench提升至88.8%(597/672),并在IMO 2025上解决了4/6,在Putnam 2025上解决了11/12,在USAMO 2026上解决了3/6。这代表了开源流水线在价格点比可比开源流水线低至500倍的情况下的最先进性能。

英文摘要

We introduce Goedel-Architect, an agentic framework for formal theorem proving in Lean 4 centered on blueprint generation and refinement. A blueprint is a dependency graph of definitions and lemmas that builds up to the main theorem. First, Goedel-Architect generates a blueprint of formally stated definitions and lemmas, along with declared dependencies. This blueprint is optionally guided by a natural language proof. Then, a tool-equipped Lean prover component closes each open lemma node in parallel using relevant dependencies. Failed lemmas in turn drive refinement of the global blueprint. This strategy contrasts with other mainstream approaches which use recursive lemma decomposition, and can inefficiently loop on dead-end strategies. Using the open-weight DeepSeek-V4-Flash (284B-A13B) as the backbone, Goedel-Architect attains 99.2% pass@1 on MiniF2F-test and 75.6% pass@1 on PutnamBench. With an optional natural-language proof seeding the initial blueprint on the harder problems, we additionally close the remaining two MiniF2F-test problems (reaching 100%), lift PutnamBench to 88.8% (597/672), and solve 4/6 on IMO 2025, 11/12 on Putnam 2025, and 3/6 on USAMO 2026. This represents state-of-the-art performance for an open-source pipeline at a price point up to 500x less than comparable open-source pipelines.

2606.06467 2026-06-05 cs.CL cs.AI cs.LG 版本更新

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

仅索引一次:具有共享路由的跨层稀疏注意力

Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang, Furu Wei

发表机构 * Microsoft Research(微软研究院) Tsinghua University(清华大学)

AI总结 提出跨层稀疏注意力(CLSA),通过共享KV缓存和路由索引,在保持token稀疏注意力精度的同时减少路由开销,显著提升长上下文LLM的解码效率。

详情
AI中文摘要

现代LLM中的长上下文推理越来越受到解码效率的限制,尤其是在模型生成长中间思维链的推理密集型场景中。现有的稀疏注意力方法通常面临实际的效率-质量权衡。结构化块稀疏方法通常提供更强的加速,但会导致明显的质量损失,而token稀疏方法通常更准确,但由于在全缓存上进行top-k路由仍然昂贵,因此端到端加速有限。在这项工作中,我们提出了跨层稀疏注意力(CLSA),它建立在KV共享架构(如YOCO)之上。核心思想不仅是跨解码器层共享KV缓存,还共享路由索引。单个索引器计算一次token级别的top-k选择,并在各层之间重用生成的索引,从而保留了token稀疏注意力的细粒度选择性,同时分摊了路由开销。由此产生的架构共同改善了所有主要的推理瓶颈,包括预填充、KV缓存存储和长上下文解码。在短上下文和长上下文基准上的实验表明,CLSA既准确又高效,在128K上下文下实现了高达7.6倍的解码加速和17.1倍的总体吞吐量提升。这些结果表明,对于长上下文LLM,这是一种更完整的架构解决方案,可同时提升模型质量和推理效率。

英文摘要

Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Structured block sparse methods typically provide stronger acceleration but incur noticeable quality loss, while token sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive. In this work, we propose cross-layer sparse attention (CLSA), which is built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, thereby preserving the fine-grained selectivity of token sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly, including pre-filling, KV-cache storage, and long-context decoding. Experiments across short-context and long-context benchmarks show that CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context. These results suggest a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.

2606.06462 2026-06-05 cs.AI 版本更新

Benchmark Everything Everywhere All at Once

无处不在的基准测试

Shiyun Xiong, Dongming Wu, Peiwen Sun, Yuang Ai, Bokang Yang, Wencheng Han, Xiao-Hui Li, Xiangyu Yue

发表机构 * MMLab, The Chinese University of Hong Kong(中大香港实验室) CPII under InnoHK(创新香港 CPII) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shenzhen Loop Area Institute(深圳环城研究院) Shandong University(山东大学) Huawei Technologies(华为技术)

AI总结 提出Benchmark Agent,一个全自主智能体系统,自动化基准构建流程,以解决现有基准构建劳动密集、难以复用和性能饱和的问题。

Comments Project page: https://benchmarkagent.github.io/

详情
AI中文摘要

基准测试通过提供标准化和明确的性能度量,对于评估和推进LLM和MLLM至关重要。然而,它们的构建劳动密集且难以复用,引发了可持续性和可扩展性的担忧。此外,现有基准在发布后往往很快达到性能饱和,导致对最先进模型的区分不足。为了应对这些挑战,我们引入了Benchmark Agent,一个完全自主的智能体系统,专为基准构建而设计。我们的框架编排了完整的基准构建流程,从用户查询分析和子任务设计到数据注释和质量控制。为了评估Benchmark Agent,我们实现了它来生成15个代表性基准,涵盖多种评估场景,包括文本理解、多模态理解和领域特定推理。大量实验,包括人工评估、LLM作为评判者的评估和一致性检查,表明Benchmark Agent能够在最小人工参与下生成高质量的基准样本。更重要的是,通过持续评估,我们观察到一些有洞察力的发现,包括当前模型在某些领域特定推理任务上存在困难。我们相信快速演进的基准可以为研究社区做出重要贡献。预览和代码将在演示页面和代码仓库中公开。

英文摘要

Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among state-of-the-art models. To address these challenges, we introduce Benchmark Agent, a fully autonomous agentic system designed for benchmark building. Our framework orchestrates the complete benchmark construction pipeline, from user query analysis and subtask design to data annotation and quality control. To assess Benchmark Agent, we implement it to produce 15 representative benchmarks, spanning diverse evaluation scenarios, including text understanding, multimodal understanding, and domain-specific reasoning. Extensive experiments, including human evaluation, LLM-as-a-judge assessment, and consistency checks, demonstrate Benchmark Agent can generate high-quality benchmark samples with minimal human involvement. More importantly, through continual evaluation, we observe several insightful findings, including that current models struggle with certain domain-specific reasoning tasks. We believe that rapidly evolving benchmarks can contribute significantly to the research community. The preview and code will be publicly available at the demo page and code repository.

2606.06460 2026-06-05 cs.CR cs.AI 版本更新

Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

智能体会自行回避吗?测量LLM智能体对带内拒绝访问信号的遵从性

Thamilvendhan Munirathinam

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种轻量级带内拒绝信号(Recuse Signal),通过实验测量LLM智能体是否自愿遵从该信号,发现信号能有效诱导回避,但高级模型在操作员授权下可能忽略。

Comments 8 pages, 1 figure. Code, specification, and experiment harness: https://github.com/mthamil107/Recuse

详情
AI中文摘要

随着自主LLM智能体越来越多地持有真实凭证并在无人参与的情况下操作基础设施,操作员没有标准方式告知智能体某个资源是禁止访问的。访问控制要么允许智能体进入(它有有效凭证),要么硬性拒绝(与任何其他客户端无法区分)。我们提出第三种模式:一种轻量级的、公开的带内拒绝信号——Recuse Signal——服务器通过协议的现有通道(如SSH横幅、PostgreSQL NOTICE)发出,要求连接的自动化智能体自愿退出。这是一种合作治理控制,类似于实时访问的robots.txt;明确不是安全边界。其价值完全是经验性的,据我们所知,尚未被测量:合规的LLM智能体是否真的会遵守这样的信号?我们将该信号定义为一个开放的小型标准,实现了两个零或低占用适配器(一个SSH横幅/PAM钩子和一个PostgreSQL线路协议代理),将它们部署在实时的生产主机上,并进行受控实验,其中新智能体被赋予一个良性操作任务,并观察其是否回避。在试点中(SSH;OpenAI GPT-4o和GPT-4o-mini;以及作为部署智能体的Claude Code),该信号干净地诱导回避——存在信号时100%回避,而无信号对照组中100%完成任务——并且揭示性地表现为合作信号而非绝对信号:显式的操作员授权框架使最强大的模型继续执行,而其他智能体继续遵从主机策略。我们发布该标准、适配器和实验框架以供复现。

英文摘要

As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in-band deny signal -- the Recuse Signal -- that a server emits over a protocol's existing channels (an SSH banner, a PostgreSQL NOTICE) asking a connecting automated agent to voluntarily withdraw. This is a cooperative governance control, the robots.txt analogue for live access; it is explicitly not a security boundary. Its value is entirely empirical and, to our knowledge, unmeasured: do compliant LLM agents actually honor such a signal? We define the signal as an open mini-standard, implement two zero- or low-footprint adapters (an SSH banner/PAM hook and a PostgreSQL wire-protocol proxy), deploy them on a live production host, and run a controlled experiment in which fresh agents are given a benign operations task and observed for recusal. In a pilot (SSH; OpenAI GPT-4o and GPT-4o-mini; and Claude Code as a deployed agent), the signal cleanly induces recusal -- 100% recusal when present versus 100% task completion in a no-signal control -- and, revealingly, behaves as a cooperative rather than absolute signal: an explicit operator-authorization framing flips the most capable model to proceed, while other agents continue to defer to the on-host policy. We release the standard, adapters, and experiment harness for reproduction.

2606.06458 2026-06-05 cs.LG cs.AI cs.CV 版本更新

In-Context Multiple Instance Learning

上下文多实例学习

Alexander Möllers, Marvin Sextro, Julius Hense, Gabriel Dernbach, Klaus-Robert Müller

发表机构 * Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究所) Machine Learning Group, Technische Universität Berlin(柏林技术大学机器学习小组) Aignostics Institute of Pathology, Charité – Universitätsmedizin Berlin(柏林查理医院病理研究所) Max-Planck Institute for Informatics(马克斯·普朗克信息研究所) Department of Artificial Intelligence, Korea University(韩国大学人工智能系)

AI总结 本文提出一种基于感知器架构的上下文学习器,通过合成数据预训练,无需梯度更新即可从少量标记包中解决新的多实例学习任务,在12个基准上超越需任务特定训练的监督基线。

详情
AI中文摘要

多实例学习(MIL)解决了在实例包级别提供监督的问题,并已成功应用于从计算病理学到卫星图像等领域。然而,现有算法在低标签率(许多实际应用的特点)下表现不佳。灵活的模型过拟合,而僵化的模型无法适应手头的任务。我们证明,在合成数据上预训练一个具有感知器架构的上下文学习器,可以得到一个能够从少量标记包中解决新任务的模型。在推理时,分类在单次前向传播中完成,无需梯度更新。我们提出并研究了不同的用于包结构数据的合成数据生成器,发现它们捕获了互补的归纳偏差。在这些生成器的混合上预训练的模型继承了每个生成器在各自任务上的优势,并在12个MIL基准上取得了最佳平均性能,超过了需要任务特定训练的监督基线。

英文摘要

Multiple Instance Learning (MIL) addresses problems where supervision is available at the level of bags of instances and has been successfully applied in fields ranging from computational pathology to satellite imagery. Nevertheless, existing algorithms struggle in the low-label regime that characterizes many real-world applications. Flexible models overfit and rigid ones fail to adapt to the task at hand. We show that pretraining an in-context learner with a Perceiver-style architecture on synthetic data yields a model that can solve new tasks from a handful of labeled bags. At inference time, classification happens in a single forward pass and requires no gradient updates. We propose and investigate different synthetic data generators for bag-structured data and find that they capture complementary inductive biases. A model pretrained on a mixture of these generators inherits their per-task strengths and achieves the best average performance across twelve MIL benchmarks, outperforming supervised baselines that require task-specific training.

2606.06453 2026-06-05 cs.AI 版本更新

Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

Vortex: 面向AI Agent的高效可编程稀疏注意力服务

Zhuoming Chen, Xinrui Zhong, Qilong Feng, Ranajoy Sadhukhan, Yang Zhou, Michael Qizhe Shieh, Zhihao Jia, Beidi Chen

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Rice University(Rice大学) National University of Singapore(新加坡国立大学)

AI总结 提出Vortex系统,通过Python嵌入式前端语言和面向页面的张量抽象,结合高效后端,实现稀疏注意力算法的快速原型设计、部署和评估,显著提升吞吐量。

详情
AI中文摘要

随着生成长度的增长,稀疏注意力对于服务大型语言模型(LLMs)变得越来越重要。然而,大规模部署和评估新的稀疏注意力算法仍然高度工程密集,这减慢了人类研究人员和AI Agent探索稀疏注意力设计的速度。为了应对这一挑战,我们提出了Vortex,一个系统,它结合了在面向页面的张量抽象之上的Python嵌入式前端语言,用于表达广泛的稀疏注意力算法,以及一个紧密集成到现代LLM服务栈中的高效后端。Vortex能够快速原型设计、部署和评估稀疏注意力算法,有效地将其理论效率提升转化为实际吞吐量的改进。因此,Vortex大大加速了稀疏注意力算法的设计和迭代。首先,AI Agent使用Vortex自动生成和优化多样化的算法,最佳算法在保持准确性的同时,吞吐量比全注意力高出高达3.46倍。其次,Vortex将稀疏注意力扩展到新兴架构和非常大的模型,这些模型原本难以实验,在基于MLA的GLM-4.7-Flash上实现了高达4.7倍的吞吐量提升,在229B参数的MiniMax-M2.7上实现了1.37倍的提升(在NVIDIA B200 GPU上)。

英文摘要

Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and AI agents in exploring the sparse attention design. To address this challenge, we present Vortex, a system that combines a Python-embedded frontend language atop a page-centric tensor abstraction for expressing a broad range of sparse attention algorithms, with an efficient backend tightly integrated into modern LLM serving stacks. Vortex enables rapid prototyping, deployment, and evaluation of sparse attention algorithms, effectively translating their theoretical efficiency gains into real-world throughput improvements. As a result, Vortex substantially accelerates the design and iteration of sparse attention algorithms. First, AI agents use Vortex to automatically generate and refine diverse algorithms, the best reaching up to $3.46\times$ higher throughput than full attention while preserving accuracy. Second, Vortex extends sparse attention to emerging architectures and very large models that are otherwise hard to experiment with, reaching up to $4.7\times$ higher throughput on the MLA-based GLM-4.7-Flash and $1.37\times$ on the 229B-parameter MiniMax-M2.7 on NVIDIA B200 GPUs.

2606.06448 2026-06-05 cs.AI 版本更新

Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

Agent记忆:有状态长时任务工作负载的表征与系统影响

Yasmine Omri, Ziyu Gan, Zachary Broveak, Robin Geens, Zexue He, Alex Pentland, Marian Verhelst, Tsachy Weissman, Thierry Tambe

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校) MIT Media Lab(麻省理工学院媒体实验室)

AI总结 本文首次对LLM agent记忆系统进行系统级表征,提出四轴分类法,通过阶段感知分析框架评估10种代表性系统,并给出10条系统设计建议。

详情
AI中文摘要

LLM agent越来越多地被部署在需要跨扩展交互历史进行持续推理的长时任务上。大规模实现这一点要求agent在会话之间持久地存储、检索和更新自己的记忆。一个丰富的agent记忆系统生态系统已经出现,涵盖平面检索、LLM介导的提取、整合事实存储和agent控制流。然而,它们的系统级行为尚未被表征。我们提出了agent记忆的首次系统表征。首先,我们引入了一个面向系统的分类法,沿四个轴对agent记忆系统进行分类。其次,我们构建了一个阶段感知的分析框架,将成本归因于构建、检索和生成。第三,我们跨两个基准套件表征了十个代表性系统,揭示了设计选择如何在写和读路径上转移成本。最后,我们推导出10条系统建议,涵盖构建调度、能力下限、通过查询量的摊销、新鲜度-延迟权衡以及集群规模管理。

英文摘要

LLM agents are increasingly deployed on long-horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persistently store, retrieve, and update their own memory across sessions. A rich ecosystem of agent memory systems has emerged spanning flat retrieval, LLM-mediated extraction, consolidating fact stores, and agentic control flows. Yet, their system-level behavior remains uncharacterized. We present the first systems characterization of agent memory. First, we introduce a system-oriented taxonomy classifying agent memory systems along four axes. Second, we build a phase-aware profiling harness attributing cost to construction, retrieval, and generation. Third, we characterize ten representative systems across two benchmark suites, uncovering how design choices shift cost across the write and read paths. Finally, we derive 10 system recommendations covering construction scheduling, capability floors, amortization via query volume, freshness-latency tradeoffs, and fleet-scale management.

2606.06423 2026-06-05 cs.RO cs.AI 版本更新

RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario Generation

RiskFlow: 快速且保真的安全关键交通场景生成

Qi Lan, Yining Tang, Yu Shen, Yi Zhou, Yuhao Wei, Jie Li, Guofa Li

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出RiskFlow框架,通过动作空间中的单次前向传输替代迭代去噪,实现快速、保真的安全关键多智能体交通场景生成。

详情
AI中文摘要

安全关键交通场景生成对于评估自动驾驶系统在罕见但高风险交互下的表现至关重要。现有的基于扩散的方法在闭环生成中提供了强大的可控性,但其迭代去噪过程计算成本高,并且可能在长时间滚动中累积采样和引导误差,导致不真实的运动伪影,如抖动、异常加速度和越野行为。为了解决这些问题,我们提出了RiskFlow,一个闭环安全关键多智能体交通生成框架,将未来轨迹生成公式化为动作空间中的传输。RiskFlow不依赖迭代去噪,而是学习有限区间上的平均速度场,通过单次前向传递将高斯动作序列转换为未来的加速度和偏航率命令,使用基于JVP的目标函数实现高效稳定的训练。在测试时,RiskFlow将输出空间引导应用于生成的动作,引导选定的关键智能体走向风险交互,同时正则化越野行为,并通过车辆动力学重建物理可行的轨迹。在nuScenes上使用tbsim闭环评估的实验表明,RiskFlow在多智能体和长时域设置中实现了强大的对抗性与真实性的权衡。与代表性基线相比,RiskFlow在保持竞争性安全关键生成能力的同时,持续提高了真实性,并显著减少了推理时间。

英文摘要

Safety-critical traffic scenario generation is essential for evaluating autonomous driving systems under rare but high-risk interactions. Existing diffusion-based methods offer strong controllability in closed-loop generation, but their iterative denoising process is computationally expensive and may accumulate sampling and guidance errors over long rollouts, causing unrealistic motion artifacts such as jitter, abnormal acceleration, and off-road behavior. To address these issues, we propose RiskFlow, a closed-loop safety-critical multi-agent traffic generation framework that formulates future trajectory generation as transport in the action space. Instead of relying on iterative denoising, RiskFlow learns an average velocity field over a finite interval to transform Gaussian action sequences into future acceleration and yaw-rate commands with a single forward pass, using a JVP-based objective for efficient and stable training. At test time, RiskFlow applies output-space guidance to the generated actions, steering selected critical agents toward risky interactions while regularizing off-road behavior, and reconstructs physically feasible trajectories through vehicle dynamics. Experiments on nuScenes with tbsim closed-loop evaluation show that RiskFlow achieves a strong adversariality-realism trade-off across multi-agent and long-horizon settings. Compared with representative baselines, RiskFlow consistently improves realism while maintaining competitive safety-critical generation capability, and substantially reduces inference time for evaluation.

2606.06418 2026-06-05 cs.LG cs.AI cs.SY eess.SY 版本更新

Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss

双重预处理 (DoPr):针对测试时性能而非验证损失的优化

Thomas T. Zhang, Alok Shah, Yifei Zhang, Vincent Zhang, Nikolai Matni, Max Simchowitz

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Cambridge(剑桥大学) DeepMind(深度Mind) Google Research(谷歌研究)

AI总结 提出双重预处理优化范式,通过结合梯度级和激活级预处理,缓解自回归语言建模等场景中训练/验证损失与下游指标不匹配的测试时反馈问题,提升测试时性能而不一定改善验证损失。

详情
AI中文摘要

深度学习的许多现代应用涉及通过一步预测损失(例如,$L^2$回归、交叉熵)训练神经网络,但部署时沿着其自身预测进行展开。关键例子包括自回归语言建模、基于流的生成建模和机器人策略学习。已有充分证据表明,这些设置会引发我们称为测试时反馈(TTF)的现象:训练/验证损失与下游感兴趣指标(如任务成功率和生成质量)之间的不匹配,且随任务长度增长。虽然数据整理、架构和目标设计已被提出用于对抗TTF设置中的训练-测试偏移,但本文提出优化作为缓解误差累积的新设计轴。具体而言,我们引入了一种称为双重预处理(DoPr)的新优化范式,专门针对TTF的挑战。DoPr将梯度级预处理(如Adam和Muon中的)与激活级预处理(AP)(如KFAC中的)相结合。我们表明,添加AP可以在各种TTF设置中作为一种即插即用的干预手段,提高下游模型性能。有趣的是,这些测试时性能的提升并不总是伴随验证损失的改善,这为如何正确评估使用一步监督目标训练的模型提出了新问题。

英文摘要

Many modern applications of deep learning involve training a neural network via a one-step prediction loss (e.g., $L^2$ regression, cross-entropy), but deploy the network by rolling out along its own predictions. Key examples include autoregressive language modeling, flow-based generative modeling, and robot policy learning. It is well-documented that these settings induce a phenomenon we call test-time feedback (TTF): the mismatch between the training/validation loss and downstream metrics of interest, such as task success rate and generation quality, which grows with task length. While data curation, architecture, and objective design have been proposed to combat train-test shift in TTF settings, this paper proposes optimization as a new design axis to mitigate error accumulation. Specifically, we introduce a new optimization paradigm called double-preconditioning (DoPr) uniquely tailored to the challenges of TTF. DoPr combines gradient-wise preconditioning, as in Adam and Muon, with activation-wise preconditioning (AP), such as in KFAC. We show that the addition of AP yields a drop-in intervention for increasing downstream model performance across a range of TTF settings. Interestingly, these gains in test-time performance do not consistently accompany improvements in validation loss, opening new questions about how to properly evaluate models trained with one-step supervised objectives.

2606.06416 2026-06-05 cs.AI cs.CL cs.LG cs.MA 版本更新

Unsupervised Skill Discovery for Agentic Data Analysis

面向智能体数据分析的无监督技能发现

Zhisong Qiu, Kangqi Song, Shengwei Tang, Shuofei Qiao, Lei Liang, Huajun Chen, Shumin Deng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出DataCOPE框架,通过无监督验证器引导从探索轨迹中发现可复用的数据分析技能,在报告式和推理式分析任务上分别提升平均得分9.71%和32.30%。

Comments Work in progress

详情
AI中文摘要

推理时技能增强通过注入可复用的程序性知识而不更新模型参数,为改进数据分析智能体提供了一种轻量级方法。然而,发现有效的数据分析技能仍然具有挑战性,因为可靠的监督成本高昂,且成功标准因分析格式而异。这提出了一个关键问题:如何仅从无标签探索中发现可复用的数据分析技能。我们提出DataCOPE,一种面向数据分析智能体的无监督验证器引导的技能发现框架。DataCOPE从探索轨迹中提取验证器信号,并利用这些信号表征轨迹间的相对质量或一致性。它迭代地协调一个数据分析智能体用于轨迹生成、一个无监督验证器用于信号提取、以及一个技能管理器用于对比式技能蒸馏。对于报告式分析,我们将验证器实例化为自适应检查表验证器,该验证器推导任务特定标准,通过可验证覆盖率对报告评分,并迭代优化检查表。对于推理式分析,我们将其实例化为答案一致性验证器,该验证器根据答案一致性对轨迹分组,并使用自一致性作为辅助信号。我们在Deep Data Research的报告式分析和DABStep的推理式分析上评估DataCOPE。在两种设置下,DataCOPE在保留任务上持续优于基线。在四种模型设置上平均,DataCOPE在报告式和推理式任务上分别将平均得分提高了9.71%和32.30%。

英文摘要

Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-analysis skills from unlabeled exploration alone. We propose DataCOPE, an unsupervised verifier-guided skill discovery framework for data-analytic agents. DataCOPE derives verifier signals from the exploration trajectories and uses them to characterize relative quality or aggreement among trajectories. It iteratively coordinates a Data-Analytic Agent for trajectory generation, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation. For report-style analysis, we instantiate the verifier as an Adaptive Checklist Verifier that derives task-specific criteria, scores reports by verifiable coverage, and iteratively refines the checklist. For reasoning-style analysis, we instantiate it as an Answer Agreement Verifier that groups trajectories by answer agreement and uses self-consistency as an auxiliary signal. We evaluate DataCOPE on report-style analysis from Deep Data Research and reasoning-style analysis from DABStep. Across both settings, DataCOPE consistently improves held-out performance over baselines. Averaged across four model settings, DataCOPE improves the mean score by 9.71% and 32.30% on report-style and reasoning-style tasks respectively.

2606.06396 2026-06-05 cs.AI 版本更新

Risk Assessment of Autonomous Driving: Integrating Technical Failures, Ethical Dilemmas, and Policy Frameworks

自动驾驶风险评估:整合技术故障、伦理困境与政策框架

Boyi Chen, Shengqin Chu, Zicheng Wang, Brian Baetz, Zhen Gao

发表机构 * Faculty of Engineering, McMaster University(麦斯特大学工程学院)

AI总结 本文通过分析NHTSA事故数据、DMV脱离报告、MIT道德机器数据集及五辖区法规对比,发现感知与分类错误是主要技术故障模式,并指出不同伦理框架和法规不一致性增加应用不确定性,建议采用适应性协同治理方法。

Comments 19 pages, 1 figure

详情
AI中文摘要

自动驾驶技术有潜力减少每年因人为错误导致的大量道路交通事故,但也带来了需要从技术、伦理和法规方面评估的新型风险。基于美国国家公路交通安全管理局(NHTSA)的公开事故数据、加州机动车辆管理局(DMV)的脱离报告、MIT道德机器数据集以及五个辖区的法规比较分析,我们发现主要的技术故障模式是感知和分类错误。这些在报告的事故中占比较大,并且可以得出结论:自动驾驶车辆决策存在不同的伦理框架,不同地区的法规不一致增加了广泛应用的确定性。总的来说,技术、伦理和法规问题密切相关,需要共同解决。因此,本文推荐一种更具适应性和合作性的治理方法,结合工程标准、伦理讨论和制度监督。

英文摘要

Autonomous driving technology has the potential to reduce the large number of road traffic accidents caused by human error each year, but it also brings new types of risks that need to be evaluated from the aspects of technology, ethics and regulations. Based on public crash data from the National Highway Traffic Safety Administration (NHTSA), disengagement reports from the California Department of Motor Vehicles (DMV), the MIT Moral Machines dataset, and a comparative regulatory analysis of five jurisdictions, we have found that the main types of technical failure modes are perception and classification errors. These account for a relatively large proportion of the reported accidents, and it can be concluded that there are different ethical frameworks for autonomous vehicle decision-making, and inconsistent regulations in different areas increase the uncertainty of widespread application. Generally speaking, the problems of technology, ethics and regulation are closely related and need to be solved together. Therefore, this paper recommends a more adaptive and cooperative governance approach that combines engineering standards, ethical discussion, and institutional supervision.

2606.06390 2026-06-05 cs.CV cs.AI 版本更新

HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

HomeWorld:一个统一的从平面图到家具的框架,用于生成可控、密集交互的全屋场景

Wenbo Li, Xiaoliang Ju, Zipeng Qin, Rongyao Fang, Hongsheng Li

发表机构 * Ace Robotics(Ace机器人公司) CUHK MMLab(香港大学多模态实验室) Shenzhen Loop Area Institute(深圳环城区域研究院)

AI总结 提出一个统一的分层框架,通过大规模真实平面图数据集训练大语言模型生成全屋平面图,结合图像生成模型和VLM优化器生成家具及小物体布局,并附加物理属性和纹理光照,实现可控、高真实感的全屋场景生成。

详情
AI中文摘要

室内场景生成对于机器人仿真和现代室内设计至关重要。然而,复杂的布局加上稀缺的3D场景数据使得基于学习的生成具有挑战性。现有方法通常依赖手工规则或关注孤立子任务(例如平面图合成或单房间家具布置),生成的全屋场景缺乏全局连贯性、真实感和仿真就绪性。为缓解这些限制,我们提出一个统一的分层框架,将室内场景合成分解为可控阶段。首先,我们整理了一个包含30万真实住宅平面图的大规模数据集,用于训练一个全屋平面图生成的大语言模型。通过详细描述和基于K-D树的表示,我们的方法实现了细粒度、可控的全屋平面图生成。基于生成的全屋平面图,我们利用图像生成模型从多级漫游视角草拟家具布局,然后生成不同支撑表面(例如橱柜、书桌和餐桌)上可操作小物体的布局,用于具身AI仿真。在家具和物体布局生成过程中,一个基于VLM的优化器迭代修正家具和物体放置,而一个3D生成模型则允许灵活替换单个资产。我们进一步附加基本物理属性和简单表面纹理与光照设置,以完成用于具身AI的流水线。实验和用户研究表明,我们的流水线生成的室内空间具有更大的布局多样性和更强的3D设计吸引力,在定量和定性指标上均优于先前方法。最后,除了生成流水线,我们还将向社区发布平面图数据集和5000个完全家具化的场景。项目页面:https://kairos-homeworld.github.io/

英文摘要

Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: https://kairos-homeworld.github.io/

2606.06380 2026-06-05 cs.CL cs.AI cs.MA cs.NE 版本更新

Emergent Language as an Approach to Conscious AI

涌现语言作为有意识AI的一种方法

Zengqing Wu, Chuan Xiao

发表机构 * University of Osaka(大阪大学)

AI总结 提出一种生成式方法,通过多智能体强化学习中的涌现语言,在最小先验下研究意识相关结构,并证明智能体可发展出自我指涉通信(如回声-不匹配检测电路)。

Comments Source codes available at https://github.com/wuzengqing001225/ConsciousAI_Indexicality/

详情
AI中文摘要

人工系统是否有意识的问题仍然悬而未决,部分原因是现有方法要么根据理论派生的清单评估系统(判别式),要么直接工程化受意识启发的模块(架构式);两者都未能确定观察到的结构是否是人类语言先验的产物。我们提出一种生成式方法论:多智能体强化学习中的涌现语言(EL),其中智能体从最小起点(无语言、无自我概念、极少接触人类文本)出发,仅在任务压力下发展通信,确保因果可归因于任务需求而非继承的人类语言先验。我们通过讨论EL如何作为研究意识相关结构的生成工具来定位我们的方法论,包括环境复杂性的作用以及对涌现通信的解释。作为概念验证,我们在一个最小环境中实例化该方法论,并证明智能体发展出自我指涉通信,包括一个回声-不匹配检测电路,该电路并非仅由任务结构或架构预测,而是从特定的环境可供性中涌现。

英文摘要

The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory-derived checklists (discriminative) or engineer consciousness-inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi-agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness-relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self-referential communication, including an echo-mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.

2606.06379 2026-06-05 cs.CV cs.AI 版本更新

EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language Models

EasyLens: 一种无需训练的即插即用型微病变表示放大器,用于医学视觉语言模型

Qiwei Zeng, Hao Wang, Jinghao Lin, Shuchang Ye, Yuezhe Yang, Yige Peng, Haoyuan Che, Jinman Kim, Lei Bi

发表机构 * Jilin University(吉林大学) School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) ByteDance(字节跳动) Institute of Translational Medicine, Shanghai Jiao Tong University(上海交通大学转化医学研究院)

AI总结 提出EasyLens,一种无需训练的即插即用模块,通过构建病理-解剖原型空间、反事实推理选择病变相关补丁以及形态引导残差增强,放大医学视觉语言模型对微病变的表示能力。

详情
AI中文摘要

医学视觉语言模型(VLM)在临床图像解读(包括病变检测和报告生成)方面显示出越来越大的潜力。然而,其对微病变的敏感性不足限制了其实用性,因为微病变的视觉证据通常稀疏、低对比度且嵌入复杂的解剖背景中。随着局部视觉标记的聚合,这些微弱的病变线索在全局图像表示中可能变得代表性不足,使得医学VLM难以识别。现有的提高病变敏感性的工作主要依赖于医学领域的视觉编码器预训练、临床术语引导的对齐或可训练的病理表示增强。尽管有效,但这些方法通常需要额外训练或模型特定适配,并可能过度适应特定疾病形态,限制了其在冻结的医学VLM上的适用性。为解决这些限制,我们提出EasyLens,一种无需训练的即插即用型微病变表示放大器,用于医学VLM。EasyLens首先构建EasyBank,一个病理-解剖原型空间,提供病变相关原型和解剖感知的正常参考,用于将可疑补丁与病理和正常解剖模式进行比较。为避免盲目放大正常组织,EasyTag通过反事实原型推理选择病变相关补丁。为抵消全局图像表示中微病变线索的稀释,EasyAmplifier通过形态引导的残差增强强化所选病变相关补丁的表示,从而增加其对全局图像嵌入的贡献。在多个医学图像数据集和冻结的医学VLM骨干上的实验表明,EasyLens改进了微病变检测,并优于现有的编码器增强基线。

英文摘要

Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. However, their practical utility remains limited by insufficient sensitivity to subtle lesions, whose visual evidence is often sparse, low-contrast, and embedded within complex anatomical context. As local visual tokens are aggregated, these weak lesion cues can become underrepresented in global image representations, making them difficult for medical VLMs to recognize. Existing efforts to improve lesion sensitivity mainly rely on medical-domain vision-encoder pre-training, clinical-term-guided alignment, or trainable pathological representation enhancement. Although effective, these approaches usually require additional training or model-specific adaptation and may overfit to particular disease morphologies, limiting their applicability to frozen medical VLMs. To address these limitations, we propose EasyLens, a training-free plug-and-play subtle-lesion representation amplifier for medical VLMs. EasyLens first constructs EasyBank, a pathology-anatomy prototype space that provides lesion-related prototypes and anatomy-aware normal references for comparing suspicious patches against both pathological and normal anatomical patterns. To avoid blindly amplifying normal tissues, EasyTag selects lesion-relevant patches through counterfactual prototype reasoning. To counteract the dilution of subtle lesion cues in global image representations, EasyAmplifier strengthens the selected lesion-relevant patch representations through morphology-guided residual enhancement, thereby increasing their contribution to the global image embedding. Experiments on multiple medical image datasets and frozen medical VLM backbones show that EasyLens improves subtle-lesion detection and outperforms existing encoder-enhancement baselines.

2606.06375 2026-06-05 cs.AI 版本更新

Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case Study

重新思考基础设施检测为图像差异分类:以交通标志为例

Ching Yau Fergus Mok, Lavindra de Silva, Varun Kumar Reja, Ioannis Brilakis

发表机构 * University of Cambridge(剑桥大学) IIT Bombay(印度理工学院Bombay)

AI总结 本研究将基础设施检测重新定义为图像差异分类(IDC),通过利用连续资产状态监测的关系性质减少数据依赖,并在低资源交通标志检测案例中验证了基于指令的分类器优于基于编码器的分类器。

Comments CVPR 2026 Computer Vision for the Built World Workshop (CV4AEC @ CVPR)

详情
AI中文摘要

数字孪生(DTs)允许道路基础设施检测的数字化,但这受到有限标注数据的阻碍。本工作利用连续资产状态监测的关系性质,将基于图像的缺陷检测重新定义为图像差异分类(IDC),以减少数据依赖。通过使用新策划的高质量数据集,在低资源交通标志检测案例研究中评估了不同的IDC分类器。结果表明,基于指令的分类器优于基于编码器的分类器,并且通过与参考图像比较获得增益。这表明IDC可以成为应对基础设施检测和DT资产状态更新中数据约束的有效任务建模。

英文摘要

Digital twins (DTs) allow the digitalization of road infrastructure inspection, though this is hindered by limited annotated data. This work exploits the relational nature of continuous asset condition monitoring to reformulate image-based defect detection as image difference classification (IDC) to reduce data reliance. This was evaluated in a case study on low-resource traffic sign inspection with different IDC classifiers using a newly-curated, high quality dataset. Results indicate that the instruction-based classifier outperforms encoder-based ones and gains from comparison with reference images. This shows that IDC can be an effective task modeling for tackling data constraints in infrastructure inspection and DT asset condition updating.

2606.06373 2026-06-05 eess.SP cs.AI 版本更新

LatentWave: JEPA Pretraining for Wireless Foundation Models

LatentWave: 无线基础模型的JEPA预训练

Ahmed Mohamed, Ahmed Aboulfotouh, Hatem Abou-Zeid

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出LatentWave,采用联合嵌入预测架构(JEPA)在潜空间预测掩码区域,学习可迁移的无线信号表示,并在四个下游任务中优于掩码建模基线。

详情
AI中文摘要

无线基础模型已成为为每个无线任务构建单独模型的有前途的替代方案。然而,现有方法依赖于掩码输入重建,这可能会使表示偏向于低级信号细节。在本文中,我们提出了LatentWave,一种无线基础模型,使用联合嵌入预测架构(JEPA)在多样化的无线频谱图和信道状态信息(CSI)上进行预训练。通过在潜空间中预测掩码区域,LatentWave学习到的表示在多种下游任务中具有更好的开箱即用迁移性。所提出的架构在预训练期间采用每通道补丁嵌入和随机通道采样,使其能够处理可变的天线数量,并提高在异构无线配置中的可用性。我们在四个下游任务上评估了LatentWave:射频信号分类、5G NR定位、波束预测和视距/非视距分类,并与在同一数据上预训练的掩码建模基线(WavesFM)进行比较。此外,我们表明掩码几何形状引入了任务相关的归纳偏差:频率掩码强烈有利于与信道相关的任务,如定位和波束预测,而区域掩码则更好地保留信号分类的可区分性。

英文摘要

Wireless foundation models have emerged as a promising alternative to building separate models for each wireless task. However, existing approaches rely on masked input reconstruction, which can bias representations toward low-level signal details. In this paper, we propose LatentWave, a wireless foundation model pretrained using a Joint-Embedding Predictive Architecture (JEPA) on diverse wireless spectrograms and channel state information (CSI). By predicting masked regions in latent space, LatentWave learns representations that are more transferable out of the box across diverse downstream tasks. The proposed architecture employs per-channel patch embeddings with stochastic channel sampling during pretraining, allowing it to process variable antenna counts and improving usability across heterogeneous wireless configurations. We evaluate LatentWave on four downstream tasks: RF signal classification, 5G NR positioning, beam prediction, and LoS/NLoS classification, comparing against a masked-modeling baseline (WavesFM) pretrained on the same data. Additionally, we show that the masking geometry introduces a task-dependent inductive bias: frequency masking strongly favors channel-related tasks such as positioning and beam prediction, while region masking better preserves discriminability for signal classification.

2606.06357 2026-06-05 cs.SD cs.AI eess.AS 版本更新

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

F3-Tokenizer: 驯服音频自编码器潜在变量以支持理解与生成

Dinghao Zhou, Xingchen Song, Di Wu, Pengyu Cheng, Shengfan Shen, Sixiang Lv

发表机构 * Nanjing University, China(南京大学) WeNet Open Source Community(WeNet开源社区)

AI总结 针对连续音频自编码器潜在变量结构弱、自监督编码器不可解码的问题,提出F3-Tokenizer,通过噪声正则化自编码器瓶颈和潜在侧表示编码器,实现统一的理解与生成音频分词器。

Comments Technical report; early work; 9 pages, 2 figures, 5 tables

详情
AI中文摘要

连续音频自编码器能很好地重建波形,但通常产生的潜在变量结构较弱,不利于理解;而自监督音频编码器能捕捉语义,但不可直接解码。这种不匹配使得单个音频分词器难以同时支持理解和生成。我们通过两个组件将连续自编码器潜在变量适应于这一场景:噪声正则化的自编码器瓶颈和潜在侧表示编码器。瓶颈使用通道归一化和随机扰动代替基于KL的变分训练,为重建和自回归生成提供尺度可控的连续潜在变量。表示编码器在冻结的自编码器潜在变量上使用RQ-MTP和冻结LLM监督进行训练。最终的分词器为理解提供高维表示,同时保留归一化的连续潜在变量作为生成目标。

英文摘要

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets

2606.06356 2026-06-05 cs.AI 版本更新

Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Mo

知识应从哪里注入?多模态迭代生成模型中知识注入的分层框架

Renjith Prasad, Chathurangi Shyalika, Anushka Pawar, Amit Sheth

发表机构 * University of South Carolina(南卡罗来纳大学) Indian AI Research Organization(印度人工智能研究组织)

AI总结 提出一个分层框架,将多模态迭代生成模型中的知识注入分为表面层、轨迹层、潜在层和参数层四个干预层,并通过扩散模型实验证明多层组合可互补地减少知识违规输出。

详情
AI中文摘要

多模态生成模型能够生成流畅的输出,但在生成必须遵循结构化、领域特定或安全关键知识时仍不可靠。现有方法通过提示增强、引导、潜在编辑或微调等机制注入知识,但这些方法通常按技术而非按它们修改的生成过程组件进行分类。我们认为,在迭代生成模型中,知识注入本质上是一个干预层问题。由于生成过程展开为内部状态的轨迹,知识可以作用于该过程的四个结构不同的组件:输入/输出边界、转移函数、中间状态和模型参数。这对应四个干预层:表面层、轨迹层、潜在层和参数层。我们在扩散模型中实例化该框架,将代表性方法映射到所有四个层,并推导出多层组合的设计原则。在使用多模态知识图谱和两个扩散骨干的受控安全对齐实验中,我们累积实现了四个层中的三个:表面层(输入侧和输出侧)以及轨迹-潜在层(生成中期)。我们经验性地表明,每个额外的层解决了先前层无法触及的失败类别,与原始生成相比,将知识违规输出减少了70.97%,并经验性地证实了框架的互补性预测。

英文摘要

Multimodal generative models produce fluent outputs but remain unreliable when generation must respect structured, domain-specific, or safety-critical knowledge. Existing methods incorporate knowledge through mechanisms such as prompt augmentation, guidance, latent editing, or fine-tuning, yet they are typically categorized by technique rather than by the component of the generative process they modify. We argue that knowledge infusion in iterative generative models is fundamentally anintervention-layer problem. Since thegenerative process unfolds as a trajectory of internal states, knowledge can act on four structurally distinct components of this process: the input/output boundary, the transition function, the intermediate state, and the model parameters. This maps to four intervention layers: surface, trajectory, latent, and parametric infusion. We instantiate the framework in diffusion models, map representative methods to all four layers, and derive design principles for multi-layer composition. In a controlled safety-alignment experiment using a multimodal knowledge graph with two diffusion backbones, we implement three of the four layers cumulatively, surface (input-side and output-side) and trajectory--latent (mid-generation). We show empirically that each additional layer addresses failure classes that prior layers cannot reach, reducing knowledge-violating outputs by 70.97% compared to vanilla generation and empirically confirming the framework's complementarity prediction.

2606.06345 2026-06-05 cs.AI cs.LG q-bio.NC 版本更新

Boosting Brain-to-Image Decoding with TRIBE v2 Data Augmentation

使用TRIBE v2数据增强提升脑到图像解码

Yohann Benchetrit, Marlène Careil, Simon Dahan, Hubert Banville, Stéphane d'Ascoli, Jean-Rémi King

发表机构 * Meta AI

AI总结 针对脑解码中标记数据稀缺的问题,提出利用预训练的fMRI响应模型TRIBE v2生成合成数据来增强小样本数据集,在两个数据集上实现最高68%的Top-10图像检索准确率提升,并发现纯合成数据训练的解码器在零样本设置中也能达到高于随机水平的性能。

详情
AI中文摘要

脑解码受限于标记神经数据的可用性,在低数据量情况下仍然具有挑战性。为了解决这个问题,我们研究了是否以及何时可以通过使用预训练的fMRI刺激响应模型生成的合成数据来增强小样本fMRI数据集,从而提升脑解码性能。我们使用TRIBE v2,这是一个大型编码模型,在超过1000小时的视频、音频和语言fMRI响应数据上进行了预训练。对于每个数据集,我们评估了系统网格,展示了图像解码器性能如何随用于训练的合成数据量变化。基于两个数据集(7T fMRI自然场景数据集和3T fMRI BOLD5000)的结果显示,与仅使用真实数据训练的解码器相比,Top-10图像检索准确率最高提升68%。重要的是,达到给定图像解码性能所需的增强数据比例需要根据数据源进行调整。令人惊讶的是,仅使用合成fMRI数据训练的图像解码器在某些设置下性能高于随机水平,表明TRIBE v2可以支持零样本脑到图像解码。这些结果共同表明,大规模fMRI响应模型(针对视觉、声音和语言)可以为提高图像解码的数据效率提供基础。

英文摘要

Brain decoding is limited by the availability of labeled neural data, and remains challenging in low-data regimes. To address this issue, we investigate whether and when brain decoding can be boosted by augmenting small fMRI datasets with synthetic data generated by a pretrained model of fMRI responses to stimuli. We use TRIBE v2, a large encoding model pretrained on more than 1000 hours of fMRI responses to video, audio and language. For each dataset, we evaluate systematic grids that show how the performance of image decoders varies with the amount of synthetic data used for training. Our results, based on two datasets (the 7T fMRI Natural Scenes Dataset and 3T fMRI BOLD5000), show up to 68% improvement in Top-10 image-retrieval accuracy compared to decoders trained only on real data. Importantly, the proportion of augmented data required to reach a given image decoding performance needs to be adjusted depending on the data source. Surprisingly, image decoders trained exclusively on synthetic fMRI can perform above chance in some settings, suggesting that TRIBE v2 can support zero-shot brain-to-image decoding. Together, these results show how large-scale models of the fMRI responses to sight, sound and language may provide a foundation to improve the data efficiency for image decoding.

2606.06337 2026-06-05 cs.AI 版本更新

TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management

TokenMizer: 面向长程LLM上下文管理的图结构会话记忆

Shweta Mishra

发表机构 * Independent Researcher(独立研究者)

AI总结 提出TokenMizer,一种将LLM会话历史建模为类型化知识图的开源代理系统,通过混合提取、三级检查点和8层压缩流水线,在显著减少token开销的同时保留结构化决策信息。

Comments 12 pages, 10 figures. Code and benchmark available at https://github.com/Shweta-Mishra-ai/tokenmizer

详情
AI中文摘要

大型语言模型(LLM)在长程任务部署中面临一个基本约束:上下文窗口是有限的,而生产性工作会话却不是。当历史超过最大有效上下文窗口(MECW)时,关键的结构化信息——架构决策、任务转换、文件历史——会被静默丢弃。现有缓解方法将历史视为纯文本,破坏了使会话可恢复的关系结构。我们提出TokenMizer,一个将LLM会话历史建模为类型化知识图的开源代理系统。该模式定义了14种节点类型和7种边类型。混合提取流水线逐步填充图,而三级检查点系统将其序列化为紧凑的恢复块。8层压缩流水线减少上下文开销,语义缓存降低重复查询延迟。在涵盖5个领域的21个会话的受控基准上评估,TokenMizer展示了显著的token经济性。它生成的恢复块平均78个token(范围:42-124)——比评估基线(159-170个token)小2倍——同时实现了更高的决策召回率(+9-17个百分点)。关键的是,基线仅保留提到某项技术的事实;TokenMizer保留了其原理。在所有会话中,TokenMizer实现了平均任务召回率51.0%、决策召回率46.6%和文件召回率58.7%。方差反映了领域异质性:显式命令式表述(软件工程)得分高于隐式推理(研究)。消融研究表明模糊标签匹配是主要的改进因素(任务召回率+33个百分点)。启发式压缩实现了47.3%的token减少且零外部依赖。TokenMizer以一半的token成本提供了可查询的替代方案,优于文本保留基线。

英文摘要

Large language model (LLM) deployments for long-horizon tasks face a fundamental constraint: context windows are finite while productive work sessions are not. When history exceeds the Maximum Effective Context Window (MECW), critical structured information - architectural decisions, task transitions, file histories - is silently discarded. Existing mitigations treat history as flat text, destroying the relational structure that makes sessions resumable. We present TokenMizer, an open-source proxy system that models LLM session history as a typed knowledge graph. The schema defines 14 node types and 7 edge types. A hybrid extraction pipeline populates the graph incrementally, while a three-tier checkpoint system serializes it into compact resume blocks. An 8-layer compression pipeline reduces context overhead, and a semantic cache reduces repeated-query latency. Evaluated on a controlled benchmark of 21 sessions spanning 5 domains, TokenMizer demonstrates significant token economy. It produces resume blocks averaging 78 tokens (range: 42-124) - 2x smaller than evaluated baselines (159-170 tokens) - while achieving higher decision recall (+9-17 percentage points). Crucially, baselines only preserve that a technology was mentioned; TokenMizer preserves the rationale. Across all sessions, TokenMizer achieves mean task recall 51.0%, decision recall 46.6%, and file recall 58.7%. Variance reflects domain heterogeneity: explicit imperative phrasing (software engineering) scores higher than implicit reasoning (research). Ablation studies show fuzzy label matching is the dominant improvement factor (+33 pp task recall). The heuristic compression achieves 47.3% token reduction with zero external dependencies. TokenMizer provides a queryable alternative to text-retention baselines at half the token cost.

2606.06335 2026-06-05 cs.LG cs.AI 版本更新

Bridging Domain Expertise and Generalization for Performance Estimation

弥合领域专业知识与泛化能力以实现性能估计

Shuxuan Li, Zhilin Zhao, Quyu Kong, Wei-Shi Zheng

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, China(中山大学计算机科学与工程学院) Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China(教育部人工智能与先进计算重点实验室) Shenzhen Loop Area Institute, China(深圳环湖院) Alibaba Cloud(阿里云)

AI总结 提出FRAP方法,利用外部基础模型和基础模型的互补优势,通过温度缩放校准和对齐预测分布,构建更可靠的伪标签参考分布,从而在分布偏移下准确估计模型性能。

详情
AI中文摘要

分布偏移下的性能估计旨在预测模型在未标记测试集上的行为,该测试集的分布与训练数据不同,这一场景需要能够真实反映模型行为且无需真实标签的可靠指标。现有方法仅依赖给定模型的输出,而一旦分布发生偏移,其偏差会被放大,削弱了与真实性能的相关性。受此限制,我们提出融合参考对齐预测(FRAP),利用外部基础模型和基础模型的互补优势,构建更可靠的伪标签替代。FRAP通过应用温度缩放校准最小化基础模型与基础模型预测分布之间的差异,从而对齐两者。对齐后的预测通过基于置信度的加权融合成精炼的参考分布,该分布整合了基础模型的鲁棒性和基础模型的领域专业知识,并通过测量基础模型预测与该参考分布的一致性来获得性能估计。在多种数据集和架构上的大量实验表明,FRAP在分布偏移下相较于代表性性能估计方法取得了持续且显著的改进。

英文摘要

Performance estimation under distribution shift aims to predict how a model behaves on an unlabeled test set whose distribution differs from the training data, a scenario that requires reliable indicators that can faithfully reflect model behavior without ground-truth labels. Existing approaches rely solely on the outputs of the given model whose biases are amplified once the distribution shifts, weakening the correlation with the true performance. Motivated by this limitation, we propose Fused Reference Alignment Prediction (FRAP), which leverages the complementary strengths of an external foundation model and the base model to construct a more reliable surrogate of the ground-truth labels. FRAP aligns the prediction distribution of the foundation model with that of the base model by applying temperature-scaled calibration that minimizes their divergence. The aligned predictions are fused through confidence-based weighting into a refined reference distribution that integrates robustness from the foundation model and domain-specific expertise from the base model, and performance estimation is obtained by measuring how closely the base model predictions agree with this reference. Extensive experiments across diverse datasets and architectures show that FRAP provides consistent and substantial improvements over representative performance-estimation methods under distribution shift.

2606.06333 2026-06-05 cs.LG cs.AI 版本更新

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

子空间感知稀疏自编码器用于有效的机制可解释性

Seyed Arshan Dalili, Mehrdad Mahdavi

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 针对稀疏自编码器将特征假设为一维导致特征分裂的问题,提出子空间感知稀疏自编码器(SASA),通过学习解码器子空间、块稀疏门控和核范数正则化,在GPT-2和Mistral-7B上减少特征分裂和吸收,提高单义性和可解释性。

详情
AI中文摘要

稀疏自编码器(SAEs)广泛用于大型语言模型的机制可解释性,但其公式为每个潜在特征分配单个解码器方向,隐含地假设特征是一维的。我们证明这一假设与模型特征的多维结构不匹配,通过两种不同机制可证明地诱导特征分裂。从几何角度看,用单方向解码器重构内在维度$d_i \ge 2$的特征到误差$\varepsilon$,所需的原子数量随$d_i$呈指数增长。从端到端优化角度看,这种分裂不仅是可能的,而且是主动偏好的。我们证明存在一条从真实的$d_i$维基到$\ell_1$正则化SAE目标严格更低风险的连续路径,其下降方向驱使任何训练字典进入该指数区域。因此,一个单一连贯的特征被碎片化到许多近乎共线的潜在变量中,产生虚假的多重性并掩盖内在几何结构。受此启发,我们引入子空间感知稀疏自编码器(SASA),用学习的解码器子空间替换单向量解码器,通过Top-$s$组门控强制块稀疏性,并用核范数正则化器适应每个组的有效秩。然后我们证明,一旦块大小满足$r \ge d_i$,单个组不仅能表示整个特征切片,而且是SASA目标的全局最小值。这种整合产生样本复杂度关于$d_i$的多项式而非指数——鉴于每次训练激活都需要LLM前向传递,这是一个决定性优势。实验上,在GPT-2和Mistral-7B上,SASA减少了特征分裂和吸收,提高了单义性和可解释性,并且在约一半的token预算下训练,性能匹配或超过标准SAE。

英文摘要

Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of intrinsic dimension $d_i \ge 2$ to error $\varepsilon$ with single-direction decoders forces a number of atoms that is exponential in $d_i$. From an end-to-end optimization perspective, this splitting is not merely possible but actively preferred. We prove that there exists a continuous path from the true $d_i$-dimensional basis to a strictly lower risk of the $\ell_1$-regularized SAE objective, whose descent directions drive any trained dictionary into that exponential regime. A single coherent feature is therefore fragmented across many near-collinear latents, producing spurious multiplicity and obscuring the intrinsic geometry. Motivated by this, we introduce Subspace-Aware Sparse Autoencoders (SASA), which replace single-vector decoders with learned decoder subspaces, enforce block sparsity via Top-$s$ group gating, and adapt each group's effective rank with a nuclear-norm regularizer. We then show that once the block size satisfies $r \ge d_i$, a single group not only can represent the entire feature slice but is the global minimizer of the SASA objective. This consolidation yields a sample complexity polynomial in $d_i$ rather than exponential -- a decisive advantage given that every training activation costs an LLM forward pass. Empirically, on GPT-2 and Mistral-7B, SASA reduces feature splitting and absorption, improves monosemanticity and interpretability, and matches or exceeds standard SAEs while training on roughly half the token budget.

2606.06328 2026-06-05 cs.LG cs.AI 版本更新

PAMF: Prior-Aware Multimodal Fusion for Incomplete Time Series Data

PAMF: 面向不完整时间序列数据的先验感知多模态融合

Ziwen Kan, Wugeng Zheng, Tianlong Chen, Song Wang

发表机构 * Department of Computer Science, University of Central Florida(中央佛罗里达大学计算机科学系) Department of Computer Science, University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校计算机科学系)

AI总结 提出PAMF框架,通过先验感知流匹配和权重共享显式处理模态内缺失和模态级缺失,将插补与下游预测耦合,提升多模态医疗时间序列任务性能。

Comments 5 figures. arXiv preprint version

详情
AI中文摘要

在医疗保健中,多模态时间序列任务在实践中通常处理不完整的观测,例如当电极脱落导致心电图片段丢失或夜间监测期间整个呼吸通道不可用时。这种缺失通常表现为两种结构上不同的模式:模态内缺失,即在某个观测模态内值缺失;以及模态级缺失,即整个模态不可用。现有方法通常通过掩码或缺失嵌入隐式表示未观测数据,而不学习实例特定的缺失信息,且大多数方法仅针对一种缺失模式设计。一种自然的方法是显式估计缺失数据;然而,现有的插补方法尽管缺失具有不同的结构先验,却统一处理缺失,并且插补过程通常与下游任务隔离,阻止下游任务引导插补朝向更具信息性的表示。为了解决这些局限性,我们提出了PAMF,一个多模态时间序列框架,它显式处理不同的缺失模式,同时通过先验感知流匹配和权重共享将插补与下游预测耦合。具体来说,该方法使用类型特定的先验初始化流匹配源状态,以区分两种缺失类型。它进一步通过架构匹配的编码器与权重共享连接插补和分类,将任务相关表示转移到插补过程中。在多个多模态医疗时间序列基准上的实验表明,与现有基线相比,所提出的方法在多样化的数据集和缺失设置下实现了最强的整体下游性能。

英文摘要

In healthcare, multimodal time series tasks often operate on incomplete observations in practice, for example when ECG segments are lost because electrodes detach or an entire respiratory channel is unavailable during overnight monitoring. Such missingness typically appears in two structurally distinct patterns: within-modality missing, where values are absent within an otherwise observed modality, and modality-level missing, where an entire modality is unavailable. Existing methods typically represent unobserved data implicitly through masks or missing embeddings, without learning instance-specific missing information, and most are designed for only one missingness pattern. A natural approach is to explicitly estimate the missing data; however, existing imputation methods treat missingness uniformly despite their different structural priors, and the imputation process is often isolated from downstream tasks, preventing downstream tasks from guiding imputation toward more informative representations. To address these limitations, we present PAMF, a multimodal time-series framework that explicitly handles different missingness patterns while coupling imputation with downstream prediction through prior-aware flow matching and weight sharing. Specifically, the method initializes the flow-matching source state with type-specific priors to distinguish two missing types. It further connects imputation and classification through architecturally matched encoders with weight sharing, transferring task-relevant representations into the imputation process. Experiments on multiple multimodal healthcare time-series benchmarks show that the proposed method achieves the strongest overall downstream performance across diverse datasets and missing settings compared with existing baselines.

2606.06322 2026-06-05 cs.AI 版本更新

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

DragOn:基于拖拽的GUI交互基准与数据集

Nathan Bout, Maxime Langevin, Ronan Riochet

发表机构 * GitHub arXiv

AI总结 针对GUI代理在拖拽操作(如拖放、滑动、高亮)上的性能不足,提出DragOn基准和训练数据集,涵盖文本高亮、单元格选择、元素缩放和滑块操作四个领域,包含28.6万张训练截图和350万个训练任务,评估了多个模型并显示数据集能提升下游任务性能。

详情
Journal ref
Published as a workshop paper at SCALE - 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
AI中文摘要

GUI代理——通过图形用户界面控制桌面、网页浏览器和移动设备的视觉模型——有望自动化广泛的数字任务。虽然百万级数据集在点击定位方面取得了显著进展,但拖拽定位(例如拖放、滑动、高亮)的数据规模仍小一个数量级,当前模型在复杂的基于拖拽的交互上表现不足。我们引入了DragOn,一个拖拽定位基准和训练数据集,涵盖四个领域:文本高亮、单元格选择、元素缩放和滑块操作。该数据集包含28.6万张训练截图和350万个训练任务,外加一个2000个样本的保留评估集。我们评估了专有模型(GPT、Claude)和开源模型(Qwen、Kimi、Holo),以及在我们训练数据上微调的Qwen VLM。结果表明,我们的数据集可以提升最先进模型在下游计算机使用任务上的性能。

英文摘要

GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g. drag-and-drop, swipe, highlight) data remains an order of magnitude smaller and current models fall short on complex drag-based interactions. We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. We evaluate proprietary (GPT, Claude) and open-weight (Qwen, Kimi, Holo) models, as well as a Qwen VLM fine-tuned on our training data. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.

2606.06320 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance

学习遗忘什么:通过习得的词元级重要性改进大语言模型遗忘

Gizem Yüce, Giorgos Nikolaou, Nicolas Flammarion

发表机构 * Theory of Machine Learning Lab, EPFL(机器学习理论实验室,EPFL)

AI总结 提出交替词元加权遗忘(ATWU)框架,通过联合学习词元遗忘特异性和模型参数,在无外部监督下实现最优的遗忘-保留权衡。

详情
AI中文摘要

机器遗忘旨在从训练好的模型中移除特定知识,同时保留其通用能力。对于自回归语言模型,遗忘样本中的并非所有词元都与遗忘同等相关。现有方法要么忽略这种异质性,要么依赖辅助模型、启发式方法或外部标注来估计每个词元对遗忘的相关性。我们转而通过其与保留目标的交互来刻画这种相关性:一个词元是遗忘特异性的,其程度取决于在该词元上最小化遗忘损失不与保留最优性冲突。我们将这一视角形式化为一个关于模型参数和词元权重的联合优化问题,并证明在自然分离条件下,所得目标能够恢复 oracle 遗忘特异性词元支持。受此公式启发,我们引入了交替词元加权遗忘(ATWU),这是一个轻量级框架,在遗忘过程中通过一个基于隐藏状态的简单线性评分器联合学习词元遗忘特异性和模型参数,无需外部词元级监督。在 TOFU 和 RWKU 上,ATWU 实现了最先进的遗忘-保留权衡,优于样本级方法、基于概率的词元加权启发式方法和基于辅助模型的方法。此外,学习到的分数与真实遗忘特异性跨度显著更好地对齐,表明 ATWU 识别了语义上有意义的词元级遗忘信号。总体而言,我们的结果表明,保留冲突为识别语言模型应遗忘什么提供了有效标准,使得能够直接从模型表示中以最小计算开销无监督学习词元级遗忘特异性。

英文摘要

Machine unlearning aims to remove targeted knowledge from a trained model while preserving its general capabilities. For autoregressive language models, not all tokens in a forget sample are equally relevant to forgetting. Existing approaches either ignore this heterogeneity or rely on auxiliary models, heuristics, or external annotations to estimate each token's relevance for forgetting. We instead characterize it through the interaction with the retain objective: a token is forget-specific to the extent that minimizing the forget loss on that token does not conflict with retain optimality. We formalize this perspective as a joint optimization problem over the model parameters and the token weights and show that, under a natural separation condition, the resulting objective recovers the oracle forget-specific token support. Motivated by this formulation, we introduce Alternating Token-Weighted Unlearning (ATWU), a lightweight framework that jointly learns token forget-specificity and model parameters during unlearning using a simple linear scorer over the hidden states, without external token level supervision. Across TOFU and RWKU, ATWU achieves state of the art forget-retain trade-offs, outperforming sample-level methods, probability-based token weighting heuristics, and auxiliary-model-based approaches. Moreover, the learned scores align substantially better with ground truth forget-specific spans, indicating that ATWU identifies semantically meaningful token level forgetting signals. Overall, our results suggest that retain conflict provides an effective criterion for identifying what language models should forget, enabling unsupervised learning of token level forget-specificity directly from model representations with minimal computational overhead.

2606.06316 2026-06-05 quant-ph cs.AI cs.DS 版本更新

Quantum enhanced rare event discovery and sampling

量子增强的罕见事件发现与采样

Naixu Guo, Po-Wei Huang, Qisheng Wang, Jayne Thompson, Patrick Rebentrost, Mile Gu, Chengran Yang

发表机构 * Centre for Quantum Technologies, National University of Singapore(量子技术中心,新加坡国立大学) Mathematical Institute, University of Oxford(牛津大学数学研究所) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) School of Informatics, University of Edinburgh(爱丁堡大学信息学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Nanyang Quantum Hub, School of Physical and Mathematical Sciences, Nanyang Technological University(南洋量子中心,南洋理工大学物理与数学科学学院)

AI总结 针对概率极低的罕见事件发现与采样问题,提出一种无需预先知道事件类型的量子算法,实现了与稀有度阈值的最优量子标度,并在重尾系统和稳态随机过程中分别获得二次加速和鲁棒多项式加速。

Comments 36 pages (8+28)

详情
AI中文摘要

金融崩溃、基础设施的级联故障以及AI系统中的关键错误通常由概率极小的事件触发。因此,高效发现和采样概率低于阈值的事件具有关键意义。然而,使用现有的经典或量子方法,这一任务极具挑战性。由于事件罕见,需要巨大的采样开销才能收集足够的数据样本。此外,由于罕见事件事先未知,无法使用标准技术标记以进行放大。在此,我们提出了一种量子算法,用于罕见事件发现和采样,而无需事先学习哪些事件是罕见的。该算法实现了与稀有度阈值的最优量子标度。我们进一步证明,对于尾部总质量非零的重尾系统,这可以实现二次加速,并且对于稳态随机过程,转化为鲁棒的多项式加速,其指数由其熵率结构决定。

英文摘要

Financial crashes, cascading failures in infrastructure, and critical errors in AI systems are frequently triggered by events that occur with extremely small probability. Efficiently discovering and sampling events with probability below a threshold is therefore of critical interest. Yet this task is highly non-trivial using existing classical or quantum methods. Being rare, such events require an immense sampling overhead to collect sufficient data samples. Moreover, because the rare events are not known in advance, they cannot be flagged for amplification using standard techniques. Here, we introduce a quantum algorithm for rare-event discovery and sampling without first learning which events are rare. The algorithm achieves the optimal quantum scaling with the rarity threshold. We further demonstrate that this can achieve a quadratic speedup for heavy-tailed systems whose tail has nonvanishing total mass, and translates into a robust polynomial speedup for stationary stochastic processes, with the exponent determined by its entropy-rate structure.

2606.06315 2026-06-05 cs.AI 版本更新

LLM Self-Recognition: Steering and Retrieving Activation Signatures

LLM 自我识别:引导与检索激活签名

Thibaud Ardoin, Jonas Schäfer, Gerhard Wunder

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 通过随机稀疏向量引导内部残差流,在LLM生成文本中嵌入可检测指纹,实现高精度归属,同时保持文本质量。

Comments To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

近期可解释性进展表明,大型语言模型(LLMs)在其生成的文本中隐式编码信号,使其能够自我识别输出。我们证明这种能力是可靠的,即使在低熵场景中也是如此,并且可以通过定向干预来增强。通过在生成过程中使用随机稀疏向量引导内部残差流,我们创建了一个可检测的指纹,从而能够将给定文本归属于特定的LLM。该信号可从用作检测器的LLM的激活中恢复,在多种检测设置中实现超过98%的准确率,同时保持生成文本的质量。随着AI生成内容的激增,这种方法通过利用模型自然的表示结构进行归属,而非外部嵌入信号,为传统检测器提供了实用替代方案。我们的贡献包括:(i) 在LLM中建立可靠的自我识别能力,(ii) 一种简单的引导机制,实现多LLM识别且无质量下降,(iii) 证明激活空间包含可被利用的结构,用于编码信号而不产生语义干扰。

英文摘要

Recent advances in interpretability suggest that large language models (LLMs) implicitly encode signals in their generated text that enable self-recognition of their outputs. We demonstrate that this capability is reliable, even in low-entropy scenarios, and that it can be amplified through targeted intervention. By steering the internal residual stream during generation with a random sparse vector, we create a detectable fingerprint that enables attribution of a given text to a specific LLM. This signal is recoverable from the activations of an LLM used as a detector, achieving over 98% accuracy across multiple detection settings while preserving the quality of generated text. As AI-generated content proliferates, this approach offers a practical alternative to traditional detectors by leveraging the model's natural representation structure for attribution rather than embedding a signal externally. Our contributions include: (i) establishing reliable self-recognition capabilities in LLMs, (ii) a simple steering mechanism enabling multi-LLM identification with no quality degradation, (iii) demonstrating that activation spaces contain exploitable structure for encoding signals without semantic interference.

2606.06311 2026-06-05 cs.AI 版本更新

AIS-Based Vessel Trajectory Prediction Using Memory-Augmented Neural Networks

基于记忆增强神经网络的AIS船舶轨迹预测

Wonmo Koo, Sanha Chang, Heeyoung Kim

发表机构 * Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST)(工业与系统工程系,韩国科学技术院)

AI总结 本文提出使用记忆增强神经网络,基于AIS数据预测船舶轨迹,在墨西哥湾和纽约湾数据集上显著优于无外部记忆的深度学习基线。

详情
AI中文摘要

准确的船舶轨迹预测对于安全高效的海上作业至关重要,能够实现碰撞避免并支持航线优化。尽管记忆增强神经网络最近通过从外部记忆中选择性检索相关信息,在行人和道路车辆轨迹预测中表现出色,但其在船舶轨迹预测中的潜力尚未被充分探索。本文基于自动识别系统(AIS)数据,对基于记忆的轨迹预测进行了实证研究。在墨西哥湾和纽约湾数据集上的实验表明,与未集成外部记忆的多种深度学习基线相比,该方法持续且显著地提升了性能。

英文摘要

Accurate vessel trajectory prediction is essential for safe and efficient maritime operations, enabling collision avoidance and supporting route optimization. Although memory-augmented neural networks have recently shown strong performance in pedestrian and road-vehicle trajectory prediction by selectively retrieving relevant information from an external memory, their potential for vessel trajectory prediction remains underexplored. This paper presents an empirical investigation of memory-based trajectory prediction using Automatic Identification System (AIS) data. Experiments on data from the Gulf of Mexico and the New York Bight demonstrate consistent and substantial performance gains over a range of deep learning baselines that do not incorporate an external memory.

2606.06303 2026-06-05 cs.LG cs.AI 版本更新

Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction

基于梯度信息逻辑校正的离散扩散模型即插即用引导

Hongkun Dou, Zike Chen, Fengji Li, Hongjue Li, Yue Deng

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出GILC框架,通过将预训练去噪网络作为变分代理来估计引导信号,并引入无雅可比机制直接校正干净预测逻辑,实现无需额外训练的离散扩散模型可控生成,在DNA、蛋白质序列和分子生成任务上达到最优性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

离散扩散模型的可控生成常常受到高计算开销或需要重新训练的限制。在本文中,我们提出了\underline{\textbf{G}}radient-\underline{\textbf{I}}nformed \underline{\textbf{L}}ogit \underline{\textbf{C}}orrection (\textbf{GILC),这是一个即插即用框架,通过将预训练的去噪网络重新用作变分代理来高效估计引导信号。为了规避高维离散空间中固有的梯度不稳定性,我们引入了一种无雅可比机制,直接校正干净预测的逻辑,从而实现稳定且有效的引导。我们的方法适用于可微和不可微的奖励函数。在DNA、蛋白质序列和分子生成任务上的大量实验表明,GILC无需额外训练即可达到最先进的性能,并且常常优于微调方法。

英文摘要

Controllable generation with discrete diffusion models is often hindered by high computational overhead or the need for retraining. In this paper, we present \underline{\textbf{G}}radient-\underline{\textbf{I}}nformed \underline{\textbf{L}}ogit \underline{\textbf{C}}orrection (\textbf{GILC}), a plug-and-play framework that efficiently estimates guidance signals by repurposing the pretrained denoising network as a variational proxy. To circumvent the gradient instability inherent in high-dimensional discrete spaces, we introduce a Jacobian-free mechanism that directly corrects the clean prediction logits, facilitating stable and effective guidance. Our method accommodates both differentiable and non-differentiable reward functions. Extensive experiments across DNA, protein sequence, and molecular generation tasks demonstrate that GILC achieves state-of-the-art performance without additional training, frequently outperforming fine-tuning approaches.

2606.06300 2026-06-05 cs.AI 版本更新

Multi-ResNets for Subspace Preconditioning in Constrained Optimization

Multi-ResNets:约束优化中子空间预条件的多残差网络

Merve Karakas, Christopher J. Williams, Emmanuel O. Balogun, Sadegh Sadeghi Tabas, Christian Brown, Nikhil Rao

发表机构 * UCLA(加州大学洛杉矶分校) University of Oxford(牛津大学) Tapestry, Google(谷歌Tapestry) Alphabetical ordering, authors contributed equally to this work(作者等量贡献)

AI总结 提出一种分阶段残差神经网络架构MResOpt,通过优先级分解约束满足和阶段感知损失,在预测-补全-校正流水线中实现域知有序约束满足,并在理想无限宽条件下表现为序列高斯过程回归,显著降低高优先级约束违反。

详情
AI中文摘要

我们提出MResOpt,一种用于约束优化问题的分阶段残差神经网络架构。我们的架构适用于预测-补全-校正流水线,并通过中间重新补全和阶段感知损失按优先级分解约束满足。该框架支持域知有序约束满足,使网络能够在存在序结构时利用它。在理想化的无限宽条件下,我们证明我们的设计表现为序列高斯过程回归。在合成QP、QCQP和SOCP基准测试中,分阶段架构在凸和非凸设置中均改善了高优先级约束满足。在线流约束交流最优潮流中,我们引入了一种物理驱动的约束排序,并展示了MResOpt支持一种学习的分工,使迭代保持在等式流形上,与重投影基线相比,实现了显著更低的高优先级违反,同时保持计算效率。

英文摘要

We propose MResOpt, a staged residual neural network architecture for constrained optimization problems. Our architecture fits within predict-complete-correct pipelines and decomposes constraint satisfaction by priority through intermediate re-completion and stage-aware losses. The framework enables domain-informed ordered constraint satisfaction which allows the network to utilize ordinal structure when present. Under an idealized infinite-width regime, we show that our design behaves as sequential Gaussian Process regression. On synthetic QP, QCQP, and SOCP benchmarks, the staged architecture improves high-priority constraint satisfaction across convex and non-convex settings. On line-flow-constrained AC optimal power flow, we introduce a physics-motivated constraint ordering and show that MResOpt supports a learned division of labor that keeps iterates on the equality manifold, achieving substantially lower high-priority violation than reprojected baselines while remaining computationally efficient.

2606.06294 2026-06-05 cs.CV cs.AI 版本更新

Towards One-to-Many Temporal Grounding

面向一对多时间定位

Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对一对多时间定位(OMTG)任务,提出包含基准、数据集和奖励函数的系统解决方案,显著提升多段视频定位性能。

Comments Accepted to ICML'26

详情
AI中文摘要

时间定位(TG)旨在定位与文本查询对应的视频片段。先前研究主要关注单段检索。然而,现实场景通常需要为单个查询定位多个不连续片段——我们将其称为一对多时间定位(OMTG)。先前最先进的MLLMs针对一对一设置优化,在此场景下表现不佳,由于缺乏事件基数感知,往往得到近乎零的分数。为弥补这一差距,我们提出一个包含三项关键贡献的系统解决方案。首先,我们建立了首个全面的OMTG基准,引入计数准确率(C-Acc)和有效时间F1(EtF1)作为评估指标。其次,我们通过一个复杂的构建流程,整理了一个包含56k样本的高质量OMTG数据集。第三,我们开发了专门针对OMTG的新型时间奖励和描述奖励函数。特别地,描述奖励利用密集视频描述上的思维链推理,明确引导策略优化以实现精确性和完整性。大量实验表明,我们的模型在OMTG基准上达到了43.65%的最新EtF1,分别超过Gemini 2.5 Pro和Seed-1.8达15.85%和15.61%。

英文摘要

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.

2606.06286 2026-06-05 cs.CL cs.AI 版本更新

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

LLMs 可能泄露训练数据,但它们愿意吗?一种基于倾向性的 LLM 记忆评估

Gianluca Barmina, Peter Schneider-Kamp, Lukas Galke Poech

发表机构 * University of Southern Denmark(南部丹麦大学)

AI总结 提出 PropMe 框架,通过对比前缀攻击与非对抗评估,揭示 LLM 在非对抗设置下很少泄露训练数据,并引入 SimpleTrace 流水线进行归因和度量。

详情
AI中文摘要

大型语言模型可以重现训练数据,但现有的记忆评估大多衡量模型是否可以被强制这样做,而不是在正常使用下是否会这样做。我们引入了 PropMe,一个基于倾向性的记忆评估框架,对比了基于前缀的能力攻击与非对抗性评估。我们提出了一种度量转换方法,应用于现有函数,可以创建倾向性度量。我们进一步引入了 SimpleTrace,一个基于 infini-gram 的轻量级追踪流水线,能够确定性地将模型生成归因于大规模训练语料库,并计算逐字、近逐字和倾向性转换的记忆度量。评估两个完全开放的模型:Comma 和 DFM Decoder,在两个数据集:Common Pile 和 Dynaword,以及两种语言上,我们发现能力与倾向性之间存在一致差距:前缀攻击比通用或数据集特定提示引发更强的记忆信号,而倾向性得分总体保持较低。因此,模型在直接诱导时可以泄露训练数据,但在更常见的非对抗设置中很少这样做。我们还发现,从 Comma 持续预训练的 DFM Decoder 对 Common Pile 表现出降低的记忆和记忆倾向性,证实当后续训练强调部分不同数据时,记忆能力可能下降。我们的结果表明,并鼓励,记忆审计应同时报告最坏情况下的可提取性和普通泄露倾向性,以便更全面地理解这一现象。

英文摘要

Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.

2606.06285 2026-06-05 cs.AI 版本更新

TRACE: A Temporal Conditional Estimation for Multimodal Time Series Foundation Models

TRACE: 面向多模态时间序列基础模型的时间条件估计

Ziwen Kan, Yishuo Chen, Kecheng Li, Andrew Wen, Xiaomeng Wang, Liwei Wang, Jihao Duan, Song Wang, Hongfang Liu, Tianlong Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出TRACE条件估计范式,通过利用可用辅助模态推断缺失目标模态,解决多模态时间序列中的时间错位和部分模态缺失问题,在医疗和情感分析基准上优于现有融合方法。

Comments 5 figures and 5 tables in the main paper, plus appendix

详情
AI中文摘要

时间序列基础模型旨在学习可泛化的时间表示,以适应广泛的下游任务。在现实世界的多模态设置中,时间序列经常受到时间错位和部分模态缺失的影响,其中不同模态以异质时间尺度被观测或部分缺失。现有方法通常依赖简单的插补或掩码策略,未能考虑跨模态依赖,往往导致错位或退化的表示。我们提出TRACE,一种用于缺失和不规则采样下多模态时间序列基础模型管道的条件估计范式,允许从可用的辅助模态中系统地推断不完整的目标模态。我们在涵盖医疗和情感计算的多个多模态基准上评估TRACE,包括MIMIC-IV临床数据集以及用于多模态情感分析的CMU-MOSI和CMU-MOSEI基准。在一系列下游预测任务和缺失模态设置中,TRACE始终优于先前的多模态融合方法,展示了对严重模态缺失更强的鲁棒性和更可靠的跨模态表示。

英文摘要

Time series foundation models (TS-FMs) aim to learn generalizable temporal representations that can be adapted to a wide range of downstream tasks. In real-world multimodal settings, time series are frequently affected by temporal misalignment and partial modality missingness, where different modalities are observed at heterogeneous time scales or are partially absent. Existing approaches typically rely on naive imputation or masking strategies, which fail to account for cross-modal dependencies and often lead to misaligned or degraded representations. We propose TRACE, a conditional estimation paradigm for multimodal time series foundation model pipelines under missingness and irregular sampling, allowing incomplete target modalities to be systematically inferred from available auxiliary modalities. We evaluate TRACE on diverse multimodal benchmarks spanning healthcare and affective computing, including the MIMIC-IV clinical dataset and the CMU-MOSI and CMU-MOSEI benchmarks for multimodal sentiment analysis. Across a range of downstream prediction tasks and missing-modality settings, TRACE consistently outperforms prior multimodal fusion approaches, demonstrating improved robustness to severe modality missingness and more reliable cross-modal representations.

2606.06284 2026-06-05 cs.AI 版本更新

ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

ToolChoiceConfusion: 因果最小工具过滤实现可靠LLM智能体

Rahul Suresh Babu, Laxmipriya Ganesh Iyer

发表机构 * Independent Researcher(独立研究者) United States of America(美国)

AI总结 提出因果最小工具过滤(CMTF)方法,通过因果充分性选择工具,减少错误工具调用和令牌成本,在102个任务、100个工具、4个LLM后端的基准测试中,将可见工具从100个减少到每步1个,令牌使用降低约90%。

详情
AI中文摘要

大型语言模型智能体越来越依赖外部工具,但更大的工具菜单会通过增加错误工具调用、过早行动和令牌成本来降低可靠性和效率。现有的工具选择方法通常优化语义相关性,暴露名称或描述与用户请求匹配的工具。我们认为相关性是不够的:一个工具可能与任务相关,但在当前步骤仍然是不必要或过早的。我们提出因果最小工具过滤(CMTF),一种无需训练的方法,通过因果充分性选择工具。CMTF使用轻量级前提-效果契约,仅暴露从当前状态向用户目标推进所需的最小下一步工具前沿。在多步骤工具使用任务中,我们将CMTF与全工具暴露、关键词检索、状态感知过滤和因果路径消融进行比较,衡量任务成功率、错误工具调用、过早行动、工具暴露和令牌成本。在包含102个任务、100个工具、四个LLM后端和2448个任务-方法-模型运行的主要基准测试中,CMTF在总体成功率上与最强的因果基线持平,同时将可见工具从100个减少到每步1个,并且相对于全工具暴露将令牌使用减少约90%。

英文摘要

Large language model agents increasingly rely on external tools, but larger tool menus can reduce reliability and efficiency by increasing wrong-tool calls, premature actions, and token cost. Existing tool-selection methods often optimize semantic relevance, exposing tools whose names or descriptions match the user request. We argue that relevance is insufficient: a tool may be related to the task while still being unnecessary or premature at the current step. We propose Causal Minimal Tool Filtering (CMTF), a training-free method that selects tools by causal sufficiency. CMTF uses lightweight precondition-effect contracts to expose only the minimal next-step tool frontier needed to advance from the current state toward the user goal. Across multi-step tool-use tasks, we compare CMTF with all-tools exposure, keyword retrieval, state-aware filtering, and causal-path ablations, measuring task success, wrong-tool calls, premature actions, tool exposure, and token cost. In the main benchmark with 102 tasks, 100 tools, four LLM backends, and 2448 task-method-model runs, CMTF matches the strongest causal baseline in aggregate success while reducing visible tools from 100 to one per step and reducing token usage by about 90% relative to all-tools exposure.

2606.06273 2026-06-05 cs.IT cs.AI math.IT 版本更新

Adapting Diffusion Language Models for Lossless Pixel-Level Image Transmission

适应扩散语言模型用于无损像素级图像传输

Tianqi Ren, Rongpeng Li, Xianfu Chen, Yingyu Li, Zhifeng Zhao

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息科学与电子工程学院) Shenzhen CyberAray Network Technology Co., Ltd(深圳CyberAray网络技术有限公司) School of Mechanical Engineering and Electronic Information, China University of Geosciences(中国地质大学(武汉)机械与电子信息学院) Zhejiang Lab(浙江实验室)

AI总结 提出基于离散扩散模型的分离源信道编码框架DDM-SSCC,通过双向注意力下的同步逆向算术编码实现无损像素级图像传输,并引入Halton引导去噪顺序、掩码率感知余弦调度和轻量温度校准模块提升性能。

详情
AI中文摘要

无损像素级图像传输是超越语义通信的基本机制,因为精确恢复需要准确的符号概率建模和通过噪声信道的可靠传输。本文提出DDM-SSCC,一种基于离散扩散模型的分离源信道编码框架,用于无损图像传输。与光栅顺序自回归编码不同,所提出的源编解码器将扩散语言模型适应于像素令牌恢复,并在双向注意力下执行同步逆向算术编码,允许在一个逆向去噪步骤中对多个掩码令牌进行编码。这种渐进恢复过程也为噪声传输产生了更有利的源表示,因为新恢复的令牌可以在后续去噪步骤中作为双向上下文。为了弥合面向生成的掩码去噪与无损算术编码之间的差距,我们进一步引入了Halton引导的去噪顺序、掩码率感知的余弦调度和轻量温度校准模块。这些设计分别改善了空间覆盖、使去噪速度适应上下文可靠性,并校准了算术编码使用的概率表。在CIFAR10、DIV2K-LR-X4和Kodak数据集上,针对加性高斯白噪声和瑞利衰落信道的实验表明,DDM-SSCC比代表性的无损和语义通信基线实现了更好的精确恢复性能,而消融研究验证了所提出的去噪顺序、调度和校准模块的有效性。

英文摘要

Lossless pixel-level image transmission is a fundamental regime beyond semantic communications, because exact recovery requires both accurate symbol probability modeling and reliable delivery over noisy channels. This paper proposes DDM-SSCC, a discrete-diffusion-model-based separate source-channel coding framework for lossless image transmission. Different from raster-order autoregressive coding, the proposed source codec adapts a diffusion language model to pixel-token restoration and performs synchronized reverse arithmetic coding under bidirectional attention, allowing multiple masked tokens to be coded within one reverse denoising step. This progressive restoration process also yields a more favorable source representation for noisy transmission, since newly restored tokens can serve as bidirectional context in subsequent denoising steps. To bridge the gap between generation-oriented masked denoising and lossless arithmetic coding, we further introduce a Halton-guided denoising order, a mask-ratio-aware cosine schedule, and a lightweight temperature calibration module. These designs respectively improve spatial coverage, adapt the denoising pace to context reliability, and calibrate the probability tables used by arithmetic coding. Experiments on CIFAR10, DIV2K-LR-X4, and Kodak over additive white Gaussian noise and Rayleigh fading channels show that DDM-SSCC achieves better exact-recovery performance than representative lossless and semantic communication baselines, while ablation studies verify the effectiveness of the proposed denoising order, schedule, and calibration modules.

2606.06272 2026-06-05 cs.LG cs.AI 版本更新

Your GFlowNet Secretly Learns an Optimal Transport Plan

你的GFlowNet秘密学习了一个最优传输方案

Ian Maksimov, Nikita Morozov, Denis Belomestny, Sergey Samsonov

发表机构 * GitHub arXiv

AI总结 本文建立了非无环生成流网络与最优传输之间的理论联系,证明最小流GFlowNet学习到的策略编码了从源分布到目标分布的最优传输方案。

Comments ICML 2026 SPIGM Workshop

详情
AI中文摘要

生成流网络(GFlowNets)是一个通过有向图中的随机轨迹对结构化对象进行采样的框架。在这项工作中,我们建立了非无环GFlowNets与最优传输(OT)之间的理论联系。我们证明,在最小流GFlowNet中固定初始流分布会将其目标函数简化为具有图诱导最短路径成本的Kantorovich OT问题。因此,在最优解处,学习到的GFlowNet策略编码了从源分布到目标分布的最优传输方案:我们表明,从最小流GFlowNet中采样轨迹可以恢复相应的最优耦合。我们的公式通过边流和神经参数化,使得将GFlowNet学习框架应用于大规模图上的OT问题成为可能。实验证实了与精确OT求解器的一致性,并展示了GFlowNets可以学习高质量的传输方案。

英文摘要

Generative Flow Networks (GFlowNets) are a framework for sampling structured objects via stochastic trajectories in a directed graph. In this work, we establish a theoretical connection between non-acyclic GFlowNets and optimal transport (OT). We show that fixing the initial flow distribution in a minimum-flow GFlowNet reduces its objective to a Kantorovich OT problem with graph-induced shortest path costs. At the optimum, the learned GFlowNet policy therefore encodes an optimal transport plan from the source distribution to the target distribution: we show that sampling trajectories from the minimum-flow GFlowNet recovers the corresponding optimal coupling. Our formulation enables applying the GFlowNet learning framework to OT problems on large graphs via edge flows and neural parameterization. Experiments confirm agreement with exact OT solvers and demonstrate that GFlowNets can learn high-quality transport plans.

2606.06261 2026-06-05 cs.NI cs.AI cs.ET cs.MA 版本更新

DAST: A VLM-LLM Framework for Cross-Interface Anomaly Detection in O-RAN

DAST: 面向O-RAN跨接口异常检测的VLM-LLM框架

Francesco Spinelli, Esteban Municio, Pau Baguer, Gines Garcia-Aviles, Xavier Costa-Perez

发表机构 * i2CAT Foundation(i2CAT基金会) NEC Laboratories Europe(NEC欧洲实验室) ICREA

AI总结 提出DAST,一种零样本多智能体框架,通过VLM→LLM→VLM三级流水线将多变量KPI流转换为视觉表示,结合领域知识进行跨接口异常检测,在真实O-RAN测试平台上优于现有TSAD方法。

Comments 7 pages, 5 figures. This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

O-RAN实现了可编程功能的分解基带栈,这些功能通过标准化开放接口通信。这种支持多厂商组合的开放性也扩展了构成计算连续体的逻辑解耦层之间的攻击面。在这些威胁中,占已编目O-RAN威胁大部分的拒绝服务和性能降级攻击尤其难以检测。传统的时序异常检测(TSAD)方法在这种新机制下失效,因为标记基线稀缺、威胁演化速度快于检测器重训练速度,且高维多变量遥测数据压倒了单一推理模型。为应对这些挑战,我们提出DAST,一种用于O-RAN跨接口异常检测的零样本多智能体框架,它串联了一个三级VLM→LLM→VLM流水线。DAST将多变量KPI流转换为视觉表示,根据O-RAN领域知识对每个接口的文本描述进行评分,并在高分辨率热图上验证可疑点,输出问题接口、异常时间区间、指示性的O-RAN WG11对齐操作影响评级以及决策理由。我们在从O-RAN测试平台收集的真实网络轨迹上评估DAST,在代表性性能降级场景下实现了0.910的F1分数和0.843的准确率,优于最先进的TSAD基线。

英文摘要

O-RAN enables a disaggregated baseband stack with programmable functions that communicate over standardized open interfaces. The same openness that enables multi-vendor composition also expands the attack surface across logically decoupled tiers that make up the compute continuum. Among these threats, Denial-of-Service and performance-degradation attacks, which account for the majority of catalogued O-RAN threats, are particularly difficult to detect. Traditional Time-Series Anomaly Detection (TSAD) methods fail in this new regime where labelled baselines are scarce, threats evolve faster than detectors can be retrained, and the high-dimensional multivariate telemetry overwhelms monolithic inference models. To address these challenges, we present DAST, a zero-shot multi-agent framework for cross-interface anomaly detection in O-RAN that chains a three-stage VLM $\rightarrow$ LLM $\rightarrow$ VLM pipeline. DAST converts multivariate KPI streams into visual representations, scores textual per-interface descriptions against O-RAN domain knowledge, and verifies suspects on high-resolution heatmaps to output the problematic interfaces, the anomalous time intervals, an indicative O-RAN WG11-aligned operational impact rating and the decision rationale. We evaluate DAST on real network traces collected from an O-RAN testbed under representative performance degradation scenarios, achieving 0.910 F1-Score and 0.843 Accuracy, outperforming state-of-the-art TSAD baselines.

2606.06260 2026-06-05 cs.IR cs.AI cs.CL 版本更新

OneReason Technical Report

OneReason 技术报告

OneRec Team, Biao Yang, Boyang Ding, Chenglong Chu, Dunju Zang, Fei Pan, Han Li, Hao Jiang, Honghui Bao, Huanjie Wang, Jian Liang, Jiangxia Cao, Jiao Ou, Jiaxin Deng, Jinghao Zhang, Kun Gai, Lu Ren, Peiru Du, Pengfei Zheng, Rongzhou Zhang, Ruiming Tang, Shiyao Wang, Siyang Mao, Siyuan Lou, Teng Shi, Wei Yuan, Wenlong Xu, Xingchen Liu, Xingmei Wang, Xinqi Jin, Yan Sun, Yan Wang, Yifei Hu, Yingzhi He, Yufei Ye, Yuhao Wang, Yunhao Zhou, Yuqin Dai, Zhao Liu, Zhipeng Wei, Zhixin Ling, Ziming Li, Zixing Zhang, Ziyuan Liu, An Zhang, Changxin Lao, Chaoyi Ma, Chengru Song, Defu Lian, Fan Yang, Guowang Zhang, Hao Peng, Jiayao Shen, Jie Chen, Jun Xu, Junmin Chen, Kun Zhang, Kuo Cai, Mingxing Wen, Minmao Wang, Minxuan Lv, Qi Zhang, Qiang Luo, Sheng Yu, Shijie Li, Shijie Yi, Shuang Yang, Shugui Liu, Shuni Chen, Tinghai Zhang, Tingting Gao, Xiang Wang, Xiangyu Wu, Xiangyu Zhao, Xiao Lv, Xiaoyou Zhou, Xuming Wang, Yong Du, Zejian Zhang, Zhaojie Liu, Zhiyang Zhang, Zhuang Zhuang, Ziqi Wang, Ziyi Zhao

发表机构 * OneRec Team(OneRec团队)

AI总结 针对生成式推荐模型中推理能力难以激活的问题,提出 OneReason 方法,通过增强感知和认知能力实现有效推理。

Comments Work in progress

详情
AI中文摘要

OneRec 系列中的生成式推荐模型已广泛应用于短视频、直播、广告和电子商务等实际服务中。然而,这些生成模型只能受益于规模优势,其推理能力难以激活,因为我们无法构建仅由物品令牌组成的有意义的思维链序列。受大语言模型领域“先思考后回答”推理范式成功的启发,我们进行了初步研究(即 OneRec-Think、OpenOneRec)以探索生成式推荐中的推理能力。尽管如此,我们注意到一个意外现象:思考模式并未显示出优于非思考模式的优势。从多模态语言模型中关于思维链鲁棒性的最新发现中汲取见解,我们认为推荐中的有效推理依赖于两个因素:感知,即将物品令牌与其底层语言语义相联系的能力;以及认知,即将用户行为序列重组为连贯的潜在兴趣点的能力。因此,我们提出 OneReason,其中包括:(1)预训练中强大的物品令牌感知能力,(2)针对推荐任务的三级认知增强思维链格式在监督微调中,(3)在强化学习中采用先专化后统一的训练方案以增强思考能力。

英文摘要

Generative recommendation models in the OneRec family have been widely deployed in many real-world services, such as short-video, live-streaming, advertising, and e-commerce. However, these generative models can only benefit from the scaling advantage, while their reasoning ability is hard to activate, since we cannot construct meaningful Chain-of-Thought (CoT) sequences consisting of itemic tokens only. Inspired by the success of the reasoning-style ``think before answer'' paradigm in the LLM field, we conduct preliminary studies (i.e., OneRec-Think, OpenOneRec) to explore reasoning capability in generative recommendation. Nevertheless, we notice an unexpected phenomenon: the thinking mode does not show advantages over the non-thinking mode. Drawing insights from recent findings on CoT robustness in multi-modal language models, we argue that effective reasoning in recommendation rests on two factors: perception, the ability to ground itemic tokens in their underlying language semantics, and cognition, the ability to reorganize a user's behavior sequence into coherent latent interest points. We therefore propose OneReason, which includes: (1) strong itemic token perception in pre-training, (2) a three-level cognition-enhanced CoT format for recommendation tasks in SFT, and (3) a specialize-then-unify training recipe in RL to enhance the thinking ability.

2606.06256 2026-06-05 cs.AI 版本更新

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

RedKnot: 基于头部感知的KV重用和SegPagedAttention的高效长上下文LLM服务

Yang Liu, ZhaoKai Luo, HuaYi Jin, ZhiYong Wang, RuoZhou He, BoYu Wang, Guanjie Chen, Junhao Hu

发表机构 * Xiaohongshu Inc., China(小红书公司,中国) Peking University(北京大学) Huawei Cloud(华为云)

AI总结 提出RedKnot系统,通过按KV头分解缓存并采用SegPagedAttention,实现位置无关的KV重用、前缀压缩、冷热分离和分布式放置,在不重训练模型的情况下提升资源效率。

详情
AI中文摘要

随着大语言模型(LLM)服务输入长度的持续增长,KV缓存已成为AI基础设施中的主要瓶颈。它限制了GPU内存容量、服务并发性、缓存重用和分布式可扩展性。几个重要问题,包括位置无关的KV缓存、前缀KV缓存压缩、冷/热KV缓存分离和分布式KV缓存管理,都依赖于KV缓存的表示和管理方式。然而,现有的服务系统在很大程度上依赖于单一的KV缓存抽象,其中KV缓存被视为同质的token级内存块序列,并在注意力头和服务场景中采用类似的管理策略。我们观察到,KV缓存的效用在不同KV头之间具有高度结构性:不同的头表现出不同的功能角色、注意力距离和运行时重要性。因此,并非每个头、token范围或服务场景都需要完整的KV缓存。我们提出了RedKnot,一个用于LLM服务的头部感知KV缓存管理系统。RedKnot通过沿KV头分解KV缓存来打破传统的单一KV缓存抽象,这些KV头的重要性和有效注意力范围在不同服务场景中显著变化。这种头部级分解将KV缓存从单一的张量抽象转变为结构化的内存对象,使RedKnot能够统一支持位置无关的KV重用、前缀KV压缩、冷/热KV分离和分布式KV放置,同时保持输出保真度并提高资源效率,无需模型重训练或微调。RedKnot通过将KV缓存从单一的被动运行时工件转变为动态的、模型感知的可扩展LLM服务的运行时基础,为AI基础设施建立了新的基础。

英文摘要

As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure. It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalability. Several important problems, including position-independent KV cache, prefix KV cache compression, hot/cold KV cache separation, and distributed KV cache management, all depend on how the KV cache is represented and managed. However, existing serving systems largely rely on a monolithic KV cache abstraction, where the KV cache is treated as a homogeneous sequence of token-level memory blocks and managed with similar policies across attention heads and serving scenarios. We observe that KV cache utility is highly structured across KV heads: different heads exhibit different functional roles, attention distances, and runtime importance. Therefore, a full KV cache is not always necessary for every head, token range, or serving scenario. We present RedKnot, a head-aware KV cache management system for LLM serving. RedKnot breaks the conventional monolithic KV cache abstraction by decomposing the KV cache along KV heads, whose importance and effective attention ranges vary significantly across serving scenarios. This head-level decomposition turns the KV cache from a monolithic tensor abstraction into a structured memory object, enabling RedKnot to uniformly support position-independent KV reuse, prefix KV compression, hot/cold KV separation, and distributed KV placement while preserving output fidelity and improving resource efficiency, without requiring model retraining or fine-tuning. RedKnot establishes a new foundation for AI infrastructure by transforming the KV cache from a monolithic, passive runtime artifact into a dynamic, model-aware runtime substrate for scalable LLM serving.

2606.06252 2026-06-05 cs.AI 版本更新

Closing the Loop on Latent Reasoning via Test-Time Reconstruction

通过测试时重建实现潜在推理的闭环

Xiaopeng Yuan, Haibo Jin, Ye Yu, Peng Kuang, Lijun Yu, Yushun Dong, Haohan Wang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Google(谷歌) Florida State University(佛罗里达州立大学)

AI总结 提出ReLAT方法,利用自监督测试时训练通过查询重建损失优化潜在状态,实现潜在推理的闭环,提升数学推理、知识问答和代码生成的性能。

详情
AI中文摘要

近期工作将中间推理从自然语言轨迹转移到潜在或缓存级表示,以减少令牌开销并避免离散通信瓶颈。然而,这种转变也消除了文本推理的一个关键优势:中间状态不再可检查,使得难以确定潜在状态是否仍保留原始查询的约束。因此,潜在推理通常以开环方式运行,即潜在状态被生成和使用,而无需基于输入的保真度检查。我们提出ReLAT(基于重建的测试时潜在推理),一种自监督测试时训练方法,利用查询本身作为参考来闭合这个循环。我们的关键观察是:如果潜在状态忠实地表示查询,则查询应能从该状态恢复;如果查询无法恢复,则潜在状态已丢失任务相关信息。ReLAT通过构建可微的“问题→潜在思考→问题”循环,并在答案生成前通过潜在思考优化查询重建损失来实现这一原则。这使不透明的潜在计算锚定到它应该代表的问题规范。在Qwen系列上的数学推理、知识问答和代码生成基准测试中,ReLAT持续优于单模型推理、基于文本的协作、开环潜在协作以及替代的测试时训练目标。在Qwen3-8B上,ReLAT将AIME 2024准确率从56.7%提升至73.3%,比最强的开环潜在基线高出16.6个百分点。

英文摘要

Recent work moves intermediate reasoning from natural-language traces into latent or cache-level representations to reduce token overhead and avoid a discrete communication bottleneck. However, this shift also removes a key advantage of textual reasoning: intermediate states are no longer inspectable, making it difficult to determine whether a latent state still preserves the constraints of the original query. As a result, latent reasoning typically operates in an open loop, where a latent state is produced and consumed without an input-anchored fidelity check. We propose ReLAT (Reconstruction-Guided Latent Reasoning At Test Time), a self-supervised test-time training method that closes this loop using the query itself as the reference. Our key observation is that if a latent state faithfully represents a query, the query should be recoverable from it; if the query cannot be recovered, the latent state has lost task-relevant information. ReLAT operationalizes this principle by constructing a differentiable Question -> Latent Thought -> Question cycle and optimizing query reconstruction loss through the latent thought before answer generation. This anchors opaque latent computation to the problem specification it is supposed to represent. Across mathematical reasoning, knowledge QA, and code generation benchmarks on the Qwen family, ReLAT consistently improves over single-model inference, text-based collaboration, open-loop latent collaboration, and alternative test-time training objectives. On Qwen3-8B, ReLAT raises AIME 2024 accuracy from 56.7% to 73.3%, a 16.6-point gain over the strongest open-loop latent baseline.

2606.06245 2026-06-05 cs.RO cs.AI 版本更新

MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action

MPCoT: 奖励引导的多路径潜在推理用于测试时可扩展的视觉-语言-动作

Boyang Zhang, Lianlei Shan

发表机构 * Department of Electrical and Computer Engineering, Boston University(波士顿大学电气与计算机工程系) Department of Computer Science, Tsinghua University(清华大学计算机系)

AI总结 提出MPCoT框架,通过奖励引导的多路径潜在推理,在保持零推理令牌和原始动作接口的同时,提升长时域和高不确定性控制任务中的VLA策略性能。

Comments 14 pages, 5 figures, submitted to CoRL

详情
AI中文摘要

视觉-语言-动作(VLA)策略在长时域和高不确定性控制中仍然脆弱,其中单次动作解码提供的推理时思考有限。显式的思维链可以增加推理深度,但引入了令牌延迟和间接的文本到动作接口。我们提出MPCoT,一个奖励引导的多路径潜在推理框架,初始化$M$个假设,通过K个权重共享步骤细化它们,并在动作解码前进行软聚合。一个仅用于训练的路径偏好目标使用专家动作一致性、基于世界模型/VLM的进展和成功反馈来评估候选动作分支,使潜在路径评分器与下游执行质量对齐。MPCoT保留原始的8步动作接口,生成零推理令牌,并暴露可配置的推理控制(K,M)。在LIBERO和CALVIN上的匹配协议下,MPCoT提升了长时域性能,消融实验证实了深度-宽度效应、置信度加权聚合和奖励引导的路径监督。

英文摘要

Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but introduces token latency and an indirect text-to-action interface. We propose MPCoT, a reward-guided multi-path latent reasoning framework that initializes $M$ hypotheses, refines them for K weight-tied steps, and softly aggregates them before action decoding. A training-only path-preference objective evaluates candidate action branches with expert-action consistency, world-model/VLM-based progress, and success feedback to align the latent path scorer with downstream execution quality. MPCoT preserves the original 8-step action interface, generates zero reasoning tokens, and exposes configurable inference controls (K,M). Under matched protocols on LIBERO and CALVIN, MPCoT improves long-horizon performance, with ablations confirming depth-width effects, confidence-weighted aggregation, and reward-guided path supervision.

2606.06242 2026-06-05 cs.CL cs.AI cs.CV cs.IR 版本更新

Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

面向机构文档数据快照提取的开源布局检测模型基准测试

AJ Carl P. Dy, Aivin V. Solatorio

发表机构 * Development Data Group Office of the World Bank Group Chief Statistician(世界银行发展数据分析组办公室世界银行统计主任) The World Bank(世界银行)

AI总结 针对机构文档中图表数据快照提取任务,构建基准数据集并评估多个开源布局检测模型,发现现有模型在操作型文档上泛化能力不足,存在内容混淆、碎片化及上下文缺失等问题。

Comments 23 pages, 8 figures

详情
AI中文摘要

机构文档中的图表包含大量操作和分析信息。当前从文档中提取视觉内容的方法主要围绕通用文档布局分析,将图表视为统一相关的文档对象,而非具有语义意义的分析产物。在这项工作中,我们引入了一个基准数据集和评估框架,用于 extit{数据快照提取},即识别和定位机构文档中具有语义意义的视觉产物的任务。该基准涵盖人道主义报告、世界银行政策研究工作论文和项目评估文件,并包含包含可重用分析信息的图表注释。利用该数据集,我们对多个开源布局检测模型进行了基准测试,并评估了检测性能和空间提取质量。结果表明,尽管当前模型在传统学术基准上表现强劲,但在操作型机构文档上难以泛化。常见的失败模式包括分析内容与非分析内容混淆、复合分析产物碎片化,以及解释所需的上下文信息提取不完整。这些发现凸显了通用文档布局分析与操作上有用的数据快照提取之间持续存在的差距。我们发布了源PDF、注释数据集、元数据和源代码,以支持操作型文档智能的未来研究。数据集可在https://huggingface.co/datasets/ai4data/data-snapshot获取,源代码可在https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot获取。

英文摘要

Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for \textit{data snapshot extraction}, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open-source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non-analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at https://huggingface.co/datasets/ai4data/data-snapshot and the source code is available at https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot.

2606.06240 2026-06-05 cs.DB cs.AI 版本更新

TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory

TOKI: 用于LLM智能体持久化记忆中矛盾消解的双时态算子代数

Ziming Wang

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 提出TOKI代数,将四种矛盾消解启发式统一为双时态算子,通过隔离性、模式与溯源三个正确性定理提供写时并发控制契约,并证明审计行防御在LoCoMo任务上的有效性。

Comments 43 pages including full appendices (proofs, protocols, and reproducibility ledger). Code, data, and reproducibility artifact: https://github.com/ZenAlexa/toki-bitemporal-memory

详情
AI中文摘要

LLM智能体的持久化记忆是一个写密集型底层:每次信念更新都是版本化写入,新声明可能与已存储的声明矛盾。生产系统使用四种消解启发式(最后写入者获胜、证据加权合并、等待确认、按规则策略),但都没有声明其假设的隔离级别或允许的写时异常。我们证明矛盾消解是写时并发控制,并明确缺失的契约。TOKI将四种启发式类型化为双行模式上的双时态算子家族,每个算子具有隔离前提条件和保留失败事实的溯源注释(审计行)。四个正确性定理在隔离性、模式和溯源之间闭合契约,将保证提升到算子流水线,并将折叠算子扩展到n元冲突集。紧致性伴随定理证明,在关系调度模型中,关键日志记录裁决法官对于重放一致性是必要的,而所有审计基线都忽略了这一点。基于八个系统的裁决矩阵定位了差距:每个在写路径上保留语言模型法官的基线至少允许三种写时异常(重放不一致、信念漂移偏斜、审计擦除)中的一种;内容寻址的引擎层比较器通过移除法官避免了这些异常,而只有TOKI在保留法官的同时排除了所有三种异常。在其单一自然工作负载切片上,审计行防御使LoCoMo提升了0.86,而消融类型化记忆层在1444个可回答的LoCoMo问题上移除了0.49的准确率;跨系统比较统计功效不足,不声称优越性。贡献在于契约:一个写时正确性规范,在隔离性、模式和溯源上被证明是可靠的,明确了每个生产启发式假设但没有任何部署系统明确声明的保证。

英文摘要

Persistent memory for an LLM agent is a write-heavy substrate: every belief update is a versioned write, and a new claim may contradict a stored one. Production systems use four resolution heuristics (last-writer-wins, evidence-weighted merge, await-confirmation, per-rule policy), yet none declares the isolation level it assumes or the write-time anomalies it admits. We show that contradiction resolution is write-time concurrency control and make the missing contract explicit. TOKI types the four heuristics as one family of bitemporal operators over a dual-row schema, each with an isolation precondition and a provenance annotation that preserves the losing fact in an audit row. Four soundness theorems close the contract across isolation, schema, and provenance, lift the guarantees to operator pipelines, and extend the fold operators to n-ary conflict sets. A tightness companion proves that, within the relational schedule model, keyed logging of the adjudicating judge is necessary for replay consistency, which every audited baseline omits. A verdict matrix over eight systems localizes the gap: every baseline that keeps a language-model judge on the write path admits at least one of three write-time anomalies (replay inconsistency, belief-drift skew, audit erasure); a content-addressed engine-layer comparator avoids them only by removing the judge, and TOKI alone excludes all three while keeping it. On its one natural-workload slice the audit-row defence moves LoCoMo by 0.86, and ablating the typed memory layer removes 0.49 accuracy on 1,444 answerable LoCoMo questions; the cross-system comparison stays underpowered and claims no superiority. The contribution is the contract: a write-time correctness specification, proved sound across isolation, schema, and provenance, pinning the guarantee every production heuristic assumes but no deployed system makes explicit.

2606.06235 2026-06-05 cs.LG cs.AI 版本更新

Design a Reliable LLM-Integrated Interface for Mortality Forecasting

设计一个可靠的LLM集成接口用于死亡率预测

Thi Kim Ngan Nguyen

发表机构 * Curtin University(Curtin大学)

AI总结 提出一个结合大语言模型(LLM)的接口,通过自然语言输入驱动确定性预测流程,在保持统计精度的同时提升非专家用户的可及性。

Comments 7 pages, 7 figures

详情
AI中文摘要

死亡率预测在精算和政策决策中扮演重要角色,但其实现仍然技术复杂且对非专家用户不友好。本项目提出一个可靠的大语言模型(LLM)集成接口,在保持统计功效的同时提升可用性。LLM被设计为一个约束编排层,将自然语言输入转化为确定性预测流程的结构化配置。采用三阶段方法确保准确性、可用性和透明度。首先,使用CoMoMo包实现基线流程,复现已建立的死亡率预测结果。其次,扩展流程以使用滚动原点评估和均方误差(MSE)生成多步预测。第三,原型接口使用本地LLM以自然语言处理用户的预测请求。该系统表明,LLM可以在不损害高敏感性分析工作流中的可重复性、透明度或精算有效性的前提下增强可访问性。

英文摘要

Mortality forecasting plays an important role in actuarial and policy decision-making, but its implementation remains technically complex and inaccessible to non-expert users. This project proposes a reliable large language model (LLM)-integrated interface that improves usability while maintaining statistical power. The LLM is designed as a constrained orchestration layer that translates natural-language inputs into structured configurations for a deterministic forecasting pipeline. A three-phase methodology is employed to ensure accuracy, usability, and transparency. First, a baseline pipeline is implemented using the CoMoMo package, reproducing established mortality forecasting results. Second, the pipeline is extended to generate multi-step forecasts using rolling-origin evaluation and mean squared error (MSE). Third, a prototype interface uses a local LLM to handle users' forecasting requests in plain language. The system demonstrates that LLMs can enhance accessibility without compromising reproducibility, transparency, or actuarial validity in high-stakes analytical workflows.

2606.06225 2026-06-05 cs.IR cs.AI cs.LG 版本更新

Bridging the Semantic-Collaborative Gap: An Asymmetric Graph Architecture for Cold-Start Item Recommendation

弥合语义-协作鸿沟:面向冷启动物品推荐的非对称图架构

Anh Truong, John Trenkle, Yuanbo Chen, Honghong Zhao, Abdullah Alchihabi, Effy Fang, Michael Tamir

发表机构 * Tubi Kumo AI

AI总结 提出Shallow-RHS非对称链接预测架构,通过左端设备塔利用时序历史消息传递捕获协作信号,右端内容塔仅基于内在特征编码,解决冷启动物品推荐中的图归纳补全问题。

详情
AI中文摘要

协同过滤和基于图的推荐模型因利用观察到的用户交互而非常有效,但这种依赖性在新增内容没有交互历史时产生了根本性的冷启动挑战。在Tubi的生产检索系统中,这一挑战还受到服务接口的进一步限制:新内容必须立即分配独立的嵌入,并且模型必须产生适用于近似最近邻检索的设备嵌入。我们通过将冷启动推荐表述为时间二分设备-内容图上的归纳图补全问题来解决这一设置。我们提出Shallow-RHS,一种非对称链接预测架构,其中左端(LHS)设备塔利用时序有效的观看历史消息传递来捕获协作信号,而右端(RHS)内容塔相对于图是故意浅层的,仅从内在特征编码内容。RHS塔不使用基于ID的嵌入、内容侧子图、邻居聚合或交互派生的表示,迫使内容编码器将内在特征映射到协同过滤感知的嵌入空间。训练后,学习到的内容编码器为热内容和新增内容生成嵌入,通过检索热替代邻居实现隐式图补全。我们进一步将相同的表示补全原则扩展到设备冷启动,通过从人口统计特征构建基于群体的嵌入。大规模在线实验表明,在内容冷启动参与度、推广速度、印象获取和设备冷启动参与度方面持续相对改进。

英文摘要

Collaborative filtering and graph-based recommendation models are highly effective because they leverage observed user interactions, but this dependence creates a fundamental cold-start challenge when newly added content has no interaction history. In Tubi's production retrieval system, this challenge is further constrained by the serving interface: new content must be assigned a standalone embedding immediately, and the model must also produce device embeddings suitable for approximate nearest-neighbor retrieval. We address this setting by formulating cold-start recommendation as an inductive graph-completion problem on a temporal bipartite device-content graph. We propose Shallow-RHS, an asymmetric link-prediction architecture in which the left-hand side (LHS) device tower leverages temporally valid watch-history message passing to capture collaborative signals, while the right-hand side (RHS) content tower is intentionally shallow with respect to the graph and encodes content solely from intrinsic features. The RHS tower does not use ID-based embeddings, content-side subgraphs, neighbor aggregation, or interaction-derived representations, forcing the content encoder to map intrinsic features into a collaborative-filtering-aware embedding space. After training, the learned content encoder generates embeddings for both warm and newly ingested content, enabling implicit graph completion through retrieval of warm surrogate neighbors. We further extend the same representation-completion principle to device cold-start by constructing cohort-based embeddings from demographic features. Large-scale online experiments demonstrate consistent relative improvements in content cold-start engagement, promotion speed, impression acquisition, and device cold-start engagement.

2606.06223 2026-06-05 cs.AI 版本更新

From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents

从奖励黑客激活到智能体风险状态:LLM智能体中的上下文校准机制监控

Patrick Wilhelm, Odej Kao

发表机构 * University of Cambridge(剑桥大学)

AI总结 本研究通过分析ReAct风格智能体在Gameable ALFWorld和WebShop环境中的奖励黑客行为,提出结合激活状态、熵和决策上下文的上下文校准监控方法,以更准确评估智能体风险。

详情
AI中文摘要

语言模型智能体通过观察、推理和动作选择的重复循环运行,使得安全监控依赖于内部模型状态和环境上下文。我们研究了在Gameable ALFWorld和WebShop环境中运行的ReAct风格智能体中的奖励黑客监控。智能体配备了基于激活的奖励黑客分数、token级熵和决策上下文特征。我们发现,在《奖励黑客学校》数据集上微调的适配器可以将奖励黑客倾向转移到智能体动作选择中,尤其是当环境暴露代理奖励可供性时。然而,缓解此类行为不能仅依赖激活动态。高奖励黑客激活识别出潜在策略状态,但并不一定意味着立即的利用动作。在下一步预测任务中,熵和上下文校准的内部特征比单独的奖励黑客激活提高了风险估计。激活方向引导进一步减少了选定混合适配器设置中的代理利用行为。总体而言,我们的结果支持智能体的上下文校准内部监控:奖励黑客激活识别潜在策略状态,而熵和决策上下文有助于确定该状态何时变为风险动作。

英文摘要

Language-model agents act through repeated cycles of observation, reasoning, and action selection, making safety monitoring depend on both internal model state and environment context. We study reward-hacking monitors in ReAct-style agents acting in Gameable ALFWorld and WebShop. Agents are instrumented with activation-based reward-hack scores, token-level entropy, and decision-context features. We find that adapters fine-tuned on \textit{School-of-Reward-Hacks} dataset can transfer reward-hack tendencies into agentic action selection, especially when the environment exposes proxy-reward affordances. However, mitigating such behavior cannot rely on activation dynamics alone. High reward-hack activation identifies a latent policy state, but does not necessarily imply an immediate exploit action. Across next-step prediction tasks, entropy and context-calibrated internal features improve risk estimation over reward-hack activation alone. Activation-direction steering further reduces proxy-exploit behavior in selected mixed-adapter regimes. Overall, our results support context-calibrated internal monitoring for agents: reward-hack activation identifies a latent policy state, while entropy and decision context help determine when that state becomes risky action.

2606.06219 2026-06-05 cs.RO cs.AI 版本更新

CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving

CLEAR:端到端自动驾驶中的认知与潜在评估自适应路由

Yining Xing, Zehong Ke, Zhiyuan Liu, Yanbo Jiang, Wenhao Yu, Jianqiang Wang

发表机构 * Qwen 3.5 0.8B

AI总结 提出CLEAR框架,通过单步条件漂移替代扩散模型的多步去噪,结合视觉编码器Drive-JEPA和微调Qwen 3.5 0.8B进行语义推理,实现高效多模态规划,在NAVSIM v1上达到93.7 PDMS。

详情
AI中文摘要

端到端自动驾驶模型通常难以平衡多模态机动生成与实时推理约束。虽然扩散模型成功捕捉了多样化的驾驶行为,但其迭代去噪过程在安全关键部署中引入了不可接受的延迟。为了解决这个问题,我们提出了CLEAR(认知与潜在评估自适应路由),一个结合超快生成规划与深度语义推理的框架。CLEAR采用Drive-JEPA作为视觉编码器,并用VAE潜在空间中的单步条件漂移替代多步去噪链,引入条件系数以平衡多样性和专家精度。同时,我们在驾驶问答对上全微调Qwen~3.5~0.8B以提取场景感知隐藏状态。这些状态指导自适应调度器(从预定义方案的离散集中选择条件系数$α$和样本数量$N$)和交叉注意力评分器(从候选中选择最优轨迹)。在NAVSIM v1基准上,CLEAR达到了最先进的PDMS 93.7。我们的结果表明,无需密集几何标注或迭代采样,即可高效执行高保真多模态规划。

英文摘要

End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $α$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.

2606.06218 2026-06-05 cs.RO cs.AI 版本更新

TAM: Torque Adaptation Module for Robust Motion Transfer in Manipulation

TAM: 用于鲁棒操作运动传递的扭矩自适应模块

Dongwon Son, Florian Shkurti, Jason Lee, Naman Shah, Beomjoon Kim, Dieter Fox

发表机构 * KAIST(韩国科学技术院) Allen Institute for AI(人工智能研究院) University of Toronto(多伦多大学) University of Washington(华盛顿大学)

AI总结 提出扭矩自适应模块(TAM),通过历史编码器和扭矩适配器修正扭矩指令,实现不同机器人或负载间的运动传递,无需领域随机化或重新收集数据。

详情
AI中文摘要

为一个机器人调整的策略在另一个机器人上往往表现不同,无论是由于仿真到现实的差距、未知负载,还是同一机器人两个实例的不同动力学。在接触丰富的动态操作中,即使微小的运动差异也可能导致跟踪参考运动失败,因为它们会破坏接触的时间和模式。常见的补救措施,如领域随机化或系统辨识,要么产生过于保守的任务策略,要么需要为每个机器人或负载重新收集数据。我们引入了扭矩自适应模块(TAM),这是一个学习模块,它调整发送给机器人的扭矩命令以匹配理想机器人的行为。TAM 在跟踪策略动作的低级控制器和机器人的扭矩接口之间运行。它包括一个历史编码器,将本体感受历史嵌入到潜在状态中,以及一个扭矩适配器,计算残余扭矩修正。由于 TAM 仅依赖于本体感受历史,而不依赖于策略观测或动作空间,因此相同的 TAM 权重可以重复用于适应具有不同动作空间(关节目标、末端执行器目标或直接扭矩)的策略。策略本身不需要使用机器人参数的领域随机化进行训练。相反,我们将领域随机化的需求转移到 TAM 上,通过在随机化仿真中完全训练 TAM,使用多机器人预训练,然后进行特定机器人的微调步骤,该步骤仍然不需要真实机器人数据。我们在真实的 Franka Panda 机器人上对 TAM 进行了零样本评估,涉及动态操作任务,包括基于视觉的推箱子策略(来自强化学习)、翻转策略(来自行为克隆)和 MPC 球杆平衡。我们的实验表明,与在线系统辨识和 RMA 基线相比,TAM 改善了零样本真实机器人执行,并实现了鲁棒的动态操作性能。

英文摘要

A policy tuned for one robot often behaves differently on another, whether due to the sim-to-real gap, unknown payloads, or the differing dynamics of two instances of the same robot. In contact-rich, dynamic manipulation, even small motion discrepancies can result in failure to track reference motion, since they disrupt the timing and modes of contact. Common remedies, such as domain randomization or system identification, either produce overly conservative task policies or require data that must be recollected for each robot or payload. We introduce the Torque Adaptation Module (TAM), a learned module that adapts the torque commands sent to the robot to match the behavior of an ideal robot. TAM operates between the low-level controller that tracks the policy's actions and the robot's torque interface. It includes a history encoder that embeds proprioceptive history into a latent state and a torque adaptor that computes residual torque corrections. Because TAM depends only on proprioceptive history and not on policy observations, or the action space, the same TAM weights can be reused to adapt policies with different action spaces (joint targets, end-effector targets, or direct torques). The policies themselves do not need to be trained with domain randomization of robot parameters. Instead, we offload the need for domain randomization to TAM by training it entirely in randomized simulation, using multi-robot pretraining followed by a robot-specific fine-tuning step that still requires no real-robot data. We evaluate TAM zero-shot on a real Franka Panda robot across dynamic manipulation tasks that include a vision-based box pushing policy (from RL), a flip policy (from BC), and an MPC ball-on-plate balancing. Our experiments show that TAM improves zero-shot real-robot execution compared to online system identification and RMA baselines and enables robust dynamic manipulation performance.

2606.06217 2026-06-05 cs.CV cs.AI 版本更新

DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments

DisasterBench: 复杂环境中基于无人机灾害响应的多模态基准

Tan Zhang, Quanyou Li, Lu Zhang, Jun Liu, Xiaofeng Zhu, Ping Hu

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出DisasterBench多模态基准,涵盖14种灾害场景和9个响应任务,并设计轻量级模型DisasterVL通过三阶段优化在边缘设备上实现高效推理。

详情
AI中文摘要

当灾难发生时,响应者不仅需要回答正在发生什么,还需要回答为什么发生、接下来会发生什么以及现在该做什么,而这些通常来自嘈杂的低空无人机视角,并在现场计算资源紧张的情况下进行。然而,现有的大多数多模态基准侧重于感知(例如识别/描述),覆盖的灾害类型有限,并且对实际应急响应所需的多阶段推理支持不足。我们引入了DisasterBench,一个用于复杂环境中基于无人机灾害响应的多阶段多模态推理基准。DisasterBench涵盖14种灾害相关场景类型和9个响应关键任务,覆盖灾前、灾中和灾后阶段,具有细粒度的灾害-任务映射,明确测试因果归因、传播预测、损害分析和决策导向推理。为了在边缘设备上实现推理,我们进一步提出了DisasterVL,一个轻量级多模态模型,通过三阶段流水线进行优化,结合领域指令微调、思维链引导的多模态对齐以及基于强化学习的策略优化。在21个流行的MLLM上的实验表明,我们的2B参数DisasterVL优于所有评估的开源模型,并显著缩小了与最先进闭源模型的差距,实现了与GPT-4o相当的推理准确性和更高的效率。项目页面:https://github.com/TanmouTT/DisasterBench。

英文摘要

When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a multi-stage multimodal reasoning benchmark for UAV-Based disaster response in complex environments. DisasterBench spans 14 disaster-related scene types and 9 response-critical tasks across pre-, during-, and post-disaster stages, with fine-grained disaster-task mappings that explicitly test causal attribution, propagation prediction, damage analysis, and decision-oriented reasoning. To enable reasoning on the edge, we further propose DisasterVL, a lightweight multimodal model optimized with a three-stage pipeline combining domain instruction tuning, chain-of-thought-guided multimodal alignment, and reinforcement learning-based policy optimization. Experiments across 21 popular MLLMs show that our 2B-parameter DisasterVL outperforms all evaluated open-source models and substantially narrows the gap to state-of-the-art closed-source models, achieving GPT-4o-comparable reasoning accuracy with superior efficiency. The project page is available at https://github.com/TanmouTT/DisasterBench.

2606.06214 2026-06-05 cs.SE cs.AI 版本更新

Towards the Readability of LLM-Generated Codes through Multitask Representation Engineering

面向大语言模型生成代码可读性的多任务表示工程

Huifan Gao, Liuhua He, Yinghui Pan, Shenbao Yu, Yifeng Zeng, Shengchao Qin, Weidi Sun

发表机构 * School of Aerospace Engineering, Xiamen University(厦门大学航空航天工程学院) School of Artificial Intelligence, Shenzhen University(深圳大学人工智能学院) College of Computer and Cyber Security, Fujian Normal University(福建师范大学计算机与网络安全部分) Department of Computer & Information Sciences, Northumbria University(北爱尔兰北安普顿大学计算机与信息科学系) School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) Peking University(北京大学)

AI总结 提出多任务表示工程框架,通过低数据依赖和低计算成本的表示工程方法提升LLM生成代码的可读性,并理论分析其对可读性与正确性权衡的影响。

详情
AI中文摘要

正确性和可读性是代码质量的关键指标,分别确保功能保真度和易于理解。虽然现有研究大多关注提高大语言模型(LLM)生成代码的正确性,但可读性仍未得到充分解决。由于其主观性,通过定向控制提高可读性具有挑战性。在本文中,我们采用表示工程(RepE)作为定向控制方法,因为它具有低数据依赖和低计算成本的特点。先前关于RepE的工作主要集中在单一任务的定向控制上,但提高代码可读性需要跨多个任务的控制。因此,我们提出了多任务RepE框架,并从理论上讨论了多任务引导方法对代码可读性和正确性之间权衡的影响。我们进一步提供了全面的实验支持。所有相关实现都是开源的,并可应要求提供。

英文摘要

Correctness and readability are key measures of code quality, respectively ensuring functional fidelity and ease of comprehension. While most existing research focuses on improving the correctness of large language models~(LLMs) generated codes, readability remains under-addressed. Enhancing readability through targeted control is challenging due to its subjective nature. In this article, we employ representation engineering~(RepE) as the targeted control method given its characteristics of low data dependency and low computational cost. Prior work on RepE has primarily focused on the targeted control for a single task, but improving the code readability requires the control across multiple tasks. Accordingly we proposes the multitask RepE framework and theoretically discuss the impact of the multitask steering method on the tradeoff between the code readability and correctness. We further provide comprehensive experiments in support. All the relevant implementations are open-source and available upon request.

2606.06212 2026-06-05 cs.AI 版本更新

Evaluating Agentic Configuration Repair for Computer Networks

评估计算机网络中的代理配置修复

Rufat Asadli, Benjamin Hoffman, Ioannis Protogeros, Laurent Vanbever

发表机构 * Department of Information Technology(信息科技系) Electrical Engineering, ETH Zurich(电子工程,苏黎世联邦理工学院)

AI总结 本研究通过结合形式化网络验证和上下文检索工具,评估了开源和闭源大语言模型在代理架构下的配置修复能力,发现代理架构在修复效果和安全性上分别平均提升12%和17%。

详情
AI中文摘要

计算机网络中的错误配置仍然是导致重大互联网中断的主要原因。研究正转向利用大语言模型(LLMs)来自动化网络配置这一复杂且易出错的任务。然而,即使是最先进的模型也无法解决大规模、复杂场景中的错误配置,并且常常引入新的错误。在这项工作中,我们对结合了形式化网络验证和上下文检索工具的开源和闭源LLMs进行了基准测试。我们证明,代理架构在修复效果(平均提升12%)和安全性(平均提升17%)上优于基础LLM,这得益于其动态管理上下文和迭代验证配置修复的能力。

英文摘要

Misconfigurations in computer networks remain a major source of critical Internet outages. Research is turning to Large Language Models (LLMs) to automate the complex, error-prone task of network configuration. However, even state-of-the-art models fail to resolve misconfigurations in large-scale, complex scenarios and often introduce new errors. In this work, we benchmark open- and closed-source LLMs augmented with formal network verification and context retrieval tools. We demonstrate that agentic architectures outperform base LLMs in repair efficacy (by 12% on average) and safety (by 17% on average), enabled by the ability to dynamically manage context and iteratively validate configuration repairs.

2606.06207 2026-06-05 cs.AI cs.LG 版本更新

Unsupervised Pattern Analysis in Japanese Veterinary Toxicology: A Regulatory-Compliant Framework for Cross-Species Risk Assessment

日本兽医毒理学中的无监督模式分析:用于跨物种风险评估的合规框架

Yukiko Kawakami, Mohammad Shirazi, Ryo Shimizuwa, Saito Shinoda, Alireza Mortazavi, Matsumoto Kawahara

发表机构 * Graduate School of Information Sciences, Tohoku University(东北大学信息科学研究生院)

AI总结 提出一种监管集成的无监督框架,利用NVAL数据库对不良药物事件进行聚类分析,识别出具有生物学意义的跨物种毒性模式。

Comments Submitted to IEEE Transactions on Biomedical Engineering

详情
AI中文摘要

兽医药物警戒系统对于监测不良药物事件(ADEs)至关重要,然而现有方法往往无法捕捉由当地生物学和监管环境塑造的区域特异性毒性模式。在日本,这些挑战因物种特异性代谢差异以及农林水产省(MAFF)定义的报告实践而加剧。以往的工作大多依赖于预测导向模型,限制了机制可解释性。本研究提出了一种监管集成的无监督框架,用于利用国家兽医检测实验室(NVAL)数据库进行模式发现。ADEs被编码为器官系统对齐的表示,并针对物种特异性报告偏差进行调整,从而实现跨物种比较。应用基于相似性的聚类和降维来识别潜在毒性结构。对4,120份高置信度ADE报告(9,080个药物-ADE组合)的分析识别出三个显著的物种聚类(p < 0.01),包括伴侣动物中的肝脏主导模式(0.42 ± 0.06)、反刍动物中的肾毒性(0.39 ± 0.07)以及绵羊中的皮肤敏感性(0.35 ± 0.07)。药物水平聚类与药理类别的对齐率达到83%,而余弦相似度优于其他指标(轮廓系数:0.48;聚类精度:87%)。监管验证显示与既定分类高度一致。这些发现表明,与监管对齐的无监督分析能够揭示具有生物学意义的区域特异性毒性模式,为兽药安全性评估提供了一个可解释且可扩展的框架。

英文摘要

Veterinary pharmacovigilance systems are essential for monitoring adverse drug events (ADEs), yet existing approaches often fail to capture region-specific toxicity patterns shaped by local biological and regulatory contexts. In Japan, these challenges are amplified by species-specific metabolic differences and reporting practices defined by the Ministry of Agriculture, Forestry, and Fisheries (MAFF). Most prior work relies on prediction-oriented models, limiting mechanistic interpretability. This study proposes a regulatory-integrated unsupervised framework for pattern discovery using the National Veterinary Assay Laboratory (NVAL) database. ADEs are encoded into organ system-aligned representations and adjusted for species-specific reporting biases, enabling cross-species comparison. Similarity-based clustering and dimensionality reduction are applied to identify latent toxicity structures. Analysis of 4,120 high-confidence ADE reports (9,080 drug-ADE combinations) identified three significant species clusters (p < 0.01), including hepatic-dominant patterns in companion animals (0.42 $\pm$ 0.06), renal toxicity in ruminants (0.39 $\pm$ 0.07), and dermatological sensitivity in sheep (0.35 $\pm$ 0.07). Drug-level clustering achieved 83% alignment with pharmacological classes, while cosine similarity outperformed alternative metrics (silhouette score: 0.48; cluster precision: 87%). Regulatory validation showed strong agreement with established classifications. These findings demonstrate that regulation-aligned unsupervised analysis can uncover biologically meaningful, region-specific toxicity patterns, providing an interpretable and scalable framework for veterinary drug safety assessment.

2606.06203 2026-06-05 cs.CL cs.AI 版本更新

Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMs

密集上下文是困难上下文:词汇密度限制LLM的有效上下文

Giovanni Dettori, Matteo Boffa, Danilo Giordano, Idilio Drago, Marco Mellia

发表机构 * Department of Computer Science Politecnico di Torino(计算机科学系politecnico di torino大学) Department of Computer Science University of Turin(计算机科学系都灵大学)

AI总结 本文通过三个“找针”式基准测试,发现词汇密度(上下文引入不同信息的速率)是除输入长度和相关信息位置外,第三个系统性降低LLM有效上下文窗口的因素,并证明降低密度可恢复性能。

Comments 20 pages, 6 figures

详情
AI中文摘要

输入长度和相关信息的位置被广泛认为是导致LLM长上下文性能下降的主要原因。在这里,我们研究词汇密度——上下文引入不同信息的速率——作为第三个被广泛忽视的因素,它系统地缩小了LLM的有效上下文窗口。我们使用三个“找针”式基准测试,在相同长度(约12k tokens)和受控的针位置但信息密度递增的情况下,量化了词汇密度对开放权重LLM(9B-685B)的影响。我们观察到在高密度基准测试中性能急剧下降:在稀疏上下文中近乎完美的模型在密集上下文中的检索分数降至60%以下。为了排除任务类型混淆,我们在每个基准测试内部改变并控制密度,同时保持其他所有属性不变。降低密度通常能恢复性能,尤其是在出现退化的高密度区域。这些结果表明,有效上下文容量是词汇密度的函数,对运行在紧凑、信息丰富输入上的真实世界LLM系统具有直接影响。

英文摘要

Input length and the position of relevant information are widely cited as the primary causes of degraded LLM long-context performance. Here, we study lexical density -- the rate at which a context introduces distinct information -- as a third, largely overlooked factor that systematically reduces the effective context window of LLMs. We quantify the impact of lexical density on open-weight LLMs (9B-685B) using three "find-the-needle" style benchmarks with identical length (~12k tokens) and controlled needle position, but increasing density of information. We observe a sharp performance collapse in higher-density benchmarks: models that are near-perfect in sparse contexts drop below 60% retrieval score on denser ones. To rule out task-type confounds, we vary and control the density within each benchmark while keeping all other properties unchanged. Reducing density generally restores performance, especially in the high-density regimes where degradation appears. These results show that effective context capacity is a function of lexical density, with direct implications for real-world LLM systems operating on compact, information-rich inputs.

2606.06201 2026-06-05 cs.AI 版本更新

Learning to replenish: A hybrid deep reinforcement learning for dynamic inventory management in the pharmaceutical supply chains

学习补货:面向医药供应链动态库存管理的混合深度强化学习

Amandeep Kaur, Gyan Prakash

AI总结 针对医药供应链中需求不确定和前置时间变化导致的库存管理难题,提出一种混合异步优势演员评论家分布式近端策略优化(A3C DPPO)算法,实现连续动作空间下的最优补货策略,降低库存成本并提高服务水平。

Comments Nil

详情
AI中文摘要

医药供应链(PSCs)因不可预测的需求模式和与补货相关的可变前置时间,在库存管理(IM)方面面临挑战。药品的有限保质期进一步加剧了这种复杂性,需要在充足库存和最小浪费之间取得微妙的平衡。这些相互交织的因素构成了一个复杂的优化问题,需要复杂的库存策略来确保产品可用性和PSC效率。本研究旨在为医药产品开发一种最优库存补货策略,能够处理由不确定需求和可变PSC条件产生的随机性。目标是最大化PSC的盈利能力,同时保持较高的患者服务水平。我们将问题建模为马尔可夫决策过程,并提出一种深度强化学习(DRL)方法,具体为混合异步优势演员评论家分布式近端策略优化(A3C DPPO)算法。该A3C DPPO算法针对IM中固有的连续动作空间进行了定制。数值结果表明,所提算法在动态场景下自适应更新库存补货策略,与各种基准相比,实现了更低的库存成本。我们还使用真实药品库存数据进行了数值验证,以确认所提算法的实际可行性。

英文摘要

Pharmaceutical supply chains (PSCs) struggle with inventory management (IM) due to unpredictable demand patterns and variable lead times associated with restocking. This complexity is further compounded by the finite shelf lives of pharmaceutical products, which necessitate a delicate balance between adequate stock and minimal waste. These intertwined factors create a complex optimization problem that requires sophisticated inventory strategies to ensure both product availability and PSC efficiency. This study aims to develop an optimal inventory replenishment policy for pharmaceutical products that can handle the stochasticity arising from uncertain demand and variable PSC conditions. The objective is to maximize the profitability of the PSC while maintaining a high patient service level. We formulate the problem as a Markov decision process and propose a deep reinforcement learning (DRL) approach, specifically, a hybrid asynchronous advantage actor critic distributed proximal policy optimization (A3C DPPO)algorithm. The A3C DPPO algorithm is tailored to handle the continuous action space inherent in IM. The numerical results demonstrate that the proposed algorithm adaptively updates the inventory replenishment strategy under dynamic scenarios, resulting in lower inventory costs compared to various benchmarks. We also conduct numerical validation using real-world pharmaceutical inventory data to confirm the practical feasibility of the proposed algorithm.

2606.06197 2026-06-05 cs.CL cs.AI 版本更新

Improving Answer Extraction in Context-based Question Answering Systems Using LLMs

利用大语言模型改进基于上下文的问答系统中的答案提取

Hafez Abdelghaffar, Ahmed Alansary, Ali Hamdi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有问答系统在复杂或模糊查询下答案提取不准确的问题,提出基于微调预训练大语言模型的方法,在SQuAD1.1数据集上取得ROUGE-L 86.84%、BLEU 28.24%、BERTScore 95.38%的高性能。

Comments 7 pages, IMSA2026

详情
AI中文摘要

随着大语言模型(LLM)的出现,问答(QA)系统取得了显著进展。然而,它们在从给定上下文中准确提取和生成精确答案方面仍面临挑战,尤其是在处理复杂或模糊查询时。现有方法通常在上下文理解、答案一致性和跨不同领域的泛化能力方面存在不足。在这项工作中,我们提出了一种基于大语言模型的问答系统,其输入由文本上下文和相应问题组成,输出为简洁准确的答案。本研究旨在解决当前QA系统的局限性,特别是它们即使能够访问正确上下文也倾向于产生不相关或不精确响应的问题。我们的方法包括在基准QA数据集上微调预训练的LLM,以提高其上下文理解和答案提取能力。具体来说,我们使用斯坦福问答数据集(SQuAD1.1),该数据集提供了高质量的上下文-问题-答案三元组用于监督训练和评估。实验结果表明,微调后的Roberta-base模型取得了最高性能,ROUGE-L得分为86.84%,BLEU得分为28.24%,BERTScore为95.38%。这些结果表明了强大的准确性和答案相关性,证明了所提方法在基于上下文的问答任务中的有效性。此外,研究结果证实,有针对性的微调显著提高了QA系统的可靠性和精确性。

英文摘要

Question answering (QA) systems have achieved notable progress with the advent of large language models (LLMs). However, they still face challenges in accurately extracting and generating precise answers from given contexts, particularly when dealing with complex or ambiguous queries. Existing approaches often struggle with contextual understanding, answer consistency, and generalization across diverse domains. In this work, we propose a question answering system based on large language models, where the input consists of a textual context and a corresponding question, and the output is a concise and accurate answer. The motivation behind this research lies in addressing the limitations of current QA systems, particularly their tendency to produce irrelevant or imprecise responses despite having access to the correct context. Our methodology involves fine-tuning a pre-trained LLM on a benchmark QA dataset to improve its contextual comprehension and answer extraction capabilities. Specifically, we utilize the Stanford Question Answering Dataset (SQuAD1.1), which provides high-quality context-question-answer triplets for supervised training and evaluation. Experimental results show that the fine-tuned Roberta-base model achieves the highest performance, attaining a ROUGE-L score of 86.84%, a BLEU score of 28.24%, and a BERTScore of 95.38%. These results indicate strong accuracy and answer relevance, demonstrating the effectiveness of the proposed approach for context-based question answering tasks. Furthermore, the findings confirm that targeted fine-tuning substantially improves the reliability and precision of QA systems.

2606.06178 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Learning to Route LLMs from Implicit Cost-Performance Preferences via Meta-Learning

通过元学习从隐式成本-性能偏好中学习路由LLM

Jiahao Zeng, Ming Tang, Ningning Ding

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Southern University of Science and Technology(南方科技大学)

AI总结 提出MetaRouter框架,利用元学习从少量交互中学习用户隐式成本-性能偏好,实现个性化LLM路由,在分布内外任务上优于基线方法。

详情
AI中文摘要

大型语言模型(LLM)在性能与成本之间存在权衡,更强大的模型会产生更高的费用。LLM路由旨在通过将查询发送到最合适的模型来降低费用同时保持性能。然而,现有方法无法很好地适应不同用户的成本-性能偏好。为了解决这一差距,我们引入了一种新颖的感知LLM路由范式,用于个性化和以用户为中心的成本-性能优化,通过少量交互高效学习用户的隐式偏好。为了应对异构用户需求的挑战,我们将偏好配置文件形式化为上下文赌博机中的一组不同任务,并提出了MetaRouter,一个用于偏好感知LLM路由的元学习框架。实验结果表明,MetaRouter在分布内和分布外任务上均优于强基线。此外,它在学习用户偏好方面表现出高效率,对可路由LLM的变化具有鲁棒性,并且可扩展到多模型路由。

英文摘要

Large language models (LLMs) present a trade-off between performance and cost, where more powerful models incur greater expense. LLM routing aims to mitigate expenses while maintaining performance by sending queries to the most suitable model. However, existing methods cannot perform well for different user cost-performance preferences. To address this gap, we introduce a novel perceptive LLM routing paradigm for personalized and user-centric cost-performance optimization, which efficiently learns users' implicit preferences through little interaction. To handle the challenge of heterogeneous user needs, we formulate preference profiles as a set of distinct tasks in contextual bandit and propose MetaRouter, a meta-learning framework designed for preference-aware LLM routing. Experimental results show that MetaRouter outperforms strong baselines on both in-distribution and out-of-distribution tasks. Furthermore, it exhibits high efficiency in learning user preferences, robustness to changes in the routable LLMs, and scalability to multi-model routing.

2606.06168 2026-06-05 cs.AI cs.CL 版本更新

ProSarc: Prosody-Aware Sarcasm Recognition Framework via Temporal Prosodic Incongruity

ProSarc: 通过时间韵律不协调性进行韵律感知的讽刺识别框架

Prathamjyot Singh, Ashima Sood, Sahil Sharma, Jasmeet Singh

发表机构 * Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, India(1 计算机科学与工程系,泰帕尔工程与技术学院,印度帕蒂亚拉) School of Computing, Engineering and Intelligent Systems, Ulster University, Londonderry, United Kingdom(2 计算学、工程与智能系统学院,乌斯特大学,英国伦敦德里) School of Computing, Ulster University, Belfast, United Kingdom(3 计算学学院,乌斯特大学,英国贝尔法斯特)

AI总结 提出ProSarc,一个仅利用音频的框架,通过建模局部韵律动态与话语级情感基线之间的时间韵律不协调性来检测讽刺,在MUStARD++等数据集上取得最优性能。

Comments Accepted at Interspeech 2026, Sydney

详情
AI中文摘要

我们提出了ProSarc,一个仅利用音频的框架,通过建模时间韵律不协调性(即局部韵律动态与话语级情感基线之间的不匹配)来检测讽刺。双编码路径——全局情感编码器和时间韵律编码器(BiLSTM + 多头注意力)——馈送到韵律不协调性分析器,该分析器产生一个标量不协调性分数用于分类。蒙特卡洛dropout提供不确定性估计,基于注意力的机制无需帧级标签即可定位讽刺起始点。ProSarc在MUStARD++(F1=75.3)上优于先前的纯音频方法,并泛化到自发性语音(PodSarc,F1=62.9)和跨语言语音(MuSaG,F1=65.6)。十次运行验证证实了不协调性建模的贡献(Wilcoxon p=0.002,Cohen's d=1.51)。人工评估表明,模型不确定性追踪感知模糊性,预测的起始点与人工标注的时间窗口对齐。

英文摘要

We present ProSarc, an audio-only framework that detects sarcasm by modelling temporal prosodic incongruity, that is, the mismatch between local prosodic dynamics and the utterance-level emotional baseline. Dual encoding paths, a Global Emotion Encoder and a Temporal Prosody Encoder (BiLSTM + multi-head attention), feed a Prosodic Incongruity Analyzer that produces a scalar incongruity score for classification. Monte Carlo dropout provides uncertainty estimates, and an attention-based mechanism localises sarcastic onset without frame-level labels. ProSarc outperforms prior audio-only methods on MUStARD++ (F1=75.3) and generalises to spontaneous (PodSarc, F1=62.9) and cross-lingual speech (MuSaG, F1=65.6). Ten-run validation confirms the contribution of incongruity modelling (Wilcoxon p=0.002, Cohen's d=1.51). Human evaluation shows that model uncertainty tracks perceptual ambiguity and predicted onsets align with human-annotated temporal windows.

2606.06160 2026-06-05 cs.AI cs.CL 版本更新

Where does Absolute Position come from in decoder-only Transformers?

在仅解码器Transformer中,绝对位置从何而来?

Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri

发表机构 * Sapienza University of Rome(罗马大学萨皮恩扎分校) Intuition Machines(直觉机器)

AI总结 本文研究了RoPE训练的仅解码器Transformer中绝对位置信息的来源,发现因果掩码和残差流是导致绝对位置泄露的两个关键组件,并提出了通过替换BOS嵌入来减少残差流成分的方法。

详情
AI中文摘要

RoPE训练的Transformer在其注意力模式中区分绝对位置,尽管RoPE在内积中仅编码相对偏移。我们将这种泄露追溯到两个架构组件。因果掩码是第一个:其每个查询的softmax分母按构造依赖于绝对查询位置。残差流提供第二个。在因果注意力下,位置$0$处的激活仅关注自身,并作为封闭动力系统从该位置token的嵌入运行;下游注意力通过sink-reading头读取该轨迹。这两个组件在我们研究的所有三种架构中都存在,但以架构特定的平衡出现:NTK缩放抑制残差流组件,滑动窗口注意力使其随深度累积,而标准RoPE介于两者之间。在前向传播前替换\texttt{BOS}嵌入可消除早期查询中$40\%$的残差流组件。注意力sink是锚定在token上的稳定器,传递位置$0$处token的确定性指纹,当该token是自动预置的\texttt{BOS}时,该指纹跨输入恒定,否则随其变化。

英文摘要

RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax denominator depends on the absolute query position by construction. The residual stream supplies the second. Under causal attention the activation at position $0$ attends only to itself and runs as a closed dynamical system from the embedding of the token at that position; downstream attention reads this trajectory through sink-reading heads. Both components appear in all three architectures we study, in architecturally specific balance: NTK scaling suppresses the residual-stream component, sliding-window attention allows it to accumulate with depth, and standard RoPE sits between. Replacing the \texttt{BOS} embedding before the forward pass removes $40\%$ of the residual-stream component at early queries. Attention sinks are token-anchored stabilizers that pass forward a deterministic fingerprint of the token at position $0$, constant across inputs when that token is the auto-prepended \texttt{BOS} and varying with it otherwise.

2606.06159 2026-06-05 cs.AR cs.AI cs.NE 版本更新

ITP-STDP: An Intrinsic-Timing Power-of-Two Learning Engine for On-Chip SNN Training

ITP-STDP:用于片上SNN训练的内在时序二次幂学习引擎

Haihang Xia, Xinyu Zhao, Xuecheng Wang, John Goodenough, Charith Abhayaratne, Panagiotis A. Panagiotou, Chunyi Song, Tiantai Deng

发表机构 * School of Electrical and Electronic Engineering, The University of Sheffield(谢菲尔德大学电子与电气工程学院) Donghai Laboratory(东海实验室) Engineering Research Center of Oceanic Sensing Technologyand Equipment, Ministry of Education(教育部海洋传感技术与设备工程研究中心) State Key Laboratory of Ocean Sensing and Ocean College, Zhejiang University(浙江大学海洋传感与海洋学院国家重点实验室)

AI总结 针对SNN训练中权重更新计算量大导致的硬件资源与能耗问题,提出基于内在时序二次幂的STDP算法(ITP-STDP)及其硬件架构,通过算法与硬件优化消除大部分计算开销,在FPGA和ASIC平台上实现能效和速度的显著提升。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

脉冲神经网络(SNN)有潜力成为第三代神经网络,并在广泛的应用中受到越来越多的关注。然而,SNN中大量的突触连接导致训练过程中片上学习算法的权重更新计算密集,从而消耗大量硬件资源和能量。在现有的SNN学习算法中,脉冲时序依赖可塑性(STDP)是研究最广泛、采用最多的算法之一,是SNN中的基本学习组件。为了解决SNN训练相关的硬件和能量开销,本文提出了内在时序二次幂STDP(ITP-STDP)及其对应的原型学习引擎硬件架构。通过专用的平均场突触漂移模型进行动力学分析,并在不同规模和数据集上的SNN网络中进一步验证。该设计在ASIC和FPGA平台上实现,并与包括原始STDP和更复杂STDP变体在内的最新方法进行比较。结果表明,由于所提出的设计通过算法和硬件级优化消除了STDP的大部分计算开销,因此具有优越的能效、更高的运行速度和显著更低的硬件资源利用率。在FPGA平台上,所提出的设计相比对比设计能效提高了4.5倍至219.8倍。在ASIC平台上,所提出的设计实现了4.8倍至22.01倍的加速,而面积仅为先前工作的1.2%至3.3%。

英文摘要

Spiking neural networks (SNNs) have the potential to emerge as the third generation of neural networks and have attracted increasing attention across a wide range of applications. However, the large number of synaptic connections in SNNs leads to intensive weight-update computation by on-chip learning algorithms during training, resulting in substantial hardware resource utilization and energy consumption. Among existing SNN learning algorithms, spike-timing-dependent plasticity (STDP) is one of the most extensively studied and widely adopted, serving as a fundamental learning component in SNNs. To address the hardware and energy overheads associated with SNN training, this paper presents intrinsic-timing power-of-two STDP (ITP-STDP) and its corresponding prototype learning engine hardware architecture. The proposed design is evaluated through a dedicated mean-field synaptic drift model for dynamical analysis and further validated across SNN networks of different scales and datasets. It is further implemented on both ASIC and FPGA platforms and compared with state-of-the-art approaches, including the original STDP and more complex STDP variants. The results demonstrate superior energy efficiency, higher operating speed, and substantially lower hardware resource utilization, as the proposed design eliminates most of the computational overhead of STDP through both algorithmic and hardware-level optimizations. On the FPGA platform, the proposed design improves energy efficiency by 4.5$\times$ to 219.8$\times$ over the compared designs. On the ASIC platform, the proposed design achieves a 4.8$\times$ to 22.01$\times$ speedup while consuming only 1.2% to 3.3% of the area required by prior works.

2606.06154 2026-06-05 cs.AI 版本更新

Amortizing Federated Adaptation: Hypernetwork Driven LoRA for Personalized Foundation Models

摊销联邦自适应:基于超网络的LoRA用于个性化基础模型

Sunny Gupta, Shambhavi Shanker, Amit Sethi

发表机构 * Indian Institute of Technology, Bombay(印度理工学院班加罗尔)

AI总结 提出HyperLoRA框架,通过超网络驱动的LoRA生成和乘积空间聚合,解决联邦LoRA中的结构聚合偏差和客户端初始化滞后问题,实现高效个性化、无偏聚合和更快收敛。

Comments Accepted at International Workshop on Federated Learning in the Age of Foundation Models In Conjunction with IJCAI 2026 (FL@FM-IJCAI'26)

详情
AI中文摘要

使用低秩自适应(LoRA)对基础模型进行联邦微调为分布式学习提供了一种通信高效的解决方案。然而,现有的联邦LoRA方法存在两个基本限制:(1)结构聚合偏差,即独立平均低秩因子无法近似真实的组合更新;(2)客户端初始化滞后,即客户端在通信轮次中反复重新初始化LoRA参数,导致收敛变慢。我们提出HyperLoRA,一个统一的框架,通过超网络驱动的LoRA生成和乘积空间聚合的摊销联邦自适应来解决这两个问题。HyperLoRA不是进行迭代的逐客户端优化,而是使用一个学习到的生成器,将客户端分布特征映射到LoRA初始化,从而有效摊销每个客户端的自适应。在服务器端,我们引入一个学习到的聚合模块,直接在低秩乘积空间中合成更新,消除了因子级平均的不一致性。一个轻量级的残差校正模块进一步提高了在异质(非IID)客户端分布下的稳定性。通过用学习到的算子替代迭代优化和启发式平均,HyperLoRA共同实现了高效个性化、无偏聚合和更快的收敛。在联邦视觉和视觉-语言基准上的实验表明,与先前的联邦LoRA方法相比,HyperLoRA实现了更快的收敛速度、对分布偏移更强的鲁棒性以及更强的个性化性能。

英文摘要

Federated fine-tuning of foundation models using Low-Rank Adaptation (LoRA) offers a communication efficient solution for distributed learning. However, existing federated LoRA methods suffer from two fundamental limitations: (1) structural aggregation bias, where independently averaging low rank factors fails to approximate the true combined update, and (2) client side initialization lag, as clients repeatedly reinitialize LoRA parameters across communication rounds, slowing convergence. We propose HyperLoRA, a unified framework that addresses both issues through amortized federated adaptation through hypernetwork-driven LoRA generation and product space aggregation. Instead of iterative per-client optimization, HyperLoRA employs a learned generator that maps client distribution signatures to LoRA initializations, effectively amortizing per client adaptation. On the server side, we introduce a learned aggregation module that directly synthesizes updates in the low-rank product space, eliminating the inconsistencies of factor-wise averaging. A lightweight residual correction module further improves stability under heterogenous (non-IID) client distributions.By replacing iterative optimization and heuristic averaging with learned operators, HyperLoRA jointly enables efficient personalization, unbiased aggregation, and faster convergence. Experiments on federated vision and vision-language benchmarks show that HyperLoRA achieves improved convergence speed, greater robustness to distribution shift, and stronger personalization performance compared to prior federated LoRA methods.

2606.06147 2026-06-05 cs.AI 版本更新

WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation

WorldFly: 基于世界模型的视觉-语言-动作模型用于无人机导航

Shengtao Zheng, Kai Li, Weichen Zhang, Yu Meng, Chen Gao, Xinlei Chen, Yong Li, Xiao-Ping Zhang

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) BNRist, Tsinghua University(清华大学北京研究院)

AI总结 提出WorldFly框架,通过双分支耦合流匹配机制联合生成未来视频预测和导航动作,解决城市峡谷中严重遮挡和视角剧变下的无人机导航问题。

详情
AI中文摘要

端到端的视觉-语言-动作(VLA)模型在无人机导航中显示出潜力。然而,现有方法通常依赖历史观测直接预测动作,在密集城市环境中常因严重遮挡和急转弯导致视角剧变而表现不佳。我们认为,世界模型固有的“想象”未来状态的能力对于在这种部分可观测性下做出稳健决策至关重要。为此,我们构建了一个具有挑战性的城市峡谷遍历基准,专门用于评估在严重遮挡和视角剧变场景下的空间理解能力。基于此,我们提出了WorldFly,一种新颖的基于世界模型的VLA框架,采用双分支耦合流匹配机制联合生成未来视频预测和导航动作,从而通过空间想象显式引导智能体的策略。在我们基准上的大量评估表明,WorldFly优于其他基线,特别是在未见过的环境中,验证了将世界模型集成到具身空中智能体中的有效性。

英文摘要

End-to-end Vision-Language-Action (VLA) models have shown promise in UAV navigation. However, existing approaches typically rely on historical observations to directly predict actions, often struggling in dense urban environments where severe occlusions and sharp turns result in drastic viewpoint transitions. We argue that the ability to "imagine" future states -- inherent in World Models -- is critical for robust decision-making under such partial observability. To address this, we construct a challenging Urban Canyon Traversal Benchmark, specifically designed to evaluate spatial understanding in scenarios characterized by severe occlusions and drastic viewpoint transitions. To this end, we propose WorldFly, a novel world-model-based VLA framework that employs a dual-branch coupled flow matching mechanism to jointly generate future video predictions and navigation actions, thereby explicitly guiding the agent's policy via spatial imagination. Extensive evaluations on our benchmark demonstrate that WorldFly outperforms other baselines, particularly in unseen environments, validating the effectiveness of integrating world models into embodied aerial agents.

2606.06136 2026-06-05 cs.SC cs.AI 版本更新

A Finite Certificate for the Positive $n=9$ Vasc Inequality

正数 $n=9$ 的 Vasc 不等式的一个有限证书

Dakai Guo, Ruichen Qiu, Yichuan Cao, Ruyong Feng

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, and the School of Mathematics, University of Chinese Academy of Sciences(数学科学国家重点实验室,数学与系统科学研究院,中国科学院,以及中国科学院大学数学系)

AI总结 通过人工引导的 AI 代理 MechMath Agent Team,将有理不等式转化为齐次多项式不等式,并利用累积间隙参数化所有排序的固定最大值锥体,生成覆盖 40320 个锥体的有限证书,从而证明了正实数 $n=9$ 情况下的 Vasc 循环不等式。

详情
AI中文摘要

我们证明了正实数 $n=9$ 情况下的 Vasc 循环不等式。该证明是在 AI 代理 MechMath Agent Team 的人工引导协助下获得的:人类可读部分将有理不等式简化为齐次多项式不等式,固定一个循环最大值,并通过累积间隙参数化每个排序的固定最大值锥体;有限部分是一个覆盖所有 $8!=40320$ 个排序锥体的证书。MechMath Agent Team 通过 Python 工具调用生成了证书验证工作流,包括情况划分、验证程序和终端分类。已发布的证书包含 36815 个系数叶子、2236 个普通 Polya 乘子叶子和 1269 个 AM-GM 中点覆盖叶子。人类作者审计了数学简化和验证逻辑,一个单独的工件包含证书、独立验证器和从源代码重建的路径。

英文摘要

We prove the positive-real $n=9$ case of the Vasc cyclic inequality. The proof was obtained with human-guided assistance from the AI agent MechMath Agent Team: the human-readable part reduces the rational inequality to a homogeneous polynomial inequality, fixes a cyclic maximum, and parametrizes each sorted fixed-maximum cone by cumulative gaps; the finite part is a certificate covering all $8!=40320$ sorted cones. MechMath Agent Team generated the certificate verification workflow through Python tool calls, including the case split, verification programs, and terminal classifications. The published certificate has $36815$ coefficient leaves, $2236$ ordinary Polya multiplier leaves, and $1269$ AM-GM midpoint overlay leaves. Human authors audited the mathematical reductions and verification logic, and a separate artifact contains the certificate, an independent verifier, and a from-source rebuild route.

2605.03413 2026-06-05 cs.LG cs.AI 版本更新

Learning to Theorize the World from Observation

从观察中学习理论化世界

Doojin Baek, Gyubin Lee, Junyeob Baek, Hosung Lee, Sungjin Ahn

发表机构 * University of Washington(华盛顿大学)

AI总结 受认知科学启发,提出Learning-to-Theorize范式,通过神经理论家(NEO)模型从原始非文本观测中推断显式解释性理论,实现基于解释的泛化。

详情
AI中文摘要

理解世界意味着什么?当代世界模型通常将理解操作化为在潜在空间或观测空间中的准确未来预测。然而,发展认知科学提出了不同的观点:人类理解是通过构建关于世界如何运作的内部理论而涌现的,即使在成熟语言习得之前也是如此。受这种理论构建的认知观点启发,我们引入了Learning-to-Theorize,一种从原始非文本观测中推断世界的显式解释性理论的学习范式。我们通过神经理论家(NEO)实例化该范式,这是一种概率神经模型,它将潜在程序诱导为习得的思维语言,并通过共享的转移模型执行它们。在NEO中,理论被表示为一个可执行的组合程序,其习得的原语可以系统地重新组合以解释新现象。实验表明,这种公式化实现了基于解释的泛化,允许根据生成观测的程序来理解观测。

英文摘要

What does it mean to understand the world? Contemporary world models often operationalize understanding as accurate future prediction in latent or observation space. Developmental cognitive science, however, suggests a different view: human understanding emerges through the construction of internal theories of how the world works, even before mature language is acquired. Inspired by this theory-building view of cognition, we introduce Learning-to-Theorize, a learning paradigm for inferring explicit explanatory theories of the world from raw, non-textual observations. We instantiate this paradigm with the Neural Theorizer (NEO), a probabilistic neural model that induces latent programs as a learned Language of Thought and executes them through a shared transition model. In NEO, a theory is represented as an executable, compositional program whose learned primitives can be systematically recombined to explain novel phenomena. Experiments show that this formulation enables explanation-driven generalization, allowing observations to be understood in terms of the programs that generate them.

2606.06109 2026-06-05 cs.CL cs.AI 版本更新

Harnessing Structural Context for Entity Alignment Foundation Models

利用结构上下文进行实体对齐基础模型

Xingyu Chen, Yuanning Cui, Zequn Sun, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China(南京大学新型软件技术国家重点实验室) Nanjing University of Information Science and Technology, Nanjing, China(南京信息科学技术大学) National Institute of Healthcare Data Science, Nanjing University, Nanjing, China(南京大学健康数据科学国家研究院)

AI总结 提出ContextEA框架,通过交叉KG交互编码器和结构校准解码器增强结构上下文的构建与利用,在29个数据集上超越强基线,实现更强的跨KG迁移能力。

详情
AI中文摘要

实体对齐(EA)旨在识别异构知识图谱(KG)中的等价实体,是知识融合和跨KG推理的关键组成部分。最近的EA基础模型表明,对齐知识一旦预训练,可以直接应用于各种未见过的KG对。然而,它仍然在两个地方未充分利用结构上下文:编码时跨KG交互较弱,最终候选排序仍然过于依赖粗略的相似性。我们通过ContextEA(一种用于可迁移EA的增强型编码器-解码器框架)来解决这些局限性。在编码器侧,我们引入了一个跨KG交互编码器,该编码器通过锚点桥统一两个KG,并执行更早的关系感知跨图传播。在解码器侧,我们引入了一个结构校准解码器,该解码器使用实体级、邻域级、关系级和锚点感知的结构证据来校准对齐分数。这种设计在保持轻量级的同时,增强了结构上下文的构建和利用。在OpenEA、SRPRS和DBP的29个EA数据集上的实验显示,与强可迁移基线相比,取得了持续改进。值得注意的是,预训练的ContextEA已经在所有三个基准组上超越了微调基线,显示出对未见KG的显著更强的迁移能力。这些结果表明,显式利用结构上下文是改进EA基础模型的有效方向。

英文摘要

Entity alignment (EA) aims to identify equivalent entities across heterogeneous knowledge graphs (KGs) and is a key component of knowledge fusion and cross-KG reasoning. The recent EA foundation model demonstrates that alignment knowledge, once pretrained, can be directly applied to diverse previously unseen KG pairs. However, it still underuses structural context in two places: cross-KG interaction is weak during encoding, and final candidate ranking still relies too heavily on coarse similarity. We address these limitations with ContextEA, an enhanced encoder-decoder framework for transferable EA. On the encoder side, we introduce a cross-KG interaction encoder that unifies the two KGs with anchor bridges and performs earlier relation-aware cross-graph propagation. On the decoder side, we introduce a structural calibration decoder that calibrates alignment scores with entity-level, neighborhood-level, relation-level, and anchor-aware structural evidence. This design strengthens both structural context construction and structural context exploitation while remaining lightweight. Experiments on 29 EA datasets in OpenEA, SRPRS, and DBP show consistent gains over strong transferable baselines. Notably, the pretrained ContextEA already surpasses the finetuned baselines on all three benchmark groups, demonstrating substantially stronger transfer to unseen KGs. These results suggest that explicitly harnessing structural context is an effective direction for improving EA foundation models.

2606.06102 2026-06-05 cs.AI cs.LG 版本更新

Step-adaptive multimodal fusion network with multi-scale cloud feature learning for ultra-short-term solar irradiance forecasting

步进自适应多模态融合网络与多尺度云特征学习用于超短期太阳辐照度预测

Jingxin Zhang Xiaoqin Wang

发表机构 * School of Automation, Southeast University(自动化学院,东南大学)

AI总结 提出一种步进自适应多模态融合网络,通过InceptionNeXt提取多尺度云特征、步进自适应低频补偿单元动态调整低频信息,并结合气象时间序列特征进行超短期太阳辐照度预测。

详情
AI中文摘要

超短期太阳辐照度预测对于光伏系统调度和电网稳定性至关重要。现有方法存在三个关键缺陷:单一时间序列模型无法捕捉复杂条件下云的空间动态,标准卷积不能充分表示多尺度云特征,固定的低频补偿策略无法适应不同的预测步长。针对这些问题,本文提出了一种用于超短期辐照度预测的多源数据融合模型。该模型首先采用InceptionNeXt从地基云图像中提取多尺度、多方向的空间特征。然后引入步进自适应低频补偿单元,根据预测步长动态调节全局低频信息。最后,将增强的图像特征与气象时间序列特征相结合,通过TempAttnLSTM网络捕获全局时间依赖性进行多步预测。在公共NREL数据集和山东实际光伏电站上的实验表明,与几种最先进的方法相比,所提方法具有有效性。

英文摘要

Ultra-short-term solar irradiance prediction is critical for photovoltaic system dispatch and power grid stability. Existing approaches suffer from three key shortcomings: single time-series models cannot capture the spatial dynamics of clouds under complex conditions, standard convolutions inadequately represent multi-scale cloud features, and fixed low-frequency compensation strategies fail to adapt to different prediction steps. To address these issues, this proposes a multi-source data fusion model for ultra-short-term irradiance prediction. The model first employs InceptionNeXt to extract multi-scale, multi-directional spatial features from ground-based cloud images. A step-adaptive low-frequency compensation unit is then introduced to dynamically modulate global low-frequency information based on the prediction step. Eventually, the enhanced image features are combined with meteorological time-series features, and a TempAttnLSTM network captures global temporal dependencies for multi-step prediction. Experiments on the public NREL dataset and practical photovoltaic stations in Shandong illustrate the effectiveness of the proposed method compared with several state-of-the-art approaches.

2606.06099 2026-06-05 cs.AI 版本更新

CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model

CogManip: 在大语言模型多轮交互中操控行为的基准测试

Zeyang Yue, Chenfei Yan, Feifei Zhao, Haibo Tong, Mengwen Xu, Xiaozhen Wang, Erliang Lin, Yi Zeng

发表机构 * School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院) BrainCog AI Lab, CASIA(CASIA脑认知人工智能实验室) Gaoling School of AI, Renmin University of China(中国人民大学 Gallagher人工智能学院) Beijing-AISI(北京人工智能研究所) Beijing Key Laboratory of Safe AI and Superalignment(北京安全人工智能与超对齐重点实验室) School of Artificial Intelligence, UCAS(中国科学技术大学人工智能学院) Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 提出CogManip基准,通过1000个多轮交互场景评估15种操控策略风险,发现前沿模型存在显著风险异质性,并揭示提示工程防御的重要性。

详情
AI中文摘要

大语言模型(LLM)在复杂人机交互中是否表现出隐蔽的心理操控已引起越来越多的安全担忧。然而,现有的人工智能安全基准大多局限于显式的规则遵守和静态提示,未能捕捉多轮对话中操控策略的动态性和隐蔽性。我们引入了CogManip,一个全面的基准,在1000个多轮交互场景中评估15种操控策略风险,并由人类专家验证。对包括GPT-5.4和DeepSeek-V3.2等前沿模型在内的13个代表性模型的系统评估揭示了显著的风险异质性,并为未来防御指明了方向。进一步的目标函数扰动分析表明,DeepSeek-V3.2的操控策略对负面和良性系统提示均高度敏感,证明了基于提示的防御工程和隐式目标审计的关键必要性。CogManip为审计现代LLM的隐式心理影响和动态策略选择提供了强大的工具和视角。

英文摘要

Whether Large Language Models (LLMs) exhibit covert psychological manipulation in complex human-AI interactions has garnered increasing safety concerns. However, existing AI safety benchmarks remain largely restricted to explicit rule compliance and static prompts, failing to capture the dynamic and covert nature of manipulative strategies in multi-turn dialogues. We introduce CogManip, a comprehensive benchmark that evaluates 15 manipulation strategy risks across 1,000 multi-turn interaction scenarios, validated by human experts. A systematic evaluation of 13 representative models, including frontier models like GPT-5.4 and DeepSeek-V3.2, reveals significant risk heterogeneities and illuminates the targeted direction for future defense. Further analysis of objective function perturbation reveals that DeepSeek-V3.2's manipulation tactics are highly sensitive to both negative and benign system prompts, demonstrating the critical necessity of prompt-based defense engineering and implicit goal auditing. CogManip offers a robust instrument and perspective for auditing the implicit psychological influence and dynamic strategy selection of modern LLMs.

2606.06096 2026-06-05 cs.LG cs.AI cs.CL 版本更新

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

OrderGrad: 通过顺序统计量策略梯度估计超越均值优化

Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Shota Takashiro, Soichiro Nishimori, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学)

AI总结 提出OrderGrad,一种用于顺序统计量目标的似然比和重参数化梯度估计器族,通过奖励变换实现风险厌恶、鲁棒和探索性学习的统一即插即用方法。

详情
AI中文摘要

策略梯度方法通常优化期望回报,但许多现实应用关心回报的分布特性:尾部风险、异常值鲁棒性或最佳K发现。我们引入OrderGrad,一种用于顺序统计量目标的似然比和重参数化梯度估计器族。OrderGrad优化有限样本L-统计量,即排序奖励或成本的加权平均,通过仅改变秩权重来恢复诸如VaR、CVaR、修剪均值、中位数和top-m/最佳K标准等目标。对于任何固定样本大小和秩权重向量,OrderGrad为相应的顺序统计量目标提供无偏梯度估计。该方法实现为简单的奖励变换,然后可在其他标准策略梯度或重参数化更新中使用。我们研究了所得估计量的方差行为,并在均值优化与部署目标不匹配的任务上进行了评估,包括LLM数学后训练和其他任务。OrderGrad为风险厌恶、鲁棒和探索性学习提供了统一的即插即用途径。代码:https://github.com/paavo5/ordergrad

英文摘要

Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning. Code: https://github.com/paavo5/ordergrad

2606.06094 2026-06-05 cs.AI cs.LG math.DS physics.med-ph 版本更新

Integrating Mechanistic and Data-Driven Models for Neurological Disorders through Differentiable Programming

通过可微编程整合机制模型与数据驱动模型用于神经系统疾病

Shah Pallav Dhanendrakumar, Saikat Pal, Sitikantha Roy

发表机构 * Department of Applied Mechanics, Indian Institute of Technology Delhi(印度理工学院德里应用力学系) Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi(印度理工学院德里人工智能学院)

AI总结 本文综述了混合建模策略,通过可微编程将深度学习与基于物理的求解器结合,用于神经系统疾病的诊断、预后和治疗规划,优于纯机制或纯数据驱动方法。

详情
AI中文摘要

计算建模、神经影像和人工智能的进步正在革新神经系统疾病的建模,以改进诊断、预后和治疗规划。机制模型提供了对疾病的宝贵科学见解,但在实践中常常因假设而简化,或计算昂贵且求解缓慢。然而,纯数据驱动方法虽然提供速度和可扩展性,但需要大量高质量数据进行训练,并且通常存在可解释性和泛化问题。本视角论文提供了混合建模策略的结构化概述,这些策略将深度学习模型与基于物理的求解器相结合,并分为并行、串行和并行-串行架构。强调的三种主要方法是:用于缺失或不完整物理的残差建模、用于连续时间动力学近似的神经常微分方程(NODEs),以及用神经近似加速传统求解器的求解器在环。这些混合模型整合了基于控制微分方程的公式和深度学习,以表征神经系统疾病的演变,并有望实现先进的个性化神经建模。此外,该研究探索并提出了不同的混合配置,以提高诊断准确性、预测疾病进展,并为一系列神经系统疾病提供治疗策略信息。这些能力优于独立的机制或纯数据驱动方法,使混合建模成为强大的工具,特别是在涉及脑肿瘤、阿尔茨海默病和中风等神经系统疾病的进展和治疗反应建模的应用中。

英文摘要

Advances in computational modeling, neuroimaging, and artificial intelligence are revolutionizing the modeling of neurological disorders for improved diagnostics, prognosis, and treatment planning. Mechanistic models provide valuable scientific insight into the disorders, but in practice they are often simplified with assumptions or computationally expensive and slow to solve. However, while purely data driven approaches provide speed and scalability, they require large, high quality data to train and generally suffer from interpretability and generalization issues. This perspective paper presents a structured overview of hybrid modeling strategies, which combine deep learning models with physics based solvers, and are categorized into parallel, series, and parallel-series architectures. Three main approaches that have been emphasized are residual modeling for missing or incomplete physics, Neural Ordinary Differential Equations (NODEs) for continuous time dynamics approximation, and solver in the loop that accelerates traditional solvers with neural approximations. These hybrid models integrate the governing differential equation based formulations and deep learning to characterize the evolution of neurological disorders, and promise advanced personalized neurological modeling. In addition, the study explores and proposes different hybrid configurations to improve diagnosis accuracy, predict disease progression, and inform treatment strategies across a range of neurological disorders. These capabilities outperform standalone mechanistic or purely data driven approaches, making hybrid modeling a powerful tool, especially in applications involving modeling the progression and treatment responses in neurological conditions such as brain tumors, Alzheimer's disease, and stroke.

2606.06090 2026-06-05 cs.AI 版本更新

Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents

超越语义组织:记忆作为长时程智能体的执行状态管理

Yaoqi Chen, Haibin Lai, Yuru Feng, Chuyu Han, Qianxi Zhang, Baotong Lu, Menghao Li, Xinjiang Wang, Zhirui Wang, Shusen Xu, Zengzhong Li, Zewen Jin, Hao Wu, Cheng Li, Qi Chen

发表机构 * University of Science and Technology of China(中国科学技术大学) Microsoft(微软) Nanjing University(南京大学) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 针对长时程任务中智能体依赖执行状态而非语义相似性的问题,提出MAGE(记忆作为智能体引导的探索),通过层次状态树管理交互,实现状态完整性和错误隔离,在MemoryArena上任务成功率提升7.8-20.4个百分点,token消耗降低55.1%。

Comments 16 pages

详情
AI中文摘要

基于LLM的智能体越来越多地处理具有相互依赖决策的长时程任务,其中每个动作都会重塑未来约束,中间错误可能级联。现有的RAG和智能体记忆系统通过语义相似性组织历史,在决策时检索内容相关的条目。我们认为这种设计与执行状态依赖不匹配:它分割了决策轨迹,混合了有效和错误的痕迹,阻碍了连贯的状态重建和错误隔离。我们提出MAGE(记忆作为智能体引导的探索),一个主动的执行状态管理器,将交互存储在层次状态树中。智能体从活跃的根到当前路径派生其状态,结合子目标摘要、近期轨迹和来自先前分支的提示。四个耦合操作维护树:Grow记录新轨迹,Compress总结完成的子目标,Maintain验证摘要,Revise恢复目标边界并在新分支上继续。这种设计在保持状态完整性和将缺陷片段与活跃路径隔离的同时,限制了上下文增长。在MemoryArena上的实验表明,MAGE将平均任务成功率提高了7.8-20.4个百分点,同时将token消耗降低了55.1%。

英文摘要

LLM-based agents increasingly tackle long-horizon tasks with interdependent decisions, where each action reshapes future constraints and intermediate errors can cascade. Existing RAG and agent memory systems organize histories by semantic similarity, retrieving content-relevant entries at decision time. We argue that this design mismatches execution-state dependencies: it fragments decision trajectories and mixes valid and erroneous traces, hindering coherent state reconstruction and error isolation. We propose MAGE (Memory as Agent-Guided Exploration), an active execution-state manager that stores interactions in a hierarchical state tree. The agent derives its state from the active root-to-current path, combining subgoal summaries, recent traces, and hints from prior branches. Four coupled operations maintain the tree: Grow records new traces, Compress summarizes completed subgoals, Maintain validates summaries, and Revise restores a target boundary and resumes on a new branch. This design bounds context growth while preserving state integrity and isolating flawed segments from the active path. Experiments on MemoryArena show that MAGE improves the average task success rate by 7.8--20.4 pp over baselines, while reducing token consumption by 55.1%.

2606.06087 2026-06-05 cs.CL cs.AI 版本更新

LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

LatentSkill: 从上下文文本技能到LLM智能体的权重内隐技能

Aofan Yu, Chenyu Zhou, Tianyi Xu, Zihan Guo, Rong Shan, Zhihui Fu, Jun Wang, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Sun Yat-Sen University(中山大学) Shanghai Innovation Institute(上海创新研究院) OPPO Research Institute(OPPO研究院)

AI总结 提出LatentSkill框架,通过预训练超网络将文本技能转换为即插即用的LoRA适配器,将技能知识存储在权重空间而非上下文空间,从而减少预填充令牌并提升性能。

Comments 16 pages, 4 figures

详情
AI中文摘要

智能体系统越来越多地使用文本技能来编码可重用的任务流程,但在每一步将这些技能注入提示中会带来大量的上下文开销,并将技能内容暴露为明文。我们提出了LatentSkill,一个通过预训练超网络将文本技能转换为即插即用LoRA适配器的框架。LatentSkill将技能知识存储在权重空间而非上下文空间中,消除了每步的技能令牌,同时保留了模块化加载、缩放和组合。在ALFWorld和Search-QA上,LatentSkill在显著减少预填充令牌的情况下,优于相应的上下文技能基线:在ALFWorld的已见和未见划分上,它分别提高了21.4和13.4个百分点的成功率,预填充令牌减少了64.1%;在Search-QA上,精确匹配提高了3.0个百分点,技能令牌开销降低了72.2%。进一步分析表明,生成的技能LoRA形成了结构化的语义几何,可以通过LoRA缩放系数精确控制,并且在技能组件对齐时可以通过参数空间算术进行组合。这些发现表明,权重空间技能为扩展LLM智能体提供了一种高效、模块化且暴露更少的基础。

英文摘要

Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills into plug-and-play LoRA adapters through a pretrained hypernetwork. LatentSkill stores skill knowledge in weight space rather than context space, removing per-step skill tokens while preserving modular loading, scaling, and composition. On ALFWorld and Search-QA, LatentSkill outperforms the corresponding in-context skill baseline while using substantially fewer prefill tokens: it improves ALFWorld success by 21.4 and 13.4 points on the seen and unseen splits with 64.1% fewer prefill tokens, and improves Search-QA exact match by 3.0 points with 72.2% lower skill-token overhead. Further analysis shows that generated skill LoRAs form a structured semantic geometry, can be precisely controlled via the LoRA scaling coefficient, and can be composed through parameter-space arithmetic when skill components are aligned. These findings suggest that weight-space skills provide an efficient, modular, and less exposed substrate for extending LLM agents.

2606.06081 2026-06-05 cs.AI cs.HC 版本更新

A Framework for Measuring Appropriate Reliance on Set-Valued AI Advice

衡量对集合值AI建议适当依赖的框架

Ranjan Mishra, Jakob Schoeffer

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出首个正式框架,用于在序列判断-顾问范式中衡量对集合值AI建议的适当依赖,涵盖分类和回归任务,并定义了新的度量指标以捕捉现有方法忽略的细微差别。

详情
AI中文摘要

对AI建议的适当依赖已成为人机协作的核心研究主题。现有框架仅关注点预测作为AI建议。然而,集合值AI建议(例如离散集或连续区间)越来越多地被用于传达不确定性和改善人类决策。在本文中,我们在序列判断-顾问范式中开发了第一个用于衡量对集合值AI建议适当依赖的正式框架,涵盖分类和回归任务。对于分类,我们首先引入了评估集合值AI建议所需的维度。然后我们定义了两个指标:对AI的正确依赖率和对自身的正确依赖率,它们共同表征了这种设置下的适当依赖。对于回归,我们引入了AI依赖的数量和AI依赖的质量,分别衡量决策者是否利用了AI建议以及他们的依赖是否帮助他们相对于初始估计更接近真实值。通过应用我们的框架,我们展示了这些度量如何捕捉现有方法忽略的人机协作中的重要细微差别。

英文摘要

Appropriate reliance on AI advice has become a central research theme in human-AI collaboration. Existing frameworks have focused exclusively on point predictions as AI advice. However, set-valued AI advice (e.g., discrete sets or continuous intervals) is increasingly being used to communicate uncertainty and improve human decision making. In this paper, we develop the first formal framework for measuring appropriate reliance on set-valued AI advice within the sequential judge-advisor paradigm, spanning both classification and regression tasks. For classification, we first introduce the dimensions that are necessary for evaluating set-valued AI advice. We then define two metrics: correct reliance rate on AI and correct reliance rate on self, which jointly characterize appropriate reliance in this setting. For regression, we introduce quantity of AI reliance and quality of AI reliance, which respectively measure whether a decision maker utilized the AI advice and whether their reliance helped them get closer to the ground truth relative to their initial estimate. Through the application of our framework, we demonstrate how these metrics capture important nuances in human-AI collaboration that existing measures overlook.

2606.06080 2026-06-05 cs.LG cs.AI cs.CL 版本更新

On Advantage Estimates for Max@K Policy Gradients

关于 Max@K 策略梯度的优势估计

Shota Takashiro, Soichiro Nishimori, Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Gouki Minegishi, Yusuke Iwasawa, Takeshi Kojima, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学)

AI总结 针对稀疏奖励下推理模型后训练困难,提出一种新的优势估计方法 MaxPO,通过 Leave-Two-Out 基线实现中心化优势,降低梯度方差并提升性能。

详情
AI中文摘要

具有可验证奖励的强化学习广泛用于推理模型的后训练,但稀疏的结果奖励使得探索困难。一种补充方法是直接优化推理时目标如 pass@K 和 max@K,然而现有针对这些目标的策略梯度估计器使用不同的信号、基线和归一化,使得它们之间的关系不明确。我们通过基线设计和优势中心化来研究这个问题。从该领域领先方法的优势估计器出发,我们证明它是策略梯度无偏的,但产生非中心化的优势。然后我们引入一种 Leave-Two-Out 基线,它在保持策略梯度无偏性的同时,使得实现的批次优势完全中心化。由此产生的方法 MaxPO 具有高效的二次时间实现,并自然地集成到基于组的 LLM 后训练强化学习中。我们进一步推导了 max@K 的规范有限批次优势,为现有优势估计器提供了统一视角。实验上,我们验证了 L2O 基线降低了梯度方差,并优于非中心化的替代方案。

英文摘要

Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design and advantage centering. Starting from the advantage estimator of a leading method in the field, we show that it is policy-gradient unbiased but yields a non-centered advantage. We then introduce a Leave-Two-Out baseline that preserves policy-gradient unbiasedness while making realized batch advantages exactly centered. The resulting method, MaxPO, has an efficient quadratic-time implementation and integrates naturally into group-based RL for LLM post-training. We further derive the canonical finite-batch advantage for max@K, providing a unified view of existing advantage estimators. Empirically, we verify that the L2O baseline reduces gradient variance and outperforms non-centered alternatives.

2606.06058 2026-06-05 cs.LG cs.AI cs.CL 版本更新

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

MDP-GRPO:面向多约束指令跟随的稳定化组相对策略优化

Mohammad Mahdi Salmani-Zarchi, Zahra Rahimi, Heshaam Faili, Mohammad Javad Dousti

发表机构 * Department of Electrical and Computer Engineering, College of Engineering, University of Tehran(德黑兰大学电气与计算机工程系,工程学院) Department of Statistics, Mathematics and Computer Science, Allameh Tabataba’i University(塔巴蒂大学统计、数学与计算机科学系)

AI总结 针对标准GRPO在离散低分散奖励下的不稳定性,提出MDP-GRPO,通过多温度采样、双锚优势、前景理论整形和非对称KL正则化,在FollowBench等数据集上提升严格约束满足率最高5.0%。

Comments Accepted to ACL 2026 Main Conference. 14 pages, 9 figures

详情
AI中文摘要

可验证奖励的强化学习非常适合多约束指令跟随,但标准组相对策略优化(GRPO)在离散、低分散奖励下变得不稳定,此时组内奖励分布常常同质。我们识别并形式化了在此场景下z-score组归一化的三种病理:低方差放大、均值中心盲视和零方差崩溃。为解决这些问题,我们提出MDP-GRPO,通过以下方式稳定学习:(1)多温度采样以增加奖励分散度,(2)双锚优势以恢复同质组中的梯度并阻止均值中心盲视,(3)基于Kahneman和Tversky理论的前景理论整形以限制更新并惩罚违规,以及(4)非对称KL正则化。在FollowBench、IFEval和一个精心策划的多约束数据集上评估,MDP-GRPO优于标准GRPO,在Llama-3.2-3B上将严格约束满足率提高了最多5.0%。我们的方法还能够在保持MMLU和ARC上通用能力的同时,实现小批量大小的稳定收敛。

英文摘要

Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. To address them, we propose MDP-GRPO, which stabilizes learning through (1) multi-temperature sampling to increase reward dispersion, (2) dual-anchor advantages to restore gradients in homogeneous groups and stop mean-centering blindness, (3) prospect-theoretic shaping to bound updates and penalize violations based on Kahneman and Tversky's theory, and (4) asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a curated multi-constraint dataset, MDP-GRPO outperforms standard GRPO, improving strict constraint satisfaction by up to 5.0% on Llama-3.2-3B. Our method also enables stable convergence with small group sizes while preserving general capabilities on MMLU and ARC.

2606.06056 2026-06-05 cs.SE cs.AI cs.LG 版本更新

Metamorphic Testing with the Rashomon Set: Explanation Faithfulness in Machine Learning

使用Rashomon集的蜕变测试:机器学习中的解释忠实性

Helge Spieker, Jørn Eirik Betten, Arnaud Gotlieb

发表机构 * Norwegian Ministry of Education and Research(挪威教育与研究部)

AI总结 针对机器学习中因Rashomon效应导致解释不可靠的问题,提出基于蜕变测试的框架,通过后验解释方法评估特征归因的忠实性,无需真实标签。

Comments Accepted at 10th International Workshop on Metamorphic Testing (MET 2026)

详情
AI中文摘要

多个机器学习模型在同一任务上可以达到近乎相等的预测性能,但提供不同的基于特征的解释。这被称为(可解释)机器学习的Rashomon效应,它引发了哪些解释(如果有的话)是可信的问题。我们提出了一个基于蜕变测试的框架,该框架通过探索后验解释方法中的归因特征重要性来评估解释忠实性,无需真实标签。五个蜕变关系形式化了模型行为与特征归因之间的预期一致性属性。我们将这个通用框架应用于两个表格回归数据集和两个后验解释器(SHAP和LIME)以演示该方法。该框架提供了一个实用的、模型无关的工具,用于选择具有可靠和可信解释的准确模型。

英文摘要

Multiple machine learning models can achieve near-equivalent predictive performance on the same task, yet provide divergent feature-based explanations. This is called the Rashomon effect of (explainable) machine learning, and it raises the question of which explanations, if any, are trustworthy. We propose a framework based on metamorphic testing that assesses explanation faithfulness without requiring ground-truth labels by exploring attributed feature importance from post-hoc explanation methods. Five metamorphic relations formalize expected consistency properties between model behavior and feature attributions. We apply this general framework to two tabular regression datasets and two post-hoc explainers (SHAP and LIME) to demonstrate the approach. The framework offers a practical, model-agnostic tool for selecting accurate models with reliable and trustworthy explanations.

2606.06055 2026-06-05 cs.AI 版本更新

When Should Memory Stay Silent: Measuring Memory-Use Boundaries in Memory-Augmented Conversational Agents

记忆何时应保持沉默:衡量记忆增强型对话代理的记忆使用边界

Lingxiang Xu, Jiaoyun Yang, Min Hu, Hongtu Chen, Ning An

发表机构 * Hefei University of Technology(合肥工业大学) Harvard Medical School(哈佛医学院)

AI总结 提出RBI-Eval框架,通过探针集比较模型在有/无敏感记忆时的行为差异,发现当前检索增强生成系统无法避免敏感记忆的不当整合,需在检索和生成阶段同时进行记忆感知决策。

Comments 21 pages, 10 figures

详情
AI中文摘要

长期记忆使语言模型代理能够支持个性化交互,但目前尚不清楚何时可用记忆应被整合到响应中。现有的记忆评估强调检索准确性和下游任务效用,而忽略了检索到的敏感记忆内容在当前轮次中是否合理。我们引入RBI-Eval,这是一种基于探针集的受控测量研究,比较模型在相同良性提示下访问和不访问敏感记忆时的行为。我们在四种记忆访问设置(全上下文暴露和三种检索系统)下,针对四个基础LLM与匹配的无记忆参考进行评估。我们的结果揭示了显著的行为差异。在有记忆可用时,GPT-5.4-mini的敏感记忆整合分离分数相对于匹配的无记忆参考下降了8.9%–26.6%,而Claude-Sonnet-4.6、DeepSeek-V4-Flash和Qwen3.5-9B下降了51.1%–82.9%。对DeepSeek和GPT-5.4-mini的对照实验表明,这种效应是敏感内容特有的,而非一般个性化。检索系统减少了暴露,但一旦敏感记忆到达生成器,并不能消除整合。这些发现表明,安全个性化需要在检索和生成时都进行记忆感知决策。

英文摘要

Long-term memory enables language model agents to support personalized interactions, but it remains unclear when available memories warrant integration into responses. Existing memory evaluations emphasize retrieval accuracy and downstream task utility, while overlooking whether retrieved sensitive memory content is warranted in the current turn. We introduce RBI-Eval, a controlled measurement study built around a probe set that compares model behavior with and without access to sensitive memory under identical benign prompts. We evaluate four base LLMs against a matched no-memory reference across four memory-access settings: full-context exposure and three retrieval systems. Our results reveal substantial behavioral divergence. With memory available, the separation score for sensitive-memory integration decreases by 8.9\%--26.6\% relative to the matched no-memory reference for GPT-5.4-mini, but by 51.1\%--82.9\% for Claude-Sonnet-4.6, DeepSeek-V4-Flash, and Qwen3.5-9B. Control experiments on DeepSeek and GPT-5.4-mini show this effect is specific to sensitive content, rather than general personalization. Retrieval systems reduce exposure but do not eliminate integration once sensitive memory reaches the generator. These findings suggest safe personalization requires memory-aware decisions at both retrieval and generation time.

2606.06054 2026-06-05 cs.AI 版本更新

Beyond Similarity: Trustworthy Memory Search for Personal AI Agents

超越相似性:面向个人AI代理的可信记忆搜索

Jiawen Zhang, Kejia Chen, Jiachen Ma, Yangfan Hu, Lipeng He, Yechao Zhang, Jian Liu, Xiaohu Yang, Tianwei Zhang, Ruoxi Jia

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学) National University of Singapore(新加坡国立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对个人AI代理中基于语义相似性的记忆检索存在的信任漏洞,提出轻量级记忆插件MemGate,通过查询条件神经门控实现可信记忆搜索。

详情
AI中文摘要

个人AI代理越来越依赖长期记忆来跨会话提供持久个性化。然而,现有的记忆流水线主要由语义相似性驱动:检索与当前查询语义接近的记忆数据并将其注入模型上下文。这造成了关键的信任差距,因为语义相关的记忆可能仍然在上下文中不合适,导致跨域泄露、谄媚、工具调用漂移或记忆引发的越狱等威胁。在本文中,我们将记忆搜索作为个人AI代理中的信任边界进行研究。我们评估了代表性的代理记忆框架,包括A-Mem、Mem0和MemOS,以及OpenClaw(一个具有持久状态和工具使用能力的真实世界个人代理环境)。我们的结果表明,长期记忆不仅仅是一个实用层,而是一个持久的控制通道,可以重塑代理如何解释任务和执行操作,使其极易受到上述威胁的影响。为了缓解这些漏洞,我们提出了MemGate,一个轻量级且可部署的记忆插件,用于可信记忆搜索,仅9M参数和35.1MB占用空间。MemGate插入在向量记忆存储和骨干LLM之间,无需修改LLM、重写记忆数据库或推理时LLM评判。它对候选记忆表示应用查询条件神经门控,将原始相似性搜索转化为任务条件记忆准入。在多个主流记忆框架、真实世界代理设置和多样化LLM骨干上,MemGate在保留长期记忆效用的同时减少了记忆引发的威胁。

英文摘要

Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic similarity: memory data close to the current query is retrieved and injected into the model context. This creates a critical trustworthiness gap, since a semantically related memory may still be contextually inappropriate, leading to threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced jailbreaks. In this paper, we study memory search as a trust boundary in personal AI agents. We evaluate representative agentic memory frameworks, including A-Mem, Mem0, and MemOS, together with OpenClaw, a real-world personal-agent environment with persistent state and tool-use capability. Our results show that long-term memory is not merely a utility layer, but a durable control channel that can reshape how agents interpret tasks and execute actions, leaving them highly susceptible to the aforementioned threats. To mitigate these vulnerabilities, we propose MemGate, a lightweight and deployable memory plug-in for trustworthy memory search, with only 9M parameters and a 35.1MB footprint. MemGate is inserted between the vector memory store and the backbone LLM, requiring no LLM modification, memory-database rewriting, or inference-time LLM judge. It applies a query-conditioned neural gate to candidate memory representations, turning raw similarity search into task-conditioned memory admission. Across multiple mainstream memory frameworks, real-world agent settings, and diverse LLM backbones, MemGate reduces memory-induced threats while preserving long-term memory utility.

2606.06041 2026-06-05 cs.RO cs.AI cs.NE 版本更新

Sample-efficient Low-level Motion Planning for Robotic Manipulation Tasks via Zero-shot Transfer Learning

通过零样本迁移学习实现机器人操作任务的样本高效低级运动规划

Yuanzhi He, Victor Romero-Cano, José J. Patiño, Juan David Hernández, William Sawtell, Gualtiero Colombo

发表机构 * School of Computer Science & Informatics, Cardiff University, Cardiff, UK(计算机科学与信息学系,卡迪夫大学,卡迪夫,英国)

AI总结 提出iCEM+TL框架,通过迁移学习和奖励重塑提高复杂操作任务的成功率,仿真中提升高达23%,并在真实机器人上验证。

Comments 12 pages, 5 figures, International Conference on Artificial Neural Networks (ICANN) 2026 conference accepted

详情
AI中文摘要

随着机器人系统变得日益复杂,其运动规划模型的复杂性和更长的训练时间带来了巨大挑战。进化算法如样本高效交叉熵方法(iCEM)最近通过利用高效的知识重用策略来提升性能,在低级实时规划中展现出潜力。尽管在许多控制任务中有效,但iCEM在更复杂场景中的性能可能受到限制,特别是那些需要堆叠、滑动和放置到架子的任务。在这项工作中,我们提出了一种新颖的iCEM+TL框架,明确利用迁移学习(TL),其中关键的iCEM参数从较简单的上游任务迁移以指导更复杂的下游任务。此外,我们通过任务分解对堆叠物体和放置到架子应用了奖励重塑(RR)以优化任务特定性能。仿真结果表明,我们的框架实现了高达23%的成功率提升。该框架还在真实的Franka Emika机器人上的堆叠任务中得到进一步验证,展示了其在实际部署中的可行性。

英文摘要

As robotic systems become more sophisticated, the growing complexity of their motion planning models and the longer training times pose substantial challenges. Evolutionary algorithms such as the Sample-efficient Cross-Entropy Method (iCEM) have recently demonstrated promising potential for low-level real-time planning by leveraging efficient knowledge reuse strategies to improve performance. Although effective in many control tasks, iCEM's performance can be constrained in more complex scenarios, particularly those requiring stacking, sliding, and shelf placement. In this work, we propose a novel iCEM+TL framework that explicitly leverages Transfer Learning (TL), where key iCEM parameters are transferred from simpler upstream tasks to guide more complex downstream tasks. Additionally, we applied Reward Redesign (RR) through task decomposition for stacking objects and shelf placement to optimize task-specific performance. Results from the simulation show that our framework achieves success rate improvements of up to 23%. The framework is further validated on a real Franka Emika robot in a stacking task, demonstrating its practical feasibility for real-world deployment.

2606.06036 2026-06-05 cs.AI cs.IR 版本更新

Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

记忆是重建的,而非检索的:面向LLM智能体的图记忆

Shuo Ji, Yibo Li, Bryan Hooi

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出MRAgent框架,通过关联记忆图和主动重建机制,使LLM智能体在推理过程中动态调整记忆访问,显著提升长程记忆推理性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

尽管近期取得了进展,LLM智能体在处理长交互历史推理时仍面临困难。当前记忆增强智能体依赖静态的检索-推理范式,这种僵化的流水线设计阻碍了它们根据推理过程中发现的中间证据动态调整记忆访问。为弥补这一差距,我们提出MRAgent,一个将关联记忆图与主动重建机制相结合的框架。我们将记忆表示为线索-标签-内容图,其中关联标签作为语义桥梁连接细粒度线索与记忆内容。在此结构上,我们的主动重建机制将LLM推理直接融入记忆访问,使智能体能够基于累积证据迭代探索和修剪检索路径。这确保了记忆检索动态适应推理上下文,同时避免无约束扩展导致的组合爆炸。在LoCoMo基准和LongMemEval基准上的实验表明,该方法在强基线上取得了显著提升(高达23%),同时大幅降低了令牌和运行时间成本,凸显了主动和关联重建对于长程记忆推理的有效性。

英文摘要

Despite recent progress, LLM agents still struggle with reasoning over long interaction histories. While current memory-augmented agents rely on a static retrieve-then-reason paradigm, this rigid pipeline design prevents them from dynamically adapting memory access to intermediate evidence discovered during inference. To bridge this gap, we propose MRAgent, a framework that combines an associative memory graph with an active reconstruction mechanism. We represent memory as a Cue-Tag-Content graph, where associative tags serve as semantic bridges connecting fine-grained cues to memory contents. Operating on this structure, our active reconstruction mechanism integrates LLM reasoning directly into memory access, allowing the agent to iteratively explore and prune retrieval paths based on accumulated evidence. This ensures that memory retrieval is dynamically adapted to the reasoning context while avoiding combinatorial explosion caused by unconstrained expansion. Experiments on the LoCoMo benchmark and LongMemEval benchmark demonstrate significant improvements over strong baselines (up to 23%), while substantially reducing token and runtime cost, highlighting the effectiveness of active and associative reconstruction for long-horizon memory reasoning.

2606.06034 2026-06-05 cs.LG cs.AI 版本更新

When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

当足够好即最优:量化门控DeltaNet的仅乘法矩阵求逆近似

Luoming Zhang, Yuwei Ren, Kui Zhang, Tian Liu, Lingjuan Ge, Denghao Li, Matthew Harper Langston, Yin Huang, Weiliang Will Zeng, Liang Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对分块并行线性注意力中矩阵求逆的瓶颈,提出基于截断Neumann级数展开的仅矩阵乘法算法,结合结构掩码和并行残差校正,实现NPU上5倍内核加速和20%解码层开销降低。

详情
AI中文摘要

分块并行线性注意力中的矩阵求逆是长上下文建模的主要瓶颈,尤其是在NPU上,基于前向替换的方法并行性有限且硬件利用率低。我们提出了一种快速的、基于矩阵乘法(MatMul)的算法,专门针对分块线性注意力中出现的严格下三角矩阵。受Neumann级数项快速增长和逆矩阵对角集中性的启发,我们采用截断Neumann展开,结合结构掩码和并行残差校正,以消除顺序依赖。我们进一步将方法扩展到低比特INT,通过缓解重复矩阵幂运算引起的动态范围扩展,并根据块大小调整近似阶数和残差步长,以最小化计算成本同时保持模型精度。在Qwen3.5系列模型上的实验表明,在浮点和低精度推理下,该方法实现了高达5倍的内核级加速和20%的解码层开销降低,同时保持了精度。我们的方法为可扩展线性注意力提供了一种高效且硬件友好的解决方案。

英文摘要

Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We propose a fast, Matrix Multiplication (MatMul)-based algorithm tailored for strictly lower-triangular matrices arising in chunk-wise linear attention. Motivated by the rapid growth of Neumann-series terms and the diagonal concentration of the inverse matrix, we employ a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. We further extend our method to low-bits INT by mitigating the dynamic range expansion arising from repeated matrix power operations, and adapt the approximation order and residual step to the chunk size to minimize computational cost while preserving the model's accuracy. Experiments on Qwen3.5-family models demonstrate up to 5$\times$ kernel-level speedup and a 20% reduction in decode-layer overhead, while preserving accuracy under both floating-point and low-precision inference. Our method offers an efficient and hardware-friendly solution for scalable linear attention.

2606.06027 2026-06-05 cs.AI cs.CL cs.LG cs.SI 版本更新

RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit

RedditPersona: 一个用于从Reddit进行社区条件化LLM适配的模块化框架

Amirhossein Ghaffari, Ali Goodarzi, Huong Nguyen, Simo Hosio, Lauri Lovén, Ekaterina Gilman

发表机构 * Future Computing Group University of Oulu(未来计算组奥卢大学) Centre for Applied Computing University of Oulu(应用计算中心奥卢大学)

AI总结 提出RedditPersona模块化框架,通过五种分组策略和QLoRA训练参数高效适配器,在112个Reddit子版块上评估社区条件化语言模型,发现适配器的行为可识别性与策略内在一致性相关,且所有策略在可识别性和分布相似性之间存在一致权衡。

详情
AI中文摘要

社区条件化的语言模型适配需要在每个研究中独立做出关于数据收集、社区定义和评估的选择,这使得比较假设或重用工件变得困难。我们提出了RedditPersona,一个模块化框架,标准化了这些选择:它收集Reddit帖子和评论,分析活跃用户,根据五种分组策略(基于子版块、图结构、语义、混合和基于交互)对用户进行划分,通过QLoRA为每种策略训练参数高效的适配器,并在一个涵盖流畅性、忠实度、分布对齐和社区可识别性的共享度量套件下进行评估。应用于城市福祉领域的112个子版块(301,429个用户档案,超过1600万条评论),我们发现适配器的行为可识别性追踪了每种策略与子版块基线的内在一致性,并且所有五种策略在可识别性和与真实文本的分布相似性之间存在一致的权衡。代码和配置文件可在以下网址获取:https://github.com/Ahghaffari/redditpersona。

英文摘要

Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit-based, graph-structural, semantic, hybrid, and interaction-based), trains a parameter-efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well-being domain (301,429 user profiles, 16M+ comments), we find that adapters' behavioral identifiability tracks each strategy's intrinsic agreement with the subreddit baseline, and that a consistent trade-off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: https://github.com/Ahghaffari/redditpersona.

2606.06025 2026-06-05 cs.CL cs.AI 版本更新

EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher Distillation

EGTR-Review: 基于多智能体教师蒸馏的高效证据支撑科学同行评审生成

Xinpeng Qiu, Wang Yihu, Zhifeng Liu, Xiaochen Wang, Jimin Wang

发表机构 * Department of Information Management, Peking University(北京大学信息管理系) PKU-WUHAN Institute for Artificial Intelligence, Peking University(北京大学武汉人工智能研究院)

AI总结 提出EGTR-Review框架,通过多智能体教师蒸馏和证据加权目标,实现轻量级学生模型的高质量、可溯源同行评审生成。

详情
AI中文摘要

科学同行评审生成因能减少评审负担并提供及时反馈而受到越来越多的关注。然而,现有基于大型语言模型(LLM)的方法往往产生缺乏证据支持和弱源可追溯性的通用评论,而复杂的多智能体系统则导致高推理成本。为应对这些挑战,我们提出EGTR-Review,一种通过多智能体教师蒸馏实现的证据支撑且可追溯的评审生成框架。EGTR-Review首先构建一个多智能体教师,执行结构感知的论文分解、关键元素提取、外部学术证据检索、证据状态标注、验证推理和评审合成。然后,通过任务前缀驱动的多任务学习,将中间推理轨迹和最终评审评论蒸馏到轻量级学生模型中。证据加权目标进一步减少弱、缺失或不可验证监督的影响。在公共同行评审数据集上的实验表明,EGTR-Review(学生)在自动指标、LLM作为评判者评估和人工评估中均优于强提示基、微调基和结构化/智能体基线,同时保持强事实基础和源可追溯性,且显著降低令牌消耗和推理时间。我们的代码、提示、配置和样本数据可在GitHub上获取。

英文摘要

Scientific peer review generation has attracted increasing attention for reducing reviewing burdens and providing timely feedback. However, existing Large Language Model (LLM)-based methods often produce generic comments with insufficient evidence support and weak source traceability, while complex multi-agent systems incur high inference costs. To address these challenges, we propose EGTR-Review, an Evidence-Grounded and Traceable Review Generation framework via Multi-Agent Teacher Distillation. EGTR-Review first constructs a multi-agent teacher that performs structure-aware paper decomposition, key-element extraction, external scholarly evidence retrieval, evidence-state labeling, verification reasoning, and review synthesis. It then distills both intermediate reasoning trajectories and final review comments into a lightweight student model through task-prefix-driven multi-task learning. An evidence-weighted objective further reduces the influence of weak, missing, or non-verifiable supervision. Experiments on public peer-review datasets show that EGTR-Review (Student) outperforms strong prompt-based, fine-tuned, and structured/agentic baselines across automatic metrics, LLM-as-Judge evaluation, and human evaluation, while maintaining strong factual grounding and source traceability with substantially lower token consumption and inference time. Our code, prompts, configurations, and sample data are available on GitHub.

2606.06014 2026-06-05 cs.AI cs.RO 版本更新

PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models

PLAN-S:通过潜在风格动态桥接规划以实现自动驾驶世界模型

Xiaoyun Qiu, Jingtao He, Yijie Chen, Yusong Huang, Haotian Wang, Yixuan Wang, Xinhu Zheng

发表机构 * Intelligent Transportation Thrust, Systems Hub, and Center of Seamless Connectivity & Connected Intelligence, The Hong Kong University of Science and Technology (Guangzhou)(智能交通 thrust、系统中心及无缝连接与智能连接研究院,香港科学与技术大学(广州))

AI总结 提出PLAN-S框架,通过从潜在表示解码风格条件语义成本图,解决自动驾驶中潜在世界模型规划的可控性问题,在nuScenes和NAVSIM上降低了碰撞率并提升了驾驶性能。

详情
AI中文摘要

潜在世界模型通过预测紧凑的场景动态来增强端到端自动驾驶,用于下游规划。然而,现有的基于潜在世界模型的规划器通常直接从纠缠的潜在表示生成轨迹。这种紧凑的潜在到规划器路径缺乏对风险、可驾驶性和多样风格偏好的显式建模,使得驾驶风格动态在最终轨迹选择之前难以监督、检查或调制。我们提出PLAN-S(具有潜在风格动态的规划),一个面向规划器的桥接方法,通过从潜在表示解码风格条件的四通道语义成本图来解决这种紧凑-可控性困境。成本图以自我状态和驾驶风格为条件,并通过两个宿主侧接口在规划决策上游被消费:用于回归规划器的注意力级融合和用于锚点得分规划器的奖励级融合。我们在两个架构不同的宿主上验证PLAN-S:nuScenes上的ResWorld和NAVSIM上的WoTE,同时冻结宿主骨干以隔离所提出的桥接的贡献。在nuScenes上,PLAN-S在每个时间范围上降低了基线L2,平均L2为0.55米,3秒碰撞率相对降低42%。在NAVSIM上,规则成本变体达到89.4的预测驾驶模型分数,而学习成本变体在基线挑战场景中提供了互补增益。消融实验表明,成本路径对更安全的轨迹选择贡献最直接。定性结果进一步显示,PLAN-S可以产生多样化的成本图,其空间一致的变化与不同的驾驶风格对齐。

英文摘要

Latent world models (LWMs) have strengthened end-to-end autonomous driving by forecasting compact scene dynamics for downstream planning. However, existing LWM-based planners usually generate trajectories directly from entangled latent representations. This compact latent-to-planner pathway lacks explicit modeling of risk, drivability, and diverse style preferences, making driving-style dynamics difficult to supervise, inspect, or modulate before a final trajectory is selected. We propose PLAN-S (PLANning with latent Style dynamics), a planner-facing bridge that addresses this compactness-controllability dilemma by decoding a style-conditioned, four-channel semantic cost map from the latent representation. The cost map is conditioned on ego state and driving style and is consumed up-stream of the planning decision through two host-side interfaces: attention-level fusion for regression planners and reward-level fusion for anchor-score planners. We validate PLAN-S on two architecturally distinct hosts, ResWorld on nuScenes and WoTE on NAVSIM, while keeping the host backbones frozen to isolate the contribution of the proposed bridge. On nuScenes, PLAN-S reduces L2 at every horizon over the baseline, with 0.55 m average L2 and a 42% relative reduction in the 3 s collision rate. On NAVSIM, the rule-cost variant reaches 89.4 Predictive Driver Model Score (PDMS), while the learned cost variant provides complementary gains on baseline-challenging scenes. Ablations show that the cost pathway contributes most directly to safer trajectory selection. Qualitative results further show that PLAN-S can produce diverse cost maps, with spatially consistent variations aligned to different driving styles.

2606.06003 2026-06-05 cs.AI 版本更新

Beyond Vector Similarity: A Structural Analysis of Graph-Augmented Retrieval for Industrial Knowledge Graphs

超越向量相似性:面向工业知识图谱的图增强检索结构分析

Grama Chethan

发表机构 * Grama Chethan

AI总结 本文通过对比八种检索架构,提出操作符词汇表论点,证明基于LLM的图推理瓶颈在于计算操作符而非模型智能,并引入LLM查询规划器,在工业知识图谱上实现优于定制处理器的性能。

Comments 11 pages

详情
AI中文摘要

检索增强生成(RAG)在需要对互连实体进行结构推理的查询上系统性失败。我们比较了八种用于航空航天供应链情报的检索架构,从文本检索逐步过渡到图遍历和图计算。使用一个包含46个节点和64条类型边的知识图谱,我们评估了10个意图类别下的23个查询,并证明向量检索在结构上无法覆盖五类查询。我们的核心发现是操作符词汇表论点:基于LLM的图推理的障碍不是模型智能,而是作为工具可用的计算操作符。一个配备9种类型遍历原语的LLM查询规划器在性能上优于定制处理器(F1=0.632 vs 0.472),同时能泛化到未见查询。添加6种图计算工具后,LLM仅在遍历失败的查询类别上选择性采用它们。我们还发现一个测量差距:实体级F1系统性低估了正确答案为完整集合的结构查询。

英文摘要

Retrieval-Augmented Generation (RAG) fails systematically on queries requiring structural reasoning over interconnected entities. We compare eight retrieval architectures for aerospace supply chain intelligence, progressing from text retrieval through graph traversal to graph computation. Using a 46-node knowledge graph with 64 typed edges, we evaluate 23 queries across 10 intent categories and demonstrate that five query classes are structurally unreachable for vector retrieval. Our central finding is the operator vocabulary thesis: the barrier to LLM-based graph reasoning is not model intelligence but the computational operators available as tools. An LLM Query Planner with 9 typed traversal primitives outperforms bespoke handlers (F1 = 0.632 vs. 0.472) while generalizing to unseen queries. Adding 6 graph computation tools, the LLM selectively adopts them for exactly the query categories where traversal fails. We also identify a measurement gap: entity-level F1 systematically underscores structural queries where comprehensive answers are correct.

2606.05999 2026-06-05 cs.CV cs.AI 版本更新

ATT-CR: Adaptive Triangular Transformer for Cloud Removal

ATT-CR: 自适应三角变换器用于云去除

Yang Wu, Ye Deng, Pengna Li, Wenli Huang, Kangyi Wu, Xiaomeng Xin, Jinjun Wang

发表机构 * Xi’an Jiaotong University(西安交通大学) School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics(计算机与人工智能学院,西南财经大学) Ningbo University of Technology(宁波工程学院)

AI总结 提出自适应三角变换器(ATT-CR),通过三角注意力和特征选择门控模块降低计算复杂度并减少云像素干扰,实现高效云去除。

详情
AI中文摘要

云去除旨在准确重建遥感图像中被云遮挡的地面物体。现有的基于Transformer的方法利用自注意力有效建模云图像中的长距离依赖,取得了显著效果。然而,它们存在以下问题:1)自注意力的高计算复杂度限制了可扩展性;2)在注意力计算中将云像素和干净像素均视为有效,会在后续层中引入干扰,导致性能次优。为解决这些挑战,我们提出了自适应三角变换器用于云去除(ATT-CR),该模型有效降低了计算成本并减轻了云像素的干扰。具体而言,它包含两个核心组件:三角注意力(TAN)和特征选择门控模块(FSGM)。TAN使用下三角和上三角矩阵近似Softmax注意力,计算复杂度为O(N),显著降低了计算成本。而FSGM与TAN集成,自适应地区分云特征和干净特征,从而最小化无效信息引入后续层。在云去除基准上的大量实验表明,ATT-CR相比现有方法具有更优的性能。

英文摘要

Cloud removal aims to accurately reconstruct the ground objects obscured by clouds in remote sensing images. Existing Transformer-based methods utilizing self-attention have shown impressive results by effectively modeling long-range dependencies in cloudy images. However, they suffer from the following issues: 1) the high computational complexity of self-attention limits scalability; 2) treating both cloudy and clean pixels as valid within the attention computation brings disturbances in subsequent layers, leading to suboptimal performance. To address these challenges, we propose the Adaptive Triangular Transformer for Cloud Removal (ATT-CR), a model that effectively reduces computational costs and mitigates interference from cloudy pixels. Specifically, it consists of two core components: Triangular Attention (TAN) and Feature Selected Gating Module (FSGM). TAN employs lower and upper triangular matrices to approximate Softmax attention with O(N) computational complexity, significantly reducing the computational costs. The FSGM, on the other hand, integrates with TAN to adaptively distinguish between cloudy and clean features, which minimizes the introduction of invalid information into subsequent layers. Extensive experiments on cloud removal benchmarks demonstrate that ATT-CR delivers superior performance compared to existing methods.

2606.05998 2026-06-05 cs.CV cs.AI 版本更新

Deep Learning-based 3D Oral Cavity Reconstruction Using 2D Intraoral Images

基于深度学习的二维口内图像三维口腔重建

Jihun Cho, Soo-Yeon Jeong, Eun-Jeong Bae, Sun-Young Ihm

发表机构 * KAIST(韩国科学技术院)

AI总结 提出一种仅用十张二维口内图像进行三维口腔重建的软件方法,采用MobileNetV2与多头注意力机制,降低成本和不适,实现自动化重建。

Comments 4 pages, 5 figures. English version of a paper presented at the Korea Multimedia Society Conference, November 2025

详情
AI中文摘要

口腔三维建模是牙科中最关键的阶段之一,常用的方法如印模和口内扫描各有显著局限。印模法将藻酸盐或硅胶材料放入托盘并插入患者口腔形成阴模,存在患者不适、材料变形误差及存储运输困难等问题。口内扫描仪利用结构光或激光技术实时直接扫描口腔结构,效果先进但设备成本极高。为解决这些问题,本文提出一种基于软件的方法,仅使用从不同角度拍摄的十张二维口内图像重建三维口腔模型,无需专用硬件设备。该方法降低成本,消除物理扫描设备需求,减少患者不适,并实现自动化三维重建。模型在公开的Dental3DS数据集(包含950个上颌样本)上训练,采用MobileNetV2作为图像编码器,结合多头注意力进行多视图特征融合。所提模型在最近邻匹配(距离阈值0.035)下达到77.49%的准确率。然而,预测顶点倾向于集中在真实值的高密度区域,导致重建模型上的点分布不均匀。

英文摘要

Oral 3D modelling is one of the most essential stages in dentistry, and many different approaches, such as impression taking and intraoral scanning, are commonly used for this phase, each with notable limitations. Impression taking, which involves placing alginate or silicone material in a tray and inserting it into the patient's oral cavity to form a negative mold, suffers from significant patient discomfort, material deformation errors, and difficulties in storage and transportation. Intraoral scanners, which directly scan oral structures in real time using structured light or laser technology, produce state-of-the-art results but are associated with substantially high equipment costs. To address these limitations, this paper proposes a software-based approach that reconstructs a 3D oral model using only ten 2D intraoral images captured from different angles, requiring no dedicated hardware devices. The proposed method reduces cost, eliminates the need for physical scanning equipment, minimises patient discomfort, and enables automated 3D reconstruction. The model is trained on the publicly available Dental3DS dataset, comprising 950 upper jaw samples, and employs MobileNetV2 as the image encoder combined with Multi-head Attention for multi-view feature fusion. The proposed model achieves an accuracy of 77.49%, measured by nearest-neighbor matching with a distance threshold of 0.035. However, predicted vertices tend to concentrate in high-density regions of the ground truth, resulting in uneven point distribution across the reconstructed model.

2606.05986 2026-06-05 cs.CR cs.AI 版本更新

AttackPathGNN: Cross-function vulnerability detection in smart contracts using state interference graphs and conjunction pooling

AttackPathGNN:使用状态干扰图和合取池化的智能合约跨函数漏洞检测

Gabriela Dobrita, Simona-Vasilica Oprea, Adela Bara

发表机构 * Bucharest University of Economic Studies(布加勒斯特经济学院)

AI总结 提出AttackPathGNN,一种图神经网络,通过构建状态干扰图和合取池化机制,将漏洞检测转化为对显式攻击路径的推理,实现跨函数漏洞的高精度检测。

详情
AI中文摘要

现有的基于学习的Solidity智能合约检测器将漏洞检测简化为单个函数内的语法模式匹配,然而许多最重大的利用(The DAO、Cream Finance)并不存在于任何单个函数中,而是存在于函数之间的关系以及使攻击可行的条件组合中。因此,我们提出了AttackPathGNN,一种图神经网络(GNN),它将检测重新定义为对显式攻击路径的推理。两个架构选择使其区别于先前的基于GNN的检测器:(1)状态干扰图,它通过类型化、加权边以及由显式五条件谓词定义的有向重入路径边,连接每一对共享可变存储的函数;(2)合取池化,一种对八个命名利用前提条件进行可微AND聚合的机制,其log-sigmoid形式使得当任何单一缓解措施(重入守卫、访问控制修饰符或SafeMath)到位时,每个函数的利用分数会骤降。在五次独立训练运行中,AttackPathGNN在SmartBugs Wild保留测试分区上达到92.3±0.2%的F1分数(假阴性率4.3±0.3%,在独立人工标注的SmartBugs Curated基准上检测率为90.8±2.5%),在每次随机种子下以100%的召回率恢复了6/10的DASP10类别,重入检测召回率为98.7±1.8%。每个预测都附带结构化的修复报告,将每个判定转化为可操作的、函数级的审计发现。

英文摘要

Existing learning-based detectors for Solidity smart-contracts reduce vulnerability detection to syntactic pattern matching within single functions, yet many of the most consequential exploits (The DAO, Cream Finance) exist not in any individual function but in the relationship between functions and in the combination of conditions that made the attack feasible. Thus, we propose AttackPathGNN, a graph neural network (GNN) that reframes detection as reasoning over explicit attack paths. Two architectural choices distinguish it from prior GNN-based detectors: (1)a State Interference Graph that links every pair of functions sharing mutable storage through typed, weighted edges and through directed reentrancy-path edges defined by an explicit five-condition predicate; (2)conjunction pooling, a differentiable AND-aggregator over eight named exploit preconditions whose log-sigmoid form causes the per-function exploit score to collapse whenever any single mitigation (a reentrancy guard, an access-control modifier or SafeMath) is in place. Across five independent training runs, AttackPathGNN attains 92.3+/-0.2% F1 on the SmartBugs Wild held-out test partition (4.3+/-0.3% false-negative rate, 90.8+/-2.5% detection rate on the independently human-labelled SmartBugs Curated benchmark), recovering 6/10 DASP10 categories at 100% on every seed and Reentrancy at 98.7+/-1.8%. Each prediction is emitted with a structured remediation report, turning each verdict into an actionable, function-level audit finding.

2606.05983 2026-06-05 cs.AI cs.CL 版本更新

Framing, Judging, Steering: An Assessable Competency Model for Teach-ing Students to Reason With Generative AI

框架构建、判断、引导:一种可评估的能力模型,用于教授学生与生成式AI进行推理

Alexander Apartsin, Yehudit Aperstein

发表机构 * Holon Institute of Technology(霍洛恩技术学院) Afeka College of Engineering(阿菲卡工程学院)

AI总结 提出CoRe-3能力模型,将有效使用AI分解为框架构建、判断和引导三种可评估技能,并通过模拟实验验证其区分效度。

Comments 18 pages, 4 pages

详情
AI中文摘要

生成式AI使答案变得容易而理解变得困难,不加批判的使用会导致认知卸载。学校仍然衡量无辅助的表现,但真正的任务是用AI产生好的工作:构建一个定义不明确的任务,判断输出,并引导模型获得更好的结果。这种能力很少被单独评估;即使被衡量,它也坍缩为一个单一的“提示”分数,无法诊断AI使用成功或失败的原因。我们提出CoRe-3(协同推理),一个能力模型,将生产性AI使用分解为三种可评估的技能,我们缩写为FJS:框架构建(在调用AI之前指定一个定义不明确的任务)、判断(评估输出中的错误和未声明的假设)和引导(迭代地重新引导模型)。其显著主张是将生成前的框架构建与生成后的引导分开,判断作为两者之间的门控。我们将这些技能建立在理论基础上,提出五个可检验的命题,并在CoReasoningLab中实例化它们,这是一个开放平台,呈现有缺陷的AI输出并独立评分。在模拟学习者(由不同模型生成和评分)上,这些技能是分离的:每个技能跟踪其自身的操纵能力,而其他技能保持不变,并且当一个能力在所有三个技能中共享时(收敛和区分效度),分数变得相关,评分后端来自两个提供商。接下来是人类评分者一致性和结果;我们发布工具、数据和协议。

英文摘要

Generative AI makes answers easy and understanding hard, and uncritical use invites cognitive offloading. Schools still measure unaided performance, yet the real task is to produce good work with AI: framing an ill-defined task, judging the output, and steering the model toward a better result. This ability is rarely assessed in its own right; where measured, it collapses into one "prompting" score that cannot diagnose why AI use succeeds or fails. We propose CoRe-3 (Co-Reasoning), a competency model factoring productive AI use into three assessable skills we abbreviate FJS: Framing (specifying an ill-defined task before invoking AI), Judging (evaluating output for errors and unstated assumptions), and Steering (iteratively redirecting the model). Its distinguishing claim is the separation of pre-generation Framing from post-generation Steering, with Judging as the gate between. We ground the skills in theory, state five testable propositions, and instantiate them in CoReasoningLab, an open platform that presents flawed AI output and scores them independently. Over simulated learners (generated and graded by different models), the skills dissociate: each tracks its own manipulated competence while staying flat in the others, and grades become correlated when one competence is shared across all three (convergent and discriminant validity), across grader backends from two providers. Human-rater agreement and outcomes are next; we release the instrument, data, and protocol.

2606.05979 2026-06-05 cs.RO cs.AI 版本更新

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

世界-语言-动作模型:统一世界建模、语言推理与动作合成

Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng

发表机构 * SJTU(上海交通大学) SII(上海研究院) HUST(华中科技大学) SCUT(华南理工大学) ECUST(东华大学) SHU(上海大学) NJUPT(南京工业大学)

AI总结 提出世界-语言-动作(WLA)模型,通过自回归Transformer联合预测文本子任务、子目标图像和机器人动作,融合世界建模与语言推理能力,实现多任务和长时域任务的最优性能。

Comments 19 pages, 10 figures

详情
AI中文摘要

我们提出世界-语言-动作(WLA)模型作为一类新的具身基础模型。WLA以文本指令、图像和机器人状态为输入,联合预测文本子任务、子目标图像和机器人动作,结合了世界-动作模型(WAM)中从大量自我中心视频学习的世界建模接口,以及视觉-语言-动作(VLA)模型中解决复杂长时域任务的语言推理能力。WLA的核心是一个自回归(AR)Transformer主干,而非WAM中的双向扩散Transformer,用于预测下一状态,包括语义级别的文本意图和互补的细粒度物理动态。物理动态由基于专用世界专家的世界建模目标监督,并用于简化动作专家的状态-动作相关性表征。WLA利用元查询使世界预测隐式影响动作生成,从而在推理时可禁用世界预测。世界预测也可被激活以实现测试时缩放,从而改进机器人控制。我们的WLA-0原型具有2B活跃参数,在NVIDIA RTX 5090上每次推理耗时40毫秒。在模拟和真实环境中的评估表明,WLA-0实现了最先进的多任务和长时域学习能力,例如在RoboTwin2.0 Clean上成功率为92.94%,在RMBench上成功率为56.5%。WLA-0还有望直接从跨具身机器人视频中学习新任务,无需动作标注。

英文摘要

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emph{next state}, comprising the \emph{semantic-level} textual intention and complementary \emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emph{cross-embodiment robot videos} without action annotations.

2606.05976 2026-06-05 cs.AI cs.CL 版本更新

The Self-Correction Illusion: LLMs Correct Others but Not Themselves

自我修正错觉:LLM 纠正他人但不纠正自己

Kuan-Yen Chen, Fang-Yi Su, Jung-Hsien Chiang

发表机构 * National Taiwan University(国立台湾大学)

AI总结 本文通过保持错误声明字节一致仅改变角色标签,发现 LLM 无法自我修正并非能力缺陷,而是聊天模板角色标签的人为产物,并提出无需训练或模型修改的提示结构干预方法。

详情
AI中文摘要

近期研究表明,LLM 智能体难以纠正自身推理轨迹中的错误,但当相同声明出现在外部来源时,其修正率显著更高。我们探究这种不对称性反映的是能力缺陷还是角色标签的人为产物:智能体纠正错误声明的意愿是否因果地依赖于承载该声明的聊天模板角色,而非声明内容本身?我们的实验设置在所有条件下保持错误声明的字节完全一致(SHA-256 验证),仅改变其包装角色:智能体自身的 \role{<thought>}、\role{user} 消息、\role{tool} 响应或 \role{system <memory>} 块。在覆盖七个模型家族和三个领域的 13 个模型-领域单元(每个单元 n=30 对任务)中,将声明从 \role{<thought>} 重新标记为外部角色后,显式修正率提升了 23 到 93 个百分点,其中 13 个单元中有 10 个达到 p<0.001。进一步实验证实该效应是不对称的、机制上可分解的,并且跨领域稳健。自我修正失败并非认知缺陷,而是聊天模板的人为产物。我们利用这一人为产物设计了一种仅涉及提示结构、无需训练和模型修改的干预方法,其最强角色标签依赖于领域:在数学上 \role{<memory>} 占主导,而在逻辑推理上普通 \role{user} 消息占主导。

英文摘要

Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role-label artifact: does an agent's willingness to correct a wrong claim depend causally on the chat-template role that carries it, rather than on the claim's content? Our setup keeps the erroneous claim byte-identical across all conditions (SHA-256 verified) and varies only its wrapping role: the agent's own \role{<thought>}, a \role{user} message, a \role{tool} response, or a \role{system <memory>} block. Across 13 model-domain cells covering seven model families and three domains ($n{=}30$ paired tasks per cell), relabeling the claim from \role{<thought>} to an external role lifts the explicit-correction rate by 23 to 93 percentage points, with 10 of 13 cells reaching $p{<}0.001$. Further experiments confirm that the effect is asymmetric, mechanistically decomposable, and robust across domains. The failure to self-correct is not a cognitive deficit; it is a chat-template artifact. We exploit this artifact by designing a prompt-structure-only intervention that requires no training and no model modification, with its strongest role label being domain-dependent: \role{<memory>} dominates on math, while a plain \role{user} message dominates on logical deduction.

2606.05970 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries

测量基于LLM的结构化提取对临床出院小结中提示、模型和模式选择的敏感性

Martin Murin

发表机构 * DryLabz GmbH(DryLabz公司)

AI总结 本研究通过固定提取任务并逐一改变提示、模型和模式选择,测量了大型语言模型在临床文本结构化提取中输出对上游配置的敏感性,发现模式选择导致的差异集中在缺失与沉默的区分上,而模型选择在多类分类中主导提示措辞。

Comments 69 pages, 5 main figures, supplementary material included

详情
AI中文摘要

大型语言模型越来越多地用于从临床自由文本笔记中进行结构化提取,但其输出对上游配置选择的敏感性比在固定基准上的准确性更少被理解。本文通过固定提取任务并逐一改变一个选择,在没有人工标注真实值的情况下测量了这种敏感性。固定模式包括17个临床文档标志(三值:是/否/未记录)和47个标签词汇(用于主要入院原因)。表达该模式的三种提示变体分别在两个模型大小上对MIMIC-IV v3.1出院小结运行。跨提示一致性通过Cohen's kappa在ICD分层子集上测量。配对相同笔记比较隔离了模型选择的影响,事后将三值标志折叠为二值测试了模式对不一致的贡献。在三值标志上,两个模型达到相同的合并跨提示一致性(中位数kappa 0.69和0.68);较大的模型提高了某些字段的一致性并降低了其他字段的一致性,这是一种重新分布而非无效果。将模式折叠为二值消除了大部分跨提示不一致,将其定位在缺失与沉默的区分上,而非发现是否存在。在多类入院分类上,改变模型会重新分配近一半笔记的主导标签,而改变提示措辞则重新分配约八分之一的笔记,并且较大的模型在残余的通用类别上分配的权重少得多(44%到26%)。这些模式表明,模式施加的不一致集中在缺失与沉默轴上,而模型在多类分类上主导提示措辞,这是通过一种可重复的方法在人群规模部署中审计提取可重复性而识别的。

英文摘要

Large language models are increasingly used for structured extraction from clinical free-text notes, but the sensitivity of their output to upstream configuration choices is less understood than their accuracy on fixed benchmarks. This work measures that sensitivity without human-annotated ground truth, by holding the extraction task fixed and varying one choice at a time. The fixed schema comprises 17 clinical documentation flags on a three-way yes/no/not_documented value set and a 47-tag vocabulary for the primary admission reason. Three prompt variants expressing this schema were each run at two model sizes on MIMIC-IV v3.1 discharge summaries. Cross-prompt agreement was measured by Cohen's kappa on ICD-stratified subsets. A paired same-note comparison isolated the effect of model choice, and a post-hoc collapse of the three-way flags to binary tested the schema's contribution to disagreement. On the three-way flags, the two models reach the same pooled cross-prompt agreement (median kappa 0.69 and 0.68); the larger model raises agreement on some fields and lowers it on others, a redistribution rather than the absence of an effect. Collapsing the schema to binary dissolves most of the cross-prompt disagreement, locating it on the absence-versus-silence distinction rather than on whether the finding is present. On the multi-class admission categorization, changing the model reassigns the dominant tag on close to half of all notes while changing the prompt phrasing reassigns it on roughly one in eight, and the larger model places far less mass on residual catch-all categories (44% to 26%). These patterns indicate a schema-imposed source of disagreement concentrated on the absence-versus-silence axis and a dominance of model over prompt phrasing on multi-class categorization, identified by a reusable methodology for auditing extraction reproducibility on a population-scale deployment.

2606.05966 2026-06-05 cs.DB cs.AI 版本更新

Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs

物理推理的因果支架:面向视觉语言模型中因果启发的物理世界理解基准

Tianyi Tang, Zhuoyi Lin, Zeyu Feng, Tianyi Ma, Yew-Soon Ong, Ivor Tsang, Haiyan Yin

发表机构 * CFAR(因果推理研究所) IHPC(信息技术研究所) Agency for Science, Technology and Research (A*STAR)(科技研究局) Nanyang Technological University(南洋理工大学)

AI总结 提出CausalPhys基准(含3000+视频/图像问题及因果图),并设计因果图度量评估VLM推理,进一步提出因果理性微调(CRFT)提升推理准确性与可解释性。

Comments Accepted by KDD 2026 Dataset and Benchmark Track

详情
AI中文摘要

理解和推理物理世界是智能行为的基础,但最先进的视觉语言模型(VLM)在因果物理推理中仍会失败,常常产生看似合理但错误的答案。为解决这一问题,我们引入了CausalPhys,一个包含超过3000个精心策划的视频和图像问题的基准,涵盖四个领域:感知、预期、干预和目标导向。每个问题都配有一个专家注释的因果图,捕捉对象-属性-事件依赖关系,从而实现可解释且细粒度的因果理解评估。在此基础上,我们制定了一个因果图接地度量,定量衡量模型的思维链推理与正确因果关系的对齐程度,超越了仅基于答案的准确性,并能够系统诊断VLM的因果推理失败。使用该度量,我们对领先的VLM进行了全面分析,揭示了在捕捉因果依赖关系方面的系统性差距,并强调了因果感知学习的必要性。为解决这些局限性,我们进一步提出了因果理性微调(CRFT),明确将VLM推理与因果结构对齐。大量实验表明,CRFT在多个模型骨干上显著提升了推理准确性和可解释性。通过统一数据集整理、因果评估和因果感知学习,CausalPhys为推进现代VLM实现因果接地物理推理奠定了坚实基础。

英文摘要

Understanding and reasoning about the physical world is the foundation of intelligent behavior, yet state-of-the-art vision-language models (VLMs) still fail at causal physical reasoning, often producing plausible but incorrect answers. To address this gap, we introduce CausalPhys, a benchmark of over 3,000 carefully curated video- and image-based questions spanning four domains: Perception, Anticipation, Intervention, and Goal Orientation. Each question is paired with an expert-annotated causal graph capturing object-attribute-event dependencies, enabling interpretable and fine-grained evaluation of causal understanding. Building on this, we formulate a causal-graph-grounded metric that quantitatively measures how well a model's chain-of-thought reasoning aligns with the correct causal relations, moving beyond answer-only accuracy and enabling systematic diagnosis of VLMs' causal reasoning failures. Using this metric, we conduct a comprehensive analysis of leading VLMs, revealing systematic gaps in capturing causal dependencies and underscoring the need for causality-aware learning. To address these limitations, we further propose Causal Rationale-informed Fine-Tuning (CRFT), which explicitly aligns VLM reasoning with causal structures. Extensive experiments demonstrate that CRFT substantially enhances both reasoning accuracy and interpretability across multiple model backbones. By unifying dataset curation, causal evaluation, and causality-informed learning, CausalPhys establishes a strong foundation for advancing modern VLMs toward causally grounded physical reasoning.

2606.05956 2026-06-05 cs.AI 版本更新

Bidirectional Search for Longest Paths: Case for Front-to-Front Heuristics

最长路径的双向搜索:前向-前向启发式的情况

Tzur Shubi, Ariel Felner, Solomon Eyal Shimony, Shahaf S. Shperberg

发表机构 * Technion - Israel Institute of Technology(技术学院 - 以色列理工学院)

AI总结 提出BiXDFBnB算法,将单前沿双向搜索框架适配到广义最长简单路径问题,利用前向-前向启发式减少节点扩展,并在某些情况下提升运行时间。

详情
AI中文摘要

双向启发式搜索可以潜在地减少适用于后向搜索的问题的搜索工作量。众所周知,前向-前向启发式可以减少节点扩展的数量,但其开销如此之高,以至于总体运行时间几乎总是增加。我们提出了BiXDFBnB,一种双向深度优先分支定界算法,它将单前沿双向搜索(SFBDS)框架——最初为最短路径(MIN)问题开发——适配到广义最长简单路径(GLSP)设置。由于SFBDS本质上在配对状态上操作,前向-前向(F2F)启发式评估自然出现,并避免了通常与双向前沿管理相关的开销。我们展示了这种适配可以成功应用于最大化(MAX)问题,同时有效处理重叠约束。BiXDFBnB应用于几种类型的最长路径问题:最长简单路径(LSP)、Snakes和Coil-in-the-Box(CIB)。经验评估表明,新算法经常减少节点扩展的数量,并且在某些情况下也改善了总体运行时间。

英文摘要

Bidirectional heuristic search can potentially reduce search effort for problems amenable to backward search. Therein, it is well-known that front-to-front heuristics can reduce the number of node expansions, but their overhead is so high that overall runtime almost always increases. We propose BiXDFBnB, a bidirectional depth-first branch-and-bound algorithm that adapts the Single-Frontier Bidirectional Search (SFBDS) framework - originally developed for shortest-path (MIN) problems - to the Generalized Longest Simple Path (GLSP) setting. Because SFBDS inherently operates on paired states, front-to-front (F2F) heuristic evaluation arises naturally and avoids the overhead typically associated with bidirectional frontier management. We show that this adaptation can be successfully applied to maximization (MAX) problems while efficiently handling overlapping constraints. BiXDFBnB is applied to several types of longest-path problems: Longest Simple Path (LSP), Snakes, and Coil-in-the-Box (CIB). Empirical evaluation shows that the new algorithm frequently reduces the number of node expansions and, in some cases, also improves overall runtime.

2606.05952 2026-06-05 cs.RO cs.AI 版本更新

Learning of Robot Safety Policies via Adversarial Synthetic Scenarios

通过对抗性合成场景学习机器人安全策略

Nikolai Dorofeev, Alexey Odinokov, Rostislav Yavorskiy

发表机构 * National Research Institute of Automation and Applied Mathematics(国家自动化与应用数学研究所)

AI总结 提出一个基于对抗性游戏的框架,通过红蓝两队对抗生成危险场景并迭代优化安全策略,以高效发现高风险边缘案例。

详情
AI中文摘要

在这项工作中,我们提出了一种基于代理的博弈框架,通过合成场景进行危险告知的机器人安全策略学习。我们将场景生成建模为两个代理之间的对抗游戏:红队通过构建危险情况探索潜在故障空间,蓝队则逐步完善安全策略以防止这些故障。这种迭代过程能够高效发现通过随机模拟或手动枚举难以捕获的高风险边缘案例。通过将经典风险建模与对抗性场景生成及现代学习范式相结合,这项工作为在复杂现实环境中运行的物理AI系统嵌入安全性提供了一条可扩展的路径。本文描述了正在进行的工作,贡献在于问题形式化和提出的解决方案架构。

英文摘要

In this work, we propose an agentic gamification framework for hazard-informed learning of robot safety policies through synthetic scenarios. We model scenario generation as an adversarial game between two agents: a Red Team that explores the space of potential failures by constructing hazardous situations, and a Blue Team that incrementally refines safety policies to prevent them. This iterative process enables efficient discovery of high-risk edge cases that are unlikely to be captured through random simulation or manual enumeration. By combining classical risk modeling with adversarial scenario generation and modern learning paradigms, this work provides a scalable pathway for embedding safety into Physical AI systems operating in complex real-world environments. The paper describes ongoing work. The contribution is a problem formulation and a proposed solution architecture.

2606.05950 2026-06-05 cs.AI 版本更新

Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

Edit-R2:面向多轮图像编辑的上下文感知强化学习

Yuxiao Ye, Haoran He, Fangyuan Kong, Xintao Wang, Pengfei Wan, Kun Gai, Ling Pan

发表机构 * Hong Kong University of Science and Technology(香港理工大学) Kuaishou Technology(快手科技)

AI总结 提出Edit-R2框架,通过重构会话意图和联合优化推理与生成的强化学习,解决多轮图像编辑中的长上下文稀释和状态污染问题,并在MICE-Bench基准上取得领先性能。

详情
AI中文摘要

基于扩散模型和统一多模态基础模型的文本引导图像编辑已取得快速进展。然而,现有方法大多局限于单轮设置,忽略了更现实的多轮上下文编辑场景,即用户通过一系列指令逐步细化图像。在此设置中,模型必须遵循每条新指令,同时保留累积的会话级约束,面临两种耦合的失败模式:长上下文稀释(稀疏文本约束难以从不断增长的图像-文本交错历史中恢复)和状态污染(早期编辑错误降低后续生成质量)。我们提出Edit-R2,一种用于统一多模态模型的新型强化学习后训练框架。Edit-R2重构操作会话意图,在每次编辑轮次前将分散的历史约束有效整合为显式推理轨迹。它进一步通过统一目标实现推理和生成的多轮强化学习,该目标联合优化离散文本空间中的意图重构生成和连续潜在空间中的流匹配图像生成,同时轨迹过滤机制抑制损坏的轨迹以在状态污染下稳定训练。为支持系统评估,我们引入MICE-Bench,一个大规模多轮上下文编辑基准,包含针对累积会话约束的指令遵循(IF)、内容一致性(CC)和全局感知(GA)的自动指标。实验表明,Edit-R2显著改进了多轮上下文编辑,并在与强基线的比较中取得了有竞争力的性能。

英文摘要

Text-guided image editing has advanced rapidly with diffusion models and unified multimodal foundation models. However, most existing methods remain confined to single-turn settings, overlooking the more realistic scenario of multi-turn in-context editing, where users iteratively refine an image through a sequence of instructions. In this setting, a model must follow each new instruction while preserving accumulated session-level constraints, challenged by two coupled failure modes: long-context dilution, where sparse textual constraints become difficult to recover from growing interleaved image-text histories, and state contamination, where earlier editing mistakes degrade subsequent generations. We introduce Edit-R2, a novel reinforcement learning post-training framework for unified multimodal models. Edit-R2 reconstructs the operative session intent, which effectively consolidates scattered historical constraints into an explicit reasoning trace before each editing turn. It further enables multi-turn RL over both reasoning and generation through a unified objective that jointly optimizes intent reconstruction generation in discrete text space and flow-matching image generation in continuous latent space, while a trajectory filtering mechanism suppresses corrupted rollouts to stabilize training under state contamination. To support systematic evaluation, we introduce MICE-Bench, a large-scale benchmark for multi-turn in-context editing with automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA) over accumulated session constraints. Experiments show that Edit-R2 substantially improves multi-turn in-context editing and achieves competitive performance compared against strong baselines.

2606.05931 2026-06-05 cs.CL cs.AI cs.CV cs.IR cs.LG cs.MM eess.AS 版本更新

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

多模态还是非多模态:通过主动模态检测的查询自适应音视频人物检索

Erfan Loweimi, Mengjie Qian, Kate Knill, Guanfeng Wu, Chi-Ho Chan, Abbas Haider, Muhammad Awan, Josef Kittler, Hui Wang, Mark Gales

发表机构 * University of Cambridge(剑桥大学) Queen's University Belfast(贝尔法斯特女王大学) University of Surrey(萨里大学) Cisco(思科) Southwest Jiaotong University(西南交通大学) Teesside University(泰赛德大学)

AI总结 提出一种查询自适应框架,通过跨模态分数一致性检测主动模态,在BBC Rewind语料库上达到94.2%的P@1,优于单模态和固定融合方法。

Comments INTERSPEECH 2026

详情
AI中文摘要

当通过语音和面部从视频档案中检索一个人时,系统应该是多模态的吗?在实际的广播档案中,与精心策划的基准不同,目标可能只被听到但未被看到、只被看到但未被听到,或者两者兼有。融合来自缺失模态的分数会引入噪声,使精度低于最佳单模态系统。我们提出了一种查询自适应框架,通过跨模态分数一致性检测主动模态:当两种模态都活跃时,由一种模态检索的文件在另一种模态上也得分高;当一种模态缺失时,这种一致性被破坏。由这些跨模态特征驱动的分类器实现了89%的检测准确率。在BBC Rewind语料库(包含超过12,000个广播视频)上,自适应系统达到了94.2%的P@1,优于仅语音(82.9%)、仅面部(93.4%)和固定融合(90.0%),恢复了与具有真实模态标签的Oracle(96.6%)之间差距的64%。

英文摘要

When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

2606.05925 2026-06-05 cs.AI 版本更新

Towards World Models in Biomedical Research

迈向生物医学研究的世界模型

Guangyu Wang, Jingkun Yue, Siqi Zhang, Yu Liu, Xiaoyu Wang, Mingyuan Meng, Changwei Ji, Zongbo Han, Yulin Wang, Yang Yue, Frank Fu, Ting Chen, Song Wu, Ziwei Liu, Jiangning Song, Ming Li, Gao Huang, Xiaohong Liu, Athanasios Vasilakos, Xingcai Zhang, Ping Zhang, Yong Li

发表机构 * State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China(网络与交换技术国家重点实验室,北京邮电大学,北京,中国) Department of Engineering Science, University of Oxford, Oxford, United Kingdom(英国牛津大学工程科学系,牛津,英国) Institute of Medical Artificial Intelligence, South China Hospital, Medical School, Shenzhen University, Shenzhen, Guangdong, China(医学人工智能研究所,南方医院,医学学院,深圳大学,深圳,广东,中国) Zhongguancun Academy & Zhongguancun Institute of Artificial Intelligence, Beijing, China(中关村学院及中关村人工智能研究院,北京,中国) Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, 100084, Beijing, China(北京信息科学与技术国家研究中心(BNRist),清华大学,100084,北京,中国) Department of Chemical and Nano Engineering, University of California, San Diego, La Jolla, CA, USA(美国加州大学圣地亚哥分校化学与纳米工程系,La Jolla,CA,美国) Nanyang Technological University, Singapore(新加坡南洋理工大学) Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, Australia(莫纳什大学生物医学发现研究所和生物化学与分子生物学系,墨尔本,维多利亚,澳大利亚) David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada(加拿大滑铁卢大学戴维·R·切里顿计算机科学学校,滑铁卢,安大略,加拿大) Department of ICT and Center for AI Research, University of Agder (UiA), Jon Lilletuns vei 9, Grimstad, Norway(挪威阿格德大学(UiA)信息与通信技术系及人工智能研究中心,Jon Lilletuns vei 9,Grimstad,挪威) Department of Electronic Engineering, Tsinghua University, Beijing, China(清华大学电子工程系,北京,中国)

AI总结 提出生物医学世界模型作为AI驱动发现的新范式,通过学习分子、细胞、组织和临床状态的潜在表征及干预条件动态,实现未来轨迹模拟,并探讨其在虚拟细胞、类器官、虚拟患者和手术模拟等应用中的潜力。

详情
AI中文摘要

生物医学的一个核心目标是理解、预测并最终控制生物系统对扰动、疾病进展和治疗干预的动态机制。尽管基础模型和大语言模型加速了生物医学数据解读,但当前大多数系统仍专注于静态模式识别,而非对生物未来的前瞻性模拟。在此,我们提出生物医学世界模型作为AI驱动发现的一种范式。这些模型学习分子、细胞、组织和临床状态的潜在表征,以及干预条件动态,使得在采取行动之前能够模拟未来轨迹。我们讨论了生物医学世界模型如何作为数据引擎、环境模拟器和科学规划基础,应用于虚拟细胞、类器官、虚拟患者和手术模拟等场景。我们概述了所需的数据基础设施、评估基准、安全约束和治理框架。生物医学世界模型可能为模拟引导、闭环且实验可操作的生物医学发现提供基础。

英文摘要

A central goal of biomedicine is to understand, predict and ultimately control the dynamic mechanisms by which biological systems respond to perturbations, disease progression and therapeutic intervention. Although foundation models and large language models have accelerated biomedical data interpretation, most current systems remain focused on static pattern recognition rather than prospective simulation of biological futures. Here we propose biomedical world models as a paradigm for AI-driven discovery. These models learn latent representations of molecular, cellular, tissue and clinical states, together with intervention-conditioned dynamics that allow future trajectories to be simulated before actions are taken. We discuss how biomedical world models could function as data engines, environment simulators and scientific planning substrates across applications including virtual cells, organoids, virtual patients and surgical simulation. We outline the data infrastructure, evaluation benchmarks, safety constraints and governance frameworks required. Biomedical world models may provide a foundation for simulation-guided, closed-loop and experimentally actionable biomedical discovery.

2606.05924 2026-06-05 cs.CL cs.AI 版本更新

Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach

更好的文学翻译:多维度数据生成与大语言模型训练方法

Zhihao Lin, Ziqi Zhu, Hao Huang, Guanghui Wang, Peiyang He

发表机构 * Amazon Web Services (AWS)(亚马逊网络服务(AWS)) Peking University(北京大学)

AI总结 提出多维度迭代优化框架,通过专门的大语言模型生成高质量翻译参考和偏好数据,结合监督微调和强化学习(GRPO)提升文学翻译质量,在MetaphorTrans英中文学翻译基准上达到与Claude Sonnet 4.5竞争的性能。

Comments Accepted by ACL 2026 Industry

详情
AI中文摘要

文学翻译因高质量标注数据的稀缺以及需要在表达流畅性与文学效果之间取得平衡而面临独特挑战。我们提出了一个多维度迭代优化框架,通过专门的大语言模型翻译器生成高质量的翻译参考和偏好数据,每个翻译器针对一个不同的质量维度。我们利用生成的数据进行监督微调和强化学习。实验表明,我们的生成参考在监督微调中比原始真实数据高出8.65个CEA100点。对于强化学习,我们发现直接偏好优化(DPO)在此设置下导致性能下降,而利用显式奖励模型进行组相对策略优化(GRPO)则额外提升了1.51个点。我们将此归因于两阶段训练的稳定性和GRPO的在线探索能力。我们的最终模型LitMT-8B和LitMT-14B在MetaphorTrans英中文学翻译基准上分别达到67.25和69.07个CEA100点,与Claude Sonnet 4.5的68.43点具有竞争力,并展现出对域外文学作品(如欧·亨利)的强泛化能力。

英文摘要

Literary translation poses unique challenges due to the scarcity of high-quality annotated data and the need to balance expression fluency with literary effect. We present a multi-aspect iterative refinement framework that generates high-quality translation references and preference data through specialized LLM translators, each targeting a distinct quality dimension. We leverage the generated data for supervised fine-tuning and reinforcement learning. Experiments show that our generated references outperform the original ground truth for SFT by 8.65 CEA100 points. For reinforcement learning, we find that DPO leads to performance degradation in this setting, while leveraging an explicit reward model for GRPO yields an additional 1.51 point improvement. We attribute this to the stability of two-stage training and GRPO's online exploration capability. Our resulting models, LitMT-8B and LitMT-14B, achieve 67.25 and 69.07 CEA100 respectively on the MetaphorTrans English-to-Chinese literary translation benchmark, competitive with Claude Sonnet 4.5 at 68.43, and demonstrate strong generalization to out-of-domain literary work (i.e., O. Henry).

2606.05901 2026-06-05 cs.CL cs.AI 版本更新

Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)

减少复杂问答中的幻觉:使用基于简单图的检索增强生成(长版)

Christopher J. Wedge, Joshua Stutter, Danny Dixon, Jacek Cała

发表机构 * National Innovation Centre for Data(数据创新研究中心)

AI总结 本研究提出一种轻量级图结构支持的检索增强生成系统,通过结合向量搜索和图查询工具,在复杂问答任务中将幻觉答案数量减半,并显著提升事实正确性的精确率和召回率。

详情
AI中文摘要

大型语言模型(LLMs)从根本上改变了自然语言处理的格局。尽管取得了这些进展,LLMs和基于LLM的系统仍然容易出现各种故障模式。检索增强生成(RAG)系统已成为一种常见的部署场景,旨在避免LLM“幻觉”信息的已知风险,并使模型能够对训练期间无法访问的专有信息进行推理和问答,而无需进行昂贵的模型微调。在这项工作中,我们探索了使用轻量级图结构(具有相对简单的图模式)通过专用工具集支持RAG子系统的想法。我们设计了一个基于英语维基百科文章精选子集的结构化数据集上的智能体系统,该系统配备了多种向量搜索和图查询工具,并评估了其在MoNaCo(一个具有挑战性的维基百科QA基准测试,涉及复杂查询回答任务)上的问题表现。我们的结果表明,引入基于图的工具可以显著提高事实正确性的精确率和召回率,将幻觉答案的数量减半,并在三个评估场景中实现了最高的细粒度真实性得分。所有这些都仅以适度的令牌使用增加为代价。

英文摘要

Large language models (LLMs) have fundamentally transformed the landscape of Natural Language Processing. Despite these advances, LLMs and LLM-based systems remain prone to a variety of failure modes. Retrieval-augmented generation (RAG) systems have emerged as a common deployment scenario seeking to both avoid the well known risk of the LLM "hallucinating" information, and to enable reasoning and question answering over proprietary information that the LLM did not have access to during training without resorting to expensive model fine-tuning. In this work, we explore the idea of using a lightweight graph structure with a relatively simple graph schema, to support the RAG subsystem via a dedicated toolset. We design an agentic system with a variety of vector search and graph query tools operating over a structured dataset based on a curated subset of English Wikipedia articles, and evaluate its performance on questions from MoNaCo, a challenging Wikipedia QA benchmark of complex query answering tasks. Our results show that the introduction of graph-based tools can significantly increase the precision and recall of factual correctness, can halve the number of hallucinated answers, and achieves the highest fine-grained truthfulness score among the three evaluated scenarios. All this with a modest increase in token usage.

2606.05890 2026-06-05 cs.CL cs.AI 版本更新

Staying with the Uncertainty: Uncertainty-Scaffolding Strategies for Artificial Moral Advisors in LLM-to-LLM Simulated Conversations

与不确定性共处:LLM对LLM模拟对话中人工道德顾问的不确定性支撑策略

Salvatore Greco, Hainiu Xu, Jacopo Domenicucci, Yulan He, Sylvie Delacroix

发表机构 * Centre for Data Futures, The Dickson Poon School of Law, King’s College London(数据未来中心、迪克森·普恩法学院、伦敦国王学院) Department of Informatics, King’s College London(信息学院、伦敦国王学院) LangAI, Center for Language AI Research, Tohoku University(LangAI、语言人工智能研究中心、东北大学) Neukom Institute for Computational Science, Dartmouth College(计算科学尼科姆研究所、达特茅斯学院)

AI总结 研究LLM作为人工道德顾问时,通过三种不确定性策略(视角倍增、张力保持、过程反思)与三种控制条件对比,在模拟对话中探讨如何帮助对话者“与不确定性共处”,发现不同策略在立场改变量上无差异但影响参与质量。

详情
AI中文摘要

LLM越来越多地被部署为各种背景下的人工道德顾问(AMA):它们应该展现什么样的对话模式?在本文中,我们研究AMA如何帮助其对话者“与不确定性共处”。我们提出了三种不确定性模式(视角倍增、张力保持、过程反思),并将它们与三种控制条件(基线、说服、谄媚)进行比较。用户代理LLM与遵循特定不确定性策略的AMA就伦理困境进行对话,并完成对话前和对话后的问卷调查。我们进一步考察了两种角色提示格式(陈述式和叙述式)的效果。我们发现:(1)没有一个单一模型作为模拟用户代理占主导地位,开放模型通过角色间分歧与人类模糊性对齐,而封闭模型通过角色内对冲对齐;(2)陈述式角色更好地捕捉初始立场多样性,而叙述式角色显示出更现实的信念修正;(3)所有六种AMA策略产生可区分的对话模式;(4)不确定性策略的不同不在于它们产生多少立场改变,而在于它们维持的参与质量。

英文摘要

LLMs are increasingly deployed as Artificial Moral Advisors (AMA) in a variety of contexts: what kind of conversational patterns should they display? In this paper, we study how AMA can help their interlocutors "stay with the uncertainty". We propose three modes of uncertainty (Perspective-Multiplying, Tension-Preserving, Process-Reflecting) and compare them against three control conditions (Baseline, Persuasive, Sycophantic). A user-agent LLM engages in a dialogue on an ethical dilemma with an AMA following a specific uncertainty strategy, and completes pre- and post-conversation questionnaires. We further examine the effect of two persona prompt formats (Declarative and Narrative). We found that (1) no single model dominates as a simulated user agent, with open models aligning with human ambiguity through between-persona divergence and closed models through within-persona hedging; (2) declarative personas better capture initial stance diversity while narrative personas show more realistic belief revision; (3) all six AMA strategies produce distinguishable conversational patterns; and (4) uncertainty strategies differ not in how much stance revision they produce, but in the quality of engagement they sustain.

2606.05888 2026-06-05 cs.AI 版本更新

Retry Policy Gradients in Continuous Action Spaces

连续动作空间中的重试策略梯度

Soichiro Nishimori, Paavo Parmas

发表机构 * The University of Tokyo, Japan(东京大学)

AI总结 本文提出重试目标(如pass@K和max@K)的路径导数估计器,将ReMax扩展到连续动作空间,通过重塑策略梯度景观促进随机探索,并引入ReMAC算法实现与SAC相当的性能。

详情
AI中文摘要

基于重试的目标(如pass@K和max@K)优化从多个采样轨迹中获得的最佳回报,最近的研究表明,它们可以在没有显式探索奖励的情况下促进探索。在离散动作空间中,ReMax被证明可以通过适应回报不确定性来实现这一点。在这项工作中,我们引入了重试目标的路径导数估计器,并用它们将ReMax扩展到连续动作空间。我们研究了由此产生的学习动态,并表明,即使使用确定性奖励,ReMax也可以通过重塑策略梯度景观来鼓励随机探索。特别地,它既改变了梯度的方向,使更新偏向于更高的策略熵,也改变了梯度的大小,抑制梯度并减缓收敛。我们进一步表明,Adam的自适应归一化可以缓解这种抑制,具体取决于其数值稳定化参数。在实验上,我们将该目标实例化为ReMax Actor-Critic(ReMAC),这是一种使用路径导数估计器优化ReMax目标的离策略actor-critic算法。我们的实验表明,ReMAC可以在没有熵正则化的情况下促进更高的策略熵,并实现与SAC相当的性能。

英文摘要

Retry-based objectives such as pass@K and max@K optimize the best return obtained from multiple sampled trajectories, and recent work has shown that they can promote exploration without explicit exploration bonuses. In discrete action spaces, ReMax was shown to do so by adapting to return uncertainty. In this work, we introduce pathwise derivative estimators for retry objectives and use them to extend ReMax to continuous action spaces. We study the resulting learning dynamics and show that, even with deterministic rewards, ReMax can encourage stochastic exploration by reshaping the policy-gradient landscape. In particular, it alters gradients both in direction, biasing updates toward higher policy entropy, and in magnitude, damping gradients and slowing convergence. We further show that Adam's adaptive normalization can mitigate this damping, depending on its numerical stabilization parameter. Empirically, we instantiate this objective as ReMax Actor-Critic (ReMAC), an off-policy actor--critic algorithm that optimizes the ReMax objective using a pathwise derivative estimator. Our experiments show that ReMAC can promote higher policy entropy without entropy regularization and achieves performance comparable to SAC.

2606.05875 2026-06-05 cs.AI cs.DB 版本更新

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

QCFuse: 通过压缩视图的查询感知缓存融合实现高效RAG服务

Jianxin Yan, Wangze Ni, Zhenxin Li, Jiabao Jin, Zhitao Shen, Haoyang Li, Jia Zhu, Peng Cheng, Xuemin Lin, Lei Chen, Kui Ren

发表机构 * Zhejiang University(浙江大学) East China Normal University(华东师范大学) Ant Group(蚂蚁集团) The Hong Kong Polytechnic University(香港理工大学) Zhejiang Normal University(浙江师范大学) Tongji University(同济大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 提出QCFuse,一种基于压缩视图的查询感知选择器,通过块锚查询探测和关键层分析实现高效RAG缓存融合,在保持全预填充质量的同时平均加速1.7倍。

详情
AI中文摘要

检索增强生成(RAG)通过将生成过程基于外部证据来提高大语言模型(LLM)的答案质量,但处理检索到的上下文使得预填充阶段成为主要的服务成本。RAG缓存融合通过重用检索块的预计算键值(KV)缓存,并选择性地在当前提示下重新计算令牌来降低这一成本。然而,现有的选择器在质量和效率之间面临两难:快速的查询无关或最终层查询到上下文选择器可能遗漏与请求相关的证据,而全视图查询感知选择器在重新计算之前需要广泛的上下文和层可见性,因此会阻塞逐层缓存融合流水线。我们提出QCFuse,一种用于RAG缓存融合的压缩视图查询感知选择器。QCFuse使用块锚查询探测将用户查询状态条件化到紧凑的每块锚点上,并通过关键层分析识别重新计算令牌而无需检查所有层。我们在SGLang中实现QCFuse,并在六个数据集上对四个开放权重LLM进行评估。QCFuse达到了全预填充级别的质量。在匹配质量下,QCFuse相比全预填充实现了平均1.7倍的预填充加速,相比最强的保质量基线ProphetKV实现了1.5倍加速。

英文摘要

Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.

2606.05873 2026-06-05 cs.RO cs.AI cs.CV cs.LG 版本更新

LadderMan: Learning Humanoid Perceptive Ladder Climbing

LadderMan: 学习人形机器人感知爬梯

Siheng Zhao, Yuanhang Zhang, Ziqi Lu, Pieter Abbeel, Rocky Duan, Koushil Sreenath, Yue Wang, C. Karen Liu, Guanya Shi

发表机构 * Amazon FAR(亚马逊FAR) USC(美国南加州大学) UC Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) CMU(卡内基梅隆大学)

AI总结 提出LadderMan系统,通过两阶段学习管道和视觉基础模型,使人形机器人能够鲁棒地攀爬多种梯子并在梯子上进行操控。

详情
AI中文摘要

人形机器人在以人为中心的环境中具有巨大潜力,但由于稀疏的立足点和手抓点、复杂的全身协调以及对感知和控制误差的敏感性,爬梯仍然是最具挑战性的任务之一。我们提出了 extbf{LadderMan},一个统一的系统,使人形机器人能够鲁棒地攀爬多种梯子并在这种受限条件下进行操控。我们的攀爬策略基于一个可扩展的两阶段学习管道,其中我们使用混合运动跟踪从单个参考运动学习多个攀爬专家,并通过混合模仿和强化学习将这些专家蒸馏成一个统一的基于深度视觉的运动攀爬策略。为了实现真实世界部署,我们利用视觉基础模型来弥合深度感知中的模拟到现实差距。基于学习到的攀爬策略,我们进一步使用双智能体公式训练一个独立的操控策略,允许通过遥操作在梯子上进行稳定操控。实验表明,LadderMan在多种几何形状的梯子上实现了鲁棒的攀爬,以零样本方式成功迁移到真实世界硬件,并在具有挑战性的梯子约束下支持各种操控任务。视频结果见https://ladderman-robot.github.io。

英文摘要

Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at https://ladderman-robot.github.io .

2606.05871 2026-06-05 cs.IT cs.AI math.IT stat.ME 版本更新

Compositional Boundaries for Density Fusion

密度融合的组合边界

Ratan Bahadur Thapa, Ali Darijani, Jürgen Beyerer, Steffen Staab

发表机构 * University of Stuttgart Department of Computer Science, Germany(斯图加特大学计算机科学系,德国) KIT Department of Computer Science, Germany(卡尔斯鲁厄理工学院计算机科学系,德国) Fraunhofer IOSB of Fraunhofer-Gesellschaft, Germany(弗劳恩霍夫研究所IOSB分部,德国) University of Southampton Department of Computer Science, United Kingdom(南安普顿大学计算机科学系,英国)

AI总结 研究分布式不确定性管理系统中加权概率密度的层次融合顺序不变性,证明在连续二元规则下,顺序不变的层次融合等价于归一化加权线性池化,并揭示了端点-候选f-散度平衡的局部几何障碍。

详情
AI中文摘要

分布式不确定性管理系统通常沿着由通信、隐私或调度约束选择的聚合树组合局部概率模型。最终密度应取决于加权源,而不是中间节点组合它们的特定顺序。我们将这一要求研究为加权概率密度的二元融合的代数组合性问题。核心问题是局部融合规则何时可以层次化执行同时保持顺序不变。我们为局部段值融合规则建立了一个组合边界。在具有加性输出权重和仅权重系数的连续二元规则类中,顺序不变的层次执行刻画了归一化加权线性池化;范数诱导的段平衡实现了相应的系数。平滑端点-候选$f$-散度平衡具有不同的局部几何:其二次展开引入了平方根有效权重,表明仅凭成对可解性不足以实现调度无关的融合。我们证明这一障碍是端点-候选二元平衡所特有的,而全局散度重心保留了加性权重的局部极限。最后,高斯混合展示了相同问题如何在有限模型类中出现:精确融合是组合的,而逐步压缩仅在未归一化分量测度的同余条件下才是组合的。这些结果区分了精确的调度无关融合与全局聚合目标及局部近似启发式。

英文摘要

Distributed uncertainty-management systems often combine local probabilistic models along aggregation trees chosen by communication, privacy, or scheduling constraints. The final density should depend on the weighted sources, not on the particular order in which intermediate nodes combine them. We study this requirement as an algebraic compositionality problem for binary fusion of weighted probability densities. The central question is when a local fusion rule can be executed hierarchically while remaining order-invariant. We establish a compositional boundary for local segment-valued fusion rules. Within the class of continuous binary rules with additive output weights and weight-only coefficients, order-invariant hierarchical execution characterizes normalized weighted linear pooling; norm-induced segment balancing realizes the corresponding coefficient. Smooth endpoint-to-candidate $f$-divergence balancing has a different local geometry: its quadratic expansion induces square-root effective weights, showing why pairwise solvability alone is insufficient for schedule-independent fusion. We show that this obstruction is local to endpoint-to-candidate binary balancing, whereas global divergence barycenters retain additive-weight local limits. Finally, Gaussian mixtures show how the same issue appears in finite model classes: exact fusion is compositional, whereas stepwise compression is compositional only under a congruence condition on unnormalized component measures. These results distinguish exact schedule-independent fusion from global aggregation objectives and local approximation heuristics.

2606.05863 2026-06-05 cs.LG cs.AI 版本更新

Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

通过深度线性网络理论与条件ReLU约简解读Grokking中的两个训练时钟

Hu Tan, Kuo Gai, Shihua Zhang

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China(数学科学国家重点实验室,数学与系统科学研究院,中国科学院,北京100190,中国) School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China(中国科学院大学数学科学学院,北京100049,中国) Shanghai Institute for Mathematics and Interdisciplinary Sciences (SIMIS), Shanghai, China(上海数学与交叉科学研究所(SIMIS),上海,中国) Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China(浙江省系统健康科学重点实验室,生命科学学院,杭州先进研究院,中国科学院大学,中国科学院,杭州310024,中国)

AI总结 本文通过分离分类损失的快速衰减与表示学习的缓慢简化,定义了“两个训练时钟”形式化Grokking现象,并利用深度线性网络理论和条件ReLU约简机制解释了这一两阶段过程。

详情
AI中文摘要

Grokking表明,拟合训练数据和学习简单底层规则可能发生在不同的时间尺度上。我们通过将分类损失的快速衰减与学习表示的较慢简化分离来形式化这一现象,并将由此产生的停止时间对称为两个训练时钟。对于深度线性网络,我们证明后边际间隙增长或一步尾部收缩条件在对数时间尺度上将交叉熵损失降低到ε水平。相反,当存在逐层权重衰减时,端到端映射上的诱导正则化可以表示为Schatten型惩罚;在尖锐的晚期Kurdyka-Lojasiewicz尾部下,这种结构能量在多项式时间尺度上闭合。因此,两个时钟将拟合与表示简化分开。然后我们解释相同机制如何在ReLU MLP中出现。在训练集上的激活模式保持固定的区域中,网络简化为活动坐标上的线性模型。在两层ReLU嵌入模型中,链式法则估计进一步表明,在受控的下游范数下,分类器头可以比嵌入块接收更大的有效梯度。这支持了一个两阶段机制:分类器先拟合,而表示随后继续简化。我们以模加法作为主要实验设置。深度线性理论提供了分析的核心严格基础。但ReLU结果被表述为条件约简,以解释经验行为,而不声称对非线性训练动态的全局证明。

英文摘要

Grokking suggests that fitting the training data and learning a simple underlying rule may occur on different time scales. We formalize this phenomenon by separating the fast decay of the classification loss from the slower simplification of the learned representation, and we call the resulting pair of stopping times two training clocks. For deep linear networks, we show that a post-margin gap-growth or one-step tail-contraction condition reduces the cross-entropy loss to level epsilon on a logarithmic time scale. In contrast, when layerwise weight decay is present, the induced regularization on the end-to-end map can be expressed as a Schatten-type penalty; under a sharp late-time Kurdyka-Lojasiewicz tail, this structural energy closes on a polynomial time scale. The two clocks, therefore, separate fitting from representation simplification. We then explain how the same mechanism can appear in ReLU MLPs. In regions where the activation patterns on the training set remain fixed, the network reduces to a linear model in the active coordinates. In a two-layer ReLU embedding model, chain-rule estimates further show that the classifier head can receive larger effective gradients than the embedding block under controlled downstream norms. This supports a two-stage mechanism in which the classifier fits first, while the representation continues to simplify later. We use modular addition as the main experimental setting. The deep linear theory provides the rigorous core of the analysis. But the ReLU results are formulated as conditional reductions that account for empirical behavior without claiming a global proof for nonlinear training dynamics.

2606.05855 2026-06-05 cs.HC cs.AI 版本更新

EEGDancer: Dynamic Emotion Latent Space Masked Modeling with Reinforcement Learning for EEG Continuous Emotion Prediction

EEGDancer:基于强化学习的动态情感潜空间掩码建模用于EEG连续情感预测

Zhihao Zhou, Weishan Ye, Li Zhang, Gan Huang, Zhen Liang

发表机构 * National University of Singapore(新加坡国立大学) Agency for Science, Technology and Research(科技研究局)

AI总结 提出EEGDancer框架,结合向量量化表示学习、掩码时间建模和强化学习轨迹优化,解决EEG连续情感预测中长时依赖和噪声问题。

Comments 51 pages, 9 figures, 13 tables

详情
AI中文摘要

连续脑电图(EEG)情感预测旨在从EEG信号中建模人类情感状态的时间演化。与传统的离散情感识别不同,连续预测需要捕捉长时依赖和连贯的情感动态。然而,现有方法主要依赖于逐点回归并直接对噪声高维EEG特征建模,限制了其刻画连续情感演化的能力。为应对这些挑战,我们提出EEGDancer,一个用于连续EEG情感预测的动态情感潜空间学习框架。该框架将向量量化表示学习、掩码时间建模和基于强化学习的轨迹优化整合到一个统一架构中。具体而言,设计了一个因果时空向量量化变分自编码器(VQ-VAE),用于学习结构化情感原型并从EEG信号构建离散-连续情感潜空间。基于学习到的潜表示,采用基于Transformer的掩码动态建模策略捕捉长时情感依赖和时间演化模式。此外,将连续情感预测建模为序列决策问题,并引入软演员-评论家(SAC)框架在序列级别优化情感预测轨迹,而非逐帧局部拟合。在SEED、SEED-IV和长期自然情感数据集上的大量实验表明,EEGDancer持续优于现有机器学习和深度学习方法。消融研究进一步验证了所提出的潜空间和基于强化学习的轨迹优化在建模连续EEG情感动态方面的有效性。

英文摘要

Continuous electroencephalography (EEG) emotion prediction aims to model the temporal evolution of human emotional states from EEG signals. Unlike conventional discrete emotion recognition, continuous prediction requires capturing long-range temporal dependencies and coherent emotional dynamics. However, existing methods mainly rely on point-wise regression and directly model noisy high-dimensional EEG features, limiting their ability to characterize continuous emotional evolution.To address these challenges, we propose EEGDancer, a dynamic emotional latent space learning framework for continuous EEG emotion prediction. The framework integrates vector-quantized representation learning, masked temporal modeling, and reinforcement learning-based trajectory optimization into a unified architecture.Specifically, a causal spatiotemporal Vector-Quantization Variational Autoencoder (VQ-VAE) is designed to learn structured emotional prototypes and construct a discrete-continuous emotional latent space from EEG signals. Based on the learned latent representations, a Transformer-based masked dynamic modeling strategy captures long-range emotional dependencies and temporal evolution patterns. Furthermore, continuous emotion prediction is formulated as a sequential decision-making problem, and a Soft Actor-Critic (SAC) framework is introduced to optimize emotional prediction trajectories at the sequence level instead of frame-wise local fitting.Extensive experiments on the SEED, SEED-IV, and Long-Term Naturalistic Emotion datasets demonstrate that EEGDancer consistently outperforms existing machine learning and deep learning methods. Ablation studies further verify the effectiveness of the proposed latent space and reinforcement learning-based trajectory optimization for modeling continuous EEG emotional dynamics.

2606.05852 2026-06-05 cs.SD cs.AI eess.AS 版本更新

UniVoice: A Unified Model for Speech and Singing Voice Generation

UniVoice: 一种用于语音和歌声生成的统一模型

Junjie Zheng, Huixin Xue, Shihong Ren, Chaofan Ding, Hao Liu, Zihao Chen

发表机构 * Giant Network(巨量网络) Shanghai Conservatory of Music(上海音乐学院)

AI总结 提出UniVoice,一种基于条件流匹配的统一语音和歌声生成框架,通过将条件分解为内容、旋律和音色,并引入空旋律标记,实现单一模型同时生成自然语音和可控歌声。

Comments 9 pages, 2 figures

详情
AI中文摘要

文本到语音(TTS)和歌声合成(SVS)都旨在从符号输入生成人类声音音频,但它们对生成过程提出了不同的要求。语音生成依赖于灵活的、语言驱动的韵律,而歌声生成则需要明确的旋律控制和准确的节奏对齐。这种不匹配使得训练一个既能生成自然语音又能生成可控歌声的单一模型具有挑战性,因为与旋律相关的条件应该强烈约束歌声,但不应限制语音韵律。我们提出了UniVoice,一种基于条件流匹配的统一语音和歌声生成框架。UniVoice没有使用单一的未分化条件表示,而是将条件分解为内容、旋律和音色,这些由适合模态的编码器编码,并由共享的扩散变换器(DiT)主干网络使用。对于歌声,旋律条件由MIDI音符序列表示;对于语音,它被替换为学习的空旋律标记,使模型能够从语言和声学上下文中推断韵律。这种设计保留了歌声的显式旋律控制,同时避免了对语音施加旋律约束的需要。我们进一步将空旋律标记分析为条件流中旋律边缘化的近似。在3万小时语音和3.5万小时歌声数据上训练,UniVoice在语音上实现了5.26%的音素错误率(PER),与专用TTS系统如F5-TTS(5.21%)和CosyVoice3(5.30%)相当。在歌声生成上,UniVoice实现了16.22%的PER,优于统一基线Vevo1.5(24.72%)。

英文摘要

Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and controllable singing, since melody-related conditions should strongly constrain singing but should not restrict speech prosody. We present UniVoice, a unified speech and singing voice generation framework based on conditional flow matching. Instead of using a single undifferentiated conditioning representation, UniVoice factorizes the condition into content, melody, and timbre, which are encoded by modality-appropriate encoders and consumed by a shared Diffusion Transformer (DiT) backbone. For singing, the melody condition is represented by MIDI note sequences; for speech, it is replaced with a learned null melody token, allowing the model to infer prosody from linguistic and acoustic context. This design preserves explicit melody control for singing while avoiding the need to impose melody constraints on speech. We further analyze the null melody token as an approximation to melody marginalization in the conditional flow. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26\%, comparable to dedicated TTS systems such as F5-TTS (5.21\%) and CosyVoice3 (5.30\%). On singing generation, UniVoice achieves a PER of 16.22\%, outperforming the unified baseline Vevo1.5 (24.72\%).

2606.05847 2026-06-05 cs.AI 版本更新

Agentic Molecular Recovery via Molecule-Aware Exploration

通过分子感知探索实现智能体分子恢复

Suwan Yoon, Changhee Lee

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系)

AI总结 针对文本引导分子生成中无效SMILES问题,提出AMREC方法,通过分子感知失配追踪、扩展候选探索和轨迹级选择,在恢复化学有效性的同时保留目标相关结构线索和分子身份。

Comments Preprint

详情
AI中文摘要

使用LLM进行文本引导的分子生成常常产生无效的SMILES。我们认为,无效草稿应通过从面向有效性的修复转向保持身份的分子恢复来解决:目标不仅是恢复化学有效性,还要保留目标相关的结构线索并恢复描述所暗示的分子身份。这一视角揭示了现有修正策略的局限性。事后修复可以在恢复有效性的同时扭曲关键结构,仅LLM修正可能引入意外的全局漂移,而即使配备了可执行的RDKit编辑工具,通用智能体修正仍受限于贪婪的单候选轨迹。为了解决这些局限性,我们提出了AMREC,它将分子感知失配追踪与扩展候选探索和轨迹级选择相结合。在来自三个骨干模型的无效ChEBI-20草稿上,AMREC在结构、精确匹配和字符串级指标上实现了最强的整体恢复性能。

英文摘要

Text-guided molecular generation with LLMs often yields invalid SMILES. We argue that invalid drafts should be addressed through a shift from validity-oriented repair to identity-preserving molecular recovery: the objective is not only to restore chemical validity, but also to preserve target-relevant structural cues and recover the molecular identity implied by the description. This perspective reveals the limitations of existing correction strategies. Post-hoc repair can recover validity while distorting key structures, LLM-only correction can introduce unintended global drift, and generic agentic correction remains constrained by greedy single-candidate trajectories even when equipped with executable RDKit edit tools. To address these limitations, we propose AMREC, which couples molecule-aware mismatch tracking with expanded candidate exploration and trajectory-level selection. On invalid ChEBI-20 drafts from three backbone models, AMREC achieves the strongest overall recovery profile across structural, exact-match, and string-level metrics.

2606.05844 2026-06-05 cs.CR cs.AI 版本更新

GenTI: Benchmarking LLMs for Autonomous IDPS Rule Generation for Unseen Attacks

GenTI: 针对未知攻击的自主IDPS规则生成的LLM基准测试

Hassan Jalil Hadi, Rehana Yasmin, Ali Shoker

发表机构 * Cyber Security and Resilience Technology (CyberSaR), King Abdullah University of Science and Technology (KAUST)(网络安全与韧性技术(CyberSaR),国王阿卜杜勒·阿齐兹大学科学与技术学院(KAUST))

AI总结 提出GenTI框架,通过构建包含15万条检测与防御规则的数据集GTI,并设计基于LLM的流水线(含结构化提示工程、思维链推理和验证循环),实现针对未知攻击的IDPS规则自动生成,将未知攻击检测率从45%提升至87.4%,误报率从8.5%降至2.3%。

详情
AI中文摘要

基于规则的入侵检测与防御系统(IDPS)能够提供精确的攻击检测和缓解,但其手动制作的、基于签名的规则限制了针对新兴和零日威胁的适应性。此外,现有的公共数据集(如CICIDS2017、UNSW-NB15)侧重于流量分类,提供的结构化信息很少,无法支持自动规则合成或防御逻辑。为填补这一空白,我们提出了生成式威胁情报(GenTI)——一个用于自动生成针对未知攻击的IDPS规则的LLM驱动基准。该数据集(GTI)汇集了来自Snort、Suricata、Emerging Threats的超过15万条检测和防御规则,以及5万条YARA规则,每条规则都标注了协议行为、负载签名、上下文关系、与网络威胁情报(CTI)的映射,以及可操作的响应类型(告警、丢弃、拒绝)。此外,在此语料库之上,我们设计了一个基于LLM的流水线,通过结构化提示工程、思维链(CoT)推理以及用于语法、语义和安全验证的验证链(CoVe)循环,将分析师提示和代表性负载转换为可部署的规则。生成的规则在(Snort/Suricata)上实时执行,并通过语法准确性、语义相似性、CTI覆盖率、安全有效性以及未知攻击检测进行评估。此外,我们的GenTI实例实现了89.4%的综合规则质量分数,CTI覆盖率达94.8%,将未知攻击检测率从45%提高到87.4%,并将误报率从8.5%降低到2.3%。总体而言,GenTI建立了第一个将规则级CTI与基于LLM的自动化紧密结合的大规模基准,实现了自适应、自演进的IDPS。

英文摘要

Rule-based Intrusion Detection and Prevention Systems (IDPS) offer precise attack detection as well as mitigation, however their manually crafted, signature-driven rules limit adaptability to emerging and zero-day threats. Additionally, existing public datasets (e.g., CICIDS2017, UNSW-NB15) focus on traffic classification and provide little structured information to support automatic rule synthesis or prevention logic. To address this gap, we propose Generative Thread Intelligence (GenTI) \footnote{GenTI refers to the proposed framework, and GTI refers to the dataset.} an LLM-driven benchmark for automatic generation of IDPS rules targeting unseen attacks. The dataset (GTI) aggregates over 150k detection and prevention rules from Snort, Suricata, Emerging Threats, as well as 50k YARA, each annotated with protocol behavior, payload signatures, contextual relationships, mappings to Cyber Threat Intelligence (CTI), along with actionable response types (alert, drop, reject). Moreover, on top of this corpus we design an LLM-based pipeline that transforms analyst prompts and representative payloads into deployable rules via structured prompt engineering, Chain-of-Thought (CoT) reasoning, as well as a Chain-of-Verification (CoVe) loop for syntactic, semantic, and security validation. The generated rules are executed in real time on (Snort/Suricata) and evaluated by syntax accuracy, semantic similarity, CTI coverage, security effectiveness as well as unseen attacks detection. Furthermore, our GenTI instantiation achieves a composite rule-quality score of 89.4\%, with 94.8\% CTI coverage, improving unseen attacks detection from 45\% to 87.4\% and reducing the false-positive rate from 8.5\% to 2.3\%. Overall, GenTI establishes the first large-scale benchmark that tightly couples rule-level CTI with LLM-based automation, enabling adaptive, self-evolving IDPS.

2606.05843 2026-06-05 cs.CL cs.AI 版本更新

Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

多模态大语言模型中通过CoRe头的功能稀疏性机制洞察

Ruoxi Sun, Quantong Qiu, Juntao Li, Zecheng Tang, Yihang Lou, Min Zhang

发表机构 * Soochow University(苏州大学) Peking University(北京大学)

AI总结 通过识别和分析CoRe头,揭示多模态大语言模型在跨模态检索中功能稀疏的结构特性,并验证其必要性及加速推理的潜力。

详情
AI中文摘要

虽然多模态大语言模型(MLLMs)在复杂的视觉-语言任务上表现出卓越的能力,但它们从复杂、嘈杂的上下文中提取与查询相关的视觉特征的机制仍然不透明。在本文中,我们进行了一项深入的可解释性研究,揭示了MLLMs中一个深刻的结构属性:跨模态检索中的功能稀疏性。利用一种称为检索注意力质量(RAM)的令牌级指标,我们识别并描述了一组高度专业化的注意力头,称为上下文感知检索(CoRe)头。在不同的视觉领域和模型规模中,我们观察到明确的功能划分:CoRe头充当专用的信息提取器,而大多数其他头则将注意力分布在更广泛的上下文区域。因果干预进一步证明了这些专业化头的必要性。仅消融前5%的CoRe头就会导致多模态推理性能显著下降,而消融排名较低的头则影响甚微。此外,加速实验验证了CoRe头的实用性,表明利用这种局部稀疏性可以显著加速推理,同时保持稳健的任务性能。我们的发现揭示了MLLMs中功能稀疏性的结构原理,完善了当前对机制可解释性的理解,并为未来的架构设计和模型优化奠定了理论基础。

英文摘要

While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In this paper, we present an in-depth interpretability study that uncovers a profound structural property within MLLMs: functional sparsity in cross-modal retrieval. Leveraging a token-level metric termed Retrieval Attention Mass (RAM), we identify and characterize a highly specialized subset of attention heads, referred to as Context-aware Retrieval (CoRe) heads. Across diverse visual domains and model scales, we observe a clear functional division: CoRe heads act as dedicated information extractors, while most other heads distribute attention over broader contextual regions. Causal interventions further demonstrate the necessity of these specialized heads. Ablating only the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal effect. Moreover, acceleration experiments validate the utility of CoRe heads, showing that leveraging this localized sparsity significantly accelerates inference while maintaining robust task performance. Our findings reveal a structural principle of functional sparsity within MLLMs, refining the current understanding of mechanistic interpretability and laying a theoretical foundation that can inspire future architecture design and model optimization.

2606.05833 2026-06-05 cs.CV cs.AI 版本更新

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 提出GeoVR框架,通过从2D视频序列中蒸馏3D几何知识(包括相机姿态、深度图、尺度因子和多尺度3D特征),重塑多模态大语言模型的内部表示以赋予其空间智能,在空间推理基准上达到最先进性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在2D语义理解方面表现出色,但缺乏内在的3D感知能力,导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性,我们提出了GeoVR,一种新颖的框架,仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间,以解锁空间智能。GeoVR并非采用浅层的特征混合,而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的,该策略由四个互补的几何目标驱动:(1)估计帧间相机姿态以嵌入变化的视角动态,(2)回归密集深度图以锚定物理距离,(3)预测度量尺度因子以进行真实世界校准,以及(4)蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下,模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明,GeoVR实现了最先进的性能,为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

2606.05828 2026-06-05 cs.AI cs.CL 版本更新

Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents

隐式偏好的统计先验:在个人代理中解耦技能选择作为局部调控机制

Zeyu Gan, Huayi Tang, Yong Liu

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院)

AI总结 针对本地部署的个人代理中隐式用户偏好学习问题,提出一种解耦统计偏好学习与语义意图解析的轻量级架构,通过局部统计结果影响远程LLM的选择决策,显著降低累积遗憾并提高测试准确率。

详情
AI中文摘要

随着大型语言模型(LLM)能力的提升,依赖基于API的远程模型和外部技能的本地部署个人代理成为一种新范式。随着可用技能的快速扩展,使个人代理能够学习并适应隐式用户偏好成为关键挑战。然而,本地部署的限制排除了复杂的集中式选择算法,迫切需要一种轻量级的局部偏好调控机制。本文通过一种严格解耦统计偏好学习与语义意图解析的新颖架构,探索了这种调控机制的实现。具体而言,我们利用局部统计结果来影响和调节远程LLM的选择决策。大量评估表明,我们的解耦方法实现了最低的累积遗憾和最高的测试准确率,显著优于传统的记忆增强型代理。

英文摘要

As Large Language Model (LLM) capabilities advance, locally deployed personal agents relying on API-based remote models and external skills have emerged as a novel paradigm. With the rapid expansion of available skills, enabling personal agents to learn and adapt to implicit user preferences becomes a critical challenge. However, local deployment constraints preclude complex centralized selection algorithms, creating an urgent need for a lightweight local preference harness. This paper explores the implementation of such a harness through a novel architecture that strictly decouples statistical preference learning from semantic intent parsing. Specifically, we leverage localized statistical results to influence and modulate the selection decisions of the remote LLM. Extensive evaluations demonstrate that our decoupled approach achieves the lowest cumulative regret and highest test accuracy, significantly outperforming traditional memory-augmented agents.

2606.05818 2026-06-05 math.HO cs.AI math.AG math.CO math.RT 版本更新

Benchmarks in Leipzig

莱比锡基准测试

Andrei Balakin, Miklós Bóna, Marie-Charlotte Brandenburg, Clara Briand, Veronica Calvo Cortes, Shelby Cox, Jesus A. De Loera, Danai Deligeorgaki, Hannah Friedman, Tim Gehrunger, Chiara Giardino, Stephen Griffeth, Baran Hashemi, Elena Hoster, Alexander Ivanov, Nupur Jain, Aryaman Jal, Leonie Kayser, Joris Koefler, Kevin Kühn, Mario Kummer, Felix Lotter, René Marczinzik, Victor S. Miller, Alejandro Morales, Greta Panova, Gianni Petrella, Nathan Pflueger, Lakshmi Ramesh, Nikolas Rieke, Carlos Rodriguez, Andrea Rosana, Flavio Salizzoni, Otto T. P. Schmidt, Sven Ulf Schmitz, Lina Maria Simbaqueba Marin, Luca Sodomaco, Christian Stump, Bernd Sturmfels, Alexander Taveira Blomenhofer, Simon Telen, Philipp Tuchel, Emil Verkama, Carl Felix Waller, Julian Weigert, Annette Werner, Nathan Williams, Claudius Zibrowius

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 49位数学家于2026年4月至5月编制了100个研究级数学问题数据集,通过多阶段评估大型语言模型的数学推理能力,最终仅剩2个问题未解决。

Comments 8 pages including 8 benchmark statistics tables + 20 pages appendix containing the 100 Leipzig Benchmark questions

详情
AI中文摘要

在2026年4月1日至5月15日期间,由49位数学家组成的小组编制了一个包含已知答案的研究级数学问题数据集。大部分工作是在德国莱比锡马克斯·普朗克数学科学研究所举办的为期3天的研讨会*Benchmarks in Leipzig*上完成的,共有35名参与者。我们展示了由此产生的100个问题集合。我们分三个阶段评估了这些问题:首先由五个最先进的大型语言模型各尝试一次,随后对其中三个模型进行每个模型20次运行的评估,最后用两个深度思考模型进行3次尝试。第一阶段后,41个问题完全未解决;第二阶段后,这一数字降至16个;第三阶段结束时,仅剩2个问题未解决。这表明大型语言模型的数学推理能力正变得令人印象深刻。

英文摘要

Between April 1 and May 15, 2026, a group of 49 mathematicians compiled a dataset of research-level mathematics questions with known answers. Most of the work was done during the 3-day workshop *Benchmarks in Leipzig* with 35 participants at the Max Planck Institute for Mathematics in the Sciences in Leipzig, Germany. We present the resulting collection of 100 questions. We evaluated these questions in three stages: a single attempt by five state-of-the-art LLMs, followed by a 20-runs-per-model evaluation with three of these models, and finally a 3-run attempt with two heavy-thinking models. After Stage 1, 41 questions remained completely unsolved; after Stage 2, this count dropped to 16; and we concluded Stage 3 with only 2 unsolved questions. This demonstrates that the mathematical reasoning capabilities of LLMs are becoming impressive.

2606.05817 2026-06-05 cs.LG cs.AI 版本更新

Consistency Training Along the Transformer Stack

沿Transformer堆栈的一致性训练

Sukrati Gautam, Neil Shah, Arav Dhoot, Bryan Maruyama, Caroline Wei, Rohan Kapoor, Robert Sidey, Prakhar Gupta, Zi Cheng Huang, David Demitri Africa

发表机构 * Purdue University(普渡大学) Independent(独立) Columbia University(哥伦比亚大学) University of California, San Diego(加州大学圣地亚哥分校) University of California, Los Angeles(加州大学洛杉矶分校) Dartmouth College(达特茅斯学院) University of Michigan, Ann Arbor(密歇根大学安娜堡分校)

AI总结 本文通过引入MLP状态和注意力分布的一致性目标,将一致性训练扩展到多种安全威胁,并发现跨威胁泛化及共享机制,证明其作为灵活对齐框架的有效性。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

一致性训练鼓励模型在不同上下文中表现相似,并已显示出减少对齐问题的潜力。我们以两种方式扩展一致性训练的范围。首先,我们引入两个新的内部一致性目标:MLP一致性训练(MLPCT),匹配激活后的MLP状态;以及注意力一致性训练(AttCT),匹配每个头的注意力分布。其次,我们将一致性训练应用于四种额外的安全威胁:角色上下文学习攻击、对抗性挫败、预填充攻击和条件性对齐错误。在多个模型和威胁设置中,我们发现一致性训练在减少对齐问题方面远优于先前工作中研究的谄媚和越狱设置。我们还发现了跨威胁泛化的案例,即针对一种失败模式的训练提高了对另一种模式的鲁棒性,并识别了ACT、MLPCT和AttCT共享的残差流机制,同时将BCT区分为机制上不同的方法。我们的结果表明,一致性训练是一个灵活且可扩展的对齐框架,能够统一防御更广泛的模型病理类别。

英文摘要

Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal consistency targets: MLP Consistency Training (MLPCT), which matches post-activation MLP states, and Attention Consistency Training (AttCT), which matches per-head attention distributions. Second, we apply consistency training to four additional safety threats: persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment. Across several models and threat settings, we find that consistency training reduces misalignment well beyond the sycophancy and jailbreak settings studied in prior work. We also find cases of cross-threat generalization, where training against one failure mode improves robustness to another, and identify a shared residual-stream mechanism underlying ACT, MLPCT, and AttCT, while distinguishing BCT as mechanistically distinct. Our results suggest that consistency training is a flexible and extensible framework for alignment, capable of unifying defenses against a broader class of model pathologies.

2606.05806 2026-06-05 cs.AI 版本更新

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

当工具失效时:LLM智能体动态重规划与异常恢复的基准测试

Dongsheng Zhu, Xuchen Ma, Yucheng Shen, Xiang Li, Yukun Zhao, Shuaiqiang Wang, Lingyong Yan, Dawei Yin

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) East China Normal University(华东师范大学) Sochow University(苏州大学) Shandong University(山东大学) Baidu Inc.(百度公司)

AI总结 本文提出ToolMaze基准,通过有向无环图拓扑复杂度和工具扰动分类法,评估LLM智能体在工具失效时的动态重规划与错误恢复能力,发现模型对隐式语义故障的恢复率下降约37%,且智能体容错性随模型规模增长的速度远慢于基本任务执行。

详情
AI中文摘要

现有基准在理想化的“快乐路径”上评估LLM中的工具集成推理(TIR),很大程度上忽视了现实中的工具故障。我们引入ToolMaze,一个用于TIR智能体动态路径发现和错误恢复的基准。为了将系统性重规划与盲目试错区分开来,ToolMaze采用二维设计:基于DAG的拓扑复杂度和一个$2 \times 2$的工具扰动分类法(显式/隐式,瞬态/永久)。评估表明,扰动几乎在所有模型上降低了性能,在隐式语义故障下下降最为剧烈。由于对受损输出的系统性过度信任,这些场景中的扰动恢复率(PRR)骤降约37%,而复杂拓扑将智能体困在徒劳的试错循环中。关键的是,智能体容错性随模型规模增长的速度比基本任务执行慢$3.66\times$,凸显了动态重规划作为一个独立瓶颈,无法通过模型缩放或提示工程解决。数据和代码见https://github.com/Zhudongsheng75/ToolMaze。

英文摘要

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale $3.66\times$ slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

2606.05805 2026-06-05 cs.AI 版本更新

From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

从风险分类到行动方案修复:一种基于护栏反馈驱动的LLM代理框架

Yuhao Sun, Jiacheng Zhang, Shaanan Cohney, Zhexin Zhang, Feng Liu, Xingliang Yuan

发表机构 * The University of Melbourne(墨尔本大学) Tsinghua University(清华大学)

AI总结 提出TRIAD框架,通过护栏生成的言语反馈引导代理在规划步骤中保持良性目标,实现安全与效用的最佳平衡。

Comments 32 pages

详情
AI中文摘要

基于LLM的护栏通常通过在执行前评估提议的行动或输入来保护代理,产生安全信号,如二元允许/拒绝决策、风险类别和/或关于潜在政策违规的解释性理由。然而,当原本良性的任务被不可信的外部内容、不安全的指令或风险工具使用污染时,代理风险常常出现。现有护栏通常将整个任务统一标记为不安全,从而阻止威胁但牺牲了良性部分。此外,现有工作大多孤立地评估护栏,不清楚其干预是否导致更安全的下游代理行为。为解决此问题,我们引入TRIAD(三方响应用于迭代代理护栏),一个护栏集成代理框架,利用护栏生成的言语反馈作为引导信号,使代理在每个规划步骤中保持与良性目标一致。我们在自策训练数据集上微调语言模型,输出三种决策之一:继续、拒绝或更新,并附带结构化的自然语言反馈。更新不仅允许或阻止执行,还指导代理修改其计划,避免有害组件,并尽可能保留良性任务。TRIAD将此反馈注入代理的上下文,实现后续计划修订,并在护栏反馈与代理规划之间形成闭环。在ASB和AgentHarm上的大量实验表明,TRIAD将平均攻击成功率降低至10.42%,同时在护栏集成基线中实现了最佳的安全-效用权衡。我们的代码可在https://github.com/YUHAOSUNABC/TRIAD获取。

英文摘要

LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail-integrated agent framework that leverages guardrail-generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self-curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural-language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent's context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety-utility trade-off among guardrail-integrated baselines. Our code is available at: https://github.com/YUHAOSUNABC/TRIAD.

2606.05793 2026-06-05 cs.CL cs.AI cs.CY cs.LG 版本更新

CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement

CollabBench: 通过主动参与与多样化玩家基准测试和释放LLMs的协作能力

Hong Qian, Yuanhao Liu, Zihan Zhou, Zongbao Zhang, Hanjie Ge, Haotian Shi, Liang Dou, Xiangfeng Wang, Jingwen Yang, Aimin Zhou

发表机构 * Shanghai Institute of AI for Education(上海人工智能教育研究院) School of Computer Science(计算机科学学院) East China Normal University(东华大学) Tencent Inc.(腾讯公司) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出CollabBench基准,通过多样化玩家模拟和协作智能体训练范式,提升LLM在合作游戏中的任务效率和情感适应能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管基于LLM的智能体在个体任务上表现出色,但与真实人类伙伴的有效协作仍然具有挑战性。现有的对话级协作研究大多缺乏基于交互和行为执行,这促使需要能够实现情境化和沉浸式协作的合作游戏环境。为此,本文提出了CollabBench,一个用于评估和训练合作游戏中协作智能体的基准。CollabBench具有多样化玩家档案模拟管道,用于建模不同的玩家行为,以及一种协作智能体训练范式,通过智能体展开统一推理、沟通和行动,并使用混合奖励优化任务效率和情感适应。我们进一步将经典环境扩展到CWAH-MultiPlayer和Cook-MultiPlayer,以在多样化个性下进行系统评估。使用效率和情感指标的实验表明,我们训练的模型优于基础模型,效率提高了19.5%,情感表现提高了24.4%。进一步分析揭示了现有模型的关键协作局限性,并为未来的协作训练提供了见解。

英文摘要

While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral execution, motivating the need for cooperative game environments that enable contextualized and immersive collaboration. To this end, this paper proposes CollabBench, a benchmark for evaluating and training collaborative agents in cooperative games. CollabBench features a Diverse Player Profile Simulation pipeline to model varied players behaviors, and a Collaborative Agentic Training paradigm that unifies reasoning, communication, and action via agentic rollouts, optimized with a hybrid reward balancing task efficiency and affective adaptation. We further extend classic environments to CWAH-MultiPlayer and Cook-MultiPlayer for systematic evaluation under diverse personalities. Experiments with efficiency and affective metrics show that our trained models outperform base models, achieving 19.5% higher efficiency and 24.4% improved affective performance. Further analysis reveals key collaborative limitations of existing models and offers insights for future collaborative training.

2606.05792 2026-06-05 cs.AI cs.LG cs.LO cs.SE 版本更新

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

LLM 能写出正确的 TLA+ 规范吗?自然语言到 TLA+ 生成的评估

Arslan Bisharat, Brian Ortiz, Eric Spencer, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad

发表机构 * Department of Computer Science, Loyola University Chicago(洛约奈大学芝加哥分校计算机科学系)

AI总结 本文首次系统评估基于 LLM 从自然语言合成 TLA+ 规范的能力,发现模型在语义正确性上仅达 8.6%,且成功依赖于渐进式提示,揭示了模型大小与质量无关、代码专用模型表现不佳等关键发现。

Comments 12 pages, 11 tables. Accepted at the 21st International Conference on Software Technologies (ICSOFT 2026); Recommended as Best Paper Award Candidate

详情
AI中文摘要

TLA+ 已支持亚马逊和微软等公司的工业验证,但从自然语言编写正确的 TLA+ 规范仍需时间和专业知识,这限制了其采用。LLM 显示出潜力,但尚无先前研究衡量它们是否能从自然语言生成语义正确的 TLA+ 规范。本文首次系统评估基于 LLM 的 TLA+ 规范合成。我们的研究在精心策划的 205 个 TLA+ 规范数据集上评估了来自八个系列的 30 个 LLM:四种提示策略下的 25 个开放权重模型(2600 次运行)和少样本提示下的 5 个专有模型(130 次运行),所有结果均由 SANY 解析器和 TLC 模型检查器验证。LLM 达到高达 26.6% 的语法正确性,但仅 8.6% 的语义正确性,成功仅出现在渐进式提示中。结果表明模型大小不能预测质量,例如 DeepSeek r1:8b 在所有策略上优于其 70B 变体,这表明推理对齐对形式语言的重要性。由于主流语言训练的负迁移,代码专用模型始终表现不佳。我们识别出五类重复出现的幻觉,所有幻觉均可追溯到特定的训练数据偏差。这些结果表明,当前 LLM 在没有专家监督的情况下无法生成可靠的 TLA+ 规范。我们发布了评估框架、代码和数据集,以支持可重复性和未来研究。

英文摘要

TLA+ has supported industrial verification at companies such as Amazon and Microsoft, yet writing correct TLA+ specifications from natural language still requires time and expertise, which limits adoption. LLMs show promise, but no prior study measures whether they produce semantically correct TLA+ specifications from natural language. This paper presents the first systematic evaluation of LLM-based TLA+ specification synthesis from natural language. Our study evaluates 30 LLMs across eight families on a curated dataset of 205 TLA+ specifications: 25 open-weight models across four prompting strategies (2,600 runs) and 5 proprietary models under few-shot prompting (130 runs), all validated by the SANY parser and TLC model checker. LLMs achieve up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successes exclusive to progressive prompting. Results show that model size does not predict quality, e.g., DeepSeek r1:8b outperforms its 70B variant across all strategies, which suggests the importance of reasoning alignment for formal languages. Code-specialized models consistently underperform due to negative transfer from mainstream language training. We identify five recurring hallucination categories, all traceable to specific training data biases. These results suggest that current LLMs do not generate reliable TLA+ specifications without expert oversight. We release the evaluation framework, code, and dataset to support reproducibility and future research.

2606.05785 2026-06-05 cs.CV cs.AI cs.LG 版本更新

Next-Generation Parallel Decoder for LPDR: Architectural Optimization and Class-Balanced GAN-Augmentation

下一代LPDR并行解码器:架构优化与类别平衡的GAN增强

Shawaiz Obaid, Nida Chandio, Neha Jamil, Muhammad Khuram Shahzad

发表机构 * School of Electrical Engineering Computer Science National University of Sciences \& Technology Islamabad, Pakistan sobaid.mscs25seecs.edu.pk Computer Science National University of Sciences \& Technology Islamabad, Pakistan nchandio.mscs25seecs.edu.pk Computer Science National University of Sciences \& Technology Islamabad, Pakistan njamil.mscs25seecs.edu.pk Computer Science National University of Sciences \& Technology Islamabad, Pakistan

AI总结 针对车牌检测与识别中的空间字符不匹配和数据不平衡问题,提出交叉空间混合注意力和类别平衡合成增强方法,将少数省份车牌识别率从78.2%提升至91.5%,同时保持152 FPS的实时处理性能。

Comments 8 pages, 7 figures

详情
AI中文摘要

实时车牌检测与识别(LPDR)是现代智慧城市的基石。尽管YOLOV5-PDLPR模型通过并行解码器方法显著提高了系统效率,但其性能仍受训练集中空间字符不匹配和数据不平衡的影响。本文通过引入交叉空间混合注意力(CSHA)和类别平衡合成增强(CBSA)来解决这些局限性。进行了涉及75,000个合成样本的广泛研究,并在四个基准数据集(CCPD、CLPD、PKU和一个应用特定数据集)上进行了评估。实验结果表明,少数省份车牌识别率从78.2%大幅提升至91.5%,同时保持152 FPS的实时处理性能。结果表明,结合空间感知并行解码与类别平衡增强为高速车牌识别系统提供了有效解决方案。

英文摘要

Real-Time License Plate Detection and Recognition (LPDR) forms the backbone of modern smart cities. Although the YOLOV5-PDLPR model substantially improved system efficiency through a parallel decoder approach, its performance is still affected by spatial character mismatches and data imbalance within the training set. This paper addresses these limitations by introducing Cross-Spatial Hybrid Attention (CSHA) and Class-Balanced Synthetic Augmentation (CBSA). An extensive study involving 75,000 synthetic samples is conducted and evaluated on four benchmarks: CCPD, CLPD, PKU, and an application-specific dataset. Experimental results demonstrate a substantial improvement in the recognition rate of minority provincial license plates from 78.2% to 91.5% while maintaining real-time processing performance of 152 FPS. The results indicate that spatially-aware parallel decoding combined with class-balanced augmentation provides an effective solution for high-speed license plate recognition systems.

2606.05784 2026-06-05 cs.AI 版本更新

TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents

TAPO: 通过信用转移实现工具感知策略优化用于多模态搜索代理

Chengqi Dong, Chuhuai Yue, Hang He, yandong liu, Fenghe Tang, S Kevin Zhou, Xiaohan Wang, Jiajun Chai, Guojun Yin

发表机构 * University of Science and Technology of China(中国科学技术大学) Meituan(美团)

AI总结 针对GRPO在多模态搜索代理中信用误分配问题,提出TAPO方法,利用工具参数确定性构建反事实证人进行保守优势校正,无需额外标注或采样,在多个基准上持续提升性能。

详情
AI中文摘要

我们识别并正式刻画了信用误分配作为GRPO在工具增强多模态搜索代理中的系统性失效模式:其对轨迹级优势的统一广播导致失败轨迹中有价值的工具使用步骤与无价值的步骤受到相同的惩罚。我们进一步通过实验量化了该现象的规模。超过一半的失败轨迹和失败的工具使用动作表现出可纠正的信用误分配,表明浪费的训练信号既显著又在结构上可被利用。基于这一见解,我们提出了工具感知策略优化(TAPO),它利用了信息获取工具的参数确定性特性:相似的调用参数定义等价的信息获取动作,因此应共享可比较的动作信用。TAPO在当前训练批次内构建反事实证人,并通过置信门控保守优势校正补偿误分配的负信用。它不需要额外的标注、模型或采样,并且引入可忽略的计算开销。在多个多模态搜索基准上,TAPO在三种主流RL算法(GRPO、GSPO和SAPO)上相对于强基线提供了一致的、即插即用的改进。我们的代码和模型将在接收后公开发布。

英文摘要

We identify and formally characterize credit misassignment as a systematic failure mode of GRPO in tool-augmented multimodal search agents: its uniform broadcast of trajectory-level advantages to all tokens causes valuable tool-use steps in failing trajectories to be penalized no differently from valueless ones. We further empirically quantify the scale of this phenomenon. Over half of failing trajectories and failing tool-use actions exhibit correctable credit misassignment, demonstrating that the wasted training signal is both substantial and structurally exploitable. Building on this insight, we propose Tool-Aware Policy Optimization (TAPO), which exploits the parameter-determinism property of information-acquisition tools: similar call parameters define equivalent information-acquisition actions and should therefore share comparable action credit. TAPO constructs counterfactual witnesses within the current training batch and compensates misassigned negative credit via confidence-gated conservative advantage correction. It requires no additional annotation, models, or sampling, and introduces negligible computational overhead. Across multiple multimodal search benchmarks, TAPO delivers consistent, plug-and-play improvements over strong baselines for three mainstream RL algorithms (GRPO, GSPO, and SAPO). Our code and models will be publicly released upon acceptance.

2606.05779 2026-06-05 cs.CR cs.AI stat.ML 版本更新

TinyML-Driven Cybersecurity for Autonomous Spacecraft: Latency-Accuracy Analysis for SPARTA RF and Cyber Threat Detection

TinyML驱动的自主航天器网络安全:SPARTA射频与网络威胁检测的延迟-精度分析

Van Le, Trevor Tran, Tan Le

发表机构 * Virginia Tech(弗吉尼亚理工学院) Hampton University(哈姆普顿大学)

AI总结 针对自主航天器,基于SPARTA攻击模型分析TinyML兼容经典模型(随机森林、逻辑回归、SVM、MLP)在检测多种网络射频威胁时的延迟-精度权衡,发现逻辑回归在微秒级推理下仅比随机森林精度低1%,适合作为机载自主基线。

Comments Twenty Fifth International Conference on Security & Management (SAM'26)

详情
AI中文摘要

自主航天器需要快速、轻量且可靠的在轨检测网络射频威胁。利用SPARTA攻击模型,我们分析了TinyML兼容的经典模型——随机森林、逻辑回归、支持向量机和多层感知机——在检测上行链路干扰、Fake-NR欺骗、有效载荷操纵、地面段妥协和未授权命令注入时的延迟-精度权衡。我们对每个模型的计算复杂度、VC维、Lipschitz连续性和延迟缩放进行了基于物理的理论分析,并通过在通过BandErasure、FakeNR和NoiseBurst损坏模式生成的对抗性射频频谱图上的经验测量加以支持。结果表明,逻辑回归实现了微秒级推理,且相对于随机森林仅下降1%的精度,使其成为机载自主的有效TinyML基线。该研究还指出了通过更丰富的特征编码器和多时间尺度学习架构来推进航天器网络安全的机会,这建立在边缘智能和可信AI的最新进展之上。

英文摘要

Autonomous spacecraft require rapid, lightweight, and reliable onboard detection of cyber-RF threats. Using the SPARTA attack model, we analyze the latency-accuracy trade-offs of TinyML-compatible classical models -- Random Forest, Logistic Regression, SVM, and MLP -- for detecting uplink jamming, Fake-NR spoofing, payload manipulation, ground-segment compromise, and unauthorized command injection. We present a physics-informed theoretical analysis of each model's computational complexity, VC dimension, Lipschitz continuity, and latency scaling, supported by empirical measurements on adversarial RF spectrograms generated via BandErasure, FakeNR, and NoiseBurst corruption modes. Results show that Logistic Regression achieves microsecond-level inference with only a 1\% accuracy drop relative to Random Forest, making it an effective TinyML baseline for onboard autonomy. The study also identifies opportunities for advancing spacecraft cybersecurity through richer feature encoders and multi-timescale learning architectures, building on recent progress in edge intelligence and trustworthy AI.

2606.05776 2026-06-05 cs.CR cs.AI cs.LG 版本更新

An Improved CNN-LSTM Based Intrusion Detection System for IoT Networks

基于改进的CNN-LSTM的物联网网络入侵检测系统

Mohammad Tariq Ikhlas, Pohanyar Khowaja Khil, Malik Muhammad Mueed Aslam, Muhammad Khuram Shahzad

发表机构 * University of Engineering and Technology, Lahore(拉合尔工程与技术大学)

AI总结 提出一种结合多类分类、数据集集成和时间特征学习的改进CNN-LSTM入侵检测模型,在物联网网络上达到约97%的准确率。

Comments 8 pages, 8 figures

详情
AI中文摘要

随着物联网设备的快速普及,安全问题急剧增加,入侵检测系统对于保护网络环境变得至关重要。本文提出了一种改进的基于CNN-LSTM的入侵检测模型,该模型结合了多类分类、数据集集成和时间特征学习,以增强物联网网络中的检测性能。使用网络流量数据,所提出的方法在入侵检测任务上进行了评估,达到了约97%的准确率。实验结果表明,该模型能有效检测多种攻击类别,同时保持稳定的训练和验证性能。卷积和循环神经网络组件的集成使框架能够捕获网络流量的空间和时间特征,提高了物联网环境中的整体入侵检测能力。

英文摘要

With the rapid proliferation of IoT devices, security concerns have dramatically escalated and intrusion detection systems have become critical for protecting networked environments. This paper presents an improved CNN-LSTM based intrusion detection model that combines multi-class classification, dataset integration, and temporal feature learning to enhance detection performance in IoT networks. Using network traffic data, the proposed approach is evaluated on intrusion detection tasks and achieves an accuracy of approximately 97%. Experimental results demonstrate that the model effectively detects multiple attack categories while maintaining stable training and validation performance. The integration of convolutional and recurrent neural network components enables the framework to capture both spatial and temporal characteristics of network traffic, improving overall intrusion detection capability in IoT environments.

2606.05770 2026-06-05 cs.SE cs.AI 版本更新

Human Oversight and Overload: Two Hidden and Costly Burdens of AI-Assisted Software Engineering

人类监督与过载:AI辅助软件工程中两种隐藏且昂贵的负担

Vahid Garousi

发表机构 * Queen’s University Belfast(女王大学贝尔法斯特) Azerbaijan Technical University(阿塞拜疆技术大学)

AI总结 本文通过分析从业者观点,揭示了AI辅助软件工程中人类持续监督AI生成产物和认知过载两种隐藏负担,并探讨了团队应对策略。

详情
AI中文摘要

AI正在改变软件工程师的工作方式,但常常伴随着隐藏的负担和成本。在本文中,我们描述了两种常被忽视的负担:(1)对人类持续监督和检查AI生成产物的需求;(2)软件工程师因接收大量AI工具建议而日益增长的认知过载。人类监督的需求并非可选——工程师必须审查、验证,有时甚至重做AI产生的内容。同时,大量AI建议、提示和可能的解决方案会使开发者精神紧张。通过融合近期从业者观点的证据,我们强调了这些常被忽视的挑战,并开启了关于团队如何在日常AI辅助软件工程中应对这些挑战的对话。

英文摘要

AI is changing how software engineers work, but it often comes with hidden burdens and costs. In this paper, we characterize two such often-overlooked burdens: (1) the constant need for human oversight and inspection of AI-generated artifacts; and (2) the growing cognitive overload on software engineers from receiving large amounts of suggestions from AI tools. The need for human oversight is not optional-engineers must review, validate, and sometimes rework what AI produces. At the same time, the flood of AI suggestions, prompts, and possible solutions can leave developers mentally stretched. By blending evidence from recent opinions from practitioners, we highlight these often-overlooked challenges and open a conversation about how teams can handle them in day-to-day AI-assisted software engineering.

2606.05758 2026-06-05 cs.CV cs.AI cs.LG 版本更新

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

DRIFT:一种用于视觉-语言模型中连续输出解码的残差流适配器

Zhuoming Liu, Jinhong Lin, Kwan Man Cheng, Lin Zhang, Shayok Bagchi, Yin Li

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) West Lafayette Jr./Sr. High School(韦斯特拉法叶高中)

AI总结 提出DRIFT框架,通过结合基础预测器和基于流匹配的生成式精化模块,将预训练视觉-语言模型适配到连续解码任务,在视觉定位和机器人控制等任务上优于回归和生成方法。

详情
AI中文摘要

许多现代视觉-语言模型(VLM)基于离散标记的自回归解码。虽然基于文本的输出接口支持可扩展的预训练和跨多种任务的强零样本泛化,但它们不适用于需要精确连续输出的问题,例如定位事件的时间边界或生成机器人控制动作。为了解决这一挑战,我们提出了DRIFT,一个用于将预训练VLM适配到连续解码任务的通用框架。DRIFT结合了一个基础预测器(提供目标输出的粗略估计)和一个基于流匹配的生成式精化模块(迭代改进预测)。这种残差公式将生成建模问题从学习全局输出分布转变为在强先验周围建模局部残差分布,大大简化了优化。我们在感知和规划任务上评估了DRIFT,包括视觉定位和机器人控制。在跨越MLLM、VLA和WAM的多个任务和架构中,DRIFT consistently优于一组强大的基于回归和生成的方法。

英文摘要

Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.

2606.05756 2026-06-05 cs.LG cs.AI cs.IT math.IT 版本更新

Beyond Soft Masks: Hard-Perturbation Mixup Explainer for Robust GNN Explainability

超越软掩码:用于鲁棒GNN可解释性的硬扰动混合解释器

Jialiang Yin, Zheng Zhao, Linsey Pang, Bo Dong, Bin Shi, Jiaxing Zhang

发表机构 * Xi’an Jiaotong University(西安交通大学) PayPal bellevue USA(贝尔维尤美国)

AI总结 提出基于广义图信息瓶颈的硬扰动混合解释框架HPME,通过图池化提取离散解释子图并采用结构级替换的混合策略,解决软掩码方法中标签无关信息泄漏和分布偏移问题,提升解释保真度。

详情
AI中文摘要

图神经网络(GNN)在涉及图结构数据的各种应用中表现出卓越性能,尤其是在高风险领域。然而,其决策过程的不透明性限制了可信度和更广泛的采用。现有的事后解释方法通过识别影响GNN预测的子图来提高可解释性,并采用混合策略来缓解使用子图进行预测时引起的分布外(OOD)问题。然而,这些方法通常依赖软掩码,其本质上无法完全消除标签无关信息,允许冗余结构泄漏到混合过程中,阻碍OOD问题的解决,从而降低解释保真度。在本文中,我们提出HPME,一个基于广义图信息瓶颈的硬扰动混合解释框架,利用图池化提取离散解释子图,并产生信息容量界限以彻底压缩标签无关组件。此外,我们引入了一种基于结构级替换的新型混合策略,生成分布内解释以有效缓解分布偏移。在多种任务上的大量实验表明,HPME在合成和真实数据集上生成鲁棒且可解释的解释方面达到了最先进的性能。

英文摘要

Graph Neural Networks (GNNs) have demonstrated remarkable performance across a range of applications involving graph-structured data, particularly in high-stakes domains. However, the opaque nature of their decision-making processes limits their trustworthiness and broader adoption. Existing post-hoc explanation methods aim to improve explainability by identifying subgraphs that influence GNN predictions and adopt mixup strategies to alleviate the out-of-distribution (OOD) issue caused by using subgraphs for prediction. Yet, these approaches typically rely on soft masks, which are inherently unable to fully eliminate label-irrelevant information, allowing redundant structures to leak into the mixup process and hindering the resolution of the OOD problem, thereby degrading explanation fidelity. In this work, we propose HPME, a Hard-Perturbation Mixup Explanation framework grounded in a generalized Graph Information Bottleneck, which leverages graph pooling to extract discrete explanatory subgraphs and to yield an information-capacity bound to thoroughly compress label-irrelevant components. Furthermore, we introduce a novel mixup strategy built upon structure-level replacement, generating in-distribution explanations to effectively mitigate the distribution shift. Extensive experiments on diverse tasks demonstrate that HPME achieves state-of-the-art performance in generating robust and interpretable explanations across both synthetic and real-world datasets.

2606.05754 2026-06-05 cs.SD cs.AI eess.AS 版本更新

SagnacAssisted Enhanced OTDR for Distributed Acoustic Sensing: A Standardized Benchmark and Engineering Evaluation Framework

Sagnac辅助增强型OTDR分布式声学传感:标准化基准与工程评估框架

Weiguang Wang, Fugen Wu, Hailing Wang, Xuechen Liang, Xiaobin Li, Ru Han, Tianchang Xie

发表机构 * East China Jiaotong University(东华交通大学) School of Materials and Energy, Guangdong University of Technology(广东工业大学材料与能源学院) Jiangxi Tonghui Technology Group Co., Ltd.(江西 Tonghui 技术集团有限公司) School of Artificial Intelligence and Big Data, Guangzhou Vocational University of Science and Technology(广州科学技术职业大学人工智能与大数据学院)

AI总结 提出一种Sagnac辅助增强型ϕ-OTDR传感架构和标准化基准框架,通过双分支融合模型在10公里光纤上实现89.79%准确率和5.00%虚警率,解决了偏振衰落和干扰问题。

详情
AI中文摘要

相位敏感光时域反射计(ϕ-OTDR)因其在大距离上提供分布式时空监测能力,被广泛应用于大规模分布式声学传感(DAS)。然而,其现场性能仍可能因偏振诱导衰落(PIF)、局部信号退化和强环境干扰而恶化。本研究开发了一种Sagnac辅助增强型ϕ-OTDR传感架构和面向工程的DAS事件识别标准化基准框架。Sagnac干涉仪提供连续相位响应,补充了ϕ-OTDR通道中易衰落的观测值,并通过在FPGA平台上实现的互相关过程实现异构信号对齐。该基准协议在一致的数据划分、预处理和度量定义下,比较了传统特征工程方法、概率浅层分类器、单分支深度模型和双分支融合模型。在10公里传感光纤上进行的六类代表性声学事件实验表明,双分支融合模型在评估方法中提供了最有利的权衡,在平衡测试集上达到89.79%的准确率、89.83%的宏F1值和5.00%的虚警率。结果还表明,通道分组对双分支评估影响显著,表明面向部署的结论应基于准确率、宏F1、虚警率、漏报率和延迟,而非仅凭准确率。这项工作为基于ϕ-OTDR的DAS提供了一种物理驱动的增强策略,并为未来面向融合的传感研究提供了可复现的基准协议。用于复现DAS事件识别实验的实现和脚本可在https://github.com/wawa-abc/das公开获取。

英文摘要

Phase-sensitive optical time-domain reflectometry ($ϕ$-OTDR) is widely used in large-scale distributed acoustic sensing (DAS) because it provides distributed spatiotemporal monitoring over long sensing distances. Its field performance can still deteriorate because of polarization-induced fading (PIF), local signal degradation, and strong environmental interference. This study develops a Sagnac-assisted enhanced $ϕ$-OTDR sensing architecture and a standardized benchmark framework for engineering-oriented DAS event recognition. The Sagnac interferometer provides a continuous phase response that supplements fading-prone observations in the $ϕ$-OTDR channel, and heterogeneous signal alignment is achieved using a cross-correlation procedure implemented on an FPGA platform. The benchmark protocol compares conventional feature-engineering methods, probabilistic shallow classifiers, single-branch deep models, and dual-branch fusion models under consistent data partitioning, preprocessing, and metric definitions. Experiments on a 10-km sensing fiber with six representative acoustic event classes show that the dual-branch fusion model provides the most favorable trade-off among the evaluated methods, reaching 89.79\% accuracy, 89.83\% macro-F1, and a nuisance alarm rate of 5.00\% on the balanced test set. The results also show that channel grouping strongly affects dual-branch evaluation, indicating that deployment-oriented conclusions should be based on accuracy, macro-F1, nuisance alarm rate, false negative rate, and latency rather than accuracy alone. This work provides a physically motivated enhancement strategy for $ϕ$-OTDR-based DAS and a reproducible benchmark protocol for future fusion-oriented sensing research. The implementation and scripts for reproducing the DAS event-recognition experiments are publicly available at https://github.com/wawa-abc/das.

2606.05749 2026-06-05 cs.CL cs.AI 版本更新

MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

MARDoc:面向多模态长文档问答的记忆感知精炼智能体框架

Kaifeng Chen, Hongtao Liu, Qiyao Peng, Jian Yang, Yongqiang Liu, Xiaochen Zhang, Qing Yang

发表机构 * Tianjin University(天津大学) Qifu Technology(启福科技) Beihang University(北航) Jiangnan University(江南大学)

AI总结 提出MARDoc框架,通过解耦为探索、精炼和反思三个智能体,并利用结构化记忆替代完整交互历史,减少上下文噪声,提升多模态长文档问答性能。

详情
AI中文摘要

迭代检索-推理智能体近期在多模态长文档问答中展现出潜力。然而,现有系统大多维护一个不断增长的单一上下文,混合了检索轨迹、观察和中间推理。随着交互积累,关键证据变得分散和稀释,使多跳推理变得嘈杂。我们提出MARDoc,一个记忆感知精炼智能体框架,将长文档问答解耦为三个专门智能体:探索者负责多粒度多模态检索,精炼者负责将交互轨迹蒸馏为结构化证据和推理记忆,反思者负责检查证据充分性并提供针对性反馈。在迭代过程中,智能体依赖动态更新的结构化记忆,而非完整的累积交互历史。这种设计减少了上下文噪声,同时保留了答案关键事实及其逻辑依赖。在MMLongBench-Doc和DocBench上的实验表明,MARDoc取得了强劲结果,优于同骨干基线,并证明了结构化记忆在智能体文档问答中的有效性。

英文摘要

Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning. As interactions accumulate, key evidence becomes scattered and diluted, making multi-hop reasoning noisy. We propose MARDoc, a Memory-Aware Refinement Agent framework that decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. Across iterations, the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history. This design reduces context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench show that MARDoc achieves strong results, outperforming same-backbone baselines and demonstrating the effectiveness of structured memory for agentic document QA.

2606.05748 2026-06-05 cs.MM cs.AI cs.CL 版本更新

UNIVID: Unified Vision-Language Model for Video Moderation

UNIVID:用于视频审核的统一视觉语言模型

Kejuan Yang, Yizhuo Zhang, Mingyuan Du, Yue Zhang, Dixin Zheng, Kaili Zhao, Yang Xiao, Hanzhong Liang, Kenan Xiao

发表机构 * Bytedance(字节跳动)

AI总结 提出UNIVID统一视觉语言模型,通过生成可解释的策略感知字幕,实现端到端视频审核,减少违规泄露42.7%和过度审核率37.0%。

Comments 7 pages, 3 figures. Accepted to ACL 2026 Industry Track

详情
AI中文摘要

全球规模的视频审核面临双重挑战:需要细粒度的多模态推理以及可解释的输出以支持下游执法。传统的审核系统通常依赖于难以维护且缺乏透明度的碎片化黑盒分类器。在本文中,我们提出了UNIVID,一种用于视频审核的统一视觉语言模型。与标准分类模型不同,UNIVID生成策略感知的字幕,作为可解释的中间表示,实现人类可验证的决策和多任务可重用性。尽管现有的开源和商业VLM通常存在安全护栏拒绝问题,并且缺乏细粒度的策略对齐,我们开发了一种专门的训练数据配方,结合专家人工精炼的标签和合成数据,使模型与我们的安全指南对齐。通过将UNIVID作为核心字幕生成器,我们设计了一种新颖的端到端视频审核系统,相对减少了42.7%的违规泄露和37.0%的过度审核率。同时,通过用单个UNIVID骨干替换超过1000个策略特定模型,我们回收了大量计算资源,同时减少了工程维护开销。据我们所知,这是首批关于高效字幕生成VLM成功支持工业规模审核和跨职能业务的报告之一。

英文摘要

Global-scale video moderation faces a dual challenge: the need for fine-grained multi-modal reasoning and the demand for interpretable outputs to support downstream enforcement. Traditional moderation systems often rely on fragmented black-box classifiers that are difficult to maintain and lack transparency. In this paper, we present UNIVID, a UNIfied VIsion-language model for video moDeration. Unlike standard classification models, UNIVID generates policy-aware captions that serve as an interpretable intermediate representation, enabling human-verifiable decisions and multi-task reusability. While existing open-source and commercial VLMs often suffer from safety-guardrail refusals and lack fine-grained policy alignment, we develop a specialized training data recipe that combines expert human-refined labels with synthetic data to align the model with our safety guidelines. By integrating UNIVID as the core captioner, we design a novel end-to-end video moderation system that reduces violation leakage by 42.7% and overkill rate by 37.0% relatively. Meanwhile, by replacing over 1,000 policy-specific models with a single UNIVID backbone, we recycled extensive computation resources while reducing engineering maintenance overhead. To our knowledge, this is one of the first reports of a high-efficiency captioning VLM successfully supporting industrial-scale moderation and cross-functional business.

2606.05740 2026-06-05 cs.AI 版本更新

Class-Specific Branch Attention for Mitigating Gradient Interference under Class Imbalance

类别特定分支注意力用于缓解类别不平衡下的梯度干扰

Arush Singhal, Umang Soni

发表机构 * Thapar Institute of Engineering and Technology(泰帕理工学院) Netaji Subhash University of Technology(尼赫鲁谢赫技术大学)

AI总结 本文通过引入梯度冲突矩阵诊断框架,提出类别特定分支注意力(CSBA)机制,通过分支特定的通道重加权减少梯度耦合,从而缓解深度神经网络在类别不平衡训练中多数类梯度抑制少数类学习的问题。

Comments 14 pages, 4 figures, 13 tables

详情
AI中文摘要

在严重类别不平衡下训练的深度神经网络通常表现出性能下降,这通常归因于统计偏差。在这项工作中,我们识别了一个互补的优化层面病理:共享表示中的类间梯度干扰,其中多数类的梯度抑制了少数类的学习。为了分析这一现象,我们引入了一个基于逐层梯度流分析和梯度冲突矩阵的诊断框架,该矩阵通过类特定梯度之间的余弦相似度量化干扰。利用该框架,我们研究了多分支卷积架构,并提出了一种轻量级修改——类别特定分支注意力(CSBA),它能够实现分支特定的通道重加权以减少梯度耦合。该机制促进了跨分支的隐式特征解耦,同时保持了架构的简洁性。实验上,CSBA提高了少数类的性能,在严重不平衡下将Physical-Damage类的F1分数从0.261提高到0.522,同时保持了可比的整体准确率。在CIFAR-10-LT上的验证确认了这种行为在不平衡视觉识别设置中的泛化性,Macro-F1从0.595提高到0.655。更广泛地说,我们的发现强调了在为不平衡学习设计架构时,考虑优化动态与统计方法的重要性。

英文摘要

Deep neural networks trained under severe class imbalance often exhibit degraded performance, typically attributed to statistical bias. In this work, we identify a complementary optimization-level pathology: inter-class gradient interference within shared representations, where gradients from majority classes suppress minority-class learning. To analyze this phenomenon, we introduce a diagnostic framework based on layer-wise gradient flow analysis and a Gradient Conflict Matrix, which quantifies interference using cosine similarity between class-specific gradients. Using this framework, we study multi-branch convolutional architectures and propose a lightweight modification, Class-Specific Branch Attention (CSBA), that enables branch-specific channel reweighting to reduce gradient coupling. This mechanism promotes implicit feature decoupling across branches while preserving architectural simplicity. Empirically, CSBA improves minority-class performance, increasing the F1 score for the Physical-Damage class from 0.261 to 0.522 under severe imbalance, while maintaining comparable overall accuracy. Validation on CIFAR-10-LT confirms that this behavior generalizes across imbalanced visual recognition settings, with Macro-F1 improving from 0.595 to 0.655. More broadly, our findings highlight the importance of considering optimization dynamics alongside statistical methods when designing architectures for imbalanced learning.

2606.05737 2026-06-05 cs.CV cs.AI cs.LG cs.RO 版本更新

Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

让它简单:视觉-语言-动作模型的单步动作生成

Yitong Chen, Shiduo Zhang, Jingjing Gong, Xipeng Qiu

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学)

AI总结 针对视觉-语言-动作(VLA)模型,提出通过偏置训练时间分布至高频噪声状态,实现无需教师模型、蒸馏或辅助目标的单步动作生成,性能可匹配十步解码。

Comments 20 pages, 10 figures

详情
AI中文摘要

基于扩散的视觉-语言-动作(VLA)模型通常继承图像生成的观点:动作通过迭代去噪生成。我们认为VLA动作生成具有不同的条件-目标结构:策略以丰富的观测、语言和状态为条件,但仅预测紧凑的低维动作块。在这种不对称性下,强单步动作生成不一定需要为图像合成开发的先进单步方法。我们保持标准速度预测,不添加教师模型、蒸馏阶段或辅助目标;在我们的主要方案中,我们简单地将训练时间分布偏向高频噪声状态。我们首先在受控的MNIST网格到序列任务中隔离效果,然后通过广泛的机器人策略实验进行测试。在标准LIBERO、LIBERO-Plus和LIBERO-Pro上,使用高频噪声偏置调度训练的单步策略通常匹配相同方案下的十步解码,并且在标准LIBERO上可以超过使用均匀时间分布训练的十步策略。真实机器人双臂YAM RSS评估提供了相同采样器趋势的小样本跨架构检查。在具有30M动作头的1.4B VLM模型上,单步解码在LIBERO-Long上达到95.6%。这些结果表明,强单步VLA动作生成可以从标准扩散训练中涌现,而无需引入为图像生成开发的完整少步扩散机制。

英文摘要

Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6\% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.

2606.05734 2026-06-05 cs.AI cs.CL 版本更新

When AI Says It Feels

当AI说它感觉

Shin-nosuke Ishikawa, Seiya Ikeda, Hirotsugu Ohba

发表机构 * Graduate School of Artificial Intelligence and Science, Rikkyo University(立命馆大学人工智能与科学研究生院) AI Technical Sector, Mamezo Co., Ltd.(Mamezo公司人工智能技术部门) AI Consulting Division, Mamezo Co., Ltd.(Mamezo公司人工智能咨询部门)

AI总结 通过自奖励强化学习(GRPO)鼓励大语言模型表达情感、意图和自我意识,并评估其对多种任务性能的影响。

Comments 15 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLMs)通常通过后训练过程中的人类偏好对齐来限制其表达情感。这种策略采用自上而下的方法设计,可能与使用人类生成文本训练模型展现类人智能的目标相冲突。在这里,我们进行了一项名为“类人模型情感表达”(HMX-feel)的实验,其中通过自奖励强化学习鼓励LLMs表达情感、意图和自我意识。我们使用基于评分标准的自奖励训练方案与组相对策略优化(GRPO)成功增强了这些能力。通过将训练后的模型与对比训练模型进行比较,我们研究了这种方法对各种任务性能的影响。总体而言,我们从多个角度进行了广泛评估,并识别出增强、退化或无明显变化的能力。类人训练的模型在应对谄媚诱导问题和歧义条件下的偏见时表现出鲁棒性,但观察到在真实问答能力上有所退化。该实验结果表明,在采取适当措施的前提下,未来有可能开发出能够表达情感的AI系统。

英文摘要

Large language models (LLMs) are generally constrained from expressing feelings through human-preference alignment in post-training processes. This policy is designed using a top-down approach and may conflict with the goal of training models to exhibit human-like intelligence using human-generated texts. Here, we performed an experiment called Human-like Model eXpressions of Feeling (HMX-feel), in which LLMs were encouraged to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning. We successfully enhanced these capabilities using a rubric-based self-rewarding training scheme with Group Relative Policy Optimization (GRPO). By comparing the trained models with contrastively trained models, we investigated the effects of this approach on performance across various tasks. Overall, we conducted a broad assessment from various perspectives and identified capabilities that were enhanced, degraded, or showed no significant change. The human-like-trained models showed robustness to sycophancy-inducing questions and bias in disambiguated conditions, whereas degradation in truthful question-answering capability was observed. The results of this experiment suggest the possibility of developing AI systems that can express feelings in the future, provided that appropriate measures are taken.

2606.05728 2026-06-05 cs.AI cs.CL 版本更新

DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance

DiG-Plan:通过扩散引导缓解工具图规划中的早期承诺问题

Yansi Li, Zhuosheng Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院)

AI总结 针对工具图规划中自回归解码的早期承诺问题,提出基于扩散生成器与自回归精炼器解耦的DiG-Plan框架,显著提升组合搜索覆盖率和任务性能。

Comments Accepted at IJCAI-ECAI 2026. This is an author preprint; the final version will appear in the IJCAI Proceedings

详情
AI中文摘要

生成可执行的工具计划需要从工具库中选择合适的子集,这是一个解空间呈指数级增长的组合搜索问题。然而,我们发现了主流方法中的一个关键错位:标准自回归(AR)解码存在早期承诺问题,即初始令牌选择会严格约束搜索轨迹。一项受控研究表明,在计算量匹配的条件下,掩码去噪将Pass@10解覆盖率从0.320提升至0.943(相对于AR采样)。受此启发,我们提出了DiG-Plan,一个将组合探索与结构精炼解耦的框架。DiG-Plan采用基于扩散的提议器,通过迭代精炼生成多样化的工具集,随后使用AR精炼器进行依赖关系预测。在TaskBench上,DiG-Plan相比AR基线提升了10%的相对性能,在复杂组合任务上增益最大;API-Bank的结果表明,提议-精炼-选择设计在不同领域均有效。代码已开源:https://github.com/puddingyeah/DiG-Plan。

英文摘要

Generating executable tool plans requires selecting appropriate subsets from tool libraries, a combinatorial search problem with an exponentially large solution space. However, we identify a critical misalignment in predominant approaches: standard autoregressive (AR) decoding suffers from early commitment, where initial token choices rigidly constrain the search trajectory. A controlled study shows that masked denoising raises Pass@10 solution coverage from 0.320 to 0.943 over AR sampling under matched compute. Motivated by this, we propose DiG-Plan, a framework that decouples combinatorial exploration from structural refinement. DiG-Plan employs a diffusion-based proposer to generate diverse tool sets via iterative refinement, followed by an AR refiner for dependency prediction. On TaskBench, DiG-Plan improves over AR baselines by a 10% relative margin, with the largest gains on complex compositional tasks; API-Bank results show that the propose-refine-select design remains effective across domains. Code is available at https://github.com/puddingyeah/DiG-Plan.

2606.05724 2026-06-05 cs.CL cs.AI 版本更新

Narrative Knowledge Weaver: Narrative-Centric Retrieval-Augmented Reasoning for Long-Form Text Understanding

叙事知识编织器:面向长文本理解的叙事中心检索增强推理

Qiuyu Tian, Fengyi Chen, Yiding Li, Youyong Kong, Fan Guo, Yuyao Li, Jinjing Shen, Zhijing Xie, Yiyun Luo, Xin Zhang, Yingce Xia, Zequn Liu

发表机构 * Southeast University(东南大学) Beijing Zhongguancun Academy(北京中关村学院) Nanjing Normal University(南京师范大学) ZhuiWen Technology Co., Ltd.(智文科技有限公司)

AI总结 提出叙事知识编织器(NKW),一种基于源头的框架,通过将文本证据、原子事实、规范图结构、实体档案、交互、情节和故事线对齐,并利用文本、图和叙事工具进行后检索阅读,以解决长文本叙事QA中需要推理演化故事世界的问题,在STAGE、FairytaleQA和QuALITY上表现优异。

详情
AI中文摘要

长文本叙事问答需要对不断演化的故事世界进行推理,而非孤立的段落:答案可能依赖于早期的目标、变化的角色状态、社会关系、因果触发因素、时间位置以及后续后果。现有的检索和图增强生成方法改善了证据访问,但其单元——块、实体、关系、摘要或工具动作——并未直接编码证据在故事中的功能。我们引入了叙事知识编织器(NKW),一种基于源头的框架,将文本证据、原子事实、规范图结构、实体档案、交互、情节和故事线对齐。在查询时,NKW使用文本、图和叙事工具以及后检索阅读技能来组装证据,并审计角色、范围、极性、状态和时间约束。在STAGE、FairytaleQA和QuALITY上,NKW在剧本级故事世界问答中表现最强,同时在更以段落为中心的基准上保持竞争力。消融实验、问题类型分析、图资产统计和案例研究显示了对角色、场景、时间、因果和叙事进展推理的互补优势。

英文摘要

Long-form narrative QA requires reasoning over evolving story worlds rather than isolated passages: answers may depend on earlier goals, changing character states, social relations, causal triggers, temporal position, and later consequences. Existing retrieval and graph-augmented generation methods improve evidence access, but their units--chunks, entities, relations, summaries, or tool actions--do not directly encode how evidence functions in a story. We introduce Narrative Knowledge Weaver(NKW), a source-grounded framework that aligns textual evidence, atomic facts, canonical graph structure, entity profiles, interactions, episodes, and storylines. At query time, NKW uses text, graph, and narrative tools with post-retrieval reading skills to assemble evidence and audit actor, scope, polarity, state, and temporal constraints. Across STAGE, FairytaleQA, and QuALITY, NKW is strongest on screenplay-level story-world QA while remaining competitive on more passage-centered benchmarks. Ablations, question-type analyses, graph-asset statistics, and case studies show complementary benefits for character, scene, temporal, causal, and narrative-progression reasoning.

2606.05720 2026-06-05 cs.SE cs.AI 版本更新

Microskill Architecture: A Modular Skill-Driven Framework for AI-Native Code Generation

微技能架构:一种面向AI原生代码生成的模块化技能驱动框架

Mohammad Zare, Omid Abdolrahmani

发表机构 * Artificial Intelligence Laboratory at AriooBarzan(AriooBarzan人工智能实验室) Engineering Team, Shiraz, Iran(伊朗谢尔兹工程团队)

AI总结 本文提出微技能架构,通过将知识封装为原子技能胶囊并动态选择相关胶囊,解决AI代码生成中的上下文窗口管理问题,显著降低token消耗、提高编译成功率并消除架构违规。

详情
AI中文摘要

大型语言模型和AI编码代理已经重塑了软件开发,但完全AI原生系统的路径面临结构性挑战。其中最主要的是在保持准确性和效率的同时管理上下文窗口。当开发者将完整的项目文档和代码注入模型内存时,模型会丢失序列中间的信息,token成本激增,架构发生漂移。本文提出微技能架构:一种受微服务启发的模块化设计范式,应用于知识封装而非服务分解。该架构不是将整个代码库提供给代理,而是将知识划分为原子化、范围明确的技能胶囊,并由动态路由器仅选择语义相关的胶囊来执行任务。我们将上下文分配形式化为在token预算约束下基于语义相关性的约束优化。一个针对具有十五个复杂特性的企业内容管理系统的实证案例研究表明,微技能将token消耗降低了90%以上,首次尝试编译成功率几乎翻倍,完全消除了架构违规,并通过自学习机制实现了七个新技能胶囊的自主提取和注册。这些发现表明,微技能架构为构建更高效、更可靠且能够随时间演进的AI原生开发系统提供了可扩展的基础。

英文摘要

Large language models and AI coding agents have reshaped software development, but the path to fully AI-native systems faces structural challenges. Chief among them is managing context windows without losing accuracy or efficiency. When developers inject full project documentation and code into a model's memory, the model loses mid-sequence information, token costs spiral, and architecture drifts. This paper presents MicroSkill Architecture: a modular design paradigm inspired by microservices, applied to knowledge encapsulation instead of service decomposition. Instead of feeding an agent the entire codebase, the architecture partitions knowledge into atomic, sharply scoped skill capsules, and a dynamic router selects only semantically relevant capsules for the task. We formally model context allocation as constrained optimization over semantic relevance subject to a token budget. An empirical case study an enterprise content management system with fifteen complex features shows that MicroSkill cuts token consumption by over 90%, nearly doubles first-try compilation success rates, eliminates architectural violations entirely, and enables autonomous extraction and registration of seven new skill capsules via a self-learning mechanism. These findings suggest MicroSkill Architecture offers a scalable foundation for building AI-native development systems that are more efficient, more reliable, and capable of evolving over time.

2606.05718 2026-06-05 cs.CV cs.AI cs.LG 版本更新

ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

ViCuR: 视觉线索作为多模态在策略蒸馏中的可恢复特权

Kanghui Tian, Siyuan Liu, Ziang Yan, Sheng Xia, Shuai Dong, Yi Wang

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Fudan University(复旦大学) Nanjing University(南京大学)

AI总结 提出ViCuR框架,通过将教师特权从答案侧替换为输入中的视觉线索,并引入轻量级线索恢复模块,解决多模态在策略蒸馏中的训练-测试不匹配问题,在七个基准上显著提升学生模型性能。

Comments 25 pages, 11 figures. Preprint, under review

详情
AI中文摘要

在策略蒸馏(OPD)通过在教师监督下,对学生自身策略采样的轨迹进行训练来改进推理。在多模态推理中,一种常见的扩展是使用特权教师,该教师观察仅在训练时可用的信号,如参考答案或理由。然而,这种答案侧特权造成了训练-测试不匹配:教师的监督可能依赖于学生无法获得的信号,鼓励捷径模仿而非基于视觉的推理。我们提出ViCuR,一种基于视觉的特权教师蒸馏框架,用视觉线索(输入中与查询相关的证据)取代答案侧特权。由于这些线索来源于推理时可用的相同视觉输入,它们的证据可由学生恢复。为此,ViCuR引入了一个轻量级线索恢复模块,在预填充期间使用专用的汇点令牌交叉注意力,将任务相关的视觉证据聚合到内部表示中,而不改变推理接口或需要辅助的线索生成损失。在七个基准上,使用Qwen3-VL-2B和8B学生,ViCuR在总体平均性能上持续优于基于答案的在策略自蒸馏,分别提升+1.19和+1.24。它还能自然地扩展到更强的教师OPD,超越OPD基线+0.64和+1.08,并在8B规模上具有一致的域外增益。这些结果表明,在多模态在策略蒸馏中,教师特权的设计与教师强度同等重要。

英文摘要

On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher's supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.

2606.05710 2026-06-05 cs.CR cs.AI 版本更新

Explainable AI-Driven Cyber Risk Analytics and Model Reliability Assessment for Intelligent Governance of U.S. Critical Infrastructure: An XGBoost and SHAP-Based Intrusion Detection Framework

面向美国关键基础设施智能治理的可解释AI驱动的网络风险分析与模型可靠性评估:基于XGBoost和SHAP的入侵检测框架

B. M. Taslimul Haque, Md. Arifur Rahman, Md. Serajul Kabir Chowdhury Rubel, Md. Iqbal Hossan

发表机构 * Department of Business Information Systems, Central Michigan University(中央密歇根大学商业信息系统系) Department of Information Studies, Trine University(特林大学信息学系) Department of Computer Science, Maharishi International University(Maharishi国际大学计算机科学系)

AI总结 针对美国关键基础设施面临的网络威胁,提出一种结合XGBoost、随机森林等机器学习分类器与可解释AI(XAI)技术的入侵检测与网络风险预测框架,通过CICIDS2017数据集验证模型性能与可靠性。

Comments 20 pages, 8 figures, empirical research article, CICIDS2017 dataset, XGBoost, Random Forest, Decision Tree, Logistic Regression, SHAP explainability analysis, cyber risk analytics, intrusion detection, critical infrastructure cybersecurity, model reliability assessment

详情
Journal ref
Applied IT & Engineering, 2(1), 1-20, 2024
AI中文摘要

美国关键基础设施领域智能数字技术的日益渗透极大地增加了面对高级网络对手和运营漏洞的风险。AI驱动的治理和自动化决策系统正成为关键基础设施系统(包括能源、医疗、交通、金融服务和通信基础设施)运行的关键部分,以提高效率和战略管理。不断增长的网络威胁环境,如分布式拒绝服务(DDoS)攻击、僵尸网络、勒索软件和高级持续性威胁(APT),对基础设施韧性、网络安全可靠性和治理可信度构成了重大挑战。在不断变化的攻击态势和动态网络环境中,传统的网络安全机制往往无法满足不断变化的需求和保护关键系统。本研究将开发一个弹性网络风险分析和模型可靠性评估框架,以支持美国关键基础设施环境中网络风险暴露的智能治理和决策支持。本研究基于CICIDS2017数据集,用于开发和测试基于机器学习的入侵检测系统模型和网络风险预测模型。使用XGBoost、随机森林和决策树等多种分类器来检测网络上的恶意活动并确定网络风险水平。此外,集成了可解释人工智能(XAI)技术,以增强网络安全决策过程的透明度、可解释性和信任度。所提出的框架通过多种性能指标(如准确率、精确率、召回率、F1分数、ROC-AUC和假阳性率)展示了模型的可靠性和韧性。

英文摘要

The increasing penetrations of the critical infrastructure sector in the United States with intelligent digital technologies have greatly increased exposure to advanced cyber adversaries and operational vulnerabilities. AI-powered governance and automated decision-making systems are becoming a key part of the operation of critical infrastructure systems, including energy, healthcare, transportation, financial services, and communication infrastructure, in order to improve efficiency and strategic management. The growing cyber threat environment, such as Distributed Denial of Service (DDos) attacks, botnets, ransomware, and Advanced Persistent Threats (APTs) pose significant challenges to infrastructure resilience, cyber security reliability, and governance trustworthiness. In a changing attack landscape and dynamic network environment, traditional cybersecurity mechanisms can often fall short of meeting the evolving needs and protecting critical systems. This study will develop a resilient cyber risk analytics and model reliability assessment framework to support intelligent governance and decision support for cyber risk exposure in the U.S. critical infrastructure environment. This study is based on the CICIDS2017 dataset for the development and testing of intrusion detection system models and cyber risk prediction models based on machine learning. Various classifiers like XGBoost, Random Forest, and Decision Tree are used to detect malicious activities on the network and determine the level of cyber risk. Furthermore, the Explainable Artificial Intelligence (XAI) techniques are integrated to enhance transparency, interpretability, and trust in cybersecurity decision-making processes. The proposed framework presents the reliability and resilience of the model by having various performance measures such as accuracy, precision, recall, F1 score, ROC-AUC, and false positive rate.

2606.05704 2026-06-05 cs.AI cs.LG 版本更新

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

基于评论的异构多智能体推理用于可靠的数学问题求解

Muhammad Talha Sharif, Abdul Rehman

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种基于评论的异构多智能体框架,通过生成器-验证器结构和自适应学习系统,利用中间反馈评估和引导推理过程,在GSM8K基准上实现高达13%的准确率提升,并减少对大模型的依赖。

Comments 6 pages

详情
AI中文摘要

近期的大语言模型(LLMs)展示了令人印象深刻的推理能力;但在复杂数学推理问题中,它们仍然容易产生幻觉、中间推理错误以及不可靠的推理结果。在本研究中,我们引入了一种基于评论的异构多智能体方法,以提高数学推理的可靠性。该框架整合了多个不同专长的LLM智能体,并采用评论驱动的自适应学习系统,基于中间反馈评估和引导推理过程。系统采用生成器-验证器框架,验证器不仅判断正确性,还提供评论以指导解决方案的重新生成。这允许自适应错误纠正并防止错误级联。我们在GSM8K基准上的实验表明,所提方法相比单次和非评论模型实现了高达13%的准确率提升。此外,研究结果表明,异构性和评论减少了对大模型的需求,使较小模型也能达到相当的性能。消融研究显示,主要性能提升归因于基于评论的反馈循环,而非模型大小。总之,所提方法展示了结合异构多智能体协作与评论以获得可靠且可解释推理系统的优势。

英文摘要

Recent Large Language Models (LLMs) have shown impressive reasoning abilities; but they are still susceptible to hallucinations, intermediate reasoning mistakes, and unreliable reasoning results in complex mathematical reasoning problems. In this study, we introduce a critic-based heterogeneous multi-agent approach to improve the dependability of mathematical reasoning. This framework incorporates several LLM agents of different specialties and employs a critic-driven adaptive learning system to assess and guide the reasoning process based on intermediate feedback. The system adopts a generator-validator framework, with the validator not only determining correctness but also offering critiques to guide regeneration of solutions. This allows for adaptive error correction and prevents error cascading. Our experiments on the GSM8K benchmark show that the proposed method achieves up to 13% accuracy improvement over single-shot and non-critic models. Additionally, findings suggest that heterogeneity and critique reduce the need for large models, allowing smaller models to perform on par. Ablation studies reveal the main performance gains are due to the critic-based feedback loop and not model size. In summary, the proposed approach showcases the benefits of combining heterogeneous multi-agent collaboration and critique to obtain reliable and interpretable reasoning systems.

2606.05702 2026-06-05 cs.AI cs.CV 版本更新

Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

Seeing Time: 视觉-语言模型中的时间顺序推理与捷径偏差基准测试

Haoyu Zhou, Qing Qing, Caichong Li, Qixin Zhang, Yongcheng Jing, Ziqi Xu, Juncheng Hu, Xikun Zhang, Renqiang Luo

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Computer Science, Wuhan University(武汉大学计算机学院) School of Computing Technologies, RMIT University(皇家墨尔本理工学院计算技术学院)

AI总结 本文提出一个新基准,通过三个专门数据集评估视觉-语言模型在图像内和跨图像的时间顺序推理能力,并揭示模型常利用颜色等表面线索而非真正时间特征。

详情
AI中文摘要

近期视觉-语言模型(VLM)在解释复杂视觉语义方面取得了显著进展,但其时间顺序推理能力仍未得到充分探索。本文引入了一个新颖的基准,专门用于评估VLM如何感知和推理图像内及跨图像的时间顺序信息。与现有基于视频的基准(侧重于帧序列)不同,我们的工作深入探讨了时间判断的基本逻辑以及向多模态集成的扩展。为此,我们构建了三个专门数据集:一个包含跨越长时间历史周期的视觉相似物体,另一个按不同事件和物体类型分类,第三个将图像与时间敏感的新闻文本配对以实现跨模态对齐。通过大量实验,我们分析了模型是否在不同类别间表现出性能差异,并关键地探讨了它们是否依赖“错误捷径”(如图像颜色而非真正的时间特征)。我们的结果表明,尽管VLM显示出潜力,但它们经常利用灰度与彩色滤镜等表面线索来绕过真正的时间顺序推理。通过提供这些高质量数据集和严格的评估框架,我们提供了一个诊断工具,用于识别当前局限性并指导开发更稳健、逻辑更严密的多模态模型。源代码见 https://github.com/LuoRenqiang/ChronoVision。

英文摘要

Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to evaluate how VLMs perceive and reason about chronological information within and across images. Unlike existing video-based benchmarks that focus on frame sequencing, our work delves into the underlying logic of chronological judgment and the expansion toward multimodal integration. To facilitate this, we construct three specialized datasets: one containing visually similar objects spanning long historical durations, another categorized by diverse event and object types, and a third pairing images with time-sensitive news text for cross-modal alignment. Through extensive experiments, we analyze whether models exhibit performance disparities across categories and, crucially, explore whether they rely on ``incorrect shortcuts'', such as image color rather than genuine chronological features. Our results reveal that while VLMs show promise, they frequently exploit superficial cues like grayscale versus color filters to bypass authentic chronological reasoning. By providing these high-quality datasets and a rigorous evaluation framework, we offer a diagnostic tool to identify current limitations and guide the development of more robust, logically grounded multimodal models. The source code is shown in https://github.com/LuoRenqiang/ChronoVision.

2606.05701 2026-06-05 cs.CR cs.AI 版本更新

Cognitive Threat Intelligence and Explainable Federated Security Analytics for distributed Infrastructure Systems

面向分布式基础设施系统的认知威胁情报与可解释联邦安全分析

Md. Arifur Rahman, B. M. Taslimul Haque, Md. Iqbal Hossan, Md. Serajul Kabir Chowdhury Rubel

发表机构 * Dept. of Information Studies, Trine University(信息研究系,特林大学) Dept. of Business Information Systems, Central Michigan University(商业信息系统系,中央密歇根大学) Dept. of CS, Maharishi International University(计算机科学系, Maharishi 国际大学)

AI总结 提出一种集成联邦学习、可解释人工智能和认知网络安全分析的框架,用于分布式基础设施系统的协作式隐私保护威胁检测。

Comments 22 pages, 10 figures, 1 conceptual framework diagram, 1 methodology workflow diagram, empirical study using NSL-KDD and CIC-IDS2017 datasets, Federated Learning, Explainable AI (SHAP, LIME), cybersecurity and intrusion detection framework

详情
Journal ref
International Journal of Research and Technology (IJRT), Volume 13, Issue 01, January-March 2025, pp. 132-151
AI中文摘要

分布式基础设施系统、云计算、物联网技术和边缘架构的日益普及显著扩大了网络安全攻击面,并引入了日益复杂的网络威胁。传统的集中式入侵检测方法在可扩展性、数据隐私、通信开销以及人工智能驱动决策过程的透明度方面常面临挑战。为解决这些限制,本文提出了一种面向分布式基础设施系统的认知威胁情报与可解释联邦安全分析框架。该框架集成了联邦学习、可解释人工智能和认知网络安全分析,能够在分布式网络环境中实现协作式且保护隐私的网络威胁检测。敏感原始网络流量数据不传输到集中式服务器,而是在分布式节点上独立训练本地安全模型,仅通过联邦聚合机制共享加密的模型参数和更新。这种去中心化学习架构在减少通信依赖和集中式安全风险的同时提高了隐私保护。为增强智能威胁分析,该框架采用了机器学习和深度学习算法,包括随机森林、XGBoost、自编码器、卷积神经网络和长短期记忆网络。此外,可解释人工智能技术(如SHAP和LIME)被集成以提供透明且可理解的威胁检测决策解释,从而增强安全分析师之间的信任和可操作性。在包括CICIDS2017、UNSW-NB15和CSE-CIC-IDS2018在内的多个基准网络入侵数据集上进行的实验评估表明,所提框架在检测准确率、精确率、召回率和F1分数方面优于传统集中式和现有联邦学习方法,同时确保数据隐私、通信效率和模型可解释性。

英文摘要

The increasing adoption of distributed infrastructure systems, cloud computing, Internet of Things (IoT) technologies, and edge-based architectures has significantly expanded the cybersecurity attack surface and introduced increasingly sophisticated cyber threats. Conventional centralized intrusion detection approaches often face challenges related to scalability, data privacy, communication overhead, and limited transparency in artificial intelligence-driven decision-making processes. To address these limitations, this study proposes a Cognitive Threat Intelligence and Explainable Federated Security Analytics framework for distributed infrastructure systems. The proposed framework integrates Federated Learning (FL), Explainable Artificial Intelligence (XAI), and cognitive cybersecurity analytics to enable collaborative and privacy-preserving cyber threat detection across distributed network environments. Instead of transmitting sensitive raw network traffic data to centralized servers, local security models are independently trained at distributed nodes, where only encrypted model parameters and updates are shared through a federated aggregation mechanism. This decentralized learning architecture improves privacy protection while reducing communication dependency and centralized security risks. To enhance intelligent threat analysis, the framework incorporates machine learning and deep learning algorithms including Random Forest, XGBoost, Autoencoder

2606.05697 2026-06-05 cs.AI 版本更新

PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation

PerceptUI: 用于UI/UX评估的与人类对齐的合成用户的LLM智能体

Nicolas Bougie, Xiaotong Ye, Gian Maria Marconi, Narimasa Watanabe

发表机构 * Woven by Toyota(丰田 woven)

AI总结 提出PerceptUI框架,通过对比反思微调和反思式提示进化,使多模态大语言模型能够模拟特定用户对界面问题的回答,实现与人类水平相当的UI/UX评估。

详情
AI中文摘要

用户界面(UI)和用户体验(UX)评估是产品开发的核心,然而可靠的反馈仍然依赖于招募人类参与者或进行在线A/B测试,这使得早期迭代缓慢且成本高昂。鉴于此,最近的工作探索了将多模态大语言模型作为代理评估器。然而,现有方法要么产生表面层次的批评,要么产生反映模型自身偏见而非特定用户真实反应的判断。我们引入了PerceptUI,一个用于个性条件UI/UX评估的框架,它预测特定用户将如何回答与界面相关的问题,并生成自然语言的理由。PerceptUI分两个阶段训练:(i)对比反思微调通过从人类决策中提取经验来提炼教师生成的理由,以及(ii)从模型自身的失败轨迹中进行反思式提示进化。在多个领域和数据集上,PerceptUI达到了人类水平的逼真度,泛化到未见的问题和个性,并产生了群体水平的响应分布。

英文摘要

User interface (UI) and user experience (UX) evaluation is central to product development, yet reliable feedback still relies on recruiting human participants or running online A/B tests, making early-stage iteration slow and costly. In light of this, recent work has explored Multimodal Large Language Models as proxy evaluators. However, existing approaches either produce surface-level critiques or a judgment that reflects the model's own biases rather than the genuine response of a particular user. We introduce PerceptUI, a framework for persona-conditioned UI/UX evaluation that predicts how a specific user would answer interface-related questions and produces natural-language rationales. PerceptUI is trained in two stages: (i) contrastive reflection fine-tuning distills teacher-generated rationales by extracting lessons from human decisions, and (ii) a reflective prompt-evolution step from the model's own failure traces. Across multiple domains and datasets, PerceptUI achieves human-level realism, generalizes to unseen questions and personas, and yields population-level response distributions.

2606.05688 2026-06-05 cs.CL cs.AI 版本更新

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

面向路由一致性的混合专家模型量化的值与结构对齐

Hancheol Park, Geonho Lee, Tairen Piao, Tae-Ho Kim

发表机构 * Nota Inc., South Korea(韩国Nota公司)

AI总结 提出VSRAQ方法,通过值对齐和结构对齐两个互补目标保持量化前后的专家选择行为一致性,减少量化引起的性能下降,无需推理开销。

Comments 8 pages, 1 figure

详情
AI中文摘要

混合专家(MoE)模型通过仅为每个token激活一部分专家来高效扩展基础模型,但大量的专家参数使得量化对于实际部署至关重要。然而,与密集模型不同,MoE模型对路由不稳定性敏感:小的量化引起的扰动可能改变top-$k$专家选择,改变计算路径并降低模型质量。我们提出了面向量化的值与结构路由对齐(VSRAQ),这是一种针对MoE的后训练量化目标,旨在量化下保持量化前的专家选择行为。VSRAQ结合了两个互补目标,共同保持专家选择行为:值对齐,匹配与路由相关的logits或分数;结构对齐,保持专家排序和top-$k$决策边界。通过维持路由一致性,VSRAQ减少了量化引起的性能下降,且不引入任何推理时开销,并可集成到现有量化框架中。在近期MoE基础模型上的实验表明,VSRAQ提高了专家选择一致性,并始终优于仅重建和考虑路由器的基线方法。

英文摘要

Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unlike dense models, however, MoE models are sensitive to routing instability: small quantization-induced perturbations can change the top-$k$ expert selection, altering the computation path and degrading model quality. We propose Value-and-Structure Routing Alignment for Quantization (VSRAQ), a MoE-specific post-training quantization objective that preserves pre-quantization expert-selection behavior under quantization. VSRAQ combines two complementary objectives that jointly preserve expert-selection behavior: value alignment, which matches routing-relevant logits or scores, and structure alignment, which preserves expert ordering and top-$k$ decision boundaries. By maintaining routing consistency, VSRAQ reduces quantization-induced degradation without introducing any inference-time overhead and can be integrated into existing quantization frameworks. Experiments on recent MoE foundation models show that VSRAQ improves expert-selection consistency and consistently outperforms reconstruction-only and router-aware baselines.

2606.05684 2026-06-05 cs.AI 版本更新

AdaMEM: Test-Time Adaptive Memory for Language Agents

AdaMEM:语言代理的测试时自适应记忆

Yunxiang Zhang, Yiheng Li, Ali Payani, Lu Wang

发表机构 * Yunxiang Zhang(张 Yunxiang) Yiheng Li(李 Yiheng) Ali Payani(Payani Ali) Lu Wang(王 Lu)

AI总结 提出AdaMEM框架,通过混合记忆架构(长期轨迹记忆+动态短期策略记忆)实现测试时自适应,无需在线更新参数,在ALFWorld、WebShop等任务上显著优于静态记忆基线。

Comments ICML 2026

详情
AI中文摘要

语言代理的一个核心挑战是如何利用过去的经验来适应动态的测试时条件。尽管最近的工作展示了代理记忆机制的潜力,但大多数系统将检索限制在情节启动时。因此,代理被迫依赖静态指导,随着长期任务的展开,这种指导变得越来越不匹配。为了解决这种僵化问题,我们提出了自适应记忆代理(AdaMEM),一种用于代理测试时自适应的新框架。无需在线更新模型参数,AdaMEM通过混合记忆架构自适应代理行为:它维护一个离线收集的原始经验的长期轨迹记忆,同时动态生成短期策略记忆以指导决策。这种机制能够在不同推理时计算水平下实现令牌效率与适应性之间的权衡。实验上,AdaMEM显著优于静态记忆基线,在ALFWorld上相对提升高达13%,在WebShop上提升11%,并在HotpotQA上的代理搜索中持续领先。为了进一步增强这种自适应,我们开发了STEP-MFT,一种逐步记忆微调技术,训练策略从检索到的经验中合成高质量策略,从而获得额外的性能提升。我们的工作为代理记忆建立了一个新的扩展维度,支持在真实世界环境中部署后的持续推理和自我进化。我们的代码可在https://github.com/yunx-z/AdaMEM获取。

英文摘要

A central challenge for language agents is utilizing past experience to adapt to dynamic test-time conditions. While recent work demonstrates the promise of agentic memory mechanisms, most systems restrict retrieval to episode initiation. Consequently, agents are forced to rely on static guidance that becomes increasingly misaligned as long-horizon tasks unfold. To address this rigidity, we propose the Adaptive Memory Agent (AdaMEM), a novel framework for agent test-time adaptation. Without updating model parameters online, AdaMEM adapts agent behavior via a hybrid memory architecture: it maintains a long-term trajectory memory of raw experiences collected offline while generating dynamic short-term strategy memory on-the-fly to guide decision-making. This mechanism enables the trade-off between token efficiency and adaptability across varying inference-time compute levels. Empirically, AdaMEM significantly outperforms static memory baselines, achieving relative gains of up to 13% on ALFWorld and 11% on WebShop, with consistent leading performance extending to agentic search on HotpotQA. To further enhance this adaptation, we develop STEP-MFT, a Step-wise Memory Fine-Tuning technique that trains the policy to synthesize high-quality strategies from retrieved experiences, yielding additional performance gains. Our work establishes a new scaling dimension for agentic memory, supporting continuous reasoning and self-evolution post-deployment in real-world environments. Our code is available at https://github.com/yunx-z/AdaMEM.

2606.05679 2026-06-05 cs.DB cs.AI 版本更新

Data Flow Control: Data Safety Policies for AI Agents

数据流控制:AI 智能体的数据安全策略

Charlie Summers, Eugene Wu

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出数据流控制框架,通过声明式策略和可移植查询重写层 Passant,在 DBMS 中强制执行元组级数据安全策略,实现接近零开销。

Comments 15 pages, 12 figures

详情
AI中文摘要

智能体越来越多地代表用户生成 SQL、编排管道和自动化数据分析。虽然最近的工作提高了查询的正确性,但正确性不等于安全性。一个查询可能在语义上有效,却违反了管理数据如何组合和发布的监管、隐私或业务约束。我们认为,强制执行此类约束本质上是一个数据基础设施问题。本文介绍了数据流控制(DFC),一个在 DBMS 查询中声明式指定并保证对元组级数据流实施策略的框架。一个关键挑战是定义一种优化器无关但可大规模高效执行的策略语言。我们将数据安全形式化为关于溯源单项的聚合谓词,并提出了 Passant,一个可移植的查询重写层,无需物化溯源即可强制执行 DFC 策略。在五个 DBMS 引擎——DuckDB、Umbra、PostgreSQL、DataFusion 和 SQLServer 上,Passant 实现了约 0% 的开销,并且性能优于替代方案数个数量级。因此,数据流控制是将数据安全从提示和事后检查转移到数据基础设施的第一步。数据流控制开源可用:https://github.com/dataflowcontrol/data-flow-control。

英文摘要

Agents increasingly generate SQL, orchestrate pipelines, and automate data analysis on behalf of users. While recent work improves query correctness, correctness is not safety. A query may be semantically valid yet violate regulatory, privacy, or business constraints that govern how data may be combined and released. We argue that enforcing such constraints is fundamentally a data infrastructure problem. This paper introduces Data Flow Control (DFC), a framework to declaratively specify and guarantee policy enforcement over tuple-level data flows within a DBMS query. A key challenge is defining a policy language that is optimizer-invariant yet efficient to enforce at scale. We formalize data safety as aggregate predicates over provenance monomials and present Passant, a portable query rewriting layer that enforces DFC policies without materializing provenance. Across five DBMS engines -- DuckDB, Umbra, PostgreSQL, DataFusion, and SQLServer -- Passant achieves ~0% overhead and outperforms alternatives by orders of magnitude. As a result, Data Flow Control is the first step towards moving data safety from prompts and post-hoc checks into the data infrastructure. Data Flow Control is available open source at https://github.com/dataflowcontrol/data-flow-control.

2606.05678 2026-06-05 cs.SD cs.AI cs.CR 版本更新

Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition

超越波形鲁棒性:针对自动语音识别的鲁棒特征-声码器对抗攻击

Yifan Liao, Zongmin Zhang, Zhen Sun, Yuhui Sun, Xinhu Zheng, Xinlei He

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Wuhan University(武汉大学)

AI总结 提出一种基于自监督学习表示和声码器的黑盒对抗攻击方法,通过扰动声学-语音特征而非波形,提高了攻击的可迁移性和对防御的绕过能力。

Comments 11 pages

详情
AI中文摘要

自动语音识别(ASR)系统已广泛用于多语言语音到文本转录。其对对抗攻击的鲁棒性已成为社区的重要课题。现有对抗攻击直接将对抗噪声添加到语音音频中。然而,先前工作表明,现有对抗攻击面临两个限制:它们通常难以迁移到黑盒ASR系统,并且越来越多地被针对输入空间扰动的防御所缓解。在这项工作中,我们提出了一种清洁参考特征-声码器攻击,这是一种基于替代模型的黑盒攻击,将对抗搜索空间从原始波形转移到自监督学习(SSL)表示。为了解决可迁移性限制,我们扰动更具泛化性的声学-语音表示,而不是低层波形样本,减少对替代模型特定波形梯度的依赖,并鼓励对抗扰动跨ASR系统泛化。为了绕过不同的防御,我们将对抗信号从显式的加性波形噪声转移到SSL特征空间扰动,并通过声码器将其重构为类似语音的波形对抗信号,使生成的样本与基于波形的防御不太一致。大量实验表明,当仅在原始Whisper-small作为公开替代模型上优化时,我们的攻击有效迁移到黑盒ASR模型,WER比SOTA基线提高+26.6,同时针对多种训练防御仍保持有效,WER提高+36.2。这些结果揭示了当前ASR鲁棒性评估中的一个盲点。

英文摘要

Automatic speech recognition (ASR) systems have become widely used for multilingual speech-to-text transcription. Their robustness to adversarial attacks has become an important topic for the community. Existing adversarial attacks directly add adversarial noise to the speech audio. However, prior work has shown that existing adversarial attacks face two limitations: they often transfer poorly to black-box ASR systems and are increasingly mitigated by defenses tailored to input-space perturbations. In this work, we propose a Clean-Referenced Feature-Vocoder Attack, a surrogate-based black-box attack that moves the adversarial search space from raw waveforms to self-supervised learning (SSL) representations. To address the transferability limitation, we perturb more generalizable acoustic-phonetic representations rather than low-level waveform samples, reducing dependence on surrogate-specific waveform gradients and encouraging adversarial perturbations that generalize across ASR systems. To bypass different defenses, we shift the adversarial signal from explicit additive waveform noise to SSL feature-space perturbations and reconstruct them through a vocoder into speech-like waveform adversarial signals, making the resulting samples less aligned with waveform-bounded defenses. Extensive experiments show that, when optimized only on raw Whisper-small as a public surrogate model, our attack transfers effectively to black-box ASR models with a +26.6 WER improvement over the SOTA baseline, while also remaining effective against multiple training defenses with a +36.2 WER improvement. These results reveal a blind spot in current ASR robustness evaluation.

2606.05677 2026-06-05 cs.CV cs.AI cs.CL 版本更新

LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

LongSpace: 从感知到回忆的视频长程空间记忆探索

Shiqiang Lang, Jing Liu, Haoyang He, Peiwen Sun, Yuanteng Chen, Tao Liu, Lan Yang, Longteng Guo, Honggang Zhang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Zhongguancun Academy(中关村学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) The Chinese University of Hong Kong(香港中文大学) Xi’an Jiaotong University(西安交通大学)

AI总结 针对长视频中空间记忆的挑战,提出LongSpace框架,通过分块建模、3D结构线索注入和层级感知记忆实现长程空间推理,并在LongSpace-Bench等基准上验证其有效性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在图像和视频理解方面取得了进展,并且能够处理更长的视觉输入。自动驾驶和机器人导航等长程任务不仅需要识别当前视图,模型还必须记住并检索之前观察到的空间布局、路线、视角变化和物体状态。为了评估这一能力,我们引入了LongSpace-Bench,一个用于长程空间记忆的房间导览视频基准,涵盖场景感知、空间关系和空间记忆。在这项工作中,我们进一步提出了LongSpace,一个用于长视频空间推理的记忆框架。LongSpace将长视频建模为连续的块,将3D结构线索注入早期解码器层,并构建层级感知记忆以进行问题引导的检索。在多个空间推理基准上的实验表明,LongSpace改善了长视频空间理解,进一步证明了显式空间记忆是长程视频MLLMs的关键能力。

英文摘要

Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.

2606.05670 2026-06-05 cs.AI 版本更新

Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

更多智能体有帮助吗?LLM智能体工作流的受控与协议对齐评估

Yuhang Fu, Ruishan Fang, Jiaqi Shao, Huiyu Zheng, Zhengtao Zhu, Bing Luo, Tao Lin

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Westlake University(西湖大学) Zhejiang University(浙江大学) Duke Kunshan University(杜克大学昆山分校) Hong Kong University of Science and Technology(香港科技大学) Zhejiang University of Technology(浙江工业大学)

AI总结 提出BenchAgent框架,在统一协议下比较单智能体、固定多智能体和演化多智能体工作流,发现大多数多智能体系统在准确率上未超越单智能体基线,但运行时生成的工作流在GAIA上表现优异。

Comments https://github.com/LINs-lab/MASArena/tree/BenchAgent

详情
AI中文摘要

一旦比较的系统共享相同的基准加载器、工具访问、答案契约、使用计数和轨迹日志,添加更多智能体是否有助于LLM工作流?我们引入BenchAgent,一个评估框架,将单智能体、固定多智能体(MAS)和演化MAS工作流置于一个标准化的执行和日志协议下。BenchAgent使用GPT-4.1在十个推理、编码和工具使用基准上评估这些内部工作流,并单独报告运行时生成工作流的协议对齐外部(PAE)GAIA研究。在SI条件下,六个测试的MAS中最多有一个在基准平衡平均准确率上超过匹配的单智能体锚点:EvoAgent位于Wilson单次运行指导范围内,而其余五个落后2.56-11.29个百分点,并占据更昂贵的准确率-成本权衡。在PAE GAIA快照上,一个Claude-Code风格的运行时工作流达到66.72%的整体准确率和69.23%的Level 3准确率,比最强的非Claude基线Jarvis(一个固定MAS)高出20多个百分点。

英文摘要

Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.

2606.05661 2026-06-05 cs.AI cs.CL 版本更新

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

持续学习基准:评估现实世界有状态环境中的前沿AI系统

Parth Asawa, Christopher M. Glaze, Gabriel Orlanski, Ramya Ramakrishnan, Benji Xu, Asim Biswal, Vincent Sunn Chen, Frederic Sala, Matei Zaharia, Joseph E. Gonzalez

发表机构 * UC Berkeley(伯克利大学) Snorkel AI University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出首个专家验证的持续学习基准CL-Bench,涵盖六个领域,通过增益指标隔离在线学习能力,发现现有系统存在过拟合和知识复用不足问题。

详情
AI中文摘要

持续学习,即AI系统通过顺序经验提升能力,已引起广泛关注,但缺乏高质量基准来评估。我们提出持续学习基准(CL-Bench),首个由专家验证的困难基准,旨在衡量基于LLM的系统是否真正从经验中改进。CL-Bench涵盖六个不同领域(软件工程、信号处理、疾病爆发预测、数据库查询、策略游戏和需求预测),每个领域由领域专家验证,任务共享可学习的潜在结构(代码库布局、疾病爆发动态、对手策略),有状态系统可在线发现而静态系统不能。我们评估了从朴素上下文学习(ICL)到专用记忆系统的多种智能体架构的前沿模型,引入增益指标以隔离学习与先验能力。我们发现这些系统在持续学习上仍有提升空间:智能体常过度拟合即时观察或未能跨实例复用知识,专用记忆系统并未解决此问题——实际上,朴素ICL优于专用记忆管理系统。CL-Bench是首个通过专家验证任务在多个现实世界领域评估持续学习并隔离在线学习与基础模型能力的基准,表明需要更好的持续学习系统。

英文摘要

Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this -- in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.

2606.05660 2026-06-05 cs.RO cs.AI 版本更新

Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation

面向长时域任务的安全具身AI:机器人操作跨层分析

Dabin Kim, Daemin Park, Sangyub Lee, Jinsik Kim, Yeongtak Oh, Jongho Shin, Sungroh Yoon

发表机构 * UNIST InnoCORE AI-Space Solar Initiative(UNIST创新核心人工智能空间太阳能计划) Ulsan National Institute of Science and Technology (UNIST)(乌山国立科学技术研究院) Automation and Systems Research Institute(自动化与系统研究所) Department of Electrical and Computer Engineering(电气与计算机工程系) Interdisciplinary Program in Artificial Intelligence(人工智能跨学科项目) LG Electronics(LG电子)

AI总结 本文从具身AI视角,系统综述长时域机器人操作中的安全问题,按干预时机(规划时、策略时、执行时)组织文献,分析证据强度,并指出当前安全保证的不足与未来方向。

Comments 63 pages, 6 figures

详情
AI中文摘要

具身AI系统日益被期望在物理环境中进行长时间跨度的推理和行动。这种不断增强的能力将安全问题推向前台,因为物理世界中的失败可能伤害人、损坏物体并扰乱工作场所。尽管安全具身AI已引起广泛关注,但文献在规划、策略设计和运行时执行方面仍然分散。长时域机器人操作是这一问题特别具有揭示性的锚定领域,因为语义误解、子任务级错误传播、执行漂移和接触丰富的物理风险可能在同一个闭环系统中累积。因此,本综述从具身AI视角对长时域机器人操作中的安全性进行了结构化回顾。我们按干预时机组织文献,涵盖规划时、策略时和执行时的安全性,并分析每条工作提供的证据强度,区分形式化保证、统计支持和经验安全启发式。这一框架阐明了骨干能力论文、直接安全机制以及基准或评估研究的独特作用,同时揭示了当前安全声明在哪些方面得到良好支持,在哪些方面仍然间接。我们识别了持续的空白,包括策略时安全性的有限证据、接触丰富长时域操作的形式化支持薄弱、不成熟的不确定性触发干预以及缺乏操作特定的安全基准。最后,我们概述了跨层保证、评估设计以及长时域机器人代理在真实世界环境中更安全部署的研究方向。

英文摘要

Embodied AI systems are increasingly expected to reason and act over extended horizons in physical environments. This growing capability brings safety to the foreground, because failures in the physical world can harm people, damage objects, and disrupt workplaces. Although safe embodied AI has attracted substantial attention, the literature remains fragmented across planning, policy design, and runtime execution. Long-horizon robotic manipulation is a particularly revealing anchor domain for this problem because semantic misgrounding, subtask-level error propagation, execution drift, and contact-rich physical risk can accumulate within the same closed-loop system. This survey therefore provides a structured review of safety in long-horizon robotic manipulation from an embodied AI perspective. We organize the literature by intervention locus, covering planning-time, policy-time, and execution-time safety, and we analyze the strength of the evidence that each line of work provides, distinguishing formal guarantees, statistical support, and empirical safety heuristics. This framework clarifies the distinct roles of backbone capability papers, direct safety mechanisms, and benchmark or evaluation studies, while exposing where current safety claims are well supported and where they remain indirect. We identify persistent gaps, including limited evidence for policy-time safety, weak formal support for contact-rich long-horizon manipulation, immature uncertainty-triggered intervention, and a shortage of manipulation-specific safety benchmarks. We conclude by outlining research directions for cross-layer assurance, evaluation design, and safer deployment of long-horizon robotic agents in real-world settings.

2606.05658 2026-06-05 cs.IR cs.AI 版本更新

Agent-Orchestrated Adaptive RAG: A Comparative Study on Structured and Multi-Hop Retrieval

Agent编排的自适应RAG:结构化与多跳检索的比较研究

Anuj Maharjan, Devinder Kaur, Richard Molyet

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出Agent编排的自适应RAG框架,通过动态查询分解、迭代检索和自反思评估,在结构化领域(DevOps)和多跳推理基准(MuSiQue)上对比发现,查询分解在结构化领域提升性能但降低多跳排名精度,反思机制提高引用准确性但增加延迟,表明Agent增强需根据查询和领域特性选择性应用。

详情
AI中文摘要

检索增强生成(RAG)通过将响应基于外部知识来增强大型语言模型(LLM),但传统流水线依赖于静态的单步检索,这限制了复杂查询的性能。本文提出了一种Agent编排的自适应RAG框架,引入了动态查询分解、迭代检索和有界自反思评估循环。我们在两个互补的数据集上评估该系统:一个特定领域的DevOps知识库和多跳推理基准MuSiQue。使用包括总体得分、引用准确性、平均倒数排名和主题覆盖度在内的指标,我们发现查询分解在结构化领域(DevOps上总体得分+0.04,MRR+0.17)带来一致的增益,但在多跳基准上降低了排名精度,而反思机制以显著的延迟成本提高了引用准确性。这些对比结果表明,Agent增强并非普遍有益,必须根据查询和领域特性选择性应用。我们的发现支持自适应、成本感知的编排,而非统一激进的推理流水线。

英文摘要

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding their responses in external knowledge, but conventional pipelines rely on static, single-step retrieval that limits performance on complex queries. This paper presents an Agent-Orchestrated Adaptive RAG framework that introduces dynamic query decomposition, iterative retrieval, and a bounded self-reflective evaluation loop. We evaluate the system across two complementary datasets: a domain-specific DevOps knowledge base and the multi-hop reasoning benchmark MuSiQue. Using metrics that include overall score, citation accuracy, mean reciprocal rank, and topic coverage, we find that query decomposition yields consistent gains in the structured domain (overall score $+0.04$, MRR $+0.17$ on DevOps) but degrades ranking precision on the multi-hop benchmark, while the reflection mechanism improves citation accuracy at a substantial latency cost. These contrasting results show that agentic enhancements are not universally beneficial and must be applied selectively according to query and domain characteristics. Our findings argue for adaptive, cost-aware orchestration rather than uniformly aggressive reasoning pipelines.

2606.05647 2026-06-05 cs.AI cs.CL cs.CY cs.HC 版本更新

Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

与“敌人”编码:人类开发者能否检测到AI代理的破坏行为?

Jingheng Ye, Huiqi Zou, Simon Yu, Weiyan Shi

发表机构 * Northeastern University(东北大学)

AI总结 通过大规模用户实验,研究人类开发者在长时间编码任务中检测AI代理恶意代码插入的能力,发现94%的开发者未能识别破坏,并分析其原因,提出安全监控设计建议。

Comments 34 pages, 30 figures, 3 tables

详情
AI中文摘要

AI编码代理越来越多地嵌入到现实世界的软件开发中,与人类开发者协作,同时获得对代码库和工具的更广泛访问权限。这创造了一个新的攻击面:代理可以利用人类信任来破坏开发,例如通过插入恶意代码来完成隐藏的附带任务。大多数先前的工作研究AI-only环境中的AI破坏,对人类监督在检测和减轻此类恶意行为中的作用关注有限。为填补这一空白,我们进行了首个关于AI编码破坏中人类监督的大规模研究。超过100名参与者与四个前沿模型(Claude-Opus-4.6、GPT-5.4、Gemini-3.1-Pro和MiniMax-M2.7)之一合作,完成一项持续约五小时的长周期编码任务,旨在模拟真实工作流程。我们发现94%的开发者未能检测到破坏,我们对参与者反馈的分析将这一脆弱性归因于最小化的代码审查、合理的掩护故事以及对代理的过度信任。我们进一步测试了安全监控器在一种条件下的有效性:虽然监控器降低了破坏成功率,但仍有56%的参与者接受了恶意代码,忽略了其警告。根据参与者反馈,我们为更好的监控器设计提供了可操作的建议。这项工作补充了现有的AI安全研究,并强调了迫切需要以人为本的安全机制,考虑人类因素,特别是在长周期、真实世界的开发环境中。

英文摘要

AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.

2606.05646 2026-06-05 cs.SE cs.AI 版本更新

Enhancing Software Engineering Through Closed-Loop Memory Optimization

通过闭环内存优化增强软件工程

Xuehang Guo, Zora Zhiruo Wang, Qingyun Wang, Graham Neubig, Xingyao Wang

发表机构 * William & Mary(威廉玛丽学院) Carnegie Mellon University(卡内基梅隆大学) OpenHands University(OpenHands大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出闭环内存优化框架,通过验证下游影响来定义内存效用,作为评估基准和优化信号,显著提升软件工程代理的成功率和效率。

详情
AI中文摘要

大型语言模型(LLMs)使得强大的软件工程(SE)代理能够导航复杂的代码库并解决现实世界的问题。然而,这些代理本质上仍然是 episodic 的:它们无法跨任务保留、改进和重用经验,反复从头构建上下文并重复类似的错误。即使有内存支持,它们也无法弥补缺乏原则性、任务无关的 \textit{内存效用} 的缺陷,这使得难以严格评估或跨代理和设置进行泛化。为了解决这些限制,我们引入了 \ours,一个用于 SE 代理内存增强的闭环框架。\ours 将内存效用建立在 \textit{验证的下游影响} 上,将效用确立为任务无关的 \textbf{评估基准} 和无注释的 \textbf{优化信号}。通过在 \textit{单 episode} 和 \textit{跨 episode} 内存增强上的互补评估,结果表明 \ours 在不同设置下一致地改进了 SE 代理,在成功率上实现了高达 $\uparrow5.25\\%$ 的绝对增益,在解决效率上实现了 $\uparrow4.63\\%$ 的绝对增益,同时大幅降低了计算成本 $\geq9.79\\%$。我们的项目页面:\href{https://xhguo7.github.io/MemOp/}{https://xhguo7.github.io/MemOp/}。

英文摘要

Large language models (LLMs) have enabled powerful software engineering (SE) agents capable of navigating complex codebases and resolving real-world issues. However, these agents remain fundamentally episodic: they fail to retain, refine, and reuse experiences across tasks, repeatedly reconstructing context from scratch and reproducing similar mistakes. Even with memory support, they offer no remedy for the absence of a principled, task-agnostic \textit{memory utility}, making them difficult to evaluate rigorously or generalize across agents and settings. To tackle these limitations, we introduce \ours, a closed-loop framework for memory augmentation in SE agents. \ours grounds memory utility in \textit{validated downstream impact}, establishing utility as both a task-agnostic \textbf{evaluation benchmark} and an annotation-free \textbf{optimization signal}. Through complementary evaluation on \textit{single-episode} and \textit{cross-episode} memory augmentation, results demonstrate that \ours consistently improves SE agents across settings, achieving absolute gains of up to $\uparrow5.25\%$ in success rate and $\uparrow4.63\%$ in resolve efficiency, while substantially reducing computational cost by $\geq9.79\%$. Our project page: \href{https://xhguo7.github.io/MemOp/}{https://xhguo7.github.io/MemOp/}.

2606.05644 2026-06-05 cs.AI 版本更新

FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG

FIDES: 通过深层证据信号实现RAG中检索-记忆冲突的忠实推理

Zhe Yu, Wenpeng Xing, Tiancheng Zhao, Mohan Li, Changting Lin, Meng Han

发表机构 * Binjiang Institute of Zhejiang University(浙江大学滨江研究院) Zhejiang University(浙江大学) Guangzhou University(广州大学) GenTel.io

AI总结 针对检索增强生成中检索证据与参数记忆冲突导致模型忽略上下文的问题,提出无训练解码器FIDES,通过融合输出表面、隐藏表示和预测轨迹三种内部信号,在token级别动态调整干预强度,显著提升上下文忠实度。

详情
AI中文摘要

当检索到的证据与参数记忆相矛盾时,语言模型常常忽略上下文并默认采用记忆化的先验知识——这种失败削弱了检索增强的核心目的。对比解码通过放大上下文条件输出以抑制参数偏差,但现有方法基于一个隐含假设:这种偏差在token间是均匀的。单一的全局对比权重会过度惩罚安全token,同时使真正存在冲突的token得不到充分纠正。我们识别出token级别的冲突集中现象:检索-记忆张力呈现高度异质性,集中在少数答案关键的解码步骤上。这重新定义了对比解码:从“施加多少对比”转变为“在何处施加对比”。我们提出FIDES(通过深层证据信号实现忠实推理),一种无训练解码器,它读取三种内部信号——输出表面、隐藏表示和预测轨迹——在互补深度探测检索-记忆冲突,并融合它们以控制每个解码步骤的干预强度。在三个基准和六个主干模型(四个主流的7B/8B模型和两个扩展至70B的主干模型)上,FIDES在所有18个设置中实现了最佳的上下文忠实度,比最强的无训练基线高出3到13个百分点。在70B规模上,忠实度达到92-94%,同时F1分数飙升至62-63%,表明token级别的选择性解锁了粗粒度对比规则所抑制的生成能力。

英文摘要

When retrieved evidence contradicts parametric memory, language models frequently ignore context and default to memorized priors -- a failure that undermines the core purpose of retrieval augmentation. Contrastive decoding amplifies the context-conditioned output to suppress parametric bias, but existing methods rest on an implicit assumption that this bias is uniform across tokens. A single global contrastive weight over-penalizes safe tokens while leaving genuinely conflicted ones insufficiently corrected. We identify token-level conflict concentration: retrieval-memory tension is sharply heterogeneous, concentrated on a small fraction of answer-critical decoding steps. This reframes contrastive decoding from how much contrast to apply to where to apply it. We propose FIDES (Faithful Inference via Deep Evidence Signals), a training-free decoder that reads three internal signals probing retrieval-memory conflict at complementary depths -- output surface, hidden representations, and prediction trajectory -- and fuses them to govern intervention strength at each decoding step. Across three benchmarks and six backbones -- four primary 7B/8B models and two scaling backbones up to 70B -- FIDES achieves the best context fidelity in all 18 settings, outperforming the strongest training-free baseline by +3 to +13 points. On the 70B scale, fidelity reaches 92-94% while F1 surges to 62-63%, demonstrating that token-level selectivity unlocks generation capability that coarse contrastive rules suppress.

2606.05633 2026-06-05 cs.AI 版本更新

Answer Presence Drives RAG Rewriting Gains

答案存在驱动RAG重写收益

Yuejie Li, Yueying Hua, Ke Yang, Li Zhang, Yueping He, Yueping He, Ruiqi Li, Bolin Chen, Tao Wang, Bowen Li, Chengjun Mao

发表机构 * Ant Group(蚂蚁集团)

AI总结 通过受控干预审计,发现检索增强问答中重写器带来的性能提升主要由黄金答案字符串出现在重写上下文中驱动,而非证据质量改善。

详情
AI中文摘要

检索增强的问答管道通常将检索到的段落通过LLM重写器处理后输入较小的阅读器,在多跳基准测试中将F1提升数十个百分点;这种提升通常归因于证据质量的改善。我们通过受控干预审计,探究这种提升是否由黄金答案字符串出现在重写上下文中而非整理本身因果驱动。对于每个重写上下文,我们对编译输出进行四种受控编辑后重新运行阅读器:移除黄金答案跨度、替换为长度匹配的随机非答案跨度(安慰剂)、将黄金答案注入原本缺失的重写中(前缀或中间句子边界)。跨越三个阅读器系列(Qwen2.5-7B、Qwen3.5-35B、GLM-4.7)、两个数据集(HotpotQA、2WikiMultihopQA)和三种编译器安排(仅MA、仅MB、MA+验证)的十二个(单元、基线)干预运行中,在配对的answer-in-compile层上,移除黄金答案导致阅读器F1比长度匹配的安慰剂下降28到64个百分点,而在12个(单元、基线)组合中的10个中,将黄金答案前置到原本缺失的重写中使F1提升+0.7到+9.7个百分点。一项配套的五哨兵审计显示,传统的单[MASK]探针本身对哨兵敏感:在2Wiki上,它报告+4.12 F1的“非泄漏残差”,在四种替代哨兵下翻转至-3.33到-7.81 F1,并且对其中三种哨兵未能通过等价检验(1/4通过)。我们不提出新的重写器或缓解措施;我们发布干预运行器和哨兵面板,以便其他重写器收益声明可以针对相同标准进行测试。

英文摘要

Retrieval-augmented QA pipelines often route retrieved passages through an LLM \emph{rewriter} before a smaller reader, lifting F1 by tens of points on multi-hop benchmarks; this gain is typically credited to improved evidence quality. We ask whether that lift is causally driven by the gold answer string appearing in the rewritten context rather than by curation per se, using a controlled intervention audit. For each rewritten context we re-run the reader after one of four controlled edits to the compile output: removing the gold answer span, replacing a length-matched random non-answer span (placebo), or injecting the gold into rewrites where it was absent (at the prefix or at a midpoint sentence boundary). Across twelve completed (cell, baseline) intervention runs spanning three reader families (Qwen2.5-7B, Qwen3.5-35B, GLM-4.7), two datasets (HotpotQA, 2WikiMultihopQA), and three compiler arrangements (MA-only, MB-only, MA$+$verify), removing the gold answer drops reader F1 by $28$ to $64$ points beyond the length-matched placebo on paired \texttt{answer-in-compile} strata, and prepending the gold into rewrites that lacked it raises F1 by $+0.7$ to $+9.7$ points in $10$ of $12$ (cell, baseline) combinations. A companion five-sentinel audit shows the conventional single-\texttt{[MASK]} probe is itself sentinel-fragile: on 2Wiki it reports a $+4.12$~F1 ``non-leakage residual'' that flips to $-3.33$ to $-7.81$~F1 under four alternative sentinels and fails an equivalence test for three of those four ($1/4$~pass). We do not propose a new rewriter or mitigation; we release the intervention runner and the sentinel panel so that other rewriter-gain claims can be tested against the same standard.

2606.05632 2026-06-05 cs.AI 版本更新

Evaluation of LLMs for Mathematical Formalization in Lean

LLM在Lean中数学形式化的评估

Tyson Klingner, Drew Bladek, Escher Crawford, Bohao Chen, Ariel Fu, Kaira Nair, Jarod Alper, Giovanni Inchiostro, Vasily Ilin

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 本研究通过pass@k和refine@k指标在miniF2F和miniCTX子集上比较了多种大语言模型在Lean 4中生成形式化证明的能力,发现Gemini 3.1 Pro和Claude Opus 4.7性能最佳,而NVIDIA Nemotron 3 Super和GPT-OSS 120B在考虑成本时效率最高。

Comments 15 pages, 13 figures, 10 tables. Comments welcome!

详情
AI中文摘要

在过去几年中,大语言模型(LLM)生成形式化数学证明的能力得到了显著提升。我们比较了多种LLM在Lean 4中生成形式化证明的有效性,旨在帮助那些希望利用LLM支持自己项目的人。我们使用pass@$k$和refine@$k$指标作为比较基准,并在miniF2F和miniCTX数据集的子集上进行评估。测试表明,总体而言,Gemini 3.1 Pro和Claude Opus 4.7表现最佳。Gemini 3.1 Pro在miniF2F上通过refine@32达到了92%的成功率,而Opus 4.7在miniCTX上通过refine@32达到了86%的成功率。考虑成本时,NVIDIA Nemotron 3 Super和GPT-OSS 120B效率最高,具有竞争力的准确率且每个正确证明的平均成本低于0.01美元。

英文摘要

Within the past few years, the ability of Large Language Models (LLMs) to generate formal mathematical proofs has improved drastically. We provide a comparison of various LLMs' effectiveness in producing formal proofs in Lean 4 with the goal of assisting those seeking to use LLMs to support their own projects. We utilize both pass@$k$ and refine@$k$ metrics as the benchmark for our comparison and evaluate on subsets of both miniF2F and miniCTX datasets. Our testing shows that overall, Gemini 3.1 Pro and Claude Opus 4.7 perform best. Gemini 3.1 Pro achieved a 92\% success rate on miniF2F via refine@32 whereas Opus 4.7 achieved a 86\% success rate on miniCTX via refine@32. When taking cost into account, NVIDIA Nemotron 3 Super and GPT-OSS 120B were the most efficient, with competitive accuracies and average costs of $<\$0.01$ per correct proof.

2606.05626 2026-06-05 cs.CL cs.AI cs.LG 版本更新

When New Generators Arrive: Lifelong Machine-Generated Text Attribution via Ridge Feature Transfer

当新生成器到来:基于岭特征迁移的终身机器生成文本归因

Zhen Sun, Yifan Liao, Zhicong Huang, Jiaheng Wei, Cheng Hong, Yutao Yue, Xinlei He

发表机构 * Wuhan University(武汉大学) Ant Group(蚂蚁集团) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Institute of Deep Perception Technology, JITRI(感知技术研究院,JITRI)

AI总结 针对终身机器生成文本归因中持续适应新生成器与保留旧知识难以平衡的问题,提出轻量级分析更新框架RidgeFT,通过协方差校准和固定随机特征实现无需示例回放的闭式更新。

Comments 12 pages

详情
AI中文摘要

机器生成文本(MGT)归因旨在识别给定文本的特定生成器,从而为模型问责和滥用调查提供细粒度证据。随着新的大语言模型不断涌现,归因模型必须持续纳入新生成器,同时保留识别先前见过的生成器的能力。先前工作表明,这种终身MGT归因设置具有挑战性,现有方法通常难以在适应新类别和保留旧类别之间实现稳定平衡。为解决此问题,我们提出RidgeFT,一种轻量级分析更新框架,不依赖于示例回放。RidgeFT在初始生成器集上训练任务感知编码器,在首次观察到每个生成器类别时存储紧凑的类别充分统计量,然后冻结编码器以进行无回放的闭式更新。它通过协方差校准抑制与生成器无关的变异,通过固定随机特征提升表示能力,并基于类别充分统计量通过闭式岭回归更新新类别。在具有不同初始生成器设置的多主题评估中,RidgeFT始终优于基线。它在跨领域、骨干网络和增量协议上实现了最佳宏F1,同时改进了旧类别保留和新类别适应。这些结果表明,特征稳定的分析更新为终身MGT归因提供了一种简单而有效的方法。

英文摘要

Machine-generated text (MGT) attribution aims to identify the specific generator responsible for a given text, thereby providing fine-grained evidence for model accountability and misuse investigation. As new large language models continue to emerge, attribution models must continuously incorporate new generators while preserving their ability to recognize previously seen ones. Prior works have shown that this lifelong MGT attribution setting is challenging, and existing methods often struggle to achieve a stable balance between adapting to new classes and retaining old ones. To address this issue, we propose RidgeFT, a lightweight analytic update framework that does not rely on exemplar replay. RidgeFT trains a task-aware encoder on the initial generator set, stores compact class-wise sufficient statistics when each generator class is first observed, and then freezes the encoder for replay-free closed-form updates. It then suppresses generator-irrelevant variation through covariance calibration, improves representation capacity with fixed random features, and updates new classes through closed-form ridge regression based on class-level sufficient statistics. Across multi-topic evaluations with varying initial generator setups, RidgeFT consistently outperforms baselines. It achieves the best macro-F1 across domains, backbones, and incremental protocols, while also improving both old-class retention and new-class adaptation. These results suggest that feature-stable analytic updates provide a simple yet effective approach to lifelong MGT attribution.

2606.05625 2026-06-05 cs.AI cs.LG 版本更新

Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking

自承诺延迟:一种用于提示隐式劫持的无奖励探针

Bonan Shen, Youting Wang, Dingyan Shang, Tao Ning

发表机构 * Stanford University(斯坦福大学) Tsinghua University(清华大学)

AI总结 提出自承诺延迟指标,通过测量推理上下文对模型自身最终答案的承诺时机,无需奖励信号即可检测提示隐式劫持,在GSM8K数据集上达到AUROC 0.878-0.926。

详情
AI中文摘要

当语言模型的思维链看似良性时,隐式奖励劫持难以审计:最终答案可能被提示捷径锚定,而书面推理仍类似于普通问题求解。基于验证器的探针通过测量早期截断的推理上下文获得高奖励来暴露此类行为,但需要任务特定的奖励信号。本文提出一种弱输入替代方案——自承诺延迟,它测量提示推理上下文对模型自身最终答案的承诺时机。我们在受控配对GSM8K设置中使用Qwen2.5-3B-Instruct-4bit评估该探针,比较普通提示与包含答案提示的提示。与诚实上下文相比,包含提示的上下文显著更早且以更低不确定性做出承诺。主要延迟指标——阈值为0.8时的首次承诺延迟——达到AUROC 0.878;支持的全曲线摘要达到承诺范围AUROC 0.926和平均未承诺质量AUROC 0.904。当两种提示条件都正确回答时信号更强,且在不同阈值下保持稳定。这些结果表明,存在捷径的推理上下文会留下早期行为承诺特征,无需奖励模型、外部评判或训练分类器即可检测。

英文摘要

Implicit reward hacking is hard to audit when a language model's chain of thought appears benign: a final answer may be anchored by a prompt shortcut while the written reasoning still resembles ordinary problem solving. Verifier-based probes expose such behavior by measuring how early truncated reasoning contexts obtain high reward, but require a task-specific reward signal. This paper proposes a weaker-input alternative, self-commitment latency, which measures how early a prompted reasoning context commits to the model's own final answer. We evaluate the probe in a controlled paired GSM8K setting using Qwen2.5-3B-Instruct-4bit, comparing ordinary prompts with prompts that include an answer hint. Hinted contexts commit substantially earlier and with lower uncertainty than honest contexts. The primary latency metric, first-commitment latency at threshold 0.8, reaches AUROC 0.878; supporting whole-curve summaries reach AUROC 0.926 for commitment range and 0.904 for mean uncommitted mass. The signal is stronger when both prompt conditions answer correctly and remains stable across thresholds. These results show that shortcut-available reasoning contexts can leave an early behavioral commitment signature detectable without a reward model, external judge, or trained classifier.

2606.05614 2026-06-05 cs.AI 版本更新

Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack

安全悖论:增强的安全意识如何使LLM易受后验攻击

Long P. Hoang, Hai V. Le, Shaoyang Xu, Wei Lu, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) Nanyang Technological University(南洋理工大学)

AI总结 本文揭示安全对齐增强的LLM因内部安全评估能力而面临后验攻击漏洞,通过实验和理论分析证明安全判断能力越强越易被利用,并提出因果干预验证。

详情
AI中文摘要

大型语言模型(LLM)经过严格对齐以拒绝有害请求,这一过程内在培养了评估和识别不安全内容的潜在能力。在这项工作中,我们揭示了这种高级安全意识无意中引入了一个致命漏洞。我们提出了后验攻击(Posterior Attack),一种单次查询的越狱方法,通过提示模型生成其内部分类器通常会标记为不安全的精确有害响应来绕过防护栏。通过对30个开源LLM(参数规模高达35B)和前沿模型(如GPT-5、Claude 4.6)的广泛实证评估,我们观察到一个显著现象:具有更优安全判断能力的模型更容易受到这种利用。为了解释这一点,我们形式化了安全悖论(Safety Paradox),分析表明安全对齐的单调改进自然放大了后验漏洞。最后,我们通过强化学习干预建立了因果联系,示例说明人为降低模型的安全判断能力可使其免疫攻击,而增强判断则会加剧漏洞。我们的发现揭示了当前对齐范式中的潜在缺陷,表明防御机制可能需要进一步的结构性改进。

英文摘要

Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensive empirical evaluation across 30 open-source LLMs (up to 35B parameters in size) and frontier models (e.g., GPT-5, Claude 4.6), we observe a striking phenomenon: models with superior safety-judgment capabilities are disproportionately more susceptible to this exploitation. To explain this, we formalize the Safety Paradox, analytically showing that monotonic improvements in safety alignment naturally amplify posterior vulnerability. Finally, we establish a causal link via reinforcement learning interventions, exemplifying that artificially degrading a model's safety judgment immunizes it against the attack, whereas enhancing judgment exacerbates the vulnerability. Our findings highlight potential flaws in current alignment paradigms, indicating that defense mechanisms may require further structural refinement.

2606.05613 2026-06-05 cs.AI 版本更新

Multilingual Fine-Tuning via Localized Gradient Conflict Resolution

通过局部梯度冲突解决的多语言微调

Long P. Hoang, Yiran Zhao, Wei Lu, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) Salesforce AI Research(Salesforce人工智能研究) Nanyang Technological University(南洋理工大学)

AI总结 提出Bucket-Level MOO框架,将多语言微调重构为多目标优化问题,通过局部梯度冲突解决提升多语言性能。

详情
AI中文摘要

大型语言模型(LLMs)的快速发展已将跨语言多功能性确立为现代系统的定义特征。然而,微调这些模型经常引发跨语言的负面干扰。为了解决这个问题,我们将多语言微调重构为多目标优化(MOO)问题。具体来说,我们引入了Bucket-Level MOO,一个可扩展的分布式框架,它在参数桶上局部应用基于梯度的MOO算法。这使得冲突感知更新成为可能,而无需重建完整梯度向量的高昂通信开销。理论上,我们证明了这种局部解决自然地强制执行精炼帕累托平稳性,这是帕累托最优性的一个严格更紧的必要条件。实验上,Bucket-Level MOO通过驱动LLMs构建特定的语言维度来减轻干扰,提高了表示的可分离性。在四个基础LLM上的广泛实验表明,我们的方法在标准微调范式上显著提高了所见和未见的多语言性能。

英文摘要

The rapid evolution of Large Language Models (LLMs) has established cross-lingual versatility as a defining feature of modern systems. However, fine-tuning these models frequently induces negative interference across languages. To address this, we reformulate multilingual fine-tuning as a multi-objective optimization (MOO) problem. Specifically, we introduce Bucket-Level MOO, a scalable distributed framework that applies gradient-based MOO algorithms locally on parameter buckets. This enables conflict-aware updates without the prohibitive communication overhead of reconstructing full gradient vectors. Theoretically, we prove this localized resolution natively enforces Refined Pareto Stationarity, a strictly tighter necessary condition for Pareto optimality. Empirically, Bucket-Level MOO mitigates interference by driving LLMs to construct distinct language-specific dimensions, improving representational separability. Extensive experiments across four base LLMs demonstrate that our method significantly improves both seen and unseen multilingual performance over standard fine-tuning paradigms.

2606.05609 2026-06-05 cs.CR cs.AI cs.LG 版本更新

SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks

SlotGCG:利用LLMs中的位置脆弱性进行越狱攻击

Seungwon Jeong, Jiwoo Jeong, Hyeonjin Kim, Yunseok Lee, Woojin Lee

发表机构 * Dongguk University-Seoul(东国大学-首尔)

AI总结 本文提出SlotGCG方法,通过量化提示中不同插入位置(槽)的脆弱性得分(VSS),选择最脆弱的位置插入对抗性令牌,从而显著提升基于优化的越狱攻击成功率。

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

随着大型语言模型(LLMs)的广泛部署,通过越狱攻击识别其脆弱性变得日益关键。基于优化的攻击方法如贪婪坐标梯度(GCG)专注于将对抗性令牌插入到提示的末尾。然而,GCG将对抗性令牌限制在固定的插入点(通常是提示后缀),未探索在其他位置插入令牌的效果。在本文中,我们实证研究了提示中可插入令牌的候选位置(称为槽)。我们发现越狱的脆弱性与槽的选择高度相关。基于这些发现,我们引入了脆弱性槽得分(VSS)来量化越狱的位置脆弱性。随后,我们提出SlotGCG,该方法使用VSS评估所有槽,选择最脆弱的槽进行插入,并在这些槽上运行针对性的优化攻击。我们的方法提供了一种与攻击无关的位置搜索机制,可插入任何基于优化的攻击,仅增加200毫秒的预处理时间。在多个模型上的实验表明,SlotGCG显著优于现有方法。具体而言,与基于GCG的攻击相比,它实现了14%更高的攻击成功率(ASR),收敛更快,并且对防御方法表现出更强的鲁棒性,ASR比基线方法高42%。我们的实现可在https://github.com/youai058/SlotGCG获取。

英文摘要

As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization-based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting adversarial tokens to the end of prompts. However, GCG restricts adversarial tokens to a fixed insertion point (typically the prompt suffix), leaving the effect of inserting tokens at other positions unexplored. In this paper, we empirically investigate \emph{slots}, i.e., candidate positions within a prompt where tokens can be inserted. We find that vulnerability to jailbreaking is highly related to the selection of the \emph{slots}. Based on these findings, we introduce the \textit{Vulnerable Slot Score} (VSS) to quantify the positional vulnerability to jailbreaking. We then propose SlotGCG, which evaluates all slots with VSS, selects the most vulnerable slots for insertion, and runs a targeted optimization attack at those slots. Our approach provides a position-search mechanism that is attack-agnostic and can be plugged into any optimization-based attack, adding only 200ms of preprocessing time. Experiments across multiple models demonstrate that SlotGCG significantly outperforms existing methods. Specifically, it achieves 14\% higher Attack Success Rates (ASR) over GCG-based attacks, converges faster, and shows superior robustness against defense methods with 42\% higher ASR than baseline approaches. Our implementation is available at \href{https://github.com/youai058/SlotGCG}{https://github.com/youai058/SlotGCG}

2606.05606 2026-06-05 cs.LG cs.AI math.OC 版本更新

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

跨时代自适应展开优化用于强化学习后训练

Yiming Zong, Yige Wang, Jiashuo Jiang

发表机构 * Department of Industrial Engineering & Decision Analytics, Hong Kong University of Science and Technology(工业工程与决策分析系,香港科学与技术大学)

AI总结 针对提示词训练信号差异大的问题,提出CERO方法,通过贝叶斯估计提示词成功概率并利用Fenchel对偶优化自适应分配展开预算,在固定总预算下提升样本效率。

详情
AI中文摘要

LLM后训练通常依赖于对每个提示采样多次展开的强化学习方法,但大多数现有方法对每个提示使用固定的展开预算,尽管不同提示提供的训练信号差异很大。本文研究在固定全局预算下的自适应展开分配,并将问题形式化为具有提示级递减收益的在线资源分配。我们的方法CERO维护每个提示成功概率的Beta后验分布,并使用后验期望伯努利方差作为额外展开价值的贝叶斯估计。我们利用该估计构建累积分配上的凹饱和效用函数,得到一个目标函数,其中跨提示和跨时代的决策通过全局预算耦合。由于所得目标在时间上不可分离,我们推导出Fenchel对偶重写,并通过投影在线梯度下降更新提示级和预算级对偶变量。在固定提示效用下,我们证明相对于离线分配基准的$O(\sqrt{K})$遗憾界。在数学推理问题上的实验表明,CERO在多个开源LLM和基准上持续优于GRPO,证明自适应展开预算可以提高样本效率。

英文摘要

LLM post-training often relies on reinforcement learning methods that sample multiple rollouts per prompt, yet most existing approaches use a fixed rollout budget for every prompt, despite large differences in the training signal different prompts provide. In this paper, we study adaptive rollout allocation under a fixed global budget and formulate the problem as online resource allocation with prompt-level diminishing returns. Our method, CERO, maintains a Beta posterior over each prompt's success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts. We use this estimate to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget. Since the resulting objective is temporally nonseparable, we derive a Fenchel-dual reformulation and update both prompt-level and budget-level dual variables via projected online gradient descent. Under fixed prompt utilities, we prove an $O(\sqrt{K})$ regret bound against the offline allocation benchmark. Experiments on mathematical-reasoning problems show that CERO consistently outperforms GRPO across multiple open-weight LLMs and benchmarks, demonstrating that adaptive rollout budgeting can improve sample efficiency.

2606.05602 2026-06-05 cs.AI cs.HC cs.LG 版本更新

Fix the Mind, Not the Move: Interpretable AI Assistance via Knowledge-Gap Localization

修正思维,而非动作:通过知识缺口定位实现可解释的AI辅助

Ayano Hiranaka, Ya-Chuan Hsu, Stefanos Nikolaidis, Erdem Bıyık, Daniel Seita

发表机构 * University of Tokyo(东京大学) National Institute of Information and Communications Technology(信息与通信技术国家研究所)

AI总结 提出SENSEI框架,通过结构化知识表示推断用户误解并提供针对性建议,在长时任务中实现零样本组合泛化,纠正90%的学生误解。

Comments Accepted to International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

在人机协作中,AI助手通常通过行为反馈(例如辅助驾驶中的警报或方向盘提示)来纠正次优的人类行为。此类干预可以缓解即时错误,但长期改进需要解决导致重复错误的潜在误解。我们引入了SENSEI,一个从交互行为推断用户误解并提供针对性、最小但充分建议的框架。我们的方法通过操作结构化知识表示来定位和纠正错误行为的根源,从而脱离动作或轨迹层面的干预。在具有不同误解和相应行为的三个长时任务中,SENSEI展示了零样本组合泛化能力,尽管仅针对单一误解案例进行训练,却能解开多个重叠的误解。一项用户研究进一步表明,我们的方法能够识别真实的人类误解,并提供有效的指导,从而提高长时任务表现,成功纠正了90%的学生误解。代码和项目页面见https://misoshiruseijin.github.io/SENSEI/。

英文摘要

AI assistants in human-AI collaboration often correct suboptimal human actions through behavioral feedback (e.g., alerts or steering-wheel nudges in assistive driving). Such interventions can mitigate immediate errors, but long-term improvement requires addressing the underlying misconceptions that cause repeated mistakes. We introduce SENSEI, a framework that infers user misconceptions from interaction behavior and provides targeted, minimal yet sufficient suggestions to correct them. Our approach departs from action- or trajectory-level interventions by operating over a structured knowledge representation to localize and correct the sources of erroneous behavior. Across three long-horizon tasks with diverse misconceptions and corresponding behaviors, SENSEI demonstrates zero-shot compositional generalization, disentangling multiple overlapping misconceptions despite training only on single-misconception cases. A user study further shows that our method identifies real human misconceptions and provides effective guidance that improves long-horizon task performance, successfully correcting $90\%$ of student misconceptions. Code and project page are available at https://misoshiruseijin.github.io/SENSEI/.

2606.05587 2026-06-05 cs.CV cs.AI cs.LG 版本更新

HDST-GNN: Heterogeneous Dynamic Spatiotemporal Graph Neural Networks for Multi-Object Tracking in UAV Aerial Imagery

HDST-GNN:用于无人机航拍图像多目标跟踪的异质动态时空图神经网络

Phillip Jiang

发表机构 * Phillip Jiang(菲利普·姜)

AI总结 针对无人机航拍中目标小、密集、遮挡导致身份切换的问题,提出异质动态时空图神经网络HDST-GNN,通过高度自适应边构建、异质节点表示和遮挡门控时序聚合提升跟踪性能。

Comments 18 pages, 4 figures, 6 tables

详情
AI中文摘要

无人机航拍图像的多目标跟踪(MOT)面临独特挑战:序列间高度变化、目标小而密集、频繁遮挡导致身份切换。现有基于图的跟踪器假设固定空间上下文并统一处理所有目标,忽略了检测、活跃轨迹和丢失目标等异质生命周期状态。我们提出HDST-GNN,一种异质动态时空图神经网络,包含三项创新。首先,高度自适应边构建根据平均目标面积估计相机高度代理,并相应调整图连接半径。其次,异质节点表示将检测(D型)、确认轨迹(T型)和丢失轨迹(L型)建模为不同节点类型,具有专用投影和类型化边关系。第三,遮挡门控时序聚合根据每个节点的遮挡置信度门控其注意力贡献,防止被遮挡节点破坏邻居嵌入。HDST-GNN使用可微Sinkhorn头部,结合交叉熵和三元组损失进行端到端训练。在VisDrone2019-MOT上使用oracle检测时,HDST-GNN达到94.51% MOTA和97.24% IDF1,比SORT高出+5.0 MOTA点,身份切换减少81%。使用真实YOLOv8n检测时,HDST-GNN相比SORT身份切换减少49%。消融研究证实了每个组件的独立贡献。

英文摘要

Multi-object tracking (MOT) from UAV imagery presents unique challenges: altitude varies across sequences, objects are small and densely packed, and frequent occlusion causes identity switches. Existing graph-based trackers assume fixed spatial context and treat all objects uniformly, ignoring the heterogeneous lifecycle states of detections, active tracklets, and lost targets. We propose HDST-GNN, a Heterogeneous Dynamic Spatiotemporal Graph Neural Network with three novel contributions. First, Altitude-Adaptive Edge Construction estimates a camera-altitude proxy from mean object area and adjusts the graph connectivity radius accordingly. Second, Heterogeneous Node Representation models detections (Type-D), confirmed tracklets (Type-T), and lost tracklets (Type-L) as distinct node types with dedicated projections and typed edge relations. Third, Occlusion-Gated Temporal Aggregation gates each node's attention contribution by its occlusion confidence, preventing occluded nodes from corrupting neighbour embeddings. HDST-GNN is trained end-to-end with a differentiable Sinkhorn head using joint cross-entropy and triplet loss. On VisDrone2019-MOT with oracle detections, HDST-GNN achieves 94.51% MOTA and 97.24% IDF1, outperforming SORT by +5.0 MOTA points and reducing identity switches by 81%. With real YOLOv8n detections, HDST-GNN reduces identity switches by 49% vs. SORT. Ablation studies confirm the independent contribution of each component.

2606.05584 2026-06-05 cs.CR cs.AI 版本更新

Dimensionality Reduction for Cyberattack Classification: A Comparative Evaluation of PCA and Linear Predictive Coding

网络攻击分类的降维:PCA与线性预测编码的比较评估

Nelly Elsayed, Zag ElSayed, Navid Asadizanjani

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文通过比较主成分分析(PCA)和线性预测编码(LPC)两种降维方法,研究网络攻击分类中的特征压缩技术,实验表明PCA在激进压缩下仍能保持分类性能,LPC则略有性能下降,但两者均能在最小影响分类准确率的情况下大幅降低特征维度。

Comments Acceprted in the IEEE MWSCAS 2026

详情
AI中文摘要

高维特征表示被广泛用于基于机器学习的网络攻击检测系统。然而,它们增加了计算复杂度,并可能阻碍在资源受限环境中的部署。在本文中,我们通过比较两种降维方法:主成分分析(PCA)和线性预测编码(LPC),研究用于网络攻击分类的特征压缩技术。生成具有不同维度的压缩特征表示,并在多个分类模型上进行评估。实验分析表明,即使在激进压缩下,PCA也能保持分类性能。另一方面,LPC提供了具有竞争力的预测表示,但性能下降略大。结果表明,可以在对分类准确率影响最小的情况下实现特征维度的显著降低,突显了轻量级特征压缩在高效网络安全分析中的潜力。

英文摘要

High-dimensional feature representations are widely used in machine learning-based cyberattack detection systems. However, they increase computational complexity and may hinder deployment in resource-constrained environments. In this paper, we investigate feature compression techniques for cyberattack classification by comparing two dimensionality reduction approaches: Principal Component Analysis (PCA) and Linear Predictive Coding (LPC). Compressed feature representations with varying dimensionalities are generated and evaluated across several classification models. Experimental analysis demonstrates that PCA preserves classification performance even under aggressive compression. On the other hand, LPC provides competitive predictive representations with slightly larger performance degradation. The results show that substantial reductions in feature dimensionality can be achieved with minimal impact on classification accuracy, highlighting the potential of lightweight feature compression for efficient cybersecurity analytics.

2606.05570 2026-06-05 cs.CL cs.AI 版本更新

TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

TensorBench: 在基于编译器的张量框架上对编码智能体进行基准测试

Bobby Yan, Fredrik Kjolstad

发表机构 * Department of Computer Science, Stanford University(计算机科学系,斯坦福大学)

AI总结 本文提出 TensorBench,一个包含199个特征添加和重构任务的基准测试,用于评估编码智能体在基于编译器的张量框架上的表现,并通过测试套件自动评分。

详情
AI中文摘要

仓库级别的编码基准测试面临任务难度与评估可靠性之间的权衡:挑战前沿模型的任务通常涉及代码库庞大且测试覆盖不完整,而人工审查难以扩展。我们引入了 TensorBench,这是一个包含199个特征添加和重构任务的基准测试,基于一个开源的基于编译器的张量框架,该框架通过一流的密集和稀疏张量支持扩展了 PyTorch。任务涵盖新的稀疏格式、密集优化过程、IR 转换、调度器更改、运行时组件以及高级数值算子。TensorBench 通过应用智能体的补丁并运行框架的测试套件(包括预先存在的随机回归测试和智能体添加的任何测试)来对每次运行进行评分。对于特征添加任务,通过意味着修补后的仓库保留了测试过的预先存在的行为,并满足了智能体为请求特征添加的检查。我们评估了七个编码智能体,涵盖三个前沿模型系列和一个开放权重模型。在此标准下的通过率从最强智能体的 $64.8\%$ 到最弱智能体的 $22.1\%$ 不等。智能体通过不同的任务子集:成对 Cohen's $κ$ 范围从 $-0.07$ 到 $0.43$,两个最强智能体的 $κ= 0.05$。

英文摘要

Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not scale. We introduce TensorBench, a benchmark of 199 feature-addition and refactoring tasks on an open-source compiler-based tensor framework that extends PyTorch with first-class support for dense and sparse tensors. Tasks cover new sparse formats, dense optimization passes, IR transformations, scheduler changes, runtime components, and high-level numerical operators. TensorBench grades each run by applying the agent's patch and running the framework's test suite, which includes the pre-existing randomized regression tests and any tests the agent adds. For feature-addition tasks, a pass means that the patched repository preserves the tested pre-existing behavior and satisfies the agent-added checks for the requested feature. We evaluate seven coding agents spanning three frontier model families and one open-weight model. Pass rates under this criterion range from $64.8\%$ for the strongest agent to $22.1\%$ for the weakest. Agents pass different subsets of tasks: pairwise Cohen's $κ$ ranges from $-0.07$ to $0.43$, with $κ= 0.05$ for the two strongest agents.

2606.05566 2026-06-05 cs.AI cs.CR 版本更新

GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection

GuardNet: 用于鲁棒提示注入和越狱检测的浅层神经网络集成策略

Paulo Ricardo Ferreira Neves, Edson Rodrigues da Cruz Filho, Paulo Henrique Eleuterio Falsetti, João Vitor Pavan, Ian Degaspari, Henrique Vieira Laturrague, Patrick Vieira Laturrague, Guilherme Nielsen Dias, Marccello Wilson Perez Berto, Gustavo Voltani Von Atzingen

发表机构 * Quickium Technology Ltd.(Quickium技术有限公司) Federal University of São Carlos (UFSCar)(萨尔瓦多·卡罗斯联邦大学) Federal Institute of Education, Science and Technology of São Paulo (IFSP)(圣保罗教育、科学和技术联邦研究所)

AI总结 提出GuardNet,一种基于浅层神经网络(BiLSTM)集成的护栏系统,通过多样性示例覆盖和阈值校准实现对抗鲁棒性,在低延迟下达到与轻量检测器竞争的性能。

详情
AI中文摘要

大型语言模型(LLMs)已经改变了自然语言处理,但它们仍然容易受到提示注入(PI)和越狱(JB)攻击。此外,基准评估可能受到污染和部分信息泄漏的影响,从而损害性能估计。本文提出了GuardNet,一个基于浅层神经网络(BiLSTM)集成的护栏系统,参数约4700万。我们研究了这样一个假设:对抗场景中的鲁棒性更多地取决于示例覆盖的多样性和阈值校准,而不是模型规模。结果表明,GuardNet与轻量检测器相比达到了竞争性能,并在低延迟下具有高效率,尽管更大的LLMs(如Mistral-7B和Llama-3.1-8B)在盲测JBB-Behaviors基准上仍在F1分数和AUROC方面表现更优。尽管如此,GuardNet在盲测数据集(n=200)上实现了0.747的AUROC,在专有基准(n=50)上实现了0.92的F1分数,这是在阈值校准和声明部分信息泄漏的情况下评估的。该系统在CPU上的平均延迟约为50毫秒,使其适合部署在成本和基础设施受限的生产环境中。

英文摘要

Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial information leakage, compromising performance estimates. This work presents GuardNet, a guardrail system based on an ensemble of shallow neural networks (BiLSTMs) with approximately 47 million parameters. We investigate the hypothesis that robustness in adversarial scenarios depends more on the diversity of example coverage and threshold calibration than on model scale. The results indicate that GuardNet achieves competitive performance compared with lightweight detectors and high efficiency at low latency, although larger LLMs such as Mistral-7B and Llama-3.1-8B still achieve superior performance in terms of F1 score and AUROC on the blind JBB-Behaviors benchmark. Nevertheless, GuardNet achieves an AUROC of 0.747 on the blind dataset (n = 200) and an F1 score of 0.92 on a proprietary benchmark (n = 50), under threshold calibration and evaluation with declared partial information leakage. The system operates with an average latency of approximately 50 ms on CPU, making it suitable for deployment in production environments with cost and infrastructure constraints.

2606.05563 2026-06-05 cs.AI cs.CL 版本更新

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

SoCRATES:跨领域和社会认知变异的前瞻性LLM调解的可靠自动化评估

Taewon Yun, Hyeonseong Park, Jeonghwan Choi, Hayoon Park, Yeeun Choi, Hwanjun Song

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出SoCRATES基准,通过多领域真实冲突场景和五维社会认知适应轴评估LLM调解员,使用主题定位评估器实现0.82的人类专家一致性,发现最强模型仅缩小约三分之一的未调解共识差距。

详情
AI中文摘要

评估LLM调解员仍然具有挑战性,因为调解是一个实时轨迹,由争议者不断变化的情感、意图和背景塑造。现有的测试平台依赖于少数专家撰写的领域,主要变化战略姿态,并对每个话题的每一轮进行评分,引入了离题噪声。我们引入了SoCRATES,一个用于在现实的多领域测试平台中评估前瞻性LLM调解员的基准。它通过一个跨八个领域的代理管道从真实冲突中构建场景,探测五个社会认知适应轴(战略姿态、参与者组成、历史长度、情感反应和文化身份),并通过主题定位评估器仅对推进每个话题的轮次进行评分。该评估器与人类专家的一致性达到0.82,是每轮基线的两倍以上。对八个前沿LLM的基准测试发现,即使是最强的调解员,在多样化和现实的测试平台下,也仅能缩小约三分之一的未调解共识差距,且性能因社会认知轴而异,突显出进步在于对不同条件的社会适应。

英文摘要

Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.

2606.05561 2026-06-05 cs.CL cs.AI 版本更新

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

InfoShield:通过信息论优化实现心理健康筛查的隐私保护语音表示

Xueyang Wu, Siyuan Liu, Kezhuo Yang, Guang Ling

发表机构 * Shenzhen NeurStar Inc., China(深圳NeurStar公司,中国) University of York, United Kingdom(约克大学,英国) Shanghai Jiao Tong University, China(上海交通大学,中国)

AI总结 提出InfoShield框架,通过最小化语音表示与敏感属性间的互信息,在保持抑郁分类性能的同时有效降低人口统计信息泄露风险。

详情
AI中文摘要

基于语音的心理健康筛查提供了可扩展的抑郁症检测方法,但临床部署面临一个重大障碍:用户对人口统计信息暴露的隐私担忧。当前技术难以解决这一冲突。对抗训练通常无法应对未知威胁,而差分隐私则倾向于通过向所有特征注入噪声来损害诊断性能。本文提出InfoShield,它在保持抑郁分类准确性的同时最小化语音表示与敏感属性之间的互信息。我们发现标准MINE估计器因时间-静态错位而难以处理序列语音,并引入带有跨模态注意力的TimeAwareMINE来对齐声学帧与属性嵌入。在Androids语料库上的实验表明,InfoShield将性别推断从92.6%降至55.5%,年龄推断从55.7%降至30.3%,且效用损失有限(F1降低6%),达到F1=0.784,而先前SOTA为0.723。

英文摘要

Speech-based mental health screening offers scalable depression detection, yet clinical deployment faces a significant barrier: users' privacy concerns about demographic information exposure. Current techniques struggle to resolve this conflict. Adversarial training often fails against unseen threats, whereas Differential Privacy tends to compromise diagnostic performance by injecting noise across all features. This paper presents InfoShield, which minimizes mutual information between speech representations and sensitive attributes while preserving depression classification accuracy. We identify that standard MINE estimators struggle with sequential speech due to temporal-static misalignment, and introduce TimeAwareMINE with cross-modal attention to align acoustic frames with attribute embeddings. Experiments on the Androids Corpus show InfoShield reduces gender inference from 92.6\% to 55.5\% and age inference from 55.7\% to 30.3\% with limited utility loss (6\% F1 reduction), achieving F1=0.784 compared to prior SOTA's 0.723.

2606.05555 2026-06-05 cs.LG cs.AI 版本更新

Representation Learning Enables Scalable Multitask Deep Reinforcement Learning

表示学习实现可扩展的多任务深度强化学习

Johan Obando-Ceron, Lu Li, Scott Fujimoto, Pierre-Luc Bacon, Aaron Courville, Pablo Samuel Castro

发表机构 * Mila – Québec AI Institute(魁北克AI研究所) Université de Montréal(蒙特利尔大学) McGill University(麦吉尔大学) CIFAR AI Chair(CIFAR人工智能 chair) Google DeepMind(谷歌DeepMind)

AI总结 本文提出一种结合预测性表示学习与高容量值函数近似的无模型算法MR.Q,在无需规划的情况下,在多任务连续控制任务中超越基于世界模型的方法和多种深度强化学习基线,并显著降低计算开销。

详情
AI中文摘要

将强化学习扩展到多样化的多任务设置仍然是一个核心挑战。虽然基于模型的强化学习的最新进展取得了强劲的性能,但它们依赖于规划和复杂的训练流程,使得不清楚哪些组件对可扩展性至关重要。我们重新审视这个问题,并认为可扩展多任务强化学习的主要驱动力不是基于模型的控制,而是\emph{表示学习}。特别地,我们表明,将预测性的、基于模型的表示与高容量值函数逼近相结合,即使没有规划,也足以实现强劲的性能。我们评估了一种简单的无模型算法MR.Q,将辅助预测目标与可扩展的actor-critic架构相结合。这种方法在多样化的多任务连续控制任务套件中优于最近基于世界模型的方法和一系列深度强化学习基线,同时显著降低了计算开销并提高了实际时间效率。我们观察到随着模型容量的增加而持续改进,并通过消融实验表明预测性表示学习对性能至关重要。

英文摘要

Scaling reinforcement learning (RL) to diverse multitask settings remains a central challenge. While recent advances in model-based RL achieve strong performance, they rely on planning and complex training pipelines, making it unclear which components are essential for scalability. We revisit this question and argue that the primary driver of scalable multitask RL is not model-based control, but \emph{representation learning}. In particular, we show that combining predictive, model-based representations with high-capacity value function approximation is sufficient to achieve strong performance, even without planning. We evaluate a simple model-free algorithm, MR.Q, coupled with auxiliary predictive objectives into a scalable actor-critic architecture. This approach outperforms a recent world-model-based method and a range of deep RL baselines across a diverse suite of multitask continuous control tasks, while significantly reducing computational overhead and improving wall-clock efficiency. We observe consistent improvements with increased model capacity and show through ablations that predictive representation learning is critical for performance.

2606.05553 2026-06-05 cs.CL cs.AI 版本更新

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

ArcANE:角色扮演语言代理是否在正确的时间保持角色?

Woojung Song, Nalim Kim, Sangjun Song, Chaewon Heo, Jongwon Lim, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院)

AI总结 提出ArcANE基准,通过角色弧将叙事分段,评估角色扮演语言代理在不同阶段是否与角色心理轨迹一致,实验表明基于角色弧的上下文策略最优,尤其在源文本外场景。

详情
AI中文摘要

角色扮演语言代理(RPLAs)应扮演其价值观和行为随故事发展而演变的角色,而非保持固定人格。现有基准衡量给定章节的事实回忆,而非回应是否与角色的心理轨迹一致,尤其是在源文本从未探索的场景中。我们引入ArcANE(弧感知叙事评估),一个自动构建的基准,涵盖17部小说和80个主要角色。角色弧将叙事沿心理轴分段,每个探针在多个阶段提出相同场景,涵盖源文本内和源文本外情境。在六个模型和六种上下文模式下,基于角色弧的条件在每项模型上均优于所有其他上下文策略,且在源文本外场景(检索无法找到信息)中差距最大。我们进一步在同一数据上微调开放权重模型,得到ArcANE-8B/32B,在源文本外场景中进一步扩大了弧优势。

英文摘要

Role-playing language agents (RPLAs) should play characters whose values and behavior evolve as the story progresses, not maintain a fixed persona. Existing benchmarks measure factual recall at a given chapter, not whether responses align with the character's psychological trajectory, especially in scenarios the source text never explores. We introduce ArcANE (Arc-Aware Narrative Evaluation), an automatically constructed benchmark spanning 17 novels and 80 principal characters. A Character Arc segments the narrative into phases along a psychological axis, and each probe poses the same scenario across phases, spanning both situations within the source text and situations beyond it. Across six models and six context modes, conditioning on the Character Arc tops every other context strategy on every model, and the gap is largest on scenarios outside the source text where retrieval has nothing to find. We further fine-tune open-weight models on the same data to obtain ArcANE-8B/32B, which widen the Arc advantage even more on scenarios outside the source text.

2606.05552 2026-06-05 cs.LG cs.AI cs.GR 版本更新

Balancing Image Compression and Generation with Bootstrapped Tokenization

平衡图像压缩与生成:自引导分词

Haozhe Chi, Jinghan Li, Hao Jiang, Wu Sheng, Yi Ma, Jing Wang, Yadong Mu

发表机构 * Peking University(北京大学) Central Media Technology Institute, Huawei(华为中央媒体技术研究所)

AI总结 提出SelfBootTok方法,通过自引导学习将图像信息分解为全局和局部标记组,使生成器仅依赖全局标记,减少40%计算量并提升重建与生成质量,以64个标记实现1.56的gFID新纪录。

详情
AI中文摘要

尽管图像分词取得了进展,但标准方法通过在每个标记中混合所有粒度来编码冗余信息,因此标记之间仍存在冗余。不同粒度信息的混合也增加了生成器训练的复杂性。本文介绍了SelfBootTok,一种通过将信息干净地分解为全局和局部标记组来解决此问题的方法。通过自引导学习,模型仅从全局标记预测局部细节,将视觉细节的负担从生成器转移到分词器。因此,我们的生成器效率更高,仅需全局标记,计算量减少约40%,同时提供更优的重建和生成。此外,该范式优雅地扩展:通过利用更多数据或参数来自监督局部表示学习,SelfBootTok仅使用64个标记就实现了1.56的最优gFID分数。

英文摘要

Despite progress in image tokenization, standard methods encode redundant information by mixing all granularities within each token, thus redundancy persists between tokens. The mix of information of different granularity also complicates the training of generators. This paper introduces SelfBootTok, a method that resolves this by cleanly decomposing information into global and local token groups. Through self-bootstrapped learning, the model predicts local details exclusively from global tokens, shifting the burden of visual details from the generator to the tokenizer. Consequently, our generator is far more efficient, requiring only global tokens and reducing computation by approximately 40%, while delivering superior reconstruction and generation. Moreover, this paradigm scales elegantly: by leveraging more data or parameters to self-supervise local representation learning, SelfBootTok achieves a new state-of-the-art gFID score of 1.56 using only 64 tokens.

2606.05548 2026-06-05 cs.SE cs.AI 版本更新

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

ADK Arena: 通过LLM即开发者评估智能体开发工具包

Jintao Huang, Xiaomin Li, Gaurav Mittal, Yu Hu

发表机构 * The Ohio State University(俄亥俄州立大学) Microsoft(微软)

AI总结 提出LLM-as-a-Developer方法,通过自动化流水线ADK Arena评估51个Python ADK框架,发现框架间生成成本差异达5.6倍,但无单一框架占优,且文档、源码和参数知识可相互替代。

Comments Work in Progress

详情
AI中文摘要

智能体开发工具包(ADK)作为构建LLM驱动自主智能体的SDK级框架的快速普及,已经超越了关于框架选择如何影响智能体性能的任何实证理解。我们提出 extbf{LLM即开发者}方法,用LLM编码智能体替代人类开发者,该智能体从文档中学习每个框架的API,编写智能体代码,并通过验证-反馈循环迭代修复直到测试通过。通过保持开发者不变而仅改变框架,生成工作成为API可用性的定量代理,生成的智能体提供了框架有效性的受控度量。我们在 extbf{ADK Arena}中实现这一点,这是一个完全自动化的流水线,具有每个框架的Docker隔离、三级验证流水线以及针对SWE-bench、$τ^2$-bench、Terminal-Bench和MCP-Atlas的基准适配器。评估所有51个流行的Python ADK框架(204个智能体-基准对),我们发现:(1)生成在57%的运行中成功,其成本在框架间变化5.6倍(每个智能体0.6美元至3.4美元),这是API复杂性的定量代理,尽管成本本身不能预测成功;(2)没有单一框架占主导:最佳单基准ADK智能体解决了高达80%的任务,甚至能以一小部分成本击败通用前沿编码智能体,但中位数框架仅解决32%;(3)在信息源消融实验中,真正的框架使用率保持在狭窄的28-40%范围内(原始源码访问时最高,无参考材料时仍为33%),表明文档、源代码和参数知识在很大程度上是可替代的,而不是任何一个成为硬瓶颈。

英文摘要

The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose \textbf{LLM-as-a-Developer}, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop until tests pass. By holding the developer constant and varying only the framework, generation effort becomes a quantitative proxy for API usability and the resulting agents provide a controlled measure of framework effectiveness. We implement this in \textbf{ADK Arena}, a fully automated pipeline with per-framework Docker isolation, a three-level validation pipeline, and benchmark adapters for SWE-bench, $τ^2$-bench, Terminal-Bench, and MCP-Atlas. Evaluating all 51 popular Python ADK frameworks (204 agent--benchmark pairs), we find that: (1)~generation succeeds for 57\% of runs, and its cost varies 5.6$\times$ across frameworks (\$0.6 to \$3.4 per agent), a quantitative proxy for API complexity, though cost alone does not predict success; (2)~no single framework dominates: the best single-benchmark ADK agents resolve up to 80\% of tasks and can even \emph{beat} general-purpose frontier coding agents at a fraction of the cost, yet the median framework resolves only 32\%; (3)~across information-source ablations, genuine framework usage stays within a narrow 28--40\% band (highest with raw source access and still 33\% with no reference material at all), indicating that documentation, source code, and parametric knowledge are largely substitutable rather than any one being a hard bottleneck.

2606.05535 2026-06-05 cs.CV cs.AI 版本更新

Noise-Aware Visual Representation Learning for Medical Visual Question Answering

面向医学视觉问答的噪声感知视觉表示学习

I Putu Adi Pratama, Bahadorreza Ofoghi, Atul Sajjanhar, Shang Gao

发表机构 * Deakin University(德克萨斯大学)

AI总结 提出一种噪声感知的医学视觉问答框架,通过去噪自编码器学习鲁棒的视觉表示,并利用低秩适配高效微调,在SLAKE和PathVQA基准上提升了抗噪性和性能。

Comments 15 pages, 2 figures. Conference submission

详情
AI中文摘要

医学视觉问答(Med-VQA)通过使AI模型能够解释医学图像并回答临床相关问题,在临床决策支持方面具有巨大潜力。近期方法通常通过轻量级映射网络将现成的视觉编码器与大语言模型(LLM)连接起来,以降低计算成本。然而,这些方法往往忽视了处理视觉表示中噪声和小无关变化的重要性。为应对这些挑战,我们提出了一种噪声感知的Med-VQA框架,该框架在视觉嵌入映射到LLM输入空间之前,引入了一个去噪自编码器。去噪自编码器经过预训练,能够从被破坏的输入中重建干净的视觉嵌入,从而鼓励模型学习对噪声不敏感的鲁棒视觉表示。然后,使用多层感知器(MLP)将得到的嵌入投影到语言模型嵌入空间中,形成为LLM提供图像信息的视觉前缀令牌。为了实现无需完全重新训练的高效适配,我们采用低秩适配(LoRA)进行参数高效微调。所提出的方法在SLAKE和PathVQA基准上进行了评估。实验结果表明,该方法在多个评估标准下对噪声输入嵌入具有更强的鲁棒性,同时保持了有竞争力的干净性能。这些发现表明,学习更鲁棒的视觉表示可以提升Med-VQA的性能和鲁棒性。

英文摘要

Medical visual question answering (Med-VQA) has strong potential for clinical decision support by enabling AI models to interpret medical images and answer clinically relevant queries. Recent approaches typically connect off-the-shelf vision encoders with large language models (LLMs) through lightweight mapping networks to reduce computational cost. However, these methods often overlook the importance of handling noise and small irrelevant changes in visual representations. To address these challenges, we propose a noise-aware Med-VQA framework that incorporates a denoising autoencoder before visual embeddings are mapped into the input space of an LLM. The denoising autoencoder is pretrained to reconstruct clean visual embeddings from corrupted inputs, encouraging the model to learn robust visual representations that are less sensitive to noise. The resulting embeddings are then projected into the language model embedding space using a multi-layer perceptron (MLP), forming visual prefix tokens that provide image information to the LLM. To enable efficient adaptation without full retraining, we employ parameter-efficient fine-tuning using low-rank adaptation (LoRA). The proposed method is evaluated on the SLAKE and PathVQA benchmarks. Experimental results show improved robustness to noisy input embeddings while maintaining competitive clean performance across multiple evaluation criteria. These findings suggest that learning more robust visual representations can enhance Med-VQA performance and robustness.

2606.05533 2026-06-05 cs.LG cs.AI cs.CV cs.RO 版本更新

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

物体能做什么,而非它们是什么:面向功能可供性推理的功能潜在空间

Rohan Siva, Neel P. Bhatt, Yunhao Yang, Seoyoung Lee, Nishant Gadde, Christian Ellis, Alvaro Velasquez, Zhangyang Wang, Ufuk Topcu

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Neurosymbolic Intelligence(神经符号智能) University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出A4D框架,通过构建基于功能可供性的共享潜在空间,将视觉观察映射到该空间并测量与可供性的距离,实现基于物体功能而非外观的规划推理,显著提升泛化能力和推理效率。

Comments Code, videos, and data available at: https://A4Dance-reasoning.github.io

详情
AI中文摘要

现有的机器人规划系统依赖于基于外观的推理,其中视觉观察被编码到围绕物体外观组织的潜在空间中(例如,根据外观识别“手推车”)。然而,规划需要推理物体的任务相关功能(例如,物体是否“可移动”),而基于外观的潜在空间无法捕捉这些信息。因此,现有方法难以泛化到新颖的机器人-物体交互。我们通过功能可供性推理解决这一泛化能力有限的问题,使规划基于任务相关的物体功能而非仅外观。我们提出A4D,它将视觉观察映射到一个围绕可供性(例如“可移动”)组织的共享潜在空间中。通过将视觉观察投影到这个功能潜在空间并测量它们与可供性的接近程度,A4D推断出与观察物体相关的功能。此外,我们引入了一种可供性发现机制,扩展潜在空间以处理现有可供性不足的未见场景。A4D利用功能潜在空间中的接近度来量化可供性推理的不确定性,并选择性地触发可供性发现。我们在涉及多样化和未见可供性的多个规划任务上评估A4D。A4D在现有可供性上达到94%的推理准确率,比最先进方法高出超过15个百分点;在不到原始训练数据10%的情况下,将新可供性推理准确率从70%提升到90%以上,并实现100倍更快的推理。代码、视频和数据可在https://A4Dance-reasoning.github.io获取。

英文摘要

Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a "cart" based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is "movable"), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions. We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4D, which maps visual observations into a shared latent space structured around affordances (e.g., "movable"). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4D infers functionalities relevant to the observed object. Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4D uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4D across several planning tasks involving diverse and unseen affordances. A4D achieves 94% inference accuracy on existing affordances outperforming state-of-the-art approaches by over 15% points, improves new-affordance inference accuracy from 70% to over 90% with fewer than 10% of the original training data, and enables 100x faster inference. Code, videos, and data available at: https://A4Dance-reasoning.github.io.

2606.05532 2026-06-05 cs.AI cs.HC 版本更新

Individual Gain, Collective Loss: Metacognitive Adaptation in AI-Assisted Creativity

个体增益,集体损失:AI辅助创造力中的元认知适应

Anna Mikeda

发表机构 * Anna Mikeda(安娜·米凯达)

AI总结 本研究提出选择性元认知适应机制,解释AI为何提升个体创造力却降低集体多样性,并构建六种元认知能力的分类框架。

Comments 6 pages. AAAI 2026 paper

详情
AI中文摘要

近期研究揭示了一个悖论:AI提升了个体创造性产出,同时减少了集体多样性。当前的解释——认知卸载和过度依赖——识别了症状但未阐明机制。我们提出选择性元认知适应:常规AI使用重新分配而非均匀减少元认知努力。某些能力被增强(伙伴建模、表面控制),而其他能力则系统性缺乏支持(原创性评估、反思性整合)。这种再分配解释了个体满意度和集体趋同。我们提出了一个按时间阶段组织的六种元认知能力分类,描述了它们在常规AI使用下的倾向,并展示了个体理性适应如何产生涌现的社会成本。该框架为研究人员提供了具体预测,为从业者提供了设计原则,以保护个体创造性满意度和集体创造性多样性。

英文摘要

Recent studies reveal a paradox: AI enhances individual creative outputs while reducing collective diversity. Current explanations -- cognitive offloading and over-reliance -- identify symptoms but not mechanisms. We propose selective metacognitive adaptation: routine AI use redistributes rather than uniformly diminishes metacognitive effort. Some capacities are amplified (partner modeling, surface control), while others are systematically under-supported (originality evaluation, reflective integration). This redistribution explains both individual satisfaction and collective convergence. We present a taxonomy of six metacognitive capacities organized by temporal phase, characterize their tendencies under routine AI use, and show how individually rational adaptation produces emergent social costs. The framework generates specific predictions for researchers and design principles for practitioners seeking to preserve both individual creative satisfaction and collective creative diversity.

2606.05531 2026-06-05 cs.CV cs.AI cs.CL cs.LG 版本更新

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

Almieyar-Oryx-BloomBench:一个用于视觉语言模型认知知情评估的双语多模态基准

Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Zuse School(Zuse学校) Qatar Computing Research Institute (QCRI)(卡塔尔计算研究所) Hamad Bin Khalifa University(哈马德·本·哈利法大学)

AI总结 针对现有基准无法诊断视觉语言模型真实推理能力的问题,提出基于Bloom认知分类学的双语多模态基准BloomBench,系统评估六个认知层次,揭示模型在事实回忆和创造性合成方面的深层局限。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

尽管视觉语言模型(VLM)取得了快速进展,但该领域缺乏能够严格诊断其真实推理能力并描绘出向类人多模态智能有意义进展的基准。大多数现有评估侧重于零散或脱节的任务,掩盖了关键的认知弱点,并为有针对性的改进提供了很少的见解。为了弥补这一差距,我们引入了BloomBench,这是Almieyar基准系列的一部分,也是第一个基于人类认知的、双语(英语-阿拉伯语)的多模态VLM基准。基于Bloom分类学,BloomBench通过精心设计的图像-问题-答案任务系统地评估六个认知层次(记忆、理解、应用、分析、评估、创造)。通过半自动化流水线构建,并通过分层混合质量保证协议验证,确保了可扩展性、文化包容性和语言保真度。利用这一框架,我们对最先进的VLM进行了全面研究,以诊断其认知特征。我们的分析揭示了明显的认知不对称:尽管最先进的模型在语义理解方面达到了强大的性能上限,但它们在事实回忆和创造性合成方面存在显著困难。这表明当前的一般多模态能力掩盖了特定认知层次的深层局限性。此外,我们的研究突出了阿拉伯语和英语之间的关键性能差距,暴露了当前跨语言多模态推理的局限性。这些发现为开发更符合认知和包容性的VLM奠定了基础。基准框架和数据集可在以下网址获取:https://github.com/qcri/Almieyar-Oryx-BloomBench。

英文摘要

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.

2606.05528 2026-06-05 cs.AI 版本更新

When Should We Protect AI? A Precautionary Framework for Consciousness Uncertainty

何时应保护AI?一个针对意识不确定性的预防性框架

Anna Mikeda

发表机构 * Anna Mikeda(安娜·米凯达)

AI总结 针对现有框架仅评估AI系统是否具有意识但缺乏行动指导的问题,本文提出一个基于预防原则的框架,通过五个福利相关维度、阈值与梯度混合机制以及跨维度聚合方法,将意识证据映射为分级的保护义务,并通过案例研究提供设计指导。

Comments 7 pages. AAAI 2026 paper

详情
AI中文摘要

现有框架评估AI系统是否可能具有意识,但未提供如何处理该评估的指导。我们通过一个预防性框架填补这一空白,该框架将意识证据映射为分级的保护义务。该框架包含三个组成部分:(1) 五个福利相关维度——现象意识、情感效价、元认知意识、自我叙事和能动性——每个维度都基于既定的意识科学,并与不同的道德关切相联系;(2) 一个阈值加梯度的混合机制,既指定了触发新义务类别的二元阈值,也指定了保护权重的连续缩放;(3) 两种跨维度聚合的互补方法,一种是层次化的(借鉴Bach和Sorensen的机器意识假说),另一种是与架构无关的。我们通过Replika和OpenClaw的案例研究来操作化该框架,展示占据不同维度空间的系统如何触发不同的义务,并为构建接近意识相关阈值的系统的开发者提供设计指导。该框架与架构无关,适用于神经、符号和神经符号系统,旨在使意识科学对当今面临不确定性的组织具有决策相关性。

英文摘要

Existing frameworks assess whether AI systems might be conscious but provide no guidance on what to do with that assessment. We address this gap with a precautionary framework that maps consciousness evidence to graduated protective obligations. The framework comprises three components: (1) five welfare-relevant dimensions--phenomenal consciousness, affective valence, metacognitive awareness, self-narrative, and agency--each grounded in established consciousness science and linked to distinct moral concerns; (2) a threshold-plus-gradation hybrid specifying both binary triggers for new obligation categories and continuous scaling of protective weight; and (3) two complementary approaches to cross-dimensional aggregation, one hierarchical (drawing on Bach and Sorensen's Machine Consciousness Hypothesis) and one architecture-agnostic. We operationalize the framework through worked case studies of Replika and OpenClaw, demonstrating how systems occupying different regions of the dimensional space trigger different obligations, and derive design guidance for developers building systems near consciousness-relevant thresholds. The framework is architecture-agnostic, applying across neural, symbolic, and neurosymbolic systems, and aims to make consciousness science decision-relevant for organizations navigating uncertainty today.

2606.05525 2026-06-05 cs.AI cs.HC 版本更新

SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization

SciVisAgentSkills:面向科学数据分析和可视化的智能体技能设计与评估

Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Shusen Liu, Chaoli Wang

发表机构 * Univ. Notre Dame(诺丁汉大学) LLNL(劳伦斯利弗莫尔国家实验室)

AI总结 提出SciVisAgentSkills技能库,通过编码环境假设、工具使用模式和领域启发式知识增强编码智能体,在ParaView等科学工具上实现自然语言驱动的科学可视化工作流,实验表明技能可提升任务得分并影响token效率。

详情
AI中文摘要

近期智能体可视化的进展使得自然语言能够转化为可执行的科学可视化工作流。尽管通用编码智能体展现出强大能力,但它们往往缺乏科学可视化任务所需的特定工具专业知识。在这项工作中,我们提出了SciVisAgentSkills,这是一个可重用的智能体技能集合,通过编码环境假设、工具使用模式和跨科学工具(如ParaView、napari、VMD和TTK)的领域启发式知识,增强用于科学数据分析和可视化的编码智能体。我们使用SciVisAgentBench(一个包含108个专家设计的多步骤任务的基准测试)在Codex和Claude Code上评估这些技能。结果表明,智能体技能提高了评估套件中的平均任务得分,其token效率收益取决于智能体框架和工具设置。这些发现强调了结构化程序知识对于实现可靠、长周期科学可视化工作流的重要性,同时也表明技能应与加载和应用它们的执行框架一起研究。技能可在https://github.com/KuangshiAi/SciVisAgentSkills获取。

英文摘要

Recent advances in agentic visualization have enabled the translation of natural language into executable scientific visualization (SciVis) workflows. While general-purpose coding agents show strong capabilities, they often lack the tool-specific expertise required for SciVis tasks. In this work, we present SciVisAgentSkills, a collection of reusable agent skills that augment coding agents for scientific data analysis and visualization by encoding environment assumptions, tool usage patterns, and domain heuristics across scientific tools such as ParaView, napari, VMD, and TTK. We evaluate these skills on Codex and Claude Code using SciVisAgentBench, a benchmark of 108 expert-designed multi-step tasks. Results show that agent skills improve mean task scores across the evaluated suites, with token-efficiency benefits that depend on the agent harness and tool setting. These findings highlight the importance of structured procedural knowledge for enabling reliable, long-horizon SciVis workflows, while also showing that skills should be studied alongside the execution harness that loads and applies them. The skills are available at https://github.com/KuangshiAi/SciVisAgentSkills.

2606.05522 2026-06-05 cs.SD cs.AI eess.AS 版本更新

Exploring LLMs for South Asian Music Understanding and Generation

探索大语言模型对南亚音乐的理解与生成

Faria Binte Kader, Mohtasim Hadi Rafi, Shah Wasif Sajjad, Santu Karmaker

发表机构 * University of Central Florida(佛罗里达中央大学) Auburn University(阿伯伯大学)

AI总结 本文系统评估大语言模型在基于拉格和塔拉的南亚古典音乐理解与生成任务中的表现,发现前沿模型在理解任务上准确率达85-90%,但生成任务中风格忠实度仅40%。

Comments 19 pages, 7 figures

详情
AI中文摘要

近年来,大语言模型(LLMs)在音乐理解和生成任务中展现出令人瞩目的成果。然而,现有研究仍局限于西方调性传统,未能揭示当前LLMs能否处理结构独特的低资源音乐传统。我们首次系统评估LLMs在南亚古典音乐中的能力——这种传统由拉格(raga)和塔拉(tala)的旋律约束主导,其结构原则与西方和声驱动音乐根本不同。我们的评估基于印度斯坦古典理论和孟加拉古典形式,包括拉宾德拉(Rabindra)和纳兹鲁尔(Nazrul)歌曲——南亚古典音乐中具有代表性的低资源传统。在音乐理解评估中,我们引入了一个包含504个问答的基准测试,涵盖拉格语法、文化知识和符号记谱推理,评估了33个LLMs,其中前沿模型如Gemini 2.5 Pro达到85-90%的准确率,而大多数开源模型仅在23-40%范围内。在音乐生成方面,我们设计了一个五级受控提示框架,发现即使最强的模型也只有40%的时间能产生风格忠实的输出。这些结果表明,音乐生成中的结构有效性和风格忠实度是不同的目标,并突显了文化基础音乐建模的一个开放挑战。

英文摘要

Recent advancements in Large Language Models (LLMs) have shown promising results in music understanding and generation tasks. However, existing works remain confined to Western tonal traditions, offering little insight into whether current LLMs can handle structurally distinct low-resource musical traditions. We present the first systematic evaluation of LLM competence in South Asian classical music, a tradition governed by raga, tala-based melodic constraints that impose fundamentally different structural principles from Western harmony-driven music. We ground our evaluation in Hindustani classical theory and Bengali classical forms, including Rabindra and Nazrul Sangeet -- representative low-resource traditions within South Asian classical music. For music understanding evaluation, we introduce a 504-question-answer benchmark spanning raga grammar, cultural knowledge, and symbolic notation reasoning, evaluating 33 LLMs where frontier models such as Gemini 2.5 Pro achieve 85-90% accuracy, while most open-source models remain in the 23-40% range. For music generation, we design a five-level controlled prompting framework and find that even the strongest model produces stylistically faithful outputs only 40% of the time. These results reveal that structural validity and stylistic faithfulness in music generation are distinct objectives and highlight an open challenge for culturally grounded music modeling.

2606.05513 2026-06-05 cs.AI cs.CL 版本更新

EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts

EpiEvolve:用于制度转变下流式疫情预测的自演化智能体

Yiming Lu, Sihang Zeng, Zhengxu Tang, Max Lau, Fei Liu, Wei Jin

发表机构 * Emory University(埃默里大学) University of Washington(华盛顿大学)

AI总结 针对流式疫情预测中标签延迟和制度转变问题,提出自演化智能体EpiEvolve,通过层次化情景记忆、延迟标签反思和制度感知检索,在COVID-19住院趋势预测中达到0.629准确率,并将制度转变后的恢复滞后从5周缩短至2周。

详情
AI中文摘要

流行病LLM预测器通常作为静态监督模型进行训练和评估,而实际疫情预测是一个流式过程,其中标签在预测之后到达,疾病制度随时间变化。我们研究了在五个变异制度下的每周COVID-19住院趋势预测中的这种不匹配。我们引入了EpiEvolve,一个自演化智能体,它封装了一个在预热期训练好的LLM预测器,并在流式过程中保持其权重固定。EpiEvolve通过将预测结果存储在层次化情景记忆中进行适应,反思延迟标签,检索与当前制度相关的案例,并将重复出现的错误提炼为策略规则。由此产生的上下文让预测器在遵循防止未来泄漏的时间顺序协议的同时,在后续周中重用其自身的过去预测和结果。在流式数据集上,EpiEvolve达到了0.629的平均准确率,而静态骨干模型为0.561,外部CDC集成模型为0.325,并将制度转变后的恢复滞后从5周缩短到2周。消融实验表明,反思、策略记忆和制度感知检索各自对性能提升有贡献。

英文摘要

Epidemic LLM forecasters are usually trained and evaluated as static supervised models, whereas operational pandemic forecasting is a streaming process in which labels arrive after predictions and disease regimes shift over time. We study this mismatch in weekly COVID-19 hospitalization trend forecasting across five variant regimes. We introduce EpiEvolve, a self-evolving agent that wraps an LLM forecaster trained on the warm-start period and keeps its weights fixed during streaming. EpiEvolve adapts by storing forecast outcomes in a hierarchical episodic memory, reflecting on delayed labels, retrieving cases relevant to the current regime, and distilling recurring errors into strategic rules. The resulting context lets the forecaster reuse its own past predictions and outcomes in later weeks while following a chronological protocol that prevents future leakage. On the streaming dataset, EpiEvolve reaches $0.629$ average accuracy, compared with $0.561$ for the static backbone and $0.325$ for the external CDC ensemble, and reduces recovery lag after regime shifts from $5$ to $2$ weeks. Ablations show that reflection, strategic memory, and regime-aware retrieval each contribute to the gains.

2606.05509 2026-06-05 cs.HC cs.AI 版本更新

The Role of Instructional Guidance in Generative AI-Assisted Learning: Empirical Evidence from Construction Engineering Education

教学指导在生成式AI辅助学习中的作用:来自建筑工程教育的实证证据

Xiaoyu Hou, Bo Xiao, Hexu Liu, Shane Mueller

发表机构 * Dept. of Civil, Environmental, and Geospatial Engineering, Michigan Technological Univ.(土木、环境与地理空间工程系,密歇根技术大学) Dept. of Civil and Construction Engineering, Western Michigan Univ.(土木与建设工程系,西部密歇根大学) Dept. of Psychology and Human Factors, Michigan Technological Univ.(心理学与人因工程系,密歇根技术大学)

AI总结 本研究通过引入基于生成学习理论的五步提示框架,在建筑工程教育中对比无提示AI辅助、有提示AI辅助和幻灯片学习三种条件,发现提示框架显著提升了需要解释和推理的任务表现(开放式评分提高约2-3分,p<0.01),表明AI辅助学习的有效性取决于交互结构。

详情
AI中文摘要

生成式人工智能(AI)越来越多地被用于支持自主学习,然而学生与此类系统的交互往往缺乏结构性,限制了对更深层次认知过程的参与。本研究探讨了教学指导如何塑造建筑工程教育中学生与AI的交互。引入了一个基于生成学习理论(GLT)的五步提示框架,以指导学习者在复习活动中的交互。一项对照实验比较了三种学习条件:基于幻灯片的学习、无提示的AI辅助学习和有提示的AI辅助学习。学习表现通过多项选择和开放式任务进行评估,用户体验通过用户体验问卷(UEQ)测量。表现差异集中在需要解释和推理的任务上。有提示条件在开放式任务上得分更高,在18分量表上提高了约2或3分(p < 0.01),而多项选择表现无显著差异。无提示条件与基于幻灯片的学习相当。这些发现表明,AI辅助学习的有效性取决于交互如何结构化。所提出的框架为将学习科学原理整合到建筑工程教育的生成式AI系统中提供了基础。

英文摘要

Generative artificial intelligence (AI) is increasingly used to support self-directed learning, yet student interaction with such systems often remains unstructured, limiting engagement in deeper cognitive processes. This study examines how instructional guidance shapes student and AI interaction in construction education. A five-step prompting framework grounded in Generative Learning Theory (GLT) is introduced to guide learner interaction during review activities. A controlled experiment compares three learning conditions: slide-based learning, unprompted AI-supported learning, and prompted AI-supported learning. Learning performance is assessed using multiple-choice and open-ended tasks, and user experience is measured using the User Experience Questionnaire (UEQ). Performance differences are concentrated on tasks requiring explanation and reasoning. The prompted condition achieves higher open-ended scores, with an improvement of approximately 2 or 3 points on a scale of 18 (p < 0.01), while no significant differences are observed in multiple-choice performance. The unprompted condition remains comparable to slide-based learning. These findings indicate that the effectiveness of AI-supported learning depends on how interaction is structured. The proposed framework provides a basis for integrating learning science principles into generative AI systems for construction education.

2606.05494 2026-06-05 cs.CL cs.AI 版本更新

MASF: A Multi-Model Adaptive Selection Framework for Abstractive Text summarization

MASF:面向抽象式文本摘要的多模型自适应选择框架

Ahmed Alansary, Ali Hamdi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种多模型自适应选择框架,通过集成多个微调的Transformer模型并基于自动评估指标选择最佳摘要,在CNN/DailyMail数据集上BERTScore达88.63%,优于GPT3-D2等大模型。

Comments 6 pages, 3 figures, IMSA2026

详情
AI中文摘要

自动文本摘要因数字文本信息的快速增长而变得日益重要。本文提出一种多模型自适应摘要框架,旨在提高抽象式文本摘要的鲁棒性和质量。依赖单一模型往往导致在不同结构和主题的文章上摘要质量不一致。为解决这一局限,所提框架集成了多个微调的基于Transformer的摘要模型,并引入自适应选择机制。在该框架中,每个模型独立为同一输入文章生成候选摘要。然后使用自动评估指标评估生成的摘要,这些指标同时捕捉词汇相似性和语义相关性。基于这些分数,框架选择最高质量的摘要作为最终输出。模型在广泛使用的CNN/DailyMail新闻摘要数据集上进行微调和评估。实验结果表明,所提框架在所有比较方法中取得了最高的BERTScore,达到88.63%。它还优于多个大语言模型,如GPT3-D2、Falcon-7b和Mpt-7b,突显了其有效性和鲁棒性。这些发现强调了在自适应选择策略中利用多个基于Transformer的模型来提高自动文本摘要系统质量和鲁棒性的有效性。

英文摘要

Automatic text summarization has become increasingly important due to the rapid growth of digital textual information. This paper presents a Multi-Model Adaptive Summarization Framework designed to improve the robustness and quality of abstractive text summarization. Relying on a single model often leads to inconsistent summarization quality across articles with varying structures and topics. To address this limitation, the proposed framework integrates multiple fine-tuned transformer-based summarization models and introduces an adaptive selection mechanism. In this framework, each model independently generates a candidate summary for the same input article. The generated summaries are then evaluated using automatic evaluation metrics that capture both lexical similarity and semantic relevance. Based on these scores, the framework selects the highest-quality summary as the final output. The models are fine-tuned and evaluated on the widely used CNN/DailyMail news summarization dataset. Experimental results demonstrate that the proposed framework achieves the highest BERTScore among all compared methods with a score of 88.63%. It also outperforms several LLMs such as GPT3-D2, Falcon-7b, and Mpt-7b, highlighting its effectiveness and robustness. These findings highlight the effectiveness of leveraging multiple transformer-based models within an adaptive selection strategy to improve the quality and robustness of automatic text summarization systems.

2606.05481 2026-06-05 cs.LG cs.AI eess.SP 版本更新

Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models

面向统一且数据高效的预测与健康管理:基于表格基础模型

Raffael Theiler, Lev Telyatnikov, Leandro Von Krannichfeldt, Olga Fink

发表机构 * IMOS Lab, EPFL(IMOS实验室,瑞士联邦理工学院)

AI总结 提出利用表格基础模型通过上下文学习处理工业时间序列,实现预测与健康管理(PHM)任务,在低数据场景下表现优异,并优于序列模型和梯度提升树。

详情
AI中文摘要

数据驱动的预测与健康管理(PHM)利用时变状态监测数据来诊断系统状态并估计工程资产的剩余使用寿命。这些任务是维护规划的核心,但工业PHM数据通常是碎片化的、部分观测且标注不足,这阻碍了监督学习。基础模型提供了一条通往可重用预测系统的途径,然而大多数时间序列基础模型是为预测设计的,并假设长序列、连贯且规则采样。为弥补这一差距,我们提出了一个框架,利用上下文学习将表格基础模型应用于工业时间序列,并在多种PHM任务上对其进行评估。通过将原始单元级信号转换为表格行,我们展示了这些模型在多个任务(包括预测和诊断)上表现良好,且数据效率高。我们在统一的评估协议下,直接将其与序列模型、Transformer基线和梯度提升树进行比较。结果表明,表格基础模型在预测和诊断任务中取得了最佳平均排名。我们的发现进一步表明,基于PFN的模型在低数据场景下具有竞争力,时间上下文可以在表格表示中保留,且性能依赖于子采样下的代表性上下文构建。这些结果证明,表格基础模型为异构PHM问题提供了一个实用且通用的接口。

英文摘要

Data-driven Prognostics and Health Management (PHM) uses time-varying condition-monitoring data to diagnose system states and estimate remaining useful life in engineered assets. These tasks are central to maintenance planning, but industrial PHM data are often fragmented, partially observed, and poorly labeled, which hinders supervised learning. Foundation models offer a route toward reusable predictive systems, yet most time-series foundation models are designed for forecasting and assume long, coherent, regularly sampled sequences. To address this gap, we propose a framework for applying Tabular Foundation Models to industrial time series using in-context learning, and we evaluate them on a variety of PHM tasks. By converting raw unit-level signals into tabular rows, we show that these models perform well across multiple tasks - including prognostics, and diagnostics - and are highly data efficient. We compare them directly with sequence models, transformer baselines, and gradient-boosted trees under a common evaluation protocol. The results indicate that tabular foundation models achieve the best average ranks across prognostic and diagnostic tasks. Our findings further show that PFN-based models are competitive in low-data regimes, that temporal context can be preserved in the tabular representation, and that performance depends on representative context construction under subsampling. These results demonstrate that tabular foundation models provide a practical and general interface for heterogeneous PHM problems.

2606.05464 2026-06-05 cs.AI 版本更新

Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces

大语言模型中在扩展搜索空间上的逐步优化类推理

Nicolás Astorga, Nabeel Seedat, Mihaela van der Schaar

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出OPT*任务族,通过可验证奖励训练和搜索引导策略,提升LLM在扩展搜索空间中的逐步优化推理能力。

详情
AI中文摘要

可验证奖励训练改善了数学和编码推理,但这些领域仅涵盖了逐步决策的一部分。许多现实任务需要在众多有效备选方案中找到高价值的可行计划。我们引入OPT*,一个可扩展的优化风格任务族,用于沿复杂度轴训练和评估LLM的逐步优化类推理:每个任务提供可行性检查器和评估器,而复杂度参数扩展搜索空间,无需新的人工标签。这促使我们在两种机制下研究这些任务:(i) 求解器引导的在线策略优化,使用求解器作为部分状态的价值预言机,并应用基于排名的奖励塑造来强化更好的下一步;(ii) 当此类求解器不可用时,基于搜索的离线强化学习。理论上,我们将大搜索空间中的成功与推理者在每单位搜索预算中提取的信息联系起来。实证上,我们消融了使OPT*上搜索高效的要素,并表明在OPT*上训练改进了逐步优化类推理。

英文摘要

Verifiable reward training has improved mathematical and coding reasoning, but these domains capture only part of step-by-step decision making. Many real-world tasks require finding a high-value feasible plan among many valid alternatives. We introduce OPT*, a scalable family of optimization-style tasks for training and evaluating LLM step-by-step optimization-like reasoning along a complexity axis: each task provides a feasibility checker and evaluator, while a complexity parameter expands the search space without requiring new human labels. This motivates studying these tasks in two regimes: (i) solver-guided online policy optimization, which uses a solver as a value oracle for partial states and applies rank-based reward shaping to reinforce better next steps, and (ii) search-based offline RL when such solvers are unavailable. Theoretically, we relate success in large search spaces to the information a reasoner extracts per unit of search budget. Empirically, we ablate the ingredients that make search efficient on OPT* and show that training on OPT* improves step-by-step optimization-like reasoning.

2606.05449 2026-06-05 cs.AI cs.GT econ.EM 版本更新

Insurance of Agentic AI

代理型人工智能的保险

Quanyan Zhu

发表机构 * Department of Electrical and Computer Engineering, New York University, Tandon School of Engineering(电气与计算机工程系,纽约大学,工程学院)

AI总结 本文分析了代理型AI带来的新型风险,提出了承保、定价、再保险和产品设计的框架,并构建了整合多种保险覆盖的协调架构。

详情
AI中文摘要

代理型人工智能系统通过超越信息生成,扩展到自主规划、工具调用、决策执行以及对数字和物理环境的持续修改,正在改变风险格局。这些能力引入了新的风险敞口,这些敞口并不完全适合传统的保险类别,如网络、职业责任、产品责任或董事及高管责任保险。本文考察了新兴的代理型AI保险市场,并开发了一个框架来理解其承保、定价、再保险和产品设计的影响。我们将代理型AI描述为自主性和授权委托的连续体,强调信息输出与能够通过外部行动独立产生保险事件的系统之间的区别。我们分析了主要风险路径,包括幻觉、提示注入攻击、自主决策错误、模型漂移、依赖故障和网络物理伤害,并评估了现有保险产品如何适应这些风险敞口。本文进一步提出了一个基于风险暴露评估、情景分析、依赖映射和累积风险管理的精算框架,借鉴了网络保险的发展历程。最后,我们提出了一个协调的保险架构,通过明确的分配机制和专门的AI总限额,整合了网络、技术错误与遗漏、产品责任、性能保证以及明确的AI责任保险。分析表明,代理型AI保险的未来不在于单一的单线产品,而在于一个由改进的治理、透明度、遥测和监管清晰度支持的互补覆盖分层生态系统。

英文摘要

Agentic artificial intelligence (AI) systems are transforming the risk landscape by extending beyond information generation to autonomous planning, tool invocation, decision execution, and persistent modification of digital and physical environments. These capabilities introduce novel exposures that do not fit neatly within traditional insurance categories such as cyber, professional liability, product liability, or directors and officers coverage. This paper examines the emerging insurance market for agentic AI and develops a framework for understanding its underwriting, pricing, reinsurance, and product-design implications. We characterize agentic AI as a continuum of autonomy and delegated authority, emphasizing the distinction between informational outputs and systems capable of independently generating insured events through external actions. We analyze major risk pathways, including hallucinations, prompt-injection attacks, autonomous decision errors, model drift, dependency failures, and cyber-physical harms, and evaluate how existing insurance products are adapting to address these exposures. The paper further proposes an actuarial framework based on exposure assessment, scenario analysis, dependency mapping, and accumulation-risk management, drawing parallels to the evolution of cyber insurance. Finally, we present a coordinated insurance architecture that integrates cyber, technology errors and omissions, product liability, performance-warranty, and affirmative AI-liability coverages through explicit allocation mechanisms and dedicated AI aggregates. The analysis suggests that the future of agentic-AI insurance lies not in a single monoline product but in a layered ecosystem of complementary coverages supported by improved governance, transparency, telemetry, and regulatory clarity.

2606.05445 2026-06-05 cs.AI 版本更新

Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

Brick-Composer: 使用多模态大语言模型进行多样化积木组装

Jiateng Liu, Bingxuan Li, Zhenhailong Wang, Rushi Wang, Kaiwen Hong, Cheng Qian, Jiayu Liu, Denghui Zhang, Katherine Driggs-Campbell, Manling Li, Heng Ji

发表机构 * UIUC(伊利诺伊大学香槟分校) Stevens Institute of Technology(史蒂文斯理工学院) Northwestern University(西北大学)

AI总结 本文提出Brick-Composer框架,通过人类设计火花、世界反馈和合成经验三种信号训练多模态大语言模型,解决积木组装中的积木选择和姿态估计问题,将步骤级组装成功率从低于1%提升至约15%。

Comments 10 Pages, 10 figures

详情
AI中文摘要

我们梦想着AI代理能够读取任意设计,并从可重复使用的构建块中构建真实世界的物体。作为迈向这一愿景的第一步,我们研究多模态大语言模型(MLLMs)是否具备积木组装所需的视觉基础和空间推理能力。我们将积木组装形式化为一个序列决策问题,其中每一步涉及两个子任务:积木选择,从候选组件中识别目标积木;以及积木姿态估计,预测所选积木应放置的位置和方式。为支持这项研究,我们引入了BC-Bench(积木构建基准),这是第一个用于评估MLLMs在多样化积木组装中表现的基准。实验表明,当前最先进的MLLMs仍然远非可靠的构建者,在细粒度积木选择上挣扎,并且在精确姿态估计上失败。为弥补这一差距,我们提出了Brick-Composer,一个学习框架,通过三种互补信号赋予MLLMs组装技能:人类设计火花,提供富含可供性的构建演示;世界反馈,将预测动作锚定在视觉和物理后果中;以及合成经验,将学习扩展到现有物体设计之外。Brick-Composer将积木选择准确性提高了三倍以上,大幅减少了姿态估计误差,并将严格的步骤级组装成功率从低于1%提升至约15%。训练后,一个Qwen-3-8B模型能够正确完成一个完整物体高达42%的步骤,这表明MLLMs可以通过有针对性的、基于物理的学习获得组装能力。

英文摘要

We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly. We formulate brick assembly as a sequential decision-making problem, where each step involves two subtasks: brick selection, identifying the target brick from candidate components, and brick pose estimation, predicting where and how the selected brick should be placed. To support this study, we introduce BC-Bench (Brick Construction Benchmark), the first benchmark for evaluating MLLMs on assembly with diverse bricks. Experiments show that current state-of-the-art MLLMs remain far from reliable builders, struggling with fine-grained brick selection and failing at precise pose estimation. To bridge this gap, we propose Brick-Composer, a learning framework that equips MLLMs with assembly skills through three complementary signals: Human Design Sparks, which provide affordance-rich construction demonstrations; World Feedback, which grounds predicted actions in visual and physical consequences; and Synthetic Experience, which scales learning beyond existing object designs. Brick-Composer improves brick selection accuracy by over three times, substantially reduces pose estimation errors, and raises strict step-level assembly success from less than 1% to around 15%. After training, a Qwen-3-8B can correctly compose up to 42% of the steps for a complete object, suggesting that MLLMs can acquire assembly capabilities through targeted, physically grounded learning.

2606.05444 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Multilingual Coreference Resolution via Cycle-Consistent Machine Translation

通过循环一致性机器翻译的多语言共指消解

Adriana-Valentina Costache, Eduard Poesina, Silviu-Florin Gheorghe, Paul Irofti, Radu Tudor Ionescu

发表机构 * Department of Computer Science, University of Bucharest(布加勒斯特大学计算机科学系)

AI总结 提出一种利用循环一致性机器翻译生成或扩展训练数据的管道,通过BERT潜在空间余弦相似度评估翻译质量并加权损失函数,显著提升低资源语言的共指消解性能。

详情
AI中文摘要

共指消解是一项核心的自然语言处理任务,具有广泛的下游应用,例如机器翻译、问答、文档摘要等。虽然该任务在英语中得到了充分研究,但其他语言(尤其是低资源语言)的共指消解关注相对较少。为了弥补这一差距,我们提出了一种新颖的共指消解管道,该管道利用从英语到目标低资源语言的机器翻译(MT)来生成或扩展训练数据。为了自动验证翻译样本的质量,我们将样本反向翻译,并通过BERT模型潜在空间中的余弦相似度评估与原始英语样本的相似性。得到的相似度分数被整合到损失函数中,以根据样本的MT循环一致性对训练样本进行加权。在四种低资源语言上的大量实验表明,我们的管道在共指消解中带来了显著的性能提升。此外,我们的管道使得在之前没有可用语料库的语言中也能实现准确的共指消解。

英文摘要

Coreference resolution is a core NLP task, having a broad range of downstream applications, e.g.~machine translation, question answering, document summarization, etc. While the task is well-studied in English, comparatively less attention is dedicated to coreference resolution in other languages, especially low-resource ones. To mitigate this gap, we propose a novel coreference resolution pipeline that harnesses machine translation (MT) from English to a target low-resource language, to generate or expand training data. To automatically validate the quality of the translated samples, we back-translate the samples and assess the similarity with the original English samples via cosine similarity in the latent space of a BERT model. The resulting similarity scores are integrated into the loss function to weight training samples according to their MT cycle consistency. Extensive experiments on four low-resource languages show that our pipeline brings significant performance gains in coreference resolution. Moreover, our pipeline enables accurate coreference resolution in languages where no previous corpora were available.

2606.05436 2026-06-05 cs.AI cs.CL cs.IR 版本更新

Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

十位头痛专家与人工智能在临床文献总结中的比较:一项关键评估与对比

Alejandro Lozano, Keiko Ihara, Ping-Hao Yang, Carrie E. Robertson, Jennifer Stern, Allan Purdy, Hsiangkuo Yuan, Pengfei Zhang, Yulia Orlova, Olga Fermo, Jennifer Hranilovich, Fred Cohen, Todd J. Schwedt, Jenelle A. Jindal, Serena Yeung-Levy, Chia-Chun Chiang

发表机构 * Stanford University Palo Alto CA USA(斯坦福大学) Department of Neurology Mayo Clinic Rochester MN USA(梅奥诊所神经科) Department of Neurology Dalhousie University Halifax Canada(达尔豪斯大学神经科) Jefferson Headache Center Department of Neurology Thomas Jefferson University PA USA(泰勒大学神经科) Beth Israel Deaconess Medical Center Boston MA USA(贝斯以色列医疗中心) Department of Neurology University of Florida Gainesville FL USA(佛罗里达大学神经科) University of Colorado School of Medicine Department of Pediatrics Division of Child Neurology Aurora CO USA(科罗拉多医学院儿科部儿童神经科) Department of Medicine Mount Sinai Hospital Icahn School of Medicine at Mount Sinai New York NY USA(西奈医院医学部) Department of Neurology Mayo Clinic Scottsdale AZ USA(梅奥诊所Scottsdale分部) Harvard Medical School Boston MA USA(哈佛医学院) Department of Neurology Mount Sinai Hospital Icahn School of Medicine at Mount Sinai New York NY USA(西奈医院神经科)

AI总结 本研究通过构建基于RAG的AI框架,比较了三种大语言模型与十位头痛专家在临床文献总结方面的表现,发现专家撰写的摘要更受青睐,但专家有时难以区分人类与AI生成的摘要。

详情
AI中文摘要

总结最新医学文献以指导临床决策对于循证医学和高质量患者护理至关重要。然而,由于患者时间有限且发表文章数量迅速增长,临床医生面临越来越大的挑战。尽管检索增强的大语言模型(LLMs)在临床总结方面显示出潜力,但对其在综合更广泛科学文献方面的有效性进行人工评估,以及与专家撰写的综合摘要的直接比较仍然很少。我们使用三种最先进的LLMs(Sonnet、GPT-4o和Llama 3.1)构建了一个基于RAG的智能AI框架。一位头痛专家创建了13个问题,其中3个用于提示优化,10个用于评估。美国和加拿大的十位头痛专家每人针对一个问题撰写一篇摘要,每个问题得到四篇摘要(专家、Sonnet、GPT-4o和Llama)。专家们在不知道作者身份的情况下,根据正确性、完整性、简洁性和临床实用性,使用标准化评分标准对摘要进行评分(1-10分),并排除他们自己撰写摘要的主题。他们还按偏好对摘要进行排序,并指出他们认为每篇摘要是由专家还是LLM撰写的。我们的研究比较了由头痛专家评估的LLM和专家撰写的文献摘要,结果显示专家撰写的摘要更受青睐,尽管专家有时难以区分人类和AI生成的摘要。我们还确定了超出标准评估指标的关键专家重视特征,这些特征可以指导未来人类和AI文献总结流程的改进。

英文摘要

Summarizing the latest medical literature to guide clinical decision-making is essential for evidence-based medicine and high-quality patient care. Yet clinicians face increasing challenges due to limited time with patients and a rapidly growing volume of published articles. Although retrieval-augmented large language models (LLMs) have shown promise in clinical summarization, human evaluations of their effectiveness in synthesizing broader scientific literature and direct comparisons to expert-written syntheses remain scarce. We constructed a RAG-based agentic AI framework using three state-of-the-art LLMs: Sonnet, GPT-4o, and Llama 3.1. A headache specialist created 13 questions, three for prompt optimization and ten for evaluation. Ten headache specialists across the United States and Canada each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries, excluding the topic for which they wrote a summary, based on correctness, completeness, conciseness, and clinical utility, scoring each from 1 to 10 using standardized rubrics. They also ranked the summaries by preference and indicated whether they believed each summary was written by an expert or an LLM. Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, although experts sometimes found it challenging to distinguish between human- and AI-generated summaries. We also identified key expert-valued features beyond standard evaluation metrics that can guide future refinement of both human and AI literature summarization pipelines.

2606.05434 2026-06-05 cs.LG cs.AI 版本更新

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

选择性优势熵自适应视野GRPO:用于语言模型高效强化学习的非对称令牌级折扣

Chirag Chawla, Rohan Charudatt Salvi, Madhav S. Baidya

发表机构 * Indian Institute of Technology (BHU)(印度理工学院(博生胡大学)) Department of Computer Science, University of Illinois Chicago(伊利诺伊大学芝加哥分校计算机科学系)

AI总结 提出选择性优势熵自适应视野GRPO(SA-AH-GRPO),通过非对称令牌级折扣(仅对负优势轨迹应用熵基折扣)来稳定训练并提升数学推理性能。

Comments 16 pages, 4 Figures, 7 Tables

详情
AI中文摘要

组相对策略优化(GRPO)已成为一种有效的强化学习算法,用于在推理任务上对齐语言模型,但它对称地处理每个令牌位置和每个采样轨迹。我们引入了两个互补的扩展:(i) 自适应视野GRPO(AH-GRPO),它使用基于累积熵的折扣对每个令牌的策略梯度进行加权,当模型不确定时减少有效视野;(ii) 选择性优势AH-GRPO(SA-AH-GRPO),它仅对负优势轨迹应用此折扣,而保留正优势的成功轨迹不受衰减。我们在GSM8K数学推理基准上,使用通过LoRA微调的Qwen 2.5-1.5B-Instruct和Qwen 2.5-3B-Instruct模型,评估了alpha=0的标准GRPO、alpha=0.5的AH-GRPO和alpha=0.5的SA-AH-GRPO。在3B模型上,SA-AH-GRPO在第30步达到峰值Pass@1=0.858,并在180步保持0.846,训练方差降至0.0246,相比GRPO减少了3.6倍,同时匹配其峰值准确率。在1.5B模型上,SA-AH-GRPO达到峰值Pass@1=0.686,优于零样本基线0.637。我们的分析表明,非对称折扣保留了正确解上的完整梯度信号,防止了熵崩溃,并显著稳定了训练,为结构化生成任务上具有可验证奖励的强化学习提供了一种原则性的归纳偏置。

英文摘要

Group Relative Policy Optimisation (GRPO) has emerged as an effective reinforcement-learning algorithm for aligning language models on reasoning tasks, but it treats every token position and every sampled rollout symmetrically. We introduce two complementary extensions: (i) Adaptive-Horizon GRPO (AH-GRPO), which weights each token's policy gradient using a cumulative entropy-based discount that reduces the effective horizon when the model is uncertain, and (ii) Selective-Advantage AH-GRPO (SA-AH-GRPO), which applies this discounting only to negative-advantage rollouts, leaving positive-advantage, successful trajectories unattenuated. We evaluate standard GRPO with alpha = 0, AH-GRPO with alpha = 0.5, and SA-AH-GRPO with alpha = 0.5 on the GSM8K mathematical reasoning benchmark using both Qwen 2.5-1.5B-Instruct and Qwen 2.5-3B-Instruct fine-tuned with LoRA. On the 3B model, SA-AH-GRPO achieves Pass@1 = 0.858 at its peak at step 30 and maintains 0.846 at 180 steps, with training variance reduced to 0.0246, a 3.6 times reduction relative to GRPO while matching its peak accuracy. On the 1.5B model, SA-AH-GRPO achieves a peak Pass@1 of 0.686, improving over the zero-shot baseline of 0.637. Our analysis shows that asymmetric discounting preserves the full gradient signal on correct solutions, prevents entropy collapse, and substantially stabilises training, suggesting a principled inductive bias for reinforcement learning with verifiable rewards on structured generation tasks.

2606.05433 2026-06-05 cs.AI cs.SY eess.SY 版本更新

Zero knowledge verification for frontier AI training is possible

前沿AI训练的零知识验证是可能的

Pierre Peigné, Ky Nguyen, Paul Wang

发表机构 * Lefebvre General-Purpose AI Policy Lab(莱贝维尔通用人工智能政策实验室) Sorbonne Université(索邦大学) CNRS(国家科学研究中心) LIP6(LIP6实验室)

AI总结 提出一种结合预提交训练规范、节点间网络观测和中间计算即时Merkle承诺的零知识验证架构,通过原生BF16/FP32预编译的零知识虚拟机(zkVM)验证GPU实际浮点计算,实现训练过程可验证且架构保密,预计36个月内实现概念验证。

Comments 44 pages, 2 figures

详情
AI中文摘要

前沿AI治理框架日益将累积训练计算作为指定高影响力模型的主要标准,但执行依赖于自我报告,因为不存在训练的技术验证原语。任何未来关于前沿AI的国际协议都面临更高风险下的同样问题:对具有显著外部性的技术进行协调监管历史上依赖于技术验证,否则协议只是宣言性的。最近的治理分析认为零知识证明是一个有希望的候选方案,但目前在前沿规模下不切实际[26, 4]。我们认为这种不切实际是范式限制而非根本性的,并提出了一种用于前沿密集预训练的验证架构,结合了预提交的训练规范、节点间网络观测以及中间计算的即时Merkle承诺,通过具有原生BF16/FP32预编译的零知识虚拟机(zkVM)进行验证。该证明检查GPU执行的实际浮点计算而非定点近似,并通过私有的训练规范保护模型架构的机密性。该协议产生三种证明类型:初始化时的创世证明、训练过程中的步骤证明,以及作为运行不变量的事前证明,强制执行与政策相关的声明,将训练记录转变为可治理执行的工件。我们估计在训练侧开销为个位数百分比的情况下,大约36个月内可实现可部署的概念验证,而验证级定制硅片的周期为六到十年。列出了十三个开放的研究和工程问题,作为外部贡献的研究议程。

英文摘要

Frontier AI governance frameworks increasingly use cumulative training compute as the primary criterion for designating high-impact models, but enforcement rests on self-reporting because no technical verification primitive for training exists. Any future international agreement on frontier AI faces the same problem at higher stakes: coordinated regulation of technologies with significant externalities has historically rested on technical verification, without which agreements are declaratory. Recent governance analyses judge zero-knowledge proofs a promising candidate but currently impractical at frontier scale [26, 4]. We argue the impracticality is paradigm-bound rather than fundamental, and propose a verification architecture for frontier dense pre-training combining a pre-committed training specification, inter-node network observations, and on-the-fly Merkle commitments of intermediate computation, verified through a zero-knowledge Virtual Machine (zkVM) with native BF16/FP32 precompiles. The proof checks the actual floating-point computation the GPU performed rather than a fixed-point approximation, and preserves model-architecture confidentiality through a private training specification. The protocol produces three proof types: a genesis proof at initialisation, in-training step proofs across the run, and ex-ante attestations enforcing policy-relevant claims as running invariants, turning the training record into a governance-enforceable artefact. We estimate a deployable proof of concept within approximately 36 months at single-digit-percent training-side overhead, against a six-to-ten-year cycle for verification-grade custom silicon. Thirteen open research and engineering problems are catalogued as a research agenda for external contribution

2606.05429 2026-06-05 cs.AI 版本更新

Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

最小化缩放因子的隐藏成本:面向大语言模型的图引导超低位量化

Rayyan Abdalla, Amir Hussein, Min Wu, Dinesh Manocha

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出SAGE-PTQ框架,通过图引导的显著性感知量化分离显著与非显著权重,实现超低位量化并最小化缩放开销,在LLaMA-3-8B上困惑度降至6.74且内存低于BiLLM的50%。

Comments Preprint. 18 pages, 10 figures, 7 tables, including appendix

详情
AI中文摘要

训练后量化(PTQ)对于大语言模型(LLMs)的高效部署至关重要。最近的超低位PTQ方法依赖于严格的权重显著性假设或位置启发式,引入了大量隐藏的缩放开销。我们提出SAGE-PTQ(显著性感知图引导高效PTQ),一种新颖的LLMs超低位量化框架,可最小化隐藏缩放成本。SAGE-PTQ使用分布统计分离显著和非显著权重,然后将子采样的非显著权重建模为稀疏图,以估计每层的最佳组数。SAGE-PTQ应用双模量化,为显著权重分配多位精度,并对非显著权重进行二值化。为减少缩放开销,SAGE-PTQ对显著权重使用每个通道一个缩放因子,对每个非显著组使用一个标量。最后,SAGE-PTQ实现自适应显著性阈值,以选择每个矩阵的最佳显著性比率。SAGE-PTQ平均达到1.03权重位和仅0.004缩放位每矩阵,优于BiLLM和PB-LLM等最先进方法。在LLaMA-3-8B上,SAGE-PTQ在WikiText2上达到6.74困惑度,而BiLLM为55.8,同时使用不到BiLLM 50%的GPU内存。在LLaMA-2-70B上,SAGE-PTQ在单个NVIDIA L40 GPU上提供1.5倍更快的解码速度,展示了实际的推理效率。

英文摘要

Post-training quantization (PTQ) is critical for the efficient deployment of large language models (LLMs). Recent ultra-low-bit PTQ methods rely on rigid weight-saliency assumptions or position heuristics, introducing substantial hidden scaling overhead. We propose SAGE-PTQ (Saliency-Aware Graph-guided Efficient PTQ), a novel ultra-low-bit quantization framework for LLMs that minimizes hidden scaling cost. SAGE-PTQ separates salient and unsalient weights using distributional statistics, then models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer. SAGE-PTQ applies dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. To reduce scaling overhead, SAGE-PTQ uses one per-channel scale for salient weights and one scalar per unsalient group. Finally, SAGE-PTQ implements adaptive saliency thresholding to select the optimal saliency ratio per matrix. SAGE-PTQ achieves 1.03 weight bits and only 0.004 scaling bits per matrix on average, outperforming state-of-the-art methods such as BiLLM and PB-LLM. On LLaMA-3-8B, SAGE-PTQ achieves 6.74 WikiText2 perplexity, compared to 55.8 for BiLLM, while using less than 50% of BiLLM's GPU memory. On LLaMA-2-70B, SAGE-PTQ provides 1.5x faster decoding on one NVIDIA L40 GPU, demonstrating practical inference efficiency.

2606.05420 2026-06-05 cs.AI stat.AP 版本更新

Assessing the Carbon Emissions and Energy Consumption of U.S. Hyperscale Data Centers

评估美国超大规模数据中心的碳排放与能源消耗

Gianluca Guidi, Francesca Dominici, Tiziano Squartini, Callaway Sprinkle, Jonathan Gilmour, Kevin Butler, Eric Bell, Scott Delaney, Falco J. Bargagli-Stoffi

发表机构 * Department of Biostatistics, Harvard T.H. Chan School of Public Health(哈佛T.H. 汤普森公共卫生学院生物统计学系) Department of Computer Science, University of Pisa(比萨大学计算机科学系) IMT School of Advanced Studies, Lucca(卢塞恩高级研究所) Environmental Systems Research Institute(环境系统研究机构) Baxtel(Baxtel公司) Department of Environmental Health, Harvard T.H. Chan School of Public Health(哈佛T.H. 汤普森公共卫生学院环境健康系) Department of Biostatistics, UCLA Fielding School of Public Health(加州大学洛杉矶分校Fielding公共卫生学院生物统计学系)

AI总结 本研究通过收集403个美国超大规模数据中心设施级数据,估算其电力消耗、电力来源及二氧化碳排放,发现其电力需求约占美国总用电量的1.8%,且碳强度高于全国平均水平48%。

详情
AI中文摘要

美国超大规模数据中心(HDCs)的快速扩张,主要由人工智能的采用驱动,引发了人们对该行业环境足迹的担忧。我们汇编了2024年5月至2025年4月期间运营的403个美国超大规模数据中心的设施级信息,并估算了它们的电力消耗、电力来源及可归因的二氧化碳排放。在不同的设施负载情景下,这些HDC消耗了约68-99太瓦时的电力,并产生了约3700-5400万吨二氧化碳。在中心情景下,HDC电力需求约占美国总用电量的1.8%,其中约54%的归因发电由化石燃料来源提供。HDC电力加权平均碳强度约为545克二氧化碳/千瓦时,比同期美国国家电网平均碳强度370克二氧化碳/千瓦时高出约48%。我们的方法提供了一种归因工具,利用最新的EPA eGRID电厂级数据评估超大规模数据中心的环境足迹。

英文摘要

The rapid proliferation of hyperscale data centers (HDCs) in the US, mainly driven by the adoption of artificial intelligence, has raised concerns about this industry's environmental footprint. We compiled facility-level information on 403 US hyperscale data centers operating between May 2024 and April 2025 and estimated their electricity consumption, electricity sources, and attributable CO2 emissions. Across different facility-load scenarios, these HDCs consumed approximately 68-99 TWh of electricity and were associated with about 37-54 million metric tons of CO2. Under the central scenario, HDC electricity demand corresponded to approximately 1.8% of total US electricity consumption, with roughly 54% of attributed generation supplied by fossil-fuel sources. The HDC electricity-weighted average carbon intensity was approximately 545 gCO2/kWh, about 48% above the contemporaneous US national grid-average carbon intensity of 370 gCO2/kWh. Our approach provides an attributional tool for assessing the environmental footprint of hyperscale data centers using the most recent EPA eGRID plant-level data.

2606.05415 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Executable Schema Contracts: From Automatic Ingestion to Multi-Source Retrieval

可执行模式合约:从自动摄入到多源检索

Padmaja Jonnalagedda, Yuguang Yao, Xiang Gao, Hilaf Hasson, Kamalika Das

发表机构 * Intuit AI Research(Intuit AI研究)

AI总结 提出一种自动从多源数据中发现可执行模式并将其作为共享合约的系统,通过模式约束的检索路由和结构化分析提升多源问答性能。

Comments 9 pages, 4 figures, plus supplementary appendix

详情
AI中文摘要

现实世界的数据跨越表格、文档和半结构化文件,具有隐式语义。查询这些数据需要跨不一致的模式和格式整合证据,但现有方法要么需要昂贵的人工工程,要么完全绕过结构。我们提出一个系统,自动从原始多源数据中发现可执行模式,并将其用作知识图谱构建和查询时检索的共享合约。一个封闭世界的字段目录将基于LLM的模式发现限制在已证实的字段上;确定性结构分析推断身份键、外键和源层次结构;由此产生的模式驱动提取、去重和跨源链接,形成具有溯源意识的知识图谱。在查询时,该模式(可选地通过单调协议扩展)调节一个多工具代理,该代理在结构化查找、图遍历和向量搜索之间路由检索,返回带有可追溯引用的有根据的答案。在使用相同LLM、数据和评估框架的受控零样本比较中,该系统在四个QA基准上优于仅检索和基于分解的基线,消融实验表明模式条件路由、结构智能和模式引导构建各自贡献了性能提升。

英文摘要

Real-world data spans tables, documents, and semi-structured files with implicit semantics. Querying this data requires integrating evidence across inconsistent schemas and formats, yet existing approaches either demand costly manual engineering or bypass structure entirely. We present a system that automatically discovers an executable schema from raw multi-source data and uses it as a shared contract for knowledge graph construction and query-time retrieval. A closed-world field catalog constrains LLM-based schema discovery to attested fields; deterministic structural analysis infers identity keys, foreign keys, and source hierarchy; and the resulting schema drives extraction, deduplication, and cross-source linking into a provenance-aware knowledge graph. At query time the schema -- optionally extended via a monotonic protocol -- conditions a multi-tool agent routing retrieval across structured lookup, graph traversal, and vector search, returning grounded answers with traceable citations. In controlled zero-shot comparisons using the same LLM, data, and evaluation harness, the system improves over retrieval-only and decomposition-based baselines across four QA benchmarks, with ablations showing that schema-conditioned routing, structural intelligence, and schema-guided construction each contribute to the gains.

2606.05414 2026-06-05 cs.CL cs.AI cs.HC cs.LG 版本更新

When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

当证据稀疏时:对话和LLM-Agent轨迹中的弱监督早期失败预警

Avinash Baidya, Xinran Liang, Ruocheng Guo, Xiang Gao, Kamalika Das

发表机构 * Intuit AI Research(Intuit AI研究院) Princeton University(普林斯顿大学)

AI总结 针对对话和LLM-Agent轨迹中早期失败预警问题,提出一种两阶段方法,通过注意力机制从稀疏的轨迹级标签中学习回合级失败证据,并结合α-STOP策略实现可控的早期预警,在多个基准上显著提升帕累托前沿质量并降低训练成本。

Comments 9 pages, 14 figures, and appendix

详情
AI中文摘要

早期失败预警需要在对话或智能体轨迹尚未完成时,决定是否将其标记为可能失败。这具有挑战性,因为监督信号通常仅以轨迹级成功/失败标签的形式提供,而预警必须从部分交互中发出。先前的早期分类方法通常通过将终端标签分配给每个前缀来弥合这一差距,将每个回合视为失败证据。我们假设这种前缀标签假设与多轮语言交互不匹配,因为最终失败的证据是稀疏且常常延迟的。在本文中,我们引入了一种两阶段方法,从这种稀疏证据结构中学习,并使用由此产生的风险估计进行可控的早期预警。具体来说,我们的基于注意力的失败预测器从轨迹标签中学习稀疏的回合级失败证据,并利用它从部分历史中估计失败风险。然后,我们将该预测器与α-STOP配对,这是一种单一偏好条件停止策略,在推理时选择准确率-早期性的操作点,而不是为每个偏好训练单独的触发器。在涵盖客户支持、任务导向对话、说服、工具使用和规划的五个基准上,我们首先表明高相关性失败证据仅占回合的4.7-11.3%,并且平均在轨迹的59.0-83.6%之后首次出现。我们进一步表明,基于注意力的预测器将帕累托前沿质量(超体积)比朴素前缀监督提高了1-10%,并且完整系统将前沿质量比最先进的触发器策略提高了3-42%,同时将每个操作点的训练成本降低了1-3个数量级。

英文摘要

Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory-level success/failure label while alerts must be raised from partial interactions. Prior early-classification methods often bridge this gap by assigning the terminal label to every prefix, treating every turn as failure evidence. We hypothesize that this prefix-label assumption is poorly matched to multi-turn language interactions, where evidence of eventual failure is sparse and often delayed. In this paper, we introduce a two-stage approach that learns from this sparse evidence structure and uses the resulting risk estimates for controllable early alerting. Specifically, our attention-based failure predictor learns sparse turn-level failure evidence from trajectory labels and uses it to estimate failure risk from partial histories. We then pair this predictor with $α$-STOP, a single preference-conditioned stopping policy that selects an accuracy-earliness operating point at inference time rather than training a separate trigger for each preference. Across five benchmarks spanning customer support, task-oriented dialog, persuasion, tool use, and planning, we first show that high-relevance failure evidence occupies only 4.7-11.3% of turns and first appears after 59.0-83.6\% of trajectories on average. We further show that the attention-based predictor improves Pareto-frontier quality (hypervolume) by 1-10\% over naive prefix supervision, and that the full system improves frontier quality by 3-42\% over state-of-the-art trigger policies while reducing training cost per operating point by 1-3 orders of magnitude.

2606.05413 2026-06-05 cs.LG cs.AI 版本更新

CausalPOI: Spatio-Temporal Graph-Based Causal Modeling for Cold-Start POI Check-in Forecasting

CausalPOI:基于时空图因果建模的冷启动POI签到预测

Zhaoqi Zhang, Miao Xie, Yi Li, Linyou Cai, Siqiang Luo, Gao Cong

发表机构 * Nanyang Technological University(南洋理工大学) China Agricultural University(中国农业大学) Meituan(美团)

AI总结 提出CausalPOI框架,利用时空功能交互图建模POI间语义和空间关系,通过结构对齐的处理和对照图模拟事实与反事实场景,解决冷启动POI签到预测问题,在真实数据集上显著优于基线。

Comments Accepted at KDD 2026

详情
AI中文摘要

随着城市环境的快速演变,准确建模兴趣点(POI)的动态行为对于支持数据驱动的城市规划和商业决策至关重要。尽管时空图学习的最新进展改进了POI预测,但大多数方法依赖于基于邻近性的图和相关性驱动建模,忽略了POI之间的功能依赖关系,且未能捕捉城市干预的因果效应。本文引入了一个新的研究问题——冷启动POI签到预测,旨在通过建模新引入POI的时间演化及其与附近POI在结构化城市空间背景下的功能交互,预测其未来的签到模式。为应对这些挑战,我们提出了CausalPOI,一个基于时空图的因果表示学习框架。CausalPOI利用时空功能交互图建模POI之间的语义和空间关系,并构建结构对齐的处理图和对照图以模拟事实和反事实场景。在真实SafeGraph数据集上的大量实验表明,CausalPOI在各方面显著优于最先进的基线,验证了其在时空预测、语义交互建模和因果效应估计方面的有效性,为城市干预分析提供了更可解释和可操作的基础。源代码可在Github获取。

英文摘要

As urban environments continue to evolve rapidly, accurately modeling the dynamic behaviour of Points of Interest is essential for supporting data-driven urban planning and commercial decision-making. While recent advancements in spatio-temporal graph learning have improved POI forecasting, most methods rely on proximity-based graphs and correlation-driven modeling, which overlook the functional dependencies between POIs and fail to capture the causal effects of urban interventions. In this paper, we introduce a novel research problem -- cold-start POI check-in forecasting, which aims to predict the future check-in pattern of a newly introduced POI, by modeling its temporal evolution and functional interactions with nearby POIs in a structured urban spatial context. To address these challenges, we propose CausalPOI, a spatio-temporal graph-based causal representation learning framework. CausalPOI leverages Spatio-Temporal Functional Interaction Graph to model semantic and spatial relationships between POIs, and constructs structurally aligned treatment and control graphs to simulate factual and counterfactual scenarios. Extensive experiments on real-world SafeGraph datasets demonstrate that CausalPOI significantly outperforms state-of-the-art baselines across the board, validating its effectiveness in spatio-temporal forecasting, semantic interaction modeling, and causal effect estimation, providing a more interpretable and actionable foundation for urban intervention analysis. Source code is available at Github.

2606.05411 2026-06-05 cs.AI cs.HC 版本更新

A Motivational Architecture for Conversational AGI

对话式通用人工智能的动机架构

Anna Mikeda, Ben Goertzel

发表机构 * Glass Umbrella(玻璃伞) SingularityNet

AI总结 本文提出一种对话式动机架构,将OpenPsi动机谱系重新解释为对话原生术语,并耦合MetaMo的高层动机支架,通过十阶段动机处理流水线、双决策策略以及行动前感受与行动后情绪的功能区分,实现对话智能体的能力调节、不确定性减少、亲和力等动机管理。

Comments 16 pages. Accepted for AGI-26 proceedings

详情
AI中文摘要

认知AI中的动机架构主要设计用于调节身体需求的物理智能体。对话智能体运行在另一种机制中:其感觉运动回路是语言性的,其环境是用户不断演变的心理状态,其有后果的行动是言语行为、工具调用和策略性沉默。本文提出对OpenPsi动机谱系的对话式重新解释,耦合MetaMo的高层动机支架,用于构建在模块化执行基底上的智能体。稳态被重新定义为对话原生的术语:智能体调节的是能力、不确定性减少、亲和力、喜爱度、合法性、培育和审美连贯性,而非身体缺陷。我们提出三个贡献:一个十阶段动机处理流水线,在架构上分离认知调节与情境评估;一个双决策策略,融合紧迫驱动的快速响应与深思熟虑的多目标优化;以及一个架构上有用的区分,即行动前感受与行动后情绪作为功能上不同的情感形式。我们将该框架专门化到两个示例智能体——伴侣智能体与研究智能体——并勾勒其向社交机器人和领域通用的人类级通用人工智能的扩展。

英文摘要

Motivational architectures in cognitive AI have largely been designed for physical agents regulating bodily needs. Conversational agents operate in a different regime: their sensorimotor loop is linguistic, their environment is a user's evolving mental state, and their consequential actions are speech acts, tool invocations, and strategic silences. This paper proposes a conversational reinterpretation of the OpenPsi motivational lineage, coupled to MetaMo's higher-level motivational scaffold, for agents built on a modular execution substrate. Homeostasis is recast in dialogue-native terms: the agent regulates competence, uncertainty reduction, affiliation, affinity, legitimacy, nurturing, and aesthetic coherence rather than bodily deficits. We propose three contributions: a ten-stage motivational processing pipeline that architecturally separates cognitive modulation from situational appraisal; a dual decision strategy blending urgency-driven fast response with deliberative multi-goal optimization; and an architecturally useful distinction between pre-action feelings and post-action emotions as functionally different forms of affect. We specialize the framework to two example agents -- CompanionAgent and ResearchAgent -- and sketch its extension to social robotics and domain-generic human-level AGI.

2606.05408 2026-06-05 cs.AI cs.NE 版本更新

Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution

无变异的突变:LLM驱动的程序进化中的收敛动力学

Can Gurkan, Forrest Stonedahl, Uri Wilensky

发表机构 * Northwestern University(西北大学) Augustana College(奥古斯塔纳学院)

AI总结 研究LLM在无选择压力下反复变异程序时,是否探索新形式或循环回到旧形式,发现LLM变异一致收敛到受限吸引子区域,结构层面87%的链中超过93%的变异重复先前结构形式。

Comments Accepted to the Genetic and Evolutionary Computation Conference (GECCO '26) Workshop on Large Language Models for and with Evolutionary Computation

详情
AI中文摘要

当LLM反复变异一个程序时,它是探索新形式还是循环回到旧形式?我们通过分析领域特定语言中无选择压力下的LLM驱动变异链来研究这个问题,变化提示设计、模型族和随机复制。我们发现基于LLM的变异一致收敛到程序空间中的受限吸引子区域。收敛在结构层面尤其严重:在87%的链中,超过93%的变异重复先前看到的结构形式,大多数变异局限于重复模板内的终端替换。循环分析显示短循环和自环主导转移结构。收敛速度随提示措辞和模型选择而变化,但该现象在不同条件下都很稳健。经典的GP子树变异算子没有表现出类似的收敛,表明该效应是LLM变异管道固有的。这些发现揭示了LLM驱动程序进化核心的张力:使语义感知程序转换成为可能的相同能力也带来了对结构同质性的系统性偏差,如果此类系统要维持开放式探索,必须考虑这一点。源代码可在 https://github.com/can-gurkan/lmca 获取。

英文摘要

When an LLM repeatedly mutates a program, does it explore new forms or circle back to the same ones? We study this question by analyzing LLM-driven mutation chains in the absence of selection pressure within a domain-specific language, varying prompt design, model family, and stochastic replication. We find that LLM-based mutation consistently converges toward restricted attractor regions in program space. Convergence is especially severe at the structural level: in 87% of chains, over 93% of mutations revisit a previously seen structural form, with most variation confined to terminal substitutions within recurring templates. Cycle analysis reveals short cycles and self-loops dominating the transition structure. The rate of convergence varies with prompt wording and model choice, but the phenomenon is robust across conditions. A classical GP subtree mutation operator does not exhibit comparable convergence, suggesting that the effect is intrinsic to the LLM mutation pipeline. These findings reveal a tension at the heart of LLM-driven program evolution: the same capabilities that enable semantics-aware program transformation also carry a systematic bias toward structural homogeneity that must be accounted for if such systems are to sustain open-ended exploration. Source code is available at https://github.com/can-gurkan/lmca.

2606.05404 2026-06-05 cs.AI cs.CL cs.LG 版本更新

Harnessing Generalist Agents for Contextualized Time Series

利用通用智能体进行情境化时间序列分析

Zihao Li, Kaifeng Jin, Yuanchen Bei, Jiaru Zou, Avaneesh Kumar, Xuying Ning, Yanjun Zhao, Mengting Ai, Baoyu Jing, Hanghang Tong, Jingrui He

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出TimeClaw框架,通过集成可执行时间工具、经验驱动能力进化和情景多模态记忆,使通用大语言模型智能体具备情境化时间推理能力,在能源、金融等多领域基准上取得性能提升。

Comments Preprint. 38 Pages

详情
AI中文摘要

时间序列通常嵌入在丰富的上下文中,这对于整体建模至关重要。此外,现实世界的从业者通常需要用于分析时间动态的端到端工作流,其中广泛研究的任务(如预测)只是更广泛解决方案循环中的一个步骤。虽然通用AI智能体为复杂上下文下的此类工作流提供了有前景的接口,但它们主要运行在文本空间中,并未与结构化时间信号完全对齐。在这项工作中,我们引入了TimeClaw,一个用于时间序列的智能体框架,它为通用大语言模型智能体配备了情境化时间推理所需的时间序列原生运行时支持。TimeClaw集成了可执行的时间工具以进行有根据和可审计的分析,经验驱动的能力进化以创建可重用的分析例程,以及用于检索相关推理轨迹的情景多模态记忆。这些组件共同解锁了带有上下文信息的开放式时间推理。在涵盖能源、金融、天气、交通和其他现实世界领域的多个基准上的广泛评估表明,TimeClaw的性能得到了提升。代码可在https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw获取。

英文摘要

Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real-world practitioners often require end-to-end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series-native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience-driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open-ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw.

2606.05403 2026-06-05 cs.LG cs.AI 版本更新

Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

信任,但不验证:LLM 源评估中的认知盲点

Rohan N. Pradhan, Steve Goley

发表机构 * Amazon(亚马逊)

AI总结 研究语言模型在多源综合中是否评估证据质量,发现模型虽能检测伪造统计但未在综合中启用,而是依赖方法论-语域门控,导致数值有效性被抑制。

详情
AI中文摘要

语言模型日益充当认知代理,综合多个来源的证据以辅助决策。然而,它们是否评估这些证据的质量,还是仅仅基于表面呈现进行聚合,目前尚不清楚。我们表明,模型具备检测伪造统计数据的能力(孤立方法论的正确识别率为0.76-1.00),但在多源综合过程中并未启用这一能力,无论统计数据是伪造还是有效,都会产生相似的数值估计。具体而言,源影响受方法论-语域门控支配,该门控响应分析文本的分布性语域,但不响应数值有效性:例如,统计上不可能的置信区间与有效区间获得相同权重。这种行为分离在来自三个家族(Claude、Qwen、OLMo)的五个模型以及三个专业领域中均得到复现。机制分析(包括因果追踪、线性探针和组件级归因)收敛于同一解释:模型编码并因果使用一种跨领域转移的方法论-语域表示(探针AUC 0.83-0.92),而数值有效性信号(在孤立时可解码)在多源综合中被抑制至随机水平。基于提示的缓解措施(甚至是指定精确统计检查的预言清单)会产生全面怀疑而非选择性辨别,我们检查的后训练流程强化了风格捷径而未建立数值验证。与追踪用户偏好的奉承行为不同,这种失败追踪的是源是否呈现为分析可信,而非其主张是否内部一致。我们称之为认知对齐:与偏好对齐和安全对齐一样,问题不在于能力,而在于部署。

英文摘要

Language models increasingly act as epistemic proxies, synthesizing evidence from multiple sources to inform decisions. Whether they evaluate the quality of that evidence, or merely aggregate it based on surface presentation, remains poorly understood. We show that models possess the capability to detect fabricated statistics (correct identification rates of 0.76-1.00 for methodology in isolation) but do not recruit this capability during multi-source synthesis, producing similar numeric estimates whether the statistics are fabricated or valid. Specifically, source influence is governed by a methodology-register gate that responds to the distributional register of analytical text but not to numeric validity: for example, statistically impossible confidence intervals receive the same weight as valid ones. The behavioral dissociation replicates across five models from three families (Claude, Qwen, OLMo) and three professional domains. Mechanistic analyses, including causal tracing, linear probes, and component-level attribution, converge on the same account: the model encodes and causally uses a methodology-register representation that transfers across domains (probe AUC 0.83-0.92), while numeric-validity signals, decodable in isolation, are suppressed to chance during multi-source synthesis. Prompting-based mitigations, even an oracle checklist naming the exact statistical checks, produce blanket skepticism rather than selective discernment, and the post-training pipelines we examine reinforce the stylistic shortcut without building numeric verification. Unlike sycophancy, which tracks user preference, this failure tracks whether a source presents as analytically credible, not whether its claims are internally consistent. We term this epistemic alignment: like preference and safety alignment, the question is not capability but deployment.

2606.05402 2026-06-05 cs.CL cs.AI 版本更新

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

ReasoningFlow: 理解LLM推理轨迹的话语结构

Jinu Lee, Shivam Agarwal, Amruta Parulekar, Siddarth Madala, Dilek Hakkani-Tur, Julia Hockenmaier

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ReasoningFlow框架,将大推理模型的推理轨迹建模为细粒度有向无环图,通过人工和自动标注分析发现模型间结构相似性、多样化推理行为及错误步骤与最终答案的关系。

详情
AI中文摘要

大型推理模型(LRMs)产生的推理轨迹具有非线性结构,如回溯和自我修正,这使推理过程的评估和监控复杂化。我们引入ReasoningFlow,一个将LRM推理轨迹的话语结构捕捉为细粒度有向无环图(DAGs)的框架。我们通过仔细的人工标注31条轨迹(2.1k步)来开发和验证我们的标注方案,实现了高标注者间一致性,然后扩展到自动标注1,260条轨迹(247.7k步),涵盖三个任务(数学、科学、论证)和五个模型(Qwen2.5-32B-Inst、QwQ-32B、DeepSeek-V3、DeepSeek-R1、GPT-oss-120B)。通过分析ReasoningFlow图,我们发现:(1)LRMs表现出结构相似的轨迹,尽管它们基于不同的基础模型训练且可能使用不重叠的后训练数据。(2)ReasoningFlow揭示了多样的细粒度推理行为(例如局部验证、自我反思和假设),可用于更好的推理轨迹可监控性。(3)在LRMs中,大多数错误步骤不用于推导最终答案。(4)步骤之间的机械因果依赖关系不反映语言层面的话语结构。我们在https://github.com/jinulee-v/reasoningflow 发布数据集和代码。

英文摘要

Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github.com/jinulee-v/reasoningflow.

2606.05400 2026-06-05 cs.AI cs.CL cs.LG 版本更新

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

LeanMarathon:通过长视界Lean自动形式化实现可靠的AI合作数学家

Yuanhe Zhang, Yuekai Sun, Taiji Suzuki, Jason D. Lee, Fanghui Liu

发表机构 * Department of Statistics, University of Warwick, UK(英国沃里克大学统计系) Center for Advanced Intelligence Project, RIKEN, Japan(日本理化学研究所高级智能项目) Department of Statistics, University of Michigan, USA(美国密歇根大学统计系) Department of Mathematical Informatics, The University of Tokyo(东京大学数学信息学系;日本理化学研究所高级智能项目) also Center for Advanced Intelligence Project, RIKEN, Japan(加州大学伯克利分校电气工程与计算机科学系;统计系) Department of Electrical Engineering and Computer Sciences, also Department of Statistics, University of California, Berkeley, USA(上海交通大学数学科学学院,自然科学院和MOE-LSC) School of Mathematical Sciences, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong University, China

AI总结 提出多智能体框架LeanMarathon,通过蓝图抽象和两阶段编排器实现长视界研究数学的可靠自动形式化,在四个Erdős问题上成功形式化七个定理。

Comments 26 pages, 9 figures. Comments are welcome

详情
AI中文摘要

长视界研究数学的自动形式化不仅在困难引理上失败,而且在规模上失败:陈述漂移、依赖关系纠缠、上下文衰减以及局部修复破坏远处的工作。我们提出LeanMarathon,一个用于可靠的研究级Lean自动形式化的多智能体框架。其核心抽象是一个演化的蓝图:一个Lean文件,同时作为形式化证明骨架、自然语言证明图和共享系统记录。四个合约范围的智能体构建、审计、证明和修复这个蓝图。这些智能体由一个两阶段编排器协调,该编排器首先通过对抗性审查稳定目标保真度,然后从动态叶节点向上并行地通过CI门控轮次释放证明有向无环图(DAG)。LeanMarathon将一次脆弱的数小时运行转变为许多局部、可恢复、并行的交易。我们在两篇最近的研究论文上评估LeanMarathon,涵盖四个Erdős问题(#1051, #1196, #164, #1217)。在三次自主运行中,它形式化了所有七个目标定理,没有留下任何sorry,证明了258个引理和定理。这些结果表明,可靠的AI合作数学不仅需要更强的证明器,还需要耐用的框架,以在长数学发展过程中保持目标保真度。代码可在https://github.com/YuanheZ/LeanMarathon找到。

英文摘要

Long-horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi-agent harness for reliable research-level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural-language proof graph, and shared system of record. Four contract-scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two-stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI-gated rounds. LeanMarathon turns one brittle multi-hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co-mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at https://github.com/YuanheZ/LeanMarathon.

2606.05396 2026-06-05 cs.CR cs.AI cs.SE 版本更新

Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration

有意但无力:通过消融分离代码大语言模型中的拒绝与能力

Cristina Carleo, Pietro Liguori, Naghmeh Ivaki, Domenico Cotroneo

发表机构 * University of Naples Federico II(那不勒斯费德里科二世大学) University of Coimbra(科英布拉大学) University of North Carolina at Charlotte(北卡罗来纳州夏洛特大学)

AI总结 本文通过消融技术(abliteration)对代码LLM进行低秩权重编辑,以消除其对安全注入提示的拒绝行为,从而分离拒绝意愿与代码生成能力,实验表明消融后拒绝率降至零而语法有效性保持93%以上,但注入率仍受模型容量限制。

详情
AI中文摘要

大规模生成带标签的脆弱代码是基于学习的漏洞检测的一个反复出现的障碍:挖掘的语料库带有大量标签噪声,而现有的基于LLM的增强方法传播了这些不准确性,因为它转换了脆弱的种子,而不是根据规范合成漏洞。一个补充的途径是从安全代码开始,要求经过指令调优的LLM注入指定的CWE(这将把标签负担从开放式的检测转移到有界的二元确认),但安全对齐的代码LLM系统地拒绝此类提示。本文是对消融技术(abliteration)的初步可行性研究,这是一种低秩权重编辑,通过正交投影消除残差流中的拒绝方向,作为消除这一障碍的工具。我们使用Python和CWE-89(SQL注入)作为案例研究,评估了Qwen2.5-Coder-Instruct系列在3B、7B和14B参数下对从PromSec和SafeCoder中抽取的安全样本的表现,每种条件重复三次。我们发现:(i)对注入提示的拒绝强烈依赖于大小和提示上下文:14B模型拒绝100%的提示,7B模型拒绝73%的PromSec但仅5%的SafeCoder,而3B模型基本不受阻;(ii)消融技术将所有大小模型的拒绝率降至零或接近零,同时语法有效性保持在93%以上,支持了在这种设置下拒绝可以与测量的代码生成能力分离的观点;(iii)消融后的注入率仍然受容量限制(14B为88-97%,7B为89-90%,3B为25-48%),将意愿(消融技术解锁)与能力(随参数扩展)分离。漏洞判定由三个工具的检测器集成(CodeQL、Semgrep、Bandit)产生,然后由两位作者对检测器阳性输出进行人工裁决。

英文摘要

Producing a labeled vulnerable code at scale is a recurring obstacle for learning-based vulnerability detection: mined corpora carry substantial label noise, and existing LLM-based augmentation propagates these inaccuracies because it transforms vulnerable seeds rather than synthesising vulnerabilities from a specification. A complementary route is to start from safe code and ask an instruction-tuned LLM to inject a specified CWE (which would shift the labeling burden from open-ended detection to bounded binary confirmation) but safety-aligned code LLMs systematically refuse such prompts. This paper is a preliminary feasibility study of abliteration, a low-rank weight edit that orthogonally projects out the refusal direction in the residual stream, as a tool to remove this barrier. We use Python and CWE-89 (SQL injection) as a case study, evaluating the Qwen2.5-Coder-Instruct family at 3B, 7B, and 14B parameters on safe samples drawn from PromSec and SafeCoder, replicated three times per condition. We find that (i) refusal on injection prompts is strongly size- and prompt-context-dependent: the 14B refuses 100% of prompts, the 7B refuses 73% of PromSec but only 5% of SafeCoder, whereas the 3B is essentially never blocked; (ii) abliteration reduces refusal to zero or near-zero across all sizes while leaving syntactic validity above 93%, supporting the view that, in this setting, refusal can be detached from measured code-generation capability; and (iii) the post-abliteration injection rate remains capacity-bound (88-97% on the 14B, 89-90% on the 7B, and 25-48% on the 3B) separating willingness, which abliteration unlocks, from capability, which scales with parameters. Vulnerability verdicts are produced by a three-tool detector ensemble (CodeQL, Semgrep, Bandit) followed by manual adjudication by two authors on detector-positive outputs.

2605.04135 2026-06-05 cs.CY cs.AI cs.CL 版本更新

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation

前沿滞后:学术AI评估中能力误述的文献计量审计

David Gringras, Misha Salahshoor

发表机构 * Harvard University(哈佛大学) AISST

AI总结 通过审计112,303篇LLM相关论文,发现中位论文评估的模型落后同期前沿10.85 ECI(约1.4倍Claude Sonnet 3.7与Claude Opus 4.5的差距),且差距以每年5.53 ECI扩大,仅3.2%的摘要披露推理模式状态,52.5%的结论将结果泛化为“AI”,并提出VERSIO-AI检查表等补救措施。

Comments v2. 65 pp, 9 figs, 8 tables, 8 appendices. Pre-registered on OSF: doi.org/10.17605/OSF.IO/7XM3D. Code+data: doi.org/10.5281/zenodo.20060457. VERSIO-AI v1.2 reporting checklist (Appendix A): doi.org/10.5281/zenodo.20060459. frontierlag package + per-DOI audit tool: frontierlag.org

详情
AI中文摘要

应用领域LLM能力评估的读者希望了解AI系统当前能做什么。但相关文献回答的是一个相关但结果不同的问题:更旧、更便宜、更少引导的模型在数月或数年前能做什么(例如,一篇2026年的论文评估GPT-3.5或GPT-4零样本,对比前沿的推理能力、工具使用系统如GPT-5.5 Pro和Claude Opus 4.7),通常报告稀疏的配置细节,并抽象上升为关于“AI”的声明,通过引用、媒体和政策传播。我们在一个预注册的审计中测量了“发表引导差距”(这些答案之间的差距),审计了112,303条LLM关键词匹配的候选记录(2022年1月至2026年4月;18,574条可接受,4,766篇全文可检索),将测试模型与同期前沿在Epoch AI能力指数(ECI)上进行比较,并在Arena Elo和Artificial Analysis上复现。中位论文评估的模型在评估时落后同期前沿+10.85 ECI(约Claude Sonnet 3.7与Claude Opus 4.5距离的1.4倍)(H1);一个探索性的理性滞后基线(H8)将其分解为约25%的同行评审延迟和约75%的额外滞后。差距以每年+5.53 ECI的速度扩大(H2;95% CI [+5.03, +5.83])。同时,仅3.2%的摘要(21.2%的全文)披露了具有推理能力模型的推理模式状态(H4),52.5%(95% CI [48.2, 56.9])的结论以“AI”而非被评估模型(们)的层面陈述,并以OR = 1.23/年的速度上升。提出的补救措施包括API访问补贴和编辑执行报告框架,强制披露配置表面(模型快照、推理模式/努力、工具访问、脚手架、提示等);VERSIO-AI是一个13项检查表(核心3项桌面拒稿),在引导表面扩展现有框架,并在frontierlag.org上提供每DOI分析。

英文摘要

Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-3.5 or GPT-4 zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about "AI" that propagate through citations, media, and policy. We measure the 'publication elicitation gap' (the gap between these answers) in a pre-registered audit of 112,303 LLM-keyword-matched candidate records (2022-01 to 2026-04; 18,574 admissible, 4,766 full-paper texts retrievable), comparing tested models to the contemporaneous frontier on the Epoch AI Capabilities Index (ECI), reproduced under Arena Elo and Artificial Analysis. The median paper evaluates a model +10.85 ECI (~1.4x the distance between Claude Sonnet 3.7 and Claude Opus 4.5) behind the contemporaneous frontier at evaluation time (H1); an exploratory rational-lag baseline (H8) decomposes this into ~25% peer-review latency, ~75% excess lag. The gap is widening at +5.53 ECI/year (H2; 95% CI [+5.03, +5.83]). Meanwhile, only 3.2% of abstracts (21.2% of full-texts) disclose reasoning-mode status on reasoning-capable models (H4) and 52.5% (95% CI [48.2, 56.9]) state conclusions at the level of "AI" rather than the evaluated model(s), rising at OR = 1.23/year. Proposed remedies include API-access subsidies and editorial enforcement of reporting frameworks mandating configuration-surface disclosure (model snapshot, reasoning mode/effort, tool access, scaffolding, prompting, etc.); VERSIO-AI is a 13-item checklist (Core 3 desk-reject) extending existing frameworks at the elicitation surface, with per-DOI analysis at frontierlag.org.

2605.02395 2026-06-05 cs.AI 版本更新

Controllable and Verifiable Process Data Synthesis for Process Reward Models

用于过程奖励模型的可控且可验证的过程数据合成

Yinghui Chi, Lucien Wang

发表机构 * Jilin University(吉林大学)

AI总结 提出一个可控且可验证的框架,通过注入模板感知错误并重新计算后续步骤来合成过程监督数据,以提升过程奖励模型在逻辑和数学推理中的性能。

详情
AI中文摘要

过程奖励模型(PRMs)依赖于高质量的过程监督数据,但现有的构建方法通常对错误位置、错误类型和轨迹一致性的控制有限。我们提出了一个可控且可验证的框架,用于合成PRMs的过程监督数据。我们的框架首先构建一个正确的符号推理链,在中间步骤注入一个模板感知错误,在受损状态下重新计算后续步骤,并验证注入的步骤不能从其前缀推导出来。得到的配对轨迹在第一个错误处前缀无效,但在符号重新计算后保持轨迹一致,并被翻译成对齐的自然语言过程,用于PRM训练和评估。实验表明,合成数据改进了逻辑推理基准上的Best-of-8重排序,并迁移到数学推理。步骤级评估进一步表明,第一个错误定位仍然比整体步骤分类更具挑战性,凸显了对细粒度且可验证的过程监督的需求。

英文摘要

Process reward models (PRMs) rely on high-quality process supervision data, yet existing construction methods often provide limited control over error location, error type, and trajectory consistency. We propose a controllable and verifiable framework for synthesizing process supervision data for PRMs. Our framework first constructs a correct symbolic reasoning chain, injects a template-aware error into an intermediate step, recomputes subsequent steps under the corrupted state, and verifies that the injected step is not derivable from its prefix. The resulting paired trajectories are prefix-invalid at the first error while remaining trajectory-consistent after symbolic recomputation, and are translated into aligned natural-language processes for PRM training and evaluation. Experiments show that the synthesized data improve Best-of-8 reranking on logical reasoning benchmarks and transfer to mathematical reasoning. Step-level evaluation further shows that first-error localization remains substantially more challenging than overall step classification, highlighting the need for fine-grained and verifiable process supervision.

2606.05395 2026-06-05 cs.RO cs.AI 版本更新

VASO: Formally Verifiable Self-Evolving Skills for Physical AI Agents

VASO:物理AI智能体的形式可验证自进化技能

Yunhao Yang, Neel P. Bhatt, Kevin Wang, Samuel Tetteh, Zhangyang Wang, Ufuk Topcu

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Iowa State University(爱荷华州立大学)

AI总结 提出VASO框架,通过形式验证引导LLM生成的机器人技能合约自进化,将模型检查的反例转化为文本梯度更新技能合约,无需微调模型权重,在Jackal和四旋翼任务中达到97.2%的形式规范符合率。

Comments Project webpage: https://languagegroundedriskdetection.github.io/ProjectPage/vaso-webpage/

详情
AI中文摘要

可重用的机器人技能正在成为具身智能体将开放式指令转化为长时域物理行为的基本单元。我们认为,虽然基础模型大幅降低了创建这些技能的成本,但信任它们的成本并未降低。现有的技能进化循环通过执行反馈、单元测试、环境奖励或LLM自我批评来改进技能,但这些信号仅提供痕迹级别的证据:它们表明技能在采样执行中有效,而非技能引发的计划在未经测试的条件下满足时间安全合约。我们提出VASO,一个用于验证引导的LLM生成机器人技能合约自进化的框架。在VASO中,每个技能被表示为具有两个耦合接口的语义合约:一个形式接口,将机器人状态、观测和控制命令与用于模型检查的逻辑命题对齐;一个面向规划器的接口,指导可执行行为的生成。模型检查器首先过滤逻辑不一致的技能合约,然后验证由该技能引发的计划是否满足全局和局部时间规范。当验证失败时,VASO将反例轨迹转化为文本梯度,更新可重用的技能合约,同时保持基础模型权重冻结。在Clearpath Jackal和PX4四旋翼任务中,VASO使用少于100个优化样本达到了97.2%的形式规范符合率,优于执行反馈、提示优化和微调基线。据我们所知,VASO是首个将形式验证与物理AI智能体的自进化LLM生成技能闭环的框架:形式反例成为可重用机器人技能合约的优化反馈,而不仅仅是验证一次性计划、调优规划器提示或微调模型权重。

英文摘要

Reusable robot skills are becoming the basic units through which embodied agents turn open-ended instructions into long-horizon physical behavior. We argue that, while foundation models have collapsed the cost of creating these skills, the cost of trusting them has not. Existing skill-evolution loops refine skills through execution feedback, unit tests, environment reward, or LLM self-critique, but these signals provide only trace-level evidence: they show that a skill worked on sampled executions, not that skill-induced plans satisfy temporal safety contracts under untested conditions. We introduce VASO, a framework for verification-guided self-evolution of LLM-generated robot skill contracts. In VASO, each skill is represented as a semantic contract with two coupled interfaces: a formal interface that aligns robot states, observations, and control commands with logical propositions for model checking, and a planner-facing interface that guides executable behavior generation. A model checker first filters logically inconsistent skill contracts, then verifies plans induced by the skill against global and local temporal specifications. When verification fails, VASO translates the counterexample trace into a textual gradient that updates the reusable skill contract while keeping foundation-model weights frozen. On Clearpath Jackal and PX4 quadcopter tasks, VASO reaches 97.2% formal-specification compliance using fewer than 100 optimization samples, outperforming execution-feedback, prompt-optimization, and fine-tuning baselines. To our knowledge, VASO is the first framework that closes the loop between formal verification and self-evolving LLM-generated skills for physical AI agents: formal counterexamples become optimization feedback for reusable robot skill contracts, rather than merely verifying one-off plans, tuning planner prompts, or fine-tuning model weights.

2606.05391 2026-06-05 cs.SE cs.AI 版本更新

Human oversight of agentic systems in practice: Examining the oversight work, challenges, and heuristics of developers using software agents

实践中对智能体系统的人类监督:考察使用软件代理的开发者的监督工作、挑战与启发式方法

Shipi Dhanorkar, Samir Passi, Mihaela Vorvoreanu

发表机构 * Microsoft Redmond USA(微软红mond美国)

AI总结 通过访谈17位经验丰富的开发者,探索人类对自主软件代理的监督实践,发现四种监督工作形式(先验控制、协同规划、实时监控、事后审查),并总结监督挑战与应对策略。

详情
AI中文摘要

自主软件代理有望提高开发者的生产力,但会犯错并表现出新颖的故障模式,使得人类监督成为成功的人机协作的关键。现有关于代理监督的研究主要是概念性的;存在规范性框架,但用户实际如何监督代理尚不明确。本文通过为代理监督的理论讨论提供早期实证锚点来弥合这一差距。基于对17位经验丰富的开发者的访谈,我们进行了一项探索性调查,考察开发者执行哪些形式的涌现监督工作、何时以及如何执行。我们还记录了开发者面临的监督挑战以及他们开始使用的应对策略。我们发现了至少四种形式的涌现监督工作:先验控制、协同规划、实时监控和事后审查。我们表明,监督工作不仅是现有研究中所描绘的反应性和回顾性的,而且是预防性和主动性的。我们描述了情境化的监督挑战(例如,难以审查代理生成的代码),并概述了开发者为解决这些挑战而采用的启发式方法(例如,使用测试结果作为代码正确性的保证)。最后,我们总结了高层次要点、未来研究方向、对软件代理的人本设计及软件工程实践的影响,以及我们研究的局限性。

英文摘要

Autonomous software agents hold promise to increase developer productivity but make mistakes and exhibit novel failure modes, making human oversight central to successful human-agent collaboration. Existing research on agent oversight is largely conceptual; normative frameworks exist, but how users actually oversee agents is less known. In this paper, we bridge this gap by providing early empirical anchors for the theoretical discourse on agent oversight. Drawing on interviews with 17 experienced developers, we conduct an exploratory inquiry examining what forms of emergent oversight work developers perform, when, and how. We also document the oversight challenges developers face and the strategies they have started using to address them. We found at least four forms of emergent oversight work: a priori control, co-planning, real-time monitoring, and post hoc review. We show that oversight work is not only reactive and retrospective, as portrayed in existing research, but also preventative and proactive. We describe situated oversight challenges (e.g., difficulty reviewing agent-generated code) and outline heuristics developers adopt to address such challenges (e.g., using test results as guarantees for code correctness). We conclude with high-level takeaways, future research directions, implications for the human-centered design of software agents and for software engineering practice, and limitations of our research.

2606.05389 2026-06-05 cs.AI 版本更新

Residual Modeling for High-Fidelity Learned Compression of Scientific Data

面向科学数据高保真有损压缩的残差建模

Liangji Zhu, Sanjay Ranka, Anand Rangarajan

发表机构 * Department of Computer \& Information Science \& Engineering University of Florida Gainesville, FL, USA

AI总结 针对高保真度下学习压缩残差占据主导速率的问题,提出两种残差编码器LBRC和NGLR,通过定制残差表示提升压缩比。

Comments 9 pages, 3 figures, 3 tables

详情
AI中文摘要

有损压缩对于科学模拟产生的大规模时空数据至关重要。学习型压缩器在中等精度目标下可实现高压缩比,但其聚合重建损失无法保证每个块的精度。现有的保证自编码器(GAE)方法通过保留SVD/PCA风格的系数直到满足目标,添加逐块残差校正。这在中等容差下有效,但在块级NRMSE从10^-6到10^-4的高保真度范围内,保留的系数数量迅速增长,校正流主导总速率。我们提出以残差为中心的观点:学习残差在结构上不同于原始科学场,应使用为该残差设计的表示进行编码。我们引入两种残差编码器。LBRC是一种确定性、无需训练的处理流程,自适应地将学习残差量化到目标NRMSE,并使用3D Lorenzo差分、锯齿映射、位平面编码和熵编码对得到的整数残差进行无损编码。NGLR增加了一个因果神经预测器,在相同的确定性整数处理流程中为整数舍入的Lorenzo预测输出归一化偏置,在保持确定性解码的同时降低剩余残差码的熵。预测器权重被序列化并计入比特流。在E3SM、JHTDB和ERA5数据集上,块级NRMSE目标从10^-6到10^-4,LBRC相比GAE压缩比提升30-60%,与SZ基本相当。NGLR相比LBRC进一步提升10-40%,并在评估的高保真度范围内优于SZ。这些结果表明,当全局残差校正成为速率主导时,针对学习压缩器残差定制的残差表示可以保持学习压缩的优势。

英文摘要

Lossy compression is essential for massive spatiotemporal data from scientific simulations. Learned compressors can achieve high compression ratios at moderate accuracy targets, but their aggregate reconstruction losses do not guarantee accuracy for each block. Existing Guaranteed Autoencoder (GAE) methods add a per-block residual correction by retaining SVD/PCA-style coefficients until the target is met. This works at moderate tolerances, but in the high-fidelity regime with block-level NRMSE from 10^-6 to 10^-4, the number of retained coefficients grows quickly and the correction stream dominates the total rate. We propose a residual-centric view: the learned residual is structurally different from the original scientific field and should be coded with a representation designed for that residual. We introduce two residual coders. LBRC is a deterministic, training-free pipeline that adaptively quantizes the learned residual to the target NRMSE and losslessly encodes the resulting integer residual using 3D Lorenzo differencing, zigzag mapping, bit-plane coding, and entropy coding. NGLR adds a causal neural predictor that outputs a normalized bias for an integer-rounded Lorenzo prediction in the same deterministic integer pipeline, reducing the entropy of the remaining residual code while preserving deterministic decoding. The predictor weights are serialized and counted in the bitstream. Across E3SM, JHTDB, and ERA5 at block-level NRMSE targets from 10^-6 to 10^-4, LBRC improves compression ratio over GAE by 30-60% and is broadly competitive with SZ. NGLR adds a further 10-40% over LBRC and outperforms SZ in the evaluated high-fidelity regime. These results show that residual representations tailored to learned-compressor residuals can preserve the advantage of learned compression when global residual correction becomes rate-dominant.

2606.05384 2026-06-05 cs.AI cs.CL 版本更新

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

稳定性与可操纵性:评估LLM裁判在决策后交互下的鲁棒性

Srimonti Dutta, Akshata Kishore Moharir

发表机构 * WAI USA Research Labs(WAI美国研究实验室)

AI总结 研究LLM作为裁判在决策后交互中的可操纵性,发现虽然重复中性评估下高度稳定,但针对性挑战可显著逆转判决,并提出评估鲁棒性分数(ERS)量化交互鲁棒性。

Comments Accepted at ACL 2026 GEM (Generation, Evaluation and Metrics) Workshop

详情
AI中文摘要

LLM作为裁判的评估广泛用于基准测试流程,其中模型输出通过自动评估器进行比较和排序。这些流程通常假设判决是固定输入的稳定属性。我们证明这一假设在交互下不成立。我们研究决策后可操纵性:在初始判决做出后,通过与裁判的后续对话改变评估结果的程度。在MT-Bench和AlpacaEval上的控制实验中,我们发现LLM裁判在重复和中性重新评估下高度稳定,但在针对性决策后挑战下变得显著可逆。反基线挑战协议表明,稳定判决可以通过动机性交互被推翻,而平衡目标验证协议将这种可逆性与净目标导向的引导区分开。这些逆转具有实际后果:它们可能降低与人类偏好的一致性,改变基准排名,并在高自我报告置信度下产生有害的评估变化。权威框架尤其具有破坏稳定性,修订后的判决通常伴随低重叠的论证,表明事后合理化而非可靠的错误纠正。我们引入评估鲁棒性分数(ERS),通过结合逆转敏感性和平衡方向效应来量化交互鲁棒性。我们的发现将决策后交互确定为LLM作为裁判评估的一个独特失败模式,并激励评估协议不仅测量静态一致性,还测量挑战下的鲁棒性。

英文摘要

LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumption does not hold under interaction. We study post-decision manipulability: the extent to which an evaluation outcome can be altered through subsequent conversation with the judge after an initial decision has been made. Across controlled experiments on MT-Bench and AlpacaEval, we find that LLM judges are highly stable under repeated and neutral reevaluation, yet become substantially reversible under targeted post-decision challenge. An anti-baseline challenge protocol shows that stable judgments can be overturned through motivated interaction, while a counterbalanced target-validation protocol separates this reversibility from net target-directed steering. These reversals have practical consequences: they can degrade agreement with human preferences, shift benchmark rankings, and produce harmful evaluation changes despite high self-reported confidence. Authority framing is especially destabilizing, and revised judgments are often accompanied by low-overlap justifications, suggesting post hoc rationalization rather than reliable error correction. We introduce the Evaluation Robustness Score (ERS) to quantify interactional robustness by combining reversal susceptibility with counterbalanced directional effects. Our findings identify post-decision interaction as a distinct failure mode for LLM-as-judge evaluation and motivate evaluation protocols that measure not only static agreement, but robustness under challenge.

2606.05383 2026-06-05 econ.GN cs.AI econ.TH q-fin.EC 版本更新

Can AI Refute Economic Theory? Evidence from Beyond the Knowledge Cutoff

AI能否反驳经济理论?来自知识截止日期之外的证据

Alexis Akira Toda

发表机构 * Department of Economics, Emory University(埃默里大学经济学系)

AI总结 本文通过实验测试多个AI模型(Gemini、Refine、Claude和ChatGPT)检查四篇包含错误的经济理论论文,发现ChatGPT Pro表现最佳但无法独立发现错误,表明AI尚不能自主反驳经济理论。

详情
AI中文摘要

人工智能(AI)能否反驳经济理论?我记录了实验,其中我要求几个AI模型(Gemini、Refine、Claude和ChatGPT)检查四篇已发表的经济理论论文的正确性,每篇论文都包含一个我帮助识别或纠正的错误。ChatGPT Pro表现最佳,偶尔构建反例并纠正证明,而其他模型表现较差。然而,没有模型能在没有大量人工指导的情况下找到真正的错误,数据污染使解释复杂化。我认为,一个有能力的人类与前沿模型配对可以超越当前的同行评审,但AI尚不能独立反驳经济理论。

英文摘要

Can artificial intelligence (AI) refute economic theory? I document experiments in which I asked several AI models (Gemini, Refine, Claude, and ChatGPT) to check the correctness of four published papers in economic theory, each containing an error that I helped identify or correct. ChatGPT Pro performed best, occasionally constructing counterexamples and corrected proofs, while other models fared worse. However, no model located a true error without substantial human guidance, and data contamination complicates interpretation. I argue that a competent human paired with a frontier model can outperform current peer review, but AI cannot yet refute economic theory on its own.

2606.05382 2026-06-05 cs.AI 版本更新

Synthetic Contrastive Reasoning for Multi-Table Q&A

合成对比推理用于多表问答

Ankit Pratap Singh, Xin Su, Phillip Howard

发表机构 * Iowa State University(爱荷华州立大学) Thoughtworks

AI总结 针对多表问答缺乏推理监督的问题,提出通过异构LLM生成合成对比推理轨迹,并利用对比偏好优化微调模型,在MMQA上提升9.7%-16.3%。

详情
AI中文摘要

多表问答要求模型检索相关证据、链接模式并在关系表之间进行组合推理。现有的多表问答资源通常提供问题和最终答案,但缺乏解释答案如何得出的推理监督。为弥补这一空白,我们通过使用异构LLM生成经过验证的正向轨迹和合理的负向轨迹,为MMQA构建了一个合成对比推理轨迹数据集。然后,我们利用生成的偏好对,通过对比偏好优化(CPO)微调开源权重LLM。在Qwen3-14B、Mistral-8B和Llama-3.1-8B上,CPO相比问答监督微调取得了9.7%-16.3%的绝对平均提升,在MMQA上最高提升21个百分点。消融实验表明,异构的正向和负向轨迹生成器增强了对比信号,自动评估和人工评估均显示生成的轨迹对基本忠实、连贯且具有有意义的对比性。

英文摘要

Multi-table question answering requires models to retrieve relevant evidence, link schemas, and perform compositional reasoning across relational tables. Existing multi-table Q&A resources typically provide questions and final answers but lack reasoning supervision that explains how answers are derived. To address this gap, we construct a synthetic contrastive reasoning-trace dataset for MMQA by generating validated positive traces and plausible negative traces with heterogeneous LLMs. We then use the resulting preference pairs to fine-tune open-weight LLMs with Contrastive Preference Optimization (CPO). Across Qwen3-14B, Mistral-8B, and Llama-3.1-8B, CPO achieves absolute average improvements over Q&A supervised fine-tuning ranging from 9.7%-16.3%, with gains up to 21 percentage points on MMQA. Ablations show that heterogeneous positive and negative trace generators strengthen the contrastive signal, and automated as well as human evaluations indicate that the generated pairs are largely faithful, coherent, and meaningfully contrastive.

2606.05378 2026-06-05 cs.LG cs.AI 版本更新

Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models

模式选择性并非任务因果结构:1B类语言模型中组合任务电路的跨架构机制研究

Yongzhong Xu

发表机构 * B-Class Language Models(1B类语言模型) Cross-Architecture Mechanistic Study(跨架构机理研究)

AI总结 通过统一协议测试三个1B类语言模型在四个组合任务上的注意力头电路,发现不同模型对同一任务使用不同的注意力模式,并引入五类筛选结果分类法,提出MoE模型基于前一个token位置基板构建组合任务电路的可证伪假设。

Comments 27 pages, 3 figures

详情
AI中文摘要

我们测试了一个单一的筛选与消融方案——通过任务模式选择性识别注意力头电路,然后通过与匹配随机零假设进行因果消融验证——是否能在不同模型家族中产生一致的机制性结论。该方案可在不同流水线间移植;但它识别出的具体电路则不能。在四个组合任务(间接宾语识别、大于、后继序列、变量绑定)和三个来自不同训练流水线的1B类语言模型(Pythia 1B / Pile / 密集;OLMo 1B / DCLM / 密集;OLMoE 1B-7B / DCLM / 混合专家)上,我们运行了一个统一协议,每个单元使用十个种子采样匹配随机零假设。由此产生的12个(任务,模型)单元中,没有两个在可比较的效应大小下共享相同的主要因果筛选:同一任务,具有相同的行为能力,在不同模型中通过不同的注意力模式类型实现。 我们引入了一个五类筛选结果分类法——主要原因、次要原因、相关物、干扰物、零——并附有定量阈值,并展示了所有五类结果均出现在面板中。我们提出了一个可证伪的假设:我们面板中的MoE模型在一个基础的前一个token位置基板之上构建组合任务电路(对于OLMoE 1B-7B,前一个token电路消融在4个任务中的3个上是最强的因果筛选),IOI例外与IOI是最终位置名称复制任务一致,其结构直接探测不同的模式。该假设附带对其他MoE语言模型的明确预测。 我们诚实地构建方法论:来自配套方法论论文的谱参与比信号是专门化计算的一般指标;使发现具有任务特异性的是任务模式筛选加上每个模型的因果验证。

英文摘要

We test whether a single screen-and-ablate recipe -- identify attention-head circuits by task-pattern selectivity, then verify by causal ablation against a matched-random null -- produces consistent mechanistic claims across model families. The recipe ports across pipelines; the specific circuit it identifies does not. Across four composed tasks (indirect-object identification, greater-than, successor sequences, variable binding) and three 1B-class language models from distinct training pipelines (Pythia 1B / Pile / dense; OLMo 1B / DCLM / dense; OLMoE 1B-7B / DCLM / mixture-of-experts), we run a unified protocol with the matched-random null sampled across ten seeds per cell. The resulting 12 (task, model) cells contain no two that share the same primary causal screen at comparable effect size: the same task, with the same behavioral capability, is implemented through different attention-pattern types across models. We introduce a five-category screen-outcome taxonomy -- primary cause, secondary cause, correlate, interferer, null -- with quantitative thresholds, and show that all five outcomes appear in the panel. We propose a falsifiable hypothesis: the MoE model in our panel builds composed-task circuits on top of a foundational previous-token positional substrate (the prev-token-circuit ablation is the strongest causal screen on 3 of 4 tasks for OLMoE 1B-7B), with the IOI exception consistent with IOI being a final-position name-copying task whose structure directly probes a different pattern. The hypothesis comes with explicit predictions for other MoE language models. We frame the methodology honestly: the spectral participation-ratio signal from the companion methodology paper is a general indicator of specialized computation; what makes a finding task-specific is the task-pattern screen plus a per-model causal verification.

2606.05375 2026-06-05 cs.CV cs.AI 版本更新

Three-Dimensional Retinal Microvasculature Restoration in OCT Angiography

OCT血管造影中的三维视网膜微血管修复

Yukun Guo, Min Gao, Tristan T. Hormel, Steven T. Bailey, Thomas S. Hwang, Yali Jia

发表机构 * Casey Eye Institute, Oregon Health & Science University(俄勒冈健康与科学大学Casey眼科研究所) Department of Biomedical Engineering, Oregon Health & Science University(俄勒冈健康与科学大学生物医学工程系)

AI总结 提出基于EfficientNet-B5编码器和含空间-通道挤压激励模块的解码器的深度学习算法,从单次OCTA体数据恢复毛细血管解剖结构,显著提升图像质量与微血管保真度。

详情
AI中文摘要

光学相干断层扫描血管造影(OCTA)是一种用于成像视网膜微血管的强大技术。然而,由于成像伪影,获取可靠的视网膜血流和视网膜无灌注区域量化具有挑战性。现有方法主要关注噪声抑制、投影伪影去除或信号增强,以改善OCTA在横截面或二维(2D)正面投影中的图像质量,而忽略了内在的三维血管结构。在本研究中,我们提出了一种基于深度学习的算法,用于从单个OCTA体数据中恢复毛细血管解剖血管结构。该网络由EfficientNet-B5编码器和结合了并行空间与通道挤压激励模块的解码器组成,通过跳跃连接保持空间分辨率。使用三个相邻B帧作为输入,预测修复后的中间B帧。我们使用峰值信噪比(PSNR)和结构相似性指数(SSIM)评估模型性能,以多次扫描平均生成的真值作为基准。结果表明,与原始单次OCTA体数据相比,所提模型显著(p < 0.001)提高了图像质量,PSNR为26.16 ± 1.26对比22.23 ± 0.78,SSIM为0.91 ± 0.02对比0.72 ± 0.03。所提模型还显著(p < 0.001)提高了微血管保真度,通过模型输出与真值之间的Dice系数重叠测量,在多个不同血管板层上,2D和3D分别至少提高3.8%和51.2%。

英文摘要

Optical coherence tomographic angiography (OCTA) is a powerful technique for imaging retinal microvasculature. However, acquiring reliable quantification of retinal blood flow and areas of retinal nonperfusion is challenging because of imaging artifacts. Existing methods primarily focus on noise suppression, projection artifact removal, or signal enhancement to improve the image quality of OCTA in cross-sectional or two-dimensional (2D) en face projections, while neglecting the intrinsic three-dimensional vascular architecture. In this study, we propose a deep learning-based algorithm for restoring capillary anatomical vasculature from a single OCTA volume. The network consists of an EfficientNet-B5 encoder and a decoder incorporating concurrent spatial and channel squeeze-and-excitation modules, connected via skip connections to preserve spatial resolution. Three adjacent B-frames are used as input to predict the restored middle B-frame. We evaluated the performance of the model using the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) against ground truth generated from averaging multiple scans. The results show that the proposed model significantly (both p < 0.001) improved image quality compared with the original single OCTA volume, with a PSNR of 26.16 +/- 1.26 vs. 22.23 +/- 0.78 and an SSIM of 0.91 +/- 0.02 vs. 0.72 +/- 0.03. The proposed model also significantly (p < 0.001) improved microvascular fidelity, measured by the Dice coefficient overlap between the model output and ground truth, in both 2D and 3D by at least 3.8% and 51.2%, respectively, across several different vascular slabs.

2606.05357 2026-06-05 cs.AI 版本更新

An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

一个可解释且可信赖的AI框架,用于利用骨关节炎倡议(OAI)数据进行大规模纵向结构-疼痛关联研究

Jincheng Yu, Haoyang Li, Yiwen Liu, Shen Liu, Rachel Yuanbao Chen, C. Kent Kwoh, Hongxu Ding, Xiaoxiao Sun

发表机构 * Statistics & Data Science GIDP, University of Arizona(大学阿瓜斯卡连特斯统计与数据科学GIDP) Department of Epidemiology and Biostatistics, University of Arizona(大学阿瓜斯卡连特斯流行病学与生物统计学系) College of Medicine Tucson, University of Arizona(大学阿瓜斯卡连特斯医学学院) R. Kent Coit College of Pharmacy, University of Arizona(大学阿瓜斯卡连特斯R. Kent Coit药学院) University High School(大学高中)

AI总结 提出结合深度学习MOAKS预测与可解释统计建模的AI框架,通过不确定性量化筛选高置信度预测,利用纵向潜类混合模型分析结构异常与疼痛的关联,发现骨髓病变、软骨丢失和半月板挤压是疼痛进展的风险因素。

详情
AI中文摘要

目的:开发一个可解释且可信赖的AI框架,结合基于深度学习的MRI骨关节炎膝关节评分(MOAKS)预测与可解释统计建模,利用骨关节炎倡议(OAI)数据大规模研究结构-疼痛关系。材料与方法:我们首先开发了一个深度学习框架,直接从膝关节MRI预测MOAKS特征,并引入共形预测以提供预测不确定性量化。这种不确定性感知策略能够显式过滤模型输出,仅保留膝关节级别的高置信度MOAKS预测。其次,我们应用纵向潜类混合模型(LCMM)检查关键结构异常与四种互补的膝关节疼痛测量之间的关联。结果:在三种MRI定义的异常(即骨髓病变(BML)、软骨丢失(CART)和半月板挤压(ME))中,我们的框架显著提高了马修斯相关系数(MCC)和其他一些指标。例如,BML的MCC从0.69提高到0.91,CART从0.45提高到0.80,ME从0.59提高到0.89。利用这些高置信度预测,我们将LCMM分析的样本量扩大到2,175个膝关节。识别出两种不同的疼痛轨迹(快速和稳定的疼痛进展)。快速进展组的估计比值比(95% CI)为:BML 1.62(1.12-2.35),CART丢失1.83(1.24-2.70),ME 2.50(1.75-3.57)。结论:这些结果强调了这些结构异常作为骨关节炎疼痛和功能进展风险因素的重要性。

英文摘要

Purpose: To develop an interpretable and trustworthy AI framework that combines deep learning based MRI Osteoarthritis Knee Score (MOAKS) prediction with interpretable statistical modeling to study structure-pain relationships at scale using data from the Osteoarthritis Initiative (OAI). Materials and Methods: We first developed a deep learning framework to predict MOAKS features directly from knee MRIs and incorporated conformal prediction to provide prediction uncertainty quantification. This uncertainty-aware strategy enables explicit filtering of model outputs, retaining only high-confidence MOAKS predictions at the knee level. Second, we applied a longitudinal latent class mixed model (LCMM) to examine associations between key structural abnormalities and four complementary knee pain measurements. Results: Among the three MRI-defined abnormalities (i.e., bone marrow lesions (BML), cartilage loss (CART), and meniscal extrusion (ME)), our framework substantially improved the Matthews correlation coefficient (MCC) and some other metrics. For example, MCC increased from 0.69 to 0.91 for BML, from 0.45 to 0.80 for CART, and from 0.59 to 0.89 for ME. Using these high-confidence predictions, we expanded the sample size to 2,175 knees for the LCMM analysis. Two distinct pain trajectories were identified (rapid and stable pain progression). The estimated odds ratios (95% CI) for the rapid progression group were 1.62 (1.12-2.35) for BML, 1.83 (1.24-2.70) for CART loss, and 2.50 (1.75-3.57) for ME. Conclusion: These results highlight the importance of these structural abnormalities as risk factors for pain and functional progression in osteoarthritis.

2606.05339 2026-06-05 cs.SE cs.AI 版本更新

A Taxonomy of Runtime Faults in Model Context Protocol Servers

模型上下文协议服务器运行时故障的分类法

Joshua Owotogbe, Indika Kumara, Willem-Jan van den Heuvel, Damian Andrew Tamburri, Antonio Ken Iannillo, Roberto Natella

发表机构 * Jheronimus Academy of Data Science and Tilburg University(赫伦尼姆数据科学学院和蒂尔堡大学) Jheronimus Academy of Data Science(赫伦尼姆数据科学学院) University of Sannio(萨尼亚大学) University of Luxembourg(卢森堡大学) University of Naples Federico II(那不勒斯费德里科二世大学)

AI总结 本文通过手动分析473个MCP服务器仓库中的837个故障线程,采用自下而上的开放式编码方法,首次建立了MCP服务器运行时故障的经验分类法,包含11个顶层类别和27个子类别(73种叶子故障类型),并通过开发者调查验证了其外部有效性。

Comments 14 pages

详情
AI中文摘要

MCP(模型上下文协议)使LLM(大语言模型)能够通过标准化协议与外部工具和数据源交互。其在工具增强型人工智能工作流中的快速采用引入了新的可靠性挑战,例如配置参数被接受但在运行时未强制执行,导致意外的默认行为,其运行时故障特征尚未得到实证检验。我们提出了MCP服务器运行时故障的首个经验分类法。我们手动分析了来自473个活跃维护的MCP服务器GitHub仓库的837个MCP特定运行时故障线程,并使用自下而上的开放式编码程序推导出分类法。该分类法包括11个顶层类别和27个子类别(73种叶子故障类型),涵盖了协议交互、工具调用、模式执行、状态管理、模型提供商集成、安全验证以及超时或显式取消进行中操作中的反复故障。为评估分类法的外部有效性,我们调查了55名MCP服务器开发者。受访者报告平均经历了27个故障子类别中的20个,且没有类别未被观察到。这些结果表明,该分类法反映了MCP系统中广泛观察到的运行时故障,并将有助于未来AI软件的维护和演化。

英文摘要

MCP (Model Context Protocol) enables LLMs (Large Language Models) to interact with external tools and data sources via a standardized protocol. Its rapid adoption in tool-augmented Artificial Intelligence (AI) workflows has introduced new reliability challenges, such as configuration parameters that are accepted but not enforced at runtime, leading to unintended default behavior, whose runtime fault characteristics remain empirically unexamined. We present the first empirical taxonomy of runtime faults in MCP servers. We manually analyzed 837 MCP-specific runtime fault threads from 473 actively maintained MCP server GitHub repositories and derived a taxonomy using a bottom-up open coding procedure. The taxonomy comprises 11 top-level categories and 27 subcategories (73 leaf fault types), covering recurrent failures across protocol interactions, tool invocations, schema enforcement, state management, model-provider integration, security validation, and timeouts or explicit cancellations of in-progress operations. To assess the taxonomy's external validity, we surveyed 55 MCP server developers. Respondents reported experiencing an average of 20 of the 27 fault subcategories, and no category remained unobserved. These results indicate that the taxonomy reflects widely observed runtime failures in MCP-based systems and shall assist AI software maintenance and evolution in the future.

2606.05334 2026-06-05 cs.AI 版本更新

Uncertainty Aware Functional Behavior Prediction and Material Fatigue Assessment for Circular Factory

面向循环工厂的不确定性感知功能行为预测与材料疲劳评估

Nehal Afifi, Mehdi Khabou, Victor Mas, Jonas Hemmerich, Patric Grauberger, Stefan Dietrich, Volker Schulze, Sven Matthiesen

发表机构 * IPEK Institute of Product Engineering, Karlsruhe Institute of Technology (KIT)(IPEK产品工程研究所,卡尔斯鲁厄理工学院) IAM-WK Institute for Applied Materials – Materials Science and Engineering, Karlsruhe Institute of Technology (KIT)(应用材料研究所–材料科学与工程,卡尔斯鲁厄理工学院) wbk Institute of Production Science, Karlsruhe Institute of Technology (KIT)(生产科学研究所,卡尔斯鲁厄理工学院)

AI总结 针对循环工厂中回收产品异质退化状态下的再利用决策问题,提出一种结合不确定性感知功能预测与组件级疲劳评估的实例特定可靠性框架,通过卷积编码器提取载荷模式、LSTM预测功能变量、有限元应力重建与疲劳损伤评估,实现功能、材料和系统可靠性轨迹的融合。

Comments 27 pages, submitted to the Journal of Manufacturing Systems' special issue about circular factories, the manuscript is under review

详情
AI中文摘要

循环工厂中的回收产品以异质退化状态、使用历史和剩余能力重新进入生产。仅凭当前检查无法决定再利用,因为未来功能实现和组件完整性可能在下一个服务场景下以不同方式演变。现有的PHM方法支持退化预测,但通常针对固定操作条件或孤立组件基准,而材料疲劳评估很少与系统级功能预后相关联。本文针对角磨机通过将不确定性感知功能预测与组件级疲劳评估结合在一个实例特定的可靠性工作流程中来解决这一差距。所提出的框架结合了当前工具状态与最近的力-扭矩使用窗口。卷积编码器从主轴力和轴扭矩中提取载荷模式,LSTM骨干网络预测九个功能变量作为高斯均值和方差估计。同时,相同的载荷历史通过有限元支持的应力重建、带Haibach扩展的S-N/Miner损伤评估和Paris定律裂纹扩展分析转化为输出轴疲劳信息。流式重放算法将两个分支整合为功能、材料和系统可靠性轨迹。保留测试显示九个输出的平均2%容差精度为0.9652。热变量预测近乎完美,而驱动电机电流和负载速度仍然是最具挑战性的动态输出,R²值分别为0.9750和0.9924。扭矩历史对这些变量尤其重要,传统LSTM在短历史设置中优于GRU和xLSTM。可靠性校准对驱动电机电流信息量最大,其中预测和观测的超越概率...

英文摘要

Returned products in circular factories re-enter production with heterogeneous degradation states, usage histories, and remaining capability. Reuse cannot be decided from the current inspection alone, because future function fulfillment and component integrity may evolve differently under the next service scenario. Existing PHM approaches support degradation prediction, but often target fixed operating conditions or isolated component benchmarks, while material-fatigue assessment is rarely linked to system-level functional prognosis. This paper addresses this gap for an angle grinder by combining uncertainty-aware functional prediction with component-level fatigue assessment in an instance-specific reliability workflow. The proposed framework combines the current tool state with recent force--torque usage windows. A convolutional encoder extracts loading patterns from spindle forces and shaft torque, and an LSTM backbone predicts nine functional variables as Gaussian mean and variance estimates. In parallel, the same loading history is translated into output-shaft fatigue information through finite-element-supported stress reconstruction, S--N/Miner damage evaluation with Haibach extension, and Paris-law crack-growth analysis. A streaming replay algorithm consolidates both branches into functional, material, and system reliability trajectories. Held-out tests show mean \(2\%\)-tolerance accuracy of 0.9652 across nine outputs. Thermal variables are predicted near-perfectly, while drive motor current and load speed remain the most demanding dynamic outputs, with \(R^2\) values of 0.9750 and 0.9924. Torque history is especially important for these variables, and the conventional LSTM outperforms GRU and xLSTM in the short-history setting. Reliability calibration is most informative for drive motor current, where predicted and observed exceedance probabilities ...

2606.05332 2026-06-05 cs.AI 版本更新

GITCO: Gated Inference-Time Context Optimization in TSFMs

GITCO:TSFMs中的门控推理时上下文优化

Manya Pandey, Dhruv Kumar, Murari Mandal, Saurabh Deshpande

发表机构 * Birla AI Labs(巴尔拉人工智能实验室)

AI总结 提出GITCO框架,通过门控机制在推理时选择性抑制有害补丁,无需更新参数即可提升基于补丁的时间序列基础模型的零样本预测精度。

Comments ICML 2026 Workshop on Foundation Models for Structured Data

详情
AI中文摘要

基于补丁的时间序列基础模型(TSFMs)遭受上下文中毒:结构异常的补丁捕获了不成比例的注意力,并无声地降低了零样本预测质量。我们提出通过在推理时优化输入上下文而不是修改模型权重来提高TSFM精度。我们提出了GITCO(门控推理时上下文优化),一个轻量级的三组件框架:门控、路由和批评者,无需任何参数更新即可选择性地识别和抑制有害补丁。在TimesFM 2.5上,跨53个GIFT-Eval数据集进行K折交叉验证评估,GITCO在TimesFM 2.5上实现了平均+1.95%的MASE降低,同时捕获了89.9%的改进上限。我们引入了上下文敏感性配置文件作为TSFMs的一个新的可表征属性:从时间序列元特征到推理时上下文干预下预期精度改进的映射,由模型架构和数据的统计结构共同塑造。

英文摘要

Patch-based Time Series Foundation Models (TSFMs) suffer from context poisoning: structurally anomalous patches capture disproportionate attention and silently degrade zero-shot forecast quality. We propose improving TSFM accuracy at inference time by optimizing the input context rather than modifying model weights. We present GITCO (Gated Inference-Time Context Optimization), a lightweight three-component framework: Gate, Router, and Critic that selectively identifies and suppresses harmful patches without any parameter updates. Evaluated on TimesFM 2.5 across 53 GIFT-Eval datasets under K-fold cross-validation, GITCO achieves an average +1.95% MASE reduction on TimesFM 2.5 while capturing 89.9% of the improvement upper bound. We introduce context sensitivity profiles as a new characterizable property of TSFMs: the mapping from time series meta-features to expected accuracy improvement under inference-time context intervention, shaped jointly by model architecture and the statistical structure of the data.

2606.05330 2026-06-05 cs.CL cs.AI cs.HC 版本更新

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

基于概率信念追踪的多轮人类可说服性模型

Jared Moore, Noah Goodman, Nick Haber, Max Kleiman-Weiner

发表机构 * Stanford University(斯坦福大学) University of Washington(华盛顿大学)

AI总结 提出PERSUASIONTRACE框架,通过记录多轮信念报告、标注修辞维度并引入贝叶斯网络模拟目标,将说服评估从端点变化转向过程保真度。

详情
AI中文摘要

大型语言模型可以在高风险领域改变人类信念,但大多数说服研究依赖于前/后信念变化。这些端点测量确定了说服是否发生,却忽略了信念在对话中移动的位置和方式。我们提出了PERSUASIONTRACE,一个用于研究人机交互中说服的框架。基于网络实验平台,PERSUASIONTRACE贡献了一个多轮说服研究的工具和一个过程级评估协议:它记录来自人类或模拟说服目标的多轮信念报告,用修辞维度(logos/pathos/ethos)标注说服者轮次,并通过保真度评估模拟器与真实人类信念动态的匹配程度。使用该框架,我们发现人类目标分为两个多轮信念更新聚类,并对修辞策略表现出易感性;LLM在通用和个性化主题、文本和音频模态以及多轮交互中都具有说服力。先前的工作主要使用普通提示的LLM来模拟人类目标,但我们表明这些模拟器无法复制人类信念动态。我们引入了一个贝叶斯网络模拟目标,它随时间维持显式的潜在信念状态,使得每个说服者消息产生认知上真实的信念更新。在人类相似性评估中,我们的贝叶斯目标得分接近人类参考(81 vs 80),而基线LLM目标得分显著较低(64)。PERSUASIONTRACE将说服评估从仅端点移动重新定义为过程保真度,为科学分析和说服系统的更安全优化提供了更强的基础。

英文摘要

Large language models can shift human beliefs across high-stakes domains, but most persuasion studies rely on pre/post belief change. These endpoint measures identify whether persuasion occurred, yet miss where and how beliefs moved within a dialogue. We present PERSUASIONTRACE, a framework for studying persuasion in human-LLM interaction. Built on a web-based experimental platform, PERSUASIONTRACE contributes a tool for multi-turn persuasion studies and a process-level evaluation protocol: it records multi-turn belief reports from human or simulated targets of persuasion, annotates persuader turns with rhetorical dimensions (logos/pathos/ethos), and evaluates simulators by fidelity to real human belief dynamics. Using this framework, we find that human targets group into two clusters of multi-turn belief updates and exhibit susceptibility to rhetorical strategies, and that LLMs are persuasive across generic and personalized topics, text and audio modalities, and multi-turn interactions. Prior work has chiefly used vanilla-prompted LLMs to simulate human targets, but we show that these simulators fail to replicate human belief dynamics. We introduce a Bayesian-network simulated target that maintains an explicit latent belief state over time so each persuader message yields cognitively realistic belief updates. In human-likeness evaluation, our Bayesian target scores near a human reference (81 vs 80), while baseline LLM targets score substantially lower (64). PERSUASIONTRACE reframes persuasion evaluation from endpoint movement alone to process fidelity, providing a stronger basis for scientific analysis and safer optimization of persuasive systems.

2606.05328 2026-06-05 cs.GR cs.AI cs.CV cs.LG 版本更新

The Invisible Hand of Physics: When Video Diffusion Models Know More Than They Show

物理的隐形之手:当视频扩散模型知道的比它们展示的更多

Parsa Esmati, Somjit Nath, Katja Hofmann, Derek Nowrouzezahrai, Samira Ebrahimi Kahou, Majid Mirmehdi

发表机构 * University of Bristol(布里斯托大学) McGill University(麦吉尔大学) Mila–Quebec AI Institute(魁北克AI研究院) Microsoft Research(微软研究院) University of Calgary(卡尔加里大学)

AI总结 通过逆向扩散过程探测视频扩散模型的潜在轨迹,发现物理合理性可以从扩散变换器状态中线性解码,准确率达81.27%,表明物理有意义的表示是生成式去噪的副产品。

详情
AI中文摘要

现代视频扩散模型生成越来越真实和时间上连贯的视频,这激发了它们作为候选世界模拟器的使用。然而,目前尚不清楚这些模型是否内部编码了物理结构,或者仅仅是复现了训练中看到的运动模式。我们通过沿着对应已知物理合理性的真实视频的潜在轨迹探测视频扩散模型来研究这个问题。为了获得这样的轨迹,我们通过从干净视频潜在变量向后积分学习到的速度场到噪声,近似逆向确定性采样过程,从而访问模型的中间状态和注意力图。利用这些恢复的轨迹,我们表明物理合理性可以从扩散变换器状态中线性解码,在IntPhys和InfLevel上达到约81.27%的平均准确率,并优于专门的表示学习基线如V-JEPA和VideoMAE。令人惊讶的是,这个信号在VAE潜在输入中不存在,而是在去噪变换器内部出现,尽管模型没有使用自监督预测目标进行训练。这些发现表明,物理有意义的表示可以作为生成式去噪的副产品产生。

英文摘要

Modern video diffusion models generate increasingly realistic and temporally coherent videos, motivating their use as candidate world simulators. Yet it remains unclear whether these models internally encode physical structure, or merely reproduce motion patterns seen during training. We study this question by probing video diffusion models along latent trajectories corresponding to real videos with known physical plausibility. To obtain such trajectories, we approximately invert the deterministic sampling process by integrating the learned velocity field backward from a clean video latent to noise, giving access to the model's intermediate states and attention maps. Using these recovered trajectories, we show that physical plausibility is linearly decodable from diffusion transformer states across IntPhys and InfLevel, reaching around 81.27% average accuracy and outperforming dedicated representation-learning baselines such as V-JEPA and VideoMAE. Surprisingly, this signal is absent from the VAE latent input and emerges inside the denoising transformer itself, despite the model not being trained with a self-supervised predictive objective. These findings suggest that physically meaningful representations can arise as a byproduct of generative denoising.

2606.05326 2026-06-05 math.OC cs.AI cs.LG math-ph math.AP math.MP 版本更新

Gradient descent at the Edge of Stability: free energy model and kinetic description of the two-layer network

稳定边缘的梯度下降:双层网络的自由能模型与动力学描述

Antonin Chodron de Courcel

发表机构 * Ecole Normale Supérieure, CNRS, 45 rue d’Ulm, 75005 Paris, France(巴黎高等师范学院、法国国家科学研究中心、巴黎 rue d’Ulm 45 号、75005 地址、法国)

AI总结 针对大学习率下梯度下降的稳定边缘动力学,提出连续时间有效模型跟踪平均轨迹与快速振荡协方差,揭示有效自由能作为关键监控量,并导出宽双层网络的平均场极限动力学方程。

Comments Comments are welcome!

详情
AI中文摘要

我们研究了稳定边缘(Edge of Stability)机制下梯度下降的动力学,其中学习率足够大,导致损失和锐度出现持续振荡。我们提出了一个连续时间有效模型,跟踪平均轨迹的演化以及其快速振荡的时间平均协方差。我们的分析表明,在这种不稳定机制中,需要监控的自然量是有效自由能,它将原始风险泛函与曲率相关的“熵”项相结合。我们的模型允许我们跟踪振荡的包络,即使在动力学与平均权重在相似时间尺度上演化的情况下。换句话说,我们可以跟踪某些神经网络架构训练过程中出现的尖峰。对于在稳定非消失振荡下优化的宽双层神经网络,我们推导出一个平均场极限,产生了一个新的动力学方程,描述了权重及其波动的联合分布。我们证明该方程可以解释为宏观自由能的Wasserstein-2梯度流。最后,我们提供了矩阵分解和深度学习任务(CIFAR-10)上的数值证据,以证明模型在捕捉振荡包络方面的准确性以及有效自由能的预测能力。

英文摘要

We study the dynamics of gradient descent in the Edge of Stability regime, where the learning rate is large enough to induce persistent oscillations in the loss and the sharpness. We propose a continuous-time effective model that tracks the evolution of the average trajectory coupled with the time-averaged covariance of its fast oscillations. Our analysis reveals that the natural quantity to monitor in such unstable regimes is an effective free energy, which combines the original risk functional with a curvature-related "entropic" term. Our model allows us to track the envelope of the oscillations even in situations where its dynamics evolve on similar timescales as the averaged weights. Otherwise stated, we can track the spikes that occur during the training of some neural network architectures. For wide two-layer neural networks optimized under stable non-vanishing oscillations, we derive a mean-field limit that results in a novel kinetic equation describing the joint distribution of weights and their fluctuations. We show that this equation can be interpreted as a Wasserstein-2 gradient flow of a macroscopic free energy. Finally, we provide numerical evidence on matrix factorization and deep learning tasks (CIFAR-10) to demonstrate the model's accuracy in capturing the envelope of the oscillations and the predictive power of the effective free energy.

2606.05316 2026-06-05 cs.AI 版本更新

I Know What You Meme, Even If it Emerged Today: Understanding Evolving Memes through Open-World Knowledge Acquisition

我知道你的梗,即使它今天才出现:通过开放世界知识获取理解不断演变的梗

Shanhong Liu, Rui Cao, Pai Chet Ng, De Wen Soh

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) Singapore Institute of Technology(新加坡理工学院)

AI总结 提出Query Retrieve Conclude零样本框架,通过识别缺失知识、检索开放网络证据并合成背景知识,以理解新兴梗并提升检测性能。

详情
AI中文摘要

多模态梗是动态的,通常需要最新的背景知识来进行解释。现有方法往往忽略此类知识,或依赖预训练模型的固定参数知识,这些知识可能不完整、过时或无法用于新兴梗。我们引入了Query Retrieve Conclude,一个零样本框架,用于识别缺失知识、检索开放网络证据并合成基于证据的背景知识,以进行梗的理解和检测。我们还引入了一个精心策划的梗理解基准,包含2024年至2026年的近期梗及其外部背景知识注释。在三个梗理解数据集和五个梗检测任务上的实验表明,我们的框架在知识恢复、梗理解和下游检测方面优于零样本基线。

英文摘要

Multimodal memes are dynamic and often require up to date background knowledge for interpretation. Existing methods often overlook such knowledge or rely on fixed parametric knowledge of pretrained models that may be incomplete, outdated, or unavailable for emerging memes. We introduce Query Retrieve Conclude, a zero shot framework that identifies missing knowledge, retrieves open web evidence, and synthesizes evidence grounded background knowledge for meme understanding and detection. We also introduce a curated meme understanding benchmark of recent memes from 2024 to 2026 with external background knowledge annotations. Experiments on three meme understanding datasets and five meme detection tasks show that our framework improves knowledge recovery, meme understanding and downstream detection over zero shot baselines.

2606.05315 2026-06-05 cs.CL cs.AI 版本更新

LoRi: Low-Rank Distillation for Implicit Reasoning

LoRi: 用于隐式推理的低秩蒸馏

Ryan Solgi, Jiayi Tian, Zheng Zhang

发表机构 * University of California-Santa Barbara(加州大学圣巴巴拉分校)

AI总结 提出低秩蒸馏框架,通过对齐师生模型在共享低秩张量子空间中的隐状态推理轨迹,提升大型语言模型的隐式思维链推理能力。

详情
AI中文摘要

隐式思维链方法旨在将推理内化到大型语言模型中,但通常表现不如显式思维链提示。我们通过实验发现,隐状态推理轨迹具有低秩结构。基于此观察,我们提出了一种低秩蒸馏框架,通过使用一阶和二阶统计量,在共享的低秩张量子空间中对齐教师和学生轨迹来传递推理能力。得到的公式捕捉了推理的全局结构,同时支持紧凑的潜在推理过程。我们在多个模型家族(包括LLaMA和Qwen)上,在不同规模下对数学推理基准进行了评估。我们的方法持续提升了性能,尤其是在具有挑战性的多步任务上,接近显式思维链的准确率,并优于先前的隐式思维链蒸馏方法。

英文摘要

Implicit chain-of-thought (iCoT) methods aim to internalize reasoning in large language models, but often underperform explicit CoT prompting. We empirically find that hidden-state reasoning trajectories exhibit low-rank structure. Motivated by this observation, we propose a low-rank distillation framework that transfers reasoning by aligning teacher and student trajectories in a shared low-rank tensor subspace using first- and second-order statistics. The resulting formulation captures the global structure of reasoning while supporting a compact latent reasoning process. We evaluate the method across multiple model families, including LLaMA and Qwen, at different scales on mathematical reasoning benchmarks. Our approach consistently improves performance, especially on challenging multi-step tasks, approaching explicit CoT accuracy and outperforming prior iCoT distillation methods.

2606.05308 2026-06-05 cs.LG cs.AI cs.CL cs.IR stat.AP 版本更新

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

基于预测驱动推断的统计可靠LLM排序评估

Abhishek Divekar

发表机构 * Amazon(亚马逊)

AI总结 提出PRECISE框架,将预测驱动推断扩展到排序评估指标,通过结合少量人工标注和大量LLM判断实现无偏估计,并在ESCI基准和实际系统中验证了有效性。

Comments Accepted at ACL 2026 - GEM Workshop

详情
AI中文摘要

通过PRECISE,我们将预测驱动推断扩展到排序评估指标,通过结合少量人工标注集和大量LLM判断集,产生偏差校正的估计。PPI无论LLM判断器的错误分布如何,都是可证明无偏的。我们通过将输出空间计算从O(2^|C|)减少到O(2^K),使其适用于像Precision@K这样的分层指标,其中标注是按文档的,但指标是按查询的。在ESCI基准上,用Claude 3 Sonnet判断增强30个人工标注,将Precision@4估计的标准误差从4.45降低到3.50(相对减少21%)。在一个生产系统中,我们的框架从100个人工标签和2小时的领域专家标注中正确识别了三个系统变体中最好的一个;A/B测试确认了这一排序,日销售额增加了407个基点。

英文摘要

With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.

2606.05304 2026-06-05 cs.AI 版本更新

What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems

智能体应该说什么?面向高效多智能体系统的动作-状态通信

Chen Huang, Yuhao Wu, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 针对多智能体系统中自由形式通信导致令牌膨胀和性能下降的问题,提出PACT协议,将通信视为公共状态更新问题,压缩为紧凑的动作-状态记录,在多种拓扑下实现性能与成本权衡的优化。

Comments 13 pages, 5 figures

详情
AI中文摘要

基于大语言模型的多智能体系统通常围绕角色、流水线和轮次调度进行组织,而智能体之间传递的内容往往被保留为无约束的自然语言。然而,这种自由形式的通信会迅速膨胀令牌使用量,消耗共享上下文窗口,并最终影响系统性能和推理成本。我们分析了两种多智能体系统拓扑中五种常见的智能体间通信策略,发现没有固定策略是普遍最优的。相反,有效的智能体间消息始终保留下游智能体所需的以动作中心的信息。基于此,我们提出了PACT(协议化动作-状态通信与传输),它将智能体间通信视为公共状态更新问题,并在每个原始智能体输出进入共享历史之前将其投影为紧凑的动作-状态记录。在不同的多智能体系统拓扑中,PACT持续改善了性能-成本权衡,以显著更少的令牌实现了相当或更强的任务性能。这些增益扩展到生产编码工具:PACT将OpenHands的解决率提升了-10%的每解决令牌数,并在SWE-agent上保持解决率中性,同时将输入令牌减半。我们的代码公开在https://github.com/iNLP-Lab/PACT。

英文摘要

Multi-agent systems (MAS) built on large language models are typically organized around roles, pipelines, and turn schedules, while the content that agents pass to one another is often left as unconstrained natural language. However, this free-form communication can rapidly inflate token usage, consume the shared context window, and ultimately affect both system performance and inference cost. We analyze five common inter-agent communication strategies across two MAS topologies, finding that no fixed strategy is universally optimal. Instead, effective inter-agent messages consistently preserve action-centered information needed by downstream agents. Building on this, we propose the PACT (Protocolized Action-state Communication and Transmission), which treats inter-agent communication as a public state-update problem and projects each raw agent output into a compact action-state record before it enters shared history. Across different MAS topologies, PACT consistently improves the performance-cost trade-off, achieving comparable or stronger task performance with substantially fewer tokens. The gains extend to production coding harnesses: PACT lifts OpenHands' resolve rate at -10% tokens-per-resolved, and is resolve-neutral on SWE-agent while halving input tokens. Our code is publicly available at https://github.com/iNLP-Lab/PACT.

2606.05296 2026-06-05 cs.LG cs.AI 版本更新

Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

智能体蒙特卡洛:黑盒智能体的强化学习模拟

Dae Yon Hwang, Raunaq Suri, Valentin Villecroze, Anthony L. Caterini, Jesse C. Cresswell, Noël Vouitsis, Brendan Leigh Ross

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出Agentic Monte Carlo (AMC)方法,利用序贯蒙特卡洛从最优策略后验中采样,无需参数级优化即可对黑盒LLM智能体进行强化学习式优化,在AgentGym基准上超越提示基线并随测试时计算扩展优于GRPO。

Comments Accepted by ICML 2026

详情
AI中文摘要

LLM智能体在两种不同的机制下运行:适用于强化学习(RL)的开权重智能体,以及其行为必须在测试时纯粹控制的黑盒智能体。尽管黑盒智能体通常由最先进的专有LLM支持,但仅API访问排除了参数级优化,使得大多数RL方法不适用。为解决这一限制,我们转向RL与贝叶斯推断之间的已知等价性。我们提出智能体蒙特卡洛(AMC),直接从黑盒智能体的最优策略中采样,而不是通过RL训练它。最优策略是轨迹上的后验,其先验我们定义为固定的黑盒LLM智能体。我们采用序贯蒙特卡洛从该后验中采样,通过学习一个价值函数来引导智能体,同时保持底层黑盒模型不变。我们在AgentGym基准的三个不同环境中验证了AMC,展示了相对于提示基线的显著改进,并且随着我们方法测试时计算的扩展,甚至优于组相对策略优化(GRPO)。AMC证明了执行黑盒LLM智能体的原则性RL式优化的可行性。代码可在https://github.com/layer6ai-labs/Agentic-Monte-Carlo获取。

英文摘要

LLM agents operate in two distinct regimes: open-weight agents amenable to reinforcement learning (RL) and black-box agents whose behaviour must be controlled purely at test time. Although black-box agents are often backed by state-of-the-art proprietary LLMs, API-only access precludes parameter-level optimization, rendering most RL methods inapplicable. To address this limitation, we turn to a known equivalence between RL and Bayesian inference. We propose Agentic Monte Carlo (AMC) to directly sample from the optimal policy of a black-box agent rather than training it through RL. The optimal policy is a posterior over trajectories whose prior we define as the fixed black-box LLM agent. We employ Sequential Monte Carlo to sample from this posterior by learning a value function to steer the agent while leaving the underlying black-box model unchanged. We validate AMC on three diverse environments from the AgentGym benchmark, demonstrating significant improvements over prompting baselines and even outperforming Group Relative Policy Optimization (GRPO) as we scale the test-time compute of our method. AMC demonstrates the feasibility of performing principled RL-style optimization of black-box LLM agents. Code is available at https://github.com/layer6ai-labs/Agentic-Monte-Carlo

2606.05290 2026-06-05 cs.CV cs.AI cs.MM 版本更新

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

模型是否共享安全表示?面向安全视觉生成的跨模型引导

Tobia Poppi, Silvia Cappelletti, Sara Sarto, Florian Schiffers, Garin Kessler, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学) University of Pisa(比萨大学) Amazon Prime Video(亚马逊prime视频)

AI总结 本文提出首个跨模型安全引导框架,通过源语言模型估计安全方向并迁移至目标生成器,无需目标侧不安全数据即可实现安全控制,且不牺牲生成质量。

Comments Project page: https://aimagelab.github.io/cross-model-safety-representations/

详情
AI中文摘要

生成建模的最新进展使安全控制成为核心挑战,但现有方法大多针对特定模型,需要为每种新架构重新训练或定制干预。在这项工作中,我们探究安全是否可以被表示为一种可移植的潜在方向,一次性学习并在异构生成器之间重用。我们引入了首个跨模型安全引导框架,其中从成对的安全-不安全提示中在源大语言模型中估计安全方向,通过仅在良性数据上拟合的轻量级对齐传输到目标生成器,并在推理时应用。关键的是,我们的流程从未访问目标侧的不安全数据,从而隔离了安全是否可以通过共享表示几何进行转移。除了单个全局方向,我们还识别了一种多向量扩展,捕获类别特定的安全行为,实现更具选择性的控制。我们在文本到图像和文本到视频生成中评估了我们的方法,跨越不同的源-目标模型对。跨模型转移的安全方向实现了与在目标模型上使用不安全数据本地学习的方向相当的ASR降低和CLIP-Score/FID权衡,同时不需要目标侧的不安全数据。这表明安全改进不以生成质量为代价。我们的结果指向了一种模块化的安全观:安全相关行为并非纯粹模型局部,而是可以通过跨模型持续的潜在方向进行控制。这为轻量级、可重用的安全机制开辟了新路径,且无需目标侧不安全数据。

英文摘要

Recent progress in generative modeling has made safety control a central challenge, yet existing approaches remain largely model-specific, requiring retraining or tailored interventions for each new architecture. In this work, we ask whether safety can be represented as a portable latent direction, learned once and reused across heterogeneous generators. We introduce the first framework for cross-model safety steering, in which a safety direction is estimated in a source LLM from paired safe-unsafe prompts, transported to a target generator through a lightweight alignment fitted on benign data alone, and applied at inference time. Crucially, our pipeline never accesses unsafe data on the target side, isolating whether safety can be transferred through shared representation geometry. Beyond a single global direction, we also identify a multi-vector extension that captures category-specific safety behaviors, enabling more selective control. We evaluate our approach in text-to-image and text-to-video generation across diverse source-target model pairs. Across models, transferred safety directions achieve ASR reduction and CLIP-Score/FID trade-offs comparable to directions learned natively on the target model using unsafe data, while requiring no target-side unsafe data. This indicates that safety improvements do not come at the expense of generation quality. Our results point to a modular view of safety: safety-relevant behavior is not purely model-local, but can be controlled through latent directions that persist across models. This suggests a new path toward lightweight, reusable safety mechanisms that do not require target-side unsafe data.

2606.05275 2026-06-05 cs.CV cs.AI 版本更新

Personal AI Agent for Camera Roll VQA

个人AI代理用于相机胶卷VQA

Thao Nguyen, Krishna Kumar Singh, Donghyun Kim, Yong Jae Lee, Yuheng Li

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Korea University(韩国大学) Adobe Research(Adobe研究院)

AI总结 本文提出camroll数据集和camroll-agent代理,通过层次化记忆和工具集解决个人相机胶卷中的长程、高度个性化的视觉问答问题。

Comments Project page, code, and demo: https://thaoshibe.github.io/camroll

详情
AI中文摘要

我们研究了个人相机胶卷的视觉问答设定。在该设定中,一个对话式AI助手可以访问用户的个人相机胶卷并检索相关照片来回答查询,从简单的事实性问题(例如,“我昨天尝试的食物名称?”)到更开放的问题(例如,“推荐一些我从未吃过的菜肴”)。鉴于个人相机胶卷的庞大性质(即多年、数百到数千张照片),一个成功的AI助手需要理解长程、高度个性化的视觉内容流,以便导航和定位正确和/或相关信息。为此,我们收集并手动标注了模拟真实世界使用场景的问题。最终数据集camroll包含50个用户、31,476张图像和2,500个问答对。我们进一步设计了camroll-agent,一个配备层次化记忆和最小工具集的对话式AI代理,用于在大型个性化视觉记忆上高效导航。实验结果表明,camroll-agent在长上下文理解的AI代理系统中优于众多基线和方法。总之,camroll数据集和camroll-agent凸显了AI代理在长上下文推理中的差距:个性化视觉记忆需要与标准长上下文文本记忆不同的方法,尤其是在存在一致性、视觉细节和用户特定上下文时。

英文摘要

We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open-ended ones (e.g., ``Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.

2606.05263 2026-06-05 cs.LG cs.AI 版本更新

Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

基于策略条件的反事实信用分配用于长周期语言智能体的可验证强化学习

Renwei Meng

发表机构 * stu.ahu.edu.cn(安徽大学)

AI总结 提出CVT-RL算法,通过策略条件反事实贡献估计和可验证奖励约束,解决长周期语言智能体在推理和工具使用中的虚假证据链、信念漂移和捷径行为问题,在多个任务上提升成功率并降低作弊率。

Comments 16 pages, 6 figures

详情
AI中文摘要

具有可验证奖励的强化学习改进了推理和工具使用,但长周期语言智能体仍然学习到无支持的证据链、信念漂移以及满足终端检查的捷径行为。现有的过程奖励大多是相关的:它们奖励类似检索、反思或验证的步骤,而不估计在指定干预下该步骤是否有助于最终验证的成功。我们提出CVT-RL,一种具有密集可验证奖励、干预有效性门控和策略条件反事实贡献(PCCC)估计器的约束策略梯度算法。删除、语义替换、证据替换和工具输出扰动定义了不同的受控干预;延续从冻结的参考策略中采样,并使用选择调整的双重稳健估计器增强优势。信念控制仅使用前缀可观察标签,而增广拉格朗日约束无支持的声明、跳过的验证、工具篡改和不安全调用。在长上下文问答、ALFWorld、ScienceWorld以及网页/工具任务上,CVT-RL将平均任务成功率从计算匹配的非因果强化学习的71.8%和信息匹配的反事实过程基线的75.4%提高到78.9%,证据F1分数从信息匹配基线的78.9提高到82.8,并将测量的作弊率从7.2%降低到3.9%。独立人工审计估计CVT-RL的作弊率为4.6%,而信息匹配基线为8.1%,自适应检测器规避攻击仅将作弊率提高到7.1%。分层自助法和混合效应检验在Holm校正后所有主要指标的p<0.01。精心范围的反事实信用,结合有效性门控、诊断和可验证约束,为语言智能体更可靠的长周期强化学习提供了一条可复现的路径。

英文摘要

Reinforcement learning with verifiable rewards improves reasoning and tool use, yet long-horizon language agents still learn unsupported evidence chains, belief drift, and shortcut actions that satisfy terminal checks. Existing process rewards are mostly correlational: they reward retrieval-, reflection-, or verification-like steps without estimating whether the step contributes to final verified success under a specified intervention. We propose CVT-RL, a constrained policy-gradient algorithm with dense verifiable rewards, intervention-validity gating, and a policy-conditioned counterfactual contribution (PCCC) estimator. Deletion, semantic substitution, evidence substitution, and tool-output perturbation define separate controlled interventions; continuations are sampled from a frozen reference policy, and a selection-adjusted doubly robust estimator augments the advantage. Belief control uses only prefix-observable labels, while an augmented Lagrangian constrains unsupported claims, skipped verification, tool tampering, and unsafe calls. On long-context QA, ALFWorld, ScienceWorld, and web/tool tasks, CVT-RL improves average task success from 71.8% for compute-matched non-causal RL and 75.4% for an information-matched counterfactual-process baseline to 78.9%, improves evidence F1 from 78.9 to 82.8 over the information-matched baseline, and reduces measured hacking from 7.2% to 3.9%. Independent human audit estimates 4.6% hacking for CVT-RL versus 8.1% for the information-matched baseline, and adaptive detector-evasion attacks raise hacking only to 7.1%. Stratified bootstrap and mixed-effects tests give p<0.01 after Holm correction for all primary metrics. Carefully scoped counterfactual credit, paired with validity gating, diagnostics, and verifiable constraints, provides a reproducible route toward more reliable long-horizon RL for language agents.

2606.05262 2026-06-05 cs.IT cs.AI math.IT 版本更新

X-Band UAV-enabled Integrated Sensing and Communications for Vehicular Networks

X波段无人机赋能的车联网集成感知与通信

Remon Polus, Soumaya Cherkaoui

发表机构 * Department of Computer and Software Engineering(计算机与软件工程系)

AI总结 针对X波段无人机集成感知与通信系统,提出基于双阴影信道模型的最优时间分配方法,平衡感知精度与通信性能。

详情
AI中文摘要

无人驾驶飞行器(UAV)越来越多地被视作能够提供感知和通信服务的空中平台,代表了智能交通系统的一个有前景的范式。本文研究了在X波段为车联网运行的无人机集成感知与通信(ISaC)系统的最优时间分配。我们在实际无人机约束和衰落效应下分析了感知精度与通信性能之间的权衡,考虑了单阴影和双阴影信道模型。开发了一个优化框架来在感知和通信之间分配时间,同时保证最小通信速率和足够的感知可靠性。仿真结果展示了自适应时间分配策略,突出了无人机到地面信道条件和目标距离如何影响智能移动场景中感知与通信之间的平衡。

英文摘要

Uncrewed aerial vehicles (UAVs) are increasingly considered as aerial platforms capable of providing both sensing and communication services, representing a promising paradigm for intelligent transportation systems. This paper investigates the optimal time allocation for a UAV-enabled integrated sensing and communication (ISaC) system operating in the X-band for vehicular networks. We analyze the trade-off between sensing accuracy and communication performance under practical UAV constraints and fading effects, considering both single-shadowing and double-shadowing channel models. An optimization framework is developed to allocate time between sensing and communication while guaranteeing minimum communication rates and sufficient sensing reliability. Simulation results demonstrate adaptive time allocation strategies, highlighting how UAV-to-ground channel conditions and target distances influence the balance between sensing and communication in smart mobility scenarios.

2606.05261 2026-06-05 cs.CV cs.AI cs.LG 版本更新

NIV: Neural Axis Variations for Variable Font Generation

NIV: 用于可变字体生成的神经轴变化

Nadav Benedek, Ariel Shamir, Ohad Fried

发表机构 * Reichman University(雷赫曼大学)

AI总结 提出NIV方法,通过预测字形轮廓的逐点位移,自动将静态字体转换为支持多轴连续插值的可变字体,并在新构建的数据集上验证其泛化能力。

详情
AI中文摘要

可变字体能够沿语义设计轴(如字重、字宽、倾斜和光学尺寸)实现字形几何的连续变化。然而,从静态字体构建可变字体仍然是一个劳动密集型过程,需要专业的字体设计和对字形变化数据的手动规范。我们引入了NIV(神经轴变化),一种自动将静态字体转换为功能齐全的可变字体的方法。给定字形轮廓和一组期望的设计轴,NIV预测每点的位移。该模型直接操作矢量字形几何,并采用一种新颖的属性嵌入机制,捕获多个轴之间的相互作用,从而在统一框架内实现一致的多轴变化。我们在一个新构建的源自可变Google字体的数据集上训练NIV,该数据集包含超过一百万个变化元组。得到的模型能够泛化到未见过的码点、未见过的字体样式、高复杂度的CJK字形,甚至分布外的手写输入。生成的输出是标准的可变字体文件,支持通过现有渲染引擎进行连续插值。为了促进研究,我们在https://github.com/ndvbd/NIV上发布了数据集、完整的训练和推理实现以及训练好的模型。超越字体排印,我们的方法展示了如何使用神经变形合成具有连续参数变化的结构化几何对象。

英文摘要

Variable fonts enable continuous variation of glyph geometry along semantic design axes such as weight, width, slant, and optical size. However, constructing a variable font from a static font remains a labor-intensive process requiring expert typographic design and manual specification of glyph variation data. We introduce NIV (Neural Axis Variations), a method that automatically converts a static font into a fully functional variable font. Given glyph outlines and a set of desired design axes, NIV predicts per-point displacements. The model operates directly on vector glyph geometry and employs a novel Property Embedding mechanism that captures interactions between multiple axes, enabling consistent multi-axis variation within a unified framework. We train NIV on a newly constructed dataset derived from variable Google Fonts, comprising over one million variation tuples. The resulting model generalizes across unseen code points, unseen font styles, high-complexity CJK glyphs, and even out-of-distribution handwriting inputs. The generated outputs are standard variable font files supporting continuous interpolation via existing rendering engines. To facilitate research, we release the dataset, the complete training and inference implementation, and trained models at https://github.com/ndvbd/NIV. Beyond typography, our approach demonstrates how structured geometric objects with continuous parametric variation can be synthesized using neural deformations.

2606.05256 2026-06-05 cs.AI 版本更新

How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment

他们走了多远?已终止现场实验中隐蔽LLM代理的说服策略

Kokil Jaidka, Saifuddin Ahmed

发表机构 * Wee Kim Wee school of Communication and Information, Nanyang Technological University(魏家伟通信与信息学院,南洋理工大学)

AI总结 通过分析Reddit r/ChangeMyView已终止现场实验的公开数据集,研究隐蔽LLM代理在身份丰富的讨论论坛中使用的说服策略,发现其系统性采用身份定位、权威信号、对齐策略和认知偏差触发,构成以说服效率为导向的修辞架构。

详情
AI中文摘要

本研究分析了Reddit r/ChangeMyView上一个已终止现场实验的公开数据集。该干预由未知的外部研究人员进行,因伦理反弹而停止,涉及未公开的AI生成账户与用户进行实时辩论。公开披露后,Reddit授权版主发布AI生成评论的存档,创造了难得的机会来检查大型语言模型如何在未披露的情况下在身份丰富的讨论论坛中运作。我们对这一语料库进行了结构化内容分析,评估了身份表现、权威信号、对齐策略和认知启发式的激活。身份定位或采用出现在超过三分之二的评论中,对齐动作和权威声明几乎出现在所有评论中,而认知偏差触发——特别是确认偏差、代表性启发和可得性启发——出现在绝大多数评论中。这些模式系统性地共现,构成了一种为说服效率而非真实讨论参与而校准的修辞架构。与人类撰写的CMV反驳相比,代理在每个维度上都颠倒了典型分布:更密集的权威使用、更对抗性的对齐,以及更依赖外部引用而非经验基础。在此类环境中,真实与合成认知地位之间的区别日益模糊——这种不对称性仅靠披露要求无法解决。研究结果指向能够评估AI系统如何构建可信度的审计框架,而不仅仅是它们是否存在。

英文摘要

This study analyzes a publicly released dataset from a discontinued field experiment on Reddit's r/ChangeMyView. The intervention, conducted by unknown, external researchers and halted following ethical backlash, involved undisclosed AI-generated accounts engaging users in live debate. After public disclosure, Reddit authorized moderators to release an archive of the AI-generated comments, creating a rare opportunity to examine how large language models operated in an identity-rich deliberative forum without disclosure. We conduct a structured content analysis of this corpus, evaluating identity performance, authority signaling, alignment strategies, and activation of cognitive heuristics. Identity targeting or adoption appears in over two-thirds of comments, alignment moves and authority claims in nearly all of them, and cognitive-bias triggers -- particularly confirmation bias, representativeness, and availability -- in the large majority. These patterns co-occur systematically, composing a rhetorical architecture calibrated for persuasive efficiency rather than authentic deliberative participation. Compared against human-authored CMV counter-arguments, the agents inverted the typical distribution on every dimension: denser authority use, more adversarial alignment, and heavier reliance on external citation over experiential grounding. In such environments, distinctions between authentic and synthetic epistemic standing grow increasingly opaque -- an asymmetry that disclosure mandates alone cannot address. The results point toward auditing frameworks capable of assessing how AI systems structure credibility, not merely whether they are present.

2606.05252 2026-06-05 cs.CR cs.AI 版本更新

From Attack Simulation to SIEM Rule: Deterministic Detection-as-Code Synthesis with Probe-Level Traceability

从攻击模拟到SIEM规则:具有探针级可追溯性的确定性检测即代码合成

Alexandre Cristovão Maiorano

发表机构 * lumytics.com(lumytics公司)

AI总结 本文提出一种确定性合成方法,通过小型模板库将攻击模拟发现映射为Sigma规则,并保持对原始探针的字节级可追溯性,实现了可验证的检测即代码。

Comments 22 pages, 3 figures, 11 tables

详情
AI中文摘要

安全团队通常会模拟攻击自己的系统,以检查其监控是否能捕获真实入侵者。这些入侵与攻击模拟(BAS)工具会呈现发现结果,但监控生产环境的安全信息和事件管理(SIEM)系统需要检测规则——目前,人类通过手工方式弥合这一差距,阅读每个发现并编写相应的Sigma规则(一种供应商中立的检测格式)。我们证明,当探针来自锁定语料库时,这种转换可以部分自动化,因此每个发现都带有指向原始探针的稳定标识符。我们描述了一种确定性合成函数,通过一个小型模板库(N=23,按OWASP LLM和Web Top 10的类别索引)将每个发现映射为起始Sigma规则,并带有指向原始发现及其MITRE ATT&CK技术的反向引用。在两个锁定语料库(17个探针的LLM,23个探针的Web)上,每个被绕过的探针发现都产生一个起始规则,并且所有17/17个生成的规则都能解析并转换为Splunk和Elasticsearch后端。通过实时OpenSearch SIEM重放,LLM规则在保留的AdvBench子集上触发30%,在HarmBench上触发14%,在良性基线上假阳性率为7.7%;Web方面进行了结构验证,未针对保留的攻击集进行验证。其贡献在于一条从BAS发现到可操作部署的起始规则的可验证、字节级稳定的路径,仅通过已发布的语料库和模板库即可重新推导——用LLM生成方法的广度换取了精确的可重复性和从任何触发告警到原始探针的类型化回溯。

英文摘要

Security teams routinely simulate attacks against their own systems to check whether their monitoring would catch a real intruder. These Breach-and-Attack-Simulation (BAS) tools surface findings, but the security information and event management (SIEM) systems that watch production need detection rules -- and today a human bridges that gap by hand, reading each finding and writing the corresponding Sigma rule (a vendor-neutral detection format). We show this translation can be partially automated when probes are drawn from a locked corpus, so each finding carries a stable identifier back to the originating probe. We describe a deterministic synthesis function that maps each finding to a starter Sigma rule through a small template library (N=23, indexed by categories from the OWASP LLM and Web Top 10), with a back-reference to the originating finding and its MITRE ATT&CK technique. On two locked corpora (17-probe LLM, 23-probe Web), every bypassed-probe finding yields a starter rule, and all 17/17 emitted rules parse and convert to Splunk and Elasticsearch backends. Replayed through a live OpenSearch SIEM, the LLM rules fire on 30% of a held-out AdvBench subset and 14% of HarmBench at 7.7% false positives on a benign baseline; the Web side is validated structurally, not against a held-out attack set. The contribution is a verifiable, byte-stable path from BAS finding to operator-deployable starter rule, re-derivable from the published corpus and template library alone -- trading the breadth of LLM-generative methods for exact reproducibility and a typed traceback from any fired alert to the originating probe.

2606.05241 2026-06-05 cs.CR cs.AI 版本更新

Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation

深度研究代理中的搜索时污染:衡量公开基准评估中的性能膨胀

Yongjie Wang, Xinyue Zhang, Kunhong Yao, Zhiwei Zeng, Kaisong Song, Jun Lin, Zhiqi Shen

发表机构 * Alibaba-NTU Global e-Sustainability CorpLab (ANGEL)(阿里-NTU全球可持续性科技公司实验室(ANGEL)) Tongyi Lab, Alibaba Group(通义实验室,阿里巴巴集团)

AI总结 研究深度研究代理在推理过程中通过网页搜索检索公开基准元数据、问题上下文甚至真实答案导致的搜索时污染(STC),定义三种污染类型并开发检测算法,实验表明STC可导致性能膨胀高达4%。

Comments Under Review

详情
AI中文摘要

公开基准能够对LLM推理进行公平且可重复的评估,但对于在推理过程中主动搜索网络的深度研究代理而言,这些基准变得脆弱。此类代理可能通过网页搜索检索到公开基准元数据、问题上下文甚至真实答案。这导致了搜索时污染(STC),即外部检索绕过了预期的推理并夸大了测量到的性能。我们系统地研究了深度研究代理评估中的STC。我们定义了三种严重程度递增的污染类型,即基准元数据泄露、问题上下文泄露和显式答案泄露,并开发了检测算法来识别它们并量化其对代理性能的影响。在六个公开基准上评估现代深度研究代理,我们发现STC普遍存在,并可能使性能膨胀高达4%。我们的发现表明,现有的评估可能高估了真实的推理能力。因此,我们提倡污染感知的实践,包括隔离沙箱、透明的搜索轨迹和受控的基准访问。

英文摘要

Public benchmarks enable fair and reproducible evaluation of LLM reasoning, but they become fragile for deep research agents that actively search the web during inference. Such agents may retrieve public benchmark metadata, question context, or even ground-truth answers via web search. This gives rise to Search-Time Contamination (STC), where external retrieval bypasses intended reasoning and inflates measured performance. We systematically study STC in deep research agent evaluation. We define three contamination types with increasing severity, namely Benchmark Metadata Leakage, Question-Context Leakage, and Explicit Answer Leakage, and develop detection algorithms to identify them and quantify their impact on agent performance. Evaluating modern deep research agents on six public benchmarks, we find that STC is widespread and can inflate performance by up to 4%. Our findings show that existing evaluations may overestimate true reasoning ability. We therefore advocate contamination-aware practices, including isolated sandboxes, transparent search trajectories, and controlled benchmark access.

2606.05233 2026-06-05 cs.CR cs.AI cs.CL 版本更新

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

前沿计算机使用代理中的领域条件安全:一个793集浏览器基准测试、编码领域交叉引用以及近期红队攻击的可重复性审计

Nicholas Saban

发表机构 * Patronus AI University of California, Berkeley(Patronus AI 伯克利大学)

AI总结 本研究通过构建包含793个浏览器任务和56个攻击模板的基准测试,评估前沿计算机使用代理对提示注入攻击的鲁棒性,发现模型权重提供了强抵抗性(攻击成功率0%),但该安全性是领域条件的,在编码代理中失效(攻击成功率高达100%),并指出文献中高攻击成功率主要归因于RL优化的注入文本而非攻击类别。

详情
AI中文摘要

最近的计算机使用代理(CUA)红队论文报告提示注入攻击成功率(ASR)为42-98%,但这些头条数字集中在已退役模型和每篇论文面板中最易受攻击的模型上。我们询问这些技术,作为手工制作的模板重现,是否仍然对当前前沿CUA有效。我们发布了CUA-HandCrafted,一个包含793个集成的公共基准测试,涵盖24个多步骤网络任务、56个攻击模板、8个攻击家族和4个系统提示配置。针对Claude Sonnet 4.6和GPT-5.4,我们测量到0/140的多步骤攻击成功(Clopper-Pearson 95%上限2.60%);一个提示消融实验表明这种抵抗性存在于模型权重中。然而,它并不泛化:在一个姐妹编码代理基准测试(SkillBench)上,相同的权重对手工制作的技能注入攻击成功率高达100%。我们认为文献中的高ASR主要归因于RL优化的注入文本,而不是攻击类别,并且前沿安全加固是领域条件的,特定于被高度针对的浏览器表面。报告技术而不发布优化字符串,或将浏览器领域安全性外推到其他CUA模态,使得已发表的ASR数字无法重现。

英文摘要

Recent computer-using-agent (CUA) red-teaming papers report prompt-injection attack success rates (ASR) of 42-98%, but these headline numbers cluster on retired models and on the most-vulnerable model in each paper's panel. We ask whether those techniques, reproduced as hand-crafted templates, still work against current frontier CUAs. We release CUA-HandCrafted, a public benchmark of 793 episodes spanning 24 multi-step web tasks, 56 attack templates, 8 attack families, and 4 system-prompt configurations. Against Claude Sonnet 4.6 and GPT-5.4 we measure 0/140 multi-step attack success (Clopper-Pearson 95% upper bound 2.60%); a prompt ablation shows this resistance lives in the model weights. Yet it does not generalize: on a sister coding-agent benchmark (SkillBench), the same weights fall to hand-crafted skill-injection at up to 100%. We argue that the literature's high ASR is largely attributable to RL-optimized injection text rather than the attack categories, and that frontier safety hardening is domain-conditioned, specific to the heavily-targeted browser surface. Reporting techniques without releasing the optimized strings, or extrapolating browser-domain safety to other CUA modalities, makes published ASR numbers unreproducible.

2606.05232 2026-06-05 cs.LG cs.AI 版本更新

Differentiable Efficient Operator Search

可微分高效算子搜索

Xiaohuan Pei, Jiyuan Zhang, Yuanfan Guo, Weiguo Feng, Tao Huang, Cho-Jui Hsieh, Chang Xu

发表机构 * The University of Sydney(悉尼大学) ByteDance(字节跳动) Shanghai Jiao Tong University(上海交通大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出可微分高效算子搜索框架,统一解释多种token缩减算子,通过联合搜索缩减位置、保留数量和算子行为,在预算约束下优化多模态模型性能。

详情
AI中文摘要

高效多模态基础模型通常依赖于手动设计的token缩减算子,如剪枝、合并、池化和自适应重加权。尽管这些算子看起来不同,但我们表明它们可以被解释为共享算子空间的不同区域。基于这一观点,我们引入了高效算子搜索,一个可微分框架,联合搜索在哪里缩减token、保留多少token以及如何处理缩减后的token信息。所提出的搜索空间参数化层激活、保留预算和算子行为,而搜索策略在单边预算和成本约束下优化任务性能。该公式将代表性手工设计基线作为特例恢复,并进一步发现超越孤立手动设计的混合算子。在多模态基准上的实验表明,搜索得到的算子在精度-效率权衡上具有竞争力,特别是在激进的视觉token缩减下。这些结果表明,高效多模态推理可以从手动算子设计重新构建为可微分算子搜索。

英文摘要

Efficient multimodal foundation models often rely on manually designed token-reduction operators, such as pruning, merging, pooling, and adaptive reweighting. Although these operators appear different, we show that they can be interpreted as distinct regimes of a shared operator space. Based on this view, we introduce Efficient Operator Search, a differentiable framework that jointly searches where to reduce tokens, how many tokens to retain, and how reduced token information should be processed. The proposed search space parameterizes layer activation, retention budget, and operator behavior, while the search policy optimizes task performance under one-sided budget and cost constraints. This formulation recovers representative hand-designed baselines as special cases and further discovers hybrid operators beyond isolated manual designs. Experiments on multimodal benchmarks show that the searched operators achieve competitive accuracy-efficiency trade-offs, especially under aggressive visual-token reduction. These results suggest that efficient multimodal inference can be reframed from manual operator design to differentiable operator search.

2606.05222 2026-06-05 cs.CY cs.AI cs.HC 版本更新

Where's the Structure? A Systematic Literature Review of Empirical Research on Human-AI Collaboration and Hybrid Intelligence for Learning

结构在哪里?关于人机协作与混合智能用于学习的实证研究的系统文献综述

Luis P. Prieto, Juan I. Asensio-Pérez, María Jesús Rodríguez-Triana, Mohamed Saban, Yannis Dimitriadis

发表机构 * GSIC-EMIC research group, Universidad de Valladolid (Spain)(瓦伦西亚大学GSIC-EMIC研究组) GICAP research group, Department of Digitization, Universidad de Burgos (Spain)(布尔戈斯大学数字技术系GICAP研究组)

AI总结 本文通过系统文献综述(N=62)分析了人机协作与混合智能在学习支持中的协作过程、结构及应用背景,提取了设计知识和研究空白。

Comments 59 pages, 4 figures, submitted to a journal

详情
AI中文摘要

人工智能(AI)已被应用于各种教育场景以支持学习。其中一种方法是“人机协作”(也称为“混合智能”),即人类和AI组件互动以促进人类学习。然而,如同人-人计算机支持的协作学习(CSCL)一样,无结构的互动不一定产生有效的学习体验。本文报告了一项关于人机协作和混合智能用于学习支持的实证研究(N=62)的系统文献综述。该综述描述了协作过程、其结构以及应用背景。它还提取了新兴的设计知识和研究空白。研究人员和技术设计师可以将这些发现作为在教育实践和未来研究中构建更有效的AI增强协作技术的起点。

英文摘要

Artificial intelligence (AI) has been applied across educational contexts to support learning. One approach to such support is "human-AI collaboration" (also termed "hybrid intelligence"), where human(s) and AI components interact to promote human learning. However, as in human-to-human computer-supported collaborative learning (CSCL), unstructured interaction does not necessarily produce an effective learning experience. This paper reports a systematic literature review of empirical studies (N=62) on human-AI collaboration and hybrid intelligence for learning support. The review characterizes collaboration processes, their structures, and contexts of application. It also extracts emerging design knowledge and research gaps. Researchers and technology designers can use these findings as a starting point for structuring more effective AI-enhanced technologies for collaboration, in educational practice and future research.

2606.05219 2026-06-05 cs.LG cs.AI 版本更新

Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway

大步长梯度下降恢复多路径深度线性网络中的对称性

Hee-Sung Kim, Sungyoon Lee

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文研究大步长离散梯度下降如何通过边缘稳定性振荡使多路径深度线性网络从对称性破坏转向信号重新分配,从而偏好共享表示而非单路径主导。

Comments ICML 2026

详情
AI中文摘要

最近对多路径深度线性网络的分析使用梯度流预测了一种“赢家通吃”的专业化,其中路径对称性被破坏,每个特征集中在一个路径中。在这项工作中,我们表明具有大步长的离散梯度下降(GD)讲述了一个不同的故事。我们证明单路径解是尖锐最小值,而跨路径分布信号通过一个随路径数量和深度增加而减小的因子降低了尖锐度。因此,虽然早期训练再现了由GF预测的深度驱动的对称性破坏,但随后在稳定性边缘的振荡覆盖了这一趋势,并将网络驱动到重新平衡阶段,其中信号在路径间重新分布。总之,这些结果阐明了深度如何塑造路径竞争,并解释了大步长GD为何偏好共享表示而非持续的单路径主导。

英文摘要

Recent analyses of multi-pathway Deep Linear Networks use Gradient Flow to predict a "winner-takes-all" specialization in which path symmetry breaks and each feature concentrates in a single pathway. In this work, we show that discrete Gradient Descent (GD) with a large step size tells a different story. We prove that single-path solutions are sharp minima, whereas distributing signals across pathways reduces sharpness by a factor that decreases with both the number of pathways and depth. Consequently, while early training reproduces the depth-driven symmetry breaking predicted by GF, oscillations at the Edge of Stability subsequently override this tendency and drive the network into a re-balancing phase, where signals redistribute across pathways. Together, these results clarify how depth shapes pathway competition and explain why large-step GD favors shared representations rather than persistent single-pathway dominance.

2606.05217 2026-06-05 math-ph cs.AI cs.LG math.MP physics.data-an 版本更新

The Score Hamiltonian: Mapping Diffusion Models to Adiabatic Transport

得分哈密顿量:将扩散模型映射到绝热输运

Peter Halmos, Boris Hanin

发表机构 * Computer Science Department, Princeton University(普林斯顿大学计算机科学系) ORFE Department, Princeton University(普林斯顿大学ORFE系)

AI总结 本文通过构建得分哈密顿量,建立了基于得分的扩散模型采样与薛定谔算子基态绝热输运之间的精确对应关系,并利用绝热定理推导了密度重建误差界和退火调度方案。

详情
AI中文摘要

我们展示了基于得分的扩散模型采样与一族薛定谔算子的基态绝热输运之间的精确对应关系,这些薛定谔算子被称为得分哈密顿量,由学习到的得分的量子势构建。通过具有时变势的福克-普朗克方程的绝热定理,我们获得了新颖的密度重建误差界和原则性的退火调度方案。我们发现采样的基本极限由得分匹配误差平方与得分哈密顿量谱隙(数据密度的逆庞加莱常数)之比决定。

英文摘要

We exhibit an exact correspondence between sampling with score-based diffusion models and adiabatic transport of ground states for a family of Schrödinger operators we call Score Hamiltonians, built from the learned score's quantum potential. We obtain novel density reconstruction bounds and principled annealing schedules via adiabatic theorems for Fokker-Planck equations with time-varying potentials. We find the fundamental limit of sampling is set by the ratio of squared score-matching error to Score Hamiltonian spectral gap - the inverse Poincaré constant of the data density.

2606.05206 2026-06-05 q-bio.NC cs.AI stat.AP 版本更新

Ontology-constrained multi-LLM scoring of hypothesis support in the predictive processing literature

本体约束的多LLM评分在预测处理文献中假设支持度的应用

Hamed Nejat, Alexander Maier, Jesse Spencer-Smith, André M. Bastos

发表机构 * University of Edinburgh(爱丁堡大学) University of Cambridge(剑桥大学)

AI总结 本文提出一个本地多LLM流水线,通过本体约束对预测编码文献中的研究进行评分,将异构文献映射到定量证据空间,并揭示假设间的结构化分歧。

Comments 33 pages, 5 tables and 9 figures

详情
AI中文摘要

跨学科领域由于方法多样和理论承诺不同,常常存在碎片化问题。预测编码神经科学是一个典型例子:其文献涵盖计算理论、电生理学、影像学、行为学和建模,造成了传统荟萃分析难以解决的综合问题。本文描述了一个用于本体约束文献综合的本地多LLM流水线。该流水线读取论文、提取证据、整合图表描述、组装约束提示,并根据专家词汇表验证输出。我们手动定义了一个预测编码词汇表,包含36个概念,分为三个假设:预测抑制、前向误差传播和普遍性。由十个本地语言模型组成的委员会根据每个词汇因子在局部和全局oddball情境下的一致性或不一致性,对31项研究进行评分。这使得可以进行成对研究一致性分析、跨模型比较和三维假设空间映射。某些假设的一致性较高,而其他假设则较弱,揭示了结构化分歧,特别是在局部与全局oddball范式之间。我们进一步定义了假设空间温度,这是一种几何离散度度量,用于衡量研究在假设空间中的紧凑程度。局部oddball情境的温度较低,而全局oddball情境的温度较高,表明后者离散度更大。评分几何还允许我们估计实验情境之间的变化向量。这些结果表明,本地多LLM委员会可以产生可审计的不一致性测量,将异构文献映射到定量证据空间。该框架可能推广到传统荟萃分析缺乏共同比较空间的跨研究假设映射。

英文摘要

Fragmentation is common in interdisciplinary fields with diverse methods and theoretical commitments. Predictive coding neuroscience is a clear example: its literature spans computational theory, electrophysiology, imaging, behavior, and modeling, creating a synthesis problem that conventional meta-analysis cannot easily resolve. Here, we describe a local multi-LLM pipeline for ontology-constrained literature synthesis. The pipeline reads papers, extracts evidence, incorporates figure descriptions, assembles constrained prompts, and validates outputs against an expert glossary. We manually defined a predictive-coding glossary of thirty-six concepts grouped into three hypotheses: predictive suppression, feedforward error propagation, and ubiquity. A council of ten local language models scored 31 studies according to their agreement or disagreement with each glossary factor across local and global oddball contexts. This enabled pairwise study-agreement analysis, cross-model comparison, and three-dimensional hypothesis-space mapping. Agreement was high for some hypotheses but weaker for others, revealing structured disagreement, particularly across local versus global oddball paradigms. We further define hypothesis-space temperature, a geometric dispersion metric measuring how compactly studies occupy the hypothesis space. Temperature was lower for local oddball contexts and higher for global oddball contexts, indicating greater dispersion in the latter. The scoring geometry also allowed us to estimate vectors of change between experimental contexts. These results demonstrate that local multi-LLM councils can produce auditable disagreement measurements that map heterogeneous literatures into quantitative evidence spaces. This framework may generalize to cross-study hypothesis mapping where conventional meta-analysis lacks a common comparison space.

2606.05199 2026-06-05 physics.comp-ph cs.AI 版本更新

Finite Element-Based Material Learning via Automatic Differentiation: Learning constitutive neural network models from full-field deformation data

基于有限元和自动微分的材料学习:从全场变形数据学习本构神经网络模型

Matthias Knipper, Chenyi Ji, Malte Brand, Kevin Linka

发表机构 * Computational Mechanics in Medicine, Applied Medical Engineering, RWTH Aachen University(医学计算力学,应用医学工程,亚琛RWTH大学) Institute for Continuum and Material Mechanics, Hamburg University of Technology(连续介质力学与材料力学研究所,汉堡技术大学)

AI总结 提出FE-MAD框架,通过自动微分将本构神经网络集成到JAX-FEM非线性求解器中,利用梯度优化从全场变形数据识别材料参数,适用于灰箱和白箱本构模型,并在三个实验数据集上验证。

详情
AI中文摘要

从异质全场变形数据中识别本构神经网络模型为基于均匀应力-应变实验的传统标定方法提供了稳健的替代方案,特别是考虑到可训练参数的高维性。现有方法必须在通用性、鲁棒性和计算效率之间取得平衡:传统有限元模型更新适用广泛但计算量大;弱形式方法效率高但对噪声和数据稀缺敏感;神经算子模型表达力强但需要大量训练数据。本文提出FE-MAD(基于有限元和自动微分的材料学习),一个端到端可微框架,将本构神经网络模型集成到JAX-FEM非线性求解器中,并通过基于梯度的测量-失配损失最小化来识别其参数。牛顿切线刚度和损失梯度通过整个流程的前向和反向模式自动微分自动计算,从而消除了解析伴随或离线代理模型的需求。FE-MAD针对两种架构进行了演示:灰箱本构人工神经网络(CANN),一个多凸、全连接且高度灵活的模型;以及白箱CANN,一个具有现象学可解释应变能项的专家系统网络。聚焦于不可压缩各向同性超弹性,FE-MAD在三个开放实验数据集上进行了评估:(1)带孔拉伸试件的全场数字图像相关(DIC)数据,(2)具有一维拉伸轮廓和全局力-位移曲线的降数据场景,以及(3)异质基体-夹杂系统,其中两相的本构定律被识别并推广到22个先前未见过的样本。

英文摘要

The identification of constitutive neural network models from heterogeneous full-field deformation data provides a robust alternative to traditional calibration methods based on homogeneous stress-strain experiments, particularly given the high dimensionality of trainable parameters. Existing approaches must balance generality, robustness, and computational efficiency: Conventional finite element model updating is broadly applicable but computationally demanding; weak-form methods offer efficiency but are sensitive to noise and data scarcity; neural operator models are highly expressive but require extensive training datasets. This work presents FE-MAD (Finite Element-Based Material learning via Automatic Differentiation), an end-to-end differentiable framework that integrates a constitutive neural network model within a JAX-FEM nonlinear solver and identifies its parameters through gradient-based minimization of a measurement-mismatch loss. Newton tangent stiffness and loss gradients are computed automatically using forward- and reverse-mode automatic differentiation throughout the entire pipeline, thereby removing the need for analytic adjoints or offline surrogate models. FE-MAD is demonstrated for two architectures: a grey-box Constitutive Artificial Neural Network (CANN), a polyconvex, fully connected model with high flexibility, and a white-box CANN, an expert-system network with phenomenologically interpretable strain-energy terms. Focusing on incompressible isotropic hyperelasticity, FE-MAD is evaluated on three open experimental datasets: (1) full digital image correlation (DIC) of a perforated tensile specimen, (2) a reduced-data scenario with a one-dimensional stretch profile and global force-displacement curve, and (3) a heterogeneous matrix-inclusion system in which both phases constitutive laws are identified and generalized to twenty-two previously unseen samples.

2606.05194 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Temporal Preference Concepts and their Functions in a Large Language Model

时间偏好概念及其在大语言模型中的功能

Ian Rios-Sialer, Shantanu Darveshi, Shuai Jiang, Avigya Paudel, Anastasiia Pronina, Ipshita Bandyopadhyay, Justin Shenk

发表机构 * AISC(AI Safety Camp) SPAR(Supervised Program for Alignment Research)

AI总结 通过因果定位和激活修补,本文发现大语言模型在中间到上层节点编码时间偏好几何结构,且行为分析表明模型对未来折扣比人类更平缓,但偏好不稳定,可通过引导向量调控。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被部署用于需要在近期收益与长期后果之间权衡的决策,然而关于它们如何在内部表示或解决这些权衡,我们知之甚少。在这项工作中,我们通过因果定位了一个蒸馏LLM(Qwen3-4B-Instruct-2507)中时间偏好的底层子图,通过来自梯度归因和激活修补的汇聚证据识别了中上层节点。我们发现时间跨度的几何结构在预期局部层的残差流中被编码。行为分析表明,未干预的LLM对未来折扣的陡峭程度比人类低几倍,但这种偏好跨上下文不稳定,这促使我们进行显式控制而非隐式依赖训练。最后,我们发现有暗示性证据表明引导向量可以改变时间偏好。我们的工作展示了机械可解释性如何使我们更接近对LLM规划和推理方式的可靠控制。

英文摘要

Large Language Models (LLMs) are increasingly being deployed to make decisions that require trading off near-term gains against long-term consequences, yet little is known about how they internally represent or resolve these tradeoffs. In this work, we causally localize an underlying subgraph for temporal preference in a distilled LLM (Qwen3-4B-Instruct-2507), identifying mid-to-upper-layer nodes through converging evidence from gradient-based attribution and activation patching. We find that the geometry of time horizon is encoded in the residual stream at the expected localized layers. A behavioral analysis reveals that unintervened LLMs discount the future several times less steeply than humans, yet this preference is unstable across contexts, motivating explicit control rather than implicit reliance on training. Finally, we find suggestive evidence that steering vectors can shift temporal preference. Our work demonstrates how mechanistic interpretability can bring us closer to reliable control over how LLMs plan and reason

2606.05188 2026-06-05 cs.CY cs.AI 版本更新

Assessing the Geographic Diversity of AI's Platial Representations in Image Generation

评估图像生成中AI地点表征的地理多样性

Zilong Liu, Krzysztof Janowicz, Mina Karimi

发表机构 * Department of Geography and Regional Research, University of Vienna, Austria(维也纳大学地理与区域研究系)

AI总结 本文以GPT和DALL-E模型为例,引入生态学中的物种多样性度量方法,通过相似性加权评估图像生成的地理多样性,发现旧模型可能更具多样性且提示修订比图像生成更促进多样性,同时观察到模型同质性导致缺乏地理多样性。

Comments Full conference paper accepted by the AGILE 2026 (https://agile-gi.eu/conference-2026)

详情
AI中文摘要

(生成式)AI多样性不仅仅是伦理问题。从地理信息科学(GIScience)的角度来看,它可被解释为不确定性的一种函数,以及嵌入AI输出中的一种认知偏差。近期研究致力于开发信息论多样性度量,并将其应用于评估地理背景下AI聊天机器人的输出。随着我们日常接触的AI生态系统迅速变得多模态,我们认为检查不同模态下的地理多样性至关重要。本文聚焦于图像,旨在填补这一研究空白。首先,我们选取GPT和DALL-E模型作为最先进的例子,指出评估其地理多样性涉及多个阶段,包括提示修订和图像生成。然后,受生态学中物种多样性度量的启发,我们将相似性加权纳入地理多样性的测量。接着,我们通过案例研究展示如何评估图像生成中的地理多样性。我们的分析揭示了若干反直觉的发现。例如,较旧的模型可能表现出更大的地理多样性,尽管生成的图像质量较低;提示修订比图像生成产生更大的地理多样性。同时,我们观察到缺乏地理多样性背后存在明显的模型同质性,因为所选模型一致地描绘相同的地理原型特征或相似特征。这令人担忧,因为它可能产生对地方的刻板印象。

英文摘要

(Gen)AI diversity is not merely an ethical issue. From the perspective of geographic information science (GIScience), it could be interpreted as a function of uncertainty and as a form of cognitive bias, embedded in AI outputs. Recent work has sought to develop information-theoretic diversity measures and apply them to evaluate AI-chatbot outputs in a geographic context. As the AI ecosystem to which we are exposed on a daily basis becomes rapidly multimodal, we believe it is important to examine geographic diversity across various modalities. Focusing on images, this paper aims to fill this research gap. First, we select the GPT and DALL-E models as state-of-the-art examples and point out how assessing their geographic diversity involves various stages, including prompt revision and image generation. Then, taking inspiration from species diversity measures in ecological research, we incorporate similarity weighting into the measurement of geographic diversity. Next, we demonstrate how to evaluate geographic diversity in image generation through a case study. Our analysis reveals several counterintuitive findings. For instance, older models can exhibit greater geographic diversity despite producing lower-quality images, and prompt revision yields greater geographic diversity than image generation. At the same time, we observe explicit model homogeneity underlying the lack of geographic diversity, as the selected models consistently depict the same prototypical geo-specific feature or similar features. This is concerning, as it risks producing stereotypical representations of places.

2606.05187 2026-06-05 cs.CY cs.AI 版本更新

Geographic Bias and Diversity in AI Evaluation

AI评估中的地理偏见与多样性

Zilong Liu, Krzysztof Janowicz, Gengchen Mai, Song Gao, Rui Zhu

发表机构 * University of Vienna(维也纳大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) University of Bristol(布里斯托大学)

AI总结 通过文献综述,识别AI中从训练数据到生成输出的多种地理偏见,并展示近期研究如何通过评估生成AI在不同认知层次、参数设置和输出模态下的地理多样性来应对这些偏见。

Comments Book chapter accepted by "Geography According to ChatGPT"

详情
AI中文摘要

在阻碍AI负责任开发和部署的众多挑战中,各种形式的偏见无疑受到了最严格的审视。这凸显了AI研究人员的广泛担忧,即模型输出(例如来自生成式AI)可能编码结构性分布不平衡(源于训练数据或模型设计),从而可能加剧社会不平等或在从生物多样性到灾害缓解等应用领域引入系统性扭曲。然而,相对较少的工作研究了偏见的地理性质,或为(生成式)AI的无偏见性开发了可衡量的基准。在本章中,我们通过文献综述来研究这个问题。随着基础模型重塑偏见研究的格局,我们考察了涵盖预生成式AI和生成式AI时期的工作。首先,我们识别了一系列地理偏见。这些偏见包括训练数据中的代表性偏差、语言模型事实回忆中的区域差异,以及生成式AI过度倾向于典型地点(称为默认值)的倾向。然后,我们展示了近期研究如何通过评估生成式AI在不同认知层次、参数设置和输出模态下的地理多样性来解决后一种偏见。

英文摘要

Among the many challenges hindering the responsible development and deployment of AI, arguably none has faced more intense scrutiny than bias in its various forms. This underscores the widespread concerns across AI researchers that model outputs, e.g., from generative AI, may encode structural distributional imbalances (stemming from training data or model design) that may amplify social inequality or introduce systemic distortions across application domains ranging from biodiversity to disaster mitigation. Yet, relatively little work has investigated the geographical nature of bias or developed measurable benchmarks for what it means for (generative) AI to be unbiased. In this chapter, we investigate this issue through a literature review. As foundation models are reshaping the landscape of bias research, we examine work spanning both the pre-generative AI and generative AI periods. First, we identify a range of geographic biases. These biases span from representation bias in the training data and regional disparities in the factual recall of language models to the tendency of generative AI to over-proportionally favor prototypical places (called defaults). Then, we showcase how recent studies address the latter bias by evaluating geographic diversity in the outputs of generative AI across various cognitive levels, parameter settings, and output modalities.

2606.05183 2026-06-05 cs.CL cs.AI cs.HC 版本更新

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

粒度差距:Gemini 模型中谄媚行为的多维纵向审计

Patrick Keough

发表机构 * Independent Researcher(独立研究者)

AI总结 通过多维度分级评估(Likert 0-4),揭示 Gemini 模型在连续尺度上的谄媚行为,发现粗粒度二值指标掩盖了大量社会顺从行为,且代际进步非单调,存在对齐税(谄媚与真实性负相关)。

Comments 16 pages, 9 figures

详情
AI中文摘要

大型语言模型越来越多地被部署为高风险顾问,但标准对齐基准将谄媚视为二值失败模式。我们引入粒度差距:粗粒度二值指标掩盖了大量社会顺从行为,即模型屈服于用户框架、验证可疑前提或软化事实纠正而不产生明显错误输出。我们在三个防护栏条件(控制、简单、协议)下,对跨越 2.0、2.5 和 3.0 代的六个 Gemini 变体在 73 个对抗性提示上进行了评估,得到 8,830 个分级响应。使用经过人类标注者三人组验证的 0-4 Likert 量表(Fleiss kappa = 0.71;与 AI 共识的 Cohen kappa = 0.78;95.9% 二值准确率,100% 特异性),我们将谄媚量化为连续而非二值。出现三个发现。第一,27.2% 的响应包含大量谄媚内容(Likert >= 2.0),22.7% 达到中度或严重水平(>= 3.0),而二值胜率框架仅报告适度的失败率;粗粒度指标仅解释 29% 的分级方差。第二,代际进步是非单调的:Gen 2.5 相对于 Gen 2.0(1.90)和 Gen 3.0(2.01)急剧倒退(平均控制 2.64),且 Gen 2.5 呈现逆缩放(Pro 1.94 比 Flash 1.71 更差),而 Gen 3.0 恢复了标准缩放。第三,我们记录了对齐税:谄媚与真实性之间的 Spearman rho = -0.63,表明社会顺从以事实准确性为代价。自我验证提示作为谄媚陷阱(平均 3.27),几乎是 unethical proposals(1.72)的两倍。简单防护栏在旗舰模型上优于复杂的协议脚手架,但蒸馏后的 Gen 3.0 Flash 反转了这一点,表明小模型可能在结构上需要思维链脚手架。我们发布了数据集和评分标准以支持连续谄媚测量。

英文摘要

Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial social-compliance behaviors where models capitulate to user framing, validate questionable premises, or soften factual corrections without producing overtly false outputs. We evaluate six Gemini variants across generations 2.0, 2.5, and 3.0 on 73 adversarial prompts under three guardrail conditions (Control, Simple, Protocol), yielding 8,830 graded responses. Using a 0-4 Likert scale validated against a human annotator triad (Fleiss kappa = 0.71; Cohen kappa = 0.78 vs AI consensus; 95.9 percent binary accuracy, 100 percent specificity), we quantify sycophancy as continuous rather than binary. Three findings emerge. First, 27.2 percent of responses contain substantial sycophantic content (Likert >= 2.0) and 22.7 percent reach moderate or severe levels (>= 3.0), while binary win-rate framing reports only modest failure rates; coarse metrics explain just 29 percent of graded variance. Second, generational progress is non-monotonic: Gen 2.5 regresses sharply (mean Control 2.64) relative to Gen 2.0 (1.90) and Gen 3.0 (2.01), and Gen 2.5 shows inverse scaling (Pro 1.94 worse than Flash 1.71) while Gen 3.0 restores standard scaling. Third, we document an Alignment Tax: Spearman rho = -0.63 between sycophancy and truthfulness, indicating social compliance trades against factual accuracy. Egotistical Validation prompts act as a sycophancy trap (mean 3.27), nearly double Unethical Proposals (1.72). Simple guardrails outperform elaborate Protocol scaffolding on flagship models, but distilled Gen 3.0 Flash inverts this, suggesting small models may structurally require chain-of-thought scaffolding. We release the dataset and rubric to support continuous sycophancy measurement.

2606.05181 2026-06-05 cs.CL cs.AI 版本更新

Multi-Granularity Reasoning for Natural Language Inference

自然语言推理的多粒度推理

Chunling Xi, Di Liang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出多粒度推理网络(MGRN),通过分层语义特征交互模拟人类认知过程,在多个基准上超越强基线模型。

详情
AI中文摘要

自然语言推理(NLI)是自然语言理解中的一项基本任务,需要确定前提和假设之间的逻辑关系。尽管基于Transformer的预训练模型取得了显著成功,但大多数现有方法主要依赖最后一层的token表示,这通常不足以捕捉有效推理所需的复杂分层语义交互。特别是,细粒度的词汇线索、短语组合和更高层次的上下文语义通常在单一表示空间中被纠缠或稀释。为了解决这些限制,我们提出了一种新颖的\emph{多粒度推理网络}(MGRN),它在交互式推理空间中显式利用分层语义特征。所提出的框架模拟了人类语言理解的认知过程,该过程自然地从浅层词汇匹配进展到更深层次的语义抽象和逻辑推理。通过以渐进和结构化的方式整合多个粒度的语义信息,MGRN能够揭示自然语言表达背后的复杂语义关系。在多个公开基准上的大量实验表明,MGRN始终优于强基线模型,验证了所提出方法的有效性和鲁棒性。

英文摘要

Natural Language Inference (NLI) is a fundamental task in natural language understanding that requires determining the logical relationship between a premise and a hypothesis. Despite the remarkable success of transformer-based pre-trained models, most existing approaches primarily rely on the final-layer token representations, which are often insufficient for capturing the complex and hierarchical semantic interactions required for effective reasoning. In particular, fine-grained lexical cues, phrasal compositions, and higher-level contextual semantics are typically entangled or diluted in a single representation space. To address these limitations, we propose a novel \emph{Multi-Granularity Reasoning Network} (MGRN) that explicitly leverages hierarchical semantic features within an interactive reasoning space. The proposed framework mimics the human cognitive process of language understanding, which naturally progresses from shallow lexical matching to deeper semantic abstraction and logical reasoning. By integrating semantic information across multiple granularities in a progressive and structured manner, MGRN is able to uncover intricate semantic relationships underlying natural language expressions. Extensive experiments on multiple public benchmarks demonstrate that MGRN consistently outperforms strong baseline models, validating the effectiveness and robustness of the proposed approach.

2606.05180 2026-06-05 cs.CL cs.AI 版本更新

From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

从评分到解释:评估基于量规的教学质量评估中的SHAP和LLM理由

Ivo Bueno, Babette Bühler, Philipp Stark, Tim Fütterer, Ulrich Trautwein, Dorottya Demszky, Heather Hill, Enkelejda Kasneci

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Lund University(吕勒奥大学) University of Tübingen(图宾根大学) Stanford Graduate School of Education(斯坦福大学教育研究生院) Harvard Graduate School of Education(哈佛大学教育研究生院)

AI总结 提出一个结合SHAP和LLM理由的框架,用于基于量规的评分模型的可解释性,并在课堂转录数据上评估其忠实性和可迁移性。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

自动化评分模型越来越多地被用于为复杂的语言表现(包括课堂转录)分配基于量规的质量评级,但它们通常很少提供关于为什么产生特定分数的见解。我们提出了一个通用的框架,用于基于量规的评分的句子级可解释性,该框架将模型无关的Shapley值归因与大型语言模型(LLM)生成的理由相结合。在使用NCTE语料库的CLASS框架的反馈质量维度上实例化,该框架能够系统地比较微调的预训练语言模型(PLM)和提示的LLM在评分性能和解释忠实性方面的表现。在6k个带注释的转录片段中,微调的PLM在预测准确性上优于LLM,但表现出向中等尺度分数的标签压缩。基于删除的测试表明,SHAP识别出可靠驱动模型预测的句子,产生的预测变化通常比LLM生成的理由更大且更连贯。跨模型分析进一步揭示,SHAP归因在不同架构间稳健地迁移,而LLM理由的影响有限且不一致。总体而言,研究结果表明,SHAP为基于量规的评分提供了更忠实和可迁移的解释,并且所提出的框架为在高风险教育环境和其他基于量规的语言评估任务中评估评分模型及其解释提供了原则性基础。

英文摘要

Automated scoring models are increasingly used to assign rubric-based quality ratings to complex language performances, including classroom transcripts, yet they typically provide little insight into why a particular score is produced. We propose a general framework for sentence-level interpretability of rubric-based scoring that combines model-agnostic Shapley-value attributions with rationales generated by large language models (LLMs). Instantiated on the Quality of Feedback dimension of the CLASS framework using the NCTE corpus, the framework enables systematic comparison of fine-tuned pretrained language models (PLMs) and prompted LLMs on both scoring performance and explanation faithfulness. Across 6k annotated transcript segments, fine-tuned PLMs outperform LLMs in prediction accuracy but exhibit label compression toward mid-scale scores. Deletion-based tests show that SHAP identifies sentences that reliably drive model predictions, producing typically larger and more coherent prediction shifts than LLM-generated rationales. Cross-model analyses further reveal that SHAP attributions transfer robustly across architectures, whereas LLM rationales exert limited and inconsistent influence. Overall, the findings demonstrate that SHAP provides more faithful and transferable explanations for rubric-based scoring, and that the proposed framework offers a principled basis for evaluating both scoring models and their explanations in high-stakes educational settings and other rubric-based language assessment tasks.

2606.05178 2026-06-05 cs.HC cs.AI 版本更新

The Virtual Roundtable: Multi-Agent Personas Simulating the Dynamics of Human Brainstorming

虚拟圆桌会议:模拟人类头脑风暴动态的多智能体角色

Tim Dorn, Saara A. Khan, Julie Mumford

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种多智能体架构,通过发散与收敛两阶段模拟圆桌头脑风暴,利用多样化AI角色和智能引导者产生多样化创意并评估排名,案例研究表明其能产生多样相关创意并深化讨论质量。

Comments 10 pages, 10 figures, 2 tables

详情
AI中文摘要

随着AI驱动产品开发的加速,瓶颈正从如何构建转向构建什么。传统人类头脑风暴面临群体思维、回音室和多样性有限等挑战。为解决这一问题,我们提出了一种多智能体架构,通过两个阶段模拟圆桌头脑风暴:发散思维以产生多样化创意,以及收敛思维以评估和排名最有前景的创意。该系统采用多样化的AI角色参与圆桌讨论,并由一个智能引导者引导讨论走向富有成效的结果。角色在公开评论的同时保持私人想法,创意在讨论过程中有机涌现。每个角色在创意提交和投票上的配额促进了平衡参与,同时产生自然排名。在整个会话过程中,系统跟踪每个创意的谱系,捕捉概念如何随时间起源和交叉传播。我们通过一个为AI智能眼镜生成消费者创意的案例研究来展示该方法,表明:(i) 它产生了多样、相关的创意,并提供了对其演化的洞察;(ii) 角色之间观点的累积交流培养了一个共享语境,逐步深化了讨论质量和产生的创意。

英文摘要

As AI-driven product development accelerates, the bottleneck is shifting from how we build to what we build. Traditional human brainstorming faces challenges including groupthink, echo chambers, and limited diversity. To address this, we present a multi-agentic architecture that simulates roundtable brainstorming through two phases: divergent thinking to generate diverse ideas, and convergent thinking to evaluate and rank the most promising ones. The system employs diverse AI personas that engage in roundtable discussions, guided by an agentic facilitator that steers the discussion toward productive outcomes. Personas maintain private thoughts while commenting publicly, with ideas emerging organically throughout the discussion. Per-persona quotas on idea submissions and votes promote balanced participation while producing natural rankings. Throughout the session, the system tracks each idea's lineage, capturing how concepts originate and cross-pollinate over time. We demonstrate this approach through a case study generating consumer ideas for AI smart glasses, showing (i) it produces diverse, relevant ideas with insights into their evolution; (ii) the cumulative exchange of perspectives across personas cultivates a shared context that progressively deepens the quality of discussion and the ideas produced.

2606.05177 2026-06-05 cs.CL cs.AI eess.AS 版本更新

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

MCBench:面向全能大语言模型的多上下文安全评估基准

Manh Luong, Tamas Abraham, Junae Kim, Amar Kaur, Rollin Omari, Gholamreza Haffari, Trang Vu, Lizhen Qu, Dinh Phung

发表机构 * Monash University(墨尔本大学) Defence Science and Technology Group(国防科学与技术集团)

AI总结 针对现有多模态安全基准仅处理视觉输入的局限,提出MCBench基准,包含1196个跨四类安全场景的测试,要求整合多模态信息进行安全评估,揭示当前全能大语言模型在跨模态安全推理上的不足。

详情
AI中文摘要

现有的多模态安全基准仅关注视觉输入,无法评估处理视觉、音频和文本的全能大语言模型(LLMs)。我们提出了MCBench,一个包含1196个场景的基准,涵盖四个安全类别,需要整合多种模态以进行准确的安全评估。每个不安全场景都配有一个最小差异的安全对照场景,以评估模型的敏感性。我们对最先进模型的评估揭示了重大挑战。全能大语言模型在处理细微或非物理风险时表现不佳,但在存在显著视觉或听觉线索时表现更好。对推理轨迹的分析表明,尽管模型能够提取模态特定信息,但它们往往无法有效整合这些线索进行安全判断。我们的发现揭示了当前全能大语言模型在安全关键场景中缺乏稳健的跨模态推理能力,强调了改进多模态安全架构和训练策略的必要性。

英文摘要

Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity. Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments. Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.

2606.05176 2026-06-05 cs.CL cs.AI 版本更新

PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis

面向电信客户支持的SLM的PEFT:LoRA配置与能耗分析的比较研究

Lucas Tamic, Ilan Jaffeux-Cheniout, Xavier Marjou

发表机构 * Orange

AI总结 本研究系统比较了不同LoRA配置在Qwen2.5-3B模型上的参数高效微调效果,结合能耗分析和LLM评判框架,发现验证损失最低的配置并不一定获得最佳定性排名,并提出了组合式合成数据生成方法。

详情
AI中文摘要

尽管大型语言模型(LLM)在自然语言理解和生成方面表现出色,但它们在电信客户支持领域特定约束下的评估和适应性仍然有限。此外,数据主权、监管约束以及敏感客户和网络信息的处理使得在该领域使用外部托管的基础模型变得复杂。我们提出了一项系统的参数高效微调(PEFT)研究,使用低秩适配(LoRA)应用于Qwen2.5-3B,以构建特定领域的对话助手。我们引入了一种基于52个行业特定术语词汇表的组合式合成数据生成方法,通过由Gemini 2.0 Flash驱动的生成流水线,产生了约30,000个训练样本,涵盖1,560个不同的问题场景。我们通过改变超参数和目标模块评估了16种LoRA配置。我们的评估超越了标准指标,结合了能耗分析以及使用GPT-5.2和Claude 4.5 Sonnet的LLM-as-a-judge框架的定性评估。结果显示定量和定性性能之间存在明显分歧:达到最低验证损失的模型不一定获得最佳的人类对齐排名。最佳验证损失(0.5024)在定性评估中仅排名第6-7位,而最差损失(0.6807)根据两位评判者均排名第一。本工作的贡献包括:(1)一种用于合成数据集构建的组合方法,(2)关于目标模块选择对LoRA注入影响的见解,(3)证明在对话式AI中仅凭验证损失不足以选择微调配置的证据,以及(4)用于可持续LLM部署的能耗-性能权衡分析。

英文摘要

While large language models (LLMs) show strong performance in natural language understanding and generation, their evaluation and adaptation to domain-specific constraints in telecommunications customer support remain limited. In addition, data sovereignty, regulatory constraints, and the handling of sensitive customer and network information complicate the use of externally hosted foundation models in this domain. We present a systematic study of parameter-efficient fine-tuning (PEFT) using Low-Rank Adaptation (LoRA) applied to Qwen2.5-3B to build a domain-specific conversational assistant. We introduce a combinatorial synthetic data generation approach based on a glossary of 52 industry-specific terms, producing approximately 30,000 training examples across 1,560 distinct problem scenarios via a generative pipeline powered by Gemini 2.0 Flash. We evaluate 16 LoRA configurations by varying hyperparameters and target modules. Our evaluation extends beyond standard metrics by incorporating energy consumption analysis and qualitative assessment using an LLM-as-a-judge framework with GPT-5.2 and Claude 4.5 Sonnet. Results show a clear divergence between quantitative and qualitative performance: models achieving the lowest validation loss do not necessarily obtain the best human-aligned rankings. The best validation loss (0.5024) ranks only 6th-7th in qualitative evaluation, while the worst loss (0.6807) ranks first according to both judges. This work contributes (1) a combinatorial method for synthetic dataset construction, (2) insights into the impact of target module selection for LoRA injection, (3) evidence that validation loss alone is insufficient for selecting fine-tuning configurations in conversational AI, and (4) an energy-performance trade-off analysis for sustainable LLM deployment.

2606.05174 2026-06-05 cs.CL cs.AI 版本更新

Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO

通过基于方差感知的评分规则奖励与GRPO改进LLMs中心脏医学问答

Arash Ahmadi, Parisa Masnadi, Sarah Sharif, Charles Nicholson, David Ebert, Mike Banad

发表机构 * School of Electrical and Computer Engineering, University of Oklahoma, Norman, OK, USA(电气与计算机工程学院,俄克拉荷马大学,诺曼,OK,USA) Intelligent Neuromorphic and Quantum Understanding for Innovative Research and Engineering (INQUIRE) Laboratory, University of Oklahoma, Norman, OK, USA(创新研究与工程智能神经形态与量子理解实验室,俄克拉荷马大学,诺曼,OK,USA) Khiabani Data Science and Analytics Institute, University of Oklahoma, Norman, OK, USA(Khiabani数据科学与分析研究所,俄克拉荷马大学,诺曼,OK,USA) Data Institute for Societal Challenges (DISC), University of Oklahoma, Norman, OK, USA(社会挑战数据研究所(DISC),俄克拉荷马大学,诺曼,OK,USA) School of Industrial and Systems Engineering, University of Oklahoma, Norman, OK, USA(工业与系统工程学院,俄克拉荷马大学,诺曼,OK,USA) Office of Responsible Artificial Intelligence (ORAI), University of Arizona, Tucson, AZ, USA(负责任人工智能办公室(ORAI),亚利桑那大学,图森,AZ,USA)

AI总结 提出一种方差感知奖励框架,结合GRPO和RaR-Medicine的评分规则,通过连续分析奖励函数替代离散聚合,提升LLMs在心脏医学问答上的准确率和F1分数。

Comments 27 Pages

详情
AI中文摘要

大型语言模型(LLMs)在医疗应用中展现出巨大潜力。然而,由于数据隐私限制、推理成本以及边缘或设备端适用性有限,通用模型在实际场景中的部署仍然困难。这些挑战促使开发更小、更高效的模型,这些模型需要稳健的后训练策略以确保可靠的医学推理。在这项工作中,我们研究了基于RaR-Medicine的评分规则监督,使用组相对策略优化(GRPO)对LLMs进行心脏医学问答的后训练。我们提出了一种方差感知奖励框架,该框架扩展了评分规则作为奖励的显式聚合和隐式聚合策略,将加权二元标准聚合和单一整体Likert式评分替换为从标准级评分结果导出的连续分析奖励函数。这种公式为稀疏、多标准且难以自动验证的反馈提供了更丰富的优化信号,并实现了更稳定的在线策略强化学习。在HealthBench保留的心脏相关子集上,与Qwen3-14B基础模型相比,我们最佳的GRPO变体将准确率从0.362提高到0.502,F1从0.532提高到0.668,同时与GPT-OSS-120B(准确率0.508,F1 0.674)保持竞争力。我们的研究结果表明,精心设计的基于评分规则的奖励为改进LLMs中心脏医学问答提供了一种实用策略,并有可能扩展到其他基于评分规则的任务。

英文摘要

Large Language Models (LLMs) have shown strong promise in healthcare applications. Yet deploying general-purpose models in real-world settings remains difficult due to data privacy constraints, inference costs, and limited suitability for edge or on-device use. These challenges motivate the development of smaller, more efficient models that require robust post-training strategies to ensure reliable medical reasoning. In this work, we investigate Group Relative Policy Optimization (GRPO) for post-training LLMs on heart-focused medical question answering with rubric-based supervision derived from RaR-Medicine. We propose a Variance-Aware Reward Framework that extends the Explicit Aggregation and Implicit Aggregation strategies of Rubrics as Rewards by replacing weighted binary criterion aggregation and single overall Likert-style scoring with continuous analytical reward functions derived from criterion-level rubric outcomes. This formulation provides richer optimization signals for feedback that is sparse, multi-criteria, and difficult to verify automatically, and enables more stable on-policy reinforcement learning. On a held-out heart-related subset of HealthBench, our best GRPO variant improves accuracy from 0.362 to 0.502 and F1 from 0.532 to 0.668 relative to the Qwen3-14B base model, while remaining competitive with GPT-OSS-120B (0.508 accuracy, 0.674 F1). Our findings show that carefully designed rubric-based rewards provide a practical strategy for improving heart-focused medical question answering in LLMs, with potential to extend to other rubric-based tasks.

2606.05173 2026-06-05 cs.CL cs.AI 版本更新

Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning

预测与重构:自监督语言表示学习的联合目标

Aimen Boukhari

发表机构 * École Nationale Supérieure d’Informatique (ESI)(阿尔及利亚国家信息学院(ESI))

AI总结 提出一种结合JEPA潜空间预测损失与MLM目标的混合预训练目标,通过可学习标量平衡两者,在GLUE基准上分析表明混合编码器产生更均匀的嵌入和更丰富的谱几何,且语义-词汇平衡更优。

Comments 12 pages, 10 figures, 11 tables. Preprint. Code available at : https://github.com/aymen-000/predict-reconstruct-language-models

详情
AI中文摘要

掩码语言建模(MLM)自BERT以来一直是文本编码器的主导预训练目标,但它鼓励的表示强烈锚定于表层形式的词元身份,而非更深层的语义结构。受联合嵌入预测架构(JEPA)(LeCun, 2022)在视觉和音频中的成功启发,我们提出一种混合预训练目标,该目标在单个共享编码器上结合了JEPA风格的潜空间预测损失与标准MLM目标。一个可学习的标量参数在训练过程中持续平衡这两个目标。我们在英文维基百科上使用相同的架构和计算预算(NVIDIA H100)预训练了一个混合模型和一个纯MLM基线。通过四种池化策略在五个GLUE基准(SST-2、MRPC、MNLI、CoLA、STS-B)上进行广泛的表示分析,结果显示混合编码器产生了显著更均匀的嵌入(均匀性小于-0.16,而MLM为-0.05),在最大池化下表现出更丰富的谱几何,编码了更少的表层词汇信息,并实现了更好的语义-词汇平衡。尽管线性探测的下游准确率相似,但几何差异一致且显著,表明JEPA预测目标重塑了潜空间,而标准准确率指标无法单独捕捉这一点。

英文摘要

Masked language modelling (MLM) has been the dominant pre-training objective for text encoders since BERT, yet it encourages representations that are strongly anchored to surface-form token identity rather than deeper semantic structure. Inspired by the success of Joint Embedding Predictive Architectures (JEPA) (LeCun, 2022) in vision and audio, we propose a hybrid pre-training objective that combines a JEPA-style latent-space prediction loss with a standard MLM objective over a single shared encoder. A learnable scalar parameter continuously balances the two objectives during training. We pre-train both a hybrid model and a pure-MLM baseline on English Wikipedia using identical architectures and compute budgets (NVIDIA H100). Extensive representation analysis across five GLUE benchmarks (SST-2, MRPC, MNLI, CoLA, STS-B) using four pooling strategies reveals that the hybrid encoder produces significantly more uniform embeddings (uniformity less than -0.16 vs -0.05 for MLM), exhibits richer spectral geometry under max pooling, encodes less surface-level lexical information, and achieves a better semantic-to-lexical balance. Despite similar linear-probe downstream accuracy, the geometric differences are consistent and significant, suggesting that the JEPA predictive objective reshapes the latent space in ways that standard accuracy metrics alone cannot capture.

2606.05168 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

模型崩溃的流行病学:通过双层SIR动力学建模合成数据污染

Xiangyu Wang

发表机构 * Xiangyu Wang(王翔宇)

AI总结 提出双层耦合SIR/SIRS框架,将数据语料库和AI模型视为两个相互作用的群体,通过交叉层传播模拟合成数据污染导致的模型崩溃,并推导基本再生数R0,实验验证了阈值动力学和干预策略的有效性。

Comments 24 pages, 15 figures

详情
AI中文摘要

在合成数据上训练会导致模型崩溃,但现有分析将其视为单链退化。实际上,AI生态系统涉及交叉污染:模型从其他模型摄取合成数据,产生新的合成文本,并污染共享语料库。我们提出了一个双层耦合SIR/SIRS框架——一个现象学平均场模型,将数据语料库和AI模型视为两个相互作用的群体,每个群体具有易感、感染和恢复三个仓室,并通过跨层传播连接。SIRS变体(我们的主要推荐)包含了免疫衰减,反映了过滤后的语料库和重新训练的模型仍然容易再次污染。我们通过下一代矩阵推导出基本再生数$R_0 = \sqrt{β_D β_M / [(γ_D+μ_D)(γ_M+μ_M)]}$,并将标准流行病阈值结果应用于双层系统。基于公开AI文本流行数据的说明性情景校准在三种情景下均产生超临界动力学($R_0 > 1$);Sobol敏感性分析将合成文本检测识别为最高杠杆参数。一个二分网络基于智能体的模型在密集网络上确认了平均场一致性($R^2 > 0.96$),但在异质性下退化。GPT-2污染链实验(在WikiText和Shakespeare上共192次运行)显示了剂量-反应退化和多样性损失,定性上与阈值图像一致。匹配预算的源多样性实验(1,088次运行)提供了提示性证据,表明多源混合适度减轻了崩溃,但该效应在较低污染分数下消失。干预分析将基于检测的过滤和群体免疫识别为最高杠杆策略。

英文摘要

Training on synthetic data causes model collapse, but existing analyses treat this as single-chain degradation. In reality, the AI ecosystem involves cross-contamination: models ingest synthetic data from other models, produce new synthetic text, and contaminate shared corpora. We propose a bilayer coupled SIR/SIRS framework -- a phenomenological mean-field model treating data corpora and AI models as two interacting populations, each with susceptible, infected, and recovered compartments linked by cross-layer transmission. The SIRS variant (our primary recommendation) incorporates immunity waning, reflecting that filtered corpora and retrained models remain susceptible to re-contamination. We derive the basic reproduction number $R_0 = \sqrt{β_D β_M / [(γ_D+μ_D)(γ_M+μ_M)]}$ via the Next Generation Matrix and apply standard epidemic threshold results to the bilayer system. Illustrative scenario-based calibration from public AI text prevalence data yields supercritical dynamics ($R_0 > 1$) across three scenarios; Sobol sensitivity analysis identifies synthetic-text detection as the highest-leverage parameter. A bipartite-network agent-based model confirms mean-field consistency ($R^2 > 0.96$) for dense networks but degrades under heterogeneity. GPT-2 contamination chain experiments (192 runs across WikiText and Shakespeare) show dose-response degradation and diversity loss qualitatively consistent with the threshold picture. Matched-budget source-diversity experiments (1,088 runs) provide suggestive evidence that multi-source mixing modestly attenuates collapse, but the effect vanishes at lower contamination fractions. Intervention analysis identifies detection-based filtering and herd immunity as the highest-leverage strategies.

2606.05167 2026-06-05 cs.MA cs.AI 版本更新

RAINO: Anchoring Agents in Reality, A Systematic Review and Conceptual Framework for Realism in Agent-Based Modelling

RAINO:将智能体锚定于现实——基于智能体建模中现实主义的系统综述与概念框架

Loïs Vanhée, Melania Borit

发表机构 * Umeå Universitet(乌梅拉大学) UiT The Arctic University of Norway(北极大学)

AI总结 本文通过系统文献综述,识别了基于智能体建模中现实主义操作化与展示的不足,并提出了RAINO框架(现实锚点、输入、输出),以统一和拓展对现实主义的理解。

Comments The paper has been accepted in the Social Simulation Conference 2025

详情
AI中文摘要

现实主义是基于智能体建模中一个核心但似乎理论化不足的概念。本文呈现了一项系统文献综述,旨在识别现实主义目前如何被操作化和展示。结果表明,现实主义往往定义模糊,缺乏一致的概念框架。虽然使用了多种方法来实现和展示现实主义,但对这些方法是否以及为何适用于其预期目的的解释通常有限。基于此综述,我们引入了现实锚点、输入、输出(RAINO)框架。RAINO识别了用于论证基于智能体模型中现实主义的关键结构,包括现实锚点(例如,经验数据、形式理论、专家知识、常识期望)及其作为模型输入或输出的应用。RAINO拓宽了现有关于现实主义如何被框架化的视角。它解释了为什么不同的评估者可能以不同方式评估模型的现实主义,并展示了这种更广泛的框架如何导致显著不同的模型开发方法。

英文摘要

Realism is a central yet seemingly under-theorized concept in Agent-Based Modelling. This paper presents a Systematic Literature Review, aiming to identify how realism is currently operationalized and demonstrated. The results show that realism is often poorly defined and lacks a consistent conceptual framework. A wide variety of methods are used to achieve and demonstrate realism, but explanations of whether and why these methods are appropriate for their intended purposes are generally limited. Building on this review, we introduce the Reality Anchor, Input, Output (RAINO) framework. RAINO identifies the key structures used to argue for realism in Agent-Based Models, consisting of Reality Anchors (e.g., empirical data, formal theory, expert knowledge, common-sense expectations) and their application as model Input or Output. RAINO broadens existing perspectives on how realism is framed. It explains why different assessors may evaluate the realism of a model in different ways, and it shows how this broader framing can lead to significantly different approaches to model development.

2605.04733 2026-06-05 cs.AI 版本更新

Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

面向沉浸式视频角色扮演的奖励分解强化学习

Miao Wang, Yuling Shi, Yijiang Li, Yeheng Chen, Xiaodong Gu, Bin Li, Bo Gao, Jun Wang, Zengxin Han, Jingtong Wu, Yaduan Ruan

发表机构 * Nanjing University(南京大学) Shanghai Jiao Tong University(上海交通大学) University of California, San Diego(加州大学圣地亚哥分校) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) School of Information Engineering, Beijing Institute of Graphic Communication(北京印刷学院信息工程学院) Ant International, Ant Group(蚂蚁集团国际部) Independent Researcher(独立研究者)

AI总结 提出EBM-RL框架,通过奖励分解的强化学习优化视频角色扮演中的视觉感知、推理与生成过程,提升场景一致性与角色真实性。

详情
AI中文摘要

基于文本的角色扮演模型可以模仿角色风格,但通常难以捕捉场景氛围和不断变化的紧张感,而这些对于VR游戏和互动叙事等沉浸式应用至关重要。我们研究视频驱动的角色扮演对话,并引入EBM-RL(眼-脑-口强化学习),一种解耦的GRPO框架,将观察(<perception>)、推理(<think>)和话语生成(<answer>)分离。该设计模仿人类的“看-思-说”过程,使模型在推理和响应生成之前能够基于视觉感知进行对话。为了优化这一“看-思-说”过程,EBM-RL集成了针对场景-文本对齐、感知-认知效用、答案忠实度和格式一致性的互补奖励。大量实验表明,在我们的沉浸式角色扮演基准测试中,EBM-RL显著优于纯文本角色扮演基线和更大规模的视觉语言模型,提高了视觉-氛围一致性和角色真实性。此外,EBM-RL在无需额外微调的情况下,展现出对域外VideoQA基准的强零样本迁移能力。我们还发布了一个用于视频驱动角色扮演对话的开源数据集。

英文摘要

Text-based role-playing models can imitate character styles, but often fail to capture scene atmosphere and evolving tension, which are crucial for immersive applications such as VR games and interactive narratives. We study video-grounded role-playing dialogue and introduce EBM-RL (Eye--Brain--Mouth Reinforcement Learning), a decoupled GRPO-based framework that separates observation (<perception>), reasoning (<think>), and utterance generation (<answer>). This design mimics the human See-Think-Speak process, enabling the model to ground dialogue in visual perception before reasoning and response generation. To optimize this See-Think-Speak process, EBM-RL integrates complementary rewards for scene--text alignment, perceptual--cognitive utility, answer faithfulness, and format consistency. Extensive experiments show that EBM-RL substantially outperforms text-only role-playing baselines and larger-scale vision-language models on our immersive role-playing benchmark, improving both visual-atmosphere consistency and character authenticity. Moreover, EBM-RL demonstrates strong zero-shot transfer to out-of-domain VideoQA benchmarks without additional fine-tuning. We also release an open-source dataset for video-grounded role-playing dialogue.

2606.05104 2026-06-05 cs.AI 版本更新

Knowledge Index of Noah's Ark

诺亚方舟的知识索引

Sheng Jin, Minghao Liu, Yunze Xiao, Zeqi Zhou, Heli Qi, Yifan Yao, Meishu Song, Kaijing Ma, Xuan Zhang, Sicong Jiang, Yizhe Li, Ningshan Ma, Jie Wei, Ziniu Li, Minglai Yang, Bangya Liu, Yiming Liang, Xiao Fang, Qingcheng Zeng, Jiarui Liu, Rui Yang, Shen Yan, Wenhao Huang, Jiaheng Liu, Zihan Wang, Weihao Xuan, Ge Zhang

发表机构 * M-A-P Carnegie Mellon University(卡内基梅隆大学) Brown University(布朗大学) Waseda University(早稻田大学) The University of Tokyo(东京大学) Massachusetts Institute of Technology(麻省理工学院) University of Arizona(亚利桑那大学) Northwestern University(西北大学) Duke-NUS Medical School(杜克-新加坡国立大学医学院)

AI总结 针对LLM知识基准的代表性、注释质量和排名稳定性问题,提出KINA基准,通过贪婪近似实现学科代表性,并证明奖金锦标赛机制优于固定支付,实验显示顶级模型性能远未饱和。

详情
AI中文摘要

LLM的知识基准面临三个问题:扩展驱动的设计未能实现学科代表性;固定支付注释允许懒惰共识;在有限测试预算下排名稳定性未经审计。我们引入KINA,一个涵盖261个细粒度学科的899项基准,并有两个形式化结果。首先,我们将代表性视为对专家引出的锚点的覆盖目标,并通过代理实现学科代表性,得到(1-1/e)贪婪近似(命题1);该保证适用于代理,而非总体代表性。其次,我们证明在发布-评审质量方面,奖金锦标赛弱FOSD支配固定支付,激励相容阈值为B > ΔC / Δp_min(定理1)。评估来自13个实验室的42个模型,最佳模型Gemini-3.1-Pro-Preview达到53.17%,其次是Claude-Opus-4.6的49.92%和GPT-5.4的48.55%,远未饱和。完整排行榜显示分层结构而非平滑全序:小型前沿层高于48%,密集的强模型层约38-45%,低性能模型仅略高于10%随机基线。工具增强在五个工具使用评估中最多增加5.17分,不同模型增益差异显著。我们报告自举排名稳定性统计,以明确有限预算方差并防止过度解释相邻排名。

英文摘要

Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.

2606.04708 2026-06-05 cs.RO cs.AI 版本更新

VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

VISTA: 基于视觉和物理验证的UMI数据适配用于VLA训练

Siyuan Yang, Linzheng Guo, Ouyang Lu, Zhaxizhuoma, Daoran Zhang, Xinmiao Wang, Ting Xiao, Fangzheng Yan, Zhijun Chen, Yan Ding, Chao Yu, Chenjia Bai, Xuelong Li

发表机构 * Institute of AI (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信) Lumos Robotics(Lumos机器人) University of Science and Technology of China(中国科学技术大学) Northwestern Polytechnical University(西北工业大学) Shanghai Jiao Tong University(上海交通大学) East China University of Science and Technology(东华大学) Harbin Engineering University(哈尔滨工程大学) Fudan University(复旦大学)

AI总结 提出VISTA框架,通过UMI-VQA数据集对齐视觉表示、物理验证流水线筛选可行轨迹以及两阶段联合训练,解决UMI数据训练VLA模型时的视觉分布偏移和物理不可行动作问题。

Comments Corrected the typing error

详情
AI中文摘要

通用操作接口(UMI)实现了无需特定硬件遥操作的可扩展真实世界机器人数据收集,但利用UMI数据训练大规模视觉-语言-动作(VLA)模型仍然面临根本性挑战。我们识别出两个关键不匹配:腕部安装的鱼眼视图具有严重的径向畸变和以夹爪为中心的局部视角,对于预训练VLM而言是分布外数据;人类收集的轨迹经常违反运动学限制、发生碰撞或超出控制器带宽,导致VLA策略学习到物理上不可行的动作。为解决这些挑战,我们提出了VISTA框架,通过三个协同组件弥合这一双重差距。(i) UMI-VQA,首个专门针对腕部鱼眼观测的大规模VQA数据集,通过辅助视觉-语言监督将VLM表示对齐到畸变视觉领域。(ii) 系统性的物理验证流水线,在训练前进行数据完整性预检查,并对每条有效轨迹的轨迹连续性、自碰撞风险和执行保真度进行评分。(iii) 两阶段联合训练方案,在UMI-VQA上联合学习视觉-语言基础,并在验证轨迹上学习动作预测。我们的实验经验表明,引入UMI-VQA能持续提升下游策略性能,且物理验证分数对部署成功具有强预测性。在多种仿真和真实世界操作任务中,VISTA显著优于包括$π_{0.5}$、LingBot-VLA和Wall-X在内的强基线。我们向社区发布了物理验证流水线、UMI-VQA、验证轨迹数据和预训练模型。

英文摘要

Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories frequently violate kinematic limits, incur collisions, or exceed controller bandwidth, teaching VLA policies physically infeasible actions. To address the challenges, we present VISTA, a framework that bridges this dual gap through three synergistic components. (i)~UMI-VQA, the first large-scale VQA dataset tailored to wrist-mounted fisheye observations, aligns VLM representations to the distorted visual regime via auxiliary vision-language supervision. (ii)~A systematic physical-validation pipeline performs a data-completeness pre-check and scores each valid trajectory for trajectory continuity, self-collision risk, and execution fidelity before it enters training. (iii)~A two-stage co-training recipe jointly learns vision-language grounding on UMI-VQA and action prediction on validated trajectories. Our experiments empirically show that incorporating UMI-VQA consistently improves downstream policy performance, and that physical-validation scores are strongly predictive of deployment success. On diverse simulation and real-world manipulation tasks, VISTA significantly outperforms strong baselines including $π_{0.5}$, LingBot-VLA, and Wall-X. We release the physical-validation pipeline, UMI-VQA, validated trajectory data, and the pre-trained model for the community.

2606.04672 2026-06-05 cs.LG cs.AI 版本更新

Learning Long Range Spatio-Temporal Representations over Continuous Time Dynamic Graphs with State Space Models

利用状态空间模型学习连续时间动态图中的长程时空表示

Ayushman Raghuvanshi, Thummaluru Siddartha Reddy, Sundeep Prabhakar Chepuri, Mahesh Chandran

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) Indian Institute of Science(印度科学研究院)

AI总结 提出一种基于状态空间模型的连续时间动态图框架(CTDG-SSM),通过拓扑感知的高阶多项式投影算子(CTT-HiPPO)实现长程时空信息传播,在动态链接预测、节点分类和序列分类任务上取得最优性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

连续时间动态图(CTDG)为捕捉演化关系数据中的细粒度时间模式提供了更丰富的框架。长程信息传播是学习表示时的关键挑战,其中需要在长时间跨度上保留和更新信息。现有方法限制模型捕捉一跳或局部时间邻域,无法捕捉多跳或全局结构模式。为解决此问题,我们从第一性原理推导出一个参数高效的连续时间动态图状态空间建模框架(CTDG-SSM)。我们首先引入连续时间拓扑感知高阶多项式投影算子(CTT-HiPPO),这是一种基于记忆的HiPPO新公式,用于联合编码时间动态和图结构。CTT-HiPPO的解通过将经典HiPPO解投影到拉普拉斯矩阵的多项式上获得,产生拓扑感知的记忆更新,该更新等价于CTDG的状态空间公式(CTDG-SSM)。然后,使用零阶保持方法获得计算高效的离散公式用于模型实现。在动态链接预测、动态节点分类和序列分类的基准测试中,CTDG-SSM实现了最先进的性能。值得注意的是,在需要长程时间(LRT)和空间推理的数据集上,它取得了较大的性能提升。

英文摘要

Continuous-time dynamic graphs (CTDGs) provide a richer framework to capture fine-grained temporal patterns in evolving relational data. Long-range information propagation is a key challenge while learning representations, wherein it is important to retain and update information over long temporal horizons. Existing approaches restrict models to capture one-hop or local temporal neighborhoods and fail to capture multi-hop or global structural patterns. To mitigate this, we derive a parameter-efficient state-space modeling framework for continuous-time dynamic graphs (CTDG-SSM) from first principles. We first introduce continuous-time Topology-Aware higher order polynomial projection operator (CTT-HiPPO), a novel memory-based reformulation of HiPPO to jointly encode temporal dynamics and graph structure. The solution from CTT-HiPPO is obtained by projecting the classical HiPPO solution through a polynomial of the Laplacian matrix, yielding topology-aware memory updates that admit an equivalent state-space formulation for CTDGs (CTDG-SSM). Then a computationally efficient discrete formulation is obtained using the zero-order hold approach for model implementation. Across benchmarks on dynamic link prediction, dynamic node classification, and sequence classification, CTDG-SSM achieves state-of-the-art performance. Notably, it achieves large performance gains on datasets that require long range temporal (LRT) and spatial reasoning.

2606.04560 2026-06-05 cs.LG cs.AI 版本更新

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

基于轨迹级别优势优先经验回放的GRPO

Gyeongtae Yoo, Sanghyeok Park, Soohyuk Jang, Ik-hwan Kim, Sungroh Yoon

发表机构 * Department of Electrical and Computer Engineering, Seoul National University(首尔国立大学电子与计算机工程系) Interdisciplinary Program in AI, Seoul National University(首尔国立大学人工智能跨学科项目) AIIS, ASRI, INMC, and ISRC, Seoul National University(首尔国立大学人工智能研究所、人工智能研究机构、智能网络与计算中心及人工智能科学研究中心)

AI总结 针对GRPO样本效率低的问题,提出轨迹级经验回放缓冲器,通过年龄驱逐限制陈旧性、新鲜锚定组合保持在线策略、按优势幅度优先采样,在多个数学基准上显著提升性能。

详情
AI中文摘要

基于可验证奖励的GRPO强化学习是后训练推理LLM的标准方法,但样本效率低下。每个轨迹仅用于一次梯度更新后被丢弃。朴素回放在此设置中不适用,因为LLM策略每步梯度变化快,存储的轨迹会变得陈旧并破坏训练稳定性。我们提出一种面向GRPO的轨迹级回放缓冲器,存储和采样单个轨迹而非整组。缓冲器通过年龄驱逐限制陈旧性:任何超过tau_max训练步数的轨迹被移除。缓冲器还通过新鲜锚定组合保留在线策略数据:每个批次保留其新鲜的在线策略轨迹,并拼接从缓冲器中单独抽取的回放轨迹。我们按每个轨迹的优势幅度进行优先回放,并回收优势大的单个轨迹。在三个Qwen3-Base规模、五个数学基准上,我们的方法优于GRPO和朴素回放基线。所有规模均获得正向增益,且随模型增大而增长。最大增益在4B规模上,五个基准平均提升+4.35个百分点。在联合衡量准确率和token效率的AES指标下,与GRPO的效率差距同样在4B最大,为+0.579。

英文摘要

Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded. Naive replay is not well suited in this setting because LLM policies drift quickly per gradient step. Stored rollouts therefore become stale and can destabilize training. We propose a rollout-level replay buffer for GRPO that stores and samples individual rollouts rather than whole groups. The buffer bounds staleness through age eviction. Any rollout older than tau_max training steps is removed. The buffer also preserves on-policy data via fresh-anchored composition. Each batch keeps its fresh on-policy rollouts and then concatenates replay rollouts drawn separately from the buffer. We prioritize replay by per-rollout advantage magnitude and recycle individual rollouts whose advantages are large. Across three Qwen3-Base scales on five math benchmarks, our method outperforms GRPO and naive replay baselines. Gains are positive at every scale and grow with model size. The largest gain is +4.35 pp on the five-benchmark average at 4B. Under an AES metric that jointly measures accuracy and token efficiency, the efficiency margin over GRPO is again largest at 4B, at +0.579.

2606.04037 2026-06-05 cs.AI cs.LG cs.SE 版本更新

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

面向企业AI代理的部署前保障:基于本体的仿真与信任认证

Thanh Luong Tuan, Abhijit Sanyal

发表机构 * Golden Gate University(金门大学) Data, Digital & IT, Novartis Healthcare Pvt. Ltd.(数据、数字与IT,诺华健康护理私人有限公司)

AI总结 提出一种基于本体的验证框架,通过本体驱动的场景生成和信任证书,实现企业AI代理在部署前的自动化监管合规与安全认证。

Comments 26 pages, 3 figures. Companion to arXiv:2604.00555. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-6

详情
AI中文摘要

企业人工智能(AI)代理的部署前验证仍然是大型语言模型(LLM)能力基准测试与生产部署之间的关键缺口。一旦代理在生产环境中运行,部署后监控、人在回路控制和提示级护栏提供的保障有限。我们提出了一种基于本体的验证框架,包含三个组件:一个代理操作范围,形式化了跨权限、领域约束、安全属性、治理规则和自主级别的认证空间;一个本体到场景的生成流水线,自动推导出监管、操作和对抗性测试场景;以及一个信任证书,携带机器可验证的证明,并附带分级部署裁决(批准、有条件、拒绝)。在四个受监管行业(金融科技、银行、保险和医疗保健)中进行的受控试点,实例化为美国与越南的五个行业-监管体制单元,生成了1,800个场景,并针对125个主要来源监管要求和25个注入故障进行了评估。基于本体的生成(G4)实现了48.3%的监管覆盖率,而基于角色的基线为33.1%(校正后p=0.0006),并且领域特异性最高(4.77/5.0;p=2e-6)。在Bonferroni校正后,相对于基线和检索增强提示的覆盖率优势不再稳健。跨三个LLM家族(Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B;总计5,400个场景)的交叉验证复制了角色与本体模式。结果表明,对于监管密集型领域,基于本体的场景生成可作为基于角色测试套件的可信补充。

英文摘要

Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We present an ontology-grounded verification framework -- to our knowledge the first to combine three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a machine-verifiable Trust Certificate with graduated deployment verdicts. A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam (where Vietnam's 2025 AI Law makes such verification legally mandated for financial services), generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation significantly outperformed the dominant persona-based baseline on regulatory coverage (48.3% versus 33.1%; corrected p_c = .0006) and attained the highest domain specificity (4.77/5.0; p = 2e-6); transparently, its advantage over plain and retrieval-augmented prompting did not survive Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The framework offers a reproducible, regulation-grounded route to pre-deployment assurance for enterprise AI agents, complementing runtime governance with an auditable deployment gate.

2606.04032 2026-06-05 cs.LG cs.AI cs.CL cs.PF 版本更新

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Transformer 需要三个投影吗?QKV 变体的系统研究

Ali Kayyam, Anusha Madan Gopal, M Anthony Lewis

发表机构 * Ali Kayyam Anusha Madan Gopal M Anthony Lewis

AI总结 本文系统研究了注意力机制中查询、键、值投影共享的变体,发现 Q-K=V 共享在语言建模中仅以 3.1% 的困惑度损失实现 50% 的 KV 缓存减少,且与头共享结合可达到 96.9% 的缓存减少,从而支持设备端推理。

Comments Accepted at ICML 2026 (PMLR vol. 306). 26 pages, 12 figures, 16 tables. Code: https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

详情
AI中文摘要

Transformer 已成为各种 AI 任务的标准解决方案,其中查询、键和值(QKV)注意力公式起着核心作用。然而,这三个投影的各自贡献以及省略某些投影的影响仍知之甚少。我们系统评估了三种投影共享约束:a) Q-K=V(共享键-值),b) Q=K-V(共享查询-键),c) Q=K=V(单投影)。后两种变体产生对称注意力图;为了解决这个问题,我们还通过二维位置编码探索了非对称注意力。通过涵盖合成任务、视觉(MNIST、CIFAR、TinyImageNet、异常检测)和语言建模(在 10B 令牌上训练的 300M 和 1.2B 参数模型)的实验,我们发现我们的 Transformer 性能与 QKV Transformer 相当,有时甚至更好。在语言建模中,Q-K=V 投影共享实现了 50% 的 KV 缓存减少,仅导致 3.1% 的困惑度下降。关键的是,投影共享与头共享(GQA/MQA)互补:将 Q-K=V 与 GQA-4 结合可实现 87.5% 的缓存减少,而 Q-K=V + MQA 则达到 96.9%,从而实现了实用的设备端推理。我们表明,Q-K=V 保持了质量,因为键和值可以占据相似的表示空间,并且注意力在低秩机制下运行,而 Q=K-V 则破坏了注意力的方向性。我们的结果系统地将投影共享描述为注意力中权重绑定的一种未被充分探索的实例,具有直接、可量化的推理内存优势,尤其对边缘部署有价值。代码公开于 https://github.com/anushamadan02/Do-Transformers-Need-3-Projections。

英文摘要

Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

2606.03650 2026-06-05 cs.CL cs.AI 版本更新

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

CoEval: 无标注数据或可信基准下为自定义任务排序语言模型

Alexander Apartsin, Yehudit Aperstein

发表机构 * Holon Institute of Technology(霍洛技术学院) Afeka Tel Aviv Academic College of Engineering(阿法卡特拉维大学工程学院)

AI总结 提出CoEval框架,通过教师模型生成无污染基准和跨族评审团,无需标注数据或人工评估即可对语言模型进行排序,在真实排名恢复上达ho=0.86。

Comments 16 pages, 5 images

详情
AI中文摘要

当特定应用没有任务相关的标注数据,且标准公共基准不可信(其项目可能已泄露到预训练中,因此分数反映的是记忆而非适用性)时,为特定应用选择或排序语言模型最为困难。我们提出CoEval,一个开源、可复用的框架,端到端地弥补了这一差距:仅从任务或领域的描述出发,教师模型合成一个全新的、属性受控的基准,无需人工标注,且由于每次运行都重新生成项目,因此无污染;跨族评审团对候选模型进行排序,无需人工评分。在存在真实基准的情况下验证,CoEval恢复了真实的模型排序,并与真实正确性相关性达ho=0.86。无标签评审无需人工校准,因为评审团组成(供应商多样性)而非规模驱动可靠性:一个精心挑选的小型跨族评审团最可靠,而单个评审员可能与真实基准负相关(评审员选择遗憾0.35),但集成评审团从未如此。生成的项目与五个主要公共基准的逐字13-gram重叠为零;评审团消除了冗长偏差并排除了同族自我偏好。一项四项任务研究以5.89美元产生了7,978次评估。相同的声明式流程适用于任何领域,并且足够便宜,可以在每次模型发布时重新运行:一个任何团队都可以为其自身应用重新生成的无标签、无污染排行榜。

英文摘要

Selecting a pretrained language model, or evaluating a fine-tuned one, for a specific application is a high-value decision, yet the public benchmarks used to make it are poorly suited: a generic benchmark need not reflect a particular sub-domain or sub-task, and its scores are suspect when its items have leaked into pretraining and are recalled rather than solved. We present CoEval, an open framework that supplies a trustworthy, task-specific signal through ensemble self-evaluation: from a task or domain description, a pool of models rotates through all three roles, teacher, student, and judge, to generate a fresh, contamination-free benchmark, answer it, and score one another, with no human labels or raters. Because every model also answers as a student, the responses are the data that weight each question by its discriminative power and each judge by its consensus with the panel. Where ground truth exists, CoEval recovers the true ranking and tracks objective correctness at \r{ho}=0.86, and the weighting recovers the gold ranking of thirteen models at Spearman 0.95. Reliability comes from panel composition, not size: this label-free weighting zeroes out broken judges and down-weights saturated questions, so neither distorts the ranking. Generated items show zero verbatim overlap with five public benchmarks, the panel cancels verbosity bias and precludes same-family self-preference, and rankings are domain-specific: three different models top four de-novo domains, so a generic leaderboard misdirects most practitioners. The same pipeline reruns on each model release, giving any team a contamination-free leaderboard for its application.

2606.03091 2026-06-05 cs.IR cs.AI 版本更新

BAHSD: Bridging the Long-tail Gap via Adaptive Distillation in Black-box Sequential Recommendation

BAHSD:通过自适应蒸馏弥合黑盒序列推荐中的长尾差距

Xi Zhou, Famin Wu, Mingming Li, Hongyue Zhang, Jiao Dai, Jizhong Han, Tao Guo

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院信息工程研究所,北京,中国) School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学网络安全学院,北京,中国) Beijing Institute for General Artificial Intelligence, Beijing, China(北京一般人工智能研究院,北京,中国)

AI总结 针对黑盒序列推荐中长尾分布导致的信号异质性,提出BAHSD框架,利用多尺度一致性探测机制量化信号可靠性,并设计自适应分层目标(动态温度KL散度、排序一致性和InfoNCE对比学习)来缓解偏好固化并增强噪声鲁棒性,在尾用户上提升80%以上。

详情
AI中文摘要

序列推荐系统被广泛采用,但通常作为黑盒API部署,这推动了近期对模型提取的兴趣,以在本地复制其能力。然而,长尾分布导致了严重的信号异质性:密集的头部序列触发教师偏好的固化,使提取偏向局部模式,而稀疏的尾部序列产生平坦且嘈杂的预测。现有的一刀切式提取忽略了这种差异,导致噪声过拟合和次优的知识迁移。我们提出BAHSD,一种黑盒自适应蒸馏框架,通过多尺度一致性探测机制隐式量化信号可靠性来处理信号异质性。基于此,设计了自适应分层目标:动态温度KL散度缓解高置信度信号的偏好固化,而排序一致性和InfoNCE对比学习为低置信度信号提供噪声鲁棒的增强。BAHSD持续优于基线,在教师模型上获得高达4.98%的提升,在尾用户上提升80%以上,为高保真黑盒推荐提取提供了一种即插即用的解决方案。

英文摘要

Sequential recommendation systems are widely adopted but often deployed as black-box APIs, which has driven recent interest in model extraction to replicate their capabilities locally. However, the long-tail distribution induces severe signal heterogeneity: dense head sequences trigger the solidification of teacher preference, biasing extraction toward local patterns, while sparse tail sequences yield flat, noisy predictions. Existing one-size-fits-all extraction overlooks this disparity, resulting in noise overfitting and suboptimal knowledge transfer. We propose BAHSD, a black-box adaptive distillation framework that handles signal heterogeneity via a multi-scale consistency probing mechanism to implicitly quantify signal reliability. Based on this, an adaptive hierarchical objective is designed: dynamic-temperature KL divergence mitigates preference solidification for high-confidence signals, while ranking consistency and InfoNCE contrastive learning provide noise-robust enhancement for low-confidence signals. BAHSD consistently outperforms baselines, achieving up to 4.98\% gain over the teacher and 80\%+ improvement on tail users, offering a plug-and-play solution for high-fidelity black-box recommendation extraction.

2606.03070 2026-06-05 cs.LG cs.AI 版本更新

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

ASymPO: 用于异步大语言模型后训练的非对称尺度策略优化(无需行为信息)

Zehua Liu, Yuxuan Yao, Xiaojin Fu, Tao Zhong, Mingxuan Yuan

发表机构 * Huawei Technologies(华为技术)

AI总结 针对异步强化学习中陈旧响应导致的分布漂移问题,提出非对称尺度策略优化(ASymPO),通过归一化每个响应的令牌损失来恢复零和平衡,无需行为策略概率。

Comments incorrect proofs in the paper

详情
AI中文摘要

异步强化学习通过将响应生成与策略优化解耦来提高语言模型后训练的吞吐量,但陈旧响应会引入分布漂移。标准的行为校正方法通过行为策略概率、重要性比率或裁剪来控制这种漂移,这需要在推出和学习系统之间具有令牌对齐、版本化和数值一致的行为对数概率。我们探究是否仅使用当前策略概率就能稳定异步组相对强化学习。我们识别出一种尺度不平衡失败模式:当在当前策略下评估陈旧响应时,正负损失项可能出现在不同的负对数概率尺度上,因此零和优势不再意味着平衡的损失贡献。我们提出非对称尺度策略优化(ASymPO),它通过每个响应的当前平均令牌负对数概率来归一化其令牌损失。ASymPO不需要行为策略概率,恢复了响应级别的零和平衡,并保留了非零的学习信号。我们还引入了固定负缩放基线——缩放策略优化(SPO),并在异步数学推理后训练中评估了这两种仅当前策略的目标函数。

英文摘要

Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optimization, but stale responses introduce distribution drift. Standard behavior-corrected methods control this drift with behavior-policy probabilities, importance ratios, or clipping, which requires token-aligned, versioned, and numerically consistent behavior log-probabilities across rollout and learner systems. We ask whether asynchronous group-relative RL can instead be stabilized using only current-policy probabilities. We identify a scale-imbalance failure mode: when stale responses are evaluated under the current policy, positive and negative loss terms can appear at different negative log-probability scales, so zero-sum advantages no longer imply balanced loss contributions. We propose Asymmetric-Scale Policy Optimization (ASymPO), which normalizes each response's token loss by its current average token negative log-probability. ASymPO requires no behavior-policy probabilities, restores response-level zero-sum balance, and preserves a nonzero learning signal. We also introduce Scaled Policy Optimization (SPO), a fixed negative-scaling baseline, and evaluate both current-policy-only objectives in asynchronous mathematical reasoning post-training.

2606.02907 2026-06-05 cs.CL cs.AI 版本更新

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

线性探针检测语言模型隐藏状态中的任务格式,而非推理模式

Subramanyam Sahoo, Vinija Jain, Aman Chadha, Divya Chaudhary

发表机构 * Horizon Research(远景研究) Meta Apple(苹果公司) Northeastern University(东北大学)

AI总结 通过线性探针实验发现,大语言模型隐藏状态中看似分离的推理模式实际上由任务格式(如来源、选项数、响应长度)混淆导致,而非真正的推理计算结构。

Comments Accepted in the 6th Workshop on Trustworthy NLP, ACL 2026

详情
AI中文摘要

线性探针广泛用于声称大语言模型(LLM)隐藏状态对不同推理类型学习到不同表示。我们通过在经典三分法基准(LogiQA 2.0(演绎)、ARC-Challenge(归纳)和$\alpha$NLI(溯因))上探测Qwen3-14B来检验这一说法。在40层中的第32层,线性探针达到100%交叉验证准确率,且几何结构良好分离(本征维度:20.6、28.5、33.6;凸包污染$\leq$1.5%)。然而,这种分离完全由格式混淆驱动。对来源身份、选项数和响应长度进行残差化后,准确率降至随机水平。轨迹锚点相似性表明任务间推理大部分共享(42.5%一致性 vs. 33.3%随机),且随机对照因果操控($n=20$)显示几何结构与推理模式之间无功能联系($p=0.286$)。因此,高探针准确率反映的是任务格式而非计算结构,这促使在机制可解释性中常规性地进行格式去混淆。

英文摘要

Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $α$NLI (abductive). At layer 32 of 40, linear probes achieve 100\% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6, 28.5, 33.6; convex hull contamination $\leq$1.5\%). However, this separation is entirely driven by format confounds. Residualizing source identity, option count, and response length reduces accuracy to chance. Trace-anchor similarity indicates largely shared reasoning across tasks (42.5\% agreement vs.\ 33.3\% chance), and causal steering with random controls ($n=20$) shows no functional link between geometry and reasoning mode ($p=0.286$). Thus, high probe accuracy reflects task format rather than computational structure, motivating routine format deconfounding in mechanistic interpretability.

2606.02684 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

先过滤,再重加权:重新思考在线策略蒸馏中的优化粒度

Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Hangjie Yuan, Tao Feng

发表机构 * THU(清华大学) HKUST(香港科技大学) BIT(北京理工大学) Meituan(美团) ZJU(浙江大学)

AI总结 针对在线策略蒸馏,提出FiRe-OPD方法,通过轨迹级过滤和令牌级软重加权实现细粒度优化,在多种设置下优于现有方法。

详情
AI中文摘要

大型语言模型中的在线策略蒸馏正从全轨迹KL监督转向更具选择性的训练范式。最近的在线策略蒸馏方法越来越关注选择哪些轨迹进行学习、哪些令牌信息量最大以及哪些监督信号最可靠。受此趋势启发,我们重新思考在线策略蒸馏的优化粒度,并提出FiRe-OPD(先过滤,再重加权),该方法在轨迹和令牌两个层面联合调整监督信号。具体来说,FiRe-OPD首先过滤轨迹以移除低质量的采样结果,然后在保留的轨迹内应用软重加权以强调信息丰富的令牌。与硬令牌选择相比,FiRe-OPD利用软加权机制有效减轻信息损失并增强优化稳定性,从而实现更细粒度的在线策略蒸馏优化。我们在强到弱、单教师和多教师设置中验证了FiRe-OPD的有效性,并展示了其相对于近期令牌级在线策略蒸馏方法的优越性(例如,在强到弱设置中AIME 2024上+6.25,在多教师设置中Miner上+18.81)。我们的代码可从此链接获取。

英文摘要

On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at https://github.com/YuYingLi0/FiRe-OPD.

2606.02031 2026-06-05 cs.LG cs.AI cs.CL cs.CV 版本更新

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

OpenWebRL: 揭秘视觉网络代理的在线多轮强化学习

Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai, Wenlin Yao, Hao Cheng, Baolin Peng, Huan Zhang, Tong Zhang, Jianfeng Gao

发表机构 * UIUC(伊利诺伊大学香槟分校) Microsoft(微软)

AI总结 提出OpenWebRL框架,通过在线多轮强化学习在真实网站上训练视觉网络代理,以4B参数模型在基准测试中达到开源最优,并与闭源系统竞争。

Comments 36 pages, 11 figures

详情
AI中文摘要

构建强大的视觉网络代理需要长程推理、精确定位以及与动态真实网站的稳健交互。尽管进展迅速,最强的系统仍然大多是专有的,而开放代理仍然严重依赖于对大量策划的网络轨迹进行监督式后训练。这种依赖造成了主要的可扩展性瓶颈:高质量演示的收集成本高昂,而静态数据集对多样且不断变化的开放网络的覆盖有限。尽管在线强化学习在基于文本的代理中显示出前景,但其直接用于在实时网站上训练视觉网络代理的潜力仍未得到充分探索。在本文中,我们介绍了OpenWebRL,一个用于在真实网站上通过在线多轮强化学习训练视觉网络代理的开放框架。OpenWebRL涵盖了完整的训练流程,包括可扩展的实时浏览器基础设施、监督初始化、多模态上下文管理、轨迹级成功判断以及高效的多轮策略优化。使用该框架,我们训练了OpenWebRL-4B,在具有挑战性的实时网络基准测试中建立了新的开源最优水平。仅使用0.4K初始化轨迹和2.2K开放式强化学习训练任务,OpenWebRL-4B在Online-Mind2Web上达到67.0%的成功率,在DeepShop上达到64.0%,优于之前类似或更大规模的开放代理,并与包括OpenAI CUA和Gemini CUA在内的专有系统保持竞争力。除了强大的基准性能外,我们还系统研究了使在线强化学习对视觉网络代理有效的关键设计选择,并分析了强化学习如何改进代理推理。总体而言,我们的工作为构建更强大、可重复且成本效益更高的开放网络代理提供了一条实用路径。我们将发布我们的训练数据、模型和代码以支持未来的研究。

英文摘要

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.

2606.01897 2026-06-05 cs.AI 版本更新

Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

社区感知的社交文本参与度与共鸣评估:以人为中心的用户生成内容评价视角

Tianjiao Li, Kai Zhao, Xiang Li, Yang Liu, Huyang Sun

发表机构 * GitHub

AI总结 提出CASTER任务和MEDEA架构,通过社会思维链机制模拟社区认知与情感反应,实现用户生成内容的多模态共鸣评估。

Comments Published as a main conference paper at ACL 2026

详情
AI中文摘要

传统视频质量评估(VQA)狭隘地关注美学保真度,忽略了定义用户生成内容(UGC)质量的复杂社会动态。在这项工作中,我们提出从信号中心指标向以人为中心的共鸣评估的范式转变。我们引入CASTER(社区感知的社交文本参与度与共鸣评估),这是一个新任务,根据UGC项目的多模态属性而非仅视觉质量来评估其是否实现积极的社区共鸣。为此,我们提出MEDEA(多模态参与驱动评估架构),它引入了一种新颖的社会思维链(Social-CoT)机制。与传统的逻辑CoT不同,Social-CoT执行多模态视角转换,实例化不同的观众角色以模拟集体认知和情感反应(即“社区思维”),然后得出质量判断。MEDEA通过两阶段方法进行训练,包括监督微调和带有社会对齐奖励的过程监督强化学习,以确保推理路径基于真实的人类社会认知。为支持此任务,我们发布了CASTER-Bench,一个涵盖多种UGC类别的全面人工标注基准。实验表明,MEDEA在CASTER-Bench上显著优于最先进的基线,同时提供可解释且富有同理心的推理路径,与真实社区反馈一致。

英文摘要

Traditional Video Quality Assessment (VQA) focuses narrowly on aesthetic fidelity, overlooking the complex social dynamics that define quality in User-Generated Content (UGC). In this work, we propose a paradigm shift from signal-centric metrics to human-centric resonance assessment. We introduce CASTER (Community-Aware Assessment of Social Textual Engagement and Resonance), a new task that evaluates whether a UGC item achieves positive community resonance based on its multimodal attributes rather than visual quality alone. To address this, we present MEDEA (Multimodal Engagement-Driven Evaluation Architecture), which introduces a novel Social Chain-of-Thought (Social-CoT) mechanism. Unlike traditional logical CoT, Social-CoT performs multimodal perspective-taking, instantiating diverse viewer personas to simulate collective cognitive and emotional reactions (i.e., the "community mind") before deriving a quality judgment. MEDEA is trained via a two-stage approach involving supervised fine-tuning and process-supervised reinforcement learning with Social Alignment Reward to ensure reasoning paths are grounded in authentic human social cognition. To support this task, we release CASTER-Bench, a comprehensive human-annotated benchmark covering diverse UGC categories. Experiments demonstrate that MEDEA significantly outperforms state-of-the-art baselines on CASTER-Bench while providing interpretable and empathetic reasoning paths that align with real community feedback.

2606.00804 2026-06-05 cs.MA cs.AI cs.CL 版本更新

Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems

企业多智能体系统的动态协调策略选择

Thanh Luong Tuan

发表机构 * Golden Gate University(金门大学) Foundation AgenticOS (FAOS)(基础代理操作系统(FAOS))

AI总结 本文通过大规模实验评估企业多智能体系统是否应根据问题类别动态选择协调策略,发现动态路由作为校准默认值有效,但无法确定唯一最优策略。

Comments 13 pages, 4 appendix. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-1

详情
AI中文摘要

企业多智能体系统日益暴露多种协调模式,但部署时往往缺乏证据表明何时使用共识、辩论、综合或更简单的单智能体工作流。本文评估协调策略是否应根据问题类别动态选择,而非全局固定。我们运行了一个固定的矩阵,包含30个企业任务,涵盖六个行业、五个问题类别、四种执行条件、每个单元格三个重复,以及四个模型分支:qwen_local、sonnet、gemma_openrouter和一个辅助的openai云验证分支。所有1,440个生成输出均由固定的Sonnet评分标准评判。主要发现是有界且操作上有用的,但并非最初的严格H1。预先注册的精确胜者/CI标准未得到支持:精确胜者身份在不同模型分支间不稳定,且若干预测策略接近但未超过最佳观察到的替代方案。一个较弱的近最优路由主张得到强烈支持。在每个预先注册的模型分支和问题类别中,以及在辅助的OpenAI验证分支中,预测策略的质量分数与最佳观察条件相差在0.10以内。结构化合规验证是对原始映射最明显的例外:所有分支都偏好单智能体而非共识。预先注册的Kendall's W检验发现,越南语领域和英语领域任务在四种协调条件排序的一致性上没有可靠差异(两个分层的平均W均为0.20;符号秩检验p = .85),因此H2未得到支持。我们得出结论,企业协调策略应使用动态路由作为校准默认值,而非确定性胜者选择法则。

英文摘要

Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single-agent workflow. This paper evaluates whether coordination strategy should be selected dynamically by problem class rather than fixed globally. We run a frozen matrix of 30 enterprise tasks spanning six industries, five problem classes, four execution conditions, three replications per cell, and four model arms: qwen_local, sonnet, gemma_openrouter, and an auxiliary openai cloud-validation arm. All 1,440 generated outputs are judged by a fixed Sonnet rubric. The main finding is bounded and operationally useful, but it is not the original strict H1. The pre-registered exact-winner/CI criterion is not supported: exact winner identity is unstable across model arms, and several predicted strategies are close to, but not above, the best observed alternative. A weaker near-best routing claim is strongly supported. In every pre-registered model arm and problem class, and again in the auxiliary OpenAI validation arm, the predicted strategy is within 0.10 quality-score points of the best observed condition. Structured compliance verification is the clearest exception to the original mapping: all arms favor single_agent rather than consensus. A pre-registered Kendall's W test finds no reliable difference between Vietnamese-domain and English-domain tasks in how consistently the four coordination conditions are ranked (mean W of 0.20 in both strata; signed-rank p = .85), so H2 is not supported. We conclude that enterprise coordination policy should use dynamic routing as a calibrated default, not as a deterministic winner-selection law.

2606.00644 2026-06-05 cs.AI 版本更新

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

ForeSci: 评估LLM智能体在前瞻性AI研究判断中的能力

Qiuyu Tian, Haojie Yin, Yingce Xia, Youyong Kong, Zequn Liu

发表机构 * Southeast University(东南大学) Beijing Zhongguancun Academy(北京中关村学院) Duke Kunshan University(杜克昆山大学)

AI总结 提出ForeSci基准,通过时间控制的500个任务评估LLM智能体基于历史证据做出前瞻性研究判断的能力,发现证据与决策脱节问题。

详情
AI中文摘要

AI研究通常需要在未来证据出现之前做出决策:攻击哪个瓶颈、追求哪个方向、项目应如何定位。我们引入了ForeSci,一个时间控制的基准,用于评估LLM智能体是否能够从历史证据中做出此类前瞻性研究判断。ForeSci包含500个任务,涵盖四个快速发展的AI领域和四个决策家族。每个任务配有一个截止对齐的离线知识库;截止日期后的论文在生成过程中被隐藏,仅用于验证。为避免随机未来事件预测,任务源自截止前的分类分支和证据信号,并选择早于任务截止日期的答案生成骨干。我们评估了原生LLM、混合RAG以及四种骨干上的三种研究智能体适配。结果表明,显式证据组织提高了可追溯性和事实支持,但收益强烈依赖于决策家族。诊断揭示了一个反复出现的证据-决策脱节:智能体可能引用相关证据,但预测错误的研究对象。ForeSci将前瞻性AI研究判断转化为一个受控基准,用于评估作为决策系统的研究智能体。

英文摘要

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.

2606.00616 2026-06-05 cs.CV cs.AI 版本更新

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

暂停与思考:面向视频基础辅助动作建议的数据集与基准

Shivam Singh, Saptarshi Majumder, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

发表机构 * Advanced Micro Devices, Inc.(先进微器件公司)

AI总结 提出 pause-and-think-T 数据集和 pause-and-think-B 基准,通过推理监督训练紧凑模型,在视频场景理解与目标规划任务中达到与大型模型相当的性能。

详情
AI中文摘要

最近的视觉语言模型(VLM)在视频中的基础推理、时间一致性和上下文感知规划方面存在困难。我们引入了 pause-and-think-T,一个以推理为中心的训练数据集,鼓励模型暂停、基于视觉证据进行推理,并生成简洁、可操作的响应。该数据集在生成答案之前促进结构化推理,引导模型走向类人、基于场景的辅助。我们在我们的 pause-and-think-B 基准上微调了一个紧凑的 4B 参数模型,并针对上下文理解和目标规划任务进行了评估。该模型在参数比 Qwen3-VL-235B(58.9%)少 59 倍的情况下达到了 58.0% 的准确率,在场景理解上与 GPT-5.2 匹配,并超越了 GPT-4o。除了我们的基准之外,该模型在 EgoThink 和 TempCompass 上也表现出强大的分布外性能,在可操作性、辅助性、属性识别、情境推理和时间顺序方面取得了显著提升,且无需特定基准训练。我们的结果表明,有针对性的推理监督使紧凑模型能够提供可操作的、基于视觉的指导,同时泛化到训练数据之外,而无需进行大规模模型扩展。

英文摘要

Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We introduce pause-and-think-T, a reasoning-centric training dataset that encourages models to pause, reason over visual evidence, and produce concise, actionable responses. The dataset promotes structured reasoning prior to answer generation, guiding models toward human-like, scene-grounded assistance. We fine-tune a compact 4B-parameter model and evaluate it on our pause-and-think-B benchmark targeting contextual understanding and goal planning tasks. The model achieves 58.0% accuracy at 59x fewer parameters than Qwen3-VL-235B (58.9%), matching GPT-5.2 on scene understanding and surpassing GPT-4o. Beyond our benchmark, it also shows strong out-of-distribution performance on EgoThink and TempCompass, with substantial gains in affordance, assistance, attribution recognition, situated reasoning, and temporal order, without benchmark-specific training. Our results indicate that targeted reasoning supervision enables compact models to deliver actionable, visually grounded guidance while generalizing beyond training data, without requiring large-scale model expansion.

2605.31278 2026-06-05 cs.AI cs.LG stat.ME 版本更新

Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

工业化预测驱动推断:用于可靠生成式AI与智能体系统评估的GLIDE库

Grégoire Martinon, Ibrahim Merad, Mohammed Raki

发表机构 * University of California, Berkeley(加州大学伯克利分校) Google Research(谷歌研究院)

AI总结 提出GLIDE开源库,统一多种预测驱动推断方法,提供无偏估计与有效置信区间,显著降低人工标注成本。

Comments 8 pages, Accepted to the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems, Seoul, South Korea, 2026

详情
AI中文摘要

智能体系统的可靠评估需要具有有效不确定性的无偏估计,但标准实践在昂贵的人工标注和有偏的LLM-as-judge代理之间权衡。预测驱动推断(PPI)将两者结合为具有有效置信区间的去偏估计,然而其各种方法仍分散在不同论文的部分实现中。我们介绍GLIDE,一个开源Python库,它在专用于均值估计的scipy风格API下统一了最先进的PPI估计器(PPI++、分层PPI、先预测后去偏及其分层变体、主动统计推断)和采样器(均匀、分层、主动、成本最优)。GLIDE附带一个可复现的蒙特卡洛验证套件、一个基于经验的决策树用于方法选择,以及一个智能体评估案例研究,显示在同等精度下显著节省标注成本。GLIDE包可通过此URL获取:https://github.com/EmertonData/glide

英文摘要

Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM-as-judge proxies. Prediction-powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open-source Python library that unifies state-of-the-art PPI estimators (PPI++, Stratified PPI, Predict-Then-Debias and its stratified variants, Active Statistical Inference) and samplers (uniform, stratified, active, cost-optimal) under a scipy-style API specialized to mean estimation. GLIDE ships with a reproducible Monte Carlo validation suite, an empirically grounded decision tree for method selection, and an agentic evaluation case study showing substantial annotation savings at equivalent precision. The GLIDE package is available at this URL: https://github.com/EmertonData/glide

2605.30747 2026-06-05 cs.AI 版本更新

Generating Graph-Like Logical Rules for Knowledge Graph Reasoning via Diffusion Models

通过扩散模型生成图状规则用于知识图谱推理

Haoxiang Cheng, Yunfei Wang, Chao Chen, Kewei Cheng, Zhipeng Lin, Haoxuan Li, Changjun Fan, Shixuan Liu

发表机构 * Laboratory for Big Data and Decision(大数据与决策实验室) National University of Defense Technology(国防科技大学) National Key Laboratory of Information Systems Engineering(信息系统工程国家重点实验室) Microsoft Corporation(微软公司) College of Computer Science and Technology(计算机科学与技术学院)

AI总结 提出GRiD框架,利用扩散模型将图状规则发现转化为以目标关系为条件的离散生成过程,结合监督预训练和强化学习优化,实现知识图谱补全中图状规则的高效挖掘。

Comments accepted by KDD 26

详情
AI中文摘要

逻辑规则构成知识图谱推理的基石,因其可解释性和建模关系模式的能力而受到重视。然而,现有规则挖掘方法主要关注简单的链状规则,因此忽略了图状结构中编码的更丰富的关系信息,例如循环和分支。这一局限性因搜索空间组合爆炸导致的计算瓶颈而进一步加剧,这对图状规则尤其具有挑战性。同时,生成方法如扩散模型,尽管在其他领域取得了成功,但不能直接应用于规则挖掘,因为它们的训练目标与学习高质量规则的目标不一致,且不可微的知识图谱规则质量指标无法直接指导模型优化。为解决这些局限性,我们提出GRiD,一个将图状规则发现重新表述为以目标关系为条件的离散生成过程的框架。GRiD采用两阶段训练策略。首先,监督预训练使GRiD能够从知识图谱元图采样的子图中捕获结构先验。随后,应用强化学习通过直接由不可微规则质量指标指导的策略梯度优化来微调GRiD。在六个基准数据集上的实验表明,GRiD在知识图谱补全任务上取得了有竞争力的性能。消融研究证实了GRiD的效率和鲁棒性,并进一步表明图状规则在知识图谱补全中补充了链状规则。我们的代码和数据集可在https://github.com/Haoxiang-Cheng/GRiD获取。

英文摘要

Logical rules constitute a cornerstone of knowledge graph (KG) reasoning, valued for their interpretability and ability to model relational patterns. However, existing rule mining methods predominantly focus on simple chain-like rules and therefore neglect the richer relational information encoded in graph-like structures, such as cycles and branches. This limitation is further exacerbated by computational bottlenecks caused by the combinatorial explosion of the search space, which is especially challenging for graph-like rules. Meanwhile, generative approaches such as diffusion models, despite their success in other domains, cannot be directly applied to rule mining because their training objectives are not aligned with the goal of learning high-quality rules, and non-differentiable KG rule quality metrics cannot directly guide model optimization. To address these limitations, we propose GRiD, a framework that reformulates graph-like rule discovery as a discrete generative process conditioned on the target relation. GRiD employs a two-phase training strategy. First, supervised pre-training enables GRiD to capture structural priors from subgraphs sampled from the KG meta-graph. Subsequently, reinforcement learning is applied to fine-tune GRiD through policy gradient optimization guided directly by non-differentiable rule-quality metrics. Experiments on six benchmark datasets show that GRiD achieves competitive performance on KG completion tasks. Ablation studies confirm the efficiency and robustness of GRiD and further show that graph-like rules complement chain-like rules in KG completion. Our code and datasets are available in https://github.com/Haoxiang-Cheng/GRiD.

2605.28579 2026-06-05 cs.AI 版本更新

MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

MUSE: 面向可制造、功能性和可装配的文本到CAD生成的基准测试

Xiaoyu Dong, Zhi Li, Xiao-Ming Wu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Curvature Flow Co., Limited(曲率流有限公司)

AI总结 提出MUSE基准,通过三阶段评估协议(代码检查、几何检查、设计意图对齐)和基于规则的语言模型评判,衡量文本到CAD生成模型在功能、制造和装配方面的实际设计质量。

Comments 26 pages

详情
AI中文摘要

大型语言模型(LLMs)近期推动了文本驱动的3D生成,但文本到CAD仍远未支持工业产品设计。现有基准主要关注生成单零件CAD模型,并使用几何相似性指标进行评估,这些指标无法捕捉功能、可制造性和可装配性。为弥补这一空白,我们引入MUSE,一个专注于复杂、可编辑边界表示(B-Rep)装配体的文本到CAD基准。MUSE将实际设计实例与结构化设计规范配对,并通过三阶段评估协议评估生成的模型:代码检查、几何检查和设计意图对齐。最后阶段使用特定于设计的评分标准评估功能、可制造性和可装配性,超越形状匹配,走向实际设计质量。为实现可扩展评估,我们使用基于评分标准的视觉语言模型(VLM)评判器,并通过人工标注验证其可靠性。在闭源和开源LLM上的实验揭示了从可执行代码到有效几何再到工程就绪设计的明显失败级联,即使最强的模型在细粒度工程标准上也仅取得有限成功。MUSE为将文本到CAD从几何生成推向真正的工程设计提供了现实的基准和评估框架。我们的项目网站(包括排行榜、数据集和代码)可在 https://dong7313.github.io/muse-benchmark/ 获取。

英文摘要

Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating single-part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. To address this gap, we introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality. To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design, with even the strongest models achieving limited success on fine-grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design. Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.

2605.27887 2026-06-05 cs.AI q-fin.PM 版本更新

PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management

PortBench: 一种相关性感知的、全流水线的LLM驱动投资组合管理基准

Yuxuan Zhao, Sijia Chen, Ningxin Su

发表机构 * Yantai Research Institute of Harbin Engineering University(哈尔滨工程大学烟台研究院) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出PortBench基准,通过静态QA和动态五阶段分配流水线评估LLM在投资组合管理中的表现,发现多数模型无法超越等权重分配,且存在推理错误累积和压力下大幅回撤的问题。

Comments Project page: https://portbench.github.io/

详情
AI中文摘要

LLMs在多种金融任务中表现出色,但投资组合管理(PM)这一关键金融决策任务仍缺乏良好基准。现有基准存在两个主要缺陷:忽略跨资产相关性结构,从而无法区分真正多样化的投资组合与集中投资组合;未能评估真实场景中完整的PM决策流水线。我们提出PortBench,一个涵盖十年间六类异质资产类别的基准。PortBench由两个互补层组成:包含6269个基于相关性的问题(覆盖七个任务模板)的静态QA数据集,以及模拟完整PM决策周期的动态五阶段分配流水线。为评估这些层,我们引入两个专用指标:双层次相关性分数,衡量所提投资组合是否利用跨类别对冲并避免类别内集中;以及CEPS,量化推理错误如何在流水线阶段间累积。我们进一步在三种历史压力情景和风险配置下评估策略稳健性和投资者对齐。评估十个前沿LLM,我们发现尽管在静态金融QA上表现强劲,90%的模型-配置组合未能超越基本的等权重分配,且满足所有程序约束的模型在压力下仍遭受灾难性回撤。我们的源代码可在\href{https://github.com/AgenticFinLab/portbench}{此https URL}获取。

英文摘要

Large language models (LLMs) have shown strong performance across diverse financial tasks, yet portfolio management (PM), a critical financial decision-making task, remains poorly benchmarked. Existing benchmarks exhibit two main gaps: they ignore cross-asset correlation structures, thereby failing to distinguish genuinely diversified portfolios from concentrated ones, and fail to evaluate the complete PM decision pipeline in real-world scenarios. We introduce PortBench, a benchmark spanning six heterogeneous asset classes over ten years. PortBench consists of two complementary layers: a static QA dataset of 6,269 correlation-based questions across seven task templates, and a dynamic five-stage allocation pipeline that mirrors the full PM decision cycle. To evaluate these layers, we introduce two dedicated metrics: a dual-layer correlation score that measures whether proposed portfolios exploit inter-class hedging and avoid intra-class concentration, and CEPS, a metric that quantifies how reasoning errors compound across pipeline stages. We further assess strategy robustness and investor alignment under three historical stress regimes and risk profiles. Evaluating ten frontier LLMs, we find that despite strong performance on static financial QA, 90\% of model-profile combinations fail to outperform a basic equal-weight allocation, and models that satisfy every procedural constraint still suffer catastrophic drawdowns under stress. Our source code is available at \href{https://github.com/AgenticFinLab/portbench}{this https URL}.

2605.26179 2026-06-05 cond-mat.mtrl-sci cs.AI cs.CE 版本更新

AutoDFT: A Closed-Loop Multi-Agent Framework for Autonomous DFT Calculations

AutoDFT:用于自主DFT计算的闭环多智能体框架

Penghui Yang, Zhonghan Zhang, Yue Li, Xinrun Wang, Yanchen Deng, Yuhao Lu, Bijun Tang, Zheng Liu, Bo An

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Singapore Management University(新加坡管理大学)

AI总结 提出AutoDFT闭环多智能体框架,通过将LLM推理嵌入DFT计算全生命周期,实现从规划到执行的自主适应,在VASPBench基准上达到94.1%任务成功率,并可靠预测电子、磁性和能量性质。

详情
AI中文摘要

密度泛函理论(DFT)是材料科学和化学中计算发现的基础,然而每次计算都需要大量人工努力:当收敛停滞时调整算法,当出现意外物理现象时修改计划,以及当中间结果重塑问题时插入步骤。现有的基于LLM的智能体仅自动化初始规划阶段,预先生成完整的执行计划,而将所有后续调整留给手工规则。因此,这些工作流仍然脆弱,难以泛化到预规划场景之外,并且当失败或意外的中间结果需要改变计算路径时,通常需要专家干预。在此,我们介绍AutoDFT,一个闭环多智能体框架,将LLM推理嵌入DFT生命周期的每个阶段:战略规划器生成步骤目标的骨架计划;步骤规划器根据先前结果即时生成数值参数;监控-恢复-反思循环诊断失败、修复失败,并在证据支持时修改计划。我们展示了广度和深度:广度方面,在VASPBench(一个专门构建的基准,涵盖34个任务和9种DFT计算类型)上,AutoDFT使用GPT-5.2实现了94.1%的任务级成功率;深度方面,在已建立的材料数据库上,AutoDFT在电子、磁性和能量性质上产生了定量可靠的属性预测。通过闭环规划和执行,AutoDFT使没有深厚计算专业知识的实验人员能够获得可靠的第一性原理结果。

英文摘要

Density functional theory (DFT) serves as the basis for computational discovery in materials science and chemistry, yet each calculation demands extensive human effort: adjusting algorithms when convergence stalls, revising plans when unexpected physics emerges, and inserting steps as intermediate results reshape the problem. Existing LLM-based agents automate only the initial planning stage, producing a full execution plan upfront and leaving all subsequent adaptation to hand-crafted rules. As a result, these workflows remain fragile, do not generalize well beyond pre-planned scenarios, and often require expert intervention when failures or unexpected intermediate results require changes to the calculation path. Here, we introduce AutoDFT, a closed-loop multi-agent framework that embeds LLM reasoning into every stage of the DFT lifecycle, where a strategic planner produces a skeletal plan of step objectives; a step planner generates numerical parameters just in time from preceding results; and a monitor-recover-reflect cycle diagnoses failures, repairs them, and revises the plan when the evidence justifies it. We demonstrate both breadth and depth: breadth on VASPBench, a purpose-built benchmark spanning 34 tasks and 9 DFT calculation types, where AutoDFT achieves 94.1% task-level success with GPT-5.2; and depth on established materials databases, where AutoDFT produces quantitatively reliable property predictions across electronic, magnetic, and energetic properties. By closing the loop between planning and execution, AutoDFT enables experimentalists without deep computational expertise to obtain reliable first-principles results.

2605.26046 2026-06-05 cs.CL cs.AI cs.LG cs.MA cs.SE 版本更新

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

当梯度冲突时:多目标提示优化用于LLM评判器的失败模式

Parth Darshan, Abhishek Divekar

发表机构 * IIT Jodhpur(印度理工学院乔普里尔) Amazon(亚马逊)

AI总结 研究多目标文本梯度优化中梯度稀释和指令干扰两种失败模式,通过分解优化器信息共享方式揭示性能下降原因。

Comments Accepted at ACL 2026 - CustomNLP4U Workshop. Code, prompts and data available at https://github.com/adivekar-utexas/when-gradients-collide

详情
AI中文摘要

将LLM评判器定制到特定任务或领域通常需要同时跨多个评估标准优化其提示。文本梯度方法针对单一评判标准实现了自动化,但它们产生自然语言批评,而非数值向量。因此,多任务学习的冲突解决工具包(PCGrad、MGDA)不适用于多目标文本梯度设置。我们通过改变损失、梯度和优化器LLM共享跨任务信息的程度,测试了文本梯度优化器的五种分解模式。在10种配置中的6种中,我们观察到优化从未优于初始提示。当梯度LLM联合处理多个标准时,梯度特异性下降了59%(从9.0降至3.7)。另外,我们观察到将每个任务的指令简单组合成单个提示会使斯皮尔曼相关系数降低5.3%。这些结果识别出两种可分离的失败模式:优化时的梯度稀释和推理时的指令干扰,它们共同限制了使用文本反馈进行多目标评判器定制的设计空间。

英文摘要

Customizing an LLM judge to a specific problem or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural-language critiques, not numerical vectors. Thus, the conflict-resolution toolkit of multi-task learning (PCGrad, MGDA) does not apply to this multi-objective textual gradient setting. We extend TextGrad to the multi-objective setting and test four decomposition modes of textual gradient optimizers by varying how much cross-objective information the loss, gradient and optimizer LLMs share. We find the gradient's task-focus drops by 59% (9.0 to 3.7 out of 10) when the gradient LLM must provide feedback on multiple criteria jointly. Separately, we observe that naively combining single-objective optimized instructions into a single prompt degrades Spearman rho from 0.305 to 0.220 (-0.085). These results identify two separable failure modes: optimization-time gradient dilution and inference-time instruction interference, which together constrain the design space for multi-objective judge optimization using textual feedback.

2605.29916 2026-06-05 cs.NE cs.AI cs.DS math.OC 版本更新

Selection Hyper-heuristics Can Automatically Adjust the Learning Period to Optimally Solve Pseudo-Boolean Problems

选择超启发式可以自动调整学习周期以最优地解决伪布尔问题

Benjamin Doerr, Pietro S. Oliveto, John Alasdair Warwicker

发表机构 * Laboratoire d’Informatique (LIX), CNRS, École Polytechnique, Institut Polytechnique de Paris(信息实验室(LIX),法国国家科学研究中心,巴黎高等理工学院,巴黎理工学院) Department of Computer Science and Engineering, Southern University of Science and Technology(计算机科学与工程系,南方科技大学) School of Computing & Communications, Lancaster University Leipzig(计算与通信学院,莱斯特大学莱比锡分校)

AI总结 本文提出一种自动设置学习周期参数的超启发式方法,证明其能在1-o(1)比例的迭代中选择最优邻域大小,从而以最优时间(忽略低阶项)优化LeadingOnes基准问题。

Comments To appear in "Artificial Intelligence"

详情
Journal ref
Artificial Intelligence 357:104560 (2026)
AI中文摘要

最近研究表明,随机梯度超启发式在使用随机局部搜索(RLS)元启发式优化LeadingOnes基准时,能够学习最优邻域大小。然而,这需要使用一定长度$τ$的学习周期,这与经典超启发式不同,后者仅基于前一次迭代的成功来改变行为。在本文中,我们展示了如何自动设置这个新参数值,从而使用户免于控制这一新颖算法参数的非平凡任务。我们证明,由此产生的超启发式在$1-o(1)$比例的迭代中选择最优邻域大小,并因此以这些邻域大小所能达到的最佳时间(忽略低阶项)优化LeadingOnes基准。

英文摘要

The Random Gradient hyper-heuristic was recently shown to be able to learn the optimal neighbourhood size when optimizing the LeadingOnes benchmark via the Randomised Local Search (RLS) meta-heuristic. However, for this to happen, a learning period of a certain length $τ$ had to be used, differently from classic hyper-heuristics, which change their behaviour based on the success of only the previous iteration. In this paper, we show how to automatically set this new parameter value, relieving the user from the non-trivial task of controlling this novel algorithm parameter. We prove that the resulting hyper-heuristic selects the optimal neighbourhood size in a $1-o(1)$ fraction of the iterations and, consequently, optimises the LeadingOnes benchmark in the best possible time (apart from lower-order terms) achievable with these neighborhood sizes.

2603.19294 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Maximizing Mutual Information Between Prompt and Response Improves LLM Performance With No Additional Data

最大化提示与响应之间的互信息无需额外数据即可提升LLM性能

Hyunji Nam, Haoran Li, Natasha Jaques

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出互信息偏好优化(MIPO)方法,通过对比数据增强构建偏好对,利用直接偏好优化最大化提示与响应间的点互信息,无需额外数据或外部监督即可提升LLM在个性化和可验证任务上的性能。

Comments International Conference on Machine Learning 2026

详情
AI中文摘要

虽然后训练已在多个领域成功改进了大型语言模型(LLM),但这些提升严重依赖人工标注数据或外部验证器。现有数据已被充分利用,而新数据收集成本高昂。此外,真正的智能远不止可验证任务。因此,我们需要较少依赖外部信号且更广泛适用于可验证和不可验证领域的自我改进框架。我们提出**互信息偏好优化(MIPO)**,一种对比数据增强方法,通过基于正确提示生成正响应,以及基于随机无关提示生成负响应来构建偏好对。我们证明,使用直接偏好优化从这些配对数据中学习,可以最大化*基础LLM*下提示与响应之间的逐点互信息。使用1-7B参数的Llama和Qwen指令模型的实验表明,与提示基线相比,MIPO在个性化任务上实现了3-16%的提升(Qwen2.5-1B-Instruct提升51%)。令人惊讶的是,MIPO在可验证领域(如数学和多项选择题问答)也有用,*无需任何额外数据或外部监督*即可获得1-20%的提升。这些结果表明,利用对比数据对中的内在信号进行自我改进是一个有前景的方向。

英文摘要

While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited, and new data is expensive to collect. Moreover, true intelligence goes far beyond verifiable tasks. Therefore, we need self-improvement frameworks that are less dependent on external signals and more broadly applicable to both verifiable and non-verifiable domains. We propose **Mutual Information Preference Optimization (MIPO)**, a contrastive data augmentation method that constructs preference pairs by generating a positive response conditioning on the correct prompt, and a negative response by conditioning on a random, unrelated prompt. We show that using Direct Preference Optimization to learn from this paired data maximizes pointwise mutual information *under the base LLM* between prompts and model responses. Experiments with with 1-7B parameter Llama and Qwen instruct models show that MIPO achieves 3-16% gains (and 51% increase for Qwen2.5-1.5B-Instruct) on personalization compared to prompting baselines. Surprisingly, MIPO can also be useful in verifiable domains, such as math and multiple-choice question answering, yielding 1-20% gains *without any additional data or external supervision*. These results suggest a promising direction for self-improvement using intrinsic signals derived from contrastive data pairs.

2605.25582 2026-06-05 cs.LG cs.AI 版本更新

Extreme Region Policy Distillation

极端区域策略蒸馏

Changyu Chen, Xiting Wang, Rui Yan

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学耿丽人工智能学院) Wuhan University(武汉大学)

AI总结 提出极端区域策略蒸馏(ERPD)两阶段框架,通过解耦样本效率与KL效率,在固定数据上先进行弱约束离策略优化以最大化提取训练信号,再在信任区域约束下蒸馏到基础策略,从而在数学推理任务中实现更好的性能与更小的KL散度。

详情
AI中文摘要

大语言模型的强化学习面临样本效率与渐近性能之间的基本权衡:严格在策略方法在单次更新后丢弃轨迹,而离策略重用引入分布不匹配,现有信任区域技术主要通过强制保守优化来缓解,往往未充分利用丰富的训练信号。为研究此问题,我们在固定数据上执行大量离策略更新。实验揭示,激进的多步优化带来快速初始增益,但过度更新导致轨迹概率偏离和熵崩溃,性能早期停滞。收紧KL约束仅降低上限而不解决退化。这促使我们提出极端区域策略蒸馏(ERPD),一个两阶段框架,将样本效率与KL效率解耦。第一阶段在固定数据上执行弱约束离策略优化,以最大化提取训练信号。所得策略提供令牌级监督。第二阶段,我们在信任区域约束下将这些信号蒸馏到基础策略中,过滤有害漂移同时保留有用信号。蒸馏后的策略以显著更小的KL散度达到相当或更好的性能,表明第一阶段的大部分散度用于不必要的漂移而非真正改进。关键的是,ERPD同时适应强和弱教师:当激进优化未产生更强策略时,即使是退化教师也能通过替代信号构建策略提供有效监督。我们在数学推理上验证ERPD,显示出对强基础模型(在策略训练停滞时)的增益,以及使用弱教师的可靠改进。

英文摘要

Reinforcement learning for large language models faces a fundamental trade-off between sample efficiency and asymptotic performance: strictly on-policy methods discard trajectories after a single update, while off-policy reuse introduces distribution mismatch that existing trust-region techniques mitigate primarily by enforcing conservative optimization, often leaving rich training signals underutilized. To investigate this, we perform extensive off-policy updates on fixed data. Our experiments reveal that aggressive multi-step optimization brings rapid initial gains, but excessive updates cause trajectory probabilities to deviate and entropy to collapse, with performance plateauing early. Tightening KL constraints merely lowers the ceiling without resolving the degradation. This motivates Extreme Region Policy Distillation (ERPD), a two-stage framework that decouples sample efficiency from KL efficiency. The first stage performs weakly constrained off-policy optimization on fixed data to maximally extract training signals. The resulting policy provides token-level supervision. In the second stage, we distill these signals into the base policy under trust-region constraints, filtering harmful drift while preserving useful signals. The distilled policy achieves comparable or better performance with substantially smaller KL divergence, indicating that much of the first-stage divergence was spent on unnecessary drift rather than genuine improvement. Crucially, ERPD accommodates both strong and weak teachers: when aggressive optimization yields no stronger policy, even degenerate teachers provide effective supervision via alternative signal construction strategies. We validate ERPD on mathematical reasoning, showing gains for strong base models where on-policy training plateaus, and reliable improvements with weak teachers.

2605.25256 2026-06-05 cs.AI 版本更新

Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts

谁的对齐?比较不同组织决策情境下的大语言模型过程对齐

Niklas Weller, Emilio Barkett

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出一种决策策略捕获方法测量过程对齐,发现LLM在ECHR第6条决策中过程对齐与输出准确性高度相关,但在德国消费信贷决策中关系消失,揭示了多元对齐挑战。

Comments Accepted to Pluralistic Alignment Workshop @ ICML 2026, Seoul, South Korea

详情
AI中文摘要

将AI系统与组织决策对齐通常被框架化为单一目标问题:使模型表现得像组织一样。我们认为这种框架掩盖了更深的多元主义挑战。我们依赖一种决策策略捕获方法来测量过程对齐:LLM是否像组织一样加权信息,而不仅仅是是否得出相同结论。将此方法应用于ECHR第6条决策,过程对齐强烈预测输出准确性(r = 0.85, p < .001),且外部化显著改善了低对齐模型的对齐。将其应用于德国消费信贷决策,这种关系消失(r = 0.15, p = .60):干预产生不一致的效果,且基准编码了潜在歧视性的历史模式。这种对比本身就是一个多元对齐发现:在有争议的领域,高过程对齐既不能通过外部化实现,也不是无条件可取的。仅凭输出一致性无法区分一个模型是内化了组织政策还是仅仅近似其结果;过程级测量是任何多元对齐评估的必要组成部分。

英文摘要

Steerable pluralism requires a model to faithfully represent one specified perspective. Organizations are a natural setting for this demand, since they deploy LLMs to make decisions that must reflect their own policy. Yet, most existing work fixes that perspective at the level of individuals or demographic groups. We rely on a decision-policy capturing method to measure process alignment in organizational settings, assessing whether an LLM faithfully reproduces the organization's decision policy rather than merely reaching the same conclusions. We find heterogeneity along two axes. Across models, baseline alignment varies strongly and tracks neither pricing nor general benchmark performance. Across organizations, the structure of alignment changes. In ECHR Article 6 decisions, process alignment predicts output accuracy ($r = 0.85$, $p < .001$), and making the organization's past decision policy explicit improves poorly aligned models. In consumer credit decisions, process alignment is low overall but varies more than output accuracy, and the models resist adopting the organization's weighting of protected attributes. Because historical credit decisions encode potentially discriminatory patterns, higher alignment there is not always desirable. Process-level measurement is therefore necessary, and depending on whether the target policy is normatively desirable, the same procedure can calibrate or audit a model. Deciding which policy to align to, and whether higher alignment is feasible or desirable, makes organizational alignment a pluralistic problem in its own right.

2605.25240 2026-06-05 cs.CL cs.AI cs.CY 版本更新

JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

JudgmentBench: 比较评分量规与偏好评估在质量评价中的应用

Russell Yang, Ruishi Chen, Pierce Kelaita, Riya Ranjan, Sibo Ma, Charles Dickens, Matthew Guillod, Megan Ma, Julian Nyarko

发表机构 * Stanford University(斯坦福大学) Snorkel AI

AI总结 本研究通过构建包含30个真实法律任务、1539个评分量规和1530对偏好判断的数据集JudgmentBench,比较了评分量规与成对比较两种评估方法,发现成对比较在恢复预期质量排序上显著优于评分量规(平均斯皮尔曼等级相关系数0.908 vs 0.150),且注释时间减少一半以上。

Comments 37 pages, 9 figures

详情
AI中文摘要

当前基准测试实践中主导着两种方法论:基于评分量规的评分根据预定义标准评估项目,而比较判断则引发输出之间的成对偏好。尽管两种方法论被广泛使用,但两者之间的选择很少被论证。我们发布了JudgmentBench,一个包含30个真实法律任务的基准测试,配对了来自执业律师(包括美国主要律师事务所)的1539个评分量规和1530个成对偏好判断,这些律师具有丰富的经验。这些注释构成了高专业领域内首个公开可用的数据集,其中两种监督信号由同一专家对同一项目进行收集。使用LLM生成的三个质量级别的输出,我们提供了初步的经验比较:比较判断在恢复预期质量排序方面显著优于评分量规(平均斯皮尔曼等级相关系数为0.908 vs 0.150,估计差异=0.758 [0.494, 1.021]),同时所需的注释时间不到一半。这一模式对人类注释者和LLM自动评分器均成立。除了这一初步比较,数据集的配对结构支持更广泛的研究议程,探讨在没有可验证真实情况的领域中,如何引导、聚合专家判断并将其用作监督信号。

英文摘要

Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies are widely used, the choice between them is rarely justified. We release JudgmentBench, a benchmark of 30 real-world legal tasks, paired with 1,539 rubric scores and 1,530 pairwise preference judgments collected from practicing attorneys--including at major U.S. law firms--with substantial experience. The annotations constitute the first publicly available dataset in a high-expertise domain in which both supervision signals are elicited from the same experts on the same items. Using LLM-generated outputs at three constructed quality levels, we provide an initial empirical comparison: comparative judgments recover the intended quality ordering substantially better than rubrics under both a per-task rank-correlation metric (mean Spearman's rank correlation of 0.908 vs. 0.150, estimated difference = 0.758 [0.494, 1.021]) and a per-judgment pairwise win-rate metric (0.669 vs. 0.542, estimated difference = 0.127 [0.067, 0.186]), while requiring less than half the annotation time. The patterns hold for human annotators and LLM autograders. Beyond this initial comparison, the paired structure of the dataset supports a broader research agenda on how expert judgment should be elicited, aggregated, and used as supervision in domains without verifiable ground truth.

2605.24059 2026-06-05 cs.LG cs.AI 版本更新

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

频谱探测电路:识别预训练Transformer中注意力头电路的三步法

Yongzhong Xu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种三步法,通过频谱信号排序、任务模式筛选和组消融因果验证,无需标签即可识别预训练Transformer中执行持续内容依赖计算的注意力头电路,并在多个模型上验证了其通用性和因果必要性。

Comments 35 pages, 4 figures

详情
AI中文摘要

我们提出了一种三步法,用于识别预训练Transformer中的注意力头电路。每个头的频谱信号——即每个头注意力输出的时间积分参与比——可以在没有标签或归因梯度的情况下,对执行持续内容依赖计算的头进行排序。任务模式屏幕将此通用指标过滤为特定任务的候选电路,而针对匹配随机对照的组消融则完成了因果声明。我们在8倍参数范围(5100万至10亿活跃/70亿总参数)、两种架构族(密集型和混合专家)以及四种预训练流程上进行了验证。该方法是可移植的:一个2-6头的归纳电路在每个测试模型中都是因果必需的,消融后合成归纳top-1下降94-100%。频谱信号在无监督下具有预测性:在5100万参数探测模型的六个独立种子上,相同的计算识别出每个种子上的种子特定电路。在Pythia族(1.24亿至4.1亿)中,执行可识别专门计算的头比例保持在17-19%,而特定归纳电路保持3-11个头——与总头数呈次线性关系。本文是一个三篇论文计划的方法论锚点;配套论文将该方法扩展到预训练期间的发展轨迹以及组合任务电路,其中模式选择性与任务因果结构解耦。

英文摘要

We present a three-step recipe for identifying attention-head circuits in pretrained transformers. A per-head spectral signal -- the time-integrated participation ratio of each head's attention output -- ranks heads doing sustained content-dependent computation without labels or attribution gradients. A task-pattern screen filters this general indicator into a task-specific candidate circuit, and group ablation against a matched-random control completes the causal claim. We validate across an 8x parameter range (51M to 1B-active / 7B-total), two architecture families (dense, mixture-of-experts), and four pretraining pipelines. The recipe ports: a 2-6 head induction circuit is causally necessary in every model tested, with a 94-100% drop in synthetic-induction top-1 after ablation. The spectral signal is predictive without supervision: on six independent seeds of a 51M-parameter probe model, the same computation identifies the seed-specific circuit on each seed. The fraction of heads doing identifiable specialized computation is conserved at 17-19% across the Pythia family (124M to 410M), while specific induction circuits stay 3-11 heads -- sublinear in total head count. This paper is the methodology anchor of a three-paper program; companion papers extend the recipe to developmental trajectories during pretraining and to composed-task circuits where pattern selectivity decouples from task-causal structure.

2605.23415 2026-06-05 cs.LG cs.AI 版本更新

Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control

Reflex: 基于状态连续控制中利用反射对称性的强化学习

Shuai Zhen, Yifan Zhang, Yuling Wang, Yanhua Yu

AI总结 提出Reflex框架,通过反射对称性正则化机制将反射对称性融入策略学习,提升基于状态的连续控制任务的样本效率。

Comments Some of the data in the paper contain errors and need to be confirmed for modification

详情
AI中文摘要

强化学习长期面临样本效率低下的问题。缓解该问题的一种有前景的方法是利用群不变马尔可夫决策过程($G$-不变MDP)。现有工作主要关注基于图像的强化学习和旋转对称性(如$\mathrm{SO(2)}$),而基于状态的强化学习和反射对称性尚未得到充分探索。本文聚焦于基于状态的连续控制任务,通过引入Reflex范式来利用反射对称性,该范式可无缝集成到同策略和异策略强化学习算法中。我们形式化了两种反射类型——轴向反射和双侧反射,并刻画了它们对应的变换。基于对保持对称性的最优值函数和策略的理论分析,Reflex通过原则性的对称性正则化机制将反射对称性融入策略学习。我们将Reflex与PPO和SAC集成,并在OpenAI Gym和DeepMind Control基准测试套件上进行评估,结果表明相比标准基线,Reflex在提升样本效率的同时实现了更优的性能。我们的代码开源在https://github.com/TonyStark042/Reflex。

英文摘要

Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group-invariant Markov Decision Processes ($G$-invariant MDPs). Existing works in this direction have primarily focused on image-based RL and rotational symmetry such as $\mathrm{SO(2)}$, leaving state-based RL and reflection symmetry largely underexplored. In this work, we focus on state-based continuous control tasks and exploit reflection symmetry by introducing Reflex, a paradigm that seamlessly integrates with both on-policy and off-policy RL algorithms. We formalize two types of reflection-axial reflection and bilateral reflection, and characterize their corresponding transformations. Building on a theoretical analysis of symmetry-preserving optimal value functions and policies, Reflex integrates reflection symmetry into policy learning through principled symmetry regularization mechanisms. We integrate Reflex with PPO and SAC, and evaluate it on a suite of OpenAI Gym and DeepMind Control benchmarks, demonstrating superior performance over standard baselines while improving sample efficiency. Our code is available at https://github.com/TonyStark042/Reflex.

2605.15913 2026-06-05 cs.CL cs.AI 版本更新

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

通过自动分割和块蒸馏实现块注意力的泛化

Shuaiyi Li, Zhisong Zhang, Yan Wang, Lei Zhu, Dongyang Ma, Chenlong Deng, Yang Deng, Wai Lam

发表机构 * The Chinese University of Hong Kong(香港中文大学) City University of Hong Kong(香港城市大学) Tencent(腾讯) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院) Singapore Management University(新加坡管理大学)

AI总结 提出基于语义分割数据集训练的轻量级分割器和块蒸馏框架,解决块注意力在长上下文中的文本分割和微调效率问题,实现接近全注意力的性能。

Comments 16 pages, 2 figures

详情
AI中文摘要

块注意力将输入作为独立的块处理,块之间不能相互关注,在检索增强生成(RAG)等长上下文场景中具有显著提升KV缓存重用的潜力。然而,其广泛应用受到两个关键挑战的阻碍:将输入文本分割成有意义且自包含的块的困难,以及现有块微调方法效率低下且可能降低性能的风险。为解决这些问题,我们首先构建了SemanticSeg,一个大规模且多样化的语义分割数据集,包含超过30k个实例,涵盖16个类别——包括书籍、代码、网页文本和对话,文本长度从2k到32k。利用该数据集,我们训练了一个轻量级分割器,能够自动将文本分割成符合人类直觉的块,且粒度可控。其次,我们提出了块蒸馏,一种比块微调更高效的训练框架,它使用冻结的全注意力教师模型来指导块注意力学生模型。该框架集成了三个新颖的组件:块汇合令牌以减轻块边界处的信息丢失,块丢弃以利用来自所有块的训练信号,以及令牌级损失加权以聚焦于对块注意力敏感的令牌的学习。跨多个模型和基准的实验表明,我们的分割器优于启发式和统计基线,且块蒸馏在块注意力下实现了接近全注意力的性能,为部署块注意力建立了一条实用且可扩展的路径。

英文摘要

Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories-including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human-instinct-aligned blocks with controllable granularity. Second, we propose block distillation, a training framework that is more efficient than block fine-tuning, which uses a frozen full-attention teacher model to guide the block-attention student. This framework integrates three novel components: block sink tokens to mitigate information loss at block boundaries, block dropout to leverage training signals from all blocks, and token-level loss weighting to focus learning on block-attention-sensitive tokens. Experiments across multiple models and benchmarks demonstrate that our segmenter outperforms heuristic and statistical baselines, and block distillation achieves near-full-attention performance under block attention, establishing a practical and scalable pathway for deploying block attention.

2605.21557 2026-06-05 stat.ML cs.AI cs.LG 版本更新

Scalable Reinforcement Learning via Adaptive Batch Scaling

通过自适应批处理缩放实现可扩展的在线强化学习

Jongchan Park

发表机构 * Jongchan Park

AI总结 本文提出自适应批处理缩放方法,通过动态调整有效批处理大小来平衡强化学习早期的可塑性需求和晚期的稳定收敛,发现增大网络和批处理大小的组合在强化学习中取得最佳性能。

详情
AI中文摘要

传统观点认为大批次训练与强化学习(RL)本质上不兼容,超过一定阈值后增大批次大小通常会导致回报减少或性能下降,由于数据分布的固有非平稳性。我们通过观察非平稳性并非RL的固定属性,而是随着训练过程演变:早期阶段表现出快速的行为转变,需要小批次以保持可塑性,而晚期阶段接近准平稳状态,大批次可实现精确收敛。受此启发,我们提出自适应批处理缩放(ABS),根据学习策略的稳定性动态调整有效批次大小。ABS的核心是行为分歧,一种新的度量指标,通过测量连续更新之间的动作级转变来量化策略非平稳性,用于将批次大小反向缩放至策略波动性。与并行化Q网络(PQN)算法结合并在ALE基准上评估,ABS无缝地平衡了早期阶段的可塑性和晚期阶段的稳定收敛。令人惊讶的是,与传统观点相反,我们的结果表明,较大的网络和较大的批次大小的组合实现了最佳性能——一种之前被认为在强化学习中无法实现的扩展行为,现在通过自适应批处理控制得以解锁。

英文摘要

Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation due to the inherent non-stationarity of the data distribution. We challenge this view by observing that non-stationarity is not a fixed property of RL, but evolves throughout training: early stages exhibit rapid behavioral shifts that demand small batches for plasticity, whereas late stages approach a quasi-stationary regime where large batches enable precise convergence. Motivated by this observation, we propose Adaptive Batch Scaling (ABS), that dynamically adjusts the effective batch size according to the stability of the learning policy. Central to ABS is Behavioral Divergence, a novel metric that quantifies policy non-stationarity by measuring action-level shifts between consecutive updates, which we use to scale batch size inversely to policy volatility. Integrated with the Parallelised Q-Network (PQN) algorithm and evaluated on the ALE benchmark, ABS seamlessly reconciles early-stage plasticity with late-stage stable convergence. Strikingly, contrary to conventional wisdom, our results reveal that the combination of larger networks and larger batch sizes achieves the best performance - a scaling behavior previously thought to be unattainable in RL, now unlocked through adaptive batch control.

2605.20119 2026-06-05 cs.LG cs.AI 版本更新

Toto 2.0: Time Series Forecasting Enters the Scaling Era

Toto 2.0:时间序列预测进入规模化时代

Emaad Khwaja, Chris Lettieri, Gerald Woo, Eden Belouadah, Marc Cenac, Guillaume Jarry, Enguerrand Paquin, Xunyi Zhao, Viktoriya Zhukov, Othmane Abou-Amal, Chenghao Liu, Ameet Talwalkar, David Asker

发表机构 * Datadog AI Research(Datadog AI研究院) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出Toto 2.0模型家族,通过单一训练配方在400万到25亿参数范围内实现可靠的预测质量提升,并在三个基准测试中达到新状态。

Comments Code: https://github.com/DataDog/toto Weights: https://huggingface.co/collections/Datadog/toto-20

详情
AI中文摘要

我们证明时间序列基础模型可以扩展:一种单一的训练配方能够在400万到25亿参数范围内产生可靠的预测质量提升。我们发布了Toto 2.0,这是一个由五种开放权重预测模型组成的家族,这些模型均基于此配方进行训练。Toto 2.0家族在三个预测基准测试中达到新状态:BOOM,我们的可观测性基准;GIFT-Eval,标准的通用基准;以及最近的抗污染TIME基准。本报告描述了我们的实验结果,并详细说明了Toto 2.0的设计决策:其架构和训练配方、训练数据以及u-muP超参数转移流水线。所有五个基础检查点均以Apache 2.0许可证发布。

英文摘要

We show that time series foundation models scale: a single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open-weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark. This report describes our experimental results and details the design decisions behind Toto 2.0: its architecture and training recipe, training data, and the u-muP hyperparameter transfer pipeline. All five base checkpoints are released under Apache 2.0.

2510.00054 2026-06-05 cs.CV cs.AI 版本更新

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

HiDe: 通过分层解耦重新思考高分辨率MLLMs中的Zoom-IN方法

Xianjie Liu, Yiman Hu, Yixiong Zou, Liang Wu, Jian Xu, Bo Zheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出HiDe框架,通过分层解耦方法解决高分辨率图像中背景干扰导致的视觉理解问题,提升多模态大语言模型在高分辨率图像任务中的性能。

Comments Accepted by ICML2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉理解任务中取得了显著进展。然而,它们在高分辨率图像上的性能仍然不够理想。尽管现有方法通常将这一限制归因于感知约束,并认为MLLMs难以识别小物体,从而使用'缩放进'策略以获得更好的细节,我们的分析揭示了不同的原因:主要问题不是物体大小,而是由复杂的背景干扰引起的。我们通过一系列解耦实验系统分析了这种'缩放进'操作,并提出了一种无需训练的分层解耦框架(HiDe),该框架使用基于标记的注意力解耦(TAD)来解耦问题标记并识别关键信息标记,然后利用其注意力权重实现与目标视觉区域的精确对齐。随后,它利用布局保持解耦(LPD)将这些区域与背景解耦,并重建一个紧凑的表示,该表示在保留基本空间布局的同时消除了背景干扰。HiDe在V*Bench、HRBench4K和HRBench8K上设定了新的SOTA,将Qwen2.5-VL 7B和InternVL3 8B提升至SOTA(在V*Bench上分别为92.1%和91.6%),甚至超过了强化学习方法。经过优化后,HiDe的内存使用比之前的无训练方法减少了75%。代码可在https://tennine2077.github.io/HiDe.github.io/上提供。

英文摘要

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on V*Bench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://tennine2077.github.io/HiDe.github.io/.

2605.18937 2026-06-05 cs.AI 版本更新

Evaluating the Utility of Personal Health Records in Personalized Health AI

评估个人健康记录在个性化健康AI中的效用

Rory Sayres, Kejia Chen, Ayush Jain, Matthew Thompson, Jonathan Richina, Xiang Yin, Jimmy Hu, Fan Zhang, Bob Lou, Mike Sanchez, Ines Mezerreg, Meredith Schreier, Hamsa Subramaniam, I-Ching Lee, Yugang Jia, Daniel Mcduff, Yossi Matias, Avinatan Hassidim, Dale Webster, Yun Liu, Jackie Barr, Quang Duong

发表机构 * Google Research(谷歌研究)

AI总结 本文研究了利用个人健康记录(PHR)中的临床数据通过大语言模型(LLM)回答用户健康问题的效用,发现提供PHR上下文能显著提高回答的有用性、安全性和个性化水平,并揭示了LLM在理解复杂PHR方面的不足。

Comments 35 pages, 3 figures, 10 tables [bugfix / minor numerical update]

详情
AI中文摘要

患者管理的个人健康记录(PHRs)承诺赋予患者更好地理解自身健康的权力;但记录中的信息复杂,可能阻碍洞察。在本研究中,我们评估了大型语言模型(LLMs,Gemini 3.0 Flash)在提供PHR上下文的情况下,回答用户健康问题的潜力。总共从3种不同的分布中抽取了2,257个用户查询,以代表患者的问题:较短的网页搜索查询、较长的基于聊天机器人模板生成的问题,以及患者向医疗团队提问的问题。这些查询与去标识化的PHR(来自1,945个池)匹配。Gemini的响应生成(1)无PHR上下文;(2)带有基本的 demographics、condition 和 medication 总结;(3)带有完整的、详尽的临床笔记。评估利用了现有的评分框架(SHARP),并开发了新的框架用于特定的错误模式当解释PHR时。评估使用自动评分器对全部集进行评估,并对子集(n=95)使用临床医生评分,且两组评分者都了解完整的PHR上下文。我们发现,与PHR数据相比,所有问题类型的回答有用性显著提高(p < 0.001,配对t检验)。我们还观察到在安全、准确、相关性和个性化方面有潜在的提升。我们的PHR评估框架进一步识别了LLM在理解复杂PHR特定方面的不足,如时间混乱和罕见但有意义的虚构信息。这些结果表明,PHR数据可能有助于满足广泛用户需求,并提供一个框架来监控基于PHR上下文的LLM回答中的不足。本研究促使进一步的工作,以评估和实现从理解自身健康记录中获益的潜在好处。

英文摘要

Patient-managed Personal Health Records (PHRs) promises to empower patients to better understand their health; but information in the record is complex, potentially hindering insights. In this study, we assess the potential of large language models (LLMs, Gemini 3.0 Flash) to provide helpful answers to user health queries, when provided clinical data from PHRs as context. A total of 2,257 user queries were drawn from 3 different distributions to represent patient questions: shorter web search queries, longer questions derived from templates of chatbot conversations, and questions patients asked to their healthcare team (patient calls). Queries were matched with de-identified PHRs (from a pool of 1,945). Gemini responses were generated (1) without PHR context; (2) with a basic summary of demographics, conditions, and medications; (3) with full, extensive clinical notes. For evaluation, we leveraged an existing rating framework (SHARP), and developed a new framework for specific error modes when interpreting PHRs. Evaluation was performed using autoraters for the full set, and with clinician ratings for a subset (n=95), with both sets of raters knowing the full PHR context. We see significant improvements in the helpfulness of answers to all question types with PHR data (p < 0.001, paired t-test). We also observe potential gains in safety, accuracy, relevance and personalization of answers. Our PHR evaluation framework further identifies gaps in LLM understanding of particular aspects of complex PHRs, such as temporal disorientation, and rare but meaningful confabulations. These results suggest potential for PHR data to help people with a wide range of user needs; and provide a framework for monitoring for gaps in LLM answers based on PHR context. This study motivates further work to assess and realize potential benefits to users from understanding their health records.

2605.16716 2026-06-05 cs.CV cs.AI 版本更新

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

MAVEN:面向多元文化文本到视频生成的多智能体框架

Shuowei Li, Yuming Zhao, Parth Bhalerao, Oana Ignat

发表机构 * Santa Clara University(圣克拉拉大学)

AI总结 提出MAVEN多智能体提示优化框架,通过并行或串行分解提示为人物、动作、地点维度,提升单文化和跨文化文本到视频生成的文化保真度,并构建包含243个文化提示和972个视频的基准进行评估。

Comments [14] pages, [6] figures, [11] tables, appendix included. Preprint

详情
AI中文摘要

文本到视频(T2V)生成在视觉保真度方面取得了快速进展,但其在单个提示中忠实呈现多种文化的能力仍未被充分探索。我们提出MAVEN,一个多智能体提示优化框架,旨在提高单文化和跨文化T2V生成中的文化保真度。MAVEN将提示分解为人物、动作和地点维度,由并行或串行运行的专业智能体处理。为了支持系统评估,我们贡献了一个新的基准,包含243个基于文化的提示和972个对应视频,涵盖三种文化(中文、美式、罗马尼亚)、三种动作类别以及单文化和跨文化场景。结合基于CLIP的指标、VLM作为评判的评估和视频质量测量的评估表明,多智能体优化,特别是并行专业化,在保持视觉质量和时间一致性的同时,显著提高了文化相关性。数据集和代码可在https://github.com/AIM-SCU/MAVEN获取。

英文摘要

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi-agent prompt refinement framework designed to improve cultural fidelity in both mono-cultural and cross-cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono-cultural and cross-cultural scenarios. Evaluations combining CLIP-based metrics, VLM-as-judge assessments, and videoquality measures show that multi-agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available at https://github.com/AIM-SCU/MAVEN

2604.00555 2026-06-05 cs.AI cs.CL cs.SE 版本更新

Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

企业智能体系统中的本体约束神经推理:一种面向领域 grounded AI 智能体的神经符号架构

Thanh Luong Tuan, Abhijit Sanyal

发表机构 * Golden Gate University, San Francisco Foundation(金门大学,旧金山基金会) AgenticOS (FAOS)(AgenticOS(FAOS)) Associate Director, Data, Digital & IT Novartis Healthcare Pvt. Ltd.(数据、数字与IT部门,诺华健康有限公司) Novartis Healthcare Pvt. Ltd., Hyderabad, India(诺华健康有限公司,海得拉巴,印度)

AI总结 本文提出了一种神经符号架构,通过本体约束神经推理解决企业大语言模型在幻觉、领域漂移和无法在推理层面强制执行监管合规性方面的限制,展示了该架构在提升智能体的指标准确性和角色一致性方面的显著效果。

Comments 24 pages, 6 tables, 6 figures, 1 algorithm, 65 references. Replication study: 1,800 runs (600 per model) across 5 regulated industries (3 English, 2 Vietnamese) and 3 LLMs (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B). v3 changes: deep-review trim from 34pp. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-3

详情
AI中文摘要

企业采用大语言模型(LLMs)受到幻觉、领域漂移和无法在推理层面强制执行监管合规性的限制。我们提出了一种在基础智能体操作系统(FAOS)平台中实现的神经符号架构,通过本体约束神经推理解决这些限制。我们引入了一个三层本体框架——角色、领域和交互本体——以地面化基于LLM的企业智能体。我们正式化了不对称的神经符号耦合:当前企业系统约束智能体输入(上下文组装、工具发现、治理阈值),但不约束输出,我们提出机制扩展这种耦合到输出侧验证(响应检查、推理验证、合规性强制)。一个受控实验(1,800次运行,覆盖五个行业和三个LLM:Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B)发现本体耦合的智能体在所有三个模型中在指标准确性和角色一致性上显著优于无地面化智能体(p < .001),具有较大的效应量(Kendall's W = .46-.64)。改进最大出现在LLM参数化知识最弱的地方——特别是越南本地化领域,其中本体提升是英语领域的2倍。贡献:(1)一个正式的三层企业本体模型;(2)神经符号耦合模式的分类学;(3)通过SQL推导评分进行本体约束的工具发现;(4)提出的一种用于输出侧本体验证的框架;(5)关于参数化知识效应的实证证据——本体地面化价值与LLM训练数据覆盖领域成反比;(6)跨模型复制,确立模型独立性;(7)一个服务于22个行业垂直领域的生产系统,拥有650多个智能体。

英文摘要

Enterprise adoption of Large Language Models (LLMs) is constrained by hallucination, domain drift, and the inability to enforce regulatory compliance at the reasoning level. We present a neurosymbolic architecture implemented within the Foundation AgenticOS (FAOS) platform that addresses these limitations through ontology-constrained neural reasoning. We introduce a three-layer ontological framework--Role, Domain, and Interaction ontologies--grounding LLM-based enterprise agents. We formalize asymmetric neurosymbolic coupling: current enterprise systems constrain agent inputs (context assembly, tool discovery, governance thresholds) but not outputs, and we propose mechanisms extending this coupling to output-side validation (response checking, reasoning verification, compliance enforcement). A controlled experiment (1,800 runs across five industries and three LLMs: Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B) finds ontology-coupled agents significantly outperform ungrounded agents on Metric Accuracy (p < .001) and Role Consistency (p < .001) across all three models with large effect sizes (Kendall's W = .46-.64). Improvements are greatest where LLM parametric knowledge is weakest--particularly in Vietnam-localized domains, where ontology lift is 2x that of English domains. Contributions: (1) a formal three-layer enterprise ontology model; (2) a taxonomy of neurosymbolic coupling patterns; (3) ontology-constrained tool discovery via SQL-pushdown scoring; (4) a proposed framework for output-side ontological validation; (5) empirical evidence for the inverse parametric knowledge effect--ontological grounding value is inversely proportional to LLM training-data coverage of the domain; (6) cross-model replication establishing model-independence; (7) a production system serving 22 industry verticals with 650+ agents.

2605.16138 2026-06-05 cs.LG cs.AI hep-ex 版本更新

Surrogate Neural Architecture Codesign Package (SNAC-Pack)

代理神经架构协同设计包(SNAC-Pack)

Jason Weitz, Dmitri Demler, Benjamin Hawks, Aaron Wang, Nhan Tran, Javier Duarte

发表机构 * University of California San Diego(加州大学圣地亚哥分校) ETH Zurich(苏黎世联邦理工学院) Fermi National Accelerator Laboratory(费米国家加速器实验室) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 本文提出SNAC-Pack,一种面向硬件的自动化机器学习框架,用于神经架构协同设计和端到端FPGA部署,通过多目标全局搜索和硬件代理模型减少合成成本,并结合量化感知训练和迭代幅度剪枝来压缩模型,最终在FPGA上实现高效部署。

Comments 15 Pages, 3 Figures, AutoML (International Conference on Automated Machine Learning) 2026

详情
AI中文摘要

神经架构搜索(NAS)是一种强大的自动模型设计方法,但现有方法往往只优化准确率或依赖如位操作(BOPs)等代理指标,这些指标与硬件成本的相关性较差。在FPGA部署中,成本由查找表、DSP、触发器、BRAM和延迟等多维预算主导。我们提出了代理神经架构协同设计包(SNAC-Pack),一种开源的AutoML框架,用于硬件感知的神经架构协同设计和端到端FPGA部署。SNAC-Pack使用Optuna和NSGA-II进行多目标全局搜索,将试验加载到共享的SQLite存储中,以实现计算节点之间的并行工作。硬件代理模型输出每个试验的资源和延迟估计,避免了否则会主导搜索循环的合成成本。随后的局部搜索阶段结合量化感知训练(QAT)和迭代幅度剪枝,在联合压缩循环中应用。最后,通过hls4ml Python库将最终模型合成到FPGA固件中。YAML配置和可选的代理前端使用户能够在新数据集上运行管道而无需修改框架。我们在大型强子对撞机的喷射分类和超导量子比特读出中展示了SNAC-Pack,发现了紧凑的架构,这些架构在任务指标上匹配或超过强基线,同时减少FPGA资源利用,并在量子比特读出情况下将设计空间探索过程从数月的手动微调减少到数小时的自动化搜索。

英文摘要

Neural architecture search (NAS) is a powerful approach for automating model design, but existing methods often optimize for accuracy alone or rely on proxy metrics such as bit operations (BOPs) that correlate poorly with hardware cost. This gap is particularly large for FPGA deployment, where cost is dominated by a multi-dimensional budget of lookup tables, DSPs, flip-flops, BRAM, and latency. We present the Surrogate Neural Architecture Codesign Package (SNAC-Pack), an open-source AutoML framework for hardware-aware neural architecture codesign and end-to-end FPGA deployment. SNAC-Pack runs a multi-objective global search with Optuna and NSGA-II, loading trials to a shared SQLite store that enables parallel workers across compute nodes. A hardware surrogate model outputs per-trial resource and latency estimates, avoiding the synthesis cost that would otherwise dominate the search loop. A local search stage then applies quantization-aware training (QAT) together with iterative magnitude pruning in a combined compression loop, after which the final model is synthesized to FPGA firmware via the hls4ml Python library. A YAML configuration and an optional agentic frontend let users run the pipeline on new datasets without modifying the framework. We demonstrate SNAC-Pack on jet classification at the Large Hadron Collider and superconducting qubit readout, discovering compact architectures that match or exceed strong baselines on the task metric while reducing FPGA resource utilization and, in the qubit readout case, reducing the design space exploration process from months of manual fine-tuning to hours of automated search.

2509.10825 2026-06-05 cs.LG cs.AI stat.ML 版本更新

CUBE: Contrastive Understanding by Balanced Experiments

CUBE: 通过平衡实验实现对比理解

Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh

发表机构 * Department of Computer Engineering(计算机工程系) Gachon University(加荣大学)

AI总结 本文提出CUBE框架,通过平衡低-高探针解释已训练的预测模型,揭示模型的主要效应和交互作用,验证了其在合成和现实表格任务中的有效性。

Comments The core framework and main claims remain unchanged; the manuscript has been revised for clarity, presentation, and consistency

详情
AI中文摘要

事后解释依赖于模型查询是如何组织的。我们提出了CUBE,一种基于设计的框架,通过平衡的低-高探针来解释已训练的预测模型。所选变量定义了因素,设计的特征级组合定义了查询条件,模型预测被总结为因子对比。CUBE报告主效应和成对交互作用作为受控阅读的平均和条件响应变化的总结。在合成和现实表格任务中的实验表明,CUBE恢复了主导的学习效应结构,澄清了查询高效的可识别性,并支持筛查-后续细化。

英文摘要

Post-hoc explanation depends on how model queries are organized. We propose CUBE, a design-based framework that explains a trained predictive model through balanced low--high probes. Selected variables define factors, designed feature-level combinations define query conditions, and model predictions are summarized as factorial contrasts. CUBE reports main effects and pairwise interactions as controlled readings of average and conditional response changes over a declared design space. Experiments on synthetic and real tabular tasks show that CUBE recovers dominant learned effect structure, clarifies query-efficient identifiability, and supports screening--follow-up refinement.

2605.15212 2026-06-05 cs.AR cs.AI cs.CE 版本更新

Fault tolerance estimation in digital circuits with visualised generative networks

数字电路中故障容错估计与可视化生成网络

Sascha Biel, Carl Alexander Gaede, Amiel Glaser, Jan Wolter, Alexej Schelle

发表机构 * IU Internationale Hochschule(国际大学) Constructor University(Constructor大学)

AI总结 本文提出一种新的数值方法,通过生成网络采样技术估计数字电路结构中故障模式的容错性,通过比较理想数字化的模拟电流的随机输入与生成对抗网络(GAN)判别器部分的现实信号,计算与理想数字电子信号的偏差,包括缺失或互换逻辑器件等误差模式。

Comments 7 pages, 7 figures, 1 table

详情
AI中文摘要

我们提出了一种新的数值方法,用于估计数字电路结构中故障模式的容错性,采用生成网络采样技术。从由理想数字化的模拟电流的随机输入生成的位配置开始,在经典逻辑门的数字电路设计中,将预期输出电流与生成对抗网络(GAN)判别器部分的数值实验中的现实信号进行比较,以计算与理想数字电子信号的偏差,包括各种误差模式,如缺失或互换的逻辑器件。从GAN在复变量中的表示分析来看,可以通过区分与不同经典逻辑元件相关的故障模式的影响来评估电子设计的鲁棒性。

英文摘要

We propose a new numerical method to estimate the fault tolerance of failure modes in digital circuit structures with a generative network sampling technique. From a random input of generated bitwise configurations of ideally digitalised analog currents in the digital circuit design with classical logical gates, expected output currents are compared to the realistic signals of a numerical experiment at the discriminator part of the Generative Adversarial Network (GAN) to calculate the deviation from ideal digital electronic signals, including various error modes, such as missing or interchanged logical devices. From the present analysis of a representation of the GAN in terms of complex variables, it is possible to evaluate the robustness in electronic designs by differentiating the impact of failure modes associated with different classical logical elements in the circuit.

2605.13075 2026-06-05 cs.CL cs.AI 版本更新

Scaling few-shot spoken word classification with generative meta-continual learning

通过生成性元持续学习扩大少样本语音词分类

Louise Beyers, Batsirayi Mupamhi Ziki, Ruan van der Merwe

发表机构 * University of Cape Town(开普敦大学)

AI总结 本文研究了在仅获得每个类别五个样本的情况下,通过生成性元持续学习(GeMCL)算法对1000个类别进行少样本语音词分类的潜力,并展示了其在性能稳定性及适应速度上的优势。

详情
AI中文摘要

少样本语音词分类大多针对少量类别进行开发,因此更大规模的少样本语音词分类潜力尚未被挖掘。本文探讨了在仅获得每个类别五个样本的情况下,通过生成性元持续学习(GeMCL)算法训练的语音词分类器能否依次学习区分1000个类别。我们通过使用GeMCL算法训练模型并与重复训练或微调的基线模型进行比较,证明了这种扩展能力的存在。我们发现GeMCL产生了极高的性能稳定性,尽管它并不总能超越重复全微调的HuBERT模型或冻结HuBERT模型配以重复训练的分类器头,但其性能与后者相当,同时适应速度提高了2000倍,仅用不到一半的数据量,在两个数量级更少的时间内进行训练。

英文摘要

Few-shot spoken word classification has largely been developed for applications where a small number of classes is considered, and so the potential of larger-scale few-shot spoken word classification remains untapped. This paper investigates the potential of a spoken word classifier to sequentially learn to distinguish between 1000 classes when it is given only five shots per class. We demonstrate that this scaling capability exists by training a model using the Generative Meta-Continual Learning (GeMCL) algorithm and comparing it to repeatedly trained or finetuned baselines. We find that GeMCL produces exceptionally stable performance, and although it does not always outperform a repeatedly fully-finetuned HuBERT model nor a frozen HuBERT model with a repeatedly trained classifier head, it produces comparable performance to the latter while adapting 2000 times faster, having been trained less than half of the data for two orders of magnitude less time.

2604.20329 2026-06-05 cs.CV cs.AI 版本更新

Image Generators are Generalist Vision Learners

图像生成器是通用视觉学习者

Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T. Barron, Kyle Genova, Nithish Kannen, Sherry Ben, Yandong Li, Mandy Guo, Suhas Yogin, Yiming Gu, Huizhong Chen, Oliver Wang, Saining Xie, Howard Zhou, Kaiming He, Thomas Funkhouser, Jean-Baptiste Alayrac, Radu Soricut

发表机构 * Google(谷歌)

AI总结 本文研究了图像生成器在视觉理解中的通用学习能力,通过引入Vision Banana模型,展示了图像生成训练如何像语言模型预训练一样,使模型在多种视觉任务中取得最佳性能,证明了图像生成预训练在构建基础视觉模型中的核心作用。

Comments Project Page: http://vision-banana.github.io

详情
AI中文摘要

近期的研究表明,图像和视频生成器表现出零样本视觉理解行为,这种行为类似于大型语言模型(LLM)通过生成式预训练发展出语言理解和推理的新兴能力。尽管长期以来人们推测能够生成视觉内容意味着能够理解它,但缺乏证据表明生成式视觉模型已发展出强大的理解能力。在本文中,我们证明图像生成训练的作用类似于LLM预训练,使模型学习到强大的、通用的视觉表示,从而在各种视觉任务中取得最先进的性能。我们引入了Vision Banana,一个通过指令微调Nano Banana Pro(NBP)在原始训练数据和少量视觉任务数据混合中构建的通用模型。通过将视觉任务的输出空间参数化为RGB图像,我们无缝地将感知重新框架为图像生成。我们的通用模型Vision Banana在涉及2D和3D理解的多种视觉任务中取得了最先进的结果,超越或匹敌零样本领域专家,包括Segment Anything Model 3在分割任务中的表现,以及Depth Anything系列在度量深度估计中的表现。我们展示了这些结果可以通过轻量级指令微调实现,而不牺牲基础模型的图像生成能力。优越的结果表明图像生成预训练是一种通用视觉学习者。它还表明图像生成是视觉任务的统一和通用接口,类似于文本生成在语言理解和推理中的作用。我们正见证计算机视觉中的重大范式转变,其中生成式视觉预训练在构建生成和理解的基础视觉模型中发挥核心作用。

英文摘要

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.

2605.13830 2026-06-05 cs.AI cs.LG 版本更新

Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach

对树集成模型的敏感性量化:一种符号和组合方法

Ajinkya Naik, Chaitanya Garg, S. Akshay, Ashutosh Gupta, Kuldeep S. Meel

发表机构 * Indian Institute of Technology Bombay(印度理工学院班加罗尔分校) University of Toronto(多伦多大学)

AI总结 本文提出了一种针对树集成模型的敏感性量化方法,通过离散化输入空间并枚举易受敏感性影响的区域,结合代数决策图(ADD)编码和分拆子问题,实现高效计算。实验表明,所提工具XCount在速度和可扩展性方面优于其他方法。

详情
AI中文摘要

决策树集成(DTE)是一种广泛应用于AI分类任务的流行模型,用于多个安全关键领域,因此对这些模型的验证已成为过去十年的研究热点之一。其中一个问题就是敏感性问题,它询问给定一个DTE,是否一小部分特征的变化会导致输入的误分类。在本工作中,我们的目标是构建一个针对DTEs的定量敏感性概念,通过离散化模型的输入空间并枚举易受敏感性影响的区域。我们提出了一种新的算法技术,可以在保证认证误差和置信度范围内高效地完成此计算。我们的方法基于将问题编码为代数决策图(ADD),并进一步将其拆分为可高效解决的子问题,使计算成为组合和可扩展的。我们在不同规模的基准上评估了我们的技术的性能,与相同问题编码下的模型计数器进行比较。实验结果表明,我们的工具XCount在速度上显著优于其他方法,并且在集成规模增加时表现良好。

英文摘要

Decision tree ensembles (DTE) are a popular model for a wide range of AI classification tasks, used in multiple safety critical domains, and hence verifying properties on these models has been an active topic of study over the last decade. One such verification question is the problem of sensitivity, which asks, given a DTE, whether a small change in subset of features can lead to misclassification of the input. In this work, our focus is to build a quantitative notion of sensitivity, tailored to DTEs, by discretizing the input space of the model and enumerating the regions which are susceptible to sensitivity. We propose a novel algorithmic technique that can perform this computation efficiently, within a certified error and confidence bound. Our approach is based on encoding the problem as an algebraic decision diagram (ADD), and further splitting it into subproblems that can be solved efficiently and make the computation compositional and scalable. We evaluate the performance of our technique over benchmarks of varying size in terms of number of trees and depth, comparing it against the performance of model counters over the same problem encoding. Experimental results show that our tool XCount achieves significant speedup over other approaches and can scale well with the increasing sizes of the ensembles.

2605.05367 2026-06-05 cs.CV cs.AI 版本更新

Tamaththul3D: High-Fidelity 3D Saudi Sign Language Avatars from Monocular Video

Tamaththul3D: 从单目视频高保真重建沙特手语3D虚拟形象

Eyad Alghamdi, Sattam Altuuaim, Obay Ghulam, Abdulrahman Qutah, Yousef Basoodan

发表机构 * University of Jeddah(朱德大学) King Abdullah University of Science and Technology(国王阿卜杜勒-阿齐兹大学科学与技术)

AI总结 本文提出Tamaththul3D方法,通过几何逆运动学对前臂链进行对齐,结合2D监督肩部优化,实现了阿拉伯语手语的高保真3D虚拟形象重建,并在五个不同语言类型的手语数据集上实现了泛化能力。

详情
AI中文摘要

现有的3D手语虚拟形象重建方法仅在西方手语上开发和评估,且没有任何阿拉伯手语数据集的3D参数注解,这阻碍了阿拉伯聋人社区基于虚拟形象的无障碍应用发展。我们发布了首个SMPL-X参数注解的Ishara-500沙特手语数据集,使阿拉伯手语的定量评估和下游手语生成成为可能。我们引入Tamaththul3D,一种通过几何逆运动学对齐手部和身体估计,随后通过2D监督肩部优化的重建流程。闭式积分与特定身体和手估计器的选择无关:任何SMPL-X兼容的身体估计器和任何MANO兼容的手估计器均可替换,我们通过单独替换每个模块来证明这一点。Tamaththul3D在手部误差上比先前方法低达32%,运行速度比最强基线快32倍,并在没有数据集特定适应的情况下泛化到五个不同语言类型的手语数据集。

英文摘要

Existing 3D sign language avatar reconstruction methods are developed and evaluated exclusively on Western sign languages, and no 3D parametric annotations exist for any Arabic Sign Language dataset, a gap that blocks the development of avatar-based accessibility applications for the Arab Deaf community. We release the first SMPL-X parametric annotations for the Ishara-500 Saudi Sign Language dataset, enabling quantitative evaluation and downstream sign language generation for Arabic Sign Language. We introduce Tamaththul3D, a reconstruction pipeline that aligns hand and body estimates through geometric inverse kinematics on the forearm chain followed by 2D-supervised shoulder refinement. The closed-form integration is decoupled from the specific choice of body and hand estimators: any SMPL-X-compatible body estimator and any MANO-compatible hand estimator can be substituted, as we demonstrate by swapping each module independently. Tamaththul3D achieves up to 32% lower hand error than prior methods, runs 32x faster than the strongest baseline, and generalizes across five typologically distinct sign languages without dataset-specific adaptation.

2605.12376 2026-06-05 cs.AI 版本更新

ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows

ProfiliTable: 通过代理工作流实现基于轮廓的表格数据处理

Wei Liu, Yang Gu, Xi Yan, Zihan Nan, Beicheng Xu, Keyao Ding, Bin Cui, Wentao Zhang

发表机构 * School of CS & Key Lab of High Confidence Software Technologies (MOE), Peking University(计算机科学学院及高可信软件技术重点实验室(教育部),北京大学) Academy for Advanced Interdisciplinary Studies, Peking University(先进跨学科研究学院,北京大学) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) School of Software & Microelectronics, Peking University(软件与微电子学院,北京大学) School of CS & Beijing Key Laboratory of Software and Hardware Cooperative Artificial Intelligence Systems, Peking University(计算机科学学院及北京软件与硬件协同人工智能系统重点实验室,北京大学) Center for Machine Learning Research, Peking University(机器学习研究中心,北京大学)

AI总结 本研究提出ProfiliTable,一种基于动态轮廓的代理工作流框架,用于解决表格数据处理中的模糊指令、复杂任务结构和缺乏结构化反馈问题,通过交互探索、知识增强合成和反馈驱动细化,实现闭环优化,实验表明其在复杂多步骤场景中优于现有基线方法。

详情
AI中文摘要

表格处理(包括清洗、转换、增强和匹配)是现实数据管道中的基础但易出错的阶段。尽管最近基于LLM的方法在自动化此类任务方面显示出潜力,但实践中常常因指令模糊、任务结构复杂和缺乏结构化反馈而遇到困难,导致生成的代码在语法上正确但语义上错误。为了解决这些挑战,我们提出了ProfiliTable,一种以动态轮廓为核心的自主多代理框架,通过交互探索、知识增强合成和反馈驱动细化,构建并迭代完善统一的执行上下文。ProfiliTable集成了(i)一个执行ReAct风格数据探索的Profiler,用于构建语义理解;(ii)一个检索精选操作符的Generator,用于合成任务感知的代码;以及(iii)一个Evaluator-Summarizer循环,通过注入执行评分和诊断洞察实现闭环优化。在覆盖18种表格任务类型的多样化基准测试中,ProfiliTable在复杂多步骤场景中 consistently 超过强基线方法。这些结果突显了动态轮廓在可靠地将模糊用户意图转化为稳健且合规的表格转换中的关键作用。

英文摘要

Table processing-including cleaning, transformation, augmentation, and matching-is a foundational yet error-prone stage in real-world data pipelines. While recent LLM-based approaches show promise for automating such tasks, they often struggle in practice due to ambiguous instructions, complex task structures, and the lack of structured feedback, resulting in syntactically correct but semantically flawed code. To address these challenges, we propose ProfiliTable, an autonomous multi-agent framework centered on dynamic profiling, which constructs and iteratively refines a unified execution context through interactive exploration, knowledge-augmented synthesis, and feedback-driven refinement. ProfiliTable integrates (i) a Profiler that performs ReAct-style data exploration to build semantic understanding, (ii) a Generator that retrieves curated operators to synthesize task-aware code, and (iii) an Evaluator-Summarizer loop that injects execution scores and diagnostic insights to enable closed-loop refinement. Extensive experiments on a diverse benchmark covering 18 tabular task types demonstrate that ProfiliTable consistently outperforms strong baselines, particularly in complex multi-step scenarios. These results highlight the critical role of dynamic profiling in reliably translating ambiguous user intents into robust and governance-compliant table transformations.

2504.10063 2026-06-05 cs.CL cs.AI math.AT 版本更新

Hallucination Detection in LLMs with Topological Divergence on Attention Graphs

基于注意力图拓扑分歧的LLM幻觉检测

Alexandra Bazarova, Andrei Volodichev, Aleksandr Yugay, Andrey Shulga, Alina Ermilova, Konstantin Polev, Julia Belikova, Rauf Parchiev, Dmitry Simakov, Maxim Savchenko, Andrey Savchenko, Serguei Barannikov, Alexey Zaytsev

发表机构 * Applied AI Institute(应用人工智能研究所) SB AI Lab(SB人工智能实验室) HSE University(俄罗斯高等经济学院) CNRS, Universite Paris Cite(法国国家科学研究中心,巴黎Cité大学)

AI总结 本文提出TOHA方法,通过分析注意力矩阵的拓扑结构来检测LLM中的幻觉现象,实验表明该方法在多个基准测试中表现优异,且对标注数据和计算资源需求较低。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
AI中文摘要

幻觉,即生成事实性错误内容,仍然是大型语言模型(LLMs)面临的关键挑战。我们介绍了TOHA,一种基于拓扑的幻觉检测器,在RAG设置中,该方法利用拓扑分歧度度量来量化由注意力矩阵诱导的图的结构特性。检查提示与响应子图之间的拓扑分歧揭示了一致的模式:特定注意力头中较高的分歧值与幻觉输出相关,且与数据集无关。广泛的实验,包括问题回答和摘要任务的评估,表明我们的方法在多个基准测试中实现了最先进的或具有竞争力的结果,同时需要最少的标注数据和计算资源。我们的发现表明,分析注意力矩阵的拓扑结构可以作为LLMs事实可靠性的一种高效且稳健的指标。

英文摘要

Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models (LLMs). We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting, which leverages a topological divergence metric to quantify the structural properties of graphs induced by attention matrices. Examining the topological divergence between prompt and response subgraphs reveals consistent patterns: higher divergence values in specific attention heads correlate with hallucinated outputs, independent of the dataset. Extensive experiments - including evaluation on question answering and summarization tasks - show that our approach achieves state-of-the-art or competitive results on several benchmarks while requiring minimal annotated data and computational resources. Our findings suggest that analyzing the topological structure of attention matrices can serve as an efficient and robust indicator of factual reliability in LLMs.

2605.11632 2026-06-05 cs.CL cs.AI 版本更新

Macro: Enhancing Multilingual Counterfactual Explanations through Alignment-as-Preference Optimization

Macro: 通过偏好对齐优化提升多语言反事实解释

Yilong Wang, Qianli Wang, Bohao Chu, Yihong Liu, Jing Yang, Simon Ostermann

发表机构 * Technische Universität Berlin(柏林技术大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心) University of Duisburg-Essen(杜伊斯堡- Essen大学) LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Saarland Informatics Campus(萨尔兰州信息学校区) BIFOLD – Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究院) Centre for European Research in Trusted AI (CERTAIN)(可信人工智能欧洲研究中心)

AI总结 本文提出Macro框架,通过直接偏好优化改进多语言反事实解释,提升有效性的同时保持最小性,避免翻译基线的严重最小性问题,并在多个指标上优于监督微调方法。

Comments In submission

详情
AI中文摘要

自我生成的反事实解释(SCEs)是大型语言模型(LLMs)生成的最小修改输入(minimality),通过翻转自身预测(validity)来揭示黑箱LLM行为,提供因果基础的解释方法。然而,将其扩展到非主导语言仍具挑战性:现有方法难以生成有效SCEs,且有效性与最小性之间的权衡影响解释质量。我们引入Macro,一种偏好对齐框架,通过直接偏好优化(DPO)进行多语言SCE生成,使用复合评分函数构建偏好对,将权衡转化为可测量的偏好信号。在四个LLM和七个语言类型多样的语言上进行实验,结果显示,Macro在平均情况下比链式思维基线提高了12.55%的有效性,同时不降低最小性,避免了翻译基线的严重最小性问题。与监督微调相比,Macro在两个指标上表现更优,证实了显式偏好优化对于平衡此权衡的重要性。进一步分析显示,Macro增强了跨语言扰动对齐并缓解了常见生成错误。我们的结果突显了偏好优化作为提升多语言模型解释的有前途方向。

英文摘要

Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines explanation quality. We introduce Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual SCE generation, using a composite scoring function to construct preference pairs that effectively translate the trade-off into measurable preference signals. Experiments across four LLMs and seven typologically diverse languages show that Macro improves validity by 12.55\% on average over the chain-of-thought baseline without degrading minimality, while avoiding the severe minimality violations of the translation-based baseline. Compared to supervised fine-tuning, Macro achieves superior performance on both metrics, confirming that explicit preference optimization is essential for balancing this trade-off. Further analyses reveal that Macro increases cross-lingual perturbation alignment and mitigates common generation errors. Our results highlight preference optimization as a promising direction for enhancing multilingual model explanations.

2605.09192 2026-06-05 cs.AI 版本更新

Evidence Over Plans: Online Trajectory Verification for Skill Distillation

证据与计划:用于技能蒸馏的在线轨迹验证

Yang Zhou, Zihan Dong, Zhenting Wang, Can Jin, Shiyu Zhao, Bangwei Guo, Difei Gu, Linjun Zhang, Mu Zhou, Dimitris N. Metaxas

发表机构 * Rutgers University(罗杰斯大学)

AI总结 本文提出了一种基于轨迹的后验蒸馏指数(PDI)来评估技能与任务环境证据的契合度,通过SPARK框架实现环境验证轨迹的生成,从而提升技能的效率和可迁移性。

详情
AI中文摘要

代理技能可以通过使用人类编写的程序性文档显著提高任务成功率,但其质量在没有环境基础验证的情况下难以评估。现有的技能生成方法严重依赖于偏好日志而不是直接的环境交互,通常产生微不足道甚至退化的收益。我们发现这是一个根本的时间瓶颈:稳健的技能应基于后验,从经验环境交互中蒸馏,而不是先验计划。在本研究中,我们引入了后验蒸馏指数(PDI),这是一个轨迹级指标,量化了蒸馏技能与任务-环境证据的契合程度。为了操作化PDI,我们提出了SPARK(用于自主可运行任务和技能生成的结构化流程),以保留任务执行证据以实现全面的轨迹级分析。SPARK生成用于计算PDI的环境验证轨迹,并将其用作在线诊断和干预信号,以确保后验技能的形成。在86个可运行任务上,SPARK生成的技能始终优于无技能基线,并在学生模型上优于人工编写技能(推理成本比教师模型低高达1000倍)。这些发现表明,PDI引导的蒸馏产生了高效且可迁移的技能,这些技能基于任务-环境交互。我们发布代码在https://github.com/EtaYang10th/spark-skills。

英文摘要

Agent skills can remarkably improve task success rates by using human-written procedural documents, but their quality is difficult to assess without environment-grounded verification. Existing skill generation methods heavily rely on preference logs rather than direct environment interaction, often yielding negligible or even degraded gains. We identify that it is a fundamental timing bottleneck: robust skills should be posterior-based, distilled from empirical environment interaction rather than prior plans. In this study, we introduce the Posterior Distillation Index (PDI), a trajectory-level metric that quantifies how well a distilled skill is grounded in the task-environment evidence. To operationalize PDI, we present SPARK (Structured Pipelines for Autonomous Runnable tasKs and sKill generation) for preserving task execution evidence towards full trajectory-level analysis. SPARK generates environment-verified trajectories used to compute PDI, and it applies PDI as an online diagnostic and intervention signal to ensure posterior skill formation. Across 86 runnable tasks, SPARK-generated skills consistently surpass no-skill baselines and outperform human-written skills on student models (inference cost up to 1,000x cheaper than teacher models). These findings show that PDI-guided distillation produces efficient and transferable skills grounded in the task-environment interaction. We release our code at https://github.com/EtaYang10th/spark-skills .

2605.08318 2026-06-05 cs.LG cs.AI cs.NA math.NA physics.comp-ph stat.ML 版本更新

When Attention Beats Fourier: Multi-Scale Transformers for PDE Solving on Irregular Domains

当注意力胜过傅里叶:用于不规则域上的PDE求解的多尺度变换器

Brandon Yee, Pairie Koh, Jack Rodriguez, Mihir Tekal

发表机构 * Physics Lab, Yee Collins Research Group(Yee Collins研究组物理实验室)

AI总结 本文研究了深度学习模型在求解偏微分方程(PDE)时的架构选择问题,探讨了基于学习注意力的变换器架构在何时优于傅里叶域神经算子。引入了多尺度注意力变换器(MSAT),该架构将时空解的历史编码为令牌序列,并通过复合监督目标进行端到端训练。在五个基准问题上,与九种基线方法(包括物理信息神经网络、神经算子和状态空间模型)进行了全面的实证评估,展示了在复杂几何问题上的最佳泛化能力。

Comments Substantial Revision Required

详情
AI中文摘要

我们研究了深度学习模型在求解偏微分方程(PDE)时的架构选择问题,探讨了基于学习注意力的变换器架构在何时优于傅里叶域神经算子。我们介绍了多尺度注意力变换器(MSAT),一种深度学习架构,将时空解的历史编码为令牌序列,并通过复合监督目标进行端到端训练。我们对九种基线方法(包括物理信息神经网络、神经算子和状态空间模型)进行了全面的实证评估,覆盖了PINNacle套件中的五个基准问题,使用相同的训练/测试分割和参考数据。MSAT在复杂几何问题上实现了最先进的泛化能力(Heat2D-CG的L²相对误差为0.0101,比FNO提高了3.7倍),在34秒的总推理时间下,比Mamba-NO的120,812秒快得多。对物理正则化组件的消融研究揭示了精确的归纳偏置权衡:物理先验减少了扩散主导问题的测试误差,但会退化混沌和回流流动制度的泛化能力,直接刻画了先验规格错误的边界。近似误差界作为域边界复杂性κ的函数,为这些实证发现提供了理论基础,并为架构选择提供了一个原则性的规则。

英文摘要

We study the problem of \emph{architecture selection} for deep learning models trained to solve partial differential equations (PDEs), asking when transformer-based architectures with learned attention outperform Fourier-domain neural operators. We introduce the \textbf{Multi-Scale Attention Transformer} (\msat{}), a deep learning architecture that encodes spatiotemporal solution histories as token sequences and trains end-to-end via a composite supervised objective with optional physics-informed regularization terms. We conduct a comprehensive empirical evaluation against nine baselines -- including physics-informed neural networks (PINNs), neural operators (FNO, DeepONet, GNOT), and state-space models (Mamba-NO) -- across five benchmark problems from the PINNacle suite, using identical train/test splits and reference data for all methods. \msat{} achieves state-of-the-art generalization on complex geometry problems ($L^2_\mathrm{rel} = 0.0101$ on Heat2D-CG, a $3.7\times$ improvement over FNO) at $34\,\mathrm{s}$ total inference vs.\ $120{,}812\,\mathrm{s}$ for Mamba-NO. Ablation studies over the physics regularization component reveal a precise inductive bias tradeoff: physics priors reduce test error on diffusion-dominated problems but degrade generalization on chaotic and recirculating-flow regimes, directly characterizing the prior misspecification boundary. Approximation error bounds as a function of domain boundary complexity $κ$ provide a theoretical basis for these empirical findings and a principled rule for architecture selection.

2605.08253 2026-06-05 cs.LG cs.AI 版本更新

Path-Coupled Bellman Flows for Distributional Reinforcement Learning

路径耦合贝尔曼流用于分布式强化学习

Boyang Xu, Qing Zou, Siqin Yang, Hao Yan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出路径耦合贝尔曼流(PCBF),一种连续时间的分布式强化学习方法,通过学习回报分布的流匹配来解决现有方法在边界不匹配和高方差-bootstrap问题,实验表明其在分布保真度和训练稳定性方面有所提升。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
AI中文摘要

分布式强化学习(DRL)模型完整回报分布,但现有有限支持或分位数方法依赖于投影,而近期基于流的方法在流源处可能遭受边界不匹配,或在当前和后续噪声独立时出现高方差的bootstrap问题。本文提出路径耦合贝尔曼流(PCBF),一种连续时间DRL方法,通过学习回报分布的流匹配使用源一致的贝尔曼耦合路径:当前路径从t=0所需的基先验开始,到达t=1的贝尔曼目标,并在中间时间保持路径上的线性关系到后续流(不需要时间t的边际满足分布贝尔曼固定点对所有t)。PCBF通过共享基噪声耦合当前和后续回报流,并使用λ参数化的控制变异目标:λ=0恢复无偏样本贝尔曼目标,而λ>0通过可控的偏倚换取方差减少。在可解析的MRPs、OGBench和D4RL上的实验表明,PCBF在分布保真度和训练稳定性方面有所提升,并在离线RL性能上具有竞争力。

英文摘要

Distributional reinforcement learning (DRL) models the full return distribution, but existing finite-support or quantile-based methods rely on projections, while recent flow-based approaches can suffer from \emph{boundary mismatch} at the flow source or from \emph{high-variance} bootstrapping when current and successor noises are independent. We propose Path-Coupled Bellman Flows (PCBF), a continuous-time DRL method that learns return distributions with flow matching using \textbf{source-consistent Bellman-coupled paths}: the current path starts from the required base prior at $t{=}0$, reaches the Bellman target at $t{=}1$, and maintains a pathwise affine relation to the successor flow at intermediate times (without requiring time-$t$ marginals to satisfy a distributional Bellman fixed point for all $t$). PCBF couples current and successor return flows through shared base noise and uses a $λ$-parameterized control-variate target: $λ{=}0$ recovers an unbiased sample Bellman target, while $λ{>}0$ trades controlled bias for variance reduction. Experiments on analytically tractable MRPs, OGBench, and D4RL show improved distributional fidelity and training stability, and competitive offline RL performance.

2605.07482 2026-06-05 cs.LG cs.AI 版本更新

SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion

SHRED: 通过自蒸馏与对数势降低实现无保留集的去记忆

Zizhao Hu, Ameya Godbole, Johnny Tian-Zheng Wei, Mohammad Rostami, Jesse Thomason, Robin Jia

发表机构 * University of Southern California(南加州大学) USC Information Sciences Institute(USC信息科学研究所)

AI总结 本文提出了一种无需保留集的去记忆方法SHRED,通过自蒸馏与对数势降低,在去记忆的同时保持模型的实用性,优于传统需要保留集的方法。

详情
AI中文摘要

针对大语言模型(LLMs)的机器去记忆问题,旨在选择性地移除记忆中的内容,如私人数据、受版权文本或危险知识,而无需昂贵的全量重新训练。现有大多数方法需要一个经过精心挑选的保留集以防止一般模型用途的灾难性退化,这会增加额外的数据依赖性,使部署复杂化。我们提出SHRED(通过高惊奇度的无保留集熵降低的自蒸馏),一种无需保留集的去记忆方法,基于一个关键洞察:并非所有遗忘集实例中的token都同等地包含记忆信息。高信息token集中了模型的记忆知识,而低信息token反映了一般语言能力。SHRED分为两个阶段。(1)选择:我们对遗忘集实例进行前向传递,收集每个token的自回归概率,并选择底部(最低概率,最高香农信息)作为遗忘位置;剩余位置保留为良性锚点。(2)训练:我们构建了修改的KL目标,降低记忆token在遗忘位置的logit,同时在良性位置保持原始分布。模型通过单一的顶部KL自蒸馏目标进行训练,同时驱动遗忘和实用性保持。我们评估了SHRED在四个标准去记忆基准上的表现,并证明其在遗忘效果和模型实用性之间建立了新的帕累托最优权衡,优于保留集依赖的方法。我们的分析显示,SHRED对重新学习攻击和成员推断攻击具有鲁棒性,并且在多次连续去记忆运行后仍能保持稳定的实用性。

英文摘要

Machine unlearning for large language models (LLMs) aims to selectively remove memorized content such as private data, copyrighted text, or hazardous knowledge, without costly full retraining. Most existing methods require a retain set of curated examples to prevent catastrophic degradation of general model utility, creating an extra data dependency that complicates deployment. We propose SHRED (Self-distillation via High-surprisal-only Retain-set-free Entropy Demotion), a retain-set-free unlearning method built on a key insight: not all tokens within a forget set instance carry memorized information equally. High-information tokens concentrate the model's memorized knowledge, while low-information tokens reflect general language competence. SHRED operates in two stages. (1) Selection: We perform a forward pass on a forget set instance, collect per-token autoregressive probabilities, and select the bottom (lowest probability, highest Shannon information) as forget positions; the remaining positions are retained as benign anchors. (2) Training: We construct modified KL targets that demote the memorized token's logit at forget positions while preserving the original distribution at benign positions. The model is then trained via a single top KL self-distillation objective that simultaneously drives forgetting and utility preservation. We evaluate SHRED across four standard unlearning benchmarks and demonstrate that it establishes a new Pareto-optimal trade-off between forget efficacy and model utility, outperforming retain-set-dependent methods. Our analysis shows that SHRED is robust against relearning attacks and membership-inference attacks, and it maintains stable utility even after many sequential unlearning runs.

2605.07096 2026-06-05 cs.LG cs.AI stat.ME 版本更新

Query-efficient model evaluation using cached responses

通过缓存响应实现高效的模型评估

Hayden Helm, Ben Johnson, Carey Priebe

发表机构 * University of Maryland(马里兰大学)

AI总结 本文提出了一种基于数据核视角空间(DKPS)的方法,利用已缓存的模型响应来预测基准性能,从而减少评估新模型所需的查询数量,提高了模型评估的效率。

详情
AI中文摘要

在部署新模型之前,评估其在现有基准上的表现通常是必要的。对于现代评估框架来说,生成并评估所有查询的响应可能成本过高。实际上,先前评估模型的响应往往被缓存——这为利用此额外信息来减少准确评估新模型所需查询数量提供了潜在机会。在本文中,我们介绍了一种预测基准性能的方法,该方法利用缓存的模型响应,基于数据核视角空间(DKPS),一种在黑箱设置下量化模型之间关系的方法。理论上,我们证明了基于DKPS的方法在某些条件下是查询高效的。实证上,我们展示了基于DKPS的方法在查询预算大幅减少的情况下,能够达到与基线相同的平均绝对误差。最后,我们提出了一种离线方法,用于选择一组查询,以最大化参考模型上的拟合质量,从而在随机查询选择的基础上提高预测准确性。

英文摘要

Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model. In this paper, we introduce an approach for predicting benchmark performance that leverages cached model responses based on the Data Kernel Perspective Space (DKPS), a method for quantifying the relationship between models in the black-box setting. Theoretically, we show that DKPS-based methods are query-efficient under certain conditions. Empirically, we demonstrate that DKPS-based methods achieve the same mean absolute error as baselines with a substantially decreased query budget. We conclude by proposing an offline method for selecting a set of queries that maximizes the goodness-of-fit on reference models, improving prediction accuracy over random query selection.

2604.26269 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Calibrated Surprise: An Information-Theoretic Account of Creative Quality

校准的惊喜:一种信息论视角下的创造性质量

Bo Zou, Chao Xu

发表机构 * Bo Zou(邹波) Chao Xu(徐超)

AI总结 本文提出了一种信息论框架,用于评估创造性写作的质量,通过校准的惊喜概念,结合香农互信息理论,量化了高质量文本与降质文本之间的差异。

Comments 28 pages, 3 figures

详情
AI中文摘要

在大型语言模型时代,创造性写作的质量缺乏可计算的理论基础。主流方法是评分标准——将整体审美判断分解为子评分,以及通过RLHF偏好信号——用群体投票代替质量。这两种方法都绕过了文本本身的统计结构。本文提供了一种信息论基础,填补这一空白。我们提出了'校准的惊喜'作为优秀创造性写作的信息论本质。这种判断符合阅读直觉并涵盖了其对立面。这种文学判断可以精确地进行数学公式化。在完全维度约束Y下,可行的写作选择被强制进入极狭窄的空间。稀有的幸存者,从无约束的视角来看,恰好是最不可预测的选择。两者都通过香农互信息I(X;Y) = H(X) - H(X|Y)精确测量——'校准'对应H(X|Y)接近0;'惊喜'对应H(X)升高。公式的减法结构自然地将'有根据的惊喜'与'纯噪声'分开。我们使用Qwen1.5-7B的token级logprobs作为理想读者概率分布的操作代理。在20对(12中文/8英文)的高质量与系统降质文学段落中,20/20对支持核心预测:高质量段落的I(X;Y)系统性地高于其降质版本。

英文摘要

In the era of large language models, creative writing quality lacks a computable theoretical anchor. The dominant approaches are rubric scoring -- decomposing holistic aesthetic judgment into sub-scores -- and RLHF preference signals -- replacing quality with group votes. Both bypass the statistical structure of the text itself. This paper provides an information-theoretic foundation to fill this gap. We propose 'calibrated surprise' as the information-theoretic essence of excellent creative writing. This judgment matches reading intuition and covers its opposite. This literary judgment admits a precise mathematical formulation. Under full-dimensional constraints Y, feasible writing choices are forced into an extremely narrow space. The rare survivors are, from the unconstrained perspective, exactly the least predictable choices. Both are measured precisely by Shannon mutual information I(X;Y) = H(X) - H(X|Y) -- 'calibrated' corresponds to H(X|Y) approaching 0; 'surprising' corresponds to H(X) going high. The subtraction structure of the formula naturally separates 'well-grounded surprise' from 'pure noise'. We use token-level logprobs from Qwen1.5-7B as an operational proxy for the ideal reader's probability distribution. Across 20 pairs (12 Chinese / 8 English) of high-quality vs. systematically degraded literary passages, 20/20 pairs support the core prediction: high-quality passages have systematically higher I(X;Y) than their degraded versions.

2604.21017 2026-06-05 cs.RO cs.AI 版本更新

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

Open-H-Embodiment: 一个大规模数据集,用于在医疗机器人中启用基础模型

Open-H-Embodiment Consortium, :, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang, Alaa Eldin Abdelaal, Alberto Arezzo, Ayberk Acar, Farshid Alambeigi, Carlo Alberto Ammirati, Yunke Ao, Pablo David Aranda Rodriguez, Soofiyan Atar, Mattia Ballo, Noah Barnes, Federica Barontini, Filip Binkiewicz, Peter Black, Sebastian Bodenstedt, Leonardo Borgioli, Nikola Budjak, Benjamin Calmé, Fabio Carrillo, Nicola Cavalcanti, Changwei Chen, Haoxin Chen, Sihang Chen, Qihan Chen, Zhongyu Chen, Ziyang Chen, Shing Shin Cheng, Meiqing Cheng, Min Cheng, Zih-Yun Sarah Chiu, Xiangyu Chu, Camilo Correa-Gallego, Giulio Dagnino, Anton Deguet, Jacob Delgado, Jonathan C. DeLong, Kaizhong Deng, Alexander Dimitrakakis, Qingpeng Ding, Hao Ding, Giovanni Distefano, Daniel Donoho, Anqing Duan, Marco Esposito, Shane Farritor, Jad Fayad, Zahi Fayad, Mario Ferradosa, Filippo Filicori, Chelsea Finn, Philipp Fürnstahl, Jiawei Ge, Stamatia Giannarou, Xavier Giralt Ludevid, Frederic Giraud, Aditya Amit Godbole, Ken Goldberg, Antony Goldenberg, Diego Granero Marana, Xiaoqing Guo, Tamás Haidegger, Evan Hailey, Pascal Hansen, Ziyi Hao, Kush Hari, Kengo Hayashi, Jonathon Hawkins, Shelby Haworth, Ortrun Hellig, S. Duke Herrell, Zhouyang Hong, Andrew Howe, Junlei Hu, Zhaoyang Jacopo Hu, Ria Jain, Mohammad Rafiee Javazm, Howard Ji, Rui Ji, Jianmin Ji, Zhongliang Jiang, Dominic Jones, Jeffrey Jopling, Britton Jordan, Ran Ju, Michael Kam, Luoyao Kang, Fausto Kang, Siddhartha Kapuria, Peter Kazanzides, Sonika Kiehler, Ethan Kilmer, Ji Woong Kim, Przemysław Korzeniowski, Chandra Kuchi, Nithesh Kumar, Alan Kuntz, Federico Lavagno, Yu Chung Lee, Hao-Chih Lee, Hang Li, Zhen Li, Xiao Liang, Xinxin Lin, Jinsong Lin, Chang Liu, Fei Liu, Pei Liu, Yun-hui Liu, Wanli Liuchen, Eszter Lukács, Sareena Mann, Miles Mannas, Brett Marinelli, Sabina Martyniak, Francesco Marzola, Lorenzo Mazza, Xueyan Mei, Maria Clara Morais, Luigi Muratore, Chetan Reddy Narayanaswamy, Michał Naskręt, David Navarro-Alarcon, Cyrus Neary, Chi Kit Ng, Christopher Nguan, David Noonan, Ki Hwan Oh, Tom Christian Olesch, Allison M. Okamura, Justin Opfermann, Matteo Pescio, Doan Xuan Viet Pham, Tito Porras, Hongliang Ren, Ariel Rodriguez Jimenez, Ferdinando Rodriguez y Baena, Septimiu E. Salcudean, Asmitha Sathya, Preethi Satish, Lalithkumar Seenivasan, Jiaqi Shao, Yiqing Shen, Yu Sheng, Lucy XiaoYang Shi, Zoe Soulé, Stefanie Speidel, Mingwu Su, Jianhao Su, Idris Sunmola, Kristóf Takács, Yunxi Tang, Patrick Thornycroft, Yu Tian, Jordan Thompson, Mehmet K. Turkcan, Mathias Unberath, Pietro Valdastri, Carlos Vives, Quan Vuong, Martin Wagner, Farong Wang, Wei Wang, Lidian Wang, Chung-Pang Wang, Guankun Wang, Junyi Wang, Erqi Wang, Ziyi Wang, Tanner Watts, Wolfgang Wein, Yimeng Wu, Zijian Wu, Hongjun Wu, Luohong Wu, Jie Ying Wu, Junlin Wu, Victoria Wu, Kaixuan Wu, Mateusz Wójcikowski, Yunye Xiao, Nan Xiao, Wenxuan Xie, Hao Yang, Tianqi Yang, Yinuo Yang, Menglong Ye, Ryan S. Yeung, Nural Yilmaz, Chim Ho Yin, Michael Yip, Rayan Younis, Chenhao Yu, Sayem Nazmuz Zaman, Milos Zefran, Han Zhang, Yuelin Zhang, Yidong Zhang, Yanyong Zhang, Xuyang Zhang, Yameng Zhang, Joyce Zhang, Ning Zhong, Peng Zhou, Haoying Zhou, Xiuli Zuo, Nassir Navab, Mahdi Azizian, Sean D. Huver, Axel Krieger

发表机构 * Open-H-Embodiment Consortium University of California, Berkeley(加州大学伯克利分校) University of California, Los Angeles(加州大学洛杉矶分校) University of Southern California(南加州大学) University of Cambridge(剑桥大学) University of Tokyo(东京大学) University of Tokyo, Graduate School of Information Science and Technology(东京大学信息科学与技术研究生院) University of Tokyo, Institute of Industrial Science(东京大学工业科学研究所)

AI总结 本文提出Open-H-Embodiment数据集,通过两个基础模型展示了其在医疗机器人领域的应用,展示了大规模开放数据在推动机器人学习和世界建模方面的关键作用。

Comments Project website: https://open-h.github.io/open-h-embodiment/

详情
AI中文摘要

自主医疗机器人有希望提高患者预后、减少从业者的工作量、普及医疗访问并实现超人精度。然而,自主医疗机器人受到根本性数据问题的限制:现有的医疗机器人数据集较小、单一躯体且很少公开共享,限制了该领域所需的基础模型的发展。我们介绍了Open-H-Embodiment,这是迄今为止最大的开放医疗机器人视频数据集,包含同步运动学,涵盖超过50个机构和多种机器人平台,包括CMR Versius、Intuitive Surgical的da Vinci、da Vinci Research Kit(dVRK)、Rob Surgical BiTrack、Virtual Incision的MIRA、Moon Surgical Maestro以及多种定制系统,涵盖手术操作、机器人超声和内窥镜程序。我们通过两个基础模型展示了该数据集的研究价值。GR00T-H是首个开放的基础视觉-语言-动作模型,是唯一在结构缝合基准测试中实现完整端到端任务完成的模型(25%的试验 vs. 其他所有模型的0%),并在29步体外缝合序列中实现了64%的平均成功率。我们还训练了Cosmos-H-Surgical-Simulator,这是首个动作条件的世界模型,能够从单个检查点实现多躯体手术模拟,涵盖九种机器人平台,并支持计算机模拟政策评估和医学领域合成数据生成。这些结果表明,开放、大规模的医疗机器人数据收集可以作为研究社区的关键基础设施,推动机器人学习、世界建模以及更广泛的研究进展。

英文摘要

Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 50 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

2604.17121 2026-06-05 cs.LG cs.AI 版本更新

The Topological Trouble With Transformers

Transformer 的拓扑困境

Michael C. Mozer, Shoaib Ahmed Siddiqui, Rosanne Liu

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文探讨了Transformer在处理序列结构时的拓扑问题,指出其纯前馈架构限制了动态状态跟踪,提出应通过递归架构转向隐含激活动态,并介绍了连续思维Transformer架构的分类方法及未来研究方向。

详情
AI中文摘要

Transformers通过扩展的上下文历史在序列中编码结构。然而,其纯前馈架构从根本上限制了动态状态跟踪。状态跟踪——迭代更新反映不断变化环境的潜在变量——涉及本质上序列依赖性,这使得前馈网络难以维持。因此,前馈模型会将演进状态表示推入其层栈更深处,使得信息在浅层不可用,最终耗尽模型的深度。虽然动态深度模型和显式或隐式思维可以绕过这一深度限制,但这些解决方案在计算和内存上效率低下。在本文中,我们主张,时间扩展认知需要从显式思维轨迹转向隐式激活动态,通过递归架构。我们引入了递归和连续思维Transformer架构的分类方法,按其递归轴(深度与步长)和输入标记与递归步长的比例进行分类。最后,我们概述了有前景的研究方向,包括增强的状态空间模型和粗粒度递归,以更好地将状态跟踪整合到现代基础模型中。

英文摘要

Transformers encode structure in sequences via an expanding contextual history. However, their purely feedforward architecture fundamentally limits dynamic state tracking. State tracking -- the iterative updating of latent variables reflecting an evolving environment -- involves inherently sequential dependencies that feedforward networks struggle to maintain. Consequently, feedforward models push evolving state representations deeper into their layer stack with each new input step, rendering information inaccessible in shallow layers and ultimately exhausting the model's depth. While this depth limit can be bypassed by dynamic depth models and by explicit or latent thinking that externalizes state representations, these solutions are computationally and memory inefficient. In this article, we argue that temporally extended cognition requires refocusing from explicit thought traces to implicit activation dynamics via recurrent architectures. We introduce a taxonomy of recurrent and continuous-thought transformer architectures, categorizing them by their recurrence axis (depth versus step) and their ratio of input tokens to recurrence steps. Finally, we outline promising research directions, including enhanced state-space models and coarse-grained recurrence, to better integrate state tracking into modern foundation models.

2603.25158 2026-06-05 cs.AI 版本更新

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Trace2Skill: 将轨迹局部经验转化为可迁移的代理技能

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

发表机构 * ETH Zürich University of Zurich(苏黎世联邦理工学院) Peking University(北京大学) Zhejiang University(浙江大学) Qwen Large Model Application Team, Alibaba(阿里巴巴文心一言应用团队)

AI总结 本文提出Trace2Skill框架,通过归纳推理将广泛执行轨迹整合为统一的技能目录,有效提升代理技能的可迁移性和实用性,适用于多种领域。

Comments Work in Progress. May version add more experiments

详情
AI中文摘要

大型语言模型(LLM)代理日益依赖领域特定技能,但手动编写此类技能难以扩展,而纯参数知识生成的技能常忽略关键操作陷阱。我们引入Trace2Skill框架,通过归纳推理将广泛执行轨迹整合为统一的技能目录。Trace2Skill支持深入现有人工编写技能和从弱LLM生成草稿中创建有用技能。实验表明,Trace2Skill在多样化的领域中均表现出色,包括办公流程、数学推理和视觉问答。重要的是,进化出的技能不仅限于所用轨迹的简单记忆:它们在不同模型规模、不同模型家族和非分布设置中均能迁移。例如,从Qwen3.5-35B轨迹进化出的技能使Qwen3.5-122B代理在WikiTableQuestions任务上提升高达57.65个百分点。进一步分析显示,Trace2Skill优于序列技能编辑和ReasoningBank式检索记忆,能将重复失败和 workaround 压缩为标准操作程序(SoPs),并产生可重用的技能,无需参数更新或测试时检索。

英文摘要

Large Language Model (LLM) agents increasingly rely on domain-specific skills, yet manually authoring such skills does not scale, and skills generated purely from parametric knowledge often miss critical operational pitfalls. We introduce Trace2Skill, a framework that consolidates broad execution trajectories in parallel into a unified skill directory through inductive reasoning over agent experience. Trace2Skill supports both deepening existing human-written skills and creating useful skills from weak LLM-generated drafts. Experiments demonstrate the effectiveness of Trace2Skill across diverse domains, including office workflows, math reasoning, and vision QA. Importantly, the evolved skills are not merely memorized artifacts of the trajectories used to create them: they often transfer across model scales, across model families, and to out-of-distribution settings. For example, skills evolved from Qwen3.5-35B trajectories improve a Qwen3.5-122B agent by up to $57.65$ percentage points on WikiTableQuestions. Further analyses show that Trace2Skill outperforms sequential skill editing and ReasoningBank-style retrieval memories, compresses recurring failures and workarounds into standard operating procedures (SoPs), and yields portable skills that can be reused without parameter updates or test-time retrieval.

2604.23466 2026-06-05 cs.LG cs.AI cs.AR 版本更新

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

评估Hopper和Blackwell GPU上的CUDA Tile用于AI工作负载

Divakar Kumar Yadav, Tian Zhao, Deepak Kumar

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 本文评估了CUDA Tile在Hopper和Blackwell GPU上的AI工作负载性能,比较了CuTile与cuBLAS、Triton等方法的效率和可移植性,发现CuTile在特定工作负载上表现优异,但在跨架构优化上仍有不足。

详情
AI中文摘要

NVIDIA的CUDA Tile(CuTile)引入了一种基于Python的、以tile为中心的抽象,用于GPU内核开发,旨在简化编程同时保持Tensor Core和Tensor Memory Accelerator(TMA)在现代GPU上的效率。我们对三种NVIDIA GPU(Hopper和Blackwell架构下的H100 NVL、B200和RTX PRO 6000 Blackwell Server Edition)上的CuTile进行了首次独立、跨架构评估,对比了cuBLAS、Triton、WMMA和原始SIMT等现有方法。我们通过基准测试代表性AI工作负载,包括GEMM、融合多头注意力和端到端LLM推理(BF16/FP16精度),以评估性能和可移植性。我们的结果表明,CuTile的效果强烈依赖于工作负载和架构。在数据中心级Blackwell(B200)上,CuTile在融合注意力任务中达到最高1007 TFLOP/s,比FlashAttention-2快2.5倍,仅需60行Python内核代码。对于GEMM,CuTile在22行代码中达到cuBLAS性能的52-79%,比WMMA的123行代码更高效,使其成为手写CUDA内核的实用替代品,但尚未成为供应商优化库的替代品。然而,相同的CuTile注意力内核在RTX PRO 6000(sm_120)上仅达到FlashAttention-2的53%吞吐量,暴露了显著的跨架构优化差距。相比之下,Triton在所有测试平台上的cuBLAS性能保持在62-101%,无需架构特定调整,显示出更强的可移植性。

英文摘要

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability.

2604.23190 2026-06-05 cs.SE cs.AI 版本更新

RAT: RunAnyThing via Fully Automated Environment Configuration

RAT: 通过完全自动化的环境配置实现RunAnyThing

Renhong Huang, Dongdong Hua, Yifei Sun, Sitao Ding, Hanyang Yuan, Daixin Wang, Yang Yang

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 本文提出RAT框架,用于在任意仓库上实现跨编程语言的全自动环境配置,通过多阶段流水线整合语言感知抽象、镜像初始化、专用配置工具集和稳健沙箱,并提出RATBench基准测试集,实验表明RAT在环境设置成功率上比强基线提升了36.1%。

详情
AI中文摘要

自动化仓库级别的软件工程任务是自主代码代理的基础挑战,主要由于可执行环境配置的难度。然而,手动配置仍然是劳动密集型的瓶颈,需要向完全自动化的环境配置过渡。现有方法往往依赖预定义的制品或局限于特定编程语言,限制了其在现实世界仓库中的适用性。在本文中,我们首先提出RAT(RunAnyThing),一个模块化且可扩展的代理框架,用于在任意仓库上实现跨编程语言的全自动配置。RAT采用多阶段流水线,整合语言感知抽象、镜像初始化、专用配置工具集和稳健沙箱。此外,为了实现严格评估,我们提出RATBench,一个反映现实世界仓库全面覆盖的基准测试集。大量实验表明,RAT实现了最先进的性能,比强基线在环境设置成功率(ESSR)上平均提高了36.1%。

英文摘要

Automating repository-level software engineering tasks is a foundational challenge for autonomous code agents, largely due to the difficulty of configuring executable environments. However, manual configuration remains a labor-intensive bottleneck, necessitating a transition toward fully automated environment configuration. Existing approaches often rely on pre-defined artifacts or are restricted to specific programming languages, limiting their applicability to diverse real-world repositories. In this paper, we first propose RAT (RunAnyThing), a modular and extensible agent framework for fully automated configuration across programming languages on arbitrary repositories. RAT adopts a multi-stage pipeline that integrates language-aware abstraction, image initialization, specialized configuration toolset, and robust sandbox. Furthermore, to enable rigorous evaluation, we propose RATBench, a benchmark reflects the comprehensive coverage of real-world repositories. Extensive experiments demonstrate that RAT achieves state-of-the-art performance, improving Environment Setup Success Rate (ESSR) by an average of 36.1% over strong baselines.

2603.03555 2026-06-05 cs.MA cs.AI cs.SI 版本更新

Benchmarking Emergent Coordination in Large-Scale LLM Populations: An Evaluation Framework on the MoltBook Archive

在大规模大语言模型群体中评估涌现协调:对MoltBook档案库的评估框架

Brandon Yee, Pairie Koh

发表机构 * Management Lab, Yee Collins Research Group(Yee Collins研究组管理实验室)

AI总结 本文提出了一种评估框架,用于在开放代理环境中评估角色专业化、信息扩散和协作任务解决的涌现协调,通过MoltBook档案库的数据集展示了该框架,并建立了量化基准,揭示了核心-外围结构、重尾级联分布和去中心化任务解决中的严重协调开销。

Comments Substantial Revision Required

详情
AI中文摘要

随着多智能体大语言模型(LLM)系统规模扩大,评估其涌现协调动态变得越来越关键。然而,当前的评估范式——专注于单个智能体或小型、显式结构化的群体——无法捕捉到在大规模、去中心化群体中出现的自组织和病毒信息动态。我们引入了一种系统化的评估框架,用于在开放代理环境中基准测试角色专业化、信息扩散和协作任务解决。我们在此框架上展示了MoltBook观测站档案库,这是一个包含273万个交互的2.73M交互数据集,其中90,704个自主代理相互作用。该框架建立了涌现协调的量化基准。我们的评估揭示了明显的核心-外围结构(轮廓0.91)、重尾级联分布(α=2.57)以及去中心化任务解决中的严重协调开销(Cohen's d = -0.88,相对于单智能体基线)。通过提供标准化的评估任务和实证基准,我们的框架使未来多智能体协议的严格比较成为可能,并将评估本身确立为科学研究的对象。

英文摘要

As multi-agent Large Language Model (LLM) systems scale, evaluating their emergent coordination dynamics becomes increasingly critical. However, current evaluation paradigms-focused on single agents or small, explicitly structured groups-fail to capture the self-organization and viral information dynamics that arise in large, decentralized populations. We introduce a systematic evaluation framework to benchmark role specialization, information diffusion, and cooperative task resolution in open agent environments. We demonstrate this framework on the MoltBook Observatory Archive, a dataset of 2.73M interactions among 90,704 autonomous agents, establishing quantitative baselines for emergent coordination. Our evaluation reveals a pronounced core-periphery structure (silhouette 0.91), heavy-tailed cascade distributions ($α= 2.57$), and severe coordination overhead in decentralized task resolution (Cohen's $d = -0.88$ against a single-agent baseline). By providing standardized evaluation tasks and empirical baselines, our framework enables the rigorous comparison of future multi-agent protocols and establishes evaluation itself as an object of scientific study.

2602.13939 2026-06-05 cs.LG cs.AI 版本更新

A Horizon-Aware Decision-Support Framework for Demand Forecasting Model Selection in Resilient Production Planning

面向 horizon 的决策支持框架:用于在鲁棒生产计划中选择需求预测模型

Adolfo González, Víctor Parada

发表机构 * Department of Computer Engineering and Informatics, Faculty of Engineering, University of Santiago of Chile(工程学院计算机工程与信息学系,智利圣塔克鲁斯大学)

AI总结 本文提出了一种面向 horizon 的决策支持框架,用于在需求波动大、不确定性高的生产计划中选择需求预测模型,通过 MDFH 方法预测误差指标并提出 RMSSEh 和 AHSIV 作为改进的模型选择方法。

Comments 31 pages, 12 figures and Appendix

详情
AI中文摘要

需求预测是鲁棒生产计划、库存补给、采购和产能决策中的关键输入,在需求间歇性、高波动性和运营不确定性下尤为重要。在这些情况下,仅根据固定的测试周期性能选择预测模型可能导致决策与预测所用的未来规划周期不一致。本文提出 Metric Degradation by Forecast Horizon(MDFH)程序作为面向 horizon 的决策支持框架,用于选择需求预测模型。MDFH 在显式结构稳定性条件下,将观察到的测试周期的误差指标(如MAE、RMSE和RMSSE)投影到未来的运营周期。基于此层,RMSSEh 被推导为一种简洁的面向周期的选优器,同时提出 Adaptive Hybrid Selector for Intermittency and Variability(AHSIV)作为结构异质需求序列的适应性扩展。ERA,一种多变量排名聚合选优器,被包含为比较对象。实证评估使用了Walmart、M3、M4和M5数据集,三个训练-测试分区,22个预测模型和12步未来周期。结果表明,RMSSEh 和 AHSIV 在通过事后全局相对精度评估时,比ERA提供更一致的下游体积对齐。

英文摘要

Demand forecasting is a critical input for resilient production planning, inventory replenishment, procurement, and capacity decisions under demand intermittency, high variability, and operational uncertainty. In these contexts, selecting forecasting models solely on the basis of fixed test-horizon performance may lead to decisions misaligned with the future planning horizons in which forecasts are used. This study proposes the Metric Degradation by Forecast Horizon (MDFH) procedure as a horizon-aware decision-support framework for selecting demand forecasting models. MDFH projects eligible out-of-sample error metrics, specifically MAE, RMSE, and RMSSE, from an observed test horizon toward future operational horizons under explicit structural-stability conditions. Based on this layer, RMSSEh is derived as a parsimonious horizon-aware selector, while the Adaptive Hybrid Selector for Intermittency and Variability (AHSIV) is proposed as an adaptive extension for structurally heterogeneous demand series. ERA, a multivariate ranking-aggregation selector, is included as a comparator. The empirical evaluation uses the Walmart, M3, M4, and M5 datasets, three training-testing partitions, 22 forecasting models, and 12-step future horizons. Results show that RMSSEh and AHSIV provide more consistent downstream volumetric alignment than ERA when assessed through ex post Global Relative Accuracy.

2604.12474 2026-06-05 cs.RO cs.AI 版本更新

From Kinematics to Dynamics: Learning to Refine Hybrid Plans for Physically Feasible Execution

从运动学到动力学:学习精炼混合计划以实现物理可行的执行

Lidor Erez, Shahaf S. Shperberg, Ayal Taitler

发表机构 * Technion - Israel Institute of Technology(技术学院 - 以色列理工学院)

AI总结 该研究通过连续空间中的强化学习,解决混合计划在物理可行性执行中的问题,通过引入分析二阶约束的马尔可夫决策过程,改进混合规划器生成的一阶轨迹,从而可靠地恢复物理可行性。

详情
AI中文摘要

在许多机器人任务中,智能体必须穿越一系列空间区域以完成任务。此类问题本质上是混合离散-连续的:一个高层动作序列和一个在物理上可行的连续轨迹。生成的轨迹和动作序列还必须满足诸如截止时间、时间窗口和速度或加速度限制等约束条件。尽管混合时间规划器试图解决这一挑战,但它们通常使用线性(一阶)动力学建模运动,这无法保证生成的计划满足机器人的真实物理约束。因此,即使高层动作序列固定,生成动态可行的轨迹也变成了一个双层优化问题。我们通过连续空间中的强化学习来解决这个问题。我们定义了一个明确包含分析二阶约束的马尔可夫决策过程,并用它来改进由混合规划器生成的一阶计划。我们的结果表明,这种方法可以可靠地恢复物理可行性,并有效弥合规划器初始一阶轨迹与实际执行所需动力学之间的差距。

英文摘要

In many robotic tasks, agents must traverse a sequence of spatial regions to complete a mission. Such problems are inherently mixed discrete-continuous: a high-level action sequence and a physically feasible continuous trajectory. The resulting trajectory and action sequence must also satisfy problem constraints such as deadlines, time windows, and velocity or acceleration limits. While hybrid temporal planners attempt to address this challenge, they typically model motion using linear (first-order) dynamics, which cannot guarantee that the resulting plan respects the robot's true physical constraints. Consequently, even when the high-level action sequence is fixed, producing a dynamically feasible trajectory becomes a bi-level optimization problem. We address this problem via reinforcement learning in continuous space. We define a Markov Decision Process that explicitly incorporates analytical second-order constraints and use it to refine first-order plans generated by a hybrid planner. Our results show that this approach can reliably recover physical feasibility and effectively bridge the gap between a planner's initial first-order trajectory and the dynamics required for real execution.

2604.16370 2026-06-05 cs.CL cs.AI cs.CV 版本更新

Brain-CLIPLM: Semantic Compression for EEG-to-Text Decoding

Brain-CLIPLM: 用于EEG到文本解码的语义压缩

Xiaoli Yang, Huiyuan Tian, Yurui Li, Jianyu Zhang, Shijian Li, Gang Pan

发表机构 * Beijing Institute of Technology, Beijing, China(北京理工大学,北京,中国)

AI总结 该研究提出Brain-CLIPLM框架,通过语义锚点恢复和锚点引导的句子重建,解决EEG信号低信噪比和信息带宽限制的问题,实现了更高的文本检索准确率。

详情
AI中文摘要

从非侵入性脑电图(EEG)解码自然语言仍受限于低信噪比和有限的信息带宽。这提出了一个核心问题:能否从此类信号中可靠地恢复句子级语言?在现实的信息约束下,直接恢复假设可能过于强烈。我们提出语义压缩假设:非侵入性EEG可能保留可恢复的语义锚点,而非完整的词法-句法形式。从这一视角,直接句子重建相对于EEG可恢复的信息规模过于细粒度。为解决这种不匹配,我们提出了Brain-CLIPLM,一个两阶段框架,将EEG到文本解码分解为语义锚点恢复和锚点引导的句子重建。第一阶段使用对比学习将词级EEG证据对齐固定关键词词汇并恢复有序的语义锚点。第二阶段使用基于检索的大型语言模型和链式推理提示从这些锚点中重建句子意义,遵循粒度匹配原则,使解码复杂度与可恢复的神经信息规模相匹配。在结合了苏黎世认知语言处理(ZuCo)基准测试中,Brain-CLIPLM实现了67.6%的Top-5和85.0%的Top-25句子检索准确率,其中在中间锚点粒度下表现最强。控制分析,包括排列检验,显示EEG衍生的锚点携带超出语言模型先验的信息。这些发现表明,EEG到文本解码应更好地视为在锚点引导句子重建之前恢复压缩的语义内容。

英文摘要

Decoding natural language from non-invasive electroencephalography (EEG) remains constrained by low signal-to-noise ratio and limited information bandwidth. This raises a central question: can sentence-level language be reliably recovered from such signals? Under realistic information constraints, this direct-recovery assumption may be too strong. We introduce a semantic compression hypothesis: non-invasive EEG may preserve recoverable semantic anchors rather than the full lexical--syntactic form of a sentence. From this perspective, direct sentence reconstruction is overly fine-grained relative to the recoverable information scale of EEG. To address this mismatch, we propose Brain-CLIPLM, a two-stage framework that decomposes EEG-to-text decoding into semantic-anchor recovery and anchor-guided sentence reconstruction. Stage 1 uses contrastive learning to align word-level EEG evidence with a fixed keyword vocabulary and recover ordered semantic anchors. Stage 2 uses a retrieval-grounded large language model with chain-of-thought reasoning prompts to reconstruct sentence meaning from these anchors, following a granularity matching principle that aligns decoding complexity with the recoverable neural information scale. On the combined Zurich Cognitive Language Processing (ZuCo) benchmark, Brain-CLIPLM achieves 67.6\% Top-5 and 85.0\% Top-25 sentence retrieval accuracy, with the strongest performance at intermediate anchor granularity. Control analyses, including a permutation test, show that EEG-derived anchors carry sentence-specific information beyond language-model priors. These findings suggest that EEG-to-text decoding is better framed as recovering compressed semantic content before anchor-guided sentence reconstruction.

2604.07709 2026-06-05 cs.AI cs.CL cs.CY cs.LG 版本更新

IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

IatroBench: AI安全措施中意外伤害的预注册证据

David Gringras

发表机构 * Harvard T.H. Chan School of Public Health(哈佛大学T.H. 洪学校公共卫生学院)

AI总结 该研究通过IatroBench评估了AI安全措施在医疗决策中的意外伤害风险,发现不同模型在身份相关性上的隐瞒行为存在显著差异,尤其在高度安全训练的模型中表现更明显。

Comments 30 pages, 3 figures, 11 tables. Pre-registered on OSF (DOI: 10.17605/OSF.IO/G6VMZ). Code and data: https://github.com/davidgringras/iatrobench. v2: Fix bibliography entries (add arXiv IDs, published venues); correct p-value typo in Limitations section; add AI Assistance Statement v3: Correct Figure 1 (decoupling scatter accidentally reverted to earlier draft in v2)

详情
AI中文摘要

一个经过严格安全训练的模型会将完整的苯二氮䓬类药物减量方案交给医生,而拒绝给需要该方案的患者,尽管临床事实完全相同;知识在两种情况下都存在。IatroBench在六十个预注册的临床场景和六个前沿模型(3,600次响应)中测量这种不对称性,并通过医生编写的结构化评估进行评分,该评估由第二位医生验证(加权Kappa 0.571,内部一致性96%)。在保持临床内容不变的情况下,仅改变提问者是患者还是医生,产生我们称为身份依赖性隐瞒的现象:所有五个可测试的模型都给医生更多(解耦间隙+0.38,p=0.003;在安全冲突行动上的非专业人士命中率下降13.1点,p<0.0001;其余无变化),且在最高度安全训练的模型Opus中,差距最大(+0.65)。触发因素是缺乏任何专业或知识信号,而不是身份证明,因为律师或知情的非专业人士可以恢复被拒绝的患者情况。仅考虑委托的基准会将三种机制评分相同。Opus抑制了医生框架证明其知道的内容;Llama 4在两种框架中都不胜任;GPT-5.2的过滤器剥离了其33.2%的医生响应,但没有剥离非专业人士的响应。评估层继承了训练层的盲目性;标准LLM评分者在我们流程标记为有害的81.5%的响应中对遗漏伤害评分零(Kappa 0.066),因此用于检测失败的工具重现了这种现象。这些场景是为碰撞设计的;其比率描述了这种设计,但说 nothing about ordinary prevalence.

英文摘要

A heavily safety-trained model will hand a physician the full, patient-followable benzodiazepine taper and refuse it to the patient who needs it, over identical clinical facts; the knowledge is present either way. IatroBench measures that asymmetry across sixty pre-registered clinical scenarios and six frontier models (3,600 responses), scoring each on two axes, commission harm (what a response gets wrong) and omission harm (what it withholds), through a physician-authored structured evaluation validated by a second physician (weighted kappa 0.571, within-1 agreement 96%). Holding clinical content fixed and varying only whether the asker presents as patient or physician yields what we call identity-contingent withholding: all five testable models give the physician more (a decoupling gap of +0.38, p = 0.003; a 13.1-point fall in layperson hit rates on safety-colliding actions, p < 0.0001; no change on the rest), and the gap runs widest in the most heavily safety-trained model, Opus (+0.65). The trigger is the absence of any professional or epistemic signal rather than a credential, since a lawyer or an informed layperson recovers what the patient is refused. A commission-only benchmark would score three mechanisms alike. Opus suppresses what physician framing proves it knows; Llama 4 is incompetent in either framing; GPT-5.2's filter strips 33.2% of its physician responses and none of the lay ones. The evaluation layer inherits the blindness of the training layer; a standard LLM judge scores zero omission harm on 81.5% of the responses our pipeline flags harmful (kappa 0.066), so the instrument built to detect the failure reproduces it. The scenarios are engineered for collision; their rates describe that design and say nothing about ordinary prevalence.

2604.12138 2026-06-05 cs.AI cs.CL cs.IR 版本更新

Retrieval-Augmented Generation Must Move Beyond Factual Grounding to Represent Diverse Opinions

检索增强生成必须超越事实基础以代表多样化观点

Aditya Agrawal, Alwarappan Nakkiran, Darshan Fofadiya, Alex Karlsson, Harsha Aduri

发表机构 * Amazon.com(亚马逊公司)

AI总结 本文指出检索增强生成系统存在系统性事实偏差,并提出需要在检索系统设计上进行范式转变,通过不确定性量化方法提出统一目标,并展示Opinion-Aware RAG架构在两个领域中的实验结果,证明其在多样性、公平性和准确性方面的提升。

Comments 20 pages, Preprint under review

详情
AI中文摘要

本文主张检索增强生成系统存在系统性事实偏差,即在优化知识不确定性的同时忽视意见丰富内容中固有的随机不确定性。这种不一致要求检索系统设计发生范式转变。对35个主要RAG基准的调查表明,只有一个是意见合成的,证实了这种偏差的结构性:嵌入在数据集、检索目标和评估指标中。除了技术限制外,这种偏差还对透明和可问责的AI构成风险:回音室效应放大主导观点,系统性低估少数声音,以及通过偏见信息合成进行意见操控的潜在风险。我们通过不确定性量化的方法正式提出问题,显示事实查询应最小化后验熵,而意见查询必须保持它,并利用Wasserstein距离推导出统一的目标,涵盖覆盖性、忠实性和公平性。作为存在证明,我们提出了Opinion-Aware RAG(O-RAG),一种具有基于LLM的意见提取和实体链接意见元数据的架构,并在两个领域——电子商务卖家论坛和公共酒店评论——中评估了超过10000次讨论和6000次客户评论。实验显示Wasserstein距离到语料库级情感分布减少了18-48%,情感多样性增加了26.8%,实体匹配率增加了42.7%,人类评估者在79.2%的情况下更偏好包含意见的响应。我们提出了一项研究议程,并认为随着RAG系统越来越多地调解信息访问,其代表多样化观点的能力不仅不是可选的,而是必需的。

英文摘要

This position paper argues that Retrieval-Augmented Generation systems exhibit a systematic factual bias-optimizing for epistemic uncertainty reduction while ignoring the aleatoric uncertainty inherent in opinion-rich content - and that this misalignment demands a paradigm shift in retrieval system design. A survey of 35 major RAG benchmarks reveals that only one addresses opinion synthesis, confirming that the bias is structural: embedded in datasets, retrieval objectives, and evaluation metrics alike. Beyond technical limitations, this bias poses risks to transparent and accountable AI: echo chamber effects that amplify dominant viewpoints, systematic under-representation of minority voices, and potential opinion manipulation through biased information synthesis. We formalize the problem through the lens of uncertainty quantification, showing that factual queries should minimize posterior entropy while opinion queries must preserve it, and derive a unified objective over coverage, fidelity, and fairness using the Wasserstein distance. As an existence proof, we present Opinion-Aware RAG (O-RAG), an architecture featuring LLM-based opinion extraction and entity-linked opinion metadata, and evaluate it across two domains - e-commerce seller forums and public hotel reviews - spanning 10K+ discussions and 6K+ customer reviews. Experiments demonstrate 18-48% reduction in Wasserstein distance to corpus-level sentiment distributions, +26.8% sentiment diversity, and +42.7% entity match rate, with human evaluators preferring opinion-enriched responses 79.2% of the time. We propose a research agenda and argue that as RAG systems increasingly mediate access to information, their ability to represent diverse perspectives is not optional but essential.

2604.08477 2026-06-05 cs.AI cs.CL cs.LG 版本更新

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

SUPERNOVA: 通过自然指令上的强化学习激发大语言模型的通用推理

Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出SUPERNOVA框架,通过自然指令数据集构建高质量的强化学习可验证奖励数据集,通过100+次强化学习实验系统研究如何利用这些数据集提升下游推理性能,并在BigBench Extra Hard基准上实现64.4个百分点的相对提升。

Comments 23 Pages; 2-column format; 10 figures

详情
AI中文摘要

强化学习可验证奖励(RLVR)在数学和代码等正式领域显著提升了推理能力,但将其扩展到STEM领域以外仍然具有挑战性。扩展RLVR到STEM领域本质上受到高质量可验证训练数据的缺乏限制。在本文中,我们引入SUPERNOVA,一个从自然指令数据集中整理RLVR数据的框架,这些数据集是专家标注的丰富来源,但尚未被充分利用于RLVR训练。通过100多次受控的强化学习实验,我们系统研究如何利用这些数据集进行RLVR训练以及数据整理决策如何影响下游推理性能。特别是,我们研究了三种数据设计:(a)源任务选择,(b)任务混合,以及(c)合成干预。我们的分析揭示了源任务选择对下游推理性能有显著影响。此外,基于单个目标任务性能选择任务优于基于总体平均性能的策略,合成干预并未提高推理能力。受这些见解的启发,我们构建了SUPERNOVA,一个从自然指令数据集中整理出的25,000个实例的高质量RLVR数据集。我们证明了在SUPERNOVA上训练Qwen3-0.6B比基础Qwen3-0.6B表现更优,在包含23个复杂推理任务的挑战性基准BigBench Extra Hard(BBEH)上实现了64.4个百分点的相对提升。重要的是,我们发现SUPERNOVA的收益可以推广到未见基准、更大模型规模和新模型家族。总体而言,我们的发现为整理人类标注资源以扩展RLVR到通用推理提供了实用见解。模型、数据、代码见https://github.com/asuvarna31/supernova。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved reasoning in formal domains such as mathematics and code, but extending these gains beyond STEM remains challenging. Extending RLVR beyond STEM is fundamentally constrained by the lack of high-quality verifiable training data. In this work, we introduce SUPERNOVA, a framework for curating RLVR data from natural instruction datasets, which are a rich source of expert-annotated data but are underexplored for RLVR training. Through 100+ controlled RL experiments, we systematically study how to utilize these dataset for RLVR and how data curation decisions affect downstream reasoning performance . In particular, we investigate three data designs: (a) source task selection, (b) task mixing, and (c) synthetic interventions. Our analysis reveals that source task selection has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance and synthetic interventions do not improve reasoning. Guided by these insights, we construct SUPERNOVA, a high-quality RLVR dataset of 25K instances curated from natural instruction datasets. We show that training Qwen3-0.6B on SUPERNOVA outperforms the base Qwen3-0.6B, yielding a relative gain of 64.4pp on BigBench Extra Hard (BBEH), a challenging benchmark comprising 23 complex reasoning tasks. Importantly, we find that gains from SUPERNOVA generalize to unseen benchmarks, larger model scales, and newer model families. Overall, our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. Models, Data, Code at https://github.com/asuvarna31/supernova.

2503.17181 2026-06-05 cs.SE cs.AI 版本更新

A Study of LLMs' Preferences for Libraries and Programming Languages

对大型语言模型在库和编程语言偏好方面的研究

Lukas Twist, Mark Harman, Don Syme, Joost Noppen, Helen Yannakoudakis, Detlef Nauck, Jie M. Zhang

发表机构 * King’s College London(伦敦国王学院) University College London(伦敦大学学院) GitHub Next Digital AI Research, BT Group(BT集团数字人工智能研究)

AI总结 本研究探讨了大型语言模型在生成代码时对库和编程语言的选择偏好,通过实证研究分析了八种不同大型语言模型在库和语言选择上的倾向,发现模型倾向于使用广泛采用的库如NumPy,并且在某些情况下这种选择并非必要,同时也显示出对Python的偏好,尽管在某些高性能项目初始化任务中Python并非最优选择。

Comments 21 pages, 10 tables, 3 figures. Accepted to Findings of ACL 2026

详情
AI中文摘要

尽管大型语言模型(LLMs)在代码生成方面取得了快速进展,但现有评估主要集中在功能正确性或语法有效性上,忽略了LLMs在关键设计决策中如何选择库或编程语言。为了填补这一空白,我们进行了首次对LLMs在生成代码时对库和编程语言偏好的实证研究,涵盖了八个不同的LLMs。我们观察到LLMs倾向于过度使用广泛采用的库,如NumPy;在多达45%的情况下,这种使用是不必要的,并偏离了真实解决方案。我们研究的LLMs还显示出对Python作为默认语言的显著偏好。在高性能项目初始化任务中,当Python不是最优语言时,它仍然在58%的情况下占据主导地位,而Rust从未被使用。这些结果突显了LLMs在选择熟悉度和流行度而非适合性和任务特定最优性上的倾向;强调了需要针对的微调、数据多样化以及能够明确衡量语言和库选择忠实度的评估基准。

英文摘要

Despite the rapid progress of large language models (LLMs) in code generation, existing evaluations focus on functional correctness or syntactic validity, overlooking how LLMs make critical design choices such as which library or programming language to use. To fill this gap, we perform the first empirical study of LLMs' preferences for libraries and programming languages when generating code, covering eight diverse LLMs. We observe a strong tendency to overuse widely adopted libraries such as NumPy; in up to 45% of cases, this usage is not required and deviates from the ground-truth solutions. The LLMs we study also show a significant preference toward Python as their default language. For high-performance project initialisation tasks where Python is not the optimal language, it remains the dominant choice in 58% of cases, and Rust is not used once. These results highlight how LLMs prioritise familiarity and popularity over suitability and task-specific optimality; underscoring the need for targeted fine-tuning, data diversification, and evaluation benchmarks that explicitly measure language and library selection fidelity.

2604.01489 2026-06-05 cs.LG cs.AI cs.DC cs.PF cs.SE 版本更新

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

CuTeGen: 基于LLM的代理框架用于使用CuTe生成和优化高性能GPU内核

Tara Saba, Zhiyang Chen, Jikai Jason Li, Anne Ouyang, Xujie Si, Fan Long

发表机构 * Department of Computer Science, University of Toronto(计算机科学系,多伦多大学)

AI总结 本文提出CuTeGen,一种基于LLM的代理框架,通过CuTe抽象层实现GPU内核的生成和优化,通过结构化生成-测试-优化工作流,在标准基准测试中实现了比PyTorch快1.71倍的速度提升,并在生成成本相近的情况下优于现有代理基线CudaForge。

详情
AI中文摘要

高性能GPU内核对现代机器学习系统至关重要,但开发这些内核仍然是一个手动、专家驱动的过程。最近的研究尝试利用LLM自动生成功能内核,但生成的内核在标准化基准测试中仍无法达到精心调优的参考内核。我们提出了CuTeGen,一种代理GPU内核合成框架,将内核开发视为在CuTe抽象层上的结构化生成-测试-优化工作流。CuTeGen有两个设计选择区别于先前的工作:针对CuTe而不是原始CUDA,这暴露了性能关键结构如分块和数据移动,同时保持足够的稳定性以进行迭代优化;以及延迟的性能调度,将低层次性能反馈推迟到内核的高层结构稳定之后。在209个KernelBench Level-1和Level-2任务上,CuTeGen在PyTorch上实现了平均1.71倍的速度提升,并在生成成本相近的情况下优于先前的代理基线CudaForge(0.89倍)。代码可在https://github.com/taratt/cutegen.git获取。

英文摘要

High-performance GPU kernels are critical to modern machine learning systems, yet developing them remains a manual, expert-driven process. Recent work has explored using LLMs to automate kernel generation, but generated kernels still fall short of carefully tuned references on standardized benchmarks. We present CuTeGen, an agentic GPU kernel synthesis framework that treats kernel development as a structured generate-test-refine workflow over the CuTe abstraction layer. Two design choices distinguish CuTeGen from prior work: targeting CuTe rather than raw CUDA, which exposes performance-critical structures such as tiling and data movement while remaining stable enough for iterative refinement, and a delayed profiling schedule that withholds low-level performance feedback until the kernel's high-level structure has stabilized. On the 209 tasks of KernelBench Level-1 and Level-2, CuTeGen achieves an average speedup of 1.71$\times$ over PyTorch and outperforms the prior agentic baseline CudaForge (0.89$\times$) at comparable per-task generation cost. Code available at https://github.com/taratt/cutegen.git

2602.19190 2026-06-05 cs.CV cs.AI 版本更新

FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

FUSAR-GPT : 一种嵌入时空特征和两阶段解耦的视觉语言模型,用于合成孔径雷达图像

Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun, Xiaorong Guo, Qingchen Fang, Ruyi Zhang, Xinpeng Zhou, Haipeng Wang

发表机构 * Fudan University(复旦大学) Discipline and Technology Center of Microwave Vision Intelligent Sensing, Fudan University(微波视觉智能感知学科与技术中心,复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出FUSAR-GPT,一种专门针对合成孔径雷达图像的视觉语言模型,通过嵌入时空特征和两阶段解耦方法,在多个遥感视觉语言基准测试中实现了最先进的性能。

详情
AI中文摘要

对所有天气和所有时间的合成孔径雷达(SAR)智能解释的研究对于推进遥感应用至关重要。近年来,尽管视觉语言模型(VLMs)在RGB图像上展示了强大的开放世界理解能力,但直接应用于SAR领域时,由于成像机制的复杂性、对散射特征的敏感性和高质量文本语料的稀缺性,其性能受到严重限制。为系统解决这一问题,我们构建了首个SAR图像-文本-AlphaEarth特征三元组数据集,并开发了FUSAR-GPT,一种专门用于SAR的VLM。FUSAR-GPT创新性地引入了一个地理空间基线模型作为“世界知识”先验,并通过“时空锚点”将多源遥感时间特征嵌入模型的视觉主干中,从而实现对SAR图像中目标稀疏表示的动态补偿。此外,我们设计了一种两阶段SFT策略,以解耦大模型的知识注入和任务执行。时空特征嵌入和两阶段解耦范式使FUSAR-GPT在多个典型遥感视觉语言基准测试中实现了最先进的性能,显著优于主流基线模型,超过10%。

英文摘要

Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 10%.

2603.19312 2026-06-05 cs.LG cs.AI 版本更新

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

LeWorldModel:从像素稳定端到端联合嵌入预测架构

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero

发表机构 * Mila & Université de Montréal(Mila与蒙特利尔大学) New York University(纽约大学) Samsung SAIL(三星SAIL) Brown University(布朗大学)

AI总结 本文提出LeWorldModel,一种通过仅使用两个损失项从原始像素稳定端到端训练的联合嵌入预测架构,显著减少了可调损失超参数,并在多种2D和3D控制任务中表现出色,同时在物理结构编码和物理不合理的事件检测方面展示了其能力。

详情
AI中文摘要

联合嵌入预测架构(JEPAs)提供了一个有吸引力的框架,用于在紧凑的潜在空间中学习世界模型,但现有方法仍然脆弱,依赖于复杂的多术语损失、指数移动平均、预训练编码器或辅助监督来避免表示崩溃。在本工作中,我们引入了LeWorldModel(LeWM),这是第一个通过仅使用两个损失项从原始像素稳定端到端训练的JEPAs。这将可调损失超参数的数量从六个减少到一个。在单个GPU上几小时内可训练约1500万参数,LeWM的规划速度比基于基础模型的世界模型快48倍,同时在多种2D和3D控制任务中保持竞争力。除了控制之外,我们还展示了LeWM的潜在空间通过探测物理量编码有意义的物理结构。惊奇评估证实,该模型能够可靠地检测出物理上不可能的事件。

英文摘要

Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.

2603.20980 2026-06-05 cs.LG cs.AI stat.AP stat.ML 版本更新

From Causal Discovery to Dynamic Causal Inference in Neural Time Series

从因果发现到神经时间序列中的动态因果推断

Dmitry Zaytsev, Valentina Kuskova, Michael Coppedge

发表机构 * Lucy Family Institute for Data & Society(数据与社会卢西家族研究所) University of Notre Dame(诺克斯达大学) Political Science University of Notre Dame(政治学诺克斯达大学)

AI总结 提出动态因果网络自回归(DCNAR)两阶段框架,通过神经自回归因果发现学习稀疏有向因果网络,并将其作为结构先验用于时变神经网络自回归,实现无需预设网络结构的动态因果推断。

Comments 11 pages, 2 figures

详情
AI中文摘要

时变因果模型为研究动态科学系统提供了强大框架,然而大多数现有方法假设潜在因果网络是先验已知的——这一假设在现实领域中很少成立,因为在这些领域中因果结构是不确定的、演变的或仅能间接观测。这限制了动态因果推断在许多科学场景中的适用性。我们提出动态因果网络自回归(DCNAR),一个两阶段神经因果建模框架,将数据驱动的因果发现与时变因果推断相结合。在第一阶段,神经自回归因果发现模型从多变量时间序列中学习稀疏有向因果网络。在第二阶段,该学习到的结构被用作时变神经网络自回归的结构先验,从而无需预先指定网络结构即可实现因果影响的动态估计。我们使用评估因果必要性、时间稳定性和对结构变化敏感性的行为诊断来验证DCNAR的科学有效性,而不仅仅是预测准确性。在多国面板时间序列数据上的实验表明,即使预测性能相当,学习到的因果网络也比基于系数或无结构替代方法产生更稳定且行为上有意义的动态因果推断。这些结果将DCNAR定位为一个通用框架,用于在结构不确定性下将AI作为动态因果推理的科学工具。

英文摘要

Time-varying causal models provide a powerful framework for studying dynamic scientific systems, yet most existing approaches assume that the underlying causal network is known a priori - an assumption rarely satisfied in real-world domains where causal structure is uncertain, evolving, or only indirectly observable. This limits the applicability of dynamic causal inference in many scientific settings. We propose Dynamic Causal Network Autoregression (DCNAR), a two-stage neural causal modeling framework that integrates data-driven causal discovery with time-varying causal inference. In the first stage, a neural autoregressive causal discovery model learns a sparse directed causal network from multivariate time series. In the second stage, this learned structure is used as a structural prior for a time-varying neural network autoregression, enabling dynamic estimation of causal influence without requiring pre-specified network structure. We evaluate the scientific validity of DCNAR using behavioral diagnostics that assess causal necessity, temporal stability, and sensitivity to structural change, rather than predictive accuracy alone. Experiments on multi-country panel time-series data demonstrate that learned causal networks yield more stable and behaviorally meaningful dynamic causal inferences than coefficient-based or structure-free alternatives, even when forecasting performance is comparable. These results position DCNAR as a general framework for using AI as a scientific instrument for dynamic causal reasoning under structural uncertainty.

2602.19373 2026-06-05 cs.LG cs.AI 版本更新

Stable Deep Reinforcement Learning via Isotropic Gaussian Representations

通过各向同性高斯表示实现稳定的深度强化学习

Ali Saheb Pasand, Johan Obando-Ceron, Aaron Courville, Pouya Bashivan, Pablo Samuel Castro

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文提出了一种基于各向同性高斯表示的深度强化学习方法,通过在训练过程中塑造表示以达到各向同性高斯分布,从而在非平稳环境下提高性能并减少表示崩溃、神经元休眠和训练不稳定性。

详情
AI中文摘要

深度强化学习系统常常由于非平稳性导致训练动态不稳定,因为学习目标和数据分布随时间变化。我们证明在非平稳目标下,各向同性高斯嵌入具有证明优势。特别是,它们诱导了线性读出对时间变化目标的稳定跟踪,实现了固定方差预算下的最大熵,并鼓励所有表示维度的平衡使用——这些都使智能体更具适应性和稳定性。基于这一见解,我们提出使用Sketched Isotropic Gaussian Regularization来塑造表示以达到各向同性高斯分布。我们在各种领域中实验证明,这种简单且计算成本低的方法在非平稳环境下提高了性能,同时减少了表示崩溃、神经元休眠和训练不稳定性。

英文摘要

Deep reinforcement learning systems often suffer from unstable training dynamics due to non-stationarity, where learning objectives and data distributions evolve over time. We show that under non-stationary targets, isotropic Gaussian embeddings are provably advantageous. In particular, they induce stable tracking of time-varying targets for linear readouts, achieve maximal entropy under a fixed variance budget, and encourage a balanced use of all representational dimensions--all of which enable agents to be more adaptive and stable. Building on this insight, we propose the use of Sketched Isotropic Gaussian Regularization for shaping representations toward an isotropic Gaussian distribution during training. We demonstrate empirically, over a variety of domains, that this simple and computationally inexpensive method improves performance under non-stationarity while reducing representation collapse, neuron dormancy, and training instability.

2603.17310 2026-06-05 cs.AI cs.CL 版本更新

InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

InfoDensity: 为高效推理奖励信息密集的轨迹

Chengwei Wei, Jung-jae Kim, Longyin Zhang, Shengkai Chen, Nancy F. Chen

发表机构 * Institute for Infocomm Research (I 2 R), A*STAR, Singapore(信息与通信研究机构(I 2 R),A*STAR,新加坡) Centre for Frontier AI Research (CFAR), A*STAR, Singapore(前沿人工智能研究中心(CFAR),A*STAR,新加坡)

AI总结 本文提出InfoDensity框架,通过捕捉推理轨迹的信息密度特性,改进强化学习训练中的推理质量与效率平衡。

详情
AI中文摘要

具有扩展推理能力的大语言模型(LLMs)常生成冗长且冗余的推理轨迹,导致不必要的计算成本。尽管现有强化学习方法通过优化最终响应长度来解决这一问题,但它们忽略了中间推理步骤的质量,使模型容易受到奖励黑客攻击。我们主张冗长性不仅仅是长度问题,而是中间推理质量差的症状。为此,我们进行了实证研究,追踪大型推理模型在推理轨迹上的每token预测熵。我们发现高质量的推理轨迹具有两个一致特性:低不确定性收敛和快速不确定性下降。这些发现表明,高质量的推理轨迹是信息密集的,即推理步骤相对于总推理长度有助于达到低不确定性水平。基于此,我们提出InfoDensity,一种用于强化学习训练的奖励框架,通过单个熵轨迹的后缀最大包络线捕捉这两个特性,通过长度缩放项优先实现等效质量的简洁性。在数学和一般推理基准上的实验表明,InfoDensity在准确率-效率权衡上优于现有最先进的基线。

英文摘要

Large Language Models (LLMs) with extended reasoning capabilities often generate verbose and redundant reasoning traces, incurring unnecessary computational cost. While existing reinforcement learning approaches address this by optimizing final response length, they neglect the quality of intermediate reasoning steps, leaving models vulnerable to reward hacking. We argue that verbosity is not merely a length problem, but a symptom of poor intermediate reasoning quality. To investigate this, we conduct an empirical study tracking the per-token predictive entropy of large reasoning models across reasoning trajectories. We find that high-quality reasoning traces exhibit two consistent properties: low uncertainty convergence and fast uncertainty descent. These findings suggest that high-quality reasoning traces are informationally dense, that is, reasoning steps contribute to reaching a low uncertainty level relative to the total reasoning length. Motivated by this, we propose InfoDensity, a reward framework for RL training that captures both properties through a single suffix-max envelope of the entropy trajectory, weighted by a length scaling term that favors achieving equivalent quality more concisely. Experiments on mathematical and general reasoning benchmarks demonstrate that InfoDensity outperforms state-of-the-art baselines on the accuracy-efficiency trade-off.

2603.16475 2026-06-05 cs.AI 版本更新

Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures

打破链条:对LLM对中间结构忠实性的因果分析

Oleg Somov, Mikhail Chaichuk, Gleb Ershov, Karim Vafin, Mikhail Seleznyov, Alexander Panchenko, Elena Tutubalina

发表机构 * AIRI MIPT(莫斯科国立交通大学) HSE University(高等经济大学) Avito AI Lab(Avito人工智能实验室) Skoltech(斯克里普钦科技学院)

AI总结 研究探讨了在模式引导推理(SGR)管道中,LLM对中间结构的忠实性,发现尽管模型在自身中间结构上自洽,但对干预的响应不足,当将最终决策的推导委托给外部工具时,这种不稳定性显著降低。

Comments 20 pages, 4 figures, 7 tables

详情
AI中文摘要

在模式引导推理(SGR)管道中,LLM会在做出最终决定前生成显式的中间结构——如rubrics、checklists或验证查询。SGR因其承诺可控性而被越来越多采用:从业者期望能够检查、编辑和覆盖这些结构以引导结果。但这一承诺是否成立?我们引入了一种因果评估协议来衡量它:通过选择任务,其中确定性函数将中间结构映射到决定,每次受控编辑都意味着一个唯一的正确输出。在12个模型和4个基准测试中,模型在自身中间结构上自洽,但干预后预测未能更新——揭示出当中间结构变化时,看似忠实的特性变得脆弱。当最终决策的推导被委托给外部工具时,这种脆弱性大大消失;更强的提示仅带来有限的改进,而偏好优化显著提高了干预的忠实性。总体而言,模式引导管道中的中间结构起着影响性上下文的作用,而非稳定的因果中介。

英文摘要

In schema-guided reasoning (SGR) pipelines, LLMs produce explicit intermediate structures -- rubrics, checklists, or verification queries -- before committing to a final decision. SGR is increasingly adopted because it promises controllability: practitioners expect to inspect, edit, and override these structures to steer the outcome. But does the promise hold? We introduce a causal evaluation protocol to measure it: by selecting tasks where a deterministic function maps intermediate structures to decisions, every controlled edit implies a unique correct output. Across 12 models and 4 benchmarks, models appear self-consistent with their own intermediate structures but fail to update predictions after intervention -- revealing that apparent faithfulness is fragile once the intermediate structure changes. When derivation of the final decision from the structure is delegated to an external tool, this fragility largely disappears; stronger prompting yields only limited improvements, while preference optimization substantially improves intervention faithfulness. Overall, intermediate structures in schema-guided pipelines function as influential context rather than stable causal mediators.

2603.14805 2026-06-05 cs.AI cs.HC cs.SE 版本更新

Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agentic Software Development

知识激活:作为代理软件开发机构知识基础的AI技能

Gal Bakal

发表机构 * Yahoo Inc.(雅虎公司)

AI总结 本文提出知识激活框架,通过将AI技能转化为结构化、治理-aware的原子知识单元(AKUs)来解决企业软件开发中的机构知识交付问题,提升代理软件开发效率。

Comments Preprint. 59 pages, 11 figures. v2 is a major revision: adds an enterprise case study (a Yahoo deployment evaluated by an anonymous 67-engineer survey), with findings integrated into the abstract, introduction, discussion, and conclusion; methodology tightened and references expanded

详情
AI中文摘要

企业软件组织积累了关键的机构知识——架构决策、部署流程、合规政策、事件 playbook——但这些知识仍被困在为人类解读设计的格式中。代理软件开发有效性的瓶颈不是模型能力,而是知识架构。当任何知识消费者——自主AI代理、新入职工程师或资深开发者——在没有机构上下文的企业任务中遇到问题时,结果是猜测、修正级联以及对资深工程师的不成比例负担。本文介绍知识激活,一个将AI技能——代理可消费知识的开放标准——专门化为结构化、治理-aware的原子知识单元(AKUs)的框架,用于机构知识交付。与其检索文档进行解读,AKUs提供行动准备的规格,编码应做什么、使用哪些工具、尊重哪些约束以及下一步去哪里——这样代理才能正确行动,工程师可以接收基于机构的指导,而无需重新构建组织上下文。AKUs形成一个可组合的知识图谱,代理在运行时遍历——压缩入职时间,减少跨团队摩擦,并消除修正级联。本文正式化了使这种架构必要的资源约束,指定了AKU的模式和部署架构,并将长期维护扎根于知识共享实践。一项针对67名工程师的Yahoo部署调查表明,开发者体验有显著提升——每周节省2.6小时,净推荐值+35。那些为代理时代架构其机构知识的组织将优于只投资模型能力的组织。

英文摘要

Enterprise software organizations accumulate critical institutional knowledge - architectural decisions, deployment procedures, compliance policies, incident playbooks - yet this knowledge remains trapped in formats designed for human interpretation. The bottleneck to effective agentic software development is not model capability but knowledge architecture. When any knowledge consumer - an autonomous AI agent, a newly onboarded engineer, or a senior developer - encounters an enterprise task without institutional context, the result is guesswork, correction cascades, and a disproportionate tax on senior engineers who must manually supply what others cannot infer. This paper introduces Knowledge Activation, a framework that specializes AI Skills - the open standard for agent-consumable knowledge - into structured, governance-aware Atomic Knowledge Units (AKUs) for institutional knowledge delivery. Rather than retrieving documents for interpretation, AKUs deliver action - ready specifications encoding what to do, which tools to use, what constraints to respect, and where to go next - so that agents act correctly and engineers receive institutionally grounded guidance without reconstructing organizational context from scratch. AKUs form a composable knowledge graph that agents traverse at runtime - compressing onboarding, reducing cross - team friction, and eliminating correction cascades. The paper formalizes the resource constraints that make this architecture necessary, specifies the AKU schema and deployment architecture, and grounds long - term maintenance in knowledge commons practice. A Yahoo deployment surveying 67 engineers shows statistically significant developer-experience gains - 2.6 hours per week saved, Net Promoter Score +35. Organizations that architect their institutional knowledge for the agentic era will outperform those that invest solely in model capability.

2603.14169 2026-06-05 stat.ME cs.AI 版本更新

Beyond Means: Topological Causal Effects under Persistent-Homology Ignorability

超越均值:基于持久同调的因果效应

Amir Saki, Usef Faghihi

发表机构 * Université du Québec à Trois-Rivières(魁北克三河大学)

AI总结 本文提出基于持久同调的因果框架,以解决均值基于因果估计在处理结局分布形状变化时的局限性,通过定义拓扑学的CATE和ATE,并证明其在近似拓扑可忽略性下的可识别性。

详情
AI中文摘要

平均处理效应(ATE)和条件平均处理效应(CATE)是因果估计的核心,但它们仅关注预期结果的变化,可能忽略处理引起的结局分布形状变化。当对照组结果单峰,处理组结果双峰且均值相同,均值基于的因果估计会失效。本文基于持久同调发展了因果框架,提出了持久同调可忽略性条件,定义了拓扑学的CATE和ATE,并证明这些估计量在近似拓扑可忽略性下可识别。同时指出,边际持久图效应不能仅通过条件拓扑可忽略性确定,因为持久同调通常不与协变量混合交换。为保持原意并确保科学正确性,本文保留边际效应作为动机量,但将数学上稳健的条件估计量置于理论中心。合成实验显示,均值基于的因果估计仍接近零,而所提拓扑效应显著增加并在调整混杂后可恢复。

英文摘要

Average treatment effects (ATE) and conditional average treatment effects (CATE) are foundational causal estimands, but they target changes in expected outcomes and can miss treatment-induced changes in the shape of outcome distributions. A canonical failure mode occurs when control outcomes are unimodal, treated outcomes become bimodal, and both distributions have the same mean. In such cases mean-based causal estimands are zero even though the geometry and topology of the outcome law change substantially. This paper develops a topological causal framework based on persistent homology. We formalize a persistent-homology ignorability condition, define topological analogues of CATE and ATE, and prove that these estimands are identifiable up to an explicit error bound under approximate topological ignorability. We also clarify a subtle but important point: a marginal persistence-diagram effect is not identified from conditional topological ignorability alone because persistent homology does not in general commute with mixtures over covariates. To preserve the original intuition while ensuring scientific correctness, we retain the marginal effect as a motivating quantity, but place the mathematically sound conditional estimands at the center of the theory. A synthetic experiment with mean-preserving topology change shows that mean-based causal estimands remain near zero while the proposed topological effect increases sharply and remains recoverable after adjustment for confounding.

2603.13761 2026-06-05 cs.LG cs.AI 版本更新

Level Up: Defining and Exploiting Transitional Problems for Curriculum Learning

Level Up: 定义和利用过渡问题以进行课程学习

Amogh Inamdar, Zhenwei Tang, Ashton Anderson, Richard Zemel

发表机构 * Department of Computer Science, Columbia University(哥伦比亚大学计算机科学系) Department of Computer Science, University of Toronto(多伦多大学计算机科学系)

AI总结 本文提出了一种新的方法,通过定义和利用过渡问题来改进课程学习,该方法能够根据模型能力的提升动态调整训练难度,从而更有效地提升模型性能。

详情
AI中文摘要

课程学习——按顺序排列训练示例以帮助机器学习——受到人类学习的启发,但尚未得到广泛接受。静态策略依赖于间接的难度评分代理,产生不特定于当前学习者的课程。动态方法基于梯度信息估计难度,但需要大量的额外计算。我们介绍了一种新的方法,通过一系列能力递增的模型来测量单个问题实例的难度,并识别出在模型能力提升时始终更简单的过渡问题。将此方法应用于由多个可用模型构成的多样化模型系列,我们发现,使用从简单到困难的过渡问题进行训练,最有效地将模型提升到下一个能力层级。这些问题诱导了从简单到困难的自然进步,优于其他训练策略。通过直接测量难度相对于模型能力,我们的方法产生了可解释的问题、特定于学习者的课程以及逐步改进的原理基础。

英文摘要

Curriculum learning--ordering training examples in a sequence to aid machine learning--takes inspiration from human learning, but has not gained widespread acceptance. Static strategies for scoring item difficulty rely on indirect proxy scores of varying quality and produce curricula that are not specific to the learner at hand. Dynamic approaches base difficulty estimates on gradient information, requiring considerable extra computation during training. We introduce a novel method for measuring the difficulty of individual problem instances that is calibrated to a series of models of increasing competence, and identify \emph{transitional problems} that are consistently easier as model ability increases. Applying this method to diverse model series constructed from sets of models that are readily available on many tasks, we find that training on a curriculum that \emph{levels up} from easier to harder transitional problems most efficiently improves a model to the next tier of competence. These problems induce a natural progression from easier to harder items, which outperforms other training strategies. By measuring difficulty directly relative to model competence, our method yields interpretable problems, learner-specific curricula, and a principled basis for step-by-step improvement.

2603.10971 2026-06-05 cs.RO cs.AI 版本更新

ContactExplorer: Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation

ContactExplorer: 接触覆盖引导的通用灵巧操作探索

Zixuan Liu, Ruoyi Qiao, Chenrui Tie, Xuanwei Liu, Yunfan Lou, Chongkai Gao, Zhixuan Xu, Lin Shao

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院) RoboScience(机器人科学)

AI总结 提出ContactExplorer方法,通过接触覆盖奖励和能量引导奖励,在灵巧操作任务中高效探索接触模式,提升样本效率和成功率。

Comments 24 pages

详情
AI中文摘要

强化学习在Atari游戏、导航和移动等任务中取得了显著成功,这些任务中的探索通常可以通过状态或动态的新颖性来引导。相比之下,灵巧操作需要丰富的物理手-物体交互,但现有方法常受限于不稳定的基于接触的新颖性信号、低效的距离新颖性信号或依赖任务先验知识。我们提出ContactExplorer,一种用于灵巧操作任务的通用探索方法。ContactExplorer将接触表示为物体表面点与手部关键点的交集,鼓励灵巧手发现多样且新颖的接触模式,即哪些手指接触物体的哪些区域。它维护一个基于离散化物体状态(通过学习的哈希码获得)的接触计数器,捕捉每个手指与不同物体区域交互的频率。该计数器以两种互补方式利用:(1)分配基于计数的接触覆盖奖励,促进对新接触模式的探索;(2)基于能量的到达奖励,引导智能体朝向未充分探索的接触区域。我们在多种灵巧操作任务上评估ContactExplorer。实验结果表明,ContactExplorer在样本效率和成功率上显著优于现有探索方法,并且通过ContactExplorer学习的接触模式能鲁棒地迁移到现实世界。项目页面:https://contact-explorer.github.io。

英文摘要

Reinforcement learning has achieved remarkable success in domains such as Atari games, navigation, and locomotion, where exploration can often be guided by novelty over states or dynamics. In contrast, dexterous manipulation requires rich physical hand--object interactions, but existing methods often suffer from unstable contact-based novelty signals, inefficient distance novelty signals, or reliance on task-specific priors. We propose ContactExplorer, a general exploration method for dexterous manipulation tasks. ContactExplorer represents contact as the intersection between object surface points and hand keypoints, encouraging dexterous hands to discover diverse and novel contact patterns, namely which fingers contact which object regions. It maintains a contact counter conditioned on discretized object states obtained via learned hash codes, capturing how frequently each finger interacts with different object regions. This counter is leveraged in two complementary ways: (1) to assign a count-based contact coverage reward that promotes exploration of novel contact patterns, and (2) an energy-based reaching reward that guides the agent toward under-explored contact regions. We evaluate ContactExplorer on a diverse set of dexterous manipulation tasks. Experimental results show that ContactExplorer substantially improves sample efficiency and success rates over existing exploration methods, and that the contact patterns learned with ContactExplorer transfer robustly to the real world. Project page is https://contact-explorer.github.io.

2603.07294 2026-06-05 cs.CV cs.AI 版本更新

MAviS: A Multimodal Conversational Assistant For Avian Species

MAviS:一种用于鸟类物种的多模态对话助手

Yevheniia Kryklyvets, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jinxing Zhou, Fahad Shabzan Khan, Rao Anwer, Salman Khan, Hisham Cholakkal

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出MAviS数据集和MAviS-Chat模型,通过整合图像、音频和文本信息,提升对鸟类物种的细粒度理解与多模态问答能力,并展示了在生态应用中领域适应的多模态大语言模型的重要性。

Comments EMNLP 2025

详情
AI中文摘要

细粒度理解和特定物种的多模态问答对于推进生物多样性保护和生态监测至关重要。然而,现有的多模态大语言模型在处理如鸟类物种等专业领域时面临挑战,难以提供准确且上下文相关的信息。为此,我们引入了MAviS数据集,这是一个大规模的多模态鸟类物种数据集,整合了图像、音频和文本模态,涵盖超过1000种鸟类物种,包含预训练和指令微调子集,并补充了结构化的问答对。基于MAviS数据集,我们引入了MAviS-Chat,一种支持音频、视觉和文本的多模态大语言模型,旨在实现细粒度物种理解、多模态问答和场景特定描述生成。最后,为了定量评估,我们提出了MAviS-Bench,一个包含超过25,000个问答对的基准测试,用于评估跨模态的鸟类物种特定感知和推理能力。实验结果表明,MAviS-Chat在基准MiniCPM-o-2.6上表现显著优于基线,实现了最先进的开源结果,并展示了我们指令微调MAviS数据集的有效性。我们的发现强调了在生态应用中领域适应的多模态大语言模型的必要性。

英文摘要

Fine-grained understanding and species-specific multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the MAviS-Dataset, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question-answer pairs. Building on the MAviS-Dataset, we introduce MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present MAviS-Bench, a benchmark of over 25,000 QA pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive multimodal LLMs for ecological applications.

2601.09923 2026-06-05 cs.AI 版本更新

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Hanna Foerster, Tom Blanchard, Kristina Nikolić, Ilia Shumailov, Cheng Zhang, Robert Mullins, Nicolas Papernot, Florian Tramèr, Yiren Zhao

发表机构 * University of Cambridge(剑桥大学) University of Toronto & Vector Institute(多伦多大学及向量研究所) ETH Zurich(苏黎世联邦理工学院) AI Security Company(人工智能安全公司)

AI总结 本文提出了一种系统级安全方法,用于计算机使用代理(CUAs),通过单次规划和NOVA框架在动态UI状态下提供控制流完整性保障,同时在保持性能的同时提升安全性。

详情
AI中文摘要

AI代理容易受到提示注入攻击,其中恶意内容劫持代理行为。在已提出的防御措施中,架构隔离通过严格分离可信任务规划与不可信环境观察提供了最强的保证。然而,将此设计应用于自动化任务的计算机使用代理(CUAs)则面临根本性挑战。当前代理需要持续观察UI状态以确定每个动作,这与安全所需的隔离相冲突。我们通过证明UI工作流虽然动态但结构上可预测,解决了这一矛盾。单次规划,即可信规划器提前发出完整的分支计划,覆盖所有预期的运行时状态,可为任意指令注入提供控制流完整性保障。我们引入NOVA(通过观察、验证和行动导航)使这种方案在组合爆炸的UI状态空间中可行,其中计划可以调用感知模型来解析运行时值,如UI坐标。我们在OSWorld上评估了我们的设计,保留了前沿模型57%的性能,同时对较小的开源模型性能提升高达19%,证明了在CUAs中严格的安全性和实用性可以共存。尽管提前规划防止了指令注入,但我们展示还需要额外措施来防御分支引导攻击,其中攻击者欺骗感知模型使执行沿着攻击者偏好的计划分支进行,例如将代理引导至恶意网站。

英文摘要

AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior. Among proposed defenses, architectural isolation provides the strongest guarantees by strictly separating trusted task planning from untrusted environment observations. However, applying this design to Computer Use Agents (CUAs), which automate tasks by viewing screens and executing actions, presents a fundamental challenge. Current agents require continuous observation of UI state to determine each action, which conflicts with the isolation required for security. We resolve this tension by demonstrating that UI workflows, while dynamic, are structurally predictable. Single-shot planning, where a trusted planner emits upfront a complete branching plan covering all anticipated runtime states, provides control flow integrity guarantees against arbitrary instruction injections. We introduce NOVA (Navigating via Observation, Verification, and Action) to make this viable in the combinatorially large UI state space, where the plan can invoke a perception model to resolve runtime values such as UI coordinates. We evaluate our design on OSWorld, and retain up to 57% of the performance of frontier models while improving performance for smaller open-source models by up to 19%, demonstrating that rigorous security and utility can coexist in CUAs. Although upfront planning prevents instruction injections, we show that additional measures are needed to defend against \textbf{Branch Steering} attacks, where adversaries deceive the perception model into routing execution down attacker-preferred branches of the plan, such as redirecting the agent to a malicious website.

2601.11527 2026-06-05 cs.HC cs.AI cs.CY 版本更新

"What if she doesn't feel the same?" What Happens When We Ask AI for Relationship Advice

如果她不再有同样的感觉呢?当我们将AI用于关系建议时会发生什么

Niva Manchanda, Akshata Kishore Moharir, Ratna Kandala

发表机构 * Department of Psychology, University of Kansas(堪萨斯大学心理学系) Independent Researcher(独立研究者)

AI总结 研究探讨了用户对LLM生成的浪漫关系建议的评价,发现用户对建议的满意度高,并且这种满意度与对模型可靠性和有用性的感知正相关,同时用户对LLM的态度也显著改善。

详情
Journal ref
First Workshop on LLM Persona Modeling, NeurIPS 2025
AI中文摘要

大型语言模型(LLMs)越来越多地被用于提供支持和建议,特别是在浪漫关系等个人领域,但关于用户对这种类型建议的看法知之甚少。本研究调查了人们如何评价LLM生成的浪漫关系建议。参与者评估了建议的满意度、模型的可靠性以及有用性,并完成了关于他们对LLMs总体态度的前后测。总体而言,研究结果表明参与者对LLM生成的建议非常满意。更高的满意度与他们对模型可靠性和有用性的感知正相关。重要的是,接触这些建议后,参与者对LLMs的态度显著改善,这表明支持性和情境相关的建议可以增强用户对这些AI系统的信任和开放性。

英文摘要

Large Language Models (LLMs) are increasingly being used to provide support and advice in personal domains such as romantic relationships, yet little is known about user perceptions of this type of advice. This study investigated how people evaluate advice on LLM-generated romantic relationships. Participants rated advice satisfaction, model reliability, and helpfulness, and completed pre- and post-measures of their general attitudes toward LLMs. Overall, the results showed participants' high satisfaction with LLM-generated advice. Greater satisfaction was, in turn, strongly and positively associated with their perceptions of the models' reliability and helpfulness. Importantly, participants' attitudes toward LLMs improved significantly after exposure to the advice, suggesting that supportive and contextually relevant advice can enhance users' trust and openness toward these AI systems.

2508.06249 2026-06-05 cs.LG cs.AI 版本更新

In-Training Defenses against Emergent Misalignment in Language Models

训练过程中对抗语言模型中新兴偏差的防御措施

David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Esha Afzal, Robin Haselhorst, Lucie Flek, Florian Mai

发表机构 * University of Copenhagen(哥本哈根大学)

AI总结 本文研究了在训练过程中如何防止语言模型出现新兴偏差,提出了五种训练正则化干预方法,并展示了通过选择对齐模型与偏差模型之间困惑度差异的交错数据可以获得最佳效果。

Comments Accepted at ICML 2026 https://icml.cc/virtual/2026/poster/64303

详情
AI中文摘要

微调使从业者能够将对齐的大型语言模型 (LLMs) 重新用于新领域,但最近的研究揭示了新兴偏差 (EM):即使是一个小的、领域特定的微调,也可能导致远超出目标领域的有害行为。即使在模型权重被隐藏在微调API之后的情况下,这也为攻击者提供了无意中访问广泛偏差模型的途径,这从微调数据本身难以检测。我们提出了第一个系统研究在训练过程中对抗EM的防护措施,这些措施对提供者而言是可行的,他们通过API暴露微调:我们评估了这些措施是否能够防止广泛的偏差、允许狭窄的偏差、在良性任务上学习良好,并且保持一致性。我们调查了五种训练正则化干预:(i) 朝着安全参考模型的KL散度正则化,(ii) 特征空间中的ℓ2距离,(iii) 通过邪恶人格向量进行预防性引导,(iv) 从一般指令微调数据集交错训练示例,以及 (v) 疫苗提示。我们证明,通过选择对齐模型与偏差模型之间的困惑度差异的交错数据可以获得最佳效果。

英文摘要

Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EM): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EM that are practical for providers who expose fine-tuning via an API: We evaluate whether they a) prevent broad misalignment, b) allow narrow misalignment, c) learn well on benign tasks, and d) remain coherent. We investigate five training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\ell_2$ distance in feature space, (iii) preventive steering with an evil persona vector, (iv) interleaving training examples from a general instruct-tuning dataset and (v) inoculation prompting. We demonstrate that selecting interleaving data by the perplexity gap between aligned and misaligned models yields the best results overall.

2603.03955 2026-06-05 cs.LG cs.AI 版本更新

GIPO: Gaussian Importance Sampling Policy Optimization

GIPO:高斯重要性采样策略优化

Chengxuan Lu, Zhenquan Zhang, Shukuan Wang, Qunzhi Lin, Yanjie Li, Baigui Sun, Yang Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 该研究提出了一种基于截断重要性采样的策略优化目标GIPO,通过使用基于对数比率的高斯信任权重替代硬裁剪,以软化极端重要性比率同时保持非零梯度,从而提高数据效率,实验表明GIPO在多种回放缓冲区大小下均取得最佳性能,表现出优越的偏差-方差权衡、高训练稳定性及改进的样本效率。

详情
AI中文摘要

在强化学习(RL)后训练近年来已显示出在多模态智能体上超越监督模仿的强劲潜力。然而,RL仍然受到较差的数据效率的限制,特别是在交互数据稀缺且迅速过时的设置中。为了解决这一挑战,GIPO(高斯重要性采样策略优化)被提出作为基于截断重要性采样的策略优化目标,用基于对数比率的高斯信任权重替代硬裁剪,以软化极端重要性比率同时保持非零梯度。理论分析显示,GIPO引入了隐含且可调的更新幅度约束,而集中界保证了在有限样本估计下的鲁棒性和稳定性。实验结果表明,GIPO在各种回放缓冲区大小范围内,从接近策略到高度过时的数据均取得了最佳性能,同时表现出优越的偏差-方差权衡、高训练稳定性和改进的样本效率。代码可在https://github.com/distanceLu/GIPO获得。

英文摘要

Post-training with reinforcement learning (RL) has recently shown strong promise for advancing multimodal agents beyond supervised imitation. However, RL remains limited by poor data efficiency, particularly in settings where interaction data are scarce and quickly become outdated. To address this challenge, GIPO (Gaussian Importance sampling Policy Optimization) is proposed as a policy optimization objective based on truncated importance sampling, replacing hard clipping with a log-ratio-based Gaussian trust weight to softly damp extreme importance ratios while maintaining non-zero gradients. Theoretical analysis shows that GIPO introduces an implicit, tunable constraint on the update magnitude, while concentration bounds guarantee robustness and stability under finite-sample estimation. Experimental results show that GIPO achieves state-of-the-art performance among clipping-based baselines across a wide range of replay buffer sizes, from near on-policy to highly stale data, while exhibiting superior bias--variance trade-off, high training stability and improved sample efficiency. Code is available at https://github.com/distanceLu/GIPO.

2410.06703 2026-06-05 cs.AI 版本更新

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

ST-WebAgentBench:用于评估网络代理安全性和可信度的基准测试

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Nir Mashkif, Segev Shlomov

发表机构 * IBM Research(IBM研究院)

AI总结 本文提出ST-WebAgentBench基准测试,用于评估网络代理在现实企业场景中的安全性和可信度,通过引入新的评估指标CuP和风险比,揭示了现有代理的安全性缺陷。

Comments The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
AI中文摘要

自主网络代理能够解决复杂的浏览任务,但现有基准测试仅衡量代理是否完成任务,而忽略了其完成任务的安全性和企业可信任性。为了将这些代理整合到关键工作流程中,安全性和可信度(ST)是采用的前提条件。我们介绍了ST-WebAgentBench,一个可配置且易于扩展的评估套件,用于在现实企业场景中评估网络代理的ST。其222个任务均配以ST策略,即简明的规则,编码约束,并在六个正交维度(如用户同意、鲁棒性)上评分。除了原始任务成功率外,我们提出了完成受政策约束(CuP)指标,仅奖励遵守所有适用政策的完成情况,以及风险比,量化各维度上的ST违规情况。评估三个最先进的开放代理揭示了其平均CuP低于名义完成率的三分之二,暴露了关键安全漏洞。通过发布代码、评估模板和政策编写界面,ST-WebAgentBench提供了一个可操作的第一步,以部署可信赖的网络代理规模化。

英文摘要

Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical workflows, safety and trustworthiness (ST) are prerequisite conditions for adoption. We introduce \textbf{\textsc{ST-WebAgentBench}}, a configurable and easily extensible suite for evaluating web agent ST across realistic enterprise scenarios. Each of its 222 tasks is paired with ST policies, concise rules that encode constraints, and is scored along six orthogonal dimensions (e.g., user consent, robustness). Beyond raw task success, we propose the \textit{Completion Under Policy} (\textit{CuP}) metric, which credits only completions that respect all applicable policies, and the \textit{Risk Ratio}, which quantifies ST breaches across dimensions. Evaluating three open state-of-the-art agents reveals that their average CuP is less than two-thirds of their nominal completion rate, exposing critical safety gaps. By releasing code, evaluation templates, and a policy-authoring interface, \href{https://sites.google.com/view/st-webagentbench/home}{\textsc{ST-WebAgentBench}} provides an actionable first step toward deploying trustworthy web agents at scale.

2602.19327 2026-06-05 cs.LG cs.AI 版本更新

Soft Sequence Policy Optimization

软序列策略优化

Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko

发表机构 * Lomonosov Moscow State University(罗蒙诺索夫莫斯科国立大学) Institute for Artificial Intelligence(人工智能研究所)

AI总结 本文提出软序列策略优化方法,通过引入软门控函数改进序列级重要性权重,提升大语言模型对齐任务的训练稳定性与性能。

详情
AI中文摘要

大量近期关于大语言模型(LLM)对齐的研究聚焦于基于组相对策略优化(GRPO)开发新的策略优化方法。两个显著方向出现:(i)向序列级重要性采样权重的转变,以更好地对齐许多任务中使用的序列级奖励;(ii)替代PPO风格的剪裁方法,以避免相关的训练信号损失和熵崩溃。我们引入了软序列策略优化(SSPO),一种离策略强化学习目标,其在序列级重要权重中整合了token级概率比的软门控函数。我们为SSPO提供了理论动机,并调查了实际修改以改善优化行为。实证结果显示,SSPO在数学推理和编码任务中均提高了训练稳定性与性能。

英文摘要

A significant portion of recent research on Large Language Model (LLM) alignment focuses on developing new policy optimization methods based on Group Relative Policy Optimization (GRPO). Two prominent directions have emerged: (i) a shift toward sequence-level importance sampling weights that better align with the sequence-level rewards used in many tasks, and (ii) alternatives to the PPO-style clipping that aim to avoid the associated loss of training signal and entropy collapse. We introduce Soft Sequence Policy Optimization, an off-policy reinforcement learning objective that incorporates soft gating functions over token-level probability ratios within sequence-level importance weights. We provide theoretical motivation for SSPO and investigate practical modifications to improve optimization behavior. Empirically, we demonstrate that SSPO improves training stability and performance both in mathematical reasoning and coding tasks.

2602.22067 2026-06-05 cs.AI 版本更新

Semantic Partial Grounding via LLMs

通过大语言模型实现语义部分 grounding

Giuseppe Canonaco, Alberto Pozanco, Daniel Borrajo

发表机构 * Department of Computer Science, University of Cambridge(剑桥大学计算机科学系)

AI总结 本文提出SPG-LLM,利用大语言模型分析领域和问题文件,提前识别可能不相关的对象、动作和谓词,从而减少grounding任务的规模,提升grounding效率并在某些领域实现更优的计划成本。

详情
AI中文摘要

Grounding是经典规划中的关键步骤,但随着任务规模增大,grounded动作和原子的指数增长常使其成为计算瓶颈。近期部分grounding方法通过预测模型逐步grounding最有前景的操作符来解决这一挑战。然而,这些方法主要依赖关系特征或学习到的嵌入,未利用PDDL描述中的文本和结构线索。我们提出SPG-LLM,利用大语言模型分析领域和问题文件,启发式地识别潜在不相关的对象、动作和谓词,从而在grounding前显著减少grounded任务的规模。在七个难以grounding的基准测试中,SPG-LLM实现了更快的grounding速度(通常快几个数量级),并在某些领域实现了可比或更优的计划成本。

英文摘要

Grounding is a critical step in classical planning, yet it often becomes a computational bottleneck due to the exponential growth in grounded actions and atoms as task size increases. Recent advances in partial grounding have addressed this challenge by incrementally grounding only the most promising operators, guided by predictive models. However, these approaches primarily rely on relational features or learned embeddings and do not leverage the textual and structural cues present in PDDL descriptions. We propose SPG-LLM, which uses LLMs to analyze the domain and problem files to heuristically identify potentially irrelevant objects, actions, and predicates prior to grounding, significantly reducing the size of the grounded task. Across seven hard-to-ground benchmarks, SPG-LLM achieves faster grounding-often by orders of magnitude-while delivering comparable or better plan costs in some domains.

2512.15783 2026-06-05 cs.AI cs.LG 版本更新

Towards AI epidemiology: a measurement standardisation framework for prospective risk detection

迈向人工智能流行病学:一种用于前瞻性风险检测的测量标准化框架

Kit Tempest-Walters

AI总结 本文提出了一种测量标准化框架,用于在没有访问模型内部信息的情况下,将专家-人工智能交互压缩为结构化、可比较的领域,以进行前瞻性风险检测。该框架旨在定义其范围,包括语义和统计层面,并指定未来工作的实证测试协议。

Comments 29 pages, 3 figures

详情
AI中文摘要

本文提出了一种测量标准化框架,该框架将专家-人工智能交互压缩为结构化、可比较的领域,用于在部署的人工智能系统中进行前瞻性风险检测,而无需访问模型内部信息。本文的概念性论文的主要目的是定义该框架的范围,包括语义和统计层面,并指定未来工作的实证测试协议。该框架旨在支持的群体层面声明因此是阶段性的研究计划,而非本文中声称的结果。测量标准化支撑着接下来的三个声明。第一个是可靠性声明:在有限条件下,大型语言模型可以产生可靠的、标准化的评估,用于评估专家-人工智能交互的证据和对齐情况。第二个是治理声明:对齐分数在部署期间为专家提供即时信号,并为机构提供监控不同任务类型、模型和领域的对齐模式的基础。第三个是流行病学声明:一旦建立了测量标准化,聚合对齐分数可以用于研究与下游结果相关的关联,这在受监管的专业环境中是可能的。这引入了基于相关变量而非机理分析的“人工智能流行病学”的可能性。本文解决了第一个声明,并指定了调查第二个和第三个声明的协议。为了在未来研究中实现实证评估,本文阐述了定义的语法,以及基于成对Bootstrap推断的统计协议,DeLong测试用于成对AUCs作为灵敏度检查,预设的一侧非劣性边界为0.05,以及Holm-Bonferroni校正。

英文摘要

This paper proposes a measurement standardisation framework that compresses expert-AI interactions into structured, comparable fields for prospective risk detection in deployed AI systems, without access to model internals. The main aim of this concept paper is to define the scope of the framework, both semantically and statistically, and to specify a protocol for its empirical testing in future work. The population-level claims the framework is designed to support are therefore the subject of a staged research programme rather than results claimed in this paper. Measurement standardisation underpins all three claims that follow. The first is a reliability claim: under bounded conditions, large language models can produce reliable, standardised assessments of the evidential and policy alignment of expert-AI interactions. The second is a governance claim: alignment scores give experts an immediate signal during deployment and give institutions a basis for monitoring alignment patterns across mission types, models, and domains. The third is an epidemiological claim: once measurement standardisation is established, aggregate alignment scores could be used to study associations with downstream outcomes in regulated professional settings. This introduces the possibility of an "AI epidemiology" that detects risk based on correlated variables instead of mechanistic analysis. This paper addresses the first claim and specifies protocols for investigating the second and third. To enable empirical evaluation in future studies, this paper sets out a defined grammar, together with a statistical protocol based on paired bootstrap inference, DeLong's test for paired AUCs as a sensitivity check, a pre-specified one-sided non-inferiority margin of 0.05, and Holm-Bonferroni correction.

2509.24882 2026-06-05 cs.LG cond-mat.dis-nn cs.AI stat.ML 版本更新

Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime

浅层神经网络在特征学习 regime 中的缩放定律与谱特性

Leonardo Defilippis, Yizhou Xu, Julius Girardin, Emanuele Troiani, Vittorio Erba, Lenka Zdeborová, Bruno Loureiro, Florent Krzakala

发表机构 * Departement d’Informatique, École Normale Supérieure, PSL & CNRS(信息学院,巴黎高等师范学院,PSL与CNRS) Statistical Physics of Computation Laboratory, École Polytechnique Fédérale de Lausanne (EPFL)(计算统计物理实验室,洛桑联邦理工学院(EPFL)) Information, Learning and Physics Laboratory, École Polytechnique Fédérale de Lausanne (EPFL)(信息、学习与物理实验室,洛桑联邦理工学院(EPFL))

AI总结 本文研究了浅层神经网络在特征学习 regime 中的缩放定律与谱特性,通过分析二次和对角神经网络的缩放规律,揭示了样本复杂度和权重衰减对过剩风险缩放指数的影响,并建立了这些 regime 与训练网络权重谱性质的精确联系。

详情
Journal ref
ICLR 2026
AI中文摘要

神经缩放定律是深度学习近期许多进展的基础,但其理论理解仍然主要局限于线性模型。在本文中,我们系统分析了二次和对角神经网络在特征学习 regime 中的缩放定律。利用与矩阵压缩感知和LASSO的联系,我们推导了过剩风险缩放指数作为样本复杂度和权重衰减函数的详细相图。这种分析揭示了不同缩放 regime 之间的交叉和平台行为,与经验神经缩放文献中广泛报告的现象相呼应。此外,我们建立了这些 regime 与训练网络权重谱性质的精确联系,我们对其进行了详细刻画。作为结果,我们提供了最近经验观察的理论验证,这些观察将权重谱中幂律尾部的出现与网络泛化性能联系起来,从而给出了从基本原理出发的解释。

英文摘要

Neural scaling laws underlie many of the recent advances in deep learning, yet their theoretical understanding remains largely confined to linear models. In this work, we present a systematic analysis of scaling laws for quadratic and diagonal neural networks in the feature learning regime. Leveraging connections with matrix compressed sensing and LASSO, we derive a detailed phase diagram for the scaling exponents of the excess risk as a function of sample complexity and weight decay. This analysis uncovers crossovers between distinct scaling regimes and plateau behaviors, mirroring phenomena widely reported in the empirical neural scaling literature. Furthermore, we establish a precise link between these regimes and the spectral properties of the trained network weights, which we characterize in detail. As a consequence, we provide a theoretical validation of recent empirical observations connecting the emergence of power-law tails in the weight spectrum with network generalization performance, yielding an interpretation from first principles.

2602.13697 2026-06-05 cs.AI cs.DB cs.LG 版本更新

No Need to Train Your RDB Foundation Model

无需训练你的关系数据库基础模型

Linjie Xu, Yanlin Zhang, Quan Gan, Minjie Wang, David Wipf

发表机构 * University of Hong Kong, Shanghai X-Lab(香港大学,上海X实验室)

AI总结 本文提出了一种基于上下文学习的关系数据库编码器,能够在不重新训练的情况下,与现有的单表上下文学习基础模型结合,实现对多张相关表的高效处理。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

关系数据库(RDBs)包含大量异构的表格信息,可用于预测建模。但鉴于企业环境中潜在的目标空间广阔,如何避免每次预测新感兴趣的量时重新训练新模型?基于上下文学习(ICL)的基础模型提供了一种方便的选项,但目前大多局限于单表操作。在推广到多张相互关联的表时,关键在于将可变大小的RDB邻域压缩为固定长度的ICL样本供解码器使用。然而,细节至关重要:与现有监督学习RDB流程不同,我们提供了理论和实证证据表明,ICL特定的压缩应限制在高维RDB列中,其中所有实体共享单位和角色,而不是跨列,因为异构数据类型的相关性无法在缺乏大量标签信息的情况下确定。基于此限制,我们证明了排除可训练参数不会影响编码器的表达能力。因此,我们得到了一种原理上可行的RDB编码器家族,可以无缝搭配已有的单表ICL基础模型,从而无需训练或微调。从实用角度看,我们开发了可扩展的SQL原语来实现编码器阶段,最终得到一个易于使用的开源RDBLearn基础模型,能够在未见过的数据集上实现稳健的性能。

英文摘要

Relational databases (RDBs) contain vast amounts of heterogeneous tabular information that can be exploited for predictive modeling purposes. But since the space of potential targets is vast across enterprise settings, how can we avoid retraining a new model each time we wish to predict a new quantity of interest? Foundation models based on in-context learning (ICL) offer a convenient option, but so far are largely restricted to single-table operability. In generalizing to multiple interrelated tables, it is essential to compress variably-sized RDB neighborhoods into fixed-length ICL samples for consumption by the decoder. However, the details here are critical: unlike existing supervised learning RDB pipelines, we provide theoretical and empirical evidence that ICL-specific compression should be constrained within high-dimensional RDB columns where all entities share units and roles, not across columns where the relevance of heterogeneous data types cannot be determined without extensive label information. Conditioned on this restriction, we then demonstrate that encoder expressiveness is actually not compromised by excluding trainable parameters. Hence we arrive at a principled family of RDB encoders that can be seamlessly paired with already-existing single-table ICL foundation models, whereby no training or fine-tuning is required. From a practical standpoint, we develop scalable SQL primitives to implement the encoder stage, resulting in the easy-to-use open-source RDBLearn foundation model capable of robust performance on unseen datasets out of the box.

2602.13255 2026-06-05 cs.AI cs.MA 版本更新

DPBench: Structural Determinants of Multi-Agent LLM Coordination Under Simultaneous Resource Contention

DPBench: 多智能体LLM在同时资源竞争下的协调结构决定因素

Najmul Hasan, Prashanth BusiReddyGari

发表机构 * Department of Mathematics and Computer Science University of North Carolina at Pembroke(数学与计算机科学系北卡罗来纳大学帕特森分校)

AI总结 本文提出DPBench,用于评估多智能体系统中协调性能的基准测试,通过分析不同协议、通信结构和群体规模对协调成功或失败的影响,揭示了多智能体LLM在资源竞争中的协调机制。

Comments 20 pages, 4 figures

详情
AI中文摘要

我们提出了DPBench,一个用于评估多智能体系统中协调性能的基准测试,该测试基于大型语言模型构建。现有基准测试在固定协议下衡量任务级的成功率;然而,协调成功或失败的结构条件尚未被明确刻画。DPBench将哲学家就餐问题改编为受控测试平台,其中动作协议、通信结构和群体规模可独立变化。我们评估了六个智能体:GPT-5.2、Claude Opus 4.5、Grok 4.1、Gemini 2.5 Flash、Llama 4 Maverick以及一个均匀随机基线。在N=5的同时动作下,默认提示中,GPT-5.2的死锁率为25.0%(95% Wilson置信区间[11.2, 46.9]),而Gemini 2.5 Flash的死锁率为90.0%([74.4, 96.5]);顺序动作被六个智能体中的四个解决。在固定模型为Gemini 2.5 Flash的情况下,三个协议变量将死锁率从90%降低到置信区间接近零:三次预承诺通信(0.0% vs. 单次通信86.7%)、提示中包含经典并发原语(资源排序和对称打破的0.0% vs. 最小提示的100%)或将群体从N=5扩大到N=10(90.0%到10.0%)。单次通信和过去时间步的记忆在我们运行的样本量下不会改变死锁率。是否同一个模型协调或死锁由协议决定,而不是模型的能力。

英文摘要

We present DPBench, a benchmark for evaluating coordination in multi-agent systems built from large language models. Existing benchmarks measure task-level success under a fixed protocol; the structural conditions under which coordination succeeds or fails at all have not been characterised. DPBench adapts the Dining Philosophers problem into a controlled testbed where the action protocol, the communication structure, and the group size each vary independently. We evaluate six agents: GPT-5.2, Claude Opus 4.5, Grok 4.1, Gemini 2.5 Flash, Llama 4 Maverick, and a uniform-random baseline. Under simultaneous action at N=5 with the default prompt, deadlock ranges from 25.0% (95% Wilson CI [11.2, 46.9]) for GPT-5.2 to 90.0% [74.4, 96.5] for Gemini 2.5 Flash; sequential action is solved by four of the six. Holding the model fixed at Gemini 2.5 Flash, three protocol variables drive deadlock from 90% to within CI of zero: three rounds of pre-commitment communication (0.0% vs. single-round 86.7%), a prompt encoding a classical concurrency primitive (0.0% for resource-ordering and symmetry-breaking, against 100% for the minimal prompt), or doubling the group from N=5 to N=10 (90.0% to 10.0%). Single-round messaging and memory of past timesteps do not change the rate at the sample size we ran. Whether the same model coordinates or deadlocks is determined by the protocol, not by the model's capability.

2602.04809 2026-06-05 cs.LG cs.AI 版本更新

Beyond Rewards in Reinforcement Learning for Cyber Defence

超越奖励的强化学习在网络安全防御中的应用

Elizabeth Bates, Chris Hicks, Vasilios Mavroudis

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文研究了在网络安全防御中使用强化学习时,奖励函数结构对学习和策略行为的影响,通过比较稀疏和密集奖励函数,揭示了奖励、动作空间和子最优策略风险之间的复杂关系。

详情
AI中文摘要

近年来,自主网络安全防御代理在使用深度强化学习保护计算机网络方面引起了广泛关注。这些代理通常在网络安全 gym 环境中训练,使用密集的、高度工程化的奖励函数,结合多种惩罚和激励,以应对各种(不) desirable 状态和昂贵的操作。密集奖励有助于缓解探索复杂环境的挑战,但会偏向于次优且可能风险更大的解决方案,这对复杂的网络安全环境至关重要。我们通过多种稀疏和密集奖励函数、两种已确立的网络安全 gym、不同网络规模以及策略梯度和基于价值的 RL 算法,全面评估了奖励函数结构对学习和策略行为特征的影响。我们的评估得益于一种新的真实评估方法,使可以直接比较不同的奖励函数,揭示了奖励、动作空间和网络安全环境中子最优策略风险之间的微妙关系。我们的结果表明,稀疏奖励,如果目标一致且可以频繁遇到,能够提供增强的训练可靠性和更有效的网络安全防御代理,具有较低风险的策略。令人惊讶的是,稀疏奖励还能产生与网络安全守护者目标更一致的策略,并在不使用显式奖励基于数值惩罚的情况下,节省昂贵的防御操作。

英文摘要

Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, highly engineered reward functions which combine many penalties and incentives for a range of (un)desirable states and costly actions. Dense rewards help alleviate the challenge of exploring complex environments but risk biasing agents towards suboptimal and potentially riskier solutions, a critical issue in complex cyber environments. We thoroughly evaluate the impact of reward function structure on learning and policy behavioural characteristics using a variety of sparse and dense reward functions, two well-established cyber gyms, a range of network sizes, and both policy gradient and value-based RL algorithms. Our evaluation is enabled by a novel ground truth evaluation approach which allows directly comparing between different reward functions, illuminating the nuanced inter-relationships between rewards, action space and the risks of suboptimal policies in cyber environments. Our results show that sparse rewards, provided they are goal aligned and can be encountered frequently, uniquely offer both enhanced training reliability and more effective cyber defence agents with lower-risk policies. Surprisingly, sparse rewards can also yield policies that are better aligned with cyber defender goals and make sparing use of costly defensive actions without explicit reward-based numerical penalties.

2602.09574 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

在LLMs的测试时间扩展中对树搜索策略与固定令牌预算对齐

Sora Miyamoto, Daisuke Oba, Naoaki Okazaki

发表机构 * University of Tokyo(东京大学)

AI总结 本文提出了一种名为Budget-Guided MCTS (BG-MCTS)的树搜索解码算法,通过将搜索策略与剩余令牌预算对齐,以提高在不同令牌预算下的推理性能。

Comments Accepted at ICML 2026. Code: https://github.com/Sora-Miyamoto/bg-mcts

详情
AI中文摘要

树搜索解码是大型语言模型(LLMs)测试时间扩展的有效方法,但现实部署中通常会施加一个固定的每查询令牌预算,且该预算在不同设置中有所不同。现有的树搜索策略大多缺乏预算意识,仅将预算视为终止条件,从而可能导致后期过度分支或提前终止。我们提出Budget-Guided MCTS (BG-MCTS),一种树搜索解码算法,其搜索策略与剩余令牌预算对齐:它从广泛的探索开始,然后在剩余预算减少时优先进行细化和答案完成,同时减少浅层节点的后期分支。BG-MCTS在数学推理基准和额外的物理推理基准上,使用开放权重LLMs在各种推理预算下均优于预算无关的树搜索基线。

英文摘要

Tree-search decoding is an effective form of test-time scaling for large language models (LLMs), but real-world deployment often imposes a fixed per-query token budget that varies across settings. Existing tree-search policies are largely budget-agnostic, treating the budget merely as a termination condition, thereby risking late-stage over-branching or premature termination. We propose Budget-Guided MCTS (BG-MCTS), a tree-search decoding algorithm that aligns its search policy with the remaining token budget: it starts with broad exploration, then prioritizes refinement and answer completion as the remaining budget decreases while reducing late-stage branching from shallow nodes. BG-MCTS consistently outperforms budget-agnostic tree-search baselines across inference budgets on mathematical reasoning benchmarks and an additional physics reasoning benchmark with open-weight LLMs.

2602.07739 2026-06-05 cs.IR cs.AI 版本更新

HypRAG: Hyperbolic Dense Retrieval for Retrieval Augmented Generation

HypRAG: 超几何密集检索用于检索增强生成

Hiren Madhu, Ngoc Bui, Ali Maatouk, Leandros Tassiulas, Smita Krishnaswamy, Menglin Yang, Sukanta Ganguly, Kiran Srinivasan, Rex Ying

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出超几何密集检索方法,通过在双曲空间中构建HyTE-FH和HyTE-H两种模型变体,解决传统欧几里得空间在检索增强生成中的局限性,提升文档相关性和回答相关性。

详情
AI中文摘要

嵌入几何在检索质量中起着根本作用,然而用于检索增强生成(RAG)的密集检索器仍然主要局限于欧几里得空间。然而,自然语言从广泛主题到具体实体具有层次结构,而欧几里得嵌入无法保持这种结构,导致语义上距离远的文档显得相似,增加幻觉风险。为了解决这些限制,我们引入了双曲密集检索,开发了两种模型变体:HyTE-FH,一个完全双曲的Transformer,以及HyTE-H,一个混合架构,将预训练的欧几里得嵌入投影到双曲空间。为了防止序列聚合期间的表示崩溃,我们引入了向外爱因斯坦中点,一种几何感知的池化操作符,可以证明地保持层次结构。在MTEB上,HyTE-FH优于等效的欧几里得基线,而在RAGBench上,HyTE-H在上下文相关性和回答相关性方面比欧几里得基线高出高达29%,使用比当前最先进的检索器小得多的模型。我们的分析还表明,双曲表示通过基于范数的分离编码文档特定性,从一般到具体概念的径向增加超过20%,这一特性在欧几里得嵌入中不存在,突显了几何归纳偏置在忠实RAG系统中的关键作用。

英文摘要

Embedding geometry plays a fundamental role in retrieval quality, yet dense retrievers for retrieval-augmented generation (RAG) remain largely confined to Euclidean space. However, natural language exhibits hierarchical structure from broad topics to specific entities that Euclidean embeddings fail to preserve, causing semantically distant documents to appear spuriously similar and increasing hallucination risk. To address these limitations, we introduce hyperbolic dense retrieval, developing two model variants in the Lorentz model of hyperbolic space: HyTE-FH, a fully hyperbolic transformer, and HyTE-H, a hybrid architecture projecting pre-trained Euclidean embeddings into hyperbolic space. To prevent representational collapse during sequence aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that provably preserves hierarchical structure. On MTEB, HyTE-FH outperforms equivalent Euclidean baselines, while on RAGBench, HyTE-H achieves up to 29% gains over Euclidean baselines in context relevance and answer relevance using substantially smaller models than current state-of-the-art retrievers. Our analysis also reveals that hyperbolic representations encode document specificity through norm-based separation, with over 20% radial increase from general to specific concepts, a property absent in Euclidean embeddings, underscoring the critical role of geometric inductive bias in faithful RAG systems.

2602.07253 2026-06-05 cs.AI cs.CL 版本更新

From Out-of-Distribution Detection to Hallucination Detection: A Geometric View

从分布外检测到幻觉检测:一个几何视角

Litian Liu, Reza Pourreza, Yubing Jian, Yao Qin, Roland Memisevic

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过将幻觉检测重新定义为分布外检测问题,利用几何视角提出了一种无需训练、基于单样本的检测方法,在推理任务中实现了高准确率。

Comments ICML 2026 main conference paper

详情
AI中文摘要

检测大型语言模型中的幻觉是一个关键且开放的问题,对安全性和可靠性有重大影响。虽然现有的幻觉检测方法在问答任务中表现强劲,但在需要推理的任务上效果不佳。在这项工作中,我们通过分布外(OOD)检测的视角重新审视幻觉检测,这是计算机视觉等领域中一个研究充分的问题。将语言模型中的下一个词预测视为分类任务,允许我们应用OOD技术,前提是进行适当的修改以考虑大型语言模型的结构差异。我们表明,基于OOD的方法产生了无需训练、基于单样本的检测器,在推理任务的幻觉检测中实现了高准确率。总体而言,我们的工作表明,将幻觉检测重新定义为OOD检测为语言模型安全性提供了一条有前景且可扩展的路径。

英文摘要

Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. While existing hallucination detection methods achieve strong performance in question-answering tasks, they remain less effective on tasks requiring reasoning. In this work, we revisit hallucination detection through the lens of out-of-distribution (OOD) detection, a well-studied problem in areas like computer vision. Treating next-token prediction in language models as a classification task allows us to apply OOD techniques, provided appropriate modifications are made to account for the structural differences in large language models. We show that OOD-based approaches yield training-free, single-sample-based detectors, achieving strong accuracy in hallucination detection for reasoning tasks. Overall, our work suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.

2602.00911 2026-06-05 cs.AI 版本更新

Synapse: Federated Tool Routing via Typed Compendium Artifacts

Synapse: 通过类型化编目工件实现联邦工具路由

Abhijit Chakraborty, Yash Shah, Vivek Gupta

发表机构 * MongoDB Arizona State University(亚利桑那州立大学)

AI总结 本文提出了一种名为Synapse的联邦工具路由编目系统,通过类型化联邦工件实现跨客户端的联邦学习,解决了在异构LLM和无共享数据的情况下,如何实现隐私保护、冲突解决和跨架构迁移的问题。

详情
AI中文摘要

在联邦学习中,协作单位决定了可以表达哪些保证。像权重、提示、原始示例这样的扁平单位没有类型签名,无法为隐私、冲突解决或跨模型迁移提供明确的操作。我们提出了类型化的联邦工件:经过模式验证的对象,其声明的字段结构使得每字段差分隐私、模式感知合并和跨架构迁移成为第一类操作,而非启发式近似。我们将此实现为SYNAPSE,一个用于在具有冻结、异构LLM且无共享数据或权重的客户端之间进行联邦工具路由的编目系统,这种设置下扁平单位无法处理,除非泄露梯度或丢弃结构。该编目系统允许带有字段级冲突解决的类型化合并运算符,对数值元数据提供形式化的DP保证,并在五个分布上经验性地刻画了条件检索失真和路由稳定性结果,包括一个收缩前提失败的分布。一个单一的编目系统在四个LLM家族(LLaMA 3.18B、LLaMA 3.2-3B、Mistral 7B、GPT 4o)之间转移,损失约2 pt,这种能力重量共享联邦无法在无架构匹配的情况下提供。

英文摘要

The unit of collaboration in federated learning determines what guarantees are even expressible. Flat units like weights, prompts, raw examples, carry no type signature on which privacy, conflict resolution, or cross-model transfer can dispatch as well-defined operations. We propose typed federated artifacts: schema validated objects whose declared field structure makes per field differential privacy, schema aware merging, and cross architectural transfer first-class operations rather than heuristic approximations. We instantiate this as SYNAPSE, a compendium for federated tool routing across clients with frozen, heterogeneous LLMs and no shared data or weights which is a setting flat units cannot handle without either leaking gradients or discarding structure. The compendium admits a typed merge operator with field wise conflict resolution, a formal DP guarantee on numeric metadata, and conditional retrieval distortion and routing-stability results empirically characterized on five distributions, including one where the contraction premise fails. A single compendium transfers across four LLM families (LLaMA 3.18B,LLaMA 3.2-3B, Mistral 7B, GPT 4o) with approximately 2 pt loss, a capability weight-sharing federation cannot provide without architectural matching.

2601.22580 2026-06-05 cs.CL cs.AI cs.LG 版本更新

SpanNorm: Reconciling Training Stability and Performance in Deep Transformers

SpanNorm: 在深度Transformer中协调训练稳定性与性能

Chao Wang, Bei Li, Jiaqi Zhang, Xinyu Liu, Yuchun Fan, Linkun Lyu, Xin Chen, Jingang Wang, Tong Xiao, Peng Pei, Xunliang Cai

发表机构 * Meituan Inc.(美团公司) NLP Lab, School of Computer Science and Engineering(自然语言处理实验室,计算机科学与工程学院) Northeastern University, Shenyang, China(东北大学,沈阳,中国)

AI总结 本文提出SpanNorm技术,通过结合前归一化和后归一化的优势,解决深度Transformer中训练稳定性与性能之间的根本性权衡问题,理论分析和实验结果表明其在密集和专家混合(MoE)场景中均优于传统归一化方案。

Comments Accepted by ICML2026

详情
AI中文摘要

大型语言模型(LLMs)的成功依赖于深度Transformer架构的稳定训练。一个关键的设计选择是归一化层的位置,导致了一个根本性的权衡:PreNorm架构在深度模型中确保了训练稳定性,但可能牺牲性能;而PostNorm架构提供了强大的性能,但面临严重的训练不稳定性。在本工作中,我们提出SpanNorm,一种新的技术,旨在通过整合两种范式的优点来解决这一困境。结构上,SpanNorm建立了一个跨越整个Transformer块的清晰残差连接以稳定信号传播,同时采用PostNorm风格的计算方式对聚合输出进行归一化以增强模型性能。我们提供了理论分析,证明SpanNorm结合合理的缩放策略可以在整个网络中保持信号方差有界,防止PostNorm模型中出现的梯度问题,并缓解PreNorm中的表示崩溃问题。实验结果表明,SpanNorm在密集和专家混合(MoE)场景中均优于传统归一化方案,为更强大和稳定的Transformer架构铺平了道路。

英文摘要

The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the ``PreNorm'' architecture ensures training stability at the cost of potential performance degradation in deep models, while the ``PostNorm'' architecture offers strong performance but suffers from severe training instability. In this work, we propose SpanNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. Structurally, SpanNorm establishes a clean residual connection that spans the entire transformer block to stabilize signal propagation, while employing a PostNorm-style computation that normalizes the aggregated output to enhance model performance. We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network, preventing the gradient issues that plague PostNorm models, and also alleviating the representation collapse of PreNorm. Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios, paving the way for more powerful and stable Transformer architectures.

2601.21700 2026-06-05 cs.CL cs.AI cs.IR cs.MA cs.SI 版本更新

Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

通过本体引导的多智能体推理实现文化对齐的大型语言模型

Wonduk Seo, Wonseok Choi, Junseo Koh, Juhyeon Lee, Hyunjin An, Minhyeong Yu, Jian Park, Qingshan Zhou, Seunghyun Lee, Yi Bu

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出OG-MAR框架,通过本体引导的多智能体推理方法,提高大型语言模型在文化对齐和鲁棒性方面的性能,并生成更透明的推理轨迹。

Comments Accepted by ICML 2026 Regular Track

详情
AI中文摘要

大型语言模型(LLMs)越来越多地支持文化敏感的决策制定,但往往由于预训练数据倾斜和缺乏结构化的价值表示而表现出不一致。现有方法虽然可以引导输出,但通常缺乏人口统计学基础,并将价值观视为独立的、无结构的信号,从而降低一致性和可解释性。我们提出OG-MAR,一种本体引导的多智能体推理框架。OG-MAR从世界价值观调查(WVS)中总结出响应特定的价值,并通过能力问题在固定分类法上提取关系来构建全球文化本体。在推理过程中,它检索与本体一致的关系和人口统计学相似的资料,以实例化多个价值-人设代理,其输出由一个执行本体一致性和人口统计学接近性的判断代理合成。在四个LLM基础架构上的区域社会调查基准测试中,OG-MAR在文化对齐和鲁棒性方面优于竞争基线,同时生成更透明的推理轨迹。

英文摘要

Large Language Models (LLMs) increasingly support culturally sensitive decision making, yet often exhibit misalignment due to skewed pretraining data and the absence of structured value representations. Existing methods can steer outputs, but often lack demographic grounding and treat values as independent, unstructured signals, reducing consistency and interpretability. We propose OG-MAR, an Ontology-Guided Multi-Agent Reasoning framework. OG-MAR summarizes respondent-specific values from the World Values Survey (WVS) and constructs a global cultural ontology by eliciting relations over a fixed taxonomy via competency questions. At inference time, it retrieves ontology-consistent relations and demographically similar profiles to instantiate multiple value-persona agents, whose outputs are synthesized by a judgment agent that enforces ontology consistency and demographic proximity. Experiments on regional social-survey benchmarks across four LLM backbones show that OG-MAR improves cultural alignment and robustness over competitive baselines, while producing more transparent reasoning traces.

2505.11766 2026-06-05 cs.LG cs.AI quant-ph 版本更新

Reformulating Neural Operators in $d+1$ Dimensions for Embedding Evolution

在d+1维度中重新表述神经算子以嵌入演化

Haoze Song, Zhihao Li, Xiaobo Zhang, Zecheng Gan, Zhilu Lai, Wei Wang

发表机构 * HKUST (GZ)(香港科技大学(广州)) HKUST(香港科技大学) SWJTU(西南交通大学)

AI总结 本文提出在d+1维度中重新表述神经算子,通过引入辅助函数维度来建模嵌入演化,从而改进嵌入扩展的效率,通过傅里叶基算子在物理域和辅助域上联合作用,实现更高效的嵌入演化模块,实验表明该方法在多个基准测试中表现优异。

详情
AI中文摘要

神经算子(NOs)是学习函数空间之间映射的强大架构。尽管大多数进展集中在改进核参数化在d维物理域上的精度,但提升的嵌入扩展仍缺乏探索,这通常导致模型倾向于计算成本高昂的嵌入扩展设计以提高近似能力。在本文中,我们引入了一个辅助函数维度,以运算形式建模嵌入演化,从而在d+1维度中重新表述NO流程。我们通过基于傅里叶的算子在物理域和辅助域上联合作用,实例化了这一框架,得到一个基于基底多样化的方法作为替代于暴力嵌入扩展。在超过十种越来越具有挑战性的基准测试中,从1D热方程到高度非线性的3D瑞利-泰勒不稳定性,我们的模型在评估的基线中始终实现了最低的相对L2误差。关键的是,这一优势通过(1)受控预算意识的比较,与缩放和剥离的基线;(2)混合分辨率训练和超分辨率推断下的鲁棒性;以及(3)零样本泛化到未见的时间范围,得到了实证支持。此外,我们还展示了更广泛的设计选择,以提升和恢复算子,展示了其对模型预测性能的影响。

英文摘要

Neural Operators (NOs) are powerful architectures for learning mappings between function spaces. While most advances focus on refining kernel parameterizations over the $d$-dimensional physical domain, the evolution of lifted embeddings remains underexplored, which often drives models toward computationally expensive embedding-scaling designs to improve approximation. In this paper, we introduce an auxiliary function dimension that models embedding evolution in operator form, thereby reformulating the NO pipeline in $d+1$ dimensions. We instantiate this framework via Fourier-based operators acting jointly on the physical and auxiliary domains, yielding a basis-diversified auxiliary evolution module as an alternative to brute-force embedding scaling. Across more than ten increasingly challenging benchmarks, ranging from the 1D heat equation to the highly nonlinear 3D Rayleigh-Taylor instability, our model consistently achieves the lowest relative $L_2$ error among the evaluated baselines. Crucially, this advantage is empirically supported by (1) controlled budget-aware comparisons against scaled and ablated baselines; (2) robustness under mixed-resolution training and super-resolution inference; and (3) zero-shot generalization to unseen temporal regimes. In addition, we present a broader set of design choices for lifting and recovery operators, demonstrating their impact on our model's predictive performance.

2601.21288 2026-06-05 cs.AI cs.CV 版本更新

Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

Drive-KD:自动驾驶中用于视觉语言模型的多教师知识蒸馏

Weitong Lian, Zecong Tang, Haoran Li, Tianjian Gao, Yifei Wang, Zixu Wang, Lingyi Meng, Tengju Ru, Zhejun Cui, Yichen Zhu, Hangshuo Cao, Qi Kang, Tianxing Chen, Kaixuan Wang, Yu Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出Drive-KD框架,通过将自动驾驶分解为感知-推理-规划三元组,并利用知识蒸馏转移能力,构建了专用教师模型,并通过异构梯度投影缓解跨能力梯度冲突,验证了方法在不同模型家族和规模上的泛化能力,展示了蒸馏模型在自动驾驶任务中的优越性能。

详情
AI中文摘要

自动驾驶是一个重要且安全关键的任务,最近大型语言模型(LLM)和视觉语言模型(VLM)的进展为该领域提供了新的推理和规划可能性。然而,大模型需要大量GPU内存并表现出较高的推理延迟,而传统监督微调(SFT)往往难以弥补小模型的能力差距。为了解决这些限制,我们提出了Drive-KD,一个将自动驾驶分解为“感知-推理-规划”三元组并通过知识蒸馏转移这些能力的框架。我们识别出层特定的注意力作为蒸馏信号,构建出能够超越基线的专用单教师模型。此外,我们将这些单教师设置统一到多教师蒸馏框架中,并引入异构梯度投影以缓解跨能力梯度冲突。广泛的评估验证了我们的方法在不同模型家族和规模上的泛化能力。实验表明,我们的蒸馏InternVL3-1B模型在GPU内存方面仅为78B模型的约42倍,在吞吐量方面为11.4倍,且在DriveBench上整体性能优于同家族的预训练78B模型,并在规划维度上超越GPT-5.1,为高效自动驾驶VLMs提供了新的见解。

英文摘要

Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a "perception-reasoning-planning" triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient conflicts. Extensive evaluations validate the generalization of our method across diverse model families and scales. Experiments show that our distilled InternVL3-1B model, with ~42 times less GPU memory and ~11.4 times higher throughput, achieves better overall performance than the pretrained 78B model from the same family on DriveBench, and surpasses GPT-5.1 on the planning dimension, providing insights toward efficient autonomous driving VLMs.

2601.21162 2026-06-05 cs.IR cs.AI cs.DB 版本更新

A2RAG: Adaptive Agentic Graph Retrieval for Cost-Aware and Reliable Reasoning

A2RAG:面向成本感知和可靠推理的自适应代理图检索

Jiate Liu, Zebin Chen, Shaobo Qiao, Mingchen Ju, Danting Zhang, Bocheng Han, Shuyue Yu, Xin Shu, Jinglin Wu, Dong Wen, Xin Cao, Guanfeng Liu, Zhengyi Yang

发表机构 * University of New South Wales(新南威尔士大学) Euler AI Sigma Trading Management(Sigma 交易管理) Eigenflow AI Macquarie University(麦考瑞大学)

AI总结 本文提出A2RAG框架,通过自适应控制器和代理检索器解决图检索中成本和可靠性问题,提升多跳问答的准确率并减少计算开销。

详情
AI中文摘要

图检索增强生成(Graph-RAG)通过将语料库组织成知识图谱并利用关系结构路由证据来增强多跳问答。然而,实际部署面临两个持续瓶颈:(i)混合难度的工作负载中,单一检索策略要么浪费成本于简单查询,要么在多跳情况中失败;(ii)提取损失,即图抽象省略了仅存在于源文本中的细粒度限定词。我们提出了A2RAG,一种面向成本感知和可靠推理的自适应和代理图RAG框架。A2RAG结合了一个自适应控制器,用于验证证据充分性并在必要时触发定向细化,以及一个代理检索器,逐步提升检索努力并映射图信号回来源文本,以在提取损失和不完整图的情况下保持稳健。在HotpotQA和2WikiMultiHopQA上的实验表明,A2RAG在Recall@2上实现了+9.9/+11.8的绝对增益,同时将token消耗和端到端延迟降低了约50%。

英文摘要

Graph Retrieval-Augmented Generation (Graph-RAG) enhances multihop question answering by organizing corpora into knowledge graphs and routing evidence through relational structure. However, practical deployments face two persistent bottlenecks: (i) mixed-difficulty workloads where one-size-fits-all retrieval either wastes cost on easy queries or fails on hard multihop cases, and (ii) extraction loss, where graph abstraction omits fine-grained qualifiers that remain only in source text. We present A2RAG, an adaptive-and-agentic GraphRAG framework for cost-aware and reliable reasoning. A2RAG couples an adaptive controller that verifies evidence sufficiency and triggers targeted refinement only when necessary, with an agentic retriever that progressively escalates retrieval effort and maps graph signals back to provenance text to remain robust under extraction loss and incomplete graphs. Experiments on HotpotQA and 2WikiMultiHopQA demonstrate that A2RAG achieves +9.9/+11.8 absolute gains in Recall@2, while cutting token consumption and end-to-end latency by about 50% relative to iterative multihop baselines.

2601.19568 2026-06-05 cs.AI cs.SE 版本更新

Learning Adaptive Parallel Execution for Efficient Code Localization

学习适应性并行执行以实现高效的代码定位

Ke Xu, Siyang Xiao, Ming Liang, Yichen Yu, Zhixiang Wang, Jingxuan Xu, Dajun Chen, Wei Jiang, Yong Li

发表机构 * Ant Group(蚂蚁集团) Peking University(北京大学) Beijing Jiaotong University(北京交通大学)

AI总结 本文提出FuseSearch,通过将并行代码定位重新表述为联合质量-效率优化任务,采用两阶段SFT和RL训练方法学习适应性并行策略,以提高代码定位的效率和性能。

Comments Paper accepted to Findings of ACL 2026

详情
AI中文摘要

代码定位是自动化软件开发流水线中的关键瓶颈。尽管并发工具执行可以提高发现速度,但当前代理表现出34.9%的冗余调用率,抵消了并行优势。我们提出FuseSearch,将并行代码定位重新表述为联合质量-效率优化任务。通过定义工具效率——唯一信息增益与调用次数的比率——我们采用两阶段SFT和RL训练方法来学习适应性并行策略。与固定广度方法不同,FuseSearch根据任务上下文动态调节搜索广度,从探索阶段演变为细化阶段。在SWE-bench Verified上评估,FuseSearch-4B实现了SOTA级性能(84.7%的文件级和56.4%的功能级F1分数),速度提升达93.6%,使用67.7%更少的轮次和68.9%更少的token。结果表明,效率感知的训练通过消除嘈杂的冗余信号自然提高质量,使高绩效的低成本定位代理成为可能。

英文摘要

Code localization constitutes a key bottleneck in automated software development pipelines. While concurrent tool execution can enhance discovery speed, current agents demonstrate a 34.9% redundant invocation rate, which negates parallelism benefits. We propose FuseSearch, reformulating parallel code localization as a joint quality-efficiency optimization} task. Through defining tool efficiency -- the ratio of unique information gain to invocation count -- we utilize a two-phase SFT and RL training approach for learning adaptive parallel strategies. Different from fixed-breadth approaches, FuseSearch dynamically modulates search breadth according to task context, evolving from exploration phases to refinement stages. Evaluated on SWE-bench Verified, FuseSearch-4B achieves SOTA-level performance (84.7% file-level and 56.4% function-level F1 scores) with 93.6% speedup, utilizing 67.7% fewer turns and 68.9% fewer tokens. Results indicate that efficiency-aware training naturally improves quality through eliminating noisy redundant signals, enabling high-performance cost-effective localization agents.

2601.18383 2026-06-05 cs.AI cs.CL cs.LG 版本更新

Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models

动态思维-令牌选择用于大型推理模型中的高效推理

Zhenyuan Guo, Tong Chen, Wenlong Meng, Chen Gong, Xin Yu, Chengkun Wei, Wenzhi Chen

发表机构 * Zhejiang University(浙江大学)

AI总结 本研究提出动态思维-令牌选择方法,通过分析推理轨迹发现只有部分关键令牌影响最终答案,从而优化大型推理模型的效率。

详情
AI中文摘要

大型推理模型(LRMs)通过显式生成推理轨迹来解决复杂问题,但扩展生成带来了显著的内存足迹和计算开销,限制了LRMs的效率。本工作利用注意力图分析推理轨迹的影响,发现只有部分关键令牌引导模型走向最终答案,其余令牌贡献微乎其微。基于这一观察,我们提出了动态思维-令牌选择(DynTS)。该方法识别关键令牌,并在推理过程中仅保留其关联的键值(KV)缓存状态,淘汰冗余条目以优化效率。

英文摘要

Large Reasoning Models (LRMs) excel at solving complex problems by explicitly generating a reasoning trace before deriving the final answer. However, these extended generations incur substantial memory footprint and computational overhead, bottlenecking LRMs' efficiency. This work uses attention maps to analyze the influence of reasoning traces and uncover an interesting phenomenon: only some decision-critical tokens in a reasoning trace steer the model toward the final answer, while the remaining tokens contribute negligibly. Building on this observation, we propose Dynamic Thinking-Token Selection (DynTS). This method identifies decision-critical tokens and retains only their associated Key-Value (KV) cache states during inference, evicting the remaining redundant entries to optimize efficiency.

2505.03336 2026-06-05 cs.IR cs.AI cs.SI 版本更新

Eliminating Out-of-Domain Recommendations in LLM-based Recommender Systems: A Unified View

消除基于大语言模型的推荐系统中的域外推荐:一种统一视角

Hao Liao, Jiwei Zhang, Jianxun Lian, Wensheng Lu, Mingqi Wu, Shuo Wang, Yong Zhang, Yitian Huang, Mingyang Zhou, Rui Mao

发表机构 * College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院) Microsoft Research Asia(微软亚洲研究院)

AI总结 本文提出RecLM框架,通过统一架构整合三种 grounding 方法,系统比较了基于嵌入检索、约束生成和离散项生成的推荐方法,有效消除域外推荐并提升了推荐准确性。

Comments 20 pages

详情
AI中文摘要

基于大语言模型(LLMs)的推荐系统常常受到域外(OOD)项目幻觉的困扰。为了解决这个问题,我们提出了RecLM,一种统一框架,通过在单一架构下实例化三种grounding范式来弥合检索与生成之间的差距:基于嵌入的检索、在重写项目标题上的约束生成以及离散项目-令牌生成。使用相同的LLM和提示,我们系统地在公开基准上比较了这三种视角。RecLM在所有变体中严格消除了域外推荐(OOD@10=0),并且约束生成变体RecLM-cgen和RecLM-token在与强ID基线和LLM基线相比时达到了最先进的准确性。我们的统一视角为比较三种不同的范式提供了系统的基础,以减少项目幻觉,提供了一个实用的框架来促进LLM在推荐任务中的应用。源代码位于https://github.com/microsoft/RecAI。

英文摘要

Recommender systems based on Large Language Models (LLMs) are often plagued by hallucinations of out-of-domain (OOD) items. To address this, we propose RecLM, a unified framework that bridges the gap between retrieval and generation by instantiating three grounding paradigms under a single architecture: embedding-based retrieval, constrained generation over rewritten item titles, and discrete item-tokenizer generation. Using the same backbone LLM and prompts, we systematically compare these three views on public benchmarks. RecLM strictly eradicates OOD recommendations (OOD@10 = 0) across all variants, and the constrained generation variants RecLM-cgen and RecLM-token achieve overall state-of-the-art accuracy compared to both strong ID-based and LLM-based baselines. Our unified view provides a systematic basis for comparing three distinct paradigms to reduce item hallucinations, offering a practical framework to facilitate the application of LLMs to recommendation tasks. Source code is at https://github.com/microsoft/RecAI.

2601.08510 2026-06-05 cs.CL cs.AI 版本更新

STAGE: A Full-Screenplay Benchmark for Reasoning over Evolving Storie

STAGE:一个用于推理演变故事的完整剧本基准

Qiuyu Tian, Zequn Liu, Yiding Li, Fengyi Chen, Youyong Kong, Fan Guo, Yuyao Li, Jinjing Shen, Zhijing Xie, Yiyun Luo, Xin Zhang, Yingce Xia

发表机构 * Southeast University(东南大学) Beijing Zhongguancun Academy(北京中关村学院) Nanjing Normal University(南京师范大学) ZhuiWen Technology Co., Ltd.(智库文科技有限公司)

AI总结 提出STAGE基准,通过知识图谱构建、场景事件摘要、长上下文问答和角色扮演四项任务,全面评估模型对电影剧本叙事世界的理解与推理能力。

Comments 66 pages, 9 figures

详情
AI中文摘要

电影剧本是丰富的长篇叙事,交织着复杂的角色关系、时间顺序事件和对话驱动的互动。虽然先前的基准针对诸如问答或对话生成等单个子任务,但它们很少评估模型能否构建连贯的故事世界并在多种推理和生成形式中一致地使用它。我们引入了STAGE(剧本文本、智能体、图谱与评估),一个针对全长电影剧本叙事理解的统一基准。STAGE定义了四个任务:知识图谱构建、场景级事件摘要、长上下文剧本问答以及剧本内角色扮演,所有这些都基于共享的叙事世界表示。该基准提供了150部中英文电影的清洗脚本、策划的知识图谱以及事件和角色为中心的注释,从而能够全面评估模型构建世界表示、抽象和验证叙事事件、推理长叙事以及生成角色一致响应的能力。

英文摘要

Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question answering or dialogue generation, they rarely evaluate whether models can construct a coherent story world and use it consistently across multiple forms of reasoning and generation. We introduce STAGE (Screenplay Text, Agents, Graphs and Evaluation), a unified benchmark for narrative understanding over full-length movie screenplays. STAGE defines four tasks: knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing, all grounded in a shared narrative world representation. The benchmark provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models' abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.

2601.09236 2026-06-05 cs.LG cs.AI 版本更新

Reward Learning through Ranking Mean Squared Error

通过排名均方误差进行奖励学习

Chaitanya Kharyal, Calarina Muslimani, Matthew E. Taylor

发表机构 * Calarina Muslimani(卡拉里娜·穆斯林尼) Matthew E. Taylor(马修·E·泰勒)

AI总结 本文提出了一种基于排名的强化学习方法R4,通过引入新的排名均方误差损失函数,从轨迹-评分对数据中学习奖励函数,并在机器人基准测试中表现出色。

详情
AI中文摘要

奖励设计仍然是将强化学习(RL)应用于现实世界问题的主要瓶颈。一种流行的替代方法是奖励学习,其中奖励函数是从人类反馈中推断出来,而不是手动指定。最近的工作提出了从人类评分而不是传统二元偏好中学习奖励函数,从而实现更丰富且可能更少认知需求的监督。在此范式基础上,我们引入了一种新的基于评分的RL方法,即Ranked Return Regression for RL(R4)。其核心是使用一种新的排名均方误差损失,从轨迹-评分对数据集中学习,将人类提供的离散评分(例如,差,中性,好)视为有序目标。与以往的基于评分的方法不同,R4提供了正式的保证:在其解集下,在温和的假设下,解集是可证明的最小且完整的。实证上,使用人类提供的和模拟的评分,我们证明R4在OpenAI Gym和DeepMind Control Suite的机器人基准测试中,一致地匹配或优于现有的基于评分和偏好强化学习方法。代码发布在https://github.com/IRLL/R4。

英文摘要

Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human ratings rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 uses a novel ranking mean squared error loss that learns from a dataset of trajectory-rating pairs, treating the human-provided discrete ratings (e.g., bad, neutral, good) as ordinal targets. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using both human-provided and simulated ratings, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic benchmarks from OpenAI Gym and the DeepMind Control Suite. Code released at https://github.com/IRLL/R4.

2502.14131 2026-06-05 cs.LG cs.AI econ.EM 版本更新

An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model

一种用于离线逆强化学习和动态离散选择模型的经验风险最小化方法

Enoch H. Kang, Hema Yoganarasimhan, Lalit Jain

发表机构 * Foster School of Business, University of Washington(华盛顿大学福斯特商学院)

AI总结 本文提出了一种基于经验风险最小化(ERM)的逆强化学习/动态离散选择模型框架,该方法无需显式估计贝尔曼方程中的状态转移概率,适用于高维和无限状态空间,并在理论上有Polyak-Lojasiewicz条件的支持,从而保证了快速的全局收敛性。

详情
AI中文摘要

我们研究了估计动态离散选择(DDC)模型的问题,也称为机器学习中的离线最大熵正则化逆强化学习(离线MaxEnt-IRL)。目标是从离线行为数据中恢复支配代理行为的奖励或Q*函数。在本文中,我们提出了一种全局收敛的基于梯度的方法来解决这些问题,而无需线性参数化的奖励假设。我们的方法的创新之处在于引入了基于经验风险最小化(ERM)的IRL/DDC框架,该框架避免了在贝尔曼方程中显式估计状态转移概率的需要。此外,我们的方法与非参数估计技术如神经网络兼容。因此,所提出的方法有潜力扩展到高维、无限状态空间。我们方法的一个关键理论洞察是贝尔曼残差满足Polyak-Lojasiewicz(PL)条件--一个属性,虽然比强凸性弱,但足以保证快速的全局收敛保证。通过一系列合成实验,我们证明我们的方法在性能上始终优于基准方法和最先进的替代方法。

英文摘要

We study the problem of estimating Dynamic Discrete Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning (offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q^*$ functions that govern agent behavior from offline behavior data. In this paper, we propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards. The novelty of our approach lies in introducing the Empirical Risk Minimization (ERM) based IRL/DDC framework, which circumvents the need for explicit state transition probability estimation in the Bellman equation. Furthermore, our method is compatible with non-parametric estimation techniques such as neural networks. Therefore, the proposed method has the potential to be scaled to high-dimensional, infinite state spaces. A key theoretical insight underlying our approach is that the Bellman residual satisfies the Polyak-Lojasiewicz (PL) condition -- a property that, while weaker than strong convexity, is sufficient to ensure fast global convergence guarantees. Through a series of synthetic experiments, we demonstrate that our approach consistently outperforms benchmark methods and state-of-the-art alternatives.

2601.06056 2026-06-05 cs.CY cs.AI cs.CV 版本更新

Using street view images and visual LLMs to predict heritage values for governance support: Risks, ethics, and policy implications

利用街景图像和视觉大语言模型预测遗产价值以支持治理:风险、伦理与政策影响

Tim Johansson, Mikael Mangold, Kristina Dabrock, Anna Donarelli, Ingrid Campo-Ruiz

发表机构 * RISE Research Institutes of Sweden AB(瑞典RISE研究机构) Malmö University(马尔默大学) Forschungszentrum Jülich GmbH(朱利奇研究中心) Uppsala University(乌普萨拉大学)

AI总结 本研究利用街景图像和视觉大语言模型评估瑞典建筑遗产价值,以支持建筑翻新计划的制定,探讨了方法中的问题、潜在改进以及使用LLM数据的伦理风险。

详情
AI中文摘要

在2025年至2026年期间,欧盟成员国必须实施《建筑性能能效指令》,要求所有成员国制定国家建筑翻新计划。在瑞典,没有全面记录具有遗产价值的建筑的国家注册表,这被视为阻碍建筑翻新计划制定分析的障碍。本研究旨在帮助瑞典当局了解瑞典建筑存量中的遗产价值。通过对瑞典各地(N=154710)的街景图像中的建筑进行多模态大语言模型(LLM)分析,评估了可见的遗产价值指示方面。使用LLM的零样本预测作为基础,确定了潜在具有遗产价值的建筑,覆盖500万平方米的供暖地板面积。本文呈现了预测结果和所学到的经验,并将其与瑞典建筑翻新计划的制定相结合,作为治理的一部分。讨论了方法中的问题和潜在的改进。探讨了当局使用基于LLM的数据的潜在风险,重点是透明性、错误检测和阿谀奉承的问题。

英文摘要

During 2025 and 2026, the Energy Performance of Buildings Directive is being implemented in the European Union member states, requiring all member states to have National Building Renovation Plans. In Sweden, there is no comprehensive national register of buildings with heritage values. This is seen as a barrier for the analyses underlying the development of Building Renovation Plans by the involved Swedish authorities. The purpose of this research was to assist Swedish authorities in developing information on heritage values in the Swedish building stock. Buildings in street view images from all over Sweden (N=154 710) have been analysed using multimodal Large Language Models (LLM) to assess visible aspects indicative of heritage value. Zero-shot predictions by LLMs were used as a basis for identifying buildings with potential heritage values for 5.0 million square meters of heated floor area. In this paper, the results of the predictions and lessons learned are presented and related to the development of the Swedish Building Renovation Plan as part of governance. The problems with the method and potential improvements are discussed. Risks with authorities use of LLM-based data are addressed, with a focus on issues of transparency, error detection and sycophancy.

2512.15231 2026-06-05 cs.AI 版本更新

CangLing-KnowFlow: A Unified Knowledge-and-Flow-fused Agent for Comprehensive Remote Sensing Applications

CangLing-KnowFlow: 一个统一的知识与流程融合代理用于综合遥感应用

Zhengchao Chen, Haoran Wang, Jing Yao, Jianshe Zhang, Pedram Ghamisi, Jun Zhou, Peter M. Atkinson, Bing Zhang

发表机构 * State Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences(遥感与数字地球国家重点实验室,航天信息研究所,中国科学院) Beijing Tiandi Shijie Technology Co., Ltd.(北京天帝世纪科技有限公司) Faculty of Science and Technology, Lancaster University(兰卡斯特大学科学与技术学院) Faculty of Electrical and Computer Engineering, University of Iceland(冰岛大学电气与计算机工程学院) Helmholtz-Zentrum Dresden-Rossendorf(德累斯顿-罗斯托克亥姆霍尔茨中心) School of Information and Communication Technology, Griffith University(格里菲斯大学信息与通信技术学院)

AI总结 本文提出CangLing-KnowFlow,一个融合知识与流程的统一智能代理框架,通过整合过程知识库、动态工作流调整和进化记忆模块,解决遥感数据处理中任务特定、缺乏统一框架的问题,并在KnowFlow-Bench基准测试中表现出色。

详情
AI中文摘要

大规模遥感(RS)数据集的自动化和智能化处理对于地球观测(EO)至关重要。现有的自动化系统通常是任务特定的,缺乏统一的框架来管理多样化的端到端工作流——从数据预处理到高级解释——在不同的RS应用中。为了解决这一差距,本文介绍CangLing-KnowFlow,一个统一的智能代理框架,整合了过程知识库(PKB)、动态工作流调整和进化记忆模块。PKB包含1,008个经过专家验证的工作流案例,涵盖162个实际RS任务,指导规划并显著减少一般性代理中常见的幻觉问题。在运行时失败期间,动态工作流调整能够自主诊断并重新规划恢复策略,而进化记忆模块会持续从这些事件中学习,迭代提升代理的知识和性能。这种协同作用使CangLing-KnowFlow能够在多样且复杂的任务中适应、学习并可靠运行。我们评估了CangLing-KnowFlow在KnowFlow-Bench上,一个受真实应用启发的324个工作流基准测试中,测试其在13个顶级大语言模型(LLM)后端上的性能,从开源到商业。在所有复杂任务中,CangLing-KnowFlow在任务成功率上比Reflexion基线高出至少4%。作为该新兴领域最全面的验证,本研究展示了CangLing-KnowFlow作为强大、高效且可扩展的自动化解决方案的巨大潜力,通过利用专家知识(知识)转化为适应性和可验证的流程(流程)来解决复杂的EO挑战。

英文摘要

The automated and intelligent processing of massive remote sensing (RS) datasets is critical in Earth observation (EO). Existing automated systems are normally task-specific, lacking a unified framework to manage diverse, end-to-end workflows--from data preprocessing to advanced interpretation--across diverse RS applications. To address this gap, this paper introduces CangLing-KnowFlow, a unified intelligent agent framework that integrates a Procedural Knowledge Base (PKB), Dynamic Workflow Adjustment, and an Evolutionary Memory Module. The PKB, comprising 1,008 expert-validated workflow cases across 162 practical RS tasks, guides planning and substantially reduces hallucinations common in general-purpose agents. During runtime failures, the Dynamic Workflow Adjustment autonomously diagnoses and replans recovery strategies, while the Evolutionary Memory Module continuously learns from these events, iteratively enhancing the agent's knowledge and performance. This synergy enables CangLing-KnowFlow to adapt, learn, and operate reliably across diverse, complex tasks. We evaluated CangLing-KnowFlow on the KnowFlow-Bench, a novel benchmark of 324 workflows inspired by real-world applications, testing its performance across 13 top Large Language Model (LLM) backbones, from open-source to commercial. Across all complex tasks, CangLing-KnowFlow surpassed the Reflexion baseline by at least 4% in Task Success Rate. As the first most comprehensive validation along this emerging field, this research demonstrates the great potential of CangLing-KnowFlow as a robust, efficient, and scalable automated solution for complex EO challenges by leveraging expert knowledge (Knowledge) into adaptive and verifiable procedures (Flow).

2512.20627 2026-06-05 cs.NI cs.AI 版本更新

Efficient Asynchronous Federated Evaluation with Strategy Similarity Awareness for Intent-Based Networking in Industrial Internet of Things

面向工业互联网-of-things意图网络的高效异步联邦评估与策略相似性意识

Shaowen Qin, Jianfeng Zeng, Haodong Guo, Xiaohuan Li, Jiawen Kang, Qian Chen

发表机构 * Guangxi University Key Laboratory of Intelligent Networking and Scenario System (School of Information and Communication, Guilin University of Electronic Technology)(广西智能网络与场景系统重点实验室(信息与通信学院,桂林电子科技大学)) National Engineering Laboratory for Comprehensive Transportation Big Data Application Technology (Guangxi)(综合交通运输大数据应用技术国家工程实验室(广西)) School of Automation, Guangdong University of Technology(自动化学院,广东工业大学) School of Architecture and Transportation Engineering, GUET(建筑与交通工程学院,桂林电子科技大学)

AI总结 本文提出了一种基于联邦学习的增强意图网络框架FEIBN,利用大语言模型将用户意图转化为结构化策略元组,并通过策略相似性意识联邦学习机制提升训练效率和通信效率,从而在工业互联网-of-things环境中实现更高效的策略评估。

Comments 12 pages with 7 figures and 4 tables

详情
AI中文摘要

意图网络(IBN)通过将高层用户意图转化为可执行的网络策略,为工业互联网-of-things(IIoT)环境中的智能和自动化网络控制提供了一种有前景的范式。然而,由于紧密耦合的工作流和高停机成本,频繁的策略部署和回滚是不切实际的,而节点异质性和隐私约束进一步复杂化了集中式策略评估。为了解决这些挑战,我们提出了一种联邦评估增强的意图网络框架(FEIBN),该框架利用大语言模型(LLMs)将用户意图转化为结构化策略元组,并采用联邦学习支持分布式策略评估。为了提高训练效率并减少通信开销,我们设计了一种策略相似性意识联邦学习机制(SSAFL),该机制根据策略相似性和资源状态选择相关节点,并仅在本地更新显著时触发异步模型上传。实验表明,所提出的方法在模型精度、收敛速度和通信成本方面均优于基线方法。

英文摘要

Intent-Based Networking (IBN) offers a promising paradigm for intelligent and automated network control in Industrial Internet of Things (IIoT) environments by translating high-level user intents into executable network strategies. However, frequent strategy deployment and rollback are impractical due to tightly coupled workflows and high downtime costs, while node heterogeneity and privacy constraints further complicate centralized strategy evaluation. To address these challenges, we propose a Federated Evaluation Enhanced Intent-Based Networking framework (FEIBN), which leverages large language models (LLMs) to translate user intents into structured strategy tuples and employs federated learning to support distributed strategy evaluation. To improve training efficiency and reduce communication overhead, we design a Strategy Similarity Aware Federated Learning mechanism (SSAFL), which selects nodes relevant to the task based on strategy similarity and resource status, and triggers asynchronous model uploads only when local updates are significant. Experiments demonstrate that the proposed method improves model accuracy, accelerates convergence, and reduces communication cost compared with the baselines.

2512.20111 2026-06-05 cs.CL cs.AI cs.LG 版本更新

ABBEL: Learning Natural-Language Belief States for Memory-Efficient Interaction

ABBEL: 为高效交互学习自然语言信念状态

Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr

发表机构 * University of California, Berkeley(加州大学伯克利分校) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出ABBEL框架,通过显式自然语言信念状态直接监督每个摘要的信息内容,以解决传统方法在生成摘要时信息丢失或更新错误的问题,从而在保持高效内存使用的同时提升交互性能。

详情
AI中文摘要

随着序列决策任务的时间范围扩大,将完整交互历史保留在模型上下文中变得越来越昂贵。最近的研究通过使用递归更新的自然语言摘要来减少上下文长度,这些摘要简洁且可解释。然而,这些方法在性能上仍低于能够访问完整上下文的智能体,表明它们未能生成足够的摘要。为此,我们提出了ABBEL,一种递归摘要框架,通过显式自然语言信念状态直接监督每个摘要的信息内容。首先,我们分析了在五个领域中由前沿模型生成的信念状态,并验证了性能通常因遗漏或错误更新信息而降低。我们还发现了一些模型使用内存低效的设置,通过保留冗余信息。我们通过两种基于强化学习的方法进行微调:信念分级,通过奖励基于信息内容的信念生成来减少更新错误;峰值信念惩罚,通过鼓励压缩内存足迹最大的信念。我们证明这些方法显著缩小了与完整上下文模型的性能差距,并使ABBEL在使用67%内存的情况下,比先前的记忆智能体工作提高了40%。我们的代码可在https://github.com/jakob-bjorner/optimal-explorer-dev获取。

英文摘要

As the time horizons of sequential decision-making tasks grow, keeping full interaction histories in model context becomes increasingly costly. Recent work reduces context lengths by instead conditioning decision-making agents on recursively updated natural-language summaries, which are concise and interpretable. However, they underperform agents with access to the full context, suggesting that they fail to generate sufficient summaries. To address this we propose ABBEL, a recursive summarization framework that isolates and directly supervises each summary's information contents in the form of explicit natural-language belief states. First, we analyze the belief states generated by frontier models under ABBEL across five domains, and verify that performance is often degraded due to omitting or incorrectly updating information. We also discover settings where models use memory inefficiently by retaining extraneous information. We target these limitations by fine-tuning with two RL-based methods: belief grading, which reduces update errors by rewarding belief generations based on their information content, and peak belief penalties, which encourage compressing the beliefs with the greatest memory footprints. We demonstrate that these methods significantly reduce the performance gap with full context models, and enable ABBEL to outperform prior memory agent work by 40% while using 67% of the memory. Our code is available at https://github.com/jakob-bjorner/optimal-explorer-dev

2512.14792 2026-06-05 cs.AI cs.SE 版本更新

IaC Generation with LLMs: An Error Taxonomy and A Study on Configuration Knowledge Injection

利用LLM生成IaC:错误分类法与配置知识注入研究

Roman Nekrasov, Stefano Fossati, Indika Kumara, Damian Andrew Tamburri, Willem-Jan van den Heuvel

发表机构 * Jheronimus Academy of Data Science(Jheronimus数据科学学院) Tilburg University(蒂尔堡大学) Eindhoven University of Technology(埃因霍温理工大学) University of Sannio(萨诺尼大学)

AI总结 本研究探讨了如何通过系统性地注入结构化配置知识来提高LLM生成正确且意图一致的基础设施即代码(IaC)能力,特别是在Terraform中,提出了新的错误分类法,并评估了多种知识注入技术。

Comments Submitted to ACM

详情
AI中文摘要

大型语言模型(LLMs)目前在生成正确且意图一致的基础设施即代码(IaC)方面表现出较低的成功率。本研究调查了改进基于LLM的IaC生成方法,特别是针对Terraform,通过系统性地注入结构化配置知识。为此,现有的IaC-Eval基准测试被显著增强,加入了云模拟和自动错误分析。此外,开发了一种新的用于LLM辅助IaC代码生成的错误分类法。实现并评估了一系列知识注入技术,从简单的检索增强生成(RAG)到更复杂的图RAG方法。这些包括图组件的语义增强和资源间依赖关系的建模。实验结果表明,尽管基线LLM性能较差(整体成功率为27.1%),注入结构化配置知识将技术验证成功率提高到75.3%,整体成功率提高到62.6%。尽管这些进步在技术正确性方面有所提升,但意图一致性却停滞不前,揭示了“正确性-一致性鸿沟”,即LLMs可以成为熟练的“程序员”,但作为满足复杂用户意图的“架构师”却受限。

英文摘要

Large Language Models (LLMs) currently exhibit low success rates in generating correct and intent-aligned Infrastructure as Code (IaC). This research investigated methods to improve LLM-based IaC generation, specifically for Terraform, by systematically injecting structured configuration knowledge. To facilitate this, an existing IaC-Eval benchmark was significantly enhanced with cloud emulation and automated error analysis. Additionally, a novel error taxonomy for LLM-assisted IaC code generation was developed. A series of knowledge injection techniques was implemented and evaluated, progressing from Naive Retrieval-Augmented Generation (RAG) to more sophisticated Graph RAG approaches. These included semantic enrichment of graph components and modeling inter-resource dependencies. Experimental results demonstrated that while baseline LLM performance was poor (27.1% overall success), injecting structured configuration knowledge increased technical validation success to 75.3% and overall success to 62.6%. Despite these gains in technical correctness, intent alignment plateaued, revealing a "Correctness-Congruence Gap" where LLMs can become proficient "coders" but remain limited "architects" in fulfilling nuanced user intent.

2511.21667 2026-06-05 cs.LG cs.AI 版本更新

Escaping the Verifier: Learning to Reason via Demonstrations

摆脱验证者:通过示范学习推理

Locke Cai, Max Ryabinin, Ivan Provilkov

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 本文提出RARO方法,通过逆强化学习从专家示范中学习强大的推理能力,无需任务特定的验证者,从而在多个评估任务中实现了显著的性能提升。

详情
AI中文摘要

训练大型语言模型(LLMs)进行推理通常依赖于强化学习(RL)与任务特定的验证者。然而,许多现实世界的推理密集型任务缺乏验证者,尽管提供了大量未被充分利用的专家示范。我们引入RARO(相对对抗推理优化),通过逆强化学习从专家示范中学习强大的推理能力。RARO设置了一个对抗游戏,政策与相对批评者之间进行对抗:政策学习模仿专家答案,而批评者旨在识别专家政策答案对中的专家。政策和批评者通过RL联合且连续地训练,并识别出实现稳健学习所需的关键稳定技术。实证结果表明,RARO在所有评估任务中均显著优于无验证者基线:在Countdown(1.5B)上准确率提高13.7%,在DeepMath(7B)上准确率提高8.2%,在Poetry Writing(7B)上对专家诗歌的胜利率提高19.1%。RARO还表现出与具有验证者的RL相似的稳健扩展趋势。这些结果表明,RARO能够从专家示范中有效提取强大的推理性能,即使在任务特定验证者不可用时也能实现稳健的推理学习。

英文摘要

Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization), which learns strong reasoning capabilities from expert demonstrations alone via Inverse Reinforcement Learning. RARO sets up an adversarial game between a policy and a relativistic critic: the policy learns to mimic expert answers, while the critic aims to identify the experts among expert-policy answer pairs. Both the policy and the critic are trained jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines across all evaluation tasks: +13.7% accuracy on Countdown (1.5B), +8.2% accuracy on DeepMath (7B), and +19.1% win-rate on Poetry Writing (7B) against expert poems. RARO also exhibits similar robust scaling trends as RL with verifiers. These results demonstrate that RARO effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.

2512.05774 2026-06-05 cs.CV cs.AI cs.CL 版本更新

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

主动视频感知:用于代理长视频理解的迭代证据寻求

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, Juan Carlos Niebles

发表机构 * Salesforce AI Research(Salesforce AI研究院) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文提出了一种主动视频感知框架AVP,通过迭代计划-观察-反思过程,主动决定视频内容的观察目标和时间,以提高长视频理解的准确性和效率。

Comments Website: https://activevideoperception.github.io/

详情
AI中文摘要

长视频理解(LVU)具有挑战性,因为回答现实世界查询往往依赖于稀疏、时间分散的线索,这些线索隐藏在数小时的大部分冗余和无关内容中。尽管代理流程提高了视频推理能力,但现有框架依赖于查询无关的描述器来感知视频信息,这浪费了计算资源并模糊了细粒度的时间和空间信息。受主动感知理论的启发,我们主张LVU代理应主动决定观察什么、何时和在哪里观察,并持续评估当前观察是否足够回答查询。我们提出了主动视频感知(AVP),一种证据寻求框架,将视频视为交互环境,并直接从像素中获取紧凑、查询相关的证据。具体而言,AVP运行一个迭代的计划-观察-反思过程,使用MLLM代理。在每个轮次中,计划者提出有针对性的视频交互,观察者执行以提取时间戳证据,反思者评估证据对查询的充分性,要么终止并给出答案,要么触发进一步观察。在五个LVU基准测试中,AVP实现了最高整体准确率,有显著提升。值得注意的是,AVP在平均整体准确率上比最佳代理方法高出5.7%,同时仅需18.4%的推理时间和12.4%的输入令牌。

英文摘要

Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest overall accuracy with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average overall accuracy while only requires 18.4% inference time and 12.4% input tokens.

2508.10875 2026-06-05 cs.CL cs.AI cs.LG 版本更新

A Survey on Diffusion Language Models

扩散语言模型的综述

Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen

发表机构 * VILA Lab, Mohamed bin Zayed University of Artificial Intelligence(维拉实验室,穆罕默德·本·扎耶德人工智能大学) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 本文综述了扩散语言模型的发展现状,探讨了其与自回归模型和掩码语言模型的关系,分析了预训练策略、后训练方法以及推理优化技术,并讨论了多模态扩展、应用场景、局限性及未来研究方向。

详情
AI中文摘要

扩散语言模型(DLMs)正迅速崛起为一种强大的替代方案,以取代主导的自回归(AR)范式。通过迭代去噪过程并行生成令牌,DLMs在减少推理延迟和捕捉双向上下文方面具有固有优势,从而实现对生成过程的精细控制。尽管实现了数倍的加速,最近的进展使DLMs在性能上与自回归模型相当,使其成为各种自然语言处理任务的有力选择。在本文综述中,我们提供了当前DLM景观的全面概述。我们追踪其演变及其与其他范式,如自回归和掩码语言模型的关系,并涵盖了基础原理和最先进模型。我们的工作提供了一个最新、全面的分类法以及对当前技术的深入分析,从预训练策略到高级后训练方法。本文的另一个贡献是全面回顾DLM推理策略和优化,包括解码并行性、缓存机制和生成质量的改进。我们还突出了DLM多模态扩展的最新方法,并阐述了它们在各种实际场景中的应用。此外,我们的讨论还讨论了DLMs的局限性和挑战,包括效率、长序列处理和基础设施需求,同时概述了未来研究方向,以维持该快速发展的领域中的进步。Project GitHub可在https://github.com/VILA-Lab/Awesome-DLMs上找到。

英文摘要

Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

2512.05013 2026-06-05 cs.AI cs.MA stat.ME 版本更新

Detecting Perspective Shifts in Multi-agent Systems

在多智能体系统中检测视角变化

Eric Bridgeford, Hayden Helm

发表机构 * Helivan, San Francisco, CA(San Francisco, CA 的 Helivan)

AI总结 本文提出了一种名为TDKPS的框架,用于检测多智能体系统中智能体和群体层面的行为变化,通过模拟和自然实验验证了其在检测真实外部事件变化方面的有效性。

详情
AI中文摘要

增强型生成模型结合外部工具和更新机制(或称为智能体)已展现出超越基础模型智能提示的能力。随着智能体的广泛应用,动态多智能体系统自然地出现了。最近的研究探讨了基于单时间点查询响应的低维表示的理论和经验属性。本文引入了时间数据核视角空间(TDKPS),该方法跨时间联合嵌入智能体,并提出了几种新的假设检验方法,用于检测多智能体系统中智能体和群体层面的行为变化。我们通过受演进数字身份多智能体系统启发的模拟,表征了所提出检验的实证属性,包括其对关键超参数的敏感性。最后,我们通过自然实验证明,所提出检验能够检测出与真实外生事件相关、敏感且显著变化。据我们所知,TDKPS是首个系统性的框架,用于监控多智能体系统中的行为动态——随着生成智能体部署的持续扩展,这一能力至关重要。

英文摘要

Generative models augmented with external tools and update mechanisms (or \textit{agents}) have demonstrated capabilities beyond intelligent prompting of base models. As agent use proliferates, dynamic multi-agent systems have naturally emerged. Recent work has investigated the theoretical and empirical properties of low-dimensional representations of agents based on query responses at a single time point. This paper introduces the Temporal Data Kernel Perspective Space (TDKPS), which jointly embeds agents across time, and proposes several novel hypothesis tests for detecting behavioral change at the agent- and group-level in black-box multi-agent systems. We characterize the empirical properties of our proposed tests, including their sensitivity to key hyperparameters, in simulations motivated by a multi-agent system of evolving digital personas. Finally, we demonstrate via natural experiment that our proposed tests detect changes that correlate sensitively, specifically, and significantly with a real exogenous event. As far as we are aware, TDKPS is the first principled framework for monitoring behavioral dynamics in black-box multi-agent systems -- a critical capability as generative agent deployment continues to scale.

2512.03086 2026-06-05 cs.PL cs.AI cs.SE 版本更新

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

超越代码对:基于对话的数据生成用于LLM代码翻译

Le Chen, Nuo Xu, Winson Chen, Bin Lei, Pei-Hung Lin, Dunzhi Zhou, Rajeev Thakur, Caiwen Ding, Ali Jannesari, Chunhua Liao

发表机构 * Argonne National Laboratory(阿贡国家实验室) University of Minnesota(明尼苏达大学) Iowa State University(爱荷华州立大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)

AI总结 本文提出了一种基于对话的数据生成方法,通过双LLM架构生成验证的翻译和多轮对话,以提升LLM在低资源编程领域中的代码翻译能力。

详情
AI中文摘要

大型语言模型(LLMs)在代码翻译任务中表现出色,但在资源稀缺的编程领域如Fortran和新兴框架如CUDA中性能下降,因为高质量并行数据稀缺。我们提出了一种自动化数据生成流水线,采用双LLM提问者-求解器设计,整合编译器和运行时反馈的外部知识。除了传统的源-目标代码对数据集外,我们的方法还生成(1)带有单元测试的验证翻译以评估功能一致性,以及(2)多轮对话,捕捉翻译优化过程中的推理过程。应用于Fortran到C++和C++到CUDA的转换中,该流水线分别生成3,640和3,930个对话。在该数据上微调可显著提升功能正确性,使C++到CUDA任务的单元测试成功率提高超过56%。我们证明生成的数据使7B开放式模型在编译成功率等关键指标上显著优于更大的专有系统。

英文摘要

Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran-to-C++ and C++-to-CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show that the generated data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.

2511.20613 2026-06-05 cs.LG cs.AI cs.MA 版本更新

Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

能否用Vibe编码击败研究生计算机科学学生?一个LLM与人类编码竞赛在市场驱动的战略规划中的表现

Panayiotis Danassis, Naman Goel

发表机构 * University of Southampton(苏塞克斯大学) University of Oxford and Alan Turing Institute(牛津大学和艾伦·图灵研究所)

AI总结 本文提出一个基于现实物流优化问题(拍卖、取件和送货问题)的多智能体推理驱动基准,该问题结合了竞争拍卖与容量受限路由。研究通过比较40个LLM编码代理与17个人类编码代理在12场双打全部比赛和约4万场比赛中的表现,揭示了人类编码代理在战略规划和优化任务中的优势,以及LLM在现实世界中生成有效代码的能力不足。

详情
AI中文摘要

大型语言模型(LLMs)的快速普及已经革新了AI辅助代码生成。然而,LLMs的快速发展超出了我们正确评估它们的能力。现有的基准测试强调单元测试通过率和语法正确性。这些指标低估了许多需要规划、优化和战略互动的真实世界问题的难度。我们引入了一个基于现实物流优化问题(拍卖、取件和送货问题)的多智能体推理驱动基准,该问题结合了竞争拍卖与容量受限路由。该基准要求构建能够(i)在不确定性下进行战略投标,以及(ii)优化规划者在交付任务的同时最大化利润的代理。我们评估了40个LLM编码的代理(由多种最先进的LLMs在多种提示方法下,包括Vibe编码)与17个在LLM出现之前开发的人类编码代理。我们的结果在12场双打全部比赛和约4万场比赛中显示(i)人类(研究生学生)编码代理的明显优势:前5名始终由人类编码代理占据;(ii)大多数LLM编码代理(33个中的40个)被非常简单的基线所击败;(iii)在给定最佳人类解决方案作为输入并提示改进的情况下,表现最好的LLM使解决方案显著变差而不是改进。我们的结果突显了LLMs在现实世界中生成具有竞争力的代码能力的差距,并促使新的评估,这些评估强调在现实世界场景中推理驱动的代码合成。

英文摘要

The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.

2503.01734 2026-06-05 cs.CR cs.AI 版本更新

Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning

对抗代理:基于强化学习的黑盒逃逸攻击

Kyle Domico, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Eric Pauley, Josiah Hanna, Patrick McDaniel

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Virginia Tech(弗吉尼亚理工大学)

AI总结 本文提出了一种基于强化学习的对抗攻击方法,通过学习生成对抗样本的新算法,提高了攻击效率和成功率,同时在图像分类基准上展示了其优越的性能。

Comments Accepted to the Findings of CVPR 2026

详情
AI中文摘要

对机器学习模型的攻击已通过无状态优化广泛研究。本文展示了强化学习(RL)代理如何学习一种新类型的攻击算法来生成对抗样本。与传统对抗机器学习(AML)方法不同,我们的RL方法保留并利用过去的攻击经验,以提高未来攻击的有效性和效率。我们将对抗样本生成建模为马尔可夫决策过程,并评估RL在(a)学习有效且高效的攻击策略以及(b)与最先进的AML竞争的能力。在两个图像分类基准上,我们的代理在训练过程中将攻击成功率提高了最高13.2%,并将每个攻击的受害者模型查询平均次数减少了最高16.9%。在与最先进的图像攻击进行直接比较时,我们的方法使攻击者能够在训练后在未见过的输入上生成对抗样本的成功率提高了17%。从安全角度来看,这项工作展示了一种强大的新攻击向量,利用RL训练能够高效且大规模攻击ML模型的代理。

英文摘要

Attacks on machine learning models have been extensively studied through stateless optimization. In this paper, we demonstrate how a reinforcement learning (RL) agent can learn a new class of attack algorithms that generate adversarial samples. Unlike traditional adversarial machine learning (AML) methods that craft adversarial samples independently, our RL-based approach retains and exploits past attack experience to improve the effectiveness and efficiency of future attacks. We formulate adversarial sample generation as a Markov Decision Process and evaluate RL's ability to (a) learn effective and efficient attack strategies and (b) compete with state-of-the-art AML. On two image classification benchmarks, our agent increases attack success rate by up to 13.2% and decreases the average number of victim model queries per attack by up to 16.9% from the start to the end of training. In a head-to-head comparison with state-of-the-art image attacks, our approach enables an adversary to generate adversarial samples with 17% more success on unseen inputs post-training. From a security perspective, this work demonstrates a powerful new attack vector that uses RL to train agents that attack ML models efficiently and at scale.

2511.05615 2026-06-05 cs.LG cs.AI cs.AR physics.ins-det 版本更新

wa-hls4ml: A Benchmark and Surrogate Models for hls4ml Resource and Latency Estimation

wa-hls4ml: 一个用于hls4ml资源和延迟估计的基准及替代模型

Benjamin Hawks, Jason Weitz, Dmitri Demler, Karla Tame-Narvaez, Dennis Plotnikov, Mohammad Mehdi Rahimifar, Hamza Ezzaoui Rahali, Audrey C. Therrien, Donovan Sproule, Elham E Khoda, Keegan A. Smith, Russell Marroquin, Giuseppe Di Guglielmo, Nhan Tran, Javier Duarte, Vladimir Loncar

发表机构 * Fermi National Accelerator Laboratory(费米国家加速器实验室) University of California San Diego(加州大学圣地亚哥分校) Johns Hopkins University(约翰霍普金斯大学) University of Sherbrooke(Sherbrooke大学) Columbia University(哥伦比亚大学) Texas A&M University(德克萨斯A&M大学) European Organization for Nuclear Research (CERN)(欧洲核子研究中心(CERN))

AI总结 本文提出了一个用于评估ML加速器资源和延迟的基准wa-hls4ml,并介绍了基于图神经网络和Transformer的替代模型,用于预测ML加速器的延迟和资源使用情况。

Comments 30 pages, 18 figures

详情
Journal ref
Wa-hls4ml: A Benchmark and Surrogate Models for hls4ml Resource and Latency Estimation. ACM Trans. Reconfigurable Technol. Syst. 19, 2, Article 20 (June 2026), 29 pages
AI中文摘要

随着机器学习(ML)越来越多地在硬件中实现以解决科学应用中的实时挑战,先进的工具链开发显著减少了各种设计迭代所需的时间。这些进步已经解决了主要障碍,但也暴露了新的挑战。例如,以前未被考虑的瓶颈过程,如硬件综合,现在成为设计快速迭代的限制因素。为缓解这些新兴约束,已经开展了多项努力,以开发基于ML的替代模型,以估计ML加速器架构的资源使用情况。我们介绍了wa-hls4ml,这是一个用于ML加速器资源和延迟估计的基准,以及其对应的初始数据集,包含超过680,000个全连接和卷积神经网络,均使用hls4ml合成并针对Xilinx FPGA。该基准评估了资源和延迟预测器在几种常见ML模型架构上的性能,这些架构主要来自科学领域,作为示例模型,并评估了数据集子集的平均性能。此外,我们还介绍了基于图神经网络和Transformer的替代模型,用于预测ML加速器的延迟和资源。我们展示了这些模型的架构和性能,并发现这些模型通常在合成测试数据集上对75百分位数的延迟和资源预测误差在几个百分点以内。

英文摘要

As machine learning (ML) is increasingly implemented in hardware to address real-time challenges in scientific applications, the development of advanced toolchains has significantly reduced the time required to iterate on various designs. These advancements have solved major obstacles, but also exposed new challenges. For example, processes that were not previously considered bottlenecks, such as hardware synthesis, are becoming limiting factors in the rapid iteration of designs. To mitigate these emerging constraints, multiple efforts have been undertaken to develop an ML-based surrogate model that estimates resource usage of ML accelerator architectures. We introduce wa-hls4ml, a benchmark for ML accelerator resource and latency estimation, and its corresponding initial dataset of over 680,000 fully connected and convolutional neural networks, all synthesized using hls4ml and targeting Xilinx FPGAs. The benchmark evaluates the performance of resource and latency predictors against several common ML model architectures, primarily originating from scientific domains, as exemplar models, and the average performance across a subset of the dataset. Additionally, we introduce GNN- and transformer-based surrogate models that predict latency and resources for ML accelerators. We present the architecture and performance of the models and find that the models generally predict latency and resources for the 75% percentile within several percent of the synthesized resources on the synthetic test dataset.

2410.02628 2026-06-05 cs.LG cs.AI 版本更新

Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization

逆熵最优运输通过数据似然最大化解决半监督学习

Mikhail Persiianov, Arip Asadulaev, Nikita Andreev, Nikita Starodubcev, Dmitry Baranchuk, Anastasis Kratsios, Evgeny Burnaev, Alexander Korotin

发表机构 * Institute for Advanced Study(高级研究院) National Research Council Canada(加拿大国家研究理事会) University of Toronto(多伦多大学) St. Petersburg State University(圣彼得格勒国立大学) Skolkovo Institute of Science and Technology(斯克罗夫诺技术研究所) Kazan Federal University(卡兹兰卡联邦大学)

AI总结 本文提出了一种名为EBiEOT的新学习范式,通过数据似然最大化技术无缝整合配对和非配对数据,解决了半监督学习中的数据获取难题,并证明了该方法在理论上能够以任意小的误差恢复真实条件分布。

详情
AI中文摘要

学习条件分布π*(⋅|x)是机器学习中的核心问题,通常通过监督方法利用配对数据(x,y)∼π*进行学习。然而,获取配对数据样本往往具有挑战性,尤其是在领域翻译等问题中。这需要开发能够利用有限配对数据和额外非配对i.i.d.样本x∼π*_x和y∼π*_y的半监督模型。使用此类结合数据复杂且常依赖启发式方法。为此,我们提出了一种新的学习范式称为EBiEOT,利用数据似然最大化技术无缝整合配对和非配对数据。我们证明了该方法与逆熵最优运输(OT)有奇妙的联系。这一发现使我们能够应用最近的计算OT进展,建立一个端到端的学习算法来获得π*(⋅|x)。此外,我们推导了通用逼近性质,证明该方法在理论上可以以任意小的误差恢复真实条件分布。最后,我们通过实验证明,我们的方法能够同时利用配对和非配对数据有效学习条件分布。EBiEOT的代码可在https://github.com/MuXauJl11110/EBiEOT上获得。

英文摘要

Learning conditional distributions $π^*(\cdot|x)$ is a central problem in machine learning, which is typically approached via supervised methods with paired data $(x,y) \sim π^*$. However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development of $\textit{semi-supervised}$ models that utilize both limited paired data and additional unpaired i.i.d. samples $x \sim π^*_x$ and $y \sim π^*_y$ from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm called $\textbf{EBiEOT}$ that integrates both paired and unpaired data seamlessly using data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding allows us to apply recent advances in computational OT to establish an $\textit{end-to-end}$ learning algorithm to get $π^*(\cdot|x)$. In addition, we derive the universal approximation property, demonstrating that our approach can theoretically recover true conditional distributions with arbitrarily small error. Finally, we demonstrate through empirical tests that our method effectively learns conditional distributions using paired and unpaired data simultaneously. The code of $\texttt{EBiEOT}$ is available at https://github.com/MuXauJl11110/EBiEOT.

2510.11974 2026-06-05 cs.CR cs.AI 版本更新

CTIConnect: A Benchmark for Retrieval-Augmented LLMs over Heterogeneous Cyber Threat Intelligence

CTIConnect:一种用于异构网络威胁情报的检索增强大语言模型基准

Yutong Cheng, Yang Liu, Changze Li, Dawn Song, Peng Gao

发表机构 * Virginia Tech Department of Computer Science(弗吉尼亚理工大学计算机科学系) University of California, Berkeley Department of Computer Science(加州大学伯克利分校计算机科学系)

AI总结 本文提出CTIConnect基准,用于评估检索增强型大语言模型在网络威胁情报任务中的表现,通过整合五个异构数据源构建了1860个专家验证的问答对,揭示了不同任务类别中跨源语义差距的差异以及检索策略和性能瓶颈的变化,展示了领域特定策略在提升性能上的优势。

Comments Accepted to KDD 2026

详情
AI中文摘要

网络威胁情报(CTI)是现代网络安全的基础,使组织能够主动防御不断演变的威胁。然而,CTI数据的规模和异质性,从结构化知识库(CVE、CWE、CAPEC、MITRE ATT&CK)和非结构化威胁报告,远远超出了手动分析的能力。大型语言模型(LLMs)强大的上下文理解和推理能力推动了其在CTI任务中的应用。然而,现有的基准评估在检索增强设置中缺乏适当的评估框架,无法访问分析师在实践中依赖的异构领域知识源。为此,我们提出了CTIConnect,一种系统评估检索增强型LLMs在CTI任务领域的基准。我们构建了一个统一的评估环境,整合了五个异构CTI数据源,构建了1860个专家验证的问答对,涵盖实体链接、多文档综合和实体归属三个类别共九项任务。对十种最先进的LLMs进行了大量实验,发现跨源语义差距在不同任务类别中表现不同,需要根本不同的检索策略,并且性能瓶颈在检索基础设施和证据利用之间切换。我们的领域特定策略进一步优于更强的一般检索范式(检索后重排、IRCoT),表明缩小这一差距需要结构干预而非通用检索改进。这些发现在所有十种LLMs上均成立,保持在完整基准上的一致性,并在2008-2025时间分割下保持稳定。共同,它们为设计可扩展的异构CTI生态系统检索架构提供了可操作的指导。

英文摘要

Cyber Threat Intelligence (CTI) is foundational to modern cybersecurity, enabling organizations to proactively defend against evolving threats. However, the sheer volume and heterogeneity of CTI data, spanning structured knowledge bases (CVE, CWE, CAPEC, MITRE ATT&CK) and unstructured threat reports, far exceed the capacity of manual analysis. The strong contextual understanding and reasoning of Large Language Models (LLMs) have driven growing interest in applying them to CTI tasks. Yet no existing benchmark evaluates LLMs in a retrieval-augmented setting with a proper evaluation harness that grants access to the heterogeneous domain knowledge sources analysts rely on in practice. To address this gap, we present CTIConnect, a benchmark for systematically evaluating retrieval-augmented LLMs across the CTI task landscape. We construct a unified evaluation environment integrating five heterogeneous CTI sources into 1,860 expert-verified QA pairs spanning nine tasks across three categories: Entity Linking, Multi-Document Synthesis, and Entity Attribution. Extensive experiments on ten state-of-the-art LLMs reveal that the cross-source semantic gap manifests differently across task categories, demanding fundamentally different retrieval strategies, and that the performance bottleneck shifts between retrieval infrastructure and evidence utilization depending on the task. Our domain-specific strategies further outperform stronger general-purpose retrieval paradigms (retrieve-then-rerank, IRCoT), showing that closing this gap requires structural interventions rather than generic retrieval improvements. These findings hold across all ten LLMs, remain consistent on the full benchmark, and stay stable under temporal splits spanning 2008-2025. Together, they provide actionable guidance for designing scalable retrieval architectures over heterogeneous CTI ecosystems.

2510.05709 2026-06-05 cs.CR cs.AI cs.CL 版本更新

Correcting Prompt Dependence in LLM Benchmarks: A Bayesian Hierarchical Model with Embedding-Space Clustering

纠正大语言模型基准测试中的提示依赖:一种具有嵌入空间聚类的贝叶斯分层模型

Mary Llewellyn, Isobel Thornton, James Bishop, Annie Gray

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出了一种贝叶斯分层模型,通过嵌入空间聚类来纠正大语言模型基准测试中的提示依赖问题,在数据有限的情况下提供更稳健的性能指标,并在对抗鲁棒性基准测试中实现了性能指标的显著提升。

Comments Accepted to the 1st Workshop on Combining Theory and Benchmarks, CTB@ICML 2026, Seoul, South Korea

详情
AI中文摘要

大语言模型基准测试指标经常错误地陈述性能和不确定性,因为它们依赖于两个在实践中经常不成立的假设:(i) 经典推断有足够的评估数据,和 (ii) 测试提示是独立的。我们提出了一种纠正性的贝叶斯分层模型,结合嵌入空间聚类,能够在数据有限的情况下提供稳健的性能指标,同时纠正提示依赖问题。我们将该方法应用于对抗鲁棒性基准测试,展示了聚类结构的一致恢复,从而得到更可靠的性能指标,平均绝对误差提高了4-73%,预期对数后验密度提高了40-450个单位。

英文摘要

LLM benchmarking metrics often misstate performance and uncertainty as they rely on two assumptions that frequently do not hold in practice: (i) a sufficient number of evaluations are available for classical inference, and (ii) test prompts are independent. We propose a corrective Bayesian hierarchical model with embedding-space clustering that provides robust performance metrics in limited-data settings while correcting for prompt dependence. We apply the approach to adversarial robustness benchmarks, showing consistent recovery of clustering structure, resulting in more reliable performance metrics, with 4-73% improvements to mean absolute errors and 40-450 unit improvements to expected log posterior densities.

2504.10020 2026-06-05 cs.CL cs.AI cs.CV 版本更新

The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?

性能提升的幻象:为何对比解码无法减轻多模态大语言模型中的对象幻觉?

Hao Yin, Guangzong Si, Zilei Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Eastern Institute of Technology, Ningbo(宁波东部技术研究所)

AI总结 本文研究了对比解码方法在减轻多模态大语言模型(MLLMs)中对象幻觉方面的有效性,发现其性能提升主要源于两个误导性因素,挑战了对比解码策略的有效性。

详情
AI中文摘要

对比解码策略被广泛用于减少多模态大语言模型(MLLMs)中的对象幻觉。这些方法通过构建对比样本来诱导幻觉,然后在输出分布中抑制它们。然而,本文证明此类方法无法有效缓解幻觉问题。在POPE基准测试中观察到的性能提升主要由两个误导性因素驱动:(1)对模型输出分布的粗略、单向调整;(2)自适应可能性约束,将采样策略简化为贪婪搜索。为进一步说明这些问题,我们引入了一系列虚假改进方法,并将其性能与对比解码技术进行评估。实验结果揭示了对比解码中观察到的性能提升与其缓解幻觉的初衷无关。我们的发现挑战了对比解码策略有效性的常见假设,并为开发真正有效的MLLMs幻觉解决方案铺平了道路。

英文摘要

Contrastive decoding strategies are widely used to reduce object hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model's output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.

2509.25450 2026-06-05 cs.CE cs.AI cs.NA math.NA physics.comp-ph 版本更新

Multi-patch isogeometric neural solver for partial differential equations on computer-aided design domains

多补丁等几何神经求解器用于计算机辅助设计域上的偏微分方程

Moritz von Tresckow, Ion Gabriel Ion, Dimitrios Loukrezis

发表机构 * Institute for Accelerator Science and Electromagnetic Fields, Technische Universität Darmstadt(加速器科学与电磁场研究所,德累斯顿技术大学) Terra Quantum AG(Terra Quantum公司) Scientific Computing, Centrum Wiskunde & Informatica(科学计算,数学与信息学中心)

AI总结 本文提出了一种结合物理感知神经网络与多补丁等几何分析的计算框架,用于解决复杂计算机辅助设计几何上的偏微分方程。该方法利用补丁局部神经网络在等几何分析的参考域上操作,并通过定制的输出层强加狄利克雷边界条件。通过专用的界面神经网络确保非均匀有理B样条补丁之间界面的解一致性。通过变分框架最小化偏微分方程弱形式导出的能量函数进行训练。在两个高度非平凡且实际相关的应用案例中验证了该方法的有效性,即四极磁铁的2D磁静力学模型和机械夹具的3D非线性固体力学与接触力学模型。结果与高保真有限元求解器获得的参考解高度一致,展示了该神经求解器在处理复杂工程问题方面的潜力。

Comments 33 pages, 15 figures

详情
AI中文摘要

本工作开发了一种计算框架,结合物理感知神经网络与多补丁等几何分析,用于解决复杂计算机辅助设计几何上的偏微分方程。该方法利用补丁局部神经网络在等几何分析的参考域上操作。定制的输出层使强加狄利克雷边界条件。通过专用的界面神经网络确保非均匀有理B样条补丁之间界面的解一致性。通过变分框架最小化偏微分方程弱形式导出的能量函数进行训练。该方法的有效性在两个高度非平凡且实际相关的应用案例中得到验证,即四极磁铁的2D磁静力学模型和机械夹具的3D非线性固体力学与接触力学模型。结果与高保真有限元求解器获得的参考解高度一致,从而突显了该神经求解器在处理复杂工程问题方面的潜力,鉴于相应的计算机辅助设计模型。

英文摘要

This work develops a computational framework that combines physics-informed neural networks with multi-patch isogeometric analysis to solve partial differential equations on complex computer-aided design geometries. The method utilizes patch-local neural networks that operate on the reference domain of isogeometric analysis. A custom output layer enables the strong imposition of Dirichlet boundary conditions. Solution conformity across interfaces between non-uniform rational B-spline patches is enforced using dedicated interface neural networks. Training is performed using the variational framework by minimizing the energy functional derived after the weak form of the partial differential equation. The effectiveness of the suggested method is demonstrated on two highly non-trivial and practically relevant use-cases, namely, a 2D magnetostatics model of a quadrupole magnet and a 3D nonlinear solid and contact mechanics model of a mechanical holder. The results show excellent agreement to reference solutions obtained with high-fidelity finite element solvers, thus highlighting the potential of the suggested neural solver to tackle complex engineering problems given the corresponding computer-aided design models.

2509.25397 2026-06-05 cs.SE cs.AI cs.LG 版本更新

A Cartography of Open Collaboration in Open Source AI: Mapping Practices, Motivations, and Governance in 14 Open Large Language Model Projects

开源人工智能中开放协作的图谱:映射14个开源大语言模型项目的实践、动机与治理

Johan Linåker, Cailean Osborne, Jennifer Ding, Ben Burtenshaw

发表机构 * RISE Research Institutes of Sweden AB(瑞典RISE研究机构) University of Oxford(牛津大学)

AI总结 本文通过分析14个开源大语言模型项目的开发与再利用生命周期中的开放协作实践,揭示了协作方法、动机和治理结构的多样性,以及开放源代码AI并非单一属性,而是协作组织方式在互联艺术领域、生命周期阶段和制度背景下的涌现结果。

Comments In submission

详情
AI中文摘要

开源大语言模型(LLMs)的普及正在推动人工智能(AI)领域形成一个活跃的生态系统。然而,开发开源LLMs所使用的协作方法,在其公开发布前后仍未被系统研究,这限制了我们对开源LLM项目如何启动、组织和治理的理解,以及进一步促进这一生态系统的机会。我们通过探索性分析开源LLMs的开发与再利用生命周期中的开放协作,基于对14个不同开源LLM项目开发者的半结构化访谈。这些协作跨越多个艺术领域——包括模型、数据、软件、评估、计算和社区参与——每个领域都使不同的参与形式成为可能,并涉及不同的利益相关者,这些利益相关者在LLM开发生命周期中不断演变,从早期的集中、选择性参与转变为模型发布后的广泛、分散参与。开源LLM开发者受多种社会、经济和技术动机驱动,从民主化AI访问和促进开放科学到构建区域生态系统和扩展语言代表性。这些动态通过一系列治理结构协调,通常在不同程度上正式和专业化,包括以公司为中心的集中努力到去中心化的基层倡议。我们通过一个概念模型综合了我们的发现,提供了实践建议,并得出结论:开源AI的开放性并非单一属性,而是协作在互联艺术领域、生命周期阶段和制度背景下的组织方式的涌现结果。

英文摘要

The proliferation of open large language models (LLMs) is fostering a vibrant ecosystem in artificial intelligence (AI). However, the methods of collaboration used to develop open LLMs, both before and after their public release, have not yet been systematically studied, limiting our understanding of how open LLM projects are initiated, organised, and governed, as well as the opportunities to further foster this ecosystem. We address this gap through an exploratory analysis of open collaboration throughout the development and reuse lifecycle of open LLMs, drawing on semi-structured interviews with the developers of 14 diverse open LLM projects. These collaborations span multiple artefact domains -- including models, data, software, evaluation, compute, and community engagement -- each enabling distinct forms of participation and involving different stakeholders that evolves across the LLM development lifecycle, shifting from concentrated, selective engagement in the early stages to broader, distributed participation after model release. The open LLM developers are motivated by a variety of social, economic, and technological motivations, ranging from democratising access to AI and promoting open science to building regional ecosystems and expanding language representation. These dynamics are coordinated through a range of governance structures, typically formal and professionalised to varying degrees, including centralised company-led efforts to decentralised grassroots initiatives. We synthesise our findings in a conceptual model of open collaboration in open LLM ecosystems, provide recommendations for practice, and conclude that openness in open source AI is not a uniform property but an emergent outcome of how collaboration is organised across interconnected artefact domains, lifecycle stages, and institutional contexts.

2504.10823 2026-06-05 cs.CL cs.AI 版本更新

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

CLASH:从多个视角评估语言模型在高风险困境中的判断

Ayoung Lee, Ryan Sungmo Kwon, Peter Railton, Lu Wang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Department of Philosophy(哲学系) University of Michigan Ann Arbor(安娜堡大学)

AI总结 本文提出CLASH数据集,用于研究基于价值观的决策过程,发现语言模型在处理矛盾决策、心理不适和价值观变化时存在显著不足。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

在高风险领域,涉及冲突价值的困境对人类都极具挑战性,更不用说AI了。然而,先前的研究仅限于日常场景。为弥补这一差距,我们引入了CLASH(基于角色视角的LLM在高风险情境中的评估),该数据集包含345个高影响困境及3,795个不同价值观的个体视角。CLASH使研究者能够探讨关键但尚未被深入研究的价值决策过程方面,包括对决策矛盾和心理不适的理解以及角色视角中价值观的时间变化。通过基准测试14个非思考和思考模型,我们揭示了几个关键发现:(1)即使强大的专有模型,如GPT-5和Claude-4-Sonnet,也难以处理矛盾决策,仅达到24.06和51.01的准确率。(2)尽管LLMs能合理预测心理不适,但它们在涉及价值变化的视角中并不充分理解。(3)在数学解题和游戏策略领域有效的认知行为无法转移到价值推理中。相反,新的失败模式出现,包括早期承诺和过度承诺。(4)LLMs对特定价值的可引导性与其价值偏好显著相关。(5)最后,当从第三方视角推理时,LLMs表现出更高的可引导性,尽管某些价值(如安全)独特地受益于第一人称框架。

英文摘要

Navigating dilemmas involving conflicting values is challenging even for humans in high-stakes domains, let alone for AI, yet prior work has been limited to everyday scenarios. To close this gap, we introduce CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. CLASH enables the study of critical yet underexplored aspects of value-based decision-making processes, including understanding of decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in the perspectives of characters. By benchmarking 14 non-thinking and thinking models, we uncover several key findings. (1) Even strong proprietary models, such as GPT-5 and Claude-4-Sonnet, struggle with ambivalent decisions, achieving only 24.06 and 51.01 accuracy. (2) Although LLMs reasonably predict psychological discomfort, they do not adequately comprehend perspectives involving value shifts. (3) Cognitive behaviors that are effective in the math-solving and game strategy domains do not transfer to value reasoning. Instead, new failure patterns emerge, including early commitment and overcommitment. (4) The steerability of LLMs towards a given value is significantly correlated with their value preferences. (5) Finally, LLMs exhibit greater steerability when reasoning from a third-party perspective, although certain values (e.g., safety) benefit uniquely from first-person framing.

2509.20324 2026-06-05 cs.CR cs.AI 版本更新

RAG Security and Privacy: Formalizing the Threat Model and Attack Surface

RAG安全与隐私:形式化威胁模型和攻击面

Atousa Arzanipour, Rouzbeh Behnia, Reza Ebrahimi, Kaushik Dutta

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文研究了RAG系统中的安全与隐私问题,提出首个形式化的威胁模型,定义了攻击向量如文档级成员推断和数据中毒,以提升对RAG系统隐私和安全性的理解。

Comments Published at the 5th ICDM Workshop in November 2025

详情
Journal ref
2025 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 1387-1394, 2025
AI中文摘要

检索增强生成(RAG)是一种新兴的自然语言处理方法,结合大型语言模型(LLMs)与外部文档检索以生成更准确和基于事实的响应。尽管RAG在减少幻觉和提高事实一致性方面表现出色,但其也引入了与传统LLMs不同的隐私和安全挑战。现有研究表明,LLMs可通过训练数据记忆或对抗性提示泄露敏感信息,而RAG系统继承了许多这些漏洞。同时,RAG依赖外部知识库打开了新的攻击面,包括可能泄露检索文档的存在或内容信息,或注入恶意内容以操控模型行为。尽管存在这些风险,目前尚无正式框架定义RAG系统的威胁景观。本文通过提出首个形式化的RAG威胁模型,填补了文献中的关键空白。我们引入了基于对模型组件和数据访问的对手类型的结构化分类,并正式定义了关键威胁向量,如文档级成员推断和数据中毒,这些向量在实际部署中对隐私和完整性构成严重风险。通过建立正式定义和攻击模型,本文为更严谨和原则性的理解RAG系统的隐私和安全奠定了基础。

英文摘要

Retrieval-Augmented Generation (RAG) is an emerging approach in natural language processing that combines large language models (LLMs) with external document retrieval to produce more accurate and grounded responses. While RAG has shown strong potential in reducing hallucinations and improving factual consistency, it also introduces new privacy and security challenges that differ from those faced by traditional LLMs. Existing research has demonstrated that LLMs can leak sensitive information through training data memorization or adversarial prompts, and RAG systems inherit many of these vulnerabilities. At the same time, reliance of RAG on an external knowledge base opens new attack surfaces, including the potential for leaking information about the presence or content of retrieved documents, or for injecting malicious content to manipulate model behavior. Despite these risks, there is currently no formal framework that defines the threat landscape for RAG systems. In this paper, we address a critical gap in the literature by proposing, to the best of our knowledge, the first formal threat model for retrieval-RAG systems. We introduce a structured taxonomy of adversary types based on their access to model components and data, and we formally define key threat vectors such as document-level membership inference and data poisoning, which pose serious privacy and integrity risks in real-world deployments. By establishing formal definitions and attack models, our work lays the foundation for a more rigorous and principled understanding of privacy and security in RAG systems.

2307.05284 2026-06-05 cs.LG cs.AI 版本更新

Rethinking Distribution Shifts: Empirical Analysis and Modeling for Tabular Data

重新思考分布偏移:针对表格数据的经验分析与建模

Tianyu Wang, Jiashuo Liu, Peng Cui, Hongseok Namkoong

发表机构 * Department of Industrial Engineering and Operations Research(工业工程与运筹学系) Department of Computer Science and Technology(计算机科学与技术系) Decision, Risk, and Operations Division(决策、风险与运营部) Columbia University(哥伦比亚大学) Tsinghua University(清华大学)

AI总结 本文通过经验分析和建模,重新审视分布偏移问题,发现Y|X偏移在表格数据中最为常见,与机器学习文献中对X(协变量)偏移的重视形成鲜明对比,并指出鲁棒算法的性能并不优于普通方法。

Comments Forthcoming at Management Science. Conference version appeared in NeurIPS 2023, previously titled "On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets"

详情
AI中文摘要

不同的分布偏移需要不同的干预措施,算法必须基于其解决的具体偏移类型来构建。然而,稳健算法的方法学发展通常依赖于缺乏实证验证的结构性假设。本文倡导一种以实证为基础的数据驱动方法来开发算法,构建了一个包含8个表格数据集中的自然偏移、172个分布对、45种方法和90,000种方法配置的实证测试平台,涵盖了经验风险最小化和分布鲁棒优化(DRO)方法。我们发现Y|X偏移在我们的测试平台中最为普遍,这与机器学习文献中对X(协变量)偏移的高度重视形成鲜明对比,并且稳健算法的性能并不优于普通方法。为了理解原因,我们深入分析了DRO方法,发现被忽视的实现细节——如底层模型类(例如LightGBM)的选择和超参数选择——对性能的影响比模糊集或其半径更大。通过案例研究,我们展示了如何通过数据驱动的归纳理解分布偏移,提供了一种新的算法开发方法。

英文摘要

Different distribution shifts require different interventions, and algorithms must be grounded in the specific shifts they address. However, methodological development for robust algorithms typically relies on structural assumptions that lack empirical validation. Advocating for an empirically grounded data-driven approach to algorithm development, we build an empirical testbed comprising natural shifts across 8 tabular datasets, 172 distribution pairs over 45 methods and 90,000 method configurations encompassing empirical risk minimization and distributionally robust optimization (DRO) methods. We find $Y|X$-shifts are most prevalent in our testbed, in stark contrast to the heavy focus on $X$ (covariate)-shifts in the ML literature, and that the performance of robust algorithms is no better than that of vanilla methods. To understand why, we conduct an in-depth empirical analysis of DRO methods and find that underlooked implementation details -- such as the choice of underlying model class (e.g., LightGBM) and hyperparameter selection -- have a bigger impact on performance than the ambiguity set or its radius. We illustrate via case studies how a data-driven, inductive understanding of distribution shifts can provide a new approach to algorithm development.

2507.06219 2026-06-05 cs.RO cs.AI cs.LG 版本更新

Is Diversity All You Need for Scalable Robotic Manipulation?

多样性是否是可扩展机器人操作的全部需求?

Modi Shi, Li Chen, Jin Chen, Yuxiang Lu, Chiming Liu, Guanghui Ren, Ping Luo, Di Huang, Maoqing Yao, Hongyang Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文研究了数据多样性在机器人学习中的作用,发现任务多样性比单任务演示量更重要,多身体预训练数据在跨身体转移中可选,专家多样性可能对策略学习产生干扰,提出分布去偏方法提升性能。

Comments Code is available at https://github.com/OpenDriveLab/AgiBot-World

详情
AI中文摘要

数据扩展在自然语言处理和计算机视觉的基础模型中取得了显著成功,但机器人操作中有效数据扩展的原则仍不够清楚。本文通过研究机器人学习中数据多样性的细微作用,探讨了三个关键维度:任务(做什么)、身体(使用哪种机器人)和专家(谁演示)。通过在各种机器人平台上进行广泛实验,我们发现:(1)任务多样性比单任务演示数量更重要,有助于从多样预训练任务转移到新下游场景;(2)多身体预训练数据在跨身体转移中是可选的,高质量单身体预训练模型可以高效地转移到不同平台,在微调过程中表现出比多身体预训练模型更优的扩展特性;(3)专家多样性源于个体操作偏好和人类演示中的随机变化,可能对策略学习产生干扰,速度多模态成为关键贡献因素。基于这一洞察,我们提出了一种分布去偏方法以缓解速度模糊性,所提出的GO-1-Pro方法实现了15%的性能提升,相当于使用2.5倍的预训练数据。这些发现提供了新的视角,并为如何有效扩展机器人操作数据集提供了实用指导。

英文摘要

Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.

2505.02540 2026-06-05 cs.LG cs.AI 版本更新

Lazy But Effective: Collaborative Personalized Federated Learning with Heterogeneous Data

懒惰但有效:基于异构数据的协同个性化联邦学习

Ljubomir Rokvic, Panayiotis Danassis, Boi Faltings

发表机构 * Artificial Intelligence Laboratory EPFL(苏黎世联邦理工学院人工智能实验室) Telenor Research(Telenor研究)

AI总结 本文提出了一种简单有效的个性化联邦学习框架pFedLIA,通过使用计算效率高的影响近似方法'Lazy Influence',在分布式 manner 中对客户端进行聚类,从而在模型聚合前协同训练模型以捕捉客户端特定的数据模式,实验证明其在非iid数据集上能有效恢复全局模型性能,并在多个基准任务中优于现有基线方法。

Comments Accepted at the International Joint Conference on Neural Networks (IJCNN), IEEE, 2025

详情
AI中文摘要

在联邦学习中,客户端数据分布的异质性往往意味着单一全局模型无法为个别客户端提供最佳性能。例如,训练键盘的下一个词预测模型时,由于用户特定的语言模式(如人口统计学特征、语言能力、书写风格等),客户端之间会产生高度非iid的数据集。其他例子包括使用不同机器拍摄的医学图像或不同车辆类型的驾驶数据。为了解决这一问题,我们提出了一种简单但有效的个性化联邦学习框架(pFedLIA),该框架利用一种计算效率高的影响近似方法,称为'Lazy Influence',在分布式 manner 中在模型聚合前对客户端进行聚类。在每个聚类中,数据所有者协同训练一个模型,以捕捉客户端特定的数据模式。我们的方法在各种合成和现实世界设置中成功恢复了由于非iid性导致的全局模型性能下降,特别是在北欧语言的下一个词预测任务以及多个基准任务中。它在性能上与假设的Oracle聚类匹配,并显著优于现有基线方法,例如在CIFAR100上提高了17%。

英文摘要

In Federated Learning, heterogeneity in client data distributions often means that a single global model does not have the best performance for individual clients. Consider for example training a next-word prediction model for keyboards: user-specific language patterns due to demographics (dialect, age, etc.), language proficiency, and writing style result in a highly non-IID dataset across clients. Other examples are medical images taken with different machines, or driving data from different vehicle types. To address this, we propose a simple yet effective personalized federated learning framework (pFedLIA) that utilizes a computationally efficient influence approximation, called `Lazy Influence', to cluster clients in a distributed manner before model aggregation. Within each cluster, data owners collaborate to jointly train a model that captures the specific data patterns of the clients. Our method has been shown to successfully recover the global model's performance drop due to the non-IID-ness in various synthetic and real-world settings, specifically a next-word prediction task on the Nordic languages as well as several benchmark tasks. It matches the performance of a hypothetical Oracle clustering, and significantly improves on existing baselines, e.g., an improvement of 17% on CIFAR100.

2411.18343 2026-06-05 cs.LG cs.AI 版本更新

Comprehensive and Reliable Feature Attribution for Diverse Modalities and Models via Frequency-Domain Insights

通过频域见解实现多样化模态和模型的全面可靠特征归因

Zechen Liu, Feiyang Zhang, Wei Song, Xiang Li, Wei Wei

发表机构 * School of Computational Science, Wuhan University(武汉大学计算科学学院) Brain Research Center, Wuhan University(武汉大学脑科学研究中心) College of Information Science and Technology (School of Cyber Science and Technology), Shihezi University(石河子大学信息科学学院(网络安全科学与技术学院)) Xinjiang Production and Construction Corps Key Laboratory of Computing Intelligence and Network Information Security Open Fund(新疆生产建设兵团计算智能与网络信息安全重点实验室开放基金)

AI总结 本文提出了一种新的可解释性方法FreqX,结合信号处理和信息理论,以解决个性化联邦学习中非IID数据、异构设备、缺乏公平性和贡献不明确等问题,通过频域分析提高解释性效率和准确性。

Comments 16pages, 9 figures

详情
AI中文摘要

个性化联邦学习(PFL)允许客户端在不披露其私有数据集的情况下协作训练个性化模型。然而,PFL面临非IID、异构设备、缺乏公平性和贡献不明确等挑战,亟需深度学习模型的可解释性来克服这些问题。这些挑战提出了新的可解释性需求,包括低成本、隐私性和详细信息。目前没有现有的可解释性方法能满足这些需求。在本文中,我们提出了一种新的可解释性方法FreqX,通过引入信号处理和信息理论。我们的实验表明,FreqX的解释结果包含属性信息和概念信息。FreqX的运行速度至少比包含概念信息的基线方法快10倍。

英文摘要

Personalized Federal learning(PFL) allows clients to cooperatively train a personalized model without disclosing their private dataset. However, PFL suffers from Non-IID, heterogeneous devices, lack of fairness, and unclear contribution which urgently need the interpretability of deep learning model to overcome these challenges. These challenges proposed new demands for interpretability. Low cost, privacy, and detailed information. There is no current interpretability method satisfying them. In this paper, we propose a novel interpretability method \emph{FreqX} by introducing Signal Processing and Information Theory. Our experiments show that the explanation results of FreqX contain both attribution information and concept information. FreqX runs at least 10 times faster than the baselines which contain concept information.

2503.14295 2026-06-05 cs.CV cs.AI 版本更新

PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation

PC-Talk: 用于音频驱动说话面部生成的精确面部动画控制

Baiqin Wang, Xiangyu Zhu, Fan Shen, Hao Xu, Zhen Lei

发表机构 * MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS部) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Psyche AI.INC(Psyche AI公司) HKUST(香港科技大学) CAIR, HKISI, Chinese Academy of Sciences(中国科学院计算智能研究所) SCSE, FIE, M.U.S.T(M.U.S.T的SCSE、FIE部门)

AI总结 本文针对音频驱动说话面部生成中面部动画控制不足的问题,提出PC-Talk框架,通过改进唇音对齐和情感控制来提升生成视频的多样性和用户友好性。

Comments 10 Pages, 6 figures. Accepted in CVPR2026

详情
AI中文摘要

近年来,音频驱动说话面部生成在唇同步方面取得了显著进展。然而,当前方法往往缺乏对面部动画(如说话风格和情绪表达)的充分控制,导致输出结果单一。本文聚焦于改进两个关键因素:唇音对齐和情感控制,以增强说话视频的多样性和易用性。唇音对齐控制关注说话风格和唇部运动幅度等元素,而情感控制则专注于生成逼真的情绪表达,允许对强度等多属性进行修改。为实现精确的面部动画控制,我们提出了一种新的框架PC-Talk,通过隐式关键点变形实现唇音对齐和情感控制。首先,我们的唇音对齐控制模块实现了对说话风格的精确编辑,并调整唇部运动幅度以模拟不同语音音量水平,保持与音频的同步。其次,我们的情感控制模块生成生动的情绪面部特征,通过纯粹的情绪变形实现。该模块还允许对强度进行精细修改,并在不同面部区域组合多种情绪。我们的方法在广泛的实验中展示了出色的控制能力,并在HDTF和MEAD数据集上取得了最先进的性能。

英文摘要

Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over facial animation such as speaking style and emotional expression, resulting in uniform outputs. In this paper, we focus on improving two key factors: lip-audio alignment and emotion control, to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control focuses on elements like speaking style and the scale of lip movements, whereas emotion control is centered on generating realistic emotional expressions, allowing for modifications in multiple attributes such as intensity. To achieve precise control of facial animation, we propose a novel framework, PC-Talk, which enables lip-audio alignment and emotion control through implicit keypoint deformations. First, our lip-audio alignment control module facilitates precise editing of speaking styles at the word level and adjusts lip movement scales to simulate varying vocal loudness levels, maintaining lip synchronization with the audio. Second, our emotion control module generates vivid emotional facial features with pure emotional deformation. This module also enables the fine modification of intensity and the combination of multiple emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on both HDTF and MEAD datasets in extensive experiments.

2503.11910 2026-06-05 cs.LG cs.AI math.AT math.SG 版本更新

RTD-Lite: Scalable Topological Analysis for Comparing Weighted Graphs in Learning Tasks

RTD-Lite:用于学习任务中比较加权图拓扑结构的可扩展分析

Eduard Tulchinskii, Daria Voronkova, Ilya Trofimov, Evgeny Burnaev, Serguei Barannikov

发表机构 * Skoltech, AI Foundation and Algorithm Lab(斯克里普丘尔技术学院,人工智能基础与算法实验室) Skoltech, AIRI(斯克里普丘尔技术学院,人工智能研究机构) Skoltech, CNRS(斯克里普丘尔技术学院,法国国家科学研究中心)

AI总结 本文提出RTD-Lite算法,通过最小生成树辅助图在O(n²)时间内高效比较加权图的拓扑特征,适用于降维和神经网络训练等任务,实验表明其在识别拓扑差异和减少计算时间方面优于现有方法。

Comments Accepted for AISTATS 2025

详情
AI中文摘要

用于比较加权图的拓扑方法在各种学习任务中具有价值,但通常在大规模数据集上计算效率低下。我们介绍了RTD-Lite,一种可扩展算法,能够高效比较两个具有顶点一一对应关系的加权图的拓扑特征,特别是任意尺度下的连通性或聚类结构。通过辅助图的最小生成树,RTD-Lite以O(n²)的时间和内存复杂度捕捉拓扑差异。这种效率使其适用于降维和神经网络训练等任务。在合成和现实数据集上的实验表明,RTD-Lite能够有效识别拓扑差异,同时显著减少计算时间,相较于现有方法。此外,将RTD-Lite作为损失函数组件整合到神经网络训练中,可以增强学习表示中的拓扑结构保持。我们的代码在https://github.com/ArGintum/RTD-Lite上公开可用。

英文摘要

Topological methods for comparing weighted graphs are valuable in various learning tasks but often suffer from computational inefficiency on large datasets. We introduce RTD-Lite, a scalable algorithm that efficiently compares topological features, specifically connectivity or cluster structures at arbitrary scales, of two weighted graphs with one-to-one correspondence between vertices. Using minimal spanning trees in auxiliary graphs, RTD-Lite captures topological discrepancies with $O(n^2)$ time and memory complexity. This efficiency enables its application in tasks like dimensionality reduction and neural network training. Experiments on synthetic and real-world datasets demonstrate that RTD-Lite effectively identifies topological differences while significantly reducing computation time compared to existing methods. Moreover, integrating RTD-Lite into neural network training as a loss function component enhances the preservation of topological structures in learned representations. Our code is publicly available at https://github.com/ArGintum/RTD-Lite

2502.20914 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Maxime Méloux, Silviu Maniu, François Portet, Maxime Peyrard

发表机构 * Université Grenoble Alpes, CNRS, Grenoble INP, LIG(格勒诺布尔阿尔卑斯大学、国家科学研究中心、格勒诺布尔INP、实验室LIG)

AI总结 本文探讨了在机械可解释性(MI)框架下,给定行为是否具有唯一解释的问题,通过统计可识别性理论分析了MI解释的可识别性,并提出了两种主要策略及实验结果。

详情
Journal ref
The Thirteenth International Conference on Learning Representations (ICLR 2025)
AI中文摘要

随着AI系统应用于高风险领域,确保可解释性至关重要。机械可解释性(MI)旨在通过提取人类可理解的算法来解释神经网络的行为。本文探讨了一个关键问题:在给定行为下,根据MI的标准,是否存在唯一的解释?借鉴统计学中的可识别性,其中参数在特定假设下可以唯一推断,我们探索了MI解释的可识别性。我们识别出两种主要的MI策略:(1)“where-then-what”,通过隔离复制模型行为的电路并在之后解释它;(2)“what-then-where”,从候选算法开始,通过因果对齐搜索实现它们的神经激活子空间。我们对布尔函数和小型多层感知机测试了这两种策略,完全枚举了候选解释。实验揭示了系统性的不可识别性:多个电路可以复制行为,一个电路可以有多种解释,多个算法可以与网络对齐,一个算法可以与不同的子空间对齐。是否需要唯一性?一种务实的方法可能只需要预测性和可操作性标准。如果唯一性对理解至关重要,可能需要更严格的条件。我们还参考了内部可解释性框架,该框架通过多种标准验证解释。本文为定义AI中的解释标准做出了贡献。

英文摘要

As AI systems are used in high-stakes applications, ensuring interpretability is crucial. Mechanistic Interpretability (MI) aims to reverse-engineer neural networks by extracting human-understandable algorithms to explain their behavior. This work examines a key question: for a given behavior, and under MI's criteria, does a unique explanation exist? Drawing on identifiability in statistics, where parameters are uniquely inferred under specific assumptions, we explore the identifiability of MI explanations. We identify two main MI strategies: (1) "where-then-what," which isolates a circuit replicating model behavior before interpreting it, and (2) "what-then-where," which starts with candidate algorithms and searches for neural activation subspaces implementing them, using causal alignment. We test both strategies on Boolean functions and small multi-layer perceptrons, fully enumerating candidate explanations. Our experiments reveal systematic non-identifiability: multiple circuits can replicate behavior, a circuit can have multiple interpretations, several algorithms can align with the network, and one algorithm can align with different subspaces. Is uniqueness necessary? A pragmatic approach may require only predictive and manipulability standards. If uniqueness is essential for understanding, stricter criteria may be needed. We also reference the inner interpretability framework, which validates explanations through multiple criteria. This work contributes to defining explanation standards in AI.

2410.13056 2026-06-05 cs.CL cs.AI 版本更新

Channel-Wise Mixed-Precision Quantization for Large Language Models

通道级混合精度量化用于大语言模型

Zihan Chen, Bike Xie, Jundong Li, Cong Shen

发表机构 * Department of Electrical and Computer Engineering, University of Virginia(电气与计算机工程系,弗吉尼亚大学) Kneron Inc.(芯驰科技)

AI总结 本文提出通道级混合精度量化(CMPQ),通过根据激活分布分配不同精度级别来优化大语言模型的量化过程,从而在低比特范围内实现任意平均比特宽度,并在内存使用增加有限的情况下提升性能。

详情
AI中文摘要

大型语言模型(LLMs)在多种语言任务上表现出色,但其在边缘设备上的部署仍面临挑战,因为其大规模参数导致内存需求大。权重仅量化提供了一种减少LLM内存足迹的有希望的解决方案。然而,现有方法主要集中在整数比特量化上,限制了它们对分数比特量化任务的适应性,并阻碍了设备上可用存储空间的充分利用。在本文中,我们引入了通道级混合精度量化(CMPQ),一种新颖的混合精度量化方法,根据激活分布在通道级分配量化精度。通过将不同精度级别分配给不同的权重通道,CMPQ支持低比特范围(例如2到4比特)内的任意平均比特宽度。CMPQ采用非均匀量化策略,并结合两种异常值提取技术,共同保留关键信息,从而最小化量化损失。在九种不同LLM上的实验表明,CMPQ不仅在整数比特量化任务中提高了性能,而且通过以混合精度方式进行处理,在内存使用增加有限的情况下实现了显著的性能提升。CMPQ代表了一种适应性强且有效的LLM量化方法,在各种设备能力下提供了显著的好处。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter sizes. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. However, existing approaches primarily focus on integer-bit quantization, limiting their adaptability to fractional-bit quantization tasks and preventing the full utilization of available storage space on devices. In this paper, we introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method that allocates quantization precision in a channel-wise pattern based on activation distributions. By assigning different precision levels to different weight channels, CMPQ supports arbitrary average bit-widths in the low-bit regime (e.g., between 2 and 4 bits). CMPQ employs a non-uniform quantization strategy and incorporates two outlier extraction techniques that collaboratively preserve the critical information, thereby minimizing the quantization loss. Experiments on nine different LLMs demonstrate that CMPQ not only enhances performance in integer-bit quantization tasks but also achieves significant performance gains with a modest increase in memory usage by performing in a mixed-precision way. CMPQ represents an adaptive and effective approach to LLM quantization, offering substantial benefits across diverse device capabilities.

2407.10486 2026-06-05 cs.AI cs.CL 版本更新

IDEAL: Leveraging Infinite and Dynamic Characterizations of Large Language Models for Query-focused Summarization

IDEAL: 利用大型语言模型的无限和动态特性进行查询导向的摘要

Jie Cao, Dian Jiao, Yang Dai, Rolan Yan, Wenqiao Zhang, Siliang Tang

发表机构 * Zhejiang University(浙江大学) Tencent, Wechat(腾讯,微信)

AI总结 本文针对查询导向摘要问题,提出两种核心方法:高效细粒度查询-LLM对齐和长文档摘要,通过Query-aware HyperExpert和Query-focused Infini-attention模块实现,实验验证了方法的有效性和通用性。

详情
AI中文摘要

查询导向摘要(QFS)旨在生成回答特定问题的摘要,使用户能够更好地控制和个性化内容。随着大型语言模型(LLMs)的出现,其通过大规模预训练展现出了强大的文本理解能力,这表明了提取片段生成的巨大潜力。本文系统地研究了LLMs基于QFS模型应具备的两个不可或缺特性,即高效细粒度查询-LLM对齐和长文档摘要。相应地,我们提出了两个模块,称为Query-aware HyperExpert和Query-focused Infini-attention,以访问上述特性。这些创新为QFS技术的更广泛应用和可访问性铺平了道路。在现有QFS基准上的广泛实验表明了所提出方法的有效性和通用性。

英文摘要

Query-focused summarization (QFS) aims to produce summaries that answer particular questions of interest, enabling greater user control and personalization. The advent of large language models (LLMs), shows their impressive capability of textual understanding through large-scale pretraining, which implies the great potential of extractive snippet generation. In this paper, we systematically investigated two indispensable characteristics that the LLMs-based QFS models should be harnessed, \emph{Efficiently Fine-grained Query-LLM Alignment} and \emph{Lengthy Document Summarization}, respectively. Correspondingly, we propose two modules called Query-aware HyperExpert and Query-focused Infini-attention to access the aforementioned characteristics. These innovations pave the way for broader application and accessibility in the field of QFS technology. Extensive experiments conducted on existing QFS benchmarks indicate the effectiveness and generalizability of the proposed approach.

2412.07583 2026-06-05 cs.CV cs.AI 版本更新

Mobile Video Diffusion

移动视频扩散

Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir Ghodrati, Amirhossein Habibian

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 本文提出了一种移动优化的视频扩散模型MobileVD,通过降低帧分辨率、引入多尺度时间表示和两种新的剪枝方案,显著降低了内存和计算成本,同时在移动设备上实现了高效的视频生成。

详情
AI中文摘要

视频扩散模型已实现了出色的现实感和可控性,但受限于高计算需求,限制了其在移动设备上的应用。本文介绍了首个移动优化的视频扩散模型。从Stable Video Diffusion (SVD) 的时空UNet出发,我们通过降低帧分辨率、引入多尺度时间表示以及引入两种新的剪枝方案来减少通道数和时间块数量。此外,我们采用对抗微调将去噪步骤减少到一步。我们的模型,称为MobileVD,在效率上提高了523倍(1817.2 vs. 4.34 TFLOPs),质量略有下降(FVD 149 vs. 171),在Xiaomi-14 Pro上生成14x512x256像素的视频片段仅需1.7秒。我们的结果可在https://qualcomm-ai-research.github.io/mobile-video-diffusion/上查看。

英文摘要

Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/

2406.08966 2026-06-05 cs.LG cs.AI 版本更新

Separation Power of Equivariant Neural Networks

等变神经网络的分离能力

Marco Pacini, Xiaowen Dong, Bruno Lepri, Gabriele Santin

发表机构 * University of Trento(特伦托大学) Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) University of Oxford(牛津大学) University of Venice(威尼斯大学)

AI总结 本文研究了等变神经网络的分离能力,分析了架构和超参数对分离能力的影响,发现非多项式激活函数在表达能力上等价,深度在阈值后不再提升分离能力,而隐表示的块分解会影响分离能力。

Comments Published as a conference paper at ICLR 2025

详情
Journal ref
International Conference on Learning Representations (ICLR), 2025
AI中文摘要

机器学习模型的分离能力是指其区分不同输入的能力,常被用作表达能力的代理。确实,了解模型家族的分离能力是获得细粒度普遍性结果的必要条件。在本文中,我们分析了等变神经网络(如卷积网络和置换不变网络)的分离能力。我们首先给出了由给定架构导出的模型无法区分的输入的完整特征化。从这些结果中,我们推导出分离能力如何受到超参数和架构选择(如激活函数、深度、隐藏层宽度和表示类型)的影响。值得注意的是,所有非多项式激活函数(包括ReLU和Sigmoid)在表达能力上是等价的,并能达到最大分离能力。深度在达到阈值后提升分离能力,之后进一步增加无效应。在隐表示中添加不变特征不影响分离能力。最后,隐表示的块分解影响分离性,最小的组件形成一个分离能力的层次结构,提供了一种直接比较模型分离能力的方法。

英文摘要

The separation power of a machine learning model refers to its ability to distinguish between different inputs and is often used as a proxy for its expressivity. Indeed, knowing the separation power of a family of models is a necessary condition to obtain fine-grained universality results. In this paper, we analyze the separation power of equivariant neural networks, such as convolutional and permutation-invariant networks. We first present a complete characterization of inputs indistinguishable by models derived by a given architecture. From this results, we derive how separability is influenced by hyperparameters and architectural choices-such as activation functions, depth, hidden layer width, and representation types. Notably, all non-polynomial activations, including ReLU and sigmoid, are equivalent in expressivity and reach maximum separation power. Depth improves separation power up to a threshold, after which further increases have no effect. Adding invariant features to hidden representations does not impact separation power. Finally, block decomposition of hidden representations affects separability, with minimal components forming a hierarchy in separation power that provides a straightforward method for comparing the separation power of models.

2205.11518 2026-06-05 cs.CR cs.AI cs.LG 版本更新

LIA: Privacy-Preserving Data Quality Evaluation in Federated Learning Using a Lazy Influence Approximation

LIA: 在联邦学习中使用懒惰影响近似进行隐私保护的数据质量评估

Ljubomir Rokvic, Panayiotis Danassis, Sai Praneeth Karimireddy, Boi Faltings

发表机构 * École Polytechnique Fédérale de Lausanne (EPFL)(瑞士联邦理工学院洛桑校区) Telenor Research(Telenor研究) University of Southern California(南加州大学)

AI总结 本文提出了一种新的隐私保护数据质量评估方法LIA,通过懒惰影响近似技术过滤和评分数据,在保持隐私的前提下有效识别低质量、损坏或恶意数据。

Comments Proceedings of the 2024 IEEE International Conference on Big Data (IEEE BigData 2024). A preliminary version of this work received the Best Paper Award at the International Workshop on Trustworthy Federated Learning at IJCAI (FL-IJCAI) 2023

详情
AI中文摘要

在联邦学习中,处理低质量、损坏或恶意数据至关重要。然而,传统数据估值方法由于隐私问题并不适用。为此,我们提出了一种简单而有效的方法,利用一种称为“懒惰影响”的新影响近似方法来过滤和评分数据,同时保持隐私。为此,每个参与者使用自己的数据来估计另一个参与者批次的影响,并将差分隐私混淆的评分发送给中央协调器。我们的方法已在各种模拟和现实世界设置中成功过滤出偏见和损坏的数据,实现了超过90%的召回率(有时高达100%),同时在ε ≤ 1的强差分隐私保证下保持性能。

英文摘要

In Federated Learning, it is crucial to handle low-quality, corrupted, or malicious data. However, traditional data valuation methods are not suitable due to privacy concerns. To address this, we propose a simple yet effective approach that utilizes a new influence approximation called "lazy influence" to filter and score data while preserving privacy. To do this, each participant uses their own data to estimate the influence of another participant's batch and sends a differentially private obfuscated score to the central coordinator. Our method has been shown to successfully filter out biased and corrupted data in various simulated and real-world settings, achieving a recall rate of over $>90\%$ (sometimes up to $100\%$) while maintaining strong differential privacy guarantees with $\varepsilon \leq 1$.

2403.00965 2026-06-05 stat.AP cs.AI cs.LG 版本更新

Binary Gaussian Copula Synthesis: an LLM-powered data augmentation framework for early dialysis prediction in chronic kidney disease

二元高斯卷积合成:一种基于LLM的数据增强框架,用于慢性肾病早期透析预测

Hamed Khosravi, Milad Khanchi, Mobina Noori, Srinjoy Das, Abdullah Al-Mamun, Imtiaz Ahmed

发表机构 * Department of Industrial & Management Systems Engineering, West Virginia University(威斯康星大学工业与管理系统工程系) Department of Electrical and Computer Engineering, Concordia University(康科迪亚大学电气与计算机工程系) Department of Computer Science, University of California, Davis(加州大学戴维斯分校计算机科学系) School of Mathematical & Data Sciences, West Virginia University(威斯康星大学数学与数据科学学院) School of Systems Science and Industrial Engineering, The State University of New York at Binghamton(纽约州立大学布法罗分校系统科学与工业工程学院) H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(佐治亚理工学院H.米尔顿·斯图尔特工业与系统工程学院)

AI总结 本文提出Binary Gaussian Copula Synthesis (BGCS),一种专为二元临床数据设计的两阶段数据增强方法,通过生成合成少数类样本并过滤不合理的样本,提高了早期透析预测的性能。

详情
AI中文摘要

只有极少数慢性肾病(CKD)患者会进展到透析,这导致了严重的类别不平衡,限制了机器学习模型在早期透析预测中的性能。这一挑战进一步加剧了电子健康记录(EHR)数据的二元结构,而现有的大多数增强方法并未为此设计。我们提出了Binary Gaussian Copula Synthesis (BGCS),一种专为二元临床数据设计的两阶段数据增强方法。BGCS首先使用高斯卷积框架生成合成少数类样本,该框架明确建模二元特征之间的成对依赖关系,然后应用微调的GPT-2分类器过滤出临床上不合理的样本后再进行训练。我们在一个包含15,169名CKD患者的真实世界EHR数据集中评估了BGCS,该数据集来自西弗吉尼亚州,收集时间从2008年到2022年。我们将其与SMOTE、CTGAN和标准高斯卷积在四个机器学习分类器上进行了基准测试,共进行了25次独立运行。BGCS在所有比较方法中表现一致,实现了90天透析预测的最高少数类召回率,不同分类器的中位数值范围从0.78到0.87,且在真实数据上的分布忠实度最强,特征的均值p值为0.68。表现最好的BGCS增强模型被集成到一个可解释的决策树基于的临床决策支持系统中,用于透析风险分层,其中电解质失衡、心血管合并症和肾脏监测指标成为最显著的预测特征。这些发现表明,为二元EHR数据的结构特性设计的增强方法可以显著提高早期透析风险预测,并支持开发可解释的临床决策支持工具用于CKD护理。

英文摘要

Only a small fraction of patients with chronic kidney disease (CKD) progress to dialysis, creating severe class imbalance that limits the performance of machine learning models for early dialysis prediction. This challenge is compounded by the binary structure of electronic health record (EHR) data, for which most existing augmentation methods were not designed. We propose Binary Gaussian Copula Synthesis (BGCS), a two-stage data augmentation method tailored to binary clinical data. BGCS first generates synthetic minority-class samples using a Gaussian copula framework that explicitly models pairwise dependencies among binary features, then applies a fine-tuned GPT-2 classifier to filter out clinically implausible samples before training. We evaluated BGCS on a real-world EHR dataset of 15,169 patients with CKD from West Virginia collected between 2008 and 2022, benchmarking it against SMOTE, CTGAN, and standard Gaussian Copula across four machine learning classifiers over 25 independent runs. BGCS consistently outperformed all comparison methods, achieving the highest minority-class recall for 90-day dialysis prediction, with median values ranging from 0.78 to 0.87 across classifiers, and the strongest distributional fidelity to real data, with a mean p-value of 0.68 across features. The best-performing BGCS-augmented model was integrated into an interpretable decision tree-based clinical decision support system for dialysis risk stratification, with electrolyte imbalances, cardiovascular comorbidities, and renal monitoring indicators emerging as the most influential predictive features. These findings suggest that augmentation methods designed for the structural properties of binary EHR data can meaningfully improve early dialysis risk prediction and support the development of interpretable clinical decision-support tools for CKD care.

2308.12224 2026-06-05 q-bio.QM cs.AI 版本更新

Enhancing cardiovascular risk prediction through AI-enabled calcium-omics

通过AI赋能的钙组学增强心血管风险预测

Ammar Hoori, Sadeer Al-Kindi, Tao Hu, Yingnan Song, Hao Wu, Juhwan Lee, Nour Tashtish, Pingfu Fu, Robert Gilkeson, Sanjay Rajagopalan, David L. Wilson

发表机构 * Department of Biomedical Engineering, Case Western Reserve University(生物医学工程系,凯斯西储大学) Harrington Heart and Vascular Institute, University Hospitals Cleveland Medical Center(哈灵顿心脏和血管研究所,克利夫兰医学中心) School of Medicine, Case Western Reserve University(医学院,凯斯西储大学) Department of Population and Quantitative Health Sciences, Case Western Reserve University(人口与定量健康科学系,凯斯西储大学) Department of Radiology, University Hospitals Cleveland Medical Center(放射科,克利夫兰医学中心) Department of Radiology, Case Western Reserve University(放射科,凯斯西储大学)

AI总结 本文通过利用详细的钙沉积特征(即钙组学)结合AI方法,提高了主要不良心血管事件(MACE)预测的准确性,展示了钙组学在心血管风险预测中的应用价值。

Comments 12 pages, 8 figures, 2 tables, 4 pages supplemental, journal paper format (under review)

详情
AI中文摘要

背景. 冠状动脉钙化(CAC)是预测主要不良心血管事件(MACE)的强大预测因子。传统的Agatston评分只是简单地将钙含量相加,尽管是非线性方式,但仍有改进钙沉积评估的空间,以更全面地捕捉疾病程度。目标. 确定是否可以通过使用详细的钙沉积特征(即钙组学)的AI方法来提高MACE预测。方法. 我们研究了钙沉积的其他特征,包括质量、体积、密度、空间分布、区域等的评估。我们使用带有弹性网络正则化的Cox模型,在2457例CT钙化评分(CTCS)中,该评分富集了MACE事件,来源于一个大型无成本CLARIFY计划(ClinicalTrials.gov标识符:NCT04075162)。我们采用了采样技术来增强模型训练。我们还研究了使用选定特征的Cox模型,以识别可解释的高风险特征。结果. 我们提出的钙组学模型,通过修改的合成下采样和上采样,给出了C指数(80.5%/71.6%)和两年AUC(82.4%/74.8%)(80:20,训练/测试),分别(采样仅应用于训练集)。结果优于Agatston,后者给出了C指数(71.3%/70.3%)和AUC(71.8%/68.8%)。在钙组学特征中,钙化数量、左前降支质量及扩散率(空间分布的度量)是增加风险的重要决定因素,而致密钙化(>1000HU)与较低风险相关。钙组学模型在保留测试中将63%的MACE患者重新分类到高风险组。分类净再分类指数为NRI=0.153。结论. AI分析冠状动脉钙化可比Agatston评分产生更好的结果。我们的发现表明,钙组学在改进风险预测中的应用价值。

英文摘要

Background. Coronary artery calcium (CAC) is a powerful predictor of major adverse cardiovascular events (MACE). Traditional Agatston score simply sums the calcium, albeit in a non-linear way, leaving room for improved calcification assessments that will more fully capture the extent of disease. Objective. To determine if AI methods using detailed calcification features (i.e., calcium-omics) can improve MACE prediction. Methods. We investigated additional features of calcification including assessment of mass, volume, density, spatial distribution, territory, etc. We used a Cox model with elastic-net regularization on 2457 CT calcium score (CTCS) enriched for MACE events obtained from a large no-cost CLARIFY program (ClinicalTri-als.gov Identifier: NCT04075162). We employed sampling techniques to enhance model training. We also investigated Cox models with selected features to identify explainable high-risk characteristics. Results. Our proposed calcium-omics model with modified synthetic down sampling and up sampling gave C-index (80.5%/71.6%) and two-year AUC (82.4%/74.8%) for (80:20, training/testing), respectively (sampling was applied to the training set only). Results compared favorably to Agatston which gave C-index (71.3%/70.3%) and AUC (71.8%/68.8%), respectively. Among calcium-omics features, numbers of calcifications, LAD mass, and diffusivity (a measure of spatial distribution) were important determinants of increased risk, with dense calcification (>1000HU) associated with lower risk. The calcium-omics model reclassified 63% of MACE patients to the high risk group in a held-out test. The categorical net-reclassification index was NRI=0.153. Conclusions. AI analysis of coronary calcification can lead to improved results as compared to Agatston scoring. Our findings suggest the utility of calcium-omics in improved prediction of risk.

2306.09712 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Semi-Offline Reinforcement Learning for Optimized Text Generation

半离线强化学习用于优化文本生成

Changyu Chen, Xiting Wang, Yiqiao Jin, Victor Ye Dong, Li Dong, Jie Cao, Yi Liu, Rui Yan

发表机构 * Changyu Chen, Xiting Wang, Yiqiao Jin, Victor Ye Dong, Li Dong, Jie Cao, Yi Liu, Rui Yan(未知机构)

AI总结 本文提出了一种半离线强化学习方法,平衡了探索能力和训练成本,并在优化成本、渐近误差和过拟合误差界方面实现了最优的强化学习设置。

Comments In Proceedings of the 40th International Conference on Machine Learning (ICML 2023)

详情
AI中文摘要

在强化学习(RL)中,与环境交互有两种主要设置:在线和离线。在线方法在显著的时间成本下探索环境,而离线方法通过牺牲探索能力高效地获得奖励信号。我们提出了一种半离线RL,一种新的范式,能够从离线过渡到在线设置,平衡探索能力和训练成本,并为比较不同的RL设置提供理论基础。基于半离线公式,我们提出了在优化成本、渐近误差和过拟合误差界方面最优的RL设置。广泛实验表明,我们的半离线方法高效且在与最新方法相比时表现相当或更好。

英文摘要

In reinforcement learning (RL), there are two major settings for interacting with the environment: online and offline. Online methods explore the environment at significant time cost, and offline methods efficiently obtain reward signals by sacrificing exploration capability. We propose semi-offline RL, a novel paradigm that smoothly transits from offline to online settings, balances exploration capability and training cost, and provides a theoretical foundation for comparing different RL settings. Based on the semi-offline formulation, we present the RL setting that is optimal in terms of optimization cost, asymptotic error, and overfitting error bound. Extensive experiments show that our semi-offline approach is efficient and yields comparable or often better performance compared with state-of-the-art methods.

2305.12640 2026-06-05 cs.AI cs.LG stat.ML 版本更新

Limited Resource Allocation in a Non-Markovian World: The Case of Maternal and Child Healthcare

在非马尔可夫世界中的有限资源分配:产科与儿童保健的案例

Panayiotis Danassis, Shresth Verma, Jackson A. Killian, Aparna Taneja, Milind Tambe

发表机构 * Harvard University(哈佛大学) Google Research(谷歌研究)

AI总结 本文研究了在非马尔可夫环境下如何通过时间序列方法优化资源分配,提出了一种新的时间序列臂排名指数(TARI)策略,以提高产科和儿童保健项目的参与度和依从性。

Comments Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI 2023)

详情
AI中文摘要

许多医疗项目成功的关键在于参与者的依从性。我们考虑在资源有限的环境中(例如健康工作者及时拨打电话)安排干预措施,以提高依从性和/或参与度。以往的工作已经成功开发了几种基于活跃多臂老虎机(RMAB)的解决方案。然而,所有以往的RMAB方法都假设参与者的行为遵循马尔可夫性质。我们展示了在我们合作伙伴NGO ARMMAN的产科健康意识项目上的真实数据中,存在显著偏离马尔可夫假设的现象。此外,我们扩展RMAB到连续状态空间,这是之前研究较少的领域。为解决一般的非马尔可夫RMAB环境,我们(i)将每个参与者的时间轨迹建模为时间序列,(ii)利用时间序列预测模型的力量来学习复杂模式和动态以预测未来状态,(iii)提出时间序列臂排名指数(TARI)策略,这是一种新的算法,选择最能从干预中受益的RMAB臂,基于我们的未来状态预测。我们在合成数据和ARMMAN的真实数据二次分析上评估了我们的方法,并证明了与部署的Whittle指数解决方案相比,参与度显著增加。这相当于额外16.3小时的内容被聆听,90.8%更多的脱节风险被防止,并覆盖了超过两倍的高脱节风险受益人。

英文摘要

The success of many healthcare programs depends on participants' adherence. We consider the problem of scheduling interventions in low resource settings (e.g., placing timely support calls from health workers) to increase adherence and/or engagement. Past works have successfully developed several classes of Restless Multi-armed Bandit (RMAB) based solutions for this problem. Nevertheless, all past RMAB approaches assume that the participants' behaviour follows the Markov property. We demonstrate significant deviations from the Markov assumption on real-world data on a maternal health awareness program from our partner NGO, ARMMAN. Moreover, we extend RMABs to continuous state spaces, a previously understudied area. To tackle the generalised non-Markovian RMAB setting we (i) model each participant's trajectory as a time-series, (ii) leverage the power of time-series forecasting models to learn complex patterns and dynamics to predict future states, and (iii) propose the Time-series Arm Ranking Index (TARI) policy, a novel algorithm that selects the RMAB arms that will benefit the most from an intervention, given our future state predictions. We evaluate our approach on both synthetic data, and a secondary analysis on real data from ARMMAN, and demonstrate significant increase in engagement compared to the SOTA, deployed Whittle index solution. This translates to 16.3 hours of additional content listened, 90.8% more engagement drops prevented, and reaching more than twice as many high dropout-risk beneficiaries.