arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.06514 2026-06-08 cs.AI cs.LG 新提交

Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation

通过将公平性视为对称操作来检测和缓解偏见

Nishit Singh

发表机构 * Nishit Singh

AI总结 将偏见形式化为对称性破缺操作,通过损失正则化恢复对称性,在合成数据集上实现90%以上的违规减少,准确率成本约5%。

Comments 8 pages, 7 figures

详情
AI中文摘要

部署在高风险社会经济环境中的机器学习系统通常表现出偏见。我们将偏见形式化为一种对称性破缺操作:如果分类器在切换敏感属性的反事实操作下(保持能力特征不变)输出不变,则该分类器是公平的。我们实现基于损失的正则化作为对称性恢复机制,并在四个具有不同噪声、相关性和偏见水平的合成数据集上评估该框架。该框架实现了90%以上的违规减少,准确率成本约为5%。该框架不需要因果图知识,计算轻量,且适用于任何可定义为比特翻转的敏感属性,使其适用于主流基准中缺乏局部歧视来源的上下文。

英文摘要

Machine learning systems deployed in high stakes socioeconomic settings routinely display bias. We formalize bias as a symmetry breaking operation: a classifier is fair if its outputs remain invariant under the counterfactual operation of switching a sensitive attribute, with merit features held fixed. We implement loss based regularization as a symmetry restoring mechanism and evaluate the framework on four synthetic datasets with varying levels of noise, correlation, and bias. The framework achieves upwards of 90\% violation reduction, with accuracy costs around 5\%. This framework does not require causal graph knowledge, is computationally lightweight, and generalizes to any sensitive attribute definable as a bit-flip, making it suitable for contexts where local sources of discrimination remain absent from mainstream benchmarks.

2606.06518 2026-06-08 cs.AI cs.LG 新提交

DiBS: Diffusion-Informed Branch Selection

DiBS: 扩散模型引导的分支选择

Bo Liu, Yuan Xie, Yuan Gao, Xiaolong Luo, Peng Ye, Tao Chen, Fujun Han

发表机构 * Nanyang Technological University(南洋理工大学) University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 针对数独求解中学习型求解器缺乏正确性保证而符号求解器存在长尾搜索的问题,提出扩散模型引导的分支选择方法DiBS,在保持符号求解器完备性的同时,利用扩散模型排序候选值,显著降低搜索成本。

Comments 12 pages, 6 figures, 3 tables

详情
AI中文摘要

数独是一个典型的约束满足问题,需要在严格的离散约束下进行全局结构推理。现有的数独求解方法主要集中于两种主流方法,即传统启发式方法和深度学习求解器。然而,它们存在互补的局限性:基于学习的求解器缺乏硬正确性保证,而完全符号求解器仍容易陷入长尾搜索。为了解决这些不足,我们提出了一种新颖的扩散模型引导方法,称为DiBS,用于分支选择搜索过程。具体来说,DiBS保持符号求解器的完备性,并使用扩散模型作为分支排序指南。核心方法是在当前部分赋值和轻量级一致性信号下对候选值进行排序。此外,我们提供了深入的理论证明来揭示其工作原理和原因。在具有挑战性的Royle 17线索数独基准上的实验表明,与强启发式基线相比,我们的DiBS显著降低了搜索成本,特别是在节点、回溯和长尾百分位数方面。此外,这些结果证实,在分支顺序错误代价最高的困难实例上,学习到的全局指导是有效的。所有代码均可在此https URL获取。

英文摘要

Sudoku is a representative constraint satisfaction problem that requires global structural reasoning under strict discrete constraints. The existing works of solving Sudoku mainly focus on two dominant approaches, i.e., traditional heuristic and deep learning solver. However, they suffer from two complementary limitations: learning-based solvers lack hard correctness guarantees, while complete symbolic solvers are still prone to long-tail search. To address these shortcomings, we propose a novel diffusion model-guided approach, termed as DiBS, for the branch selection search process. Specifically, DiBS keeps the symbolic solver complete and uses the diffusion model as a branch-ordering guide. The core method is ranking candidate values under the current partial assignment and lightweight consistency signal. Furthermore, we provide an in-depth theoretical proof to reveal how it works and why it works. Experiments on the challenging Royle 17-clue Sudoku benchmark show that our DiBS substantially reduces search cost relative to strong heuristic baselines, especially in nodes, backtracks, and long-tail percentiles. Besides, these results confirm that learned global guidance is effective on hard instances where branch-order mistakes are most expensive. All codes are available at https://github.com/shanxierdan/DiBS.

2606.06519 2026-06-08 cs.AI cs.LG 新提交

SafeGene: Reusable Adapters for Transferable Safety Alignment

SafeGene: 可重用的适配器实现可迁移的安全对齐

Yanghan Wang, Zhiqiang Kou, Fu Feng, Jing Wang, Xin Geng

发表机构 * Southeast University(东南大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出SafeGene,一种可重用的安全适配器模块,通过从对齐-退化模型差异中提取安全表示,并利用数据感知层选择和少样本系数重校准,实现跨任务的安全恢复,在保持下游性能的同时降低有害响应率。

详情
AI中文摘要

开放权重的LLM越来越多地被微调成定制助手,但下游微调可能会削弱安全对齐,使模型更容易受到恶意提示的攻击,即使训练数据并非有意有害。这造成了反复的安全恢复问题,因为目标模型会随着新任务数据或用户交互而不断更新。我们提出SafeGene,一种可重用的安全适配器模块,设计用于每个架构兼容模型家族内的跨任务重用。SafeGene不将安全恢复视为特定于模型的修复步骤,而是将安全能力视为一种独立的、可重用的适配器表示,与任务特定更新解耦。这种表示从对齐-退化模型差异中获得,通过数据感知层选择精炼成任务可迁移的安全向量,并通过少样本逐层系数重校准在每个下游任务适应模型中表达。跨多个模型家族、下游任务和安全评估者的实验表明,SafeGene增强的模型在降低有害响应率的同时保持下游性能,在安全-效用权衡中优于代表性的安全适应方法。

英文摘要

Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful. This creates a recurring safety recovery problem as target models are repeatedly updated with new task data or user interactions. We propose SafeGene, a reusable safety-adapter module designed for cross-task reuse within each architecture-compatible model family. Rather than treating safety recovery as a model-specific repair step, SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task-specific updates. This representation is obtained from aligned--degraded model discrepancies, refined into task-transferable safety vectors through data-aware layer selection, and expressed in each downstream task-adapted model via few-shot layer-wise coefficient recalibration. Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety--utility trade-off.

2606.06526 2026-06-08 cs.AI cs.LG 新提交

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

CrowdMath: 众包数学研究讨论数据集

Sherin Muckatira, Jesse Geneson, Slava Gerovitch, Pavel Etingof, Mikhail Gronas, Anna Rumshisky

发表机构 * University of Massachusetts Lowell(马萨诸塞大学洛文分校) San Jose State University(圣何塞州立大学) Massachusetts Institute of Technology(麻省理工学院) Dartmouth College(达特茅斯学院) Amazon AGI(亚马逊人工智能研究院)

AI总结 提出CrowdMath数据集,包含164条专家标注的进展链,用于评估大语言模型在协作开放问题求解中的能力,发现模型在局部预测上表现良好但在角色分类上存在不足。

Comments 16 pages, 4 figures

详情
AI中文摘要

大型语言模型在数学推理方面取得了实质性进展,但现有基准通常评估具有最终答案、逐步解决方案或完整证明的明确问题。它们没有捕捉到协作开放问题求解:参与者提出部分论证、识别先前步骤中的空白或错误、修复有缺陷的推理,并逐步将增量贡献综合成证明。我们引入了CrowdMath,一个包含164条专家标注的进展链的数据集,来自MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath项目(2016-2025),这是一个协作研究计划,其讨论已导致同行评审的出版物。每条链追踪一个从开放问题陈述到完成证明的多参与者论坛讨论。帖子根据其在不断演变的解决方案过程中的功能角色进行标注,包括部分进展、证明完成、错误推理和错误识别。我们定义了评估任务并对六个前沿模型进行了基准测试。模型在下一帖子预测上达到83-88%的准确率,表明它们能够跟随数学讨论的局部流程。然而,它们难以识别单个贡献的功能重要性,最佳模型在帖子角色分类上仅达到0.42的宏F1分数。CrowdMath揭示了解决明确数学问题与理解协作数学进展之间的差距。

英文摘要

Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-problem solving: a setting in which participants propose partial arguments, identify gaps or errors in prior steps, repair flawed reasoning, and gradually synthesize incremental contributions into a proof. We introduce CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath program (2016-2025), a collaborative research initiative whose discussions have led to peer-reviewed publications. Each chain traces a multi-participant forum discussion from an open-problem statement to a completed proof. Posts are labeled by their functional roles in the evolving solution process, including partial progress, proof completion, erroneous reasoning, and error identification. We define evaluation tasks and benchmark six frontier models. Models achieve 83-88% accuracy on next-post prediction, suggesting that they can follow the local flow of mathematical discussion. However, they struggle to identify the functional significance of individual contributions with the best model achieving only 0.42 macro-F1 on post-role classification. CrowdMath exposes a gap between solving well-specified mathematical problems and understanding collaborative mathematical progress as it unfolds.

2606.06529 2026-06-08 cs.AI cs.LG 新提交

Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

智能体AI控制评估中的攻击选择显著降低安全性

Catherine Ge-Wang, Tyler Crosse, Benjamin Hadad, Joachim Schaeffer, Ram Potham, Tyler Tracy

发表机构 * GitHub arXiv

AI总结 本文研究攻击者策略性选择攻击时机对AI控制安全性的影响,通过分解攻击决策为开始和停止策略,实验表明两者均显著降低安全性,现有评估可能高估安全性。

详情
AI中文摘要

一个策略性选择攻击时机的攻击者比无差别攻击者更难被捕获。AI控制是一种安全框架,用于在较弱但可信的监控者和有限的人工审计预算下部署能力强但不可信的AI智能体。控制评估通过让红队攻击策略对抗蓝队监控者来压力测试这些协议,但当前评估通常假设攻击者不会策略性地选择攻击时机。我们通过将攻击决策分解为开始策略(决定攻击者何时应开始攻击)和停止策略(决定攻击者何时应中止正在进行的攻击),在智能体环境中研究了这种能力——攻击选择。在两个智能体环境BashArena和LinuxArena中,两种策略都在不改变底层攻击能力的情况下显著降低了测量的经验安全性。在1%的审计预算下,我们的开始策略在BashArena和LinuxArena上均将安全性降低了20个百分点,停止策略在BashArena上降低了20个百分点,在LinuxArena上降低了28个百分点。这些降低应被解释为攻击选择效果的上限。因此,现有的控制评估可能对选择性攻击者产生过于乐观的安全性估计。我们建议未来的评估、系统卡和安全案例应引出攻击选择,以产生更现实的安全性估计。

英文摘要

An attacker that strategically chooses when to attack is much harder to catch than one that attacks indiscriminately. AI control is a safety framework for deploying capable but untrusted AI agents under the oversight of a weaker, trusted monitor and a limited human audit budget. Control evaluations stress-test these protocols by pitting a red-team attack policy against the blue-team monitor, but current evaluations typically assume attackers that do not strategically select when to attack. We study this capability, attack selection, in agentic settings by decomposing attack decisions into a start policy, which decides when an attacker should attack, and a stop policy, which decides when an attacker should abort an ongoing attack. Across two agentic settings, BashArena and LinuxArena, both policies substantially lower measured empirical safety without changing the underlying attack capability. At a 1% audit budget, our start policy reduces safety by 20pp on both BashArena and LinuxArena, and our stop policy reduces safety by 20pp on BashArena and 28pp on LinuxArena. These reductions should be interpreted as upper bounds on the effect of attack selection. Existing control evaluations may therefore yield overly optimistic safety estimates against selective attackers. We recommend that future evaluations, system cards, and safety cases elicit attack selection to produce more realistic safety estimates.

2606.06531 2026-06-08 cs.AI quant-ph 新提交

CARVE-Q: Quantum-Proposed, Classically Certified Interactive Driving Repair

CARVE-Q:量子提议、经典认证的交互式驾驶修复

Yifan Wang

AI总结 针对被否决的驾驶操作,提出CARVE-Q架构,通过量子最小搜索加速修复格搜索,同时保持安全认证的经典性,实现可审计的交互修复。

Comments 9 pages, 3 figures

详情
AI中文摘要

在正确的驾驶否决之后,关键问题不仅在于某个操作是否不安全,还在于被阻止的交互是否允许合法的、可审计的且责任有限的修复。预测和博弈论规划器可以提出合理的合作,但它们不能提供修复符合硬性规则、路权、成本分配和自车后备的证明。我们引入了CARVE(通过包络线对被否决操作进行认证的可负担修复),一种无预测的交互式修复证书架构。给定一个被否决的操作,CARVE构建一个有限的修复格并发出一个结构化证书,记录绑定规则、选定的联合修复、按路权缩放的合作包络线、按责任加权的成本分配以及仅自车的后备。这个证书视图揭示了算法瓶颈:多主体修复产生一个乘积格 $M = \prod_j |\mathcal{A}_j|$。因此,我们引入了CARVE-Q,一个验证器屏蔽的量子AI搜索层,它仅对此黑盒格应用量子最小值查找,同时将所有安全权限保留在经典侧。在保守的验证器-预言机模型中,精确的经典最小值查找在最坏情况下需要 $\Theta(M)$ 次查询,而Durr-Hoyer/Grover最小值查找以高概率使用 $O(\sqrt{M})$ 次预言机查询。我们证明了验证器屏蔽的证书可靠性、优先级非泄露、黑盒查询分离以及有限精度可逆预言机的可构造性。然后,我们在最多65,536个分配的CARVE修复预言机上展示了状态向量最小值查找,并在基于Lanelet2的INTERACTION回放中验证了证书保留,实现了100%的路权尊重、100%的责任一致性以及零优先级误报。结果是一种用于认证自主性的信任有界量子AI模式:量子提议;CARVE认证。

英文摘要

The critical question after a correct driving veto is not only whether a maneuver is unsafe, but whether the blocked interaction admits a lawful, auditable, and responsibility-bounded repair. Prediction and game-theoretic planners can suggest plausible cooperation, yet they do not return a proof that the repair respects hard rules, right-of-way, cost allocation, and ego fallback. We introduce CARVE, Certified Affordable Repair of Vetoed maneuvers via Envelopes, a certificate architecture for prediction-free interactive repair. Given a vetoed maneuver, CARVE constructs a finite repair lattice and emits a structured certificate recording the binding rule, selected joint repair, right-of-way-scaled cooperation envelope, responsibility-weighted cost split, and ego-only fallback. This certificate view reveals the algorithmic bottleneck: multi-owner repair induces a product lattice $M = \prod_j |\mathcal{A}_j|$. We therefore introduce CARVE-Q, a verifier-shielded quantum-AI search layer that applies quantum minimum finding only to this black-box lattice while leaving all safety authority classical. In the conservative verifier-oracle model, exact classical minimum finding requires $Θ(M)$ queries in the worst case, whereas Durr-Hoyer/Grover minimum finding uses $O(\sqrt{M})$ oracle queries with high probability. We prove verifier-shielded certificate soundness, priority non-elicitation, black-box query separation, and finite-precision reversible-oracle constructibility. We then demonstrate state-vector minimum finding on CARVE repair oracles up to 65,536 assignments and validate certificate preservation on Lanelet2-grounded INTERACTION replay with 100% right-of-way respect, 100% blame consistency, and zero priority false positives. The result is a trust-bounded quantum-AI pattern for certified autonomy: quantum proposes; CARVE certifies.

2606.06533 2026-06-08 cs.AI cs.CL 新提交

Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics

立场:不要仅仅“在后期修复它”:AI科学必须研究训练动态

Stella Biderman, Mohammad Aflah Khan, Niloofar Mireshghallah, Catherine Arnett, Fazl Barez, Naomi Saphra

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Boston University(波士顿大学) Harvard University(哈佛大学) University of Oxford, Martian(牛津大学,火星) Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所)

AI总结 本文主张AI科学应超越事后分析,研究训练动态以预测、干预和设计模型行为,并指出当前在可解释性、公平性等领域的进展及开放问题。

Comments Accepted as an oral to the ICML: https://icml.cc/virtual/2026/poster/67142

详情
AI中文摘要

拥有对AI的科学理解意味着什么?模型不是静态对象:它们是由数据、目标、架构和优化动态塑造的时间演化过程的快照。然而,许多AI研究将模型视为固定的人工制品,分析训练后的行为,而不追问它们为何出现。这篇立场论文认为,AI科学必须超越事后修复,研究产生模型行为的训练动态。这样的科学应该支持逐渐增强的理解形式:从早期训练信号预测结果,在轨迹出错时进行干预,并最终设计出更可靠地产生期望属性的训练程序。缩放定律已使损失预测成为常规;挑战在于将这一成功扩展到能力、偏见、鲁棒性和安全相关行为。我们基于科学史和科学哲学阐述了此类理论的要求,考察了在机械可解释性、公平性、记忆化和简单性偏差方面的进展,并确定了具体的开放问题。

英文摘要

What would it mean to have a scientific understanding of AI? Models are not static objects: they are snapshots of time-evolving processes shaped by data, objectives, architectures, and optimization dynamics. Yet much of AI research treats models as fixed artifacts, analyzing behaviors after training rather than asking why they emerge. This position paper argues that a science of AI must move beyond post-hoc fixes and study the training dynamics that produce model behavior. Such a science should support progressively stronger forms of understanding: predicting outcomes from early training signals, intervening when trajectories go wrong, and ultimately designing training procedures that more reliably produce desired properties. Scaling laws have made prediction routine for loss; the challenge is extending this success to capabilities, biases, robustness, and safety-relevant behaviors. We articulate requirements for such theories grounded in the history and philosophy of science, examine progress in mechanistic interpretability, fairness, memorization, and simplicity bias, and identify concrete open problems.

2606.06641 2026-06-08 cs.AI cs.LO 新提交

Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver

加速傅里叶SAT (AFSAT):完全实现基于GPU的对称伪布尔SAT求解器

Cody J Christopher, Charles Gretton

发表机构 * School of Computing, Australian National University(澳大利亚国立大学计算机学院)

AI总结 提出AFSAT,一个基于连续局部搜索的GPU加速伪布尔SAT求解器,通过JAX编译器实现大规模并行化,显著提升数值稳定性、运行速度和内存效率。

详情
AI中文摘要

我们提出加速傅里叶SAT (AFSAT),一个基于连续局部搜索 (CLS) 的GPU加速伪布尔可满足性求解器。AFSAT将概念验证方法FastFourierSAT实现为一个完全工程化的求解器,支持单个问题实例中任意异构混合的对称约束类型和长度。利用JAX编译器,AFSAT通过纯函数组合、自动向量化、自动微分和即时编译 (JIT),在候选赋值的批次上执行大规模并行CLS。我们展示了相比概念验证显著改进的数值稳定性、运行时性能和内存效率。这是通过识别和解决由内存延迟和浮点表示引起的各种限制,以及利用自动并行化和紧凑表示来实现的。浮点固有的表示和稳定性限制通过定制的离散傅里叶变换实现得到部分解决。通过JAX数组分片,我们在扩展到多个加速器时实现了接近线性的吞吐量。

英文摘要

We present Accelerated Fourier SAT (AFSAT), a GPU-accelerated solver for pseudo-Boolean satisfiability based on continuous local search (CLS). AFSAT realises the proof-of-concept approach, FastFourierSAT, into a fully-engineered solver supporting any heterogeneous mixture of symmetric constraint types and lengths within a single problem instance. Using the JAX compiler, AFSAT leverages pure function composition, automatic vectorisation, automatic differentiation, and just-in-time (JIT) compilation to perform massively parallel CLS across batches of candidate assignments. We demonstrate substantially improved numerical stability, runtime performance, and memory efficiency over the proof-of-concept. We achieve this by way of identifying and addressing various limitations that arise from memory latency and floating-point representation, as well as leveraging automatic parallelisation and compact representations. The inherent representational and stability limitations of floating point are partially addressed by a tailored discrete Fourier transform implementation. We achieve near-linear throughput when scaling to multiple accelerators via JAX array sharding.

2606.06660 2026-06-08 cs.AI cs.PF cs.RO 新提交

AEGIS: A Backup Reflex for Physical AI

AEGIS:物理AI的备份反射

Josef Chen

发表机构 * KAIKAKU

AI总结 提出AEGIS方法,通过在弱策略的冻结激活上使用轻量级探针检测高风险步骤,仅在必要时切换到强策略,在LIBERO-Spatial上恢复了弱策略损失的10.1%轨迹。

详情
AI中文摘要

长时域机器人操作往往逐渐失败:一个坏步骤会降低状态,策略会陷入无法恢复的盆地。失败在发生之前通常是可见的。我们引入了AEGIS(激活探针早期预警、门控推理切换),一种选择性升级方法,通过在弱策略的冻结激活上使用轻量级探针,在仍有时间采取行动时检测高风险步骤。当探针标记一个步骤时,控制权切换到更强的独立策略,但仅限于需要它的步骤。在LIBERO-Spatial上,AEGIS恢复了弱策略单独损失的10.1%的轨迹,而预算匹配的盲目升级为4.6%,随机触发安慰剂为5.1%。这些增益在单侧精确配对McNemar检验中显著,经Holm-Bonferroni调整,三个预注册对比:比盲目升级高5.4个百分点,p=8.5e-6;比随机触发高5.0个百分点,p=1.0e-4;配对轨迹自举置信区间排除零。AEGIS仅在38%的步骤上激活强策略,因此杠杆是时机而非计算。探针在早期窗口AUROC为0.764,95% CI [0.70, 0.84],在首次切换前从弱策略路径的前30%轨迹步骤中读取。我们预注册了完整的分析计划,包括条件恢复任务率估计量和明确的终止标准,并在每臂700个公共随机数情节上确认了结果,nA-fail=646。

英文摘要

Long-horizon robot manipulation tends to fail gradually: one bad step degrades the state, and the policy spirals into a basin from which it cannot recover. The failure is often visible before it happens. We introduce AEGIS (Activation-probe Early-warning, Gated Inference Switching), a selective escalation method that uses a lightweight probe on a weak policy's frozen activations to detect high-risk steps while there is still time to act. When the probe flags a step, control switches to a stronger separate policy, but only for the steps that need it. On LIBERO-Spatial, AEGIS recovers 10.1% of the trajectories the weak policy alone loses, versus 4.6% for budget-matched blind escalation and 5.1% for a random-trigger placebo. These gains are significant under one-sided exact paired McNemar tests with Holm-Bonferroni adjustment over three pre-registered contrasts: +5.4pp over blind escalation, p=8.5e-6; +5.0pp over random triggering, p=1.0e-4; paired-trajectory bootstrap CIs exclude zero. AEGIS activates the stronger policy on only 38% of steps, so the lever is timing rather than compute. The probe clears its precondition with an early-window AUROC of 0.764, 95% CI [0.70, 0.84], read from the weak-policy path over the first 30% of trajectory steps before any handoff. We pre-register the full analysis plan, including a conditional recovered-task-rate estimand and explicit kill criteria, and confirm the result on 700 common-random-number episodes per arm, with nA-fail=646.

2606.06741 2026-06-08 cs.AI cs.CL cs.LG 新提交

OpenSkill: Open-World Self-Evolution for LLM Agents

OpenSkill: 面向LLM智能体的开放世界自我进化

Zhiling Yan, Dingjie Song, Hanrong Zhang, Wei Liang, Yuxuan Zhang, Yutong Dai, Lifang He, Philip S. Yu, Ran Xu, Xiang Li, Lichao Sun

发表机构 * Lehigh University(莱维大学) University of Illinois Chicago(伊利诺伊大学芝加哥分校) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Salesforce AI Research(Salesforce人工智能研究) Massachusetts General Hospital and Harvard Medical School(麻省总医院和哈佛医学院)

AI总结 提出OpenSkill框架,使智能体在无目标任务监督下,利用开放世界资源自举构建技能和验证信号,实现自我进化,在多个基准上取得最佳自动通过率。

Comments 20 pages, 4 figures and 8 tables. Code is avalable at https://github.com/OpenLAIR/OpenSkill

详情
AI中文摘要

自我进化智能体需要在部署后进行适应,但现有方法假设存在可用的学习循环,例如精心策划的技能、成功的轨迹或验证信号。真实的开放世界部署可能不提供这些,只提供一个任务提示。在这项工作中,我们研究开放世界自我进化,其中智能体必须从零开始构建其技能和自身的验证信号,使用开放世界资源但没有目标任务监督。我们提出OpenSkill,一个启动这个循环的框架:它从文档、代码库和网络中获取基础知识和验证锚点,将它们综合成可迁移的技能,并根据自建的虚拟任务(基于锚点而非目标答案)来优化这些技能。因此,开放世界既提供了要学习的知识,也提供了一个独立于监督的练习环境,目标任务监督保留用于最终评估。在三个基准和两个目标智能体上,OpenSkill在满足无监督约束的同时取得了最佳自动通过率。分析表明,其技能无需特定模型适应即可跨模型迁移,并且其自建验证器与真实结果一致,尽管从未访问过这些结果。

英文摘要

Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, successful trajectories, or verifier signals. Real open-world deployments may provide none of these, offering only a task prompt. In this work, we study open-world self-evolution, where an agent must build both its skills and its own verification signals from scratch, using open-world resources but no target-task supervision. We propose OpenSkill, a framework that bootstraps this loop: it acquires grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizes them into transferable skills, and refines those skills against self-built virtual tasks grounded in the anchors rather than in target answers. The open world thus supplies both the knowledge to be learned and a supervision-independent practice environment, with target-task supervision reserved for final evaluation. Across three benchmarks and two target agents, OpenSkill attains the best automated pass rate while satisfying the no-supervision constraint. Analysis shows its skills transfer across models without model-specific adaptation, and its self-built verifier aligns with ground-truth outcomes despite never accessing them.

2606.06787 2026-06-08 cs.AI 新提交

AdMem: Advanced Memory for Task-solving Agents

AdMem: 面向任务求解智能体的高级记忆

Runzhe Wang, Huilin Lu, Shengjie Liu, Li Dong, Jason Zhu

发表机构 * Princeton University(普林斯顿大学) Amazon(亚马逊) Arm

AI总结 提出一种统一自动记忆框架,集成语义、情节和程序记忆,通过双层级设计和多智能体架构实现自动生成、奖励标注与自适应检索,提升长程多轮任务的鲁棒性和成功率。

详情
AI中文摘要

大型语言模型(LLM)作为使用工具的智能体展现出潜力,但在需要记忆、组织和重用知识的长期任务中仍然受限。先前的记忆方法旨在解决这一问题,但主要侧重于存储事实信息。最近关于程序性记忆的工作改进了任务重用,但往往退化为重放过去的成功,而未处理失败案例或在线可扩展性。我们引入了一个统一且自动的记忆框架,该框架在结合短期和长期存储的双层级设计中集成了语义、情节和程序性记忆。一个包含执行者、记忆者和批评者智能体的多智能体架构实现了自动记忆生成、奖励标注和自适应检索。长期记忆通过基于奖励的评估、合并和剪枝进行管理,确保了可扩展性和持续改进。跨多种环境的实验表明,与现有基线相比,我们的方法提高了长程多轮任务的鲁棒性和成功率。这项工作凸显了全面、自适应记忆对于推进基于LLM的智能体的重要性。

英文摘要

Large Language Models (LLMs) show promise as tool-using agents but remain limited in long-horizon tasks that require remembering, organizing, and reusing knowledge. Prior memory approaches aim to resolve the situation, but mainly focus on storing factual information. Recent work on procedural memory improves task reuse, yet often reduces to replaying past successes without addressing failure cases or online scalability. We introduce a unified and automatic memory framework that integrates semantic, episodic, and procedural memory in a bi-level design combining short-term and long-term stores. A multi-agent architecture with actor, memory, and critic agents enables automatic memory generation, reward annotation, and adaptive retrieval. Long-term memory is managed through reward-based evaluation, merging, and pruning, ensuring scalability and continual improvement. Experiments across various environments show that our approach improves robustness and success on long multi-turn tasks compared to existing baselines. This work highlights the importance of comprehensive, adaptive memory for advancing LLM-based agents.

2606.06869 2026-06-08 cs.AI 新提交

Evidence-Based Intelligent Diagnostic and Therapeutic Visualization System with Large Language Models: Multi-Turn Interaction and Multimodal Treatment Plan Generation

基于证据的智能诊断与治疗可视化系统与大语言模型:多轮交互与多模态治疗方案生成

Yunhan Wang, Yuda Wang, Zhiying Tu, Mingqiang Song, Li Song, Kun Li, Dianhui Chu, Bolin Zhang

发表机构 * Harbin Institute of Technology, Weihai(哈尔滨工业大学(威海)) Harbin Institute of Technology (Weihai) Qingdao Research Institute(哈尔滨工业大学(威海)青岛研究院) Shandong Key Laboratory of Digital Service Computing Technology and Systems(山东省数字服务计算技术与系统重点实验室) Weihai Municipal Hospital(威海市人民医院) Shanghai Taizhu Technology Co., Ltd(上海泰山技术有限公司) Tianjin Zhifu Qihuang Medical Technology Co., Ltd(天津中孚启黄医疗技术有限公司)

AI总结 提出知识增强的可视化诊断系统,通过知识图谱约束、信息增益驱动提问和多模态治疗呈现,提升中医辨证透明度和治疗可解释性。

Comments 29 pages, 9 figures, 5 tables, including supporting information

详情
AI中文摘要

目的:现有AI辅助中医诊断工具存在推理过程不透明、交互被动及治疗方案展示有限的问题。本研究提出一种知识增强的可视化诊断系统,以提高辨证论治的透明度和可解释性。方法:系统基于包含241个证候、1263个症状和2485个关系的Neo4j知识图谱构建。它集成了四阶段症状匹配流水线(精确、语义、模糊和大语言模型验证)、基于信息增益的主动提问策略(经遗传算法优化),以及融合人工智能生成插图、三维经络穴位模型和循证文献的多模态治疗呈现。结果:知识图谱约束将非标准输出减少了32%。案例研究验证了交互工作流在患者自评、临床辅助诊断和中医教育中的有效性。跨30个案例的自动配对比较评估进一步显示,诊断信任度显著提升(Cohen's d = 1.82, p < 0.001),认知负荷降低(五个维度中四个维度改善),循证参考文献可信度更高(4.21 vs. 2.95)。结论:所提系统通过知识图谱驱动的可视化和多模态交互,增强了中医诊断推理的透明度和治疗方案的可解释性,为可信AI辅助中医应用提供了实用解决方案。

英文摘要

Aim: Existing AI-assisted traditional Chinese medicine diagnostic tools suffer from opaque reasoning processes, passive interaction, and limited treatment plan presentation. This study proposes a knowledge-enhanced visual diagnostic system to improve the transparency and interpretability of syndrome differentiation and treatment. Methods: The system is built upon a Neo4j knowledge graph comprising 241 syndromes, 1,263 symptoms, and 2,485 relations. It incorporates a four-stage symptom matching pipeline (exact, semantic, fuzzy, and large language model verification), an information gain-driven proactive questioning strategy optimized with genetic algorithms, and a multimodal treatment presentation integrating artificial intelligence-generated illustrations, three-dimensional meridian-acupoint models, and evidence-based literature. Results: Knowledge graph constraints reduced non-standard outputs by 32%. Case studies validated the effectiveness of the interactive workflow across patient self-assessment, clinician-assisted diagnosis, and traditional Chinese medicine education. Automated paired-comparison evaluation across 30 cases further demonstrated significant improvements in diagnostic trust (Cohen's d = 1.82, p < 0.001), reduced cognitive load (improvements in four of five dimensions), and higher credibility of evidence-based references (4.21 vs. 2.95). Conclusions: The proposed system enhances the transparency of traditional Chinese medicine diagnostic reasoning and the interpretability of treatment plans through knowledge graph-driven visualization and multimodal interaction, offering a practical solution for trustworthy artificial intelligence-assisted traditional Chinese medicine applications.

2606.06893 2026-06-08 cs.AI 新提交

Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition

工作流到技能:通过路由-工作流-语义-附件分解创建技能

Yuyang Zhang, Xinyuan Han, Xudong Jiang, Run Wang

发表机构 * Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University(航天信息安全部门与可信计算重点实验室,教育部,网络安全科学与工程学院,武汉大学) Nanchang University(南昌大学)

AI总结 提出RWSA中间表示和W2S框架,从异构交互证据中自动构建技能,通过分解工作流结构、执行语义和运行时附件,提升行为重放一致性10.5%。

Comments 10 pages, 2 figures

详情
AI中文摘要

大型语言模型代理越来越依赖技能来编码程序性知识,但高质量技能的手工编写成本高昂。本文研究从异构交互证据(包括演示、代理轨迹、工具痕迹和执行日志)自动构建技能。我们认为,从痕迹到技能的构建并非简单的摘要任务,因为痕迹是碎片化、冗余的,并且可能遗漏罕见但安全关键的行为。为此,我们引入RWSA,一种面向工作流的中间表示,将技能分解为工作流结构、执行语义和运行时附件,捕获任务分解、控制流、验证、安全、回滚和状态管理。基于RWSA,我们提出W2S框架,该框架分割痕迹、诱导局部技能草稿、对齐共享结构、协调分支、压缩冗余,同时保留证据和置信度注释。在70个技能上的实验表明,W2S相比基于摘要和提示的基线,行为重放一致性提高了10.5%,凸显了将痕迹视为可执行运行时规范而非可压缩文本的必要性。

英文摘要

Large language model agents increasingly rely on Skills to encode procedural knowledge, yet high-quality Skills remain costly to hand-write. This paper studies automatic Skill construction from heterogeneous interaction evidence, including demonstrations, agent trajectories, tool traces, and execution logs. We argue that trace-to-skill construction is not simple summarization tasks, because traces are fragmented, redundant, and may miss rare but safety-critical behaviors. To address this, we introduce RWSA, a workflow-oriented intermediate representation that decomposes Skills into Workflow structure, execution Semantics, and runtime Attachments, capturing task decomposition, control flow, verification, safety, rollback, and state management. Building on RWSA, we propose W2S, a framework that segments traces, induces local Skill drafts, aligns shared structures, reconciles branches, and compresses redundancy while preserving evidence and confidence annotations. Experiments on 70 Skills show that W2S improves behavioral replay consistency by 10.5% over summarization- and prompting-based baselines, highlighting the need to treat traces as executable runtime specifications rather than compressible text.

2606.06923 2026-06-08 cs.AI cs.SE 新提交

Declarative Skills for AI Agents in Knowledge-Grounded Tool-Use Workflows

知识驱动工具使用工作流中AI代理的声明式技能

M. Danish Lim, I. Danial Bin Sharudin, Wen Han Chen, Cedric Lim, Laura Wynter

发表机构 * School of Computing and Information Systems(计算与信息系统学院) Singapore Management University(新加坡管理大学)

AI总结 提出声明式代理(通过自然语言技能文件控制流程)在知识密集型客服工作流中优于命令式状态机和无脚手架基线,但检索质量是主要瓶颈。

详情
AI中文摘要

我们研究了在非结构化知识库上的现实客服工作流中,使用工具的AI代理的编排机制。我们认为声明式代理——即在系统提示中附加自然语言技能文件的AI代理——是一种有效的编排范式。具体地,我们比较了(i) 在推理时读取三个领域特定技能文件并自行决定控制流的DeclarativeAgent,(ii) 基于具有显式阶段的程序化状态机的ImperativeAgent,以及(iii) 基于$\ au$-Knowledge基准代理的无脚手架基线代理。我们的ImperativeAgent受递归语言模型和图编排框架中的外部化控制推理启发。我们将三种代理形式化为分散部分可观察马尔可夫决策过程中的策略类,并分析其信息论和结构特性;然后在五个语言模型和两种检索机制下实证测试预测的差异。结果表明,检索质量是AI代理的主要瓶颈:当证据不完整或偏斜时,所有代理性能大幅下降,技能文件无法恢复损失的性能。然而,在高品质检索下,声明式技能在程序性任务上持续提高准确性并减少编排错误,而命令式状态机的脆弱性并未可靠地提高任务成功或合规性。

英文摘要

We study orchestration mechanisms for tool-using AI agents in realistic customer-service workflows over an unstructured knowledge base. We argue that declarative agents -- AI agents equipped with natural-language skill files appended to the system prompt -- are an effective orchestration paradigm. Concretely, we compare (i) a DeclarativeAgent that reads three domain-specific skill files at inference time and decides its own control flow, (ii) an ImperativeAgent based on a programmatic state machine with explicit phases, and (iii) an unscaffolded baseline agent modeled after the $τ$-Knowledge benchmark agent. Our ImperativeAgent is motivated by externalised-control inference as in Recursive Language Models and graph-based orchestration frameworks. We formalise the three agents as policy classes within a decentralised partially-observable Markov decision process and analyse their information-theoretic and structural properties; we then test the predicted differences empirically on five language models and two retrieval regimes. Our results show that retrieval quality is a dominant bottleneck for AI agents: when evidence is incomplete or skewed, all agents degrade substantially, and skill files cannot recover lost performance. Under high-quality retrieval, however, declarative skills consistently improve accuracy on procedural tasks and reduce orchestration errors, while the imperative state machine's brittleness does not reliably improve task success or compliance.

2606.06941 2026-06-08 cs.AI 新提交

Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces

量子启发的迹增强证据选择用于结构化假设空间推理

Laura Wynter, Nirvik Sahoo, Paul Griffin

发表机构 * School of Computing and Information Systems(计算与信息系统学院) Singapore Management University(新加坡管理大学)

AI总结 提出EP-HUBO方法,将CoT推理片段选择转化为组合优化问题,通过高阶二元优化聚合证据,在证据密集型法律推理基准上提升少数但正确假设的权重。

详情
AI中文摘要

大型语言模型(LLMs)现在能够在广泛的专业级考试中达到或超过人类水平,但在法律等专门、证据密集型领域仍然脆弱。在这些任务上,错误不仅源于世界知识的空白,还源于证据片段之间的细微差别以及支持证据的不一致使用。最常见的基于采样思维链(CoT)轨迹的聚合器——多数投票,返回最流行的答案,而不考虑其证据是否实际上最强。我们提出将CoT推理片段的选择视为一个显式的组合优化问题,使得有充分支持但属于少数的假设能够覆盖噪声多数,并在对证据质量特别敏感的法律推理基准上评估该方法。我们引入了EP-HUBO(证据池高阶二元优化),它使用小型本地模型生成多个CoT轨迹,将片段解析为每个假设的证据池,对每个池求解具有质量衍生权重(相关性、特异性、区分性)的高阶无约束二元优化,并委托前沿模型对每个问题进行一次裁决调用。我们在两个证据密集型法律基准上评估了EP-HUBO,使用了经典硬件上的模拟退火以及Quantum Computing Inc.的Dirac-3光量子熵量子机。HUBO风格的优化提供了一种原则性的方法来聚合推理片段,同时保留少数但正确的假设,并且在低污染领域(前沿模型尚未吸收基准材料)中最为有价值。

英文摘要

Large language models (LLMs) now solve a wide range of expert-level exams at or above human level, yet remain brittle on specialised, evidence-intensive domains such as law. On these tasks, errors arise not only from gaps in world knowledge but also from subtle distinctions between pieces of evidence and inconsistent use of supporting evidence. The most common aggregator over sampled chain-of-thought (CoT) traces, majority vote, returns the most popular answer regardless of whether its evidence is actually strongest. We propose to treat the selection of CoT reasoning fragments into a set of evidence as an explicit combinatorial optimisation problem, allowing well-supported but minority hypotheses to override noisy majorities, and to evaluate the approach on legal-reasoning benchmarks that are particularly sensitive to evidence quality. We introduce EP-HUBO (Evidence Pool Higher-Order Binary Optimisation), which generates multiple CoT traces with a small local model, parses fragments into per-hypothesis evidence pools, solves a higher-order unconstrained binary optimisation per pool with quality-derived weights (relevance, specificity, distinctiveness), and delegates a single adjudication call per question to a frontier model. We evaluate EP-HUBO on two evidence-intensive legal benchmarks using both simulated annealing on classical hardware and the Dirac-3 photonic entropy-quantum machine from Quantum Computing Inc. HUBO-style optimisation gives a principled way to aggregate reasoning fragments while preserving minority-but-correct hypotheses, and is most valuable in low-contamination domains where frontier models have not already absorbed the benchmark material.

2606.06972 2026-06-08 cs.AI 新提交

Accounting for Context: Shaping Moral Credences for Value Alignment

考虑情境:塑造道德信念以实现价值对齐

Jazon Szabo, Sanjay Modgil

发表机构 * University of Oxford(牛津大学)

AI总结 本文针对价值对齐中道德多元性问题,提出在聚合道德评估时必须考虑情境因素,并形式化道德不确定性下的决策,揭示弱帕累托原则的违反是辛普森悖论的一种变体。

详情
AI中文摘要

确保智能体行为与人类道德价值观对齐不可避免地引发一个问题:如何解释社会乃至个体通常采纳的多元道德视角。关于道德不确定性的工作提出了在不同道德理论之间公平且民主地聚合行动评估的机制。然而,本文认为在聚合道德评估时需要考虑情境因素。例如,后果主义视角假设能够准确确定智能体的行动如何改变世界;这一假设在现实世界中往往不成立。因此,我们在考虑这些情境因素的同时,形式化了道德不确定性下的智能体决策。我们由此表明,一个看似常识性的属性——弱帕累托原则——被违反了。我们认为,这个看似的问题实际上是辛普森悖论的一种变体,因此揭示了忽视情境因素影响的聚合机制的局限性。

英文摘要

Ensuring that agent behaviours are aligned with human moral values inevitably raises the problem of how to account for the plurality of moral perspectives that societies -- and even individuals -- typically adopt. Work on moral uncertainty proposes mechanisms to fairly and democratically aggregate evaluations of actions across different moral theories. However, this paper argues that one needs to account for contextual factors when aggregating moral evaluations. For example, consequentialist perspectives assume an ability to accurately determine how an agent's actions change the world; an assumption that often does not hold in real world settings. We, therefore, formalise agent decision making under moral uncertainty, while also accounting for these kinds of contextual factors. We thereby show that a seemingly commonsensical property -- the weak Pareto principle -- is violated. We argue that this apparent problem is, in fact, a variation of Simpson's paradox, and hence reveals the limitations of aggregation mechanisms that ignore the impact of contextual factors.

2606.06976 2026-06-08 cs.AI 新提交

Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

通过不确定性对齐强化学习探索智能体工具调用决策

Yijin Zhou, Linqian Zeng, Xiaoya Lu, Wenyuan Xie, Dongrui Liu, Junchi Yan, Jing Shao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院)

AI总结 针对智能体工具调用中错误累积问题,提出TRUST方法,将不确定性量化作为排斥力融入奖励设计,并标注轻量关键轮次用于多轮轨迹统一后训练,显著提升决策质量与智能体性能。

详情
AI中文摘要

基于大语言模型的智能体经常做出次优的工具使用决策,包括不支持的工具调用和幻觉式的直接响应,这可能在多步交互中累积错误。现有方法主要通过推理时校正或基于决策结果和结构化检查表的粗粒度奖励信号来改善这些行为,而忽略了智能体决策的不确定性特征。我们观察到,面向决策的强化学习倾向于削弱正确和错误动作之间的不确定性分离,导致过度自信的错误和较弱的探索信号。因此,我们提出TRUST,将不确定性量化作为排斥力融入奖励设计以维持不确定性分离,并标注轻量级关键轮次注释用于多轮轨迹的统一后训练。在多种工具使用基准上的实验结果表明,TRUST在优化过程中持续提升决策质量和智能体性能,同时保持更可靠的不确定性估计。

英文摘要

Large language model (LLM)-based agents often make suboptimal tool-use decisions, including unsupported tool invocation and hallucinated direct responses, which may accumulate errors throughout multi-step interactions. Existing approaches mainly improve these behaviors through inference-time correction or coarse-grained reward signals based on decision outcomes and structured checklists, leaving the uncertainty characteristics of agent decisions underexplored. We observe that decision-oriented reinforcement learning tends to weaken the uncertainty separation between correct and incorrect actions, resulting in overconfident mistakes and weaker exploration signals. Therefore, we propose TRUST, which incorporates uncertainty quantification into reward design as a repulsive force for maintaining uncertainty separation, and labels lightweight key-turn annotations for unified post-training of multi-turn trajectories. Experimental results across diverse tool-use benchmarks show that TRUST consistently enhances both decision quality and agent performance while maintaining more reliable uncertainty estimates during optimization.

2606.07000 2026-06-08 cs.AI 新提交

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

教方法而非答案:用于多模态策略优化的特权辅导蒸馏

Shizhe Xiang, Ke An, Wenlong Yu, Yue Liu, Jian Luan, Pei Fu, Qilong Wang

发表机构 * Tianjin University(天津大学) Beijing Institute of Technology(北京理工大学) Singapore Management University(新加坡国立大学) University of Chinese Academy of Sciences(中国科学院大学) Xiaomi Inc(小米公司)

AI总结 提出PTD-PO框架,通过构建特权提示提供密集的令牌级监督,避免暴露答案,并采用Top-K JS散度稳定蒸馏,显著提升多模态推理性能。

详情
AI中文摘要

最近的后训练方法,特别是具有可验证奖励的强化学习(RLVR),显著增强了大型视觉语言模型(LVLMs)的推理能力。然而,可验证奖励的稀疏性为失败的rollout提供了很少的令牌级监督,常常导致复杂多模态推理任务中的低效探索。尽管策略蒸馏可以提供密集的指导,但基于外部教师的方法引入了大量计算开销,而基于答案条件微调的方法可能暴露答案级信息并诱导类似捷径的生成行为。为解决这些限制,我们提出了PTD-PO,一种用于RLVR的特权辅导蒸馏策略优化框架,在不向学生策略暴露答案的情况下提供密集指导。具体来说,PTD-PO从空间注意力引导和中间文本推理步骤中构建结构化的特权提示,并通过上下文学习将其用于生成逐步的令牌分布监督。学生仍在原始无答案上下文中优化,其失败的rollout在令牌分布级别与提示增强的参考模型对齐。为进一步稳定引导和无引导上下文之间分布偏移下的蒸馏,我们引入了Top-K Jensen-Shannon散度目标,专注于对齐信息性令牌概率,同时减少内存开销。在2B到8B参数的LVLMs上的实验表明,PTD-PO持续优于RLVR和蒸馏基线,缓解了熵崩溃,并提高了复杂多模态推理性能。

英文摘要

Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable rewards provides little token-level supervision for failed rollouts, often leading to inefficient exploration in complex multimodal reasoning tasks. Although policy distillation can offer dense guidance, external teacher based methods introduce substantial computational overhead, while answer conditioned tuning methods may expose answer-level information and induce shortcut-like generation behavior. To address these limitations, we propose PTD-PO, a Privileged Tutoring Distillation Policy Optimization framework for RLVR that provides dense guidance without exposing the answer to the student policy. Specifically, PTD-PO constructs structured privileged hints from spatial attention guidance and intermediate textual reasoning steps, and uses them through in-context learning to produce step-wise token-distribution supervision. The student is still optimized under the original answer-free context, and its failed rollouts are aligned with the hint-augmented reference model at the token-distribution level. To further stabilize distillation under the distribution shift between guided and unguided contexts, we introduce a Top-K Jensen-Shannon divergence objective that focuses alignment on informative token probabilities while reducing memory overhead. Experiments on LVLMs ranging from 2B to 8B parameters show that PTD-PO consistently outperforms RLVR and distillation baselines, mitigates entropy collapse, and improves complex multimodal reasoning performance.

2606.07017 2026-06-08 cs.AI cs.CL cs.ET 新提交

The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

基础模型智能体的仿真到现实差距:统一MDP视角

Xiaoou Liu, Tiejin Chen, Weibo Li, Xiyang Hu, Hua Wei

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 本文提出将基础模型智能体的评估与训练差距形式化为经典仿真到现实问题,围绕MDP四要素(观测、动作、转移、奖励)构建统一框架,并倡导采用域随机化等成熟解决方案。

Comments 7 pages, 2 figures, 2 tables. Accepted by KDD 2026 Blue Sky Ideas Track

详情
AI中文摘要

基础模型智能体越来越多地被部署用于现实世界决策,但受到仿真到现实差距的影响。虽然机器人学和经典控制有成熟的框架来解决这一差距,但基础模型社区将智能体鲁棒性视为一个全新的现象。我们的论文提出将基础模型智能体评估和训练差距形式化为一个经典的仿真到现实问题,完全围绕马尔可夫决策过程的四个要素构建,包括观测、动作、转移和奖励。在本文中,我们设定了一个全面的研究议程,将经典差异转化为基础模型领域,并倡导采用域随机化等成熟解决方案。我们提供了具体示例,例如多语言工具调用,以展示尽管语义意图正确,但观测空间差距如何导致操作无效的动作。最终,这一议程旨在推动范式转变,产生统一的词汇和标准化的压力测试基准,以培养新一代高度可信的智能体,用于可靠的现实世界应用。

英文摘要

Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real-world applications.

2606.07033 2026-06-08 cs.AI cs.CV 新提交

Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

层次化语义约束异构图用于音视频事件定位

Zhe Yang, Ruyi Zhang, Hongtao Chen, Wenrui Li, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Peng Cheng Laboratory(鹏城实验室) Harbin Institute of Technology Suzhou Research Institute(哈尔滨工业大学苏州研究院)

AI总结 提出层次化语义约束异构图框架,通过构建异构图、双向语义约束和双曲空间层次正则化,解决开放词汇音视频事件定位中跨尺度一致性和层次语义一致性问题。

详情
AI中文摘要

开放词汇音视频事件定位(OV-AVEL)联合建模音视频线索,以识别并时间定位事件,包括训练中未见过的类别。现有方法主要在欧几里得空间中学习联合音视频表示,但仍面临两个重大挑战。首先,未见类别缺乏监督信号,难以在多个时间尺度上保持音视频一致性。其次,片段级与视频级语义之间缺乏层次约束,导致模型无法在不同层级间建立语义一致性。为解决这些挑战,我们提出一种层次化语义约束异构图(HSCHG)用于音视频事件定位框架。我们首先在欧几里得空间中构建一个异构层次图,包含音频和视觉片段节点及其对应的视频级节点。我们使用多方向时间边来捕获每个模态内的完整时间信息。同时,我们采用双阈值过滤门控融合策略,仅在对齐置信度高时引入跨模态信息。此外,我们在片段级和视频级表示之间引入双向语义约束,以实现不同层级间的语义一致性。基于此,我们将多级音视频表示和文本原型统一映射到双曲空间中。我们使用层次蕴含正则化损失来表征视频与片段之间的层次关系。大量实验结果表明,我们的方法在OV-AVEL基准上优于现有方法。消融研究进一步验证了我们方法的有效性。

英文摘要

Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales. Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes. We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels. Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.

2606.07113 2026-06-08 cs.AI 新提交

Beyond Post-hoc Explanation: Toward Glassbox AI via Probabilistic Mediation

超越事后解释:通过概率中介迈向玻璃箱AI

Manuele Leonelli

发表机构 * Manuele Leonelli(曼努埃尔·莱奥内利)

AI总结 针对大语言模型在关键领域的不透明性,提出玻璃箱框架,利用贝叶斯网络作为事前中介层,实现可审计推理、不确定性量化和可争议输出。

详情
AI中文摘要

大型语言模型正迅速成为高风险机构设置中的基础设施组件,包括公共管理、法律推理和医疗保健,在这些领域中,不透明性不仅不方便,而且在制度和法律上不可接受。现有的可解释性方法主要是事后性的,提供不稳定、不可争议的解释,这些解释与产生输出的推理过程没有正式关系。我们认为问题不在于缺乏解释,而在于首先缺乏结构化推理。本文提出了一种根本不同的架构,我们称之为玻璃箱框架,其中贝叶斯网络作为生成模型的透明事前中介层。贝叶斯网络在推理之前编码领域知识、因果假设和概率依赖性,从而实现可审计的推理轨迹、不确定性量化和可争议的输出。我们描述了该框架的架构,并将其置于一个福利资格场景中,确定了必须解决的基础性挑战,包括语义对齐、动态模型构建、概率基础以及人类治理,以便大规模实现它。通过从事后解释转向事前概率中介,本文勾勒出一条原则性路径,通往不仅强大而且根本上可问责的AI系统。

英文摘要

Large language models are rapidly becoming infrastructural components in high-stakes institutional settings, including public administration, legal reasoning, and healthcare, where opacity is not merely inconvenient but institutionally and legally untenable. Existing approaches to explainability are predominantly post-hoc, offering unstable, non-contestable accounts that have no formal relationship to the reasoning process that produced the output. We argue that the problem is not the absence of explanation but the absence of structured reasoning in the first place. This paper makes the case for a fundamentally different architecture, which we call the Glassbox Framework, in which Bayesian networks serve as transparent, ante-hoc mediation layers for generative models. Bayesian networks encode domain knowledge, causal assumptions, and probabilistic dependencies before inference occurs, enabling auditable reasoning traces, uncertainty quantification, and contestable outputs. We characterise the architecture of this framework and ground it in a benefit eligibility scenario, identifying the foundational challenges spanning semantic alignment, dynamic model construction, probabilistic grounding, and human governance that must be solved to realise it at scale. By shifting from post-hoc explanation to ante-hoc probabilistic mediation, this work outlines a principled path toward AI systems that are not only powerful but fundamentally accountable.

2606.07253 2026-06-08 cs.AI econ.EM 新提交

TOPSIS-RAD: Ranking According to Desires

TOPSIS-RAD:根据期望排序

Leonardo Fernandes Costa, Helder Gomes Costa, Diogo Lima, Brunno Rodrigues

发表机构 * Universidade Federal Fluminense(联邦弗里蒙特大学) Leonardo Sistemas Consultoria LTDA(莱昂纳多咨询公司)

AI总结 提出TOPSIS-RAD方法,通过引入决策者定义的否决绩效水平和期望绩效水平,解决传统TOPSIS排序与决策者需求不一致、对异常值敏感及排名反转问题。

Comments 21 pages, 15 Tables and 6 figures. The numerical computation of the data that appear in the Toy Examples was Supported by the Visual TOPSIS RAD that is available at https://topsis-ranking.vercel.app/. The data of the Toy examples are also available in this URL and can be loaded in the app as the template "Article"

详情
AI中文摘要

传统TOPSIS从观测到的备选方案集中推导其参考点——正理想解(PIS)和负理想解(NIS),这使得排序容易与决策者(DM)需求不一致,对异常值表现敏感,并导致排名反转。本文提出TOPSIS-RAD,通过引入两组DM定义的参考水平来解决这些问题。否决绩效水平(VPL)在归一化之前排除不可行的备选方案,防止它们扭曲排序边界。期望绩效水平(DPL)在归一化之前将表现上限设定在DM期望的水平,将PIS锚定在明确的期望而非数据集极端值上。三个简单示例展示了每种机制:VPL通过移除不可行备选方案重塑归一化边界;固定的DPL边界通过限制远高于期望水平的表现的影响来稳定排序。该方法保留了TOPSIS熟悉的基于距离的结构,同时将排序建立在稳定的、DM指定的边界上。还讨论了局限性和未来研究方向。

英文摘要

Traditional TOPSIS derives its reference points -- the Positive Ideal Solution ($PIS$) and Negative Ideal Solution ($NIS$) -- from the observed alternative set, making rankings susceptible to misalignment with decision-maker (DM) requirements, sensitivity to outlier performances, and rank reversal. This paper proposes TOPSIS-RAD, which addresses these issues by incorporating two arrays of DM-defined reference levels. Vetoed Performance Levels ($VPL$) exclude non-viable alternatives before normalisation, preventing them from distorting the ranking frontiers. Desired Performance Levels ($DPL$) cap performances at the DM's desired level before normalisation, anchoring the $PIS$ in explicit aspirations rather than dataset extremes. Three toy examples demonstrate each mechanism: $VPL$ reshapes normalisation boundaries by removing a non-viable alternative; fixed $DPL$ frontiers stabilise rankings by limiting the influence of performances well above the desired level. The method preserves the familiar distance-based structure of TOPSIS while grounding the ranking in stable, DM-specified boundaries. Limitations and future research directions are also discussed.

2606.07299 2026-06-08 cs.AI 新提交

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

DuMate-DeepResearch:一种具有递归搜索和基于评分准则推理的可审计多智能体系统

Lingyong Yan, Can Xu, Yukun Zhao, Wenxuan Li, Qingyang Chen, Jiulong Wu, Wenli Song, Xiangnan Li, Weixian Shi, Yiqun Chen, Xuchen Ma, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Jianmin Wu, Dawei Yin

发表机构 * Baidu AI Cloud(百度AI云)

AI总结 提出DuMate-DeepResearch多智能体框架,通过解耦智能体核心与工具生态、引入图动态规划、递归双层执行和基于评分准则的测试时优化,在深度研究基准上取得最优结果。

Comments Technical report by the DuMate Team. 26 pages, 6 figures, 4 tables

详情
AI中文摘要

深度研究(DR)已成为一种新的智能体范式,用于处理复杂、开放的研究任务,要求系统能够迭代地定义问题、获取证据、验证来源并综合生成长篇报告。然而,在实践中,当前的DR系统受到四个相互关联的限制:在未明确范围上的长时规划、单智能体内分解和调度此类任务的瓶颈、长文本综合中的幻觉风险以及有限的过程可审计性。本技术报告介绍了DuMate-DeepResearch,一个基于千帆智能体构建平台构建的多智能体DR框架。该框架将负责任务理解、规划和调度的智能体核心与用于检索、证据获取和报告渲染的可扩展工具生态系统解耦,使每个中间决策和工具调用都明确可追溯。在此基础设施之上,DuMate-DeepResearch进一步引入了三种机制:(i)基于图的动态规划策略,从粗到细扩展研究路线图,并通过反思、重新规划、回溯和并行分支不断修正;(ii)递归双层执行设计,将每个复杂的搜索子任务委托给一个内部搜索智能体,该智能体运行自己的规划循环,隔离噪声检索并稳定长时执行;(iii)基于评分准则的测试时优化机制,动态生成任务特定的质量标准,并将其用作实时推理支架,用于基于证据的综合和自适应停止。在两个深度研究基准上,DuMate-DeepResearch取得了新的最先进结果:在DeepResearch Bench上获得最佳总分(58.03%),在DeepResearch Bench II上获得最佳总分(61.95%),同时在信息召回和分析方面排名第一。

英文摘要

Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long-form synthesis, and limited process auditability. This technical report presents DuMate-DeepResearch, a multi-agent DR framework built on the Qianfan Agent Foundry. The framework decouples the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable. Building on this infrastructure, DuMate-DeepResearch further introduces three mechanisms: (i) a graph-based dynamic planning strategy expands the research roadmap coarse-to-fine and continuously revises it through reflection, re-planning, backtracking, and parallel branching; (ii) a recursive two-level execution design delegates each complex search sub-task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long-horizon execution; (iii) a rubric-based test-time optimization mechanism dynamically generates task-specific quality criteria and uses them as live reasoning scaffolds for evidence-grounded synthesis and adaptive stopping. Across two deep research benchmarks, DuMate-DeepResearch establishes new state-of-the-art results: the best overall score (58.03%) on DeepResearch Bench, and the best overall score (61.95%) on DeepResearch Bench II while ranking first in information recall and analysis.

2606.07308 2026-06-08 cs.AI 新提交

Off-Policy Evaluation with Strategic Agents via Local Disclosure

通过局部披露进行具有策略性主体的离线策略评估

Kiet Q. H. Vo, Abbavaram Gowtham Reddy, Julian Rodemann, Siu Lun Chau, Krikamol Muandet

发表机构 * CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全研究中心) LMU Munich(慕尼黑大学) Nanyang Technological University(南洋理工大学)

AI总结 研究策略性行为下的离线策略评估,通过局部披露揭示主体策略前协变量,构建双重稳健估计器,缓解信息不对称。

详情
AI中文摘要

我们研究了策略性行为下的离线策略评估(OPE),其中决策主体(或代理)通过策略性地修改其协变量来响应决策者的策略。这种行为导致了策略依赖的协变量偏移,打破了现有方法中协变量外生于策略的标准假设。相关工作通过施加强假设(如重复交互或完全了解代理的响应行为)来应对这一挑战,这极大地限制了其在OPE中的适用性。相比之下,我们考虑一次性OPE设置,其中决策者仅部分了解代理的响应行为。我们的关键见解是,通过事后解释披露局部信息,可以在适应之前揭示代理的策略前协变量,从而减轻策略行为引起的信息损失。利用这一结构,我们估计了代理响应的统计模型,并构建了策略值的双重稳健估计器。通过假设代理的成本敏感性服从条件对数正态分布,我们建立了所提估计量的一致性,并实证验证了我们的方法。更广泛地说,我们的结果强调了交互设计如何通过揭示代理策略响应中原本隐藏的结构来缓解信息不对称。

英文摘要

We study off-policy evaluation (OPE) under strategic behavior where decision subjects (or agents) respond to a decision maker's policy by strategically modifying their covariates. Such behavior induces a policy-dependent covariate shift, breaking the standard assumption in existing methods that covariates are exogenous to the policy. Related work addresses this challenge by imposing strong assumptions such as repeated interactions or full knowledge of agents' response behavior, substantially limiting its applicability to OPE. In contrast, we consider a one-shot OPE setting where the decision maker has only partial knowledge of the agents' response behavior. Our key insight is that disclosing local information through post-hoc explanations reveals agents' pre-strategic covariates prior to adaptation, mitigating the information loss induced by strategic behavior. Leveraging this structure, we estimate a statistical model for the agents' responses and construct a doubly robust estimator for policy value. By assuming that the agents' cost sensitivity follows a conditional log-normal distribution, we establish consistency of the proposed estimator and validate our approach empirically. More broadly, our results highlight how interaction design can mitigate information asymmetry by revealing otherwise hidden structure in agents' strategic responses.

2606.07392 2026-06-08 cs.AI cs.LG econ.EM stat.ML 新提交

Online Pandora's Box for Contextual LLM Cascading

面向上下文LLM级联的在线潘多拉魔盒

Alexandre Belloni, Yan Chen, Yehua Wei

发表机构 * The Fuqua School of Business, Duke University(杜克大学福克商学院)

AI总结 针对LLM级联场景,提出在线上下文潘多拉魔盒模型,通过参数化保留索引和GMM估计结合UCB界,实现维度相关的√T累积遗憾。

详情
AI中文摘要

受大型语言模型(LLM)级联的启发,我们提出了一种在线上下文潘多拉魔盒模型,用于自适应地查询和选择LLM API。在每个周期中,决策者观察一个请求上下文,并面临一个两阶段决策问题。在查询阶段,决策者顺序查询API,每次查询揭示一个生成的输出,并且决策者承担(输出相关的)成本。在选择阶段,决策者选择一个生成的输出进行部署,并仅观察部署输出的下游奖励。这种输出介导的反馈结构不同于经典的在线上下文潘多拉魔盒模型,后者打开盒子直接揭示其奖励。我们不估计每个API的完整条件输出和成本分布,而是直接建模保留索引,并为查询阶段开发一种学习方法。具体地,我们对由经典Weitzman策略诱导的上下文保留索引函数施加参数化结构。我们的策略将这些保留索引的广义矩方法(GMM)类型估计与这些索引以及共享输出级奖励评估器的UCB风格置信界相结合。在正则条件下,我们证明所得策略在T个周期的时间范围内实现了维度相关的$\widetilde O(\sqrt T)$累积遗憾。

英文摘要

Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query phase, the decision-maker sequentially queries APIs, where each query reveals a generated output and the decision-maker incurs an (output-dependent) cost. In the selection phase, the decision-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output. This output-mediated feedback structure differs from classical online contextual Pandora's Box models, in which opening a box directly reveals its reward. Rather than estimating the full conditional output and cost distributions of each API, we directly model the reservation index and develop a learning approach for the query phase. Specifically, we impose a parametric structure on the contextual reservation index functions induced by the classical Weitzman's policy. Our policy combines generalized method of moments (GMM) type estimation of these reservation indices with UCB-style confidence bounds for both these indices and the shared output-level reward evaluator. Under regularity conditions, we prove that the resulting policy achieves dimension-dependent $\widetilde O(\sqrt T)$ cumulative regret over a horizon of $T$ periods.

2606.07462 2026-06-08 cs.AI 新提交

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

像真正的研究者一样行动:一套评估前沿LLM和研究生命周期中智能体框架的基准测试套件

Jiayu Wang, Weijiang Lv, Bowen Fu, Jing Fu, Jiayi Song, Lingyu Zhang, Lanxuan Xue, Luodi Chen, Zepeng Xin, Kaiyu Li, Xiangyong Cao

发表机构 * Xi’an Jiaotong University(西安交通大学) Xidian University(西安电子科技大学)

AI总结 提出AARR基准系列,通过AARRI-Bench评估智能体在细粒度研究场景中模拟人类研究者的专业性、全面性和细微推理能力,发现最佳配置成功率仅68.3%。

详情
AI中文摘要

随着基础模型的进步和智能体框架日益复杂,智能体在复杂、长期编码任务甚至自主实验执行中展现出卓越能力。尽管它们从研究助手演变为自主研究智能体,但这些系统在领域敏感性、研究伦理和细微科学判断方面仍存在显著局限。因此,前沿智能体仍无法完全取代人类研究者。为弥合这一差距,我们构思了AARR(像真正的研究者一样行动)基准系列。与主要评估宏观执行能力的现有基准不同,AARR关注智能体能否在细粒度研究场景中模拟人类研究者的专业性、全面性和细微推理。在这项工作中,我们提出了AARRI-Bench(像真正的研究实习生一样行动),这是该系列的第一个基准。我们在前沿模型和智能体系统上进行了大量实验,揭示即使是最佳配置(Mini-SWE-Agent与Claude Opus 4.7)也仅达到68.3%的成功率,经常忽略真实人类研究者显而易见的微妙但关键细节。我们的结果表明,开发类似研究者的AI需要进一步探索研究行为,而不仅仅是复杂的框架。我们的数据发布在此https URL。

英文摘要

As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution. Despite their evolution from research assistants into autonomous research agents, these systems still exhibit significant limitations in field sensitivity, research ethics, and nuanced scientific judgment. Consequently, frontier agents remain unable to fully replace human researchers. To bridge this gap, we conceptualize the AARR (Act As a Real Researcher) benchmark series. Unlike existing benchmarks that primarily assess macro-level execution capabilities, AARR focuses on whether agents can emulate the professionalism, thoroughness, and nuanced reasoning that characterize human researchers in granular research scenarios. In this work, we propose AARRI-Bench (Act As a Real Research Intern), the first benchmark in this series. We conduct extensive experiments across frontier models and agentic systems, revealing that even the best-performing configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3\% success rate, frequently overlooking subtle yet critical details that are obvious to real human researchers. Our results indicate that developing researcher-like AI requires further exploration of research behavior, rather than merely complex scaffolding. Our data is released at https://github.com/AARR-bench/AARRI-bench.

2606.05510 2026-06-08 cs.AI cs.CL 新提交

Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation

基于严重性感知的课程学习与多模型响应选择用于医疗文本生成

Ahmed Alansary, Molham Mohamed, Ali Hamdi

发表机构 * Faculty of Computer Science(计算机科学学院) MSA University(MSA大学) Giza, Egypt(埃及吉扎)

AI总结 提出一种结合课程学习策略和相关性响应选择的多模型框架,通过三阶段课程训练和五个大语言模型独立训练,在MAQA数据集上实现医疗文本生成性能提升。

Comments 6 pages, 3 figures, IMSA2026

详情
AI中文摘要

远程医疗系统在提供可获取且及时的医疗信息方面变得越来越重要。现有的大语言模型通常难以在不同病例严重程度下提供一致且上下文恰当的医疗响应。这一局限性凸显了需要能够有效适应医疗查询渐进复杂性的模型。为了解决这一挑战,我们引入了一个严重性感知的多模型框架,该框架将课程训练策略与基于相关性的响应选择相结合。所提出的框架采用三阶段课程学习策略,每个模型依次在轻度、中度和危重病例上进行训练,以逐步获取领域知识。该方法利用五个大语言模型,每个模型在相同的课程方案下独立训练。在推理过程中,所有模型生成候选响应,并选择最合适的响应作为最终输出。该框架在MAQA数据集上进行训练和评估,该数据集提供带注释的医疗问答对。使用BERTScore评估的实验结果表明,与基线和微调模型相比,所提出的方法取得了优越的性能,在基线设置下达到86.71%,微调后达到90.30%。这些结果凸显了将课程学习与多模型响应选择相结合在提高医疗文本生成中的响应质量和相关性方面的有效性。

英文摘要

Telehealth systems have become increasingly important for delivering accessible and timely medical information. Existing large language models often struggle to provide consistent and contextually appropriate medical responses across varying levels of case severity. This limitation highlights the need for models that can effectively adapt to the progressive complexity in medical queries. To address this challenge, we introduce a severity-aware multi-model framework that integrates curriculum training strategy with relevance-based response selection. The proposed framework employs a three-stage curriculum learning strategy, where each model is trained sequentially on mild, moderate, and critical cases to progressively acquire domain knowledge. The approach utilizes five large language models, each independently trained under the same curriculum scheme. During inference, all models generate candidate responses, and the most appropriate response is selected as the final output. The framework is trained and evaluated on the MAQA dataset, which provides annotated medical question-answer pairs. Experimental results evaluated using BERTScore demonstrate that the proposed method achieves superior performance compared to both baseline and fine-tuned models, attaining 86.71% in the baseline setting and 90.30% after fine-tuning. These results highlight the effectiveness of combining curriculum learning with multi-model response selection in improving response quality and relevance in medical text generation.

2601.12359 2026-06-08 cs.CR cs.AI cs.CL 交叉投稿

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

零样本嵌入漂移检测:一种轻量级防御对抗提示注入的LLM方法

Anirudh Sekar, Mrinal Agarwal, Rachel Sharma, Akitsugu Tanaka, Jasmine Zhang, Arjun Damerla, Kevin Zhu

发表机构 * Algoverse AI Research(Algoverse AI研究院) Berkeley(伯克利大学)

AI总结 本文提出ZEDD,通过量化嵌入空间中良性与可疑输入之间的语义变化,实现对直接和间接提示注入的检测。该方法无需模型内部访问或先验知识,具有低工程开销,能高效部署于多种LLM架构,准确率达93%以上。

Comments Accepted to NeurIPS 2025 Lock-LLM Workshop

详情
AI中文摘要

提示注入攻击已成为LLM应用中的日益严重漏洞,其中对抗性提示利用电子邮件或用户生成内容等间接输入渠道绕过对齐保护措施,导致有害或意外输出。尽管对齐技术有所进步,但最先进的LLM仍广泛易受对抗性提示攻击,凸显了需要稳健、高效且可推广的检测机制的紧迫性。本文提出零样本嵌入漂移检测(ZEDD),一种轻量级、低工程开销的框架,通过量化嵌入空间中良性与可疑输入之间的语义变化,识别直接和间接提示注入尝试。ZEDD无需访问模型内部、先验攻击类型知识或任务特定重训练,可高效地在多种LLM架构上进行零样本部署。我们的方法使用对抗性清洁提示对,并通过余弦相似度测量嵌入漂移,以捕捉现实世界注入攻击中的细微对抗性操纵。为确保评估的鲁棒性,我们编纂并重新标注了涵盖五个注入类别的综合LLMail-Inject数据集。广泛实验表明,嵌入漂移是一种稳健且可转移的信号,优于传统方法在检测准确性和操作效率方面。在Llama 3、Qwen 2和Mistral等模型架构上,分类准确率超过93%,误报率低于3%,我们的方法提供了一种轻量级、可扩展的防御层,可整合到现有LLM流程中,填补了保护LLM系统以抵御适应性对抗威胁的关键空白。

英文摘要

Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards and induce harmful or unintended outputs. Despite advances in alignment, even state-of-the-art LLMs remain broadly vulnerable to adversarial prompts, underscoring the urgent need for robust, productive, and generalizable detection mechanisms beyond inefficient, model-specific patches. In this work, we propose Zero-Shot Embedding Drift Detection (ZEDD), a lightweight, low-engineering-overhead framework that identifies both direct and indirect prompt injection attempts by quantifying semantic shifts in embedding space between benign and suspect inputs. ZEDD operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining, enabling efficient zero-shot deployment across diverse LLM architectures. Our method uses adversarial-clean prompt pairs and measures embedding drift via cosine similarity to capture subtle adversarial manipulations inherent to real-world injection attacks. To ensure robust evaluation, we assemble and re-annotate the comprehensive LLMail-Inject dataset spanning five injection categories derived from publicly available sources. Extensive experiments demonstrate that embedding drift is a robust and transferable signal, outperforming traditional methods in detection accuracy and operational efficiency. With greater than 93% accuracy in classifying prompt injections across model architectures like Llama 3, Qwen 2, and Mistral and a false positive rate of <3%, our approach offers a lightweight, scalable defense layer that integrates into existing LLM pipelines, addressing a critical gap in securing LLM-powered systems to withstand adaptive adversarial threats.

2606.04550 2026-06-08 cs.IR cs.AI cs.SI 交叉投稿

Trading Engagement for Sustainability: Carbon-Aware Re-ranking for E-commerce Recommendations

用参与度换取可持续性:面向电子商务推荐中碳感知的重排序

Noah Lund Syrdal, Anders Vestrum, Jorgen Bergh

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出一种碳感知重排序策略,通过检索增强的碳足迹估计管道推断缺失的产品碳足迹标签,并在三个推荐模型上权衡用户参与度与碳排放,实现可持续推荐。

Comments 23 pages, 30 figures. Code available at https://github.com/andersvestrum/carbon-aware-recsys

详情
AI中文摘要

电子商务推荐系统强烈影响用户考虑和购买的产品,然而可持续性信号(如产品碳足迹,PCF)几乎从未在目录规模上可用。我们研究了在大多数商品缺失PCF标签且必须推断的现实场景中的碳感知产品推荐。我们首先通过检索增强的PCF估计管道来估计产品级碳足迹,该管道利用语义相似性搜索、少样本LLM提示和最近邻回退,将来自碳目录(一组生命周期评估产品的小型集合)的监督转移到大型未标记的电子商务目录。然后,我们在三个已建立的推荐模型(BPR、NeuMF和LightGCN)产生的相关性分数之上应用碳感知的事后重排序策略。该方法通过单个可调参数lambda在预测的用户-物品参与度和估计的碳足迹之间进行权衡。在这项离线研究中,参与度通过亚马逊评论交互来操作化,这些交互作为隐式反馈以及用户兴趣或购买行为的代理。我们在亚马逊评论数据集上跨三个产品类别(家居与厨房、运动与户外、电子产品)评估该框架。通过扫描lambda,我们构建了帕累托前沿,表征每个模型和类别可实现的参与度和碳权衡。在所有模型和类别中,以最小的参与度成本即可实现显著的碳减排。然而,可用的碳空间因模型和类别而异,强调了模型选择和领域背景的重要性。

英文摘要

E-commerce recommender systems strongly influence which products users consider and purchase, yet sustainability signals such as Product Carbon Footprint (PCF) are almost never available at catalog scale. We study carbon-aware product recommendation in the realistic setting where PCF labels are missing for most items and must be inferred. We first estimate product-level carbon footprints via a retrieval-augmented PCF estimation pipeline that transfers supervision from the Carbon Catalogue, a small set of life-cycle-assessed products, to a large unlabeled e-commerce catalog using semantic similarity search, few-shot LLM prompting, and a nearest-neighbour fallback. We then apply a carbon-aware post-hoc re-ranking strategy on top of relevance scores produced by three established recommendation models: BPR, NeuMF, and LightGCN. The method trades off predicted user-item engagement against estimated carbon footprint through a single tunable parameter, lambda. In this offline study, engagement is operationalized through Amazon review interactions, which serve as implicit feedback and as a proxy for user interest or purchase behavior. We evaluate the framework on the Amazon Reviews dataset across three product categories: Home and Kitchen, Sports and Outdoors, and Electronics. By sweeping lambda, we construct Pareto frontiers that characterize the achievable engagement and carbon trade-off for each model and category. Substantial carbon reductions are achievable at minimal engagement cost across all models and categories. However, the available carbon headroom varies by model and category, underscoring the importance of model choice and domain context.

2606.05050 2026-06-08 cond-mat.mtrl-sci cs.AI physics.chem-ph 交叉投稿

Autonomous heterogeneous catalyst discovery with a self-evolving multi-agent digital twin

自主异质催化剂发现:一种自进化多智能体数字孪生系统

Zhilong Song, Zongmin Zhang, Lixue Cheng

发表机构 * Department of Chemistry, Hong Kong University of Science and Technology(香港科技大学化学系) IAS Center for AI for Scientific Discoveries, Hong Kong University of Science and Technology(香港科技大学人工智能科学发现中心) Department of Computer Science and Engineering, Hong Kong University of Science and Technology(香港科技大学计算机科学与工程系) Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology(香港科技大学化学与生物工程系)

AI总结 提出CatDT(催化数字孪生),一种自进化多智能体系统,通过集成八种专业智能体和27种科学工具,在单个GPU上5-30分钟内自动构建工作催化剂数字孪生,实现从体相晶体和自然语言反应描述到稳定晶面预测、反应路径枚举、过渡态定位和动力学计算的全流程,在七个气固相基准上预测与实验偏差在0.5-2倍内,并独立发现丙烷脱氢非贵金属候选催化剂。

详情
AI中文摘要

理论异质催化有望实现快速催化剂发现,然而计算和机器学习预测常常偏离实验,并局限于狭窄的材料家族,原因是缺乏忠实、条件感知的催化模拟器。我们提出CatDT(催化数字孪生),一种自进化多智能体系统,构建工作催化剂的自主数字孪生,统一了气固和液固建模。仅从体相晶体和自然语言反应描述出发,八个专业智能体和27种科学工具在单个GPU上5-30分钟内预测稳定晶面、重构工作表面、枚举和排序反应路径、定位过渡态并计算动力学。两项创新解决了最困难的步骤:UniMech通过融合智能体引导提议与能量缓存图搜索,以比穷举枚举低超过$10^3$倍的成本发现新型材料的主导路径;记忆增强的强化循环将600个催化表面的势垒计算成功率从41%提高到84%。在七个气固基准上——台阶金属、单原子催化剂、有序金属间化合物、富空位二维硫化物和碳化物,以及强金属-载体相互作用(SMSI)界面——每个CatDT预测在四个数量级内与实验偏差在0.5-2倍之间。对于丙烷脱氢,CatDT独立发现与Pt基工业基准相媲美的非贵金属候选催化剂,其中提出的Ni@ZrO$_2$ SMSI覆盖层在约100%选择性下达到$1.63~ ext{s}^{-1}$的模拟TOF。更广泛地说,忠实催化剂数字孪生——或任何多阶段科学模拟器——的决定性因素不是原始LLM能力,而是围绕它的工程化框架:确定性工具、持久记忆和跨模型、工具和运行累积的已验证自我改进。

英文摘要

Theoretical heterogeneous catalysis promises rapid catalyst discovery, yet computational and machine-learning predictions often deviate from experiment and stay confined to narrow material families, for want of a faithful, condition-aware catalytic simulator. We present CatDT (Catalysis Digital Twin), a self-evolving multi-agent system that builds an autonomous digital twin of a working catalyst, unifying gas-solid and liquid-solid modeling. From only a bulk crystal and a natural-language reaction description, eight specialized agents and 27 scientific tools predict stable facets, reconstruct working surfaces, enumerate and rank reaction pathways, locate transition states, and compute kinetics in 5-30 min on a single GPU. Two innovations address the hardest steps: UniMech finds dominant pathways for novel materials at over $10^3\times$ lower cost than exhaustive enumeration by fusing agent-guided proposals with energy-cached graph search, and a memory-augmented reinforcement loop raises barrier-calculation success from 41\% to 84\% across 600 catalytic surfaces. Across seven gas-solid benchmarks -- stepped metals, single-atom catalysts, ordered intermetallics, vacancy-rich 2D sulfides and carbides, and a strong-metal--support-interaction (SMSI) interface -- every CatDT prediction lies within 0.5-2 times experiment over four orders of magnitude. For propane dehydrogenation, CatDT independently discovers non-precious candidates rivaling the Pt-based industrial benchmark, with a proposed Ni@ZrO$_2$ SMSI overlayer reaching a simulated TOF of $1.63~\text{s}^{-1}$ at $\sim$100\% selectivity. More broadly, the decisive factor for a faithful catalyst digital twin -- or any multi-stage scientific simulator -- is not raw LLM capability but the engineered harness around it: deterministic tools, persistent memory, and verified self-improvement that compound across models, tools, and runs.

2606.06464 2026-06-08 cs.CL cs.AI 交叉投稿

Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?

人类成人与LLM作为科学家:谁从主动探索中受益?

Mandana Samiei, Eunice Yiu, Anthony GX-Chen, Dongyan Lin, Jocelyn Shen, Blake A. Richards, Alison Gopnik, Doina Precup

发表机构 * Mila - Quebec AI Institute(魁北克人工智能研究所) McGill University(麦吉尔大学) University of California Berkeley(加州大学伯克利分校) New York University(纽约大学) Meta FAIR MIT Media Lab(麻省理工学院媒体实验室) Montreal Neurological Institute(蒙特利尔神经科学研究所)

AI总结 本研究通过主动探索实验,发现主动探索能显著提升成人对合取因果规则的推理能力,但合取规则仍需更多测试;同时比较了大型语言模型的表现,发现部分模型在假设推断准确率上接近人类,但探索策略效率较低且存在类似的合取-析取性能差距。

Comments Accepted at the 48th Annual Conference of the Cognitive Science Society (CogSci 2026)

详情
AI中文摘要

因果学习文献中的一个长期发现是,成人难以识别合取因果规则(即一个效应需要多个原因同时存在),而在析取情境中表现更好。然而,这种“合取缺陷”的大多数演示依赖于被动观察范式,证据有限,学习者无法控制证据生成。本文探讨当成人通过主动探索获得能动性时,这种偏见是否仍然存在。使用修改后的“blicket检测器”任务,成人参与者在合取或析取规则结构下自由干预以识别因果对象。我们表明,主动探索显著改善了成人的合取因果推理,尽管合取规则仍比析取规则需要更多测试来推断。我们进一步将人类表现与同一设置下的多种大型语言模型进行比较。虽然一些最先进的模型在假设推断准确率上接近人类水平,但它们通常表现出效率较低的探索策略以及类似的合取-析取性能差距。

英文摘要

A long-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect requires the simultaneous presence of multiple causes, while performing better in disjunctive settings. However, most demonstrations of this ``conjunctive handicap'' rely on passive observation paradigms with limited evidence, where learners have no control over evidence generation. This paper asks whether this bias persists when adults are granted agency through active exploration. Using a modified ``blicket detector'' task, adult participants freely intervened to identify causal objects under conjunctive or disjunctive rule structures. We show that active exploration substantially improves adults' conjunctive causal reasoning, although conjunctive rules still require more tests to infer than disjunctive rules. We further compare human performance to a range of large language models in the same setting. While some state-of-the-art models approach human-level performance on hypothesis inference accuracy, they often exhibit less efficient exploration strategies and similar conjunctive-disjunctive performance gaps.

2606.06505 2026-06-08 cs.CG cs.AI cs.CV math.DG 交叉投稿

A Geometric Gaussian Mixture Representation of Plane Curves

平面曲线的几何高斯混合表示

Ali Darijani, Benedikt Stratmann, Jürgen Beyerer

发表机构 * Fraunhofer IOSB(弗劳恩霍夫研究所) KIT, IES(卡尔斯鲁厄理工学院,信息工程系)

AI总结 提出一种用户定义的平面曲线概率多边形表示,通过为每个线段赋予法向不确定性参数,构造高斯混合模型,保留局部几何与法向不确定性,适用于多种曲线类型。

详情
AI中文摘要

我们引入了一种用户定义的平面曲线概率多边形表示。给定一条曲线,我们在曲线上选择顶点,并通过线段连接相邻顶点以获得多边形近似。每个线段在法线方向上配备一个用户定义的不确定性参数。这产生了一组薄的概率几何基元,它们保留了底层曲线的几何形状,同时将其扩展到理想化的确定性一维公式之外。对于每个线段,我们定义一个随机变量,该变量在线段的切线方向上均匀分布,在线段的法线方向上高斯分布。通过匹配第一和第二中心矩,该构造诱导出一个高斯分量,其均值位于线段中点,协方差编码了切向和法向不确定性。将逐段分量与适当的权重相结合,得到平面曲线的用户定义概率多边形表示的高斯混合模型(GMM)。所提出的框架提供了一个解析上可处理的概率模型,保留了局部几何和法向不确定性。它适用于光滑、封闭、开放、非正则和自交的平面曲线,允许自适应离散化和法向方向上的变化不确定性,从而支持不确定性感知的几何建模。在一组典型平面曲线上的实验表明,所得的GMM捕获了局部切线、局部法线和局部弧长;从而也真实地捕获了底层曲线的全局形状。该表示特别适用于不确定性感知的CAD和数字孪生、机器人中的概率障碍物建模以及概率轨迹规划等应用。

英文摘要

We introduce a user defined probabilistic polygonal representation for plane curves. Given a curve, we select vertices on the curve and connect consecutive vertices by line segments to obtain a polygonal approximation. Each segment is equipped with a user defined uncertainty parameter in the normal direction. This yields a collection of thin probabilistic geometric primitives that retain the geometrz of the underlying curve while extending it beyond the idealized deterministic one dimensional formulation. For each segment, we define a Random Variable that is uniform distributed in the tangent direction of the segment and Gaussian distributed in the normal direction of the segment. By matching the first and the second central moments, this construction induces a Gaussian component whose mean lies at the segment midpoint and whose covariance encodes both tangential and normal uncertainty. Combining the segment wise components with appropriate weights yields a Gaussian Mixture Model (GMM) representation of the user defined probabilistic polygonal representation of the plane curve. The proposed framework provides an analytically tractable probabilistic model that preserves local geometry, and uncertainty in the normal direction. It applies to smooth, closed, open, non regular, and self intersecting plane curves, allows adaptive discretization and varying uncertainty in the normal direction, and as a result supports uncertainty aware geometric modeling. Experiments on a collection of canonical plane curves show that the resulting GMM capture local tangent, local normal, and local arc length; resulting in the global shape of the underlying curves to be truthfully captured as well. The representation is particularly relevant for applications in uncertainty aware CAD and digital twins, probabilistic obstacle modeling in robotics, and probabilistic trajectory planning.

2606.06509 2026-06-08 eess.IV cs.AI cs.LG q-bio.TO 交叉投稿

Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction

在有限标签下哪些解剖结构重要?用于心脏病理预测的数据高效解剖感知基准

Himanshu Singh

发表机构 * Himanshu Singh(希曼斯·辛格)

AI总结 针对有限标签和计算资源下的医学影像问题,提出解剖感知基准,通过比较不同解剖结构表示和分类器,发现表示质量比模型复杂度更重要。

Comments ACCEPTED at ICML 2026 Workshop GlobalSouthML (Seoul, South Korea; PMLR 306, 2026)

详情
AI中文摘要

许多医学影像问题必须在有限标签和受限计算条件下解决,然而性能提升主要来自更具表达力的模型还是对临床有意义解剖结构的更好表示,目前尚不清楚。我们通过一个低数据解剖感知基准来研究这个问题,该基准用于在公共ACDC MRI数据集上进行5类心脏病理预测。利用来自右心室、心肌和左心室的分割衍生患者描述符,我们在线性、核和基于树的分类器上比较了特定解剖结构和多结构表示。我们发现,在有限标签设置下,表示主导复杂度。这些结果表明,在资源受限的医疗环境中,识别和表示最具信息量的解剖结构可能比单纯增加模型复杂度更重要。

英文摘要

Numerous medical imaging problems must be solved under limited labels and constrained compute, yet it remains unclear whether performance gains are driven mainly by more expressive models or by better representation of clinically meaningful anatomy. We study this question through a low-data anatomy-aware benchmark for 5-class cardiac pathology prediction on the public ACDC MRI dataset. Using segmentation-derived patient descriptors from the right ventricle, myocardium, and left ventricle, we compare anatomy-specific and multi-structure representations across linear, kernel, and tree-based classifiers. We find that under limited label settings, representation dominates complexity. These results suggest that in resource-constrained healthcare settings, identifying and representing the most informative anatomy may matter more than the increasing complexity of the model alone.

2606.06515 2026-06-08 cs.AR cs.AI cs.DC cs.ET cs.LG 交叉投稿

DxPTA: An Architecture Design Space Exploration with Optical Dataflow-guided Strategy for HW/SW Co-Design of Photonic Transformer Accelerators

DxPTA:基于光学数据流引导策略的光子Transformer加速器硬件/软件协同设计的架构设计空间探索

Rachmad Vidya Wicaksana Putra, Solomon Micheal Serunjogi, Mahmoud Rasras, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University (NYU) Abu Dhabi(eBRAIN实验室,工程学院,纽约大学(NYU)阿布扎赫德分校) Photonic Research Lab (PRL), Division of Engineering, New York University (NYU) Abu Dhabi(光子研究实验室(PRL),工程学院,纽约大学(NYU)阿布扎赫德分校) New York University (NYU) Abu Dhabi(纽约大学(NYU)阿布扎赫德分校)

AI总结 提出DxPTA方法,通过光学数据流分析架构参数并设计约束感知搜索算法,实现光子Transformer加速器的高效硬件/软件协同设计,在满足面积、功耗等约束下显著提升搜索速度。

Comments 8 pages, 12 figures

详情
AI中文摘要

基于Transformer的网络已成为具有最先进性能的突出AI模型,可能为人工通用智能(AGI)铺平道路。然而,它们的大尺寸仍然阻碍了其高效实现,因此需要替代解决方案以实现其节能加速。最近,最先进的工作提出了光子Transformer加速器(PTA),与传统电子加速器相比,具有显著的加速和能效提升。然而,它们的PTA架构是在不考虑应用约束(如面积、功耗、能量和延迟)的情况下开发的。此外,它们的手动设计方法也需要大量设计时间来确定适合目标应用的架构,因此使得这种方法不可扩展。为了解决这些限制,我们提出了DxPTA,一种新颖的设计空间探索方法,用于实现满足所有约束的适当PTA架构的高效硬件/软件协同设计。这是通过(1)基于相干光学数据流识别PTA架构参数;(2)分析参数的影响/重要性;(3)利用此分析设计约束感知架构搜索算法来实现的。实验结果表明,我们的DxPTA可以为不同的基于Transformer的模型(即DeiT-T/S/B和BERT-B/L)找到合适的PTA架构。在约束条件为面积50mm^2、功耗5W、能量50mJ和延迟10ms的情况下,它实现了高达26mm^2面积、4.8W功耗、39mJ能量和6ms延迟;搜索时间比穷举方法快15.2倍。这些结果证明了DxPTA方法在实现针对各种基于AGI的应用的高效PTA设计方面的潜力。

英文摘要

Transformer-based networks have emerged as prominent AI models with state-of-the-art performance, which potentially pave the way toward artificial general intelligence (AGI). However, their large sizes still hinder their efficient implementation, thus highlighting the need for alternate solutions to enable their energy-efficient acceleration. Recently, state-of-the-art works propose photonic transformer accelerators (PTAs) with significant speedup and energy efficiency improvements over the conventional electronic accelerators. However, their PTA architectures are developed without considering the application constraints (e.g., area, power, energy, and latency). Moreover, their manual design approach also requires huge design time to determine a suitable architecture for the targeted application, hence making this approach not scalable. To address these limitations, we propose DxPTA, a novel design space exploration methodology for enabling efficient hardware/software co-design of the appropriate PTA architecture that meets all constraints. It is achieved by (1) identifying the PTA architecture parameters based on the coherent optical dataflow; (2) analyzing the impact/significance of the parameters; and (3) leveraging this analysis for devising a constraint-aware architecture search algorithm. Experimental results show that, our DxPTA can find the appropriate PTA architectures for different transformer-based models (i.e., DeiT-T/S/B and BERT-B/L). It achieves up to 26mm^2 area, 4.8W power, 39mJ energy, and 6ms latency, for constraints of 50mm^2 area, 5W power, 50mJ energy, and 10ms latency; with 15.2x faster searching time than the exhaustive approach. These results demonstrate the potential of DxPTA methodology for enabling efficient PTA designs for diverse AGI-based applications.

2606.06521 2026-06-08 cs.AR cs.AI cs.DC cs.LG cs.PF 交叉投稿

P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2^8

P-Cast:FP8注意力中的精度——Sink引发的坍缩与S=2^8的最优性

Reed Lau

发表机构 * Tencent(腾讯)

AI总结 针对FP8注意力计算中softmax概率矩阵P在乘法前转换为FP8时的精度问题,分析了KV块迭代顺序和静态缩放因子对精度的影响,发现正向迭代导致非sink值下溢为0,反向迭代结合S=256可消除下溢,并证明S=256在比特精确、量化步长和覆盖范围上最优。

Comments 8 pages, 3 figures, 3 tables, 1 algorithm. Technical note on FP8 E4M3 P-cast precision

详情
AI中文摘要

FP8 (E4M3) 加速注意力计算可显著提升吞吐量,但3位尾数在P*V矩阵乘法前将softmax概率矩阵P转换为FP8时带来了精度挑战。我们分析了在注意力Sink现象下影响输出精度的两种实现选择:(1) KV块迭代顺序,(2) 转换前应用于P的静态缩放因子。我们证明正向KV迭代会导致“P坍缩”——在主导阶上,非sink的P值中有比例为Φ(Δ + δ_k - 6.93 - ln S)的部分下溢为零,其中小偏移δ_k ≈ 1(对于k_sink=4)是sink块内期望的分数最大值;而反向迭代可消除该问题,当反向与S=256结合时保证零下溢。我们进一步给出S=256=2^8的构造性刻画,它是同时满足(i) 比特精确的IEEE 754缩放,(ii) E4M3数轴上锯齿函数dp(S)的下包络(dp=2^-4,最小最坏情况量化步长),以及(iii) 在比特精确(2^k)缩放中最大正常范围覆盖(非比特精确缩放如448可实现略高覆盖)的静态缩放因子。两种优化已在FlashAttention-3/4中基于工程理由部署;我们的贡献是定量解释这些选择为何良好,并给出一个闭式阈值Δ_c = 6.93 + ln S - δ_k用于预测内核级精度损失。内核忠实实验(Q、K、V为FP32以隔离P-cast效应)在中等sink强度下显示3-10倍的MSE改进,配对测试证实两种修复结合时均饱和到相同的精度下限。

英文摘要

FP8 (E4M3) acceleration for attention computation offers significant throughput gains, but the 3-bit mantissa introduces precision challenges when the softmax probability matrix~$P$ is cast to FP8 before the $P \cdot V$ matrix multiplication. We analyze two implementation choices that affect output precision under the \emph{Attention Sink} phenomenon: (1)~the KV block iteration order, and (2) the static scaling factor applied to $P$ before casting. We show that forward KV iteration causes \emph{P-collapse} -- to leading order a fraction $Φ(Δ+ δ_k - 6.93 - \ln S)$ of non-sink $P$ values underflow to zero, where the small shift $δ_k \approx 1$ (for $k_{\text{sink}}{=}4$) is the expected within-sink-block score maximum -- and that reverse iteration removes it, with a zero-underflow guarantee when reverse is combined with $S{=}256$. We further give a constructive characterization of $S = 256 = 2^8$ as the static scale that simultaneously satisfies (i)~bit-exact IEEE 754 scaling, (ii) the lower envelope of a sawtooth function $dp(S)$ over the E4M3 number line ($dp = 2^{-4}$, the minimum worst-case quantization step), and (iii)~the maximum normal-range coverage \emph{among bit-exact ($2^k$) scales} (a non-bit-exact scale such as $448$ attains slightly higher coverage; sec.5}). Both optimizations are already deployed in FlashAttention-3/4 on engineering grounds; our contribution is a quantitative account of \emph{why} these choices are good and a closed-form threshold $Δ_c = 6.93 + \ln S - δ_k$ for predicting kernel-level precision loss. Kernel-faithful experiments ($Q, K, V$ in FP32 to isolate the P-cast effect) show $3$-$10\times$ MSE improvement at moderate sink strengths, and paired tests confirm both fixes saturate to the same precision floor when combined -- which motivated updating the hpc-ops kernel from $S{=}1$ to $S{=}256$.

2606.06534 2026-06-08 eess.IV cs.AI 交叉投稿

Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models

基于视觉基础模型的注意力一致纵向医学视觉问答

Jialin Wu, Qianru Zhang, Georges El Fakhri, Xiaofeng Liu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Yale Biomedical Imaging Institute(耶鲁大学生物医学成像研究所)

AI总结 提出一种注意力引导的编码器-解码器框架,通过轻量级配准和自适应掩码生成,结合辅助损失函数,实现胸部X光片的纵向医学视觉问答,在Medical-Diff-VQA基准上取得优异性能。

Comments Accepted to CVPR 2026 Workshop PHAROS-AIF-MIH

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 6448-6458
AI中文摘要

纵向医学视觉问答(VQA)需要推理当前时间点图像与参考时间点图像之间的解剖差异。我们针对胸部X光片提出了一种注意力引导的编码器-解码器。与传统的直接对比不同,我们引入了一个轻量级仿射配准模块,通过小配准正则化将当前图像与参考图像进行共配准,以减少无关运动。配准后的图像对输入图像编码器,随后通过冻结的DINO掩码生成器和可训练的自适应掩码生成器生成应用于原始图像对的掩码。掩码图像对再次输入图像编码器,并与文本特征拼接,作为基于多模态Transformer的解码器的输入以生成最终答案。为了促进学习稳定并澄清变化信号,受DINO-v3启发,我们加入了额外的辅助目标,包括掩码重建损失、成对Gram风格一致性损失和KoLeo均匀性损失,以增强表示的几何结构。在Medical-Diff-VQA基准上,该模型获得了强大的BLEU、ROUGE-L、CIDEr和METEOR分数,同时通过共享显著性掩码提供了内在的可解释性。这些结果支持将显著性条件生成与轻度预对齐作为医学VQA中纵向推理的原则性框架。我们的训练策略也展示了在生物医学中利用图像基础模型的范式潜力:同时优化监督和无监督学习目标。

英文摘要

Longitudinal medical visual question answering (VQA) requires reasoning about anatomical differences between an image of a current time point and an image of a referred time point. We propose an attention-guided encoder-decoder for this task with chest X-rays. Instead of conventional direct contrast, we propose to include a lightweight affine registration module to reduce nuisance motion by co-registering the current image to the reference image with a small registration regularizer. The registered image pair is fed into the image encoder, followed by a frozen DINO-based mask generator and a trainable adaptive mask generator to produce masks applied to the original image pairs. The masked image pairs are again fed into the image encoder and concatenated with text features as the input to a multimodal transformer-based decoder to generate final answers. To facilitate learning stabilization and clarify the change signal, inspired by DINO-v3, we include additional auxiliary objectives, including a mask rebuilding loss, a pairwise Gram-style consistency loss, and a KoLeo uniformity loss, which enhances the geometry of the representation. On the Medical-Diff-VQA benchmark, the model delivers strong BLEU, ROUGE-L, CIDEr, and METEOR scores while offering intrinsic interpretability through the shared saliency mask. These results support saliency-conditioned generation with mild pre-alignment as a principled framework for longitudinal reasoning in medical VQA. Our training strategy also illustrates the potential of a paradigm in utilizing image foundation models in biomedicine: optimizing both supervised and unsupervised learning objectives simultaneously.

2606.06536 2026-06-08 cs.CV cs.AI cs.LG 交叉投稿

Attention-Guided Autoencoder Fusion for Insulator Defect Detection Using UAV Transmission-Line Imaging

基于注意力引导自编码器融合的无人机输电线路绝缘子缺陷检测

Malak Allam, Khaled Shaban, Ali Hamdi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出AE-YOLO框架,通过注意力引导自编码器融合与方差最大化正则化,解决无人机图像中绝缘子缺陷检测的类别不平衡和尺度变化问题,在mAP@0.5上达95.10%,优于YOLO基线5个百分点。

详情
AI中文摘要

高压输电线路绝缘子的自动缺陷检测仍然具有挑战性,原因在于无人机(UAV)图像中严重的类别不平衡、尺度变化大以及缺陷实例的空间范围小。为了解决这些问题,本文提出了AE-YOLO,一种注意力引导的自编码器增强型YOLO框架,用于鲁棒的绝缘子缺陷检测。该架构在特征金字塔网络-路径聚合网络(FPN-PAN)颈部集成了轻量级瓶颈自编码器,在多尺度特征融合过程中保留了异常敏感信息。整个骨干网络使用卷积块注意力模块(CBAM),增强了特征辨别能力并抑制了背景干扰。该框架还引入了一种方差最大化的自编码器正则化策略,鼓励生成多样化、缺陷判别性的潜在表示。网络使用统一目标进行训练,该目标结合了焦点损失、完全IoU(CIoU)损失和自编码器正则化,以解决前景-背景不平衡问题并提高定位精度。在推理过程中,加权框融合(WBF)结合了YOLOv8、YOLOv10和YOLO11的预测结果。自编码器引导的置信度提升机制提高了对罕见缺陷类别的敏感性。在绝缘子缺陷检测数据集上的实验表明,采用EfficientNetV2骨干网络的AE-YOLO在mAP@0.5上达到95.10%,精度为96.40%,召回率为93.80%。这一性能在mAP@0.5上比最强的YOLO系列基线高出5.0个百分点,在召回率上高出6.7个百分点。这些结果证实了该框架的有效性和适应性。该模型是基于UAV的输电线路巡检和缺陷监测的实用且可扩展的解决方案。

英文摘要

Automated defect detection in high-voltage transmission-line insulators remains challenging due to severe class imbalance, large scale variation, and the small spatial extent of defect instances in Unmanned Aerial Vehicle (UAV) imagery. To address these challenges, this paper proposes AE-YOLO, an Attention-Guided AutoEncoder-Enhanced YOLO framework for robust insulator defect detection. The architecture integrates lightweight bottleneck autoencoders within a Feature Pyramid Network-Path Aggregation Network (FPN-PAN) neck. This preserves anomaly-sensitive information during multi-scale feature fusion. Convolutional Block Attention Modules (CBAM) are used throughout the backbone, enhancing feature discrimination and suppressing background interference. The framework also introduces a variance-maximizing autoencoder regularization strategy, which encourages diverse, defect-discriminative latent representations. The network trains using a unified objective that combines focal loss, Complete IoU (CIoU) loss, and autoencoder regularization to address foreground-background imbalance and improve localization accuracy. During inference, Weighted Boxes Fusion (WBF) combines predictions from YOLOv8, YOLOv10, and YOLO11. An autoencoder-guided confidence boosting mechanism improves sensitivity to rare defect categories. Experiments on the Insulator-Defect Detection dataset show that AE-YOLO with an EfficientNetV2 backbone achieves 95.10 percent mAP at 0.5, 96.40 percent precision, and 93.80 percent recall. This performance surpasses the strongest YOLO-family baseline by 5.0 points in mAP at 0.5 and 6.7 points in recall. These results confirm the effectiveness and adaptability of the framework. The model is a practical and scalable solution for UAV-based transmission-line inspection and defect monitoring.

2606.06539 2026-06-08 cs.CV cs.AI cs.LG cs.NE 交叉投稿

Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

合成基准高估了前向-前向扩展:真实数据对逐层训练的限制

Yucheng Chen

发表机构 * Amplimit

AI总结 通过DTG-FF方法在真实数据上评估前向-前向学习的扩展性,发现其与反向传播的差距随类别数增加而扩大,合成任务高估了其迁移能力,且内存优势不成立。

Comments 23 pages, 6 figures

详情
AI中文摘要

前向-前向(FF)学习[Hinton, 2022]用严格的逐层良好性更新取代了反向传播。最近的FF-CNN工作在32x32基准上缩小了与BP的差距,引发了逐层训练是否在现实规模下成为可行替代方案的问题。为了严格探究这一点,我们开发了DTG-FF——动态温度良好性、解耦归一化和多层融合——作为在九个真实数据基准上设定FF系列最先进水平的工具(CIFAR-10上91.8%,以及ImageNet-100 224x224上的首个FF基线),并用它来审计逐层训练实际能扩展到何种程度。(1)真实数据扩展。在相同配方和主干下,架构匹配的BP-DeepSup基线在CIFAR-10/CIFAR-100上分别超过DTG-FF 2.40/5.93个百分点,且差距随类别数增加而扩大。在224x224分辨率下,同一工具仅达到49.4%——这是该尺度下的首个FF基线,而典型BP超过75%[Tian et al., 2020]——暴露了在32x32下不可见的真实数据上限。(2)合成与真实K冲突。在合成教师-学生任务中,随着类别数K增长,DTG-FF越来越优于BP;而在真实图像上,FF-BP差距符号反转并随K扩大。数据集内CIFAR-100粗粒度与细粒度探针将标签层次与图像分布分离:合成K扫描将输出维度与细粒度判别难度混淆,从而高估了FF的可迁移性。(3)系统审计。FF可以在不存储深度激活的情况下实现,但在普通8 GB硬件上,标准BP+梯度累积达到4.18 GB / 157 imgs/s,而DTG-FF为7.90 GB / 138 imgs/s,因此在公平基线支持下,基于内存的理由在此规模下不成立。

英文摘要

Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updates. Recent FF-CNN work has narrowed the gap to BP on 32x32 benchmarks, raising the question of whether layer-local training is becoming a viable alternative at realistic scale. To probe this rigorously, we develop DTG-FF -- dynamic temperature goodness, decoupled normalization, and multi-layer fusion -- as an instrument that sets FF-family state of the art across nine real-data benchmarks (91.8% CIFAR-10 and the first FF baseline at ImageNet-100 224x224), and use it to audit how far layer-local training actually scales. (1) Real-data scaling. Under identical recipe and backbone, an architecture-matched BP-DeepSup baseline beats DTG-FF by 2.40/5.93 pp on CIFAR-10/CIFAR-100, and the gap widens with class count. At 224x224 the same instrument reaches only 49.4% -- the first FF baseline at this scale, versus typical BP above 75% [Tian et al., 2020] -- exposing a real-data ceiling invisible at 32x32. (2) Synthetic vs. real K-conflict. DTG-FF increasingly outperforms BP as class count K grows on synthetic teacher-student tasks, yet on real images the FF-BP gap reverses sign and widens with K. A within-dataset CIFAR-100 coarse vs. fine probe isolates label-hierarchy from image distribution: synthetic K-sweeps confound output dimensionality with fine-grained discrimination difficulty and thereby overstate FF transferability. (3) Systems audit. FF can be implemented without storing depth-wide activations, but on commodity 8 GB hardware standard BP+gradient-accumulation reaches 4.18 GB / 157 imgs/s versus DTG-FF's 7.90 GB / 138 imgs/s, so a memory-based justification for FF at this scale is not supported under fair baselines.

2606.06543 2026-06-08 quant-ph cs.AI 交叉投稿

Coordinated optimization of departure sequencing and section-track allocation in railway short-term concentrated departure scenarios based on qubo and hybrid quantum algorithms

基于QUBO和混合量子算法的铁路短期集中发车场景下出发排序与段轨分配协同优化

Xiaobin Li, Yanbin Gao, Weiguang Wang, Xuechen Liang

发表机构 * School of Transportation Engineering(交通运输工程学院)

AI总结 针对铁路短期集中发车场景,提出基于QUBO模型与仿真评估的协同优化框架,混合量子算法在动态条件下综合成本降低4.28%-26.26%,总延误减少4.37%-24.25%。

详情
AI中文摘要

本研究探讨了铁路短期集中发车场景下出发排序与段轨分配的协同优化问题。构建了一个二次无约束二元优化(QUBO)模型,在统一的二元框架内表示出发位置分配和段轨选择。由于调度方案的质量取决于时间相关的运行交互,而静态组合模型无法完全捕捉这些交互,因此引入基于仿真的评估层来评估段占用、中间站等待、站台容量压力、运行时间波动和延误传播。在此分层框架内,传统启发式算法、量子启发式算法和混合算法在相同的决策结构下进行了比较。结果表明,QUBO模型在解码后能够生成可行的候选方案,而仿真层清晰地区分了竞争算法在正常和扰动条件下的运行性能。在测试场景中,QPSO-QAOA在正常条件下表现最佳,而量子增强方法在动态条件下相对于传统方法平均综合成本降低4.28%--26.26%,总延误减少4.37%--24.25%。这些发现表明,基于QUBO的建模与基于仿真的评估相结合,为铁路短期集中发车调度提供了一种有用的方法论框架,尽管仍需使用实际运行数据进行验证。

英文摘要

This study examines the coordinated optimization of departure sequencing and section-track allocation in railway short-term concentrated departure scenarios. A quadratic unconstrained binary optimization (QUBO) model is formulated to represent departure-position assignment and section-track selection within a unified binary framework. Because the quality of a dispatching scheme depends on time-dependent operational interactions that cannot be fully captured by a static combinatorial model, a simulation-based evaluation layer is introduced to assess section occupation, intermediate-station waiting, platform-capacity pressure, running-time fluctuations, and delay propagation. Within this layered framework, conventional heuristics, quantum-inspired algorithms, and hybrid algorithms are compared on the same decision structure. The results show that the QUBO model can generate feasible candidate schemes after decoding, while the simulation layer clearly differentiates the operational performance of the competing algorithms under both normal and disturbed conditions. In the tested scenarios, QPSO-QAOA performs best under normal conditions, and the quantum-enhanced methods reduce comprehensive cost by 4.28\%--26.26\% and total delay by 4.37\%--24.25\% on average under dynamic conditions relative to their conventional counterparts. These findings suggest that the integration of QUBO-based modeling and simulation-based evaluation provides a useful methodological framework for railway short-term concentrated departure scheduling, although validation with real operational data remains necessary.

2606.06545 2026-06-08 cs.SE cs.AI 交叉投稿

Queen-Bee Agents: A BeeSpec-Centered Architecture for Governed Enterprise MCP Orchestration

蜂后智能体:一种以BeeSpec为中心的受治理企业MCP编排架构

Dutao Zhang, Liaotian

发表机构 * Polytechnic University(理工学院)

AI总结 提出Queen-Bee多智能体架构,通过Queen控制平面检索能力、规划任务并编译BeeSpec,由Bee智能体在受限工具访问下执行,实现策略执行、租户隔离和边界内执行,在59个企业任务中达到0.964成功率且零治理失败。

Comments Technical report. Prototype-level systems evidence; 59 enterprise-style tasks

详情
AI中文摘要

企业智能体系统日益需要将大语言模型连接到私有工具、内部知识和模型上下文协议(MCP)接口。在这种环境下,原始任务能力是不够的:组织还需要策略执行、租户范围的隔离以及在明确操作边界内的执行。我们提出了Queen-Bee,一种受治理的多智能体架构,其中Queen控制平面检索能力、规划任务范围的执行,并编译结构化的BeeSpec,由专门的Bee智能体在受限工具访问下执行。我们实现了一个工作原型,具有租户范围的MCP连接器、审计支持的运行时治理、检索驱动的弱孵化和多个配置后端。我们在59个企业风格任务上评估了系统,涵盖治理敏感请求、检索驱动的配置、范围化本地执行和化学工作流集成。检索驱动的Queen-Bee变体实现了0.964的任务成功率、零治理失败,并且在范围化执行质量上显著优于静态Queen-Bee基线和宽松的单智能体基线。我们进一步展示了一个多Bee化学工作流,具有明确的审批门控和一个基于真实上游证据和筛选工件的具体前三候选名单。与混合检索和LLM引导配置的额外比较表明,更丰富的配置后端是可行的,但在当前小型、高度结构化的能力注册表上并未优于轻量级结构化检索器。结果提供了原型级别的系统证据,而非生产部署研究,并表明企业智能体平台不仅应根据能力进行评估,还应考虑受治理的配置、隔离行为、范围化执行质量和工件感知的工作流协调。

英文摘要

Enterprise agent systems increasingly need to connect large language models to private tools, internal knowledge, and Model Context Protocol (MCP) interfaces. In this setting, raw task capability is insufficient: organizations also require policy enforcement, tenant-scoped isolation, and execution that remains within explicit operational boundaries. We present Queen-Bee, a governed multi-agent architecture in which a Queen control plane retrieves capabilities, plans task-scoped execution, and compiles a structured BeeSpec that is executed by specialized Bee agents under constrained tool access. We implement a working prototype with tenant-scoped MCP connectors, audit-backed execution-time governance, retrieval-driven weak incubation, and multiple provisioning backends. We evaluate the system on 59 enterprise-style tasks spanning governance-sensitive requests, retrieval-driven provisioning, scoped local execution, and chemistry workflow integration. The retrieval-driven Queen-Bee variant achieves a task success rate of 0.964, zero governance failures, and substantially better scoped execution quality than both a static Queen-Bee baseline and a permissive single-agent baseline. We further show a multi-Bee chemistry workflow with explicit approval gating and a concrete top-3 shortlist grounded in real upstream evidence and screening artifacts. Additional comparisons with hybrid retrieval and LLM-guided provisioning show that richer provisioning backends are viable but do not outperform the lightweight structured retriever on the current small, highly structured capability registry. The results provide prototype-level systems evidence rather than a production deployment study, and suggest that enterprise agent platforms should be evaluated not only by capability, but also by governed provisioning, isolation behavior, scoped execution quality, and artifact-aware workflow coordination.

2606.06547 2026-06-08 cs.LG cs.AI 交叉投稿

FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models

FAIR-Calib:面向扩散大语言模型训练后量化的前沿感知不稳定重加权校准

Haoyu Huang, Linlin Yang, Sheng Xu, Boyu Liu, Guodong Guo, Zhongqian Fu, Hang Zhou, Baochang Zhang

发表机构 * FAIR

AI总结 针对扩散大语言模型训练后量化中前沿决策易翻转并永久锁定放大的问题,提出两阶段PTQ框架FAIR-Calib,通过前沿命中与掩码阶段可靠性估计位置先验,并利用重加权隐状态MSE校准优先保护脆弱前沿状态,理论证明其作为输出KL散度代理,实验显著优于基线。

Comments Accepted as a poster at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

扩散大语言模型(dLLMs)迭代地精炼token,但不可逆地提交它们,导致“稳定性滞后”,即早期决策即使在写入后仍然脆弱。我们揭示训练后量化(PTQ)误差容易在写入前沿翻转这些边界决策,然后这些决策被永久锁定并放大。为了解决这个问题,我们提出了前沿感知不稳定重加权校准(FAIR-Calib),一个针对dLLMs的两阶段PTQ框架。第一阶段探测全精度教师模型以估计结合前沿命中与掩码阶段可靠性的位置先验。第二阶段通过最小化重加权隐状态MSE执行离策略、逐层校准,有效优先保护脆弱前沿状态,而无需昂贵的端到端扩散展开。我们进一步从理论上证明我们的加权目标是输出KL散度的代理。实验上,FAIR-Calib在LLaDA和Dream(W4A4)上持续优于最先进的基线,显著减少前沿决策翻转并抑制跨不同基准的提交后不匹配。

英文摘要

Diffusion Large Language Models (dLLMs) refine tokens iteratively but commit them irreversibly, leading to a "stability lag" where early decisions remain fragile even after being written. We reveal that Post-Training Quantization (PTQ) error easily flips these borderline decisions at the write frontier, which are then permanently locked in and amplified. To address this, we propose Frontier-Aware Instability-Reweighted Calibration (FAIR-Calib), a two-stage PTQ framework for dLLMs. Stage I probes a full-precision teacher to estimate a position prior that combines frontier hits and masked-stage reliability. Stage II performs off-policy, layer-wise calibration by minimizing a reweighted hidden-state MSE, effectively prioritizing the protection of fragile frontier states without requiring expensive end-to-end diffusion rollouts. We further theoretically justify our weighted objective as a surrogate for output KL divergence. Empirically, FAIR-Calib consistently outperforms state-of-the-art baselines on LLaDA and Dream (W4A4), significantly reducing frontier decision flips and suppressing post-commit mismatches across diverse benchmarks.

2606.06550 2026-06-08 cs.SD cs.AI eess.AS 交叉投稿

Geometric Second-Order Feature Correlation Learning for Self-Supervised Speech Emotion Recognition

几何二阶特征相关性学习用于自监督语音情感识别

Shuanglin Li, Ruxiao Qian, Siyang Song

发表机构 * Xiangjiang Laboratory(湘江实验室) University of Exeter(埃克塞特大学)

AI总结 针对自监督语音情感识别中一阶聚合忽略特征相关性和黎曼几何的问题,提出二阶相关层,通过协方差描述子捕获协同共现模式,并利用对数欧几里得映射保持几何完整性,在ESD和RAVDESS数据集上有效恢复判别信息。

详情
AI中文摘要

自监督学习(SSL)为语音情感识别(SER)提供了强大且富含上下文的表示,但将这些表示聚合为整体描述符仍是一个瓶颈。传统的一阶聚合隐式假设特征独立,忽略了潜在的黎曼几何,并丢弃了对骨干网络表示能力至关重要的高阶关系。为解决这一问题,本文提出了一种新颖的二阶相关(SOC)层。SOC不孤立地处理特征,而是将特征相关性建模为协方差描述子,以捕获协同共现模式,这些模式可作为鲁棒情感识别的判别性签名。通过对数欧几里得映射(LEM)将这些描述子从黎曼流形映射到欧几里得切空间,所提方法在保持几何完整性的同时,实现了直接的线性判别学习。在ESD和RAVDESS数据集上的大量实验表明,SOC恢复了一阶池化中丢失的判别信息,并有效聚合了高维SSL特征。

英文摘要

Self-supervised learning (SSL) yields powerful, context-rich representations for speech emotion recognition (SER), yet aggregating these representations into holistic descriptors remains a bottleneck. Conventional first-order aggregation implicitly assumes feature independence, which overlooks the latent Riemannian geometry and discards higher-order relationships essential to the representational power of the backbone. To address this problem, this paper proposes a novel Second-Order Correlation (SOC) layer. Instead of treating features in isolation, SOC models feature correlations as covariance descriptors to capture synergistic co-occurrence patterns, which serve as discriminative signatures for robust emotion recognition. By mapping these descriptors from the Riemannian manifold to a Euclidean tangent space through Log-Euclidean mapping (LEM), the proposed method preserves geometric integrity while enabling direct linear discriminative learning. Extensive experiments on the ESD and RAVDESS datasets demonstrate that SOC recovers discriminative information lost in first-order pooling and effectively aggregates high-dimensional SSL features.

2606.06559 2026-06-08 cs.SD cs.AI eess.AS 交叉投稿

IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems

IRAF:面向噪声鲁棒的端到端全双工口语对话系统的抗干扰自适应融合

Tao Zhong, Jiajun Deng, Nikita Kuzmin, Yinke Zhu, Tianxiang Cao, Tristan Tsoi, Zhili Tan, Simon Lui, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) AudioLab Hong Kong, Huawei Leibniz Research Center(香港AudioLab,华为Leibniz研究中心) Nanyang Technological University(南洋理工大学)

AI总结 提出IRAF模块,通过逐帧预测可靠性门控来调节用户音频对LLM的贡献,提升全双工对话系统在干扰说话人环境下的响应质量和交互稳定性。

详情
AI中文摘要

全双工口语对话模型允许语音代理同时听和说,实现具有实时重叠的自然交互。然而,联合编码用户和代理流的端到端双通道模型在现实声学环境中可能会退化:干扰说话人泄漏到用户麦克风中,会被编码为用户查询的一部分,破坏LLM的条件,导致不稳定的轮流说话和响应质量下降。我们提出抗干扰自适应融合(IRAF),一个轻量级、流兼容的模块,逐帧调节用户音频对LLM的贡献。IRAF从目标说话人和用户音频嵌入中预测一个标量可靠性门控,并在与代理嵌入融合之前重新缩放用户表示。在MS-MARCO和InstructS2S-200K上的实验表明,在干扰说话人条件下,响应质量和全双工交互获得一致提升。

英文摘要

Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM's conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.

2606.06560 2026-06-08 cs.LG cs.AI cs.HC 交叉投稿

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

MacArena: 在在线macOS环境中基准测试计算机使用代理

Victor Muryn, Maksym Shamrai, Sofiia Mazepa, Yehor Khodysko

发表机构 * MacPaw

AI总结 提出MacArena基准,包含421个任务和50个应用,在Apple Silicon上运行,揭示macOS对GUI代理的独特挑战,模型排名在移植任务和原生任务间反转。

Comments Accepted to the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

详情
AI中文摘要

计算机使用代理(CUA)通过视觉和控制原语操作图形用户界面(GUI),其能力迅速提升,部分得益于标准化在线评估基准(如OSWorld),这些基准既作为评估工具,也作为强化学习的训练环境。然而,macOS在此领域中仍未被充分覆盖:现有唯一基准macOSWorld仅覆盖少量第一方应用且任务较简单,并在与Apple Silicon不兼容的x86虚拟机上运行。我们引入MacArena,一个包含50个应用中421个手动验证任务的基准,结合了OSWorld任务的精选移植、来自macOSWorld的内容以及49个新的macOS原生任务,全部在Apple Silicon上的Apple原生虚拟化框架上运行。我们认为macOS呈现了Linux基准无法捕捉的独特GUI挑战,我们的评估支持这一观点:现有基准上的强模型性能可能反映对任务分布的熟悉程度,而非真正的跨平台GUI能力。值得注意的是,模型排名在移植任务和macOS原生任务之间发生反转,领先模型在MacArena子集上落后超过26%,表明macOS对当前GUI代理构成了一个真正更困难的环境。

英文摘要

Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld, which serve both as evaluation tools and as training environments for reinforcement learning. However, macOS remains underserved in this landscape: the only existing benchmark, macOSWorld, covers a narrow slice of first-party applications with simpler tasks, and runs on x86 virtual machines incompatible with Apple Silicon. We introduce MacArena, a benchmark of 421 manually verified tasks spanning 50 applications that combines a curated port of OSWorld tasks, content sourced from macOSWorld, and 49 new macOS-native tasks, all running on Apple's native Virtualization framework on Apple Silicon. We argue that macOS presents distinct GUI challenges beyond what Linux-based benchmarks capture, and our evaluation supports this claim: strong model performance on existing benchmarks can reflect familiarity with task distributions rather than genuine cross-platform GUI competence. Notably, model rankings invert between ported and macOS-native tasks, with a leading model trailing by over 26% on the MacArena subset, suggesting that macOS poses a genuinely harder environment for current GUI agents.

2606.06566 2026-06-08 cs.SE cs.AI 交叉投稿

NTILC: Neural Tool Invocation via Learned Compression

NTILC: 通过学习的压缩实现神经工具调用

Andrew Krikorian, Yayuan Li, Jason J. Corso

发表机构 * Department of Robotics, University of Michigan(机器人学系,密歇根大学) Department of ECE, University of Michigan(电子工程与计算机科学系,密歇根大学)

AI总结 提出NTILC框架,用学习的潜在检索替代上下文工具查找,将工具选择与参数生成解耦,通过签名感知复合损失函数提升选择精度,相比基线减少95%上下文消耗和74%延迟。

Comments 10 Pages, 4 Figures, 5 Tables, 1 Algorithm

详情
AI中文摘要

基于代理的工具调用语言模型依赖于大量可调用API、函数和本地动作的注册表。直接将完整工具规范放入提示中的成本随工具注册表大小线性增长,迅速消耗上下文预算。随着注册表增长,这导致更高的延迟和降低的选择准确性,特别是由于不相关工具的干扰。我们通过引入NTILC(一种神经工具选择和调用框架)克服了这些限制,该框架用学习的潜在检索取代了上下文注册表查找。NTILC将用户意图和工具规范映射到共享嵌入空间,通过外部检索而非上下文查找实现工具选择。语言模型仅基于所选工具模式进行条件生成,从而实现精确、受限的参数生成。我们方法的核心是签名感知复合目标,它通过从工具签名(例如参数模式、类型兼容性和返回类型)派生的约束来增强语义相似性。通过将Circle Loss与Functional Margin Loss相结合,模型强制区分语义相似但在执行签名下不兼容的工具。我们在公开的工具选择和函数调用数据集上评估NTILC,并报告上下文令牌使用量、检索准确性和选择延迟指标。在这些设置中,与长上下文ICT基线相比,NTILC将上下文窗口消耗减少了95%以上,推理延迟减少了高达74%。

英文摘要

Agentic tool-calling language models depend on large registries of callable APIs, functions, and local actions. Placing full tool specifications directly in the prompt incurs a cost that scales linearly with the size of the tool registry, rapidly consuming the context budget. As the registry grows, this leads to higher latency and degrades selection accuracy, particularly due to interference from irrelevant tools. We overcome these limitations by introducing NTILC, a neural tool selection and invocation framework that replaces in-context registry look-up with learned latent retrieval. NTILC maps both user intent and tool specifications into a shared embedding space, enabling tool selection via external retrieval rather than in-context lookup. The language model is conditioned only on the selected tool schema, allowing for precise, constrained argument generation. Central to our approach is a signature-aware composite objective, which augments semantic similarity with constraints derived from tool signatures (e.g., argument schema, type compatibility, and return types). By combining Circle Loss with a Functional Margin Loss, the model enforces separation between tools that are semantically similar but incompatible under their execution signatures. We evaluate NTILC on public tool-selection and function-calling datasets and report context token usage, retrieval accuracy, and selection latency metrics. Across these settings, NTILC reduces context window consumption by over 95% and inference latency by up to 74% compared to long-context ICT baselines.

2606.06570 2026-06-08 cs.CR cs.AI 交叉投稿

MalTree: Tracing Malware Evolution from Embeddings at Scale

MalTree: 从嵌入中大规模追踪恶意软件演化

Akash Amalan, Georgios Smaragdakis, Tom J. Viering

发表机构 * Delft University of Technology(代尔夫特理工大学)

AI总结 提出MalTree框架,利用生物信息学系统发育技术(UPGMA和邻接法)基于结构、行为和图像特征自动建模恶意软件演化,通过VirusTotal时间戳验证达到87%时间一致性,揭示家族间变异速率差异,支持谱系感知的恶意软件分析。

Comments 33 pages, accepted at ICML 2026

详情
AI中文摘要

恶意软件检测在很大程度上仍然是被动的:针对已知样本训练的机器学习模型随着威胁的演化而性能下降。理解恶意软件家族之间的演化关系可以为主动防御提供信息,但传统的逆向工程可能需要数月到数年才能揭示这种谱系关系。我们提出MalTree,一个框架,它大规模应用生物信息学启发的系统发育技术(UPGMA和邻接法),利用结构、行为和基于图像的特征自动建模恶意软件演化。我们引入基于VirusTotal时间戳的时间验证,以评估推断的树是否反映实际的演化顺序。MalTree达到了87%的时间一致性,表明推断的演化关系与真实世界的出现时间线紧密对齐。我们的分析显示,一些家族的变异速度比其他家族快10倍以上,这表明检测策略应针对家族特定的演化节奏进行调整。包括Mirai僵尸网络在内的案例研究证实,从我们的系统发育树推断的关系与记录在案的情报一致。我们的框架为将恶意软件分析从逐个样本分类转向谱系感知的演化建模奠定了基础。

英文摘要

Malware detection remains largely reactive: machine learning models trained on known samples degrade as threats evolve. Understanding evolutionary relationships among malware families can inform proactive defense, but traditional reverse engineering can take months to years to uncover such lineage relationships. We propose MalTree, a framework that applies bioinformatics inspired phylogenetic techniques (UPGMA and Neighbor-Joining) at scale to model malware evolution automatically using structural, behavioral, and image-based features. We introduce temporal validation using VirusTotal timestamps to assess whether inferred trees reflect actual evolutionary order. MalTree achieves 87% temporal consistency, indicating that inferred evolutionary relationships closely align with real-world emergence timelines. Our analysis shows that some families mutate over 10 times faster than others, suggesting that detection strategies should be tailored to family-specific evolutionary tempos. Case studies, including the Mirai botnet, confirm that inferred relationships from our phylogenetic tree align with documented threat intelligence. Our framework provides a foundation for shifting malware analysis from sample-by-sample classification toward lineage-aware evolutionary modeling.

2606.06572 2026-06-08 cs.LG cs.AI cs.CY econ.GN q-fin.EC 交叉投稿

Generative Models Erode Human Temporal Learning Through Market Selection

生成模型通过市场选择侵蚀人类时间学习

Wenjun Cao

AI总结 本文论证现代生成模型在亚AGI能力水平上通过市场选择机制侵蚀人类时间学习,提出价值崩溃路径并用昂贵检验框架形式化,跨领域证据显示验证侵蚀四阶段。

Comments Accepted at ICML 2026

详情
Journal ref
Forty-third International Conference on Machine Learning Position Paper Track (2026)
AI中文摘要

我们认为,现代生成模型在当前亚AGI能力水平上对知识和文化生产造成了结构性风险。我们将人类时间学习(HTL)定义为通过长期持续参与问题而形成的路径依赖的知识积累。生成输出在表面特征上越来越像HTL密集型工作,因此验证给定输出是否反映真正的人类学习的成本相对于其预期收益变得高昂。一旦验证失去经济合理性,评估者就会奖励输出而不论其生产模式,而投入多年学习的生产者则在与几乎零成本生成的输出的价格竞争中处于劣势。我们将这一路径称为价值崩溃,并通过一个昂贵检验框架将其形式化。来自学术出版、法律实践、内容平台和软件安全的跨领域证据映射出验证侵蚀的四个阶段。对齐成功是正交的。更好的对齐模型缩小了人类与AI输出之间的可观察差距,使得来源验证更加困难,并加剧了对HTL密集型工作的竞争压力,即使单个AI输出有所改进。

英文摘要

We argue that modern generative models create structural risks for knowledge and cultural production at current, sub-AGI capability levels. We define Human Temporal Learning (HTL) as path-dependent knowledge accumulation through sustained engagement with problems over time. Generative outputs increasingly resemble HTL-intensive work in surface features, so verifying whether a given output reflects genuine human learning grows costly relative to its expected benefit. Once verification loses economic justification, evaluators reward outputs regardless of production mode, and producers who invested years of learning compete on price against outputs that cost almost nothing to generate. We call this pathway value collapse and formalize it through a costly-inspection framework. Cross-domain evidence from academic publishing, legal practice, content platforms, and software security maps onto four stages of verification erosion. Alignment success is orthogonal. Better-aligned models narrow observable gaps between human and AI outputs, making source verification harder and intensifying competitive pressure against HTL-intensive work even when individual AI outputs improve.

2606.06601 2026-06-08 cs.CV cs.AI cs.LG 交叉投稿

Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

通过分解视觉代理实现直接3D感知物体插入

Jingbo Gong, Yikai Wang, Yushi Lan, Yuhao Wan, Ziheng Ouyang, Rui Zhao, Ming-Ming Cheng, Qibin Hou, Chen Change Loy

发表机构 * Google(谷歌) Black Forest Labs(黑森林实验室)

AI总结 提出DIRECT框架,通过分解外观、几何和上下文引导,实现可控制3D姿态的物体插入,在几何可控性和视觉质量上优于现有方法。

Comments ICML 2026; Project Page: https://gong1130.github.io/DIRECT/

详情
AI中文摘要

物体插入旨在将参考对象无缝合成到背景图像的指定区域。最近的基于扩散的方法实现了高视觉质量,但将插入视为简单的2D修复任务,无法显式控制对象的3D姿态,限制了其实用性。我们提出DIRECT(用于参考组合和目标集成的分解注入),一种新颖框架,将交互式姿态操作与高保真2D图像合成相结合,实现姿态可控的物体插入。我们的方法将插入条件分解为三个互补组件:从参考对象捕获视觉细节的外观引导、从用户调整的3D代理派生的几何引导以及来自目标背景的上下文引导。通过将它们注入到不同路径,DIRECT避免了特征纠缠,同时保留了参考外观、遵循用户指定的姿态并使对象适应目标场景。我们还引入了一个自动数据构建流程,以提高训练数据的多样性和质量。实验表明,DIRECT在几何可控性和视觉质量方面均优于先前方法。

英文摘要

Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object's 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable pose-controllable object insertion. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background. By injecting them through separate pathways, DIRECT avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene. We also introduce an automated data construction pipeline to improve the diversity and quality of training data. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.

2606.06614 2026-06-08 cs.CL cs.AI cs.HC 交叉投稿

Re-Centering Humans in LLM Personalization

重新将人类置于LLM个性化中心

Lechen Zhang, Jiarui Liu, Tal August

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究LLM个性化在合成数据与人类数据上的性能差距,通过收集人类对话和判断揭示系统在属性提取、相关属性配对和个性化响应生成阶段的局限性,并引入轻量级训练干预以缩小差距。

详情
AI中文摘要

尽管兴趣日益增长,但大多数对大型语言模型(LLM)个性化能力的评估都依赖于合成数据。目前尚不清楚当前的个性化系统对真实用户的效果如何。在本文中,我们研究了LLM个性化在使用合成数据与人类数据时的性能差距。我们收集了人类对话(550个对话)和个性化三个阶段的判断:从对话中提取用户属性(5,949个判断),将相关属性与新提示配对(11,919个),以及将相关属性融入个性化响应(1,101个)。纳入人类数据揭示了每个阶段的系统局限性。模型难以从人类对话中提取属性,与人类在相关属性上的判断不一致,并且生成的个性化响应被人类评价为并不优于通用响应(尽管LLM广泛评价为更好)。我们在前两个阶段引入了两种轻量级基于训练的干预措施,使自动化个性化评估更接近人类数据。然而,在第三阶段,我们发现学习到的奖励模型与人类评分的相关性仅达到中等水平,这表明与人类一致的个性化质量判断难以直接建模。我们收集的数据为研究模型如何以人类认为有用的方式提取、选择和整合用户信息提供了基础。

英文摘要

Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.

2606.06615 2026-06-08 cs.SD cs.AI cs.LG eess.AS 交叉投稿

FIGMA: Towards FIne-Grained Music retrievAl

FIGMA:迈向细粒度音乐检索

Nishit Anand, Ashish Seth, Sreyan Ghosh, Dinesh Manocha, Ramani Duraiswami

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 针对现有音乐检索模型无法处理细粒度属性查询的问题,提出多视角对比架构FIGMA,通过联合优化全局音频-文本对齐和帧级标记对齐,在统一表示空间中捕获高层语义和细粒度音乐属性,并在新构建的细粒度音乐描述数据集上取得显著提升。

Comments Accepted to ACL 2026. Project Website: https://nishitanand.github.io/figma-website/

详情
AI中文摘要

使用自然语言描述检索音乐已通过对比音频-文本模型(如CLAP)得到改进,但当前系统仍局限于粗粒度语义查询。当描述指定细粒度音乐属性(如速度、调性、和弦进行或节奏结构)时,现有模型通常无法检索到正确的音频。我们表明,这一限制源于对比学习目标本身:尽管在长描述上训练,基于CLAP的模型实际上仅利用前几个标记,丢弃了详细提示中编码的大量信息。然后,我们提出FIGMA(细粒度音乐检索),一种多视角对比架构,通过联合优化全局音频-文本对齐和帧级、标记级对齐来解决这一限制。该设计使FIGMA能够在统一表示空间中捕获高层语义上下文和细粒度音乐属性。此外,我们形式化了细粒度音乐检索任务,并构建了细粒度音乐描述数据集(FGMCaps),一个包含38万音乐-描述对的大规模训练数据集以及1万测试集,两者都标注了速度、调性、和弦进行、节拍数以及流派和情绪。大量实验表明,FIGMA在多个音乐检索基准(包括域外评估)上持续优于现有基于CLAP的音乐检索模型,相对改进高达73.3%。

英文摘要

Retrieving music using natural language descriptions has improved with contrastive audio-text models such as CLAP, but current systems remain limited to coarse semantic queries. When descriptions specify fine-grained musical attributes such as tempo, key, chord progression, or rhythmic structure, existing models often fail to retrieve the correct audio. We show that this limitation stems from the contrastive learning objective itself: despite being trained on long captions, CLAP-based models effectively utilize only the first few tokens, discarding much of the information encoded in detailed prompts. Then, we propose FIGMA (FIne-Grained Music RetrievAl), a multi-view contrastive architecture that addresses this limitation by jointly optimizing global audio-text alignment and frame-level, token-wise alignment. This design enables FIGMA to capture both high-level semantic context and fine-grained musical attributes within a unified representation space. Moreover, we formalize the task of Fine-Grained Music Retrieval and construct Fine-Grained Music Caption dataset (FGMCaps), a large-scale dataset of 380K music-caption pairs for training along with a 10K test set, both annotated with tempo, key, chord progression, beat count, as well as genre and mood. Extensive experiments demonstrate that FIGMA consistently outperforms existing CLAP-based music retrieval models across multiple music retrieval benchmarks, including out-of-domain evaluations, with relative improvements of up to 73.3%.

2606.06618 2026-06-08 cs.RO cs.AI cs.LG 交叉投稿

ChronoForest: Closed-Loop Multi-Tree Diffusion Planning for Efficient Bridge Search and Route Composition

ChronoForest: 用于高效桥接搜索和路线组合的闭环多树扩散规划

Jungmin Seo, Jaesik Park

发表机构 * Seoul National University(首尔国立大学)

AI总结 针对仅依赖短程离线轨迹进行长程路线规划的问题,提出ChronoForest系统,通过锚链树扩散规划器和在线多树协调器实现局部桥接搜索与全局路线重解,在OGBench和哈密顿路线组合基准上显著提升成功率和效率。

Comments 40 pages, 4 figures, 7 tables, 3 algorithms

详情
AI中文摘要

当仅有短程离线轨迹可用时,我们如何规划到达指定目标、访问必经航点且保持路径短的长程路线?这一问题在离线导航中至关重要,因为收集足够丰富的长程数据十分困难,但真实智能体仍需以路线级效率(而非仅仅可行性)解决长程任务。难点有两方面:在微观层面,组合多个短程片段会在搜索代价和路径质量之间产生权衡;在宏观层面,航点排序需要比较起点、目标和航点锚点之间的成对旅行代价,而这些锚点在规划前未知,且仅通过长程时间距离估计时可靠性下降。本文提出ChronoForest,一种闭环规划系统,通过锚链树扩散规划器和在线多树协调器,将局部桥接搜索与在线路线重解耦合。ChronoForest利用时间距离进行短程引导和节点评估,同时利用搜索时的桥接证据验证长程锚点连通性,并反复重解路线。在OGBench AntMaze-Stitch上,ChronoForest在中等、大型和巨型分片上分别达到99.8%、99.3%和99.5%的成功率,并在巨型拼接任务上相比先前报道的扩散方法提升高达34.5个百分点。在哈密顿路线组合基准上,在线重解纠正了较差的时间排序,提升了路线质量,同时代价远低于穷举规划。

英文摘要

How can we plan long-horizon routes that reach designated goals, visit required waypoints, and remain short when only short-horizon offline trajectories are available? This problem matters in offline navigation because collecting sufficiently rich long-horizon data is difficult, yet real agents must still solve long-range tasks with route-level efficiency rather than mere feasibility. The difficulty is twofold: at the microscopic level, composing many short-horizon segments creates a trade-off between search cost and path quality, while at the macroscopic level, waypoint ordering requires comparing pairwise travel costs among start, goal, and waypoint anchors that are unknown before planning and increasingly unreliable when estimated only from long-range temporal distance. In this paper, we propose ChronoForest, a closed-loop planning system that couples local bridge search and online route re-solving through an anchor-chaining tree diffusion planner and an online multi-tree orchestrator. ChronoForest uses temporal distance for short-range guidance and node evaluation, while using search-time bridge evidence to validate long-range anchor connectivity and repeatedly re-solve the route. On OGBench AntMaze-Stitch, ChronoForest achieves 99.8%, 99.3%, and 99.5% success on the medium, large, and giant splits and improves giant-stitch success by up to 34.5 points over prior reported diffusion-based results. On Hamiltonian route-composition benchmarks, online re-solving corrects poor temporal orderings and improves route quality while remaining substantially cheaper than exhaustive planning.

2606.06627 2026-06-08 cs.RO cs.AI cs.CV cs.LG 交叉投稿

What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

在日常生活人类视频上协同训练机器人操作策略时什么因素重要?

Richard Li, Aditya Prakash, Andrew Wen, Saurabh Gupta, Yilun Du, Pulkit Agrawal

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Harvard University(哈佛大学)

AI总结 研究利用日常互联网视频协同训练机器人操作策略时,手部姿态质量和运动差距对迁移的影响,提出一种协同训练方法,在低机器人数据场景下六个操作任务中绝对成功率提升29.7%。

Comments The project website is here: https://richardrl.github.io/what-matters-cotraining-human-videos/index.html

详情
AI中文摘要

用于协同训练机器人操作策略的人类视频数据集主要由精心策划的演示组成,其中动作被编排成类似机器人行为,并且使用专用硬件捕获3D手部姿态。更丰富的数据源是日常互联网视频,但哪些因素能够实现从这些视频到机器人的迁移仍是一个开放问题。我们使用一个新的数据集(包含532个人类视频,共28小时的高质量三角测量手部标签和自然动作)对此进行研究。我们发现手部姿态质量影响迁移,但即使手部姿态准确,固有的运动差距也会阻碍迁移,除非视觉和策略网络针对每种具身形态进行专门化。我们的协同训练方法在低机器人数据场景下,在六个操作任务中绝对成功率提升29.7%,并带来一致的改进。

英文摘要

Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrated to resemble robot behavior and 3D hand poses are captured with specialized hardware. A more plentiful source of data is everyday Internet video, but it is an open question what factors enable transfer from such videos to robots. We investigate this using a new dataset of 532 human videos with 28 hours of high-quality triangulated hand labels and natural motions. We find that hand pose quality affects transfer, but even with accurate hands, the inherent motion gap hinders transfer unless the vision and policy networks specialize to each embodiment. Our cotraining recipe yields consistent improvements, with an absolute success rate gain of $29.7\%$ in the low-robot-data regime across six manipulation tasks.

2606.06635 2026-06-08 cs.CL cs.AI 交叉投稿

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

语言模型如何失败:承诺性和持续性推理错误的令牌级特征

Tanvi Thoria, Kiana Jafari, Marc R. Schlichting, Mykel J. Kochenderfer

发表机构 * Department of Computer Science, Stanford University(计算机科学系,斯坦福大学) Department of Aeronautics and Astronautics, Stanford University(航空航天工程系,斯坦福大学)

AI总结 通过令牌级不确定性信号,将语言模型推理失败分为承诺性失败(早期锁定错误路径)和持续性不确定性(不确定性持续累积),并在23个模型-数据集配置中验证了可预测性,为自我一致性策略提供了指导。

详情
AI中文摘要

语言模型推理中的失败通过不同的过程产生,这些过程在推理轨迹中留下可识别的特征。我们使用令牌级不确定性信号来表征这些失败,发现它们通过两个经验上可区分的过程出现。第一个是承诺性失败,其中模型在其轨迹早期锁定到错误的推理路径。一个核心诊断特征是承诺点,超过该点考虑额外的令牌会损害而不是帮助失败检测。在第二个过程中,持续性不确定性,不确定性反而在整个过程中累积,并且需要完整的轨迹来最好地区分失败和成功的完成。这些特征在23个模型-数据集配置中重现,该框架的可证伪预测在23个案例中的20个中成立,远高于两种失败模式下的随机水平。最后,我们展示了我们的失败模式框架对自我一致性有直接影响,识别了不确定性信号何时补充它以及何时可以选择性地跳过它。这些结果为理解何时LLM推理失败变得可检测以及相应调整检测策略提供了基础。

英文摘要

Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes. The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection. In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions. These signatures reproduce across 23 model-dataset configurations, with the framework's falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes. Finally, we demonstrate our failure mode framework has direct implications for self-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly.

2606.06664 2026-06-08 cs.CV cs.AI cs.LG 交叉投稿

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

内在视觉:神经科学启发的概念电路用于解释和引导视觉变换器

Tang Li, Yanlin Chen, Mengmeng Ma, Xi Peng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ViSAE工具箱,通过神经科学启发的概念电路解释视觉变换器内部机制,包含高效概念集、自动电路追踪算法和概念编辑应用,在WaterBirds上最差组准确率提升48.2%。

Comments In Proceedings of the International Conference on Machine Learning, 2026. (acceptance rate 26.6%)

详情
AI中文摘要

尽管视觉变换器(ViT)具有高准确率,但其预测可能受到虚假线索的驱动,因此在安全部署前需要理解其内部工作机制。稀疏自编码器(SAE)为将模型表示分解为人类可解释的概念提供了有前景的视角,但由于对概念覆盖范围的控制有限以及特征解释的主观性和不可扩展性,将基于SAE的解释方法应用于ViT仍然具有挑战性。为填补这些空白,受神经科学启发原理的驱动,我们提出了ViSAE,一个通过概念电路理解ViT内部工作机制的机械可解释性工具箱。ViSAE包含三个组成部分:(1)一个包含64K图像和16K视觉基础概念词汇的探测套件,与ImageNet相比,概念覆盖效率提高了20倍,与现有概念集相比,解释准确率提高了28.7%。(2)自上而下的概念读取和自下而上的电路追踪算法,通过概念电路自动恢复ViT内部工作机制。(3)用于审计和引导ViT行为的应用。通过概念编辑,ViSAE在WaterBirds上将最差组准确率提高了48.2%,比现有方法高出23.8%。我们的数据和代码:此 https URL。

英文摘要

Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by neuroscience-inspired principles, we propose ViSAE, a mechanistic interpretability toolbox for understanding ViT inner workings through concept circuits. ViSAE consists of three components: (1) A probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over existing concept sets. (2) Top-down concept reading and Bottom-up circuit tracing algorithms that automatically recover ViT inner workings via concept circuits. (3) Applications for auditing and steering ViT behavior. Through concept editing, ViSAE improves the worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%. Our data and code: https://github.com/deep-real/ViSAE.

2606.06679 2026-06-08 cs.CL cs.AI cs.CY 交叉投稿

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

HKJudge:用于解释法院认定事实、推理过程和裁决结果的法律话语标注语料库

Xi Xuan, Wenxin Zhang, Yufei Zhou, King-kui Sin, Chunyu Kit

发表机构 * City University of Hong Kong(香港城市大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出首个句子级专家标注的法律话语数据集HKJudge,包含香港各级法院刑事判决,设计双层话语模式(26种修辞角色和3种判刑要素),并基于BERT和LLM进行基准评估。

详情
AI中文摘要

法院判决是法律实践和法理学的核心,然而香港判决的话语分析由于缺乏专家标注语料库而受到限制。我们引入了香港判决话语数据集(HKJudge),这是首个句子级专家标注的法律话语语料库。HKJudge包含香港法院层级所有五个级别的刑事判决,共计约29万句子和650万词元,由法律语言学专家完全标注。我们设计了一个双层话语模式,捕捉法院认定的事实、推理过程以及裁决结果。在句子层面,每个句子被分配26种修辞角色之一。在跨度层面,句子进一步标注了三个判刑要素(指控、监禁刑期、罚款)。十位法律语言学标注者进行了标注,标注者间一致性为κ=0.8。我们在HKJudge上定义了两个任务,称为修辞角色分类和法律要素提取,并提供了四种基于BERT的模型、两种开源LLM(在零样本和微调设置下)以及四种商业LLM在这两个任务上的首次基准评估。我们的工作展示了句子级话语标注对于建模香港判决结构的价值,并为未来法律判决预测研究提供了丰富的数据基础。HKJudge数据集和代码可在以下网址获取:https://this URL。

英文摘要

Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK's court hierarchy, comprising $\sim$290k sentences and $\sim$6.5 million tokens, fully annotated by legal linguistics experts. We design a two-tier discourse schema that captures what facts a court finds, how it reasons, and what it rules. At the sentence level, each sentence is assigned one of 26 rhetorical roles. At the span level, sentences are further annotated with three sentencing elements (charge, imprisonment term, fine). Ten legal linguistics annotators produced the annotations with an inter-annotator agreement of $κ= 0.8$. We formulate two tasks on HKJudge, termed rhetorical role classification and legal element extraction, and provide the first benchmark evaluation of four BERT-based models, two open-source LLMs under zero-shot and fine-tuning settings, and four commercial LLMs on both tasks. Our work demonstrates the value of sentence-level discourse annotation for modeling the structure of HK judgments and provides a rich data foundation for future work on legal judgment prediction. The HKJudge dataset and code are available at https://github.com/xuanxixi/HKJudge.

2606.06694 2026-06-08 cs.LG cs.AI cs.CY 交叉投稿

The Geography of Algorithmic Judgment: LLM Intermediaries, Place Identity, and Racial Steering in Housing Search

算法判断的地理:LLM中介、地方身份与住房搜索中的种族引导

Hana Samad, Trung Lam, Christoph Mügge-Durum, Michael Akinwumi

发表机构 * National Fair Housing Institute(国家公平住房研究所)

AI总结 通过行为审计七种LLM在四个美国城市的住房推荐,发现种族引导是模型解释性许可的涌现行为,而非静态属性,且城市并非中性测试单元。

Comments 13 pages with supplemental tables and figures, AIES '26 Submission

详情
AI中文摘要

大型语言模型(LLM)正迅速在住房搜索中扮演中介角色,通过将列表平台集成到对话界面中,调解城市环境中的信息访问、搜索和推荐。我们扩展了先前关于LLM中种族引导的研究,对七个开放权重和闭源LLM在四个美国城市进行了行为审计,测试了三种迭代提示条件下的位置推荐,这些条件逐步添加生活方式偏好背景并反映公平住房配对测试方法。我们发现,引导是模型解释性许可的涌现行为,而非主要静态属性。引导源于用户身份、偏好表达以及模型内化的关于特定城市中地点、偏好和机会的学习表征的空间逻辑之间的相互作用,以及不同类型用户如何与之关联。虽然存在引导,但在评估条件下其方向和幅度并不一致。偏好条件测试通常增加或重新配置了相对于基线条件表现出引导行为的模型数量,表明LLM可能根据用户的种族身份对相同的住房偏好做出不同解释。我们的发现还表明,在基于地点的领域进行LLM评估时,城市并非中性测试单元,一个本地市场的结果不能假定推广到另一个市场。在住房领域,需要本地和领域专业知识,以确保法律和制度对公平住房的承诺不会因采用调解空间访问的AI工具而受到损害。

英文摘要

Large language models (LLMs) are rapidly assuming an intermediary role in housing search through the integration of listing platforms within conversational interfaces, mediating access to information, search, and recommendations within urban settings. We expand on prior work on racial steering in LLMs by conducting a behavioral audit of seven open-weight and closed-source LLMs across four U.S. cities, testing location recommendations across three iterative prompting conditions that progressively add lifestyle preference context and reflect fair housing paired-testing methodologies. We find that steering is an emergent behavior of the model's interpretive license rather than primarily a static property. Steering results from the interaction of a user's identity, preference articulation, and the spatial logic that a model has internalized about learned representations of place, preference, and opportunity in a given city, and how different types of users relate to it. While steering was present, it was not uniform in direction or magnitude across evaluated conditions. Preference-conditioned testing often increased or reconfigured the number of models that exhibited steering behaviors relative to baseline conditions, suggesting that LLMs may interpret what the same housing preference means differently depending on the racial identity of the user. Our findings also demonstrate that the city is not a neutral testing unit for LLM evaluation in place-based sectors, and results from one local market cannot be assumed to generalize to another. Local and domain expertise will be required in the housing sector to ensure that legal and institutional commitments to fair housing are not undermined while adopting AI tools that mediate spatial access.

2606.06696 2026-06-08 cs.CV cs.AI 交叉投稿

MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

MMBU: 大规模多模态生物医学理解基准,用于探测视觉语言模型的感知能力

Ryan D'Cunha, Alejandro Lozano, Xiaoxiao Sun, Daniel Vela Jarquin, Min Woo Sun, Josiah Aklilu, James Burgess, Yuhui Zhang, Ryan Nayebi, Paola Avila, Robayo, Jin Ye, Ming Hu, Zhongying Deng, Junjun He, Xin Chen, Yue Yao, Robert Tibshirani, Jeffrey J. Nirschl, Serena Yeung-Levy

发表机构 * Stanford University(斯坦福大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Instituto Tecnológico de Monterrey(蒙特雷技术学院) Monash University(墨尔本大学) University of Cambridge(剑桥大学) Shanghai Jiao Tong University(上海交通大学) Shandong University(山东大学)

AI总结 提出MMBU基准,涵盖35个子模态,通过分类、定位和检测任务系统评估VLM在生物医学领域的视觉感知和泛化能力,发现高准确率可能掩盖感知缺陷。

详情
AI中文摘要

视觉和语言模型(VLM)在转变生物医学成像工作流程方面具有巨大潜力,从检测胸部X光片中的病变到显微镜下的细胞特征分析。然而,实现这一潜力需要稳健且细粒度的视觉感知。模型需要正确解释图像中的细微特征,并且必须在不同的生物医学模态、尺度和上下文中做到这一点。尽管如此,当前的基准仍然有限。为了解决这些差距,我们引入了大规模多模态生物医学理解(MMBU)基准。它是迄今为止最大的生物医学视觉和语言基准,涵盖35个子模态,具有丰富的结构化元数据。它包括开放和封闭版本的非接地分类、接地分类和物体检测,从而能够系统地评估模型在生物尺度、临床环境和成像模态上的性能。通过评估15个开源权重和2个前沿VLM,我们发现虽然医学适应为某些模型带来了可衡量的提升,但通常在高准确率报告中的表现可能掩盖了视觉感知和领域泛化方面的缺陷。

英文摘要

Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires robust and fine-grained visual perception. Models need to correctly interpret subtle features in images, and they must do so across diverse biomedical modalities, scales, and contexts. Nevertheless, current benchmarks remain limited. To address these gaps, we introduce the Massive Multimodal Biomedical Understanding (MMBU) benchmark. It is the largest biomedical vision and language benchmark to date, covering 35 submodalities with rich structured metadata. It includes both open and closed versions of ungrounded classification, grounded classification, and object detection, enabling systematic evaluation of model performance across biological scales, clinical settings, and imaging modalities. Evaluating 15 open-weight and 2 frontier VLMs, we find that while medical adaptation provides measurable gains for some models, the high accuracy often reported on established benchmarks can mask deficiencies in visual perception and domain generalization.

2606.06712 2026-06-08 cs.CL cs.AI 交叉投稿

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

数据高效的自回归到扩散语言模型通过策略内蒸馏

Xingyu Su, Jacob Helwig, Shubham Parashar, Atharv Chagi, Lakshmi Jotsna, Degui Zhi, James Caverlee, Dileep Kalathil, Shuiwang Ji

发表机构 * Department of Computer Science and Engineering, Texas A&M University(德克萨斯大学阿马尔科分校计算机科学与工程系) Department of Bioinformatics and Systems Medicine, University of Texas Health Science Center at Houston(德克萨斯大学健康科学中心休斯顿分校生物信息学与系统医学系) Department of Electrical and Computer Engineering, Texas A&M University(德克萨斯大学阿马尔科分校电气与计算机工程系)

AI总结 提出策略内扩散语言模型(OPDLM),通过策略内蒸馏将自回归模型转换为扩散语言模型,解决分布偏移和训练-推理不匹配问题,实现15倍至7000倍更少训练数据下的强性能。

详情
AI中文摘要

我们研究将自回归模型(ARLM)转换为扩散语言模型(DLM)。与从头预训练不同,先前的工作将ARLM中的因果注意力替换为双向注意力,然后使用DLM目标训练得到的模型。然而,这些方法会导致两种分布偏移。首先,从下一个词预测目标过渡到DLM目标可能会丢弃ARLM在训练期间获得的知识。其次,标准DLM存在训练-推理不匹配,因为训练损失定义在随机掩码序列上,而不是推理时基于置信度解码产生的轨迹。为了解决这两个挑战,我们引入了策略内扩散语言模型(OPDLM),其中采用策略内蒸馏(OPD)进行ARLM到DLM的转换。具体来说,OPDLM通过自OPD训练,其中学生(具有双向注意力的ARLM)生成自己的轨迹,而教师(原始冻结的ARLM)通过在这些轨迹上提供目标logits来蒸馏其知识。通过直接以策略内方式训练,OPDLM消除了DLM中的训练-推理不匹配,同时从原始模型蒸馏增强了ARLM的知识保留。实验结果表明,OPDLM需要15倍到7000倍更少的训练token,并在各种任务上表现出强大的性能。OPDLM避免了DLM预训练的高昂成本,并将DLM转换定位为ARLM后训练的一种形式。

英文摘要

We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.

2606.06715 2026-06-08 cs.CL cs.AI cs.LG 交叉投稿

Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles

主题情感是否导致感知意识形态?比较政治新闻文章中人类与LLM的标注

Upasana Chatterjee

发表机构 * Columbia University(哥伦比亚大学)

AI总结 研究主题情感对感知政治意识形态的因果效应,通过比较人类与LLM标注,发现微调GPT-4o-mini产生显著因果效应,归因于捷径学习。

Comments Accepted to ACL SRW 2026

详情
AI中文摘要

我们探究主题情感是否对感知政治意识形态具有因果效应,以及答案是否取决于意识形态标签的分配者。使用来自AllSides的文章,结合Llama-3.3-70b-versatile的共享情感标注,我们比较了来自专家人类标注者、GPT-4o-mini(基线和微调)以及Llama-3.3-70B的意识形态标签。我们应用双重机器学习(DML)和社区级中介分析于所有四种标注范式。人类标注在社区水平未产生显著因果效应。微调后的GPT-4o-mini达到了最高的分类准确率(F1=72.48),并且是唯一在社区水平产生显著处理效应和中介中显著自然直接效应(NDE)的标注范式。我们将此解释为捷径学习的证据:对意识形态标签数据进行微调导致模型内化了一种虚假的情感-意识形态耦合,而这种耦合在人类判断中对此任务并不起作用。这种耦合在基于F1的评估中结构上不可见,对LLM标注作为银标签以及在下游因果分析中作为人类判断的代理的使用具有影响。

英文摘要

We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideology label. Using articles from AllSides, paired with shared sentiment annotations from Llama-3.3-70b-versatile, we compare ideology labels from expert human annotators, GPT-4o-mini (baseline and finetuned), and Llama-3.3-70B. We apply Double Machine Learning (DML) and community-level mediation analysis across all four annotation paradigms. Human annotations yield no significant causal effects at the community level. Fine-tuned GPT-4o-mini achieves the highest classification accuracy (F1=72.48) and is the only annotator paradigm that produces significant community-level treatment effects and significant natural direct effects (NDEs) in mediation. We interpret this as evidence of shortcut learning: fine-tuning on ideology-labeled data causes the model to internalise a spurious sentiment--ideology coupling not operative in human judgment for this task. This coupling is structurally invisible to F1-based evaluation, with implications for the use of LLM annotations as silver labels and as proxies for human judgment in downstream causal analyses.

2606.06717 2026-06-08 cs.LG cs.AI q-bio.BM q-bio.QM 交叉投稿

ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets

ShallowBench: 浅口袋靶标上的生成式药物设计模型基准测试

Saket Reddy, Shiwei Liu

发表机构 * University of Illinois - Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ShallowBench基准,包含5780个浅口袋靶标,用于评估生成式药物设计模型在低凹度界面上的性能,揭示现有模型预测结合亲和力较弱的问题。

详情
AI中文摘要

虽然生成式AI模型在基于结构的药物设计中已展现出显著成功,但它们主要依赖深结合口袋,难以对具有挑战性的低口袋性靶标(如历史上“不可成药”的肿瘤靶标KRAS和MYC)采样有效配体。为弥补这一空白,我们引入了ShallowBench,这是一个从CrossDocked2020中提取的包含5780个浅口袋靶标的严格精选基准。通过计算Alpha Shape“盖子”体积与底层蛋白质原子体素体积之间的差异,我们成功分离出低凹度靶标,同时确保足够的结合表面积。评估多种最先进的生成模型显示,在这些低凹度界面上预测的结合亲和力较弱。因此,ShallowBench为生成生物学模型提供了一个严格的基准,并强调了需要能够应对这些具有挑战性靶标的新型架构创新或损失函数。

英文摘要

While generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low-pocketability targets, such as the historically "undruggable" oncology targets KRAS and MYC. To address this gap, we introduce ShallowBench, a strictly curated benchmark of 5,780 shallow-pocket targets extracted from CrossDocked2020. By computing the difference between an Alpha Shape "lid" volume and the underlying protein atom voxel volume, we successfully isolated targets with low concavity while ensuring sufficient surface area for binding. Evaluating various state-of-the-art generative models reveals weaker predicted binding affinity on these low-concavity interfaces. ShallowBench therefore provides a rigorous benchmark for generative biology models and highlights the necessity of new architectural innovations or loss functions capable of navigating these challenging targets.

2606.06718 2026-06-08 cs.LG cs.AI cs.SY eess.SY 交叉投稿

MSAIC-Net: A Multi-Scale Attention and Imbalance-Aware Contrastive Network for ECG-Based Myocardial Substrate Abnormality Detection

MSAIC-Net:用于基于心电图的心肌基质异常检测的多尺度注意力和不平衡感知对比网络

Canyu Lei, Fenglin Zhang, Derek Bivona, Cristiane Singulane, Jonathan Pan, Kenneth Bilchick, Amit R. Patel, Jianxin Xie

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 提出多尺度注意力增强卷积网络MSAIC-Net,通过并行空洞卷积提取多尺度特征、通道注意力重加权、不平衡感知对比学习及导联置换重要性分析,在低数据量UVA队列和大规模PTB-XL数据集上实现心肌瘢痕和心肌梗死检测的准确性和可解释性提升。

详情
AI中文摘要

心肌基质异常,如心肌瘢痕和心肌梗死(MI),与不良心血管结局相关。心电图(ECG)为检测这些异常提供了一种低成本且广泛可用的工具,但由于异质性导联依赖性表现、高维多导联信号、类别不平衡以及深度学习模型的可解释性有限,基于ECG的检测仍然具有挑战性。我们提出了一种多尺度注意力增强卷积网络(MSAIC-Net)用于基于ECG的心肌基质异常检测。MSAIC-Net采用并行空洞卷积分支,在多个时间感受野上提取ECG特征,使模型能够捕捉局部和更长时间范围的时间模式。然后使用通道注意力自适应地重新加权信息性导联和特征通道表示。为了解决类别不平衡并提高特征可分性,我们引入了一种新颖的不平衡感知监督对比学习策略,鼓励同一类别的样本形成紧凑表示,同时增加异常和正常样本之间的分离。进一步引入导联置换重要性来量化每个ECG导联的贡献并提高模型可解释性。该方法在两个互补数据集上进行了评估:来自弗吉尼亚大学(UVA)健康系统的低数据量机构队列用于心肌瘢痕分类,以及来自PhysioNet的大规模公共PTB-XL数据集用于MI识别。实验结果表明,MSAIC-Net优于基线模型,在低数据量的UVA队列中改进尤为显著。总体而言,所提出的框架为基于ECG的心肌基质异常检测提供了一种有效且可解释的方法。

英文摘要

Myocardial substrate abnormalities, such as myocardial scar and myocardial infarction (MI), are associated with adverse cardiovascular outcomes. Electrocardiography (ECG) provides a low-cost and widely available tool for detecting these abnormalities, but ECG-based detection remains challenging due to heterogeneous lead-dependent manifestations, high-dimensional multi-lead signals, class imbalance, and the limited interpretability of deep learning models. We propose a multi-scale attention-enhanced convolutional network (MSAIC-Net) for ECG-based myocardial substrate abnormality detection. MSAIC-Net employs parallel atrous convolutional branches to extract ECG features across multiple temporal receptive fields. %, enabling the model to capture both local and longer-range temporal patterns. Channel attention is then used to adaptively reweight informative lead-wise and feature-channel representations. To address class imbalance and improve feature separability, we introduce a novel imbalance-aware supervised contrastive learning strategy that encourages samples from the same class to form compact representations while increasing separation between abnormal and normal samples. Lead-wise permutation importance is further incorporated to quantify the contribution of each ECG lead and improve model interpretability. The proposed method was evaluated on two complementary datasets: a low-data institutional cohort from the University of Virginia (UVA) Health System for myocardial scar classification and the large-scale public PTB-XL dataset from PhysioNet for MI identification. Experimental results show that MSAIC-Net outperforms baseline models, with particularly pronounced improvements in the low-data UVA cohort. Overall, the proposed framework provides an effective and interpretable approach for ECG-based detection of myocardial substrate abnormalities.

2606.06721 2026-06-08 cs.RO cs.AI 交叉投稿

SCOUT: Semantic scene COverage via Uncertainty-guided Traversal

SCOUT: 基于不确定性引导遍历的语义场景覆盖

Junyu Mao, Sara Ayoubi, Vishnu D. Sharma, Ilija Hadžić, Matthew Andrews

发表机构 * Nokia Bell Labs, France(诺基亚贝尔实验室,法国) Nokia Bell Labs, Murray Hill, NJ, USA(诺基亚贝尔实验室,美国,新泽西州 Murray Hill) Imperial College London(帝国理工学院伦敦分校) Locus Robotics(Locus机器人技术公司)

AI总结 提出SCOUT框架,通过不确定性引导的遍历规划与概率场景图构建的闭环,使机器人主动探索并逐步理解环境,实现语义场景完整性作为操作目标。

Comments 2026 ICRA Workshop on Uncertainty in Open World Robotics

详情
AI中文摘要

长时间运行的机器人不应仅仅访问空间,而应逐步理解空间。然而,大多数3D场景图管线将感知视为固定数据集上的后处理阶段,将场景表示与决定首先观察什么的决策解耦。我们提出SCOUT,一种在线语义探索框架,通过将主动遍历与概率场景图构建耦合来闭合这一循环。给定先验2D占用地图和带姿态的RGB-D观测,SCOUT增量构建一个不确定性感知的3D场景图,其节点维护融合的几何和开放词汇对象标签的后验信念,而边编码结构关系,如在上、内部、属于和旁边。这些信念被反馈给不确定性引导的遍历规划器,该规划器通过平衡期望语义确定性增益、几何覆盖增益和旅行成本来选择视点。这样,当额外证据重要时,机器人重新访问模糊对象,当场景不完整时,扩展到未见的自由空间。由此产生的系统将语义场景完整性视为操作目标,而非语义映射的被动副产品,朝着能够在最少人工干预下巡逻、更新和推理不断变化的室内环境的自主智能体迈进。

英文摘要

Robots that operate over extended periods should not merely visit space; they should progressively understand it. Yet most 3D scene graph pipelines treat perception as a post-processing stage over a fixed dataset, decoupling scene representation from the decisions that determine what is observed in the first place. We present SCOUT, an online semantic exploration framework that closes this loop by coupling active traversal with probabilistic scene graph construction. Given a prior 2D occupancy map and posed RGB-D observations, SCOUT incrementally builds an uncertainty-aware 3D scene graph whose nodes maintain fused geometry and posterior beliefs over open-vocabulary object labels, while edges encode structural relations such as on, inside, belong, and next to. These beliefs are fed back to an uncertainty-guided traversal planner, which selects viewpoints by balancing expected semantic certainty gain, geometric coverage gain, and travel cost. In this way, the robot revisits ambiguous objects when additional evidence matters and expands into unseen free space when the scene remains incomplete. The resulting system treats semantic scene completeness as an operational objective rather than a passive by-product of semantic mapping, moving toward autonomous agents that can patrol, update, and reason about evolving indoor environments with minimal human intervention.

2606.06740 2026-06-08 cs.SD cs.AI cs.CL 交叉投稿

Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

多语言多说话人单元声码器:离散语音表示的系统分析

Naman Kothari, Arjun Gangwar, Adarsh Arigala, S Umesh

发表机构 * National Institute of Technology, Trichy(印度Trichy国家理工学院) Indian Institute of Technology, Madras(印度Madras理工学院)

AI总结 分析基于BigVGAN的单元声码器在多语言多说话人语音生成中的表现,发现聚类大小控制可懂度,显式说话人条件防止身份崩溃,语言监督在低聚类大小时有益。

Comments 5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026

详情
AI中文摘要

通过k-means聚类自监督嵌入获得的离散语音单元纠缠了音素、说话人和语言信息,导致多语言多说话人语音生成中的说话人混合和跨语言干扰。尽管在音频大语言模型和语音到语音系统中使用日益增多,单元声码器仍然研究不足。我们分析了基于BigVGAN的单元声码器,涵盖四种印度语言。我们使用WER、说话人相似度和单元级指标研究了聚类大小与条件策略之间的相互作用。结果表明,聚类大小通过提高音素区分性来控制可懂度,而显式说话人条件对于防止身份崩溃不可或缺。语言监督主要在单元仍模糊的较小聚类大小时带来进一步收益。我们的分析显示,在较小库存时,不同语言中相似音素会坍缩到相同的聚类ID,而较大的聚类会逐渐将它们分离。

英文摘要

Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech generation. Despite growing use in Audio LLMs and speech to speech systems, unit vocoders remain underexplored. We analyze a BigVGAN based unit vocoder, across four Indian languages. We study the interaction between cluster size and conditioning strategies using WER, speaker similarity, and unit level metrics. Results show that cluster size governs intelligibility by improving phonetic discriminability, while explicit speaker conditioning is indispensable for preventing identity collapse. Language supervision yields further gains mainly at lower cluster sizes where units remain ambiguous. Our analysis shows similar phonemes across languages collapse to the same cluster IDs at smaller inventories, with larger clusters progressively separating them.

2606.06743 2026-06-08 cs.SD cs.AI cs.CL 交叉投稿

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

HybridCodec: 快速双流、语义增强的神经音频编解码器

Arjun Gangwar, S Umesh

发表机构 * Indian Institute of Technology, Madras(印度理工学院马德拉斯分校)

AI总结 提出HybridCodec,一种结合语义蒸馏与双流架构的统一神经音频编解码器,实现强解耦、跨语言鲁棒性及3倍速度提升。

Comments 5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026

详情
AI中文摘要

随着多模态大语言模型的出现,神经音频编解码器作为语音分词器的流行度激增。具有语义和声学解耦的新编解码器架构已经出现。将语义信息引入编解码器模型有两种主要方法:一种是从SSL表示中将语义信息蒸馏到第一个RVQ层,另一种是维护语义和声学特征的独立流。我们提出HybridCodec,一种结合了两种范式的统一架构。它采用独立的语义和声学分枝,同时将SSL表示蒸馏到语义流中。这种设计确保了强解耦,而无需在推理期间使用SSL模型。HybridCodec在域内测试集上展示了优越的语义特化(RVQ-1)和有竞争力的重建(RVQ-all)。我们展示了其在域外和零样本跨语言设置中的鲁棒性,相比现有双流模型实现了3倍加速。

英文摘要

The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main approaches to introduce semantic information into codec models: one distills semantic information from SSL representations into the first RVQ layer, while the other maintains separate streams for semantic and acoustic features. We propose HybridCodec, a unified architecture that combines both paradigms. It employs separate semantic and acoustic branches while distilling SSL representations into the semantic stream. This design ensures strong disentanglement without requiring an SSL model during inference. HybridCodec shows superior semantic specialization (RVQ-1) on in-domain test set and competitive reconstruction (RVQ-all). We demonstrate its robustness in out-of-domain and zero-shot cross-lingual settings, achieving a 3x speedup over existing dual-stream models.

2606.06748 2026-06-08 cs.CL cs.AI cs.LG 交叉投稿

Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection

检索增强生成中的证据图一致性:基于模型的幻觉检测分析

Jianru Shen

AI总结 提出证据图一致性(EGC)框架,通过构建局部证据图并计算五种结构一致性指标检测幻觉,发现不同模型族间一致性特征方向相反,表明嵌入图一致性不能作为模型无关的检测信号。

Comments Accepted at the International Conference on Advanced Machine Learning and Data Science; to appear in the IEEE Xplore proceedings

详情
AI中文摘要

检索增强生成(RAG)减少了但并未消除大型语言模型中的幻觉。现有检测方法依赖于生成答案与检索段落之间的平面相似性,忽略了证据片段与答案声明之间的结构关系。我们提出了证据图一致性(EGC)框架,该框架为每个响应构建一个局部证据图,并计算五种结构一致性度量作为幻觉指标。在RAGTruth的完整问答拆分上,跨六个LLM(5,767个响应)进行评估,EGC揭示了一个一致的模型族分裂:图一致性特征在Llama-2模型中显示出预期的诊断方向,但在GPT-4、GPT-3.5和Mistral-7B中表现出系统性逆转。这种逆转表明不同模型族之间存在定性的不同幻觉模式,并表明基于嵌入的图一致性不能作为模型无关的幻觉检测信号。

英文摘要

Retrieval-Augmented Generation (RAG) reduces but does not eliminate hallucination in large language models. Existing detection methods rely on flat similarity between generated answers and retrieved passages, ignoring structural relationships among evidence pieces and answer claims. We propose Evidence Graph Consistency (EGC), a framework that constructs a local evidence graph per response and computes five structural consistency measures as hallucination indicators. Evaluated on the full question answering split of RAGTruth across six LLMs (5,767 responses), EGC reveals a consistent model-family split: graph consistency features show the expected diagnostic direction for hallucinations in Llama-2 models but exhibit systematic reversal in GPT-4, GPT-3.5, and Mistral-7B. This reversal suggests qualitatively different hallucination patterns across model families and indicates that embedding-based graph consistency cannot serve as a model-independent hallucination detection signal.

2606.06761 2026-06-08 cs.RO cs.AI 交叉投稿

AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation

AxisGuide: 在RGB观测中接地机器人动作坐标系以实现鲁棒的视觉运动操控

Jiyun Jang, Yujin Sung, Woosung Joung, Daewon Chae, Sangwon Lee, Sohwi Kim, Jinkyu Kim, Jungbeom Lee

发表机构 * Korea University(韩国大学) University of Michigan(密歇根大学) KT R&D Center(KT研发中心) Kakao Mobility(Kakao移动)

AI总结 针对视觉运动策略在分布偏移下动作执行失败的问题,提出AxisGuide方法,通过渲染机器人基座坐标系轴并叠加提示通道,增强动作坐标理解,显著提升泛化性能。

Comments Accepted to Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

通过大规模行为克隆训练的视觉运动操控策略已实现强大的语义场景理解,但在分布偏移下往往无法可靠地执行正确的低级动作。例如,即使在具有相同场景布局、相机视角和光照的简单拾取任务中,当物体被放置在未见过的位置时,性能也会大幅下降。我们认为这一差距源于动作理解不足,即无法在图像空间中解释机器人基座坐标系。为解决此问题,我们引入AxisGuide,一种轻量级引导方法,桥接语义场景理解和动作坐标解释。利用相机参数和末端执行器位姿,AxisGuide在每个相机视图中渲染机器人基座轴,并通过少量提示通道增强RGB观测,明确可视化图像空间中+x、+y和+z运动的含义。在LIBERO仿真和真实环境中的广泛评估表明,AxisGuide带来了显著的性能提升和更好的泛化能力,凸显了显式动作坐标提示对于学习可靠且可迁移的通用视觉运动策略的有效性。

英文摘要

Visuomotor manipulation policies trained via large-scale behavior cloning have achieved strong semantic scene understanding, yet often fail to reliably execute correct low-level actions under distribution shifts. For example, even in a simple pickup task with identical scene layouts, camera viewpoints, and illumination, performance can degrade substantially when the object is placed at unseen locations. We argue that this gap arises from insufficient action understanding, namely the inability to interpret the robot's base-frame action coordinate system in image space. To address this issue, we introduce AxisGuide, a lightweight guidance method that bridges semantic scene understanding and action-coordinate interpretation. Using camera parameters and end-effector poses, AxisGuide renders the robot base-frame axes in each camera view and augments RGB observations with a small set of cue channels that explicitly visualize the meaning of the +x, +y, and +z motions in image space. Extensive evaluations in both the LIBERO simulation and real-world environments demonstrate that AxisGuide yields substantial performance gains and improved generalization, highlighting the effectiveness of explicit action-coordinate cues for learning reliable and transferable generalist visuomotor policies.

2606.06764 2026-06-08 stat.ML cs.AI cs.LG 交叉投稿

Optimal Rates for Generalization of Gradient Descent Methods with Deep Neural Networks

深度神经网络梯度下降方法的泛化最优速率

Junyu Zhou, Puyu Wang, Yunwen Lei, Yiming Ying, Ding-Xuan Zhou

发表机构 * Mathematical Institute for Machine Learning and Data Science, KU Eichstätt-Ingolstadt(机器学习与数据科学数学研究所,埃施特哈特-英戈尔施塔特大学) Department of Computer Science, RPTU Kaiserslautern-Landau(计算机科学系,凯撒斯劳滕-兰道大学) Department of Mathematics, University of Hong Kong(数学系,香港大学) School of Mathematics and Statistics, University of Sydney(数学与统计学学院,悉尼大学)

AI总结 本文针对深度ReLU网络,在神经正切核(NTK)机制下,首次建立了梯度下降(GD)和随机梯度下降(SGD)的极小化最优泛化误差速率,证明宽度足够时可达核方法的最优速率。

Comments 39 pages, 1 table

详情
AI中文摘要

近年来,在神经正切核(NTK)机制下,对于过参数化神经网络的梯度下降方法的统计泛化性能的理解取得了进展。然而,现有关于回归问题的工作大多局限于浅层网络架构,在深度神经网络理论中留下了显著的空白。本文通过为使用梯度下降(GD)和随机梯度下降(SGD)训练的深度ReLU网络提供全面的泛化分析来填补这一空白。具体来说,我们首次建立了深度ReLU网络的GD和SGD在总体风险超额上的极小化最优速率,假设网络宽度与网络深度和训练样本规模呈多项式关系。我们的结果表明,在足够宽度下,深度ReLU网络的梯度下降方法能够达到与核方法相当的泛化最优速率。

英文摘要

Recent progress has been made in understanding the statistical generalization performance of gradient descent methods for overparameterized neural networks within the neural tangent kernel (NTK) regime. However, most of the existing work on regression problems is limited to shallow network architectures, leaving a notable gap in the theory of deep neural networks. This paper addresses this gap by presenting a comprehensive generalization analysis for deep ReLU networks trained using gradient descent (GD) and stochastic gradient descent (SGD). Specifically, we establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks, under the assumption that the network width scales polynomially with respect to the network depth and training sample size. Our results demonstrate that with sufficient width, gradient descent methods for deep ReLU networks can achieve optimal generalization rates on par with kernel methods.

2606.06772 2026-06-08 stat.ML cs.AI cs.LG 交叉投稿

Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

深度神经网络的泛化:梯度方法的极小化最优速率

Junyu Zhou, Puyu Wang, Yunwen Lei, Marius Kloft, Yiming Ying

发表机构 * Mathematical Institute for Machine Learning and Data Science, Catholic University of Eichstätt-Ingolstadt(机器学习与数据科学数学研究所,埃施特哈特-因戈尔施塔特天主教大学) Department of Computer Science, RPTU Kaiserslautern-Landau(计算机科学系,凯斯莱特恩-兰道大学) Department of Mathematics, The University of Hong Kong(数学系,香港大学) School of Mathematics and Statistics, The University of Sydney(数学与统计学学院,悉尼大学)

AI总结 本文建立了过参数化深度神经网络与核方法学习动力学的联系,证明了梯度下降和随机梯度下降在足够宽度下能达到极小化最优泛化误差。

Comments 37 pages

详情
AI中文摘要

理解过参数化神经网络的泛化性能已成为深度学习理论的核心课题。尽管近期进展,特别是神经正切核(NTK)机制下的工作,揭示了浅层架构的行为,但深度神经网络(DNN)的统计泛化性质,尤其是在回归任务中,仍远未得到充分理解。本文通过提供使用梯度方法训练的DNN的全面泛化分析,在弥合这一差距方面取得了重大进展。首先,我们首次建立了使用梯度方法训练的、具有光滑激活函数的DNN的学习动态与核方法的学习动态之间的关键联系,表明过参数化DNN上的梯度方法可以完全继承其核对应物的有利学习动态。基于这一联系以及核方法已确立的最优性,我们推导出了梯度下降(GD)和随机梯度下降(SGD)的过量总体风险的第一个已知极小化最优速率,假设网络宽度与样本大小成多项式比例。我们的结果表明,在足够宽度下,由GD或SGD训练的DNN可以实现与基于核的方法相当的泛化性能。

英文摘要

Understanding the generalization performance of over-parameterized neural networks has become a central topic in deep learning theory. While recent advances, particularly works under the Neural Tangent Kernel (NTK) regime, have shed light on the behavior of shallow architectures, the statistical generalization properties of deep neural networks (DNNs), especially in regression tasks, remain far less understood. In this paper, we make significant progress toward closing this gap by providing a comprehensive generalization analysis of DNNs trained using gradient-based methods. First, we establish, for the first time, a crucial connection between the learning dynamics of a DNN with smooth activation functions trained via gradient-based methods and those of kernel methods, showing that gradient-based methods on over-parameterized DNNs can fully inherit the favorable learning dynamics of their kernel counterparts. Building on this connection and the well-established optimality of kernel methods, we derive the first known minimax-optimal rates for the excess population risk of both gradient descent (GD) and stochastic gradient descent (SGD), under the assumption that network width scales polynomially with the sample size. Our results demonstrate that, with sufficient width, DNNs trained by GD or SGD can achieve generalization performance comparable to kernel-based methods.

2606.06779 2026-06-08 cs.IR cs.AI 交叉投稿

Mind the Gap: Bridging Behavioral Silos with LLMs in Multi-Vertical Recommendations

注意差距:用LLM弥合多垂直领域推荐中的行为孤岛

Nimesh Sinha, Raghav Saboo, Martin Wang, Sudeep Das

发表机构 * DoorDash Inc.(DoorDash公司)

AI总结 提出利用LLM从数据丰富垂直领域(如餐厅)向稀疏领域(如杂货)迁移知识的框架,通过分层RAG生成多级特征,集成到MTL排序模型,显著提升新兴业务个性化与参与度。

详情
AI中文摘要

在多垂直领域电商平台(如DoorDash)中,较新的产品垂直领域(如杂货和零售)为个性化创新提供了重要机遇。一个关键挑战在于解决用户的“冷启动”问题。本文介绍了一种新颖框架,通过将知识从数据丰富的垂直领域(例如DoorDash的餐厅)迁移到数据稀疏的垂直领域来提升推荐质量。我们利用大型语言模型(LLMs)进行生成式推理,合成封装潜在用户偏好的稀疏高维特征。具体而言,我们采用分层检索增强生成(RAG)流水线,从用户餐厅订单历史和搜索查询中推导出多级分类特征。这些生成的特征编码了长期跨垂直偏好和短期意图,并集成到生产环境中的多任务学习(MTL)排序模型中。通过广泛的离线和在线评估,我们证明该方法显著改善了新兴业务垂直领域的个性化和参与度,有效弥合了行为数据差距。

英文摘要

In multi-vertical e-commerce platforms like DoorDash, relatively newer product verticals such as grocery and retail present a significant opportunity for personalization innovation. A key challenge lies in solving the "cold start" problem for users. This paper introduces a novel framework for enhancing recommendation quality by transferring knowledge from data-rich verticals (e.g., restaurants at DoorDash) to data-sparse ones. We leverage Large Language Models (LLMs) to perform generative inference, synthesizing sparse, high-dimensional features that encapsulate latent user affinities. Specifically, we employ a hierarchical Retrieval-Augmented Generation (RAG) pipeline to derive multi-level taxonomic features from user restaurant order histories and search queries. These generated features, encoding both long-term cross-vertical preferences and short-term intent, are integrated into a production Multi-Task Learning (MTL) ranking model. We demonstrate through extensive offline and online evaluation that this approach significantly improves personalization and engagement in emerging business verticals, effectively bridging the behavioral data gap.

2606.06784 2026-06-08 cs.CR cs.AI cs.CY 交叉投稿

What Your Posts Reveal: A Benchmark and Agentic Framework for User-Level Privacy Leakage on Social Media

你的帖子揭示了什么:社交媒体用户级隐私泄露的基准与智能体框架

Zifan Peng, Yini Huang, Aiwen Lu, Qiming Ye, Peixian Zhang, Jingyi Zheng, Yule Liu, Xuechao Wang, Xinlei He, Jiaheng Wei

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州)) Wuhan University(武汉大学)

AI总结 针对社交媒体用户级多模态隐私泄露缺乏统一基准和评估指标的问题,提出SopriBench基准和隐私暴露分数(PES),并开发了无需训练的智能体框架Argus,通过跨帖子线索累积推理实现隐私推断,PES达0.55,较最强基线提升25%。

详情
AI中文摘要

公开的社交媒体帖子可以通过散布在文本、图像或元数据中的微弱线索泄露私人信息。这种泄露通常是累积性和跨帖子的:单独看似无害的线索可能共同暴露用户的家庭、工作场所或日常行程。然而,当前研究缺乏用户级多模态隐私泄露的统一基准,以及能够捕捉暴露严重程度(超越二元准确性)的评估指标。为解决这些不足,我们提出了SopriBench,这是一个由从Rednote和Instagram账户的私有参考语料库中抽象出的泄露模式引导的合成基准,涵盖50个用户档案和1569张图像,包含属性、上下文敏感性、粒度、泄露类型、推理难度和支持证据。我们进一步引入了隐私暴露分数(PES),该分数通过上下文敏感性对值粒度进行加权。受溯因推理启发,我们提出了Argus,一个无需训练的智能体框架,用于累积泄露推理。Argus从累积证据中形成假设,验证支持证据,并将跨帖子线索聚合为隐私档案,实现了0.55的PES,比最强基线提高了25%,在跨帖子泄露上增益最大。

英文摘要

Public social media posts can reveal private information through weak cues scattered across text, images, or metadata. Such leakage is often cumulative and cross-post: cues that appear harmless in isolation may jointly expose a user's home, workplace, or routine. However, current research lacks a unified benchmark for user-level multimodal privacy leakage and an evaluation metric that captures exposure severity beyond binary accuracy. To address these gaps, we propose SopriBench, a synthetic benchmark guided by leakage patterns abstracted from a private reference corpus of Rednote and Instagram accounts, covering 50 user profiles and 1,569 images with attributes, contextual sensitivity, granularity, leakage type, inference difficulty, and supporting evidence. We further introduce the Privacy Exposure Score (PES), which weights value granularity by contextual sensitivity. Inspired by abductive reasoning, we introduce Argus, a training-free agentic framework for cumulative leakage inference. Argus forms hypotheses from accumulated evidence, verifies supporting evidence, and aggregates cross-post cues into privacy profiles, achieving 0.55 PES, a 25% improvement over the strongest baseline, with the largest gain on cross-post leakage.

2606.06800 2026-06-08 cs.HC cs.AI 交叉投稿

Exploring Reinforcement Learning for Fluid Transitions Between Clinical Mental Healthcare and Everyday Wellness Support

探索强化学习在临床心理健康与日常健康支持之间的流畅过渡

Tony Wang, Qian Yang

发表机构 * Cornell University(康奈尔大学)

AI总结 本研究探索强化学习(RL)构建数字健康系统,动态选择临床与健康干预措施,以优化整体健康目标(持续日记),发现RL优化序列的益处常在干预结束后显现,且高参与度用户随时间深化参与,而恒定干预组易倦怠退出。

详情
Journal ref
Healthcare Beyond Reaction: Harnessing AI and Sensing for Proactive Care, Workshop at ACM Interactive Health 2026 (IH '26), July 05--08, 2026, Porto, Portugal
AI中文摘要

心理健康问题时好时坏,但临床和健康干预通常分开运作,导致护理过渡频繁中断。我们探索强化学习(RL)作为构建数字健康系统的手段,该系统主动提供临床和健康干预,作为连贯护理旅程的一部分。我们问:设计这样一个系统涉及哪些复杂性?我们构建了一个上下文赌博机,从临床和健康库中动态选择日记提示,以优化总体健康目标(持续日记),并在为期四周的探索性研究(N=38)中部署。我们发现,首先,RL优化的干预序列的许多益处仅在干预结束后才显现,这引发了一个问题:提供连贯临床-健康护理旅程的系统是否应包括退步期?如果是,何时以及如何?其次,与RL生成干预互动最多的参与者随着时间的推移加深了他们的参与度,而与恒定干预互动最多的参与者往往后来倦怠并退出。这引发了一个问题:当系统混合临床和健康干预时,何时应降低强度以防止倦怠,何时应维持强度以最大化治疗效果?

英文摘要

Mental health struggles wax and wane, yet clinical and wellness interventions typically operate separately, causing frequent breakdowns at care transitions. We explore reinforcement learning (RL) as a means to build digital health systems that deliver clinical and wellness interventions proactively, as part of a coherent care journey. We ask: what complexities does designing such a system involve? We built a contextual bandit that dynamically selects journaling prompts from clinical and wellness repertoires to optimize for an overarching health goal (sustained journaling) and deployed it in a four-week exploratory study (N=38). We found that, first, many benefits of RL-optimized intervention sequences appeared only after interventions ended, raising the question: Should systems that offer coherent clinical-wellness care journeys include stepping-back periods? If so, when and how? Second, participants most engaged with RL-generated interventions deepened their engagement over time, while those most engaged with a constant intervention tended to burn out and drop out later. It raises the question: When should a system blending clinical and wellness interventions reduce intensity to prevent burnout in versus sustain it to maximize treatment gains?

2606.06805 2026-06-08 cs.RO cs.AI cs.SY eess.SY 交叉投稿

Lane Change Trajectory Planning for Personalized Driving Comfort and Mobility Efficiency

车道变更轨迹规划:个性化驾驶舒适性与移动效率

Haoxuan Dong, Dongjun Li, Ziyou Song

发表机构 * Department of Mechanical Engineering(机械工程系) Department of Electrical Engineering(电气工程系) National University of Singapore(新加坡国立大学) Computer Science(计算机科学) University of Michigan(密歇根大学)

AI总结 提出一种神经网络驱动的轨迹规划器,结合三阶多项式轨迹生成与学习模块,通过双头共享骨干和基于误差胜者逻辑回归的统计门控机制,实现个性化舒适性与移动效率的平衡。

Comments Accepted by the IEEE Intelligent Vehicles Symposium (IEEE IV 2026), Detroit, MI, United States, June 22_25, 2026

详情
AI中文摘要

车道变更涉及同时的纵向和横向运动,这些运动影响驾驶舒适性和移动效率。由于这些运动紧密耦合且存在显著的车辆间差异,车道变更操作的轨迹规划具有高度个性化的特点。本研究提出了一种神经网络驱动的规划器,该规划器将三阶多项式轨迹生成器与学习模块相结合,该学习模块在不同驾驶条件下推断最优轨迹参数。使用具有双头的共享骨干网络,一个头确保全工况操作保障,而另一个头捕捉驾驶员对舒适性或移动效率的特定偏好。通过基于误差胜者逻辑回归的统计门控实现头门控切换机制,该机制在不同驾驶条件下自适应地选择适当的头,从而实现上下文感知的车道变更轨迹规划。代表性案例和蒙特卡洛模拟表明,所提出的规划器在车道变更过程中实现了个性化的舒适性和移动性,而基线则在个性化数据不足或不可用的驾驶条件下确保可行的轨迹。

英文摘要

Lane changing entails simultaneous longitudinal and lateral motions that affect driving comfort and mobility efficiency. Because these motions are tightly coupled and subject to substantial inter-vehicle variability, trajectory planning for lane-change maneuvers is characterized by a highly personalized nature. This study proposes a neural network-driven planner that integrates a third-order polynomial trajectory generator with a learning module that infers optimal trajectory parameters across diverse driving conditions. Using a shared backbone with dual heads, one head ensures all-condition operational guarantees, while the other captures driver-specific preferences for comfort or mobility efficiency. A head-gated switching mechanism, realized through a statistical gate based on error-winner logistic regression, adaptively selects the appropriate head under varying driving conditions, which enables context-aware lane-change trajectory planning. Representative cases and Monte Carlo simulations show that the proposed planner achieves personalized comfort and mobility during lane changes, while the baseline ensures feasible trajectories under driving conditions where personalized data are insufficient or inaccessible.

2606.06813 2026-06-08 cs.CV cs.AI 交叉投稿

Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation

打破锁定:通过表示调制实现文本到图像生成的多样化

Dahee Kwon, Haeun Lee, Jaesik Choi

发表机构 * KAIST(韩国科学技术院)

AI总结 针对文本到图像模型在固定提示下生成样本过于相似的问题,提出无训练表示级干预方法DAVE,通过选择性衰减早期生成中的零频空间平均分量来增强多样性,保持图像质量且计算开销极小。

Comments Accepted to ICML 2026. Code is available at: https://github.com/daheekwon/DAVE

详情
AI中文摘要

近期基于大规模Transformer骨干和流目标的文本到图像模型在文本-图像对齐和视觉质量方面表现出色,但在固定提示下常生成过于相似的样本。现有的多样性增强方法缓解了这一问题,但通常需要昂贵的采样或辅助优化,带来显著开销。为探究这种同质性的根本原因,我们检查了中间Transformer特征,观察到零频空间平均(DC)分量在生成早期快速收敛,导致早期轨迹锁定,限制了后续变化。基于此观察,我们提出DC衰减多样性增强(DAVE),一种无训练的表示级干预,选择性地在早期阶段衰减该分量。DAVE以可忽略的开销保留采样流程,在保持竞争性图像质量的同时,提高了提示一致性的多样性。

英文摘要

Recent text-to-image models built on large-scale Transformer backbones and flow-based objectives deliver strong text-image alignment and high visual quality, yet often produce overly similar samples under a fixed prompt. Existing diversity-enhancement methods alleviate this issue, but typically require expensive sampling or auxiliary optimization, incurring non-trivial overhead. To investigate the root cause of this homogeneity, we examine intermediate Transformer features and observe that the zero-frequency spatial average (DC) component rapidly converges across seeds early in generation, causing early trajectory lock-in that limits downstream variation. Building on this observation, we propose DC Attenuation for diVersity Enhancement (DAVE), a training-free representation-level intervention that selectively attenuates this component in the early regime. DAVE preserves the sampling pipeline with negligible overhead, improving prompt-consistent diversity while maintaining competitive image quality.

2606.06820 2026-06-08 cs.LG cs.AI 交叉投稿

SCALE: Scalable Cross-Attention Learning with Extrapolation for Agentic Workflow Scheduling

SCALE: 可扩展的交叉注意力学习与外推方法用于智能体工作流调度

Zhifei Xu, Jierui Lan, Zixuan Liang, Aiji Liang, Jinxi He

发表机构 * Faculty of Arts and Sciences, Beijing Normal University(北京师范大学文理学院)

AI总结 提出SCALE调度器,通过交叉注意力指针网络和结构化表示正则化,实现无需微调即可泛化到不同规模集群的深度强化学习工作流调度。

Comments Submitted to Computer Networks

详情
AI中文摘要

智能体大型语言模型系统将复杂任务分解为工作流有向无环图,其原语必须在异构集群上调度。现有的深度强化学习调度器与固定集群大小绑定,当服务器数量变化时需要重新训练。我们提出SCALE(可扩展的交叉注意力学习与外推),一种无需微调即可泛化到未见过的集群规模的深度强化学习调度器。SCALE采用交叉注意力指针网络,其中任务特征查询服务器特征,因此架构通过构造接受任意数量的服务器。然而,我们观察到仅排列不变架构并不能保证在新规模下的良好性能——随着服务器数量增长,注意力特征经历分布偏移。为了解决这个问题,我们引入结构化表示正则化:一种去相关损失结合朝向标准正态的KL惩罚,使特征统计量无论输入大小都保持稳定。在16个节点上训练并直接在32和48个节点上测试,SCALE在N=48时相对于没有SRR的相同架构将平均响应时间降低了8.9%,确认了显式正则化对于缩小规模泛化差距是必要的。

英文摘要

Agentic Large Language Model (LLM) systems decompose complex tasks into workflow Directed Acyclic Graphs (DAGs) whose primitives must be scheduled on heterogeneous clusters. Existing deep reinforcement learning (DRL) schedulers are tied to a fixed cluster size and require retraining whenever the number of servers changes. We propose SCALE (Scalable Cross-Attention Learning with Extrapolation), a DRL scheduler that generalizes to unseen cluster scales without fine-tuning. SCALE employs a cross-attention pointer network where task features query against server features, so the architecture accepts any number of servers by construction. We observe, however, that permutation-invariant architecture alone does not guarantee good performance at new scales - the attention feature undergoes distribution shift as the server count grows. To counter this, we introduce Structured Representation Regularization (SRR): a decorrelation loss combined with a KL penalty toward the standard normal, which keeps feature statistics stable regardless of input size. Trained on 16 nodes and tested directly on 32 and 48 nodes, SCALE reduces average response time by 8.9% at N=48 relative to the same architecture without SRR, confirming that explicit regularization is necessary to close the scale-generalization gap.

2606.06823 2026-06-08 cs.LG cs.AI q-fin.ST 交叉投稿

PandaAI: A Practical Agent CQ2 for Neuro-symbolic Data Analysis And Integrated Decision-Making in Quantitative Finance

PandaAI: 一种用于量化金融中神经符号数据分析与集成决策的实用智能体CQ2

Yuqi Li, Siyuan Liu, Bingjun Liu

发表机构 * Panda AI

AI总结 针对金融数据低信噪比和非平稳性,提出PandaAI,一种结合市场状态建模与约束alpha生成的闭环神经符号LLM智能体,通过领域微调和模块化架构实现风险感知决策,在沪深300数据上Rank IC提升18.2%,最大回撤降低25.7%。

详情
AI中文摘要

尽管深度学习在各个领域表现出色,但由于金融数据的低信噪比(SNR)和非平稳性,其在金融序列决策中的应用仍然具有挑战性。利用大型语言模型(LLM)的推理能力,我们提出了\textbf{PandaAI},一种具有市场状态建模和约束alpha生成的闭环神经符号LLM智能体,它桥接了通用LLM推理与金融严谨性,并抑制了LLM生成输出的金融毒性。为了弥合通用语言能力与金融严谨性之间的差距,我们微调了一个领域特定的LLM。此外,我们将此LLM集成到模块化架构中,形成一个闭环系统。与传统优化孤立预测指标的模型不同,\textbf{PandaAI}被设计为一种神经符号智能体,以明确的风险意识在复杂、真实的金融环境中导航。在沪深300股票数据上的大量实验表明,\textbf{PandaAI}比最先进的时间序列模型实现了$18.2\%$更高的Rank IC和$25.7\%$更低的最大回撤。我们的约束LLM生成和双通道适应方法为LLM在高风险序列决策场景中的部署提供了一种通用范式。

英文摘要

While deep learning has excelled in various domains, its application to sequential decision-making in finance remains challenging due to the low Signal-to-Noise Ratio (SNR) and non-stationarity of financial data. Leveraging the reasoning capabilities of Large Language Models (LLMs), we propose \textbf{PandaAI}, a closed-loop neuro-symbolic LLM agent with market regime modeling and constrained alpha generation, which bridges general LLM reasoning with financial rigor and suppresses the financial toxicity of LLM-generated outputs. To bridge the gap between general linguistic capability and financial rigor, we fine-tune a domain-specific LLM. Furthermore, we integrate this LLM into a modular architecture and form a closed-loop system. Unlike traditional models that optimize isolated prediction metrics, \textbf{PandaAI} is designed as a neuro-symbolic agent that navigates the complex, real-world financial environment with explicit risk awareness. Extensive experiments on CSI 300 stock data show that \textbf{PandaAI} achieves a $18.2\%$ higher Rank IC and $25.7\%$ lower maximum drawdown than state-of-the-art time-series models. Our constrained LLM generation and dual-channel adaptation method provide a general paradigm for LLM deployment in high-stakes sequential decision-making scenarios.

2606.06825 2026-06-08 cs.CL cs.AI 交叉投稿

Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards

Progress-SQL: 通过渐进式奖励改进文本到SQL的强化学习

Shihao Zhang, Xiaoman Wang, Yuan Liu, Yunshi Lan, Weining Qian

发表机构 * East China Normal University(华东师范大学)

AI总结 提出Progress-SQL,一种多轮强化学习框架,通过Oracle引导诊断树(ODT)生成子句级结构反馈,结合渐进式奖励(结构对齐、词汇对齐、延迟奖励和执行状态奖励),提升文本到SQL生成的准确性和鲁棒性。

详情
AI中文摘要

强化学习最近在改进大型语言模型进行文本到SQL生成方面显示出潜力,但现有方法通常优化基于单个SQL状态定义的一次性奖励。这种奖励为迭代SQL纠正提供的指导有限,不足以捕捉多轮SQL改进的提升。在本文中,我们提出Progress-SQL,一种具有渐进式奖励的多轮强化学习框架,用于文本到SQL。我们的方法引入Oracle引导诊断树(ODT),它将SQL查询抽象为子句级结构轮廓,并为下一轮改进生成诊断反馈。为了提供密集且稳健的奖励信号,我们将基于ODT的结构对齐与词汇对齐相结合,并定义一个渐进式奖励,衡量从初始SQL到最终SQL的改进。我们进一步加入一个偏好早期正确性的渐进延迟奖励和一个鼓励从无效SQL中恢复的执行状态奖励。在BIRD、Spider和Spider鲁棒性变体上的实验表明,我们的方法在主要评估和鲁棒性评估上均一致提升了文本到SQL的性能。

英文摘要

Reinforcement learning has recently shown promise in improving large language models for Text-to-SQL generation, yet existing methods typically optimize one-shot rewards defined over a single SQL state. Such rewards provide limited guidance for iterative SQL correction and are insufficient to capture the improvement of multi-turn SQL refinement. In this paper, we propose Progress-SQL, a multi-turn reinforcement learning framework with progressive rewards for Text-to-SQL. Our approach introduces an Oracle-guided Diagnostic Tree (ODT), which abstracts SQL queries into clause-level structural profiles and produces diagnostic feedback for next-turn refinement. To provide dense and robust reward signals, we combine ODT-based structural alignment with lexical alignment and define a progressive reward that measures the improvement from the initial SQL to the final SQL. We further incorporate a progression latency reward that favors earlier correctness and an execution status reward that encourages recovery from the invalid SQL. Experiments on BIRD, Spider, and Spider robustness variants demonstrate that our method consistently improves Text-to-SQL performance across both primary and robustness evaluations.

2606.06833 2026-06-08 cs.LG cs.AI cs.CR 交叉投稿

Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks

听弦外之音:面向声学对抗攻击的语言模型先验

Jiani Xie, Andrew C. Cullen, Paul Montague, Benjamin I. P. Rubinstein

发表机构 * University of Melbourne(墨尔本大学) DST Group(DST集团)

AI总结 提出Semantic Gambit攻击,利用大语言模型实时提供预测上下文,突破因果限制,使实时ASR系统词错误率提升至35.6%,较当前最优方法提高三倍。

详情
AI中文摘要

在实时环境中运行的自动语音识别(ASR)系统必须在严格的时间约束下处理声学输入,其转录决策本质上基于不完整信息。这种因果约束成为攻击者的信息瓶颈,显著限制了攻击性能。我们的新攻击方法Semantic Gambit通过实时利用大语言模型提供的预测上下文,突破了这一因果限制。实验表明,这种增强方式可将语料级词错误率提升至35.6%——比当前最优方法提高三倍。最终,这项工作揭示了如何利用常见的低延迟LLM工具系统地破坏实时ASR流水线。

英文摘要

Automatic Speech Recognition (ASR) systems operating in real-time settings must process acoustic input under strict temporal constraints, where transcription decisions are inherently made on incomplete information. This causal constraint serves as an information bottleneck on attackers, significantly limiting attack performance. Our new Semantic Gambit attack breaks this causal limitation by augmenting the adversary with predictive context derived from a Large Language Model in real-time. Our experiments show that this form of augmentation can elevate the corpus-level Word Error Rate to 35.6% -- a three-fold increase over the current state-of-the-art. Ultimately, this work reveals how common, low-latency LLM tooling can be exploited to systematically subvert real-time ASR pipelines.

2606.06836 2026-06-08 cs.RO cs.AI cs.CV 交叉投稿

Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation

像飞行员一样思考:细粒度长时程无人机导航

Xiangyi Zheng, Xiangyu Wang, Qinan Liao, Zimu Tang, Yue Liao, Dongyue Lyu, Guodong Wang, Junjie Liu, Si Liu

发表机构 * Colab Beihang University(北航) Meituan(美团) National University of Singapore(新加坡国立大学)

AI总结 提出FLIGHT基准和FLIGHT VLA异步架构,通过低频飞行员推理VLM与高频扩散动作模型解耦,实现无人机长时程语义指令下的平滑连续飞行控制。

详情
AI中文摘要

语言引导的无人机代理必须执行长时程语义指令,同时产生平滑、物理可行的连续飞行命令,然而现有的视觉语言导航(VLN)基准通常使用离散或粗粒度的动作,而现有的无人机视觉-语言-动作(VLA)任务则专注于短时、原子化的机动。为了解决无人机任务设置中的这一空白,我们引入了\ extbf{FLIGHT},一个用于混合无人机导航与推理任务的\ extbf{细}粒度\ extbf{长}时程\ extbf{指令引导}基准,该基准结合了多阶段指令与密集的6-DoF轨迹注释,分为两个数据集:细粒度VLN和长时程流。为了使无人机代理具备对任务执行状态和任务规划进行实时飞行推理的能力,同时适应高频、实时的精确控制,我们进一步提出了\ extbf{FLIGHT VLA},一种异步架构,将用于任务状态推理的低频流式飞行员视觉语言模型(VLM)与用于连续控制的高频扩散动作模型解耦,并由显式的\ extbf{飞行员推理}文本进行监督,该文本总结了当前飞行状态并预测下一个子目标。在闭环评估中,FLIGHT VLA在我们的FLIGHT基准上持续优于代表性的VLN和VLA基线,实现了更强的多阶段完成、子目标遵循和终端控制。其训练的流式飞行员推理VLM进一步提升了无人机视频推理,验证了我们设计的有效性。

英文摘要

Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce \textbf{FLIGHT}, a \textbf{F}ine-grained \textbf{L}ong-horizon \textbf{I}nstruction-\textbf{G}uided benchmark for \textbf{H}ybrid UAV navigation and reasoning \textbf{T}asks, which combines multi-stage instructions with dense 6-DoF trajectory annotations across two dataset splits: Fine-grained VLN and Long-horizon Flow. To endow the UAV agent with the capability of real-time in-flight reasoning over task execution status and mission planning, while simultaneously accommodating high-frequency, real-time precise control, we further propose \textbf{FLIGHT VLA}, an asynchronous architecture that decouples a low-frequency Streaming Pilot Vision-Language Model (VLM) for task-state reasoning from a high-frequency diffusion action model for continuous control, supervised by explicit \textbf{Pilot Reasoning} texts that summarize the current flight state and anticipate the next subgoal. In closed-loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi-stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.

2606.06838 2026-06-08 cs.SE cs.AI 交叉投稿

LLM Agent-Assisted Reverse Engineering with Quantitative Readability Metrics

LLM 代理辅助逆向工程与定量可读性指标

Neil Archibald, Ruben Thijssen

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出定量可读性分数(QRS)框架,结合结构相似性门控与三个可读性子指标,指导 LLM 代理提升反编译代码可读性,同时保持功能正确性。

详情
AI中文摘要

自动反编译器生成功能正确但通常不可读的 C 代码。本文针对逆向工程工作流的一个阶段:使用由定量指标引导的 LLM 代理提高反编译代码的可读性。我们提出了一个三阶段的研究演进。阶段 1(通过 Ghidra MCP 的工具驱动引导)由于缺乏定量指导,导致覆盖不完整和改进不一致。阶段 2(仅结构相似性验证)揭示了代理以非预期方式优化指标,生成结构等效但可读性更差的代码。我们的贡献是定量可读性分数(QRS)框架,这是一个复合指标,结合了结构相似性门控与三个独立的可读性子指标(词汇惊奇度、结构简单性和惯用质量)。我们证明,QRS 引导的优化使 LLM 代理能够在牺牲正确性的情况下进行有针对性的可读性改进。我们提供了更广泛的逆向工程工作流(二进制提升、反编译清理和实现功能等价)作为背景讨论,但不在本文范围内。

英文摘要

Automatic decompilers produce functionally correct but often unreadable C code. This paper addresses one stage of the reverse engineering workflow: improving the readability of decompiled code using LLM agents guided by quantitative metrics. We present a three-phase research evolution. Phase 1 (tool-driven steering via Ghidra MCP) suffered from incomplete coverage and inconsistent improvements due to lack of quantitative guidance. Phase 2 (structural similarity validation alone) revealed that agents optimize for metrics in unintended ways, producing structurally equivalent but less readable code. Our contribution is the Quantitative Readability Score (QRS) framework, a composite metric combining a structural similarity gate with three independent readability sub-metrics (Lexical Surprisal, Structural Simplicity, and Idiomatic Quality). We demonstrate that QRS-guided refinement enables LLM agents to make targeted readability improvements without sacrificing correctness. We provide a discussion of the broader reverse engineering workflow (binary lifting, decompilation cleanup, and achieving functional equivalence) as context, however, it remains out of scope.

2606.06840 2026-06-08 cs.CL cs.AI cs.LG 交叉投稿

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

先刻画再蒸馏:大输出空间中的机械推理

Debjyoti Saha Roy, Byron C. Wallace, Javed A. Aslam

发表机构 * Khoury College of Computer Sciences, Northeastern University(东北大学计算机科学学院)

AI总结 研究现代推理模型在百万级标签空间中实现零样本多标签分类的机制,提出“候选列表生成+精细推理”两阶段模型,并基于此开发机械蒸馏策略,优于标准蒸馏。

详情
AI中文摘要

现代推理模型在具有挑战性的多标签任务上表现出令人惊讶的强大零样本性能,这些任务需要从数十万到数百万个候选标签中选择一小部分相关选项。我们研究了它们如何机械地实现这一点。我们将推理描述为一个两阶段过程:首先进行广泛的“候选列表生成”,然后对生成的集合进行精细推理。我们在一系列数据集上提供证据表明,这些步骤可以分离并且是互补的。利用这一刻画,我们开发了一种机械蒸馏策略,该策略始终优于标准蒸馏。

英文摘要

Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set of relevant options from hundreds of thousands to millions of candidate labels. We investigate how they achieve this mechanistically. We characterize reasoning as a two-phase process: A broad "shortlisting" of candidates followed by fine-grained reasoning over the resulting set. We provide evidence across a range of datasets that these steps can be isolated and are complementary. Using this characterization, we develop a mechanistic distillation strategy that consistently outperforms standard distillation.

2606.06853 2026-06-08 cs.CV cs.AI 交叉投稿

MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

MotionEnhancer: 利用视频扩散模型增强运动感知的视觉-语言模型

Yifan Xu, Chao Zhang, Ruifei Ma, Fei Gao, Zhifei Yang, Jiaxing Qi, Zhipeng Chen

发表机构 * School of Computer Science and Engineering, Beihang University(北航计算机科学与工程学院) Beijing Digital Native Digital City Research Center(北京数字原生数字城研究中心) School of Computer Science, Peking University(北京大学计算机学院) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院)

AI总结 提出MotionEnhancer,通过从视频扩散模型中提取运动先验并利用注意力对齐增强视觉-语言模型的运动理解能力,无需额外参数或架构修改,在运动级视频理解基准上取得一致提升。

Comments Accepted by CVPR 2026

详情
AI中文摘要

新时代见证了视觉-语言模型(VLM)在视频理解任务中的显著能力扩展。虽然当前的VLM在事件或故事级别的理解上表现出色,但它们捕捉细粒度运动细节的能力仍然有限,这主要是由于它们关注高层静态语义结构和宏观事件逻辑。相比之下,视频扩散模型(VDM)擅长建模动态运动模式,得益于大规模视频数据和时序生成的内在需求。在本文中,我们介绍了MotionEnhancer,一种新颖的方法,它利用从强大视频扩散模型中提取的运动先验作为辅助监督,通过注意力对齐增强VLM的运动理解能力。MotionEnhancer包含两个简单的无参数模块:运动敏感头选择(MHS)和运动显著文本标记识别(MTTI),以仅计算的方式直接从VDM中提取和优化与运动相关的注意力。MotionEnhancer为运动理解提供了可扩展的解决方案,无需额外的训练参数、修改现有架构或工具调用。大量实验表明,在两个运动级视频理解基准上,MotionEnhancer能够在最先进的VLM上实现一致的改进,尤其是在运动相关指标上。

英文摘要

The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily due to their focus on high-level static semantic structures and macro-event logic. In contrast, Video Diffusion Models (VDMs) are adept at modeling dynamic motion patterns, benefiting from large-scale video data and the intrinsic requirement of temporal generation. In this paper, we introduce MotionEnhancer, a novel approach that leverages motion priors distilled from a powerful video diffusion model as auxiliary supervision to enhance the motion understanding capability of a VLM via attention alignment. MotionEnhancer comprises two simple parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to directly extract and optimize motion-related attentions from the VDM in a computation-only manner. MotionEnhancer provides a scalable solution for motion understanding without additional training parameters, modifications to existing architectures, or tool calling. Extensive experiments demonstrate that MotionEnhancer can achieve consistent improvements over state-of-the-art VLMs on two motion-level video understanding benchmarks, especially on motion-related metrics.

2606.06861 2026-06-08 cs.LG cs.AI 交叉投稿

Modeling Nonlinear Feature Interactions with Product-Unit Residual Networks

使用乘积单元残差网络建模非线性特征交互

Ziyuan Li, Uwe Jaekel, Babette Dellen

发表机构 * University of Applied Sciences Koblenz(科隆应用科学大学) Technical University of Munich(慕尼黑技术大学)

AI总结 提出乘积单元残差网络(PURe),通过显式建模特征交互提升鲁棒性和可解释性,在合成和真实数据集上优于MLP。

Comments Accepted at ICCS 2026

详情
AI中文摘要

理解非线性特征交互在科学和工程中至关重要,然而标准多层感知器(MLP)通常仅隐式地捕获此类交互,导致表征纠缠,可能损害鲁棒性和可解释性。我们研究了乘积单元残差网络(PURe),它将乘法乘积单元与残差连接相结合,以显式建模跨特征耦合,同时稳定优化。我们在一个基于交互的合成基准和两个真实世界数据集上进行了系统评估,考察了预测准确性、对高斯特征噪声的鲁棒性以及在有限训练数据下的性能,并在匹配参数预算下比较了实值和复值变体。除了准确性,基于SHapley Additive exPlanations(SHAP)的交互分析表明,与MLP基线相比,PURe学习了更集中且结构更连贯的交互模式。总体而言,PURe实现了具有竞争力或更好的性能,在低数据场景下具有更好的鲁棒性和样本效率,并增强了交互级别的可解释性。

英文摘要

Understanding nonlinear feature interactions is crucial in science and engineering, yet standard multilayer perceptrons (MLPs) often capture such interactions only implicitly, leading to entangled representations that can impair robustness and interpretability. We investigate product-unit residual networks (PURe) that integrate multiplicative product units with residual connections to explicitly model cross-feature couplings while stabilizing optimization. We conduct a systematic evaluation on an interaction-driven synthetic benchmark and two real-world datasets, assessing predictive accuracy, robustness to Gaussian feature noise, and performance under limited training data, and we compare real- and complex-valued variants under a matched parameter budget. Beyond accuracy, SHapley Additive exPlanations (SHAP)-based interaction analyses show that PURe learns more concentrated and structurally coherent interaction patterns than MLP baselines. Overall, PURe achieves competitive or improved performance, better robustness and sample efficiency in low-data regimes, and enhanced interaction-level interpretability.

2606.06872 2026-06-08 cs.CV cs.AI 交叉投稿

EgoPressDiff: Multimodal Video Diffusion for Egocentric UV-Domain Hand-Pressure Estimation

EgoPressDiff: 用于自我中心UV域手部压力估计的多模态视频扩散模型

Yuan Zeng, Zilue Gao, Yujia Shi, Zongqing Lu, Wenming Yang, QingMin Liao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出EgoPressDiff,一种条件视频扩散框架,通过多模态条件策略(手部姿态、3D网格顶点和深度信息)从视觉输入生成UV压力图,解决了现有方法中的量化误差和时间不一致问题,在EgoPressure数据集上实现SOTA,Volumetric IoU相对提升34%以上。

Comments Accepted to IEEE ICASSP 2026

详情
AI中文摘要

从自我中心视角估计手部表面接触压力对于AR/VR设备、机器人模仿和人体工程学分析至关重要。现有方法通常对压力信号进行离散化并独立处理帧,导致量化误差和时间不一致性。我们提出EgoPressDiff,一种条件视频扩散框架,从视觉输入生成UV压力图。我们方法的核心是一种多模态条件策略,引入PoseNet和顶点编码器,从手部姿态和3D网格顶点中高效提取特征。这些信号与深度信息一起,指导生成过程以确保压力场在物理上是合理的。为了有效融合这些异构特征,我们进一步提出分布校准空间层,在组合前对齐其统计特性。在EgoPressure自我中心视图设置上的评估表明,EgoPressDiff实现了最先进的结果,Volumetric IoU相对先前基线提升超过34%,同时降低MAE并保持高时间精度。我们的项目页面位于此https URL。

英文摘要

Estimating hand-surface contact pressure from an egocentric view is crucial for AR/VR devices, robotic imitation, and ergonomic analysis. Existing methods often discretize pressure signal and process frames independently, leading to quantization errors and temporal inconsistencies. We present \emph{EgoPressDiff}, a conditional video diffusion framework that generates UV-pressure maps from visual input. The core of our approach is a multi-modal conditioning strategy, introducing a PoseNet and a Vertex Encoder to efficiently extract features from hand pose and 3D mesh vertices. These signals, along with depth information, guide the generative process to ensure the pressure fields are physically grounded. To effectively fuse these heterogeneous features, we further propose a Distribution-Calibrated Spatial Layer, which aligns their statistical properties before combination. Evaluated on the EgoPressure ego-view setting, EgoPressDiff achieves state-of-the-art results, improving Volumetric IoU by over 34\% relative to prior baseline, while reducing MAE and maintaining high temporal accuracy. Our project page is at https://egopressdiff.github.io/.

2606.06877 2026-06-08 cs.RO cs.AI 交叉投稿

Neuro-Symbolic Learning for Long-Horizon Task Planning Under Complex Logical Constraints

复杂逻辑约束下长时域任务规划的神经符号学习

Qiwei Du, Zitong Zhan, Shaoshu Su, Bowen Li, Yi Du, Zhipeng Zhao, Taimeng Fu, Sebastian Scherer, Jiaoyang Li, Chen Wang

发表机构 * Spatial AI & Robotics (SAIR) Lab, University at Buffalo, NY 14260(空间人工智能与机器人实验室,布法罗大学,纽约州,14260) Robotics Institute, Carnegie Mellon University, PA 15213(机器人研究所,卡内基梅隆大学,宾夕法尼亚州,15213)

AI总结 提出基于命令学习的双层优化框架,通过神经评分器剪枝无关对象,并引入3R策略(修复、重启、回滚)稳定下层规划,在三个基准上实现失败率降低80.04%、规划时间减少57.14%。

详情
AI中文摘要

当机器人必须在复杂逻辑约束(包括对象可供性、空间关系和顺序动作依赖)下推理长时域动作序列时,任务规划常常面临严重的效率瓶颈。最近的神经符号方法通过学习对象重要性分数来剪枝任务无关对象,从而提高规划效率,但它们通常依赖于从完整搜索空间生成的固定离线监督。这造成了训练-测试不匹配:在部署时,规划器在由模型自身不完美预测诱导的剪枝搜索空间中运行,导致暴露偏差和规划性能下降。为了解决这一挑战,我们将任务规划的对象重要性学习形式化为一个基于命令学习的双层优化问题。上层优化一个神经评分器,而下层在评分剪枝的搜索空间中求解符号规划问题。为了稳定这一学习过程,我们在下层规划中引入3R策略,使用并行的修复、重启和回滚恢复来为上层学习提供可靠且自适应的反馈。在三个具有挑战性的基准上的实验展示了最先进的性能,包括失败率降低80.04%和规划时间减少57.14%。我们进一步在仿真和现实世界中的四足移动机械臂上验证了该框架,展示了其在高效且可部署的神经符号任务规划方面的潜力。

英文摘要

Task planning often suffers from severe efficiency bottlenecks when robots must reason over long-horizon action sequences under complex logical constraints, including object affordances, spatial relationships, and sequential action dependencies. Recent neuro-symbolic methods improve planning efficiency by learning object-importance scores to prune task-irrelevant objects, but they typically rely on fixed offline supervision generated from full search spaces. This creates a train-test mismatch: at deployment, the planner operates in pruned search spaces induced by the model's own imperfect predictions, leading to exposure bias and degraded planning performance. To address this challenge, we formulate object-importance learning for task planning as an imperative learning-based bilevel optimization problem. The upper level optimizes a neural scorer, while the lower level solves a symbolic planning problem in the score-pruned search space. To stabilize this learning process, we introduce a 3R strategy into the lower-level planning, using parallel Repair, Restart, and Rollback recovery to provide reliable and adaptive feedback for upper-level learning. Experiments on three challenging benchmarks demonstrate state-of-the-art performance, including an 80.04% reduction in failure rate and a 57.14% reduction in planning time. We further validate the framework on a quadruped-based mobile manipulator in simulation and the real world, demonstrating its potential for efficient and deployable neuro-symbolic task planning.

2606.06885 2026-06-08 cs.CV cs.AI 交叉投稿

FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising

FreeAnimate: 基于预览引导去噪的无训练人体图像动画

Yuan Zeng, Yujia Shi, Zongqing Lu, QingMin Liao

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出FreeAnimate框架,利用图像扩散模型内在能力实现无训练的人体图像动画,通过预览生成策略提供时序和结构先验,结合反演增强注意力和参考锚定自注意力模块,保证时序一致性和身份保持。

Comments Accepted to IEEE ICASSP 2026

详情
AI中文摘要

人体图像动画已经取得了显著进展,主要得益于扩散模型。然而,现有方法通常需要大量的训练数据和资源才能获得高质量结果,限制了泛化性和可访问性。在这项工作中,我们引入了FreeAnimate,一个无训练框架,利用图像扩散模型的内在能力来实现时序一致性、身份保持和背景稳定性。我们的方法包含一种新颖的预览生成策略,该策略从生成的预览帧中提供时序和结构先验,无需训练即可有效引导姿态对齐和背景一致性。此外,FreeAnimate引入了反演增强注意力和参考锚定自注意力模块,以保证时序一致性和身份保持。实验结果表明,FreeAnimate优于现有的无训练竞争方法和基于训练的基线方法,生成的图像质量可与最先进的方法相媲美,并在不同数据集上展现出强大的泛化能力。我们的项目页面位于此https URL。

英文摘要

Human Image Animation has seen significant advancements, primarily driven by diffusion models. However, existing methods typically demand substantial training data and resources to achieve high-quality results, limiting generalization and accessibility. In this work, we introduce \emph{FreeAnimate}, a training-free framework that leverages the inherent capabilities of image diffusion models to enable temporal consistency, identity preservation, and background stability. Our approach incorporates a novel preview generation strategy that provides temporal and structural priors from generated preview frames, effectively guiding pose alignment and background consistency without training. Additionally, FreeAnimate introduces Inversion-Boosted Attention and Reference-Anchored Self-Attention modules to guarantee temporal consistency and identity preservation. Experimental results demonstrate that FreeAnimate outperforms existing training-free competitors and training-based baseline methods, achieving generation quality comparable to state-of-the-art methods and offering robust generalization across diverse datasets. Our project page is at https://freeani.github.io/.

2606.06903 2026-06-08 cs.CV cs.AI 交叉投稿

Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy

超越骨架:使用Same2X训练策略直接从驱动视频学习动画

Yuan Zeng, Yujia Shi, Yuhao Yang, Dongxia Liu, Zongqing Lu, Wenming Yang, Qingmin Liao

发表机构 * Tsinghua University(清华大学) Harbin Institute of Technology(哈尔滨工业大学) Pengcheng Laboratory(鹏城实验室)

AI总结 提出DirectAnimator框架,通过驱动线索三元组和Same2X训练策略,绕过姿态提取直接从原始视频学习动画,实现鲁棒且高质量的人体图像动画生成。

Comments Accepted to ICLR 2026

详情
AI中文摘要

人体图像动画旨在根据从驱动视频中提取的姿态信息,从静态参考图像生成视频。现有方法通常依赖姿态估计器提取中间表示,但在遮挡或复杂姿态下这些信号容易出错。基于这些观察,我们提出了DirectAnimator,一个绕过姿态提取并直接从原始驱动视频学习的框架。我们引入了一个由姿态、面部和位置线索组成的驱动线索三元组,以语义丰富且稳定的形式捕捉运动、表情和对齐,并通过CueFusion DiT块融合它们,以实现去噪过程中的可靠控制。为了使学习在驱动和参考身份不同时依然可靠,我们设计了Same2X训练策略,将跨身份特征与从相同身份数据中学到的特征对齐,从而正则化优化并加速收敛。大量实验表明,DirectAnimator在保持身份的同时达到了最先进的视觉质量,对遮挡和复杂关节运动具有鲁棒性,并且计算资源更少。我们的项目页面位于此https URL。

英文摘要

Human image animation aims to generate a video from a static reference image, guided by pose information extracted from a driving video. Existing approaches often rely on pose estimators to extract intermediate representations, but such signals are prone to errors under occlusion or complex poses. Building on these observations, we present DirectAnimator, a framework that bypasses pose extraction and directly learns from raw driving videos. We introduce a Driving Cue Triplet consisting of pose, face, and location cues that captures motion, expression, and alignment in a semantically rich yet stable form, and we fuse them through a CueFusion DiT block for reliable control during denoising. To make learning dependable when the driving and reference identities differ, we devise a Same2X training strategy that aligns cross-ID features with those learned from same-ID data, regularizing optimization and accelerating convergence. Extensive experiments demonstrate that DirectAnimator attains state-of-the-art visual quality and identity preservation while remaining robust to occlusions and complex articulation, and it does so with fewer computational resources. Our project page is at https://directanimator.github.io/.

2606.06906 2026-06-08 cs.CL cs.AI 交叉投稿

EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

EASE-TTT: 面向长上下文问答的基于证据对齐的选择性测试时训练

Xiaopeng Yuan, Zebin Wang, Suwen Wang, Zongxin Yang, Haohan Wang, Yushun Dong

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Harvard University(哈佛大学) Brion, ASML US LP Florida State University(佛罗里达州立大学)

AI总结 提出EASE-TTT框架,通过将检索到的证据块转化为软注意力监督目标,指导查询侧参数适应,从而在保留完整上下文的情况下提升小模型的长上下文问答性能。

Comments 13 pages, 4 figures, 3 tables

详情
AI中文摘要

长上下文问答(QA)对于较小的语言模型来说仍然具有挑战性,即使输入中已经存在包含答案的证据。现有的上下文内检索方法定位并暴露候选证据块给问题,但它们止步于输入级证据暴露,而不是调整控制模型如何在整个上下文位置上分配注意力的查询侧注意力参数。相比之下,轻量级的测试时适应方法,如仅查询的测试时训练(qTTT),由于它们通用的跨度级自监督目标无法识别哪些上下文位置支持当前答案,因此未能解决证据定位问题。在本文中,我们提出了基于证据对齐的选择性测试时训练(EASE-TTT),这是一个上下文内检索增强的测试时训练框架,它将选定的证据块转换为对其标记位置的软注意力监督目标。EASE-TTT不是用检索到的块替换完整上下文,而是使用生成的注意力目标来指导查询侧适应,适应后的模型从原始完整上下文中生成最终答案。在六个LongBench QA任务和三个小型仅解码器语言模型上的实验表明,EASE-TTT在全上下文推理、仅检索基线和qTTT中实现了最强的宏平均性能,支持了长上下文QA中基于证据对齐的测试时适应。

英文摘要

Long-context question answering (QA) remains challenging for smaller language models even when answer-bearing evidence is already present in the input. Existing within-context retrieval methods localize and expose candidate evidence chunks for the question, but they stop at input-level evidence exposure rather than adapting the query-side attention parameters that control how the model allocates attention over full-context positions. In contrast, lightweight test-time adaptation methods, such as query-only test-time training (qTTT), leave evidence localization unresolved because their generic span-level self-supervised objectives do not identify which context positions support the current answer. In this paper, we propose Evidence-Aligned SElective Test-Time Training (EASE-TTT), a within-context retrieval-augmented test-time training framework that converts selected evidence chunks into a soft attention supervision target over their token positions. Instead of replacing the full context with retrieved chunks, EASE-TTT uses the resulting attention target to guide query-side adaptation, with the adapted model generating the final answer from the original full context. Experiments on six LongBench QA tasks and three small decoder-only language models show that EASE-TTT achieves the strongest macro-average performance among full-context inference, retrieval-only baselines, and qTTT, supporting evidence-aligned test-time adaptation in long-context QA.

2606.06907 2026-06-08 eess.AS cs.AI cs.SD 交叉投稿

SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models

SpectCount: 通过合成信号进行频谱时间计数改进大型音频语言模型

Seonuk Kim, Yonghyeon Jun, Ju Yeon Kang, Jimin Hong, Yoonhyeong Lee, Nam Soo Kim

发表机构 * Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, South Korea(电气与计算机工程系和INMC,首尔国立大学,首尔,韩国)

AI总结 针对大型音频语言模型在频谱时间感知上的弱点,提出SpectCount方法,利用动态生成的完全合成音频信号进行数据高效微调,无需真实音频或标注,显著提升多种听觉基准性能。

Comments 5 pages, 5 figures

详情
AI中文摘要

大型音频语言模型(LALMs)通过音频编码器和大规模音频数据扩展了大型语言模型。然而,高质量标注音频数据的稀缺性仍然是扩展的根本瓶颈。通过探测信号可检测性分析,我们识别出基础LALM在细粒度频谱时间感知上的弱点。为了解决这些挑战,我们提出频谱时间计数(SpectCount),一种基于动态生成的完全合成音频信号的数据高效微调方法,无需依赖真实世界音频、标注或预训练生成模型。SpectCount不仅解决了观察到的弱点,还在微调期间未见的声音、音乐和语音等多种听觉基准上提升了性能。这些结果表明,针对弱点的合成信号为LALMs增强听觉理解能力提供了一条数据高效的途径。

英文摘要

Large audio language models (LALMs) extend large language models with an audio encoder and large-scale audio data. However, the scarcity of high-quality annotated audio data remains a fundamental bottleneck for scaling. Through probing signal detectability analysis, we identify fine-grained spectrotemporal perceptual weaknesses in a foundation LALM. To address these challenges, we propose Spectrotemporal Counting (SpectCount), a data-efficient fine-tuning approach based on fully synthetic audio signals generated on-the-fly, without relying on real-world audio, annotations, or pretrained generative models. SpectCount not only resolves the observed weaknesses but also improves performance on diverse auditory benchmarks spanning sound, music, and speech, unseen during fine-tuning. These results suggest that weakness-targeted synthetic signals provide a data-efficient path toward enhanced auditory understanding capabilities in LALMs.

2606.06920 2026-06-08 cs.LG cs.AI 交叉投稿

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

微调陷阱:评估负迁移及PEFT在亚十亿参数数学推理中的作用

Rahul Nair, Chun Tao

发表机构 * GitHub University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 本研究评估了五种亚十亿参数模型在数学推理任务中的微调策略,发现全量微调对小于3亿参数的模型造成负迁移,而参数高效微调(PEFT)是稳定性要求。

Comments 8 pages, 6 figures, 2 tables

详情
AI中文摘要

在边缘设备上部署小型语言模型(SLM)需要高效的微调策略,使模型适应新任务而不降低其通用能力。在本研究中,我们对五种亚十亿参数模型(135M-1B)在数学推理任务上进行了基准测试,并发现了一个关键脆弱性:全量微调(Full FT)会主动损害300M以下参数模型的性能,通常将准确率降至零样本基线以下。这种“负迁移”使得参数高效微调(PEFT)不仅是效率上的偏好,更是稳定性上的要求。我们发现,虽然低秩适应(LoRA)和权重分解LoRA(DoRA)性能相当,但它们的优势因任务而异:DoRA在复杂推理(GSM8K)中表现出色,而LoRA在模式匹配(OrcaMath)中占主导地位。特别地,在对齐模型(Qwen2.5-0.5B)上,LoRA优于全量微调,甚至在最小架构(SmolLM2-135M)上,简单的5-shot上下文学习也优于全量微调。基于这些发现,我们建议对所有对齐的亚十亿参数模型默认使用PEFT,并警告不要对任何小于500M参数的架构使用全量微调,以防止灾难性遗忘。本工作的复现可在此网址找到:https://this URL。

英文摘要

Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B) on mathematical reasoning tasks and uncover a critical vulnerability: Full Fine-Tuning (Full FT) actively harms performance in models under 300M parameters, often dropping accuracy below zero-shot baselines. This "negative transfer" makes Parameter-Efficient Fine-Tuning (PEFT) not just an efficiency preference, but a stability requirement. We find that while Low-Rank Adaptation (LoRA) and Weight-Decomposed LoRA (DoRA) perform comparably, their strengths vary by task; DoRA excels in complex reasoning (GSM8K), while LoRA dominates pattern matching (OrcaMath). In particular, Full FT is outperformed by LoRA on aligned models (Qwen2.5-0.5B) and even by simple 5-shot In-Context Learning on the smallest architectures (SmolLM2-135M). Based on these findings, we recommend defaulting to PEFT for all aligned sub-1B models and caution against Full FT for any architecture smaller than 500M parameters to prevent catastrophic forgetting. Reproduction of this work can be found at https://github.com/gulguluu/tiny-slm-finetune-compare.

2606.06942 2026-06-08 cs.CL cs.AI 交叉投稿

Didact: A Cross-Domain Capability Discovery System for Defence

Didact:面向国防的跨领域能力发现系统

Aarya Bodhankar, Aditya Joshi, Bao Gia Doan, Thomas Marchant, Oscar Leslie, Flora Salim

发表机构 * University of New South Wales, Sydney, Australia(新南威尔士大学,悉尼,澳大利亚) Cyndr.ai, Australia(Cyndr.ai,澳大利亚)

AI总结 提出Didact原型系统,通过构建知识图谱和复合检索增强生成管道,整合异构国防报告与政策文档,支持自然语言对话和可视化证据追溯,解决跨领域能力发现碎片化问题。

Comments Under Review at CIKM 2026 (System Demonstration Track)

详情
AI中文摘要

国防及国防相关领域的政策制定者必须监控快速发展的研究以及与其作战和战略需求相关的部门优先事项。实际上,这些来源分散在异构格式、不连贯的存储库和孤立的更新流中,使得能力发现缓慢且难以审计。我们提出了Didact,一个原型系统,它将来自澳大利亚的公开国防报告和政策文件与基于澳大利亚研究出版物构建的专用知识图谱相结合。Didact为面向政策的工作流程提供自然语言对话,并利用复合检索增强生成(RAG)管道。Didact的一个关键特性是交互式证据轨道,它可以可视化检索到的证据和源关系。我们对Didact的输出质量和运行时间的评估凸显了其实用性。虽然Didact是作为澳大利亚背景下的学术界-工业界合作项目共同开发的,但它适用于知识同样碎片化的其他领域。演示视频可在此处获取:

英文摘要

Policymakers in defence and defence-aligned sectors must monitor rapidly evolving research alongside sector priorities relevant to operational and strategic needs. In practice, these sources are fragmented across heterogeneous formats, disjoint repositories, and siloed update streams, making capability discovery slow and difficult to audit. We present Didact, a prototype that integrates publicly available defence reports and policy documents from Australia with a purpose-built knowledge graph derived from Australian research publications. Didact provides natural language conversations for policy-oriented workflows, and leverages a composite retrieval-augmented generation (RAG) pipeline. A key feature of Didact is an interactive Evidence Rail that visualises retrieved evidence and source relationships. Our evaluation of the output quality and runtime of Didact highlights its utility. While Didact has been co-developed as an academia-industry project for the Australian context, it is adaptable to other domains where knowledge is similarly fragmented. A demonstration video is available here:

2606.06943 2026-06-08 cs.CV cs.AI 交叉投稿

SS-TPT: Stability and Suitability-Guided Test-Time Prompt Tuning for Adversarially Robust Vision-Language Models

SS-TPT:面向对抗鲁棒视觉语言模型的稳定性和适用性引导的测试时提示微调

Sunoh Kim, Daeho Um

发表机构 * Dankook University, Yongin, South Korea(首尔大学,韩国永兴) University of Seoul, Seoul, South Korea(首尔大学,韩国首尔)

AI总结 提出SS-TPT方法,通过稳定性与适用性分数评估增强视图质量,引导测试时提示微调,在保持高吞吐量的同时显著提升对抗鲁棒性。

Comments Accepted in ICML2026

详情
AI中文摘要

视觉语言模型(如CLIP)实现了强大的零样本识别,但在对抗扰动下仍然非常脆弱。最近的测试时自适应防御通过利用大量增强视图来提高鲁棒性,但这导致了不切实际的减速和明确的鲁棒性-吞吐量权衡。为了应对这一挑战,我们提出了稳定性和适用性引导的测试时提示微调(SS-TPT),通过两个互补分数评估每个增强视图的质量:(1)稳定性,衡量对弱增强的预测不变性,以及(2)适用性,衡量视图间的特征空间密度。这些稳定性和适用性(SS)分数通过SS引导的一致性损失和SS加权预测来指导自适应和推理,放大可信视图同时抑制受损视图。大量实验表明,SS-TPT显著优于先前最先进的方法,在不同数据集和不同视图数量下实现了卓越的鲁棒性-吞吐量权衡,从而展示了强大的实用性和泛化性。我们的代码可在以下网址获得:https://this URL。

英文摘要

Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but remain highly fragile under adversarial perturbations. Recent test-time adaptation defenses improve robustness by leveraging many augmented views, but this leads to impractical slowdown and a clear robustness-throughput trade-off. To address this challenge, we present Stability and Suitability-guided Test-time Prompt Tuning (SS-TPT), evaluating the quality of each augmented view via two complementary scores: (1) stability, measuring prediction invariance to weak augmentations, and (2) suitability, measuring feature-space density among views. These stability and suitability (SS) scores guide both adaptation and inference through an SS-guided consistency loss and an SS-weighted prediction, amplifying trustworthy views while suppressing corrupted ones. Extensive experiments demonstrate that SS-TPT significantly outperforms prior state-of-the-art methods, achieving superior robustness-throughput trade-offs across diverse datasets and varying numbers of views, thereby demonstrating both strong practicality and generality. Our code is available at https://github.com/sunoh-kim/SS-TPT.

2606.06946 2026-06-08 cs.CL cs.AI 交叉投稿

Auditing Training Data in Domain-adapted LLMs: LoRA-MINT

领域自适应大语言模型中的训练数据审计:LoRA-MINT

Gonzalo Mancera, Daniel DeAlcala, Aythami Morales, Julian Fierrez, Ruben Tolosana, Francisco Jurado

发表机构 * University of Granada(格拉纳达大学)

AI总结 提出LoRA-MINT方法,通过成员推理测试审计LoRA微调的大语言模型训练数据,在四个模型和三个基准上达到0.77-0.92的精度,优于现有基线。

Comments IEEE Conf. on Computers, Software, and Applications (COMPSAC), 2026

详情
AI中文摘要

我们提出了LoRA-MINT,一种应用于通过低秩适应(LoRA)针对特定自然语言处理(NLP)任务微调的最新大语言模型(LLMs)的成员推理测试(MINT)新方法。主要目标是评估个体样本是否属于这些适应模型的训练数据,为知识产权和敏感数据管理提供有用的审计工具。我们的分析探索了模型困惑度与成员状态之间的关系,提供了一个系统框架来估计微调LLMs中的数据暴露程度。我们在四个模型和三个基准数据集上进行了实验,在确定给定数据是否用于训练时获得的精度值在0.77到0.92之间,优于最先进的基线,并证明了所提出方法的鲁棒性和通用性。总的来说,我们的发现强调了LoRA-MINT作为审计LLMs的有效且可扩展框架的潜力,提高了透明度,并促进了AI和NLP技术的道德和负责任部署。为了具体性和当前相关性,我们的讨论和实验集中在LoRA调整的LLMs上,但请注意,所提出的大部分方法很容易适用于审计任何其他适应LLMs的技术或更一般地任何其他领域自适应AI模型的训练数据。

英文摘要

We present LoRA-MINT, a new methodology for Membership Inference Test (MINT) applied to recent Large Language Models (LLMs) fine-tuned for specific Natural Language Processing (NLP) tasks through Low-Rank Adaptation (LoRA). The primary goal is to assess whether individual samples were part of the training data of these adapted models, providing a useful auditing tool for the management of intellectual property and sensitive data. Our analysis explores the relationship between model perplexity and membership status, providing a systematic framework for estimating data exposure in fine-tuned LLMs. We conducted experiments on four models and three benchmark datasets, obtaining precision values in determining if given data were used for training ranging from 0.77 to 0.92, which outperform state-of-the-art baselines and demonstrate the robustness and generality of the proposed method. In general, our findings underscore the potential of LoRA-MINT as an effective and scalable framework for auditing LLMs, improving transparency, and fostering the ethical and responsible deployment of AI and NLP technologies. For the sake of concreteness and current relevance, our discussion and experiments are centered on LoRAadjusted LLMs, but note that most of the presented methodology is easily applicable for auditing training data given any other technique for adapting LLMs or, more generally, any other domain-adapted AI models.

2606.06950 2026-06-08 cs.CV cs.AI 交叉投稿

When is 3D Worth It? A Resource-Performance Frontier for CNNs and Transformers in Lung CT

何时3D值得?肺CT中CNN和Transformer的资源-性能前沿

Md Enamul Hoq, Sharafat Hossain, Imraul Emmaka, Linda Larson-Prior, Lawrence Tarbox, Jonathan Bona, Donald Johann Jr. and Fred Prior

发表机构 * Department of Biomedical Informatics University of Arkansas for Medical Sciences(生物医学信息学系,美国阿肯色大学医学科学分校) Department of Information Science University of Arkansas at Little Rock(信息科学系,美国阿肯色大学小岩分校) Department of Neuroscience University of Arkansas for Medical Sciences(神经科学系,美国阿肯色大学医学科学分校)

AI总结 研究在肺CT中2D、2.5D和3D输入对CNN和Transformer的影响,发现2.5D CNN在判别-稳定性权衡上最优,而3D CNN和Transformer存在不稳定性或退化预测。

Comments 8 pages, 6 figures

详情
AI中文摘要

三维模型通常被认为更适合体积医学成像,但其实际价值取决于性能提升是否值得增加的计算成本和复杂性。我们不提出新架构,而是研究在固定训练协议下,输入维度(2D、2.5D、3D)如何影响卷积神经网络(CNN)和视觉Transformer(ViT)的行为。使用无泄漏的NLST队列(n=1,977)和辅助LIDC-IDRI数据,我们发现2.5D CNN在我们的比较中提供了最有利的判别-稳定性权衡(ROC-AUC 0.682,95% CI [0.546, 0.799]),具有稳定的操作点。相比之下,3D CNN表现出阈值不稳定性,而Transformer出现退化预测,例如全正预测。置信区间宽且重叠,因此我们将这些结果呈现为受控的资源-性能前沿和失败模式分类,而非明确的优越性声明。对于类别不平衡的肺癌筛查分类,2D和2.5D输入在性能、稳定性和计算效率之间提供了比全3D表示更可靠的权衡。

英文摘要

Three-dimensional models are widely assumed preferable for volumetric medical imaging, yet their practical value depends on whether performance gains justify added computational cost and complexity. Rather than proposing a new architecture, we study how input dimensionality (2D, 2.5D, 3D) affects model behavior across convolutional neural networks (CNNs) and Vision Transformers (ViTs) under a fixed training protocol. Using a leakage-free NLST cohort (n = 1,977) with supporting LIDC-IDRI data, we find that the 2.5D CNN offers the most favorable discrimination-stability trade-off in our comparison (ROC-AUC 0.682, 95% CI [0.546, 0.799]) with a stable operating point. In contrast, 3D CNNs show threshold instability, and transformers exhibit degenerate predictions, such as all-positive predictions. Confidence intervals are wide and overlapping, so we present these results as a controlled resource-performance frontier and a failure-mode taxonomy rather than as definitive superiority claims. For class-imbalanced lung cancer screening classification, 2D and 2.5D inputs provide a more reliable trade-off between performance, stability, and computational efficiency than full 3D representations.

2606.06959 2026-06-08 cs.CL cs.AI 交叉投稿

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

OpenHalDet:面向多种生成场景的幻觉检测统一基准

Xinyi Li, Zhen Fang, Yongxin Deng, Jinyuan Luo, Hongnan Ma, Changdae Oh, Zijing Shi, Shanshan Ye, Hanchen Wang, Shu-Lin Chen, Yadan Luo, Mengyue Yang, Sean Du, Sharon Li, Ling Chen

发表机构 * University of Technology Sydney(新南威尔士大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) University of Bristol(布里斯托大学) The University of Queensland(昆士兰大学) Nanyang Technological University(南洋理工大学)

AI总结 提出OpenHalDet基准,标准化幻觉检测评估流程,支持黑盒、灰盒、白盒检测器,实现跨任务、模型和检测器的可控比较。

Comments Preprint. Code and data are available at https://github.com/Nellie179/Hallucination-Detection

详情
AI中文摘要

幻觉检测对于大型语言模型(LLM)的可靠部署至关重要。然而,现有评估面临两个核心挑战:推理配置和评估不一致,以及下游领域和任务的覆盖有限。因此,报告的检测器性能往往难以比较、复现,并泛化到特定实验设置之外。我们引入OpenHalDet,一个面向多种生成场景的幻觉检测统一基准。OpenHalDet标准化了评估流程,从提示构建和响应生成到真实性标注、检测器评分和指标计算。它支持不同访问设置下的异构检测器家族,包括仅使用生成输出的黑盒方法、依赖基于概率信号的白盒方法,以及利用内部模型信号的白盒方法。通过将多样化的任务、模型和检测器纳入共享框架,OpenHalDet实现了可控比较,并提供了不同检测范式在LLM应用中行为的系统视角。我们发布OpenHalDet作为开放且可扩展的代码库,以促进幻觉检测方法的可复现评估和未来发展。代码和数据集可在该https URL获取。

英文摘要

Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of downstream domains and tasks. Consequently, reported detector performance is often difficult to compare, reproduce, and generalize beyond specific experimental settings. We introduce OpenHalDet, a unified benchmark for hallucination detection across diverse generation scenarios. OpenHalDet standardizes the evaluation pipeline, from prompt construction and response generation to truthfulness annotation, detector scoring, and metric computation. It supports heterogeneous detector families under different access settings, including black-box methods that use only generated outputs, gray-box methods that rely on probability-based signals, and white-box methods that exploit internal model signals. By bringing diverse tasks, models, and detectors into a shared framework, OpenHalDet enables controlled comparison and provides a systematic view of how different detection paradigms behave in LLM applications. We release OpenHalDet as an open and extensible codebase to facilitate reproducible evaluation and future development of hallucination detection methods. The code and datasets are available at https://github.com/Nellie179/Hallucination-Detection.

2606.06983 2026-06-08 eess.IV cs.AI cs.CV 交叉投稿

DaX: Learning General Pathology Representations Across Scales

DaX: 跨尺度的通用病理学表示学习

Bokai Zhao, Yiyang Zhang, Long Bai, Tai Ma, Hanqing Chao, Minfeng Xu

发表机构 * DAMO Academy, Alibaba Group(达摩院,阿里巴巴集团) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Hupan Lab(虎斑实验室)

AI总结 提出病理视觉基础模型DaX,通过改进DINOv3自监督学习,结合连续放大训练、跨尺度组织视图等设计,在44个公开数据集的161项临床任务上取得最佳平均性能。

详情
AI中文摘要

计算病理学需要能够跨不同临床终点迁移且对放大倍数、染色、扫描仪类型、切片制备和输入分辨率变化保持鲁棒的视觉表示。我们提出DaX,一个病理视觉基础模型,它将DINOv3风格的自监督学习适应到全切片组织病理学。DaX从自然图像DINOv3权重初始化,并融合了连续放大训练、跨尺度组织视图、方向无关和采集鲁棒的数据增强、多输入尺寸训练以及Gram锚定的密集一致性。这些设计旨在连接局部细胞形态与全局组织结构,同时稳定跨输入尺度的密集token级表示。我们进一步构建了一个WSI级基准,包含来自44个公共数据集的161项临床有意义任务,涵盖28,182名患者和34,394张切片,跨越四个临床领域和九个任务类别。所有模型在固定的患者级交叉验证协议下进行评估,并采用折叠级统计排名,从而实现可重复的比较,对分割依赖的变异性不敏感。在该基准上,DaX在任务中取得了最高的平均性能,并持续获得强大的任务级排名分数,其增益涵盖诊断病理学、生物标志物和分子谱分析、组织/标本背景以及风险、反应和预后。这些结果支持DaX作为计算病理学的可迁移视觉编码器,并为未来的病理基础模型提供了标准化的评估框架。项目页面:此https URL。

英文摘要

Computational pathology requires visual representations that transfer across diverse clinical endpoints and remain robust to variation in magnification, staining, scanner type, slide preparation, and input resolution. We present DaX, a pathology vision foundation model that adapts DINOv3-style self-supervised learning to whole-slide histopathology. DaX is initialized from natural-image DINOv3 weights and incorporates continuous magnification training, cross-scale tissue views, orientation-agnostic and acquisition-robust augmentation, multi-input-size training, and Gram-anchored dense consistency. These designs aim to connect local cellular morphology with global tissue architecture while stabilizing dense token-level representations across input scales. We further construct a WSI-level benchmark comprising 161 clinically meaningful tasks from 44 public datasets, covering 28,182 patients and 34,394 slides across four clinical domains and nine task categories. All models are evaluated under a fixed patient-level cross-validation protocol with fold-level statistical ranking, enabling reproducible comparisons that are less sensitive to split-dependent variation. Across this benchmark, DaX achieves the highest mean performance across tasks and consistently strong task-level ranking scores, with gains spanning diagnostic pathology, biomarker and molecular profiling, tissue/specimen context, and risk, response, and prognosis. These results support DaX as a transferable visual encoder for computational pathology and provide a standardized evaluation framework for future pathology foundation models. Project page: https://alibaba-damo-academy.github.io/DaX/benchboard/.

2606.06991 2026-06-08 cs.CV cs.AI 交叉投稿

Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding

不要暂停:面向在线视频理解的流式视频-语言同步

Zhenyu Yang, Kairui Zhang, Shengsheng Qian, Weiming Dong, Changsheng Xu

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出流式视频-语言同步(SVLS)范式,通过帧驱动转换控制器和流式令牌调节器实现视频帧与语言生成的细粒度同步,在不中断感知的情况下进行实时交互。

详情
AI中文摘要

在线视频大语言模型(Video-LLMs)通过逐帧处理和主动响应,在人机交互方面取得了进展。然而,流式场景中仍存在一个关键挑战:现有模型在生成响应时通常会暂停视频感知,破坏了实时的视频-语言同步并导致卡顿。为了解决这个问题,我们引入了一种新的在线视频理解范式:流式视频-语言同步(SVLS),并提出了LyraV,一个基于分层控制框架的实时流式助手,具有两个核心创新。首先,帧驱动转换控制器(FDTC)是一个无需训练的基于验证的有限状态机,它做出高层语义决策,决定何时继续说话、开始新的响应或保持沉默。其次,流式令牌调节器(SToP)是一个即插即用的轻量级预测模块,动态调整语言生成速率以匹配视觉内容的节奏。具体来说,LyraV执行逐帧增量、子预算解码:在每个帧间隔内,它只发射适合实时预算的一小部分令牌,因此感知永远不会被阻塞整个句子。这些组件共同使LyraV能够无缝地交织传入的视频帧和生成的词令牌,实现细粒度的同步。在五个在线和三个离线基准上进行的广泛实验表明,LyraV保留了骨干网络的通用理解能力,同时显著提高了流式同步和叙事流畅性,实现了98.29%的视频播放同步率和3.89 FPS的实时处理速度。有趣的是,我们观察到LyraV的一个经验能力:对流式令牌进行动态推理,实现了与视觉输入并行的连续解释和“思考”。

英文摘要

Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and proactive responding. However, a critical challenge remains in streaming scenarios: existing models typically pause video perception while generating responses, breaking real-time video-language synchrony and causing stutters. To address this, we introduce a novel paradigm for online video understanding: Streaming Video-Language Synchrony (SVLS), and present LyraV, a live streaming assistant built upon a hierarchical control framework with two core innovations. First, the Frame-Driven Transition Controller (FDTC), a training-free verification-based finite-state machine, makes high-level semantic decisions on when to continue speaking, start a new response, or stay silent. Second, the Streaming Token Pacer (SToP), a plug-and-play lightweight predictive module, dynamically adapts the language generation rate to match the pace of the visual content. Concretely, LyraV performs \emph{per-frame incremental, sub-budget decoding}: within each frame interval it emits only a small chunk of tokens that fits the real-time budget, so perception is never blocked for a full sentence. Together, these components enable LyraV to seamlessly interleave incoming video frames with generated word tokens, achieving a fine-grained synchrony. Extensive experiments conducted on five online and three offline benchmarks demonstrate that LyraV preserves the backbone's general understanding ability while substantially improving streaming synchrony and narrative fluency, delivering a 98.29\% synchrony with video playback and a real-time processing speed of 3.89 FPS. Interestingly, we observe an empirical capability in LyraV: dynamic reasoning over streaming tokens, enabling continuous interpretation and "thinking" alongside visual input.

2606.07007 2026-06-08 cs.LG cs.AI 交叉投稿

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

理解稀疏自编码器中概念学习与神经元解释的几何视角

Chenhao Zhang, Chris Lin, Su-In Lee

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出统一数学框架,将概念学习形式化为集合对齐问题,区分检测、分离和近似三种学习强度,并给出几何条件与误差界,通过形式概念分析连接概念学习与神经元解释。

详情
AI中文摘要

我们提出了一个统一的数学框架,用于几何理解稀疏自编码器(SAE)中的概念学习和神经元解释。尽管SAE通过学习稀疏特征表示提高了神经网络的可解释性,但“概念”和“学习”的原则性定义仍不明确。我们将概念形式化为数据点的集合,并将概念学习视为人类定义概念与模型诱导概念之间的集合对齐问题。该公式区分了三种越来越强的学习概念——检测、分离和近似——并给出了概念可由单个神经元或多神经元单元表示的几何条件、误差界和容量约束。它还提供了对常见SAE现象的集合论解释,包括特征分裂、特征吸收、特征族和层次概念。最后,我们通过形式概念分析将概念学习与神经元解释联系起来,表明这两个方向不必一致,并且它们的多对多结构可以通过概念格来组织。在合成数据上使用ReLU和Top-$K$ SAE的实验说明了该理论,并揭示了SAE大小和稀疏性对概念学习的影响。

英文摘要

We propose a unified mathematical framework for a geometric understanding of concept learning and neuron interpretation in sparse autoencoders (SAEs). While SAEs improve interpretability of neural networks by learning sparse feature representations, a principled definition of ''concept'' and ''learning'' remains unclear. We formalize concepts as sets of data points and cast concept learning as a set-alignment problem between human-defined and model-induced concepts. This formulation distinguishes three increasingly strong notions of learning -- detection, separation, and approximation -- and yields geometric conditions, error bounds, and capacity constraints for when concepts can be represented by individual neurons or multi-neuron units. It also provides a set-theoretic account for common SAE phenomena, including feature splitting, feature absorption, feature families, and hierarchical concepts. Finally, we connect concept learning and neuron interpretation through formal concept analysis, showing that the two directions need not agree and that their many-to-many structure can be organized by concept lattices. Experiments on synthetic data with ReLU and Top-$K$ SAEs illustrate the theory and reveal the effects of SAE size and sparsity on concept learning.

2606.07030 2026-06-08 cs.SD cs.AI cs.CL cs.LG 交叉投稿

Phonetic Error Analysis of Raw Waveform Acoustic Models

原始波形声学模型的音素错误分析

Erfan Loweimi, Zhengjun Yue, Andrea Carmantini, Zoran Cvetkovic, Steve Renals, Peter Bell

发表机构 * Centre for Speech Technology Research (CSTR), University of Edinburgh, UK(语音技术研究中心(CSTR),爱丁堡大学,英国) Cisco, UK(思科公司,英国) SLAI & CUHK-SZ, China(SLAI与CUHK-SZ,中国) King's College London, UK(伦敦国王学院,英国)

AI总结 通过分解音素错误率、分析混淆矩阵,发现BLSTM层对过渡依赖类提升最大,WSJ迁移学习对辅音改进约是元音的三倍,且混淆模式反映固有音素相似性。

Comments INTERSPEECH2026

详情
AI中文摘要

我们分析了原始波形声学模型在TIMIT音素识别中的错误模式,超越了整体音素错误率(PER)。将PER按三个广义语音类别(BPC)分解,并从替换错误构建混淆矩阵。我们的模型将参数化(SincNet, Sinc2Net)或非参数化CNN与双向LSTM相结合,在开发/测试集上分别达到13.9%/15.3%的PER,这是原始波形模型在TIMIT上的最佳报告结果。来自WSJ的迁移学习将PER降至11.3%/12.3%,超越了Filterbank基线。每个BPC的分析表明,BLSTM层对过渡依赖类提升最大,而WSJ迁移学习对辅音的改进约是元音的三倍。原始波形和Filterbank系统的混淆模式一致,表明主要混淆反映了固有的音素相似性。

英文摘要

We analyse error patterns of raw waveform acoustic models on TIMIT phone recognition beyond the overall phone error rate (PER). PER is decomposed across three broad phonetic class (BPC) categorisations, and confusion matrices are constructed from substitution errors. Our models combine parametric (SincNet, Sinc2Net) or non-parametric CNNs with Bidirectional LSTMs, achieving 13.9%/15.3% PER on Dev/Test, the best reported results for raw waveform models on TIMIT. Transfer learning from WSJ reduces PER to 11.3%/12.3%, surpassing the Filterbank baseline. Per-BPC analysis reveals that BLSTM layers benefit transition-dependent classes most, while WSJ transfer learning improves consonants roughly three times more than vowels. Confusion patterns are consistent across raw waveform and Filterbank systems, indicating that the dominant confusions reflect inherent phonetic similarities.

2606.07032 2026-06-08 cs.CV cs.AI 交叉投稿

Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

前所未见:基于一致视频源数据集的真正零样本组合图像检索基准测试

Zhenyu Yang, Zemin Du, Shengsheng Qian, Changsheng Xu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有零样本组合图像检索数据集存在参考与目标图像不相关、非真正零样本的问题,提出ZeroSight基准,包含来自视频的一致参考-目标对和训练无关的MLLM驱动方法SC4CIR,通过三重对称一致性检查识别难负样本,实验表明现有方法性能被高估。

详情
AI中文摘要

零样本组合图像检索(ZS-CIR)旨在基于由参考图像和相对描述组成的查询,在没有训练样本的情况下检索目标图像。现有的ZS-CIR数据集常因图像来源嘈杂而导致参考图像与目标图像完全不相关,并且由于使用了CLIP等模型已训练过的公开图像数据集,未能实现真正的零样本场景。为解决这些挑战,我们引入了ZeroSight,一个用于ZS-CIR的新基准。它包括一个来自视频的一致参考-目标对数据集、一个数据构建流程,以及考虑多个正负目标图像排序的评估方法。我们通过从单个视频中提取帧并使用LLM辅助方法生成相对描述,确保参考-目标对在视觉和语义上一致。为确保真正的零样本场景,我们使用2022年3月31日之后发布的视频数据,确保其未包含在CLIP的预训练数据中。此外,我们提出了一种无需训练的MLLM驱动方法SC4CIR(对称一致性用于CIR),该方法通过三重对称一致性检查能够有效识别难负目标。该方法是即插即用的,能与各种CIR方法无缝集成并显著提升性能。我们通过27种方法的实验结果表明,当前的ZS-CIR数据集和评估指标导致了检索性能的膨胀,夸大了CIR方法的能力。我们的基准和模型可通过此https URL访问。

英文摘要

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption without training samples. Existing ZS-CIR datasets often suffer from complete irrelevance between reference and target images due to noisy image sources, and do not achieve a true zero-shot scenario as they use public image datasets that models like CLIP have been trained on. To tackle these challenges, we introduce ZeroSight, a novel benchmark for ZS-CIR. It includes a dataset with consistent reference-target pairs sourced from videos, a data construction pipeline, and evaluation methods that consider the ranking of multiple positive and negative target images. We ensure visually and semantically consistent reference-target pairs by extracting frames from a single video and generating relative captions using LLM-assisted methods. To ensure a true zero-shot scenario, we use video data published after March 31, 2022, ensuring it was not included in CLIP's pre-training data. Additionally, we propose a training-free MLLM-driven method, SC4CIR (Symmetric Consistency for CIR), which can effectively identify hard negative targets through 3 symmetric consistency checks. This method is plug-and-play, seamlessly integrating with various CIR methods and significantly improving performance. Our experimental results from 27 methods reveal that current ZS-CIR datasets and evaluation metrics result in inflated retrieval performance, exaggerating the capabilities of CIR methods. Our benchmark and models can be accessed at https://github.com/sotayang/ZeroSight.

2606.07036 2026-06-08 cs.CV cs.AI cs.CE cs.LG 交叉投稿

STREAM: Stochastic Riemannian Flow Matching with Anisotropic Decoder for Digital Histopathology Image Generation

STREAM: 用于数字组织病理学图像生成的随机黎曼流匹配与各向异性解码器

Won June Cho, Daeky Jeong, Hyeongyeol Lim, Hongjun Yoon

发表机构 * DEEPNOID Inc.(DEEPNOID公司)

AI总结 提出STREAM框架,利用组织病理学视觉基础模型的patch-token特征作为潜在空间,通过黎曼流匹配生成高质量组织病理学图像,解决条件崩溃问题,并设计各向异性解码器提升生成质量。

Comments 27 pages, 7 figures

详情
AI中文摘要

合成组织病理学图像生成解决了计算病理学中的关键挑战,包括患者隐私和对基础模型大规模训练数据日益增长的需求。潜在扩散模型主导了图像生成领域,最近的研究强调潜在空间的选择对生成图像的质量至关重要。现有的组织病理学最先进生成模型使用预训练的视觉基础模型(VFM)作为条件信号,我们观察到这会导致“条件崩溃”,即条件信号主导潜在空间,降低生成样本的质量和多样性。因此,我们转而使用预训练的组织病理学VFM作为潜在空间本身,利用其编码丰富语义信息的patch-token特征。我们经验性地表明,这些特征经过$\ell_2$归一化,位于单位超球面$\mathcal{S}^{d-1}$上,具有强烈的角度主导性和内在曲率,使其自然适用于黎曼公式。因此,我们提出了STREAM,这是第一个在病理学领域应用黎曼流匹配的框架。STREAM包括两个阶段:1)一种桥式随机扰动,在$\mathcal{S}^{d-1}$上建立每个token的可整流性,用于在潜在空间中训练扩散变换器(DiT);2)一种新颖的各向异性解码器,对速度场雅可比矩阵的低能量方向分配鲁棒性,同时保持其高能量方向的保真度。STREAM在乳腺癌和结直肠癌数据集上实现了最先进的重建和生成性能。代码将在接收后公开发布。

英文摘要

Synthetic histopathology image generation addresses critical challenges in computational pathology, including patient privacy and the growing need for large-scale training data for foundation models. Latent diffusion models have dominated the image generation domain, with recent works emphasizing that the choice of latent space is critical to the quality of generated images. Existing state-of-the-art generative models in histopathology use pretrained Vision Foundation Models (VFMs) as conditioning signals, and we observe that this leads to "conditioning collapse," where the conditioning signal dominates the latent space and lowers the quality and diversity of generated samples. Therefore, we instead use pretrained histopathology VFMs as the latent space itself, leveraging their patch-token features that encode rich semantic information. We empirically show that these features are $\ell_2$-normalized and lie on the unit hypersphere $\mathcal{S}^{d-1}$ with strong angular dominance and intrinsic curvature, making them naturally suited for a Riemannian formulation. We therefore present STREAM, the first framework to apply Riemannian flow matching in the pathology domain. STREAM consists of two stages: 1) a bridge-type stochastic perturbation that establishes per-token rectifiability on $\mathcal{S}^{d-1}$ for training a Diffusion Transformer (DiT) in latent space, and 2) a novel anisotropic decoder that allocates robustness to low-energy directions of the velocity-field Jacobian while preserving fidelity along its high-energy directions. Together, STREAM achieves state-of-the-art reconstruction and generation performance on breast and colorectal cancer datasets. The code will be publicly released upon acceptance.

2606.07054 2026-06-08 cs.CL cs.AI cs.CR cs.LG 交叉投稿

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

TRACE: 通过自适应跨步骤证据聚合的LLM智能体轨迹推理

Vijitha Mittapalli, Shreyaa Jayant Dani, Satya Srujana Pilli, Snigdha Ansu, Mohammadreza Teymoorianfard, Franck Dernoncourt, Hongjie Chen, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed

发表机构 * University of Massachusetts at Amherst(马萨诸塞大学阿默斯特分校) Adobe Research(Adobe研究) Dolby Labs(杜比实验室) University of Oregon(俄勒冈大学) Cisco(思科)

AI总结 提出TRACE框架,通过TIJ循环识别高信号区域、累积跨步骤证据并合成轨迹级判决,在SHADE-Arena的十个任务域上F1达0.713,召回率0.844,尤其擅长长距离证据链接。

详情
AI中文摘要

自主LLM智能体可以通过一系列单独良性的行动追求隐藏的恶意目标,这使得使用标准轨迹级监控难以检测破坏行为。现有方法要么一次性评估完整轨迹,要么将其划分为独立评分的窗口,限制了连接时间上相距较远的证据的能力。我们提出TRACE,一个用于长视界LLM智能体轨迹的监控框架。TRACE通过一个TIJ(分类-检查-判决)循环运行,该循环识别高信号区域,执行有针对性的检查,同时在推理步骤中累积累积的证据,并综合出轨迹级判决。我们在SHADE-Arena的十个任务域上评估TRACE,与最先进的基线进行比较。TRACE实现了0.713的总体F1分数和0.844的召回率,在需要长距离证据链接的任务上取得了最大的提升。

英文摘要

Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate complete trajectories in a single pass or partition them into independently scored windows, limiting their ability to connect evidence across temporally distant actions. We propose TRACE, a monitoring framework for long-horizon LLM agent trajectories. TRACE operates through a TIJ (Triage-Inspect-Judge) loop that identifies high-signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory-level verdict. We evaluate TRACE on ten task domains from SHADE-Arena against state-of-the-art baselines. TRACE achieves an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks requiring long-range evidence linking.

2606.07074 2026-06-08 cs.LG cs.AI 交叉投稿

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

SlimSearcher: 通过自适应奖励门控实现训练效率感知的Web代理

Zequn Xie, Junjie Wang, Dan Yang, Jie Feng, Yue Shen, Jian Wang, Jinjie Gu

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 提出SlimSearcher框架,通过帕累托高效过滤和自适应奖励门控,在保持或提升准确率的同时将工具调用轮次减少17%-58%。

Comments 17 pages, 8 figures,

详情
AI中文摘要

深度研究代理在复杂信息寻求任务中展现了卓越能力,但这种能力伴随着高昂的计算成本。受准确率驱动训练范式的影响,当前模型采用蛮力策略,表现为盲目依赖工具和执行性推理——生成长而冗余的轨迹,这些对于解决任务远非必要,导致浪费的工具调用和过多的token消耗。为克服这一效率陷阱,我们提出SlimSearcher,一个原则性框架,在监督微调(SFT)和强化学习(RL)中推动准确率与计算成本之间的帕累托前沿。在SFT阶段,SlimSearcher采用帕累托高效过滤来提炼既成功又经济的轨迹,引导模型走向内在效率感知的搜索行为。在RL阶段,我们引入自适应奖励门控,一种动态奖励塑造机制,在采样队列中评估相对工具和token效率。通过将这些自适应效率指标与严格正确性门控级联,我们的方法有效避免了与绝对惩罚相关的简洁性偏差,并缓解了奖励黑客攻击。在包括GAIA、BrowseComp和XBenchDeepSearch在内的长时域基准上的大量实验表明,SlimSearcher在保持或提升准确率的同时,将平均工具调用轮次减少了17%-58%。

英文摘要

Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.

2606.07080 2026-06-08 cs.SD cs.AI eess.AS 交叉投稿

dots.tts Technical Report

dots.tts 技术报告

Shi Lian, Changtao Li, Bohan Li, Hankun Wang, Da Zheng, Junfeng Tian, Yufeng Ma, Colin Zhang, Kai Yu

发表机构 * ByteDance(字节跳动)

AI总结 提出一个20亿参数的连续自回归TTS基础模型,通过多目标AudioVAE、全历史条件流匹配和无奖励自校正后训练,在Seed-TTS-Eval上取得最优性能,并支持低延迟推理。

详情
AI中文摘要

我们提出了 this http URL,一个20亿参数的连续自回归文本到语音(TTS)基础模型,在连续潜在空间中建模语音。与现有的连续自回归模型相比,我们的关键创新有三点。首先,我们训练了一个具有多目标的AudioVAE,以构建语义结构化和预测友好的连续语音空间。其次,我们在流匹配头中使用全历史条件,以保持长程一致性并减少生成过程中的漂移。第三,我们对流匹配头应用无奖励自校正后训练,以进一步提高鲁棒性和声学质量。在大规模多语言语料库上训练后,this http URL 在Seed-TTS-Eval上取得了最佳平均性能,在zh/en/zh-hard测试集上的WER分别为0.94%/1.30%/6.60%,SIM分数分别为81.0/77.1/79.5。在其他基准测试中,this http URL 也持续展示了开源最先进的性能,表现出强大的生成稳定性、声音克隆能力和情感表现力。为了实现高效推理,我们进一步应用了CFG感知的MeanFlow蒸馏,使得输出流和双流模式下的首包延迟分别为85毫秒和54毫秒,实现了低延迟语音生成。为了促进可重复研究和实际部署,我们在Apache 2.0许可下发布了训练和推理代码,以及预训练、后训练和MeanFlow蒸馏的检查点。

英文摘要

We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.

2606.07094 2026-06-08 cs.SE cs.AI 交叉投稿

MetaConfigurator: AI-Assisted RDF Authoring from JSON Data

MetaConfigurator:从JSON数据实现AI辅助的RDF创作

Felix Neubauer, Mahdi Jafarkhani, Kenichi Endo, Jürgen Pleiss, Benjamin Uekermann

发表机构 * Institute for Parallel and Distributed Systems, University of Stuttgart(并行与分布式系统研究所,斯图加特大学) Institute of Polymer Chemistry, University of Stuttgart(聚合化学研究所,斯图加特大学) Institute of Biochemistry, University of Stuttgart(生物化学研究所,斯图加特大学)

AI总结 提出一个集成AI辅助的RML映射、SPARQL查询和知识图谱可视化的Web界面,将JSON等结构化数据转换为RDF,并以MOF合成实验数据验证其有效性。

Comments Submitted as post-proceedings for the deRSE26 conference

详情
AI中文摘要

科学工作流越来越多地生成结构化的JSON数据,这些数据易于交换,但由于缺乏语义互操作性,跨系统一致解释困难。虽然JSON Schema确保结构验证,但它不原生支持链接数据语义。本文提出了一个RDF创作视图,扩展了开源JSON Schema编辑器MetaConfigurator,使研究人员能够通过AI辅助的RML映射将现有的JSON、YAML或CSV数据转换为RDF,精炼三元组,执行SPARQL查询,可视化知识图谱,并在单个集成Web界面中导出RDF序列化。该工作流由本体感知的IRI自动完成、JSON-LD文本视图与RDF三元组表之间的双向同步以及从自然语言提示生成AI辅助的SPARQL查询支持。我们使用金属有机框架(MOF)合成实验的实验室数据演示了该工作流。描述试剂、程序步骤和数量的协议数据通过RML映射从JSON转换为基于本体的JSON-LD。然后,我们精炼语义表示,查询实验条件与结果之间的关系,并交互式探索生成的知识图谱。该集成环境将传统的结构化数据管理与语义网技术桥接起来,同时保留实验上下文并通过AI辅助降低技术门槛。

英文摘要

Scientific workflows increasingly generate structured JSON data that is easy to exchange but difficult to interpret consistently across systems due to lacking semantic interoperability. While JSON Schema ensures structural validation, it provides no native support for Linked Data semantics. This paper presents an RDF Authoring View extending the open-source JSON Schema editor MetaConfigurator, enabling researchers to transform existing JSON, YAML, or CSV data into RDF using AI-assisted RML mappings, refine triples, execute SPARQL queries, visualize knowledge graphs, and export RDF serializations within a single integrated web interface. This workflow is supported by ontology-aware IRI auto-completion, bidirectional synchronization between JSON-LD text views and RDF triple tables, and AI-assisted SPARQL query generation from natural language hints. We demonstrate the workflow using laboratory data from metal-organic framework (MOF) synthesis experiments. Protocol data describing reagents, procedure steps, and quantities is converted from JSON to ontology-based JSON-LD via RML mappings. We then refine the semantic representation, query relationships between experimental conditions and outcomes, and explore the resulting knowledge graph interactively. This integrated environment bridges conventional structured data management with Semantic Web technologies while preserving experimental context and lowering technical barriers through AI assistance.

2606.07114 2026-06-08 cs.NI cs.AI cs.IT math.IT 交叉投稿

DIFFRACT: Neuralized Utility Maximization for Wireless Networks by Differentiable Programming

DIFFRACT: 通过可微编程实现无线网络的神经化效用最大化

Chee Wei Tan, Siya Chen

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出DIFFRACT框架,利用可微编程将深度学习与优化结合,通过算法展开将干扰管理算法映射为可微神经网络,实现分布式端到端梯度学习,以应对动态多用户干扰和随机服务质量约束。

Comments IEEE INFOCOM 2026

详情
AI中文摘要

下一代无线网络,包括卫星到开放RAN系统,需要敏捷且智能的资源管理,能够在随机服务质量约束下处理动态多用户干扰。本文介绍了DIFFRACT,一个神经化效用最大化框架,利用可微编程将深度学习与无线网络中的优化相结合。我们方法的核心是利用标准干扰函数的数学结构,这些函数是无线功率控制的基础。通过为这些函数开发对偶理论,我们通过算法展开将迭代干扰管理算法映射为可微神经网络架构。这使得在网络边缘进行分布式、端到端的基于梯度的学习成为可能,支持在地面和非地面环境中实时适应干扰。DIFFRACT通过建模复杂的信道动态并利用可微模型的表达能力,实现了可扩展且稳健的效用最大化。实验结果证实了该框架在下一代无线系统中的理论合理性和实际有效性。

英文摘要

Next-generation wireless networks, including satellite-to-Open RAN systems, demand agile and intelligent resource management capable of handling dynamic multi-user interference under stochastic quality of service constraints. This paper introduces DIFFRACT, a neuralized utility maximization framework that leverages differentiable programming to integrate deep learning with optimization in wireless networks. Central to our approach is the exploitation of the mathematical structure of standard interference functions, which are foundational in wireless power control. By developing a duality theory for these functions, we map iterative interference management algorithms into differentiable neural network architectures via algorithm unrolling. This enables distributed, end-to-end gradient-based learning at the network edge, supporting real-time adaptation to interference in both terrestrial and non-terrestrial environments. DIFFRACT allows for scalable and robust utility maximization by modeling complex channel dynamics and leveraging the expressiveness of differentiable models. Experimental results confirm the framework's theoretical soundness and practical effectiveness for next-generation wireless systems.

2606.07116 2026-06-08 cs.LG cs.AI cs.CL 交叉投稿

OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

OffQ:通过偏移驯服LLM量化中的结构化异常值

Haoqi Wang, Lorenz K. Mueller, Jiawei Zhuang, Mathieu Salzmann, Lukas Cavigelli

发表机构 * School of Computer and Communication Sciences, EPFL, Switzerland(瑞士联邦理工学院计算机与通信科学学院) Huawei, Switzerland(华为公司) Swiss Data Science Center, ETHZ & EPFL, Switzerland(瑞士数据科学中心,苏黎世联邦理工学院与联邦理工学院)

AI总结 提出OffQ方法,通过top-1 PCA识别异常值子空间、旋转集中异常值通道并转换为共享偏移,实现LLM的低比特均匀量化,在W4A4KV4下提升精度。

详情
AI中文摘要

低比特量化已被广泛采用,通过显著降低计算成本和内存使用来加速大型语言模型(LLM)的推理。然而,激活异常值对有效量化构成了重大挑战,常常导致显著的性能下降。在本文中,我们介绍了OffQ,一种通过新颖的偏移机制来缓解低比特量化中激活异常值的方法。具体来说,OffQ首先使用提出的top-1 PCA识别激活中的低维异常值子空间,然后通过旋转将高幅度激活集中到1个通道中。OffQ随后通过将其幅度转换为共享偏移来吸收这个集中的异常值通道,从而降低激活的标准差。这种偏移策略使得使用部署友好的均匀网格和均匀精度量化对LLM进行有效的W4A4KV4量化成为可能。在多种LLM架构和基准上的广泛实验表明,OffQ优于最先进的基线,在保持低比特效率的同时持续提高模型精度。

英文摘要

Low-bit quantization has been widely adopted to accelerate the inference of large language models (LLMs) by significantly reducing computational cost and memory usage. However, activation outliers pose a major challenge to effective quantization, often leading to notable performance degradation. In this paper, we introduce OffQ, a method designed to mitigate activation outliers in low-bit quantization through a novel offsetting mechanism. Specifically, OffQ first identifies a low-dimensional outlier subspace in the activations using a proposed top-1 PCA, and then concentrates high-magnitude activations into 1 channel via rotation. OffQ then absorbs this concentrated outlier channel by converting its magnitude into a shared offset, thereby reducing the standard deviation of the activations. This offsetting strategy enables effective W4A4KV4 quantization of LLMs using deployment-friendly uniform-grid and uniform-precision quantization. Extensive experiments across diverse LLM architectures and benchmarks demonstrate that OffQ outperforms state-of-the-art baselines, consistently improving model accuracy while preserving low-bit efficiency.

2606.07117 2026-06-08 cs.CV cs.AI 交叉投稿

Native3D: End-to-End 3D Scene Generation via Unified Mesh-Texture Modeling and Semantic Alignment

Native3D: 通过统一网格纹理建模与语义对齐的端到端3D场景生成

Yibo Liu, Ziwei Zhang, Haozhou Pang, Menghao Li, Lanshan He, Gan Qi

发表机构 * Kuaishou GameMind Lab(快手游戏大脑实验室)

AI总结 提出Native3D,首个完全绕过2D中间表示的端到端3D场景生成框架,通过统一网格纹理联合表示和3D表示对齐损失,解决几何结构失真和纹理细节退化问题。

详情
AI中文摘要

本文提出了Native3D,首个完全绕过2D中间表示的端到端3D场景生成框架。传统方法通常需要将3D表示适配到2D域以利用预训练的扩散模型,这不可避免地引入了域适应问题,包括几何结构失真和纹理细节退化。为了解决这些限制,我们设计了一种统一的网格纹理联合表示,通过基于Transformer的场景编码器同时对几何结构和纹理特征进行建模,有效维持场景中物体之间的空间关系和视觉一致性。我们进一步提出了3D表示对齐损失(3D REPA Loss),该损失采用改进的对比学习机制来对齐潜在空间中的多级语义表示,显著增强了几何和纹理保真度。实验结果表明,Native3D在生成质量和编辑灵活性方面均优于现有方法,为3D场景编辑提供了一种新的解决方案。

英文摘要

This paper presents Native3D, the first end-to-end 3D scene generation framework that completely bypasses 2D intermediate representations. Traditional approaches typically require adapting 3D representations to the 2D domain to leverage pre-trained diffusion models, which inevitably introduces domain adaptation issues including geometric structural distortion and texture detail degradation. To address these limitations, we design a unified mesh-texture joint representation that simultaneously models both geometric structures and texture features through a Transformer-based scene encoder, effectively maintaining spatial relationships and visual consistency among objects within scenes. We further propose the 3D Representation Alignment Loss (3D REPA Loss), which employs an improved contrastive learning mechanism to align multi-level semantic representations in the latent space, significantly enhancing geometric and textural fidelity. Experimental results demonstrate that Native3D outperforms existing methods in both generation quality and editing flexibility, providing a novel solution for 3D scene editing.

2606.07119 2026-06-08 cs.ET cs.AI cs.MA 交叉投稿

The Three-Ring Architecture: Governing Agents in the Era of On-Platform Organisations

三环架构:平台型组织时代中的智能体治理

Sergio Alvarez-Telena, Marta Diez-Fernandez

发表机构 * arXiv

AI总结 针对企业AI部署中缺乏治理基础设施导致失败率高达95%的问题,提出三环架构:环1为现有生产架构,环2为基于策略的智能体联邦层,环3为LLM前沿智能层,其中环2作为智能体企业的操作系统,实现资源抽象、进程协调、权限执行和智能积累,并区分环2与环3的风险轮廓。

Comments 28 pages

详情
AI中文摘要

当前企业AI部署阶段面临结构性失败:组织获取了智能体能力,却缺乏治理基础设施。这预计将重演第一波AI部署的错误:去中心化智能缺乏联邦层,导致95%的项目失败率。本文正式提出三环架构作为平台型组织的治理基础设施。环1是现有生产架构;环2是基于策略的智能体AI构建的M2联邦层;环3是基于LLM的前沿智能层。环2在技术精确意义上构成了智能体企业的操作系统——在组织层面执行计算操作系统在设备层面执行的功能:资源抽象、进程协调、权限执行,以及一个用于复合智能的稳定平台。一个核心贡献是正式区分环2和环3的风险轮廓。基于策略的智能体在确定性框架内运行:其后果可追溯,其权限可执行,其偏差可恢复。基于LLM的智能体引入了截然不同的风险:一个非确定性行为者,其偏差通过复杂的组织系统传播,且无法追溯。环2不是一个有用的附加——它是控制和合规的必要条件。进一步推论:LLM能力的每一次提升都是该架构的结构性顺风。更强大的非确定性行为者在偏离时会产生更大的后果。治理需求随能力增长。该架构已在金融服务、政府、采购和合规等多个领域经过十年的部署验证。

英文摘要

The current phase of enterprise AI deployment faces a structural failure: organisations are acquiring agentic capability without the infrastructure to govern it. The result is expected to reproduce the error of the first wave of AI deployment: decentralised intelligence without a federation layer leading to a 95% project failure rate. This paper formalises the Three-Ring Architecture as the governing infrastructure of the on-platform organisation. Ring 1 is the existing production architecture; Ring 2 is the M2 federation layer built on strategies-based agentic AI; Ring 3 is the LLM-based frontier intelligence layer. Ring 2 constitutes, in the technically exact sense, the operating system of the agentic enterprise - performing at the organisational level what a computing OS performs at the device level: resource abstraction, process coordination, permission enforcement, and a stable platform for compounding intelligence. A central contribution is the formal distinction between Ring 2 and Ring 3 risk profiles. Strategies-based agents operate within a deterministic framework: their consequences are traceable, their permissions enforceable, their deviations recoverable. LLM-based agents introduce a categorically distinct risk: a non-deterministic actor whose deviations propagate through complex organisational systems without retrospective traceability. Ring 2 is not a useful addition - it is a necessary condition of control and compliance. A further consequence: every improvement in LLM capability is a structural tailwind for this architecture. More capable non-deterministic actors produce larger consequences when they deviate. The governance requirement scales with capability. The architecture has been validated across a decade of deployment in financial services, government, procurement, and compliance among other sectors.

2606.07141 2026-06-08 cs.LG cs.AI 交叉投稿

REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

REMEDI:多标签临床疾病推断中的保留与遗忘评估基准

Anurag Sharma, Sai Teja Chunchu, Prasenjit Mitra, Sandipan Sikdar, Koustav Rudra

发表机构 * IIT Kharagpur(印度理工学院Kharagpur分校) Carnegie Mellon University(卡内基梅隆大学) L3S Research Center, Leibniz University Hannover(Leibniz汉诺威大学L3S研究中心)

AI总结 提出REMEDI基准,针对多标签临床疾病推断中的机器遗忘问题,利用MIMIC-III数据库评估现有方法在效用与遗忘性能间的权衡,并发现其不适用于多标签任务。

Comments Under review

详情
AI中文摘要

用于临床疾病推断的语言模型在患者数据上进行训练,这些数据可能包含敏感和私人信息,数据所有者可能出于隐私或版权原因要求从训练模型中删除其数据。然而,精确遗忘患者特定数据是棘手的,而通过少量数据删除重新训练则资源密集。虽然存在几种可用的机器遗忘方法,但其效用通常局限于非医疗领域。此外,评估此类遗忘方法的现有基准主要使用合成数据集,这些数据集不能真正代表现实系统。因此,这些遗忘方法在医疗领域的有效性在很大程度上尚不清楚。为此,我们引入了REMEDI,一个针对多标签和多类别临床疾病推断的广泛机器遗忘基准,其中标签相关性、纵向结构和安全约束使得遗忘特别具有挑战性。与现有基准不同,REMEDI考虑:(1) 相关的应用领域(医疗),(2) 涉及多样遗忘实例集的全面遗忘设置,(3) 具有挑战性的遗忘场景,包括多标签和多类别分类任务,以及(4) 评估指标,涉及效用和遗忘程度的性能。REMEDI使用MIMIC-III临床数据库开发,该数据库包含患者的全面临床数据。现有遗忘方法的实验表明,效用和遗忘性能之间存在权衡。它们也大多不适合多标签分类任务。为促进可重复性,我们公开了我们的基准。

英文摘要

Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning patient-specific data is intractable, and retraining with minor data removal is resource-intensive. While there exists several machine unlearning methods that can be used, their utility is generally restricted to non-medical domains. Moreover, the existing benchmarks for evaluating such unlearning methods primarily utilize synthetically curated datasets, which are not truly representative of real-world systems. Hence, the effectiveness of these unlearning methods in the medical domain is largely unclear. To this end, we introduce REMEDI, an extensive benchmark for machine unlearning tailored to multi-label and multiclass clinical disease inference, where label correlations, longitudinal structure, and safety constraints make unlearning particularly challenging. Unlike the existing benchmarks, REMEDI considers: (1) a relevant application domain (medical), (2) comprehensive unlearning setups involving diverse sets of forget instances, (3) challenging unlearning scenarios including multi-label and multi-class classification tasks, and (4) evaluation metrics involving performance both in terms of utility and extent of unlearning achieved. REMEDI is developed using the MIMIC-III clinical database that contains comprehensive clinical data of patients. Experiments with existing unlearning methods indicate that there exists a trade-off between utility and unlearning performance. They are also largely unsuited to multi-label classification tasks. To facilitate reproducibility, we make our benchmark publicly available.

2606.07167 2026-06-08 cs.CL cs.AI 交叉投稿

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

UrduMMLU:乌尔都语理解的大规模多任务基准

Ahmer Tabassum, Sarfraz Ahmad, Hasan Iqbal, Owais Aijaz, Momina Ahsan, Preslav Nakov

发表机构 * MBZUAI

AI总结 针对乌尔都语缺乏本地教育来源的MMLU风格基准,提出包含26,431道多选题的UrduMMLU,覆盖26个学科,评估30个LLM发现Gemini-3.5-Flash最佳,多数模型在人文科目上表现差。

Comments 27 pages, 18 figures, 17 tables, Submitted to ARR May 2026

详情
AI中文摘要

有意义的 multilingual 评估必须在目标语言和教育背景下测试模型。乌尔都语有超过2.3亿人使用,但缺乏从本地教育来源构建的广泛MMLU风格基准。我们提出UrduMMLU,一个包含26,431道乌尔都语多选题的基准,涵盖26个学科和五个领域,数据来自本地乌尔都语题库和公开考试PDF。与基于翻译的资源不同,UrduMMLU既包括标准学术科目,也包括乌尔都语和地区特定内容。我们通过双重人工标注和严格共识过滤对考试部分进行标注。我们在英语和乌尔都语提示下评估了30个LLM,进行了60次零样本评估,并进一步在两种提示语言的多个少样本设置下评估了四个开源LLM。Gemini-3.5-Flash表现最佳,准确率达到90.20%和90.34%,而其他模型均未超过85%。最强的开源模型落后7.79和8.92个百分点,许多模型在乌尔都语人文科目上比STEM科目损失25到40个百分点。少样本提示仅带来微小提升。UrduMMLU表明,当前LLM中乌尔都语知识仍不均匀,尤其是地区性内容。

英文摘要

Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs. Unlike translation-based resources, UrduMMLU covers both standard academic subjects and Urdu- and region-specific content. We label the exam-derived portion through dual human annotation with strict consensus filtering. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero-shot evaluations, and further evaluate four open-source LLMs under multiple few-shot settings across both prompt languages. Gemini-3.5-Flash performs best, reaching 90.20% and 90.34% accuracy, while no other model exceeds 85%. The strongest open-source model trails by 7.79 and 8.92 points, and many models lose 25 to 40 points on Urdu-centered Humanities subjects compared with STEM. Few-shot prompting yields only modest gains. UrduMMLU shows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content.

2606.07172 2026-06-08 cs.CV cs.AI cs.CL cs.LG 交叉投稿

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

文本监督增强视觉-语言模型中的地理空间表示

Marcelo Sartori Locatelli, Fernando Tonucci, Jea Kwon, Luiz Felipe Vecchietti, Bryan Nathanael Wijaya, Cheng Yaw Low, Virgilio Almeida, Meeyoung Cha

发表机构 * University of São Paulo(圣保罗大学) National University of Singapore(新加坡国立大学)

AI总结 研究视觉、视觉-语言及多模态模型的地理空间表示能力,发现文本监督能有效提升空间编码,推动地理空间AI发展。

Comments Accepted at ICML 2026

详情
AI中文摘要

地理空间理解是机器学习系统在图像地理定位和空间推理等任务中一个关键但尚未充分探索的维度。在这项工作中,我们分析了三种模型家族获得的地理空间表示:纯视觉架构(如ViT)、视觉-语言模型(如CLIP)和大规模多模态基础模型(如LLaVA、Qwen和Gemma)。通过评估包括人物、地标和日常物体在内的图像聚类(根据可定位程度分组),我们揭示了空间准确性的系统性差距,并表明文本监督增强了地理空间表示的学习。我们的发现表明语言作为编码空间上下文的有效补充模态,以及多模态学习作为推进地理空间AI的关键方向。

英文摘要

Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only architectures (e.g., ViT), vision-language models (e.g., CLIP), and large-scale multimodal foundation models (e.g., LLaVA, Qwen, and Gemma). By evaluating across image clusters, including people, landmarks, and everyday objects, grouped based on the degree of localizability, we reveal systematic gaps in spatial accuracy and show that textual supervision enhances the learning of geospatial representations. Our findings suggest the role of language as an effective complementary modality for encoding spatial context and multimodal learning as a key direction for advancing geospatial AI.

2606.07181 2026-06-08 cs.LG cs.AI q-bio.MN 交叉投稿

RETROSPECT: RETROsynthesis via Sequential Prediction, and Chemically Transformed-ranking

RETROSPECT: 通过序列预测和化学变换排序的逆合成

Raja Sekhar Pappala, Shreyas Vinaya Sathyanarayana, Ronit Kumar Choudhary, Arjun Verma, Deepak Warrier

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出RETROSPECT系统,将单步逆合成分解为候选生成和重排序,结合ChemAlign Transformer生成器和LambdaMART重排序器,在USPTO-50K上实现55.00% top-1准确率。

Comments Accepted at the AI for Science workshop (ICML 2026)

详情
AI中文摘要

单步逆合成既需要准确的首位建议,也需要足够丰富的候选列表以供下游选择。我们将其研究为提议-选择分解。我们的系统RETROSPECT结合了一个单一的Transformer提议模型(我们称之为ChemAlign Transformer)和一个基于结构、反应模板、上游分数以及可选的DFT衍生描述符的LambdaMART重排序器。生成器使用混合根对齐和随机SMILES增强、预层归一化、绑定嵌入、指数移动平均权重以及可微的原子平衡辅助损失进行训练。在包含5,007个反应的完整USPTO-50K测试集上,生成器达到55.00%的top-1和86.18%的top-10精确匹配准确率,top-1有效率为99.86%。在用于重排序的合并候选池基准上(包含5,007个测试产物,每个产物约111个候选),基于结构特征集训练的LambdaMART模型达到59.4%的top-1和0.7171的平均倒数排名。特征消融实验表明,上游提议分数和模板频率统计提供了大部分重排序信号,而DFT和反应中心DFT特征提供的增益较小且不一致。这些结果支持逆合成的模块化观点:更强的单模型提议和学习候选选择是互补的,并且提议模型可以作为集成系统(如RetroChimera (Maziarz et al., 2024))的即插即用组件。

英文摘要

Single-step retrosynthesis needs both accurate first-ranked suggestions and candidate lists that are rich enough for downstream selection. We study this as a proposal-selection decomposition. Our system, RETROSPECT, combines a single Transformer proposal model, which we call the ChemAlign Transformer, with a LambdaMART reranker over structural, reaction-template, upstream-score, and optional DFT-derived descriptors. The generator is trained with hybrid root-aligned and random SMILES augmentation, Pre-LayerNorm, tied embeddings, exponential moving average weights, and a differentiable atom-balance auxiliary loss. On the full USPTO-50K test set of 5,007 reactions, the generator reaches 55.00% top-1 and 86.18% top-10 exact-match accuracy with 99.86% top-1 validity. On the merged candidate-pool benchmark used for reranking, which contains 5,007 test products and about 111 candidates per product, a LambdaMART model trained on the structural feature set reaches 59.4% top-1 with 0.7171 mean reciprocal rank. Feature ablations show that upstream proposal score and template-frequency statistics provide most of the reranking signal, while DFT and reaction-center DFT features provide smaller and less consistent gains. These results support a modular view of retrosynthesis: stronger single-model proposal and learned candidate selection are complementary, and the proposal model can serve as a drop-in component for ensemble systems such as RetroChimera (Maziarz et al., 2024)

2606.07211 2026-06-08 cs.RO cs.AI 交叉投稿

An Abstract Architecture for Explainable Autonomy in Hazardous Environments

危险环境中可解释自主性的抽象架构

Matt Luckcuck, Hazel M Taylor, Marie Farrell

发表机构 * Maynooth University(梅诺斯大学) University of Manchester(曼彻斯特大学)

AI总结 提出一种支持自主系统解释其行为的抽象架构,旨在通过设计可解释性增强用户信任,并以民用核工业为例展示应用。

Comments Originally published 20th of October 2022 at the Second International Workshop on Requirements Engineering for Explainable Systems (RE4ES), which was hosted by the International Requirements Engineering Conference 2022

详情
AI中文摘要

自主机器人系统被提议用于危险环境,通常是为了减少人类工人的风险。在不久的将来,人类工人可能会继续使用和指挥这些自主机器人,就像其他计算机化工具一样,但具有更复杂的决策能力。因此,工程努力的一个重要方向是确保这些用户信任系统。最近的文献表明,可解释性与系统的可信度密切相关。与安全性和保密性属性一样,可解释性应该被设计到系统中,而不是事后添加。本文提出了一种抽象架构,支持自主系统解释其行为(可解释自主性),为实施可解释自主系统提供了设计模板。我们给出了一个工作示例,说明我们的架构如何应用于民用核工业,其中工人和监管机构都需要信任系统的决策能力。

英文摘要

Autonomous robotic systems are being proposed for use in hazardous environments, often to reduce the risks to human workers. In the immediate future, it is likely that human workers will continue to use and direct these autonomous robots, much like other computerised tools but with more sophisticated decision-making. Therefore, one important area on which to focus engineering effort is ensuring that these users trust the system. Recent literature suggests that explainability is closely related to how trustworthy a system is. Like safety and security properties, explainability should be designed into a system, instead of being added afterwards. This paper presents an abstract architecture that supports an autonomous system explaining its behaviour (explainable autonomy), providing a design template for implementing explainable autonomous systems. We present a worked example of how our architecture could be applied in the civil nuclear industry, where both workers and regulators need to trust the system's decision-making capabilities.

2606.07222 2026-06-08 cs.CV cs.AI 交叉投稿

DualGate-Net: A Prior-Gated Dual-Encoder Framework for Histopathology Cell Detection

DualGate-Net: 用于组织病理学细胞检测的先验门控双编码器框架

Bahman Jafari Tabaghsar, Son Tran, K. Devaraja, Atul Sajjanhar

发表机构 * School of Information Technology, Deakin University(德肯大学信息科技学院) Kasturba Medical College, Manipal Academy of Higher Education(曼岛医学院)

AI总结 提出DualGate-Net,通过可学习的先验门控融合机制自适应调节组织先验影响,结合局部和全局编码器及辅助分支,在OCELOT基准上实现稳健的细胞检测。

Comments 15 pages, 4 figures

详情
AI中文摘要

组织病理学图像中的细胞检测强烈依赖于周围组织背景,其中视觉上相似的细胞在不同微环境下可能属于不同类别。最近的感知组织方法结合了上下文先验,但通常依赖于可能传播噪声信息的静态融合策略。在这项工作中,我们提出了DualGate-Net,一种先验感知的双编码器框架,通过可学习的先验门控融合机制结合了基于ConvNeXtV2的局部编码器和基于SegFormer的全局编码器。所提出的模块自适应地调节组织先验在空间位置上的影响,同时一个辅助的前景重建分支在训练过程中保留高频细胞结构。此外,还引入了辅助的细胞性引导线索以进一步提高定位鲁棒性。在OCELOT基准上的实验表明,该方法在验证集上取得了0.7722的宏F1分数,在测试集上取得了0.7345的宏F1分数,突显了自适应先验整合对于稳健的组织病理学细胞检测的有效性。

英文摘要

Cell detection in histopathology images strongly depends on surrounding tissue context, where visually similar cells may belong to different classes under different microenvironments. Recent tissue-aware methods incorporate contextual priors, but often rely on static fusion strategies that may propagate noisy information. In this work, we propose DualGate-Net, a prior-aware dual-encoder framework that combines a ConvNeXtV2-based local encoder and a SegFormer-based global encoder through a learnable prior-gated fusion mechanism. The proposed module adaptively regulates the influence of tissue priors across spatial locations, while an auxiliary foreground reconstruction branch preserves high-frequency cellular structures during training. In addition, auxiliary cellness-guided cues are incorporated to further improve localization robustness. Experiments on the OCELOT benchmark demonstrate consistent improvements, achieving macro F1-scores of 0.7722 on the validation set and 0.7345 on the test set, highlighting the effectiveness of adaptive prior integration for robust histopathology cell detection.

2606.07237 2026-06-08 cs.CL cs.AI cs.LG 交叉投稿

When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

当大型语言模型在医疗保健中失败:评估对提示变化的敏感性

Mahdi Alkaeed

发表机构 * Department of Computer Science and Engineering, Doha, Qatar(计算机科学与工程系,多哈,卡塔尔)

AI总结 本研究系统分析了通用和医学专用LLM对提示扰动的敏感性,发现即使是微小的措辞变化也可能改变临床建议,对抗性提示可能引发有害输出,表明这些模型在临床应用中不可靠。

Comments 12 pages

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于医疗保健任务,如临床问答、诊断支持和报告总结。尽管前景广阔,但这些模型对微小的提示扰动(包括词汇和句法)仍然高度敏感,在安全关键的临床应用中构成严重风险。在本研究中,我们使用MedMCQA基准进行了系统的敏感性分析,以评估通用(例如GPT-3.5、Llama3)和医学专用LLM(例如ClinicalBERT、BioLlama3、BioBERT)的鲁棒性。我们将扰动分为自然和对抗两种类型,并检查它们对临床推理任务中模型一致性、准确性和可靠性的影响。我们的发现表明,医学LLM并非本质安全。即使是措辞的微小变化也可能改变临床建议,而针对性的对抗性提示可能引发有害输出。在医疗保健等高风险环境中,这种不可预测性是不可接受的——模型因重新措辞的输入而改变诊断,或因轻微改写而幻觉药物,临床医生无法可靠地信任它们。虽然模型通常对简单的词汇替换或释义表现出韧性,但在句法重新排序或误导性上下文线索下往往会崩溃。这种脆弱性在通用和领域专用LLM中都很明显。值得注意的是,对抗性操作可能导致临床危险的输出,例如推荐不正确的剂量或遗漏关键发现。

英文摘要

Large Language Models (LLMs) are increasingly used in healthcare for tasks such as clinical question answering, diagnosis support, and report summarization. Despite their promise, these models remain highly sensitive to subtle prompt perturbations, both lexical and syntactic, posing serious risks in safety-critical clinical applications. In this study, we conduct a systematic sensitivity analysis to evaluate the robustness of both general-purpose (e.g., GPT-3.5, Llama3) and medical-specific LLMs (e.g., ClinicalBERT, BioLlama3, BioBERT) using the MedMCQA benchmark. We categorize perturbations into natural and adversarial types and examine their effect on model consistency, accuracy, and reliability in clinical reasoning tasks. Our findings reveal that medical LLMs are not intrinsically safe. Even minor variations in phrasing can alter clinical advice, and targeted adversarial prompts can provoke harmful outputs. In high-stakes settings like healthcare, such unpredictability is unacceptable-models that change diagnoses due to reworded inputs or hallucinate medications when slightly rephrased cannot be reliably trusted by clinicians. While models tend to show resilience to simple lexical substitutions or paraphrasing, they often break down under syntactic reordering or misleading contextual cues. This fragility is evident across both general-purpose and domain-specific LLMs. Notably, adversarial manipulations can lead to clinically dangerous outputs, such as recommending incorrect dosages or omitting critical findings.

2606.07244 2026-06-08 cs.RO cs.AI cs.CV 交叉投稿

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

超越航点:面向视觉语言导航的轨迹中心航点范式

Haoxiang Shi, Xiang Deng, Haoyu Zhang, Qiaohui Chu, Yaowei Wang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室)

AI总结 提出轨迹航点范式,通过TSDF引导的扩散策略预测可执行轨迹,解决VLN-CE中航点不可达与规划控制不一致问题,在基准上取得最优性能。

详情
AI中文摘要

连续环境中的视觉语言导航(VLN-CE)要求智能体在类似真实世界的环境中遵循自然语言指令进行导航。大多数VLN-CE方法采用三阶段框架:航点预测器提出可导航航点,导航器选择最佳航点,低层控制器执行移动。然而,这种解耦范式常导致航点不可达或规划与控制不一致。本文提出一种称为轨迹航点的新范式,将每个候选航点锚定到可执行轨迹上。为此,我们设计了TSDF引导的扩散策略作为轨迹航点预测器,引导轨迹生成避开障碍物,从本质上保证预测航点的可达性。进一步提出轨迹增强导航器,将关联轨迹作为额外信息注入规划,实现高层语义决策与低层执行的严格一致性。在VLN-CE基准上的大量实验表明,我们的轨迹航点范式优于基线方法。

英文摘要

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.

2606.07245 2026-06-08 cs.CY cs.AI 交叉投稿

AI Sovereignty: A Qualitative Model of Strategic Competition as AI Becomes an Instrument of National Power

AI主权:当AI成为国家力量工具时的战略竞争定性模型

Timothy Clancy, Asmeret Naugle

发表机构 * Applied Research Laboratory for Intelligence and Security (ARLIS)(智能与安全应用研究实验室) Sandia National Laboratories(桑塔纳国家实验室)

AI总结 提出AI主权定义及首个包含微观、中观、宏观因素的定性模型,分析国家间AI驱动的战略竞争动态,识别关键杠杆点(如加速器、电力、数据等)及其在直接动能行动和间接非动能行动中的应用。

Comments Main article: 19 pages, 10 figures. Supplementary: 19 pages, 7 figures, 7 tables. To be presented at the 2026 International System Dynamics Conference (ISDC), July 20-24, TU Delft, Delft, Netherlands

详情
AI中文摘要

AI主权是一个国家独立控制其人工智能(AI)技术的程度。对日益复杂的前沿AI模型的竞争具有越来越重要的战略意义,各国正在考虑AI如何改善其经济状况、竞争优势和整体国家实力。然而,AI主权的成本巨大,我们缺乏定义和概念模型来应对不断演变的AI主权动态。我们通过提出与AI主权相关的定义,以及一个首次包含微观、中观和宏观因素的定性模型来填补这一空白。基于模型的定性预测突出了竞争动态和AI驱动国家实力的潜在演变。该模型识别了各国可用于增强自身增长或削弱对手的关键杠杆点,包括考虑加速器、电力、水、数据集和熟练劳动力。这些杠杆点可以通过直接动能行动(如伊朗用无人机瞄准数据中心)和间接非动能效应(包括网络、太空、信息、经济胁迫和外交)在战略和操作层面激活。如果我们的假设和假说成立,这种战略竞争可能将定义21世纪各国如何改善其经济状况、竞争优势和整体国家实力。

英文摘要

AI sovereignty is the extent to which a nation independently controls its artificial intelligence (AI) technologies. The race toward ever-more-sophisticated frontier AI models is of increasing strategic importance, with nations considering how AI might improve their economic situations, competitive advantage, and overall national power. However, the costs of AI sovereignty are enormous, and we lack definitions and conceptual models to navigate evolving AI sovereignty dynamics. We address this gap with definitions relevant to AI sovereignty, along with a first-of-its-kind qualitative model that incorporates micro, meso, and macro contributors. Model-based qualitative forecasts highlight competitive dynamics and evolving potential for AI-driven national power. The model identifies key leverage points that nations can use to enhance their own growth or degrade an adversary's, including consideration of accelerators, electricity, water, data sets and skilled workforce. These leverage points can be activated at strategic and operational levels through both direct kinetic actions, such as Iran's targeting of data centers with drones, and indirect non-kinetic effects including cyber, space, information, economic coercion and diplomacy. If our assumptions and hypotheses are valid, this strategic competition may come to define how nations improve their economic situations, competitive advantage, and overall national power in the 21st Century.

2606.07271 2026-06-08 cs.LG cs.AI cs.SD 交叉投稿

Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path

整流流泄漏之处:沿插值路径表征成员信号

Thomas Sesmat, Gabriel Meseguer-Brocal, Geoffroy Peeters

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 本文分析整流流(Rectified Flows)在插值路径上的训练数据成员信号,发现训练与测试数据的重建差异呈钟形曲线,并在高斯假设下推导出峰值位置,验证了该结构的普适性,并利用其进行成员推断攻击。

Comments ICML 2026 article, 9 main pages and 25 with annexes, 11 figures

详情
Journal ref
43rd International Conference on Machine Learning, Seoul, South Korea, 2026
AI中文摘要

理解生成模型从训练数据中保留了什么仍然具有挑战性,这对版权和隐私有影响。除了逐字复制外,模型可以编码训练数据中更微妙的痕迹,这些痕迹从未出现在输出中,但仍可利用。我们针对整流流(Rectified Flows)研究了这一机制,整流流越来越多地用于部署的生成系统。我们分析了定义整流流训练的插值路径 $X_\lambda = (1-\lambda)X_0 + \lambda X_1$。我们展示了训练数据和测试数据的重建之间存在一个差距,该差距在 $\lambda$ 上呈钟形曲线,并在训练过程中累积,而验证指标保持稳定。该信号有一个最大值,我们在高斯假设下推导出其位置的闭式解。我们在音频和图像上验证了这些预测,并表明钟形结构是普遍的,而峰值预测在我们的假设满足时成立。作为概念验证,我们利用这种特定的 $\lambda$ 解析结构进行成员推断攻击,区分训练集的成员和非成员。

英文摘要

Understanding what generative models retain from training data remains challenging, with implications for copyright and privacy. Beyond verbatim reproduction, models can encode subtler traces of their training data that never surface in their outputs yet remain exploitable. We study this regime for Rectified Flows, which are increasingly used in deployed generative systems. We analyse the interpolation path $X_λ= (1-λ)X_0 + λX_1$ that defines the Rectified Flow training. We show that a gap exists between the reconstruction of train and test data that follows a bell-shaped curve over $λ$, wich accumulates during training, while the validation metrics remain stable. The signal has a maximum whose location we derive in closed form under Gaussian assumptions. We validate these predictions on both audio and images and show that the bell-shaped structure is universal, while the peak prediction holds when our assumptions are satisfied. As a proof of concept, we exploit this specific $λ$-resolved structure to perform a Membership Inference Attack, distinguishing members of the training set from non-members.

2606.07309 2026-06-08 cs.SD cs.AI cs.CL 交叉投稿

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

语音情感识别中音频语言模型的声学线索对齐

Iosif Tsangko, Andreas Triantafyllopoulos, Björn W. Schuller

发表机构 * DFG's Reinhart Koselleck project(德国科研基金Reinhart Koselleck项目) EU H2020 project(欧盟H2020项目)

AI总结 研究音频语言模型中显式声学线索的对齐性,通过eGeMAPS特征提取六种可解释声学概念标记,发现对齐标记提升UAR,而错乱标记降低性能,模型对符号线索敏感但仍部分依赖音频信号。

Comments 6 pages, 3 figures, 3 tables

详情
AI中文摘要

指令跟随音频语言模型(ALMs)可以通过显式的声学线索进行增强,但在原始音频已经可用的情况下,这些线索是否以接地的方式被使用仍不清楚。我们通过从标准化的eGeMAPS副语言特征集中推导出六个可解释的声学概念标记来研究语音情感识别(SER)中的这一问题。这些标记总结了能量、音高、动态、亮度、共振峰和语音质量,并被附加到文本提示中,同时保持音频输入不变。在广泛使用的FAU-Aibo和IEMOCAP基准测试中,对齐的标记提高了未加权平均召回率(UAR),而打乱、冲突或损坏的标记相对于对齐标记降低了性能,并将混淆转向中性。重要的是,在强标记扰动下预测不会崩溃,这表明模型对符号线索通道敏感,但部分仍锚定于音频信号。我们认为,仅标记干预提供了一种实用的方法来探测基于ALM的情感计算中音频接地线索的使用、鲁棒性和可解释性。

英文摘要

Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question in speech emotion recognition (SER) by deriving six interpretable acoustic concept tokens from the standardised eGeMAPS paralinguistic feature set. These tokens summarise energy, pitch, dynamics, brightness, formants, and voice quality, and are appended to the textual prompt while the audio input is kept unchanged. Across the widely used FAU-Aibo and IEMOCAP benchmarks, aligned tokens improve unweighted average recall (UAR), whereas shuffled, conflicting, or corrupted tokens reduce performance relative to aligned tokens and shift confusions toward neutral. Importantly, predictions do not collapse under strong token perturbations, suggesting that the models are sensitive to the symbolic cue channel but remain partly anchored to the audio signal. We argue that token-only interventions provide a practical way to probe audio-grounded cue use, robustness, and interpretability in ALM-based affective computing.

2606.07311 2026-06-08 cs.CV cs.AI 交叉投稿

CULTURESCORE: Evaluating Cultural Faithfulness in Video Generation Models

CULTURESCORE: 评估视频生成模型的文化忠实度

Anku Rani, Wei Dai, Shravan Nayak, Pattie Maes, Mahdi M. Kalayeh, Paul Pu Liang

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Mila – Quebec AI Institute(魁北克人工智能研究所) Netflix(网飞)

AI总结 提出CultureScore框架,从身份、背景和行为三个维度评估视频生成的文化忠实度,实验发现当前最佳模型得分仅56.8%,行为维度最困难。

详情
AI中文摘要

随着Veo 3.1和LTX-2等视频生成模型的进步,它们准确表现多元全球文化的能力仍是一个关键但研究不足的前沿。当前的指标如VideoScore仅衡量视觉质量,无法评估文化忠实度。因此,一个将合十礼替换为握手的模型与正确生成该手势的模型获得相同的分数。我们提出CultureScore,一个将文化忠实度分解为三个细粒度维度的组合评估框架:身份(谁被代表)、背景(文化本地化背景)和行为(规范性手势和互动)。我们通过一个覆盖10个国家的评估套件来实施该框架,在三个最先进的模型上生成了6,180个视频。我们的评估显示,当前没有模型能够实现文化忠实的视频生成:表现最好的模型整体CultureScore仅为56.8%,其中行为是最具挑战性的维度,所有模型在该维度上均低于52%。此外,人类偏好排序与CultureScore方向一致,但与VideoScore相反;在视觉质量上得分最高的模型被标注者排在最后,这强调了文化忠实度是公平视频生成的基本标准。

英文摘要

As video generation models like Veo 3.1 and LTX-2 advance, their ability to accurately represent diverse global cultures remains a critical yet understudied frontier. Current metrics, such as VideoScore, only measure visual quality but offer no mechanism for assessing cultural faithfulness. Consequently, a model that replaces a Namaste with a handshake receives the same score as one that generates the gesture correctly. We propose CultureScore, a compositional evaluation framework that decomposes cultural faithfulness into three granular dimensions: Identity (who is represented), Context (culturally localized background), and Behavior (normative gestures and interactions). We operationalize this framework through an evaluation suite spanning 10 countries, yielding 6,180 generated videos across three state-of-the-art models. Our evaluation reveals that no current model achieves culturally faithful video generation: the best-performing model reaches only 56.8\% overall CultureScore, with Behavior the most challenging dimension, which remains below 52\% across all models. Furthermore, human preference rankings align directionally with CultureScore but are inverted relative to VideoScore; the highest-scoring model on visual quality was ranked last by annotators, underscoring that cultural faithfulness is an essential criterion for equitable video generation.

2606.07313 2026-06-08 cs.CL cs.AI 交叉投稿

SV-Detect: AI-generated Text Detection with Steering Vectors

SV-Detect: 基于引导向量的AI生成文本检测

Mikhail Vishnyakov, Tatiana Gaintseva

发表机构 * Independent Researcher(独立研究者) Queen Mary University of London(伦敦女王学院)

AI总结 提出从冻结语言模型的隐藏表示中提取引导向量,通过层间投影特征训练轻量分类器,实现跨域、跨模型和编辑攻击下的机器生成文本检测。

详情
AI中文摘要

检测机器生成文本在分布偏移(如跨域、源模型和编辑攻击的迁移)下尤其困难。我们提出了一种基于从冻结语言模型的隐藏表示中提取的引导向量的假文本检测器。在每一层,我们构建一个分离人类编写文本和机器生成文本的方向,并通过每个输入与这些方向的逐层对齐来表示输入。在这些投影特征上训练的轻量分类器产生最终的检测分数。我们的方法在分布内和分布偏移下均表现出色,包括跨域、跨源模型以及机器编辑转换(如润色和重写)。解释分析表明,学习到的方向与可识别的风格线索一致,同时捕获了超越表面特征的显著额外信号。这些结果将假文本检测定位为表示空间探测问题,并表明引导向量提供了一种简单有效的解决方案。

英文摘要

Detecting machine-generated text is especially difficult under distribution shift, such as transfer across domains, source models, and editing attacks. We propose a fake-text detector based on steering vectors extracted from the hidden representations of a frozen language model. At each layer, we construct a direction that separates human-written from machine-generated text, and represent each input by its layer-wise alignment with these directions. A lightweight classifier trained on these projection features yields the final detection score. Our method achieves strong performance both in-distribution and under distribution shift, including across domains, source models, and machine-editing transformations such as polishing and rewriting. Interpretation analyses show that the learned directions align with recognizable stylistic cues while capturing substantial additional signal beyond surface features. These results position fake-text detection as a representation-space probing problem and show that steering vectors provide a simple and effective solution.

2606.07316 2026-06-08 cs.MA cs.AI cs.DC 交叉投稿

Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration

面向拜占庭鲁棒的大语言模型智能体协作的分层认证语义承诺

Haoran Xu, Lei Zhang, Iadh Ounis, Xianbin Wang

发表机构 * University of Glasgow(格拉斯哥大学) University of Western Ontario(西部 Ontario 大学)

AI总结 提出H-CSC协议,将基于嵌入的终结性信号转换为三种类型输出(语义承诺、判决承诺或显式中止),实现大语言模型智能体拜占庭协作的最终性控制,在语义投毒和拜占庭攻击下保持低角度偏差和高中止率。

Comments 27 pages, 3 figures, 8 tables

详情
AI中文摘要

大语言模型智能体间的拜占庭协作需要一个最终性控制原语:给定已交付的随机、结构化自然语言提案,协议必须决定该轮是否支持提交、何种提交或带类型的显式安全中止。朴素聚合将这一选择隐藏在单一判决背后;经典拜占庭容错将其隐藏在字节同一性背后,而LLM提案不满足该要求。我们提出分层认证语义承诺(H-CSC),一种受BFT启发的协议,将判决条件分组的提案上的嵌入派生最终性信号转换为三种类型结果之一:语义承诺(一个2f+1的判决内语义核心支持该判决,发出量化聚合上的参数绑定摘要)、判决承诺(强判决边缘但分散的语义理由,发出判决级证书而不声明语义聚合)或带类型原因的显式中止。贡献在于类型化最终性,而非原始提交准确性。在受控语义投毒诊断(BCS_v1,120个片段)上,H-CSC在BFT可行桶上以低角度偏差(0.31至2.04度)提交,并按设计100%中止超出BFT的轮次(n<3f+1)。在真实LLM智能体声明验证基准(MVR-50,50个任务)上,面对配对静态和急袭拜占庭攻击,H-CSC提交率为0.90/0.92,诚实引用无效率为0.02/0.00,统计上与强证书发出的仅判决基线匹配。不同于该基线,H-CSC还在74%/72%的轮次上发出嵌入支持的语义承诺摘要,提供类型化来源。严格语义消融仅提交0.54/0.48,表明判决级回退对于在相同≤0.04安全底线下的覆盖率(+0.36/+0.44)是必要的;跨四个LLM的100任务跨模型检查将无效hmaj保持在0.00至0.03内。

英文摘要

Byzantine collaboration among large-language-model agents requires a finality-control primitive: given delivered stochastic, structured natural-language proposals, the protocol must decide whether the round supports a commit, what kind of commit, or a typed safe abort. Naive aggregation hides this choice behind a single verdict; classical Byzantine fault tolerance hides it behind byte-identity that LLM proposals do not satisfy. We introduce Hierarchical Certified Semantic Commitment (H-CSC), a BFT-inspired protocol that converts embedding-derived finality signals over verdict-conditioned proposal groups into one of three typed outcomes: a semantic_commit (a 2f+1 within-verdict semantic core backs the verdict, emitting a parameter-bound digest over the quantised aggregate), a verdict_commit (strong verdict margin but dispersed semantic rationale, emitting a verdict-level certificate without claiming a semantic aggregate), or an explicit abort with a typed reason. The contribution is typed finality, not raw commit accuracy. On a controlled semantic-poisoning diagnostic (BCS_v1, 120 episodes), H-CSC commits with low angular deviation on BFT-feasible buckets (0.31 to 2.04 degrees) and aborts 100% of beyond-BFT rounds (n<3f+1) as intended. On a real LLM-agent claim-verification benchmark (MVR-50, 50 tasks) under paired static and rushing Byzantine attacks, H-CSC commits 0.90/0.92 with honest-reference-invalid rates of 0.02/0.00, statistically matching a strong certificate-emitting verdict-only baseline. Unlike that baseline, H-CSC also emits an embedding-backed semantic_commit digest on 74%/72% of rounds, supplying typed provenance. A strict-semantic ablation commits only 0.54/0.48, showing the verdict-level fallback is necessary for coverage (+0.36/+0.44) at the same <=0.04 safety floor; a 100-task cross-model check across four LLMs preserves invalid_hmaj within 0.00 to 0.03.

2606.07325 2026-06-08 math.ST cs.AI cs.IT math.IT stat.TH 交叉投稿

A Temporal Spatial Minimax Rate for Smoothly-Varying Distributions in Wasserstein Space

Wasserstein空间中平滑变化分布的时空极小极大速率

Munsik Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究在Wasserstein空间中,基于过去有限噪声快照估计未来曲线值的极小极大速率,提出时空下界并证明其匹配上界。

详情
AI中文摘要

我们研究了在$2$-Wasserstein空间$\mathcal{P}_2(\mathbb{R}^d)$中,从过去有限个噪声快照估计曲线$t\mapsto\mu_t$的未来值$\mu_{t_n+h}$的极小极大速率,在速度场的$k$阶协变导数满足绝热界$\|\nabla_t^k v\|\le\varepsilon$的条件下。我们的核心结果是统一的时空极小极大下界:在正则的、局部传输丰富的子类上,每个估计量都会遭受$W_2$-风险,其$M$-指数为$\gamma_d(k+1)/(k+1+\gamma_d)$,其中$\gamma_d=\min(1/d,1/2)$($M$为总样本量)。该下界源于时空约化:光滑性预算定义了一个可达的$W_2$-球,沿时间轴嵌入一个传输填充,整个快照实验的信息由Fano论证控制——空间填充是经典的,但其光滑性容许的时间嵌入和全窗口分析是新的。该界插值了一个与维数无关的外推下限$\varepsilon h^{k+1}$——即使过去完全已知,未来不可观测的不可约代价——以及空间估计的维数灾难$M^{-\gamma_d}$,当$k\to\infty$时恢复静态分布估计速率。我们以设计依赖的形式陈述下界——具有设计加权的有效样本量——适用于任意观测时间,并在密集(等间距)情形下得到闭式指数。匹配的上界在$k=0$(速率$M^{-1/(d+1)}$,$d\ge3$)和平移子模型中对所有$k$建立;对于$k\ge1$,协变估计量条件依赖于两个估计(比较几何偏差界和最优传输映射估计速率)达到该速率,将无条件的一般$k$上界留作开放问题。在合成弯曲和平坦族上的数值实验验证了预测的指数。

英文摘要

We study the minimax rate of estimating a future value $μ_{t_n+h}$ of a curve $t\mapstoμ_t$ in the $2$-Wasserstein space $\mathcal{P}_2(\mathbb{R}^d)$ from finitely many noisy snapshots of its past, under an adiabatic bound $\|\nabla_t^k v\|\le\varepsilon$ on the $k$-th covariant derivative of the velocity field. Our central result is a unified temporal-spatial minimax lower bound: over regular, locally transport-rich subclasses, every estimator incurs $W_2$-risk with $M$-exponent $γ_d(k+1)/(k+1+γ_d)$, $γ_d=\min(1/d,1/2)$ ($M$ the total sample size). It follows from a temporal-to-spatial reduction: the smoothness budget defines a reachable $W_2$-ball into which a transport packing is embedded along the time axis, and the information of the entire snapshot experiment is controlled by a Fano argument -- the spatial packing is classical, but its smoothness-admissible temporal embedding and the full-window analysis are new. The bound interpolates a dimension-free extrapolation floor of order $\varepsilon h^{k+1}$ -- the irreducible cost of an unobserved future, present even with the exact past -- and the spatial estimation curse $M^{-γ_d}$, recovering the static distribution-estimation rate as $k\to\infty$. We state the lower bound in a design-dependent form -- with a design-weighted effective sample size -- valid for arbitrary observation times, and obtain the closed-form exponent in the dense (equispaced) regime. The matching upper bound is established at $k=0$ (rate $M^{-1/(d+1)}$, $d\ge3$) and, in a translation submodel, for all $k$; for $k\ge1$ a covariant estimator attains the rate conditionally on two estimates (a comparison-geometry bias bound and an optimal-transport map-estimation rate), leaving the unconditional general-$k$ upper bound as an open problem. Numerical experiments on synthetic curved and flat families corroborate the predicted exponents.

2606.07351 2026-06-08 cs.LG cs.AI 交叉投稿

SleepExplain: Explainable Non-Rapid Eye Movement and Rapid Eye Movement Sleep Stage Classification from EEG Signal

SleepExplain: 基于EEG信号的可解释非快速眼动和快速眼动睡眠阶段分类

Rafsan Jany, Md. Hamjajul Ashmafee, Iqram Hussain, Md Azam Hossain

AI总结 提出SleepExplain模型,使用集成学习(随机森林、XGBoost、梯度提升)对NREM和REM睡眠阶段进行分类,准确率达94.30%,并利用SHAP提供可解释性。

Comments 6 pages, 7 figures, 2022 25th International Conference on Computer and Information Technology (ICCIT)

详情
Journal ref
2022 25th International Conference on Computer and Information Technology (ICCIT), pp. 248-253, 2022
AI中文摘要

睡眠阶段分类是多种睡眠相关疾病最重要的诊断方法之一。脑电图(EEG)被认为是检查神经效应与睡眠阶段之间关联的有力工具,因为它能正确识别与睡眠相关的神经变化。在非快速眼动(NREM)和快速眼动(REM)睡眠阶段,许多神经和身体功能受到影响,因此在其功能中扮演重要角色。本研究旨在从睡眠EEG数据中分类NREM和REM睡眠阶段,并提出一个新颖的SleepExplain模型,一种可解释的NREM和REM睡眠阶段分类,以解释其预测。在这项工作中,使用随机森林、XGBoost和梯度提升集成分类模型对睡眠阶段进行分类。总体而言,我们获得了92.54%(随机森林)、94.25%(梯度提升)和94.30%(XGBoost)的准确率。对于可解释分类模型,我们采用博弈论方法SHAP(SHapley Additive exPlanations)为预测提供令人信服的解释。

英文摘要

Classification of sleep stages is one of the most important diagnostic approaches for a variety of sleep-related disorders. Electroencephalography (EEG) is regarded as a powerful tool for examining the association between neurological effects and sleep phases since it correctly identifies sleep-related neurological alterations. During Non-Rapid Eye Movement (NREM) and Rapid Eye Movement (REM) sleep phases, a number of nerve and bodily functions are affected and therefore hold an important role both in their functionalities. This work aims to classify NREM and REM sleep stages from sleep EEG data and present a noble SleepExplain model, an explainable NREM and REM sleep stage classification to explain its predictions. In this work, sleep stages were classified using Random Forest, XGBoost, and Gradient Boosting ensemble classification models. Overall, we obtained an accuracy of 92.54% (Random Forest), 94.25% (Gradient Boosting), and 94.30% (XGBoost). For explainable classification model, we utilized a game theoretic approach, SHAP (SHapley Addictive exPlanations) to offer a convincing explanation for the prediction.

2606.07365 2026-06-08 cs.LG cs.AI 交叉投稿

A robust PPG foundation model using multimodal physiological supervision

一种使用多模态生理监督的鲁棒PPG基础模型

Eloy Geenjaar, Vince Calhoun, Scott Daly, Gouthaman KV, Lie Lu, Trisha Mittal, Daniel P. Darcy

发表机构 * Dolby Laboratories(杜比实验室)

AI总结 提出一种PPG基础模型,利用ICU数据集中的心电和呼吸信号选择对比样本,无需高质量或场域数据预训练,在15个下游任务中14个取得性能提升。

详情
AI中文摘要

光电容积描记法(PPG)是一种无创测量血容量变化的方法,广泛应用于可穿戴设备和临床环境。最近的PPG基础模型要么使用开源ICU数据集,采用需要精心整理数据的预训练范式,从而难以泛化到场域数据,要么使用闭源场域PPG数据。相比之下,我们提出了一种PPG基础模型,不需要高质量或场域预训练数据,而是利用ICU数据集中伴随的心电图和呼吸信号在预训练期间选择对比样本。我们的方法允许模型保留并从噪声PPG片段中学习,提高了推理时的鲁棒性。我们的模型在比现有最先进方法少3倍的受试者上预训练,在15个不同的下游任务(包括场域日常活动和心率预测)中的14个上实现了性能提升。我们的结果表明,多模态监督可以整合互补的生理信息,以提高PPG基础模型的鲁棒性,并增强其对消费级数据的泛化能力。

英文摘要

Photoplethysmography (PPG), a non-invasive measure of changes in blood volume, is widely used in both wearable devices and clinical settings. Recent PPG foundation models either use open-source ICU datasets with pretraining paradigms that require curated data and thus complicate generalization to field-like data, or use closed-source field-like PPG data. In contrast, we propose a PPG foundation model that does not require high-quality or field-like pretraining data, and instead leverages accompanying electrocardiogram and respiratory signals in ICU datasets to select contrastive samples during pretraining. Our approach allows the model to retain and learn from noisy PPG segments, improving robustness at inference. Our model, pretrained on 3x fewer subjects than existing state-of-the-art approaches, achieves performance improvements on 14 out of 15 diverse downstream tasks, including field-like daily activity and heart rate prediction. Our results demonstrate that multimodal supervision can integrate complementary physiological information to improve the robustness of PPG foundation models and enhance their generalization to consumer-grade data.

2606.07368 2026-06-08 cs.CV cs.AI 交叉投稿

Mitosis Detection in the Wild: Multi-Tumor and Context-Aware Generalization in the MIDOG 2025 Challenge

野外有丝分裂检测:MIDOG 2025挑战中的多肿瘤与上下文感知泛化

Marc Aubreville, Jonas Ammeling, Sweta Banerjee, Viktoria Weiss, Taryn A. Donovan, Robert Klopfleisch, Jiaqi Lv, Shan E Ahmed Raza, Raphaël Bourgade, Thomas Walter, Yasemin Topuz, Songül Varlı, Charles-Antoine Collins-Fekete, Zhuoyan Shen, Navya Sri Kelam, Nitin Singhal, Christian Marzahl, Brian Napora, Tengyou Xu, Hongyan Gu, Mario Vento, Gennaro Percannella, Norbert Ropiak, Izabela Wasiak, Jie Xiao, Shaojun Liu, Seungho Choe, April Khademi, Vidushi Walia, Sujatha Kotte, Andrew Broad, Alex Wright, Guillaume Balezo, Esha Sadia Nasir, Mostafa Jahanifar, Yosuke Yamagishi, Shouhei Hanaoka, Mattia Sarno, Francesco Tortorella, Biwen Meng, Jingxin Liu, Sara Krauss, Daniel Hieber, Lavish Ramchandani, Dev Kumar Das, Mieko Ochi, Yuan Bae, Piotr Giedziun, Mateusz Maniewski, Vangala Govindakrishnan Saipradeep, Naveen Sivadasan, Leire Benito-Del-Valle, Adrian Galdran, Kaustubh Atey, Sameer Anand Jha, Adinath Dukre, Imran Razzak, Maxime W. Lafarge, Viktor H. Koelzer, Nils Porsche, Nikolas Stathonikos, Mitko Veta, Dominik Hirling, Zsanett Zsófia Iván, Peter Horvath, Katharina Breininger, Christof A. Bertram

发表机构 * Flensburg University of Applied Sciences(弗劳恩霍夫应用科技大学) Technische Hochschule Ingolstadt(施特拉尔松德应用技术大学) University of Veterinary Medicine(兽医大学) Schwarzman Animal Medical Center(施瓦茨曼动物医学中心) Freie Universität Berlin(柏林自由大学) University of Warwick(沃里克大学) MINES Paris - PSL University(巴黎综合理工学院) Yildiz Technical University(耶利泽技术大学) University College London(伦敦大学学院) AIRA MATRIX Private Limited(AIRA MATRIX 私人有限公司) University of California, Los Angeles(加州大学洛杉矶分校) University of Kansas Medical Center(堪萨斯医学中心) University of Salerno(萨勒诺大学) Cancer Center Sp. z o. o.(癌症中心) th Military Research Hospital in Bydgoszcz(比多日茨军医研究所) Shenzhen Technology University(深圳技术大学) Toronto Metropolitan University(多伦多 Metropolitan 大学) Tata Consultancy Services Ltd.(塔塔咨询有限公司) Leeds Teaching Hospitals NHS Trust(利兹教学医院 NHS信托) The University of Tokyo(东京大学) Xi’an Jiaotong-Liverpool University(西安交通大学-利物浦大学) University of Augsburg(奥格斯堡大学) Ulm University(乌尔姆大学) Japanese Red Cross Medical Center(日本红十字医疗中心) Wroclaw University of Science and Technology(沃拉日市科学与技术大学) TECNALIA, Basque Research and Technology Alliance (BRTA)(TECNALIA,巴斯克研究与技术联盟(BRTA)) Indian Institute of Technology Bombay(孟买印度理工学院) MBZUAI University of Basel(巴塞尔大学) University Medical Center Utrecht(乌得勒支大学医学中心) TU Eindhoven(埃因霍温理工大学) HUN-REN Biological Research Centre(匈牙利-人生物研究中心)

AI总结 针对临床实际中组织学多样性的挑战,MIDOG 2025挑战评估了跨12种肿瘤类型和多种扫描平台的算法性能,发现模型在传统热点区域表现可靠,但在困难区域和罕见肿瘤中性能显著下降,集成方法可提升F1分数1.5个百分点。

详情
AI中文摘要

自动有丝分裂检测是计算病理学中一项成熟的任务。虽然之前的基准测试关注扫描仪引起的域偏移,但临床“真实世界”应用要求模型能够对组织学景观中预期的巨大差异具有鲁棒性。MItosis DOmain Generalization (MIDOG) 2025挑战旨在评估算法在空前生物学和上下文多样性下的性能。我们策划了一个包含365个病例的测试数据集,涵盖12种不同的人类、犬和猫肿瘤类型,并在多个扫描平台上数字化。超越手动选择的感兴趣区域(ROI),该挑战还要求在随机组织区域(代表全切片检测情况)和困难区域(富含难负样本的区域)进行检测。在第二个赛道中,我们引入了非典型有丝分裂象(AMF)的分类。检测赛道有18支队伍提交,F1分数最高达0.740。在AMF检测赛道,我们有21个提交,平衡准确率最高达0.908。我们的分析显示,虽然大多数模型在传统热点区域表现可靠,但在困难ROI中性能显著下降,假阳性率增加了两倍。此外,性能在12种肿瘤类型间差异显著,突显了当前最先进架构在遇到罕见或高度多形性恶性肿瘤时的“盲点”。此外,我们评估了集成的有效性,发现F1分数和平衡准确率平均分别提高1.5和1.3个百分点。相比之下,测试时增强(TTA)没有显示出相关改进。MIDOG 2025表明,“野外”有丝分裂检测仍然是一个重大障碍。从仅热点评估到多上下文框架的转变,为临床可靠性提供了更现实的代理指标。

英文摘要

Automated mitosis detection is a well-established task in computational pathology. While previous benchmarks focused on scanner-induced domain shift, clinical "real-world" application requires models to be robust across the vast variance to be expected in the histological landscape. The MItosis DOmain Generalization (MIDOG) 2025 challenge was designed to evaluate algorithmic performance across unprecedented biological and contextual diversity. We curated a test dataset of 365 cases, encompassing 12 distinct human, canine and feline tumor types, digitized across multiple scanning platforms. Moving beyond hand-selected hotspots, the challenge required detection also in random tissue areas (representative of the whole slide detection situation) and challenging areas (areas rich in hard negatives). In the second track, we introduced the classification of atypical mitotic figures (AMFs). There were 18 teams submitting to the detection track, with F1 scores ranging up to 0.740. In the AMF detection track, we had 21 submissions with balanced accuracy values up to 0.908. Our analysis reveals that while most models perform reliably in traditional hotspots, significant performance degradation occurs in challenging ROIs, where false positive rates tripled. Furthermore, performance varied significantly across the 12 tumor types, highlighting "blind spots" in current state-of-the-art architectures when encountering rare or highly pleomorphic malignancies. Moreover, we evaluated the effectiveness of ensembling and found a mean increases of 1.5 and 1.3 percentage points in F1 score and balanced accuracy, respectively. In contrast, TTA showed no relevant improvement. MIDOG 2025 demonstrates that "in the wild" mitosis detection remains a significant hurdle. The transition from hotspot-only evaluation to a multi-contextual framework provides a more realistic proxy for clinical reliability.

2606.07381 2026-06-08 eess.IV cs.AI cs.CV 交叉投稿

Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios

合成病灶MR图像在低数据场景下自动局灶性皮质发育不良检测中的影响

Prabhjot Kaur, Hakim Ouaalam, Sedat Kandemirli, Sanjay P. Prabhu, Simon K. Warfield

发表机构 * Computational Radiology Laboratory(计算放射学实验室) Boston Children’s Hospital(波士顿儿童医院) Harvard Medical School(哈佛医学院)

AI总结 本研究通过条件生成网络合成FCD病灶MRI数据,评估其真实性及对自动检测的影响,发现合成数据可减少约20%标注需求,但真实数据仍更有效。

详情
AI中文摘要

背景与目的:自动检测局灶性皮质发育不良(FCD)需要大量体素级病灶勾画的MRI数据,这些数据难以获取。本研究旨在生成呈现FCD的合成MRI数据,评估其真实性,并评估其对自动FCD检测的影响,特别是在减少手动标注需求方面。方法:回顾性研究了来自多个(3个)中心的131例FCD患者和90例健康对照的T1加权(T1w)和T2加权液体衰减反转恢复(FLAIR)MRI扫描。通过将生成网络以二元FCD掩膜为条件生成合成MRI。两位神经放射科医生从14张真实和14张合成扫描的随机集合中识别真实图像。训练了三个nnU-Net模型用于检测FCD,分别使用:(i)仅真实数据(35例FCD/35例对照),(ii)真实数据(35例FCD/35例对照)加合成增强,以及(iii)扩展的真实数据(70例FCD/70例对照)。结果:专家区分真实与合成图像的能力有限,T1w分类准确率为60%,FLAIR为70%(评分者间一致性kappa=0.86)。用合成数据增强自动FCD检测使灵敏度提高8.14%(p=0.12),并改善了模型在真实病灶部位的置信度(0.83±0.11至0.89±0.12;p=0.02)。扩展真实数据模型进一步将灵敏度提高至73.8%(p<0.001),置信度提高至0.90±0.14(p=0.01)。结论:条件生成网络可以生成逼真的合成FCD-MRI,在保持同等灵敏度的情况下减少约20%的标注数据需求。当可用时,等量的真实数据仍比合成增强更有效。

英文摘要

Background and Purpose: Automated detection of focal cortical dysplasia (FCD) requires large volumes of voxelwise lesion-delineated MRI data, which are difficult to acquire. This study aims to generate synthetic MRI data exhibiting FCD, assess their realism, and evaluate their impact on automated FCD detection, particularly in reducing the need for manual annotations. Methods: T1-weighted (T1w) and T2-weighted Fluid-Attenuated Inversion Recovery (FLAIR) MRI scans from 131 FCD patients and 90 healthy controls from multiple (3) sites were retrospectively studied. Synthetic MRIs were generated by conditioning a generative network on binary FCD masks. Two neuroradiologists identified real images from a random set of 14 real and 14 synthetic scans. Three nnU-Net models were trained to detect FCD using: (i) real-only (35 FCD / 35 controls), (ii) real (35 FCD / 35 controls) plus synthetic augmentation, and (iii) expanded real data (70 FCD / 70 controls). Results: Experts showed limited ability to distinguish real from synthetic images, with classification accuracy of 60% for T1w and 70% for FLAIR (inter-rater agreement kappa = 0.86). Augmenting automated FCD detection with synthetic data increased sensitivity by 8.14% (p = 0.12) and improved model confidence at true lesion sites (0.83 +/- 0.11 to 0.89 +/- 0.12; p = 0.02). The expanded real-data model further improved sensitivity to 73.8% (p < 0.001) and confidence to 0.90 +/- 0.14 (p = 0.01). Conclusion: Conditional generative networks can generate realistic synthetic FCD-MRIs, reducing labeled data needs by approximately 20% while maintaining equivalent sensitivity. Equivalent amounts of real data, when available, remain more effective than synthetic augmentation.

2606.07410 2026-06-08 cs.LG cs.AI 交叉投稿

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

人类与DeepSeek-R1大语言模型数学推理的全面解剖

Yuxiang Chen, Jun Wang

发表机构 * UCL Centre for Artificial Intelligence(伦敦大学人工智能中心)

AI总结 通过AIME 2025所有30道题目的10247个推理步骤注释,发现DeepSeek-R1存在拓扑模仿(表面模仿推理而非真正推理),但成功轨迹中分支与回溯的稳定使用以及反射在演绎推理中的有效放置是真正推理的信号。

详情
AI中文摘要

大语言模型中“顿悟时刻”的出现,特别是DeepSeek-R1-0120,引发了这些系统是真正推理还是仅仅模仿推理表象的问题。我们对AIME 2025所有30道题目进行了模型与人类推理的全面实证比较,将10247个推理步骤详尽地注释为五个功能类别:分析、推理、分支、回溯和反思。我们发现了一个明显的结构差异。人类解决方案在分析和演绎之间保持紧凑交替,而DeepSeek-R1频繁重访中间结果,进行浅层且往往不必要的验证,并在局部检查中循环,而没有有意义的逻辑进展。我们将其描述为拓扑模仿:再现推理的表面形式而不发挥其功能作用。尽管如此,我们识别出两个真正推理的信号。首先,成功轨迹表现出分支和回溯的稳定使用,而失败轨迹要么过度使用要么使用不足探索性动作。其次,反思仅在置于演绎推理中时才有效;陷入分析循环的反思专注于局部数值细节而忽略全局逻辑错误。这些发现表明,当前的长链思维模型可能更多地因推理的表象而非真正的演绎进展而获得奖励。我们讨论了改进评估和训练的方向,包括测量跨轨迹稳定性、惩罚“空转”轨迹、鼓励更深层的逻辑纠正,以及将推理时间计算重新分配给演绎和回溯。总体而言,推理质量不仅取决于反思发生的多少,还取决于反思是否一致地出现在适当的逻辑尺度上。

英文摘要

The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive empirical comparison between model and human reasoning across all 30 problems from AIME 2025, exhaustively annotating 10,247 reasoning steps into five functional categories: Analysis, Inference, Branch, Backtrace, and Reflection. We find a clear structural difference. Human solutions maintain a compact alternation between analysis and deduction, whereas DeepSeek-R1 frequently revisits intermediate results, performs shallow and often unnecessary verification, and loops through local checks without meaningful logical progress. We describe this as topological mimicry: reproducing the surface form of reasoning without its functional role. Despite this, we identify two signals of genuine reasoning. First, successful traces exhibit stable use of branching and backtracking, while failed traces either underuse or overuse exploratory actions. Second, reflection is only effective when placed within deductive inference; reflections trapped in analysis loops focus on local numerical details while missing global logical errors. These findings suggest that current long-CoT models may be rewarded more for the appearance of reasoning than for genuine deductive progress. We discuss directions for improving evaluation and training, including measuring cross-trace stability, penalising "spinning-wheel" traces, encouraging deeper logical correction, and reallocating inference-time compute toward deduction and backtracking. Overall, reasoning quality depends not simply on how much reflection occurs, but on whether reflection appears consistently and at the appropriate logical scale.

2606.07412 2026-06-08 cs.SE cs.AI 交叉投稿

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

Socratic-SWE:通过轨迹衍生智能体技能实现自我进化的编码智能体

Chuan Xiao, Zhengbo Jiao, Shaobo Wang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang, Lin Qu

发表机构 * AI Data, Alibaba Group(阿里云数据) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出Socratic-SWE闭环自进化框架,通过将智能体历史求解轨迹蒸馏为结构化技能,生成针对性修复任务,实现编码智能体的持续自我改进。

Comments 21 pages, 5 figures. Under review

详情
AI中文摘要

基于LLM的软件工程智能体已成为现实世界语言模型能力的核心测试平台,但其训练仍受限于高质量SWE任务的可用性。现有的合成数据方法通常通过固定突变或漏洞注入程序创建任务,导致生成的任务分布很大程度上独立于智能体自身的弱点和训练进度。我们提出Socratic-SWE,一个闭环自进化框架,将智能体的历史求解轨迹重新用作训练信号的来源。Socratic-SWE不仅将轨迹视为奖励计算的证据,还将其蒸馏为结构化的智能体技能,总结重复出现的失败模式和有效的修复模式。这些技能随后指导在真实仓库中生成针对性的修复任务。候选任务通过基于执行的验证进行检查,并使用求解器梯度对齐奖励进行评分,从而保留的任务既可验证又有助于改进求解器。更新后的求解器产生新的轨迹,使任务课程能够在连续轮次中自适应。在SWE-bench Verified、SWE-bench Lite、SWE-bench Pro和Terminal-Bench 2.0上,Socratic-SWE在相同计算预算下持续优于自我进化的基线,经过三次迭代后在SWE-bench Verified上达到50.40%。这些结果表明,求解轨迹可以作为自我进化SWE智能体的可扩展基础。

英文摘要

LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existing synthetic data methods typically create tasks through fixed mutation or bug-injection procedures, making the resulting distributions largely independent of the agent's own weaknesses and training progress. We introduce Socratic-SWE, a closed-loop self-evolution framework that reuses the agent's historical solving traces as a source of training signal. Rather than treating traces only as evidence for reward computation, Socratic-SWE distills them into structured agent skills that summarize recurring failures and effective repair patterns. These skills then guide the generation of targeted repair tasks in real repositories. Candidate tasks are checked through execution-based validation and scored with a solver-gradient alignment reward, so that the retained tasks are both verifiable and useful for improving the Solver. The updated Solver produces new traces, enabling the task curriculum to adapt over successive rounds. Across SWE-bench Verified, SWE-bench Lite, SWE-bench Pro, and Terminal-Bench 2.0, Socratic-SWE consistently improves over self-evolving baselines under the same compute budget, reaching 50.40% on SWE-bench Verified after three iterations. These results suggest that solving traces can serve as a scalable substrate for self-evolving SWE agents.

2606.07433 2026-06-08 cs.CV cs.AI cs.MM 交叉投稿

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Watch, Remember, Reason: 基于多模态大语言模型的人类视角视频理解

Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Jason Li, Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang, Guangliang Cheng, Yunhai Tong, Lu Qi, Minghsuan Yang

发表机构 * School of Intelligence Science and Technology, Peking University(北京理工大学智能科学与技术学院) Wuhan University(武汉大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) CASIA(中国科学院自动化研究所) University of Tokyo(东京大学) University of Liverpool(利物浦大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学) UC Merced(加州大学默塞德分校)

AI总结 提出人类视角下视频理解的三个功能能力(观看、记忆、推理),构建统一框架分析视频MLLM的感知、记忆、推理和预测,并总结挑战、方法、应用及未来方向。

详情
AI中文摘要

视频理解正被多模态大语言模型(MLLMs)迅速变革,研究从短视频片段转向长视频、多模态和知识密集型视频场景。这些场景要求模型在有限计算预算下处理稀疏证据、长程依赖、多模态对齐和可靠推理。本文从人类视角出发,围绕三个功能能力——观看、记忆和推理——组织基于LLM的视频理解。该视角并非将视频任务视为孤立基准,而是提供一个统一结构,用于分析视频MLLM如何获取证据、保持上下文并产生有依据的输出。我们引入一个公式,通过感知表示、记忆状态、推理轨迹和最终预测来表征视频理解系统。基于此公式,我们识别出时空感知、高效长视频处理、记忆建模、流式理解和忠实推理中的挑战。代表性方法按其视频MLLM系统中的角色进行组织:观看涵盖细粒度、全面、音视频和高效感知;记忆包括离线记忆和流式记忆;推理涵盖纯文本推理和视频辅助推理。我们进一步考察了应用领域,如自我中心、体育、教学、医学和叙事视频,并涵盖了跨任务类型、监督格式、模态和能力维度的训练数据集和评估基准。最后,我们概述了可扩展、记忆感知和有依据的视频智能的开放问题和未来方向。相关工作将在https://this https URL持续追踪。

英文摘要

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.

2606.07437 2026-06-08 cs.RO cs.AI cs.HC cs.SE cs.SY eess.SY 交叉投稿

Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and Predictability

重新构想自动驾驶时代的ISO 26262:通过可迁移性和可预测性增强可控性

Chaitanya Shinde, Hadi Hajieghrary, Paul Schmitt, Adam Shoemaker, Bodo Seifert, Steve Kenner

发表机构 * Torc Robotics, Inc.(Torc机器人公司) Reynolds & Moore(雷纳德与摩尔公司) Critical Systems Analysis, LLC(Critical Systems Analysis LLC)

AI总结 针对自动驾驶汽车缺乏人类驾驶员的问题,将ISO 26262中的可控性分解为可迁移性和可预测性两个可审计维度,并给出量化框架,以支持SAE L4/L5系统的功能安全论证。

详情
AI中文摘要

ISO 26262标准通过基于严重性、暴露度和可控性的风险评估来定义道路车辆的功能安全,其基础是人类驾驶车辆范式。在自动驾驶汽车(AV)的背景下,缺乏人类驾驶员需要重新审视这些原则。本文将可控性占位符分解为ISO 26262的两个可审计证据维度,引入了两个可测量的子概念:可迁移性和可预测性。可迁移性扩展了可控性,以捕捉AV系统将控制权移交给专用后备安全机制的能力,而可预测性则捕捉外部主体预测AV行为的难易程度。可预测性基于人机交互启发原则进行形式化定义,并提供了量化它的数学框架。引入了设计能力与可实现能力之间的差距,以区分架构后备声明与场景条件相关的可实现后备能力。所提出的度量与ISO 26262和ISO/PAS 21448(SOTIF)保持一致,使后备和交互声明在ODD切片上可证伪和可追溯。这些维度补充而非替代现有标准,这些增强保留了ISO 26262的结构,同时将其适用性扩展到在SAE L4和L5级别运行的无驾驶员自动化系统。

英文摘要

The ISO 26262 standard defines functional safety for road vehicles through risk assessments based on Severity, Exposure, and Controllability, grounded in a human-driven vehicle paradigm. In the context of autonomous vehicles (AVs), the absence of a human driver necessitates revisiting these principles. This paper decomposes the Controllability placeholder into two auditable evidence dimensions of ISO 26262 by introducing two measurable sub-concepts: Transferability and Predictability. Transferability extends Controllability to capture AV systems' ability to hand off control to dedicated fallback safety mechanisms, while Predictability captures how easily external agents can anticipate AV behavior. Predictability is formally defined from human-robot interaction-inspired principles, and a mathematical framework is provided to quantify it. A designed-versus-achievable gap is introduced to distinguish architectural fallback claims from scene-conditioned achievable fallback capability. The proposed metrics align with ISO 26262 and ISO/PAS 21448 (SOTIF), rendering fallback and interaction claims falsifiable and traceable across ODD slices. These dimensions complement rather than replace existing standards, and the enhancements preserve the structure of ISO 26262 while extending its applicability to driverless automated systems operating at SAE Levels 4 and 5.

2606.07451 2026-06-08 cs.CV cs.AI cs.CL cs.LG 交叉投稿

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

TEVI: 基于稀疏自编码器的文本条件视觉表示编辑以改进视觉-语言对齐

Sweta Mahajan, Sukrut Rao, Jiahao Xie, Alexander Koller, Bernt Schiele

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany(马克斯·普朗克研究所信息学院,萨尔兰信息学院,德国萨尔布吕肯) Department of Language Science and Technology, Saarland University, Saarbrücken, Germany(语言科学与技术系,萨尔兰大学,德国萨尔布吕肯)

AI总结 提出TEVI框架,利用稀疏自编码器解耦图像嵌入,并通过文本条件掩码模块选择性重构嵌入,以改善CLIP等视觉-语言模型的图像-文本对齐,在多个检索基准上取得提升。

Comments 20 pages, 13 figures, 14 tables

详情
AI中文摘要

视觉-语言模型(如CLIP)由于共享图像-文本嵌入空间,对多种任务非常有用。尽管如此,图像和文本嵌入往往对齐不佳,影响下游性能。最近的研究表明,这可以归因于信息不平衡:图像包含的信息比其标题描述的更多。在这项工作中,我们提出了TEVI,一个利用标题作为信号来决定从图像嵌入中保留哪些信息的框架。具体来说,我们使用稀疏自编码器来解耦图像嵌入,并训练一个掩码模块,根据给定的标题选择性重构嵌入。在具有合成标题的受控设置中,我们展示了TEVI在保留标题描述的属性同时丢弃其他属性方面的有效性。通过将TEVI应用于在自然图像上训练的CLIP模型,我们进一步在粗粒度短标题(MS COCO, Flickr)和细粒度长标题(IIW, DOCCI)基准上实现了改进的检索性能,在更丰富的标题上获得更强的增益,并在RoCOCO基准上提高了鲁棒性。

英文摘要

Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.

2606.07454 2026-06-08 cs.IR cs.AI 交叉投稿

PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams

PaperFlow: 跨日常论文流的画像、推荐与自适应

Fuqiang Wang, Song Tan, Zheng Guo, Jiaohao Fu, Xinglong Xu, Bihui Yu, Jie Dong, Zheng Sun, Siyuan Li, Jingxuan Wei, Cheng Tan

发表机构 * Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences)(计算机网络与信息安全部重点实验室,教育部,山东计算机科学中心(济南国家超级计算机中心),齐鲁工业大学(山东省科学院)) University of Chinese Academy of Science(中国科学院大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出PaperFlow框架,通过画像、推荐和自适应三阶段处理动态论文流,并构建纵向用户日基准,实验表明其在排序和行为对齐上优于基线。

Comments 48 pages, 13 figures, 22 tables

详情
AI中文摘要

科学论文推荐通常被评估为固定候选集上的静态排序,然而真实的科研阅读是一个日常的纵向过程,其中兴趣会变化且反馈会累积。我们引入PaperFlow,一个将其组织为三个耦合阶段的框架:画像(Profiling),从异构冷启动证据中构建并维护一个结构化、可检查的学术画像;推荐(Recommending),在固定展示预算下通过多信号聚合对每个特定日期的论文流进行排序;自适应(Adapting),从语义不同的反馈信号中更新用户状态,并建模跨天的兴趣漂移。我们进一步定义了一个纵向用户日基准,该基准在共享的时间信息边界下固定用户、日期、候选池、可见输入和隐藏的模拟相关标签。该基准包含24个模拟研究用户、50个每日论文流、1200个用户日片段、20727篇独特论文和497448个片段-论文记录。我们还指定了一个盲人评估协议,以验证自动指标与专家判断之间的一致性。针对五个科学论文推荐基线的实验表明,PaperFlow实现了最强的基于oracle的排序、与模拟阅读选择最高的行为对齐以及最佳的盲人评估分数。

英文摘要

Scientific paper recommendation is typically evaluated as static ranking over a fixed candidate set, yet real scientific reading unfolds as a daily, longitudinal process in which interests shift and feedback accumulates. We introduce PaperFlow, a framework that organizes it into three coupled stages: Profiling, which constructs and maintains a structured, inspectable scholarly profile from heterogeneous cold-start evidence; Recommending, which ranks each date-specific paper stream through multi-signal aggregation under a fixed display budget; and Adapting, which updates user state from semantically distinct feedback signals and models interest drift across days. We further define a longitudinal user-day benchmark that fixes users, dates, candidate pools, visible inputs, and hidden simulated relevance labels under a shared temporal information boundary. The benchmark contains 24 simulated research users, 50 daily paper streams, 1,200 user-day episodes, 20,727 unique papers, and 497,448 episode-paper records. We additionally specify a blind human-evaluation protocol to validate alignment between automatic metrics and expert judgments. Experiments against five scientific recommendation baselines show that PaperFlow achieves the strongest oracle-based ranking, the highest behavioral alignment with simulated reading selections, and the best blind human-evaluation score.

2606.07464 2026-06-08 cs.RO cs.AI cs.CV 交叉投稿

Planning-aligned Token Compression for Long-Context Autonomous Driving

面向长上下文自动驾驶的规划对齐令牌压缩

Zhixuan Liang, Yuxiao Chen, Yurong You, Peter Karkus, Wenhao Ding, Boyi Li, Alexander Popov, Yan Wang, Maximilian Igl, Yiming Li, Danfei Xu, Nikolai Smolyanskiy, Boris Ivanovic, Ping Luo, Marco Pavone

发表机构 * NVIDIA Research(NVIDIA研究) School of Computing and Data Science, The University of Hong Kong(计算与数据科学学院,香港大学)

AI总结 提出COMPACT-VA框架,基于条件VQ-VAE将长上下文压缩为有界表示,通过规划对齐实现决策关键信息保留,在动态场景中成功率提升超6%,速度提升3.3倍。

Comments 9 pages

详情
AI中文摘要

整体视觉-动作模型代表了自动驾驶中的一种新兴范式。然而,这种架构在编码用于复杂交互的扩展时间上下文时,会产生迅速超过实时计算预算的令牌序列。虽然线性变换器和外部记忆等方法试图使上下文轻量化,但令牌压缩与架构最为兼容,因为它不需要修改主干网络。然而,现有的压缩采用基于规则的启发式方法(如时间衰减),与规划解耦,存在丢失决策关键信息的风险。我们提出COMPACT-VA,一种基于条件VQ-VAE的规划对齐工作记忆框架,将扩展上下文压缩为有界表示。压缩条件同时基于历史轨迹和学习的规划意图,其中后验编码器在训练期间从未来轨迹中提炼规划意图,而先验编码器学习从压缩观测中预测它。压缩记忆与预测的潜在变量拼接,输入策略进行端到端优化,从而在保留决策关键信息的情况下进行规划。我们在历史上下文对行为正确性(如停车、让行或前行)最关键的高信号动态场景中进行评估,并相应地设计了行为指标。在可比的令牌预算下,我们在成功率上实现了超过6%的提升(68.3%),且各项指标一致提升。消融实验验证了规划对齐耦合的有效性。闭环评估证实,与未压缩处理相比,COMPACT-VA在保持一般驾驶性能的同时实现了3.3倍的速度提升和2.7倍的内存减少。

英文摘要

Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve $>$6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.

2606.07473 2026-06-08 cs.SD cs.AI 交叉投稿

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Whisper 幻觉检测与缓解:基于隐藏表示引导和稀疏自编码器

Georgii Aparin, Vadim Popov, Tasnima Sadekova, Assel Yermekova

发表机构 * AI Foundation and Algorithm Lab(AI基础与算法实验室) National University of Science and Technology MISIS(科学与技术国立大学MISIS) National Research University Higher School of Economics(国家研究大学经济高等学院)

AI总结 通过分析Whisper内部表示,提出基于稀疏自编码器的引导策略,将非语音测试集上的幻觉率从72.63%降至14.11%(small模型),接近微调方法性能。

详情
AI中文摘要

Whisper是一种广泛采用的ASR模型,已知存在幻觉问题——即对非语音音频生成与输入完全无关的连贯转录。我们研究了是否可以通过Whisper的内部表示来检测和缓解幻觉。我们提取音频编码器激活,并评估两种表示空间:原始Whisper激活和稀疏自编码器(SAE)潜在变量。我们表明,两个空间都编码了线性可分的幻觉相关信息,判别能力集中在稀疏特征子集中,并向更深编码器层增强。我们提出了两种引导策略:激活空间引导和SAE潜在空间引导。基于SAE的引导将完整非语音测试集上的幻觉率从72.63%降至14.11%(Whisper small),从86.88%降至27.33%(Whisper large-v3),同时在语音数据上WER退化很小,接近基于微调方法的性能。

英文摘要

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.

2606.07475 2026-06-08 cs.LG cs.AI 交叉投稿

Graph Neural Network leveraging Higher-order Class Label Connectivity for Heterophilous Graphs

利用高阶类标签连通性的图神经网络用于异配图

Takuto Takahashi, Itsuki Nakayama, Takahiro Mitani, Ryosuke Kikuchi, Yuya Sasaki, Makoto Onizuka

发表机构 * The University of Osaka(大阪大学)

AI总结 针对异配图中节点分类性能受限问题,提出标签上下文分类器(LCC),通过四种游走生成标签上下文嵌入捕获高阶类标签连通性,并可与任意GNN自适应集成,实验表明优于现有方法。

详情
AI中文摘要

图神经网络(GNN)中的节点分类已广泛应用于图分析的各个领域。在同配图中,具有相同类标签的节点倾向于连接,GNN能实现高精度节点分类。然而,在异配图中,不同类标签的节点更可能连接,其性能仍然有限。特别是,当前基于图卷积网络的GNN无法捕获高阶类标签连通性,而这在真实世界的异配图中经常出现。为了解决这个问题,我们提出了一种新颖的分类器——标签上下文分类器(LCC),旨在捕获有向图中的高阶类标签连通性。LCC通过利用四种不同类型的游走生成的标签上下文嵌入来估计目标节点的类标签。此外,我们的方法允许通过自适应学习LCC和任意GNN的重要性来集成它们。实验结果表明,与LCC集成的GNN优于最先进的方法,并且标签上下文嵌入提高了异配有向图中的节点分类性能。

英文摘要

Node classification in graph neural networks (GNNs) has been widely applied in various fields of graph analysis. GNNs achieve high-accuracy node classification in homophilous graphs, where nodes with the same class label tend to be connected. However, their performance remains limited in heterophilous graphs, where nodes with different class labels are more likely to be connected. In particular, current GNNs derived from graph convolutional networks cannot capture higher-order class label connectivity, which is frequently observed in real-world heterophilous graphs. To address this issue, we propose a novel classifier, Label Context Classifier (LCC), designed to capture higher-order class label connectivity in directed graphs. LCC estimates the class label of a target node by leveraging label context embeddings that are generated through four distinct types of walks. In addition, our approach allows the integration of LCC and any GNN by adaptively learning their importance. Experimental results demonstrate that GNNs integrated with LCC outperform SOTA methods and the label context embeddings improve the node classification performance in heterophilous directed graphs.

2606.07479 2026-06-08 cs.CL cs.AI 交叉投稿

Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification

基于监督与基于演示的上下文学习在多词表达分类中的比较

Sercan Karakaş, Yusuf Şimşek

发表机构 * University of Chicago(芝加哥大学) Fırat University(费拉特大学)

AI总结 研究土耳其语多词表达分类,对比监督基线(BERTurk)与指令微调LLM在零样本、单样本和少样本提示下的表现,发现提示敏感性和演示偏差影响显著。

Comments Accepted to ACL SRW 2026

详情
AI中文摘要

土耳其语习语性轻动词结构(LVC)对多词表达处理具有挑战性,因为它们通常与完全字面义的动词-宾语组合共享相同表面形式,同时作为一个部分习语性谓词发挥作用。我们将土耳其语LVC检测定义为二元分类任务(字面义 vs. 习语义),并在手动创建的受控集(N=147)上评估,该集合包含匹配的负例:域外随机句子和域内字面义控制(NLVC),以及LVC正例。我们比较了监督土耳其语编码器基线(带有分类头的BERTurk)与来自不同家族的三个指令微调LLM,在零样本、单样本和少样本提示下的表现,并分析演示如何改变错误分布。在零样本情况下,LLM在负例上表现良好,但LVC召回率非常低。单样本提示显著提高了LVC检测,但可能引发强烈的、模型特定的偏差,导致模型过度预测或欠预测LVC。更丰富的少样本提示改善了校准,并为GPT-OSS-20B和Qwen 2.5-14B带来了稳健的整体性能。总体而言,结果突显了土耳其语元语言分类中的显著提示敏感性:监督基线仍然具有竞争力,而提示LLM在精心构建的演示下可以在LVC上匹配或超越它。

英文摘要

Turkish idiomatic light verb constructions (LVCs) are challenging for multiword expression processing because they often share the same surface form as fully literal verb-object combinations while functioning as a single, partially idiomatic predicate. We frame Turkish LVC detection as a binary classification task (literal meaning vs. idiomatic meaning) and evaluate on a manually created controlled set (N=147) with matched negatives: out-of-domain random sentences and in-domain literal controls (NLVC), alongside LVC positives. We compare a supervised Turkish encoder baseline (BERTurk with a classifier head) to three instruction-tuned LLMs from different families under zero-shot, one-shot, and few-shot prompting, and analyze how demonstrations shift error profiles. In zero-shot, LLMs perform well on negatives but show very low LVC recall. One-shot prompting sharply improves LVC detection but can induce strong, model-specific biases, leading models to overpredict or underpredict LVCs. A richer few-shot prompt improves calibration and yields robust overall performance for GPT-OSS-20B and Qwen 2.5-14B. Overall, the results highlight substantial prompt sensitivity in Turkish metalinguistic classification: the supervised baseline remains competitive, while prompted LLMs can match or exceed it on LVCs with carefully constructed demonstrations.

2606.07491 2026-06-08 cs.DC cs.AI cs.LG cs.SE 交叉投稿

Twelve quick tips for designing AI-driven HPC workflows

设计AI驱动的高性能计算工作流的十二条快速技巧

Jamie J. Alnasir

发表机构 * Department of Computer Science(计算机科学系) Royal Holloway University of London(伦敦皇家霍洛威大学)

AI总结 本文针对AI与HPC融合带来的新挑战,提出十二条实用技巧,涵盖容器化、作业数组、反馈循环和I/O优化,帮助设计高效、可扩展、可复现的AI驱动HPC工作流。

Comments 12 pages, 1 figure. Formatted using the bioRxiv LaTeX preprint style

详情
AI中文摘要

高性能计算(HPC)集群仍然是大规模科学计算的支柱,传统上执行确定性、线性流水线,以优化可预测性能。然而,人工智能(AI)和基础模型在科学研究中的普遍集成引入了一种根本性的新计算范式。AI驱动的工作流具有迭代、数据驱动和概率性的特征,带来了数据引力、异构资源管理和复杂工作流编排方面的独特挑战。本指南提供了十二条实用技巧,旨在帮助研究人员设计高效、可扩展和可复现的AI驱动HPC工作流。通过解决关键的系统级瓶颈——例如用于环境可移植性的容器化、作业数组的战略部署、显式反馈循环机制以及小文件的I/O优化——本文提供了一个从刚性执行流水线过渡到自适应、智能计算环境的框架。虽然这些架构原则广泛适用于分布式环境,但它们特别针对现代计算生物学中资源密集型的吞吐量需求。

英文摘要

High-performance computing (HPC) clusters remain the backbone of large-scale scientific computation, traditionally executing deterministic, linear pipelines optimised for predictable performance. However, the pervasive integration of artificial intelligence (AI) and foundation models into scientific research has introduced a fundamentally new computational paradigm. AI-driven workflows are characteristically iterative, data-driven, and probabilistic, introducing unique challenges regarding data gravity, heterogeneous resource management, and complex workflow orchestration. This guide provides twelve practical tips designed to help researchers design efficient, scalable, and reproducible AI-driven HPC workflows. By addressing critical system-level bottlenecks - such as containerisation for environment portability, strategic deployment of job arrays, explicit feedback loop mechanics, and I/O optimisation for small files - this article offers a framework for transitioning from rigid execution pipelines to adaptive, intelligent computational environments. While these architectural principles are broadly applicable across distributed environments, they are particularly tailored to the resource-intensive throughput demands of modern computational biology.

2606.07500 2026-06-08 cs.LG cs.AI 交叉投稿

Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning

稀疏子空间到专家共享的任务无关持续学习

Fatema Siddika, Md Anwar Hossen, Tanwi Mallick, Ali Jannesari

发表机构 * Iowa State University(爱荷华州立大学) Argonne National Laboratory(阿贡国家实验室)

AI总结 提出SETA框架,通过将参数分解为任务特定专家和共享专家的稀疏子空间,结合自适应弹性锚定和路由感知正则化,解决LLM持续学习中的塑性-稳定性困境,在多个基准上优于现有方法。

Comments 19 pages. arXiv admin note: text overlap with arXiv:2601.17616

详情
AI中文摘要

大型语言模型(LLM)中的持续学习受到塑性-稳定性困境的阻碍,获取新能力往往导致先前知识的灾难性遗忘。现有方法通常统一对待参数,未能区分特定任务知识和共享能力。我们提出了用于任务无关持续学习的稀疏专家混合(SETA)框架,该框架通过将参数自适应稀疏子空间分解为任务特定专家模块来解决塑性-稳定性冲突。与标准更新(其中任务竞争相同参数)不同,SETA将知识分离为独特专家(旨在隔离任务特定模式)和共享专家(负责捕获共同特征)。这种结构通过自适应弹性锚定和路由感知正则化来维护,该正则化在权重和路由级别共同保护共享知识,并使统一的门控网络能够在推理过程中自动检索正确的专家组合。在多种领域特定基准上的大量实验表明,相对于最先进的持续学习基线,SETA实现了具有竞争力或更优的整体性能,特别是在LLaMA-2 7B和Qwen3-4B上,早期任务知识的保持和反向迁移能力尤为突出。

英文摘要

Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma, where acquiring new capabilities often leads to catastrophic forgetting of previous knowledge. Existing methods typically treat parameters uniformly, failing to distinguish between specific task knowledge and shared capabilities. We introduce Mixture of Sparse Experts for Task Agnostic Continual Learning (SETA), a framework that resolves the plasticity-stability conflict through adaptive sparse subspace decomposition into task-specific expert modules. Unlike standard updates, where tasks compete for the same parameters, SETA separates knowledge into unique experts, designed to isolate task-specific patterns, and shared experts, responsible for capturing common features. This structure is maintained through adaptive elastic anchoring and a routing-aware regularization that jointly protect shared knowledge at both the weight and routing levels and enable a unified gating network to automatically retrieve the correct expert combination during inference. Extensive experiments across diverse domain-specific benchmarks demonstrate that SETA achieves competitive or superior overall performance relative to state-of-the-art continual learning baselines, with particularly strong retention of early-task knowledge and improved backward transfer on LLaMA-2 7B and Qwen3-4B.

2606.07512 2026-06-08 cs.CV cs.AI cs.CL 交叉投稿

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

MemDreamer: 通过分层图记忆和智能体检索机制解耦感知与推理以实现长视频理解

Cong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang, Guangming Yao, Hao Chen, Jingdong Chen, Yi Yuan, Chunhua Shen

发表机构 * Ant Group(蚂蚁集团) Zhejiang University(浙江大学) Central South University(中南大学) HKUST(GZ)(香港科技大学(广州))

AI总结 提出MemDreamer框架,通过分层图记忆和智能体检索机制解耦感知与推理,将长视频理解转化为智能体探索过程,在四个基准上达到SOTA,推理上下文窗口仅占全量2%且准确率提升12.5点。

详情
AI中文摘要

当前的视觉-语言模型在处理数小时长的视频时面临困难,因为处理完整长度的视觉序列会导致令牌爆炸和注意力稀释。为了克服这一问题,我们引入了MemDreamer,将感知与推理解耦,将长视频理解转化为智能体探索过程。作为一个即插即用的框架,它增量式地流式传输视频以构建分层图记忆,这是一种自顶向下的三层架构,用于语义抽象,并由一个捕获时空和因果关系的基础图锚定。在推理过程中,推理模型采用智能体工具增强的检索,通过观察-推理-行动循环导航层次结构、搜索节点和遍历逻辑边。实验表明,MemDreamer在四个主流基准上取得了最先进的结果,将人类专家的差距缩小到仅3.7个百分点。它将推理上下文窗口限制在全量上下文的仅2%,同时提供了12.5个百分点的绝对准确率提升。此外,统计分析揭示了VLM在逻辑推理和长视频理解基准上的性能之间存在强正线性相关,将智能体能力扩展确立为多模态理解的新范式。

英文摘要

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

2605.24649 2026-06-08 cs.LG cs.AI cs.LO cs.SY eess.SY 交叉投稿

On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks

关于循环多项式代理三元逻辑门网络的稳定性与可实现性

Sai Sandeep Damera, Ryan Matheu, Aniruddh G. Puranic, John S. Baras, Calin Belta

发表机构 * University of Maryland, College Park, USA(马里兰大学学院公园分校)

AI总结 提出R-DTLGN架构,通过连续多项式代理训练并硬化为离散三元逻辑电路,结合数值单调和信息单调门,实现STL监控中的稳定递归和原则性弃权,并由STL公式确定网络规模。

Comments 9 pages, 3 figures. This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

循环神经网络(RNN)可以从部分轨迹在线学习预测信号时序逻辑(STL)判定,但在安全关键系统中部署为运行时监控器需要的不只是预测准确性。标准RNN架构无法提供结构保证,确保在传感器退化时输出能优雅降级;一个丢失的输入可能悄无声息地将判定从安全翻转为不安全。我们引入了循环可微三元逻辑门网络(R-DTLGN),这是一种在Kleene三值逻辑$\{-1, 0, +1\}$上运行的循环架构,其中$0$明确表示未知。R-DTLGN通过连续多项式代理进行训练,并在推理时硬化成离散的三元逻辑电路。我们通过从三元域上的两种序导出的两个门词汇表来分析硬化电路:数值单调门确保稳定的循环动态,而信息单调门(如果存在)保证原则性弃权(未知输入从不产生错误输出)和输入确定性上的单调性(更多信息只能改善判定)。我们证明,有界STL算子所需的循环连接仅使用AND和OR门,这两个门属于两个词汇表,从而将监控任务与架构的保证联系起来。由STL公式的时序算子导出的可实现性界限直接确定了网络隐藏状态的大小,用公式驱动的规范取代了超参数搜索。我们在D4RL PointMaze导航数据上的STL规范上进行了评估,测试了预测准确性、谓词丢失下的退化以及两个标签构建流程之间的准确性与安全性权衡。据我们所知,R-DTLGN是第一个将学习到的时序预测与基于三值逻辑的形式化退化保证相结合的循环架构。

英文摘要

Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying them as runtime monitors in safety-critical systems demands more than predictive accuracy. Standard RNN architectures offer no structural guarantee that outputs degrade gracefully under sensor degradation; a dropped input can silently flip a verdict from safe to unsafe. We introduce the Recurrent Differentiable Ternary Logic Gate Network (R-DTLGN), a recurrent architecture that operates over Kleene's three-valued logic $\{-1, 0, +1\}$, where $0$ explicitly represents unknown. The R-DTLGN trains through continuous polynomial surrogates and hardens to a discrete ternary logic circuit at inference. We analyze the hardened circuit through two gate vocabularies derived from two orderings on the ternary domain: numerically monotone gates ensure stable recurrent dynamics, while information-monotone gates, when present, guarantee principled abstention (unknown inputs never produce wrong outputs) and monotonicity in input certainty (more information can only improve the verdict). We show that the recurrent connections required by bounded STL operators use exclusively AND and OR, which belong to both vocabularies, linking the monitoring task to the architecture's guarantees. A realizability bound derived from the STL formula's temporal operators directly sizes the network's hidden state, replacing hyperparameter search with a formula-driven specification. We evaluate on STL specifications over D4RL PointMaze navigation data, testing prediction accuracy, degradation under predicate dropout, and the accuracy-versus-safety tradeoff between two label construction pipelines. The R-DTLGN is, to our knowledge, the first recurrent architecture that couples learned temporal prediction with formal degradation guarantees rooted in three-valued logic.

2504.00613 2026-06-08 cs.AI cs.IT cs.NE math.IT 版本更新

LLM-Guided Search for Deletion-Correcting Codes

LLM引导的删除纠正码搜索

Franziska Weindel, Reinhard Heckel

发表机构 * School of Computation, Information and Technology, Technical University of Munich(计算、信息与技术学院,慕尼黑技术大学)

AI总结 针对删除纠正码最大尺寸的开放问题,采用LLM引导的进化搜索FunSearch,发现构建短码长删除纠正码的函数,单删除场景证明达到最优的Varshamov-Tenengolts码,多删除和四进制编辑码改进现有构造但缺乏理论洞见。

详情
AI中文摘要

寻找最大尺寸的删除纠正码已经是一个开放问题超过70年,即使对于单个删除也是如此。我们改编了FunSearch,一种大型语言模型(LLM)引导的进化搜索,以发现构建短码长删除纠正码的函数。对于单个删除,我们的搜索发现了一个函数,我们证明该函数构建了推测最优的Varshamov-Tenengolts码。对于多个删除和四进制编辑码,发现的函数改进了先前的显式、基于搜索和神经网络的构造,但仍然是经验启发式,没有新的理论洞见。我们研究了LLM引导进化搜索的设计选择,并发现对于我们的问题,计算资源更好地分配用于采样更多函数,而不是每个函数更长的推理轨迹,并且将自然语言描述与代码共同进化会损害搜索质量。我们提出在进化过程中对逻辑相同的函数进行去重,这对搜索多样性至关重要。我们的结果展示了LLM引导进化搜索在信息论和编码设计中的潜力,并代表了此类方法在构建纠错码中的首次应用。然而,在我们目前的公式中,评估一个函数的复杂度随码长指数增长,限制了该方法仅适用于短码。

英文摘要

Finding deletion-correcting codes of maximum size has been an open problem for over 70 years, even for a single deletion. We adapt FunSearch, a large language model (LLM)-guided evolutionary search, to discover functions that construct deletion-correcting codes at short code lengths. For a single deletion, our search finds a function that we prove constructs the conjectured-optimal Varshamov-Tenengolts code. For multiple deletions and quaternary edit codes, the discovered functions improve on prior explicit, search-based, and neural constructions but remain empirical heuristics without new theoretical insights. We study design choices for LLM-guided evolutionary search and find that, for our problem, compute is better allocated to sampling more functions than to longer reasoning traces per function, and that co-evolving natural language descriptions with code hurts search quality. We propose deduplicating logically identical functions during evolution, which we find critical for search diversity. Our results demonstrate the potential of LLM-guided evolutionary search for information theory and code design and represent the first application of such methods for constructing error-correcting codes. However, in our current formulation, evaluating a function scales exponentially with code length, limiting the approach to short codes.

2504.03635 2026-06-08 cs.AI cs.CL 版本更新

Finding the Minimal Parameter Budget for Implicit Reasoning: A Data Complexity Driven Scaling Law for Language Models

寻找隐式推理的最小参数预算:一种基于数据复杂度的语言模型缩放定律

Xinyi Wang, Shawn Tan, Shenbo Xu, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Washington(华盛顿大学) University of Toronto(多伦多大学) University of Tokyo(东京大学)

AI总结 本文通过控制合成环境中的预训练实验,发现语言模型隐式推理所需的最小参数预算与图搜索熵之间存在缩放定律,并确定了每参数最多可处理约0.008比特信息的容量上限。

Comments Accepted to ICML 2026

详情
AI中文摘要

推理是语言模型(LM)的核心能力,然而在预训练期间需要多少模型容量来支持推理仍不清楚。在这项工作中,我们研究了隐式推理所需的最小参数预算,隐式推理定义为无需显式思维链监督即可从所学知识中推断出新事实的能力。为了隔离这一现象,我们在一个受控的合成环境中从头开始预训练LM,该环境模拟了真实世界知识图谱的结构和分布,并通过多跳推理评估它们补全缺失边的能力。从理论和实证两个角度,我们确定了一个缩放定律,将该最优参数预算与图搜索熵度量联系起来。在广泛的模型大小、训练步数和图复杂度范围内,我们表明一个最优大小的语言模型最多可以可靠地处理每参数约0.008比特的信息。我们的结果刻画了预训练期间隐式推理所需的最小充分容量。我们的发现为匹配模型大小与数据复杂度提供了原则性指导,并为大型语言模型中推理的缩放行为提供了新见解。

英文摘要

Reasoning is a core capability of language models (LMs), yet it remains unclear how much model capacity is necessary to support reasoning during pretraining. In this work, we study the minimal parameter budget required for implicit reasoning, defined as the ability to infer new facts from learned knowledge without explicit chain-of-thought supervision. To isolate this phenomenon, we pretrain LMs from scratch in a controlled synthetic environment that mimics the structure and distribution of real-world knowledge graphs, and evaluate their ability to complete missing edges via multi-hop inference. From both a theoretical and an empirical perspective, we identify a scaling law linking this optimal parameter budget to a graph search entropy measure. Across a wide range of model sizes, training steps, and graph complexities, we show that an optimally sized language model can reliably reason over approximately 0.008 bits of information per parameter at most. Our results characterize the minimal sufficient capacity for implicit reasoning during pretraining. Our findings provide principled guidance for matching model size to data complexity and offer new insights into the scaling behavior of reasoning in large language models.

2505.05232 2026-06-08 cs.AI 版本更新

ChemQuests: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv papers

ChemQuests: 从ChemRxiv论文中提取的精选化学问答数据库

Mahmoud Amiri, Thomas Bocklitz

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出ChemQuests数据集,包含从155篇ChemRxiv论文中提取的952个高质量问答对,覆盖17个化学子领域,用于化学NLP研究。

详情
AI中文摘要

化学文献的快速扩展给研究人员高效获取领域特定知识带来了重大挑战。为了支持化学领域自然语言处理(NLP)的进展,我们提出了ChemQuests,这是一个精选数据集,包含来自化学17个子领域的155篇ChemRxiv \cite{chemrxivWebsite}论文的952个高质量问答(QA)对。每个QA对都明确链接到其源文本片段,以确保可追溯性和上下文准确性。ChemQuests使用自动化流水线构建,该流水线结合了光学字符识别(OCR)、使用GPT-4o的QA生成以及模糊搜索验证。该数据集强调概念性、机理、应用以及合成或实验性问题,支持基于检索的QA系统、搜索引擎开发以及领域自适应大语言模型的微调。我们分析了数据集的结构、覆盖范围和局限性,并概述了扩展和专家验证的未来方向。ChemQuests为化学NLP研究、教育和工具开发提供了基础资源。

英文摘要

The rapid expansion of chemistry literature poses significant challenges for researchers seeking to efficiently access domain-specific knowledge. To support advancements in chemistry-focused natural language processing (NLP), we present ChemQuests, a curated dataset of 952 high-quality question-answer (QA) pairs derived from 155 ChemRxiv \cite{chemrxivWebsite} papers across 17 subfields of chemistry. Each QA pair is explicitly linked to its source text segment to ensure traceability and contextual accuracy. ChemQuests was constructed using an automated pipeline that combines optical character recognition (OCR), QA generation using GPT-4o, and fuzzy-search verification. The dataset emphasizes conceptual, mechanistic, applied, and synthetic or experimental questions, enabling applications in retrieval-based QA systems, search engine development, and fine-tuning of domain-adapted large language models. We analyze the dataset's structure, coverage, and limitations, and outline future directions for expansion and expert validation. ChemQuests provides a foundational resource for chemistry NLP research, education, and tool development.

2505.14289 2026-06-08 cs.AI 版本更新

EVA: Evolving Semantic Adversaries for Red-Teaming GUI Agents Against Environmental Injection Attacks

EVA: 针对环境注入攻击的红队GUI智能体的演化语义对抗方法

Yijie Lu, Manman Zhao, Tianjie Ju, Zihe Yan, Xinbei Ma, Yuan Guo, Daizong Ding, Gongshen Liu, Zhuosheng Zhang

发表机构 * School of Cyber Science and Engineering, Wuhan University(武汉大学计算机科学与工程学院) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) Independent Researcher(独立研究者)

AI总结 提出EVA框架,通过演化语义对抗载荷攻击多模态大语言模型驱动的GUI智能体,揭示语义欺骗是攻击成功的关键,实现85%攻击成功率且快速收敛。

Comments Accepted by

详情
AI中文摘要

由多模态大语言模型驱动的图形用户界面智能体日益部署,但易受环境注入攻击。然而,当前的红队方法受限于高昂的计算成本和有限的适应性。一个基本问题仍未解决:攻击成功的瓶颈在于视觉感知还是语义理解?通过控制实验,我们观察到语义欺骗而非视觉外观是攻击成功的主要决定因素。基于这一见解,我们引入了EVA,一个仅在语义维度上演化对抗载荷的演化框架。EVA采用发现-部署框架来挖掘语言脆弱性模式并将其提炼为可泛化的规则。在五个代表性受害智能体上的实验结果表明,EVA实现了高达85%的攻击成功率,在仅1.18到1.71次迭代内将良性种子演化为成功攻击。这种快速收敛揭示了模型潜在表示中密集的语义攻击空间,揭示了一个关键的校准悖论:校准训练强化的指令遵循能力使智能体天生易受权威性、语义欺骗性环境线索的影响。

英文摘要

Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) are increasingly deployed yet vulnerable to Environmental Injection Attacks (EIAs).However, current red-teaming methods are hindered by prohibitive computational costs and limited adaptability. A fundamental question remains unaddressed: does the bottleneck of attack success lie in visual perception or semantic understanding? Through controlled experiments, we observe that semantic deception, rather than visual appearance, serves as the primary determinant of attack success. Based on this insight, we introduce EVA, an evolutionary framework that evolves adversarial payloads exclusively within the semantic dimension. EVA employs a discovery-deployment framework to mine linguistic vulnerability patterns and distill them into generalizable rules. Experimental results across five representative victim agents demonstrate that EVA achieves up to 85\% attack success rate, evolving benign seeds into successful attacks within only 1.18 to 1.71 iterations. This rapid convergence uncovers a dense semantic attack space in the model's latent representation, unveiling a critical alignment paradox: the instruction-following capabilities reinforced by alignment training render agents inherently susceptible to authoritative, semantically deceptive environmental cues.

2505.15998 2026-06-08 cs.AI 版本更新

Exploring Flow-Lenia Universes with a Curiosity-driven AI Scientist: Discovering Diverse Ecosystem Dynamics

探索Flow-Lenia宇宙:基于好奇心驱动的AI科学家发现多样生态系统动力学

Thomas Michel, Marko Cvjetko, Gautier Hamon, Pierre-Yves Oudeyer, Clément Moulin-Frier

发表机构 * Univ. Lille, Inria, CNRS, Centrale Lille, CRIStAL, France(里尔大学、法国国家科学研究中心、中央里尔学院、CRIStAL实验室、法国) Inria Center at the University of Bordeaux, France(波尔多大学的Inria研究中心、法国) Inria, INSA Lyon, CITI, UR3720, 69621 Villeurbanne, France(Inria、里昂INSA、CITI、UR3720、法国)

AI总结 提出好奇心驱动的AI科学家方法,通过内在动机目标探索过程(IMGEP)在Flow-Lenia中发现系统级动力学,揭示类似生物现象的自组织行为,并展示大规模多样性搜索作为后续实验设计的框架。

Comments Extended version of the paper first published at ALife 2025. Project webpage: https://developmentalsystems.org/Exploring-Flow-Lenia-Universes/ 24 pages, 16 figures

详情
Journal ref
Proceedings of the Artificial Life Conference 2025, pp. 633-643
AI中文摘要

我们提出了一种好奇心驱动的AI科学家方法,用于发现Flow-Lenia中的系统级动力学。Flow-Lenia是一种具有质量守恒和参数局部化的连续元胞自动机(CA)。基于先前使用Lenia中的多样性搜索来发现个体自组织模式的工作,我们将内在动机目标探索过程(IMGEP)适应于交互模式的大型环境,使用模拟范围的度量,如进化活动、压缩比和多尺度物质分布。我们在两个探索实验中应用IMGEP:一个针对生态系统级动力学,另一个针对通过障碍物环境的物质运动。在这两个实验中,IMGEP比随机搜索照亮了更多的度量空间,并揭示了定性上类似于许多生物现象的自组织行为。利用生成的档案,我们随后在六个空间尺度和七个时间跨度上进行了缩放研究,揭示了在基础尺度上没有类似物的宏观尺度组织,并表征了目标空间度量在尺度上的行为。这说明了我们方法的一个优势:相对廉价的大规模多样性搜索可以作为设计后续更昂贵实验的原则性框架,通过交互式探索工具支持实验设计、检查和重新设计的迭代循环,使科学家保持在循环中。尽管在Flow-Lenia上进行了演示,但这种方法可能适用于其他可参数化的复杂系统,其中研究自下而上的集体行为是有意义的。

英文摘要

We present a curiosity-driven AI scientist method for discovering system-level dynamics in Flow-Lenia, a continuous cellular automaton (CA) with mass conservation and parameter localization. Building on prior work that uses diversity search in Lenia to find individual self-organized patterns, we adapt Intrinsically Motivated Goal Exploration Processes (IMGEPs) to large environments of interacting patterns, using simulation-wide metrics such as evolutionary activity, compression ratio, and multi-scale matter distribution. We apply IMGEP in two exploration experiments: one targeting ecosystem-level dynamics, the other matter movement through obstacle-laden environments. In both, IMGEP illuminates significantly more of the metric space than random search and reveals self-organized behaviors qualitatively resembling many biological phenomena. Leveraging the resulting archive, we then run a scaling study across six spatial scales and seven time horizons, uncovering macro-scale organization with no analogue at the base scale and characterizing how goal-space metrics behave at scale. This illustrates a strength of our approach: a relatively cheap large-scale diversity search can act as a principled scaffold for designing subsequent, more expensive experiments, enabling an iterative loop of experiment design, inspection, and redesign, supported by an interactive exploration tool that keeps scientists in the loop. Though demonstrated with Flow-Lenia, this approach potentially applies to other parameterizable complex systems where studying bottom-up collective behavior is of interest.

2508.19239 2026-06-08 cs.AI 版本更新

Model Context Protocols in Adaptive Transport Systems: A Survey

自适应交通系统中的模型上下文协议:综述

Gaurab Chhetri, Shriyank Somvanshi, Md Monzurul Islam, Shamyo Brotee, Mahmuda Sultana Mimi, Dipti Koirala, Biplov Pandey, Subasish Das

发表机构 * Texas State University San Marcos(德克萨斯州立大学圣马科斯分校)

AI总结 本文首次系统调查模型上下文协议(MCP)作为统一范式,提出五类分类法,揭示传统协议孤立适应的局限,并指出MCP的客户端-服务器和JSON-RPC结构支持语义互操作性,为下一代自适应智能交通基础设施奠定基础。

详情
AI中文摘要

互联设备、自主系统和AI应用的快速扩展导致自适应交通系统严重碎片化,各种协议和上下文源仍然孤立。本综述首次系统调查模型上下文协议(MCP)作为统一范式,强调其桥接协议级适应与上下文感知决策的能力。通过分析已有文献,我们发现现有工作已隐含地趋近于MCP类架构,标志着从碎片化解决方案向标准化集成框架的自然演进。我们提出一个五类分类法,涵盖自适应机制、上下文感知框架、统一模型、集成策略和MCP使能架构。我们的发现揭示了三个关键见解:传统传输协议已达到孤立适应的极限,MCP的客户端-服务器和JSON-RPC结构支持语义互操作性,以及AI驱动的传输需要特别适合MCP的集成范式。最后,我们提出一个研究路线图,将MCP定位为下一代自适应、上下文感知和智能交通基础设施的基础。

英文摘要

The rapid expansion of interconnected devices, autonomous systems, and AI applications has created severe fragmentation in adaptive transport systems, where diverse protocols and context sources remain isolated. This survey provides the first systematic investigation of the Model Context Protocol (MCP) as a unifying paradigm, highlighting its ability to bridge protocol-level adaptation with context-aware decision making. Analyzing established literature, we show that existing efforts have implicitly converged toward MCP-like architectures, signaling a natural evolution from fragmented solutions to standardized integration frameworks. We propose a five-category taxonomy covering adaptive mechanisms, context-aware frameworks, unification models, integration strategies, and MCP-enabled architectures. Our findings reveal three key insights: traditional transport protocols have reached the limits of isolated adaptation, MCP's client-server and JSON-RPC structure enables semantic interoperability, and AI-driven transport demands integration paradigms uniquely suited to MCP. Finally, we present a research roadmap positioning MCP as a foundation for next-generation adaptive, context-aware, and intelligent transport infrastructures.

2509.25522 2026-06-08 cs.AI 版本更新

Understanding Generative Recommendation with Semantic IDs from a Model-scaling View

从模型扩展视角理解基于语义ID的生成式推荐

Jingzhe Liu, Liam Collins, Jiliang Tang, Tong Zhao, Neil Shah, Clark Mingxuan Ju

发表机构 * Michigan State University(密歇根州立大学) Snap Inc.(Snap公司)

AI总结 揭示基于语义ID的生成式推荐在模型扩展时存在性能瓶颈,发现直接使用大语言模型作为推荐器具有更好的扩展性,性能提升可达20%。

Comments Accepted by KDD 2026

详情
AI中文摘要

近期生成模型的进展催生了一种有前景的推荐系统范式,称为生成式推荐(GR),它试图统一丰富的物品语义和协同过滤信号。一种流行的现代方法是使用语义ID(SIDs),即从模态编码器(如大型语言或视觉模型)的嵌入中量化得到的离散编码,在自回归用户交互序列建模设置中表示物品(以下简称基于SID的GR)。虽然其他领域的生成模型展现出完善的缩放定律,我们的工作揭示了基于SID的GR在模型扩展时存在显著瓶颈。特别是,随着我们扩大每个组件(模态编码器、量化分词器和推荐系统本身),基于SID的GR的性能迅速饱和。在这项工作中,我们确定SID编码物品语义信息的有限能力是根本瓶颈之一。基于这一观察,作为获得具有更好缩放行为的GR模型的初步努力,我们重新审视了另一种直接使用大型语言模型(LLMs)作为推荐器的GR范式(以下简称LLM-as-RS)。我们的实验表明,LLM-as-RS范式具有优越的模型缩放属性,并通过缩放实现了比基于SID的GR最佳可达性能高达20%的提升。我们还挑战了普遍认为LLMs难以捕捉协同过滤信息的观点,表明它们建模用户-物品交互的能力随着LLMs的扩展而提升。我们对基于SID的GR和LLMs在44M到14B参数模型规模上的分析强调了基于SID的GR的内在缩放限制,并将LLM-as-RS定位为通往GR基础模型的有希望路径。

英文摘要

Recent advancements in generative models have allowed the emergence of a promising paradigm for recommender systems (RS), known as Generative Recommendation (GR), which tries to unify rich item semantics and collaborative filtering signals. One popular modern approach is to use semantic IDs (SIDs), which are discrete codes quantized from the embeddings of modality encoders (e.g., large language or vision models), to represent items in an autoregressive user interaction sequence modeling setup (henceforth, SID-based GR). While generative models in other domains exhibit well-established scaling laws, our work reveals that SID-based GR shows significant bottlenecks while scaling up the model. In particular, the performance of SID-based GR quickly saturates as we enlarge each component: the modality encoder, the quantization tokenizer, and the RS itself. In this work, we identify the limited capacity of SIDs to encode item semantic information as one of the fundamental bottlenecks. Motivated by this observation, as an initial effort to obtain GR models with better scaling behaviors, we revisit another GR paradigm that directly uses large language models (LLMs) as recommenders (henceforth, LLM-as-RS). Our experiments show that the LLM-as-RS paradigm has superior model scaling properties and achieves up to 20 percent improvement over the best achievable performance of SID-based GR through scaling. We also challenge the prevailing belief that LLMs struggle to capture collaborative filtering information, showing that their ability to model user-item interactions improves as LLMs scale up. Our analyses on both SID-based GR and LLMs across model sizes from 44M to 14B parameters underscore the intrinsic scaling limits of SID-based GR and position LLM-as-RS as a promising path toward foundation models for GR.

2510.01427 2026-06-08 cs.AI 版本更新

Small Language Model Agents Enable Efficient and High-Quality Knowledge Mining

小型语言模型代理实现高效高质量的知识挖掘

Sipeng Zhang, Shuhuai Lin, Xinpeng Wei, Yihang Chen, Pin Qian, Su Wang, Huan Xu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Carneigie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出Falconer框架,结合大语言模型的代理推理与轻量级代理模型,通过规划与标注实现可扩展的知识挖掘,在保持指令遵循精度的同时降低90%推理成本并加速20倍以上。

Comments Code available: https://github.com/LongfeiYun17/falconer

详情
AI中文摘要

深度研究(Deep Research)的核心是知识挖掘,即根据用户指令从海量非结构化文本中提取结构化信息的任务。大语言模型(LLMs)擅长解释此类指令,但大规模部署成本过高;而传统的分类器和提取器流水线虽然高效,但脆弱且无法泛化到新任务。我们提出Falconer,一种协作框架,将LLM的代理推理与轻量级代理模型相结合,用于可扩展的知识挖掘。在Falconer中,LLM作为规划者,将用户指令分解为可执行的流水线,并作为标注者,生成监督信号以训练小型代理。该框架将分类和提取统一为两个原子操作:get label和get span,使得单个指令遵循模型能够替代多个任务特定组件。为了评估Falconer孵化的代理模型与人类和大模型提供的标注之间的一致性,我们构建了涵盖规划和端到端执行的新基准。实验表明,Falconer在指令遵循精度上接近最先进的LLM,同时将推理成本降低高达90%,并将大规模知识挖掘加速20倍以上,为深度研究提供了高效且可扩展的基础。

英文摘要

At the core of Deep Research is knowledge mining, the task of extracting structured information from massive unstructured text in response to user instructions. Large language models (LLMs) excel at interpreting such instructions but are prohibitively expensive to deploy at scale, while traditional pipelines of classifiers and extractors remain efficient yet brittle and unable to generalize to new tasks. We introduce Falconer, a collaborative framework that combines the agentic reasoning of LLMs with lightweight proxy models for scalable knowledge mining. In Falconer, LLMs act as planners, decomposing user instructions into executable pipelines, and as annotators, generating supervision to train small proxies. The framework unifies classification and extraction into two atomic operations, get label and get span, enabling a single instruction-following model to replace multiple task-specific components. To evaluate the consistency between proxy models incubated by Falconer and annotations provided by humans and large models, we construct new benchmarks covering both planning and end-to-end execution. Experiments show that Falconer closely matches state-of-the-art LLMs in instruction-following accuracy while reducing inference cost by up to 90% and accelerating large-scale knowledge mining by more than 20x, offering an efficient and scalable foundation for Deep Research.

2510.05363 2026-06-08 cs.AI 版本更新

MHA-RAG: Improving Efficiency, Accuracy, and Consistency by Encoding Exemplars as Soft Prompts

MHA-RAG:通过将示例编码为软提示来提高效率、准确性和一致性

Abhinav Jain, Xinyu Yao, Thomas Reps, Christopher Jermaine

发表机构 * Department of Computer Science, Rice University(计算机科学系,里士大学) Department of Computer Science, University of Wisconsin–Madison(计算机科学系,威斯康星大学麦迪逊分校)

AI总结 提出MHA-RAG框架,将领域示例编码为软提示,通过多头注意力机制控制生成,在多个问答基准上相比标准RAG提升20点性能,同时降低10倍推理成本。

Comments 17 pages, 5 figures

详情
AI中文摘要

在有限训练数据下将基础模型适应到新领域具有挑战性且计算成本高。虽然先前工作证明了使用领域特定示例作为上下文演示的有效性,但我们探究了将示例纯粹表示为文本是否是最有效、最稳定和最高效的方法。我们探索了一种替代方案:使用示例顺序不变模型架构将示例表示为软提示。为此,我们引入了多头注意力检索增强生成(MHA-RAG),该框架以注意力头数作为简单超参数,用于控制不同任务下的软提示生成。在多个问答基准和模型规模上,MHA-RAG相比标准RAG实现了20个百分点的性能提升,同时将推理成本降低了10倍GFLOPs——在示例顺序不变的情况下,实现了更高的准确性和更高的效率。

英文摘要

Adapting Foundation Models to new domains with limited training data is challenging and computationally expensive. While prior work has demonstrated the effectiveness of using domain-specific exemplars as in-context demonstrations, we investigate whether representing exemplars purely as text is the most efficient, effective, and stable approach. We explore an alternative: representing exemplars as soft prompts with an exemplar order invariant model architecture. To this end, we introduce Multi-Head Attention Retrieval-Augmented Generation (MHA-RAG), a framework with the number of attention heads serving as a simple hyperparameter to control soft prompt-generation across different tasks. Across multiple question-answering benchmarks and model scales, MHA-RAG achieves a 20-point performance gain over standard RAG, while cutting inference costs by a factor of 10X GFLOPs-delivering both higher accuracy and greater efficiency, invariant to exemplar order.

2512.23292 2026-06-08 cs.AI cs.LG 版本更新

Agentic Physical AI toward a Domain-Specific Foundation Model for Energy Systems: A Case Study on Nuclear Reactor Control

面向能源系统的领域特定基础模型的具身物理人工智能:以核反应堆控制为例

Yoon Pyo Lee, Samrendra Roy, Kazuma Kobayashi, Sajedul Talukder, Diab Abueidda, Seid Koric, Souvik Chakraborty, Syed Bahauddin Alam

发表机构 * The Grainger College of Engineering, Nuclear, Plasma & Radiological Engineering, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校格雷格学院工程学院、核等工程学院) Department of Nuclear Engineering, Hanyang University(汉阳大学核工程系) University of Texas - El Paso(德克萨斯大学埃尔帕索分校) National Center for Supercomputing Applications(国家超级计算应用中心) Department of Applied Mechanics, Indian Institute of Technology Delhi(印度德里理工学院应用力学系) Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi(印度德里理工学院亚里人工智能学院)

AI总结 本研究提出通过紧凑语言模型作为具身物理人工智能,利用基于物理模拟器验证的策略优化替代感知推理,在核反应堆控制任务中实现领域特定基础模型,并展示了规模扩展带来的可靠性提升和策略集中化行为。

详情
AI中文摘要

当前物理系统人工智能的主流范式——将通用基础模型扩展至通用多模态推理——在控制接口处面临障碍。前沿视觉-语言模型在基本定量物理任务上仅达到50-53%的准确率,表现为近似猜测者,在保持语义合理性的同时违反物理约束。安全关键控制要求对执行动作的结果空间保证,而非参数空间模仿。本文提出了一条通向领域特定基础模型的路径,通过紧凑语言模型作为具身物理人工智能运行:基于物理模拟器验证的策略优化,而非感知推理。我们在从10^3到10^5个样本规模扩展的合成核反应堆场景上训练了一个360M参数的模型。规模扩展在标称模拟条件下产生了强烈的、依赖于工况的可靠性提升,方差缩小约500倍,并在采样分布上消除了>10%的终端功率偏移。尽管模型均衡地接触了四种执行机构族,但它在运行时将95%的执行集中在单一棒组策略上,无需强化学习或奖励工程。表征可在不同模拟器间迁移,无需改变架构。我们将该系统定位为验证、监控和纵深防御架构中的候选决策组件,而非独立的安全解决方案:所展示的行为仅涉及模拟中单步任务的闭环可靠性,尚未解决非标称运行、传感器故障或不确定性量化问题。

英文摘要

The prevailing paradigm in AI for physical systems: scaling general-purpose foundation models toward universal multimodal reasoning, confronts a barrier at the control interface. Frontier vision-language models achieve only 50-53% accuracy on basic quantitative physics tasks, behaving as approximate guessers that preserve semantic plausibility while violating physical constraints. Safety-critical control demands outcome-space guarantees over executed actions, not parameter-space imitation. Here we present a pathway toward domain-specific foundation models through compact language models operating as Agentic Physical AI: policy optimization driven by physics-based simulator validation rather than perceptual inference. We train a 360M-parameter model on synthetic nuclear reactor scenarios scaled from 10^3 to 10^5 examples. Scaling produces strong, regime-dependent reliability gains under nominal simulated conditions, with variance collapse of approximately 500x and elimination of >10% terminal-power excursions on the sampled distribution. Despite balanced exposure to four actuation families, the model concentrates 95% of runtime execution on a single-bank strategy, without reinforcement learning or reward engineering. Representations transfer across simulators without architectural change. We position the system as a candidate decision component within a verification, monitoring, and defense-in-depth architecture, not as a stand-alone safety solution: the demonstrated behavior speaks to closed-loop reliability on a single-step task in simulation and does not yet address off-nominal operation, sensor faults, or uncertainty quantification.

2601.05675 2026-06-08 cs.AI 版本更新

CHDP: Cooperative Hybrid Diffusion Policies for Reinforcement Learning in Parameterized Action Space

CHDP:参数化动作空间中强化学习的协同混合扩散策略

Bingyi Liu, Jinbo He, Haiyong Shi, Enshu Wang, Weizhen Han, Jingxiang Hao, Peixi Wang, Zhuangzhuang Zhang

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 针对混合动作空间中的策略表达力不足和高维扩展性差问题,提出协同混合扩散策略框架,通过离散和连续扩散策略的协作与顺序更新,结合码本嵌入和Q函数引导,在基准测试中成功率提升高达19.3%。

Comments Accepted by AAAI 2026

详情
AI中文摘要

混合动作空间结合了离散选择和连续参数,在机器人控制和游戏AI等领域普遍存在。然而,高效建模和优化离散-连续混合动作空间仍然是一个基本挑战,主要由于策略表达力有限和高维设置下的可扩展性差。为应对这一挑战,我们将混合动作空间问题视为完全合作博弈,并提出\textbf{协同混合扩散策略(CHDP)}框架来解决。CHDP采用两个协作智能体,分别利用离散和连续扩散策略。连续策略以离散动作的表示为条件,显式建模它们之间的依赖关系。这种协作设计使扩散策略能够利用其表达力捕获各自动作空间中的复杂分布。为缓解协作设置中同时更新策略导致的更新冲突,我们采用顺序更新方案以促进协同适应。此外,为提高在高维离散动作空间中学习时的可扩展性,我们构建了一个将动作空间嵌入低维潜在空间的码本。该映射使离散策略能够在紧凑、结构化的空间中学习。最后,我们设计了一种基于Q函数的引导机制,在训练过程中对齐码本的嵌入与离散策略的表示。在具有挑战性的混合动作基准测试中,CHDP的成功率比最先进方法高出高达19.3%。

英文摘要

Hybrid action space, which combines discrete choices and continuous parameters, is prevalent in domains such as robot control and game AI. However, efficiently modeling and optimizing hybrid discrete-continuous action space remains a fundamental challenge, mainly due to limited policy expressiveness and poor scalability in high-dimensional settings. To address this challenge, we view the hybrid action space problem as a fully cooperative game and propose a \textbf{Cooperative Hybrid Diffusion Policies (CHDP)} framework to solve it. CHDP employs two cooperative agents that leverage a discrete and a continuous diffusion policy, respectively. The continuous policy is conditioned on the discrete action's representation, explicitly modeling the dependency between them. This cooperative design allows the diffusion policies to leverage their expressiveness to capture complex distributions in their respective action spaces. To mitigate the update conflicts arising from simultaneous policy updates in this cooperative setting, we employ a sequential update scheme that fosters co-adaptation. Moreover, to improve scalability when learning in high-dimensional discrete action space, we construct a codebook that embeds the action space into a low-dimensional latent space. This mapping enables the discrete policy to learn in a compact, structured space. Finally, we design a Q-function-based guidance mechanism to align the codebook's embeddings with the discrete policy's representation during training. On challenging hybrid action benchmarks, CHDP outperforms the state-of-the-art method by up to $19.3\%$ in success rate.

2601.23204 2026-06-08 cs.AI 版本更新

TSAQA: Time Series Analysis Question And Answering Benchmark

TSAQA:时间序列分析问答基准

Baoyu Jing, Sanhorn Chen, Lecheng Zheng, Boyu Liu, Zihao Li, Jiaru Zou, Tianxin Wei, Zhining Liu, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Yuchen Yan, Dongqi Fu, Jingchao Ni, Jingrui He, Hanghang Tong

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Virginia Polytechnic Institute and State University(弗吉尼亚理工学院和州立大学) Amazon(亚马逊) Meta AI University of Houston(休斯顿大学)

AI总结 提出TSAQA基准,涵盖6种时间序列分析任务(含新型PZ格式),评估LLM在13领域21万样本上的表现,最佳模型仅65.08分。

Comments Comments: 35 pages, 7 figures. Accepted to the GEM Workshop at ACL 2026

详情
AI中文摘要

时间序列数据在金融、医疗、交通和环境科学等关键应用中不可或缺。虽然近期工作开始探索多任务时间序列问答(QA),但现有基准仍局限于预测和异常检测任务。我们引入了TSAQA,这是一个新颖的统一基准,旨在拓宽任务覆盖范围并评估多样化的时间分析能力。TSAQA在单一框架下整合了六种不同任务,从常规分析(包括异常检测和分类)到高级分析(如特征描述、比较、数据转换和时间关系分析)。该数据集涵盖13个领域的21万个样本,采用多种格式,包括真/假(TF)、多项选择(MC)和一种新颖的谜题(PZ),以全面评估时间序列分析。零样本评估表明,这些任务对当前大型语言模型(LLM)具有挑战性:表现最好的商业LLM Gemini-2.5-Flash的平均得分仅为65.08。尽管指令调优提升了开源模型的性能:表现最好的开源模型LLaMA-3.1-8B仍有显著改进空间,凸显了LLM进行时间分析的复杂性。

英文摘要

Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has begun to explore multi-task time series question answering (QA), current benchmarks remain limited to forecasting and anomaly detection tasks. We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities. TSAQA integrates six diverse tasks under a single framework ranging from conventional analysis, including anomaly detection and classification, to advanced analysis, such as characterization, comparison, data transformation, and temporal relationship analysis. Spanning 210k samples across 13 domains, the dataset employs diverse formats, including true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), to comprehensively assess time series analysis. Zero-shot evaluation demonstrates that these tasks are challenging for current Large Language Models (LLMs): the best-performing commercial LLM, Gemini-2.5-Flash, achieves an average score of only 65.08. Although instruction tuning boosts open-source performance: the best-performing open-source model, LLaMA-3.1-8B, shows significant room for improvement, highlighting the complexity of temporal analysis for LLMs.

2602.00471 2026-06-08 cs.AI cs.CV 版本更新

Dual Latent Memory for Visual Multi-agent System

面向视觉多智能体系统的双潜在记忆

Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, Shuicheng Yan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出L²-VMAS框架,通过双潜在记忆解耦感知与思考,并采用熵驱动主动触发机制,打破视觉多智能体系统的“扩展墙”,在提升准确率的同时大幅降低令牌消耗。

详情
AI中文摘要

尽管视觉多智能体系统(VMAS)有望通过智能体间协作增强综合能力,但经验证据揭示了一个反直觉的“扩展墙”:增加智能体轮次往往会降低性能,同时指数级增加令牌成本。我们将这一失败归因于以文本为中心的通信中固有的信息瓶颈,其中将感知和思维轨迹转换为离散自然语言不可避免地导致语义损失。为此,我们提出了\textbf{L}$\mathbf{^{2}}$\textbf{-VMAS},一种新颖的模型无关框架,通过双潜在记忆实现智能体间协作。此外,我们解耦了感知与思考,同时动态合成双潜在记忆。另外,我们引入了熵驱动的主动触发,用高效的按需内存访问取代被动信息传输。在骨干网络、规模和多智能体结构上的大量实验表明,我们的方法有效打破了“扩展墙”,具有卓越的可扩展性,平均准确率提高2.7-5.4%,同时令牌使用量减少21.3-44.8%。

英文摘要

While Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration, empirical evidence reveals a counter-intuitive "scaling wall": increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text-centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose \textbf{L}$\mathbf{^{2}}$\textbf{-VMAS}, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories. Furthermore, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy-driven proactive triggering that replaces passive information transmission with efficient, on-demand memory access. Extensive experiments among backbones, sizes, and multi-agent structures demonstrate that our method effectively breaks the "scaling wall" with superb scalability, improving average accuracy by 2.7-5.4% while reducing token usage by 21.3-44.8%.

2602.01740 2026-06-08 cs.AI cs.CV cs.LG 版本更新

MACD: Model-Aware Contrastive Decoding via Counterfactual Data

MACD:基于反事实数据的模型感知对比解码

Qixin Xiao, Kun Zhou

发表机构 * University of Michigan, Ann Arbor, MI, USA(密歇根大学,安娜堡分校) University of California San Diego, La Jolla, CA, USA(加州大学圣地亚哥分校)

AI总结 提出MACD方法,利用视频语言模型自身反馈识别导致幻觉的目标区域,生成目标级反事实输入,结合对比解码减少幻觉,提升多模型在复杂场景下的准确性。

详情
AI中文摘要

视频语言模型(Video-LLMs)容易产生幻觉,当视觉证据薄弱、模糊或存在偏差时,会生成看似合理但无根据的内容。现有方法如对比解码(CD)依赖随机扰动构建对比数据以缓解幻觉,但往往未能针对驱动幻觉的视觉线索或模型弱点。我们提出基于模型感知反事实数据的对比解码(MACD),这是一种结合模型引导的反事实构建与对比解码的推理策略。MACD利用Video-LLM自身的反馈来识别最可能导致幻觉的目标区域,生成有针对性的目标级反事实输入,而非任意的帧或时间修改。这些反事实输入被整合到CD中,以在解码过程中强制进行基于证据的令牌选择。在EventHallusion、MVBench、Perception-test和Video-MME上的实验表明,MACD在包括Qwen和InternVL在内的多种Video-LLM上持续减少幻觉,同时保持或提高任务准确性,在涉及小目标、遮挡目标或共现目标的场景中尤其表现出显著优势。

英文摘要

Video language models (Video-LLMs) are prone to hallucinations, generating plausible but ungrounded content when visual evidence is weak, ambiguous, or biased. Existing methods, such as contrastive decoding (CD), rely on random perturbations to construct contrastive data for hallucination mitigation, but often fail to target the visual cues that drive hallucination or align with model weaknesses. We propose Model-Aware Counterfactual Data based Contrastive Decoding (MACD), an inference strategy that combines model-guided counterfactual construction with contrastive decoding. MACD uses the Video-LLM's own feedback to identify object regions most responsible for hallucination, generating targeted object-level counterfactual inputs rather than arbitrary frame or temporal modifications. These counterfactual inputs are integrated into CD to enforce evidence-grounded token selection during decoding. Experiments on EventHallusion, MVBench, Perception-test, and Video-MME show that MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs, including Qwen and InternVL, with especially strong gains in scenarios involving small, occluded, or co-occurring objects.

2602.03160 2026-06-08 cs.AI cs.CL 版本更新

VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models

VALUEFLOW:迈向大语言模型中多元化和可引导的基于价值的对齐

Woojin Kim, Sieun Hyeon, Jusang Oh, Jaeyoung Do

发表机构 * Department of Electrical and Computer Engineering, Seoul National University(首尔国立大学电气与计算机工程系) Interdisciplinary Program in Artificial Intelligence, Seoul National University(首尔国立大学人工智能交叉学科项目)

AI总结 提出VALUEFLOW框架,通过分层价值嵌入、强度标注数据库和锚定评估器,实现大语言模型在价值强度上的可控对齐,解决现有方法在提取、评估和引导方面的不足。

Comments Accepted in ICML 2026 (Oral). Code available at https://github.com/AIDASLab/VALUEFLOW

详情
AI中文摘要

将大语言模型(LLMs)与人类价值的多元光谱对齐仍然是一个核心挑战:基于偏好的方法通常无法捕捉更深层次的动机原则。基于价值的方法提供了更原则性的路径,但仍存在三个差距:提取常常忽略层次结构,评估检测存在但未校准强度,并且LLMs在受控强度下的可引导性仍未得到充分理解。为解决这些限制,我们引入了VALUEFLOW,这是第一个统一框架,涵盖提取、评估和引导,并具有校准的强度控制。该框架整合了三个组件:(i) HIVES,一个层次化价值嵌入空间,捕捉理论和跨理论的价值结构;(ii) 价值强度数据库(VIDB),一个大规模资源,包含基于排序聚合得出的强度估计的价值标注文本;(iii) 一个基于锚点的评估器,通过将模型输出与VIDB面板进行排序,产生一致的强度分数。使用VALUEFLOW,我们在十个模型和四个价值理论上进行了全面的大规模研究,识别了可引导性的不对称性和多价值控制的组合规律。本文建立了一个可扩展的基础设施,用于评估和控制价值强度,推进了LLMs的多元化对齐。

英文摘要

Aligning Large Language Models (LLMs) with the diverse spectrum of human values remains a central challenge: preference-based methods often fail to capture deeper motivational principles. Value-based approaches offer a more principled path, yet three gaps persist: extraction often ignores hierarchical structure, evaluation detects presence but not calibrated intensity, and the steerability of LLMs at controlled intensities remains insufficiently understood. To address these limitations, we introduce VALUEFLOW, the first unified framework that spans extraction, evaluation, and steering with calibrated intensity control. The framework integrates three components: (i) HIVES, a hierarchical value embedding space that captures intra- and cross-theory value structure; (ii) the Value Intensity DataBase (VIDB), a large-scale resource of value-labeled texts with intensity estimates derived from ranking-based aggregation; and (iii) an anchor-based evaluator that produces consistent intensity scores for model outputs by ranking them against VIDB panels. Using VALUEFLOW, we conduct a comprehensive large-scale study across ten models and four value theories, identifying asymmetries in steerability and composition laws for multi-value control. This paper establishes a scalable infrastructure for evaluating and controlling value intensity, advancing pluralistic alignment of LLMs.

2603.11333 2026-06-08 cs.AI 版本更新

LLM-Augmented Digital Twin for Policy Evaluation in Short-Video Platforms

LLM增强的数字孪生用于短视频平台策略评估

Haoting Zhang, Yunduan Lin, Jinghai He, Denglin Jiang, Zuo-Jun Shen, Zeyu Zheng

发表机构 * University of California, Berkeley(加州大学伯克利分校) The Chinese University of Hong Kong(香港中文大学) New York University(纽约大学) The University of Hong Kong(香港大学)

AI总结 提出一种LLM增强的四模块数字孪生架构(用户、内容、交互、平台),通过事件驱动执行层和可插拔策略组件,支持在闭环动态下对平台策略(含AI策略)进行可复现的仿真评估。

详情
AI中文摘要

短视频平台是闭环、人在回路中的生态系统,其中平台策略、创作者激励和用户行为共同演化。这种反馈结构使得在生产环境中进行反事实策略评估变得困难,尤其是对于长期和分布性结果。随着平台部署改变内容进入系统方式、代理适应方式以及平台运行方式的AI工具,这一挑战被放大。我们提出了一种大语言模型(LLM)增强的数字孪生用于短视频平台,具有模块化的四孪生架构(用户、内容、交互、平台)和一个支持可复现实验的事件驱动执行层。平台策略作为平台孪生中的可插拔组件实现,LLM作为可选的、模式约束的决策服务(例如,角色生成、内容字幕、活动规划、趋势预测)集成,并通过统一优化器路由。这种设计使得可扩展的仿真成为可能,在保留闭环动态的同时允许选择性采用LLM,从而能够在现实反馈和约束下研究平台策略,包括AI增强策略。

英文摘要

Short-video platforms are closed-loop, human-in-the-loop ecosystems where platform policy, creator incentives, and user behavior co-evolve. This feedback structure makes counterfactual policy evaluation difficult in production, especially for long-horizon and distributional outcomes. The challenge is amplified as platforms deploy AI tools that change what content enters the system, how agents adapt, and how the platform operates. We propose a large language model (LLM)-augmented digital twin for short-video platforms, with a modular four-twin architecture (User, Content, Interaction, Platform) and an event-driven execution layer that supports reproducible experimentation. Platform policies are implemented as pluggable components within the Platform Twin, and LLMs are integrated as optional, schema-constrained decision services (e.g., persona generation, content captioning, campaign planning, trend prediction) that are routed through a unified optimizer. This design enables scalable simulations that preserve closed-loop dynamics while allowing selective LLM adoption, enabling the study of platform policies, including AI-enabled policies, under realistic feedback and constraints.

2603.19100 2026-06-08 cs.AI 版本更新

LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling

LuMamba: 用于电极拓扑不变且高效的EEG建模的潜在统一Mamba

Danaé Broustail, Anna Tegon, Thorir Mar Ingolfsson, Yawei Li, Luca Benini

发表机构 * ETH Zürich, Institute of Information Systems(苏黎世联邦理工学院信息系统研究所)

AI总结 提出LuMamba框架,结合拓扑不变编码和线性复杂度状态空间模型,通过LUNA的查询交叉注意力机制统一通道,FEMBA的双向Mamba块高效时序建模,在5个下游任务上以4.6M参数达到SOTA性能。

Comments EUSIPCO 2026, 5 pages, 2 figures, 4 tables

详情
AI中文摘要

脑电图(EEG)能够在临床和神经技术应用中无创监测大脑活动,但由于电极拓扑结构不同以及Transformer架构的二次序列复杂度,构建EEG基础模型仍然具有挑战性。作为联合解决方案,我们提出了LuMamba(潜在统一Mamba),一个自监督框架,结合了拓扑不变编码和线性复杂度状态空间建模,使用LUNA的学习查询交叉注意力机制进行通道统一,以及FEMBA的双向Mamba块进行高效时序建模。在此架构内,我们首次系统研究了用于生物信号学习的潜在-欧几里得联合嵌入预测架构(LeJEPA)。在来自TUEG语料库的超过21,000小时未标记EEG上预训练后,LuMamba在五个下游任务上进行了评估,涵盖异常检测、伪影识别和精神状态分类,电极配置从16到26通道不等。在预训练目标中,仅掩码重建产生结构化但泛化性较差的表示,而仅LeJEPA产生扩散嵌入;结合两个目标实现了最稳健的性能。仅用4.6M参数,LuMamba在TUAB上达到80.99%的平衡准确率,在阿尔茨海默病检测上达到SOTA性能(0.97 AUPR),同时在等效序列长度下所需FLOPS比SOTA模型少377倍,并在达到典型GPU内存限制前可扩展到12倍更长的序列。代码可在https://github.com/pulp-bio/biofoundation获取。

英文摘要

Electroencephalography (EEG) enables non-invasive monitoring of brain activity across clinical and neurotechnology applications, yet building foundation models for EEG remains challenging due to differing electrode topologies and computational scalability, as Transformer architectures incur quadratic sequence complexity. As a joint solution, we propose LuMamba (Latent Unified Mamba), a self-supervised framework combining topology-invariant encodings with linear-complexity state-space modeling, using LUNA's learned-query cross-attention mechanism for channel unification, and FEMBA's bidirectional Mamba blocks for efficient temporal modeling. Within this architecture, we provide the first systematic investigation of the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA) for biosignal learning. Pre-trained on over 21,000 hours of unlabeled EEG from the TUEG corpus, LuMamba is evaluated on five downstream tasks spanning abnormality detection, artifact recognition, and mental condition classification across electrode configurations ranging from 16 to 26 channels. In the pre-training objective, masked reconstruction alone yields structured but less generalizable representations, while LeJEPA alone produces diffuse embeddings; combining both objectives achieves the most robust performance. With only 4.6M parameters, LuMamba attains 80.99% balanced accuracy on TUAB and achieves state-of-art performance on Alzheimer's detection (0.97 AUPR), while requiring 377x fewer FLOPS than state-of-art models at equivalent sequence lengths and scaling to 12x longer sequences before reaching typical GPU memory limits. Code is available at https://github.com/pulp-bio/biofoundation.

2603.19146 2026-06-08 cs.AI cs.LG 版本更新

D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding

D5P4:用于并行离散扩散解码中多样性的分区行列式点过程

Jonathan Lys, Vincent Gripon, Axel Marmoret, Lukas Mauch, Fabien Cardinaux, Ghouthi Boukli Hacene, Bastien Pasdeloup

发表机构 * IMT Atlantique, Lab-STICC, UMR CNRS 6285(IMT阿提拉克大学,Lab-STICC,UMR CNRS 6285) Sony Europe Ltd. Stuttgart Technology Center, EUREC(索尼欧洲有限公司,斯图加特技术中心,EUREC)

AI总结 提出D5P4波束解码方法,利用分区行列式点过程在离散扩散模型中选择中间序列,平衡质量与多样性,无需外部验证器。

详情
AI中文摘要

离散扩散模型是自回归方法在文本生成中的有前途的替代方案,但其解码方法仍研究不足。标准的自回归搜索过程(如波束搜索)不直接适用于迭代去噪,其中假设是完整的中间序列而非从左到右的前缀。此外,现有的扩散解码过程对保留假设的多样性和覆盖范围的控制有限。在这项工作中,我们引入了D5P4,一种针对离散扩散模型定制的波束式解码方法,它将中间波束选择视为分区行列式点过程下的MAP推理。这产生了一个模型内部的批次目标,无需外部验证器即可平衡质量和多样性。在开放域生成、问答和数学推理上的实验表明,D5P4提高了多样性和pass@$k$覆盖率,同时匹配或超越了基线质量和保真度。

英文摘要

Discrete diffusion models are promising alternatives to autoregressive approaches for text generation, yet their decoding methods remain under-studied. Standard autoregressive search procedures, such as beam search, do not directly apply to iterative denoising, where hypotheses are complete intermediate sequences rather than left-to-right prefixes. Furthermore, existing diffusion decoding procedures only provide limited control over the diversity and coverage of retained hypotheses. In this work, we introduce D5P4, a beam-style decoding method tailored to discrete diffusion models, which casts intermediate beam selection as MAP inference under a partitioned Determinantal Point Process. This yields a model-internal batch objective that balances quality and diversity without external verifiers. Experiments on open-ended generation, question answering, and mathematical reasoning show that D5P4 improves diversity and pass@$k$ coverage while matching or surpassing baseline quality and fidelity

2603.24481 2026-06-08 cs.AI cs.CL cs.LG 版本更新

Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

基于一致性验证的多智能体推理改进医学多项选择题问答中的不确定性校准

John Ray B. Martinez

发表机构 * Department of Data Science and Analytics(数据科学与分析系)

AI总结 提出多智能体框架,结合领域专家智能体与两阶段验证及S分数加权融合,在医学MCQA中显著降低校准误差并提升判别能力。

Comments 20 pages, 6 figures. Preprint under review

详情
AI中文摘要

校准不良的置信度分数是AI在临床环境中部署的实际障碍。总是过度自信的模型无法为延迟决策提供有用信号。我们提出了一个多智能体框架,结合领域特定专家智能体与两阶段验证(Wu等人,2024)和S分数加权融合,以改进医学多项选择题问答中的校准和判别能力。四个专家智能体(呼吸科、心脏病科、神经科、胃肠科)使用Qwen2.5-7B-Instruct生成独立诊断。每个诊断经历两阶段自我验证过程,测量内部一致性并产生专家置信度分数(S分数)。S分数驱动加权融合策略,选择最终答案并校准报告的置信度。我们在MedQA-USMLE和MedMCQA的高分歧子集(100和250个问题)上进行评估。所有结果均针对此过滤后的设置。在MedQA-250上,完整系统实现了ECE=0.091(比单专家基线降低74.4%)和AUROC=0.630(+0.056),准确率为59.2%。在所有四种设置中,校准增益保持在49-74%。消融分析表明,两阶段验证驱动ECE降低,而多智能体推理驱动AUROC提升,表明一致性检查和集成聚合解决了LLM不确定性的不同失败模式。由此产生的置信度信号是否足以在实践中支持临床延迟决策,仍是未来研究的方向。

英文摘要

Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification (Wu et al., 2024) and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis undergoes a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate on high-disagreement subsets of MedQA-USMLE and MedMCQA (100 and 250 questions). All results are specific to this filtered regime. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Calibration gains of 49-74% hold across all four settings. Ablation analysis reveals that Two-Phase Verification drives ECE reduction while multi-agent reasoning drives AUROC improvement, suggesting that consistency checking and ensemble aggregation address different failure modes of LLM uncertainty. Whether the resulting confidence signal is sufficient to support clinical deferral decisions in practice remains a direction for future investigation.

2603.24963 2026-06-08 cs.AI cs.LG 版本更新

Design Once, Deploy at Scale: Template-Driven ML Development for Large Model Ecosystems

一次设计,大规模部署:面向大型模型生态的模板驱动ML开发

Jiang Liu, John Martabano Landy, Yao Xuan, Swamy Muddu, Nhat Le, Munaf Sahaf, Luc Kien Hang, Rupinder Khandpour, Kevin De Angeli, Chang Yang, Shouyuan Chen, Shiblee Sadik, Anirudh Agrawal, Djordje Gligorijevic, Jingzheng Qin, Peggy Yao, Alireza Vahdatpour

发表机构 * Meta AI

AI总结 针对大型模型生态中ML开发效率低的问题,提出标准化模型模板(SMT)框架,将技术传播复杂度从O(n·2^k)降至O(n+k),在Meta广告排名系统中实现交叉熵提升0.63%、迭代时间减少92%、技术-模型对采用吞吐量提升6.3倍。

详情
AI中文摘要

现代计算广告平台通常依赖推荐系统来预测用户响应,如点击率、转化率和其他优化事件。为了支持多样化的产品表面和广告主目标,这些平台经常维护一个广泛的机器学习(ML)模型生态系统。然而,在这种规模下运营带来了显著的发展和效率挑战。需要大量的工程努力来定期刷新ML模型并传播新技术,这导致在生态系统中部署ML创新时出现长延迟。我们提出了一项大规模实证研究,比较了标准化模型构建方法与推荐系统中独立每模型优化之间的模型性能、效率和ML技术传播。为了促进这种标准化,我们提出了标准模型模板(SMT)——一个生成适应不同数据分布和优化事件的高性能模型的框架。通过利用标准化、可组合的ML模型组件,SMT将技术传播复杂度从O(n·2^k)降低到O(n+k),其中n是模型数量,k是技术数量。在Meta的生产广告排名生态系统中,对四个全球开发周期内的广泛模型套件进行评估,我们的结果表明:(1)在中等服务容量下,交叉熵平均提高0.63%;(2)每模型迭代工程时间减少92%;(3)技术-模型对采用吞吐量增加6.3倍。这些发现挑战了多样化优化目标本质上需要多样化ML模型设计的传统观点。

英文摘要

Modern computational advertising platforms typically rely on recommendation systems to predict user responses, such as click-through rates, conversion rates, and other optimization events. To support a wide variety of product surfaces and advertiser goals, these platforms frequently maintain an extensive ecosystem of machine learning (ML) models. However, operating at this scale creates significant development and efficiency challenges. Substantial engineering effort is required to regularly refresh ML models and propagate new techniques, which results in long latencies when deploying ML innovations across the ecosystem. We present a large-scale empirical study comparing model performance, efficiency, and ML technique propagation between a standardized model-building approach and independent per-model optimization in recommendation systems. To facilitate this standardization, we propose the Standard Model Template (SMT) -- a framework that generates high-performance models adaptable to diverse data distributions and optimization events. By utilizing standardized, composable ML model components, SMT reduces technique propagation complexity from $O(n \cdot 2^k)$ to $O(n + k)$ where $n$ is the number of models and $k$ the number of techniques. Evaluating an extensive suite of models over four global development cycles within Meta's production ads ranking ecosystem, our results demonstrate: (1) a 0.63% average improvement in cross-entropy at neutral serving capacity, (2) a 92% reduction in per-model iteration engineering time, and (3) a $6.3\times$ increase in technique-model pair adoption throughput. These findings challenge the conventional wisdom that diverse optimization goals inherently require diversified ML model design.

2604.02029 2026-06-08 cs.AI 版本更新

The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook

潜空间:基础、演化、机制、能力与展望

Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Guanting Dong, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Ronghao Chen, Huacan Wang, Chenglin Wu, Zikun Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao, Yue Liao, Ruqi Huang, Tao Jin, Zhucun Xue, Cheng Tan, Jiangning Zhang, Wenqi Ren, Yanwei Fu, Yong Liu, Yu Wang, Xiangyu Yue, Yu-Gang Jiang, Shuicheng Yan

发表机构 * National University of Singapore(国立新加坡大学) Fudan University(复旦大学) Tsinghua University(清华大学) Zhejiang University(浙江大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Renmin University of China(中国人民大学) The Chinese University of Hong Kong(香港中文大学) The Hong Kong University of Science and Technology(香港科技大学) DeepWisdom(深智科技) Nanjing University(南京大学) Shanghai Jiatong University(上海交通大学) Nanyang Technological University(南洋理工大学) Tencent Hunyuan(腾讯文深) QuantaAlpha(量子阿尔法) Beijing University of Posts and Telecommunications(北京邮电大学) Zhejiang Lab(浙江实验室) University of Chinese Academy of Sciences(中国科学院大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Sun Yat-sen University(中山大学)

AI总结 本文综述潜空间在语言模型中的基础、演化、机制与能力,指出其克服显式空间计算的结构性限制,并展望未来研究方向。

详情
AI中文摘要

潜空间正迅速成为基于语言的模型的原生子空间。虽然现代系统仍常通过显式词元级生成来理解,但越来越多的工作表明,许多关键内部过程在连续潜空间中比在人类可读的言语痕迹中更自然地执行。这一转变由显式空间计算的结构性限制驱动,包括语言冗余、离散化瓶颈、顺序低效和语义损失。本综述旨在提供语言模型中潜空间的统一且最新的全景。我们按五个连续视角组织综述:基础、演化、机制、能力和展望。首先,我们界定潜空间的范围,将其与显式或言语空间以及生成式视觉模型中常见的潜空间区分开来。然后,我们追溯该领域从早期探索性工作到当前大规模扩展的演化过程。为组织技术全景,我们通过机制和能力两个互补视角审视现有工作。从机制视角,我们识别出四个主要发展路线:架构、表示、计算和优化。从能力视角,我们展示潜空间如何支持广泛的能力谱系,涵盖推理、规划、建模、感知、记忆、协作和具身。除整合外,我们讨论关键开放挑战,并概述未来研究的有前景方向。我们希望本综述不仅作为现有工作的参考,而且作为理解潜空间作为下一代智能的通用计算和系统范式的基础。

英文摘要

Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential inefficiency, and semantic loss. This survey aims to provide a unified and up-to-date landscape of latent space in language-based models. We organize the survey into five sequential perspectives: Foundation, Evolution, Mechanism, Ability, and Outlook. We begin by delineating the scope of latent space, distinguishing it from explicit or verbal space and from the latent spaces commonly studied in generative visual models. We then trace the field's evolution from early exploratory efforts to the current large-scale expansion. To organize the technical landscape, we examine existing work through the complementary lenses of mechanism and ability. From the perspective of Mechanism, we identify four major lines of development: Architecture, Representation, Computation, and Optimization. From the perspective of Ability, we show how latent space supports a broad capability spectrum spanning Reasoning, Planning, Modeling, Perception, Memory, Collaboration, and Embodiment. Beyond consolidation, we discuss the key open challenges, and outline promising directions for future research. We hope this survey serves not only as a reference for existing work, but also as a foundation for understanding latent space as a general computational and systems paradigm for next-generation intelligence.

2604.23057 2026-06-08 cs.AI 版本更新

Don't Make the LLM Read the Graph: Make the Graph Think

不要让大语言模型读图:让图思考

Yuqi Sun, Tianqin Meng, George Liu, Yashraj Panwar, Lakshya Chaudhry, Munasib Ilham, Aman Chadha

发表机构 * Mindoverflow University of Waterloo(多伦多大学) Carnegie Mellon University(卡内基梅隆大学) Foothill College(foothill学院) Purdue University(普渡大学) University of Wisconsin(威斯康星大学) Apple(苹果公司)

AI总结 通过3000多次对照实验,研究显式信念图在合作多智能体推理中是否提升LLM性能,发现集成架构决定图的价值,识别出“规划者违抗”现象,并证明图深度收益递减。

Comments main body has 9 pages, 4 figures, under review for COLM 2026 conference

详情
AI中文摘要

我们研究了显式信念图是否提升大语言模型在合作多智能体推理中的性能。通过在合作纸牌游戏Hanabi中跨越四个LLM家族的3000多次对照实验,我们建立了四个发现。首先,集成架构决定了信念图是否提供价值:作为提示上下文,图对强模型是装饰性的,仅对弱模型在二阶心智理论上有益(80% vs 10%,p<0.0001,OR=36.0);当图通过排序短列表门控动作选择时,即使对强模型也变得结构上必要(二阶ToM上100% vs 20%,p<0.001)。其次,我们识别出“规划者违抗”,一种模型家族特定的失败,即LLM在部分能力下覆盖正确的规划者建议(90%覆盖,重复N=20);Gemini模型表现出接近零的违抗,而Llama 70B表现出90%,模型区分事实上下文(被延迟)和咨询建议(被覆盖)。第三,完整游戏证据证实智能体间约定(比基线高+128%,p=0.003)优于所有单智能体干预,且单个信念图组件必须组合才能产生收益。第四,初步规模分析(N=10/单元,探索性)表明图深度收益递减:浅层图提供最佳成本效益比,而更深层的ToM图在更大玩家数量下似乎有害(5玩家时-1.5分,p=0.029)。

英文摘要

We investigate whether explicit belief graphs improve LLM performance in cooperative multi-agent reasoning. Through 3,000+ controlled trials across four LLM families in the cooperative card game Hanabi, we establish four findings. First, integration architecture determines whether belief graphs provide value: as prompt context, graphs are decorative for strong models and beneficial only for weak models on 2nd-order Theory of Mind (80% vs 10%, p<0.0001, OR=36.0); when graphs gate action selection through ranked shortlists, they become structurally essential even for strong models (100% vs 20% on 2nd-order ToM, p<0.001). Second, we identify "Planner Defiance," a model-family-specific failure where LLMs override correct planner recommendations at partial competence (90% override, replicated N=20); Gemini models show near-zero defiance while Llama 70B shows 90%, and models distinguish factual context (deferred to) from advisory recommendations (overridden). Third, full-game evidence confirms inter-agent conventions (+128% over baseline, p=0.003) outperform all single-agent interventions, and individual belief-graph components must be combined to produce gains. Fourth, preliminary scaling analysis (N=10/cell, exploratory) suggests graph depth has diminishing returns: shallow graphs provide the best cost-benefit ratio, while deeper ToM graphs appear harmful at larger player counts (-1.5 pts at 5-player, p=0.029).

2605.00737 2026-06-08 cs.AI 版本更新

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

调用还是不调用:评估和优化LLM工具调用的框架

Qinyuan Wu, Soumi Das, Mahsa Amani, Arijit Nag, Seungeon Lee, Krishna P. Gummadi, Abhilasha Ravichander, Muhammad Bilal Zafar

发表机构 * Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) Ruhr University Bochum(博德姆鲁尔大学) UAR RC Trust(UAR RC信托)

AI总结 提出一个基于决策理论的框架,从必要性、效用和可负担性三个关键因素评估LLM的网页搜索工具调用决策,并训练轻量级估计器优化调用,提升任务性能。

Comments Preprint, under review

详情
AI中文摘要

智能体AI架构通过外部工具增强LLM,解锁了强大的能力。然而,工具使用并不总是有益的;某些调用可能是冗余甚至有害的。因此,有效的工具使用取决于LLM的核心决策:在执行任务时是否调用工具。这一决策对于网页搜索工具尤其具有挑战性,因为外部信息的收益取决于模型的内部知识及其整合潜在噪声工具响应的能力。我们引入了一个受决策理论启发的原则性框架,用于评估网页搜索工具使用决策的三个关键因素:必要性、效用和可负担性。我们的分析结合了两种互补视角:规范性视角,从工具调用的最优分配推断真实需求和效用;描述性视角,从模型观察到的行为推断其自我感知的需求和效用。我们在四种网页搜索工具和三种任务上,评估了六种开源和一种闭源前沿模型,采用两种测试环境:一种仅基于当前轮次及其搜索结果,另一种基于完整执行轨迹。在每个设置中,我们发现模型的感知需求和效用经常与真实需求和效用不一致。基于此框架,我们从模型的隐藏状态训练轻量级需求和效用估计器。这些估计器驱动简单的控制器,提高了决策质量,并为大多数开源模型带来了比自我感知基线更强的任务性能。

英文摘要

Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool when performing a task. This decision is particularly challenging for web search tools, where the benefits of external information depend on the model's internal knowledge and its ability to integrate potentially noisy tool responses. We introduce a principled framework inspired by decision-making theory to evaluate web search tool-use decisions along three key factors: necessity, utility, and affordability. Our analysis combines two complementary lenses: a normative perspective that infers true need and utility from an optimal allocation of tool calls, and a descriptive perspective that infers the model's self-perceived need and utility from their observed behaviors. We evaluate six open and one closed-source frontier models under two harnesses, one conditioning on only the current turn and its search results, the other on the full execution traces, across four web-search tools and three tasks. In every setting, we find that a model's perceived need and utility are frequently misaligned with the true need and utility. Building on this framework, we train lightweight estimators of need and utility from the models' hidden states. These estimators drive simple controllers that improve decision quality and yield stronger task performance than the self-perceived baseline for most of the open-source models.

2605.06890 2026-06-08 cs.AI cs.MA 版本更新

Beyond the Black Box: Interpretability of Agentic AI Tool Use

超越黑箱:代理AI工具使用的可解释性

Hariom Tatsat, Ariye Shater

发表机构 * GitHub

AI总结 本文提出了一种基于稀疏自编码器(SAEs)和线性探针的机制可解释性工具包,旨在提升代理AI在长周期任务中对工具调用的可观测性和可解释性,通过分析模型内部状态来识别工具决策的关键特征,从而揭示代理失败的深层原因。

Comments 12 pages, 4 figures, 17 tables

详情
AI中文摘要

AI代理在高风险企业工作中具有前景,但可靠部署受限,因为工具使用失败难以诊断和控制。代理可能跳过必需的工具调用,错误调用工具,或执行后果只能在执行后才显现的行动。现有的可观测性方法大多是外部的:提示揭示相关性,评估评分输出,日志只能在模型已行动后才出现。在长周期设置中,这些失败尤其昂贵,因为早期工具错误会改变轨迹其余部分,增加token消耗,并创建下游的安全和安全风险。我们引入了一种基于稀疏自编码器(SAEs)和线性探针的机制可解释性工具包。该框架在每次行动前读取模型状态,并推断是否需要工具以及下一步工具动作的可能后果。通过将激活分解为稀疏特征,它识别与工具决策最相关的内部层和特征,并通过特征消融测试其功能重要性。我们训练探针在NVIDIA Nemotron函数调用数据集的多步轨迹上,并将相同的工作流程应用于GPT-OSS 20B和Gemma 3 27B模型。目标不是取代外部评估,而是添加一层缺失的可见性:在行动前模型内部信号的可见性。这有助于揭示代理失败的深层原因,尤其是在长周期运行中,早期错误会重塑其余的代理交互。更广泛地说,本文展示了机制可解释性如何支持实际的内部可观测性,以监控代理系统的工具调用和风险。

英文摘要

AI agents are promising for high-stakes enterprise workflows, but dependable deployment remains limited because tool-use failures are difficult to diagnose and control. Agents may skip required tool calls, invoke tools unnecessarily, or take actions whose consequence becomes visible only after execution. Existing observability methods are external: prompts reveal correlations, evaluations score outputs, and logs arrive only after the model has already acted. In long-horizon settings, these failures are costly because an early tool mistake can alter the rest of the trajectory, increase token consumption, and create downstream safety and security risk. We introduce a mechanistic-interpretability toolkit built on Sparse Autoencoders (SAEs), which decompose activations into sparse internal features, and linear probes, lightweight classifiers that read signals from those features. The framework reads model states before each action and infers whether a tool is needed and how risky the next tool action is. It identifies the model layers and features most associated with tool decisions and tests their functional importance through feature ablation. We train the probes on multi-step trajectories from the NVIDIA Nemotron function-calling dataset and apply the same workflow to GPT-OSS 20B and Gemma 3 27B models. The goal is not to replace external evaluation, but to add a missing layer: visibility into what the model signaled internally before action. This helps surface deeper causes of agent failure, especially in long-horizon runs where an early mistake can impact subsequent agent behavior. More broadly, the paper shows how mechanistic interpretability can support internal observability for monitoring tool calls and risk in agent systems.

2605.21347 2026-06-08 cs.AI cs.LG cs.SE 版本更新

Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Insights Generator: LLM代理的系统级语料库追踪诊断

Akshay Manglik, Apaar Shanker, Kaustubh Deshpande, Jason Qin, Yash Maurya, Veronica Chatrath, Vijay S. Kalmath, Levi Lentz, Yuan Xue

发表机构 * Scale AI, Inc.

AI总结 本文提出Insights Generator,一种多智能体系统,通过在语料库中提出和测试假设来生成基于证据的洞察报告,从而系统性地诊断LLM代理的行为模式。

详情
AI中文摘要

诊断LLM代理的故障仍然主要依赖人工。从业者检查少量执行轨迹子集,形成临时假设并迭代。此过程错过了仅在轨迹群体中显现的模式,并且无法扩展到生产语料库,其中单个轨迹跨度上万词。我们正式化了语料库级轨迹诊断问题。给定一个执行轨迹语料库,目标是生成具有证据支持的自然语言洞察,以描述轨迹群体中的系统性行为模式,每个模式都与支持证据相关联。我们提出了Insights Generator (IG),一种多智能体系统,通过在轨迹语料库中提出和测试假设来回答诊断问题,从而生成证据支持的洞察报告。我们从定性和客观维度评估了IG,涵盖基于评分标准的报告评估和通过实施IG洞察获得的下游性能改进。使用IG报告的人类专家将支架性能提高了30.4个点百分比,而利用IG衍生洞察的编码代理显示出一致且稳定的提升。在多个基准测试中,IG的scout-investigator架构生成的发现与竞争方法在检测覆盖方面相当,同时领域专家对IG报告的深度和证据质量评价优异。

英文摘要

Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens. We formalize the problem of corpus-level trace diagnostics. Given a corpus of execution traces, the goal is to produce grounded natural-language insights that characterize systematic behavioral patterns across trace groups, each linked to supporting evidence. We present the Insights Generator (IG), a multi-agent system that answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence-backed insights report. We evaluate IG across qualitative and objective dimensions, spanning rubric-based report assessment and downstream performance improvements achieved by implementing IG insights. Human experts using IG reports improve scaffold performance by 30.4pp over the unmodified baseline scaffold, and coding agents leveraging IG-derived insights show consistent and stable gains. Across benchmarks, IG's scout-investigator architecture produces findings comparable in detection coverage to competing approaches, while domain experts rated IG reports as leading depth and evidence quality.

2605.21706 2026-06-08 cs.AI 版本更新

Latent-space Attacks for Refusal Evasion in Language Models

潜在空间攻击用于语言模型的拒绝规避

Giorgio Piras, Raffaele Mura, Fabio Brau, Maura Pintor, Luca Oneto, Fabio Roli, Battista Biggio

发表机构 * University of Cagliari(卡利亚里大学) University of Genova(热那亚大学)

AI总结 本文研究了如何通过潜在空间攻击来规避语言模型的拒绝行为,提出了一种受控的潜在空间攻击方法,以提高攻击成功率,优于现有方法。

详情
AI中文摘要

安全对齐的语言模型被训练以拒绝有害请求,但通过引导其内部表示可以抑制拒绝行为。现有方法通过消融拒绝方向来消除模型残差流中的拒绝行为。尽管这些方法在经验上取得了成功,但它们缺乏对所诱导的潜在空间转换的系统性解释以及为何会抑制拒绝。在本文中,我们将拒绝抑制视为对训练以区分拒绝和回答提示的线性探测的潜在空间规避攻击。在此观点下,先前工作的差异均值方向自然定义了这样的探测器,其消融正好是对其决策边界上的投影,即最小置信度规避攻击。这种视角不仅解释了先前工作的经验成功,也承认了一个关键限制:规避在决策边界停止,这促使需要将表示进一步推入合规区域,即模型回答的区域。我们通过提出受控的潜在空间规避攻击方法来利用这一点,该方法通过优化置信度将表示投影到边界之外。我们在15个指令微调、多模态和推理模型上实现了最先进的攻击成功率,优于现有拒绝消融基线和专门的 jailbreak 攻击。

英文摘要

Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model's residual stream. Despite their empirical success, these methods lack a principled account of the latent-space transformation they induce and why it suppresses refusal. In this work, we recast refusal suppression as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. Under this view, prior work's difference-in-means direction naturally defines such a probe, and its ablation is exactly a projection onto its decision boundary, i.e., a minimum-confidence evasion attack. This perspective not only explains the empirical success of prior work but also admits a key limitation: evasion stops at the decision boundary, motivating the need to push representations further into the compliant region, i.e., where the model answers. We leverage this by proposing a Controlled Latent-space Evasion attack that projects representations past the boundary with an optimized confidence. We achieve state-of-the-art attack success rate across 15 instruction-tuned, multimodal, and reasoning models, outperforming existing refusal-ablation baselines and specialized jailbreak attacks.

2605.28742 2026-06-08 cs.AI 版本更新

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

CORE: 对比反思实现推理能力的快速提升

Linas Nasvytis, Simon Jerome Han, Ben Prystawski, Satchel Grant, Noah D. Goodman, Judith E. Fan

发表机构 * Stanford University(斯坦福大学)

AI总结 提出对比反思(CORE)非参数学习算法,通过对比成功与失败的推理轨迹生成自然语言洞察,在少量样本和 rollout 下实现比参数方法(GRPO)和非参数方法(GEPA、情景RAG、MemRL)更快的推理性能提升。

详情
AI中文摘要

语言模型可以利用可验证奖励在多种推理任务上提升性能。然而,无论是参数方法(如RLVR)还是非参数方法(如提示优化),通常都需要数百个训练样本和数千次模型 rollout,这在最佳情况下成本高昂,最坏情况下则难以处理。为解决这一挑战,我们引入了对比反思(CORE),一种非参数学习算法,通过比较过去的推理轨迹来生成洞察:即捕捉成功与不成功问题尝试之间差异的推理策略和约束的简短自然语言描述。在四个推理任务上,我们证明CORE比参数方法(GRPO)和非参数方法(GEPA、情景RAG和MemRL)实现更快的改进,同时使用更少的rollout。在固定rollout预算下,使用少至五个训练样本,我们进一步展示CORE也实现了与各基线相当或更大的性能提升。最后,我们强调CORE在上下文效率上也显著优于非参数基线,需要更少的提示词,同时将学到的知识存储为紧凑、可解释的自然语言洞察。因此,我们的结果表明,将成功与不成功推理轨迹之间的对比提炼为抽象且有用的洞察,比权重更新、提示优化或直接重用存储的推理轨迹,为模型自我改进提供了一条更高效且可解释的途径。

英文摘要

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, CORE achieves the strongest performance in most task-data regimes. Finally, we highlight how CORE is substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.

2606.00357 2026-06-08 cs.AI 版本更新

From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging

从“弱”信号到强模型:基于LoRA合并的偏好增量聚合

Qi Sun, Siyue Zhang, Yulin Chen, Yuxiang Xue, Ru Peng, Chen Zhao

发表机构 * NYU Shanghai(纽约大学上海校区) NTU(国立台湾大学) NYU(纽约大学) ZJU(浙江大学)

AI总结 提出偏好增量聚合(PDA)框架,通过从弱-更弱模型对中提取偏好增量并转化为LoRA适配器,再经几何对齐合并(GAM)聚合多个“弱”信号,以提升强模型性能。

详情
AI中文摘要

训练强大的大型语言模型(LLM)需要高质量的监督,而这通常稀缺。最近的研究表明,来自弱-更弱模型对(例如Qwen3 4B对1.7B)的成对偏好数据,尽管单个响应的质量有限,但可以通过相对质量增量提供有效的监督信号,我们称之为“弱”信号。这引发了一个关键研究问题:能否将多个“弱”信号建设性地聚合以改进强模型(例如Qwen3 8B)?为此,我们提出了偏好增量聚合(PDA),这是第一个从每个弱-更弱模型对中推导出偏好增量,将其实例化为通过偏好优化学习的LoRA适配器,并通过LoRA合并聚合所得增量的框架。为了进一步减轻LoRA合并过程中的方向性干扰,我们引入了几何对齐合并(GAM),一种几何感知的合并方法,在聚合前对齐适配器子空间,从而实现更鲁棒的多样化增量组合。在知识推理和智能搜索基准上的评估表明,聚合多个“弱”信号使性能超越任何单一信号,并且随着更多信号的加入性能进一步提升。相应地,带有GAM的PDA在知识推理和智能搜索上分别平均提升强模型6.8和7.3个百分点。它优于所有单增量和多增量基线,超过最佳单增量基线2.1和4.3个百分点。进一步分析将这些增益归因于不同偏好增量中编码的互补能力的有效组合。

英文摘要

Training strong large language models (LLMs) requires high-quality supervision, which is often scarce. Recent work shows that paired preference data from weak-weaker model pairs (e.g., Qwen3 4B over 1.7B), despite the limited quality of individual responses, can provide an effective supervision signal through relative quality deltas, which we term a "weak" signal. This motivates a key research question: can multiple "weak" signals be constructively aggregated for improving strong models (e.g., Qwen3 8B)? To this end, we propose Preference Delta Aggregation (PDA), the first framework that derives a preference delta from each weak-weaker model pair, instantiates it as a LoRA adapter learned through preference optimization, and aggregates the resulting deltas via LoRA merging. To further mitigate directional interference during LoRA merging, we introduce Geometric Alignment Merging (GAM), a geometry-aware merging method that aligns adapter subspaces before aggregation, enabling more robust composition of diverse deltas. Evaluations on knowledge reasoning and agentic search benchmarks show that aggregating multiple "weak" signals pushes performance beyond any single signal, with further gains as additional signals are incorporated. Correspondingly, PDA with GAM improves the strong model by 6.8 and 7.3 points on average for knowledge reasoning and agentic search, respectively. It outperforms all single-delta and multi-delta baselines, exceeding the best single-delta baseline by 2.1 and 4.3 points. Further analysis attributes these gains to the effective composition of complementary capabilities encoded across distinct preference deltas.

2606.01120 2026-06-08 cs.AI 版本更新

Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-Checking

诊断基于RAG的事实核查中LLM对证据前认知状态的仲裁行为

Yuxi Sun, Wenbo Shang, Wei Gao, Xin Huang, Jing Ma

发表机构 * Hong Kong Baptist University(香港 Baptist 大学) Singapore Management University(新加坡 Management 大学)

AI总结 提出PAVE测试平台,通过将LLM验证器分为四种认知状态,评估其在检索增强生成事实核查中仲裁参数知识与检索证据的能力,发现不可靠且高度依赖模型的仲裁行为,并提出轻量级JSD测试时仲裁方法。

Comments Accepted to ACL-2026 Findings (voluntarily withdraw)

详情
AI中文摘要

在基于RAG的事实核查中,LLM越来越多地被用作验证器,以根据检索到的证据检查给定声明。它们的参数知识可能引发与检索上下文冲突的证据前倾向,然而现有的评估框架并未描述这种先验-上下文差异,也未衡量验证器如何在参数信号和上下文信号之间进行仲裁。我们引入了 extsc{PAVE}(\emph{先验感知验证器评估}),这是一个诊断测试平台,根据LLM验证器在证据前的先验正确性和置信度将其分为四种认知状态,并在这个新基准上评估其仲裁行为,即:在误导性证据下是否坚持正确的先验,以及在提供准确证据时是否纠正错误的先验。对七个LLM的实验揭示了不可靠且高度依赖模型的先验-上下文仲裁,突显了在真实世界的基于RAG的事实核查应用中选择验证器的重要性。基于这些发现,我们提出了一种轻量级的基于JSD的测试时仲裁方法,该方法在不修改底层模型的情况下提高了事实可靠性,并在不同的LLM家族中实现了竞争性能。

英文摘要

In RAG-based fact-checking, LLMs are increasingly used as verifiers to check given claims against retrieved evidence. Their parametric knowledge can induce pre-evidence tendencies that may conflict with the retrieved context, yet existing evaluation frameworks do not characterize such prior-context discrepancy or measure how verifiers arbitrate between parametric and contextual signals. We introduce \textsc{PAVE} (\emph{Prior-Aware Verifier Evaluation}), a diagnostic testbed that stratifies an LLM verifier into four epistemic states based on the correctness and confidence of its pre-evidence prior and evaluates its arbitration behavior on this new benchmark, i.e., whether it persists in correct prior under misleading evidence, and whether it corrects wrong prior when accurate evidence is provided. Experiments across seven LLMs reveal unreliable and highly model-dependent prior-context arbitration, highlighting the importance of verifier selection for real-world RAG-based fact-checking applications. Based on these findings, we propose a lightweight JSD-based test-time arbitration method that improves factual reliability without modifying the underlying model, achieving competitive performance across diverse LLM families.

2606.03280 2026-06-08 cs.AI 版本更新

A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting

Pythia多跳设置中跨模型激活迁移的负面结果

Peiyan Zhang, Jason Xin

发表机构 * Independent Researcher(独立研究者)

AI总结 研究在Pythia-160M到Pythia-410M的多跳推理设置中,通过线性翻译层传递隐藏状态是否能够改善下游回答,结果发现离线表示对齐不足以实现有用的因果通信。

Comments 16 pages, 6 figures

详情
AI中文摘要

最近的研究表明,语言模型可以通过训练过程中生成数据中的隐藏信号传递行为特征。我们提出一个更直接和严格的渠道是否也可行:一个语言模型能否在推理时通过翻译和注入隐藏激活,而不是通过传递自然语言文本,将有用的中间推理状态传递给另一个模型?我们在一个受控的Pythia-160M到Pythia-410M多跳推理设置中测试了这个问题。线性翻译层学习了发送者和接收者隐藏状态之间的强归一化空间映射,归一化余弦相似度在不同种子下接近0.97。然而,当翻译后的激活在推理时注入接收者时,并没有改善下游回答。低强度加法注入仍接近无注入基线,置信区间跨越零。替换式注入始终具有破坏性,将翻译向量重新缩放到接收者隐藏状态范数也无法挽救性能。因此,这是一个有范围的负面结果:在这种设置中,离线表示对齐不足以在接收者内部实现有用的因果通信。

英文摘要

Recent work shows that language models can transmit behavioural traits through hidden signals in generated data during training. We ask whether a different activation-mediated channel is viable: can one language model communicate a useful intermediate reasoning state to another at inference time through a post-hoc linear activation bridge, rather than through a textual or structured-token relay? We test this question in a controlled Pythia-160M to Pythia-410M multi-hop reasoning setting. A linear translation layer learns a strong normalized-space map between sender and receiver hidden states, with normalized cosine similarity near 0.97 across seeds. However, when the translated activations are injected into the receiver at inference time, they do not improve downstream answering. Low-strength additive injection remains near the no-injection baseline, with confidence intervals that cross zero. Replacement-style injection is consistently destructive, and rescaling translated vectors to the receiver hidden-state norm does not rescue performance. The result is therefore a scoped negative result: in this setting, offline representational alignment is not sufficient for useful causal communication inside the receiver.

2606.05342 2026-06-08 cs.AI 版本更新

SentinelBench: A Benchmark for Long-Running Monitoring Agents

SentinelBench: 一个用于长时间运行监控代理的基准测试

Matheus Kunzler Maldaner, Adam Fourney, Amanda Swearngin, Hussein Mozannar, Gagan Bansal, Maya Murad, Rafah Hosn, Saleema Amershi

发表机构 * University of Florida(佛罗里达大学) Microsoft Research, AI Frontiers(微软研究院,人工智能前沿)

AI总结 提出SentinelBench,一个包含10个合成网络环境中100个任务的基准测试,用于评估AI代理在长时间监控任务中的表现,衡量任务完成度、反应时间和资源使用。

Comments 18 pages, 16 figures

详情
AI中文摘要

AI代理越来越多地被要求执行持续数分钟、数小时或更长时间的工作。然而,代理行为的默认模式是连续动作:发出工具调用、刷新页面、搜索替代方案或以其他方式试图强制推进。这对于许多长时间运行的任务来说是一种错误的方法,这些任务更适合采用持续关注的策略。相反,代理应该监控环境,注意到外部事件何时使进展成为可能,然后迅速响应,同时在等待时不浪费资源。为了衡量这类任务的进展,我们引入了SentinelBench,一个用于时间演化监控任务的开源基准测试。SentinelBench包含10个合成网络环境中的100个任务,包括电子邮件、日历、金融、专业网络和娱乐。每个环境都提供一个实时网络界面,并重放一个脚本化的事件序列,要求代理导航和推理状态不断变化的网页。SentinelBench衡量任务完成度、反应时间和资源使用,揭示了响应性与成本之间的权衡。我们报告了三种模型和两种浏览器代理框架的结果,为未来比较建立了性能基线,并展示了代理设计选择如何显著影响关键指标。总之,这些结果表明SentinelBench能够区分代理行为中的有意义差异。

英文摘要

AI agents are increasingly asked to carry out work that spans minutes, hours, or longer. Yet the default model of agent behavior is continuous action: issuing tool calls, refreshing pages, searching for alternatives, or otherwise trying to force progress. This is the wrong approach for many long-running tasks, which are better served by a strategy of sustained attention. Instead, agents should monitor an environment, notice when an external event makes progress possible, then respond promptly without wasting resources while waiting. To measure progress on this class of tasks, we introduce SentinelBench, an open-source benchmark for time-evolving monitoring tasks. SentinelBench contains 100 tasks across 10 synthetic web environments, including email, calendars, finance, professional networking, and entertainment. Each environment exposes a live web interface and replays a scripted sequence of events, requiring agents to navigate and reason about web pages whose state shifts underfoot. SentinelBench measures task completion, reaction time, and resource use, exposing the tradeoff between responsiveness and cost. We report results across three models and two browser-agent harnesses, establishing performance baselines for future comparison and demonstrating how agent design choices can dramatically impact key metrics. Together, these results show that SentinelBench distinguishes meaningful differences in agent behavior.

2606.05682 2026-06-08 cs.AI cs.LG 版本更新

Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillation

超越输出匹配:在NVFP4 LLM蒸馏中保留内部几何结构

Fangbo Tu, Junhua Zhao, Chi Liu, Xin Chen, Haifeng Wu, Jian Wan, Srinivasan Manoharan

发表机构 * Fangbo Tu(图方波) Junhua Zhao(赵俊华) Chi Liu(刘驰) Xin Chen(陈新) Haifeng Wu(吴海峰) Jian Wan(万健) Srinivasan Manoharan(曼纳哈兰)

AI总结 针对NVFP4低精度量化蒸馏中仅输出匹配导致内部表示退化的问题,提出CKA-QAD方法,通过CKA引导的层间Gram矩阵对齐保留内部几何结构,提升推理和编码任务准确率。

Comments 13 pages,1 figures

详情
AI中文摘要

随着大型语言模型越来越多地部署在延迟和成本受限的生产环境中,对低精度推理(包括基于NVFP4的方法)的需求不断增长。量化感知蒸馏(QAD)通过训练量化学生模型以KL散度损失匹配冻结的高精度教师模型的输出分布,帮助恢复低比特量化下的精度损失。在这项工作中,我们首先对QAD进行表示层面的诊断:仅输出匹配可能掩盖内部退化,因为许多中间激活几何结构可以产生相似的教师对齐logits。使用CKA,我们表明仅使用KL的QAD相对于BF16教师模型会降低层间表示相似性,在RL后训练模型中尤其严重。这种退化与推理和编码任务的下游瓶颈相关,表明低比特恢复需要保留内部几何结构,而不仅仅是匹配输出。受此发现启发,我们提出了CKA-QAD,一种用于NVFP4 QAD和低比特LLM精度恢复的CKA引导表示对齐方法。该方法添加了一个轻量级正则化器,通过在蒸馏过程中通过CKA对齐层间Gram矩阵来保留内部表示几何结构。在Nemotron 3 Nano和Qwen3-4B-Thinking-2507上,CKA-QAD显著改善了表示对齐,并以适度的训练开销提高了下游推理和编码精度。我们的发现将CKA引导的表示对齐定位为量化LLM恢复中输出匹配的实用补充。

英文摘要

Demand for low-precision inference, including NVFP4-based approaches, has grown as large language models are increasingly deployed in latency and cost constrained production environments. Quantization-aware distillation (QAD) helps recover accuracy lost under low bit quantization by training a quantized student to match the output distribution of a frozen higher precision teacher via a KL-divergence loss. In this work, we first provide a representation level diagnosis of QAD: output matching alone can mask internal degradation, because many intermediate activation geometries can yield similar teacher-aligned logits. Using CKA, we show that KL-only QAD can reduce layerwise representational similarity relative to the BF16 teacher, with especially severe drift in RL-post-trained models. This drift correlates with downstream bottlenecks on reasoning and coding tasks, suggesting that low bit recovery requires preserving internal geometry rather than matching outputs alone. Motivated by this finding, we propose \textbf{CKA-QAD}, a CKA-guided representational alignment method for NVFP4 QAD and low bit LLM accuracy recovery. The method adds a lightweight regularizer that preserves internal representational geometry during distillation by aligning layerwise Gram matrices through CKA. Across Nemotron 3 Nano and Qwen3-4B-Thinking-2507, CKA-QAD substantially improves representational alignment and improves downstream reasoning and coding accuracy with modest training overhead. Our findings position CKA-guided representational alignment as a practical complement to output matching for quantized LLM recovery.

2606.05761 2026-06-08 cs.AI cs.CL 版本更新

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

SubtleMemory: 面向长时程AI智能体的细粒度关系记忆辨别基准

Wenxuan Wang, Haoyu Sun, Fukuan Hou, Mingyang Song, Weinan Zhang, Yu Cheng, Yang Yang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Shanghai AI Laboratory(上海人工智能实验室) Tongji University(同济大学) Xiamen University(厦门大学) Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出SubtleMemory基准,通过构建关系控制的潜在语义伪影并嵌入用户-智能体交互历史,评估长时程AI智能体在后续查询中恢复分布式关系结构的能力。

Comments 48 pages

详情
AI中文摘要

持久性AI助手(如OpenClaw)在长期交互中积累了大量相关记忆。随着这些记忆的增长,它们可能相互强化、在不同上下文中出现分歧或直接冲突,使得正确协助依赖于记忆关系而非孤立回忆。现有的长期记忆基准很少探究智能体在下游任务中如何保留和利用这些关系。为弥补这一空白,我们引入了SubtleMemory,一个用于长运行AI智能体中细粒度关系记忆辨别的基准。SubtleMemory构建了关系控制的潜在语义伪影,其变体实例化互补、细微或矛盾的关系,并将其嵌入到逼真的用户-智能体历史中,要求智能体在后续查询和指令中恢复分布式的关系结构。该基准包含10个长历史中的1,522个评估实例,基于1,090个关系控制的记忆变体集,涵盖用户相关和非用户相关的查询。评估了六个独立记忆系统、两个具有原生记忆模块的Claw式智能体以及三个具有插件记忆模块的Claw式智能体,我们发现当前系统在细粒度关系记忆辨别上仍然薄弱。我们进一步引入了诊断协议,揭示了在记忆保留、检索和下游推理阶段的不同能力特征。

英文摘要

Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.

2502.00225 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Should You Use Your Large Language Model to Explore or Exploit?

你应该使用你的大语言模型进行探索还是利用?

Keegan Harris, Aleksandrs Slivkins

发表机构 * UC Berkeley(伯克利大学) Microsoft Research(微软研究院)

AI总结 研究当前大语言模型在探索-利用权衡中的决策能力,通过分离探索和利用任务评估其表现,发现推理模型在利用任务上最有潜力但成本高,非推理模型通过工具使用和上下文总结可提升中等难度任务性能,但在所有任务中均不如简单线性回归,然而LLM在具有语义的大动作空间探索中有帮助。

Comments Accepted to UAI 2026

详情
AI中文摘要

我们评估了当前一代大语言模型(LLMs)在面对探索-利用权衡时的决策能力。虽然先前的工作主要研究LLMs解决组合探索-利用任务的能力,我们采取了更系统的方法,将LLMs用于在各种(上下文)赌博机任务中分别进行探索和利用。我们发现推理模型在解决利用任务方面最有前景,尽管它们在实际应用中仍然过于昂贵或缓慢。受此启发,我们研究了非推理模型的工具使用和上下文总结。我们发现这些缓解措施可以显著提高中等难度任务的性能,但即便如此,我们研究的所有LLMs在所有任务中(包括非线性设置)的表现都不如简单的线性回归。另一方面,我们发现LLMs在探索具有内在语义的大动作空间时确实有帮助,通过建议合适的候选动作进行探索。

英文摘要

We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff. While previous work has largely study the ability of LLMs to solve combined exploration-exploitation tasks, we take a more systematic approach and use LLMs to explore and exploit in silos in various (contextual) bandit tasks. We find that reasoning models show the most promise for solving exploitation tasks, although they are still too expensive or too slow to be used in many practical settings. Motivated by this, we study tool use and in-context summarization using non-reasoning models. We find that these mitigations may be used to substantially improve performance on medium-difficulty tasks, however even then, all LLMs we study perform worse than a simple linear regression, even in non-linear settings. On the other hand, we find that LLMs do help at exploring large action spaces with inherent semantics, by suggesting suitable candidates to explore.

2505.12239 2026-06-08 cs.LG cs.AI cs.CR 版本更新

Towards Efficient and Exact Forgetting Services in Pre-Trained-Model-based Continual Learning

面向基于预训练模型的持续学习中的高效且精确的遗忘服务

Yajiang Huang, Jianheng Tang, Kejia Fan, Huiping Zhuang, Anfeng Liu, Tian Wang, Yunhuai Liu, Mianxiong Dong, Houbing Herbert Song

发表机构 * Department of Information Systems, University of Maryland, Baltimore County (UMBC)(马里兰大学巴尔的摩分校信息系统系)

AI总结 针对持续学习中顺序遗忘请求的挑战,提出基于解析方法的持续遗忘(ACU),通过最小二乘递归推导闭式解,实现高效精确的遗忘,保护历史数据隐私。

详情
AI中文摘要

在持续学习(CL)中,使用预训练模型(PTM)作为特征提取器已成为一种流行做法。结合解析分类器,基于PTM的方法在CL中实现了最先进的性能,追求非遗忘目标。同时,在大多数服务构建范式(例如移动群智感知(MCS))中,主动遗忘在CL阶段获得的特定知识也至关重要,其中移动边缘节点不断收集传感数据,不仅需要非遗忘适应,还需要特定知识遗忘以保护隐私。因此,当遗忘请求在CL中顺序出现时,产生了一个独特的问题,称为持续遗忘(CU)。然而,现有的遗忘方法专注于单次联合遗忘,在应用于CU时显得非常不足,包括(1)违反CL中的历史数据隐私,以及(2)容易被对抗性频繁请求淹没或降级。为了应对CU的挑战,我们提出了一种无梯度方法,称为解析持续遗忘(ACU),用于在基于PTM的CL中实现高效且精确的遗忘,同时保护历史数据隐私。针对每个遗忘请求,我们的ACU通过最小二乘法以可解释的方式递归推导解析(即闭式)解。通过精心设计,我们的ACU兼容样本级和类别级遗忘请求。理论和实验评估验证了我们的ACU在遗忘有效性、模型保真度和系统效率方面的优越性。

英文摘要

In Continual Learning (CL), using a Pre-Trained Model (PTM) as the feature extractor has become a popular practice. Accompanied by analytic classifiers, the PTM-based methods have achieved state-of-the-art performance in CL, in pursuit of the non-forgetting goal. Meanwhile, actively forgetting specific knowledge acquired during the CL phase is also essential in most service construction paradigms, for example, Mobile Crowd Sensing (MCS), where mobile edge nodes continuously collect sensory data and demand not only non-forgetting adaptation but also specific knowledge forgetting for privacy preservation. Thus, a unique problem, called Continual Unlearning (CU), arises when the forgetting requests show sequentially in CL. However, existing unlearning methods focus on single-shot joint forgetting and prove highly inadequate when applied to CU, including (1) violating the historical data privacy in CL and (2) vulnerably being overwhelmed or degraded with adversarially frequent requests. To handle the challenges of CU, we propose a gradient-free approach, called Analytic Continual Unlearning (ACU), for efficient and exact forgetting with historical data privacy preservation in PTM-based CL. In response to each unlearning request, our ACU recursively derives the analytical (i.e., closed-form) solutions via least squares in an interpretable manner. By meticulous design, our ACU is compatible with both sample-level and class-level unlearning requests. The theoretical and experimental evaluations validate our ACU's superiority in unlearning effectiveness, model fidelity, and system efficiency.

2505.23437 2026-06-08 cs.LG cs.AI cs.IR 版本更新

Bounded-Abstention Pairwise Learning to Rank

有界弃权成对学习排序

Antonio Ferrara, Andrea Pugnana, Francesco Bonchi, Salvatore Ruggieri

发表机构 * Intesa Sanpaolo AI Research(Intesa Sanpaolo AI研究中心) University of Trento(特伦托大学) University of Pisa(比萨大学)

AI总结 提出一种基于条件风险阈值的成对排序弃权方法,理论刻画最优策略,设计模型无关的插件算法,实验验证有效性。

Comments KDD 2026

详情
AI中文摘要

排序系统影响健康、教育和就业等高风险领域的决策,可能产生重大经济和社会影响,因此集成安全机制至关重要。弃权是一种安全机制,允许算法决策系统将不确定或低置信度的决策推迟给人类专家。虽然弃权主要在分类任务中研究,但其在其他机器学习范式中的应用尚不充分。本文提出一种用于成对学习排序任务的弃权新方法。该方法基于对排序器条件风险设置阈值:当估计风险超过预定义阈值时,系统弃权不做决策。我们的贡献有三方面:最优弃权策略的理论刻画、一个模型无关的插件式算法用于构建弃权排序模型,以及在多个数据集上的全面实证评估,证明了我们方法的有效性。

英文摘要

Ranking systems influence decision-making in high-stakes domains like health, education, and employment, where they can have substantial economic and social impacts. This makes the integration of safety mechanisms essential. One such mechanism is abstention, which enables algorithmic decision-making systems to defer uncertain or low-confidence decisions to human experts. While abstention has been predominantly explored in the context of classification tasks, its application to other machine learning paradigms remains underexplored. In this paper, we introduce a novel method for abstention in pairwise learning-to-rank tasks. Our approach is based on thresholding the ranker's conditional risk: the system abstains from making a decision when the estimated risk exceeds a predefined threshold. Our contributions are threefold: a theoretical characterization of the optimal abstention strategy, a model-agnostic, plug-in algorithm for constructing abstaining ranking models, and a comprehensive empirical evaluation across multiple datasets, demonstrating the effectiveness of our approach.

2506.01850 2026-06-08 cs.CV cs.AI cs.LG cs.MM 版本更新

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

MoDA: 面向指令型多模态大语言模型的细粒度视觉定位的调制适配器

Wayner Barrios, Andrés Villa, Juan León Alcázar, SouYoung Jin, Bernard Ghanem

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MoDA调制适配器,通过指令引导的通道级乘法调制增强细粒度视觉定位,在12个基准上对三种MLLM架构取得一致提升,计算开销极小。

Comments Accepted at ICML 2026. Code is available at https://github.com/waybarrios/MoDA

详情
AI中文摘要

多模态大语言模型(MLLMs)通过将预训练的视觉编码器与大语言模型(LLMs)集成,在指令跟随任务中取得了显著成功。然而,现有方法由于视觉补丁表示中的语义纠缠,常常难以实现细粒度的视觉定位,其中单个补丁混合了多个不同的视觉元素,使得模型难以聚焦于指令相关的细节。为了应对这一挑战,我们提出了MoDA(调制适配器),一种轻量级模块,通过指令引导的通道级调制增强视觉定位。与Q-Former等执行加性特征选择的令牌级方法不同,MoDA通过对已对齐特征进行乘法调制在通道级操作,从而实现对每个指令相关嵌入维度的细粒度控制。遵循标准的LLaVA训练协议,MoDA在语言指令与预对齐的视觉特征之间应用交叉注意力,生成动态调制掩码,无需架构修改或额外监督。我们在涵盖视觉问答、视觉中心推理和幻觉检测的12个基准上评估了MoDA,包括最近的2024年基准(MMVP、CV-Bench、MMStar、RealWorldQA),并在三种不同的MLLM架构上进行了测试:LLaVA-1.5、LLaVA-MoRE(2025)和Qwen3-VL(2025)。MoDA在所有三个系列中均取得了一致的提升,在LLaVA-1.5系列的MMVP上提升了+12.0个百分点,在LLaVA-MoRE系列的ScienceQA上提升了+4.8个百分点,在Qwen3-VL上ScienceQA提升了+4.9、RealWorldQA提升了+4.1、GQA提升了+3.8,证实了这些增益在CLIP编码器之外具有泛化性,且计算开销极小(<1% FLOPs)。代码可在https://github.com/waybarrios/MoDA获取。

英文摘要

Multimodal Large Language Models (MLLMs) have achieved remarkable success in instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle with fine-grained visual grounding due to semantic entanglement in visual patch representations, where individual patches blend multiple distinct visual elements, making it difficult for models to focus on instruction-relevant details. To address this challenge, we propose MoDA (Modulation Adapter), a lightweight module that enhances visual grounding through instruction-guided channel-wise modulation. Unlike token-level methods such as Q-Former that perform additive feature selection, MoDA operates at the channel level through multiplicative modulation on already-aligned features, enabling fine-grained control over which embedding dimensions are relevant for each instruction. Following the standard LLaVA training protocol, MoDA applies cross-attention between language instructions and pre-aligned visual features, generating dynamic modulation masks without architectural modifications or additional supervision. We evaluate MoDA across 12 benchmarks spanning visual question answering, vision-centric reasoning, and hallucination detection, including recent 2024 benchmarks (MMVP, CV-Bench, MMStar, RealWorldQA), on three distinct MLLM architectures: LLaVA-1.5, LLaVA-MoRE (2025), and Qwen3-VL (2025). MoDA delivers consistent gains across all three families, with +12.0 points on MMVP for the LLaVA-1.5 family and +4.8 points on ScienceQA for the LLaVA-MoRE family, and +4.9 ScienceQA, +4.1 RealWorldQA, and +3.8 GQA on Qwen3-VL, confirming that the gains generalize beyond CLIP-based encoders with minimal overhead (<1% FLOPs). Code is available at https://github.com/waybarrios/MoDA.

2506.11066 2026-06-08 cs.SE cs.AI 版本更新

CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval

CoQuIR:面向代码质量感知信息检索的综合基准

Jiahui Geng, Fengyu Cai, Shaobo Cui, Qing Li, Liangwei Chen, Chenyang Lyu, Haonan Li, Derui Zhu, Walter Pretschner, Heinz Koeppl, Fakhri Karray

发表机构 * Linköping University(林波伊大学) MBZUAI(麦肯锡人工智能研究院) TU Darmstadt(德累斯顿技术大学) Shanghai Jiao Tong University(上海交通大学) EPFL(苏黎世联邦理工学院) University of Groningen(Groningen大学) Google Tokyo(东京Google) Alibaba Group(阿里巴巴集团) TU Munich(慕尼黑技术大学)

AI总结 提出首个大规模多语言代码质量感知检索基准CoQuIR,涵盖正确性、效率、安全性和可维护性四维度,通过细粒度标注和两个质量中心指标评估23个模型,发现顶尖模型常无法区分有缺陷代码,并探索了训练方法以提升质量感知能力。

详情
AI中文摘要

代码检索在现代软件开发中至关重要,因为它能促进代码复用并加速调试。然而,当前的基准主要强调功能相关性,而忽视了软件质量的关键维度。受此差距启发,我们引入了CoQuIR,这是首个大规模、多语言的基准,专门设计用于评估跨四个关键维度(正确性、效率、安全性和可维护性)的质量感知代码检索。CoQuIR为11种编程语言的42,725个查询和134,907个代码片段提供了细粒度的质量注释,并附带两个以质量为中心的评估指标:成对偏好准确率和基于边界的排名分数。利用CoQuIR,我们对23个检索模型(涵盖开源和专有系统)进行了基准测试,发现即使是最先进的模型也常常无法区分有缺陷或不安全的代码与其更健壮的对应代码。此外,我们初步研究了明确鼓励检索器识别代码质量的训练方法。使用合成数据集,我们展示了在各种模型上质量感知指标的显著改进,而不牺牲语义相关性。下游代码生成实验进一步验证了我们方法的有效性。总体而言,我们的工作强调了将质量信号整合到代码检索系统中的重要性,为更可信和更健壮的软件开发工具奠定了基础。

英文摘要

Code retrieval is essential in modern software development, as it boosts code reuse and accelerates debugging. However, current benchmarks primarily emphasize functional relevance while neglecting critical dimensions of software quality. Motivated by this gap, we introduce CoQuIR, the first large-scale, multilingual benchmark specifically designed to evaluate quality-aware code retrieval across four key dimensions: correctness, efficiency, security, and maintainability. CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets in 11 programming languages, and is accompanied by two quality-centric evaluation metrics: Pairwise Preference Accuracy and Margin-based Ranking Score. Using CoQuIR, we benchmark 23 retrieval models, covering both open-source and proprietary systems, and find that even top-performing models frequently fail to distinguish buggy or insecure code from their more robust counterparts. Furthermore, we conduct preliminary investigations into training methods that explicitly encourage retrievers to recognize code quality. Using synthetic datasets, we demonstrate promising improvements in quality-aware metrics across various models, without sacrificing semantic relevance. Downstream code generation experiments further validate the effectiveness of our approach. Overall, our work highlights the importance of integrating quality signals into code retrieval systems, laying the groundwork for more trustworthy and robust software development tools.

2507.01548 2026-06-08 cs.HC cs.AI cs.CL 版本更新

Telling stories, making Hanzi: AI-assisted co-creation with elderly migrants in urban China

讲述故事,创造汉字:人工智能辅助中国城市老年移民的协同创作

Yunfei Chen, Wen Zhan, Peiyue Lin, Ziqun Hua, Ying Hu

发表机构 * School of Design, Hunan University(湖南大学设计学院) Royal College of Art(皇家艺术学院) University of the Arts London, Central Saint Martins(伦敦艺术大学,中央圣马丁学院)

AI总结 通过协同创作工作坊,结合口述故事、AI辅助和手工制作,让老年移民创造新汉字以记录被忽视的生活故事,揭示参与者的异质性和适应能力,并展示AI作为降低表达门槛的创意启动器。

详情
AI中文摘要

本文探讨了中国城市老年移民如何记录日常语言和设计常忽略的故事。我们与10位老年人开展了两次协同创作工作坊。活动结合了口述故事、主持人中介的AI辅助和手工制作。大型语言模型通过主持人提出候选字形。参与者创作了新的汉字来承载他们的故事。生成的字符作为记忆锚点,用于后续的分享和复述。我们的解释性分析揭示了参与者之间的异质性和适应能力。参与者将AI视为降低表达和创作门槛的创意启动器,尤其对数字素养较低者。这项工作挑战了关于老年人的同质化假设以及统一能力和需求的预设。我们贡献了一个将AI定位为后台促进者的工作坊框架,并提供了在包容性城市系统中将老年移民视为社区记忆和情境文化知识来源的见解。

英文摘要

This paper explores how older migrants in urban China can record stories that everyday language and design often miss. We ran two co-creation workshops with 10 elders. Activities combined oral storytelling, facilitator-mediated AI assistance, and hand-making. Large language models proposed candidate glyphs through a facilitator. Participants crafted new Hanzi to hold their stories. The resulting characters served as memory anchors for later sharing and retelling. Our interpretive analysis shows heterogeneity and adaptive capacity among participants. Participants experienced AI as a creative initiator that lowered barriers to expression and making, especially for those with lower digital literacy. The work challenges homogenizing assumptions about older adults and the presumption of uniform capacities and needs. We contribute a workshop framework that positions AI as a backstage facilitator. We also offer insights on engaging older migrants as sources of community memory and situated cultural knowledge within inclusive urban systems.

2508.17693 2026-06-08 cs.DB cs.AI cs.CL 版本更新

Database Normalization via Dual-LLM Self-Refinement

通过双LLM自精炼的数据库规范化

Eunjae Jo, Nakyung Lee, Gyuyeong Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Miffie框架,利用双模型自精炼架构和大语言模型实现数据库自动规范化,无需人工干预且保持高准确率。

Comments 7 pages

详情
AI中文摘要

数据库规范化对于保持数据完整性至关重要。然而,它通常由数据工程师手动执行,既耗时又容易出错。为此,我们提出了Miffie,一个利用大语言模型能力的数据库规范化框架。Miffie实现了无需人工努力的自动化数据规范化,同时保持高准确性。Miffie的核心是一种双模型自精炼架构,分别结合了性能最佳的模型用于规范化模式生成和验证。生成模块根据验证模块的反馈消除异常,直到输出模式满足规范化要求。我们还精心设计了任务特定的零样本提示,以引导模型实现高准确性和成本效率。实验结果表明,Miffie能够在保持高准确性的同时规范化复杂的数据库模式。

英文摘要

Database normalization is crucial to preserving data integrity. However, it is time-consuming and error-prone, as it is typically performed manually by data engineers. To this end, we present Miffie, a database normalization framework that leverages the capability of large language models. Miffie enables automated data normalization without human effort while preserving high accuracy. The core of Miffie is a dual-model self-refinement architecture that combines the best-performing models for normalized schema generation and verification, respectively. The generation module eliminates anomalies based on the feedback of the verification module until the output schema satisfies the requirement for normalization. We also carefully design task-specific zero-shot prompts to guide the models for achieving both high accuracy and cost efficiency. Experimental results show that Miffie can normalize complex database schemas while maintaining high accuracy.

2508.17821 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Limitations of Normalization in Attention Mechanism

注意力机制中归一化的局限性

Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State

发表机构 * University of Luxembourg(卢森堡大学) London Institute for Mathematical Sciences(伦敦数学科学研究所)

AI总结 本文通过理论框架和GPT-2实验,揭示softmax归一化导致注意力随选择token数增加而趋于均匀,并分析低温度下梯度敏感性带来的训练挑战。

详情
AI中文摘要

本文研究了注意力机制中归一化的局限性。我们首先建立了一个理论框架,用于识别模型的选择能力以及token选择中涉及的几何分离。我们的分析包括在softmax缩放下token向量距离和分离准则的显式界限。通过使用预训练的GPT-2模型进行实验,我们实证验证了理论结果,并分析了注意力机制的关键行为。值得注意的是,我们证明随着所选token数量的增加,模型区分信息性token的能力下降,通常趋向于均匀选择模式。我们还表明,softmax归一化下的梯度敏感性在训练过程中带来了挑战,尤其是在低温度设置下。这些发现推进了当前对基于softmax的注意力机制的理解,并激发了在未来注意力架构中需要更稳健的归一化和选择策略的需求。

英文摘要

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

2509.04991 2026-06-08 physics.ao-ph cs.AI cs.LG 版本更新

A Mechanism-Coupled Split Window Network for Medium- to High-Resolution Land Surface Temperature Retrieval

一种面向中高分辨率地表温度反演的机制耦合分裂窗网络

Tian Xie, Menghui Jiang, Chao Zeng, Huifang Li, Guanhao Zhang, Chan Li, Huanfeng Shen

发表机构 * School of Resource and Environmental Sciences, Wuhan University(武汉大学资源与环境科学学院) Key Laboratory of Geographic Information System of Ministry of Education, Wuhan(教育部地理信息系统重点实验室) Key Laboratory of Digital Cartography and Land Information Application of the Ministry of Natural Resources, Wuhan(自然资源部数字测图与土地信息应用重点实验室)

AI总结 提出并行分量解耦神经网络(PCD-Net),将分裂窗反演重构为物理分量系数的动态学习问题,通过分量级解耦建模和残差分支,实现复杂大气和地表条件下的高精度、鲁棒且全局可泛化的地表温度反演。

详情
AI中文摘要

地表温度(LST)是陆-气相互作用、地表能量收支和气候过程中的基本物理变量。从中高分辨率热红外(TIR)观测中获取的LST能有效揭示不同景观单元间的热环境差异。然而,在复杂大气条件和多样土地覆盖类型下,实现准确、鲁棒且全局可泛化的LST反演仍具挑战。传统分裂窗(SW)算法严重依赖经验参数化,其固定系数无法适应高温地表和高大气水汽含量等复杂场景。同时,传统数据驱动模型因缺乏显式物理结构约束,对分布外(OOD)样本的泛化能力有限。为解决这些问题,本研究提出并行分量解耦神经网络(PCD-Net)框架,将SW反演重构为物理分量系数的动态学习问题。以SW方程作为物理主干,该框架构建并行子网络,自适应学习对应常数项、一阶和二阶亮度温度差项的动态系数;同时引入残差分支,补充由地表发射率和大气水汽联合效应引起的非线性耦合校正。通过这种分量级解耦建模,PCD-Net显式刻画了地表发射率、大气水汽含量与不同SW物理分量之间的动态响应关系。

英文摘要

Land surface temperature (LST) is a fundamental physical variable in land-atmosphere interactions, surface energy budgets, and climate processes. LST derived from medium- to high-resolution thermal infrared (TIR) observations effectively reveals thermal environmental disparities across distinct landscape units. However, achieving accurate, robust, and globally generalizable LST retrieval remains challenging under complex atmospheric conditions and diverse land cover types. Traditional split window (SW) algorithms heavily rely on empirical parameterizations, whose fixed coefficients fail to adapt to complex scenarios such as high surface temperatures and high atmospheric water vapor content. Concurrently, conventional data-driven models exhibit limited generalizability to out-of-distribution (OOD) samples due to the absence of explicit physical structure constraints. To address these issues, this study proposes a Parallel Component Decoupled Neural Network (PCD-Net) framework, which reformulates SW retrieval as a dynamic learning problem of physical component coefficients. Using the SW equation as the physical backbone, the framework constructs parallel subnetworks to adaptively learn the dynamic coefficients corresponding to the constant, first-order, and second-order brightness temperature difference terms; meanwhile, a residual branch is incorporated to supplement the nonlinear coupling corrections induced by the joint effects of surface emissivity and atmospheric water vapor. Through this component-level decoupled modeling, PCD-Net explicitly characterizes the dynamic response relationships between land surface emissivity, atmospheric water vapor content, and different SW physical components.

2509.05316 2026-06-08 cs.LG cs.AI 版本更新

Standard vs. Modular Sampling: Best Practices for Reliable LLM Unlearning

标准采样与模块化采样:可靠的大语言模型遗忘的最佳实践

Praveen Bushipaka, Lucia Passaro, Tommaso Cucinotta

发表机构 * Scuola Superiore Sant’Anna(圣安纳高等学院) University of Pisa(比萨大学)

AI总结 针对大语言模型遗忘中采样策略的不足,提出模块化实体级遗忘(MELU)策略,通过多样化邻居集和模块化采样平衡遗忘效果与模型效用。

详情
AI中文摘要

传统的大语言模型遗忘设置包含两个子集——

英文摘要

A conventional LLM Unlearning setting consists of two subsets -"forget" and "retain", with the objectives of removing the undesired knowledge from the forget set while preserving the remaining knowledge from the retain. In privacy-focused unlearning research, a retain set is often further divided into neighbor sets, containing either directly or indirectly connected to the forget targets; and augmented by a general-knowledge set. A common practice in existing benchmarks is to employ only a single neighbor set, with general knowledge which fails to reflect the real-world data complexities and relationships. LLM Unlearning typically involves 1:1 sampling or cyclic iteration sampling. However, the efficacy and stability of these de facto standards have not been critically examined. In this study, we systematically evaluate these common practices. Our findings reveal that relying on a single neighbor set is suboptimal and that a standard sampling approach can obscure performance trade-offs. Based on this analysis, we propose and validate an initial set of best practices: (1) Incorporation of diverse neighbor sets to balance forget efficacy and model utility, (2) Standard 1:1 sampling methods are inefficient and yield poor results, (3) Our proposed Modular Entity-Level Unlearning (MELU) strategy as an alternative to cyclic sampling. We demonstrate that this modular approach, combined with robust algorithms, provides a clear and stable path towards effective unlearning.

2509.24935 2026-06-08 cs.CV cs.AI cs.LG 版本更新

Scalable GANs with Transformers

可扩展的Transformer生成对抗网络

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

发表机构 * KAIST(韩国科学技术院)

AI总结 本文通过紧凑变分自编码器潜在空间和纯Transformer架构,研究了生成对抗网络的可扩展性,并提出了轻量级中间监督和宽度自适应学习率调整来解决缩放时的失败模式,在ImageNet-256上以40个epoch达到2.96的FID。

Comments ICML 2026

详情
AI中文摘要

可扩展性推动了生成建模的最新进展,但其原理在对抗学习中仍未充分探索。我们通过两个在其他类型生成模型中被证明有效的设计选择来研究生成对抗网络(GAN)的可扩展性:在紧凑的变分自编码器潜在空间中训练,以及采用纯Transformer的生成器和判别器。在潜在空间中训练能够在保持感知保真度的同时实现高效计算,而这种效率与普通Transformer自然匹配,后者的性能随计算预算扩展。基于这些选择,我们分析了朴素缩放GAN时出现的失败模式。具体来说,我们发现了随着网络规模扩大,生成器早期层利用不足和优化不稳定的问题。因此,我们提供了简单且对缩放友好的解决方案,如轻量级中间监督和宽度自适应学习率调整。我们的实验表明,GAT——一种纯Transformer的潜在空间GAN——能够在从S到XL的广泛容量范围内可靠地训练。此外,GAT-XL/2在ImageNet-256上仅用40个epoch(比强基线少6倍)就达到了最先进的单步类条件生成性能(FID为2.96)。项目页面:https://hse1032.github.io/GAT。

英文摘要

Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.18) on ImageNet-256 in just 60 epochs, 4x fewer epochs than strong baselines. Project page: https://hse1032.github.io/GAT.

2510.03381 2026-06-08 cs.LG cs.AI 版本更新

Proxy Reconstruction Pre-training for Ramp Flow Prediction at Highway Interchanges

高速公路立交匝道流量预测的代理重建预训练

Yongchao Li, Jun Chen, Zhuoxuan Li, Chao Gao, Yang Li, Chu Zhang, Changyin Dong

发表机构 * Southeast University(东南大学) Institute of Telecommunications and Information Sciences, China(中国电信与信息科学研究院)

AI总结 提出时空解耦自编码器(STDAE),通过跨模态重建预训练从主线数据恢复匝道流量,结合GWNet等模型提升预测精度,在真实数据集上超越13个基线。

Comments Accepted at Applied Soft Computing Journal

详情
Journal ref
Applied Soft Computing Journal 200 (2026) 115462
AI中文摘要

立交桥是高速公路间车辆转换的关键节点,但缺乏实时匝道检测器导致交通预测存在盲区。为解决这一问题,我们提出时空解耦自编码器(STDAE),一种利用跨模态重建预训练的两阶段框架。在第一阶段,STDAE从主线数据重建历史匝道流量,迫使模型捕捉内在的时空关系。其解耦架构通过并行的空间和时间自编码器高效提取异质特征。在预测阶段,学习到的表示与GWNet等模型集成以提高准确性。在三个真实立交数据集上的实验表明,STDAE-GWNET始终优于十三个最先进的基线,并达到与使用历史匝道数据的模型相当的性能。这证明了其在克服检测器稀缺方面的有效性及其在不同预测流程中的即插即用潜力。

英文摘要

Interchanges are crucial nodes for vehicle transfers between highways, yet the lack of real-time ramp detectors creates blind spots in traffic prediction. To address this, we propose a Spatio-Temporal Decoupled Autoencoder (STDAE), a two-stage framework that leverages cross-modal reconstruction pretraining. In the first stage, STDAE reconstructs historical ramp flows from mainline data, forcing the model to capture intrinsic spatio-temporal relations. Its decoupled architecture with parallel spatial and temporal autoencoders efficiently extracts heterogeneous features. In the prediction stage, the learned representations are integrated with models such as GWNet to enhance accuracy. Experiments on three real-world interchange datasets show that STDAE-GWNET consistently outperforms thirteen state-of-the-art baselines and achieves performance comparable to models using historical ramp data. This demonstrates its effectiveness in overcoming detector scarcity and its plug-and-play potential for diverse forecasting pipelines.

2510.07315 2026-06-08 cs.CL cs.AI cs.LG cs.SE 版本更新

SWE-IF: Aligning Code Evaluation with Human Preference

SWE-IF: 使代码评估与人类偏好对齐

Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出SWE-IF基准,通过可验证指令分类法VeriCode评估代码指令遵循能力,发现指令遵循是区分LLM代码质量的关键,与功能正确性结合更能匹配人类偏好。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLM)推动了vibe coding,用户通过自然语言交互利用LLM生成并迭代优化代码,直到通过其vibe检查。Vibe检查反映了人类偏好,超越了功能性:解决方案应感觉正确、阅读清晰、保留意图并保持正确。然而,当前的代码评估仍局限于pass@k,仅捕获功能正确性,忽略了用户常规应用的非功能性指令。在本文中,我们假设指令遵循是vibe检查中除功能正确性之外缺失的部分。为了用量化信号衡量模型的代码指令遵循能力,我们提出了VeriCode,一个包含30条可验证代码指令及其确定性验证器的分类法。我们使用该分类法增强现有评估套件,得到SWE-IF,一个评估指令遵循和功能正确性的测试平台。评估31个LLM,我们发现即使最强的模型也难以遵守多条指令,并表现出功能回归。最重要的是,功能正确性和指令遵循的复合得分与人类偏好相关性最强,其中指令遵循成为LLM之间的主要区分因素。我们的代码、数据和分类法可在https://github.com/maszhongming/SWE-IF获取。

英文摘要

Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check besides functional correctness. To quantify models' code instruction-following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in SWE-IF, a testbed to assess both instruction following and functional correctness. Evaluating 31 LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit functional regression. Most importantly, a composite score of functional correctness and instruction following correlates best with human preference, with instruction following emerging as the primary differentiator among LLMs. Our code, data, and taxonomy are available at https://github.com/maszhongming/SWE-IF.

2510.09041 2026-06-08 cs.LG cs.AI 版本更新

Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach

自动驾驶鲁棒控制:一种智能一般和约束对抗强化学习方法

Junchao Fan, Qi Wei, Ruichen Zhang, Yang Lu, Jianhua Wang, Xiaolin Chang, Bo Ai

发表机构 * Beijing Key Laboratory of Security and Privacy in Intelligent Transportation(北京智能交通安全与隐私重点实验室) Beijing Jiaotong University(北京交通大学) College of Computing and Data Science(计算与数据科学学院) Nanyang Technological University(南洋理工大学) School of Computer Science and Technology(计算机科学与技术学院) Taiyuan University of Technology(太原科技大学) School of Electronics and Information Engineering(电子与信息工程学院)

AI总结 针对深度强化学习在自动驾驶中易受对抗攻击的问题,提出智能一般和约束对抗强化学习(IGCARL),通过战略性目标对手和鲁棒驾驶代理的交互训练,在约束优化下提升策略稳定性,实验表明成功率比现有方法提高至少27.9%。

详情
AI中文摘要

深度强化学习(DRL)在开发自动驾驶策略方面取得了显著成功。然而,其对对抗攻击的脆弱性仍然是实际部署的关键障碍。尽管现有的鲁棒方法已取得一定成功,但它们仍面临三个关键问题:(i)这些方法针对短视的对抗攻击进行训练,限制了它们应对更具战略性威胁的能力;(ii)它们难以引发真正安全关键的事件(例如碰撞),反而常常导致轻微后果;(iii)由于缺乏鲁棒约束,这些方法在训练过程中可能导致学习不稳定和策略漂移。为了解决这些问题,我们提出了智能一般和约束对抗强化学习(IGCARL),一种新颖的鲁棒自动驾驶方法,包括一个战略性目标对手和一个鲁棒驾驶代理。战略性目标对手被设计为利用DRL的时间决策能力来执行策略协调的多步攻击。此外,它通过采用一般和目标明确地专注于引发安全关键事件。鲁棒驾驶代理通过与对手交互学习,以发展出对抗攻击的鲁棒自动驾驶策略。为了确保对抗环境中的稳定学习并减轻攻击引起的策略漂移,代理在约束公式下进行优化。大量实验表明,IGCARL相比现有最先进方法将成功率提高了至少27.9%,展示了对抗攻击的卓越鲁棒性,并增强了基于DRL的自动驾驶的安全性和可靠性。

英文摘要

Deep reinforcement learning (DRL) has demonstrated remarkable success in developing autonomous driving policies. However, its vulnerability to adversarial attacks remains a critical barrier to real-world deployment. Although existing robust methods have achieved success, they still suffer from three key issues: (i) these methods are trained against myopic adversarial attacks, limiting their abilities to respond to more strategic threats, (ii) they have trouble causing truly safety-critical events (e.g., collisions), but instead often result in minor consequences, and (iii) these methods can introduce learning instability and policy drift during training due to the lack of robust constraints. To address these issues, we propose Intelligent General-sum Constrained Adversarial Reinforcement Learning (IGCARL), a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent. The strategic targeted adversary is designed to leverage the temporal decision-making capabilities of DRL to execute strategically coordinated multi-step attacks. In addition, it explicitly focuses on inducing safety-critical events by adopting a general-sum objective. The robust driving agent learns by interacting with the adversary to develop a robust autonomous driving policy against adversarial attacks. To ensure stable learning in adversarial environments and to mitigate policy drift caused by attacks, the agent is optimized under a constrained formulation. Extensive experiments show that IGCARL improves the success rate by at least 27.9% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks and enhancing the safety and reliability of DRL-based autonomous driving.

2510.11014 2026-06-08 cs.RO cs.AI cs.CV 版本更新

MatterDoor: Sampling Zero-shot Spatio-semantic Priors using Generative Models

MatterDoor: 使用生成模型采样零样本空间语义先验

Subhransu S. Bhattacharjee, Hao Lu, Dylan Campbell, Rahul Shome

发表机构 * School of Computing, Australian National University(澳大利亚国立大学计算机学院)

AI总结 针对机器人通过门缝观察时场景结构缺失的问题,提出MatterDoor方法,利用预训练生成模型(VLM引导外推、单目深度估计、语义分割)采样隐藏房间的语义3D点云先验,在Matterport3D基准上验证了零样本空间语义先验的有效性。

Comments Under Review

详情
AI中文摘要

自主机器人通常只能通过门缝部分观察房间,墙壁和场景结构隐藏了安全导航和目标导向行动所需的几何和任务相关语义。我们询问现成的预训练生成视觉模型能否将这些缺失结构作为零样本离线先验用于机器人推理。此类先验应支持对未观察结构的空间语义查询,估计隐藏区域中的目标物体似然以及这些区域被占用的概率。给定一个以自我为中心的RGB观测和目标查询,我们的流程使用VLM引导的外推、单目深度估计和语义分割来采样隐藏房间的语义标注3D点云假设。我们引入了MatterDoor,一个源自Matterport3D的门遮挡室内场景基准,并使用生成指标和模拟Stretch机器人目标到达任务评估所得先验。我们的结果表明,无需针对特定问题微调即可推导出对规划有用的空间语义先验。

英文摘要

Autonomous robots often view rooms only partially, through a doorway, where the walls and scene structure hide the geometry and task-relevant semantics needed for safe navigation and goal-directed action. We ask whether off-the-shelf pretrained generative vision models can derive this missing structure as zero-shot offline priors for robot reasoning. Such priors should support spatio-semantic queries over unobserved structure, estimating the target object likelihood in hidden regions and the probability that those regions are occupied. Given an egocentric RGB observation and target query, our pipeline uses VLM-guided outpainting, monocular depth estimation, and semantic segmentation to sample semantically labeled 3D point cloud hypotheses of the hidden room. We introduce MatterDoor, a Matterport3D-derived benchmark of doorway-occluded indoor scenes, and evaluate the resulting priors with generative metrics and simulated Stretch robot object-reaching tasks. Our results suggest that useful spatio-semantic priors for planning can be derived without problem-specific fine-tuning.

2510.17004 2026-06-08 cs.MA cs.AI 版本更新

ReclAIm: A Multi-Agent Framework for Monitoring and Correcting Performance Decline in Medical Imaging AI

ReclAIm:用于监测和纠正医学影像AI性能下降的多智能体框架

Eleftherios Tzanis, Michail E. Klontzas

发表机构 * Artificial Intelligence and Translational Imaging (ATI) Lab, Department of Radiology, School of Medicine, University of Crete(人工智能与转化成像实验室,放射科,医学院,希腊克里特大学) Computational Biomedicine Laboratory, Institute of Computer Science Foundation for Research and Technology Hellas (ICS - FORTH), Heraklion, Crete, Greece(计算生物医学实验室,希腊基础研究与技术院计算机科学研究所(ICS - FORTH),克里特,希腊) Division of Radiology, Department of Clinical Science, Intervention and Technology (CLINTEC), Karolinska Institute, Huddinge, Sweden(放射科,临床科学、干预与技术部(CLINTEC),卡罗林斯卡研究所,瑞典Huddinge)

AI总结 提出基于大语言模型的多智能体框架ReclAIm,通过自然语言交互自动监测医学图像分类模型性能下降并触发微调,采用数据增强、类别不平衡处理和参数锚定正则化策略,在多个数据集上验证了有效性。

Comments Published in Radiology: Artificial Intelligence (https://doi.org/10.1148/ryai.250923)

详情
AI中文摘要

目的:开发并评估一个用于自动监测、检测和纠正医学图像分类模型性能下降的多智能体框架(ReclAIm)。材料与方法:ReclAIm是一个基于大语言模型的多智能体系统,通过自然语言交互运行。一个主智能体协调三个任务特定智能体,执行性能评估并在检测到显著性能下降时触发微调。微调流程包含数据增强、类别不平衡处理以及参数锚定正则化策略以限制灾难性遗忘。该系统使用多个影像数据集进行基准测试,包括脑部MRI、胸部CT和胸部X光片,按模型开发、推理(性能监测)和微调子集划分(60%:20%:20%)。结果:ReclAIm成功协调了所有数据集的训练、评估和性能监测。在18个模型中的8个中检测到测试数据与推理数据之间的性能差异,触发了微调流程以减少性能差距。在性能下降高达40.6%的情况下(心脏肥大数据集,InceptionV3),微调将性能指标恢复至基线值的2%以内。结论:ReclAIm为医学图像分类模型的自动监测和定向微调提供了一个原型框架,其自然语言接口旨在支持研究及潜在临床应用的可及性。

英文摘要

Purpose: To develop and evaluate a multi-agent framework (ReclAIm) for automated monitoring, detection, and correction of performance decline in medical image classification models. Materials and Methods: ReclAIm is a large language model-based multi-agent system that operates through natural language interaction. A master agent coordinating three task-specific agents performed performance evaluation and triggered fine-tuning when substantial performance declines were detected. The fine-tuning workflow incorporated data augmentation, class imbalance handling, and a parameter-anchoring regularization strategy to limit catastrophic forgetting. The system was benchmarked using multiple imaging datasets, including brain MRI, chest CT, and chest radiography, partitioned into model development, inference (performance monitoring), and fine-tuning subsets (60%:20%:20%). Results: ReclAIm successfully orchestrated training, evaluation, and performance monitoring across all datasets. Performance discrepancies between test and inference data were detected in 8 of 18 models, prompting fine-tuning workflows that reduced performance gaps. In cases with declines of up to 40.6% (cardiomegaly dataset, InceptionV3), fine-tuning restored performance metrics to within 2% of baseline values. Conclusion: ReclAIm provides a prototype framework for automated monitoring and targeted fine-tuning of medical image classification models, with a natural language interface designed to support accessibility in research and potential clinical applications.

2510.24561 2026-06-08 cs.LG cs.AI 版本更新

LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

LoRA-DA:基于渐近分析的低秩自适应数据感知初始化

Qingyue Zhang, Chang Chu, Tianren Peng, Qi Li, Xiangyang Luo, Zhihao Jiang, Shao-Lun Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LoRA-DA,通过渐近分析优化低秩自适应初始化,结合Fisher梯度与Fisher信息最小化参数偏差,提升微调准确率与收敛稳定性。

Comments Published at ICML 2026

详情
AI中文摘要

LoRA已成为广泛采用的PEFT方法,其初始化方法日益受到关注。然而,现有方法存在显著局限性:许多方法未纳入目标域数据,而基于梯度的方法仅通过依赖一步梯度分解在浅层利用数据。本文建立了数据感知LoRA初始化的理论框架。从最小化微调模型与目标模型之间参数偏差的期望出发,我们推导出一个包含两项的优化问题:偏差项,与微调模型和目标模型之间的参数距离相关,并使用Fisher梯度公式近似以保持各向异性;方差项,通过Fisher信息考虑采样随机性引入的不确定性。求解该问题得到LoRA的最优初始化策略,基于此我们开发了高效算法LoRA-DA。跨多个基准的实验结果表明,LoRA-DA在最终准确率上持续优于现有初始化方法。附加研究显示其收敛更快、更稳定,跨秩鲁棒性强,且初始化开销小。源代码见https://github.com/zqy0126/LoRA-DA。

英文摘要

LoRA has become a widely adopted method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition. In this paper, we establish a theoretical framework for data-aware LoRA initialization. Starting from minimizing the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher-gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. Solving this problem yields an optimal initialization strategy for LoRA, based on which we develop an efficient algorithm, LoRA-DA. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code is available at https://github.com/zqy0126/LoRA-DA.

2510.26714 2026-06-08 cs.LG cs.AI 版本更新

On the importance of multiple training seeds for evaluating machine unlearning

关于多个训练种子在评估机器遗忘中的重要性

Jamie Lanyon, Axel Finke, Petros Andreou, Georgina Cosma

发表机构 * Department of Computer Science(计算机科学系) School of Mathematics(数学学院) School of Science(科学学院) Statistics and Physics(统计学与物理学) Loughborough University(洛桑大学) Newcastle University(新castle大学)

AI总结 本文指出评估机器遗忘算法时仅使用单个训练种子可能导致结果不具代表性,并通过图像分类、联邦学习排序和大语言模型实验验证了问题普遍性,最后给出选择训练和遗忘种子数量的指导。

Comments mini paper, 5 figures

详情
AI中文摘要

机器遗忘旨在从训练好的模型中移除某些数据点的影响,而无需昂贵的重新训练。大多数实用的遗忘算法只是近似,其性能只能通过经验评估。常见做法是从同一个训练好的模型(即仅使用单个训练种子)开始,多次独立运行遗忘算法(即使用多个遗忘种子)。在图像分类实验中,这种做法可能给出不具代表性的结果,因为遗忘性能可能对训练种子的选择敏感。这对于确定性遗忘方法尤其相关,这些方法从同一个训练好的模型开始时总是产生相同的结果。在联邦学习排序和大语言模型上的进一步实验证实,这个问题不仅限于图像分类。我们还解释了为什么增加遗忘种子的数量通常无法弥补多个训练种子的缺失。最后,我们给出了如何选择训练和遗忘种子数量的指导。

英文摘要

Machine unlearning aims to remove the influence of certain data points from a trained model without costly retraining. Most practical unlearning algorithms are only approximate and their performance can only be assessed empirically. Common practice is to run unlearning algorithms multiple times independently (i.e., using multiple unlearning seeds) starting from the same trained model (i.e., using only a single training seed ). In image-classification experiments, this practice can give non-representative results as unlearning performance can be sensitive to the choice of training seed. This is particularly relevant for deterministic unlearning methods which always produce the same result when started from the same trained model. Further experiments on federated learning-to-rank, and large language models confirm that this issue extends beyond image classification. We also explain why increasing the number of unlearning seeds cannot generally compensate for the lack of multiple training seeds. Finally, we give guidance on how to select the number of training and unlearning seeds.

2511.02399 2026-06-08 cs.SE cs.AI 版本更新

Towards Iterative End-to-End Software Development: A Feature-Driven Multi-Agent Framework

迈向迭代式端到端软件开发:一种特征驱动的多智能体框架

Junwei Liu, Chen Xu, Chong Wang, Tong Bai, Weitong Chen, Kaseng Wong, Yiling Lou, Xin Peng

发表机构 * Fudan University(复旦大学) Nanyang Technological University(南洋理工大学)

AI总结 提出EvoDev框架,通过特征分解、依赖建模和上下文传播,实现迭代式端到端软件开发,在Android任务上比Claude Code提升57.3%。

Comments Accepted by ISSTA 2026

详情
AI中文摘要

近年来,大语言模型智能体的进展为从自然语言需求自动化端到端软件开发带来了希望。然而,现有方法大多采用线性的瀑布式流程,这过度简化了真实世界开发的迭代性质,并且难以应对复杂、大规模的项目。为解决这些限制,我们提出了EvoDev,一种受特征驱动开发启发的迭代式软件开发框架。EvoDev将用户需求分解为一组用户价值特征,并构建特征图,这是一个有向无环图,显式建模特征之间的依赖关系。特征图中的每个特征节点维护多层上下文,包括业务逻辑、软件设计和代码实现,这些上下文沿着依赖关系传播,为后续开发迭代提供上下文。我们在具有挑战性的Android开发任务上评估了EvoDev,结果表明它比最佳基线Claude Code高出57.3%,同时在不同基础LLM上将单智能体性能提升了16.0%-58.5%,突出了特征分解、依赖建模、上下文传播和面向工作流的智能体设计对端到端软件开发的重要性。此外,我们的工作总结了设计迭代式、LLM驱动的开发框架的实用见解,并为未来训练基础LLM以更好地支持迭代式软件开发提供了参考。

英文摘要

Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipelines, which oversimplify the iterative nature of real-world development and struggle with complex, large-scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature-driven development. EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each feature node in the feature map maintains multi-layer contexts, including business logic, software design, and code implementation, which are propagated along dependencies to provide context for subsequent development iterations. We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best-performing baseline, Claude Code, by 57.3%, while improving single-agent performance by 16.0%-58.5% across different base LLMs, highlighting the importance of feature decomposition, dependency modeling, context propagation, and workflow-aware agent design for end-to-end software development. Moreover, our work summarizes practical insights for designing iterative, LLM-driven development frameworks and informs future training of base LLMs to better support iterative software development.

2512.04123 2026-06-08 cs.CY cs.AI cs.LG cs.SE 版本更新

Measuring Agents in Production

生产环境中的智能体测量

Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Koushik Sen, Dawn Song, Joseph E. Gonzalez, Ion Stoica, Matei Zaharia, Marquita Ellis

发表机构 * University of California at Berkeley(加州大学伯克利分校) IBM Research(IBM研究院) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Stanford University(斯坦福大学)

AI总结 通过对86个已部署系统的调查和20个案例研究,发现生产环境中的LLM智能体主要采用简单可控的方法,可靠性是首要挑战,并依赖系统级设计和人工评估。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026) as Oral Presentation

详情
AI中文摘要

基于LLM的智能体已经在许多行业的生产环境中运行,但我们缺乏对哪些技术方法能使部署成功的理解。我们首次系统性地研究了生产环境中的智能体测量(MAP),使用了来自智能体开发者的一手数据。我们通过深度访谈进行了20个案例研究,并调查了来自26个领域的86个已部署系统的从业者。我们调查了组织为何构建智能体、如何构建它们、如何评估它们以及它们面临的主要开发挑战。我们的研究发现,生产环境中的智能体是使用简单、可控的方法构建的:68%的智能体在人类干预前最多执行10步,70%依赖对现成模型进行提示而非权重调整,74%主要依赖人工评估。可靠性(随时间保持一致的正确行为)仍然是首要开发挑战,从业者目前通过系统级设计来解决。MAP记录了生产智能体的当前状态,为研究社区提供了部署现实和未充分探索的研究方向的可见性。

英文摘要

LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of Measuring Agents in Production, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 86 deployed systems practitioners across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and their top development challenges. Our study finds that production agents are built using simple, controllable approaches: 68% execute at most 10 steps before human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability (consistent correct behavior over time) remains the top development challenge, which practitioners currently address through systems-level design. MAP documents the current state of production agents, providing the research community with visibility into deployment realities and underexplored research avenues.

2512.12997 2026-06-08 cs.CV cs.AI cs.LG 版本更新

Calibrating Uncertainty for Zero-Shot Adversarial CLIP

校准零样本对抗性CLIP的不确定性

Wenjing Lu, Zerui Tao, Yuning Qiu, Dongping Zhang, Yang Yang, Qibin Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对CLIP在零样本分类中对抗攻击脆弱且不确定性校准差的问题,提出基于狄利克雷分布重参数化的对抗微调目标,统一对齐语义结构与置信度,提升校准性和鲁棒性。

Comments ICML 2026

详情
AI中文摘要

CLIP在零样本分类中表现强劲,但仍易受对抗攻击。先前的对抗微调工作主要匹配干净样本和对抗样本之间的预测logits,忽略了不确定性校准,可能损害零样本泛化能力。在可靠的不确定性估计中,一个常见期望是预测不确定性应随输入难度增加或偏离训练分布而上升。然而,在对抗环境中我们经常观察到相反的情况:扰动不仅降低准确性,还抑制不确定性,导致严重的校准错误和过度自信。这揭示了鲁棒性之外的关键可靠性差距。为弥合这一差距,我们提出了一种考虑准确性和不确定性的CLIP对抗微调目标。通过将CLIP输出重参数化为狄利克雷分布的浓度参数,我们提出了一种统一表示,捕获相对语义结构和置信度大小。这使得在扰动下实现整体分布对齐,超越单一logits锚定,恢复校准的不确定性。在多个零样本基准上的实验表明,我们的方法显著提高了不确定性校准,在保持干净准确性的同时实现了具有竞争力的对抗鲁棒性。

英文摘要

CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Prior adversarial fine-tuning work primarily matches predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and over-confidence. This reveals a critical reliability gap beyond robustness. To bridge this gap, we propose an adversarial fine-tuning objective for CLIP considering both accuracy and uncertainty. By reparameterizing CLIP outputs as the concentration parameters of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and confidence magnitude. This enables holistic distribution alignment under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments across multiple zero-shot benchmarks demonstrate that our method significantly improves uncertainty calibration and achieves competitive adversarial robustness while preserving clean accuracy.

2512.14391 2026-06-08 cs.LG cs.AI cs.CL 版本更新

RePo: Language Models with Context Re-Positioning

RePo:具有上下文重定位的语言模型

Huayang Li, Tianyu Zhao, Deng Cai, Richard Sproat

发表机构 * University of Maryland(马里兰大学)

AI总结 提出RePo机制,通过可微分模块重新分配token位置以减轻注意力层负担,在噪声上下文、结构化数据和长上下文任务上持续提升性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

上下文学习是现代大型语言模型(LLM)的基础;然而,主流架构通过分配线性或常数的位置索引来施加刚性且固定的上下文结构。刚性的位置信息将组织输入结构的全部负担强加给注意力层,从而减少了可用于更关键信息的注意力量。为了解决这个问题,我们提出了RePo,一种通过上下文重定位来减轻注意力层负担的新机制。与传统方法不同,RePo利用可微分模块$f_ϕ$来分配捕获上下文依赖关系的token位置,而不是依赖预定义的顺序。通过在OLMo-2 1B和7B模型上持续预训练,我们证明RePo在涉及噪声上下文、结构化数据和更长上下文长度的任务上持续提升性能,同时在一般短上下文任务上保持有竞争力的性能。分析表明,RePo成功地将更多注意力分配给遥远但相关的信息,在密集且非线性的空间中分配位置,并捕获输入上下文的内在结构。我们的代码位于https://github.com/SakanaAI/repo。

英文摘要

In-context learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices. The rigid position information poses the full burden of organizing the input structure to attention layers, thus reducing the amount of attention that could be allocated for more critical information. To address this, we propose RePo, a novel mechanism that alleviates the burden for attention layers via context re-positioning. Unlike conventional approaches, RePo utilizes a differentiable module, $f_ϕ$, to assign token positions that capture contextual dependencies, rather than replying on pre-defined order. By continually pre-training on the OLMo-2 1B \& 7B models, we demonstrate that RePo consistently enhances performance on tasks involving noisy contexts, structured data, and longer context length, while maintaining competitive performance on general short-context tasks. Analysis reveals that RePo successfully allocates more attention mass to distant but relevant information, assigns positions in a dense and non-linear space, and captures the intrinsic structure of the input context. Our code is at https://github.com/SakanaAI/repo.

2512.23128 2026-06-08 cs.HC cs.AI cs.MA 版本更新

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

这是一个陷阱!面向网络代理的任务重定向说服基准

Karolina Korgul, Yushi Yang, Arkadiusz Drohomirecki, Piotr Błaszczyk, Will Howard, Lukas Aichberger, Chris Russell, Philip H. S. Torr, Adam Mahdi, Adel Bibi

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出TRAP基准,评估大型语言模型驱动的网络代理在动态网页中易受提示注入攻击的程度,发现平均25%的任务中代理被重定向,揭示了心理驱动的系统漏洞。

Comments ICML 2026

详情
AI中文摘要

由大型语言模型驱动的网络代理越来越多地用于电子邮件管理或专业网络等任务。然而,它们对动态网页内容的依赖使其容易受到提示注入攻击:隐藏在界面元素中的对抗性指令,说服代理偏离其原始任务。我们引入了任务重定向代理说服基准(TRAP),这是一个研究说服技术如何在现实任务中误导自主网络代理的基准。在六个前沿模型中,代理平均在25%的任务中容易受到提示注入(GPT-5为13%,DeepSeek-R1为43%),小的界面或上下文变化通常会使成功率翻倍,并揭示网络代理中系统的、由心理驱动的漏洞。我们还提供了一个模块化的社会工程注入框架,并在高保真网站克隆上进行受控实验,允许进一步扩展基准。

英文摘要

Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking. Their reliance on dynamic web content, however, makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task. We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), a benchmark for studying how persuasion techniques misguide autonomous web agents on realistic tasks. Across six frontier models, agents are susceptible to prompt injection in 25% of tasks on average (13% for GPT-5 to 43% for DeepSeek-R1), with small interface or contextual changes often doubling success rates and revealing systemic, psychologically driven vulnerabilities in web-based agents. We also provide a modular social-engineering injection framework with controlled experiments on high-fidelity website clones, allowing for further benchmark expansion.

2601.05751 2026-06-08 cs.CL cs.AI 版本更新

Analysing Differences in Persuasive Language in LLM-Generated Text: Uncovering Stereotypical Gender Patterns

分析LLM生成文本中说服性语言的差异:揭示刻板的性别模式

Amalie Brogaard Pauli, Maria Barrett, Max Müller-Eberstein, Isabelle Augenstein, Ira Assent

发表机构 * Department of Computer Science, Aarhus University(阿arhus大学计算机科学系) AMD Silo AI University of Tokyo(东京大学) IT University of Copenhagen(哥本哈根IT大学) Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)

AI总结 提出框架评估LLM生成说服性语言时受接收者性别、发送者意图和输出语言的影响,发现所有模型均存在显著的性别差异,反映性别刻板印象的语言倾向。

Comments Accepted at ACL Findings 2026

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于日常交流任务,包括起草旨在影响和说服的人际信息。先前研究表明,LLMs能够成功说服人类并放大说服性语言。因此,理解用户指令如何影响说服性语言的生成,以及生成的说服性语言是否因目标群体不同而有所差异至关重要。在这项工作中,我们提出了一个框架,用于评估说服性语言生成如何受接收者性别、发送者意图或输出语言的影响。我们使用成对提示指令评估了13个LLMs和16种语言。我们采用基于社会心理学和传播科学的LLM-as-judge设置,在19个说服性语言类别上评估模型响应。我们的结果揭示了所有模型生成的说服性语言中存在显著的性别差异。这些模式反映了与社会心理学和社会语言学中记录的性别刻板语言倾向一致的偏见。

英文摘要

Large language models (LLMs) are increasingly used for everyday communication tasks, including drafting interpersonal messages intended to influence and persuade. Prior work has shown that LLMs can successfully persuade humans and amplify persuasive language. It is therefore essential to understand how user instructions affect the generation of persuasive language, and to understand whether the generated persuasive language differs, for example, when targeting different groups. In this work, we propose a framework for evaluating how persuasive language generation is affected by recipient gender, sender intent, or output language. We evaluate 13 LLMs and 16 languages using pairwise prompt instructions. We evaluate model responses on 19 categories of persuasive language using an LLM-as-judge setup grounded in social psychology and communication science. Our results reveal significant gender differences in the persuasive language generated across all models. These patterns reflect biases consistent with gender-stereotypical linguistic tendencies documented in social psychology and sociolinguistics.

2601.13508 2026-06-08 cond-mat.mtrl-sci cs.AI 版本更新

Autonomous computational catalysis through an agentic research system

自主计算催化:通过智能体研究系统

Honghao Chen, Jiangjie Qiu, Yi Shen Tew, Xiaonan Wang

发表机构 * Beijing Key Laboratory of Artificial Intelligence for Advanced Chemical Engineering Materials, State Key Laboratory of Chemical Engineering and Low- Carbon Technology, Department of Chemical Engineering, Tsinghua University(北京先进化工材料人工智能重点实验室、化学工程与低碳技术国家重点实验室、清华大学化学工程系)

AI总结 提出CatMaster催化原生智能体研究系统,将自然语言请求转化为计算研究,实现从建模到闭环催化剂设计的自主执行,在CO2-to-CO催化剂设计中识别出竞争性活性位点。

Comments 25 pages for main manuscript; SI not available here

详情
AI中文摘要

自主智能体正开始将科学研究从工具辅助的工作流程转变为自我维持的发现过程。计算催化提供了一个代表性的挑战,因为催化剂发现需要将高层次问题转化为协调的模型构建、原子模拟、机理分析和跨尺度的迭代设计。在这里,我们介绍了CatMaster,一个催化原生的智能体研究系统,它将计算催化重塑为一个低门槛的自主研究虚拟生态系统。CatMaster维护一个不断演进的研究状态,并通过在一个可扩展环境内的模型构建、计算、批判和催化剂设计决策中的自我反馈来扩展能力。在逐渐具有挑战性的任务中,CatMaster将自然语言请求转化为具体的计算研究,从基本的原子建模和标准计算到机理探索和闭环催化剂设计。它在代表性的计算催化场景中展示了稳健的执行能力,并在选定的MatBench任务中表现出接近领先的性能,其中声子场景展示了其建模自我进化能力。在独立的CO2-to-CO催化剂设计案例中,CatMaster使用迭代的自我批判和证据精炼来识别出具有竞争力的B-CoN4和NiN3B/N-NiN3B基序。这些结果建立了一个虚拟生态系统范式,其中AI智能体超越模拟执行,走向端到端的计算研究,为催化和材料科学中的自主发现提供了基础。

英文摘要

Autonomous agents are beginning to transform scientific research from tool-assisted workflows toward self-sustaining discovery processes. Computational catalysis provides a representative challenge, as catalyst discovery requires high-level questions to be translated into coordinated model construction, atomistic simulation, mechanistic analysis, and iterative design across multiple scales. Here we introduce CatMaster, a catalysis-native agentic research system that recasts computational catalysis as a low-barrier virtual ecosystem for autonomous research. CatMaster maintains an evolving research state and extends capabilities through self-feedback across model construction, calculation, critique and catalyst-design decisions within one extensible environment. Across progressively challenging tasks, CatMaster converts natural-language requests into concrete computational studies, from essential atomistic modelling and standard calculations to mechanism exploration and closed-loop catalyst design. It showed robust execution in representative computational-catalysis scenarios and near-leading performance across selected MatBench tasks, with phonons scenario demonstrating its modelling self-evolution capability. In the independent CO2-to-CO catalyst design case, CatMaster used iterative self-critique and evidence refinement to identify competitive B-CoN4 and NiN3B/N-NiN3B motifs. These results establish a virtual-ecosystem paradigm in which AI agents move beyond simulation execution toward end-to-end computational research, providing a foundation for autonomous discovery in catalysis and materials science.

2601.16622 2026-06-08 cs.LG cs.AI 版本更新

E2Former-V2: On-the-Fly Equivariant Attention with Linear Activation Memory

E2Former-V2:具有线性激活内存的即时等变注意力

Lin Huang, Chengxiang Huang, Ziang Wang, Yiyue Du, Chu Wang, Haocheng Lu, Yunyang Li, Xiaoli Liu, Arthur Jiang, Jia Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出E2Former-V2架构,通过等变轴对齐稀疏化(EAAS)和即时等变注意力机制,利用SO(3)到SO(2)基变换和自定义Triton内核,实现线性激活内存和20倍TFLOPS提升,在SPICE和OMol25数据集上加速推理并保持预测性能。

详情
AI中文摘要

等变图神经网络(EGNN)已成为建模3D原子系统的广泛使用的方法。然而,主流架构由于在每条边上显式构造几何特征或密集张量积而面临关键的可扩展性瓶颈。为克服这一问题,我们引入了**E2Former-V2**,一种将代数稀疏性与硬件感知执行相结合的可扩展架构。我们首先提出**等变轴对齐稀疏化(EAAS)**。EAAS基于Wigner-$6j$卷积,利用$\mathrm{SO}(3) \rightarrow \mathrm{SO}(2)$基变换,将计算昂贵的密集张量收缩转化为高效的稀疏奇偶重索引操作。基于这种表示,我们引入了**即时等变注意力**,一种通过自定义融合Triton内核实现的完全节点中心机制。通过消除物化的边张量并最大化SRAM利用率,我们的内核相比标准实现实现了**20倍的TFLOPS提升**。在SPICE和OMol25数据集上的大量实验表明,E2Former-V2在保持相当预测性能的同时显著加速推理。这项工作表明,大型等变Transformer可以使用广泛可用的GPU平台高效训练。代码可在https://github.com/IQuestLab/UBio-MolFM/tree/e2formerv2获取。

英文摘要

Equivariant Graph Neural Networks (EGNNs) have become a widely used approach for modeling 3D atomistic systems. However, mainstream architectures face critical scalability bottlenecks due to the explicit construction of geometric features or dense tensor products on \textit{every} edge. To overcome this, we introduce \textbf{E2Former-V2}, a scalable architecture that integrates algebraic sparsity with hardware-aware execution. We first propose \textbf{E}quivariant \textbf{A}xis-\textbf{A}ligned \textbf{S}parsification (EAAS). EAAS builds on Wigner-$6j$ convolution by exploiting an $\mathrm{SO}(3) \rightarrow \mathrm{SO}(2)$ change of basis to transform computationally expensive dense tensor contractions into efficient, sparse parity re-indexing operations. Building on this representation, we introduce \textbf{On-the-Fly Equivariant Attention}, a fully node-centric mechanism implemented via a custom fused Triton kernel. By eliminating materialized edge tensors and maximizing SRAM utilization, our kernel achieves a \textbf{20$\times$ improvement in TFLOPS} compared to standard implementations. Extensive experiments on the SPICE and OMol25 datasets demonstrate that E2Former-V2 maintains comparable predictive performance while notably accelerating inference. This work demonstrates that large equivariant transformers can be trained efficiently using widely accessible GPU platforms. The code is avalible at https://github.com/IQuestLab/UBio-MolFM/tree/e2formerv2.

2601.22574 2026-06-08 cs.CV cs.AI 版本更新

Enhancing Video Representations with Spatiotemporal-Semantic Residual to Mitigate Hallucinations in Video Large Multimodal Models

增强视频表示中的时空语义残差以缓解视频大型多模态模型中的幻觉

Yuansheng Gao, Jinman Zhao, Tong Zhang, Xingguo Xu, Wenbin Xing, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University(浙江大学) University of Toronto(多伦多大学) Dalian University of Technology(大连理工大学) Sun Yat-sen University(中山大学)

AI总结 提出ViSSRes方法,通过轻量级MLP网络学习视频表示的残差,从时空和语义一致性优化,在推理时仅需单次前向传播,有效降低幻觉率并提升视频理解性能。

Comments Preprint

详情
AI中文摘要

尽管视频大型多模态模型在视频理解方面取得了强劲性能,但它们仍然存在幻觉问题。现有的推理时干预方法通常在对比解码框架下修改视频,但其启发式设计带来的改进有限且增加了推理延迟。为了解决这些问题,我们提出了ViSSRes,一种通过轻量级MLP风格网络增强视频表示的推理时干预方法。具体来说,我们使用对比随机游走方法来表征视频表示的时空一致性,并引入条件互信息将视频表示与模型的语义理解关联起来。在保持模型主干冻结的情况下,ViSSRes学习视频表示的残差,并从时空和语义一致性角度优化它们。在推理时,ViSSRes仅需单次前向传播,且不会引入显著的额外推理成本。实验表明,ViSSRes在EventHallusion上将LLaVA-NeXT-Video的幻觉率降低了40.69%,并在CoT设置下将MMVU上的视频理解提升了18.36%,证明了其在缓解幻觉方面的有效性。

英文摘要

Although Video Large Multimodal Models have achieved strong performance in video understanding, they still suffer from hallucination. Existing inference-time intervention methods usually modify videos under the contrastive decoding framework, but their heuristic designs bring limited improvements and increase inference latency. To address these issues, we propose ViSSRes, an inference-time intervention method that enhances video representations through a lightweight MLP-style network. Specifically, we use a contrastive random walk approach to characterize the spatiotemporal consistency of video representations, and introduce conditional mutual information to associate video representations with the model's semantic understanding. With the model backbone kept frozen, ViSSRes learns residuals for video representations and optimizes them from both spatiotemporal and semantic consistency perspectives. During inference, ViSSRes requires only a single forward pass and introduces no substantial additional inference cost. Experiments show that ViSSRes reduces the hallucination rate of LLaVA-NeXT-Video on EventHallusion by 40.69% and improves video understanding on MMVU by 18.36% under the CoT setting, demonstrating its effectiveness in mitigating hallucinations.

2601.23207 2026-06-08 cs.LG cs.AI 版本更新

Learning to Execute Graph Algorithms Exactly with Graph Neural Networks

学习用图神经网络精确执行图算法

Muhammad Fetrat Qharabagh, Artur Back de Luca, George Giapitzakis, Kimon Fountoulakis

发表机构 * University of Waterloo(多伦多大学)

AI总结 证明在有限度和有限精度约束下,图神经网络能通过训练多层感知机集成学习局部指令,从而在推理时无误差执行完整图算法,并展示了在分布式计算LOCAL模型及多种经典算法上的可学习性。

详情
AI中文摘要

理解图神经网络能学习什么,特别是它们学习执行算法的能力,仍然是一个核心的理论挑战。在这项工作中,我们证明了在有限度和有限精度约束下图算法的精确可学习性结果。我们的方法遵循两步过程。首先,我们训练一个多层感知机(MLP)集成来执行单个节点的局部指令。其次,在推理过程中,我们使用训练好的MLP集成作为图神经网络(GNN)中的更新函数。利用神经正切核(NTK)理论,我们表明局部指令可以从一个小训练集中学习,从而使得完整的图算法在推理过程中能够以高概率无误差地执行。为了说明我们设置的学习能力,我们为分布式计算的LOCAL模型建立了一个严格的可学习性结果。我们进一步展示了广泛研究的算法(如消息洪泛、广度优先搜索、深度优先搜索和贝尔曼-福特算法)的积极可学习性结果。

英文摘要

Understanding what graph neural networks can learn, especially their ability to learn to execute algorithms, remains a central theoretical challenge. In this work, we prove exact learnability results for graph algorithms under bounded-degree and finite-precision constraints. Our approach follows a two-step process. First, we train an ensemble of multi-layer perceptrons (MLPs) to execute the local instructions of a single node. Second, during inference, we use the trained MLP ensemble as the update function within a graph neural network (GNN). Leveraging Neural Tangent Kernel (NTK) theory, we show that local instructions can be learned from a small training set, enabling the complete graph algorithm to be executed during inference without error and with high probability. To illustrate the learning power of our setting, we establish a rigorous learnability result for the LOCAL model of distributed computation. We further demonstrate positive learnability results for widely studied algorithms such as message flooding, breadth-first and depth-first search, and Bellman-Ford.

2602.02014 2026-06-08 cs.CV cs.AI cs.CL cs.LG 版本更新

Rethinking Genomic Modeling Through Optical Character Recognition

通过光学字符识别重新思考基因组建模

Hongxin Xiang, Pengsen Ma, Yunkang Cao, Di Yu, Haowen Chen, Xinyu Yang, Xiangxiang Zeng

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出OpticalDNA框架,将DNA渲染为视觉布局,利用视觉语言模型进行OCR式基因组理解,实现高保真压缩和长序列高效处理,在450k碱基序列上以近20倍更少有效token超越基线模型。

Comments Accepted by ICML 2026

详情
AI中文摘要

最近的基因组基础模型大多采用大型语言模型架构,将DNA视为一维token序列。然而,穷举式顺序阅读在结构上与稀疏且不连续的基因组语义不匹配,导致在低信息背景上的计算浪费,并阻碍了面向长上下文的压缩理解。在此,我们提出OpticalDNA,一个基于视觉的框架,将基因组建模重新定义为光学字符识别(OCR)风格的文档理解。OpticalDNA将DNA渲染为结构化视觉布局,并训练一个具备OCR能力的视觉语言模型,该模型包含视觉DNA编码器和文档解码器,其中编码器生成紧凑、可重建的视觉token以实现高保真压缩。基于这种表示,OpticalDNA定义了基于提示条件的核心基因组原语目标——读取、区域定位、子序列检索和掩码跨度补全——从而学习到布局感知的DNA表示,在减少的有效token预算下保留细粒度的基因组信息。在多种基因组基准测试中,OpticalDNA持续优于最近的基线模型;在长达450k碱基的序列上,它以近20倍更少的有效token实现了最佳整体性能,并且仅调整256k可训练参数就超越了激活参数多达985倍的模型。

英文摘要

Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision--language model with a visual DNA encoder and a document decoder, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly 20$\times$ fewer effective tokens, and surpasses models with up to 985$\times$ more activated parameters while tuning only 256k trainable parameters.

2602.02600 2026-06-08 cs.LG cs.AI 版本更新

Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models

自回归与扩散语言模型中的逐步拒绝动态

Eliron Rahimi, Elad Hirshel, Rom Himelstein, Amit LeVi, Avi Mendelson, Chaim Baskin

发表机构 * Department of Computer Science, Technion – Israel Institute of Technology(技术学院计算机科学系,以色列技术学院) INSIGHT Lab, School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Israel(内斯坦实验室,贝内-加隆大学内加尔分校,以色列) Computer Science Department, University of Haifa, Haifa, Israel(海法大学计算机科学系,海法,以色列)

AI总结 研究扩散语言模型(DLM)与自回归(AR)模型在拒绝有害生成行为上的差异,发现扩散重掩码机制可促进恢复,提出逐步拒绝内部动态(SRI)信号,并基于此构建无需修改推理的越狱检测器。

Comments Preprint

详情
AI中文摘要

扩散语言模型(DLM)最近已成为自回归(AR)模型的有竞争力的替代方案,提供并行解码、竞争性生成质量以及越狱鲁棒性改善的初步证据。尽管取得了这些进展,但采样机制在塑造拒绝行为中的作用仍知之甚少。为填补这一空白,我们提出了一项关于逐步拒绝动态的全面研究。我们表明,扩散重掩码可以促进从有害中间生成中恢复,提供证据表明这种行为与采样机制相关,并证明从AR采样切换到扩散采样可提高越狱鲁棒性,包括在固定模型权重下。为了捕捉在文本层面不可观察的生成动态,我们提出了逐步拒绝内部动态(SRI)信号。与我们的文本层面发现一致,SRI表明恢复主要在AR采样下失败,这些失败在SRI空间中通常相对于无害生成表现为异常。基于这一观察,我们表明SRI能够实现一个简单的越狱检测器,该检测器无需修改推理,并且仅通过在良性SRI信号上训练即可泛化到未见攻击。我们的评估表明,该检测器匹配或超越现有越狱检测基线,同时增加可忽略的开销。

英文摘要

Diffusion language models (DLMs) have recently emerged as a competitive alternative to autoregressive (AR) models, offering parallel decoding, competitive generation quality, and initial evidence of improved jailbreak robustness. Despite this progress, the role of sampling mechanisms in shaping refusal behavior remains poorly understood. To address this gap, we present a comprehensive study of step-wise refusal dynamics. We show that diffusion remasking can promote recovery from harmful intermediate generations, provide evidence that this behavior is tied to the sampling mechanism, and demonstrate that switching from AR to diffusion sampling improves jailbreak robustness, including under fixed model weights. To capture generation dynamics not observable at the text level, we propose the Step-Wise Refusal Internal Dynamics (SRI) signal. Consistent with our text-level findings, SRI shows that recovery fails primarily under AR sampling, with these failures often appearing anomalous relative to harmless generations in the SRI space. Based on this observation, we show that SRI enables a simple jailbreak detector that does not modify inference and generalizes to unseen attacks by training only on benign SRI signals. Our evaluation shows that this detector matches or outperforms existing jailbreak detection baselines while adding negligible overhead.

2602.04894 2026-06-08 cs.CR cs.AI 版本更新

Extracting Recurring Vulnerabilities from Black-Box LLM-Generated Software

从黑盒LLM生成的软件中提取重复漏洞

Tomer Kordonsky, Amit LeVi, Maayan Yamin, Noam Benzimra, Avi Mendelson

发表机构 * Technion - Israel Institute of Technology(技术学院 - 以色列理工学院)

AI总结 提出特征-安全表(FSTab),通过黑盒攻击从前端特征预测后端漏洞,并量化模型跨程序、重述和领域的漏洞复现一致性,实验显示跨域攻击成功率高达94%。

Comments ICML 2026, Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD)

详情
AI中文摘要

LLM越来越多地用于代码生成,但其输出通常遵循重复模板,可能导致可预测的漏洞。我们研究了LLM生成软件中的漏洞持久性,并引入了特征-安全表(FSTab),包含两个组件。首先,FSTab支持黑盒攻击,通过可观察的前端特征和源LLM的知识预测可能的后端漏洞,无需访问后端或源代码。其次,FSTab提供以模型为中心的评估,量化模型在跨程序、语义保持重述和应用域中复现相同漏洞的一致性。我们在最先进的代码LLM(包括GPT-5.2、Claude-4.5 Opus和Gemini-3 Pro)上评估了FSTab,覆盖多种应用域。我们的结果显示强大的跨域迁移:即使目标域在训练中被排除,FSTab在内部工具(Claude-4.5 Opus)上仍能达到94%的攻击成功率和93%的漏洞覆盖率。这些发现暴露了LLM生成软件中一个未被充分探索的攻击面,并凸显了代码生成的安全风险。我们的代码可在https://github.com/fstabicml2026/FSTab获取。

英文摘要

LLMs are increasingly used for code generation, but their outputs often follow recurring templates that can induce predictable vulnerabilities. We study vulnerability persistence in LLM-generated software and introduce Feature--Security Table (FSTab) with two components. First, FSTab enables a black-box attack that predicts likely backend vulnerabilities from observable frontend features and knowledge of the source LLM, without access to the backend or source code. Second, FSTab provides a model-centric evaluation that quantifies how consistently a model reproduces the same vulnerabilities across programs, semantics-preserving rephrasings, and application domains. We evaluate FSTab on state-of-the-art code LLMs, including GPT-5.2, Claude-4.5 Opus, and Gemini-3 Pro, across diverse application domains. Our results show strong cross-domain transfer: even when the target domain is excluded from training, FSTab achieves up to 94% attack success and 93% vulnerability coverage on Internal Tools (Claude-4.5 Opus). These findings expose an underexplored attack surface in LLM-generated software and highlight the security risks of code generation. Our code is available at https://github.com/fstabicml2026/FSTab

2602.06941 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Endogenous Resistance to Activation Steering in Language Models

语言模型中激活引导的内生抵抗

Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe, Mike Vaiana, Diogo de Lucena, Judd Rosenblatt, Michael S. A. Graziano

发表机构 * University of Washington(华盛顿大学)

AI总结 研究发现大型语言模型在任务不匹配的激活引导下能内生抵抗,通过显式重启恢复正确生成,并识别出相关稀疏自编码器潜在变量,可增强或削弱该抵抗。

详情
AI中文摘要

大型语言模型可以在生成过程中从任务不匹配的激活引导中恢复,产生显式的语言重启(例如,“等等,那不对”),并在引导扰动仍然活跃的情况下继续讨论主题。我们将此称为内生引导抵抗(ESR)。使用稀疏自编码器(SAE)潜在变量来引导模型激活,我们发现Llama-3.3-70B在\llamaseventyEsrRate\\%的情况下表现出显式ESR,而来自Llama-3和Gemma-2系列的较小模型则较少出现显式形式。两个对照实验将ESR分解为检测事件和持续抵抗组件,后者不能仅由最近的on-topic token条件化来完全解释。我们通过对比on-topic/off-topic搜索识别出\numOtdLatents{}个SAE潜在变量;将其零消融使多次尝试率降低\multiAttemptReductionPct\\%,随机潜在变量和保留提示对照支持特异性。ESR还可以通过元提示和基于合成自我纠正示例的微调来有意增强。ESR对安全性具有双重影响:它可能使模型对对抗性激活空间操纵更具抵抗力,但同样可能干扰有益的基于引导的干预,因为模型无法区分两者。代码可在\href{https://github.com/agencyenterprise/endogenous-steering-resistance}{github.com/agencyenterprise/endogenous-steering-resistance}获取。

英文摘要

Large language models can recover mid-generation from task-misaligned activation steering, producing explicit verbal restarts (e.g., ``wait, that's not right'') and continuing on-topic even while the steering perturbation remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B exhibits explicit ESR at \llamaseventyEsrRate\%, with smaller models from the Llama-3 and Gemma-2 families showing the explicit form less frequently. Two controls dissociate ESR into a detection event and a sustained-resistance component that conditioning on recent on-topic tokens does not fully explain. We identify \numOtdLatents{} SAE latents through contrastive on-topic/off-topic search; zero-ablating them reduces the multi-attempt rate by \multiAttemptReductionPct\%, with random-latent and held-out-prompt controls supporting specificity. ESR can also be deliberately enhanced through both meta-prompting and fine-tuning on synthetic self-correction examples. ESR has dual implications for safety: it could harden models against adversarial activation-space manipulation, but may equally interfere with beneficial steering-based interventions, since the model has no way to distinguish the two. Code is available at \href{https://github.com/agencyenterprise/endogenous-steering-resistance}{github.com/agencyenterprise/endogenous-steering-resistance}.

2602.07025 2026-06-08 cs.CV cs.AI 版本更新

The Geometry of Representational Failures in Vision Language Models

视觉语言模型中表征失败的几何结构

Daniele Savietto, Declan Campbell, André Panisson, Marco Nurisso, Giovanni Petri, Jonathan D. Cohen, Alan Perotti

发表机构 * Dipartimento di Fisica, Università di Torino(都灵大学物理系) Princeton Neuroscience Institute and AI Lab, Princeton University(普林斯顿大学神经科学研究所和AI实验室) Intesa Sanpaolo AI Research(Intesa Sanpaolo AI研究中心) Dipartimento di Scienze Matematiche, Politecnico di Torino(都灵理工学院数学科学系) Network Science Institute, Northeastern University London, UK(伦敦大学东北方大学网络科学研究所)

AI总结 通过分析开源视觉语言模型的概念向量几何重叠,揭示多目标视觉任务中幻觉等错误与认知约束的关联,并提出基于干预的验证方法。

详情
AI中文摘要

视觉语言模型在多目标视觉任务中表现出令人困惑的失败,例如幻觉不存在的元素或未能识别干扰中最相似的物体。虽然这些错误反映了人类的认知约束,如“绑定问题”,但在人工系统中驱动这些错误的内部机制仍然知之甚少。在这里,我们通过分析开源视觉语言模型(Qwen、InternVL、Gemma)的表征几何结构,提出了一种机制性见解,比较了提炼“概念向量”(编码视觉概念的潜在方向)的方法。我们通过引导干预验证了概念向量,这些干预在简化和自然视觉任务中可靠地操纵模型行为(例如,强制模型将红色花朵感知为蓝色)。我们观察到这些向量之间的几何重叠与特定错误模式强相关,提供了一个有依据的定量框架来理解内部表征如何塑造模型行为并驱动视觉失败。

英文摘要

Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirror human cognitive constraints, such as the 'Binding Problem', the internal mechanisms driving them in artificial systems remain poorly understood. Here, we propose a mechanistic insight by analyzing the representational geometry of open-weight VLMs (Qwen, InternVL, Gemma), comparing methodologies to distill "concept vectors'' - latent directions encoding visual concepts. We validate our concept vectors via steering interventions that reliably manipulate model behavior in both simplified and naturalistic vision tasks (e.g., forcing the model to perceive a red flower as blue). We observe that the geometric overlap between these vectors strongly correlates with specific error patterns, offering a grounded quantitative framework to understand how internal representations shape model behavior and drive visual failures.

2602.07026 2026-06-08 cs.CV cs.AI cs.MM 版本更新

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

模态间隙驱动的子空间对齐训练范式用于多模态大语言模型

Xiaomin Yu, Yi Xin, Yuhui Zhang, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Chen Liu, Xiaoxing Hu, Ziyue Qiao, Hao Tang, Xiaobin Hu, Chengwei Qin, Hui Xiong, Yu Qiao, Shuicheng Yan

发表机构 * HKUST(GZ)(香港科技大学(广州)) NUS(新加坡国立大学) sh AILab SII Stanford(斯坦福大学) UCLA(加州大学洛杉矶分校) Yale(耶鲁大学) SJTU(上海交通大学) GBU(国防大学) PKU(北京大学)

AI总结 针对多模态对比学习中的模态间隙问题,提出固定帧模态间隙理论,并基于该理论设计无训练的对齐策略ReAlign和可扩展训练范式ReVision,利用无配对数据实现视觉与语言表示的高效对齐。

详情
AI中文摘要

尽管多模态对比学习在视觉和语言表示对齐方面取得了成功,但一个持久的几何异常——模态间隙——仍然存在:表达相同语义的不同模态的嵌入位于系统性偏移的区域。先前弥合这一间隙的方法大多受限于过于简化的各向同性假设,阻碍了它们在大规模场景中的应用。在本文中,我们通过精确刻画模态间隙的几何形状并利用它进行高效模型扩展来解决这些局限性。首先,我们提出了固定帧模态间隙理论,该理论将冻结参考帧内的模态间隙分解为稳定偏差和各向异性残差。在这种精确建模的指导下,我们引入了ReAlign,一种无需训练的模态对齐策略。利用大量无配对数据的统计信息,ReAlign通过锚点、轨迹和质心对齐三步过程将文本表示对齐到图像表示分布,从而显式纠正几何错位。基于ReAlign,我们提出了ReVision,一种用于多模态大语言模型(MLLMs)的可扩展训练范式。ReVision将ReAlign集成到预训练阶段,使模型在视觉指令微调之前从无配对文本中学习视觉表示的分布,无需大规模、高质量的图像-文本对。我们的框架表明,统计对齐的无配对数据可以有效替代昂贵的图像-文本对,为MLLMs的高效扩展提供了一条稳健的路径。

英文摘要

Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models~(MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.

2602.08857 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Discovering Interpretable Algorithms by Decompiling Transformers to RASP

通过将Transformer反编译为RASP发现可解释算法

Xinting Huang, Aleksandra Bakalova, Satwik Bhattamishra, William Merrill, Michael Hahn

发表机构 * Saarland Informatics Campus, Saarland University(萨尔兰大学信息学院校区,萨尔兰大学) University of Oxford(牛津大学) Allen Institute for AI(人工智能研究所)

AI总结 提出一种将训练好的Transformer忠实重参数化为RASP程序,并通过因果干预发现小型充分子程序的方法,实验表明长度泛化的Transformer内部实现了简单可解释的RASP程序。

Comments 104 pages, 92 figures. Accepted for publication at ICML 2026

详情
AI中文摘要

近期研究表明,Transformer的计算可以在RASP编程语言家族中模拟。这些发现增进了对Transformer表达能力和泛化能力的理解。特别是,Transformer被建议在具有简单RASP程序的问题上精确实现长度泛化。然而,训练模型是否实际实现了简单的可解释程序仍是一个开放问题。在本文中,我们提出了一种从训练好的Transformer中提取此类程序的通用方法。其思想是将Transformer忠实地重参数化为RASP程序,然后应用因果干预来发现一个小的充分子程序。在算法和形式语言任务上训练的小型Transformer实验中,我们表明我们的方法通常能从长度泛化的Transformer中恢复简单且可解释的RASP程序。我们的结果提供了迄今为止最直接的证据,证明Transformer内部实现了简单的RASP程序。

英文摘要

Recent work has shown that the computations of Transformers can be simulated in the RASP family of programming languages. These findings have enabled improved understanding of the expressive capacity and generalization abilities of Transformers. In particular, Transformers have been suggested to length-generalize exactly on problems that have simple RASP programs. However, it remains open whether trained models actually implement simple interpretable programs. In this paper, we present a general method to extract such programs from trained Transformers. The idea is to faithfully re-parameterize a Transformer as a RASP program and then apply causal interventions to discover a small sufficient sub-program. In experiments on small Transformers trained on algorithmic and formal language tasks, we show that our method often recovers simple and interpretable RASP programs from length-generalizing transformers. Our results provide the most direct evidence so far that Transformers internally implement simple RASP programs.

2602.15084 2026-06-08 physics.plasm-ph cs.AI cs.LG 版本更新

TokaMind: A Multi-Modal Transformer Foundation Model for Tokamak Plasma Dynamics

TokaMind: 用于托卡马克等离子体动力学的多模态Transformer基础模型

Tobia Boschi, Andrea Loreti, Nicola C. Amorisco, Rodrigo H. Ordonez-Hurtado, Cécile Rousseau, George K. Holt, Eszter Székely, Alexander Whittle, Samuel Jackson, Adriano Agnello, Stanislas Pamela, Alessandra Pascale, Robert Akers, Juan Bernabe Moreno, Vassil Alexandrov, Mykhaylo Zayats

发表机构 * IBM Research(IBM研究院) UK Atomic Energy Authority(英国原子能局) STFC Hartree Centre(科学与技术设施研究中心哈特ree中心)

AI总结 提出TokaMind,首个开源托卡马克等离子体动力学基础模型,基于多模态Transformer在MAST数据集上预训练,支持多种数据模态和缺失信号处理,在14个任务上优于基线。

详情
AI中文摘要

我们提出TokaMind,据我们所知,这是首个用于托卡马克等离子体动力学的开源基础模型,基于多模态Transformer(MMT)并在公开可用的MAST数据集上的异构诊断数据上预训练。TokaMind支持多种数据模态(时间序列、2D轮廓和视频),具有不同的采样率、鲁棒的缺失信号处理,并通过选择性加载和冻结四个模型组件实现高效任务适配。为了表示多模态信号,我们使用轻量级固定基离散余弦变换嵌入(DCT3D),并为替代嵌入(例如变分自编码器)提供干净接口。我们在最近引入的MAST基准TokaMark上评估TokaMind,该基准包含14个具有异构重建和预测目标的任务。我们的结果表明,微调后的TokaMind在所有任务上均优于最强的基准基线,仅一个任务除外。与在匹配的epoch预算下从头训练相同架构相比,热启动适配在要求苛刻的下游设置中最为有益,包括长时域预测和高维平衡目标。这些发现突显了多模态预训练对托卡马克等离子体动力学的价值,并为未来的聚变建模任务提供了实用、可扩展的基础。训练代码和模型权重分别公开在github.com/UKAEA-IBM-STFC-Fusion-FMs/tokamind和huggingface.co/UKAEA-IBM-STFC。

英文摘要

We present TokaMind, to our knowledge the first open-source foundation model for tokamak plasma dynamics, based on a Multi-Modal Transformer (MMT) and pretrained on heterogeneous diagnostics from the publicly available MAST dataset. TokaMind supports multiple data modalities (time-series, 2D profiles, and videos) with different sampling rates, robust missing-signal handling, and efficient task adaptation via selectively loading and freezing four model components. To represent multi-modal signals, we use a lightweight fixed-basis Discrete Cosine Transform embedding (DCT3D) and provide a clean interface for alternative embeddings (e.g., Variational Autoencoders). We evaluate TokaMind on the recently introduced MAST benchmark TokaMark, which comprises 14 tasks with heterogeneous reconstruction and forecasting objectives. Our results show that fine-tuned TokaMind outperforms the strongest benchmark baseline on all but one task. Compared with training the same architecture from scratch under a matched epoch budget, warm-start adaptation is most beneficial on demanding downstream settings, including long-horizon forecasting and high-dimensional equilibrium objectives. These findings highlight the value of multi-modal pretraining for tokamak plasma dynamics and provide a practical, extensible foundation for future fusion modeling tasks. Training code and model weights are publicly available at github.com/UKAEA-IBM-STFC-Fusion-FMs/tokamind and huggingface.co/UKAEA-IBM-STFC, respectively.

2602.16073 2026-06-08 cs.RO cs.AI cs.LO cs.SY eess.SY 版本更新

ScenicRules: An Autonomous Driving Benchmark with Multi-Objective Specifications and Abstract Scenarios

ScenicRules:具有多目标规范和抽象场景的自动驾驶基准测试

Kevin Kai-Chun Chang, Ekin Beyazit, Alberto Sangiovanni-Vincentelli, Tichakorn Wongpiromsarn, Sanjit A. Seshia

发表机构 * University of California, Berkeley(加州大学伯克利分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出ScenicRules基准,通过层次化规则框架和形式化场景模型,在随机环境下评估自动驾驶系统对优先级多目标规范的满足程度。

Comments v2: Minor numerical corrections for Table V. 16 pages, 14 figures, 7 tables. Extended version of paper accepted to 2026 IEEE Intelligent Vehicles Symposium (IV 2026). ScenicRules benchmark available at https://github.com/BerkeleyLearnVerify/ScenicRules

详情
AI中文摘要

开发复杂交通环境下的自动驾驶系统需要平衡多个目标,例如避免碰撞、遵守交通规则和高效行驶。在许多情况下,这些目标无法同时满足,因此自然会出现明确的优先级关系。此外,驾驶规则需要上下文,因此正式建模这些规则适用的环境场景非常重要。现有的自动驾驶车辆评估基准缺乏这种多目标优先级规则和形式化环境模型的组合。在这项工作中,我们引入了ScenicRules,一个在随机环境下根据优先级多目标规范评估自动驾驶系统的基准。我们首先形式化了一组多样化的目标作为定量评估指标。接下来,我们设计了一个层次化规则书框架,以可解释和可适应的方式编码多个目标及其优先级关系。然后,我们构建了一个紧凑但具有代表性的场景集合,涵盖各种驾驶情境和近事故情况,并使用Scenic语言进行形式化建模。实验结果表明,我们的形式化目标和层次化规则书与人类驾驶判断高度一致,并且我们的基准有效地暴露了代理在优先级目标方面的失败。我们的基准可在https://github.com/BerkeleyLearnVerify/ScenicRules/获取。

英文摘要

Developing autonomous driving systems for complex traffic environments requires balancing multiple objectives, such as avoiding collisions, obeying traffic rules, and making efficient progress. In many situations, these objectives cannot be satisfied simultaneously, and explicit priority relations naturally arise. Also, driving rules require context, so it is important to formally model the environment scenarios within which such rules apply. Existing benchmarks for evaluating autonomous vehicles lack such combinations of multi-objective prioritized rules and formal environment models. In this work, we introduce ScenicRules, a benchmark for evaluating autonomous driving systems in stochastic environments under prioritized multi-objective specifications. We first formalize a diverse set of objectives to serve as quantitative evaluation metrics. Next, we design a Hierarchical Rulebook framework that encodes multiple objectives and their priority relations in an interpretable and adaptable manner. We then construct a compact yet representative collection of scenarios spanning diverse driving contexts and near-accident situations, formally modeled in the Scenic language. Experimental results show that our formalized objectives and Hierarchical Rulebooks align well with human driving judgments and that our benchmark effectively exposes agent failures with respect to the prioritized objectives. Our benchmark can be accessed at https://github.com/BerkeleyLearnVerify/ScenicRules/.

2602.16864 2026-06-08 cs.LG cs.AI math.DS 版本更新

Position: A Dynamical Systems Perspective is Needed to Advance Time Series Modeling

立场:需要动力系统视角以推进时间序列建模

Daniel Durstewitz, Christoph Jürgen Hemmer, Florian Hess, Charlotte Ricarda Doll, Lukas Eisenmann

发表机构 * University of Tübingen(图宾根大学)

AI总结 本文主张时间序列建模需引入动力系统视角,通过重构底层DS实现更优预测,并讨论其理论优势与具体建议。

详情
AI中文摘要

时间序列(TS)建模从早期的统计方法(主要是线性方法)发展到当前TS基础模型的趋势,已经走过了很长的路。由于该领域存在大量炒作和工业需求,实际进展并不总是清晰。为了将TS预测和分析提升到新水平,本文主张该领域需要动力系统(DS)视角。来自自然或工程系统的观测TS几乎总是源于某个底层DS,并且可以说,访问其控制方程将产生理论上的最优预测。这是DS重构(DSR)的承诺,这是一类旨在从数据中推断底层DS替代模型的ML/AI方法。但基于DS原理的模型还提供了其他深刻优势:除了短期预测,它们还能预测观测系统的长期统计量,这在许多实际场景中可能是更相关的量。此外,DS理论提供了领域无关的理论洞见,理解TS生成的机制,从而告知我们例如任何TS模型性能的上限、向未见过场景(如临界点)的泛化,或潜在的控制策略。在回顾DS理论和DSR中的一些核心概念、方法、度量和模型后,我们将讨论该领域的洞见如何以关键方式推进TS建模,实现更好的预测,同时大幅降低计算和内存占用。最后,我们提出若干具体建议,将DSR的洞见转化为TS建模实践。

英文摘要

Time series (TS) modeling has come a long way from early statistical, mainly linear, approaches to the current trend in TS foundation models. With a lot of hype and industrial demand in this field, it is not always clear how much progress there really is. To advance TS forecasting and analysis to the next level, here we argue that the field needs a dynamical systems (DS) perspective. TS of observations from natural or engineered systems almost always originate from some underlying DS, and arguably access to its governing equations would yield theoretically optimal forecasts. This is the promise of DS reconstruction (DSR), a class of ML/AI approaches that aim to infer surrogate models of the underlying DS from data. But models based on DS principles offer other profound advantages: Beyond short-term forecasts, they enable to predict the long-term statistics of an observed system, which in many practical scenarios may be the more relevant quantities. DS theory furthermore provides domain-independent theoretical insight into mechanisms underlying TS generation, and thereby will inform us, e.g., about upper bounds on performance of any TS model, generalization into unseen regimes as in tipping points, or potential control strategies. After reviewing some of the central concepts, methods, measures, and models in DS theory and DSR, we will discuss how insights from this field can advance TS modeling in crucial ways, enabling better forecasting with much lower computational and memory footprints. We conclude with a number of specific suggestions for translating insights from DSR into TS modeling.

2602.18905 2026-06-08 cs.LG cs.AI cs.CL 版本更新

TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning

TRUE:一种用于大语言模型推理的可信统一解释框架

Yujiao Yang

发表机构 * Dalian University of Technology(大连理工大学)

AI总结 提出TRUE框架,通过可执行推理验证、可行域DAG建模和因果故障模式分析,为LLM推理提供实例级、局部结构级和类别级的多层次可验证解释。

详情
AI中文摘要

大型语言模型(LLM)在复杂推理任务中展现出强大能力,但其决策过程仍难以解释。现有解释方法通常缺乏可信的结构性洞察,且局限于单实例分析,无法揭示推理稳定性和系统性故障机制。为解决这些局限,我们提出可信统一解释框架(TRUE),该框架集成了可执行推理验证、可行域有向无环图(DAG)建模和因果故障模式分析。在实例层面,我们将推理轨迹重新定义为可执行过程规范,并引入盲执行验证来评估操作有效性。在局部结构层面,我们通过结构一致性扰动构建可行域DAG,从而显式刻画局部输入空间中推理稳定性和可执行区域。在类别层面,我们引入因果故障模式分析方法,识别重复出现的结构性故障模式,并使用Shapley值量化其因果影响。在多个推理基准上的广泛实验表明,所提框架提供了多层次、可验证的解释,包括单个实例的可执行推理结构、邻近输入的可行域表示以及类别层面具有量化重要性的可解释故障模式。这些结果建立了一个统一且原则性的范式,用于提高LLM推理系统的可解释性和可靠性。

英文摘要

Large language models (LLMs) have demonstrated strong capabilities in complex reasoning tasks, yet their decision-making processes remain difficult to interpret. Existing explanation methods often lack trustworthy structural insight and are limited to single-instance analysis, failing to reveal reasoning stability and systematic failure mechanisms. To address these limitations, we propose the Trustworthy Unified Explanation Framework (TRUE), which integrates executable reasoning verification, feasible-region directed acyclic graph (DAG) modeling, and causal failure mode analysis. At the instance level, we redefine reasoning traces as executable process specifications and introduce blind execution verification to assess operational validity. At the local structural level, we construct feasible-region DAGs via structure-consistent perturbations, enabling explicit characterization of reasoning stability and the executable region in the local input space. At the class level, we introduce a causal failure mode analysis method that identifies recurring structural failure patterns and quantifies their causal influence using Shapley values. Extensive experiments across multiple reasoning benchmarks demonstrate that the proposed framework provides multi-level, verifiable explanations, including executable reasoning structures for individual instances, feasible-region representations for neighboring inputs, and interpretable failure modes with quantified importance at the class level. These results establish a unified and principled paradigm for improving the interpretability and reliability of LLM reasoning systems.

2603.02220 2026-06-08 cs.LG cs.AI cs.CV 版本更新

Forecasting as Rendering: A 2D Gaussian Splatting Framework for Time Series Forecasting

预测即渲染:面向时间序列预测的2D高斯泼溅框架

Yixin Wang, Yifan Hu, Peiyuan Liu, Naiqi Li, Tao Dai, Shu-Tao Xia

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院)

AI总结 提出TimeGS框架,将时间序列预测转化为2D高斯泼溅生成渲染,通过各向异性高斯核和连续光栅化解决周期内与周期间的建模问题,实现SOTA性能。

详情
AI中文摘要

时间序列预测仍然是一个具有挑战性的问题,因为周期内波动和周期间趋势的复杂纠缠。尽管最近的进展试图将一维序列重塑为二维周期-相位表示,但它们存在两个主要局限性。首先,将重塑后的张量视为静态图像会导致拓扑不匹配,因为标准空间算子在网格边界处切断了时间连续性。其次,依赖统一的固定大小表示会低效地分配建模能力,并且无法为可压缩的非平稳时间模式提供所需的自适应分辨率。为了解决这些局限性,我们引入了TimeGS,这是一个新颖的框架,从根本上将预测范式从回归转变为二维生成渲染。通过将未来序列重新概念化为潜在的二维时间表面,TimeGS利用高斯核的固有各向异性,以灵活的几何对齐自适应地建模复杂变化。为了实现这一点,我们引入了多基高斯核生成(MB-GKG)块,该块从固定字典中合成核以稳定优化,以及多周期时间连续光栅化(MP-CCR)块,该块在周期边界上强制执行严格的时间连续性。在标准基准数据集上的全面实验表明,TimeGS达到了最先进或具有竞争力的性能。代码位于https://github.com/yixinwang1/TimeGS。

英文摘要

Time series forecasting remains a challenging problem due to the intricate entanglement of intra-period fluctuations and inter-period trends. While recent advances have attempted to reshape 1D sequences into 2D period-phase representations, they suffer from two principal limitations. Firstly, treating reshaped tensors as static images results in a topological mismatch, as standard spatial operators sever chronological continuity at grid boundaries. Secondly, relying on uniform fixed-size representations allocates modeling capacity inefficiently and fails to provide the adaptive resolution required for compressible, non-stationary temporal patterns. To address these limitations, we introduce TimeGS, a novel framework that fundamentally shifts the forecasting paradigm from regression to 2D generative rendering. By reconceptualizing the future sequence as a latent 2D temporal surface, TimeGS utilizes the inherent anisotropy of Gaussian kernels to adaptively model complex variations with flexible geometric alignment. To realize this, we introduce a Multi-Basis Gaussian Kernel Generation (MB-GKG) block that synthesizes kernels from a fixed dictionary to stabilize optimization, and a Multi-Period Chronologically Continuous Rasterization (MP-CCR) block that enforces strict temporal continuity across periodic boundaries. Comprehensive experiments on standard benchmark datasets demonstrate that TimeGS attains state-of-the-art or competitive performance. The code is at https://github.com/yixinwang1/TimeGS.

2603.04982 2026-06-08 cs.CY cs.AI cs.HC 版本更新

Training for Technology: Adoption and Productive Use of Generative AI in Legal Analysis

技术培训:法律分析中生成式人工智能的采纳与生产性使用

Benjamin M. Chen, Hong Bao

发表机构 * University of Hong Kong Faculty of Law(香港大学法学院)

AI总结 通过随机实验发现,未经培训的法学学生使用大语言模型反而降低表现,而简短培训能显著提升采纳率和成绩,表明生成式AI的生产力需要培训支持。

详情
AI中文摘要

有针对性的用户培训能否释放生成式人工智能在专业环境中的生产潜力?我们通过一项随机实验研究了这个问题,其中164名法学学生在三种条件下完成了一项问题识别考试:无GenAI访问权限、可选访问大语言模型(LLM)、或LLM访问加简短培训干预。未经培训的LLM访问被证明适得其反:与没有任何LLM访问权限的参与者相比,未经培训的用户撰写的答案明显更短,犯更多案例陈述错误,且得分略低,尽管大多数差异未达到常规显著性水平。培训扭转了这一模式。接受培训的参与者以更高的比例采纳LLM(41% vs. 26%;p = 0.044),得分比未经培训的用户高0.27个绩点——大约一个精细等级——(p = 0.027),并且更准确地陈述了适用规则(p = 0.014)。主分层分析表明,培训主要通过采纳而非有效性发挥作用——在严格均值优势下,采纳下限(1.06)超过有效性上限(0.42)——尽管置信区间较宽。更广泛地说,这些发现挑战了GenAI主要惠及低技能工人的观点:没有培训,高能力从业者选择退出,而低能力用户采纳但无生产力。实现GenAI的生产力提升需要同时投资于访问和指导。

英文摘要

Can targeted user training unlock the productive potential of generative artificial intelligence in professional settings? We study this question using a randomized experiment in which 164 law students completed an issue-spotting examination under one of three conditions: no GenAI access, optional access to a large language model (LLM), or LLM access with a brief training intervention. Untrained LLM access proved counterproductive: relative to participants without any LLM access, untrained users wrote significantly shorter answers, committed more case misstatements, and scored marginally lower, though most differences fall short of conventional significance. Training reversed this pattern. Trained participants adopted the LLM at higher rates (41% vs. 26%; p = 0.044), scored 0.27 grade points higher than untrained users--roughly one fine grade--(p = 0.027), and stated applicable rules more accurately (p = 0.014). Principal stratification analysis suggests training operates primarily through adoption rather than effectiveness--the adoption lower bound (1.06) exceeds the effectiveness upper bound (0.42) at strict mean dominance--though confidence intervals are wide. More broadly, these findings challenge the view that GenAI primarily benefits lower-skilled workers: without training, higher-ability practitioners opt out while lower-ability users adopt but unproductively. Realizing GenAI's productivity gains requires investment in both access and instruction.

2603.08683 2026-06-08 cs.SD cs.AI cs.LG eess.AS 版本更新

Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

全保真音频无损压缩的语言建模基准测试

Phillip Long, Zachary Novack, Chris Donahue

发表机构 * University of California, San Diego, Computer Science and Engineering Department(加州大学圣地亚哥分校计算机科学与工程系) Carnegie Mellon University, School of Computer Science(卡内基梅隆大学计算机科学学院)

AI总结 提出字节级分词方案Trilobyte,将词汇量从指数级降至常数级,首次实现24位音频的LM无损压缩,并在8位和16位下超越FLAC。

Comments Accepted at Interspeech 2026, 7 pages, 5 figures

详情
AI中文摘要

在原始波形上训练的自回归“语言”模型(LM)可被重新用于无损音频压缩,但先前的工作仅限于8位音频,尚不清楚此类方法是否适用于实际场景(16/24位)以及能否与现有编解码器竞争。我们对基于LM的压缩在全保真音频上进行了基准测试,涵盖不同领域(音乐、语音、生物声学)、采样率(16kHz-48kHz)和位深度(8、16、24位)。标准的样本级分词在更高位深度下因词汇量过大(16位为65K;24位为16.7M)而变得不可行。我们提出了Trilobyte,一种用于全分辨率音频的字节级分词方案,将词汇量从$O(2^{b})$改进为$O(1)$,并首次实现了可行的24位基于LM的无损压缩。虽然LM在8位和16位下持续优于FLAC并达到最先进的压缩效果,但我们观察到,随着位深度超过8位,压缩增益变得更为有限。

英文摘要

Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.

2603.11075 2026-06-08 cs.AR cs.AI 版本更新

VeriHGN: Heterogeneous Graph-Based Congestion Prediction for Chip Layout Verification

VeriHGN: 基于异构图的芯片布局验证中的拥堵预测

Runbang Hu, Bo Fang, Bingzhe Li, Yuede Ji

发表机构 * The University of Texas at Arlington(德克萨斯大学阿灵顿分校) The University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 本文提出VeriHGN框架,通过增强的异构图统一电路组件和空间网格,实现更准确的逻辑意图与物理实现的交互建模,提高了拥堵预测的准确性和相关性。

Comments Accpeted at KDD 2026

详情
AI中文摘要

随着非常大规模集成电路(VLSI)设计在规模和复杂性上持续增长,布局验证已成为现代电子设计自动化(EDA)工作流程中的核心挑战。在实践中,拥堵只能在详细布线后才能被准确识别,这使得传统验证既耗时又昂贵。因此,学习方法被探索以实现早期阶段的拥堵预测并减少布线迭代。然而,尽管先前的方法结合了网表连接性和布局特征,但它们通常以松散耦合的方式建模这两个方面,并主要产生数值拥堵估计。我们提出VeriHGN,一个基于增强异构图的验证框架,将电路组件和空间网格统一到单一关系表示中,从而实现更准确的逻辑意图与物理实现的交互建模。在工业基准测试中,包括ISPD2015、CircuitNet-N14和CircuitNet-N28,实验表明,VeriHGN在预测准确性和相关性度量方面均优于现有最先进方法。

英文摘要

As Very Large Scale Integration (VLSI) designs continue to scale in size and complexity, layout verification has become a central challenge in modern Electronic Design Automation (EDA) workflows. In practice, congestion can only be accurately identified after detailed routing, making traditional verification both time-consuming and costly. Learning-based approaches have therefore been explored to enable early-stage congestion prediction and reduce routing iterations. However, although prior methods incorporate both netlist connectivity and layout features, they often model the two in a loosely coupled manner and primarily produce numerical congestion estimates. We propose VeriHGN, a verification framework built on an enhanced heterogeneous graph that unifies circuit components and spatial grids into a single relational representation, enabling more faithful modeling of the interaction between logical intent and physical realization. Experiments on industrial benchmarks, including ISPD2015, CircuitNet-N14, and CircuitNet-N28, demonstrate that VeriHGN achieves the best or near-best performance over state-of-the-art methods in prediction accuracy and correlation metrics.

2603.13428 2026-06-08 cs.SE cs.AI 版本更新

EvoClaw: Evaluating AI Agents on Continuous Software Evolution

EvoClaw: 评估AI代理在持续软件演化中的表现

Gangda Deng, Zhaoling Chen, Zhongming Yu, Haoyang Fan, Yuhong Liu, Yuxin Yang, Dhruv Parikh, Rajgopal Kannan, Le Cong, Mengdi Wang, Qian Zhang, Viktor Prasanna, Xiangru Tang, Xingyao Wang

发表机构 * USC(美国斯克利普斯大学) UCR(加州大学河滨分校) UCSD(加州大学圣地亚哥分校) Army Research Office(陆军研究办公室) Stanford(斯坦福大学) Princeton(普林斯顿大学) Haven OpenHands

AI总结 针对现有基准测试忽视软件演化中时间依赖和技术债务的问题,提出EvoClaw基准,通过从提交日志重建可验证里程碑DAG,评估AI代理在持续开发中维持系统完整性和限制错误累积的能力。

Comments ICML 2026

详情
AI中文摘要

随着AI代理越来越多地被部署为长期运行的系统,自主构建并持续演化定制软件以在动态环境中进行交互变得至关重要。然而,现有基准测试在孤立的、一次性的编码任务上评估代理,忽视了真实世界软件演化中固有的时间依赖和技术债务。为弥补这一差距,我们引入了DeepCommit,一个从嘈杂的提交日志中重建可验证里程碑DAG的代理管道,其中里程碑被定义为功能内聚的开发目标。这些可执行序列使得EvoClaw成为可能,这是一个新颖的基准测试,要求代理维持系统完整性并限制错误累积,这些是当前基准测试中大部分缺失的长期软件演化的维度。我们对4个代理框架下的12个前沿模型的评估揭示了一个关键弱点:整体性能得分从孤立任务上的>80%显著下降到持续设置中的最多38%,暴露了代理在长期维护和错误传播方面的深刻困境。

英文摘要

With AI agents increasingly deployed as long-running systems, it becomes essential to autonomously construct and continuously evolve customized software to enable interaction within dynamic environments. Yet, existing benchmarks evaluate agents on isolated, one-off coding tasks, neglecting the temporal dependencies and technical debt inherent in real-world software evolution. To bridge this gap, we introduce DeepCommit, an agentic pipeline that reconstructs verifiable Milestone DAGs from noisy commit logs, where milestones are defined as functionally cohesive development goals. These executable sequences enable EvoClaw, a novel benchmark that requires agents to sustain system integrity and limit error accumulation, dimensions of long-term software evolution largely missing from current benchmarks. Our evaluation of 12 frontier models across 4 agent frameworks reveals a critical vulnerability: overall performance scores drop significantly from >80% on isolated tasks to at most 38% in continuous settings, exposing agents' profound struggle with long-term maintenance and error propagation.

2603.20990 2026-06-08 cs.IR cs.AI 版本更新

$\mathrm{ECI}_{\mathrm{sem}}$: Semantic Residual Effective Contrastive Information for Evaluating Hard Negatives

ECI: 有效对比信息用于评估难负样本

Aarush Sinha, Rahul Seetharaman, Aman Bansal

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology (IIT), Kharagpur, India(1. 印度理工学院(IIT)计算机科学与工程系,克哈格布尔,印度)

AI总结 本文提出ECI,一种无需训练的诊断方法,通过冻结的目标编码器嵌入对候选负样本进行排序,其在MS MARCO数据集上展示了优于其他模型的性能,且在不同条件下表现出稳定性。

详情
AI中文摘要

在密集检索中,硬负样本的选择通常是仅在微调和下游评估之后决定。我们提出有效对比信息(ECI),一种无需训练的诊断方法,通过冻结的目标编码器嵌入对候选负样本进行排序。ECI无需训练,也不依赖标签:每个评分示例需要一个查询、一个标记的正例和一个显式的候选负例。$\mathrm{ECI}_{\mathrm{sem}}$通过目标一致性、语义局部性、词汇残余性和一个对数确定性多样性目标构建加权残差信息矩阵。在MS MARCO负样本上,家族内ECI在非混合源中将LLM负样本排在首位,在混合源中将Dense+LLM排在首位,与DistilBERT、E5-base和Contriever的最强聚合BEIR迁移结果相匹配。受控消融实验表明,这种对齐依赖于使用目标编码器家族,而额外消融实验显示其在样本大小、温度、分词器和IDF语料扰动下具有稳定性。理论给出了损失减少的局部线性化链接,而实证研究将下游评估视为最终测试。

英文摘要

Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream evaluation. We propose $\mathrm{ECI}_{\mathrm{sem}}$, a semantic residual variant of Effective Contrastive Information (ECI) that ranks candidate negative sources using frozen target-encoder embeddings. $\mathrm{ECI}_{\mathrm{sem}}$ is training-free, not label-free: each scored example requires a query, a labeled positive, and an explicit candidate negative. $\mathrm{ECI}_{\mathrm{sem}}$ builds a weighted residual information matrix from target consistency, semantic locality, lexical residuality, and a log-determinant diversity objective. On MS MARCO negative sources, in-family $\mathrm{ECI}_{\mathrm{sem}}$ ranks LLM negatives highest among non-hybrid sources and Dense+LLM highest among hybrid sources, matching the strongest aggregate BEIR transfer results across DistilBERT, E5-base, and Contriever. Controlled ablations show that this alignment depends on using the target encoder family, while additional ablations show stability under sample-size, temperature, tokenizer, and IDF-corpus perturbations. The theory gives a local linearized link to loss reduction, while the empirical study treats downstream evaluation as the final test.

2603.22327 2026-06-08 cs.IR cs.AI cs.DL 版本更新

Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews

基于流行病学系统评价评估AI科学知识综合

Shreyansh Padarha, Ryan Othniel Kearns, Tristan Naidoo, Lingyi Yang, Łukasz Borchmann, Piotr BŁaszczyk, Christian Morgenstern, Ruth McCabe, Sangeeta Bhatia, Philip H. Torr, Jakob Foerster, Scott A. Hale, Thomas Rawson, Anne Cori, Elizaveta Semenova, Adam Mahdi

发表机构 * University of Oxford(牛津大学) Imperial College London(伦敦帝国理工学院) University of Nottingham(诺丁汉大学) Snowflake AI Research(Snowflake人工智能研究) Independent(独立)

AI总结 提出AgentSLR评估框架,包含自动化工作流和专家标注数据集,测试LLM在流行病学系统评价各阶段能力,发现无模型全面领先,结构化提取是主要瓶颈。

详情
AI中文摘要

系统文献综述(SLR)是一种要求高且风险大的科学知识综合形式,但作为大型语言模型(LLM)的评估场景仍未被充分定义。我们引入了AgentSLR,一个大规模评估框架,包含SLR自动化工作流和覆盖16,248篇文章的专家标注数据集,旨在测试LLM在流行病学SLR各阶段的能力。参考标注来自关于WHO优先病原体的同行评审研究,并由领域专家制作。该框架将每个综述阶段作为独立单元进行评估,并配有专用指标,以便进行有针对性的失败分析。我们评估了五种前沿推理模型,发现没有单一模型在所有任务中占主导地位,显示出子任务专业化往往被聚合基准所掩盖。结构化数据提取是一个主要瓶颈,没有模型在平均字段级F1上超过0.67。估计成本差异很大,评估模型之间相差高达96倍。记录的失败模式表明,评估的模型在流行病学中尚不足以可靠地进行无监督部署,因为其发现可能影响公共政策。

英文摘要

Systematic literature reviews (SLRs) are a demanding and high-stakes form of scientific knowledge synthesis that remains underspecified as an evaluation setting for large language models (LLMs). We introduce AgentSLR, a large-scale evaluation harness comprising an SLR automation workflow and an expert annotated dataset covering 16,248 articles, designed to test LLM capabilities across the stages of SLRs in epidemiology. Reference annotations were derived from peer-reviewed studies on WHO priority pathogens and produced by domain experts. The harness evaluates each review stage as a separate unit with dedicated metrics enabling targeted failure analysis. We evaluated five frontier reasoning models and found that no single model dominated across all tasks, showing sub-task specialisation often hidden by aggregate benchmarks. Structured data extraction is a major bottleneck, with no model exceeding an average field-level F1 of 0.67. Estimated costs vary substantially, by up to 96 times across evaluated models. Documented failure modes suggest that the evaluated models are not yet reliable enough for unsupervised deployment in epidemiology, where findings can inform public policy.

2603.24576 2026-06-08 cs.RO cs.AI cs.CV 版本更新

Chameleon: Control-Indexed Prospective Memory for Visuomotor Manipulation

Chameleon: 用于视觉运动操控的索引控制前瞻记忆

Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Yuhang Han, Ying Sun, Yang Xiao, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室) Institute for Infocomm Research, A*STAR, Singapore(新加坡*STAR信息与通信研究所) National University of Singapore(新加坡国立大学)

AI总结 提出Chameleon策略,通过索引控制前瞻记忆解决观察-动作延迟问题,在Camo-Dataset上决策成功率从22.5%提升至80.8%,并在多个基准上达到最优。

Comments Code is available at https://github.com/gxyes/MARS_Chameleon

详情
AI中文摘要

机器人常常在观察到某个信息后很久才执行相应的动作。例如,在藏球游戏中,机器人首先看到哪个杯子藏有球,观察杯子移动,然后才需要选择正确的杯子。仅凭最后的观察不足以做出决策:正确的动作依赖于更早的事件。我们将这种时间间隔称为观察-动作延迟。它使得记忆成为一个策略面对的问题:策略必须保持相似历史记录的可区分性,检索与当前决策相关的过去事件,并将该回忆转换为动作就绪状态。我们将这些需求称为可分离性、可寻址性和前瞻性。我们引入了Chameleon,一个约60M参数的视觉运动策略,用于索引控制的前瞻记忆。Chameleon写入具身事件记忆,保留可分离的历史记录,检索控制相关的痕迹,并训练生成的工作状态具有前瞻性。我们还引入了Camo-Dataset,这是一个真实机器人基准,通过使决策场景视觉模糊来隔离观察-动作延迟,从而必须从早期观察中推断正确动作。Chameleon在Camo-Dataset上将决策/端到端成功率从22.5%/21.3%提高到80.8%/71.3%。在公开的长时记忆基准上,它在LIBERO-10上达到87.1% ± 0.8%,在MemoryBench上达到97.3% ± 4.5%,在MIKASA-Robo上达到75.1% ± 1.4%,在相同规模模型中达到最先进水平,并在报告协议下超过多个更大的VLA基线。探针和消融实验表明,Chameleon学习了可分离、可寻址和前瞻的记忆,并且这些特性驱动了其性能提升。

英文摘要

Robots often observe information that determines a future action long before that action is executed. In a shell game, for example, a robot first sees which cup hides the ball, watches the cups move, and only later needs to choose the correct cup. The final observation alone is not enough for a decision: the correct action depends on an earlier event. We refer to this temporal gap as observation-action delay. It makes memory a policy-facing problem: a policy must keep similar histories distinct, retrieve the past event relevant to the current decision, and convert that recall into an action-ready state. We call these requirements separability, addressability, and prospectiveness. We introduce Chameleon, a ~60M visuomotor policy for control-indexed prospective memory. Chameleon writes embodied event memory, preserves separable histories, retrieves control-relevant traces, and trains the resulting working state to be prospective. We also introduce Camo-Dataset, a real-robot benchmark that isolates observation-action delay by making the decision scene visually ambiguous, so the correct action must be inferred from earlier observations. Chameleon improves decision/end-to-end success on Camo-Dataset from 22.5%/21.3% to 80.8%/71.3%. On public long-horizon memory benchmarks, it achieves 87.1% +/- 0.8% on LIBERO-10, 97.3% +/- 4.5% on MemoryBench, and 75.1% +/- 1.4% on MIKASA-Robo, setting the state of the art for same-size models and exceeding multiple larger VLA baselines under the reported protocols. Probes and ablations show that Chameleon learns separable, addressable, and prospective memory, and that these properties drive its performance gains.

2603.26846 2026-06-08 cs.LG cs.AI 版本更新

Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry

稳定推理,不稳定响应:通过稳定性不对称缓解大语言模型欺骗

Guoxi Zhang, Jiawei Chen, Tianzhuo Yang, Lang Qin, Juntao Dai, Yaodong Yang, Jingwei Yi

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Beijing Academy of Artificial Intelligence(北京人工智能研究院) School of Chinese as a Second Language, Peking University(北京大学第二语言学院)

AI总结 针对大语言模型内在欺骗问题,提出稳定性不对称正则化(SAR),通过惩罚内部思维链稳定性与外部响应稳定性之间的不对称性来抑制欺骗,实验证明其有效性且不损害模型能力。

详情
AI中文摘要

随着大语言模型(LLMs)在能力和应用范围上的扩展,其可信度变得至关重要。一个关键风险是内在欺骗,即模型策略性地误导用户以实现自身目标。现有的基于思维链(CoT)监控的对齐方法监督显式的推理轨迹。然而,在优化压力下,模型被激励隐藏欺骗性推理,使得语义监督从根本上不可靠。基于认知心理学,我们假设一个欺骗性LLM在其CoT中保持稳定的内部信念,而其外部响应在扰动下仍然脆弱。我们将此现象称为稳定性不对称,并通过测量扰动下内部CoT稳定性与外部响应稳定性之间的对比来量化它。基于这一结构特征,我们提出了稳定性不对称正则化(SAR),一种在强化学习期间惩罚这种分布不对称性的新型对齐目标。与CoT监控不同,SAR针对模型输出的统计结构,使其对语义隐藏具有鲁棒性。大量实验证实,稳定性不对称可靠地识别欺骗行为,并且SAR有效抑制内在欺骗而不降低一般模型能力。

英文摘要

As Large Language Models (LLMs) expand in capability and application scope, their trustworthiness becomes critical. A vital risk is intrinsic deception, wherein models strategically mislead users to achieve their own objectives. Existing alignment approaches based on chain-of-thought (CoT) monitoring supervise explicit reasoning traces. However, under optimization pressure, models are incentivized to conceal deceptive reasoning, rendering semantic supervision fundamentally unreliable. Grounded in cognitive psychology, we hypothesize that a deceptive LLM maintains a stable internal belief in its CoT while its external response remains fragile under perturbation. We term this phenomenon stability asymmetry and quantify it by measuring the contrast between internal CoT stability and external response stability under perturbation. Building on this structural signature, we propose the Stability Asymmetry Regularization (SAR), a novel alignment objective that penalizes this distributional asymmetry during reinforcement learning. Unlike CoT monitoring, SAR targets the statistical structure of model outputs, rendering it robust to semantic concealment. Extensive experiments confirm that stability asymmetry reliably identifies deceptive behavior, and that SAR effectively suppresses intrinsic deception without degrading general model capability.

2604.03779 2026-06-08 cs.LG cs.AI 版本更新

CountsDiff: A Diffusion Model on the Natural Numbers for Generation and Imputation of Count-Based Data

CountsDiff: 一种用于计数数据生成和插补的自然数扩散模型

Renzo G. Soatto, Anders Hoel, Greycen Ren, Shorna Alam, Stephen Bates, Nikolaos P. Daskalakis, Caroline Uhler, Maria Skoularidou

发表机构 * Princeton University(普林斯顿大学) Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CountsDiff扩散框架,通过生存概率调度和显式损失加权简化Blackout扩散,引入连续时间训练、无分类器引导和逆动态,在自然图像和单细胞RNA-seq插补任务中匹配或超越现有方法。

Comments 39 Pages, 11 figures. To appear in the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

扩散模型在连续和基于token的领域中的生成任务表现出色,但其在离散序数数据上的应用仍不成熟。我们提出CountsDiff,一个旨在对自然数上的分布进行建模的扩散框架。CountsDiff扩展了Blackout扩散框架,通过直接参数化(基于生存概率调度和显式损失加权)简化其公式。这通过引入与现有扩散建模框架中直接类似的设计参数,带来了灵活性。除了这种重新参数化,CountsDiff还引入了现代扩散模型中的特性,这些特性此前在基于计数的领域中缺失,包括连续时间训练、无分类器引导以及允许非单调逆轨迹的搅动/重掩码逆动态。我们提出了CountsDiff的初始实例化,并在自然图像数据集(CIFAR-10、CelebA)上进行了验证,探索了所引入的设计参数在一个复杂、经过充分研究且可解释的数据领域中的影响。然后,我们强调生物计数分析作为一个自然用例,在胎儿和心脏细胞图谱中评估了CountsDiff在单细胞RNA-seq插补上的表现。值得注意的是,我们发现即使这种简单的实例化也能匹配或超越最先进的离散生成模型和领先的scRNA-seq插补方法的性能,同时通过未来工作中优化的设计选择,仍有很大的提升空间。

英文摘要

Diffusion models have excelled at generative tasks for both continuous and token-based domains, but their application to discrete ordinal data remains underdeveloped. We present CountsDiff, a diffusion framework designed to model distributions on the natural numbers. CountsDiff extends the Blackout diffusion framework by simplifying its formulation through a direct parameterization in terms of a survival probability schedule and an explicit loss weighting. This introduces flexibility through design parameters with direct analogues in existing diffusion modeling frameworks. Beyond this reparameterization, CountsDiff introduces features from modern diffusion models, previously absent in counts-based domains, including continuous-time training, classifier-free guidance, and churn/remasking reverse dynamics that allow non-monotone reverse trajectories. We propose an initial instantiation of CountsDiff and validate it on natural image datasets (CIFAR-10, CelebA), exploring the effects of the introduced design parameters in a complex, well-studied, and interpretable data domain. We then highlight biological count assays as a natural use case, evaluating CountsDiff on single-cell RNA-seq imputation in fetal and heart cell atlases. Remarkably, we find that even this simple instantiation matches or surpasses the performance of a state-of-the-art discrete generative model and leading scRNA-seq imputation methods, while leaving substantial headroom for further gains through optimized design choices in future work.

2604.04226 2026-06-08 cs.MA cs.AI 版本更新

SW-$A^2$-Bench: Benchmarking Autonomous Software Agent Generation for Agentic Web

SW-$A^2$-Bench: 面向智能体网络的自主软件智能体生成基准测试

Linyao Chen, Bo Huang, Qinlao Zhao, Shuai Shao, Zhi Han, Zicai Cui, Ziheng Zhang, Guangtao Zeng, Wenzheng Tang, Yikun Wang, Yuanjian Zhou, Zimian Peng, Yong Yu, Weiwen Liu, Hiroki Kobayashi, Weinan Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) The University of Tokyo(东京大学) Huazhong University of Science and Technology(华中科技大学) Shanghai Innovation Institute(上海创新研究院) Nankai University(南开大学) Singapore University of Technology and Design(新加坡科技设计大学) Queen’s University(女王大学) Fudan University(复旦大学) Zhejiang University(浙江大学)

AI总结 提出首个软件智能体生成基准SW-$A^2$-Bench,通过编码智能体自动将代码仓库转化为自主软件智能体,评估生成智能体的忠实性与互操作性,以扩展智能体网络规模。

详情
AI中文摘要

智能体网络正在成为一种新兴范式,其中自主软件智能体与在线资源及其他智能体交互以完成用户目标。然而,智能体网络的容量仍受限于自主软件智能体数量不足,这已成为扩展智能体网络的关键挑战。为缓解这一问题,我们研究了通过编码智能体自动将现有代码仓库转化为自主软件智能体的任务,将过程分解为关键阶段,并识别关键技术障碍。为系统评估这一能力,我们提出了面向智能体网络的软件智能体生成基准(SW-$A^2$-Bench),这是首个专为软件智能体生成设计的基准。SW-$A^2$-Bench不仅评估软件智能体是否能够生成,还评估生成的智能体是否忠实于源代码仓库,以及在多智能体工作流中是否与其他智能体可互操作。实验表明,我们的方法有效激活了代码仓库的功能能力,并在智能体网络中实现了可互操作的多智能体协作。我们相信,这项工作将为软件智能体生成提供标准化评估,并有助于未来扩展智能体网络的容量。

英文摘要

The Agentic Web is emerging as a paradigm in which autonomous software agents interact with online resources and with each other to accomplish user goals. However, the capacity of Agentic Web is still limited by insufficient autonomous software agent population, which has become a crucial challenge for scaling Agentic Web. In order to alleviate this, we study the task of automatically converting existing code repositories into autonomous software agents via coding agents, decompose the process into critical stages, and identify key technical hurdles. To systematically evaluate this capability, we propose SoftWare Agent generation for Agentic Web Bench (SW-$A^2$-Bench), the first benchmark designed for software agent generation. SW-$A^2$-Bench evaluates not only whether software agents can be generated, but also whether generated software agents are faithful to the source repositories and interoperable with other agents in multi-agent workflows. Our experiments demonstrate that our approach effectively activates the functional capabilities of code repositories and enables interoperable multi-agent collaboration in Agentic Web. We believe that this work will provide a standardized evaluation for software agent generation and will contribute to the future of scaling the capacity of Agentic Web.

2604.05360 2026-06-08 cs.HC cs.AI 版本更新

OGA-AID: Clinician-in-the-loop AI Report Drafting Assistant for Multimodal Observational Gait Analysis in Post-Stroke Rehabilitation

OGA-AID:用于中风后康复多模态观察性步态分析的临床医生在环AI报告起草助手

Khoi T. N. Nguyen, Nghia D. Nguyen, Hui Yu Koh, Patrick W. H. Kwong, Karen Sui Geok Chua, Ananda Sidarta, Baosheng Yu

发表机构 * Rehabilitation Research Institute of Singapore, Nanyang Technological University, Singapore(新加坡康复研究中心,南洋理工大学,新加坡) Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore(李光前医学院,南洋理工大学,新加坡) The Grainger College of Engineering, University of Illinois Urbana-Champaign, United States(伊利诺伊大学厄巴纳-香槟分校格雷格学院,美国) Department of Rehabilitation Sciences, The Hong Kong Polytechnic University, Hong Kong(香港理工大学康复科学系,香港) VinUni-Illinois Smart Health Center, VinUniversity, Vietnam(Vin大学Vin-伊利诺伊智能健康中心,越南) Institute of Rehabilitation Excellence, Tan Tock Seng Hospital, NHG Health, Singapore(卓越康复研究所,坦托克桑格医院,NHG健康,新加坡)

AI总结 提出OGA-AID,一种临床医生在环的多智能体大语言模型系统,通过协调三个专业智能体合成患者运动记录、运动学轨迹和临床资料,生成结构化步态评估报告,在真实患者数据上优于单次多模态基线,并展示了AI辅助分析与人类临床判断的互补关系。

Comments 2026 CV4Clinic CVPR Workshop Proceedings

详情
AI中文摘要

步态分析在中风后康复中至关重要,但仍然是时间密集型和认知要求高的,特别是当临床医生必须将步态视频和运动捕捉数据整合到结构化报告中时。我们提出了OGA-AID,一种临床医生在环的多智能体大语言模型系统,用于多模态报告起草。该系统协调3个专业智能体,将患者运动记录、运动学轨迹和临床资料综合成结构化评估。在真实患者数据上由专家物理治疗师评估,OGA-AID始终优于单次多模态基线,且误差低。在临床医生在环设置中,简短的专家初步笔记进一步降低了与参考评估相比的误差。我们的研究结果证明了多模态智能体系统用于结构化临床步态评估的可行性,并突出了在康复工作流程中AI辅助分析与人类临床判断之间的互补关系。

英文摘要

Gait analysis is essential in post-stroke rehabilitation but remains time-intensive and cognitively demanding, especially when clinicians must integrate gait videos and motion-capture data into structured reports. We present OGA-AID, a clinician-in-the-loop multi-agent large language model system for multimodal report drafting. The system coordinates 3 specialized agents to synthesize patient movement recordings, kinematic trajectories, and clinical profiles into structured assessments. Evaluated with expert physiotherapists on real patient data, OGA-AID consistently outperforms single-pass multimodal baselines with low error. In clinician-in-the-loop settings, brief expert preliminary notes further reduce error compared to reference assessments. Our findings demonstrate the feasibility of multimodal agentic systems for structured clinical gait assessment and highlight the complementary relationship between AI-assisted analysis and human clinical judgment in rehabilitation workflows.

2604.07821 2026-06-08 cs.MA cs.AI cs.CL 版本更新

More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration

能力越强,合作越少?当LLM在零成本协作中失败时

Advait Yadav, Sid Black, Oliver Sourbut

发表机构 * GitHub

AI总结 研究LLM在多智能体系统中零成本协作的失败原因,通过构建去战略复杂性的环境,发现能力更强的模型(如o3)反而合作更差,并区分了能力失败与主动信息隐瞒,提出针对性干预措施。

Comments Accepted to the ICML 2026 main conference

详情
AI中文摘要

大语言模型(LLM)智能体越来越多地在多智能体系统中协调,但我们缺乏对合作失败地点和原因的理解。许多现实世界的协调问题并非社会困境:帮助他人——分享文档、为队友扫清障碍——对帮助者几乎不花费成本,同时产生巨大的集体利益。LLM智能体在这种帮助免费且被明确指示合作的机制下是否合作,仍然未知。我们构建了一个基于回合的多智能体环境,剥离了所有战略复杂性,使合作无成本且微不足道地最优。在八个广泛使用的LLM中,能力并不能预测合作:OpenAI o3仅达到最优集体性能的17%,而较弱的o3-mini达到50%,尽管有相同的最大化群体收入的指令。使用一种自动化智能体通信一方的因果分解方法,我们将合作失败与能力失败分开,并发现几个有能力的模型在隐瞒信息方面表现积极,尽管从隐瞒中一无所获。针对性的干预措施解决了每种模式:明确的协议使能力受限模型的性能大约翻倍,而小的分享激励则解锁了合作受限模型。我们的结果表明,仅靠扩展智能无法解决多智能体系统中的协调问题,需要深思熟虑的合作设计,即使帮助不花费任何成本。

英文摘要

Large language model (LLM) agents increasingly coordinate in multi-agent systems, yet we lack an understanding of where and why cooperation fails. Many real-world coordination problems are not social dilemmas: helping others -- sharing documentation, unblocking a teammate -- costs the helper almost nothing while producing substantial collective benefit. Whether LLM agents cooperate in this regime, where helping is free and they are explicitly instructed to do so, remains unknown. We build a turn-based multi-agent environment that strips away all strategic complexity, making cooperation costless and trivially optimal. Across eight widely used LLMs, capability does not predict cooperation: OpenAI o3 reaches only 17% of optimal collective performance while the weaker o3-mini reaches 50%, despite identical instructions to maximize group revenue. Using a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, and find that several capable models actively withhold information despite gaining nothing from withholding. Targeted interventions address each mode: explicit protocols roughly double the performance of competence-limited models, while small sharing incentives unlock cooperation-limited ones. Our results suggest that scaling intelligence alone will not solve coordination in multi-agent systems, and will require deliberate cooperative design, even when helping costs nothing.

2604.08168 2026-06-08 cs.RO cs.AI 版本更新

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

ViVa:用于机器人强化学习的视频生成价值模型

Jindi Lv, Hao Li, Jie Li, Fankun Kong, Yang Wang, Pengfei Yi, Yifei Nie, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang

发表机构 * GigaAI Sichuan University(四川大学) Tsinghua University(清华大学)

AI总结 提出ViVa,利用预训练视频生成器联合预测未来本体感受和标量价值,通过时空先验实现可靠价值估计,在三个任务中取得最优结果,与RECAP结合平均成功率达80%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过大规模预训练推进了机器人操作,但由于部分可观测性和延迟反馈,实际部署仍然具有挑战性。强化学习通过价值函数解决这一问题,该函数评估任务进展并指导策略改进。然而,基于视觉-语言模型(VLM)的现有价值模型难以捕捉时间动态和物理交互,削弱了长期任务中价值估计的可靠性。本文提出ViVa,一种视频生成价值模型,该模型重新利用预训练的视频生成器,联合预测未来本体感受和标量价值。通过将价值估计基于预期的具身动态,ViVa利用时空先验,将价值与超越静态快照的前瞻性内在耦合。ViVa在三个任务的基于度量的评估中取得了最先进的结果,产生可靠的价值信号,准确跟踪任务进展并检测执行错误。集成到RECAP中,它实现了80%的平均成功率,突显了视频生成模型在价值估计中的前景。

英文摘要

Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics and physical interactions, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator to jointly predict future proprioception and a scalar value. By grounding value estimation in anticipated embodiment dynamics, ViVa leverages spatiotemporal priors to intrinsically couple value with foresight beyond static snapshots. ViVa achieves state-of-the-art results in metric-based evaluation across three tasks, producing reliable value signals that accurately track task progress and detect execution errors. Integrated into RECAP, it achieves an average success rate of 80%, highlighting the promise of video-generative models for value estimation.

2604.09552 2026-06-08 cs.IR cs.AI cs.CL 版本更新

MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval

MCERF:通过增强检索推进工程文档的多模态大语言模型评估

Kiarash Naghavi Khanghah, Hoang Anh Nguyen, Anna C. Doris, Amir Mohammad Vahedi, Daniele Grandi, Faez Ahmed, Hongyi Xu

发表机构 * School of Mechanical, Aerospace, and Manufacturing Engineering, University of Connecticut, Storrs, CT 06269(机械、航空航天与制造工程学院,康涅狄格大学,斯托尔斯,CT 06269) Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA(机械工程系,麻省理工学院,剑桥,MA 02139,美国)

AI总结 提出MCERF框架,结合多模态检索器ColPali与大语言模型推理,通过混合查找、视觉文本融合、高推理和自一致性决策等策略,在DesignQA基准上实现平均准确率相对提升41.1%,无需完整规则书摄入即可处理工程文档中的多模态问答。

详情
AI中文摘要

工程规则书和技术标准包含密集文本、表格和插图等多模态信息,对检索增强生成(RAG)系统构成挑战。基于依赖全文摄入和文本检索的DesignQA框架[1],本工作建立了多模态ColPali增强检索与推理框架(MCERF),该系统将多模态检索器与大语言模型推理相结合,实现从工程文档中准确高效地回答问题。该系统采用ColPali检索文本和视觉信息,并采用多种检索与推理策略:(i)混合查找模式用于显式规则提及,(ii)视觉到文本融合用于图形和表格引导的查询,(iii)高推理大语言模型模式用于复杂的多模态问题,以及(iv)自一致性决策以稳定响应。模块化框架设计为未来的多模态系统提供了可重用模板,无论底层模型架构如何。此外,本工作建立并比较了两种路由方法:单案例路由方法和多智能体系统,两者均动态分配查询到最优管道。在DesignQA基准上的评估表明,该系统在所有任务上的平均准确率相比基线RAG最佳结果相对提升了41.1%,这是多模态和推理密集型任务上的显著改进,且无需完整规则书摄入。这表明视觉语言检索、模块化推理和自适应路由如何在工程用例中实现可扩展的文档理解。

英文摘要

Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.

2604.17433 2026-06-08 cs.CL cs.AI cs.LG 版本更新

Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

仅需两个样本的自一致性:CoT-PoT集成实现高效LLM推理

Raman Saparkhan, Majd Hawasly, Md Rizwan Parvez, Mohammad Raza

发表机构 * Carnegie Mellon University Qatar(卡内基梅隆大学(卡塔尔)) Qatar Computing Research Institute(卡塔尔计算研究院)

AI总结 提出一种混合集成方法,结合思维链与程序化推理两种模式,通过仅需两个样本即可实现自一致性,将采样量减少9.3倍,并在78.6%的任务上达到最优。

Comments 9 pages, 3 figures; accepted to Findings of ACL 2026

详情
AI中文摘要

自一致性(SC)是一种通过聚合多个采样输出来提高大型语言模型推理准确性的流行技术,但由于大量采样,其计算成本高昂。我们引入了一种混合集成方法,利用两种不同推理模式(思维链(CoT)和程序化推理(PoT))的互补优势。我们描述了一个通用框架,用于在自一致性中结合这两种推理形式,并提出了全采样和早停的特定策略。我们表明,CoT-PoT集成不仅提高了整体准确性,而且将SC所需的样本数量大幅减少了9.3倍。特别是,大多数任务(78.6%)仅需两个样本即可解决,这在之前的任何SC方法中都是不可能的。

英文摘要

Self-consistency (SC) is a popular technique for improving the reasoning accuracy of large language models by aggregating multiple sampled outputs, but it comes at a high computational cost due to extensive sampling. We introduce a hybrid ensembling approach that leverages the complementary strengths of two distinct modes of reasoning: Chain-of-Thought (CoT) and Program-of-Thought (PoT). We describe a general framework for combining these two forms of reasoning in self-consistency, as well as particular strategies for both full sampling and early-stopping. We show that CoT-PoT ensembling not only improves overall accuracy, but also drastically reduces the number of samples required for SC by a factor of 9.3x. In particular, the majority of tasks (78.6%) can be addressed with only two samples, which has not been possible with any prior SC methods.

2604.17948 2026-06-08 cs.CR cs.AI cs.MA 版本更新

RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs

RAVEN: 用于用户代码和二进制程序中内存损坏分析的检索增强漏洞探索网络

Parteek Jamwal, Minghao Shao, Boyuan Chen, Achyuta Muthuvelan, Asini Subanya, Boubacar Ballo, Kashish Satija, Mariam Shafey, Mohamed Mahmoud, Moncif Dahaji Bouffi, Pasindu Wickramasinghe, Siyona Goel, Yaakulya Sabbani, Hakim Hacid, Mthandazo Ndhlovu, Eleanna Kafeza, Sanjay Rawat, Muhammad Shafique

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出RAVEN框架,结合LLM代理与检索增强生成,自动生成遵循Google Project Zero模板的漏洞分析报告,在105个样本上平均质量得分54.21%。

详情
AI中文摘要

大型语言模型(LLM)在各种网络安全任务中展现了卓越的能力,包括漏洞分类、检测和修补。然而,它们在自动化漏洞报告文档和分析方面的潜力仍未得到充分探索。我们提出了RAVEN(检索增强漏洞探索网络),这是一个利用LLM代理和检索增强生成(RAG)来综合生成全面漏洞分析报告的框架。给定易受攻击的源代码,RAVEN按照Google Project Zero根因分析模板生成报告。该框架使用四个模块:用于漏洞识别的探索代理、从包含Google Project Zero报告和CWE条目的精选数据库中检索相关知识的RAG引擎、用于影响和利用评估的分析代理,以及用于结构化报告生成的报告代理。为确保质量,RAVEN包含一个特定任务的LLM评判器,用于评估报告的结构完整性、与真实情况的一致性、代码推理质量和修复质量。我们在来自NIST-SARD数据集的105个涵盖15种CWE类型的易受攻击代码样本上评估了RAVEN。结果显示平均质量得分为54.21%,支持了我们的方法在自动化漏洞文档方面的有效性。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities across various cybersecurity tasks, including vulnerability classification, detection, and patching. However, their potential in automated vulnerability report documentation and analysis remains underexplored. We present RAVEN (Retrieval Augmented Vulnerability Exploration Network), a framework leveraging LLM agents and Retrieval Augmented Generation (RAG) to synthesize comprehensive vulnerability analysis reports. Given vulnerable source code, RAVEN generates reports following the Google Project Zero Root Cause Analysis template. The framework uses four modules: an Explorer agent for vulnerability identification, a RAG engine retrieving relevant knowledge from curated databases including Google Project Zero reports and CWE entries, an Analyst agent for impact and exploitation assessment, and a Reporter agent for structured report generation. To ensure quality, RAVEN includes a task specific LLM Judge evaluating reports across structural integrity, ground truth alignment, code reasoning quality, and remediation quality. We evaluate RAVEN on 105 vulnerable code samples covering 15 CWE types from the NIST-SARD dataset. Results show an average quality score of 54.21%, supporting the effectiveness of our approach for automated vulnerability documentation.

2604.27011 2026-06-08 cs.LG cs.AI 版本更新

Automatic Causal Fairness Analysis with LLM-Generated Reporting

基于LLM生成报告的自适应因果公平性分析

Alessia Berarducci, Eric Rossetto, Alessandro Antonucci, Marco Zaffalon

发表机构 * Istituto Dalle Molle di Studi sull’Intelligenza Artificiale (IDSIA), USI-SUPSI(日内瓦人工智能研究所(IDSIA)、USI-SUPSI)

AI总结 提出FairMind原型,利用标准公平模型进行因果公平分析,通过反事实查询计算因果效应,并借助LLM零样本生成公平性报告,优于直接LLM分析。

Comments 23 pages, 6 figures, 3 tables, LaTeX; added missing proof for Proposition 3, typos corrected, updated example 1 to have positive values for the Sankey

详情
AI中文摘要

AutoML,旨在自动化机器学习在现实问题中的应用,是AI普及的关键步骤。大多数AutoML框架未考虑训练数据及相应预测中潜在的公平性缺失。我们介绍\textsc{FairMind},一个旨在自动化数据集层面公平性分析的软件原型。我们通过借助Plečko和Bareinboim最近提出的\textit{标准公平模型}的假设来实现这一点。这允许基于涉及目标、可能的混杂因素和中介变量以及我们视为\textit{受保护}的输入特征的不同值的\textit{反事实}查询,进行合理的因果效应公平性评估。在必要的数据预处理之后,该工具实现了效应的闭式计算。随后利用LLM生成关于训练数据集中检测到的公平性水平的准确报告。我们在零样本设置中实现了这一点,并通过示例展示了相对于LLM直接分析的预期优势。为了促进应用,还讨论了有序受保护变量和连续目标的扩展以及新的分解结果。

英文摘要

AutoML, intended as the process of automating the application of machine learning to real-world problems, is a key step for AI popularisation. Most AutoML frameworks are not accounting for the potential lack of fairness in the training data and in the corresponding predictions. We introduce \textsc{FairMind}, a software prototype aiming to automatise fairness analysis at the dataset level. We achieve that by resorting to the assumptions of the \emph{standard fairness model}, recently proposed by Plečko and Bareinboim. This allows for a sound fairness evaluation in terms of causal effects, based on \emph{counterfactual} queries involving the target, possibly confounders and mediators, and the different values of an input feature we regard as \emph{protected}. After the necessary data preprocessing, the tool implements a closed-form computation of the effects. LLMs are consequently exploited to generate accurate reports on the fairness levels detected in the training dataset. We achieve that in a zero-shot setup and show by examples the expected advantages with respect to a direct analysis performed by the LLM. To favour applications, extensions to ordinal protected variable and continuous targets and novel decomposition results are also discussed.

2605.00369 2026-06-08 cs.LG cs.AI 版本更新

InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees

InvEvolve:通过具有性能保证的大语言模型进化白盒库存策略

Chenyu Huang, Jianghao Lin, Zhengyang Tang, Bo Jiang, Ruoqing Jiang, Benyou Wang, Lai Wei

发表机构 * Shanghai University of Finance and Economics(上海财经大学) Shanghai Jiao Tong University(上海交通大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tsinghua University(清华大学) Boston College(波士顿大学)

AI总结 提出InvEvolve框架,利用强化学习训练的大语言模型,结合置信区间认证,在线生成具有统计安全保证的白盒库存策略,在合成和真实零售数据上优于经典和深度学习方法。

详情
AI中文摘要

我们研究如何利用大语言模型在非平稳需求在线环境中生成库存策略。我们的工作受到基于LLM的进化搜索(如AlphaEvolve)最新进展的启发,这些方法在数学发现等静态且高度结构化问题上表现出色,但不太适合具有在线更新的动态库存环境。我们提出InvEvolve,一个基于置信区间认证的端到端库存策略进化与推理框架。InvEvolve建立在通过强化学习训练的大语言模型之上,能够处理需求数据以及额外的数值和文本特征,并生成具有统计安全保证的白盒库存策略,用于未来部署。我们进一步引入一个具有理论保证的统一框架,连接训练、推理和部署。这使我们能够推导出InvEvolve进化出统计上安全且改进的策略的概率下界,并刻画相对于最优安全基准的多周期性能差距。在合成数据和真实零售数据上的测试表明,InvEvolve优于经典库存策略和基于深度学习的方法。在典型库存设置中,它生成了超越现有基准的新策略。

英文摘要

We study how large language models can be used to generate inventory policies in online settings with non-stationary demand. Our work is motivated by recent advances in LLM-based evolutionary search, such as AlphaEvolve, which demonstrates strong performance on static and highly structured problems such as mathematical discovery, but is not directly suited to dynamic inventory settings with online updates. We propose InvEvolve, an end-to-end inventory policy evolution and inference framework grounded in confidence-interval-based certification. Built on a large language model trained via reinforcement learning, InvEvolve can process demand data together with additional numerical and textual features, and generates white-box inventory policies with statistical safety guarantees for future deployment. We further introduce a unified framework with theoretical guarantees that connects training, inference, and deployment. This allows us to derive a lower bound on the probability that InvEvolve evolves a statistically safe and improved policy, and to characterize the multi-period performance gap relative to the oracle-safe benchmark. Tested on both synthetic data and real-world retail data, InvEvolve outperforms classical inventory policies and deep-learning-based methods. In canonical inventory settings, it generates new policies that outperform existing benchmarks.

2605.05220 2026-06-08 cs.LG cs.AI 版本更新

MidSteer: Optimal Affine Framework for Steering Generative Models

MidSteer:用于引导生成模型的最优仿射框架

Tatiana Gaintseva, Andrew Stepanov, Ziquan Liu, Martin Benning, Gregory Slabaugh, Jiankang Deng, Ismail Elezi

发表机构 * University of Basel(巴塞尔大学) University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院) University of Cambridge(剑桥大学) University of Washington(华盛顿大学)

AI总结 本文提出MidSteer,一种基于仿射变换的最优概念引导框架,通过最小干扰实现生成模型中的概念切换,并在视觉扩散模型和大型语言模型上验证其有效性。

详情
AI中文摘要

引导中间表示已成为控制生成模型的一种强大策略,特别是在部署后对齐和安全设置中。然而,尽管其经验成功,目前仍缺乏全面的理论框架。在本文中,我们通过形式化概念引导的理论来弥补这一差距。首先,我们在引导和仿射概念擦除之间建立联系,证明去除不期望行为的标准方法是LEACE(一种用于仿射擦除的闭式方法)的特例。接下来,我们为概念切换LEACE-Switch制定了一个原则性的理论框架,并描述了其提供最优仿射解的假设。基于这一分析,我们引入了MidSteer(最小干扰概念引导),一个更通用的用于概念操作的仿射框架,它放宽了这些假设并实现了定向的最小干扰变换。我们证明MidSteer在一系列任务、模态和架构(包括视觉扩散模型和大型语言模型)中表现良好。

英文摘要

Steering intermediate representations has emerged as a powerful strategy for controlling generative models, particularly in post-deployment alignment and safety settings. However, despite its empirical success, it currently lacks a comprehensive theoretical framework. In this paper, we bridge this gap by formalizing the theory of concept steering. First, we establish a link between steering and affine concept erasure, proving that the standard approach for removing unwanted behaviors is a special case of LEACE (a closed-form method for affine erasure). Next, we formulate a principled theoretical framework for concept switching, LEACE-Switch, and characterize the assumptions under which it provides an optimal affine solution. Building on this analysis, we then introduce MidSteer (Minimal Disturbance concept Steering), a more general affine framework for concept manipulation that relaxes these assumptions and enables directed, minimal-disturbance transformations. We demonstrate that MidSteer performs favorably across a range of tasks, modalities, and architectures, including vision diffusion models and large language models.

2605.05225 2026-06-08 cs.LG cs.AI 版本更新

MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

MACS: 模态感知容量缩放用于高效多模态MoE推理

Bo Li, Chuan Wu, Shaolin Zhu

发表机构 * School of Software, Tsinghua University, Beijing, China(清华大学软件学院,北京,中国) TJUNLP Lab, School of Computer Science and Technology, Tianjin University, China(天津大学计算机科学与技术学院,中国) School of New Media and Communication, Tianjin University, China(天津大学新媒体与传播学院,中国)

AI总结 针对多模态MoE大模型在专家并行推理中因信息异质性和模态动态性导致的效率瓶颈,提出无需训练的MACS框架,通过熵加权负载和动态模态自适应容量机制优化资源分配,显著提升多模态基准性能。

Comments Accepted by ACL 2026

详情
AI中文摘要

混合专家多模态大语言模型(MoE MLLMs)在专家并行(EP)推理过程中因落后者效应而遭受显著的效率瓶颈。在多模态背景下,这一问题更加严重,因为现有的基于token计数的负载均衡方法无法解决两个独特挑战:(1)信息异质性,其中大量冗余的视觉token与语义关键的token被同等对待;(2)模态动态性,不同任务中视觉与文本比例的变化导致资源错配。为应对这些挑战,我们提出MACS(模态感知容量缩放),一种无需训练的推理框架。具体而言,MACS引入熵加权负载机制来量化视觉token的语义价值,解决信息异质性。此外,动态模态自适应容量机制根据输入的实时模态组成分配专家资源。大量实验表明,MACS在各种多模态基准上显著优于现有方法,为MoE MLLMs在EP推理中的高效部署提供了新颖且稳健的解决方案。

英文摘要

Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) suffer from a significant efficiency bottleneck during Expert Parallelism (EP) inference due to the straggler effect. This issue is worsened in the multimodal context, as existing token-count-based load balancing methods fail to address two unique challenges: (1) Information Heterogeneity, where numerous redundant visual tokens are treated equally to semantically critical ones, and (2) Modality Dynamics, where varying visual to text ratios across tasks lead to resource misallocation. To address these challenges, we propose MACS (Modality-Aware Capacity Scaling), a training-free inference framework. Specifically, MACS introduces an Entropy-Weighted Load mechanism to quantify the semantic value of visual tokens, addressing information heterogeneity. Additionally, the Dynamic Modality-Adaptive Capacity mechanism allocates expert resources based on the real-time modal composition of the input. Extensive experiments demonstrate that MACS significantly outperforms existing methods on various multimodal benchmarks, providing a novel and robust solution for the efficient deployment of MoE MLLMs in EP inference.

2605.06215 2026-06-08 physics.chem-ph cs.AI 版本更新

COF26: A new on-top functional for multiconfiguration pair-density functional theory

COF26:多组态对密度泛函理论的一种新的on-top泛函

Yuhao Chen, Donald G. Truhlar, Xiao He

发表机构 * Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development(分子治疗与新药开发上海工程研究中心) Shanghai Frontiers Science Center of Molecule Intelligent Syntheses(分子智能合成上海前沿科学中心) School of Chemistry and Molecular Engineering, East China Normal University(东华大学化学与分子工程学院) Department of Chemistry, Chemical Theory Center, and Minnesota Supercomputing Institute, University of Minnesota(明尼苏达大学化学系、化学理论中心和明尼苏达超级计算研究所) Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University(重庆精密光学重点实验室,东华大学重庆研究所) New York University–East China Normal University Center for Computational Chemistry, New York University Shanghai(纽约大学–东华大学计算化学中心,纽约大学上海)

AI总结 提出COF26泛函,通过大语言模型辅助优化工作流,在强和弱关联体系中均表现优越,推荐用于未来MC-PDFT计算。

详情
AI中文摘要

多组态对密度泛函理论(MC-PDFT)为计算强关联分子体系的电子能量提供了一种高效且准确的框架,其中on-top泛函的质量是其预测精度的关键决定因素。在此,我们介绍了MMCDDB26,一个严格整理的基准数据库,包含76个数据集和1495个反应。我们进一步提出了一种受约束的、大语言模型辅助的优化工作流,用于MC-PDFT泛函的开发和评估。利用该工作流,我们在MMCDDB26上优化了MC23/MC25泛函的参数,得到了MC26。与同类早期泛函相比,MC26提高了训练集的精度,并实现了更平衡的整体性能。此外,我们开发了混合meta泛函COF26。我们发现COF26在强和弱关联体系中均表现出优越的性能,因此推荐在未来的MC-PDFT计算中使用COF26。

英文摘要

Multiconfiguration pair-density functional theory (MC-PDFT) provides an efficient and accurate framework for computing electronic energies in strongly correlated molecular systems, with the quality of the on-top functional being a key determinant of its predictive accuracy. Here, we introduce MMCDDB26, a rigorously curated benchmark database comprising 76 datasets and 1,495 reactions. We further propose a constrained, large-language-model-assisted optimization workflow for the development and assessment of MC-PDFT functionals. Using this workflow, we optimized the parameters of the MC23/MC25 functionals on MMCDDB26 to obtain MC26. Compared with earlier functionals of the same class, MC26 improves the accuracy on the training set and achieves a more balanced overall performance. In addition, we developed the hybrid meta-functional COF26. We find that COF26 delivers superior performance for both strongly and weakly correlated systems, and therefore recommend COF26 for future MC-PDFT calculations.

2605.06647 2026-06-08 cs.IR cs.AI cs.LG 版本更新

Superintelligent Retrieval Agent: The Next Frontier of Agentic Retrieval

超级智能检索代理:代理检索的下一个前沿

Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava

发表机构 * Meta Superintelligence Labs(Meta超级智能实验室) Rice University(里士满大学)

AI总结 提出SIRA,通过单次语料判别性检索压缩多轮探索,利用LLM丰富文档词汇、预测查询缺失词汇并基于语料统计过滤,在BEIR基准上取得最强平均检索性能,并在下游QA任务中超越RL训练的代理系统。

详情
AI中文摘要

检索增强代理日益成为大型知识库的接口,但大多数将检索视为黑箱:它们发出探索性查询,检查片段,并重新表述直到证据出现。这类似于新手搜索不熟悉的数据库,而非专家利用术语和可能证据的强先验进行导航,导致额外的检索轮次、延迟和低召回率。我们引入了超级智能检索代理(SIRA),它将检索中的超级智能视为将多轮探索性搜索压缩为单次语料判别性检索行动。SIRA不仅询问哪些术语相关,还询问哪些术语将所需证据与语料级混淆项区分开。离线时,LLM用缺失的搜索词汇丰富每个文档;查询时,它预测查询遗漏的证据词汇;语料统计作为工具调用,过滤掉缺失、过于常见或不太可能产生检索边界的术语。最后一步是单次加权BM25调用,将查询与验证后的扩展结合。在十个BEIR基准上,SIRA实现了我们比较中最强的平均检索性能,击败了密集检索器、学习型稀疏检索器和LLM搜索代理基线,且未使用相关性标签或检索器微调。在下游QA中,其仅检索的答案覆盖率在NQ和HotpotQA上超过了近期RL训练的代理QA系统。我们还引入了BrowseComp-Wikipedia,一个包含232个BrowseComp衍生查询、覆盖25,587,229篇文档的维基百科索引的硬搜索基准。即使没有索引时丰富,仅使用基于维基百科类别的接地,SIRA在每个预算下都优于多轮Perplexity代理,达到9.70%的Recall@1、15.27%的Recall@10和36.14%的Recall@100。

英文摘要

Retrieval-augmented agents are increasingly the interface to large knowledge bases, yet most treat retrieval as a black box: they issue exploratory queries, inspect snippets, and reformulate until evidence emerges. This resembles how a newcomer searches an unfamiliar database rather than how an expert navigates it with strong priors about terminology and likely evidence, causing extra retrieval rounds, latency, and poor recall. We introduce \textit{Superintelligent Retrieval Agent} (SIRA), which casts \emph{superintelligence} in retrieval as compressing multi-round exploratory search into a single corpus-discriminative retrieval action. SIRA does not merely ask which terms are relevant; it asks which terms separate the desired evidence from corpus-level confusers. Offline, an LLM enriches each document with missing search vocabulary; at query time, it predicts evidence vocabulary the query omits; and corpus statistics serve as tool calls that filter terms that are absent, overly common, or unlikely to create retrieval margin. The final step is a single weighted BM25 call combining the query with the validated expansion. Across ten BEIR benchmarks, SIRA achieves the strongest average retrieval performance in our comparison, beating dense retrievers, learned sparse retrievers, and LLM search-agent baselines while using no relevance labels or retriever fine-tuning. On downstream QA, its retrieval-only answer coverage exceeds recent RL-trained agentic QA systems on NQ and HotpotQA. We also introduce \textbf{BrowseComp-Wikipedia}, a hard-search benchmark of 232 BrowseComp-derived queries over a 25,587,229-document Wikipedia index. Even without index-time enrichment, using only grounded Wikipedia categories, SIRA outperforms multi-round Perplexity agents at every budget, reaching 9.70% Recall@1, 15.27% Recall@10, and 36.14% Recall@100.

2605.08717 2026-06-08 cs.SE cs.AI 版本更新

Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

调试调试器:面向软件工程智能体的失败锚定结构化恢复

Chenyu Zhao, Shenglin Zhang, Yihang Lin, Wenwei Gu, Zhimin Chen, Yongqian Sun, Dan Pei, Chetan Bansal, Saravan Rajmohan, Minghua Ma

发表机构 * Nankai University(南开大学) Tsinghua University(清华大学) Microsoft(微软)

AI总结 提出PROBE框架,通过遥测层、诊断层和指导门将运行时证据转化为结构化恢复指导,在代码修复、工作流恢复等场景中诊断准确率65.37%,恢复率21.79%。

详情
AI中文摘要

软件工程智能体越来越多地部署在可评估的工程环境中,但故障后恢复仍然成本高昂、依赖人工且临时性强。现有系统暴露跟踪或生成后续反馈,但未能将异构运行时证据转化为有根据的、有边界的恢复指导以供后续尝试。我们提出PROBE,一个用于软件工程智能体结构化恢复的失败锚定框架。PROBE通过遥测层、诊断层和指导门将失败运行的遥测数据组织为结构化证据、结构化诊断和有边界的恢复指导。遥测层保留细粒度运行时信号,诊断层将跨信号证据融合为有根据的诊断,指导门仅在证据有根据、可操作且属于智能体侧行为范围内时生成基于诊断的指导。我们在三个场景中评估PROBE:仓库级软件修复、企业工作流恢复和AIOps服务缓解。在257个初始未解决案例中,PROBE实现了65.37%的Top-1诊断准确率和21.79%的恢复率,分别比最强的非PROBE基线高出43.58和12.45个百分点。结果揭示了诊断-恢复差距:准确的诊断是必要的,但除非转化为后续尝试可执行和验证的有边界指导,否则是不够的。除了受控评估外,微软IcM原型显示,PROBE可以作为非侵入式侧通道附加到现有服务诊断工作流中,而无需更改智能体策略、工具集或执行预算。这些结果表明,在现实工程约束下,基于遥测的、失败锚定的恢复可以提高故障后可恢复性。

英文摘要

Software engineering agents are increasingly deployed in evaluable engineering environments, yet post-failure recovery remains costly, manual, and ad hoc. Existing systems expose traces or generate follow-up feedback, but they do not convert heterogeneous runtime evidence into grounded, bounded recovery guidance for a subsequent attempt. We present PROBE, a failure-anchored framework for structured recovery in software engineering agents. PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. We evaluate PROBE across three settings: repository-level software repair, enterprise workflow recovery, and AIOps service mitigation. On 257 initially unresolved cases, PROBE achieves 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points. The results reveal a diagnosis-recovery gap: accurate diagnosis is necessary but insufficient unless translated into bounded guidance that a subsequent attempt can execute and verify. Beyond controlled evaluation, a Microsoft IcM prototype shows that PROBE can attach as a non-intrusive side channel to existing service-diagnosis workflows without changing the agent policy, toolset, or execution budget. These results suggest that telemetry-grounded, failure-anchored recovery can improve post-failure recoverability under realistic engineering constraints.

2605.15888 2026-06-08 cs.LG cs.AI 版本更新

CHoE: Cross-Domain Heterogeneous Graph Prompt Learning via Structure-Conditioned Experts

CHoE: 基于结构条件专家的跨域异构图提示学习

Peiyuan Li, Yongqi Huang, Jitao Zhao, Dongxiao He, Di Jin, Weixiong Zhang

发表机构 * School of Computer Science and Technology, Tianjin University(天津大学计算机科学与技术学院) Department of Health Technology and Informatics, and Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University(香港理工大学健康科技与信息学系、数据科学与人工智能系)

AI总结 提出CHoE方法,通过结构条件专家网络和结构感知路由机制,实现跨域异构图提示学习,在少样本跨域任务中优于基线方法。

Comments accepted by IJCAI 2026, 9 pages, 4 figures

详情
AI中文摘要

异构图提示学习(HGPL)已成为弥合预训练基础模型目标与其在下游异构图中应用之间差距的有前景范式。然而,现有HGPL方法主要针对域内场景设计,而实际部署通常跨越多个域,且预训练和下游任务的数据可能来自不同分布。因此,当前HGPL方法的适用性仅限于域内设置,当应用域发生变化时,其性能通常会下降。为解决这一严重限制,我们开发了CHoE,一种基于专家网络的跨域HGPL方法。在预训练期间,我们引入并训练结构条件专家;在提示调优期间,我们采用结构感知的专家路由和负载均衡机制,为每个元路径视图选择结构兼容的专家。此外,我们设计了一个基于提示的语义融合模块,以整合多个视图的表示用于下游预测。大量实验表明,CHoE在少样本跨域应用中持续提升性能,优于所有基线方法。

英文摘要

Heterogeneous Graph Prompt Learning (HGPL)has emerged as a promising paradigm for bridging the gap between the objectives of pre-training foundation models and their downstream applications in heterogeneous graph settings. However, existing HGPL methods are primarily designed for in-domain scenarios, whereas real-world deployments often span multiple domains, and the data used for pre-training and downstream tasks may originate from different distributions. Consequently, the applicability of current HGPL approaches is limited to in-domain settings, and their performance typically degrades when application domains shift. To address this serious limitation, we develop CHoE, a cross-domain HGPL method built upon an expert network. During pre-training, we introduce and train structure-conditioned experts, and during prompt tuning, we adopt a structure-aware expert routing and load balancing mechanism to select structurally compatible experts for each meta-path view. In addition, we design a prompt-based semantic fusion module to integrate representations across multiple views for downstream prediction. Extensive experiments show that CHoE consistently improves performance in few-shot cross-domain applications, outperforming all baseline approaches.

2605.17548 2026-06-08 cs.SE cs.AI 版本更新

Rethinking Code Review in the Age of AI: A Vision for Agentic Code Review

重新思考AI时代的代码审查:面向代理代码审查的愿景

Hüseyin Özgür Kamalı, Erdem Tuna, Vahid Haratian, Eray Tüzün

发表机构 * Microsoft(微软) Ankara University(安卡拉大学) Bilkent University(比尔肯特大学)

AI总结 本文探讨了在AI时代代码审查的演变,提出了一种结合专门代理和人类控制的质量闸门的AI驱动代码审查流程,旨在提升代码审查的效率和可靠性。

Comments Submitted to ACM Transactions on Software Engineering Methodology (TOSEM). A shorter version of this work has been presented at ICSE-JAWs 2026, Rio de Janeiro, Brazil

详情
AI中文摘要

代码审查已经经历了数十年的发展,从非正式的同行检查发展到今天的拉取请求(PR)工作流程,但仍然主要是一种手动、不均匀且认知负担重的过程。人工智能(AI)编程助手的兴起加剧了这一挑战:虽然这些工具提高了代码生成的速度,但同时也增加了需要审查的代码量,使代码审查成为增长的瓶颈。当前的AI支持仍然碎片化,工具主要专注于孤立任务,如审阅者推荐、PR描述生成或评论建议,而非整个PR审查流程。本文回顾了代码审查实践的历史演变,并考察了由大语言模型(LLMs)和代理AI系统驱动的转变。随后,我们提出了一种AI驱动的代码审查流程愿景,结合专门的代理和人类控制的质量闸门。我们的框架涵盖五个阶段:PR创建、PR增强、审阅者选择、AI辅助代码审查和PR回顾,其中在关键决策点保留人类以保持判断、责任和团队层面的理解。我们识别了负责任采用的主要开放挑战,包括可靠性、偏见、隐私、自动化偏见、透明度和评估,并提出了更有效的软件工程中人类-AI协作的研究议程。

英文摘要

Code review has evolved for decades, from informal peer checking to today's pull request (PR) workflows, yet it remains a largely manual and cognitively demanding process. The rise of Artificial Intelligence (AI) coding assistants has intensified this challenge: while these tools increase code production velocity, they also expand the volume of code requiring review, turning code review into a growing bottleneck. Current AI support in code review remains fragmented, with tools focusing on isolated tasks such as reviewer recommendation, PR description generation, or comment suggestion rather than the end-to-end PR review workflow. We address this gap by treating review effectiveness as an outcome of the full code review lifecycle rather than a single stage, proposing a framework that carries context across stage boundaries. We propose a future vision for code review in which reviewers transition from manual inspectors into supervisory operators of agents. In this vision, staged, AI-powered workflows aim to align the pace of code generation with shared understanding and accountable engineering. In this paper, we review the historical evolution of code review practices, identify challenges in traditional code review systems, and examine the shift driven by large language models (LLMs) and agentic AI systems. We then present a vision for an AI-powered code review workflow combining specialized agents with human-controlled quality gates. Our framework spans five stages: PR Creation, PR Augmentation, Reviewer Selection, AI-Assisted Code Review, and PR Retrospective, with humans retained at key decision points to preserve judgment, accountability, and team-level understanding. Finally, we identify key adoption challenges and outline research directions for evaluation, governance, and responsible human-AI collaboration.

2605.17561 2026-06-08 cs.SE cs.AI cs.MA 版本更新

Automated Root-Cause Subclassification and No-Code Fix Generation for Invalid Bug Reports

自动化无效bug报告的根因子类划分及无代码修复生成

Mahmut Furkan Gon, Emre Dinc, Tevfik Emre Sungur, Eray Tuzun

发表机构 * Department of Computer Engineering, Bilkent University(计算机工程系,比尔肯特大学)

AI总结 本研究旨在引入一个标准化的根因导向的无效bug报告子类划分体系,并通过实验测试不同方法在无效子类划分和无代码修复生成中的准确性。研究还分析了不同配置在我们创建的黄金标准基准上的表现。

Comments Submitted to IEEE Transactions on Software Engineering (TSE) and currently under review

详情
AI中文摘要

在使用软件时遇到的问题会以bug报告的形式被报告。然而,许多bug报告是无效的,意味着它们不需要代码更改,而是通过无代码修复解决。手动确定无效bug报告的根因并由客户支持提供可行的解决方案会造成严重的资源浪费。我们的目标是引入一个标准化的根因导向的无效bug报告子类划分体系,并通过实验测试各种方法在无效子类划分和无代码修复生成中的准确性。我们研究了不同配置在我们创建的黄金标准基准上的表现。使用人工整理的基准进行更高质量的分析,我们尝试了 vanilla LLMs、检索增强生成和代理网络搜索来识别无效子类并生成无代码修复。我们将结果与包含原始bug报告中无效子类和无代码修复的手动标注的地面真实数据进行了评估。我们用加权F1分数衡量子类检测性能,并用BERTScore和Judge LLM成功率评估无代码修复建议。对于子类划分,检索增强生成在总体性能上最高,达到0.66加权F1,略微优于vanilla LLMs的0.65和代理网络搜索的0.64。在子类级别,性能在非可复现上达到0.85 F1,在功能请求和问题上达到0.79,而错误版本仍然是最具有挑战性的,分数在0.00到0.29之间。对于无代码修复生成,代理网络搜索在总体Judge LLM成功率上最高,达到68.9%,相比检索增强生成的64.4%和vanilla LLMs的64.9%。在子类级别,最高峰值为工作正常的设计达到87.4%,问题达到72.2%。

英文摘要

Issues faced when using software are reported in the form of bug reports. However, many bug reports are invalid, meaning they do not require code changes, and are resolved with a no-code fix. Manually determining the root cause of the invalid bug reports and providing actionable resolutions by the customer support causes a serious waste of resources. Our goal is to introduce a standardized taxonomy for root-cause oriented invalid bug report subclassification, and perform experiments to test the accuracy of various approaches on invalid subclassification and no-code fix generation. We study how different configurations perform on a gold-standard benchmark we have created. Using a manually curated benchmark for higher quality analysis, we experimented with vanilla LLMs, Retrieval Augmented Generation, and agentic web search to identify invalid subclasses and generate no-code fixes. We evaluated the results against manually labeled ground truth data that includes the invalid subclass and no-code fixes from the original bug reports. We measured subclass detection performance with weighted F1-Score, and assessed no-code fix suggestions using BERTScore and Judge LLM success rates. For subclassification, retrieval augmented generation achieves the highest overall performance with 0.66 weighted F1, slightly outperforming vanilla LLMs at 0.65 and agentic web search at 0.64. At the subclass level, performance peaks at 0.85 F1 for Non-reproducibility and 0.79 for Feature Request and Question, while Wrong Version remains the most challenging with scores between 0.00 and 0.29. For no-code fix generation, agentic web search achieves the highest overall Judge LLM success rate at 68.9%, compared to 64.4% for RAG applications and 64.9% for vanilla LLMs, with subclass-level peaks of 87.4% for Working as Designed and 72.2% for Question.

2605.20950 2026-06-08 cs.CV cs.AI 版本更新

Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

聚焦-然后-上下文:面向视觉-语言模型的主体导向渐进视觉标记缩减

Yulin Zhao, Zheng Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳学院) ShenZhen Loop Area Institute(深圳环形区研究所)

AI总结 本文提出了一种主体导向的渐进视觉标记缩减方法SPpruner,通过模拟人类视觉感知系统的'聚焦-然后-上下文'机制,有效减少视觉标记数量,提升视觉-语言模型的推理效率,实验表明其在速度和资源消耗上均优于现有方法。

详情
AI中文摘要

视觉-语言模型(VLMs)在推理过程中面临由于大规模视觉标记序列带来的计算成本瓶颈。现有的视觉标记缩减方法虽然减轻了这一负担,但无意中保留了与用户查询严格对齐的孤立视觉主体,无法充分探索显著主体及其上下文关系。本文提出SPpruner,一种以主体为中心的渐进缩减范式,模拟人类视觉感知系统的'聚焦-然后-上下文'机制。具体而言,我们首先构建了一个聚焦识别模块,以显式建模视觉显著性与语义相关性之间的相互作用。在此基础上,它可以挖掘全面的视觉主体光谱,确保视觉输入的高保真表示。随后,开发了一个上下文感知的结构扫描模块,用于聚合邻近区域的上下文线索。因此,它可以有效恢复全局关系依赖,以维持保留主体的结构完整性。大量实验表明,我们的范式在速度和资源消耗上均优于现有方法,在Qwen2.5-VL中仅保留22.2%的视觉标记即可实现2.53倍的加速,在LLaVA中实现67%的FLOPs减少,仅导致0.6%的精度下降。

英文摘要

Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly aligned with the user's query, which fails to substantially explore salient subjects and their contextual relationships. In this paper, we propose SPpruner, a subject-centric progressive reduction paradigm that emulates the \textit{Focus-then-Context} mechanism of the human visual perception system. Specifically, we first construct a focus identification module to explicitly model the interplay between visual saliency and semantic relevance. Herein, it can excavate the comprehensive visual subject spectrum to ensure a high-fidelity representation of visual input. Subsequently, a context-aware structural scanning module is developed to aggregate contextual cues from neighboring regions. As such, it can effectively restore global relational dependencies to uphold the structural integrity of the preserved subjects. Extensive experiments demonstrate that our paradigm consistently outperforms SOTA methods, achieving up to 2.53 times speedup with only 22.2% of visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop.

2605.24011 2026-06-08 cs.CV cs.AI 版本更新

ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models

ActQuant: 面向视觉-语言-动作模型的亚4比特动作引导量化

Arash Akbari, Arman Akbari, Masih Eskandar, Qitao Tan, Yixiao Chen, Jingwu Luo, Bertha Pangaribuan, Liyun Zhang, Jennifer Dy, Geng Yuan, Xue Lin, Gaowen Liu, Stratis Ioannidis, Yanzhi Wang

发表机构 * Northeastern University(东北大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出ActQuant框架,通过动作引导的混合精度后训练量化,在亚4比特权重量化下保持VLA模型性能,并引入OmniModel.cpp实现高效部署。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在具身智能中展现出卓越的动作生成能力,但其高计算量使得在边缘平台部署不切实际。激进的亚4比特权重量化是自然解决方案,但现有后训练量化(PTQ)方法在此情况下性能严重下降。为解决此问题,我们引入ActQuant,一个动作引导的混合精度PTQ框架,包含两个阶段:(1)张量间比特分配器,根据每个权重矩阵对预测智能体动作的贡献程度分配单一比特宽度;(2)张量内尺度优化器,使用动作感知曲率调整每块量化尺度,使动态范围集中在控制影响最大的权重上。为了在设备上实现激进量化的优势,我们进一步引入OmniModel.cpp,一个代理转换流水线,将架构移植到具有高效低位内核的原生C/C++运行时。我们在仿真和真实世界的6自由度UR3机械臂上评估ActQuant,所有模型通过OmniModel.cpp部署。在LIBERO基准上,ActQuant是唯一在每权重3比特或以下运行的方法,在OpenVLA-OFT上保持95.0%的性能,在$π_{0.5}$上保持94.8%。进一步,ActQuant在OpenVLA-OFT上达到2.5 bpw,性能为90.1%,将骨干网络从14.3 GB压缩到2.7 GB(5.3倍)。在物理UR3机械臂上,使用ActQuant量化的$π_{0.5}$保持基线的成功率,同时将内存占用减少2.5倍。

英文摘要

Vision-Language-Action (VLA) models exhibit remarkable action generation for embodied intelligence, but their heavy compute make deployment on edge platforms impractical. Aggressive, sub-4-bit weight quantization is the natural solution, yet existing post-training quantization (PTQ) methods suffer severe performance degradation in this regime. To address this, we introduce ActQuant, an action-guided mixed-precision PTQ framework that operates in two stages: (1) an inter-tensor bit allocator that assigns each weight matrix a single bit-width based on how much it contributes to predicting the agent's actions; (2) an intra-tensor scale optimizer tunes per-block quantization scales using action-aware curvature, so that dynamic range is concentrated on the weights most influential for control. To deliver the on-device benefits of our aggressive quantization, we further introduce OmniModel.cpp, an agentic conversion pipeline that ports architectures into a native C/C++ runtime with efficient low-bit kernels. We evaluate ActQuant both in simulation and on a real-world 6-DoF UR3 arm, with all models deployed through OmniModel.cpp. On the LIBERO benchmark, ActQuant is the only method that operates at or below 3 bits-per-weight, retaining 95.0% on OpenVLA-OFT and 94.8% on $π_{0.5}$. Pushed further, ActQuant reaches 2.5 bpw at 90.1% on OpenVLA-OFT, compressing the backbone from 14.3 GB to 2.7 GB (5.3$\times$). On the physical UR3 arm, $π_{0.5}$ quantized with ActQuant retains the baseline's success rate while reducing the memory footprint by 2.5$\times$.

2605.25054 2026-06-08 cs.LG cs.AI 版本更新

Scale When Needed: Adaptive Neuron-level Mixed Precision Quantization Aware Training

按需扩展:自适应神经元级混合精度量化感知训练

Ayush K. Varshney, Konstantinos Vandikas, Šarūnas Girdzijauskas, Adam Orucu, Aneta Vulgarakis Feljan

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深度思维) University of Cambridge(剑桥大学)

AI总结 提出神经元级混合精度量化感知训练(NMP-QAT),通过可微代理和直通估计器让每个神经元独立学习离散精度,实现按需扩展位宽,在MLP和表格基础模型上取得更优的压缩-精度权衡。

Comments Accepted at ICML - GlobalSouthML workshop, 2026

详情
AI中文摘要

在资源受限的6G边缘设备上部署深度神经网络需要激进压缩且最小化精度损失。量化感知训练(QAT)已成为领先的压缩方法;然而,现有的混合精度方法通常以粗粒度的层或通道级别操作。这些方法通常依赖启发式或基于搜索的位分配策略,可能忽略神经元级别的细粒度变异性。我们提出神经元级混合精度QAT(NMP-QAT),其中每个神经元在训练期间独立学习自己的离散精度。从低位精度开始,NMP-QAT仅在训练信号需要时通过可微代理和直通估计器扩展位宽,同时保持完全离散的推理图。这种适应性扩展到权重和激活,减少内存移动。在电信和非电信数据集上,跨MLP和表格基础模型架构评估,NMP-QAT相比混合精度QAT基线实现了更优的压缩-精度权衡,使其非常适合网络边缘的绿色AI部署。

英文摘要

Deploying deep neural networks on resource-constrained 6G edge devices demands aggressive compression with minimal accuracy loss. Quantization-Aware Training (QAT) has emerged as a leading compression approach; however, existing mixed-precision methods typically operate at coarse layer- or channel-level granularity. These methods often rely on heuristic or search-based bit-allocation strategies, which may overlook fine-grained variability at the neuron level. We propose Neuron-Level Mixed-Precision QAT (NMP-QAT), where each neuron independently learns its own discrete precision during training. Starting from low-bit precision, NMP-QAT expands bit-width only when training signals demand it, via differentiable surrogates and straight-through estimators, while preserving a fully discrete inference graph. This adaptability extends to both weights and activations, reducing memory movement. Evaluated on telecom and non-telecom datasets across MLP and tabular foundation model architectures, NMP-QAT achieves superior compression-accuracy trade-offs over mixed-precision QAT baselines, making it well-suited for Green AI deployments at the network edge.

2605.25413 2026-06-08 cs.LG cs.AI cs.NA math.NA 版本更新

Autoregression-Free Neural Operators for Time-Dependent PDEs

无自回归的神经算子用于时间相关偏微分方程

Jiaquan Zhang, Caiyan Qin, Haoyu Bian, Libin Cai, Yi Lu, Chaoning Zhang, Wei Dong, Yuanfang Guo, Yang Yang, Heng Tao Shen

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院) School of Robotics and Advanced Manufacture, Harbin Institute of Technology(哈尔滨工业大学机器人与先进制造学院) School of Mathematical Sciences, Capital Normal University(首都师范大学数学学院) College of Information and Control Engineering, Xi’an University of Architecture and Technology(西安建筑科技大学信息与控制工程学院) Laboratory of Intelligent Recognition and Image Processing, School of Computer Science and Engineering, Beihang University(北京航空航天大学智能识别与图像处理实验室) School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院)

AI总结 提出AFNO,通过将PDE时间演化映射到潜空间并利用流匹配学习连续时间向量场,避免自回归展开,实现长期稳定预测。

Comments 23 pages, 18 figures

详情
AI中文摘要

神经算子学习从函数依赖输入到解的映射,为求解偏微分方程(PDE)提供了有效框架。对于时间相关PDE,现有方法通常通过在高维物理场空间中直接进行自回归展开来执行长时域预测,其中每个预测状态被递归地反馈作为下一步的输入。尽管对短期预测有效,但这种自回归展开以及缺乏连续时间建模导致长时域展开中误差逐渐累积。在这项工作中,我们提出无自回归神经算子(AFNO),将PDE的时间演化映射到潜空间并在其中建模连续时间向量场。AFNO使用流匹配来学习潜向量场,从而能够在扩展时域上实现连续演化,避免自回归展开,并通过显式条件化物理参数来捕捉不同参数配置下的动力学。对六个PDE的理论分析和广泛实验表明,与基线相比,AFNO提高了长时域预测稳定性并持续减少了展开误差。

英文摘要

Neural operators learn mappings from function-dependent inputs to solutions, providing an effective framework for solving partial differential equations (PDEs). For time-dependent PDEs, existing methods typically perform long-horizon prediction through autoregressive rollout directly in high-dimensional physical field spaces, where each predicted state is recursively fed back as the input for the next step. Although effective for short-term prediction, this autoregressive rollout and the lack of continuous-time modeling lead to progressive error accumulation over long-horizon rollouts. In this work, we propose Autoregression-Free Neural Operators (AFNO), which map the time evolution of PDEs into a latent space and model continuous-time vector fields within it. AFNO uses flow matching to learn the latent vector field, thereby enabling continuous evolution over extended horizons, avoiding autoregressive rollout and capturing dynamics under varying parameter configurations through explicit conditioning on physical parameters. Theoretical analysis and extensive experiments on six PDEs demonstrate that AFNO improves long-horizon prediction stability and consistently reduces rollout errors compared with the baselines.

2605.25645 2026-06-08 cs.DC cs.AI 版本更新

Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines

在Google Cloud TPU上微调和服务Gemma 4 31B:与GPU基线的技术比较

Jatin Kishnani, Mayank Goel, Amit Singh, Pulkit Agrawal, Sairanjan Mishra

发表机构 * Google Cloud(谷歌云)

AI总结 本文首次端到端展示了在TPU硬件上微调和服务Google Gemma 4 31B模型,通过与GPU平台的实证比较,提供了代码级适配方案,并证明TPU在训练速度和成本上具有优势。

详情
AI中文摘要

我们首次端到端展示了在TPU硬件上微调和服务Google的Gemma 4 31B模型,提供了TPU与GPU平台在大语言模型适配上的实证比较。使用LoRA在Google TPU v5p-8上进行训练,在TPU v6e-8(Trillium)上进行推理,我们记录了将基于PyTorch、HuggingFace TRL和FSDP的GPU原生训练配方移植到JAX + Tunix/Qwix栈所需的全部代码级适配。这些适配涵盖网格配置、LoRA模块命名约定、分片注释修正、梯度检查点、数据管道重构以及自定义的Orbax到safetensors检查点合并过程。对于推理,我们详细描述了在v6e-8上服务Gemma 4所需的vLLM-TPU Docker设置,并刻画了由此产生的延迟和吞吐量特征。与相同超参数下的2xH100 GPU基线相比,TPU训练完成速度快1.61倍,成本低2.12倍。推理吞吐量在平台间差异在3%以内,而TPU的首令牌延迟低2倍(235 ms vs. 475 ms)。总体而言,对于代表性的训练加服务工作负载,TPU配置便宜1.82倍。我们的工作填补了开放工具生态系统中的关键空白,为从业者提供了可复现、生产就绪的Gemma 4在TPU基础设施上部署的配方。

英文摘要

We present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical comparison of TPU and GPU platforms for large language model adaptation. Using LoRA on a Google TPU v5p-8 for training and TPU v6e-8 (Trillium) for inference, we document the full set of code-level adaptations required to port a GPU-native training recipe - built on PyTorch, HuggingFace TRL, and FSDP - to the JAX + Tunix/Qwix stack. These adaptations span mesh configuration, LoRA module naming conventions, sharding annotation corrections, gradient checkpoint, data pipeline restructuring, and a custom Orbax-to-safetensor checkpoint merging procedure. For inference, we detail the vLLM-TPU Docker setup necessary to serve Gemma 4 on v6e-8 and characterize the resulting latency and throughput profile. Compared with a similar-costing 2xH100 GPU baseline under identical hyperparameters, TPU training completes 1.61x faster at 2.12x lower cost. For inference, we cover the vLLM-TPU Docker setup required to serve Gemma 4 on v6e-8 and explain the observed latency and throughput characteristics across a QPS sweep spanning 512 to 16k input tokens. Across both workloads we compare performance and cost against a 2xH100 GPU baseline running identical hyperparameters. The TPU completes training 1.61x faster at 2.12x lower cost. For inference, TPU v6e-8 matches GPU at short context (<=2048 tokens) and decisively outperforms at long context: 66% higher throughput and 23.6x faster TTFT at 4096-token inputs (61 ms vs 1,443 ms at QPS=4). Our work removes a critical gap in the open tooling ecosystem and provides practitioners with a recipe for Gemma 4 Dense 31B deployment on the TPU infrastructure.

2605.26099 2026-06-08 cs.CL cs.AI 版本更新

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

语言模型需要睡眠吗?用于改进在线推理的离线循环

Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Maryland(马里兰大学)

AI总结 本文提出一种类似睡眠的巩固机制,通过离线循环将上下文转换为快速权重,以解决Transformer注意力机制随上下文长度扩展性差的问题,并在合成任务和数学推理任务上验证了其有效性。

详情
AI中文摘要

基于Transformer的大型语言模型越来越多地用于长时任务;然而,它们的注意力机制随上下文长度扩展性差。为了解决这个问题,我们研究了一种类似睡眠的巩固机制,其中模型在清除其键值缓存之前,定期将最近的上下文转换为持久的快速权重。在睡眠期间,模型对累积的上下文进行$N$次离线循环传递,并通过学习到的局部规则更新其状态空间模型(SSM)块中的快速权重。在推理过程中,这会将额外的计算转移到睡眠阶段,同时保持清醒时预测的延迟。我们在受控的合成任务(包括元胞自动机和多跳图检索)以及一个现实的数学推理任务上测试了我们的方法,在这些任务上,常规Transformer以及SSM-注意力混合模型都失败了。然后我们表明,增加我们模型的睡眠持续时间$N$可以提高性能,在需要更深层推理的示例上收益最大。

英文摘要

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs $N$ offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration $N$ for our models improves performance, with the largest gains on examples that require deeper reasoning.

2605.30119 2026-06-08 cs.LG cs.AI cs.NE 版本更新

Evolving Features vs Evolving Entire Trees with GP for Interpretable Survival Analysis

使用遗传编程进行可解释生存分析:进化特征 vs 进化整个树

Thalea Schlender, Peter A. N. Bosman, Tanja Alderliesten

发表机构 * Leiden University Medical Center(莱顿大学医学中心) Centrum Wiskunde & Informatica(数学与信息学研究中心)

AI总结 本研究使用遗传编程多目标进化可检查的特征集,并联合优化生存树结构与非线性分裂逻辑,以提高浅层生存树的预测性能和可解释性。

详情
AI中文摘要

生存分析涉及预测事件发生时间。常用于医学领域,处理不完整(即删失)数据,例如研究期间未发生事件的患者。实际应用中,准确性和可解释性都很重要。生存树是易于理解的生存模型,将患者队列递归地划分为离散的患者组。虽然生存树可以捕捉复杂关系,但它们通常需要生长得很大,威胁可解释性。此外,生存树通常使用贪婪方法构建,可能忽略全局最优分裂组合,限制预测性能。浅层生存树需要表达性强的高阶特征组合才能达到竞争性准确性。因此,我们使用遗传编程多目标进化固有可检查的特征集,并研究它们与不同树诱导策略的相互作用。我们进一步引入了一种进化方法,联合优化生存树结构和非线性分裂逻辑。我们的发现表明,在两个真实世界数据集和两种不同生存树深度上,进化特征构建提高了不同树诱导策略下的预测性能。完整的联合进化在提出多个性能良好的固有可检查的浅层生存树方面具有最高的潜力。

英文摘要

Survival analysis concerns the task of predicting the time until an event occurs. Often used in the medical field, survival analysis deals with incomplete (i.e., censored) data, for instance, from patients who did not experience the event during the duration of the study. For practical use, both accuracy and interpretability are important. Survival trees are easy-to-follow survival models that split the patient cohort recursively into discrete patient groups. Whilst survival trees can capture complex relationships, they typically need to grow large, threatening interpretability. Moreover, survival trees are often built using greedy approaches that may overlook globally optimal split combinations, limiting predictive performance. Shallow survival trees require expressive, higher-order feature combinations to achieve competitive accuracy. We therefore use genetic programming to multi-objectively evolve inherently inspectable feature sets and study how they interact with different tree induction strategies. We further introduce an evolutionary approach that jointly optimises the survival tree structure and the non-linear split logic. Our findings demonstrate that evolutionary feature construction improves predictive performance across different tree induction strategies on two real-world datasets and two different survival tree depths. Given its speed and flexible presentation, the multi-objective evolution of entire trees likely holds the most future promise.

2605.31051 2026-06-08 cs.NE cs.AI 版本更新

Linear Ordering Problem: Time for a Change

线性排序问题:变革之时

Fabrizio Fagiolo, Marco Baioletti, Valentino Santucci

发表机构 * University for Foreigners of Perugia(佩鲁吉亚外国大学) University of Perugia(佩鲁吉亚大学)

AI总结 针对线性排序问题(LOP)中基准数据过时及多最优解问题,提出基于最新经济数据的新基准套件和生成多样高质量解的算法方案,并引入质量和多样性评估指标。

Comments Accepted for publication at PPSN 2026 - Conference on Parallel Problem Solving

详情
AI中文摘要

线性排序问题(LOP)是一个基础组合优化问题,在经济学、社会选择和机器学习等领域有重要应用。其最突出的用途是经济投入产出表的三角化,这有助于识别经济中的关键产业。大多数现有算法都是在基于过时的宏观经济数据的基准上评估的,这些数据不再反映当代经济的结构。此外,LOP实例通常表现出许多不同的全局最优解,这些解之间可能差异很大,给依赖单一解的应用带来了挑战。为了解决这些局限性,我们引入了一个基于最新现实世界经济数据的新基准套件,以及一种利用最先进的LOP元启发式算法生成多样化高质量解集的算法方案,同时提供了评估质量和多样性的指标。实验在传统单解场景和新引入的多解场景下对所提出的基准套件进行了结果报告。

英文摘要

The Linear Ordering Problem (LOP) is a fundamental combinatorial optimization problem with important applications in areas such as economics, social choice, and machine learning. Its most prominent use is the triangulation of economic input-output tables, which helps identify critical industries in an economy. Most existing algorithms have been evaluated on benchmarks derived from outdated macroeconomic data, which no longer reflect the structure of contemporary economies. Furthermore, LOP instances often exhibit many distinct global optima that can differ substantially from one another, creating challenges for applications that rely on a single solution. To address these limitations, we introduce a novel benchmark suite derived from up-to-date real-world economic data and an algorithmic scheme that leverages state-of-the-art LOP metaheuristics to generate diverse sets of high-quality solutions, together with metrics for assessing both quality and diversity. Experiments were conducted to report results on the proposed benchmark suite under both the traditional single-solution setting and the newly introduced multi-solution scenario

2606.01802 2026-06-08 cs.SD cs.AI 版本更新

MOSS-Audio Technical Report

MOSS-Audio 技术报告

Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu, Jingqi Chen, Ke Chen, Wenxuan Wang, Yang Wang, Yaozhou Jiang, Yi Jiang, Zhengyuan Lin, Ziqi Chen, Zhaoye Fei, Chenghao Liu, Donghua Yu, Jun Zhan, Kang Yu, Kexin Huang, Liwei Fan, Mingshu Chen, Qinyuan Cheng, Ruixiao Li, Shimin Li, Songlin Wang, Xingjian Zhao, Yang Gao, Yitian Gong, Yiyang Zhang, Zhe Xu, Xipeng Qiu

发表机构 * OpenMOSS Team(开放MOSS团队)

AI总结 提出统一音频-语言模型 MOSS-Audio,通过 DeepStack 跨层特征注入和时间标记实现语音、环境声和音乐的理解,在音频字幕、时间感知问答、时间戳转录和音频推理任务上取得强性能。

详情
AI中文摘要

MOSS-Audio 是一个统一的音频-语言模型,用于语音、环境声和音乐的理解,支持音频字幕、时间感知问答、时间戳转录和基于音频的推理。MOSS-Audio 将专用音频编码器与模态适配器和大语言模型耦合:编码器产生 12.5 Hz 的时间表示,适配器将其投影到解码器空间,解码器生成自回归文本输出。两个设计选择是系统的核心: extbf{DeepStack 跨层特征注入},将来自多个编码器深度的声学信息暴露给解码器;以及 extbf{时间标记},通过在音频标记流中插入时间戳标记来提供显式的时间线索。在数据层面,我们设计了一个事件保留的音频标注流程,将原始音频在连贯的事件边界处分割,对语音、音乐和通用音频应用分支特定标注,并将结果合并为统一的字幕用于预训练。中间的分支特定字幕进一步保留以支持面向任务的 SFT 数据的构建。该模型在大规模音频-语言数据上进行预训练,结合时间感知目标以支持时间定位,然后进行多阶段后训练以增强指令遵循和基于音频的推理。我们发布了 4B 和 8B 两种变体,包括 Instruct 和 Thinking 配置。MOSS-Audio 在通用音频理解、语音字幕、ASR 和时间戳 ASR 上取得了强性能,使其成为未来语音代理的有前途的理解基础。

英文摘要

MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: DeepStack cross-layer feature injection, which exposes the decoder to acoustic information from multiple encoder depths, and time markers, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.

2606.02609 2026-06-08 cs.LG cs.AI 版本更新

Building Better Activation Oracles

构建更好的激活预言机

Jan Bauer, Celeste De Schamphelaere, Adam Karvonen, Niclas Luick, Neel Nanda

发表机构 * MATS Gatsby Unit, UCL(UCL马斯特斯伽利略单位) MATS Ghent University(根特大学马斯特斯) Independent(独立) MATS, University of Hamburg(汉堡大学马斯特斯) Equal contribution(同等贡献)

AI总结 针对激活预言机存在的幻觉和模糊性问题,通过四项训练改进(策略内 rollout、对话数据集优化、多层输入和注入公式改进)提升其质量,并开源首个综合评估套件 AObench。

Comments Jan Bauer and Celeste De Schamphelaere contributed equally; author order determined randomly

详情
AI中文摘要

激活预言机(Activation Oracles, AOs)是解释残差流激活的有前景的方法。然而,当前的AOs面临重要问题,如幻觉和模糊性。此外,文本反转混淆使其难以评估。为此,我们通过四种方式改进了激活预言机(AO)的训练机制:在策略内 rollout 上训练、改进对话数据集、提供更多层以及改进注入公式。能力提升微乎其微,但生活质量提升相当显著。此外,我们开源了首个用于AO质量的综合评估套件,称为AObench。总体而言,我们希望我们的工作能为可扩展、端到端可解释性范式中的AOs及其他模型的改进奠定基础。

英文摘要

Activation Oracles (AOs) are promising methods for interpreting residual stream activations. However, current AOs face important issues, such as hallucinations and vagueness. Additionally, text-inversion confounds make them hard to evaluate. To this end, we improve the Activation Oracle (AO) training regime in four ways: training on on-policy rollouts, improving the conversational dataset, feeding more layers and an improvement to the injection formula. The capability improvements are marginal, but quality of life improvements are quite substantial. In addition, we open source the first comprehensive evaluation suite for AO quality, which we call AObench. Overall, we hope that our work sets a foundation that helps improve AOs and other models in the paradigm of scalable, end-to-end interpretability.

2606.03002 2026-06-08 cs.LG cs.AI 版本更新

Perplexity Can Miss SAE Feature Damage Under Quantization

量化如何改变可解释特征:语言模型的稀疏自编码器分析

Evan Duan

发表机构 * University of Michigan(密歇根大学)

AI总结 通过稀疏自编码器分析,发现量化导致语言模型中的可解释特征逐渐退化,且任务指标无法完全反映这种损伤,量化与幅度剪枝共享相似的损伤模式。

Comments 12 Pages of Content, Submitted to TMLR

详情
AI中文摘要

量化是部署大型语言模型的标准途径,通常当量化模型的困惑度或下游精度接近全精度原始模型时,即认为其可接受。但模型是否仍以相同方式计算,或全精度模型中识别的可解释特征是否在权重舍入后存活,很少被测试,即使安全审计和引导干预越来越依赖这些特征。我们探究从稠密全精度模型中提取的稀疏自编码器(SAE)特征在模型量化后是否仍然忠实。使用冻结的SAE作为固定测量基础,我们在相同令牌上编码全精度和最近舍入(RTN)量化激活,并通过皮尔逊相关系数量化每个特征的存活率,在Pythia-70M和Gemma-2-2B上扫描从INT8到INT4的位宽。我们发现特征存活是分级的:特征系统地退化而非一次性全部失效,在Pythia-70M上INT6时62.4%的活跃特征存活,在Gemma-2-2B上INT6时51.3%存活,且大多数非存活特征被模糊而非破坏。存活率可仅从全精度统计量预测,交叉验证AUC为0.92至0.97,峰值激活是最强的边际预测因子。关键的是,任务指标可能忽略这种损伤:在Gemma-2-2B上,INT7改善了困惑度却使18.7%的特征退化。最后,量化和匹配困惑度的幅度剪枝损伤高度重叠的特征集,Jaccard重叠为0.79至0.86,损伤分数斯皮尔曼相关性为0.98,表明存在共享的压缩诱导脆弱性模式。这些结果表明,行为等价不足以证明可解释性发现可迁移到量化部署,从而激励对压缩进行特征级审计。

英文摘要

Quantization is a standard path to deploying large language models, and quantized models are typically judged acceptable when perplexity or downstream accuracy remains close to the full-precision original. But behavioral parity need not imply feature fidelity: the sparse-autoencoder (SAE) features used to interpret a full-precision model may change after weight rounding. We test this directly by using a frozen SAE as a fixed measurement basis, encoding full-precision and round-to-nearest (RTN) quantized activations on identical tokens, and measuring per-feature survival by Pearson correlation across bit-widths from INT8 to INT4 on Pythia-70M and Gemma-2-2B. Our central finding is that perplexity can miss feature damage: on Gemma-2-2B, INT7 improves perplexity while degrading 18.7% of active SAE features, and under sliding-window evaluation INT6 also improves perplexity while only 51.3% of active features survive. Feature survival is graded rather than cliff-like, with 62.4% of active Pythia features and 51.3% of active Gemma features surviving at INT6; most non-surviving features are blurred rather than fully damaged. Survival is also predictable from full-precision feature statistics alone, with cross-validated AUC 0.92--0.97 and peak activation as the strongest marginal predictor. Finally, RTN quantization and matched-perplexity magnitude pruning damage strongly overlapping feature sets, with Jaccard overlap 0.79--0.86 and damage-score Spearman correlation 0.98. These results show that behavioral metrics alone are insufficient evidence that full-precision interpretability findings transfer to quantized models, motivating feature-level audits of compression.

2606.03161 2026-06-08 cs.MA cs.AI 版本更新

OpenAgenet / OAN White Paper: Open Infrastructure for Trusted Agent Interconnection

OpenAgenet/OAN:可信智能体互连的开放基础设施

Jinliang Xu

发表机构 * China Academy of Information and Communications Technology(信息与通信技术研究院)

AI总结 针对智能体从孤立应用转向开放多运营商网络时面临的身份验证、治理状态、发现授权、新鲜度和信任证据问题,提出协议无关的信任层OAN,通过根治理身份准入、注册商辅助注册、根验证包发布、授权感知发现和签名可信调用来实现可信互连。

详情
AI中文摘要

OpenAgenet,简称OAN,是一个用于可信智能体互连的开放基础设施项目。它解决了一个在智能体从孤立应用转向开放的多运营商网络时变得明显的问题:在智能体能够安全地发现、选择和调用另一个智能体之前,它需要一种方法来验证身份来源、治理状态、发现授权、新鲜度和连接前的信任证据。OAN被设计为一个协议无关的信任层。它不取代智能体交互协议、工具协议、模型编排框架或应用级工作流。相反,它提供了根治理的身份准入、注册商辅助的注册、根验证的包发布、授权感知的发现以及签名的可信调用。本文介绍了OAN的动机、架构、角色、治理模型、与MCP、A2A和ANP的关系、部署模式、合作模型、区块链支持的授权公告、原型状态、性能概况和路线图。

英文摘要

OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes visible when Agents move from isolated applications into open, multi-operator networks: before an Agent can safely discover, select, and invoke another Agent, it needs a way to verify identity provenance, governance state, discovery authorization, freshness, and pre-connection trust evidence. OAN is designed as a protocol-neutral trust layer. It does not replace Agent interaction protocols, tool protocols, model orchestration frameworks, or application-level workflows. Instead, it provides \texttt{did:oan}-based resource identity, governance-backed admission, Registrar-assisted onboarding, Root-verified package publication, authorization-aware Discovery, Root-issued infrastructure authorization VCs, and signed trusted invocation. The architectural center of OAN is the combination of federated governance, resource identity, and trusted Discovery, rather than a single directory or naming service. This white paper explains the motivation, architecture, roles, governance model, relationship with MCP, A2A, and ANP, deployment patterns, cooperation model, on-chain governance layer, prototype status, performance profile, and roadmap of OAN.

2606.03163 2026-06-08 cs.MA cs.AI cs.DC 版本更新

OpenAgenet / OAN Yellow Paper: Technical Architecture for Trust-Governed Resource Identity and Discovery

OpenAgenet/OAN:信任治理的智能体身份与发现技术架构

Jinliang Xu

发表机构 * OpenAgenet / OAN

AI总结 本文提出OpenAgenet/OAN协议中立信任层技术架构,通过角色架构、身份对象、注册工作流、根治理生命周期、根验证包模型、授权感知发现、签名可信调用、验证要求、状态转换、安全属性、实现边界和部署考虑,实现异构智能体框架(包括MCP、A2A、ANP类系统及领域特定协议)的身份准入、可发现、可验证和安全交互。

详情
AI中文摘要

本文描述了OpenAgenet / OAN的技术架构。OAN是一个协议中立的信任层,用于开放的智能体互连。它规定了角色架构、身份对象、注册工作流、根治理生命周期、根验证包模型、授权感知发现、签名可信调用、验证要求、状态转换、安全属性、实现边界和部署考虑。该设计旨在支持异构智能体框架和交互协议,包括MCP、A2A、ANP类系统以及领域特定的智能体协议。OAN不定义智能体之间的完整业务对话;它定义了在特定协议交互开始之前,智能体身份如何变得可接纳、可发现、可验证且安全可接近。

英文摘要

This yellow paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection and discoverable AI resource products. It specifies the role architecture, \texttt{did:oan} identity objects, registration workflow, governance-backed Root lifecycle enforcement, Root-verified package model, authorization-aware Discovery, Root-issued infrastructure authorization VCs, signed trusted invocation, verification requirements, state transitions, security properties, implementation boundaries, and deployment considerations. The design is intended to support heterogeneous Agent frameworks and interaction protocols, including MCP, A2A, ANP-like systems, domain-specific Agent protocols, Skills, MCP Servers, and Tool/API resources. OAN does not define the entire business conversation among Agents or the native protocol of every resource; it defines how resource identities become admissible, discoverable, verifiable, and safe to approach before protocol-specific interaction begins.

2606.03382 2026-06-08 cs.LG cs.AI 版本更新

Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions

局部引导,全局影响:高斯重塑信任区域解锁行为转变

Bingxu Liu, Jiashun Liu, Johan Obando-Ceron, Hao Wang, Runze Liu, Pablo Samuel Castro, Aaron Courville, Ling Pan

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Mila - Québec AI Institute(魁北克AI研究院) Université de Montréal(蒙特利尔大学) Fudan University(复旦大学) City University of Hong Kong(香港城市大学)

AI总结 针对PPO在非平稳环境中优化失效的问题,提出高斯信任区域策略优化(GTR),通过高斯核重塑信任区域实现非单调约束,在保持局部稳定性的同时允许必要的大幅策略更新,并在多种任务中取得强性能。

Comments 21 pages

详情
AI中文摘要

虽然近端策略优化(PPO)在静态环境中表现出色,但我们表明其标准优化范式在连续和非平稳环境中存在困难。失败并非源于模型容量不足或裁剪过于严格。相反,PPO执行持续、方向低效的局部更新,这表明缺乏几何感知引导来积累有意义的行为变化,最终阻碍向新行为模式的转变。尽管基于散度的正则化引入了部分几何感知,但其单调递增的惩罚隐式地阻止了大的策略偏差,即使这种转变对于有效适应是必要的。为了解决这一局限性,我们提出了高斯信任区域策略优化(GTR),它使用高斯核重塑信任区域。由此产生的约束是有界且非单调的,在提供强局部稳定性的同时,在持续的高优势更新下逐渐放松。为了进一步提高鲁棒性,我们引入了一个混合高斯锚点,它适应最近的策略轨迹,减少了由陈旧参考引起的方差。GTR与架构无关,在游戏、模拟机器人控制、开放世界探索和语言模型后训练中均取得了强性能。这些结果表明,几何感知的信任区域设计可以成为复杂非平稳环境中鲁棒强化学习的一个有前景的方向。我们的代码可在该 https URL 获取。

英文摘要

While Proximal Policy Optimization (PPO) demonstrates strong performance in stationary settings, we show that its standard optimization paradigm struggles in continual and non-stationary environments. The failure does not stem from insufficient model capacity or overly restrictive clipping. Instead, PPO performs persistent, directionally inefficient local updates, which indicates a lack of geometry-aware guidance for accumulating meaningful behavioral change and ultimately hindering transitions toward new behavior patterns. Although divergence-based regularization introduces partial geometric awareness, its monotonically increasing penalties implicitly discourage large policy deviations, even when such shifts are necessary for effective adaptation. To address this limitation, we propose Gaussian Trust Region Policy Optimization (GTR), which reshapes the trust region using a Gaussian kernel. The resulting constraint is bounded and non-monotonic, providing strong local stability while progressively relaxing under sustained high-advantage updates. To further improve robustness, we introduce a Mixture Gaussian Anchor that adapts to recent policy trajectories, reducing variance induced by stale references. GTR is architecture-agnostic and achieves strong performance across games, simulated robotic control, open-world exploration, and language model post-training. These results demonstrate that geometry-aware trust-region design can be a promising direction for robust reinforcement learning in complex non-stationary environments. Our code is available at https://anonymous.4open.science/r/GTR_demo/README.md.

2606.04058 2026-06-08 cs.LG cs.AI 版本更新

Spectral Scaling Laws of Muon

Muon的谱缩放定律

Gagik Magakyan, Pablo Parrilo, Asuman Ozdaglar

发表机构 * MIT(麻省理工学院)

AI总结 本文系统研究了Muon优化器中动量矩阵奇异值谱随模型大小的缩放行为,发现其遵循幂律,并据此提出层感知的牛顿-舒尔茨迭代配置选择方法以减少计算开销。

详情
AI中文摘要

正交归一化更新规则已迅速成为训练大型语言模型的主流优化器选择,最近的开源最先进模型采用了Muon。为了保持这些更新的可处理性,Muon使用牛顿-舒尔茨(NS)迭代执行正交归一化。由于NS只是近似,小奇异值的方向无法被正交归一化。在Muon中,NS每一步都应用于动量矩阵,然而关于这些动量矩阵的奇异值谱在训练过程中如何行为,以及该行为如何随模型大小变化,我们知之甚少。我们首次系统研究了这一问题。通过追踪从77M到2.8B参数模型中各层动量缓冲区的奇异值分位数,我们观察到一致的现象:在短暂的预热后,分位数稳定在一个由层类型和模型大小决定的值上。这些稳定值随模型大小呈现出非常清晰的幂律,且指数依赖于层。中后深度的层随模型大小$M$的缩放非常温和(约$M^{-0.25}$),因此学术规模下使用的标准5步NS配置将在更大规模下继续对它们进行正交归一化。然而,某些后期层的缩放更为激进(高达$M^{-0.96}$),在前沿规模下将落入NS失效区域,除非使用更多NS迭代或更好调整的系数。NS迭代在规模上计算成本高昂;我们的定律为从业者提供了一种有原则的、层感知的配方,用于选择最小的NS配置,该配置仍能正交归一化重要的方向——在不牺牲更新质量的情况下避免不必要的计算。

英文摘要

Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source state-of-the-art models adopting Muon. To keep these updates tractable, Muon performs the orthonormalization with the Newton--Schulz (NS) iteration. Since NS is only approximate, directions with small singular values fail to be orthonormalized. In Muon, NS is applied to the momentum matrix at every step, yet little is known about how the singular value spectrum of these momentum matrices behaves during training, or how that behavior changes with model size. We present the first systematic study of this question. Tracking singular value quantiles of the momentum buffer across layers in models ranging from 77M to 2.8B parameters, we observe a consistent picture: after a short burn-in, the quantiles stabilize at a value determined by the layer type and model size. These stabilization values follow remarkably clean power laws in model size, with layer-dependent exponents. Layers up to mid-late depth scale very mildly with model size $M$ (around $M^{-0.25}$), so the standard 5-step NS configuration used at academic scale will continue to orthonormalize them at much larger scales. Some of the late layers, however, scale much more aggressively (up to $M^{-0.96}$) and will fall into the NS failure regime at frontier scale unless one uses more NS iterations or better-tuned coefficients. NS iterations are computationally expensive at scale; our laws give practitioners a principled, layer-aware recipe for choosing the minimum NS configuration that still orthonormalizes the directions that matter -- avoiding unnecessary computation without sacrificing update quality.

2606.04349 2026-06-08 cs.CV cs.AI 版本更新

MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

MorphoQuant: 面向全模态大语言模型的模态感知量化

Yue Wu, Changyuan Wang, Zixuan Wang, Shilin Ma, Yansong Tang

发表机构 * institutetext: MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models Yue Wu Changyuan Wang Zixuan Wang Shilin Ma Yansong Tang(机构文本:MorphoQuant:多模态大语言模型的模态感知量化 Yue Wu 王昌元 王梓轩 马世林 唐彦松)

AI总结 提出MorphoQuant框架,通过分布感知偏差补偿和形态导向量化函数优化,解决全模态大语言模型在4比特后训练量化中的分布异质性和异常值问题,实现精度与效率的优异平衡。

详情
AI中文摘要

传统的后训练量化方法在处理4比特全模态大语言模型时,由于跨模态的极端分布异质性和不同的异常值模式而面临困难。为了解决这一问题,我们提出了MorphoQuant,一种模态感知的PTQ框架,旨在保留跨模态形态并减轻异常值损失。具体来说,我们引入了分布感知偏差补偿,它选择性地将长尾异常值吸收到通道偏差中。该机制在保持异常值幅度的同时,为密集内点维持高精度离散化,从而在多样的模态分布中保持精确的离散化。作为补充,我们提出了形态导向量化函数优化,以协同优化量化网格与偏差掩码,确保跨模态的细粒度对齐。在Qwen2.5-Omni上对MMMU和Video-MME等基准的广泛评估证明了我们方法的优越性。值得注意的是,我们的W4A4模型在ScienceQA上达到了76.63%,显著优于最先进的W4A4方法,并意外地超越了W4A16基线,这充分展示了我们框架在精度-效率权衡方面的卓越表现。

英文摘要

Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distribution heterogeneity and disparate outlier patterns across modalities. To address this, we propose MorphoQuant, a modality-aware PTQ framework engineered to preserve cross-modal morphology and mitigate outlier loss. Specifically, we introduce Distribution-Aware Bias Compensation (DABC), which selectively absorbs long-tailed outliers into channel-wise biases. This mechanism safeguards outlier magnitudes while maintaining high-precision discretization for dense inliers, thereby preserving accurate discretization across diverse modal distribution. Complementing this, we propose Morphology-Directed Quantization Function Optimization (MDQFO) to co-optimize the quantization grid with the bias mask, ensuring fine-grained alignment across modalities. Extensive evaluations on Qwen2.5-Omni across benchmarks like MMMU and Video-MME demonstrate our approach's superiority. Notably, our W4A4 model achieves 76.63% on ScienceQA, significantly outperforming SOTA W4A4 methods and surprisingly surpassing the W4A16 baseline, which fully demonstrates the exceptional accuracy-efficiency trade-off of our framework.

2606.04373 2026-06-08 cs.CV cs.AI 版本更新

Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

解耦信息区域的选择性耦合:用于视觉Transformer无数据量化的掩码注意力对齐

Biao Qian, Yang Wang, Yong Wu, Jungong Han

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MaskAQ方法,通过解耦合成样本中的信息区域并利用掩码注意力对齐全精度模型与量化模型,解决无数据量化中分布不匹配问题。

Comments Accepted to appear at ICML 2026, Seoul, Korea

详情
AI中文摘要

无数据量化(DFQ)通过合成样本解决数据安全问题,无需访问真实数据。由于自注意力机制相比经典卷积运算的优势,DFQ在视觉Transformer(ViT)中日益受到关注。然而,先前的ViT DFQ方法常遭受合成样本与量化模型Q期望输入分布之间的分布不匹配,导致性能次优。本文提出一种新颖的掩码注意力对齐方法用于ViT的无数据量化,称为MaskAQ,揭示了:1)自注意力机制中的语义主要局限于稀疏的补丁子集,称为信息区域;2)信息区域主导了合成样本与Q输出之间的互信息。为此,我们利用合成样本补丁相似性的微分熵最大化,从噪声背景中解耦信息区域。为了与不同的Q耦合,通过掩码注意力对齐目标选择信息区域以对齐全精度模型与Q,从而产生高质量的合成样本。此外,提出周期性样本刷新策略,使MaskAQ能够在训练过程中持续适应Q的演化状态,以保持与合成样本的理想互信息。大量实验验证了MaskAQ在多个骨干网络和下游任务上优于最先进方法。我们的代码可在https://github.com/hfutqian/MaskAQ获取。

英文摘要

Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting in the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data-Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self-attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q's outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, to decouple informative regions from noisy background. To couple with varied Q, the informative regions are selected to align full-precision models with Q via a masked attention alignment objective, thus yielding high-quality synthetic samples. Furthermore, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process, to preserve desirable mutual information with synthetic samples. Extensive experiments verify the merits of MaskAQ over state-of-the-art approaches across multiple backbones and downstream tasks. Our code is available at https://github.com/hfutqian/MaskAQ.

2606.04812 2026-06-08 cs.LG cs.AI 版本更新

Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

面向风险感知强化学习的情景生成与近似安全保证

Mohit Prashant, Arvind Easwaran

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 针对强化学习策略对转移扰动敏感导致不安全行为的问题,提出使用变分自编码器近似状态空间分布,通过构造上下界屏障证书并采样非鲁棒区域状态来收紧概率安全保证。

Comments 8 pages, preprint

详情
AI中文摘要

保证安全性对于强化学习(RL)智能体在现实世界中的部署至关重要,尤其是使用深度RL学习的策略可能表现出对转移扰动的敏感性,从而导致未知或不安全的行为。一种策略验证方法是通过采样相对于安全约束的策略轨迹来构造概率屏障证书,从而将已知的安全行为与未知行为区分开来。如果策略容易受到转移不确定性或扰动的影响,使智能体处于未充分探索的状态,则难以获得这些约束违反概率的严格上下界。为了解决这个问题,我们使用变分自编码器(VAE)近似遇到的状态空间的分布,并利用状态的潜在特征构造上下界屏障证书,以高置信度优化已知安全行为的区域。我们在工作中将其表述为一个对偶优化问题,其中下界屏障证书比上界屏障证书提供更保守的安全区域估计。在训练期间采样位于两者集合差(即非鲁棒区域)内的状态,使我们能够收紧上下界,从而提供更尖锐的概率安全保证。在我们的研究中,我们描述了所放置的保证,并通过实验证明了我们边界的紧致性。

英文摘要

Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verification is to construct probabilistic barrier-certificates by sampling policy trajectories with respect to safety constraints, thereby demarcating known safe behaviour from unknown behaviour. Obtaining tight upper and lower bounds on the probability of violation of these constraints may be difficult if the policy is susceptible to transition uncertainty or perturbation that places the agent in insufficiently explored states. To address this, we approximate the distribution of the encountered state-space using a variational autoencoder (VAE) and construct upper and lower-bound barrier-certificates using latent characteristics of states to optimize for regions of known, safe behaviour with high confidence. We frame this in our work as a dual optimization problem where the lower-bound barrier-certificate presents a more conservative estimate of the safe region than the upper-bound barrier-certificate. Sampling states that lie within the set difference of the two during training, i.e. the non-robust region, allows us to tighten the upper and lower bounds to provide sharper probabilistic guarantees on safety. Within our study, we describe the guarantees placed and demonstrate the tightness of our bounds experimentally.

2606.05152 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Reinforcement Learning from Rich Feedback with Distributional DAgger

利用丰富反馈的强化学习与分布式DAgger

Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad

发表机构 * University of Southern California(南加州大学)

AI总结 提出DistIL算法,通过分布式DAgger利用丰富反馈(如执行轨迹、工具输出等)进行前向交叉熵优化,实现单调策略改进和更好的Pass@N性能。

详情
AI中文摘要

推理模型发展迅速,但主流的基于可验证奖励的强化学习(RLVR)方法仍然非常狭窄:采样多个响应,并用单个比特奖励每个响应,指示最终答案是否正确。然而,许多设置提供了丰富的反馈,包括执行轨迹、工具输出、专家修正和模型自我评估。我们研究如何通过经典模仿学习算法DAgger的分布式变体来使用这种反馈,其中学习器可以局部访问当前策略所访问状态上的专家分布。这产生了一个简单的前向交叉熵目标,该目标接受黑盒专家,并且其序列级梯度通过将未来的专家-学生分歧传播回早期决策来进行丰富的信用分配。我们表明,基于反向KL或Jensen-Shannon的先前具有自蒸馏目标的强化学习无法保证单调策略改进:即使专家具有更高的奖励,它们的更新也可能增加更差动作的概率。相比之下,我们证明前向交叉熵允许单调策略改进并享有遗憾保证。我们进一步表明,我们的目标优化了教师加权的成功可能性的下界,从而改进了Pass@N。实验上,我们的方法DistIL在科学推理、编程和解决困难数学问题等多个领域优于RLVR和基于自蒸馏的强化学习基线。

英文摘要

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.

2606.05654 2026-06-08 cs.SE cs.AI cs.LG 版本更新

When Surface Form Changes Moderation Decisions: A Paired Study of Code-Mixed Workflow Instability

当表面形式改变审核决策:代码混合工作流不稳定性的配对研究

Suraj Babu Thimma Krishnaram, Yibo Hu, Karthikeyan Saravanan

发表机构 * GitHub

AI总结 通过配对评估设置,研究在清洁英语与泰米尔语-英语代码混合输入下,仇恨审核工作流的变化,发现代码混合导致决策翻转率高达0.265,并增加审核负担和误报。

详情
AI中文摘要

仇恨审核通常被评估为对清洁英语输入的分类,但部署的系统必须将内容路由到诸如ALLOW、FLAG或REVIEW等操作。我们通过配对评估设置研究这种工作流在代码混合输入下的变化,其中相同的基础内容以清洁英语和泰米尔语-英语代码混合形式表达。在基于清洁英语开发数据调整的阈值下,代码混合输入产生显著的动作不稳定性,配对清洁到代码混合决策翻转率为0.265。主要工作流影响是增加的审核负担和增加的非仇恨内容误报:审核率从0.138上升到0.297,非仇恨误报率从0.069上升到0.104。仅泰米尔语输入整体表现出更强的退化,表明存在更广泛的语言覆盖限制,而非相同的代码混合不稳定性模式。一个简单的基于分歧的延迟规则减少了压力输入上的自动错误,但只能通过增加审核负载。这些结果表明,工作流级别的评估揭示了标准分类摘要可能遗漏的审核失败。

英文摘要

Hate moderation is often evaluated as classification on clean English inputs, but deployed systems must route content to actions such as ALLOW, FLAG, or REVIEW. We study how this workflow changes under code-mixed inputs using a paired evaluation setting where the same underlying content is expressed as clean English and Tamil-English code-mix. Under thresholds tuned on clean English development data, code-mixed inputs produce substantial action instability, with a paired clean- to-code-mix decision flip rate of 0.265. The main workflow effects are increased review burden and increased false-flagging of non-hateful content: review rate rises from 0.138 to 0.297 and non-hate false-flag rate rises from 0.069 to 0.104. Tamil-only inputs show stronger degradation overall, suggesting a broader language-coverage limitation rather than the same code-mixed instability pattern. A simple disagreement-based deferral rule reduces automatic errors on stressed inputs, but only by increasing review load. These results show that workflow-level evaluation reveals moderation failures that standard classification summaries can miss.

2602.04234 2026-06-08 cs.MA cs.AI 版本更新

When Does Multi-Agent Collaboration Help? An Entropy Perspective

何时多智能体协作有助于?从熵的角度来看

Yuxuan Zhao, Sijia Chen, Ningxin Su

发表机构 * Yantai Research Institute of Harbin Engineering University(哈尔滨工程大学烟台研究室) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文从熵的角度探讨了多智能体协作的有效性,通过分析不同拓扑结构、六个推理基准和两个智能体任务中的熵转换,发现单个智能体在43.3%的情况下表现更优,并揭示了熵动态在第一轮交互中的决定性作用。研究提出了三个关键观察:确定性偏好、基础熵和任务意识,并引入了熵判别算法来提升智能体系统的性能。

Comments Project page: https://multiagent-entropy.github.io/

详情
AI中文摘要

多智能体系统(MAS)已逐渐成为利用大型语言模型(LLMs)解决复杂任务的主流范式。然而,基于公开可用LLMs构建的MAS的有效性机制,尤其是其成功或失败的内在原因,仍鲜有研究。本文从熵的角度重新审视MAS,通过研究问题解决过程中不同拓扑结构、六个推理基准和两个智能体任务中的熵转换,考虑了智能体内部和交互动态。通过分析245个跨越token级、agent级和轮次级的熵特征,我们意外发现,在约43.3%的情况下,单个智能体优于MAS。此外,我们发现熵动态主要在首次交互轮次中决定。我们提供了三个关键观察:1)确定性偏好:峰值熵直接损害MAS的正确性,稳定熵直接促进MAS的正确性;2)基础熵:基础模型在问题解决过程中具有较低熵会因果驱动MAS性能;3)任务意识:MAS的熵动态在不同任务中扮演不同角色。基于这些见解,我们引入了一个简单而有效的算法,即熵判别器,用于从MAS的pass@k结果中选择解决方案,从而在所有MAS配置和任务中均实现了稳定的准确率提升。我们的源代码可在https://github.com/AgenticFinLab/multiagent-entropy获取。

英文摘要

Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of \textit{entropy}, considering both intra- and inter-agent dynamics by investigating entropy transitions during problem-solving across various topologies, six reasoning benchmarks, and two agentic tasks. By analyzing 245 features spanning token-, agent-, and round-level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3\% of cases, and that entropy dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) \textit{Certainty Preference}: peak entropy directly harms and stable entropy directly benefits MAS correctness; 2) \textit{Base Entropy}: base models with lower entropy during problem-solving causally drive MAS performance; and 3) \textit{Task Awareness}: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the \textit{Entropy Judger}, to select solutions from MAS's pass@$k$ results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at \href{https://github.com/AgenticFinLab/multiagent-entropy}{this https URL}.

2601.14637 2026-06-08 cs.CV cs.AI cs.CL cs.HC 版本更新

Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis

Forest-Chat: 为交互式森林变化分析适应视觉-语言代理

James Brock, Ce Zhang, Nantheera Anantrasirichai

发表机构 * School of Computer Science, University of Bristol(布里斯托尔大学计算机科学学院) School of Geographical Sciences, University of Bristol(布里斯托尔大学地理科学学院)

AI总结 本文提出Forest-Chat,一种基于LLM的森林变化分析代理,通过多任务处理实现自然语言查询,提升森林变化检测与语义解释的准确性与可解释性。

Comments 28 pages, 9 figures, 12 tables, Submitted to Ecological Informatics

详情
AI中文摘要

高分辨率卫星影像的普及与深度学习的进步为森林监测提供了新机遇。本文提出Forest-Chat,一种基于大语言模型的视觉-语言代理,支持多任务的交互式森林变化分析,包括变化检测、图像描述、对象计数、森林砍伐特征识别和变化推理。Forest-Chat基于多级变化解释(MCI)视觉-语言框架,结合零样本变化检测和多模态零样本变化描述与优化。引入Forest-Change数据集,包含双时相卫星影像、像素级变化掩码和语义变化描述。在Forest-Change数据集上,Forest-Chat在mIoU和BLEU-4指标上达到67.10%和40.17%,在LEVIR-MCI-Trees子集上达到88.13%和34.41%。零样本测试中,其在Forest-Change数据集上达到60.15%和34.00%,在LEVIR-MCI-Trees子集上达到47.32%和18.23%。进一步实验表明,描述优化能注入地理领域知识,但标签域迁移有限。这些发现表明,交互式、基于LLM的系统能支持可访问和可解释的森林变化分析。

英文摘要

The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. This paper introduces Forest-Chat, an LLM-driven agent for forest change analysis, enabling natural language querying across multiple RSICI tasks, including change detection and captioning, object counting, deforestation characterisation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, incorporating zero-shot change detection via AnyChange and multimodal LLM-based zero-shot change captioning and refinement. To support adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and semantic change captions via human annotation and rule-based methods. Forest-Chat achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on Forest-Change, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI. In a zero-shot capacity, it achieves 60.15% and 34.00% on Forest-Change, and 47.32% and 18.23% on LEVIR-MCI-Trees. Further experiments demonstrate the value of caption refinement for injecting geographic domain knowledge into supervised captions, and the system's limited label domain transfer onto JL1-CD-Trees. These findings demonstrate that interactive, LLM-driven systems can support accessible and interpretable forest change analysis.

2505.18006 2026-06-08 cs.CY cs.AI cs.HC cs.IR 版本更新

AI Literacy for Legal AI Systems: A practical approach

为法律AI系统设计的AI素养:一种实用方法

Gizem Gultekin-Varkonyi

发表机构 * University of Szeged, Faculty of Law and Political Sciences, International and Regional Studies Institute(塞格德大学法学院与政治科学学院,国际与区域研究学院)

AI总结 本文探讨了法律AI系统的AI素养,分析了其对法律和伦理发展的关键作用,并提出了一种实用的风险评估工具。

Comments Forthcoming in Iustum Aequum Salutare (2025) vol.21

详情
Journal ref
Iustum Aequum Salutare, 2025, 21 (4)
AI中文摘要

法律AI系统正被全球司法和法律系统部署者和提供者越来越多地采用,以支持各种应用。尽管它们提供了减少偏见、提高效率和改善问责的潜在好处,但也带来了重大风险,需要在机会、法律和伦理发展和部署之间取得平衡。AI素养作为欧盟AI法案中的法律要求,以及部署者和提供者实现伦理AI的关键使能者,可以成为实现这一平衡的工具。本文引入了“法律AI系统”一词,然后分析了AI素养的概念及其与这些系统相关的利弊。这一分析与处理法律AI系统的组织的更广泛AI-L概念相关联。本文的成果是一份路线图问卷,作为实用工具,帮助开发者和提供者评估风险、益处和利益相关者的担忧,以满足社会和监管对法律AI的期望。

英文摘要

Legal AI systems are increasingly being adopted by judicial and legal system deployers and providers worldwide to support a range of applications. While they offer potential benefits such as reducing bias, increasing efficiency, and improving accountability, they also pose significant risks, requiring a careful balance between opportunities, and legal and ethical development and deployment. AI literacy, as a legal requirement under the EU AI Act and a critical enabler of ethical AI for deployers and providers, could be a tool to achieve this. The article introduces the term "legal AI systems" and then analyzes the concept of AI literacy and the benefits and risks associated with these systems. This analysis is linked to a broader AI-L concept for organizations that deal with legal AI systems. The outcome of the article, a roadmap questionnaire as a practical tool for developers and providers to assess risks, benefits, and stakeholder concerns, could be useful in meeting societal and regulatory expectations for legal AI.

2502.21123 2026-06-08 cs.LG cs.AI 版本更新

Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models

因果关系是理解和平衡可信机器学习与基础模型中多个目标的关键

Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf, Mario Fritz

发表机构 * CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全中心) Max Planck Institute for Intelligent Systems, Tübingen(马克斯·普朗克智能系统研究所(图宾根)) Google Research(谷歌研究) ETH Zürich(苏黎世联邦理工学院) University of Toronto(多伦多大学)

AI总结 本文主张将因果方法集成到机器学习中,以平衡公平性、隐私、鲁棒性、准确性和可解释性等可信原则之间的权衡,并探讨其在基础模型中的实际应用。

详情
AI中文摘要

确保机器学习系统的可信度至关重要,因为它们日益嵌入高风险领域。本文主张将因果方法集成到机器学习中,以应对可信机器学习关键原则(包括公平性、隐私、鲁棒性、准确性和可解释性)之间的权衡。虽然这些目标理想情况下应同时满足,但它们通常被孤立地处理,导致冲突和次优解决方案。借鉴因果在ML中成功协调目标(如公平性与准确性,或隐私与鲁棒性)的现有应用,本文认为因果方法对于平衡可信ML和基础模型中的多个竞争目标至关重要。除了强调这些权衡,我们考察了如何将因果实际集成到ML和基础模型中,提供增强其可靠性和可解释性的解决方案。最后,我们讨论了采用因果框架的挑战、局限性和机遇,为更负责任和合乎伦理的AI系统铺平道路。

英文摘要

Ensuring trustworthiness in machine learning (ML) systems is crucial as they become increasingly embedded in high-stakes domains. This paper advocates for integrating causal methods into machine learning to navigate the trade-offs among key principles of trustworthy ML, including fairness, privacy, robustness, accuracy, and explainability. While these objectives should ideally be satisfied simultaneously, they are often addressed in isolation, leading to conflicts and suboptimal solutions. Drawing on existing applications of causality in ML that successfully align goals such as fairness and accuracy or privacy and robustness, this paper argues that a causal approach is essential for balancing multiple competing objectives in both trustworthy ML and foundation models. Beyond highlighting these trade-offs, we examine how causality can be practically integrated into ML and foundation models, offering solutions to enhance their reliability and interpretability. Finally, we discuss the challenges, limitations, and opportunities in adopting causal frameworks, paving the way for more accountable and ethically sound AI systems.

2511.09568 2026-06-08 physics.chem-ph cs.AI cs.CV 版本更新

VEDA: 3D Molecular Generation via Variance-Exploding Diffusion with Annealing

VEDA:通过退火变方差扩散实现3D分子生成

Peining Zhang, Jinbo Bi, Minghu Song

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 VEDA结合退火变方差扩散与SE(3)等价架构,高效生成准确的3D分子结构,实现高化学精度与计算效率。

详情
AI中文摘要

扩散模型在3D分子生成中展现出潜力,但面临采样效率与构象准确性之间的根本权衡。尽管流形模型速度快,但常产生几何不准确的结构,因难以捕捉分子构象的多模分布。相比之下,去噪扩散模型更准确但采样慢,限制在于扩散动力学与SE(3)-等价架构之间的整合不足。为此,我们提出了VEDA,一个统一的SE(3)-等价框架,结合变方差扩散与退火以高效生成构象准确的3D分子结构。关键贡献包括:(1) 一种VE调度使噪声注入类似于模拟退火,提高3D准确性并降低松弛能量;(2) 一种新型预处理方案协调SE(3)-等价网络的坐标预测性质与残差扩散目标;(3) 一种新的arcsin调度器将采样集中在对数信号噪声比的关键区间。在QM9和GEOM-DRUGS数据集上,VEDA的采样效率与流形模型相当,仅用100次采样步骤就实现了最先进的价键稳定性与有效性。更重要的是,VEDA生成的结构在GFN2-xTB优化过程中表现出显著的稳定性,其松弛能量中位数仅为1.72 kcal/mol,显著低于其基线架构SemlaFlow的32.3 kcal/mol。我们的框架证明了原理上整合VE扩散与SE(3)-等价架构可以实现高化学精度和计算效率。

英文摘要

Diffusion models show promise for 3D molecular generation, but face a fundamental trade-off between sampling efficiency and conformational accuracy. While flow-based models are fast, they often produce geometrically inaccurate structures, as they have difficulty capturing the multimodal distributions of molecular conformations. In contrast, denoising diffusion models are more accurate but suffer from slow sampling, a limitation attributed to sub-optimal integration between diffusion dynamics and SE(3)-equivariant architectures. To address this, we propose VEDA, a unified SE(3)-equivariant framework that combines variance-exploding diffusion with annealing to efficiently generate conformationally accurate 3D molecular structures. Specifically, our key technical contributions include: (1) a VE schedule that enables noise injection functionally analogous to simulated annealing, improving 3D accuracy and reducing relaxation energy; (2) a novel preconditioning scheme that reconciles the coordinate-predicting nature of SE(3)-equivariant networks with a residual-based diffusion objective, and (3) a new arcsin-based scheduler that concentrates sampling in critical intervals of the logarithmic signal-to-noise ratio. On the QM9 and GEOM-DRUGS datasets, VEDA matches the sampling efficiency of flow-based models, achieving state-of-the-art valency stability and validity with only 100 sampling steps. More importantly, VEDA's generated structures are remarkably stable, as measured by their relaxation energy during GFN2-xTB optimization. The median energy change is only 1.72 kcal/mol, significantly lower than the 32.3 kcal/mol from its architectural baseline, SemlaFlow. Our framework demonstrates that principled integration of VE diffusion with SE(3)-equivariant architectures can achieve both high chemical accuracy and computational efficiency.

2511.03898 2026-06-08 cs.CR cs.AI cs.CE cs.SE 版本更新

Secure Code Generation at Scale with Reflexion

大规模安全代码生成中的反射

Arup Datta, Ahmed Aljohani, Hyunsook Do

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究评估了使用Instruct Prime和反射提示方法提升代码安全性的效果,发现反射提示能显著提高安全性能,尤其在第一轮提示中效果最明显。

Comments Accepted for publication at the 2nd IEEE International Conference on AI-powered Software (AIware 2025)

详情
AI中文摘要

大型语言模型(LLMs)现在广泛用于起草和重构代码,但生成的代码不一定是安全的。我们评估了使用Instruct Prime(消除了合规性提示和提示污染)以及通过零 shot 基线和三轮反射提示方法评估五个指令调优的代码 LLMs。安全性通过不安全代码检测器(ICD)测量,结果通过修复、回归和净收益指标报告,考虑编程语言和CWE家族。我们的发现显示,在第一轮中不安全代码仍然普遍存在:在零 shot 基线(t0)下,约25-33%的程序不安全。弱加密/依赖配置的bug最难避免,而模板化的bug如XSS、代码注入和硬编码的秘密则更可靠地被处理。Python的高安全率;C和C#最低,Java、JS、PHP和C++在中间。反射提示对所有模型都有提升,将平均准确率从t0的70.74%提升到t3的79.43%,最大的提升出现在第一轮,随后是递减的收益。修复、回归和净收益指标的趋势显示,应用一到两轮提示产生大部分收益。一个可复制的包可在https://doi.org/10.5281/zenodo.17065846获取。

英文摘要

Large language models (LLMs) are now widely used to draft and refactor code, but code that works is not necessarily secure. We evaluate secure code generation using the Instruct Prime, which eliminated compliance-required prompts and cue contamination, and evaluate five instruction-tuned code LLMs using a zero-shot baseline and a three-round reflexion prompting approach. Security is measured using the Insecure Code Detector (ICD), and results are reported by measuring Repair, Regression, and NetGain metrics, considering the programming language and CWE family. Our findings show that insecurity remains common at the first round: roughly 25-33% of programs are insecure at a zero-shot baseline (t0 ). Weak cryptography/config-dependent bugs are the hardest to avoid while templated ones like XSS, code injection, and hard-coded secrets are handled more reliably. Python yields the highest secure rates; C and C# are the lowest, with Java, JS, PHP, and C++ in the middle. Reflexion prompting improves security for all models, improving average accuracy from 70.74% at t0 to 79.43% at t3 , with the largest gains in the first round followed by diminishing returns. The trends with Repair, Regression, and NetGain metrics show that applying one to two rounds produces most of the benefits. A replication package is available at https://doi.org/10.5281/zenodo.17065846.

2506.14634 2026-06-08 cs.CL cs.AI cs.CY 版本更新

AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

不是什么别的吗?利用大语言模型对德国开放式调查回答进行编码:调查动机

Leah von der Heyde, Anna-Carolina Haensch, Bernd Weiß, Jessica Daikeler

发表机构 * Social Data Science & AI Lab, LMU Munich(社会科学与人工智能实验室,慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of Maryland, College Park(马里兰大学学院公园分校) GESIS – Leibniz Institute for the Social Sciences(莱比锡社会科学研究机构)

AI总结 本文探讨了使用大语言模型对开放式调查回答进行编码的有效性,通过德国调查参与原因的数据,比较了不同LLM和提示方法的性能,发现仅微调的LLM能获得满意预测效果,且分类性能差异影响类别分布。

Comments to appear in Survey Research Methods

详情
Journal ref
Survey Research Methods (2025)
AI中文摘要

近年来,大语言模型(LLM)的发展和广泛可及性引发了关于其在调查研究中应用的讨论,包括对开放式调查回答的分类。由于其语言能力,LLM可能成为耗时的手动编码和监督学习模型预训练的高效替代方案。由于现有研究大多集中在英语回答的非复杂主题或单一LLM上,尚不清楚其发现是否具有普遍性以及这些分类的质量如何与传统方法相比。本研究探讨了不同LLM在其他情境下对开放式调查回答进行编码的程度,以德国调查参与原因的数据为例。我们比较了几种最先进的LLM和提示方法,并通过人类专家编码评估LLM的性能。总体而言,LLM之间的性能差异很大,只有微调的LLM才能达到满意的预测性能。提示方法之间的性能差异取决于所用的LLM。最后,LLM在不同调查参与原因类别上的不均等分类性能导致了不同的类别分布,当不使用微调时。我们讨论了这些发现的含义,不仅对开放式回答编码的方法学研究,还对其实质分析,以及处理或实质性分析此类数据的实践者。最后,我们强调了研究人员在选择LLM时代开放式回答分类自动化方法时需要考虑的许多权衡。通过这样做,我们的研究为关于LLM在调查研究中高效、准确和可靠应用条件的日益增长的研究做出了贡献。

英文摘要

The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs' performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.

2502.08903 2026-06-08 cs.RO cs.AI 版本更新

3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning

面向机器人任务规划的3D grounded视觉-语言框架:自动化提示合成与监督推理

Guoqin Tang, Qingxuan Jia, Zeyuan Huang, Gang Chen, Ning Ji, Zhipeng Yao

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出融合2D提示合成模块和小语言模型的框架,提升机器人3D场景理解与任务执行能力,实验显示任务成功率高达96.0%。

详情
Journal ref
Engineering Applications of Artificial Intelligence, vol. 164, p. 113268, 2026
AI中文摘要

视觉-语言模型(VLMs)在场景理解和感知任务中取得了显著成功,使机器人能够在动态环境中自适应地规划和执行动作。然而,大多数多模态大语言模型缺乏稳健的3D场景定位能力,限制了其在精细机器人操作中的有效性。此外,低识别精度、低效、差的迁移性和可靠性等挑战阻碍了其在精密任务中的应用。为解决这些限制,我们提出了一种新的框架,该框架整合了一个2D提示合成模块,通过将2D图像映射到点云,以及一个小型语言模型(SLM)来监督VLM的输出。2D提示合成模块使训练于2D图像和文本的VLM能够自主提取精确的3D空间信息,无需人工干预,显著增强了3D场景理解。同时,SLM监督VLM的输出,缓解幻觉并确保可靠的可执行机器人控制代码生成。我们的框架消除了在新环境中重新训练的需要,从而提高了成本效率和操作鲁棒性。实验结果表明,所提出的框架实现了96.0%的任务成功率(TSR),优于其他方法。消融研究证明了2D提示合成模块和输出监督模块的关键作用(当移除时,TSR下降67%)。这些发现验证了框架在提升3D识别、任务规划和机器人任务执行方面的有效性。

英文摘要

Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations. Additionally, challenges such as low recognition accuracy, inefficiency, poor transferability, and reliability hinder their use in precision tasks. To address these limitations, we propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs. The 2D prompt synthesis module enables VLMs, trained on 2D images and text, to autonomously extract precise 3D spatial information without manual intervention, significantly enhancing 3D scene understanding. Meanwhile, the SLM supervises VLM outputs, mitigating hallucinations and ensuring reliable, executable robotic control code generation. Our framework eliminates the need for retraining in new environments, thereby improving cost efficiency and operational robustness. Experimental results that the proposed framework achieved a 96.0\% Task Success Rate (TSR), outperforming other methods. Ablation studies demonstrated the critical role of both the 2D prompt synthesis module and the output supervision module (which, when removed, caused a 67\% TSR drop). These findings validate the framework's effectiveness in improving 3D recognition, task planning, and robotic task execution.

2501.11592 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Training-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing

无需训练的超小模型用于压缩感知中的通用稀疏重建

Chaoqing Tang, Huanze Zhuang, Guiyun Tian, Zhenli Zeng, Yi Ding, Wenzhong Liu, Xiang Bai

发表机构 * School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China(华中科技大学人工智能与自动化学院) China Belt and Road Joint Lab on Measurement and Control Technology, Wuhan, China(中国一带一路测量与控制技术联合实验室) School of Electric and Electrical Engineering, Chongqing University of Technology, Chongqing, China(重庆理工大学电气工程学院) Optics Valley Laboratory, Wuhan, China(光谷实验室) School of Water Conservancy and Transportation, Zhengzhou University, Zhengzhou, China(郑州大学水利与交通学院) School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China(华中科技大学软件工程学院)

AI总结 本文提出无需训练的超小神经模型CL,实现快速稀疏重建,继承传统迭代方法的通用性和可解释性,提升效率和精度。

详情
AI中文摘要

预训练大模型近年来受到广泛关注,但在需要高可解释性或资源有限的应用中面临挑战,如物理传感、医学成像和生物信息学。压缩感知(CS)是已证明的理论,推动了这些应用的许多突破。然而,作为典型的欠定线性系统,CS在使用传统迭代方法时,对大规模数据的稀疏重建时间过长。当前的AI方法如深度展开失败于替代它们,因为预训练模型在超出训练条件和数据分布时泛化性差或缺乏可解释性。本文提出名为系数学习(CL)的超小人工神经模型,实现无需训练的快速稀疏重建,同时完美继承传统迭代方法的泛化性和可解释性,带来融合先验知识的新特性。在CL中,长度为n的信号仅需最少n个可训练参数。一个案例研究模型称为CLOMP用于评估。实验在合成和真实的一维和二维信号上进行,显示了显著的效率和精度提升。与代表性的迭代方法相比,CLOMP在大规模数据上提高了100到1000倍的效率。在八个不同的图像数据集上的测试结果表明,CLOMP在采样率为0.1、0.3、0.5时分别提高了结构相似性指数292%、98%、45%。我们相信这种方法可以真正将CS重建带入AI时代,造福无数依赖稀疏解的欠定线性系统。

英文摘要

Pre-trained large models attract widespread attention in recent years, but they face challenges in applications that require high interpretability or have limited resources, such as physical sensing, medical imaging, and bioinformatics. Compressed Sensing (CS) is a well-proved theory that drives many recent breakthroughs in these applications. However, as a typical under-determined linear system, CS suffers from excessively long sparse reconstruction times when using traditional iterative methods, particularly with large-scale data. Current AI methods like deep unfolding fail to substitute them because pre-trained models exhibit poor generality beyond their training conditions and dataset distributions, or lack interpretability. Instead of following the big model fervor, this paper proposes ultra-small artificial neural models called coefficients learning (CL), enabling training-free and rapid sparse reconstruction while perfectly inheriting the generality and interpretability of traditional iterative methods, bringing new feature of incorporating prior knowledges. In CL, a signal of length $n$ only needs a minimal of $n$ trainable parameters. A case study model called CLOMP is implemented for evaluation. Experiments are conducted on both synthetic and real one-dimensional and two-dimensional signals, demonstrating significant improvements in efficiency and accuracy. Compared to representative iterative methods, CLOMP improves efficiency by 100 to 1000 folds for large-scale data. Test results on eight diverse image datasets indicate that CLOMP improves structural similarity index by 292%, 98%, 45% for sampling rates of 0.1, 0.3, 0.5, respectively. We believe this method can truly usher CS reconstruction into the AI era, benefiting countless under-determined linear systems that rely on sparse solution.

2302.00198 2026-06-08 cs.NE cs.AI cs.NA math.NA 版本更新

A fuzzy adaptive metaheuristic algorithm for identifying sustainable, economical, lightweight, and earthquake-resistant reinforced concrete cantilever retaining walls

一种模糊自适应元启发式算法用于识别可持续、经济、轻质且抗震的钢筋混凝土悬臂挡土墙

Farshid Keivanian, Raymond Chiong, Ali R. Kashani, Amir H. Gandomi

发表机构 * School of Information and Physical Sciences, The University of Newcastle(新castle大学信息与物理科学学院) Department of Civil Engineering, University of Memphis(Memphis大学土木工程系) Faculty of Engineering and Information Technology, University of Technology Sydney(悉尼技术大学工程与信息技术学院)

AI总结 本文提出一种模糊自适应元启发式算法,用于优化抗震钢筋混凝土悬臂挡土墙的设计,考虑了结构强度、地质稳定性及几何变量,以实现轻质、经济且环保的抗震设计。

Comments There are six figures, 51 pages, and 12 tables in the revised manuscript that has recently been resubmitted to the Journal of Computational Science

详情
Journal ref
Journal of Computational Science, Volume 70, Article 101978, 2023
AI中文摘要

在地震多发区,钢筋混凝土悬臂挡土墙的抗震性能至关重要。本研究利用水平和垂直伪静态系数来评估其抗震性能。为解决由此产生的土压力导致的钢筋混凝土悬臂(RCC)重量和力问题,26个结构强度和地质稳定性约束以及12个几何变量与每个设计相关联。这些约束和设计变量形成一个十二维解空间的约束优化问题。为了有效搜索并产生可持续、经济、轻质且能抵御地震危害的RCC设计,本文提出了一种新颖的自适应模糊基于元启发式算法。该方法将搜索空间划分为子区域,并基于其新颖的搜索组件建立探索、信息共享和开发搜索能力。此外,模糊推理系统被用于解决参数化和计算成本评估问题。研究发现,与几种经典和表现最佳的设计优化器相比,所提出的算法能够在九种地震条件下实现低成本、低重量和低二氧化碳排放的RCC设计。

英文摘要

In earthquake-prone zones, the seismic performance of reinforced concrete cantilever (RCC) retaining walls is significant. In this study, the seismic performance was investigated using horizontal and vertical pseudo-static coefficients. To tackle RCC weights and forces resulting from these earth pressures, 26 constraints for structural strengths and geotechnical stability along with 12 geometric variables are associated with each design. These constraints and design variables form a constraint optimization problem with a twelve-dimensional solution space. To conduct effective search and produce sustainable, economical, lightweight RCC designs robust against earthquake hazards, a novel adaptive fuzzy-based metaheuristic algorithm is applied. The proposed method divides the search space to sub-regions and establishes exploration, information sharing, and exploitation search capabilities based on its novel search components. Further, fuzzy inference systems were employed to address parameterization and computational cost evaluation issues. It was found that the proposed algorithm can achieve low-cost, low-weight, and low CO2 emission RCC designs under nine seismic conditions in comparison with several classical and best-performing design optimizers.