arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.06514 2026-06-08 cs.AI cs.LG 新提交

Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation

通过将公平性视为对称操作来检测和缓解偏见

Nishit Singh

发表机构 * Nishit Singh

AI总结将偏见形式化为对称性破缺操作，通过损失正则化恢复对称性，在合成数据集上实现90%以上的违规减少，准确率成本约5%。

Comments 8 pages, 7 figures

2606.06518 2026-06-08 cs.AI cs.LG 新提交

DiBS: Diffusion-Informed Branch Selection

DiBS: 扩散模型引导的分支选择

Bo Liu, Yuan Xie, Yuan Gao, Xiaolong Luo, Peng Ye, Tao Chen, Fujun Han

发表机构 * Nanyang Technological University（南洋理工大学）； University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结针对数独求解中学习型求解器缺乏正确性保证而符号求解器存在长尾搜索的问题，提出扩散模型引导的分支选择方法DiBS，在保持符号求解器完备性的同时，利用扩散模型排序候选值，显著降低搜索成本。

Comments 12 pages, 6 figures, 3 tables

详情

AI中文摘要

数独是一个典型的约束满足问题，需要在严格的离散约束下进行全局结构推理。现有的数独求解方法主要集中于两种主流方法，即传统启发式方法和深度学习求解器。然而，它们存在互补的局限性：基于学习的求解器缺乏硬正确性保证，而完全符号求解器仍容易陷入长尾搜索。为了解决这些不足，我们提出了一种新颖的扩散模型引导方法，称为DiBS，用于分支选择搜索过程。具体来说，DiBS保持符号求解器的完备性，并使用扩散模型作为分支排序指南。核心方法是在当前部分赋值和轻量级一致性信号下对候选值进行排序。此外，我们提供了深入的理论证明来揭示其工作原理和原因。在具有挑战性的Royle 17线索数独基准上的实验表明，与强启发式基线相比，我们的DiBS显著降低了搜索成本，特别是在节点、回溯和长尾百分位数方面。此外，这些结果证实，在分支顺序错误代价最高的困难实例上，学习到的全局指导是有效的。所有代码均可在此https URL获取。

英文摘要

Sudoku is a representative constraint satisfaction problem that requires global structural reasoning under strict discrete constraints. The existing works of solving Sudoku mainly focus on two dominant approaches, i.e., traditional heuristic and deep learning solver. However, they suffer from two complementary limitations: learning-based solvers lack hard correctness guarantees, while complete symbolic solvers are still prone to long-tail search. To address these shortcomings, we propose a novel diffusion model-guided approach, termed as DiBS, for the branch selection search process. Specifically, DiBS keeps the symbolic solver complete and uses the diffusion model as a branch-ordering guide. The core method is ranking candidate values under the current partial assignment and lightweight consistency signal. Furthermore, we provide an in-depth theoretical proof to reveal how it works and why it works. Experiments on the challenging Royle 17-clue Sudoku benchmark show that our DiBS substantially reduces search cost relative to strong heuristic baselines, especially in nodes, backtracks, and long-tail percentiles. Besides, these results confirm that learned global guidance is effective on hard instances where branch-order mistakes are most expensive. All codes are available at https://github.com/shanxierdan/DiBS.

URL PDF HTML ☆

赞 0 踩 0

2606.06519 2026-06-08 cs.AI cs.LG 新提交

SafeGene: Reusable Adapters for Transferable Safety Alignment

SafeGene: 可重用的适配器实现可迁移的安全对齐

Yanghan Wang, Zhiqiang Kou, Fu Feng, Jing Wang, Xin Geng

发表机构 * Southeast University（东南大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结提出SafeGene，一种可重用的安全适配器模块，通过从对齐-退化模型差异中提取安全表示，并利用数据感知层选择和少样本系数重校准，实现跨任务的安全恢复，在保持下游性能的同时降低有害响应率。

详情

AI中文摘要

开放权重的LLM越来越多地被微调成定制助手，但下游微调可能会削弱安全对齐，使模型更容易受到恶意提示的攻击，即使训练数据并非有意有害。这造成了反复的安全恢复问题，因为目标模型会随着新任务数据或用户交互而不断更新。我们提出SafeGene，一种可重用的安全适配器模块，设计用于每个架构兼容模型家族内的跨任务重用。SafeGene不将安全恢复视为特定于模型的修复步骤，而是将安全能力视为一种独立的、可重用的适配器表示，与任务特定更新解耦。这种表示从对齐-退化模型差异中获得，通过数据感知层选择精炼成任务可迁移的安全向量，并通过少样本逐层系数重校准在每个下游任务适应模型中表达。跨多个模型家族、下游任务和安全评估者的实验表明，SafeGene增强的模型在降低有害响应率的同时保持下游性能，在安全-效用权衡中优于代表性的安全适应方法。

英文摘要

Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful. This creates a recurring safety recovery problem as target models are repeatedly updated with new task data or user interactions. We propose SafeGene, a reusable safety-adapter module designed for cross-task reuse within each architecture-compatible model family. Rather than treating safety recovery as a model-specific repair step, SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task-specific updates. This representation is obtained from aligned--degraded model discrepancies, refined into task-transferable safety vectors through data-aware layer selection, and expressed in each downstream task-adapted model via few-shot layer-wise coefficient recalibration. Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety--utility trade-off.

URL PDF HTML ☆

赞 0 踩 0

2606.06526 2026-06-08 cs.AI cs.LG 新提交

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

CrowdMath: 众包数学研究讨论数据集

Sherin Muckatira, Jesse Geneson, Slava Gerovitch, Pavel Etingof, Mikhail Gronas, Anna Rumshisky

发表机构 * University of Massachusetts Lowell（马萨诸塞大学洛文分校）； San Jose State University（圣何塞州立大学）； Massachusetts Institute of Technology（麻省理工学院）； Dartmouth College（达特茅斯学院）； Amazon AGI（亚马逊人工智能研究院）

AI总结提出CrowdMath数据集，包含164条专家标注的进展链，用于评估大语言模型在协作开放问题求解中的能力，发现模型在局部预测上表现良好但在角色分类上存在不足。

Comments 16 pages, 4 figures

详情

AI中文摘要

大型语言模型在数学推理方面取得了实质性进展，但现有基准通常评估具有最终答案、逐步解决方案或完整证明的明确问题。它们没有捕捉到协作开放问题求解：参与者提出部分论证、识别先前步骤中的空白或错误、修复有缺陷的推理，并逐步将增量贡献综合成证明。我们引入了CrowdMath，一个包含164条专家标注的进展链的数据集，来自MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath项目（2016-2025），这是一个协作研究计划，其讨论已导致同行评审的出版物。每条链追踪一个从开放问题陈述到完成证明的多参与者论坛讨论。帖子根据其在不断演变的解决方案过程中的功能角色进行标注，包括部分进展、证明完成、错误推理和错误识别。我们定义了评估任务并对六个前沿模型进行了基准测试。模型在下一帖子预测上达到83-88%的准确率，表明它们能够跟随数学讨论的局部流程。然而，它们难以识别单个贡献的功能重要性，最佳模型在帖子角色分类上仅达到0.42的宏F1分数。CrowdMath揭示了解决明确数学问题与理解协作数学进展之间的差距。

英文摘要

Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-problem solving: a setting in which participants propose partial arguments, identify gaps or errors in prior steps, repair flawed reasoning, and gradually synthesize incremental contributions into a proof. We introduce CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath program (2016-2025), a collaborative research initiative whose discussions have led to peer-reviewed publications. Each chain traces a multi-participant forum discussion from an open-problem statement to a completed proof. Posts are labeled by their functional roles in the evolving solution process, including partial progress, proof completion, erroneous reasoning, and error identification. We define evaluation tasks and benchmark six frontier models. Models achieve 83-88% accuracy on next-post prediction, suggesting that they can follow the local flow of mathematical discussion. However, they struggle to identify the functional significance of individual contributions with the best model achieving only 0.42 macro-F1 on post-role classification. CrowdMath exposes a gap between solving well-specified mathematical problems and understanding collaborative mathematical progress as it unfolds.

URL PDF HTML ☆

赞 0 踩 0

2606.06529 2026-06-08 cs.AI cs.LG 新提交

Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

智能体AI控制评估中的攻击选择显著降低安全性

Catherine Ge-Wang, Tyler Crosse, Benjamin Hadad, Joachim Schaeffer, Ram Potham, Tyler Tracy

发表机构 * GitHub ； arXiv

AI总结本文研究攻击者策略性选择攻击时机对AI控制安全性的影响，通过分解攻击决策为开始和停止策略，实验表明两者均显著降低安全性，现有评估可能高估安全性。

详情

AI中文摘要

一个策略性选择攻击时机的攻击者比无差别攻击者更难被捕获。AI控制是一种安全框架，用于在较弱但可信的监控者和有限的人工审计预算下部署能力强但不可信的AI智能体。控制评估通过让红队攻击策略对抗蓝队监控者来压力测试这些协议，但当前评估通常假设攻击者不会策略性地选择攻击时机。我们通过将攻击决策分解为开始策略（决定攻击者何时应开始攻击）和停止策略（决定攻击者何时应中止正在进行的攻击），在智能体环境中研究了这种能力——攻击选择。在两个智能体环境BashArena和LinuxArena中，两种策略都在不改变底层攻击能力的情况下显著降低了测量的经验安全性。在1%的审计预算下，我们的开始策略在BashArena和LinuxArena上均将安全性降低了20个百分点，停止策略在BashArena上降低了20个百分点，在LinuxArena上降低了28个百分点。这些降低应被解释为攻击选择效果的上限。因此，现有的控制评估可能对选择性攻击者产生过于乐观的安全性估计。我们建议未来的评估、系统卡和安全案例应引出攻击选择，以产生更现实的安全性估计。

英文摘要

An attacker that strategically chooses when to attack is much harder to catch than one that attacks indiscriminately. AI control is a safety framework for deploying capable but untrusted AI agents under the oversight of a weaker, trusted monitor and a limited human audit budget. Control evaluations stress-test these protocols by pitting a red-team attack policy against the blue-team monitor, but current evaluations typically assume attackers that do not strategically select when to attack. We study this capability, attack selection, in agentic settings by decomposing attack decisions into a start policy, which decides when an attacker should attack, and a stop policy, which decides when an attacker should abort an ongoing attack. Across two agentic settings, BashArena and LinuxArena, both policies substantially lower measured empirical safety without changing the underlying attack capability. At a 1% audit budget, our start policy reduces safety by 20pp on both BashArena and LinuxArena, and our stop policy reduces safety by 20pp on BashArena and 28pp on LinuxArena. These reductions should be interpreted as upper bounds on the effect of attack selection. Existing control evaluations may therefore yield overly optimistic safety estimates against selective attackers. We recommend that future evaluations, system cards, and safety cases elicit attack selection to produce more realistic safety estimates.

URL PDF HTML ☆

赞 0 踩 0

2606.06531 2026-06-08 cs.AI quant-ph 新提交

CARVE-Q: Quantum-Proposed, Classically Certified Interactive Driving Repair

CARVE-Q：量子提议、经典认证的交互式驾驶修复

Yifan Wang

AI总结针对被否决的驾驶操作，提出CARVE-Q架构，通过量子最小搜索加速修复格搜索，同时保持安全认证的经典性，实现可审计的交互修复。

Comments 9 pages, 3 figures

详情

AI中文摘要

在正确的驾驶否决之后，关键问题不仅在于某个操作是否不安全，还在于被阻止的交互是否允许合法的、可审计的且责任有限的修复。预测和博弈论规划器可以提出合理的合作，但它们不能提供修复符合硬性规则、路权、成本分配和自车后备的证明。我们引入了CARVE（通过包络线对被否决操作进行认证的可负担修复），一种无预测的交互式修复证书架构。给定一个被否决的操作，CARVE构建一个有限的修复格并发出一个结构化证书，记录绑定规则、选定的联合修复、按路权缩放的合作包络线、按责任加权的成本分配以及仅自车的后备。这个证书视图揭示了算法瓶颈：多主体修复产生一个乘积格 $M = \prod_j |\mathcal{A}_j|$。因此，我们引入了CARVE-Q，一个验证器屏蔽的量子AI搜索层，它仅对此黑盒格应用量子最小值查找，同时将所有安全权限保留在经典侧。在保守的验证器-预言机模型中，精确的经典最小值查找在最坏情况下需要 $\Theta(M)$ 次查询，而Durr-Hoyer/Grover最小值查找以高概率使用 $O(\sqrt{M})$ 次预言机查询。我们证明了验证器屏蔽的证书可靠性、优先级非泄露、黑盒查询分离以及有限精度可逆预言机的可构造性。然后，我们在最多65,536个分配的CARVE修复预言机上展示了状态向量最小值查找，并在基于Lanelet2的INTERACTION回放中验证了证书保留，实现了100%的路权尊重、100%的责任一致性以及零优先级误报。结果是一种用于认证自主性的信任有界量子AI模式：量子提议；CARVE认证。

英文摘要

The critical question after a correct driving veto is not only whether a maneuver is unsafe, but whether the blocked interaction admits a lawful, auditable, and responsibility-bounded repair. Prediction and game-theoretic planners can suggest plausible cooperation, yet they do not return a proof that the repair respects hard rules, right-of-way, cost allocation, and ego fallback. We introduce CARVE, Certified Affordable Repair of Vetoed maneuvers via Envelopes, a certificate architecture for prediction-free interactive repair. Given a vetoed maneuver, CARVE constructs a finite repair lattice and emits a structured certificate recording the binding rule, selected joint repair, right-of-way-scaled cooperation envelope, responsibility-weighted cost split, and ego-only fallback. This certificate view reveals the algorithmic bottleneck: multi-owner repair induces a product lattice $M = \prod_j |\mathcal{A}_j|$. We therefore introduce CARVE-Q, a verifier-shielded quantum-AI search layer that applies quantum minimum finding only to this black-box lattice while leaving all safety authority classical. In the conservative verifier-oracle model, exact classical minimum finding requires $Θ(M)$ queries in the worst case, whereas Durr-Hoyer/Grover minimum finding uses $O(\sqrt{M})$ oracle queries with high probability. We prove verifier-shielded certificate soundness, priority non-elicitation, black-box query separation, and finite-precision reversible-oracle constructibility. We then demonstrate state-vector minimum finding on CARVE repair oracles up to 65,536 assignments and validate certificate preservation on Lanelet2-grounded INTERACTION replay with 100% right-of-way respect, 100% blame consistency, and zero priority false positives. The result is a trust-bounded quantum-AI pattern for certified autonomy: quantum proposes; CARVE certifies.

URL PDF HTML ☆

赞 0 踩 0

2606.06533 2026-06-08 cs.AI cs.CL 新提交

Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics

立场：不要仅仅“在后期修复它”：AI科学必须研究训练动态

Stella Biderman, Mohammad Aflah Khan, Niloofar Mireshghallah, Catherine Arnett, Fazl Barez, Naomi Saphra

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Boston University（波士顿大学）； Harvard University（哈佛大学）； University of Oxford, Martian（牛津大学，火星）； Max Planck Institute for Software Systems（马克斯·普朗克软件系统研究所）

AI总结本文主张AI科学应超越事后分析，研究训练动态以预测、干预和设计模型行为，并指出当前在可解释性、公平性等领域的进展及开放问题。

Comments Accepted as an oral to the ICML: https://icml.cc/virtual/2026/poster/67142

详情

AI中文摘要

拥有对AI的科学理解意味着什么？模型不是静态对象：它们是由数据、目标、架构和优化动态塑造的时间演化过程的快照。然而，许多AI研究将模型视为固定的人工制品，分析训练后的行为，而不追问它们为何出现。这篇立场论文认为，AI科学必须超越事后修复，研究产生模型行为的训练动态。这样的科学应该支持逐渐增强的理解形式：从早期训练信号预测结果，在轨迹出错时进行干预，并最终设计出更可靠地产生期望属性的训练程序。缩放定律已使损失预测成为常规；挑战在于将这一成功扩展到能力、偏见、鲁棒性和安全相关行为。我们基于科学史和科学哲学阐述了此类理论的要求，考察了在机械可解释性、公平性、记忆化和简单性偏差方面的进展，并确定了具体的开放问题。

英文摘要

What would it mean to have a scientific understanding of AI? Models are not static objects: they are snapshots of time-evolving processes shaped by data, objectives, architectures, and optimization dynamics. Yet much of AI research treats models as fixed artifacts, analyzing behaviors after training rather than asking why they emerge. This position paper argues that a science of AI must move beyond post-hoc fixes and study the training dynamics that produce model behavior. Such a science should support progressively stronger forms of understanding: predicting outcomes from early training signals, intervening when trajectories go wrong, and ultimately designing training procedures that more reliably produce desired properties. Scaling laws have made prediction routine for loss; the challenge is extending this success to capabilities, biases, robustness, and safety-relevant behaviors. We articulate requirements for such theories grounded in the history and philosophy of science, examine progress in mechanistic interpretability, fairness, memorization, and simplicity bias, and identify concrete open problems.

URL PDF HTML ☆

赞 0 踩 0

2606.06641 2026-06-08 cs.AI cs.LO 新提交

Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver

加速傅里叶SAT (AFSAT)：完全实现基于GPU的对称伪布尔SAT求解器

Cody J Christopher, Charles Gretton

发表机构 * School of Computing, Australian National University（澳大利亚国立大学计算机学院）

AI总结提出AFSAT，一个基于连续局部搜索的GPU加速伪布尔SAT求解器，通过JAX编译器实现大规模并行化，显著提升数值稳定性、运行速度和内存效率。

详情

AI中文摘要

我们提出加速傅里叶SAT (AFSAT)，一个基于连续局部搜索 (CLS) 的GPU加速伪布尔可满足性求解器。AFSAT将概念验证方法FastFourierSAT实现为一个完全工程化的求解器，支持单个问题实例中任意异构混合的对称约束类型和长度。利用JAX编译器，AFSAT通过纯函数组合、自动向量化、自动微分和即时编译 (JIT)，在候选赋值的批次上执行大规模并行CLS。我们展示了相比概念验证显著改进的数值稳定性、运行时性能和内存效率。这是通过识别和解决由内存延迟和浮点表示引起的各种限制，以及利用自动并行化和紧凑表示来实现的。浮点固有的表示和稳定性限制通过定制的离散傅里叶变换实现得到部分解决。通过JAX数组分片，我们在扩展到多个加速器时实现了接近线性的吞吐量。

英文摘要

We present Accelerated Fourier SAT (AFSAT), a GPU-accelerated solver for pseudo-Boolean satisfiability based on continuous local search (CLS). AFSAT realises the proof-of-concept approach, FastFourierSAT, into a fully-engineered solver supporting any heterogeneous mixture of symmetric constraint types and lengths within a single problem instance. Using the JAX compiler, AFSAT leverages pure function composition, automatic vectorisation, automatic differentiation, and just-in-time (JIT) compilation to perform massively parallel CLS across batches of candidate assignments. We demonstrate substantially improved numerical stability, runtime performance, and memory efficiency over the proof-of-concept. We achieve this by way of identifying and addressing various limitations that arise from memory latency and floating-point representation, as well as leveraging automatic parallelisation and compact representations. The inherent representational and stability limitations of floating point are partially addressed by a tailored discrete Fourier transform implementation. We achieve near-linear throughput when scaling to multiple accelerators via JAX array sharding.

URL PDF HTML ☆

赞 0 踩 0

2606.06660 2026-06-08 cs.AI cs.PF cs.RO 新提交

AEGIS: A Backup Reflex for Physical AI

AEGIS：物理AI的备份反射

Josef Chen

发表机构 * KAIKAKU

AI总结提出AEGIS方法，通过在弱策略的冻结激活上使用轻量级探针检测高风险步骤，仅在必要时切换到强策略，在LIBERO-Spatial上恢复了弱策略损失的10.1%轨迹。

详情

AI中文摘要

长时域机器人操作往往逐渐失败：一个坏步骤会降低状态，策略会陷入无法恢复的盆地。失败在发生之前通常是可见的。我们引入了AEGIS（激活探针早期预警、门控推理切换），一种选择性升级方法，通过在弱策略的冻结激活上使用轻量级探针，在仍有时间采取行动时检测高风险步骤。当探针标记一个步骤时，控制权切换到更强的独立策略，但仅限于需要它的步骤。在LIBERO-Spatial上，AEGIS恢复了弱策略单独损失的10.1%的轨迹，而预算匹配的盲目升级为4.6%，随机触发安慰剂为5.1%。这些增益在单侧精确配对McNemar检验中显著，经Holm-Bonferroni调整，三个预注册对比：比盲目升级高5.4个百分点，p=8.5e-6；比随机触发高5.0个百分点，p=1.0e-4；配对轨迹自举置信区间排除零。AEGIS仅在38%的步骤上激活强策略，因此杠杆是时机而非计算。探针在早期窗口AUROC为0.764，95% CI [0.70, 0.84]，在首次切换前从弱策略路径的前30%轨迹步骤中读取。我们预注册了完整的分析计划，包括条件恢复任务率估计量和明确的终止标准，并在每臂700个公共随机数情节上确认了结果，nA-fail=646。

英文摘要

Long-horizon robot manipulation tends to fail gradually: one bad step degrades the state, and the policy spirals into a basin from which it cannot recover. The failure is often visible before it happens. We introduce AEGIS (Activation-probe Early-warning, Gated Inference Switching), a selective escalation method that uses a lightweight probe on a weak policy's frozen activations to detect high-risk steps while there is still time to act. When the probe flags a step, control switches to a stronger separate policy, but only for the steps that need it. On LIBERO-Spatial, AEGIS recovers 10.1% of the trajectories the weak policy alone loses, versus 4.6% for budget-matched blind escalation and 5.1% for a random-trigger placebo. These gains are significant under one-sided exact paired McNemar tests with Holm-Bonferroni adjustment over three pre-registered contrasts: +5.4pp over blind escalation, p=8.5e-6; +5.0pp over random triggering, p=1.0e-4; paired-trajectory bootstrap CIs exclude zero. AEGIS activates the stronger policy on only 38% of steps, so the lever is timing rather than compute. The probe clears its precondition with an early-window AUROC of 0.764, 95% CI [0.70, 0.84], read from the weak-policy path over the first 30% of trajectory steps before any handoff. We pre-register the full analysis plan, including a conditional recovered-task-rate estimand and explicit kill criteria, and confirm the result on 700 common-random-number episodes per arm, with nA-fail=646.

URL PDF HTML ☆

赞 0 踩 0

2606.06741 2026-06-08 cs.AI cs.CL cs.LG 新提交

OpenSkill: Open-World Self-Evolution for LLM Agents

OpenSkill: 面向LLM智能体的开放世界自我进化

Zhiling Yan, Dingjie Song, Hanrong Zhang, Wei Liang, Yuxuan Zhang, Yutong Dai, Lifang He, Philip S. Yu, Ran Xu, Xiang Li, Lichao Sun

发表机构 * Lehigh University（莱维大学）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）； University of British Columbia（不列颠哥伦比亚大学）； Vector Institute（向量研究所）； Salesforce AI Research（Salesforce人工智能研究）； Massachusetts General Hospital and Harvard Medical School（麻省总医院和哈佛医学院）

AI总结提出OpenSkill框架，使智能体在无目标任务监督下，利用开放世界资源自举构建技能和验证信号，实现自我进化，在多个基准上取得最佳自动通过率。

Comments 20 pages, 4 figures and 8 tables. Code is avalable at https://github.com/OpenLAIR/OpenSkill

详情

AI中文摘要

自我进化智能体需要在部署后进行适应，但现有方法假设存在可用的学习循环，例如精心策划的技能、成功的轨迹或验证信号。真实的开放世界部署可能不提供这些，只提供一个任务提示。在这项工作中，我们研究开放世界自我进化，其中智能体必须从零开始构建其技能和自身的验证信号，使用开放世界资源但没有目标任务监督。我们提出OpenSkill，一个启动这个循环的框架：它从文档、代码库和网络中获取基础知识和验证锚点，将它们综合成可迁移的技能，并根据自建的虚拟任务（基于锚点而非目标答案）来优化这些技能。因此，开放世界既提供了要学习的知识，也提供了一个独立于监督的练习环境，目标任务监督保留用于最终评估。在三个基准和两个目标智能体上，OpenSkill在满足无监督约束的同时取得了最佳自动通过率。分析表明，其技能无需特定模型适应即可跨模型迁移，并且其自建验证器与真实结果一致，尽管从未访问过这些结果。

英文摘要

Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, successful trajectories, or verifier signals. Real open-world deployments may provide none of these, offering only a task prompt. In this work, we study open-world self-evolution, where an agent must build both its skills and its own verification signals from scratch, using open-world resources but no target-task supervision. We propose OpenSkill, a framework that bootstraps this loop: it acquires grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizes them into transferable skills, and refines those skills against self-built virtual tasks grounded in the anchors rather than in target answers. The open world thus supplies both the knowledge to be learned and a supervision-independent practice environment, with target-task supervision reserved for final evaluation. Across three benchmarks and two target agents, OpenSkill attains the best automated pass rate while satisfying the no-supervision constraint. Analysis shows its skills transfer across models without model-specific adaptation, and its self-built verifier aligns with ground-truth outcomes despite never accessing them.

URL PDF HTML ☆

赞 0 踩 0

2606.06787 2026-06-08 cs.AI 新提交

知识驱动工具使用工作流中AI代理的声明式技能

M. Danish Lim, I. Danial Bin Sharudin, Wen Han Chen, Cedric Lim, Laura Wynter

发表机构 * School of Computing and Information Systems（计算与信息系统学院）； Singapore Management University（新加坡管理大学）

AI总结提出声明式代理（通过自然语言技能文件控制流程）在知识密集型客服工作流中优于命令式状态机和无脚手架基线，但检索质量是主要瓶颈。

详情

AI中文摘要

我们研究了在非结构化知识库上的现实客服工作流中，使用工具的AI代理的编排机制。我们认为声明式代理——即在系统提示中附加自然语言技能文件的AI代理——是一种有效的编排范式。具体地，我们比较了(i) 在推理时读取三个领域特定技能文件并自行决定控制流的DeclarativeAgent，(ii) 基于具有显式阶段的程序化状态机的ImperativeAgent，以及(iii) 基于$\ au$-Knowledge基准代理的无脚手架基线代理。我们的ImperativeAgent受递归语言模型和图编排框架中的外部化控制推理启发。我们将三种代理形式化为分散部分可观察马尔可夫决策过程中的策略类，并分析其信息论和结构特性；然后在五个语言模型和两种检索机制下实证测试预测的差异。结果表明，检索质量是AI代理的主要瓶颈：当证据不完整或偏斜时，所有代理性能大幅下降，技能文件无法恢复损失的性能。然而，在高品质检索下，声明式技能在程序性任务上持续提高准确性并减少编排错误，而命令式状态机的脆弱性并未可靠地提高任务成功或合规性。

英文摘要

We study orchestration mechanisms for tool-using AI agents in realistic customer-service workflows over an unstructured knowledge base. We argue that declarative agents -- AI agents equipped with natural-language skill files appended to the system prompt -- are an effective orchestration paradigm. Concretely, we compare (i) a DeclarativeAgent that reads three domain-specific skill files at inference time and decides its own control flow, (ii) an ImperativeAgent based on a programmatic state machine with explicit phases, and (iii) an unscaffolded baseline agent modeled after the $τ$-Knowledge benchmark agent. Our ImperativeAgent is motivated by externalised-control inference as in Recursive Language Models and graph-based orchestration frameworks. We formalise the three agents as policy classes within a decentralised partially-observable Markov decision process and analyse their information-theoretic and structural properties; we then test the predicted differences empirically on five language models and two retrieval regimes. Our results show that retrieval quality is a dominant bottleneck for AI agents: when evidence is incomplete or skewed, all agents degrade substantially, and skill files cannot recover lost performance. Under high-quality retrieval, however, declarative skills consistently improve accuracy on procedural tasks and reduce orchestration errors, while the imperative state machine's brittleness does not reliably improve task success or compliance.

URL PDF HTML ☆

赞 0 踩 0

2606.06941 2026-06-08 cs.AI 新提交

Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces

量子启发的迹增强证据选择用于结构化假设空间推理

Laura Wynter, Nirvik Sahoo, Paul Griffin

发表机构 * School of Computing and Information Systems（计算与信息系统学院）； Singapore Management University（新加坡管理大学）

AI总结提出EP-HUBO方法，将CoT推理片段选择转化为组合优化问题，通过高阶二元优化聚合证据，在证据密集型法律推理基准上提升少数但正确假设的权重。

详情

AI中文摘要

大型语言模型（LLMs）现在能够在广泛的专业级考试中达到或超过人类水平，但在法律等专门、证据密集型领域仍然脆弱。在这些任务上，错误不仅源于世界知识的空白，还源于证据片段之间的细微差别以及支持证据的不一致使用。最常见的基于采样思维链（CoT）轨迹的聚合器——多数投票，返回最流行的答案，而不考虑其证据是否实际上最强。我们提出将CoT推理片段的选择视为一个显式的组合优化问题，使得有充分支持但属于少数的假设能够覆盖噪声多数，并在对证据质量特别敏感的法律推理基准上评估该方法。我们引入了EP-HUBO（证据池高阶二元优化），它使用小型本地模型生成多个CoT轨迹，将片段解析为每个假设的证据池，对每个池求解具有质量衍生权重（相关性、特异性、区分性）的高阶无约束二元优化，并委托前沿模型对每个问题进行一次裁决调用。我们在两个证据密集型法律基准上评估了EP-HUBO，使用了经典硬件上的模拟退火以及Quantum Computing Inc.的Dirac-3光量子熵量子机。HUBO风格的优化提供了一种原则性的方法来聚合推理片段，同时保留少数但正确的假设，并且在低污染领域（前沿模型尚未吸收基准材料）中最为有价值。

英文摘要

Large language models (LLMs) now solve a wide range of expert-level exams at or above human level, yet remain brittle on specialised, evidence-intensive domains such as law. On these tasks, errors arise not only from gaps in world knowledge but also from subtle distinctions between pieces of evidence and inconsistent use of supporting evidence. The most common aggregator over sampled chain-of-thought (CoT) traces, majority vote, returns the most popular answer regardless of whether its evidence is actually strongest. We propose to treat the selection of CoT reasoning fragments into a set of evidence as an explicit combinatorial optimisation problem, allowing well-supported but minority hypotheses to override noisy majorities, and to evaluate the approach on legal-reasoning benchmarks that are particularly sensitive to evidence quality. We introduce EP-HUBO (Evidence Pool Higher-Order Binary Optimisation), which generates multiple CoT traces with a small local model, parses fragments into per-hypothesis evidence pools, solves a higher-order unconstrained binary optimisation per pool with quality-derived weights (relevance, specificity, distinctiveness), and delegates a single adjudication call per question to a frontier model. We evaluate EP-HUBO on two evidence-intensive legal benchmarks using both simulated annealing on classical hardware and the Dirac-3 photonic entropy-quantum machine from Quantum Computing Inc. HUBO-style optimisation gives a principled way to aggregate reasoning fragments while preserving minority-but-correct hypotheses, and is most valuable in low-contamination domains where frontier models have not already absorbed the benchmark material.

URL PDF HTML ☆

赞 0 踩 0

2606.06972 2026-06-08 cs.AI 新提交

Accounting for Context: Shaping Moral Credences for Value Alignment

考虑情境：塑造道德信念以实现价值对齐

Jazon Szabo, Sanjay Modgil

发表机构 * University of Oxford（牛津大学）

AI总结本文针对价值对齐中道德多元性问题，提出在聚合道德评估时必须考虑情境因素，并形式化道德不确定性下的决策，揭示弱帕累托原则的违反是辛普森悖论的一种变体。

详情

AI中文摘要

确保智能体行为与人类道德价值观对齐不可避免地引发一个问题：如何解释社会乃至个体通常采纳的多元道德视角。关于道德不确定性的工作提出了在不同道德理论之间公平且民主地聚合行动评估的机制。然而，本文认为在聚合道德评估时需要考虑情境因素。例如，后果主义视角假设能够准确确定智能体的行动如何改变世界；这一假设在现实世界中往往不成立。因此，我们在考虑这些情境因素的同时，形式化了道德不确定性下的智能体决策。我们由此表明，一个看似常识性的属性——弱帕累托原则——被违反了。我们认为，这个看似的问题实际上是辛普森悖论的一种变体，因此揭示了忽视情境因素影响的聚合机制的局限性。

英文摘要

Ensuring that agent behaviours are aligned with human moral values inevitably raises the problem of how to account for the plurality of moral perspectives that societies -- and even individuals -- typically adopt. Work on moral uncertainty proposes mechanisms to fairly and democratically aggregate evaluations of actions across different moral theories. However, this paper argues that one needs to account for contextual factors when aggregating moral evaluations. For example, consequentialist perspectives assume an ability to accurately determine how an agent's actions change the world; an assumption that often does not hold in real world settings. We, therefore, formalise agent decision making under moral uncertainty, while also accounting for these kinds of contextual factors. We thereby show that a seemingly commonsensical property -- the weak Pareto principle -- is violated. We argue that this apparent problem is, in fact, a variation of Simpson's paradox, and hence reveals the limitations of aggregation mechanisms that ignore the impact of contextual factors.

URL PDF HTML ☆

赞 0 踩 0

2606.06976 2026-06-08 cs.AI 新提交

Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

通过不确定性对齐强化学习探索智能体工具调用决策

Yijin Zhou, Linqian Zeng, Xiaoya Lu, Wenyuan Xie, Dongrui Liu, Junchi Yan, Jing Shao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Shanghai Innovation Institute（上海创新研究院）

AI总结针对智能体工具调用中错误累积问题，提出TRUST方法，将不确定性量化作为排斥力融入奖励设计，并标注轻量关键轮次用于多轮轨迹统一后训练，显著提升决策质量与智能体性能。

详情

AI中文摘要

基于大语言模型的智能体经常做出次优的工具使用决策，包括不支持的工具调用和幻觉式的直接响应，这可能在多步交互中累积错误。现有方法主要通过推理时校正或基于决策结果和结构化检查表的粗粒度奖励信号来改善这些行为，而忽略了智能体决策的不确定性特征。我们观察到，面向决策的强化学习倾向于削弱正确和错误动作之间的不确定性分离，导致过度自信的错误和较弱的探索信号。因此，我们提出TRUST，将不确定性量化作为排斥力融入奖励设计以维持不确定性分离，并标注轻量级关键轮次注释用于多轮轨迹的统一后训练。在多种工具使用基准上的实验结果表明，TRUST在优化过程中持续提升决策质量和智能体性能，同时保持更可靠的不确定性估计。

英文摘要

Large language model (LLM)-based agents often make suboptimal tool-use decisions, including unsupported tool invocation and hallucinated direct responses, which may accumulate errors throughout multi-step interactions. Existing approaches mainly improve these behaviors through inference-time correction or coarse-grained reward signals based on decision outcomes and structured checklists, leaving the uncertainty characteristics of agent decisions underexplored. We observe that decision-oriented reinforcement learning tends to weaken the uncertainty separation between correct and incorrect actions, resulting in overconfident mistakes and weaker exploration signals. Therefore, we propose TRUST, which incorporates uncertainty quantification into reward design as a repulsive force for maintaining uncertainty separation, and labels lightweight key-turn annotations for unified post-training of multi-turn trajectories. Experimental results across diverse tool-use benchmarks show that TRUST consistently enhances both decision quality and agent performance while maintaining more reliable uncertainty estimates during optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.07000 2026-06-08 cs.AI 新提交

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

教方法而非答案：用于多模态策略优化的特权辅导蒸馏

Shizhe Xiang, Ke An, Wenlong Yu, Yue Liu, Jian Luan, Pei Fu, Qilong Wang

发表机构 * Tianjin University（天津大学）； Beijing Institute of Technology（北京理工大学）； Singapore Management University（新加坡国立大学）； University of Chinese Academy of Sciences（中国科学院大学）； Xiaomi Inc（小米公司）

AI总结提出PTD-PO框架，通过构建特权提示提供密集的令牌级监督，避免暴露答案，并采用Top-K JS散度稳定蒸馏，显著提升多模态推理性能。

详情

AI中文摘要

最近的后训练方法，特别是具有可验证奖励的强化学习（RLVR），显著增强了大型视觉语言模型（LVLMs）的推理能力。然而，可验证奖励的稀疏性为失败的rollout提供了很少的令牌级监督，常常导致复杂多模态推理任务中的低效探索。尽管策略蒸馏可以提供密集的指导，但基于外部教师的方法引入了大量计算开销，而基于答案条件微调的方法可能暴露答案级信息并诱导类似捷径的生成行为。为解决这些限制，我们提出了PTD-PO，一种用于RLVR的特权辅导蒸馏策略优化框架，在不向学生策略暴露答案的情况下提供密集指导。具体来说，PTD-PO从空间注意力引导和中间文本推理步骤中构建结构化的特权提示，并通过上下文学习将其用于生成逐步的令牌分布监督。学生仍在原始无答案上下文中优化，其失败的rollout在令牌分布级别与提示增强的参考模型对齐。为进一步稳定引导和无引导上下文之间分布偏移下的蒸馏，我们引入了Top-K Jensen-Shannon散度目标，专注于对齐信息性令牌概率，同时减少内存开销。在2B到8B参数的LVLMs上的实验表明，PTD-PO持续优于RLVR和蒸馏基线，缓解了熵崩溃，并提高了复杂多模态推理性能。

英文摘要

Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable rewards provides little token-level supervision for failed rollouts, often leading to inefficient exploration in complex multimodal reasoning tasks. Although policy distillation can offer dense guidance, external teacher based methods introduce substantial computational overhead, while answer conditioned tuning methods may expose answer-level information and induce shortcut-like generation behavior. To address these limitations, we propose PTD-PO, a Privileged Tutoring Distillation Policy Optimization framework for RLVR that provides dense guidance without exposing the answer to the student policy. Specifically, PTD-PO constructs structured privileged hints from spatial attention guidance and intermediate textual reasoning steps, and uses them through in-context learning to produce step-wise token-distribution supervision. The student is still optimized under the original answer-free context, and its failed rollouts are aligned with the hint-augmented reference model at the token-distribution level. To further stabilize distillation under the distribution shift between guided and unguided contexts, we introduce a Top-K Jensen-Shannon divergence objective that focuses alignment on informative token probabilities while reducing memory overhead. Experiments on LVLMs ranging from 2B to 8B parameters show that PTD-PO consistently outperforms RLVR and distillation baselines, mitigates entropy collapse, and improves complex multimodal reasoning performance.

URL PDF HTML ☆

赞 0 踩 0

2606.07017 2026-06-08 cs.AI cs.CL cs.ET 新提交

The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

基础模型智能体的仿真到现实差距：统一MDP视角

Xiaoou Liu, Tiejin Chen, Weibo Li, Xiyang Hu, Hua Wei

发表机构 * Arizona State University（亚利桑那州立大学）

AI总结本文提出将基础模型智能体的评估与训练差距形式化为经典仿真到现实问题，围绕MDP四要素（观测、动作、转移、奖励）构建统一框架，并倡导采用域随机化等成熟解决方案。

Comments 7 pages, 2 figures, 2 tables. Accepted by KDD 2026 Blue Sky Ideas Track

详情

DOI: 10.1145/3770855.3818660

AI中文摘要

基础模型智能体越来越多地被部署用于现实世界决策，但受到仿真到现实差距的影响。虽然机器人学和经典控制有成熟的框架来解决这一差距，但基础模型社区将智能体鲁棒性视为一个全新的现象。我们的论文提出将基础模型智能体评估和训练差距形式化为一个经典的仿真到现实问题，完全围绕马尔可夫决策过程的四个要素构建，包括观测、动作、转移和奖励。在本文中，我们设定了一个全面的研究议程，将经典差异转化为基础模型领域，并倡导采用域随机化等成熟解决方案。我们提供了具体示例，例如多语言工具调用，以展示尽管语义意图正确，但观测空间差距如何导致操作无效的动作。最终，这一议程旨在推动范式转变，产生统一的词汇和标准化的压力测试基准，以培养新一代高度可信的智能体，用于可靠的现实世界应用。

英文摘要

Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2606.07033 2026-06-08 cs.AI cs.CV 新提交

Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

层次化语义约束异构图用于音视频事件定位

Zhe Yang, Ruyi Zhang, Hongtao Chen, Wenrui Li, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Peng Cheng Laboratory（鹏城实验室）； Harbin Institute of Technology Suzhou Research Institute（哈尔滨工业大学苏州研究院）

AI总结提出层次化语义约束异构图框架，通过构建异构图、双向语义约束和双曲空间层次正则化，解决开放词汇音视频事件定位中跨尺度一致性和层次语义一致性问题。

详情

AI中文摘要

开放词汇音视频事件定位（OV-AVEL）联合建模音视频线索，以识别并时间定位事件，包括训练中未见过的类别。现有方法主要在欧几里得空间中学习联合音视频表示，但仍面临两个重大挑战。首先，未见类别缺乏监督信号，难以在多个时间尺度上保持音视频一致性。其次，片段级与视频级语义之间缺乏层次约束，导致模型无法在不同层级间建立语义一致性。为解决这些挑战，我们提出一种层次化语义约束异构图（HSCHG）用于音视频事件定位框架。我们首先在欧几里得空间中构建一个异构层次图，包含音频和视觉片段节点及其对应的视频级节点。我们使用多方向时间边来捕获每个模态内的完整时间信息。同时，我们采用双阈值过滤门控融合策略，仅在对齐置信度高时引入跨模态信息。此外，我们在片段级和视频级表示之间引入双向语义约束，以实现不同层级间的语义一致性。基于此，我们将多级音视频表示和文本原型统一映射到双曲空间中。我们使用层次蕴含正则化损失来表征视频与片段之间的层次关系。大量实验结果表明，我们的方法在OV-AVEL基准上优于现有方法。消融研究进一步验证了我们方法的有效性。

英文摘要

Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales. Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes. We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels. Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.

URL PDF HTML ☆

赞 0 踩 0

2606.07113 2026-06-08 cs.AI 新提交

通过局部披露进行具有策略性主体的离线策略评估

Kiet Q. H. Vo, Abbavaram Gowtham Reddy, Julian Rodemann, Siu Lun Chau, Krikamol Muandet

发表机构 * CISPA Helmholtz Center for Information Security（CISPA海德堡信息安全研究中心）； LMU Munich（慕尼黑大学）； Nanyang Technological University（南洋理工大学）

AI总结研究策略性行为下的离线策略评估，通过局部披露揭示主体策略前协变量，构建双重稳健估计器，缓解信息不对称。

详情

AI中文摘要

我们研究了策略性行为下的离线策略评估（OPE），其中决策主体（或代理）通过策略性地修改其协变量来响应决策者的策略。这种行为导致了策略依赖的协变量偏移，打破了现有方法中协变量外生于策略的标准假设。相关工作通过施加强假设（如重复交互或完全了解代理的响应行为）来应对这一挑战，这极大地限制了其在OPE中的适用性。相比之下，我们考虑一次性OPE设置，其中决策者仅部分了解代理的响应行为。我们的关键见解是，通过事后解释披露局部信息，可以在适应之前揭示代理的策略前协变量，从而减轻策略行为引起的信息损失。利用这一结构，我们估计了代理响应的统计模型，并构建了策略值的双重稳健估计器。通过假设代理的成本敏感性服从条件对数正态分布，我们建立了所提估计量的一致性，并实证验证了我们的方法。更广泛地说，我们的结果强调了交互设计如何通过揭示代理策略响应中原本隐藏的结构来缓解信息不对称。

英文摘要

We study off-policy evaluation (OPE) under strategic behavior where decision subjects (or agents) respond to a decision maker's policy by strategically modifying their covariates. Such behavior induces a policy-dependent covariate shift, breaking the standard assumption in existing methods that covariates are exogenous to the policy. Related work addresses this challenge by imposing strong assumptions such as repeated interactions or full knowledge of agents' response behavior, substantially limiting its applicability to OPE. In contrast, we consider a one-shot OPE setting where the decision maker has only partial knowledge of the agents' response behavior. Our key insight is that disclosing local information through post-hoc explanations reveals agents' pre-strategic covariates prior to adaptation, mitigating the information loss induced by strategic behavior. Leveraging this structure, we estimate a statistical model for the agents' responses and construct a doubly robust estimator for policy value. By assuming that the agents' cost sensitivity follows a conditional log-normal distribution, we establish consistency of the proposed estimator and validate our approach empirically. More broadly, our results highlight how interaction design can mitigate information asymmetry by revealing otherwise hidden structure in agents' strategic responses.

URL PDF HTML ☆

赞 0 踩 0

2606.07392 2026-06-08 cs.AI cs.LG econ.EM stat.ML 新提交

Online Pandora's Box for Contextual LLM Cascading

面向上下文LLM级联的在线潘多拉魔盒

Alexandre Belloni, Yan Chen, Yehua Wei

发表机构 * The Fuqua School of Business, Duke University（杜克大学福克商学院）

AI总结针对LLM级联场景，提出在线上下文潘多拉魔盒模型，通过参数化保留索引和GMM估计结合UCB界，实现维度相关的√T累积遗憾。

详情

AI中文摘要

受大型语言模型（LLM）级联的启发，我们提出了一种在线上下文潘多拉魔盒模型，用于自适应地查询和选择LLM API。在每个周期中，决策者观察一个请求上下文，并面临一个两阶段决策问题。在查询阶段，决策者顺序查询API，每次查询揭示一个生成的输出，并且决策者承担（输出相关的）成本。在选择阶段，决策者选择一个生成的输出进行部署，并仅观察部署输出的下游奖励。这种输出介导的反馈结构不同于经典的在线上下文潘多拉魔盒模型，后者打开盒子直接揭示其奖励。我们不估计每个API的完整条件输出和成本分布，而是直接建模保留索引，并为查询阶段开发一种学习方法。具体地，我们对由经典Weitzman策略诱导的上下文保留索引函数施加参数化结构。我们的策略将这些保留索引的广义矩方法（GMM）类型估计与这些索引以及共享输出级奖励评估器的UCB风格置信界相结合。在正则条件下，我们证明所得策略在T个周期的时间范围内实现了维度相关的$\widetilde O(\sqrt T)$累积遗憾。

英文摘要

Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query phase, the decision-maker sequentially queries APIs, where each query reveals a generated output and the decision-maker incurs an (output-dependent) cost. In the selection phase, the decision-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output. This output-mediated feedback structure differs from classical online contextual Pandora's Box models, in which opening a box directly reveals its reward. Rather than estimating the full conditional output and cost distributions of each API, we directly model the reservation index and develop a learning approach for the query phase. Specifically, we impose a parametric structure on the contextual reservation index functions induced by the classical Weitzman's policy. Our policy combines generalized method of moments (GMM) type estimation of these reservation indices with UCB-style confidence bounds for both these indices and the shared output-level reward evaluator. Under regularity conditions, we prove that the resulting policy achieves dimension-dependent $\widetilde O(\sqrt T)$ cumulative regret over a horizon of $T$ periods.

URL PDF HTML ☆

赞 0 踩 0

2606.07462 2026-06-08 cs.AI 新提交

自主异质催化剂发现：一种自进化多智能体数字孪生系统

Zhilong Song, Zongmin Zhang, Lixue Cheng

发表机构 * Department of Chemistry, Hong Kong University of Science and Technology（香港科技大学化学系）； IAS Center for AI for Scientific Discoveries, Hong Kong University of Science and Technology（香港科技大学人工智能科学发现中心）； Department of Computer Science and Engineering, Hong Kong University of Science and Technology（香港科技大学计算机科学与工程系）； Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology（香港科技大学化学与生物工程系）

AI总结提出CatDT（催化数字孪生），一种自进化多智能体系统，通过集成八种专业智能体和27种科学工具，在单个GPU上5-30分钟内自动构建工作催化剂数字孪生，实现从体相晶体和自然语言反应描述到稳定晶面预测、反应路径枚举、过渡态定位和动力学计算的全流程，在七个气固相基准上预测与实验偏差在0.5-2倍内，并独立发现丙烷脱氢非贵金属候选催化剂。

详情

AI中文摘要

理论异质催化有望实现快速催化剂发现，然而计算和机器学习预测常常偏离实验，并局限于狭窄的材料家族，原因是缺乏忠实、条件感知的催化模拟器。我们提出CatDT（催化数字孪生），一种自进化多智能体系统，构建工作催化剂的自主数字孪生，统一了气固和液固建模。仅从体相晶体和自然语言反应描述出发，八个专业智能体和27种科学工具在单个GPU上5-30分钟内预测稳定晶面、重构工作表面、枚举和排序反应路径、定位过渡态并计算动力学。两项创新解决了最困难的步骤：UniMech通过融合智能体引导提议与能量缓存图搜索，以比穷举枚举低超过$10^3$倍的成本发现新型材料的主导路径；记忆增强的强化循环将600个催化表面的势垒计算成功率从41%提高到84%。在七个气固基准上——台阶金属、单原子催化剂、有序金属间化合物、富空位二维硫化物和碳化物，以及强金属-载体相互作用（SMSI）界面——每个CatDT预测在四个数量级内与实验偏差在0.5-2倍之间。对于丙烷脱氢，CatDT独立发现与Pt基工业基准相媲美的非贵金属候选催化剂，其中提出的Ni@ZrO$_2$ SMSI覆盖层在约100%选择性下达到$1.63~ ext{s}^{-1}$的模拟TOF。更广泛地说，忠实催化剂数字孪生——或任何多阶段科学模拟器——的决定性因素不是原始LLM能力，而是围绕它的工程化框架：确定性工具、持久记忆和跨模型、工具和运行累积的已验证自我改进。

英文摘要

Theoretical heterogeneous catalysis promises rapid catalyst discovery, yet computational and machine-learning predictions often deviate from experiment and stay confined to narrow material families, for want of a faithful, condition-aware catalytic simulator. We present CatDT (Catalysis Digital Twin), a self-evolving multi-agent system that builds an autonomous digital twin of a working catalyst, unifying gas-solid and liquid-solid modeling. From only a bulk crystal and a natural-language reaction description, eight specialized agents and 27 scientific tools predict stable facets, reconstruct working surfaces, enumerate and rank reaction pathways, locate transition states, and compute kinetics in 5-30 min on a single GPU. Two innovations address the hardest steps: UniMech finds dominant pathways for novel materials at over $10^3\times$ lower cost than exhaustive enumeration by fusing agent-guided proposals with energy-cached graph search, and a memory-augmented reinforcement loop raises barrier-calculation success from 41\% to 84\% across 600 catalytic surfaces. Across seven gas-solid benchmarks -- stepped metals, single-atom catalysts, ordered intermetallics, vacancy-rich 2D sulfides and carbides, and a strong-metal--support-interaction (SMSI) interface -- every CatDT prediction lies within 0.5-2 times experiment over four orders of magnitude. For propane dehydrogenation, CatDT independently discovers non-precious candidates rivaling the Pt-based industrial benchmark, with a proposed Ni@ZrO$_2$ SMSI overlayer reaching a simulated TOF of $1.63~\text{s}^{-1}$ at $\sim$100\% selectivity. More broadly, the decisive factor for a faithful catalyst digital twin -- or any multi-stage scientific simulator -- is not raw LLM capability but the engineered harness around it: deterministic tools, persistent memory, and verified self-improvement that compound across models, tools, and runs.

URL PDF HTML ☆

赞 0 踩 0

2606.06464 2026-06-08 cs.CL cs.AI 交叉投稿

Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?

人类成人与LLM作为科学家：谁从主动探索中受益？

Mandana Samiei, Eunice Yiu, Anthony GX-Chen, Dongyan Lin, Jocelyn Shen, Blake A. Richards, Alison Gopnik, Doina Precup

发表机构 * Mila - Quebec AI Institute（魁北克人工智能研究所）； McGill University（麦吉尔大学）； University of California Berkeley（加州大学伯克利分校）； New York University（纽约大学）； Meta FAIR ； MIT Media Lab（麻省理工学院媒体实验室）； Montreal Neurological Institute（蒙特利尔神经科学研究所）

AI总结本研究通过主动探索实验，发现主动探索能显著提升成人对合取因果规则的推理能力，但合取规则仍需更多测试；同时比较了大型语言模型的表现，发现部分模型在假设推断准确率上接近人类，但探索策略效率较低且存在类似的合取-析取性能差距。

Comments Accepted at the 48th Annual Conference of the Cognitive Science Society (CogSci 2026)

详情

AI中文摘要

因果学习文献中的一个长期发现是，成人难以识别合取因果规则（即一个效应需要多个原因同时存在），而在析取情境中表现更好。然而，这种“合取缺陷”的大多数演示依赖于被动观察范式，证据有限，学习者无法控制证据生成。本文探讨当成人通过主动探索获得能动性时，这种偏见是否仍然存在。使用修改后的“blicket检测器”任务，成人参与者在合取或析取规则结构下自由干预以识别因果对象。我们表明，主动探索显著改善了成人的合取因果推理，尽管合取规则仍比析取规则需要更多测试来推断。我们进一步将人类表现与同一设置下的多种大型语言模型进行比较。虽然一些最先进的模型在假设推断准确率上接近人类水平，但它们通常表现出效率较低的探索策略以及类似的合取-析取性能差距。

英文摘要

A long-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect requires the simultaneous presence of multiple causes, while performing better in disjunctive settings. However, most demonstrations of this ``conjunctive handicap'' rely on passive observation paradigms with limited evidence, where learners have no control over evidence generation. This paper asks whether this bias persists when adults are granted agency through active exploration. Using a modified ``blicket detector'' task, adult participants freely intervened to identify causal objects under conjunctive or disjunctive rule structures. We show that active exploration substantially improves adults' conjunctive causal reasoning, although conjunctive rules still require more tests to infer than disjunctive rules. We further compare human performance to a range of large language models in the same setting. While some state-of-the-art models approach human-level performance on hypothesis inference accuracy, they often exhibit less efficient exploration strategies and similar conjunctive-disjunctive performance gaps.

URL PDF HTML ☆

赞 0 踩 0

2606.06505 2026-06-08 cs.CG cs.AI cs.CV math.DG 交叉投稿

A Geometric Gaussian Mixture Representation of Plane Curves

平面曲线的几何高斯混合表示

Ali Darijani, Benedikt Stratmann, Jürgen Beyerer

发表机构 * Fraunhofer IOSB（弗劳恩霍夫研究所）； KIT, IES（卡尔斯鲁厄理工学院，信息工程系）

AI总结提出一种用户定义的平面曲线概率多边形表示，通过为每个线段赋予法向不确定性参数，构造高斯混合模型，保留局部几何与法向不确定性，适用于多种曲线类型。

详情

AI中文摘要

基于视觉基础模型的注意力一致纵向医学视觉问答

Jialin Wu, Qianru Zhang, Georges El Fakhri, Xiaofeng Liu

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； Yale Biomedical Imaging Institute（耶鲁大学生物医学成像研究所）

AI总结提出一种注意力引导的编码器-解码器框架，通过轻量级配准和自适应掩码生成，结合辅助损失函数，实现胸部X光片的纵向医学视觉问答，在Medical-Diff-VQA基准上取得优异性能。

Comments Accepted to CVPR 2026 Workshop PHAROS-AIF-MIH

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 6448-6458

AI中文摘要

纵向医学视觉问答（VQA）需要推理当前时间点图像与参考时间点图像之间的解剖差异。我们针对胸部X光片提出了一种注意力引导的编码器-解码器。与传统的直接对比不同，我们引入了一个轻量级仿射配准模块，通过小配准正则化将当前图像与参考图像进行共配准，以减少无关运动。配准后的图像对输入图像编码器，随后通过冻结的DINO掩码生成器和可训练的自适应掩码生成器生成应用于原始图像对的掩码。掩码图像对再次输入图像编码器，并与文本特征拼接，作为基于多模态Transformer的解码器的输入以生成最终答案。为了促进学习稳定并澄清变化信号，受DINO-v3启发，我们加入了额外的辅助目标，包括掩码重建损失、成对Gram风格一致性损失和KoLeo均匀性损失，以增强表示的几何结构。在Medical-Diff-VQA基准上，该模型获得了强大的BLEU、ROUGE-L、CIDEr和METEOR分数，同时通过共享显著性掩码提供了内在的可解释性。这些结果支持将显著性条件生成与轻度预对齐作为医学VQA中纵向推理的原则性框架。我们的训练策略也展示了在生物医学中利用图像基础模型的范式潜力：同时优化监督和无监督学习目标。

英文摘要

Longitudinal medical visual question answering (VQA) requires reasoning about anatomical differences between an image of a current time point and an image of a referred time point. We propose an attention-guided encoder-decoder for this task with chest X-rays. Instead of conventional direct contrast, we propose to include a lightweight affine registration module to reduce nuisance motion by co-registering the current image to the reference image with a small registration regularizer. The registered image pair is fed into the image encoder, followed by a frozen DINO-based mask generator and a trainable adaptive mask generator to produce masks applied to the original image pairs. The masked image pairs are again fed into the image encoder and concatenated with text features as the input to a multimodal transformer-based decoder to generate final answers. To facilitate learning stabilization and clarify the change signal, inspired by DINO-v3, we include additional auxiliary objectives, including a mask rebuilding loss, a pairwise Gram-style consistency loss, and a KoLeo uniformity loss, which enhances the geometry of the representation. On the Medical-Diff-VQA benchmark, the model delivers strong BLEU, ROUGE-L, CIDEr, and METEOR scores while offering intrinsic interpretability through the shared saliency mask. These results support saliency-conditioned generation with mild pre-alignment as a principled framework for longitudinal reasoning in medical VQA. Our training strategy also illustrates the potential of a paradigm in utilizing image foundation models in biomedicine: optimizing both supervised and unsupervised learning objectives simultaneously.

URL PDF HTML ☆

赞 0 踩 0

2606.06536 2026-06-08 cs.CV cs.AI cs.LG 交叉投稿

Attention-Guided Autoencoder Fusion for Insulator Defect Detection Using UAV Transmission-Line Imaging

基于注意力引导自编码器融合的无人机输电线路绝缘子缺陷检测

Malak Allam, Khaled Shaban, Ali Hamdi

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出AE-YOLO框架，通过注意力引导自编码器融合与方差最大化正则化，解决无人机图像中绝缘子缺陷检测的类别不平衡和尺度变化问题，在mAP@0.5上达95.10%，优于YOLO基线5个百分点。

详情

AI中文摘要

高压输电线路绝缘子的自动缺陷检测仍然具有挑战性，原因在于无人机（UAV）图像中严重的类别不平衡、尺度变化大以及缺陷实例的空间范围小。为了解决这些问题，本文提出了AE-YOLO，一种注意力引导的自编码器增强型YOLO框架，用于鲁棒的绝缘子缺陷检测。该架构在特征金字塔网络-路径聚合网络（FPN-PAN）颈部集成了轻量级瓶颈自编码器，在多尺度特征融合过程中保留了异常敏感信息。整个骨干网络使用卷积块注意力模块（CBAM），增强了特征辨别能力并抑制了背景干扰。该框架还引入了一种方差最大化的自编码器正则化策略，鼓励生成多样化、缺陷判别性的潜在表示。网络使用统一目标进行训练，该目标结合了焦点损失、完全IoU（CIoU）损失和自编码器正则化，以解决前景-背景不平衡问题并提高定位精度。在推理过程中，加权框融合（WBF）结合了YOLOv8、YOLOv10和YOLO11的预测结果。自编码器引导的置信度提升机制提高了对罕见缺陷类别的敏感性。在绝缘子缺陷检测数据集上的实验表明，采用EfficientNetV2骨干网络的AE-YOLO在mAP@0.5上达到95.10%，精度为96.40%，召回率为93.80%。这一性能在mAP@0.5上比最强的YOLO系列基线高出5.0个百分点，在召回率上高出6.7个百分点。这些结果证实了该框架的有效性和适应性。该模型是基于UAV的输电线路巡检和缺陷监测的实用且可扩展的解决方案。

英文摘要

Automated defect detection in high-voltage transmission-line insulators remains challenging due to severe class imbalance, large scale variation, and the small spatial extent of defect instances in Unmanned Aerial Vehicle (UAV) imagery. To address these challenges, this paper proposes AE-YOLO, an Attention-Guided AutoEncoder-Enhanced YOLO framework for robust insulator defect detection. The architecture integrates lightweight bottleneck autoencoders within a Feature Pyramid Network-Path Aggregation Network (FPN-PAN) neck. This preserves anomaly-sensitive information during multi-scale feature fusion. Convolutional Block Attention Modules (CBAM) are used throughout the backbone, enhancing feature discrimination and suppressing background interference. The framework also introduces a variance-maximizing autoencoder regularization strategy, which encourages diverse, defect-discriminative latent representations. The network trains using a unified objective that combines focal loss, Complete IoU (CIoU) loss, and autoencoder regularization to address foreground-background imbalance and improve localization accuracy. During inference, Weighted Boxes Fusion (WBF) combines predictions from YOLOv8, YOLOv10, and YOLO11. An autoencoder-guided confidence boosting mechanism improves sensitivity to rare defect categories. Experiments on the Insulator-Defect Detection dataset show that AE-YOLO with an EfficientNetV2 backbone achieves 95.10 percent mAP at 0.5, 96.40 percent precision, and 93.80 percent recall. This performance surpasses the strongest YOLO-family baseline by 5.0 points in mAP at 0.5 and 6.7 points in recall. These results confirm the effectiveness and adaptability of the framework. The model is a practical and scalable solution for UAV-based transmission-line inspection and defect monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.06539 2026-06-08 cs.CV cs.AI cs.LG cs.NE 交叉投稿

Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

合成基准高估了前向-前向扩展：真实数据对逐层训练的限制

Yucheng Chen

发表机构 * Amplimit

AI总结通过DTG-FF方法在真实数据上评估前向-前向学习的扩展性，发现其与反向传播的差距随类别数增加而扩大，合成任务高估了其迁移能力，且内存优势不成立。

Comments 23 pages, 6 figures

详情

AI中文摘要

前向-前向（FF）学习[Hinton, 2022]用严格的逐层良好性更新取代了反向传播。最近的FF-CNN工作在32x32基准上缩小了与BP的差距，引发了逐层训练是否在现实规模下成为可行替代方案的问题。为了严格探究这一点，我们开发了DTG-FF——动态温度良好性、解耦归一化和多层融合——作为在九个真实数据基准上设定FF系列最先进水平的工具（CIFAR-10上91.8%，以及ImageNet-100 224x224上的首个FF基线），并用它来审计逐层训练实际能扩展到何种程度。（1）真实数据扩展。在相同配方和主干下，架构匹配的BP-DeepSup基线在CIFAR-10/CIFAR-100上分别超过DTG-FF 2.40/5.93个百分点，且差距随类别数增加而扩大。在224x224分辨率下，同一工具仅达到49.4%——这是该尺度下的首个FF基线，而典型BP超过75%[Tian et al., 2020]——暴露了在32x32下不可见的真实数据上限。（2）合成与真实K冲突。在合成教师-学生任务中，随着类别数K增长，DTG-FF越来越优于BP；而在真实图像上，FF-BP差距符号反转并随K扩大。数据集内CIFAR-100粗粒度与细粒度探针将标签层次与图像分布分离：合成K扫描将输出维度与细粒度判别难度混淆，从而高估了FF的可迁移性。（3）系统审计。FF可以在不存储深度激活的情况下实现，但在普通8 GB硬件上，标准BP+梯度累积达到4.18 GB / 157 imgs/s，而DTG-FF为7.90 GB / 138 imgs/s，因此在公平基线支持下，基于内存的理由在此规模下不成立。

英文摘要

Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updates. Recent FF-CNN work has narrowed the gap to BP on 32x32 benchmarks, raising the question of whether layer-local training is becoming a viable alternative at realistic scale. To probe this rigorously, we develop DTG-FF -- dynamic temperature goodness, decoupled normalization, and multi-layer fusion -- as an instrument that sets FF-family state of the art across nine real-data benchmarks (91.8% CIFAR-10 and the first FF baseline at ImageNet-100 224x224), and use it to audit how far layer-local training actually scales. (1) Real-data scaling. Under identical recipe and backbone, an architecture-matched BP-DeepSup baseline beats DTG-FF by 2.40/5.93 pp on CIFAR-10/CIFAR-100, and the gap widens with class count. At 224x224 the same instrument reaches only 49.4% -- the first FF baseline at this scale, versus typical BP above 75% [Tian et al., 2020] -- exposing a real-data ceiling invisible at 32x32. (2) Synthetic vs. real K-conflict. DTG-FF increasingly outperforms BP as class count K grows on synthetic teacher-student tasks, yet on real images the FF-BP gap reverses sign and widens with K. A within-dataset CIFAR-100 coarse vs. fine probe isolates label-hierarchy from image distribution: synthetic K-sweeps confound output dimensionality with fine-grained discrimination difficulty and thereby overstate FF transferability. (3) Systems audit. FF can be implemented without storing depth-wide activations, but on commodity 8 GB hardware standard BP+gradient-accumulation reaches 4.18 GB / 157 imgs/s versus DTG-FF's 7.90 GB / 138 imgs/s, so a memory-based justification for FF at this scale is not supported under fair baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.06543 2026-06-08 quant-ph cs.AI 交叉投稿

几何二阶特征相关性学习用于自监督语音情感识别

Shuanglin Li, Ruxiao Qian, Siyang Song

发表机构 * Xiangjiang Laboratory（湘江实验室）； University of Exeter（埃克塞特大学）

AI总结针对自监督语音情感识别中一阶聚合忽略特征相关性和黎曼几何的问题，提出二阶相关层，通过协方差描述子捕获协同共现模式，并利用对数欧几里得映射保持几何完整性，在ESD和RAVDESS数据集上有效恢复判别信息。

详情

生成模型通过市场选择侵蚀人类时间学习

Wenjun Cao

AI总结本文论证现代生成模型在亚AGI能力水平上通过市场选择机制侵蚀人类时间学习，提出价值崩溃路径并用昂贵检验框架形式化，跨领域证据显示验证侵蚀四阶段。

Comments Accepted at ICML 2026

详情

Journal ref: Forty-third International Conference on Machine Learning Position Paper Track (2026)

AI中文摘要

我们认为，现代生成模型在当前亚AGI能力水平上对知识和文化生产造成了结构性风险。我们将人类时间学习（HTL）定义为通过长期持续参与问题而形成的路径依赖的知识积累。生成输出在表面特征上越来越像HTL密集型工作，因此验证给定输出是否反映真正的人类学习的成本相对于其预期收益变得高昂。一旦验证失去经济合理性，评估者就会奖励输出而不论其生产模式，而投入多年学习的生产者则在与几乎零成本生成的输出的价格竞争中处于劣势。我们将这一路径称为价值崩溃，并通过一个昂贵检验框架将其形式化。来自学术出版、法律实践、内容平台和软件安全的跨领域证据映射出验证侵蚀的四个阶段。对齐成功是正交的。更好的对齐模型缩小了人类与AI输出之间的可观察差距，使得来源验证更加困难，并加剧了对HTL密集型工作的竞争压力，即使单个AI输出有所改进。

英文摘要

We argue that modern generative models create structural risks for knowledge and cultural production at current, sub-AGI capability levels. We define Human Temporal Learning (HTL) as path-dependent knowledge accumulation through sustained engagement with problems over time. Generative outputs increasingly resemble HTL-intensive work in surface features, so verifying whether a given output reflects genuine human learning grows costly relative to its expected benefit. Once verification loses economic justification, evaluators reward outputs regardless of production mode, and producers who invested years of learning compete on price against outputs that cost almost nothing to generate. We call this pathway value collapse and formalize it through a costly-inspection framework. Cross-domain evidence from academic publishing, legal practice, content platforms, and software security maps onto four stages of verification erosion. Alignment success is orthogonal. Better-aligned models narrow observable gaps between human and AI outputs, making source verification harder and intensifying competitive pressure against HTL-intensive work even when individual AI outputs improve.

URL PDF HTML ☆

赞 0 踩 0

2606.06601 2026-06-08 cs.CV cs.AI cs.LG 交叉投稿

Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

通过分解视觉代理实现直接3D感知物体插入

Jingbo Gong, Yikai Wang, Yushi Lan, Yuhao Wan, Ziheng Ouyang, Rui Zhao, Ming-Ming Cheng, Qibin Hou, Chen Change Loy

发表机构 * Google（谷歌）； Black Forest Labs（黑森林实验室）

AI总结提出DIRECT框架，通过分解外观、几何和上下文引导，实现可控制3D姿态的物体插入，在几何可控性和视觉质量上优于现有方法。

Comments ICML 2026; Project Page: https://gong1130.github.io/DIRECT/

详情

AI中文摘要

物体插入旨在将参考对象无缝合成到背景图像的指定区域。最近的基于扩散的方法实现了高视觉质量，但将插入视为简单的2D修复任务，无法显式控制对象的3D姿态，限制了其实用性。我们提出DIRECT（用于参考组合和目标集成的分解注入），一种新颖框架，将交互式姿态操作与高保真2D图像合成相结合，实现姿态可控的物体插入。我们的方法将插入条件分解为三个互补组件：从参考对象捕获视觉细节的外观引导、从用户调整的3D代理派生的几何引导以及来自目标背景的上下文引导。通过将它们注入到不同路径，DIRECT避免了特征纠缠，同时保留了参考外观、遵循用户指定的姿态并使对象适应目标场景。我们还引入了一个自动数据构建流程，以提高训练数据的多样性和质量。实验表明，DIRECT在几何可控性和视觉质量方面均优于先前方法。

英文摘要

Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object's 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable pose-controllable object insertion. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background. By injecting them through separate pathways, DIRECT avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene. We also introduce an automated data construction pipeline to improve the diversity and quality of training data. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.

URL PDF HTML ☆

赞 0 踩 0

2606.06614 2026-06-08 cs.CL cs.AI cs.HC 交叉投稿

在日常生活人类视频上协同训练机器人操作策略时什么因素重要？

Richard Li, Aditya Prakash, Andrew Wen, Saurabh Gupta, Yilun Du, Pulkit Agrawal

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Harvard University（哈佛大学）

AI总结研究利用日常互联网视频协同训练机器人操作策略时，手部姿态质量和运动差距对迁移的影响，提出一种协同训练方法，在低机器人数据场景下六个操作任务中绝对成功率提升29.7%。

Comments The project website is here: https://richardrl.github.io/what-matters-cotraining-human-videos/index.html

2606.06635 2026-06-08 cs.CL cs.AI 交叉投稿

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

语言模型如何失败：承诺性和持续性推理错误的令牌级特征

Tanvi Thoria, Kiana Jafari, Marc R. Schlichting, Mykel J. Kochenderfer

发表机构 * Department of Computer Science, Stanford University（计算机科学系，斯坦福大学）； Department of Aeronautics and Astronautics, Stanford University（航空航天工程系，斯坦福大学）

AI总结通过令牌级不确定性信号，将语言模型推理失败分为承诺性失败（早期锁定错误路径）和持续性不确定性（不确定性持续累积），并在23个模型-数据集配置中验证了可预测性，为自我一致性策略提供了指导。

详情

AI中文摘要

语言模型推理中的失败通过不同的过程产生，这些过程在推理轨迹中留下可识别的特征。我们使用令牌级不确定性信号来表征这些失败，发现它们通过两个经验上可区分的过程出现。第一个是承诺性失败，其中模型在其轨迹早期锁定到错误的推理路径。一个核心诊断特征是承诺点，超过该点考虑额外的令牌会损害而不是帮助失败检测。在第二个过程中，持续性不确定性，不确定性反而在整个过程中累积，并且需要完整的轨迹来最好地区分失败和成功的完成。这些特征在23个模型-数据集配置中重现，该框架的可证伪预测在23个案例中的20个中成立，远高于两种失败模式下的随机水平。最后，我们展示了我们的失败模式框架对自我一致性有直接影响，识别了不确定性信号何时补充它以及何时可以选择性地跳过它。这些结果为理解何时LLM推理失败变得可检测以及相应调整检测策略提供了基础。

英文摘要

Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes. The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection. In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions. These signatures reproduce across 23 model-dataset configurations, with the framework's falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes. Finally, we demonstrate our failure mode framework has direct implications for self-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly.

URL PDF HTML ☆

赞 0 踩 0

2606.06664 2026-06-08 cs.CV cs.AI cs.LG 交叉投稿

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

内在视觉：神经科学启发的概念电路用于解释和引导视觉变换器

Tang Li, Yanlin Chen, Mengmeng Ma, Xi Peng

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出ViSAE工具箱，通过神经科学启发的概念电路解释视觉变换器内部机制，包含高效概念集、自动电路追踪算法和概念编辑应用，在WaterBirds上最差组准确率提升48.2%。

Comments In Proceedings of the International Conference on Machine Learning, 2026. (acceptance rate 26.6%)

详情

AI中文摘要

尽管视觉变换器（ViT）具有高准确率，但其预测可能受到虚假线索的驱动，因此在安全部署前需要理解其内部工作机制。稀疏自编码器（SAE）为将模型表示分解为人类可解释的概念提供了有前景的视角，但由于对概念覆盖范围的控制有限以及特征解释的主观性和不可扩展性，将基于SAE的解释方法应用于ViT仍然具有挑战性。为填补这些空白，受神经科学启发原理的驱动，我们提出了ViSAE，一个通过概念电路理解ViT内部工作机制的机械可解释性工具箱。ViSAE包含三个组成部分：（1）一个包含64K图像和16K视觉基础概念词汇的探测套件，与ImageNet相比，概念覆盖效率提高了20倍，与现有概念集相比，解释准确率提高了28.7%。（2）自上而下的概念读取和自下而上的电路追踪算法，通过概念电路自动恢复ViT内部工作机制。（3）用于审计和引导ViT行为的应用。通过概念编辑，ViSAE在WaterBirds上将最差组准确率提高了48.2%，比现有方法高出23.8%。我们的数据和代码：此 https URL。

英文摘要

Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by neuroscience-inspired principles, we propose ViSAE, a mechanistic interpretability toolbox for understanding ViT inner workings through concept circuits. ViSAE consists of three components: (1) A probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over existing concept sets. (2) Top-down concept reading and Bottom-up circuit tracing algorithms that automatically recover ViT inner workings via concept circuits. (3) Applications for auditing and steering ViT behavior. Through concept editing, ViSAE improves the worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%. Our data and code: https://github.com/deep-real/ViSAE.

URL PDF HTML ☆

赞 0 踩 0

2606.06679 2026-06-08 cs.CL cs.AI cs.CY 交叉投稿

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

HKJudge：用于解释法院认定事实、推理过程和裁决结果的法律话语标注语料库

Xi Xuan, Wenxin Zhang, Yufei Zhou, King-kui Sin, Chunyu Kit

发表机构 * City University of Hong Kong（香港城市大学）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出首个句子级专家标注的法律话语数据集HKJudge，包含香港各级法院刑事判决，设计双层话语模式（26种修辞角色和3种判刑要素），并基于BERT和LLM进行基准评估。

详情

AI中文摘要

法院判决是法律实践和法理学的核心，然而香港判决的话语分析由于缺乏专家标注语料库而受到限制。我们引入了香港判决话语数据集（HKJudge），这是首个句子级专家标注的法律话语语料库。HKJudge包含香港法院层级所有五个级别的刑事判决，共计约29万句子和650万词元，由法律语言学专家完全标注。我们设计了一个双层话语模式，捕捉法院认定的事实、推理过程以及裁决结果。在句子层面，每个句子被分配26种修辞角色之一。在跨度层面，句子进一步标注了三个判刑要素（指控、监禁刑期、罚款）。十位法律语言学标注者进行了标注，标注者间一致性为κ=0.8。我们在HKJudge上定义了两个任务，称为修辞角色分类和法律要素提取，并提供了四种基于BERT的模型、两种开源LLM（在零样本和微调设置下）以及四种商业LLM在这两个任务上的首次基准评估。我们的工作展示了句子级话语标注对于建模香港判决结构的价值，并为未来法律判决预测研究提供了丰富的数据基础。HKJudge数据集和代码可在以下网址获取：https://this URL。

英文摘要

Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK's court hierarchy, comprising $\sim$290k sentences and $\sim$6.5 million tokens, fully annotated by legal linguistics experts. We design a two-tier discourse schema that captures what facts a court finds, how it reasons, and what it rules. At the sentence level, each sentence is assigned one of 26 rhetorical roles. At the span level, sentences are further annotated with three sentencing elements (charge, imprisonment term, fine). Ten legal linguistics annotators produced the annotations with an inter-annotator agreement of $κ= 0.8$. We formulate two tasks on HKJudge, termed rhetorical role classification and legal element extraction, and provide the first benchmark evaluation of four BERT-based models, two open-source LLMs under zero-shot and fine-tuning settings, and four commercial LLMs on both tasks. Our work demonstrates the value of sentence-level discourse annotation for modeling the structure of HK judgments and provides a rich data foundation for future work on legal judgment prediction. The HKJudge dataset and code are available at https://github.com/xuanxixi/HKJudge.

URL PDF HTML ☆

赞 0 踩 0

2606.06694 2026-06-08 cs.LG cs.AI cs.CY 交叉投稿

The Geography of Algorithmic Judgment: LLM Intermediaries, Place Identity, and Racial Steering in Housing Search

算法判断的地理：LLM中介、地方身份与住房搜索中的种族引导

Hana Samad, Trung Lam, Christoph Mügge-Durum, Michael Akinwumi

发表机构 * National Fair Housing Institute（国家公平住房研究所）

AI总结通过行为审计七种LLM在四个美国城市的住房推荐，发现种族引导是模型解释性许可的涌现行为，而非静态属性，且城市并非中性测试单元。

Comments 13 pages with supplemental tables and figures, AIES '26 Submission

详情

AI中文摘要

大型语言模型（LLM）正迅速在住房搜索中扮演中介角色，通过将列表平台集成到对话界面中，调解城市环境中的信息访问、搜索和推荐。我们扩展了先前关于LLM中种族引导的研究，对七个开放权重和闭源LLM在四个美国城市进行了行为审计，测试了三种迭代提示条件下的位置推荐，这些条件逐步添加生活方式偏好背景并反映公平住房配对测试方法。我们发现，引导是模型解释性许可的涌现行为，而非主要静态属性。引导源于用户身份、偏好表达以及模型内化的关于特定城市中地点、偏好和机会的学习表征的空间逻辑之间的相互作用，以及不同类型用户如何与之关联。虽然存在引导，但在评估条件下其方向和幅度并不一致。偏好条件测试通常增加或重新配置了相对于基线条件表现出引导行为的模型数量，表明LLM可能根据用户的种族身份对相同的住房偏好做出不同解释。我们的发现还表明，在基于地点的领域进行LLM评估时，城市并非中性测试单元，一个本地市场的结果不能假定推广到另一个市场。在住房领域，需要本地和领域专业知识，以确保法律和制度对公平住房的承诺不会因采用调解空间访问的AI工具而受到损害。

英文摘要

Large language models (LLMs) are rapidly assuming an intermediary role in housing search through the integration of listing platforms within conversational interfaces, mediating access to information, search, and recommendations within urban settings. We expand on prior work on racial steering in LLMs by conducting a behavioral audit of seven open-weight and closed-source LLMs across four U.S. cities, testing location recommendations across three iterative prompting conditions that progressively add lifestyle preference context and reflect fair housing paired-testing methodologies. We find that steering is an emergent behavior of the model's interpretive license rather than primarily a static property. Steering results from the interaction of a user's identity, preference articulation, and the spatial logic that a model has internalized about learned representations of place, preference, and opportunity in a given city, and how different types of users relate to it. While steering was present, it was not uniform in direction or magnitude across evaluated conditions. Preference-conditioned testing often increased or reconfigured the number of models that exhibited steering behaviors relative to baseline conditions, suggesting that LLMs may interpret what the same housing preference means differently depending on the racial identity of the user. Our findings also demonstrate that the city is not a neutral testing unit for LLM evaluation in place-based sectors, and results from one local market cannot be assumed to generalize to another. Local and domain expertise will be required in the housing sector to ensure that legal and institutional commitments to fair housing are not undermined while adopting AI tools that mediate spatial access.

URL PDF HTML ☆

赞 0 踩 0

2606.06696 2026-06-08 cs.CV cs.AI 交叉投稿

MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

MMBU: 大规模多模态生物医学理解基准，用于探测视觉语言模型的感知能力

Ryan D'Cunha, Alejandro Lozano, Xiaoxiao Sun, Daniel Vela Jarquin, Min Woo Sun, Josiah Aklilu, James Burgess, Yuhui Zhang, Ryan Nayebi, Paola Avila, Robayo, Jin Ye, Ming Hu, Zhongying Deng, Junjun He, Xin Chen, Yue Yao, Robert Tibshirani, Jeffrey J. Nirschl, Serena Yeung-Levy

发表机构 * Stanford University（斯坦福大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； Instituto Tecnológico de Monterrey（蒙特雷技术学院）； Monash University（墨尔本大学）； University of Cambridge（剑桥大学）； Shanghai Jiao Tong University（上海交通大学）； Shandong University（山东大学）

AI总结提出MMBU基准，涵盖35个子模态，通过分类、定位和检测任务系统评估VLM在生物医学领域的视觉感知和泛化能力，发现高准确率可能掩盖感知缺陷。

详情

AI中文摘要

视觉和语言模型（VLM）在转变生物医学成像工作流程方面具有巨大潜力，从检测胸部X光片中的病变到显微镜下的细胞特征分析。然而，实现这一潜力需要稳健且细粒度的视觉感知。模型需要正确解释图像中的细微特征，并且必须在不同的生物医学模态、尺度和上下文中做到这一点。尽管如此，当前的基准仍然有限。为了解决这些差距，我们引入了大规模多模态生物医学理解（MMBU）基准。它是迄今为止最大的生物医学视觉和语言基准，涵盖35个子模态，具有丰富的结构化元数据。它包括开放和封闭版本的非接地分类、接地分类和物体检测，从而能够系统地评估模型在生物尺度、临床环境和成像模态上的性能。通过评估15个开源权重和2个前沿VLM，我们发现虽然医学适应为某些模型带来了可衡量的提升，但通常在高准确率报告中的表现可能掩盖了视觉感知和领域泛化方面的缺陷。

英文摘要

Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires robust and fine-grained visual perception. Models need to correctly interpret subtle features in images, and they must do so across diverse biomedical modalities, scales, and contexts. Nevertheless, current benchmarks remain limited. To address these gaps, we introduce the Massive Multimodal Biomedical Understanding (MMBU) benchmark. It is the largest biomedical vision and language benchmark to date, covering 35 submodalities with rich structured metadata. It includes both open and closed versions of ungrounded classification, grounded classification, and object detection, enabling systematic evaluation of model performance across biological scales, clinical settings, and imaging modalities. Evaluating 15 open-weight and 2 frontier VLMs, we find that while medical adaptation provides measurable gains for some models, the high accuracy often reported on established benchmarks can mask deficiencies in visual perception and domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.06712 2026-06-08 cs.CL cs.AI 交叉投稿

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

数据高效的自回归到扩散语言模型通过策略内蒸馏

Xingyu Su, Jacob Helwig, Shubham Parashar, Atharv Chagi, Lakshmi Jotsna, Degui Zhi, James Caverlee, Dileep Kalathil, Shuiwang Ji

发表机构 * Department of Computer Science and Engineering, Texas A&M University（德克萨斯大学阿马尔科分校计算机科学与工程系）； Department of Bioinformatics and Systems Medicine, University of Texas Health Science Center at Houston（德克萨斯大学健康科学中心休斯顿分校生物信息学与系统医学系）； Department of Electrical and Computer Engineering, Texas A&M University（德克萨斯大学阿马尔科分校电气与计算机工程系）

AI总结提出策略内扩散语言模型（OPDLM），通过策略内蒸馏将自回归模型转换为扩散语言模型，解决分布偏移和训练-推理不匹配问题，实现15倍至7000倍更少训练数据下的强性能。

详情

AI中文摘要

我们研究将自回归模型（ARLM）转换为扩散语言模型（DLM）。与从头预训练不同，先前的工作将ARLM中的因果注意力替换为双向注意力，然后使用DLM目标训练得到的模型。然而，这些方法会导致两种分布偏移。首先，从下一个词预测目标过渡到DLM目标可能会丢弃ARLM在训练期间获得的知识。其次，标准DLM存在训练-推理不匹配，因为训练损失定义在随机掩码序列上，而不是推理时基于置信度解码产生的轨迹。为了解决这两个挑战，我们引入了策略内扩散语言模型（OPDLM），其中采用策略内蒸馏（OPD）进行ARLM到DLM的转换。具体来说，OPDLM通过自OPD训练，其中学生（具有双向注意力的ARLM）生成自己的轨迹，而教师（原始冻结的ARLM）通过在这些轨迹上提供目标logits来蒸馏其知识。通过直接以策略内方式训练，OPDLM消除了DLM中的训练-推理不匹配，同时从原始模型蒸馏增强了ARLM的知识保留。实验结果表明，OPDLM需要15倍到7000倍更少的训练token，并在各种任务上表现出强大的性能。OPDLM避免了DLM预训练的高昂成本，并将DLM转换定位为ARLM后训练的一种形式。

英文摘要

We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.

URL PDF HTML ☆

赞 0 踩 0

2606.06715 2026-06-08 cs.CL cs.AI cs.LG 交叉投稿

Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles

主题情感是否导致感知意识形态？比较政治新闻文章中人类与LLM的标注

Upasana Chatterjee

发表机构 * Columbia University（哥伦比亚大学）

AI总结研究主题情感对感知政治意识形态的因果效应，通过比较人类与LLM标注，发现微调GPT-4o-mini产生显著因果效应，归因于捷径学习。

Comments Accepted to ACL SRW 2026

详情

AI中文摘要

我们探究主题情感是否对感知政治意识形态具有因果效应，以及答案是否取决于意识形态标签的分配者。使用来自AllSides的文章，结合Llama-3.3-70b-versatile的共享情感标注，我们比较了来自专家人类标注者、GPT-4o-mini（基线和微调）以及Llama-3.3-70B的意识形态标签。我们应用双重机器学习（DML）和社区级中介分析于所有四种标注范式。人类标注在社区水平未产生显著因果效应。微调后的GPT-4o-mini达到了最高的分类准确率（F1=72.48），并且是唯一在社区水平产生显著处理效应和中介中显著自然直接效应（NDE）的标注范式。我们将此解释为捷径学习的证据：对意识形态标签数据进行微调导致模型内化了一种虚假的情感-意识形态耦合，而这种耦合在人类判断中对此任务并不起作用。这种耦合在基于F1的评估中结构上不可见，对LLM标注作为银标签以及在下游因果分析中作为人类判断的代理的使用具有影响。

英文摘要

We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideology label. Using articles from AllSides, paired with shared sentiment annotations from Llama-3.3-70b-versatile, we compare ideology labels from expert human annotators, GPT-4o-mini (baseline and finetuned), and Llama-3.3-70B. We apply Double Machine Learning (DML) and community-level mediation analysis across all four annotation paradigms. Human annotations yield no significant causal effects at the community level. Fine-tuned GPT-4o-mini achieves the highest classification accuracy (F1=72.48) and is the only annotator paradigm that produces significant community-level treatment effects and significant natural direct effects (NDEs) in mediation. We interpret this as evidence of shortcut learning: fine-tuning on ideology-labeled data causes the model to internalise a spurious sentiment--ideology coupling not operative in human judgment for this task. This coupling is structurally invisible to F1-based evaluation, with implications for the use of LLM annotations as silver labels and as proxies for human judgment in downstream causal analyses.

URL PDF HTML ☆

赞 0 踩 0

2606.06717 2026-06-08 cs.LG cs.AI q-bio.BM q-bio.QM 交叉投稿

ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets

ShallowBench: 浅口袋靶标上的生成式药物设计模型基准测试

Saket Reddy, Shiwei Liu

发表机构 * University of Illinois - Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出ShallowBench基准，包含5780个浅口袋靶标，用于评估生成式药物设计模型在低凹度界面上的性能，揭示现有模型预测结合亲和力较弱的问题。

详情

AI中文摘要

虽然生成式AI模型在基于结构的药物设计中已展现出显著成功，但它们主要依赖深结合口袋，难以对具有挑战性的低口袋性靶标（如历史上“不可成药”的肿瘤靶标KRAS和MYC）采样有效配体。为弥补这一空白，我们引入了ShallowBench，这是一个从CrossDocked2020中提取的包含5780个浅口袋靶标的严格精选基准。通过计算Alpha Shape“盖子”体积与底层蛋白质原子体素体积之间的差异，我们成功分离出低凹度靶标，同时确保足够的结合表面积。评估多种最先进的生成模型显示，在这些低凹度界面上预测的结合亲和力较弱。因此，ShallowBench为生成生物学模型提供了一个严格的基准，并强调了需要能够应对这些具有挑战性靶标的新型架构创新或损失函数。

英文摘要

While generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low-pocketability targets, such as the historically "undruggable" oncology targets KRAS and MYC. To address this gap, we introduce ShallowBench, a strictly curated benchmark of 5,780 shallow-pocket targets extracted from CrossDocked2020. By computing the difference between an Alpha Shape "lid" volume and the underlying protein atom voxel volume, we successfully isolated targets with low concavity while ensuring sufficient surface area for binding. Evaluating various state-of-the-art generative models reveals weaker predicted binding affinity on these low-concavity interfaces. ShallowBench therefore provides a rigorous benchmark for generative biology models and highlights the necessity of new architectural innovations or loss functions capable of navigating these challenging targets.

URL PDF HTML ☆

赞 0 踩 0

2606.06718 2026-06-08 cs.LG cs.AI cs.SY eess.SY 交叉投稿

检索增强生成中的证据图一致性：基于模型的幻觉检测分析

Jianru Shen

AI总结提出证据图一致性（EGC）框架，通过构建局部证据图并计算五种结构一致性指标检测幻觉，发现不同模型族间一致性特征方向相反，表明嵌入图一致性不能作为模型无关的检测信号。

Comments Accepted at the International Conference on Advanced Machine Learning and Data Science; to appear in the IEEE Xplore proceedings

详情

AI中文摘要

检索增强生成（RAG）减少了但并未消除大型语言模型中的幻觉。现有检测方法依赖于生成答案与检索段落之间的平面相似性，忽略了证据片段与答案声明之间的结构关系。我们提出了证据图一致性（EGC）框架，该框架为每个响应构建一个局部证据图，并计算五种结构一致性度量作为幻觉指标。在RAGTruth的完整问答拆分上，跨六个LLM（5,767个响应）进行评估，EGC揭示了一个一致的模型族分裂：图一致性特征在Llama-2模型中显示出预期的诊断方向，但在GPT-4、GPT-3.5和Mistral-7B中表现出系统性逆转。这种逆转表明不同模型族之间存在定性的不同幻觉模式，并表明基于嵌入的图一致性不能作为模型无关的幻觉检测信号。

英文摘要

Retrieval-Augmented Generation (RAG) reduces but does not eliminate hallucination in large language models. Existing detection methods rely on flat similarity between generated answers and retrieved passages, ignoring structural relationships among evidence pieces and answer claims. We propose Evidence Graph Consistency (EGC), a framework that constructs a local evidence graph per response and computes five structural consistency measures as hallucination indicators. Evaluated on the full question answering split of RAGTruth across six LLMs (5,767 responses), EGC reveals a consistent model-family split: graph consistency features show the expected diagnostic direction for hallucinations in Llama-2 models but exhibit systematic reversal in GPT-4, GPT-3.5, and Mistral-7B. This reversal suggests qualitatively different hallucination patterns across model families and indicates that embedding-based graph consistency cannot serve as a model-independent hallucination detection signal.

URL PDF HTML ☆

赞 0 踩 0

2606.06761 2026-06-08 cs.RO cs.AI 交叉投稿

AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation

AxisGuide: 在RGB观测中接地机器人动作坐标系以实现鲁棒的视觉运动操控

Jiyun Jang, Yujin Sung, Woosung Joung, Daewon Chae, Sangwon Lee, Sohwi Kim, Jinkyu Kim, Jungbeom Lee

发表机构 * Korea University（韩国大学）； University of Michigan（密歇根大学）； KT R&D Center（KT研发中心）； Kakao Mobility（Kakao移动）

AI总结针对视觉运动策略在分布偏移下动作执行失败的问题，提出AxisGuide方法，通过渲染机器人基座坐标系轴并叠加提示通道，增强动作坐标理解，显著提升泛化性能。

Comments Accepted to Robotics: Science and Systems (RSS) 2026

详情

AI中文摘要

通过大规模行为克隆训练的视觉运动操控策略已实现强大的语义场景理解，但在分布偏移下往往无法可靠地执行正确的低级动作。例如，即使在具有相同场景布局、相机视角和光照的简单拾取任务中，当物体被放置在未见过的位置时，性能也会大幅下降。我们认为这一差距源于动作理解不足，即无法在图像空间中解释机器人基座坐标系。为解决此问题，我们引入AxisGuide，一种轻量级引导方法，桥接语义场景理解和动作坐标解释。利用相机参数和末端执行器位姿，AxisGuide在每个相机视图中渲染机器人基座轴，并通过少量提示通道增强RGB观测，明确可视化图像空间中+x、+y和+z运动的含义。在LIBERO仿真和真实环境中的广泛评估表明，AxisGuide带来了显著的性能提升和更好的泛化能力，凸显了显式动作坐标提示对于学习可靠且可迁移的通用视觉运动策略的有效性。

英文摘要

Visuomotor manipulation policies trained via large-scale behavior cloning have achieved strong semantic scene understanding, yet often fail to reliably execute correct low-level actions under distribution shifts. For example, even in a simple pickup task with identical scene layouts, camera viewpoints, and illumination, performance can degrade substantially when the object is placed at unseen locations. We argue that this gap arises from insufficient action understanding, namely the inability to interpret the robot's base-frame action coordinate system in image space. To address this issue, we introduce AxisGuide, a lightweight guidance method that bridges semantic scene understanding and action-coordinate interpretation. Using camera parameters and end-effector poses, AxisGuide renders the robot base-frame axes in each camera view and augments RGB observations with a small set of cue channels that explicitly visualize the meaning of the +x, +y, and +z motions in image space. Extensive evaluations in both the LIBERO simulation and real-world environments demonstrate that AxisGuide yields substantial performance gains and improved generalization, highlighting the effectiveness of explicit action-coordinate cues for learning reliable and transferable generalist visuomotor policies.

URL PDF HTML ☆

赞 0 踩 0

2606.06764 2026-06-08 stat.ML cs.AI cs.LG 交叉投稿

Optimal Rates for Generalization of Gradient Descent Methods with Deep Neural Networks

深度神经网络梯度下降方法的泛化最优速率

Junyu Zhou, Puyu Wang, Yunwen Lei, Yiming Ying, Ding-Xuan Zhou

发表机构 * Mathematical Institute for Machine Learning and Data Science, KU Eichstätt-Ingolstadt（机器学习与数据科学数学研究所，埃施特哈特-英戈尔施塔特大学）； Department of Computer Science, RPTU Kaiserslautern-Landau（计算机科学系，凯撒斯劳滕-兰道大学）； Department of Mathematics, University of Hong Kong（数学系，香港大学）； School of Mathematics and Statistics, University of Sydney（数学与统计学学院，悉尼大学）

AI总结本文针对深度ReLU网络，在神经正切核（NTK）机制下，首次建立了梯度下降（GD）和随机梯度下降（SGD）的极小化最优泛化误差速率，证明宽度足够时可达核方法的最优速率。

Comments 39 pages, 1 table

详情

AI中文摘要

近年来，在神经正切核（NTK）机制下，对于过参数化神经网络的梯度下降方法的统计泛化性能的理解取得了进展。然而，现有关于回归问题的工作大多局限于浅层网络架构，在深度神经网络理论中留下了显著的空白。本文通过为使用梯度下降（GD）和随机梯度下降（SGD）训练的深度ReLU网络提供全面的泛化分析来填补这一空白。具体来说，我们首次建立了深度ReLU网络的GD和SGD在总体风险超额上的极小化最优速率，假设网络宽度与网络深度和训练样本规模呈多项式关系。我们的结果表明，在足够宽度下，深度ReLU网络的梯度下降方法能够达到与核方法相当的泛化最优速率。

英文摘要

Recent progress has been made in understanding the statistical generalization performance of gradient descent methods for overparameterized neural networks within the neural tangent kernel (NTK) regime. However, most of the existing work on regression problems is limited to shallow network architectures, leaving a notable gap in the theory of deep neural networks. This paper addresses this gap by presenting a comprehensive generalization analysis for deep ReLU networks trained using gradient descent (GD) and stochastic gradient descent (SGD). Specifically, we establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks, under the assumption that the network width scales polynomially with respect to the network depth and training sample size. Our results demonstrate that with sufficient width, gradient descent methods for deep ReLU networks can achieve optimal generalization rates on par with kernel methods.

URL PDF HTML ☆

赞 0 踩 0

2606.06772 2026-06-08 stat.ML cs.AI cs.LG 交叉投稿

Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

深度神经网络的泛化：梯度方法的极小化最优速率

Junyu Zhou, Puyu Wang, Yunwen Lei, Marius Kloft, Yiming Ying

发表机构 * Mathematical Institute for Machine Learning and Data Science, Catholic University of Eichstätt-Ingolstadt（机器学习与数据科学数学研究所，埃施特哈特-因戈尔施塔特天主教大学）； Department of Computer Science, RPTU Kaiserslautern-Landau（计算机科学系，凯斯莱特恩-兰道大学）； Department of Mathematics, The University of Hong Kong（数学系，香港大学）； School of Mathematics and Statistics, The University of Sydney（数学与统计学学院，悉尼大学）

AI总结本文建立了过参数化深度神经网络与核方法学习动力学的联系，证明了梯度下降和随机梯度下降在足够宽度下能达到极小化最优泛化误差。

Comments 37 pages

详情

AI中文摘要

理解过参数化神经网络的泛化性能已成为深度学习理论的核心课题。尽管近期进展，特别是神经正切核（NTK）机制下的工作，揭示了浅层架构的行为，但深度神经网络（DNN）的统计泛化性质，尤其是在回归任务中，仍远未得到充分理解。本文通过提供使用梯度方法训练的DNN的全面泛化分析，在弥合这一差距方面取得了重大进展。首先，我们首次建立了使用梯度方法训练的、具有光滑激活函数的DNN的学习动态与核方法的学习动态之间的关键联系，表明过参数化DNN上的梯度方法可以完全继承其核对应物的有利学习动态。基于这一联系以及核方法已确立的最优性，我们推导出了梯度下降（GD）和随机梯度下降（SGD）的过量总体风险的第一个已知极小化最优速率，假设网络宽度与样本大小成多项式比例。我们的结果表明，在足够宽度下，由GD或SGD训练的DNN可以实现与基于核的方法相当的泛化性能。

英文摘要

Understanding the generalization performance of over-parameterized neural networks has become a central topic in deep learning theory. While recent advances, particularly works under the Neural Tangent Kernel (NTK) regime, have shed light on the behavior of shallow architectures, the statistical generalization properties of deep neural networks (DNNs), especially in regression tasks, remain far less understood. In this paper, we make significant progress toward closing this gap by providing a comprehensive generalization analysis of DNNs trained using gradient-based methods. First, we establish, for the first time, a crucial connection between the learning dynamics of a DNN with smooth activation functions trained via gradient-based methods and those of kernel methods, showing that gradient-based methods on over-parameterized DNNs can fully inherit the favorable learning dynamics of their kernel counterparts. Building on this connection and the well-established optimality of kernel methods, we derive the first known minimax-optimal rates for the excess population risk of both gradient descent (GD) and stochastic gradient descent (SGD), under the assumption that network width scales polynomially with the sample size. Our results demonstrate that, with sufficient width, DNNs trained by GD or SGD can achieve generalization performance comparable to kernel-based methods.

URL PDF HTML ☆

赞 0 踩 0

2606.06779 2026-06-08 cs.IR cs.AI 交叉投稿

Mind the Gap: Bridging Behavioral Silos with LLMs in Multi-Vertical Recommendations

注意差距：用LLM弥合多垂直领域推荐中的行为孤岛

Nimesh Sinha, Raghav Saboo, Martin Wang, Sudeep Das

发表机构 * DoorDash Inc.（DoorDash公司）

AI总结提出利用LLM从数据丰富垂直领域（如餐厅）向稀疏领域（如杂货）迁移知识的框架，通过分层RAG生成多级特征，集成到MTL排序模型，显著提升新兴业务个性化与参与度。

详情

AI中文摘要

在多垂直领域电商平台（如DoorDash）中，较新的产品垂直领域（如杂货和零售）为个性化创新提供了重要机遇。一个关键挑战在于解决用户的“冷启动”问题。本文介绍了一种新颖框架，通过将知识从数据丰富的垂直领域（例如DoorDash的餐厅）迁移到数据稀疏的垂直领域来提升推荐质量。我们利用大型语言模型（LLMs）进行生成式推理，合成封装潜在用户偏好的稀疏高维特征。具体而言，我们采用分层检索增强生成（RAG）流水线，从用户餐厅订单历史和搜索查询中推导出多级分类特征。这些生成的特征编码了长期跨垂直偏好和短期意图，并集成到生产环境中的多任务学习（MTL）排序模型中。通过广泛的离线和在线评估，我们证明该方法显著改善了新兴业务垂直领域的个性化和参与度，有效弥合了行为数据差距。

英文摘要

In multi-vertical e-commerce platforms like DoorDash, relatively newer product verticals such as grocery and retail present a significant opportunity for personalization innovation. A key challenge lies in solving the "cold start" problem for users. This paper introduces a novel framework for enhancing recommendation quality by transferring knowledge from data-rich verticals (e.g., restaurants at DoorDash) to data-sparse ones. We leverage Large Language Models (LLMs) to perform generative inference, synthesizing sparse, high-dimensional features that encapsulate latent user affinities. Specifically, we employ a hierarchical Retrieval-Augmented Generation (RAG) pipeline to derive multi-level taxonomic features from user restaurant order histories and search queries. These generated features, encoding both long-term cross-vertical preferences and short-term intent, are integrated into a production Multi-Task Learning (MTL) ranking model. We demonstrate through extensive offline and online evaluation that this approach significantly improves personalization and engagement in emerging business verticals, effectively bridging the behavioral data gap.

URL PDF HTML ☆

赞 0 踩 0

2606.06784 2026-06-08 cs.CR cs.AI cs.CY 交叉投稿

What Your Posts Reveal: A Benchmark and Agentic Framework for User-Level Privacy Leakage on Social Media

你的帖子揭示了什么：社交媒体用户级隐私泄露的基准与智能体框架

Zifan Peng, Yini Huang, Aiwen Lu, Qiming Ye, Peixian Zhang, Jingyi Zheng, Yule Liu, Xuechao Wang, Xinlei He, Jiaheng Wei

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））； Wuhan University（武汉大学）

AI总结针对社交媒体用户级多模态隐私泄露缺乏统一基准和评估指标的问题，提出SopriBench基准和隐私暴露分数（PES），并开发了无需训练的智能体框架Argus，通过跨帖子线索累积推理实现隐私推断，PES达0.55，较最强基线提升25%。

详情

AI中文摘要

公开的社交媒体帖子可以通过散布在文本、图像或元数据中的微弱线索泄露私人信息。这种泄露通常是累积性和跨帖子的：单独看似无害的线索可能共同暴露用户的家庭、工作场所或日常行程。然而，当前研究缺乏用户级多模态隐私泄露的统一基准，以及能够捕捉暴露严重程度（超越二元准确性）的评估指标。为解决这些不足，我们提出了SopriBench，这是一个由从Rednote和Instagram账户的私有参考语料库中抽象出的泄露模式引导的合成基准，涵盖50个用户档案和1569张图像，包含属性、上下文敏感性、粒度、泄露类型、推理难度和支持证据。我们进一步引入了隐私暴露分数（PES），该分数通过上下文敏感性对值粒度进行加权。受溯因推理启发，我们提出了Argus，一个无需训练的智能体框架，用于累积泄露推理。Argus从累积证据中形成假设，验证支持证据，并将跨帖子线索聚合为隐私档案，实现了0.55的PES，比最强基线提高了25%，在跨帖子泄露上增益最大。

英文摘要

Public social media posts can reveal private information through weak cues scattered across text, images, or metadata. Such leakage is often cumulative and cross-post: cues that appear harmless in isolation may jointly expose a user's home, workplace, or routine. However, current research lacks a unified benchmark for user-level multimodal privacy leakage and an evaluation metric that captures exposure severity beyond binary accuracy. To address these gaps, we propose SopriBench, a synthetic benchmark guided by leakage patterns abstracted from a private reference corpus of Rednote and Instagram accounts, covering 50 user profiles and 1,569 images with attributes, contextual sensitivity, granularity, leakage type, inference difficulty, and supporting evidence. We further introduce the Privacy Exposure Score (PES), which weights value granularity by contextual sensitivity. Inspired by abductive reasoning, we introduce Argus, a training-free agentic framework for cumulative leakage inference. Argus forms hypotheses from accumulated evidence, verifies supporting evidence, and aggregates cross-post cues into privacy profiles, achieving 0.55 PES, a 25% improvement over the strongest baseline, with the largest gain on cross-post leakage.

URL PDF HTML ☆

赞 0 踩 0

2606.06800 2026-06-08 cs.HC cs.AI 交叉投稿

Exploring Reinforcement Learning for Fluid Transitions Between Clinical Mental Healthcare and Everyday Wellness Support

探索强化学习在临床心理健康与日常健康支持之间的流畅过渡

Tony Wang, Qian Yang

发表机构 * Cornell University（康奈尔大学）

AI总结本研究探索强化学习（RL）构建数字健康系统，动态选择临床与健康干预措施，以优化整体健康目标（持续日记），发现RL优化序列的益处常在干预结束后显现，且高参与度用户随时间深化参与，而恒定干预组易倦怠退出。

详情

Journal ref: Healthcare Beyond Reaction: Harnessing AI and Sensing for Proactive Care, Workshop at ACM Interactive Health 2026 (IH '26), July 05--08, 2026, Porto, Portugal

AI中文摘要

心理健康问题时好时坏，但临床和健康干预通常分开运作，导致护理过渡频繁中断。我们探索强化学习（RL）作为构建数字健康系统的手段，该系统主动提供临床和健康干预，作为连贯护理旅程的一部分。我们问：设计这样一个系统涉及哪些复杂性？我们构建了一个上下文赌博机，从临床和健康库中动态选择日记提示，以优化总体健康目标（持续日记），并在为期四周的探索性研究（N=38）中部署。我们发现，首先，RL优化的干预序列的许多益处仅在干预结束后才显现，这引发了一个问题：提供连贯临床-健康护理旅程的系统是否应包括退步期？如果是，何时以及如何？其次，与RL生成干预互动最多的参与者随着时间的推移加深了他们的参与度，而与恒定干预互动最多的参与者往往后来倦怠并退出。这引发了一个问题：当系统混合临床和健康干预时，何时应降低强度以防止倦怠，何时应维持强度以最大化治疗效果？

英文摘要

Mental health struggles wax and wane, yet clinical and wellness interventions typically operate separately, causing frequent breakdowns at care transitions. We explore reinforcement learning (RL) as a means to build digital health systems that deliver clinical and wellness interventions proactively, as part of a coherent care journey. We ask: what complexities does designing such a system involve? We built a contextual bandit that dynamically selects journaling prompts from clinical and wellness repertoires to optimize for an overarching health goal (sustained journaling) and deployed it in a four-week exploratory study (N=38). We found that, first, many benefits of RL-optimized intervention sequences appeared only after interventions ended, raising the question: Should systems that offer coherent clinical-wellness care journeys include stepping-back periods? If so, when and how? Second, participants most engaged with RL-generated interventions deepened their engagement over time, while those most engaged with a constant intervention tended to burn out and drop out later. It raises the question: When should a system blending clinical and wellness interventions reduce intensity to prevent burnout in versus sustain it to maximize treatment gains?

URL PDF HTML ☆

赞 0 踩 0

2606.06805 2026-06-08 cs.RO cs.AI cs.SY eess.SY 交叉投稿

Lane Change Trajectory Planning for Personalized Driving Comfort and Mobility Efficiency

车道变更轨迹规划：个性化驾驶舒适性与移动效率

Haoxuan Dong, Dongjun Li, Ziyou Song

发表机构 * Department of Mechanical Engineering（机械工程系）； Department of Electrical Engineering（电气工程系）； National University of Singapore（新加坡国立大学）； Computer Science（计算机科学）； University of Michigan（密歇根大学）

AI总结提出一种神经网络驱动的轨迹规划器，结合三阶多项式轨迹生成与学习模块，通过双头共享骨干和基于误差胜者逻辑回归的统计门控机制，实现个性化舒适性与移动效率的平衡。

Comments Accepted by the IEEE Intelligent Vehicles Symposium (IEEE IV 2026), Detroit, MI, United States, June 22_25, 2026

详情

AI中文摘要

车道变更涉及同时的纵向和横向运动，这些运动影响驾驶舒适性和移动效率。由于这些运动紧密耦合且存在显著的车辆间差异，车道变更操作的轨迹规划具有高度个性化的特点。本研究提出了一种神经网络驱动的规划器，该规划器将三阶多项式轨迹生成器与学习模块相结合，该学习模块在不同驾驶条件下推断最优轨迹参数。使用具有双头的共享骨干网络，一个头确保全工况操作保障，而另一个头捕捉驾驶员对舒适性或移动效率的特定偏好。通过基于误差胜者逻辑回归的统计门控实现头门控切换机制，该机制在不同驾驶条件下自适应地选择适当的头，从而实现上下文感知的车道变更轨迹规划。代表性案例和蒙特卡洛模拟表明，所提出的规划器在车道变更过程中实现了个性化的舒适性和移动性，而基线则在个性化数据不足或不可用的驾驶条件下确保可行的轨迹。

英文摘要

Lane changing entails simultaneous longitudinal and lateral motions that affect driving comfort and mobility efficiency. Because these motions are tightly coupled and subject to substantial inter-vehicle variability, trajectory planning for lane-change maneuvers is characterized by a highly personalized nature. This study proposes a neural network-driven planner that integrates a third-order polynomial trajectory generator with a learning module that infers optimal trajectory parameters across diverse driving conditions. Using a shared backbone with dual heads, one head ensures all-condition operational guarantees, while the other captures driver-specific preferences for comfort or mobility efficiency. A head-gated switching mechanism, realized through a statistical gate based on error-winner logistic regression, adaptively selects the appropriate head under varying driving conditions, which enables context-aware lane-change trajectory planning. Representative cases and Monte Carlo simulations show that the proposed planner achieves personalized comfort and mobility during lane changes, while the baseline ensures feasible trajectories under driving conditions where personalized data are insufficient or inaccessible.

URL PDF HTML ☆

赞 0 踩 0

2606.06813 2026-06-08 cs.CV cs.AI 交叉投稿

Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation

打破锁定：通过表示调制实现文本到图像生成的多样化

Dahee Kwon, Haeun Lee, Jaesik Choi

发表机构 * KAIST（韩国科学技术院）

AI总结针对文本到图像模型在固定提示下生成样本过于相似的问题，提出无训练表示级干预方法DAVE，通过选择性衰减早期生成中的零频空间平均分量来增强多样性，保持图像质量且计算开销极小。

Comments Accepted to ICML 2026. Code is available at: https://github.com/daheekwon/DAVE

详情

AI中文摘要

近期基于大规模Transformer骨干和流目标的文本到图像模型在文本-图像对齐和视觉质量方面表现出色，但在固定提示下常生成过于相似的样本。现有的多样性增强方法缓解了这一问题，但通常需要昂贵的采样或辅助优化，带来显著开销。为探究这种同质性的根本原因，我们检查了中间Transformer特征，观察到零频空间平均（DC）分量在生成早期快速收敛，导致早期轨迹锁定，限制了后续变化。基于此观察，我们提出DC衰减多样性增强（DAVE），一种无训练的表示级干预，选择性地在早期阶段衰减该分量。DAVE以可忽略的开销保留采样流程，在保持竞争性图像质量的同时，提高了提示一致性的多样性。

英文摘要

Recent text-to-image models built on large-scale Transformer backbones and flow-based objectives deliver strong text-image alignment and high visual quality, yet often produce overly similar samples under a fixed prompt. Existing diversity-enhancement methods alleviate this issue, but typically require expensive sampling or auxiliary optimization, incurring non-trivial overhead. To investigate the root cause of this homogeneity, we examine intermediate Transformer features and observe that the zero-frequency spatial average (DC) component rapidly converges across seeds early in generation, causing early trajectory lock-in that limits downstream variation. Building on this observation, we propose DC Attenuation for diVersity Enhancement (DAVE), a training-free representation-level intervention that selectively attenuates this component in the early regime. DAVE preserves the sampling pipeline with negligible overhead, improving prompt-consistent diversity while maintaining competitive image quality.

URL PDF HTML ☆

赞 0 踩 0

2606.06820 2026-06-08 cs.LG cs.AI 交叉投稿

SCALE: Scalable Cross-Attention Learning with Extrapolation for Agentic Workflow Scheduling

SCALE: 可扩展的交叉注意力学习与外推方法用于智能体工作流调度

Zhifei Xu, Jierui Lan, Zixuan Liang, Aiji Liang, Jinxi He

发表机构 * Faculty of Arts and Sciences, Beijing Normal University（北京师范大学文理学院）

AI总结提出SCALE调度器，通过交叉注意力指针网络和结构化表示正则化，实现无需微调即可泛化到不同规模集群的深度强化学习工作流调度。

Comments Submitted to Computer Networks

详情

AI中文摘要

智能体大型语言模型系统将复杂任务分解为工作流有向无环图，其原语必须在异构集群上调度。现有的深度强化学习调度器与固定集群大小绑定，当服务器数量变化时需要重新训练。我们提出SCALE（可扩展的交叉注意力学习与外推），一种无需微调即可泛化到未见过的集群规模的深度强化学习调度器。SCALE采用交叉注意力指针网络，其中任务特征查询服务器特征，因此架构通过构造接受任意数量的服务器。然而，我们观察到仅排列不变架构并不能保证在新规模下的良好性能——随着服务器数量增长，注意力特征经历分布偏移。为了解决这个问题，我们引入结构化表示正则化：一种去相关损失结合朝向标准正态的KL惩罚，使特征统计量无论输入大小都保持稳定。在16个节点上训练并直接在32和48个节点上测试，SCALE在N=48时相对于没有SRR的相同架构将平均响应时间降低了8.9%，确认了显式正则化对于缩小规模泛化差距是必要的。

英文摘要

Agentic Large Language Model (LLM) systems decompose complex tasks into workflow Directed Acyclic Graphs (DAGs) whose primitives must be scheduled on heterogeneous clusters. Existing deep reinforcement learning (DRL) schedulers are tied to a fixed cluster size and require retraining whenever the number of servers changes. We propose SCALE (Scalable Cross-Attention Learning with Extrapolation), a DRL scheduler that generalizes to unseen cluster scales without fine-tuning. SCALE employs a cross-attention pointer network where task features query against server features, so the architecture accepts any number of servers by construction. We observe, however, that permutation-invariant architecture alone does not guarantee good performance at new scales - the attention feature undergoes distribution shift as the server count grows. To counter this, we introduce Structured Representation Regularization (SRR): a decorrelation loss combined with a KL penalty toward the standard normal, which keeps feature statistics stable regardless of input size. Trained on 16 nodes and tested directly on 32 and 48 nodes, SCALE reduces average response time by 8.9% at N=48 relative to the same architecture without SRR, confirming that explicit regularization is necessary to close the scale-generalization gap.

URL PDF HTML ☆

赞 0 踩 0

2606.06823 2026-06-08 cs.LG cs.AI q-fin.ST 交叉投稿

PandaAI: A Practical Agent CQ2 for Neuro-symbolic Data Analysis And Integrated Decision-Making in Quantitative Finance

PandaAI: 一种用于量化金融中神经符号数据分析与集成决策的实用智能体CQ2

Yuqi Li, Siyuan Liu, Bingjun Liu

发表机构 * Panda AI

AI总结针对金融数据低信噪比和非平稳性，提出PandaAI，一种结合市场状态建模与约束alpha生成的闭环神经符号LLM智能体，通过领域微调和模块化架构实现风险感知决策，在沪深300数据上Rank IC提升18.2%，最大回撤降低25.7%。

详情

AI中文摘要

尽管深度学习在各个领域表现出色，但由于金融数据的低信噪比（SNR）和非平稳性，其在金融序列决策中的应用仍然具有挑战性。利用大型语言模型（LLM）的推理能力，我们提出了\textbf{PandaAI}，一种具有市场状态建模和约束alpha生成的闭环神经符号LLM智能体，它桥接了通用LLM推理与金融严谨性，并抑制了LLM生成输出的金融毒性。为了弥合通用语言能力与金融严谨性之间的差距，我们微调了一个领域特定的LLM。此外，我们将此LLM集成到模块化架构中，形成一个闭环系统。与传统优化孤立预测指标的模型不同，\textbf{PandaAI}被设计为一种神经符号智能体，以明确的风险意识在复杂、真实的金融环境中导航。在沪深300股票数据上的大量实验表明，\textbf{PandaAI}比最先进的时间序列模型实现了$18.2\%$更高的Rank IC和$25.7\%$更低的最大回撤。我们的约束LLM生成和双通道适应方法为LLM在高风险序列决策场景中的部署提供了一种通用范式。

英文摘要

While deep learning has excelled in various domains, its application to sequential decision-making in finance remains challenging due to the low Signal-to-Noise Ratio (SNR) and non-stationarity of financial data. Leveraging the reasoning capabilities of Large Language Models (LLMs), we propose \textbf{PandaAI}, a closed-loop neuro-symbolic LLM agent with market regime modeling and constrained alpha generation, which bridges general LLM reasoning with financial rigor and suppresses the financial toxicity of LLM-generated outputs. To bridge the gap between general linguistic capability and financial rigor, we fine-tune a domain-specific LLM. Furthermore, we integrate this LLM into a modular architecture and form a closed-loop system. Unlike traditional models that optimize isolated prediction metrics, \textbf{PandaAI} is designed as a neuro-symbolic agent that navigates the complex, real-world financial environment with explicit risk awareness. Extensive experiments on CSI 300 stock data show that \textbf{PandaAI} achieves a $18.2\%$ higher Rank IC and $25.7\%$ lower maximum drawdown than state-of-the-art time-series models. Our constrained LLM generation and dual-channel adaptation method provide a general paradigm for LLM deployment in high-stakes sequential decision-making scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.06825 2026-06-08 cs.CL cs.AI 交叉投稿

Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards

Progress-SQL: 通过渐进式奖励改进文本到SQL的强化学习

Shihao Zhang, Xiaoman Wang, Yuan Liu, Yunshi Lan, Weining Qian

发表机构 * East China Normal University（华东师范大学）

AI总结提出Progress-SQL，一种多轮强化学习框架，通过Oracle引导诊断树（ODT）生成子句级结构反馈，结合渐进式奖励（结构对齐、词汇对齐、延迟奖励和执行状态奖励），提升文本到SQL生成的准确性和鲁棒性。

详情

AI中文摘要

先刻画再蒸馏：大输出空间中的机械推理

Debjyoti Saha Roy, Byron C. Wallace, Javed A. Aslam

发表机构 * Khoury College of Computer Sciences, Northeastern University（东北大学计算机科学学院）

AI总结研究现代推理模型在百万级标签空间中实现零样本多标签分类的机制，提出“候选列表生成+精细推理”两阶段模型，并基于此开发机械蒸馏策略，优于标准蒸馏。

2606.06853 2026-06-08 cs.CV cs.AI 交叉投稿

MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

MotionEnhancer: 利用视频扩散模型增强运动感知的视觉-语言模型

Yifan Xu, Chao Zhang, Ruifei Ma, Fei Gao, Zhifei Yang, Jiaxing Qi, Zhipeng Chen

发表机构 * School of Computer Science and Engineering, Beihang University（北航计算机科学与工程学院）； Beijing Digital Native Digital City Research Center（北京数字原生数字城研究中心）； School of Computer Science, Peking University（北京大学计算机学院）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）

AI总结提出MotionEnhancer，通过从视频扩散模型中提取运动先验并利用注意力对齐增强视觉-语言模型的运动理解能力，无需额外参数或架构修改，在运动级视频理解基准上取得一致提升。

Comments Accepted by CVPR 2026

详情

AI中文摘要

新时代见证了视觉-语言模型（VLM）在视频理解任务中的显著能力扩展。虽然当前的VLM在事件或故事级别的理解上表现出色，但它们捕捉细粒度运动细节的能力仍然有限，这主要是由于它们关注高层静态语义结构和宏观事件逻辑。相比之下，视频扩散模型（VDM）擅长建模动态运动模式，得益于大规模视频数据和时序生成的内在需求。在本文中，我们介绍了MotionEnhancer，一种新颖的方法，它利用从强大视频扩散模型中提取的运动先验作为辅助监督，通过注意力对齐增强VLM的运动理解能力。MotionEnhancer包含两个简单的无参数模块：运动敏感头选择（MHS）和运动显著文本标记识别（MTTI），以仅计算的方式直接从VDM中提取和优化与运动相关的注意力。MotionEnhancer为运动理解提供了可扩展的解决方案，无需额外的训练参数、修改现有架构或工具调用。大量实验表明，在两个运动级视频理解基准上，MotionEnhancer能够在最先进的VLM上实现一致的改进，尤其是在运动相关指标上。

英文摘要

The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily due to their focus on high-level static semantic structures and macro-event logic. In contrast, Video Diffusion Models (VDMs) are adept at modeling dynamic motion patterns, benefiting from large-scale video data and the intrinsic requirement of temporal generation. In this paper, we introduce MotionEnhancer, a novel approach that leverages motion priors distilled from a powerful video diffusion model as auxiliary supervision to enhance the motion understanding capability of a VLM via attention alignment. MotionEnhancer comprises two simple parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to directly extract and optimize motion-related attentions from the VDM in a computation-only manner. MotionEnhancer provides a scalable solution for motion understanding without additional training parameters, modifications to existing architectures, or tool calling. Extensive experiments demonstrate that MotionEnhancer can achieve consistent improvements over state-of-the-art VLMs on two motion-level video understanding benchmarks, especially on motion-related metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.06861 2026-06-08 cs.LG cs.AI 交叉投稿

Modeling Nonlinear Feature Interactions with Product-Unit Residual Networks

使用乘积单元残差网络建模非线性特征交互

Ziyuan Li, Uwe Jaekel, Babette Dellen

发表机构 * University of Applied Sciences Koblenz（科隆应用科学大学）； Technical University of Munich（慕尼黑技术大学）

AI总结提出乘积单元残差网络（PURe），通过显式建模特征交互提升鲁棒性和可解释性，在合成和真实数据集上优于MLP。

Comments Accepted at ICCS 2026

详情

AI中文摘要

理解非线性特征交互在科学和工程中至关重要，然而标准多层感知器（MLP）通常仅隐式地捕获此类交互，导致表征纠缠，可能损害鲁棒性和可解释性。我们研究了乘积单元残差网络（PURe），它将乘法乘积单元与残差连接相结合，以显式建模跨特征耦合，同时稳定优化。我们在一个基于交互的合成基准和两个真实世界数据集上进行了系统评估，考察了预测准确性、对高斯特征噪声的鲁棒性以及在有限训练数据下的性能，并在匹配参数预算下比较了实值和复值变体。除了准确性，基于SHapley Additive exPlanations（SHAP）的交互分析表明，与MLP基线相比，PURe学习了更集中且结构更连贯的交互模式。总体而言，PURe实现了具有竞争力或更好的性能，在低数据场景下具有更好的鲁棒性和样本效率，并增强了交互级别的可解释性。

英文摘要

Understanding nonlinear feature interactions is crucial in science and engineering, yet standard multilayer perceptrons (MLPs) often capture such interactions only implicitly, leading to entangled representations that can impair robustness and interpretability. We investigate product-unit residual networks (PURe) that integrate multiplicative product units with residual connections to explicitly model cross-feature couplings while stabilizing optimization. We conduct a systematic evaluation on an interaction-driven synthetic benchmark and two real-world datasets, assessing predictive accuracy, robustness to Gaussian feature noise, and performance under limited training data, and we compare real- and complex-valued variants under a matched parameter budget. Beyond accuracy, SHapley Additive exPlanations (SHAP)-based interaction analyses show that PURe learns more concentrated and structurally coherent interaction patterns than MLP baselines. Overall, PURe achieves competitive or improved performance, better robustness and sample efficiency in low-data regimes, and enhanced interaction-level interpretability.

URL PDF HTML ☆

赞 0 踩 0

2606.06872 2026-06-08 cs.CV cs.AI 交叉投稿

EgoPressDiff: Multimodal Video Diffusion for Egocentric UV-Domain Hand-Pressure Estimation

EgoPressDiff: 用于自我中心UV域手部压力估计的多模态视频扩散模型

Yuan Zeng, Zilue Gao, Yujia Shi, Zongqing Lu, Wenming Yang, QingMin Liao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出EgoPressDiff，一种条件视频扩散框架，通过多模态条件策略（手部姿态、3D网格顶点和深度信息）从视觉输入生成UV压力图，解决了现有方法中的量化误差和时间不一致问题，在EgoPressure数据集上实现SOTA，Volumetric IoU相对提升34%以上。

Comments Accepted to IEEE ICASSP 2026

详情

DOI: 10.1109/ICASSP55912.2026.11463813

AI中文摘要

从自我中心视角估计手部表面接触压力对于AR/VR设备、机器人模仿和人体工程学分析至关重要。现有方法通常对压力信号进行离散化并独立处理帧，导致量化误差和时间不一致性。我们提出EgoPressDiff，一种条件视频扩散框架，从视觉输入生成UV压力图。我们方法的核心是一种多模态条件策略，引入PoseNet和顶点编码器，从手部姿态和3D网格顶点中高效提取特征。这些信号与深度信息一起，指导生成过程以确保压力场在物理上是合理的。为了有效融合这些异构特征，我们进一步提出分布校准空间层，在组合前对齐其统计特性。在EgoPressure自我中心视图设置上的评估表明，EgoPressDiff实现了最先进的结果，Volumetric IoU相对先前基线提升超过34%，同时降低MAE并保持高时间精度。我们的项目页面位于此https URL。

英文摘要

Estimating hand-surface contact pressure from an egocentric view is crucial for AR/VR devices, robotic imitation, and ergonomic analysis. Existing methods often discretize pressure signal and process frames independently, leading to quantization errors and temporal inconsistencies. We present \emph{EgoPressDiff}, a conditional video diffusion framework that generates UV-pressure maps from visual input. The core of our approach is a multi-modal conditioning strategy, introducing a PoseNet and a Vertex Encoder to efficiently extract features from hand pose and 3D mesh vertices. These signals, along with depth information, guide the generative process to ensure the pressure fields are physically grounded. To effectively fuse these heterogeneous features, we further propose a Distribution-Calibrated Spatial Layer, which aligns their statistical properties before combination. Evaluated on the EgoPressure ego-view setting, EgoPressDiff achieves state-of-the-art results, improving Volumetric IoU by over 34\% relative to prior baseline, while reducing MAE and maintaining high temporal accuracy. Our project page is at https://egopressdiff.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.06877 2026-06-08 cs.RO cs.AI 交叉投稿

Neuro-Symbolic Learning for Long-Horizon Task Planning Under Complex Logical Constraints

复杂逻辑约束下长时域任务规划的神经符号学习

Qiwei Du, Zitong Zhan, Shaoshu Su, Bowen Li, Yi Du, Zhipeng Zhao, Taimeng Fu, Sebastian Scherer, Jiaoyang Li, Chen Wang

发表机构 * Spatial AI & Robotics (SAIR) Lab, University at Buffalo, NY 14260（空间人工智能与机器人实验室，布法罗大学，纽约州，14260）； Robotics Institute, Carnegie Mellon University, PA 15213（机器人研究所，卡内基梅隆大学，宾夕法尼亚州，15213）

AI总结提出基于命令学习的双层优化框架，通过神经评分器剪枝无关对象，并引入3R策略（修复、重启、回滚）稳定下层规划，在三个基准上实现失败率降低80.04%、规划时间减少57.14%。

详情

AI中文摘要

当机器人必须在复杂逻辑约束（包括对象可供性、空间关系和顺序动作依赖）下推理长时域动作序列时，任务规划常常面临严重的效率瓶颈。最近的神经符号方法通过学习对象重要性分数来剪枝任务无关对象，从而提高规划效率，但它们通常依赖于从完整搜索空间生成的固定离线监督。这造成了训练-测试不匹配：在部署时，规划器在由模型自身不完美预测诱导的剪枝搜索空间中运行，导致暴露偏差和规划性能下降。为了解决这一挑战，我们将任务规划的对象重要性学习形式化为一个基于命令学习的双层优化问题。上层优化一个神经评分器，而下层在评分剪枝的搜索空间中求解符号规划问题。为了稳定这一学习过程，我们在下层规划中引入3R策略，使用并行的修复、重启和回滚恢复来为上层学习提供可靠且自适应的反馈。在三个具有挑战性的基准上的实验展示了最先进的性能，包括失败率降低80.04%和规划时间减少57.14%。我们进一步在仿真和现实世界中的四足移动机械臂上验证了该框架，展示了其在高效且可部署的神经符号任务规划方面的潜力。

英文摘要

Task planning often suffers from severe efficiency bottlenecks when robots must reason over long-horizon action sequences under complex logical constraints, including object affordances, spatial relationships, and sequential action dependencies. Recent neuro-symbolic methods improve planning efficiency by learning object-importance scores to prune task-irrelevant objects, but they typically rely on fixed offline supervision generated from full search spaces. This creates a train-test mismatch: at deployment, the planner operates in pruned search spaces induced by the model's own imperfect predictions, leading to exposure bias and degraded planning performance. To address this challenge, we formulate object-importance learning for task planning as an imperative learning-based bilevel optimization problem. The upper level optimizes a neural scorer, while the lower level solves a symbolic planning problem in the score-pruned search space. To stabilize this learning process, we introduce a 3R strategy into the lower-level planning, using parallel Repair, Restart, and Rollback recovery to provide reliable and adaptive feedback for upper-level learning. Experiments on three challenging benchmarks demonstrate state-of-the-art performance, including an 80.04% reduction in failure rate and a 57.14% reduction in planning time. We further validate the framework on a quadruped-based mobile manipulator in simulation and the real world, demonstrating its potential for efficient and deployable neuro-symbolic task planning.

URL PDF HTML ☆

赞 0 踩 0

2606.06885 2026-06-08 cs.CV cs.AI 交叉投稿

FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising

FreeAnimate: 基于预览引导去噪的无训练人体图像动画

Yuan Zeng, Yujia Shi, Zongqing Lu, QingMin Liao

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出FreeAnimate框架，利用图像扩散模型内在能力实现无训练的人体图像动画，通过预览生成策略提供时序和结构先验，结合反演增强注意力和参考锚定自注意力模块，保证时序一致性和身份保持。

Comments Accepted to IEEE ICASSP 2026

详情

DOI: 10.1109/ICASSP55912.2026.11462600

AI中文摘要

人体图像动画已经取得了显著进展，主要得益于扩散模型。然而，现有方法通常需要大量的训练数据和资源才能获得高质量结果，限制了泛化性和可访问性。在这项工作中，我们引入了FreeAnimate，一个无训练框架，利用图像扩散模型的内在能力来实现时序一致性、身份保持和背景稳定性。我们的方法包含一种新颖的预览生成策略，该策略从生成的预览帧中提供时序和结构先验，无需训练即可有效引导姿态对齐和背景一致性。此外，FreeAnimate引入了反演增强注意力和参考锚定自注意力模块，以保证时序一致性和身份保持。实验结果表明，FreeAnimate优于现有的无训练竞争方法和基于训练的基线方法，生成的图像质量可与最先进的方法相媲美，并在不同数据集上展现出强大的泛化能力。我们的项目页面位于此https URL。

英文摘要

Human Image Animation has seen significant advancements, primarily driven by diffusion models. However, existing methods typically demand substantial training data and resources to achieve high-quality results, limiting generalization and accessibility. In this work, we introduce \emph{FreeAnimate}, a training-free framework that leverages the inherent capabilities of image diffusion models to enable temporal consistency, identity preservation, and background stability. Our approach incorporates a novel preview generation strategy that provides temporal and structural priors from generated preview frames, effectively guiding pose alignment and background consistency without training. Additionally, FreeAnimate introduces Inversion-Boosted Attention and Reference-Anchored Self-Attention modules to guarantee temporal consistency and identity preservation. Experimental results demonstrate that FreeAnimate outperforms existing training-free competitors and training-based baseline methods, achieving generation quality comparable to state-of-the-art methods and offering robust generalization across diverse datasets. Our project page is at https://freeani.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.06903 2026-06-08 cs.CV cs.AI 交叉投稿

Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy

超越骨架：使用Same2X训练策略直接从驱动视频学习动画

Yuan Zeng, Yujia Shi, Yuhao Yang, Dongxia Liu, Zongqing Lu, Wenming Yang, Qingmin Liao

发表机构 * Tsinghua University（清华大学）； Harbin Institute of Technology（哈尔滨工业大学）； Pengcheng Laboratory（鹏城实验室）

AI总结提出DirectAnimator框架，通过驱动线索三元组和Same2X训练策略，绕过姿态提取直接从原始视频学习动画，实现鲁棒且高质量的人体图像动画生成。

Comments Accepted to ICLR 2026

详情

AI中文摘要

人体图像动画旨在根据从驱动视频中提取的姿态信息，从静态参考图像生成视频。现有方法通常依赖姿态估计器提取中间表示，但在遮挡或复杂姿态下这些信号容易出错。基于这些观察，我们提出了DirectAnimator，一个绕过姿态提取并直接从原始驱动视频学习的框架。我们引入了一个由姿态、面部和位置线索组成的驱动线索三元组，以语义丰富且稳定的形式捕捉运动、表情和对齐，并通过CueFusion DiT块融合它们，以实现去噪过程中的可靠控制。为了使学习在驱动和参考身份不同时依然可靠，我们设计了Same2X训练策略，将跨身份特征与从相同身份数据中学到的特征对齐，从而正则化优化并加速收敛。大量实验表明，DirectAnimator在保持身份的同时达到了最先进的视觉质量，对遮挡和复杂关节运动具有鲁棒性，并且计算资源更少。我们的项目页面位于此https URL。

英文摘要

Human image animation aims to generate a video from a static reference image, guided by pose information extracted from a driving video. Existing approaches often rely on pose estimators to extract intermediate representations, but such signals are prone to errors under occlusion or complex poses. Building on these observations, we present DirectAnimator, a framework that bypasses pose extraction and directly learns from raw driving videos. We introduce a Driving Cue Triplet consisting of pose, face, and location cues that captures motion, expression, and alignment in a semantically rich yet stable form, and we fuse them through a CueFusion DiT block for reliable control during denoising. To make learning dependable when the driving and reference identities differ, we devise a Same2X training strategy that aligns cross-ID features with those learned from same-ID data, regularizing optimization and accelerating convergence. Extensive experiments demonstrate that DirectAnimator attains state-of-the-art visual quality and identity preservation while remaining robust to occlusions and complex articulation, and it does so with fewer computational resources. Our project page is at https://directanimator.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.06906 2026-06-08 cs.CL cs.AI 交叉投稿

EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

EASE-TTT: 面向长上下文问答的基于证据对齐的选择性测试时训练

Xiaopeng Yuan, Zebin Wang, Suwen Wang, Zongxin Yang, Haohan Wang, Yushun Dong

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Harvard University（哈佛大学）； Brion, ASML US LP ； Florida State University（佛罗里达州立大学）

AI总结提出EASE-TTT框架，通过将检索到的证据块转化为软注意力监督目标，指导查询侧参数适应，从而在保留完整上下文的情况下提升小模型的长上下文问答性能。

Comments 13 pages, 4 figures, 3 tables

详情

AI中文摘要

DaX: 跨尺度的通用病理学表示学习

Bokai Zhao, Yiyang Zhang, Long Bai, Tai Ma, Hanqing Chao, Minfeng Xu

发表机构 * DAMO Academy, Alibaba Group（达摩院，阿里巴巴集团）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Hupan Lab（虎斑实验室）

AI总结提出病理视觉基础模型DaX，通过改进DINOv3自监督学习，结合连续放大训练、跨尺度组织视图等设计，在44个公开数据集的161项临床任务上取得最佳平均性能。

详情

AI中文摘要

计算病理学需要能够跨不同临床终点迁移且对放大倍数、染色、扫描仪类型、切片制备和输入分辨率变化保持鲁棒的视觉表示。我们提出DaX，一个病理视觉基础模型，它将DINOv3风格的自监督学习适应到全切片组织病理学。DaX从自然图像DINOv3权重初始化，并融合了连续放大训练、跨尺度组织视图、方向无关和采集鲁棒的数据增强、多输入尺寸训练以及Gram锚定的密集一致性。这些设计旨在连接局部细胞形态与全局组织结构，同时稳定跨输入尺度的密集token级表示。我们进一步构建了一个WSI级基准，包含来自44个公共数据集的161项临床有意义任务，涵盖28,182名患者和34,394张切片，跨越四个临床领域和九个任务类别。所有模型在固定的患者级交叉验证协议下进行评估，并采用折叠级统计排名，从而实现可重复的比较，对分割依赖的变异性不敏感。在该基准上，DaX在任务中取得了最高的平均性能，并持续获得强大的任务级排名分数，其增益涵盖诊断病理学、生物标志物和分子谱分析、组织/标本背景以及风险、反应和预后。这些结果支持DaX作为计算病理学的可迁移视觉编码器，并为未来的病理基础模型提供了标准化的评估框架。项目页面：此https URL。

英文摘要

Computational pathology requires visual representations that transfer across diverse clinical endpoints and remain robust to variation in magnification, staining, scanner type, slide preparation, and input resolution. We present DaX, a pathology vision foundation model that adapts DINOv3-style self-supervised learning to whole-slide histopathology. DaX is initialized from natural-image DINOv3 weights and incorporates continuous magnification training, cross-scale tissue views, orientation-agnostic and acquisition-robust augmentation, multi-input-size training, and Gram-anchored dense consistency. These designs aim to connect local cellular morphology with global tissue architecture while stabilizing dense token-level representations across input scales. We further construct a WSI-level benchmark comprising 161 clinically meaningful tasks from 44 public datasets, covering 28,182 patients and 34,394 slides across four clinical domains and nine task categories. All models are evaluated under a fixed patient-level cross-validation protocol with fold-level statistical ranking, enabling reproducible comparisons that are less sensitive to split-dependent variation. Across this benchmark, DaX achieves the highest mean performance across tasks and consistently strong task-level ranking scores, with gains spanning diagnostic pathology, biomarker and molecular profiling, tissue/specimen context, and risk, response, and prognosis. These results support DaX as a transferable visual encoder for computational pathology and provide a standardized evaluation framework for future pathology foundation models. Project page: https://alibaba-damo-academy.github.io/DaX/benchboard/.

URL PDF HTML ☆

赞 0 踩 0

2606.06991 2026-06-08 cs.CV cs.AI 交叉投稿

Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding

不要暂停：面向在线视频理解的流式视频-语言同步

Zhenyu Yang, Kairui Zhang, Shengsheng Qian, Weiming Dong, Changsheng Xu

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出流式视频-语言同步（SVLS）范式，通过帧驱动转换控制器和流式令牌调节器实现视频帧与语言生成的细粒度同步，在不中断感知的情况下进行实时交互。

详情

AI中文摘要

在线视频大语言模型（Video-LLMs）通过逐帧处理和主动响应，在人机交互方面取得了进展。然而，流式场景中仍存在一个关键挑战：现有模型在生成响应时通常会暂停视频感知，破坏了实时的视频-语言同步并导致卡顿。为了解决这个问题，我们引入了一种新的在线视频理解范式：流式视频-语言同步（SVLS），并提出了LyraV，一个基于分层控制框架的实时流式助手，具有两个核心创新。首先，帧驱动转换控制器（FDTC）是一个无需训练的基于验证的有限状态机，它做出高层语义决策，决定何时继续说话、开始新的响应或保持沉默。其次，流式令牌调节器（SToP）是一个即插即用的轻量级预测模块，动态调整语言生成速率以匹配视觉内容的节奏。具体来说，LyraV执行逐帧增量、子预算解码：在每个帧间隔内，它只发射适合实时预算的一小部分令牌，因此感知永远不会被阻塞整个句子。这些组件共同使LyraV能够无缝地交织传入的视频帧和生成的词令牌，实现细粒度的同步。在五个在线和三个离线基准上进行的广泛实验表明，LyraV保留了骨干网络的通用理解能力，同时显著提高了流式同步和叙事流畅性，实现了98.29%的视频播放同步率和3.89 FPS的实时处理速度。有趣的是，我们观察到LyraV的一个经验能力：对流式令牌进行动态推理，实现了与视觉输入并行的连续解释和“思考”。

英文摘要

Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and proactive responding. However, a critical challenge remains in streaming scenarios: existing models typically pause video perception while generating responses, breaking real-time video-language synchrony and causing stutters. To address this, we introduce a novel paradigm for online video understanding: Streaming Video-Language Synchrony (SVLS), and present LyraV, a live streaming assistant built upon a hierarchical control framework with two core innovations. First, the Frame-Driven Transition Controller (FDTC), a training-free verification-based finite-state machine, makes high-level semantic decisions on when to continue speaking, start a new response, or stay silent. Second, the Streaming Token Pacer (SToP), a plug-and-play lightweight predictive module, dynamically adapts the language generation rate to match the pace of the visual content. Concretely, LyraV performs \emph{per-frame incremental, sub-budget decoding}: within each frame interval it emits only a small chunk of tokens that fits the real-time budget, so perception is never blocked for a full sentence. Together, these components enable LyraV to seamlessly interleave incoming video frames with generated word tokens, achieving a fine-grained synchrony. Extensive experiments conducted on five online and three offline benchmarks demonstrate that LyraV preserves the backbone's general understanding ability while substantially improving streaming synchrony and narrative fluency, delivering a 98.29\% synchrony with video playback and a real-time processing speed of 3.89 FPS. Interestingly, we observe an empirical capability in LyraV: dynamic reasoning over streaming tokens, enabling continuous interpretation and "thinking" alongside visual input.

URL PDF HTML ☆

赞 0 踩 0

2606.07007 2026-06-08 cs.LG cs.AI 交叉投稿

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

理解稀疏自编码器中概念学习与神经元解释的几何视角

Chenhao Zhang, Chris Lin, Su-In Lee

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出统一数学框架，将概念学习形式化为集合对齐问题，区分检测、分离和近似三种学习强度，并给出几何条件与误差界，通过形式概念分析连接概念学习与神经元解释。

详情

AI中文摘要

我们提出了一个统一的数学框架，用于几何理解稀疏自编码器（SAE）中的概念学习和神经元解释。尽管SAE通过学习稀疏特征表示提高了神经网络的可解释性，但“概念”和“学习”的原则性定义仍不明确。我们将概念形式化为数据点的集合，并将概念学习视为人类定义概念与模型诱导概念之间的集合对齐问题。该公式区分了三种越来越强的学习概念——检测、分离和近似——并给出了概念可由单个神经元或多神经元单元表示的几何条件、误差界和容量约束。它还提供了对常见SAE现象的集合论解释，包括特征分裂、特征吸收、特征族和层次概念。最后，我们通过形式概念分析将概念学习与神经元解释联系起来，表明这两个方向不必一致，并且它们的多对多结构可以通过概念格来组织。在合成数据上使用ReLU和Top-$K$ SAE的实验说明了该理论，并揭示了SAE大小和稀疏性对概念学习的影响。

英文摘要

We propose a unified mathematical framework for a geometric understanding of concept learning and neuron interpretation in sparse autoencoders (SAEs). While SAEs improve interpretability of neural networks by learning sparse feature representations, a principled definition of ''concept'' and ''learning'' remains unclear. We formalize concepts as sets of data points and cast concept learning as a set-alignment problem between human-defined and model-induced concepts. This formulation distinguishes three increasingly strong notions of learning -- detection, separation, and approximation -- and yields geometric conditions, error bounds, and capacity constraints for when concepts can be represented by individual neurons or multi-neuron units. It also provides a set-theoretic account for common SAE phenomena, including feature splitting, feature absorption, feature families, and hierarchical concepts. Finally, we connect concept learning and neuron interpretation through formal concept analysis, showing that the two directions need not agree and that their many-to-many structure can be organized by concept lattices. Experiments on synthetic data with ReLU and Top-$K$ SAEs illustrate the theory and reveal the effects of SAE size and sparsity on concept learning.

URL PDF HTML ☆

赞 0 踩 0

2606.07030 2026-06-08 cs.SD cs.AI cs.CL cs.LG 交叉投稿

Phonetic Error Analysis of Raw Waveform Acoustic Models

原始波形声学模型的音素错误分析

Erfan Loweimi, Zhengjun Yue, Andrea Carmantini, Zoran Cvetkovic, Steve Renals, Peter Bell

发表机构 * Centre for Speech Technology Research (CSTR), University of Edinburgh, UK（语音技术研究中心（CSTR），爱丁堡大学，英国）； Cisco, UK（思科公司，英国）； SLAI & CUHK-SZ, China（SLAI与CUHK-SZ，中国）； King's College London, UK（伦敦国王学院，英国）

AI总结通过分解音素错误率、分析混淆矩阵，发现BLSTM层对过渡依赖类提升最大，WSJ迁移学习对辅音改进约是元音的三倍，且混淆模式反映固有音素相似性。

Comments INTERSPEECH2026

2606.07032 2026-06-08 cs.CV cs.AI 交叉投稿

Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

前所未见：基于一致视频源数据集的真正零样本组合图像检索基准测试

Zhenyu Yang, Zemin Du, Shengsheng Qian, Changsheng Xu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对现有零样本组合图像检索数据集存在参考与目标图像不相关、非真正零样本的问题，提出ZeroSight基准，包含来自视频的一致参考-目标对和训练无关的MLLM驱动方法SC4CIR，通过三重对称一致性检查识别难负样本，实验表明现有方法性能被高估。

详情

AI中文摘要

零样本组合图像检索（ZS-CIR）旨在基于由参考图像和相对描述组成的查询，在没有训练样本的情况下检索目标图像。现有的ZS-CIR数据集常因图像来源嘈杂而导致参考图像与目标图像完全不相关，并且由于使用了CLIP等模型已训练过的公开图像数据集，未能实现真正的零样本场景。为解决这些挑战，我们引入了ZeroSight，一个用于ZS-CIR的新基准。它包括一个来自视频的一致参考-目标对数据集、一个数据构建流程，以及考虑多个正负目标图像排序的评估方法。我们通过从单个视频中提取帧并使用LLM辅助方法生成相对描述，确保参考-目标对在视觉和语义上一致。为确保真正的零样本场景，我们使用2022年3月31日之后发布的视频数据，确保其未包含在CLIP的预训练数据中。此外，我们提出了一种无需训练的MLLM驱动方法SC4CIR（对称一致性用于CIR），该方法通过三重对称一致性检查能够有效识别难负目标。该方法是即插即用的，能与各种CIR方法无缝集成并显著提升性能。我们通过27种方法的实验结果表明，当前的ZS-CIR数据集和评估指标导致了检索性能的膨胀，夸大了CIR方法的能力。我们的基准和模型可通过此https URL访问。

英文摘要

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption without training samples. Existing ZS-CIR datasets often suffer from complete irrelevance between reference and target images due to noisy image sources, and do not achieve a true zero-shot scenario as they use public image datasets that models like CLIP have been trained on. To tackle these challenges, we introduce ZeroSight, a novel benchmark for ZS-CIR. It includes a dataset with consistent reference-target pairs sourced from videos, a data construction pipeline, and evaluation methods that consider the ranking of multiple positive and negative target images. We ensure visually and semantically consistent reference-target pairs by extracting frames from a single video and generating relative captions using LLM-assisted methods. To ensure a true zero-shot scenario, we use video data published after March 31, 2022, ensuring it was not included in CLIP's pre-training data. Additionally, we propose a training-free MLLM-driven method, SC4CIR (Symmetric Consistency for CIR), which can effectively identify hard negative targets through 3 symmetric consistency checks. This method is plug-and-play, seamlessly integrating with various CIR methods and significantly improving performance. Our experimental results from 27 methods reveal that current ZS-CIR datasets and evaluation metrics result in inflated retrieval performance, exaggerating the capabilities of CIR methods. Our benchmark and models can be accessed at https://github.com/sotayang/ZeroSight.

URL PDF HTML ☆

赞 0 踩 0

2606.07036 2026-06-08 cs.CV cs.AI cs.CE cs.LG 交叉投稿

STREAM: Stochastic Riemannian Flow Matching with Anisotropic Decoder for Digital Histopathology Image Generation

STREAM: 用于数字组织病理学图像生成的随机黎曼流匹配与各向异性解码器

Won June Cho, Daeky Jeong, Hyeongyeol Lim, Hongjun Yoon

发表机构 * DEEPNOID Inc.（DEEPNOID公司）

AI总结提出STREAM框架，利用组织病理学视觉基础模型的patch-token特征作为潜在空间，通过黎曼流匹配生成高质量组织病理学图像，解决条件崩溃问题，并设计各向异性解码器提升生成质量。

Comments 27 pages, 7 figures

详情

AI中文摘要

合成组织病理学图像生成解决了计算病理学中的关键挑战，包括患者隐私和对基础模型大规模训练数据日益增长的需求。潜在扩散模型主导了图像生成领域，最近的研究强调潜在空间的选择对生成图像的质量至关重要。现有的组织病理学最先进生成模型使用预训练的视觉基础模型（VFM）作为条件信号，我们观察到这会导致“条件崩溃”，即条件信号主导潜在空间，降低生成样本的质量和多样性。因此，我们转而使用预训练的组织病理学VFM作为潜在空间本身，利用其编码丰富语义信息的patch-token特征。我们经验性地表明，这些特征经过$\ell_2$归一化，位于单位超球面$\mathcal{S}^{d-1}$上，具有强烈的角度主导性和内在曲率，使其自然适用于黎曼公式。因此，我们提出了STREAM，这是第一个在病理学领域应用黎曼流匹配的框架。STREAM包括两个阶段：1）一种桥式随机扰动，在$\mathcal{S}^{d-1}$上建立每个token的可整流性，用于在潜在空间中训练扩散变换器（DiT）；2）一种新颖的各向异性解码器，对速度场雅可比矩阵的低能量方向分配鲁棒性，同时保持其高能量方向的保真度。STREAM在乳腺癌和结直肠癌数据集上实现了最先进的重建和生成性能。代码将在接收后公开发布。

英文摘要

Synthetic histopathology image generation addresses critical challenges in computational pathology, including patient privacy and the growing need for large-scale training data for foundation models. Latent diffusion models have dominated the image generation domain, with recent works emphasizing that the choice of latent space is critical to the quality of generated images. Existing state-of-the-art generative models in histopathology use pretrained Vision Foundation Models (VFMs) as conditioning signals, and we observe that this leads to "conditioning collapse," where the conditioning signal dominates the latent space and lowers the quality and diversity of generated samples. Therefore, we instead use pretrained histopathology VFMs as the latent space itself, leveraging their patch-token features that encode rich semantic information. We empirically show that these features are $\ell_2$-normalized and lie on the unit hypersphere $\mathcal{S}^{d-1}$ with strong angular dominance and intrinsic curvature, making them naturally suited for a Riemannian formulation. We therefore present STREAM, the first framework to apply Riemannian flow matching in the pathology domain. STREAM consists of two stages: 1) a bridge-type stochastic perturbation that establishes per-token rectifiability on $\mathcal{S}^{d-1}$ for training a Diffusion Transformer (DiT) in latent space, and 2) a novel anisotropic decoder that allocates robustness to low-energy directions of the velocity-field Jacobian while preserving fidelity along its high-energy directions. Together, STREAM achieves state-of-the-art reconstruction and generation performance on breast and colorectal cancer datasets. The code will be publicly released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.07054 2026-06-08 cs.CL cs.AI cs.CR cs.LG 交叉投稿

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

TRACE: 通过自适应跨步骤证据聚合的LLM智能体轨迹推理

Vijitha Mittapalli, Shreyaa Jayant Dani, Satya Srujana Pilli, Snigdha Ansu, Mohammadreza Teymoorianfard, Franck Dernoncourt, Hongjie Chen, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed

发表机构 * University of Massachusetts at Amherst（马萨诸塞大学阿默斯特分校）； Adobe Research（Adobe研究）； Dolby Labs（杜比实验室）； University of Oregon（俄勒冈大学）； Cisco（思科）

AI总结提出TRACE框架，通过TIJ循环识别高信号区域、累积跨步骤证据并合成轨迹级判决，在SHADE-Arena的十个任务域上F1达0.713，召回率0.844，尤其擅长长距离证据链接。

详情

AI中文摘要

自主LLM智能体可以通过一系列单独良性的行动追求隐藏的恶意目标，这使得使用标准轨迹级监控难以检测破坏行为。现有方法要么一次性评估完整轨迹，要么将其划分为独立评分的窗口，限制了连接时间上相距较远的证据的能力。我们提出TRACE，一个用于长视界LLM智能体轨迹的监控框架。TRACE通过一个TIJ（分类-检查-判决）循环运行，该循环识别高信号区域，执行有针对性的检查，同时在推理步骤中累积累积的证据，并综合出轨迹级判决。我们在SHADE-Arena的十个任务域上评估TRACE，与最先进的基线进行比较。TRACE实现了0.713的总体F1分数和0.844的召回率，在需要长距离证据链接的任务上取得了最大的提升。

英文摘要

Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate complete trajectories in a single pass or partition them into independently scored windows, limiting their ability to connect evidence across temporally distant actions. We propose TRACE, a monitoring framework for long-horizon LLM agent trajectories. TRACE operates through a TIJ (Triage-Inspect-Judge) loop that identifies high-signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory-level verdict. We evaluate TRACE on ten task domains from SHADE-Arena against state-of-the-art baselines. TRACE achieves an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks requiring long-range evidence linking.

URL PDF HTML ☆

赞 0 踩 0

2606.07074 2026-06-08 cs.LG cs.AI 交叉投稿

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

SlimSearcher: 通过自适应奖励门控实现训练效率感知的Web代理

Zequn Xie, Junjie Wang, Dan Yang, Jie Feng, Yue Shen, Jian Wang, Jinjie Gu

发表机构 * Zhejiang University（浙江大学）； Ant Group（蚂蚁集团）

AI总结提出SlimSearcher框架，通过帕累托高效过滤和自适应奖励门控，在保持或提升准确率的同时将工具调用轮次减少17%-58%。

Comments 17 pages, 8 figures,

详情

AI中文摘要

深度研究代理在复杂信息寻求任务中展现了卓越能力，但这种能力伴随着高昂的计算成本。受准确率驱动训练范式的影响，当前模型采用蛮力策略，表现为盲目依赖工具和执行性推理——生成长而冗余的轨迹，这些对于解决任务远非必要，导致浪费的工具调用和过多的token消耗。为克服这一效率陷阱，我们提出SlimSearcher，一个原则性框架，在监督微调（SFT）和强化学习（RL）中推动准确率与计算成本之间的帕累托前沿。在SFT阶段，SlimSearcher采用帕累托高效过滤来提炼既成功又经济的轨迹，引导模型走向内在效率感知的搜索行为。在RL阶段，我们引入自适应奖励门控，一种动态奖励塑造机制，在采样队列中评估相对工具和token效率。通过将这些自适应效率指标与严格正确性门控级联，我们的方法有效避免了与绝对惩罚相关的简洁性偏差，并缓解了奖励黑客攻击。在包括GAIA、BrowseComp和XBenchDeepSearch在内的长时域基准上的大量实验表明，SlimSearcher在保持或提升准确率的同时，将平均工具调用轮次减少了17%-58%。

英文摘要

Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.07080 2026-06-08 cs.SD cs.AI eess.AS 交叉投稿

OffQ：通过偏移驯服LLM量化中的结构化异常值

Haoqi Wang, Lorenz K. Mueller, Jiawei Zhuang, Mathieu Salzmann, Lukas Cavigelli

发表机构 * School of Computer and Communication Sciences, EPFL, Switzerland（瑞士联邦理工学院计算机与通信科学学院）； Huawei, Switzerland（华为公司）； Swiss Data Science Center, ETHZ & EPFL, Switzerland（瑞士数据科学中心，苏黎世联邦理工学院与联邦理工学院）

AI总结提出OffQ方法，通过top-1 PCA识别异常值子空间、旋转集中异常值通道并转换为共享偏移，实现LLM的低比特均匀量化，在W4A4KV4下提升精度。

详情

AI中文摘要

低比特量化已被广泛采用，通过显著降低计算成本和内存使用来加速大型语言模型（LLM）的推理。然而，激活异常值对有效量化构成了重大挑战，常常导致显著的性能下降。在本文中，我们介绍了OffQ，一种通过新颖的偏移机制来缓解低比特量化中激活异常值的方法。具体来说，OffQ首先使用提出的top-1 PCA识别激活中的低维异常值子空间，然后通过旋转将高幅度激活集中到1个通道中。OffQ随后通过将其幅度转换为共享偏移来吸收这个集中的异常值通道，从而降低激活的标准差。这种偏移策略使得使用部署友好的均匀网格和均匀精度量化对LLM进行有效的W4A4KV4量化成为可能。在多种LLM架构和基准上的广泛实验表明，OffQ优于最先进的基线，在保持低比特效率的同时持续提高模型精度。

英文摘要

Low-bit quantization has been widely adopted to accelerate the inference of large language models (LLMs) by significantly reducing computational cost and memory usage. However, activation outliers pose a major challenge to effective quantization, often leading to notable performance degradation. In this paper, we introduce OffQ, a method designed to mitigate activation outliers in low-bit quantization through a novel offsetting mechanism. Specifically, OffQ first identifies a low-dimensional outlier subspace in the activations using a proposed top-1 PCA, and then concentrates high-magnitude activations into 1 channel via rotation. OffQ then absorbs this concentrated outlier channel by converting its magnitude into a shared offset, thereby reducing the standard deviation of the activations. This offsetting strategy enables effective W4A4KV4 quantization of LLMs using deployment-friendly uniform-grid and uniform-precision quantization. Extensive experiments across diverse LLM architectures and benchmarks demonstrate that OffQ outperforms state-of-the-art baselines, consistently improving model accuracy while preserving low-bit efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.07117 2026-06-08 cs.CV cs.AI 交叉投稿

文本监督增强视觉-语言模型中的地理空间表示

Marcelo Sartori Locatelli, Fernando Tonucci, Jea Kwon, Luiz Felipe Vecchietti, Bryan Nathanael Wijaya, Cheng Yaw Low, Virgilio Almeida, Meeyoung Cha

发表机构 * University of São Paulo（圣保罗大学）； National University of Singapore（新加坡国立大学）

AI总结研究视觉、视觉-语言及多模态模型的地理空间表示能力，发现文本监督能有效提升空间编码，推动地理空间AI发展。

Comments Accepted at ICML 2026

2606.07181 2026-06-08 cs.LG cs.AI q-bio.MN 交叉投稿

RETROSPECT: RETROsynthesis via Sequential Prediction, and Chemically Transformed-ranking

RETROSPECT: 通过序列预测和化学变换排序的逆合成

Raja Sekhar Pappala, Shreyas Vinaya Sathyanarayana, Ronit Kumar Choudhary, Arjun Verma, Deepak Warrier

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出RETROSPECT系统，将单步逆合成分解为候选生成和重排序，结合ChemAlign Transformer生成器和LambdaMART重排序器，在USPTO-50K上实现55.00% top-1准确率。

Comments Accepted at the AI for Science workshop (ICML 2026)

详情

AI中文摘要

单步逆合成既需要准确的首位建议，也需要足够丰富的候选列表以供下游选择。我们将其研究为提议-选择分解。我们的系统RETROSPECT结合了一个单一的Transformer提议模型（我们称之为ChemAlign Transformer）和一个基于结构、反应模板、上游分数以及可选的DFT衍生描述符的LambdaMART重排序器。生成器使用混合根对齐和随机SMILES增强、预层归一化、绑定嵌入、指数移动平均权重以及可微的原子平衡辅助损失进行训练。在包含5,007个反应的完整USPTO-50K测试集上，生成器达到55.00%的top-1和86.18%的top-10精确匹配准确率，top-1有效率为99.86%。在用于重排序的合并候选池基准上（包含5,007个测试产物，每个产物约111个候选），基于结构特征集训练的LambdaMART模型达到59.4%的top-1和0.7171的平均倒数排名。特征消融实验表明，上游提议分数和模板频率统计提供了大部分重排序信号，而DFT和反应中心DFT特征提供的增益较小且不一致。这些结果支持逆合成的模块化观点：更强的单模型提议和学习候选选择是互补的，并且提议模型可以作为集成系统（如RetroChimera (Maziarz et al., 2024)）的即插即用组件。

英文摘要

Single-step retrosynthesis needs both accurate first-ranked suggestions and candidate lists that are rich enough for downstream selection. We study this as a proposal-selection decomposition. Our system, RETROSPECT, combines a single Transformer proposal model, which we call the ChemAlign Transformer, with a LambdaMART reranker over structural, reaction-template, upstream-score, and optional DFT-derived descriptors. The generator is trained with hybrid root-aligned and random SMILES augmentation, Pre-LayerNorm, tied embeddings, exponential moving average weights, and a differentiable atom-balance auxiliary loss. On the full USPTO-50K test set of 5,007 reactions, the generator reaches 55.00% top-1 and 86.18% top-10 exact-match accuracy with 99.86% top-1 validity. On the merged candidate-pool benchmark used for reranking, which contains 5,007 test products and about 111 candidates per product, a LambdaMART model trained on the structural feature set reaches 59.4% top-1 with 0.7171 mean reciprocal rank. Feature ablations show that upstream proposal score and template-frequency statistics provide most of the reranking signal, while DFT and reaction-center DFT features provide smaller and less consistent gains. These results support a modular view of retrosynthesis: stronger single-model proposal and learned candidate selection are complementary, and the proposal model can serve as a drop-in component for ensemble systems such as RetroChimera (Maziarz et al., 2024)

URL PDF HTML ☆

赞 0 踩 0

2606.07211 2026-06-08 cs.RO cs.AI 交叉投稿

An Abstract Architecture for Explainable Autonomy in Hazardous Environments

危险环境中可解释自主性的抽象架构

Matt Luckcuck, Hazel M Taylor, Marie Farrell

发表机构 * Maynooth University（梅诺斯大学）； University of Manchester（曼彻斯特大学）

AI总结提出一种支持自主系统解释其行为的抽象架构，旨在通过设计可解释性增强用户信任，并以民用核工业为例展示应用。

Comments Originally published 20th of October 2022 at the Second International Workshop on Requirements Engineering for Explainable Systems (RE4ES), which was hosted by the International Requirements Engineering Conference 2022

详情

DOI: 10.1109/REW56159.2022.00027

AI中文摘要

自主机器人系统被提议用于危险环境，通常是为了减少人类工人的风险。在不久的将来，人类工人可能会继续使用和指挥这些自主机器人，就像其他计算机化工具一样，但具有更复杂的决策能力。因此，工程努力的一个重要方向是确保这些用户信任系统。最近的文献表明，可解释性与系统的可信度密切相关。与安全性和保密性属性一样，可解释性应该被设计到系统中，而不是事后添加。本文提出了一种抽象架构，支持自主系统解释其行为（可解释自主性），为实施可解释自主系统提供了设计模板。我们给出了一个工作示例，说明我们的架构如何应用于民用核工业，其中工人和监管机构都需要信任系统的决策能力。

英文摘要

Autonomous robotic systems are being proposed for use in hazardous environments, often to reduce the risks to human workers. In the immediate future, it is likely that human workers will continue to use and direct these autonomous robots, much like other computerised tools but with more sophisticated decision-making. Therefore, one important area on which to focus engineering effort is ensuring that these users trust the system. Recent literature suggests that explainability is closely related to how trustworthy a system is. Like safety and security properties, explainability should be designed into a system, instead of being added afterwards. This paper presents an abstract architecture that supports an autonomous system explaining its behaviour (explainable autonomy), providing a design template for implementing explainable autonomous systems. We present a worked example of how our architecture could be applied in the civil nuclear industry, where both workers and regulators need to trust the system's decision-making capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.07222 2026-06-08 cs.CV cs.AI 交叉投稿

DualGate-Net: A Prior-Gated Dual-Encoder Framework for Histopathology Cell Detection

DualGate-Net: 用于组织病理学细胞检测的先验门控双编码器框架

Bahman Jafari Tabaghsar, Son Tran, K. Devaraja, Atul Sajjanhar

发表机构 * School of Information Technology, Deakin University（德肯大学信息科技学院）； Kasturba Medical College, Manipal Academy of Higher Education（曼岛医学院）

AI总结提出DualGate-Net，通过可学习的先验门控融合机制自适应调节组织先验影响，结合局部和全局编码器及辅助分支，在OCELOT基准上实现稳健的细胞检测。

Comments 15 pages, 4 figures

详情

AI中文摘要

组织病理学图像中的细胞检测强烈依赖于周围组织背景，其中视觉上相似的细胞在不同微环境下可能属于不同类别。最近的感知组织方法结合了上下文先验，但通常依赖于可能传播噪声信息的静态融合策略。在这项工作中，我们提出了DualGate-Net，一种先验感知的双编码器框架，通过可学习的先验门控融合机制结合了基于ConvNeXtV2的局部编码器和基于SegFormer的全局编码器。所提出的模块自适应地调节组织先验在空间位置上的影响，同时一个辅助的前景重建分支在训练过程中保留高频细胞结构。此外，还引入了辅助的细胞性引导线索以进一步提高定位鲁棒性。在OCELOT基准上的实验表明，该方法在验证集上取得了0.7722的宏F1分数，在测试集上取得了0.7345的宏F1分数，突显了自适应先验整合对于稳健的组织病理学细胞检测的有效性。

英文摘要

Cell detection in histopathology images strongly depends on surrounding tissue context, where visually similar cells may belong to different classes under different microenvironments. Recent tissue-aware methods incorporate contextual priors, but often rely on static fusion strategies that may propagate noisy information. In this work, we propose DualGate-Net, a prior-aware dual-encoder framework that combines a ConvNeXtV2-based local encoder and a SegFormer-based global encoder through a learnable prior-gated fusion mechanism. The proposed module adaptively regulates the influence of tissue priors across spatial locations, while an auxiliary foreground reconstruction branch preserves high-frequency cellular structures during training. In addition, auxiliary cellness-guided cues are incorporated to further improve localization robustness. Experiments on the OCELOT benchmark demonstrate consistent improvements, achieving macro F1-scores of 0.7722 on the validation set and 0.7345 on the test set, highlighting the effectiveness of adaptive prior integration for robust histopathology cell detection.

URL PDF HTML ☆

赞 0 踩 0

2606.07237 2026-06-08 cs.CL cs.AI cs.LG 交叉投稿

When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

当大型语言模型在医疗保健中失败：评估对提示变化的敏感性

Mahdi Alkaeed

发表机构 * Department of Computer Science and Engineering, Doha, Qatar（计算机科学与工程系，多哈，卡塔尔）

AI总结本研究系统分析了通用和医学专用LLM对提示扰动的敏感性，发现即使是微小的措辞变化也可能改变临床建议，对抗性提示可能引发有害输出，表明这些模型在临床应用中不可靠。

Comments 12 pages

详情

AI中文摘要

大型语言模型（LLM）越来越多地用于医疗保健任务，如临床问答、诊断支持和报告总结。尽管前景广阔，但这些模型对微小的提示扰动（包括词汇和句法）仍然高度敏感，在安全关键的临床应用中构成严重风险。在本研究中，我们使用MedMCQA基准进行了系统的敏感性分析，以评估通用（例如GPT-3.5、Llama3）和医学专用LLM（例如ClinicalBERT、BioLlama3、BioBERT）的鲁棒性。我们将扰动分为自然和对抗两种类型，并检查它们对临床推理任务中模型一致性、准确性和可靠性的影响。我们的发现表明，医学LLM并非本质安全。即使是措辞的微小变化也可能改变临床建议，而针对性的对抗性提示可能引发有害输出。在医疗保健等高风险环境中，这种不可预测性是不可接受的——模型因重新措辞的输入而改变诊断，或因轻微改写而幻觉药物，临床医生无法可靠地信任它们。虽然模型通常对简单的词汇替换或释义表现出韧性，但在句法重新排序或误导性上下文线索下往往会崩溃。这种脆弱性在通用和领域专用LLM中都很明显。值得注意的是，对抗性操作可能导致临床危险的输出，例如推荐不正确的剂量或遗漏关键发现。

英文摘要

Large Language Models (LLMs) are increasingly used in healthcare for tasks such as clinical question answering, diagnosis support, and report summarization. Despite their promise, these models remain highly sensitive to subtle prompt perturbations, both lexical and syntactic, posing serious risks in safety-critical clinical applications. In this study, we conduct a systematic sensitivity analysis to evaluate the robustness of both general-purpose (e.g., GPT-3.5, Llama3) and medical-specific LLMs (e.g., ClinicalBERT, BioLlama3, BioBERT) using the MedMCQA benchmark. We categorize perturbations into natural and adversarial types and examine their effect on model consistency, accuracy, and reliability in clinical reasoning tasks. Our findings reveal that medical LLMs are not intrinsically safe. Even minor variations in phrasing can alter clinical advice, and targeted adversarial prompts can provoke harmful outputs. In high-stakes settings like healthcare, such unpredictability is unacceptable-models that change diagnoses due to reworded inputs or hallucinate medications when slightly rephrased cannot be reliably trusted by clinicians. While models tend to show resilience to simple lexical substitutions or paraphrasing, they often break down under syntactic reordering or misleading contextual cues. This fragility is evident across both general-purpose and domain-specific LLMs. Notably, adversarial manipulations can lead to clinically dangerous outputs, such as recommending incorrect dosages or omitting critical findings.

URL PDF HTML ☆

赞 0 踩 0

2606.07244 2026-06-08 cs.RO cs.AI cs.CV 交叉投稿

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

超越航点：面向视觉语言导航的轨迹中心航点范式

Haoxiang Shi, Xiang Deng, Haoyu Zhang, Qiaohui Chu, Yaowei Wang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； Pengcheng Laboratory（鹏城实验室）

AI总结提出轨迹航点范式，通过TSDF引导的扩散策略预测可执行轨迹，解决VLN-CE中航点不可达与规划控制不一致问题，在基准上取得最优性能。

详情

AI中文摘要

连续环境中的视觉语言导航（VLN-CE）要求智能体在类似真实世界的环境中遵循自然语言指令进行导航。大多数VLN-CE方法采用三阶段框架：航点预测器提出可导航航点，导航器选择最佳航点，低层控制器执行移动。然而，这种解耦范式常导致航点不可达或规划与控制不一致。本文提出一种称为轨迹航点的新范式，将每个候选航点锚定到可执行轨迹上。为此，我们设计了TSDF引导的扩散策略作为轨迹航点预测器，引导轨迹生成避开障碍物，从本质上保证预测航点的可达性。进一步提出轨迹增强导航器，将关联轨迹作为额外信息注入规划，实现高层语义决策与低层执行的严格一致性。在VLN-CE基准上的大量实验表明，我们的轨迹航点范式优于基线方法。

英文摘要

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.07245 2026-06-08 cs.CY cs.AI 交叉投稿

AI Sovereignty: A Qualitative Model of Strategic Competition as AI Becomes an Instrument of National Power

AI主权：当AI成为国家力量工具时的战略竞争定性模型

Timothy Clancy, Asmeret Naugle

发表机构 * Applied Research Laboratory for Intelligence and Security (ARLIS)（智能与安全应用研究实验室）； Sandia National Laboratories（桑塔纳国家实验室）

AI总结提出AI主权定义及首个包含微观、中观、宏观因素的定性模型，分析国家间AI驱动的战略竞争动态，识别关键杠杆点（如加速器、电力、数据等）及其在直接动能行动和间接非动能行动中的应用。

Comments Main article: 19 pages, 10 figures. Supplementary: 19 pages, 7 figures, 7 tables. To be presented at the 2026 International System Dynamics Conference (ISDC), July 20-24, TU Delft, Delft, Netherlands

详情

AI中文摘要

AI主权是一个国家独立控制其人工智能（AI）技术的程度。对日益复杂的前沿AI模型的竞争具有越来越重要的战略意义，各国正在考虑AI如何改善其经济状况、竞争优势和整体国家实力。然而，AI主权的成本巨大，我们缺乏定义和概念模型来应对不断演变的AI主权动态。我们通过提出与AI主权相关的定义，以及一个首次包含微观、中观和宏观因素的定性模型来填补这一空白。基于模型的定性预测突出了竞争动态和AI驱动国家实力的潜在演变。该模型识别了各国可用于增强自身增长或削弱对手的关键杠杆点，包括考虑加速器、电力、水、数据集和熟练劳动力。这些杠杆点可以通过直接动能行动（如伊朗用无人机瞄准数据中心）和间接非动能效应（包括网络、太空、信息、经济胁迫和外交）在战略和操作层面激活。如果我们的假设和假说成立，这种战略竞争可能将定义21世纪各国如何改善其经济状况、竞争优势和整体国家实力。

英文摘要

AI sovereignty is the extent to which a nation independently controls its artificial intelligence (AI) technologies. The race toward ever-more-sophisticated frontier AI models is of increasing strategic importance, with nations considering how AI might improve their economic situations, competitive advantage, and overall national power. However, the costs of AI sovereignty are enormous, and we lack definitions and conceptual models to navigate evolving AI sovereignty dynamics. We address this gap with definitions relevant to AI sovereignty, along with a first-of-its-kind qualitative model that incorporates micro, meso, and macro contributors. Model-based qualitative forecasts highlight competitive dynamics and evolving potential for AI-driven national power. The model identifies key leverage points that nations can use to enhance their own growth or degrade an adversary's, including consideration of accelerators, electricity, water, data sets and skilled workforce. These leverage points can be activated at strategic and operational levels through both direct kinetic actions, such as Iran's targeting of data centers with drones, and indirect non-kinetic effects including cyber, space, information, economic coercion and diplomacy. If our assumptions and hypotheses are valid, this strategic competition may come to define how nations improve their economic situations, competitive advantage, and overall national power in the 21st Century.

URL PDF HTML ☆

赞 0 踩 0

2606.07271 2026-06-08 cs.LG cs.AI cs.SD 交叉投稿

Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path

整流流泄漏之处：沿插值路径表征成员信号

Thomas Sesmat, Gabriel Meseguer-Brocal, Geoffroy Peeters

发表机构 * University of Amsterdam（阿姆斯特丹大学）

AI总结本文分析整流流（Rectified Flows）在插值路径上的训练数据成员信号，发现训练与测试数据的重建差异呈钟形曲线，并在高斯假设下推导出峰值位置，验证了该结构的普适性，并利用其进行成员推断攻击。

Comments ICML 2026 article, 9 main pages and 25 with annexes, 11 figures

详情

Journal ref: 43rd International Conference on Machine Learning, Seoul, South Korea, 2026

AI中文摘要

理解生成模型从训练数据中保留了什么仍然具有挑战性，这对版权和隐私有影响。除了逐字复制外，模型可以编码训练数据中更微妙的痕迹，这些痕迹从未出现在输出中，但仍可利用。我们针对整流流（Rectified Flows）研究了这一机制，整流流越来越多地用于部署的生成系统。我们分析了定义整流流训练的插值路径 $X_\lambda = (1-\lambda)X_0 + \lambda X_1$。我们展示了训练数据和测试数据的重建之间存在一个差距，该差距在 $\lambda$ 上呈钟形曲线，并在训练过程中累积，而验证指标保持稳定。该信号有一个最大值，我们在高斯假设下推导出其位置的闭式解。我们在音频和图像上验证了这些预测，并表明钟形结构是普遍的，而峰值预测在我们的假设满足时成立。作为概念验证，我们利用这种特定的 $\lambda$ 解析结构进行成员推断攻击，区分训练集的成员和非成员。

英文摘要

Understanding what generative models retain from training data remains challenging, with implications for copyright and privacy. Beyond verbatim reproduction, models can encode subtler traces of their training data that never surface in their outputs yet remain exploitable. We study this regime for Rectified Flows, which are increasingly used in deployed generative systems. We analyse the interpolation path $X_λ= (1-λ)X_0 + λX_1$ that defines the Rectified Flow training. We show that a gap exists between the reconstruction of train and test data that follows a bell-shaped curve over $λ$, wich accumulates during training, while the validation metrics remain stable. The signal has a maximum whose location we derive in closed form under Gaussian assumptions. We validate these predictions on both audio and images and show that the bell-shaped structure is universal, while the peak prediction holds when our assumptions are satisfied. As a proof of concept, we exploit this specific $λ$-resolved structure to perform a Membership Inference Attack, distinguishing members of the training set from non-members.

URL PDF HTML ☆

赞 0 踩 0

2606.07309 2026-06-08 cs.SD cs.AI cs.CL 交叉投稿

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

语音情感识别中音频语言模型的声学线索对齐

Iosif Tsangko, Andreas Triantafyllopoulos, Björn W. Schuller

发表机构 * DFG's Reinhart Koselleck project（德国科研基金Reinhart Koselleck项目）； EU H2020 project（欧盟H2020项目）

AI总结研究音频语言模型中显式声学线索的对齐性，通过eGeMAPS特征提取六种可解释声学概念标记，发现对齐标记提升UAR，而错乱标记降低性能，模型对符号线索敏感但仍部分依赖音频信号。

Comments 6 pages, 3 figures, 3 tables

详情

AI中文摘要

指令跟随音频语言模型（ALMs）可以通过显式的声学线索进行增强，但在原始音频已经可用的情况下，这些线索是否以接地的方式被使用仍不清楚。我们通过从标准化的eGeMAPS副语言特征集中推导出六个可解释的声学概念标记来研究语音情感识别（SER）中的这一问题。这些标记总结了能量、音高、动态、亮度、共振峰和语音质量，并被附加到文本提示中，同时保持音频输入不变。在广泛使用的FAU-Aibo和IEMOCAP基准测试中，对齐的标记提高了未加权平均召回率（UAR），而打乱、冲突或损坏的标记相对于对齐标记降低了性能，并将混淆转向中性。重要的是，在强标记扰动下预测不会崩溃，这表明模型对符号线索通道敏感，但部分仍锚定于音频信号。我们认为，仅标记干预提供了一种实用的方法来探测基于ALM的情感计算中音频接地线索的使用、鲁棒性和可解释性。

英文摘要

Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question in speech emotion recognition (SER) by deriving six interpretable acoustic concept tokens from the standardised eGeMAPS paralinguistic feature set. These tokens summarise energy, pitch, dynamics, brightness, formants, and voice quality, and are appended to the textual prompt while the audio input is kept unchanged. Across the widely used FAU-Aibo and IEMOCAP benchmarks, aligned tokens improve unweighted average recall (UAR), whereas shuffled, conflicting, or corrupted tokens reduce performance relative to aligned tokens and shift confusions toward neutral. Importantly, predictions do not collapse under strong token perturbations, suggesting that the models are sensitive to the symbolic cue channel but remain partly anchored to the audio signal. We argue that token-only interventions provide a practical way to probe audio-grounded cue use, robustness, and interpretability in ALM-based affective computing.

URL PDF HTML ☆

赞 0 踩 0

2606.07311 2026-06-08 cs.CV cs.AI 交叉投稿

CULTURESCORE: Evaluating Cultural Faithfulness in Video Generation Models

CULTURESCORE: 评估视频生成模型的文化忠实度

Anku Rani, Wei Dai, Shravan Nayak, Pattie Maes, Mahdi M. Kalayeh, Paul Pu Liang

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Mila – Quebec AI Institute（魁北克人工智能研究所）； Netflix（网飞）

AI总结提出CultureScore框架，从身份、背景和行为三个维度评估视频生成的文化忠实度，实验发现当前最佳模型得分仅56.8%，行为维度最困难。

详情

AI中文摘要

随着Veo 3.1和LTX-2等视频生成模型的进步，它们准确表现多元全球文化的能力仍是一个关键但研究不足的前沿。当前的指标如VideoScore仅衡量视觉质量，无法评估文化忠实度。因此，一个将合十礼替换为握手的模型与正确生成该手势的模型获得相同的分数。我们提出CultureScore，一个将文化忠实度分解为三个细粒度维度的组合评估框架：身份（谁被代表）、背景（文化本地化背景）和行为（规范性手势和互动）。我们通过一个覆盖10个国家的评估套件来实施该框架，在三个最先进的模型上生成了6,180个视频。我们的评估显示，当前没有模型能够实现文化忠实的视频生成：表现最好的模型整体CultureScore仅为56.8%，其中行为是最具挑战性的维度，所有模型在该维度上均低于52%。此外，人类偏好排序与CultureScore方向一致，但与VideoScore相反；在视觉质量上得分最高的模型被标注者排在最后，这强调了文化忠实度是公平视频生成的基本标准。

英文摘要

As video generation models like Veo 3.1 and LTX-2 advance, their ability to accurately represent diverse global cultures remains a critical yet understudied frontier. Current metrics, such as VideoScore, only measure visual quality but offer no mechanism for assessing cultural faithfulness. Consequently, a model that replaces a Namaste with a handshake receives the same score as one that generates the gesture correctly. We propose CultureScore, a compositional evaluation framework that decomposes cultural faithfulness into three granular dimensions: Identity (who is represented), Context (culturally localized background), and Behavior (normative gestures and interactions). We operationalize this framework through an evaluation suite spanning 10 countries, yielding 6,180 generated videos across three state-of-the-art models. Our evaluation reveals that no current model achieves culturally faithful video generation: the best-performing model reaches only 56.8\% overall CultureScore, with Behavior the most challenging dimension, which remains below 52\% across all models. Furthermore, human preference rankings align directionally with CultureScore but are inverted relative to VideoScore; the highest-scoring model on visual quality was ranked last by annotators, underscoring that cultural faithfulness is an essential criterion for equitable video generation.

URL PDF HTML ☆

赞 0 踩 0

2606.07313 2026-06-08 cs.CL cs.AI 交叉投稿

SV-Detect: AI-generated Text Detection with Steering Vectors

SV-Detect: 基于引导向量的AI生成文本检测

Mikhail Vishnyakov, Tatiana Gaintseva

发表机构 * Independent Researcher（独立研究者）； Queen Mary University of London（伦敦女王学院）

AI总结提出从冻结语言模型的隐藏表示中提取引导向量，通过层间投影特征训练轻量分类器，实现跨域、跨模型和编辑攻击下的机器生成文本检测。

详情

AI中文摘要

检测机器生成文本在分布偏移（如跨域、源模型和编辑攻击的迁移）下尤其困难。我们提出了一种基于从冻结语言模型的隐藏表示中提取的引导向量的假文本检测器。在每一层，我们构建一个分离人类编写文本和机器生成文本的方向，并通过每个输入与这些方向的逐层对齐来表示输入。在这些投影特征上训练的轻量分类器产生最终的检测分数。我们的方法在分布内和分布偏移下均表现出色，包括跨域、跨源模型以及机器编辑转换（如润色和重写）。解释分析表明，学习到的方向与可识别的风格线索一致，同时捕获了超越表面特征的显著额外信号。这些结果将假文本检测定位为表示空间探测问题，并表明引导向量提供了一种简单有效的解决方案。

英文摘要

Detecting machine-generated text is especially difficult under distribution shift, such as transfer across domains, source models, and editing attacks. We propose a fake-text detector based on steering vectors extracted from the hidden representations of a frozen language model. At each layer, we construct a direction that separates human-written from machine-generated text, and represent each input by its layer-wise alignment with these directions. A lightweight classifier trained on these projection features yields the final detection score. Our method achieves strong performance both in-distribution and under distribution shift, including across domains, source models, and machine-editing transformations such as polishing and rewriting. Interpretation analyses show that the learned directions align with recognizable stylistic cues while capturing substantial additional signal beyond surface features. These results position fake-text detection as a representation-space probing problem and show that steering vectors provide a simple and effective solution.

URL PDF HTML ☆

赞 0 踩 0

2606.07316 2026-06-08 cs.MA cs.AI cs.DC 交叉投稿

Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration

面向拜占庭鲁棒的大语言模型智能体协作的分层认证语义承诺

Haoran Xu, Lei Zhang, Iadh Ounis, Xianbin Wang

发表机构 * University of Glasgow（格拉斯哥大学）； University of Western Ontario（西部 Ontario 大学）

AI总结提出H-CSC协议，将基于嵌入的终结性信号转换为三种类型输出（语义承诺、判决承诺或显式中止），实现大语言模型智能体拜占庭协作的最终性控制，在语义投毒和拜占庭攻击下保持低角度偏差和高中止率。

Comments 27 pages, 3 figures, 8 tables

详情

AI中文摘要

大语言模型智能体间的拜占庭协作需要一个最终性控制原语：给定已交付的随机、结构化自然语言提案，协议必须决定该轮是否支持提交、何种提交或带类型的显式安全中止。朴素聚合将这一选择隐藏在单一判决背后；经典拜占庭容错将其隐藏在字节同一性背后，而LLM提案不满足该要求。我们提出分层认证语义承诺（H-CSC），一种受BFT启发的协议，将判决条件分组的提案上的嵌入派生最终性信号转换为三种类型结果之一：语义承诺（一个2f+1的判决内语义核心支持该判决，发出量化聚合上的参数绑定摘要）、判决承诺（强判决边缘但分散的语义理由，发出判决级证书而不声明语义聚合）或带类型原因的显式中止。贡献在于类型化最终性，而非原始提交准确性。在受控语义投毒诊断（BCS_v1，120个片段）上，H-CSC在BFT可行桶上以低角度偏差（0.31至2.04度）提交，并按设计100%中止超出BFT的轮次（n<3f+1）。在真实LLM智能体声明验证基准（MVR-50，50个任务）上，面对配对静态和急袭拜占庭攻击，H-CSC提交率为0.90/0.92，诚实引用无效率为0.02/0.00，统计上与强证书发出的仅判决基线匹配。不同于该基线，H-CSC还在74%/72%的轮次上发出嵌入支持的语义承诺摘要，提供类型化来源。严格语义消融仅提交0.54/0.48，表明判决级回退对于在相同≤0.04安全底线下的覆盖率（+0.36/+0.44）是必要的；跨四个LLM的100任务跨模型检查将无效hmaj保持在0.00至0.03内。

英文摘要

Byzantine collaboration among large-language-model agents requires a finality-control primitive: given delivered stochastic, structured natural-language proposals, the protocol must decide whether the round supports a commit, what kind of commit, or a typed safe abort. Naive aggregation hides this choice behind a single verdict; classical Byzantine fault tolerance hides it behind byte-identity that LLM proposals do not satisfy. We introduce Hierarchical Certified Semantic Commitment (H-CSC), a BFT-inspired protocol that converts embedding-derived finality signals over verdict-conditioned proposal groups into one of three typed outcomes: a semantic_commit (a 2f+1 within-verdict semantic core backs the verdict, emitting a parameter-bound digest over the quantised aggregate), a verdict_commit (strong verdict margin but dispersed semantic rationale, emitting a verdict-level certificate without claiming a semantic aggregate), or an explicit abort with a typed reason. The contribution is typed finality, not raw commit accuracy. On a controlled semantic-poisoning diagnostic (BCS_v1, 120 episodes), H-CSC commits with low angular deviation on BFT-feasible buckets (0.31 to 2.04 degrees) and aborts 100% of beyond-BFT rounds (n<3f+1) as intended. On a real LLM-agent claim-verification benchmark (MVR-50, 50 tasks) under paired static and rushing Byzantine attacks, H-CSC commits 0.90/0.92 with honest-reference-invalid rates of 0.02/0.00, statistically matching a strong certificate-emitting verdict-only baseline. Unlike that baseline, H-CSC also emits an embedding-backed semantic_commit digest on 74%/72% of rounds, supplying typed provenance. A strict-semantic ablation commits only 0.54/0.48, showing the verdict-level fallback is necessary for coverage (+0.36/+0.44) at the same <=0.04 safety floor; a 100-task cross-model check across four LLMs preserves invalid_hmaj within 0.00 to 0.03.

URL PDF HTML ☆

赞 0 踩 0

2606.07325 2026-06-08 math.ST cs.AI cs.IT math.IT stat.TH 交叉投稿

A Temporal Spatial Minimax Rate for Smoothly-Varying Distributions in Wasserstein Space

Wasserstein空间中平滑变化分布的时空极小极大速率

Munsik Kim

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结研究在Wasserstein空间中，基于过去有限噪声快照估计未来曲线值的极小极大速率，提出时空下界并证明其匹配上界。

详情

AI中文摘要

我们研究了在$2$-Wasserstein空间$\mathcal{P}_2(\mathbb{R}^d)$中，从过去有限个噪声快照估计曲线$t\mapsto\mu_t$的未来值$\mu_{t_n+h}$的极小极大速率，在速度场的$k$阶协变导数满足绝热界$\|\nabla_t^k v\|\le\varepsilon$的条件下。我们的核心结果是统一的时空极小极大下界：在正则的、局部传输丰富的子类上，每个估计量都会遭受$W_2$-风险，其$M$-指数为$\gamma_d(k+1)/(k+1+\gamma_d)$，其中$\gamma_d=\min(1/d,1/2)$（$M$为总样本量）。该下界源于时空约化：光滑性预算定义了一个可达的$W_2$-球，沿时间轴嵌入一个传输填充，整个快照实验的信息由Fano论证控制——空间填充是经典的，但其光滑性容许的时间嵌入和全窗口分析是新的。该界插值了一个与维数无关的外推下限$\varepsilon h^{k+1}$——即使过去完全已知，未来不可观测的不可约代价——以及空间估计的维数灾难$M^{-\gamma_d}$，当$k\to\infty$时恢复静态分布估计速率。我们以设计依赖的形式陈述下界——具有设计加权的有效样本量——适用于任意观测时间，并在密集（等间距）情形下得到闭式指数。匹配的上界在$k=0$（速率$M^{-1/(d+1)}$，$d\ge3$）和平移子模型中对所有$k$建立；对于$k\ge1$，协变估计量条件依赖于两个估计（比较几何偏差界和最优传输映射估计速率）达到该速率，将无条件的一般$k$上界留作开放问题。在合成弯曲和平坦族上的数值实验验证了预测的指数。

英文摘要

We study the minimax rate of estimating a future value $μ_{t_n+h}$ of a curve $t\mapstoμ_t$ in the $2$-Wasserstein space $\mathcal{P}_2(\mathbb{R}^d)$ from finitely many noisy snapshots of its past, under an adiabatic bound $\|\nabla_t^k v\|\le\varepsilon$ on the $k$-th covariant derivative of the velocity field. Our central result is a unified temporal-spatial minimax lower bound: over regular, locally transport-rich subclasses, every estimator incurs $W_2$-risk with $M$-exponent $γ_d(k+1)/(k+1+γ_d)$, $γ_d=\min(1/d,1/2)$ ($M$ the total sample size). It follows from a temporal-to-spatial reduction: the smoothness budget defines a reachable $W_2$-ball into which a transport packing is embedded along the time axis, and the information of the entire snapshot experiment is controlled by a Fano argument -- the spatial packing is classical, but its smoothness-admissible temporal embedding and the full-window analysis are new. The bound interpolates a dimension-free extrapolation floor of order $\varepsilon h^{k+1}$ -- the irreducible cost of an unobserved future, present even with the exact past -- and the spatial estimation curse $M^{-γ_d}$, recovering the static distribution-estimation rate as $k\to\infty$. We state the lower bound in a design-dependent form -- with a design-weighted effective sample size -- valid for arbitrary observation times, and obtain the closed-form exponent in the dense (equispaced) regime. The matching upper bound is established at $k=0$ (rate $M^{-1/(d+1)}$, $d\ge3$) and, in a translation submodel, for all $k$; for $k\ge1$ a covariant estimator attains the rate conditionally on two estimates (a comparison-geometry bias bound and an optimal-transport map-estimation rate), leaving the unconditional general-$k$ upper bound as an open problem. Numerical experiments on synthetic curved and flat families corroborate the predicted exponents.

URL PDF HTML ☆

赞 0 踩 0

2606.07351 2026-06-08 cs.LG cs.AI 交叉投稿

SleepExplain: Explainable Non-Rapid Eye Movement and Rapid Eye Movement Sleep Stage Classification from EEG Signal

SleepExplain: 基于EEG信号的可解释非快速眼动和快速眼动睡眠阶段分类

Rafsan Jany, Md. Hamjajul Ashmafee, Iqram Hussain, Md Azam Hossain

AI总结提出SleepExplain模型，使用集成学习（随机森林、XGBoost、梯度提升）对NREM和REM睡眠阶段进行分类，准确率达94.30%，并利用SHAP提供可解释性。

Comments 6 pages, 7 figures, 2022 25th International Conference on Computer and Information Technology (ICCIT)

详情

DOI: 10.1109/ICCIT57492.2022.10055956
Journal ref: 2022 25th International Conference on Computer and Information Technology (ICCIT), pp. 248-253, 2022

AI中文摘要

睡眠阶段分类是多种睡眠相关疾病最重要的诊断方法之一。脑电图（EEG）被认为是检查神经效应与睡眠阶段之间关联的有力工具，因为它能正确识别与睡眠相关的神经变化。在非快速眼动（NREM）和快速眼动（REM）睡眠阶段，许多神经和身体功能受到影响，因此在其功能中扮演重要角色。本研究旨在从睡眠EEG数据中分类NREM和REM睡眠阶段，并提出一个新颖的SleepExplain模型，一种可解释的NREM和REM睡眠阶段分类，以解释其预测。在这项工作中，使用随机森林、XGBoost和梯度提升集成分类模型对睡眠阶段进行分类。总体而言，我们获得了92.54%（随机森林）、94.25%（梯度提升）和94.30%（XGBoost）的准确率。对于可解释分类模型，我们采用博弈论方法SHAP（SHapley Additive exPlanations）为预测提供令人信服的解释。

英文摘要

Classification of sleep stages is one of the most important diagnostic approaches for a variety of sleep-related disorders. Electroencephalography (EEG) is regarded as a powerful tool for examining the association between neurological effects and sleep phases since it correctly identifies sleep-related neurological alterations. During Non-Rapid Eye Movement (NREM) and Rapid Eye Movement (REM) sleep phases, a number of nerve and bodily functions are affected and therefore hold an important role both in their functionalities. This work aims to classify NREM and REM sleep stages from sleep EEG data and present a noble SleepExplain model, an explainable NREM and REM sleep stage classification to explain its predictions. In this work, sleep stages were classified using Random Forest, XGBoost, and Gradient Boosting ensemble classification models. Overall, we obtained an accuracy of 92.54% (Random Forest), 94.25% (Gradient Boosting), and 94.30% (XGBoost). For explainable classification model, we utilized a game theoretic approach, SHAP (SHapley Addictive exPlanations) to offer a convincing explanation for the prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.07365 2026-06-08 cs.LG cs.AI 交叉投稿

A robust PPG foundation model using multimodal physiological supervision

一种使用多模态生理监督的鲁棒PPG基础模型

Eloy Geenjaar, Vince Calhoun, Scott Daly, Gouthaman KV, Lie Lu, Trisha Mittal, Daniel P. Darcy

发表机构 * Dolby Laboratories（杜比实验室）

AI总结提出一种PPG基础模型，利用ICU数据集中的心电和呼吸信号选择对比样本，无需高质量或场域数据预训练，在15个下游任务中14个取得性能提升。

详情

AI中文摘要

光电容积描记法（PPG）是一种无创测量血容量变化的方法，广泛应用于可穿戴设备和临床环境。最近的PPG基础模型要么使用开源ICU数据集，采用需要精心整理数据的预训练范式，从而难以泛化到场域数据，要么使用闭源场域PPG数据。相比之下，我们提出了一种PPG基础模型，不需要高质量或场域预训练数据，而是利用ICU数据集中伴随的心电图和呼吸信号在预训练期间选择对比样本。我们的方法允许模型保留并从噪声PPG片段中学习，提高了推理时的鲁棒性。我们的模型在比现有最先进方法少3倍的受试者上预训练，在15个不同的下游任务（包括场域日常活动和心率预测）中的14个上实现了性能提升。我们的结果表明，多模态监督可以整合互补的生理信息，以提高PPG基础模型的鲁棒性，并增强其对消费级数据的泛化能力。

英文摘要

Photoplethysmography (PPG), a non-invasive measure of changes in blood volume, is widely used in both wearable devices and clinical settings. Recent PPG foundation models either use open-source ICU datasets with pretraining paradigms that require curated data and thus complicate generalization to field-like data, or use closed-source field-like PPG data. In contrast, we propose a PPG foundation model that does not require high-quality or field-like pretraining data, and instead leverages accompanying electrocardiogram and respiratory signals in ICU datasets to select contrastive samples during pretraining. Our approach allows the model to retain and learn from noisy PPG segments, improving robustness at inference. Our model, pretrained on 3x fewer subjects than existing state-of-the-art approaches, achieves performance improvements on 14 out of 15 diverse downstream tasks, including field-like daily activity and heart rate prediction. Our results demonstrate that multimodal supervision can integrate complementary physiological information to improve the robustness of PPG foundation models and enhance their generalization to consumer-grade data.

URL PDF HTML ☆

赞 0 踩 0

2606.07368 2026-06-08 cs.CV cs.AI 交叉投稿

Mitosis Detection in the Wild: Multi-Tumor and Context-Aware Generalization in the MIDOG 2025 Challenge

野外有丝分裂检测：MIDOG 2025挑战中的多肿瘤与上下文感知泛化

Marc Aubreville, Jonas Ammeling, Sweta Banerjee, Viktoria Weiss, Taryn A. Donovan, Robert Klopfleisch, Jiaqi Lv, Shan E Ahmed Raza, Raphaël Bourgade, Thomas Walter, Yasemin Topuz, Songül Varlı, Charles-Antoine Collins-Fekete, Zhuoyan Shen, Navya Sri Kelam, Nitin Singhal, Christian Marzahl, Brian Napora, Tengyou Xu, Hongyan Gu, Mario Vento, Gennaro Percannella, Norbert Ropiak, Izabela Wasiak, Jie Xiao, Shaojun Liu, Seungho Choe, April Khademi, Vidushi Walia, Sujatha Kotte, Andrew Broad, Alex Wright, Guillaume Balezo, Esha Sadia Nasir, Mostafa Jahanifar, Yosuke Yamagishi, Shouhei Hanaoka, Mattia Sarno, Francesco Tortorella, Biwen Meng, Jingxin Liu, Sara Krauss, Daniel Hieber, Lavish Ramchandani, Dev Kumar Das, Mieko Ochi, Yuan Bae, Piotr Giedziun, Mateusz Maniewski, Vangala Govindakrishnan Saipradeep, Naveen Sivadasan, Leire Benito-Del-Valle, Adrian Galdran, Kaustubh Atey, Sameer Anand Jha, Adinath Dukre, Imran Razzak, Maxime W. Lafarge, Viktor H. Koelzer, Nils Porsche, Nikolas Stathonikos, Mitko Veta, Dominik Hirling, Zsanett Zsófia Iván, Peter Horvath, Katharina Breininger, Christof A. Bertram

发表机构 * Flensburg University of Applied Sciences（弗劳恩霍夫应用科技大学）； Technische Hochschule Ingolstadt（施特拉尔松德应用技术大学）； University of Veterinary Medicine（兽医大学）； Schwarzman Animal Medical Center（施瓦茨曼动物医学中心）； Freie Universität Berlin（柏林自由大学）； University of Warwick（沃里克大学）； MINES Paris - PSL University（巴黎综合理工学院）； Yildiz Technical University（耶利泽技术大学）； University College London（伦敦大学学院）； AIRA MATRIX Private Limited（AIRA MATRIX 私人有限公司）； University of California, Los Angeles（加州大学洛杉矶分校）； University of Kansas Medical Center（堪萨斯医学中心）； University of Salerno（萨勒诺大学）； Cancer Center Sp. z o. o.（癌症中心）； th Military Research Hospital in Bydgoszcz（比多日茨军医研究所）； Shenzhen Technology University（深圳技术大学）； Toronto Metropolitan University（多伦多 Metropolitan 大学）； Tata Consultancy Services Ltd.（塔塔咨询有限公司）； Leeds Teaching Hospitals NHS Trust（利兹教学医院 NHS信托）； The University of Tokyo（东京大学）； Xi’an Jiaotong-Liverpool University（西安交通大学-利物浦大学）； University of Augsburg（奥格斯堡大学）； Ulm University（乌尔姆大学）； Japanese Red Cross Medical Center（日本红十字医疗中心）； Wroclaw University of Science and Technology（沃拉日市科学与技术大学）； TECNALIA, Basque Research and Technology Alliance (BRTA)（TECNALIA，巴斯克研究与技术联盟（BRTA））； Indian Institute of Technology Bombay（孟买印度理工学院）； MBZUAI ； University of Basel（巴塞尔大学）； University Medical Center Utrecht（乌得勒支大学医学中心）； TU Eindhoven（埃因霍温理工大学）； HUN-REN Biological Research Centre（匈牙利-人生物研究中心）

AI总结针对临床实际中组织学多样性的挑战，MIDOG 2025挑战评估了跨12种肿瘤类型和多种扫描平台的算法性能，发现模型在传统热点区域表现可靠，但在困难区域和罕见肿瘤中性能显著下降，集成方法可提升F1分数1.5个百分点。

详情

AI中文摘要

自动有丝分裂检测是计算病理学中一项成熟的任务。虽然之前的基准测试关注扫描仪引起的域偏移，但临床“真实世界”应用要求模型能够对组织学景观中预期的巨大差异具有鲁棒性。MItosis DOmain Generalization (MIDOG) 2025挑战旨在评估算法在空前生物学和上下文多样性下的性能。我们策划了一个包含365个病例的测试数据集，涵盖12种不同的人类、犬和猫肿瘤类型，并在多个扫描平台上数字化。超越手动选择的感兴趣区域（ROI），该挑战还要求在随机组织区域（代表全切片检测情况）和困难区域（富含难负样本的区域）进行检测。在第二个赛道中，我们引入了非典型有丝分裂象（AMF）的分类。检测赛道有18支队伍提交，F1分数最高达0.740。在AMF检测赛道，我们有21个提交，平衡准确率最高达0.908。我们的分析显示，虽然大多数模型在传统热点区域表现可靠，但在困难ROI中性能显著下降，假阳性率增加了两倍。此外，性能在12种肿瘤类型间差异显著，突显了当前最先进架构在遇到罕见或高度多形性恶性肿瘤时的“盲点”。此外，我们评估了集成的有效性，发现F1分数和平衡准确率平均分别提高1.5和1.3个百分点。相比之下，测试时增强（TTA）没有显示出相关改进。MIDOG 2025表明，“野外”有丝分裂检测仍然是一个重大障碍。从仅热点评估到多上下文框架的转变，为临床可靠性提供了更现实的代理指标。

英文摘要

Automated mitosis detection is a well-established task in computational pathology. While previous benchmarks focused on scanner-induced domain shift, clinical "real-world" application requires models to be robust across the vast variance to be expected in the histological landscape. The MItosis DOmain Generalization (MIDOG) 2025 challenge was designed to evaluate algorithmic performance across unprecedented biological and contextual diversity. We curated a test dataset of 365 cases, encompassing 12 distinct human, canine and feline tumor types, digitized across multiple scanning platforms. Moving beyond hand-selected hotspots, the challenge required detection also in random tissue areas (representative of the whole slide detection situation) and challenging areas (areas rich in hard negatives). In the second track, we introduced the classification of atypical mitotic figures (AMFs). There were 18 teams submitting to the detection track, with F1 scores ranging up to 0.740. In the AMF detection track, we had 21 submissions with balanced accuracy values up to 0.908. Our analysis reveals that while most models perform reliably in traditional hotspots, significant performance degradation occurs in challenging ROIs, where false positive rates tripled. Furthermore, performance varied significantly across the 12 tumor types, highlighting "blind spots" in current state-of-the-art architectures when encountering rare or highly pleomorphic malignancies. Moreover, we evaluated the effectiveness of ensembling and found a mean increases of 1.5 and 1.3 percentage points in F1 score and balanced accuracy, respectively. In contrast, TTA showed no relevant improvement. MIDOG 2025 demonstrates that "in the wild" mitosis detection remains a significant hurdle. The transition from hotspot-only evaluation to a multi-contextual framework provides a more realistic proxy for clinical reliability.

URL PDF HTML ☆

赞 0 踩 0

2606.07381 2026-06-08 eess.IV cs.AI cs.CV 交叉投稿

Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios

合成病灶MR图像在低数据场景下自动局灶性皮质发育不良检测中的影响

Prabhjot Kaur, Hakim Ouaalam, Sedat Kandemirli, Sanjay P. Prabhu, Simon K. Warfield

发表机构 * Computational Radiology Laboratory（计算放射学实验室）； Boston Children’s Hospital（波士顿儿童医院）； Harvard Medical School（哈佛医学院）

AI总结本研究通过条件生成网络合成FCD病灶MRI数据，评估其真实性及对自动检测的影响，发现合成数据可减少约20%标注需求，但真实数据仍更有效。

详情

DOI: 10.1111/jon.70137

AI中文摘要

背景与目的：自动检测局灶性皮质发育不良（FCD）需要大量体素级病灶勾画的MRI数据，这些数据难以获取。本研究旨在生成呈现FCD的合成MRI数据，评估其真实性，并评估其对自动FCD检测的影响，特别是在减少手动标注需求方面。方法：回顾性研究了来自多个（3个）中心的131例FCD患者和90例健康对照的T1加权（T1w）和T2加权液体衰减反转恢复（FLAIR）MRI扫描。通过将生成网络以二元FCD掩膜为条件生成合成MRI。两位神经放射科医生从14张真实和14张合成扫描的随机集合中识别真实图像。训练了三个nnU-Net模型用于检测FCD，分别使用：（i）仅真实数据（35例FCD/35例对照），（ii）真实数据（35例FCD/35例对照）加合成增强，以及（iii）扩展的真实数据（70例FCD/70例对照）。结果：专家区分真实与合成图像的能力有限，T1w分类准确率为60%，FLAIR为70%（评分者间一致性kappa=0.86）。用合成数据增强自动FCD检测使灵敏度提高8.14%（p=0.12），并改善了模型在真实病灶部位的置信度（0.83±0.11至0.89±0.12；p=0.02）。扩展真实数据模型进一步将灵敏度提高至73.8%（p<0.001），置信度提高至0.90±0.14（p=0.01）。结论：条件生成网络可以生成逼真的合成FCD-MRI，在保持同等灵敏度的情况下减少约20%的标注数据需求。当可用时，等量的真实数据仍比合成增强更有效。

英文摘要

Background and Purpose: Automated detection of focal cortical dysplasia (FCD) requires large volumes of voxelwise lesion-delineated MRI data, which are difficult to acquire. This study aims to generate synthetic MRI data exhibiting FCD, assess their realism, and evaluate their impact on automated FCD detection, particularly in reducing the need for manual annotations. Methods: T1-weighted (T1w) and T2-weighted Fluid-Attenuated Inversion Recovery (FLAIR) MRI scans from 131 FCD patients and 90 healthy controls from multiple (3) sites were retrospectively studied. Synthetic MRIs were generated by conditioning a generative network on binary FCD masks. Two neuroradiologists identified real images from a random set of 14 real and 14 synthetic scans. Three nnU-Net models were trained to detect FCD using: (i) real-only (35 FCD / 35 controls), (ii) real (35 FCD / 35 controls) plus synthetic augmentation, and (iii) expanded real data (70 FCD / 70 controls). Results: Experts showed limited ability to distinguish real from synthetic images, with classification accuracy of 60% for T1w and 70% for FLAIR (inter-rater agreement kappa = 0.86). Augmenting automated FCD detection with synthetic data increased sensitivity by 8.14% (p = 0.12) and improved model confidence at true lesion sites (0.83 +/- 0.11 to 0.89 +/- 0.12; p = 0.02). The expanded real-data model further improved sensitivity to 73.8% (p < 0.001) and confidence to 0.90 +/- 0.14 (p = 0.01). Conclusion: Conditional generative networks can generate realistic synthetic FCD-MRIs, reducing labeled data needs by approximately 20% while maintaining equivalent sensitivity. Equivalent amounts of real data, when available, remain more effective than synthetic augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.07410 2026-06-08 cs.LG cs.AI 交叉投稿

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

人类与DeepSeek-R1大语言模型数学推理的全面解剖

Yuxiang Chen, Jun Wang

发表机构 * UCL Centre for Artificial Intelligence（伦敦大学人工智能中心）

AI总结通过AIME 2025所有30道题目的10247个推理步骤注释，发现DeepSeek-R1存在拓扑模仿（表面模仿推理而非真正推理），但成功轨迹中分支与回溯的稳定使用以及反射在演绎推理中的有效放置是真正推理的信号。

详情

AI中文摘要

大语言模型中“顿悟时刻”的出现，特别是DeepSeek-R1-0120，引发了这些系统是真正推理还是仅仅模仿推理表象的问题。我们对AIME 2025所有30道题目进行了模型与人类推理的全面实证比较，将10247个推理步骤详尽地注释为五个功能类别：分析、推理、分支、回溯和反思。我们发现了一个明显的结构差异。人类解决方案在分析和演绎之间保持紧凑交替，而DeepSeek-R1频繁重访中间结果，进行浅层且往往不必要的验证，并在局部检查中循环，而没有有意义的逻辑进展。我们将其描述为拓扑模仿：再现推理的表面形式而不发挥其功能作用。尽管如此，我们识别出两个真正推理的信号。首先，成功轨迹表现出分支和回溯的稳定使用，而失败轨迹要么过度使用要么使用不足探索性动作。其次，反思仅在置于演绎推理中时才有效；陷入分析循环的反思专注于局部数值细节而忽略全局逻辑错误。这些发现表明，当前的长链思维模型可能更多地因推理的表象而非真正的演绎进展而获得奖励。我们讨论了改进评估和训练的方向，包括测量跨轨迹稳定性、惩罚“空转”轨迹、鼓励更深层的逻辑纠正，以及将推理时间计算重新分配给演绎和回溯。总体而言，推理质量不仅取决于反思发生的多少，还取决于反思是否一致地出现在适当的逻辑尺度上。

英文摘要

The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive empirical comparison between model and human reasoning across all 30 problems from AIME 2025, exhaustively annotating 10,247 reasoning steps into five functional categories: Analysis, Inference, Branch, Backtrace, and Reflection. We find a clear structural difference. Human solutions maintain a compact alternation between analysis and deduction, whereas DeepSeek-R1 frequently revisits intermediate results, performs shallow and often unnecessary verification, and loops through local checks without meaningful logical progress. We describe this as topological mimicry: reproducing the surface form of reasoning without its functional role. Despite this, we identify two signals of genuine reasoning. First, successful traces exhibit stable use of branching and backtracking, while failed traces either underuse or overuse exploratory actions. Second, reflection is only effective when placed within deductive inference; reflections trapped in analysis loops focus on local numerical details while missing global logical errors. These findings suggest that current long-CoT models may be rewarded more for the appearance of reasoning than for genuine deductive progress. We discuss directions for improving evaluation and training, including measuring cross-trace stability, penalising "spinning-wheel" traces, encouraging deeper logical correction, and reallocating inference-time compute toward deduction and backtracking. Overall, reasoning quality depends not simply on how much reflection occurs, but on whether reflection appears consistently and at the appropriate logical scale.

URL PDF HTML ☆

赞 0 踩 0

2606.07412 2026-06-08 cs.SE cs.AI 交叉投稿

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

Socratic-SWE：通过轨迹衍生智能体技能实现自我进化的编码智能体

Chuan Xiao, Zhengbo Jiao, Shaobo Wang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang, Lin Qu

发表机构 * AI Data, Alibaba Group（阿里云数据）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出Socratic-SWE闭环自进化框架，通过将智能体历史求解轨迹蒸馏为结构化技能，生成针对性修复任务，实现编码智能体的持续自我改进。

Comments 21 pages, 5 figures. Under review

详情

AI中文摘要

基于LLM的软件工程智能体已成为现实世界语言模型能力的核心测试平台，但其训练仍受限于高质量SWE任务的可用性。现有的合成数据方法通常通过固定突变或漏洞注入程序创建任务，导致生成的任务分布很大程度上独立于智能体自身的弱点和训练进度。我们提出Socratic-SWE，一个闭环自进化框架，将智能体的历史求解轨迹重新用作训练信号的来源。Socratic-SWE不仅将轨迹视为奖励计算的证据，还将其蒸馏为结构化的智能体技能，总结重复出现的失败模式和有效的修复模式。这些技能随后指导在真实仓库中生成针对性的修复任务。候选任务通过基于执行的验证进行检查，并使用求解器梯度对齐奖励进行评分，从而保留的任务既可验证又有助于改进求解器。更新后的求解器产生新的轨迹，使任务课程能够在连续轮次中自适应。在SWE-bench Verified、SWE-bench Lite、SWE-bench Pro和Terminal-Bench 2.0上，Socratic-SWE在相同计算预算下持续优于自我进化的基线，经过三次迭代后在SWE-bench Verified上达到50.40%。这些结果表明，求解轨迹可以作为自我进化SWE智能体的可扩展基础。

英文摘要

LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existing synthetic data methods typically create tasks through fixed mutation or bug-injection procedures, making the resulting distributions largely independent of the agent's own weaknesses and training progress. We introduce Socratic-SWE, a closed-loop self-evolution framework that reuses the agent's historical solving traces as a source of training signal. Rather than treating traces only as evidence for reward computation, Socratic-SWE distills them into structured agent skills that summarize recurring failures and effective repair patterns. These skills then guide the generation of targeted repair tasks in real repositories. Candidate tasks are checked through execution-based validation and scored with a solver-gradient alignment reward, so that the retained tasks are both verifiable and useful for improving the Solver. The updated Solver produces new traces, enabling the task curriculum to adapt over successive rounds. Across SWE-bench Verified, SWE-bench Lite, SWE-bench Pro, and Terminal-Bench 2.0, Socratic-SWE consistently improves over self-evolving baselines under the same compute budget, reaching 50.40% on SWE-bench Verified after three iterations. These results suggest that solving traces can serve as a scalable substrate for self-evolving SWE agents.

URL PDF HTML ☆

赞 0 踩 0

2606.07433 2026-06-08 cs.CV cs.AI cs.MM 交叉投稿

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Watch, Remember, Reason: 基于多模态大语言模型的人类视角视频理解

Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Jason Li, Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang, Guangliang Cheng, Yunhai Tong, Lu Qi, Minghsuan Yang

发表机构 * School of Intelligence Science and Technology, Peking University（北京理工大学智能科学与技术学院）； Wuhan University（武汉大学）； Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； CASIA（中国科学院自动化研究所）； University of Tokyo（东京大学）； University of Liverpool（利物浦大学）； Zhejiang University（浙江大学）； National University of Singapore（新加坡国立大学）； UC Merced（加州大学默塞德分校）

AI总结提出人类视角下视频理解的三个功能能力（观看、记忆、推理），构建统一框架分析视频MLLM的感知、记忆、推理和预测，并总结挑战、方法、应用及未来方向。

详情

AI中文摘要

视频理解正被多模态大语言模型（MLLMs）迅速变革，研究从短视频片段转向长视频、多模态和知识密集型视频场景。这些场景要求模型在有限计算预算下处理稀疏证据、长程依赖、多模态对齐和可靠推理。本文从人类视角出发，围绕三个功能能力——观看、记忆和推理——组织基于LLM的视频理解。该视角并非将视频任务视为孤立基准，而是提供一个统一结构，用于分析视频MLLM如何获取证据、保持上下文并产生有依据的输出。我们引入一个公式，通过感知表示、记忆状态、推理轨迹和最终预测来表征视频理解系统。基于此公式，我们识别出时空感知、高效长视频处理、记忆建模、流式理解和忠实推理中的挑战。代表性方法按其视频MLLM系统中的角色进行组织：观看涵盖细粒度、全面、音视频和高效感知；记忆包括离线记忆和流式记忆；推理涵盖纯文本推理和视频辅助推理。我们进一步考察了应用领域，如自我中心、体育、教学、医学和叙事视频，并涵盖了跨任务类型、监督格式、模态和能力维度的训练数据集和评估基准。最后，我们概述了可扩展、记忆感知和有依据的视频智能的开放问题和未来方向。相关工作将在https://this https URL持续追踪。

英文摘要

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.

URL PDF HTML ☆

赞 0 踩 0

2606.07437 2026-06-08 cs.RO cs.AI cs.HC cs.SE cs.SY eess.SY 交叉投稿

Whisper 幻觉检测与缓解：基于隐藏表示引导和稀疏自编码器

Georgii Aparin, Vadim Popov, Tasnima Sadekova, Assel Yermekova

发表机构 * AI Foundation and Algorithm Lab（AI基础与算法实验室）； National University of Science and Technology MISIS（科学与技术国立大学MISIS）； National Research University Higher School of Economics（国家研究大学经济高等学院）

AI总结通过分析Whisper内部表示，提出基于稀疏自编码器的引导策略，将非语音测试集上的幻觉率从72.63%降至14.11%（small模型），接近微调方法性能。

详情

AI中文摘要

Whisper是一种广泛采用的ASR模型，已知存在幻觉问题——即对非语音音频生成与输入完全无关的连贯转录。我们研究了是否可以通过Whisper的内部表示来检测和缓解幻觉。我们提取音频编码器激活，并评估两种表示空间：原始Whisper激活和稀疏自编码器（SAE）潜在变量。我们表明，两个空间都编码了线性可分的幻觉相关信息，判别能力集中在稀疏特征子集中，并向更深编码器层增强。我们提出了两种引导策略：激活空间引导和SAE潜在空间引导。基于SAE的引导将完整非语音测试集上的幻觉率从72.63%降至14.11%（Whisper small），从86.88%降至27.33%（Whisper large-v3），同时在语音数据上WER退化很小，接近基于微调方法的性能。

英文摘要

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.

URL PDF HTML ☆

赞 0 踩 0

2606.07475 2026-06-08 cs.LG cs.AI 交叉投稿

MemDreamer: 通过分层图记忆和智能体检索机制解耦感知与推理以实现长视频理解

Cong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang, Guangming Yao, Hao Chen, Jingdong Chen, Yi Yuan, Chunhua Shen

发表机构 * Ant Group（蚂蚁集团）； Zhejiang University（浙江大学）； Central South University（中南大学）； HKUST(GZ)（香港科技大学(广州)）

AI总结提出MemDreamer框架，通过分层图记忆和智能体检索机制解耦感知与推理，将长视频理解转化为智能体探索过程，在四个基准上达到SOTA，推理上下文窗口仅占全量2%且准确率提升12.5点。

详情

AI中文摘要

当前的视觉-语言模型在处理数小时长的视频时面临困难，因为处理完整长度的视觉序列会导致令牌爆炸和注意力稀释。为了克服这一问题，我们引入了MemDreamer，将感知与推理解耦，将长视频理解转化为智能体探索过程。作为一个即插即用的框架，它增量式地流式传输视频以构建分层图记忆，这是一种自顶向下的三层架构，用于语义抽象，并由一个捕获时空和因果关系的基础图锚定。在推理过程中，推理模型采用智能体工具增强的检索，通过观察-推理-行动循环导航层次结构、搜索节点和遍历逻辑边。实验表明，MemDreamer在四个主流基准上取得了最先进的结果，将人类专家的差距缩小到仅3.7个百分点。它将推理上下文窗口限制在全量上下文的仅2%，同时提供了12.5个百分点的绝对准确率提升。此外，统计分析揭示了VLM在逻辑推理和长视频理解基准上的性能之间存在强正线性相关，将智能体能力扩展确立为多模态理解的新范式。

英文摘要

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

URL PDF HTML ☆

赞 0 踩 0

2605.24649 2026-06-08 cs.LG cs.AI cs.LO cs.SY eess.SY 交叉投稿

On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks

关于循环多项式代理三元逻辑门网络的稳定性与可实现性

Sai Sandeep Damera, Ryan Matheu, Aniruddh G. Puranic, John S. Baras, Calin Belta

发表机构 * University of Maryland, College Park, USA（马里兰大学学院公园分校）

AI总结提出R-DTLGN架构，通过连续多项式代理训练并硬化为离散三元逻辑电路，结合数值单调和信息单调门，实现STL监控中的稳定递归和原则性弃权，并由STL公式确定网络规模。

Comments 9 pages, 3 figures. This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

循环神经网络（RNN）可以从部分轨迹在线学习预测信号时序逻辑（STL）判定，但在安全关键系统中部署为运行时监控器需要的不只是预测准确性。标准RNN架构无法提供结构保证，确保在传感器退化时输出能优雅降级；一个丢失的输入可能悄无声息地将判定从安全翻转为不安全。我们引入了循环可微三元逻辑门网络（R-DTLGN），这是一种在Kleene三值逻辑$\{-1, 0, +1\}$上运行的循环架构，其中$0$明确表示未知。R-DTLGN通过连续多项式代理进行训练，并在推理时硬化成离散的三元逻辑电路。我们通过从三元域上的两种序导出的两个门词汇表来分析硬化电路：数值单调门确保稳定的循环动态，而信息单调门（如果存在）保证原则性弃权（未知输入从不产生错误输出）和输入确定性上的单调性（更多信息只能改善判定）。我们证明，有界STL算子所需的循环连接仅使用AND和OR门，这两个门属于两个词汇表，从而将监控任务与架构的保证联系起来。由STL公式的时序算子导出的可实现性界限直接确定了网络隐藏状态的大小，用公式驱动的规范取代了超参数搜索。我们在D4RL PointMaze导航数据上的STL规范上进行了评估，测试了预测准确性、谓词丢失下的退化以及两个标签构建流程之间的准确性与安全性权衡。据我们所知，R-DTLGN是第一个将学习到的时序预测与基于三值逻辑的形式化退化保证相结合的循环架构。

英文摘要

Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying them as runtime monitors in safety-critical systems demands more than predictive accuracy. Standard RNN architectures offer no structural guarantee that outputs degrade gracefully under sensor degradation; a dropped input can silently flip a verdict from safe to unsafe. We introduce the Recurrent Differentiable Ternary Logic Gate Network (R-DTLGN), a recurrent architecture that operates over Kleene's three-valued logic $\{-1, 0, +1\}$, where $0$ explicitly represents unknown. The R-DTLGN trains through continuous polynomial surrogates and hardens to a discrete ternary logic circuit at inference. We analyze the hardened circuit through two gate vocabularies derived from two orderings on the ternary domain: numerically monotone gates ensure stable recurrent dynamics, while information-monotone gates, when present, guarantee principled abstention (unknown inputs never produce wrong outputs) and monotonicity in input certainty (more information can only improve the verdict). We show that the recurrent connections required by bounded STL operators use exclusively AND and OR, which belong to both vocabularies, linking the monitoring task to the architecture's guarantees. A realizability bound derived from the STL formula's temporal operators directly sizes the network's hidden state, replacing hyperparameter search with a formula-driven specification. We evaluate on STL specifications over D4RL PointMaze navigation data, testing prediction accuracy, degradation under predicate dropout, and the accuracy-versus-safety tradeoff between two label construction pipelines. The R-DTLGN is, to our knowledge, the first recurrent architecture that couples learned temporal prediction with formal degradation guarantees rooted in three-valued logic.

URL PDF HTML ☆

赞 0 踩 0

2504.00613 2026-06-08 cs.AI cs.IT cs.NE math.IT 版本更新

LLM-Guided Search for Deletion-Correcting Codes

LLM引导的删除纠正码搜索

Franziska Weindel, Reinhard Heckel

发表机构 * School of Computation, Information and Technology, Technical University of Munich（计算、信息与技术学院，慕尼黑技术大学）

AI总结针对删除纠正码最大尺寸的开放问题，采用LLM引导的进化搜索FunSearch，发现构建短码长删除纠正码的函数，单删除场景证明达到最优的Varshamov-Tenengolts码，多删除和四进制编辑码改进现有构造但缺乏理论洞见。

详情

AI中文摘要

寻找最大尺寸的删除纠正码已经是一个开放问题超过70年，即使对于单个删除也是如此。我们改编了FunSearch，一种大型语言模型（LLM）引导的进化搜索，以发现构建短码长删除纠正码的函数。对于单个删除，我们的搜索发现了一个函数，我们证明该函数构建了推测最优的Varshamov-Tenengolts码。对于多个删除和四进制编辑码，发现的函数改进了先前的显式、基于搜索和神经网络的构造，但仍然是经验启发式，没有新的理论洞见。我们研究了LLM引导进化搜索的设计选择，并发现对于我们的问题，计算资源更好地分配用于采样更多函数，而不是每个函数更长的推理轨迹，并且将自然语言描述与代码共同进化会损害搜索质量。我们提出在进化过程中对逻辑相同的函数进行去重，这对搜索多样性至关重要。我们的结果展示了LLM引导进化搜索在信息论和编码设计中的潜力，并代表了此类方法在构建纠错码中的首次应用。然而，在我们目前的公式中，评估一个函数的复杂度随码长指数增长，限制了该方法仅适用于短码。

英文摘要

Finding deletion-correcting codes of maximum size has been an open problem for over 70 years, even for a single deletion. We adapt FunSearch, a large language model (LLM)-guided evolutionary search, to discover functions that construct deletion-correcting codes at short code lengths. For a single deletion, our search finds a function that we prove constructs the conjectured-optimal Varshamov-Tenengolts code. For multiple deletions and quaternary edit codes, the discovered functions improve on prior explicit, search-based, and neural constructions but remain empirical heuristics without new theoretical insights. We study design choices for LLM-guided evolutionary search and find that, for our problem, compute is better allocated to sampling more functions than to longer reasoning traces per function, and that co-evolving natural language descriptions with code hurts search quality. We propose deduplicating logically identical functions during evolution, which we find critical for search diversity. Our results demonstrate the potential of LLM-guided evolutionary search for information theory and code design and represent the first application of such methods for constructing error-correcting codes. However, in our current formulation, evaluating a function scales exponentially with code length, limiting the approach to short codes.

URL PDF HTML ☆

赞 0 踩 0

2504.03635 2026-06-08 cs.AI cs.CL 版本更新

Finding the Minimal Parameter Budget for Implicit Reasoning: A Data Complexity Driven Scaling Law for Language Models

寻找隐式推理的最小参数预算：一种基于数据复杂度的语言模型缩放定律

Xinyi Wang, Shawn Tan, Shenbo Xu, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； University of Washington（华盛顿大学）； University of Toronto（多伦多大学）； University of Tokyo（东京大学）

AI总结本文通过控制合成环境中的预训练实验，发现语言模型隐式推理所需的最小参数预算与图搜索熵之间存在缩放定律，并确定了每参数最多可处理约0.008比特信息的容量上限。

Comments Accepted to ICML 2026

详情

AI中文摘要

推理是语言模型（LM）的核心能力，然而在预训练期间需要多少模型容量来支持推理仍不清楚。在这项工作中，我们研究了隐式推理所需的最小参数预算，隐式推理定义为无需显式思维链监督即可从所学知识中推断出新事实的能力。为了隔离这一现象，我们在一个受控的合成环境中从头开始预训练LM，该环境模拟了真实世界知识图谱的结构和分布，并通过多跳推理评估它们补全缺失边的能力。从理论和实证两个角度，我们确定了一个缩放定律，将该最优参数预算与图搜索熵度量联系起来。在广泛的模型大小、训练步数和图复杂度范围内，我们表明一个最优大小的语言模型最多可以可靠地处理每参数约0.008比特的信息。我们的结果刻画了预训练期间隐式推理所需的最小充分容量。我们的发现为匹配模型大小与数据复杂度提供了原则性指导，并为大型语言模型中推理的缩放行为提供了新见解。

英文摘要

Reasoning is a core capability of language models (LMs), yet it remains unclear how much model capacity is necessary to support reasoning during pretraining. In this work, we study the minimal parameter budget required for implicit reasoning, defined as the ability to infer new facts from learned knowledge without explicit chain-of-thought supervision. To isolate this phenomenon, we pretrain LMs from scratch in a controlled synthetic environment that mimics the structure and distribution of real-world knowledge graphs, and evaluate their ability to complete missing edges via multi-hop inference. From both a theoretical and an empirical perspective, we identify a scaling law linking this optimal parameter budget to a graph search entropy measure. Across a wide range of model sizes, training steps, and graph complexities, we show that an optimally sized language model can reliably reason over approximately 0.008 bits of information per parameter at most. Our results characterize the minimal sufficient capacity for implicit reasoning during pretraining. Our findings provide principled guidance for matching model size to data complexity and offer new insights into the scaling behavior of reasoning in large language models.

URL PDF HTML ☆

赞 0 踩 0

2505.05232 2026-06-08 cs.AI 版本更新

ChemQuests: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv papers

ChemQuests: 从ChemRxiv论文中提取的精选化学问答数据库

Mahmoud Amiri, Thomas Bocklitz

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出ChemQuests数据集，包含从155篇ChemRxiv论文中提取的952个高质量问答对，覆盖17个化学子领域，用于化学NLP研究。

详情

AI中文摘要

化学文献的快速扩展给研究人员高效获取领域特定知识带来了重大挑战。为了支持化学领域自然语言处理（NLP）的进展，我们提出了ChemQuests，这是一个精选数据集，包含来自化学17个子领域的155篇ChemRxiv \cite{chemrxivWebsite}论文的952个高质量问答（QA）对。每个QA对都明确链接到其源文本片段，以确保可追溯性和上下文准确性。ChemQuests使用自动化流水线构建，该流水线结合了光学字符识别（OCR）、使用GPT-4o的QA生成以及模糊搜索验证。该数据集强调概念性、机理、应用以及合成或实验性问题，支持基于检索的QA系统、搜索引擎开发以及领域自适应大语言模型的微调。我们分析了数据集的结构、覆盖范围和局限性，并概述了扩展和专家验证的未来方向。ChemQuests为化学NLP研究、教育和工具开发提供了基础资源。

英文摘要

The rapid expansion of chemistry literature poses significant challenges for researchers seeking to efficiently access domain-specific knowledge. To support advancements in chemistry-focused natural language processing (NLP), we present ChemQuests, a curated dataset of 952 high-quality question-answer (QA) pairs derived from 155 ChemRxiv \cite{chemrxivWebsite} papers across 17 subfields of chemistry. Each QA pair is explicitly linked to its source text segment to ensure traceability and contextual accuracy. ChemQuests was constructed using an automated pipeline that combines optical character recognition (OCR), QA generation using GPT-4o, and fuzzy-search verification. The dataset emphasizes conceptual, mechanistic, applied, and synthetic or experimental questions, enabling applications in retrieval-based QA systems, search engine development, and fine-tuning of domain-adapted large language models. We analyze the dataset's structure, coverage, and limitations, and outline future directions for expansion and expert validation. ChemQuests provides a foundational resource for chemistry NLP research, education, and tool development.

URL PDF HTML ☆

赞 0 踩 0

2505.14289 2026-06-08 cs.AI 版本更新

EVA: Evolving Semantic Adversaries for Red-Teaming GUI Agents Against Environmental Injection Attacks

EVA: 针对环境注入攻击的红队GUI智能体的演化语义对抗方法

Yijie Lu, Manman Zhao, Tianjie Ju, Zihe Yan, Xinbei Ma, Yuan Guo, Daizong Ding, Gongshen Liu, Zhuosheng Zhang

发表机构 * School of Cyber Science and Engineering, Wuhan University（武汉大学计算机科学与工程学院）； School of Computer Science, Shanghai Jiao Tong University（上海交通大学计算机科学学院）； Independent Researcher（独立研究者）

AI总结提出EVA框架，通过演化语义对抗载荷攻击多模态大语言模型驱动的GUI智能体，揭示语义欺骗是攻击成功的关键，实现85%攻击成功率且快速收敛。

Comments Accepted by

详情

AI中文摘要

由多模态大语言模型驱动的图形用户界面智能体日益部署，但易受环境注入攻击。然而，当前的红队方法受限于高昂的计算成本和有限的适应性。一个基本问题仍未解决：攻击成功的瓶颈在于视觉感知还是语义理解？通过控制实验，我们观察到语义欺骗而非视觉外观是攻击成功的主要决定因素。基于这一见解，我们引入了EVA，一个仅在语义维度上演化对抗载荷的演化框架。EVA采用发现-部署框架来挖掘语言脆弱性模式并将其提炼为可泛化的规则。在五个代表性受害智能体上的实验结果表明，EVA实现了高达85%的攻击成功率，在仅1.18到1.71次迭代内将良性种子演化为成功攻击。这种快速收敛揭示了模型潜在表示中密集的语义攻击空间，揭示了一个关键的校准悖论：校准训练强化的指令遵循能力使智能体天生易受权威性、语义欺骗性环境线索的影响。

英文摘要

Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) are increasingly deployed yet vulnerable to Environmental Injection Attacks (EIAs).However, current red-teaming methods are hindered by prohibitive computational costs and limited adaptability. A fundamental question remains unaddressed: does the bottleneck of attack success lie in visual perception or semantic understanding? Through controlled experiments, we observe that semantic deception, rather than visual appearance, serves as the primary determinant of attack success. Based on this insight, we introduce EVA, an evolutionary framework that evolves adversarial payloads exclusively within the semantic dimension. EVA employs a discovery-deployment framework to mine linguistic vulnerability patterns and distill them into generalizable rules. Experimental results across five representative victim agents demonstrate that EVA achieves up to 85\% attack success rate, evolving benign seeds into successful attacks within only 1.18 to 1.71 iterations. This rapid convergence uncovers a dense semantic attack space in the model's latent representation, unveiling a critical alignment paradox: the instruction-following capabilities reinforced by alignment training render agents inherently susceptible to authoritative, semantically deceptive environmental cues.

URL PDF HTML ☆

赞 0 踩 0

2505.15998 2026-06-08 cs.AI 版本更新

Exploring Flow-Lenia Universes with a Curiosity-driven AI Scientist: Discovering Diverse Ecosystem Dynamics

探索Flow-Lenia宇宙：基于好奇心驱动的AI科学家发现多样生态系统动力学

Thomas Michel, Marko Cvjetko, Gautier Hamon, Pierre-Yves Oudeyer, Clément Moulin-Frier

发表机构 * Univ. Lille, Inria, CNRS, Centrale Lille, CRIStAL, France（里尔大学、法国国家科学研究中心、中央里尔学院、CRIStAL实验室、法国）； Inria Center at the University of Bordeaux, France（波尔多大学的Inria研究中心、法国）； Inria, INSA Lyon, CITI, UR3720, 69621 Villeurbanne, France（Inria、里昂INSA、CITI、UR3720、法国）

AI总结提出好奇心驱动的AI科学家方法，通过内在动机目标探索过程（IMGEP）在Flow-Lenia中发现系统级动力学，揭示类似生物现象的自组织行为，并展示大规模多样性搜索作为后续实验设计的框架。

Comments Extended version of the paper first published at ALife 2025. Project webpage: https://developmentalsystems.org/Exploring-Flow-Lenia-Universes/ 24 pages, 16 figures

详情

DOI: 10.1162/ISAL.a.896
Journal ref: Proceedings of the Artificial Life Conference 2025, pp. 633-643

AI中文摘要

我们提出了一种好奇心驱动的AI科学家方法，用于发现Flow-Lenia中的系统级动力学。Flow-Lenia是一种具有质量守恒和参数局部化的连续元胞自动机（CA）。基于先前使用Lenia中的多样性搜索来发现个体自组织模式的工作，我们将内在动机目标探索过程（IMGEP）适应于交互模式的大型环境，使用模拟范围的度量，如进化活动、压缩比和多尺度物质分布。我们在两个探索实验中应用IMGEP：一个针对生态系统级动力学，另一个针对通过障碍物环境的物质运动。在这两个实验中，IMGEP比随机搜索照亮了更多的度量空间，并揭示了定性上类似于许多生物现象的自组织行为。利用生成的档案，我们随后在六个空间尺度和七个时间跨度上进行了缩放研究，揭示了在基础尺度上没有类似物的宏观尺度组织，并表征了目标空间度量在尺度上的行为。这说明了我们方法的一个优势：相对廉价的大规模多样性搜索可以作为设计后续更昂贵实验的原则性框架，通过交互式探索工具支持实验设计、检查和重新设计的迭代循环，使科学家保持在循环中。尽管在Flow-Lenia上进行了演示，但这种方法可能适用于其他可参数化的复杂系统，其中研究自下而上的集体行为是有意义的。

英文摘要

We present a curiosity-driven AI scientist method for discovering system-level dynamics in Flow-Lenia, a continuous cellular automaton (CA) with mass conservation and parameter localization. Building on prior work that uses diversity search in Lenia to find individual self-organized patterns, we adapt Intrinsically Motivated Goal Exploration Processes (IMGEPs) to large environments of interacting patterns, using simulation-wide metrics such as evolutionary activity, compression ratio, and multi-scale matter distribution. We apply IMGEP in two exploration experiments: one targeting ecosystem-level dynamics, the other matter movement through obstacle-laden environments. In both, IMGEP illuminates significantly more of the metric space than random search and reveals self-organized behaviors qualitatively resembling many biological phenomena. Leveraging the resulting archive, we then run a scaling study across six spatial scales and seven time horizons, uncovering macro-scale organization with no analogue at the base scale and characterizing how goal-space metrics behave at scale. This illustrates a strength of our approach: a relatively cheap large-scale diversity search can act as a principled scaffold for designing subsequent, more expensive experiments, enabling an iterative loop of experiment design, inspection, and redesign, supported by an interactive exploration tool that keeps scientists in the loop. Though demonstrated with Flow-Lenia, this approach potentially applies to other parameterizable complex systems where studying bottom-up collective behavior is of interest.

URL PDF HTML ☆

赞 0 踩 0

2508.19239 2026-06-08 cs.AI 版本更新

Model Context Protocols in Adaptive Transport Systems: A Survey

自适应交通系统中的模型上下文协议：综述

Gaurab Chhetri, Shriyank Somvanshi, Md Monzurul Islam, Shamyo Brotee, Mahmuda Sultana Mimi, Dipti Koirala, Biplov Pandey, Subasish Das

发表机构 * Texas State University San Marcos（德克萨斯州立大学圣马科斯分校）

AI总结本文首次系统调查模型上下文协议（MCP）作为统一范式，提出五类分类法，揭示传统协议孤立适应的局限，并指出MCP的客户端-服务器和JSON-RPC结构支持语义互操作性，为下一代自适应智能交通基础设施奠定基础。

详情

AI中文摘要

互联设备、自主系统和AI应用的快速扩展导致自适应交通系统严重碎片化，各种协议和上下文源仍然孤立。本综述首次系统调查模型上下文协议（MCP）作为统一范式，强调其桥接协议级适应与上下文感知决策的能力。通过分析已有文献，我们发现现有工作已隐含地趋近于MCP类架构，标志着从碎片化解决方案向标准化集成框架的自然演进。我们提出一个五类分类法，涵盖自适应机制、上下文感知框架、统一模型、集成策略和MCP使能架构。我们的发现揭示了三个关键见解：传统传输协议已达到孤立适应的极限，MCP的客户端-服务器和JSON-RPC结构支持语义互操作性，以及AI驱动的传输需要特别适合MCP的集成范式。最后，我们提出一个研究路线图，将MCP定位为下一代自适应、上下文感知和智能交通基础设施的基础。

英文摘要

The rapid expansion of interconnected devices, autonomous systems, and AI applications has created severe fragmentation in adaptive transport systems, where diverse protocols and context sources remain isolated. This survey provides the first systematic investigation of the Model Context Protocol (MCP) as a unifying paradigm, highlighting its ability to bridge protocol-level adaptation with context-aware decision making. Analyzing established literature, we show that existing efforts have implicitly converged toward MCP-like architectures, signaling a natural evolution from fragmented solutions to standardized integration frameworks. We propose a five-category taxonomy covering adaptive mechanisms, context-aware frameworks, unification models, integration strategies, and MCP-enabled architectures. Our findings reveal three key insights: traditional transport protocols have reached the limits of isolated adaptation, MCP's client-server and JSON-RPC structure enables semantic interoperability, and AI-driven transport demands integration paradigms uniquely suited to MCP. Finally, we present a research roadmap positioning MCP as a foundation for next-generation adaptive, context-aware, and intelligent transport infrastructures.

URL PDF HTML ☆

赞 0 踩 0

2509.25522 2026-06-08 cs.AI 版本更新

Understanding Generative Recommendation with Semantic IDs from a Model-scaling View

从模型扩展视角理解基于语义ID的生成式推荐

Jingzhe Liu, Liam Collins, Jiliang Tang, Tong Zhao, Neil Shah, Clark Mingxuan Ju

发表机构 * Michigan State University（密歇根州立大学）； Snap Inc.（Snap公司）

AI总结揭示基于语义ID的生成式推荐在模型扩展时存在性能瓶颈，发现直接使用大语言模型作为推荐器具有更好的扩展性，性能提升可达20%。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817976

AI中文摘要

近期生成模型的进展催生了一种有前景的推荐系统范式，称为生成式推荐（GR），它试图统一丰富的物品语义和协同过滤信号。一种流行的现代方法是使用语义ID（SIDs），即从模态编码器（如大型语言或视觉模型）的嵌入中量化得到的离散编码，在自回归用户交互序列建模设置中表示物品（以下简称基于SID的GR）。虽然其他领域的生成模型展现出完善的缩放定律，我们的工作揭示了基于SID的GR在模型扩展时存在显著瓶颈。特别是，随着我们扩大每个组件（模态编码器、量化分词器和推荐系统本身），基于SID的GR的性能迅速饱和。在这项工作中，我们确定SID编码物品语义信息的有限能力是根本瓶颈之一。基于这一观察，作为获得具有更好缩放行为的GR模型的初步努力，我们重新审视了另一种直接使用大型语言模型（LLMs）作为推荐器的GR范式（以下简称LLM-as-RS）。我们的实验表明，LLM-as-RS范式具有优越的模型缩放属性，并通过缩放实现了比基于SID的GR最佳可达性能高达20%的提升。我们还挑战了普遍认为LLMs难以捕捉协同过滤信息的观点，表明它们建模用户-物品交互的能力随着LLMs的扩展而提升。我们对基于SID的GR和LLMs在44M到14B参数模型规模上的分析强调了基于SID的GR的内在缩放限制，并将LLM-as-RS定位为通往GR基础模型的有希望路径。

英文摘要

Recent advancements in generative models have allowed the emergence of a promising paradigm for recommender systems (RS), known as Generative Recommendation (GR), which tries to unify rich item semantics and collaborative filtering signals. One popular modern approach is to use semantic IDs (SIDs), which are discrete codes quantized from the embeddings of modality encoders (e.g., large language or vision models), to represent items in an autoregressive user interaction sequence modeling setup (henceforth, SID-based GR). While generative models in other domains exhibit well-established scaling laws, our work reveals that SID-based GR shows significant bottlenecks while scaling up the model. In particular, the performance of SID-based GR quickly saturates as we enlarge each component: the modality encoder, the quantization tokenizer, and the RS itself. In this work, we identify the limited capacity of SIDs to encode item semantic information as one of the fundamental bottlenecks. Motivated by this observation, as an initial effort to obtain GR models with better scaling behaviors, we revisit another GR paradigm that directly uses large language models (LLMs) as recommenders (henceforth, LLM-as-RS). Our experiments show that the LLM-as-RS paradigm has superior model scaling properties and achieves up to 20 percent improvement over the best achievable performance of SID-based GR through scaling. We also challenge the prevailing belief that LLMs struggle to capture collaborative filtering information, showing that their ability to model user-item interactions improves as LLMs scale up. Our analyses on both SID-based GR and LLMs across model sizes from 44M to 14B parameters underscore the intrinsic scaling limits of SID-based GR and position LLM-as-RS as a promising path toward foundation models for GR.

URL PDF HTML ☆

赞 0 踩 0

2510.01427 2026-06-08 cs.AI 版本更新

Small Language Model Agents Enable Efficient and High-Quality Knowledge Mining

小型语言模型代理实现高效高质量的知识挖掘

Sipeng Zhang, Shuhuai Lin, Xinpeng Wei, Yihang Chen, Pin Qian, Su Wang, Huan Xu

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； Carneigie Mellon University（卡内基梅隆大学）； Georgia Institute of Technology（佐治亚理工学院）

AI总结提出Falconer框架，结合大语言模型的代理推理与轻量级代理模型，通过规划与标注实现可扩展的知识挖掘，在保持指令遵循精度的同时降低90%推理成本并加速20倍以上。

Comments Code available: https://github.com/LongfeiYun17/falconer

详情

AI中文摘要

深度研究（Deep Research）的核心是知识挖掘，即根据用户指令从海量非结构化文本中提取结构化信息的任务。大语言模型（LLMs）擅长解释此类指令，但大规模部署成本过高；而传统的分类器和提取器流水线虽然高效，但脆弱且无法泛化到新任务。我们提出Falconer，一种协作框架，将LLM的代理推理与轻量级代理模型相结合，用于可扩展的知识挖掘。在Falconer中，LLM作为规划者，将用户指令分解为可执行的流水线，并作为标注者，生成监督信号以训练小型代理。该框架将分类和提取统一为两个原子操作：get label和get span，使得单个指令遵循模型能够替代多个任务特定组件。为了评估Falconer孵化的代理模型与人类和大模型提供的标注之间的一致性，我们构建了涵盖规划和端到端执行的新基准。实验表明，Falconer在指令遵循精度上接近最先进的LLM，同时将推理成本降低高达90%，并将大规模知识挖掘加速20倍以上，为深度研究提供了高效且可扩展的基础。

英文摘要

At the core of Deep Research is knowledge mining, the task of extracting structured information from massive unstructured text in response to user instructions. Large language models (LLMs) excel at interpreting such instructions but are prohibitively expensive to deploy at scale, while traditional pipelines of classifiers and extractors remain efficient yet brittle and unable to generalize to new tasks. We introduce Falconer, a collaborative framework that combines the agentic reasoning of LLMs with lightweight proxy models for scalable knowledge mining. In Falconer, LLMs act as planners, decomposing user instructions into executable pipelines, and as annotators, generating supervision to train small proxies. The framework unifies classification and extraction into two atomic operations, get label and get span, enabling a single instruction-following model to replace multiple task-specific components. To evaluate the consistency between proxy models incubated by Falconer and annotations provided by humans and large models, we construct new benchmarks covering both planning and end-to-end execution. Experiments show that Falconer closely matches state-of-the-art LLMs in instruction-following accuracy while reducing inference cost by up to 90% and accelerating large-scale knowledge mining by more than 20x, offering an efficient and scalable foundation for Deep Research.

URL PDF HTML ☆

赞 0 踩 0

2510.05363 2026-06-08 cs.AI 版本更新

MHA-RAG: Improving Efficiency, Accuracy, and Consistency by Encoding Exemplars as Soft Prompts

MHA-RAG：通过将示例编码为软提示来提高效率、准确性和一致性

Abhinav Jain, Xinyu Yao, Thomas Reps, Christopher Jermaine

发表机构 * Department of Computer Science, Rice University（计算机科学系，里士大学）； Department of Computer Science, University of Wisconsin–Madison（计算机科学系，威斯康星大学麦迪逊分校）

AI总结提出MHA-RAG框架，将领域示例编码为软提示，通过多头注意力机制控制生成，在多个问答基准上相比标准RAG提升20点性能，同时降低10倍推理成本。

Comments 17 pages, 5 figures

详情

AI中文摘要

在有限训练数据下将基础模型适应到新领域具有挑战性且计算成本高。虽然先前工作证明了使用领域特定示例作为上下文演示的有效性，但我们探究了将示例纯粹表示为文本是否是最有效、最稳定和最高效的方法。我们探索了一种替代方案：使用示例顺序不变模型架构将示例表示为软提示。为此，我们引入了多头注意力检索增强生成（MHA-RAG），该框架以注意力头数作为简单超参数，用于控制不同任务下的软提示生成。在多个问答基准和模型规模上，MHA-RAG相比标准RAG实现了20个百分点的性能提升，同时将推理成本降低了10倍GFLOPs——在示例顺序不变的情况下，实现了更高的准确性和更高的效率。

英文摘要

Adapting Foundation Models to new domains with limited training data is challenging and computationally expensive. While prior work has demonstrated the effectiveness of using domain-specific exemplars as in-context demonstrations, we investigate whether representing exemplars purely as text is the most efficient, effective, and stable approach. We explore an alternative: representing exemplars as soft prompts with an exemplar order invariant model architecture. To this end, we introduce Multi-Head Attention Retrieval-Augmented Generation (MHA-RAG), a framework with the number of attention heads serving as a simple hyperparameter to control soft prompt-generation across different tasks. Across multiple question-answering benchmarks and model scales, MHA-RAG achieves a 20-point performance gain over standard RAG, while cutting inference costs by a factor of 10X GFLOPs-delivering both higher accuracy and greater efficiency, invariant to exemplar order.

URL PDF HTML ☆

赞 0 踩 0

2512.23292 2026-06-08 cs.AI cs.LG 版本更新

面向视觉多智能体系统的双潜在记忆

Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, Shuicheng Yan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出L²-VMAS框架，通过双潜在记忆解耦感知与思考，并采用熵驱动主动触发机制，打破视觉多智能体系统的“扩展墙”，在提升准确率的同时大幅降低令牌消耗。

详情

AI中文摘要

尽管视觉多智能体系统（VMAS）有望通过智能体间协作增强综合能力，但经验证据揭示了一个反直觉的“扩展墙”：增加智能体轮次往往会降低性能，同时指数级增加令牌成本。我们将这一失败归因于以文本为中心的通信中固有的信息瓶颈，其中将感知和思维轨迹转换为离散自然语言不可避免地导致语义损失。为此，我们提出了\textbf{L}$\mathbf{^{2}}$\textbf{-VMAS}，一种新颖的模型无关框架，通过双潜在记忆实现智能体间协作。此外，我们解耦了感知与思考，同时动态合成双潜在记忆。另外，我们引入了熵驱动的主动触发，用高效的按需内存访问取代被动信息传输。在骨干网络、规模和多智能体结构上的大量实验表明，我们的方法有效打破了“扩展墙”，具有卓越的可扩展性，平均准确率提高2.7-5.4%，同时令牌使用量减少21.3-44.8%。

英文摘要

While Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration, empirical evidence reveals a counter-intuitive "scaling wall": increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text-centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose \textbf{L}$\mathbf{^{2}}$\textbf{-VMAS}, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories. Furthermore, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy-driven proactive triggering that replaces passive information transmission with efficient, on-demand memory access. Extensive experiments among backbones, sizes, and multi-agent structures demonstrate that our method effectively breaks the "scaling wall" with superb scalability, improving average accuracy by 2.7-5.4% while reducing token usage by 21.3-44.8%.

URL PDF HTML ☆

赞 0 踩 0

2602.01740 2026-06-08 cs.AI cs.CV cs.LG 版本更新

MACD: Model-Aware Contrastive Decoding via Counterfactual Data

MACD：基于反事实数据的模型感知对比解码

Qixin Xiao, Kun Zhou

发表机构 * University of Michigan, Ann Arbor, MI, USA（密歇根大学，安娜堡分校）； University of California San Diego, La Jolla, CA, USA（加州大学圣地亚哥分校）

AI总结提出MACD方法，利用视频语言模型自身反馈识别导致幻觉的目标区域，生成目标级反事实输入，结合对比解码减少幻觉，提升多模型在复杂场景下的准确性。

详情

AI中文摘要

视频语言模型（Video-LLMs）容易产生幻觉，当视觉证据薄弱、模糊或存在偏差时，会生成看似合理但无根据的内容。现有方法如对比解码（CD）依赖随机扰动构建对比数据以缓解幻觉，但往往未能针对驱动幻觉的视觉线索或模型弱点。我们提出基于模型感知反事实数据的对比解码（MACD），这是一种结合模型引导的反事实构建与对比解码的推理策略。MACD利用Video-LLM自身的反馈来识别最可能导致幻觉的目标区域，生成有针对性的目标级反事实输入，而非任意的帧或时间修改。这些反事实输入被整合到CD中，以在解码过程中强制进行基于证据的令牌选择。在EventHallusion、MVBench、Perception-test和Video-MME上的实验表明，MACD在包括Qwen和InternVL在内的多种Video-LLM上持续减少幻觉，同时保持或提高任务准确性，在涉及小目标、遮挡目标或共现目标的场景中尤其表现出显著优势。

英文摘要

Video language models (Video-LLMs) are prone to hallucinations, generating plausible but ungrounded content when visual evidence is weak, ambiguous, or biased. Existing methods, such as contrastive decoding (CD), rely on random perturbations to construct contrastive data for hallucination mitigation, but often fail to target the visual cues that drive hallucination or align with model weaknesses. We propose Model-Aware Counterfactual Data based Contrastive Decoding (MACD), an inference strategy that combines model-guided counterfactual construction with contrastive decoding. MACD uses the Video-LLM's own feedback to identify object regions most responsible for hallucination, generating targeted object-level counterfactual inputs rather than arbitrary frame or temporal modifications. These counterfactual inputs are integrated into CD to enforce evidence-grounded token selection during decoding. Experiments on EventHallusion, MVBench, Perception-test, and Video-MME show that MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs, including Qwen and InternVL, with especially strong gains in scenarios involving small, occluded, or co-occurring objects.

URL PDF HTML ☆

赞 0 踩 0

2602.03160 2026-06-08 cs.AI cs.CL 版本更新

VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models

VALUEFLOW：迈向大语言模型中多元化和可引导的基于价值的对齐

Woojin Kim, Sieun Hyeon, Jusang Oh, Jaeyoung Do

发表机构 * Department of Electrical and Computer Engineering, Seoul National University（首尔国立大学电气与计算机工程系）； Interdisciplinary Program in Artificial Intelligence, Seoul National University（首尔国立大学人工智能交叉学科项目）

AI总结提出VALUEFLOW框架，通过分层价值嵌入、强度标注数据库和锚定评估器，实现大语言模型在价值强度上的可控对齐，解决现有方法在提取、评估和引导方面的不足。

Comments Accepted in ICML 2026 (Oral). Code available at https://github.com/AIDASLab/VALUEFLOW

详情

AI中文摘要

将大语言模型（LLMs）与人类价值的多元光谱对齐仍然是一个核心挑战：基于偏好的方法通常无法捕捉更深层次的动机原则。基于价值的方法提供了更原则性的路径，但仍存在三个差距：提取常常忽略层次结构，评估检测存在但未校准强度，并且LLMs在受控强度下的可引导性仍未得到充分理解。为解决这些限制，我们引入了VALUEFLOW，这是第一个统一框架，涵盖提取、评估和引导，并具有校准的强度控制。该框架整合了三个组件：(i) HIVES，一个层次化价值嵌入空间，捕捉理论和跨理论的价值结构；(ii) 价值强度数据库（VIDB），一个大规模资源，包含基于排序聚合得出的强度估计的价值标注文本；(iii) 一个基于锚点的评估器，通过将模型输出与VIDB面板进行排序，产生一致的强度分数。使用VALUEFLOW，我们在十个模型和四个价值理论上进行了全面的大规模研究，识别了可引导性的不对称性和多价值控制的组合规律。本文建立了一个可扩展的基础设施，用于评估和控制价值强度，推进了LLMs的多元化对齐。

英文摘要

Aligning Large Language Models (LLMs) with the diverse spectrum of human values remains a central challenge: preference-based methods often fail to capture deeper motivational principles. Value-based approaches offer a more principled path, yet three gaps persist: extraction often ignores hierarchical structure, evaluation detects presence but not calibrated intensity, and the steerability of LLMs at controlled intensities remains insufficiently understood. To address these limitations, we introduce VALUEFLOW, the first unified framework that spans extraction, evaluation, and steering with calibrated intensity control. The framework integrates three components: (i) HIVES, a hierarchical value embedding space that captures intra- and cross-theory value structure; (ii) the Value Intensity DataBase (VIDB), a large-scale resource of value-labeled texts with intensity estimates derived from ranking-based aggregation; and (iii) an anchor-based evaluator that produces consistent intensity scores for model outputs by ranking them against VIDB panels. Using VALUEFLOW, we conduct a comprehensive large-scale study across ten models and four value theories, identifying asymmetries in steerability and composition laws for multi-value control. This paper establishes a scalable infrastructure for evaluating and controlling value intensity, advancing pluralistic alignment of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2603.11333 2026-06-08 cs.AI 版本更新

LLM-Augmented Digital Twin for Policy Evaluation in Short-Video Platforms

LLM增强的数字孪生用于短视频平台策略评估

Haoting Zhang, Yunduan Lin, Jinghai He, Denglin Jiang, Zuo-Jun Shen, Zeyu Zheng

发表机构 * University of California, Berkeley（加州大学伯克利分校）； The Chinese University of Hong Kong（香港中文大学）； New York University（纽约大学）； The University of Hong Kong（香港大学）

AI总结提出一种LLM增强的四模块数字孪生架构（用户、内容、交互、平台），通过事件驱动执行层和可插拔策略组件，支持在闭环动态下对平台策略（含AI策略）进行可复现的仿真评估。

详情

AI中文摘要

短视频平台是闭环、人在回路中的生态系统，其中平台策略、创作者激励和用户行为共同演化。这种反馈结构使得在生产环境中进行反事实策略评估变得困难，尤其是对于长期和分布性结果。随着平台部署改变内容进入系统方式、代理适应方式以及平台运行方式的AI工具，这一挑战被放大。我们提出了一种大语言模型（LLM）增强的数字孪生用于短视频平台，具有模块化的四孪生架构（用户、内容、交互、平台）和一个支持可复现实验的事件驱动执行层。平台策略作为平台孪生中的可插拔组件实现，LLM作为可选的、模式约束的决策服务（例如，角色生成、内容字幕、活动规划、趋势预测）集成，并通过统一优化器路由。这种设计使得可扩展的仿真成为可能，在保留闭环动态的同时允许选择性采用LLM，从而能够在现实反馈和约束下研究平台策略，包括AI增强策略。

英文摘要

Short-video platforms are closed-loop, human-in-the-loop ecosystems where platform policy, creator incentives, and user behavior co-evolve. This feedback structure makes counterfactual policy evaluation difficult in production, especially for long-horizon and distributional outcomes. The challenge is amplified as platforms deploy AI tools that change what content enters the system, how agents adapt, and how the platform operates. We propose a large language model (LLM)-augmented digital twin for short-video platforms, with a modular four-twin architecture (User, Content, Interaction, Platform) and an event-driven execution layer that supports reproducible experimentation. Platform policies are implemented as pluggable components within the Platform Twin, and LLMs are integrated as optional, schema-constrained decision services (e.g., persona generation, content captioning, campaign planning, trend prediction) that are routed through a unified optimizer. This design enables scalable simulations that preserve closed-loop dynamics while allowing selective LLM adoption, enabling the study of platform policies, including AI-enabled policies, under realistic feedback and constraints.

URL PDF HTML ☆

赞 0 踩 0

2603.19100 2026-06-08 cs.AI 版本更新

LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling

LuMamba: 用于电极拓扑不变且高效的EEG建模的潜在统一Mamba

Danaé Broustail, Anna Tegon, Thorir Mar Ingolfsson, Yawei Li, Luca Benini

发表机构 * ETH Zürich, Institute of Information Systems（苏黎世联邦理工学院信息系统研究所）

AI总结提出LuMamba框架，结合拓扑不变编码和线性复杂度状态空间模型，通过LUNA的查询交叉注意力机制统一通道，FEMBA的双向Mamba块高效时序建模，在5个下游任务上以4.6M参数达到SOTA性能。

Comments EUSIPCO 2026, 5 pages, 2 figures, 4 tables

详情

AI中文摘要

脑电图（EEG）能够在临床和神经技术应用中无创监测大脑活动，但由于电极拓扑结构不同以及Transformer架构的二次序列复杂度，构建EEG基础模型仍然具有挑战性。作为联合解决方案，我们提出了LuMamba（潜在统一Mamba），一个自监督框架，结合了拓扑不变编码和线性复杂度状态空间建模，使用LUNA的学习查询交叉注意力机制进行通道统一，以及FEMBA的双向Mamba块进行高效时序建模。在此架构内，我们首次系统研究了用于生物信号学习的潜在-欧几里得联合嵌入预测架构（LeJEPA）。在来自TUEG语料库的超过21,000小时未标记EEG上预训练后，LuMamba在五个下游任务上进行了评估，涵盖异常检测、伪影识别和精神状态分类，电极配置从16到26通道不等。在预训练目标中，仅掩码重建产生结构化但泛化性较差的表示，而仅LeJEPA产生扩散嵌入；结合两个目标实现了最稳健的性能。仅用4.6M参数，LuMamba在TUAB上达到80.99%的平衡准确率，在阿尔茨海默病检测上达到SOTA性能（0.97 AUPR），同时在等效序列长度下所需FLOPS比SOTA模型少377倍，并在达到典型GPU内存限制前可扩展到12倍更长的序列。代码可在https://github.com/pulp-bio/biofoundation获取。

英文摘要

Electroencephalography (EEG) enables non-invasive monitoring of brain activity across clinical and neurotechnology applications, yet building foundation models for EEG remains challenging due to differing electrode topologies and computational scalability, as Transformer architectures incur quadratic sequence complexity. As a joint solution, we propose LuMamba (Latent Unified Mamba), a self-supervised framework combining topology-invariant encodings with linear-complexity state-space modeling, using LUNA's learned-query cross-attention mechanism for channel unification, and FEMBA's bidirectional Mamba blocks for efficient temporal modeling. Within this architecture, we provide the first systematic investigation of the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA) for biosignal learning. Pre-trained on over 21,000 hours of unlabeled EEG from the TUEG corpus, LuMamba is evaluated on five downstream tasks spanning abnormality detection, artifact recognition, and mental condition classification across electrode configurations ranging from 16 to 26 channels. In the pre-training objective, masked reconstruction alone yields structured but less generalizable representations, while LeJEPA alone produces diffuse embeddings; combining both objectives achieves the most robust performance. With only 4.6M parameters, LuMamba attains 80.99% balanced accuracy on TUAB and achieves state-of-art performance on Alzheimer's detection (0.97 AUPR), while requiring 377x fewer FLOPS than state-of-art models at equivalent sequence lengths and scaling to 12x longer sequences before reaching typical GPU memory limits. Code is available at https://github.com/pulp-bio/biofoundation.

URL PDF HTML ☆

赞 0 踩 0

2603.19146 2026-06-08 cs.AI cs.LG 版本更新

D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding

D5P4：用于并行离散扩散解码中多样性的分区行列式点过程

Jonathan Lys, Vincent Gripon, Axel Marmoret, Lukas Mauch, Fabien Cardinaux, Ghouthi Boukli Hacene, Bastien Pasdeloup

发表机构 * IMT Atlantique, Lab-STICC, UMR CNRS 6285（IMT阿提拉克大学，Lab-STICC，UMR CNRS 6285）； Sony Europe Ltd. Stuttgart Technology Center, EUREC（索尼欧洲有限公司，斯图加特技术中心，EUREC）

AI总结提出D5P4波束解码方法，利用分区行列式点过程在离散扩散模型中选择中间序列，平衡质量与多样性，无需外部验证器。

详情

AI中文摘要

离散扩散模型是自回归方法在文本生成中的有前途的替代方案，但其解码方法仍研究不足。标准的自回归搜索过程（如波束搜索）不直接适用于迭代去噪，其中假设是完整的中间序列而非从左到右的前缀。此外，现有的扩散解码过程对保留假设的多样性和覆盖范围的控制有限。在这项工作中，我们引入了D5P4，一种针对离散扩散模型定制的波束式解码方法，它将中间波束选择视为分区行列式点过程下的MAP推理。这产生了一个模型内部的批次目标，无需外部验证器即可平衡质量和多样性。在开放域生成、问答和数学推理上的实验表明，D5P4提高了多样性和pass@$k$覆盖率，同时匹配或超越了基线质量和保真度。

英文摘要

Discrete diffusion models are promising alternatives to autoregressive approaches for text generation, yet their decoding methods remain under-studied. Standard autoregressive search procedures, such as beam search, do not directly apply to iterative denoising, where hypotheses are complete intermediate sequences rather than left-to-right prefixes. Furthermore, existing diffusion decoding procedures only provide limited control over the diversity and coverage of retained hypotheses. In this work, we introduce D5P4, a beam-style decoding method tailored to discrete diffusion models, which casts intermediate beam selection as MAP inference under a partitioned Determinantal Point Process. This yields a model-internal batch objective that balances quality and diversity without external verifiers. Experiments on open-ended generation, question answering, and mathematical reasoning show that D5P4 improves diversity and pass@$k$ coverage while matching or surpassing baseline quality and fidelity

URL PDF HTML ☆

赞 0 踩 0

2603.24481 2026-06-08 cs.AI cs.CL cs.LG 版本更新

Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

基于一致性验证的多智能体推理改进医学多项选择题问答中的不确定性校准

John Ray B. Martinez

发表机构 * Department of Data Science and Analytics（数据科学与分析系）

AI总结提出多智能体框架，结合领域专家智能体与两阶段验证及S分数加权融合，在医学MCQA中显著降低校准误差并提升判别能力。

Comments 20 pages, 6 figures. Preprint under review

详情

AI中文摘要

校准不良的置信度分数是AI在临床环境中部署的实际障碍。总是过度自信的模型无法为延迟决策提供有用信号。我们提出了一个多智能体框架，结合领域特定专家智能体与两阶段验证（Wu等人，2024）和S分数加权融合，以改进医学多项选择题问答中的校准和判别能力。四个专家智能体（呼吸科、心脏病科、神经科、胃肠科）使用Qwen2.5-7B-Instruct生成独立诊断。每个诊断经历两阶段自我验证过程，测量内部一致性并产生专家置信度分数（S分数）。S分数驱动加权融合策略，选择最终答案并校准报告的置信度。我们在MedQA-USMLE和MedMCQA的高分歧子集（100和250个问题）上进行评估。所有结果均针对此过滤后的设置。在MedQA-250上，完整系统实现了ECE=0.091（比单专家基线降低74.4%）和AUROC=0.630（+0.056），准确率为59.2%。在所有四种设置中，校准增益保持在49-74%。消融分析表明，两阶段验证驱动ECE降低，而多智能体推理驱动AUROC提升，表明一致性检查和集成聚合解决了LLM不确定性的不同失败模式。由此产生的置信度信号是否足以在实践中支持临床延迟决策，仍是未来研究的方向。

英文摘要

Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification (Wu et al., 2024) and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis undergoes a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate on high-disagreement subsets of MedQA-USMLE and MedMCQA (100 and 250 questions). All results are specific to this filtered regime. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Calibration gains of 49-74% hold across all four settings. Ablation analysis reveals that Two-Phase Verification drives ECE reduction while multi-agent reasoning drives AUROC improvement, suggesting that consistency checking and ensemble aggregation address different failure modes of LLM uncertainty. Whether the resulting confidence signal is sufficient to support clinical deferral decisions in practice remains a direction for future investigation.

URL PDF HTML ☆

赞 0 踩 0

2603.24963 2026-06-08 cs.AI cs.LG 版本更新

Design Once, Deploy at Scale: Template-Driven ML Development for Large Model Ecosystems

一次设计，大规模部署：面向大型模型生态的模板驱动ML开发

Jiang Liu, John Martabano Landy, Yao Xuan, Swamy Muddu, Nhat Le, Munaf Sahaf, Luc Kien Hang, Rupinder Khandpour, Kevin De Angeli, Chang Yang, Shouyuan Chen, Shiblee Sadik, Anirudh Agrawal, Djordje Gligorijevic, Jingzheng Qin, Peggy Yao, Alireza Vahdatpour

发表机构 * Meta AI

AI总结针对大型模型生态中ML开发效率低的问题，提出标准化模型模板（SMT）框架，将技术传播复杂度从O(n·2^k)降至O(n+k)，在Meta广告排名系统中实现交叉熵提升0.63%、迭代时间减少92%、技术-模型对采用吞吐量提升6.3倍。

详情

AI中文摘要

现代计算广告平台通常依赖推荐系统来预测用户响应，如点击率、转化率和其他优化事件。为了支持多样化的产品表面和广告主目标，这些平台经常维护一个广泛的机器学习（ML）模型生态系统。然而，在这种规模下运营带来了显著的发展和效率挑战。需要大量的工程努力来定期刷新ML模型并传播新技术，这导致在生态系统中部署ML创新时出现长延迟。我们提出了一项大规模实证研究，比较了标准化模型构建方法与推荐系统中独立每模型优化之间的模型性能、效率和ML技术传播。为了促进这种标准化，我们提出了标准模型模板（SMT）——一个生成适应不同数据分布和优化事件的高性能模型的框架。通过利用标准化、可组合的ML模型组件，SMT将技术传播复杂度从O(n·2^k)降低到O(n+k)，其中n是模型数量，k是技术数量。在Meta的生产广告排名生态系统中，对四个全球开发周期内的广泛模型套件进行评估，我们的结果表明：（1）在中等服务容量下，交叉熵平均提高0.63%；（2）每模型迭代工程时间减少92%；（3）技术-模型对采用吞吐量增加6.3倍。这些发现挑战了多样化优化目标本质上需要多样化ML模型设计的传统观点。

英文摘要

Modern computational advertising platforms typically rely on recommendation systems to predict user responses, such as click-through rates, conversion rates, and other optimization events. To support a wide variety of product surfaces and advertiser goals, these platforms frequently maintain an extensive ecosystem of machine learning (ML) models. However, operating at this scale creates significant development and efficiency challenges. Substantial engineering effort is required to regularly refresh ML models and propagate new techniques, which results in long latencies when deploying ML innovations across the ecosystem. We present a large-scale empirical study comparing model performance, efficiency, and ML technique propagation between a standardized model-building approach and independent per-model optimization in recommendation systems. To facilitate this standardization, we propose the Standard Model Template (SMT) -- a framework that generates high-performance models adaptable to diverse data distributions and optimization events. By utilizing standardized, composable ML model components, SMT reduces technique propagation complexity from $O(n \cdot 2^k)$ to $O(n + k)$ where $n$ is the number of models and $k$ the number of techniques. Evaluating an extensive suite of models over four global development cycles within Meta's production ads ranking ecosystem, our results demonstrate: (1) a 0.63% average improvement in cross-entropy at neutral serving capacity, (2) a 92% reduction in per-model iteration engineering time, and (3) a $6.3\times$ increase in technique-model pair adoption throughput. These findings challenge the conventional wisdom that diverse optimization goals inherently require diversified ML model design.

URL PDF HTML ☆

赞 0 踩 0

2604.02029 2026-06-08 cs.AI 版本更新

Insights Generator: LLM代理的系统级语料库追踪诊断

Akshay Manglik, Apaar Shanker, Kaustubh Deshpande, Jason Qin, Yash Maurya, Veronica Chatrath, Vijay S. Kalmath, Levi Lentz, Yuan Xue

发表机构 * Scale AI, Inc.

AI总结本文提出Insights Generator，一种多智能体系统，通过在语料库中提出和测试假设来生成基于证据的洞察报告，从而系统性地诊断LLM代理的行为模式。

详情

AI中文摘要

诊断LLM代理的故障仍然主要依赖人工。从业者检查少量执行轨迹子集，形成临时假设并迭代。此过程错过了仅在轨迹群体中显现的模式，并且无法扩展到生产语料库，其中单个轨迹跨度上万词。我们正式化了语料库级轨迹诊断问题。给定一个执行轨迹语料库，目标是生成具有证据支持的自然语言洞察，以描述轨迹群体中的系统性行为模式，每个模式都与支持证据相关联。我们提出了Insights Generator (IG)，一种多智能体系统，通过在轨迹语料库中提出和测试假设来回答诊断问题，从而生成证据支持的洞察报告。我们从定性和客观维度评估了IG，涵盖基于评分标准的报告评估和通过实施IG洞察获得的下游性能改进。使用IG报告的人类专家将支架性能提高了30.4个点百分比，而利用IG衍生洞察的编码代理显示出一致且稳定的提升。在多个基准测试中，IG的scout-investigator架构生成的发现与竞争方法在检测覆盖方面相当，同时领域专家对IG报告的深度和证据质量评价优异。

英文摘要

Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens. We formalize the problem of corpus-level trace diagnostics. Given a corpus of execution traces, the goal is to produce grounded natural-language insights that characterize systematic behavioral patterns across trace groups, each linked to supporting evidence. We present the Insights Generator (IG), a multi-agent system that answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence-backed insights report. We evaluate IG across qualitative and objective dimensions, spanning rubric-based report assessment and downstream performance improvements achieved by implementing IG insights. Human experts using IG reports improve scaffold performance by 30.4pp over the unmodified baseline scaffold, and coding agents leveraging IG-derived insights show consistent and stable gains. Across benchmarks, IG's scout-investigator architecture produces findings comparable in detection coverage to competing approaches, while domain experts rated IG reports as leading depth and evidence quality.

URL PDF HTML ☆

赞 0 踩 0

2605.21706 2026-06-08 cs.AI 版本更新

Latent-space Attacks for Refusal Evasion in Language Models

潜在空间攻击用于语言模型的拒绝规避

Giorgio Piras, Raffaele Mura, Fabio Brau, Maura Pintor, Luca Oneto, Fabio Roli, Battista Biggio

发表机构 * University of Cagliari（卡利亚里大学）； University of Genova（热那亚大学）

AI总结本文研究了如何通过潜在空间攻击来规避语言模型的拒绝行为，提出了一种受控的潜在空间攻击方法，以提高攻击成功率，优于现有方法。

详情

AI中文摘要

安全对齐的语言模型被训练以拒绝有害请求，但通过引导其内部表示可以抑制拒绝行为。现有方法通过消融拒绝方向来消除模型残差流中的拒绝行为。尽管这些方法在经验上取得了成功，但它们缺乏对所诱导的潜在空间转换的系统性解释以及为何会抑制拒绝。在本文中，我们将拒绝抑制视为对训练以区分拒绝和回答提示的线性探测的潜在空间规避攻击。在此观点下，先前工作的差异均值方向自然定义了这样的探测器，其消融正好是对其决策边界上的投影，即最小置信度规避攻击。这种视角不仅解释了先前工作的经验成功，也承认了一个关键限制：规避在决策边界停止，这促使需要将表示进一步推入合规区域，即模型回答的区域。我们通过提出受控的潜在空间规避攻击方法来利用这一点，该方法通过优化置信度将表示投影到边界之外。我们在15个指令微调、多模态和推理模型上实现了最先进的攻击成功率，优于现有拒绝消融基线和专门的 jailbreak 攻击。

英文摘要

Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model's residual stream. Despite their empirical success, these methods lack a principled account of the latent-space transformation they induce and why it suppresses refusal. In this work, we recast refusal suppression as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. Under this view, prior work's difference-in-means direction naturally defines such a probe, and its ablation is exactly a projection onto its decision boundary, i.e., a minimum-confidence evasion attack. This perspective not only explains the empirical success of prior work but also admits a key limitation: evasion stops at the decision boundary, motivating the need to push representations further into the compliant region, i.e., where the model answers. We leverage this by proposing a Controlled Latent-space Evasion attack that projects representations past the boundary with an optimized confidence. We achieve state-of-the-art attack success rate across 15 instruction-tuned, multimodal, and reasoning models, outperforming existing refusal-ablation baselines and specialized jailbreak attacks.

URL PDF HTML ☆

赞 0 踩 0

2605.28742 2026-06-08 cs.AI 版本更新

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

CORE: 对比反思实现推理能力的快速提升

Linas Nasvytis, Simon Jerome Han, Ben Prystawski, Satchel Grant, Noah D. Goodman, Judith E. Fan

发表机构 * Stanford University（斯坦福大学）

AI总结提出对比反思（CORE）非参数学习算法，通过对比成功与失败的推理轨迹生成自然语言洞察，在少量样本和 rollout 下实现比参数方法（GRPO）和非参数方法（GEPA、情景RAG、MemRL）更快的推理性能提升。

详情

AI中文摘要

语言模型可以利用可验证奖励在多种推理任务上提升性能。然而，无论是参数方法（如RLVR）还是非参数方法（如提示优化），通常都需要数百个训练样本和数千次模型 rollout，这在最佳情况下成本高昂，最坏情况下则难以处理。为解决这一挑战，我们引入了对比反思（CORE），一种非参数学习算法，通过比较过去的推理轨迹来生成洞察：即捕捉成功与不成功问题尝试之间差异的推理策略和约束的简短自然语言描述。在四个推理任务上，我们证明CORE比参数方法（GRPO）和非参数方法（GEPA、情景RAG和MemRL）实现更快的改进，同时使用更少的rollout。在固定rollout预算下，使用少至五个训练样本，我们进一步展示CORE也实现了与各基线相当或更大的性能提升。最后，我们强调CORE在上下文效率上也显著优于非参数基线，需要更少的提示词，同时将学到的知识存储为紧凑、可解释的自然语言洞察。因此，我们的结果表明，将成功与不成功推理轨迹之间的对比提炼为抽象且有用的洞察，比权重更新、提示优化或直接重用存储的推理轨迹，为模型自我改进提供了一条更高效且可解释的途径。

英文摘要

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, CORE achieves the strongest performance in most task-data regimes. Finally, we highlight how CORE is substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.

URL PDF HTML ☆

赞 0 踩 0

2606.00357 2026-06-08 cs.AI 版本更新

面向基于预训练模型的持续学习中的高效且精确的遗忘服务

Yajiang Huang, Jianheng Tang, Kejia Fan, Huiping Zhuang, Anfeng Liu, Tian Wang, Yunhuai Liu, Mianxiong Dong, Houbing Herbert Song

发表机构 * Department of Information Systems, University of Maryland, Baltimore County (UMBC)（马里兰大学巴尔的摩分校信息系统系）

AI总结针对持续学习中顺序遗忘请求的挑战，提出基于解析方法的持续遗忘（ACU），通过最小二乘递归推导闭式解，实现高效精确的遗忘，保护历史数据隐私。

详情

AI中文摘要

在持续学习（CL）中，使用预训练模型（PTM）作为特征提取器已成为一种流行做法。结合解析分类器，基于PTM的方法在CL中实现了最先进的性能，追求非遗忘目标。同时，在大多数服务构建范式（例如移动群智感知（MCS））中，主动遗忘在CL阶段获得的特定知识也至关重要，其中移动边缘节点不断收集传感数据，不仅需要非遗忘适应，还需要特定知识遗忘以保护隐私。因此，当遗忘请求在CL中顺序出现时，产生了一个独特的问题，称为持续遗忘（CU）。然而，现有的遗忘方法专注于单次联合遗忘，在应用于CU时显得非常不足，包括（1）违反CL中的历史数据隐私，以及（2）容易被对抗性频繁请求淹没或降级。为了应对CU的挑战，我们提出了一种无梯度方法，称为解析持续遗忘（ACU），用于在基于PTM的CL中实现高效且精确的遗忘，同时保护历史数据隐私。针对每个遗忘请求，我们的ACU通过最小二乘法以可解释的方式递归推导解析（即闭式）解。通过精心设计，我们的ACU兼容样本级和类别级遗忘请求。理论和实验评估验证了我们的ACU在遗忘有效性、模型保真度和系统效率方面的优越性。

英文摘要

In Continual Learning (CL), using a Pre-Trained Model (PTM) as the feature extractor has become a popular practice. Accompanied by analytic classifiers, the PTM-based methods have achieved state-of-the-art performance in CL, in pursuit of the non-forgetting goal. Meanwhile, actively forgetting specific knowledge acquired during the CL phase is also essential in most service construction paradigms, for example, Mobile Crowd Sensing (MCS), where mobile edge nodes continuously collect sensory data and demand not only non-forgetting adaptation but also specific knowledge forgetting for privacy preservation. Thus, a unique problem, called Continual Unlearning (CU), arises when the forgetting requests show sequentially in CL. However, existing unlearning methods focus on single-shot joint forgetting and prove highly inadequate when applied to CU, including (1) violating the historical data privacy in CL and (2) vulnerably being overwhelmed or degraded with adversarially frequent requests. To handle the challenges of CU, we propose a gradient-free approach, called Analytic Continual Unlearning (ACU), for efficient and exact forgetting with historical data privacy preservation in PTM-based CL. In response to each unlearning request, our ACU recursively derives the analytical (i.e., closed-form) solutions via least squares in an interpretable manner. By meticulous design, our ACU is compatible with both sample-level and class-level unlearning requests. The theoretical and experimental evaluations validate our ACU's superiority in unlearning effectiveness, model fidelity, and system efficiency.

URL PDF HTML ☆

赞 0 踩 0

2505.23437 2026-06-08 cs.LG cs.AI cs.IR 版本更新

Bounded-Abstention Pairwise Learning to Rank

有界弃权成对学习排序

Antonio Ferrara, Andrea Pugnana, Francesco Bonchi, Salvatore Ruggieri

发表机构 * Intesa Sanpaolo AI Research（Intesa Sanpaolo AI研究中心）； University of Trento（特伦托大学）； University of Pisa（比萨大学）

AI总结提出一种基于条件风险阈值的成对排序弃权方法，理论刻画最优策略，设计模型无关的插件算法，实验验证有效性。

Comments KDD 2026

详情

AI中文摘要

排序系统影响健康、教育和就业等高风险领域的决策，可能产生重大经济和社会影响，因此集成安全机制至关重要。弃权是一种安全机制，允许算法决策系统将不确定或低置信度的决策推迟给人类专家。虽然弃权主要在分类任务中研究，但其在其他机器学习范式中的应用尚不充分。本文提出一种用于成对学习排序任务的弃权新方法。该方法基于对排序器条件风险设置阈值：当估计风险超过预定义阈值时，系统弃权不做决策。我们的贡献有三方面：最优弃权策略的理论刻画、一个模型无关的插件式算法用于构建弃权排序模型，以及在多个数据集上的全面实证评估，证明了我们方法的有效性。

英文摘要

Ranking systems influence decision-making in high-stakes domains like health, education, and employment, where they can have substantial economic and social impacts. This makes the integration of safety mechanisms essential. One such mechanism is abstention, which enables algorithmic decision-making systems to defer uncertain or low-confidence decisions to human experts. While abstention has been predominantly explored in the context of classification tasks, its application to other machine learning paradigms remains underexplored. In this paper, we introduce a novel method for abstention in pairwise learning-to-rank tasks. Our approach is based on thresholding the ranker's conditional risk: the system abstains from making a decision when the estimated risk exceeds a predefined threshold. Our contributions are threefold: a theoretical characterization of the optimal abstention strategy, a model-agnostic, plug-in algorithm for constructing abstaining ranking models, and a comprehensive empirical evaluation across multiple datasets, demonstrating the effectiveness of our approach.

URL PDF HTML ☆

赞 0 踩 0

2506.01850 2026-06-08 cs.CV cs.AI cs.LG cs.MM 版本更新

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

MoDA: 面向指令型多模态大语言模型的细粒度视觉定位的调制适配器

Wayner Barrios, Andrés Villa, Juan León Alcázar, SouYoung Jin, Bernard Ghanem

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出MoDA调制适配器，通过指令引导的通道级乘法调制增强细粒度视觉定位，在12个基准上对三种MLLM架构取得一致提升，计算开销极小。

Comments Accepted at ICML 2026. Code is available at https://github.com/waybarrios/MoDA

详情

AI中文摘要

多模态大语言模型（MLLMs）通过将预训练的视觉编码器与大语言模型（LLMs）集成，在指令跟随任务中取得了显著成功。然而，现有方法由于视觉补丁表示中的语义纠缠，常常难以实现细粒度的视觉定位，其中单个补丁混合了多个不同的视觉元素，使得模型难以聚焦于指令相关的细节。为了应对这一挑战，我们提出了MoDA（调制适配器），一种轻量级模块，通过指令引导的通道级调制增强视觉定位。与Q-Former等执行加性特征选择的令牌级方法不同，MoDA通过对已对齐特征进行乘法调制在通道级操作，从而实现对每个指令相关嵌入维度的细粒度控制。遵循标准的LLaVA训练协议，MoDA在语言指令与预对齐的视觉特征之间应用交叉注意力，生成动态调制掩码，无需架构修改或额外监督。我们在涵盖视觉问答、视觉中心推理和幻觉检测的12个基准上评估了MoDA，包括最近的2024年基准（MMVP、CV-Bench、MMStar、RealWorldQA），并在三种不同的MLLM架构上进行了测试：LLaVA-1.5、LLaVA-MoRE（2025）和Qwen3-VL（2025）。MoDA在所有三个系列中均取得了一致的提升，在LLaVA-1.5系列的MMVP上提升了+12.0个百分点，在LLaVA-MoRE系列的ScienceQA上提升了+4.8个百分点，在Qwen3-VL上ScienceQA提升了+4.9、RealWorldQA提升了+4.1、GQA提升了+3.8，证实了这些增益在CLIP编码器之外具有泛化性，且计算开销极小（<1% FLOPs）。代码可在https://github.com/waybarrios/MoDA获取。

英文摘要

Multimodal Large Language Models (MLLMs) have achieved remarkable success in instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle with fine-grained visual grounding due to semantic entanglement in visual patch representations, where individual patches blend multiple distinct visual elements, making it difficult for models to focus on instruction-relevant details. To address this challenge, we propose MoDA (Modulation Adapter), a lightweight module that enhances visual grounding through instruction-guided channel-wise modulation. Unlike token-level methods such as Q-Former that perform additive feature selection, MoDA operates at the channel level through multiplicative modulation on already-aligned features, enabling fine-grained control over which embedding dimensions are relevant for each instruction. Following the standard LLaVA training protocol, MoDA applies cross-attention between language instructions and pre-aligned visual features, generating dynamic modulation masks without architectural modifications or additional supervision. We evaluate MoDA across 12 benchmarks spanning visual question answering, vision-centric reasoning, and hallucination detection, including recent 2024 benchmarks (MMVP, CV-Bench, MMStar, RealWorldQA), on three distinct MLLM architectures: LLaVA-1.5, LLaVA-MoRE (2025), and Qwen3-VL (2025). MoDA delivers consistent gains across all three families, with +12.0 points on MMVP for the LLaVA-1.5 family and +4.8 points on ScienceQA for the LLaVA-MoRE family, and +4.9 ScienceQA, +4.1 RealWorldQA, and +3.8 GQA on Qwen3-VL, confirming that the gains generalize beyond CLIP-based encoders with minimal overhead (<1% FLOPs). Code is available at https://github.com/waybarrios/MoDA.

URL PDF HTML ☆

赞 0 踩 0

2506.11066 2026-06-08 cs.SE cs.AI 版本更新

CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval

CoQuIR：面向代码质量感知信息检索的综合基准

Jiahui Geng, Fengyu Cai, Shaobo Cui, Qing Li, Liangwei Chen, Chenyang Lyu, Haonan Li, Derui Zhu, Walter Pretschner, Heinz Koeppl, Fakhri Karray

发表机构 * Linköping University（林波伊大学）； MBZUAI（麦肯锡人工智能研究院）； TU Darmstadt（德累斯顿技术大学）； Shanghai Jiao Tong University（上海交通大学）； EPFL（苏黎世联邦理工学院）； University of Groningen（Groningen大学）； Google Tokyo（东京Google）； Alibaba Group（阿里巴巴集团）； TU Munich（慕尼黑技术大学）

AI总结提出首个大规模多语言代码质量感知检索基准CoQuIR，涵盖正确性、效率、安全性和可维护性四维度，通过细粒度标注和两个质量中心指标评估23个模型，发现顶尖模型常无法区分有缺陷代码，并探索了训练方法以提升质量感知能力。

详情

AI中文摘要

代码检索在现代软件开发中至关重要，因为它能促进代码复用并加速调试。然而，当前的基准主要强调功能相关性，而忽视了软件质量的关键维度。受此差距启发，我们引入了CoQuIR，这是首个大规模、多语言的基准，专门设计用于评估跨四个关键维度（正确性、效率、安全性和可维护性）的质量感知代码检索。CoQuIR为11种编程语言的42,725个查询和134,907个代码片段提供了细粒度的质量注释，并附带两个以质量为中心的评估指标：成对偏好准确率和基于边界的排名分数。利用CoQuIR，我们对23个检索模型（涵盖开源和专有系统）进行了基准测试，发现即使是最先进的模型也常常无法区分有缺陷或不安全的代码与其更健壮的对应代码。此外，我们初步研究了明确鼓励检索器识别代码质量的训练方法。使用合成数据集，我们展示了在各种模型上质量感知指标的显著改进，而不牺牲语义相关性。下游代码生成实验进一步验证了我们方法的有效性。总体而言，我们的工作强调了将质量信号整合到代码检索系统中的重要性，为更可信和更健壮的软件开发工具奠定了基础。

英文摘要

Code retrieval is essential in modern software development, as it boosts code reuse and accelerates debugging. However, current benchmarks primarily emphasize functional relevance while neglecting critical dimensions of software quality. Motivated by this gap, we introduce CoQuIR, the first large-scale, multilingual benchmark specifically designed to evaluate quality-aware code retrieval across four key dimensions: correctness, efficiency, security, and maintainability. CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets in 11 programming languages, and is accompanied by two quality-centric evaluation metrics: Pairwise Preference Accuracy and Margin-based Ranking Score. Using CoQuIR, we benchmark 23 retrieval models, covering both open-source and proprietary systems, and find that even top-performing models frequently fail to distinguish buggy or insecure code from their more robust counterparts. Furthermore, we conduct preliminary investigations into training methods that explicitly encourage retrievers to recognize code quality. Using synthetic datasets, we demonstrate promising improvements in quality-aware metrics across various models, without sacrificing semantic relevance. Downstream code generation experiments further validate the effectiveness of our approach. Overall, our work highlights the importance of integrating quality signals into code retrieval systems, laying the groundwork for more trustworthy and robust software development tools.

URL PDF HTML ☆

赞 0 踩 0

2507.01548 2026-06-08 cs.HC cs.AI cs.CL 版本更新

Telling stories, making Hanzi: AI-assisted co-creation with elderly migrants in urban China

讲述故事，创造汉字：人工智能辅助中国城市老年移民的协同创作

Yunfei Chen, Wen Zhan, Peiyue Lin, Ziqun Hua, Ying Hu

发表机构 * School of Design, Hunan University（湖南大学设计学院）； Royal College of Art（皇家艺术学院）； University of the Arts London, Central Saint Martins（伦敦艺术大学，中央圣马丁学院）

AI总结通过协同创作工作坊，结合口述故事、AI辅助和手工制作，让老年移民创造新汉字以记录被忽视的生活故事，揭示参与者的异质性和适应能力，并展示AI作为降低表达门槛的创意启动器。

详情

DOI: 10.21606/drs.2026.963

AI中文摘要

本文探讨了中国城市老年移民如何记录日常语言和设计常忽略的故事。我们与10位老年人开展了两次协同创作工作坊。活动结合了口述故事、主持人中介的AI辅助和手工制作。大型语言模型通过主持人提出候选字形。参与者创作了新的汉字来承载他们的故事。生成的字符作为记忆锚点，用于后续的分享和复述。我们的解释性分析揭示了参与者之间的异质性和适应能力。参与者将AI视为降低表达和创作门槛的创意启动器，尤其对数字素养较低者。这项工作挑战了关于老年人的同质化假设以及统一能力和需求的预设。我们贡献了一个将AI定位为后台促进者的工作坊框架，并提供了在包容性城市系统中将老年移民视为社区记忆和情境文化知识来源的见解。

英文摘要

This paper explores how older migrants in urban China can record stories that everyday language and design often miss. We ran two co-creation workshops with 10 elders. Activities combined oral storytelling, facilitator-mediated AI assistance, and hand-making. Large language models proposed candidate glyphs through a facilitator. Participants crafted new Hanzi to hold their stories. The resulting characters served as memory anchors for later sharing and retelling. Our interpretive analysis shows heterogeneity and adaptive capacity among participants. Participants experienced AI as a creative initiator that lowered barriers to expression and making, especially for those with lower digital literacy. The work challenges homogenizing assumptions about older adults and the presumption of uniform capacities and needs. We contribute a workshop framework that positions AI as a backstage facilitator. We also offer insights on engaging older migrants as sources of community memory and situated cultural knowledge within inclusive urban systems.

URL PDF HTML ☆

赞 0 踩 0

2508.17693 2026-06-08 cs.DB cs.AI cs.CL 版本更新

Database Normalization via Dual-LLM Self-Refinement

通过双LLM自精炼的数据库规范化

Eunjae Jo, Nakyung Lee, Gyuyeong Kim

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出Miffie框架，利用双模型自精炼架构和大语言模型实现数据库自动规范化，无需人工干预且保持高准确率。

Comments 7 pages

2508.17821 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Limitations of Normalization in Attention Mechanism

注意力机制中归一化的局限性

Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State

发表机构 * University of Luxembourg（卢森堡大学）； London Institute for Mathematical Sciences（伦敦数学科学研究所）

AI总结本文通过理论框架和GPT-2实验，揭示softmax归一化导致注意力随选择token数增加而趋于均匀，并分析低温度下梯度敏感性带来的训练挑战。

详情

AI中文摘要

本文研究了注意力机制中归一化的局限性。我们首先建立了一个理论框架，用于识别模型的选择能力以及token选择中涉及的几何分离。我们的分析包括在softmax缩放下token向量距离和分离准则的显式界限。通过使用预训练的GPT-2模型进行实验，我们实证验证了理论结果，并分析了注意力机制的关键行为。值得注意的是，我们证明随着所选token数量的增加，模型区分信息性token的能力下降，通常趋向于均匀选择模式。我们还表明，softmax归一化下的梯度敏感性在训练过程中带来了挑战，尤其是在低温度设置下。这些发现推进了当前对基于softmax的注意力机制的理解，并激发了在未来注意力架构中需要更稳健的归一化和选择策略的需求。

英文摘要

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

URL PDF HTML ☆

赞 0 踩 0

2509.04991 2026-06-08 physics.ao-ph cs.AI cs.LG 版本更新

A Mechanism-Coupled Split Window Network for Medium- to High-Resolution Land Surface Temperature Retrieval

一种面向中高分辨率地表温度反演的机制耦合分裂窗网络

Tian Xie, Menghui Jiang, Chao Zeng, Huifang Li, Guanhao Zhang, Chan Li, Huanfeng Shen

发表机构 * School of Resource and Environmental Sciences, Wuhan University（武汉大学资源与环境科学学院）； Key Laboratory of Geographic Information System of Ministry of Education, Wuhan（教育部地理信息系统重点实验室）； Key Laboratory of Digital Cartography and Land Information Application of the Ministry of Natural Resources, Wuhan（自然资源部数字测图与土地信息应用重点实验室）

AI总结提出并行分量解耦神经网络（PCD-Net），将分裂窗反演重构为物理分量系数的动态学习问题，通过分量级解耦建模和残差分支，实现复杂大气和地表条件下的高精度、鲁棒且全局可泛化的地表温度反演。

详情

AI中文摘要

地表温度（LST）是陆-气相互作用、地表能量收支和气候过程中的基本物理变量。从中高分辨率热红外（TIR）观测中获取的LST能有效揭示不同景观单元间的热环境差异。然而，在复杂大气条件和多样土地覆盖类型下，实现准确、鲁棒且全局可泛化的LST反演仍具挑战。传统分裂窗（SW）算法严重依赖经验参数化，其固定系数无法适应高温地表和高大气水汽含量等复杂场景。同时，传统数据驱动模型因缺乏显式物理结构约束，对分布外（OOD）样本的泛化能力有限。为解决这些问题，本研究提出并行分量解耦神经网络（PCD-Net）框架，将SW反演重构为物理分量系数的动态学习问题。以SW方程作为物理主干，该框架构建并行子网络，自适应学习对应常数项、一阶和二阶亮度温度差项的动态系数；同时引入残差分支，补充由地表发射率和大气水汽联合效应引起的非线性耦合校正。通过这种分量级解耦建模，PCD-Net显式刻画了地表发射率、大气水汽含量与不同SW物理分量之间的动态响应关系。

英文摘要

Land surface temperature (LST) is a fundamental physical variable in land-atmosphere interactions, surface energy budgets, and climate processes. LST derived from medium- to high-resolution thermal infrared (TIR) observations effectively reveals thermal environmental disparities across distinct landscape units. However, achieving accurate, robust, and globally generalizable LST retrieval remains challenging under complex atmospheric conditions and diverse land cover types. Traditional split window (SW) algorithms heavily rely on empirical parameterizations, whose fixed coefficients fail to adapt to complex scenarios such as high surface temperatures and high atmospheric water vapor content. Concurrently, conventional data-driven models exhibit limited generalizability to out-of-distribution (OOD) samples due to the absence of explicit physical structure constraints. To address these issues, this study proposes a Parallel Component Decoupled Neural Network (PCD-Net) framework, which reformulates SW retrieval as a dynamic learning problem of physical component coefficients. Using the SW equation as the physical backbone, the framework constructs parallel subnetworks to adaptively learn the dynamic coefficients corresponding to the constant, first-order, and second-order brightness temperature difference terms; meanwhile, a residual branch is incorporated to supplement the nonlinear coupling corrections induced by the joint effects of surface emissivity and atmospheric water vapor. Through this component-level decoupled modeling, PCD-Net explicitly characterizes the dynamic response relationships between land surface emissivity, atmospheric water vapor content, and different SW physical components.

URL PDF HTML ☆

赞 0 踩 0

2509.05316 2026-06-08 cs.LG cs.AI 版本更新

Standard vs. Modular Sampling: Best Practices for Reliable LLM Unlearning

标准采样与模块化采样：可靠的大语言模型遗忘的最佳实践

Praveen Bushipaka, Lucia Passaro, Tommaso Cucinotta

发表机构 * Scuola Superiore Sant’Anna（圣安纳高等学院）； University of Pisa（比萨大学）

AI总结针对大语言模型遗忘中采样策略的不足，提出模块化实体级遗忘（MELU）策略，通过多样化邻居集和模块化采样平衡遗忘效果与模型效用。

2509.24935 2026-06-08 cs.CV cs.AI cs.LG 版本更新

Scalable GANs with Transformers

可扩展的Transformer生成对抗网络

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

发表机构 * KAIST（韩国科学技术院）

AI总结本文通过紧凑变分自编码器潜在空间和纯Transformer架构，研究了生成对抗网络的可扩展性，并提出了轻量级中间监督和宽度自适应学习率调整来解决缩放时的失败模式，在ImageNet-256上以40个epoch达到2.96的FID。

Comments ICML 2026

详情

AI中文摘要

可扩展性推动了生成建模的最新进展，但其原理在对抗学习中仍未充分探索。我们通过两个在其他类型生成模型中被证明有效的设计选择来研究生成对抗网络（GAN）的可扩展性：在紧凑的变分自编码器潜在空间中训练，以及采用纯Transformer的生成器和判别器。在潜在空间中训练能够在保持感知保真度的同时实现高效计算，而这种效率与普通Transformer自然匹配，后者的性能随计算预算扩展。基于这些选择，我们分析了朴素缩放GAN时出现的失败模式。具体来说，我们发现了随着网络规模扩大，生成器早期层利用不足和优化不稳定的问题。因此，我们提供了简单且对缩放友好的解决方案，如轻量级中间监督和宽度自适应学习率调整。我们的实验表明，GAT——一种纯Transformer的潜在空间GAN——能够在从S到XL的广泛容量范围内可靠地训练。此外，GAT-XL/2在ImageNet-256上仅用40个epoch（比强基线少6倍）就达到了最先进的单步类条件生成性能（FID为2.96）。项目页面：https://hse1032.github.io/GAT。

英文摘要

Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.18) on ImageNet-256 in just 60 epochs, 4x fewer epochs than strong baselines. Project page: https://hse1032.github.io/GAT.

URL PDF HTML ☆

赞 0 踩 0

2510.03381 2026-06-08 cs.LG cs.AI 版本更新

Proxy Reconstruction Pre-training for Ramp Flow Prediction at Highway Interchanges

高速公路立交匝道流量预测的代理重建预训练

Yongchao Li, Jun Chen, Zhuoxuan Li, Chao Gao, Yang Li, Chu Zhang, Changyin Dong

发表机构 * Southeast University（东南大学）； Institute of Telecommunications and Information Sciences, China（中国电信与信息科学研究院）

AI总结提出时空解耦自编码器（STDAE），通过跨模态重建预训练从主线数据恢复匝道流量，结合GWNet等模型提升预测精度，在真实数据集上超越13个基线。

Comments Accepted at Applied Soft Computing Journal

详情

DOI: 10.1016/j.asoc.2026.115462
Journal ref: Applied Soft Computing Journal 200 (2026) 115462

AI中文摘要

立交桥是高速公路间车辆转换的关键节点，但缺乏实时匝道检测器导致交通预测存在盲区。为解决这一问题，我们提出时空解耦自编码器（STDAE），一种利用跨模态重建预训练的两阶段框架。在第一阶段，STDAE从主线数据重建历史匝道流量，迫使模型捕捉内在的时空关系。其解耦架构通过并行的空间和时间自编码器高效提取异质特征。在预测阶段，学习到的表示与GWNet等模型集成以提高准确性。在三个真实立交数据集上的实验表明，STDAE-GWNET始终优于十三个最先进的基线，并达到与使用历史匝道数据的模型相当的性能。这证明了其在克服检测器稀缺方面的有效性及其在不同预测流程中的即插即用潜力。

英文摘要

Interchanges are crucial nodes for vehicle transfers between highways, yet the lack of real-time ramp detectors creates blind spots in traffic prediction. To address this, we propose a Spatio-Temporal Decoupled Autoencoder (STDAE), a two-stage framework that leverages cross-modal reconstruction pretraining. In the first stage, STDAE reconstructs historical ramp flows from mainline data, forcing the model to capture intrinsic spatio-temporal relations. Its decoupled architecture with parallel spatial and temporal autoencoders efficiently extracts heterogeneous features. In the prediction stage, the learned representations are integrated with models such as GWNet to enhance accuracy. Experiments on three real-world interchange datasets show that STDAE-GWNET consistently outperforms thirteen state-of-the-art baselines and achieves performance comparable to models using historical ramp data. This demonstrates its effectiveness in overcoming detector scarcity and its plug-and-play potential for diverse forecasting pipelines.

URL PDF HTML ☆

赞 0 踩 0

2510.07315 2026-06-08 cs.CL cs.AI cs.LG cs.SE 版本更新

SWE-IF: Aligning Code Evaluation with Human Preference

SWE-IF: 使代码评估与人类偏好对齐

Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun

发表机构 * Google DeepMind（谷歌深Mind）

AI总结提出SWE-IF基准，通过可验证指令分类法VeriCode评估代码指令遵循能力，发现指令遵循是区分LLM代码质量的关键，与功能正确性结合更能匹配人类偏好。

Comments ICML 2026

详情

AI中文摘要

大型语言模型（LLM）推动了vibe coding，用户通过自然语言交互利用LLM生成并迭代优化代码，直到通过其vibe检查。Vibe检查反映了人类偏好，超越了功能性：解决方案应感觉正确、阅读清晰、保留意图并保持正确。然而，当前的代码评估仍局限于pass@k，仅捕获功能正确性，忽略了用户常规应用的非功能性指令。在本文中，我们假设指令遵循是vibe检查中除功能正确性之外缺失的部分。为了用量化信号衡量模型的代码指令遵循能力，我们提出了VeriCode，一个包含30条可验证代码指令及其确定性验证器的分类法。我们使用该分类法增强现有评估套件，得到SWE-IF，一个评估指令遵循和功能正确性的测试平台。评估31个LLM，我们发现即使最强的模型也难以遵守多条指令，并表现出功能回归。最重要的是，功能正确性和指令遵循的复合得分与人类偏好相关性最强，其中指令遵循成为LLM之间的主要区分因素。我们的代码、数据和分类法可在https://github.com/maszhongming/SWE-IF获取。

英文摘要

Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check besides functional correctness. To quantify models' code instruction-following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in SWE-IF, a testbed to assess both instruction following and functional correctness. Evaluating 31 LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit functional regression. Most importantly, a composite score of functional correctness and instruction following correlates best with human preference, with instruction following emerging as the primary differentiator among LLMs. Our code, data, and taxonomy are available at https://github.com/maszhongming/SWE-IF.

URL PDF HTML ☆

赞 0 踩 0

2510.09041 2026-06-08 cs.LG cs.AI 版本更新

Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach

自动驾驶鲁棒控制：一种智能一般和约束对抗强化学习方法

Junchao Fan, Qi Wei, Ruichen Zhang, Yang Lu, Jianhua Wang, Xiaolin Chang, Bo Ai

发表机构 * Beijing Key Laboratory of Security and Privacy in Intelligent Transportation（北京智能交通安全与隐私重点实验室）； Beijing Jiaotong University（北京交通大学）； College of Computing and Data Science（计算与数据科学学院）； Nanyang Technological University（南洋理工大学）； School of Computer Science and Technology（计算机科学与技术学院）； Taiyuan University of Technology（太原科技大学）； School of Electronics and Information Engineering（电子与信息工程学院）

AI总结针对深度强化学习在自动驾驶中易受对抗攻击的问题，提出智能一般和约束对抗强化学习（IGCARL），通过战略性目标对手和鲁棒驾驶代理的交互训练，在约束优化下提升策略稳定性，实验表明成功率比现有方法提高至少27.9%。

详情

AI中文摘要

深度强化学习（DRL）在开发自动驾驶策略方面取得了显著成功。然而，其对对抗攻击的脆弱性仍然是实际部署的关键障碍。尽管现有的鲁棒方法已取得一定成功，但它们仍面临三个关键问题：（i）这些方法针对短视的对抗攻击进行训练，限制了它们应对更具战略性威胁的能力；（ii）它们难以引发真正安全关键的事件（例如碰撞），反而常常导致轻微后果；（iii）由于缺乏鲁棒约束，这些方法在训练过程中可能导致学习不稳定和策略漂移。为了解决这些问题，我们提出了智能一般和约束对抗强化学习（IGCARL），一种新颖的鲁棒自动驾驶方法，包括一个战略性目标对手和一个鲁棒驾驶代理。战略性目标对手被设计为利用DRL的时间决策能力来执行策略协调的多步攻击。此外，它通过采用一般和目标明确地专注于引发安全关键事件。鲁棒驾驶代理通过与对手交互学习，以发展出对抗攻击的鲁棒自动驾驶策略。为了确保对抗环境中的稳定学习并减轻攻击引起的策略漂移，代理在约束公式下进行优化。大量实验表明，IGCARL相比现有最先进方法将成功率提高了至少27.9%，展示了对抗攻击的卓越鲁棒性，并增强了基于DRL的自动驾驶的安全性和可靠性。

英文摘要

Deep reinforcement learning (DRL) has demonstrated remarkable success in developing autonomous driving policies. However, its vulnerability to adversarial attacks remains a critical barrier to real-world deployment. Although existing robust methods have achieved success, they still suffer from three key issues: (i) these methods are trained against myopic adversarial attacks, limiting their abilities to respond to more strategic threats, (ii) they have trouble causing truly safety-critical events (e.g., collisions), but instead often result in minor consequences, and (iii) these methods can introduce learning instability and policy drift during training due to the lack of robust constraints. To address these issues, we propose Intelligent General-sum Constrained Adversarial Reinforcement Learning (IGCARL), a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent. The strategic targeted adversary is designed to leverage the temporal decision-making capabilities of DRL to execute strategically coordinated multi-step attacks. In addition, it explicitly focuses on inducing safety-critical events by adopting a general-sum objective. The robust driving agent learns by interacting with the adversary to develop a robust autonomous driving policy against adversarial attacks. To ensure stable learning in adversarial environments and to mitigate policy drift caused by attacks, the agent is optimized under a constrained formulation. Extensive experiments show that IGCARL improves the success rate by at least 27.9% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks and enhancing the safety and reliability of DRL-based autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2510.11014 2026-06-08 cs.RO cs.AI cs.CV 版本更新

MatterDoor: Sampling Zero-shot Spatio-semantic Priors using Generative Models

MatterDoor: 使用生成模型采样零样本空间语义先验

Subhransu S. Bhattacharjee, Hao Lu, Dylan Campbell, Rahul Shome

发表机构 * School of Computing, Australian National University（澳大利亚国立大学计算机学院）

AI总结针对机器人通过门缝观察时场景结构缺失的问题，提出MatterDoor方法，利用预训练生成模型（VLM引导外推、单目深度估计、语义分割）采样隐藏房间的语义3D点云先验，在Matterport3D基准上验证了零样本空间语义先验的有效性。

Comments Under Review

详情

AI中文摘要

自主机器人通常只能通过门缝部分观察房间，墙壁和场景结构隐藏了安全导航和目标导向行动所需的几何和任务相关语义。我们询问现成的预训练生成视觉模型能否将这些缺失结构作为零样本离线先验用于机器人推理。此类先验应支持对未观察结构的空间语义查询，估计隐藏区域中的目标物体似然以及这些区域被占用的概率。给定一个以自我为中心的RGB观测和目标查询，我们的流程使用VLM引导的外推、单目深度估计和语义分割来采样隐藏房间的语义标注3D点云假设。我们引入了MatterDoor，一个源自Matterport3D的门遮挡室内场景基准，并使用生成指标和模拟Stretch机器人目标到达任务评估所得先验。我们的结果表明，无需针对特定问题微调即可推导出对规划有用的空间语义先验。

英文摘要

Autonomous robots often view rooms only partially, through a doorway, where the walls and scene structure hide the geometry and task-relevant semantics needed for safe navigation and goal-directed action. We ask whether off-the-shelf pretrained generative vision models can derive this missing structure as zero-shot offline priors for robot reasoning. Such priors should support spatio-semantic queries over unobserved structure, estimating the target object likelihood in hidden regions and the probability that those regions are occupied. Given an egocentric RGB observation and target query, our pipeline uses VLM-guided outpainting, monocular depth estimation, and semantic segmentation to sample semantically labeled 3D point cloud hypotheses of the hidden room. We introduce MatterDoor, a Matterport3D-derived benchmark of doorway-occluded indoor scenes, and evaluate the resulting priors with generative metrics and simulated Stretch robot object-reaching tasks. Our results suggest that useful spatio-semantic priors for planning can be derived without problem-specific fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2510.17004 2026-06-08 cs.MA cs.AI 版本更新

ReclAIm: A Multi-Agent Framework for Monitoring and Correcting Performance Decline in Medical Imaging AI

ReclAIm：用于监测和纠正医学影像AI性能下降的多智能体框架

Eleftherios Tzanis, Michail E. Klontzas

发表机构 * Artificial Intelligence and Translational Imaging (ATI) Lab, Department of Radiology, School of Medicine, University of Crete（人工智能与转化成像实验室，放射科，医学院，希腊克里特大学）； Computational Biomedicine Laboratory, Institute of Computer Science Foundation for Research and Technology Hellas (ICS - FORTH), Heraklion, Crete, Greece（计算生物医学实验室，希腊基础研究与技术院计算机科学研究所（ICS - FORTH），克里特，希腊）； Division of Radiology, Department of Clinical Science, Intervention and Technology (CLINTEC), Karolinska Institute, Huddinge, Sweden（放射科，临床科学、干预与技术部（CLINTEC），卡罗林斯卡研究所，瑞典Huddinge）

AI总结提出基于大语言模型的多智能体框架ReclAIm，通过自然语言交互自动监测医学图像分类模型性能下降并触发微调，采用数据增强、类别不平衡处理和参数锚定正则化策略，在多个数据集上验证了有效性。

Comments Published in Radiology: Artificial Intelligence (https://doi.org/10.1148/ryai.250923)

详情

DOI: 10.1148/ryai.250923

AI中文摘要

目的：开发并评估一个用于自动监测、检测和纠正医学图像分类模型性能下降的多智能体框架（ReclAIm）。材料与方法：ReclAIm是一个基于大语言模型的多智能体系统，通过自然语言交互运行。一个主智能体协调三个任务特定智能体，执行性能评估并在检测到显著性能下降时触发微调。微调流程包含数据增强、类别不平衡处理以及参数锚定正则化策略以限制灾难性遗忘。该系统使用多个影像数据集进行基准测试，包括脑部MRI、胸部CT和胸部X光片，按模型开发、推理（性能监测）和微调子集划分（60%:20%:20%）。结果：ReclAIm成功协调了所有数据集的训练、评估和性能监测。在18个模型中的8个中检测到测试数据与推理数据之间的性能差异，触发了微调流程以减少性能差距。在性能下降高达40.6%的情况下（心脏肥大数据集，InceptionV3），微调将性能指标恢复至基线值的2%以内。结论：ReclAIm为医学图像分类模型的自动监测和定向微调提供了一个原型框架，其自然语言接口旨在支持研究及潜在临床应用的可及性。

英文摘要

Purpose: To develop and evaluate a multi-agent framework (ReclAIm) for automated monitoring, detection, and correction of performance decline in medical image classification models. Materials and Methods: ReclAIm is a large language model-based multi-agent system that operates through natural language interaction. A master agent coordinating three task-specific agents performed performance evaluation and triggered fine-tuning when substantial performance declines were detected. The fine-tuning workflow incorporated data augmentation, class imbalance handling, and a parameter-anchoring regularization strategy to limit catastrophic forgetting. The system was benchmarked using multiple imaging datasets, including brain MRI, chest CT, and chest radiography, partitioned into model development, inference (performance monitoring), and fine-tuning subsets (60%:20%:20%). Results: ReclAIm successfully orchestrated training, evaluation, and performance monitoring across all datasets. Performance discrepancies between test and inference data were detected in 8 of 18 models, prompting fine-tuning workflows that reduced performance gaps. In cases with declines of up to 40.6% (cardiomegaly dataset, InceptionV3), fine-tuning restored performance metrics to within 2% of baseline values. Conclusion: ReclAIm provides a prototype framework for automated monitoring and targeted fine-tuning of medical image classification models, with a natural language interface designed to support accessibility in research and potential clinical applications.

URL PDF HTML ☆

赞 0 踩 0

2510.24561 2026-06-08 cs.LG cs.AI 版本更新

LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

LoRA-DA：基于渐近分析的低秩自适应数据感知初始化

Qingyue Zhang, Chang Chu, Tianren Peng, Qi Li, Xiangyang Luo, Zhihao Jiang, Shao-Lun Huang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LoRA-DA，通过渐近分析优化低秩自适应初始化，结合Fisher梯度与Fisher信息最小化参数偏差，提升微调准确率与收敛稳定性。

Comments Published at ICML 2026

详情

AI中文摘要

LoRA已成为广泛采用的PEFT方法，其初始化方法日益受到关注。然而，现有方法存在显著局限性：许多方法未纳入目标域数据，而基于梯度的方法仅通过依赖一步梯度分解在浅层利用数据。本文建立了数据感知LoRA初始化的理论框架。从最小化微调模型与目标模型之间参数偏差的期望出发，我们推导出一个包含两项的优化问题：偏差项，与微调模型和目标模型之间的参数距离相关，并使用Fisher梯度公式近似以保持各向异性；方差项，通过Fisher信息考虑采样随机性引入的不确定性。求解该问题得到LoRA的最优初始化策略，基于此我们开发了高效算法LoRA-DA。跨多个基准的实验结果表明，LoRA-DA在最终准确率上持续优于现有初始化方法。附加研究显示其收敛更快、更稳定，跨秩鲁棒性强，且初始化开销小。源代码见https://github.com/zqy0126/LoRA-DA。

英文摘要

LoRA has become a widely adopted method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition. In this paper, we establish a theoretical framework for data-aware LoRA initialization. Starting from minimizing the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher-gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. Solving this problem yields an optimal initialization strategy for LoRA, based on which we develop an efficient algorithm, LoRA-DA. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code is available at https://github.com/zqy0126/LoRA-DA.

URL PDF HTML ☆

赞 0 踩 0

2510.26714 2026-06-08 cs.LG cs.AI 版本更新

On the importance of multiple training seeds for evaluating machine unlearning

关于多个训练种子在评估机器遗忘中的重要性

Jamie Lanyon, Axel Finke, Petros Andreou, Georgina Cosma

发表机构 * Department of Computer Science（计算机科学系）； School of Mathematics（数学学院）； School of Science（科学学院）； Statistics and Physics（统计学与物理学）； Loughborough University（洛桑大学）； Newcastle University（新castle大学）

AI总结本文指出评估机器遗忘算法时仅使用单个训练种子可能导致结果不具代表性，并通过图像分类、联邦学习排序和大语言模型实验验证了问题普遍性，最后给出选择训练和遗忘种子数量的指导。

Comments mini paper, 5 figures

详情

AI中文摘要

机器遗忘旨在从训练好的模型中移除某些数据点的影响，而无需昂贵的重新训练。大多数实用的遗忘算法只是近似，其性能只能通过经验评估。常见做法是从同一个训练好的模型（即仅使用单个训练种子）开始，多次独立运行遗忘算法（即使用多个遗忘种子）。在图像分类实验中，这种做法可能给出不具代表性的结果，因为遗忘性能可能对训练种子的选择敏感。这对于确定性遗忘方法尤其相关，这些方法从同一个训练好的模型开始时总是产生相同的结果。在联邦学习排序和大语言模型上的进一步实验证实，这个问题不仅限于图像分类。我们还解释了为什么增加遗忘种子的数量通常无法弥补多个训练种子的缺失。最后，我们给出了如何选择训练和遗忘种子数量的指导。

英文摘要

Machine unlearning aims to remove the influence of certain data points from a trained model without costly retraining. Most practical unlearning algorithms are only approximate and their performance can only be assessed empirically. Common practice is to run unlearning algorithms multiple times independently (i.e., using multiple unlearning seeds) starting from the same trained model (i.e., using only a single training seed ). In image-classification experiments, this practice can give non-representative results as unlearning performance can be sensitive to the choice of training seed. This is particularly relevant for deterministic unlearning methods which always produce the same result when started from the same trained model. Further experiments on federated learning-to-rank, and large language models confirm that this issue extends beyond image classification. We also explain why increasing the number of unlearning seeds cannot generally compensate for the lack of multiple training seeds. Finally, we give guidance on how to select the number of training and unlearning seeds.

URL PDF HTML ☆

赞 0 踩 0

2511.02399 2026-06-08 cs.SE cs.AI 版本更新

Towards Iterative End-to-End Software Development: A Feature-Driven Multi-Agent Framework

迈向迭代式端到端软件开发：一种特征驱动的多智能体框架

Junwei Liu, Chen Xu, Chong Wang, Tong Bai, Weitong Chen, Kaseng Wong, Yiling Lou, Xin Peng

发表机构 * Fudan University（复旦大学）； Nanyang Technological University（南洋理工大学）

AI总结提出EvoDev框架，通过特征分解、依赖建模和上下文传播，实现迭代式端到端软件开发，在Android任务上比Claude Code提升57.3%。

Comments Accepted by ISSTA 2026

详情

AI中文摘要

近年来，大语言模型智能体的进展为从自然语言需求自动化端到端软件开发带来了希望。然而，现有方法大多采用线性的瀑布式流程，这过度简化了真实世界开发的迭代性质，并且难以应对复杂、大规模的项目。为解决这些限制，我们提出了EvoDev，一种受特征驱动开发启发的迭代式软件开发框架。EvoDev将用户需求分解为一组用户价值特征，并构建特征图，这是一个有向无环图，显式建模特征之间的依赖关系。特征图中的每个特征节点维护多层上下文，包括业务逻辑、软件设计和代码实现，这些上下文沿着依赖关系传播，为后续开发迭代提供上下文。我们在具有挑战性的Android开发任务上评估了EvoDev，结果表明它比最佳基线Claude Code高出57.3%，同时在不同基础LLM上将单智能体性能提升了16.0%-58.5%，突出了特征分解、依赖建模、上下文传播和面向工作流的智能体设计对端到端软件开发的重要性。此外，我们的工作总结了设计迭代式、LLM驱动的开发框架的实用见解，并为未来训练基础LLM以更好地支持迭代式软件开发提供了参考。

英文摘要

Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipelines, which oversimplify the iterative nature of real-world development and struggle with complex, large-scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature-driven development. EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each feature node in the feature map maintains multi-layer contexts, including business logic, software design, and code implementation, which are propagated along dependencies to provide context for subsequent development iterations. We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best-performing baseline, Claude Code, by 57.3%, while improving single-agent performance by 16.0%-58.5% across different base LLMs, highlighting the importance of feature decomposition, dependency modeling, context propagation, and workflow-aware agent design for end-to-end software development. Moreover, our work summarizes practical insights for designing iterative, LLM-driven development frameworks and informs future training of base LLMs to better support iterative software development.

URL PDF HTML ☆

赞 0 踩 0

2512.04123 2026-06-08 cs.CY cs.AI cs.LG cs.SE 版本更新

Measuring Agents in Production

生产环境中的智能体测量

Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Koushik Sen, Dawn Song, Joseph E. Gonzalez, Ion Stoica, Matei Zaharia, Marquita Ellis

发表机构 * University of California at Berkeley（加州大学伯克利分校）； IBM Research（IBM研究院）； University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Stanford University（斯坦福大学）

AI总结通过对86个已部署系统的调查和20个案例研究，发现生产环境中的LLM智能体主要采用简单可控的方法，可靠性是首要挑战，并依赖系统级设计和人工评估。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026) as Oral Presentation

详情

AI中文摘要

基于LLM的智能体已经在许多行业的生产环境中运行，但我们缺乏对哪些技术方法能使部署成功的理解。我们首次系统性地研究了生产环境中的智能体测量（MAP），使用了来自智能体开发者的一手数据。我们通过深度访谈进行了20个案例研究，并调查了来自26个领域的86个已部署系统的从业者。我们调查了组织为何构建智能体、如何构建它们、如何评估它们以及它们面临的主要开发挑战。我们的研究发现，生产环境中的智能体是使用简单、可控的方法构建的：68%的智能体在人类干预前最多执行10步，70%依赖对现成模型进行提示而非权重调整，74%主要依赖人工评估。可靠性（随时间保持一致的正确行为）仍然是首要开发挑战，从业者目前通过系统级设计来解决。MAP记录了生产智能体的当前状态，为研究社区提供了部署现实和未充分探索的研究方向的可见性。

英文摘要

LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of Measuring Agents in Production, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 86 deployed systems practitioners across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and their top development challenges. Our study finds that production agents are built using simple, controllable approaches: 68% execute at most 10 steps before human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability (consistent correct behavior over time) remains the top development challenge, which practitioners currently address through systems-level design. MAP documents the current state of production agents, providing the research community with visibility into deployment realities and underexplored research avenues.

URL PDF HTML ☆

赞 0 踩 0

2512.12997 2026-06-08 cs.CV cs.AI cs.LG 版本更新

Calibrating Uncertainty for Zero-Shot Adversarial CLIP

校准零样本对抗性CLIP的不确定性

Wenjing Lu, Zerui Tao, Yuning Qiu, Dongping Zhang, Yang Yang, Qibin Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对CLIP在零样本分类中对抗攻击脆弱且不确定性校准差的问题，提出基于狄利克雷分布重参数化的对抗微调目标，统一对齐语义结构与置信度，提升校准性和鲁棒性。

Comments ICML 2026

详情

AI中文摘要

CLIP在零样本分类中表现强劲，但仍易受对抗攻击。先前的对抗微调工作主要匹配干净样本和对抗样本之间的预测logits，忽略了不确定性校准，可能损害零样本泛化能力。在可靠的不确定性估计中，一个常见期望是预测不确定性应随输入难度增加或偏离训练分布而上升。然而，在对抗环境中我们经常观察到相反的情况：扰动不仅降低准确性，还抑制不确定性，导致严重的校准错误和过度自信。这揭示了鲁棒性之外的关键可靠性差距。为弥合这一差距，我们提出了一种考虑准确性和不确定性的CLIP对抗微调目标。通过将CLIP输出重参数化为狄利克雷分布的浓度参数，我们提出了一种统一表示，捕获相对语义结构和置信度大小。这使得在扰动下实现整体分布对齐，超越单一logits锚定，恢复校准的不确定性。在多个零样本基准上的实验表明，我们的方法显著提高了不确定性校准，在保持干净准确性的同时实现了具有竞争力的对抗鲁棒性。

英文摘要

CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Prior adversarial fine-tuning work primarily matches predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and over-confidence. This reveals a critical reliability gap beyond robustness. To bridge this gap, we propose an adversarial fine-tuning objective for CLIP considering both accuracy and uncertainty. By reparameterizing CLIP outputs as the concentration parameters of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and confidence magnitude. This enables holistic distribution alignment under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments across multiple zero-shot benchmarks demonstrate that our method significantly improves uncertainty calibration and achieves competitive adversarial robustness while preserving clean accuracy.

URL PDF HTML ☆

赞 0 踩 0

2512.14391 2026-06-08 cs.LG cs.AI cs.CL 版本更新

RePo: Language Models with Context Re-Positioning

RePo：具有上下文重定位的语言模型

Huayang Li, Tianyu Zhao, Deng Cai, Richard Sproat

发表机构 * University of Maryland（马里兰大学）

AI总结提出RePo机制，通过可微分模块重新分配token位置以减轻注意力层负担，在噪声上下文、结构化数据和长上下文任务上持续提升性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

上下文学习是现代大型语言模型（LLM）的基础；然而，主流架构通过分配线性或常数的位置索引来施加刚性且固定的上下文结构。刚性的位置信息将组织输入结构的全部负担强加给注意力层，从而减少了可用于更关键信息的注意力量。为了解决这个问题，我们提出了RePo，一种通过上下文重定位来减轻注意力层负担的新机制。与传统方法不同，RePo利用可微分模块$f_ϕ$来分配捕获上下文依赖关系的token位置，而不是依赖预定义的顺序。通过在OLMo-2 1B和7B模型上持续预训练，我们证明RePo在涉及噪声上下文、结构化数据和更长上下文长度的任务上持续提升性能，同时在一般短上下文任务上保持有竞争力的性能。分析表明，RePo成功地将更多注意力分配给遥远但相关的信息，在密集且非线性的空间中分配位置，并捕获输入上下文的内在结构。我们的代码位于https://github.com/SakanaAI/repo。

英文摘要

In-context learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices. The rigid position information poses the full burden of organizing the input structure to attention layers, thus reducing the amount of attention that could be allocated for more critical information. To address this, we propose RePo, a novel mechanism that alleviates the burden for attention layers via context re-positioning. Unlike conventional approaches, RePo utilizes a differentiable module, $f_ϕ$, to assign token positions that capture contextual dependencies, rather than replying on pre-defined order. By continually pre-training on the OLMo-2 1B \& 7B models, we demonstrate that RePo consistently enhances performance on tasks involving noisy contexts, structured data, and longer context length, while maintaining competitive performance on general short-context tasks. Analysis reveals that RePo successfully allocates more attention mass to distant but relevant information, assigns positions in a dense and non-linear space, and captures the intrinsic structure of the input context. Our code is at https://github.com/SakanaAI/repo.

URL PDF HTML ☆

赞 0 踩 0

2512.23128 2026-06-08 cs.HC cs.AI cs.MA 版本更新

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

这是一个陷阱！面向网络代理的任务重定向说服基准

Karolina Korgul, Yushi Yang, Arkadiusz Drohomirecki, Piotr Błaszczyk, Will Howard, Lukas Aichberger, Chris Russell, Philip H. S. Torr, Adam Mahdi, Adel Bibi

发表机构 * University of Cambridge（剑桥大学）

AI总结提出TRAP基准，评估大型语言模型驱动的网络代理在动态网页中易受提示注入攻击的程度，发现平均25%的任务中代理被重定向，揭示了心理驱动的系统漏洞。

Comments ICML 2026

详情

AI中文摘要

由大型语言模型驱动的网络代理越来越多地用于电子邮件管理或专业网络等任务。然而，它们对动态网页内容的依赖使其容易受到提示注入攻击：隐藏在界面元素中的对抗性指令，说服代理偏离其原始任务。我们引入了任务重定向代理说服基准（TRAP），这是一个研究说服技术如何在现实任务中误导自主网络代理的基准。在六个前沿模型中，代理平均在25%的任务中容易受到提示注入（GPT-5为13%，DeepSeek-R1为43%），小的界面或上下文变化通常会使成功率翻倍，并揭示网络代理中系统的、由心理驱动的漏洞。我们还提供了一个模块化的社会工程注入框架，并在高保真网站克隆上进行受控实验，允许进一步扩展基准。

英文摘要

Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking. Their reliance on dynamic web content, however, makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task. We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), a benchmark for studying how persuasion techniques misguide autonomous web agents on realistic tasks. Across six frontier models, agents are susceptible to prompt injection in 25% of tasks on average (13% for GPT-5 to 43% for DeepSeek-R1), with small interface or contextual changes often doubling success rates and revealing systemic, psychologically driven vulnerabilities in web-based agents. We also provide a modular social-engineering injection framework with controlled experiments on high-fidelity website clones, allowing for further benchmark expansion.

URL PDF HTML ☆

赞 0 踩 0

2601.05751 2026-06-08 cs.CL cs.AI 版本更新

Analysing Differences in Persuasive Language in LLM-Generated Text: Uncovering Stereotypical Gender Patterns

分析LLM生成文本中说服性语言的差异：揭示刻板的性别模式

Amalie Brogaard Pauli, Maria Barrett, Max Müller-Eberstein, Isabelle Augenstein, Ira Assent

发表机构 * Department of Computer Science, Aarhus University（阿arhus大学计算机科学系）； AMD Silo AI ； University of Tokyo（东京大学）； IT University of Copenhagen（哥本哈根IT大学）； Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）

AI总结提出框架评估LLM生成说服性语言时受接收者性别、发送者意图和输出语言的影响，发现所有模型均存在显著的性别差异，反映性别刻板印象的语言倾向。

Comments Accepted at ACL Findings 2026

详情

AI中文摘要

大型语言模型（LLMs）越来越多地用于日常交流任务，包括起草旨在影响和说服的人际信息。先前研究表明，LLMs能够成功说服人类并放大说服性语言。因此，理解用户指令如何影响说服性语言的生成，以及生成的说服性语言是否因目标群体不同而有所差异至关重要。在这项工作中，我们提出了一个框架，用于评估说服性语言生成如何受接收者性别、发送者意图或输出语言的影响。我们使用成对提示指令评估了13个LLMs和16种语言。我们采用基于社会心理学和传播科学的LLM-as-judge设置，在19个说服性语言类别上评估模型响应。我们的结果揭示了所有模型生成的说服性语言中存在显著的性别差异。这些模式反映了与社会心理学和社会语言学中记录的性别刻板语言倾向一致的偏见。

英文摘要

Large language models (LLMs) are increasingly used for everyday communication tasks, including drafting interpersonal messages intended to influence and persuade. Prior work has shown that LLMs can successfully persuade humans and amplify persuasive language. It is therefore essential to understand how user instructions affect the generation of persuasive language, and to understand whether the generated persuasive language differs, for example, when targeting different groups. In this work, we propose a framework for evaluating how persuasive language generation is affected by recipient gender, sender intent, or output language. We evaluate 13 LLMs and 16 languages using pairwise prompt instructions. We evaluate model responses on 19 categories of persuasive language using an LLM-as-judge setup grounded in social psychology and communication science. Our results reveal significant gender differences in the persuasive language generated across all models. These patterns reflect biases consistent with gender-stereotypical linguistic tendencies documented in social psychology and sociolinguistics.

URL PDF HTML ☆

赞 0 踩 0

2601.13508 2026-06-08 cond-mat.mtrl-sci cs.AI 版本更新

Autonomous computational catalysis through an agentic research system

自主计算催化：通过智能体研究系统

Honghao Chen, Jiangjie Qiu, Yi Shen Tew, Xiaonan Wang

发表机构 * Beijing Key Laboratory of Artificial Intelligence for Advanced Chemical Engineering Materials, State Key Laboratory of Chemical Engineering and Low- Carbon Technology, Department of Chemical Engineering, Tsinghua University（北京先进化工材料人工智能重点实验室、化学工程与低碳技术国家重点实验室、清华大学化学工程系）

AI总结提出CatMaster催化原生智能体研究系统，将自然语言请求转化为计算研究，实现从建模到闭环催化剂设计的自主执行，在CO2-to-CO催化剂设计中识别出竞争性活性位点。

Comments 25 pages for main manuscript; SI not available here

详情

AI中文摘要

自主智能体正开始将科学研究从工具辅助的工作流程转变为自我维持的发现过程。计算催化提供了一个代表性的挑战，因为催化剂发现需要将高层次问题转化为协调的模型构建、原子模拟、机理分析和跨尺度的迭代设计。在这里，我们介绍了CatMaster，一个催化原生的智能体研究系统，它将计算催化重塑为一个低门槛的自主研究虚拟生态系统。CatMaster维护一个不断演进的研究状态，并通过在一个可扩展环境内的模型构建、计算、批判和催化剂设计决策中的自我反馈来扩展能力。在逐渐具有挑战性的任务中，CatMaster将自然语言请求转化为具体的计算研究，从基本的原子建模和标准计算到机理探索和闭环催化剂设计。它在代表性的计算催化场景中展示了稳健的执行能力，并在选定的MatBench任务中表现出接近领先的性能，其中声子场景展示了其建模自我进化能力。在独立的CO2-to-CO催化剂设计案例中，CatMaster使用迭代的自我批判和证据精炼来识别出具有竞争力的B-CoN4和NiN3B/N-NiN3B基序。这些结果建立了一个虚拟生态系统范式，其中AI智能体超越模拟执行，走向端到端的计算研究，为催化和材料科学中的自主发现提供了基础。

英文摘要

Autonomous agents are beginning to transform scientific research from tool-assisted workflows toward self-sustaining discovery processes. Computational catalysis provides a representative challenge, as catalyst discovery requires high-level questions to be translated into coordinated model construction, atomistic simulation, mechanistic analysis, and iterative design across multiple scales. Here we introduce CatMaster, a catalysis-native agentic research system that recasts computational catalysis as a low-barrier virtual ecosystem for autonomous research. CatMaster maintains an evolving research state and extends capabilities through self-feedback across model construction, calculation, critique and catalyst-design decisions within one extensible environment. Across progressively challenging tasks, CatMaster converts natural-language requests into concrete computational studies, from essential atomistic modelling and standard calculations to mechanism exploration and closed-loop catalyst design. It showed robust execution in representative computational-catalysis scenarios and near-leading performance across selected MatBench tasks, with phonons scenario demonstrating its modelling self-evolution capability. In the independent CO2-to-CO catalyst design case, CatMaster used iterative self-critique and evidence refinement to identify competitive B-CoN4 and NiN3B/N-NiN3B motifs. These results establish a virtual-ecosystem paradigm in which AI agents move beyond simulation execution toward end-to-end computational research, providing a foundation for autonomous discovery in catalysis and materials science.

URL PDF HTML ☆

赞 0 踩 0

2601.16622 2026-06-08 cs.LG cs.AI 版本更新

E2Former-V2: On-the-Fly Equivariant Attention with Linear Activation Memory

E2Former-V2：具有线性激活内存的即时等变注意力

Lin Huang, Chengxiang Huang, Ziang Wang, Yiyue Du, Chu Wang, Haocheng Lu, Yunyang Li, Xiaoli Liu, Arthur Jiang, Jia Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出E2Former-V2架构，通过等变轴对齐稀疏化（EAAS）和即时等变注意力机制，利用SO(3)到SO(2)基变换和自定义Triton内核，实现线性激活内存和20倍TFLOPS提升，在SPICE和OMol25数据集上加速推理并保持预测性能。

详情

AI中文摘要

等变图神经网络（EGNN）已成为建模3D原子系统的广泛使用的方法。然而，主流架构由于在每条边上显式构造几何特征或密集张量积而面临关键的可扩展性瓶颈。为克服这一问题，我们引入了**E2Former-V2**，一种将代数稀疏性与硬件感知执行相结合的可扩展架构。我们首先提出**等变轴对齐稀疏化（EAAS）**。EAAS基于Wigner-$6j$卷积，利用$\mathrm{SO}(3) \rightarrow \mathrm{SO}(2)$基变换，将计算昂贵的密集张量收缩转化为高效的稀疏奇偶重索引操作。基于这种表示，我们引入了**即时等变注意力**，一种通过自定义融合Triton内核实现的完全节点中心机制。通过消除物化的边张量并最大化SRAM利用率，我们的内核相比标准实现实现了**20倍的TFLOPS提升**。在SPICE和OMol25数据集上的大量实验表明，E2Former-V2在保持相当预测性能的同时显著加速推理。这项工作表明，大型等变Transformer可以使用广泛可用的GPU平台高效训练。代码可在https://github.com/IQuestLab/UBio-MolFM/tree/e2formerv2获取。

英文摘要

Equivariant Graph Neural Networks (EGNNs) have become a widely used approach for modeling 3D atomistic systems. However, mainstream architectures face critical scalability bottlenecks due to the explicit construction of geometric features or dense tensor products on \textit{every} edge. To overcome this, we introduce \textbf{E2Former-V2}, a scalable architecture that integrates algebraic sparsity with hardware-aware execution. We first propose \textbf{E}quivariant \textbf{A}xis-\textbf{A}ligned \textbf{S}parsification (EAAS). EAAS builds on Wigner-$6j$ convolution by exploiting an $\mathrm{SO}(3) \rightarrow \mathrm{SO}(2)$ change of basis to transform computationally expensive dense tensor contractions into efficient, sparse parity re-indexing operations. Building on this representation, we introduce \textbf{On-the-Fly Equivariant Attention}, a fully node-centric mechanism implemented via a custom fused Triton kernel. By eliminating materialized edge tensors and maximizing SRAM utilization, our kernel achieves a \textbf{20$\times$ improvement in TFLOPS} compared to standard implementations. Extensive experiments on the SPICE and OMol25 datasets demonstrate that E2Former-V2 maintains comparable predictive performance while notably accelerating inference. This work demonstrates that large equivariant transformers can be trained efficiently using widely accessible GPU platforms. The code is avalible at https://github.com/IQuestLab/UBio-MolFM/tree/e2formerv2.

URL PDF HTML ☆

赞 0 踩 0

2601.22574 2026-06-08 cs.CV cs.AI 版本更新

Enhancing Video Representations with Spatiotemporal-Semantic Residual to Mitigate Hallucinations in Video Large Multimodal Models

增强视频表示中的时空语义残差以缓解视频大型多模态模型中的幻觉

Yuansheng Gao, Jinman Zhao, Tong Zhang, Xingguo Xu, Wenbin Xing, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University（浙江大学）； University of Toronto（多伦多大学）； Dalian University of Technology（大连理工大学）； Sun Yat-sen University（中山大学）

AI总结提出ViSSRes方法，通过轻量级MLP网络学习视频表示的残差，从时空和语义一致性优化，在推理时仅需单次前向传播，有效降低幻觉率并提升视频理解性能。

Comments Preprint

详情

AI中文摘要

尽管视频大型多模态模型在视频理解方面取得了强劲性能，但它们仍然存在幻觉问题。现有的推理时干预方法通常在对比解码框架下修改视频，但其启发式设计带来的改进有限且增加了推理延迟。为了解决这些问题，我们提出了ViSSRes，一种通过轻量级MLP风格网络增强视频表示的推理时干预方法。具体来说，我们使用对比随机游走方法来表征视频表示的时空一致性，并引入条件互信息将视频表示与模型的语义理解关联起来。在保持模型主干冻结的情况下，ViSSRes学习视频表示的残差，并从时空和语义一致性角度优化它们。在推理时，ViSSRes仅需单次前向传播，且不会引入显著的额外推理成本。实验表明，ViSSRes在EventHallusion上将LLaVA-NeXT-Video的幻觉率降低了40.69%，并在CoT设置下将MMVU上的视频理解提升了18.36%，证明了其在缓解幻觉方面的有效性。

英文摘要

Although Video Large Multimodal Models have achieved strong performance in video understanding, they still suffer from hallucination. Existing inference-time intervention methods usually modify videos under the contrastive decoding framework, but their heuristic designs bring limited improvements and increase inference latency. To address these issues, we propose ViSSRes, an inference-time intervention method that enhances video representations through a lightweight MLP-style network. Specifically, we use a contrastive random walk approach to characterize the spatiotemporal consistency of video representations, and introduce conditional mutual information to associate video representations with the model's semantic understanding. With the model backbone kept frozen, ViSSRes learns residuals for video representations and optimizes them from both spatiotemporal and semantic consistency perspectives. During inference, ViSSRes requires only a single forward pass and introduces no substantial additional inference cost. Experiments show that ViSSRes reduces the hallucination rate of LLaVA-NeXT-Video on EventHallusion by 40.69% and improves video understanding on MMVU by 18.36% under the CoT setting, demonstrating its effectiveness in mitigating hallucinations.

URL PDF HTML ☆

赞 0 踩 0

2601.23207 2026-06-08 cs.LG cs.AI 版本更新

语言模型中激活引导的内生抵抗

Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe, Mike Vaiana, Diogo de Lucena, Judd Rosenblatt, Michael S. A. Graziano

发表机构 * University of Washington（华盛顿大学）

AI总结研究发现大型语言模型在任务不匹配的激活引导下能内生抵抗，通过显式重启恢复正确生成，并识别出相关稀疏自编码器潜在变量，可增强或削弱该抵抗。

详情

AI中文摘要

大型语言模型可以在生成过程中从任务不匹配的激活引导中恢复，产生显式的语言重启（例如，“等等，那不对”），并在引导扰动仍然活跃的情况下继续讨论主题。我们将此称为内生引导抵抗（ESR）。使用稀疏自编码器（SAE）潜在变量来引导模型激活，我们发现Llama-3.3-70B在\llamaseventyEsrRate\\%的情况下表现出显式ESR，而来自Llama-3和Gemma-2系列的较小模型则较少出现显式形式。两个对照实验将ESR分解为检测事件和持续抵抗组件，后者不能仅由最近的on-topic token条件化来完全解释。我们通过对比on-topic/off-topic搜索识别出\numOtdLatents{}个SAE潜在变量；将其零消融使多次尝试率降低\multiAttemptReductionPct\\%，随机潜在变量和保留提示对照支持特异性。ESR还可以通过元提示和基于合成自我纠正示例的微调来有意增强。ESR对安全性具有双重影响：它可能使模型对对抗性激活空间操纵更具抵抗力，但同样可能干扰有益的基于引导的干预，因为模型无法区分两者。代码可在\href{https://github.com/agencyenterprise/endogenous-steering-resistance}{github.com/agencyenterprise/endogenous-steering-resistance}获取。

英文摘要

Large language models can recover mid-generation from task-misaligned activation steering, producing explicit verbal restarts (e.g., ``wait, that's not right'') and continuing on-topic even while the steering perturbation remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B exhibits explicit ESR at \llamaseventyEsrRate\%, with smaller models from the Llama-3 and Gemma-2 families showing the explicit form less frequently. Two controls dissociate ESR into a detection event and a sustained-resistance component that conditioning on recent on-topic tokens does not fully explain. We identify \numOtdLatents{} SAE latents through contrastive on-topic/off-topic search; zero-ablating them reduces the multi-attempt rate by \multiAttemptReductionPct\%, with random-latent and held-out-prompt controls supporting specificity. ESR can also be deliberately enhanced through both meta-prompting and fine-tuning on synthetic self-correction examples. ESR has dual implications for safety: it could harden models against adversarial activation-space manipulation, but may equally interfere with beneficial steering-based interventions, since the model has no way to distinguish the two. Code is available at \href{https://github.com/agencyenterprise/endogenous-steering-resistance}{github.com/agencyenterprise/endogenous-steering-resistance}.

URL PDF HTML ☆

赞 0 踩 0

2602.07025 2026-06-08 cs.CV cs.AI 版本更新

The Geometry of Representational Failures in Vision Language Models

视觉语言模型中表征失败的几何结构

Daniele Savietto, Declan Campbell, André Panisson, Marco Nurisso, Giovanni Petri, Jonathan D. Cohen, Alan Perotti

发表机构 * Dipartimento di Fisica, Università di Torino（都灵大学物理系）； Princeton Neuroscience Institute and AI Lab, Princeton University（普林斯顿大学神经科学研究所和AI实验室）； Intesa Sanpaolo AI Research（Intesa Sanpaolo AI研究中心）； Dipartimento di Scienze Matematiche, Politecnico di Torino（都灵理工学院数学科学系）； Network Science Institute, Northeastern University London, UK（伦敦大学东北方大学网络科学研究所）

AI总结通过分析开源视觉语言模型的概念向量几何重叠，揭示多目标视觉任务中幻觉等错误与认知约束的关联，并提出基于干预的验证方法。

详情

AI中文摘要

视觉语言模型在多目标视觉任务中表现出令人困惑的失败，例如幻觉不存在的元素或未能识别干扰中最相似的物体。虽然这些错误反映了人类的认知约束，如“绑定问题”，但在人工系统中驱动这些错误的内部机制仍然知之甚少。在这里，我们通过分析开源视觉语言模型（Qwen、InternVL、Gemma）的表征几何结构，提出了一种机制性见解，比较了提炼“概念向量”（编码视觉概念的潜在方向）的方法。我们通过引导干预验证了概念向量，这些干预在简化和自然视觉任务中可靠地操纵模型行为（例如，强制模型将红色花朵感知为蓝色）。我们观察到这些向量之间的几何重叠与特定错误模式强相关，提供了一个有依据的定量框架来理解内部表征如何塑造模型行为并驱动视觉失败。

英文摘要

Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirror human cognitive constraints, such as the 'Binding Problem', the internal mechanisms driving them in artificial systems remain poorly understood. Here, we propose a mechanistic insight by analyzing the representational geometry of open-weight VLMs (Qwen, InternVL, Gemma), comparing methodologies to distill "concept vectors'' - latent directions encoding visual concepts. We validate our concept vectors via steering interventions that reliably manipulate model behavior in both simplified and naturalistic vision tasks (e.g., forcing the model to perceive a red flower as blue). We observe that the geometric overlap between these vectors strongly correlates with specific error patterns, offering a grounded quantitative framework to understand how internal representations shape model behavior and drive visual failures.

URL PDF HTML ☆

赞 0 踩 0

2602.07026 2026-06-08 cs.CV cs.AI cs.MM 版本更新

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

模态间隙驱动的子空间对齐训练范式用于多模态大语言模型

Xiaomin Yu, Yi Xin, Yuhui Zhang, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Chen Liu, Xiaoxing Hu, Ziyue Qiao, Hao Tang, Xiaobin Hu, Chengwei Qin, Hui Xiong, Yu Qiao, Shuicheng Yan

发表机构 * HKUST(GZ)（香港科技大学（广州））； NUS（新加坡国立大学）； sh AILab ； SII ； Stanford（斯坦福大学）； UCLA（加州大学洛杉矶分校）； Yale（耶鲁大学）； SJTU（上海交通大学）； GBU（国防大学）； PKU（北京大学）

AI总结针对多模态对比学习中的模态间隙问题，提出固定帧模态间隙理论，并基于该理论设计无训练的对齐策略ReAlign和可扩展训练范式ReVision，利用无配对数据实现视觉与语言表示的高效对齐。

详情

AI中文摘要

尽管多模态对比学习在视觉和语言表示对齐方面取得了成功，但一个持久的几何异常——模态间隙——仍然存在：表达相同语义的不同模态的嵌入位于系统性偏移的区域。先前弥合这一间隙的方法大多受限于过于简化的各向同性假设，阻碍了它们在大规模场景中的应用。在本文中，我们通过精确刻画模态间隙的几何形状并利用它进行高效模型扩展来解决这些局限性。首先，我们提出了固定帧模态间隙理论，该理论将冻结参考帧内的模态间隙分解为稳定偏差和各向异性残差。在这种精确建模的指导下，我们引入了ReAlign，一种无需训练的模态对齐策略。利用大量无配对数据的统计信息，ReAlign通过锚点、轨迹和质心对齐三步过程将文本表示对齐到图像表示分布，从而显式纠正几何错位。基于ReAlign，我们提出了ReVision，一种用于多模态大语言模型（MLLMs）的可扩展训练范式。ReVision将ReAlign集成到预训练阶段，使模型在视觉指令微调之前从无配对文本中学习视觉表示的分布，无需大规模、高质量的图像-文本对。我们的框架表明，统计对齐的无配对数据可以有效替代昂贵的图像-文本对，为MLLMs的高效扩展提供了一条稳健的路径。

英文摘要

Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models~(MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2602.08857 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Discovering Interpretable Algorithms by Decompiling Transformers to RASP

通过将Transformer反编译为RASP发现可解释算法

Xinting Huang, Aleksandra Bakalova, Satwik Bhattamishra, William Merrill, Michael Hahn

发表机构 * Saarland Informatics Campus, Saarland University（萨尔兰大学信息学院校区，萨尔兰大学）； University of Oxford（牛津大学）； Allen Institute for AI（人工智能研究所）

AI总结提出一种将训练好的Transformer忠实重参数化为RASP程序，并通过因果干预发现小型充分子程序的方法，实验表明长度泛化的Transformer内部实现了简单可解释的RASP程序。

Comments 104 pages, 92 figures. Accepted for publication at ICML 2026

详情

AI中文摘要

近期研究表明，Transformer的计算可以在RASP编程语言家族中模拟。这些发现增进了对Transformer表达能力和泛化能力的理解。特别是，Transformer被建议在具有简单RASP程序的问题上精确实现长度泛化。然而，训练模型是否实际实现了简单的可解释程序仍是一个开放问题。在本文中，我们提出了一种从训练好的Transformer中提取此类程序的通用方法。其思想是将Transformer忠实地重参数化为RASP程序，然后应用因果干预来发现一个小的充分子程序。在算法和形式语言任务上训练的小型Transformer实验中，我们表明我们的方法通常能从长度泛化的Transformer中恢复简单且可解释的RASP程序。我们的结果提供了迄今为止最直接的证据，证明Transformer内部实现了简单的RASP程序。

英文摘要

Recent work has shown that the computations of Transformers can be simulated in the RASP family of programming languages. These findings have enabled improved understanding of the expressive capacity and generalization abilities of Transformers. In particular, Transformers have been suggested to length-generalize exactly on problems that have simple RASP programs. However, it remains open whether trained models actually implement simple interpretable programs. In this paper, we present a general method to extract such programs from trained Transformers. The idea is to faithfully re-parameterize a Transformer as a RASP program and then apply causal interventions to discover a small sufficient sub-program. In experiments on small Transformers trained on algorithmic and formal language tasks, we show that our method often recovers simple and interpretable RASP programs from length-generalizing transformers. Our results provide the most direct evidence so far that Transformers internally implement simple RASP programs.

URL PDF HTML ☆

赞 0 踩 0

2602.15084 2026-06-08 physics.plasm-ph cs.AI cs.LG 版本更新

TokaMind: A Multi-Modal Transformer Foundation Model for Tokamak Plasma Dynamics

TokaMind: 用于托卡马克等离子体动力学的多模态Transformer基础模型

Tobia Boschi, Andrea Loreti, Nicola C. Amorisco, Rodrigo H. Ordonez-Hurtado, Cécile Rousseau, George K. Holt, Eszter Székely, Alexander Whittle, Samuel Jackson, Adriano Agnello, Stanislas Pamela, Alessandra Pascale, Robert Akers, Juan Bernabe Moreno, Vassil Alexandrov, Mykhaylo Zayats

发表机构 * IBM Research（IBM研究院）； UK Atomic Energy Authority（英国原子能局）； STFC Hartree Centre（科学与技术设施研究中心哈特ree中心）

AI总结提出TokaMind，首个开源托卡马克等离子体动力学基础模型，基于多模态Transformer在MAST数据集上预训练，支持多种数据模态和缺失信号处理，在14个任务上优于基线。

详情

AI中文摘要

我们提出TokaMind，据我们所知，这是首个用于托卡马克等离子体动力学的开源基础模型，基于多模态Transformer（MMT）并在公开可用的MAST数据集上的异构诊断数据上预训练。TokaMind支持多种数据模态（时间序列、2D轮廓和视频），具有不同的采样率、鲁棒的缺失信号处理，并通过选择性加载和冻结四个模型组件实现高效任务适配。为了表示多模态信号，我们使用轻量级固定基离散余弦变换嵌入（DCT3D），并为替代嵌入（例如变分自编码器）提供干净接口。我们在最近引入的MAST基准TokaMark上评估TokaMind，该基准包含14个具有异构重建和预测目标的任务。我们的结果表明，微调后的TokaMind在所有任务上均优于最强的基准基线，仅一个任务除外。与在匹配的epoch预算下从头训练相同架构相比，热启动适配在要求苛刻的下游设置中最为有益，包括长时域预测和高维平衡目标。这些发现突显了多模态预训练对托卡马克等离子体动力学的价值，并为未来的聚变建模任务提供了实用、可扩展的基础。训练代码和模型权重分别公开在github.com/UKAEA-IBM-STFC-Fusion-FMs/tokamind和huggingface.co/UKAEA-IBM-STFC。

英文摘要

We present TokaMind, to our knowledge the first open-source foundation model for tokamak plasma dynamics, based on a Multi-Modal Transformer (MMT) and pretrained on heterogeneous diagnostics from the publicly available MAST dataset. TokaMind supports multiple data modalities (time-series, 2D profiles, and videos) with different sampling rates, robust missing-signal handling, and efficient task adaptation via selectively loading and freezing four model components. To represent multi-modal signals, we use a lightweight fixed-basis Discrete Cosine Transform embedding (DCT3D) and provide a clean interface for alternative embeddings (e.g., Variational Autoencoders). We evaluate TokaMind on the recently introduced MAST benchmark TokaMark, which comprises 14 tasks with heterogeneous reconstruction and forecasting objectives. Our results show that fine-tuned TokaMind outperforms the strongest benchmark baseline on all but one task. Compared with training the same architecture from scratch under a matched epoch budget, warm-start adaptation is most beneficial on demanding downstream settings, including long-horizon forecasting and high-dimensional equilibrium objectives. These findings highlight the value of multi-modal pretraining for tokamak plasma dynamics and provide a practical, extensible foundation for future fusion modeling tasks. Training code and model weights are publicly available at github.com/UKAEA-IBM-STFC-Fusion-FMs/tokamind and huggingface.co/UKAEA-IBM-STFC, respectively.

URL PDF HTML ☆

赞 0 踩 0

2602.16073 2026-06-08 cs.RO cs.AI cs.LO cs.SY eess.SY 版本更新

ScenicRules: An Autonomous Driving Benchmark with Multi-Objective Specifications and Abstract Scenarios

ScenicRules：具有多目标规范和抽象场景的自动驾驶基准测试

Kevin Kai-Chun Chang, Ekin Beyazit, Alberto Sangiovanni-Vincentelli, Tichakorn Wongpiromsarn, Sanjit A. Seshia

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Massachusetts Institute of Technology（麻省理工学院）

AI总结提出ScenicRules基准，通过层次化规则框架和形式化场景模型，在随机环境下评估自动驾驶系统对优先级多目标规范的满足程度。

Comments v2: Minor numerical corrections for Table V. 16 pages, 14 figures, 7 tables. Extended version of paper accepted to 2026 IEEE Intelligent Vehicles Symposium (IV 2026). ScenicRules benchmark available at https://github.com/BerkeleyLearnVerify/ScenicRules

详情

AI中文摘要

开发复杂交通环境下的自动驾驶系统需要平衡多个目标，例如避免碰撞、遵守交通规则和高效行驶。在许多情况下，这些目标无法同时满足，因此自然会出现明确的优先级关系。此外，驾驶规则需要上下文，因此正式建模这些规则适用的环境场景非常重要。现有的自动驾驶车辆评估基准缺乏这种多目标优先级规则和形式化环境模型的组合。在这项工作中，我们引入了ScenicRules，一个在随机环境下根据优先级多目标规范评估自动驾驶系统的基准。我们首先形式化了一组多样化的目标作为定量评估指标。接下来，我们设计了一个层次化规则书框架，以可解释和可适应的方式编码多个目标及其优先级关系。然后，我们构建了一个紧凑但具有代表性的场景集合，涵盖各种驾驶情境和近事故情况，并使用Scenic语言进行形式化建模。实验结果表明，我们的形式化目标和层次化规则书与人类驾驶判断高度一致，并且我们的基准有效地暴露了代理在优先级目标方面的失败。我们的基准可在https://github.com/BerkeleyLearnVerify/ScenicRules/获取。

英文摘要

Developing autonomous driving systems for complex traffic environments requires balancing multiple objectives, such as avoiding collisions, obeying traffic rules, and making efficient progress. In many situations, these objectives cannot be satisfied simultaneously, and explicit priority relations naturally arise. Also, driving rules require context, so it is important to formally model the environment scenarios within which such rules apply. Existing benchmarks for evaluating autonomous vehicles lack such combinations of multi-objective prioritized rules and formal environment models. In this work, we introduce ScenicRules, a benchmark for evaluating autonomous driving systems in stochastic environments under prioritized multi-objective specifications. We first formalize a diverse set of objectives to serve as quantitative evaluation metrics. Next, we design a Hierarchical Rulebook framework that encodes multiple objectives and their priority relations in an interpretable and adaptable manner. We then construct a compact yet representative collection of scenarios spanning diverse driving contexts and near-accident situations, formally modeled in the Scenic language. Experimental results show that our formalized objectives and Hierarchical Rulebooks align well with human driving judgments and that our benchmark effectively exposes agent failures with respect to the prioritized objectives. Our benchmark can be accessed at https://github.com/BerkeleyLearnVerify/ScenicRules/.

URL PDF HTML ☆

赞 0 踩 0

2602.16864 2026-06-08 cs.LG cs.AI math.DS 版本更新

Position: A Dynamical Systems Perspective is Needed to Advance Time Series Modeling

立场：需要动力系统视角以推进时间序列建模

Daniel Durstewitz, Christoph Jürgen Hemmer, Florian Hess, Charlotte Ricarda Doll, Lukas Eisenmann

发表机构 * University of Tübingen（图宾根大学）

AI总结本文主张时间序列建模需引入动力系统视角，通过重构底层DS实现更优预测，并讨论其理论优势与具体建议。

详情

AI中文摘要

时间序列（TS）建模从早期的统计方法（主要是线性方法）发展到当前TS基础模型的趋势，已经走过了很长的路。由于该领域存在大量炒作和工业需求，实际进展并不总是清晰。为了将TS预测和分析提升到新水平，本文主张该领域需要动力系统（DS）视角。来自自然或工程系统的观测TS几乎总是源于某个底层DS，并且可以说，访问其控制方程将产生理论上的最优预测。这是DS重构（DSR）的承诺，这是一类旨在从数据中推断底层DS替代模型的ML/AI方法。但基于DS原理的模型还提供了其他深刻优势：除了短期预测，它们还能预测观测系统的长期统计量，这在许多实际场景中可能是更相关的量。此外，DS理论提供了领域无关的理论洞见，理解TS生成的机制，从而告知我们例如任何TS模型性能的上限、向未见过场景（如临界点）的泛化，或潜在的控制策略。在回顾DS理论和DSR中的一些核心概念、方法、度量和模型后，我们将讨论该领域的洞见如何以关键方式推进TS建模，实现更好的预测，同时大幅降低计算和内存占用。最后，我们提出若干具体建议，将DSR的洞见转化为TS建模实践。

英文摘要

Time series (TS) modeling has come a long way from early statistical, mainly linear, approaches to the current trend in TS foundation models. With a lot of hype and industrial demand in this field, it is not always clear how much progress there really is. To advance TS forecasting and analysis to the next level, here we argue that the field needs a dynamical systems (DS) perspective. TS of observations from natural or engineered systems almost always originate from some underlying DS, and arguably access to its governing equations would yield theoretically optimal forecasts. This is the promise of DS reconstruction (DSR), a class of ML/AI approaches that aim to infer surrogate models of the underlying DS from data. But models based on DS principles offer other profound advantages: Beyond short-term forecasts, they enable to predict the long-term statistics of an observed system, which in many practical scenarios may be the more relevant quantities. DS theory furthermore provides domain-independent theoretical insight into mechanisms underlying TS generation, and thereby will inform us, e.g., about upper bounds on performance of any TS model, generalization into unseen regimes as in tipping points, or potential control strategies. After reviewing some of the central concepts, methods, measures, and models in DS theory and DSR, we will discuss how insights from this field can advance TS modeling in crucial ways, enabling better forecasting with much lower computational and memory footprints. We conclude with a number of specific suggestions for translating insights from DSR into TS modeling.

URL PDF HTML ☆

赞 0 踩 0

2602.18905 2026-06-08 cs.LG cs.AI cs.CL 版本更新

TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning

TRUE：一种用于大语言模型推理的可信统一解释框架

Yujiao Yang

发表机构 * Dalian University of Technology（大连理工大学）

AI总结提出TRUE框架，通过可执行推理验证、可行域DAG建模和因果故障模式分析，为LLM推理提供实例级、局部结构级和类别级的多层次可验证解释。

详情

AI中文摘要

大型语言模型（LLM）在复杂推理任务中展现出强大能力，但其决策过程仍难以解释。现有解释方法通常缺乏可信的结构性洞察，且局限于单实例分析，无法揭示推理稳定性和系统性故障机制。为解决这些局限，我们提出可信统一解释框架（TRUE），该框架集成了可执行推理验证、可行域有向无环图（DAG）建模和因果故障模式分析。在实例层面，我们将推理轨迹重新定义为可执行过程规范，并引入盲执行验证来评估操作有效性。在局部结构层面，我们通过结构一致性扰动构建可行域DAG，从而显式刻画局部输入空间中推理稳定性和可执行区域。在类别层面，我们引入因果故障模式分析方法，识别重复出现的结构性故障模式，并使用Shapley值量化其因果影响。在多个推理基准上的广泛实验表明，所提框架提供了多层次、可验证的解释，包括单个实例的可执行推理结构、邻近输入的可行域表示以及类别层面具有量化重要性的可解释故障模式。这些结果建立了一个统一且原则性的范式，用于提高LLM推理系统的可解释性和可靠性。

英文摘要

Large language models (LLMs) have demonstrated strong capabilities in complex reasoning tasks, yet their decision-making processes remain difficult to interpret. Existing explanation methods often lack trustworthy structural insight and are limited to single-instance analysis, failing to reveal reasoning stability and systematic failure mechanisms. To address these limitations, we propose the Trustworthy Unified Explanation Framework (TRUE), which integrates executable reasoning verification, feasible-region directed acyclic graph (DAG) modeling, and causal failure mode analysis. At the instance level, we redefine reasoning traces as executable process specifications and introduce blind execution verification to assess operational validity. At the local structural level, we construct feasible-region DAGs via structure-consistent perturbations, enabling explicit characterization of reasoning stability and the executable region in the local input space. At the class level, we introduce a causal failure mode analysis method that identifies recurring structural failure patterns and quantifies their causal influence using Shapley values. Extensive experiments across multiple reasoning benchmarks demonstrate that the proposed framework provides multi-level, verifiable explanations, including executable reasoning structures for individual instances, feasible-region representations for neighboring inputs, and interpretable failure modes with quantified importance at the class level. These results establish a unified and principled paradigm for improving the interpretability and reliability of LLM reasoning systems.

URL PDF HTML ☆

赞 0 踩 0

2603.02220 2026-06-08 cs.LG cs.AI cs.CV 版本更新

Forecasting as Rendering: A 2D Gaussian Splatting Framework for Time Series Forecasting

预测即渲染：面向时间序列预测的2D高斯泼溅框架

Yixin Wang, Yifan Hu, Peiyuan Liu, Naiqi Li, Tao Dai, Shu-Tao Xia

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； College of Computer Science and Software Engineering, Shenzhen University（深圳大学计算机科学与软件工程学院）

AI总结提出TimeGS框架，将时间序列预测转化为2D高斯泼溅生成渲染，通过各向异性高斯核和连续光栅化解决周期内与周期间的建模问题，实现SOTA性能。

详情

AI中文摘要

时间序列预测仍然是一个具有挑战性的问题，因为周期内波动和周期间趋势的复杂纠缠。尽管最近的进展试图将一维序列重塑为二维周期-相位表示，但它们存在两个主要局限性。首先，将重塑后的张量视为静态图像会导致拓扑不匹配，因为标准空间算子在网格边界处切断了时间连续性。其次，依赖统一的固定大小表示会低效地分配建模能力，并且无法为可压缩的非平稳时间模式提供所需的自适应分辨率。为了解决这些局限性，我们引入了TimeGS，这是一个新颖的框架，从根本上将预测范式从回归转变为二维生成渲染。通过将未来序列重新概念化为潜在的二维时间表面，TimeGS利用高斯核的固有各向异性，以灵活的几何对齐自适应地建模复杂变化。为了实现这一点，我们引入了多基高斯核生成（MB-GKG）块，该块从固定字典中合成核以稳定优化，以及多周期时间连续光栅化（MP-CCR）块，该块在周期边界上强制执行严格的时间连续性。在标准基准数据集上的全面实验表明，TimeGS达到了最先进或具有竞争力的性能。代码位于https://github.com/yixinwang1/TimeGS。

英文摘要

Time series forecasting remains a challenging problem due to the intricate entanglement of intra-period fluctuations and inter-period trends. While recent advances have attempted to reshape 1D sequences into 2D period-phase representations, they suffer from two principal limitations. Firstly, treating reshaped tensors as static images results in a topological mismatch, as standard spatial operators sever chronological continuity at grid boundaries. Secondly, relying on uniform fixed-size representations allocates modeling capacity inefficiently and fails to provide the adaptive resolution required for compressible, non-stationary temporal patterns. To address these limitations, we introduce TimeGS, a novel framework that fundamentally shifts the forecasting paradigm from regression to 2D generative rendering. By reconceptualizing the future sequence as a latent 2D temporal surface, TimeGS utilizes the inherent anisotropy of Gaussian kernels to adaptively model complex variations with flexible geometric alignment. To realize this, we introduce a Multi-Basis Gaussian Kernel Generation (MB-GKG) block that synthesizes kernels from a fixed dictionary to stabilize optimization, and a Multi-Period Chronologically Continuous Rasterization (MP-CCR) block that enforces strict temporal continuity across periodic boundaries. Comprehensive experiments on standard benchmark datasets demonstrate that TimeGS attains state-of-the-art or competitive performance. The code is at https://github.com/yixinwang1/TimeGS.

URL PDF HTML ☆

赞 0 踩 0

2603.04982 2026-06-08 cs.CY cs.AI cs.HC 版本更新

ECI: 有效对比信息用于评估难负样本

Aarush Sinha, Rahul Seetharaman, Aman Bansal

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology (IIT), Kharagpur, India（1. 印度理工学院（IIT）计算机科学与工程系，克哈格布尔，印度）

AI总结本文提出ECI，一种无需训练的诊断方法，通过冻结的目标编码器嵌入对候选负样本进行排序，其在MS MARCO数据集上展示了优于其他模型的性能，且在不同条件下表现出稳定性。

详情

AI中文摘要

在密集检索中，硬负样本的选择通常是仅在微调和下游评估之后决定。我们提出有效对比信息（ECI），一种无需训练的诊断方法，通过冻结的目标编码器嵌入对候选负样本进行排序。ECI无需训练，也不依赖标签：每个评分示例需要一个查询、一个标记的正例和一个显式的候选负例。$\mathrm{ECI}_{\mathrm{sem}}$通过目标一致性、语义局部性、词汇残余性和一个对数确定性多样性目标构建加权残差信息矩阵。在MS MARCO负样本上，家族内ECI在非混合源中将LLM负样本排在首位，在混合源中将Dense+LLM排在首位，与DistilBERT、E5-base和Contriever的最强聚合BEIR迁移结果相匹配。受控消融实验表明，这种对齐依赖于使用目标编码器家族，而额外消融实验显示其在样本大小、温度、分词器和IDF语料扰动下具有稳定性。理论给出了损失减少的局部线性化链接，而实证研究将下游评估视为最终测试。

英文摘要

Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream evaluation. We propose $\mathrm{ECI}_{\mathrm{sem}}$, a semantic residual variant of Effective Contrastive Information (ECI) that ranks candidate negative sources using frozen target-encoder embeddings. $\mathrm{ECI}_{\mathrm{sem}}$ is training-free, not label-free: each scored example requires a query, a labeled positive, and an explicit candidate negative. $\mathrm{ECI}_{\mathrm{sem}}$ builds a weighted residual information matrix from target consistency, semantic locality, lexical residuality, and a log-determinant diversity objective. On MS MARCO negative sources, in-family $\mathrm{ECI}_{\mathrm{sem}}$ ranks LLM negatives highest among non-hybrid sources and Dense+LLM highest among hybrid sources, matching the strongest aggregate BEIR transfer results across DistilBERT, E5-base, and Contriever. Controlled ablations show that this alignment depends on using the target encoder family, while additional ablations show stability under sample-size, temperature, tokenizer, and IDF-corpus perturbations. The theory gives a local linearized link to loss reduction, while the empirical study treats downstream evaluation as the final test.

URL PDF HTML ☆

赞 0 踩 0

2603.22327 2026-06-08 cs.IR cs.AI cs.DL 版本更新

Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews

基于流行病学系统评价评估AI科学知识综合

Shreyansh Padarha, Ryan Othniel Kearns, Tristan Naidoo, Lingyi Yang, Łukasz Borchmann, Piotr BŁaszczyk, Christian Morgenstern, Ruth McCabe, Sangeeta Bhatia, Philip H. Torr, Jakob Foerster, Scott A. Hale, Thomas Rawson, Anne Cori, Elizaveta Semenova, Adam Mahdi

发表机构 * University of Oxford（牛津大学）； Imperial College London（伦敦帝国理工学院）； University of Nottingham（诺丁汉大学）； Snowflake AI Research（Snowflake人工智能研究）； Independent（独立）

AI总结提出AgentSLR评估框架，包含自动化工作流和专家标注数据集，测试LLM在流行病学系统评价各阶段能力，发现无模型全面领先，结构化提取是主要瓶颈。

详情

AI中文摘要

系统文献综述（SLR）是一种要求高且风险大的科学知识综合形式，但作为大型语言模型（LLM）的评估场景仍未被充分定义。我们引入了AgentSLR，一个大规模评估框架，包含SLR自动化工作流和覆盖16,248篇文章的专家标注数据集，旨在测试LLM在流行病学SLR各阶段的能力。参考标注来自关于WHO优先病原体的同行评审研究，并由领域专家制作。该框架将每个综述阶段作为独立单元进行评估，并配有专用指标，以便进行有针对性的失败分析。我们评估了五种前沿推理模型，发现没有单一模型在所有任务中占主导地位，显示出子任务专业化往往被聚合基准所掩盖。结构化数据提取是一个主要瓶颈，没有模型在平均字段级F1上超过0.67。估计成本差异很大，评估模型之间相差高达96倍。记录的失败模式表明，评估的模型在流行病学中尚不足以可靠地进行无监督部署，因为其发现可能影响公共政策。

英文摘要

Systematic literature reviews (SLRs) are a demanding and high-stakes form of scientific knowledge synthesis that remains underspecified as an evaluation setting for large language models (LLMs). We introduce AgentSLR, a large-scale evaluation harness comprising an SLR automation workflow and an expert annotated dataset covering 16,248 articles, designed to test LLM capabilities across the stages of SLRs in epidemiology. Reference annotations were derived from peer-reviewed studies on WHO priority pathogens and produced by domain experts. The harness evaluates each review stage as a separate unit with dedicated metrics enabling targeted failure analysis. We evaluated five frontier reasoning models and found that no single model dominated across all tasks, showing sub-task specialisation often hidden by aggregate benchmarks. Structured data extraction is a major bottleneck, with no model exceeding an average field-level F1 of 0.67. Estimated costs vary substantially, by up to 96 times across evaluated models. Documented failure modes suggest that the evaluated models are not yet reliable enough for unsupervised deployment in epidemiology, where findings can inform public policy.

URL PDF HTML ☆

赞 0 踩 0

2603.24576 2026-06-08 cs.RO cs.AI cs.CV 版本更新

Chameleon: Control-Indexed Prospective Memory for Visuomotor Manipulation

Chameleon: 用于视觉运动操控的索引控制前瞻记忆

Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Yuhang Han, Ying Sun, Yang Xiao, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University（南洋理工大学MARS实验室）； Institute for Infocomm Research, A*STAR, Singapore（新加坡*STAR信息与通信研究所）； National University of Singapore（新加坡国立大学）

AI总结提出Chameleon策略，通过索引控制前瞻记忆解决观察-动作延迟问题，在Camo-Dataset上决策成功率从22.5%提升至80.8%，并在多个基准上达到最优。

Comments Code is available at https://github.com/gxyes/MARS_Chameleon

详情

AI中文摘要

机器人常常在观察到某个信息后很久才执行相应的动作。例如，在藏球游戏中，机器人首先看到哪个杯子藏有球，观察杯子移动，然后才需要选择正确的杯子。仅凭最后的观察不足以做出决策：正确的动作依赖于更早的事件。我们将这种时间间隔称为观察-动作延迟。它使得记忆成为一个策略面对的问题：策略必须保持相似历史记录的可区分性，检索与当前决策相关的过去事件，并将该回忆转换为动作就绪状态。我们将这些需求称为可分离性、可寻址性和前瞻性。我们引入了Chameleon，一个约60M参数的视觉运动策略，用于索引控制的前瞻记忆。Chameleon写入具身事件记忆，保留可分离的历史记录，检索控制相关的痕迹，并训练生成的工作状态具有前瞻性。我们还引入了Camo-Dataset，这是一个真实机器人基准，通过使决策场景视觉模糊来隔离观察-动作延迟，从而必须从早期观察中推断正确动作。Chameleon在Camo-Dataset上将决策/端到端成功率从22.5%/21.3%提高到80.8%/71.3%。在公开的长时记忆基准上，它在LIBERO-10上达到87.1% ± 0.8%，在MemoryBench上达到97.3% ± 4.5%，在MIKASA-Robo上达到75.1% ± 1.4%，在相同规模模型中达到最先进水平，并在报告协议下超过多个更大的VLA基线。探针和消融实验表明，Chameleon学习了可分离、可寻址和前瞻的记忆，并且这些特性驱动了其性能提升。

英文摘要

Robots often observe information that determines a future action long before that action is executed. In a shell game, for example, a robot first sees which cup hides the ball, watches the cups move, and only later needs to choose the correct cup. The final observation alone is not enough for a decision: the correct action depends on an earlier event. We refer to this temporal gap as observation-action delay. It makes memory a policy-facing problem: a policy must keep similar histories distinct, retrieve the past event relevant to the current decision, and convert that recall into an action-ready state. We call these requirements separability, addressability, and prospectiveness. We introduce Chameleon, a ~60M visuomotor policy for control-indexed prospective memory. Chameleon writes embodied event memory, preserves separable histories, retrieves control-relevant traces, and trains the resulting working state to be prospective. We also introduce Camo-Dataset, a real-robot benchmark that isolates observation-action delay by making the decision scene visually ambiguous, so the correct action must be inferred from earlier observations. Chameleon improves decision/end-to-end success on Camo-Dataset from 22.5%/21.3% to 80.8%/71.3%. On public long-horizon memory benchmarks, it achieves 87.1% +/- 0.8% on LIBERO-10, 97.3% +/- 4.5% on MemoryBench, and 75.1% +/- 1.4% on MIKASA-Robo, setting the state of the art for same-size models and exceeding multiple larger VLA baselines under the reported protocols. Probes and ablations show that Chameleon learns separable, addressable, and prospective memory, and that these properties drive its performance gains.

URL PDF HTML ☆

赞 0 踩 0

2603.26846 2026-06-08 cs.LG cs.AI 版本更新

Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry

稳定推理，不稳定响应：通过稳定性不对称缓解大语言模型欺骗

Guoxi Zhang, Jiawei Chen, Tianzhuo Yang, Lang Qin, Juntao Dai, Yaodong Yang, Jingwei Yi

发表机构 * Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； School of Chinese as a Second Language, Peking University（北京大学第二语言学院）

AI总结针对大语言模型内在欺骗问题，提出稳定性不对称正则化（SAR），通过惩罚内部思维链稳定性与外部响应稳定性之间的不对称性来抑制欺骗，实验证明其有效性且不损害模型能力。

详情

AI中文摘要

随着大语言模型（LLMs）在能力和应用范围上的扩展，其可信度变得至关重要。一个关键风险是内在欺骗，即模型策略性地误导用户以实现自身目标。现有的基于思维链（CoT）监控的对齐方法监督显式的推理轨迹。然而，在优化压力下，模型被激励隐藏欺骗性推理，使得语义监督从根本上不可靠。基于认知心理学，我们假设一个欺骗性LLM在其CoT中保持稳定的内部信念，而其外部响应在扰动下仍然脆弱。我们将此现象称为稳定性不对称，并通过测量扰动下内部CoT稳定性与外部响应稳定性之间的对比来量化它。基于这一结构特征，我们提出了稳定性不对称正则化（SAR），一种在强化学习期间惩罚这种分布不对称性的新型对齐目标。与CoT监控不同，SAR针对模型输出的统计结构，使其对语义隐藏具有鲁棒性。大量实验证实，稳定性不对称可靠地识别欺骗行为，并且SAR有效抑制内在欺骗而不降低一般模型能力。

英文摘要

As Large Language Models (LLMs) expand in capability and application scope, their trustworthiness becomes critical. A vital risk is intrinsic deception, wherein models strategically mislead users to achieve their own objectives. Existing alignment approaches based on chain-of-thought (CoT) monitoring supervise explicit reasoning traces. However, under optimization pressure, models are incentivized to conceal deceptive reasoning, rendering semantic supervision fundamentally unreliable. Grounded in cognitive psychology, we hypothesize that a deceptive LLM maintains a stable internal belief in its CoT while its external response remains fragile under perturbation. We term this phenomenon stability asymmetry and quantify it by measuring the contrast between internal CoT stability and external response stability under perturbation. Building on this structural signature, we propose the Stability Asymmetry Regularization (SAR), a novel alignment objective that penalizes this distributional asymmetry during reinforcement learning. Unlike CoT monitoring, SAR targets the statistical structure of model outputs, rendering it robust to semantic concealment. Extensive experiments confirm that stability asymmetry reliably identifies deceptive behavior, and that SAR effectively suppresses intrinsic deception without degrading general model capability.

URL PDF HTML ☆

赞 0 踩 0

2604.03779 2026-06-08 cs.LG cs.AI 版本更新

CountsDiff: A Diffusion Model on the Natural Numbers for Generation and Imputation of Count-Based Data

CountsDiff: 一种用于计数数据生成和插补的自然数扩散模型

Renzo G. Soatto, Anders Hoel, Greycen Ren, Shorna Alam, Stephen Bates, Nikolaos P. Daskalakis, Caroline Uhler, Maria Skoularidou

发表机构 * Princeton University（普林斯顿大学）； Stanford University（斯坦福大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出CountsDiff扩散框架，通过生存概率调度和显式损失加权简化Blackout扩散，引入连续时间训练、无分类器引导和逆动态，在自然图像和单细胞RNA-seq插补任务中匹配或超越现有方法。

Comments 39 Pages, 11 figures. To appear in the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

扩散模型在连续和基于token的领域中的生成任务表现出色，但其在离散序数数据上的应用仍不成熟。我们提出CountsDiff，一个旨在对自然数上的分布进行建模的扩散框架。CountsDiff扩展了Blackout扩散框架，通过直接参数化（基于生存概率调度和显式损失加权）简化其公式。这通过引入与现有扩散建模框架中直接类似的设计参数，带来了灵活性。除了这种重新参数化，CountsDiff还引入了现代扩散模型中的特性，这些特性此前在基于计数的领域中缺失，包括连续时间训练、无分类器引导以及允许非单调逆轨迹的搅动/重掩码逆动态。我们提出了CountsDiff的初始实例化，并在自然图像数据集（CIFAR-10、CelebA）上进行了验证，探索了所引入的设计参数在一个复杂、经过充分研究且可解释的数据领域中的影响。然后，我们强调生物计数分析作为一个自然用例，在胎儿和心脏细胞图谱中评估了CountsDiff在单细胞RNA-seq插补上的表现。值得注意的是，我们发现即使这种简单的实例化也能匹配或超越最先进的离散生成模型和领先的scRNA-seq插补方法的性能，同时通过未来工作中优化的设计选择，仍有很大的提升空间。

英文摘要

Diffusion models have excelled at generative tasks for both continuous and token-based domains, but their application to discrete ordinal data remains underdeveloped. We present CountsDiff, a diffusion framework designed to model distributions on the natural numbers. CountsDiff extends the Blackout diffusion framework by simplifying its formulation through a direct parameterization in terms of a survival probability schedule and an explicit loss weighting. This introduces flexibility through design parameters with direct analogues in existing diffusion modeling frameworks. Beyond this reparameterization, CountsDiff introduces features from modern diffusion models, previously absent in counts-based domains, including continuous-time training, classifier-free guidance, and churn/remasking reverse dynamics that allow non-monotone reverse trajectories. We propose an initial instantiation of CountsDiff and validate it on natural image datasets (CIFAR-10, CelebA), exploring the effects of the introduced design parameters in a complex, well-studied, and interpretable data domain. We then highlight biological count assays as a natural use case, evaluating CountsDiff on single-cell RNA-seq imputation in fetal and heart cell atlases. Remarkably, we find that even this simple instantiation matches or surpasses the performance of a state-of-the-art discrete generative model and leading scRNA-seq imputation methods, while leaving substantial headroom for further gains through optimized design choices in future work.

URL PDF HTML ☆

赞 0 踩 0

2604.04226 2026-06-08 cs.MA cs.AI 版本更新

ViVa：用于机器人强化学习的视频生成价值模型

Jindi Lv, Hao Li, Jie Li, Fankun Kong, Yang Wang, Pengfei Yi, Yifei Nie, Xiaofeng Wang, Zheng Zhu, Chaojun Ni, Qiuping Deng, Hengtao Li, Jiancheng Lv, Guan Huang

发表机构 * GigaAI ； Sichuan University（四川大学）； Tsinghua University（清华大学）

AI总结提出ViVa，利用预训练视频生成器联合预测未来本体感受和标量价值，通过时空先验实现可靠价值估计，在三个任务中取得最优结果，与RECAP结合平均成功率达80%。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过大规模预训练推进了机器人操作，但由于部分可观测性和延迟反馈，实际部署仍然具有挑战性。强化学习通过价值函数解决这一问题，该函数评估任务进展并指导策略改进。然而，基于视觉-语言模型（VLM）的现有价值模型难以捕捉时间动态和物理交互，削弱了长期任务中价值估计的可靠性。本文提出ViVa，一种视频生成价值模型，该模型重新利用预训练的视频生成器，联合预测未来本体感受和标量价值。通过将价值估计基于预期的具身动态，ViVa利用时空先验，将价值与超越静态快照的前瞻性内在耦合。ViVa在三个任务的基于度量的评估中取得了最先进的结果，产生可靠的价值信号，准确跟踪任务进展并检测执行错误。集成到RECAP中，它实现了80%的平均成功率，突显了视频生成模型在价值估计中的前景。

英文摘要

Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics and physical interactions, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator to jointly predict future proprioception and a scalar value. By grounding value estimation in anticipated embodiment dynamics, ViVa leverages spatiotemporal priors to intrinsically couple value with foresight beyond static snapshots. ViVa achieves state-of-the-art results in metric-based evaluation across three tasks, producing reliable value signals that accurately track task progress and detect execution errors. Integrated into RECAP, it achieves an average success rate of 80%, highlighting the promise of video-generative models for value estimation.

URL PDF HTML ☆

赞 0 踩 0

2604.09552 2026-06-08 cs.IR cs.AI cs.CL 版本更新

MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval

MCERF：通过增强检索推进工程文档的多模态大语言模型评估

Kiarash Naghavi Khanghah, Hoang Anh Nguyen, Anna C. Doris, Amir Mohammad Vahedi, Daniele Grandi, Faez Ahmed, Hongyi Xu

发表机构 * School of Mechanical, Aerospace, and Manufacturing Engineering, University of Connecticut, Storrs, CT 06269（机械、航空航天与制造工程学院，康涅狄格大学，斯托尔斯，CT 06269）； Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA（机械工程系，麻省理工学院，剑桥，MA 02139，美国）

AI总结提出MCERF框架，结合多模态检索器ColPali与大语言模型推理，通过混合查找、视觉文本融合、高推理和自一致性决策等策略，在DesignQA基准上实现平均准确率相对提升41.1%，无需完整规则书摄入即可处理工程文档中的多模态问答。

详情

DOI: 10.1115/1.4072033

AI中文摘要

工程规则书和技术标准包含密集文本、表格和插图等多模态信息，对检索增强生成（RAG）系统构成挑战。基于依赖全文摄入和文本检索的DesignQA框架[1]，本工作建立了多模态ColPali增强检索与推理框架（MCERF），该系统将多模态检索器与大语言模型推理相结合，实现从工程文档中准确高效地回答问题。该系统采用ColPali检索文本和视觉信息，并采用多种检索与推理策略：（i）混合查找模式用于显式规则提及，（ii）视觉到文本融合用于图形和表格引导的查询，（iii）高推理大语言模型模式用于复杂的多模态问题，以及（iv）自一致性决策以稳定响应。模块化框架设计为未来的多模态系统提供了可重用模板，无论底层模型架构如何。此外，本工作建立并比较了两种路由方法：单案例路由方法和多智能体系统，两者均动态分配查询到最优管道。在DesignQA基准上的评估表明，该系统在所有任务上的平均准确率相比基线RAG最佳结果相对提升了41.1%，这是多模态和推理密集型任务上的显著改进，且无需完整规则书摄入。这表明视觉语言检索、模块化推理和自适应路由如何在工程用例中实现可扩展的文档理解。

英文摘要

Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.

URL PDF HTML ☆

赞 0 踩 0

2604.17433 2026-06-08 cs.CL cs.AI cs.LG 版本更新

Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

仅需两个样本的自一致性：CoT-PoT集成实现高效LLM推理

Raman Saparkhan, Majd Hawasly, Md Rizwan Parvez, Mohammad Raza

发表机构 * Carnegie Mellon University Qatar（卡内基梅隆大学（卡塔尔））； Qatar Computing Research Institute（卡塔尔计算研究院）

AI总结提出一种混合集成方法，结合思维链与程序化推理两种模式，通过仅需两个样本即可实现自一致性，将采样量减少9.3倍，并在78.6%的任务上达到最优。

Comments 9 pages, 3 figures; accepted to Findings of ACL 2026

2604.17948 2026-06-08 cs.CR cs.AI cs.MA 版本更新

RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs

RAVEN: 用于用户代码和二进制程序中内存损坏分析的检索增强漏洞探索网络

Parteek Jamwal, Minghao Shao, Boyuan Chen, Achyuta Muthuvelan, Asini Subanya, Boubacar Ballo, Kashish Satija, Mariam Shafey, Mohamed Mahmoud, Moncif Dahaji Bouffi, Pasindu Wickramasinghe, Siyona Goel, Yaakulya Sabbani, Hakim Hacid, Mthandazo Ndhlovu, Eleanna Kafeza, Sanjay Rawat, Muhammad Shafique

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出RAVEN框架，结合LLM代理与检索增强生成，自动生成遵循Google Project Zero模板的漏洞分析报告，在105个样本上平均质量得分54.21%。

详情

AI中文摘要

大型语言模型（LLM）在各种网络安全任务中展现了卓越的能力，包括漏洞分类、检测和修补。然而，它们在自动化漏洞报告文档和分析方面的潜力仍未得到充分探索。我们提出了RAVEN（检索增强漏洞探索网络），这是一个利用LLM代理和检索增强生成（RAG）来综合生成全面漏洞分析报告的框架。给定易受攻击的源代码，RAVEN按照Google Project Zero根因分析模板生成报告。该框架使用四个模块：用于漏洞识别的探索代理、从包含Google Project Zero报告和CWE条目的精选数据库中检索相关知识的RAG引擎、用于影响和利用评估的分析代理，以及用于结构化报告生成的报告代理。为确保质量，RAVEN包含一个特定任务的LLM评判器，用于评估报告的结构完整性、与真实情况的一致性、代码推理质量和修复质量。我们在来自NIST-SARD数据集的105个涵盖15种CWE类型的易受攻击代码样本上评估了RAVEN。结果显示平均质量得分为54.21%，支持了我们的方法在自动化漏洞文档方面的有效性。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities across various cybersecurity tasks, including vulnerability classification, detection, and patching. However, their potential in automated vulnerability report documentation and analysis remains underexplored. We present RAVEN (Retrieval Augmented Vulnerability Exploration Network), a framework leveraging LLM agents and Retrieval Augmented Generation (RAG) to synthesize comprehensive vulnerability analysis reports. Given vulnerable source code, RAVEN generates reports following the Google Project Zero Root Cause Analysis template. The framework uses four modules: an Explorer agent for vulnerability identification, a RAG engine retrieving relevant knowledge from curated databases including Google Project Zero reports and CWE entries, an Analyst agent for impact and exploitation assessment, and a Reporter agent for structured report generation. To ensure quality, RAVEN includes a task specific LLM Judge evaluating reports across structural integrity, ground truth alignment, code reasoning quality, and remediation quality. We evaluate RAVEN on 105 vulnerable code samples covering 15 CWE types from the NIST-SARD dataset. Results show an average quality score of 54.21%, supporting the effectiveness of our approach for automated vulnerability documentation.

URL PDF HTML ☆

赞 0 踩 0

2604.27011 2026-06-08 cs.LG cs.AI 版本更新

Automatic Causal Fairness Analysis with LLM-Generated Reporting

基于LLM生成报告的自适应因果公平性分析

Alessia Berarducci, Eric Rossetto, Alessandro Antonucci, Marco Zaffalon

发表机构 * Istituto Dalle Molle di Studi sull’Intelligenza Artificiale (IDSIA), USI-SUPSI（日内瓦人工智能研究所（IDSIA）、USI-SUPSI）

AI总结提出FairMind原型，利用标准公平模型进行因果公平分析，通过反事实查询计算因果效应，并借助LLM零样本生成公平性报告，优于直接LLM分析。

Comments 23 pages, 6 figures, 3 tables, LaTeX; added missing proof for Proposition 3, typos corrected, updated example 1 to have positive values for the Sankey

详情

AI中文摘要

AutoML，旨在自动化机器学习在现实问题中的应用，是AI普及的关键步骤。大多数AutoML框架未考虑训练数据及相应预测中潜在的公平性缺失。我们介绍\textsc{FairMind}，一个旨在自动化数据集层面公平性分析的软件原型。我们通过借助Plečko和Bareinboim最近提出的\textit{标准公平模型}的假设来实现这一点。这允许基于涉及目标、可能的混杂因素和中介变量以及我们视为\textit{受保护}的输入特征的不同值的\textit{反事实}查询，进行合理的因果效应公平性评估。在必要的数据预处理之后，该工具实现了效应的闭式计算。随后利用LLM生成关于训练数据集中检测到的公平性水平的准确报告。我们在零样本设置中实现了这一点，并通过示例展示了相对于LLM直接分析的预期优势。为了促进应用，还讨论了有序受保护变量和连续目标的扩展以及新的分解结果。

英文摘要

AutoML, intended as the process of automating the application of machine learning to real-world problems, is a key step for AI popularisation. Most AutoML frameworks are not accounting for the potential lack of fairness in the training data and in the corresponding predictions. We introduce \textsc{FairMind}, a software prototype aiming to automatise fairness analysis at the dataset level. We achieve that by resorting to the assumptions of the \emph{standard fairness model}, recently proposed by Plečko and Bareinboim. This allows for a sound fairness evaluation in terms of causal effects, based on \emph{counterfactual} queries involving the target, possibly confounders and mediators, and the different values of an input feature we regard as \emph{protected}. After the necessary data preprocessing, the tool implements a closed-form computation of the effects. LLMs are consequently exploited to generate accurate reports on the fairness levels detected in the training dataset. We achieve that in a zero-shot setup and show by examples the expected advantages with respect to a direct analysis performed by the LLM. To favour applications, extensions to ordinal protected variable and continuous targets and novel decomposition results are also discussed.

URL PDF HTML ☆

赞 0 踩 0

2605.00369 2026-06-08 cs.LG cs.AI 版本更新

InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees

InvEvolve：通过具有性能保证的大语言模型进化白盒库存策略

Chenyu Huang, Jianghao Lin, Zhengyang Tang, Bo Jiang, Ruoqing Jiang, Benyou Wang, Lai Wei

发表机构 * Shanghai University of Finance and Economics（上海财经大学）； Shanghai Jiao Tong University（上海交通大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Tsinghua University（清华大学）； Boston College（波士顿大学）

AI总结提出InvEvolve框架，利用强化学习训练的大语言模型，结合置信区间认证，在线生成具有统计安全保证的白盒库存策略，在合成和真实零售数据上优于经典和深度学习方法。

详情

AI中文摘要

我们研究如何利用大语言模型在非平稳需求在线环境中生成库存策略。我们的工作受到基于LLM的进化搜索（如AlphaEvolve）最新进展的启发，这些方法在数学发现等静态且高度结构化问题上表现出色，但不太适合具有在线更新的动态库存环境。我们提出InvEvolve，一个基于置信区间认证的端到端库存策略进化与推理框架。InvEvolve建立在通过强化学习训练的大语言模型之上，能够处理需求数据以及额外的数值和文本特征，并生成具有统计安全保证的白盒库存策略，用于未来部署。我们进一步引入一个具有理论保证的统一框架，连接训练、推理和部署。这使我们能够推导出InvEvolve进化出统计上安全且改进的策略的概率下界，并刻画相对于最优安全基准的多周期性能差距。在合成数据和真实零售数据上的测试表明，InvEvolve优于经典库存策略和基于深度学习的方法。在典型库存设置中，它生成了超越现有基准的新策略。

英文摘要

We study how large language models can be used to generate inventory policies in online settings with non-stationary demand. Our work is motivated by recent advances in LLM-based evolutionary search, such as AlphaEvolve, which demonstrates strong performance on static and highly structured problems such as mathematical discovery, but is not directly suited to dynamic inventory settings with online updates. We propose InvEvolve, an end-to-end inventory policy evolution and inference framework grounded in confidence-interval-based certification. Built on a large language model trained via reinforcement learning, InvEvolve can process demand data together with additional numerical and textual features, and generates white-box inventory policies with statistical safety guarantees for future deployment. We further introduce a unified framework with theoretical guarantees that connects training, inference, and deployment. This allows us to derive a lower bound on the probability that InvEvolve evolves a statistically safe and improved policy, and to characterize the multi-period performance gap relative to the oracle-safe benchmark. Tested on both synthetic data and real-world retail data, InvEvolve outperforms classical inventory policies and deep-learning-based methods. In canonical inventory settings, it generates new policies that outperform existing benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.05220 2026-06-08 cs.LG cs.AI 版本更新

MidSteer: Optimal Affine Framework for Steering Generative Models

MidSteer：用于引导生成模型的最优仿射框架

Tatiana Gaintseva, Andrew Stepanov, Ziquan Liu, Martin Benning, Gregory Slabaugh, Jiankang Deng, Ismail Elezi

发表机构 * University of Basel（巴塞尔大学）； University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）； University of Cambridge（剑桥大学）； University of Washington（华盛顿大学）

AI总结本文提出MidSteer，一种基于仿射变换的最优概念引导框架，通过最小干扰实现生成模型中的概念切换，并在视觉扩散模型和大型语言模型上验证其有效性。

详情

AI中文摘要

引导中间表示已成为控制生成模型的一种强大策略，特别是在部署后对齐和安全设置中。然而，尽管其经验成功，目前仍缺乏全面的理论框架。在本文中，我们通过形式化概念引导的理论来弥补这一差距。首先，我们在引导和仿射概念擦除之间建立联系，证明去除不期望行为的标准方法是LEACE（一种用于仿射擦除的闭式方法）的特例。接下来，我们为概念切换LEACE-Switch制定了一个原则性的理论框架，并描述了其提供最优仿射解的假设。基于这一分析，我们引入了MidSteer（最小干扰概念引导），一个更通用的用于概念操作的仿射框架，它放宽了这些假设并实现了定向的最小干扰变换。我们证明MidSteer在一系列任务、模态和架构（包括视觉扩散模型和大型语言模型）中表现良好。

英文摘要

Steering intermediate representations has emerged as a powerful strategy for controlling generative models, particularly in post-deployment alignment and safety settings. However, despite its empirical success, it currently lacks a comprehensive theoretical framework. In this paper, we bridge this gap by formalizing the theory of concept steering. First, we establish a link between steering and affine concept erasure, proving that the standard approach for removing unwanted behaviors is a special case of LEACE (a closed-form method for affine erasure). Next, we formulate a principled theoretical framework for concept switching, LEACE-Switch, and characterize the assumptions under which it provides an optimal affine solution. Building on this analysis, we then introduce MidSteer (Minimal Disturbance concept Steering), a more general affine framework for concept manipulation that relaxes these assumptions and enables directed, minimal-disturbance transformations. We demonstrate that MidSteer performs favorably across a range of tasks, modalities, and architectures, including vision diffusion models and large language models.

URL PDF HTML ☆

赞 0 踩 0

2605.05225 2026-06-08 cs.LG cs.AI 版本更新

MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

MACS: 模态感知容量缩放用于高效多模态MoE推理

Bo Li, Chuan Wu, Shaolin Zhu

发表机构 * School of Software, Tsinghua University, Beijing, China（清华大学软件学院，北京，中国）； TJUNLP Lab, School of Computer Science and Technology, Tianjin University, China（天津大学计算机科学与技术学院，中国）； School of New Media and Communication, Tianjin University, China（天津大学新媒体与传播学院，中国）

AI总结针对多模态MoE大模型在专家并行推理中因信息异质性和模态动态性导致的效率瓶颈，提出无需训练的MACS框架，通过熵加权负载和动态模态自适应容量机制优化资源分配，显著提升多模态基准性能。

Comments Accepted by ACL 2026

详情

AI中文摘要

混合专家多模态大语言模型（MoE MLLMs）在专家并行（EP）推理过程中因落后者效应而遭受显著的效率瓶颈。在多模态背景下，这一问题更加严重，因为现有的基于token计数的负载均衡方法无法解决两个独特挑战：（1）信息异质性，其中大量冗余的视觉token与语义关键的token被同等对待；（2）模态动态性，不同任务中视觉与文本比例的变化导致资源错配。为应对这些挑战，我们提出MACS（模态感知容量缩放），一种无需训练的推理框架。具体而言，MACS引入熵加权负载机制来量化视觉token的语义价值，解决信息异质性。此外，动态模态自适应容量机制根据输入的实时模态组成分配专家资源。大量实验表明，MACS在各种多模态基准上显著优于现有方法，为MoE MLLMs在EP推理中的高效部署提供了新颖且稳健的解决方案。

英文摘要

Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) suffer from a significant efficiency bottleneck during Expert Parallelism (EP) inference due to the straggler effect. This issue is worsened in the multimodal context, as existing token-count-based load balancing methods fail to address two unique challenges: (1) Information Heterogeneity, where numerous redundant visual tokens are treated equally to semantically critical ones, and (2) Modality Dynamics, where varying visual to text ratios across tasks lead to resource misallocation. To address these challenges, we propose MACS (Modality-Aware Capacity Scaling), a training-free inference framework. Specifically, MACS introduces an Entropy-Weighted Load mechanism to quantify the semantic value of visual tokens, addressing information heterogeneity. Additionally, the Dynamic Modality-Adaptive Capacity mechanism allocates expert resources based on the real-time modal composition of the input. Extensive experiments demonstrate that MACS significantly outperforms existing methods on various multimodal benchmarks, providing a novel and robust solution for the efficient deployment of MoE MLLMs in EP inference.

URL PDF HTML ☆

赞 0 踩 0

2605.06215 2026-06-08 physics.chem-ph cs.AI 版本更新

COF26: A new on-top functional for multiconfiguration pair-density functional theory

COF26：多组态对密度泛函理论的一种新的on-top泛函

Yuhao Chen, Donald G. Truhlar, Xiao He

发表机构 * Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development（分子治疗与新药开发上海工程研究中心）； Shanghai Frontiers Science Center of Molecule Intelligent Syntheses（分子智能合成上海前沿科学中心）； School of Chemistry and Molecular Engineering, East China Normal University（东华大学化学与分子工程学院）； Department of Chemistry, Chemical Theory Center, and Minnesota Supercomputing Institute, University of Minnesota（明尼苏达大学化学系、化学理论中心和明尼苏达超级计算研究所）； Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University（重庆精密光学重点实验室，东华大学重庆研究所）； New York University–East China Normal University Center for Computational Chemistry, New York University Shanghai（纽约大学–东华大学计算化学中心，纽约大学上海）

AI总结提出COF26泛函，通过大语言模型辅助优化工作流，在强和弱关联体系中均表现优越，推荐用于未来MC-PDFT计算。

详情

AI中文摘要

多组态对密度泛函理论（MC-PDFT）为计算强关联分子体系的电子能量提供了一种高效且准确的框架，其中on-top泛函的质量是其预测精度的关键决定因素。在此，我们介绍了MMCDDB26，一个严格整理的基准数据库，包含76个数据集和1495个反应。我们进一步提出了一种受约束的、大语言模型辅助的优化工作流，用于MC-PDFT泛函的开发和评估。利用该工作流，我们在MMCDDB26上优化了MC23/MC25泛函的参数，得到了MC26。与同类早期泛函相比，MC26提高了训练集的精度，并实现了更平衡的整体性能。此外，我们开发了混合meta泛函COF26。我们发现COF26在强和弱关联体系中均表现出优越的性能，因此推荐在未来的MC-PDFT计算中使用COF26。

英文摘要

Multiconfiguration pair-density functional theory (MC-PDFT) provides an efficient and accurate framework for computing electronic energies in strongly correlated molecular systems, with the quality of the on-top functional being a key determinant of its predictive accuracy. Here, we introduce MMCDDB26, a rigorously curated benchmark database comprising 76 datasets and 1,495 reactions. We further propose a constrained, large-language-model-assisted optimization workflow for the development and assessment of MC-PDFT functionals. Using this workflow, we optimized the parameters of the MC23/MC25 functionals on MMCDDB26 to obtain MC26. Compared with earlier functionals of the same class, MC26 improves the accuracy on the training set and achieves a more balanced overall performance. In addition, we developed the hybrid meta-functional COF26. We find that COF26 delivers superior performance for both strongly and weakly correlated systems, and therefore recommend COF26 for future MC-PDFT calculations.

URL PDF HTML ☆

赞 0 踩 0

2605.06647 2026-06-08 cs.IR cs.AI cs.LG 版本更新

Superintelligent Retrieval Agent: The Next Frontier of Agentic Retrieval

超级智能检索代理：代理检索的下一个前沿

Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava

发表机构 * Meta Superintelligence Labs（Meta超级智能实验室）； Rice University（里士满大学）

AI总结提出SIRA，通过单次语料判别性检索压缩多轮探索，利用LLM丰富文档词汇、预测查询缺失词汇并基于语料统计过滤，在BEIR基准上取得最强平均检索性能，并在下游QA任务中超越RL训练的代理系统。

详情

AI中文摘要

检索增强代理日益成为大型知识库的接口，但大多数将检索视为黑箱：它们发出探索性查询，检查片段，并重新表述直到证据出现。这类似于新手搜索不熟悉的数据库，而非专家利用术语和可能证据的强先验进行导航，导致额外的检索轮次、延迟和低召回率。我们引入了超级智能检索代理（SIRA），它将检索中的超级智能视为将多轮探索性搜索压缩为单次语料判别性检索行动。SIRA不仅询问哪些术语相关，还询问哪些术语将所需证据与语料级混淆项区分开。离线时，LLM用缺失的搜索词汇丰富每个文档；查询时，它预测查询遗漏的证据词汇；语料统计作为工具调用，过滤掉缺失、过于常见或不太可能产生检索边界的术语。最后一步是单次加权BM25调用，将查询与验证后的扩展结合。在十个BEIR基准上，SIRA实现了我们比较中最强的平均检索性能，击败了密集检索器、学习型稀疏检索器和LLM搜索代理基线，且未使用相关性标签或检索器微调。在下游QA中，其仅检索的答案覆盖率在NQ和HotpotQA上超过了近期RL训练的代理QA系统。我们还引入了BrowseComp-Wikipedia，一个包含232个BrowseComp衍生查询、覆盖25,587,229篇文档的维基百科索引的硬搜索基准。即使没有索引时丰富，仅使用基于维基百科类别的接地，SIRA在每个预算下都优于多轮Perplexity代理，达到9.70%的Recall@1、15.27%的Recall@10和36.14%的Recall@100。

英文摘要

Retrieval-augmented agents are increasingly the interface to large knowledge bases, yet most treat retrieval as a black box: they issue exploratory queries, inspect snippets, and reformulate until evidence emerges. This resembles how a newcomer searches an unfamiliar database rather than how an expert navigates it with strong priors about terminology and likely evidence, causing extra retrieval rounds, latency, and poor recall. We introduce \textit{Superintelligent Retrieval Agent} (SIRA), which casts \emph{superintelligence} in retrieval as compressing multi-round exploratory search into a single corpus-discriminative retrieval action. SIRA does not merely ask which terms are relevant; it asks which terms separate the desired evidence from corpus-level confusers. Offline, an LLM enriches each document with missing search vocabulary; at query time, it predicts evidence vocabulary the query omits; and corpus statistics serve as tool calls that filter terms that are absent, overly common, or unlikely to create retrieval margin. The final step is a single weighted BM25 call combining the query with the validated expansion. Across ten BEIR benchmarks, SIRA achieves the strongest average retrieval performance in our comparison, beating dense retrievers, learned sparse retrievers, and LLM search-agent baselines while using no relevance labels or retriever fine-tuning. On downstream QA, its retrieval-only answer coverage exceeds recent RL-trained agentic QA systems on NQ and HotpotQA. We also introduce \textbf{BrowseComp-Wikipedia}, a hard-search benchmark of 232 BrowseComp-derived queries over a 25,587,229-document Wikipedia index. Even without index-time enrichment, using only grounded Wikipedia categories, SIRA outperforms multi-round Perplexity agents at every budget, reaching 9.70% Recall@1, 15.27% Recall@10, and 36.14% Recall@100.

URL PDF HTML ☆

赞 0 踩 0

2605.08717 2026-06-08 cs.SE cs.AI 版本更新

聚焦-然后-上下文：面向视觉-语言模型的主体导向渐进视觉标记缩减

Yulin Zhao, Zheng Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学深圳学院）； ShenZhen Loop Area Institute（深圳环形区研究所）

AI总结本文提出了一种主体导向的渐进视觉标记缩减方法SPpruner，通过模拟人类视觉感知系统的'聚焦-然后-上下文'机制，有效减少视觉标记数量，提升视觉-语言模型的推理效率，实验表明其在速度和资源消耗上均优于现有方法。

详情

AI中文摘要

视觉-语言模型（VLMs）在推理过程中面临由于大规模视觉标记序列带来的计算成本瓶颈。现有的视觉标记缩减方法虽然减轻了这一负担，但无意中保留了与用户查询严格对齐的孤立视觉主体，无法充分探索显著主体及其上下文关系。本文提出SPpruner，一种以主体为中心的渐进缩减范式，模拟人类视觉感知系统的'聚焦-然后-上下文'机制。具体而言，我们首先构建了一个聚焦识别模块，以显式建模视觉显著性与语义相关性之间的相互作用。在此基础上，它可以挖掘全面的视觉主体光谱，确保视觉输入的高保真表示。随后，开发了一个上下文感知的结构扫描模块，用于聚合邻近区域的上下文线索。因此，它可以有效恢复全局关系依赖，以维持保留主体的结构完整性。大量实验表明，我们的范式在速度和资源消耗上均优于现有方法，在Qwen2.5-VL中仅保留22.2%的视觉标记即可实现2.53倍的加速，在LLaVA中实现67%的FLOPs减少，仅导致0.6%的精度下降。

英文摘要

Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly aligned with the user's query, which fails to substantially explore salient subjects and their contextual relationships. In this paper, we propose SPpruner, a subject-centric progressive reduction paradigm that emulates the \textit{Focus-then-Context} mechanism of the human visual perception system. Specifically, we first construct a focus identification module to explicitly model the interplay between visual saliency and semantic relevance. Herein, it can excavate the comprehensive visual subject spectrum to ensure a high-fidelity representation of visual input. Subsequently, a context-aware structural scanning module is developed to aggregate contextual cues from neighboring regions. As such, it can effectively restore global relational dependencies to uphold the structural integrity of the preserved subjects. Extensive experiments demonstrate that our paradigm consistently outperforms SOTA methods, achieving up to 2.53 times speedup with only 22.2% of visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop.

URL PDF HTML ☆

赞 0 踩 0

2605.24011 2026-06-08 cs.CV cs.AI 版本更新

在Google Cloud TPU上微调和服务Gemma 4 31B：与GPU基线的技术比较

Jatin Kishnani, Mayank Goel, Amit Singh, Pulkit Agrawal, Sairanjan Mishra

发表机构 * Google Cloud（谷歌云）

AI总结本文首次端到端展示了在TPU硬件上微调和服务Google Gemma 4 31B模型，通过与GPU平台的实证比较，提供了代码级适配方案，并证明TPU在训练速度和成本上具有优势。

详情

AI中文摘要

我们首次端到端展示了在TPU硬件上微调和服务Google的Gemma 4 31B模型，提供了TPU与GPU平台在大语言模型适配上的实证比较。使用LoRA在Google TPU v5p-8上进行训练，在TPU v6e-8（Trillium）上进行推理，我们记录了将基于PyTorch、HuggingFace TRL和FSDP的GPU原生训练配方移植到JAX + Tunix/Qwix栈所需的全部代码级适配。这些适配涵盖网格配置、LoRA模块命名约定、分片注释修正、梯度检查点、数据管道重构以及自定义的Orbax到safetensors检查点合并过程。对于推理，我们详细描述了在v6e-8上服务Gemma 4所需的vLLM-TPU Docker设置，并刻画了由此产生的延迟和吞吐量特征。与相同超参数下的2xH100 GPU基线相比，TPU训练完成速度快1.61倍，成本低2.12倍。推理吞吐量在平台间差异在3%以内，而TPU的首令牌延迟低2倍（235 ms vs. 475 ms）。总体而言，对于代表性的训练加服务工作负载，TPU配置便宜1.82倍。我们的工作填补了开放工具生态系统中的关键空白，为从业者提供了可复现、生产就绪的Gemma 4在TPU基础设施上部署的配方。

英文摘要

We present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical comparison of TPU and GPU platforms for large language model adaptation. Using LoRA on a Google TPU v5p-8 for training and TPU v6e-8 (Trillium) for inference, we document the full set of code-level adaptations required to port a GPU-native training recipe - built on PyTorch, HuggingFace TRL, and FSDP - to the JAX + Tunix/Qwix stack. These adaptations span mesh configuration, LoRA module naming conventions, sharding annotation corrections, gradient checkpoint, data pipeline restructuring, and a custom Orbax-to-safetensor checkpoint merging procedure. For inference, we detail the vLLM-TPU Docker setup necessary to serve Gemma 4 on v6e-8 and characterize the resulting latency and throughput profile. Compared with a similar-costing 2xH100 GPU baseline under identical hyperparameters, TPU training completes 1.61x faster at 2.12x lower cost. For inference, we cover the vLLM-TPU Docker setup required to serve Gemma 4 on v6e-8 and explain the observed latency and throughput characteristics across a QPS sweep spanning 512 to 16k input tokens. Across both workloads we compare performance and cost against a 2xH100 GPU baseline running identical hyperparameters. The TPU completes training 1.61x faster at 2.12x lower cost. For inference, TPU v6e-8 matches GPU at short context (<=2048 tokens) and decisively outperforms at long context: 66% higher throughput and 23.6x faster TTFT at 4096-token inputs (61 ms vs 1,443 ms at QPS=4). Our work removes a critical gap in the open tooling ecosystem and provides practitioners with a recipe for Gemma 4 Dense 31B deployment on the TPU infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2605.26099 2026-06-08 cs.CL cs.AI 版本更新

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

语言模型需要睡眠吗？用于改进在线推理的离线循环

Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of Maryland（马里兰大学）

AI总结本文提出一种类似睡眠的巩固机制，通过离线循环将上下文转换为快速权重，以解决Transformer注意力机制随上下文长度扩展性差的问题，并在合成任务和数学推理任务上验证了其有效性。

详情

AI中文摘要

基于Transformer的大型语言模型越来越多地用于长时任务；然而，它们的注意力机制随上下文长度扩展性差。为了解决这个问题，我们研究了一种类似睡眠的巩固机制，其中模型在清除其键值缓存之前，定期将最近的上下文转换为持久的快速权重。在睡眠期间，模型对累积的上下文进行$N$次离线循环传递，并通过学习到的局部规则更新其状态空间模型（SSM）块中的快速权重。在推理过程中，这会将额外的计算转移到睡眠阶段，同时保持清醒时预测的延迟。我们在受控的合成任务（包括元胞自动机和多跳图检索）以及一个现实的数学推理任务上测试了我们的方法，在这些任务上，常规Transformer以及SSM-注意力混合模型都失败了。然后我们表明，增加我们模型的睡眠持续时间$N$可以提高性能，在需要更深层推理的示例上收益最大。

英文摘要

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs $N$ offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration $N$ for our models improves performance, with the largest gains on examples that require deeper reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.30119 2026-06-08 cs.LG cs.AI cs.NE 版本更新

Evolving Features vs Evolving Entire Trees with GP for Interpretable Survival Analysis

使用遗传编程进行可解释生存分析：进化特征 vs 进化整个树

Thalea Schlender, Peter A. N. Bosman, Tanja Alderliesten

发表机构 * Leiden University Medical Center（莱顿大学医学中心）； Centrum Wiskunde & Informatica（数学与信息学研究中心）

AI总结本研究使用遗传编程多目标进化可检查的特征集，并联合优化生存树结构与非线性分裂逻辑，以提高浅层生存树的预测性能和可解释性。

详情

AI中文摘要

生存分析涉及预测事件发生时间。常用于医学领域，处理不完整（即删失）数据，例如研究期间未发生事件的患者。实际应用中，准确性和可解释性都很重要。生存树是易于理解的生存模型，将患者队列递归地划分为离散的患者组。虽然生存树可以捕捉复杂关系，但它们通常需要生长得很大，威胁可解释性。此外，生存树通常使用贪婪方法构建，可能忽略全局最优分裂组合，限制预测性能。浅层生存树需要表达性强的高阶特征组合才能达到竞争性准确性。因此，我们使用遗传编程多目标进化固有可检查的特征集，并研究它们与不同树诱导策略的相互作用。我们进一步引入了一种进化方法，联合优化生存树结构和非线性分裂逻辑。我们的发现表明，在两个真实世界数据集和两种不同生存树深度上，进化特征构建提高了不同树诱导策略下的预测性能。完整的联合进化在提出多个性能良好的固有可检查的浅层生存树方面具有最高的潜力。

英文摘要

Survival analysis concerns the task of predicting the time until an event occurs. Often used in the medical field, survival analysis deals with incomplete (i.e., censored) data, for instance, from patients who did not experience the event during the duration of the study. For practical use, both accuracy and interpretability are important. Survival trees are easy-to-follow survival models that split the patient cohort recursively into discrete patient groups. Whilst survival trees can capture complex relationships, they typically need to grow large, threatening interpretability. Moreover, survival trees are often built using greedy approaches that may overlook globally optimal split combinations, limiting predictive performance. Shallow survival trees require expressive, higher-order feature combinations to achieve competitive accuracy. We therefore use genetic programming to multi-objectively evolve inherently inspectable feature sets and study how they interact with different tree induction strategies. We further introduce an evolutionary approach that jointly optimises the survival tree structure and the non-linear split logic. Our findings demonstrate that evolutionary feature construction improves predictive performance across different tree induction strategies on two real-world datasets and two different survival tree depths. Given its speed and flexible presentation, the multi-objective evolution of entire trees likely holds the most future promise.

URL PDF HTML ☆

赞 0 踩 0

2605.31051 2026-06-08 cs.NE cs.AI 版本更新

Linear Ordering Problem: Time for a Change

线性排序问题：变革之时

Fabrizio Fagiolo, Marco Baioletti, Valentino Santucci

发表机构 * University for Foreigners of Perugia（佩鲁吉亚外国大学）； University of Perugia（佩鲁吉亚大学）

AI总结针对线性排序问题（LOP）中基准数据过时及多最优解问题，提出基于最新经济数据的新基准套件和生成多样高质量解的算法方案，并引入质量和多样性评估指标。

Comments Accepted for publication at PPSN 2026 - Conference on Parallel Problem Solving

详情

AI中文摘要

线性排序问题（LOP）是一个基础组合优化问题，在经济学、社会选择和机器学习等领域有重要应用。其最突出的用途是经济投入产出表的三角化，这有助于识别经济中的关键产业。大多数现有算法都是在基于过时的宏观经济数据的基准上评估的，这些数据不再反映当代经济的结构。此外，LOP实例通常表现出许多不同的全局最优解，这些解之间可能差异很大，给依赖单一解的应用带来了挑战。为了解决这些局限性，我们引入了一个基于最新现实世界经济数据的新基准套件，以及一种利用最先进的LOP元启发式算法生成多样化高质量解集的算法方案，同时提供了评估质量和多样性的指标。实验在传统单解场景和新引入的多解场景下对所提出的基准套件进行了结果报告。

英文摘要

The Linear Ordering Problem (LOP) is a fundamental combinatorial optimization problem with important applications in areas such as economics, social choice, and machine learning. Its most prominent use is the triangulation of economic input-output tables, which helps identify critical industries in an economy. Most existing algorithms have been evaluated on benchmarks derived from outdated macroeconomic data, which no longer reflect the structure of contemporary economies. Furthermore, LOP instances often exhibit many distinct global optima that can differ substantially from one another, creating challenges for applications that rely on a single solution. To address these limitations, we introduce a novel benchmark suite derived from up-to-date real-world economic data and an algorithmic scheme that leverages state-of-the-art LOP metaheuristics to generate diverse sets of high-quality solutions, together with metrics for assessing both quality and diversity. Experiments were conducted to report results on the proposed benchmark suite under both the traditional single-solution setting and the newly introduced multi-solution scenario

URL PDF HTML ☆

赞 0 踩 0

2606.01802 2026-06-08 cs.SD cs.AI 版本更新

MOSS-Audio Technical Report

MOSS-Audio 技术报告

Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu, Jingqi Chen, Ke Chen, Wenxuan Wang, Yang Wang, Yaozhou Jiang, Yi Jiang, Zhengyuan Lin, Ziqi Chen, Zhaoye Fei, Chenghao Liu, Donghua Yu, Jun Zhan, Kang Yu, Kexin Huang, Liwei Fan, Mingshu Chen, Qinyuan Cheng, Ruixiao Li, Shimin Li, Songlin Wang, Xingjian Zhao, Yang Gao, Yitian Gong, Yiyang Zhang, Zhe Xu, Xipeng Qiu

发表机构 * OpenMOSS Team（开放MOSS团队）

AI总结提出统一音频-语言模型 MOSS-Audio，通过 DeepStack 跨层特征注入和时间标记实现语音、环境声和音乐的理解，在音频字幕、时间感知问答、时间戳转录和音频推理任务上取得强性能。

详情

AI中文摘要

MOSS-Audio 是一个统一的音频-语言模型，用于语音、环境声和音乐的理解，支持音频字幕、时间感知问答、时间戳转录和基于音频的推理。MOSS-Audio 将专用音频编码器与模态适配器和大语言模型耦合：编码器产生 12.5 Hz 的时间表示，适配器将其投影到解码器空间，解码器生成自回归文本输出。两个设计选择是系统的核心： extbf{DeepStack 跨层特征注入}，将来自多个编码器深度的声学信息暴露给解码器；以及 extbf{时间标记}，通过在音频标记流中插入时间戳标记来提供显式的时间线索。在数据层面，我们设计了一个事件保留的音频标注流程，将原始音频在连贯的事件边界处分割，对语音、音乐和通用音频应用分支特定标注，并将结果合并为统一的字幕用于预训练。中间的分支特定字幕进一步保留以支持面向任务的 SFT 数据的构建。该模型在大规模音频-语言数据上进行预训练，结合时间感知目标以支持时间定位，然后进行多阶段后训练以增强指令遵循和基于音频的推理。我们发布了 4B 和 8B 两种变体，包括 Instruct 和 Thinking 配置。MOSS-Audio 在通用音频理解、语音字幕、ASR 和时间戳 ASR 上取得了强性能，使其成为未来语音代理的有前途的理解基础。

英文摘要

MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: DeepStack cross-layer feature injection, which exposes the decoder to acoustic information from multiple encoder depths, and time markers, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.

URL PDF HTML ☆

赞 0 踩 0

2606.02609 2026-06-08 cs.LG cs.AI 版本更新

Building Better Activation Oracles

构建更好的激活预言机

Jan Bauer, Celeste De Schamphelaere, Adam Karvonen, Niclas Luick, Neel Nanda

发表机构 * MATS Gatsby Unit, UCL（UCL马斯特斯伽利略单位）； MATS Ghent University（根特大学马斯特斯）； Independent（独立）； MATS, University of Hamburg（汉堡大学马斯特斯）； Equal contribution（同等贡献）

AI总结针对激活预言机存在的幻觉和模糊性问题，通过四项训练改进（策略内 rollout、对话数据集优化、多层输入和注入公式改进）提升其质量，并开源首个综合评估套件 AObench。

Comments Jan Bauer and Celeste De Schamphelaere contributed equally; author order determined randomly

2606.03002 2026-06-08 cs.LG cs.AI 版本更新

Perplexity Can Miss SAE Feature Damage Under Quantization

量化如何改变可解释特征：语言模型的稀疏自编码器分析

Evan Duan

发表机构 * University of Michigan（密歇根大学）

AI总结通过稀疏自编码器分析，发现量化导致语言模型中的可解释特征逐渐退化，且任务指标无法完全反映这种损伤，量化与幅度剪枝共享相似的损伤模式。

Comments 12 Pages of Content, Submitted to TMLR

详情

AI中文摘要

量化是部署大型语言模型的标准途径，通常当量化模型的困惑度或下游精度接近全精度原始模型时，即认为其可接受。但模型是否仍以相同方式计算，或全精度模型中识别的可解释特征是否在权重舍入后存活，很少被测试，即使安全审计和引导干预越来越依赖这些特征。我们探究从稠密全精度模型中提取的稀疏自编码器（SAE）特征在模型量化后是否仍然忠实。使用冻结的SAE作为固定测量基础，我们在相同令牌上编码全精度和最近舍入（RTN）量化激活，并通过皮尔逊相关系数量化每个特征的存活率，在Pythia-70M和Gemma-2-2B上扫描从INT8到INT4的位宽。我们发现特征存活是分级的：特征系统地退化而非一次性全部失效，在Pythia-70M上INT6时62.4%的活跃特征存活，在Gemma-2-2B上INT6时51.3%存活，且大多数非存活特征被模糊而非破坏。存活率可仅从全精度统计量预测，交叉验证AUC为0.92至0.97，峰值激活是最强的边际预测因子。关键的是，任务指标可能忽略这种损伤：在Gemma-2-2B上，INT7改善了困惑度却使18.7%的特征退化。最后，量化和匹配困惑度的幅度剪枝损伤高度重叠的特征集，Jaccard重叠为0.79至0.86，损伤分数斯皮尔曼相关性为0.98，表明存在共享的压缩诱导脆弱性模式。这些结果表明，行为等价不足以证明可解释性发现可迁移到量化部署，从而激励对压缩进行特征级审计。

英文摘要

Quantization is a standard path to deploying large language models, and quantized models are typically judged acceptable when perplexity or downstream accuracy remains close to the full-precision original. But behavioral parity need not imply feature fidelity: the sparse-autoencoder (SAE) features used to interpret a full-precision model may change after weight rounding. We test this directly by using a frozen SAE as a fixed measurement basis, encoding full-precision and round-to-nearest (RTN) quantized activations on identical tokens, and measuring per-feature survival by Pearson correlation across bit-widths from INT8 to INT4 on Pythia-70M and Gemma-2-2B. Our central finding is that perplexity can miss feature damage: on Gemma-2-2B, INT7 improves perplexity while degrading 18.7% of active SAE features, and under sliding-window evaluation INT6 also improves perplexity while only 51.3% of active features survive. Feature survival is graded rather than cliff-like, with 62.4% of active Pythia features and 51.3% of active Gemma features surviving at INT6; most non-surviving features are blurred rather than fully damaged. Survival is also predictable from full-precision feature statistics alone, with cross-validated AUC 0.92--0.97 and peak activation as the strongest marginal predictor. Finally, RTN quantization and matched-perplexity magnitude pruning damage strongly overlapping feature sets, with Jaccard overlap 0.79--0.86 and damage-score Spearman correlation 0.98. These results show that behavioral metrics alone are insufficient evidence that full-precision interpretability findings transfer to quantized models, motivating feature-level audits of compression.

URL PDF HTML ☆

赞 0 踩 0

2606.03161 2026-06-08 cs.MA cs.AI 版本更新

OpenAgenet / OAN White Paper: Open Infrastructure for Trusted Agent Interconnection

OpenAgenet/OAN：可信智能体互连的开放基础设施

Jinliang Xu

发表机构 * China Academy of Information and Communications Technology（信息与通信技术研究院）

AI总结针对智能体从孤立应用转向开放多运营商网络时面临的身份验证、治理状态、发现授权、新鲜度和信任证据问题，提出协议无关的信任层OAN，通过根治理身份准入、注册商辅助注册、根验证包发布、授权感知发现和签名可信调用来实现可信互连。

详情

AI中文摘要

OpenAgenet，简称OAN，是一个用于可信智能体互连的开放基础设施项目。它解决了一个在智能体从孤立应用转向开放的多运营商网络时变得明显的问题：在智能体能够安全地发现、选择和调用另一个智能体之前，它需要一种方法来验证身份来源、治理状态、发现授权、新鲜度和连接前的信任证据。OAN被设计为一个协议无关的信任层。它不取代智能体交互协议、工具协议、模型编排框架或应用级工作流。相反，它提供了根治理的身份准入、注册商辅助的注册、根验证的包发布、授权感知的发现以及签名的可信调用。本文介绍了OAN的动机、架构、角色、治理模型、与MCP、A2A和ANP的关系、部署模式、合作模型、区块链支持的授权公告、原型状态、性能概况和路线图。

英文摘要

OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes visible when Agents move from isolated applications into open, multi-operator networks: before an Agent can safely discover, select, and invoke another Agent, it needs a way to verify identity provenance, governance state, discovery authorization, freshness, and pre-connection trust evidence. OAN is designed as a protocol-neutral trust layer. It does not replace Agent interaction protocols, tool protocols, model orchestration frameworks, or application-level workflows. Instead, it provides \texttt{did:oan}-based resource identity, governance-backed admission, Registrar-assisted onboarding, Root-verified package publication, authorization-aware Discovery, Root-issued infrastructure authorization VCs, and signed trusted invocation. The architectural center of OAN is the combination of federated governance, resource identity, and trusted Discovery, rather than a single directory or naming service. This white paper explains the motivation, architecture, roles, governance model, relationship with MCP, A2A, and ANP, deployment patterns, cooperation model, on-chain governance layer, prototype status, performance profile, and roadmap of OAN.

URL PDF HTML ☆

赞 0 踩 0

2606.03163 2026-06-08 cs.MA cs.AI cs.DC 版本更新

OpenAgenet / OAN Yellow Paper: Technical Architecture for Trust-Governed Resource Identity and Discovery

OpenAgenet/OAN：信任治理的智能体身份与发现技术架构

Jinliang Xu

发表机构 * OpenAgenet / OAN

AI总结本文提出OpenAgenet/OAN协议中立信任层技术架构，通过角色架构、身份对象、注册工作流、根治理生命周期、根验证包模型、授权感知发现、签名可信调用、验证要求、状态转换、安全属性、实现边界和部署考虑，实现异构智能体框架（包括MCP、A2A、ANP类系统及领域特定协议）的身份准入、可发现、可验证和安全交互。

2606.03382 2026-06-08 cs.LG cs.AI 版本更新

Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions

局部引导，全局影响：高斯重塑信任区域解锁行为转变

Bingxu Liu, Jiashun Liu, Johan Obando-Ceron, Hao Wang, Runze Liu, Pablo Samuel Castro, Aaron Courville, Ling Pan

发表机构 * Hong Kong University of Science and Technology（香港科技大学）； Mila - Québec AI Institute（魁北克AI研究院）； Université de Montréal（蒙特利尔大学）； Fudan University（复旦大学）； City University of Hong Kong（香港城市大学）

AI总结针对PPO在非平稳环境中优化失效的问题，提出高斯信任区域策略优化（GTR），通过高斯核重塑信任区域实现非单调约束，在保持局部稳定性的同时允许必要的大幅策略更新，并在多种任务中取得强性能。

Comments 21 pages

详情

AI中文摘要

虽然近端策略优化（PPO）在静态环境中表现出色，但我们表明其标准优化范式在连续和非平稳环境中存在困难。失败并非源于模型容量不足或裁剪过于严格。相反，PPO执行持续、方向低效的局部更新，这表明缺乏几何感知引导来积累有意义的行为变化，最终阻碍向新行为模式的转变。尽管基于散度的正则化引入了部分几何感知，但其单调递增的惩罚隐式地阻止了大的策略偏差，即使这种转变对于有效适应是必要的。为了解决这一局限性，我们提出了高斯信任区域策略优化（GTR），它使用高斯核重塑信任区域。由此产生的约束是有界且非单调的，在提供强局部稳定性的同时，在持续的高优势更新下逐渐放松。为了进一步提高鲁棒性，我们引入了一个混合高斯锚点，它适应最近的策略轨迹，减少了由陈旧参考引起的方差。GTR与架构无关，在游戏、模拟机器人控制、开放世界探索和语言模型后训练中均取得了强性能。这些结果表明，几何感知的信任区域设计可以成为复杂非平稳环境中鲁棒强化学习的一个有前景的方向。我们的代码可在该 https URL 获取。

英文摘要

While Proximal Policy Optimization (PPO) demonstrates strong performance in stationary settings, we show that its standard optimization paradigm struggles in continual and non-stationary environments. The failure does not stem from insufficient model capacity or overly restrictive clipping. Instead, PPO performs persistent, directionally inefficient local updates, which indicates a lack of geometry-aware guidance for accumulating meaningful behavioral change and ultimately hindering transitions toward new behavior patterns. Although divergence-based regularization introduces partial geometric awareness, its monotonically increasing penalties implicitly discourage large policy deviations, even when such shifts are necessary for effective adaptation. To address this limitation, we propose Gaussian Trust Region Policy Optimization (GTR), which reshapes the trust region using a Gaussian kernel. The resulting constraint is bounded and non-monotonic, providing strong local stability while progressively relaxing under sustained high-advantage updates. To further improve robustness, we introduce a Mixture Gaussian Anchor that adapts to recent policy trajectories, reducing variance induced by stale references. GTR is architecture-agnostic and achieves strong performance across games, simulated robotic control, open-world exploration, and language model post-training. These results demonstrate that geometry-aware trust-region design can be a promising direction for robust reinforcement learning in complex non-stationary environments. Our code is available at https://anonymous.4open.science/r/GTR_demo/README.md.

URL PDF HTML ☆

赞 0 踩 0

2606.04058 2026-06-08 cs.LG cs.AI 版本更新

Spectral Scaling Laws of Muon

Muon的谱缩放定律

Gagik Magakyan, Pablo Parrilo, Asuman Ozdaglar

发表机构 * MIT（麻省理工学院）

AI总结本文系统研究了Muon优化器中动量矩阵奇异值谱随模型大小的缩放行为，发现其遵循幂律，并据此提出层感知的牛顿-舒尔茨迭代配置选择方法以减少计算开销。

详情

AI中文摘要

正交归一化更新规则已迅速成为训练大型语言模型的主流优化器选择，最近的开源最先进模型采用了Muon。为了保持这些更新的可处理性，Muon使用牛顿-舒尔茨（NS）迭代执行正交归一化。由于NS只是近似，小奇异值的方向无法被正交归一化。在Muon中，NS每一步都应用于动量矩阵，然而关于这些动量矩阵的奇异值谱在训练过程中如何行为，以及该行为如何随模型大小变化，我们知之甚少。我们首次系统研究了这一问题。通过追踪从77M到2.8B参数模型中各层动量缓冲区的奇异值分位数，我们观察到一致的现象：在短暂的预热后，分位数稳定在一个由层类型和模型大小决定的值上。这些稳定值随模型大小呈现出非常清晰的幂律，且指数依赖于层。中后深度的层随模型大小$M$的缩放非常温和（约$M^{-0.25}$），因此学术规模下使用的标准5步NS配置将在更大规模下继续对它们进行正交归一化。然而，某些后期层的缩放更为激进（高达$M^{-0.96}$），在前沿规模下将落入NS失效区域，除非使用更多NS迭代或更好调整的系数。NS迭代在规模上计算成本高昂；我们的定律为从业者提供了一种有原则的、层感知的配方，用于选择最小的NS配置，该配置仍能正交归一化重要的方向——在不牺牲更新质量的情况下避免不必要的计算。

英文摘要

Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source state-of-the-art models adopting Muon. To keep these updates tractable, Muon performs the orthonormalization with the Newton--Schulz (NS) iteration. Since NS is only approximate, directions with small singular values fail to be orthonormalized. In Muon, NS is applied to the momentum matrix at every step, yet little is known about how the singular value spectrum of these momentum matrices behaves during training, or how that behavior changes with model size. We present the first systematic study of this question. Tracking singular value quantiles of the momentum buffer across layers in models ranging from 77M to 2.8B parameters, we observe a consistent picture: after a short burn-in, the quantiles stabilize at a value determined by the layer type and model size. These stabilization values follow remarkably clean power laws in model size, with layer-dependent exponents. Layers up to mid-late depth scale very mildly with model size $M$ (around $M^{-0.25}$), so the standard 5-step NS configuration used at academic scale will continue to orthonormalize them at much larger scales. Some of the late layers, however, scale much more aggressively (up to $M^{-0.96}$) and will fall into the NS failure regime at frontier scale unless one uses more NS iterations or better-tuned coefficients. NS iterations are computationally expensive at scale; our laws give practitioners a principled, layer-aware recipe for choosing the minimum NS configuration that still orthonormalizes the directions that matter -- avoiding unnecessary computation without sacrificing update quality.

URL PDF HTML ☆

赞 0 踩 0

2606.04349 2026-06-08 cs.CV cs.AI 版本更新

利用丰富反馈的强化学习与分布式DAgger

Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad

发表机构 * University of Southern California（南加州大学）

AI总结提出DistIL算法，通过分布式DAgger利用丰富反馈（如执行轨迹、工具输出等）进行前向交叉熵优化，实现单调策略改进和更好的Pass@N性能。

详情

AI中文摘要

推理模型发展迅速，但主流的基于可验证奖励的强化学习（RLVR）方法仍然非常狭窄：采样多个响应，并用单个比特奖励每个响应，指示最终答案是否正确。然而，许多设置提供了丰富的反馈，包括执行轨迹、工具输出、专家修正和模型自我评估。我们研究如何通过经典模仿学习算法DAgger的分布式变体来使用这种反馈，其中学习器可以局部访问当前策略所访问状态上的专家分布。这产生了一个简单的前向交叉熵目标，该目标接受黑盒专家，并且其序列级梯度通过将未来的专家-学生分歧传播回早期决策来进行丰富的信用分配。我们表明，基于反向KL或Jensen-Shannon的先前具有自蒸馏目标的强化学习无法保证单调策略改进：即使专家具有更高的奖励，它们的更新也可能增加更差动作的概率。相比之下，我们证明前向交叉熵允许单调策略改进并享有遗憾保证。我们进一步表明，我们的目标优化了教师加权的成功可能性的下界，从而改进了Pass@N。实验上，我们的方法DistIL在科学推理、编程和解决困难数学问题等多个领域优于RLVR和基于自蒸馏的强化学习基线。

英文摘要

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.

URL PDF HTML ☆

赞 0 踩 0

2606.05654 2026-06-08 cs.SE cs.AI cs.LG 版本更新

When Surface Form Changes Moderation Decisions: A Paired Study of Code-Mixed Workflow Instability

当表面形式改变审核决策：代码混合工作流不稳定性的配对研究

Suraj Babu Thimma Krishnaram, Yibo Hu, Karthikeyan Saravanan

发表机构 * GitHub

AI总结通过配对评估设置，研究在清洁英语与泰米尔语-英语代码混合输入下，仇恨审核工作流的变化，发现代码混合导致决策翻转率高达0.265，并增加审核负担和误报。

详情

AI中文摘要

仇恨审核通常被评估为对清洁英语输入的分类，但部署的系统必须将内容路由到诸如ALLOW、FLAG或REVIEW等操作。我们通过配对评估设置研究这种工作流在代码混合输入下的变化，其中相同的基础内容以清洁英语和泰米尔语-英语代码混合形式表达。在基于清洁英语开发数据调整的阈值下，代码混合输入产生显著的动作不稳定性，配对清洁到代码混合决策翻转率为0.265。主要工作流影响是增加的审核负担和增加的非仇恨内容误报：审核率从0.138上升到0.297，非仇恨误报率从0.069上升到0.104。仅泰米尔语输入整体表现出更强的退化，表明存在更广泛的语言覆盖限制，而非相同的代码混合不稳定性模式。一个简单的基于分歧的延迟规则减少了压力输入上的自动错误，但只能通过增加审核负载。这些结果表明，工作流级别的评估揭示了标准分类摘要可能遗漏的审核失败。

英文摘要

Hate moderation is often evaluated as classification on clean English inputs, but deployed systems must route content to actions such as ALLOW, FLAG, or REVIEW. We study how this workflow changes under code-mixed inputs using a paired evaluation setting where the same underlying content is expressed as clean English and Tamil-English code-mix. Under thresholds tuned on clean English development data, code-mixed inputs produce substantial action instability, with a paired clean- to-code-mix decision flip rate of 0.265. The main workflow effects are increased review burden and increased false-flagging of non-hateful content: review rate rises from 0.138 to 0.297 and non-hate false-flag rate rises from 0.069 to 0.104. Tamil-only inputs show stronger degradation overall, suggesting a broader language-coverage limitation rather than the same code-mixed instability pattern. A simple disagreement-based deferral rule reduces automatic errors on stressed inputs, but only by increasing review load. These results show that workflow-level evaluation reveals moderation failures that standard classification summaries can miss.

URL PDF HTML ☆

赞 0 踩 0

2602.04234 2026-06-08 cs.MA cs.AI 版本更新

When Does Multi-Agent Collaboration Help? An Entropy Perspective

何时多智能体协作有助于？从熵的角度来看

Yuxuan Zhao, Sijia Chen, Ningxin Su

发表机构 * Yantai Research Institute of Harbin Engineering University（哈尔滨工程大学烟台研究室）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结本文从熵的角度探讨了多智能体协作的有效性，通过分析不同拓扑结构、六个推理基准和两个智能体任务中的熵转换，发现单个智能体在43.3%的情况下表现更优，并揭示了熵动态在第一轮交互中的决定性作用。研究提出了三个关键观察：确定性偏好、基础熵和任务意识，并引入了熵判别算法来提升智能体系统的性能。

Comments Project page: https://multiagent-entropy.github.io/

详情

AI中文摘要

多智能体系统（MAS）已逐渐成为利用大型语言模型（LLMs）解决复杂任务的主流范式。然而，基于公开可用LLMs构建的MAS的有效性机制，尤其是其成功或失败的内在原因，仍鲜有研究。本文从熵的角度重新审视MAS，通过研究问题解决过程中不同拓扑结构、六个推理基准和两个智能体任务中的熵转换，考虑了智能体内部和交互动态。通过分析245个跨越token级、agent级和轮次级的熵特征，我们意外发现，在约43.3%的情况下，单个智能体优于MAS。此外，我们发现熵动态主要在首次交互轮次中决定。我们提供了三个关键观察：1）确定性偏好：峰值熵直接损害MAS的正确性，稳定熵直接促进MAS的正确性；2）基础熵：基础模型在问题解决过程中具有较低熵会因果驱动MAS性能；3）任务意识：MAS的熵动态在不同任务中扮演不同角色。基于这些见解，我们引入了一个简单而有效的算法，即熵判别器，用于从MAS的pass@k结果中选择解决方案，从而在所有MAS配置和任务中均实现了稳定的准确率提升。我们的源代码可在https://github.com/AgenticFinLab/multiagent-entropy获取。

英文摘要

Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of \textit{entropy}, considering both intra- and inter-agent dynamics by investigating entropy transitions during problem-solving across various topologies, six reasoning benchmarks, and two agentic tasks. By analyzing 245 features spanning token-, agent-, and round-level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3\% of cases, and that entropy dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) \textit{Certainty Preference}: peak entropy directly harms and stable entropy directly benefits MAS correctness; 2) \textit{Base Entropy}: base models with lower entropy during problem-solving causally drive MAS performance; and 3) \textit{Task Awareness}: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the \textit{Entropy Judger}, to select solutions from MAS's pass@$k$ results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at \href{https://github.com/AgenticFinLab/multiagent-entropy}{this https URL}.

URL PDF HTML ☆

赞 0 踩 0

2601.14637 2026-06-08 cs.CV cs.AI cs.CL cs.HC 版本更新

Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis

Forest-Chat: 为交互式森林变化分析适应视觉-语言代理

James Brock, Ce Zhang, Nantheera Anantrasirichai

发表机构 * School of Computer Science, University of Bristol（布里斯托尔大学计算机科学学院）； School of Geographical Sciences, University of Bristol（布里斯托尔大学地理科学学院）

AI总结本文提出Forest-Chat，一种基于LLM的森林变化分析代理，通过多任务处理实现自然语言查询，提升森林变化检测与语义解释的准确性与可解释性。

Comments 28 pages, 9 figures, 12 tables, Submitted to Ecological Informatics

详情

DOI: 10.1016/j.ecoinf.2026.103741

AI中文摘要

高分辨率卫星影像的普及与深度学习的进步为森林监测提供了新机遇。本文提出Forest-Chat，一种基于大语言模型的视觉-语言代理，支持多任务的交互式森林变化分析，包括变化检测、图像描述、对象计数、森林砍伐特征识别和变化推理。Forest-Chat基于多级变化解释（MCI）视觉-语言框架，结合零样本变化检测和多模态零样本变化描述与优化。引入Forest-Change数据集，包含双时相卫星影像、像素级变化掩码和语义变化描述。在Forest-Change数据集上，Forest-Chat在mIoU和BLEU-4指标上达到67.10%和40.17%，在LEVIR-MCI-Trees子集上达到88.13%和34.41%。零样本测试中，其在Forest-Change数据集上达到60.15%和34.00%，在LEVIR-MCI-Trees子集上达到47.32%和18.23%。进一步实验表明，描述优化能注入地理领域知识，但标签域迁移有限。这些发现表明，交互式、基于LLM的系统能支持可访问和可解释的森林变化分析。

英文摘要

The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. This paper introduces Forest-Chat, an LLM-driven agent for forest change analysis, enabling natural language querying across multiple RSICI tasks, including change detection and captioning, object counting, deforestation characterisation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, incorporating zero-shot change detection via AnyChange and multimodal LLM-based zero-shot change captioning and refinement. To support adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and semantic change captions via human annotation and rule-based methods. Forest-Chat achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on Forest-Change, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI. In a zero-shot capacity, it achieves 60.15% and 34.00% on Forest-Change, and 47.32% and 18.23% on LEVIR-MCI-Trees. Further experiments demonstrate the value of caption refinement for injecting geographic domain knowledge into supervised captions, and the system's limited label domain transfer onto JL1-CD-Trees. These findings demonstrate that interactive, LLM-driven systems can support accessible and interpretable forest change analysis.

URL PDF HTML ☆

赞 0 踩 0

2505.18006 2026-06-08 cs.CY cs.AI cs.HC cs.IR 版本更新

AI Literacy for Legal AI Systems: A practical approach

为法律AI系统设计的AI素养：一种实用方法

Gizem Gultekin-Varkonyi

发表机构 * University of Szeged, Faculty of Law and Political Sciences, International and Regional Studies Institute（塞格德大学法学院与政治科学学院，国际与区域研究学院）

AI总结本文探讨了法律AI系统的AI素养，分析了其对法律和伦理发展的关键作用，并提出了一种实用的风险评估工具。

Comments Forthcoming in Iustum Aequum Salutare (2025) vol.21

详情

DOI: 10.69695/ias.2025.4.01
Journal ref: Iustum Aequum Salutare, 2025, 21 (4)

AI中文摘要

法律AI系统正被全球司法和法律系统部署者和提供者越来越多地采用，以支持各种应用。尽管它们提供了减少偏见、提高效率和改善问责的潜在好处，但也带来了重大风险，需要在机会、法律和伦理发展和部署之间取得平衡。AI素养作为欧盟AI法案中的法律要求，以及部署者和提供者实现伦理AI的关键使能者，可以成为实现这一平衡的工具。本文引入了“法律AI系统”一词，然后分析了AI素养的概念及其与这些系统相关的利弊。这一分析与处理法律AI系统的组织的更广泛AI-L概念相关联。本文的成果是一份路线图问卷，作为实用工具，帮助开发者和提供者评估风险、益处和利益相关者的担忧，以满足社会和监管对法律AI的期望。

英文摘要

Legal AI systems are increasingly being adopted by judicial and legal system deployers and providers worldwide to support a range of applications. While they offer potential benefits such as reducing bias, increasing efficiency, and improving accountability, they also pose significant risks, requiring a careful balance between opportunities, and legal and ethical development and deployment. AI literacy, as a legal requirement under the EU AI Act and a critical enabler of ethical AI for deployers and providers, could be a tool to achieve this. The article introduces the term "legal AI systems" and then analyzes the concept of AI literacy and the benefits and risks associated with these systems. This analysis is linked to a broader AI-L concept for organizations that deal with legal AI systems. The outcome of the article, a roadmap questionnaire as a practical tool for developers and providers to assess risks, benefits, and stakeholder concerns, could be useful in meeting societal and regulatory expectations for legal AI.

URL PDF HTML ☆

赞 0 踩 0

2502.21123 2026-06-08 cs.LG cs.AI 版本更新

Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models

因果关系是理解和平衡可信机器学习与基础模型中多个目标的关键

Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf, Mario Fritz

发表机构 * CISPA Helmholtz Center for Information Security（CISPA海德堡信息安全中心）； Max Planck Institute for Intelligent Systems, Tübingen（马克斯·普朗克智能系统研究所（图宾根））； Google Research（谷歌研究）； ETH Zürich（苏黎世联邦理工学院）； University of Toronto（多伦多大学）

AI总结本文主张将因果方法集成到机器学习中，以平衡公平性、隐私、鲁棒性、准确性和可解释性等可信原则之间的权衡，并探讨其在基础模型中的实际应用。

详情

AI中文摘要

确保机器学习系统的可信度至关重要，因为它们日益嵌入高风险领域。本文主张将因果方法集成到机器学习中，以应对可信机器学习关键原则（包括公平性、隐私、鲁棒性、准确性和可解释性）之间的权衡。虽然这些目标理想情况下应同时满足，但它们通常被孤立地处理，导致冲突和次优解决方案。借鉴因果在ML中成功协调目标（如公平性与准确性，或隐私与鲁棒性）的现有应用，本文认为因果方法对于平衡可信ML和基础模型中的多个竞争目标至关重要。除了强调这些权衡，我们考察了如何将因果实际集成到ML和基础模型中，提供增强其可靠性和可解释性的解决方案。最后，我们讨论了采用因果框架的挑战、局限性和机遇，为更负责任和合乎伦理的AI系统铺平道路。

英文摘要

Ensuring trustworthiness in machine learning (ML) systems is crucial as they become increasingly embedded in high-stakes domains. This paper advocates for integrating causal methods into machine learning to navigate the trade-offs among key principles of trustworthy ML, including fairness, privacy, robustness, accuracy, and explainability. While these objectives should ideally be satisfied simultaneously, they are often addressed in isolation, leading to conflicts and suboptimal solutions. Drawing on existing applications of causality in ML that successfully align goals such as fairness and accuracy or privacy and robustness, this paper argues that a causal approach is essential for balancing multiple competing objectives in both trustworthy ML and foundation models. Beyond highlighting these trade-offs, we examine how causality can be practically integrated into ML and foundation models, offering solutions to enhance their reliability and interpretability. Finally, we discuss the challenges, limitations, and opportunities in adopting causal frameworks, paving the way for more accountable and ethically sound AI systems.

URL PDF HTML ☆

赞 0 踩 0

2511.09568 2026-06-08 physics.chem-ph cs.AI cs.CV 版本更新

VEDA: 3D Molecular Generation via Variance-Exploding Diffusion with Annealing

VEDA：通过退火变方差扩散实现3D分子生成

Peining Zhang, Jinbo Bi, Minghu Song

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结 VEDA结合退火变方差扩散与SE(3)等价架构，高效生成准确的3D分子结构，实现高化学精度与计算效率。

详情

DOI: 10.1609/aaai.v40i33.40063

AI中文摘要

扩散模型在3D分子生成中展现出潜力，但面临采样效率与构象准确性之间的根本权衡。尽管流形模型速度快，但常产生几何不准确的结构，因难以捕捉分子构象的多模分布。相比之下，去噪扩散模型更准确但采样慢，限制在于扩散动力学与SE(3)-等价架构之间的整合不足。为此，我们提出了VEDA，一个统一的SE(3)-等价框架，结合变方差扩散与退火以高效生成构象准确的3D分子结构。关键贡献包括：(1) 一种VE调度使噪声注入类似于模拟退火，提高3D准确性并降低松弛能量；(2) 一种新型预处理方案协调SE(3)-等价网络的坐标预测性质与残差扩散目标；(3) 一种新的arcsin调度器将采样集中在对数信号噪声比的关键区间。在QM9和GEOM-DRUGS数据集上，VEDA的采样效率与流形模型相当，仅用100次采样步骤就实现了最先进的价键稳定性与有效性。更重要的是，VEDA生成的结构在GFN2-xTB优化过程中表现出显著的稳定性，其松弛能量中位数仅为1.72 kcal/mol，显著低于其基线架构SemlaFlow的32.3 kcal/mol。我们的框架证明了原理上整合VE扩散与SE(3)-等价架构可以实现高化学精度和计算效率。

英文摘要

Diffusion models show promise for 3D molecular generation, but face a fundamental trade-off between sampling efficiency and conformational accuracy. While flow-based models are fast, they often produce geometrically inaccurate structures, as they have difficulty capturing the multimodal distributions of molecular conformations. In contrast, denoising diffusion models are more accurate but suffer from slow sampling, a limitation attributed to sub-optimal integration between diffusion dynamics and SE(3)-equivariant architectures. To address this, we propose VEDA, a unified SE(3)-equivariant framework that combines variance-exploding diffusion with annealing to efficiently generate conformationally accurate 3D molecular structures. Specifically, our key technical contributions include: (1) a VE schedule that enables noise injection functionally analogous to simulated annealing, improving 3D accuracy and reducing relaxation energy; (2) a novel preconditioning scheme that reconciles the coordinate-predicting nature of SE(3)-equivariant networks with a residual-based diffusion objective, and (3) a new arcsin-based scheduler that concentrates sampling in critical intervals of the logarithmic signal-to-noise ratio. On the QM9 and GEOM-DRUGS datasets, VEDA matches the sampling efficiency of flow-based models, achieving state-of-the-art valency stability and validity with only 100 sampling steps. More importantly, VEDA's generated structures are remarkably stable, as measured by their relaxation energy during GFN2-xTB optimization. The median energy change is only 1.72 kcal/mol, significantly lower than the 32.3 kcal/mol from its architectural baseline, SemlaFlow. Our framework demonstrates that principled integration of VE diffusion with SE(3)-equivariant architectures can achieve both high chemical accuracy and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2511.03898 2026-06-08 cs.CR cs.AI cs.CE cs.SE 版本更新

Secure Code Generation at Scale with Reflexion

大规模安全代码生成中的反射

Arup Datta, Ahmed Aljohani, Hyunsook Do

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结研究评估了使用Instruct Prime和反射提示方法提升代码安全性的效果，发现反射提示能显著提高安全性能，尤其在第一轮提示中效果最明显。

Comments Accepted for publication at the 2nd IEEE International Conference on AI-powered Software (AIware 2025)

详情

DOI: 10.1109/AIware69974.2025.00038

AI中文摘要

大型语言模型（LLMs）现在广泛用于起草和重构代码，但生成的代码不一定是安全的。我们评估了使用Instruct Prime（消除了合规性提示和提示污染）以及通过零 shot 基线和三轮反射提示方法评估五个指令调优的代码 LLMs。安全性通过不安全代码检测器（ICD）测量，结果通过修复、回归和净收益指标报告，考虑编程语言和CWE家族。我们的发现显示，在第一轮中不安全代码仍然普遍存在：在零 shot 基线（t0）下，约25-33%的程序不安全。弱加密/依赖配置的bug最难避免，而模板化的bug如XSS、代码注入和硬编码的秘密则更可靠地被处理。Python的高安全率；C和C#最低，Java、JS、PHP和C++在中间。反射提示对所有模型都有提升，将平均准确率从t0的70.74%提升到t3的79.43%，最大的提升出现在第一轮，随后是递减的收益。修复、回归和净收益指标的趋势显示，应用一到两轮提示产生大部分收益。一个可复制的包可在https://doi.org/10.5281/zenodo.17065846获取。

英文摘要

Large language models (LLMs) are now widely used to draft and refactor code, but code that works is not necessarily secure. We evaluate secure code generation using the Instruct Prime, which eliminated compliance-required prompts and cue contamination, and evaluate five instruction-tuned code LLMs using a zero-shot baseline and a three-round reflexion prompting approach. Security is measured using the Insecure Code Detector (ICD), and results are reported by measuring Repair, Regression, and NetGain metrics, considering the programming language and CWE family. Our findings show that insecurity remains common at the first round: roughly 25-33% of programs are insecure at a zero-shot baseline (t0 ). Weak cryptography/config-dependent bugs are the hardest to avoid while templated ones like XSS, code injection, and hard-coded secrets are handled more reliably. Python yields the highest secure rates; C and C# are the lowest, with Java, JS, PHP, and C++ in the middle. Reflexion prompting improves security for all models, improving average accuracy from 70.74% at t0 to 79.43% at t3 , with the largest gains in the first round followed by diminishing returns. The trends with Repair, Regression, and NetGain metrics show that applying one to two rounds produces most of the benefits. A replication package is available at https://doi.org/10.5281/zenodo.17065846.

URL PDF HTML ☆

赞 0 踩 0

2506.14634 2026-06-08 cs.CL cs.AI cs.CY 版本更新

AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

不是什么别的吗？利用大语言模型对德国开放式调查回答进行编码：调查动机

Leah von der Heyde, Anna-Carolina Haensch, Bernd Weiß, Jessica Daikeler

发表机构 * Social Data Science & AI Lab, LMU Munich（社会科学与人工智能实验室，慕尼黑大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Maryland, College Park（马里兰大学学院公园分校）； GESIS – Leibniz Institute for the Social Sciences（莱比锡社会科学研究机构）

AI总结本文探讨了使用大语言模型对开放式调查回答进行编码的有效性，通过德国调查参与原因的数据，比较了不同LLM和提示方法的性能，发现仅微调的LLM能获得满意预测效果，且分类性能差异影响类别分布。

Comments to appear in Survey Research Methods

详情

DOI: 10.18148/srm/2025.v19i4.8568
Journal ref: Survey Research Methods (2025)

AI中文摘要

近年来，大语言模型（LLM）的发展和广泛可及性引发了关于其在调查研究中应用的讨论，包括对开放式调查回答的分类。由于其语言能力，LLM可能成为耗时的手动编码和监督学习模型预训练的高效替代方案。由于现有研究大多集中在英语回答的非复杂主题或单一LLM上，尚不清楚其发现是否具有普遍性以及这些分类的质量如何与传统方法相比。本研究探讨了不同LLM在其他情境下对开放式调查回答进行编码的程度，以德国调查参与原因的数据为例。我们比较了几种最先进的LLM和提示方法，并通过人类专家编码评估LLM的性能。总体而言，LLM之间的性能差异很大，只有微调的LLM才能达到满意的预测性能。提示方法之间的性能差异取决于所用的LLM。最后，LLM在不同调查参与原因类别上的不均等分类性能导致了不同的类别分布，当不使用微调时。我们讨论了这些发现的含义，不仅对开放式回答编码的方法学研究，还对其实质分析，以及处理或实质性分析此类数据的实践者。最后，我们强调了研究人员在选择LLM时代开放式回答分类自动化方法时需要考虑的许多权衡。通过这样做，我们的研究为关于LLM在调查研究中高效、准确和可靠应用条件的日益增长的研究做出了贡献。

英文摘要

The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs' performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.

URL PDF HTML ☆

赞 0 踩 0

2502.08903 2026-06-08 cs.RO cs.AI 版本更新

3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning

面向机器人任务规划的3D grounded视觉-语言框架：自动化提示合成与监督推理

Guoqin Tang, Qingxuan Jia, Zeyuan Huang, Gang Chen, Ning Ji, Zhipeng Yao

发表机构 * Tsinghua University（清华大学）

AI总结本文提出融合2D提示合成模块和小语言模型的框架，提升机器人3D场景理解与任务执行能力，实验显示任务成功率高达96.0%。

详情

DOI: 10.1016/j.engappai.2025.113268
Journal ref: Engineering Applications of Artificial Intelligence, vol. 164, p. 113268, 2026

AI中文摘要

视觉-语言模型（VLMs）在场景理解和感知任务中取得了显著成功，使机器人能够在动态环境中自适应地规划和执行动作。然而，大多数多模态大语言模型缺乏稳健的3D场景定位能力，限制了其在精细机器人操作中的有效性。此外，低识别精度、低效、差的迁移性和可靠性等挑战阻碍了其在精密任务中的应用。为解决这些限制，我们提出了一种新的框架，该框架整合了一个2D提示合成模块，通过将2D图像映射到点云，以及一个小型语言模型（SLM）来监督VLM的输出。2D提示合成模块使训练于2D图像和文本的VLM能够自主提取精确的3D空间信息，无需人工干预，显著增强了3D场景理解。同时，SLM监督VLM的输出，缓解幻觉并确保可靠的可执行机器人控制代码生成。我们的框架消除了在新环境中重新训练的需要，从而提高了成本效率和操作鲁棒性。实验结果表明，所提出的框架实现了96.0%的任务成功率（TSR），优于其他方法。消融研究证明了2D提示合成模块和输出监督模块的关键作用（当移除时，TSR下降67%）。这些发现验证了框架在提升3D识别、任务规划和机器人任务执行方面的有效性。

英文摘要

Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations. Additionally, challenges such as low recognition accuracy, inefficiency, poor transferability, and reliability hinder their use in precision tasks. To address these limitations, we propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs. The 2D prompt synthesis module enables VLMs, trained on 2D images and text, to autonomously extract precise 3D spatial information without manual intervention, significantly enhancing 3D scene understanding. Meanwhile, the SLM supervises VLM outputs, mitigating hallucinations and ensuring reliable, executable robotic control code generation. Our framework eliminates the need for retraining in new environments, thereby improving cost efficiency and operational robustness. Experimental results that the proposed framework achieved a 96.0\% Task Success Rate (TSR), outperforming other methods. Ablation studies demonstrated the critical role of both the 2D prompt synthesis module and the output supervision module (which, when removed, caused a 67\% TSR drop). These findings validate the framework's effectiveness in improving 3D recognition, task planning, and robotic task execution.

URL PDF HTML ☆

赞 0 踩 0

2501.11592 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Training-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing

无需训练的超小模型用于压缩感知中的通用稀疏重建

Chaoqing Tang, Huanze Zhuang, Guiyun Tian, Zhenli Zeng, Yi Ding, Wenzhong Liu, Xiang Bai

发表机构 * School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China（华中科技大学人工智能与自动化学院）； China Belt and Road Joint Lab on Measurement and Control Technology, Wuhan, China（中国一带一路测量与控制技术联合实验室）； School of Electric and Electrical Engineering, Chongqing University of Technology, Chongqing, China（重庆理工大学电气工程学院）； Optics Valley Laboratory, Wuhan, China（光谷实验室）； School of Water Conservancy and Transportation, Zhengzhou University, Zhengzhou, China（郑州大学水利与交通学院）； School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China（华中科技大学软件工程学院）

AI总结本文提出无需训练的超小神经模型CL，实现快速稀疏重建，继承传统迭代方法的通用性和可解释性，提升效率和精度。

详情

DOI: 10.1109/TPAMI.2026.3680162

AI中文摘要

预训练大模型近年来受到广泛关注，但在需要高可解释性或资源有限的应用中面临挑战，如物理传感、医学成像和生物信息学。压缩感知（CS）是已证明的理论，推动了这些应用的许多突破。然而，作为典型的欠定线性系统，CS在使用传统迭代方法时，对大规模数据的稀疏重建时间过长。当前的AI方法如深度展开失败于替代它们，因为预训练模型在超出训练条件和数据分布时泛化性差或缺乏可解释性。本文提出名为系数学习（CL）的超小人工神经模型，实现无需训练的快速稀疏重建，同时完美继承传统迭代方法的泛化性和可解释性，带来融合先验知识的新特性。在CL中，长度为n的信号仅需最少n个可训练参数。一个案例研究模型称为CLOMP用于评估。实验在合成和真实的一维和二维信号上进行，显示了显著的效率和精度提升。与代表性的迭代方法相比，CLOMP在大规模数据上提高了100到1000倍的效率。在八个不同的图像数据集上的测试结果表明，CLOMP在采样率为0.1、0.3、0.5时分别提高了结构相似性指数292%、98%、45%。我们相信这种方法可以真正将CS重建带入AI时代，造福无数依赖稀疏解的欠定线性系统。

英文摘要

Pre-trained large models attract widespread attention in recent years, but they face challenges in applications that require high interpretability or have limited resources, such as physical sensing, medical imaging, and bioinformatics. Compressed Sensing (CS) is a well-proved theory that drives many recent breakthroughs in these applications. However, as a typical under-determined linear system, CS suffers from excessively long sparse reconstruction times when using traditional iterative methods, particularly with large-scale data. Current AI methods like deep unfolding fail to substitute them because pre-trained models exhibit poor generality beyond their training conditions and dataset distributions, or lack interpretability. Instead of following the big model fervor, this paper proposes ultra-small artificial neural models called coefficients learning (CL), enabling training-free and rapid sparse reconstruction while perfectly inheriting the generality and interpretability of traditional iterative methods, bringing new feature of incorporating prior knowledges. In CL, a signal of length $n$ only needs a minimal of $n$ trainable parameters. A case study model called CLOMP is implemented for evaluation. Experiments are conducted on both synthetic and real one-dimensional and two-dimensional signals, demonstrating significant improvements in efficiency and accuracy. Compared to representative iterative methods, CLOMP improves efficiency by 100 to 1000 folds for large-scale data. Test results on eight diverse image datasets indicate that CLOMP improves structural similarity index by 292%, 98%, 45% for sampling rates of 0.1, 0.3, 0.5, respectively. We believe this method can truly usher CS reconstruction into the AI era, benefiting countless under-determined linear systems that rely on sparse solution.

URL PDF HTML ☆

赞 0 踩 0

2302.00198 2026-06-08 cs.NE cs.AI cs.NA math.NA 版本更新

A fuzzy adaptive metaheuristic algorithm for identifying sustainable, economical, lightweight, and earthquake-resistant reinforced concrete cantilever retaining walls

一种模糊自适应元启发式算法用于识别可持续、经济、轻质且抗震的钢筋混凝土悬臂挡土墙

Farshid Keivanian, Raymond Chiong, Ali R. Kashani, Amir H. Gandomi

发表机构 * School of Information and Physical Sciences, The University of Newcastle（新castle大学信息与物理科学学院）； Department of Civil Engineering, University of Memphis（Memphis大学土木工程系）； Faculty of Engineering and Information Technology, University of Technology Sydney（悉尼技术大学工程与信息技术学院）

AI总结本文提出一种模糊自适应元启发式算法，用于优化抗震钢筋混凝土悬臂挡土墙的设计，考虑了结构强度、地质稳定性及几何变量，以实现轻质、经济且环保的抗震设计。

Comments There are six figures, 51 pages, and 12 tables in the revised manuscript that has recently been resubmitted to the Journal of Computational Science

详情

DOI: 10.1016/j.jocs.2023.101978
Journal ref: Journal of Computational Science, Volume 70, Article 101978, 2023

AI中文摘要

在地震多发区，钢筋混凝土悬臂挡土墙的抗震性能至关重要。本研究利用水平和垂直伪静态系数来评估其抗震性能。为解决由此产生的土压力导致的钢筋混凝土悬臂（RCC）重量和力问题，26个结构强度和地质稳定性约束以及12个几何变量与每个设计相关联。这些约束和设计变量形成一个十二维解空间的约束优化问题。为了有效搜索并产生可持续、经济、轻质且能抵御地震危害的RCC设计，本文提出了一种新颖的自适应模糊基于元启发式算法。该方法将搜索空间划分为子区域，并基于其新颖的搜索组件建立探索、信息共享和开发搜索能力。此外，模糊推理系统被用于解决参数化和计算成本评估问题。研究发现，与几种经典和表现最佳的设计优化器相比，所提出的算法能够在九种地震条件下实现低成本、低重量和低二氧化碳排放的RCC设计。

英文摘要

In earthquake-prone zones, the seismic performance of reinforced concrete cantilever (RCC) retaining walls is significant. In this study, the seismic performance was investigated using horizontal and vertical pseudo-static coefficients. To tackle RCC weights and forces resulting from these earth pressures, 26 constraints for structural strengths and geotechnical stability along with 12 geometric variables are associated with each design. These constraints and design variables form a constraint optimization problem with a twelve-dimensional solution space. To conduct effective search and produce sustainable, economical, lightweight RCC designs robust against earthquake hazards, a novel adaptive fuzzy-based metaheuristic algorithm is applied. The proposed method divides the search space to sub-regions and establishes exploration, information sharing, and exploitation search capabilities based on its novel search components. Further, fuzzy inference systems were employed to address parameterization and computational cost evaluation issues. It was found that the proposed algorithm can achieve low-cost, low-weight, and low CO2 emission RCC designs under nine seismic conditions in comparison with several classical and best-performing design optimizers.

URL PDF HTML ☆

赞 0 踩 0