arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2605.07990 2026-05-27 cs.CL cs.AI cs.LG cs.SE

Tool Calling is Linearly Readable and Steerable in Language Models

语言模型中的工具调用是线性可读且可引导的

Zekun Wu, Ze Wang, Seonglae Cho, Yufei Yang, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

发表机构 * University College London(伦敦大学学院) Holistic AI Imperial College London(伦敦帝国学院)

AI总结 本文发现语言模型内部存在对应工具选择的线性方向,通过干预该方向可切换工具调用,并能提前检测潜在错误,在多个模型和基准上验证了有效性。

Comments 24 pages. ACL ARR May 2026 submission (EMNLP 2026 preferred venue); v2 reflects revised manuscript

详情
AI中文摘要

当工具调用代理选错工具时,失败在执行之前是不可见的:邮件被发送,会议被错过。随着代理承担重要行动,一次糟糕的工具调用可能造成实际损害。目前我们无法在模型内部查看并在错误发生前捕捉它;本文表明我们可以做到。在模型内部,工具的选择由激活空间中的单个方向承载,每对工具对应一个方向。在生成过程中添加该方向会切换模型选择的工具。在涵盖 Gemma 3、Qwen 3、Qwen 2.5 和 Llama 3.1(270M 到 27B)的 12 个指令微调模型和 6 个基础模型上,这在 4B+ 指令微调模型上对 15 个工具的合成基准达到 83-100% 的准确率,在真实 API 基准 τ-bench airline 上达到 77-94%。随后的 JSON 参数自动适应新工具的模式,因此仅翻转名称就足够了。相同的每工具方向还能在错误发生前标记潜在错误:模型在两个工具之间不确定的查询失败率比确定的高 21 倍(Gemma 3 27B)。这不仅仅是主题注入:相同幅度的随机向量给出 0% 的切换率,而在单个领域(共享一个主题的 14 个航空工具)内的探针仍然能在五个 4B-14B 模型上以 top-1 61-89% 的准确率读取模型将调用的工具。即使是基础模型在能够输出工具之前内部已经携带了正确的工具:从模型内部状态读取所选工具(余弦读出)在 BFCL 上恢复 61-82% 的准确率,而基础生成仅为 2-10%,这表明预训练形成了表示,而指令微调后来将其连接到输出。我们的结果涵盖单轮、固定菜单设置;在多轮代理循环中,相同的干预不太稳定(匹配基线的增益或损失高达 30 个百分点,没有一致的方向)。

英文摘要

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As agents take on consequential actions, one bad tool call can do real damage. We currently have no way to look inside the model and catch the mistake before it happens; this paper shows that we can. Inside the model, the choice of tool is carried by a single direction in activation space, one direction per pair of tools. Adding that direction during generation switches which tool the model picks. Across 12 instruction-tuned and 6 base models spanning Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), this works at 83-100% accuracy on 4B+ instruction-tuned models on a 15-tool synthetic benchmark and at 77-94% on the real-API benchmark $τ$-bench airline. The JSON arguments that follow automatically adapt to the new tool's schema, so flipping the name is enough. The same per-tool directions also flag likely errors before they happen: queries where the model is unsure between two tools fail 21x more often than queries where it is not (Gemma 3 27B). This is not just topic injection: random vectors at the same magnitude give a 0% switch rate, and a probe within a single domain (14 airline tools that share one topic) still reads which tool the model will call at top-1 61-89% across five 4B-14B models. Even base models already carry the right tool internally before they can emit it: reading the chosen tool off the model's internal state (cosine readout) recovers 61-82% accuracy on BFCL while base generation lands at 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. Our results cover single-turn, fixed-menu settings; on multi-turn agent loops the same intervention is less stable (matched-baseline gain or loss of up to 30 percentage points with no consistent direction).

2605.07632 2026-05-27 cs.CL cs.AI cs.LG

Post-training makes large language models less human-like

后训练使大型语言模型更不像人类

Marcel Binz, Elif Akata, Abdullah Almaatouq, Mohammed Alsobay, Oleksii Ariasov, Franziska Brändle, David Broska, Jason W. Burton, Nuno Busch, Frederick Callaway, Vanessa Cheung, Brian Christian, Julian Coda-Forno, Can Demircan, Vittoria Dentella, Maria K. Eckstein, Noémi Éltető, Michael Franke, Thomas L. Griffiths, Fritz Günther, Susanne Haridi, Sebastian Hellmann, Stefan Herytash, Linus Hof, Eleanor Holton, Isabelle Hoxha, Zak Hussain, Akshay Jagadish, Elif Kara, Valentin Kriegmair, Evelina Leivada, Li Ji-An, Tobias Ludwig, Maximilian Maier, Marcelo G. Mattar, Marvin Mathony, Alireza Modirshanechi, Robin Na, Mariia Nadverniuk, Antonios Nasioulas, Surabhi S. Nath, Helen Niemeyer, Kate Nussenbaum, Sebastian Olschewski, Thorsten Pachur, Stefano Palminteri, Aliona Petrenco, Camille V. Phaneuf-Hadd, Angelo Pirrone, Manuel Rausch, Laura Raveling, Shashank Reddy, Milena Rmus, Evan M. Russek, Tankred Saanum, Kai Sandbrink, Louis Schiekiera, Johannes A. Schubert, Luca M. Schulze Buschoff, Nishad Singhi, Leah H. Somerville, Mikhail S. Spektor, Xin Sui, Christopher Summerfield, Mirko Thalmann, Anna I. Thoma, Taisiia Tikhomirova, Vuong Truong, Polina Tsvilodub, Konstantinos Voudouris, Kristin Witte, Shuchen Wu, Dirk U. Wulff, Hua-Dong Xiong, Songlin Xu, Lance Ying, Xinyu Zhang, Jian-Qiao Zhu, Eric Schulz

发表机构 * Helmholtz Munich(海德堡-慕尼黑亥姆霍兹中心) Massachusetts Institute of Technology(麻省理工学院) University of Tübingen(图宾根大学) University of Oxford(牛津大学) Stanford(斯坦福大学)

AI总结 通过引入Psych-201数据集,发现后训练(将基础模型转化为有用助手的过程)一致地降低了模型与人类行为的对齐度,且这种错位在新模型世代中加剧,而人物诱导技术无法改善个体层面的预测。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作人类参与者的替代品,但目前尚不清楚哪些模型最能捕捉人类行为及其原因。为了解决这个问题,我们引入了Psych-201,这是一个新颖的数据集,使我们能够大规模测量行为对齐。我们发现,后训练——将基础模型转化为有用助手的阶段——在模型家族、规模和目标上一致地降低了与人类行为的对齐度。此外,这种错位在新模型世代中扩大,即使基础模型继续改进。最后,我们发现人物诱导——一种通过将模型条件化为参与者特定信息来引发类人行为的流行技术——并不能改善个体层面的预测。综合来看,我们的结果表明,当前用于将LLMs转化为有用助手的那些过程也使得它们成为人类行为的不太准确的模型。

英文摘要

Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.

2605.07521 2026-05-27 cs.AI

From Feasible to Practical: Pareto-Optimal Synthesis Planning

从可行到实用:帕累托最优合成规划

Friedrich Hastedt, Dongda Zhang, Antonio del Rio Chanona

发表机构 * Department of Chemical Engineering, Imperial College London, UK(伦敦帝国理工学院化学工程系) Department of Chemical Engineering, University of Manchester, UK(曼彻斯特大学化学工程系)

AI总结 针对现有合成规划方法忽略多目标权衡的问题,提出MORetro*算法,通过多目标A*搜索生成帕累托前沿,在成本、可持续性、毒性等指标间实现最优权衡。

Comments Published in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

当前的计算机辅助合成规划(CASP)方法通常将逆合成视为一旦找到单一可行路线即解决,主要关注收敛性或最短路径指标。这种观点与现实实践不符,因为化学家必须平衡成本、可持续性、毒性和总产率等相互竞争的目标。为解决这一问题,我们将合成规划建模为多目标搜索问题,并引入MORetro*算法,该算法生成合成路线的帕累托前沿,以明确捕捉用户定义标准之间的权衡。MORetro*使用加权标量化和基于贝叶斯优化的采样,有效导航组合搜索空间并优先考虑有前景的权衡。基于多目标A*搜索,我们提供了最优性保证,表明对于固定的单步模型,MORetro*在可采纳性条件下恢复真实的帕累托前沿。在多个逆合成基准测试中,MORetro*生成了多样化、高质量的帕累托前沿,发现了单目标方法忽略的解决方案,并使CASP输出更符合工业决策。

英文摘要

Current computer-aided synthesis planning (CASP) methods often treat retrosynthesis as solved once a single feasible route is identified, focusing primarily on convergence or shortest-path metrics. This view is misaligned with real-world practice, where chemists must balance competing objectives such as cost, sustainability, toxicity, and overall yield. To address this, we formulate synthesis planning as a multi-objective search problem and introduce MORetro*, an algorithm that generates a Pareto front of synthesis routes to explicitly capture trade-offs among user-defined criteria. MORetro* uses weighted scalarization and BO-informed sampling to efficiently navigate the combinatorial search space and prioritize promising trade-offs. Building on multi-objective A*-search, we provide optimality guarantees showing that, for a fixed single-step model, MORetro* recovers the true Pareto front under admissibility. Across multiple retrosynthesis benchmarks, MORetro* produces diverse, high-quality Pareto fronts, uncovering solutions overlooked by single-objective approaches and better aligning CASP outputs with industrial decision-making.

2603.12647 2026-05-27 cs.CV cs.AI

LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction

LR-SGS:用于自动驾驶场景重建的鲁棒激光雷达反射率引导显著高斯泼溅

ZY Chen, F Zhu, H Zhu, DY Kong, XK Kuang, YJ Zhang, CM Jiang

发表机构 * Waymo Open Dataset(Waymo开放数据集)

AI总结 提出一种结合激光雷达反射率与RGB的显著高斯表示方法,通过结构感知初始化、反射率校准和联合对齐,实现高效鲁棒的自动驾驶场景重建。

Comments 8 pages, 7 figures

详情
AI中文摘要

最近的3D高斯泼溅(3DGS)方法已证明了自动驾驶场景重建和新视角合成的可行性。然而,现有方法大多仅依赖相机,或仅将激光雷达用于高斯初始化或深度监督,而点云中包含的丰富场景信息(如反射率)以及激光雷达与RGB之间的互补性尚未被充分利用,导致在具有高自运动和复杂光照等挑战性自动驾驶场景中性能下降。为解决这些问题,我们提出了一种鲁棒且高效的激光雷达反射率引导显著高斯泼溅方法(LR-SGS),用于自动驾驶场景。该方法引入了一种结构感知的显著高斯表示,该表示从激光雷达提取的几何和反射率特征点初始化,并通过显著变换和改进的密度控制来捕捉边缘和平面结构。此外,我们将激光雷达强度校准为反射率,并将其作为光照不变的材料通道附加到每个高斯上,与RGB联合对齐以强制边界一致性。在Waymo Open数据集上的大量实验表明,LR-SGS以更少的高斯和更短的训练时间实现了优越的重建性能。特别是在复杂光照场景下,我们的方法在PSNR上超过OmniRe 1.18 dB。

英文摘要

Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.

2605.07053 2026-05-27 cs.CL cs.AI

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

GSM-SEM: 生成语义变体增强的基准与框架

Jyotika Singh, Fang Tu, Aziza Mirsaidova, Amit Agarwal, Hitesh Laxmichand Patel, Sandip Ghoshal, Miguel Ballesteros, Karan Dua, Yassine Benajiba, Weiyi Sun, Tao Sheng, Graham Horwood, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结 提出GSM-SEM框架,通过修改实体、属性和关系生成语义多样的数学问题变体,降低模型对固定测试集的记忆偏差,并在多个基准上验证性能下降。

详情
AI中文摘要

像GSM8K这样的基准测试是数学推理的流行度量,但由于对固定测试集的记忆,排行榜上的提升可能夸大真实能力。大多数鲁棒性变体应用表面级别的扰动(释义、重命名、数字交换、干扰项),这些扰动在很大程度上保留了底层事实,而静态发布本身可能随着时间的推移成为记忆目标。我们引入了GSM-SEM,一个可重用且随机的框架,用于生成语义多样化的基准变体,其语义方差显著高于先前方法。GSM-SEM通过修改实体、属性和/或关系来扰动问题陈述,经常改变底层事实,并要求模型在新条件下重新计算解决方案,同时约束生成以保留原始计算/答案和近似问题难度。GSM-SEM在每次运行时生成新的变体,无需重新标注,减少了对静态公共基准评估的依赖,从而降低了记忆偏差。我们将GSM-SEM应用于GSM8K和两个现有的变体系列(GSM-Symbolic和GSM-Plus),生成了GSM8K-SEM、GSM-Symbolic-SEM和GSM-Plus-SEM。评估14个SOTA LLM,我们观察到一致的性能下降,当语义扰动与符号/plus变体结合时下降更大(在GSM-SEM的最大严格配置中平均下降率为28%)。我们公开发布这三个SEM变体作为完全人工验证的数据集。最后,为了展示在GSM风格数学问题之外的适用性,我们将GSM-SEM应用于其他基准,包括BigBenchHard、LogicBench和NLR-BIRD。

英文摘要

Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.

2604.08059 2026-05-27 cs.RO cs.AI

Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study

受治理的能力演化:基于AI组件的系统的生命周期兼容性检查与回滚——以具身智能体为例

Xue Qin, Simin Luan, John See, Zeyd Boukhers, Cong Yang, Zhijun Li

发表机构 * School of Software, Harbin Institute of Technology(哈尔滨工业大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) School of Mathematical and Computer Sciences, Heriot-Watt University, Malaysia Campus(赫瑞-瓦德大学马来西亚分校数学与计算机科学学院) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院) Fraunhofer Institute for Applied Information Technology(弗劳恩霍夫应用信息科技研究所)

AI总结 针对基于AI组件的系统,提出一种受治理的能力演化框架,通过四类兼容性检查和七阶段升级管线实现安全部署,在具身智能体实验中实现零不安全激活。

Comments 42 pages, 7 figures, 12 tables

详情
AI中文摘要

由版本化AI组件构建的软件系统越来越需要生命周期治理:当能力模块演化到新版本时,宿主系统必须决定新版本是否可以安全激活、应在何种部署条件下运行、如何监控以及何时回滚。现有的软件部署模式(金丝雀发布、蓝绿部署、特性标志和MLOps管线)解决了这一循环的部分问题,但它们是针对无状态Web服务而非驱动现场AI组件的带状态、策略约束运行时设计的。我们将受治理的能力演化形式化为基于AI组件的系统的一等软件生命周期问题,并提出一个分阶段升级框架,其中每个新能力版本被视为受治理的部署候选,而非立即可执行的替换。该框架引入了四类升级兼容性检查(接口、策略、行为、恢复),并将其组织成七阶段管线(候选验证、沙箱评估、影子部署、门控激活、在线监控、回滚、审计)。我们在带有ROS 2中间件的PyBullet操作测试平台上实现了参考原型,并在15个随机种子的6轮能力升级中进行了评估。朴素升级实现了72.9%的任务成功率,但到最后一轮不安全激活率升至60%;受治理升级保持了可比的成功率(67.4%),同时在所有轮次中保持零不安全激活(Wilcoxon p=0.003)。影子部署揭示了40%的升级回归问题,这些问题是单独沙箱评估无法发现的,并且在79.8%的激活后漂移场景中回滚成功。

英文摘要

Software systems built from versioned AI components increasingly need lifecycle-time governance: when a capability module evolves into a new version, the hosting system must decide whether the new version may be activated safely, under what deployment conditions it should run, how it must be monitored, and when it should be rolled back. Existing software-deployment patterns (canary release, blue-green, feature flags, and MLOps pipelines) address parts of this loop but were designed for stateless web services rather than for stateful, policy-constrained runtimes that drive AI components in the field. We formulate governed capability evolution as a first-class software-lifecycle problem for AI-component-based systems and propose a staged upgrade framework in which every new capability version is treated as a governed deployment candidate rather than an immediately executable replacement. The framework introduces four upgrade compatibility checks (interface, policy, behavioral, recovery) and organizes them into a seven-stage pipeline (candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, rollback, audit). We implement a reference prototype on a PyBullet manipulation testbed with ROS 2 middleware and evaluate it over 6 rounds of capability upgrade with 15 random seeds. Naive upgrade achieves 72.9% task success but drives unsafe activation to 60% by the final round; governed upgrade retains comparable success (67.4%) while maintaining zero unsafe activations across all rounds (Wilcoxon p=0.003). Shadow deployment reveals 40% of upgrade regressions invisible to sandbox evaluation alone, and rollback succeeds in 79.8% of post-activation drift scenarios.

2511.22882 2026-05-27 cs.LG math.PR

Normalizing Flows on Quotient Manifolds via Boundary Quotients

通过边界商在商流形上的归一化流

William Ghanem, Benjamin Cai

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出边界商框架,用于在作为更简单域边界商的流形上学习密度,并构造离散群作用下的商流形上的归一化流,在亏格g曲面和透镜空间上验证了有效性。

详情
AI中文摘要

我们引入了边界商,并提出了一个框架,用于在作为更简单域边界商的流形上学习密度。我们展示了该框架可用于构造商流形 $N/G$ 上的归一化流,其中离散群 $G$ 作用在 $N$ 上。我们为亏格 $g$ 曲面 $\Sigma_g$ 实例化了这一构造。当 $G$ 有限时,我们展示了其对对称感知学习的适用性;我们在三维球面的循环商上进行了演示。在透镜空间上的实验表明,简单的商前 RealNVP 模型可以在评估成本大幅降低的同时取得强劲的结果。

英文摘要

We introduce boundary quotients and present a framework for learning densities on manifolds that arise as boundary quotients of simpler domains. We show that this framework can be used to construct normalizing flows on quotient manifolds $N/G$, where a discrete group $G$ acts on $N$. We instantiate this construction for genus-$g$ surfaces $Σ_g$. When $G$ is finite, we show applicability to symmetry aware learning; we demonstrate this on cyclic quotients of the 3-sphere. Experiments on lens spaces show that simple pre-quotient RealNVP models can achieve strong results while being substantially cheaper to evaluate.

2509.26619 2026-05-27 cs.CL cs.AI

Searching the Internet for Challenging Benchmarks at Scale

在互联网上大规模搜索具有挑战性的基准测试

Wenda Xu, Vilém Zouhar, Parker Riley, Mara Finkelstein, Markus Freitag, Daniel Deutsch

发表机构 * Google(谷歌) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种自动框架,将互联网建模为多臂老虎机问题,通过epsilon-greedy策略高效搜索最具挑战性的主题,以构建无需人工筛选的基准测试。

详情
AI中文摘要

许多静态基准测试开始饱和:随着模型快速改进,它们在固定测试集上获得近乎完美的分数,几乎没有剩余空间来暴露模型的真正弱点——即使是专家策划的挑战集在爬山后也会迅速饱和。我们提出一个完全自动化的框架,在互联网上大规模搜索以构建具有挑战性的基准测试,无需人工筛选。关键洞察是将互联网建模为一个广阔的主题空间,并将搜索形式化为多臂老虎机问题,其中每个主题的难度仅通过昂贵的采样和评估查询来揭示。我们的epsilon-greedy策略在仅探索6%的搜索空间的情况下识别出最具挑战性的主题——相比穷举评估成本降低了100倍。我们在机器翻译和知识问答上进行了验证,确认发现的难度在独立指标(GEMBA-SQA和MetricX)、语言和模型上都是稳健的。

英文摘要

Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving little headroom to expose genuine model weaknesses -- and even expert-curated challenge sets quickly saturate after hillclimbing. We present a fully automatic framework that searches the Internet at scale to construct challenging benchmarks without human curation. The key insight is to model the Internet as a vast space of topics and formalize the search as a multi-armed bandit problem, where each topic's difficulty is revealed only through expensive sample-and-evaluate queries. Our epsilon-greedy strategy identifies the most challenging topics while exploring only 6% of the search space -- a 100 times cost reduction over exhaustive evaluation. We validate on machine translation and knowledge question answering, confirming that discovered difficulty is robust across independent metrics (GEMBA-SQA and MetricX), languages, and models.

2605.01489 2026-05-27 cs.AI cs.CL

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

SciResearcher: 面向前沿科学推理的深度研究智能体规模化

Tianshi Zheng, Rui Wang, Xiyun Li, Kelvin Kiu Wai Tam, Newt Nguyen Kim Hue Nam, Wei Fan, Yangqiu Song, Tianqing Fang

发表机构 * HKUST(香港科技大学) CUHK(香港大学) Tencent AI Lab(腾讯AI实验室)

AI总结 提出SciResearcher框架,通过合成基于学术证据的概念与计算任务并训练智能体,在HLE-Bio/Chem-Gold等基准上达到最优性能。

Comments 23 pages, 6 figures, 15 tables

详情
AI中文摘要

前沿科学推理正迅速成为推动AI智能体在自动化科学发现中的关键基础。深度研究智能体为此挑战提供了有前景的方法。这些模型通过后训练处理信息寻求任务(通常通过知识图谱构建或迭代网页浏览来策划)来发展强大的问题解决能力。然而,这些策略在前沿科学中面临固有局限性,因为领域特定知识分散在稀疏且异构的学术来源中,而问题解决需要远超事实回忆的复杂计算和推理。为弥合这一差距,我们引入了SciResearcher,一个用于前沿科学数据构建的全自动智能体框架。SciResearcher综合了基于学术证据的多样化概念和计算任务,同时激发信息获取、工具集成推理和长程能力。利用策划的数据进行监督微调和智能体强化学习,我们开发了SciResearcher-8B,一个在HLE-Bio/Chem-Gold基准上达到19.46%的智能体基础模型,在其参数规模上建立了新的最先进水平,并超越了多个更大的专有智能体。它在SuperGPQA-Hard-Biology和TRQA-Literature基准上进一步取得了13-15%的绝对提升。总体而言,SciResearcher为前沿科学推理的自动数据构建引入了一种新范式,并为未来的科学智能体提供了一条可扩展的路径。

英文摘要

Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.

2605.01188 2026-05-27 cs.CL

Compute Optimal Tokenization

计算最优分词

Tomasz Limisiewicz, Artidoro Pagnoni, Srini Iyer, Mike Lewis, Sachin Mehta, Alisa Liu, Margaret Li, Gargi Ghosh, Luke Zettlemoyer

发表机构 * FAIR at Meta(Meta 的 FAIR) University of Washington(华盛顿大学)

AI总结 通过训练988个BLT模型,研究压缩率(每token平均字节数)对缩放趋势的影响,发现计算最优配置下模型参数量与数据字节数成比例,且最优压缩率低于BPE并随计算量下降。

详情
AI中文摘要

缩放定律能够优化数据量和语言模型大小的选择,但数据单元——token——对此关系的影响尚未充分探索。本文系统研究了由压缩率(即每token平均文本字节数)控制的token信息粒度如何影响缩放趋势。我们训练了988个潜在分词模型(BLT),参数规模从50M到7B,这些模型可以设置所需的压缩率。这种灵活性使我们能够研究压缩率远超过使用流行BPE分词器得到的每token 4.57字节的作用。实验表明,在计算最优配置中,模型参数量与以字节为单位的数据大小成比例,而不是通常认为的以token为单位(Kaplan et al., 2020; Hoffmann et al., 2022)。此外,我们发现最优压缩率不同于BPE得到的压缩率,并且随计算量增加而降低。这些发现普遍适用于潜在分词和子词分词,以及英语以外的其他语言,指导语言模型开发者选择分词方案以实现最大计算效率。

英文摘要

Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility allows us to study the role of compression rate well beyond 4.57 bytes per token obtained with a popular BPE tokenizer. Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022). Furthermore, we discover that the optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization, as well as to languages other than English, guiding language model developers on tokenization scheme selection for maximal compute efficiency.

2603.23985 2026-05-27 cs.LG

Diet Your LLM: Dimension-wise Global Pruning of LLMs via Merging Task-specific Importance Score

精简你的大语言模型:通过融合任务特定重要性分数的维度级全局剪枝

Jimyung Hong, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 提出一种无需训练的维度级结构化剪枝方法DIET,通过跨任务激活幅度多数投票构建全局掩码,在保持任务感知能力的同时避免高昂训练成本,在Gemma-2模型上显著提升剪枝后准确率。

Comments 14 pages, 10 figures. Code available at https://github.com/Jimmy145123/DIET

详情
AI中文摘要

大型语言模型(LLMs)展现了卓越的能力,但其庞大的规模给实际部署带来了重大挑战。结构化剪枝通过移除整个维度或层提供了一种有前景的解决方案,然而现有方法面临关键权衡:任务无关方法无法适应任务特定需求,而任务感知方法需要昂贵的训练来学习任务适应性。我们提出DIET(通过融合任务重要性分数进行维度级全局剪枝),一种无需训练的结构化剪枝方法,结合了维度级粒度与任务感知选择。DIET仅使用每个任务100个样本跨任务分析激活幅度,然后应用多数投票构建单个全局掩码。DIET不需要预计算或训练的高成本。在Gemma-2 2B和9B模型上的七个零样本基准测试实验证明了DIET的有效性;例如,在Gemma-2 2B上20%稀疏度下,与先前最先进的结构化剪枝方法相比,DIET实现了近10%的平均准确率提升。这一优势在不同稀疏度和模型规模下持续存在,使DIET成为结构化LLM剪枝的实用且稳健的选择。

英文摘要

Large language models (LLMs) have demonstrated remarkable capabilities, but their massive scale poses significant challenges for practical deployment. Structured pruning offers a promising solution by removing entire dimensions or layers, yet existing methods face critical trade-offs: task-agnostic approaches cannot adapt to task-specific requirements, while task-aware methods require costly training to learn task adaptability. We propose DIET (Dimension-wise global pruning of LLMs via merging Task-wise importance scores), a training-free structured pruning method that combines dimension-level granularity with task-aware selection. DIET profiles activation magnitudes across tasks using only 100 samples per task, then applies majority voting to construct a single global mask. DIET does not require large costs from pre-computation or training. Experiments on seven zero-shot benchmarks using Gemma-2 2B and 9B models demonstrate the effectiveness of DIET; for example, at 20% sparsity on Gemma-2 2B, DIET achieves near 10% average accuracy improvement, compared to previous state-of-the-art structured pruning methods. This advantage persists across various sparsity levels and model scales, positioning DIET as a practical and robust choice for structured LLM pruning.

2601.21972 2026-05-27 cs.AI cs.DC cs.MA

Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

基于多智能体Actor-Critic的分散式LLM协作学习

Shuo Liu, Tianle Chen, Ryan Amiri, Christopher Amato

发表机构 * Northeastern University, Boston, MA(波士顿马萨诸塞大学)

AI总结 针对分散式LLM协作优化,提出两种多智能体Actor-Critic方法(CoLLM-CC和CoLLM-DC),实验表明在长时域或稀疏奖励任务中集中式Critic方法优于蒙特卡洛方法和分散式Critic方法。

详情
AI中文摘要

近期工作探索了通过多智能体强化学习(MARL)优化LLM协作。然而,大多数MARL微调方法依赖于预定义的执行协议,通常需要集中式执行。分散式LLM协作在实践中更具吸引力,因为智能体可以并行运行推理并灵活部署。此外,当前方法使用蒙特卡洛方法进行微调,这存在高方差问题,因此需要更多样本才能有效训练。Actor-Critic方法在MARL中常用于处理这些问题;因此,我们开发了多智能体Actor-Critic(MAAC)方法来优化分散式LLM协作。本文分析了这些MAAC方法何时以及为何有益。我们提出了两种MAAC方法:带有集中式Critic的CoLLM-CC和带有分散式Critic的CoLLM-DC。我们在写作、编码和游戏领域的实验表明,在短时域和密集奖励设置中,蒙特卡洛方法和CoLLM-DC可以达到与CoLLM-CC相当的性能。然而,在长时域或稀疏奖励任务中,它们均不如CoLLM-CC,其中蒙特卡洛方法需要更多样本,而CoLLM-DC难以收敛。

英文摘要

Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues; thus, we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge.

2605.00412 2026-05-27 cs.AI cs.RO

Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling

物理原生世界模型:生成式世界建模的哈密顿视角

Sen Cui, Jingheng Ma

发表机构 * Tsinghua University(清华大学)

AI总结 提出哈密顿世界模型,通过结构化潜相空间和哈密顿动力学演化实现物理可靠、动作可控且长期稳定的未来预测,用于具身决策。

详情
AI中文摘要

世界模型最近重新成为具身智能、机器人、自动驾驶和基于模型的强化学习的核心范式。然而,当前的世界模型研究通常由三条部分分离的路线主导:强调视觉未来合成的2D视频生成模型、强调空间重建的3D场景中心模型,以及强调抽象预测表示的JEPA类潜变量模型。每条路线都取得了重要进展,但它们仍然难以提供物理可靠、动作可控且长期稳定的预测以支持具身决策。在本文中,我们认为世界模型的瓶颈不再仅仅是它们能否生成逼真的未来,而是这些未来是否物理上有意义且对动作有用。我们提出哈密顿世界模型作为世界建模的一个物理基础视角。关键思想是将观测编码到结构化的潜相空间中,通过带有控制、耗散和残差项的哈密顿动力学演化潜状态,将预测轨迹解码为未来观测,并利用生成的轨迹进行规划。我们讨论了哈密顿结构如何提高可解释性、数据效率和长期稳定性,同时也指出了在涉及摩擦、接触、非保守力和可变形物体的真实机器人场景中的实际挑战。

英文摘要

World models have recently re-emerged as a central paradigm for embodied intelligence, robotics, autonomous driving, and model-based reinforcement learning. However, current world model research is often dominated by three partially separated routes: 2D video-generative models that emphasize visual future synthesis, 3D scene-centric models that emphasize spatial reconstruction, and JEPA-like latent models that emphasize abstract predictive representations. While each route has made important progress, they still struggle to provide physically reliable, action-controllable, and long-horizon stable predictions for embodied decision making. In this paper, we argue that the bottleneck of world models is no longer only whether they can generate realistic futures, but whether those futures are physically meaningful and useful for action. We propose \emph{Hamiltonian World Models} as a physically grounded perspective on world modeling. The key idea is to encode observations into a structured latent phase space, evolve the latent state through Hamiltonian-inspired dynamics with control, dissipation, and residual terms, decode the predicted trajectory into future observations, and use the resulting rollouts for planning. We discuss how Hamiltonian structure may improve interpretability, data efficiency, and long-horizon stability, while also noting practical challenges in real-world robotic scenes involving friction, contact, non-conservative forces, and deformable objects.

2604.27604 2026-05-27 cs.CV cs.CE

Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

解码科学实验图像:用于感知、理解和推理的SPUR基准

Junpeng Ding, Zichen Tang, Haihong E, Mengyuan Ji, Yang Liu, Haolin Tian, Haiyang Sun, Pengqi Sun, Yang Xu, Yichen Liu, Haocheng Gao, Zijie Xi, Ruomeng Jiang, Peizhi Zhao, Rongjin Li, Yuanze Li, Jiacheng Liu, Zhongjun Yang, Jintong Chen, Siying Lin

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出SPUR基准,通过4264个问答对评估多模态大模型在科学实验图像上的细粒度感知、跨面板关系理解和专家级推理能力,揭示当前模型与专家水平的差距。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

我们引入了SPUR,一个全面的科学实验图像感知、理解和推理基准,包含来自1084张专家精选图像的4264个问答对。SPUR具有三个关键创新:(1)面板级细粒度感知:评估多模态大语言模型(MLLMs)在六个细粒度面板类型上的三个维度(数值、形态和信息定位)的视觉感知能力;(2)跨面板关系理解:利用平均每样本14.3个面板的复杂图像评估MLLMs解读复杂跨面板关系的能力;(3)专家级推理:跨五个实验范式评估定性和定量推理,以确定模型是否能像人类专家一样从证据中推断结论。对20个MLLMs和四种多模态思维链(MCoT)方法的全面评估表明,当前模型远未达到科学图像解释的专家级要求,凸显了人工智能科学(AI4S)研究的关键瓶颈。

英文摘要

We introduce SPUR, a comprehensive benchmark for scientific experimental image perception, understanding, and reasoning, comprising 4,264 question-answering (QA) pairs derived from 1,084 expert-curated images. SPUR features three key innovations: (1) Panel-Level Fine-Grained Perception: evaluating the visual perception of multimodal large language models (MLLMs) across three dimensions (numerical, morphological, and information localization) on six fine-grained panel types; (2) Cross-Panel Relation Understanding: utilizing complex images with an average of 14.3 panels per sample to evaluate MLLMs' ability to decipher intricate cross-panel relations; (3) Expert-Level Reasoning: assessment of qualitative and quantitative reasoning across five experimental paradigms to determine if models can infer conclusions from evidence as human experts do. Comprehensive evaluation of 20 MLLMs and four multimodal Chain-of-Thought (MCoT) methods reveals that current models fall significantly short of the expert-level requirements for scientific image interpretation, underscoring a critical bottleneck in AI for Science (AI4S) research.

2604.27292 2026-05-27 cs.AI

The Two Boundaries: Why Behavioral AI Governance Fails Structurally

两个边界:为什么行为性AI治理在结构上失败

Alan L. McCann

发表机构 * Mashin, Inc.(Mashin公司)

AI总结 本文提出形式化框架,利用Rice定理证明行为性AI治理存在结构性的不可判定间隙,并定义共延治理作为可测试标准,通过分离计算与效应实现结构治理。

Comments 17 pages, 2 figures. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. v2: corrected cross-reference identifiers for companion papers;updated license

详情
AI中文摘要

每个产生效应的系统都有两个边界:它能做什么(表达能力)和治理覆盖什么(治理)。在几乎所有已部署的AI系统中,这些边界是独立定义的,从而产生三个区域:受治理的能力(唯一有用的区域)、未受治理的能力(风险)以及针对不存在的能力的治理策略(作秀)。三个区域中有两个是失败模式。我们关注效应的治理:AI系统在世界中执行的动作(API调用、数据库写入、工具调用)。这不同于模型输出的治理(内容质量、偏见、公平性),后者在不同层面运作并需要不同的机制。我们提出了一个形式化框架来分析这种结构性差距。Rice定理(1953)证明,对于任何试图行为性地治理效应的图灵完备架构,该差距在一般情况下是不可判定的:没有算法可以决定任意程序的非平凡语义属性,包括属性“该程序的效应符合治理策略”。我们定义了共延治理:一种系统属性,其中表达能力边界等于治理边界。我们证明共延治理需要架构决策(将计算与效应分离),而不是事后添加的治理层。我们表明,在这种分离下的结构治理包含了独立的治理基础设施:治理检查成为执行流水线的一部分,而不是与之并行的第二系统。我们提出共延治理作为任何AI治理系统的可测试标准:要么两个边界可证明相同,要么风险和作秀在结构上不可避免。证明在Coq中机械化(454个定理,36个模块,0个待证)。

英文摘要

Every system that performs effects has two boundaries: what it can do (expressiveness) and what governance covers (governance). In nearly all deployed AI systems, these boundaries are defined independently, creating three regions: governed capabilities (the only useful region), ungoverned capabilities (risk), and governance policies that address non-existent capabilities (theater). Two of the three regions are failure modes. We focus on the governance of effects: actions that AI systems perform in the world (API calls, database writes, tool invocations). This is distinct from the governance of model outputs (content quality, bias, fairness), which operates at a different level and requires different mechanisms. We present a formal framework for analyzing this structural gap. Rice's theorem (1953) proves the gap is undecidable in the general case for any Turing-complete architecture that attempts to govern effects behaviorally: no algorithm can decide non-trivial semantic properties of arbitrary programs, including the property "this program's effects comply with the governance policy." We define coterminous governance: a system property where the expressivenessboundary equals the governance boundary. We show that coterminous governance requires an architectural decision (separatingcomputation from effect) rather than a governance layer added after the fact. We show that structural governance under this separation subsumes separate governance infrastructure: governance checks become part of the execution pipeline rather than a second system running alongside it. We propose coterminous governance as the testable criterion for any AI governance system: either the two boundaries are provably identical, or risk and theater are structurally inevitable. Proofs are mechanized in Coq (454 theorems, 36 modules, 0 admitted).

2604.27289 2026-05-27 cs.AI

Mechanized Foundations of Structural Governance: Machine-Checked Proofs for Governed Intelligence

结构治理的机械化基础:受治理智能的机器验证证明

Alan L. McCann

发表机构 * Mashin, Inc.(Mashin公司)

AI总结 本文通过Coq机械化证明和纸上证明,建立了认知工作流系统中结构治理的理论基础,包括共归纳安全谓词、治理不变性定理、充分性定理、交替范式、必要性定理,并通过属性测试验证了BEAM运行时与规范的一致性。

Comments 27 pages, 4 figures, 1 table. Code and proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. v2: corrected cross-reference identifiers for companion papers. Updated license

详情
AI中文摘要

我们提出了认知工作流系统结构治理理论中的五个结果。其中三个使用Interaction Trees库和参数化共归纳在Coq 8.19中机械化实现;两个通过显式归约在纸上证明。共归纳安全谓词(gov_safe)是一个共归纳性质,捕获无限程序行为的治理安全性,由布尔权限标志索引,该标志对于未治理的I/O可证明为假,对于治理的解释为真(机械化)。治理不变性定理证明治理在元递归塔上是统一的:第n+1层的治理通过类型的定义性等式归约为第n层的治理(机械化)。充分性定理证明四个原子原语(代码、推理、内存、调用)对于任何离散智能系统在表达上完备,形式化为Kleisli范畴的组合闭包(机械化)。交替范式提供任何机器到交替代码和效果层的规范分解,具有合流重写系统(纸上证明)。必要性定理通过显式归约为Rice定理证明,对于需要语义判断的问题,架构不透明组件(推理原语)在数学上是必要的(纸上证明)。第六个贡献将抽象模型连接到部署的运行时:验证解释器规范在Coq中形式化了BEAM运行时的信任、能力和哈希链逻辑,然后使用基于属性的测试对运行系统进行测试,使用超过70,000个随机生成的指令序列,零分歧。机械化包括约12,000行代码,跨越36个模块,包含454个定理和零个待证明引理。

英文摘要

We present five results in the theory of structural governance for cognitive workflow systems. Three are mechanized in Coq 8.19 using the Interaction Trees library with parameterized coinduction; two are proved on paper with explicit reductions. The Coinductive Safety Predicate (gov_safe) is a coinductive property that captures governance safety for infinite program behaviors, indexed by a boolean permission flag that is provably false for ungoverned I/O and true for governed interpretations (mechanized). The Governance Invariance Theorem establishes that governance is uniform across the meta-recursive tower: governance at level n+1 reduces to governance at level n by definitional equality of the type (mechanized). The Sufficiency Theorem proves that four atomic primitives (code, reason, memory, call) are expressively complete for any discrete intelligent system, formalized as compositional closure of a Kleisli category (mechanized). The Alternating Normal Form provides a canonical decomposition of any machine into alternating code and effect layers, with a confluent rewriting system (paper proof). The Necessity Theorem proves via explicit reduction to Rice's theorem that an architecturally opaque component (the reason primitive) is mathematically necessary for problems requiring semantic judgment (paper proof). A sixth contribution connects the abstract model to the deployed runtime: the Verified Interpreter Specification formalizes the BEAM runtime's trust, capability, and hash chain logic in Coq, then tests the running system against this specification using property-based testing with over 70,000 randomly generated directive sequences and zero disagreements. The mechanization comprises approximately 12,000 lines across 36 modules with 454 theorems and zero admitted lemmas.

2604.24764 2026-05-27 cs.CV

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

World-R1:通过强化学习为文本到视频生成注入3D约束

Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y. Chen, Zhiyuan He, Yuqing Yang, Bohan Zhuang

发表机构 * Zhejiang University(浙江大学) Microsoft Research(微软研究院) Independent Researcher(独立研究者)

AI总结 提出World-R1框架,利用强化学习(Flow-GRPO)结合3D基础模型和视觉语言模型的反馈,在不修改架构的情况下增强视频生成的3D一致性,并采用周期解耦训练策略平衡刚体几何与动态场景。

Comments ICML 2026, Project Page: https://aka.ms/world-r1, Code: https://github.com/microsoft/World-R1

详情
AI中文摘要

最近的视频基础模型展示了令人印象深刻的视觉合成能力,但经常遭受几何不一致性的困扰。现有方法尝试通过架构修改注入3D先验,但往往导致高计算成本并限制可扩展性。我们提出World-R1,一个通过强化学习将视频生成与3D约束对齐的框架。为促进这种对齐,我们引入了一个专门为世界模拟定制的纯文本数据集。利用Flow-GRPO,我们使用预训练的3D基础模型和视觉语言模型的反馈来优化模型,在不改变底层架构的情况下强制执行结构一致性。我们进一步采用周期解耦训练策略来平衡刚体几何一致性与动态场景流畅性。大量评估表明,我们的方法显著增强了3D一致性,同时保留了基础模型的原始视觉质量,有效弥合了视频生成与可扩展世界模拟之间的差距。

英文摘要

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

2604.19499 2026-05-27 cs.CL

Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics

秩湍流Delta与可解释的文体测量Delta方法

Dmitry Pronin, Evgeny Kazartsev

发表机构 * HSE University(俄罗斯高等经济大学)

AI总结 本文引入两种新的作者归属度量——秩湍流Delta和Jensen-Shannon Delta,通过将距离函数应用于概率分布来推广Burrows经典Delta,并提出词级分解实现数值可解释性,在四个语料库上验证了方法的有效性。

Comments Published in Digital Scholarship in the Humanities. The version of record is available at https://academic.oup.com/dsh/advance-article-abstract/doi/10.1093/llc/fqag072/8692587 Code available at: https://github.com/DDPronin/Rank-Turbulence-Delta

Journal ref Digital Scholarship in the Humanities, 2026

详情
AI中文摘要

本文介绍了两种新的作者归属度量——秩湍流Delta和Jensen-Shannon Delta,它们通过应用为概率分布设计的距离函数来推广Burrows经典Delta。我们首先阐述了这些度量的理论基础,对比了词频向量的中心化和非中心化z分数,并将非中心化向量重新解释为概率分布。基于这一表示,我们开发了一种词级分解,使每个Delta距离在数值上可解释,从而促进细读和结果验证。这些方法的有效性在英语、德语、法语和俄语的四个文学语料库上进行了评估。英语、德语和法语数据集来自Project Gutenberg,而俄语基准是SOCIOLIT语料库,包含89位作者的639部作品,时间跨度从18世纪到21世纪。秩湍流Delta达到了与余弦Delta相当的归属准确率;Jensen-Shannon Delta始终匹配或超越经典Burrows Delta的性能。最后,在扩展的SOCIOLIT语料库上重新评估了几种已有的归属算法,提供了它们在显著时间和风格变化下鲁棒性的现实估计。

英文摘要

This article introduces two new measures for authorship attribution - Rank-Turbulence Delta and Jensen-Shannon Delta - which generalise Burrows's classical Delta by applying distance functions designed for probabilistic distributions. We first set out the theoretical basis of the measures, contrasting centred and uncentred z-scoring of word-frequency vectors and re-casting the uncentred vectors as probability distributions. Building on this representation, we develop a token-level decomposition that renders every Delta distance numerically interpretable, thereby facilitating close reading and the validation of results. The effectiveness of the methods is assessed on four literary corpora in English, German, French and Russian. The English, German and French datasets are compiled from Project Gutenberg, whereas the Russian benchmark is the SOCIOLIT corpus containing 639 works by 89 authors spanning the eighteenth to the twenty-first centuries. Rank-Turbulence Delta attains attribution accuracy comparable with Cosine Delta; Jensen-Shannon Delta consistently matches or exceeds the performance of canonical Burrows's Delta. Finally, several established attribution algorithms are re-evaluated on the extended SOCIOLIT corpus, providing a realistic estimate of their robustness under pronounced temporal and stylistic variation.

2401.07669 2026-05-27 cs.CV

SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels

SRL-CLIP: 通过结构化语义角色标签实现高效的CLIP视频适配

Darshan Singh, Zeeshan Khan, Makarand Tapaswi

发表机构 * CVIT, IIIT Hyderabad(IIIT海得拉巴计算机视觉研究所) Inria, École normale supérieure, CNRS, PSL Research University(法国国家信息与自动化研究所、巴黎综合理工学院、国家科学研究中心、巴黎高等研究大学)

AI总结 本文提出SRL-CLIP,利用结构化语义角色标签(SRL)生成规则化字幕,仅用23k视频-字幕对进行对比微调,即可高效适配CLIP用于通用视频理解,在零样本文本-视频检索上性能优于参数多4-8倍、数据多6000倍的模型。

Comments Accepted to the CV4Smalls Workshop at CVPR 2026

详情
AI中文摘要

将CLIP适配到视频领域因其语义丰富表示而日益流行。虽然CLIP是一个良好的起点,但它通常需要在大型视频叙述或字幕数据集(如HowTo100M、WebVid2.5M)上进行后预训练(对比微调)。然而,此类叙述或字幕往往缺乏全面信息来整体表示视频。由于文本的学习信号稀疏,视觉学习效率低下,适配需要数百万样本进行后预训练。在这项工作中,我们提出疑问:是否可能高效地将CLIP适配到通用和整体的视频理解?我们使用带有结构化和密集语义角色标签(SRL)的视频,这些标签以结构化格式捕获动作、人物或物体、属性、副词(方式)和位置,从而整体表示整个视频。我们从SRL生成基于规则的字幕,并证明仅对23k视频-字幕对进行简单的对比微调就足以学习强大的、可迁移的表示,适用于需要不同感知粒度水平的多种视频理解任务。我们的适配CLIP模型SRL-CLIP在零样本文本-视频检索上展现出与最先进模型相当或更优的性能,而这些模型拥有4-8倍更多的参数,并在多达6000倍更多的数据上进行了后预训练。SRL-CLIP在多个视频基准上超越了CLIP,突显了高效学习和改进的表示能力。

英文摘要

Adapting CLIP for videos has gained popularity due to its semantic and rich representation. While CLIP is a good starting point, it typically undergoes post-pretraining (contrastive finetuning) on large video narration or caption datasets (e.g. HowTo100M, WebVid2.5M). However, such narrations or captions often lack comprehensive information needed to represent a video holistically. As the learning signal from text is sparse, the visual learning is inefficient and adaptation requires millions of samples to post-pretrain. In this work, we ask: is it possible to efficiently adapt CLIP for general and holistic video understanding? We use videos labeled with structured and dense Semantic Role Labels (SRLs) that capture actions, people or objects, their attributes, adverbs (manner), and location in a structured format representing the entire video in a holistic way. We generate rule-based captions from SRLs and demonstrate that simple contrastive finetuning on a mere 23k video-caption pairs is adequate to learn powerful, transferable representations applicable across a diverse range of video understanding tasks that require varying levels of perceptual granularity. Our adapted CLIP model, SRL-CLIP, exhibits comparable or superior performance on zero-shot text-to-video retrieval compared to state-of-the-art models that possess 4-8x more parameters and are post-pretrained on up to 6000x more data. SRL-CLIP surpasses CLIP on multiple video benchmarks, underscoring the efficient learning and improved representations.

2603.13381 2026-05-27 cs.LG cs.AI

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

注意力投影中的非线性:非线性查询的情况

Marko Karbevski

发表机构 * Simplicity Technologies(简化科技)

AI总结 本文提出用非线性残差替换注意力中的查询投影W_Q,通过瓶颈MLP实现,在GPT-3小模型上验证了性能提升。

Comments Accepted at the ICLR 2026 GRaM workshop: https://openreview.net/forum?id=pwdnneFiNZ#discussion

详情
AI中文摘要

最近的代数分析表明,在仅解码器和仅编码器Transformer中,查询投影$W_Q$可以设置为恒等映射而不会显著降低性能。这是因为注意力仅通过乘积$XW_Q, XW_K, XW_V$依赖于$X$,允许基变换被相邻层吸收并通过网络传播。我们将$W_Q \in \R^{d imes d}$替换为非线性残差形式$Q(X) = X + f_θ(X)$,其中$f_θ$是一个瓶颈MLP,具有$d^2 + O(d)$个参数。恒等项将非线性锚定到已知良好的先验。在GPT-3小规模风格模型上的实验显示,与基线相比持续改进(验证对数损失降低$2.40\%$,困惑度降低$6.81\%$),轻松优于参数增加12.5%的非嵌入参数模型。这些结果激励在更大规模和多模态上的研究。

英文摘要

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \R^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_θ(X)$, where $f_θ$ is a bottleneck MLP with $d^2 + O(d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline ($2.40\%$ lower validation log-loss, $6.81\%$ lower perplexity), comfortably outperforming a model with 12.5\% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.

2512.05794 2026-05-27 cs.LG cs.AI q-bio.QM

Mechanistic Interpretability of Antibody Language Models Using SAEs

使用 SAE 对抗体语言模型的机制可解释性研究

Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane

发表机构 * Department of Statistics, University of Oxford, UK(英国牛津大学统计系) Reticular, San Francisco, USA(美国旧金山Reticular公司) EECS, MIT, Cambridge MA, USA(美国麻省理工学院电子工程与计算机科学系) Leyden Laboratories BV, Leiden, The Netherlands(荷兰莱顿实验室)

AI总结 本研究采用 TopK 和 Ordered 稀疏自编码器(SAE)对抗体语言模型进行机制可解释性分析,发现 TopK SAE 能揭示有意义的生物学潜在特征但无法保证生成控制,而 Ordered SAE 通过层次结构可靠识别可操控特征但激活模式更复杂。

Comments v3: 15 pages; corrected author list and affiliations in the main text; minor text changes; updated steering results following minor code changes; conclusions and findings remain unchanged; included link to data and code in the Data Availability section

详情
AI中文摘要

稀疏自编码器(SAE)是一种机制可解释性技术,已被用于揭示大型蛋白质语言模型中学到的概念。在此,我们采用 TopK 和 Ordered SAE 来研究自回归抗体语言模型,并引导其生成。我们表明,TopK SAE 可以揭示有生物学意义的潜在特征,但高特征-概念相关性并不能保证对生成的因果控制。相比之下,Ordered SAE 施加了层次结构,能够可靠地识别可操控特征,但代价是激活模式更复杂且可解释性较低。这些发现推进了领域特异性蛋白质语言模型的机制可解释性,并表明,虽然 TopK SAE 足以将潜在特征映射到概念,但在需要精确生成引导时,Ordered SAE 更可取。

英文摘要

Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive antibody language models, and steer their generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature-concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs suffice for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.

2604.21454 2026-05-27 cs.CL cs.AI

Reasoning Primitives in Hybrid and Non-Hybrid LLMs: Do Architectural Differences Yield Advantages in State-Tracking and Recall?

混合与非混合大语言模型中的推理原语:架构差异在状态追踪和召回中是否带来优势?

Shivam Rawat, Lucie Flek, Florian Mai, Nicholas Kluge Corrêa

发表机构 * Lamarr Institute for Machine Learning and Artificial Intelligence(拉玛尔机器学习与人工智能研究所) Rheinische Friedrich-Wilhelms-Universität Bonn(波恩莱茵河弗里德里希-威廉大学)

AI总结 本研究通过五个受控任务族比较了Transformer和混合架构在状态召回任务上的表现,发现推理增强是主要优势因素,而混合架构的优势较窄且依赖于任务。

详情
AI中文摘要

大型语言模型中的推理通常被视为单一能力,但其部分收益可能源于更简单的底层操作。我们通过五个以状态召回为中心的控制任务族,研究了两种这样的原语——召回和状态追踪,并比较了匹配的Transformer和混合架构(有无推理增强)。在整个套件中,推理增强变体显著优于仅指令变体,通常差距很大。这一模式与“状态超越令牌”观点一致:外部化推理痕迹之所以有帮助,是因为它们在令牌空间中向前传递中间状态。相比之下,一旦推理令牌可用,混合归纳偏置在准确性上并不产生统一优势。当架构差异确实出现时,它们遵循任务结构:混合Think模型在严格顺序的链式更新上更稳健,而Transformer Think模型在平面多跳检索上更稳健。因此,我们将本研究的主要贡献视为对状态召回任务性能驱动因素的描述性说明:推理令牌增强似乎是主导因素,而混合优势更窄、依赖于任务,并且可能更多关乎推理效率而非整体能力。我们还发布了重现这些结果所需的代码库和数据。

英文摘要

Reasoning in large language models is often discussed as a single capability, but some of its gains may stem from simpler underlying operations. We examine two such primitives, recall and state-tracking, through five controlled task families centered on state-based recall, and compare matched transformer and hybrid architectures with and without reasoning augmentation. Across the suite, reasoning-augmented variants substantially outperform instruction-only variants, often by large margins. This pattern is consistent with the State over Tokens view: externalized reasoning traces help because they carry the intermediate state forward in token space. By contrast, hybrid inductive bias does not yield a uniform advantage in accuracy once reasoning tokens are available. When architectural differences do appear, they follow task structure: the hybrid Think model is more robust on strictly sequential chained updates, whereas the transformer Think model is more robust on flat multi-hop retrieval. We therefore cast the main contribution of this study as a descriptive account of what drives performance on state-based recall tasks: reasoning-token augmentation appears to be the dominant factor, while hybrid advantages are narrower, task-dependent, and potentially more about inference efficiency than overall capability. We also release the codebase and data required to reproduce these results.

2604.19673 2026-05-27 cs.CV

InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

InHabit: 利用图像基础模型实现可扩展的3D人体放置

Nikita Kister, Pradyumna YM, István Sárándi, Jiayi Wang, Anna Khoreva, Gerard Pons-Moll

发表机构 * Bosch Center of Artificial Intelligence(博世人工智能中心) Max Planck Institute for Informatics(马克斯·普朗克信息学院)

AI总结 提出InHabit方法,通过渲染-生成-提升流程利用2D基础模型知识自动生成3D场景中与几何一致的人体交互数据,并构建大规模数据集InHabitants,显著提升3D人体-场景重建和接触估计性能。

详情
AI中文摘要

训练具身智能体像人类一样理解3D场景需要大量人类与多样环境有意义交互的数据,但此类数据稀缺。真实世界捕捉成本高昂且局限于受控环境,而现有合成数据集依赖简单几何启发式,忽略了丰富的场景上下文。相比之下,在互联网规模上训练的2D基础模型已获得关于人类-环境交互的常识知识。为了将这些知识迁移到3D,我们引入了InHabit,一种自动且可扩展的数据生成器,用于在3D场景中填充交互的人类。InHabit遵循渲染-生成-提升原则:给定渲染的3D场景,视觉语言模型提出上下文相关的动作,图像编辑模型插入一个人体,优化过程将编辑结果提升为与场景几何对齐的物理上合理的SMPL-X人体。应用于Habitat-Matterport3D,InHabit生成了InHabitants,这是首个大规模逼真3D人-场景交互数据集,包含约800个建筑规模场景中的78K个样本,具有完整的3D几何、SMPL-X人体和图像。用InHabitants增强标准训练数据改进了基于RGB的3D人-场景重建和接触估计,在感知用户研究中,我们的数据在78%的情况下优于先前技术。

英文摘要

Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics, ignoring rich scene context. In contrast, 2D foundation models trained at internet scale have acquired commonsense knowledge of human-environment interactions. To transfer this knowledge to 3D, we introduce InHabit, an automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces InHabitants, the first large-scale photorealistic 3D human-scene interaction dataset, with 78K samples across $\sim$800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and images. Augmenting standard training data with InHabitants improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over prior art.

2604.19667 2026-05-27 cs.CL cs.AI cs.CV cs.LG cs.MA

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Chat2Workflow: 用自然语言生成可执行可视化工作流的基准

Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao, Guozhou Zheng, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Tencent(腾讯)

AI总结 提出Chat2Workflow基准,用于评估大语言模型从自然语言生成可执行可视化工作流的能力,并设计了一个智能体基线以提升性能。

Comments Work in progress

详情
AI中文摘要

目前,可执行的可视化工作流已成为实际工业部署中的主流范式,提供了强大的可靠性和可控性。然而,在当前实践中,此类工作流几乎完全通过手动工程构建:开发人员必须仔细设计工作流,为每个步骤编写提示,并随着需求的变化反复修改逻辑——这使得开发成本高昂、耗时且容易出错。为了研究大语言模型能否自动化这一多轮交互过程,我们引入了Chat2Workflow,一个直接从自然语言生成可执行可视化工作流的基准,并提出了一个稳健的智能体基线以提高性能。该基准基于大量真实业务工作流构建,每个实例的设计使得生成的工作流可以转换并直接部署到实际工作流平台(如Dify和Coze)上。实验结果表明,尽管最先进的语言模型通常能捕捉高层次意图,但在生成正确、稳定且可执行的工作流方面仍存在困难,尤其是在面对复杂且不断变化的需求时。尽管我们的智能体基线带来了高达6.05%的解决率提升,但剩余的现实差距使Chat2Workflow成为推进工业级自动化的基础。代码可在https://github.com/zjunlp/Chat2Workflow获取。

英文摘要

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve -- making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic baseline to improve performance. The benchmark is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially given complex and evolving requirements. Although our agentic baseline yields up to 6.05% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

2604.18751 2026-05-27 cs.LG cs.AI stat.ME stat.ML

Beyond Coefficients: Forecast-Necessity Testing for Interpretable Causal Discovery in Nonlinear Time-Series Models

超越系数:非线性时间序列模型中可解释因果发现的预测必要性检验

Valentina Kuskova, Dmitry Zaytsev, Michael Coppedge

发表机构 * Lucy Family Institute for Data & Society(数据与社会联合研究所) University of Notre Dame(诺特大学) Department of Political Science(政治学系)

AI总结 针对非线性时间序列模型中因果分数被误读为回归系数的问题,提出基于边消融和预测比较的预测必要性检验框架,以评估因果关系的实际必要性。

详情
AI中文摘要

非线性机器学习模型越来越多地用于发现时间序列数据中的因果关系,但其输出的解释仍不明确。特别是,正则化神经自回归模型产生的因果分数常被视为回归系数的类比,导致误导性的统计显著性声明。在本文中,我们认为非线性时间序列模型中的因果相关性应通过预测必要性而非系数大小来评估,并提出了一种实用的评估程序。我们提出了一个基于系统边消融和预测比较的可解释评估框架,用于测试候选因果关系是否对准确预测是必要的。以神经加性向量自回归作为案例研究模型,我们将该框架应用于一个关于民主发展的真实世界案例研究,该案例将面板数据(139个国家的民主指标)建模为多元时间序列。我们表明,具有相似因果分数的关系由于冗余、时间持久性和特定制度效应,其预测必要性可能差异巨大。我们的结果展示了预测必要性检验如何支持应用AI系统中更可靠的因果推理,并为在高风险领域解释非线性时间序列模型提供实用指导。

英文摘要

Nonlinear machine-learning models are increasingly used to discover causal relationships in time-series data, yet the interpretation of their outputs remains poorly understood. In particular, causal scores produced by regularized neural autoregressive models are often treated as analogues of regression coefficients, leading to misleading claims of statistical significance. In this paper, we argue that causal relevance in nonlinear time-series models should be evaluated through forecast necessity rather than coefficient magnitude, and we present a practical evaluation procedure for doing so. We present an interpretable evaluation framework based on systematic edge ablation and forecast comparison, which tests whether a candidate causal relationship is required for accurate prediction. Using Neural Additive Vector Autoregression as a case study model, we apply this framework to a real-world case study of democratic development, modeled as a multivariate time series of panel data - democracy indicators across 139 countries. We show that relationships with similar causal scores can differ dramatically in their predictive necessity due to redundancy, temporal persistence, and regime-specific effects. Our results demonstrate how forecast-necessity testing supports more reliable causal reasoning in applied AI systems and provides practical guidance for interpreting nonlinear time-series models in high-stakes domains.

2504.01733 2026-05-27 cs.AI cs.CC cs.LO

Epistemic Skills: Reasoning about Knowledge and Oblivion

认知技能:关于知识与遗忘的推理

Xiaolong Liang, Yì N. Wáng

发表机构 * School of Philosophy, Shanxi University, Taiyuan, Shanxi, China(山西大学哲学学院) School of Philosophy and Social Development, Shandong University, Jinan, Shandong, China(山东大学哲学与社会发展学院)

AI总结 本文提出一类认知逻辑,通过加权模型系统引入“认知技能”度量,将知识获取建模为技能提升、遗忘建模为技能下降,并研究可知性与可遗忘性以及de re与de dicto表达的区别,分析了模型检测和可满足性的计算复杂性。

Journal ref Logical Methods in Computer Science, Volume 22, Issue 2 (May 25, 2026) lmcs:15460

详情
AI中文摘要

本文提出了一类认知逻辑,用于捕捉获取知识和陷入遗忘的动态过程,同时融入群体知识的概念。该方法基于加权模型系统,引入“认知技能”度量来表示与知识更新相关的认知能力。在此框架内,知识获取被建模为技能提升的过程,而遗忘则被表示为技能下降的结果。该框架进一步支持探索“可知性”和“可遗忘性”,分别定义为通过技能提升获得知识的潜力和通过技能下降陷入遗忘的潜力。此外,它还支持对认知de re与de dicto表达之间区别的详细分析。研究了模型检测和可满足性问题的计算复杂性,提供了对其理论基础和实际意义的洞察。

英文摘要

This paper presents a class of epistemic logics that captures the dynamics of acquiring knowledge and descending into oblivion, while incorporating concepts of group knowledge. The approach is grounded in a system of weighted models, introducing an ``epistemic skills'' metric to represent the epistemic capacities tied to knowledge updates. Within this framework, knowledge acquisition is modeled as a process of upskilling, whereas oblivion is represented as a consequence of downskilling. The framework further enables exploration of ``knowability'' and ``forgettability,'' defined as the potential to gain knowledge through upskilling and to lapse into oblivion through downskilling, respectively. Additionally, it supports a detailed analysis of the distinctions between epistemic de re and de dicto expressions. The computational complexity of the model checking and satisfiability problems is examined, offering insights into their theoretical foundations and practical implications.

2604.18103 2026-05-27 cs.AI

Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling

稳定性意味着冗余:Delta注意力选择性停止用于高效长上下文预填充

Yujie Chen, Tailai Chen, Yifeng Gao, Zoe Wanying He, Yijue Xu, Shaobo Wang, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of California, San Diego(加州大学圣地亚哥分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对长上下文场景中预填充计算成本高的问题,提出一种无需训练的Delta注意力选择性停止策略(DASH),通过监控自注意力层更新动态来停止稳定令牌的处理,从而在不牺牲模型准确性和硬件效率的前提下实现预填充加速。

Comments Accepted to ACL 2026 main conference

详情
AI中文摘要

预填充计算成本在长上下文设置中对大型语言模型(LLMs)和大型多模态模型(LMMs)构成了显著瓶颈。虽然令牌剪枝减少了序列长度,但先前的方法依赖于启发式规则,这些规则与FlashAttention等硬件高效内核不兼容。在这项工作中,我们观察到令牌会向 extit{语义固定点}演化,使得进一步处理变得冗余。为此,我们引入了Delta注意力选择性停止(DASH),这是一种无需训练的策略,通过监控自注意力机制的逐层更新动态来选择性停止已稳定的令牌。大量评估证实,DASH在语言和视觉基准测试中具有泛化能力,在保持模型准确性和硬件效率的同时,实现了显著的预填充加速。代码将在https://github.com/verach3n/DASH.git发布。

英文摘要

Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward \textit{semantic fixing points}, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git.

2510.06133 2026-05-27 cs.CL cs.AI

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit

CreditDecoding: 利用轨迹信用加速扩散大语言模型中的并行解码

Kangyu Wang, Zhiyun Jiang, Haibo Feng, Weijia Zhao, Lin Liu, Jianguo Li, Zhenzhong Lan, Weiyao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团) Westlake University(西湖大学)

AI总结 针对扩散大语言模型并行解码中正确令牌被反复重掩导致冗余迭代的问题,提出基于轨迹信用的无训练并行解码方法CreditDecoding,融合历史证据与当前logits提升低置信度正确令牌的置信度,实现高达5.48倍加速并提升准确性。

Comments 19 pages, 13 figures, 9 tables, Accepted to ACL 2026 main conference

详情
AI中文摘要

扩散大语言模型(dLLMs)通过迭代去噪生成文本。在普遍采用的并行解码方案中,每一步仅确认高置信度位置,而重掩其他位置。通过分析dLLM去噪轨迹,我们发现一个关键的低效问题:模型通常在目标令牌的置信度足够高以被解码之前的几个步骤就预测出正确令牌。这种早期预测与后期解码之间的差距导致已正确的令牌被反复重掩,造成冗余迭代并限制加速。为利用这种时间冗余,我们引入轨迹信用(Trace Credit),通过累积历史证据来量化令牌的解码潜力。基于此,我们提出CreditDecoding,一种无训练的并行解码方法,将轨迹信用与当前logits融合,以提升正确但低置信度令牌的置信度,从而加速去噪并提高鲁棒性。在八个基准测试上,CreditDecoding在LLaDA-8B上实现了高达5.48倍的加速和+0.48的准确率提升,并在多种dLLM架构和参数规模上持续改进性能。它还能扩展到长上下文,并与主流推理优化方法正交,使其成为一种实用且广泛适用的解决方案。

英文摘要

Diffusion large language models (dLLMs) generate text through iterative denoising. In commonly adopted parallel decoding schemes, each step confirms only high-confidence positions while remasking the others. By analyzing dLLM denoising traces, we uncover a key inefficiency: models often predict the correct target token several steps before its confidence becomes high enough to be decoded. This gap between early prediction and late decoding forces repeated remasking of already-correct tokens, causing redundant iterations and limiting acceleration. To exploit this temporal redundancy, we introduce Trace Credit to quantify a token's decoding potential by accumulating historical evidence. Building on this, we propose CreditDecoding, a training-free parallel decoding method that fuses Trace Credit with current logits to boost the confidence of correct but underconfident tokens, thereby accelerating denoising and improving robustness. On eight benchmarks, CreditDecoding achieves up to 5.48 times speedup with +0.48 accuracy on LLaDA-8B and consistently improves performance across diverse dLLM architectures and parameter scales. It further scales to long contexts and remains orthogonal to mainstream inference optimizations, making it a practical and widely applicable solution.

2604.14684 2026-05-27 cs.CV

DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

DETR-ViP:具有鲁棒判别性视觉提示的检测Transformer

Bo Qian, Dahu Shi, Xing Wei

发表机构 * School of Software Engineering, Xi’an Jiaotong University(西安交通大学软件工程学院) State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University(西安交通大学人机混合增强智能国家重点实验室) CCAI, Zhejiang University(浙江大学计算机辅助设计与智能交互研究院) Hikrobot Co., Ltd.(海康威视有限公司)

AI总结 提出DETR-ViP框架,通过全局提示集成和视觉-文本提示关系蒸馏学习可区分视觉提示,并采用选择性融合策略实现鲁棒检测,在多个数据集上显著提升视觉提示检测性能。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

视觉提示目标检测能够交互式且灵活地定义目标类别,从而促进开放词汇检测。由于视觉提示直接来源于图像特征,在识别稀有类别时通常优于文本提示。然而,视觉提示检测的研究很大程度上被忽视,通常被视为训练文本提示检测器的副产品,这阻碍了其发展。为充分释放视觉提示检测的潜力,我们研究了其性能次优的原因,并揭示根本问题在于视觉提示缺乏全局可区分性。受这些观察启发,我们提出DETR-ViP,一个鲁棒的目标检测框架,能够产生类别可区分的视觉提示。在基础图像-文本对比学习之上,DETR-ViP结合了全局提示集成和视觉-文本提示关系蒸馏,以学习更具判别性的提示表示。此外,DETR-ViP采用选择性融合策略,确保稳定且鲁棒的检测。在COCO、LVIS、ODinW和Roboflow100上的大量实验表明,DETR-ViP在视觉提示检测中相比其他最先进方法取得了显著更高的性能。一系列消融研究和分析进一步验证了所提出改进的有效性,并揭示了视觉提示检测能力增强的潜在原因。

英文摘要

Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.

2604.14640 2026-05-27 cs.CL cs.AI

Fact4ac at the Financial Misinformation Detection Challenge Task: Reference-Free Financial Misinformation Detection via Fine-Tuning and Few-Shot Prompting of Large Language Models

Fact4ac在金融虚假信息检测挑战赛中的方法:通过微调和少样本提示的大语言模型实现无参考金融虚假信息检测

Cuong Hoang, Le-Minh Nguyen

发表机构 * KaiNKaiho

AI总结 本文提出一种结合零样本/少样本提示和LoRA参数高效微调的大语言模型框架,用于无外部证据的金融虚假信息检测,在公开和私有测试集上分别达到95.4%和96.3%的准确率,获得竞赛第一名。

Journal ref Proceedings of the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD 2026), 20th International AAAI Conference on Web and Social Media

详情
AI中文摘要

金融虚假信息的泛滥对市场稳定和投资者信任构成严重威胁,误导市场行为并造成关键信息不对称。检测此类误导性叙述本身具有挑战性,尤其是在现实场景中,外部证据或用于交叉验证的补充参考资料严格不可用。本文介绍了我们在“无参考金融虚假信息检测”共享任务中的获胜方法。该任务基于最近提出的RFC-BENCH框架(Jiang等人,2026),挑战模型仅依赖内部语义理解和上下文一致性而非外部事实核查来判断金融声明的真实性。为应对这一艰巨的评估设置,我们提出了一个综合框架,利用最先进的大语言模型(LLM)的推理能力。我们的方法系统地集成了上下文学习(特别是零样本和少样本提示策略)以及通过低秩适应(LoRA)的参数高效微调(PEFT),以最优方式使模型与金融操纵的微妙语言线索对齐。我们提出的系统表现出卓越效果,成功在两个官方排行榜上均获得第一名。具体来说,我们在公开测试集上达到95.4%的准确率,在私有测试集上达到96.3%的准确率,突显了我们方法的鲁棒性,并有助于加速金融自然语言处理中上下文感知的虚假信息检测。我们的模型(14B和32B)可在https://huggingface.co/KaiNKaiho获取。

英文摘要

The proliferation of financial misinformation poses a severe threat to market stability and investor trust, misleading market behavior and creating critical information asymmetry. Detecting such misleading narratives is inherently challenging, particularly in real-world scenarios where external evidence or supplementary references for cross-verification are strictly unavailable. This paper presents our winning methodology for the "Reference-Free Financial Misinformation Detection" shared task. Built upon the recently proposed RFC-BENCH framework (Jiang et al. 2026), this task challenges models to determine the veracity of financial claims by relying solely on internal semantic understanding and contextual consistency, rather than external fact-checking. To address this formidable evaluation setup, we propose a comprehensive framework that capitalizes on the reasoning capabilities of state-of-the-art Large Language Models (LLMs). Our approach systematically integrates in-context learning, specifically zero-shot and few-shot prompting strategies, with Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation (LoRA) to optimally align the models with the subtle linguistic cues of financial manipulation. Our proposed system demonstrated superior efficacy, successfully securing the first-place ranking on both official leaderboards. Specifically, we achieved an accuracy of 95.4% on the public test set and 96.3% on the private test set, highlighting the robustness of our method and contributing to the acceleration of context-aware misinformation detection in financial Natural Language Processing. Our models (14B and 32B) are available at https://huggingface.co/KaiNKaiho.