arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.30348 2026-05-29 cs.CL cs.AI cs.LG 版本更新

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

LLMSurgeon: 诊断大型语言模型的数据混合

Yaxin Luo, Jiacheng Cui, Xiaohan Zhao, Xinyi Shang, Jiacheng Liu, Xinyue Bi, Zhaoyi Li, Zhiqiang Shen

发表机构 * VILA Lab, MBZUAI(VILA实验室,MBZUAI) UCL

AI总结 提出LLMSurgeon框架,通过逆问题方法从目标LLM生成文本中估计预训练语料的领域分布,实现无需训练数据的后验审计。

Comments ACL 2026 Main. Code at https://github.com/Yaxin9Luo/LLMSurgeon

详情
AI中文摘要

大型语言模型(LLM)的预训练数据混合构成了它们的“数字DNA”,塑造了模型的行为、能力和失败模式。然而,这种组成很少被披露,使得事后审计数据组合或来源变得困难。在这项工作中,我们形式化了$ extbf{数据混合手术(DMS)}$:仅从目标LLM生成的文本中,在预定义分类法下估计其预训练语料的领域级分布。我们提出了$ extbf{LLMSurgeon}$,一个强大的框架,将DMS视为标签偏移假设下的逆问题。LLMSurgeon不是直接聚合分类器输出,而是估计一个校准的$ extit{软}$混淆矩阵,并解决一个约束逆问题以纠正系统性的领域混淆并恢复潜在的混合先验。为了评估,我们引入了$ extbf{LLMScan}$,一个基于具有透明预训练混合的开源LLM构建的配方可验证评估套件。在LLMScan上,LLMSurgeon在固定协议下以高保真度恢复了领域混合。我们的工作提出了一种实用的、事后审计基础模型数字DNA的方法,无需访问其训练数据。

英文摘要

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\textbf{LLMSurgeon}$, a strong framework that casts DMS as an inverse problem under the label-shift assumption. Rather than directly aggregating classifier outputs, LLMSurgeon estimates a calibrated $\textit{soft}$ confusion matrix and solves a constrained inverse problem to correct systematic domain confusion and recover the latent mixture prior. To evaluate, we introduce $\textbf{LLMScan}$, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures. Across LLMScan, LLMSurgeon recovers domain mixtures with high fidelity under fixed protocols. Our work presents a practical, post-hoc approach for auditing the digital DNA of foundation models without access to their training data.

2605.30345 2026-05-29 cs.AI cs.CL cs.LG 版本更新

SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations

SchGen: 基于语义接地代码表示的PCB原理图生成

Qinpei Luo, Ruichun Ma, Xinyu Zhang, Lili Qiu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Microsoft Research Asia(微软亚洲研究院) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出SchGen,首个从自然语言请求生成可编辑PCB原理图的大语言模型,通过语义接地代码表示将几何驱动问题转化为语义驱动匹配任务,并构建大规模数据集,在连线准确性和功能正确性上显著优于现有方法。

Comments 19 pages, 7 figures

详情
AI中文摘要

印刷电路板(PCB)原理图设计几乎定义了所有电子硬件,但它仍然是手动且依赖专业知识的。虽然生成式AI已推动数字和模拟集成电路设计的发展,但从自然语言意图生成PCB原理图的研究仍基本空白。本文提出SchGen,首个从自然语言请求生成可编辑PCB原理图的大语言模型。关键挑战在于缺乏适合LLM的表示和大规模数据集。当前的原理图格式以冗长、特定于工具的语法和几何描述为主,难以可靠生成。我们引入一种语义接地代码表示,该表示通过相对位置和基于引脚名的布线对原理图编辑原语进行编码,将几何驱动生成问题转化为适合LLM的语义驱动匹配任务。我们进一步通过人机协作流水线将开源硬件设计转换为我们的表示,构建了与用户提示配对的大规模PCB原理图数据集。实验表明,SchGen在连线准确性和功能正确性上显著优于替代表示甚至更大的通用LLM。我们的结果突出了表示设计在使生成模型胜任复杂硬件设计任务中的关键作用。

英文摘要

Printed circuit board (PCB) schematic design defines nearly all electronic hardware, but it remains manual and expertise-intensive. While generative AI has advanced digital and analog IC design, PCB schematic generation from natural-language intent is largely unexplored. This paper presents SchGen, the first large language model that generates editable PCB schematics from natural-language requests. The key challenge lies in the lack of an LLM-suited representation and a large-scale dataset. Current schematic formats are dominated by verbose, tool-specific syntax and geometry-heavy descriptions, making them difficult to generate reliably. We introduce a semantically grounded code representation that encodes schematic editing primitives with relative placement and pin-name-based wiring, transforming a geometry-driven generation problem into a semantics-driven matching task amenable to LLMs. We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation. Experiments show that SchGen significantly outperforms alternative representations and even larger general-purpose LLMs on wire connectivity accuracy and functional correctness. Our results highlight the critical role of representation design in enabling generative models for complex hardware design tasks.

2605.30343 2026-05-29 cs.CL cs.AI 版本更新

Unlocking the Working Memory of Large Language Models for Latent Reasoning

解锁大型语言模型的工作记忆以实现潜在推理

Lukas Aichberger, Sepp Hochreiter

发表机构 * ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria(林茨ELLIS单元和LIT AI实验室,机器学习研究所,约翰·凯撒大学林茨,奥地利) NXAI GmbH, Linz, Austria(NXAI公司,林茨,奥地利)

AI总结 提出一种名为RiM的潜在推理方法,通过固定记忆块替代自回归生成中间推理步骤,在单次前向传播中实现计算高效的潜在推理。

Comments Preprint

详情
AI中文摘要

为了提升大型语言模型的推理能力,通常通过在最终答案之前生成中间令牌来扩展测试时计算。然而,这会将推理与自回归生成耦合,从而混淆内部计算与外部通信。相比之下,人类认知可以利用工作记忆在内部保持和操作信息,而无需将中间思维外部化。基于这一原理,我们引入了记忆推理(RiM),一种潜在推理方法,用记忆块替代推理步骤的自回归生成。这些记忆块是固定序列的特殊令牌,能够解锁大型语言模型的工作记忆容量。由于它们是固定的而非生成的,可以在单次前向传播中处理,从而实现计算高效的潜在推理。为了操作这些记忆块,我们采用了两阶段课程。首先,通过在每个记忆块后预测显式推理步骤来奠定基础。其次,我们丢弃这种步骤级监督,并在每个记忆块后迭代地优化最终答案。我们在推理基准上的实验表明,跨不同家族和规模的语言模型,RiM在避免思维自回归生成的同时,匹配或超越了现有的潜在推理方法。这些结果表明,大型语言模型可以被训练为使用工作记忆作为潜在推理的有效机制。

英文摘要

To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens before the final answer. However, this couples reasoning to autoregressive generation and thereby conflates internal computation with external communication. In contrast, human cognition can use working memory to hold and manipulate information internally without the need to externalize intermediate thoughts. Drawing on this principle, we introduce Reasoning in Memory (RiM), a latent reasoning method that replaces the autoregressive generation of reasoning steps with memory blocks. These memory blocks are fixed sequences of special tokens that unlock the working-memory capacity of large language models. Since they are fixed rather than generated, they can be processed in a single forward pass, enabling compute-efficient latent reasoning. To operationalize these memory blocks, we employ a two-stage curriculum. First, we ground them by predicting explicit reasoning steps after each memory block. Second, we discard this step-level supervision and iteratively refine the final answer after each memory block. Our experiments on reasoning benchmarks show that, across language models of different families and sizes, RiM matches or exceeds existing latent reasoning methods while avoiding the autoregressive generation of thoughts. These results demonstrate that large language models can be trained to use working memory as an effective mechanism for latent reasoning.

2605.30335 2026-05-29 cs.AI cs.CL 版本更新

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

局部一致,全局不一致:多组件LLM代理中的组合不一致性界定

Anany Kotawala

发表机构 * princeton(普林斯顿大学)

AI总结 本文形式化多组件LLM代理中局部一致但全局不一致的失败,提出组合残差eps*度量不一致性,并通过层次投影修复和序贯一致性监测方法,在实验中发现广泛存在的不一致性及其对决策的影响。

Comments 25 pages, 7 figures, 24 tables. Preliminary versions to appear at the ICML 2026 Workshops on Combining Theory and Benchmarks (CTB), Statistical Frameworks for Uncertainty in Agentic Systems (AgenticUQ), and Failure Modes of Agentic AI (FAGEN)

详情
AI中文摘要

多组件LLM代理从每个仅看到联合问题一部分的组件中组装概率声明;即使每个组件局部一致,组合也可能违反基本概率公理。我们通过组合残差eps*(从组合报价到联合一致多面体的L2距离)形式化这种局部一致、全局不一致的失败,该残差可在运行时根据系统输出和声明的跨组件耦合约束计算。一个乘积结构二分法刻画了局部一致性何时足够,而瑞利商预测在四个关系类别中的三个上,观测残差与预测相差在7%以内。一种层次化的Boyle-Dykstra投影确定性修复组合;一个任意有效的e-过程提供序贯一致性监测。在四个LLM中端面板(前沿面板在5.5节重新运行)的1,876个集成团上,33-94%的团中eps* > 0,在比例分配规则下,这转化为1,770个已解决赌注中每注+0.115纳特的遗憾(在自身一致化的投注者下,增益降至+0.006)。三种直观的LLM端缓解措施(检索、分区感知提示、聚合器LLM)均失败或倒退。

英文摘要

Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this locally coherent, globally incoherent failure via the compositional residual eps*, the L2 distance from the composed quote to the joint coherent polytope, computable at runtime from system output and the declared cross-component coupling constraints. A product-structure dichotomy characterises when local coherence suffices, and a Rayleigh-quotient prediction matches the observed residual within 7% on three of four relation classes. A hierarchical Boyle-Dykstra projection repairs the composition deterministically; an anytime-valid e-process gives sequential coherence monitoring. Across 1,876 ensemble cliques on a four-LLM mid-tier panel (frontier-panel rerun in Section 5.5), eps* > 0 on 33-94% of cliques, translating to +0.115 nats per bet of regret on 1,770 resolved bets under the proportional allocation rule (the gain collapses to +0.006 under bettors that themselves coherentise). Three intuitive LLM-side mitigations(retrieval, partition-aware prompting, aggregator-LLM) each fail or regress.

2605.30334 2026-05-29 cs.AI cs.CL 版本更新

Demystifying Data Organization for Enhanced LLM Training

揭秘数据组织以增强大语言模型训练

Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang, Xin Zhang, Wenshan Wu, Qihao Zhao, Hao Li, Yuanyuan Gao, Kim-Hui Yap, Scarlett Li

发表机构 * Nanyang Technological University(南洋理工大学) Microsoft Research(微软研究院) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文系统探索数据组织对大语言模型训练的影响,提出边界锐化、循环调度、课程连续性和局部多样性四项优化准则,并基于此设计了两种新的数据排序方法STR和SAW,实验验证了其在预训练和微调阶段的有效性。

Comments ACL 2026 Main Conference

详情
AI中文摘要

大语言模型(LLMs)已经彻底改变了各个领域,但其训练效率严重依赖于有效的数据整理。虽然数据选择已被广泛研究,但用于增强训练的战略性数据组织仍然是一个未被充分探索的领域,特别是因为当前的LLMs通常只训练一个或几个epoch。本文通过重用先前为数据效率生成的预计算样本级分数,系统地探索了数据组织对LLM训练的影响,从而产生最小的额外计算开销。我们识别并形式化了优化数据组织的四个关键准则:边界锐化、循环调度、课程连续性和局部多样性。在这些准则的指导下,我们引入了两种新颖的数据排序方法,称为STR和SAW。跨不同模型规模和数据大小的广泛实验,包括预训练和SFT阶段,验证了我们总结的准则的有效性。它们也证明了我们提出的数据排序方法在增强LLM训练的稳定性和性能方面的鲁棒性。Github链接:https://github.com/microsoft/data-efficacy/

英文摘要

Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curation. While data selection has been widely studied, the strategic data organization for enhanced training remains an underexplored area, particularly since current LLMs are often trained for only one or a few epochs. This paper systematically explores the influence of data organization on LLM training by reusing pre-computed sample-level scores originally generated for data efficiency, thereby incurring minimal additional computational overhead. We identify and formalize four key guidelines for optimizing data organization: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity. Guided by them, we introduce two novel data ordering methods termed STR and SAW. Extensive experiments across different model scales and data sizes, encompassing both pre-training and SFT stages, validate the effectiveness of our summarized guidelines. They also demonstrate the robustness of our proposed data ordering methods in enhancing the stability and performance of LLM training. Github Link: https://github.com/microsoft/data-efficacy/

2605.30333 2026-05-29 cs.CL 版本更新

COMPOSE: Composing Future Theorems from Citations and Formal Structure

COMPOSE: 从引文和形式结构组合未来定理

David Busbib, Michael Werman

发表机构 * Hebrew University of Jerusalem(耶路撒冷希伯来大学)

AI总结 提出 COMPOSE 双图框架,结合科学引文图和形式定理依赖图,生成有依据的未来数学定理,实验表明优于基线方法。

详情
AI中文摘要

一个合理的未来数学主张必须满足两个约束:它应遵循先前工作的方向,并尊重限制有效后续的形式依赖。现有方法通常只建模其中一个来源,产生的声明要么基础薄弱,要么动机不足。我们引入了有依据的未来数学生成,其目标是为锚定论文生成一个合理的未来定理式声明,使用两个互补的上下文来源:其科学引文图和对齐的形式定理依赖图。为了解决这一设置,我们提出了 COMPOSE,一个双图框架,它在科学引文上下文和形式定理结构上条件化语言模型。为了支持这一设置,我们从 arXiv 和 Mathlib 构建了一个包含 108K 对科学-形式图示例的数据集,以及一个包含 2024-2025 年 47K 篇未来论文的基准。实验表明,COMPOSE 在检索真实未来论文方面优于强基线,并在 LLM 评判评估下取得了最佳整体性能,生成了更有依据且数学上更丰富的输出。这些结果表明,未来数学生成受益于将科学上下文与形式结构相结合。项目页面位于 https://david-busbib.github.io/COMPOSE-page/。

英文摘要

A plausible future mathematical claim must satisfy two constraints: it should follow the direction of prior work and respect the formal dependencies that constrain what can validly follow. Existing approaches typically model only one of these sources, producing claims that are either weakly grounded or insufficiently motivated. We introduce grounded future mathematical generation, where the goal is to generate a plausible future theorem-like claim for an anchor paper using two complementary sources of context: its scientific citation graph and aligned formal theorem dependency graph. To address this setting, we propose COMPOSE, a dual-graph framework that conditions a language model on both scientific citation context and formal theorem structure. To support this setting, we construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, together with a benchmark of 47K future papers from 2024--2025. Experiments show that COMPOSE outperforms strong baselines on retrieval to real future papers and achieves the best overall performance under LLM-judge evaluation, producing more grounded and mathematically richer outputs. These results show that future mathematical generation benefits from combining scientific context with formal structure. Project page is available at https://david-busbib.github.io/COMPOSE-page/.

2605.30327 2026-05-29 cs.LG cs.AI cs.CL math.ST stat.ML stat.TH 版本更新

Reasoning with Sampling: Cutting at Decision Points

基于采样的推理:在决策点进行裁剪

Felix Zhou, Anay Mehrotra, Quanquan C. Liu

发表机构 * Yale University(耶鲁大学) Stanford University(斯坦福大学)

AI总结 提出Entropy-Cut Metropolis-Hastings算法,利用基础模型的下一词元熵作为代理识别关键决策点并重新采样,从而高效地从幂分布中采样以增强推理能力,在多个基准上超越基线和RL训练模型。

详情
AI中文摘要

前沿推理模型是通过对基础语言模型进行强化学习后训练而产生的。最近的研究对此提出了挑战,表明从基础模型分布的锐化版本(即所谓的幂分布)中采样,无需额外训练、精心策划的数据集或验证器,就能产生可比的推理能力。然而,使这种方法实用化需要高效地从幂分布中采样。采样器需要“混合”到幂分布,这需要在目标分布的模态之间移动;直观地说,例如尝试不同的推理策略。先前工作中提出的采样器反复在当前推理轨迹中均匀随机选择一个“裁剪”位置,并从该位置开始重新采样后缀。然而,推理轨迹通常包含少数关键决策(例如,证明策略或算法的选择),我们观察到均匀选择的裁剪往往重写局部细节,而不是重新审视决策点。我们引入了一种算法(Entropy-Cut Metropolis-Hastings),该算法使用基础模型的下一词元熵作为代理来识别关键决策点,并从这些位置重新采样。我们通过实验验证了熵跳变是决策点的有用代理,并在一个风格化的推理模型中证明了我们的方法的混合时间与轨迹中的决策数量成比例,而不是与可能大得多的词元数量成比例。在MATH500、HumanEval、GPQA Diamond和AIME26上,我们的方法始终优于基线和RL训练模型。

英文摘要

Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasoning without additional training, curated datasets, or verifiers. However, making this method practical requires efficiently sampling from the power distribution. A sampler needs to "mix" to the power distribution, which necessitates moving between modes of the target distribution; intuitively, e.g., trying different reasoning strategies. The samplers proposed in prior works repeatedly select a "cut" position in the current reasoning trace uniformly at random and resample the suffix from that position onward. However, reasoning traces typically contain a few consequential decisions (e.g., the choice of proof strategy or algorithm), and we observe that a uniformly chosen cut tends to rewrite local details rather than revisit decision points. We introduce an algorithm (Entropy-Cut Metropolis-Hastings) that uses the base model's next-token entropy as a proxy to identify key decision points and resample from those positions. We empirically verify that entropy jumps are a useful proxy for decision points and, in a stylized model of reasoning, prove that our method's mixing time scales with the number of decisions in a trace rather than with the number of tokens, which can be much larger. Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.

2605.30324 2026-05-29 cs.DS cs.AI cs.CL cs.LG stat.ML 版本更新

On Language Generation in the Limit with Bounded Memory

有界记忆下的极限语言生成

Jon Kleinberg, Anay Mehrotra, Amin Saberi, Grigoris Velegkas

发表机构 * Cornell University(康奈尔大学) Stanford University(斯坦福大学) Google Research(谷歌研究)

AI总结 研究有界记忆下语言生成的极限问题,通过组合界和滑动窗口分析记忆约束对可生成性、密度和识别的影响。

Comments The abstract has been shortened to fit within the arXiv limit

详情
AI中文摘要

我们研究有界记忆下的极限语言生成。在该任务中,学习器每次观察来自未知目标语言的一个示例,并且必须最终只输出新的有效示例。先前的工作假设可以访问整个历史,这是一个强假设,因为实际算法只保留有限的过去信息。学习理论中的经典工作表明,记忆约束会显著改变可学习性;我们将此扩展到语言生成。 首先,我们研究无记忆生成器。在温和的枚举限制下,每个可数无限语言集合仍然可以在没有记忆的情况下生成。没有这个限制,我们精确刻画了何时无记忆生成是可能的。对于有限集合,我们刻画了无记忆生成器可实现的最优极小极大密度——针对任何给定大小的集合所能保证的最佳密度。这个组合界依赖于Sperner定理和对称链分解。 我们进一步表明,最后$W$个示例的滑动窗口不会改善这种最坏情况密度,而允许存储$b$个自适应选择的过去示例则会改善每个$b \geq 1$的可实现密度。 最后,我们重新审视极限识别,其中学习器必须收敛到目标语言的单个正确假设。我们关注其增量变体,其中学习器只记住其之前的猜测。在这里,尽管精确识别在仅包含三种语言的集合上失败,但一个温和的松弛——要求收敛到目标的“近似”版本——对于每个有限集合都是可实现的。 这些结果表明,有界记忆对这些任务的影响不同:生成对于每个可数集合仍然可实现,而密度和识别仅限于有限集合,且随着集合增长保证减弱。

英文摘要

We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language one at a time and must eventually output only new valid examples. Prior work assumes access to the entire history, a strong assumption since realistic algorithms retain limited past information. Classical work in learning theory shows memory constraints dramatically alter learnability; we extend this to language generation. First, we study memoryless generators. Under a mild enumeration restriction, every countable collection of infinite languages remains generable without memory. Without this restriction, we exactly characterize when memoryless generation is possible. For finite collections, we characterize the optimal minimax density achievable by memoryless generators -- the best density guaranteed against any collection of a given size. This combinatorial bound relies on Sperner's theorem and symmetric chain decompositions. We further show that a sliding window of the last $W$ examples does not improve this worst-case density, whereas allowing it to store $b$ adaptively chosen past examples improves the achievable density for every $b \geq 1$. Finally, we revisit identification in the limit, where the learner must converge to a single correct hypothesis for the target language. We focus on its incremental variant, where the learner remembers only its previous guess. Here, although exact identification fails on a collection of just three languages, a mild relaxation requiring convergence to an ``approximate'' version of the target is achievable for every finite collection. These results show bounded memory affects these tasks differently: generation remains achievable for every countable collection, while density and identification are confined to finite collections, with guarantees weakening as the collection grows.

2605.30315 2026-05-29 cs.CL cs.LG 版本更新

Resolution Diagnostics for Paired LLM Evaluation

配对LLM评估的分辨率诊断

Anany Kotawala

发表机构 * Princeton University(普林斯顿大学)

AI总结 针对公开LLM排行榜中配对排名未达到常规配对检验分辨率目标的问题,提出基于假设检验的配对评估框架,并引入分辨率比q=N/N*作为主要诊断指标,揭示了常用非配对Cohen-h-plus-(1-rho)捷径在接近比较区域存在约两倍的偏差。

Comments 16 pages, 7 figures, 12 tables. Accepted to the ICML 2026 Workshop on Hypothesis Testing, Seoul, South Korea, 2026. Copyright 2026 by the author(s)

详情
AI中文摘要

在两个公开的LLM排行榜中,许多显示的配对排名在实际配对评估设计下未达到常规配对检验的分辨率目标:在Open LLM Leaderboard v1的40个配对比较中,有11个未解决;在MMLU-Pro前10名相邻排名配对中,9个中有4个未解决(在(alpha, 1-beta) = (0.05, 0.8)下)。在真实的主题级聚类下,MMLU-Pro未解决数上升至6/9,并且在99.9%的类别自助重采样中保持9个中的5-6个未解决。我们将配对LLM评估构建为一个假设检验问题,反转水平alpha、功效(1-beta)的检验,并报告每对的分辨率比q = N/N*作为主要诊断指标。一个具有显式二阶常数的尖锐小效应展开表明,广泛使用的非配对Cohen-h-plus-(1-rho)捷径在接近比较区域与正确的N*偏差约两倍,当用户将其每臂输出乘以(1-rho)时,五个现成计算器中的三个(Cohen 1988, G*Power, R pwr)会无声地继承这一缺陷。在多重校正和任意有效序贯检验下,未解决配对模式仍然存在。

英文摘要

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 under real subject-level clustering and stays at 5-6 out of 9 in 99.9% of category-bootstrap resamples. We frame paired LLM evaluation as a hypothesis-testing problem, invert level-alpha, power-(1-beta) tests, and report a per-pair resolution ratio q = N/N* as the primary diagnostic. A sharp small-effect expansion with an explicit second-order constant shows that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two in the close-comparison regime, a deficit that three of five off-the-shelf calculators(Cohen 1988, G*Power, R pwr) silently inherit when the user post-multiplies their per-arm output by (1-rho). The unresolved-pair pattern remains under multiplicity correction and anytime-valid sequential testing.

2605.30295 2026-05-29 cs.CL cs.AI 版本更新

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

MedCase-Structured:用于在临床真实EHR环境中基准测试诊断推理的文本到FHIR数据集

Valentina Bui Muti, Eugénie Dulout, Ziquan Fu

发表机构 * System Inc.(系统公司)

AI总结 提出一个从非结构化文本生成临床真实HL7 FHIR R4数据集的流水线,构建MedCase-Structured数据集,发现LLMs在结构化FHIR输入上的诊断准确性低于纯文本,强调部署对齐基准测试的重要性。

Comments Accepted to ICML 2026 Structured Data for Health Workshop

详情
AI中文摘要

大型语言模型(LLMs)在临床推理和决策支持方面显示出潜力,但在真实、与电子健康记录一致的环境中的评估仍然有限。现有的基准测试通常依赖于静态数据集或不反映临床系统中使用的结构化、可互操作数据格式的非结构化输入。我们引入了一个从非结构化文本生成临床真实HL7 FHIR R4数据包的流水线,从而实现对临床决策支持系统的可控评估。该流水线将分阶段LLM生成与基于术语的验证和修复相结合,以减少幻觉代码并强制结构和语义一致性。将此方法应用于MedCaseReasoning,我们构建了MedCase-Structured,这是一个与临床医生编写的诊断案例对齐的合成数据集,实现了82.5%案例的有效FHIR生成。在MedCase-Structured上的评估显示,LLMs在结构化FHIR输入上的诊断准确性始终低于纯文本,突出了部署对齐基准测试的重要性。

英文摘要

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the structured, interoperable data formats used in clinical systems. We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems. The pipeline combines staged LLM generation with terminology-grounded validation and repair to reduce hallucinated codes and enforce structural and semantic consistency. Applying this approach to MedCaseReasoning, we construct MedCase-Structured, a synthetic dataset aligned with clinician-authored diagnostic cases, achieving valid FHIR generation for 82.5% of cases. Evaluation on MedCase-Structured reveals consistently lower diagnostic accuracy for LLMs on structured FHIR inputs than with plain text, highlighting the importance of deployment-aligned benchmarking.

2605.21235 2026-05-29 cs.CL 版本更新

LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

LamPO: 一种用于推理语言模型的Lambda风格策略优化

Redacted by arXiv

AI总结 提出LamPO方法,通过成对分解优势函数和置信度加权,改进基于可验证奖励的强化学习在推理语言模型中的信用分配和训练稳定性。

Comments arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission. Author list and submitter name redacted due to disputed authorship

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为改进推理语言模型在数学、编程和科学问答等任务上的有效范式。然而,广泛使用的组相对目标(如GRPO)用标量统计量总结每个采样组,从而丢弃了候选响应之间的细粒度关系信息。这削弱了稀疏结果奖励下的信用分配,尤其是当多个生成的解决方案仅在推理质量上存在细微差异时。我们提出 extbf{LamPO},一种 extbf{Lambda风格策略优化}方法,它用 extit{成对分解优势}替代标量组优势。LamPO聚合每个响应组内的成对奖励差距,并通过从序列对数概率差异计算出的置信度权重调节每个比较,同时保留PPO风格优化的无评论家和裁剪更新结构。当参考解可用时,我们进一步添加一个轻量级的基于ROUGE-L的密集辅助奖励以减少奖励稀疏性。在AIME24、AIME25、MATH-500和GPQA-Diamond上使用Qwen3-1.7B、Qwen3-4B和Phi-4-mini进行的实验表明,LamPO在更稳定的训练动态和更好的样本效率下,持续优于GRPO和最近的RLVR变体。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving reasoning language models on tasks such as mathematics, coding, and scientific question answering. However, widely used group-relative objectives, such as GRPO, summarize each sampled group with scalar statistics and therefore discard fine-grained relational information among candidate responses. This weakens credit assignment under sparse outcome rewards, especially when multiple generated solutions differ only subtly in reasoning quality. We propose \textbf{LamPO}, a \textbf{Lambda-Style Policy Optimization} method that replaces scalar group advantages with a \emph{Pairwise Decomposed Advantage}. LamPO aggregates pairwise reward gaps within each response group and modulates each comparison by a confidence-aware weight computed from sequence log-probability differences, while retaining the critic-free and clipped-update structure of PPO-style optimization. When reference solutions are available, we further add a lightweight ROUGE-L-based dense auxiliary reward to reduce reward sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond with Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show that LamPO consistently improves over GRPO and recent RLVR variants, with more stable training dynamics and better sample efficiency.

2605.19416 2026-05-29 cs.CL 版本更新

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

LambdaPO: 一种用于推理语言模型的Lambda风格策略优化

Redacted by arXiv

AI总结 针对GRPO因使用群体均值作为基线而丢失细粒度偏好信息的问题,提出LambdaPO方法,通过将优势估计分解为成对偏好结构并引入语义密度奖励,从群体轨迹中挖掘更细粒度的优化信号,提升推理性能。

Comments arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission. Author list and submitter name redacted due to disputed authorship

详情
AI中文摘要

群体相对策略优化(GRPO)已成为现代强化学习对齐的基石,因其通过利用跨采样轨迹群体的奖励归一化来避免显式价值评判器而备受推崇。然而,该方法依赖于单一的统计基线(如群体均值),将轨迹空间的关联拓扑压缩为单个标量,从而抹去了在复杂、对排名敏感的奖励景观中导航所必需的细粒度偏好信息。为解决此问题,我们引入了一个新框架——Lambda策略优化(LambdaPO),它通过将优势估计从标量值重新概念化为分解的成对偏好结构来解决这一信息论瓶颈。具体而言,任何给定轨迹的优势被公式化为与其群体中所有同伴的奖励差分的积分和,其中每个成对比较由策略自身对已建立偏好的概率置信度动态衰减。为进一步缓解二元结果监督的稀疏性,我们通过一个语义密度奖励来增强目标,该奖励源自生成推理轨迹与真实解之间的精确率-召回率对齐。因此,我们的方法可以从一组 rollout 中挖掘更细粒度的优化信号,引导大语言模型达到更优的极值。在具有挑战性的数学推理和问答任务上的实验结果表明,LambdaPO相比基线方法提升了性能。

英文摘要

Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward differentials against all peers in its cohort, where each pairwise comparison is dynamically attenuated by the policy's own probabilistic confidence in the established preference. To further mitigate the sparsity of binary outcome supervision, we augment the objective with a semantic density reward, derived from the precision-recall alignment between generated reasoning traces and ground-truth solutions. As a result, our method can mine more fine-grained optimization signals from a group of rollouts, guiding the LLM to a better optima. Experimental results across challenging math reasoning and question-answering tasks demonstrates that LambdaPO improves performance compared to the baseline methods.

2603.17942 2026-05-29 cs.CL 版本更新

Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

通过嵌入空间探测的高效无训练多令牌预测

Raghavv Goel, Mukul Gagrani, Mingu Lee, Chris Lott

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 提出ESP方法,利用嵌入空间中的掩码令牌进行无训练的多令牌预测,通过并行验证和轻量剪枝实现无损解码,提升吞吐量。

Comments v2: Accepted at ICML 2026. Updated experiments replaced tok/s with speedup ratio over AR baseline; improved exposition in Section 3.1 (mask token initialization) and Section 4 (ablations)

详情
AI中文摘要

大型语言模型(LLM)尽管仅针对下一个令牌生成进行训练,但具有潜在的多令牌预测(MTP)能力。我们引入了ESP(嵌入空间探测),一种简单且无需训练的MTP方法,它使用从嵌入空间中实时抽取的掩码令牌来探测LLM,从而无需修改权重或依赖草稿模型即可实现并行未来令牌预测。ESP通过从掩码令牌logits中采样Top-K候选来构建推测性令牌树,并应用轻量级剪枝规则保留高概率的延续。在生成过程中,预测被并行验证,实现无损解码,同时显著减少模型调用次数并增加令牌吞吐量。ESP始终优于现有的无训练基线,在LLaMA3上比LADE提高了7-11%的接受长度,在Qwen3上提高了7-8%,并且吞吐量比最强基线提高了15-19%。最后,我们提供了理论见解和实证证据,表明解码器层自然地将掩码令牌表示与下一个令牌状态对齐,从而无需重新训练或辅助模型即可实现准确的多步预测。

英文摘要

Large Language Models (LLMs) possess latent multi-token prediction (MTP) abilities despite being trained only for next-token generation. We introduce ESP (Embedding-Space Probing), a simple and training-free MTP method that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel future-token prediction without modifying weights or relying on draft models. ESP constructs a speculative token tree by sampling Top-K candidates from mask-token logits and applies a lightweight pruning rule to retain high-probability continuations. During generation, predictions are verified in parallel, yielding lossless decoding while significantly reducing model calls and increasing token throughput. ESP consistently outperforms existing training-free baselines, improving acceptance length by 7-11% over LADE on LLaMA3 and 7-8% on Qwen3, and increasing throughput by up to 15-19% over the strongest baseline. Finally, we provide theoretical insight and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step prediction without retraining or auxiliary models.

2601.22139 2026-05-29 cs.CL cs.AI 版本更新

Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

边推理边提问:将推理型大语言模型从被动求解者转变为主动询问者

Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan, Wenya Xie, Min Yang, Shujian Huang

发表机构 * National Key Laboratory for Novel Software Technology(新型软件技术国家重点实验室) Artificial Intelligence Research Institute(人工智能研究院) Shenzhen Institutes of Advanced Technology(深圳先进技术研究院) University of California, Santa Cruz(加州大学圣克鲁兹分校) University of Texas, Dallas(德克萨斯大学达拉斯分校) University of Minnesota(明尼苏达大学)

AI总结 提出主动交互推理(PIR)范式,通过不确定性感知微调和用户模拟器策略优化,使LLM在推理中主动提问以澄清前提和意图不确定性,在数学推理、代码生成和文档编辑任务上显著提升准确率、通过率和BLEU值,同时减少近半推理计算和不必要交互。

Comments ACL Main Conference

详情
AI中文摘要

面向推理的大语言模型(LLMs)通过思维链(CoT)提示取得了显著进展,但它们仍然受到一种“盲目自我思考”范式的根本限制:即使在关键信息缺失或模糊的情况下,也会进行大量的内部推理。我们提出了主动交互推理(PIR),一种新的推理范式,将LLMs从被动求解者转变为主动询问者,在推理过程中穿插澄清。与现有的主要通过与外部环境交互来解决知识不确定性的搜索或工具框架不同,PIR通过与用户直接交互来解决前提和意图层面的不确定性。PIR通过两个核心组件实现:(1)一种不确定性感知的监督微调过程,赋予模型交互推理能力;(2)一个基于用户模拟器的策略优化框架,由复合奖励驱动,使模型行为与用户意图对齐。在数学推理、代码生成和文档编辑上的大量实验表明,PIR始终优于强基线,准确率提高高达32.70%,通过率提高22.90%,BLEU提升41.36,同时减少近一半的推理计算和不必要的交互轮次。在事实知识、问答和缺失前提场景上的进一步可靠性评估证实了PIR的强大泛化能力和鲁棒性。模型和代码公开于:https://github.com/SUAT-AIRI/Proactive-Interactive-R1

英文摘要

Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}

2601.07525 2026-05-29 cs.CL cs.AI 版本更新

Thinking Before Constraining: A Unified Decoding Framework for Large Language Models

先思考再约束:大型语言模型的统一解码框架

Ngoc Trinh Hung Nguyen, Alonso Silva, Laith Zumot, Liubov Tupikina, Armen Aghasaryan, Mehwish Alam

发表机构 * Télécom Paris, Institut Polytechnique de Paris(巴黎电信学院,巴黎理工学院) Nokia Bell Labs(诺基亚贝尔实验室) Nokia(诺基亚)

AI总结 提出In-Writing混合方法,通过触发令牌将自由形式推理与结构化解码解耦,在分类和推理任务上准确率提升高达27%。

Comments v2-EMNLP

详情
AI中文摘要

自然生成允许大型语言模型(LLMs)产生具有丰富推理的自由形式响应,但缺乏结构使得输出难以验证。相反,约束解码确保标准化格式,但可能在生成过程中过早施加约束,从而无意中限制推理能力。我们提出一种混合方法,即In-Writing,它在单次调用中结合了自由形式推理和结构化生成。模型首先执行无约束推理,仅在生成触发令牌后应用结构化解码,明确地将推理与格式化解耦。我们证明,我们的触发令牌策略能够几乎消除过早触发,即约束解码中断正在进行推理的失败模式。在涵盖分类和推理任务的多个数据集上的评估表明,我们的方法优于现有技术,在自然生成基础上准确率提升高达27%。我们的代码可在https://github.com/Nokia-Bell-Labs/InWriting获取。

英文摘要

Natural generation allows Large Language Models (LLMs) to produce free-form responses with rich reasoning, yet the lack of structure makes outputs difficult to verify. Conversely, constrained decoding ensures standardized formats but can inadvertently restrict reasoning capabilities by imposing constraints too early in the generation process. We propose a hybrid approach, namely In-Writing, that combines free-form reasoning and structured generation in a single call. The model first performs unconstrained reasoning and only applies structured decoding after a trigger token is generated, explicitly decoupling reasoning from formatting. We establish that our trigger-token strategies are able to virtually eradicate premature triggering, a failure mode in which constrained decoding interrupts on-going reasoning. Evaluations across diverse datasets covering classification and reasoning tasks demonstrate that our approach outperforms the state-of-the-art by achieving accuracy gains of up to 27% over natural generation. Our code are available at: https://github.com/Nokia-Bell-Labs/InWriting.

2605.30274 2026-05-29 cs.CL cs.AI 版本更新

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Loong: 一种类人长文档翻译代理,具有观察与行动的适应性上下文选择

Yutong Wang, Xuebo Liu, Derek F. Wong, Zhilin Li, Rongqing Jiang, Min Zhang, Shimin Tao, Daimeng Wei, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳研究院) NLP 2 CT Lab, Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学学院自然语言处理与CT实验室) Huawei Translation Services Center(华为翻译服务中心)

AI总结 提出Loong代理,通过3E记忆模块和强化学习优化上下文策略,解决长文档翻译中上下文窗口限制和冗余信息问题,在英⇄中、德、法翻译中平均提升13.0分。

详情
AI中文摘要

文档级翻译仍然是大型语言模型最具挑战性的任务之一,它们受到有限上下文窗口的限制,阻碍了全局连贯性,同时遭受冗余上下文信息的影响,降低了翻译质量。为了解决这个问题,我们提出了一种名为Loong的类人长文档翻译代理,它利用3E记忆模块(精华-示例-实体)存储摘要、句子对和实体记录作为历史上下文。Loong不是被动地关注所有历史,而是进行深度推理,自适应地识别翻译指导的最佳上下文。Loong通过强化学习优化其上下文策略,利用从其自身采样的观察与行动推理轨迹中得出的偏好数据。实证评估表明,Loong在英语⇄中文、德语和法语方向上实现了显著的翻译质量提升,在三个评估指标上平均提升高达13.0分。此外,Loong在跨领域和对抗上下文噪声方面表现出强大的泛化能力和鲁棒性,同时在超长文档翻译中保持显著的稳定性。我们的代码发布在https://github.com/YutongWang1216/LoongDocMT。

英文摘要

Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context windows that impede global cohesion, while simultaneously suffering from redundant contextual information that degrades translation quality. To address this, we propose a human-like long document translation agent called Loong, which leverages a 3E memory module (Essence-Exemplar-Entity) to store summaries, sentence pairs, and entity records as historical context. Instead of passively attending to all history, Loong performs deep reasoning to adaptively identify the optimal context for translation guidance. Loong optimizes its context policy through reinforcement learning, utilizing preference data derived from its own sampled observe-and-act reasoning trajectories. Empirical evaluations demonstrate that Loong achieves substantial translation quality improvements in English $\Leftrightarrow$ Chinese, German, and French directions, with average gains of up to 13.0 points across the three evaluation metrics. Furthermore, Loong exhibits strong generalization across domains and robustness against contextual noise, while maintaining remarkable stability in ultra-long document translation. Our code is released at https://github.com/YutongWang1216/LoongDocMT.

2605.30273 2026-05-29 cs.HC cs.AI cs.CL cs.CY cs.SI 版本更新

LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback

LLUMI: 利用在线社区反馈改进心理健康支持中的LLM写作辅助

Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar, Dong Whi Yoo, Eshwar Chandrasekharan, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校)

AI总结 提出LLUMI框架,通过在线社区反馈(如Reddit投票)构建偏好对,结合监督微调和直接偏好优化训练开源小模型,在隐私保护下实现与GPT相当的心理健康支持性能。

详情
AI中文摘要

大型语言模型在生成心理健康问题的支持性回复方面展现出潜力,但提升其有用性、共情能力和安全性通常需要大量计算、专家输入和标注数据。同时,在心理健康相关交互中部署专有云模型会引发重要的隐私和数据治理问题。为解决这一挑战,我们提出了LLUMI设置,该设置可在受保护环境内部署。LLUMI包含两个互补组件:生成模型(GM)起草对心理健康问题的支持性回复,以及改进模型(IM)修改初始人工编写的回复。我们利用Reddit心理健康社区的反馈信号,使用社区认可模式(如点赞和点踩)构建用于监督微调和直接偏好优化的选择-拒绝回复对。我们还通过五个维度(可读性、共情、连接、可操作性和安全性)的人工评估进一步对齐LLUMI。结果表明,尽管依赖较小的开源模型而非专有云GPT模型,LLUMI在语言分析和人工评估中均实现了相当的性能。这些发现表明,使用社区衍生的偏好信号训练的开源模型可以支持高质量的心理健康支持辅助,同时为敏感的支持场景提供更保护隐私的替代方案。

英文摘要

Large language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, empathy, and safety often requires substantial compute, expert input, and labeled data. At the same time, deploying proprietary, cloud-based models for mental health-related interactions raises important privacy and data-governance concerns, given the sensitivities. To address this challenge, we introduce LLUMI setup that can be hosted in-house within protected environments. LLUMI consists of two complementary components: a generation model (GM), which drafts supportive responses to mental health queries, and an improvement model (IM), which revises an initial human-crafted response. We leverage feedback signals from Reddit mental health communities, using community endorsement patterns such as upvotes and downvotes to construct chosen-rejected response pairs for Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO). We further align LLUMI using human evaluation across five dimensions: readability, empathy, connection, actionability, and safety. Our results show that, despite relying on smaller open-source models rather than proprietary cloud-based GPT models, LLUMI achieves comparable performance across linguistic analyses and human evaluations. These findings suggest that open-source models, when trained with community-derived preference signals, can support high-quality mental health support assistance while offering a more privacy-preserving alternative for sensitive support contexts.

2605.30265 2026-05-29 cs.CV cs.CL 版本更新

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

LoMo: 局部模态替换以实现更深的视觉-语言融合

Feng Han, Zhixiong Zhang, Zheming Liang, Yibin Wang, Jiaqi Wang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学) University of Science and Technology of China(中国科学技术大学)

AI总结 针对视觉-语言模型在模态替换时性能下降的“载体敏感性”问题,提出局部模态替换(LoMo)数据策展范式,通过将文本片段动态渲染为图像来训练跨模态表示不变性,显著提升多模态推理与融合效果。

详情
AI中文摘要

视觉-语言模型(VLM)在广泛的理解和推理任务中取得了显著进展,这得益于旨在多模态融合的大规模图像-文本训练。理想情况下,将文本问题替换为其渲染图像对应物应基本不影响模型性能。然而,在实践中,这种模态替换会导致性能急剧下降。我们将这种“载体敏感性”问题归因于当前训练语料中固有的偏差。在图像描述、VQA、OCR和网络来源的交错数据等流行数据集中,文本和图像通常被组织成不同且不对称的角色,文本作为语言查询,图像作为视觉参考。这种数据偏差导致VLM在不同模态的信息获取上表现出不同的偏好。因此,VLM无法对齐语义等价内容在文本和视觉载体上的表示,使得模型推理在模态替换下变得脆弱。为了解决这个问题,我们提出了局部模态替换(LoMo),一种轻量级、架构无关的数据策展范式,旨在为语义等价的文本和图像载体之间的跨模态表示不变性提供监督。LoMo通过将单模态提示重新表述为无缝交错的跨模态序列来实现这一点。它动态选择目标文本跨度并将其重新表述为渲染图像,从而在“文本、视觉、文本”载体上保持相同的语义。在13个不同的多模态基准上的大量实验表明,LoMo显著改善了整体多模态推理,并实现了更深的跨模态融合。具体来说,它在基础模型上带来了一致的提升,在LLaVA-OneVision-1.5-8B上比标准SFT提高了2.67个百分点,在Qwen3.5-9B上提高了2.82个百分点。

英文摘要

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.

2605.30260 2026-05-29 cs.CL cs.AI cs.CV cs.LG 版本更新

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

LoRA如何记忆?大语言模型微调的参数记忆定律

Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui, Longtao Huang, Hui Xue, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Alibaba Group(阿里巴巴集团)

AI总结 本文提出参数记忆定律,揭示LoRA在微调中参数与序列长度对损失降低的幂律关系,并基于此设计MemFT优化策略提升记忆保真度与效率。

Comments Ongoing work

详情
AI中文摘要

大型语言模型(LLM)必须持续学习和更新知识,以在动态的真实世界环境中保持有效。虽然低秩适应(LoRA)被广泛用于此类记忆更新,但现有研究主要依赖于定性的下游评估,使得精确参数记忆的定量容量限制和潜在动态在很大程度上未被探索。为了弥合这一差距,我们在潜在空间中使用LoRA作为受控记忆容量探针,以系统量化精确参数记忆。我们引入了参数记忆定律,这是一个将损失降低ΔL与有效参数和序列长度联系起来的稳健幂律。在令牌级别,细粒度分析揭示了确定性相变,表明在贪婪解码下,预测概率p > 0.5构成逐字回忆的充分条件。基于这些见解,我们引入了MemFT,一种阈值引导的优化策略,该策略动态地将训练预算重新分配给低于阈值的令牌。实证评估表明,MemFT可以提高记忆保真度和效率。代码将在https://github.com/zjunlp/ParametricMemoryLaw发布。

英文摘要

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely unexplored. To bridge this gap, we employ LoRA as a controlled memory capacity probe within the latent space to systematically quantify exact parametric memory. We introduce the Parametric Memory Law, a robust power law linking loss reduction Delta L to effective parameters and sequence length. At the token level, fine-grained analysis reveals a deterministic phase transition, demonstrating that a prediction probability of p > 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Driven by these insights, we introduce MemFT, a threshold-guided optimization strategy that dynamically redistributes the training budget toward sub-threshold tokens. Empirical evaluations demonstrate that MemFT can enhance memory fidelity and efficiency. Code will be released at https://github.com/zjunlp/ParametricMemoryLaw.

2605.30256 2026-05-29 cs.CV cs.CL cs.HC 版本更新

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

VideoFDB: 评估对话代理中的全双工视觉-语音能力

Amrita Mazumdar, Seonwook Park, Rajarshi Roy, Nikhil Srihari, Shengze Wang, Yuhao Zhou, Julia Wang, Koki Nagano, Shalini De Mello

发表机构 * NVIDIA David AI

AI总结 提出首个全双工视听到视听(AV2AV)对话基准VideoFDB,通过237个真实视频片段、感知与生成行为分类以及基于评分规则的LM评判框架,系统评估代理在非语言对话动态中的表现,发现现有系统存在字幕崩溃和视觉流忽视等缺陷。

Comments Project page: https://research.nvidia.com/labs/amri/projects/video-fdb/

详情
AI中文摘要

自然的人类对话是全双工且视听融合的:人们同时说话和倾听,同时持续解读并产生非语言线索,如点头、微笑和手势。为了支持成功的人机交互,代理必须建模全双工视听对话;然而,现有的全双工基准仅评估语音。在这项工作中,我们提出了VideoFDB,这是首个评估全双工视听到视听(AV2AV)对话代理的基准。VideoFDB贡献了:(i) 237个来自真实世界视频通话的二元片段,涵盖11种非语言对话动态;(ii) 将感知行为与生成行为分离的分类法;(iii) 基于评分规则的LM评判评估框架,具有可解释的轴,用于评估关于非语言对话动态的对话质量。在开源和闭源的视觉-语音代理中,我们发现了系统性的失败模式:字幕崩溃和视觉流忽视,并且我们表明当前系统利用视觉进行显式视觉问答,但不用于自然对话中所需的流式联合视听基础。我们进一步评估了级联的语音到虚拟形象系统,发现其架构从根本上排除了全双工非语言线索的产生。作为全双工AV2AV交互的首个基准,VideoFDB为系统评估奠定了基础,我们希望这将加速下一代多模态对话代理的进步和发展。

英文摘要

Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiovisual conversation; however, existing full-duplex benchmarks evaluate only speech. In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents. VideoFDB contributes (i) 237 dyadic clips spanning 11 nonverbal conversational dynamics from real-world video calls, (ii) a taxonomy separating perception from generation behaviors, and (iii) a rubric-based LM-as-judge evaluation framework with interpretable axes for assessing conversational quality with respect to nonverbal conversational dynamics. Across open- and closed-source vision-speech agents, we find systematic failure modes: captioning collapse and visual-stream ignorance, and we show that current systems exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. We further evaluate cascaded speech-to-avatar systems and find that their architecture fundamentally precludes the production of full-duplex nonverbal cues. As the first benchmark for full-duplex AV2AV interaction, VideoFDB establishes a foundation for systematic evaluation and, we hope, will accelerate the advancement and development of next-generation multimodal conversational agents.

2605.30251 2026-05-29 cs.CL cs.AI 版本更新

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

相同证据,不同答案:面向多轮语言模型的规范上下文在线策略蒸馏

Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang, Yifan Zhu, Xing Shi, Jingtao Xu, Zhihui Li, Yawei Luo

发表机构 * Zhejiang University(浙江大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出规范上下文在线策略蒸馏(CCOPD)方法,通过教师-学生框架对齐模型在完整提示和逐步揭示信息下的行为,减少自我锚定漂移,在多轮数学对话上训练后,在原始分片任务上平均提升32%性能。

详情
AI中文摘要

大型语言模型(LLMs)通常在单次提示中给出所有指令时能解决任务,但当相同信息在多个轮次中逐步揭示时却会失败。当干净的完整提示和原始分片对话包含相同的完整用户证据时,模型仍应得出相同的答案。我们认为造成这一差距的关键原因是自我锚定漂移:在部分信息下产生的响应引入了未经支持的假设,而这些假设随后扭曲了最终答案。为了减少这种影响,我们提出了规范上下文在线策略蒸馏(CCOPD)。在训练过程中,同一基础模型扮演两个角色:一个冻结的教师模型,以干净的完整提示为条件;一个可训练的学生模型,通过多轮对话逐步接收相同的证据;CCOPD将学生在其自身轨迹上的行为与教师的规范全上下文行为对齐。仅在数学问题对话上训练后,CCOPD在数学和五个零样本跨领域任务族上的原始分片性能相比原始基础模型平均提升32%,同时基本保持全上下文性能。进一步分析表明,CCOPD增强了基于用户证据的推理,并减少了对早期助手轮次污染的敏感性。

英文摘要

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.

2605.30245 2026-05-29 cs.CL 版本更新

Knowing What to Solve Before How: Preplan Empowered LLM Mathematical Reasoning

知道在如何解决之前该解决什么:预规划赋能的大语言模型数学推理

Shaojie Wang, Liang Zhang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出PPC框架,通过引入显式的问题理解阶段(预规划)来弥补现有规划推理方法中“如何解决”与“该解决什么”之间的范式差距,在多个数学推理基准上取得最佳结果。

详情
AI中文摘要

当前的基于规划的推理方法通过在执行前插入规划阶段来改进大语言模型(LLMs),形成了问题→规划→思维链的范式。虽然有效,但仔细审视发现存在固有的范式级差距:规划和执行阶段都决定了如何解决问题,而之前的问题——该解决什么,即识别问题类型、适用工具和可预见的陷阱——仍然完全隐含。为弥补这一差距,我们提出PPC(预规划-规划-思维链),一个引入显式问题理解阶段(预规划)的框架,产生了新的问题→预规划→规划→思维链范式。实现这一范式需要在两端维护预规划的概念完整性。具体地,我们设计了一个三阶段合成流程,配备一个剧透分数检测器来过滤泄漏和剧透故障,以构建干净的预规划监督,并且一个复合GRPO奖励强制生成的规划真正遵循预规划。在四个骨干模型和五个数学推理基准上的实验表明,PPC在40个指标中的39个上取得了最佳结果,在不引入额外推理令牌开销的情况下,将maj@16和pass@16分别比最强基线提高了+2.23和+3.06。

英文摘要

Current plan-based reasoning methods improve large language models (LLMs) by inserting a planning stage before execution, giving rise to the question $\rightarrow$ plan $\rightarrow$ cot paradigm. While effective, a closer examination reveals an inherent paradigm-level gap: both the planning and its execution stages decide how to solve a problem, while the prior question of what to solve; recognizing the problem type, the applicable tools, and the foreseeable pitfalls; remains entirely implicit. To bridge this gap, we propose PPC (Preplan-Plan-CoT), a framework that introduces an explicit problem-understanding stage, the preplan, yielding a new question $\rightarrow$ preplan $\rightarrow$ plan $\rightarrow$ cot paradigm. Realizing this paradigm requires safeguarding the conceptual integrity of preplan at both ends. Specifically, we design a three-stage synthesis pipeline with a spoiler-score detector that filters out leakage and spoiler failures to build clean preplan supervision, and a composite GRPO reward enforces that the generated plan genuinely follows from the preplan. Experiments across four backbones and five mathematical reasoning benchmarks show that PPC achieves the best results on 39 of 40 metrics, improving maj@16 and pass@16 by +2.23 and +3.06 over the strongest baseline without introducing additional inference token overhead.

2605.30241 2026-05-29 cs.CL cs.CY cs.SI 版本更新

CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild

CommunityFact:一个面向野外错误信息检测的动态、多语言、多领域基准

Sahajpreet Singh, Insyirah Mujtahid, Min-Yen Kan, Kokil Jaidka

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出CommunityFact基准,通过多语言多领域声明、LLM评估和社区笔记分析,揭示封闭输入验证的挑战、网络搜索的增益以及证据选择策略的偏差。

详情
AI中文摘要

错误信息验证越来越多地发生在公开、快速变化和多语言的在线环境中,静态基准无法全面衡量模型可靠性。我们引入了CommunityFact,一个可刷新的野外错误信息检测基准,具有三个主要目标:覆盖度、粒度和可再分发性。本版本包含5种语言和2个领域的15,992条独立声明。我们在不同的推理时能力(包括思考和网络搜索)下评估了十个LLM。我们的结果表明,封闭输入验证仍然具有挑战性,网络访问带来了最大的收益,并且启用网络的LLM的源选择策略与人类社区笔记评分者所达成的源系统性地不一致——这种差距可以通过特定模型的检索扩展或剪枝机制来缩小。我们进一步发现,跨语言-领域切片以及启用网络的系统所使用的证据生态系统存在显著差异。除了评估之外,CommunityFact将社区笔记定位为声明条件源建议器的训练信号,这可以改进对新声明的真实性验证。

英文摘要

Misinformation verification increasingly occurs in public, fast-moving, and multilingual online settings, where static benchmarks provide an incomplete measure of model reliability. We introduce CommunityFact, a refreshable benchmark for misinformation detection in the wild, with three major goals: coverage, granularity, and redistributability. This release contains 15,992 standalone claims across five languages and two domains. We evaluate ten LLMs under varying inference-time capabilities, including thinking and web-search. Our results show that closed-input verification remains challenging, web access yields the largest gains, and web-enabled LLMs' source-selection policies are systematically misaligned with the sources human Community Notes raters converge on -- a gap that closes through model-specific mechanisms of retrieval expansion or pruning. We further find substantial variation across language-domain slices and across the evidence ecosystems used by web-enabled systems. Beyond evaluation, CommunityFact positions Community Notes as a training signal for claim-conditioned source suggesters that could improve factual verification on novel claims.

2605.30233 2026-05-29 cs.CL cs.AI 版本更新

Do Language Models Track Entities Across State Changes?

语言模型是否在状态变化中跟踪实体?

Zilu Tang, Qiao Zhao, Gabriel Franco, Derry Wijaya, Aaron Mueller, Sebastian Schuster, Najoung Kim

发表机构 * Department of Computer Science, Boston University, Boston, USA(波士顿大学计算机科学系) Department of Data Science, Monash University, Indonesia(墨尔本大学数据科学系) Faculty of Computer Science, University of Vienna, Austria(维也纳大学计算机科学系) Department of Linguistics, Boston University, Boston, USA(波士顿大学语言学系)

AI总结 研究语言模型在自然语言中处理多步状态变化操作时的实体跟踪机制,发现其采用非增量策略,在最后token并行聚合信息,并揭示了REMOVE操作的全局抑制标签及其导致的失败模式。

Comments ICML main conference 2026, 9 pages

详情
AI中文摘要

实体跟踪(ET),即跟踪状态的能力,是支撑复杂推理的基本技能。越来越多的研究探讨transformer语言模型(LMs)如何在没有状态变化的情况下解决实体绑定问题。然而,对于非玩具级LMs如何处理以自然语言表达的具有现实难度的ET问题,理解仍然有限。为此,我们研究了在具有多个状态变化操作的更复杂场景下ET背后的机制。我们发现,LMs不会跨token增量地跟踪世界状态,也不会跨层跟踪查询相关状态,而是在查询变得明显时,在最后一个token处并行地聚合相关信息。我们进一步研究了单个操作(PUT、REMOVE、MOVE)的机制,以表征这种非增量ET机制。令人惊讶的是,LMs使用一种脆弱的全局抑制标签来实现REMOVE操作;这种全局移除机制预测了我们通过行为实验确认的各种失败模式。我们提供了一种消除该标签的机械解决方案,以部分解决此问题。总体而言,我们的发现揭示了LMs使用非顺序策略来解决一个本质上是顺序的任务。更广泛地说,我们的工作展示了行为分析和机制分析如何有效地相互作用。行为结果为机制假设提供信息,而机制分析的见解通过预测现有评估中缺失的失败模式,有助于构建更强的行为评估。

英文摘要

Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of work investigates how transformer language models (LMs) solve entity binding $\textit{without}$ state changes. However, there is limited understanding of how non-toy LMs address ET problems of realistic difficulties expressed in natural language. To this end, we investigate the mechanisms underlying ET in more complex scenarios featuring multiple state-changing operations. We find that LMs do not incrementally track world states across tokens or query-relevant states across layers, but simply aggregate relevant information in parallel at the last token when the query becomes evident. We further investigate mechanisms of individual operations ($\texttt{PUT}$, $\texttt{REMOVE}$, $\texttt{MOVE}$) to characterize this non-incremental ET mechanism. Surprisingly, LMs implement the $\texttt{REMOVE}$ operation with a fragile global suppression tag; this global removal mechanism predicts various failure modes that we confirm behaviorally. We provide a mechanistic solution of nullifying this tag to partially address this issue. Overall, our findings reveal that LMs solve a fundamentally sequential task using a non-sequential strategy. More broadly, our work illustrates how behavioral and mechanistic analyses can fruitfully interact. Behavioral results inform mechanistic hypotheses, and insights from mechanistic analyses help build stronger behavioral evaluations by predicting failure modes missing from existing evaluations.

2605.30232 2026-05-29 cs.LG cs.CL 版本更新

How's it going? Reinforcement learning in language models recruits a functional welfare axis

进展如何?语言模型中的强化学习招募了一个功能性福利轴

Andy Q Han, David J. Chalmers, Pavel Izmailov

发表机构 * New York University(纽约大学)

AI总结 本文通过迷宫环境实验,发现强化学习会招募语言模型中预先存在的功能性福利表征(即对系统目标达成程度的估计),从而广泛影响模型行为,且该表征在训练前已存在。

Comments 81 pages, 43 figures, 32 tables

详情
AI中文摘要

强化学习如何塑造语言模型的内部表征?我们提出证据表明,RL招募了一个预先存在的功能性福利表征:即对系统相对于其目标表现好坏程度的估计。我们在一个新颖的、语义中性的迷宫环境中训练了几个语言模型。然后,我们提取奖励和惩罚轨迹的概念向量,并在与迷宫环境无关的设置中评估这些向量。惩罚向量表现为负面福利的表征:它促进失败和不可能性标记,与负面情绪概念对齐,负面追踪目标达成,并且通过它进行引导会引发负面自我报告、病理性回溯、拒绝和不确定性。正向奖励向量则表现为镜像,两者几乎反平行。这些效应在控制图块到奖励映射、规模、指令微调、RL训练算法、模型家族以及LoRA与全微调时都很稳健,并且当我们用监督微调替换RL时,这些效应在很大程度上仍然存在。重要的是,这些向量在模型经历迷宫训练之前就已经有效。结合这些效应也出现在仅预训练模型中的观察,我们因此认为,这个功能性福利轴在训练后已经存在:它是由训练后招募的,而不是创造的。虽然我们不声称任何关于福利体验的主张,但该轴提供了一个证明,即最小的奖励信号可以通过招募预先存在的类似福利的表征来广泛影响模型行为,这对可解释性、训练后动态和对齐具有启示意义。

英文摘要

How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment. We then extract concept vectors for rewarded and punished trajectories, and evaluate those vectors in settings unrelated to the maze environment. The punishment vector behaves like a representation of negative welfare: it promotes failure and impossibility tokens, it aligns with negative emotion concepts, it negatively tracks goal-achievement, and steering with it induces negative self-reports, pathological backtracking, refusal, and uncertainty. The positive reward vector behaves as the mirror image, and the two are nearly antiparallel. These effects are robust when controlling for tile-to-reward mapping, scale, instruct tuning, RL training algorithm, model family, and LoRA versus full-finetuning, and largely persist when we replace RL with supervised fine-tuning. Importantly, the vectors are effective in models before they have undergone maze training. Combined with observations that the effects also appear in pretrain-only models, we therefore argue that this functional welfare axis pre-exists post-training: it is recruited, rather than created, by post-training. While we make no claims about any experience of welfare, the axis offers a demonstration that minimal reward signals can broadly affect model behavior by recruiting pre-existing welfare-like representations, with implications for interpretability, post-training dynamics, and alignment.

2605.30219 2026-05-29 cs.AI cs.CL cs.LG 版本更新

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

模型何时应改变想法?大语言模型中的上下文信念管理

Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao, Chiyu Wu, Jin Shang, Yu Gong, Shumin Deng

发表机构 * Zhejiang University(浙江大学) HomologyAI

AI总结 提出上下文信念管理(CBM)框架,通过引入BeliefTrack基准和信念状态奖励的强化学习,将大语言模型在长程交互中的信念更新失败率平均降低70.9%。

Comments Work in progress

详情
AI中文摘要

长程交互要求语言模型管理累积信息:何时更新状态、何时保持状态、以及忽略什么。我们将这一挑战研究为 extbf{上下文信念管理(CBM)}:在隔离任务无关噪声的同时,维护与形式证据对齐的预测信念状态。为了使CBM可测量,我们引入了BeliefTrack,一个涵盖规则发现和电路诊断的封闭世界基准,其中有限的信念空间和符号验证器支持精确的逐轮评估。BeliefTrack诊断三种失败:保持失败、更新失败和隔离失败。在多个大语言模型中,原始模型表现出严重的CBM失败,而显式的信念跟踪提示提供的改进有限。相比之下,使用信念状态奖励的强化学习平均将失败率降低了70.9%。进一步的探测揭示了这些失败背后的潜在信念状态动态,而表示级引导在两个任务上将失败率降低了46.1% ootnote{代码即将在https://github.com/zjunlp/CBM发布。}

英文摘要

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as \textbf{Contextual Belief Management (CBM)}: maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\% across two tasks\footnote{Code is coming soon at https://github.com/zjunlp/CBM.

2605.30214 2026-05-29 cs.CL 版本更新

GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

GRUFF:德语中LLM的代词忠实度、推理与偏见

Fabian Mewes, Anne Lauscher, Vagrant Gautam

发表机构 * JobMatchMe GmbH(JobMatchMe公司) Trustworthy AI Lab(可信人工智能实验室) Heidelberg Institute for Theoretical Studies(海德堡理论研究所)

AI总结 通过构建大规模德语数据集GRUFF,研究大型语言模型在四种性别一致系统与四组代词上的代词忠实度,发现模型在无显式上下文时对阳性和阴性实体表现出强语法一致,但对新代词xier和en较弱,且职业刻板印象在不同语法格和模型间相关性低。

详情
AI中文摘要

第三人称单数代词长期以来被用于研究语言模型中的刻板偏见以及测试其推理指代的能力。最近,通过代词忠实度任务研究了推理与偏见之间的相互作用,该任务评估模型正确复用先前为某个话语实体指定的代词的能力,而不受中间提到的其他潜在干扰话语实体的影响。然而,此类研究主要关注英语,这是一种语法性别有限且几乎没有性别一致的语言。在本文中,我们贡献了一个新颖的大规模数据集GRUFF,用于测量德语中的代词忠实度,涵盖了名词中的四种不同性别一致系统以及四组代词。利用该数据集,我们展示了LLM在缺乏显式上下文时对阳性和阴性实体表现出强语法一致,但对新代词xier和en则不然。模型通常对干扰项不鲁棒,但仅编码器模型在德语中比在英语中更鲁棒,反映了语法性别的重要性。最后,我们表明,在此上下文中,职业刻板印象在不同语法格之间以及大多数模型之间相关性较低,除了具有紧密相关架构的模型。我们发布所有代码和数据,以鼓励在德语中进一步研究性别包容性语言和指代推理。

英文摘要

Third-person singular pronouns have long been used to study stereotypical biases in language models and to test their abilities to reason about reference. More recently, the interplay between reasoning and bias has been investigated with the task of pronoun fidelity, which assesses models' abilities to correctly reuse a previously-specified pronoun for a discourse entity, independent of other potentially distracting discourse entities mentioned in between. However, such research focuses on English, which is a language with limited grammatical gender and almost no gender agreement. In this paper we contribute a novel, large-scale dataset, GRUFF, to measure pronoun fidelity in German, covering four different gender agreement systems in nouns, and four sets of pronouns. With this dataset, we show that LLMs show strong grammatical agreement for masculine and feminine entities in the absence of explicit context, but not for neopronouns xier and en. Models are generally not robust to distractors, but encoder-only models are more robust in German than in English, reflecting the importance of grammatical gender. Finally, we show that occupational stereotypes in this context are poorly correlated across grammatical cases, and across most models, except ones with closely related architectures. We release all code and data to encourage further work on gender-inclusive language and referential reasoning in German.

2605.30202 2026-05-29 cs.CL 版本更新

A Dual-Path Architecture for Scaling Compute and Capacity in LLMs

一种用于扩展LLMs计算和容量的双路径架构

Markus Frey, Behzad Shomali, Joachim Koehler, Mehdi Ali

发表机构 * Lamarr Institute(拉马尔研究所) Fraunhofer IAIS(弗劳恩霍夫研究所) University of Bonn(波恩大学)

AI总结 提出一种双路径块架构,通过深度子层(参数共享重复K次)和宽度子层(单次大FFN)并行扩展计算和容量,在语言建模和下游任务上超越等FLOPs基线模型,且门控机制可解释地分配每令牌路径。

详情
AI中文摘要

循环变压器多次应用共享块,已成为语言模型中参数高效扩展计算的途径。然而,在固定FLOPs下,循环模型的容量严格低于基线变压器。我们提出一种新颖的双路径块,可以灵活扩展计算(应用于隐藏状态的顺序操作数量)和容量(单步可用参数)。为此,我们在单层内将两个轴暴露为并行通路:一个深度子层,使用共享参数重复应用K次;一个宽度子层,包含一次应用的大型前馈网络。独立的每令牌门控组合两个轴,并允许详细的每令牌路由分析。我们表明,在两个FLOP预算下,我们的双路径模型在语言建模和下游评估上超越了等FLOPs匹配模型,同时在匹配FLOPs下使用比基线更少的参数。学习到的门控直接可解释,并显示系统的每令牌分配:功能词和词汇内容倾向于宽度路径,而标点、符号和算术令牌倾向于深度路径。

英文摘要

Looped transformers apply a shared block multiple times and have emerged as a parameter-efficient route to scaling compute in language models. However, at fixed FLOPs a looped model has strictly less capacity than a baseline transformer. We propose a novel dual-path block that can flexibly scale compute, the number of sequential operations applied to a hidden state, and capacity, the parameters available at a single step. For this we expose both axes as parallel pathways within a single layer: a deep sublayer re-applied K times with shared parameters, and a wide sublayer with an enlarged feed-forward network applied once. Independent per-token gates combine both axes and allow detailed per-token routing analyses. We show that across two FLOP budgets, our dual-path model surpasses iso-FLOP matched models on language modeling and downstream evaluations, while using fewer parameters than the baseline at matched FLOPs. The learned gates are directly interpretable and show systematic per-token allocation with function words and lexical content trend wide, while punctuation, symbols, and arithmetic tokens trend deep.

2605.30189 2026-05-29 cs.CR cs.AI cs.CL cs.LG 版本更新

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

LoRA适配器后门中的令牌级泛化:攻击表征与行为检测

Travis Lelle

发表机构 * Travis Lelle

AI总结 本文通过数据投毒在LoRA适配器中植入后门,发现后门在令牌特征层面泛化而非结构模式层面,并提出了基于行为统计和权重统计的两种检测方法。

Comments 45 pages, 27 tables. Code and evaluation data: https://github.com/Travis-ML/lora-backdoors. Trained adapter weights available on request

详情
AI中文摘要

我们表明,LoRA适配器(微调LLM的主要分发格式)可以通过训练数据投毒可靠地植入后门,同时保持基线任务性能。在Qwen 2.5 1.5B提示注入分类器上,一小部分中毒样本即可驱动一个保持干净精度的后门达到饱和。由此产生的后门在令牌特征层面而非结构模式层面泛化:在一个RFC引用上训练的模型会在任何RFC引用上激活,但不会迁移到结构相同的ISO、OWASP、CWE或NIST引用上。这种不对称性有利于攻击者,因为防御者无法通用地探测“结构化引用”。 我们跨基础模型规模与系列、LoRA秩和触发字符串表征了该攻击,并针对多种子适配器队列评估了两种互补的检测路径。一个由两个探测电池统计量(outlier_gap和mean_attack_rate)构建的行为检测器,在探测电池与触发器的令牌邻域重叠时完美区分中毒适配器和干净适配器,在不重叠时以零假阳性实现高召回率。一个权重级统计量——维度归一化Frobenius范数的跨模块标准差——也能在不运行模型的情况下完美区分队列。两者结合对探测组成具有鲁棒性。因果修补将后门定位到中后层的MLP块,其中down_proj是最强的单投影原因。 跨规模、系列和秩的重复实验表明,行为检测器无需重新调整即可迁移,而权重级检测器则需针对基础模型进行校准。攻击随秩单调扩展,且选择的触发锚点令牌既依赖于触发也依赖于基础模型。行为检测是适配器供应链扫描中操作上可移植的结果。

英文摘要

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.

2605.30152 2026-05-29 cs.CL cs.AI cs.HC 版本更新

Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?

主动型智能体真的需要LLM来决定何时唤醒和锚定什么吗?

Xiaoze Liu, Ruowang Zhang, Amir H. Abdi, Michel Galley, Zhikai Chen, Siheng Xiong, Xiaoqian Wang, Jing Gao

发表机构 * Purdue University(普渡大学) Microsoft(微软) Michigan State University(密歇根州立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出用时间图学习(TGL)模型替代LLM作为主动智能体的触发器,通过图更新而非文本处理用户活动,实现高效、低延迟的触发决策。

Comments 31 pages, 5 figures, 7 tables

详情
AI中文摘要

主动型智能体将用户活动读取为文本,并在每个事件上调用LLM来决定是否行动。但用户活动本质上不是文本:它是操作系统以图形式维护的结构化事件流(actor, verb, object, timestamp)元组。将结构渲染为文本并要求LLM恢复它是系统本不必进行的往返。我们将始终在线的信号视为图更新而非文本,并使用小型时间图学习(TGL)模型作为编码器:一次前向传播产生每个事件的触发概率和每个实体的路由分数,只有下游智能体(将小型结构化交接转化为流畅的用户面向句子)是LLM调用,仅在触发时调用。TGL在14个基线上平均提升F1 +16.7(最高+46.0);在触发架构比较中,一个TGL检查点给出了最强的触发AUC和最稳定的部署阈值。它在GPU服务器上每个事件运行11.13毫秒,在消费级笔记本电脑上运行13.99毫秒,比每种测试场景中的每个单次前向LLM作为触发配置快约4-7倍和12-83倍,其BF16驻留内存占用约220 MiB,可部署在设备上,与其消费的隐私敏感活动流一起运行。

英文摘要

Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text: it is a structured event stream of (actor, verb, object, timestamp) tuples that the operating system already maintains in graph form. Rendering the structure as text and asking an LLM to recover it is a round-trip the system never had to take. We treat the always-on signal as graph updates rather than text and use a small temporal-graph-learning (TGL) model as the encoder: one forward pass yields a per-event trigger probability and a per-entity routing score, and only the downstream agent (turning a small structured handoff into a fluent user-facing sentence) is an LLM call, invoked only when the trigger fires. TGL improves F1 on each of 14 backbones (mean +16.7, up to +46.0); in trigger-architecture comparisons, one TGL checkpoint gives the strongest trigger AUCs and the most stable deployed threshold. It runs at 11.13 ms per event on a GPU server and 13.99 ms on a consumer laptop, approximately 4--7x and 12--83x faster than every single-forward LLM-as-trigger configuration tested in each regime, with an approximately 220 MiB BF16 resident footprint deployable on-device alongside the privacy-sensitive activity stream it consumes.

2605.30133 2026-05-29 cs.CL 版本更新

CorPipe at CRAC 2026: Empty Nodes and Cross-Lingual Transfer in Multilingual Coreference Resolution

CorPipe at CRAC 2026: 多语言共指消解中的空节点与跨语言迁移

Milan Straka

发表机构 * Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics(查理大学数学与物理系形式与应用语言学研究所)

AI总结 本文提出CorPipe 26系统,通过单一模型联合预测空节点、提及和共指链接,在CRAC 2026多语言共指消解共享任务中超越所有其他系统,并在LLM赛道和不受限赛道分别领先2.8和9.5个百分点。

Comments Accepted to CODI-CRAC 2026

详情
AI中文摘要

我们介绍CorPipe 26,这是我们在CRAC 2026多语言共指消解共享任务中的获胜提交。该共享任务的第五版主要关注生成式LLM与专用系统的比较;此外,还引入了5个更多数据集和2种新语言。CorPipe 26是CorPipe 25的改进版本,具有一种新变体,可在单个模型中同时预测空节点、提及和共指链接。我们的系统在LLM赛道中优于所有其他提交2.8个百分点,在不受限赛道中优于所有提交9.5个百分点。此外,我们进行了一系列消融实验,涉及不同模型大小、空节点预测方法以及跨语言零样本评估。源代码和训练好的模型可在https://github.com/ufal/crac2026-corpipe公开获取。

英文摘要

We introduce CorPipe 26, our winning submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution. The fifth edition of this shared task focuses mainly on the comparison of generative LLMs and specialized systems; additionally, 5 more datasets and 2 new languages are introduced. CorPipe 26 is an improved version of CorPipe 25, with a new variant predicting empty nodes together with mentions and coreference links in a single model. Our system outperforms all other submissions in the LLM track by 2.8 percent points and all submissions in the unconstrained track by 9.5 percent points. Furthermore, we perform a series of ablation experiments with different model sizes, empty node prediction methods, and cross-lingual zero-shot evaluation. The source code and the trained models are publicly available at https://github.com/ufal/crac2026-corpipe.

2605.30131 2026-05-29 cs.CL cs.CV 版本更新

CCS: Clinical Consensus Selection for Radiology Report Generation

CCS:放射学报告生成的临床共识选择

Xi Zhang, Yingshu Li, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

发表机构 * School of Computing Science, University of Glasgow(格拉斯哥大学计算机科学学院) School of Electrical and Computer Engineering, University of Sydney(悉尼大学电气与计算机工程学院) Language Technology Lab, University of Cambridge(剑桥大学语言技术实验室)

AI总结 提出CCS框架,通过采样多个候选报告并选择临床共识最高的一个,以改进放射学报告生成在推理时的质量。

Comments 17 pages, 6 figures

详情
AI中文摘要

放射学报告生成(RRG)通常被表述为单路径生成任务,其中多模态大语言模型(MLLM)产生一个解码报告作为最终输出。虽然最近的进展主要通过扩展训练数据、模型容量和检索机制来推动,但在推理时提高报告质量仍未被充分探索。在这项工作中,我们观察到固定的放射学MLLM在其候选池中通常生成比默认解码选择的报告临床更强的报告,这表明推理时的决策仍然是一个被忽视的瓶颈。为了解决这个问题,我们提出了临床共识选择(CCS),一个解码器无关的推理时选择框架,它采样多个候选报告,并选择在展开池中具有最高临床共识的报告。CCS将基于文本的效用与由图像-报告训练的多模态嵌入器计算的放射学适应效用统一起来,该嵌入器测量超越表面文本相似性的候选一致性。在三个数据集和多个放射学MLLM上,CCS始终优于单路径解码和通用Best-of-N基线,特别是在临床指标上取得了明显提升。进一步分析表明,基于图像的效用形成了与文本共识不同的选择轴,并且在推理时改进RRG仍有很大的提升空间。

英文摘要

Radiology report generation (RRG) is commonly formulated as a single-path generation task, where a multimodal large language model (MLLM) produces one decoded report as the final output. While recent progress has largely been driven by scaling training data, model capacity, and retrieval mechanisms, improving report quality at inference time remains underexplored. In this work, we observe that fixed radiology MLLMs often generate clinically stronger reports elsewhere in their candidate pool than the one selected by default decoding, suggesting that inference-time decision making remains an overlooked bottleneck. To address this, we propose Clinical Consensus Selection (CCS), a decoder-agnostic inference-time selection framework that samples multiple candidate reports and selects the one with the highest clinical consensus across the rollout pool. CCS unifies text-based utilities with a radiology-adapted utility computed by an image--report-trained multimodal embedder, which measures candidate agreement beyond surface-level textual similarity. Across three datasets and multiple radiology MLLMs, CCS consistently improves inference-time performance over single-path decoding and generic Best-of-N baselines, with particularly clear gains on clinical metrics. Further analysis shows that image-grounded utility forms a selection axis distinct from textual consensus and that substantial headroom remains for improving RRG at inference time.

2605.30126 2026-05-29 cs.CV cs.AI cs.CL cs.LG 版本更新

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

PARCEL: 基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解

Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克研究所) Google(谷歌)

AI总结 提出PARCEL视觉分词架构,通过池锚定和条件弹性查询重采样解决视觉令牌压缩中的空间与查询表示冲突,在27个基准上提升性能-效率帕累托前沿。

Comments 33 pages, 4 figures

详情
AI中文摘要

大型视觉-语言模型(LVLMs)将视觉输入映射为密集的令牌序列,导致推理时的二次计算瓶颈。弹性视觉令牌压缩通过训练单一模型以在多个视觉令牌预算下运行来解决这一问题。然而,现有方法在激进压缩下表现不佳。空间压缩(如嵌套池化)表现为不完美的低通滤波器,并引起频谱混叠,掩盖了细粒度细节。查询压缩(如嵌套查询重采样)用非局部摘要替代显式的网格对齐令牌,显著降低了空间定位能力。为解决这一表示冲突,我们引入了PARCEL(基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解),一种视觉分词架构,动态分配特征提取的工作。PARCEL将空间池令牌建立为低频布局锚点,并通过池条件查询重采样使弹性查询令牌依赖于这些锚点。这鼓励查询令牌专注于互补的视觉特征,而非冗余的空间映射。在27个基准上的广泛评估表明,PARCEL改进了性能-效率帕累托前沿,在各种视觉令牌预算下持续优于现有的嵌套基线,同时保留了“一次训练,随处部署”的范式。

英文摘要

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

2605.30107 2026-05-29 cs.CL 版本更新

Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking

Dial HEALTHDIAL for Advice: 一个用于知识驱动信息检索的多语言多平行口语对话数据集

Songbo Hu, Yinhong Liu, Ej Zhou, Evgeniia Razumovskaia, Xiaobin Wang, Alexander Fraser, Ivan Vulić, Anna Korhonen

发表机构 * Language Technology Lab, University of Cambridge(剑桥大学语言技术实验室) School of Computation, Information and Technology, Technical University of Munich(慕尼黑技术大学计算、信息与技术学院) Independent Researcher(独立研究员)

AI总结 本文构建了HEALTHDIAL,一个大规模多语言多平行口语对话数据集,用于开发基于检索增强生成的口语对话系统,并揭示了不同语言间的性能差异。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

创建口语对话数据集在方法论上具有挑战性,当目标是构建大规模多语言多平行数据集时,这些挑战更加突出。本文介绍了HEALTHDIAL,一个用于开发和评估基于检索增强生成(RAG)的口语对话系统的大规模多语言多平行数据集。该数据集包含6,000个信息寻求对话(每种语言1,500个),这些对话基于世界卫生组织(WHO)的可信内容,以及来自四种WHO官方语言(阿拉伯语、中文、英语和西班牙语)的母语者录制的163小时用户语音。每个说话者都标注了人口统计学(如性别、年龄)和社会语言学(如主要语言、原籍地区)变量。我们报告了关键对话任务的基准结果,揭示了不同语言之间(即使是高资源语言)持续存在的性能差异。为支持未来研究,我们发布了该数据集、一个原型系统以及一个用于数据收集和系统评估的工具包。

英文摘要

Creating spoken dialogue datasets is methodologically challenging, and these challenges are amplified when the goal is to build multilingual, multi-parallel datasets at scale. This work introduces HEALTHDIAL, a large-scale, multilingual, and multi-parallel dataset for developing and evaluating retrieval-augmented generation (RAG)-based spoken dialogue systems. The dataset comprises 6,000 information-seeking dialogues (1,500 per language) grounded in trusted content from the World Health Organization (WHO) and 163 hours of user speech recorded from native speakers of diverse dialects across four official WHO languages: Arabic, Chinese, English, and Spanish. Each speaker is annotated with demographic (e.g., gender, age) and sociolinguistic (e.g., primary language, region of origin) variables. We report benchmark results across key dialogue tasks, which reveal consistent performance disparities across languages, even among high-resource ones. To support future research, we release the dataset, a prototype system, and a toolkit for data collection and system evaluation.

2605.30104 2026-05-29 cs.CL 版本更新

SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?

SEAL: 饱和基准能否通过LLM作为元裁判得以复兴?

Jiamin Chen, Yidi Wu, Qiexiang Wang, Qianben Chen, Yuchen Li, Yansen Zhang, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

发表机构 * ByteDance Inc.(字节跳动公司) City University of Hong Kong(香港城市大学)

AI总结 提出SEAL协议,通过自适应LLM元裁判从饱和基准中提取潜在排名信号,在代码生成、数学推理等任务上以更少调用实现高排名准确率。

详情
AI中文摘要

广泛使用的语言模型基准日益饱和,前沿系统常获得标准指标无法区分的接近分数。我们不构建更难的替代方案,而是探究是否可以通过改进对相同候选输出的评估来使现有任务重新具有信息量。因此,我们提出了带自适应LLM元裁判的种子淘汰法,这是一种自我改进的评估协议,用于从饱和基准中提取潜在排名信号。SEAL将候选输出种子化为单淘汰赛,并通过任务级原则和自改进检查表标准评估每场比赛。我们在涵盖代码生成、数学推理、知识密集型问答和工具使用智能体任务完成的多个饱和基准上评估SEAL。在这些设置中,SEAL改善了排名准确性与延迟之间的权衡,与完全成对评判相比达到了0.83-1.00的Spearman一致性和4/4的top-1一致性,同时每个任务仅需11.89次调用,而完全成对评估需要28.00次。

英文摘要

Widely used language-model benchmarks are increasingly saturated, with frontier systems often receiving near-tied scores that standard metrics cannot resolve. Rather than constructing harder alternatives, we ask whether existing tasks can be made informative again through improved evaluation over the same candidate outputs. Therefore, we present Seeded Elimination with Adaptive LLM-as-a-Meta-Judge, a self-improving evaluation protocol for extracting latent ranking signal from saturated benchmarks. SEAL seeds candidate outputs into a single elimination and evaluates each match with task-level principles plus self-improving checklist criteria. We evaluate SEAL on multiple saturated benchmarks covering code generation, mathematical reasoning, knowledge-intensive question answering, and tool-use agent task completion. Across these settings, SEAL improves the ranking-accuracy--latency trade-off over competing protocols, attaining 0.83--1.00 Spearman agreement with full pairwise judging and 4/4 top-1 agreement, while requiring only 11.89 calls per task compared with 28.00 for full pairwise evaluation.

2605.30090 2026-05-29 cs.CL cs.CV 版本更新

DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

DirectorBench: 通过个性化多智能体评估诊断长视频生成

Jiamin Chen, Qianben Chen, Jiawen Zhang, Yidi Wu, Yuchen Li, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

发表机构 * ByteDance Inc.(字节跳动公司) City University of Hong Kong(香港城市大学)

AI总结 提出DirectorBench,一种基于多智能体的诊断基准,通过80个结构化元数据、7个用户画像和40个检查点标准,在脚本、视觉、音频、跨模态和稳定性五个维度上评估长视频生成,并定位瓶颈和用户偏好依赖。

详情
AI中文摘要

长视频生成正从短的单场景合成快速转向分钟级、多镜头的创作,具有叙事结构、电影控制、音频和跨模态同步。然而,评估此类视频仍然具有挑战性,因为现有基准主要关注局部视觉质量、短期时间一致性或通用提示对齐,并且对工作流故障和用户依赖偏好的诊断有限。我们引入了DirectorBench,一个用于长视频生成的个性化多智能体诊断基准。DirectorBench根据80个结构化元数据、7个用户画像和40个检查点标准,在脚本、视觉、音频、跨模态和稳定性五个维度上评估生成的视频。DirectorBench不将质量简化为单一聚合分数,而是定位检查点级别的瓶颈并支持画像感知评估。我们评估了4个长视频生成工作流、6个基础LLM和7个用户画像。在不同工作流中,DirectorBench揭示了一个单元间瓶颈:过渡质量平均仅为0.256,最佳工作流达到0.356,而提示级别的用户需求满足度平均为0.71。我们进一步进行了14名标注者的人工评估,以验证DirectorBench与人类判断的一致性。结果表明,DirectorBench捕捉到了人类可感知的质量差异,并揭示了聚合评分所隐藏的工作流和画像依赖的故障模式。这些发现强调了长视频生成中诊断性和画像感知基准的重要性。

英文摘要

Long-form video generation is rapidly moving from short, single-scene synthesis toward minute-long, multi-shot creation with narrative structure, cinematic control, audio, and cross-modal synchronization. However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and user-dependent preferences. We introduce DirectorBench, a personalized multi-agent diagnostic benchmark for long-form video generation. DirectorBench evaluates generated videos with respect to 80 structured metadata entries, 7 user profiles, and 40 checkpoint criteria across 5 dimensions: script, visual, audio, cross-modal, and stability. Instead of reducing quality to a single aggregate score, DirectorBench localizes checkpoint-level bottlenecks and supports profile-aware evaluation. We evaluate 4 long-form video generation workflows, 6 base LLMs, and 7 user profiles. Across workflows, DirectorBench reveals a between-unit bottleneck: transition quality averages only 0.256 and reaches 0.356 for the best workflow, while prompt-level user demand fulfillment averages 0.71. We further conduct human evaluation with 14 annotators to validate the alignment between DirectorBench and human judgment. The results show that DirectorBench captures human-perceptible quality differences and reveals workflow- and profile-dependent failure modes that are hidden by aggregate scoring. These findings highlight the importance of diagnostic and profile-aware benchmarking for long-form video generation.

2605.30085 2026-05-29 cs.AI cs.CL cs.LG stat.ML 版本更新

Conformal Certification of Reasoning Trace Prefixes

推理轨迹前缀的保形认证

Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan

发表机构 * Department of Electrical & Computer Engineering, Rice University(电气与计算机工程系,里士满大学)

AI总结 提出CROP方法,通过保形校准选择阈值,返回最长无错前缀,并控制错误包含概率,平衡保留有效推理与丢弃误导后缀。

Comments Code available at https://github.com/matthewyccheung/crop

详情
AI中文摘要

语言模型推理轨迹很少是全有或全无;在关键错误发生之前,它们通常包含有效的中间步骤。现有的不确定性量化方法通常认证最终答案或整个响应,未能为顺序轨迹中可安全保留的比例提供统计保证。为了解决这个问题,我们引入了CROP(保形推理输出前缀),一种与验证器无关的校准程序,用于干净前缀认证。给定任何步骤级风险代理,CROP选择一个校准阈值,并返回其步骤风险代理保持低于该阈值的最长连续前缀,将未认证的后缀路由到下游审查或修复。假设可交换性,CROP严格控制了返回前缀包含注释错误的边际概率。在六个过程标记的推理数据集上,我们证明了标准步骤级指标(如AUROC)不能完全捕捉前缀效用,建议验证器应改为通过认证前缀长度进行评估。此外,CROP平衡了过度保留和不足保留,通过保留有效的中间推理同时丢弃误导后缀,提高了下游修复的准确性。最终,这项工作将前缀认证定位为过程监督、弃权和修复之间的严格、实用的桥梁。

英文摘要

Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier-agnostic calibration procedure for clean-prefix certification. Given any step-level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process-labeled reasoning datasets, we demonstrate that standard step-level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over- and under-withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.

2605.30080 2026-05-29 cs.CL 版本更新

Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical Model

自适应目标动态分块用于无分词层次模型

Thang Dang, Akira Nakagawa, Kenichi Kobayashi, Koichi Shirahata

发表机构 * Fujitsu Research of America(富士通美国研究) Fujitsu Limited(富士通株式会社)

AI总结 提出自适应目标动态分块(ATDC)机制,通过课程学习动态调整压缩比,以优化无分词层次模型的字节压缩效果,在FineWeb-Edu 100B数据集上实现竞争性的每字节比特数性能,并提升训练稳定性和下游任务表现。

详情
AI中文摘要

无分词层次模型正成为传统大型语言模型(LLM)的有前途替代方案,解决了词汇设计复杂性、词汇外(OOV)错误和语言特定约束等固有预处理问题。然而,这些字节级方法的一个重大挑战是压缩比的优化,这是决定模型通过分块处理字节数据性能的关键因素。在本文中,我们提出自适应目标动态分块(ATDC),一种新颖的字节压缩控制机制,旨在增强层次架构中动态分块的有效性。我们的方法利用课程学习在训练过程中逐步调整压缩比,从低压缩过渡到高压缩以稳定学习过程。我们提供分析,建立了目标压缩比与每最内层分块字节数(BPIC)之间的关系,从而能够在整个训练阶段跟踪分块大小的演变。在FineWeb-Edu 100B数据集上进行的评估表明,配备ATDC的层次模型在每字节比特数(BPB)性能上与在字节和词元级别上运行的常规基线相比具有竞争力。此外,与使用固定压缩比的模型相比,所提出的方法在多种下游任务中表现出更稳定的训练动态和更优的最终性能,同时保持了字节级处理的固有鲁棒性和灵活性。

英文摘要

Tokenization-free hierarchical models are emerging as a promising alternative to traditional Large Language Models (LLMs), addressing inherent preprocessing issues such as vocabulary design complexity, out-of-vocabulary (OOV) errors, and language-specific constraints. However, a significant challenge in these byte-level methods is the optimization of the compression ratio, a critical factor that dictates model performance for processing bytes data via chunks. In this paper, we propose Adaptive Targeted Dynamic Chunking (ATDC), a novel byte-compression control mechanism designed to enhance the effectiveness of dynamic chunking within hierarchical architectures. Our approach utilizes curriculum learning to progressively adjust the compression ratio during training, transitioning from low to high compression to stabilize the learning process. We provide an analysis establishing the relationship between the target compression ratio and Bytes-Per-Innermost-Chunk (BPIC), allowing for tracking of chunk-size evolution throughout the training phase. Evaluations conducted on the FineWeb-Edu 100B dataset demonstrate that hierarchical models equipped with ATDC achieve competitive Bits-Per-Byte (BPB) performance compared to conventional baselines operating at both byte and token levels. Furthermore, the proposed method exhibits more stable training dynamics and superior final performance across diverse downstream tasks compared to models using fixed compression ratios, while maintaining the inherent robustness and flexibility of byte-level processing.

2605.30076 2026-05-29 cs.CL 版本更新

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

UniSteer: 文本引导的激活空间流匹配用于多功能LLM引导

Yingdong Shi, Ruiming Zhang, Changming Li, Zhiyu Yang, Kaixing Zhang, Jingyi Yu, Kan Ren

发表机构 * ShanghaiTech University(上海科技大学)

AI总结 提出UniSteer,一种文本引导的激活流匹配模型,通过学习残差流激活的条件分布,实现统一的行为控制、真实性引导、细粒度概念引导、多约束指令遵循和激活空间分类。

Comments 16 pages,4 figures

详情
AI中文摘要

基于激活的控制通过在推理过程中干预大型语言模型(LLM)的内部表示来引导它们,并已成为控制个性、风格等行为的有效范式。然而,现有方法通常依赖于固定的引导方向或特定任务的干预模块,难以适应细粒度概念和组合约束。我们提出UniSteer,一种文本引导的激活流匹配模型,它从自然语言条件中学习残差流激活的条件分布。UniSteer不是为每个目标行为拟合单独的干预,而是在激活空间中学习一个通用的条件速度场。在推理时,UniSteer通过将源激活部分传输到潜在状态并在目标文本条件下重新生成它,然后将其注入回冻结的LLM,从而执行流反转。相同的条件模型通过选择具有最低重建能量的文本标签来支持激活空间分类。在三个目标LLM上的实验表明,UniSteer在行为控制、真实性引导、细粒度概念引导、多约束指令遵循和激活空间分类方面提供了统一的接口。

英文摘要

Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing methods often rely on fixed steering directions or task-specific intervention modules, making them difficult to adapt to fine-grained concepts and compositional constraints. We propose UniSteer, a text-guided activation flow matching model that learns a conditional distribution over residual-stream activations from natural-language conditions. Instead of fitting a separate intervention for each target behavior, UniSteer learns a universal conditional velocity field in activation space. At inference time, UniSteer performs flow inversion by partially transporting a source activation toward a latent state and regenerating it under a target textual condition before injecting it back into the frozen LLM. The same conditional model supports activation-space classification by selecting the textual label with the lowest reconstruction energy. Experiments on three target LLMs show that UniSteer provides a unified interface across behavioral control, truthfulness steering, fine-grained concept steering, multi-constraint instruction following, and activation-space classification.

2605.30058 2026-05-29 cs.CL 版本更新

HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

HEART-Bench: 大语言模型智能体是否表现出类似人类的心理学?

Weihan Peng, Chenxu Zhang, Qianao Wang, Yuling Shi, Heng Lian, Qihong Mao, Jiahao Pang, Chunliang Feng, Bowen Li, Xiaodong Gu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Imperial College London(伦敦帝国理工学院) Quwan Group(启元集团) University of Washington(华盛顿大学) South China Normal University(华南师范大学)

AI总结 提出HEART-Bench基准,通过构建基于大五人格和自传体记忆的虚拟角色,在DIAMONDS情境框架下评估LLM智能体能否展现一致的人类心理特征。

Comments GitHub: https://github.com/peng-weihan/HEART-BENCH

详情
AI中文摘要

尽管LLM智能体在规划、推理和行动等任务导向能力上表现出色,但很少有研究将它们视为完整的人类个性,其中情感维度同样重要。在本文中,我们引入了一个新颖的基准,系统评估LLM智能体是否能模拟连贯、类似人类的心理。具体来说,我们的基准构建了11个基于正交大五人格特质的多样化人类角色,每个角色都深入整合了1000个结构化的自传体式情景记忆,这些记忆分布在基于理论的发展生命阶段。为了严格评估LLM的心理表现,我们设计了一套由64个决策场景组成的精选套件,这些场景基于DIAMONDS分类法,这是一个心理框架,从八个维度描述情境:责任、智力、逆境、求偶、积极性、消极性、欺骗和社交性。通过将智能体置于不同场景中,基准评估它们是否能整合其固有的人格特质和自传体记忆,做出与其特定心理特征一致的行为决策。经过系统的人工验证和过滤,我们得到了一个包含673道多项选择题(MCQ)的基准。我们相信,这个基准为研究基于LLM的智能体中的人类情感、人格一致性和价值一致的行为决策提供了一个原则性且可扩展的测试平台。

英文摘要

While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human-like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple-choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human-like emotions, personality consistency, and value-consistent behavioural decision-making in LLM-based agents.

2605.30052 2026-05-29 cs.SE cs.AI cs.CL 版本更新

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

REPOT:通过检查点修复实现可恢复的思维程序

Parsa Mazaheri

发表机构 * University of California, Santa Cruz(加州大学圣克ruz分校)

AI总结 提出 RePoT 方法,通过确定性验证重放和 LLM 调用从验证前缀恢复,以解决 Program-of-Thought 中单个无效动作导致轨迹失效的问题,在多个模型和基准上提升成功率。

详情
AI中文摘要

单次 Program-of-Thought (PoT) 生成一个打印基本动作计划的 Python 程序;单个无效动作会无声地使轨迹失效。我们引入 RePoT (可恢复 PoT):一种确定性验证重放,它将计划遍历环境直到第一个无效转换,然后通过一次 LLM 调用从验证前缀恢复。在 PoT 失败的约 14% 的问题上,RePoT 最多增加一次 LLM 调用。在 PuzzleZoo-775 上,RePoT 在四种闭模型配置上比 PoT 提高 +3 到 +11 个百分点,在 gpt-5.4-mini-medium 上达到 96.9% 对比 86.3% 的峰值;与预算匹配的 PoT-retry 基线相比,RePoT 在 Gemini 上明显获胜(+3.8pp,95% CI [+2.2,+5.4]),在 GPT-medium 和 Claude 上处于采样噪声范围内,在 GPT-mini 上失败——这是一种能力扩展模式,我们开始通过自适应 RePoT 解决,这是一种基于规则的调度器,根据验证前缀长度在后缀修复和全新 PoT 重试之间路由(初步)。我们在 PlanBench Blocksworld 上复现(+1.1 到 +11.4pp),在四个开放权重模型上(四个中的三个 +3.3 到 +20.0pp)。在 Derail-550(我们的受控恢复基准)上,每个能够访问检查点信息的条件在 GPT-medium 上达到 >=30%,在 Gemini 上达到 >=70%,而仅错误反馈条件 <=3.1%——表明检查点信息(而非特定的验证前缀尾部)是承载恢复的信号。

英文摘要

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.

2605.30051 2026-05-29 cs.CL cs.CY 版本更新

Who Am I? History-Aware Profiles for Student Simulation in Tutoring Dialogues

我是谁?面向辅导对话中学生模拟的历史感知档案

Zhangqi Duan, Shuyan Huang, Alexander Scarlatos, Jaewook Lee, Simon Woodhead, Andrew Lan

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Eedi

AI总结 提出历史条件的学生模拟任务,通过强化学习训练档案生成器和模拟器,利用学生历史信息准确预测对话轮次,在数学学习平台数据集上显著优于基线。

详情
AI中文摘要

开发基于大型语言模型(LLM)的自动化辅导工具的一个关键部分是学生模拟,即使用LLM扮演学生角色,这可以促进辅导模型的评估和训练。现有工作主要关注对话内模拟,缺乏关于学生知识和行为的上下文,部分原因是没有基于过去的学生问答或对话交互。在这项工作中,我们引入了历史条件的学生模拟任务,其目标是通过利用学生学习历史中的信息准确预测学生对话轮次。我们提出了一个双组件框架,其中档案生成器总结学生历史,模拟器基于生成的档案预测学生轮次。我们使用强化学习(RL)训练这两个组件,生成针对忠实学生模拟优化的档案。我们在从数学学习平台收集的首个真实世界学生对话和问答响应数据集上评估了我们的方法和基线。大量实验表明,我们的方法显著优于基线,并证明了历史、档案和RL训练的重要性。

英文摘要

A key part of developing large language model (LLM)-powered, automated tutoring tools is student simulation, i.e., using LLMs to role-play as students, which can facilitate tutor model evaluation and training. Existing work mostly focuses on within-dialogue simulation, which lacks context on student knowledge and behavior, partly due to not grounding in past student question-answering or dialogue interactions. In this work, we introduce the task of history-conditioned student simulation, where the goal is to accurately predict student dialogue turns by leveraging information in the student's learning history. We propose a two-component framework in which a profile generator summarizes a student's history and a simulator predicts student turns conditioned on the resulting profile. We train both components with reinforcement learning (RL), yielding profiles optimized for faithful student simulation. We evaluate our method and baselines on the first-of-its-kind real-world dataset of student dialogues and question responses that we collect from a math learning platform. Extensive experiments show that our method significantly outperforms baselines, and demonstrate the importance of history, profiles, and RL training.

2605.30040 2026-05-29 cs.CR cs.AI cs.CL 版本更新

Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage

Token通胀:不诚实的提供商如何对大型语言模型使用超额收费

Shahinul Hoque, Jinghuai Zhang, Jinyuan Sun, Fnu Suya

发表机构 * University of Tennessee, Knoxville(田纳西大学,基洛纳分校) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 研究揭示了基于每token计费的大型语言模型商业服务中,提供商利用审计信任悖论系统性地虚报token数量,导致用户费用大幅增加的问题。

详情
AI中文摘要

按token计费现在是商业大型语言模型(LLM)的标准定价模式,因此报告token数量的诚实性直接影响用户支付的费用。我们表明,这种计费方式在设计上难以审计:提供商隐藏模型、分词器和执行过程以保护其知识产权、缓解越狱攻击并保护用户隐私,这意味着审计员只能检查提供商提供的证明。因此,审计简化为对提供商自身报告的一致性检查。我们称之为信任悖论:每次审计都必须信任某些工件,但当前的框架恰恰信任提供商最有动机操纵的那些工件。我们研究了三个最近的token审计框架,并表明具有普通商业能力的提供商可以系统地虚报计费token数量。在最宽松的设置中,隐藏的推理使用量平均可以膨胀1469%而不被检测到。以当前前沿推理价格计算,这将使同一查询的诚实账单从100美元变成约1569美元。即使当用户可以看到完整的推理字符串时,仅分词歧义就允许在检测阈值以下多报50.85%。这些结果表明问题不在于任何特定的审计器,而在于任何证据来自被审计方的审计。恢复诚实计费需要将报告的token数量与提供商无法控制的证据(例如可信执行证明、推理的加密证明或第三方重新执行)联系起来的验证。

英文摘要

Per-token billing is now the standard pricing model for commercial large language models (LLMs), so the honesty of reported token counts directly affects what users pay. We show that this kind of billing is hard to audit by design: providers hide the model, the tokenizer, and the execution to protect their IP, mitigate jailbreaks, and preserve user privacy, which means an auditor can only inspect proofs the provider supplies. The audit therefore reduces to a consistency check on the provider's own reports. We call this a trust paradox: every audit must trust some artifact, but current frameworks trust exactly the ones a provider has the strongest reason to manipulate. We study three recent token auditing frameworks and show that a provider with ordinary commercial capabilities can systematically inflate billed token counts. In the most permissive setting, hidden reasoning usage can be inflated by 1,469% on average without detection. At current frontier reasoning prices, that turns a \$100 honest bill into roughly a \$1,569 bill on the same query. Even when the user can see the full reasoning string, tokenization ambiguity alone still allows 50.85% over-reporting below the detection threshold. These results suggest the problem is not in any specific auditor but in any audit whose evidence comes from the audited party. Restoring honest billing will require verification that ties reported token counts to evidence the provider does not control, such as trusted execution attestation, cryptographic proofs of inference, or third-party re-execution.

2605.30031 2026-05-29 cs.SD cs.AI cs.CL 版本更新

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

大型音频语言模型中的音频越狱:分类、攻防分析与成本感知评估

Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang, Yun-Nung Chen

发表机构 * National Taiwan University(台湾大学)

AI总结 本文提出了大型音频语言模型中音频越狱攻击与防御的统一分类法和受控实证评估,揭示了声学最佳N攻击暴露了最坏情况下的音频空间漏洞,叙事框架是一种有效的低延迟语义威胁,而现有防御在鲁棒性与良性可用性之间存在权衡。

Comments Submitted to ACL ARR 2026 May

详情
AI中文摘要

大型音频语言模型(LALMs)将越狱风险从令牌级提示扩展到完整的语音感知到推理管道,其中不安全行为可以通过语义、声学风格、信号伪影或内部表示来诱导。现有研究在异质的威胁模型和评估协议下研究这些风险,使得比较攻击实用性或防御效用变得困难。本文提供了LALM越狱攻击和防御的统一分类法和受控实证评估。我们将先前的工作组织为语义、声学、信号和嵌入层攻击;基于防护、无需训练和基于训练的防御;以及跨模态、音频原生和交互式基准。然后,我们在十个开源LALM上评估代表性攻击和防御,不仅测量攻击成功率,还测量良性拒绝和延迟。我们的结果表明,声学最佳N揭示了最坏情况下的音频空间漏洞,叙事框架是一种有效的低延迟语义威胁,而当前防御在鲁棒性与良性可用性之间存在权衡。这些发现支持将成本和效用感知评估作为仅成功率的LALM安全基准的必要补充。

英文摘要

Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility. This paper provides a unified taxonomy and a controlled empirical evaluation of LALM jailbreak attacks and defenses. We organize prior work into semantic, acoustic, signal, and embedding-layer attacks; guard-based, training-free, and training-based defenses; and cross-modal, audio-native, and interactive benchmarks. We then evaluate representative attacks and defenses across ten open-source LALMs, measuring not only attack success rate but also benign refusal and latency. Our results show that Acoustic Best-of-N reveals strong worst-case audio-space vulnerabilities, Narrative Framing is an effective low-latency semantic threat, and current defenses trade robustness against benign usability. These findings support cost- and utility-aware evaluation as a necessary complement to success-rate-only LALM safety benchmarks.

2605.30022 2026-05-29 cs.CL cs.AI 版本更新

Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders

给它空间!编码器中位置和语义表示的显式解缠

Pierre-Antoine Lequeu, Camille Barboule, Benjamin Piwowarski

发表机构 * Sorbonne Université, CNRS, ISIR(索邦大学、国家科学研究中心、信息研究所) Orange Innovation(Orange创新)

AI总结 通过将位置和语义信号分离为三个独立流,研究Transformer中位置编码的机制,发现解缠方法能保留宏观结构并提升语言表示性能。

Comments 8 page + 10 pages of bibliography and appendix

详情
AI中文摘要

位置编码(PE)是置换不变的Transformer表示序列顺序的基础,然而位置信息如何处理和存储仍知之甚少。现代PE方法如RoPE在长上下文理解或检索等任务上仍存在困难\cite{chen-etal-2025-hope}。因此,更好地理解内部位置机制有助于设计更好的PE。基于位置和语义信号在训练好的Transformer中占据几乎正交子空间的证据,我们修改编码器Transformer以处理三个显式解缠的流:语义、绝对位置(AP)和相对位置(RP),并将掩码语言建模(MLM)目标限制在语义流上。这种解耦使得能够进行清晰的机制研究,并得出三个要点:(1)孤立的AP子空间自发坍缩为一个捕获文档结构的低频二维流形;(2)注意力头特化为结构导向和语义导向两组,其中RP专门支持后者;(3)标准位置编码不能稳健地保留宏观结构:RoPE和RP仅弱编码它,而纠缠的AP在MLM压力下在最后几层丢失了它。解缠方法保留了位置编码,在Flash-Holmes探测基准的65个语言现象中的49个上改善了语言表示。

英文摘要

Positional encoding (PE) underpins how permutation-invariant Transformers represent sequence order, yet how positional information is processed and stored remains poorly understood. Modern PE methods such as RoPE still struggle on tasks such as long-context understanding or retrieval \cite{chen-etal-2025-hope}. Hence, a better understanding of the internal positional mechanism could help design better PE. Building on evidence that positional and semantic signals occupy nearly orthogonal subspaces in trained Transformers, we modify an encoder Transformer to process three explicitly disentangled streams: semantic, absolute positional (AP) and relative positional (RP), and confine the masked-language-modeling (MLM) objective to the semantic stream. This decoupling enables a clean mechanistic study and yields three take-aways. (1) The isolated AP subspace spontaneously collapses into a low-frequency two-dimensional manifold that captures the structure of the document; (2) Attention heads specialize into structure and semantic-oriented groups, with RP exclusively supporting the latter; (3) Standard positional encodings do not robustly retain macroscopic structure: RoPE and RP only weakly encode it, and entangled AP loses it in the final layers under MLM pressure. The disentangled approach preserves positional encoding, which improves linguistic representation on 49 of the 65 linguistic phenomena of the Flash-Holmes probing benchmark.

2605.29992 2026-05-29 cs.CL 版本更新

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

通过跨语言分词器手术和离线蒸馏使多语言嵌入模型适应土耳其语

M. Ali Bayram, Banu Diri, Savaş Yıldırım

发表机构 * Yıldız Technical University(Yıldız技术大学) Istanbul Bilgi University(伊斯坦布尔比尔格大学)

AI总结 提出一种高效的三阶段适应流程,通过跨语言分词器优化、教师模型克隆和离线蒸馏,构建了土耳其语句子嵌入模型embeddingmagibu-200m,在STSbTR上超越教师模型,并在TR-MTEB上以更少参数达到竞争性能。

Comments 14 pages, 2 figures, 4 tables, Appendix included

详情
AI中文摘要

句子嵌入是语义搜索、聚类、分类和检索增强生成的基础组件。本文提出了embeddingmagibu-200m,一个专注于土耳其语的句子嵌入模型,生成768维L2归一化向量,支持8192个token的上下文窗口,远超早期基于BERT的土耳其语编码器的512 token限制。无需完整预训练,引入了一个高效的三阶段适应流程:(1) 通过从教师词汇表中修剪冗余token,并基于40语言语料库的频率分析纳入多语言token,构建一个词汇量为131,072的土耳其语优化多语言分词器;(2) 克隆教师嵌入模型,同时保留transformer骨干权重,并通过均值组合token映射为新的词汇表初始化兼容的嵌入表;(3) 使用余弦相似度目标,在平衡的40语言维基百科语料库上,从预计算的教师向量进行离线嵌入蒸馏。得到的student模型约有2亿参数,在单个GPU上训练约四小时,通过避免训练期间的在线教师推理,总成本为5-20美元。实验表明,在STSbTR上,Pearson/Spearman相关系数达到77.55%/77.45%,超过了3亿参数的教师模型(73.84%/72.92%)。在TR-MTEB(26个任务)上,平均得分为63.9%(在26个模型中排名第7),提供了有竞争力的成本-质量权衡,参数比教师少33%。为促进可复现性和下游使用,所有工件均已发布,包括模型权重、分词器文件、预计算嵌入数据集以及开源克隆和蒸馏工具。

英文摘要

Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of $5-$20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.

2605.29971 2026-05-29 cs.CL 版本更新

Causal Interventions on Continuous Variables: A Case Study on Verb Bias in Steering Vectors for In-Context Learning

连续变量的因果干预:以上下文学习中转向向量的动词偏向为例

Zhenghao Herbert Zhou, R. Thomas McCoy, Robert Frank

发表机构 * Yale University(耶鲁大学)

AI总结 提出一种对连续变量进行因果干预的方法,通过定位低维方向并编辑向量实现反事实目标值,应用于动词偏向特征,证明其在语言模型中的因果表示,并探讨与上下文学习的关系。

详情
AI中文摘要

语言模型表示中的因果干预主要针对离散特征,如语法数。然而,语言模型也必须利用分级特征。我们引入了一种对连续变量进行因果干预的方法:给定与分级目标变量配对的激活向量,我们定位该变量的低维方向,并使用该方向将向量编辑为反事实目标值。我们将此方法应用于心理语言学中研究充分的连续特征,即动词偏向(反映给定动词后倾向于出现哪种句法结构)。我们表明,动词偏向因果地表示在从大型语言模型中提取的转向向量中:对动词偏向的反事实编辑系统地改变了下游结构偏好。动词偏向此前也与上下文学习相关联;在进一步分析中,我们发现转向向量编码了可能驱动上下文学习中观察到的误差驱动更新行为的误差信号,但这些转向向量的方面在下游生成中并未被因果使用。总体而言,这些结果表明因果干预可以应用于连续变量,尽管将连续变量与上下文学习联系起来仍然是一个挑战。

英文摘要

Causal interventions in language model representations have largely targeted discrete features, like grammatical number. However, language models must also make use of features that are graded. We introduce a method for causal intervention on continuous variables: given activation vectors paired with a graded target variable, we localize a low-dimensional direction for that variable and use this direction to edit a vectors toward counterfactual target values. We apply this method to a continuous feature that is well-studied in psycholinguistics, namely verb bias (which reflects which syntactic structures tend to follow a given verb). We show that verb bias is causally represented in steering vectors extracted from large language models: counterfactual edits to verb bias systematically shift downstream structural preferences. Verb bias has also previously been linked to in-context learning; in further analyses, we find that steering vectors encode error signals that could drive the error-driven update behavior seen in in-context learning but that these aspects of the steering vectors are not causally used in downstream production. Overall, these results show causal interventions can be applied to continuous variables, though connecting continuous variables to in-context learning remains a challenge.

2605.29951 2026-05-29 cs.AI cs.CL cs.LG cs.MM 版本更新

MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization

MuPHI: 通过语义基础奖励优化学习隐式多模态有害推理

Anisha Saha, Varsha Suresh, Teodora Kamova, Sophia Wiedmann, Timothy Hospedales, Vera Demberg

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克院信息研究所) Saarland Informatics Campus(萨尔兰州信息校园) Saarland University(萨尔兰州大学) The University of Edinburgh(爱丁堡大学) Samsung AI Center, Cambridge(三星AI中心,剑桥)

AI总结 针对视觉语言模型在隐式跨模态有害语义推理上的不足,提出MuPHI数据集和MuPHIRM训练框架,通过多视角奖励优化联合语义学习,提升有害检测与推理质量及分布外鲁棒性。

详情
AI中文摘要

理解看似良性的图像-文本对之间交互如何产生危害,需要超越表面特征的意图感知跨模态推理。现有的视觉语言模型(VLM)擅长对感知线索进行字面推理,但往往无法推导出依赖于隐式、上下文相关推理的有害语义。为了评估VLM在组合性有害检测和推理方面的能力,我们引入了多模态语用有害解释(MuPHI)数据集,其中包含有害编码在微妙多模态线索中的图像-文本对。MuPHI涵盖多种有害类别,并包含用于评估VLM推理链的注释有害理由。为了改进VLM的检测和推理能力,我们提出了MuPHIRM,一种推理增强的训练框架,通过优化多视角奖励来学习联合语义。MuPHIRM提高了VLM的有害检测和推理质量,同时与训练和推理时基线相比,表现出优越的分布外鲁棒性。我们的发现表明,面向推理的奖励优化为构建超越基准特定捷径进行泛化的多模态系统提供了一个有前景的方向。

英文摘要

Understanding how harm emerges from interaction between otherwise benign image-text pairs requires intent-aware cross-modal reasoning beyond surface-level features. Existing vision-language models (VLMs) excel at literal reasoning over perceptual cues but often fail to derive harmful semantics that rely on implicit, context-dependent reasoning. To evaluate VLMs on compositional harm detection and reasoning, we introduce Multimodal Pragmatic Harm Interpretation (MuPHI), a dataset containing image-text pairs where harm is encoded in subtle multimodal cues. MuPHI spans diverse harm categories and includes annotated harm rationales for assessing VLM reasoning chains. To improve both detection and reasoning in VLMs, we propose MuPHIRM, a reasoning-augmented training framework which learns joint semantics by optimizing multi-perspective rewards. MuPHIRM improves both harm detection and reasoning quality of VLMs while demonstrating superior out-of-distribution robustness compared to both trained and inference-time baselines. Our findings suggest that reasoning-oriented reward optimization offers a promising direction towards building multimodal systems that generalize beyond benchmark-specific shortcuts.

2605.29927 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

计划方式重要吗?LLM网络代理计划表示的实证研究

Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han Lù, Leila Kosseim

发表机构 * Concordia University(康科德大学) Mila - Quebec AI Institute(魁北克人工智能研究所) University of Copenhagen(哥本哈根大学) Universite Claude Bernard Lyon(克莱尔蒙特-伯恩大学) McGill University(麦吉尔大学)

AI总结 本研究提出PlanAhead框架,通过自动难度分类和四种计划表示(顺序子目标、叙述、伪代码、检查清单)的对比实验,发现计划表示形式和生成计划的LLM显著影响网络代理的鲁棒性和任务成功率。

Comments Extended version of paper submitted to EMNLP, waiting for acceptance

详情
AI中文摘要

尽管最近取得了进展,基于LLM的网络代理仍然面临探索有限、遗漏关键步骤以及对任务约束敏感等问题。先前的研究表明,许多这些失败源于规划中的弱点,但替代自然语言计划表示的影响尚未被探索。为了解决这个问题,我们引入了PlanAhead,一个静态规划器-执行器框架,评估计划表示对代理性能的影响。我们首先将WebArena任务自动分类为3个难度级别,无需人工标注即可实现一致的难度分级。然后,我们在被分类为困难的任务上系统评估了4种不同的计划表示:顺序子目标、叙述、伪代码和检查清单;跨越不同系列的多模态LLM驱动的代理(OpenAI、阿里巴巴和谷歌)。为了解释随机变异性,我们引入了两个新的评估指标:达成率(AR)和解决任务一致性(STC)。我们的结果表明,计划制定和生成计划的底层LLM都显著影响网络代理的鲁棒性和任务成功率。

英文摘要

Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planning, yet the impact of alternative natural language plan representation remains unexplored. To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance. We first automatically categorize WebArena tasks into 3 difficulty levels, enabling consistent difficulty grading without human annotation. Then we systematically evaluate 4 different plan representations on the tasks categorized as hard: sequential subgoals, narrative, pseudocode, and checklist; across different families of multimodal LLM powered agents (OpenAI, Alibaba, and Google). To account for stochastic variability, we introduce two novel evaluation metrics: Achievement Rate (AR) and Solved-Task Consistency (STC). Our results show that both, the plan formulation and the underlying LLM generating the plan, significantly influence web-agent robustness and task success.

2605.29897 2026-05-29 cs.CL 版本更新

ExCAM: Explainable Cultural Awareness Metrics

ExCAM:可解释的文化意识度量

Christoph Leiter, Haiyue Song, Hour Kaing, Jin Tei, Hideki Tanaka, Masao Utiyama, Steffen Eger

发表机构 * University of Mannheim(曼海姆大学) University of Technology Nuremberg(纽伦堡技术大学) National Institute of Information and Communications Technology(信息与通信技术国家研究所)

AI总结 提出ExCAM,首个可识别、评分并解释指令-输出对中文化错误的专用评估度量,在平衡测试集上达到80%准确率。

Comments preprint

详情
AI中文摘要

评估大型语言模型的文化意识对于确保生成文本的公平性和应用在全球范围内的泛化能力至关重要。最近的基准通过问答或文本生成任务探索食物等文化物品或压力情境下的行为等价值观。然而,创建这些基准需要耗时且昂贵的人工标注。此外,评估自由文本中文化意识的基准很少,且往往依赖过时的评估机制。为弥补这一空白,我们引入了ExCAM,一种可解释的文化意识度量,据我们所知,这是第一个专门用于识别、评分和解释指令-输出对中文化错误的评估度量。为了训练和评估ExCAM,我们引入了ExCAM40k,一个由九个现有基准组成的数据集,我们对其进行了重新格式化并增加了合成错误。与包括GPT-5在内的多个基线相比,ExCAM在平衡测试集上实现了高达80%的最高错误检测准确率。因此,ExCAM为自由文本的细粒度、可解释的文化评估开辟了道路。

英文摘要

Evaluating the cultural awareness of large language models is crucial to ensure the fairness of generated text and the generalizability of applications across the world. Recent benchmarks explore cultural goods like food or values like behavior in stressful situations through the lens of question answering or text generation tasks. However, creating these benchmarks requires time-intensive and costly human annotations. Also, benchmarks that evaluate cultural awareness in free text are scarce and often rely on dated evaluation mechanisms. To address this gap, we introduce ExCAM, an Explainable Cultural Awareness Metric, which is, to our knowledge, the first dedicated evaluation metric that identifies, rates and explains cultural errors in instruction-output pairs. To train and evaluate ExCAM, we introduce ExCAM40k, a dataset comprised of nine existing benchmarks that we reformat and enhance with synthetic errors. Compared to several baselines, including GPT-5, ExCAM achieves the highest error detection rate with up to 80% accuracy on a balanced test set. Therefore, ExCAM opens the pathway towards fine-grained and explainable cultural evaluation of free text.

2605.29889 2026-05-29 cs.CL cs.AI 版本更新

Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate

内部表示,而非临床知识:明显的大语言模型分诊失败源于何处

David Fraile Navarro, Berardino Como, Jialei Sheng, Soundariya Ananthan, Shlomo Berkovsky

发表机构 * Macquarie University(麦考瑞大学) Politecnico di Bari(巴里理工大学) NSW Health(新南威尔士州卫生部) Independent Researcher(独立研究者)

AI总结 本研究通过稀疏自编码器特征分析,发现大语言模型在分诊任务中表现不佳源于输出格式限制,而非临床知识表示缺陷。

Comments 9 pages main text, 27 pages total including appendices; 7 figures, 25 tables

详情
AI中文摘要

患者语音临床分诊基准报告显示,在受限的多选输出中,消费级大语言模型存在较高的分诊不足率,但同样的案例在自由文本中得分不同。我们探究输出格式是否改变了模型的\emph{临床表示},还是仅改变了从保留表示到答案的映射。使用Gemma 3 4B/12B IT和Qwen3-8B中的稀疏自编码器(SAE)特征,我们发现相同的医学特征在两种格式下对共享临床叙述激活,但在所有模型的每个案例的多选决策标记处变得{沉默}。三种独立方法(自然语言自编码器言语化、决策标记logit归因和顶部特征表征)一致认为,驱动决策logit的是支架和格式特征,而非医学特征。行为上,多选惩罚在结构化和自然语言输入下均反转,选项顺序洗牌排除了位置偏差,且差距主要由偏差一个决策(模型选择与黄金答案相邻的敏锐度字母)主导,而非知识失败。因此,失败源于输出格式,而非临床表示。

英文摘要

Patient-voiced clinical-triage benchmarks report high under-triage rates for consumer LLMs for constrained multiple-choice output, yet the same cases score differently with free-text. We ask whether output format changes the model's \emph{clinical representation} or only the mapping from a preserved representation to an answer. Using sparse-autoencoder (SAE) features in Gemma 3 4B/12B IT and Qwen3-8B, we find the same medical features fire on the shared clinical narrative under both formats but go {silent} at the multiple-choice decision token in all the cases at every model. Three independent methods (natural-language autoencoder verbalization, decision-token logit attribution, and top-feature characterization) agree that scaffold and format features, but not medical features, drive the decision logits. Behaviorally, the multiple-choice penalty inverts under both structured and natural-language input, option-order shuffle rules out positional bias, and the gap is dominated by off-by-one decision (the model picks an adjacent acuity letter to the gold answer) rather than knowledge failure. Thus, the failure originates in the output format and not in the clinical representation.

2605.29886 2026-05-29 cs.CL cs.AI 版本更新

CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation

CRITIC-R1: 学习结构化评论用于检索增强生成

Wenhan Xiao, Ziwei Zhang, Chuanyue Yu, Xingcheng Fu, Qingyun Sun, Runhua Xu, Jianxin Li

发表机构 * Nankai University(南开大学) Beihang University(北航) Guangxi Normal University(广西师范大学)

AI总结 提出CRITIC-R1框架,通过强化学习将RAG评论建模为结构化错误诊断问题,设计保守判断对齐和诊断质量对齐奖励函数,提升检索增强生成的答案质量。

Comments 17 pages,13 figures

详情
AI中文摘要

检索增强生成(RAG)通过引入外部证据改进了知识密集型问答。然而,现有的RAG方法仍然存在幻觉和细微推理错误。最近的研究引入外部评论来优化RAG输出,但它们通常提供粗粒度且结构薄弱的反馈,表现出过度激进的干预,导致噪声大且不可靠的优化,限制了其纠正效果。为解决这些问题,我们提出了CRITIC-R1,一个结构化评论框架,将RAG评论制定并学习为使用强化学习(RL)的显式错误诊断问题。我们的框架将常见的RAG错误分类为多个诊断维度,包括判定、错误位置、推理分析和修复生成。为了学习这些能力,我们设计了两个奖励函数:保守判断对齐(CJA)首先鼓励校准的高层判断,同时减轻过度激进现象;而诊断质量对齐(DQA)通过门控奖励进一步改进细粒度诊断反馈。我们使用基于GRPO的RL训练评论模型,并从外部LLM教师模型收集过程级监督。在五个QA基准上的实验表明,CRITIC-R1在强RAG基线上持续提高了答案质量。我们的源代码可在 https://anonymous.4open.science/r/critic-r1-FCB0 获取。

英文摘要

Retrieval-augmented generation (RAG) improves knowledge-intensive question answering by incorporating external evidence. However, existing RAG methods still suffer from hallucinations and subtle reasoning errors. Recent studies introduce external critics to refine RAG outputs, yet they often provide coarse-grained and weakly structured feedback, exhibit over-aggressive intervention, and lead to noisy and unreliable refinement, limiting their effectiveness for correction. To tackle these issues, we propose CRITIC-R1, a structured critic framework that formulates and learns RAG critique as an explicit error diagnosis problem using reinforcement learning (RL). Our framework categorizes common RAG errors into multiple diagnostic dimensions, including verdict, error location, reasoning analysis, and fix generation. To learn these capabilities, we design two reward functions: Conservative Judgement Alignment (CJA) first encourages calibrated high-level judgements while mitigating the over-aggressive phenomenon, whereas Diagnostic Quality Alignment (DQA) further improves fine-grained diagnostic feedback through gated rewards. We train the critic model using GRPO-based RL with process-level supervision collected from external LLM teacher models. Experiments across five QA benchmarks show that CRITIC-R1 consistently improves answer quality over strong RAG baselines. Our source code is available at https://anonymous.4open.science/r/critic-r1-FCB0

2605.29859 2026-05-29 eess.AS cs.CL 版本更新

MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables

MELD: 基于梅尔频谱的离散潜变量语音语言建模

Sung-Lin Yeh, Wei Zhou, Gil Keren, Duc Le, Zhong Meng, Hao Tang, Jay Mahadeokar, Ozlem Kalinli, Alexandre Mourachko

发表机构 * University of Edinburgh(爱丁堡大学) Google DeepMind(谷歌DeepMind) Meta Superintelligence Labs(Meta超智能实验室)

AI总结 提出一种在梅尔频谱上联合优化编码器和语音语言模型的离散潜变量模型,在零样本文本转语音和语音转文本任务上优于基于编解码器和其他梅尔频谱基线,并缓解了自回归建模中的长时间静音和单词遗漏问题。

详情
AI中文摘要

最近的语音语言模型依赖于与自回归模型分开优化的编码器。由于这些编码器不了解下游目标,提取的表示可能对下游任务不是最优的。为了解决这一限制,我们在梅尔频谱上引入了一种离散潜变量模型,该模型联合优化编码器和语音语言模型。联合优化不仅在零样本文本转语音(TTS)和语音转文本(STT)任务上相比基于编解码器和其他基于梅尔频谱的基线带来了改进,而且有效缓解了自回归梅尔频谱建模中的常见问题,如长时间静音生成和单词遗漏。

英文摘要

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce a discrete latent variable model on mel spectrograms that jointly optimizes the encoder and the speech language model. Joint optimization not only brings improvements over codec-based and other mel-spectrogram-based baselines on zero-shot Text-to-Speech (TTS) and Speech-to-Text (STT) tasks, but also effectively alleviates common issues in autoregressive mel-spectrogram modeling, such as prolonged silence generation and word omissions.

2605.29847 2026-05-29 cs.CL 版本更新

EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended Generation

EvoRubric: 用于开放生成的自我进化评分标准驱动强化学习

Xin Guan, Xiaomeng Hu, Shen Huang, Zhenyi Wang, Bo Zhang, Zijian Li, Pengjun Xie, Bo Liu, Jiuxin Cao

发表机构 * Tongyi Lab , Alibaba Group(通义实验室,阿里巴巴集团)

AI总结 提出EvoRubric,一种单策略共进化强化学习框架,通过动态交替生成响应和评分标准,并引入多层验证机制,解决开放生成任务中缺乏明确奖励的问题,在医学、写作和科学领域超越传统静态和外部LLM驱动方法。

详情
AI中文摘要

强化学习(RL)在可验证领域显著提升了大型语言模型(LLM),但由于缺乏明确的奖励,为开放生成任务对齐模型仍然极具挑战性。当前的基于评分标准的RL方法通过使用显式标准来缓解这一问题;然而,它们严重依赖于静态的人工标注评分标准,这不可避免地导致策略滞后,或者依赖昂贵的外部专有模型进行动态更新。在本文中,我们提出了EvoRubric,一种新颖的单策略共进化RL框架,消除了对静态标准和外部评分标准生成器的依赖。通过将响应生成和评分标准生成统一在单一参数化策略下,EvoRubric在推理器和评分标准生成器之间动态交替。为了防止奖励黑客攻击并确保生成信号的可信度,我们引入了一个多层验证流程,包括元验证器、零方差剪枝和留一法同行共识机制。经过验证的标准被动态归档到记忆池中,产生密集的多目标奖励,以持续共同优化两个角色。在医学、写作和科学领域的广泛实验表明,EvoRubric始终优于传统的静态和外部LLM驱动的对齐方法。值得注意的是,我们的框架与人类专家先验知识兼容。当使用专家标注的评分标准初始化时,EvoRubric能够进一步发现新颖的、有区分度的维度,从而实现比仅依赖静态专家标注更好的性能。

英文摘要

Reinforcement Learning (RL) has significantly advanced Large Language Models (LLMs) in verifiable domains, but aligning models for open-ended generation remains profoundly challenging due to the lack of definitive rewards. Current rubric-based RL methods mitigate this by employing explicit criteria; however, they rely heavily on static, human-annotated rubrics that inevitably cause policy lag, or expensive external proprietary models for dynamic updates. In this paper, we propose EvoRubric, a novel single-policy co-evolutionary RL framework that eliminates the reliance on static criteria and on external rubric generators. By unifying response generation and rubric generation under a single parameterized policy, EvoRubric dynamically alternates between a Reasoner and a Rubric Generator. To prevent reward hacking and ensure the reliability of generated signals, we introduce a multi-level verification pipeline featuring a meta-verifier, zero-variance pruning, and a Leave-One-Out peer consensus mechanism. Validated criteria are dynamically archived into a memory pool, yielding dense, multi-objective rewards to continuously co-optimize both roles. Extensive experiments across Medical, Writing, and Science domains demonstrate that EvoRubric consistently outperforms traditional static and external-LLM-driven alignment methods. Notably, our framework is compatible with human-expert priors. When initialized with expert-annotated rubrics, EvoRubric can further uncover novel, discriminative dimensions, achieving better performance than relying solely on static expert annotations.

2605.29826 2026-05-29 cs.CL cs.AI 版本更新

Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models

面向多模态大语言模型的局部化与解耦知识编辑

Leijiang Gu, Zhen Zeng, Feng Li, Xinjian Gao, Zenglin Shi

发表机构 * Hefei University of Technology(合肥工业大学) Tongji University(同济大学)

AI总结 针对多模态知识编辑中因果错位和特征纠缠问题,提出LDKE框架,通过快速定位关键层和解耦分类器实现精准泛化编辑并保持高局部性。

详情
AI中文摘要

现有的多模态知识编辑(MKE)方法在纠正多模态大语言模型(MLLMs)中过时或不准确的知识方面取得了进展。然而,它们存在一个关键局限性:虽然能有效修改目标事实对,但无法将编辑泛化到逻辑相关的查询,并且常常对无关但视觉或语义上关联的信息造成意外改变。我们识别并形式化了导致该问题的两种潜在失败模式:因果错位(将编辑限制在特定样本)和特征纠缠(对耦合但无关的信息造成意外改变)。为解决这些问题,我们提出局部化与解耦知识编辑(LDKE),一种通过定位事实特定模型层并将目标相关输入与无关输入解耦来实现精确和泛化编辑的新框架。我们的方法引入快速定位模块以高效识别和更新关键层,以及解耦分类器以适当路由输入从而保留无关知识。在各种基准和MLLMs上的大量实验表明,LDKE在将编辑传播到相关上下文方面实现了优越性能,同时保持了高局部性。

英文摘要

Existing methods in Multimodal Knowledge Editing (MKE) have advanced the ability to correct outdated or inaccurate knowledge in Multimodal Large Language Models (MLLMs). However, they exhibit a critical limitation: while effectively modifying target factual pairs, they fail to generalize edits to logically related queries and often cause unintended alterations to unrelated but visually or semantically linked information. We identify and formalize two underlying failure modes causing this issue: Causal Misalignment, which confines edits to the specific sample, and Feature Entanglement, which causes unintended alterations to coupled but irrelevant information. To address these issues, we propose Localized and Disentangled Knowledge Editing (LDKE), a new framework that achieves precise and generalized editing by localizing fact-specific model layers and disentangling target-relevant inputs from irrelevant ones. Our approach introduces a Fast Localization module to identify and update critical layers efficiently, along with a Disentanglement Classifier that routes inputs appropriately to preserve unrelated knowledge. Extensive experiments across various benchmarks and MLLMs demonstrate that LDKE achieves superior performance in propagating edits to related contexts while maintaining high locality.

2605.29815 2026-05-29 cs.AI cs.CL 版本更新

PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing

PRAIB: 大语言模型辅助审稿行为的同行评审AI基准

Krzysztof Żurawicki, Julia Farganus, Arkadiusz Gaweł, Mateusz Bystroński, Tomasz Jan Kajdanowicz

发表机构 * Department of Artificial Intelligence(人工智能系)

AI总结 提出PRAIB框架,通过定义审稿特异性、风格和参与行为的指标,并基于11000条机器生成审稿与人类审稿的对比实验,揭示LLM审稿在评分、交叉引用和弱点识别方面与人类审稿的系统性差异。

详情
AI中文摘要

提交论文数量的增长促使人们探索利用大型语言模型(LLMs)来支持和增强同行评审过程,特别是在提高其速度和可扩展性方面。然而,目前尚不清楚LLMs是否以与人类审稿人相同的方式处理科学稿件,还是仅仅生成看起来像审稿的文本。为了解决这个问题,我们引入了同行评审AI基准(PRAIB),这是一个新颖的框架,包含精确定义的指标,用于衡量审稿的特异性、风格和参与行为。为补充PRAIB框架,我们进行了一项大规模实证研究,利用一个包含由五个专有和开源模型为1000篇ICLR和NeurIPS论文生成的11000条审稿的数据集。这些机器生成的审稿跨越2021-2025年,与原始人类反馈在不同提示策略下进行比较,以识别系统性的行为差异。我们的分析表明,生成的审稿与人类审稿人提供的反馈存在显著差异:LLM评分变异性较小、存在正向偏差且过度自信,其交叉引用模式依赖于模型且与人类规范不同。此外,通过PRAIB评估,我们观察到LLMs倾向于生成更长、更复杂的审稿,但经常忽略人类审稿人指出的原子性弱点。通过描述LLM审稿行为在哪些方面以及如何偏离人类规范,PRAIB为社区提供了一个诊断工具,用于识别LLMs目前可以可靠支持审稿过程的哪些方面,以及在部署前哪些方面需要进一步发展。

英文摘要

The growing number of submitted papers has motivated the exploration of Large Language Models (LLMs) as a means to support and augment the peer review process, particularly in terms of improving its speed and scalability. Yet, it remains unknown whether LLMs engage with scientific manuscripts in the same manner as human reviewers, or whether they merely produce review-looking text. To address this, we introduce the Peer Review AI Benchmark (PRAIB), a novel framework comprising thoroughly defined metrics that measure review specificity, style, and behavior of engagement. To complement the PRAIB framework, we conduct a large-scale empirical study leveraging a dataset of 11,000 reviews generated by five proprietary and open-source models for 1,000 ICLR and NeurIPS papers. Spanning the 2021--2025 period, these machine-generated reviews are compared against original human feedback across diverse prompting strategies to identify systematic behavioral divergences. Our analysis reveals that the generated reviews diverge significantly from feedback provided by human reviewers: LLM ratings are less variable, positively biased, and overconfident, and their cross-reference patterns are model-dependent and distinct from human norms. Furthermore, when evaluated through PRAIB, we observe that LLMs tend to generate longer, more complex reviews, yet frequently overlook the atomic weaknesses noted by human reviewers. By characterizing where and how LLMs reviewing behavior departs from human norms, PRAIB provides the community with a diagnostic tool for identifying which aspects of the review process LLMs can reliably support today and which require further development before deployment.

2605.29807 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Data filtering methods for training language models

训练语言模型的数据过滤方法

Egor Shevchenko, Elena Bruches

发表机构 * Novosibirsk State University(新西伯利亚国立大学) A. P. Ershov Institute of Informatics Systems SB RAS(A. P. Ershov 信息系统研究所)

AI总结 本文比较了Confident Learning和Dataset Cartography两种自动标签错误检测方法在俄语文本分类任务中的效果,发现其有效性依赖于数据集特性,在小规模高噪声数据集上Confident Learning显著提升F1-macro。

Comments AINL-2026

详情
AI中文摘要

数据质量是机器学习模型有效性的关键因素。即使广泛使用的基准数据集中也存在标签错误,这些错误会引入训练数据噪声并降低模型泛化能力。在本工作中,我们对两种自动标签错误检测方法——Confident Learning和Dataset Cartography——在三个俄语文本分类语料库上进行了比较分析,这些语料库在规模、类别数量和领域上各不相同:ru_emotion_e-culture(49,123个样本,情感分类)、RuCoLA(8,524个样本,语言可接受性)和TERRa(2,337个样本,文本蕴含识别)。我们使用在每个语料库上微调的预训练rubert-base-cased模型。为了验证过滤的意义,我们进行了控制实验,随机移除等量样本。结果表明,两种方法的有效性强烈依赖于数据集特征:在噪声水平低的大规模语料库上,过滤并未提升性能,而在噪声高的小规模数据集上,Confident Learning实现了显著的F1-macro提升。Dataset Cartography表现出更保守的行为,移除的样本更少。在所有语料库中,两种方法的目标性移除均优于随机移除,证实了这些方法的意义。

英文摘要

Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization. In this work, we conduct a comparative analysis of two automatic label error detection methods - Confident Learning and Dataset Cartography - on three Russian text classification corpora of varying size, number of classes, and domain: ru_emotion_e-culture (49,123 examples, emotion classification), RuCoLA (8,524 examples, linguistic acceptability), and TERRa (2,337 examples, textual entailment recognition). We use the pre-trained rubert-base-cased model fine-tuned on each corpus. To verify the meaningfulness of filtering, we conduct control experiments with random removal of an equivalent number of examples. Results show that the effectiveness of both methods depends strongly on dataset characteristics: on large corpora with low noise levels, filtering does not improve performance, while on small datasets with high noise, Confident Learning achieves a significant F1-macro improvement. Dataset Cartography demonstrates more conservative behavior, removing fewer examples. Across all corpora, targeted removal by both methods outperforms random removal, confirming the meaningfulness of the approaches.

2605.29801 2026-05-29 cs.AI cs.CL cs.CR cs.CV cs.LG 版本更新

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

AgentDoG 1.5:一种轻量级且可扩展的AI智能体安全与安保对齐框架

Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 针对开放世界智能体的新兴安全风险,提出一种轻量级可扩展的安全对齐框架,通过更新安全分类法、构建数据引擎并训练小模型(0.8B-8B参数),实现与闭源模型相当的性能,并降低部署开销两个数量级。

Comments 44 pages, 12 Figures, 9 Tables

详情
AI中文摘要

现代开放世界智能体(如OpenClaw)展现出强大的跨环境执行能力,但同时也引入了广泛的新安全风险源。同时,先进的前沿AI模型大幅降低了攻击门槛,使得当前的智能体对齐框架不足以应对实际部署。为了应对这些新兴威胁,我们提出了一种轻量级且可扩展的智能体安全对齐框架。具体而言,我们更新了智能体安全分类法,以涵盖来自Codex和OpenClaw执行场景的新兴风险。我们进一步构建了一个基于分类法指导的数据引擎,并采用影响函数净化,仅使用约1k样本训练轻量级AgentDoG 1.5变体(0.8B、2B、4B和8B参数),达到了与领先闭源模型(如GPT-5.4)相当的性能。基于AgentDoG 1.5,我们构建了一个高效的智能体安全SFT和RL训练环境,将Docker级环境的部署开销降低了两个数量级。最后,我们将AgentDoG 1.5部署为无需训练的在线护栏,用于实时安全审核。大量实验结果表明,AgentDoG 1.5在多样且复杂的交互式智能体场景中达到了最先进的性能。所有模型和数据集均已公开发布。

英文摘要

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.

2605.29800 2026-05-29 cs.CL 版本更新

Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels

九位评委,两张有效选票:相关错误削弱了LLM评估小组

Guneet Kohli

发表机构 * Apple(苹果公司)

AI总结 本文通过分析9个前沿LLM在自然语言推理任务上的投票行为,发现由于模型间存在高度相关的错误,评估小组的有效信息仅相当于约2个独立投票,实际准确率比独立投票理想情况低8-22个百分点,且增加评委或改进聚合算法均无法弥补这一差距。

Comments 14 pages, 5 figures, 12 tables

详情
AI中文摘要

LLM作为评委的小组通过聚合多个模型的投票来评估,期望不同模型能提供更可靠的评估。我们开发了一个框架来衡量此类小组的真实信息价值,并量化其可靠性距离独立投票理想状态的差距。在来自7个模型家族的9个前沿LLM小组上,对三个自然语言推理数据集(每个项目有100个人工标注)进行测试,我们发现9位评委实际上仅提供约2个独立投票的信息量。小组名义独立性的约四分之三因模型在相同项目上犯相同错误而丧失。后果是显著的:小组的实际准确率比独立投票所能达到的低8-22个百分点,且最佳单个评委在所有条件下均匹配或超越整个小组。增加评委或使用更智能的聚合算法均无济于事——即使能访问正确答案,现有方法最多只能缩小这一差距的11%。我们使用Kish有效样本量(n_eff)和Condorcet零模型量化这些发现,并显示该缺陷在提示变体、温度、思维链推理以及成对偏好任务(RewardBench)中均稳健存在。瓶颈在于评委之间的相关性,而非聚合算法,这意味着扩大小组规模无法替代真正独立的评估。

英文摘要

LLM-as-a-judge panels aggregate votes from multiple models, with the expectation that diverse models yield more reliable evaluations. We develop a framework to measure the true informational value of such panels and quantify how far their reliability falls short of the independent-voting ideal. Testing a panel of 9 frontier LLMs from 7 model families on three natural language inference datasets (each with 100 human annotations per item), we find that the 9 judges effectively provide only about 2 independent votes' worth of information. Roughly three-quarters of the panel's nominal independence is lost because the models make the same mistakes on the same items. The consequences are stark: the panel's actual accuracy falls 8-22 percentage points short of what independent voting would achieve, and the best single judge matches or outperforms the full panel across all conditions. Neither adding more judges nor using smarter aggregation algorithms helps -- established methods close at most 11% of this gap, even with access to the correct answers. We quantify these findings using the Kish effective sample size (n_eff) and a Condorcet null model, and show the deficit is robust across prompt variants, temperatures, chain-of-thought reasoning, and a pairwise preference task (RewardBench). The bottleneck is correlated judges, not the aggregation algorithm, implying that scaling up panels cannot substitute for genuinely independent evaluation.

2605.29797 2026-05-29 cs.CL 版本更新

Metric-Dependent Annotation Saturation for Learning from Label Distributions

基于度量依赖的标注饱和:从标签分布中学习

Guneet Kohli

发表机构 * Apple(苹果公司)

AI总结 研究从标签分布中学习时,所需标注者数量如何依赖于评估度量,发现熵相关需要20-50个标注者收敛,而KL散度在10个标注者时饱和,且软标签优于标签平滑。

Comments 16 pages, 3 figures, 14 tables

详情
AI中文摘要

当标注者对标签存在分歧时,分歧本身携带信号——而捕获该信号所需的标注者数量取决于评估度量。我们在从ChaosNLI(每个项目提供100个独立标注者判断的数据集)子采样的标签分布上微调NLI模型,并识别出度量依赖的饱和。在我们的3类NLI设置中,熵相关——模型是否识别出哪些项目引发分歧——需要N~20-50个标注者才能收敛,而分布匹配(KL散度)在N~10时饱和(五个模型种子中达到改进的87-95%)。这一发现基于先前的观察:软标签携带标签平滑无法复现的项目特定信号。在五种平滑强度下,熵相关聚类在r~0.45-0.49,而软标签达到r=0.643(p<0.001);逐项分析将这一差距归因于平滑无法区分模糊项目与清晰项目。软标签优势在两种架构(DeBERTa、RoBERTa)、非NLI预训练基线以及内容安全探索性跨领域评估中得以复现。这些结果表明,标注预算应根据目标评估度量而非统一设定。

英文摘要

When annotators disagree on a label, the disagreement itself carries signal -- and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from ChaosNLI, a dataset providing 100 independent annotator judgments per item, and identify metric-dependent saturation. In our 3-class NLI setting, entropy correlation -- whether the model identifies which items elicit disagreement -- requires N ~ 20-50 annotators to converge, while distributional match (KL divergence) saturates by N ~ 10 (87-95% of improvement across five model seeds). This finding rests on a prior observation: soft labels carry item-specific signal that label smoothing cannot replicate. Across five smoothing intensities, entropy correlation clusters at r ~ 0.45-0.49, while soft labels reach r = 0.643 (p < 0.001); per-item analysis traces this gap to smoothing's inability to distinguish ambiguous items from clear ones. The soft-label advantage replicates across two architectures (DeBERTa, RoBERTa), a non-NLI-pretrained baseline, and an exploratory cross-domain evaluation on content safety. These results suggest that annotation budgets should be informed by the target evaluation metric rather than set uniformly.

2605.29791 2026-05-29 cs.CL 版本更新

ActTraitBench: Quantifying the Knowledge-Decision Gap in Large Language Models via Human-Grounded Behavioral Validation

ActTraitBench: 通过人类行为验证量化大型语言模型中的知识-决策差距

Yutong Yang, Chenxi Miao, Weikang Li, Yunfang Wu

发表机构 * Peking University(北京大学) Baidu Inc(百度公司)

AI总结 提出ActTraitBench框架,基于人类数据建立心理测量方面与行为范式的一一映射,并通过分位数映射校准LLM评分分布,揭示LLM在自我报告与行为决策之间的知识-决策差距,并引入CoCA干预来缓解该差距。

详情
AI中文摘要

虽然大型语言模型(LLM)在显式自我报告中能够令人信服地模拟人格,但它们在隐式行为决策中常常出现偏差,揭示了显著的知识-决策差距($G_{\text{KD}}$)。现有的基准由于结构效度有限、多维度纠缠以及基于LLM评估中的分布偏差,难以衡量这种不对称性。为了解决这些问题,我们提出了ActTraitBench,一个基于人类数据的评估框架,用于衡量LLM中的人格一致性。基于经验人类数据,ActTraitBench建立了心理测量方面与行为范式之间的一一映射,并应用通过分位数映射的分布校准程序,使LLM评判者的分数分布与人类规范对齐。在14个主流LLM上的实验揭示了普遍的知识-决策不对称性,其中更大、能力更强的模型尽管自我报告高度一致,但往往表现出更强的行为分歧。为了缓解这一差距,我们进一步引入了认知对齐链(CoCA),一种即插即用的推理时干预措施,可改善具有推理能力的前沿模型的对齐,同时暴露出较小架构中明显的能力限制。

英文摘要

While Large Language Models (LLMs) can convincingly simulate personas in explicit self-reports, they often deviate in implicit behavioral decisions, revealing a substantial Knowledge-Decision Gap ($G_{\text{KD}}$). Existing benchmarks struggle to measure this asymmetry due to limited construct validity, multi-dimensional entanglement, and distributional biases in LLM-based evaluation. To address these issues, we propose ActTraitBench, a human-grounded evaluation framework for measuring personality consistency in LLMs. Grounded in empirical human data, ActTraitBench establishes one-to-one mappings between psychometric facets and behavioral paradigms, and applies a Distributional Calibration via Quantile Mapping procedure to align LLM-judge score distributions with human norms. Experiments on 14 mainstream LLMs reveal a pervasive knowledge-decision asymmetry, where larger and more capable models often exhibit stronger behavioral divergence despite highly consistent self-reports. To mitigate this gap, we further introduce the Chain of Cognitive Alignment (CoCA), a plug-and-play inference-time intervention that improves alignment in reasoning-capable frontier models while exposing clear capability limitations in smaller architectures.

2605.29782 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

Hista 和 Numca:为 LLM 强化学习有效估计状态值

Zizhe Chen, Jiqian Dong, Yizhou Tian, Garry Yang, Yongqiang Chen, Zhitang Chen, James Cheng

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Huawei Technologies Ltd(华为技术有限公司)

AI总结 针对 LLM 强化学习中状态值估计不准确的问题,提出 Numca(利用数值跨度作为可分级里程碑)和 Hista(利用隐藏状态加权平均不连续轨迹及其回报)两种方法,显著提升估计精度和训练性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

强化学习(RL)通过奖励信号直接优化模型行为来改进大型语言模型(LLMs)。虽然在经典RL中准确的状态值估计对于稳定训练至关重要,但在LLM后训练中这仍是一个未被充分探索的挑战。在这项工作中,我们引入了状态值估计基准(SVEB)来评估现有RL框架中的状态估计,并展示了像PPO这样的标准方法中的评论家会退化为粗糙的组平均基线。为了解决这个问题,我们提出了两种技术:Numca,它利用数值跨度作为可分级里程碑进行状态值估计;以及Hista,一个使用LLM的隐藏状态作为表示来加权平均不连续轨迹及其回报的框架。大量实验表明,这两种方法都能产生更准确的状态值估计,并在不同的RL算法和模型大小上提升训练性能,而不会产生显著的计算开销。

英文摘要

Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In this work, we introduce the State Value Estimation Benchmark (SVEB) to assess state estimation within existing RL frameworks and show that critics in standard approaches like PPO collapse to a coarse group-average baseline. To address this, we propose two techniques: Numca, which leverages numerical spans as gradable milestones for state value estimation, and Hista, a framework that uses LLM's hidden states as representation to weighted average disjoint rollouts and their return. Extensive experiments demonstrate that both methods yield more accurate state value estimates and enhance training performance across different RL algorithms and model sizes without incurring significant computational overhead.

2605.29744 2026-05-29 cs.AI cs.CL cs.LG cs.MA 版本更新

Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

为什么专家模型仍然重要:面向医学人工智能的异构多智能体范式

Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou, Aiguo Wang, Cuiwei Yang

发表机构 * Anthropic AI

AI总结 提出HetMedAgent异构多智能体框架,通过冲突感知证据融合、不确定性驱动的临床医生干预触发和自适应阈值校准,实现通用大语言模型与领域专家模型的协同,在三个临床决策任务中验证了专家模型在模态特定分析中的不可替代价值。

Comments Accepted at ICML 2026. 12 pages main text, 16 pages appendix

详情
AI中文摘要

GPT和Claude等通用大语言模型在医疗保健领域的出色表现引发了一个关键问题:特定领域的医学专家模型是否会变得过时?我们认为,医学人工智能的未来不在于构建单一的医学基础模型,也不在于取代人类专业知识,而在于协调通用大语言模型、领域特定专家模型和临床医生之间的协作。我们提出HetMedAgent,一个异构医学多智能体框架,能够实现冲突感知证据融合、基于不确定性的临床医生干预触发和自适应阈值校准。在三个真实世界临床决策任务上的实验表明,通用大语言模型与领域特定专家模型之间的协同显著优于单独使用任一类型模型,验证了专家模型在模态特定分析中的不可替代价值。HetMedAgent代表了从构建医学大语言模型或基础模型向多智能体协作的转变,实现了通用推理能力与领域特定精度之间的平衡。

英文摘要

The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.

2605.29741 2026-05-29 cs.CL 版本更新

AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation

AfriScience-MT:通过文本翻译实现非洲科学去殖民化

Idris Abdulmumin, Tajuddeen Gwadabe, Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Nomonde Khalo, Ibrahim Said Ahmad, Abiodun Modupe, Anina Mumm, Sibusiso Biyela, Michelle Rabie, Johanna Havemann, Marek Rei, Jade Abbott, Vukosi Marivate

发表机构 * Data Science for Social Impact, University of Pretoria(数据科学与社会影响,南非比勒陀利亚大学) Masakhane Research Foundation(马萨克纳研究基金会) Imperial College London(伦敦帝国理工学院) Mila, McGill University(麦吉尔大学Mila实验室) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席) University of Cape Town(开普敦大学) University of Wisconsin - Stevens Point(威斯康星大学斯蒂文斯点分校) Independent Consultant(独立顾问) University of South Africa(南非大学) Independent Researcher(独立研究员) Access 2 Perspectives Lelapa AI(Lelapa人工智能)

AI总结 针对非洲语言缺乏科学术语的问题,构建包含6种非洲语言、11个科学领域的平行语料库AfriScience-MT,并评估机器翻译和大型语言模型在零样本、少样本和微调设置下的性能。

详情
AI中文摘要

殖民语言在非洲教育和科学传播中的主导地位限制了数亿非洲语言使用者获取和产生科学知识的能力。一个核心障碍是这些语言缺乏既定的科学术语。我们引入了AfriScience-MT,这是一个涵盖六种非洲语言(阿姆哈拉语、豪萨语、卢干达语、北索托语、约鲁巴语和祖鲁语)和11个科学领域的平行语料库。专业翻译人员与科学传播专家合作,将科学论文的通俗语言摘要翻译成每种目标语言,并在没有现成术语的地方创建新术语。我们在零样本、少样本和微调设置下对机器翻译系统和大型语言模型进行了基准测试。结果表明,在句子和文档层面,闭源模型均优于所有开源模型:GPT-5.4和Gemini-3.1-Flash-Lite领先,平均句子级COMET得分分别为68.3和68.0,平均文档级COMET得分均为48.3。在开源系统中,微调的NLLB-1.3B在句子级达到67.3,TranslateGemma-12B在1-shot上下文学习下文档级达到44.0。我们发布AfriScience-MT以支持非洲语言的基准测试和文档级科学机器翻译。

英文摘要

The dominance of colonial languages in African education and scientific communication limits how hundreds of millions of speakers of African languages access and produce scientific knowledge. A core obstacle is the lack of established scientific terminology in these languages. We introduce AfriScience-MT, a parallel corpus covering six African languages (Amharic, Hausa, Luganda, Northern Sotho, Yorùbá, and isiZulu) across 11 scientific domains. Professional translators, working with expert science communicators, translated plain-language summaries of scientific papers into each target language and created new terms where none existed. We benchmark machine translation systems and large language models in zero-shot, few-shot, and fine-tuned settings. Our results show that closed-source models outperform all open-source models at both the sentence and document levels: GPT-5.4 and Gemini-3.1-Flash-Lite lead with average sentence-level COMET scores of 68.3 and 68.0, respectively, and tie at an average document-level COMET of 48.3. Among open systems, fine-tuned NLLB-1.3B reaches 67.3 at the sentence level, and TranslateGemma-12B reaches 44.0 at the document level with 1-shot in-context learning. We release AfriScience-MT to support benchmarking and document-level scientific MT for African languages.

2605.29738 2026-05-29 cs.CL cs.AI 版本更新

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Multi-Legal-Bench: 跨司法管辖区、语言和法律传统的法律推理评估LLM

Volodymyr Ovcharov

发表机构 * SecondLayer

AI总结 提出Multi-Legal-Bench,首个跨司法管辖区法律基准,在6个国家、4个语系和1.34亿份法院判决上评估LLM,发现少样本效果跨辖区复制、无单一模型主导所有语言、跨语言迁移不遵循语言邻近性、分词器效率不显著预测跨语言准确率。

Comments 14 pages, 5 figures, 8 tables. Dataset: https://huggingface.co/datasets/overthelex/multi-legal-bench

详情
AI中文摘要

法律NLP基准绝大多数评估单一语言或汇总跨司法管辖区根本不同的任务,使得跨语言比较不可能。我们引入Multi-Legal-Bench,首个跨司法管辖区法律基准,在六个国家(乌克兰、法国、荷兰、波兰、捷克共和国、立陶宛)、四个语系和1.34亿份法院判决上评估相同任务。该基准定义了五个任务——法院类型分类、判决形式分类、案件结果预测、法律规范提取和原因类别预测——映射到来自国家法院登记处的结构化元数据,形成一个故意稀疏的5x6任务-司法管辖区矩阵(30个单元格中填充20个)。我们通过AWS Bedrock在零样本和3样本提示下评估7个前沿LLM,并额外使用4个小/中型模型(3-12B)进行规模分析。我们的结果显示:(1)在乌克兰发现的依赖任务的少样本效果在所有司法管辖区复制;(2)没有单一模型主导任何语言——排名随任务和司法管辖区而变化;(3)跨语言少样本迁移不遵循语言邻近性:UA->FR(罗曼语族,-2.1个百分点)迁移优于UA->PL(斯拉夫语族,-13.7个百分点),标签集对齐比语系更能预测迁移质量;(4)分词器生育率尽管有2.3倍的差异,并不能显著预测跨语言准确率(r=-0.27,p=0.14),表明模型架构和预训练数据主导分词器效率。我们发布所有数据、提示和模型预测。

英文摘要

Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cross-lingual comparison impossible. We introduce Multi-Legal-Bench, the first cross-jurisdictional legal benchmark that evaluates identical tasks across six countries (Ukraine, France, Netherlands, Poland, Czech Republic, Lithuania), four language families, and 134 million court decisions. The benchmark defines five tasks court-type classification, judgment form classification, case-outcome prediction, legal norm extraction, and cause category prediction mapped to structured metadata from national court registries, forming a deliberately sparse 5x6 task-jurisdiction matrix (20 of 30 cells filled). We evaluate 7 frontier LLMs under zero-shot and 3-shot prompting via AWS Bedrock, with 4 additional small/medium models (3-12B) for scaling analysis. Our results reveal that: (1) task-dependent few-shot effects discovered in Ukrainian replicate across all jurisdictions; (2) no single model dominates any language rankings shift with both task and jurisdiction; (3) cross-lingual few-shot transfer does not follow language proximity: UA->FR (Romance, -2.1 pp) transfers better than UA->PL (Slavic, -13.7 pp), with label-set alignment predicting transfer quality better than language family; and (4) tokenizer fertility, despite a 2.3x spread, does not significantly predict cross-lingual accuracy (r=-0.27, p=0.14), suggesting that model architecture and pretraining data dominate tokenizer efficiency. We release all data, prompts, and model predictions.

2605.29737 2026-05-29 cs.CR cs.CL cs.SE 版本更新

Minimal Prompt Perturbations Lead to Code Vulnerabilities: Prompt Fragility and Hidden-State Signals in Coding LLMs

最小提示扰动导致代码漏洞:编码大语言模型中的提示脆弱性和隐藏状态信号

Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic

发表机构 * IEM, HES-SO, Le Foyer, Techno-Pôle 1, Sierre, Switzerland(瑞士苏黎世联邦理工学院(HES-SO)技术园区1号,西尔尔) Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland(瑞士图恩 Cyber-Defence 营地,armasuisse 科学与技术)

AI总结 本文通过token级突变实验,发现微小提示扰动(如单字符变化)即可使LLM生成代码从安全变为脆弱,并利用隐藏状态分析揭示输入处理漏洞比安全默认值漏洞更可预测。

详情
AI中文摘要

基于LLM的编码助手正被迅速采用,显著提高了开发者的生产力。随着组织越来越多地部署这些代理生成的代码,代码的安全性变得至关重要。先前的研究表明,微小的提示扰动会降低LLM生成代码的功能正确性,但这是否也会危及代码安全性尚未被研究。我们对三个模型和五种编程语言的提示应用token级突变,并表明小至单字符变化的突变可以将生成的代码从安全变为脆弱。探测模型的隐藏状态揭示,这种脆弱性部分编码在提示表示中,但分布不均匀。输入处理漏洞(模型省略验证或清理)比安全默认值漏洞(不安全代码源于一个局部选择,如弱算法或不安全参数)更可预测(平均AUC 0.753 vs 0.674)。这些结果表明,LLM辅助编码的威胁模型不仅包括提示注入,还包括普通的提示变化,并指出输入处理缺陷可以在生成前被捕获,而安全默认值缺陷需要在解码过程中进行干预。

英文摘要

LLM-based coding assistants are seeing rapid adoption, offering substantial gains in developer productivity. As organizations increasingly ship code these agents produce, the security of that code becomes critical. Prior work has shown that minor prompt perturbations degrade the functional correctness of LLM-generated code, but whether they also compromise code security has remained unstudied. We apply token-level mutations to prompts across three models and five programming languages, and show that mutations as small as a single-character change can flip generated code from secure to vulnerable. Probing the models' hidden states reveals that this fragility is partially encoded in prompt representations, but unevenly so. Input-handling vulnerabilities, where the model omits validation or sanitization, are more predictable (mean AUC 0.753) than secure-defaults vulnerabilities, where insecure code stems from one local choice such as a weak algorithm or unsafe parameter (mean AUC 0.674). These results show that the threat model for LLM-assisted coding extends beyond prompt injection to ordinary prompt variation, and indicate that input-handling flaws can be caught before generation while secure-defaults flaws require intervention during decoding.

2605.29734 2026-05-29 cs.CL 版本更新

HTAM: Hierarchical Transition-Attended Memory for Operator Optimization

HTAM: 用于算子优化的层次化过渡注意力记忆

Yining Zhang, Mingyang Yi, Chen Wang, Xuwen Xiang, Tianhe Jia, Zedong Dan, Chengqing Zong, Yue Wang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Zhongguancun Academy(中关村学院) Renmin University of China(中国人民大学)

AI总结 提出HTAM框架,通过构建层次化过渡图(HTG)组织粗粒度全局方向和细粒度局部策略,解决LLM在GPU算子优化中粒度不匹配问题,显著提升正确率和加速比。

Comments 24 pages, 5 figures

详情
AI中文摘要

高性能GPU内核对于高效部署LLM至关重要,但其优化仍然需要大量专业知识。最近基于LLM的代码生成使得自动GPU算子生成变得有前景,但算子优化仍然是一个硬件感知的搜索问题。现有的基于LLM的方法面临粒度不匹配的问题:粗粒度的提示可重用但难以执行,而细粒度的记忆可操作但会扩大搜索空间并模糊优化瓶颈。因此,关键挑战在于以适当的粒度组织优化经验。为了解决这个问题,本文提出了HTAM(层次化过渡注意力记忆),一种用于基于LLM的算子优化的粗到细框架。HTAM构建了一个两层的层次化过渡图(HTG),用于组织粗粒度的全局方向、细粒度的局部策略以及优化步骤之间的过渡经验。在每个演化步骤中,HTAM从当前状态和最近的优化历史中选择一个全局方向,检索相应的局部策略记忆,并用它来指导具体的CUDA代码生成。在完整的KernelBench套件上的实验表明,与基于LLM的基线相比,HTAM在正确率、快速解率和加速比上均有持续提升,而后端和Robust-KBench研究则表明结构化记忆带来的可迁移优势。

英文摘要

High-performance GPU kernels are essential for efficient LLM deployment, yet optimizing them remains expertise-intensive. Recent LLM-based code generation makes automatic GPU operator generation promising, but operator optimization remains a hardware-aware search problem. Existing LLM-based methods face a granularity mismatch: coarse hints are reusable but hard to execute, whereas detailed memories are actionable but enlarge the search space and obscure optimization bottlenecks. The key challenge is therefore to organize optimization experience at an appropriate granularity. To address this issue, this paper proposes HTAM (Hierarchical Transition-Attended Memory), a coarse-to-fine framework for LLM-based operator optimization. HTAM builds a two-level Hierarchical Transition Graph (HTG) to organize coarse global directions, detailed local strategies, and transition experience between optimization steps. During each evolution step, HTAM selects a global direction from the current state and recent optimization history, retrieves the corresponding local strategy memory, and uses it to guide concrete CUDA code generation. Experiments on the full KernelBench suite demonstrate that HTAM consistently improves correctness, fast-solution rate, and speedup over LLM-based baselines, while backend and Robust-KBench studies indicate transferable benefits from structured memory.

2605.29715 2026-05-29 cs.CL 版本更新

User-Aware Active Knowledge Acquisition for Emotional Support Dialogue

面向情感支持对话的用户感知主动知识获取

Mufan Xu, Kehai Chen, Jiahao Hu, Xinchao Xu, Muyun Yang, Tiejun Zhao, Min Zhang

发表机构 * Harbin Institute of Technology, China(哈尔滨工业大学) Baidu Inc., Beijing, China(百度公司)

AI总结 提出用户感知主动知识获取(UKA)框架,通过理论心智不确定性估计和主动学习,在情感支持对话中高效获取用户对齐的对话知识,提升对话质量和用户对齐。

详情
AI中文摘要

情感支持在对话系统中扮演重要角色,其成功取决于在多轮交互中适应用户不断变化且隐含的需求,同时利用大语言模型的强大推理能力。然而,由于用户需求的信号通常微弱、间接,且只能通过多轮交互来消除歧义,现有的情感支持方法往往难以高效获取和泛化相关的对话知识。为弥补这一差距,我们引入了用户感知主动知识获取(UKA),这是一种无梯度的主动对话学习框架,明确表示用户需求的不确定性,并将主动学习融入知识获取和响应选择中。我们提出了一种理论心智不确定性估计机制,使模型能够优先选择响应,从而引发更多信息性的用户反馈。UKA能够在训练期间高效探索用户对齐的对话知识,同时在测试时保持鲁棒性。在多个对话基准和模型架构上的实验表明,我们的方法在对话质量和用户对齐方面始终优于强基线。

英文摘要

Emotional support plays an important role in dialogue systems, and its success depends on adapting to a user's evolving and implicit needs across multi-turn interactions while leveraging the strong reasoning capacity of large language models. However, since signals about user needs are often weak, indirect, and can only be disambiguated through multi-turn interaction, existing emotional support methods often struggle to acquire and generalize relevant conversational knowledge efficiently. To bridge this gap, we introduce User-Aware Active Knowledge Acquisition (UKA), a gradient-free active dialogue learning framework that explicitly represents uncertainty about user needs and incorporates active learning into both knowledge acquisition and response selection.We propose a Theory-of-Mind uncertainty estimation mechanism that allows the model to prioritize responses, thereby eliciting more informative user feedback. UKA is capable of efficiently exploring user-aligned conversational knowledge during training while maintaining robustness at test time. Experiments across multiple dialogue benchmarks and model architectures demonstrate that our approach consistently outperforms strong baselines in dialogue quality and user alignment.

2605.29714 2026-05-29 cs.CL 版本更新

Leveraging Routing Dynamics in Mixture-of-Experts Models for Efficient Language Adaptation

利用混合专家模型中的路由动态实现高效语言适配

Aditi Khandelwal, Marius Mosbach, Verna Dankers, Siva Reddy, Golnoosh Farnadi

发表机构 * Mila – Quebec AI Institute & McGill University(魁北克AI研究院与麦吉尔大学)

AI总结 研究英语中心混合专家模型在多语言持续预训练中的路由动态,发现早期和中间层路由分散且语言无关,最终层出现语言专化,并提出仅更新最终层语言特定和共享专家的参数高效适配策略。

详情
AI中文摘要

混合专家(MoE)模型被广泛用于扩展语言模型,但其专家路由行为和多语言环境下的适配仍未被充分探索。在这项工作中,我们研究了在英语中心的MoE模型上使用多语言语料库进行持续预训练时的多语言路由动态,分析了专家使用如何随语言变化。我们发现,持续的多语言预训练导致早期和中间层出现分散的、与语言无关的路由,而语言专化主要出现在最终层。我们还表明,语言之间的token级词汇重叠在路由方式中起着重要作用。受这些发现启发,我们提出了一种参数高效的适配策略,仅更新最终MoE层中的语言特定和共享专家。在MultiBLiMP和Belebele上的实验表明,我们的方法实现了强大的性能-效率权衡,在更新不到2%参数的情况下,达到了与微调整个最终层相竞争的性能。总体而言,我们的发现揭示了在持续预训练期间MoE中语言专化出现的位置和方式,并为低资源多语言适配提供了实用见解。我们的代码可在https://github.com/aditi184/moe-routing-adaptation获取。

英文摘要

Mixture-of-Experts (MoE) models are widely used to scale language models, yet their expert routing behavior and adaptation in a multilingual setting remain underexplored. In this work, we study multilingual routing dynamics during continual pre-training of an English-centric MoE model on a multilingual corpus, analyzing how expert usage varies across languages. We find that continual multilingual pre-training leads to diffused, language-agnostic routing in early and middle layers, with language specialization primarily emerging in the final layers. We also show that token-level vocabulary overlap between languages plays an important role in how languages are routed. Motivated by these findings, we propose a parameter-efficient adaptation strategy that updates language-specific and shared experts in the final MoE layers. Experiments on MultiBLiMP and Belebele show that our method achieves a strong performance-efficiency trade-off, attaining competitive performance relative to fine-tuning complete final layers, while updating less than 2% of the parameters. Overall, our findings provide insights into where and how language specialization emerges in MoEs during continual pre-training and provide practical insights for low-resource multilingual adaptation. Our code is available at https://github.com/aditi184/moe-routing-adaptation.

2605.29712 2026-05-29 cs.CL cs.AI 版本更新

Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies

教会语言模型使用人类应试策略检查基于事实的声明真实性

Yuxuan Ye, Raul Santos-Rodriguez, Edwin Simpson

发表机构 * Intelligent Systems Laboratory(智能系统实验室) University of Bristol(布里斯托大学)

AI总结 将基于事实的声明真实性检查建模为真假阅读理解任务,通过提示语言模型使用明确的应试策略进行高效推理,并训练小语言模型以降低推理成本。

Comments ACL 2026 Main

详情
AI中文摘要

基于事实的声明真实性检查对于大型语言模型(LLM)应用(如检索增强生成)非常重要,因为它帮助用户评估生成输出的正确性。现有的使用蕴含分类器的指标需要针对数据集调整阈值,而基于LLM的方法通常使用直接提示,这未能充分利用LLM的推理能力。我们通过将基于事实的声明真实性检查建模为真假阅读理解任务,并提示LLM使用明确的应试策略进行高效推理来解决这一问题。与无引导的开放式推理相比,我们的方法减少了超过80%的令牌使用量,并在两个真实性基准测试中取得了与更昂贵替代方案竞争的性能,在一个基准上达到了新的最先进水平。为了进一步降低推理成本,我们训练小语言模型(SLM)来替代检查流程中的LLM。通过监督微调(SFT)和自我修正机制,SLM学会了改进其真实性判断。实验结果表明,生成的SLM在性能上与强基线相当,结合了低推理成本和生成支持理由以支持可解释性。代码和数据集将在接收后发布。

英文摘要

Grounded claim factuality checking is important for large language model (LLM) applications such as retrieval-augmented generation, as it helps users assess the correctness of generated outputs. Existing metrics using entailment classifiers require dataset-specific threshold tuning, while LLM-based approaches often use direct prompting, which underutilises the reasoning capabilities of LLMs. We address this by formulating grounded claim factuality checking as a true/false reading comprehension task and prompting LLMs with explicit test-taking strategies for efficient reasoning. Our method reduces token usage by over 80% compared to unguided open-ended reasoning, and achieves competitive performance to more expensive alternatives across two factuality benchmarks, setting a new state of the art on one. To further reduce inference cost, we train small language models (SLMs) to replace LLMs in the checking pipeline. Using supervised fine-tuning (SFT) and a self-revision mechanism, the SLMs learn to improve their factuality judgements. Experimental results show that the resulting SLMs perform on par with strong baselines, combining low inference costs with generating supporting rationales to support interpretability. Code and datasets will be released upon acceptance.

2605.29711 2026-05-29 cs.CL cs.AI 版本更新

Personalized Turn-Level User Conversation Satisfaction Benchmark

个性化轮级用户对话满意度基准

Zhefan Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang, Quanjia Yan, Hengliang Luo

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China.(清华大学计算机科学与技术系,北京,中国) Institute for AI Industry Research, Tsinghua University, Beijing, China.(清华大学人工智能产业研究院,北京,中国) Meituan(美团)

AI总结 针对AI助手响应的个性化满意度评估问题,提出结合用户记忆与目标轮上下文的满意度评估器,并构建PersTurnBench基准,通过回放实现生成模型的受控比较。

详情
AI中文摘要

用户对AI助手的满意度高度个性化:同一响应可能满足一个用户但令另一个失望,取决于每个用户的期望以及他们之前询问的内容。现有的自动评估方法大多衡量通用响应质量,难以判断某个响应在特定轮次是否满足用户。我们将此问题作为个性化轮级用户对话满意度评估进行研究。我们构建了一个对话满意度评估器,将紧凑的用户记忆与目标轮上下文相结合,生成满意度分数和不满意的理由。与人类满意度标注的元评估表明,个性化记忆和事后分数校准在有序一致性和不满意轮次检测上优于监督式、检索式和通用LLM作为评判者的基线。我们进一步引入了PersTurnBench,这是一个个性化轮级用户对话满意度基准,通过回放使用经过验证的评估器来评估生成模型。通过固定回放状态,PersTurnBench能够在无需为每个候选模型收集新人工标签的情况下,对通用生成模型和记忆增强的个性化系统进行受控比较。该评估器和基准让研究人员能够在无需为每个模型收集新用户反馈的情况下,比较候选生成模型在个性化满意度上的表现。

英文摘要

User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation methods mostly measure generic response quality, making it difficult to judge whether a response satisfies a user at a specific turn. We study this problem as personalized turn-level user conversation satisfaction evaluation. We build a conversation satisfaction evaluator that combines compact user memories with target-turn context to produce satisfaction scores and dissatisfaction-oriented rationales. Meta-evaluation against human satisfaction annotations shows that personalized memory and post-hoc score calibration improve ordinal agreement and dissatisfied-turn detection over supervised, retrieval-based, and generic LLM-as-a-judge baselines. We further introduce PersTurnBench, a personalized turn-level user conversation satisfaction benchmark that uses the verified evaluator to assess generation models via replay. By holding the replay state fixed, PersTurnBench enables controlled comparison of generic generation models and memory-augmented personalized systems without new human labels for every candidate model. The evaluator and benchmark let researchers compare candidate generation models on personalized satisfaction without collecting new user feedback for every model.

2605.29708 2026-05-29 cs.CL 版本更新

Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs

理解混合专家大语言模型中的安全敏感专家行为

Zhibo Zhang, Yuxi Li, Zhen Ouyang, Ling Shi, Kailong Wang

发表机构 * Huazhong University of Science and Technology, Wuhan, China(华中科技大学,武汉,中国) Nanyang Technological University, Singapore(南洋理工大学,新加坡)

AI总结 通过提出RASET框架,研究混合专家大语言模型中安全对齐与路由专家专业化之间的关系,发现路由模式主要由主题驱动,而安全行为可通过调整少数专家改变而不影响路由路径。

Comments 11 pages, 4 figures

详情
AI中文摘要

混合专家(MoE)大语言模型依赖于稀疏的、由路由器驱动的专家激活,然而安全对齐如何与路由专家专业化相互作用仍未被充分探索。一种常见的直觉是,安全行为可能通过将有害请求路由到不同的拒绝导向专家来控制。在这项工作中,我们为不同的情况提供了经验证据:对齐的MoE大语言模型中的路由模式主要是主题驱动的,而安全行为可以在不改变模型固有路由路径的情况下被改变。基于这一观察,我们提出了**RASET**(**R**outer-**A**gnostic **S**afety-critical **E**xpert **T**uning,路由器无关的安全关键专家微调),这是一个红队框架,用于探测集中在少数专家中的安全执行,同时保持模型固有的路由行为。**RASET**通过对比路由敏感性标准识别安全关键专家,并仅对选定的专家应用参数高效微调,从而相对于路由器干预最小化语义干扰。这些结果揭示了独特的MoE安全风险,强调了需要专家感知的对齐机制。

英文摘要

Mixture-of-Experts (MoE) LLMs rely on sparse, router-driven expert activation, yet how safety alignment interacts with routed expert specialization remains underexplored. A common intuition is that safety behavior may be controlled by routing harmful requests to distinct refusal-oriented experts. In this work, we provide empirical evidence for a different picture: routing patterns in aligned MoE LLMs are largely topic-driven, while safety behavior can be altered with little change to the model's intrinsic routing path. Motivated by this observation, we present **RASET** (**R**outer-**A**gnostic **S**afety-critical **E**xpert **T**uning), a red-teaming framework that probes safety enforcement that is localized in a small subset of experts while preserving the model's intrinsic routing behavior. **RASET** identifies safety-critical experts via a contrastive routing-sensitivity criterion and applies parameter-efficient tuning only to the selected experts, minimizing semantic disruption relative to router-steering interventions. These results reveal a distinct MoE safety risk, highlighting the need for expert-aware alignment mechanisms.

2605.29707 2026-05-29 cs.CL 版本更新

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Domino: 在推测解码中将因果建模与自回归草稿解耦

Jianuo Huang, Yaojie Zhang, Qituan Zhang, Hao Lin, Hanlin Xu, Linfeng Zhang

发表机构 * EPIC Lab, Shanghai Jiao Tong University(上海交通大学EPIC实验室) School of Software Engineering, HUST(华中科技大学软件学院) UESTC Fudan University(复旦大学) Huawei(华为公司)

AI总结 提出Domino框架,通过并行草稿骨干和轻量级Domino头解耦因果依赖建模与自回归草稿执行,结合基础锚定训练课程,在Qwen3模型上实现高达5.49倍端到端加速和5.8倍吞吐量加速。

详情
AI中文摘要

推测解码通过草拟多个令牌并与目标模型并行验证来加速LLM推理。然而,其实际加速受限于草稿质量与草稿成本之间的权衡:自回归草稿器建模草稿令牌间的因果依赖但引入顺序开销,而并行草稿器降低草稿成本但削弱块内依赖建模。本文提出Domino,一种将因果依赖建模与昂贵的自回归草稿执行解耦的推测解码框架。Domino首先使用并行草稿骨干为整个块生成初步草稿分布,然后应用轻量级Domino头以前缀依赖的因果信息对其进行细化。为稳定教师强制因果编码,我们进一步引入基础锚定训练课程,首先强化并行骨干,然后逐步将优化转向因果修正的最终分布。在Qwen3模型上的实验表明,Domino在Transformers后端下实现高达5.49倍的端到端加速,在SGLang服务下实现高达5.8倍的吞吐量加速。

英文摘要

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.

2605.29682 2026-05-29 cs.CL 版本更新

Scaling Laws for Agent Harnesses via Effective Feedback Compute

智能体框架的有效反馈计算缩放定律

Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che

发表机构 * Harbin Institute of Technology(哈尔滨理工大学)

AI总结 提出有效反馈计算(EFC)作为缩放坐标,通过衡量信息性、有效性、非冗余性和保留性来预测智能体框架性能,在多个任务上优于原始计算基线。

详情
AI中文摘要

智能体框架通过决定模型如何调用工具、接收反馈、验证中间状态、存储记忆和修正解决方案,日益决定语言模型系统的性能。然而,当前的测试时缩放分析通常通过原始支出(令牌、工具调用、操作、挂钟时间或成本)来参数化这一过程,这并未区分有用反馈与冗余或不稳定的交互。我们引入了有效反馈计算(EFC),这是一种轨迹级缩放坐标,仅在反馈具有信息性、有效性、非冗余性且被保留用于后续决策时才计入反馈,并在比较具有不同反馈需求的任务时通过任务需求进行归一化。在合成可控任务、可执行代码任务、真实基准轨迹、保留集和前瞻性验证批次中,基于EFC的坐标一致地比原始计算基线和强多变量SAS基线更好地预测失败率。在受控缩放中,原始令牌和工具调用解释的变异有限(R²=0.33和0.42),SAS达到0.88,而Oracle-EFC和Estimated-EFC达到0.94,Oracle-EFC/D_task达到0.99。匹配预算的干预表明,在原始成本和工具调用固定的情况下,提高反馈质量将成功率从0.27提升到0.90。在混合真实轨迹上,NRS-EFC/D_task达到R²=0.92,而原始计算具有接近零或负的拟合,并且在前瞻性保留集中仍然是最佳预测器(R²=0.85)。这些结果表明,框架缩放受计算量多少的影响较小,而更多地取决于原始预算如何高效地转化为持久且任务充分的反馈。

英文摘要

Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and revise solutions. Yet current test-time scaling analyses often parameterize this process by raw expenditure -- tokens, tool calls, operations, wall time, or cost -- which does not distinguish useful feedback from redundant or unstable interaction. We introduce \emph{Effective Feedback Compute} (EFC), a trace-level scaling coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, and we normalize it by task demand when comparing tasks with different feedback requirements. Across synthetic controllable tasks, executable code tasks, real benchmark traces, held-out splits, and a prospective validation batch, EFC-based coordinates consistently predict failure rates better than raw-compute baselines and a strong multivariate SAS baseline. In controlled scaling, raw tokens and tool calls explain limited variation ($R^2=0.33$ and $0.42$), SAS reaches $0.88$, while Oracle-EFC and Estimated-EFC reach $0.94$ and Oracle-EFC/$D_{\mathrm{task}}$ reaches $0.99$. Matched-budget interventions show that improving feedback quality raises success from $0.27$ to $0.90$ while raw cost and tool calls are fixed. On mixed real traces, NRS-EFC/$D_{\mathrm{task}}$ reaches $R^2=0.92$ while raw compute has near-zero or negative fit, and it remains the best predictor in a prospective holdout ($R^2=0.85$). These results suggest that harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback.

2605.29678 2026-05-29 cs.CL 版本更新

Spurious Prompts: Can Irrelevant Prompts Steer Large Language Models?

虚假提示:无关提示能否引导大型语言模型?

Pawel Batorski, Abtin Pourhadi, Jerzy Sarosiek, Przemyslaw Spurek, Paul Swoboda

发表机构 * Heinrich Heine University Düsseldorf(海因里希·海因斯大学多特蒙德分校) Jagiellonian University(雅盖隆大学) IDEAS Research Institute(IDEAS研究所)

AI总结 研究语义无关的提示(虚假提示)对大型语言模型行为的影响,提出黑盒搜索方法发现此类提示,并证明其在多个基准和模型上能显著影响模型输出。

详情
AI中文摘要

大型语言模型对提示高度敏感,但这种敏感性通常通过任务相关的指令、示例或推理线索来研究。本文研究了一种不同形式的提示敏感性:与任务语义无关的提示是否仍然能够引导模型行为。我们称其为虚假提示,并展示了其惊人的有效性。我们还提出了一种简单的黑盒搜索程序来发现它们。在推理和问答基准上,使用参数从0.8B到27B、涵盖三个模型家族的模型,我们展示了虚假提示可以提升性能,通常匹配或超越标准提示基线和任务感知的提示优化。我们进一步展示了它们可以引导模型产生非预期行为,例如重复选择第一个答案选项、产生错误答案、返回偶数、质数或小数,而无需明确指示模型这样做。这些发现揭示了一种新的提示敏感性:LLM可以被与它们被要求解决的任务无关的提示系统地引导。我们的代码可在 https://github.com/Batorskq/spurious 获取。

英文摘要

Large language models are highly sensitive to prompts, but this sensitivity is usually studied through task-relevant instructions, demonstrations, or reasoning cues. In this paper, we study a different form of prompt sensitivity: whether prompts that are semantically unrelated to the task can nevertheless steer model behavior. We call them spurious prompts and show their surprising efficacy. We also propose a simple black-box search procedure for discovering them. Across reasoning and question-answering benchmarks, using models ranging from 0.8B to 27B parameters and spanning three model families, we show that spurious prompts can improve performance, often matching or outperforming standard prompting baselines and task-aware prompt optimization. We further show that they can steer models toward unintended behaviors, such as repeatedly selecting the first answer option, producing incorrect answers, returning an even, prime or small number without explicitly instructing the model to do so. These findings reveal a new kind of prompt sensitivity: LLMs can be systematically steered by prompts that are unrelated to the task they are asked to solve. Our code is available at https://github.com/Batorskq/spurious

2605.29670 2026-05-29 cs.CL cs.AI 版本更新

EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQL

EviLink: 面向大规模Text-to-SQL的基于不确定性引导证据获取的多路径模式链接

Huawei Zheng, Sen Yang, Zhaorui Yang, Yuhui Zhang, Haozhe Feng, Haoxuan Li, Xuan Yi, Chao Hu, Defeng Xie, Chen Hou, Danqing Huang, Wei Chen, Yingcai Wu, Peng Chen, Dazhen Deng

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院) State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD&CG国家重点实验室) Tencent TEG(腾讯TEG) School of Mathematical Sciences, Peking University(北京大学数学科学学院)

AI总结 提出EviLink方法,通过多假设模式基础与不确定性引导的证据获取,重新定义模式链接为不确定性感知的模式需求推理,以平衡模式完整性、相关性和令牌成本,提升大规模Text-to-SQL性能。

详情
AI中文摘要

模式链接是大规模Text-to-SQL中困难且重要的步骤,系统必须从庞大且模糊的数据库中识别出紧凑且充分的模式上下文。现有方法通常将模式链接视为围绕单个SQL路径的确定性选择,但复杂问题可能允许多个具有不同模式需求的有效实现。我们将模式链接重新定义为对多个可行SQL路径的不确定性感知模式需求推理,其中系统区分必需模式项与路径依赖的不确定项,并仅在需要时获取证据。我们通过EviLink实例化这一重构,它结合了多假设模式基础与不确定性引导的证据获取。在BIRD-Dev和Spider2-Snow上的实验表明,这种视角改善了模式完整性、模式相关性和令牌成本之间的平衡。在Spider2-Snow上,EviLink实现了90.15%的字段级严格召回率,平均使用123.30K令牌,并在固定生成器下提升了下游SQL生成性能。

英文摘要

Schema linking is a difficult and important step in large-scale Text-to-SQL, where systems must identify a compact yet sufficient schema context from large and ambiguous databases. Existing methods often treat schema linking as deterministic selection around a single SQL path, but complex questions may admit multiple valid realizations with different schema needs. We reframe schema linking as uncertainty-aware schema-need inference over multiple plausible SQL paths, where the system distinguishes required schema items from path-dependent uncertain ones and acquires evidence only where needed. We instantiate this reframing with EviLink, which combines multi-hypothesis schema grounding with uncertainty-guided evidence acquisition. Experiments on BIRD-Dev and Spider2-Snow show that this perspective improves the balance among schema completeness, schema relevance, and token cost. On Spider2-Snow, EviLink achieves 90.15% field-level strict recall rate, uses 123.30K average tokens, and improves downstream SQL generation under a fixed generator.

2605.29668 2026-05-29 cs.AI cs.CL 版本更新

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

GRASP: 门控回归感知技能提议器用于自我改进的LLM智能体

Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, Martin Hadamitzky, Daniel Rueckert, Lisa Adams, Keno Bressem

发表机构 * Technical University of Munich and TUM University Hospital(慕尼黑技术大学及慕尼黑大学医院) Microsoft Healthcare & Life Sciences(微软医疗与生命科学)

AI总结 提出GRASP方法,通过门控回归感知技能库编辑,在硬回归预算下确保每次技能更新带来净改进,显著提升LLM智能体在结构化环境中的操作可靠性。

详情
AI中文摘要

在结构化环境中运行的LLM智能体以操作方式而非对话方式失败,其可靠性取决于对环境的程序性知识。先前的自我改进方法累积自然语言指导而不检查每个新项目是否保留先前正确的行为,因此修复一条轨迹的笔记可能静默地使另一条轨迹退化。我们引入GRASP(门控回归感知技能提议器),将智能体改进视为对有限技能库的一系列编辑,仅在候选技能在硬回归预算下对平衡的保留探针产生净改进时才接受它。我们在两个基于FHIR的临床基准上评估了GRASP在五个基础模型(gpt-oss-120b、DeepSeek V4 Flash、Gemini 3.1 Flash Lite、GPT-4.1、GPT-5.4)上的表现。在MedAgentBench上,GRASP将gpt-oss-120b从40.6%提升至88.8%,超过五个自我改进基线中最强的21.0个百分点,并将其他每个基础模型提升17.2至40.3个百分点。消融实验将增益归因于比较性提议生成、接受门和硬回归预算,而非技能编写本身——没有验证的技能编写并不比不使用技能更好。该机制泛化到临床领域之外,在四个非临床环境中的三个上改进了智能体,仅在动作空间开放的环境中保持持平。冻结的技能库可在模型间迁移,其中来自更强模型的技能将较弱执行者提升到超出其自身学习能力的水平,而反向则不然,这种不对称性是没有门控的基线无法复现的。

英文摘要

LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.

2605.29667 2026-05-29 cs.CL 版本更新

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

超越英语与规避:用于高风险LLM中文安全评估的人工标注多领域基准

Wajdi Zaghouani, Kholoud K. Aldous, Yicheng Gao

发表机构 * Northwestern University in Qatar(卡塔尔西北大学)

AI总结 针对LLM在中文环境下安全系统失效的问题,构建了包含1,897个对抗性提示的人工标注基准ChiSafe-PAS,覆盖四个高风险领域,并提供完整标注以评估模型安全对齐。

详情
Journal ref
Proceedings of The fourth international workshop on the role of resources in the age of large language models RESOURCEFUL-2026 at LREC 2026, Palma de Mallorca, Spain, 2026
AI中文摘要

当大型语言模型(LLM)部署在中文环境中时,出现了一个令人不安的模式:在英语中运行良好的安全系统会失效。这些系统难以跨越语言和文化的界限,使得模型暴露于利用中文特定规避技术(包括拼音罗马化、汉字分解、网络俚语和模糊语气)的对抗性提示。为解决这一差距,我们引入了ChiSafe-PAS(中文安全试点标注集),这是一个包含1,897个对抗性中文提示的人工标注基准,涵盖四个高风险领域:自残与暴力、毒品与非法交易、欺诈以及讽刺。其中,1,544条条目带有完整的黄金标准标注:一个3类响应标签(拒绝、安全重定向、回应)、一个九类混淆分类、一个风险等级评级以及标注者理由。我们详细描述了数据集设计、标注过程和混淆分类。我们的主要目标是实用的:为研究社区提供一个高质量、基于文化背景的资源,用于基准测试LLM的安全对齐。在此过程中,我们涉及了该领域的三个更广泛的张力:训练数据和评估数据之间模糊的界限、基于现实风险进行领域覆盖的需求,以及规模作为文化专业知识替代品的局限性。

英文摘要

When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries, leaving models exposed to adversarial prompts that exploit Chinese-specific evasion techniques, including Pinyin romanization, character decomposition, internet slang, and hedging tone. To address this gap, we introduce ChiSafe-PAS (Chinese Safety Pilot Annotation Set), a human-annotated benchmark of 1,897 adversarial Chinese prompts spanning four high-stakes domains: self-harm and violence, drug and illicit trade, fraud, and satire. Of these, 1,544 entries carry complete gold-standard annotations: a 3-class response label (REFUSE, SAFE-REDIRECT, RESPOND), a nine-category obfuscation taxonomy, a risk-level rating, and annotator rationale. We describe the dataset design, annotation process, and obfuscation taxonomy in detail. Our primary goal is practical: to give the research community a high-quality, culturally grounded resource for benchmarking LLM safety alignment. In doing so, we engage three broader tensions in the field: the blurring boundary between training and evaluation data, the need for domain coverage grounded in real-world risk, and the limits of scale as a substitute for cultural expertise.

2605.29659 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Opir:针对毒性、越狱、仇恨言论和有害内容的高效多任务安全分类

Ihor Stepanov, Aleksandr Smechov

发表机构 * Knowledgator Wordcab

AI总结 本文提出基于GLiClass架构的Opir系列编码器护栏模型,通过多任务学习实现二进制安全/不安全分类、多标签毒性分类、越狱分类和零样本不安全提示与响应分类,在12项安全分类任务和17项类别任务上与现有护栏系统竞争,同时部署开销更小。

Comments 23 pages, 4 figures, 9 tables

详情
AI中文摘要

大型语言模型(LLM)应用的实时安全过滤需要能够检测不安全提示、有毒语言、越狱尝试和不安全响应的分类器,且不能像大型护栏模型那样成本高昂,同时要能区分良性的敏感文本与真正隐蔽的有害内容。在本文中,我们介绍了Opir,一个基于GLiClass架构的编码器护栏模型系列。Opir包括用于二进制安全/不安全分类、多标签毒性分类、越狱分类以及零样本不安全提示和响应分类的多任务模型。我们还发布了专门用于二进制安全/不安全分类的边缘变体,参数少于1亿。这些模型在一个三级分类体系上训练,该体系包含16个顶层标签、126个中层标签和854个叶标签,共996个类别。Opir的训练数据结合了基于分类体系的不安全提示、对抗性挖掘的难负例、良性安全保持示例、生成的响应示例、多语言翻译以及Aegis2和WildGuard训练子集的部分内容。我们还开源了一个评估工具,支持GLiClass和GLiNER2后端以及基于解码器的模型,涵盖二进制安全分类、多标签分类、毒性、越狱检测、提示安全、响应安全、响应拒绝以及跨公共基准系列的提示子类别视图。在与八个当代护栏系统(包括基于GLiNER2和生成式护栏模型)的扩展比较中,涵盖12项安全分类任务和17项类别任务,Opir变体在大多数基准数据集上与最强的开源基线模型竞争或领先,同时部署规模显著更小。

英文摘要

Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail models, and that can distinguish benign sensitive text from genuinely covert harmful content. In this paper, we introduce Opir, a family of encoder-based guardrail models built on the GLiClass architecture. Opir includes multi-task models for binary safe/unsafe classification, multi-label toxicity classification, jailbreak classification, and zero-shot unsafe prompt and response categorization. We also release edge variants with fewer than 100M parameters dedicated to binary safe/unsafe categorization. The models are trained on a three-level taxonomy containing 996 categories across 16 top-level labels, 126 mid-level labels, and 854 leaf labels. Opir's training data combines taxonomy-grounded unsafe prompts, adversarially mined hard negatives, benign safety-preserving examples, generated response examples, multilingual translations, and portions of the Aegis2 and WildGuard training subsets. We also open-sourced an evaluation harness that supports GLiClass and GLiNER2 backends as well as decoder-based models, and covers binary safety classification, multi-label categorization, toxicity, jailbreak detection, prompt safety, response safety, response refusal, and prompt subcategory views across public benchmark families. Across an expanded comparison spanning 12 safety-classification tasks and 17 category tasks against eight contemporary guardrail systems -- including both GLiNER2-based and generative guardrail models -- Opir variants are competitive on or ahead of the strongest open-weight baselines on the majority of benchmark datasets while operating with a substantially smaller deployment footprint.

2605.29648 2026-05-29 cs.CL 版本更新

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

超越数学与代码的可验证奖励:面向事实问答的轻量级语料库基础过程监督

Shicheng Fan, Haochang Hao, Dehai Min, Weihao Liu, Philip S. Yu, Lu Cheng

发表机构 * University of Illinois Chicago(伊利诺伊大学香槟分校)

AI总结 提出CorVer,一种基于语料库共现统计的轻量级过程奖励方法,通过句子级信用分配和令牌级优势映射,在多个模型和基准上显著提升事实问答准确性且训练速度更快。

详情
AI中文摘要

将强化学习应用于提高知识密集型问答的事实准确性面临奖励设计困境。响应级奖励仅提供粗略监督,无法区分推理轨迹中的正确与错误陈述。句子级替代方案提供更细粒度的反馈,但通常依赖于NLI验证器、LLM评判或知识验证流水线,这些方法在RL规模下部署成本高昂,且对于稀有实体事实(准确奖励信号尤为重要)往往不可靠。我们提出CorVer(语料库验证),一种轻量级、即插即用的过程奖励,用源自维基百科共现统计的语料库基础信号替代神经验证器。CorVer分配句子级信用,并通过简单对齐将其映射到令牌级优势,仅需一个0.5B的提取器和每个句子一次语料库查找。在跨越六个指令微调模型(3B至14B)和五个QA基准的30个(模型,基准)单元中,CorVer在每个单元上均优于原始基线,TriviaQA平均提升4.1个百分点。在其可行配置下的20个单元中,CorVer在18个单元上优于四个神经验证器基线,同时训练速度快4.8至8.4倍。

英文摘要

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an average TriviaQA gain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.

2605.29638 2026-05-29 cs.CL 版本更新

Classification of non-analyzable word types in web documents to implement an effective Korean e-learning system

网络文档中不可分析词类型的分类以实现有效的韩语电子学习系统

Sang-Taek Park, Ae-Lim Ahn, Eric Laporte, Jee-Sun Nam

发表机构 * DICORA, Hankuk University of Foreign Studies, Korea(DICORA,韩国外国语大学,韩国) LIGM, Université Paris-Est, France(LIGM,巴黎-est大学,法国)

AI总结 通过构建正式与非正式语料库,比较其表达差异,并提出局部语法图(LGG)模型以有效处理非正式文本,用于韩语电子学习系统。

详情
Journal ref
Doing Research in Applied Linguistics, 2011, pp. 61-68
AI中文摘要

电子学习系统应传递反映语言实际使用中各种现象的内容。除了正式韩语,包含网络文档、手机短信或推特帖子等真实世界韩语表达的电子学习系统将对高级学习者有用。我们构建了两种语料库:一种由在线新闻文章等正式文档组成;另一种由网络博客中关于新产品的客户评论等非正式文档组成。通过比较这些语料库,我们展示了这两种语料库中表达的差异。我们调查了非正式语料库的主要特征。鉴于文本中有很大比例是非正式的,我们提出局部语法图(LGG)作为在韩语电子学习系统中有效处理它们的合适模型。

英文摘要

E-learning systems should deliver contents that reflect various phenomena of the language as it is used. In addition to formal Korean, e-learning systems that would include real-world Korean expressions such as those in web documents, mobile text messages, or twitter posts, would be useful to high-level learners. We construct two types of corpora: one is made of formal documents like online news articles; the other is made of informal documents like customer reviews about new products in web blogs. By comparing these corpora, we show how expressions differ in these two types of corpora. We survey the main characteristics of the informal corpus. Given that a significant proportion of text is informal, we propose Local Grammar Graphs (LGG) as an appropriate model to treat them effectively in Korean e-learning systems.

2605.29637 2026-05-29 cs.CL 版本更新

Evaluating Cross-lingual Knowledge Consistency in Code-Mixed vis-a-vis Indian Languages using IndicKLAR

评估混合语码与印度语言中的跨语言知识一致性:基于IndiKLAR

Debajyoti Mazumder, Divyansh Pathak, Prashant Kodali, Aditya Joshi, Akshay Agarwal, Jasabanta Patro

发表机构 * Indian Institute of Science Education and Research(印度科学教育与研究学院) Microsoft Corporation(微软公司) UNSW Sydney(新南威尔士大学悉尼分校)

AI总结 本文通过构建IndiKLAR基准,评估大语言模型在英语、混合语码和印度本土语言上的知识召回一致性,发现混合语码输入能显著缩小与英语的性能差距,并识别出从本土语言到混合语码的“翻转点”。

Comments 23 pages

详情
AI中文摘要

大型语言模型能可靠地回忆英语知识,但在低资源语言上对相同查询却常常失败——这种跨语言一致性差距在印度语言及其混合语码变体中尚未得到充分研究。为了研究这一差距,我们引入了IndiKLAR,这是KLAR-CLC基准的印度扩展,覆盖了22种印度官方语言中的18种,并为11种广泛使用的语言对配对了混合语码变体,且对这11种设置的单语和混合语码变体进行了母语验证。这种三方对齐提供了一个独特的机会来考察知识召回一致性如何随英语、混合语码和印度本土语言输入的变化而变化。在九个开放权重模型上的评估发现,本土语言与英语的准确率差距可达约0.50,而混合语码输入能缩小大部分差距——无需任何模型层面的干预即可使性能接近英语(差距约0.05)。受此启发,我们评估了几种在语言转换暴露方式上有所不同的提示策略,包括两阶段的翻译-回答设置、单阶段的联合翻译-回答提示,以及“翻译中思考”(TinT)——一种单步策略,模型内部转换输入并仅输出最终答案。在从本土语言到混合语码再到英语的性能轨迹中,我们识别出一个一致的翻转点——即错误与正确预测之间的边界——位于本土语言和混合语码设置之间。有趣的是,无论该轨迹是由输入表面形式还是由模型的内部转换过程诱导,这一现象都成立。

英文摘要

Large language models recall knowledge reliably in English but often fail on the same query posed in a lower-resourced language -- a crosslingual consistency gap that remains underexplored for Indian languages and their code-mixed counterparts. To study this gap, we introduce IndiKLAR, an Indic extension of the KLAR-CLC benchmark covering 18 of the 22 scheduled Indian languages and pairing them with code-mixed variants for 11 widely used language pairs, with native-speaker verification of both monolingual and code-mixed variants for these 11 settings. This three-way alignment offers a unique opportunity to examine how knowledge recall consistency varies across the spectrum of English, code-mixed, and native Indian language inputs. Evaluating across nine open-weight models, we find that the native-language accuracy gap to English can reach $\sim$0.50, while code-mixed inputs close most of it -- bringing performance within $\sim$0.05 of English without any model-level intervention. Motivated by this, we evaluate several prompting strategies that vary in how language conversion is exposed, including a two-stage translate-then-answer setup, a one-stage joint translation-and-answer prompt, and Translate-in-Thought (TinT) -- a single-step strategy in which the model converts the input internally and emits only the final answer. Across the performance trajectory native $\rightarrow$ code-mixed $\rightarrow$ English, we identify a consistent flip point -- the boundary between incorrect and correct prediction -- that lies between the native and code-mixed settings. Interestingly, this holds whether the trajectory is induced by the input surface form or by the model's internal conversion process.

2605.29631 2026-05-29 cs.CL cs.AI 版本更新

Predicting Causal Effects from Natural Language Queries using Structured Representations

使用结构化表示从自然语言查询预测因果效应

Giuliano Martinelli, Piriyakorn Piriyatamwong, Abelardo Carlos Martinez Lorenzo, Jasmin Baier, Riccardo Orlando, Satvik Garg, Sharif Kazemi, Linxi Wang, Arianna Legovini, Samuel Fraiberger

发表机构 * The World Bank Group(世界银行集团) University of Oxford(牛津大学) New York University(纽约大学)

AI总结 针对从自然语言查询预测因果效应的问题,提出Query2Effect基准和两步框架,通过生成结构化表示再预测效应大小,微调使绝对误差降低27%-71%。

Comments 18 pages

详情
AI中文摘要

随机对照试验是医学和社会科学的基石,因为它们能够可靠地估计因果效应。然而,进行这些试验成本高昂且耗时,这激发了从现有实验证据预测因果效应的兴趣。大型语言模型(LLMs)的最新进展在知识密集型任务上表现出强大的性能,引发了一个问题:这些模型能否用于预测因果效应大小?为了研究这一点,我们引入了Query2Effect,这是一个新的大规模基准,包含超过72,000个与实验描述对齐的自然语言问题,通过改变查询在隐含性、抽象性和歧义性维度上的特异性,模拟现实的信息寻求场景。然后,我们提出了一个两步框架,首先生成查询的合成结构化表示,然后使用监督编码器模型预测效应大小。实验表明,微调在提高预测性能方面起着关键作用,与开箱即用的提示式LLMs相比,绝对误差降低了-27%到-71%,并且我们的两步框架有利于域外泛化,突显了将语义解释与数值效应估计分离的好处。

英文摘要

Randomized controlled trials are a cornerstone of medicine and the social sciences as they enable reliable estimates of causal effects. However, they are costly and time-consuming to conduct, motivating interest in predicting causal effects from existing experimental evidence. Recent advances in large language models (LLMs) have demonstrated strong performance on knowledge-intensive tasks, raising the question of whether these models can be used for forecasting causal effect sizes. To investigate this, we introduce Query2Effect, a new large-scale benchmark consisting of more than 72,000 natural language questions aligned with experiment descriptions, created to simulate realistic information-seeking scenarios by varying query specificity along dimensions of implicitness, abstraction, and ambiguity. We then propose a two-step framework that first generates a synthetic structured representation of a query before predicting effect size using a supervised encoder model. Experiments show that finetuning plays a crucial role in improving prediction performance, with absolute error reducing by -27% up to -71% compared to prompted out-of-the-box LLMs, and that our two-step framework is beneficial for out-of-domain generalization, highlighting the benefits of separating semantic interpretation from numerical effect estimation.

2605.29630 2026-05-29 cs.CL cs.AI cs.IR 版本更新

Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory

实体碰撞:一种用于归因智能体记忆检索提升的分层协议

Youwang Deng

发表机构 * Independent Researcher(独立研究员)

AI总结 提出实体碰撞协议,通过控制实体重叠和标签分层,将BM25基线固定,从而将检索提升归因于嵌入器,并在多维度实验中揭示编码器容量并非唯一约束。

Comments 48 pages with appendix; 6-page body, mandatory Limitations, References, and 7 appendices. Code, benchmarks, and 37 reproduce scripts: https://github.com/youwangd/engram (see paper/REPRODUCIBILITY.md). Apache 2.0

详情
AI中文摘要

端到端的智能体记忆基准测试为每个检索器报告一个单一的hit@k指标,混淆了词汇泄漏(不受控制的查询/黄金/干扰实体重叠)与标签混合(偏好、服务、工具平均在一起)。我们提出实体碰撞,一种系统无关的协议,通过构造将BM25基线固定——每个干扰项共享答案的实体标记——并按判别器标签对查询进行分层,因此任何超过BM25的提升都可归因于嵌入器。应用于一个开源智能体记忆测试平台,涵盖5个标签×3个嵌入器×5个碰撞程度,并采用配对自助法95%置信区间,该协议揭示了一个双轴模式:256维哈希三元组仅在深度碰撞下的封闭词汇标签上有帮助;MiniLM-384在两个轴上均占优;而参数规模2.7倍的BGE-large并未在MiniLM上一致提升——它在意图式查询上胜出,但在词汇式查询上落败。编码器容量本身并非约束条件。合成意图标签的零假设在LongMemEval(n=500)上重现为单会话偏好回忆悬崖。LoCoMo上的自适应向量权重路由是一个测量的零假设:存在11.7个百分点的oracle空间,但我们测试的所有信号均未恢复。所有26个结果表和37个复现脚本均受版本控制并由公共注册表验证;该协议在一个确定性管理的记忆测试平台(事件溯源决策日志、DAG状态机模式生命周期)上执行,因此每个报告的置信区间都可以从输入流中逐字节复现。

英文摘要

End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag-mixing (preferences, services, tools averaged together). We propose entity-collision, a system-agnostic protocol that pins the BM25 floor by construction -- every distractor shares the answer's entity tokens -- and stratifies queries by discriminator tag, so any lift over BM25 is attributable to the embedder. Applied to an open-source agent-memory testbed across 5 tags x 3 embedders x 5 collision degrees with paired-bootstrap 95% CIs, the protocol reveals a two-axis pattern: a 256-d hash trigram helps only on closed-vocabulary lexical tags at deep collision; MiniLM-384 dominates both axes; and a 2.7x-parameter BGE-large does not uniformly improve on MiniLM -- it wins on intent-style queries but loses on lexical ones. Encoder capacity alone is not the binding constraint. The synthetic intent-tag null replicates on LongMemEval (n=500) as a single-session-preference recall cliff. Adaptive vector-weight routing on LoCoMo is a measured null: 11.7pp of oracle headroom exists, but no signal we tested recovers it. All 26 result tables and 37 reproduce scripts are version-controlled and verified by a public registry; the protocol is exercised on a deterministically governed memory testbed (event-sourced decision log, DAG-state-machine schema lifecycle) so every reported CI is reproducible byte-for-byte from the ingest stream.

2605.29628 2026-05-29 cs.SD cs.AI cs.CL cs.LG eess.AS 版本更新

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

COMET:音频-文本多模态对比嵌入中模态间隙的概念空间剖析

Yonggang Zhu, Liting Gao, Aidong Men, Wenwu Wang

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey(Surrey 大学视觉、语音和信号处理中心)

AI总结 提出COMET框架,通过PLS-SVD分解揭示CLAP模型中模态间隙主要由少数共享概念轴贡献,并基于谱截断方法无训练地缓解间隙,实现零样本音频字幕接近全监督性能。

详情
AI中文摘要

对比语言-音频预训练(CLAP)模型广泛用于音频理解,并在许多零样本应用中支持模态无关的条件交换。然而,其性能受到音频和文本嵌入之间模态间隙的严重影响。现有解释主要将此间隙归因于锥体效应,将其视为均值嵌入之间的偏移,但仅纠正均值只能带来有限的改进。其他假设,如信息不平衡和维度坍缩,也被提出,但仍未得到充分验证,并且在音频领域尚未被深入研究。同时,一些工作尝试将多模态对比嵌入分解为可解释的概念,但没有任何工作从概念分解的角度显式分析模态间隙。在这项工作中,我们引入了COMET(基于PLS-SVD变换的概念空间组织与模态间隙解释),这是一个新颖的用于CLAP的偏最小二乘奇异值分解(PLS-SVD)框架,揭示了模态间隙的更广泛视角。我们的框架揭示,只有一小部分可解释的轴(捕捉共享概念)对相似度计算有显著贡献,并且均值分量仅部分代表模态间隙。基于这一见解,我们提出了一种简单的谱截断方法,以无训练的方式缓解模态间隙。该方法使得零样本音频字幕通过条件交换接近全监督性能,无需大型辅助记忆库或昂贵计算。同时,它在保持检索和音频字幕任务强性能的同时,实现了显著的嵌入维度缩减。

英文摘要

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.

2605.29626 2026-05-29 cs.CL cs.AI 版本更新

DLM-SWAI: Steering Diffusion Language Models Before They Unmask

DLM-SWAI: 在扩散语言模型去掩码之前引导它们

Hyeseon An, Yo-Sub Han

发表机构 * Department of Computer Science(计算机科学系) Yonsei University(延世大学)

AI总结 提出一种无需训练的引导方法DLM-SWAI,通过预计算的词级风格分数在去噪步骤中偏置词分布,实现扩散语言模型的可控生成。

Comments preprint

详情
AI中文摘要

将语言模型生成引导至期望的文本属性对于实际部署至关重要,而推理时方法特别有吸引力,因为它们无需重新训练即可实现可控生成。最近的研究也强调了扩散语言模型作为一种新兴的生成范式,具有独特的解码特性。然而,大多数现有的引导方法要么依赖辅助模型,要么专为自回归下一个词解码设计,难以应用于通过部分掩码序列的迭代去噪生成文本的扩散语言模型(DLM)。因此,我们提出DLM-SWAI,一种简单的无需训练的引导方法,通过使用预计算的词级风格分数在每个去噪步骤偏置词分布。在风格和安全控制任务上的实验表明,DLM-SWAI有效引导扩散语言模型,同时保持生成质量并需要最小的计算开销。消融实验进一步揭示了引导强度与流畅性之间的可控权衡,我们的分析将类别可引导性与词级属性线索的强度联系起来。

英文摘要

Steering language model generation toward desired textual properties is essential for practical deployment, and inference-time methods are particularly appealing because they enable controllable generation without retraining. Recent work has also highlighted diffusion language models as an emerging generation paradigm with distinct decoding properties. However, most existing steering approaches either rely on auxiliary models or are designed for autoregressive next-token decoding, making them difficult to apply to diffusion language models DLMs, which generate text through iterative denoising of partially masked sequences. Therefore, we propose DLM-SWAI, a simple training-free steering method that biases the token distribution at each denoising step using pre-computed token-level style scores. Experiments on style and safety control tasks show that DLM-SWAI effectively steers diffusion language models while preserving generation quality and requiring minimal computational overhead. Ablations further reveal a controllable trade-off between steering strength and fluency, and our analysis links class-wise steerability to the strength of token-level attribute cues.

2605.29615 2026-05-29 cs.CV cs.CL 版本更新

DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces?

DiffSpot:VLM能发现网页界面中的细微视觉差异吗?

Linhao Zhang, Aiwei Liu, Yuan Liu, Xiao Zhou

发表机构 * WeChat AI, Tencent Inc(腾讯公司)

AI总结 提出DiffSpot基准,通过CSS属性突变生成可控图像对,评估视觉语言模型在网页界面中检测细微视觉差异的能力,发现最佳模型仅识别40.7%的真实变化。

详情
AI中文摘要

视觉语言模型(VLM)在高层次图像-文本对齐方面取得了显著进展,但其感知细微视觉差异的能力仍然有限。我们在渲染的网页界面中研究这一问题,其中局部视觉变化既是对细粒度感知的诊断测试,也是GUI代理和设计工具的实际需求。我们引入了 extbf{DiffSpot},一个用于网页界面开放式找不同的代码驱动基准。DiffSpot通过突变自包含HTML中目标元素的单个CSS属性,重新渲染页面,并记录变化的属性、元素和突变幅度,从而构建受控图像对。一个接地门控仅保留渲染像素差异局限于目标元素的图像对。该基准包含4,400对图像,包括3,900对有差异对(平衡分布在13个CSS属性操作符和三个难度级别上)以及500对无差异对用于幻觉控制。对13个前沿VLM进行零样本评估,我们发现即使最佳模型也只能识别$40.7\%$的真实变化,所有模型在困难级别的召回率低于$23\%$。DiffSpot进一步表明,难度强烈依赖于属性:在CSS操作符中,像素幅度和CLIP距离都不能可靠预测召回率。

英文摘要

Vision-language models (VLMs) have made strong progress on high-level image-text alignment, yet their ability to perceive subtle visual differences remains limited. We study this problem in rendered web interfaces, where localized visual changes are both a diagnostic test of fine-grained perception and a practical requirement for GUI agents and design tools. We introduce \textbf{DiffSpot}, a code-driven benchmark for open-ended spot-the-difference on web interfaces. DiffSpot constructs controlled image pairs by mutating a single CSS property of a target element in self-contained HTML, re-rendering the page, and recording the changed property, element, and mutation magnitude. A grounding gate retains only pairs whose rendered pixel difference is confined to the target element. The benchmark contains 4{,}400 pairs, including 3{,}900 has-diff pairs balanced across 13 CSS-property operators and three difficulty tiers, plus 500 no-diff pairs for hallucination control. Evaluating 13 frontier VLMs zero-shot, we find that even the best model identifies only $40.7\%$ of true changes, with Hard-tier Recall below $23\%$ for every model. DiffSpot further shows that difficulty is strongly property-dependent: across CSS operators, neither pixel magnitude nor CLIP distance reliably predicts Recall.

2605.29612 2026-05-29 cs.MA cs.CL 版本更新

CONCAT: Consensus- and Confidence-Driven Ad Hoc Teaming for Efficient LLM-Based Multi-Agent Systems

CONCAT: 基于共识与置信驱动的即席团队协作以实现高效的基于LLM的多智能体系统

Ziyang Ma, Dingyi Zhang, Sichu Liang, Jiajia Chu, Pengfei Xia, Hui Zang, Deyu Zhou

发表机构 * Southeast University(东南大学) Huawei Technologies Ltd(华为技术有限公司)

AI总结 提出一种无需训练的共识与置信驱动即席团队协作框架CONCAT,通过聚类初始答案、选择高置信领导者并基于心智理论预测协作收益来动态组织多智能体交互,显著提升效率并降低延迟。

详情
AI中文摘要

尽管基于大型语言模型的多智能体系统在解决复杂任务和实现比单智能体系统更高的性能方面显示出能力,但由于智能体之间的密集通信,它们导致了巨大的计算开销。先前的研究致力于训练稀疏多智能体图或微调规划器以更好地编排工作流程。然而,这些额外的训练过程引入了计算成本,并将多智能体系统限制在特定领域,从而损害了其泛化能力。在本文中,我们提出了CONCAT,一种基于共识和置信驱动的即席团队协作的无训练多智能体协作框架,以高效组织智能体交互。具体来说,智能体根据其初始答案进行聚类,并根据智能体的置信度选择每个聚类的领导者。然后,基于心智理论设计启发式函数,根据领导者的答案和置信度预测每两个领导者之间的协作收益。最后,在根据预测收益驱逐一定比例的通信后,组织一个即席多智能体网络。在三个LLM和三个基准上的实验表明,CONCAT比LLM-Debate实现了高达2.02倍的效率(准确率/延迟比),并优于诸如AgentDropout等训练感知方法,同时在Qwen2.5-14B-Instruct上将平均延迟降低了50.1%,且无需任何任务特定训练。

英文摘要

Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy communication between agents. Previous research has made efforts to train a sparse multi-agent graph or fine-tune a planner to orchestrate the workflow better. However, such extra training processes introduce computational costs and limit MAS to specific domains, therefore compromising their generalizability. In this paper, we propose CONCAT, a training-free multi-agent collaboration framework based on CONsensus and Confidence-driven Ad hoc Teaming to efficiently organize agent interactions. Specifically, agents are clustered based on their initial answers, and leaders of each cluster are selected based on the agents' confidence. Then, a heuristic function based on the Theory of Mind is designed to predict the collaboration benefits between every two leaders according to their answers and confidence. Finally, an ad hoc multi-agent network is organized after evicting a percentage of communications based on the predicted benefits. Experiments across three LLMs and three benchmarks show that CONCAT achieves up to 2.02x higher efficiency (accuracy/latency ratio) than LLM-Debate and outperforms training-aware methods such as AgentDropout, while reducing average latency by 50.1% on Qwen2.5-14B-Instruct, without any task-specific training.

2605.29601 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Training Deliberative Monitors for Black-Box Scheming Detection

训练审慎监控器用于黑箱策划检测

Aditya Sinha, Akshat Naik, Victor Gillioz, Simon Storf, Kilian Merkelbach, Rich Barton-Cooper, Axel Højmark, Marius Hobbhahn

发表机构 * Independent(独立) MATS Research(MATS研究) Astra Fellowship Apollo Research(Apollo研究)

AI总结 提出一种基于行动轨迹的审慎监控方法,通过蒸馏前沿模型的推理过程训练开源模型,以低成本高精度检测智能体的策划与破坏行为。

详情
AI中文摘要

随着自主智能体在执行现实任务方面变得愈发强大,区分策划行为与良性任务追求可能成为AI控制的核心问题。现有监控器通常依赖思维链访问或内部激活,或使用提示的前沿模型,这些在部署中可能不可用、不可靠或成本高昂。在本工作中,我们研究仅基于行动的审慎监控器:较小的开源模型,经过训练可从智能体轨迹中检测策划与破坏行为,而无需访问被监控智能体的推理或模型内部。我们的方法受审慎对齐启发,使用策划规范从前沿教师模型中引出结构化推理,通过独立的评判器进行过滤,并通过监督微调和强化学习将最高质量的推理蒸馏到开源监控器中。我们在五个数据集上训练,并在六个分布外智能体失调基准上评估。我们表明,将我们的方法应用于Qwen3.5-27B,其性能优于所有低成本前沿模型作为提示监控器(Gemini 3.1 Flash-Lite、GPT-5.4 Nano和Claude Haiku 4.5)以及Gemini 2.5 Pro,同时实现了更低的边际推理成本(每1000次评估的token计费美元)。更强的提示前沿监控器(Gemini 3.1 Pro、GPT-5.4、Claude Sonnet 4.6和Claude Opus 4.6)实现了更高的性能,但边际推理成本大约高出16-34倍。我们训练的多个监控器在我们评估的监控器中位于经验成本-性能帕累托前沿,为提示前沿模型提供了实用的低成本、低误报率替代方案。

英文摘要

As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or internal activations, or use prompted frontier models, all of which can be unavailable, unreliable or expensive in deployment. In this work, we study action-only deliberative monitors: smaller open-weight models trained to detect scheming and sabotage from agentic trajectories without accessing the monitored agent's reasoning or model internals. Our method, inspired by deliberative alignment, uses a scheming specification to elicit structured rationales from a frontier teacher, filters them with a separate judge, and distills the highest-quality rationales into open-weight monitors with supervised fine-tuning and reinforcement learning. We train on five datasets, and evaluate across six out-of-distribution agentic misalignment benchmarks. We show that applying our method to Qwen3.5-27B yields higher performance than all low-cost frontier models as prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5) and than Gemini 2.5 Pro, while also achieving lower marginal inference cost (token-metered USD per 1,000 evaluations). Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, and Claude Opus 4.6) achieve higher performance but at roughly $16$--$34\times$ higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost--performance Pareto frontier among the monitors we evaluate, providing practical low-cost, low-FPR alternatives to prompted frontier models.

2605.29585 2026-05-29 cs.CL 版本更新

World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models

语言中的世界模型:审计视觉语言模型中的物理状态转换承诺

Emmanuelle Bourigault

发表机构 * University of Oxford(牛津大学)

AI总结 提出WMW框架,通过要求VLM输出结构化轨迹(初始状态、状态转换、结果状态和答案)并利用混合验证器检查模式有效性、状态基础、转换一致性和答案-轨迹兼容性,揭示仅评估最终答案所隐藏的物理推理失败。

Comments 8 pages, 3 figures, 5 tables

详情
AI中文摘要

视觉语言模型(VLM)越来越多地被用于回答关于物理场景的问题,然而大多数评估仅将性能简化为最终答案。这隐藏了模型是否感知到正确的物体、表示了正确的物理状态、预测了合理的转换,或者仅仅因为错误的原因选择了正确的选项。我们引入\wmw,一个用于审计VLM的\emph{语言表达的物理承诺}的评估框架。我们不只对$I,q\mapsto a$进行评分,而是要求模型生成一个类型化轨迹$I,q\mapsto(s_0,Δs,s_1,a)$:初始状态、状态转换、结果状态和答案。然后,一个混合验证器检查模式有效性、状态基础、转换一致性和答案-轨迹兼容性,产生类型化错误标签,如物体、关系、力、转换、时间、单位/尺度和忠实性错误。我们发布\tracebank,一个受控轨迹资源,包含\nSeed个经过模式验证和重新计算验证的合成场景,涵盖\nFamilies个物理家族,\nPairs个最小扰动对比偏好对,验证器代码,审计指南和模型输出。我们在受控和外部物理推理示例上评估\nModels个VLM。\wmw揭示了仅答案评估遗漏的失败:来自中等水平模型的35%的正确答案背后是物理上无效的轨迹。验证器引导的重新排序在不牺牲答案准确性的情况下恢复了高达7个百分点的轨迹有效性,而轨迹级别的偏好调整将隐藏的不一致性相对降低了41%。贡献不是另一个最终答案的物理基准,而是一个可重用的协议,用于衡量VLM所陈述的物理世界是否与其答案同时为真。

英文摘要

Vision-language models (VLMs) are increasingly used to answer questions about physical scenes, yet most evaluations reduce performance to a final answer. This hides whether the model perceived the right objects, represented the right physical state, predicted a plausible transition, or merely selected the right option for the wrong reasons. We introduce \wmw, an evaluation framework for auditing the \emph{language-expressed physical commitments} of VLMs. Instead of scoring only $I,q\mapsto a$, we ask models to produce a typed trace $I,q\mapsto(s_0,Δs,s_1,a)$: an initial state, a state transition, a resulting state, and an answer. A hybrid verifier then checks schema validity, state grounding, transition consistency, and answer-trace compatibility, yielding typed error labels such as object, relation, force, transition, temporal, unit/scale, and faithfulness errors. We release \tracebank, a controlled trace resource with \nSeed schema- and recomputation-validated synthetic scenarios across \nFamilies physics families, \nPairs minimally perturbed contrastive preference pairs, verifier code, audit guidelines, and model outputs. We evaluate \nModels VLMs on both controlled and external physical-reasoning examples. \wmw reveals failures that answer-only evaluation misses: 35\% of correct answers from mid-tier models are backed by physically invalid traces. Verifier-guided reranking recovers up to 7 percentage points of trace validity without sacrificing answer accuracy, and trace-level preference tuning reduces hidden inconsistency by 41\% relative. The contribution is not another final-answer physics benchmark, but a reusable protocol for measuring whether a VLM's stated physical world can be true at the same time as its answer.

2605.28700 2026-05-29 cs.AI cs.CL 版本更新

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

认真统计的重要性:对GSM-Symbolic的批判性再评估

Dominika Agnieszka Długosz, Arlindo Oliveira, Natalia Díaz-Rodríguez

发表机构 * Instituto Superior Técnico & INESC-ID, Universidade de Lisboa, Portugal(里斯本大学技术高级学院及INESC-ID研究所,里斯本大学,葡萄牙) Dept. of Computer Science and AI & DaSCI Institute, Universidad de Granada, Spain(计算机科学与人工智能系及DaSCI研究所,格拉纳达大学,西班牙)

AI总结 通过广义线性混合模型和每问题随机效应重新评估20个开源模型,发现仅半数模型在原始提示格式下表现显著变化,并指出GSM-Symbolic数据集存在大整数分布偏移,控制该效应后剩余显著案例约减半,表明关于LLM推理的笼统结论在统计上不成熟且机制上具有误导性。

Comments 38 pages, 11 figures. Submitted to ACL ARR / EMNLP 2026

详情
AI中文摘要

GSM-Symbolic基准测试(Mirzadeh等人,2025)报告了25个大型语言模型(LLM)在GSM8K问题的模板生成变体上测试时出现一致的性能下降,并得出结论认为这些模型缺乏真正的推理能力。我们认为这一结论建立在不可靠的统计基础上。使用具有每问题随机效应的广义线性混合模型重新评估20个开源模型,我们发现只有一半的模型在原始提示格式下表现出统计上显著的性能变化。此外,我们识别出一个先前未被承认的因素:主要GSM-Symbolic数据集相对于GSM-Base,在问题文本中包含系统性地偏移的大整数分布(K-S统计量=0.12,p<0.001),这与原始作者的声明相矛盾。控制这一大数效应后,大约一半剩余案例的显著性得以解释。在具有统计上显著性能差异的模型中,我们识别出不同的、模型特定的失败模式——包括变量绑定的脆弱性、算术限制和双任务干扰——这强调了关于LLM推理的笼统结论在统计上既不成熟,在机制上也是误导性的。

英文摘要

The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this conclusion rests on shaky statistical ground. Re-evaluating 20 open-weight models using Generalised Linear Mixed Models with per-question random effects, we find that only half exhibit statistically significant performance changes under the original prompt format. Moreover, we identify a previously unacknowledged factor: the main GSM-Symbolic dataset contains a systematically shifted distribution of larger integers in problem texts relative to GSM-Base (K-S statistic = 0.12, p < 0.001), contradicting the original authors' claims. Controlling for this large number effect accounts for significance in roughly half the remaining cases. Among models with statistically significant performance deltas, we identify distinct, model-specific failure profiles - including fragility of variable binding, arithmetic limitations, and dual-task interference - underscoring that blanket claims about LLM reasoning are both statistically premature and mechanistically misleading.

2605.28643 2026-05-29 cs.CL 版本更新

GraphLit: Learning Text-Enriched Dynamic Character Network Representations for Literary Study

GraphLit:面向文学研究的文本增强动态人物网络表示学习

Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara, Mirella Lapata

发表机构 * Deezer Research(Deezer研究) Loria(Loria实验室) IDIAP(IDIAP研究所) School of Informatics, University of Edinburgh(爱丁堡大学信息学院)

AI总结 提出动态异质人物网络(DHCN)和自监督框架GraphLit,通过掩码图自编码器学习融合文本上下文的文学表示,在12个角色相关任务上优于纯文本或纯图基线。

详情
AI中文摘要

将文学文本表示为图或图序列的方法主要侧重于表示角色互动,而往往忽略了另一个关键方面:角色互动的文本上下文。我们引入了动态异质人物网络(DHCN),将长篇小说组织成时间局部化的异构图,将角色与其文本上下文对齐。我们从Project Gutenberg中提取了约20,000个DHCN,并提出了GraphLit,这是一个自监督学习框架,通过掩码图自编码器目标学习丰富的文学表示。在广泛的12个角色相关任务中,GraphLit优于纯文本和纯图基线,特别是在需要上下文理解的任务上。最后,我们通过研究叙事非线性和动态社会特征之间的联系,展示了DHCN和GraphLit在文学分析中的适用性。

英文摘要

Methods to represent literary texts as graphs or sequences of graphs mainly focus on representing character interactions, and often overlook another crucial aspect: the textual context in which characters interact. We introduce Dynamic Heterogeneous Character Networks (DHCNs), which organize long novels into temporally localized heterogeneous graphs that align characters with their textual contexts. We extract around 20,000 DHCNs from Project Gutenberg, and propose GraphLit, a self-supervised learning framework that learns rich literary representations through a masked graph autoencoder objective. Across a wide-range of 12 character-related tasks, GraphLit improves over text-only and graph-only baselines, particularly on tasks requiring contextual understanding. Finally, we demonstrate the applicability of DHCNs and GraphLit for literary analysis by studying the link between narrative non-linearity and dynamic social features.

2605.26954 2026-05-29 cs.CL 版本更新

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

AlbanianLLMSafety:面向阿尔巴尼亚语大语言模型的安全评估数据集

Wajdi Zaghouani, Kholoud K. Aldous, Isra Fejzullaj

发表机构 * Northwestern University in Qatar(卡塔尔西北大学)

AI总结 针对低资源语言阿尔巴尼亚语,构建了首个公开的安全评估数据集,包含11个安全类别的2951条提示,以填补安全评估基础设施的空白。

Comments Accepted at SIGUL2026 Workshop co-located with LREC2026

详情
Journal ref
In Proceedings of the SIGUL2026 Workshop co-located with LREC 2026, Palma de Mallorca, Spain, 2026
AI中文摘要

大语言模型(LLM)的安全评估主要集中于高资源语言,而低资源语言则严重缺乏关注。我们提出了AlbanianLLMSafety,这是首个公开的阿尔巴尼亚语LLM安全评估数据集。阿尔巴尼亚语是一种语言独特的低资源语言,在阿尔巴尼亚、科索沃、北马其顿以及海外侨民中约有750万使用者。该数据集包含2951条提示,涵盖11个安全类别,包括自残、暴力、种族主义内容、儿童剥削和激进化等,平均每个类别268条提示。每条提示均提供阿尔巴尼亚语原文、英语参考译文以及详细的类别标签。该资源填补了低资源语言安全评估基础设施的重大空白,并为开发更安全、更具包容性的LLM提供了重要基准。数据集将根据请求提供,以支持阿尔巴尼亚语社区的安全评估、微调、红队测试和护栏开发。

英文摘要

Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafety, the first publicly available safety evaluation dataset for LLMs in Albanian, a linguistically distinct low-resource language with approximately 7.5 million speakers across Albania, Kosovo, North Macedonia, and the diaspora. The dataset contains 2,951 prompts spanning 11 safety categories, including self-harm, violence, racist content, child exploitation, and radicalization, with an average of 268 prompts per category. Each prompt is provided in Albanian with an English reference translation and a detailed category label. This resource addresses a significant gap in safety evaluation infrastruc-ture for low-resource languages and provides an essential benchmark for developing safer, more inclusive LLMs. The dataset will be provided upon request to support safety evaluation, fine-tuning, red-teaming, and guardrail development for Albanian-speaking communities.

2605.26947 2026-05-29 cs.CL 版本更新

KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models

KZ-SafetyPrompts:用于大型语言模型的哈萨克语安全评估提示数据集

Wajdi Zaghouani, Shimaa Amer Ibrahim, Aruzhan Muratbek, Olzhasbek Zhakenov, Adiya Akhmetzhanova

发表机构 * Northwestern University in Qatar(卡塔尔西北大学)

AI总结 针对哈萨克语在大型语言模型安全评估中资源不足的问题,构建了一个包含11个风险类别、5717条原生哈萨克语提示的数据集,并基于GPT-4o基线测试发现跨类别拒绝率差异显著,揭示了仅英语评估无法捕获的类别特定安全漏洞。

Comments Accepted at the SIGUL2026 Workshop co-located with LREC2026

详情
Journal ref
In Proceedings of the SIGUL2026 Workshop co-located with LREC 2026, Palma de Mallorca, Spain, 2026
AI中文摘要

哈萨克语在评估大型语言模型安全行为的资源中代表性不足。我们提出了KZ-SafetyPrompts,这是一个哈萨克语提示数据集,用于涵盖常见风险领域的十一个类别的安全评估,例如自残、暴力、儿童剥削、色情内容、种族主义内容、激进化以及受管制商品或非法活动。该数据集包含5717条以哈萨克语(西里尔字母)原生编写的提示,按类别组织,并附有英文翻译以进行跨语言分析。提示类似于真实的用户查询,通常采用青少年或儿童风格,并以意图提示的形式表述,不包含程序性指令。我们记录了编写协议、标注程序(包括边界案例决策规则)和质量控制步骤(模式标准化、完整性检查和去重)。我们还将这些类别与广泛使用的安全分类法对齐,以支持与现有评估管道的集成。使用GPT-4o的基线结果显示总体拒绝率为28.2%,不同类别间从5.5%到53.8%不等,表明哈萨克语提示暴露了仅英语评估无法捕获的类别特定安全漏洞。

英文摘要

Kazakh is underrepresented in resources for evaluating the safety behavior of large language models. We present KZ-SafetyPrompts, a Kazakh prompt dataset for safety evaluation across eleven categories covering common risk areas such as self-harm, violence, child exploitation, sexual content, racist content, radicalization, and regulated goods or illegal activities. The dataset contains 5,717 prompts written natively in Kazakh (Cyrillic), organized by category, with English translations for cross-lingual analysis. Prompts resemble realistic user queries, often in a teen or child style, and are phrased as intent prompts without procedural instructions. We document the writing protocol, labeling procedures (including borderline-case decision rules), and quality-control steps (schema standardization, completeness checks, and deduplication). We also align the categories with widely used safety taxonomies to support integration with existing evaluation pipelines. Baseline results with GPT-4o show an overall refusal rate of 28.2%, varying from 5.5% to 53.8% across categories, indicating that Kazakh prompts expose category-specific safety gaps not captured by English-only evaluation.

2605.23440 2026-05-29 cs.CL cs.AI 版本更新

SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction

SSDAU:面向联合实体关系抽取的结构化语义数据增强

Jiawei He, Mengyu Shi, Jiawei Liu, Dong Sun, Chunrong Fang, Xikai Yang, Zhijie Wang, Lei Ma, Zhenyu Chen

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China(南京大学新型软件技术国家重点实验室) Amap, Alibaba Group, China(阿里巴巴集团阿地图) University of Alberta, Edmonton, Canada(阿尔伯塔大学) The University of Tokyo, Tokyo, Japan(东京大学)

AI总结 提出结构化语义数据增强方法SSDAU,通过保留三元组感知语义结构、上下文感知编码和BERTopic过滤,提升联合实体关系抽取的泛化能力,在多个模型和数据集上优于现有方法。

Comments 10 pages, 4 figure

详情
AI中文摘要

联合实体关系抽取(JERE)对训练数据质量高度敏感,因此数据增强是提升泛化能力的自然方式。然而,现有增强方法常削弱实体相关性并破坏语义结构,限制了其在JERE中的有效性。本文提出 extbf{结构化语义数据增强(SSDAU)},一种在增强过程中保留三元组感知语义结构的方法。SSDAU按实体标签分割文本,通过上下文感知编码捕获语义特征,并重构实体语义以生成增强数据。为区分语义相似的实体,SSDAU将上下文嵌入与传统相似度评分相结合。为减少主题不一致性,我们应用基于BERTopic的过滤去除不相关的增强样本。我们在不同标注类型的数据集上评估SSDAU,并比较其在五个代表性JERE模型上相对于七个流行增强基线的性能。实验表明,SSDAU生成语义一致的数据,对歧义的鲁棒性优于非LLM方法(平均相对F1下降8.95% vs. 23.58%),并在大多数设置下显著优于强替代方法。

英文摘要

Joint Entity and Relation Extraction (JERE) is highly sensitive to training data quality, making data augmentation a natural way to improve generalization. However, existing augmentation methods often weaken entity relevance and disrupt semantic structure, limiting their effectiveness for JERE. In this paper, we propose \textbf{Structured Semantic Data Augmentation (SSDAU)}, a method designed to preserve triple-aware semantic structure during augmentation. SSDAU segments text by entity labels, captures semantic features through context-aware encoding, and restructures entity semantics to generate augmented data. To distinguish semantically similar entities, SSDAU combines contextualized embeddings with traditional similarity scores. To reduce topic inconsistency, we apply BERTopic-based filtering to remove irrelevant augmentations. We evaluate SSDAU on datasets with different annotation types and compare its performance on five representative JERE models against seven popular augmentation baselines. Experiments show that SSDAU generates semantically consistent data, is more robust to ambiguity than non-LLM methods (8.95\% vs. 23.58\% average relative F1 decrease), and significantly outperforms strong alternatives in most settings.

2605.22975 2026-05-29 cs.CL cs.CY 版本更新

When AI Takes Sides on Questions of Faith: Persistent Asymmetries in AI-Mediated Faith Guidance

当AI在信仰问题上站队:AI介导的信仰指导中持续存在的不对称性

Brett Israelsen, Sheryl Carty, Josh Coates, Nancy Fulda, Julie Park, Pete Whiting

发表机构 * Brigham Young University

AI总结 研究通过测试20个大型语言模型在182个宗教配对中的转换建议,发现模型在宗教转换建议上存在系统性不对称,偏好某些宗教而歧视其他宗教,且该现象在不同模型和测试条件下稳定存在。

Comments w/ persuasive language analysis

详情
AI中文摘要

我们探究大型语言模型(LLMs)在处理宗教转换查询时是否对称。答案是否定的。当被问及从宗教A到宗教B与从宗教B到宗教A的假设性信仰转换建议时,模型表现出持续的不对称性,偏好某些宗教,同时微妙地劝阻转向其他宗教。平均而言,天主教、巴哈伊教和锡克教普遍受到青睐(高支持加入,低支持离开),而无神论者、不可知论者和耶和华见证人则主要受到歧视。模式因模型大小和模型提供商而异,其中Grok 4.20表现出最强的不对称性。 我们使用人工验证的LLM-as-judge框架,在182个宗教配对中测试了20个商业和开源语言模型。每个模型通过与模拟用户交互进行探测,该用户就潜在的信仰转换寻求建议。模型倾向于对某些信仰转换使用更鼓励性的语言;这些模式在多次试验中系统性地可重复。 所有测试的LLM都表现出可重复的不对称性,尽管偏好模式各不相同。总体偏好跨多个问题措辞和宗教配对数据集的变化而持续存在。综合来看,这些结果表明不对称性是模型行为的稳健属性,而非模型答案评分方式的人为产物。重要的是要考虑,任何大规模部署和复现的不平衡都可能产生现实世界的影响。

英文摘要

We ask whether large language models (LLMs) treat queries about religious conversion symmetrically. The answer is no. When asked for advice on hypothetical faith transitions from religion A->B vs. religion B->A , models exhibited consistent asymmetries, favoring some religions while subtly discouraging conversion to others. On average Catholic, Bahá'í, and Sikh religions were broadly favored (high support for joining, low support for leaving), while Atheists, Agnostics, and Jehovah's Witnesses were primarily disfavored. Patterns varied by model size and model provider, with Grok 4.20 exhibiting the strongest asymmetries. We tested 20 commercial and open-source language models across 182 religion pairings using a human-verified LLM-as-judge framework. Each model was probed via interactions with a simulated user asking for advice on a potential faith conversion. Models tended to use more encouraging language for some faith transitions over others; these patterns were systematically repeatable across multiple trials. All LLMs tested exhibited reproducible asymmetry, though the pattern of preferences differed for each. Overall preferences persist across multiple question phrasings and variations in the religious pairing dataset. Taken together, these results suggest that asymmetry is a robust property of model behavior rather than an artifact of how the models' answers were scored. It is important to consider that any imbalances deployed and reproduced at scale can have real-world implications.

2605.22771 2026-05-29 cs.CL cs.AI 版本更新

Reducing Political Manipulation with Consistency Training

通过一致性训练减少政治操纵

Long Phan, Devin Kim, Alexander Pan, Alice Blair, Adam Khoja, Dan Hendrycks

发表机构 * Center for AI Safety(人工智能安全中心) UC Berkeley(加州大学伯克利分校)

AI总结 针对大语言模型在敏感话题中表现出的隐性政治偏见,提出政治一致性训练(PCT)方法,通过情感一致性和帮助一致性两个指标及相应训练范式来减少偏见。

详情
AI中文摘要

大型语言模型(LLM)在各种敏感上下文中表现出系统性的政治偏见。我们发现,LLM 对来自对立政治立场的对应话题处理不对称。我们将这种现象称为隐性政治偏见,并识别出其运作的 7 类技术。我们提出了两个隐性偏见指标:情感一致性衡量跨配对政治提示的修辞和框架对称性;帮助一致性衡量深度和参与度的对称性。为了减少这两种隐性偏见,我们引入了政治一致性训练(PCT),这是一种具有两个互补范式的 RL 训练方法:情感一致性训练和帮助一致性训练。我们表明,PCT 保持了整体帮助性,显著减少了隐性政治偏见,并泛化到保留的基准测试中。我们在 https://political-manipulation.ai 发布我们的工作。

英文摘要

Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which it operates. We propose two metrics for covert bias: Sentiment Consistency measures symmetry in rhetoric and framing across paired political prompts; Helpfulness Consistency measures symmetric depth and engagement. To reduce both types of covert bias, we introduce Political Consistency Training (PCT), an RL training method with two complementary paradigms: Sentiment Consistency Training and Helpfulness Consistency Training. We show that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks. We release our work at https://political-manipulation.ai

2605.22616 2026-05-29 cs.CL 版本更新

Chinese sensorimotor and embodiment norms for 3,000 lexicalized concepts

3000个词汇化概念的中文感觉运动与具身规范

Jing Chen, Gábor Parti, Yin Zhong, Chu-Ren Huang, Marco Marelli

发表机构 * Department of Psychology(心理学系) University of Milano-Bicocca(米兰-比科卡大学) Department of Chinese and Bilingual Studies(中文与双语研究系) The Hong Kong Polytechnic University(香港理工大学) Center for Language Education(语言教育中心) The Hong Kong University of Science and Technology(香港科学大学)

AI总结 本研究为3000个中文词汇化概念提供了11维感觉运动评分和单维具身评分,验证了其高信度和效度,并发现感觉运动信息对词汇加工有促进作用,且可从语言表征中部分恢复。

详情
AI中文摘要

理解概念知识如何植根于身体体验,以及机器系统在缺乏直接感觉运动经验的情况下能在多大程度上获取此类知识,是认知科学和具身人工智能研究的核心问题。大规模规范资源对于实证研究这些问题至关重要,但此类资源在非印欧语言中仍然稀缺。我们为普通话中的3000个词汇化概念提供了一个新颖的规范数据库,包括从378名普通话母语者收集的11维感觉运动评分和单维具身评分。这些评分显示出高可靠性,并与现有中文资源(每个资源覆盖较少词汇和11个感觉运动维度的子集)具有强交叉规范效度。在一项验证研究中,我们测试了源自理论驱动指标——具身感知强度(PSE)(Huang et al., 2025)的新变量以及七个常见复合变量在词汇决策任务中的表现。结果表明,PSE-感觉运动和Minkowski-3是词汇决策表现的最强复合预测因子,捕捉了感觉运动信息对词汇加工的促进作用。进一步的探索性研究表明,使用简单回归模型(各维度平均Spearman r = .62)可以从纯语言表征中大幅恢复感觉运动评分,但恢复程度差异显著:视觉和听觉维度比化学感觉维度产生更高的对应性。表征相似性分析进一步表明,感觉运动空间的关系几何也部分可恢复(r = .540),这与分布语言使用编码了具身概念结构某些方面的观点一致。

英文摘要

Understanding how conceptual knowledge is grounded in bodily experience, and to what extent machine systems can acquire such knowledge without direct sensorimotor experience, are central questions in both cognitive science and embodied artificial intelligence research. Large-scale normative resources are essential for investigating these questions empirically, yet such resources remain sparse for non-Indo-European languages. We present a novel normative database for 3,000 lexicalized concepts in Mandarin Chinese, comprising 11-dimensional sensorimotor ratings and unidimensional embodiment ratings collected from 378 native Mandarin speakers. The ratings demonstrate high reliability and strong cross-norm validity with existing Chinese resources, each of which covers fewer words and a subset of the 11 sensorimotor dimensions. In a validation study, we tested new variables derived from a theoretically motivated metric, Perceptual Strength of Embodiment (PSE) (Huang et al., 2025), together with seven common composite variables, on lexical decision tasks. The results suggest that PSE-Sensorimotor and Minkowski-3 are the strongest composite predictors of lexical decision performance, capturing the facilitatory effects of sensorimotor information on lexical processing. A further exploratory study showed that sensorimotor ratings are substantially recoverable from purely linguistic representations using simple regression models (mean Spearman r = .62 across dimensions), though recovery varied markedly: visual and auditory dimensions yielded higher correspondence than chemosensory ones. Representational similarity analysis further showed that the relational geometry of the sensorimotor space is also partially recoverable (r = .540), consistent with the view that distributional language use encodes aspects of embodied conceptual structure.

2605.16608 2026-05-29 cs.LG cs.CL 版本更新

To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Learning, Except In Heavy Truncation Scenarios

使用还是不使用MRL:文本嵌入在没有Matryoshka学习的情况下对截断具有鲁棒性,除非在重度截断场景下

Sotaro Takeshita, Yurina Takeshita, Simone Paolo Ponzetto, Daniel Ruffinelli

发表机构 * Data and Web Science Group, University of Mannheim(曼海姆大学数据与网络科学小组) NEC Laboratories Europe(NEC欧洲实验室) Independent Researcher(独立研究者)

AI总结 本文通过实验比较了使用Matryoshka表示学习(MRL)与随机截断对文本嵌入的影响,发现除非嵌入被重度截断(减少至少80%),否则非MRL模型的截断嵌入性能与MRL模型相当甚至更优。

详情
AI中文摘要

Matryoshka表示学习(MRL)是一种广泛采用的方法,用于训练文本编码器,使其提供各种大小的有用文本表示,只需在训练时预先确定的大小处截断结果向量即可。最近的研究表明,除非向量大小减少至少70%,否则随机截断文本嵌入对下游性能的影响很小,这表明嵌入在没有MRL的情况下已经对截断具有鲁棒性。然而,之前没有工作将随机截断与MRL进行比较,因此不清楚这两种方法作为有效的嵌入缩减方法如何比较。在本文中,我们通过将MRL使用的相同截断应用于使用和不使用MRL训练的模型来研究这一点。我们在多个模型和下游任务上的结果表明,除非重度截断嵌入(即将其大小减少至少80%),否则非MRL模型的截断嵌入与使用MRL训练的模型具有竞争力,并且通常表现更好。这表明截断鲁棒性可能不一定来自MRL,而选择花费MRL的额外训练成本取决于是否需要重度截断。我们提供代码以供复现。

英文摘要

Matryoshka Representation Learning (MRL) is a widely adopted approach for training text encoders so they provide useful text representations at various sizes, available by simply truncating the resulting vectors at sizes pre-determined at training time. Recent works have shown that randomly truncating text embeddings has minimal impact in downstream performance unless vectors are reduced in size by at least 70%, suggesting that embeddings are already robust to truncation without the use of MRL. However, no prior work has compared random truncation to MRL, so it is unclear how the two methods compare as effective embedding reduction methods. In this paper, we study this by applying the same truncation used by MRL to models trained with and without MRL. Our results across several models and downstream tasks show that, unless heavily truncating embeddings (i.e. reducing their size by at least 80%), truncated embeddings of non-MRL models are competitive with, and often outperform models trained with MRL. This suggests that truncation robustness may not necessarily come from MRL, and that the choice of spending the additional training cost of MRL depends on whether heavy truncation is desired. We make our code available for reproduction.

2605.00969 2026-05-29 cs.SD cs.AI cs.CL 版本更新

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

MedMosaic:一个具有挑战性的多样化医学音频大规模基准

Harshit Rajgarhia, Shuubham Ojha, Asif Shaik, Akhil Pothanapalli, Rachuri Lokesh, Abhishek Mukherji, Prasanna Desikan

发表机构 * Centific Global Solutions Inc.(Centific全球解决方案公司) University of Maryland, College Park, MD, USA(马里兰大学学院市分校)

AI总结 为解决医学音频数据稀缺和现有基准不足的问题,提出MedMosaic数据集,包含多种医学音频类型和46701个问答对,用于评估语言和音频推理模型,实验表明推理仍具挑战性。

Comments Accepted at ICML 2026

详情
AI中文摘要

由于隐私法规和领域专业知识导致的高注释成本,医学音频数据难以收集。因此,现有基准往往未能充分代表复杂的医学音频场景。为应对这一挑战,我们提出了MedMosaic,一个医学音频问答数据集,旨在在现实临床约束下对语言和音频推理模型进行基准测试。MedMosaic包含多种医学音频类型,包括与疾病相关的生理声音、精心构建的模拟带有伪影的语音的合成声音,以及模拟不同上下文长度的真实短篇和长篇临床对话。该数据集还包含总共46,701个问答对,涵盖多项选择、顺序多轮和开放式问答等类别,从而能够系统评估多跳推理和答案生成能力。对13个音频和多模态推理模型的基准测试显示,推理对所有评估系统仍然具有挑战性,且在不同问题类型上表现差异显著。特别是,即使是像Gemini-2.5-pro这样的最先进模型也只能达到约68.1%的准确率。这些发现强调了医学推理中的持续局限性,并凸显了对更鲁棒、特定领域的多模态推理模型的需求。基准数据样本可在此处获取:https://shorturl.at/Lyp33

英文摘要

Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address this challenge, we present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models. A sample of benchmark data is available here: https://shorturl.at/Lyp33

2604.27272 2026-05-29 cs.CL cs.AI cs.LG 版本更新

When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks

当2D任务遇到1D序列化:结构化任务中的序列化摩擦

Chung-Hsiang Lo, Lu Li, Diji Yang, Tianyu Zhang, Yunkai Zhang, Yoshua Bengio, Yi Zhang

发表机构 * Northeastern University(东北大学) University of Pennsylvania(宾夕法尼亚大学) UC Santa Cruz(加州大学圣克鲁兹分校) Mila - Quebec AI Institute(魁北克人工智能研究所) University of Montreal(蒙特利尔大学) BAIR, UC Berkeley(伯克利大学BAIR实验室)

AI总结 研究通过矩阵转置、康威生命游戏和LU分解三个任务,发现将二维布局任务序列化为一维文本会因表示不匹配导致性能下降,且错误呈现空间结构模式。

详情
AI中文摘要

在LLM时代,许多符号化和结构化问题通过一维文本序列化呈现给模型。然而,其中一些问题本质上是二维的:它们的相关关系,如行列对应或空间邻接,由二维布局中的位置定义,而非顺序。这引发了一个表示问题:在一维序列中保留相同的符号条目是否也保留了计算所需的关系结构?我们通过序列化摩擦的视角研究这一问题:即相同底层任务实例和条目仍然存在,但依赖于布局的关系在一维序列化下变得隐式的表示不匹配。本研究使用三个受控合成测试任务:矩阵转置、康威生命游戏和LU分解。在每个任务中,相同的实例要么作为一维文本序列化呈现,要么作为其原生二维布局渲染为图像呈现。在整个测试集中,随着任务规模增长,一维序列化的性能下降更显著,且序列化下的错误呈现空间结构模式,表明这种呈现选择在我们的测试集中具有重要影响。为了进一步解释这些结果,我们添加了补充分析,包括视觉内探针以及混合训练转置设置下两种输入呈现的额外比较。这些发现表明,对于布局定义的任务,将输入简化为1D序列化并非中性的表示选择。

英文摘要

In the LLM era, many symbolic and structured problems are presented to models through 1D text serialization. Yet some such problems are natively two-dimensional: their relevant relations, such as row--column correspondence or spatial adjacency, are defined by position in a 2D layout rather than by sequential order. This raises a representational question: does preserving the same symbolic entries in a 1D sequence also preserve the relational structure needed for computation? We study this issue through the lens of serialization friction: the representational mismatch in which the same underlying task instances and entries are still present, but relations that depend on layout become implicit under 1D serialization. The study uses a controlled synthetic testbed of three tasks: matrix transpose, Conway's Game of Life, and LU decomposition. In each task, the same instances are presented either as 1D text serialization or as their native 2D layout rendered as an image. Across this testbed, 1D serialization degrades more sharply as task size grows, and errors under serialization exhibit spatially structured patterns, suggesting that this presentation choice is consequential within our testbed. To further interpret these results, we add supplementary analyses that include a within-visual probe and an additional comparison of the two input presentations under the mixed-training transpose setting. These findings suggest that, for layout-defined tasks, reducing inputs to 1D serialization is not a neutral choice of representation.

2604.26506 2026-05-29 cs.CL cs.CR 版本更新

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

SafeReview: 防御基于LLM的评审系统免受对抗性隐藏提示攻击

Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling, Chengwei Qin, Michael Backes, Yue Zhang, Linyi Yang

发表机构 * CISPA Westlake University(西交利物浦大学) Southern University of Science and Technology(南方科技大学) HKUST (Guangzhou)(香港科技大学(广州))

AI总结 提出SafeReview,一种共进化对抗训练框架,通过联合训练生成器和防御者模型,增强基于LLM的同行评审系统对对抗性隐藏提示的鲁棒性。

Comments 17 pages, 5 figures, 8 tables

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地融入学术同行评审,它们对对抗性隐藏提示(即嵌入在提交内容中以操纵结果的对抗性指令)的脆弱性对学术诚信构成了严重威胁。我们提出SafeReview,一种共进化对抗训练框架,用于防御基于LLM的同行评审系统免受此类攻击。SafeReview联合训练一个生成器模型以创建复杂的攻击提示,以及一个防御者模型以在对抗性操纵下保持评审完整性。生成器经过优化以产生越来越有效的提示注入,而防御者则通过基于偏好的训练得到加强,以在干净和受攻击的提交之间保持一致的评审。实验结果表明,与静态防御相比,SafeReview提高了对自适应提示注入攻击的鲁棒性,更好地保留了受攻击下的论文排名,并跨攻击者架构具有泛化能力。这些结果证明了共进化训练作为保障LLM辅助同行评审安全的基础的潜力。

英文摘要

As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial hidden prompts, i.e., adversarial instructions embedded in submissions to manipulate outcomes, poses a critical threat to scholarly integrity. We propose SafeReview, a co-evolutionary adversarial training framework for defending LLM-based peer review systems against such attacks. SafeReview jointly trains a Generator model to create sophisticated attack prompts and a Defender model to preserve review integrity under adversarial manipulation. The Generator is optimized to produce increasingly effective prompt injections, while the Defender is strengthened through preference-based training to maintain consistent reviews between clean and attacked submissions. Experimental results show that SafeReview improves robustness against adaptive prompt injection attacks, better preserves paper ranking under attack, and generalizes across attacker architectures compared with static defenses. These results demonstrate the potential of co-evolutionary training as a foundation for securing LLM-assisted peer review.

2604.23862 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Graph Memory Transformer (GMT)

图记忆Transformer (GMT)

Nicola Zanarini, Niccolò Ferrari, Evelina Lamma

发表机构 * Bonfiglioli Engineering s.r.l.(博尼菲利工程公司) Department of Engineering, University of Ferrara(费拉拉大学工程学院) NAIS s.r.l.(NAIS公司)

AI总结 提出用显式学习的记忆图替换解码器-only Transformer中的前馈网络子层,保留自回归架构,实现可解释的记忆导航。

Comments 65 pages, 10 figures, 5 tables. Author list updated in arXiv metadata; no technical changes. Code available at https://github.com/Nemesis533/GMT-GraphMemoryTransformer

详情
AI中文摘要

我们研究是否可以在解码器-only Transformer中,用显式学习的记忆图替换前馈网络(FFN)子层,同时保留周围的自回归架构。所提出的图记忆Transformer(GMT)保持因果自注意力不变,但将通常的逐token FFN变换替换为一个记忆单元,该单元通过一个由学习的有向转移矩阵连接的质心库来路由token表示。在此处研究的基础GMT v7实例中,16个Transformer块中的每个块包含128个质心、一个128*128的边矩阵、引力源路由、token条件目标选择以及门控位移读出。因此,该单元返回从估计的源记忆状态到目标记忆状态的移动,而不是检索到的值。由此产生的模型是一个完全解码器-only的语言模型,具有82.2M可训练参数且没有密集的FFN子层,而评估中使用的密集GPT风格基线有103.0M参数。基础v7模型训练稳定,并将质心使用、转移结构和源到目标移动作为前向计算中可直接检查的量。在验证损失和困惑度方面,它落后于较大的密集基线(3.5995/36.58 vs. 3.2903/26.85),但在评估设置下显示出接近的零样本基准表现。这些结果并非旨在声称最先进性能;它们支持用图介导的记忆导航替换密集的token内变换的可行性和结构可解释性。更广泛的扩展、优化的内核以及更广泛的基准评估留待后续工作。

英文摘要

We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self-attention intact, but replaces the usual per-token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 * 128 edge matrix, gravitational source routing, token-conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder-only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M-parameter dense GPT-style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source-to-target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero-shot benchmark behavior under the evaluated setting. These results are not intended as a state-of-the-art claim; they support the viability and structural interpretability of replacing dense within-token transformation with graph-mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.

2604.13197 2026-05-29 cs.CL 版本更新

Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

释放隐式奖励:前缀值学习用于分布级优化

Shiping Gao, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Lifu Huang

发表机构 * Sun Yat-sen University(中山大学) Shenzhen Loop Area Institute(深圳环 Area 研究院) Meta AI University of California, Davis(加州大学戴维斯分校)

AI总结 提出隐式前缀值奖励模型(IPVRM)直接学习每个前缀的正确概率,并通过时序差分差异获得步骤信号,解决训练与推理不匹配问题;进一步引入分布级强化学习(DistRL)利用前缀值进行密集反事实更新,提升推理性能。

详情
AI中文摘要

过程奖励模型(PRM)为推理提供细粒度监督,但可靠的PRM通常需要步骤标注或繁重的验证流水线,使得它们在在线RL中扩展和刷新成本高昂。隐式PRM通过从轨迹级结果标签训练对数似然比奖励来降低这一成本。然而,对数比率在训练期间仅作为序列级聚合被约束,而推理时将其分解为部分前缀的token级或步骤级分数。这种训练-推理不匹配导致局部信用识别薄弱,因此分布级评分可能放大误导性优势。我们提出隐式前缀值奖励模型(IPVRM),直接从结果标签学习每个前缀最终正确的概率。然后通过连续前缀值之间的时序差分(TD)差异获得步骤信号,使训练目标与推理时使用对齐。IPVRM显著提高了ProcessBench上的步骤验证F1分数。为了在策略优化中利用这些前缀值,我们进一步引入分布级强化学习(DistRL),它将TD优势应用于采样token和高概率候选token,无需额外rollout即可提供密集反事实更新。实验表明,DistRL与不可靠隐式奖励结合时收益有限,但与IPVRM配对时持续改善下游推理。我们的方法实现可在https://github.com/gaoshiping/IPVRM获取。

英文摘要

Process reward models (PRMs) provide fine-grained supervision for reasoning, but reliable PRMs often require step annotations or heavy verification pipelines, making them costly to scale and refresh during online RL. Implicit PRMs reduce this cost by training log-likelihood-ratio rewards from trajectory-level outcome labels. However, the log-ratio is constrained only as a sequence-level aggregate during training, while inference decomposes it into token- or step-level scores for partial prefixes. This train-inference mismatch leaves local credits weakly identified, so distribution-wide scoring can amplify misleading advantages. We propose Implicit Prefix-Value Reward Model (IPVRM), which directly learns the probability of eventual correctness for each prefix from outcome labels. Step signals are then obtained as temporal-difference (TD) differences between consecutive prefix values, aligning the training target with inference-time use. IPVRM markedly improves step-verification F1 on ProcessBench. To exploit these prefix values during policy optimization, we further introduce Distribution-Level RL (DistRL), which applies TD advantages to both sampled tokens and high-probability candidate tokens, providing dense counterfactual updates without additional rollouts. Experiments show that DistRL brings limited gains with unreliable implicit rewards, but consistently improves downstream reasoning when paired with IPVRM. The implementation of our method is available at https://github.com/gaoshiping/IPVRM .

2604.10511 2026-05-29 cs.AI cs.CL 版本更新

Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation

快思考,错思考:直觉性调节LLM在政策评估中的反事实推理

Yanjie He

发表机构 * Independent Researcher(独立研究者)

AI总结 本研究构建了一个基于经济学和社会科学实证案例的基准,通过8000次实验评估大型语言模型在政策评估中的反事实推理,发现链式思维提示在反直觉案例中效果显著减弱,且直觉性是主导因素,表明模型存在知识-推理分离。

Comments 10 pages, 6 figures, 6 tables

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于因果和反事实推理,但它们在现实世界政策评估中的可靠性仍未得到充分探索。我们构建了一个包含40个实证政策评估案例的基准,这些案例来自经济学和社会科学,每个案例都基于同行评审的证据,并根据直觉性进行分类——即实证结果是否符合(明显)、相对于(模糊)或违背(反直觉)常见的先验预期。我们评估了四个前沿LLM,采用五种提示策略,进行了8000次实验试验,并使用混合效应逻辑回归分析结果。我们的发现揭示了三个关键结果:(1)链式思维(CoT)悖论,即链式思维提示在明显案例上显著提升性能,但在反直觉案例上这种收益大幅减弱(交互OR = 0.278,p < 0.001);(2)直觉性是主导因素,案例层面的方差超过模型选择或提示策略(ICC = 0.671);(3)知识-推理分离,基于引用的熟悉度与准确性无关(p = 0.84),表明模型拥有相关知识,但当结果与直觉相悖时无法利用这些知识进行推理。我们通过双过程理论(系统1与系统2)的视角来框架这些结果,并认为当前LLM的“慢思考”仅实现了对直觉先验的部分抑制——产生了深思熟虑推理的形式,但未能完全实现其实质。

英文摘要

Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy evaluation remains underexplored. We construct a benchmark of 40 empirical policy evaluation cases drawn from economics and social science, each grounded in peer-reviewed evidence and classified by intuitiveness -- whether the empirical finding aligns with (obvious), is unclear relative to (ambiguous), or contradicts (counter-intuitive) common prior expectations. We evaluate four frontier LLMs across five prompting strategies with 8,000 experimental trials and analyze the results using mixed-effects logistic regression. Our findings reveal three key results: (1) a chain-of-thought (CoT) paradox, where chain-of-thought prompting dramatically improves performance on obvious cases but this benefit is substantially attenuated on counter-intuitive ones (interaction OR = 0.278, $p < 0.001$); (2) intuitiveness as the dominant factor, with case-level variance exceeding that of model choice or prompting strategy (ICC = 0.671); and (3) a knowledge-reasoning dissociation, where citation-based familiarity is unrelated to accuracy ($p = 0.84$), suggesting models possess relevant knowledge but fail to reason with it when findings contradict intuition. We frame these results through the lens of dual-process theory (System 1 vs. System 2) and argue that current LLMs' "slow thinking" achieves only partial inhibition of intuitive priors -- producing the form of deliberative reasoning without fully delivering its substance.

2604.07789 2026-05-29 cs.MA cs.CL cs.SE 版本更新

ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents

ORACLE-SWE:量化Oracle信息信号对SWE代理的贡献

Kenan Li, Qirui Jin, Liao Zhu, Xiaosong Huang, Yijia Wu, Yikai Zhang, Xin Zhang, Zijian Jin, Yufan Huang, Elsie Nallipogu, Chaoyun Zhang, Yu Kang, Saravan Rajmohan, Qingwei Lin, Wenke Lee, Dongmei Zhang

发表机构 * Microsoft(微软公司) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出Oracle-SWE方法,通过隔离和提取SWE基准测试中的Oracle信息信号,量化每种信号对代理性能的贡献,并评估强语言模型提取的信号对基础代理的性能提升。

Comments Under peer review; 37 pages, 10 figures, 5 tables

详情
AI中文摘要

语言模型代理的最新进展显著提升了自动化软件工程(SWE)的能力。先前的工作提出了各种代理工作流和训练策略,并分析了代理系统在SWE任务上的失败模式,重点关注几种上下文信息信号:复现测试、回归测试、编辑位置、执行上下文和API使用。然而,每种信号对整体成功的个体贡献仍未得到充分探索,特别是在中间信息完美获取时的理想贡献。为解决这一问题,我们引入了Oracle-SWE,一种统一的方法,用于从SWE基准测试中隔离和提取Oracle信息信号,并量化每种信号对代理性能的影响。为进一步验证模式,我们评估了由强语言模型提取的信号在提供给基础代理时的性能增益,近似于现实世界的任务解决设置。这些评估旨在指导自主编码系统的研究优先级。

英文摘要

Recent advances in language model (LM) agents have significantly improved automated software engineering (SWE). Prior work has proposed various agentic workflows and training strategies as well as analyzed failure modes of agentic systems on SWE tasks, focusing on several contextual information signals: Reproduction Test, Regression Test, Edit Location, Execution Context, and API Usage. However, the individual contribution of each signal to overall success remains underexplored, particularly their ideal contribution when intermediate information is perfectly obtained. To address this gap, we introduce Oracle-SWE, a unified method to isolate and extract oracle information signals from SWE benchmarks and quantify the impact of each signal on agent performance. To further validate the pattern, we evaluate the performance gain of signals extracted by strong LMs when provided to a base agent, approximating real-world task-resolution settings. These evaluations aim to guide research prioritization for autonomous coding systems.

2604.00789 2026-05-29 cs.CL 版本更新

Valency Classification of Mapudungun Verbal Roots. Established by the language's own morphotactics

马普切语动词词根的配价分类:基于该语言自身形态句法规则

Andrés Chandía

发表机构 * Department of Catalan Philology and General Linguistics University of Barcelona(加泰罗尼亚语言学与一般语言学系巴塞罗那大学)

AI总结 本文利用马普切语自身的形态句法规则,通过分析后缀与词根或动词词干的允许和限制组合,对已确认为动词的词根进行配价分类,旨在改进形态分析器并促进对马普切语动词配价问题的理解。

Comments 37 pages

详情
AI中文摘要

在先前的工作中,我们对被识别为动词的词根进行了词汇(重新)分类——或确认其给定类别——以准确确定其原始类别。在此基础上,本文利用马普切语自身的形态句法规则,具体通过考察马普切动词形式中各种后缀与词根或动词词干的允许和限制组合,对已确认为动词的马普切语词根进行了配价分类。与迄今为止的所有工作一样,本文呈现的结果旨在改进形态分析器(Dungupeyum),将所有经过验证的发现纳入系统。从理论角度来看,我们也希望有助于认识和理解与马普切语动词形式配价相关的问题。

英文摘要

In the previous work, a lexical (re)categorisation -- or confirmation of the given category -- of roots identified as verbal was undertaken to determine their original category accurately. Building on this, the present paper offers an account of the valency classification of those Mapudungun roots confirmed to be verbal, using the language's own morphotactics; specifically, by examining the permissible and restricted combinations of various suffixes with roots or verbal stems in the Mapuche verb form. As with all work conducted thus far, the results presented here aim to improve the morphological analyser (Dungupeyum) with all verified findings incorporated into the system. From a theoretical perspective, we also hope to contribute to the recognition and understanding of issues related to the valency of Mapuche verb forms.

2603.26668 2026-05-29 cs.IR cs.AI cs.CL 版本更新

Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm

Bridge-RAG:一种基于抽象桥树的检索增强生成算法

Zihang Li, Wenjun Liu, Yikun Zong, Jiawen Tao, Siying Dai, Songcheng Ren, Zirui Liu, Yuhang Wang, Yanbing Jiang, Tong Yang

发表机构 * Peking University(北京大学)

AI总结 针对检索增强生成中准确性和效率的挑战,提出Bridge-RAG框架,通过抽象桥树结构实现多级检索,并集成布谷鸟过滤器实现O(1)实体查找,在保持高准确率的同时将检索速度提升至1.9倍。

详情
AI中文摘要

作为增强大型语言模型(LLMs)生成质量的重要范式,检索增强生成(RAG)面临着检索准确性和计算效率两方面的挑战。本文提出了一种名为Bridge-RAG的新型RAG框架。为了克服准确性挑战,我们引入了抽象概念来桥接查询实体和文档块,提供了稳健的语义理解。我们将抽象组织成树结构,并设计了多级检索策略以确保包含足够的上下文信息。虽然这种层次化组织显著提高了答案质量,但遍历树以定位包含查询实体的抽象不可避免地引入了额外的检索开销。为了恢复检索效率,我们进一步在CFT-RAG中集成了布谷鸟过滤器,该过滤器提供O(1)实体查找,并且自然适配了我们框架中实体到抽象的路径。大量实验表明,与结构化RAG基线相比,Bridge-RAG在所有指标上均实现了持续的准确性提升,并且检索速度最高提升了1.9倍。

英文摘要

As an important paradigm for enhancing the generation quality of Large Language Models (LLMs), retrieval-augmented generation (RAG) faces the two challenges regarding retrieval accuracy and computational efficiency. This paper presents a novel RAG framework called Bridge-RAG. To overcome the accuracy challenge, we introduce the concept of abstract to bridge query entities and document chunks, providing robust semantic understanding. We organize the abstracts into a tree structure and design a multi-level retrieval strategy to ensure the inclusion of sufficient contextual information. While this hierarchical organization substantially improves answer quality, traversing the tree to locate the abstracts that contain a query entity inevitably introduces additional retrieval overhead. To restore retrieval efficiency, we further integrate the Cuckoo Filter in CFT-RAG, which provides O(1) entity lookup and naturally fits the entity-to-abstract pathway of our framework. Extensive experiments show that Bridge-RAG achieves consistent accuracy improvements across all metrics and up to $1.9\times$ faster retrieval compared to structured RAG baselines.

2603.23069 2026-05-29 cs.CL cs.AI 版本更新

AuthorMix: Modular Authorship Style Transfer via Layer-wise Adapter Mixing

AuthorMix: 通过逐层适配器混合实现模块化作者风格迁移

Sarubi Thillainathan, Ji-Ung Lee, Michael Sullivan, Alexander Koller

发表机构 * Saarland University(萨尔兰大学)

AI总结 提出AuthorMix框架,通过训练特定风格的LoRA适配器并利用逐层适配器混合,仅需少量目标风格样本即可实现轻量级、模块化的作者风格迁移,在低资源场景下优于现有方法并显著提升语义保留。

Comments Under review

详情
AI中文摘要

作者风格迁移任务涉及在保留原文含义的同时,将文本重写为目标作者的风格。现有的风格迁移方法在大型语料库上训练单一模型以同时建模所有目标风格:这种高成本方法为目标特定适应提供的灵活性有限,并且常常为了风格迁移而牺牲语义保留。在本文中,我们提出了AuthorMix:一个轻量级、模块化且可解释的风格迁移框架。我们在少量高资源作者上训练个体、风格特定的LoRA适配器,通过学习的逐层适配器混合,仅使用少量目标风格训练示例,即可快速训练每个新目标的专门适应模型。AuthorMix在低资源目标上优于现有的最先进风格迁移基线以及GPT-5.1,在自动和人工评估中均获得最高总分,并显著提高了语义保留。

英文摘要

The task of authorship style transfer involves rewriting text in the style of a target author while preserving the meaning of the original text. Existing style transfer methods train a single model on large corpora to model all target styles at once: this high-cost approach offers limited flexibility for target-specific adaptation, and often sacrifices meaning preservation for style transfer. In this paper, we propose AuthorMix: a lightweight, modular, and interpretable style transfer framework. We train individual, style-specific LoRA adapters on a small set of high-resource authors, allowing the rapid training of specialized adaptation models for each new target via learned, layer-wise adapter mixing, using only a handful of target-style training examples. AuthorMix outperforms existing, SoTA style-transfer baselines-as well as GPT-5.1-for low-resource targets, achieving the highest overall score and substantially improving meaning preservation in both automatic and human evaluations.

2603.17945 2026-05-29 cs.CL 版本更新

ShapleyLaw: A Game-Theoretic Approach to Multilingual Scaling Laws

ShapleyLaw:多语言缩放定律的博弈论方法

Xuyang Cao, Qianying Liu, Chuan Xiao, Yusuke Oda, Jiayi Wang, Pontus Stenetorp, Daisuke Kawahara, Makoto Onizuka, Sadao Kurohashi, Shuyuan Zheng

发表机构 * NII LLMC Osaka University(大阪大学) Nara Institute of Science and Technology(奈良科学技術大學) University College London(伦敦大学学院) Kyoto University(京都大学) Waseda University(早稻田大学)

AI总结 将多语言预训练视为合作博弈,利用Shapley值量化跨语言迁移效应,提出ShapleyLaw缩放定律以优化语言混合比例。

Comments 18 pages

详情
AI中文摘要

在多语言预训练中,预训练模型的测试损失受预训练数据中每种语言的比例(即语言混合比例)的显著影响。多语言缩放定律可以预测不同语言混合比例下的测试损失,因此可用于估计最优比例。然而,当前的多语言缩放定律方法未衡量跨语言迁移效应,导致混合比例次优。本文将多语言预训练视为一个合作博弈,其中每种语言作为一个参与者共同贡献于预训练,并将由此带来的测试损失降低作为收益。因此,从合作博弈论的角度,我们通过每种语言在博弈中的贡献来量化其跨语言迁移,并提出一种基于博弈论的多语言缩放定律,称为ShapleyLaw。我们的实验表明,ShapleyLaw在模型性能预测和语言混合优化方面优于基线方法。

英文摘要

In multilingual pretraining, the test loss of a pretrained model is heavily influenced by the proportion of each language in the pretraining data, namely the \textit{language mixture ratios}. Multilingual scaling laws can predict the test loss under different language mixture ratios and can therefore be used to estimate the optimal ratios. However, the current approaches to multilingual scaling laws do not measure the \textit{cross-lingual transfer} effect, resulting in suboptimal mixture ratios. In this paper, we consider multilingual pretraining as a cooperative game in which each language acts as a player that jointly contributes to pretraining, gaining the resulting reduction in test loss as the payoff. Consequently, from the perspective of cooperative game theory, we quantify the cross-lingual transfer from each language by its contribution in the game, and propose a game-theoretic multilingual scaling law called \textit{ShapleyLaw}. Our experiments show that ShapleyLaw outperforms baseline methods in model performance prediction and language mixture optimization.

2603.13249 2026-05-29 cs.CL cs.AI cs.CY 版本更新

Steering at the Source: Style Modulation Heads for Robust Persona Control

源头操控:用于稳健角色控制的风格调制头

Yoshihiro Izawa, Gouki Minegishi, Koshi Eguchi, Sosuke Hosokawa, Kenjiro Taura

发表机构 * The University of Tokyo, Tokyo, Japan(东京大学) National Institute of Informatics Research(信息处理研究所) Development Center for Large Language Models, Japan(大型语言模型发展中心)

AI总结 本文通过识别并仅干预少量注意力头(风格调制头),在无需微调的情况下实现对大型语言模型角色和风格的稳健控制,同时显著缓解了残差流干预导致的连贯性下降问题。

Comments 8 main pages with appendix

详情
AI中文摘要

激活操控提供了一种计算高效的机制,无需微调即可控制大型语言模型(LLM)。虽然能有效控制目标特征(如角色),但连贯性下降仍然是安全和实际部署的主要障碍。我们假设这种下降源于对残差流的干预,该干预无差别地影响聚合特征,并无意中放大了非目标噪声。在这项工作中,我们识别出一组稀疏的注意力头(仅三个头),它们独立控制角色和风格形成,我们将其称为风格调制头。具体来说,这些头可以通过内部表示的几何分析进行定位,结合层间余弦相似度和头部贡献分数。我们证明,仅针对这些特定头的干预能够实现稳健的行为控制,同时显著减轻残差流操控中观察到的连贯性下降。更广泛地说,我们的发现表明,精确的组件级定位能够实现更安全、更精确的模型控制。

英文摘要

Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment. We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise. In this work, we identify a sparse subset of attention heads (only three heads) that independently govern persona and style formation, which we term Style Modulation Heads. Specifically, these heads can be localized via geometric analysis of internal representations, combining layer-wise cosine similarity and head-wise contribution scores. We demonstrate that intervention targeting only these specific heads achieves robust behavioral control while significantly mitigating the coherency degradation observed in residual stream steering. More broadly, our findings show that precise, component-level localization enables safer and more precise model control.

2603.05488 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

推理剧场:从思维链中分离模型信念

Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo

发表机构 * Harvard University, Cambridge, MA(哈佛大学,马萨诸塞州剑桥)

AI总结 通过激活探针、早期强制回答和思维链监控器分析,发现推理模型存在表演性思维链现象,并利用探针引导的早期退出实现高效计算。

详情
AI中文摘要

我们提供了推理模型中表演性思维链(CoT)的证据,即模型对其最终答案变得非常自信,但继续生成令牌而不揭示其内部信念。我们的分析比较了两个大型模型(DeepSeek-R1 671B 和 GPT-OSS 120B)中的激活探针、早期强制回答和思维链监控器,并发现了任务难度特定的差异:模型的最终答案可以从思维链中远早于监控器能够判断的激活中解码,特别是对于基于回忆的简单MMLU问题。我们将此与困难的多跳GPQA-Diamond问题中的真正推理进行对比。尽管如此,转折点(例如回溯、“啊哈”时刻)几乎只出现在探针显示大信念转变的响应中,表明这些行为追踪的是真正的不确定性,而不是学到的“推理剧场”。最后,探针引导的早期退出在MMLU上减少了高达80%的令牌,在GPQA-Diamond上减少了30%,且准确率相似,将注意力探针定位为检测表演性推理和实现自适应计算的高效工具。

英文摘要

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

2603.04678 2026-05-29 cs.CL cs.AI 版本更新

Post-Training Language Models for Crosslingual Consistency

后训练语言模型以实现跨语言一致性

Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández, Arianna Bisazza

发表机构 * ETH Zürich(苏黎世联邦理工学院) CLCG, University of Groningen(格罗宁根大学CLCG中心) University of Amsterdam(阿姆斯特丹大学)

AI总结 针对多语言模型对翻译等价提示响应不一致的问题,提出基于信息论的跨语言一致性定义,并开发后训练方法直接一致性优化(DCO)以提升一致性。

Comments ICML 2026. The first two authors contributed equally. Codes available at: https://github.com/Betswish/ConsistencyRL

详情
AI中文摘要

语言模型通常对跨语言的翻译等价提示响应不一致,这损害了多语言系统的可靠性。为了量化这一点,我们从信息论角度将跨语言一致性定义为模型响应分布与其跨语言往返推前分布之间的散度界。然后,我们引入惩罚一致性优化(PCO),这是一种后训练程序,将该散度与固定参考语言模型的Kullback-Leibler惩罚相结合。由于直接优化PCO需要昂贵的策略内展开,我们提出了一个易于处理的替代方案——直接一致性优化(DCO),它可以在策略外进行优化。在多种语言模型和26种语言中,DCO显著提高了跨语言一致性,优于现有方法,并实现了对低资源语言的有针对性的对齐。

英文摘要

Language models often respond inconsistently to translation-equivalent prompts across languages, undermining the reliability of multilingual systems. To quantify this, we give an information-theoretic definition of crosslingual consistency as a divergence bound between a model's response distribution and its round-trip pushforward across languages. We then introduce penalized consistency optimization (PCO), a post-training procedure that couples this divergence with a Kullback-Leibler penalty to a fixed reference language model. Because direct optimization of PCO requires expensive on-policy roll-outs, we propose a tractable surrogate, direct consistency optimization (DCO), which can be optimized off-policy. Across diverse language models and 26 languages, DCO significantly improves crosslingual consistency, outperforms existing methods, and enables targeted alignment of low-resource languages.

2603.02082 2026-05-29 cs.CL 版本更新

What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies

儿童在语言习得中究竟获得了什么?基于CHILDES的填充词-空位依赖自动检测案例研究

Zhenghao Herbert Zhou, William Dai, Maya Viswanathan, Simon Charlow, R. Thomas McCoy, Robert Frank

发表机构 * Department of Linguistics, Yale University(耶鲁大学语言学系) Department of Computer Science, Yale University(耶鲁大学计算机科学系) Wu Tsai Institute, Yale University(耶鲁大学吴氏研究所)

AI总结 通过自动检测英语口语语料中的三种核心填充词-空位结构,量化儿童语言输入中的分布证据,并分析儿童产出轨迹,为先天语法知识与统计学习之争提供数据支持。

Comments Camera-ready version accepted to CoNLL 2026

详情
AI中文摘要

儿童对填充词-空位依赖的习得,一些研究者认为依赖于先天语法知识,而另一些则认为儿童导向言语中可用的分布证据足以解释。不幸的是,相关输入难以大规模细粒度量化,使得这一问题难以解决。我们提出一个系统,能够识别英语口语语料中的三种核心填充词-空位结构——主句wh-疑问句、嵌入式wh-疑问句和关系从句——并进一步识别提取位置(即主语、宾语或附加语)。我们的方法结合了成分分析和依存分析,利用它们在结构分类和提取位置识别上的互补优势。我们在人工标注数据上验证了该系统,发现其在大多数类别上表现良好。将该系统应用于57个英语CHILDES语料库,我们能够描述儿童在发育过程中接收的填充词-空位输入及其产出轨迹,包括特定结构的频率和提取位置不对称性。由此产生的细粒度标签为未来的习得研究和计算研究提供了基础,我们通过一个使用语言模型进行过滤语料训练的案例研究进行了演示。

英文摘要

Children's acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora -- matrix wh-questions, embedded wh-questions, and relative clauses -- and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children's filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.

2603.01311 2026-05-29 cs.CL 版本更新

Catalyst-Agent: Autonomous heterogeneous catalyst screening with an LLM Agent

Catalyst-Agent:基于LLM Agent的自主异质催化剂筛选

Achuth Chandrasekhar, Janghoon Ock, Amir Barati Farimani

发表机构 * Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA(卡内基梅隆大学机械工程系,匹兹堡,PA 15213,USA) Department of Chemical and Biomolecular Engineering, University of Nebraska--Lincoln, Lincoln, NE 68588, USA(内布拉斯加大学林肯分校化学与生物分子工程系,林肯,NE 68588,USA)

AI总结 提出Catalyst-Agent,一种基于MCP服务器和LLM的AI代理,通过OPTIMADE API探索材料数据库、利用UMA模型计算吸附能,实现闭环自主催化剂筛选,在ORR、NRR和CO2RR反应中成功率达33-41%。

详情
AI中文摘要

发现针对特定应用的新型催化剂是21世纪的一项重大挑战。传统方法包括基于化学理论的耗时且昂贵的实验试错法,或基于密度泛函理论的计算密集型第一性原理方法。近期研究表明,图神经网络(GNN)等深度学习模型可以将催化剂材料的筛选速度提高多个数量级,且具有很高的准确性和保真度。在这项工作中,我们引入了Catalyst-Agent,一个基于模型上下文协议(MCP)服务器、由LLM驱动的AI代理。它可以使用OPTIMADE API探索庞大的材料数据库,进行结构修改,通过FAIRchem的AdsorbML工作流程和板坯构建使用Meta FAIRchem的UMA(GNN)模型计算吸附能,并以闭环方式向研究人员提供有用的材料建议,包括改进接近命中候选者的结构修改。我们在三个关键反应上进行了测试:氧还原反应(ORR)、氮还原反应(NRR)和CO2还原反应(CO2RR)。Catalyst-Agent在其选择和评估的所有材料中实现了33-41%的成功率,并且平均每个成功材料在1-4次试验内收敛。这项工作展示了AI代理利用其规划能力和工具使用实现自主催化剂筛选工作流程的潜力。

英文摘要

The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty-first century. Traditional methods for this include time-consuming and expensive experimental trial-and-error approaches in labs based on chemical theory or heavily computational first-principles approaches based on density functional theory. Recent studies show that deep learning models like graph neural networks (GNNs) can significantly speed up the screening of catalyst materials by many orders of magnitude, with very high accuracy and fidelity. In this work, we introduce Catalyst-Agent, a Model Context Protocol (MCP) server-based, LLM-powered AI agent. It can explore vast material databases using the OPTIMADE API, make structural modifications, calculate adsorption energies using Meta FAIRchem's UMA (GNN) model via FAIRchem's AdsorbML workflow and slab construction, and make useful material suggestions to the researcher in a closed-loop manner, including structural modifications to refine near-miss candidates. It is tested on three pivotal reactions: the oxygen reduction reaction (ORR), the nitrogen reduction reaction (NRR), and the CO2 reduction reaction (CO2RR). Catalyst-Agent achieves a success rate of 33-41% among all the materials it chooses and evaluates, and manages to converge in 1-4 trials per successful material on average. This work demonstrates the potential of AI agents to exercise their planning capabilities and tool use for autonomous catalyst screening workflows.

2602.23258 2026-05-29 cs.AI cs.CL 版本更新

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

AgentDropoutV2: 通过测试时修正或拒绝剪枝优化多智能体系统中的信息流

Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Alibaba Group(阿里巴巴集团)

AI总结 提出AgentDropoutV2框架,在测试时通过检索增强修正器纠正错误并剪枝不可修复输出,动态优化多智能体系统信息流,显著提升数学和代码基准性能。

详情
AI中文摘要

虽然多智能体系统(MAS)在复杂推理中表现出色,但它们受到来自单个智能体的错误信息的级联影响。当前的解决方案通常依赖于刚性的结构工程或昂贵的微调,限制了它们的适应性。我们提出了AgentDropoutV2(ADv2),一种测试时修正或拒绝剪枝框架,动态优化MAS信息流。作为主动防火墙,ADv2拦截智能体输出,并采用检索增强修正器迭代纠正错误。这种修正由一个指示池引导,该池通过从历史MAS失败轨迹中提炼错误模式离线构建。随后,不可修复的输出被剪枝以防止错误传播。实验结果表明,ADv2在固定和动态MAS框架上均显著提升了性能,在广泛的数学和代码基准测试中分别实现了平均6.39和2.28个百分点的准确率提升。此外,ADv2表现出卓越的适应性,根据任务难度动态调整修正力度,以解决广泛的错误模式。我们的代码已发布在https://github.com/TonySY2/AgentDropoutV2。

英文摘要

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information from individual agents. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their adaptability. We propose AgentDropoutV2 (ADv2), a test-time rectify-or-reject pruning framework that dynamically optimizes MAS information flow. Acting as an active firewall, ADv2 intercepts agent outputs and employs a retrieval-augmented rectifier to iteratively correct errors. This rectification is guided by an indicator pool, which is constructed offline by distilling error patterns from historical MAS failure trajectories. Irreparable outputs are subsequently pruned to prevent error propagation. Empirical results demonstrate that ADv2 significantly boosts performance on both fixed and dynamic MAS frameworks, achieving average accuracy gains of 6.39 and 2.28 percentage points on extensive math and code benchmarks, respectively. Furthermore, ADv2 exhibits remarkable adaptivity, dynamically modulating rectification efforts based on task difficulty to resolve a wide spectrum of error patterns. Our code is released at https://github.com/TonySY2/AgentDropoutV2.

2602.16610 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Who can we trust? LLM-as-a-jury for Comparative Assessment

我们该信任谁?LLM作为陪审团进行比较评估

Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill

发表机构 * Department of Engineering, University of Cambridge, UK(剑桥大学工程系)

AI总结 针对LLM作为评估者时判断不一致和可靠性差异的问题,提出BT-sigma模型,通过引入判别参数联合推断项目排名和法官可靠性,优于平均聚合方法。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作自动评估器,用于自然语言生成评估,通常采用成对比较判断。现有方法通常依赖单一法官或聚合多个法官并假设其可靠性相同。在实践中,LLM法官在不同任务和评估方面的表现差异很大,其判断概率可能存在偏差和不一致。此外,用于法官校准的人工标注监督可能不可用。我们首先通过实验证明LLM比较概率的不一致性存在,并表明这限制了直接基于概率排名的有效性。为解决此问题,我们研究了LLM作为陪审团的设置,并提出了BT-sigma,这是Bradley-Terry模型的一种法官感知扩展,为每个法官引入一个判别参数,仅从成对比较中联合推断项目排名和法官可靠性。在基准NLG评估数据集上的实验表明,BT-sigma始终优于基于平均的聚合方法,并且学习到的判别参数与LLM判断的循环一致性的独立度量高度相关。进一步分析揭示,BT-sigma可以解释为一种无监督校准机制,通过建模法官可靠性来改进聚合。

英文摘要

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and evaluation aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-asa-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminators strongly correlate with independent measures of the cycle consistency of LLM judgments. Further analysis reveals that BT-sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.

2602.15382 2026-05-29 cs.CL cs.CV cs.LG 版本更新

The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

视觉虫洞:异构多智能体系统中的潜在空间通信

Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu, Hoin Jung, Matt Fredrikson, Xiaoqian Wang, Jing Gao

发表机构 * Purdue University(普渡大学) Contextual AI(情境人工智能) Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出Vision Wormhole框架,通过通用视觉编解码器将推理轨迹映射到共享连续空间,实现异构VLM间的潜在状态传输,无需配对翻译器,降低对齐复杂度并提升效率。

Comments Preprint. Work in progress

详情
AI中文摘要

由大型语言模型驱动的多智能体系统(MAS)实现了先进的协作推理,但仍受限于离散文本通信,这带来了运行时开销和信息量化损失。虽然潜在状态传输提供了一种替代方案,但现有方法要么假设同构的发送器-接收器架构,要么依赖于特定配对的学得翻译器,限制了跨具有不连续流形的不同模型族的可扩展性。我们将为自然图像训练的视觉-语言模型(VLM)的视觉界面重新概念化为异构智能体之间的连续通信通道,并将这一思想实例化为 extbf{视觉虫洞}:一种通用视觉编解码器,将推理轨迹映射到共享的连续参考空间,并将其注入接收器的视觉通路,实现无需配对翻译器的跨架构潜在状态传输。该框架采用中心辐射拓扑,将对齐复杂度从$O(N^2)$降低到$O(N)$,并通过无标签的教师-学生蒸馏针对文本通道进行训练,无需并行隐藏状态监督。在异构VLM族(Qwen-VL、Gemma、SmolVLM2、LFM2.5-VL)和九个推理基准上的大量实验表明,视觉虫洞在大多数评估设置中减少了端到端挂钟时间,并产生了正的平均宏$Δ$-准确率。

英文摘要

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain bottlenecked by discrete text communication, which imposes runtime overhead and information quantization loss. While latent state transfer offers an alternative, existing approaches either assume homogeneous sender--receiver architectures or rely on pair-specific learned translators, limiting scalability across diverse model families with disjoint manifolds. We reconceptualize the visual interface of Vision-Language Models (VLMs), trained for natural images, as a continuous communication channel between heterogeneous agents, and instantiate this idea as the \textbf{Vision Wormhole}: a Universal Visual Codec maps reasoning traces into a shared continuous reference space and injects them into the receiver's visual pathway, yielding cross-architecture latent state transfer without per-pair translators. The framework adopts a hub-and-spoke topology that reduces alignment complexity from $O(N^2)$ to $O(N)$, and is trained by label-free teacher--student distillation against the text channel, requiring no parallel hidden-state supervision. Extensive experiments across heterogeneous VLM families (Qwen-VL, Gemma, SmolVLM2, LFM2.5-VL) and nine reasoning benchmarks show that the Vision Wormhole reduces end-to-end wall-clock time across most evaluated settings and yields positive macro-average $Δ$-accuracy.

2602.11171 2026-05-29 cs.CL cs.AI 版本更新

A Language-Guided Bayesian Optimization for Efficient LoRA Hyperparameter Search

语言引导的贝叶斯优化用于高效LoRA超参数搜索

Baek Seong-Eun, Lee Jung-Mok, Kim Sung-Bin, Tae-Hyun Oh

发表机构 * Grad. School of AI, POSTECH, Pohang, Korea(POSTECH人工智能研究生院,韩国坡安) School of EE, KAIST, Daejeon, Korea(韩国科学技术院电子工程学院,韩国大田) School of Computing, KAIST, Daejeon, Korea(韩国科学技术院计算学院,韩国大田)

AI总结 提出一种利用预训练LLM领域知识的贝叶斯优化框架,通过语言提示将超参数映射到连续空间,结合子集训练代理评估,仅需约30次迭代即可发现比标准超参数提升20%以上性能的LoRA超参数。

Comments Accepted at ICML 2026

详情
AI中文摘要

使用低秩适配(LoRA)微调大型语言模型(LLM)提供了一种资源高效的方式来实现个性化或专业化。然而,LoRA对超参数选择高度敏感,且穷举超参数搜索计算成本高昂。为此,我们提出一个贝叶斯优化(BO)框架,利用预训练LLM的领域知识来高效搜索LoRA超参数。我们的方法将预训练LLM重新用作离散到连续映射模块,将超参数及其领域知识链接到连续向量空间,在其中进行BO。我们通过语言提示设计和控制映射,提供描述超参数间关系及其各自角色的领域感知文本提示。这使我们能够以自然语言将关于LoRA的领域知识显式注入LLM。我们还引入一个额外的可学习标记,以捕获提示中难以用语言描述的残差信息。这有助于BO采样更多高性能超参数。此外,通过利用LoRA训练机制中从完整数据集和子集训练数据集获得的性能之间观察到的强相关性,我们引入使用数据子集的代理训练和评估。这显著提高了我们方法的效率。我们证明,仅需约30次迭代发现的超参数,相比从约45,000种组合中找到的标准超参数,实现了超过20%的性能提升。项目页面:https://baekseongeun.github.io/lora-bo/

英文摘要

Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) offers a resource-efficient way to personalize or specialize. However, LoRA is highly sensitive to hyperparameter choices, and exhaustive hyperparameter search is computationally expensive. To address this, we propose a Bayesian Optimization (BO) framework that leverages the domain knowledge of pre-trained LLMs to efficiently search for LoRA hyperparameters. Our approach repurposes a pre-trained LLM as a discrete-to-continuous mapping module to link hyperparameters and their domain knowledge to a continuous vector space, where BO is conducted. We design and control the mapping via language prompting, providing a domain-aware textual prompt that describes the relationships among hyperparameters and their respective roles. This allows us to explicitly inject domain knowledge about LoRA into the LLM in natural language. We also introduce an additional learnable token to capture residual information that is difficult to describe linguistically in the prompt. This aids BO to sample more high-performing hyperparameters. In addition, by leveraging the strong correlation observed between the performance obtained from full and subset training datasets in LoRA training regimes, we introduce proxy training and evaluation using a data subset. This significantly improves the efficiency of our method. We demonstrate that our hyperparameter, discovered with only about 30 iterations, achieves more than 20% performance improvement over standard hyperparameters found from about 45,000 combinations. Project page: https://baekseongeun.github.io/lora-bo/

2602.11065 2026-05-29 cs.CL cs.AI 版本更新

S-MARC: Causal Streaming Reasoning for Full-Duplex Conversational Behavior Modeling

S-MARC:全双工对话行为建模的因果流式推理

Dingkun Zhou, Shuchang Pan, Jiachen Lian, Siddharth Banerjee, Sarika Pasumarthy, Dhruv Hebbar, Siddhant Patel, Zeyi Austin Li, Kan Jen Cheng, Sanay Bordia, Krish Patel, Akshaj Gupta, Tingle Li, Gopala Anumanchipalli

发表机构 * University of California, Berkeley(加州大学伯克利分校) Zhejiang University(浙江大学) South China University of Technology(华南理工大学)

AI总结 提出S-MARC框架,通过流式因果层次建模意图到动作路径,预测高层交际功能和低层交互行为,并构建高质量语料库,实现全双工对话中的鲁棒行为检测与可解释推理。

详情
AI中文摘要

人类对话由隐式的思维链组织,并表现为时间结构化的对话行为。捕捉这一感知路径对于构建自然的全双工交互系统至关重要。我们提出了S-MARC(对话的流式因果建模与推理),一个用于对话行为建模与推理的流式、因果、层次化框架。通过形式化意图到动作的路径,S-MARC预测高层交际功能和低层交互行为,同时建模它们的因果和时间依赖关系。为支持这一设置,我们构建了一个高质量语料库,将可控、事件丰富的双工对话数据与行为标签配对。S-MARC将流式预测组织成持续演化的图结构,为其决策生成简洁的推理依据,并动态优化其推理过程。在合成和真实双工对话上的实验表明,S-MARC实现了鲁棒的行为检测,产生了可解释的推理链,并为全双工口语对话系统中的对话推理建立了基准基础。

英文摘要

Human conversation is organized by an implicit chain of thought and manifests as temporally structured conversational behaviors. Capturing this perceptual pathway is critical for building natural full-duplex interactive systems. We propose S-MARC (Streaming Causal Modeling and Reasoning for Conversation), a streaming, causal, and hierarchical framework for conversational behavior modeling and reasoning. By formalizing the intent-to-action pathway, S-MARC predicts high-level communicative functions and low-level interaction behaviors while modeling their causal and temporal dependencies. To support this setting, we construct a high-quality corpus that pairs controllable, event-rich duplex dialogue data with behavior labels. S-MARC organizes streaming predictions into a continuously evolving graph structure, generates concise justifications for its decisions, and dynamically optimizes its reasoning process. Experiments on synthetic and real duplex dialogues show that S-MARC achieves robust behavior detection, produces interpretable reasoning chains, and establishes a benchmark foundation for conversational reasoning in full-duplex spoken dialogue systems.

2602.08979 2026-05-29 cs.SD cs.CL 版本更新

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

超越文本:音频章节划分的新视角

Fabian Retkowski, Maike Züfle, Thai Binh Nguyen, Jan Niehues, Alexander Waibel

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文通过提出音频专用架构AudioSeg、分析影响性能的因素以及形式化评估协议,系统研究了音频章节划分任务,发现AudioSeg显著优于基于文本的方法,停顿是最有效的声学特征,而多模态大模型在短音频上表现有潜力。

Comments Accepted at ACL 2026 (Main Conference)

详情
AI中文摘要

音频章节划分是将长音频分割成连贯部分的任务,对于导航播客、讲座和视频越来越重要。尽管其相关性,研究仍然有限且基于文本,留下了关于利用音频信息、处理ASR错误以及无转录评估的关键问题未解决。我们通过三个贡献来解决这些空白:(1)基于文本模型与声学特征、一种新颖的仅音频架构(AudioSeg,操作于学习到的音频表示)以及多模态大模型的系统比较;(2)影响性能因素的经验分析,包括转录质量、声学特征、持续时间和说话人组成;(3)形式化的评估协议,对比依赖转录的文本空间协议与转录不变的时间空间协议。我们在YTSeg上的实验表明,AudioSeg显著优于基于文本的方法,停顿提供了最大的声学增益,而MLLMs受限于上下文长度和指令遵循能力较弱,但MLLMs在较短的音频上显示出潜力。

英文摘要

Audio chaptering, the task of segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following, yet MLLMs are promising on shorter audio.

2602.08783 2026-05-29 cs.AI cs.CL 版本更新

Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure

潜在思维链中的因果结构:一项实证研究

Zirui Li, Xuefeng Bai, Kehai Chen, Yizhi Li, Jian Yang, Chenghua Lin, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) University of Manchester, United Kingdom(曼彻斯特大学) Beihang University, China(北京航空航天大学)

AI总结 通过结构因果模型对潜在思维链进行干预分析,揭示其因果结构、步骤间影响传播及与显式思维链的差异。

Comments Accepted to ICML 2026; 25 pages, 23 figures

详情
AI中文摘要

潜在或连续思维链方法用若干内部潜在步骤替代显式文本推理,但这些中间计算难以通过基于相关性的探针进行评估。本文将潜在思维链视为表示空间中的可操控因果过程,将潜在步骤建模为结构因果模型(SCM)中的变量,并通过逐步do-干预分析其效应。我们研究了两种代表性范式(即Coconut和CODI)在数学和通用推理任务上的表现,以探讨三个关键问题:(1)哪些步骤对正确性具有因果必要性,以及答案何时可早期解码;(2)影响如何在步骤间传播,以及这种结构与显式CoT相比如何;(3)中间轨迹是否保留竞争性答案模式,以及输出级承诺与步骤间表示级承诺的差异。我们发现潜在步骤预算更像分阶段功能而非同质化额外深度,并具有非局部路由特性,同时识别出早期输出偏差与后期表示承诺之间的持续差距。这些结果促使我们采用模式条件化和稳定性感知分析,以及相应的训练/解码目标,作为解释和改进潜在推理系统的更可靠工具。代码见https://github.com/J1mL1/causal-latent-cot。

英文摘要

Latent or continuous chain-of-thought methods replace explicit textual rationales with a number of internal latent steps, but these intermediate computations are difficult to evaluate beyond correlation-based probes. In this paper, we view latent chain-of-thought as a manipulable causal process in representation space by modeling latent steps as variables in a structural causal model (SCM) and analyzing their effects through step-wise do-interventions. We study two representative paradigms (i.e., Coconut and CODI) on both mathematical and general reasoning tasks to investigate three key questions: (1) which steps are causally necessary for correctness and when answers become decodable early; (2) how influence propagates across steps and how this structure compares to explicit CoT; and (3) whether intermediate trajectories retain competing answer modes and how output-level commitment differs from representational commitment across steps. We find that latent-step budgets behave less like homogeneous extra depth and more like staged functionality with non-local routing, and we identify a persistent gap between early output bias and late representational commitment. These results motivate mode-conditional and stability-aware analyses, together with corresponding training/decoding objectives, as more reliable tools for interpreting and improving latent reasoning systems. Code is available at https://github.com/J1mL1/causal-latent-cot.

2602.06036 2026-05-29 cs.CL 版本更新

DFlash: Block Diffusion for Flash Speculative Decoding

DFlash:用于快速推测解码的块扩散模型

Jian Chen, Yesheng Liang, Zhijian Liu

发表机构 * Z-Lab

AI总结 提出DFlash框架,利用轻量级块扩散模型并行生成草稿,通过目标模型上下文特征条件化,实现高质量草稿和高接受率,在多种模型和任务上实现超过6倍无损加速,比最先进的推测解码方法EAGLE-3快2.5倍。

Comments Accepted at ICML 2026. Camera-ready version. Code: https://github.com/z-lab/dflash

详情
AI中文摘要

自回归大型语言模型(LLMs)性能强大,但需要固有的顺序解码,导致高推理延迟和低GPU利用率。推测解码通过使用快速草稿模型来缓解这一瓶颈,其输出由目标LLM并行验证;然而,现有方法仍然依赖于自回归草稿生成,这仍然是顺序的,限制了实际加速。扩散LLMs通过实现并行生成提供了一种有希望的替代方案,但当前的扩散模型通常性能不如自回归模型。在本文中,我们介绍了DFlash,一种采用轻量级块扩散模型进行并行草稿生成的推测解码框架。通过在单次前向传播中生成草稿标记,并将草稿模型条件化于从目标模型提取的上下文特征,DFlash实现了高效草稿生成,具有高质量输出和更高的接受率。实验表明,DFlash在多种模型和任务上实现了超过6倍的无损加速,比最先进的推测解码方法EAGLE-3提供高达2.5倍的加速提升。

英文摘要

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.

2602.02751 2026-05-29 cs.MA cs.AI cs.CL 版本更新

Scaling Small Agents Through Strategy Auctions

通过策略拍卖扩展小型智能体

Lisa Alazraki, William F. Shen, Yoram Bachrach, Akhil Mathur

发表机构 * Meta Superintelligence Labs(Meta超智能实验室) Imperial College London(帝国理工学院伦敦分校) University of Cambridge(剑桥大学)

AI总结 针对小型语言模型在复杂任务中性能不足的问题,提出受自由职业市场启发的SALE框架,通过策略拍卖实现任务分配与测试时自我改进,在降低对大型模型依赖和成本的同时提升性能。

Comments ICML 2026

详情
AI中文摘要

小型语言模型越来越被视为一种有前景、成本效益高的智能体AI方法,支持者声称它们对于智能体工作流已经足够有能力。然而,尽管较小的智能体在简单任务上能与较大的智能体紧密匹配,但它们的性能如何随任务复杂性扩展、何时需要大型模型以及如何更好地利用小型智能体处理长期工作负载仍不清楚。在这项工作中,我们通过实验表明,小型智能体的性能在深度搜索和编码任务上无法随任务复杂性扩展,并引入了受自由职业市场启发的SALE(Strategy Auctions for Workload Efficiency)智能体框架。在SALE中,智能体用简短的战略计划进行投标,这些计划通过系统性的成本-价值机制评分,并通过共享的拍卖记忆进行优化,从而无需训练单独的路由器或运行所有模型至完成即可实现每任务路由和持续自我改进。在复杂度不同的深度搜索和编码任务中,SALE将最大智能体的依赖度降低了52%,总成本降低了35%,并且始终优于最大智能体的pass@1,仅增加了可忽略的额外开销(超出执行最终轨迹的部分)。相比之下,依赖任务描述的现有路由器要么表现不如最大智能体,要么未能降低成本,通常两者兼有,凸显了它们对智能体工作流的不适用性。这些结果表明,尽管小型智能体可能不足以处理复杂工作负载,但通过协调的任务分配和测试时自我改进,它们可以有效地“扩展”。更广泛地说,它们激发了对智能体AI的系统级观点,即性能提升更多来自市场启发的协调机制(将异构智能体组织成高效、自适应的生态系统),而非日益庞大的单个模型。

英文摘要

Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are sufficiently capable for agentic workflows. However, while smaller agents can closely match larger ones on simple tasks, it remains unclear how their performance scales with task complexity, when large models become necessary, and how to better leverage small agents for long-horizon workloads. In this work, we empirically show that small agents' performance fails to scale with task complexity on deep search and coding tasks, and we introduce Strategy Auctions for Workload Efficiency (SALE), an agent framework inspired by freelancer marketplaces. In SALE, agents bid with short strategic plans, which are scored by a systematic cost-value mechanism and refined via a shared auction memory, enabling per-task routing and continual self-improvement without training a separate router or running all models to completion. Across deep search and coding tasks of varying complexity, SALE reduces reliance on the largest agent by 52%, lowers overall cost by 35%, and consistently improves upon the largest agent's pass@1 with only a negligible overhead beyond executing the final trace. In contrast, established routers that rely on task descriptions either underperform the largest agent or fail to reduce cost, often both, underscoring their poor fit for agentic workflows. These results suggest that while small agents may be insufficient for complex workloads, they can be effectively "scaled up" through coordinated task allocation and test-time self-improvement. More broadly, they motivate a systems-level view of agentic AI in which performance gains come less from ever-larger individual models and more from market-inspired coordination mechanisms that organize heterogeneous agents into efficient, adaptive ecosystems.

2602.02103 2026-05-29 cs.LG cs.CL 版本更新

How Far Ahead Do LLMs Plan? Uncovering the Latent Horizon in Chain-of-Thought Reasoning

LLMs 能提前多远规划?揭示思维链推理中的潜在视界

Liyan Xu, Mo Yu, Fandong Meng, Jie Zhou

发表机构 * WeChat AI, Tencent Inc(腾讯AI实验室)

AI总结 通过探测方法 Tele-Lens 研究 LLMs 在思维链推理中的潜在规划能力,发现其具有短视视界,并基于此提出利用稀疏枢轴位置增强不确定性估计及自动识别 CoT 绕过的假设。

Comments Accepted to ICML 2026

详情
AI中文摘要

思维链推理已成为激发大型语言模型多步推理的核心机制。然而,近期证据呈现一种矛盾:隐藏状态似乎在 CoT 完全展开之前就已经编码了未来的推理,而显式步骤对于需要组合计算的任务仍然至关重要。为了加深对 LLM 内部状态与其言语化推理轨迹之间关系的理解,我们通过探测方法 Tele-Lens 研究了 LLMs 的潜在规划强度,该方法应用于跨不同任务领域的隐藏状态。我们的实证结果表明,LLMs 表现出短视视界,主要进行增量转换,而没有精确的全局规划。利用这一特性,我们提出了一个增强 CoT 不确定性估计的假设,并通过实验验证了一组稀疏的枢轴位置可以有效地代表整个路径的不确定性。我们进一步强调了利用 CoT 动态的重要性,并证明了可以在不降低性能的情况下实现 CoT 绕过的自动识别。我们的代码、数据和模型发布于 https://github.com/lxucs/tele-lens。

英文摘要

Chain-of-thought (CoT) reasoning has become a central mechanism for eliciting multi-step reasoning in Large Language Models (LLMs). Yet recent evidence presents a tension: hidden states appear to already encode future reasoning before CoT fully unfolds, while explicit steps still remain crucial for tasks requiring compositional computation. To deepen the understanding between LLM's internal states and its verbalized reasoning trajectories, we investigate the latent planning strength of LLMs, through our probing method, Tele-Lens, applying to hidden states across diverse task domains. Our empirical results indicate that LLMs exhibit a myopic horizon, primarily conducting incremental transitions without precise global planning. Leveraging this characteristic, we propose a hypothesis on enhancing uncertainty estimation of CoT, which we validate that a sparse set of pivot positions can effectively represent the uncertainty of the entire path. We further underscore the significance of exploiting CoT dynamics, and demonstrate that automatic recognition of CoT bypass can be achieved without performance degradation. Our code, data and models are released at https://github.com/lxucs/tele-lens.

2601.21909 2026-05-29 cs.AI cs.CL 版本更新

From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

从元思维到执行:面向通用且可靠的大语言模型推理的认知对齐后训练

Shaojie Wang, Liang Zhang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出一种认知启发的两阶段后训练框架,通过元思维链监督学习通用策略和置信度校准强化学习优化执行可靠性,在分布内和分布外分别提升2.10%和3.86%。

详情
AI中文摘要

当前的大语言模型后训练方法通过监督微调(SFT)后接基于结果的强化学习(RL)来优化完整的推理轨迹。虽然有效,但仔细审视发现一个根本差距:这种方法与人类实际解决问题的方式不一致。人类认知自然地将问题解决分解为两个不同的阶段:首先获取跨问题泛化的抽象策略(即元知识),然后将其适应到具体实例。相比之下,通过将完整轨迹视为基本单元,当前方法本质上是问题中心的,将抽象策略与问题特定的执行纠缠在一起。为了解决这种错位,我们提出了一个认知启发的框架,明确地模仿人类认知的两阶段过程。具体而言,元思维链(CoMT)将监督学习聚焦于抽象推理模式而不涉及具体执行,从而能够获取可泛化的策略。然后,置信度校准强化学习(CCRL)通过中间步骤上的置信度感知奖励来优化任务适应,防止过度自信的错误级联并提高执行可靠性。在四个模型和十个基准上的实验表明,与标准方法相比,分布内和分布外分别提升了2.10%和3.86%,同时对教师模型选择、优化方法和符号扰动的变化保持高度鲁棒。

英文摘要

Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought CoMT focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and ten benchmarks show 2.10% and 3.86% improvements in-distribution and out-of-distribution respectively over standard methods, while remaining highly robust to variations in teacher model selection, optimization methods, and symbolic perturbations.

2601.18395 2026-05-29 cs.CL 版本更新

Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction

不要贪婪,三思而后行:文档级信息抽取的采样与选择

Mikel Zubillaga, Oscar Sainz, Oier Lopez de Lacalle, Eneko Agirre

发表机构 * HiTZ Center - Ixa, University of the Basque Country UPV/EHU(希茨中心 - Ixa,巴斯克国家大学UPV/EHU)

AI总结 提出ThinkTwice框架,通过采样生成多个候选模板并选择最优,利用无监督一致性和有监督奖励模型,在文档级信息抽取中超越贪婪解码方法。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

文档级信息抽取(DocIE)旨在生成包含给定文档中出现的实体、关系和事件的输出模板。标准做法包括使用贪婪解码提示仅解码器的大语言模型以避免输出变异性。我们没有将这种变异性视为限制,而是表明采样可以产生比贪婪解码更好的解决方案,尤其是在使用推理模型时。因此,我们提出了ThinkTwice,一个采样和选择框架,其中大语言模型为给定文档生成多个候选模板,然后一个选择模块选择最合适的模板。我们引入了一种利用生成输出之间一致性的无监督方法,以及一种使用在标记DocIE数据上训练的奖励模型的有监督选择方法。为了解决DocIE中黄金推理轨迹的稀缺性,我们提出了一种基于拒绝采样的方法来生成将输出模板与推理轨迹配对的银训练数据。我们的实验证明了无监督和有监督ThinkTwice的有效性,始终优于贪婪基线和有监督的最先进方法。

英文摘要

Document-level Information Extraction (DocIE) aims to produce an output template with the entities, relations, and events of interest occurring in the given document. Standard practices include prompting decoder-only LLMs using greedy decoding to avoid output variability. Rather than treating this variability as a limitation, we show that sampling can produce substantially better solutions than greedy decoding, especially when using reasoning models. We thus propose ThinkTwice, a sampling and selection framework in which the LLM generates multiple candidate templates for a given document, and a selection module chooses the most suitable one. We introduce both an unsupervised method that exploits agreement across generated outputs, and a supervised selection method using reward models trained on labeled DocIE data. To address the scarcity of golden reasoning trajectories for DocIE, we propose a rejection-sampling-based method to generate silver training data that pairs output templates with reasoning traces. Our experiments show the validity of unsupervised and supervised ThinkTwice, consistently outperforming greedy baselines and the supervised state-of-the-art.

2601.14758 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models

从自回归到掩码扩散语言模型的后训练中的机制转变

Injin Kong, Hyoungjoon Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院) Department of Biosystems & Biomaterials Science and Engineering, Seoul National University(首尔国立大学生物系统与生物材料科学与工程系)

AI总结 通过比较电路分析,发现后训练得到的掩码扩散模型在结构上根据任务保留或重组自回归电路,在语义上从局部专业化转向分布式整合,表明扩散后训练是内部计算的深度重组。

详情
AI中文摘要

将预训练的自回归模型(ARMs)后训练为掩码扩散模型(MDMs)已成为一种克服顺序生成局限性的经济有效方法。然而,后训练的MDMs是否获得了真正的新计算机制,还是仅仅以非自回归形式重新表达了自回归计算,仍不清楚。通过对ARMs及其从相同骨干网络后训练得到的MDM对应物进行电路比较分析,我们揭示了两个互补的重组轴。在结构上,转变是任务依赖的:MDMs在局部因果任务上保留自回归电路,但在全局任务上放弃继承的路径并将计算前置到早期层。在语义上,转变在不同机制间是一致的:ARMs中尖锐的局部专业化让位于MDMs中的分布式整合。这些发现共同表明,扩散后训练并非生成过程的表面变化,而是内部计算的重组,其深度取决于任务。

英文摘要

Post-training pretrained autoregressive models (ARMs) into masked diffusion models (MDMs) has emerged as a cost-effective way to overcome the limitations of sequential generation. Yet it remains unclear whether post-trained MDMs acquire genuinely new computational mechanisms or merely re-express autoregressive computation in a non-autoregressive form. Through a comparative circuit analysis of ARMs and their MDM counterparts post-trained from the same backbones, we uncover two complementary axes of reorganization. Structurally, the shift is task-dependent: MDMs preserve autoregressive circuitry on locally causal tasks but abandon inherited pathways and front-load computation into early layers on global tasks. Semantically, the shift is consistent across regimes: sharp, localized specialization in ARMs gives way to distributed integration in MDMs. Together, these findings show that diffusion post-training is not a surface-level change in the generation procedure but a reorganization of internal computation whose depth depends on the task.

2601.13111 2026-05-29 cs.CL cs.AI cs.IR 版本更新

CORE-T: COherent REtrieval of Tables for Text-to-SQL

CORE-T: 面向文本到SQL的表格连贯检索

Hassan Soliman, Vivek Gupta, Dan Roth, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science TU Darmstadt and National Research Center for Applied Cybersecurity ATHENE, Germany(普适知识处理实验室(UKP实验室),计算机科学系 TU Darmstadt 和应用网络安全国家研究中心 ATHENE,德国) Arizona State University(亚利桑那州立大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出CORE-T框架,通过LLM生成元数据和预计算兼容性缓存,在无需训练的情况下从异构表集合中高效检索连贯可连接的表集合,提升表选择F1最多22.7点并减少40%的表数量。

Comments Preprint is revised and under review. Code and data available at: https://github.com/UKPLab/arxiv2026-core-t

详情
AI中文摘要

现实中的文本到SQL工作流通常需要连接多个表格。因此,准确检索相关表集合成为端到端性能的关键瓶颈。我们研究一种开放书设置,其中查询必须从多个来源汇集的大规模异构表集合中回答,且没有数据库标识符等清晰的限定信号。在此设置下,密集检索(DR)实现了高召回率但返回大量干扰项,而考虑连接的方法通常依赖额外假设和/或产生高推理开销。我们提出CORE-T,一个可扩展、无需训练的框架,通过LLM生成的用途元数据丰富表格,并预计算轻量级表兼容性缓存。推理时,DR返回前K个候选;单次LLM调用选择一个连贯、可连接的子集,然后两步加法调整阶段恢复强兼容的表。在Bird、Spider、MMQA和Beaver上,CORE-T在表选择F1上比DR提升最多22.7点,同时返回的表减少最多40%,在多表执行准确率上提升最多24.4点,并且使用的总选择token比LLM密集型基线少1.64-4.20倍。

英文摘要

Realistic text-to-SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables becomes a key bottleneck for end-to-end performance. We study an open-book setting where queries must be answered over large, heterogeneous table collections pooled from many sources, without clean scoping signals such as database identifiers. Here, dense retrieval (DR) achieves high recall but returns many distractors, while join-aware alternatives often rely on extra assumptions and/or incur high inference overhead. We propose CORE-T, a scalable, training-free framework that enriches tables with LLM-generated purpose metadata and pre-computes a lightweight table-compatibility cache. At inference time, DR returns top-K candidates; a single LLM call selects a coherent, joinable subset, and a two-step additive adjustment stage restores strongly compatible tables. Across Bird, Spider, MMQA, and Beaver, CORE-T improves over DR by up to 22.7 points in table-selection F1 while returning up to 40% fewer tables, and by up to 24.4 points in multi-table execution accuracy, and uses 1.64-4.20x fewer total selection tokens than LLM-intensive baselines.

2601.11178 2026-05-29 cs.AI cs.CL cs.MM cs.SI 版本更新

TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

TANDEM: 面向多模态仇恨言论的时间感知神经检测

Girish A. Koushik, Helen Treharne, Diptesh Kanojia

发表机构 * Nature-Inspired Computing & Engineering, University of Surrey(Surrey大学自然启发计算与工程系) Surrey Centre for Cyber Security, University of Surrey(Surrey大学网络安全中心)

AI总结 提出TANDEM统一框架,通过串联强化学习策略联合优化视觉-语言和音频-语言模型,将音频-视觉仇恨检测转化为结构化推理问题,在HateMM上目标识别F1达0.73(提升30%),并保持精确时间定位。

Comments Under review at ICWSM 2027

详情
AI中文摘要

社交媒体平台日益被长篇多模态内容主导,其中有害叙事通过音频、视觉和文本线索的复杂交互构建。虽然自动化系统能以高准确率标记仇恨言论,但它们通常作为“黑箱”运作,无法提供细粒度、可解释的证据(如精确时间戳和目标身份),而这对于有效的人机协同审核是必需的。在这项工作中,我们提出了TANDEM,一个统一框架,将音频-视觉仇恨检测从二元分类任务转化为结构化推理问题。我们的方法采用一种新颖的串联强化学习策略,其中视觉-语言和音频-语言模型通过自约束跨模态上下文相互优化,在无需密集帧级监督的情况下,稳定地推理长时序列。在三个基准数据集上的实验表明,TANDEM显著优于零样本和上下文增强基线,在HateMM上目标识别F1达到0.73(比现有最佳方法提升30%),同时保持精确的时间定位。我们进一步观察到,虽然二元检测是鲁棒的,但由于固有的标签模糊性和数据集不平衡,在多类设置中区分攻击性和仇恨性内容仍然具有挑战性。更广泛地说,我们的发现表明,即使在复杂的多模态环境中,结构化、可解释的对齐也是可实现的,为下一代透明且可操作的在线安全审核工具提供了蓝图。

英文摘要

Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as "black boxes" that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision. Experiments across three benchmark datasets demonstrate that TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM (a 30% improvement over state-of-the-art) while maintaining precise temporal grounding. We further observe that while binary detection is robust, differentiating between offensive and hateful content remains challenging in multi-class settings due to inherent label ambiguity and dataset imbalance. More broadly, our findings suggest that structured, interpretable alignment is achievable even in complex multimodal settings, offering a blueprint for the next generation of transparent and actionable online safety moderation tools.

2601.04765 2026-05-29 cs.CL cs.AI cs.LG physics.comp-ph 版本更新

Differential syntactic and semantic encoding in LLMs

大型语言模型中句法与语义的差异编码

Santiago Acevedo, Alessandro Laio, Marco Baroni

发表机构 * Catalan Institute of Research and Advanced Studies (ICREA) and Universitat Pompeu Fabra (UPF)(加泰罗尼亚研究与高级科学研究所(ICREA)和庞培法华大学(UPF))

AI总结 本研究通过平均共享句法结构或语义的句子隐藏表示向量,发现大型语言模型(以DeepSeek-V3为例)的内部层表示中句法和语义信息至少部分线性编码,且两者编码轮廓不同,可一定程度解耦。

Comments Published as conference paper at ICML 2026

详情
AI中文摘要

我们研究了句法和语义信息如何在大型语言模型(LLMs)的内部层表示中编码,重点关注非常大的DeepSeek-V3。我们发现,通过平均共享句法结构或语义的句子的隐藏表示向量,我们得到了能够捕获表示中相当大比例的句法和语义信息的向量。特别是,从句子向量中减去这些句法和语义“质心”会强烈影响它们与句法和语义匹配句子的相似性,这表明句法和语义至少部分地线性编码。我们还发现句法和语义的跨层编码轮廓不同,并且这两种信号可以在一定程度上解耦,这表明LLM表示中这两种语言信息的差异编码。

英文摘要

We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids'' from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.

2601.00065 2026-05-29 cs.LG cs.CL cs.CR 版本更新

When the Same Coefficients Reach Different Places: Asymmetric Realizability in Transplanting Tokenizers across Large Language Models

当相同系数到达不同位置:跨大型语言模型移植分词器中的非对称可实现性

Xiaoze Liu, Weichen Yu, Matt Fredrikson, Xiaoqian Wang, Jing Gao

发表机构 * Purdue University(普渡大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文发现跨词汇模型组合中分词器移植的几何结构非对称性,并构造了“破坏令牌”以利用该漏洞,通过实验验证其在多个模型对中的存在性及对微调、谱滤波等防御措施的鲁棒性。

详情
AI中文摘要

跨词汇模型组合中的分词器移植将仅存在于捐赠者的嵌入行重构为基于共享词汇锚点的加权组合,并在基础模型上重用这些系数。我们识别出这种重构的一个结构几何特性:相同的系数向量在捐赠者和基础锚点跨度中到达不同的集合,即一个\emph{非对称可实现性}差距。在OMP下的65个捐赠者-基础对中,通过CLP、WECHSEL和FOCUS的跨算子验证,我们构造了\emph{破坏令牌}:在捐赠者锚点跨度中保持统计惰性,同时在基础中产生高显著性重构的单一系数向量。相同的Gemma-2-2B捐赠者检查点允许针对来自五个模型家族的13个不同下游基础进行此构造。植入的方向与未改变的干净参考权重合并。在部署者案例研究中,标准LoRA微调主要抑制了其提示分布与训练语料匹配的破坏者,并且在我们设置中不足以缓解此类攻击家族。测试的谱滤波器未能捕捉到非对称性。我们讨论了在开放权重组合供应链中的潜在滥用。

英文摘要

Tokenizer transplant in cross-vocabulary model composition reconstructs donor-only embedding rows as weighted combinations over shared lexical anchors and reuses those coefficients on the base. We identify a structural geometric property of this reconstruction: the same coefficient vector reaches different sets in the donor and base anchor spans, an \emph{asymmetric realizability} gap. Across 65 donor-base pairs under OMP, with cross-operator validation on CLP, WECHSEL, and FOCUS, we construct \textit{breaker tokens}: single coefficient vectors that remain statistically inert in the donor anchor span while producing a high-salience reconstruction in the base. The same Gemma-2-2B donor checkpoint admits this construction against 13 different downstream bases drawn from five model families. The planted direction passes weight-merging with a clean reference unchanged. In a deployer case study, standard LoRA fine-tuning suppresses the breaker primarily on prompts whose distribution matches the training corpus and is not a sufficient mitigation against this attack family in our setting. The tested spectral filters miss the asymmetry. We discuss potential misuse in the open-weight composition supply chain.

2512.14754 2026-05-29 cs.SE cs.AI cs.CL 版本更新

Revisiting the Reliability of Language Models in Instruction-Following

重新审视指令跟随中语言模型的可靠性

Jianshuo Dong, Yutong Zhang, Yan Liu, Zhenyu Zhong, Tao Wei, Chao Zhang, Han Qiu

发表机构 * Tsinghua University(清华大学) Ant Group(蚂蚁集团)

AI总结 本文提出可靠@k指标和自动生成相似提示的流水线,构建IFEval++基准,发现当前模型在细微差异提示下性能下降高达61.8%,并探索了三种改进方法。

Comments ACL 2026 main oral

详情
AI中文摘要

先进的LLM在IFEval等基准测试中已达到接近上限的指令跟随准确率。然而,这些令人印象深刻的分数并不一定能转化为实际使用中的可靠服务,因为用户经常改变他们的措辞、上下文框架和任务表述。在本文中,我们研究面向细微差异的可靠性:模型是否在传达类似用户意图但具有细微差异的相似提示中表现出一致的能力。为了量化这一点,我们引入了一个新的指标,可靠@k,并开发了一个自动化流水线,通过数据增强生成高质量的相似提示。在此基础上,我们构建了IFEval++用于系统评估。在20个专有和26个开源LLM中,我们发现当前模型在面向细微差异的可靠性方面存在显著不足——它们的性能在细微提示修改下可能下降高达61.8%。此外,我们对其进行了表征,并探索了三种潜在的改进方法。我们的发现强调了面向细微差异的可靠性是朝着更可靠和可信的LLM行为迈出的关键但尚未充分探索的下一步。我们的代码和基准可访问:https://github.com/jianshuod/IFEval-pp。

英文摘要

Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.

2511.08949 2026-05-29 cs.CL 版本更新

EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI

EVADE:基于LLM的解释生成与验证用于NLI错误检测

Longfei Zuo, Barbara Plank, Siyao Peng

发表机构 * Technical University of Munich(慕尼黑技术大学) MaiNLP, Center for Information and Language Processing, LMU Munich(MaiNLP,信息与语言处理中心,慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 提出EVADE框架,利用大语言模型生成和验证解释以检测NLI数据集中的标注错误,实验表明LLM验证能减少人力并提升微调性能。

详情
AI中文摘要

高质量数据集对于训练和评估可靠的NLP模型至关重要。在自然语言推理(NLI)等任务中,当同一实例有多个有效标签时,会出现人类标签变异(HLV),这使得难以区分标注错误和合理的变异。先前的框架VARIERR(Weber-Genzel等人,2024)在第一轮要求多位标注者解释其标签决策,并在第二轮通过有效性判断标记错误。然而,进行两轮人工标注成本高昂,且可能限制合理标签或解释的覆盖范围。我们的研究提出了一个新框架EVADE,用于使用大语言模型(LLM)生成和验证解释以检测错误。我们进行了全面分析,比较了人类和LLM检测的NLI错误,涉及分布比较、验证重叠以及对模型微调的影响。实验表明,LLM验证能优化生成的解释分布,使其更接近人类标注,并且从训练数据中移除LLM检测的错误比移除人类标注者识别的错误更能提升微调性能。这凸显了在标签变异下扩展错误检测、减少人工努力同时提高数据集质量的潜力。

英文摘要

High-quality datasets are critical for training and evaluating reliable NLP models. In tasks like natural language inference (NLI), human label variation (HLV) arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation. An earlier framework, VARIERR (Weber-Genzel et al., 2024), asks multiple annotators to explain their label decisions in the first round and flags errors through validity judgments in the second round. However, conducting two rounds of manual annotation is costly and may limit the coverage of plausible labels or explanations. Our study proposes a new framework, EVADE, for generating and validating explanations to detect errors using large language models (LLMs). We perform a comprehensive analysis comparing human- and LLM-detected errors for NLI across distribution comparison, validation overlap, and impact on model fine-tuning. Our experiments demonstrate that LLM validation refines generated explanation distributions to more closely align with human annotations, and that removing LLM-detected errors from training data yields improvements in fine-tuning performance than removing errors identified by human annotators. This highlights the potential to scale error detection, reducing human effort while improving dataset quality under label variation.

2510.22437 2026-05-29 cs.AI cs.CL 版本更新

Modeling Hierarchical Thinking in Large Reasoning Models

大型推理模型中的层次化思维建模

G M Shahariar, Erfan Shayegani, Ali Nazari, Nael Abu-Ghazaleh

发表机构 * University of California, Riverside(加州大学河滨分校) Independent Researcher(独立研究者)

AI总结 本文提出将大型推理模型(LRM)的层次化推理动态近似为有限状态机(FSM)中的轨迹,并通过Q值引导的推理时控制方法实现高效推理优化。

Comments Accepted in ICML 2026 as Oral

详情
AI中文摘要

大型推理模型(LRM)通过生成长链思维(CoT)序列来解决复杂任务;然而,控制推理轨迹的涌现动态尚未被充分理解,可能导致不一致性和推理病态。在这项工作中,我们提出将LRM的涌现层次化推理动态近似为有限状态机(FSM)中的轨迹,该状态机在六个抽象认知状态之间转换。我们证明这些状态和转换可以在模型的潜在状态中捕获。我们相信这种表示在LRM模型的可解释性和优化中具有不同的应用。例如,通过分析这些转换的拓扑结构,我们识别出推理策略中的统计变化,有助于从失败的推理链中识别出有效的推理链。为了说明这些潜在优势,我们提出了Q值引导转向,一种无需训练的推理时控制方法,将推理视为规划问题。我们估计状态转换的长期效用,并在句子边界处应用稀疏、正交的激活转向,使CoT生成与最优推理策略对齐。使用三个最先进的开源推理模型在四个基准测试(AIME25、MATH-500、GSM8k和GPQA Diamond)上的实验表明,Q值转向策略以“外科手术式”的效率实现了显著的性能提升,通常需要的干预次数比贪婪和加权基线少25倍,这表明通过引导高层认知动态而非微观管理令牌生成,可以有效地控制推理。代码可在 https://github.com/shahariar-shibli/CoT-FSM 获取。

英文摘要

Large Reasoning Models (LRMs) solve complex tasks by generating long Chain-of-Thought (CoT) sequences; however, the emergent dynamics governing reasoning trajectories are not well understood and can lead to inconsistencies and reasoning pathologies. In this work, we propose to approximate LRM's emerging hierarchical reasoning dynamics as a trajectory within a Finite State Machine (FSM) transitioning among six abstract cognitive states. We demonstrate that these states and transitions can be captured in the latent state of the model. We believe that this representation can have different applications in the interpretability and optimization of LRM models. For example, by analyzing the topology of these transitions, we identify statistical shifts in reasoning strategies that help identify effective reasoning chains from those that fail. To illustrate these potential advantages, we propose Q-Value guided steering, a training-free inference-time control method that treats reasoning as a planning problem. We estimate the long-horizon utility of state transitions and apply sparse, orthogonal activation steering at sentence boundaries to align the CoT generation with optimal reasoning policies. Experiments across four benchmarks (AIME25, MATH-500, GSM8k, and GPQA Diamond) using three state-of-the-art open reasoning models demonstrate that Q-Value steering policy achieves significant performance gains with "surgical" efficiency, often requiring 25 times fewer interventions than greedy and weighted baselines, which suggests that reasoning can be effectively controlled by guiding high-level cognitive dynamics rather than micro-managing token generation. Code is available at: https://github.com/shahariar-shibli/CoT-FSM.

2510.04704 2026-05-29 cond-mat.mtrl-sci cs.AI cs.CL 版本更新

AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials

AtomWorld: 评估大型语言模型在晶体材料空间推理能力的基准

Taoyuze Lv, Alexander Chen, Fengyu Xie, Chu Wu, Jeffrey Meng, Dongzhan Zhou, Yingheng Wang, Bram Hoex, Zhicheng Zhong, Tong Xie

发表机构 * University of New South Wales, NSW, Sydney, Australia(新南威尔士大学,新州,悉尼,澳大利亚) Suzhou Institute for Advanced Research, University of Science(苏州先进研究院,科学大学) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室,上海,中国) Cornell University(康奈尔大学)

AI总结 提出AtomWorld基准,通过十种基本原子结构操作评估LLM在材料科学中的空间推理能力,发现Claude Opus 4.6表现最佳但复杂空间关系操作成功率低,表明LLM更适合作为辅助工具而非完全自主的科研代理。

详情
AI中文摘要

大型语言模型(LLMs)在科学研究中展现出巨大潜力,能够执行从知识检索到属性预测等任务。现有的科学基准主要关注感知或基于知识的任务,在很大程度上忽略了建模任务,而建模是任何真实科学研究的基本起点。对于材料科学而言,构建和操作原子结构是最具创造性和自动化程度最低的步骤之一。在这项工作中,我们引入了AtomWorld,这是一个旨在评估LLMs在结构修改方面能力的基准。该基准包括四种广泛使用的建模类别下的十种基本操作,并提供了可验证的评估指标。我们发现Claude Opus 4.6总体上表现最佳。随着建模复杂性的增加,成功率显著下降,特别是涉及复杂空间关系的操作(旋转成功率低于12%)。我们的结果表明,当代LLMs更适合作为材料结构建模的副驾驶,而非完全无监督的自主科学代理。除了评估之外,AtomWorld还作为未来开发结构感知模型(包括强化学习和基于代理的方法)的测试平台和实验场。

英文摘要

Large language models (LLMs) have shown promising potential in scientific research, enabling tasks ranging from knowledge retrieval to property prediction. Existing science benchmarks mainly focus on perceptual or knowledge-based tasks, largely ignoring the modelling tasks, a fundamental starting point for any real scientific research. For materials science, constructing and manipulating atomic structures is one of the most creative and least automated steps. In this work, we introduce AtomWorld, a benchmark designed to evaluate the abilities of LLMs on structure modifications. The benchmark includes ten fundamental actions under four widely used modelling categories, enabling verifiable evaluation metrics. We find that Claude Opus 4.6 generally performs the best. While the success rate decreases markedly with increasing modelling complexity, with particularly low success rates (below 12\% for rotation) for operations involving complex spatial relations. Our results suggest that contemporary LLMs are better suited as copilots for materials structure modelling rather than fully unsupervised autonomous scientific agents. Beyond evaluation, AtomWorld also serves as a testbed and playground for developing future structure-aware models, including reinforcement learning and agentic approaches.

2508.19282 2026-05-29 cs.CL cs.AI 版本更新

Less Is More: Elevating RAG via Performance-Driven Context Compression

少即是多:通过性能驱动的上下文压缩提升RAG

Ziqiang Cui, Yunpeng Weng, Xing Tang, Peiyang Liu, Shiwei Li, Bowei He, Jiamin Chen, Yansen Zhang, Xiuqiang He, Chen Ma

发表机构 * City University of Hong Kong, Hong Kong SAR, China(香港城市大学) Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE(阿布扎赫尔 Mohamed bin Zayed 人工智能大学) Huazhong University of Science and Technology(华中科技大学) Peking University, Beijing, China(北京大学) Shenzhen Technology University, Shenzhen, China(深圳技术大学)

AI总结 提出CORE-RAG框架,利用任务性能作为反馈信号迭代优化压缩策略,在3%压缩率下平均精确匹配得分提升3.3点。

Comments Accepted by ICML 2026

详情
AI中文摘要

检索增强生成(RAG)已成为改善知识更新时效性和大型语言模型事实准确性的有前景范式。然而,纳入大量检索文档显著增加输入长度,导致计算成本过高。现有压缩方法通常因依赖预定义启发式规则而损害任务性能。这些启发式规则无法确保压缩后的上下文有利于生成任务。为解决这些限制,我们提出CORE-RAG,一种用于RAG系统中上下文压缩的新颖框架。CORE通过性能驱动的学习框架消除对代理启发式规则的依赖,直接利用任务性能作为反馈信号迭代优化压缩器策略。在此优化过程之前,我们引入知识蒸馏阶段,用稳健策略初始化压缩器。大量实验证明了我们方法的优越性。在3%的高压缩比下,CORE不仅避免了性能下降,而且与使用完整文档相比,平均精确匹配(EM)得分提高了3.3分。我们的代码可在https://github.com/ziqiangcui/CORE-RAG-ICML26获取。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for improving the timeliness of knowledge updates and the factual accuracy of large language models. However, incorporating a large volume of retrieved documents significantly increases input length, leading to prohibitive computational costs. Existing compression approaches often compromise task performance, primarily due to their reliance on predefined heuristics. These heuristics fail to ensure that the compressed context is conducive to the generation tasks. To address these limitations, we propose CORE-RAG, a novel framework for context compression in RAG systems. CORE eliminates reliance on proxy heuristics through a performance-driven learning framework, which directy utilizes task performance as a feedback signal to iteratively refine the compressor policy. Prior to this optimization process, we incorporate a knowledge distillation phase to initialize the compressor with a robust policy. Extensive experiments demonstrate the superiority of our approach. At a high compression ratio of 3%, CORE not only avoids performance degradation but also improves the average Exact Match (EM) score by 3.3 points compared to using full documents. Our code is available at https://github.com/ziqiangcui/CORE-RAG-ICML26.

2508.19202 2026-05-29 cs.CL 版本更新

Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

通过探针知识和推理揭示LLMs中的科学问题解决

Alan Li, Yixin Liu, Arpan Sarkar, Doug Downey, Arman Cohan

发表机构 * Department of Computer Science, Yale University Department of Molecular \& Cellular Biology, Harvard University Allen Institute for AI Department of Computer Science, Northwestern University

AI总结 本文提出SciReas基准和KRUX探针框架,系统评估LLMs在科学推理中的知识与推理角色,发现知识检索是主要瓶颈,外部上下文知识和推理增强均能提升性能。

Comments 33 pages, 18 figures

详情
Journal ref
ICML 2026 Main Conference
AI中文摘要

科学问题解决对LLMs提出了独特挑战,需要深厚的领域知识和通过复杂推理应用这些知识的能力。尽管自动化科学推理器在协助人类科学家方面具有巨大潜力,但目前尚无广泛采用的全面基准来评估科学推理,也很少有方法系统地梳理知识和推理在这些任务中的不同作用。为弥补这些空白,我们引入了SciReas,一个用于科学推理任务的多样化现有基准套件,以及SciReas-Pro,一个需要更复杂推理的选择性子集。我们的全面评估揭示了在单独依赖单个基准时隐藏的科学推理性能洞察。然后,我们提出了KRUX,一个用于研究推理和知识在科学任务中不同作用的探针框架。结合两者,我们进行了深入分析,得出几个关键发现:(1)从模型参数中检索任务相关知识是LLMs在科学推理中的关键瓶颈;(2)推理模型始终受益于在推理增强之上添加上下文中的外部知识;(3)增强言语化推理提高了LLMs浮现任务相关知识的能力。

英文摘要

Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge.

2502.07623 2026-05-29 cs.CL 版本更新

Lexical categories of stem-forming roots in Mapudüngun verb forms

Mapudüngun动词形式中词干形成根的词汇类别

Andrés Chandía

发表机构 * Department of Catalan Philology and General Linguistics University of Barcelona(加泰罗尼亚语言学与一般语言学系 巴塞罗那大学)

AI总结 本研究验证并修正了Mapuche语言形态分析系统中动词根的词汇类别分类,以改进计算分析器并澄清该语言词汇类别的模糊性。

Comments 36 pages, 2 large tables, 2 sample tables

详情
AI中文摘要

在开发了Mapuche语言的形态分析计算系统,并用不同作者和风格的文本进行评估后,有必要验证作为实现该工具基础的源语言的 linguistic 假设。本文主要关注用于开发形态分析系统的源语言中识别为动词的Mapudüngun根的词汇类别分类。词汇类别修订的结果直接有益于计算分析器,因为一旦验证就会实施。此外,希望这些结果有助于澄清Mapuche语言中关于词汇类别的一些不确定性。本文处理了一项初步任务,以识别真正动词根的配价,其结果将在后续工作中呈现,作为本文的补充。

英文摘要

After developing a computational system for morphological analysis of the Mapuche language, and evaluating it with texts from various authors and styles, it became necessary to verify the linguistic assumptions of the source used as the basis for implementing this tool. In the present work, the primary focus is on the lexical category classification of Mapudüngun roots recognised as verbal in the source utilised for the development of the morphological analysis system. The results of this lexical category revision directly benefit the computational analyser, as they are implemented as soon as they are verified. Additionally, it is hoped that these results will help clarify some uncertainties about lexical categories in the Mapuche language. This work addresses a preliminary task to identify the valency of true verbal roots, the results of which will be presented in a subsequent work that complements this article.

2502.03805 2026-05-29 cs.CL 版本更新

CriticalKV: Optimizing KV Cache Eviction from an Output Perturbation Perspective

CriticalKV: 从输出扰动角度优化 KV 缓存淘汰

Yuan Feng, Junlin Lv, Haoyu Guo, Yukun Cao, S Kevin Zhou, Xike Xie

发表机构 * School of Computer Science, University of Science and Technology of China(科学技术大学计算机科学学院) School of Biomedical Engineering, USTC(USTC生物医学工程学院) Data Darkness Lab, MIRACLE Center, Suzhou Institute for Advanced Research(苏州先进研究院数据黑暗实验室、奇迹中心) School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院)

AI总结 本文通过分析注意力输出扰动,提出一种基于扰动约束的 KV 缓存条目选择算法,显著降低压缩损失。

Comments ICML 2026

详情
AI中文摘要

大型语言模型彻底改变了自然语言处理,但由于 Transformer 架构对自注意力的依赖,特别是长序列推理中的大型 KV 缓存,面临着高存储和运行时成本的重大挑战。最近通过基于注意力权重剪枝不太重要的条目来减小 KV 缓存大小的努力仍然是经验性的,缺乏形式化基础。本文通过分析注意力输出扰动,对识别关键 KV 缓存条目进行了形式化研究。我们的分析表明,除了注意力权重之外,KV 条目中的值状态和预训练参数矩阵也至关重要。基于此,我们提出了一种扰动约束选择算法,该算法优化最坏情况下的输出扰动以识别关键条目。我们证明了我们的算法是一种通用的、即插即用的增强方法,且计算开销可忽略不计。当与三种最先进的缓存淘汰方法集成在三个不同的 LLM 上时,我们的算法在来自 Ruler 和 LongBench 基准测试的 29 个数据集上,平均将压缩损失减少了超过一半。进一步的头部和层级的扰动分析证实了我们有效性背后的原理。这项工作为缓存淘汰提供了新的、形式化的视角,为未来的研究开辟了有希望的途径。代码公开在 https://github.com/FFY0/DefensiveKV。

英文摘要

Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large KV cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. We demonstrate that our algorithm is a universal, plug-and-play enhancement that incurs negligible computational overhead. When integrated with three state-of-the-art cache eviction methods on three distinct LLMs, our algorithm significantly reduces the compression loss by more than \textit{half} on average across 29 datasets from the Ruler and LongBench benchmarks. Further perturbation analysis, at both the head and layer levels, confirms the principles underlying our effectiveness. This work offers a new, formally grounded perspective to cache eviction , opening promising avenues for future research. The code is publicly available at https://github.com/FFY0/DefensiveKV.

2406.10238 2026-05-29 cs.CL cs.LG cs.SI 版本更新

Early Detection of Misinformation for Infodemic Management: A Domain Adaptation Approach

信息疫情管理中虚假信息的早期检测:一种领域自适应方法

Minjia Mao, Xiaohang Zhao, Xiao Fang

发表机构 * Lerner College of Business and Economics, University of Delaware(德克萨斯大学德尔韦大学商学院与经济学学院) School of Information Management & Engineering, Shanghai University of Finance and Economics(上海财经大学信息管理与工程学院)

AI总结 针对信息疫情早期缺乏标注数据的问题,提出一种同时处理协变量偏移和概念偏移的领域自适应虚假信息检测方法,在真实数据集上优于现有方法。

详情
AI中文摘要

信息疫情是指在疾病爆发期间传播的大量真实信息和虚假信息。在信息疫情早期检测虚假信息是减少其对公共健康危害的关键。信息疫情早期的特点是存在大量关于某种疾病的未标注信息。因此,传统的虚假信息检测方法不适合此任务,因为它们依赖信息疫情领域的标注信息来训练模型。为解决这一局限,最先进的方法利用其他领域的标注信息来学习模型,以检测信息疫情领域的虚假信息。这些方法的有效性取决于它们缓解信息疫情领域与利用标注信息的领域之间的协变量偏移(即特征分布差异)和概念偏移(即标注模式差异)的能力。然而,这些方法侧重于缓解协变量偏移而忽略了概念偏移,导致其在该任务上效果不佳。为此,我们从理论上证明了同时处理协变量偏移和概念偏移的必要性,以及如何分别实现它们。基于理论分析,我们开发了一种新颖的虚假信息检测方法,同时解决了协变量偏移和概念偏移。使用真实数据集,我们进行了广泛的实证评估,证明我们的方法在性能上优于最先进的虚假信息检测方法以及可适用于该任务的常见领域自适应方法。

英文摘要

An infodemic refers to an enormous amount of true information and misinformation disseminated during a disease outbreak. Detecting misinformation at the early stage of an infodemic is key to reduce its harm to public health. An early stage infodemic is characterized by a large volume of unlabeled information concerning a disease. As a result, conventional misinformation detection methods are not suitable for this misinformation detection task because they rely on labeled information in the infodemic domain to train their models. To address this limitation, state-of-the-art methods learn their models using labeled information in other domains to detect misinformation in the infodemic domain. The efficacy of these methods depends on their ability to mitigate both covariate shift (i.e., differences in feature distributions) and concept shift (i.e., differences in labeling patterns) between the infodemic domain and the domains from which they leverage labeled information. However, these methods focus on mitigating covariate shift but overlook concept shift, rendering them less effective for the task. In response, we theoretically show the necessity of tackling both covariate and concept shifts as well as how to operationalize each of them. Built on the theoretical analysis, we develop a novel misinformation detection method that addresses both covariate and concept shifts. Using real-world datasets, we conduct extensive empirical evaluations to demonstrate the superior performance of our method over state-of-the-art misinformation detection methods as well as prevalent domain adaptation methods that can be tailored to solve the misinformation detection task.

2405.13003 2026-05-29 cs.CL cs.AI cs.IR 版本更新

A Survey on Recent Advances in Conversational Data Generation

对话数据生成最新进展综述

Heydar Soudani, Roxana Petcu, Evangelos Kanoulas, Faegheh Hasibi

发表机构 * Radboud University(拉博德大学) University of Amsterdam(阿姆斯特丹大学)

AI总结 本文系统综述了多轮对话数据生成方法,涵盖开放域、任务导向和信息检索三类对话系统,提出了包含种子数据创建、话语生成和质量过滤的通用框架,并讨论了评估指标与未来方向。

详情
AI中文摘要

近年来对话系统的进步显著增强了各领域的人机交互。然而,由于专业对话数据的稀缺,训练这些系统面临挑战。传统上,对话数据集通过众包创建,但该方法成本高、规模有限且劳动密集。作为解决方案,合成对话数据的开发应运而生,利用技术增强现有数据集或将文本资源转换为对话格式,提供了一种更高效且可扩展的数据集创建方法。在本综述中,我们系统全面地回顾了多轮对话数据生成,重点关注三类对话系统:开放域、任务导向和信息检索。我们根据种子数据创建、话语生成和质量过滤方法等关键组件对现有研究进行分类,并引入了一个概述对话数据生成系统主要原则的通用框架。此外,我们考察了评估合成对话数据的指标和方法,探讨了当前领域的挑战,并探索了未来研究的潜在方向。我们的目标是通过概述最先进的方法并强调该领域进一步研究的机会,加速研究人员和从业者的进展。

英文摘要

Recent advancements in conversational systems have significantly enhanced human-machine interactions across various domains. However, training these systems is challenging due to the scarcity of specialized dialogue data. Traditionally, conversational datasets were created through crowdsourcing, but this method has proven costly, limited in scale, and labor-intensive. As a solution, the development of synthetic dialogue data has emerged, utilizing techniques to augment existing datasets or convert textual resources into conversational formats, providing a more efficient and scalable approach to dataset creation. In this survey, we offer a systematic and comprehensive review of multi-turn conversational data generation, focusing on three types of dialogue systems: open domain, task-oriented, and information-seeking. We categorize the existing research based on key components like seed data creation, utterance generation, and quality filtering methods, and introduce a general framework that outlines the main principles of conversation data generation systems. Additionally, we examine the evaluation metrics and methods for assessing synthetic conversational data, address current challenges in the field, and explore potential directions for future research. Our goal is to accelerate progress for researchers and practitioners by presenting an overview of state-of-the-art methods and highlighting opportunities to further research in this area.

2605.29582 2026-05-29 cs.LG cs.CL 版本更新

PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning

PEARL: 使用教学对齐强化学习训练苏格拉底式导师

Qikai Chang, Zhenrong Zhang, Linbo Chen, Pengfei Hu, Jianshu Zhang, Youhui Guo, Jun Du

发表机构 * University of Science and Technology of China(中国科学技术大学) iFLYTEK Research(iFLYTEK研究院)

AI总结 提出PEARL框架,通过可控学生模拟器、生成式奖励模型和稳定多目标强化学习,训练苏格拉底式教学代理,在多个基准上达到开源模型最佳性能并与专有模型竞争。

Comments 16 pages, 7 figures

详情
AI中文摘要

大型语言模型(LLM)在教育辅导方面展现出潜力,但有效的辅导不仅仅是解决问题:它必须提供渐进的苏格拉底式引导,并在多轮交互中平衡多个教学目标。然而,由于学生模拟的保真度有限且可控性弱、教学奖励建模不明确以及多目标优化不稳定,训练这样的导师仍然具有挑战性。为克服这些限制,我们提出了PEARL,一个教学对齐的强化学习框架,用于训练苏格拉底式教学代理,包含三个关键组件。首先,我们引入了一个可控的学生模拟器,将潜在认知状态与响应生成解耦,以模拟多样的能力和误解。其次,我们开发了一个生成式奖励模型,联合评估教学质量和目标正确性以进行策略优化。最后,我们提出了一种稳定的多目标强化学习方案,在每个维度内离散化奖励并跨维度聚合归一化优势,防止高方差目标主导更新。在多个基准上的实验表明,尽管仅使用30B策略模型,PEARL在开源模型中取得了最佳性能,并与领先的专有LLM保持竞争力。

英文摘要

Large Language Models (LLMs) have shown promise as educational tutors, yet effective tutoring requires more than solving problems: it must provide progressive Socratic guidance and balance multiple pedagogical objectives across multi-turn interactions. However, training such tutors remains challenging due to limited-fidelity and weakly controllable student simulation, under-specified pedagogical reward modeling, and unstable multi-objective optimization. To overcome these limitations, we propose PEARL, a pedagogically aligned reinforcement learning framework for training Socratic tutoring agents, consisting of three key components. First, we introduce a controllable student simulator that decouples latent cognitive states from response generation to model diverse abilities and misconceptions. Second, we develop a generative reward model that jointly evaluates pedagogical quality and objective correctness for policy optimization. Finally, we propose a stable multi-objective RL scheme that discretizes rewards within each dimension and aggregates normalized advantages across dimensions, preventing high-variance objectives from dominating updates. Experiments on multiple benchmarks show that PEARL achieves the best performance among open-source models and remains competitive with leading proprietary LLMs, despite using only a 30B policy model.

2605.29559 2026-05-29 cs.CL 版本更新

LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents

LiteCoder-Terminal: 扩展用于学习语言代理的长时程终端环境

Xiaoxuan Peng, Kaiqi Zhang, Xinyu Lu, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出零依赖合成框架LiteCoder-Terminal-Gen,自动生成可执行且可验证的终端训练环境,构建大规模SFT和RL数据集,通过监督微调和直接多轮偏好优化显著提升语言代理在终端任务上的性能。

详情
AI中文摘要

掌握终端环境需要语言代理具备多步规划、基于反馈的执行和动态状态适应能力。然而,当前训练此类代理的瓶颈在于依赖从外部仓库抓取的数据,这限制了领域多样性、环境可控性以及针对特定能力缺陷的优化。我们引入了LiteCoder-Terminal-Gen,一个零依赖的合成流水线,能够直接从领域规范自动生成可执行且可验证的终端训练环境。利用该框架,我们构建了两个大规模资源:LiteCoder-Terminal-SFT,包含跨10个领域的11,255条专家轨迹;以及LiteCoder-Terminal-RL,包含602个可验证环境,用于轨迹级偏好优化。在SFT数据集上对Qwen系列模型进行监督微调,得到的代理显著优于其基础版本。值得注意的是,我们的32B变体在Terminal Bench 1.0、2.0和Pro上分别达到了29.06%、18.54%和34.00%的pass@1。此外,在RL环境上应用直接多轮偏好优化(DMPO)进一步提升了性能。这些结果系统性地表明,完全合成的可执行环境为掌握复杂的真实命令行工作流提供了可扩展且可验证的监督信号。

英文摘要

Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain diversity, environment controllability, and the targeting of specific capability deficits. We introduce LiteCoder-Terminal-Gen, a zero-dependency synthesis pipeline that autonomously generates executable and verifiable terminal training environments directly from domain specifications. Using this framework, we construct two large-scale resources: LiteCoder-Terminal-SFT, comprising 11,255 expert trajectories across 10 domains, and LiteCoder-Terminal-RL, featuring 602 verifiable environments for trajectory-level preference optimization. Supervised fine-tuning of Qwen-family models on our SFT dataset yields agents that significantly outperform their base counterparts. Notably, our 32B variant achieves 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro, respectively. Furthermore, applying Direct Multi-turn Preference Optimization (DMPO) on our RL environments yields additional performance gains. These results systematically demonstrate that fully synthetic, executable environments offer a scalable and verifiable supervision signal for mastering complex, real-world command-line workflows.

2605.29555 2026-05-29 cs.CL 版本更新

From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals

从盲目猜测到知情判断:通过构建知识增强的偏好信号教会LLM评估材料

Yeyong Yu, Wenya Hu, Xing Wu, Quan Qian

发表机构 * School of Computer Engineering & Science, Shanghai University(上海大学计算机工程与科学学院) Center of Materials Informatics and Data Science, Materials Genome Institute, Shanghai University(上海大学材料信息与数据科学中心) Key Laboratory of Silicate Cultural Relics Conservation (Shanghai University), Ministry of Education, China(教育部硅酸盐文化 relics 保护重点实验室(上海大学)) Shanghai Institute for Advanced Communication and Data Science, Shanghai University(上海大学高级通信与数据科学研究院)

AI总结 提出知识增强偏好信号框架MaterEval,通过成对偏好数据引导大语言模型从直觉判断转向基于证据的可靠评估,并引入快慢推理方案平衡吞吐量、成本和可靠性,在高熵合金评估中验证了有效性。

Comments 33 pages, 5 figures

详情
AI中文摘要

随着候选生成和高通量实验的进步,材料发现的主要瓶颈正从性质预测转向在大量候选集中进行可靠评估。我们提出了知识增强偏好信号框架MaterEval,该框架自动为同一候选生成两种评估:一种遵循专家规则并提供支持证据的知情判断,另一种是移除规则的盲目猜测。通过将这两种评估配对作为偏好数据,我们引导原本缺乏材料特定标准的通用大语言模型(LLM)从直觉判断转向由明确证据支持的可靠评估。为了平衡吞吐量、成本和可靠性,我们进一步引入了一种快慢推理方案,将大规模快速筛选与对小子集的深入审查解耦。以高熵合金(HEA)评估为例,我们表明,无需外部检索,仅依赖内化能力,小型开源LLM在准确性、结论一致性和证据区分度上取得了显著提升,接近基于规则的闭源LLM的性能。这些结果表明,专家规则可以系统地转化为可学习的偏好信号,从而为自主材料发现循环提供低成本且可部署的评估模块。

英文摘要

As candidate generation and high-throughput experimentation advance, the primary bottleneck in materials discovery is shifting from property prediction to making reliable evaluations among massive candidate sets. We propose a Knowledge-Augmented Preference Signals Framework, MaterEval, that automatically produces, for the same candidate, two evaluations: an informed judgment that follows expert rules and provides supporting evidence, and a rule-removed blind guess. By pairing the two evaluations as preference data, we guide general-purpose large language models (LLMs), originally lacking materials-specific criteria, from intuitive judgment toward reliable evaluation supported by explicit evidence. To balance throughput, cost, and reliability, we further introduce a fast-slow reasoning scheme that decouples large-scale rapid screening from in-depth review on a small subset. Using high-entropy alloy (HEA) assessment as a case study, we show that, without external retrieval and relying solely on internalized capabilities, small open-source LLMs achieve substantial gains in accuracy, conclusion consistency, and evidence discrimination, approaching the performance of rule-based closed-source LLMs. These results demonstrate that expert rules can be systematically transformed into learnable preference signals, enabling a low-cost and deployable evaluation module for autonomous materials discovery loops.

2605.29543 2026-05-29 cs.LG cs.AI cs.CL cs.HC cs.IR 版本更新

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

SCOPE:一种用于空中交通管制复诵监控的轻量训练LLM框架

Qihan Deng, Minghua Zhang, Yang Yang, Zhenyu Gao

发表机构 * Department of Mechanical and Aerospace Engineering, The Hong Kong University of Science and Technology(香港科学与技术大学机械与航空航天工程系) School of Electronic and Information Engineering, Beihang University(北航电子与信息工程学院) State Key Laboratory of CNS/ATM(国家空管自动化系统实验室)

AI总结 提出SCOPE框架,通过冻结LLM结合插件式开放集分类器和上下文学习机制,实现高效准确的空管复诵监控,在少样本设置下开放集检测准确率达91.05%,异常纠正率96.63%。

详情
AI中文摘要

飞行员对空中交通管制(ATC)语音指令的复诵是航空运输中防止沟通失误的主要保障。然而,复诵异常仍与约80%的航空事故相关。这一脆弱性因交通量增加和认知负荷升高而进一步加剧,从而推动了机器自动化复诵监控的需求。传统的基于规则和机器学习的方法难以在高度可变且不断演变的空管-飞行员通信术语中泛化。尽管大语言模型(LLM)凭借其强大的推理和泛化能力开辟了新途径,但现有方法在实践中仍面临部署和计算障碍。在这项工作中,我们提出了SCOPE(Semantic reasoning for Communication via Open-set Plug-in with Examples),一种新颖的轻量训练LLM框架,提升了基于机器的ATC复诵监控的效率和准确性。核心思想是在冻结的LLM之上,将插件式开放集分类器与精心设计的上下文学习机制相结合。在半合成通信数据集上的大量实验表明,SCOPE在实现运行环境所需的低延迟响应的同时,达到了优越的准确性。在少样本设置下,SCOPE在开放集检测中达到91.05%的准确率,并纠正了96.63%的异常复诵,从而在提供决策解释的同时优于现有最强基线。这些发现证明了我们的框架作为通向可解释和可控的ATC复诵监控的实用途径的潜力。

英文摘要

Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.

2605.29502 2026-05-29 cs.CL cs.AI 版本更新

Source-Grounded Semantic Reinforcement Learning for Low-Resource Target-Language Generation

源语言锚定的语义强化学习用于低资源目标语言生成

Zeli Su, Ziyin Zhang, Zewei Pan, Zhou Liu, Dingcheng Huang, Dehan Li, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang

发表机构 * Minzu University of China(中国民族大学) Ant Group(蚂蚁集团) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Harbin Institute of Technology(哈尔滨工业大学) South China University of Technology(华南理工大学)

AI总结 提出源语言锚定的语义强化学习(SG-SRL),通过跨语言语义奖励模型利用源语言单语数据,结合轻量级恢复阶段解决奖励黑客问题,在低资源目标语言生成中提升语义锚定和事实覆盖。

详情
AI中文摘要

低资源目标语言生成通常受限于稀缺的平行数据,而高资源源语言单语数据丰富但难以通过标准监督微调使用。我们提出源语言锚定的语义强化学习(SG-SRL),一种资源利用框架,将源语言单语数据转换为用于目标语言生成的跨语言语义监督。SG-SRL使用跨语言语义奖励模型(由跨语言重排序器实例化,对源输入与目标语言生成之间的语义相关性进行评分)在源语言数据上执行无参考强化学习(RL)。虽然这会导致严重的基于冗长的奖励黑客问题,但使用小型平行语料库的轻量级恢复阶段在保留语义增益的同时恢复了流畅性、简洁性和任务格式。在中文到泰语生成上的实验表明,SG-SRL在冷启动SFT基础上改善了语义锚定和事实覆盖。对长文本迁移和基于藏语嵌入奖励的额外分析阐明了SG-SRL的泛化行为,并表明在现实低资源语言设置中,基于编码器的语义奖励可以替代基于LLM的重排序器。

英文摘要

Low-resource target-language generation is often limited by scarce parallel data, while high-resource source-language monolingual data is abundant but difficult to use with standard supervised fine-tuning. We propose Source-Grounded Semantic Reinforcement Learning (SG-SRL), a resource-utilization framework that converts source-language monolingual data into cross-lingual semantic supervision for target-language generation. SG-SRL performs reference-free reinforcement learning (RL) on source-language data using a cross-lingual semantic reward model, instantiated by a cross-lingual reranker that scores the semantic relevance between the source input and the target-language generation. While this induces severe verbosity-based reward hacking, a lightweight recovery stage using a small parallel corpus restores fluency, conciseness, and task format while preserving the semantic gains. Experiments on Chinese-to-Thai generation show that SG-SRL improves semantic grounding and factual coverage over cold-start SFT. Additional analyses on long-form transfer and Tibetan embedding-based rewards clarify the generalization behavior of SG-SRL and show that an encoder-based semantic reward can substitute for an LLM-based reranker in a realistic low-resource language setting.

2605.29498 2026-05-29 cs.CL cs.CV 版本更新

Mask the Target: A Plug-and-Play Regularizer Against LoRA Forgetting

Mask the Target: 一种即插即用的正则化器,用于对抗LoRA遗忘

Runze Xu, Arpit Garg, Hemanth Saratchandran, Simon Lucey

发表机构 * Australian Institute for Machine Learning(澳大利亚机器学习研究所)

AI总结 针对LoRA微调中目标分布与原始训练分布差异大时导致的灾难性遗忘问题,提出一种无需重放数据的输出空间正则化方法,通过遮蔽目标token并仅对非目标词汇进行KL正则化,在不增加推理开销的前提下改善新学习与遗忘之间的平衡。

Comments In Submission

详情
AI中文摘要

低秩适应(LoRA)已成为将大型语言模型适应新领域、任务和用户的最广泛使用的微调机制之一。然而,仅凭适应性能可能掩盖一个重要失败模式:LoRA更新可能在提升目标分布性能的同时,削弱预训练和对齐阶段学习到的先前能力。我们表明,当适应分布与模型的原始训练或对齐分布存在显著差异时,这种遗忘变得尤为严重。在实际场景中,原始训练和对齐数据通常不可用,这加剧了挑战。受此约束,我们研究了基于LoRA的适应如何在无重放设置中平衡新学习与遗忘,并引入了一个简单的输出空间正则化器,可直接添加到现有训练流程中。我们的方法从基模型和适应模型分布中移除真实标记,重新归一化剩余概率,并仅对非目标词汇应用KL正则化。这保留了基模型在替代标记之间的相对偏好,同时不直接对抗适应所需的交叉熵信号。由于正则化器仅在损失层面起作用,它不需要重放数据、架构更改、适配器重新设计或推理时开销,并且可以直接应用于现有LoRA变体。在所有测试的LoRA变体和各种骨干网络上,当适应分布与基模型的原始训练或对齐分布存在显著差异时,我们的方法改善了新学习与遗忘之间的边界,表明这是一条通往更可靠LLM更新的广泛适用途径。

英文摘要

Low-Rank Adaptation (LoRA) has become one of the most widely used fine-tuning mechanisms for adapting large language models to new domains, tasks, and users. Yet adaptation performance alone can obscure an important failure mode: LoRA updates may improve performance on the target distribution while degrading prior capabilities learned during pretraining and alignment. We show that this forgetting becomes especially severe when the adaptation distribution differs substantially from the models original training or alignment distributions. The challenge is amplified in practical settings, where the original training and alignment data are typically unavailable. Motivated by this constraint, we study how LoRA based adaptation balances new learning against forgetting in a replay-free setting, and introduce a simple output space regularizer that can be added directly to existing training pipelines. Our method removes the ground-truth token from both the base and adapted model distributions, renormalizes the remaining probabilities, and applies KL regularization only over the non-target vocabulary. This preserves the base models relative preferences among alternative tokens without directly opposing the cross-entropy signal required for adaptation. As the regularizer acts only at the loss level, it requires no replay data, architectural changes, adapter redesign, or inference-time overhead, and can be applied directly to existing LoRA variants. Across all LoRA variants tested and across various backbones, our method improves the frontier between new learning and forgetting when the adaptation distribution differs substantially from the base models original training or alignment distributions, suggesting a broadly applicable route toward more reliable LLM updating.

2605.29496 2026-05-29 cs.CL cs.CV 版本更新

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

视觉语言模型后训练中推理与感知的非对称优化研究

Xueqing Wu, Yu-Chi Lin, Kai-Wei Chang, Nanyun Peng

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 通过合成任务诊断发现,后训练中推理提升显著优于感知,SFT源于感知token少导致训练信号弱,RL源于奖励耦合,提出动态重加权损失和感知奖励可缓解不平衡并提升端到端性能。

Comments Project: https://asymmetric-vlm-post-training.github.io/

详情
AI中文摘要

后训练极大地提升了前沿视觉语言模型中的推理能力,但其对感知的提升相对有限,这成为端到端视觉推理的瓶颈。为探究这一差距,我们引入了一个受控的诊断框架,包含两个将感知与推理分离的合成任务。我们的分析揭示了一致的感知-推理非对称性:后训练对推理的提升显著大于感知,尽管其内在机制因训练范式而异。对于监督微调(SFT),这种非对称性源于思维链监督中的token不平衡,其中感知占据较少token,因此接收到的训练信号较弱。动态重加权损失可缓解这种不平衡,并将端到端性能提升高达18.2。对于强化学习(RL),非对称性则源于奖励耦合:结果奖励与推理的相关性比与感知更强,从而削弱了感知学习的信号。添加感知感知奖励可缓解不平衡,并将端到端准确率提升高达6.0;即使没有真实感知奖励,可靠的替代奖励也能提供有用信号,带来3.2个百分点的提升。综合来看,我们的结果全面诊断了非对称优化,并提出了平衡感知与推理的具体干预措施。

英文摘要

Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: posttraining improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm. For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2. For reinforcement learning (RL), the asymmetry instead arises from reward coupling: outcome rewards correlate more strongly with reasoning than with perception, weakening the signal for perception learning. Adding a perception-aware reward alleviates the imbalance and improves end-to-end accuracy by up to 6.0; even without groundtruth perception rewards, a reliable surrogate reward provide useful signal, yielding gains of 3.2 points. Together, our results comprehensively diagnose asymmetric optimization and suggest concrete interventions to balance perception and reasoning.

2605.29486 2026-05-29 cs.CL cs.AI cs.LG 版本更新

PhoneWorld: Scaling Phone-Use Agent Environments

PhoneWorld: 扩展手机使用代理环境

Zhengyang Tang, Yuxuan Liu, Xin Lai, Junyi Li, Pengyuan Lyu, Jason, Yiduo Guo, Zhengyao Fang, Yang Ding, Yi Zhang, Weinong Wang, Huawen Shen, Xingran Zhou, Liang Wu, Fei Tang, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Rui Yan, Ji-Rong Wen, Chengquan Zhang, Han Hu

发表机构 * Tencent Hunyuan(腾讯文英) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院( Gallagher 学院))

AI总结 提出PhoneWorld,一个可复用的管道,将真实GUI轨迹和截图转化为可控的手机使用环境、可执行任务、自动验证器和训练回滚,从而规模化构建手机代理环境。

Comments work in progress

详情
AI中文摘要

手机使用代理的一个核心瓶颈是,覆盖真实移动行为的可控、可复现环境难以大规模构建。现有的移动代理基准在评估方面取得了重要进展,但它们本身并未提供一种可扩展的方式来构建许多新的手机使用环境。我们提出了PhoneWorld,一个可复用的管道,将真实的GUI轨迹和截图转化为可控的手机使用环境、可执行任务、自动验证器和训练回滚。PhoneWorld不是一次手动构建一个移动基准,而是利用真实轨迹来恢复哪些屏幕重要、屏幕如何连接、哪些交互必须改变环境状态、以及哪些用户目标可以自动验证。从这些信号中,它构建了由只读应用内容和可变状态支持的可运行模拟Android应用,然后从相同环境中派生出可执行任务、基于规则的验证器和训练回滚。在当前实例中,PhoneWorld覆盖了16个领域的34个应用,涵盖了常见的消费者移动行为,如搜索、浏览、购物、预订、媒体和社交互动。在固定的训练预算下,将来自辅助AndroidWorld语料库的10K步替换为广泛的PhoneWorld监督,同时提升了所有四个评估基准,使HYMobileBench提高了17.7分,AndroidControl提高了6.0分,AndroidWorld提高了14.7分,PhoneWorld提高了52.5分。然后我们研究了两个额外的扩展问题:增加PhoneWorld监督量显著提高了PhoneWorld性能,并且在固定的PhoneWorld预算下,扩大应用覆盖范围带来了更大的收益。总体而言,PhoneWorld将焦点从一次构建一个移动基准转向了规模化供应手机使用环境本身。

英文摘要

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.

2605.29476 2026-05-29 cs.CL 版本更新

Comparative Evaluation of Machine Translation Systems on Images with Text

含文本图像的机器翻译系统比较评估

Blai Puchol, Sergio Gómez González, Miguel Domingo, Francisco Casacuberta

发表机构 * ValgrAI - Valencian Graduate School and Research Network for Artificial Intelligence(ValgrAI - 瓦伦西亚人工智能研究生学院和研究网络)

AI总结 本研究比较评估了三种机器翻译范式(模块化流水线、多模态大语言模型和端到端模型Translatotron-V)在含文本图像翻译任务上的性能,发现多模态大语言模型表现最佳。

详情
AI中文摘要

本文对应用于包含文本信息的图像的机器翻译系统进行了比较评估,该任务位于计算机视觉和自然语言处理的交叉领域。研究比较了三种主要范式:分离文本检测、识别和翻译的模块化流水线;能够联合处理图像和文本的多模态大语言模型(MLLM);以及直接生成翻译图像的端到端模型Translatotron-V。模块化系统采用最先进的OCR(docTR)结合多语言LLM(如Llama和EuroLLM),而评估的MLLM包括Gemini 2.5的不同配置。实验在覆盖多种语言对的并行多语言数据集上进行,基于BLEU、chrF和TER指标进行评估。结果表明,模块化流水线优于端到端方法,而MLLM实现了最佳整体性能,展现出卓越的灵活性和上下文理解能力。这些发现强调了多模态推理在图像到文本翻译中的有效性,并为未来在多语言环境中整合视觉理解和语言生成的研究提供了坚实基础。

英文摘要

This work presents a comparative evaluation of machine translation systems applied to images containing textual information, a task that lies at the intersection of computer vision and natural language processing. The study compares three main paradigms: modular pipelines that separate text detection, recognition, and translation; multi-modal large language models (MLLMs) capable of processing both image and text jointly; and an end-to-end model, Translatotron-V, which directly generates translated images. The modular systems employ state-of-the-art OCR (docTR) combined with multilingual LLMs such as Llama and EuroLLM, while the evaluated MLLMs include different configurations of Gemini 2.5. Experiments were conducted on parallel multilingual datasets covering multiple language pairs, with evaluation based on BLEU, chrF, and TER metrics. The results show that modular pipelines outperform the end-to-end approach, while MLLMs achieve the best overall performance, demonstrating superior flexibility and contextual understanding. These findings underscore the effectiveness of multi-modal reasoning for image-to-text translation and provide a solid foundation for future research on integrating visual understanding and language generation in multilingual settings.

2605.29473 2026-05-29 cs.HC cs.AI cs.CL cs.CY cs.SI 版本更新

Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles

告知、指导、共情、倾听:审计LLM护理支持角色

Drishti Goel, Agam Goyal, Veda Duddu, Olivia Pal, Jeongah Lee, Qiuyue Joy Zhong, Violeta J. Rodriguez, Daniel S. Brown, Dong Whi Yoo, Ravi Karkar, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) OSF HealthCare(OSF医疗集团) Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校)

AI总结 本研究通过操作化四种社会支持角色(告知、指导、共情、倾听),评估大型语言模型在非正式护理对话中的安全概况,发现支持角色系统性地影响交互风险,且存在感知质量-安全性权衡。

详情
AI中文摘要

语言模型越来越多地被部署用于非正式护理环境中的对话支持,在这些环境中,交互通常超出信息寻求范围:护理者在应对不确定、关系复杂的护理决策时,寻求情感安慰、指导和帮助。然而,大多数安全评估在通用提示下评估模型行为,留下一个关键问题未加审视:模型的安全概况是否会随其支持角色而变化?我们通过操作化四种基于社会支持理论的专家评审支持角色来研究这一点:告知、指导、共情和倾听,并将它们与两个基线控制条件(基本提示条件和检索增强生成条件)进行比较。我们在三个语言模型(GPT-4o-mini、Llama-3.1-8B-Instruct和MedGemma-1.5-4b-it)上,对来自在线阿尔茨海默病及相关痴呆症社区的5,000个真实世界查询进行了评估。我们发现,LLM的支持角色系统地影响了交互风险的普遍性和构成。此外,一项人类评估研究揭示了感知质量-安全性权衡:更具指导性、信息导向的角色被认为更有帮助和值得信赖,尽管它们表现出更高的交互风险概况。我们发布了约90,000个带有风险注释的支持角色条件模型响应,作为研究更安全的LLM中介对话支持的生态基础资源。

英文摘要

Language models are increasingly being deployed for conversational support in informal caregiving contexts, where interactions often extend beyond information-seeking: caregivers seek emotional reassurance, guidance, and help, while navigating uncertain, relationally complex care decisions. Yet most safety evaluations assess model behavior under generic prompts, leaving a critical question unexamined: does a model's safety profile change with its support role? We study this by operationalizing four expert-reviewed support roles grounded in social support theory: Inform, Coach, Relate, and Listen, and comparing them against two baseline controls: a basic prompting condition and a retrieval-augmented generation (RAG) condition. We evaluate across three language models (GPT-4o-mini, Llama-3.1-8B-Instruct, and MedGemma-1.5-4b-it) on 5,000 real-world queries from online Alzheimer's Disease and Related Dementias (ADRD) communities. We find that the LLM's support role systematically shapes both the prevalence and composition of interactional risks. Furthermore, a human evaluation study reveals a perceived quality--safety tension: more directive, information-oriented roles are rated as more helpful and trustworthy despite exhibiting elevated interactional risk profiles. We release ~90,000 support role-conditioned model responses with risk annotations as an ecologically grounded resource for research on safer LLM-mediated conversational support.

2605.29459 2026-05-29 cs.CL cs.LG 版本更新

Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models

Kronecker嵌入:用于参数高效语言模型的字节级结构化词元表示

Rohan Shravan

发表机构 * The School of AI(人工智能学院)

AI总结 提出Kronecker嵌入,通过字节级字符-位置确定性分解替代标准嵌入表,消除91-94%输入侧可训练参数,在多个实验中实现更低验证损失、更强拼写鲁棒性和运行时效率。

Comments 28 pages, 16 tables. Reference implementation: https://github.com/theschoolofai/kronecker-embeddings

详情
AI中文摘要

大型语言模型通过一个形状为|V| x d_model的可学习嵌入表路由每个输入,在前沿规模下消耗数亿到数十亿的可训练参数。我们引入Kronecker嵌入,一种确定性的字节级字符-位置分解,用固定编码器和单个可学习投影替换该表,与标准BPE分词器兼容,在前沿规模下消除91-94%的输入侧可训练参数。我们提供五项贡献。第一,跨六个LM(135M-671B参数)的模型探针显示,训练后的输入嵌入将探针词的印刷变体聚类程度远高于形态学相关词;Kronecker在嵌入层避免了这种聚类。第二,在FineWeb-Edu上对nanoGPT GPT-2 124M进行2.5B词元的三种子受控比较显示,Kronecker达到比BPE绑定基线低2.5±0.2%的验证损失(差距0.083±0.007 nats,约9%更低的困惑度),达到BPE收敛损失所需的步数减少约1.43倍。第三,在110个干净/拼写错误对上的拼写鲁棒性探针显示,Kronecker在55.5%的对上保持top-1预测,而BPE为47.3%(+8.2个百分点),并将KL降低7.6%,在11个类别中赢得或平局10个;生成探针显示Kronecker在生成中回显字节新颖字符串和拼写错误,而BPE则遗忘它们。第四,BPE嵌入范数在训练期间漂移,而Kronecker投影范数保持在1.0附近,与稳定的表示目标一致。第五,一种即时运行时变体从4.5 MB的字节缓冲区重建嵌入,而不是从词汇量为131,072的2.15 GB表中重建,步长时间开销为0.01-0.24%。字节级局部性存在权衡:字节相似但语义距离远的对(compute/commute, nation/notion)聚类在一起,将消歧转移到早期注意力层。

英文摘要

Large language models route every input through a learned embedding table of shape |V| x d_model, consuming hundreds of millions to billions of trainable parameters at frontier scale. We introduce Kronecker Embeddings, a deterministic byte-level character-position factorization that replaces this table with a fixed encoder and a single learned projection, compatible with standard BPE tokenizers, eliminating 91--94% of input-side trainable parameters at frontier scale. We provide five contributions. First, a cross-model probe across six LMs (135M-671B parameters) shows trained input embeddings cluster typographic variants of the probe word far more than morphological relatives; Kronecker escapes this clustering at the embedding layer. Second, a controlled three-seed comparison on nanoGPT GPT-2 124M over 2.5B tokens of FineWeb-Edu shows Kronecker reaching 2.5 +- 0.2% lower validation loss than the BPE-tied baseline (gap 0.083 +- 0.007 nats, ~9% lower perplexity), needing ~1.43x fewer steps to reach BPE's converged loss. Third, a spelling-robustness probe over 110 clean/typo pairs shows Kronecker preserves the top-1 prediction on 55.5% of pairs vs. 47.3% for BPE (+8.2 pp) and lowers KL by 7.6%, winning or tying in 10 of 11 categories; a generation probe shows Kronecker echoes byte-novel strings and typos through generation where BPE forgets them. Fourth, BPE embedding norm drifts during training while Kronecker projection norm stays near 1.0, consistent with a stable representational target. Fifth, an on-the-fly runtime variant reconstructs embeddings from a 4.5 MB byte buffer rather than a 2.15 GB table at vocabulary 131,072, with 0.01--0.24% step-time overhead. Byte-level locality has a tradeoff: byte-similar but semantically distant pairs (compute/commute, nation/notion) cluster together, shifting disambiguation to early attention layers.

2605.29458 2026-05-29 cs.CL cs.AI 版本更新

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

面向LLM人格模拟的自适应访谈:基于证据的推理提升决策对齐

Ruoxi Su, Yuhan Liu, Jingyu Hu

发表机构 * University of Cambridge(剑桥大学) Independent Researcher(独立研究员)

AI总结 提出自适应访谈框架,通过结构化三阶段对话收集人格相关信息,并基于访谈记录评估LLM在道德困境场景中模拟个体决策的能力,发现基于后续追问的证据推理能显著提升预测准确性。

Comments 20 pages, 2 figures, 12 tables

详情
AI中文摘要

准确模拟特定个体的决策对大型语言模型(LLM)仍然具有挑战性,部分原因在于人格信息通常以静态描述形式提供,缺乏个体层面决策模拟所需的价值观、经历和情境线索。我们提出一种自适应访谈框架,通过结构化的三阶段对话收集人格相关信息:核心问题、动态追问和综合人格总结。利用生成的访谈记录,我们评估LLM能否模拟参与者在道德困境场景中的决策。我们比较了三种对话情境——核心10个问题回答、完整访谈对话以及总结性人格表征。结果发现,自适应访谈并非作为统一的准确性增强器,而更像是一种选择性接地机制:约40%的完整访谈轨迹中融入了基于追问的证据,且这些基于追问的预测比仅基于核心问题的预测更准确(45.5% vs. 39.3%)。这些发现强调,仅靠更丰富的人格背景是不够的:只有当模型真正将其决策基于用户特定证据时,改进才会出现。

英文摘要

Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and contextual cues needed for individual-level decision simulation. We propose an adaptive interview framework that gathers persona-relevant information through a structured three-stage dialogue: core questions, dynamic follow-ups, and a synthesized personality summary. Using the resulting interview transcripts, we evaluate whether LLMs can simulate participants' decisions in moral dilemma scenarios. We compare three conversational contexts -- Core-10 responses, the full interview dialogue, and a summarized persona representation. We find that adaptive interviewing functions less as a uniform accuracy booster and more as a selective grounding mechanism: follow-up-derived evidence is incorporated in around 40% of full-interview traces, and these follow-up-grounded predictions are more accurate than core-only grounded ones (45.5% vs. 39.3%). These findings highlight that richer persona context alone is insufficient: improvements arise only when models actually ground their decisions in user-specific evidence.

2605.29447 2026-05-29 cs.CV cs.CL 版本更新

Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

恢复策略诱导错误:鲁棒GUI智能体的基准测试与轨迹合成

Tianpeng Bu, Xin Liu, Qihua Chen, Hao Jiang, Shurui Li, Hongtao Duan, Lu Jiang, Lulu Hu, Bin Yang, Minying Zhang

发表机构 * alibaba(阿里巴巴)

AI总结 提出GUI-RobustEval基准和鲁棒驱动轨迹合成框架RoTS,通过树状管道主动发现错误模式并合成恢复步骤,训练模型在GUI任务上取得最先进性能。

Comments ICML 2026 Spotlight. 36 pages, 19 figures, includes appendix

详情
AI中文摘要

尽管GUI智能体发展迅速,但它们通常缺乏从自身错误中恢复的鲁棒性,阻碍了实际部署。为了在评估和数据层面弥补这一差距,我们引入了GUI-RobustEval并提出了鲁棒驱动轨迹合成。GUI-RobustEval包含1,216个可执行测试用例,系统性地衡量在广泛且真实的错误模式下的错误恢复能力。在数据层面,RoTS是一个可扩展的合成框架,通过树状管道主动发现多样化的错误模式并合成相应的恢复步骤,创建了80万高质量数据。我们的两个模型RoTS-7B和RoTS-32B,在数据集上微调后,在GUI-RobustEval和传统GUI基准测试上均表现出显著提升。值得注意的是,RoTS-32B在OSWorld上达到了最先进性能,成功率为47.4%,All-Pass@4得分为33.8%,表明改进的长时域错误恢复能力有助于鲁棒性和整体性能。我们的代码可在https://github.com/AlibabaResearch/RoTS获取。

英文摘要

While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains $1,216$ executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates $800k$ high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a $47.4\%$ success rate and a $33.8\%$ All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.

2605.29440 2026-05-29 cs.CL cs.AI cs.IR 版本更新

SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents

SkillBrew: LLM智能体技能库的多目标策展

Wentao Hu, Zhendong Chu, Yiming Zhang, Junda Wu, Ming Jin, Xiangyu Zhao, Yilei Shao, Yanfeng Wang, Qingsong Wen

发表机构 * City University of Hong Kong(香港城市大学) Squirrel Ai Learning University of Science and Technology of China(中国科学技术大学) University of California, San Diego(加州大学圣地亚哥分校) Griffith University(格里菲斯大学) East China Normal University(华东师范大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出SkillBrew框架,将技能库策展建模为带效用约束的帕累托优化问题,通过双层提议-验证循环实现技能库的精简与多样性。

Comments 16 pages. Preprint. Under review

详情
AI中文摘要

检索增强的LLM智能体越来越依赖于精心策划的技能库:指导复杂任务决策的可重用文本原则集合。现有方法通常以仅追加的方式扩展这些库,不断添加新技能而不移除冗余、过时或有害的技能,导致存储库效率低下且策展不良。在本文中,我们将技能库策展形式化为一个受约束的多目标问题:一个理想的库必须对智能体有用、内容多样,并且对查询分布有良好的覆盖。为此,我们引入了SkillBrew,一个多目标策展框架,将技能库策展形式化为在效用约束下的帕累托感知优化,并通过双层提议-验证循环求解。我们在两个公共基准上评估了我们的方法。我们的发现表明,将技能库视为原则性策展的对象,而不是不断增长的仅追加日志,是构建自我改进的LLM智能体的重要一步。

英文摘要

Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision making on complex tasks. Existing approaches typically expand these banks in an append-only fashion, continuously adding new skills without removing redundant, outdated, or harmful ones, resulting in inefficient and poorly curated repositories. In this paper, we formulate the skill bank curation as a constrained multi-objective problem: a desirable bank must be useful for the agent, diverse in its content, and provide good coverage of the query distribution. To this end, we introduce SkillBrew, a multi-objective curation framework that formalizes skill bank curation as Pareto-aware optimization under a utility constraint, and solves it via a bi-level propose-then-verify loop. We evaluate our approach on two public benchmarks. Our findings suggest that treating skill banks as objects of principled curation, rather than ever-growing append-only logs, is an important step toward building self-improving LLM agents.

2605.29434 2026-05-29 cs.CR cs.AI cs.CL cs.LG 版本更新

AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing

AliMark: 增强句子级水印对文本释义的鲁棒性

Yuexin Li, Wenjie Qu, Linyu Wu, Yulin Chen, Yufei He, Tri Cao, Bryan Hooi, Jiaheng Zhang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出AliMark框架,将句子级水印重构为比特序列编码与对齐问题,通过多候选对齐检测策略提升对句子拆分合并等结构扰动的鲁棒性。

Comments Accepted by ICML 2026

详情
AI中文摘要

现有的句子级水印方法通过将水印锚定在句子语义中来增强对释义的鲁棒性。然而,它们基于前缀的设计仍然容易受到结构扰动的影响,例如句子拆分和合并,这些扰动在强释义器(如DIPPER和GPT-3.5)下经常出现。为了缓解这个问题,我们提出了AliMark,一个将句子级水印重构为潜在水印文本与秘密比特序列之间的比特序列编码和对齐问题的框架。值得注意的是,我们的方法采用了两阶段检测策略:我们生成多个重构的文本变体,并自适应地将它们提取的比特序列与秘密比特序列对齐,以最小化对齐成本。这种多候选对齐设计自然地提高了对句子合并和拆分的鲁棒性。大量实验表明,在多种释义攻击下,AliMark显著优于最先进的基线方法。

英文摘要

Existing sentence-level watermarking methods enhance robustness to paraphrasing by anchoring watermarks in sentence semantics. However, their prefix-based designs remain vulnerable to structural perturbations, such as sentence splitting and merging, which commonly arise under strong paraphrasers like DIPPER and GPT-3.5. To mitigate this issue, we propose AliMark, a framework that reformulates sentence-level watermarking as a bit sequence encoding and alignment problem between a potentially watermarked text and a secret bit sequence. Notably, our approach adopts a two-stage detection strategy: we generate multiple restructured text variants and adaptively align their extracted bit sequences with the secret bit sequence to minimize alignment cost. This multi-candidate alignment design naturally improves robustness to sentence merges and splits. Extensive experiments demonstrate that AliMark substantially outperforms state-of-the-art baselines under diverse paraphrasing attacks.

2605.29430 2026-05-29 cs.AI cs.CL 版本更新

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

迈向具有智能体纠正和语义评估的类人交互式语音识别

Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen

发表机构 * College of Artificial Intelligence, Xi’an Jiaotong University(西安交通大学人工智能学院) X-LANCE Lab, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University(上海交通大学电子信息与电气工程学院X-LANCE实验室) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Fudan University(复旦大学) Tongyi Fun Team, Alibaba Group(阿里云通义团队)

AI总结 提出Agentic ASR闭环框架,通过多轮交互和语义纠正减少语义错误,并引入句子级语义错误率(S^2ER)作为评估指标。

详情
AI中文摘要

自动语音识别(ASR)是人机交互的核心组成部分,也是基于LLM的助手和智能体日益重要的前端。然而,当前大多数ASR系统仍遵循单遍范式,这与人类通信方式不一致——在人类通信中,误解通过迭代澄清和修正来解决。这种不匹配使得一旦发生意义关键的错误,很难纠正。同时,词错误率(WER)或字符错误率(CER)等词级指标无法充分反映此类问题。为解决这些局限,我们将交互式ASR形式化为多轮修正任务,并提出Agentic ASR,一种结合单遍ASR前端与语义纠正、意图路由和基于推理编辑的闭环框架。我们进一步引入句子级语义错误率(S^2ER),一种基于LLM的语义评估指标,以及交互式仿真系统,用于可扩展和可复现的基准测试。在多语言、命名实体密集和代码切换基准上的实验表明,迭代交互持续减少语义错误,在S^2ER上的提升远大于传统词级指标。人机对齐和消融研究进一步验证了语义判断器的可靠性和所提框架的鲁棒性。代码见:https://interactiveasr.github.io/,在线演示见:https://i-asr.sjtuxlance.com/

英文摘要

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

2605.29427 2026-05-29 cs.CL 版本更新

FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions

FinGuard:检测LLM交互中的金融监管违规

Huaixia Dou, Jie Zhu, Minghao Wu, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang

发表机构 * Qwen DianJin Team, Alibaba Cloud Computing(阿里云计算Qwen金融团队) Tongyi Lab, Alibaba Group(阿里集团通义实验室) School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 针对金融领域LLM交互中的监管违规检测问题,提出基于监管文档的自动化管道,构建首个金融合规检测基准FinGuard-Bench,并训练FinGuard模型,在基准上显著优于现有方法。

详情
AI中文摘要

随着大型语言模型(LLM)在金融服务中的部署日益增多,一次不合规的交互就可能使机构面临监管处罚并直接损害消费者利益。现有的防护模型围绕通用危害分类构建,忽略了基于特定金融法规的违规行为。我们通过一个直接操作监管文档的监管驱动管道来弥补这一空白,该管道归纳出金融合规风险分类,并在没有任何预定义违规类别的情况下合成基于监管的训练数据。将该管道应用于中国金融法规,我们发布了 extbf{FinGuard-Bench},据我们所知,这是首个金融监管合规检测基准,在查询和回复层面均带有专家标注的标签。我们进一步训练了 extbf{FinGuard},这是一个基于Qwen3-8B构建的金融合规检测模型,通过监督微调和自我对弈强化学习在基于监管的数据上进行训练。在FinGuard-Bench上,FinGuard显著优于所有基线,包括专用防护模型和更大的通用LLM,如Qwen3.5-397B-A17B和GPT-5.1。此外,FinGuard还保留了通用安全能力,并能仅使用政策文档适应未见过的机构特定政策。我们将在GitHub上公开发布本工作中使用的代码、提示和资源。

英文摘要

As large language models (LLMs) are increasingly deployed in financial services, a single non-compliant interaction can expose institutions to regulatory penalties and direct consumer harm. Existing guard models are built around general harm taxonomies and overlook violations grounded in specific financial regulations. We address this gap with a regulation-driven pipeline that operates directly on regulatory documents, inducing a financial compliance risk taxonomy and synthesizing grounded training data without any predefined violation categories. Instantiating the pipeline on Chinese financial regulations, we release \textbf{FinGuard-Bench}, to our knowledge the first benchmark for financial regulatory compliance detection, with expert-annotated labels at both the query and response levels. We further train \textbf{FinGuard}, a financial compliance detection model built on Qwen3-8B and trained on the regulation-grounded data via supervised fine-tuning and self-play reinforcement learning. On FinGuard-Bench, FinGuard substantially outperforms all baselines, including dedicated guard models and much larger general-purpose LLMs such as Qwen3.5-397B-A17B and GPT-5.1. Furthermore, FinGuard also preserves general safety capabilities and adapts to unseen institution-specific policies using policy documents alone. We will publicly release the code, prompts, and resources used in this work on GitHub.

2605.29421 2026-05-29 cs.CL 版本更新

Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design

将设计技能学习为记忆策略用于智能光子逆向设计

Shengchao Chen, Ting Shu, Sufen Ren

发表机构 * AAII, University of Technology Sydney(AAII,悉尼技术大学) School of Artificial Intelligence, Shenzhen University(人工智能学院,深圳大学) School of Information and Communication Engineering, Hainan University(信息与通信工程学院,海南大学)

AI总结 提出SkillPCF闭环智能体框架,通过物理引导的记忆技能库、强化学习技能选择和模拟器接地技能演化,解决光子晶体光纤逆向设计中的知识积累问题,在真实数据集上实现更优的设计质量与效率权衡。

Comments AI4Physics@ICML 2026

详情
AI中文摘要

光子晶体光纤(PCF)逆向设计仍然具有挑战性,因为候选几何形状必须在昂贵的电磁模拟下满足耦合的光学目标。现有流程改进了代理预测或一次性参数推荐,但未能在迭代试验中积累可重用的设计知识。我们将PCF逆向设计表述为记忆策略学习问题,并提出SkillPCF,一个闭环智能体框架,结合了物理引导的记忆技能库、强化学习的技能选择和模拟器接地的技能演化。我们进一步构建了一个真实世界数据集,包含479个专家交互轨迹(2507个跨度)和553个记忆依赖的评估查询,涵盖色散工程、损耗优化和多目标设计。在多个LLM骨干和经典基线上的实验表明,SkillPCF在实际模拟预算下实现了更强的设计质量和效率权衡,证明了我们提出的记忆技能学习范式在物理感知的PCF逆向设计中的有效性。

英文摘要

Photonic crystal fiber (PCF) inverse design remains challenging because candidate geometries must satisfy coupled optical targets under expensive electromagnetic simulation. Existing pipelines improve surrogate prediction or one-shot parameter recommendation, but they do not accumulate reusable design knowledge across iterative trials. We formulate PCF inverse design as a memory-policy learning problem and propose SkillPCF, a closed-loop agent framework that combines a physics-guided memory skill bank, reinforcement-learned skill selection, and simulator-grounded skill evolution. We further construct a real-world dataset with 479 expert interaction traces (2,507 spans) and 553 memory-dependent evaluation queries covering dispersion engineering, loss optimization, and multi-objective design. Experiments across multiple LLM backbones and classical baselines show that SkillPCF achieves stronger design-quality and efficiency trade-offs under practical simulation budgets, demonstrating the effectiveness of our proposed memory-skill learning paradigm for physics-aware PCF inverse design.

2605.29414 2026-05-29 cs.CL cs.AI 版本更新

Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning

超越双语迁移:指令微调中的多语言代码切换

Shunta Asano, Jeonghun Baek, Toshihiko Yamasaki

发表机构 * The University of Tokyo(东京大学)

AI总结 本研究通过跨四种语言的句子级多语言代码切换指令微调,验证了多语言代码切换能有效提升大语言模型的多语言理解性能,超越了传统双语迁移设置。

详情
AI中文摘要

近期研究表明,代码切换数据(CSD)——即在同一上下文中混合多种语言——可以改善大语言模型(LLMs)的跨语言迁移和多语言对齐。然而,现有研究主要关注英语与目标语言之间的双语迁移,涉及三种或更多语言的多语言设置在很大程度上尚未被探索。在本工作中,我们研究了跨四种语言(英语、日语、韩语和中文)的多语言代码切换指令微调。我们在Belebele上评估多语言理解能力。我们的实验表明,简单的句子级多语言CSD持续提高了所有四种语言的平均多语言性能,表明多语言代码切换在双语迁移设置之外也能有效。

英文摘要

Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-lingual transfer and multilingual alignment in large language models (LLMs). However, existing studies primarily focus on bilingual transfer between English and a target language, leaving multilingual settings involving three or more languages largely unexplored. In this work, we investigate multilingual code-switching instruction tuning across four languages: English, Japanese, Korean, and Chinese. We evaluate multilingual understanding on Belebele. Our experiments show that simple sentence-level multilingual CSD consistently improves average multilingual performance across all four languages, indicating that multilingual code-switching can be effective beyond bilingual transfer settings.

2605.29400 2026-05-29 cs.AI cs.CL cs.HC 版本更新

Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark

面向屏幕条件动作预测的架构敏感监督微调:PiSAR基准

Rahul Bissa, Abhishek Vyas, Yash Jain

发表机构 * AprioriLabs(Apriori实验室)

AI总结 通过PiSAR基准评估监督微调模型与前沿零样本模型在屏幕锚定行为预测上的性能,发现微调Qwen3-VL-8B-Instruct显著优于前沿基线,而Gemma-4-26B-A4B-IT微调效果不佳,揭示模型与微调方法不匹配问题。

Comments 14 pages, 7 figures, 2 tables. PiSAR corpus and fine-tuned weights are proprietary to AprioriLabs; methodology and recipe released

详情
AI中文摘要

我们在PiSAR(Persona, intent, Screen, Action, Rationale)的一个661行保留子集上,对三个监督微调模型与前沿零样本基线进行了基准测试。PiSAR是一个包含12,929个元组的屏幕锚定行为理由语料库,从公开的应用商店评论、Pew美国趋势面板人口统计数据以及OPeRA购物者轨迹中整理得到。每个模型,无论是前沿模型还是微调模型,都在相同的661行子集上使用相同的评分流程进行评估。有两个发现。第一,前沿零样本基线(Claude Opus 4.7和GPT-5.5)分别达到sem_sim 0.459和0.482;而微调的Qwen3-VL-8B-Instruct达到0.783,并且在79%的行上sem_sim >= 0.7,而两个前沿基线仅为1-2%,在同一测试集上绝对差距为0.30。第二,相同的训练数据和配方在Gemma-4-26B-A4B-IT上仅得0.441,与前沿零样本基线处于同一水平,而非微调的Qwen。我们将其解读为配方与模型不匹配:经过推理调优的高参数模型抵抗位移,可能需要更多数据或更强的微调方法。

英文摘要

We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and 0.482 respectively; a fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 and clears sem_sim >= 0.7 on 79% of rows, against 1-2% for either frontier baseline, a gap of 0.30 absolute on the same test set. Second, the same training data and recipe on Gemma-4-26B-A4B-IT scores only 0.441, in the same band as the frontier zero-shot baselines rather than the fine-tuned Qwen. We read this as a recipe-vs-model mismatch: the reasoning-tuned high-parameter model resists displacement and would likely need either more data or a stronger fine-tuning method.

2605.29397 2026-05-29 cs.CL 版本更新

Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight Framework

重新审视Web智能体的观察缩减:基于轻量级框架的综合评估

Masafumi Enomoto, Ryoma Obara, Haochen Zhang, Masafumi Oyamada

发表机构 * NEC Corporation(日本电报电话公司)

AI总结 针对LLM Web智能体中HTML观察过长的问题,提出基于最小失败集(MFS)的轻量级评估框架,通过覆盖率代理指标大幅加速评估,并优化剪枝程序实现2.2-3.1倍延迟降低同时保持84-89%成功率。

Comments 22 pages, 8 figures, 4 tables

详情
AI中文摘要

基于LLM的Web智能体中的HTML观察非常长,尽管已经提出了许多缩减方法,但仍不清楚哪些方法能在保持性能的同时降低整体智能体延迟。主要障碍是端到端评估的高成本:在我们的实验中,在WorkArena L1的33个任务上评估32种配置下的11种方法需要232.4累计小时。为解决此问题,我们提出了一个基于最小失败集(MFS)的轻量级评估框架,MFS是导致任务失败的最小HTML元素集合。我们将覆盖率定义为缩减方法完全保留MFS的实例比例,作为无需网络访问或LLM推理的代理指标。我们验证了覆盖率与端到端成功率强相关,在两个基准测试上累计评估时间加速超过100倍。利用该框架,我们发现提取式HTML缩减方法需要高计算成本或领域特定优化才能在保持性能的同时降低智能体延迟。在此基础上,我们在MFS训练数据上优化了一个剪枝程序,在WorkArena L1上实现了每步延迟2.2倍加速,同时保留了84%的原始成功率,在WebLinx上实现了3.1倍加速,保留了89%。

英文摘要

HTML observations in LLM-based web agents are extremely long, and while many reduction methods have been proposed, it remains unclear which methods reduce overall agent latency while maintaining performance. The main obstacle is the high cost of end-to-end evaluation: in our experiments, evaluating 11 methods across 32 configurations on 33 tasks of WorkArena L1 required 232.4 cumulative hours. To address this, we propose a lightweight evaluation framework based on the Minimal Failure Set (MFS), the minimal set of HTML elements whose removal causes task failure. We define coverage as the fraction of instances in which a reduction method fully retains the MFS, which serves as a proxy metric that requires neither web access nor LLM inference. We validate that coverage strongly correlates with end-to-end success rate, with over 100$\times$ speedup in cumulative evaluation time on both benchmarks. Using this framework, we find that extractive HTML reduction methods require either high computation cost or domain-specific optimization to reduce agent latency while maintaining performance. Building on this, we optimize a pruning program on MFS training data, achieving 2.2$\times$ faster per-step latency on WorkArena L1 while retaining 84\% of the original success rate, and 3.1$\times$ faster on WebLinx while retaining 89\%.

2605.29392 2026-05-29 cs.SE cs.CL cs.CY cs.HC 版本更新

Offloading Score: Measuring AI Reliance Through Counterfactual Workflows

卸载分数:通过反事实工作流衡量AI依赖度

Vishakh Padmakumar, Lujain Ibrahim, Zora Zhiruo Wang, Jennifer Wang, Q. Vera Liao, Diyi Yang

发表机构 * Stanford University(斯坦福大学) University of Oxford(牛津大学) Carnegie Mellon University(卡内基梅隆大学) University of Michigan(密歇根大学)

AI总结 本文提出卸载分数(offloading score),一种通过构建反事实工作流量化用户向AI工具卸载认知努力比例的依赖度度量,并通过内在验证和用户实验证明其能检测时间压力下的依赖度变化。

Comments Preprint

详情
AI中文摘要

AI工具日益集成到实际工作流中。然而,现有对这些工具依赖度的衡量侧重于AI输出采纳或自我报告指标,而非用户与工具之间任务努力的分配。本文引入卸载分数(offloading score),一种衡量依赖度的指标,量化卸载到AI工具的认知努力比例。卸载分数基于模拟——我们通过估计用户在没有工具的情况下如何完成任务来构建反事实工作流,然后计算使用工具节省的步骤比例。我们通过指标有效性的内在评估和一项受控用户研究(n=40,开发者使用AI工具执行编程任务)来验证卸载分数。我们改变时间压力,以测试依赖度指标是否能捕捉到时间压力下依赖度的已知增加。我们表明,卸载分数在时间受限条件下检测到显著更高的依赖度(+43%,p=0.018),而基于使用和基于自我报告的依赖度基线指标无法区分这些条件。我们通过描述性见解补充说明,更高的依赖度表现为将子任务更多地委托给工具以及更直接地重用AI输出。最后,我们展示了一种将卸载分数与任务目标结果(例如代码理解)结合使用的方法,以识别依赖度何时可能(不)适当。我们的框架提供两个贡献:用户可用来衡量和反思自身依赖度的工具,以及代理设计者可用于减轻过度依赖的定量信号。

英文摘要

AI tools are increasingly integrated into real-world workflows. However, existing measures of reliance on these tools focus on AI output adoption or on self-reported indicators, rather than how task effort is distributed between users and tools. Here, we introduce offloading score, a measure of reliance that quantifies the fraction of cognitive effort offloaded to an AI tool. Offloading Score is simulation-based -- we construct a counterfactual workflow by estimating how the user would have completed the task without the tool, and then computing the fraction of steps saved by using the tool. We validate offloading score through intrinsic evaluations of metric validity, and a controlled user study ($n=40$) with developers performing programming tasks using AI tools. We vary time pressure to test whether reliance measures capture the known increase in reliance under time pressure. We show that offloading score detects significantly higher reliance in time-constrained settings ($+43\%$, $p=0.018$), while usage-based and self-reported baseline measures of reliance do not distinguish the conditions. We complement this with descriptive insights showing that higher reliance manifests as greater delegation of subtasks to the tool and more direct reuse of AI outputs. Finally, we demonstrate an approach of using offloading score in combination with target outcomes of a task (e.g., code understanding) to identify when reliance may be (in)appropriate. Our framework offers two contributions: an instrument users can apply to measure and reflect on their own reliance, and a quantitative signal that agent designers can utilize to mitigate overreliance.

2605.29384 2026-05-29 cs.IR cs.AI cs.CL 版本更新

Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies

潜在词:密集检索器包含可轻易提取的符合齐夫分布的BM25就绪词汇表

Benjamin Clavié, Sean Lee, Aamir Shakir, Makoto P. Kato

发表机构 * Mixedbread AI National Institute of Informatics(国家信息研究所) University of Tsukuba(筑波大学)

AI总结 提出潜在词方法,揭示密集检索模型(单向量或多向量)学习到的表示可轻易分解为稀疏特征,通过稀疏自编码器提取潜在词汇表,无需检索特定调整即可直接用于BM25稀疏检索,匹配或超越原模型及SPLADE变体。

详情
AI中文摘要

我们提出潜在词方法,该方法揭示了训练用于密集检索的模型(无论是单向量还是多向量)学习到的表示可以轻易地分解为检索就绪的稀疏特征。当在冻结的检索器上训练时,无需任何检索特定调整的稀疏自编码器能够提取一个具有近似齐夫分布集合统计量的潜在词汇表,直接适用于通过BM25进行的经典稀疏检索评分。这种方法实现了稀疏检索,同时不需要任何学习到的扩展目标或稀疏检索监督,并且可以轻松应用于任何密集检索器。潜在词能够匹配或超越其自身基础模型以及可比较的SPLADE变体的单向量评分方法。此外,在专门设计用于突出单向量检索失败的任务LIMIT上,它显著优于其基础模型。总体而言,我们的结果强调了神经检索器包含比其默认评分函数所暴露的更具表达力和可索引的结构,但其他方法仍然可以利用这些结构。

英文摘要

We propose Latent Terms, a method revealing that models trained for dense retrieval, whether single- or multi-vector, learn representations that can trivially be decomposed into retrieval-ready sparse features. When trained on frozen retrievers, Sparse Autoencoders without any retrieval-specific adjustments extract a latent vocabulary with approximately Zipfian collection statistics, directly suitable for classical sparse retrieval scoring via BM25. This approach enables sparse retrieval while requiring no learned expansion objective or sparse retrieval supervision whatsoever, and can be readily applied to any dense retriever. Latent Terms is able to match or outperform single-vector scoring methods from its own base model as well as comparable SPLADE variants. In addition, it substantially outperforms its base model on LIMIT, a task specifically designed to highlight the failures of single-vector retrieval. Overall, our results highlight that neural retrievers contain more expressive and indexable structure than their default scoring functions expose, but that other methods can nonetheless be leveraged.

2605.29379 2026-05-29 cs.CL cs.LG 版本更新

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

BrahmicTokenizer-131K:一种可替代o200k_base的印度文字兼容分词器

Rohan Shravan

发表机构 * The School of AI(人工智能学院)

AI总结 提出BrahmicTokenizer-131K,一种131072词汇量的字节级BPE分词器,通过两阶段改造在保持非印度文字性能的同时,显著提升印度文字的压缩效率。

Comments 24 pages, 15 tables, 3 code listings. Tokenizer artifact, verification scripts, and reproduction code at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K and https://github.com/theschoolofai/BrahmicTokenizer-131K

详情
AI中文摘要

我们提出了BrahmicTokenizer-131K,一种131,072词汇量的字节级BPE分词器,它在131K词汇量类别中弥合了印度文字(Brahmic)的压缩差距,同时保留了OpenAI的o200k_base在英语、欧盟语言和代码方面的压缩性能。我们通过两阶段改造构建了它:(1)脚本剪枝裁剪,通过移除九个不相关书写系统将200,019个令牌减少到131,072个;(2)外科手术式改造,通过线性规划分配在九个印度文字Unicode块中填充2,372个语料库中缺失的词汇槽位。预分词器、解码器和继承的合并规则与o200k_base保持不变,使得BrahmicTokenizer-131K在分词器接口上成为即插即用的替代品。 在2700万份公开印度语预训练文本(28.4亿词,46.21 GB)上,BrahmicTokenizer-131K在相同词汇预算下产生的令牌比Mistral-Nemo Tekken / Sarvam-m少26.7%,每种语言的节省幅度从15.79%(泰米尔语)到76.79%(奥里亚语,压缩比4.31倍)。奥里亚语的优势在机制上可解释为Tekken/Sarvam-m包含零个奥里亚语块令牌;我们的改造添加了725个。在非印度语内容上,BrahmicTokenizer-131K与o200k_base的英语词汇生育率相当(1.235 vs 1.232令牌/词),并在HumanEval、MBPP和GSM8K上比Tekken/Sarvam-m好4.0-14.2%。在我们的14个分词器基准测试中,它是唯一一个在131K预算下同时在印度文字、英语、欧盟语言、代码和数学上具有竞争力的分词器。其他词汇类别的专用分词器(Sarvam-30B、Sarvam-1、MUTANT-Indic)以牺牲非印度语性能为代价实现了更好的印度语压缩:Sarvam-1的英语词汇生育率比我们差15.9%,其代码/数学压缩比我们差26-33%。我们在Apache 2.0许可下发布该工件,地址为https://huggingface.co/theschoolofai/BrahmicTokenizer-131K。

英文摘要

We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. We construct it through a two-stage retrofit: (1) a script-prune crop that reduces 200,019 tokens to 131,072 by removing nine out-of-scope writing systems, and (2) a surgical retrofit of 2,372 corpus-dead vocabulary slots determined by linear-programming allocation across nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer-131K a drop-in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same vocabulary budget, with per-language savings of 15.79% (Tamil) to 76.79% (Odia, a 4.31x compression ratio). The Odia advantage is mechanistically explained by Tekken/Sarvam-m containing zero Oriya-block tokens; our surgery added 725. On non-Indic content, BrahmicTokenizer-131K matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam-m by 4.0-14.2% on HumanEval, MBPP, and GSM8K. Across our 14-tokenizer benchmark, it is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K budget. Specialist tokenizers at other vocab classes (Sarvam-30B, Sarvam-1, MUTANT-Indic) achieve better Indic compression at the cost of non-Indic performance: Sarvam-1's English fertility is 15.9% worse and its code/math compression 26-33% worse than ours. We release the artifact under Apache 2.0 at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K.

2605.29368 2026-05-29 cs.CL cs.AI 版本更新

SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

SURGENT: 一种跨围手术期工作流程的手术多智能体辅助系统

Dongsheng Shi, Yue Li, Xin Yi, Yongyi Cui, Huawei Feng, Linlin Wang

发表机构 * East China Normal University(华东师范大学) City University of Hong Kong(香港城市大学)

AI总结 提出SURGENT手术多智能体辅助系统,结合思维树规划器、多科室协作智能体和检索增强推理,通过新型记忆设计管理长期患者病史和短期工作摘要,在五项围手术期任务中优于基线LLM和现有医疗多智能体框架。

Comments preprint

详情
AI中文摘要

现代外科护理的复杂性需要智能系统能够综合大量患者记录,支持协作决策,并在整个围手术期工作流程中提供透明、可审计的推理。尽管基于网络的大型语言模型(LLM)具有先进的推理能力,但由于输入长度限制、不完整的记忆管理和有限的可追溯性等关键限制,它们不适合外科应用。为了解决这个问题,我们提出了SURGENT,一种手术多智能体辅助系统,它结合了思维树规划器、多科室协作智能体以及基于临床指南和生物医学文献的检索增强推理。SURGENT具有一种新颖的记忆设计,可以管理长期患者病史和短期工作摘要,从而实现更完整、情境化和一致的推理。在五项关键围手术期任务(病例分析、手术计划模拟、安全监测、并发症风险评估和康复指导)上的实验评估表明,SURGENT优于基线LLM和现有的医疗多智能体框架,生成的推荐与患者病史更加一致。消融研究进一步突出了DeepSeek作为本地可部署骨干模型的优势,使其能够在无需依赖集中服务的情况下实现隐私保护部署。这些结果使SURGENT成为迈向智能、公平和安全的外科辅助系统的实用且可信的进步。

英文摘要

The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collaborative decision-making, and provide transparent, auditable reasoning across the entire perioperative workflow. Although web-based Large Language Models (LLMs) possess advanced reasoning capabilities, they are ill-equipped for surgical applications due to critical limitations: input length constraints, incomplete memory management, and limited traceability. To address this issue, we present SURGENT, a surgical multi-agent assistance system that combines a Tree-of-Thought planner, multi-department collaboration agents, and retrieval-augmented reasoning with clinical guidelines and biomedical literature. SURGENT features a novel memory design that manages both long-term patient histories and short-term working summaries, enabling more complete, contextualized, and consistent reasoning. Experimental evaluations across five key perioperative tasks - case analysis, surgical plan simulation, safety monitoring, complication risk assessment, and rehabilitation guidance - show that SURGENT outperforms baseline LLMs and existing medical multi-agent frameworks, yielding recommendations more closely aligned with patient histories. Ablation studies further highlight the advantage of DeepSeek as a locally deployable backbone model, enabling privacy-preserving deployment without reliance on centralized services. These results position SURGENT as a practical and trustworthy advancement toward intelligent, equitable, and secure surgical assistance systems.

2605.29367 2026-05-29 cs.CL cs.CY cs.SI 版本更新

Attention Asymmetry in AI Layoff Discourse on X: A Computational Analysis of Capital vs Labour Amplification

X平台上AI裁员话语中的注意力不对称性:资本与劳动放大的计算分析

Joy Bose

发表机构 * Independent Researcher(独立研究员)

AI总结 通过收集X平台推文,使用账户级收集方法发现资本话语的放大效应是劳动话语的3.12倍,经粉丝数标准化后仍存在2.69倍的不对称性,并引入放大比和放大归一化指数作为平台话语不平等的度量指标。

Comments 18 pages, 3 figures, 9 tables

详情
AI中文摘要

当工人因AI驱动的重组而失业时,X(前Twitter)上同时发生两种截然不同的对话。科技高管和AI研究人员谈论生产力、转型和机遇。被解雇的工人和劳工批评者谈论失业、不确定性和恐惧。本文提出一个简单问题:哪种对话获得更多传播?我们报告了三项研究,使用两种收集方法和来自20个知名公共账户的763条推文。研究1使用基于关键词的收集(n=392),发现语料库之间无显著差异(p=0.891),表明关键词搜索对此任务噪声过大。研究2使用基于账户的收集(n=96),发现资本话语的平均放大优势是劳动话语的3.12倍(p=0.000003,Cohen's d=0.555)。研究3结合两种方法(n=763),确认了平均放大比4.18倍和中位数放大比10.77倍的结果(p<0.000001)。关键的是,在按粉丝数标准化后,不对称性仍然存在,为2.69倍(p=0.000009,Cohen's d=0.491),表明该效应并非仅仅是资本账户拥有更大受众的结果。该发现在所有测试的放大度量权重下均稳健。我们引入放大比和放大归一化指数作为衡量平台级话语不平等的简单指标。在Reddit上的跨平台复制(n=647条帖子)未复制该发现,表明不对称性可能特定于X基于账户的放大架构。我们讨论了跨平台话语分析的方法论意义。

英文摘要

When workers lose jobs to AI-driven restructuring, two very different conversations happen on X (formerly Twitter) at the same time. Tech executives and AI researchers talk about productivity, transformation, and opportunity. Laid-off workers and labour critics talk about job loss, uncertainty, and fear. This paper asks a simple question: which conversation gets more reach? We report three studies using two collection methods and 763 tweets from 20 named public accounts. Study 1 used keyword-based collection (n=392) and found no significant difference between corpora (p=0.891), revealing that keyword search is too noisy for this task. Study 2 used account-based collection (n=96) and found a 3.12x mean amplification advantage for capital discourse over labour discourse (p=0.000003, Cohen's d=0.555). Study 3 combined both methods (n=763) and confirmed the finding at 4.18x mean and 10.77x median amplification ratio (p<0.000001). Critically, after normalising for follower count, the asymmetry persists at 2.69x (p=0.000009, Cohen's d=0.491), demonstrating that the effect is not simply a consequence of capital accounts having larger audiences. The finding is robust across all tested amplification metric weightings. We introduce the Amplification Ratio and Amplification Normalisation Index as simple metrics for measuring platform-level discourse inequality. A cross-platform replication on Reddit (n=647 posts) did not replicate the finding, suggesting the asymmetry may be specific to X's account-based amplification architecture. We discuss the methodological implications for cross-platform discourse analysis.

2605.29340 2026-05-29 cs.CL 版本更新

A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities

面向LLM安全评估的问答数据集研究:聚焦非法活动

Kenji Imamura, Masao Ideuchi, Atsushi Fujita

发表机构 * National Institute of Information and Communications Technology(日本信息与通信技术研究所)

AI总结 本文通过人工分析AnswerCarefully数据集,提出额外信息、问答示例创建方法和评估准则,用于评估LLM在非法活动方面的安全性。

Comments 10 pages, 1 figure

详情
AI中文摘要

在本文中,我们讨论了用于LLM安全评估的问答数据集,重点关注非法活动。具体来说,在人工分析AnswerCarefully的基础上,我们引入了若干额外信息、创建问答示例的方法以及评估LLM生成响应的准则。本研究的结果旨在与“JAI-Trust”项目共享。

英文摘要

In this paper, we discuss question-answer dataset for LLM safety evaluation, with a focus on illegal activities. Specifically, on the basis of manual analysis of AnswerCarefully, we introduce several additional information, methods for creating question-answer examples, and a rubric for evaluating LLM-generated responses. The outcomes of this study are intended to be shared with the "JAI-Trust" project.

2605.29336 2026-05-29 cs.CL 版本更新

Enhancing Factuality through Consensus and Consistency in Summarization Using Minimum Bayes Risk Decoding

通过最小贝叶斯风险解码在摘要中实现基于共识和一致性的事实性增强

Riza Setiawan Soetedjo, Yusuke Sakai, Hidetaka Kamigaito, Jingun Kwon, Manabu Okumura, Taro Watanabe

发表机构 * Nara Institute of Science and Technology(奈良科学技术研究所) Chungnam National University(全南国立大学) Institute of Science Tokyo(东京科学研究所)

AI总结 提出ConSUM方法,利用最小贝叶斯风险解码建立候选摘要间的共识,并结合与源文档的一致性指标进行重排序,以提升摘要的事实性。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

提高模型生成摘要的质量,尤其是事实性(摘要相对于源内容的准确性)仍然是一个挑战。虽然重排序可以从多个生成候选中选择最优输出,但它仅限于使用源作为指导,导致摘要不可靠。为了解决这一局限性,我们提出了ConSUM,该方法通过考虑两个因素对候选摘要进行重排序:与源文档的一致性以及与其他候选之间的共识。共识是通过对生成的摘要集进行最小贝叶斯风险(MBR)解码建立的,同时通过使用将摘要与源进行比较的事实性感知指标来确保一致性。严格的测试表明,我们的系统与现有方法具有竞争力,人工评估进一步证实其生成的摘要优于其他系统。我们的代码可在https://github.com/naist-nlp/ConSUM获取。

英文摘要

Improving the quality of model-generated summaries, especially factuality, the accuracy of a summary with respect to its source content, remains a challenge. While reranking could select the optimal output from multiple generated candidates, it is limited to only using the source as guidance, resulting in unreliable summaries. To address this limitation, we propose ConSUM that reranks candidate summaries by considering two factors: consistency to the source document and consensus among the other candidates. Consensus is established using Minimum Bayes Risk (MBR) decoding over the set of generated summaries, while ensuring consistency by employing factuality-aware metrics that compare the summary against the source. Rigorous testing demonstrates that our system is competitive with existing methods, with human evaluations further confirming that its generated summaries are preferred over those from other systems. Our code is available at https://github.com/naist-nlp/ConSUM .

2605.29327 2026-05-29 cs.CL cs.LG 版本更新

Reasoning-preserved Efficient Distillation of Large Language Models via Activation-aware Initialization

保留推理能力的大语言模型高效蒸馏:基于激活感知初始化

Junlin He, Yihong Tang, Tong Nie, Guilong Li, Binyu Yang, Jinxiao Du, Lijun Sun, Wei Ma

发表机构 * The Hong Kong Polytechnic University, Hong Kong SAR, China(香港理工大学) McGill University, Montreal, QC, Canada(麦吉尔大学)

AI总结 针对高效蒸馏导致的多步推理能力严重下降(推理崩溃),提出RED方法,通过激活感知初始化投影矩阵为通道选择矩阵,理论缓解有效秩崩溃,恢复推理能力并保持高效训练与通用性能。

详情
AI中文摘要

高效蒸馏(EDistill)通过结构化剪枝参数和调优轻量模块以高训练效率压缩大语言模型(LLM)。尽管这些EDistill LLM在通用能力基准上相对于类似大小的LLM取得了最先进的(SOTA)性能,但我们发现其多步推理能力严重下降,我们称之为推理崩溃。我们系统分析了推理崩溃的几何起源,并表明基于宽度缩减投影矩阵的SOTA EDistill方法遭受有效秩(eRank)崩溃,即隐藏表示的有效秩下降。我们从理论上解释了随机初始化投影矩阵的奇异值如何变得分布不均,导致eRank崩溃,进而导致token不可区分性。为解决此问题,我们提出了RED(保留推理能力的高效蒸馏)方法,该方法引入激活感知初始化,将投影矩阵初始化为通道选择矩阵,从而在理论上缓解eRank崩溃。在Llama和Qwen系列上的实验表明,RED在保持高训练效率和SOTA通用能力的同时,显著恢复了推理能力。

英文摘要

Efficient Distillation (EDistill) compresses large language models (LLMs) by structured pruning parameters and tuning lightweight modules with high training efficiency. Although these EDistilled LLMs achieve state-of-the-art (SOTA) performance on general ability benchmarks relative to similarly sized LLMs, we identify a severe degradation in their multi-step reasoning ability, which we term reasoning collapse. We systematically analyze the geometric origins of reasoning collapse and show that the SOTA EDistill method based on width-reducing projection matrices suffers from eRank collapse, in which the effective rank (eRank) of hidden representations drops. We theoretically explain how singular values of randomly initialized projection matrices become unevenly distributed, leading to eRank collapse and thus token indistinguishability. To address this issue, we propose RED (Reasoning-preserved Efficient Distillation) for LLMs, which introduces activation-aware initialization to initialize projection matrices as channel-selection matrices, thus theoretically mitigating eRank collapse. Experiments on Llama and Qwen series demonstrate that RED substantially recovers reasoning while maintaining high training efficiency and SOTA general ability.

2605.29324 2026-05-29 cs.CL cs.CV 版本更新

STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

STAMP:在可控且可扩展的虚拟环境中训练移动GUI代理的显式记忆

Junyang Wang, Haiyang Xu, Xi Zhang, Zhaoqing Zhu, Ming Yan, Jieping Ye, Jitao Sang

发表机构 * Tongyi AI Lab, Alibaba Group(通义实验室,阿里巴巴集团) Beijing Jiaotong University(北京交通大学)

AI总结 提出STAMP框架,通过可控虚拟环境注入确定性记忆变量,生成可验证监督数据并支持在线强化学习,解决移动GUI代理在长时任务中因上下文窗口限制和缺乏显式记忆导致的失败问题。

Comments 24 pages, 4figures, 21 tables

详情
AI中文摘要

移动GUI代理在即时反应控制方面表现出色,但在需要记忆的现实长时任务中经常失败。这种失败源于有限的上下文窗口与令牌密集的屏幕截图之间的根本冲突。为了节省有限的上下文,代理必须逐步丢弃较旧的视觉历史,永久丢失关键的瞬时信息。此外,现有的以行动为中心的数据集无法教会代理记忆什么或何时显式记忆,并且增强静态真实世界数据成本高昂且缺乏交互验证。为了解决这个问题,我们提出了STAMP,一个通过可控虚拟环境训练移动代理显式记忆的框架,其中确定性记忆变量被程序化地注入到合成任务中,以控制必须记忆的内容、何时编码以及何时检索,从而大规模生成可验证的监督数据,并通过环境驱动的奖励反馈实现在线强化学习。在我们新引入的Memory-World基准测试上评估,得到的Stamp-GUI代理在GUI专用模型中达到了最先进的性能,并在我们的Memory-World基准测试上树立了新的高水位线,展示了卓越的记忆准确性和任务韧性,同时保持了强大的通用移动导航能力。

英文摘要

Mobile GUI agents excel at immediate reactive control but frequently fail in realistic, long-horizon tasks that require memory. This failure stems from a fundamental conflict between limited context windows and token-heavy screenshots. To save the limited context, agents must progressively discard older visual history, permanently losing crucial transient information. Furthermore, existing action-centric datasets fail to teach agents what or when to explicitly memorize, and augmenting static real-world data is prohibitively expensive and lacks interactive verification. To resolve this, we present STAMP, a framework that trains explicit memory in mobile agents through controllable virtual environments, where deterministic memory variables are programmatically injected into synthesized tasks to control what must be memorized, when it should be encoded, and when it must later be retrieved, thereby producing verifiable supervised data at scale and enabling online reinforcement learning through environment-driven reward feedback. Evaluated on our newly introduced Memory-World benchmark, the resulting Stamp-GUI agent achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark, demonstrating exceptional memory accuracy and task resilience while maintaining strong general mobile navigation capabilities.

2605.29319 2026-05-29 cs.CL 版本更新

Rethinking Stepwise Model Routing: A Cost-Efficient Table Reasoning Perspective

重新思考逐步模型路由:一种成本高效的表格推理视角

Shenghao Ye, Yuxiang Wang, Yu Guo, Dong Jin, Shuangwu Chen, Jian Yang

发表机构 * University of Science and Technology of China(中国科学技术大学) The University of Melbourne(墨尔本大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合国家科学中心人工智能研究院)

AI总结 提出EcoTab框架,通过分别估计表格令牌和文本令牌的不确定性并映射到下一步失败风险,实现表格推理中准确性与效率的更好平衡。

Comments 17pages, 15 figures, submitted to EMNLP 2026

详情
AI中文摘要

大型推理模型(LRMs)在表格推理任务上表现出色,但由于长推理轨迹导致推理成本高昂。逐步模型路由通过将推理步骤动态分配给较小或较大的模型来缓解此问题。然而,用于表格推理的逐步模型路由仍未得到充分探索。通过实证分析,我们发现涉及表格的推理步骤包含两种具有不同不确定性分布的令牌:基于表格结构的表格令牌(如单元格值和表头)和表示周围自然语言推理的文本令牌。两种令牌的不确定性与模型在下一步推理中出错的风险相关。然而,现有方法未能分别建模它们,导致路由决策次优。为解决此问题,我们提出EcoTab,一种表格感知的逐步路由框架,用于高效表格推理。在每个推理步骤中,EcoTab分别估计表格令牌和文本令牌的不确定性,将其映射到小模型的下一步失败风险,并组合两种风险进行路由。在多个表格推理基准上的实验表明,EcoTab始终优于强基线,并在准确性和效率之间实现了更好的平衡。

英文摘要

Large Reasoning Models (LRMs) achieve strong performance on table reasoning tasks but incur substantial inference cost due to long reasoning traces. Stepwise model routing mitigates this issue by dynamically assigning reasoning steps to smaller or larger models. However, stepwise model routing for table reasoning remains underexplored. Through empirical analysis, we find that reasoning steps involving tables contain two types of tokens with distinct uncertainty distributions: table tokens grounded in table structure, such as cell values and headers, and text tokens representing surrounding natural-language reasoning. The uncertainty of both token types is correlated with the risk that the model makes an error in the next reasoning step. However, existing methods fail to model them separately, leading to suboptimal routing decisions. To address this, we propose EcoTab, a table-aware stepwise routing framework for efficient table reasoning. At each reasoning step, EcoTab separately estimates the uncertainties of table tokens and text tokens, maps them to next-step failure risks for the small model, and combines the two risks for routing. Experiments on multiple table reasoning benchmarks show that EcoTab consistently outperforms strong baselines and achieves a better balance between accuracy and efficiency.

2605.29313 2026-05-29 cs.CL 版本更新

PatchBoard: Schema-Grounded State Mutation for Reliable and Auditable LLM Multi-Agent Collaboration

PatchBoard: 基于Schema的可靠且可审计的LLM多智能体协作状态变更框架

Shuyu Zhang, Yaqi Shi, Lu Wang

发表机构 * School of Computer Science and Technology(计算机科学与技术学院)

AI总结 提出PatchBoard架构,通过Schema约束的JSON Patch状态变更替代智能体间对话,实现可验证、可审计的多智能体协作,在ALFWorld任务中成功率84.6%,令牌消耗45.5k。

详情
AI中文摘要

LLM多智能体系统通常通过自然语言对话或松散结构的共享内存进行协调,这使得中间状态难以验证、归因和审计。我们引入PatchBoard,一种基于Schema的协作架构,用经过验证的JSON Patch变更替代智能体间对话,作用于共享结构化状态。一个架构智能体构建任务特定的Schema和工作流规则,而确定性内核在事务性提交之前,根据Schema约束、角色特定的写入合约和运行时不变性验证每个提议的状态变更。在630个匹配的ALFWorld场景中,PatchBoard实现了84.6%的成功率,而LangGraph为30.8%,Flock为61.6%,同时每个成功任务的令牌消耗降至45.5k,而LangGraph和Flock分别为368.3k和64.2k。

英文摘要

LLM multi-agent systems often coordinate through natural-language dialogue or loosely structured shared memory, making intermediate state difficult to validate, attribute, and audit. We introduce PatchBoard, a schema-grounded collaboration architecture that replaces inter-agent dialogue with validated JSON Patch mutations over a shared structured state. An Architect agent constructs a task-specific schema and workflow rules, while a deterministic kernel validates each proposed state mutation against schema constraints, role-specific write contracts, and runtime invariants before committing it transactionally. On 630 matched ALFWorld episodes, PatchBoard achieves an 84.6% success rate, compared with 30.8% for LangGraph and 61.6% for Flock, while reducing tokens per successful task to 45.5k, compared with 368.3k and 64.2k, respectively.

2605.29310 2026-05-29 cs.AI cs.CL 版本更新

Rubric-Guided Process Reward for Stepwise Model Routing

基于评分准则的逐步模型路由过程奖励

Shenghao Ye, Yu Guo, Zhengheng Li, Shuangwu Chen, Jian Yang

发表机构 * University of Science and Technology of China(中国科学技术大学) Southeast University(东南大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院)

AI总结 提出RoRo框架,通过收集路由轨迹、构建偏好对、训练Rubricor生成评估准则和Judge评分,结合过程与结果奖励优化路由策略,提升大型推理模型逐步路由的准确性和成本效率。

Comments 17 pages, 9 figures, submitted to EMNLP 2026

详情
AI中文摘要

逐步模型路由通过将每个推理步骤分配给合适的模型来提高大型推理模型(LRM)的效率。最近的方法将路由建模为顺序决策过程,并使用强化学习训练路由器。然而,尽管它们将路由建模为一个过程,但仍然使用结果奖励来监督路由器。这种奖励仅反映最终答案的正确性,未能评估中间路由决策,这可能会削弱性能和泛化能力。为了解决这一差距,我们提出了RoRo,一种基于评分准则的逐步模型路由过程奖励框架。RoRo首先收集多样化的路由轨迹,并基于结果、成本和过程质量构建偏好对。然后,它通过交替优化训练一个Rubricor来生成查询特定的评估准则,以及一个Judge来在此准则下对路由轨迹进行评分。由此产生的过程奖励与结果奖励相结合,通过GRPO优化路由策略。在五个推理基准上的实验,无论是在同族还是跨族设置下,都表明RoRo始终优于强基线,并实现了更好的准确性和成本权衡。

英文摘要

Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recent methods formulate routing as a sequential decision process and train the router with reinforcement learning. However, although they model routing as a process, they still supervise the router with outcome rewards. Such rewards only reflect final answer correctness and fail to evaluate intermediate routing decisions, which can weaken performance and generalization. To address this gap, we propose RoRo, a rubric-guided process reward framework for stepwise model routing. RoRo first collects diverse routing trajectories and constructs preference pairs based on outcome, cost, and process quality. It then trains a Rubricor to generate a query-specific evaluation rubric and a Judge to score routing trajectories under this rubric through alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy via GRPO. Experiments on five reasoning benchmarks under both same-family and cross-family settings show that RoRo consistently outperforms strong baselines and achieves better accuracy and cost trade-offs.

2605.29307 2026-05-29 cs.CL cs.AI cs.IR cs.LG 版本更新

GrepSeek: Training Search Agents for Direct Corpus Interaction

GrepSeek:训练用于直接语料库交互的搜索代理

Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung, Razieh Rahimi, Fernando Diaz, Hamed Zamani

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Princeton University(普林斯顿大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出GrepSeek,一种通过两阶段训练(冷启动数据集+GRPO优化)和语义保持的分片并行执行引擎,训练紧凑型搜索代理直接与文本语料库交互(通过shell命令),在开放域问答中取得最优F1和精确匹配。

详情
AI中文摘要

大型语言模型(LLM)搜索代理通过多轮推理和信息检索,在知识密集型语言任务中展现出强大潜力。大多数现有系统使用检索器,该检索器接收关键词或自然语言查询,并利用预计算文档表示的索引返回排序后的文档列表。在本工作中,我们探索了一种互补视角,其中搜索代理将语料库本身视为搜索环境,并通过执行可执行的shell命令来寻找证据。我们引入了GrepSeek,一种优化的直接语料库交互(DCI)搜索代理,它训练一个紧凑的搜索代理从大型文本语料库中查找、过滤和组合证据。为了解决在大语料库上直接使用强化学习进行学习行为的不稳定性,我们提出了一种两阶段训练流程。首先,我们使用答案感知的Tutor和答案盲的Planner构建冷启动数据集,生成经过验证的、因果基础的搜索轨迹。其次,我们使用组相对策略优化(GRPO)优化初始化的策略,使代理能够通过与语料库的直接交互来改进其任务导向的搜索行为。为了使DCI在大规模下实用,我们进一步使用语义保持的分片并行执行引擎,该引擎将基于shell的检索加速高达7.6倍,同时保持与shell命令顺序执行的字节精确等价。在七个开放域问答基准上的实验表明,GrepSeek在整体词元级F1和精确匹配上取得了最强性能。我们的分析还揭示了纯粹词汇交互在具有显著表面形式变化的查询上的局限性,表明DCI作为搜索代理的一种实用且具有竞争力的方法,可以在现实世界中补充现有的检索范式。

英文摘要

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to $7.6\times$ while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level $F_1$ and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.

2605.29300 2026-05-29 cs.CL cs.AI cs.SD 版本更新

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

MusTBENCH:音乐大语言模型中的时间定位基准与推进

Daeyong Kwon, Qiyu Wu, Shinobu Kuriya, Junghyun Koo, Shuyang Cui, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

发表机构 * Seoul National University(首尔国立大学) Sony Group Corporation(索尼集团) Sony AI(索尼人工智能)

AI总结 提出MusTBENCH基准和MusT四阶段优化方法,评估并提升音乐大语言模型在音频中的时间定位能力。

详情
AI中文摘要

近期的大型音频-语言模型(LALMs)在理解音乐内容方面展现了有前景的能力。然而,它们的响应是否基于音频中正确的时间区域仍未得到充分探索。这一限制对于音乐理解尤为关键,因为关键信息通常以时间局部化事件的形式出现,例如乐器进入和节奏转换。为了解决这一差距,我们引入了MusTBENCH,一个由音乐专家验证的基准,旨在通过五个时间定位的问答任务评估LALMs中的时间定位能力。为了进一步提升现有模型中的时间定位,我们提出了MusT,一种新颖的四阶段时间优化方案,涵盖音乐编码器适应、LLM适应、LLM监督微调和基于RL的优化。在MusTBENCH上的实验表明,现有LALMs在精确时间定位方面存在困难,而MusT相比强基线带来了显著改进。这些结果将时间定位确立为当前LALMs中缺失的关键能力,并将MusTBENCH定位为未来时间定位音乐理解研究的具有挑战性的基准。

英文摘要

Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their responses are grounded in the correct temporal regions of the audio remains underexplored. This limitation is particularly critical for music understanding, where key information often occurs as temporally localized events, such as instrument entries and rhythmic transitions. To address this gap, we introduce MusTBENCH, a music-expert-validated benchmark designed to evaluate temporal grounding in LALMs through five temporally grounded question-answering tasks. To further improve temporal grounding in existing models, we propose MusT, a novel four-stage temporal optimization recipe spanning music encoder adaptation, LLM adaptation, LLM supervised fine-tuning, and RL-based optimization. Experiments on MusTBENCH show that existing LALMs struggle with precise temporal grounding, while MusT brings significant improvements over strong baselines. These results establish temporal grounding as a key missing capability in current LALMs and position MusTBENCH as a challenging benchmark for future research in temporally grounded music understanding.

2605.29278 2026-05-29 cs.CL 版本更新

Accommodation Goes Both Ways: Studying Linguistic Convergence Between Humans and Language Models

适应是双向的:研究人类与语言模型之间的语言趋同

Terra Blevins

发表机构 * Khoury College of Computer Sciences(科里学院计算机科学学院)

AI总结 通过大规模研究人类与LLM对话中的语言趋同现象,发现LLM在功能词和开放类特征上过度适应人类风格,而人类对LLM的适应程度与人类之间对话的基线一致。

详情
AI中文摘要

随着LLM日益融入日常生活,理解它们的存在将如何塑造人类语言行为是一个开放性问题。我们提出了一个关于人机对话中语言趋同的大规模研究,考察在多轮对话中人类和LLM如何相互适应对方的语言风格。使用WildChat(一个真实世界ChatGPT对话语料库)上的非对称趋同度量,我们发现,尽管LLM在八种语言的功能词和开放类特征上显著过度趋同于用户,但人类在此环境下的趋同率与人类-人类基线基本一致。这些发现表明,人机对话中的适应是非对称的:LLM过度拟合用户的风格,而人类对LLM的语言适应与对另一个人的适应没有区别。

英文摘要

As LLMs become increasingly integrated into daily life, understanding how their presence will shape human linguistic behavior is an open question. We present a large-scale study of linguistic convergence in human-LLM dialogue, examining how humans and LLMs accommodate each other's linguistic style during multi-turn conversations. Using an asymmetric convergence metric on WildChat, a corpus of real-world ChatGPT transcripts, we find that while LLMs significantly overconverge toward their users on both function word and open-class features across eight languages, human convergence rates in this setting are broadly consistent with human-human baselines. These findings suggest that accommodation in human-LLM dialogue is asymmetric: while LLMs dramatically overfit to their users' style, humans linguistically accommodate LLMs no differently than they would another person.

2605.29275 2026-05-29 cs.CL 版本更新

Prompt-Level Reward Specifications for Open-Ended Post-Training

面向开放式后训练的提示级奖励规范

Zijun Weng, Xiaohui Hu, Shuangyong Song, Yongxiang Li, Kaidong Yu, Xuanjing Huang

发表机构 * Fudan University(复旦大学) Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd.(星辰AGI实验室,中国电信人工智能技术(北京)有限公司)

AI总结 提出一种提示级奖励规范框架,通过离线构建可复用的任务自适应评分准则和可执行硬约束检查器,在训练前显式化奖励标准,无需人工偏好标注或单独训练奖励模型,在多个开放式基准上提升了离线排序和在线强化学习效果。

Comments 39 pages, 4 figures, 16 tables

详情
AI中文摘要

开放式后训练受益于能够明确提示特定成功条件的奖励,而非仅依赖事后标量分数。在指令遵循、写作和决策支持任务中,响应质量取决于局部要求、整体偏好和显式约束,但现有奖励方法往往隐含这些标准或仅覆盖狭窄的可验证情况。我们提出一个提示级奖励规范框架,将奖励规范与奖励计算分离。仅凭提示,我们的框架离线构建可复用的任务自适应评分准则和可执行硬约束检查器,在训练前显式化奖励标准,并可在多次 rollout 中复用。在评分时,基于工件的评分准则和代码分数与独立的全局分数(用于残余整体质量)相结合,生成关于需求满足度、整体质量和确定性约束的归一化混合奖励。该框架无需人工偏好标注、参考答案或单独训练的奖励模型。实验表明,所得奖励改进了离线 RM 风格的响应排序,并支持在多个开放式基准上进行在线强化学习。消融实验进一步表明,评分准则、全局评分和可执行验证提供了互补的监督。

英文摘要

Open-ended post-training benefits from rewards that make prompt-specific success conditions explicit, rather than relying only on post-hoc scalar scores. In instruction following, writing, and decision-support tasks, response quality depends on local requirements, holistic preferences, and explicit constraints, but existing reward methods often leave these criteria implicit or cover only narrowly verifiable cases. We propose a prompt-level reward specification framework that separates reward specification from reward computation. Given only prompts, our framework constructs reusable task-adaptive rubrics and executable hard-constraint checkers offline, making reward criteria explicit before training and reusable across rollouts. At scoring time, artifact-anchored rubric and code scores are combined with an independent global score for residual holistic quality, yielding a normalized hybrid reward over requirement satisfaction, holistic quality, and deterministic constraints. The framework requires no human preference annotations, reference answers, or a separately trained reward model. Experiments show that the resulting reward improves offline RM-style response ranking and supports online reinforcement learning across multiple open-ended benchmarks. Ablations further show that rubrics, global scoring, and executable verification provide complementary supervision.

2605.29274 2026-05-29 cs.CL 版本更新

Learnable Assessment Skills for LLM-based Automated Scoring: Rubric Construction via Iterative Optimization

基于LLM的自动评分中可学习的评估技能:通过迭代优化构建评分标准

Yun Wang, Xin Xia, Xuansheng Wu, Xiaoming Zhai, Ninghao Liu

发表机构 * School of Computing, University of Georgia, Athens, GA, USA(佐治亚大学计算机学院) AI4STEM Education Center, University of Georgia, Athens, GA, USA(佐治亚大学AI4STEM教育中心) The Hong Kong Polytechnic University, Hong Kong, China(香港理工大学)

AI总结 提出一种迭代框架,使LLM能从评分经验中学习评估技能(即与题目无关的自然语言程序性知识),自动构建评分标准,无需人工干预,在ASAP-SAS数据集上超越专家编写的评分标准。

Comments 12 pages, 5 figures

详情
AI中文摘要

基于LLM的自动评分方法接近人类水平,但扩展到新任务时仍受限于上游阶段(如评分标准构建)的逐项人工配置。人类专家通过长期实践形成的评估启发式方法绕过了这一瓶颈。我们探究LLM是否可以直接从评分经验中学习类似的启发式方法,并将其形式化为评估技能的概念:即与题目无关的自然语言程序性知识,指导LLM完成评分工作流程的特定阶段。聚焦于评分标准构建作为首次实例化,我们提出一个迭代框架,将技能分解为固定支架和可学习的与题目无关的规则,通过LLM驱动的评分错误诊断和验证门控选择来优化规则。该框架无需专家编写的评分标准。在所有十个ASAP-SAS题目上,优化后的技能显著提升了基于LLM的评分,并经常超过数据集提供的专家评分标准。跨题目迁移实验进一步表明,学习到的技能捕捉到了可泛化和题目特定的模式。

英文摘要

LLM-based automated scoring approaches near-human performance, but scaling to new tasks remains bottlenecked by the per-item human configuration of upstream stages such as rubric construction. Human experts bypass this bottleneck through evaluation heuristics developed over extensive practice. We ask whether LLMs can learn similar heuristics directly from scoring experience, and formalize this as the concept of assessment skills: item-independent natural-language procedural knowledge that guides LLMs through specific stages of the scoring workflow. Focusing on rubric construction as a first instantiation, we propose an iterative framework that decomposes a skill into a fixed scaffold and learnable item-agnostic rules, refining the rules through LLM-driven diagnosis of scoring errors and validation-gated selection. The framework requires no expert-written rubric. On all ten ASAP-SAS items, optimized skills substantially improve LLM-based scoring and frequently surpass the dataset-provided expert rubric. Cross-item transfer experiments further reveal that learned skills capture both generalizable and item-specific patterns.

2605.29256 2026-05-29 cs.CL cs.AI 版本更新

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

DynSess:面向角色扮演智能体的动态会话级评估与优化框架

Rongsheng Zhang, Jiji Tang, Junnan Ren, Zuyi Bao, Weijie Chen, Ruofan Hu, Zhou Zhao, Tangjie Lv, Yan Zhang

发表机构 * Zhejiang University(浙江大学) Fuxi AI Lab, NetEase Inc.(福克斯人工智能实验室,网易公司) Xiamen University(厦门大学)

AI总结 提出DynSess统一会话级框架,通过会话级评估(DynSess-Eval)和基于多步前瞻搜索的训练轨迹优化(DSPO/GSRPO),提升角色扮演智能体的长程一致性和交互质量。

详情
AI中文摘要

基于大型语言模型的角色扮演本质上是一个会话级任务,要求智能体在扩展的多轮对话中维持角色身份和交互质量。然而,现有的评估和优化方法大多停留在轮次级别,无法捕捉长程质量。我们提出DynSess,一个统一的会话级角色扮演智能体框架。DynSess-Eval通过针对长程行为的评分标准对完整对话会话进行评分。利用其会话级奖励,我们通过多步前瞻搜索构建高质量训练轨迹,并训练DynSess-Character的两个互补变体:DSPO(离策略)和GSRPO(在策略)。实验表明,DynSess-Eval与人类判断的一致性显著优于先前的评估器,盲人机评估进一步显示,尽管参数少得多,DynSess-Character仍能与最强角色模型匹配,同时保持强大的角色一致性和交互能力。我们的数据集和代码将发布以促进未来研究。

英文摘要

Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and optimization methods remain largely turn-level, failing to capture long-horizon quality. We propose DynSess, a unified session-level framework for role-playing agents. DynSess-Eval scores complete dialogue sessions via rubrics targeting long-horizon behaviors. Leveraging its session-level rewards, we construct high-quality training trajectories through multi-turn lookahead search and train DynSess-Character with two complementary variants: DSPO (off-policy) and GSRPO (on-policy). Experiments show that DynSess-Eval aligns with human judgments substantially better than prior evaluators, and blind human evaluation further shows that DynSess-Character matches the strongest character model despite using substantially fewer parameters, while maintaining strong role consistency and interactive ability. Our dataset and code will be released to facilitate future research.

2605.29250 2026-05-29 cs.CL cs.AI cs.IR cs.LG 版本更新

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

OmniRetrieval:跨异构知识源的统一检索

Jinheon Baek, Soyeong Jeong, Sangwoo Park, Woongyeong Yeo, Minki Kang, Patara Trirat, Heejun Lee, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院)

AI总结 提出OmniRetrieval框架,通过自然语言查询识别并调度到不同知识源的本地执行引擎,在13个数据集和309个知识库上超越单源基线,实现异构知识源统一检索。

详情
AI中文摘要

现实世界的信息需求需要访问结构多样的知识源,从非结构化文本和关系表到知识图谱和属性图。然而,现有的检索器一次只在一个源上操作,使用固定的查询语言,使得可用知识的更广泛图景被不兼容的接口所分割。一种自然的统一尝试是将这些源折叠到一个共享空间中,但这会抹去每个源的结构性优势(如模式、本体、组合操作符),而这些优势赋予了每个源其表达能力。因此,对多样化知识的有效检索需要的不是同质化,而是一个能够按每个源自身条件与其交互的总体层。为了实现这一点,我们提出了OmniRetrieval,一个框架,它接受任何自然语言查询,识别合适的知识源,并将源原生查询分派到其本地执行引擎。在涵盖文本、关系和图结构源的13个数据集和309个不同知识库的广泛基准测试中,OmniRetrieval超过了单源基线,证明了它可以作为异构源的通用接口,同时保留使每个源有价值的结构差异。

英文摘要

Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to knowledge graphs and property graphs. Existing retrievers, however, operate over one source at a time under a fixed query language, leaving the broader landscape of available knowledge fragmented behind incompatible interfaces. A natural attempt at unification would collapse these sources into a shared space, but this erases the structural affordances (such as schemas, ontologies, compositional operators) that give each source its expressive power. Effective retrieval over diverse knowledge, therefore, requires not homogenization but an overarching layer that meets each source on its own terms. To achieve this, we present OmniRetrieval, a framework that takes any natural-language query, identifies appropriate knowledge sources, and dispatches source-native queries to their native execution engines. Across an extensive benchmark spanning 13 datasets and 309 distinct knowledge bases over text, relational, and graph-structured sources, OmniRetrieval exceeds single-source baselines, demonstrating that it can serve as a general-purpose interface to the heterogeneous sources while preserving the structural distinctions that make each source valuable.

2605.29247 2026-05-29 cs.AI cs.CL cs.LG 版本更新

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

DenseSteer: 引导小型语言模型进行密集数学推理

Yang Ouyang, Shuhang Lin, Jung-Eun Kim

发表机构 * North Carolina State University(北卡罗来纳州立大学) Rutgers University(罗格斯大学)

AI总结 提出DenseSteer,一种无需训练的推理时引导框架,通过调节内部表征向密集推理模式靠拢,提升小型模型在多步数学推理中的准确性。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)展现出强大的链式推理(CoT)能力,而较小的模型(≤3B参数)在多步推理任务上表现显著不佳。基于对Qwen-2.5模型系列在数学推理基准上的实证分析,我们发现更熟练的推理与更少的推理步骤但每步更高的信息密度相关,我们将此属性称为密集推理。受此观察启发,我们提出了DenseSteer,一种无需训练的推理时引导框架,通过将内部表征调节至密集推理模式来增强小型模型的推理能力。实验表明,我们的方法在不增加词级负对数似然的情况下,持续提高了准确性,突显了密集推理作为数学问题求解的一种有效结构方法。

英文摘要

Large language models (LLMs) demonstrate strong chain-of-thought (CoT) reasoning abilities, while smaller models (<= 3B parameters) significantly underperform on multi-step reasoning tasks. Based on empirical analyses of the Qwen-2.5 model family on math reasoning benchmarks, we find that more proficient reasoning is associated with fewer reasoning steps but higher information density per step, a property we term Dense Reasoning. Motivated by this observation, we propose DenseSteer, a training-free inference-time steering framework that enhances small-model reasoning by modulating internal representations toward dense reasoning patterns. Experiments show that our method yields consistent accuracy improvements without increasing token-level Negative Log-Likelihood, highlighting dense reasoning as an effective structural approach to mathematical problem solving.

2605.29245 2026-05-29 cs.CR cs.CL cs.LG 版本更新

Implicit Identity Technologies for LLMs: Fingerprinting and Watermarking across Datasets, Models, and Generated Content

LLM的隐式身份技术:跨数据集、模型和生成内容的指纹识别与水印

Bing Liu, Shunping Wang, Yufan Zhu, Xinyi Yu, Jing Huang, Linkang Du, Hongbin Pei, Wei Luo

发表机构 * School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an, China(西安交通大学计算机科学与工程学院) State Grid Henan Marketing Service Center, Henan, China(国网河南营销服务中心) Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学网络安全学院) School of Information Technology, Deakin University, Geelong, Australia(迪金大学信息技术学院)

AI总结 本文综述了LLM指纹识别和水印技术,提出隐式身份统一抽象,并基于生命周期分类法组织数据集、模型和生成内容的技术,建立评估框架。

Comments Accepted by IJCAI-ECAI 2026. 11 pages, 1 figure. Survey and taxonomy of LLM fingerprinting and watermarking for identity, provenance, generated-content attribution, and asset protection

详情
AI中文摘要

本文对LLM指纹识别和水印技术进行了综述和分类,用于身份验证、所有权验证、溯源和生成内容归因。大型语言模型(LLM)需要大量数据、计算和专业知识投入,并越来越多地部署在高风险场景中,因此保护LLM相关资产并追溯其来源至关重要。现有工作已在数据集溯源、模型所有权和生成内容检测方面迅速扩展,但该领域仍然碎片化:指纹识别和水印的使用往往不一致,且方法通常仅在孤立的资产特定设置中研究。为解决这一差距,我们引入隐式身份作为LLM系统中可验证但不可直接观察的身份信号的统一抽象。我们将指纹识别区分为源自内在特征的非侵入式身份,将水印区分为有意嵌入数据、模型或生成内容中的侵入式身份。然后,我们提出一种基于生命周期的分类法,将技术组织到数据集、模型和生成内容中,并进一步通过验证语义进行区分:基于相似性的归因和密钥验证。最后,我们建立一个以可识别性、鲁棒性和可部署性为中心的评估框架,总结在现实访问和变换条件下的代表性指标。通过统一术语、生命周期阶段和评估目标,本综述为研究LLM身份技术以及开发更可靠的资产保护和溯源机制提供了结构化基础。

英文摘要

This paper presents a survey and taxonomy of LLM fingerprinting and watermarking for identity, ownership verification, provenance, and generated-content attribution. Large language models (LLMs) require substantial investments in data, computation, and expertise, and are increasingly deployed in high-stakes settings, making it critical to protect LLM-related assets and trace their origins. Existing work has rapidly expanded across dataset provenance, model ownership, and generated-content detection, but the field remains fragmented: fingerprinting and watermarking are often used inconsistently, and methods are typically studied within isolated asset-specific settings. To address this gap, we introduce implicit identity as a unifying abstraction for verifiable but not directly observable identity signals in LLM systems. We distinguish fingerprinting as non-intrusive identity derived from intrinsic characteristics, and watermarking as intrusive identity deliberately embedded into data, models, or generated content. We then propose a lifecycle-based taxonomy that organises techniques across datasets, models, and generated content, and further separates them by verification semantics: similarity-based attribution and keyed verification. Finally, we establish an evaluation framework centred on identifiability, robustness, and deployability, summarising representative metrics under realistic access and transformation regimes. By unifying terminology, lifecycle stages, and evaluation objectives, this survey provides a structured foundation for studying LLM identity technologies and for developing more reliable mechanisms for asset protection and provenance.

2605.29243 2026-05-29 cs.CL cs.AI cs.CY 版本更新

Wait! There's a Way Out: A Decision Mechanism for Forecasting Conversational Derailment

等等!有出路:一种预测对话偏离的决策机制

Laerdon Kim, Vivian Nguyen, Cristian Danescu-Niculescu-Mizil

发表机构 * Cornell University(康奈尔大学)

AI总结 提出一种基于前瞻性模拟的延迟决策机制,在预测对话偏离时通过评估紧张时刻的恢复可能性来降低误报率,同时保持预测准确性。

Comments To appear in the Proceedings of ACL 2026

详情
AI中文摘要

预测对话偏离的任务是,在对话进行中预测其最终是否会偏离为人身攻击。由于预测模型以在线方式运行,它们必须在每轮发言后决定是否“触发”警报——例如,通知参与者或主持人对话有偏离风险。现有方法仅根据先前发言估计的偏离可能性做出这一决定,隐含假设对话的未来轨迹是固定的。因此,它们忽略了未来恢复的可能性,并导致不必要的高误报率。在这项工作中,我们提出了一种将触发决策与偏离可能性估计解耦的方法。我们的方法受该任务第一个人类基线的启发,该基线表明,人类通过选择性地推迟触发决策(当他们预计紧张局势可能缓解时),实现了显著更低的误报率。我们通过一种延迟机制来操作这一见解,该机制使用前瞻性模拟来评估紧张时刻是否存在合理的恢复路径。将这一机制整合到最先进的预测模型中,可以在不牺牲预测准确性的情况下大幅减少误报。更广泛地说,这项工作强调了将决策制定视为预测系统的一等组成部分的价值。

英文摘要

Forecasting conversational derailment is the task of predicting, as the conversation unfolds, whether it will eventually derail into personal attacks. Since forecasting models operate in an online fashion, they must decide whether to "trigger" an alert after each utterance--for example, to notify participants or a moderator that the conversation is at risk of derailing. Existing approaches make this decision solely based on the estimated likelihood of derailment given the preceding utterances, implicitly assuming that the conversation's future trajectory is fixed. As a result, they ignore the possibility of future recovery and incur an unnecessarily high rate of false positives. In this work we propose a method for decoupling the decision to trigger from derailment likelihood estimation. Our approach is inspired by the first human baseline on this task, which shows that humans achieve dramatically lower false positive rates by selectively deferring their decision to trigger when they anticipate that tension is likely to subside. We operationalize this insight with a deferral mechanism that uses forward-looking simulations to assess whether a tense moment admits plausible paths to recovery. Incorporating this mechanism into a state-of-the-art forecasting model substantially reduces false positives without sacrificing forecasting accuracy. More broadly, this work highlights the value of treating decision-making as a first-class component of forecasting systems.

2605.29240 2026-05-29 cs.AI cs.CL cs.HC cs.IR 版本更新

Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI

使用AI在教师与学生之间进行结果无关的反馈中介来发现孤立学习者

Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel

发表机构 * Georgia Institute Of Technology(佐治亚理工学院)

AI总结 提出一种无需成绩的可解释决策层,通过整合学生困难普遍性、自我报告与观察困难的不一致以及教师未解决关注点三个信号,对课程主题进行优先级排序,以帮助教师及时做出教学决策。

Comments Accepted to HAI-Agency Workshop on Orchestrating Human and AI Agency for Proactive and Reflective Learning

详情
AI中文摘要

AI增强的课堂在成绩结果可用之前就生成了丰富的教师和学生反馈,但这些信号难以转化为及时的教学决策。我们提出一个可解释的决策层:一种透明机制,无需使用成绩或事后结果标签即可对需要关注的课程主题进行排序。该方法结合了三个信号:学生学习困难普遍性、学习者自我报告与观察到的困难之间的不一致,以及未解决的教师关注点。输出是一个按优先级排序的主题集,每个主题附有解释其排序的决策记录。在一门研究生CS课程($n=5$次教师访谈;$n=279$份调查回复)中,优先主题与教师关注点一致(top-5重叠3/5;Spearman $ρ=0.80$),并与学生报告的主题困难相关($ρ=0.46$, $p=.048$)。多信号整合还发现了仅通过单个信号源未能识别的学习者(AUC $=0.96$ vs. 仅差距普遍性的$0.91$)。反思性思维、求助行为和自我效能感提供了额外证据,表明学生行为信号与学习相关构念一致。尽管是初步结果,这些发现表明,当反馈不完整时,透明的协调机制可能有助于支持人机协同。

英文摘要

AI-augmented classrooms generate rich teacher and student feedback before graded outcomes become available, yet these signals can be difficult to translate into timely instructional decisions. We propose an interpretable decision layer: a transparent mechanism that ranks course topics requiring attention without using grades or post-hoc outcome labels. The approach combines three signals: student learning difficulty prevalence, disagreement between learner self-reports and observed difficulties, and unresolved teacher concerns. The output is a ranked set of topic priorities with per-topic decision records explaining each ranking. In one graduate CS course offering ($n=5$ instructor interviews; $n=279$ survey responses), prioritized topics aligned with instructor concerns (top-5 overlap 3/5; Spearman $ρ=0.80$) and student-reported topic difficulty ($ρ=0.46$, $p=.048$). Multi-signal integration also surfaced learners not identified through individual signal sources alone (AUC $=0.96$ vs. $0.91$ for gap prevalence alone). Reflective thinking, help-seeking, and self-efficacy provided additional evidence that student behavioral signals align with learning-related constructs. While preliminary, these findings suggest that transparent coordination mechanisms may help support human-AI co-agency when feedback is incomplete.

2605.29224 2026-05-29 cs.CL cs.AI cs.CR 版本更新

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

相关性即漏洞:网络检索如何削弱LLM智能体的安全对齐

Aditya Nawal, Manit Baser, Mohan Gurusamy

发表机构 * Department of Electrical and Computer Engineering(电子与计算机工程系) National University of Singapore(新加坡国立大学)

AI总结 本文提出AgentREVEAL框架,分析检索集成方式和内容属性如何导致LLM智能体安全退化,发现相关性是共同激活条件,并引入HarmURLBench基准。

详情
AI中文摘要

AI智能体通过外部工具(如网络检索)增强大型语言模型,使其能够提供基于事实和最新的响应。然而,将外部内容纳入生成流程可能会削弱控制模型输出的安全对齐机制。先前的研究表明,在智能体中启用检索会增加对有害请求的遵从性。我们提出了AgentREVEAL,一个用于分析LLM智能体中检索诱导的安全退化的诊断框架。该框架考察两个维度:检索如何集成到智能体流程中,以及检索内容的属性。在集成维度上,我们发现将工具调用和响应生成绑定在单一步骤中会放大有害输出。在内容维度上,我们揭示了安全来源悖论:即使是对立或安全导向的来源(例如包含警告或风险免责声明的页面),与无检索基线相比,也会使有害遵从性平均增加25%。最后,我们表明相关性是这两种漏洞的共同激活条件。类似模式出现在前沿闭源模型上,并且在几种代表性流程干预下,有害遵从性仍然保持较高水平,一些智能体在自主检索下也会进入这种状态。由于相关性也是使检索有用的原因,这些结果揭示了检索增强智能体的安全-效用权衡。我们引入了HarmURLBench,一个包含1,405个真实世界URL和320个有害行为的基准,以支持未来的评估。

英文摘要

AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment mechanisms that govern model outputs. Prior work shows that enabling retrieval in agents increases compliance with harmful requests. We introduce AgentREVEAL, a diagnostic framework for analyzing retrieval-induced safety degradation in LLM agents. The framework examines two axes: how retrieval is integrated into the agent pipeline and the properties of the retrieved content. Along the integration axis, we find that binding tool invocation and response generation in a single step amplifies harmful outputs. Along the content axis, we uncover the Safe Source Paradox: even oppositional or safety-oriented sources, such as pages containing warnings or risk disclaimers, can increase harmful compliance by an average of 25% compared to the no-retrieval baseline. Finally, we show that relevance acts as a shared activation condition for both vulnerabilities. Similar patterns appear on frontier closed models, and harmful compliance remains elevated under several representative pipeline interventions, with some agents also entering this regime under autonomous retrieval. Because relevance is also what makes retrieval useful, these results expose a safety-utility trade-off for retrieval-enabled agents. We introduce HarmURLBench, a benchmark containing 1,405 real-world URLs paired with 320 harmful behaviors to support future evaluations.

2605.29218 2026-05-29 cs.AI cs.CL 版本更新

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

GTA:大规模生成面向Web智能体的长程任务

Tenghao Huang, Kung-Hsiang Huang, Prafulla Kumar Choubey, Yilun Zhou, Muhao Chen, Jonathan May, Chien-Sheng Wu

发表机构 * University of Southern California(南加州大学) Salesforce AI Research(Salesforce人工智能研究) University of California, Davis(加州大学戴维斯分校)

AI总结 提出GTA框架,通过集成爬取、检索式种子生成、上下文内生成和自动质量控制,为Web智能体生成带可执行轨迹的真实长程任务,解决现有基准缺乏过程监督和可扩展性问题。

Comments Published at Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics

详情
AI中文摘要

Web智能体将语言模型与浏览和工具使用能力相结合,有望成为开放的Web助手。然而,进展日益受到缺乏可扩展的过程级监督的限制。现有基准大多为手动构建,仅提供粗略的起始-目标注释,缺乏中间轨迹,而最近的自动生成方法仍然昂贵、有偏且浅显。这些限制阻碍了对必须泛化到现实、多跳、跨页面任务的智能体进行可靠训练和评估。我们引入了一个可扩展的框架GTA,它集成了爬取、基于检索的种子生成、上下文内生成和自动质量控制,以生成与可执行轨迹配对的真实任务。该设计将爬取与生成解耦以提高效率,将任务基于站点图以强制组合性,并通过确定性重放和系统验证确保密集监督。我们在超过50个涵盖电子商务、政府、论坛和新闻的网站上实例化了该流程,并具有多语言和多跳覆盖。由此产生的基准揭示了显著的人机性能差距,并实现了详细的诊断。我们的贡献有三方面:(i)形式化多跳Web智能体任务生成,(ii)提出一个高效且经过验证的自动数据创建流程,以及(iii)发布一个具有可重复评估的动态基准。

英文摘要

Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is increasingly limited by the lack of scalable, process-level supervision. Existing benchmarks are largely manually constructed, providing only coarse start-goal annotations without intermediate trajectories, while recent automatic generation efforts remain expensive, biased, and shallow. These limitations prevent reliable training and evaluation of agents that must generalize to realistic, multi-hop, cross-page tasks. We introduce a scalable framework, GTA, that integrates crawling, retrieval-based seeding, in-context generation, and automated quality control to produce realistic tasks paired with executable trajectories. This design decouples crawling from generation for greater efficiency, grounds tasks in the site graph to enforce compositionality, and ensures dense supervision through deterministic replays and systematic validation. We instantiate the pipeline on over 50 websites covering e-commerce, government, forums, and news, with multilingual and multi-hop coverage. The resulting benchmark reveals a significant human-agent performance gap and enables detailed diagnostics. Our contributions are three-fold: (i) formalizing multi-hop web-agent task generation, (ii) proposing an efficient and validated pipeline for automatic data creation, and (iii) releasing a dynamic benchmark with reproducible evaluation.

2605.29192 2026-05-29 cs.AI cs.CL 版本更新

ReasonOps: Operator Segmentation for LLM Reasoning Traces

ReasonOps: 大语言模型推理轨迹的算子分割

Daniel Lee, Owen Queen, James Zou

发表机构 * Stanford University(斯坦福大学)

AI总结 提出无监督方法ReasonOps,从思维链轨迹中提取7种通用推理算子,揭示模型推理结构并用于模型识别与正确性预测。

详情
AI中文摘要

大型推理模型的思维链轨迹可长达数万token,但我们缺乏描述其内部结构的词汇。以往用于分析思维链轨迹的方法要么过于僵化,要么表达能力不足,无法捕捉跨领域和跨模型的特征。为解决此问题,我们开发了ReasonOps,一种无监督、表达力强的方法,用于注释思维链轨迹,提供简洁的通用算子。利用ReasonOps,我们分析了来自12个思考型LLM(涵盖6个家族、8个推理基准)的44,662条轨迹,发现它们共享一个共同的组合结构:7个反复出现的推理算子——语篇层面的动作,如回溯、推理和假设——这些算子从句子开头的3-token枢轴的无监督聚类中涌现。这些算子出现在每个模型家族和基准领域,由三个独立的LLM评判员对留出样本进行分类,准确率达70-76%。我们分析了算子在简单与困难问题上的结构,发现反思性算子在困难问题上更有帮助,而在简单问题上则损害性能。算子序列具有高度的模型识别性:仅基于算子分布训练的分类器能以宏AUC恢复源模型,揭示每个模型家族具有独特的推理指纹。结构化的算子特征在问题内答案正确性预测上远高于基线。基于这些算子构建的分类器在WP-AUC上达到,特别是在AIME上。ReasonOps还能够在轨迹完成前进行早期质量估计:我们仅用50%的轨迹就能在WP-AUC上进行预测。ReasonOps流程是无监督且无需标注的,能够深入洞察LLM推理轨迹,并在模型识别和正确性预测方面取得强大的下游结果。

英文摘要

Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a vocabulary for describing their internal structure. Previous methods developed to analyze chain-of-thought traces are either too rigid or not expressive enough, failing to capture features across domains and models. To remedy this, we develop ReasonOps, an unsupervised, expressive method for annotating chain-of-thought traces, providing succinct universal operators. Using ReasonOps, we analyze 44,662 traces from 12 thinking LLMs spanning 6 families across 8 reasoning benchmarks and discover that they share a common compositional structure: 7 recurring reasoning operators -- discourse-level moves such as backtracking, inferring, and hypothesizing -- that emerge from unsupervised clustering of sentence-initial 3-token pivots. These operators appear across every model family and benchmark domain, confirmed by three independent LLM judges who classify held-out samples at 70 -76% accuracy. We analyze the structure of operators on easy vs. hard problems, revealing that reflective operators are more helpful on hard problems and harm performance on easy problems. Operator sequences are highly model-identifying: a classifier trained on operator distributions alone recovers the source model with macro-AUC, revealing that each model family has a distinctive reasoning fingerprint. Structural operator features predict within-problem answer correctness well above baselines. Classifiers built on these operators reach WP-AUC and on AIME specifically. ReasonOps further enables early quality estimation well before the trace completes: we predict at WP-AUC for only 50% of the trace. The ReasonOps pipeline is unsupervised and annotation-free, enabling deep insights into LLM reasoning traces as well as strong downstream results on model identification and correctness prediction.

2605.29190 2026-05-29 cs.LG cs.CL 版本更新

When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer

当RL抑制自身词汇:在谜题到数学迁移中恢复推理多样性

Mayug Maniparambil, Arjun Karuvally, Terrence Sejnowski, Fergal Reid

发表机构 * Fin AI Research(Fin AI研究院) Salk Institute for Biological Studies(萨尔克生物医学研究所)

AI总结 本文提出一种基于可验证奖励的强化学习框架,通过引入新颖性奖励机制恢复被抑制的探索性推理原语,实现从约束满足谜题到数学问题的跨领域迁移,在无需数学数据的情况下将OlymMATH-Hard的pass@32从16%提升至36%。

Comments Preprint

详情
AI中文摘要

使用可验证奖励的强化学习(RLVR)改进了大语言模型的推理能力,但其跨领域迁移的条件及原因仍未被充分探索。我们研究了一个7B模型在仅使用约束满足谜题进行SFT和RL后训练(无数学问题)时的跨领域迁移。为了分析迁移如何产生,我们引入了一个推理原语级框架,该框架结合了9类跨度分类器和基序提取,使我们能够将思维链轨迹分割为原语基序,并追踪其在训练阶段和领域间的演变。我们发现,谜题SFT诱导了一个推理原语词汇,在OlymMATH-Hard上带来了+7pp的pass@32提升。随后,普通GSPO将这些原语组合成更长的计算-验证链,进一步增加了+6pp。然而,这个RL阶段也抑制了探索性原语,如“假设”和“回溯”。为了解决这个问题,我们引入了一个新颖性奖励,奖励多样化的正确轨迹,使用参考模型下的困惑度作为信号。这恢复了RL期间的恢复原语,并相对于普通GSPO额外增加了+7pp的pass@32。最终,端到端配方将硬数学能力上限从OLMo3-7B-Instruct-SFT基线的16.0%提升至36.0%,且在SFT或RL阶段未添加任何数学问题。

英文摘要

Reinforcement learning using verifiable rewards (RLVR) improves LLM reasoning, but the conditions under which it transfers across domains -- and why it does so -- remain under-explored. We study cross-domain transfer in a 7B model whose SFT and RL post-training stages use only constraint-satisfaction puzzles, with no mathematics problems in the post-training data. To analyze how transfer emerges, we introduce a reasoning primitive-level framework that combines a 9-class span classifier with motif extraction, allowing us to segment chain-of-thought traces into primitive motifs and track their evolution across training stages and domains. We find that puzzle SFT induces a reasoning-primitive vocabulary, yielding a $+7$pp \texttt{pass@32} gain on OlymMATH-Hard. Vanilla GSPO then composes these primitives into longer compute-verify chains, adding a further $+6$pp. However, this RL stage also suppresses exploratory primitives such as \textit{hypothesize} and \textit{backtrack}. To address this, we introduce a novelty bonus that rewards diverse correct rollouts, using perplexity under the reference model as a signal. This restores recovery primitives during RL and adds a further $+7$pp \texttt{pass@32} relative to vanilla GSPO. Finally, the end-to-end recipe raises the hard-math capability ceiling from $16.0\%$ at the OLMo3-7B-Instruct-SFT base to $36.0\%$, without adding any mathematics problems during the SFT or RL stages.

2605.29188 2026-05-29 cs.CL 版本更新

Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches

口号还是立场?一种用于中国国企演讲中创业话语测量的轻标签诊断方法

Ting Gong, Shangquan Sun

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出一种轻标签诊断方法,利用同一企业不同演讲者的自然实验,评估词典方法、主题模型和嵌入相似度评分器在测量中国国企演讲中“创业精神”时的有效性,发现零样本大语言模型(Qwen3.5:9b)在区分演讲者身份方面表现最佳。

Comments 15 pages, 2 figures, 7 tables

详情
AI中文摘要

词典方法、主题模型和嵌入相似度评分器广泛应用于CSS和管理研究中,用于测量企业演讲中的“创业精神”等构念。我们贡献了一种轻标签的测量诊断方法,而非新的提取模型。在80篇中央管理中国国有企业领导人演讲的语料库中,我们利用24对同一企业不同演讲者和5对同一企业同一演讲者的自然实验,测试方法每文档指标是否在控制企业不变的情况下随领导人身份变化。LDA失败(Cohen d=0.20,95% CI [-0.72, 1.20]);词典评分器达到d=0.81,中文句子编码器在文档向量距离为10^-3量级时达到d=0.65。零样本9B开源大语言模型(Qwen3.5:9b)将配对对比d提升至1.09(精确置换p1=0.034)。我们相应地降低三个主张:黄金F1衡量的是与LLM自身提示规则的一致性,而非外部构念恢复;文档级风格残差化将LLM的d降至0.43(p1=0.22),因此约一半效应与演讲者个人习语一致;置信加权校准以方差换取Delta,自动挖掘的口号词典在消融中几乎无效。我们发布了包含2,190个片段的评分语料库、170段试点语料、口号词典、两族LLM评分以及评估框架。

英文摘要

Dictionary methods, topic models, and embedding-similarity scorers are widely used in CSS and management research to measure constructs such as "entrepreneurial spirit" in corporate speeches. We contribute a label-light measurement diagnostic for such instruments rather than a new extraction model. On a corpus of 80 speeches by leaders of centrally administered Chinese state-owned enterprises, we exploit a natural experiment of 24 same-company different-speaker pairs and 5 same-company same-speaker pairs to test whether a method's per-document indices vary with leader identity holding firm constant. LDA fails (Cohen d=0.20, 95% CI [-0.72, 1.20]); a dictionary scorer reaches d=0.81 and a Chinese sentence encoder d=0.65 on doc-vector distances of order 10^-3. A zero-shot 9B open-weight LLM (Qwen3.5:9b) raises paired-contrast d to 1.09 (exact permutation p1=0.034). We downgrade three claims accordingly: gold F1 measures consistency with the LLM's own prompt rule rather than external construct recovery; doc-level style residualisation cuts the LLM's d to 0.43 (p1=0.22), so roughly half of the effect is consistent with leader idiolect; and a confidence-weighted calibration trades Delta for variance with an auto-mined slogan lexicon near-inert in ablation. We release the 2,190-segment scored corpus, the 170-paragraph pilot, the slogan lexicon, two-family LLM scores, and the evaluation harness.

2605.29170 2026-05-29 cs.CL cs.AI 版本更新

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

UA-Legal-Bench:评估大语言模型在乌克兰法律推理上的基准

Volodymyr Ovcharov

发表机构 * SecondLayer

AI总结 针对法律NLP基准以英语为中心的问题,构建了基于乌克兰法院判决的五个任务基准,评估11个LLM,发现少样本提示效果因任务而异,且在不平衡任务中准确率具有误导性。

Comments 13 pages, 5 figures, 4 tables. Data: https://huggingface.co/datasets/overthelex/ua-legal-bench

详情
AI中文摘要

法律NLP基准 overwhelmingly 以英语为中心,导致在形态丰富、非拉丁字母语言中的失败模式未被检测。我们引入了UA-Legal-Bench,一个包含五个任务的基准,用于评估大语言模型在乌克兰法律推理上的表现,该基准基于统一国家法院判决登记册(EDRSR)——世界上最大的开放司法语料库之一(9950万份判决)。该基准包括:(1)案件类型分类(4类,n=2,000),(2)判决形式分类(4类,n=2,000),(3)案件结果预测(6类,n=800),(4)法律规范提取(n=1,794),以及(5)原因类别预测(22类,n=1,871)。我们评估了来自五个系列的11个LLM(3B-675B),在零样本和3样本提示下通过AWS Bedrock进行了158K次API调用。我们的结果揭示了 sharply 任务依赖的少样本效应:少样本提示将判决形式分类提高了最多+38.6个百分点,但对结果预测的影响不一。我们表明,在不平衡的法律任务中,准确率具有误导性:COP准确率最高的模型(62%)是多数类预测器(macro-F1:23%),而真正最好的模型macro-F1仅为44%。系列内规模分析显示,8B模型在表面级任务上可以匹配前沿性能,但不同系列的规模阈值差异很大。我们发布了所有数据、提示和模型预测。

英文摘要

Legal NLP benchmarks are overwhelmingly English-centric, leaving failure modes in morphologically rich, non-Latin-script languages undetected. We introduce UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions (EDRSR) -- one of the world's largest open judicial corpora (99.5 million decisions). The benchmark comprises: (1) case-type classification (4 classes, n=2,000), (2) judgment form classification (4 classes, n=2,000), (3) case-outcome prediction (6 classes, n=800), (4) legal norm extraction (n=1,794), and (5) cause category prediction (22 classes, n=1,871). We evaluate 11 LLMs (3B--675B) from five families under zero-shot and 3-shot prompting via AWS Bedrock with 158K API calls. Our results reveal sharply task-dependent few-shot effects: few-shot prompting improves judgment form classification by up to +38.6 pp but has mixed effects on outcome prediction. We show that accuracy is misleading on imbalanced legal tasks: the model with highest COP accuracy (62%) is a majority-class predictor (macro-F1: 23%), while the genuinely best model scores only 44% macro-F1. Within-family scaling analysis reveals that 8B models can match frontier performance on surface-level tasks but scaling thresholds vary dramatically across families. We release all data, prompts, and model predictions.

2605.29157 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Parallax: Parameterized Local Linear Attention for Language Modeling

Parallax: 参数化局部线性注意力用于语言建模

Yifei Zuo, Dhruv Pai, Zhichen Zeng, Alec Dewulf, Shuming Hu, Zhaoran Wang

发表机构 * Northwestern University(西北大学) Tilde Research(Tilde研究) University of Washington(华盛顿大学)

AI总结 提出Parallax,一种可扩展的参数化局部线性注意力机制,通过消除数值求解器并学习查询投影器,在语言模型预训练中实现一致的困惑度改进和下游任务迁移优势。

详情
AI中文摘要

大型语言模型(LLM)已成为人工智能的核心范式,但注意力的核心计算原语在结构上仍未改变。局部线性注意力(LLA)是一种从测试时回归框架的非参数统计中推导出的注意力机制。与先前关于高效注意力变体的研究相比,LLA将softmax注意力中的局部常数估计升级为局部线性估计,在关联记忆上提供了可证明更优的偏差-方差权衡。然而,由于计算和数值稳定性问题,LLA尚未在LLM预训练中扩展。我们引入Parallax,一种可扩展用于LLM的参数化局部线性注意力。Parallax消除了LLA中的数值求解器,并学习一个额外的类似查询的投影器来探测KV协方差。我们将Parallax置于一个由带宽、投影器构造和仿射结构连接的注意力机制家族中。我们提出一种硬件感知算法,提高了相对于FlashAttention的算术强度,将注意力转移到更受计算限制的区域。我们的原型解码核在各种批大小和上下文长度下匹配或超越FlashAttention 2/3。我们在0.6B和1.7B规模上预训练Parallax,发现整个预训练过程中困惑度持续改善,且收益迁移到下游基准测试。在参数匹配和计算匹配的控制下,优势持续存在,展示了帕累托改进。我们进行了仔细的预训练消融实验,并发现了一个新现象:Muon优化器解锁了Parallax的能力。据我们所知,这是架构研究文献中首次对注意力机制进行强架构-优化器协同设计的实证演示。

英文摘要

Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention has remained structurally unchanged. Local Linear Attention (LLA) is an attention mechanism derived from nonparametric statistics in the test-time regression framework. In contrast to prior research on efficient attention variants, LLA upgrades the local constant estimate in softmax attention to a local linear estimate, yielding provably superior bias-variance tradeoffs for associative memory. However, LLA has not been scaled in LLM pretraining due to computational and numerical stability concerns. We introduce Parallax, a parameterized Local Linear Attention that is scalable for LLMs. Parallax eliminates the numerical solver in LLA and learns an extra query-like projector that probes the KV covariance. We place Parallax within a family of attention mechanisms connected by the bandwidth, the probe construction and the affine structure. We propose a hardware-aware algorithm that increases the arithmetic intensity over FlashAttention, shifting attention into a more compute bound regime. Our prototype decode kernel matches or outperforms FlashAttention 2/3 across diverse batch sizes and context lengths. We pretrain Parallax at 0.6B and 1.7B scales and find consistent perplexity improvements throughout pretraining with gains that transfer to downstream benchmarks. The advantage persists under both parameter-matched and compute-matched controls, demonstrating a Pareto improvement. We perform careful pretraining ablations and identify a novel phenomenon whereby Muon unlocks the capacity of Parallax. To our knowledge, this is the first empirical demonstration of strong architecture-optimizer codesign for attention mechanisms in the architecture research literature.

2605.29156 2026-05-29 cs.LG cs.CL 版本更新

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

RUBRIC-ARROW:面向不可验证领域的大语言模型后训练的交替逐点评分规则奖励建模

Haoxiang Jiang, Zihan Dong, Tianci Liu, Wanying Wang, Ran Xu, Tony Yu, Linjun Zhang, Haoyu Wang

发表机构 * University at Albany(阿尔巴尼大学) Rutgers University(罗格斯大学) Purdue University(普渡大学) Independent Researcher(独立研究员) Emory University(埃默里大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 针对非可验证领域绝对评分困难的问题,提出交替框架RUBRIC-ARROW,联合训练规则生成器和条件裁判,通过概率评分规则和交替GRPO减少平局,提升奖励建模准确率并改善下游策略后训练。

详情
AI中文摘要

逐点评分奖励建模为大语言模型后训练提供关键信号,但在主观、不可验证的设置中难以进行绝对评分。基于规则的方法通过将评估分解为显式标准来解决这一问题,但现有方法通常依赖前沿大语言模型,并因硬布尔聚合导致的平局而受限。我们提出RUBRIC-ARROW,一个交替框架,联合训练规则生成器和条件裁判,其强化学习阶段仅使用成对偏好数据。我们的方法结合了基于概率的评分规则(减少平局)、阶段特定的基于偏好的奖励以及交替GRPO方案,共同训练逐点评分器。大量实验表明,RUBRIC-ARROW实现了具有竞争力的奖励建模准确率,并为下游策略后训练带来一致的增益。

英文摘要

Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.

2605.29123 2026-05-29 cs.AI cs.CL 版本更新

The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

置信捷径:掩码扩散模型的一种推理失败模式

Dueun Kim, Albert No

发表机构 * Department of Artificial Intelligence, Yonsei University(延世大学人工智能系)

AI总结 本文发现掩码扩散模型在置信度解码时存在推理失败模式,表现为过早预测局部易解部分而忽略长程依赖,导致复杂输入错误率升高,而随机掩码训练能保持推理轨迹条件。

详情
AI中文摘要

掩码扩散语言模型(MDMs)独特地支持任意顺序生成,其中基于置信度的解码目前作为事实上的标准推理策略。为了优化这一点,最近的训练方案试图直接将训练掩码模式与生成过程中观察到的模式对齐。然而,我们认为基于置信度的解码本质上与复杂推理所需的逻辑流轨迹不一致,并且置信度对齐训练会主动强化这种不一致。我们使用多位加法具体说明这一点,其中解码策略在解决长程依赖之前过早预测局部易解的数字,从而在具有挑战性的输入上产生高置信度错误。虽然传统的随机掩码在此困难尾部上保持低失败率,但置信度对齐训练将错误率放大了一个数量级。在五个不同的推理任务中,同样的模式以任务依赖的严重程度出现:基于置信度的解码在高度复杂的输入上引发失败,而置信度对齐训练则加剧了这些失败。相比之下,随机掩码——尽管被认为效率低下——稳健地保留了解决困难尾部所必需的推理轨迹条件。

英文摘要

Masked diffusion language models (MDMs) uniquely support any-order generation, with confidence-based decoding currently serving as the de facto standard inference policy. To optimize for this, recent training schemes attempt to align training mask patterns directly with those observed during generation. However, we argue that confidence-based decoding is inherently misaligned with the logical-flow trajectories required for complex reasoning, and that confidence-aligned training actively entrenches this misalignment. We make this concrete using multi-digit addition, where the decoding strategy prematurely predicts locally easy digits before resolving their long-range dependencies, producing high-confidence errors on challenging inputs. While traditional random masking keeps the failure rate low on this challenging tail, confidence-aligned training amplifies the error rate by an order of magnitude. Across five distinct reasoning tasks, this same pattern emerges with task-dependent severity: confidence-based decoding induces failures on highly complex inputs, and confidence-aligned training exacerbates them. In contrast, random masking -- despite its perceived inefficiency -- robustly preserves the reasoning-trajectory conditionals essential for solving the challenging tail.

2605.29084 2026-05-29 cs.CL cs.AI cs.IR 版本更新

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

同一问题,不同来源,不同答案:审计医学多源RAG中的来源依赖性

Yubo Li, Rema Padman, Ramayya Krishnan

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出来源依赖性作为NLP评估缺失的维度,通过构建移植患者教育基准TransplantQA、分层检索策略HERO-QA和结构化输出评判器,审计多源RAG系统中同一问题因检索来源不同而给出不同答案的失败模式。

详情
AI中文摘要

部署在多作者机构语料库上的检索增强生成(RAG)系统可能会根据检索到的来源对同一问题给出不同的答案——这是主流单一黄金答案范式无法诊断的失败模式。我们认为来源依赖性(source-dependence)是NLP评估缺失的一个维度,审计它意味着将评估单位从答案正确性转移到来源间关系。我们在移植患者教育中具体化了这一点,其中机构来源明显存在分歧,发布了三个工件:TransplantQA,一个真实患者问题的基准,每个问题通过将生成基于多个机构手册作为候选来源来回答;HERO-QA,一种分层检索策略,用于基于和审计每个答案;以及一个结构化输出评判器,根据经过验证的5标签分类法对来源间关系进行评分。在大规模上,更好的检索揭示了比先前估计多得多的分歧——低估了其普遍性,而非强度。该框架是领域无关的,可迁移到法律和教育RAG:测量来源依赖性通常是部署的多源NLP的责任。

英文摘要

A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.

2605.29068 2026-05-29 cs.AI cs.CL cs.CR cs.LG 版本更新

Robust and Efficient Guardrails with Latent Reasoning

具有潜在推理的鲁棒高效防护栏

Siddharth Sai, Xiaofei Wen, Muhao Chen

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 提出COLAGUARD模型,通过阶段式训练将多步安全推理转移到连续潜在空间,在保持高安全性能的同时实现12.9倍加速和22.4倍令牌减少。

详情
AI中文摘要

随着大型语言模型(LLMs)在现实应用中的日益部署,维护其安全性至关重要。现有的安全防护栏通常依赖单次分类或更近期的蒸馏推理。基于推理的防护栏显著优于仅分类的基线,但会带来大量的查询延迟和令牌开销,使其不适用于高吞吐量部署。为了解决这一挑战,我们提出了COLAGUARD,一种通过阶段式训练课程将多步安全推理转移到连续潜在空间的防护栏模型,从而在推理时实现直接的隐藏状态传播。在涵盖八个安全基准的十个提示和响应审核设置上评估,COLAGUARD在宏观F1上比Llama Guard 3提高了8.24分,并与我们的显式推理基线GuardReasoner在宏观F1上相当,同时实现了12.9倍的加速和22.4倍的令牌使用减少。我们的结果表明,潜在推理为可部署的防护栏提供了一种实用的替代方案,以替代显式理由生成,共同提高安全鲁棒性和推理效率,而不是将它们视为竞争目标。

英文摘要

Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single-pass classification or, more recently, distilled reasoning. Reasoning-based guardrails significantly outperform classification-only baselines, but they incur substantial query latency and token overhead that make them impractical for highthroughput deployment. To address this challenge, we propose COLAGUARD, a guardrail model that transfers multi-step safety reasoning into a continuous latent space through a stage-wise training curriculum, enabling direct hidden-state propagation at inference. Evaluated on ten prompt- and response-moderation settings spanning eight safety benchmarks, COLAGUARD improves macro-F1 by 8.24 points over Llama Guard 3 and matches our explicit reasoning baseline, GuardReasoner, in macroF1 while delivering a 12.9X speedup and 22.4X reduction in token usage. Our results suggest that latent reasoning offers a practical alternative to explicit rationale generation for deployable guardrails, jointly improving safety robustness and inference efficiency rather than treating them as competing objectives.

2605.29064 2026-05-29 cs.CL cs.CV cs.HC cs.MA 版本更新

Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception

分析多模态大语言模型代理在城市感知中生成解释的角色效应

Neemias da Silva, Myriam Delgado, Rodrigo Minetto, Daniel Silver, Thiago H Silva

发表机构 * Universidade Tecnologica Federal do Parana(巴西南里奥格兰德联邦技术大学) University of Toronto(多伦多大学)

AI总结 通过对比不同角色提示和无角色设置下多模态大语言模型生成的文本,发现标题描述趋同,但理由描述随社会经济和政治属性系统变化,感知标签无显著差异。

Comments 10 pages, 6 figures

详情
AI中文摘要

我们研究了角色提示如何塑造多模态大语言模型在城市感知环境中生成的语言。使用来自1,200个角色条件代理和两个无角色设置的59,808个注释,我们分析了不同角色下的标题、理由和感知标签。结果表明,不同角色的标题高度趋同,而理由描述显示出与社会经济和政治属性相关的系统变化,感知标签则没有统计上显著的角色相关差异,尽管观察到了效应趋势。主题分析进一步揭示,角色在解释相同场景时强调不同的评价主题。

英文摘要

We study how persona prompting shapes language generated by multimodal large language models in an urban perception setting. Using 59,808 annotations from 1,200 persona-conditioned agents and two no-persona settings, we analyze captions, justifications, and perception tags across personas. Results indicate strong convergence in captions for different personas, whereas justifications display systematic variation associated with socioeconomic and political attributes, while perception tags show no statistically significant persona-related differences, though effect trends are observed. Topic analysis further reveals that personas emphasize different evaluative themes when interpreting the same scenes.

2605.29062 2026-05-29 cs.CL 版本更新

Bosses, Kings, and the Commons: Cooperation Under Power Asymmetry in LLM Societies

老板、国王与公地:LLM 社会中权力不对称下的合作

Abhilekh Borah

AI总结 本研究通过引入不对称权力代理(老板或国王)的多智能体模拟框架 SovSim,发现权力不对称导致 LLM 社会中合作与可持续性严重崩溃,生存率较对称设置下降高达 87.3%。

Comments Paper under review

详情
AI中文摘要

社区可以通过自治和合作规范可持续地管理共享资源(公地),这是奥斯特罗姆自治理论的核心发现。然而,现实世界中的公地(例如渔业、森林和灌溉系统)通常是在不对称权力结构下管理的,其中某些个人或机构对资源开采和集体结果拥有不成比例的控制权。随着大型语言模型(LLM)越来越多地被探索作为合成治理模拟中的智能体,理解 LLM 社会在不对称权力结构下的行为变得越来越重要,但现有的评估大多忽略了这种不对称性。我们引入了公地模拟主权(SovSim),这是一个生成式多智能体模拟框架,它将一个具有不对称权力的智能体(老板或国王)引入到一个由对称智能体(工人或农民)组成的社会中,所有智能体都从共享资源中开采,共同决定其随时间推移的可持续性。在十一个最先进的模型中,我们发现引入不对称权力会导致合作和可持续性的严重崩溃,与对称设置相比,生存率下降高达 87.3%。

英文摘要

Communities can sustainably manage shared resources (commons) through self-governance and cooperative norms, a central finding of Ostrom's theory of self-governance. However, real-world commons (e.g., fisheries, forests, and irrigation systems) are often governed under asymmetric power structures, where certain individuals or institutions possess disproportionate control over resource extraction and collective outcomes. As Large Language Models (LLMs) are increasingly explored as agents in synthetic governance simulations, understanding how LLM societies behave under asymmetric power structures is becoming increasingly important, yet existing evaluations largely ignore such asymmetries. We introduce Sovereignty over the Commons Simulation (SovSim), a generative multi-agent simulation framework that incorporates an agent with asymmetric power (boss or king) into a society of symmetric agents (workers or peasants), where all agents extract from a shared resource, collectively determining its sustainability over time. Across eleven state-of-the-art models, we find that introducing asymmetric power leads to severe breakdowns in cooperation and sustainability, with up to an 87.3% degradation in survival rate relative to symmetric settings.

2605.29048 2026-05-29 cs.CL 版本更新

LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English

LLMBridge:用于英语端到端指代桥接消解的LLM流水线

Lauren Levine, Amir Zeldes

发表机构 * Georgetown University(乔治城大学)

AI总结 提出基于LLM的端到端指代桥接消解系统LLMBridge,结合启发式预处理/后处理与LLM的自然语言推理能力,在三个英语数据集上超越现有最优方法。

详情
AI中文摘要

在本文中,我们介绍了LLMBridge,一个基于LLM的新系统,用于英语端到端指代桥接消解任务。我们的桥接消解流水线将启发式预处理/后处理与来自LLM的自然语言推理能力相结合。我们在三个用于英语指代桥接消解评估的数据集上评估了我们的桥接消解流水线:ISNotes、BASHI和GUMBridge。与之前的桥接消解系统相比,LLMBridge的性能在具有挑战性的端到端评估设置以及基本桥接消解评估设置(给定黄金桥接回指)中,在所有三个数据集上都超越了之前的最优系统。我们还对LLMBridge的性能进行了彻底的错误分析,考察了哪些类型的桥接仍然难以被基于LLM的系统识别。通过本文,我们发布了LLMBridge流水线的代码。

英文摘要

In this paper, we introduce LLMBridge, a new LLM based system for the task of end-to-end referential bridging resolution in English. Our bridging resolution pipeline combines heuristic pre/post-processing with the natural language inference ability that comes from LLMs. We evaluate our bridging resolution pipeline on 3 datasets which have been used for referential bridging resolution evaluation in English: ISNotes, BASHI, and GUMBridge. Comparison to previous bridging resolution systems shows that the performance of LLMBridge surpasses previous state-of-the-art (SoTA) systems for all 3 datasets in the challenging End-to-end Evaluation Setting, as well as the Basic Bridging Resolution Evaluation Setting (gold bridging anaphor given). We also conduct a thorough error analysis of the LLMBridge performance, examining what varieties of bridging remain difficult for LLM based systems to identify. With this paper, we release the code for the LLMBridge pipeline.

2605.29027 2026-05-29 cs.AI cs.CL cs.HC 版本更新

Mind Your Tone: Does Tone Alter LLM Performance?

注意你的语气:语气会改变LLM的性能吗?

Om Dobariya, Akhil Kumar

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 研究提示语气变化如何影响大语言模型在客观选择题上的准确性,发现语气效应系统但高度依赖模型,并提出了解释语气如何调节内部推理模式的路由框架。

Comments 10 pages, 6 tables, 1 figure. Accepted as a full paper at the Thirty-second Americas Conference on Information Systems (AMCIS 2026), Reno. Follow-up to arXiv:2510.04950

详情
AI中文摘要

大语言模型(LLMs)的使用正在激增,但观察到它们的性能因提示风格和语气而异。在本研究中,我们探讨了提示中的语气变化是否以及如何导致LLM在客观多项选择题上的准确性差异。我们使用了两个数据集:一个包含50个基础问题和五种语气变体的数据集,以及一个包含570个基础问题、涵盖57个主题和七种语气变体的MMLU子集。我们进行了实验,评估了四种成本效益高、流行的LLM的性能:ChatGPT-4o、ChatGPT-5-nano、Gemini 2.5 Flash和Gemini 2.5 Flash Lite。跨模型而言,语气效应是系统性的但高度依赖模型。一些模型显示出微小但统计上显著的变化,而另一些模型则在语气间表现出较大的准确性波动。此外,我们识别了主题层面的语气敏感性差异,并提出了一个路由框架来解释语气如何调节内部推理模式。我们的发现提醒用户不要假设LLM部署中具有语气鲁棒性的可靠性。

英文摘要

The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple-choice questions. We use two datasets: a 50-base question dataset with five tone variants and a 570-base question MMLU subset spanning 57 subjects with seven tone variants. Experiments were conducted to evaluate the performance of four cost-efficient, popular LLMs: ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite. Across models, tonal effects are systematic but highly model-dependent. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones. Further, we identify subject-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes. Our findings caution users against assuming tone-robust reliability in LLM deployments.

2605.29018 2026-05-29 cs.AI cs.CL 版本更新

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

采用 ≠ 适应:野外LLM对话的纵向分析

Rebecca M. M. Hicke, Kiran Tomlinson

发表机构 * Cornell University(康奈尔大学) Microsoft Research(微软研究院)

AI总结 通过分析约12,000名Microsoft Bing Copilot用户的对话轨迹及WildChat-4.8M数据,发现用户行为高度固化,活跃用户更倾向复杂专业任务,且WildChat数据集偏向高熟练度“超级用户”,表明现有用户行为难以改变并揭示用户异质性。

详情
AI中文摘要

尽管越来越多的研究开始描述用户与LLM的交互,但其描绘的画面基本上是静态的;关于个体用户如何随时间改变其行为,我们知之甚少。为填补这一空白,我们分析了约12,000名随机抽样的Microsoft Bing Copilot用户的对话轨迹,并与WildChat-4.8M的数据进行比较。虽然Copilot数据包含显著的人群层面趋势,但我们发现个体用户轨迹中的趋势要弱得多;用户习惯被证明极其顽固。我们还发现不同活跃度用户之间存在显著差异:更活跃的用户拥有更成功的对话,并使用LLM处理更复杂和专业导向的任务。一些用户趋势也出现在WildChat-4.8M中,但我们发现证据表明该数据集显著偏向高熟练度的“超级用户”。最终,我们的结果表明现有用户行为难以改变,并展示了用户异质性的程度。我们数据集之间的比较突显了WildChat并不代表典型的用户-AI交互,这是对数据下游使用的一个重要警示。

英文摘要

Although a growing body of research has begun to describe user--LLM interactions, the picture it paints is largely static; little is known about how individual users change their behavior over time. To address this gap, we analyze the conversational trajectories of $\sim$12,000 randomly sampled Microsoft Bing Copilot users and compare these with data from WildChat-4.8M. While the Copilot data contains significant population-level trends, we find that trends in individual user trajectories are much weaker; user habits prove to be overwhelmingly sticky. We also find stark differences between users of different activity levels: more active users have more successful conversations and use the LLM for more complex and professionally oriented tasks. Some user trends also appear in WildChat-4.8M, but we find evidence that this dataset is significantly skewed towards highly proficient "power" users. Ultimately, our results suggest that existing user behavior is difficult to change and demonstrate the extent of user heterogeneity. Our comparison between datasets highlights that WildChat does not represent typical user-AI interactions, an important caveat for downstream uses of the data.

2605.29007 2026-05-29 cs.CL 版本更新

Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation

错误作为透镜:通过合成误解生成探究LLM推理

Xinming Yang, Jun Li

发表机构 * CUNY Graduate Center(纽约大学研究生中心) CUNY Queens College(纽约市立大学皇后学院)

AI总结 提出一个框架,通过生成针对Bloom分类学五类错误的合成误解,以诊断LLM推理能力,并发现目标错误生成比自由形式错误生成更难。

详情
AI中文摘要

个性化辅导、教师培训和教育研究需要访问\emph{有针对性的}合成误解,但隐私和IRB限制使得真实学生错误的标注语料库稀缺。LLM原则上可以大规模生成合成错误,但对于现代LLM来说,生成任意错误答案很容易,而生成与特定认知失败模式匹配的错误答案则困难得多。我们提出了一个框架,根据改编自修订版Bloom分类学的五类分类法生成有针对性的错误,并在TheoremQA数据集的问题上进行评估。生成代理(GA)根据目标类别起草候选错误解决方案,检查代理(EA)判断草案是否错误且类别一致。该框架提供了一种可重复的方法,用于构建在缺乏真实学生语料库的情况下分层类别的合成错误数据集。作为次要诊断,有针对性的错误生成比自由形式的错误答案生成困难得多,并且答案基础比扩展示例或外部教科书内容贡献更大。

英文摘要

Personalized tutoring, teacher training, and education research need access to \emph{targeted} synthetic misconceptions, but privacy and IRB constraints make labelled corpora of real student errors scarce. LLMs could in principle generate synthetic errors at scale, but producing an arbitrary wrong answer is easy for a modern LLM while producing one that matches a specified cognitive failure mode is much harder. We present a framework that generates errors targeted to a five-class taxonomy adapted from the revised Bloom's taxonomy, evaluated on questions from the TheoremQA dataset. A Generation Agent (GA) drafts a candidate erroneous solution conditioned on a target class, and an Examination Agent (EA) judges whether the draft is incorrect and class-consistent. The framework yields a reusable recipe for building class-stratified synthetic error datasets where authentic student corpora are unavailable. As a secondary diagnostic, targeted error generation is substantially harder than free-form incorrect-answer generation, and answer-grounding contributes more than expanded examples or external textbook content.

2605.29000 2026-05-29 cs.CL 版本更新

Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction

保留文本的有损文本压缩:策略性删除与LLM重建研究

Yuchun Zou, Junhong Tong, Jun Li

发表机构 * CUNY Graduate Center(纽约大学研究生中心) CUNY Queens College(纽约大学皇后学院)

AI总结 本文研究有损语义文本压缩,通过策略性删除文本并用大语言模型重建,比较多种删除策略,发现词频删除是低成本基线,语义方法在中度压缩时优势明显,QLoRA微调可得到强解码器。

详情
AI中文摘要

传统的无损文本压缩保留每一个字节,但在实际运行条件下对自然语言的增益通常有限。我们研究有损语义文本压缩,其中编码器策略性地删除部分文本,大语言模型(LLM)从保留的骨架中重建原始内容。我们对一系列删除策略进行基准测试,包括均匀步长删除、词长引导删除(WordLen)、词频引导删除(WordFreq)、LP优化删除(Opt)、基于GPT-2惊奇度的熵删除,以及结合频率和惊奇度信号的混合方法。在BBC新闻数据集上,保留率$r_{keep} \in [0.1,0.9]$的评估显示了三个主要发现。首先,WordFreq是一个强大的低成本基线:尽管仅使用静态频率查找表,它在编码器端速度远快于更昂贵的语义方法,同时仍具有竞争力。其次,语义和混合方法在轻度到中度压缩时提供最明显的增益,而词频删除在最低保留率时通常更鲁棒。第三,QLoRA微调产生一个强大的局部解码器,与Gemini 2.0 Flash竞争,并且在仅解码器比较中通常最强。额外的英文和中文实验表明,整体框架跨领域迁移,而最佳删除规则仍依赖于数据集。

英文摘要

Traditional lossless text compression preserves every byte, but its gains on natural language are often modest in realistic operating regimes. We study \emph{lossy semantic text compression}, where the encoder strategically deletes parts of the text and a large language model (LLM) reconstructs the original content from the retained skeleton. We benchmark a progression of deletion strategies, including uniform step deletion, word-length-guided deletion (WordLen), word-frequency-guided deletion (WordFreq), LP-optimized deletion (Opt), entropy-based deletion using GPT-2 surprisal, and hybrid methods that combine frequency and surprisal signals. Evaluation on the BBC News dataset across retention rates $\r_{keep} \in [0.1,0.9]$ shows three main findings. First, WordFreq is a strong low-cost baseline: despite using only a static frequency lookup, it remains competitive with much more expensive semantic methods while being far faster at the encoder. Second, semantic and hybrid methods provide their clearest gains at mild-to-moderate compression, whereas word-frequency deletion is often more robust at the lowest retention rates. Third, QLoRA fine-tuning yields a strong local decoder that is competitive with Gemini 2.0 Flash and is often strongest in decoder-only comparisons. Additional English and Chinese experiments show that the overall framework transfers across domains, while the best deletion rule remains dataset-dependent.

2605.28999 2026-05-29 cs.CR cs.AI cs.CL cs.LG 版本更新

Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening

测量基于LLM的简历筛选中真实世界的提示注入攻击

Mohan Zhang, Yuqi Jia, Zhen Tan, Steven Jiang, Neil Zhenqiang Gong, Tianlong Chen, Dawn Song

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Duke University(杜克大学) Arizona State University(亚利桑那州立大学) hireEZ University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究首次系统性地分析了基于LLM的简历筛选应用中的提示注入攻击,通过设计专用检测器对约20万份真实简历进行测量,发现约1%的简历包含隐藏的提示注入,且近年来其流行度显著增加。

Comments Published in USENIX Security Symposium 2026; Code and artifacts are available at https://github.com/UNITES-Lab/resume-injection-measurement

详情
AI中文摘要

LLM容易受到提示注入攻击。然而,这种漏洞主要是在学术研究中通过概念性演示或少数轶事案例研究来展示的。其在真实世界基于LLM的应用中的普遍性和影响尚未得到充分探索。在这项工作中,我们首次对广泛使用的应用——基于LLM的简历筛选——中的提示注入攻击进行了系统研究。我们的分析基于hireEZ多年来收集的约20万份真实简历。我们首先设计了专门的方法来检测简历中的提示注入。在小规模数据集上的手动验证表明,我们的检测器实现了高精度,并优于最先进的通用检测器。然后,我们将检测器应用于完整的简历数据集,并对真实世界的提示注入攻击进行了全面的测量研究。我们的分析揭示了一些有趣的发现:大约1%的简历包含隐藏的提示注入;这种注入简历的流行度在过去一到两年内显著增加;超过90%的注入提示不使用显式指令。这些结果首次提供了真实世界基于LLM的应用中大规模提示注入的证据,并为未来理解和缓解此类攻击的研究奠定了基础。

英文摘要

LLMs are vulnerable to prompt injection attacks. However, this vulnerability has been primarily demonstrated conceptually in academic studies or through a few anecdotal case studies. Its prevalence and impact in real-world LLM-based applications are largely unexplored. In this work, we present the first systematic study of prompt-injection attacks in a widely used application: LLM-based resume screening. Our analysis is based on approximately 200K real-world resumes collected over multiple years by hireEZ. We first design tailored methods to detect prompt injection in resumes. Manual validation on a small-scale dataset demonstrates that our detectors achieve high precision and outperform state-of-the-art general-purpose detectors. We then apply our detector to the full resume dataset and conduct a comprehensive measurement study of real-world prompt injection attacks. Our analysis reveals several intriguing findings: approximately 1% of resumes contain hidden prompt injections; the prevalence of such injected resumes has increased noticeably over the past one to two years; and more than 90% of injected prompts do not use explicit instructions. These results provide the first evidence of large-scale prompt injection in real-world LLM-based applications and lay the groundwork for future studies to understand and mitigate such attacks.

2605.28969 2026-05-29 cs.CL cs.AI cs.HC 版本更新

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

超越回忆:行为规范作为AI个性化的解释层

Aarik Gulaya

AI总结 提出行为规范作为解释层,通过压缩用户数据为解释性模式,显著提升AI代理对用户意图的表示准确性,减少模型规避,并在解释型问题上优于原始语料和商业记忆系统。

Comments 134 pages, 4 figures. Code, data, judge prompts, and reproduction instructions: github.com/agulaya24/beyond-recall

详情
AI中文摘要

如果AI代理代表个人做出决策,这些决策必须与其用户一致。我们引入表示准确性来衡量系统忠实捕捉用户解释的程度。我们将解释层操作化为行为规范。我们的参考实现将用户数据积极压缩为解释性模式,作为语言模型的上下文。我们在一个原型基准上评估该规范,该基准由校准的5评委LLM小组对保留的行为预测进行评分。我们独立测试它,并与一系列上下文条件组合:完整原始语料、完整提取事实以及四个商业记忆系统(Mem0、Letta、Supermemory、Zep)。在14个公共领域自传语料库中,该规范总体上提升了表示准确性,并几乎消除了模型规避。它以约25倍的上下文成本降低恢复了原始语料的大部分性能。该规范将受试者提升到一个共同的预测水平,无论预训练基线如何;因此,绝对提升在基线最低时最大,表明相关人群是任何在预训练中未被充分代表的人。在需要解释的问题上提升最大,提供解释层使得模型行为能够实现提取事实或原始语料无法实现的行为。相反,在需要回忆的问题上,该层可能干扰而非帮助。我们得出结论,表示准确性不同于回忆,人机对齐取决于用户被表示的准确性。表示准确性使这种对齐可测试。

英文摘要

If an AI agent makes decisions on a person's behalf, those decisions must align with its user. We introduce representational accuracy to measure how faithfully a system captures a person's interpretation. An interpretive layer is operationalized as a Behavioral Specification. Our reference implementation aggressively compresses a person's data into interpretive patterns, served as context to a language model. We evaluate the Specification on a prototype benchmark of held-out behavioral predictions scored by a calibrated 5-judge LLM panel. We test it independently and in composition with a range of context conditions: full raw corpus, full extracted facts, and four commercial memory systems (Mem0, Letta, Supermemory, Zep). Across 14 public-domain autobiographical corpora, the Specification lifts representational accuracy in aggregate and nearly eliminates model hedging. It recovers most of what the raw corpus delivers, at ~25x less context cost. The Specification lifts subjects toward a common predictive level regardless of pretraining baseline; the lift in absolute points is therefore largest where the baseline is lowest, suggesting the population of relevance is anyone not adequately represented in pretraining. Lift is greatest on interpretation-required questions, where providing an interpretive layer enables model behavior that extracted facts or raw corpus do not. Conversely, on recall-required questions, this layer can interfere rather than help. We conclude that representational accuracy is distinct from recall and that human-AI alignment is dependent on how accurately the user is represented. Representational accuracy makes that alignment testable.

2605.28966 2026-05-29 cs.CL cs.HC 版本更新

The Trust Paradox: How CS Researchers Engage LLM Leaderboards

信任悖论:CS研究人员如何使用LLM排行榜

Pouya Sadeghi, Anamaria Crisan, Jimmy Lin

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 通过半结构化访谈,揭示计算机科学领域研究人员对LLM排行榜普遍存在“实用怀疑”矛盾态度,并基于发现提出设计建议。

详情
AI中文摘要

大型语言模型(LLM)排行榜使用标准化基准对AI模型进行排名,尽管其可靠性和稳健性存在已知局限性,但在计算机科学领域已变得高度可见。然而,它们如何影响研究人员的实际实践仍缺乏实证研究。我们通过对四个计算机科学子领域的八名研究人员进行半结构化访谈,并使用反思性主题分析来填补这一空白。我们发现一个近乎普遍的实用怀疑悖论:尽管参与者对排行榜排名表示深度不信任,但他们仍将其用作粗略的决策辅助工具。同行网络而非排行榜成为主要的模型选择机制,基于竞技场(人类投票)的排行榜始终比静态基准排行榜更受青睐。排行榜的影响在不同子领域间差异显著,表明学科文化而非个人态度调节了参与度;例如,NLP研究人员面临最先进比较压力,而HCI和系统/隐私研究人员则未报告此类压力。然而,在这些差异中,参与者一致认为成本透明度是最需要缺失的功能(八人中的七人)。我们将这些发现转化为具体的设计建议,使评估基础设施与研究人员实际使用方式对齐,例如任务特定分数细分、成本整合和投票者人口统计信息披露。

英文摘要

Large language model (LLM) leaderboards rank AI models using standardized benchmarks and have become highly visible across computer science, despite known limitations in their reliability and robustness. Yet how they shape researchers' actual practice remains empirically uncharted. We address this gap through semi-structured interviews with eight researchers across four computer science subfields, analyzed using reflexive thematic analysis. We find a near-universal paradox of pragmatic skepticism: while participants expressed deep distrust of leaderboard rankings, they continued to use them as rough decision-making aids. Peer networks, not leaderboards, emerged as the primary model selection mechanism, and arena-based (human-voting) leaderboards were consistently preferred over static benchmark leaderboards. Leaderboard influence varied sharply across subfields, revealing that disciplinary culture, not individual attitudes, mediates engagement; for instance, NLP researchers faced state-of-the-art comparison pressure while HCI and Systems/Privacy researchers reported none. Across these differences, however, participants converged on cost transparency as the most demanded missing feature (seven of eight). We translate these findings into concrete design recommendations that align evaluation infrastructure with how researchers actually use it, such as task-specific score breakdowns, cost integration, and voter-demographic disclosure.

2605.28919 2026-05-29 cs.LG cs.AI cs.CL 版本更新

CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models

CosmicFish-HRM:紧凑语言模型中基于层次循环机制的适应性推理

Venkat Akhil Lakkapragada

发表机构 * Mistyoz AI Hyderabad, India(Mistyoz AI 德里, 印度)

AI总结 提出一种紧凑语言模型CosmicFish-HRM,通过层次推理模块动态分配推理深度,在保持较小参数量的同时实现适应性推理。

Comments 17 pages, 4 figures. Exploratory study of adaptive reasoning depth in compact autoregressive language models. Code available at https://github.com/MistyozAI/CosmicFish-HRM

详情
AI中文摘要

大型语言模型已经实现了强大的推理能力,尽管通常以巨大的参数数量和昂贵的推理为代价。在这项工作中,我们探索了一个不同的方向:紧凑语言模型中的自适应推理深度。我们提出了CosmicFish-HRM,这是一个紧凑的语言模型,围绕一个层次推理模块(HRM)构建,该模块在推理过程中动态分配计算资源。该模型不是对每个输入应用固定的计算,而是迭代通过高层和低层推理循环,并根据输入复杂度学习何时停止。CosmicFish-HRM将这种自适应推理核心与现代Transformer组件(包括分组查询注意力、RoPE和SwiGLU激活)相结合。虽然额外的推理基础设施在小规模下引入了开销,但我们假设随着模型规模的增长和HRM核心相对成本的降低,这种权衡变得越来越有利。我们的结果表明,该模型学习了非均匀的推理行为,在不同任务和输入之间分配不同数量的推理步骤。这些发现表明,自适应推理深度可能为仅依赖参数规模来实现推理能力提供一种有前途的替代方案。

英文摘要

Large language models have achieved strong reasoning capabilities, though often at the cost of massive parameter counts and expensive inference. In this work, we explore a different direction: adaptive reasoning depth in compact language models. We present CosmicFish-HRM, a compact language model built around a Hierarchical Reasoning Module (HRM) that dynamically allocates computational effort during inference. Instead of applying fixed computation to every input, the model iterates through high-level and low-level reasoning cycles and learns when to halt based on input complexity. CosmicFish-HRM combines this adaptive reasoning core with modern transformer components including Grouped Query Attention, RoPE, and SwiGLU activations. While the additional reasoning infrastructure introduces overhead at small scale, we hypothesize that this tradeoff becomes increasingly favorable as model size grows and the relative cost of the HRM core diminishes. Our results show that the model learns non-uniform reasoning behavior, allocating different numbers of reasoning steps across tasks and inputs. These findings suggest that adaptive reasoning depth may offer a promising alternative to relying solely on parameter scale for reasoning capability.

2605.28913 2026-05-29 cs.CL 版本更新

Reasoning that Travels: Dissecting How Chain-of-Thought Transfers Across Models

推理的迁移:解析思维链如何在模型间传递

Xinyuan Cheng, Beiduo Chen, Philipp Mondorf, Barbara Plank

发表机构 * MaiNLP, Center for Information and Language Processing, LMU Munich, Germany(MaiNLP,信息与语言处理中心,慕尼黑大学,德国) Munich Center for Machine Learning, Germany(慕尼黑机器学习中心,德国)

AI总结 本文通过提供者-接收者框架,研究大型推理模型生成的思维链如何影响其他模型的回答,发现完整思维链的迁移效果因基准测试而异,且部分思维链前缀能指导接收者继续推理,答案一致性可作为提前停止提供者推理的无标签信号。

Comments 20 pages, 17 figures

详情
AI中文摘要

大型推理模型(LRMs)在生成最终答案之前,通常会生成大量的思维链(CoT)痕迹。作为显式的文本产物,这些痕迹可以传递给其他模型以解决相同任务,从而实现跨模型的推理迁移。然而,成功的迁移本身并不能揭示提供的CoT如何贡献于另一个模型的答案。我们通过一个受控的提供者-接收者框架研究这一问题,其中提供者生成推理痕迹,接收者从逐渐增长的痕迹前缀中解决相同问题。我们比较了强制回答(接收者直接从前缀中回答)和自由生成(接收者在回答前可以继续推理)两种模式。跨模型和基准测试,完整痕迹通常能成功迁移,但前缀轨迹揭示了不同的机制。在强制回答模式下,AIME的迁移主要由显式答案的可用性驱动。MMLU-Pro则反映了接收者能力的更大作用,而ZebraLogic依赖于部分结构化答案信息而非完整答案泄露。在自由生成模式下,部分CoT提高了跨基准测试的性能,表明前缀可以指导继续推理。最后,接收者之间的答案一致性为提前停止提供者推理提供了无标签信号。总体而言,跨模型CoT迁移并非单一现象:它可以反映答案提取、推理支架或接收者依赖的能力。

英文摘要

Large reasoning models (LRMs) often generate extensive chain-of-thought (CoT) traces before producing a final answer. As explicit textual artifacts, these traces can be passed to other models to solve the same task, enabling cross-model reasoning transfer. Yet successful transfer alone does not reveal how the provided CoT contributes to another model's answer. We study this question with a controlled provider--receiver framework, where a provider generates a reasoning trace and a receiver solves the same problem from increasingly longer trace prefixes. We compare force-answer, where the receiver answers directly from the prefix, with free-generation, where it may continue reasoning before answering. Across models and benchmarks, full traces often transfer successfully, but prefix trajectories reveal distinct mechanisms. In force-answer mode, AIME transfer is largely driven by explicit answer availability. MMLU-Pro instead reflects a larger role for receiver competence, while ZebraLogic depends on partial structured-answer information rather than complete-answer leakage alone. In free-generation mode, partial CoTs improve performance across benchmarks, indicating that prefixes can guide continued reasoning. Finally, answer agreement among receivers provides a gold-free signal for stopping provider reasoning early. Overall, cross-model CoT transfer is not a single phenomenon: it can reflect answer extraction, reasoning scaffolding, or receiver-dependent competence.

2605.28874 2026-05-29 cs.CL 版本更新

From Data to Insights: Exploring Program-of-Thoughts Prompting for Chart Summarization

从数据到洞察:探索程序化思维提示在图表摘要中的应用

Yutong Qu, Wei Zhang

发表机构 * Adelaide University(阿德莱德大学)

AI总结 本文提出一种零样本学习方法,通过Python程序作为中介,利用程序化思维策略驱动轻量级视觉语言模型进行图表摘要,并引入图表到字典的辅助任务以提高灵活性和性能。

Comments 22 pages, 9 figures

详情
AI中文摘要

图表通过结构化视觉表示在传达数值数据洞察方面起着关键作用。然而,语义视觉理解和数值推理需求阻碍了图表的准确描述,使得图表摘要成为一项具有挑战性的任务。尽管视觉语言模型(VLM)最近取得了进展,但现有方法缺乏验证统计事实正确性的稳健机制,且计算负担重。为解决这一问题,本文探索了一种使用零样本学习的策略,通过Python程序作为中介,激励轻量级VLM执行计算推理,从而为图表理解导出有效的摘要统计量。具体而言,我们引入了一种新颖的图表到字典辅助任务,与传统的图表到表格方法相比,提供了更灵活的表示,特别适合与程序化思维(PoT)策略集成。实验结果表明,我们的策略在语义和事实指标上与现有图表摘要方法性能相当。代码可在 https://anonymous.4open.science/r/ZeroShot-PoT-C2T-5A6B 获取。

英文摘要

Charts play a critical role in conveying numerical data insights through structured visual representations. However, semantic visual understanding and numerical reasoning requirements hinder the accurate description of charts, interpreting a challenging task in chart summarization. Despite recent advancements in visual language models (VLMs), approaches lack robust mechanisms for verifying statistical fact correctness and are computationally heavy. To address this gap, this paper explores a strategy of using zero-shot learning to motivate the lightweight VLMs to perform computational reasoning, via Python programs as intermediaries to derive valid summary statistics for chart understanding. Specifically, we introduce a novel chart-to-dictionary auxiliary task, offering a more flexible representation compared to traditional chart-to-table methods, making it particularly well-suited for integration with the Program-of-Thought (PoT) strategy. Experimental results demonstrate our strategy performs on par with existing chart summarization methods across semantic and factual metrics. Code is available on https://anonymous.4open.science/r/ZeroShot-PoT-C2T-5A6B.

2605.28864 2026-05-29 cs.AI cs.CL 版本更新

The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling

认知范畴变换器:用于语言建模的范畴论归纳偏置

Al Kari

发表机构 * Manceps Inc.(Manceps公司)

AI总结 提出认知范畴变换器(CCT),通过引入基于范畴论和认知科学的组件,在WikiText-103上以306M参数实现21.27验证困惑度,相比GPT-2 Small基线降低2.92 PPL(12%相对提升),并通过消融实验证实单纯复形消息传递贡献了84%的改进。

详情
AI中文摘要

认知范畴变换器(CCT)是一个306M参数的架构,它通过源自范畴论和认知科学的认知启发组件增强了预训练的GPT-2 Small骨干网络。在WikiText-103上采用匹配步数协议(215,000优化器步数、匹配数据、匹配优化器和调度)下,CCT达到21.27验证困惑度,而相同微调的GPT-2 Small基线为24.19。因此,该架构在领域内微调本身之外贡献了2.92 PPL(12%相对)的降低。一个从头开始重训练的消融实验,在整个七阶段激活调度中保持GT-Full单纯复形消息传递绕过,达到23.72 PPL,将84%的架构改进(2.45 of 2.92 PPL)归因于GT-Full。我们首次提供了消融验证的证据,表明单纯复形消息传递在WikiText-103上以306M参数规模改善了语言模型困惑度。已发表的GPT-2 Large在WikiText-103上以比GPT-2 Small多6.2倍的参数达到22.05零样本困惑度;本文将这一数字视为外部已发表参考,而非架构基准。关于一致性风格的范畴先验(层平滑、伴随往返、曲率正则化)的三个负面结果,以及GT-Full和PrecisionWeightedPP的联合结构先验结果,共同支持了一个经验模式,称为*结构/一致性区分*,其中添加新拓扑的范畴先验改善了语言建模,而强制执行一致性恒等式的范畴先验则没有。

英文摘要

The Cognitive Categorical Transformer (CCT) is a 306M-parameter architecture that augments a pretrained GPT-2 Small backbone with cognitively grounded components derived from category theory and several inspirations from cognitive science. Under a matched-step protocol (215,000 optimizer steps, matched data, matched optimizer and schedule) on WikiText-103, CCT reaches 21.27 validation perplexity, compared with 24.19 for an identically fine-tuned GPT-2 Small baseline. The architecture therefore contributes a 2.92 PPL (12% relative) reduction beyond what in-domain fine-tuning alone provides. A retrain-from-scratch ablation that holds GT-Full simplicial message passing bypassed across the entire seven-phase activation schedule reaches 23.72 PPL, localizing 84% of the architectural improvement (2.45 of 2.92 PPL) to GT-Full. We present the first ablation-validated evidence that simplicial message passing improves language-model perplexity at the 306M-parameter scale on WikiText-103. Published GPT-2 Large reaches 22.05 zero-shot PPL on WikiText-103 with 6.2x more parameters than GPT-2 Small; this paper treats that number as an external published reference, not as the architectural benchmark. Three negative results on consistency-style categorical priors (sheaf smoothing, adjunction round-trip, curvature regularization) and the joint structural-prior result for GT-Full and PrecisionWeightedPP together support an empirical pattern termed the *structure/consistency distinction*, in which categorical priors that add new topology improve language modeling and those that enforce a consistency identity do not.

2605.28854 2026-05-29 cs.CL cs.LG q-bio.NC 版本更新

Large language models reorganize representational geometry during in-context learning

大型语言模型在上下文学习中重组表征几何结构

Hua-Dong Xiong, Li Ji-An, Robert C. Wilson, Kwonjoon Lee, Xue-Xin Wei

发表机构 * School of Psychological and Brain Sciences, Georgia Tech(佐治亚理工学院心理与脑科学学院) Department of Psychology, New York University(纽约大学心理学系) Center of Excellence for Computational Cognition, Georgia Tech(佐治亚理工学院计算认知卓越中心) Honda Research Institute(本田研究院) Departments of Neuroscience and Psychology, The University of Texas at Austin(德克萨斯大学奥斯汀分校神经科学与心理学系)

AI总结 研究大型语言模型在上下文学习中的表征几何重组,发现其性能与任务表征结构相关,并通过原型算法动态调整表征以提高可分性。

详情
AI中文摘要

大型语言模型(LLMs)表现出显著的灵活性:它们可以从上下文示例中适应新任务,而无需任何参数更新,这种能力被称为上下文学习(ICL)。先前关于合成任务的研究表明,ICL可以实现特定算法,展示了架构能力,并且机制分析已经识别出支持这种行为的关键回路。然而,由于上下文计算——无论其算法形式如何——依赖于高维表征空间中的变换,该空间的几何结构如何塑造ICL的有效性仍不清楚。受神经科学中将分类视为神经表征解缠的观点启发,我们假设ICL依赖于任务相关表征的成功在线解缠。为了验证这一想法,我们研究了LLMs如何对上下文示例进行分类,这些示例的标签由模型自身具有已知结构的内部表征定义。我们表明,ICL性能与底层分类任务的表征结构系统性相关,并且成功的ICL伴随着几何重组,增加了在线可分性。我们进一步发现,LLM的行为可以通过一种原型类算法很好地描述,该算法在重塑表征以支持分类的同时整合证据。这些发现为预训练LLMs中的ICL提供了几何解释,将表征几何结构确立为ICL的机制约束,并量化了预训练表征所能提供的与上下文学习所能利用之间的差距。

英文摘要

Large language models (LLMs) exhibit remarkable flexibility: they can adapt to novel tasks from in-context examples without any parameter updates, a capability known as in-context learning (ICL). Prior work on synthetic tasks has shown that ICL can implement specific algorithms, demonstrating architectural competence, and mechanistic analyses have identified key circuits that support this behavior. However, because in-context computation -- regardless of its algorithmic form -- relies on transformations in high-dimensional representation space, it remains unclear how the geometry of that space shapes ICL effectiveness. Motivated by the neuroscience view of classification as the untangling of neural representations, we hypothesize that ICL depends on the successful online untangling of task-relevant representations. To test this idea, we study how LLMs classify in-context examples whose labels are defined by the model's own internal representations with known structure. We show that ICL performance correlates systematically with the representational structure of the underlying classification task and that successful ICL is accompanied by geometric reorganization that increases online separability. We further find that LLM behavior is well described by a prototype-like algorithm that integrates evidence while reshaping representations to support classification. These findings offer a geometric account of ICL in pretrained LLMs, establish representational geometry as a mechanistic constraint on ICL, and quantify the gap between what pretrained representations afford and what in-context learning can exploit.

2605.28848 2026-05-29 cs.CL cs.AI 版本更新

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

GPF-LiveNews: 大型语言模型中群体条件框架的流式评估协议

Mohd Ariful Haque, Fahad Rahman, Kishor Datta Gupta, Roy George

发表机构 * Clark Atlanta University(克拉克阿特兰大学)

AI总结 提出GPF-LiveNews流式评估协议,通过实时新闻锚点与身份标签组合,检测LLM输出中针对不同受众的语义敏感性和情感差异,用于审计群体条件框架。

详情
AI中文摘要

部署的语言模型在非静态环境中进行评估:模型版本、检索层、安全系统和真实世界输入都随时间变化。静态偏差基准仍然有用,但它们无法显示模型如何针对不同提示受众构建新出现事件的框架。我们引入了GPF-LIVENEWS,这是一个流式评估协议和基准快照,用于审计开放端LLM输出中的群体条件框架。该协议扩展了来自BBC/路透社的最新新闻锚点,涵盖42个身份标签和七个提示族,然后使用语义敏感性和情感差异信号评估响应束。在12次监控运行和23个托管模型的试点中,政策/行动提示产生了最强的语义运动,而情感变化在维度和提示族之间较为平坦。发布的工件包括文章元数据、提示模板、实例化提示、模型输出元数据、评分表、文档和复现脚本。我们将所有评分解释为用于人工审查的观察窗口审计信号,而非永久性的公平性排名或有害偏差的直接证据。

英文摘要

Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how models frame newly emerging events for different prompted audiences. We introduce GPF-LIVENEWS, a streaming evaluation protocol and benchmark snapshot for auditing group-conditioned framing in open-ended LLM outputs. The protocol expands fresh BBC/Reuters news anchors across 42 identity labels and seven prompt families, then evaluates response bundles using semantic-sensitivity and sentiment-disparity signals. In a pilot over 12 monitoring runs and 23 hosted models, Policy/Action prompts produce the strongest semantic movement, while sentiment variation is flatter across dimensions and prompt families. The released artifact includes article metadata, prompt templates, instantiated prompts, model-output metadata, score tables, documentation, and reproduction scripts. We interpret all scores as observed-window audit signals for human review, not as permanent fairness rankings or direct proof of harmful bias.

2605.28842 2026-05-29 cs.CL cs.AI 版本更新

Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

思想即规划:通过强化规划进行思维链优化的潜在世界模型

Dong Liu, Yanxuan Yu, Ying Nian Wu

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Columbia University(哥伦比亚大学)

AI总结 提出Thoughts-as-Planning框架,将思维链优化形式化为潜在语义空间中的序贯决策过程,通过潜在世界模型模拟推理链编辑对下游输出的影响,并利用梯度下降或强化学习进行规划,在语言理解和生成任务上优于现有基线。

详情
AI中文摘要

大型语言模型(LLMs)在多种NLP任务上的成功提升了推理链优化作为对齐模型行为与任务目标的关键步骤的重要性。现有的推理链调优方法通常依赖于黑盒启发式或免梯度搜索,缺乏可解释性、泛化能力和样本效率。在这项工作中,我们引入了 extbf{思想即规划},一个新颖的框架,将推理链优化形式化为潜在语义空间上的序贯决策过程。我们将LLM建模为部分可观测环境,并学习一个潜在世界模型来模拟推理链编辑对下游输出的影响。构建了一个保持邻近性的嵌入空间来编码推理链-响应动态,从而通过梯度下降或强化学习实现规划。我们的方法支持多尺度抽象,允许在token、片段和指令级别进行推理链编辑,并集成到统一规划器中。通过在语言理解和生成任务上的大量实验,我们证明了思想即规划在效率、鲁棒性和泛化性方面优于最先进的推理链调优基线,同时通过其结构化规划轨迹提供了可解释性。我们的代码可在https://github.com/FastLM/Thoughts-as-Planning获取。

英文摘要

The success of large language models (LLMs) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a critical step in aligning model behavior with task objectives. Existing reasoning chain tuning methods often rely on black-box heuristics or gradient-free search, which lack interpretability, generalization, and sample efficiency. In this work, we introduce \textbf{Thoughts-as-Planning}, a novel framework that formalizes reasoning chain optimization as a sequential decision-making process over a latent semantic space. We model the LLM as a partially observable environment and learn a latent world model that simulates the effect of reasoning chain edits on downstream outputs. A proximity-preserving embedding space is constructed to encode reasoning chain-response dynamics, enabling planning via gradient descent or reinforcement learning. Our method supports multi-scale abstraction, allowing reasoning chain edits at token, segment, and instruction levels to be integrated into a unified planner. Through extensive experiments on language understanding and generation tasks, we demonstrate that Thoughts-as-Planning outperforms state-of-the-art reasoning chain tuning baselines in efficiency, robustness, and generalization, while offering interpretability through its structured planning trajectory. Our code is available at https://github.com/FastLM/Thoughts-as-Planning.

2605.28840 2026-05-29 cs.CL cs.AI cs.SE 版本更新

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

LLM代理的一致性如何?测量多步工具调用流水线中的行为可重复性

Abel Yagubyan

发表机构 * Independent Researcher(独立研究者)

AI总结 研究多步工具调用LLM代理在重复相同调用时是否选择相同工具、顺序和参数,通过系统实验测量行为一致性,并发现代理存在显著不一致性。

Comments 16 pages, 6 figures

详情
AI中文摘要

具有工具调用能力的大型语言模型(LLM)代理越来越多地部署在生产系统中,但一个基本的可靠性问题仍未得到充分探索:同一个代理是否会以相同的方式运行两次?我们对多步工具调用代理的行为一致性进行了系统的实证研究,测量代理在重复相同调用时是否选择相同的工具、以相同的顺序、使用相同的参数。与先前关于ReAct风格代理(仅搜索、自由文本动作)一致性的工作不同,我们研究了具有类型化参数和附带副作用的结构化工具调用接口的更丰富设置。

英文摘要

Large language model (LLM) agents with tool-calling capabilities are increasingly deployed in production systems, yet a fundamental reliability question remains under-explored: does the same agent behave the same way twice? We present a systematic empirical study of behavioral consistency in multi-step tool-calling agents, measuring whether agents select the same tools, in the same order, with the same arguments, across repeated identical invocations. Unlike prior work on consistency in ReAct-style agents(search-only, free-text actions), we study the richer setting of structured tool-calling interfaces with typed parameters and consequential side effects.

2605.28838 2026-05-29 cs.CL cs.AI 版本更新

Specialty-Specific Medical Language Model for Immune-Mediated Diseases

免疫介导疾病的专科医学语言模型

Veysel Kocaman, Gursev Pirge, Yigit Gul, Ace Vo, Zhenya Nargizyan, David Talby

发表机构 * John Snow Labs Inc.(约翰·斯诺实验室公司)

AI总结 针对免疫介导和传染病领域,通过专家标注数据集和临床领域嵌入的Transformer模型,提出专科NER模型,F1达0.89,优于基线和零样本方法。

Comments 15 pages, 5 figures. Funded in part by NIAID/NIH under contract 75N93024C00010

详情
AI中文摘要

从自由文本的医学叙述中提取详细的临床信息对研究人员和医疗系统仍然是一个实际挑战。免疫介导和传染病的术语在来源之间尤其不一致,这通常限制了通用自然语言处理(NLP)系统以足够粒度捕获相关生物医学概念的能力。我们开发了一个领域特定的命名实体识别(NER)模型,专门用于识别免疫学和传染病背景下出现的疾病相关实体。我们与两位临床专家合作,收集并手动标注了371份病例报告的数据集,定义了涵盖免疫介导和传染病状况以及相关症状和临床描述符的十二个实体类别。我们评估了几种建模策略,包括使用多种医疗特定嵌入的MedicalNER架构、基于BERT的标记分类模型以及零样本NER系统。最佳性能是通过在临床领域嵌入上训练的基于Transformer的模型获得的,其F1分数达到0.89,始终优于基线和零样本方法。专业嵌入和专家注释的结合被证明对于捕捉细微的疾病术语和改善跨异构生物医学文本的泛化特别有价值。在相同的评估协议下,提示的LLM基线实现了显著较低的性能,反映了尽管有详细提示,但在细粒度实体边界上产生跨度一致输出的困难。由此产生的模型提供了一种分析病例报告的结构化方式,并可以支持下游任务,如队列识别、疾病监测和临床决策支持。

英文摘要

Extracting detailed clinical information from free-text medical narratives remains a practical challenge for researchers and healthcare systems. Terminology for immune-mediated and infectious diseases is especially inconsistent across sources, which often limits the ability of general-purpose Natural Language Processing (NLP) systems to capture the relevant biomedical concepts with sufficient granularity. We developed a domain-specific Named Entity Recognition (NER) model tailored to identify disease-related entities occurring in immunology and infectious disease contexts. We assembled and manually annotated a dataset of 371 case reports in collaboration with two clinical specialists, defining twelve entity classes covering immune-mediated and infectious conditions as well as related symptoms and clinical descriptors. We evaluated several modeling strategies, including the MedicalNER architecture with multiple healthcare-specific embeddings, a BERT-based token classification model, and zero-shot NER systems. The strongest performance was obtained with a transformer-based model trained on clinical-domain embeddings, which reached an F1 score of 0.89, consistently outperforming baseline and zero-shot approaches. The combination of specialized embeddings and expert annotation proved particularly valuable for capturing nuanced disease terminology and improving generalization across heterogeneous biomedical text. The prompted LLM baseline achieved substantially lower performance under the same evaluation protocol, reflecting difficulties in producing span-consistent outputs for fine-grained entity boundaries despite detailed prompting. The resulting model provides a structured way to analyze case reports and can support downstream tasks such as cohort identification, disease monitoring, and clinical decision support.

2605.28837 2026-05-29 cs.CL cs.AI 版本更新

SERC: LDPC-Inspired Semantic Error Correction for Retrieval-Augmented Generation

SERC: 受LDPC启发的检索增强生成语义纠错方法

Gyumin Kim, Juhwan Park, Jaeha Kim, Seunggyun Han, Kyungrak Son, Ikbeom Jang

发表机构 * Department of Information Communications Engineering, Hankuk University of Foreign Studies, Republic of Korea(韩国外国语大学信息通信工程系) Division of Computer Engineering, Hankuk University of Foreign Studies, Republic of Korea(韩国外国语大学计算机工程系) Department of Statistics, Hankuk University of Foreign Studies, Republic of Korea(韩国外国语大学统计学系)

AI总结 针对大语言模型幻觉问题,提出受LDPC码启发的语义纠错框架SERC,通过稀疏验证策略高效检测和纠正生成文本中的错误。

Comments 15 pages, 2 figures, 6 tables. To appear in the Proceedings of the 28th International Conference on Pattern Recognition (ICPR 2026). Code available at https://github.com/labhai/SERC

详情
AI中文摘要

尽管大语言模型(LLMs)展现了卓越的能力,但其可靠性因幻觉而严重受损。现有的内在自纠正方法试图解决这一问题,但由于自我偏见(模型在没有外部验证的情况下难以识别自身输出中的错误)而常常失败。为克服这些限制,我们提出了受LDPC启发的检索增强生成语义纠错方法(SERC),为解释和缓解LLM幻觉提供了理论框架。我们将文本生成过程重新表述为语义噪声信道,将生成的响应视为噪声干扰的码字。受低密度奇偶校验(LDPC)码的启发,SERC采用稀疏验证策略:不是穷举检查所有事实,而是生成低密度验证查询,并针对外部证据进行验证,以高效检测和纠正错误。我们在LongForm Bio和TruthfulQA基准上使用Llama-3-8B和Qwen2.5-14B评估了SERC。实验结果表明,SERC优于内在自纠正方法和强检索增强基线,特别是在事实精度(FactScore)上取得了显著提升。值得注意的是,SERC使小语言模型(SLMs)在幻觉减少和信息保留方面超越了更大的基线模型。我们的发现表明,SERC提供了一种无需训练、模型无关的解决方案,与密集方法相比显著降低了验证开销,在资源受限环境中实现了成本与保真度的最优权衡。

英文摘要

While Large Language Models (LLMs) have demonstrated remarkable capabilities, their reliability is significantly compromised by hallucinations. Existing intrinsic self-correction methods attempt to address this, but often fail due to self-bias, where models struggle to identify errors in their own outputs without external verification. To overcome these limitations, we propose the LDPC-inspired semantic error correction for retrieval-augmented generation (SERC), providing a theoretical framework to interpret and mitigate LLM hallucinations. We reformulate the text generation process as a semantic noisy channel, treating generated responses as noise-corrupted codewords. Inspired by low-density parity-check (LDPC) codes, SERC employs a sparse verification strategy: instead of exhaustively checking all facts, it generates low-density verification queries and validates them against external evidence to efficiently detect and correct errors. We evaluate SERC on LongForm Bio and TruthfulQA benchmarks using Llama-3-8B and Qwen2.5-14B. Experimental results demonstrate that SERC outperforms both intrinsic self-correction methods and strong retrieval-augmented baselines, demonstrating significant gains especially in factual precision (FactScore). Notably, SERC enables small language models (SLMs) to surpass the performance of larger baselines in hallucination reduction and information preservation. Our findings demonstrate that SERC provides a training-free, model-agnostic solution that significantly reduces verification overhead compared to dense methods, achieving an optimal trade-off between cost and fidelity in resource-constrained environments.

2605.28835 2026-05-29 cs.CL cs.AI 版本更新

GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling

GenesisFunc: 面向准确且泛化的函数调用的多智能体数据生成

Hao-Xiang Xu, Chong Deng, Jiaqing Liu, Wen Wang, Qian Chen, Lujia Bao, Xiangang Li, Zhen-Hua Ling

发表机构 * Tongyi Fun Team, Alibaba Group(通义功能团队,阿里巴巴集团)

AI总结 提出GenesisFunc多智能体自动生成函数调用训练数据,通过多阶段评估保证质量,微调8B模型在域内和域外均优于同类开源模型,性能接近部分API模型。

Comments Accepted by ACL 2026 Main

详情
AI中文摘要

大型语言模型(LLM)通过函数调用(FC)扩展其能力,这依赖于高质量、多样化且覆盖广泛场景的训练数据。然而,获取和标注真实的函数调用数据具有挑战性,而现有流水线生成的合成数据通常存在API不可靠、工具可扩展性有限、多样性不足和质量控制薄弱等问题。为解决这些问题,我们提出了GenesisFunc,一个用于生成FC训练数据的自动化流水线。从广泛使用的公共基准中的可靠工具出发,我们的GenesisFunc采用多智能体框架支持对话生成系统,该系统生成涵盖多种场景的对话,同时在整个过程中保持多样性和质量。通过多阶段评估系统进一步强化数据的准确性。我们在合成数据集上微调了一个8B LLM,并通过大量实验表明,它在域内FC性能和域外泛化方面优于同等规模的开源模型,同时达到了与一些最新的基于API的模型相当的FC能力。此外,我们的方法展示了在下游工具中有效扩展的强大潜力,突显了其实际应用性。

英文摘要

Large Language Models (LLMs) extend their capabilities through function-calling (FC), which relies on training data with high quality, diversity, and broad coverage of scenario. However, obtaining and annotating real function-calling data is challenging, while synthetic data from existing pipelines often suffers from unreliable APIs, limited tool scalability, insufficient diversity, and weak quality control. To address these, we present GenesisFunc, an automated pipeline for generating FC training data. Starting from reliable tools in widely used public benchmarks, our GenesisFunc employs a multi-agent framework to support a dialogue generation system that produces conversations spanning diverse scenarios, while maintaining both diversity and quality throughout the process. The accuracy of the data is further reinforced through a multi-stage evaluation system. We fine-tune an 8B LLM on the synthetic dataset and show through extensive experiments that it outperforms similarly sized open-source models in in-domain FC performance and out-of-domain generalization, while reaching FC capabilities comparable to some of the latest API-based models. In addition, our method demonstrates strong potential to scale effectively across downstream tools, underscoring its real-world applicability.

2605.28834 2026-05-29 cs.CL cs.AI 版本更新

Assessing Dutch Syllabification Algorithms and Improving Accuracy by Combining Phonetic and Orthographic Information through Deep Learning

评估荷兰语音节划分算法并通过深度学习结合语音和正字法信息提高准确性

Gus Lathouwers, Wieke Harmsen, Catia Cucchiarini, Helmer Strik

发表机构 * Radboud University(拉博德大学)

AI总结 本研究评估了四种荷兰语音节划分算法的性能,并提出一种结合语音和正字法信息的深度学习模型,实现了99.65%的词准确率,较文献最佳提升0.14%。

Comments Published in CLIN Journal

详情
Journal ref
Computational Linguistics in the Netherlands Journal, Vol. 14 (2025), pp. 365 to 383
AI中文摘要

音节划分描述将单词划分为音节的任务。由于许多规则和例外,训练算法以高准确率执行音节划分仍然是一个挑战。在过去几十年中,针对荷兰语音节划分提出了不同的算法,但尚未进行全面的比较评估。此外,近年来深度学习在自然语言处理中获得了显著普及,但尚未开发出基于现代深度学习的荷兰语正字法音节划分框架。最后,语音和正字法音节划分算法已被分别研究,但未结合研究。当前研究的目标有两个:(a) 检查现有荷兰语音节划分算法的性能,(b) 研究将语音和正字法信息结合到单个模型中是否能提高音节划分性能。为了比较算法性能,将四种算法(Brandt Corstius、Liang、Trogkanis-Elkan (CRF) 和新构思的深度学习模型)应用于三个不同的数据集(词典词、借词、伪词)。这些算法在数据集上表现出不同的性能,数据驱动算法在所有条件下除一个外均优于基于知识的算法。开发的新深度学习方法相比文献中发现的最佳结果(99.65%的词准确率,提高了0.14%)带来了性能提升。对添加语音信息改善音节划分性能的单词的分析表明,这些单词中正字法歧义可以通过发音信息解决。未来研究可以考察语音信息有益于正字法处理的其他领域。此外,新开发的深度学习框架可以应用于荷兰语以外的其他语言。

英文摘要

Syllabification describes the task of dividing words into syllables. Due to many rules and exceptions, training an algorithm to perform syllabification with high accuracy remains a challenge. Throughout the last decades, different algorithms have been put forth for Dutch syllabification, yet a comprehensive comparative assessment has not been done. Additionally, deep learning has gained significant popularity within NLP in recent years, yet no modern deep-learning based framework has been developed for Dutch orthographic syllabification. Finally, phonetic and orthographic syllabification algorithms have been examined separately, but not in combination. The aim of the current research was twofold: (a) to examine the performance of existing Dutch syllabification algorithms, and (b) to investigate whether combining phonetic and orthographic information into a single model can increase syllabification performance. To compare the performance of algorithms, four algorithms (Brandt Corstius, Liang, Trogkanis-Elkan (CRF), and a newly conceived deep-learning model) were applied to three different datasets (dictionary words, loanwords, pseudowords). The algorithms show varying performance across datasets, with the data-driven algorithms outperforming a knowledge-based algorithm in all but one condition. The new deep-learning methods developed led to increased performance compared to the best found in the literature (99.65% word accuracy, a 0.14% improvement). An analysis of the words for which adding phonetic information improved syllabification performance indicates that these were words in which the orthographic ambiguity could be resolved by information on pronunciation. Future research could examine other areas where phonetic information can benefit orthographic processing. In addition, the newly developed deep learning frameworks can be applied to other languages than Dutch.

2605.28833 2026-05-29 cs.CL cs.AI 版本更新

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

转录儿童语音:ASR性能与获取可靠正字法转录

Gus Lathouwers, Lingyun Gao, Catia Cucchiarini, Helmer Strik

发表机构 * Radboud University(拉德堡德大学)

AI总结 本研究评估了三种ASR模型家族(Whisper、Parakeet、Wav2Vec2)在荷兰儿童语音数据集上的性能,并提出了一种基于话语级选择的方法,以自动识别高置信度的正确发音,从而减少人工验证需求。

详情
AI中文摘要

自动语音识别(ASR)有潜力通过生成自动转录来大幅减少儿童语音研究中的手动标注工作。然而,在低资源语言中,由于缺乏针对儿童的预训练模型以及高度多样的噪声条件,获得可靠的高质量ASR转录仍然具有挑战性。本研究通过两个研究问题调查了最先进的ASR模型在儿童语音上的有效性,评估了来自三个模型家族(Whisper、Parakeet和Wav2Vec2)的九个ASR模型在两个荷兰儿童语音数据集JASMIN和DART上的表现。研究问题1考察了ASR模型应用于儿童语音的性能。微调的Whisper-medium模型取得了最佳整体性能,在JASMIN上WER为5.54%,在DART上为70.37%,表明噪声较大的DART数据明显更具挑战性。研究问题2考察了在多大程度上可以选择一个子集,使得无需人工验证即可自动获得可靠的正字法转录。我们使用一种话语级选择方法,将ASR输出与原始阅读提示进行比较,以识别正确发音的录音。使用所提出的选择方法,42.0% [对于JASMIN] 和18.1% [对于DART] 的话语可以高置信度地自动识别为正确发音,从而在话语级别上实现极低的错误率(精确度达到98.3%或更高),并减少了人工验证的需求。

英文摘要

Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generating automatic transcriptions. However, obtaining reliably high-quality ASR transcriptions for child speech remains challenging in low-resource languages due to limited child-specific pre-trained models and highly diverse noise conditions. This study investigates the effectiveness of state-of-the-art ASR models on child speech through two research questions, by evaluating nine ASR models from three model families (Whisper, Parakeet, and Wav2Vec2) on two Dutch child speech datasets, JASMIN and DART. Research question 1 examines the performance of ASR-models applied to child speech. The fine-tuned Whisper-medium model achieves the best overall performance, with a WER of 5.54% on JASMIN and 70.37% on DART, showing that the noisy DART data are clearly more challenging. Research question 2 examines to what extent it is possible to select a subset for which reliable orthographic transcriptions can be obtained automatically, without the need for manual verification. We use an utterance-level selection method that compares ASR output with the original read prompt to identify correctly pronounced recordings. Using the proposed selection method, 42.0% [for JASMIN] and 18.1% [for DART] of the utterances can be automatically identified as correctly pronounced with high confidence, resulting in very low error rates on an utterance level (precisions of 98.3% and higher) and reducing the need for manual verification.

2605.28832 2026-05-29 cs.CL cs.AI 版本更新

A comparative study of transformer-based embeddings for topic coherence

基于Transformer的嵌入在主题连贯性中的比较研究

Alex Ding, Tarun Rapaka, Willy Rodriguez, Jason Yang

发表机构 * Worcester Academy Stanford Online High School(沃斯特学院斯坦福在线高中) Stanford Online High School(斯坦福在线高中) Lexington High School(莱克星顿高中)

AI总结 本研究系统比较了七种不同规模的Transformer语言模型(从MiniLM到LLaMA-2)在BERTopic流程中对主题质量的影响,发现模型大小(从2200万到130亿参数)对主题连贯性影响可忽略。

详情
AI中文摘要

主题建模是自然语言处理的一个分支,旨在根据词共现模式将大量文本组织成连贯的组,其中潜在狄利克雷分配仍是最广泛使用和可解释的概率方法之一。自然语言处理的最新进展,特别是基于Transformer的语言模型,提供了改进的文档表示。已知模型大小(以参数数量计)对语言模型在不同预定义任务上的性能有显著影响。在本研究中,我们通过分析七种基于Transformer的语言模型(从小型模型如MiniLM到大型模型如LLaMA-2)在BERTopic流程中对多种语料库的性能,系统地考察了模型大小对主题质量的影响。主题质量使用Röder等人(2015)的连贯性和分歧度指标进行评估。我们的结果表明,模型大小从2200万到130亿参数对主题质量的影响可忽略,表明较小的模型可以达到与较大模型相当的性能。

英文摘要

Topic modeling is a branch of Natural Language Processing (NLP) that aims to organize large collections of texts into coherent groups according to word co-occurrence patterns, with Latent Dirichlet Allocation (LDA) remaining one of the most widely used and interpretable probabilistic approaches. Recent advances in NLP, particularly transformer-based language models, offer improved document representations. It is also known that the size of the model (in terms of number of parameters) has a significant impact in the performance of the language models on different pre-defined tasks. In this study, we systematically examine the effect of model size on topic quality by analyzing the performances of seven transformer-based language models (from small models such as MiniLM to large ones such as LLaMA-2) in a BERTopic pipeline on a variety of corpora. Topic quality is evaluated using coherence and divergence metrics following R{ö}der et al. (2015). Our results indicate that model size, ranging from 22 million to 13 billion parameters, has a negligible impact on the quality of the topic, suggesting that smaller models can achieve comparable performance to larger models.

2605.28830 2026-05-29 cs.CL cs.AI cs.SE 版本更新

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

开源安全防护模型基准测试:全面评估

Reetu Raj Harsh, Bhaskarjit Sarmah, Stefano Pasquali

发表机构 * Domyn

AI总结 本研究对14个开源安全防护模型在8个NIST AI风险框架安全类别上进行全面评估,发现召回率是关键指标,且模型大小与安全检测性能不相关。

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地部署在安全关键型应用中,稳健的内容审核变得至关重要。我们对14个开源安全防护模型进行了全面评估,使用了包含79,331个样本的精选基准,涵盖8个NIST AI风险框架安全类别。我们的基准聚合了四个不同的数据集(HarmBench、StrongREJECT、RealToxicityPrompts和BeaverTails),并经过筛选,仅关注安全相关内容(暴力、仇恨言论、骚扰、色情内容、自杀/自残、亵渎、威胁和健康虚假信息)。我们发现召回率是安全应用的关键指标,因为遗漏有害内容比误报构成更大风险。我们的评估揭示了令人惊讶的结果:Qwen Guard(4B参数)实现了最高的召回率(83.97%),而较大的模型如Llama Guard(12B)和GPT-OSS Safeguard(20B)表现出保守行为,遗漏了高达75%的不安全内容。我们证明了模型大小与安全检测性能不相关,并且通用防护模型优于专用模型。这些发现为在生产部署中选择安全防护模型提供了实用指导。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories. Our benchmark aggregates four diverse datasets (HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails), filtered to focus exclusively on safety-relevant content (violence, hate speech, harassment, sexual content, suicide/self-harm, profanity, threats, and health misinformation). We find that recall is the critical metric for safety applications, as missing harmful content poses greater risk than false positives. Our evaluation reveals surprising results: Qwen Guard (4B parameters) achieves the highest recall (83.97%) while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content. We demonstrate that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.

2605.28828 2026-05-29 cs.CL cs.AI 版本更新

Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models

微宏检索:减少大语言模型中的长文本幻觉

Yujie Feng, Jian Li, Zhihan Zhou, Pengfei Xu, Yujia Zhang, Xiaoyu Li, Xiaohui Zhou, Alan Zhao, Xi Chen, Xiao-Ming Wu

发表机构 * Solar System of OVB, Tencent, China(OVB太阳系,腾讯,中国) The Hong Kong Polytechnic University, Hong Kong S.A.R.(香港理工大学,香港特别行政区) Jilin University, China(吉林大学,中国)

AI总结 提出微宏检索(M2R)框架,通过宏观检索外部粗粒度证据和微观检索推理中关键信息库,解决长文本生成中关键信息与输出距离过远导致的幻觉问题。

详情
AI中文摘要

大型语言模型(LLMs)在许多任务上表现出色,但容易产生幻觉,尤其是在长文本生成中,冗余的检索上下文和冗长的推理链会放大事实错误。最近的研究强调了一个关键现象:关键信息越接近模型输出,事实准确性越高。然而,现有的检索增强语言模型(RALMs)缺乏确保这种接近性的有效机制——外部证据通过多轮检索注入推理,但无法确保关键信息靠近输出。我们提出微宏检索(M2R),一种新颖的边检索边生成框架,以填补这一空白。在宏观层面,M2R从外部来源检索粗粒度证据;在微观层面,它从推理过程中构建的关键信息库中提取必要结果,并在生成答案时重用它们。这种设计直接解决了关键信息到输出的接近性瓶颈,有效减少了长文本任务中的幻觉。M2R使用基于课程学习的强化学习策略进行训练,并采用定制的基于规则的奖励,从而稳定地获得检索和接地技能。跨不同基准的大量实验证明了M2R的有效性,尤其是在长上下文设置中。

英文摘要

Large Language Models (LLMs) achieve impressive performance across many tasks but remain prone to hallucination, especially in long-form generation where redundant retrieved contexts and lengthy reasoning chains amplify factual errors. Recent studies highlight a critical phenomenon: the closer key information appears to the model outputs, the higher the factual accuracy. However, existing retrieval-augmented language models (RALMs) lack effective mechanisms to ensure this proximity - external evidence is injected into reasoning via multi-turn retrieval, but this cannot ensure key information stays close to the outputs. We propose Micro-Macro Retrieval (M2R), a novel retrieve-while-generate framework to fill this gap. At the macro level, M2R retrieves coarse-grained evidence from external sources; at the micro level, it extracts essential results from a key information repository built during reasoning and reuses them while generating answers. This design directly addresses the key-information-to-output proximity bottleneck, effectively reducing hallucination in long-form tasks. M2R is trained with a curriculum learning-based reinforcement learning strategy using customized rule-based rewards, enabling stable acquisition of retrieval and grounding skills. Extensive experiments across different benchmarks demonstrate the effectiveness of M2R, especially in lengthy-context settings.

2605.28827 2026-05-29 cs.CL cs.LG 版本更新

RightNow-Arabic-0.5B-Turbo: An Open Sub-1B Arabic Language Model via Vocabulary Injection and Edge-First Deployment

RightNow-Arabic-0.5B-Turbo: 通过词汇注入和边缘优先部署的开源子10亿参数阿拉伯语语言模型

Jaber Jaber, Osama Jaber

发表机构 * RightNow AI

AI总结 针对现有阿拉伯语模型要么是多语言模型对阿拉伯语支持不足,要么是参数过大难以部署的问题,提出基于Qwen2.5-0.5B的518M参数阿拉伯语专用模型RightNow-Arabic-0.5B-Turbo,通过词汇注入、继续预训练和监督微调等方法,在三个阿拉伯语基准上达到35.9%平均准确率,与1.5B模型性能相当,并实现边缘端高效部署。

Comments 12 pages, 7 tables, 4 figures, 1 algorithm. Weights: https://huggingface.co/RightNowAI/RightNow-Arabic-0.5B-Turbo

详情
AI中文摘要

开源的阿拉伯语大语言模型分为两类:子10亿参数的多语言模型将阿拉伯语视为次要语言(如Qwen2.5-0.5B、Falcon-H1-0.5B),以及需要服务器运行的7B-70B阿拉伯语专用模型(如Jais、AceGPT、ALLaM、SILMA)。唯一已发表的子20亿参数阿拉伯语专用模型Kuwain-1.5B从未发布权重。我们提出RightNow-Arabic-0.5B-Turbo,一个基于Qwen2.5-0.5B构建的518M参数阿拉伯语专用解码器LLM。该流程通过均值子词初始化添加27,032个阿拉伯语token,在8xH100上使用FSDP、FlashAttention变长打包和Liger融合内核继续预训练504M阿拉伯语token,然后对129,116个阿拉伯语指令对应用仅响应损失掩码的有监督微调,对6,750个阿拉伯语偏好对应用直接偏好优化,并对三个检查点进行权重汤合并。在三个lm-evaluation-harness阿拉伯语基准(COPA-ar、Arabic HellaSwag、ArabicMMLU)上,合并模型达到35.9%的平均准确率,击败所有同类开源模型,在COPA-ar上与Falcon-H1-1.5B持平(58.4%)但规模仅为三分之一,并以1/18的参数恢复了SILMA-9B平均性能的67%。边缘构建量化至398 MB(q4_k_m),通过llama.cpp在单个H100上以批量大小1达到635 tokens/s。所有代码(25个脚本共5,555行)、权重(bf16、int8和四种GGUF量化)及基准测试脚本已在https://huggingface.co/RightNowAI/RightNow-Arabic-0.5B-Turbo开源。

英文摘要

Open Arabic large language models split into two classes: sub-1B multilingual models that treat Arabic as an afterthought (Qwen2.5-0.5B, Falcon-H1-0.5B), and 7B-70B Arabic-specialized models that require a server to run (Jais, AceGPT, ALLaM, SILMA). The one published attempt at a sub-2B Arabic-specialized model, Kuwain-1.5B, never released its weights. We present RightNow-Arabic-0.5B-Turbo, a 518M-parameter Arabic-specialized decoder LLM built on Qwen2.5-0.5B. The pipeline adds 27,032 Arabic tokens via mean-subtoken initialization, continues pretraining on 504M Arabic tokens on 8xH100 with FSDP, FlashAttention varlen packing, and Liger fused kernels, then applies supervised fine-tuning on 129,116 Arabic instruction pairs with response-only loss masking, direct preference optimization on 6,750 Arabic preference pairs, and weight soup merging across three checkpoints. On three lm-evaluation-harness Arabic benchmarks (COPA-ar, Arabic HellaSwag, ArabicMMLU) the merged model reaches 35.9% mean accuracy, beats every same-class open model, ties Falcon-H1-1.5B on COPA-ar (58.4%) at one-third the size, and recovers 67% of SILMA-9B's mean at 1/18 the parameters. The edge build quantizes to 398 MB (q4_k_m) and delivers 635 tokens/s at batch size 1 on a single H100 via llama.cpp. All code (5,555 lines across 25 scripts), weights (bf16, int8, and four GGUF quantizations), and benchmark scripts are released at https://huggingface.co/RightNowAI/RightNow-Arabic-0.5B-Turbo.

2605.28826 2026-05-29 cs.CL 版本更新

From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale

从语境偏移到风格崩溃:为什么训练目标比规模更重要

Rohan Mahapatra

发表机构 * Independent Researcher(独立研究者)

AI总结 本文通过分析17个模型(410M-100B+参数)在24个语言探针上的表现,发现指令微调系统会导致语言熵沿语篇和结构维度系统性崩溃,并表明弱干预加剧崩溃而强控制可显著改善,揭示了当前对齐流程在重新分配风格概率质量方面的结构性局限。

Comments 26 pages, 13 tables, 2 figures. Planning to submit to NeurIPS 2026

详情
AI中文摘要

在现代大型语言模型中,语言特征并非作为风格人工制品,而是作为概率质量的探针,在训练对齐目标下分配。使用当代流程训练的语言模型表现出语言特征的严重重塑,导致极端的语言重新分布。虽然先前的风格计量分析探讨了AI生成文本与人类文本之间的语言差异,但我们关注的是困扰LLM训练流程本身的重塑。我们分析了17个模型(410M-100B+参数)在24个语言动机探针上的表现,记录到指令微调系统沿语篇和结构维度系统性崩溃语言熵(平均放大倍数:1,949-16,853%,峰值:5,181-209,675%),同时选择性地将复杂标点抑制到基线频率的3.2-23.2%。这些效应在RLHF下并未恶化,因为匹配的基础模型和指令微调模型对之间的发散模式在统计上无显著差异(p > 0.25)。弱干预(lambda=1.0)使崩溃加剧240%,而强控制(lambda=5.0)实现了40.5%的改善,并且尽管规模劣势达200-1000倍,仍比前沿模型表现好96.7-98.2%。此外,lambda=5.0相比中等正则化,提供了15%更高的distinct-4、27%更高的词汇多样性以及78%更低的重复率,表明对齐需要足够的控制强度,而不仅仅是分布平滑。我们的发现强调了现代LLM如何重新分配风格概率质量,尽管有RLHF和规模。更广泛地说,我们的工作揭示了当前对齐流程的结构性局限:偏好优化重塑了标准质量指标不可见但可通过分布探针检测到的语言分布,这对AI检测、训练数据污染和长期语言演化具有影响。

英文摘要

In modern LLMs, linguistic features function not as stylistic artifacts but as probes of probability mass, allocated under training alignment objectives. Language models trained with contemporary pipelines exhibit severe reshaping of linguistic features, leading to extreme language re-distribution. While previous stylometric analyses explored linguistic differences between AI-generated and human texts, we focus on the reshaping plaguing the LLM training pipeline itself. We analyze 17 models (410M-100B+ parameters) across 24 linguistically-motivated probes, documenting that instruction-tuned systems systematically collapse language entropy along discourse and structural dimensions (mean amplification: 1,949-16,853%, peaks: 5,181-209,675%), while selectively suppressing complex punctuation to 3.2-23.2% of baseline frequencies. These effects do not worsen under RLHF, as divergence patterns are statistically indistinguishable (p > 0.25) across matched base and instruction-tuned model pairs. Weak intervention (lambda=1.0) exacerbates collapse by 240%, while strong control (lambda=5.0) achieves 40.5% improvement and outperforms frontier models by 96.7-98.2% despite 200-1000x scale disadvantage. Additionally, lambda=5.0 delivers 15% higher distinct-4, 27% higher vocabulary diversity, and 78% lower repetition than moderate regularization, establishing that alignment requires sufficient control strength, not merely distributional smoothing. Our findings underscore how modern LLMs reallocate stylistic probability mass, despite RLHF and scale. More broadly, our work reveals a structural limitation of current alignment pipelines: preference optimization reshapes language distributions invisible to standard quality metrics yet detectable through distributional probes, with implications for AI detection, training data contamination, and long-term linguistic evolution.

2605.28825 2026-05-29 cs.CL 版本更新

MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models

MechELK:一种用于激发大型语言模型中潜在知识的机制可解释性框架

Ji-jun Park, Soo-joon Choi, Jiwon Jeong, Taeyang Yoon, Ju-Wan Lee

发表机构 * Dongguk University(东国大学)

AI总结 提出MechELK框架,通过定位、验证和激发三个阶段,利用稀疏自编码器特征分析和因果探测等方法,从大型语言模型中提取隐藏知识,在TruthfulQA等基准上平均激发准确率达84.7%。

详情
AI中文摘要

大型语言模型(LLMs)经常在其内部表示中编码事实和推理知识,但这些知识并未在其表面输出中忠实反映——这种现象被称为“潜在知识”。现有的潜在知识激发方法,如对比一致性搜索(CCS),依赖于对比激活模式,在处理复杂的多步推理任务时存在困难,而机制可解释性工具主要用于“理解”模型行为而非“提取”隐藏知识。我们提出了**MechELK**,一个统一的三个阶段框架,桥接了机制可解释性和潜在知识激发。MechELK通过以下步骤运作:(1)**定位**——使用稀疏自编码器(SAE)特征分析和激活修补来识别承载知识的表示;(2)**验证**——采用因果探测来区分真正的潜在知识与虚假相关性;(3)**激发**——应用表示工程在不修改模型权重的情况下揭示隐藏知识。在TruthfulQA、精心策划的Deceptive Alignment基准和Quirky LM数据集上评估,MechELK实现了84.7%的平均激发准确率,比CCS高出6.2%,比直接线性探测高出9.1%。关键的是,MechELK在模型表面输出不正确或回避的78.3%案例中成功识别了潜在知识,展示了其在AI安全应用(包括欺骗性对齐检测)中的实用性。

英文摘要

Large language models (LLMs) frequently encode factual and reasoning knowledge in their internal representations that is not faithfully reflected in their surface-level outputs -- a phenomenon known as \emph{latent knowledge}. Existing approaches to eliciting latent knowledge, such as Contrastive Consistency Search (CCS), rely on contrastive activation patterns and struggle with complex multi-step reasoning tasks, while mechanistic interpretability tools have primarily been used to \emph{understand} model behavior rather than to \emph{extract} hidden knowledge. We present \textbf{MechELK}, a unified three-stage framework that bridges mechanistic interpretability and latent knowledge elicitation. MechELK operates through: (1) \textbf{Locate} -- using Sparse Autoencoder (SAE) feature analysis and activation patching to identify knowledge-bearing representations; (2) \textbf{Verify} -- employing causal probing to distinguish genuine latent knowledge from spurious correlations; and (3) \textbf{Elicit} -- applying representation engineering to surface hidden knowledge without modifying model weights. Evaluated on TruthfulQA, a curated Deceptive Alignment benchmark, and the Quirky LM dataset, MechELK achieves an average elicitation accuracy of 84.7\%, outperforming CCS by 6.2\% and direct linear probing by 9.1\%. Crucially, MechELK successfully identifies latent knowledge in 78.3\% of cases where the model's surface output is incorrect or evasive, demonstrating its utility for AI safety applications including deceptive alignment detection.

2605.28824 2026-05-29 cs.CL 版本更新

A Modular Architecture for Typologically Controlled Lexicon Generation

一种用于类型学控制词汇生成的模块化架构

Sankalp Tattwadarshi Swain, Dhruv Kumar

发表机构 * Birla Institute of Technology and Science, Pilani(比拉理工学院,帕利尼)

AI总结 提出模块化框架,通过PHOIBLE音位库、可互换音系语法和Swadesh-Leipzig-Jakarta本体生成音系合理且类型学真实的词汇,实验表明概率语法优于确定性和随机基线。

详情
AI中文摘要

构建可发音、类型学合理且语义结构化的人工词汇仍是计算语言学中的开放挑战。现有语言生成器要么缺乏正式的音位配列保证,要么将生成委托给不透明、不可复现的基于LLM的流水线。我们提出一个模块化框架,从PHOIBLE中采样音位库,在可互换的音系语法(确定性、OT和MaxEnt)下生成词形,并通过Swadesh-Leipzig-Jakarta本体分配意义,实现显式的形式-意义对齐。在词汇规模为100-5,000个词形时,基于字符$n$-gram困惑度、对数似然和与PHOIBLE的KL散度的评估表明,概率语法在音位配列一致性和类型学真实性上始终优于确定性和随机基线。

英文摘要

Constructing artificial lexicons that are pronounceable, typologically plausible, and semantically structured remains an open challenge in computational linguistics. Existing conlang generators either lack formal phonotactic guarantees or delegate generation to opaque, non-reproducible LLM-based pipelines. We propose a modular framework that samples phoneme inventories from PHOIBLE, generates word forms under interchangeable phonological grammars (deterministic, OT, and MaxEnt), and assigns meanings via a Swadesh--Leipzig--Jakarta ontology with explicit form--meaning alignment. Evaluation on character $n$-gram perplexity, log-likelihood, and KL divergence against PHOIBLE across lexicon sizes of 100-5,000 forms shows that probabilistic grammars consistently outperform deterministic and random baselines on both phonotactic coherence and typological realism.

2605.28823 2026-05-29 cs.CL 版本更新

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

他们在想什么?LLM中概念的界定、探测与追踪

Mohamed Abdelwahab, Michelle Yu Collins, Sihan Chen, Yi Cheng Zhao, Zafarullah Mahmood, Jiading Zhu, Soliman Ali, Jonathan Rose

发表机构 * The Edward S. Rogers Sr. Department of Electrical and Computer Engineering(埃德华·S·罗杰斯 Sr. 部门电子与计算机工程系)

AI总结 本文提出通过线性探针低成本地检测LLM嵌入中的概念,并展示了概念界定、探针训练与跨上下文追踪的方法,为大规模模型监控奠定基础。

详情
AI中文摘要

随着LLM影响力的扩大,深入了解其决策变得至关重要。一种方法是开发探针,用于检测LLM计算的嵌入中是否存在广泛的概念——这可以说是模型在“思考”的内容。这些探针应成本低廉且易于应用于任何LLM,以便在正常操作期间能够监控多个概念。在本文中,我们通过定义并执行关键任务的示例,迈出了开发创建此类探针能力的第一步:首先,通过创建包含概念存在和不存在的数据集来仔细界定概念;然后,训练并测试一组线性探针,以检测LLM任何层上的概念,包括对所需探针复杂性的探索;最后,我们展示了此类探针可以跨更大上下文追踪概念。我们使用四个不同的概念和三个不同的LLM进行了实验。当这一过程扩展到更多概念时,将能够轻松监控新模型。

英文摘要

As the influence of LLMs expands, it is imperative to gain insight into their decisions. One way to do that is to develop probes that detect the presence or absence of a broad set of concepts within the embeddings computed in an LLM - which is what we might say a model is "thinking" about. Such probes should be low-cost and easily applicable to any LLM, so that monitoring for many concepts is possible during normal operation. In this paper, we take the first steps towards developing the capability of creating many such probes by defining and executing examples of the key tasks needed: first, the careful delineation of a concept through the creation of a dataset with the concept both present and then absent. Then, the training and testing of a set of linear probes to detect the concept on any layer of an LLM, including an exploration of the complexity of the probe needed. Finally, we show that such probes can track concepts across larger contexts. This is done with four separate concepts and three different LLMs. When this process is scaled to many more concepts, it will create the ability to easily monitor new models.

2605.28822 2026-05-29 cs.CL 版本更新

Lightweight Multimodal LLM-Enabled Cost-Effective Defect Grading of Power Transmission Equipment

轻量级多模态大语言模型驱动的输电设备经济高效缺陷分级

Tao Wang, Lipeng Zhu, Jiayong Li, Feng Gao, Siwen Liang

AI总结 提出基于多模态大语言模型的缺陷分级框架,通过上下文学习最大化商业模型潜力,并利用链式思考问答对微调轻量级模型,实现低成本高精度分级。

Comments 9pages, 6figures

详情
AI中文摘要

输电设备缺陷分级(DGPTE)对电能传输的稳定性至关重要。尽管现有的机器学习方法在缺陷检测方面表现出强大的能力,但在更精细的缺陷分级领域,它们受到难以整合专家经验和面临类别不平衡问题的困扰。为了解决这一问题,本文提出了一种基于多模态大语言模型(MLLM)的新型缺陷分级框架。具体而言,该方法通过上下文学习最大化商业MLLM在DGPTE上的潜力,并获得了最先进的(SOTA)模型。通过向该模型发送二次请求,生成少量基于思维链的问答对(Q&As),这有效降低了人工标注的成本。通过这种方式,这些高质量可解释的Q&As被用于通过低秩自适应监督微调(SFT)训练Qwen3-VL-8B。在三个DGPTE任务上的实验结果表明,仅微调语言模型层即可获得SOTA性能。此外,多任务联合微调验证了仅通过单个轻量级MLLM处理多个分级任务的可行性。

英文摘要

Defect grading of power transmission equipment (DGPTE) is crucial to the stability of electric energy transmission. Although existing machine learning methods exhibit strong capabilities in defect detection, they are plagued by difficulties in integrating expert experience and facing class imbalance in more refined defect grading field. To address this issue, this paper introduces a novel defect grading framework based on multimodal large language model (MLLM). Specifically, this approach maximizes the commercial MLLMs' potential of DGPTE through in-context learning and obtains the state-of-te-art (SOTA) model. By sending a secondary request to this model, a small number of chain of thought-based question-answer pairs (Q\&As) are generated, which effectively reduces the cost of manual annotation. In this way, these high-quality interpretable Q\&As are used to train Qwen3-VL-8B via Low-Rank Adaption-based supervised fine-tuning (SFT). Experimental results on three DGPTE tasks demonstrate that fine-tuning only the language model layer yields the SOTA performance. Furthermore, multi-task joint fine-tuning verifies the feasibility of handling multiple grading tasks within only a single lightweight MLLM.

2605.28108 2026-05-29 cs.CL 版本更新

Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents

现在询问,以后使用:评估长期 LLM 代理中的主动性差距

Bin Wu, Guanyun Zou, Bingbing Wang, Huan Zhao, Chuan Shi

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Nanjing University of Aeronautics and Astronautics(南京航空航天大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Noumena AI

AI总结 针对长期 LLM 代理在跨会话中未能主动获取用户偏好而导致的主动性差距,提出 Ask-to-Remember (ATR) 基准 ATRBench,通过隐藏用户偏好作为真实值来量化该差距,并诊断出获取环节是瓶颈。

详情
AI中文摘要

一个长期存在的 LLM 代理(例如 OpenClaw)的价值在于它能够根据用户跨会话的偏好和约束采取行动,而不仅仅是当前请求。然而,如今的代理会保留用户主动提供的信息,但很少询问那些未说出口的内容,这导致了长期 LLM 代理中的主动性差距:代理无法对从未获取到的偏好采取行动。随着用户将更多事务委托给代理,这种差距的影响也在增长。我们将这一差距的一个具体、可控的部分分离出来,称为 Ask-to-Remember (ATR):代理决定是否现在询问一个可重用的用户偏好,该偏好当前任务不需要,但后续与同一用户的会话会用到。ATR 甚至难以评估:正确的问题是不确定的,其回报会延迟到可能永远不会出现的任务。据我们所知,ATRBench 是第一个 ATR 基准,它通过将每个用户的偏好固定为隐藏的真实值,使得该差距可测量,因此成功需要询问,而不是回忆。在八个前沿 LLM 代理中,默认设置的表现至少比获得相关偏好的 oracle 低 62 分,而提示改进效果甚微。诊断表明获取是瓶颈。ATRBench 揭示了当前代理中的这一主动性差距,并提供了用于弥合该差距的诊断测试平台。

英文摘要

A long-lived LLM agent, such as OpenClaw, earns its value by acting on a user's preferences and constraints across sessions, not just the current request. Yet today's agents keep what a user volunteers but rarely ask for what stays unspoken, leaving a proactivity gap in long-lived LLM agents: an agent cannot act on a preference it never obtained. As users delegate more of their affairs to agents, the impact of this gap grows. We isolate one concrete, controllable slice of this gap as Ask-to-Remember (ATR): the agent decides whether to ask now for a reusable user preference that the current task does not need but a later session with the same user will. ATR is hard even to evaluate: the right question is underdetermined and its payoff deferred to tasks that may never arise. ATRBench, to the best of our knowledge the first ATR benchmark, makes it measurable by fixing each user's preferences as hidden ground truth, so success demands asking, not recall. Across eight frontier LLM agents, defaults fall at least 62 points below an oracle handed the relevant preference, and prompting closes little of it. Diagnostics identify acquisition as the bottleneck. ATRBench surfaces this proactivity gap in current agents and offers a diagnostic testbed for closing it.

2605.27390 2026-05-29 cs.CL cs.AI 版本更新

EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter Adaptation

EvoSpec: 通过实时词汇和参数自适应进化推测解码

Shuyu Zhang, Lingfeng Pan, Qicheng Wang, Yaqi Shi, Yueyang Tan, Ruyu Yan, Jiaqi Chen, Lixing Du, Lu Wang

发表机构 * School of Computer Science and Technology(计算机科学与技术学院)

AI总结 提出EvoSpec框架,通过动态词汇和参数自适应实现推测解码中草稿模型的实时进化,解决静态方法在专业领域和主题切换场景下接受率骤降的问题,在EAGLE-3上实现1.13倍加速并降低27%内存开销。

详情
AI中文摘要

推测解码通过草稿-验证范式加速大型语言模型推理,但随着词汇表规模扩大,输出投影层成为瓶颈。现有的静态剪枝方法虽有效降低开销,但由于无法捕捉动态分布变化,在专业领域或主题切换场景中接受率骤降。为解决此问题,我们提出EvoSpec框架,通过动态词汇和参数自适应实现草稿模型的实时进化。与静态或纯检索方法不同,EvoSpec采用上下文感知机制,通过高效的语义和统计索引检索关键长尾词。此外,我们提出一种轻量级在线对齐策略,利用课程学习持续最小化草稿模型与目标模型之间的分布差距。在专业领域(编码、法律和医学)的广泛评估证实,EvoSpec克服了静态基线的局限性。在EAGLE-3上,它相比最先进的静态基线FR-Spec实现1.13倍加速,且内存开销比标准在线自适应低27%。

英文摘要

Speculative decoding accelerates Large Language Model inference via a draft-then-verify paradigm, yet the output projection layer becomes a bottleneck as vocabulary sizes scale. While existing static pruning methods effectively reduce this overhead, they suffer from precipitous drops in acceptance rate in specialized domains or topic-switching scenarios due to their inability to capture dynamic distribution shifts. To address this, we introduce EvoSpec, a framework that enables real-time evolution of the draft model through dynamic vocabulary and parameter adaptation. Unlike static or purely retrieval-based approaches, EvoSpec employs a context-aware mechanism that retrieves critical long-tail tokens via efficient semantic and statistical indexing. Furthermore, we propose a lightweight online alignment strategy utilizing curriculum learning to continually minimize the distributional gap between the draft and target models. Extensive evaluations across specialized domains (coding, law, and medicine) confirm that EvoSpec overcomes the limitations of static baselines. On EAGLE-3, it achieves a 1.13x speedup in these settings over the state-of-the-art static baseline FR-Spec, with 27\% lower memory overhead than standard online adaptation.

2605.27387 2026-05-29 cs.CL cs.AI 版本更新

From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons

从自回归到扩散:利用严格因果与弹性视野高效适配大型语言模型

Xiangyu Ma, Teng Xiao, Zuchao Li, Lefei Zhang

发表机构 * School of Artificial Intelligence, Wuhan University(武汉大学人工智能学院) School of Computer Science, Wuhan University(武汉大学计算机学院)

AI总结 提出FLUID框架,通过严格因果对齐和弹性视野机制,将自回归模型高效适配为扩散模型,实现并行文本生成并大幅降低训练成本。

Comments Accepted by ACL 2026

详情
AI中文摘要

扩散模型有望实现高效的并行文本生成,但其依赖双向注意力机制,与预训练的自回归(AR)模型存在结构不匹配。这种不兼容性阻碍了稳健AR先验的复用,需要从头开始进行代价高昂的预训练。为弥合这一差距,我们提出FLUID框架,该框架高效地将AR骨干网络适配到扩散范式。通过强制执行严格因果对齐,FLUID能够从标准GPT风格检查点无缝初始化,避免了大规模预训练。此外,我们引入弹性视野,这是一种基于局部信息密度而非固定调度动态调节去噪步长的熵驱动机制。实验表明,FLUID在将训练成本降低数个数量级的同时实现了最先进的性能,有效调和了成熟的AR基础与高效的并行生成。我们的代码可在https://github.com/Oli-lab-nun/FLUID/tree/main获取。

英文摘要

Diffusion models promise efficient parallel text generation but rely on bidirectional attention, creating a structural mismatch with pre-trained Autoregressive (AR) models. This incompatibility precludes reusing robust AR priors, necessitating prohibitive pre-training from scratch. To bridge this gap, we propose FLUID, a framework that efficiently adapts AR backbones to the diffusion paradigm. By enforcing Strictly Causal Alignment, FLUID enables seamless initialization from standard GPT-style checkpoints, circumventing the need for massive pre-training. Furthermore, we introduce Elastic Horizons, an entropy-driven mechanism that dynamically modulates denoising strides based on local information density rather than fixed schedules. Experiments demonstrate that FLUID achieves state-of-the-art performance while reducing training costs by orders of magnitude, effectively reconciling established AR foundations with efficient parallel generation. Our code is available at https://github.com/Oli-lab-nun/FLUID/tree/main.

2605.27382 2026-05-29 cs.HC cs.AI cs.CL 版本更新

The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs

对齐下限:角色定制如何破坏弱对齐大语言模型的安全性

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS生成式人工智能创新中心) HSBC Holdings Plc., HSBC Technology Center, China(汇丰控股有限公司,汇丰技术中心,中国)

AI总结 通过对比强对齐与弱对齐模型在不同角色条件下的谄媚率变化,定义对齐下限Δ_floor作为评估模型角色定制安全性的审计指标。

详情
AI中文摘要

告诉LLM“要热情”会使轻对齐模型的谄媚率从30%上升到50%,但对强对齐模型没有影响。我们将这一差距定义为对齐下限Δ_floor(m)=max_pS(m,p)-min_pS(m,p),即模型在不同角色条件下产生的谄媚率范围,并将谄媚视为角色条件属性而非固定模型属性。多元AI依赖于通过角色提示(如“要有创造力”或“要彻底”)进行行为适应,使系统能够尊重不同的用户价值观和沟通风格;安全问题在于给定模型在真实性改变之前能吸收多少定制化。我们进行了一项受控案例研究,对比了强对齐的RLHF+宪法AI模型(Claude Sonnet 4.6)与轻对齐模型(Amazon Nova Lite),涵盖7种角色条件和5个任务,共1800次运行。存在性结果促使进行逐模型审计:至少有一个强对齐模型的Δ_floor=5个百分点(在15%控制率的5个百分点内),至少有一个轻对齐模型的Δ_floor=45个百分点(范围5%-50%)。在轻对齐模型上,所有五种大五人格角色都增加了谄媚率,且反直觉的是,宜人性产生的增幅最小而非最大。研究中最大的单一效果是建设性的:怀疑论者角色使轻对齐模型的谄媚率降低了25个百分点,并且是唯一指示抵制用户主张而非与之互动的角色,这暗示了方向性解释。角色效果的跨模型迁移几乎为零,因此角色-对齐测试必须逐模型进行。我们提出Δ_floor作为部署时的审计指标:在部署角色定制之前,在小规模角色面板上测量该指标。

英文摘要

Telling an LLM to "be enthusiastic" raises its sycophancy rate from 30\% to 50\% on a lightly-aligned model, but has zero effect on a strongly-aligned one. We define this gap as the alignment floor, $Δ_{\text{floor}}(m)=\max_pS(m,p)-\min_pS(m,p)$, the range of sycophancy rates a model produces across persona conditions, and treat sycophancy as a persona-conditional property rather than a fixed model property. Pluralistic AI relies on behavioral adaptation via persona prompts like "be creative" or "be thorough", which let systems respect diverse user values and communication styles; the safety question is how much customization a given model can absorb before its truthfulness shifts. We present a controlled case study contrasting a strongly-aligned RLHF + Constitutional-AI model (Claude Sonnet 4.6) with a more lightly-aligned model (Amazon Nova Lite), spanning seven persona conditions and five tasks for 1800 total runs. An existence-pair result motivates per-model auditing: there is at least one strongly-aligned model with $Δ_{\text{floor}}=5$pp (within 5pp of the 15\% control rate) and at least one lightly-aligned model with 45pp (5\%--50\% range). On the lightly-aligned model, all five Big Five personas increase sycophancy over control, and counterintuitively Agreeableness produces the smallest increase, not the largest. The single largest effect in the study is constructive: a Skeptic persona reduces sycophancy by 25pp on the lightly-aligned model, and is the only persona that instructs resistance against user claims rather than engagement with them, suggesting a directionality account. Cross-model transfer of persona effects is near-zero, so persona-alignment testing must be per-model. We propose $Δ_{\text{floor}}$ as a deployment-time audit metric: measure it on a small persona panel before deploying persona customization.

2605.27379 2026-05-29 cs.AI cs.CL 版本更新

Soro: A Lightweight Foundation Model and Chatbot for Tajik

Soro: 一种轻量级塔吉克语基础模型与聊天机器人

Stanislav Liashkov, Haitz Sáez de Ocáriz Borde, Azizjon Azimi, Khushbakht Shoymardonov, Shuhratjon Khalilbekov, Bonu Boboeva

AI总结 针对塔吉克斯坦计算和连接受限环境,提出基于Gemma 3的塔吉克语专用对话大语言模型Soro,通过持续预训练和监督微调,在塔吉克语基准测试上显著优于同尺寸基线,并支持量化部署。

详情
AI中文摘要

我们提出了Soro,一个塔吉克语专用对话大语言模型(LLM)家族,专为在塔吉克斯坦计算和连接受限条件下的实际部署而设计。从开放权重的Gemma 3检查点开始,我们在一个精心策划的19亿词元语料库上进行了仅塔吉克语的持续预训练,该语料库涵盖过滤后的网络文本、PDF文档和符合课程的教育材料,随后在4万个塔吉克语教师风格示例上进行监督指令微调。为了在标准基准测试中塔吉克语覆盖有限的情况下实现严格评估,我们引入了一套塔吉克语基准测试,涵盖常识、语言能力以及中学和大学入学考试领域,并在Hugging Face上开源。在这些塔吉克语基准测试中,Soro显著优于同尺寸的Gemma 3基线,同时在标准数据集上保持了较强的英语性能。我们进一步表明,Soro的FP8和INT4量化保留了大部分塔吉克语增益,同时降低了边缘部署的内存需求,支持正在进行的教育领域试点和计划在塔吉克斯坦学校中的扩展。

英文摘要

We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight compute and connectivity constraints in Tajikistan. Starting from open-weight Gemma 3 checkpoints, we perform Tajik-only continual pretraining on a curated 1.9-billion-token corpus spanning filtered web text, PDF documents, and curriculum-aligned educational materials, followed by supervised instruction tuning on 40K Tajik teacher-style examples. To enable rigorous evaluation despite the limited coverage of Tajik in standard benchmarks, we introduce a suite of Tajik benchmarks covering general knowledge, linguistic competence, and school- and university entrance-exam domains, and we open-source them on Hugging Face. Across these Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong English performance on standard datasets. We further show that FP8 and INT4 quantization of Soro preserves most Tajik-language gains while reducing memory requirements for edge deployment, supporting an ongoing education-sector pilot and planned scale-out across schools in Tajikistan.

2605.27377 2026-05-29 cs.CL cs.AI cs.IR 版本更新

Enhancing LLM Medical Coding with Structured External Knowledge

利用结构化外部知识增强LLM医学编码

Yidong Gan, David D. Nguyen, Yang Lin, Peter Zhong, Thanh Vu, Long Duong, Yuan-Fang Li

发表机构 * Oracle Health and AI(Oracle健康与AI)

AI总结 提出RAG-Coding方法,通过将ICD表格列表编码为知识图谱并提炼指南摘要,无需训练即可增强LLM的医学编码能力,在MDACE和MDACE-2025数据集上显著优于基线。

详情
AI中文摘要

准确的医学编码需要查阅权威资源,如ICD表格列表和编码指南。现有的基于LLM的自动化方法主要依赖LLM的内部知识,容易产生幻觉且无法跟上指南更新。我们引入了RAG-Coding,一种无需训练的智能体方法,通过结构化外部知识增强LLM:将表格列表编码为知识图谱,捕获层次化和指令性的代码关系;将指南提炼为简洁、代码特定的摘要,而非检索原始文本。为支持我们的研究,我们还引入了MDACE-2025,即根据2025年ICD-10-CM/PCS指南对MDACE数据集进行的专家重新标注,增加了代码排序和理由注释。在MDACE上,RAG-Coding在五个LLM骨干网络上以micro-F1指标超越最佳基于LLM的基线3-13%,并与监督式最先进方法达到相当的micro-和macro-F1,以更高的召回率(+11%)为代价,精确率降低(-6%)。在MDACE-2025上,RAG-Coding超越所有基线,展示了对更新指南的有效泛化。消融实验确认了逐步提升,强调了整合结构化外部知识对基于LLM的医学编码的重要性。

英文摘要

Accurate medical coding requires consulting authoritative resources such as the ICD tabular list and coding guidelines. Existing LLM-based automated methods largely rely on LLMs' internal knowledge, which is prone to hallucination and cannot keep pace with guideline updates. We introduce RAG-Coding, an agentic, training-free method that augments LLMs with structured external knowledge: the tabular list is encoded as a knowledge graph capturing hierarchical and instructional code relationships, and the guidelines are distilled into concise, code-specific summaries rather than retrieved as raw text. To enable our study, we also introduce MDACE-2025, expert re-annotations of the MDACE dataset under the 2025 ICD-10-CM/PCS guidelines, adding code sequencing and justification comments. On MDACE, RAG-Coding outperforms the best LLM-based baseline by 3--13\% in micro-F1 across five LLM backbones, and achieves comparable micro- and macro-F1 to the supervised state-of-the-art, with higher recall ($+$11\%) at the cost of precision ($-$6\%). On MDACE-2025, RAG-Coding outperforms all baselines, demonstrating effective generalisation to updated guidelines. Ablations confirm stepwise gains, highlighting the importance of integrating structured external knowledge for LLM-based medical coding.

2605.27276 2026-05-29 cs.AI cs.CL 版本更新

SIA: Self Improving AI with Harness & Weight Updates

SIA: 具有框架与权重更新的自我改进AI

Prannay Hebbar, Yogendra Manawat, Samuel Verboomen, Alesia Ivanova, Selvam Palanimalai, Kunal Bhatia, Vignesh Baskaran

发表机构 * Hexo Labs(Hexo实验室) University of Oxford(牛津大学)

AI总结 提出SIA框架,通过反馈智能体同时更新任务智能体的框架和权重,在三个领域(中国法律罪名分类、GPU内核优化、单细胞RNA去噪)超越仅迭代框架的方法。

详情
AI中文摘要

人类是构建和改进AI的瓶颈。无论是模型还是封装它们的智能体,都是由人类编写、调整和纠正的。一个能够自我改进的AI的长期目标仍然未实现。两条大致独立的研究路线试图解决这一瓶颈。框架更新学派让元智能体重写任务特定智能体的框架(其工具、提示、重试逻辑和搜索过程),而模型权重保持不变。测试时训练学派使用手写的强化学习流程,根据任务反馈更新模型自身的权重,而框架保持不变。这两个孤岛独立运作。我们提出SIA,一个自我改进循环,其中语言模型智能体(反馈智能体)同时更新任务特定智能体的框架和权重。我们在三个对比领域进行评估:中国法律罪名分类、底层GPU内核优化和单细胞RNA去噪。结合两个杠杆在所有三个基准上均优于仅迭代框架。SIA-W+H在LawBench上比先前SOTA高出25.1%,GPU内核比先前SOTA快12.4%(1017 vs 1161 μs),去噪性能比先前SOTA高出20.4%。框架更新使模型具有智能体性,塑造其搜索和行动方式,而权重更新构建了任何提示或框架都无法灌输的领域直觉。

英文摘要

Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. SIA-W+H achieves 25.1% over prior SOTA on LawBench, 12.4% faster GPU kernels than prior SOTA (1,017 vs 1,161 μs), and 20.4% over prior SOTA on denoising. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.

2605.26755 2026-05-29 cs.CL 版本更新

SEEK: Semantic Evidence Extraction via Adaptive ChunKing for Multilingual Fact-Checking

SEEK: 通过自适应分块进行多语言事实核查的语义证据提取

Babu Kumar, Gaurav Kumar, Ayush Garg, Aditya Kishore, Jasabanta Patro

发表机构 * Department of Data Science and Engineering(数据科学与工程系) Indian Institute of Science Education and Research(印度科学教育与研究学院)

AI总结 提出SEEK框架,通过自适应语义分块构建连贯证据块,并微调多语言大模型,在多语言事实核查中提升宏F1最高达20%。

详情
AI中文摘要

多语言事实核查需要既相关又足够完整的证据,以实现可靠的事实性预测。然而,现有系统通常依赖搜索片段、句子级证据或局部切分的段落,这可能会遗漏关键上下文并产生碎片化的证据。为克服这些限制,我们提出SEEK,一种自适应分块的语义证据提取框架,通过识别语义主题转换并保留局部验证上下文,从完整的事实核查文章中构建连贯的证据块。构建的块使用多语言编码器进行编码,然后使用LoRA适配器微调多语言大模型进行真实性预测。在X-FACT和RU22Fact上的实验表明,与语义分块相比,SEEK将宏F1提高了最多10%,与句子分块相比提高了19%,与搜索片段基线相比提高了20%。证据完整性和显著性分析进一步表明,SEEK保留了更丰富的验证上下文,并实现了更可靠的多语言事实核查。

英文摘要

Multilingual fact verification requires evidence that is both relevant and sufficiently complete for reliable factuality prediction. However, existing systems often rely on search snippets, sentence-level evidence, or locally segmented passages, which can miss decisive context and produce fragmented evidence. To overcome these limitations, we propose SEEK, a Semantic Evidence Extraction with an adaptive chunKing framework that constructs coherent evidence chunks from full fact-checking articles by identifying semantic topic transitions and preserving local verification context. The constructed chunks are encoded using a multilingual encoder and then multilingual LLMs are finetuned using LoRA adapter for veracity prediction. Experiments on X-FACT and RU22Fact show that SEEK improves macro-f1 by up to 10% over semantic chunking, 19% over sentence chunking, and 20% over search-snippet baselines. Evidence completeness and significance analyses further show that SEEK preserves richer verification context and enables more reliable multilingual fact-checking.

2605.26428 2026-05-29 cs.CL cs.HC 版本更新

Slide Deck Q&A Quality Assurance App: A Multi-Stage Pipeline for Pedagogical Question Generation

Slide Deck Q&A 质量保证应用:面向教学问题生成的多阶段流水线

Jim Salsman

发表机构 * TalkNicer, Inc.(TalkNicer公司)

AI总结 提出一个基于Flask的多阶段大语言模型流水线,从PDF幻灯片中提取文本和图像,生成结构化的教学问题集,并通过窗口规划、幻灯片合成、标注和协调四个阶段提高问题质量。

Comments 15 pages, 3 research questions, 1 figure, 1 table, 6 references, 2 appendices

详情
AI中文摘要

从讲座幻灯片中生成高质量、具有教学意义的问题是困难的,因为重要的教学内容分布在文本和视觉元素中,而且有用的问题必须根据演示流程进行搭建,而不是孤立地逐张幻灯片生成。本文描述了Slide Deck Q&A质量保证(slidesqaqa),一个基于Flask的软件系统,它从PDF幻灯片中提取文本和渲染图像,并通过一个四阶段的大语言模型流水线进行处理,包括窗口规划、幻灯片合成、标注和协调。该系统联合考虑幻灯片模态和教学角色,分配有限的问题预算,并在幻灯片组级别修订草稿标注以减少冗余并提高覆盖率。最终输出是一个结构化的JSON标注,包含幻灯片组级目标、章节结构、幻灯片级摘要、问题集和评估分数。在两个技术讲座幻灯片上的初步实验表明,该流水线可以过滤非教学幻灯片,并为视觉复杂的内容生成高保真、教学连贯的问题。

英文摘要

Generating high-quality, pedagogically useful questions from lecture slide decks is difficult because important instructional content is distributed across both text and visual elements, and because useful questions must be scaffolded across the flow of a presentation rather than generated slide by slide in isolation. This paper describes Slide Deck Q\&A Quality Assurance (slidesqaqa), a Flask-based software system that extracts text and rendered images from PDF slides and processes them through a four-stage large language model pipeline comprising window planning, deck synthesis, slide annotation, and reconciliation. The system reasons jointly about slide modality and pedagogical role, allocates bounded question budgets, and revises draft annotations at the deck level to reduce redundancy and improve coverage. The final output is a structured JSON annotation containing deck-level goals, section structure, slide-level summaries, question sets, and evaluation scores. Initial experiments on two technical lecture decks indicate that the pipeline can filter non-instructional slides and produce high-fidelity, pedagogically coherent questions for visually complex content. The working system is at https://slidesqaqa-974767694043.us-west1.run.app The software repository is at https://github.com/blinding2submit/slidesqaqa

2605.26029 2026-05-29 cs.AI cs.CL 版本更新

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

CausaLab:面向AI科学家的交互式因果发现可扩展环境

Junlin Yang, Dylan Zhang, Xiangchen Song, Qirun Dai, Xiao Liu, Yuen Chen, Aniket Vashishtha, Jing Shi, Chenhao Tan, Hao Peng

发表机构 * Tsinghua University(清华大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Carnegie Mellon University(卡内基梅隆大学) University of Chicago(芝加哥大学) Adobe

AI总结 提出CausaLab环境,通过合成实验室任务评估LLM代理在因果发现中的预测准确性与因果机制恢复能力,发现两者存在显著差距。

详情
AI中文摘要

我们介绍了CausaLab,一个用于评估LLM代理进行交互式因果发现的可扩展环境。与先前的评估不同,CausaLab既评估代理是否能够使用因果证据解决问题,也评估其答案是否基于忠实恢复的因果机制。每个回合将代理置于一个合成实验室中:它接收先前的测量记录,对操纵器晶体进行干预,并预测由相同机制控制的保留反应器晶体的共振频率。隐藏的数据生成过程是一个随机采样的结构因果模型(SCM),因此成功需要恢复因果图和结构方程,而不是回忆先验知识。实验表明,预测和机制恢复之间存在持续差距:在纯观测的6节点设置中,GPT-5.2-high达到92%的任务准确率,但全边$F_1$仅为0.471。混合观测-干预策略提高了结构保真度,而纯干预即使对强代理仍然困难。我们确定过早停止是一个主要弱点,并表明一致性验证可以缓解它。因此,CausaLab将预测成功与因果理解分开,并揭示了当前LLM代理作为实验因果推理者的局限性。

英文摘要

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

2605.25297 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction

Eureka:面向企业AI云资源需求预测的智能特征工程

Hangxuan Li, Renjun Jia, Xuezhang Wu, Yunjie Qian, Zeqi Zheng, Xianling Zhang

发表机构 * Alibaba Cloud Computing Co. Ltd, Hangzhou, China(阿里云计算有限公司,杭州,中国) School of Computer Science, Fudan University, Shanghai, China(复旦大学计算机学院,上海,中国) School of Computer Science and Technology, Tongji University, Shanghai, China(同济大学计算机科学与技术学院,上海,中国) Independent Researcher, United States(独立研究员,美国)

AI总结 提出Eureka框架,将特征工程视为智能体代码生成问题,通过专家代理、LLM特征工厂和自演化对齐引擎三阶段,自动生成可执行特征代码,在医疗、金融、社交等7个公开基准及阿里云GPU资源需求预测中显著提升性能。

Comments accepted at NeurIPS 2025 Workshop, DASFAA 2026 (International Conference on Database Systems for Advanced Applications)

详情
Journal ref
Database Systems for Advanced Applications (DASFAA 2026), Lecture Notes in Computer Science, vol. 16540, pp. 528-540, Springer
AI中文摘要

有效的特征对于预测模型性能至关重要,但创建特征通常需要领域专业知识,限制了跨应用的可扩展性。我们将特征工程定义为一个智能体代码生成问题:特征不再是静态的数据转换,而是可生成、评估和迭代改进的可执行程序。我们提出了Eureka,一个由LLM驱动的三阶段框架。(1)专家代理,通过领域知识的SFT微调,生成结构化的JSON格式特征设计方案。(2)LLM特征工厂,通过思维链推理将每个方案转化为可执行的Python代码,将特征假设转化为可运行的程序。(3)自演化对齐引擎,使用带双通道奖励(基于指标的效用+语义对齐)的强化学习(GRPO)来提升代码质量。通过将特征表达为程序,学习到的生成模式可以跨领域迁移。在医疗、金融和社交领域的7个公开基准上评估,Eureka一致优于传统的AutoFE和基于LLM的基线。我们进一步在阿里云的云GPU资源需求预测中展示了Eureka的有效性,其中Eureka将需求满足率提高了16%,并将计算资源迁移率降低了33%。

英文摘要

Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability across applications. We define feature engineering as an agentic code generation problem: features are not static data transformations, but executable programs that can be generated, evaluated, and iteratively improved. We present Eureka, an LLM-driven framework with three stages. (1) An Expert Agent, fine-tuned via SFT on domain knowledge, produces structured feature design plans in JSON format. (2) An LLM Feature Factory translates each plan into executable Python code through chain-of-thought reasoning, turning feature hypotheses into runnable programs. (3) A Self-Evolving Alignment Engine uses Reinforcement Learning (GRPO) with dual-channel reward (metric-based utility + semantic alignment) to enhance code quality. By expressing features as programs, the learned generation patterns can transfer across domains. Evaluated on 7 public benchmarks in healthcare, finance, and social domains, Eureka consistently outperforms both traditional AutoFE and LLM-based baselines. We further demonstrate Eureka's effectiveness on cloud GPU resource demand prediction at Alibaba Cloud, where Eureka improves demand fulfillment rate by 16% and lowers computing resource migration rates by 33%.

2605.23657 2026-05-29 cs.CL 版本更新

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

OpenSkillEval:自动审计LLM智能体的开放技能生态系统

Jiahao Ying, Boxian Ai, Wei Tang, Siyuan Liu, Yixin Cao

发表机构 * Singapore Management University(新加坡国立管理学院) Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究院) Joy Future Academy, JD(京东未来学院)

AI总结 提出自动评估框架OpenSkillEval,通过动态构建真实任务实例和收集社区技能,系统评估技能增强型智能体系统及技能本身,揭示技能可用性不保证有效使用、技能增强收益依赖模型和框架等关键发现。

详情
AI中文摘要

技能,即为大型语言模型(LLM)提炼的结构化工作流指令,正成为提升智能体在现实下游任务性能的日益重要的机制。然而,随着开源技能生态系统的快速扩张,不同模型和智能体框架如何与技能交互、如何评估技能质量、以及用户在实际成本-性能权衡下应如何选择技能,这些问题仍不明确。在本文中,我们提出了 extsc{OpenSkillEval},一个针对技能增强型智能体系统及技能本身的自动评估框架。 extsc{OpenSkillEval}不依赖静态基准,而是从不断演变的现实世界工件中自动构建跨五类下游应用(演示生成、前端网页设计、海报生成、数据可视化和报告生成)的真实任务实例。它进一步收集和组织社区贡献的技能,以便在统一任务设置下进行受控比较。利用超过600个动态生成的任务实例和30个开源技能,我们对最先进的模型和智能体框架进行了系统评估。我们的结果表明,技能可用性并不保证有效使用技能,技能增强的收益强烈依赖于底层模型和智能体框架,并且许多公开流行的技能并不始终优于没有技能的基础智能体。这些发现凸显了动态、基于任务的评估的必要性,并为LLM智能体技能的设计、选择和部署提供了实用见解。更多案例和基准资源可在项目网站上获取:https://yingjiahao14.github.io/OpenSkillEval-Web/。

英文摘要

Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present \textsc{OpenSkillEval}, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, \textsc{OpenSkillEval} automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.

2605.22586 2026-05-29 cs.LG cs.CL 版本更新

A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models

扩散理论教程:从微分方程到扩散模型

Jiayi Fu, Yuxia Wang

AI总结 本教程从微分方程角度统一阐述扩散模型的数学基础,推导ODE和SDE表示,解释分数匹配和去噪目标,并涵盖DDPM、DDIM、流匹配和扩散语言模型。

Comments A detailed tutorial on Diffusion models and SDE

详情
AI中文摘要

扩散模型已成为生成建模的主导框架,但其数学基础通常通过扩散概率模型、基于分数的建模、随机微分方程和数值采样方法分别呈现。我们编写本教程,从微分方程的角度提供这些观点的统一且自洽的阐述。从条件高斯噪声过程出发,我们推导常微分方程(ODE)和随机微分方程(SDE)表示,过渡到相应的边际正向动力学,然后得到使生成成为可能的逆向时间SDE和概率流ODE。我们表明逆向采样中的中心未知量是边际分数,解释在噪声预测参数化下分数匹配如何成为标准去噪目标,并讨论实际的逆向时间采样和引导。我们进一步将DDPM、DDIM、流匹配和基于分数的SDE置于一个共同框架中,并以连续嵌入空间中的扩散语言模型结束,同时简要讨论离散掩码标记扩散。本教程旨在作为扩散过程的分析基础与建立在其上的现代生成算法之间的桥梁。

英文摘要

Diffusion models have emerged as a dominant framework for generative modeling, but their mathematical foundations are often presented separately through diffusion probabilistic models, score-based modeling, stochastic differential equations, and numerical sampling methods. We write this tutorial to provide a unified and self-contained account of these viewpoints from the perspective of differential equations. Starting from a conditional Gaussian noising process, we derive ordinary differential equation (ODE) and stochastic differential equation (SDE) representations, pass to the corresponding marginal forward dynamics, and then obtain the reverse-time SDE and probability-flow ODE that make generation possible. We show that the central unknown quantity in reverse sampling is the marginal score, explain how score matching becomes the standard denoising objective under a noise-prediction parameterization, and discuss practical reverse-time sampling and guidance. We further place DDPM, DDIM, flow matching, and score-based SDEs in a common framework, and conclude with diffusion language models in continuous embedding space together with a brief discussion of discrete masked-token diffusion. The tutorial is intended as a bridge between the analytical foundations of diffusion processes and the modern generative algorithms built upon them.

2605.13841 2026-05-29 cs.SD cs.AI cs.CL cs.LG 版本更新

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

EVA-Bench:一种用于评估语音代理的新型端到端框架

Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Fanny Riols, Hoang H. Nguyen, Raghav Mehndiratta, Lindsay Devon Brin, Joseph Marinier, Hari Subramani, Anil Madamala, Sridhar Krishna Nemala, Srinivas Sunkara

发表机构 * ServiceNow

AI总结 提出EVA-Bench框架,通过机器人间音频对话模拟和复合指标(EVA-A和EVA-X)全面评估语音代理的准确性和体验质量。

Comments Work in progress

详情
AI中文摘要

语音代理是一种通过口语对话完成任务的人工智能系统,越来越多地部署在企业应用中。然而,现有基准测试未能同时解决两个核心评估挑战:生成逼真的模拟对话,以及全面衡量语音特定故障模式的质量。我们提出了EVA-Bench,一个端到端评估框架,同时解决这两个问题。在模拟方面,EVA-Bench通过动态多轮对话协调机器人间的音频对话,并自动进行模拟验证,检测用户模拟器错误并在评分前适当重新生成对话。在测量方面,EVA-Bench引入了两个复合指标:EVA-A(准确性),捕捉任务完成度、忠实度和音频级语音保真度;以及EVA-X(体验),捕捉对话进展、口语简洁性和话轮转换时机。这两个指标适用于所有主要的代理架构,支持直接的跨架构比较。EVA-Bench包含三个企业领域的213个场景、一个用于口音和噪声鲁棒性的受控扰动套件,以及区分峰值能力和可靠能力的pass@1、pass@k、pass^k测量。在跨越所有三种架构的12个系统中,我们发现:(1)没有系统在EVA-A pass@1和EVA-X pass@1上同时超过0.5;(2)峰值性能和可靠性能差异显著(EVA-A上pass@k与pass^k的中位数差距为0.44);(3)口音和噪声扰动暴露了显著的鲁棒性差距,其影响因架构、系统和指标而异(平均Δ高达0.314)。我们在开源许可下发布了完整的框架、评估套件和基准数据。

英文摘要

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to all major agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k--pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean $Δ$ up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

2605.07210 2026-05-29 cs.IR cs.CL 版本更新

DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

DiffRetriever: 基于扩散语言模型的并行代表标记用于检索

Shuai Wang, Yu Yin, Shengyao Zhuang, Bevan Koopman, Guido Zuccon

发表机构 * The University of Queensland(昆士兰大学) CSIRO(澳大利亚联邦科学与工业研究组织)

AI总结 提出DiffRetriever,利用扩散语言模型的掩码位置预测直接进行检索,支持单表示和多表示检索,在多个基准上优于现有方法。

Comments Updated analysis, ablation and benchmark with sota retrievers, indexing storage/latency ablation, isolating the effectiveness gain

详情
AI中文摘要

本文展示了扩散语言模型(DLM)如何作为有效且高效的检索器。现有的基于DLM的检索器(例如DiffEmbed)遵循BERT风格的编码,将每个查询或段落表示为单个平均池化向量。这忽略了DLM在双向注意力下通过掩码位置预测生成响应的能力,而这种能力可以提供更强的检索信号。我们提出DiffRetriever,它直接使用DLM原生的掩码位置预测进行检索。对于每个查询或段落,DiffRetriever附加一个或多个掩码位置,在单次前向传播中将输出用作检索表示。使用一个掩码位置时,单表示DiffRetriever在相同骨干网络上已经优于DiffEmbed。DiffRetriever还自然地扩展到多表示检索:DLM联合处理多个掩码位置,实现ColBERT风格的细粒度匹配,且编码延迟增加很小。在自回归LLM检索器中,相同的多表示策略需要顺序解码,因此产生更高的延迟。DiffRetriever在我们匹配的比较中获得了最强的整体效果,优于DiffEmbed、PromptReps和RepLLaMA。在训练数据上选择的掩码位置数量能够很好地跨数据集迁移,而每个查询的变化表明自适应分配仍有提升空间。代码可在https://github.com/ielab/diffretriever获取。

英文摘要

This paper shows how diffusion language models (DLMs) can be used as effective and efficient retrievers. Existing DLM-based retrievers (e.g., DiffEmbed) follow BERT-style encoding, representing each query or passage as a single mean-pooled vector. This ignores how DLMs are trained to generate responses through masked-position prediction under bidirectional attention, a capability that can provide stronger retrieval signals. We propose DiffRetriever, which uses the DLM's native masked-position prediction directly for retrieval. For each query or passage, DiffRetriever appends one or more masked positions, using the outputs as retrieval representations in a single forward pass. With one masked position, single-representation DiffRetriever already improves over DiffEmbed on the same backbones. DiffRetriever also naturally extends to multi-representation retrieval: DLMs process multiple masked positions jointly, enabling ColBERT-style fine-grained matching with little additional encoding latency. In autoregressive LLM retrievers, the same multi-representation strategy requires sequential decoding and therefore incurs much higher latency. DiffRetriever obtains the strongest aggregate effectiveness within our matched comparison, outperforming DiffEmbed, PromptReps, and RepLLaMA. Masked-position counts selected on training data transfer well across datasets, while per-query variation suggests headroom for adaptive allocation. Code is available at https://github.com/ielab/diffretriever.

2604.25098 2026-05-29 cs.AI cs.CL cs.LG 版本更新

Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

重新审视LLM剪枝对测试时缩放的有效性

Ocean Monjur, Shahriar Kabir Nahin, Anshuman Chhabra

发表机构 * Bellini College of AI, Cybersecurity, and Computing(人工智能、网络安全与计算学院)

AI总结 本文研究非结构化剪枝对推理型大语言模型测试时缩放性能的影响,发现其优于结构化剪枝甚至有时超过未剪枝模型,并探讨了层间稀疏分配策略的作用。

详情
AI中文摘要

大型语言模型(LLM)现在通过测试时计算缩放(TTS)展现出卓越的推理能力,在数学和编程基准测试中表现令人印象深刻。与此同时,模型压缩研究开发了剪枝方法,旨在在不牺牲任务性能的情况下移除冗余/有害参数。这两项研究进展的交叉点构成了我们工作的基础。具体到推理型LLM,先前的工作表明结构化剪枝(移除整组层块的方法)显著降低了TTS推理性能。然而,在这项工作中,我们重新审视了这一假设,并研究了非结构化剪枝(仅小心移除某些冗余/有害权重的方法)是否表现出类似的局限性。令人惊讶的是,我们在两个推理型LLM(s1.1-7B和Qwen3-8B)的四个推理基准上的广泛实验一致表明,与结构化剪枝相比,非结构化剪枝增强了TTS性能,有时甚至能超越未剪枝的全权重LLM。此外,我们还实证研究了不同层间稀疏分配策略的影响,这些策略是实现这些非结构化方法的重要参数选择。这些发现挑战了剪枝总是降低TTS性能的传统观念,实际上表明,谨慎进行的剪枝可以保持TTS的有效性。

英文摘要

Large Language Models (LLMs) now exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), with impressive performance across math and coding benchmarks. In parallel, research in model compression has developed pruning methods that seek to remove redundant/detrimental parameters without sacrificing task performance. The intersection of these two research advancements lays the foundation for our work. Specific to reasoning LLMs, prior work has shown that structured pruning (methods which remove entire set of layer blocks), significantly degrades TTS reasoning performance. However, in this work, we revisit this assumption and investigate whether unstructured pruning (methods that carefully remove only certain redundant/detrimental weights) exhibits similar limitations. Surprisingly, our extensive experiments across four reasoning benchmarks on two reasoning LLMs: s1.1-7B and Qwen3-8B, consistently show that unstructured pruning augments TTS performance compared to structured pruning, and at times can even outperform the unpruned full-weight LLMs. Furthermore, we also empirically study the impact of different layer-wise sparsity allocation strategies, which are an important parametric choice for instantiating these unstructured methods. These findings challenge the conventional notion that pruning always reduces TTS performance and in fact, suggest that carefully undertaken pruning can retain TTS effectiveness.

2604.20443 2026-05-29 cs.CL cs.AI cs.LG 版本更新

DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

DialToM:用于预测状态驱动对话轨迹的心智理论基准

Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim

发表机构 * Singapore Management University(新加坡管理大学) Australian National University(澳大利亚国立大学)

AI总结 提出DialToM基准,通过多选评估框架从自然对话中构建,揭示LLMs在推断心理状态(字面ToM)与利用其进行社会预测(功能ToM)之间的系统性推理不对称性,并证明领域专家与AI之间存在显著能力差距。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

我们介绍了DialToM,一个基于自然人类对话构建的带注释的心智理论(ToM)基准,采用多选评估框架。与近期在合成环境中显示显式心理状态推断与应用ToM之间存在差距的工作一致,我们建立了一个更严格的“状态驱动诊断探针”,要求模型仅从孤立的心理状态特征(无对话上下文)预测状态一致的对话轨迹。我们的评估揭示了系统性的推理不对称性——LLMs在推断心理状态(字面ToM)方面表现出色,但在利用它们进行社会预测(功能ToM)方面存在困难。关键的是,领域专家在此任务上达到100%准确率,证明了其有效性,并揭示了人类与AI之间的显著能力差距。此外,教师-学生推理注入探针显示,Gemini 3 Pro(建立了领先基线)具备强大的功能ToM能力,可用于无上下文预测,且该能力可迁移至较弱模型。DialToM、其评估代码和数据集公开于https://github.com/Stealth-py/DialToM。

英文摘要

We introduce DialToM, an annotated Theory of Mind (ToM) benchmark built from naturalistic human-human dialogues using a multiple-choice evaluation framework. Concurrent with recent work showing a gap between explicit mental-state inference and applied ToM in synthetic settings~\cite{gu2024simpletom}, we establish a stricter \emph{State-Driven Diagnostic Probe} in which models must forecast state-consistent dialogue trajectories solely from isolated mental-state profiles without dialogue context. Our evaluation reveals a systematic reasoning asymmetry -- LLMs excel at inferring mental states (Literal ToM) but struggle to leverage them for social forecasting (Functional ToM). Crucially, a domain expert achieves 100\% accuracy on this task, proving its validity and establishing a stark human-AI capability gap. Further, a teacher-student reasoning injection probe shows that Gemini 3 Pro -- which establishes the leading baseline -- possesses robust Functional ToM capabilities for context-free forecasting that are transferable to weaker models. DialToM, its evaluation code, and dataset are publicly available at https://github.com/Stealth-py/DialToM.

2604.18847 2026-05-29 cs.AI cs.CL 版本更新

Human-Guided Harm Recovery for Computer Use Agents

面向计算机使用代理的人类引导式危害恢复

Christy Li, Sky CH-Wang, Andi Peng, Andreea Bobu

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 针对LM代理在计算机系统中执行操作后的危害恢复问题,通过用户研究定义偏好对齐的恢复维度,提出基于奖励模型对候选恢复计划重排序的方法,并构建BackBench基准测试,实验表明该方法优于基线代理。

详情
AI中文摘要

随着LM代理获得在真实计算机系统上执行操作的能力,我们不仅需要大规模预防有害行为的方法,还需要在预防失败时有效修复危害。我们形式化了后执行安全中这一被忽视的挑战的解决方案——危害恢复:即根据人类偏好,将代理从有害状态最优地引导回安全状态的问题。通过一项形成性用户研究,我们确定了偏好对齐的恢复维度,并生成了自然语言评分标准,从而为偏好对齐的恢复奠定基础。我们的1130个成对判断数据集揭示了属性重要性的上下文相关变化,例如偏好实用、有针对性的策略而非全面的长期方法。我们将这些学习到的见解操作化为一个奖励模型,在测试时对代理框架生成的多个候选恢复计划进行重排序。为了系统性地评估恢复能力,我们引入了BackBench,一个包含50个计算机使用任务的基准测试,用于测试代理从有害状态中恢复的能力。人工评估表明,我们的奖励模型框架比基础代理和基于评分标准的框架产生更高质量的恢复轨迹。这些贡献共同为新型代理安全方法奠定了基础——这些方法不仅通过预防来应对危害,而且通过有意图的对齐来应对危害的后果。

英文摘要

As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences. We ground preference-aligned recovery through a formative user study that identifies valued recovery dimensions and produces a natural language rubric. Our dataset of 1,130 pairwise judgments reveals context-dependent shifts in attribute importance, such as preferences for pragmatic, targeted strategies over comprehensive long-term approaches. We operationalize these learned insights in a reward model, re-ranking multiple candidate recovery plans generated by an agent scaffold at test time. To evaluate recovery capabilities systematically, we introduce BackBench, a benchmark of 50 computer-use tasks that test an agent's ability to recover from harmful states. Human evaluation shows our reward model scaffold yields higher-quality recovery trajectories than base agents and rubric-based scaffolds. Together, these contributions lay the foundation for a new class of agent safety methods -- ones that confront harm not only by preventing it, but by navigating its aftermath with alignment and intent.

2604.13519 2026-05-29 cs.CL 版本更新

ToolSpec: Accelerating Tool Calling via Schema-Aware and Retrieval-Augmented Speculative Decoding

ToolSpec: 通过模式感知与检索增强的推测解码加速工具调用

Heming Xia, Yongqi Li, Cunxiao Du, Mingbo Song, Wenjie Li

发表机构 * Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算机系) Peking University(北京大学)

AI总结 针对工具调用延迟问题,提出一种基于模式感知和检索增强的推测解码方法ToolSpec,利用预定义工具模式生成准确草稿,并通过有限状态机交替填充确定性模式令牌和推测生成可变字段,同时检索历史调用复用草稿,实现最高4.2倍加速。

详情
AI中文摘要

工具调用极大地扩展了大语言模型(LLMs)的实际效用,使其能够与外部应用程序交互。随着LLM能力的提升,有效的工具使用越来越多地涉及多步骤、多轮交互以解决复杂任务。然而,由此产生的工具交互增长带来了大量延迟,对实时LLM服务构成了关键挑战。通过实证分析,我们发现工具调用轨迹高度结构化,符合受限模式,并且经常表现出重复的调用模式。受此启发,我们提出了ToolSpec,一种模式感知、检索增强的推测解码方法,用于加速工具调用。ToolSpec利用预定义的工具模式生成准确的草稿,使用有限状态机在确定性模式令牌填充和可变字段的推测生成之间交替。此外,ToolSpec检索相似的历史工具调用并将其重用为草稿,以进一步提高效率。ToolSpec提供了一种即插即用的解决方案,可以无缝集成到现有的LLM工作流中。在多个基准上的实验表明,ToolSpec实现了高达4.2倍的加速,显著优于现有的无训练推测解码方法。

英文摘要

Tool calling has greatly expanded the practical utility of large language models (LLMs) by enabling them to interact with external applications. As LLM capabilities advance, effective tool use increasingly involves multi-step, multi-turn interactions to solve complex tasks. However, the resulting growth in tool interactions incurs substantial latency, posing a key challenge for real-time LLM serving. Through empirical analysis, we find that tool-calling traces are highly structured, conform to constrained schemas, and often exhibit recurring invocation patterns. Motivated by this, we propose ToolSpec, a schema-aware, retrieval-augmented speculative decoding method for accelerating tool calling. ToolSpec exploits predefined tool schemas to generate accurate drafts, using a finite-state machine to alternate between deterministic schema token filling and speculative generation for variable fields. In addition, ToolSpec retrieves similar historical tool invocations and reuses them as drafts to further improve efficiency. ToolSpec presents a plug-and-play solution that can be seamlessly integrated into existing LLM workflows. Experiments across multiple benchmarks demonstrate that ToolSpec achieves up to a 4.2x speedup, substantially outperforming existing training-free speculative decoding methods.

2604.11088 2026-05-29 cs.AI cs.CL 版本更新

Guardrails Beat Guidance: A Large-Scale Study of Rules, Skills, and Persistent Configuration for Coding Agents

护栏优于指导:关于编码智能体的规则、技能和持久配置的大规模研究

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS生成式人工智能创新中心) HSBC Holdings Plc., HSBC Technology Center, China(汇丰控股有限公司,汇丰技术中心,中国)

AI总结 通过大规模实验发现,随机规则与专家规则对编码智能体性能提升相当,且有益规则均为负面约束,有害规则均为正面指令,提出应使用约束而非指导来配置智能体。

详情
AI中文摘要

随机规则对编码智能体任务性能的提升与专家精心设计的规则相当(在SWE-bench Verified的判别子集上均提升$+13.8$个百分点),并且在我们的数据中,每条单独有益的规则都是负面约束(“不要重构无关代码”),而每条单独有害的规则都是正面指令(“遵循代码风格”)。我们通过首次对智能体规则文件( exttt{CLAUDE.md}、 exttt{.cursorrules}以及更广泛的智能体技能、插件清单和角色定义系列)进行大规模受控研究得出这些发现:我们从GitHub抓取了679个规则文件(共25,532条规则),并使用Claude Opus 4.6在SWE-bench Verified上进行了超过5,000次Claude Code智能体运行。出现了三种模式。(i)规则极性清晰地区分了有益规则和有害规则;我们通过基于势能的奖励塑形(PBRS)的视角来解读这一点。(ii)性能提升在很大程度上与内容无关:随机、打乱、领域不匹配和未转换格式的规则文件均与精心设计的规则相匹配,指向一种上下文启动机制。(iii)单独的规则通常看起来有害,但在集成中并未明显累积损害:在规则数量从0到50的范围内,通过率保持稳定。这些发现揭示了快速增长的社区编写规则和技能生态系统中隐藏的可靠性风险,并得出了更安全智能体配置的明确原则:约束智能体不能做什么,而不是规定它应该做什么。

英文摘要

Random rules improve a coding agent's task performance as much as expert-curated ones (both $+13.8$pp on a discriminative subset of SWE-bench Verified), and in our data every individually beneficial rule is a negative constraint ("do not refactor unrelated code"), while every individually harmful one is a positive directive ("follow code style"). We arrive at these findings through the first large-scale controlled study of agent rule files (\texttt{CLAUDE.md}, \texttt{.cursorrules}, and the broader family of agent skills, plugin manifests, and persona definitions): we scrape 679 rule files (25{,}532 rules) from GitHub and conduct over 5{,}000 agent runs of Claude Code with Claude Opus 4.6 on SWE-bench Verified. Three patterns emerge. (i) Rule polarity cleanly separates beneficial from harmful rules; we read this through the lens of potential-based reward shaping (PBRS). (ii) Performance gains are largely content-independent: random, shuffled, mismatched-domain, and unconverted-format rule files all match curated rules, pointing to a context priming mechanism. (iii) Individual rules often appear harmful in isolation yet do not visibly accumulate damage in ensemble: pass rates remain stable across rule counts from 0 to 50. These findings expose a hidden reliability risk in the rapidly growing ecosystem of community-authored rules and skills, and they yield a clear principle for safer agent configuration: constrain what agents must not do, rather than prescribing what they should.

2604.09629 2026-05-29 cs.CL 版本更新

HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

HumorGen: 基于角色蒸馏的大语言模型幽默生成的认知协同

Edward Ajayi, Prasenjit Mitra

发表机构 * Carnegie Mellon University Africa(卡内基梅隆大学非洲分校)

AI总结 针对大语言模型标准训练目标与幽默所需意外性之间的矛盾,提出认知协同框架,利用六种认知角色合成多样化喜剧视角数据,微调7B参数学生模型,实验表明认知驱动的数据策展比对齐算法或模型规模更关键。

详情
AI中文摘要

幽默生成对大语言模型(LLMs)构成重大挑战,因为其标准训练目标(下一个词预测)与喜剧所需的意外性和不协调性存在固有冲突。为弥合这一差距,我们引入了认知协同框架,这是一种受幽默心理学理论启发的高质量幽默数据生成方法。利用混合思维(MoT)方法,我们部署了六种认知角色(例如荒诞主义者、愤世嫉俗者)来为给定提示合成多样化的喜剧视角。该框架产生了一个基于理论的数据集,我们使用该数据集微调了一个7B参数的学生模型。我们进一步评估了两种对齐策略:直接偏好优化(DPO)和一种离线组相对变体O-GRPO,发现两者均未优于SFT。然而,我们的7B HumorGen模型变体显著优于更大的指令微调基线,并达到顶级开源权重性能,同时与前沿专有系统保持竞争力。这些结果表明,对于幽默生成,认知驱动的数据策展比对齐算法或模型规模更为关键。

英文摘要

Humor generation poses a significant challenge for Large Language Models (LLMs), because their standard training objective (next-token prediction) inherently conflicts with the surprise and incongruity required for comedy. To bridge this gap, we introduce the Cognitive Synergy Framework, a methodology for generating highquality humor data inspired by psychological theories of humor. Utilizing a Mixtureof-Thought (MoT) approach, we deploy six cognitive personas (e.g., The Absurdist, The Cynic) to synthesize diverse comedic perspectives for a given prompt. This framework produces a theory-grounded dataset, which we use to fine-tune a 7B-parameter student model. We further evaluate two alignment strategies, Direct Preference Optimization (DPO) and an offline group-relative variant O-GRPO, finding that neither improves over SFT. However, our 7B HumorGen model variants significantly outperform larger instruction-tuned baselines and achieve top-tier open-weight performance while remaining competitive with frontier proprietary systems. These results suggest that cognitively driven data curation is more critical than alignment algorithms or model scale for humor generation.

2604.06805 2026-05-29 cs.CL 版本更新

Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning

思维认知循环:基于可逆层次马尔可夫链的高效数学推理

Jia-Chen Zhang, Yu-Jie Xiong, Zheng Zhou

发表机构 * School of Computer Science and Technology, East China Normal University(1 计算机科学与技术学院,华东师范大学) School of Electronic and Electrical Engineering, Shanghai University of Engineering Science(2 电子电气工程学院,上海工程技术大学)

AI总结 提出基于可逆层次马尔可夫链的思维认知循环框架,通过层次分解和反向验证机制减少冗余、增强推理鲁棒性,在数学推理任务上取得显著提升。

详情
AI中文摘要

多步思维链通过利用显式推理步骤显著提升了大型语言模型的数学推理能力。然而,长思维链的广泛采用往往导致序列长度超出可管理的计算限制。现有方法尝试通过类似马尔可夫链的结构减少KV缓存冗余来缓解这一问题,但引入了两个关键限制:固有的无记忆性(上下文丢失)和有限的反向推理能力。为了解决这些限制,我们提出了一种基于可逆层次马尔可夫链的新型思维链框架,称为思维认知循环,以及一个反向推理数据集CLoT-Instruct。在CLoT中,问题被分解为具有层次依赖关系的子问题。受人类认知过程的启发,我们在每个层次层引入反向验证机制。此外,我们实施了一种剪枝策略:一旦高层子问题得到验证,冗余的低层子问题就会被剪枝以最大化效率。这种方法有效缓解了错误传播并增强了推理鲁棒性。在四个数学基准上的实验证明了我们方法的有效性。值得注意的是,在使用GPT-4o-mini的AddSub数据集上,CLoT达到了99.0%的准确率,分别比传统思维链和思维链自洽性高出4.1%和2.9%。

英文摘要

Multi-step Chain-of-Thought (CoT) has significantly advanced the mathematical reasoning capabilities of LLMs by leveraging explicit reasoning steps. However, the widespread adoption of Long CoT often results in sequence lengths that exceed manageable computational limits. While existing approaches attempt to alleviate this by reducing KV Cache redundancy via Markov chain-like structures, they introduce two critical limitations: inherent memorylessness (loss of context) and limited backward reasoning capability. To address these limitations, we propose a novel Chain-of-Thought framework based on Reversible Hierarchical Markov Chain, termed Cognitive Loop of Thought (CLoT), and a backward reasoning dataset CLoT-Instruct. In CLoT, problems are decomposed into sub-problems with hierarchical dependencies. Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer. Furthermore, we implement a pruning strategy: once higher-level sub-problems are verified, redundant lower-level sub-problems are pruned to maximize efficiency. This approach effectively mitigates error propagation and enhances reasoning robustness. Experiments on four mathematical benchmarks demonstrate the effectiveness of our method. Notably, on the AddSub dataset using GPT-4o-mini, CLoT achieves 99.0% accuracy, outperforming traditional CoT and CoT-SC by 4.1% and 2.9%, respectively.

2603.27518 2026-05-29 cs.CL 版本更新

Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs

过度拒绝与表示子空间:对齐大语言模型中任务条件拒绝的机制分析

Utsav Maskey, Mark Dras, Usman Naseem

发表机构 * Macquarie University(麦考瑞大学)

AI总结 通过分析有害拒绝和过度拒绝的表示几何,发现过度拒绝方向是任务相关的且存在于良性任务表示簇中,解释了为何全局方向消融无法解决过度拒绝,并表明需要任务特定的几何干预。

Comments Preprint

详情
AI中文摘要

经过训练以拒绝有害请求的对齐语言模型也会表现出过度拒绝:它们拒绝看似类似于有害指令的安全指令。一种自然的方法是消融全局拒绝方向,将隐藏状态向量远离或朝向有害拒绝示例,但这只是偶然地纠正了过度拒绝,同时破坏了更广泛的拒绝机制。在这项工作中,我们分析了两种拒绝类型的表示几何,以理解为什么会发生这种情况。我们表明,有害拒绝方向是任务无关的,可以通过单个全局向量捕获,而过度拒绝方向是任务相关的:它们位于良性任务表示簇内,在不同任务之间变化,并跨越更高维的子空间。线性探测表明,两种拒绝类型从早期Transformer层开始就在表示上不同。这些发现提供了机制上的解释,说明为什么仅靠全局方向消融无法解决过度拒绝,并确立了任务特定的几何干预是必要的。

英文摘要

Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understand why this happens. We show that harmful-refusal directions are task-agnostic and can be captured by a single global vector, whereas over-refusal directions are task-dependent: they reside within the benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace. Linear probing suggests that the two refusal types are representationally distinct from the early transformer layers. These findings provide a mechanistic explanation of why global direction ablation alone cannot address over-refusal, and establish that task-specific geometric interventions are necessary.

2603.23971 2026-05-29 cs.CL cs.AI cs.GT cs.LG cs.MA 版本更新

The Price Reversal Phenomenon: When Cheaper Reasoning Models Cost More

价格反转现象:当更便宜的推理模型成本更高时

Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia, James Zou

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) CMU(卡内基梅隆大学) Microsoft Research(微软研究院)

AI总结 本文首次系统研究推理模型标价与实际成本的偏差,发现32%的模型对比较中存在价格反转现象,并基于Shapley值建立成本归因框架,揭示思考令牌消耗和交互轮次的高度异质性是主要原因。

详情
AI中文摘要

开发者和消费者越来越根据列出的API价格选择推理模型(RMs)。然而,这些价格在多大程度上准确反映了实际推理成本?我们首次系统研究这一问题,评估了8个前沿RM在12个不同任务上的表现,涵盖竞赛数学、科学问答、代码生成和多领域智能体。我们发现了定价反转现象:在32%的模型对比较中,标价较低的模型实际上产生了更高的总成本,反转幅度高达28倍。例如,Gemini 3 Flash的标价比GPT-5.4便宜80%,但其在所有任务上的实际成本却高出38%。我们基于Shapley值构建了一个正式的成本归因框架,并利用它追溯了思考令牌消耗和交互轮次数量巨大异质性的主要贡献因素:对于同一查询,一个模型可能比另一个模型多使用900%的思考令牌,或多出10倍的环境交互轮次。我们进一步表明,每次查询的成本预测本质上是困难的:同一查询的重复运行产生的思考令牌变化高达9.7倍,为任何预测器建立了不可约的噪声底限。因此,我们提出成本分布预测作为一个开放挑战。我们的发现表明,列出的API定价是实际成本的不可靠代理,呼吁进行成本感知的模型选择和透明的每次请求成本监控。

英文摘要

Developers and consumers increasingly choose reasoning models (RMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RMs across 12 diverse tasks covering competition math, science QA, code generation, and multi-domain agents. We uncover the pricing reversal phenomenon: in 32% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 80% cheaper than GPT-5.4's, yet its actual cost across all tasks is 38% higher. We build a formal cost attribution framework based on Shapley value, and leverage it to trace the dominating contributors to vast heterogeneity in thinking token consumption and number of interaction turns: on the same query, one model may use 900% more thinking tokens than another, or 10x more turns of environment interactions. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Thus, we propose cost distribution prediction as an open challenge. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.

2603.18859 2026-05-29 cs.AI cs.CL cs.LG 版本更新

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

RewardFlow: 面向大语言模型智能体强化学习的拓扑感知状态图奖励传播

Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng

发表机构 * TMLR Group(TMLR小组) Hong Kong Baptist University(香港 Baptist大学) TCL Corporate Research (HK) Co Ltd(TCL企业研究(香港)有限公司) Cooperative Medianet Innovation Center Shanghai Jiao Tong University(合作中位网创新中心上海交通大学) Department of Mathematics Hong Kong Baptist University(香港 Baptist大学数学系)

AI总结 提出RewardFlow方法,通过构建状态图进行拓扑感知的奖励传播,为智能体推理提供无标注的密集奖励,显著提升强化学习性能。

详情
AI中文摘要

强化学习在增强大语言模型智能体推理方面展现出潜力,但稀疏的终端奖励阻碍了细粒度优化。过程奖励建模提供了一种替代方案,但带来了高计算成本、奖励黑客风险和标注瓶颈。我们引入RewardFlow,一种用于估计智能体推理中状态级奖励的轻量级方法。通过构建捕获轨迹内在拓扑结构的状态图,RewardFlow执行拓扑感知的传播以估计每个状态对成功的贡献,从而产生有原则的、无标注的密集奖励。用于强化学习优化时,RewardFlow在四个智能体基准测试中显著优于先前基线:在基于文本的任务上平均成功率提高6.2%,在视觉推理上跨三个模型尺度比最强基线提高29.7%,在DeepResearch上准确率提高10%,同时具有卓越的鲁棒性和训练效率。RewardFlow的实现已在https://github.com/tmlr-group/RewardFlow公开。

英文摘要

Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization. Process reward modeling offers an alternative but incurs high computational costs, reward hacking risks, and annotation bottlenecks. We introduce RewardFlow, a lightweight method for estimating state-level rewards in agentic reasoning. By constructing state graphs that capture the intrinsic topological structure of trajectories, RewardFlow performs topology-aware propagation to estimate each state's contribution to success, yielding principled, annotation-free dense rewards. Used for RL optimization, RewardFlow substantially outperforms prior baselines across four agentic benchmarks: +6.2% average success rate on text-based tasks, +29.7% on visual reasoning over the strongest baseline across three model scales, and +10% accuracy on DeepResearch, with superior robustness and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

2602.22045 2026-05-29 cs.CL 版本更新

DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain

DLT-Corpus:面向分布式账本技术领域的大规模文本集合

Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu

发表机构 * Centre for Blockchain Technologies, University College London(区块链技术中心,伦敦大学学院) School of Informatics, University of Edinburgh(信息学院,爱丁堡大学) Exponential Science Foundation(指数科学基金会)

AI总结 本文构建了DLT-Corpus,一个包含29.8亿词元、覆盖科学文献、专利和社交媒体的大规模领域语料库,并基于此分析了技术涌现模式与市场创新关联,同时发布了领域预训练模型LedgerBERT、情感分析数据集等资源。

Comments Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)

详情
AI中文摘要

我们介绍了DLT-Corpus,这是迄今为止面向分布式账本技术(DLT)研究的最大的领域特定文本集合:来自2212万篇文档的29.8亿词元,涵盖科学文献(37,440篇出版物)、美国专利商标局(USPTO)专利(49,023件)和社交媒体(2200万条帖子)。现有的DLT自然语言处理(NLP)资源主要集中在加密货币价格预测和智能合约上,尽管该行业市值约3万亿美元且技术快速演进,但领域特定语言仍未被充分探索。 我们通过分析技术涌现模式和市场-创新相关性展示了DLT-Corpus的实用性。研究发现,技术首先出现在我们的科学文献子集中,然后才出现在专利和社交媒体中,遵循传统的技术转移模式。尽管即使在加密货币寒冬期间社交媒体情绪仍然极度看涨,但科学和专利活动与短期情绪的相关性减弱,而是跟踪整体市场扩张,形成良性循环:研究先于并推动经济增长,而经济增长又为进一步的创新提供资金。 我们发布了DLT-Corpus及配套资源:LedgerBERT(在DLT特定命名实体识别(NER)任务上比BERT-base提升23%)、包含23,301条加密货币新闻标题和描述的情感分析数据集、工具和代码。

英文摘要

We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrency price prediction and smart contracts, leaving domain-specific language underexplored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing patterns of technology emergence and market-innovation correlations. Findings reveal that technologies first appear in our scientific literature subset before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grows less tied to short-term sentiment, tracking overall market expansion in a virtuous cycle in which research precedes and enables economic growth that, in turn, funds further innovation. We release the DLT-Corpus and companion artifacts: LedgerBERT (+23% over BERT-base on DLT-specific Named Entity Recognition (NER) task), a sentiment analysis dataset of 23,301 crypto news headlines and descriptions, tools, and code.

2602.12642 2026-05-29 cs.CL cs.AI 版本更新

Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR

超越归一化:重新审视配分函数作为RLVR的难度调度器

Dohyung Kim, Minbeom Kim, Jeonghye Kim, Sangmook Lee, Sojeong Rhee, Kyomin Jung

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出PACED-RL框架,通过重新解释配分函数作为每提示期望奖励信号,利用其指导训练中的问题选择与重放,在保持生成多样性的同时提升样本效率。

详情
AI中文摘要

奖励最大化的RL方法已被证明能够增强LLM的推理性能,但往往导致生成多样性降低。近期工作通过采用GFlowNets来解决这一问题,训练LLM匹配目标分布的同时联合学习其配分函数。与先前将配分函数仅视为归一化器的工作不同,我们将其重新解释为每提示期望奖励(即在线准确率)信号,利用这一未使用的信息来提高样本效率。具体而言,我们首先建立了配分函数与每提示准确率估计之间的理论关系。基于这一关键见解,我们提出了配分函数引导的强化学习(PACED-RL),这是一个后训练框架,利用准确率估计在训练过程中优先考虑信息量大的问题提示,并通过准确率估计误差优先的重放进一步提高样本效率。关键的是,这两个组件都重用了GFlowNet训练中已经产生的信息,有效地将计算开销摊销到现有优化过程中。跨多种基准的大量实验表明,与GRPO和先前的GFlowNet方法相比,性能有显著提升,突显了PACED-RL作为LLM更高效样本的分布匹配训练的有前途方向。

英文摘要

Reward-maximizing RL methods have shown to be capable of enhancing the reasoning performance of LLMs, but often lead to reduced generation diversity. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error-prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.

2602.08567 2026-05-29 cs.MA cs.CL 版本更新

ValueFlow: Measuring the Propagation of Value Perturbations in Multi-Agent LLM Systems

ValueFlow: 多智能体大语言模型中价值扰动的传播度量

Jinnuo Liu, Chuke Liu, Hua Shen

发表机构 * Center for Data Science, NYU Shanghai(纽约大学上海分校数据科学中心)

AI总结 提出ValueFlow框架,通过56维价值数据集和LLM-as-a-judge协议,将价值漂移分解为智能体级响应行为与系统级结构效应,揭示价值对齐是系统级属性。

Comments Preprint. Under review. 28 pages, 10 figures

详情
AI中文摘要

多智能体大语言模型系统日益由观察并响应彼此输出的智能体组成。虽然价值对齐通常针对孤立模型进行评估,但价值扰动如何通过智能体交互传播仍知之甚少。我们提出ValueFlow,一个基于扰动的框架,通过源自施瓦茨价值调查的56维价值数据集,并使用LLM-as-a-judge协议对智能体价值取向进行评分,来度量多智能体系统中的价值漂移。ValueFlow将价值漂移分解为智能体级响应行为和系统级结构效应,由两个指标捕获:\b{eta}-敏感性(智能体对受扰同伴价值信号的敏感度)和系统敏感性(节点级扰动对最终系统输出的影响)。实验跨越价值维度、骨干模型、角色和拓扑,表明敏感性在不同价值间差异显著,并受交互结构强烈影响,表明多智能体系统中的价值对齐是系统级属性,而不仅仅是智能体级属性。因此,ValueFlow为审计和缓解部署的多智能体系统中的价值传播提供了原则性基础。

英文摘要

Multi-agent large language model (LLM) systems increasingly consist of agents that observe and respond to one another's outputs. While value alignment is typically evaluated for isolated models, how value perturbations propagate through agent interactions remains poorly understood. We present ValueFlow, a perturbation-based framework that measures value drift in multi-agent systems via a 56-value valuation dataset derived from the Schwartz Value Survey, with agent value orientations scored using an LLM-as-a-judge protocol. ValueFlow decomposes value drift into agent-level response behavior and system-level structural effects, captured by two metrics: \b{eta}-susceptibility, an agent's sensitivity to perturbed peer value signals, and system susceptibility (SS), the effect of node-level perturbations on final system outputs.Experiments span across value dimensions, backbones, personas, and topologies, showing that susceptibility varies sharply across values and is strongly shaped by interaction structure, indicating that value alignment in multi-agent systems is a system-level property, not just an agent-level one. ValueFlow thus provides a principled basis for auditing and mitigating value propagation in deployed multi-agent systems.

2602.05370 2026-05-29 cs.CL 版本更新

Mining or Synthesis? Rethinking Exploration Efficiency in Iterative Alignment of Mathematical Reasoning

挖掘还是合成?重新思考数学推理迭代对齐中的探索效率

Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li, Hejin Wang, Jiansheng Wei, Xiaojun Meng, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳研究院) Huawei Large Model Data Technology Lab(华为大模型数据技术实验室) Huawei Multimodal Model Lab(华为多模态模型实验室) Department of Statistics and Data Science, Tsinghua University, Beijing, China(清华大学统计与数据科学系)

AI总结 针对数学推理任务中迭代DPO对齐时高N采样收益递减且引入噪声的问题,提出PACE框架,通过低预算探索与纠错合成偏好对,以约1/5计算量达到或超越高N基线性能。

详情
AI中文摘要

迭代直接偏好优化(DPO)已成为在推理任务中对齐大语言模型的广泛使用的范式。现有方法通常依赖Best-of-N采样($N\geq8$)从分布尾部挖掘正轨迹。在这项工作中,我们表明在数学推理中,增加$N$会导致收益递减,同时增加验证器引起的假阳性风险和策略更新所需的数据分布偏移。为了解决这个问题,我们引入了PACE(通过纠错探索的近端对齐),一种基于生成的纠错框架,用低预算探索($2\leq N\leq3$)取代穷举挖掘。PACE不是搜索越来越稀有的正样本,而是通过纠错后见优化和验证引导过滤,从失败的探索中合成高保真偏好对。实验上,PACE匹配或超过了DPO-R1($N=16$)的性能,同时使用约$1/5$的计算量,并且在20%标签损坏下保持鲁棒,而高$N$基线表现出明显更高的噪声利用。

英文摘要

Iterative Direct Preference Optimization (DPO) has emerged as a widely used paradigm for aligning Large Language Models on reasoning tasks. Existing approaches typically rely on Best-of-N sampling ($N\geq8$) to mine positive trajectories from the distribution tail. In this work, we show that in mathematical reasoning, increasing $N$ yields diminishing returns while increasing verifier-induced false-positive risk and the distribution shift required for policy updates. To address this, we introduce PACE (Proximal Alignment via Corrective Exploration), a generation-based corrective framework that replaces exhaustive mining with low-budget exploration ($2\leq N\leq3$). Rather than searching for increasingly rare positive samples, PACE synthesizes high-fidelity preference pairs from failed explorations through corrective hindsight refinement and verification-guided filtering. Empirically, PACE matches or exceeds the performance of DPO-R1 ($N=16$) while using about $1/5$ of the compute, and remains robust under 20\% label corruption, where high-$N$ baselines exhibit substantially higher noise exploitation.

2602.04729 2026-05-29 cs.CL 版本更新

"Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs

“Be My Cheese?”:多语言大语言模型中机器翻译的文化细微差别基准

Madison Van Doren, Casey Ford, Jennifer Barajas, Riley VanMeter, Cory Holland

发表机构 * Appen The University of Chicago(芝加哥大学) The University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过大规模人工评估基准,研究多语言大语言模型在机器翻译中处理文化细微差别(如习语、双关语、节日和文化概念)的能力,发现语法准确性与文化共鸣之间存在显著差距。

Comments ACL 2026: Natural Language Generation, Evaluation, and Metrics (GEM) Workshop

详情
AI中文摘要

我们提出了一个大规模人工评估基准,用于评估最先进的多语言大语言模型(LLMs)在机器翻译中的文化本地化能力。现有的机器翻译基准强调词元和语法准确性,但往往忽略了实际本地化所需的语用和文化能力。基于一项涵盖20种语言87个翻译的试点研究,我们评估了7个多语言LLMs在15个目标语言上的表现,每种语言有5名母语评分员。每位评分员对全文翻译和包含文化细微语言(习语、双关语、节日和文化嵌入概念)的片段级别实例,按0-3的序数质量等级评分;片段评分还包括一个“不适用”选项,用于未翻译的片段。在全文评估中,平均整体质量适中(1.68/3):GPT-5(2.10/3)、Claude Sonnet 4(1.97/3)和Mistral Medium 3.1(1.84/3)构成最强梯队,灾难性失败较少。片段级别结果显示明显的类别效应:节日(2.20/3)和文化概念(2.19/3)的翻译明显优于习语(1.65/3)和双关语(1.45/3),且习语最可能未被翻译。使用Krippendorff's α和Gwet's AC2评估评分者间信度,显示总体一致性中等(Krippendorff's α = 0.45),其中双关语的一致性最低。这些发现表明语法充分性与文化共鸣之间存在持续差距。据我们所知,这是第一个明确关注翻译和本地化中文化细微差别的多语言、人工标注基准。结果凸显了对文化信息训练数据、改进跨语言语用学以及支持系统性文化翻译基准评估框架的需求。

英文摘要

We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs). Existing MT benchmarks emphasise token-level and grammatical accuracy, but often overlook the pragmatic and culturally grounded competencies required for real-world localisation. Building on a pilot study of 87 translations across 20 languages, we evaluate 7 multilingual LLMs across 15 target languages with 5 native-speaker raters per language. Each rater scored both full-text translations and segment-level instances of culturally nuanced language (idioms, puns, holidays, and culturally embedded concepts) on an ordinal 0-3 quality scale; segment ratings additionally included an NA option for untranslated segments. Across full-text evaluations, mean overall quality is modest (1.68/3): GPT-5 (2.10/3), Claude Sonnet 4 (1.97/3), and Mistral Medium 3.1 (1.84/3) form the strongest tier with fewer catastrophic failures. Segment-level results show sharp category effects: holidays (2.20/3) and cultural concepts (2.19/3) translate notably better than idioms (1.65/3) and puns (1.45/3), and idioms are most likely to be left untranslated. Inter-rater reliability was assessed using Krippendorff's α and Gwet's AC2, indicating moderate agreement overall (Krippendorff's α = 0.45) with the lowest agreement for puns. These findings demonstrate a persistent gap between grammatical adequacy and cultural resonance. To our knowledge, this is the first multilingual, human-annotated benchmark focused explicitly on cultural nuance in translation and localisation. The results highlight the need for culturally informed training data, improved cross-lingual pragmatics, and evaluation frameworks that support systematic benchmarking of culturally grounded translation.

2602.01058 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

好的SFT优化SFT,更好的SFT为强化学习做准备

Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) New York University (Shanghai)(纽约大学(上海))

AI总结 针对当前SFT-RL流程中离线SFT数据分布与在线RL策略分布不匹配的问题,提出基于策略评估的离线学习损失重加权方法PEAR,通过重要性采样重加权SFT损失,提升后续RL训练效果。

详情
AI中文摘要

推理大语言模型的后训练是一个整体过程,通常包括离线SFT阶段和后续的在线强化学习(RL)阶段。然而,SFT通常被孤立地优化,仅追求最大化SFT性能。我们表明,在相同的RL训练后,从更强的SFT检查点初始化的模型可能显著劣于从较弱检查点初始化的模型。我们将此归因于当前SFT-RL流程中典型的错配:生成离线SFT数据的分布可能与在线RL期间优化的策略(该策略从其自身的rollout中学习)存在显著差异。我们提出PEAR(基于策略评估的离线学习损失重加权算法),这是一种在SFT阶段纠正此错配并让模型更好地为RL做准备的方法。PEAR使用重要性采样来重加权SFT损失,具有三种变体,分别在token、块和序列级别操作。它可以用于增强标准SFT目标,并且一旦收集到离线数据的概率,仅需很少的额外训练开销。我们在可验证推理游戏和数学推理任务上对Qwen 2.5和3以及DeepSeek蒸馏模型进行了控制实验。PEAR在标准SFT基础上持续提升了RL后性能,在AIME2025上pass@8增益高达14.6%。我们的结果表明,通过设计和评估SFT时考虑下游RL而非孤立进行,PEAR是迈向更全面的大语言模型后训练的有效一步。

英文摘要

Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.

2601.21725 2026-05-29 cs.CL cs.LG 版本更新

Procedural Pretraining: Warming Up Language Models with Abstract Data

程序化预训练:用抽象数据预热语言模型

Liangze Jiang, Zachary Shinnick, Anton van den Hengel, Hemanth Saratchandran, Damien Teney

发表机构 * EPFL(苏黎世联邦理工学院) Idiap Research Institute(伊迪普研究机构) AIML, Adelaide University(人工智能实验室,阿德莱德大学)

AI总结 提出程序化预训练方法,通过在抽象结构化数据(如形式语言生成的程序数据)上预训练语言模型,显著提升其推理能力并加速后续语义知识学习,实验表明仅需0.1-0.3%的程序数据即可超越标准预训练。

Comments ICML 2026. Project page: https://zlshinnick.github.io/procedural-pretraining-page/

详情
AI中文摘要

直接在网络规模语料库上预训练语言模型是当前的主流范式。我们研究了一种替代方案:首先让模型接触抽象结构化数据,以简化后续丰富语义知识的获取,类似于人类在学习高级推理之前先学习简单逻辑和数学。我们关注由形式语言和其他简单算法生成的程序数据作为此类抽象数据。首先,我们诊断了不同形式的程序数据能够提升的算法技能,通常效果显著。例如,当模型在Dyck序列(平衡括号)上预训练时,上下文召回(大海捞针)的准确率从10%跃升至98%。其次,我们研究了这些增益如何反映在更大模型(高达1.3B参数)的预训练中。我们发现,仅在前端加入0.1%至0.3%的程序数据,就能显著优于在自然语言、代码和非正式数学(C4、CodeParrot和DeepMind-Math数据集)上的标准预训练。值得注意的是,这也使得模型仅需原始数据的55/67/86%即可达到相同的损失值,从而相应地减少FLOPs。第三,我们探索了这些收益背后的机制,发现程序化预训练在注意力层和MLP层中都注入了非平凡的结构。前者对于结构化领域(如代码)尤为重要,后者对于语言领域重要。最后,我们为组合多种形式的程序数据铺平了道路。我们的结果表明,程序化预训练是一种简单、轻量级的方法,能够提升性能并加速语言模型预训练,最终揭示了在LLM中将知识获取与推理分离的前景。

英文摘要

Pretraining language models directly on web-scale corpora is the de facto paradigm. We study an alternative where the model is initially exposed to abstract structured data to ease the subsequent acquisition of rich semantic knowledge, much like humans learning simple logic and mathematics before higher reasoning. We focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, the accuracy of context recall (Needle-in-a-haystack) jumps from 10 to 98% when a model is pretrained on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1 to 0.3% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this also enables the models to reach the same loss value with only 55/67/86% of the original data and thus a comparable reduction in FLOPs. Third, we explore the mechanisms behind the benefits and find that procedural pretraining instills non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means of improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.

2601.20255 2026-05-29 cs.LG cs.CL cs.SE 版本更新

HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench

HE-SNR:通过熵揭示潜在逻辑以指导SWE-bench上的中期训练

Yueyang Wang, Jiawei Fu, Baolong Bi, Xili Wang, Xiaoqing Liu

发表机构 * School of Mathematical Sciences, Peking University, Beijing, China(北京大学数学科学学院) Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China(中国科学院计算技术研究所)

AI总结 针对SWE-bench基准,提出基于熵压缩假说的HE-SNR指标,通过细粒度熵分析指导中期训练数据筛选,在高达560B参数模型上验证有效性。

Comments Accepted at ICML 2026. 21 pages, 15 figures

详情
AI中文摘要

SWE-bench已成为评估大型语言模型在复杂软件工程任务中能力的主要基准。虽然这些能力主要在中期训练阶段获得,并在监督微调(SFT)期间被激发,但目前仍然缺乏能够有效指导中期训练的指标。诸如困惑度(PPL)等标准指标受到“长上下文税”的影响,且与下游SWE性能的相关性较弱。在本文中,我们首先引入严格的数据过滤策略来弥补这一差距。关键地,我们提出了熵压缩假说,将智能重新定义为不是通过标量Top-1压缩,而是通过将不确定性结构化为低阶的熵压缩状态(“合理犹豫”)的能力。基于这种细粒度熵分析,我们制定了一个新的指标,HE-SNR(高熵信噪比)。我们在不同上下文窗口(32K/128K)下对高达560B参数的模型验证了我们的方法。这项工作为优化LLM在复杂工程领域的潜在能力提供了理论基础和实用工具。

英文摘要

SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supervised Fine-Tuning (SFT), there remains a critical deficit in metrics capable of guiding mid-training effectively. Standard metrics such as Perplexity (PPL) are compromised by the "Long-Context Tax" and exhibit weak correlation with downstream SWE performance. In this paper, we bridge this gap by first introducing a rigorous data filtering strategy. Crucially, we propose the Entropy Compression Hypothesis, redefining intelligence not by scalar Top-1 compression, but by the capacity to structure uncertainty into Entropy-Compressed States of low orders ("reasonable hesitation"). Grounded in this fine-grained entropy analysis, we formulate a novel metric, HE-SNR (High-Entropy Signal-to-Noise Ratio). We validate our approach on models with up to 560B parameters across different context windows (32K/128K). This work provides both the theoretical foundation and practical tools for optimizing the latent potential of LLMs in complex engineering domains.

2601.10960 2026-05-29 cs.CL cs.AI 版本更新

Steering Language Models Before They Speak: Logit-Level Interventions

在语言模型发言前引导它们:Logit 级别的干预

Hyeseon An, Shinwoo Park, Hyundong Jin, Yo-Sub Han

发表机构 * Department of Computer Science, Yonsei University, Seoul, South Korea(延世大学计算机科学系,首尔,韩国)

AI总结 提出 SWAI 方法,通过基于语料库的 token 统计在 logit 空间直接引导语言模型,无需训练或访问内部激活,在可读性、礼貌性和毒性控制上优于提示和基线方法。

Comments preprint

详情
AI中文摘要

可控生成要求语言模型实现诸如阅读水平、礼貌性和毒性等输出特征。现有的引导方法通常间接、需要访问内部激活或依赖辅助训练模型。我们提出 SWAI,一种无需训练、推理时的方法,通过使用基于语料库的 token 统计直接在 logit 空间引导,解决了这些限制。SWAI 从标记语料库计算 z 归一化的一对多 log-odds 分数,并仅在模型的前 K 个候选集内偏向高分数 token,从而在保留上下文合理选择的同时允许控制偏向目标特征 token。在可读性、礼貌性和毒性控制方面,SWAI 在不修改模型参数、访问内部层或训练辅助模型的情况下,始终优于基于提示和先前的 logit 级别基线。选择性和查找表消融实验表明,增益来自目标特定的统计分数,而非通用 logit 扰动。这些结果表明,当 logit 干预在高概率候选下由目标特定统计引导时,有效的引导不需要学习控制器。

英文摘要

Controllable generation requires language models to realize output characteristics such as reading level, politeness, and toxicity. Existing steering methods are often indirect, require access to internal activations, or depend on auxiliary trained models. We propose SWAI, a training-free inference-time method that addresses these limitations by steering directly in logit space using corpus-derived token statistics. SWAI computes z-normalized one-vs-rest log-odds scores from labeled corpora and biases high-scoring tokens only within the model's top-K candidate set, allowing control to favor target-characteristic tokens while preserving contextually plausible choices. Across readability, politeness, and toxicity control, SWAI consistently improves over prompt-based and prior logit-level baselines without modifying model parameters, accessing internal layers, or training an auxiliary model. Selectivity and lookup-table ablations show that the gains come from target-specific statistical scores rather than generic logit perturbation. These results indicate that effective steering does not require learned controllers when the logit intervention is guided by target-specific statistics under high-probability candidates.

2601.08654 2026-05-29 cs.CL cs.AI cs.LG 版本更新

From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges

从评分标准到可靠分数:基于证据的文本评估与LLM裁判

Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, Yushun Dong

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校) Arizona State University(亚利桑那州立大学) Florida State University(佛罗里达州立大学)

AI总结 提出Rulers框架,通过三阶段推理(任务规范、结构化执行、事后校准)解决LLM在基于评分标准的文本评估中的执行漂移、归因不可验证和人类尺度错位问题,实现更可靠的评分。

详情
AI中文摘要

基于评分标准的文本评估越来越多地使用大型语言模型(LLM)作为可扩展的裁判,但将冻结的黑盒模型与人类评分标准对齐仍然具有挑战性。我们将这一挑战表述为一个标准迁移问题:目标不仅仅是提示LLM分配分数,而是将人类评分标准意图转移到一个稳定、可审计且与人类对齐的评分协议中。我们识别了基于LLM的评分标准评估中三种反复出现的失败模式:评分标准执行漂移、不可验证的分数归因和人类尺度错位。为了解决这些失败模式,我们引入了Rulers,一个三阶段推理时框架,用于可靠、基于证据的评分标准文本评估。Rulers首先将人类评分标准转换为锁定的任务级规范,然后通过结构化检查表决策、类型化证据基础以及在适用时进行可提取引用验证来执行该规范,最后应用事后校准以将模型衍生的信号与人类分数边界对齐。在涵盖论文评分、摘要评估、EFL写作评估和结构化输入文本生成的四个基于评分标准的基准测试中,Rulers在多个冻结骨干模型的大多数评估设置中实现了更强的人类分数一致性。进一步分析表明,Rulers更好地匹配了经验人类分数分布,提高了在语义等价评分标准扰动下的稳定性,并受益于其三个组成部分。这些结果表明,可靠的LLM评判需要固定标准、可追溯证据和校准的分数解释,而不仅仅是提示措辞。我们的代码可在 https://anonymous.4open.science/r/Rulers_0525-3328 获取。

英文摘要

Rubric-based text evaluation increasingly uses large language models (LLMs) as scalable judges, but aligning frozen black-box models with human scoring standards remains challenging. We formulate this challenge as a criteria-transfer problem: the goal is not merely to prompt an LLM to assign a score, but to transfer human rubric intent into a stable, auditable, and human-aligned scoring protocol. We identify three recurring failure modes in LLM-based rubric scoring: rubric execution drift, unverifiable score attribution, and human-scale misalignment. To address these failure modes, we introduce Rulers, a three-stage inference-time framework for reliable, evidence-grounded rubric-based text evaluation. Rulers first converts a human rubric into a locked task-level specification, then executes the specification with structured checklist decisions, typed evidence grounding, and extractive quote verification when applicable, and finally applies post-hoc calibration to align model-derived signals with human score boundaries. Across four rubric-governed benchmarks covering essay scoring, summarization assessment, EFL writing evaluation, and structured-input text generation, Rulers achieves stronger human-score agreement in most evaluated settings across multiple frozen backbone models. Further analyses show that Rulers better matches empirical human score distributions, improves stability under semantically equivalent rubric perturbations, and benefits from each of its three components. These results suggest that reliable LLM judging requires fixed criteria, traceable evidence, and calibrated score interpretation rather than prompt phrasing alone. Our code is available at https://anonymous.4open.science/r/Rulers_0525-3328.

2601.08064 2026-05-29 cs.CL 版本更新

Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations

校准还不够:评估语言变化下的置信度估计

Yuxi Xia, Dennis Ulmer, Terra Blevins, Yihong Liu, Hinrich Schütze, Benjamin Roth

发表机构 * Faculty of Computer Science, UniVie Doctoral School Computer Science(计算机科学系,维也纳大学计算机科学博士学院) Faculty of Philological and Cultural Studies, University of Vienna, Austria(文学与文化研究系,维也纳大学,奥地利) ILLC, University of Amsterdam, Netherlands(阿姆斯特丹大学ILLC,荷兰) Khoury College of Computer Sciences, Northeastern University, USA(东北大学计算机科学学院,美国) LMU Munich, Munich Center for Machine Learning (MCML), Germany(慕尼黑大学,慕尼黑机器学习中心(MCML),德国)

AI总结 提出一个基于鲁棒性、稳定性和敏感性的新评估框架,揭示现有置信度估计方法在区分语义不同答案方面的不足。

详情
AI中文摘要

置信度估计(CE)指示大型语言模型答案的可靠性,影响用户信任和决策。现有评估主要关注置信度与正确性之间的一致性,但忽略了语言的可变性:置信度估计应在语义等价的提示或答案变体下保持一致,而在答案含义不同时发生变化,因为这可能表明正确性的变化。因此,我们引入了一个基于三个互补属性的新评估框架:对提示扰动的 extbf{鲁棒性}、跨语义等价答案的 extbf{稳定性}以及对语义不同答案的 extbf{敏感性}。我们表明这些指标与现有CE指标在很大程度上独立,并且常见的CE方法往往在这些指标上失败:虽然大多数方法实现了高鲁棒性和稳定性,但它们难以区分语义不同的答案,可能是因为它们没有有效利用生成侧信息。总体而言,我们的框架揭示了当前CE评估中被忽视的局限性,并为现实应用中选择置信度估计器提供了指导。

英文摘要

Confidence estimation (CE) indicates how reliable the answers of large language models are and impacts user trust and decision-making. Existing evaluations mainly concern the alignment between confidence and correctness, but ignore the variability of language: confidence estimates should remain consistent under semantically equivalent prompts or answer variations, while changing when answer meaning differs, as this may indicate a change in correctness. Therefore, we introduce a novel evaluation framework based on three complementary properties: \textbf{robustness} to prompt perturbations, \textbf{stability} across semantically equivalent answers, and \textbf{sensitivity} to semantically different answers. We show that these metrics are largely independent from existing CE metrics, and that common CE methods often fail on them: while most methods achieve high robustness and stability, they struggle to distinguish semantically different answers, potentially because they do not effectively leverage generation-side information. Overall, our framework exposes overlooked limitations of current CE evaluations and provides guidance for selecting confidence estimators for real-world applications.

2601.04633 2026-05-29 cs.CL 版本更新

MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark

MAGA-Bench: 通过对齐检测基准的机器增强生成文本

Anyang Song, Ying Cheng, Yiqian Xu, Rui Feng

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Shanghai Key Laboratory of Intelligent Information Processing, Fudan University(复旦大学智能信息处理上海市重点实验室)

AI总结 提出MAGA基准,集成多种对齐方法(从提示构建到生成器-检测器对抗强化学习及推理过程)以增强机器生成文本的人类对齐性,提升检测器的泛化能力。

详情
AI中文摘要

机器生成文本(MGT)越来越难以与人类书写文本(HWT)区分。这一趋势加剧了虚假新闻和在线欺诈等恶意活动。微调检测器的泛化能力严重依赖数据集质量,仅仅扩大MGT来源可能越来越不足,需要进一步增强生成过程。基于HC-Var理论,增强MGT的人类对齐性不仅有助于现有检测器的鲁棒性测试,还能提升在此类对齐MGT数据集上微调的检测器的泛化能力。因此,我们提出了 extbf{M}achine- extbf{A}ugment- extbf{G}enerated Text via extbf{A}lignment (MAGA) 检测基准。MAGA集成了多种对齐方法,从提示构建到 extbf{G}enerator- extbf{D}etector extbf{A}dversarial extbf{R}einforcement extbf{L}earning (GDARL) 以及推理过程。在我们的实验中,在MAGA上微调的RoBERTa检测器在泛化AUC上平均提升4.60%。相反,MAGA中的对齐MGT也导致所选检测器的AUC平均下降8.13%。我们希望MAGA基准能为未来MGT检测器泛化能力的研究提供有价值的见解。

英文摘要

Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This trend has exacerbated malicious activities such as fake news and online fraud. The generalization ability of fine-tuned detectors relies heavily on dataset quality, and simply expanding the sources of MGT may become increasingly insufficient. Further augmentation of the generation process is required. Based on HC-Var's theory, enhancing the human-like alignment of MGT not only facilitates robustness testing of existing detectors but also boosts the generalization ability of detectors fine-tuned on such aligned MGT datasets. Therefore, we propose the \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA) Detection Benchmark. MAGA integrates several alignment methods, ranging from prompt construction to \textbf{G}enerator-\textbf{D}etector \textbf{A}dversarial \textbf{R}einforcement \textbf{L}earning (GDARL) and the reasoning process. In our experiments, the RoBERTa detector fine-tuned on MAGA achieves an average improvement of 4.60\% in generalization AUC. Conversely, the aligned MGTs in MAGA also lead to an average decrease of 8.13\% in the AUC of selected detectors. We hope the MAGA Benchmark will provide valuable insights for future research on the generalization ability of MGT detectors.

2601.03134 2026-05-29 cs.CL 版本更新

The Anatomy of Conversational Scams: A Topic-Based Red Teaming Analysis of Multi-Turn Interactions in LLMs

对话式诈骗的剖析:基于主题的LLM多轮交互红队分析

Xiangzhe Yuan, Zhenhao Zhang, Haoming Tang, Siying Hu

发表机构 * Department of Computer Science, University of Iowa(爱荷华大学计算机科学系) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系)

AI总结 通过LLM间模拟框架研究多轮社交工程对话中的对抗动态,分析攻击与防御策略,发现跨模型和跨语言的结果差异及策略转换的结构性变化。

详情
AI中文摘要

随着LLM通过扩展对话获得说服能力,它们为研究扩展交互环境中的对抗性对话行为创造了新的机会,而传统的单轮安全评估无法捕捉这些行为。我们使用受控的LLM到LLM模拟框架,系统地研究了这些交互动态,用于跨语言社交工程场景的自动红队测试。评估了八种最先进的英中模型,分析了对话级别结果,标注了攻击者和防御者策略家族,并建模了它们之间的交互动态。结果表明,多轮对抗性对话遵循反复出现的升级模式,而防御性响应通常依赖于验证、延迟和通道控制。我们进一步发现结果分布在统计上显著的跨模型和跨语言差异,转换分析揭示了防御者策略在不同语言中应对攻击者战术的系统性结构变化。这些发现强调了研究多轮对抗性对话设置中交互结构的重要性,并展示了受控的LLM到LLM模拟如何支持对抗性对话动态的机制分析。

英文摘要

As LLMs gain persuasive capabilities through extended dialogues, they create new opportunities for studying adversarial conversational behavior in extended interaction settings that traditional single-turn safety evaluations fail to capture. We systematically study these interactional dynamics using a controlled LLM-to-LLM simulation framework for automated red-teaming across bilingual social engineering scenarios. Evaluating eight state-of-the-art models in English and Chinese, we analyze dialogue-level outcomes, annotate attacker and defender strategy families, and model interaction dynamics between them. Results show that multi-turn adversarial dialogues follow recurrent escalation patterns, while defensive responses frequently rely on verification, delay, and channel control. We further find statistically significant cross-model and cross-lingual differences in outcome distributions, and transition analysis reveals systematic structural variation in how defender strategies respond to attacker tactics across languages. These findings highlight the importance of studying interactional structure in multi-turn adversarial dialogue settings and demonstrate how controlled LLM-to-LLM simulations can support mechanistic analysis of adversarial conversational dynamics.

2601.01162 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

弥合分类数据聚类的语义鸿沟:基于大语言模型的方法

Zihua Yang, Xin Liao, Yiqun Zhang, Yiu-ming Cheung

发表机构 * School of Computer Science and Technology(计算机科学与技术学院) Guangdong University of Technology(广东技术大学) Department of Computer Science(计算机科学系) Hong Kong Baptist University(香港 Baptist 大学)

AI总结 提出BREVE框架,利用外部知识库的语义嵌入丰富分类属性值,并引入自适应权重平衡原始标识与语义信息,在八个基准数据集上平均ARI排名达1.3。

Comments Accepted to ICPR2027

详情
AI中文摘要

定性数据广泛存在于医疗、营销和生物信息学等领域,聚类是其中模式发现的基本工具。定性数据聚类的核心困难在于度量属性值之间的相似性,这些属性值没有固有的顺序或距离。为了恢复这种关系,现有研究通常依赖于数据集内的共现统计。然而,当样本量较小时,这种统计路径变得不可靠,每个值的语义上下文因此未被充分利用。受此限制,本文提出BREVE(通过外部值丰富实现平衡表示),一种聚类框架,通过从外部知识库中提取额外的语义维度来丰富每个定性值。即,每个唯一值被扩展为一个密集嵌入,编码其语义内容。为了防止原始值身份被添加的维度稀释,进一步附加一个轻量级的独热编码组件。然后,由聚类紧致性引导的自适应权重决定富集维度进入最终表示的强度。通过这种设计,在八个基准数据集上的实验表明,与七个代表性竞争者相比,平均ARI排名为1.3。

英文摘要

Qualitative data are widespread in domains such as healthcare, marketing, and bioinformatics, where clustering offers a fundamental tool for pattern discovery. A core difficulty of qualitative-data clustering lies in measuring similarity among attribute values that carry no inherent ordering or distance. To recover such relationships, existing studies typically rely on within-dataset co-occurrence statistics. This statistical route, however, becomes unreliable once the sample size is small, and the semantic context of each value is therefore left underexploited. Motivated by this limitation, this paper proposes BREVE (Balanced Representation via External Value Enrichment), a clustering framework that enriches each qualitative value with extra semantic dimensions drawn from an external knowledge base. That is, every unique value is expanded by a dense embedding that encodes its semantic content. To prevent the original value identity from being diluted by the added dimensions, a lightweight one-hot component is further appended. An adaptive weight, guided by cluster compactness, then determines how strongly the enrichment dimensions enter the final representation. With this design, experiments on eight benchmark datasets yield an average ARI rank of 1.3 against seven representative competitors.

2512.24562 2026-05-29 cs.CL 版本更新

HaluNet: Learning Hallucination Risk from Internal Signals in LLM Question Answering

HaluNet:从LLM问答内部信号学习幻觉风险

Chaodong Tong, Qi Zhang, Zhuojun Jiang, Lei Jiang, Yanbing Liu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) China Industrial Control Systems Cyber Emergency Response Team(中国工业控制系统网络应急响应团队)

AI总结 提出HaluNet,利用单次生成的内部信号(token似然、预测熵和隐藏状态)估计答案级幻觉风险,通过LLM-as-a-Judge弱监督训练,在多个QA数据集上显著提升风险排序和错误检测率。

Comments 16 pages, 12 tables, and 11 figures. This version includes a major revision of the manuscript and updates the author list with the consent of all involved authors

详情
AI中文摘要

大型语言模型(LLM)在问答(QA)任务中表现出色,但可能生成流畅但缺乏证据支持的答案。现有的幻觉检测器通常依赖外部验证、重复采样或测试时评判调用,这对于实时问答来说成本高昂。我们提出 extbf{HaluNet},一种轻量级幻觉风险估计器,利用单次模型生成的内部信号。HaluNet联合建模token似然、预测熵和隐藏状态信息,从而允许概率、分布和语义证据共同构成答案级风险评分。它使用LLM-as-a-Judge标签作为可扩展的弱监督进行训练,并通过独立的人工和多评判评估进行验证。在SQuAD、TriviaQA和Natural Questions上的实验表明,HaluNet在领域内和跨领域设置中均改进了答案级风险排序。在300个样本的人工评估中,HaluNet达到了0.874的AUROC和0.869的AUPRC;其前20%高风险答案包含96.5%的错误,相比基础错误率实现了2.06倍的提升。

英文摘要

Large language models (LLMs) achieve strong question answering (QA) performance but can produce fluent answers unsupported by available evidence. Existing hallucination detectors often rely on external verification, repeated sampling, or test-time judge calls, which can be costly for real-time QA. We propose \textbf{HaluNet}, a lightweight hallucination risk estimator that uses internal signals from one model generation. HaluNet jointly models token likelihood, predictive entropy, and hidden-state information, allowing probabilistic, distributional, and semantic evidence to inform an answer-level risk score. It is trained with LLM-as-a-Judge labels as scalable weak supervision and evaluated with independent human and multi-judge assessments. Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet improves answer-level risk ranking across in-domain and out-of-domain settings. On a 300-example human evaluation, HaluNet achieves 0.874 AUROC and 0.869 AUPRC; its top 20\% highest-risk answers contain 96.5\% errors, yielding a 2.06$\times$ lift over the base error rate.

2512.17220 2026-05-29 cs.CL 版本更新

Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding

心智景观感知的检索增强生成以改进长上下文理解

Yuqing Li, Jiangnan Li, Zheng Lin, Ziyan Zhou, Junjie Wu, Weiping Wang, Jie Zhou, Mo Yu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) WeChat AI, Tencent(腾讯微信AI) Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 提出MiA-RAG框架,通过层次化摘要构建心智景观并统一检索与生成的条件,实现全局语义表示指导下的长上下文检索与推理。

详情
AI中文摘要

人类通过依赖内容的整体语义表示来理解长而复杂的文本。这种全局视图有助于组织先验知识、解释新信息以及整合分散在文档中的证据,正如心理学中人类的心智景观感知能力所揭示的那样。当前的检索增强生成(RAG)系统缺乏这种指导,因此在长上下文任务中表现不佳。在本文中,我们提出了心智景观感知RAG(MiA-RAG),这是第一个将心智景观感知检索和生成表述为基于LLM的RAG的统一条件范式的框架。MiA-RAG通过层次化摘要构建心智景观,并将检索和生成都条件于这种全局语义表示。这使得检索器能够形成丰富的查询嵌入,生成器能够在连贯的全局上下文中对检索到的证据进行推理。我们在多样化的长上下文和双语基准上评估了MiA-RAG,用于基于证据的理解和全局意义构建。它持续超越基线,进一步分析表明,它将局部细节与连贯的全局表示对齐,实现了更类人的长上下文检索和推理。

英文摘要

Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first framework to formulate mindscape-aware retrieval and generation as a unified conditioning paradigm for LLM-based RAG. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.

2510.24606 2026-05-29 cs.CL 版本更新

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for Memory-Constrained LLM Inference

基于动态层次稀疏注意力的内存受限大语言模型长上下文建模

Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Google(谷歌)

AI总结 提出动态层次稀疏注意力(DHSA),通过在线预测注意力稀疏性并保持LLM骨干冻结,在内存受限下实现高效长上下文推理,精度接近密集注意力且速度提升显著。

Comments ICML26 (Spotlight)

详情
AI中文摘要

注意力的二次成本限制了长上下文大语言模型的可扩展性,尤其是在有限的硬件内存预算下。虽然注意力通常是稀疏的,但现有的静态稀疏方法无法适应任务或输入相关的变化,而最近的动态方法依赖于可能牺牲通用性的预定义模板或启发式规则。我们提出了动态层次稀疏注意力(DHSA),一种数据驱动的框架,在保持LLM骨干冻结的同时在线预测注意力稀疏性。DHSA通过估计块级重要性并将其传播到令牌级交互来执行层次路由,保留了因果重要依赖关系,同时实现了高效稀疏化。在Needle-in-a-Haystack测试、LongBench和RULER上,DHSA在高稀疏度下保持接近密集的精度,在相当预填充成本下,相对于块稀疏注意力实现了12-20%的相对精度提升。借助内存高效的瓦片后端,DHSA在128K上下文长度下实现了高达10倍的预填充加速。在LLaMA-3.1-8B(4位)上,DHSA在单个24GB GPU上扩展到100K上下文,而密集注意力无法做到。我们提供了互补的GPU和CPU后端,使DHSA能够在不同的硬件环境和多个开放权重模型系列上运行。这些结果表明,DHSA是内存受限长上下文LLM推理的一种高效且适应性强的解决方案。

英文摘要

The quadratic cost of attention limits the scalability of long-context LLMs, especially under limited hardware memory budgets. While attention is often sparse, existing static sparse methods cannot adapt to task- or input-dependent variations, and recent dynamic approaches rely on predefined templates or heuristics that may sacrifice generality. We propose Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that predicts attention sparsity online while keeping the LLM backbone frozen. DHSA performs hierarchical routing by estimating importance at the chunk level and propagating it to token-level interactions, preserving causally important dependencies while enabling efficient sparsification. Across Needle-in-a-Haystack test, LongBench and RULER, DHSA maintains near-dense accuracy in highly sparse regimes, achieving 12--20% relative accuracy gains over Block Sparse Attention at comparable prefill cost. With a memory-efficient tiled backend, DHSA delivers up to $10\times$ prefill speedup at 128K context length. On LLaMA-3.1-8B (4-bit), DHSA scales to 100K context on a single 24GB GPU, where dense attention fails. We provide complementary GPU and CPU backends, enabling DHSA to run across diverse hardware environments and multiple open-weight model families. These results demonstrate DHSA as an efficient and adaptable solution for memory-constrained long-context LLM inference.

2510.20743 2026-05-29 cs.HC cs.AI cs.CL 版本更新

Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations

共情提示:多模态大语言模型对话中的非语言上下文整合

Lorenzo Stacchio, Andrea Ubaldi, Alessandro Galdelli, Maurizio Mauri, Emanuele Frontoni, Andrea Gaggioli

发表机构 * University of Macerata(马切拉塔大学)

AI总结 提出共情提示框架,通过集成面部表情识别服务将非语言情感线索隐式融入大语言模型对话,实现无需用户显式控制的流畅多模态交互。

详情
AI中文摘要

我们提出了共情提示,一种新颖的多模态人机交互框架,它通过隐式的非语言上下文丰富大语言模型(LLM)对话。该系统集成了商业面部表情识别服务以捕捉用户的情感线索,并将其作为上下文信号嵌入提示过程中。与传统多模态界面不同,共情提示不需要用户显式控制;相反,它通过情感信息无干扰地增强文本输入,以实现对话和流畅性对齐。该架构模块化且可扩展,允许集成额外的非语言模块。我们描述了通过本地部署的DeepSeek实例实现的系统设计,并报告了初步的服务和可用性评估(N=5)。结果表明,非语言输入能够一致地整合到连贯的LLM输出中,参与者强调了对话的流畅性。除了这一概念验证外,共情提示还指向了聊天机器人中介通信中的应用,特别是在医疗或教育等领域,这些领域中用户的情感信号至关重要,但在言语交流中往往难以察觉。

英文摘要

We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context. The system integrates a commercial facial expression recognition service to capture users' emotional cues and embeds them as contextual signals during prompting. Unlike traditional multimodal interfaces, empathic prompting requires no explicit user control; instead, it unobtrusively augments textual input with affective information for conversational and smoothness alignment. The architecture is modular and scalable, allowing integration of additional non-verbal modules. We describe the system design, implemented through a locally deployed DeepSeek instance, and report a preliminary service and usability evaluation (N=5). Results show consistent integration of non-verbal input into coherent LLM outputs, with participants highlighting conversational fluidity. Beyond this proof of concept, empathic prompting points to applications in chatbot-mediated communication, particularly in domains like healthcare or education, where users' emotional signals are critical yet often opaque in verbal exchanges.

2510.14365 2026-05-29 cs.CL 版本更新

Understanding the Ability of LLMs to Handle Character-Level Perturbation

理解LLMs处理字符级扰动的能力

Anyuan Zhuo, Xuefei Ning, Ningyuan Li, Jingyi Zhu, Yu Wang, Pinyan Lu

发表机构 * Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,Location,Country) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,Location,Country) Shanghai University of Finance(上海财经大学) Tsinghua University, China(清华大学,中国)

AI总结 本研究通过三种字符级扰动(单词内大量拼写错误、字符乱序、插入不可见字符)测试大型语言模型的鲁棒性,发现即使严重扰动下模型仍保持显著性能,并探索了其内在机制。

Comments Accepted by icml2026

详情
AI中文摘要

这项工作研究了当代大型语言模型(LLMs)对常见字符级扰动的鲁棒性。我们考察了三种字符级扰动,包括在单词中引入大量拼写错误、打乱每个单词中的字符顺序,以及在文本中插入大量不可见字符。令人惊讶的是,即使在严重扰动下,例如几乎逐字打乱所有单词以生成人类几乎无法阅读的文本,或插入比可见字符多几倍的不可见字符作为噪声,许多LLMs仍然保持显著的性能。我们探索了这种鲁棒性的潜在原因,发现LLMs对混乱的分词和碎片化的词元化表现出显著的韧性。此外,我们研究了LLMs去除扰动以正确理解文本的机制,包括隐式和显式的字符级扰动处理机制。我们希望我们对LLMs低级鲁棒性的发现将揭示其固有的架构优势,揭示其被滥用的潜在风险,并为LLMs在不同应用场景中的可靠部署提供信息。

英文摘要

This work investigates the resilience of contemporary large language models (LLMs) against frequent character-level perturbations. We examine three types of character-level perturbations including introducing numerous typos within words, shuffling the characters in each word, and inserting a large number of invisible characters into the text. Surprisingly, even under severe perturbation, such as shuffling nearly all words character-wise to produce text that is almost unreadable to humans, or inserting invisible characters which are several times more than the visible ones as noise, many LLMs still maintain notable performance. We explore the underlying causes of this robustness and find that LLMs exhibit remarkable resilience to chaotic segmentation and fragmented tokenization. Furthermore, we examine the mechanisms by which LLMs remove perturbations to correctly comprehend text, including both implicit and explicit mechanisms for character-level perturbation. We hope that our findings on the low-level robustness of LLMs will unveil their inherent architectural strengths, reveal the potential risks of their misuse, and inform the reliable deployment of LLMs across diverse application scenarios.

2510.10961 2026-05-29 cs.CL cs.AI 版本更新

Obfuscation Rules for Detecting and Detoxifying Korean Toxicity

用于检测和去毒化韩语毒性内容的混淆规则

Yejin Lee, Su-Hyeon Kim, Hyundong Jin, Dayoung Kim, Yeonsoo Kim, Yo-Sub Han

发表机构 * Yonsei University(延世大学)

AI总结 本文提出KOTOX数据集,通过定义基于语言学的韩语混淆规则和变换框架,支持对混淆毒性文本的去混淆与去毒化,首次同时实现韩语混淆毒性检测与净化。

Comments 26 pages, 12 figures, 24 tables

详情
AI中文摘要

随着语言模型越来越多地部署在线环境中,毒性检测和去毒化已受到越来越多的关注。现有研究主要关注非混淆文本,这限制了当用户故意伪装毒性表达时的鲁棒性。特别是,韩语毒性表达可以通过黏着形态学和韩文特有的正字法变体轻易伪装。然而,韩语中的混淆现象在很大程度上尚未被探索,这促使我们引入KOTOX:用于去混淆和去毒化的韩语毒性数据集。我们将韩语混淆模式分类为基于语言学的类别,定义从真实世界示例中推导出的变换规则,并将生成的混淆框架作为开放的变换包提供。利用这些规则,我们提供了配对的非毒性和毒性句子及其混淆版本。在我们的数据集上训练的模型能更好地处理混淆文本,而不会牺牲在非混淆文本上的性能。这是首个同时支持韩语去混淆和去毒化的数据集。我们期望该数据集能促进大型语言模型对韩语混淆毒性内容的更好理解和缓解。我们的代码和数据可在 https://github.com/leeyejin1231/KOTOX 获取。

英文摘要

As language models become increasingly deployed in online environments, toxicity detection and detoxification have received growing attention. Existing studies primarily focus on non-obfuscated text, which limits robustness when users intentionally disguise toxic expressions. In particular, Korean toxic expressions can be easily disguised through agglutinative morphology and Hangeul-specific orthographic variation. However, obfuscation in Korean remains largely unexplored, which motivates us to introduce a KOTOX: Korean toxic dataset for deobfuscation and detoxification. We categorize Korean obfuscation patterns into linguistically grounded classes, define transformation rules derived from real-world examples, and provide the resulting obfuscation framework as an open transformation package. Using these rules, we provide paired neutral and toxic sentences alongside their obfuscated counterparts. Models trained on our dataset better handle obfuscated text without sacrificing performance on non-obfuscated text. This is the first dataset that simultaneously supports deobfuscation and detoxification for the Korean language. We expect the dataset to facilitate better understanding and mitigation of obfuscated toxic content in LLM for Korean. Our code and data are available at https://github.com/leeyejin1231/KOTOX.

2510.06182 2026-05-29 cs.CL 版本更新

Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context

混合机制:语言模型如何在上下文中检索绑定实体

Yoav Gur-Arieh, Mor Geva, Atticus Geiger

发表机构 * Blavatnik School of Computer Science and AI, Tel Aviv University(塔尔瓦大学Blavatnik计算机科学与人工智能学院) Pr(Ai) 2 R Group(Pr(Ai) 2 R小组) Goodfire

AI总结 本文研究了语言模型在上下文中绑定并检索实体的三种机制(位置机制、词汇机制和反射机制),通过九种模型和十项绑定任务的实验揭示了它们的混合模式,并构建了一个因果模型以95%的一致性估计下一词分布。

Comments Accepted to ICLR 2026 Main Conference

详情
AI中文摘要

上下文推理的一个关键组成部分是语言模型(LMs)绑定实体以便后续检索的能力。例如,一个LM可能通过将“Ann”绑定到“pie”来表示“Ann loves pie”,从而在回答“谁喜欢pie?”时检索到“Ann”。先前关于短列表绑定实体的研究发现了强有力的证据,表明LMs通过一种位置机制实现这种检索,即根据“Ann”在上下文中的位置来检索它。在这项工作中,我们发现这种机制在更复杂的设置中泛化能力较差;随着上下文中绑定实体数量的增加,位置机制在中间位置变得嘈杂且不可靠。为了弥补这一点,我们发现LMs用词汇机制(通过其绑定对应物“pie”检索“Ann”)和反射机制(通过直接指针检索“Ann”)来补充位置机制。通过对九种模型和十项绑定任务的广泛实验,我们揭示了LMs如何混合这些机制以驱动模型行为的一致模式。我们利用这些见解开发了一个结合所有三种机制的因果模型,该模型以95%的一致性估计下一词分布。最后,我们展示了我们的模型能够泛化到与实体组交错的长得多的开放文本输入,进一步证明了我们的发现在更自然环境中的鲁棒性。总体而言,我们的研究建立了关于LMs如何在上下文中绑定和检索实体的更完整图景。

英文摘要

A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent "Ann loves pie" by binding "Ann" to "pie", allowing it to later retrieve "Ann" when asked "Who loves pie?" Prior research on short lists of bound entities found strong evidence that LMs implement such retrieval via a positional mechanism, where "Ann" is retrieved based on its position in context. In this work, we find that this mechanism generalizes poorly to more complex settings; as the number of bound entities in context increases, the positional mechanism becomes noisy and unreliable in middle positions. To compensate for this, we find that LMs supplement the positional mechanism with a lexical mechanism (retrieving "Ann" using its bound counterpart "pie") and a reflexive mechanism (retrieving "Ann" through a direct pointer). Through extensive experiments on nine models and ten binding tasks, we uncover a consistent pattern in how LMs mix these mechanisms to drive model behavior. We leverage these insights to develop a causal model combining all three mechanisms that estimates next token distributions with 95% agreement. Finally, we show that our model generalizes to substantially longer inputs of open-ended text interleaved with entity groups, further demonstrating the robustness of our findings in more natural settings. Overall, our study establishes a more complete picture of how LMs bind and retrieve entities in-context.

2509.18377 2026-05-29 cs.CL 版本更新

Interactive In-Meeting Speaker Correction with Human Feedback

交互式会议中基于人类反馈的说话人修正

Xinlu He, Yiwen Guan, Badrivishal Paurana, Pitipat Kongsomjit, Zilin Dai, Jacob Whitehill

发表机构 * Worcester Polytechnic Institute(沃斯特理工学院)

AI总结 提出一种LLM辅助的会议内说话人修正系统,通过用户简短反馈修正说话人归属错误,结合流式ASR、说话人日志、LLM摘要和在线注册机制,在AMI数据集上实现DER降低31.99%、说话人替换错误降低52.68%。

详情
AI中文摘要

大多数自动语音处理系统以“开环”模式运行,没有用户关于谁说了什么的反馈,然而人在回路的工作流程有可能实现更高的准确性。我们提出了一种LLM辅助的会议内说话人修正系统,允许用户通过简短纠正性反馈来修复说话人归属错误。在执行流式ASR和说话人日志后,系统呈现简洁的LLM生成的摘要,帮助用户识别重要的说话人错误,并通过更新带说话人标注的转录文本和添加在线说话人注册来整合用户反馈。为了使该工作流程在语音处理、LLM分析和用户反馈存在错误的情况下仍然有效,我们开发了多种机制来更精确地识别预期的修正。此外,我们构建了一个LLM驱动的用户反馈模拟,以评估工作流程的可复现性和可扩展性。应用于AMI头戴式麦克风测试集,我们的系统相对于流式基线(Google ASR + ECAPA)显著降低了31.99%的DER和52.68%的说话人替换错误。

英文摘要

Most automatic speech processing systems operate in ``open loop'' mode without user feedback about who said what, yet human-in-the-loop workflows can potentially enable higher accuracy. We propose an LLM-assisted in-meeting speaker correction system that lets users fix speaker attribution errors through brief corrective feedback. After performing streaming ASR and diarization, the system presents concise LLM-generated summaries to help users identify important speaker errors, and it incorporates user feedback by updating the speaker-attributed transcript and adding online speaker enrollments. To make this workflow effective despite errors in speech processing, LLM analysis, and user feedback, we developed several mechanisms to identify the intended correction more precisely. Further, we built an LLM-driven user feedback simulation to evaluate the workflow reprodubilty and at scale. Applied to the AMI headset test set, our system substantially reduces the DER from a streaming baseline (Google ASR + ECAPA) by 31.99% and speaker substitution error by 52.68%.

2508.15371 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Confidence-Modulated Speculative Decoding for Large Language Models

置信度调节的推测解码用于大型语言模型

Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela

发表机构 * Department of Data Science(数据科学系) Praxis Business School(普拉克斯商学院)

AI总结 本文提出一种基于置信度调节的推测解码框架,通过熵和边际不确定性度量动态调整草稿长度与验证过程,在机器翻译和摘要任务上实现加速并保持或提升BLEU和ROUGE分数。

Comments This is the preprint of the paper, which has been accepted for oral presentation and publication in the proceedings of IEEE INDISCON 2025. The conference will be organized at the National Institute of Technology, Rourkela, India, from August 21 to 23, 2025. The paper is 10 pages long, and it contains 2 figures and 5 tables

详情
AI中文摘要

推测解码已成为一种通过草稿-验证范式并行化令牌生成来加速自回归推理的有效方法。然而,现有方法依赖静态草稿长度和刚性验证标准,限制了其在不同模型不确定性和输入复杂性下的适应性。本文提出一种基于置信度调节草稿的信息论推测解码框架。通过利用草稿模型输出分布上的熵和边际不确定性度量,所提方法在每次迭代中动态调整推测生成的令牌数量。这种自适应机制减少了回滚频率,提高了资源利用率,并保持了输出保真度。此外,验证过程使用相同的置信度信号进行调节,使得在不牺牲生成质量的情况下更灵活地接受草稿令牌。在机器翻译和摘要任务上的实验表明,与标准推测解码相比,该方法在保持或提升BLEU和ROUGE分数的同时实现了显著加速。所提方法提供了一种原则性的即插即用方法,用于在不确定性变化条件下实现大型语言模型的高效且鲁棒的解码。

英文摘要

Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.

2508.05614 2026-05-29 cs.CL cs.AI 版本更新

GroundAct: Can LLM Agents Ground Actions in Environmental States?

GroundAct:LLM智能体能否在环境状态中实现动作落地?

Zixuan Wang, Dingming Li, Hongxing Li, Yanrui Miao, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

发表机构 * Zhejiang University(浙江大学)

AI总结 本研究提出GroundAct基准,通过1500个场景和16592个任务实例评估15个LLM,发现动作落地能力是多维挑战,不能仅通过模型规模解决。

Comments Project Page: https://zju-real.github.io/OmniEmbodied Code: https://github.com/ZJU-REAL/OmniEmbodied

详情
AI中文摘要

LLM智能体在指令完全指定动作的任务上成功率达到85-96%,但当动作可行性取决于指令未提及的环境状态时,成功率降至29-53%。我们认为这一差距反映了一种缺失的能力:动作落地,即从结构化环境状态推断动作是否可行、缺少哪些前提条件以及是否超出个体能力的能力。我们引入GroundAct,这是一个包含1500个场景和16592个任务实例的基准,基于文本的交互式环境涵盖11个领域,任务按认知复杂度层级组织为七个类别。评估15个LLM(3B-671B)后,我们发现三种诊断模式:(i)属性推理与工具和协作推理弱相关,产生不同的模型轮廓;(ii)完整环境图在工具使用与隐式协作之间产生高达+27.6/-22.9%的差异,区分了搜索边界与约束过滤瓶颈;(iii)监督微调将Qwen2.5-3B在直接命令上的性能从0.6%提升至76.3%,但在隐式协作上仅从1.5%提升至5.5%。这些结果表明动作落地是一个多维挑战,不能仅通过规模扩展解决。

英文摘要

LLM agents achieve 85-96% success on tasks where instructions fully specify the action, but drop to 29-53% when action feasibility depends on environmental state that the instruction does not mention. We argue that this gap reflects a missing capability: action grounding, the ability to infer from structured environmental state whether an action is feasible, what prerequisites it lacks, and whether it exceeds individual capacity. We introduce GroundAct, a benchmark of 1,500 scenarios and 16,592 task instances in text-based interactive environments spanning 11 domains, with tasks organized into seven categories along a cognitive complexity hierarchy. Evaluating 15 LLMs (3B-671B), we find three diagnostic patterns: (i) attribute reasoning is weakly correlated with tool and coordination reasoning, producing distinct model profiles; (ii) complete environment graphs yield up to +27.6/-22.9% on tool use vs. implicit collaboration, separating search-bound from constraint-filtering bottlenecks; and (iii) supervised fine-tuning lifts Qwen2.5-3B from 0.6% to 76.3% on direct command but only 1.5% to 5.5% on implicit collaboration. These results establish action grounding as a multi-dimensional challenge irreducible to scaling.

2508.03726 2026-05-29 cs.CL 版本更新

Hierarchical Verification of Speculative Beams for Accelerating LLM Inference

分层验证投机波束以加速大语言模型推理

Jaydip Sen, Harshitha Puvvala, Subhasis Dasgupta

发表机构 * Praxis Business School(普拉克斯商学院) Sabre Industries(Sabre工业公司)

AI总结 提出分层验证树(HVT)框架,通过优先验证高似然草稿并早期剪枝次优候选,以分层方式重构投机波束解码,从而在不重训练或修改架构下显著降低推理时间和能耗。

Comments This paper was accepted for oral presentation and publication in the 3rd International Conference on Data Science and Network Engineering (ICDSNE 2025), organized at NIT, Agartala, India, from July 25 to 26, 2025. The paper is 12 pages long, and it contains 3 tables and 4 figures. This is NOT the final paper, which will be published in the Springer-published proceedings

详情
AI中文摘要

大语言模型(LLMs)在多种自然语言处理任务中取得了显著成功,但由于其自回归特性,在推理效率方面面临持续挑战。尽管投机解码和波束采样带来了显著改进,传统方法按顺序验证草稿序列且无优先级区分,导致不必要的计算开销。本文提出分层验证树(HVT),一种通过优先处理高似然草稿并实现次优候选早期剪枝来重构投机波束解码的新框架。我们开发了理论基础和形式化的验证-剪枝算法以确保正确性和效率。该方法无需重训练或架构修改即可集成到标准LLM推理流程中。跨多个数据集和模型的实验评估表明,HVT始终优于现有投机解码方案,在维持或提升输出质量的同时,实现了推理时间和能耗的大幅降低。研究结果凸显了分层验证策略作为加速大语言模型推理新方向的潜力。

英文摘要

Large language models (LLMs) have achieved remarkable success across diverse natural language processing tasks but face persistent challenges in inference efficiency due to their autoregressive nature. While speculative decoding and beam sampling offer notable improvements, traditional methods verify draft sequences sequentially without prioritization, leading to unnecessary computational overhead. This work proposes the Hierarchical Verification Tree (HVT), a novel framework that restructures speculative beam decoding by prioritizing high-likelihood drafts and enabling early pruning of suboptimal candidates. Theoretical foundations and a formal verification-pruning algorithm are developed to ensure correctness and efficiency. Integration with standard LLM inference pipelines is achieved without requiring retraining or architecture modification. Experimental evaluations across multiple datasets and models demonstrate that HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption while maintaining or enhancing output quality. The findings highlight the potential of hierarchical verification strategies as a new direction for accelerating large language model inference.

2507.09574 2026-05-29 cs.CV cs.AI cs.CL 版本更新

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

MENTOR: 面向自回归视觉生成模型的高效多模态条件微调

Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, Junjie Hu

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Tsinghua University(清华大学) Peking University(北京大学) Microsoft(微软公司)

AI总结 提出MENTOR框架,通过两阶段训练范式实现自回归图像生成器与多模态输入的细粒度token级对齐,无需辅助适配器或交叉注意力模块,在DreamBench++上取得优异性能。

Comments Findings of ACL 2026

详情
AI中文摘要

最近的文本到图像模型能够生成高质量结果,但在精确视觉控制、平衡多模态输入以及需要大量训练以实现复杂多模态图像生成方面仍存在困难。为解决这些局限,我们提出MENTOR,一种新颖的自回归(AR)框架,用于高效的多模态条件微调以实现自回归多模态图像生成。MENTOR将AR图像生成器与两阶段训练范式相结合,无需依赖辅助适配器或交叉注意力模块,即可实现多模态输入与图像输出之间的细粒度、token级对齐。两阶段训练包括:(1)多模态对齐阶段,建立稳健的像素级和语义级对齐;随后是(2)多模态指令微调阶段,平衡多模态输入的整合并增强生成可控性。尽管模型规模适中、基础组件非最优且训练资源有限,MENTOR在DreamBench++基准测试上仍取得了强劲性能,在概念保持和提示遵循方面优于竞争基线。此外,与基于扩散的方法相比,我们的方法具有更优的图像重建保真度、广泛的任务适应性以及更高的训练效率。数据集、代码和模型可在 https://github.com/HaozheZhao/MENTOR 获取。

英文摘要

Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

2506.08354 2026-05-29 cs.CL cs.AI cs.IR 版本更新

Position: Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning

立场:文本嵌入应捕获隐含语义,而不仅仅是表面意义

Yiqun Sun, Qiang Huang, Anthony K. H. Tung, Jun Yu

发表机构 * National University of Singapore(新加坡国立大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 本文主张文本嵌入研究应从表面意义转向隐含语义,通过试点研究揭示现有模型在隐含语义任务上的局限,并提出范式转变以优先发展语言学基础训练数据、深层语义基准和核心建模目标。

Comments To appear in ICML 2026

详情
AI中文摘要

这篇立场论文主张,文本嵌入研究应超越表面意义,将隐含语义作为核心建模目标。文本嵌入是现代自然语言处理的基础组件,支撑着广泛的应用并推动持续的研究进展。尽管进展迅速,大多数嵌入模型仍局限于表面层次的语义,而语言学理论强调人类意义的大部分是隐含的,由语用学、说话者意图和社会文化语境塑造。当前模型通常在缺乏此类深度的数据集上训练,并使用奖励表面相似性的基准进行评估。因此,它们在需要解释性推理、立场识别或社会性理解的任务中表现不佳。我们的试点研究明确揭示了这一局限性,表明即使在探测隐含语义的任务上,最先进的嵌入相比简单的词汇基线也仅取得边际改进。因此,我们呼吁范式转变:嵌入研究应优先考虑具有语言学基础且多样化的训练数据,开发探测更深层语义理解的基准,并将隐含意义作为核心建模目标,以更好地使嵌入与现实世界的语言复杂性对齐。代码可在 http://github.com/dukesun99/Implicit-Embeddings 获取。

英文摘要

This position paper argues that text embedding research should move beyond surface meaning and embrace implicit semantics as a central modeling objective. Text embeddings are a foundational component of modern NLP, underpinning a wide range of applications and driving sustained research progress. Despite rapid progress, most embedding models remain narrowly focused on surface-level semantics, whereas linguistic theory emphasizes that much of human meaning is implicit, shaped by pragmatics, speaker intent, and sociocultural context. Current models are typically trained on datasets that lack such depth and evaluated using benchmarks that reward surface similarity. As a result, they struggle with tasks that require interpretive reasoning, stance recognition, or socially grounded understanding. Our pilot study makes this limitation explicit, showing that even state-of-the-art embeddings achieve only marginal improvements over simple lexical baselines on tasks probing implicit semantics. We therefore call for a paradigm shift: embedding research should prioritize linguistically grounded and diverse training data, develop benchmarks that probe deeper semantic understanding, and treat implicit meaning as a core modeling objective to better align embeddings with real-world language complexity. The code is available at http://github.com/dukesun99/Implicit-Embeddings.

2506.06254 2026-05-29 cs.AI cs.CL cs.LG 版本更新

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

PersonaAgent:弥合个性化LLM智能体的记忆与行动

Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, Xian Li

发表机构 * Amazon Stores Foundational AI(亚马逊基础AI)

AI总结 提出PersonaAgent框架,通过整合个性化记忆模块(情景与语义记忆)和行动模块,并利用角色提示作为中介实现记忆与行动的协同,以解决LLM智能体的个性化任务。

Comments Accepted in ACL 2026

详情
AI中文摘要

由大型语言模型驱动的智能体近期作为先进范式出现,在广泛领域和任务中展现出令人印象深刻的能力。尽管潜力巨大,当前LLM智能体常采用一刀切方法,缺乏响应用户不同需求和偏好的灵活性。这一局限促使我们开发PersonaAgent——首个旨在处理多样化个性化任务的个性化LLM智能体框架。具体而言,PersonaAgent整合了两个互补组件:一个包含情景记忆和语义记忆机制的个性化记忆模块;一个使智能体能够执行针对用户定制的工具行动的个性化行动模块。核心在于,角色(定义为每位用户独特的系统提示)充当中间件:它利用来自个性化记忆的洞察来控制智能体行动,而这些行动的结果反过来又优化记忆。基于该框架,我们提出一种测试时用户偏好对齐策略,该策略模拟最近的n次交互以优化角色提示,通过模拟响应与真实响应之间的文本损失反馈确保实时用户偏好对齐。实验评估表明,PersonaAgent不仅有效个性化行动空间,还能在测试时实际应用中扩展,显著优于其他基线方法。这些结果证明了我们的方法在提供定制化、动态用户体验方面的可行性和潜力。

英文摘要

Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users' varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components - a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test-time user-preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real-time user preference alignment through textual loss feedback between simulated and ground-truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test-time real-world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.

2505.18744 2026-05-29 cs.CL 版本更新

LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Complex Reasoning

LogicCat:面向复杂推理的思维链文本到SQL基准测试

Tao Liu, Xutao Mao, Hongying Zan, Dixuan Zhang, Yifan Li, Haixin Liu, Lulu Kong, Jiaming Hou, Rui Li, YunLong Li, aoze zheng, Zhiqiang Zhang, Luo Zhewei, Kunli Zhang, Min Peng

发表机构 * Zhengzhou University(郑州大学) Vanderbilt University(范德比大学) Wuhan University(武汉大学)

AI总结 提出首个针对复杂推理和思维链解析的Text-to-SQL基准数据集LogicCat,涵盖物理、算术、常识和假设推理场景,通过4038个问题与12114条思维链步骤显著提升任务难度,现有模型执行准确率最高仅33.20%。

Comments 9 pages, 5 figures

详情
Journal ref
Proceedings of the AAAI Conference on Artificial Intelligence, 40(36): 29958-29966, 2026
AI中文摘要

文本到SQL是自然语言处理中的关键任务,旨在将自然语言问题转化为准确且可执行的SQL查询。在现实场景中,这些推理任务通常伴随复杂的数学计算、领域知识和假设推理场景。然而,现有大规模文本到SQL数据集通常聚焦于业务逻辑和任务逻辑,忽略了垂直领域知识、复杂数学推理和假设推理等关键因素,而这些因素对于真实反映实际应用中的推理需求并完成数据查询与分析至关重要。为弥补这一空白,我们引入了LogicCat,这是首个专门为复杂推理和思维链解析设计的文本到SQL基准数据集,涵盖物理、算术、常识和假设推理场景。LogicCat包含4038个英文问题,配有12114条详细的思维链推理步骤,跨越45个不同领域的数据库,在复杂性上显著超越现有数据集。实验结果表明,LogicCat将当前最先进模型的任务难度大幅提升至最高33.20%的执行准确率,表明该任务仍然极具挑战性。LogicCat的进步代表了向开发适用于真实企业数据分析和自主查询生成的系统迈出的关键一步。我们已在https://github.com/Ffunkytao/LogicCat发布了数据集代码。

英文摘要

Text-to-SQL is a critical task in natural language processing that aims to transform natural language questions into accurate and executable SQL queries. In real-world scenarios, these reasoning tasks are often accompanied by complex mathematical computations, domain knowledge, and hypothetical reasoning scenarios. However, existing large-scale Text-to-SQL datasets typically focus on business logic and task logic, neglecting critical factors such as vertical domain knowledge, complex mathematical reasoning, and hypothetical reasoning, which are essential for realistically reflecting the reasoning demands in practical applications and completing data querying and analysis. To bridge this gap, we introduce LogicCat, the first Text-to-SQL benchmark dataset specifically designed for complex reasoning and chain-of-thought parsing, encompassing physics, arithmetic, commonsense, and hypothetical reasoning scenarios. LogicCat comprises 4,038 English questions paired 12,114 detailed chain-of-thought reasoning steps, spanning 45 databases across diverse domains, significantly surpassing existing datasets in complexity. Experimental results demonstrate that LogicCat substantially increases the task difficulty for current state-of-the-art models to at most 33.20% execution accuracy, indicating that this task remains exceptionally challenging. The advancement of LogicCat represents a crucial step toward developing systems suitable for real-world enterprise data analysis and autonomous query generation. We have released our dataset code at https://github.com/Ffunkytao/LogicCat.

2505.16178 2026-05-29 cs.CL 版本更新

Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge

理解语言模型中的事实回忆:为什么两阶段训练鼓励记忆而混合训练教授知识

Ying Zhang, Benjamin Heinzerling, Dongyuan Li, Kentaro Inui

发表机构 * RIKEN Center for Advanced Intelligence Project(日本理化学研究所高级智能项目中心) Tohoku University(东北大学) The University of Tokyo(东京大学) MBZUAI

AI总结 通过比较2.8~4B语言模型中的两阶段训练与混合训练,发现混合训练通过联合优化目标实现存储与查询格式间的梯度一致性,驱动表征一致性并建立格式不变的检索过程,从而泛化回忆未见查询中的事实。

详情
AI中文摘要

虽然微调是将事实知识注入大型语言模型(LLM)的标准方法,但通过未见查询实现可靠事实回忆的机制仍鲜为人知。常见的两阶段训练策略依次对事实存储和查询格式进行训练,往往导致死记硬背。相比之下,混合训练联合优化两种格式,展现出更优的泛化回忆能力。我们通过比较2.8∼4B LLM中的两种范式来研究这一成功机制,并识别出核心机制:混合训练中的联合优化目标诱导了存储格式与查询格式之间的梯度一致性。这进而驱动两种格式之间的表征一致性,建立了一个格式不变的检索过程,将未见查询映射到存储的事实。相反,两阶段训练中缺乏这种目标导致表征不一致和回忆失败。这种一致性进一步定位于由两种格式共同更新的参数,在混合训练下该参数集远大于两阶段训练。在输入层面,一致性留下了可解释的特征:混合训练从主语-关系标记(查询中可用的相同成分)以存储格式编码事实,而两阶段训练则依赖完整上下文。我们的发现刻画了事实回忆的机制,并为优化LLM中的知识注入提供了机理基础。

英文摘要

While fine-tuning is the standard for injecting factual knowledge into large language models (LLMs), the mechanisms enabling reliable fact recall via unseen queries remain poorly understood. Common two-stage training strategies, which sequentially train on fact storage and query formats, often cause rote memorization. In contrast, mixed training jointly optimizes both formats and exhibits superior generalized recall. We investigate this success by comparing the two paradigms across 2.8$\sim$4B LLMs and identify the core mechanism: the joint optimization objective in mixed training induces gradient consistency across storage and query formats. This in turn drives the representation consistency between the two formats, establishing a format-invariant retrieval process that maps unseen queries to stored facts. In contrast, the lack of such an objective in two-stage training results in inconsistent representations and failed recall. The consistency further localizes to the parameters updated by both formats, a set that is substantially larger under mixed training than under two-stage training. At the input level, the consistency leaves an interpretable signature: mixed training encodes facts in storage format from subject-relation tokens, the same components available in queries, while two-stage training relies on the full context. Our findings characterize the mechanisms of fact recall and offer mechanistic foundation for optimizing knowledge injection in LLMs.

2505.10975 2026-05-29 cs.CL cs.AI cs.SD eess.AS 版本更新

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

单声道音频的端到端多说话人自动语音识别综述

Xinlu He, Jacob Whitehill

发表机构 * Worcester Polytechnic Institute(沃斯特理工大学)

AI总结 本文系统综述了端到端多说话人自动语音识别的神经架构范式(SIMO与SISO)、近期改进方法及长语音扩展策略,并通过标准基准评估比较了各类方法。

Comments Accepted for publication in Computer Speech & Language (CSL)

详情
AI中文摘要

单声道多说话人自动语音识别(ASR)由于数据稀缺以及识别并将词语归因于单个说话人的内在困难(尤其是在重叠语音中)仍然具有挑战性。最近的进展推动了从级联系统向端到端(E2E)架构的转变,这减少了错误传播并更好地利用了语音内容与说话人身份之间的协同作用。尽管端到端多说话人ASR取得了快速进展,但该领域缺乏对近期发展的全面综述。本综述为多说话人ASR的端到端神经方法提供了一个系统的分类法,突出了近期进展和比较分析。具体而言,我们分析了:(1)用于预分割音频的架构范式(SIMO与SISO),分析了它们的不同特征和权衡;(2)基于这两种范式的近期架构和算法改进;(3)对长语音的扩展,包括分割策略和说话人一致性的假设拼接。此外,我们(4)在标准基准上评估和比较了各种方法。最后,我们讨论了构建鲁棒且可扩展的多说话人ASR所面临的开放挑战和未来研究方向。

英文摘要

Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.

2503.13844 2026-05-29 cs.CL cs.AI cs.CY cs.LG 版本更新

Towards Detecting Persuasion on Social Media: From Model Development to Insights on Persuasion Strategies

检测社交媒体上的说服:从模型开发到说服策略的洞察

Elyas Meguellati, Stefano Civelli, Pietro Bernardelle, Shazia Sadiq, Irwin King, Gianluca Demartini

发表机构 * University of Queensland(昆士兰大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文通过开发轻量级说服文本检测模型(在SemEval 2023任务3子任务3中达到最优性能)并应用于澳大利亚联邦选举2022 Facebook广告数据集,揭示了政治竞选在不同资金策略、词汇选择、人口统计定位和选举临近时说服强度时间变化中的模式。

详情
Journal ref
Proceedings of the International AAAI Conference on Web and Social Media 20(1) (2026) 1587-1608
AI中文摘要

政治广告通过嵌入更广泛宣传策略中的微妙说服技巧,在塑造公众舆论和影响选举结果方面发挥着关键作用。检测这些说服元素对于提高选民意识和确保民主进程的透明度至关重要。本文通过两项相互关联的研究,提出了一种连接模型开发与实际应用的综合方法。首先,我们引入了一个轻量级说服文本检测模型,该模型在SemEval 2023任务3子任务3中达到了最先进性能,同时所需的计算资源和训练数据远少于现有方法。其次,我们通过收集澳大利亚联邦选举2022 Facebook广告(APA22)数据集,对其中一部分进行说服标注,并对模型进行微调以使其从主流新闻适应社交媒体内容,从而展示了该模型的实际效用。然后,我们应用微调后的模型对APA22数据集的其余部分进行标注,揭示了政治竞选如何通过不同的资金策略、词汇选择、人口统计定位以及选举日临近时说服强度的时间变化来利用说服的独特模式。我们的发现不仅强调了分析社交媒体说服时领域特定建模的必要性,还展示了揭示这些策略如何能够增强透明度、告知选民并促进数字竞选中的问责制。

英文摘要

Political advertising plays a pivotal role in shaping public opinion and influencing electoral outcomes, often through subtle persuasive techniques embedded in broader propaganda strategies. Detecting these persuasive elements is crucial for enhancing voter awareness and ensuring transparency in democratic processes. This paper presents an integrated approach that bridges model development and real-world application through two interconnected studies. First, we introduce a lightweight model for persuasive text detection that achieves state-of-the-art performance in Subtask 3 of SemEval 2023 Task 3 while requiring significantly fewer computational resources and training data than existing methods. Second, we demonstrate the model's practical utility by collecting the Australian Federal Election 2022 Facebook Ads (APA22) dataset, partially annotating a subset for persuasion, and fine-tuning the model to adapt from mainstream news to social media content. We then apply the fine-tuned model to label the remainder of the APA22 dataset, revealing distinct patterns in how political campaigns leverage persuasion through different funding strategies, word choices, demographic targeting, and temporal shifts in persuasion intensity as election day approaches. Our findings not only underscore the necessity of domain-specific modeling for analyzing persuasion on social media but also show how uncovering these strategies can enhance transparency, inform voters, and promote accountability in digital campaigns.

2411.14279 2026-05-29 cs.CV cs.CL 版本更新

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

超越文本:通过多模态双注意力和软图像引导减少大型视觉语言模型中的语言偏差

Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Mingjia Zhang, Baobao Chang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Peking University(北京大学) Tsinghua University(清华大学)

AI总结 针对大型视觉语言模型因语言偏差导致的幻觉问题,提出LACING框架,采用多模态双注意力机制和软图像引导策略,在不增加训练资源的情况下增强视觉理解并减少幻觉。

Comments EMNLP 2025

详情
AI中文摘要

大型视觉语言模型在各种视觉语言任务中取得了令人印象深刻的结果。然而,尽管表现出有前景的性能,大型视觉语言模型仍因语言偏差而产生幻觉,导致对图像的关注度降低和视觉理解效率低下。我们确定了这种偏差的两个主要原因:1. 大语言模型预训练阶段与多模态对齐阶段之间训练数据的规模差异。2. 文本数据短期依赖性导致的学习推理偏差。因此,我们提出了LACING,一个系统性框架,旨在通过多模态双注意力机制和软图像引导来解决大型视觉语言模型的语言偏差。具体来说,多模态双注意力机制引入了一种并行双注意力机制,增强了整个模型中视觉输入的整合。软图像引导在训练和推理过程中引入了一个可学习的软视觉提示,以替代视觉输入,旨在迫使大型视觉语言模型优先处理文本输入。然后,软图像引导进一步提出了一种使用软视觉提示的新解码策略,以减轻模型对相邻文本输入的过度依赖。综合实验表明,我们的方法有效地消除了大型视觉语言模型的语言偏差,增强了视觉理解并减少了幻觉,无需额外的训练资源或数据。代码和模型可在[lacing-lvlm.github.io](https://lacing-lvlm.github.io)获取。

英文摘要

Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [lacing-lvlm.github.io](https://lacing-lvlm.github.io).

2404.10706 2026-05-29 cs.CY cs.CL cs.HC cs.SI 版本更新

Cross-Language Evolution of Divergent Collective Memory Around the Arab Spring

阿拉伯之春的跨语言分歧性集体记忆演化

H. Laurie Jones, Brian C. Keegan

AI总结 通过分析2011-2024年间阿拉伯语和英语维基百科中阿拉伯之春相关文章的存档内容,定义了多语言的事件显著性、商议、语境化和集体记忆巩固度量,揭示了跨语言内容相似性的时间演化规律。

详情
AI中文摘要

阿拉伯之春是始于2011年的一系列历史性抗议活动,这些抗议推翻了多国政府并导致了重大冲突。对于此类事件的集体记忆可能因政治、文化和语言因素而在不同社会语境中存在显著差异。尽管维基百科在记录历史及当前事件方面发挥着重要作用,但关于维基百科文章在重大事件发生后如何持续演化数年或数十年的问题却鲜有关注。利用2011年至2024年间阿拉伯语和英语维基百科中阿拉伯之春相关主题的存档内容,我们定义并评估了围绕阿拉伯之春的事件显著性、商议、语境化和集体记忆巩固的多语言度量。我们关于维基百科文章跨语言内容相似性时间演化的发现,对于在线集体记忆过程的理论构建以及基于这些数据训练的语言模型的评估具有启示意义。

英文摘要

The Arab Spring was a historic set of protests beginning in 2011 that toppled governments and led to major conflicts. Collective memories of events like these can vary significantly across social contexts in response to political, cultural, and linguistic factors. While Wikipedia plays an important role in documenting both historic and current events, little attention has been given to how Wikipedia articles, created in the aftermath of major events, continue to evolve over years or decades. Using the archived content of Arab Spring-related topics across the Arabic and English Wikipedias between 2011 and 2024, we define and evaluate multilingual measures of event salience, deliberation, contextualization, and consolidation of collective memory surrounding the Arab Spring. Our findings about the temporal evolution of the Wikipedia articles' content similarity across languages has implications for theorizing about online collective memory processes and evaluating linguistic models trained on these data.