2605.30327 2026-05-29 cs.LG cs.AI cs.CL math.ST stat.ML stat.TH 版本更新

Reasoning with Sampling: Cutting at Decision Points

基于采样的推理：在决策点进行裁剪

Felix Zhou, Anay Mehrotra, Quanquan C. Liu

发表机构 * Yale University（耶鲁大学）； Stanford University（斯坦福大学）

AI总结提出Entropy-Cut Metropolis-Hastings算法，利用基础模型的下一词元熵作为代理识别关键决策点并重新采样，从而高效地从幂分布中采样以增强推理能力，在多个基准上超越基线和RL训练模型。

详情

AI中文摘要

前沿推理模型是通过对基础语言模型进行强化学习后训练而产生的。最近的研究对此提出了挑战，表明从基础模型分布的锐化版本（即所谓的幂分布）中采样，无需额外训练、精心策划的数据集或验证器，就能产生可比的推理能力。然而，使这种方法实用化需要高效地从幂分布中采样。采样器需要“混合”到幂分布，这需要在目标分布的模态之间移动；直观地说，例如尝试不同的推理策略。先前工作中提出的采样器反复在当前推理轨迹中均匀随机选择一个“裁剪”位置，并从该位置开始重新采样后缀。然而，推理轨迹通常包含少数关键决策（例如，证明策略或算法的选择），我们观察到均匀选择的裁剪往往重写局部细节，而不是重新审视决策点。我们引入了一种算法（Entropy-Cut Metropolis-Hastings），该算法使用基础模型的下一词元熵作为代理来识别关键决策点，并从这些位置重新采样。我们通过实验验证了熵跳变是决策点的有用代理，并在一个风格化的推理模型中证明了我们的方法的混合时间与轨迹中的决策数量成比例，而不是与可能大得多的词元数量成比例。在MATH500、HumanEval、GPQA Diamond和AIME26上，我们的方法始终优于基线和RL训练模型。

英文摘要

Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasoning without additional training, curated datasets, or verifiers. However, making this method practical requires efficiently sampling from the power distribution. A sampler needs to "mix" to the power distribution, which necessitates moving between modes of the target distribution; intuitively, e.g., trying different reasoning strategies. The samplers proposed in prior works repeatedly select a "cut" position in the current reasoning trace uniformly at random and resample the suffix from that position onward. However, reasoning traces typically contain a few consequential decisions (e.g., the choice of proof strategy or algorithm), and we observe that a uniformly chosen cut tends to rewrite local details rather than revisit decision points. We introduce an algorithm (Entropy-Cut Metropolis-Hastings) that uses the base model's next-token entropy as a proxy to identify key decision points and resample from those positions. We empirically verify that entropy jumps are a useful proxy for decision points and, in a stylized model of reasoning, prove that our method's mixing time scales with the number of decisions in a trace rather than with the number of tokens, which can be much larger. Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.

URL PDF HTML ☆

赞 0 踩 0

2605.30324 2026-05-29 cs.DS cs.AI cs.CL cs.LG stat.ML 版本更新

On Language Generation in the Limit with Bounded Memory

有界记忆下的极限语言生成

Jon Kleinberg, Anay Mehrotra, Amin Saberi, Grigoris Velegkas

发表机构 * Cornell University（康奈尔大学）； Stanford University（斯坦福大学）； Google Research（谷歌研究）

AI总结研究有界记忆下语言生成的极限问题，通过组合界和滑动窗口分析记忆约束对可生成性、密度和识别的影响。

Comments The abstract has been shortened to fit within the arXiv limit

详情

AI中文摘要

我们研究有界记忆下的极限语言生成。在该任务中，学习器每次观察来自未知目标语言的一个示例，并且必须最终只输出新的有效示例。先前的工作假设可以访问整个历史，这是一个强假设，因为实际算法只保留有限的过去信息。学习理论中的经典工作表明，记忆约束会显著改变可学习性；我们将此扩展到语言生成。首先，我们研究无记忆生成器。在温和的枚举限制下，每个可数无限语言集合仍然可以在没有记忆的情况下生成。没有这个限制，我们精确刻画了何时无记忆生成是可能的。对于有限集合，我们刻画了无记忆生成器可实现的最优极小极大密度——针对任何给定大小的集合所能保证的最佳密度。这个组合界依赖于Sperner定理和对称链分解。我们进一步表明，最后$W$个示例的滑动窗口不会改善这种最坏情况密度，而允许存储$b$个自适应选择的过去示例则会改善每个$b \geq 1$的可实现密度。最后，我们重新审视极限识别，其中学习器必须收敛到目标语言的单个正确假设。我们关注其增量变体，其中学习器只记住其之前的猜测。在这里，尽管精确识别在仅包含三种语言的集合上失败，但一个温和的松弛——要求收敛到目标的“近似”版本——对于每个有限集合都是可实现的。这些结果表明，有界记忆对这些任务的影响不同：生成对于每个可数集合仍然可实现，而密度和识别仅限于有限集合，且随着集合增长保证减弱。

英文摘要

We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language one at a time and must eventually output only new valid examples. Prior work assumes access to the entire history, a strong assumption since realistic algorithms retain limited past information. Classical work in learning theory shows memory constraints dramatically alter learnability; we extend this to language generation. First, we study memoryless generators. Under a mild enumeration restriction, every countable collection of infinite languages remains generable without memory. Without this restriction, we exactly characterize when memoryless generation is possible. For finite collections, we characterize the optimal minimax density achievable by memoryless generators -- the best density guaranteed against any collection of a given size. This combinatorial bound relies on Sperner's theorem and symmetric chain decompositions. We further show that a sliding window of the last $W$ examples does not improve this worst-case density, whereas allowing it to store $b$ adaptively chosen past examples improves the achievable density for every $b \geq 1$. Finally, we revisit identification in the limit, where the learner must converge to a single correct hypothesis for the target language. We focus on its incremental variant, where the learner remembers only its previous guess. Here, although exact identification fails on a collection of just three languages, a mild relaxation requiring convergence to an ``approximate'' version of the target is achievable for every finite collection. These results show bounded memory affects these tasks differently: generation remains achievable for every countable collection, while density and identification are confined to finite collections, with guarantees weakening as the collection grows.

URL PDF HTML ☆

赞 0 踩 0

2605.30315 2026-05-29 cs.CL cs.LG 版本更新

Resolution Diagnostics for Paired LLM Evaluation

配对LLM评估的分辨率诊断

Anany Kotawala

发表机构 * Princeton University（普林斯顿大学）

AI总结针对公开LLM排行榜中配对排名未达到常规配对检验分辨率目标的问题，提出基于假设检验的配对评估框架，并引入分辨率比q=N/N*作为主要诊断指标，揭示了常用非配对Cohen-h-plus-(1-rho)捷径在接近比较区域存在约两倍的偏差。

详情

AI中文摘要

在两个公开的LLM排行榜中，许多显示的配对排名在实际配对评估设计下未达到常规配对检验的分辨率目标：在Open LLM Leaderboard v1的40个配对比较中，有11个未解决；在MMLU-Pro前10名相邻排名配对中，9个中有4个未解决（在(alpha, 1-beta) = (0.05, 0.8)下）。在真实的主题级聚类下，MMLU-Pro未解决数上升至6/9，并且在99.9%的类别自助重采样中保持9个中的5-6个未解决。我们将配对LLM评估构建为一个假设检验问题，反转水平alpha、功效(1-beta)的检验，并报告每对的分辨率比q = N/N*作为主要诊断指标。一个具有显式二阶常数的尖锐小效应展开表明，广泛使用的非配对Cohen-h-plus-(1-rho)捷径在接近比较区域与正确的N*偏差约两倍，当用户将其每臂输出乘以(1-rho)时，五个现成计算器中的三个（Cohen 1988, G*Power, R pwr）会无声地继承这一缺陷。在多重校正和任意有效序贯检验下，未解决配对模式仍然存在。

英文摘要

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 under real subject-level clustering and stays at 5-6 out of 9 in 99.9% of category-bootstrap resamples. We frame paired LLM evaluation as a hypothesis-testing problem, invert level-alpha, power-(1-beta) tests, and report a per-pair resolution ratio q = N/N* as the primary diagnostic. A sharp small-effect expansion with an explicit second-order constant shows that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two in the close-comparison regime, a deficit that three of five off-the-shelf calculators(Cohen 1988, G*Power, R pwr) silently inherit when the user post-multiplies their per-arm output by (1-rho). The unresolved-pair pattern remains under multiplicity correction and anytime-valid sequential testing.

URL PDF HTML ☆

赞 0 踩 0

2605.30295 2026-05-29 cs.CL cs.AI 版本更新

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

MedCase-Structured：用于在临床真实EHR环境中基准测试诊断推理的文本到FHIR数据集

Valentina Bui Muti, Eugénie Dulout, Ziquan Fu

发表机构 * System Inc.（系统公司）

AI总结提出一个从非结构化文本生成临床真实HL7 FHIR R4数据集的流水线，构建MedCase-Structured数据集，发现LLMs在结构化FHIR输入上的诊断准确性低于纯文本，强调部署对齐基准测试的重要性。

Comments Accepted to ICML 2026 Structured Data for Health Workshop

详情

AI中文摘要

大型语言模型（LLMs）在临床推理和决策支持方面显示出潜力，但在真实、与电子健康记录一致的环境中的评估仍然有限。现有的基准测试通常依赖于静态数据集或不反映临床系统中使用的结构化、可互操作数据格式的非结构化输入。我们引入了一个从非结构化文本生成临床真实HL7 FHIR R4数据包的流水线，从而实现对临床决策支持系统的可控评估。该流水线将分阶段LLM生成与基于术语的验证和修复相结合，以减少幻觉代码并强制结构和语义一致性。将此方法应用于MedCaseReasoning，我们构建了MedCase-Structured，这是一个与临床医生编写的诊断案例对齐的合成数据集，实现了82.5%案例的有效FHIR生成。在MedCase-Structured上的评估显示，LLMs在结构化FHIR输入上的诊断准确性始终低于纯文本，突出了部署对齐基准测试的重要性。

英文摘要

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the structured, interoperable data formats used in clinical systems. We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems. The pipeline combines staged LLM generation with terminology-grounded validation and repair to reduce hallucinated codes and enforce structural and semantic consistency. Applying this approach to MedCaseReasoning, we construct MedCase-Structured, a synthetic dataset aligned with clinician-authored diagnostic cases, achieving valid FHIR generation for 82.5% of cases. Evaluation on MedCase-Structured reveals consistently lower diagnostic accuracy for LLMs on structured FHIR inputs than with plain text, highlighting the importance of deployment-aligned benchmarking.

URL PDF HTML ☆

赞 0 踩 0

2605.21235 2026-05-29 cs.CL 版本更新

LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

LamPO: 一种用于推理语言模型的Lambda风格策略优化

Redacted by arXiv

AI总结提出LamPO方法，通过成对分解优势函数和置信度加权，改进基于可验证奖励的强化学习在推理语言模型中的信用分配和训练稳定性。

Comments arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission. Author list and submitter name redacted due to disputed authorship

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）已成为改进推理语言模型在数学、编程和科学问答等任务上的有效范式。然而，广泛使用的组相对目标（如GRPO）用标量统计量总结每个采样组，从而丢弃了候选响应之间的细粒度关系信息。这削弱了稀疏结果奖励下的信用分配，尤其是当多个生成的解决方案仅在推理质量上存在细微差异时。我们提出 extbf{LamPO}，一种 extbf{Lambda风格策略优化}方法，它用 extit{成对分解优势}替代标量组优势。LamPO聚合每个响应组内的成对奖励差距，并通过从序列对数概率差异计算出的置信度权重调节每个比较，同时保留PPO风格优化的无评论家和裁剪更新结构。当参考解可用时，我们进一步添加一个轻量级的基于ROUGE-L的密集辅助奖励以减少奖励稀疏性。在AIME24、AIME25、MATH-500和GPQA-Diamond上使用Qwen3-1.7B、Qwen3-4B和Phi-4-mini进行的实验表明，LamPO在更稳定的训练动态和更好的样本效率下，持续优于GRPO和最近的RLVR变体。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving reasoning language models on tasks such as mathematics, coding, and scientific question answering. However, widely used group-relative objectives, such as GRPO, summarize each sampled group with scalar statistics and therefore discard fine-grained relational information among candidate responses. This weakens credit assignment under sparse outcome rewards, especially when multiple generated solutions differ only subtly in reasoning quality. We propose \textbf{LamPO}, a \textbf{Lambda-Style Policy Optimization} method that replaces scalar group advantages with a \emph{Pairwise Decomposed Advantage}. LamPO aggregates pairwise reward gaps within each response group and modulates each comparison by a confidence-aware weight computed from sequence log-probability differences, while retaining the critic-free and clipped-update structure of PPO-style optimization. When reference solutions are available, we further add a lightweight ROUGE-L-based dense auxiliary reward to reduce reward sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond with Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show that LamPO consistently improves over GRPO and recent RLVR variants, with more stable training dynamics and better sample efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.19416 2026-05-29 cs.CL 版本更新

Loong: 一种类人长文档翻译代理，具有观察与行动的适应性上下文选择

Yutong Wang, Xuebo Liu, Derek F. Wong, Zhilin Li, Rongqing Jiang, Min Zhang, Shimin Tao, Daimeng Wei, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳研究院）； NLP 2 CT Lab, Department of Computer and Information Science, University of Macau（澳门大学计算机与信息科学学院自然语言处理与CT实验室）； Huawei Translation Services Center（华为翻译服务中心）

AI总结提出Loong代理，通过3E记忆模块和强化学习优化上下文策略，解决长文档翻译中上下文窗口限制和冗余信息问题，在英⇄中、德、法翻译中平均提升13.0分。

详情

AI中文摘要

文档级翻译仍然是大型语言模型最具挑战性的任务之一，它们受到有限上下文窗口的限制，阻碍了全局连贯性，同时遭受冗余上下文信息的影响，降低了翻译质量。为了解决这个问题，我们提出了一种名为Loong的类人长文档翻译代理，它利用3E记忆模块（精华-示例-实体）存储摘要、句子对和实体记录作为历史上下文。Loong不是被动地关注所有历史，而是进行深度推理，自适应地识别翻译指导的最佳上下文。Loong通过强化学习优化其上下文策略，利用从其自身采样的观察与行动推理轨迹中得出的偏好数据。实证评估表明，Loong在英语⇄中文、德语和法语方向上实现了显著的翻译质量提升，在三个评估指标上平均提升高达13.0分。此外，Loong在跨领域和对抗上下文噪声方面表现出强大的泛化能力和鲁棒性，同时在超长文档翻译中保持显著的稳定性。我们的代码发布在https://github.com/YutongWang1216/LoongDocMT。

英文摘要

Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context windows that impede global cohesion, while simultaneously suffering from redundant contextual information that degrades translation quality. To address this, we propose a human-like long document translation agent called Loong, which leverages a 3E memory module (Essence-Exemplar-Entity) to store summaries, sentence pairs, and entity records as historical context. Instead of passively attending to all history, Loong performs deep reasoning to adaptively identify the optimal context for translation guidance. Loong optimizes its context policy through reinforcement learning, utilizing preference data derived from its own sampled observe-and-act reasoning trajectories. Empirical evaluations demonstrate that Loong achieves substantial translation quality improvements in English $\Leftrightarrow$ Chinese, German, and French directions, with average gains of up to 13.0 points across the three evaluation metrics. Furthermore, Loong exhibits strong generalization across domains and robustness against contextual noise, while maintaining remarkable stability in ultra-long document translation. Our code is released at https://github.com/YutongWang1216/LoongDocMT.

URL PDF HTML ☆

赞 0 踩 0

2605.30273 2026-05-29 cs.HC cs.AI cs.CL cs.CY cs.SI 版本更新

LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback

LLUMI: 利用在线社区反馈改进心理健康支持中的LLM写作辅助

Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar, Dong Whi Yoo, Eshwar Chandrasekharan, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Indiana University Indianapolis（印第安纳大学印第安纳波利斯分校）

AI总结提出LLUMI框架，通过在线社区反馈（如Reddit投票）构建偏好对，结合监督微调和直接偏好优化训练开源小模型，在隐私保护下实现与GPT相当的心理健康支持性能。

详情

AI中文摘要

大型语言模型在生成心理健康问题的支持性回复方面展现出潜力，但提升其有用性、共情能力和安全性通常需要大量计算、专家输入和标注数据。同时，在心理健康相关交互中部署专有云模型会引发重要的隐私和数据治理问题。为解决这一挑战，我们提出了LLUMI设置，该设置可在受保护环境内部署。LLUMI包含两个互补组件：生成模型（GM）起草对心理健康问题的支持性回复，以及改进模型（IM）修改初始人工编写的回复。我们利用Reddit心理健康社区的反馈信号，使用社区认可模式（如点赞和点踩）构建用于监督微调和直接偏好优化的选择-拒绝回复对。我们还通过五个维度（可读性、共情、连接、可操作性和安全性）的人工评估进一步对齐LLUMI。结果表明，尽管依赖较小的开源模型而非专有云GPT模型，LLUMI在语言分析和人工评估中均实现了相当的性能。这些发现表明，使用社区衍生的偏好信号训练的开源模型可以支持高质量的心理健康支持辅助，同时为敏感的支持场景提供更保护隐私的替代方案。

英文摘要

Large language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, empathy, and safety often requires substantial compute, expert input, and labeled data. At the same time, deploying proprietary, cloud-based models for mental health-related interactions raises important privacy and data-governance concerns, given the sensitivities. To address this challenge, we introduce LLUMI setup that can be hosted in-house within protected environments. LLUMI consists of two complementary components: a generation model (GM), which drafts supportive responses to mental health queries, and an improvement model (IM), which revises an initial human-crafted response. We leverage feedback signals from Reddit mental health communities, using community endorsement patterns such as upvotes and downvotes to construct chosen-rejected response pairs for Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO). We further align LLUMI using human evaluation across five dimensions: readability, empathy, connection, actionability, and safety. Our results show that, despite relying on smaller open-source models rather than proprietary cloud-based GPT models, LLUMI achieves comparable performance across linguistic analyses and human evaluations. These findings suggest that open-source models, when trained with community-derived preference signals, can support high-quality mental health support assistance while offering a more privacy-preserving alternative for sensitive support contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.30265 2026-05-29 cs.CV cs.CL 版本更新

相同证据，不同答案：面向多轮语言模型的规范上下文在线策略蒸馏

Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang, Yifan Zhu, Xing Shi, Jingtao Xu, Zhihui Li, Yawei Luo

发表机构 * Zhejiang University（浙江大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出规范上下文在线策略蒸馏（CCOPD）方法，通过教师-学生框架对齐模型在完整提示和逐步揭示信息下的行为，减少自我锚定漂移，在多轮数学对话上训练后，在原始分片任务上平均提升32%性能。

详情

AI中文摘要

大型语言模型（LLMs）通常在单次提示中给出所有指令时能解决任务，但当相同信息在多个轮次中逐步揭示时却会失败。当干净的完整提示和原始分片对话包含相同的完整用户证据时，模型仍应得出相同的答案。我们认为造成这一差距的关键原因是自我锚定漂移：在部分信息下产生的响应引入了未经支持的假设，而这些假设随后扭曲了最终答案。为了减少这种影响，我们提出了规范上下文在线策略蒸馏（CCOPD）。在训练过程中，同一基础模型扮演两个角色：一个冻结的教师模型，以干净的完整提示为条件；一个可训练的学生模型，通过多轮对话逐步接收相同的证据；CCOPD将学生在其自身轨迹上的行为与教师的规范全上下文行为对齐。仅在数学问题对话上训练后，CCOPD在数学和五个零样本跨领域任务族上的原始分片性能相比原始基础模型平均提升32%，同时基本保持全上下文性能。进一步分析表明，CCOPD增强了基于用户证据的推理，并减少了对早期助手轮次污染的敏感性。

英文摘要

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.

URL PDF HTML ☆

赞 0 踩 0

2605.30245 2026-05-29 cs.CL 版本更新

Knowing What to Solve Before How: Preplan Empowered LLM Mathematical Reasoning

知道在如何解决之前该解决什么：预规划赋能的大语言模型数学推理

Shaojie Wang, Liang Zhang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出PPC框架，通过引入显式的问题理解阶段（预规划）来弥补现有规划推理方法中“如何解决”与“该解决什么”之间的范式差距，在多个数学推理基准上取得最佳结果。

详情

AI中文摘要

当前的基于规划的推理方法通过在执行前插入规划阶段来改进大语言模型（LLMs），形成了问题→规划→思维链的范式。虽然有效，但仔细审视发现存在固有的范式级差距：规划和执行阶段都决定了如何解决问题，而之前的问题——该解决什么，即识别问题类型、适用工具和可预见的陷阱——仍然完全隐含。为弥补这一差距，我们提出PPC（预规划-规划-思维链），一个引入显式问题理解阶段（预规划）的框架，产生了新的问题→预规划→规划→思维链范式。实现这一范式需要在两端维护预规划的概念完整性。具体地，我们设计了一个三阶段合成流程，配备一个剧透分数检测器来过滤泄漏和剧透故障，以构建干净的预规划监督，并且一个复合GRPO奖励强制生成的规划真正遵循预规划。在四个骨干模型和五个数学推理基准上的实验表明，PPC在40个指标中的39个上取得了最佳结果，在不引入额外推理令牌开销的情况下，将maj@16和pass@16分别比最强基线提高了+2.23和+3.06。

英文摘要

Current plan-based reasoning methods improve large language models (LLMs) by inserting a planning stage before execution, giving rise to the question $\rightarrow$ plan $\rightarrow$ cot paradigm. While effective, a closer examination reveals an inherent paradigm-level gap: both the planning and its execution stages decide how to solve a problem, while the prior question of what to solve; recognizing the problem type, the applicable tools, and the foreseeable pitfalls; remains entirely implicit. To bridge this gap, we propose PPC (Preplan-Plan-CoT), a framework that introduces an explicit problem-understanding stage, the preplan, yielding a new question $\rightarrow$ preplan $\rightarrow$ plan $\rightarrow$ cot paradigm. Realizing this paradigm requires safeguarding the conceptual integrity of preplan at both ends. Specifically, we design a three-stage synthesis pipeline with a spoiler-score detector that filters out leakage and spoiler failures to build clean preplan supervision, and a composite GRPO reward enforces that the generated plan genuinely follows from the preplan. Experiments across four backbones and five mathematical reasoning benchmarks show that PPC achieves the best results on 39 of 40 metrics, improving maj@16 and pass@16 by +2.23 and +3.06 over the strongest baseline without introducing additional inference token overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.30241 2026-05-29 cs.CL cs.CY cs.SI 版本更新

GRUFF：德语中LLM的代词忠实度、推理与偏见

Fabian Mewes, Anne Lauscher, Vagrant Gautam

发表机构 * JobMatchMe GmbH（JobMatchMe公司）； Trustworthy AI Lab（可信人工智能实验室）； Heidelberg Institute for Theoretical Studies（海德堡理论研究所）

AI总结通过构建大规模德语数据集GRUFF，研究大型语言模型在四种性别一致系统与四组代词上的代词忠实度，发现模型在无显式上下文时对阳性和阴性实体表现出强语法一致，但对新代词xier和en较弱，且职业刻板印象在不同语法格和模型间相关性低。

详情

AI中文摘要

第三人称单数代词长期以来被用于研究语言模型中的刻板偏见以及测试其推理指代的能力。最近，通过代词忠实度任务研究了推理与偏见之间的相互作用，该任务评估模型正确复用先前为某个话语实体指定的代词的能力，而不受中间提到的其他潜在干扰话语实体的影响。然而，此类研究主要关注英语，这是一种语法性别有限且几乎没有性别一致的语言。在本文中，我们贡献了一个新颖的大规模数据集GRUFF，用于测量德语中的代词忠实度，涵盖了名词中的四种不同性别一致系统以及四组代词。利用该数据集，我们展示了LLM在缺乏显式上下文时对阳性和阴性实体表现出强语法一致，但对新代词xier和en则不然。模型通常对干扰项不鲁棒，但仅编码器模型在德语中比在英语中更鲁棒，反映了语法性别的重要性。最后，我们表明，在此上下文中，职业刻板印象在不同语法格之间以及大多数模型之间相关性较低，除了具有紧密相关架构的模型。我们发布所有代码和数据，以鼓励在德语中进一步研究性别包容性语言和指代推理。

英文摘要

Third-person singular pronouns have long been used to study stereotypical biases in language models and to test their abilities to reason about reference. More recently, the interplay between reasoning and bias has been investigated with the task of pronoun fidelity, which assesses models' abilities to correctly reuse a previously-specified pronoun for a discourse entity, independent of other potentially distracting discourse entities mentioned in between. However, such research focuses on English, which is a language with limited grammatical gender and almost no gender agreement. In this paper we contribute a novel, large-scale dataset, GRUFF, to measure pronoun fidelity in German, covering four different gender agreement systems in nouns, and four sets of pronouns. With this dataset, we show that LLMs show strong grammatical agreement for masculine and feminine entities in the absence of explicit context, but not for neopronouns xier and en. Models are generally not robust to distractors, but encoder-only models are more robust in German than in English, reflecting the importance of grammatical gender. Finally, we show that occupational stereotypes in this context are poorly correlated across grammatical cases, and across most models, except ones with closely related architectures. We release all code and data to encourage further work on gender-inclusive language and referential reasoning in German.

URL PDF HTML ☆

赞 0 踩 0

2605.30202 2026-05-29 cs.CL 版本更新

CorPipe at CRAC 2026: 多语言共指消解中的空节点与跨语言迁移

Milan Straka

发表机构 * Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics（查理大学数学与物理系形式与应用语言学研究所）

AI总结本文提出CorPipe 26系统，通过单一模型联合预测空节点、提及和共指链接，在CRAC 2026多语言共指消解共享任务中超越所有其他系统，并在LLM赛道和不受限赛道分别领先2.8和9.5个百分点。

Comments Accepted to CODI-CRAC 2026

2605.30131 2026-05-29 cs.CL cs.CV 版本更新

CCS: Clinical Consensus Selection for Radiology Report Generation

CCS：放射学报告生成的临床共识选择

Xi Zhang, Yingshu Li, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

发表机构 * School of Computing Science, University of Glasgow（格拉斯哥大学计算机科学学院）； School of Electrical and Computer Engineering, University of Sydney（悉尼大学电气与计算机工程学院）； Language Technology Lab, University of Cambridge（剑桥大学语言技术实验室）

AI总结提出CCS框架，通过采样多个候选报告并选择临床共识最高的一个，以改进放射学报告生成在推理时的质量。

Comments 17 pages, 6 figures

详情

AI中文摘要

放射学报告生成（RRG）通常被表述为单路径生成任务，其中多模态大语言模型（MLLM）产生一个解码报告作为最终输出。虽然最近的进展主要通过扩展训练数据、模型容量和检索机制来推动，但在推理时提高报告质量仍未被充分探索。在这项工作中，我们观察到固定的放射学MLLM在其候选池中通常生成比默认解码选择的报告临床更强的报告，这表明推理时的决策仍然是一个被忽视的瓶颈。为了解决这个问题，我们提出了临床共识选择（CCS），一个解码器无关的推理时选择框架，它采样多个候选报告，并选择在展开池中具有最高临床共识的报告。CCS将基于文本的效用与由图像-报告训练的多模态嵌入器计算的放射学适应效用统一起来，该嵌入器测量超越表面文本相似性的候选一致性。在三个数据集和多个放射学MLLM上，CCS始终优于单路径解码和通用Best-of-N基线，特别是在临床指标上取得了明显提升。进一步分析表明，基于图像的效用形成了与文本共识不同的选择轴，并且在推理时改进RRG仍有很大的提升空间。

英文摘要

Radiology report generation (RRG) is commonly formulated as a single-path generation task, where a multimodal large language model (MLLM) produces one decoded report as the final output. While recent progress has largely been driven by scaling training data, model capacity, and retrieval mechanisms, improving report quality at inference time remains underexplored. In this work, we observe that fixed radiology MLLMs often generate clinically stronger reports elsewhere in their candidate pool than the one selected by default decoding, suggesting that inference-time decision making remains an overlooked bottleneck. To address this, we propose Clinical Consensus Selection (CCS), a decoder-agnostic inference-time selection framework that samples multiple candidate reports and selects the one with the highest clinical consensus across the rollout pool. CCS unifies text-based utilities with a radiology-adapted utility computed by an image--report-trained multimodal embedder, which measures candidate agreement beyond surface-level textual similarity. Across three datasets and multiple radiology MLLMs, CCS consistently improves inference-time performance over single-path decoding and generic Best-of-N baselines, with particularly clear gains on clinical metrics. Further analysis shows that image-grounded utility forms a selection axis distinct from textual consensus and that substantial headroom remains for improving RRG at inference time.

URL PDF HTML ☆

赞 0 踩 0

2605.30126 2026-05-29 cs.CV cs.AI cs.CL cs.LG 版本更新

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

PARCEL: 基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解

Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem

发表机构 * Max Planck Institute for Informatics（马克斯·普朗克研究所）； Google（谷歌）

AI总结提出PARCEL视觉分词架构，通过池锚定和条件弹性查询重采样解决视觉令牌压缩中的空间与查询表示冲突，在27个基准上提升性能-效率帕累托前沿。

Comments 33 pages, 4 figures

详情

AI中文摘要

大型视觉-语言模型（LVLMs）将视觉输入映射为密集的令牌序列，导致推理时的二次计算瓶颈。弹性视觉令牌压缩通过训练单一模型以在多个视觉令牌预算下运行来解决这一问题。然而，现有方法在激进压缩下表现不佳。空间压缩（如嵌套池化）表现为不完美的低通滤波器，并引起频谱混叠，掩盖了细粒度细节。查询压缩（如嵌套查询重采样）用非局部摘要替代显式的网格对齐令牌，显著降低了空间定位能力。为解决这一表示冲突，我们引入了PARCEL（基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解），一种视觉分词架构，动态分配特征提取的工作。PARCEL将空间池令牌建立为低频布局锚点，并通过池条件查询重采样使弹性查询令牌依赖于这些锚点。这鼓励查询令牌专注于互补的视觉特征，而非冗余的空间映射。在27个基准上的广泛评估表明，PARCEL改进了性能-效率帕累托前沿，在各种视觉令牌预算下持续优于现有的嵌套基线，同时保留了“一次训练，随处部署”的范式。

英文摘要

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

URL PDF HTML ☆

赞 0 踩 0

2605.30107 2026-05-29 cs.CL 版本更新

Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking

Dial HEALTHDIAL for Advice: 一个用于知识驱动信息检索的多语言多平行口语对话数据集

Songbo Hu, Yinhong Liu, Ej Zhou, Evgeniia Razumovskaia, Xiaobin Wang, Alexander Fraser, Ivan Vulić, Anna Korhonen

发表机构 * Language Technology Lab, University of Cambridge（剑桥大学语言技术实验室）； School of Computation, Information and Technology, Technical University of Munich（慕尼黑技术大学计算、信息与技术学院）； Independent Researcher（独立研究员）

AI总结本文构建了HEALTHDIAL，一个大规模多语言多平行口语对话数据集，用于开发基于检索增强生成的口语对话系统，并揭示了不同语言间的性能差异。

Comments Accepted to Findings of ACL 2026

详情

AI中文摘要

创建口语对话数据集在方法论上具有挑战性，当目标是构建大规模多语言多平行数据集时，这些挑战更加突出。本文介绍了HEALTHDIAL，一个用于开发和评估基于检索增强生成（RAG）的口语对话系统的大规模多语言多平行数据集。该数据集包含6,000个信息寻求对话（每种语言1,500个），这些对话基于世界卫生组织（WHO）的可信内容，以及来自四种WHO官方语言（阿拉伯语、中文、英语和西班牙语）的母语者录制的163小时用户语音。每个说话者都标注了人口统计学（如性别、年龄）和社会语言学（如主要语言、原籍地区）变量。我们报告了关键对话任务的基准结果，揭示了不同语言之间（即使是高资源语言）持续存在的性能差异。为支持未来研究，我们发布了该数据集、一个原型系统以及一个用于数据收集和系统评估的工具包。

英文摘要

Creating spoken dialogue datasets is methodologically challenging, and these challenges are amplified when the goal is to build multilingual, multi-parallel datasets at scale. This work introduces HEALTHDIAL, a large-scale, multilingual, and multi-parallel dataset for developing and evaluating retrieval-augmented generation (RAG)-based spoken dialogue systems. The dataset comprises 6,000 information-seeking dialogues (1,500 per language) grounded in trusted content from the World Health Organization (WHO) and 163 hours of user speech recorded from native speakers of diverse dialects across four official WHO languages: Arabic, Chinese, English, and Spanish. Each speaker is annotated with demographic (e.g., gender, age) and sociolinguistic (e.g., primary language, region of origin) variables. We report benchmark results across key dialogue tasks, which reveal consistent performance disparities across languages, even among high-resource ones. To support future research, we release the dataset, a prototype system, and a toolkit for data collection and system evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.30104 2026-05-29 cs.CL 版本更新

SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?

SEAL: 饱和基准能否通过LLM作为元裁判得以复兴？

Jiamin Chen, Yidi Wu, Qiexiang Wang, Qianben Chen, Yuchen Li, Yansen Zhang, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

发表机构 * ByteDance Inc.（字节跳动公司）； City University of Hong Kong（香港城市大学）

AI总结提出SEAL协议，通过自适应LLM元裁判从饱和基准中提取潜在排名信号，在代码生成、数学推理等任务上以更少调用实现高排名准确率。

详情

AI中文摘要

广泛使用的语言模型基准日益饱和，前沿系统常获得标准指标无法区分的接近分数。我们不构建更难的替代方案，而是探究是否可以通过改进对相同候选输出的评估来使现有任务重新具有信息量。因此，我们提出了带自适应LLM元裁判的种子淘汰法，这是一种自我改进的评估协议，用于从饱和基准中提取潜在排名信号。SEAL将候选输出种子化为单淘汰赛，并通过任务级原则和自改进检查表标准评估每场比赛。我们在涵盖代码生成、数学推理、知识密集型问答和工具使用智能体任务完成的多个饱和基准上评估SEAL。在这些设置中，SEAL改善了排名准确性与延迟之间的权衡，与完全成对评判相比达到了0.83-1.00的Spearman一致性和4/4的top-1一致性，同时每个任务仅需11.89次调用，而完全成对评估需要28.00次。

英文摘要

Widely used language-model benchmarks are increasingly saturated, with frontier systems often receiving near-tied scores that standard metrics cannot resolve. Rather than constructing harder alternatives, we ask whether existing tasks can be made informative again through improved evaluation over the same candidate outputs. Therefore, we present Seeded Elimination with Adaptive LLM-as-a-Meta-Judge, a self-improving evaluation protocol for extracting latent ranking signal from saturated benchmarks. SEAL seeds candidate outputs into a single elimination and evaluates each match with task-level principles plus self-improving checklist criteria. We evaluate SEAL on multiple saturated benchmarks covering code generation, mathematical reasoning, knowledge-intensive question answering, and tool-use agent task completion. Across these settings, SEAL improves the ranking-accuracy--latency trade-off over competing protocols, attaining 0.83--1.00 Spearman agreement with full pairwise judging and 4/4 top-1 agreement, while requiring only 11.89 calls per task compared with 28.00 for full pairwise evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.30090 2026-05-29 cs.CL cs.CV 版本更新

DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

DirectorBench: 通过个性化多智能体评估诊断长视频生成

Jiamin Chen, Qianben Chen, Jiawen Zhang, Yidi Wu, Yuchen Li, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

发表机构 * ByteDance Inc.（字节跳动公司）； City University of Hong Kong（香港城市大学）

AI总结提出DirectorBench，一种基于多智能体的诊断基准，通过80个结构化元数据、7个用户画像和40个检查点标准，在脚本、视觉、音频、跨模态和稳定性五个维度上评估长视频生成，并定位瓶颈和用户偏好依赖。

详情

AI中文摘要

长视频生成正从短的单场景合成快速转向分钟级、多镜头的创作，具有叙事结构、电影控制、音频和跨模态同步。然而，评估此类视频仍然具有挑战性，因为现有基准主要关注局部视觉质量、短期时间一致性或通用提示对齐，并且对工作流故障和用户依赖偏好的诊断有限。我们引入了DirectorBench，一个用于长视频生成的个性化多智能体诊断基准。DirectorBench根据80个结构化元数据、7个用户画像和40个检查点标准，在脚本、视觉、音频、跨模态和稳定性五个维度上评估生成的视频。DirectorBench不将质量简化为单一聚合分数，而是定位检查点级别的瓶颈并支持画像感知评估。我们评估了4个长视频生成工作流、6个基础LLM和7个用户画像。在不同工作流中，DirectorBench揭示了一个单元间瓶颈：过渡质量平均仅为0.256，最佳工作流达到0.356，而提示级别的用户需求满足度平均为0.71。我们进一步进行了14名标注者的人工评估，以验证DirectorBench与人类判断的一致性。结果表明，DirectorBench捕捉到了人类可感知的质量差异，并揭示了聚合评分所隐藏的工作流和画像依赖的故障模式。这些发现强调了长视频生成中诊断性和画像感知基准的重要性。

英文摘要

Long-form video generation is rapidly moving from short, single-scene synthesis toward minute-long, multi-shot creation with narrative structure, cinematic control, audio, and cross-modal synchronization. However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and user-dependent preferences. We introduce DirectorBench, a personalized multi-agent diagnostic benchmark for long-form video generation. DirectorBench evaluates generated videos with respect to 80 structured metadata entries, 7 user profiles, and 40 checkpoint criteria across 5 dimensions: script, visual, audio, cross-modal, and stability. Instead of reducing quality to a single aggregate score, DirectorBench localizes checkpoint-level bottlenecks and supports profile-aware evaluation. We evaluate 4 long-form video generation workflows, 6 base LLMs, and 7 user profiles. Across workflows, DirectorBench reveals a between-unit bottleneck: transition quality averages only 0.256 and reaches 0.356 for the best workflow, while prompt-level user demand fulfillment averages 0.71. We further conduct human evaluation with 14 annotators to validate the alignment between DirectorBench and human judgment. The results show that DirectorBench captures human-perceptible quality differences and reveals workflow- and profile-dependent failure modes that are hidden by aggregate scoring. These findings highlight the importance of diagnostic and profile-aware benchmarking for long-form video generation.

URL PDF HTML ☆

赞 0 踩 0

2605.30085 2026-05-29 cs.AI cs.CL cs.LG stat.ML 版本更新

Conformal Certification of Reasoning Trace Prefixes

推理轨迹前缀的保形认证

Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan

发表机构 * Department of Electrical & Computer Engineering, Rice University（电气与计算机工程系，里士满大学）

AI总结提出CROP方法，通过保形校准选择阈值，返回最长无错前缀，并控制错误包含概率，平衡保留有效推理与丢弃误导后缀。

Comments Code available at https://github.com/matthewyccheung/crop

详情

AI中文摘要

语言模型推理轨迹很少是全有或全无；在关键错误发生之前，它们通常包含有效的中间步骤。现有的不确定性量化方法通常认证最终答案或整个响应，未能为顺序轨迹中可安全保留的比例提供统计保证。为了解决这个问题，我们引入了CROP（保形推理输出前缀），一种与验证器无关的校准程序，用于干净前缀认证。给定任何步骤级风险代理，CROP选择一个校准阈值，并返回其步骤风险代理保持低于该阈值的最长连续前缀，将未认证的后缀路由到下游审查或修复。假设可交换性，CROP严格控制了返回前缀包含注释错误的边际概率。在六个过程标记的推理数据集上，我们证明了标准步骤级指标（如AUROC）不能完全捕捉前缀效用，建议验证器应改为通过认证前缀长度进行评估。此外，CROP平衡了过度保留和不足保留，通过保留有效的中间推理同时丢弃误导后缀，提高了下游修复的准确性。最终，这项工作将前缀认证定位为过程监督、弃权和修复之间的严格、实用的桥梁。

英文摘要

Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier-agnostic calibration procedure for clean-prefix certification. Given any step-level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process-labeled reasoning datasets, we demonstrate that standard step-level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over- and under-withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.

URL PDF HTML ☆

赞 0 踩 0

2605.30080 2026-05-29 cs.CL 版本更新

REPOT：通过检查点修复实现可恢复的思维程序

Parsa Mazaheri

发表机构 * University of California, Santa Cruz（加州大学圣克ruz分校）

AI总结提出 RePoT 方法，通过确定性验证重放和 LLM 调用从验证前缀恢复，以解决 Program-of-Thought 中单个无效动作导致轨迹失效的问题，在多个模型和基准上提升成功率。

详情

AI中文摘要

单次 Program-of-Thought (PoT) 生成一个打印基本动作计划的 Python 程序；单个无效动作会无声地使轨迹失效。我们引入 RePoT (可恢复 PoT)：一种确定性验证重放，它将计划遍历环境直到第一个无效转换，然后通过一次 LLM 调用从验证前缀恢复。在 PoT 失败的约 14% 的问题上，RePoT 最多增加一次 LLM 调用。在 PuzzleZoo-775 上，RePoT 在四种闭模型配置上比 PoT 提高 +3 到 +11 个百分点，在 gpt-5.4-mini-medium 上达到 96.9% 对比 86.3% 的峰值；与预算匹配的 PoT-retry 基线相比，RePoT 在 Gemini 上明显获胜（+3.8pp，95% CI [+2.2,+5.4]），在 GPT-medium 和 Claude 上处于采样噪声范围内，在 GPT-mini 上失败——这是一种能力扩展模式，我们开始通过自适应 RePoT 解决，这是一种基于规则的调度器，根据验证前缀长度在后缀修复和全新 PoT 重试之间路由（初步）。我们在 PlanBench Blocksworld 上复现（+1.1 到 +11.4pp），在四个开放权重模型上（四个中的三个 +3.3 到 +20.0pp）。在 Derail-550（我们的受控恢复基准）上，每个能够访问检查点信息的条件在 GPT-medium 上达到 >=30%，在 Gemini 上达到 >=70%，而仅错误反馈条件 <=3.1%——表明检查点信息（而非特定的验证前缀尾部）是承载恢复的信号。

英文摘要

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.

URL PDF HTML ☆

赞 0 踩 0

2605.30051 2026-05-29 cs.CL cs.CY 版本更新

Who Am I? History-Aware Profiles for Student Simulation in Tutoring Dialogues

我是谁？面向辅导对话中学生模拟的历史感知档案

Zhangqi Duan, Shuyan Huang, Alexander Scarlatos, Jaewook Lee, Simon Woodhead, Andrew Lan

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）； Eedi

AI总结提出历史条件的学生模拟任务，通过强化学习训练档案生成器和模拟器，利用学生历史信息准确预测对话轮次，在数学学习平台数据集上显著优于基线。

详情

AI中文摘要

开发基于大型语言模型（LLM）的自动化辅导工具的一个关键部分是学生模拟，即使用LLM扮演学生角色，这可以促进辅导模型的评估和训练。现有工作主要关注对话内模拟，缺乏关于学生知识和行为的上下文，部分原因是没有基于过去的学生问答或对话交互。在这项工作中，我们引入了历史条件的学生模拟任务，其目标是通过利用学生学习历史中的信息准确预测学生对话轮次。我们提出了一个双组件框架，其中档案生成器总结学生历史，模拟器基于生成的档案预测学生轮次。我们使用强化学习（RL）训练这两个组件，生成针对忠实学生模拟优化的档案。我们在从数学学习平台收集的首个真实世界学生对话和问答响应数据集上评估了我们的方法和基线。大量实验表明，我们的方法显著优于基线，并证明了历史、档案和RL训练的重要性。

英文摘要

A key part of developing large language model (LLM)-powered, automated tutoring tools is student simulation, i.e., using LLMs to role-play as students, which can facilitate tutor model evaluation and training. Existing work mostly focuses on within-dialogue simulation, which lacks context on student knowledge and behavior, partly due to not grounding in past student question-answering or dialogue interactions. In this work, we introduce the task of history-conditioned student simulation, where the goal is to accurately predict student dialogue turns by leveraging information in the student's learning history. We propose a two-component framework in which a profile generator summarizes a student's history and a simulator predicts student turns conditioned on the resulting profile. We train both components with reinforcement learning (RL), yielding profiles optimized for faithful student simulation. We evaluate our method and baselines on the first-of-its-kind real-world dataset of student dialogues and question responses that we collect from a math learning platform. Extensive experiments show that our method significantly outperforms baselines, and demonstrate the importance of history, profiles, and RL training.

URL PDF HTML ☆

赞 0 踩 0

2605.30040 2026-05-29 cs.CR cs.AI cs.CL 版本更新

连续变量的因果干预：以上下文学习中转向向量的动词偏向为例

Zhenghao Herbert Zhou, R. Thomas McCoy, Robert Frank

发表机构 * Yale University（耶鲁大学）

AI总结提出一种对连续变量进行因果干预的方法，通过定位低维方向并编辑向量实现反事实目标值，应用于动词偏向特征，证明其在语言模型中的因果表示，并探讨与上下文学习的关系。

详情

AI中文摘要

语言模型表示中的因果干预主要针对离散特征，如语法数。然而，语言模型也必须利用分级特征。我们引入了一种对连续变量进行因果干预的方法：给定与分级目标变量配对的激活向量，我们定位该变量的低维方向，并使用该方向将向量编辑为反事实目标值。我们将此方法应用于心理语言学中研究充分的连续特征，即动词偏向（反映给定动词后倾向于出现哪种句法结构）。我们表明，动词偏向因果地表示在从大型语言模型中提取的转向向量中：对动词偏向的反事实编辑系统地改变了下游结构偏好。动词偏向此前也与上下文学习相关联；在进一步分析中，我们发现转向向量编码了可能驱动上下文学习中观察到的误差驱动更新行为的误差信号，但这些转向向量的方面在下游生成中并未被因果使用。总体而言，这些结果表明因果干预可以应用于连续变量，尽管将连续变量与上下文学习联系起来仍然是一个挑战。

英文摘要

Causal interventions in language model representations have largely targeted discrete features, like grammatical number. However, language models must also make use of features that are graded. We introduce a method for causal intervention on continuous variables: given activation vectors paired with a graded target variable, we localize a low-dimensional direction for that variable and use this direction to edit a vectors toward counterfactual target values. We apply this method to a continuous feature that is well-studied in psycholinguistics, namely verb bias (which reflects which syntactic structures tend to follow a given verb). We show that verb bias is causally represented in steering vectors extracted from large language models: counterfactual edits to verb bias systematically shift downstream structural preferences. Verb bias has also previously been linked to in-context learning; in further analyses, we find that steering vectors encode error signals that could drive the error-driven update behavior seen in in-context learning but that these aspects of the steering vectors are not causally used in downstream production. Overall, these results show causal interventions can be applied to continuous variables, though connecting continuous variables to in-context learning remains a challenge.

URL PDF HTML ☆

赞 0 踩 0

2605.29951 2026-05-29 cs.AI cs.CL cs.LG cs.MM 版本更新

MELD: 基于梅尔频谱的离散潜变量语音语言建模

Sung-Lin Yeh, Wei Zhou, Gil Keren, Duc Le, Zhong Meng, Hao Tang, Jay Mahadeokar, Ozlem Kalinli, Alexandre Mourachko

发表机构 * University of Edinburgh（爱丁堡大学）； Google DeepMind（谷歌DeepMind）； Meta Superintelligence Labs（Meta超智能实验室）

AI总结提出一种在梅尔频谱上联合优化编码器和语音语言模型的离散潜变量模型，在零样本文本转语音和语音转文本任务上优于基于编解码器和其他梅尔频谱基线，并缓解了自回归建模中的长时间静音和单词遗漏问题。

2605.29847 2026-05-29 cs.CL 版本更新

EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended Generation

EvoRubric: 用于开放生成的自我进化评分标准驱动强化学习

Xin Guan, Xiaomeng Hu, Shen Huang, Zhenyi Wang, Bo Zhang, Zijian Li, Pengjun Xie, Bo Liu, Jiuxin Cao

发表机构 * Tongyi Lab , Alibaba Group（通义实验室，阿里巴巴集团）

AI总结提出EvoRubric，一种单策略共进化强化学习框架，通过动态交替生成响应和评分标准，并引入多层验证机制，解决开放生成任务中缺乏明确奖励的问题，在医学、写作和科学领域超越传统静态和外部LLM驱动方法。

详情

AI中文摘要

强化学习（RL）在可验证领域显著提升了大型语言模型（LLM），但由于缺乏明确的奖励，为开放生成任务对齐模型仍然极具挑战性。当前的基于评分标准的RL方法通过使用显式标准来缓解这一问题；然而，它们严重依赖于静态的人工标注评分标准，这不可避免地导致策略滞后，或者依赖昂贵的外部专有模型进行动态更新。在本文中，我们提出了EvoRubric，一种新颖的单策略共进化RL框架，消除了对静态标准和外部评分标准生成器的依赖。通过将响应生成和评分标准生成统一在单一参数化策略下，EvoRubric在推理器和评分标准生成器之间动态交替。为了防止奖励黑客攻击并确保生成信号的可信度，我们引入了一个多层验证流程，包括元验证器、零方差剪枝和留一法同行共识机制。经过验证的标准被动态归档到记忆池中，产生密集的多目标奖励，以持续共同优化两个角色。在医学、写作和科学领域的广泛实验表明，EvoRubric始终优于传统的静态和外部LLM驱动的对齐方法。值得注意的是，我们的框架与人类专家先验知识兼容。当使用专家标注的评分标准初始化时，EvoRubric能够进一步发现新颖的、有区分度的维度，从而实现比仅依赖静态专家标注更好的性能。

英文摘要

Reinforcement Learning (RL) has significantly advanced Large Language Models (LLMs) in verifiable domains, but aligning models for open-ended generation remains profoundly challenging due to the lack of definitive rewards. Current rubric-based RL methods mitigate this by employing explicit criteria; however, they rely heavily on static, human-annotated rubrics that inevitably cause policy lag, or expensive external proprietary models for dynamic updates. In this paper, we propose EvoRubric, a novel single-policy co-evolutionary RL framework that eliminates the reliance on static criteria and on external rubric generators. By unifying response generation and rubric generation under a single parameterized policy, EvoRubric dynamically alternates between a Reasoner and a Rubric Generator. To prevent reward hacking and ensure the reliability of generated signals, we introduce a multi-level verification pipeline featuring a meta-verifier, zero-variance pruning, and a Leave-One-Out peer consensus mechanism. Validated criteria are dynamically archived into a memory pool, yielding dense, multi-objective rewards to continuously co-optimize both roles. Extensive experiments across Medical, Writing, and Science domains demonstrate that EvoRubric consistently outperforms traditional static and external-LLM-driven alignment methods. Notably, our framework is compatible with human-expert priors. When initialized with expert-annotated rubrics, EvoRubric can further uncover novel, discriminative dimensions, achieving better performance than relying solely on static expert annotations.

URL PDF HTML ☆

赞 0 踩 0

2605.29826 2026-05-29 cs.CL cs.AI 版本更新

Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models

面向多模态大语言模型的局部化与解耦知识编辑

Leijiang Gu, Zhen Zeng, Feng Li, Xinjian Gao, Zenglin Shi

发表机构 * Hefei University of Technology（合肥工业大学）； Tongji University（同济大学）

AI总结针对多模态知识编辑中因果错位和特征纠缠问题，提出LDKE框架，通过快速定位关键层和解耦分类器实现精准泛化编辑并保持高局部性。

详情

AI中文摘要

现有的多模态知识编辑（MKE）方法在纠正多模态大语言模型（MLLMs）中过时或不准确的知识方面取得了进展。然而，它们存在一个关键局限性：虽然能有效修改目标事实对，但无法将编辑泛化到逻辑相关的查询，并且常常对无关但视觉或语义上关联的信息造成意外改变。我们识别并形式化了导致该问题的两种潜在失败模式：因果错位（将编辑限制在特定样本）和特征纠缠（对耦合但无关的信息造成意外改变）。为解决这些问题，我们提出局部化与解耦知识编辑（LDKE），一种通过定位事实特定模型层并将目标相关输入与无关输入解耦来实现精确和泛化编辑的新框架。我们的方法引入快速定位模块以高效识别和更新关键层，以及解耦分类器以适当路由输入从而保留无关知识。在各种基准和MLLMs上的大量实验表明，LDKE在将编辑传播到相关上下文方面实现了优越性能，同时保持了高局部性。

英文摘要

Existing methods in Multimodal Knowledge Editing (MKE) have advanced the ability to correct outdated or inaccurate knowledge in Multimodal Large Language Models (MLLMs). However, they exhibit a critical limitation: while effectively modifying target factual pairs, they fail to generalize edits to logically related queries and often cause unintended alterations to unrelated but visually or semantically linked information. We identify and formalize two underlying failure modes causing this issue: Causal Misalignment, which confines edits to the specific sample, and Feature Entanglement, which causes unintended alterations to coupled but irrelevant information. To address these issues, we propose Localized and Disentangled Knowledge Editing (LDKE), a new framework that achieves precise and generalized editing by localizing fact-specific model layers and disentangling target-relevant inputs from irrelevant ones. Our approach introduces a Fast Localization module to identify and update critical layers efficiently, along with a Disentanglement Classifier that routes inputs appropriately to preserve unrelated knowledge. Extensive experiments across various benchmarks and MLLMs demonstrate that LDKE achieves superior performance in propagating edits to related contexts while maintaining high locality.

URL PDF HTML ☆

赞 0 踩 0

2605.29815 2026-05-29 cs.AI cs.CL 版本更新

ActTraitBench: 通过人类行为验证量化大型语言模型中的知识-决策差距

Yutong Yang, Chenxi Miao, Weikang Li, Yunfang Wu

发表机构 * Peking University（北京大学）； Baidu Inc（百度公司）

AI总结提出ActTraitBench框架，基于人类数据建立心理测量方面与行为范式的一一映射，并通过分位数映射校准LLM评分分布，揭示LLM在自我报告与行为决策之间的知识-决策差距，并引入CoCA干预来缓解该差距。

详情

AI中文摘要

虽然大型语言模型（LLM）在显式自我报告中能够令人信服地模拟人格，但它们在隐式行为决策中常常出现偏差，揭示了显著的知识-决策差距（$G_{\text{KD}}$）。现有的基准由于结构效度有限、多维度纠缠以及基于LLM评估中的分布偏差，难以衡量这种不对称性。为了解决这些问题，我们提出了ActTraitBench，一个基于人类数据的评估框架，用于衡量LLM中的人格一致性。基于经验人类数据，ActTraitBench建立了心理测量方面与行为范式之间的一一映射，并应用通过分位数映射的分布校准程序，使LLM评判者的分数分布与人类规范对齐。在14个主流LLM上的实验揭示了普遍的知识-决策不对称性，其中更大、能力更强的模型尽管自我报告高度一致，但往往表现出更强的行为分歧。为了缓解这一差距，我们进一步引入了认知对齐链（CoCA），一种即插即用的推理时干预措施，可改善具有推理能力的前沿模型的对齐，同时暴露出较小架构中明显的能力限制。

英文摘要

While Large Language Models (LLMs) can convincingly simulate personas in explicit self-reports, they often deviate in implicit behavioral decisions, revealing a substantial Knowledge-Decision Gap ($G_{\text{KD}}$). Existing benchmarks struggle to measure this asymmetry due to limited construct validity, multi-dimensional entanglement, and distributional biases in LLM-based evaluation. To address these issues, we propose ActTraitBench, a human-grounded evaluation framework for measuring personality consistency in LLMs. Grounded in empirical human data, ActTraitBench establishes one-to-one mappings between psychometric facets and behavioral paradigms, and applies a Distributional Calibration via Quantile Mapping procedure to align LLM-judge score distributions with human norms. Experiments on 14 mainstream LLMs reveal a pervasive knowledge-decision asymmetry, where larger and more capable models often exhibit stronger behavioral divergence despite highly consistent self-reports. To mitigate this gap, we further introduce the Chain of Cognitive Alignment (CoCA), a plug-and-play inference-time intervention that improves alignment in reasoning-capable frontier models while exposing clear capability limitations in smaller architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.29782 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

Hista 和 Numca：为 LLM 强化学习有效估计状态值

Zizhe Chen, Jiqian Dong, Yizhou Tian, Garry Yang, Yongqiang Chen, Zhitang Chen, James Cheng

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong（香港中文大学计算机科学与工程系）； Huawei Technologies Ltd（华为技术有限公司）

AI总结针对 LLM 强化学习中状态值估计不准确的问题，提出 Numca（利用数值跨度作为可分级里程碑）和 Hista（利用隐藏状态加权平均不连续轨迹及其回报）两种方法，显著提升估计精度和训练性能。

Comments Accepted at ICML 2026

详情

AI中文摘要

强化学习（RL）通过奖励信号直接优化模型行为来改进大型语言模型（LLMs）。虽然在经典RL中准确的状态值估计对于稳定训练至关重要，但在LLM后训练中这仍是一个未被充分探索的挑战。在这项工作中，我们引入了状态值估计基准（SVEB）来评估现有RL框架中的状态估计，并展示了像PPO这样的标准方法中的评论家会退化为粗糙的组平均基线。为了解决这个问题，我们提出了两种技术：Numca，它利用数值跨度作为可分级里程碑进行状态值估计；以及Hista，一个使用LLM的隐藏状态作为表示来加权平均不连续轨迹及其回报的框架。大量实验表明，这两种方法都能产生更准确的状态值估计，并在不同的RL算法和模型大小上提升训练性能，而不会产生显著的计算开销。

英文摘要

Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In this work, we introduce the State Value Estimation Benchmark (SVEB) to assess state estimation within existing RL frameworks and show that critics in standard approaches like PPO collapse to a coarse group-average baseline. To address this, we propose two techniques: Numca, which leverages numerical spans as gradable milestones for state value estimation, and Hista, a framework that uses LLM's hidden states as representation to weighted average disjoint rollouts and their return. Extensive experiments demonstrate that both methods yield more accurate state value estimates and enhance training performance across different RL algorithms and model sizes without incurring significant computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.29744 2026-05-29 cs.AI cs.CL cs.LG cs.MA 版本更新

Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

为什么专家模型仍然重要：面向医学人工智能的异构多智能体范式

Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou, Aiguo Wang, Cuiwei Yang

发表机构 * Anthropic AI

AI总结提出HetMedAgent异构多智能体框架，通过冲突感知证据融合、不确定性驱动的临床医生干预触发和自适应阈值校准，实现通用大语言模型与领域专家模型的协同，在三个临床决策任务中验证了专家模型在模态特定分析中的不可替代价值。

Comments Accepted at ICML 2026. 12 pages main text, 16 pages appendix

详情

AI中文摘要

GPT和Claude等通用大语言模型在医疗保健领域的出色表现引发了一个关键问题：特定领域的医学专家模型是否会变得过时？我们认为，医学人工智能的未来不在于构建单一的医学基础模型，也不在于取代人类专业知识，而在于协调通用大语言模型、领域特定专家模型和临床医生之间的协作。我们提出HetMedAgent，一个异构医学多智能体框架，能够实现冲突感知证据融合、基于不确定性的临床医生干预触发和自适应阈值校准。在三个真实世界临床决策任务上的实验表明，通用大语言模型与领域特定专家模型之间的协同显著优于单独使用任一类型模型，验证了专家模型在模态特定分析中的不可替代价值。HetMedAgent代表了从构建医学大语言模型或基础模型向多智能体协作的转变，实现了通用推理能力与领域特定精度之间的平衡。

英文摘要

The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.

URL PDF HTML ☆

赞 0 踩 0

2605.29741 2026-05-29 cs.CL 版本更新

AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation

AfriScience-MT：通过文本翻译实现非洲科学去殖民化

Idris Abdulmumin, Tajuddeen Gwadabe, Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Nomonde Khalo, Ibrahim Said Ahmad, Abiodun Modupe, Anina Mumm, Sibusiso Biyela, Michelle Rabie, Johanna Havemann, Marek Rei, Jade Abbott, Vukosi Marivate

发表机构 * Data Science for Social Impact, University of Pretoria（数据科学与社会影响，南非比勒陀利亚大学）； Masakhane Research Foundation（马萨克纳研究基金会）； Imperial College London（伦敦帝国理工学院）； Mila, McGill University（麦吉尔大学Mila实验室）； Canada CIFAR AI Chair（加拿大CIFAR人工智能主席）； University of Cape Town（开普敦大学）； University of Wisconsin - Stevens Point（威斯康星大学斯蒂文斯点分校）； Independent Consultant（独立顾问）； University of South Africa（南非大学）； Independent Researcher（独立研究员）； Access 2 Perspectives ； Lelapa AI（Lelapa人工智能）

AI总结针对非洲语言缺乏科学术语的问题，构建包含6种非洲语言、11个科学领域的平行语料库AfriScience-MT，并评估机器翻译和大型语言模型在零样本、少样本和微调设置下的性能。

详情

AI中文摘要

殖民语言在非洲教育和科学传播中的主导地位限制了数亿非洲语言使用者获取和产生科学知识的能力。一个核心障碍是这些语言缺乏既定的科学术语。我们引入了AfriScience-MT，这是一个涵盖六种非洲语言（阿姆哈拉语、豪萨语、卢干达语、北索托语、约鲁巴语和祖鲁语）和11个科学领域的平行语料库。专业翻译人员与科学传播专家合作，将科学论文的通俗语言摘要翻译成每种目标语言，并在没有现成术语的地方创建新术语。我们在零样本、少样本和微调设置下对机器翻译系统和大型语言模型进行了基准测试。结果表明，在句子和文档层面，闭源模型均优于所有开源模型：GPT-5.4和Gemini-3.1-Flash-Lite领先，平均句子级COMET得分分别为68.3和68.0，平均文档级COMET得分均为48.3。在开源系统中，微调的NLLB-1.3B在句子级达到67.3，TranslateGemma-12B在1-shot上下文学习下文档级达到44.0。我们发布AfriScience-MT以支持非洲语言的基准测试和文档级科学机器翻译。

英文摘要

The dominance of colonial languages in African education and scientific communication limits how hundreds of millions of speakers of African languages access and produce scientific knowledge. A core obstacle is the lack of established scientific terminology in these languages. We introduce AfriScience-MT, a parallel corpus covering six African languages (Amharic, Hausa, Luganda, Northern Sotho, Yorùbá, and isiZulu) across 11 scientific domains. Professional translators, working with expert science communicators, translated plain-language summaries of scientific papers into each target language and created new terms where none existed. We benchmark machine translation systems and large language models in zero-shot, few-shot, and fine-tuned settings. Our results show that closed-source models outperform all open-source models at both the sentence and document levels: GPT-5.4 and Gemini-3.1-Flash-Lite lead with average sentence-level COMET scores of 68.3 and 68.0, respectively, and tie at an average document-level COMET of 48.3. Among open systems, fine-tuned NLLB-1.3B reaches 67.3 at the sentence level, and TranslateGemma-12B reaches 44.0 at the document level with 1-shot in-context learning. We release AfriScience-MT to support benchmarking and document-level scientific MT for African languages.

URL PDF HTML ☆

赞 0 踩 0

2605.29738 2026-05-29 cs.CL cs.AI 版本更新

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Multi-Legal-Bench: 跨司法管辖区、语言和法律传统的法律推理评估LLM

Volodymyr Ovcharov

发表机构 * SecondLayer

AI总结提出Multi-Legal-Bench，首个跨司法管辖区法律基准，在6个国家、4个语系和1.34亿份法院判决上评估LLM，发现少样本效果跨辖区复制、无单一模型主导所有语言、跨语言迁移不遵循语言邻近性、分词器效率不显著预测跨语言准确率。

Comments 14 pages, 5 figures, 8 tables. Dataset: https://huggingface.co/datasets/overthelex/multi-legal-bench

详情

AI中文摘要

法律NLP基准绝大多数评估单一语言或汇总跨司法管辖区根本不同的任务，使得跨语言比较不可能。我们引入Multi-Legal-Bench，首个跨司法管辖区法律基准，在六个国家（乌克兰、法国、荷兰、波兰、捷克共和国、立陶宛）、四个语系和1.34亿份法院判决上评估相同任务。该基准定义了五个任务——法院类型分类、判决形式分类、案件结果预测、法律规范提取和原因类别预测——映射到来自国家法院登记处的结构化元数据，形成一个故意稀疏的5x6任务-司法管辖区矩阵（30个单元格中填充20个）。我们通过AWS Bedrock在零样本和3样本提示下评估7个前沿LLM，并额外使用4个小/中型模型（3-12B）进行规模分析。我们的结果显示：（1）在乌克兰发现的依赖任务的少样本效果在所有司法管辖区复制；（2）没有单一模型主导任何语言——排名随任务和司法管辖区而变化；（3）跨语言少样本迁移不遵循语言邻近性：UA->FR（罗曼语族，-2.1个百分点）迁移优于UA->PL（斯拉夫语族，-13.7个百分点），标签集对齐比语系更能预测迁移质量；（4）分词器生育率尽管有2.3倍的差异，并不能显著预测跨语言准确率（r=-0.27，p=0.14），表明模型架构和预训练数据主导分词器效率。我们发布所有数据、提示和模型预测。

英文摘要

Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cross-lingual comparison impossible. We introduce Multi-Legal-Bench, the first cross-jurisdictional legal benchmark that evaluates identical tasks across six countries (Ukraine, France, Netherlands, Poland, Czech Republic, Lithuania), four language families, and 134 million court decisions. The benchmark defines five tasks court-type classification, judgment form classification, case-outcome prediction, legal norm extraction, and cause category prediction mapped to structured metadata from national court registries, forming a deliberately sparse 5x6 task-jurisdiction matrix (20 of 30 cells filled). We evaluate 7 frontier LLMs under zero-shot and 3-shot prompting via AWS Bedrock, with 4 additional small/medium models (3-12B) for scaling analysis. Our results reveal that: (1) task-dependent few-shot effects discovered in Ukrainian replicate across all jurisdictions; (2) no single model dominates any language rankings shift with both task and jurisdiction; (3) cross-lingual few-shot transfer does not follow language proximity: UA->FR (Romance, -2.1 pp) transfers better than UA->PL (Slavic, -13.7 pp), with label-set alignment predicting transfer quality better than language family; and (4) tokenizer fertility, despite a 2.3x spread, does not significantly predict cross-lingual accuracy (r=-0.27, p=0.14), suggesting that model architecture and pretraining data dominate tokenizer efficiency. We release all data, prompts, and model predictions.

URL PDF HTML ☆

赞 0 踩 0

2605.29737 2026-05-29 cs.CR cs.CL cs.SE 版本更新

Minimal Prompt Perturbations Lead to Code Vulnerabilities: Prompt Fragility and Hidden-State Signals in Coding LLMs

最小提示扰动导致代码漏洞：编码大语言模型中的提示脆弱性和隐藏状态信号

Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic

发表机构 * IEM, HES-SO, Le Foyer, Techno-Pôle 1, Sierre, Switzerland（瑞士苏黎世联邦理工学院（HES-SO）技术园区1号，西尔尔）； Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland（瑞士图恩 Cyber-Defence 营地，armasuisse 科学与技术）

AI总结本文通过token级突变实验，发现微小提示扰动（如单字符变化）即可使LLM生成代码从安全变为脆弱，并利用隐藏状态分析揭示输入处理漏洞比安全默认值漏洞更可预测。

详情

AI中文摘要

基于LLM的编码助手正被迅速采用，显著提高了开发者的生产力。随着组织越来越多地部署这些代理生成的代码，代码的安全性变得至关重要。先前的研究表明，微小的提示扰动会降低LLM生成代码的功能正确性，但这是否也会危及代码安全性尚未被研究。我们对三个模型和五种编程语言的提示应用token级突变，并表明小至单字符变化的突变可以将生成的代码从安全变为脆弱。探测模型的隐藏状态揭示，这种脆弱性部分编码在提示表示中，但分布不均匀。输入处理漏洞（模型省略验证或清理）比安全默认值漏洞（不安全代码源于一个局部选择，如弱算法或不安全参数）更可预测（平均AUC 0.753 vs 0.674）。这些结果表明，LLM辅助编码的威胁模型不仅包括提示注入，还包括普通的提示变化，并指出输入处理缺陷可以在生成前被捕获，而安全默认值缺陷需要在解码过程中进行干预。

英文摘要

LLM-based coding assistants are seeing rapid adoption, offering substantial gains in developer productivity. As organizations increasingly ship code these agents produce, the security of that code becomes critical. Prior work has shown that minor prompt perturbations degrade the functional correctness of LLM-generated code, but whether they also compromise code security has remained unstudied. We apply token-level mutations to prompts across three models and five programming languages, and show that mutations as small as a single-character change can flip generated code from secure to vulnerable. Probing the models' hidden states reveals that this fragility is partially encoded in prompt representations, but unevenly so. Input-handling vulnerabilities, where the model omits validation or sanitization, are more predictable (mean AUC 0.753) than secure-defaults vulnerabilities, where insecure code stems from one local choice such as a weak algorithm or unsafe parameter (mean AUC 0.674). These results show that the threat model for LLM-assisted coding extends beyond prompt injection to ordinary prompt variation, and indicate that input-handling flaws can be caught before generation while secure-defaults flaws require intervention during decoding.

URL PDF HTML ☆

赞 0 踩 0

2605.29734 2026-05-29 cs.CL 版本更新

HTAM: Hierarchical Transition-Attended Memory for Operator Optimization

HTAM: 用于算子优化的层次化过渡注意力记忆

Yining Zhang, Mingyang Yi, Chen Wang, Xuwen Xiang, Tianhe Jia, Zedong Dan, Chengqing Zong, Yue Wang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Zhongguancun Academy（中关村学院）； Renmin University of China（中国人民大学）

AI总结提出HTAM框架，通过构建层次化过渡图（HTG）组织粗粒度全局方向和细粒度局部策略，解决LLM在GPU算子优化中粒度不匹配问题，显著提升正确率和加速比。

Comments 24 pages, 5 figures

详情

AI中文摘要

高性能GPU内核对于高效部署LLM至关重要，但其优化仍然需要大量专业知识。最近基于LLM的代码生成使得自动GPU算子生成变得有前景，但算子优化仍然是一个硬件感知的搜索问题。现有的基于LLM的方法面临粒度不匹配的问题：粗粒度的提示可重用但难以执行，而细粒度的记忆可操作但会扩大搜索空间并模糊优化瓶颈。因此，关键挑战在于以适当的粒度组织优化经验。为了解决这个问题，本文提出了HTAM（层次化过渡注意力记忆），一种用于基于LLM的算子优化的粗到细框架。HTAM构建了一个两层的层次化过渡图（HTG），用于组织粗粒度的全局方向、细粒度的局部策略以及优化步骤之间的过渡经验。在每个演化步骤中，HTAM从当前状态和最近的优化历史中选择一个全局方向，检索相应的局部策略记忆，并用它来指导具体的CUDA代码生成。在完整的KernelBench套件上的实验表明，与基于LLM的基线相比，HTAM在正确率、快速解率和加速比上均有持续提升，而后端和Robust-KBench研究则表明结构化记忆带来的可迁移优势。

英文摘要

High-performance GPU kernels are essential for efficient LLM deployment, yet optimizing them remains expertise-intensive. Recent LLM-based code generation makes automatic GPU operator generation promising, but operator optimization remains a hardware-aware search problem. Existing LLM-based methods face a granularity mismatch: coarse hints are reusable but hard to execute, whereas detailed memories are actionable but enlarge the search space and obscure optimization bottlenecks. The key challenge is therefore to organize optimization experience at an appropriate granularity. To address this issue, this paper proposes HTAM (Hierarchical Transition-Attended Memory), a coarse-to-fine framework for LLM-based operator optimization. HTAM builds a two-level Hierarchical Transition Graph (HTG) to organize coarse global directions, detailed local strategies, and transition experience between optimization steps. During each evolution step, HTAM selects a global direction from the current state and recent optimization history, retrieves the corresponding local strategy memory, and uses it to guide concrete CUDA code generation. Experiments on the full KernelBench suite demonstrate that HTAM consistently improves correctness, fast-solution rate, and speedup over LLM-based baselines, while backend and Robust-KBench studies indicate transferable benefits from structured memory.

URL PDF HTML ☆

赞 0 踩 0

2605.29715 2026-05-29 cs.CL 版本更新

User-Aware Active Knowledge Acquisition for Emotional Support Dialogue

面向情感支持对话的用户感知主动知识获取

Mufan Xu, Kehai Chen, Jiahao Hu, Xinchao Xu, Muyun Yang, Tiejun Zhao, Min Zhang

发表机构 * Harbin Institute of Technology, China（哈尔滨工业大学）； Baidu Inc., Beijing, China（百度公司）

AI总结提出用户感知主动知识获取（UKA）框架，通过理论心智不确定性估计和主动学习，在情感支持对话中高效获取用户对齐的对话知识，提升对话质量和用户对齐。

详情

AI中文摘要

情感支持在对话系统中扮演重要角色，其成功取决于在多轮交互中适应用户不断变化且隐含的需求，同时利用大语言模型的强大推理能力。然而，由于用户需求的信号通常微弱、间接，且只能通过多轮交互来消除歧义，现有的情感支持方法往往难以高效获取和泛化相关的对话知识。为弥补这一差距，我们引入了用户感知主动知识获取（UKA），这是一种无梯度的主动对话学习框架，明确表示用户需求的不确定性，并将主动学习融入知识获取和响应选择中。我们提出了一种理论心智不确定性估计机制，使模型能够优先选择响应，从而引发更多信息性的用户反馈。UKA能够在训练期间高效探索用户对齐的对话知识，同时在测试时保持鲁棒性。在多个对话基准和模型架构上的实验表明，我们的方法在对话质量和用户对齐方面始终优于强基线。

英文摘要

Emotional support plays an important role in dialogue systems, and its success depends on adapting to a user's evolving and implicit needs across multi-turn interactions while leveraging the strong reasoning capacity of large language models. However, since signals about user needs are often weak, indirect, and can only be disambiguated through multi-turn interaction, existing emotional support methods often struggle to acquire and generalize relevant conversational knowledge efficiently. To bridge this gap, we introduce User-Aware Active Knowledge Acquisition (UKA), a gradient-free active dialogue learning framework that explicitly represents uncertainty about user needs and incorporates active learning into both knowledge acquisition and response selection.We propose a Theory-of-Mind uncertainty estimation mechanism that allows the model to prioritize responses, thereby eliciting more informative user feedback. UKA is capable of efficiently exploring user-aligned conversational knowledge during training while maintaining robustness at test time. Experiments across multiple dialogue benchmarks and model architectures demonstrate that our approach consistently outperforms strong baselines in dialogue quality and user alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.29714 2026-05-29 cs.CL 版本更新

Leveraging Routing Dynamics in Mixture-of-Experts Models for Efficient Language Adaptation

利用混合专家模型中的路由动态实现高效语言适配

Aditi Khandelwal, Marius Mosbach, Verna Dankers, Siva Reddy, Golnoosh Farnadi

发表机构 * Mila – Quebec AI Institute & McGill University（魁北克AI研究院与麦吉尔大学）

AI总结研究英语中心混合专家模型在多语言持续预训练中的路由动态，发现早期和中间层路由分散且语言无关，最终层出现语言专化，并提出仅更新最终层语言特定和共享专家的参数高效适配策略。

详情

AI中文摘要

混合专家（MoE）模型被广泛用于扩展语言模型，但其专家路由行为和多语言环境下的适配仍未被充分探索。在这项工作中，我们研究了在英语中心的MoE模型上使用多语言语料库进行持续预训练时的多语言路由动态，分析了专家使用如何随语言变化。我们发现，持续的多语言预训练导致早期和中间层出现分散的、与语言无关的路由，而语言专化主要出现在最终层。我们还表明，语言之间的token级词汇重叠在路由方式中起着重要作用。受这些发现启发，我们提出了一种参数高效的适配策略，仅更新最终MoE层中的语言特定和共享专家。在MultiBLiMP和Belebele上的实验表明，我们的方法实现了强大的性能-效率权衡，在更新不到2%参数的情况下，达到了与微调整个最终层相竞争的性能。总体而言，我们的发现揭示了在持续预训练期间MoE中语言专化出现的位置和方式，并为低资源多语言适配提供了实用见解。我们的代码可在https://github.com/aditi184/moe-routing-adaptation获取。

英文摘要

Mixture-of-Experts (MoE) models are widely used to scale language models, yet their expert routing behavior and adaptation in a multilingual setting remain underexplored. In this work, we study multilingual routing dynamics during continual pre-training of an English-centric MoE model on a multilingual corpus, analyzing how expert usage varies across languages. We find that continual multilingual pre-training leads to diffused, language-agnostic routing in early and middle layers, with language specialization primarily emerging in the final layers. We also show that token-level vocabulary overlap between languages plays an important role in how languages are routed. Motivated by these findings, we propose a parameter-efficient adaptation strategy that updates language-specific and shared experts in the final MoE layers. Experiments on MultiBLiMP and Belebele show that our method achieves a strong performance-efficiency trade-off, attaining competitive performance relative to fine-tuning complete final layers, while updating less than 2% of the parameters. Overall, our findings provide insights into where and how language specialization emerges in MoEs during continual pre-training and provide practical insights for low-resource multilingual adaptation. Our code is available at https://github.com/aditi184/moe-routing-adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.29712 2026-05-29 cs.CL cs.AI 版本更新

Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies

教会语言模型使用人类应试策略检查基于事实的声明真实性

Yuxuan Ye, Raul Santos-Rodriguez, Edwin Simpson

发表机构 * Intelligent Systems Laboratory（智能系统实验室）； University of Bristol（布里斯托大学）

AI总结将基于事实的声明真实性检查建模为真假阅读理解任务，通过提示语言模型使用明确的应试策略进行高效推理，并训练小语言模型以降低推理成本。

Comments ACL 2026 Main

详情

AI中文摘要

基于事实的声明真实性检查对于大型语言模型（LLM）应用（如检索增强生成）非常重要，因为它帮助用户评估生成输出的正确性。现有的使用蕴含分类器的指标需要针对数据集调整阈值，而基于LLM的方法通常使用直接提示，这未能充分利用LLM的推理能力。我们通过将基于事实的声明真实性检查建模为真假阅读理解任务，并提示LLM使用明确的应试策略进行高效推理来解决这一问题。与无引导的开放式推理相比，我们的方法减少了超过80%的令牌使用量，并在两个真实性基准测试中取得了与更昂贵替代方案竞争的性能，在一个基准上达到了新的最先进水平。为了进一步降低推理成本，我们训练小语言模型（SLM）来替代检查流程中的LLM。通过监督微调（SFT）和自我修正机制，SLM学会了改进其真实性判断。实验结果表明，生成的SLM在性能上与强基线相当，结合了低推理成本和生成支持理由以支持可解释性。代码和数据集将在接收后发布。

英文摘要

Grounded claim factuality checking is important for large language model (LLM) applications such as retrieval-augmented generation, as it helps users assess the correctness of generated outputs. Existing metrics using entailment classifiers require dataset-specific threshold tuning, while LLM-based approaches often use direct prompting, which underutilises the reasoning capabilities of LLMs. We address this by formulating grounded claim factuality checking as a true/false reading comprehension task and prompting LLMs with explicit test-taking strategies for efficient reasoning. Our method reduces token usage by over 80% compared to unguided open-ended reasoning, and achieves competitive performance to more expensive alternatives across two factuality benchmarks, setting a new state of the art on one. To further reduce inference cost, we train small language models (SLMs) to replace LLMs in the checking pipeline. Using supervised fine-tuning (SFT) and a self-revision mechanism, the SLMs learn to improve their factuality judgements. Experimental results show that the resulting SLMs perform on par with strong baselines, combining low inference costs with generating supporting rationales to support interpretability. Code and datasets will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.29711 2026-05-29 cs.CL cs.AI 版本更新

Personalized Turn-Level User Conversation Satisfaction Benchmark

个性化轮级用户对话满意度基准

Zhefan Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang, Quanjia Yan, Hengliang Luo

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China.（清华大学计算机科学与技术系，北京，中国）； Institute for AI Industry Research, Tsinghua University, Beijing, China.（清华大学人工智能产业研究院，北京，中国）； Meituan（美团）

AI总结针对AI助手响应的个性化满意度评估问题，提出结合用户记忆与目标轮上下文的满意度评估器，并构建PersTurnBench基准，通过回放实现生成模型的受控比较。

详情

AI中文摘要

用户对AI助手的满意度高度个性化：同一响应可能满足一个用户但令另一个失望，取决于每个用户的期望以及他们之前询问的内容。现有的自动评估方法大多衡量通用响应质量，难以判断某个响应在特定轮次是否满足用户。我们将此问题作为个性化轮级用户对话满意度评估进行研究。我们构建了一个对话满意度评估器，将紧凑的用户记忆与目标轮上下文相结合，生成满意度分数和不满意的理由。与人类满意度标注的元评估表明，个性化记忆和事后分数校准在有序一致性和不满意轮次检测上优于监督式、检索式和通用LLM作为评判者的基线。我们进一步引入了PersTurnBench，这是一个个性化轮级用户对话满意度基准，通过回放使用经过验证的评估器来评估生成模型。通过固定回放状态，PersTurnBench能够在无需为每个候选模型收集新人工标签的情况下，对通用生成模型和记忆增强的个性化系统进行受控比较。该评估器和基准让研究人员能够在无需为每个模型收集新用户反馈的情况下，比较候选生成模型在个性化满意度上的表现。

英文摘要

User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation methods mostly measure generic response quality, making it difficult to judge whether a response satisfies a user at a specific turn. We study this problem as personalized turn-level user conversation satisfaction evaluation. We build a conversation satisfaction evaluator that combines compact user memories with target-turn context to produce satisfaction scores and dissatisfaction-oriented rationales. Meta-evaluation against human satisfaction annotations shows that personalized memory and post-hoc score calibration improve ordinal agreement and dissatisfied-turn detection over supervised, retrieval-based, and generic LLM-as-a-judge baselines. We further introduce PersTurnBench, a personalized turn-level user conversation satisfaction benchmark that uses the verified evaluator to assess generation models via replay. By holding the replay state fixed, PersTurnBench enables controlled comparison of generic generation models and memory-augmented personalized systems without new human labels for every candidate model. The evaluator and benchmark let researchers compare candidate generation models on personalized satisfaction without collecting new user feedback for every model.

URL PDF HTML ☆

赞 0 踩 0

2605.29708 2026-05-29 cs.CL 版本更新

Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs

理解混合专家大语言模型中的安全敏感专家行为

Zhibo Zhang, Yuxi Li, Zhen Ouyang, Ling Shi, Kailong Wang

发表机构 * Huazhong University of Science and Technology, Wuhan, China（华中科技大学，武汉，中国）； Nanyang Technological University, Singapore（南洋理工大学，新加坡）

AI总结通过提出RASET框架，研究混合专家大语言模型中安全对齐与路由专家专业化之间的关系，发现路由模式主要由主题驱动，而安全行为可通过调整少数专家改变而不影响路由路径。

Comments 11 pages, 4 figures

详情

AI中文摘要

混合专家（MoE）大语言模型依赖于稀疏的、由路由器驱动的专家激活，然而安全对齐如何与路由专家专业化相互作用仍未被充分探索。一种常见的直觉是，安全行为可能通过将有害请求路由到不同的拒绝导向专家来控制。在这项工作中，我们为不同的情况提供了经验证据：对齐的MoE大语言模型中的路由模式主要是主题驱动的，而安全行为可以在不改变模型固有路由路径的情况下被改变。基于这一观察，我们提出了**RASET**（**R**outer-**A**gnostic **S**afety-critical **E**xpert **T**uning，路由器无关的安全关键专家微调），这是一个红队框架，用于探测集中在少数专家中的安全执行，同时保持模型固有的路由行为。**RASET**通过对比路由敏感性标准识别安全关键专家，并仅对选定的专家应用参数高效微调，从而相对于路由器干预最小化语义干扰。这些结果揭示了独特的MoE安全风险，强调了需要专家感知的对齐机制。

英文摘要

Mixture-of-Experts (MoE) LLMs rely on sparse, router-driven expert activation, yet how safety alignment interacts with routed expert specialization remains underexplored. A common intuition is that safety behavior may be controlled by routing harmful requests to distinct refusal-oriented experts. In this work, we provide empirical evidence for a different picture: routing patterns in aligned MoE LLMs are largely topic-driven, while safety behavior can be altered with little change to the model's intrinsic routing path. Motivated by this observation, we present **RASET** (**R**outer-**A**gnostic **S**afety-critical **E**xpert **T**uning), a red-teaming framework that probes safety enforcement that is localized in a small subset of experts while preserving the model's intrinsic routing behavior. **RASET** identifies safety-critical experts via a contrastive routing-sensitivity criterion and applies parameter-efficient tuning only to the selected experts, minimizing semantic disruption relative to router-steering interventions. These results reveal a distinct MoE safety risk, highlighting the need for expert-aware alignment mechanisms.

URL PDF HTML ☆

赞 0 踩 0

2605.29707 2026-05-29 cs.CL 版本更新

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Domino: 在推测解码中将因果建模与自回归草稿解耦

Jianuo Huang, Yaojie Zhang, Qituan Zhang, Hao Lin, Hanlin Xu, Linfeng Zhang

发表机构 * EPIC Lab, Shanghai Jiao Tong University（上海交通大学EPIC实验室）； School of Software Engineering, HUST（华中科技大学软件学院）； UESTC ； Fudan University（复旦大学）； Huawei（华为公司）

AI总结提出Domino框架，通过并行草稿骨干和轻量级Domino头解耦因果依赖建模与自回归草稿执行，结合基础锚定训练课程，在Qwen3模型上实现高达5.49倍端到端加速和5.8倍吞吐量加速。

详情

AI中文摘要

推测解码通过草拟多个令牌并与目标模型并行验证来加速LLM推理。然而，其实际加速受限于草稿质量与草稿成本之间的权衡：自回归草稿器建模草稿令牌间的因果依赖但引入顺序开销，而并行草稿器降低草稿成本但削弱块内依赖建模。本文提出Domino，一种将因果依赖建模与昂贵的自回归草稿执行解耦的推测解码框架。Domino首先使用并行草稿骨干为整个块生成初步草稿分布，然后应用轻量级Domino头以前缀依赖的因果信息对其进行细化。为稳定教师强制因果编码，我们进一步引入基础锚定训练课程，首先强化并行骨干，然后逐步将优化转向因果修正的最终分布。在Qwen3模型上的实验表明，Domino在Transformers后端下实现高达5.49倍的端到端加速，在SGLang服务下实现高达5.8倍的吞吐量加速。

英文摘要

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to $5.49\times$ end-to-end speedup under the Transformers backend and up to $5.8\times$ throughput speedup under SGLang serving.

URL PDF HTML ☆

赞 0 踩 0

2605.29682 2026-05-29 cs.CL 版本更新

Scaling Laws for Agent Harnesses via Effective Feedback Compute

智能体框架的有效反馈计算缩放定律

Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che

发表机构 * Harbin Institute of Technology（哈尔滨理工大学）

AI总结提出有效反馈计算（EFC）作为缩放坐标，通过衡量信息性、有效性、非冗余性和保留性来预测智能体框架性能，在多个任务上优于原始计算基线。

详情

AI中文摘要

智能体框架通过决定模型如何调用工具、接收反馈、验证中间状态、存储记忆和修正解决方案，日益决定语言模型系统的性能。然而，当前的测试时缩放分析通常通过原始支出（令牌、工具调用、操作、挂钟时间或成本）来参数化这一过程，这并未区分有用反馈与冗余或不稳定的交互。我们引入了有效反馈计算（EFC），这是一种轨迹级缩放坐标，仅在反馈具有信息性、有效性、非冗余性且被保留用于后续决策时才计入反馈，并在比较具有不同反馈需求的任务时通过任务需求进行归一化。在合成可控任务、可执行代码任务、真实基准轨迹、保留集和前瞻性验证批次中，基于EFC的坐标一致地比原始计算基线和强多变量SAS基线更好地预测失败率。在受控缩放中，原始令牌和工具调用解释的变异有限（R²=0.33和0.42），SAS达到0.88，而Oracle-EFC和Estimated-EFC达到0.94，Oracle-EFC/D_task达到0.99。匹配预算的干预表明，在原始成本和工具调用固定的情况下，提高反馈质量将成功率从0.27提升到0.90。在混合真实轨迹上，NRS-EFC/D_task达到R²=0.92，而原始计算具有接近零或负的拟合，并且在前瞻性保留集中仍然是最佳预测器（R²=0.85）。这些结果表明，框架缩放受计算量多少的影响较小，而更多地取决于原始预算如何高效地转化为持久且任务充分的反馈。

英文摘要

Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and revise solutions. Yet current test-time scaling analyses often parameterize this process by raw expenditure -- tokens, tool calls, operations, wall time, or cost -- which does not distinguish useful feedback from redundant or unstable interaction. We introduce \emph{Effective Feedback Compute} (EFC), a trace-level scaling coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, and we normalize it by task demand when comparing tasks with different feedback requirements. Across synthetic controllable tasks, executable code tasks, real benchmark traces, held-out splits, and a prospective validation batch, EFC-based coordinates consistently predict failure rates better than raw-compute baselines and a strong multivariate SAS baseline. In controlled scaling, raw tokens and tool calls explain limited variation ($R^2=0.33$ and $0.42$), SAS reaches $0.88$, while Oracle-EFC and Estimated-EFC reach $0.94$ and Oracle-EFC/$D_{\mathrm{task}}$ reaches $0.99$. Matched-budget interventions show that improving feedback quality raises success from $0.27$ to $0.90$ while raw cost and tool calls are fixed. On mixed real traces, NRS-EFC/$D_{\mathrm{task}}$ reaches $R^2=0.92$ while raw compute has near-zero or negative fit, and it remains the best predictor in a prospective holdout ($R^2=0.85$). These results suggest that harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback.

URL PDF HTML ☆

赞 0 踩 0

2605.29678 2026-05-29 cs.CL 版本更新

Spurious Prompts: Can Irrelevant Prompts Steer Large Language Models?

虚假提示：无关提示能否引导大型语言模型？

Pawel Batorski, Abtin Pourhadi, Jerzy Sarosiek, Przemyslaw Spurek, Paul Swoboda

发表机构 * Heinrich Heine University Düsseldorf（海因里希·海因斯大学多特蒙德分校）； Jagiellonian University（雅盖隆大学）； IDEAS Research Institute（IDEAS研究所）

AI总结研究语义无关的提示（虚假提示）对大型语言模型行为的影响，提出黑盒搜索方法发现此类提示，并证明其在多个基准和模型上能显著影响模型输出。

详情

AI中文摘要

大型语言模型对提示高度敏感，但这种敏感性通常通过任务相关的指令、示例或推理线索来研究。本文研究了一种不同形式的提示敏感性：与任务语义无关的提示是否仍然能够引导模型行为。我们称其为虚假提示，并展示了其惊人的有效性。我们还提出了一种简单的黑盒搜索程序来发现它们。在推理和问答基准上，使用参数从0.8B到27B、涵盖三个模型家族的模型，我们展示了虚假提示可以提升性能，通常匹配或超越标准提示基线和任务感知的提示优化。我们进一步展示了它们可以引导模型产生非预期行为，例如重复选择第一个答案选项、产生错误答案、返回偶数、质数或小数，而无需明确指示模型这样做。这些发现揭示了一种新的提示敏感性：LLM可以被与它们被要求解决的任务无关的提示系统地引导。我们的代码可在 https://github.com/Batorskq/spurious 获取。

英文摘要

Large language models are highly sensitive to prompts, but this sensitivity is usually studied through task-relevant instructions, demonstrations, or reasoning cues. In this paper, we study a different form of prompt sensitivity: whether prompts that are semantically unrelated to the task can nevertheless steer model behavior. We call them spurious prompts and show their surprising efficacy. We also propose a simple black-box search procedure for discovering them. Across reasoning and question-answering benchmarks, using models ranging from 0.8B to 27B parameters and spanning three model families, we show that spurious prompts can improve performance, often matching or outperforming standard prompting baselines and task-aware prompt optimization. We further show that they can steer models toward unintended behaviors, such as repeatedly selecting the first answer option, producing incorrect answers, returning an even, prime or small number without explicitly instructing the model to do so. These findings reveal a new kind of prompt sensitivity: LLMs can be systematically steered by prompts that are unrelated to the task they are asked to solve. Our code is available at https://github.com/Batorskq/spurious

URL PDF HTML ☆

赞 0 踩 0

2605.29670 2026-05-29 cs.CL cs.AI 版本更新

EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQL

EviLink: 面向大规模Text-to-SQL的基于不确定性引导证据获取的多路径模式链接

Huawei Zheng, Sen Yang, Zhaorui Yang, Yuhui Zhang, Haozhe Feng, Haoxuan Li, Xuan Yi, Chao Hu, Defeng Xie, Chen Hou, Danqing Huang, Wei Chen, Yingcai Wu, Peng Chen, Dazhen Deng

发表机构 * School of Software Technology, Zhejiang University（浙江大学软件学院）； State Key Lab of CAD&CG, Zhejiang University（浙江大学CAD&CG国家重点实验室）； Tencent TEG（腾讯TEG）； School of Mathematical Sciences, Peking University（北京大学数学科学学院）

AI总结提出EviLink方法，通过多假设模式基础与不确定性引导的证据获取，重新定义模式链接为不确定性感知的模式需求推理，以平衡模式完整性、相关性和令牌成本，提升大规模Text-to-SQL性能。

详情

AI中文摘要

模式链接是大规模Text-to-SQL中困难且重要的步骤，系统必须从庞大且模糊的数据库中识别出紧凑且充分的模式上下文。现有方法通常将模式链接视为围绕单个SQL路径的确定性选择，但复杂问题可能允许多个具有不同模式需求的有效实现。我们将模式链接重新定义为对多个可行SQL路径的不确定性感知模式需求推理，其中系统区分必需模式项与路径依赖的不确定项，并仅在需要时获取证据。我们通过EviLink实例化这一重构，它结合了多假设模式基础与不确定性引导的证据获取。在BIRD-Dev和Spider2-Snow上的实验表明，这种视角改善了模式完整性、模式相关性和令牌成本之间的平衡。在Spider2-Snow上，EviLink实现了90.15%的字段级严格召回率，平均使用123.30K令牌，并在固定生成器下提升了下游SQL生成性能。

英文摘要

Schema linking is a difficult and important step in large-scale Text-to-SQL, where systems must identify a compact yet sufficient schema context from large and ambiguous databases. Existing methods often treat schema linking as deterministic selection around a single SQL path, but complex questions may admit multiple valid realizations with different schema needs. We reframe schema linking as uncertainty-aware schema-need inference over multiple plausible SQL paths, where the system distinguishes required schema items from path-dependent uncertain ones and acquires evidence only where needed. We instantiate this reframing with EviLink, which combines multi-hypothesis schema grounding with uncertainty-guided evidence acquisition. Experiments on BIRD-Dev and Spider2-Snow show that this perspective improves the balance among schema completeness, schema relevance, and token cost. On Spider2-Snow, EviLink achieves 90.15% field-level strict recall rate, uses 123.30K average tokens, and improves downstream SQL generation under a fixed generator.

URL PDF HTML ☆

赞 0 踩 0

2605.29668 2026-05-29 cs.AI cs.CL 版本更新

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

GRASP: 门控回归感知技能提议器用于自我改进的LLM智能体

Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, Martin Hadamitzky, Daniel Rueckert, Lisa Adams, Keno Bressem

发表机构 * Technical University of Munich and TUM University Hospital（慕尼黑技术大学及慕尼黑大学医院）； Microsoft Healthcare & Life Sciences（微软医疗与生命科学）

AI总结提出GRASP方法，通过门控回归感知技能库编辑，在硬回归预算下确保每次技能更新带来净改进，显著提升LLM智能体在结构化环境中的操作可靠性。

详情

AI中文摘要

在结构化环境中运行的LLM智能体以操作方式而非对话方式失败，其可靠性取决于对环境的程序性知识。先前的自我改进方法累积自然语言指导而不检查每个新项目是否保留先前正确的行为，因此修复一条轨迹的笔记可能静默地使另一条轨迹退化。我们引入GRASP（门控回归感知技能提议器），将智能体改进视为对有限技能库的一系列编辑，仅在候选技能在硬回归预算下对平衡的保留探针产生净改进时才接受它。我们在两个基于FHIR的临床基准上评估了GRASP在五个基础模型（gpt-oss-120b、DeepSeek V4 Flash、Gemini 3.1 Flash Lite、GPT-4.1、GPT-5.4）上的表现。在MedAgentBench上，GRASP将gpt-oss-120b从40.6%提升至88.8%，超过五个自我改进基线中最强的21.0个百分点，并将其他每个基础模型提升17.2至40.3个百分点。消融实验将增益归因于比较性提议生成、接受门和硬回归预算，而非技能编写本身——没有验证的技能编写并不比不使用技能更好。该机制泛化到临床领域之外，在四个非临床环境中的三个上改进了智能体，仅在动作空间开放的环境中保持持平。冻结的技能库可在模型间迁移，其中来自更强模型的技能将较弱执行者提升到超出其自身学习能力的水平，而反向则不然，这种不对称性是没有门控的基线无法复现的。

英文摘要

LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.

URL PDF HTML ☆

赞 0 踩 0

2605.29667 2026-05-29 cs.CL 版本更新

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

超越英语与规避：用于高风险LLM中文安全评估的人工标注多领域基准

Wajdi Zaghouani, Kholoud K. Aldous, Yicheng Gao

发表机构 * Northwestern University in Qatar（卡塔尔西北大学）

AI总结针对LLM在中文环境下安全系统失效的问题，构建了包含1,897个对抗性提示的人工标注基准ChiSafe-PAS，覆盖四个高风险领域，并提供完整标注以评估模型安全对齐。

详情

Journal ref: Proceedings of The fourth international workshop on the role of resources in the age of large language models RESOURCEFUL-2026 at LREC 2026, Palma de Mallorca, Spain, 2026

AI中文摘要

当大型语言模型（LLM）部署在中文环境中时，出现了一个令人不安的模式：在英语中运行良好的安全系统会失效。这些系统难以跨越语言和文化的界限，使得模型暴露于利用中文特定规避技术（包括拼音罗马化、汉字分解、网络俚语和模糊语气）的对抗性提示。为解决这一差距，我们引入了ChiSafe-PAS（中文安全试点标注集），这是一个包含1,897个对抗性中文提示的人工标注基准，涵盖四个高风险领域：自残与暴力、毒品与非法交易、欺诈以及讽刺。其中，1,544条条目带有完整的黄金标准标注：一个3类响应标签（拒绝、安全重定向、回应）、一个九类混淆分类、一个风险等级评级以及标注者理由。我们详细描述了数据集设计、标注过程和混淆分类。我们的主要目标是实用的：为研究社区提供一个高质量、基于文化背景的资源，用于基准测试LLM的安全对齐。在此过程中，我们涉及了该领域的三个更广泛的张力：训练数据和评估数据之间模糊的界限、基于现实风险进行领域覆盖的需求，以及规模作为文化专业知识替代品的局限性。

英文摘要

When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries, leaving models exposed to adversarial prompts that exploit Chinese-specific evasion techniques, including Pinyin romanization, character decomposition, internet slang, and hedging tone. To address this gap, we introduce ChiSafe-PAS (Chinese Safety Pilot Annotation Set), a human-annotated benchmark of 1,897 adversarial Chinese prompts spanning four high-stakes domains: self-harm and violence, drug and illicit trade, fraud, and satire. Of these, 1,544 entries carry complete gold-standard annotations: a 3-class response label (REFUSE, SAFE-REDIRECT, RESPOND), a nine-category obfuscation taxonomy, a risk-level rating, and annotator rationale. We describe the dataset design, annotation process, and obfuscation taxonomy in detail. Our primary goal is practical: to give the research community a high-quality, culturally grounded resource for benchmarking LLM safety alignment. In doing so, we engage three broader tensions in the field: the blurring boundary between training and evaluation data, the need for domain coverage grounded in real-world risk, and the limits of scale as a substitute for cultural expertise.

URL PDF HTML ☆

赞 0 踩 0

2605.29659 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Opir：针对毒性、越狱、仇恨言论和有害内容的高效多任务安全分类

Ihor Stepanov, Aleksandr Smechov

发表机构 * Knowledgator ； Wordcab

AI总结本文提出基于GLiClass架构的Opir系列编码器护栏模型，通过多任务学习实现二进制安全/不安全分类、多标签毒性分类、越狱分类和零样本不安全提示与响应分类，在12项安全分类任务和17项类别任务上与现有护栏系统竞争，同时部署开销更小。

Comments 23 pages, 4 figures, 9 tables

详情

AI中文摘要

大型语言模型（LLM）应用的实时安全过滤需要能够检测不安全提示、有毒语言、越狱尝试和不安全响应的分类器，且不能像大型护栏模型那样成本高昂，同时要能区分良性的敏感文本与真正隐蔽的有害内容。在本文中，我们介绍了Opir，一个基于GLiClass架构的编码器护栏模型系列。Opir包括用于二进制安全/不安全分类、多标签毒性分类、越狱分类以及零样本不安全提示和响应分类的多任务模型。我们还发布了专门用于二进制安全/不安全分类的边缘变体，参数少于1亿。这些模型在一个三级分类体系上训练，该体系包含16个顶层标签、126个中层标签和854个叶标签，共996个类别。Opir的训练数据结合了基于分类体系的不安全提示、对抗性挖掘的难负例、良性安全保持示例、生成的响应示例、多语言翻译以及Aegis2和WildGuard训练子集的部分内容。我们还开源了一个评估工具，支持GLiClass和GLiNER2后端以及基于解码器的模型，涵盖二进制安全分类、多标签分类、毒性、越狱检测、提示安全、响应安全、响应拒绝以及跨公共基准系列的提示子类别视图。在与八个当代护栏系统（包括基于GLiNER2和生成式护栏模型）的扩展比较中，涵盖12项安全分类任务和17项类别任务，Opir变体在大多数基准数据集上与最强的开源基线模型竞争或领先，同时部署规模显著更小。

英文摘要

Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail models, and that can distinguish benign sensitive text from genuinely covert harmful content. In this paper, we introduce Opir, a family of encoder-based guardrail models built on the GLiClass architecture. Opir includes multi-task models for binary safe/unsafe classification, multi-label toxicity classification, jailbreak classification, and zero-shot unsafe prompt and response categorization. We also release edge variants with fewer than 100M parameters dedicated to binary safe/unsafe categorization. The models are trained on a three-level taxonomy containing 996 categories across 16 top-level labels, 126 mid-level labels, and 854 leaf labels. Opir's training data combines taxonomy-grounded unsafe prompts, adversarially mined hard negatives, benign safety-preserving examples, generated response examples, multilingual translations, and portions of the Aegis2 and WildGuard training subsets. We also open-sourced an evaluation harness that supports GLiClass and GLiNER2 backends as well as decoder-based models, and covers binary safety classification, multi-label categorization, toxicity, jailbreak detection, prompt safety, response safety, response refusal, and prompt subcategory views across public benchmark families. Across an expanded comparison spanning 12 safety-classification tasks and 17 category tasks against eight contemporary guardrail systems -- including both GLiNER2-based and generative guardrail models -- Opir variants are competitive on or ahead of the strongest open-weight baselines on the majority of benchmark datasets while operating with a substantially smaller deployment footprint.

URL PDF HTML ☆

赞 0 踩 0

2605.29648 2026-05-29 cs.CL 版本更新

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

超越数学与代码的可验证奖励：面向事实问答的轻量级语料库基础过程监督

Shicheng Fan, Haochang Hao, Dehai Min, Weihao Liu, Philip S. Yu, Lu Cheng

发表机构 * University of Illinois Chicago（伊利诺伊大学香槟分校）

AI总结提出CorVer，一种基于语料库共现统计的轻量级过程奖励方法，通过句子级信用分配和令牌级优势映射，在多个模型和基准上显著提升事实问答准确性且训练速度更快。

详情

COMET：音频-文本多模态对比嵌入中模态间隙的概念空间剖析

Yonggang Zhu, Liting Gao, Aidong Men, Wenwu Wang

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey（Surrey 大学视觉、语音和信号处理中心）

AI总结提出COMET框架，通过PLS-SVD分解揭示CLAP模型中模态间隙主要由少数共享概念轴贡献，并基于谱截断方法无训练地缓解间隙，实现零样本音频字幕接近全监督性能。

详情

AI中文摘要

对比语言-音频预训练（CLAP）模型广泛用于音频理解，并在许多零样本应用中支持模态无关的条件交换。然而，其性能受到音频和文本嵌入之间模态间隙的严重影响。现有解释主要将此间隙归因于锥体效应，将其视为均值嵌入之间的偏移，但仅纠正均值只能带来有限的改进。其他假设，如信息不平衡和维度坍缩，也被提出，但仍未得到充分验证，并且在音频领域尚未被深入研究。同时，一些工作尝试将多模态对比嵌入分解为可解释的概念，但没有任何工作从概念分解的角度显式分析模态间隙。在这项工作中，我们引入了COMET（基于PLS-SVD变换的概念空间组织与模态间隙解释），这是一个新颖的用于CLAP的偏最小二乘奇异值分解（PLS-SVD）框架，揭示了模态间隙的更广泛视角。我们的框架揭示，只有一小部分可解释的轴（捕捉共享概念）对相似度计算有显著贡献，并且均值分量仅部分代表模态间隙。基于这一见解，我们提出了一种简单的谱截断方法，以无训练的方式缓解模态间隙。该方法使得零样本音频字幕通过条件交换接近全监督性能，无需大型辅助记忆库或昂贵计算。同时，它在保持检索和音频字幕任务强性能的同时，实现了显著的嵌入维度缩减。

英文摘要

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.29626 2026-05-29 cs.CL cs.AI 版本更新

DLM-SWAI: Steering Diffusion Language Models Before They Unmask

DLM-SWAI: 在扩散语言模型去掩码之前引导它们

Hyeseon An, Yo-Sub Han

发表机构 * Department of Computer Science（计算机科学系）； Yonsei University（延世大学）

AI总结提出一种无需训练的引导方法DLM-SWAI，通过预计算的词级风格分数在去噪步骤中偏置词分布，实现扩散语言模型的可控生成。

Comments preprint

详情

AI中文摘要

将语言模型生成引导至期望的文本属性对于实际部署至关重要，而推理时方法特别有吸引力，因为它们无需重新训练即可实现可控生成。最近的研究也强调了扩散语言模型作为一种新兴的生成范式，具有独特的解码特性。然而，大多数现有的引导方法要么依赖辅助模型，要么专为自回归下一个词解码设计，难以应用于通过部分掩码序列的迭代去噪生成文本的扩散语言模型（DLM）。因此，我们提出DLM-SWAI，一种简单的无需训练的引导方法，通过使用预计算的词级风格分数在每个去噪步骤偏置词分布。在风格和安全控制任务上的实验表明，DLM-SWAI有效引导扩散语言模型，同时保持生成质量并需要最小的计算开销。消融实验进一步揭示了引导强度与流畅性之间的可控权衡，我们的分析将类别可引导性与词级属性线索的强度联系起来。

英文摘要

Steering language model generation toward desired textual properties is essential for practical deployment, and inference-time methods are particularly appealing because they enable controllable generation without retraining. Recent work has also highlighted diffusion language models as an emerging generation paradigm with distinct decoding properties. However, most existing steering approaches either rely on auxiliary models or are designed for autoregressive next-token decoding, making them difficult to apply to diffusion language models DLMs, which generate text through iterative denoising of partially masked sequences. Therefore, we propose DLM-SWAI, a simple training-free steering method that biases the token distribution at each denoising step using pre-computed token-level style scores. Experiments on style and safety control tasks show that DLM-SWAI effectively steers diffusion language models while preserving generation quality and requiring minimal computational overhead. Ablations further reveal a controllable trade-off between steering strength and fluency, and our analysis links class-wise steerability to the strength of token-level attribute cues.

URL PDF HTML ☆

赞 0 踩 0

2605.29615 2026-05-29 cs.CV cs.CL 版本更新

DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces?

DiffSpot：VLM能发现网页界面中的细微视觉差异吗？

Linhao Zhang, Aiwei Liu, Yuan Liu, Xiao Zhou

发表机构 * WeChat AI, Tencent Inc（腾讯公司）

AI总结提出DiffSpot基准，通过CSS属性突变生成可控图像对，评估视觉语言模型在网页界面中检测细微视觉差异的能力，发现最佳模型仅识别40.7%的真实变化。

详情

AI中文摘要

视觉语言模型（VLM）在高层次图像-文本对齐方面取得了显著进展，但其感知细微视觉差异的能力仍然有限。我们在渲染的网页界面中研究这一问题，其中局部视觉变化既是对细粒度感知的诊断测试，也是GUI代理和设计工具的实际需求。我们引入了 extbf{DiffSpot}，一个用于网页界面开放式找不同的代码驱动基准。DiffSpot通过突变自包含HTML中目标元素的单个CSS属性，重新渲染页面，并记录变化的属性、元素和突变幅度，从而构建受控图像对。一个接地门控仅保留渲染像素差异局限于目标元素的图像对。该基准包含4,400对图像，包括3,900对有差异对（平衡分布在13个CSS属性操作符和三个难度级别上）以及500对无差异对用于幻觉控制。对13个前沿VLM进行零样本评估，我们发现即使最佳模型也只能识别$40.7\%$的真实变化，所有模型在困难级别的召回率低于$23\%$。DiffSpot进一步表明，难度强烈依赖于属性：在CSS操作符中，像素幅度和CLIP距离都不能可靠预测召回率。

英文摘要

Vision-language models (VLMs) have made strong progress on high-level image-text alignment, yet their ability to perceive subtle visual differences remains limited. We study this problem in rendered web interfaces, where localized visual changes are both a diagnostic test of fine-grained perception and a practical requirement for GUI agents and design tools. We introduce \textbf{DiffSpot}, a code-driven benchmark for open-ended spot-the-difference on web interfaces. DiffSpot constructs controlled image pairs by mutating a single CSS property of a target element in self-contained HTML, re-rendering the page, and recording the changed property, element, and mutation magnitude. A grounding gate retains only pairs whose rendered pixel difference is confined to the target element. The benchmark contains 4{,}400 pairs, including 3{,}900 has-diff pairs balanced across 13 CSS-property operators and three difficulty tiers, plus 500 no-diff pairs for hallucination control. Evaluating 13 frontier VLMs zero-shot, we find that even the best model identifies only $40.7\%$ of true changes, with Hard-tier Recall below $23\%$ for every model. DiffSpot further shows that difficulty is strongly property-dependent: across CSS operators, neither pixel magnitude nor CLIP distance reliably predicts Recall.

URL PDF HTML ☆

赞 0 踩 0

2605.29612 2026-05-29 cs.MA cs.CL 版本更新

CONCAT: Consensus- and Confidence-Driven Ad Hoc Teaming for Efficient LLM-Based Multi-Agent Systems

CONCAT: 基于共识与置信驱动的即席团队协作以实现高效的基于LLM的多智能体系统

Ziyang Ma, Dingyi Zhang, Sichu Liang, Jiajia Chu, Pengfei Xia, Hui Zang, Deyu Zhou

发表机构 * Southeast University（东南大学）； Huawei Technologies Ltd（华为技术有限公司）

AI总结提出一种无需训练的共识与置信驱动即席团队协作框架CONCAT，通过聚类初始答案、选择高置信领导者并基于心智理论预测协作收益来动态组织多智能体交互，显著提升效率并降低延迟。

详情

AI中文摘要

尽管基于大型语言模型的多智能体系统在解决复杂任务和实现比单智能体系统更高的性能方面显示出能力，但由于智能体之间的密集通信，它们导致了巨大的计算开销。先前的研究致力于训练稀疏多智能体图或微调规划器以更好地编排工作流程。然而，这些额外的训练过程引入了计算成本，并将多智能体系统限制在特定领域，从而损害了其泛化能力。在本文中，我们提出了CONCAT，一种基于共识和置信驱动的即席团队协作的无训练多智能体协作框架，以高效组织智能体交互。具体来说，智能体根据其初始答案进行聚类，并根据智能体的置信度选择每个聚类的领导者。然后，基于心智理论设计启发式函数，根据领导者的答案和置信度预测每两个领导者之间的协作收益。最后，在根据预测收益驱逐一定比例的通信后，组织一个即席多智能体网络。在三个LLM和三个基准上的实验表明，CONCAT比LLM-Debate实现了高达2.02倍的效率（准确率/延迟比），并优于诸如AgentDropout等训练感知方法，同时在Qwen2.5-14B-Instruct上将平均延迟降低了50.1%，且无需任何任务特定训练。

英文摘要

Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy communication between agents. Previous research has made efforts to train a sparse multi-agent graph or fine-tune a planner to orchestrate the workflow better. However, such extra training processes introduce computational costs and limit MAS to specific domains, therefore compromising their generalizability. In this paper, we propose CONCAT, a training-free multi-agent collaboration framework based on CONsensus and Confidence-driven Ad hoc Teaming to efficiently organize agent interactions. Specifically, agents are clustered based on their initial answers, and leaders of each cluster are selected based on the agents' confidence. Then, a heuristic function based on the Theory of Mind is designed to predict the collaboration benefits between every two leaders according to their answers and confidence. Finally, an ad hoc multi-agent network is organized after evicting a percentage of communications based on the predicted benefits. Experiments across three LLMs and three benchmarks show that CONCAT achieves up to 2.02x higher efficiency (accuracy/latency ratio) than LLM-Debate and outperforms training-aware methods such as AgentDropout, while reducing average latency by 50.1% on Qwen2.5-14B-Instruct, without any task-specific training.

URL PDF HTML ☆

赞 0 踩 0

2605.29601 2026-05-29 cs.CL cs.AI cs.LG 版本更新

GraphLit：面向文学研究的文本增强动态人物网络表示学习

Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara, Mirella Lapata

发表机构 * Deezer Research（Deezer研究）； Loria（Loria实验室）； IDIAP（IDIAP研究所）； School of Informatics, University of Edinburgh（爱丁堡大学信息学院）

AI总结提出动态异质人物网络（DHCN）和自监督框架GraphLit，通过掩码图自编码器学习融合文本上下文的文学表示，在12个角色相关任务上优于纯文本或纯图基线。

2605.26954 2026-05-29 cs.CL 版本更新

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

AlbanianLLMSafety：面向阿尔巴尼亚语大语言模型的安全评估数据集

Wajdi Zaghouani, Kholoud K. Aldous, Isra Fejzullaj

发表机构 * Northwestern University in Qatar（卡塔尔西北大学）

AI总结针对低资源语言阿尔巴尼亚语，构建了首个公开的安全评估数据集，包含11个安全类别的2951条提示，以填补安全评估基础设施的空白。

Comments Accepted at SIGUL2026 Workshop co-located with LREC2026

详情

Journal ref: In Proceedings of the SIGUL2026 Workshop co-located with LREC 2026, Palma de Mallorca, Spain, 2026

AI中文摘要

大语言模型（LLM）的安全评估主要集中于高资源语言，而低资源语言则严重缺乏关注。我们提出了AlbanianLLMSafety，这是首个公开的阿尔巴尼亚语LLM安全评估数据集。阿尔巴尼亚语是一种语言独特的低资源语言，在阿尔巴尼亚、科索沃、北马其顿以及海外侨民中约有750万使用者。该数据集包含2951条提示，涵盖11个安全类别，包括自残、暴力、种族主义内容、儿童剥削和激进化等，平均每个类别268条提示。每条提示均提供阿尔巴尼亚语原文、英语参考译文以及详细的类别标签。该资源填补了低资源语言安全评估基础设施的重大空白，并为开发更安全、更具包容性的LLM提供了重要基准。数据集将根据请求提供，以支持阿尔巴尼亚语社区的安全评估、微调、红队测试和护栏开发。

英文摘要

Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafety, the first publicly available safety evaluation dataset for LLMs in Albanian, a linguistically distinct low-resource language with approximately 7.5 million speakers across Albania, Kosovo, North Macedonia, and the diaspora. The dataset contains 2,951 prompts spanning 11 safety categories, including self-harm, violence, racist content, child exploitation, and radicalization, with an average of 268 prompts per category. Each prompt is provided in Albanian with an English reference translation and a detailed category label. This resource addresses a significant gap in safety evaluation infrastruc-ture for low-resource languages and provides an essential benchmark for developing safer, more inclusive LLMs. The dataset will be provided upon request to support safety evaluation, fine-tuning, red-teaming, and guardrail development for Albanian-speaking communities.

URL PDF HTML ☆

赞 0 踩 0

2605.26947 2026-05-29 cs.CL 版本更新

KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models

KZ-SafetyPrompts：用于大型语言模型的哈萨克语安全评估提示数据集

Wajdi Zaghouani, Shimaa Amer Ibrahim, Aruzhan Muratbek, Olzhasbek Zhakenov, Adiya Akhmetzhanova

发表机构 * Northwestern University in Qatar（卡塔尔西北大学）

AI总结针对哈萨克语在大型语言模型安全评估中资源不足的问题，构建了一个包含11个风险类别、5717条原生哈萨克语提示的数据集，并基于GPT-4o基线测试发现跨类别拒绝率差异显著，揭示了仅英语评估无法捕获的类别特定安全漏洞。

Comments Accepted at the SIGUL2026 Workshop co-located with LREC2026

详情

Journal ref: In Proceedings of the SIGUL2026 Workshop co-located with LREC 2026, Palma de Mallorca, Spain, 2026

AI中文摘要

哈萨克语在评估大型语言模型安全行为的资源中代表性不足。我们提出了KZ-SafetyPrompts，这是一个哈萨克语提示数据集，用于涵盖常见风险领域的十一个类别的安全评估，例如自残、暴力、儿童剥削、色情内容、种族主义内容、激进化以及受管制商品或非法活动。该数据集包含5717条以哈萨克语（西里尔字母）原生编写的提示，按类别组织，并附有英文翻译以进行跨语言分析。提示类似于真实的用户查询，通常采用青少年或儿童风格，并以意图提示的形式表述，不包含程序性指令。我们记录了编写协议、标注程序（包括边界案例决策规则）和质量控制步骤（模式标准化、完整性检查和去重）。我们还将这些类别与广泛使用的安全分类法对齐，以支持与现有评估管道的集成。使用GPT-4o的基线结果显示总体拒绝率为28.2%，不同类别间从5.5%到53.8%不等，表明哈萨克语提示暴露了仅英语评估无法捕获的类别特定安全漏洞。

英文摘要

Kazakh is underrepresented in resources for evaluating the safety behavior of large language models. We present KZ-SafetyPrompts, a Kazakh prompt dataset for safety evaluation across eleven categories covering common risk areas such as self-harm, violence, child exploitation, sexual content, racist content, radicalization, and regulated goods or illegal activities. The dataset contains 5,717 prompts written natively in Kazakh (Cyrillic), organized by category, with English translations for cross-lingual analysis. Prompts resemble realistic user queries, often in a teen or child style, and are phrased as intent prompts without procedural instructions. We document the writing protocol, labeling procedures (including borderline-case decision rules), and quality-control steps (schema standardization, completeness checks, and deduplication). We also align the categories with widely used safety taxonomies to support integration with existing evaluation pipelines. Baseline results with GPT-4o show an overall refusal rate of 28.2%, varying from 5.5% to 53.8% across categories, indicating that Kazakh prompts expose category-specific safety gaps not captured by English-only evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.23440 2026-05-29 cs.CL cs.AI 版本更新

SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction

SSDAU：面向联合实体关系抽取的结构化语义数据增强

Jiawei He, Mengyu Shi, Jiawei Liu, Dong Sun, Chunrong Fang, Xikai Yang, Zhijie Wang, Lei Ma, Zhenyu Chen

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China（南京大学新型软件技术国家重点实验室）； Amap, Alibaba Group, China（阿里巴巴集团阿地图）； University of Alberta, Edmonton, Canada（阿尔伯塔大学）； The University of Tokyo, Tokyo, Japan（东京大学）

AI总结提出结构化语义数据增强方法SSDAU，通过保留三元组感知语义结构、上下文感知编码和BERTopic过滤，提升联合实体关系抽取的泛化能力，在多个模型和数据集上优于现有方法。

Comments 10 pages, 4 figure

详情

AI中文摘要

联合实体关系抽取（JERE）对训练数据质量高度敏感，因此数据增强是提升泛化能力的自然方式。然而，现有增强方法常削弱实体相关性并破坏语义结构，限制了其在JERE中的有效性。本文提出 extbf{结构化语义数据增强（SSDAU）}，一种在增强过程中保留三元组感知语义结构的方法。SSDAU按实体标签分割文本，通过上下文感知编码捕获语义特征，并重构实体语义以生成增强数据。为区分语义相似的实体，SSDAU将上下文嵌入与传统相似度评分相结合。为减少主题不一致性，我们应用基于BERTopic的过滤去除不相关的增强样本。我们在不同标注类型的数据集上评估SSDAU，并比较其在五个代表性JERE模型上相对于七个流行增强基线的性能。实验表明，SSDAU生成语义一致的数据，对歧义的鲁棒性优于非LLM方法（平均相对F1下降8.95% vs. 23.58%），并在大多数设置下显著优于强替代方法。

英文摘要

Joint Entity and Relation Extraction (JERE) is highly sensitive to training data quality, making data augmentation a natural way to improve generalization. However, existing augmentation methods often weaken entity relevance and disrupt semantic structure, limiting their effectiveness for JERE. In this paper, we propose \textbf{Structured Semantic Data Augmentation (SSDAU)}, a method designed to preserve triple-aware semantic structure during augmentation. SSDAU segments text by entity labels, captures semantic features through context-aware encoding, and restructures entity semantics to generate augmented data. To distinguish semantically similar entities, SSDAU combines contextualized embeddings with traditional similarity scores. To reduce topic inconsistency, we apply BERTopic-based filtering to remove irrelevant augmentations. We evaluate SSDAU on datasets with different annotation types and compare its performance on five representative JERE models against seven popular augmentation baselines. Experiments show that SSDAU generates semantically consistent data, is more robust to ambiguity than non-LLM methods (8.95\% vs. 23.58\% average relative F1 decrease), and significantly outperforms strong alternatives in most settings.

URL PDF HTML ☆

赞 0 踩 0

2605.22975 2026-05-29 cs.CL cs.CY 版本更新

When AI Takes Sides on Questions of Faith: Persistent Asymmetries in AI-Mediated Faith Guidance

当AI在信仰问题上站队：AI介导的信仰指导中持续存在的不对称性

Brett Israelsen, Sheryl Carty, Josh Coates, Nancy Fulda, Julie Park, Pete Whiting

发表机构 * Brigham Young University

AI总结研究通过测试20个大型语言模型在182个宗教配对中的转换建议，发现模型在宗教转换建议上存在系统性不对称，偏好某些宗教而歧视其他宗教，且该现象在不同模型和测试条件下稳定存在。

Comments w/ persuasive language analysis

详情

AI中文摘要

MedMosaic：一个具有挑战性的多样化医学音频大规模基准

Harshit Rajgarhia, Shuubham Ojha, Asif Shaik, Akhil Pothanapalli, Rachuri Lokesh, Abhishek Mukherji, Prasanna Desikan

发表机构 * Centific Global Solutions Inc.（Centific全球解决方案公司）； University of Maryland, College Park, MD, USA（马里兰大学学院市分校）

AI总结为解决医学音频数据稀缺和现有基准不足的问题，提出MedMosaic数据集，包含多种医学音频类型和46701个问答对，用于评估语言和音频推理模型，实验表明推理仍具挑战性。

Comments Accepted at ICML 2026

详情

AI中文摘要

由于隐私法规和领域专业知识导致的高注释成本，医学音频数据难以收集。因此，现有基准往往未能充分代表复杂的医学音频场景。为应对这一挑战，我们提出了MedMosaic，一个医学音频问答数据集，旨在在现实临床约束下对语言和音频推理模型进行基准测试。MedMosaic包含多种医学音频类型，包括与疾病相关的生理声音、精心构建的模拟带有伪影的语音的合成声音，以及模拟不同上下文长度的真实短篇和长篇临床对话。该数据集还包含总共46,701个问答对，涵盖多项选择、顺序多轮和开放式问答等类别，从而能够系统评估多跳推理和答案生成能力。对13个音频和多模态推理模型的基准测试显示，推理对所有评估系统仍然具有挑战性，且在不同问题类型上表现差异显著。特别是，即使是像Gemini-2.5-pro这样的最先进模型也只能达到约68.1%的准确率。这些发现强调了医学推理中的持续局限性，并凸显了对更鲁棒、特定领域的多模态推理模型的需求。基准数据样本可在此处获取：https://shorturl.at/Lyp33

英文摘要

Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address this challenge, we present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models. A sample of benchmark data is available here: https://shorturl.at/Lyp33

URL PDF HTML ☆

赞 0 踩 0

2604.27272 2026-05-29 cs.CL cs.AI cs.LG 版本更新

释放隐式奖励：前缀值学习用于分布级优化

Shiping Gao, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Lifu Huang

发表机构 * Sun Yat-sen University（中山大学）； Shenzhen Loop Area Institute（深圳环 Area 研究院）； Meta AI ； University of California, Davis（加州大学戴维斯分校）

AI总结提出隐式前缀值奖励模型（IPVRM）直接学习每个前缀的正确概率，并通过时序差分差异获得步骤信号，解决训练与推理不匹配问题；进一步引入分布级强化学习（DistRL）利用前缀值进行密集反事实更新，提升推理性能。

详情

AI中文摘要

过程奖励模型（PRM）为推理提供细粒度监督，但可靠的PRM通常需要步骤标注或繁重的验证流水线，使得它们在在线RL中扩展和刷新成本高昂。隐式PRM通过从轨迹级结果标签训练对数似然比奖励来降低这一成本。然而，对数比率在训练期间仅作为序列级聚合被约束，而推理时将其分解为部分前缀的token级或步骤级分数。这种训练-推理不匹配导致局部信用识别薄弱，因此分布级评分可能放大误导性优势。我们提出隐式前缀值奖励模型（IPVRM），直接从结果标签学习每个前缀最终正确的概率。然后通过连续前缀值之间的时序差分（TD）差异获得步骤信号，使训练目标与推理时使用对齐。IPVRM显著提高了ProcessBench上的步骤验证F1分数。为了在策略优化中利用这些前缀值，我们进一步引入分布级强化学习（DistRL），它将TD优势应用于采样token和高概率候选token，无需额外rollout即可提供密集反事实更新。实验表明，DistRL与不可靠隐式奖励结合时收益有限，但与IPVRM配对时持续改善下游推理。我们的方法实现可在https://github.com/gaoshiping/IPVRM获取。

英文摘要

Process reward models (PRMs) provide fine-grained supervision for reasoning, but reliable PRMs often require step annotations or heavy verification pipelines, making them costly to scale and refresh during online RL. Implicit PRMs reduce this cost by training log-likelihood-ratio rewards from trajectory-level outcome labels. However, the log-ratio is constrained only as a sequence-level aggregate during training, while inference decomposes it into token- or step-level scores for partial prefixes. This train-inference mismatch leaves local credits weakly identified, so distribution-wide scoring can amplify misleading advantages. We propose Implicit Prefix-Value Reward Model (IPVRM), which directly learns the probability of eventual correctness for each prefix from outcome labels. Step signals are then obtained as temporal-difference (TD) differences between consecutive prefix values, aligning the training target with inference-time use. IPVRM markedly improves step-verification F1 on ProcessBench. To exploit these prefix values during policy optimization, we further introduce Distribution-Level RL (DistRL), which applies TD advantages to both sampled tokens and high-probability candidate tokens, providing dense counterfactual updates without additional rollouts. Experiments show that DistRL brings limited gains with unreliable implicit rewards, but consistently improves downstream reasoning when paired with IPVRM. The implementation of our method is available at https://github.com/gaoshiping/IPVRM .

URL PDF HTML ☆

赞 0 踩 0

2604.10511 2026-05-29 cs.AI cs.CL 版本更新

Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation

快思考，错思考：直觉性调节LLM在政策评估中的反事实推理

Yanjie He

发表机构 * Independent Researcher（独立研究者）

AI总结本研究构建了一个基于经济学和社会科学实证案例的基准，通过8000次实验评估大型语言模型在政策评估中的反事实推理，发现链式思维提示在反直觉案例中效果显著减弱，且直觉性是主导因素，表明模型存在知识-推理分离。

Comments 10 pages, 6 figures, 6 tables

详情

AI中文摘要

大型语言模型（LLM）越来越多地用于因果和反事实推理，但它们在现实世界政策评估中的可靠性仍未得到充分探索。我们构建了一个包含40个实证政策评估案例的基准，这些案例来自经济学和社会科学，每个案例都基于同行评审的证据，并根据直觉性进行分类——即实证结果是否符合（明显）、相对于（模糊）或违背（反直觉）常见的先验预期。我们评估了四个前沿LLM，采用五种提示策略，进行了8000次实验试验，并使用混合效应逻辑回归分析结果。我们的发现揭示了三个关键结果：（1）链式思维（CoT）悖论，即链式思维提示在明显案例上显著提升性能，但在反直觉案例上这种收益大幅减弱（交互OR = 0.278，p < 0.001）；（2）直觉性是主导因素，案例层面的方差超过模型选择或提示策略（ICC = 0.671）；（3）知识-推理分离，基于引用的熟悉度与准确性无关（p = 0.84），表明模型拥有相关知识，但当结果与直觉相悖时无法利用这些知识进行推理。我们通过双过程理论（系统1与系统2）的视角来框架这些结果，并认为当前LLM的“慢思考”仅实现了对直觉先验的部分抑制——产生了深思熟虑推理的形式，但未能完全实现其实质。

英文摘要

Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy evaluation remains underexplored. We construct a benchmark of 40 empirical policy evaluation cases drawn from economics and social science, each grounded in peer-reviewed evidence and classified by intuitiveness -- whether the empirical finding aligns with (obvious), is unclear relative to (ambiguous), or contradicts (counter-intuitive) common prior expectations. We evaluate four frontier LLMs across five prompting strategies with 8,000 experimental trials and analyze the results using mixed-effects logistic regression. Our findings reveal three key results: (1) a chain-of-thought (CoT) paradox, where chain-of-thought prompting dramatically improves performance on obvious cases but this benefit is substantially attenuated on counter-intuitive ones (interaction OR = 0.278, $p < 0.001$); (2) intuitiveness as the dominant factor, with case-level variance exceeding that of model choice or prompting strategy (ICC = 0.671); and (3) a knowledge-reasoning dissociation, where citation-based familiarity is unrelated to accuracy ($p = 0.84$), suggesting models possess relevant knowledge but fail to reason with it when findings contradict intuition. We frame these results through the lens of dual-process theory (System 1 vs. System 2) and argue that current LLMs' "slow thinking" achieves only partial inhibition of intuitive priors -- producing the form of deliberative reasoning without fully delivering its substance.

URL PDF HTML ☆

赞 0 踩 0

2604.07789 2026-05-29 cs.MA cs.CL cs.SE 版本更新

ORACLE-SWE: Quantifying the Contribution of Oracle Information Signals on SWE Agents

ORACLE-SWE：量化Oracle信息信号对SWE代理的贡献

Kenan Li, Qirui Jin, Liao Zhu, Xiaosong Huang, Yijia Wu, Yikai Zhang, Xin Zhang, Zijian Jin, Yufan Huang, Elsie Nallipogu, Chaoyun Zhang, Yu Kang, Saravan Rajmohan, Qingwei Lin, Wenke Lee, Dongmei Zhang

发表机构 * Microsoft（微软公司）； Georgia Institute of Technology（佐治亚理工学院）

AI总结提出Oracle-SWE方法，通过隔离和提取SWE基准测试中的Oracle信息信号，量化每种信号对代理性能的贡献，并评估强语言模型提取的信号对基础代理的性能提升。

Comments Under peer review; 37 pages, 10 figures, 5 tables

详情

AI中文摘要

语言模型代理的最新进展显著提升了自动化软件工程（SWE）的能力。先前的工作提出了各种代理工作流和训练策略，并分析了代理系统在SWE任务上的失败模式，重点关注几种上下文信息信号：复现测试、回归测试、编辑位置、执行上下文和API使用。然而，每种信号对整体成功的个体贡献仍未得到充分探索，特别是在中间信息完美获取时的理想贡献。为解决这一问题，我们引入了Oracle-SWE，一种统一的方法，用于从SWE基准测试中隔离和提取Oracle信息信号，并量化每种信号对代理性能的影响。为进一步验证模式，我们评估了由强语言模型提取的信号在提供给基础代理时的性能增益，近似于现实世界的任务解决设置。这些评估旨在指导自主编码系统的研究优先级。

英文摘要

Recent advances in language model (LM) agents have significantly improved automated software engineering (SWE). Prior work has proposed various agentic workflows and training strategies as well as analyzed failure modes of agentic systems on SWE tasks, focusing on several contextual information signals: Reproduction Test, Regression Test, Edit Location, Execution Context, and API Usage. However, the individual contribution of each signal to overall success remains underexplored, particularly their ideal contribution when intermediate information is perfectly obtained. To address this gap, we introduce Oracle-SWE, a unified method to isolate and extract oracle information signals from SWE benchmarks and quantify the impact of each signal on agent performance. To further validate the pattern, we evaluate the performance gain of signals extracted by strong LMs when provided to a base agent, approximating real-world task-resolution settings. These evaluations aim to guide research prioritization for autonomous coding systems.

URL PDF HTML ☆

赞 0 踩 0

2604.00789 2026-05-29 cs.CL 版本更新

Valency Classification of Mapudungun Verbal Roots. Established by the language's own morphotactics

马普切语动词词根的配价分类：基于该语言自身形态句法规则

Andrés Chandía

发表机构 * Department of Catalan Philology and General Linguistics University of Barcelona（加泰罗尼亚语言学与一般语言学系巴塞罗那大学）

AI总结本文利用马普切语自身的形态句法规则，通过分析后缀与词根或动词词干的允许和限制组合，对已确认为动词的词根进行配价分类，旨在改进形态分析器并促进对马普切语动词配价问题的理解。

Comments 37 pages

2603.26668 2026-05-29 cs.IR cs.AI cs.CL 版本更新

推理剧场：从思维链中分离模型信念

Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo

发表机构 * Harvard University, Cambridge, MA（哈佛大学，马萨诸塞州剑桥）

AI总结通过激活探针、早期强制回答和思维链监控器分析，发现推理模型存在表演性思维链现象，并利用探针引导的早期退出实现高效计算。

详情

AI中文摘要

我们提供了推理模型中表演性思维链（CoT）的证据，即模型对其最终答案变得非常自信，但继续生成令牌而不揭示其内部信念。我们的分析比较了两个大型模型（DeepSeek-R1 671B 和 GPT-OSS 120B）中的激活探针、早期强制回答和思维链监控器，并发现了任务难度特定的差异：模型的最终答案可以从思维链中远早于监控器能够判断的激活中解码，特别是对于基于回忆的简单MMLU问题。我们将此与困难的多跳GPQA-Diamond问题中的真正推理进行对比。尽管如此，转折点（例如回溯、“啊哈”时刻）几乎只出现在探针显示大信念转变的响应中，表明这些行为追踪的是真正的不确定性，而不是学到的“推理剧场”。最后，探针引导的早期退出在MMLU上减少了高达80%的令牌，在GPQA-Diamond上减少了30%，且准确率相似，将注意力探针定位为检测表演性推理和实现自适应计算的高效工具。

英文摘要

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

URL PDF HTML ☆

赞 0 踩 0

2603.04678 2026-05-29 cs.CL cs.AI 版本更新

Post-Training Language Models for Crosslingual Consistency

后训练语言模型以实现跨语言一致性

Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández, Arianna Bisazza

发表机构 * ETH Zürich（苏黎世联邦理工学院）； CLCG, University of Groningen（格罗宁根大学CLCG中心）； University of Amsterdam（阿姆斯特丹大学）

AI总结针对多语言模型对翻译等价提示响应不一致的问题，提出基于信息论的跨语言一致性定义，并开发后训练方法直接一致性优化（DCO）以提升一致性。

Comments ICML 2026. The first two authors contributed equally. Codes available at: https://github.com/Betswish/ConsistencyRL

2603.02082 2026-05-29 cs.CL 版本更新

What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies

儿童在语言习得中究竟获得了什么？基于CHILDES的填充词-空位依赖自动检测案例研究

Zhenghao Herbert Zhou, William Dai, Maya Viswanathan, Simon Charlow, R. Thomas McCoy, Robert Frank

发表机构 * Department of Linguistics, Yale University（耶鲁大学语言学系）； Department of Computer Science, Yale University（耶鲁大学计算机科学系）； Wu Tsai Institute, Yale University（耶鲁大学吴氏研究所）

AI总结通过自动检测英语口语语料中的三种核心填充词-空位结构，量化儿童语言输入中的分布证据，并分析儿童产出轨迹，为先天语法知识与统计学习之争提供数据支持。

Comments Camera-ready version accepted to CoNLL 2026

详情

AI中文摘要

儿童对填充词-空位依赖的习得，一些研究者认为依赖于先天语法知识，而另一些则认为儿童导向言语中可用的分布证据足以解释。不幸的是，相关输入难以大规模细粒度量化，使得这一问题难以解决。我们提出一个系统，能够识别英语口语语料中的三种核心填充词-空位结构——主句wh-疑问句、嵌入式wh-疑问句和关系从句——并进一步识别提取位置（即主语、宾语或附加语）。我们的方法结合了成分分析和依存分析，利用它们在结构分类和提取位置识别上的互补优势。我们在人工标注数据上验证了该系统，发现其在大多数类别上表现良好。将该系统应用于57个英语CHILDES语料库，我们能够描述儿童在发育过程中接收的填充词-空位输入及其产出轨迹，包括特定结构的频率和提取位置不对称性。由此产生的细粒度标签为未来的习得研究和计算研究提供了基础，我们通过一个使用语言模型进行过滤语料训练的案例研究进行了演示。

英文摘要

Children's acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora -- matrix wh-questions, embedded wh-questions, and relative clauses -- and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children's filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.

URL PDF HTML ☆

赞 0 踩 0

2603.01311 2026-05-29 cs.CL 版本更新

Catalyst-Agent: Autonomous heterogeneous catalyst screening with an LLM Agent

Catalyst-Agent：基于LLM Agent的自主异质催化剂筛选

Achuth Chandrasekhar, Janghoon Ock, Amir Barati Farimani

发表机构 * Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA（卡内基梅隆大学机械工程系，匹兹堡，PA 15213，USA）； Department of Chemical and Biomolecular Engineering, University of Nebraska--Lincoln, Lincoln, NE 68588, USA（内布拉斯加大学林肯分校化学与生物分子工程系，林肯，NE 68588，USA）

AI总结提出Catalyst-Agent，一种基于MCP服务器和LLM的AI代理，通过OPTIMADE API探索材料数据库、利用UMA模型计算吸附能，实现闭环自主催化剂筛选，在ORR、NRR和CO2RR反应中成功率达33-41%。

详情

AI中文摘要

发现针对特定应用的新型催化剂是21世纪的一项重大挑战。传统方法包括基于化学理论的耗时且昂贵的实验试错法，或基于密度泛函理论的计算密集型第一性原理方法。近期研究表明，图神经网络（GNN）等深度学习模型可以将催化剂材料的筛选速度提高多个数量级，且具有很高的准确性和保真度。在这项工作中，我们引入了Catalyst-Agent，一个基于模型上下文协议（MCP）服务器、由LLM驱动的AI代理。它可以使用OPTIMADE API探索庞大的材料数据库，进行结构修改，通过FAIRchem的AdsorbML工作流程和板坯构建使用Meta FAIRchem的UMA（GNN）模型计算吸附能，并以闭环方式向研究人员提供有用的材料建议，包括改进接近命中候选者的结构修改。我们在三个关键反应上进行了测试：氧还原反应（ORR）、氮还原反应（NRR）和CO2还原反应（CO2RR）。Catalyst-Agent在其选择和评估的所有材料中实现了33-41%的成功率，并且平均每个成功材料在1-4次试验内收敛。这项工作展示了AI代理利用其规划能力和工具使用实现自主催化剂筛选工作流程的潜力。

英文摘要

The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty-first century. Traditional methods for this include time-consuming and expensive experimental trial-and-error approaches in labs based on chemical theory or heavily computational first-principles approaches based on density functional theory. Recent studies show that deep learning models like graph neural networks (GNNs) can significantly speed up the screening of catalyst materials by many orders of magnitude, with very high accuracy and fidelity. In this work, we introduce Catalyst-Agent, a Model Context Protocol (MCP) server-based, LLM-powered AI agent. It can explore vast material databases using the OPTIMADE API, make structural modifications, calculate adsorption energies using Meta FAIRchem's UMA (GNN) model via FAIRchem's AdsorbML workflow and slab construction, and make useful material suggestions to the researcher in a closed-loop manner, including structural modifications to refine near-miss candidates. It is tested on three pivotal reactions: the oxygen reduction reaction (ORR), the nitrogen reduction reaction (NRR), and the CO2 reduction reaction (CO2RR). Catalyst-Agent achieves a success rate of 33-41% among all the materials it chooses and evaluates, and manages to converge in 1-4 trials per successful material on average. This work demonstrates the potential of AI agents to exercise their planning capabilities and tool use for autonomous catalyst screening workflows.

URL PDF HTML ☆

赞 0 踩 0

2602.23258 2026-05-29 cs.AI cs.CL 版本更新

从元思维到执行：面向通用且可靠的大语言模型推理的认知对齐后训练

Shaojie Wang, Liang Zhang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出一种认知启发的两阶段后训练框架，通过元思维链监督学习通用策略和置信度校准强化学习优化执行可靠性，在分布内和分布外分别提升2.10%和3.86%。

详情

AI中文摘要

当前的大语言模型后训练方法通过监督微调（SFT）后接基于结果的强化学习（RL）来优化完整的推理轨迹。虽然有效，但仔细审视发现一个根本差距：这种方法与人类实际解决问题的方式不一致。人类认知自然地将问题解决分解为两个不同的阶段：首先获取跨问题泛化的抽象策略（即元知识），然后将其适应到具体实例。相比之下，通过将完整轨迹视为基本单元，当前方法本质上是问题中心的，将抽象策略与问题特定的执行纠缠在一起。为了解决这种错位，我们提出了一个认知启发的框架，明确地模仿人类认知的两阶段过程。具体而言，元思维链（CoMT）将监督学习聚焦于抽象推理模式而不涉及具体执行，从而能够获取可泛化的策略。然后，置信度校准强化学习（CCRL）通过中间步骤上的置信度感知奖励来优化任务适应，防止过度自信的错误级联并提高执行可靠性。在四个模型和十个基准上的实验表明，与标准方法相比，分布内和分布外分别提升了2.10%和3.86%，同时对教师模型选择、优化方法和符号扰动的变化保持高度鲁棒。

英文摘要

Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought CoMT focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and ten benchmarks show 2.10% and 3.86% improvements in-distribution and out-of-distribution respectively over standard methods, while remaining highly robust to variations in teacher model selection, optimization methods, and symbolic perturbations.

URL PDF HTML ☆

赞 0 踩 0

2601.18395 2026-05-29 cs.CL 版本更新

Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction

不要贪婪，三思而后行：文档级信息抽取的采样与选择

Mikel Zubillaga, Oscar Sainz, Oier Lopez de Lacalle, Eneko Agirre

发表机构 * HiTZ Center - Ixa, University of the Basque Country UPV/EHU（希茨中心 - Ixa，巴斯克国家大学UPV/EHU）

AI总结提出ThinkTwice框架，通过采样生成多个候选模板并选择最优，利用无监督一致性和有监督奖励模型，在文档级信息抽取中超越贪婪解码方法。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

文档级信息抽取（DocIE）旨在生成包含给定文档中出现的实体、关系和事件的输出模板。标准做法包括使用贪婪解码提示仅解码器的大语言模型以避免输出变异性。我们没有将这种变异性视为限制，而是表明采样可以产生比贪婪解码更好的解决方案，尤其是在使用推理模型时。因此，我们提出了ThinkTwice，一个采样和选择框架，其中大语言模型为给定文档生成多个候选模板，然后一个选择模块选择最合适的模板。我们引入了一种利用生成输出之间一致性的无监督方法，以及一种使用在标记DocIE数据上训练的奖励模型的有监督选择方法。为了解决DocIE中黄金推理轨迹的稀缺性，我们提出了一种基于拒绝采样的方法来生成将输出模板与推理轨迹配对的银训练数据。我们的实验证明了无监督和有监督ThinkTwice的有效性，始终优于贪婪基线和有监督的最先进方法。

英文摘要

Document-level Information Extraction (DocIE) aims to produce an output template with the entities, relations, and events of interest occurring in the given document. Standard practices include prompting decoder-only LLMs using greedy decoding to avoid output variability. Rather than treating this variability as a limitation, we show that sampling can produce substantially better solutions than greedy decoding, especially when using reasoning models. We thus propose ThinkTwice, a sampling and selection framework in which the LLM generates multiple candidate templates for a given document, and a selection module chooses the most suitable one. We introduce both an unsupervised method that exploits agreement across generated outputs, and a supervised selection method using reward models trained on labeled DocIE data. To address the scarcity of golden reasoning trajectories for DocIE, we propose a rejection-sampling-based method to generate silver training data that pairs output templates with reasoning traces. Our experiments show the validity of unsupervised and supervised ThinkTwice, consistently outperforming greedy baselines and the supervised state-of-the-art.

URL PDF HTML ☆

赞 0 踩 0

2601.14758 2026-05-29 cs.LG cs.AI cs.CL 版本更新

大型语言模型中句法与语义的差异编码

Santiago Acevedo, Alessandro Laio, Marco Baroni

发表机构 * Catalan Institute of Research and Advanced Studies (ICREA) and Universitat Pompeu Fabra (UPF)（加泰罗尼亚研究与高级科学研究所（ICREA）和庞培法华大学（UPF））

AI总结本研究通过平均共享句法结构或语义的句子隐藏表示向量，发现大型语言模型（以DeepSeek-V3为例）的内部层表示中句法和语义信息至少部分线性编码，且两者编码轮廓不同，可一定程度解耦。

Comments Published as conference paper at ICML 2026

2601.00065 2026-05-29 cs.LG cs.CL cs.CR 版本更新

When the Same Coefficients Reach Different Places: Asymmetric Realizability in Transplanting Tokenizers across Large Language Models

当相同系数到达不同位置：跨大型语言模型移植分词器中的非对称可实现性

Xiaoze Liu, Weichen Yu, Matt Fredrikson, Xiaoqian Wang, Jing Gao

发表机构 * Purdue University（普渡大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文发现跨词汇模型组合中分词器移植的几何结构非对称性，并构造了“破坏令牌”以利用该漏洞，通过实验验证其在多个模型对中的存在性及对微调、谱滤波等防御措施的鲁棒性。

详情

AI中文摘要

跨词汇模型组合中的分词器移植将仅存在于捐赠者的嵌入行重构为基于共享词汇锚点的加权组合，并在基础模型上重用这些系数。我们识别出这种重构的一个结构几何特性：相同的系数向量在捐赠者和基础锚点跨度中到达不同的集合，即一个\emph{非对称可实现性}差距。在OMP下的65个捐赠者-基础对中，通过CLP、WECHSEL和FOCUS的跨算子验证，我们构造了\emph{破坏令牌}：在捐赠者锚点跨度中保持统计惰性，同时在基础中产生高显著性重构的单一系数向量。相同的Gemma-2-2B捐赠者检查点允许针对来自五个模型家族的13个不同下游基础进行此构造。植入的方向与未改变的干净参考权重合并。在部署者案例研究中，标准LoRA微调主要抑制了其提示分布与训练语料匹配的破坏者，并且在我们设置中不足以缓解此类攻击家族。测试的谱滤波器未能捕捉到非对称性。我们讨论了在开放权重组合供应链中的潜在滥用。

英文摘要

Tokenizer transplant in cross-vocabulary model composition reconstructs donor-only embedding rows as weighted combinations over shared lexical anchors and reuses those coefficients on the base. We identify a structural geometric property of this reconstruction: the same coefficient vector reaches different sets in the donor and base anchor spans, an \emph{asymmetric realizability} gap. Across 65 donor-base pairs under OMP, with cross-operator validation on CLP, WECHSEL, and FOCUS, we construct \textit{breaker tokens}: single coefficient vectors that remain statistically inert in the donor anchor span while producing a high-salience reconstruction in the base. The same Gemma-2-2B donor checkpoint admits this construction against 13 different downstream bases drawn from five model families. The planted direction passes weight-merging with a clean reference unchanged. In a deployer case study, standard LoRA fine-tuning suppresses the breaker primarily on prompts whose distribution matches the training corpus and is not a sufficient mitigation against this attack family in our setting. The tested spectral filters miss the asymmetry. We discuss potential misuse in the open-weight composition supply chain.

URL PDF HTML ☆

赞 0 踩 0

2512.14754 2026-05-29 cs.SE cs.AI cs.CL 版本更新

Revisiting the Reliability of Language Models in Instruction-Following

重新审视指令跟随中语言模型的可靠性

Jianshuo Dong, Yutong Zhang, Yan Liu, Zhenyu Zhong, Tao Wei, Chao Zhang, Han Qiu

发表机构 * Tsinghua University（清华大学）； Ant Group（蚂蚁集团）

AI总结本文提出可靠@k指标和自动生成相似提示的流水线，构建IFEval++基准，发现当前模型在细微差异提示下性能下降高达61.8%，并探索了三种改进方法。

Comments ACL 2026 main oral

详情

AI中文摘要

先进的LLM在IFEval等基准测试中已达到接近上限的指令跟随准确率。然而，这些令人印象深刻的分数并不一定能转化为实际使用中的可靠服务，因为用户经常改变他们的措辞、上下文框架和任务表述。在本文中，我们研究面向细微差异的可靠性：模型是否在传达类似用户意图但具有细微差异的相似提示中表现出一致的能力。为了量化这一点，我们引入了一个新的指标，可靠@k，并开发了一个自动化流水线，通过数据增强生成高质量的相似提示。在此基础上，我们构建了IFEval++用于系统评估。在20个专有和26个开源LLM中，我们发现当前模型在面向细微差异的可靠性方面存在显著不足——它们的性能在细微提示修改下可能下降高达61.8%。此外，我们对其进行了表征，并探索了三种潜在的改进方法。我们的发现强调了面向细微差异的可靠性是朝着更可靠和可信的LLM行为迈出的关键但尚未充分探索的下一步。我们的代码和基准可访问：https://github.com/jianshuod/IFEval-pp。

英文摘要

Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.

URL PDF HTML ☆

赞 0 踩 0

2511.08949 2026-05-29 cs.CL 版本更新

EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI

EVADE：基于LLM的解释生成与验证用于NLI错误检测

Longfei Zuo, Barbara Plank, Siyao Peng

发表机构 * Technical University of Munich（慕尼黑技术大学）； MaiNLP, Center for Information and Language Processing, LMU Munich（MaiNLP，信息与语言处理中心，慕尼黑大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心（MCML））

AI总结提出EVADE框架，利用大语言模型生成和验证解释以检测NLI数据集中的标注错误，实验表明LLM验证能减少人力并提升微调性能。

详情

AI中文摘要

高质量数据集对于训练和评估可靠的NLP模型至关重要。在自然语言推理（NLI）等任务中，当同一实例有多个有效标签时，会出现人类标签变异（HLV），这使得难以区分标注错误和合理的变异。先前的框架VARIERR（Weber-Genzel等人，2024）在第一轮要求多位标注者解释其标签决策，并在第二轮通过有效性判断标记错误。然而，进行两轮人工标注成本高昂，且可能限制合理标签或解释的覆盖范围。我们的研究提出了一个新框架EVADE，用于使用大语言模型（LLM）生成和验证解释以检测错误。我们进行了全面分析，比较了人类和LLM检测的NLI错误，涉及分布比较、验证重叠以及对模型微调的影响。实验表明，LLM验证能优化生成的解释分布，使其更接近人类标注，并且从训练数据中移除LLM检测的错误比移除人类标注者识别的错误更能提升微调性能。这凸显了在标签变异下扩展错误检测、减少人工努力同时提高数据集质量的潜力。

英文摘要

High-quality datasets are critical for training and evaluating reliable NLP models. In tasks like natural language inference (NLI), human label variation (HLV) arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation. An earlier framework, VARIERR (Weber-Genzel et al., 2024), asks multiple annotators to explain their label decisions in the first round and flags errors through validity judgments in the second round. However, conducting two rounds of manual annotation is costly and may limit the coverage of plausible labels or explanations. Our study proposes a new framework, EVADE, for generating and validating explanations to detect errors using large language models (LLMs). We perform a comprehensive analysis comparing human- and LLM-detected errors for NLI across distribution comparison, validation overlap, and impact on model fine-tuning. Our experiments demonstrate that LLM validation refines generated explanation distributions to more closely align with human annotations, and that removing LLM-detected errors from training data yields improvements in fine-tuning performance than removing errors identified by human annotators. This highlights the potential to scale error detection, reducing human effort while improving dataset quality under label variation.

URL PDF HTML ☆

赞 0 踩 0

2510.22437 2026-05-29 cs.AI cs.CL 版本更新

Modeling Hierarchical Thinking in Large Reasoning Models

大型推理模型中的层次化思维建模

G M Shahariar, Erfan Shayegani, Ali Nazari, Nael Abu-Ghazaleh

发表机构 * University of California, Riverside（加州大学河滨分校）； Independent Researcher（独立研究者）

AI总结本文提出将大型推理模型（LRM）的层次化推理动态近似为有限状态机（FSM）中的轨迹，并通过Q值引导的推理时控制方法实现高效推理优化。

Comments Accepted in ICML 2026 as Oral

详情

AI中文摘要

大型推理模型（LRM）通过生成长链思维（CoT）序列来解决复杂任务；然而，控制推理轨迹的涌现动态尚未被充分理解，可能导致不一致性和推理病态。在这项工作中，我们提出将LRM的涌现层次化推理动态近似为有限状态机（FSM）中的轨迹，该状态机在六个抽象认知状态之间转换。我们证明这些状态和转换可以在模型的潜在状态中捕获。我们相信这种表示在LRM模型的可解释性和优化中具有不同的应用。例如，通过分析这些转换的拓扑结构，我们识别出推理策略中的统计变化，有助于从失败的推理链中识别出有效的推理链。为了说明这些潜在优势，我们提出了Q值引导转向，一种无需训练的推理时控制方法，将推理视为规划问题。我们估计状态转换的长期效用，并在句子边界处应用稀疏、正交的激活转向，使CoT生成与最优推理策略对齐。使用三个最先进的开源推理模型在四个基准测试（AIME25、MATH-500、GSM8k和GPQA Diamond）上的实验表明，Q值转向策略以“外科手术式”的效率实现了显著的性能提升，通常需要的干预次数比贪婪和加权基线少25倍，这表明通过引导高层认知动态而非微观管理令牌生成，可以有效地控制推理。代码可在 https://github.com/shahariar-shibli/CoT-FSM 获取。

英文摘要

Large Reasoning Models (LRMs) solve complex tasks by generating long Chain-of-Thought (CoT) sequences; however, the emergent dynamics governing reasoning trajectories are not well understood and can lead to inconsistencies and reasoning pathologies. In this work, we propose to approximate LRM's emerging hierarchical reasoning dynamics as a trajectory within a Finite State Machine (FSM) transitioning among six abstract cognitive states. We demonstrate that these states and transitions can be captured in the latent state of the model. We believe that this representation can have different applications in the interpretability and optimization of LRM models. For example, by analyzing the topology of these transitions, we identify statistical shifts in reasoning strategies that help identify effective reasoning chains from those that fail. To illustrate these potential advantages, we propose Q-Value guided steering, a training-free inference-time control method that treats reasoning as a planning problem. We estimate the long-horizon utility of state transitions and apply sparse, orthogonal activation steering at sentence boundaries to align the CoT generation with optimal reasoning policies. Experiments across four benchmarks (AIME25, MATH-500, GSM8k, and GPQA Diamond) using three state-of-the-art open reasoning models demonstrate that Q-Value steering policy achieves significant performance gains with "surgical" efficiency, often requiring 25 times fewer interventions than greedy and weighted baselines, which suggests that reasoning can be effectively controlled by guiding high-level cognitive dynamics rather than micro-managing token generation. Code is available at: https://github.com/shahariar-shibli/CoT-FSM.

URL PDF HTML ☆

赞 0 踩 0

2510.04704 2026-05-29 cond-mat.mtrl-sci cs.AI cs.CL 版本更新

AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials

AtomWorld: 评估大型语言模型在晶体材料空间推理能力的基准

Taoyuze Lv, Alexander Chen, Fengyu Xie, Chu Wu, Jeffrey Meng, Dongzhan Zhou, Yingheng Wang, Bram Hoex, Zhicheng Zhong, Tong Xie

发表机构 * University of New South Wales, NSW, Sydney, Australia（新南威尔士大学，新州，悉尼，澳大利亚）； Suzhou Institute for Advanced Research, University of Science（苏州先进研究院，科学大学）； Shanghai Artificial Intelligence Laboratory, Shanghai, China（上海人工智能实验室，上海，中国）； Cornell University（康奈尔大学）

AI总结提出AtomWorld基准，通过十种基本原子结构操作评估LLM在材料科学中的空间推理能力，发现Claude Opus 4.6表现最佳但复杂空间关系操作成功率低，表明LLM更适合作为辅助工具而非完全自主的科研代理。

详情

AI中文摘要

大型语言模型（LLMs）在科学研究中展现出巨大潜力，能够执行从知识检索到属性预测等任务。现有的科学基准主要关注感知或基于知识的任务，在很大程度上忽略了建模任务，而建模是任何真实科学研究的基本起点。对于材料科学而言，构建和操作原子结构是最具创造性和自动化程度最低的步骤之一。在这项工作中，我们引入了AtomWorld，这是一个旨在评估LLMs在结构修改方面能力的基准。该基准包括四种广泛使用的建模类别下的十种基本操作，并提供了可验证的评估指标。我们发现Claude Opus 4.6总体上表现最佳。随着建模复杂性的增加，成功率显著下降，特别是涉及复杂空间关系的操作（旋转成功率低于12%）。我们的结果表明，当代LLMs更适合作为材料结构建模的副驾驶，而非完全无监督的自主科学代理。除了评估之外，AtomWorld还作为未来开发结构感知模型（包括强化学习和基于代理的方法）的测试平台和实验场。

英文摘要

Large language models (LLMs) have shown promising potential in scientific research, enabling tasks ranging from knowledge retrieval to property prediction. Existing science benchmarks mainly focus on perceptual or knowledge-based tasks, largely ignoring the modelling tasks, a fundamental starting point for any real scientific research. For materials science, constructing and manipulating atomic structures is one of the most creative and least automated steps. In this work, we introduce AtomWorld, a benchmark designed to evaluate the abilities of LLMs on structure modifications. The benchmark includes ten fundamental actions under four widely used modelling categories, enabling verifiable evaluation metrics. We find that Claude Opus 4.6 generally performs the best. While the success rate decreases markedly with increasing modelling complexity, with particularly low success rates (below 12\% for rotation) for operations involving complex spatial relations. Our results suggest that contemporary LLMs are better suited as copilots for materials structure modelling rather than fully unsupervised autonomous scientific agents. Beyond evaluation, AtomWorld also serves as a testbed and playground for developing future structure-aware models, including reinforcement learning and agentic approaches.

URL PDF HTML ☆

赞 0 踩 0

2508.19282 2026-05-29 cs.CL cs.AI 版本更新

Less Is More: Elevating RAG via Performance-Driven Context Compression

少即是多：通过性能驱动的上下文压缩提升RAG

Ziqiang Cui, Yunpeng Weng, Xing Tang, Peiyang Liu, Shiwei Li, Bowei He, Jiamin Chen, Yansen Zhang, Xiuqiang He, Chen Ma

发表机构 * City University of Hong Kong, Hong Kong SAR, China（香港城市大学）； Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE（阿布扎赫尔 Mohamed bin Zayed 人工智能大学）； Huazhong University of Science and Technology（华中科技大学）； Peking University, Beijing, China（北京大学）； Shenzhen Technology University, Shenzhen, China（深圳技术大学）

AI总结提出CORE-RAG框架，利用任务性能作为反馈信号迭代优化压缩策略，在3%压缩率下平均精确匹配得分提升3.3点。

Comments Accepted by ICML 2026

详情

AI中文摘要

检索增强生成（RAG）已成为改善知识更新时效性和大型语言模型事实准确性的有前景范式。然而，纳入大量检索文档显著增加输入长度，导致计算成本过高。现有压缩方法通常因依赖预定义启发式规则而损害任务性能。这些启发式规则无法确保压缩后的上下文有利于生成任务。为解决这些限制，我们提出CORE-RAG，一种用于RAG系统中上下文压缩的新颖框架。CORE通过性能驱动的学习框架消除对代理启发式规则的依赖，直接利用任务性能作为反馈信号迭代优化压缩器策略。在此优化过程之前，我们引入知识蒸馏阶段，用稳健策略初始化压缩器。大量实验证明了我们方法的优越性。在3%的高压缩比下，CORE不仅避免了性能下降，而且与使用完整文档相比，平均精确匹配（EM）得分提高了3.3分。我们的代码可在https://github.com/ziqiangcui/CORE-RAG-ICML26获取。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for improving the timeliness of knowledge updates and the factual accuracy of large language models. However, incorporating a large volume of retrieved documents significantly increases input length, leading to prohibitive computational costs. Existing compression approaches often compromise task performance, primarily due to their reliance on predefined heuristics. These heuristics fail to ensure that the compressed context is conducive to the generation tasks. To address these limitations, we propose CORE-RAG, a novel framework for context compression in RAG systems. CORE eliminates reliance on proxy heuristics through a performance-driven learning framework, which directy utilizes task performance as a feedback signal to iteratively refine the compressor policy. Prior to this optimization process, we incorporate a knowledge distillation phase to initialize the compressor with a robust policy. Extensive experiments demonstrate the superiority of our approach. At a high compression ratio of 3%, CORE not only avoids performance degradation but also improves the average Exact Match (EM) score by 3.3 points compared to using full documents. Our code is available at https://github.com/ziqiangcui/CORE-RAG-ICML26.

URL PDF HTML ☆

赞 0 踩 0

2508.19202 2026-05-29 cs.CL 版本更新

Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

通过探针知识和推理揭示LLMs中的科学问题解决

Alan Li, Yixin Liu, Arpan Sarkar, Doug Downey, Arman Cohan

发表机构 * Department of Computer Science, Yale University ； Department of Molecular \& Cellular Biology, Harvard University ； Allen Institute for AI ； Department of Computer Science, Northwestern University

AI总结本文提出SciReas基准和KRUX探针框架，系统评估LLMs在科学推理中的知识与推理角色，发现知识检索是主要瓶颈，外部上下文知识和推理增强均能提升性能。

Comments 33 pages, 18 figures

详情

Journal ref: ICML 2026 Main Conference

AI中文摘要

科学问题解决对LLMs提出了独特挑战，需要深厚的领域知识和通过复杂推理应用这些知识的能力。尽管自动化科学推理器在协助人类科学家方面具有巨大潜力，但目前尚无广泛采用的全面基准来评估科学推理，也很少有方法系统地梳理知识和推理在这些任务中的不同作用。为弥补这些空白，我们引入了SciReas，一个用于科学推理任务的多样化现有基准套件，以及SciReas-Pro，一个需要更复杂推理的选择性子集。我们的全面评估揭示了在单独依赖单个基准时隐藏的科学推理性能洞察。然后，我们提出了KRUX，一个用于研究推理和知识在科学任务中不同作用的探针框架。结合两者，我们进行了深入分析，得出几个关键发现：（1）从模型参数中检索任务相关知识是LLMs在科学推理中的关键瓶颈；（2）推理模型始终受益于在推理增强之上添加上下文中的外部知识；（3）增强言语化推理提高了LLMs浮现任务相关知识的能力。

英文摘要

Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge.

URL PDF HTML ☆

赞 0 踩 0

2502.07623 2026-05-29 cs.CL 版本更新

Lexical categories of stem-forming roots in Mapudüngun verb forms

Mapudüngun动词形式中词干形成根的词汇类别

Andrés Chandía

发表机构 * Department of Catalan Philology and General Linguistics University of Barcelona（加泰罗尼亚语言学与一般语言学系巴塞罗那大学）

AI总结本研究验证并修正了Mapuche语言形态分析系统中动词根的词汇类别分类，以改进计算分析器并澄清该语言词汇类别的模糊性。

Comments 36 pages, 2 large tables, 2 sample tables

2502.03805 2026-05-29 cs.CL 版本更新

CriticalKV: Optimizing KV Cache Eviction from an Output Perturbation Perspective

CriticalKV: 从输出扰动角度优化 KV 缓存淘汰

Yuan Feng, Junlin Lv, Haoyu Guo, Yukun Cao, S Kevin Zhou, Xike Xie

发表机构 * School of Computer Science, University of Science and Technology of China（科学技术大学计算机科学学院）； School of Biomedical Engineering, USTC（USTC生物医学工程学院）； Data Darkness Lab, MIRACLE Center, Suzhou Institute for Advanced Research（苏州先进研究院数据黑暗实验室、奇迹中心）； School of Computer Science and Technology, Xidian University（西安电子科技大学计算机科学与技术学院）

AI总结本文通过分析注意力输出扰动，提出一种基于扰动约束的 KV 缓存条目选择算法，显著降低压缩损失。

Comments ICML 2026

详情

AI中文摘要

大型语言模型彻底改变了自然语言处理，但由于 Transformer 架构对自注意力的依赖，特别是长序列推理中的大型 KV 缓存，面临着高存储和运行时成本的重大挑战。最近通过基于注意力权重剪枝不太重要的条目来减小 KV 缓存大小的努力仍然是经验性的，缺乏形式化基础。本文通过分析注意力输出扰动，对识别关键 KV 缓存条目进行了形式化研究。我们的分析表明，除了注意力权重之外，KV 条目中的值状态和预训练参数矩阵也至关重要。基于此，我们提出了一种扰动约束选择算法，该算法优化最坏情况下的输出扰动以识别关键条目。我们证明了我们的算法是一种通用的、即插即用的增强方法，且计算开销可忽略不计。当与三种最先进的缓存淘汰方法集成在三个不同的 LLM 上时，我们的算法在来自 Ruler 和 LongBench 基准测试的 29 个数据集上，平均将压缩损失减少了超过一半。进一步的头部和层级的扰动分析证实了我们有效性背后的原理。这项工作为缓存淘汰提供了新的、形式化的视角，为未来的研究开辟了有希望的途径。代码公开在 https://github.com/FFY0/DefensiveKV。

英文摘要

Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large KV cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. We demonstrate that our algorithm is a universal, plug-and-play enhancement that incurs negligible computational overhead. When integrated with three state-of-the-art cache eviction methods on three distinct LLMs, our algorithm significantly reduces the compression loss by more than \textit{half} on average across 29 datasets from the Ruler and LongBench benchmarks. Further perturbation analysis, at both the head and layer levels, confirms the principles underlying our effectiveness. This work offers a new, formally grounded perspective to cache eviction , opening promising avenues for future research. The code is publicly available at https://github.com/FFY0/DefensiveKV.

URL PDF HTML ☆

赞 0 踩 0

2406.10238 2026-05-29 cs.CL cs.LG cs.SI 版本更新

Early Detection of Misinformation for Infodemic Management: A Domain Adaptation Approach

信息疫情管理中虚假信息的早期检测：一种领域自适应方法

Minjia Mao, Xiaohang Zhao, Xiao Fang

发表机构 * Lerner College of Business and Economics, University of Delaware（德克萨斯大学德尔韦大学商学院与经济学学院）； School of Information Management & Engineering, Shanghai University of Finance and Economics（上海财经大学信息管理与工程学院）

AI总结针对信息疫情早期缺乏标注数据的问题，提出一种同时处理协变量偏移和概念偏移的领域自适应虚假信息检测方法，在真实数据集上优于现有方法。

详情

AI中文摘要

信息疫情是指在疾病爆发期间传播的大量真实信息和虚假信息。在信息疫情早期检测虚假信息是减少其对公共健康危害的关键。信息疫情早期的特点是存在大量关于某种疾病的未标注信息。因此，传统的虚假信息检测方法不适合此任务，因为它们依赖信息疫情领域的标注信息来训练模型。为解决这一局限，最先进的方法利用其他领域的标注信息来学习模型，以检测信息疫情领域的虚假信息。这些方法的有效性取决于它们缓解信息疫情领域与利用标注信息的领域之间的协变量偏移（即特征分布差异）和概念偏移（即标注模式差异）的能力。然而，这些方法侧重于缓解协变量偏移而忽略了概念偏移，导致其在该任务上效果不佳。为此，我们从理论上证明了同时处理协变量偏移和概念偏移的必要性，以及如何分别实现它们。基于理论分析，我们开发了一种新颖的虚假信息检测方法，同时解决了协变量偏移和概念偏移。使用真实数据集，我们进行了广泛的实证评估，证明我们的方法在性能上优于最先进的虚假信息检测方法以及可适用于该任务的常见领域自适应方法。

英文摘要

An infodemic refers to an enormous amount of true information and misinformation disseminated during a disease outbreak. Detecting misinformation at the early stage of an infodemic is key to reduce its harm to public health. An early stage infodemic is characterized by a large volume of unlabeled information concerning a disease. As a result, conventional misinformation detection methods are not suitable for this misinformation detection task because they rely on labeled information in the infodemic domain to train their models. To address this limitation, state-of-the-art methods learn their models using labeled information in other domains to detect misinformation in the infodemic domain. The efficacy of these methods depends on their ability to mitigate both covariate shift (i.e., differences in feature distributions) and concept shift (i.e., differences in labeling patterns) between the infodemic domain and the domains from which they leverage labeled information. However, these methods focus on mitigating covariate shift but overlook concept shift, rendering them less effective for the task. In response, we theoretically show the necessity of tackling both covariate and concept shifts as well as how to operationalize each of them. Built on the theoretical analysis, we develop a novel misinformation detection method that addresses both covariate and concept shifts. Using real-world datasets, we conduct extensive empirical evaluations to demonstrate the superior performance of our method over state-of-the-art misinformation detection methods as well as prevalent domain adaptation methods that can be tailored to solve the misinformation detection task.

URL PDF HTML ☆

赞 0 踩 0

2405.13003 2026-05-29 cs.CL cs.AI cs.IR 版本更新

A Survey on Recent Advances in Conversational Data Generation

对话数据生成最新进展综述

Heydar Soudani, Roxana Petcu, Evangelos Kanoulas, Faegheh Hasibi

发表机构 * Radboud University（拉博德大学）； University of Amsterdam（阿姆斯特丹大学）

AI总结本文系统综述了多轮对话数据生成方法，涵盖开放域、任务导向和信息检索三类对话系统，提出了包含种子数据创建、话语生成和质量过滤的通用框架，并讨论了评估指标与未来方向。

详情

DOI: 10.1145/3795686

AI中文摘要

近年来对话系统的进步显著增强了各领域的人机交互。然而，由于专业对话数据的稀缺，训练这些系统面临挑战。传统上，对话数据集通过众包创建，但该方法成本高、规模有限且劳动密集。作为解决方案，合成对话数据的开发应运而生，利用技术增强现有数据集或将文本资源转换为对话格式，提供了一种更高效且可扩展的数据集创建方法。在本综述中，我们系统全面地回顾了多轮对话数据生成，重点关注三类对话系统：开放域、任务导向和信息检索。我们根据种子数据创建、话语生成和质量过滤方法等关键组件对现有研究进行分类，并引入了一个概述对话数据生成系统主要原则的通用框架。此外，我们考察了评估合成对话数据的指标和方法，探讨了当前领域的挑战，并探索了未来研究的潜在方向。我们的目标是通过概述最先进的方法并强调该领域进一步研究的机会，加速研究人员和从业者的进展。

英文摘要

Recent advancements in conversational systems have significantly enhanced human-machine interactions across various domains. However, training these systems is challenging due to the scarcity of specialized dialogue data. Traditionally, conversational datasets were created through crowdsourcing, but this method has proven costly, limited in scale, and labor-intensive. As a solution, the development of synthetic dialogue data has emerged, utilizing techniques to augment existing datasets or convert textual resources into conversational formats, providing a more efficient and scalable approach to dataset creation. In this survey, we offer a systematic and comprehensive review of multi-turn conversational data generation, focusing on three types of dialogue systems: open domain, task-oriented, and information-seeking. We categorize the existing research based on key components like seed data creation, utterance generation, and quality filtering methods, and introduce a general framework that outlines the main principles of conversation data generation systems. Additionally, we examine the evaluation metrics and methods for assessing synthetic conversational data, address current challenges in the field, and explore potential directions for future research. Our goal is to accelerate progress for researchers and practitioners by presenting an overview of state-of-the-art methods and highlighting opportunities to further research in this area.

URL PDF HTML ☆

赞 0 踩 0

2605.29582 2026-05-29 cs.LG cs.CL 版本更新

PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning

PEARL: 使用教学对齐强化学习训练苏格拉底式导师

Qikai Chang, Zhenrong Zhang, Linbo Chen, Pengfei Hu, Jianshu Zhang, Youhui Guo, Jun Du

发表机构 * University of Science and Technology of China（中国科学技术大学）； iFLYTEK Research（iFLYTEK研究院）

AI总结提出PEARL框架，通过可控学生模拟器、生成式奖励模型和稳定多目标强化学习，训练苏格拉底式教学代理，在多个基准上达到开源模型最佳性能并与专有模型竞争。

Comments 16 pages, 7 figures

详情

AI中文摘要

大型语言模型（LLM）在教育辅导方面展现出潜力，但有效的辅导不仅仅是解决问题：它必须提供渐进的苏格拉底式引导，并在多轮交互中平衡多个教学目标。然而，由于学生模拟的保真度有限且可控性弱、教学奖励建模不明确以及多目标优化不稳定，训练这样的导师仍然具有挑战性。为克服这些限制，我们提出了PEARL，一个教学对齐的强化学习框架，用于训练苏格拉底式教学代理，包含三个关键组件。首先，我们引入了一个可控的学生模拟器，将潜在认知状态与响应生成解耦，以模拟多样的能力和误解。其次，我们开发了一个生成式奖励模型，联合评估教学质量和目标正确性以进行策略优化。最后，我们提出了一种稳定的多目标强化学习方案，在每个维度内离散化奖励并跨维度聚合归一化优势，防止高方差目标主导更新。在多个基准上的实验表明，尽管仅使用30B策略模型，PEARL在开源模型中取得了最佳性能，并与领先的专有LLM保持竞争力。

英文摘要

Large Language Models (LLMs) have shown promise as educational tutors, yet effective tutoring requires more than solving problems: it must provide progressive Socratic guidance and balance multiple pedagogical objectives across multi-turn interactions. However, training such tutors remains challenging due to limited-fidelity and weakly controllable student simulation, under-specified pedagogical reward modeling, and unstable multi-objective optimization. To overcome these limitations, we propose PEARL, a pedagogically aligned reinforcement learning framework for training Socratic tutoring agents, consisting of three key components. First, we introduce a controllable student simulator that decouples latent cognitive states from response generation to model diverse abilities and misconceptions. Second, we develop a generative reward model that jointly evaluates pedagogical quality and objective correctness for policy optimization. Finally, we propose a stable multi-objective RL scheme that discretizes rewards within each dimension and aggregates normalized advantages across dimensions, preventing high-variance objectives from dominating updates. Experiments on multiple benchmarks show that PEARL achieves the best performance among open-source models and remains competitive with leading proprietary LLMs, despite using only a 30B policy model.

URL PDF HTML ☆

赞 0 踩 0

2605.29559 2026-05-29 cs.CL 版本更新

LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents

LiteCoder-Terminal: 扩展用于学习语言代理的长时程终端环境

Xiaoxuan Peng, Kaiqi Zhang, Xinyu Lu, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所信息处理实验室）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出零依赖合成框架LiteCoder-Terminal-Gen，自动生成可执行且可验证的终端训练环境，构建大规模SFT和RL数据集，通过监督微调和直接多轮偏好优化显著提升语言代理在终端任务上的性能。

详情

AI中文摘要

掌握终端环境需要语言代理具备多步规划、基于反馈的执行和动态状态适应能力。然而，当前训练此类代理的瓶颈在于依赖从外部仓库抓取的数据，这限制了领域多样性、环境可控性以及针对特定能力缺陷的优化。我们引入了LiteCoder-Terminal-Gen，一个零依赖的合成流水线，能够直接从领域规范自动生成可执行且可验证的终端训练环境。利用该框架，我们构建了两个大规模资源：LiteCoder-Terminal-SFT，包含跨10个领域的11,255条专家轨迹；以及LiteCoder-Terminal-RL，包含602个可验证环境，用于轨迹级偏好优化。在SFT数据集上对Qwen系列模型进行监督微调，得到的代理显著优于其基础版本。值得注意的是，我们的32B变体在Terminal Bench 1.0、2.0和Pro上分别达到了29.06%、18.54%和34.00%的pass@1。此外，在RL环境上应用直接多轮偏好优化（DMPO）进一步提升了性能。这些结果系统性地表明，完全合成的可执行环境为掌握复杂的真实命令行工作流提供了可扩展且可验证的监督信号。

英文摘要

Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain diversity, environment controllability, and the targeting of specific capability deficits. We introduce LiteCoder-Terminal-Gen, a zero-dependency synthesis pipeline that autonomously generates executable and verifiable terminal training environments directly from domain specifications. Using this framework, we construct two large-scale resources: LiteCoder-Terminal-SFT, comprising 11,255 expert trajectories across 10 domains, and LiteCoder-Terminal-RL, featuring 602 verifiable environments for trajectory-level preference optimization. Supervised fine-tuning of Qwen-family models on our SFT dataset yields agents that significantly outperform their base counterparts. Notably, our 32B variant achieves 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro, respectively. Furthermore, applying Direct Multi-turn Preference Optimization (DMPO) on our RL environments yields additional performance gains. These results systematically demonstrate that fully synthetic, executable environments offer a scalable and verifiable supervision signal for mastering complex, real-world command-line workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.29555 2026-05-29 cs.CL 版本更新

From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals

从盲目猜测到知情判断：通过构建知识增强的偏好信号教会LLM评估材料

Yeyong Yu, Wenya Hu, Xing Wu, Quan Qian

发表机构 * School of Computer Engineering & Science, Shanghai University（上海大学计算机工程与科学学院）； Center of Materials Informatics and Data Science, Materials Genome Institute, Shanghai University（上海大学材料信息与数据科学中心）； Key Laboratory of Silicate Cultural Relics Conservation (Shanghai University), Ministry of Education, China（教育部硅酸盐文化 relics 保护重点实验室（上海大学））； Shanghai Institute for Advanced Communication and Data Science, Shanghai University（上海大学高级通信与数据科学研究院）

AI总结提出知识增强偏好信号框架MaterEval，通过成对偏好数据引导大语言模型从直觉判断转向基于证据的可靠评估，并引入快慢推理方案平衡吞吐量、成本和可靠性，在高熵合金评估中验证了有效性。

Comments 33 pages, 5 figures

详情

AI中文摘要

随着候选生成和高通量实验的进步，材料发现的主要瓶颈正从性质预测转向在大量候选集中进行可靠评估。我们提出了知识增强偏好信号框架MaterEval，该框架自动为同一候选生成两种评估：一种遵循专家规则并提供支持证据的知情判断，另一种是移除规则的盲目猜测。通过将这两种评估配对作为偏好数据，我们引导原本缺乏材料特定标准的通用大语言模型（LLM）从直觉判断转向由明确证据支持的可靠评估。为了平衡吞吐量、成本和可靠性，我们进一步引入了一种快慢推理方案，将大规模快速筛选与对小子集的深入审查解耦。以高熵合金（HEA）评估为例，我们表明，无需外部检索，仅依赖内化能力，小型开源LLM在准确性、结论一致性和证据区分度上取得了显著提升，接近基于规则的闭源LLM的性能。这些结果表明，专家规则可以系统地转化为可学习的偏好信号，从而为自主材料发现循环提供低成本且可部署的评估模块。

英文摘要

As candidate generation and high-throughput experimentation advance, the primary bottleneck in materials discovery is shifting from property prediction to making reliable evaluations among massive candidate sets. We propose a Knowledge-Augmented Preference Signals Framework, MaterEval, that automatically produces, for the same candidate, two evaluations: an informed judgment that follows expert rules and provides supporting evidence, and a rule-removed blind guess. By pairing the two evaluations as preference data, we guide general-purpose large language models (LLMs), originally lacking materials-specific criteria, from intuitive judgment toward reliable evaluation supported by explicit evidence. To balance throughput, cost, and reliability, we further introduce a fast-slow reasoning scheme that decouples large-scale rapid screening from in-depth review on a small subset. Using high-entropy alloy (HEA) assessment as a case study, we show that, without external retrieval and relying solely on internalized capabilities, small open-source LLMs achieve substantial gains in accuracy, conclusion consistency, and evidence discrimination, approaching the performance of rule-based closed-source LLMs. These results demonstrate that expert rules can be systematically transformed into learnable preference signals, enabling a low-cost and deployable evaluation module for autonomous materials discovery loops.

URL PDF HTML ☆

赞 0 踩 0

2605.29543 2026-05-29 cs.LG cs.AI cs.CL cs.HC cs.IR 版本更新

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

SCOPE：一种用于空中交通管制复诵监控的轻量训练LLM框架

Qihan Deng, Minghua Zhang, Yang Yang, Zhenyu Gao

发表机构 * Department of Mechanical and Aerospace Engineering, The Hong Kong University of Science and Technology（香港科学与技术大学机械与航空航天工程系）； School of Electronic and Information Engineering, Beihang University（北航电子与信息工程学院）； State Key Laboratory of CNS/ATM（国家空管自动化系统实验室）

AI总结提出SCOPE框架，通过冻结LLM结合插件式开放集分类器和上下文学习机制，实现高效准确的空管复诵监控，在少样本设置下开放集检测准确率达91.05%，异常纠正率96.63%。

详情

AI中文摘要

飞行员对空中交通管制（ATC）语音指令的复诵是航空运输中防止沟通失误的主要保障。然而，复诵异常仍与约80%的航空事故相关。这一脆弱性因交通量增加和认知负荷升高而进一步加剧，从而推动了机器自动化复诵监控的需求。传统的基于规则和机器学习的方法难以在高度可变且不断演变的空管-飞行员通信术语中泛化。尽管大语言模型（LLM）凭借其强大的推理和泛化能力开辟了新途径，但现有方法在实践中仍面临部署和计算障碍。在这项工作中，我们提出了SCOPE（Semantic reasoning for Communication via Open-set Plug-in with Examples），一种新颖的轻量训练LLM框架，提升了基于机器的ATC复诵监控的效率和准确性。核心思想是在冻结的LLM之上，将插件式开放集分类器与精心设计的上下文学习机制相结合。在半合成通信数据集上的大量实验表明，SCOPE在实现运行环境所需的低延迟响应的同时，达到了优越的准确性。在少样本设置下，SCOPE在开放集检测中达到91.05%的准确率，并纠正了96.63%的异常复诵，从而在提供决策解释的同时优于现有最强基线。这些发现证明了我们的框架作为通向可解释和可控的ATC复诵监控的实用途径的潜力。

英文摘要

Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.29502 2026-05-29 cs.CL cs.AI 版本更新

含文本图像的机器翻译系统比较评估

Blai Puchol, Sergio Gómez González, Miguel Domingo, Francisco Casacuberta

发表机构 * ValgrAI - Valencian Graduate School and Research Network for Artificial Intelligence（ValgrAI - 瓦伦西亚人工智能研究生学院和研究网络）

AI总结本研究比较评估了三种机器翻译范式（模块化流水线、多模态大语言模型和端到端模型Translatotron-V）在含文本图像翻译任务上的性能，发现多模态大语言模型表现最佳。

详情

AI中文摘要

本文对应用于包含文本信息的图像的机器翻译系统进行了比较评估，该任务位于计算机视觉和自然语言处理的交叉领域。研究比较了三种主要范式：分离文本检测、识别和翻译的模块化流水线；能够联合处理图像和文本的多模态大语言模型（MLLM）；以及直接生成翻译图像的端到端模型Translatotron-V。模块化系统采用最先进的OCR（docTR）结合多语言LLM（如Llama和EuroLLM），而评估的MLLM包括Gemini 2.5的不同配置。实验在覆盖多种语言对的并行多语言数据集上进行，基于BLEU、chrF和TER指标进行评估。结果表明，模块化流水线优于端到端方法，而MLLM实现了最佳整体性能，展现出卓越的灵活性和上下文理解能力。这些发现强调了多模态推理在图像到文本翻译中的有效性，并为未来在多语言环境中整合视觉理解和语言生成的研究提供了坚实基础。

英文摘要

This work presents a comparative evaluation of machine translation systems applied to images containing textual information, a task that lies at the intersection of computer vision and natural language processing. The study compares three main paradigms: modular pipelines that separate text detection, recognition, and translation; multi-modal large language models (MLLMs) capable of processing both image and text jointly; and an end-to-end model, Translatotron-V, which directly generates translated images. The modular systems employ state-of-the-art OCR (docTR) combined with multilingual LLMs such as Llama and EuroLLM, while the evaluated MLLMs include different configurations of Gemini 2.5. Experiments were conducted on parallel multilingual datasets covering multiple language pairs, with evaluation based on BLEU, chrF, and TER metrics. The results show that modular pipelines outperform the end-to-end approach, while MLLMs achieve the best overall performance, demonstrating superior flexibility and contextual understanding. These findings underscore the effectiveness of multi-modal reasoning for image-to-text translation and provide a solid foundation for future research on integrating visual understanding and language generation in multilingual settings.

URL PDF HTML ☆

赞 0 踩 0

2605.29473 2026-05-29 cs.HC cs.AI cs.CL cs.CY cs.SI 版本更新

迈向具有智能体纠正和语义评估的类人交互式语音识别

Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen

发表机构 * College of Artificial Intelligence, Xi’an Jiaotong University（西安交通大学人工智能学院）； X-LANCE Lab, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University（上海交通大学电子信息与电气工程学院X-LANCE实验室）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Fudan University（复旦大学）； Tongyi Fun Team, Alibaba Group（阿里云通义团队）

AI总结提出Agentic ASR闭环框架，通过多轮交互和语义纠正减少语义错误，并引入句子级语义错误率（S^2ER）作为评估指标。

详情

AI中文摘要

自动语音识别（ASR）是人机交互的核心组成部分，也是基于LLM的助手和智能体日益重要的前端。然而，当前大多数ASR系统仍遵循单遍范式，这与人类通信方式不一致——在人类通信中，误解通过迭代澄清和修正来解决。这种不匹配使得一旦发生意义关键的错误，很难纠正。同时，词错误率（WER）或字符错误率（CER）等词级指标无法充分反映此类问题。为解决这些局限，我们将交互式ASR形式化为多轮修正任务，并提出Agentic ASR，一种结合单遍ASR前端与语义纠正、意图路由和基于推理编辑的闭环框架。我们进一步引入句子级语义错误率（S^2ER），一种基于LLM的语义评估指标，以及交互式仿真系统，用于可扩展和可复现的基准测试。在多语言、命名实体密集和代码切换基准上的实验表明，迭代交互持续减少语义错误，在S^2ER上的提升远大于传统词级指标。人机对齐和消融研究进一步验证了语义判断器的可靠性和所提框架的鲁棒性。代码见：https://interactiveasr.github.io/，在线演示见：https://i-asr.sjtuxlance.com/

英文摘要

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

URL PDF HTML ☆

赞 0 踩 0

2605.29427 2026-05-29 cs.CL 版本更新

FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions

FinGuard：检测LLM交互中的金融监管违规

Huaixia Dou, Jie Zhu, Minghao Wu, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang

发表机构 * Qwen DianJin Team, Alibaba Cloud Computing（阿里云计算Qwen金融团队）； Tongyi Lab, Alibaba Group（阿里集团通义实验室）； School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结针对金融领域LLM交互中的监管违规检测问题，提出基于监管文档的自动化管道，构建首个金融合规检测基准FinGuard-Bench，并训练FinGuard模型，在基准上显著优于现有方法。

详情

AI中文摘要

随着大型语言模型（LLM）在金融服务中的部署日益增多，一次不合规的交互就可能使机构面临监管处罚并直接损害消费者利益。现有的防护模型围绕通用危害分类构建，忽略了基于特定金融法规的违规行为。我们通过一个直接操作监管文档的监管驱动管道来弥补这一空白，该管道归纳出金融合规风险分类，并在没有任何预定义违规类别的情况下合成基于监管的训练数据。将该管道应用于中国金融法规，我们发布了 extbf{FinGuard-Bench}，据我们所知，这是首个金融监管合规检测基准，在查询和回复层面均带有专家标注的标签。我们进一步训练了 extbf{FinGuard}，这是一个基于Qwen3-8B构建的金融合规检测模型，通过监督微调和自我对弈强化学习在基于监管的数据上进行训练。在FinGuard-Bench上，FinGuard显著优于所有基线，包括专用防护模型和更大的通用LLM，如Qwen3.5-397B-A17B和GPT-5.1。此外，FinGuard还保留了通用安全能力，并能仅使用政策文档适应未见过的机构特定政策。我们将在GitHub上公开发布本工作中使用的代码、提示和资源。

英文摘要

As large language models (LLMs) are increasingly deployed in financial services, a single non-compliant interaction can expose institutions to regulatory penalties and direct consumer harm. Existing guard models are built around general harm taxonomies and overlook violations grounded in specific financial regulations. We address this gap with a regulation-driven pipeline that operates directly on regulatory documents, inducing a financial compliance risk taxonomy and synthesizing grounded training data without any predefined violation categories. Instantiating the pipeline on Chinese financial regulations, we release \textbf{FinGuard-Bench}, to our knowledge the first benchmark for financial regulatory compliance detection, with expert-annotated labels at both the query and response levels. We further train \textbf{FinGuard}, a financial compliance detection model built on Qwen3-8B and trained on the regulation-grounded data via supervised fine-tuning and self-play reinforcement learning. On FinGuard-Bench, FinGuard substantially outperforms all baselines, including dedicated guard models and much larger general-purpose LLMs such as Qwen3.5-397B-A17B and GPT-5.1. Furthermore, FinGuard also preserves general safety capabilities and adapts to unseen institution-specific policies using policy documents alone. We will publicly release the code, prompts, and resources used in this work on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2605.29421 2026-05-29 cs.CL 版本更新

Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design

将设计技能学习为记忆策略用于智能光子逆向设计

Shengchao Chen, Ting Shu, Sufen Ren

发表机构 * AAII, University of Technology Sydney（AAII，悉尼技术大学）； School of Artificial Intelligence, Shenzhen University（人工智能学院，深圳大学）； School of Information and Communication Engineering, Hainan University（信息与通信工程学院，海南大学）

AI总结提出SkillPCF闭环智能体框架，通过物理引导的记忆技能库、强化学习技能选择和模拟器接地技能演化，解决光子晶体光纤逆向设计中的知识积累问题，在真实数据集上实现更优的设计质量与效率权衡。

Comments AI4Physics@ICML 2026

详情

潜在词：密集检索器包含可轻易提取的符合齐夫分布的BM25就绪词汇表

Benjamin Clavié, Sean Lee, Aamir Shakir, Makoto P. Kato

发表机构 * Mixedbread AI ； National Institute of Informatics（国家信息研究所）； University of Tsukuba（筑波大学）

AI总结提出潜在词方法，揭示密集检索模型（单向量或多向量）学习到的表示可轻易分解为稀疏特征，通过稀疏自编码器提取潜在词汇表，无需检索特定调整即可直接用于BM25稀疏检索，匹配或超越原模型及SPLADE变体。

详情

AI中文摘要

我们提出潜在词方法，该方法揭示了训练用于密集检索的模型（无论是单向量还是多向量）学习到的表示可以轻易地分解为检索就绪的稀疏特征。当在冻结的检索器上训练时，无需任何检索特定调整的稀疏自编码器能够提取一个具有近似齐夫分布集合统计量的潜在词汇表，直接适用于通过BM25进行的经典稀疏检索评分。这种方法实现了稀疏检索，同时不需要任何学习到的扩展目标或稀疏检索监督，并且可以轻松应用于任何密集检索器。潜在词能够匹配或超越其自身基础模型以及可比较的SPLADE变体的单向量评分方法。此外，在专门设计用于突出单向量检索失败的任务LIMIT上，它显著优于其基础模型。总体而言，我们的结果强调了神经检索器包含比其默认评分函数所暴露的更具表达力和可索引的结构，但其他方法仍然可以利用这些结构。

英文摘要

We propose Latent Terms, a method revealing that models trained for dense retrieval, whether single- or multi-vector, learn representations that can trivially be decomposed into retrieval-ready sparse features. When trained on frozen retrievers, Sparse Autoencoders without any retrieval-specific adjustments extract a latent vocabulary with approximately Zipfian collection statistics, directly suitable for classical sparse retrieval scoring via BM25. This approach enables sparse retrieval while requiring no learned expansion objective or sparse retrieval supervision whatsoever, and can be readily applied to any dense retriever. Latent Terms is able to match or outperform single-vector scoring methods from its own base model as well as comparable SPLADE variants. In addition, it substantially outperforms its base model on LIMIT, a task specifically designed to highlight the failures of single-vector retrieval. Overall, our results highlight that neural retrievers contain more expressive and indexable structure than their default scoring functions expose, but that other methods can nonetheless be leveraged.

URL PDF HTML ☆

赞 0 踩 0

2605.29379 2026-05-29 cs.CL cs.LG 版本更新

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

BrahmicTokenizer-131K：一种可替代o200k_base的印度文字兼容分词器

Rohan Shravan

发表机构 * The School of AI（人工智能学院）

AI总结提出BrahmicTokenizer-131K，一种131072词汇量的字节级BPE分词器，通过两阶段改造在保持非印度文字性能的同时，显著提升印度文字的压缩效率。

Comments 24 pages, 15 tables, 3 code listings. Tokenizer artifact, verification scripts, and reproduction code at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K and https://github.com/theschoolofai/BrahmicTokenizer-131K

详情

AI中文摘要

我们提出了BrahmicTokenizer-131K，一种131,072词汇量的字节级BPE分词器，它在131K词汇量类别中弥合了印度文字（Brahmic）的压缩差距，同时保留了OpenAI的o200k_base在英语、欧盟语言和代码方面的压缩性能。我们通过两阶段改造构建了它：（1）脚本剪枝裁剪，通过移除九个不相关书写系统将200,019个令牌减少到131,072个；（2）外科手术式改造，通过线性规划分配在九个印度文字Unicode块中填充2,372个语料库中缺失的词汇槽位。预分词器、解码器和继承的合并规则与o200k_base保持不变，使得BrahmicTokenizer-131K在分词器接口上成为即插即用的替代品。在2700万份公开印度语预训练文本（28.4亿词，46.21 GB）上，BrahmicTokenizer-131K在相同词汇预算下产生的令牌比Mistral-Nemo Tekken / Sarvam-m少26.7%，每种语言的节省幅度从15.79%（泰米尔语）到76.79%（奥里亚语，压缩比4.31倍）。奥里亚语的优势在机制上可解释为Tekken/Sarvam-m包含零个奥里亚语块令牌；我们的改造添加了725个。在非印度语内容上，BrahmicTokenizer-131K与o200k_base的英语词汇生育率相当（1.235 vs 1.232令牌/词），并在HumanEval、MBPP和GSM8K上比Tekken/Sarvam-m好4.0-14.2%。在我们的14个分词器基准测试中，它是唯一一个在131K预算下同时在印度文字、英语、欧盟语言、代码和数学上具有竞争力的分词器。其他词汇类别的专用分词器（Sarvam-30B、Sarvam-1、MUTANT-Indic）以牺牲非印度语性能为代价实现了更好的印度语压缩：Sarvam-1的英语词汇生育率比我们差15.9%，其代码/数学压缩比我们差26-33%。我们在Apache 2.0许可下发布该工件，地址为https://huggingface.co/theschoolofai/BrahmicTokenizer-131K。

英文摘要

We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. We construct it through a two-stage retrofit: (1) a script-prune crop that reduces 200,019 tokens to 131,072 by removing nine out-of-scope writing systems, and (2) a surgical retrofit of 2,372 corpus-dead vocabulary slots determined by linear-programming allocation across nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer-131K a drop-in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same vocabulary budget, with per-language savings of 15.79% (Tamil) to 76.79% (Odia, a 4.31x compression ratio). The Odia advantage is mechanistically explained by Tekken/Sarvam-m containing zero Oriya-block tokens; our surgery added 725. On non-Indic content, BrahmicTokenizer-131K matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam-m by 4.0-14.2% on HumanEval, MBPP, and GSM8K. Across our 14-tokenizer benchmark, it is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K budget. Specialist tokenizers at other vocab classes (Sarvam-30B, Sarvam-1, MUTANT-Indic) achieve better Indic compression at the cost of non-Indic performance: Sarvam-1's English fertility is 15.9% worse and its code/math compression 26-33% worse than ours. We release the artifact under Apache 2.0 at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K.

URL PDF HTML ☆

赞 0 踩 0

2605.29368 2026-05-29 cs.CL cs.AI 版本更新

SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

SURGENT: 一种跨围手术期工作流程的手术多智能体辅助系统

Dongsheng Shi, Yue Li, Xin Yi, Yongyi Cui, Huawei Feng, Linlin Wang

发表机构 * East China Normal University（华东师范大学）； City University of Hong Kong（香港城市大学）

AI总结提出SURGENT手术多智能体辅助系统，结合思维树规划器、多科室协作智能体和检索增强推理，通过新型记忆设计管理长期患者病史和短期工作摘要，在五项围手术期任务中优于基线LLM和现有医疗多智能体框架。

Comments preprint

详情

AI中文摘要

现代外科护理的复杂性需要智能系统能够综合大量患者记录，支持协作决策，并在整个围手术期工作流程中提供透明、可审计的推理。尽管基于网络的大型语言模型（LLM）具有先进的推理能力，但由于输入长度限制、不完整的记忆管理和有限的可追溯性等关键限制，它们不适合外科应用。为了解决这个问题，我们提出了SURGENT，一种手术多智能体辅助系统，它结合了思维树规划器、多科室协作智能体以及基于临床指南和生物医学文献的检索增强推理。SURGENT具有一种新颖的记忆设计，可以管理长期患者病史和短期工作摘要，从而实现更完整、情境化和一致的推理。在五项关键围手术期任务（病例分析、手术计划模拟、安全监测、并发症风险评估和康复指导）上的实验评估表明，SURGENT优于基线LLM和现有的医疗多智能体框架，生成的推荐与患者病史更加一致。消融研究进一步突出了DeepSeek作为本地可部署骨干模型的优势，使其能够在无需依赖集中服务的情况下实现隐私保护部署。这些结果使SURGENT成为迈向智能、公平和安全的外科辅助系统的实用且可信的进步。

英文摘要

The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collaborative decision-making, and provide transparent, auditable reasoning across the entire perioperative workflow. Although web-based Large Language Models (LLMs) possess advanced reasoning capabilities, they are ill-equipped for surgical applications due to critical limitations: input length constraints, incomplete memory management, and limited traceability. To address this issue, we present SURGENT, a surgical multi-agent assistance system that combines a Tree-of-Thought planner, multi-department collaboration agents, and retrieval-augmented reasoning with clinical guidelines and biomedical literature. SURGENT features a novel memory design that manages both long-term patient histories and short-term working summaries, enabling more complete, contextualized, and consistent reasoning. Experimental evaluations across five key perioperative tasks - case analysis, surgical plan simulation, safety monitoring, complication risk assessment, and rehabilitation guidance - show that SURGENT outperforms baseline LLMs and existing medical multi-agent frameworks, yielding recommendations more closely aligned with patient histories. Ablation studies further highlight the advantage of DeepSeek as a locally deployable backbone model, enabling privacy-preserving deployment without reliance on centralized services. These results position SURGENT as a practical and trustworthy advancement toward intelligent, equitable, and secure surgical assistance systems.

URL PDF HTML ☆

赞 0 踩 0

2605.29367 2026-05-29 cs.CL cs.CY cs.SI 版本更新

Attention Asymmetry in AI Layoff Discourse on X: A Computational Analysis of Capital vs Labour Amplification

X平台上AI裁员话语中的注意力不对称性：资本与劳动放大的计算分析

Joy Bose

发表机构 * Independent Researcher（独立研究员）

AI总结通过收集X平台推文，使用账户级收集方法发现资本话语的放大效应是劳动话语的3.12倍，经粉丝数标准化后仍存在2.69倍的不对称性，并引入放大比和放大归一化指数作为平台话语不平等的度量指标。

Comments 18 pages, 3 figures, 9 tables

详情

AI中文摘要

当工人因AI驱动的重组而失业时，X（前Twitter）上同时发生两种截然不同的对话。科技高管和AI研究人员谈论生产力、转型和机遇。被解雇的工人和劳工批评者谈论失业、不确定性和恐惧。本文提出一个简单问题：哪种对话获得更多传播？我们报告了三项研究，使用两种收集方法和来自20个知名公共账户的763条推文。研究1使用基于关键词的收集（n=392），发现语料库之间无显著差异（p=0.891），表明关键词搜索对此任务噪声过大。研究2使用基于账户的收集（n=96），发现资本话语的平均放大优势是劳动话语的3.12倍（p=0.000003，Cohen's d=0.555）。研究3结合两种方法（n=763），确认了平均放大比4.18倍和中位数放大比10.77倍的结果（p<0.000001）。关键的是，在按粉丝数标准化后，不对称性仍然存在，为2.69倍（p=0.000009，Cohen's d=0.491），表明该效应并非仅仅是资本账户拥有更大受众的结果。该发现在所有测试的放大度量权重下均稳健。我们引入放大比和放大归一化指数作为衡量平台级话语不平等的简单指标。在Reddit上的跨平台复制（n=647条帖子）未复制该发现，表明不对称性可能特定于X基于账户的放大架构。我们讨论了跨平台话语分析的方法论意义。

英文摘要

When workers lose jobs to AI-driven restructuring, two very different conversations happen on X (formerly Twitter) at the same time. Tech executives and AI researchers talk about productivity, transformation, and opportunity. Laid-off workers and labour critics talk about job loss, uncertainty, and fear. This paper asks a simple question: which conversation gets more reach? We report three studies using two collection methods and 763 tweets from 20 named public accounts. Study 1 used keyword-based collection (n=392) and found no significant difference between corpora (p=0.891), revealing that keyword search is too noisy for this task. Study 2 used account-based collection (n=96) and found a 3.12x mean amplification advantage for capital discourse over labour discourse (p=0.000003, Cohen's d=0.555). Study 3 combined both methods (n=763) and confirmed the finding at 4.18x mean and 10.77x median amplification ratio (p<0.000001). Critically, after normalising for follower count, the asymmetry persists at 2.69x (p=0.000009, Cohen's d=0.491), demonstrating that the effect is not simply a consequence of capital accounts having larger audiences. The finding is robust across all tested amplification metric weightings. We introduce the Amplification Ratio and Amplification Normalisation Index as simple metrics for measuring platform-level discourse inequality. A cross-platform replication on Reddit (n=647 posts) did not replicate the finding, suggesting the asymmetry may be specific to X's account-based amplification architecture. We discuss the methodological implications for cross-platform discourse analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.29340 2026-05-29 cs.CL 版本更新

A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities

面向LLM安全评估的问答数据集研究：聚焦非法活动

Kenji Imamura, Masao Ideuchi, Atsushi Fujita

发表机构 * National Institute of Information and Communications Technology（日本信息与通信技术研究所）

AI总结本文通过人工分析AnswerCarefully数据集，提出额外信息、问答示例创建方法和评估准则，用于评估LLM在非法活动方面的安全性。

Comments 10 pages, 1 figure

2605.29336 2026-05-29 cs.CL 版本更新

Enhancing Factuality through Consensus and Consistency in Summarization Using Minimum Bayes Risk Decoding

通过最小贝叶斯风险解码在摘要中实现基于共识和一致性的事实性增强

Riza Setiawan Soetedjo, Yusuke Sakai, Hidetaka Kamigaito, Jingun Kwon, Manabu Okumura, Taro Watanabe

发表机构 * Nara Institute of Science and Technology（奈良科学技术研究所）； Chungnam National University（全南国立大学）； Institute of Science Tokyo（东京科学研究所）

AI总结提出ConSUM方法，利用最小贝叶斯风险解码建立候选摘要间的共识，并结合与源文档的一致性指标进行重排序，以提升摘要的事实性。

Comments Accepted to ACL 2026 Findings

详情

AI中文摘要

提高模型生成摘要的质量，尤其是事实性（摘要相对于源内容的准确性）仍然是一个挑战。虽然重排序可以从多个生成候选中选择最优输出，但它仅限于使用源作为指导，导致摘要不可靠。为了解决这一局限性，我们提出了ConSUM，该方法通过考虑两个因素对候选摘要进行重排序：与源文档的一致性以及与其他候选之间的共识。共识是通过对生成的摘要集进行最小贝叶斯风险（MBR）解码建立的，同时通过使用将摘要与源进行比较的事实性感知指标来确保一致性。严格的测试表明，我们的系统与现有方法具有竞争力，人工评估进一步证实其生成的摘要优于其他系统。我们的代码可在https://github.com/naist-nlp/ConSUM获取。

英文摘要

Improving the quality of model-generated summaries, especially factuality, the accuracy of a summary with respect to its source content, remains a challenge. While reranking could select the optimal output from multiple generated candidates, it is limited to only using the source as guidance, resulting in unreliable summaries. To address this limitation, we propose ConSUM that reranks candidate summaries by considering two factors: consistency to the source document and consensus among the other candidates. Consensus is established using Minimum Bayes Risk (MBR) decoding over the set of generated summaries, while ensuring consistency by employing factuality-aware metrics that compare the summary against the source. Rigorous testing demonstrates that our system is competitive with existing methods, with human evaluations further confirming that its generated summaries are preferred over those from other systems. Our code is available at https://github.com/naist-nlp/ConSUM .

URL PDF HTML ☆

赞 0 踩 0

2605.29327 2026-05-29 cs.CL cs.LG 版本更新

PatchBoard: 基于Schema的可靠且可审计的LLM多智能体协作状态变更框架

Shuyu Zhang, Yaqi Shi, Lu Wang

发表机构 * School of Computer Science and Technology（计算机科学与技术学院）

AI总结提出PatchBoard架构，通过Schema约束的JSON Patch状态变更替代智能体间对话，实现可验证、可审计的多智能体协作，在ALFWorld任务中成功率84.6%，令牌消耗45.5k。

2605.29310 2026-05-29 cs.AI cs.CL 版本更新

Rubric-Guided Process Reward for Stepwise Model Routing

基于评分准则的逐步模型路由过程奖励

Shenghao Ye, Yu Guo, Zhengheng Li, Shuangwu Chen, Jian Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Southeast University（东南大学）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（合肥综合性国家科学中心人工智能研究院）

AI总结提出RoRo框架，通过收集路由轨迹、构建偏好对、训练Rubricor生成评估准则和Judge评分，结合过程与结果奖励优化路由策略，提升大型推理模型逐步路由的准确性和成本效率。

Comments 17 pages, 9 figures, submitted to EMNLP 2026

详情

AI中文摘要

逐步模型路由通过将每个推理步骤分配给合适的模型来提高大型推理模型（LRM）的效率。最近的方法将路由建模为顺序决策过程，并使用强化学习训练路由器。然而，尽管它们将路由建模为一个过程，但仍然使用结果奖励来监督路由器。这种奖励仅反映最终答案的正确性，未能评估中间路由决策，这可能会削弱性能和泛化能力。为了解决这一差距，我们提出了RoRo，一种基于评分准则的逐步模型路由过程奖励框架。RoRo首先收集多样化的路由轨迹，并基于结果、成本和过程质量构建偏好对。然后，它通过交替优化训练一个Rubricor来生成查询特定的评估准则，以及一个Judge来在此准则下对路由轨迹进行评分。由此产生的过程奖励与结果奖励相结合，通过GRPO优化路由策略。在五个推理基准上的实验，无论是在同族还是跨族设置下，都表明RoRo始终优于强基线，并实现了更好的准确性和成本权衡。

英文摘要

Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recent methods formulate routing as a sequential decision process and train the router with reinforcement learning. However, although they model routing as a process, they still supervise the router with outcome rewards. Such rewards only reflect final answer correctness and fail to evaluate intermediate routing decisions, which can weaken performance and generalization. To address this gap, we propose RoRo, a rubric-guided process reward framework for stepwise model routing. RoRo first collects diverse routing trajectories and constructs preference pairs based on outcome, cost, and process quality. It then trains a Rubricor to generate a query-specific evaluation rubric and a Judge to score routing trajectories under this rubric through alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy via GRPO. Experiments on five reasoning benchmarks under both same-family and cross-family settings show that RoRo consistently outperforms strong baselines and achieves better accuracy and cost trade-offs.

URL PDF HTML ☆

赞 0 踩 0

2605.29307 2026-05-29 cs.CL cs.AI cs.IR cs.LG 版本更新

GrepSeek: Training Search Agents for Direct Corpus Interaction

GrepSeek：训练用于直接语料库交互的搜索代理

Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung, Razieh Rahimi, Fernando Diaz, Hamed Zamani

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）； Princeton University（普林斯顿大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出GrepSeek，一种通过两阶段训练（冷启动数据集+GRPO优化）和语义保持的分片并行执行引擎，训练紧凑型搜索代理直接与文本语料库交互（通过shell命令），在开放域问答中取得最优F1和精确匹配。

详情

AI中文摘要

大型语言模型（LLM）搜索代理通过多轮推理和信息检索，在知识密集型语言任务中展现出强大潜力。大多数现有系统使用检索器，该检索器接收关键词或自然语言查询，并利用预计算文档表示的索引返回排序后的文档列表。在本工作中，我们探索了一种互补视角，其中搜索代理将语料库本身视为搜索环境，并通过执行可执行的shell命令来寻找证据。我们引入了GrepSeek，一种优化的直接语料库交互（DCI）搜索代理，它训练一个紧凑的搜索代理从大型文本语料库中查找、过滤和组合证据。为了解决在大语料库上直接使用强化学习进行学习行为的不稳定性，我们提出了一种两阶段训练流程。首先，我们使用答案感知的Tutor和答案盲的Planner构建冷启动数据集，生成经过验证的、因果基础的搜索轨迹。其次，我们使用组相对策略优化（GRPO）优化初始化的策略，使代理能够通过与语料库的直接交互来改进其任务导向的搜索行为。为了使DCI在大规模下实用，我们进一步使用语义保持的分片并行执行引擎，该引擎将基于shell的检索加速高达7.6倍，同时保持与shell命令顺序执行的字节精确等价。在七个开放域问答基准上的实验表明，GrepSeek在整体词元级F1和精确匹配上取得了最强性能。我们的分析还揭示了纯粹词汇交互在具有显著表面形式变化的查询上的局限性，表明DCI作为搜索代理的一种实用且具有竞争力的方法，可以在现实世界中补充现有的检索范式。

英文摘要

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to $7.6\times$ while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level $F_1$ and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.

URL PDF HTML ☆

赞 0 踩 0

2605.29300 2026-05-29 cs.CL cs.AI cs.SD 版本更新

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

MusTBENCH：音乐大语言模型中的时间定位基准与推进

Daeyong Kwon, Qiyu Wu, Shinobu Kuriya, Junghyun Koo, Shuyang Cui, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

发表机构 * Seoul National University（首尔国立大学）； Sony Group Corporation（索尼集团）； Sony AI（索尼人工智能）

AI总结提出MusTBENCH基准和MusT四阶段优化方法，评估并提升音乐大语言模型在音频中的时间定位能力。

详情

DynSess：面向角色扮演智能体的动态会话级评估与优化框架

Rongsheng Zhang, Jiji Tang, Junnan Ren, Zuyi Bao, Weijie Chen, Ruofan Hu, Zhou Zhao, Tangjie Lv, Yan Zhang

发表机构 * Zhejiang University（浙江大学）； Fuxi AI Lab, NetEase Inc.（福克斯人工智能实验室，网易公司）； Xiamen University（厦门大学）

AI总结提出DynSess统一会话级框架，通过会话级评估（DynSess-Eval）和基于多步前瞻搜索的训练轨迹优化（DSPO/GSRPO），提升角色扮演智能体的长程一致性和交互质量。

详情

AI中文摘要

基于大型语言模型的角色扮演本质上是一个会话级任务，要求智能体在扩展的多轮对话中维持角色身份和交互质量。然而，现有的评估和优化方法大多停留在轮次级别，无法捕捉长程质量。我们提出DynSess，一个统一的会话级角色扮演智能体框架。DynSess-Eval通过针对长程行为的评分标准对完整对话会话进行评分。利用其会话级奖励，我们通过多步前瞻搜索构建高质量训练轨迹，并训练DynSess-Character的两个互补变体：DSPO（离策略）和GSRPO（在策略）。实验表明，DynSess-Eval与人类判断的一致性显著优于先前的评估器，盲人机评估进一步显示，尽管参数少得多，DynSess-Character仍能与最强角色模型匹配，同时保持强大的角色一致性和交互能力。我们的数据集和代码将发布以促进未来研究。

英文摘要

Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and optimization methods remain largely turn-level, failing to capture long-horizon quality. We propose DynSess, a unified session-level framework for role-playing agents. DynSess-Eval scores complete dialogue sessions via rubrics targeting long-horizon behaviors. Leveraging its session-level rewards, we construct high-quality training trajectories through multi-turn lookahead search and train DynSess-Character with two complementary variants: DSPO (off-policy) and GSRPO (on-policy). Experiments show that DynSess-Eval aligns with human judgments substantially better than prior evaluators, and blind human evaluation further shows that DynSess-Character matches the strongest character model despite using substantially fewer parameters, while maintaining strong role consistency and interactive ability. Our dataset and code will be released to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2605.29250 2026-05-29 cs.CL cs.AI cs.IR cs.LG 版本更新

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

OmniRetrieval：跨异构知识源的统一检索

Jinheon Baek, Soyeong Jeong, Sangwoo Park, Woongyeong Yeo, Minki Kang, Patara Trirat, Heejun Lee, Sung Ju Hwang

发表机构 * KAIST（韩国科学技术院）

AI总结提出OmniRetrieval框架，通过自然语言查询识别并调度到不同知识源的本地执行引擎，在13个数据集和309个知识库上超越单源基线，实现异构知识源统一检索。

详情

AI中文摘要

相关性即漏洞：网络检索如何削弱LLM智能体的安全对齐

Aditya Nawal, Manit Baser, Mohan Gurusamy

发表机构 * Department of Electrical and Computer Engineering（电子与计算机工程系）； National University of Singapore（新加坡国立大学）

AI总结本文提出AgentREVEAL框架，分析检索集成方式和内容属性如何导致LLM智能体安全退化，发现相关性是共同激活条件，并引入HarmURLBench基准。

详情

AI中文摘要

AI智能体通过外部工具（如网络检索）增强大型语言模型，使其能够提供基于事实和最新的响应。然而，将外部内容纳入生成流程可能会削弱控制模型输出的安全对齐机制。先前的研究表明，在智能体中启用检索会增加对有害请求的遵从性。我们提出了AgentREVEAL，一个用于分析LLM智能体中检索诱导的安全退化的诊断框架。该框架考察两个维度：检索如何集成到智能体流程中，以及检索内容的属性。在集成维度上，我们发现将工具调用和响应生成绑定在单一步骤中会放大有害输出。在内容维度上，我们揭示了安全来源悖论：即使是对立或安全导向的来源（例如包含警告或风险免责声明的页面），与无检索基线相比，也会使有害遵从性平均增加25%。最后，我们表明相关性是这两种漏洞的共同激活条件。类似模式出现在前沿闭源模型上，并且在几种代表性流程干预下，有害遵从性仍然保持较高水平，一些智能体在自主检索下也会进入这种状态。由于相关性也是使检索有用的原因，这些结果揭示了检索增强智能体的安全-效用权衡。我们引入了HarmURLBench，一个包含1,405个真实世界URL和320个有害行为的基准，以支持未来的评估。

英文摘要

AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment mechanisms that govern model outputs. Prior work shows that enabling retrieval in agents increases compliance with harmful requests. We introduce AgentREVEAL, a diagnostic framework for analyzing retrieval-induced safety degradation in LLM agents. The framework examines two axes: how retrieval is integrated into the agent pipeline and the properties of the retrieved content. Along the integration axis, we find that binding tool invocation and response generation in a single step amplifies harmful outputs. Along the content axis, we uncover the Safe Source Paradox: even oppositional or safety-oriented sources, such as pages containing warnings or risk disclaimers, can increase harmful compliance by an average of 25% compared to the no-retrieval baseline. Finally, we show that relevance acts as a shared activation condition for both vulnerabilities. Similar patterns appear on frontier closed models, and harmful compliance remains elevated under several representative pipeline interventions, with some agents also entering this regime under autonomous retrieval. Because relevance is also what makes retrieval useful, these results expose a safety-utility trade-off for retrieval-enabled agents. We introduce HarmURLBench, a benchmark containing 1,405 real-world URLs paired with 320 harmful behaviors to support future evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.29218 2026-05-29 cs.AI cs.CL 版本更新

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

GTA：大规模生成面向Web智能体的长程任务

Tenghao Huang, Kung-Hsiang Huang, Prafulla Kumar Choubey, Yilun Zhou, Muhao Chen, Jonathan May, Chien-Sheng Wu

发表机构 * University of Southern California（南加州大学）； Salesforce AI Research（Salesforce人工智能研究）； University of California, Davis（加州大学戴维斯分校）

AI总结提出GTA框架，通过集成爬取、检索式种子生成、上下文内生成和自动质量控制，为Web智能体生成带可执行轨迹的真实长程任务，解决现有基准缺乏过程监督和可扩展性问题。

Comments Published at Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics

详情

AI中文摘要

Web智能体将语言模型与浏览和工具使用能力相结合，有望成为开放的Web助手。然而，进展日益受到缺乏可扩展的过程级监督的限制。现有基准大多为手动构建，仅提供粗略的起始-目标注释，缺乏中间轨迹，而最近的自动生成方法仍然昂贵、有偏且浅显。这些限制阻碍了对必须泛化到现实、多跳、跨页面任务的智能体进行可靠训练和评估。我们引入了一个可扩展的框架GTA，它集成了爬取、基于检索的种子生成、上下文内生成和自动质量控制，以生成与可执行轨迹配对的真实任务。该设计将爬取与生成解耦以提高效率，将任务基于站点图以强制组合性，并通过确定性重放和系统验证确保密集监督。我们在超过50个涵盖电子商务、政府、论坛和新闻的网站上实例化了该流程，并具有多语言和多跳覆盖。由此产生的基准揭示了显著的人机性能差距，并实现了详细的诊断。我们的贡献有三方面：（i）形式化多跳Web智能体任务生成，（ii）提出一个高效且经过验证的自动数据创建流程，以及（iii）发布一个具有可重复评估的动态基准。

英文摘要

Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is increasingly limited by the lack of scalable, process-level supervision. Existing benchmarks are largely manually constructed, providing only coarse start-goal annotations without intermediate trajectories, while recent automatic generation efforts remain expensive, biased, and shallow. These limitations prevent reliable training and evaluation of agents that must generalize to realistic, multi-hop, cross-page tasks. We introduce a scalable framework, GTA, that integrates crawling, retrieval-based seeding, in-context generation, and automated quality control to produce realistic tasks paired with executable trajectories. This design decouples crawling from generation for greater efficiency, grounds tasks in the site graph to enforce compositionality, and ensures dense supervision through deterministic replays and systematic validation. We instantiate the pipeline on over 50 websites covering e-commerce, government, forums, and news, with multilingual and multi-hop coverage. The resulting benchmark reveals a significant human-agent performance gap and enables detailed diagnostics. Our contributions are three-fold: (i) formalizing multi-hop web-agent task generation, (ii) proposing an efficient and validated pipeline for automatic data creation, and (iii) releasing a dynamic benchmark with reproducible evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.29192 2026-05-29 cs.AI cs.CL 版本更新

Parallax: 参数化局部线性注意力用于语言建模

Yifei Zuo, Dhruv Pai, Zhichen Zeng, Alec Dewulf, Shuming Hu, Zhaoran Wang

发表机构 * Northwestern University（西北大学）； Tilde Research（Tilde研究）； University of Washington（华盛顿大学）

AI总结提出Parallax，一种可扩展的参数化局部线性注意力机制，通过消除数值求解器并学习查询投影器，在语言模型预训练中实现一致的困惑度改进和下游任务迁移优势。

详情

AI中文摘要

AI中文摘要

大语言模型（LLMs）的使用正在激增，但观察到它们的性能因提示风格和语气而异。在本研究中，我们探讨了提示中的语气变化是否以及如何导致LLM在客观多项选择题上的准确性差异。我们使用了两个数据集：一个包含50个基础问题和五种语气变体的数据集，以及一个包含570个基础问题、涵盖57个主题和七种语气变体的MMLU子集。我们进行了实验，评估了四种成本效益高、流行的LLM的性能：ChatGPT-4o、ChatGPT-5-nano、Gemini 2.5 Flash和Gemini 2.5 Flash Lite。跨模型而言，语气效应是系统性的但高度依赖模型。一些模型显示出微小但统计上显著的变化，而另一些模型则在语气间表现出较大的准确性波动。此外，我们识别了主题层面的语气敏感性差异，并提出了一个路由框架来解释语气如何调节内部推理模式。我们的发现提醒用户不要假设LLM部署中具有语气鲁棒性的可靠性。

英文摘要

The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple-choice questions. We use two datasets: a 50-base question dataset with five tone variants and a 570-base question MMLU subset spanning 57 subjects with seven tone variants. Experiments were conducted to evaluate the performance of four cost-efficient, popular LLMs: ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite. Across models, tonal effects are systematic but highly model-dependent. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones. Further, we identify subject-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes. Our findings caution users against assuming tone-robust reliability in LLM deployments.

URL PDF HTML ☆

赞 0 踩 0

2605.29018 2026-05-29 cs.AI cs.CL 版本更新

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

采用 ≠ 适应：野外LLM对话的纵向分析

Rebecca M. M. Hicke, Kiran Tomlinson

发表机构 * Cornell University（康奈尔大学）； Microsoft Research（微软研究院）

AI总结通过分析约12,000名Microsoft Bing Copilot用户的对话轨迹及WildChat-4.8M数据，发现用户行为高度固化，活跃用户更倾向复杂专业任务，且WildChat数据集偏向高熟练度“超级用户”，表明现有用户行为难以改变并揭示用户异质性。

详情

AI中文摘要

尽管越来越多的研究开始描述用户与LLM的交互，但其描绘的画面基本上是静态的；关于个体用户如何随时间改变其行为，我们知之甚少。为填补这一空白，我们分析了约12,000名随机抽样的Microsoft Bing Copilot用户的对话轨迹，并与WildChat-4.8M的数据进行比较。虽然Copilot数据包含显著的人群层面趋势，但我们发现个体用户轨迹中的趋势要弱得多；用户习惯被证明极其顽固。我们还发现不同活跃度用户之间存在显著差异：更活跃的用户拥有更成功的对话，并使用LLM处理更复杂和专业导向的任务。一些用户趋势也出现在WildChat-4.8M中，但我们发现证据表明该数据集显著偏向高熟练度的“超级用户”。最终，我们的结果表明现有用户行为难以改变，并展示了用户异质性的程度。我们数据集之间的比较突显了WildChat并不代表典型的用户-AI交互，这是对数据下游使用的一个重要警示。

英文摘要

Although a growing body of research has begun to describe user--LLM interactions, the picture it paints is largely static; little is known about how individual users change their behavior over time. To address this gap, we analyze the conversational trajectories of $\sim$12,000 randomly sampled Microsoft Bing Copilot users and compare these with data from WildChat-4.8M. While the Copilot data contains significant population-level trends, we find that trends in individual user trajectories are much weaker; user habits prove to be overwhelmingly sticky. We also find stark differences between users of different activity levels: more active users have more successful conversations and use the LLM for more complex and professionally oriented tasks. Some user trends also appear in WildChat-4.8M, but we find evidence that this dataset is significantly skewed towards highly proficient "power" users. Ultimately, our results suggest that existing user behavior is difficult to change and demonstrate the extent of user heterogeneity. Our comparison between datasets highlights that WildChat does not represent typical user-AI interactions, an important caveat for downstream uses of the data.

URL PDF HTML ☆

赞 0 踩 0

2605.29007 2026-05-29 cs.CL 版本更新

Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation

错误作为透镜：通过合成误解生成探究LLM推理

Xinming Yang, Jun Li

发表机构 * CUNY Graduate Center（纽约大学研究生中心）； CUNY Queens College（纽约市立大学皇后学院）

AI总结提出一个框架，通过生成针对Bloom分类学五类错误的合成误解，以诊断LLM推理能力，并发现目标错误生成比自由形式错误生成更难。

详情

AI中文摘要

个性化辅导、教师培训和教育研究需要访问\emph{有针对性的}合成误解，但隐私和IRB限制使得真实学生错误的标注语料库稀缺。LLM原则上可以大规模生成合成错误，但对于现代LLM来说，生成任意错误答案很容易，而生成与特定认知失败模式匹配的错误答案则困难得多。我们提出了一个框架，根据改编自修订版Bloom分类学的五类分类法生成有针对性的错误，并在TheoremQA数据集的问题上进行评估。生成代理（GA）根据目标类别起草候选错误解决方案，检查代理（EA）判断草案是否错误且类别一致。该框架提供了一种可重复的方法，用于构建在缺乏真实学生语料库的情况下分层类别的合成错误数据集。作为次要诊断，有针对性的错误生成比自由形式的错误答案生成困难得多，并且答案基础比扩展示例或外部教科书内容贡献更大。

英文摘要

Personalized tutoring, teacher training, and education research need access to \emph{targeted} synthetic misconceptions, but privacy and IRB constraints make labelled corpora of real student errors scarce. LLMs could in principle generate synthetic errors at scale, but producing an arbitrary wrong answer is easy for a modern LLM while producing one that matches a specified cognitive failure mode is much harder. We present a framework that generates errors targeted to a five-class taxonomy adapted from the revised Bloom's taxonomy, evaluated on questions from the TheoremQA dataset. A Generation Agent (GA) drafts a candidate erroneous solution conditioned on a target class, and an Examination Agent (EA) judges whether the draft is incorrect and class-consistent. The framework yields a reusable recipe for building class-stratified synthetic error datasets where authentic student corpora are unavailable. As a secondary diagnostic, targeted error generation is substantially harder than free-form incorrect-answer generation, and answer-grounding contributes more than expanded examples or external textbook content.

URL PDF HTML ☆

赞 0 踩 0

2605.29000 2026-05-29 cs.CL 版本更新

Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction

保留文本的有损文本压缩：策略性删除与LLM重建研究

Yuchun Zou, Junhong Tong, Jun Li

发表机构 * CUNY Graduate Center（纽约大学研究生中心）； CUNY Queens College（纽约大学皇后学院）

AI总结本文研究有损语义文本压缩，通过策略性删除文本并用大语言模型重建，比较多种删除策略，发现词频删除是低成本基线，语义方法在中度压缩时优势明显，QLoRA微调可得到强解码器。

详情

AI中文摘要

传统的无损文本压缩保留每一个字节，但在实际运行条件下对自然语言的增益通常有限。我们研究有损语义文本压缩，其中编码器策略性地删除部分文本，大语言模型（LLM）从保留的骨架中重建原始内容。我们对一系列删除策略进行基准测试，包括均匀步长删除、词长引导删除（WordLen）、词频引导删除（WordFreq）、LP优化删除（Opt）、基于GPT-2惊奇度的熵删除，以及结合频率和惊奇度信号的混合方法。在BBC新闻数据集上，保留率$r_{keep} \in [0.1,0.9]$的评估显示了三个主要发现。首先，WordFreq是一个强大的低成本基线：尽管仅使用静态频率查找表，它在编码器端速度远快于更昂贵的语义方法，同时仍具有竞争力。其次，语义和混合方法在轻度到中度压缩时提供最明显的增益，而词频删除在最低保留率时通常更鲁棒。第三，QLoRA微调产生一个强大的局部解码器，与Gemini 2.0 Flash竞争，并且在仅解码器比较中通常最强。额外的英文和中文实验表明，整体框架跨领域迁移，而最佳删除规则仍依赖于数据集。

英文摘要

Traditional lossless text compression preserves every byte, but its gains on natural language are often modest in realistic operating regimes. We study \emph{lossy semantic text compression}, where the encoder strategically deletes parts of the text and a large language model (LLM) reconstructs the original content from the retained skeleton. We benchmark a progression of deletion strategies, including uniform step deletion, word-length-guided deletion (WordLen), word-frequency-guided deletion (WordFreq), LP-optimized deletion (Opt), entropy-based deletion using GPT-2 surprisal, and hybrid methods that combine frequency and surprisal signals. Evaluation on the BBC News dataset across retention rates $\r_{keep} \in [0.1,0.9]$ shows three main findings. First, WordFreq is a strong low-cost baseline: despite using only a static frequency lookup, it remains competitive with much more expensive semantic methods while being far faster at the encoder. Second, semantic and hybrid methods provide their clearest gains at mild-to-moderate compression, whereas word-frequency deletion is often more robust at the lowest retention rates. Third, QLoRA fine-tuning yields a strong local decoder that is competitive with Gemini 2.0 Flash and is often strongest in decoder-only comparisons. Additional English and Chinese experiments show that the overall framework transfers across domains, while the best deletion rule remains dataset-dependent.

URL PDF HTML ☆

赞 0 踩 0

2605.28999 2026-05-29 cs.CR cs.AI cs.CL cs.LG 版本更新

Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening

测量基于LLM的简历筛选中真实世界的提示注入攻击

Mohan Zhang, Yuqi Jia, Zhen Tan, Steven Jiang, Neil Zhenqiang Gong, Tianlong Chen, Dawn Song

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Duke University（杜克大学）； Arizona State University（亚利桑那州立大学）； hireEZ ； University of California, Berkeley（加州大学伯克利分校）

AI总结本研究首次系统性地分析了基于LLM的简历筛选应用中的提示注入攻击，通过设计专用检测器对约20万份真实简历进行测量，发现约1%的简历包含隐藏的提示注入，且近年来其流行度显著增加。

Comments Published in USENIX Security Symposium 2026; Code and artifacts are available at https://github.com/UNITES-Lab/resume-injection-measurement

详情

AI中文摘要

LLM容易受到提示注入攻击。然而，这种漏洞主要是在学术研究中通过概念性演示或少数轶事案例研究来展示的。其在真实世界基于LLM的应用中的普遍性和影响尚未得到充分探索。在这项工作中，我们首次对广泛使用的应用——基于LLM的简历筛选——中的提示注入攻击进行了系统研究。我们的分析基于hireEZ多年来收集的约20万份真实简历。我们首先设计了专门的方法来检测简历中的提示注入。在小规模数据集上的手动验证表明，我们的检测器实现了高精度，并优于最先进的通用检测器。然后，我们将检测器应用于完整的简历数据集，并对真实世界的提示注入攻击进行了全面的测量研究。我们的分析揭示了一些有趣的发现：大约1%的简历包含隐藏的提示注入；这种注入简历的流行度在过去一到两年内显著增加；超过90%的注入提示不使用显式指令。这些结果首次提供了真实世界基于LLM的应用中大规模提示注入的证据，并为未来理解和缓解此类攻击的研究奠定了基础。

英文摘要

LLMs are vulnerable to prompt injection attacks. However, this vulnerability has been primarily demonstrated conceptually in academic studies or through a few anecdotal case studies. Its prevalence and impact in real-world LLM-based applications are largely unexplored. In this work, we present the first systematic study of prompt-injection attacks in a widely used application: LLM-based resume screening. Our analysis is based on approximately 200K real-world resumes collected over multiple years by hireEZ. We first design tailored methods to detect prompt injection in resumes. Manual validation on a small-scale dataset demonstrates that our detectors achieve high precision and outperform state-of-the-art general-purpose detectors. We then apply our detector to the full resume dataset and conduct a comprehensive measurement study of real-world prompt injection attacks. Our analysis reveals several intriguing findings: approximately 1% of resumes contain hidden prompt injections; the prevalence of such injected resumes has increased noticeably over the past one to two years; and more than 90% of injected prompts do not use explicit instructions. These results provide the first evidence of large-scale prompt injection in real-world LLM-based applications and lay the groundwork for future studies to understand and mitigate such attacks.

URL PDF HTML ☆

赞 0 踩 0

2605.28969 2026-05-29 cs.CL cs.AI cs.HC 版本更新

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

超越回忆：行为规范作为AI个性化的解释层

Aarik Gulaya

AI总结提出行为规范作为解释层，通过压缩用户数据为解释性模式，显著提升AI代理对用户意图的表示准确性，减少模型规避，并在解释型问题上优于原始语料和商业记忆系统。

Comments 134 pages, 4 figures. Code, data, judge prompts, and reproduction instructions: github.com/agulaya24/beyond-recall

详情

AI中文摘要

如果AI代理代表个人做出决策，这些决策必须与其用户一致。我们引入表示准确性来衡量系统忠实捕捉用户解释的程度。我们将解释层操作化为行为规范。我们的参考实现将用户数据积极压缩为解释性模式，作为语言模型的上下文。我们在一个原型基准上评估该规范，该基准由校准的5评委LLM小组对保留的行为预测进行评分。我们独立测试它，并与一系列上下文条件组合：完整原始语料、完整提取事实以及四个商业记忆系统（Mem0、Letta、Supermemory、Zep）。在14个公共领域自传语料库中，该规范总体上提升了表示准确性，并几乎消除了模型规避。它以约25倍的上下文成本降低恢复了原始语料的大部分性能。该规范将受试者提升到一个共同的预测水平，无论预训练基线如何；因此，绝对提升在基线最低时最大，表明相关人群是任何在预训练中未被充分代表的人。在需要解释的问题上提升最大，提供解释层使得模型行为能够实现提取事实或原始语料无法实现的行为。相反，在需要回忆的问题上，该层可能干扰而非帮助。我们得出结论，表示准确性不同于回忆，人机对齐取决于用户被表示的准确性。表示准确性使这种对齐可测试。

英文摘要

If an AI agent makes decisions on a person's behalf, those decisions must align with its user. We introduce representational accuracy to measure how faithfully a system captures a person's interpretation. An interpretive layer is operationalized as a Behavioral Specification. Our reference implementation aggressively compresses a person's data into interpretive patterns, served as context to a language model. We evaluate the Specification on a prototype benchmark of held-out behavioral predictions scored by a calibrated 5-judge LLM panel. We test it independently and in composition with a range of context conditions: full raw corpus, full extracted facts, and four commercial memory systems (Mem0, Letta, Supermemory, Zep). Across 14 public-domain autobiographical corpora, the Specification lifts representational accuracy in aggregate and nearly eliminates model hedging. It recovers most of what the raw corpus delivers, at ~25x less context cost. The Specification lifts subjects toward a common predictive level regardless of pretraining baseline; the lift in absolute points is therefore largest where the baseline is lowest, suggesting the population of relevance is anyone not adequately represented in pretraining. Lift is greatest on interpretation-required questions, where providing an interpretive layer enables model behavior that extracted facts or raw corpus do not. Conversely, on recall-required questions, this layer can interfere rather than help. We conclude that representational accuracy is distinct from recall and that human-AI alignment is dependent on how accurately the user is represented. Representational accuracy makes that alignment testable.

URL PDF HTML ☆

赞 0 踩 0

2605.28966 2026-05-29 cs.CL cs.HC 版本更新

认知范畴变换器：用于语言建模的范畴论归纳偏置

Al Kari

发表机构 * Manceps Inc.（Manceps公司）

AI总结提出认知范畴变换器（CCT），通过引入基于范畴论和认知科学的组件，在WikiText-103上以306M参数实现21.27验证困惑度，相比GPT-2 Small基线降低2.92 PPL（12%相对提升），并通过消融实验证实单纯复形消息传递贡献了84%的改进。

详情

AI中文摘要

认知范畴变换器（CCT）是一个306M参数的架构，它通过源自范畴论和认知科学的认知启发组件增强了预训练的GPT-2 Small骨干网络。在WikiText-103上采用匹配步数协议（215,000优化器步数、匹配数据、匹配优化器和调度）下，CCT达到21.27验证困惑度，而相同微调的GPT-2 Small基线为24.19。因此，该架构在领域内微调本身之外贡献了2.92 PPL（12%相对）的降低。一个从头开始重训练的消融实验，在整个七阶段激活调度中保持GT-Full单纯复形消息传递绕过，达到23.72 PPL，将84%的架构改进（2.45 of 2.92 PPL）归因于GT-Full。我们首次提供了消融验证的证据，表明单纯复形消息传递在WikiText-103上以306M参数规模改善了语言模型困惑度。已发表的GPT-2 Large在WikiText-103上以比GPT-2 Small多6.2倍的参数达到22.05零样本困惑度；本文将这一数字视为外部已发表参考，而非架构基准。关于一致性风格的范畴先验（层平滑、伴随往返、曲率正则化）的三个负面结果，以及GT-Full和PrecisionWeightedPP的联合结构先验结果，共同支持了一个经验模式，称为*结构/一致性区分*，其中添加新拓扑的范畴先验改善了语言建模，而强制执行一致性恒等式的范畴先验则没有。

英文摘要

The Cognitive Categorical Transformer (CCT) is a 306M-parameter architecture that augments a pretrained GPT-2 Small backbone with cognitively grounded components derived from category theory and several inspirations from cognitive science. Under a matched-step protocol (215,000 optimizer steps, matched data, matched optimizer and schedule) on WikiText-103, CCT reaches 21.27 validation perplexity, compared with 24.19 for an identically fine-tuned GPT-2 Small baseline. The architecture therefore contributes a 2.92 PPL (12% relative) reduction beyond what in-domain fine-tuning alone provides. A retrain-from-scratch ablation that holds GT-Full simplicial message passing bypassed across the entire seven-phase activation schedule reaches 23.72 PPL, localizing 84% of the architectural improvement (2.45 of 2.92 PPL) to GT-Full. We present the first ablation-validated evidence that simplicial message passing improves language-model perplexity at the 306M-parameter scale on WikiText-103. Published GPT-2 Large reaches 22.05 zero-shot PPL on WikiText-103 with 6.2x more parameters than GPT-2 Small; this paper treats that number as an external published reference, not as the architectural benchmark. Three negative results on consistency-style categorical priors (sheaf smoothing, adjunction round-trip, curvature regularization) and the joint structural-prior result for GT-Full and PrecisionWeightedPP together support an empirical pattern termed the *structure/consistency distinction*, in which categorical priors that add new topology improve language modeling and those that enforce a consistency identity do not.

URL PDF HTML ☆

赞 0 踩 0

2605.28854 2026-05-29 cs.CL cs.LG q-bio.NC 版本更新

Large language models reorganize representational geometry during in-context learning

大型语言模型在上下文学习中重组表征几何结构

Hua-Dong Xiong, Li Ji-An, Robert C. Wilson, Kwonjoon Lee, Xue-Xin Wei

发表机构 * School of Psychological and Brain Sciences, Georgia Tech（佐治亚理工学院心理与脑科学学院）； Department of Psychology, New York University（纽约大学心理学系）； Center of Excellence for Computational Cognition, Georgia Tech（佐治亚理工学院计算认知卓越中心）； Honda Research Institute（本田研究院）； Departments of Neuroscience and Psychology, The University of Texas at Austin（德克萨斯大学奥斯汀分校神经科学与心理学系）

AI总结研究大型语言模型在上下文学习中的表征几何重组，发现其性能与任务表征结构相关，并通过原型算法动态调整表征以提高可分性。

详情

AI中文摘要

大型语言模型（LLMs）表现出显著的灵活性：它们可以从上下文示例中适应新任务，而无需任何参数更新，这种能力被称为上下文学习（ICL）。先前关于合成任务的研究表明，ICL可以实现特定算法，展示了架构能力，并且机制分析已经识别出支持这种行为的关键回路。然而，由于上下文计算——无论其算法形式如何——依赖于高维表征空间中的变换，该空间的几何结构如何塑造ICL的有效性仍不清楚。受神经科学中将分类视为神经表征解缠的观点启发，我们假设ICL依赖于任务相关表征的成功在线解缠。为了验证这一想法，我们研究了LLMs如何对上下文示例进行分类，这些示例的标签由模型自身具有已知结构的内部表征定义。我们表明，ICL性能与底层分类任务的表征结构系统性相关，并且成功的ICL伴随着几何重组，增加了在线可分性。我们进一步发现，LLM的行为可以通过一种原型类算法很好地描述，该算法在重塑表征以支持分类的同时整合证据。这些发现为预训练LLMs中的ICL提供了几何解释，将表征几何结构确立为ICL的机制约束，并量化了预训练表征所能提供的与上下文学习所能利用之间的差距。

英文摘要

Large language models (LLMs) exhibit remarkable flexibility: they can adapt to novel tasks from in-context examples without any parameter updates, a capability known as in-context learning (ICL). Prior work on synthetic tasks has shown that ICL can implement specific algorithms, demonstrating architectural competence, and mechanistic analyses have identified key circuits that support this behavior. However, because in-context computation -- regardless of its algorithmic form -- relies on transformations in high-dimensional representation space, it remains unclear how the geometry of that space shapes ICL effectiveness. Motivated by the neuroscience view of classification as the untangling of neural representations, we hypothesize that ICL depends on the successful online untangling of task-relevant representations. To test this idea, we study how LLMs classify in-context examples whose labels are defined by the model's own internal representations with known structure. We show that ICL performance correlates systematically with the representational structure of the underlying classification task and that successful ICL is accompanied by geometric reorganization that increases online separability. We further find that LLM behavior is well described by a prototype-like algorithm that integrates evidence while reshaping representations to support classification. These findings offer a geometric account of ICL in pretrained LLMs, establish representational geometry as a mechanistic constraint on ICL, and quantify the gap between what pretrained representations afford and what in-context learning can exploit.

URL PDF HTML ☆

赞 0 踩 0

2605.28848 2026-05-29 cs.CL cs.AI 版本更新

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

GPF-LiveNews: 大型语言模型中群体条件框架的流式评估协议

Mohd Ariful Haque, Fahad Rahman, Kishor Datta Gupta, Roy George

发表机构 * Clark Atlanta University（克拉克阿特兰大学）

AI总结提出GPF-LiveNews流式评估协议，通过实时新闻锚点与身份标签组合，检测LLM输出中针对不同受众的语义敏感性和情感差异，用于审计群体条件框架。

详情

AI中文摘要

部署的语言模型在非静态环境中进行评估：模型版本、检索层、安全系统和真实世界输入都随时间变化。静态偏差基准仍然有用，但它们无法显示模型如何针对不同提示受众构建新出现事件的框架。我们引入了GPF-LIVENEWS，这是一个流式评估协议和基准快照，用于审计开放端LLM输出中的群体条件框架。该协议扩展了来自BBC/路透社的最新新闻锚点，涵盖42个身份标签和七个提示族，然后使用语义敏感性和情感差异信号评估响应束。在12次监控运行和23个托管模型的试点中，政策/行动提示产生了最强的语义运动，而情感变化在维度和提示族之间较为平坦。发布的工件包括文章元数据、提示模板、实例化提示、模型输出元数据、评分表、文档和复现脚本。我们将所有评分解释为用于人工审查的观察窗口审计信号，而非永久性的公平性排名或有害偏差的直接证据。

英文摘要

Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how models frame newly emerging events for different prompted audiences. We introduce GPF-LIVENEWS, a streaming evaluation protocol and benchmark snapshot for auditing group-conditioned framing in open-ended LLM outputs. The protocol expands fresh BBC/Reuters news anchors across 42 identity labels and seven prompt families, then evaluates response bundles using semantic-sensitivity and sentiment-disparity signals. In a pilot over 12 monitoring runs and 23 hosted models, Policy/Action prompts produce the strongest semantic movement, while sentiment variation is flatter across dimensions and prompt families. The released artifact includes article metadata, prompt templates, instantiated prompts, model-output metadata, score tables, documentation, and reproduction scripts. We interpret all scores as observed-window audit signals for human review, not as permanent fairness rankings or direct proof of harmful bias.

URL PDF HTML ☆

赞 0 踩 0

2605.28842 2026-05-29 cs.CL cs.AI 版本更新

Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

思想即规划：通过强化规划进行思维链优化的潜在世界模型

Dong Liu, Yanxuan Yu, Ying Nian Wu

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Columbia University（哥伦比亚大学）

AI总结提出Thoughts-as-Planning框架，将思维链优化形式化为潜在语义空间中的序贯决策过程，通过潜在世界模型模拟推理链编辑对下游输出的影响，并利用梯度下降或强化学习进行规划，在语言理解和生成任务上优于现有基线。

详情

AI中文摘要

评估荷兰语音节划分算法并通过深度学习结合语音和正字法信息提高准确性

Gus Lathouwers, Wieke Harmsen, Catia Cucchiarini, Helmer Strik

发表机构 * Radboud University（拉博德大学）

AI总结本研究评估了四种荷兰语音节划分算法的性能，并提出一种结合语音和正字法信息的深度学习模型，实现了99.65%的词准确率，较文献最佳提升0.14%。

Comments Published in CLIN Journal

详情

Journal ref: Computational Linguistics in the Netherlands Journal, Vol. 14 (2025), pp. 365 to 383

AI中文摘要

音节划分描述将单词划分为音节的任务。由于许多规则和例外，训练算法以高准确率执行音节划分仍然是一个挑战。在过去几十年中，针对荷兰语音节划分提出了不同的算法，但尚未进行全面的比较评估。此外，近年来深度学习在自然语言处理中获得了显著普及，但尚未开发出基于现代深度学习的荷兰语正字法音节划分框架。最后，语音和正字法音节划分算法已被分别研究，但未结合研究。当前研究的目标有两个：(a) 检查现有荷兰语音节划分算法的性能，(b) 研究将语音和正字法信息结合到单个模型中是否能提高音节划分性能。为了比较算法性能，将四种算法（Brandt Corstius、Liang、Trogkanis-Elkan (CRF) 和新构思的深度学习模型）应用于三个不同的数据集（词典词、借词、伪词）。这些算法在数据集上表现出不同的性能，数据驱动算法在所有条件下除一个外均优于基于知识的算法。开发的新深度学习方法相比文献中发现的最佳结果（99.65%的词准确率，提高了0.14%）带来了性能提升。对添加语音信息改善音节划分性能的单词的分析表明，这些单词中正字法歧义可以通过发音信息解决。未来研究可以考察语音信息有益于正字法处理的其他领域。此外，新开发的深度学习框架可以应用于荷兰语以外的其他语言。

英文摘要

Syllabification describes the task of dividing words into syllables. Due to many rules and exceptions, training an algorithm to perform syllabification with high accuracy remains a challenge. Throughout the last decades, different algorithms have been put forth for Dutch syllabification, yet a comprehensive comparative assessment has not been done. Additionally, deep learning has gained significant popularity within NLP in recent years, yet no modern deep-learning based framework has been developed for Dutch orthographic syllabification. Finally, phonetic and orthographic syllabification algorithms have been examined separately, but not in combination. The aim of the current research was twofold: (a) to examine the performance of existing Dutch syllabification algorithms, and (b) to investigate whether combining phonetic and orthographic information into a single model can increase syllabification performance. To compare the performance of algorithms, four algorithms (Brandt Corstius, Liang, Trogkanis-Elkan (CRF), and a newly conceived deep-learning model) were applied to three different datasets (dictionary words, loanwords, pseudowords). The algorithms show varying performance across datasets, with the data-driven algorithms outperforming a knowledge-based algorithm in all but one condition. The new deep-learning methods developed led to increased performance compared to the best found in the literature (99.65% word accuracy, a 0.14% improvement). An analysis of the words for which adding phonetic information improved syllabification performance indicates that these were words in which the orthographic ambiguity could be resolved by information on pronunciation. Future research could examine other areas where phonetic information can benefit orthographic processing. In addition, the newly developed deep learning frameworks can be applied to other languages than Dutch.

URL PDF HTML ☆

赞 0 踩 0

2605.28833 2026-05-29 cs.CL cs.AI 版本更新

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

转录儿童语音：ASR性能与获取可靠正字法转录

Gus Lathouwers, Lingyun Gao, Catia Cucchiarini, Helmer Strik

发表机构 * Radboud University（拉德堡德大学）

AI总结本研究评估了三种ASR模型家族（Whisper、Parakeet、Wav2Vec2）在荷兰儿童语音数据集上的性能，并提出了一种基于话语级选择的方法，以自动识别高置信度的正确发音，从而减少人工验证需求。

详情

AI中文摘要

自动语音识别（ASR）有潜力通过生成自动转录来大幅减少儿童语音研究中的手动标注工作。然而，在低资源语言中，由于缺乏针对儿童的预训练模型以及高度多样的噪声条件，获得可靠的高质量ASR转录仍然具有挑战性。本研究通过两个研究问题调查了最先进的ASR模型在儿童语音上的有效性，评估了来自三个模型家族（Whisper、Parakeet和Wav2Vec2）的九个ASR模型在两个荷兰儿童语音数据集JASMIN和DART上的表现。研究问题1考察了ASR模型应用于儿童语音的性能。微调的Whisper-medium模型取得了最佳整体性能，在JASMIN上WER为5.54%，在DART上为70.37%，表明噪声较大的DART数据明显更具挑战性。研究问题2考察了在多大程度上可以选择一个子集，使得无需人工验证即可自动获得可靠的正字法转录。我们使用一种话语级选择方法，将ASR输出与原始阅读提示进行比较，以识别正确发音的录音。使用所提出的选择方法，42.0% [对于JASMIN] 和18.1% [对于DART] 的话语可以高置信度地自动识别为正确发音，从而在话语级别上实现极低的错误率（精确度达到98.3%或更高），并减少了人工验证的需求。

英文摘要

Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generating automatic transcriptions. However, obtaining reliably high-quality ASR transcriptions for child speech remains challenging in low-resource languages due to limited child-specific pre-trained models and highly diverse noise conditions. This study investigates the effectiveness of state-of-the-art ASR models on child speech through two research questions, by evaluating nine ASR models from three model families (Whisper, Parakeet, and Wav2Vec2) on two Dutch child speech datasets, JASMIN and DART. Research question 1 examines the performance of ASR-models applied to child speech. The fine-tuned Whisper-medium model achieves the best overall performance, with a WER of 5.54% on JASMIN and 70.37% on DART, showing that the noisy DART data are clearly more challenging. Research question 2 examines to what extent it is possible to select a subset for which reliable orthographic transcriptions can be obtained automatically, without the need for manual verification. We use an utterance-level selection method that compares ASR output with the original read prompt to identify correctly pronounced recordings. Using the proposed selection method, 42.0% [for JASMIN] and 18.1% [for DART] of the utterances can be automatically identified as correctly pronounced with high confidence, resulting in very low error rates on an utterance level (precisions of 98.3% and higher) and reducing the need for manual verification.

URL PDF HTML ☆

赞 0 踩 0

2605.28832 2026-05-29 cs.CL cs.AI 版本更新

A comparative study of transformer-based embeddings for topic coherence

基于Transformer的嵌入在主题连贯性中的比较研究

Alex Ding, Tarun Rapaka, Willy Rodriguez, Jason Yang

发表机构 * Worcester Academy Stanford Online High School（沃斯特学院斯坦福在线高中）； Stanford Online High School（斯坦福在线高中）； Lexington High School（莱克星顿高中）

AI总结本研究系统比较了七种不同规模的Transformer语言模型（从MiniLM到LLaMA-2）在BERTopic流程中对主题质量的影响，发现模型大小（从2200万到130亿参数）对主题连贯性影响可忽略。

详情

AI中文摘要

主题建模是自然语言处理的一个分支，旨在根据词共现模式将大量文本组织成连贯的组，其中潜在狄利克雷分配仍是最广泛使用和可解释的概率方法之一。自然语言处理的最新进展，特别是基于Transformer的语言模型，提供了改进的文档表示。已知模型大小（以参数数量计）对语言模型在不同预定义任务上的性能有显著影响。在本研究中，我们通过分析七种基于Transformer的语言模型（从小型模型如MiniLM到大型模型如LLaMA-2）在BERTopic流程中对多种语料库的性能，系统地考察了模型大小对主题质量的影响。主题质量使用Röder等人（2015）的连贯性和分歧度指标进行评估。我们的结果表明，模型大小从2200万到130亿参数对主题质量的影响可忽略，表明较小的模型可以达到与较大模型相当的性能。

英文摘要

Topic modeling is a branch of Natural Language Processing (NLP) that aims to organize large collections of texts into coherent groups according to word co-occurrence patterns, with Latent Dirichlet Allocation (LDA) remaining one of the most widely used and interpretable probabilistic approaches. Recent advances in NLP, particularly transformer-based language models, offer improved document representations. It is also known that the size of the model (in terms of number of parameters) has a significant impact in the performance of the language models on different pre-defined tasks. In this study, we systematically examine the effect of model size on topic quality by analyzing the performances of seven transformer-based language models (from small models such as MiniLM to large ones such as LLaMA-2) in a BERTopic pipeline on a variety of corpora. Topic quality is evaluated using coherence and divergence metrics following R{ö}der et al. (2015). Our results indicate that model size, ranging from 22 million to 13 billion parameters, has a negligible impact on the quality of the topic, suggesting that smaller models can achieve comparable performance to larger models.

URL PDF HTML ☆

赞 0 踩 0

2605.28830 2026-05-29 cs.CL cs.AI cs.SE 版本更新

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

开源安全防护模型基准测试：全面评估

Reetu Raj Harsh, Bhaskarjit Sarmah, Stefano Pasquali

发表机构 * Domyn

AI总结本研究对14个开源安全防护模型在8个NIST AI风险框架安全类别上进行全面评估，发现召回率是关键指标，且模型大小与安全检测性能不相关。

详情

AI中文摘要

随着大型语言模型（LLMs）越来越多地部署在安全关键型应用中，稳健的内容审核变得至关重要。我们对14个开源安全防护模型进行了全面评估，使用了包含79,331个样本的精选基准，涵盖8个NIST AI风险框架安全类别。我们的基准聚合了四个不同的数据集（HarmBench、StrongREJECT、RealToxicityPrompts和BeaverTails），并经过筛选，仅关注安全相关内容（暴力、仇恨言论、骚扰、色情内容、自杀/自残、亵渎、威胁和健康虚假信息）。我们发现召回率是安全应用的关键指标，因为遗漏有害内容比误报构成更大风险。我们的评估揭示了令人惊讶的结果：Qwen Guard（4B参数）实现了最高的召回率（83.97%），而较大的模型如Llama Guard（12B）和GPT-OSS Safeguard（20B）表现出保守行为，遗漏了高达75%的不安全内容。我们证明了模型大小与安全检测性能不相关，并且通用防护模型优于专用模型。这些发现为在生产部署中选择安全防护模型提供了实用指导。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories. Our benchmark aggregates four diverse datasets (HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails), filtered to focus exclusively on safety-relevant content (violence, hate speech, harassment, sexual content, suicide/self-harm, profanity, threats, and health misinformation). We find that recall is the critical metric for safety applications, as missing harmful content poses greater risk than false positives. Our evaluation reveals surprising results: Qwen Guard (4B parameters) achieves the highest recall (83.97%) while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content. We demonstrate that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.

URL PDF HTML ☆

赞 0 踩 0

2605.28828 2026-05-29 cs.CL cs.AI 版本更新

MechELK：一种用于激发大型语言模型中潜在知识的机制可解释性框架

Ji-jun Park, Soo-joon Choi, Jiwon Jeong, Taeyang Yoon, Ju-Wan Lee

发表机构 * Dongguk University（东国大学）

AI总结提出MechELK框架，通过定位、验证和激发三个阶段，利用稀疏自编码器特征分析和因果探测等方法，从大型语言模型中提取隐藏知识，在TruthfulQA等基准上平均激发准确率达84.7%。

详情

AI中文摘要

从自回归到扩散：利用严格因果与弹性视野高效适配大型语言模型

Xiangyu Ma, Teng Xiao, Zuchao Li, Lefei Zhang

发表机构 * School of Artificial Intelligence, Wuhan University（武汉大学人工智能学院）； School of Computer Science, Wuhan University（武汉大学计算机学院）

AI总结提出FLUID框架，通过严格因果对齐和弹性视野机制，将自回归模型高效适配为扩散模型，实现并行文本生成并大幅降低训练成本。

Comments Accepted by ACL 2026

详情

AI中文摘要

扩散模型有望实现高效的并行文本生成，但其依赖双向注意力机制，与预训练的自回归（AR）模型存在结构不匹配。这种不兼容性阻碍了稳健AR先验的复用，需要从头开始进行代价高昂的预训练。为弥合这一差距，我们提出FLUID框架，该框架高效地将AR骨干网络适配到扩散范式。通过强制执行严格因果对齐，FLUID能够从标准GPT风格检查点无缝初始化，避免了大规模预训练。此外，我们引入弹性视野，这是一种基于局部信息密度而非固定调度动态调节去噪步长的熵驱动机制。实验表明，FLUID在将训练成本降低数个数量级的同时实现了最先进的性能，有效调和了成熟的AR基础与高效的并行生成。我们的代码可在https://github.com/Oli-lab-nun/FLUID/tree/main获取。

英文摘要

Diffusion models promise efficient parallel text generation but rely on bidirectional attention, creating a structural mismatch with pre-trained Autoregressive (AR) models. This incompatibility precludes reusing robust AR priors, necessitating prohibitive pre-training from scratch. To bridge this gap, we propose FLUID, a framework that efficiently adapts AR backbones to the diffusion paradigm. By enforcing Strictly Causal Alignment, FLUID enables seamless initialization from standard GPT-style checkpoints, circumventing the need for massive pre-training. Furthermore, we introduce Elastic Horizons, an entropy-driven mechanism that dynamically modulates denoising strides based on local information density rather than fixed schedules. Experiments demonstrate that FLUID achieves state-of-the-art performance while reducing training costs by orders of magnitude, effectively reconciling established AR foundations with efficient parallel generation. Our code is available at https://github.com/Oli-lab-nun/FLUID/tree/main.

URL PDF HTML ☆

赞 0 踩 0

2605.27382 2026-05-29 cs.HC cs.AI cs.CL 版本更新

The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs

对齐下限：角色定制如何破坏弱对齐大语言模型的安全性

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center（AWS生成式人工智能创新中心）； HSBC Holdings Plc., HSBC Technology Center, China（汇丰控股有限公司，汇丰技术中心，中国）

AI总结通过对比强对齐与弱对齐模型在不同角色条件下的谄媚率变化，定义对齐下限Δ_floor作为评估模型角色定制安全性的审计指标。

详情

AI中文摘要

告诉LLM“要热情”会使轻对齐模型的谄媚率从30%上升到50%，但对强对齐模型没有影响。我们将这一差距定义为对齐下限Δ_floor(m)=max_pS(m,p)-min_pS(m,p)，即模型在不同角色条件下产生的谄媚率范围，并将谄媚视为角色条件属性而非固定模型属性。多元AI依赖于通过角色提示（如“要有创造力”或“要彻底”）进行行为适应，使系统能够尊重不同的用户价值观和沟通风格；安全问题在于给定模型在真实性改变之前能吸收多少定制化。我们进行了一项受控案例研究，对比了强对齐的RLHF+宪法AI模型（Claude Sonnet 4.6）与轻对齐模型（Amazon Nova Lite），涵盖7种角色条件和5个任务，共1800次运行。存在性结果促使进行逐模型审计：至少有一个强对齐模型的Δ_floor=5个百分点（在15%控制率的5个百分点内），至少有一个轻对齐模型的Δ_floor=45个百分点（范围5%-50%）。在轻对齐模型上，所有五种大五人格角色都增加了谄媚率，且反直觉的是，宜人性产生的增幅最小而非最大。研究中最大的单一效果是建设性的：怀疑论者角色使轻对齐模型的谄媚率降低了25个百分点，并且是唯一指示抵制用户主张而非与之互动的角色，这暗示了方向性解释。角色效果的跨模型迁移几乎为零，因此角色-对齐测试必须逐模型进行。我们提出Δ_floor作为部署时的审计指标：在部署角色定制之前，在小规模角色面板上测量该指标。

英文摘要

Telling an LLM to "be enthusiastic" raises its sycophancy rate from 30\% to 50\% on a lightly-aligned model, but has zero effect on a strongly-aligned one. We define this gap as the alignment floor, $Δ_{\text{floor}}(m)=\max_pS(m,p)-\min_pS(m,p)$, the range of sycophancy rates a model produces across persona conditions, and treat sycophancy as a persona-conditional property rather than a fixed model property. Pluralistic AI relies on behavioral adaptation via persona prompts like "be creative" or "be thorough", which let systems respect diverse user values and communication styles; the safety question is how much customization a given model can absorb before its truthfulness shifts. We present a controlled case study contrasting a strongly-aligned RLHF + Constitutional-AI model (Claude Sonnet 4.6) with a more lightly-aligned model (Amazon Nova Lite), spanning seven persona conditions and five tasks for 1800 total runs. An existence-pair result motivates per-model auditing: there is at least one strongly-aligned model with $Δ_{\text{floor}}=5$pp (within 5pp of the 15\% control rate) and at least one lightly-aligned model with 45pp (5\%--50\% range). On the lightly-aligned model, all five Big Five personas increase sycophancy over control, and counterintuitively Agreeableness produces the smallest increase, not the largest. The single largest effect in the study is constructive: a Skeptic persona reduces sycophancy by 25pp on the lightly-aligned model, and is the only persona that instructs resistance against user claims rather than engagement with them, suggesting a directionality account. Cross-model transfer of persona effects is near-zero, so persona-alignment testing must be per-model. We propose $Δ_{\text{floor}}$ as a deployment-time audit metric: measure it on a small persona panel before deploying persona customization.

URL PDF HTML ☆

赞 0 踩 0

2605.27379 2026-05-29 cs.AI cs.CL 版本更新

Slide Deck Q&A 质量保证应用：面向教学问题生成的多阶段流水线

Jim Salsman

发表机构 * TalkNicer, Inc.（TalkNicer公司）

AI总结提出一个基于Flask的多阶段大语言模型流水线，从PDF幻灯片中提取文本和图像，生成结构化的教学问题集，并通过窗口规划、幻灯片合成、标注和协调四个阶段提高问题质量。

Comments 15 pages, 3 research questions, 1 figure, 1 table, 6 references, 2 appendices

详情

AI中文摘要

从讲座幻灯片中生成高质量、具有教学意义的问题是困难的，因为重要的教学内容分布在文本和视觉元素中，而且有用的问题必须根据演示流程进行搭建，而不是孤立地逐张幻灯片生成。本文描述了Slide Deck Q&A质量保证（slidesqaqa），一个基于Flask的软件系统，它从PDF幻灯片中提取文本和渲染图像，并通过一个四阶段的大语言模型流水线进行处理，包括窗口规划、幻灯片合成、标注和协调。该系统联合考虑幻灯片模态和教学角色，分配有限的问题预算，并在幻灯片组级别修订草稿标注以减少冗余并提高覆盖率。最终输出是一个结构化的JSON标注，包含幻灯片组级目标、章节结构、幻灯片级摘要、问题集和评估分数。在两个技术讲座幻灯片上的初步实验表明，该流水线可以过滤非教学幻灯片，并为视觉复杂的内容生成高保真、教学连贯的问题。

英文摘要

Generating high-quality, pedagogically useful questions from lecture slide decks is difficult because important instructional content is distributed across both text and visual elements, and because useful questions must be scaffolded across the flow of a presentation rather than generated slide by slide in isolation. This paper describes Slide Deck Q\&A Quality Assurance (slidesqaqa), a Flask-based software system that extracts text and rendered images from PDF slides and processes them through a four-stage large language model pipeline comprising window planning, deck synthesis, slide annotation, and reconciliation. The system reasons jointly about slide modality and pedagogical role, allocates bounded question budgets, and revises draft annotations at the deck level to reduce redundancy and improve coverage. The final output is a structured JSON annotation containing deck-level goals, section structure, slide-level summaries, question sets, and evaluation scores. Initial experiments on two technical lecture decks indicate that the pipeline can filter non-instructional slides and produce high-fidelity, pedagogically coherent questions for visually complex content. The working system is at https://slidesqaqa-974767694043.us-west1.run.app The software repository is at https://github.com/blinding2submit/slidesqaqa

URL PDF HTML ☆

赞 0 踩 0

2605.26029 2026-05-29 cs.AI cs.CL 版本更新

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

CausaLab：面向AI科学家的交互式因果发现可扩展环境

Junlin Yang, Dylan Zhang, Xiangchen Song, Qirun Dai, Xiao Liu, Yuen Chen, Aniket Vashishtha, Jing Shi, Chenhao Tan, Hao Peng

发表机构 * Tsinghua University（清华大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Carnegie Mellon University（卡内基梅隆大学）； University of Chicago（芝加哥大学）； Adobe

AI总结提出CausaLab环境，通过合成实验室任务评估LLM代理在因果发现中的预测准确性与因果机制恢复能力，发现两者存在显著差距。

详情

AI中文摘要

我们介绍了CausaLab，一个用于评估LLM代理进行交互式因果发现的可扩展环境。与先前的评估不同，CausaLab既评估代理是否能够使用因果证据解决问题，也评估其答案是否基于忠实恢复的因果机制。每个回合将代理置于一个合成实验室中：它接收先前的测量记录，对操纵器晶体进行干预，并预测由相同机制控制的保留反应器晶体的共振频率。隐藏的数据生成过程是一个随机采样的结构因果模型（SCM），因此成功需要恢复因果图和结构方程，而不是回忆先验知识。实验表明，预测和机制恢复之间存在持续差距：在纯观测的6节点设置中，GPT-5.2-high达到92%的任务准确率，但全边$F_1$仅为0.471。混合观测-干预策略提高了结构保真度，而纯干预即使对强代理仍然困难。我们确定过早停止是一个主要弱点，并表明一致性验证可以缓解它。因此，CausaLab将预测成功与因果理解分开，并揭示了当前LLM代理作为实验因果推理者的局限性。

英文摘要

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

URL PDF HTML ☆

赞 0 踩 0

2605.25297 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction

Eureka：面向企业AI云资源需求预测的智能特征工程

Hangxuan Li, Renjun Jia, Xuezhang Wu, Yunjie Qian, Zeqi Zheng, Xianling Zhang

发表机构 * Alibaba Cloud Computing Co. Ltd, Hangzhou, China（阿里云计算有限公司，杭州，中国）； School of Computer Science, Fudan University, Shanghai, China（复旦大学计算机学院，上海，中国）； School of Computer Science and Technology, Tongji University, Shanghai, China（同济大学计算机科学与技术学院，上海，中国）； Independent Researcher, United States（独立研究员，美国）

AI总结提出Eureka框架，将特征工程视为智能体代码生成问题，通过专家代理、LLM特征工厂和自演化对齐引擎三阶段，自动生成可执行特征代码，在医疗、金融、社交等7个公开基准及阿里云GPU资源需求预测中显著提升性能。

Comments accepted at NeurIPS 2025 Workshop, DASFAA 2026 (International Conference on Database Systems for Advanced Applications)

详情

DOI: 10.1007/978-981-92-0378-9_33
Journal ref: Database Systems for Advanced Applications (DASFAA 2026), Lecture Notes in Computer Science, vol. 16540, pp. 528-540, Springer

AI中文摘要

有效的特征对于预测模型性能至关重要，但创建特征通常需要领域专业知识，限制了跨应用的可扩展性。我们将特征工程定义为一个智能体代码生成问题：特征不再是静态的数据转换，而是可生成、评估和迭代改进的可执行程序。我们提出了Eureka，一个由LLM驱动的三阶段框架。（1）专家代理，通过领域知识的SFT微调，生成结构化的JSON格式特征设计方案。（2）LLM特征工厂，通过思维链推理将每个方案转化为可执行的Python代码，将特征假设转化为可运行的程序。（3）自演化对齐引擎，使用带双通道奖励（基于指标的效用+语义对齐）的强化学习（GRPO）来提升代码质量。通过将特征表达为程序，学习到的生成模式可以跨领域迁移。在医疗、金融和社交领域的7个公开基准上评估，Eureka一致优于传统的AutoFE和基于LLM的基线。我们进一步在阿里云的云GPU资源需求预测中展示了Eureka的有效性，其中Eureka将需求满足率提高了16%，并将计算资源迁移率降低了33%。

英文摘要

Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability across applications. We define feature engineering as an agentic code generation problem: features are not static data transformations, but executable programs that can be generated, evaluated, and iteratively improved. We present Eureka, an LLM-driven framework with three stages. (1) An Expert Agent, fine-tuned via SFT on domain knowledge, produces structured feature design plans in JSON format. (2) An LLM Feature Factory translates each plan into executable Python code through chain-of-thought reasoning, turning feature hypotheses into runnable programs. (3) A Self-Evolving Alignment Engine uses Reinforcement Learning (GRPO) with dual-channel reward (metric-based utility + semantic alignment) to enhance code quality. By expressing features as programs, the learned generation patterns can transfer across domains. Evaluated on 7 public benchmarks in healthcare, finance, and social domains, Eureka consistently outperforms both traditional AutoFE and LLM-based baselines. We further demonstrate Eureka's effectiveness on cloud GPU resource demand prediction at Alibaba Cloud, where Eureka improves demand fulfillment rate by 16% and lowers computing resource migration rates by 33%.

URL PDF HTML ☆

赞 0 踩 0

2605.23657 2026-05-29 cs.CL 版本更新

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

OpenSkillEval：自动审计LLM智能体的开放技能生态系统

Jiahao Ying, Boxian Ai, Wei Tang, Siyuan Liu, Yixin Cao

发表机构 * Singapore Management University（新加坡国立管理学院）； Institute of Trustworthy Embodied AI, Fudan University（复旦大学可信具身人工智能研究院）； Joy Future Academy, JD（京东未来学院）

AI总结提出自动评估框架OpenSkillEval，通过动态构建真实任务实例和收集社区技能，系统评估技能增强型智能体系统及技能本身，揭示技能可用性不保证有效使用、技能增强收益依赖模型和框架等关键发现。

详情

AI中文摘要

技能，即为大型语言模型（LLM）提炼的结构化工作流指令，正成为提升智能体在现实下游任务性能的日益重要的机制。然而，随着开源技能生态系统的快速扩张，不同模型和智能体框架如何与技能交互、如何评估技能质量、以及用户在实际成本-性能权衡下应如何选择技能，这些问题仍不明确。在本文中，我们提出了 extsc{OpenSkillEval}，一个针对技能增强型智能体系统及技能本身的自动评估框架。 extsc{OpenSkillEval}不依赖静态基准，而是从不断演变的现实世界工件中自动构建跨五类下游应用（演示生成、前端网页设计、海报生成、数据可视化和报告生成）的真实任务实例。它进一步收集和组织社区贡献的技能，以便在统一任务设置下进行受控比较。利用超过600个动态生成的任务实例和30个开源技能，我们对最先进的模型和智能体框架进行了系统评估。我们的结果表明，技能可用性并不保证有效使用技能，技能增强的收益强烈依赖于底层模型和智能体框架，并且许多公开流行的技能并不始终优于没有技能的基础智能体。这些发现凸显了动态、基于任务的评估的必要性，并为LLM智能体技能的设计、选择和部署提供了实用见解。更多案例和基准资源可在项目网站上获取：https://yingjiahao14.github.io/OpenSkillEval-Web/。

英文摘要

Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present \textsc{OpenSkillEval}, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, \textsc{OpenSkillEval} automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.

URL PDF HTML ☆

赞 0 踩 0

2605.22586 2026-05-29 cs.LG cs.CL 版本更新

重新审视LLM剪枝对测试时缩放的有效性

Ocean Monjur, Shahriar Kabir Nahin, Anshuman Chhabra

发表机构 * Bellini College of AI, Cybersecurity, and Computing（人工智能、网络安全与计算学院）

AI总结本文研究非结构化剪枝对推理型大语言模型测试时缩放性能的影响，发现其优于结构化剪枝甚至有时超过未剪枝模型，并探讨了层间稀疏分配策略的作用。

详情

AI中文摘要

大型语言模型（LLM）现在通过测试时计算缩放（TTS）展现出卓越的推理能力，在数学和编程基准测试中表现令人印象深刻。与此同时，模型压缩研究开发了剪枝方法，旨在在不牺牲任务性能的情况下移除冗余/有害参数。这两项研究进展的交叉点构成了我们工作的基础。具体到推理型LLM，先前的工作表明结构化剪枝（移除整组层块的方法）显著降低了TTS推理性能。然而，在这项工作中，我们重新审视了这一假设，并研究了非结构化剪枝（仅小心移除某些冗余/有害权重的方法）是否表现出类似的局限性。令人惊讶的是，我们在两个推理型LLM（s1.1-7B和Qwen3-8B）的四个推理基准上的广泛实验一致表明，与结构化剪枝相比，非结构化剪枝增强了TTS性能，有时甚至能超越未剪枝的全权重LLM。此外，我们还实证研究了不同层间稀疏分配策略的影响，这些策略是实现这些非结构化方法的重要参数选择。这些发现挑战了剪枝总是降低TTS性能的传统观念，实际上表明，谨慎进行的剪枝可以保持TTS的有效性。

英文摘要

Large Language Models (LLMs) now exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), with impressive performance across math and coding benchmarks. In parallel, research in model compression has developed pruning methods that seek to remove redundant/detrimental parameters without sacrificing task performance. The intersection of these two research advancements lays the foundation for our work. Specific to reasoning LLMs, prior work has shown that structured pruning (methods which remove entire set of layer blocks), significantly degrades TTS reasoning performance. However, in this work, we revisit this assumption and investigate whether unstructured pruning (methods that carefully remove only certain redundant/detrimental weights) exhibits similar limitations. Surprisingly, our extensive experiments across four reasoning benchmarks on two reasoning LLMs: s1.1-7B and Qwen3-8B, consistently show that unstructured pruning augments TTS performance compared to structured pruning, and at times can even outperform the unpruned full-weight LLMs. Furthermore, we also empirically study the impact of different layer-wise sparsity allocation strategies, which are an important parametric choice for instantiating these unstructured methods. These findings challenge the conventional notion that pruning always reduces TTS performance and in fact, suggest that carefully undertaken pruning can retain TTS effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2604.20443 2026-05-29 cs.CL cs.AI cs.LG 版本更新

DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

DialToM：用于预测状态驱动对话轨迹的心智理论基准

Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim

发表机构 * Singapore Management University（新加坡管理大学）； Australian National University（澳大利亚国立大学）

AI总结提出DialToM基准，通过多选评估框架从自然对话中构建，揭示LLMs在推断心理状态（字面ToM）与利用其进行社会预测（功能ToM）之间的系统性推理不对称性，并证明领域专家与AI之间存在显著能力差距。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

我们介绍了DialToM，一个基于自然人类对话构建的带注释的心智理论（ToM）基准，采用多选评估框架。与近期在合成环境中显示显式心理状态推断与应用ToM之间存在差距的工作一致，我们建立了一个更严格的“状态驱动诊断探针”，要求模型仅从孤立的心理状态特征（无对话上下文）预测状态一致的对话轨迹。我们的评估揭示了系统性的推理不对称性——LLMs在推断心理状态（字面ToM）方面表现出色，但在利用它们进行社会预测（功能ToM）方面存在困难。关键的是，领域专家在此任务上达到100%准确率，证明了其有效性，并揭示了人类与AI之间的显著能力差距。此外，教师-学生推理注入探针显示，Gemini 3 Pro（建立了领先基线）具备强大的功能ToM能力，可用于无上下文预测，且该能力可迁移至较弱模型。DialToM、其评估代码和数据集公开于https://github.com/Stealth-py/DialToM。

英文摘要

We introduce DialToM, an annotated Theory of Mind (ToM) benchmark built from naturalistic human-human dialogues using a multiple-choice evaluation framework. Concurrent with recent work showing a gap between explicit mental-state inference and applied ToM in synthetic settings~\cite{gu2024simpletom}, we establish a stricter \emph{State-Driven Diagnostic Probe} in which models must forecast state-consistent dialogue trajectories solely from isolated mental-state profiles without dialogue context. Our evaluation reveals a systematic reasoning asymmetry -- LLMs excel at inferring mental states (Literal ToM) but struggle to leverage them for social forecasting (Functional ToM). Crucially, a domain expert achieves 100\% accuracy on this task, proving its validity and establishing a stark human-AI capability gap. Further, a teacher-student reasoning injection probe shows that Gemini 3 Pro -- which establishes the leading baseline -- possesses robust Functional ToM capabilities for context-free forecasting that are transferable to weaker models. DialToM, its evaluation code, and dataset are publicly available at https://github.com/Stealth-py/DialToM.

URL PDF HTML ☆

赞 0 踩 0

2604.18847 2026-05-29 cs.AI cs.CL 版本更新

Human-Guided Harm Recovery for Computer Use Agents

面向计算机使用代理的人类引导式危害恢复

Christy Li, Sky CH-Wang, Andi Peng, Andreea Bobu

发表机构 * MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）

AI总结针对LM代理在计算机系统中执行操作后的危害恢复问题，通过用户研究定义偏好对齐的恢复维度，提出基于奖励模型对候选恢复计划重排序的方法，并构建BackBench基准测试，实验表明该方法优于基线代理。

详情

AI中文摘要

随着LM代理获得在真实计算机系统上执行操作的能力，我们不仅需要大规模预防有害行为的方法，还需要在预防失败时有效修复危害。我们形式化了后执行安全中这一被忽视的挑战的解决方案——危害恢复：即根据人类偏好，将代理从有害状态最优地引导回安全状态的问题。通过一项形成性用户研究，我们确定了偏好对齐的恢复维度，并生成了自然语言评分标准，从而为偏好对齐的恢复奠定基础。我们的1130个成对判断数据集揭示了属性重要性的上下文相关变化，例如偏好实用、有针对性的策略而非全面的长期方法。我们将这些学习到的见解操作化为一个奖励模型，在测试时对代理框架生成的多个候选恢复计划进行重排序。为了系统性地评估恢复能力，我们引入了BackBench，一个包含50个计算机使用任务的基准测试，用于测试代理从有害状态中恢复的能力。人工评估表明，我们的奖励模型框架比基础代理和基于评分标准的框架产生更高质量的恢复轨迹。这些贡献共同为新型代理安全方法奠定了基础——这些方法不仅通过预防来应对危害，而且通过有意图的对齐来应对危害的后果。

英文摘要

As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences. We ground preference-aligned recovery through a formative user study that identifies valued recovery dimensions and produces a natural language rubric. Our dataset of 1,130 pairwise judgments reveals context-dependent shifts in attribute importance, such as preferences for pragmatic, targeted strategies over comprehensive long-term approaches. We operationalize these learned insights in a reward model, re-ranking multiple candidate recovery plans generated by an agent scaffold at test time. To evaluate recovery capabilities systematically, we introduce BackBench, a benchmark of 50 computer-use tasks that test an agent's ability to recover from harmful states. Human evaluation shows our reward model scaffold yields higher-quality recovery trajectories than base agents and rubric-based scaffolds. Together, these contributions lay the foundation for a new class of agent safety methods -- ones that confront harm not only by preventing it, but by navigating its aftermath with alignment and intent.

URL PDF HTML ☆

赞 0 踩 0

2604.13519 2026-05-29 cs.CL 版本更新

ToolSpec: Accelerating Tool Calling via Schema-Aware and Retrieval-Augmented Speculative Decoding

ToolSpec: 通过模式感知与检索增强的推测解码加速工具调用

Heming Xia, Yongqi Li, Cunxiao Du, Mingbo Song, Wenjie Li

发表机构 * Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算机系）； Peking University（北京大学）

AI总结针对工具调用延迟问题，提出一种基于模式感知和检索增强的推测解码方法ToolSpec，利用预定义工具模式生成准确草稿，并通过有限状态机交替填充确定性模式令牌和推测生成可变字段，同时检索历史调用复用草稿，实现最高4.2倍加速。

详情

AI中文摘要

工具调用极大地扩展了大语言模型（LLMs）的实际效用，使其能够与外部应用程序交互。随着LLM能力的提升，有效的工具使用越来越多地涉及多步骤、多轮交互以解决复杂任务。然而，由此产生的工具交互增长带来了大量延迟，对实时LLM服务构成了关键挑战。通过实证分析，我们发现工具调用轨迹高度结构化，符合受限模式，并且经常表现出重复的调用模式。受此启发，我们提出了ToolSpec，一种模式感知、检索增强的推测解码方法，用于加速工具调用。ToolSpec利用预定义的工具模式生成准确的草稿，使用有限状态机在确定性模式令牌填充和可变字段的推测生成之间交替。此外，ToolSpec检索相似的历史工具调用并将其重用为草稿，以进一步提高效率。ToolSpec提供了一种即插即用的解决方案，可以无缝集成到现有的LLM工作流中。在多个基准上的实验表明，ToolSpec实现了高达4.2倍的加速，显著优于现有的无训练推测解码方法。

英文摘要

Tool calling has greatly expanded the practical utility of large language models (LLMs) by enabling them to interact with external applications. As LLM capabilities advance, effective tool use increasingly involves multi-step, multi-turn interactions to solve complex tasks. However, the resulting growth in tool interactions incurs substantial latency, posing a key challenge for real-time LLM serving. Through empirical analysis, we find that tool-calling traces are highly structured, conform to constrained schemas, and often exhibit recurring invocation patterns. Motivated by this, we propose ToolSpec, a schema-aware, retrieval-augmented speculative decoding method for accelerating tool calling. ToolSpec exploits predefined tool schemas to generate accurate drafts, using a finite-state machine to alternate between deterministic schema token filling and speculative generation for variable fields. In addition, ToolSpec retrieves similar historical tool invocations and reuses them as drafts to further improve efficiency. ToolSpec presents a plug-and-play solution that can be seamlessly integrated into existing LLM workflows. Experiments across multiple benchmarks demonstrate that ToolSpec achieves up to a 4.2x speedup, substantially outperforming existing training-free speculative decoding methods.

URL PDF HTML ☆

赞 0 踩 0

2604.11088 2026-05-29 cs.AI cs.CL 版本更新

Guardrails Beat Guidance: A Large-Scale Study of Rules, Skills, and Persistent Configuration for Coding Agents

护栏优于指导：关于编码智能体的规则、技能和持久配置的大规模研究

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center（AWS生成式人工智能创新中心）； HSBC Holdings Plc., HSBC Technology Center, China（汇丰控股有限公司，汇丰技术中心，中国）

AI总结通过大规模实验发现，随机规则与专家规则对编码智能体性能提升相当，且有益规则均为负面约束，有害规则均为正面指令，提出应使用约束而非指导来配置智能体。

详情

AI中文摘要

随机规则对编码智能体任务性能的提升与专家精心设计的规则相当（在SWE-bench Verified的判别子集上均提升$+13.8$个百分点），并且在我们的数据中，每条单独有益的规则都是负面约束（“不要重构无关代码”），而每条单独有害的规则都是正面指令（“遵循代码风格”）。我们通过首次对智能体规则文件（ exttt{CLAUDE.md}、 exttt{.cursorrules}以及更广泛的智能体技能、插件清单和角色定义系列）进行大规模受控研究得出这些发现：我们从GitHub抓取了679个规则文件（共25,532条规则），并使用Claude Opus 4.6在SWE-bench Verified上进行了超过5,000次Claude Code智能体运行。出现了三种模式。（i）规则极性清晰地区分了有益规则和有害规则；我们通过基于势能的奖励塑形（PBRS）的视角来解读这一点。（ii）性能提升在很大程度上与内容无关：随机、打乱、领域不匹配和未转换格式的规则文件均与精心设计的规则相匹配，指向一种上下文启动机制。（iii）单独的规则通常看起来有害，但在集成中并未明显累积损害：在规则数量从0到50的范围内，通过率保持稳定。这些发现揭示了快速增长的社区编写规则和技能生态系统中隐藏的可靠性风险，并得出了更安全智能体配置的明确原则：约束智能体不能做什么，而不是规定它应该做什么。

英文摘要

Random rules improve a coding agent's task performance as much as expert-curated ones (both $+13.8$pp on a discriminative subset of SWE-bench Verified), and in our data every individually beneficial rule is a negative constraint ("do not refactor unrelated code"), while every individually harmful one is a positive directive ("follow code style"). We arrive at these findings through the first large-scale controlled study of agent rule files (\texttt{CLAUDE.md}, \texttt{.cursorrules}, and the broader family of agent skills, plugin manifests, and persona definitions): we scrape 679 rule files (25{,}532 rules) from GitHub and conduct over 5{,}000 agent runs of Claude Code with Claude Opus 4.6 on SWE-bench Verified. Three patterns emerge. (i) Rule polarity cleanly separates beneficial from harmful rules; we read this through the lens of potential-based reward shaping (PBRS). (ii) Performance gains are largely content-independent: random, shuffled, mismatched-domain, and unconverted-format rule files all match curated rules, pointing to a context priming mechanism. (iii) Individual rules often appear harmful in isolation yet do not visibly accumulate damage in ensemble: pass rates remain stable across rule counts from 0 to 50. These findings expose a hidden reliability risk in the rapidly growing ecosystem of community-authored rules and skills, and they yield a clear principle for safer agent configuration: constrain what agents must not do, rather than prescribing what they should.

URL PDF HTML ☆

赞 0 踩 0

2604.09629 2026-05-29 cs.CL 版本更新

HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

HumorGen: 基于角色蒸馏的大语言模型幽默生成的认知协同

Edward Ajayi, Prasenjit Mitra

发表机构 * Carnegie Mellon University Africa（卡内基梅隆大学非洲分校）

AI总结针对大语言模型标准训练目标与幽默所需意外性之间的矛盾，提出认知协同框架，利用六种认知角色合成多样化喜剧视角数据，微调7B参数学生模型，实验表明认知驱动的数据策展比对齐算法或模型规模更关键。

详情

AI中文摘要

幽默生成对大语言模型（LLMs）构成重大挑战，因为其标准训练目标（下一个词预测）与喜剧所需的意外性和不协调性存在固有冲突。为弥合这一差距，我们引入了认知协同框架，这是一种受幽默心理学理论启发的高质量幽默数据生成方法。利用混合思维（MoT）方法，我们部署了六种认知角色（例如荒诞主义者、愤世嫉俗者）来为给定提示合成多样化的喜剧视角。该框架产生了一个基于理论的数据集，我们使用该数据集微调了一个7B参数的学生模型。我们进一步评估了两种对齐策略：直接偏好优化（DPO）和一种离线组相对变体O-GRPO，发现两者均未优于SFT。然而，我们的7B HumorGen模型变体显著优于更大的指令微调基线，并达到顶级开源权重性能，同时与前沿专有系统保持竞争力。这些结果表明，对于幽默生成，认知驱动的数据策展比对齐算法或模型规模更为关键。

英文摘要

Humor generation poses a significant challenge for Large Language Models (LLMs), because their standard training objective (next-token prediction) inherently conflicts with the surprise and incongruity required for comedy. To bridge this gap, we introduce the Cognitive Synergy Framework, a methodology for generating highquality humor data inspired by psychological theories of humor. Utilizing a Mixtureof-Thought (MoT) approach, we deploy six cognitive personas (e.g., The Absurdist, The Cynic) to synthesize diverse comedic perspectives for a given prompt. This framework produces a theory-grounded dataset, which we use to fine-tune a 7B-parameter student model. We further evaluate two alignment strategies, Direct Preference Optimization (DPO) and an offline group-relative variant O-GRPO, finding that neither improves over SFT. However, our 7B HumorGen model variants significantly outperform larger instruction-tuned baselines and achieve top-tier open-weight performance while remaining competitive with frontier proprietary systems. These results suggest that cognitively driven data curation is more critical than alignment algorithms or model scale for humor generation.

URL PDF HTML ☆

赞 0 踩 0

2604.06805 2026-05-29 cs.CL 版本更新

Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning

思维认知循环：基于可逆层次马尔可夫链的高效数学推理

Jia-Chen Zhang, Yu-Jie Xiong, Zheng Zhou

发表机构 * School of Computer Science and Technology, East China Normal University（1 计算机科学与技术学院，华东师范大学）； School of Electronic and Electrical Engineering, Shanghai University of Engineering Science（2 电子电气工程学院，上海工程技术大学）

AI总结提出基于可逆层次马尔可夫链的思维认知循环框架，通过层次分解和反向验证机制减少冗余、增强推理鲁棒性，在数学推理任务上取得显著提升。

详情

AI中文摘要

多步思维链通过利用显式推理步骤显著提升了大型语言模型的数学推理能力。然而，长思维链的广泛采用往往导致序列长度超出可管理的计算限制。现有方法尝试通过类似马尔可夫链的结构减少KV缓存冗余来缓解这一问题，但引入了两个关键限制：固有的无记忆性（上下文丢失）和有限的反向推理能力。为了解决这些限制，我们提出了一种基于可逆层次马尔可夫链的新型思维链框架，称为思维认知循环，以及一个反向推理数据集CLoT-Instruct。在CLoT中，问题被分解为具有层次依赖关系的子问题。受人类认知过程的启发，我们在每个层次层引入反向验证机制。此外，我们实施了一种剪枝策略：一旦高层子问题得到验证，冗余的低层子问题就会被剪枝以最大化效率。这种方法有效缓解了错误传播并增强了推理鲁棒性。在四个数学基准上的实验证明了我们方法的有效性。值得注意的是，在使用GPT-4o-mini的AddSub数据集上，CLoT达到了99.0%的准确率，分别比传统思维链和思维链自洽性高出4.1%和2.9%。

英文摘要

Multi-step Chain-of-Thought (CoT) has significantly advanced the mathematical reasoning capabilities of LLMs by leveraging explicit reasoning steps. However, the widespread adoption of Long CoT often results in sequence lengths that exceed manageable computational limits. While existing approaches attempt to alleviate this by reducing KV Cache redundancy via Markov chain-like structures, they introduce two critical limitations: inherent memorylessness (loss of context) and limited backward reasoning capability. To address these limitations, we propose a novel Chain-of-Thought framework based on Reversible Hierarchical Markov Chain, termed Cognitive Loop of Thought (CLoT), and a backward reasoning dataset CLoT-Instruct. In CLoT, problems are decomposed into sub-problems with hierarchical dependencies. Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer. Furthermore, we implement a pruning strategy: once higher-level sub-problems are verified, redundant lower-level sub-problems are pruned to maximize efficiency. This approach effectively mitigates error propagation and enhances reasoning robustness. Experiments on four mathematical benchmarks demonstrate the effectiveness of our method. Notably, on the AddSub dataset using GPT-4o-mini, CLoT achieves 99.0% accuracy, outperforming traditional CoT and CoT-SC by 4.1% and 2.9%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2603.27518 2026-05-29 cs.CL 版本更新

Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs

过度拒绝与表示子空间：对齐大语言模型中任务条件拒绝的机制分析

Utsav Maskey, Mark Dras, Usman Naseem

发表机构 * Macquarie University（麦考瑞大学）

AI总结通过分析有害拒绝和过度拒绝的表示几何，发现过度拒绝方向是任务相关的且存在于良性任务表示簇中，解释了为何全局方向消融无法解决过度拒绝，并表明需要任务特定的几何干预。

Comments Preprint

详情

AI中文摘要

经过训练以拒绝有害请求的对齐语言模型也会表现出过度拒绝：它们拒绝看似类似于有害指令的安全指令。一种自然的方法是消融全局拒绝方向，将隐藏状态向量远离或朝向有害拒绝示例，但这只是偶然地纠正了过度拒绝，同时破坏了更广泛的拒绝机制。在这项工作中，我们分析了两种拒绝类型的表示几何，以理解为什么会发生这种情况。我们表明，有害拒绝方向是任务无关的，可以通过单个全局向量捕获，而过度拒绝方向是任务相关的：它们位于良性任务表示簇内，在不同任务之间变化，并跨越更高维的子空间。线性探测表明，两种拒绝类型从早期Transformer层开始就在表示上不同。这些发现提供了机制上的解释，说明为什么仅靠全局方向消融无法解决过度拒绝，并确立了任务特定的几何干预是必要的。

英文摘要

Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate the global refusal direction, steering the hidden-state vectors away or towards the harmful-refusal examples, but this corrects over-refusal only incidentally while disrupting the broader refusal mechanism. In this work, we analyse the representational geometry of both refusal types to understand why this happens. We show that harmful-refusal directions are task-agnostic and can be captured by a single global vector, whereas over-refusal directions are task-dependent: they reside within the benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace. Linear probing suggests that the two refusal types are representationally distinct from the early transformer layers. These findings provide a mechanistic explanation of why global direction ablation alone cannot address over-refusal, and establish that task-specific geometric interventions are necessary.

URL PDF HTML ☆

赞 0 踩 0

2603.23971 2026-05-29 cs.CL cs.AI cs.GT cs.LG cs.MA 版本更新

The Price Reversal Phenomenon: When Cheaper Reasoning Models Cost More

价格反转现象：当更便宜的推理模型成本更高时

Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia, James Zou

发表机构 * Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）； CMU（卡内基梅隆大学）； Microsoft Research（微软研究院）

AI总结本文首次系统研究推理模型标价与实际成本的偏差，发现32%的模型对比较中存在价格反转现象，并基于Shapley值建立成本归因框架，揭示思考令牌消耗和交互轮次的高度异质性是主要原因。

详情

AI中文摘要

开发者和消费者越来越根据列出的API价格选择推理模型（RMs）。然而，这些价格在多大程度上准确反映了实际推理成本？我们首次系统研究这一问题，评估了8个前沿RM在12个不同任务上的表现，涵盖竞赛数学、科学问答、代码生成和多领域智能体。我们发现了定价反转现象：在32%的模型对比较中，标价较低的模型实际上产生了更高的总成本，反转幅度高达28倍。例如，Gemini 3 Flash的标价比GPT-5.4便宜80%，但其在所有任务上的实际成本却高出38%。我们基于Shapley值构建了一个正式的成本归因框架，并利用它追溯了思考令牌消耗和交互轮次数量巨大异质性的主要贡献因素：对于同一查询，一个模型可能比另一个模型多使用900%的思考令牌，或多出10倍的环境交互轮次。我们进一步表明，每次查询的成本预测本质上是困难的：同一查询的重复运行产生的思考令牌变化高达9.7倍，为任何预测器建立了不可约的噪声底限。因此，我们提出成本分布预测作为一个开放挑战。我们的发现表明，列出的API定价是实际成本的不可靠代理，呼吁进行成本感知的模型选择和透明的每次请求成本监控。

英文摘要

Developers and consumers increasingly choose reasoning models (RMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RMs across 12 diverse tasks covering competition math, science QA, code generation, and multi-domain agents. We uncover the pricing reversal phenomenon: in 32% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 80% cheaper than GPT-5.4's, yet its actual cost across all tasks is 38% higher. We build a formal cost attribution framework based on Shapley value, and leverage it to trace the dominating contributors to vast heterogeneity in thinking token consumption and number of interaction turns: on the same query, one model may use 900% more thinking tokens than another, or 10x more turns of environment interactions. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Thus, we propose cost distribution prediction as an open challenge. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.

URL PDF HTML ☆

赞 0 踩 0

2603.18859 2026-05-29 cs.AI cs.CL cs.LG 版本更新

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

RewardFlow: 面向大语言模型智能体强化学习的拓扑感知状态图奖励传播

Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng

发表机构 * TMLR Group（TMLR小组）； Hong Kong Baptist University（香港 Baptist大学）； TCL Corporate Research (HK) Co Ltd（TCL企业研究（香港）有限公司）； Cooperative Medianet Innovation Center Shanghai Jiao Tong University（合作中位网创新中心上海交通大学）； Department of Mathematics Hong Kong Baptist University（香港 Baptist大学数学系）

AI总结提出RewardFlow方法，通过构建状态图进行拓扑感知的奖励传播，为智能体推理提供无标注的密集奖励，显著提升强化学习性能。

详情

AI中文摘要

强化学习在增强大语言模型智能体推理方面展现出潜力，但稀疏的终端奖励阻碍了细粒度优化。过程奖励建模提供了一种替代方案，但带来了高计算成本、奖励黑客风险和标注瓶颈。我们引入RewardFlow，一种用于估计智能体推理中状态级奖励的轻量级方法。通过构建捕获轨迹内在拓扑结构的状态图，RewardFlow执行拓扑感知的传播以估计每个状态对成功的贡献，从而产生有原则的、无标注的密集奖励。用于强化学习优化时，RewardFlow在四个智能体基准测试中显著优于先前基线：在基于文本的任务上平均成功率提高6.2%，在视觉推理上跨三个模型尺度比最强基线提高29.7%，在DeepResearch上准确率提高10%，同时具有卓越的鲁棒性和训练效率。RewardFlow的实现已在https://github.com/tmlr-group/RewardFlow公开。

英文摘要

Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization. Process reward modeling offers an alternative but incurs high computational costs, reward hacking risks, and annotation bottlenecks. We introduce RewardFlow, a lightweight method for estimating state-level rewards in agentic reasoning. By constructing state graphs that capture the intrinsic topological structure of trajectories, RewardFlow performs topology-aware propagation to estimate each state's contribution to success, yielding principled, annotation-free dense rewards. Used for RL optimization, RewardFlow substantially outperforms prior baselines across four agentic benchmarks: +6.2% average success rate on text-based tasks, +29.7% on visual reasoning over the strongest baseline across three model scales, and +10% accuracy on DeepResearch, with superior robustness and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

URL PDF HTML ☆

赞 0 踩 0

2602.22045 2026-05-29 cs.CL 版本更新

DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain

DLT-Corpus：面向分布式账本技术领域的大规模文本集合

Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu

发表机构 * Centre for Blockchain Technologies, University College London（区块链技术中心，伦敦大学学院）； School of Informatics, University of Edinburgh（信息学院，爱丁堡大学）； Exponential Science Foundation（指数科学基金会）

AI总结本文构建了DLT-Corpus，一个包含29.8亿词元、覆盖科学文献、专利和社交媒体的大规模领域语料库，并基于此分析了技术涌现模式与市场创新关联，同时发布了领域预训练模型LedgerBERT、情感分析数据集等资源。

Comments Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)

详情

DOI: 10.1145/3770855.3817553

AI中文摘要

我们介绍了DLT-Corpus，这是迄今为止面向分布式账本技术（DLT）研究的最大的领域特定文本集合：来自2212万篇文档的29.8亿词元，涵盖科学文献（37,440篇出版物）、美国专利商标局（USPTO）专利（49,023件）和社交媒体（2200万条帖子）。现有的DLT自然语言处理（NLP）资源主要集中在加密货币价格预测和智能合约上，尽管该行业市值约3万亿美元且技术快速演进，但领域特定语言仍未被充分探索。我们通过分析技术涌现模式和市场-创新相关性展示了DLT-Corpus的实用性。研究发现，技术首先出现在我们的科学文献子集中，然后才出现在专利和社交媒体中，遵循传统的技术转移模式。尽管即使在加密货币寒冬期间社交媒体情绪仍然极度看涨，但科学和专利活动与短期情绪的相关性减弱，而是跟踪整体市场扩张，形成良性循环：研究先于并推动经济增长，而经济增长又为进一步的创新提供资金。我们发布了DLT-Corpus及配套资源：LedgerBERT（在DLT特定命名实体识别（NER）任务上比BERT-base提升23%）、包含23,301条加密货币新闻标题和描述的情感分析数据集、工具和代码。

英文摘要

We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrency price prediction and smart contracts, leaving domain-specific language underexplored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing patterns of technology emergence and market-innovation correlations. Findings reveal that technologies first appear in our scientific literature subset before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grows less tied to short-term sentiment, tracking overall market expansion in a virtuous cycle in which research precedes and enables economic growth that, in turn, funds further innovation. We release the DLT-Corpus and companion artifacts: LedgerBERT (+23% over BERT-base on DLT-specific Named Entity Recognition (NER) task), a sentiment analysis dataset of 23,301 crypto news headlines and descriptions, tools, and code.

URL PDF HTML ☆

赞 0 踩 0

2602.12642 2026-05-29 cs.CL cs.AI 版本更新

Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR

超越归一化：重新审视配分函数作为RLVR的难度调度器

Dohyung Kim, Minbeom Kim, Jeonghye Kim, Sangmook Lee, Sojeong Rhee, Kyomin Jung

发表机构 * Seoul National University（首尔国立大学）

AI总结本文提出PACED-RL框架，通过重新解释配分函数作为每提示期望奖励信号，利用其指导训练中的问题选择与重放，在保持生成多样性的同时提升样本效率。

详情

AI中文摘要

奖励最大化的RL方法已被证明能够增强LLM的推理性能，但往往导致生成多样性降低。近期工作通过采用GFlowNets来解决这一问题，训练LLM匹配目标分布的同时联合学习其配分函数。与先前将配分函数仅视为归一化器的工作不同，我们将其重新解释为每提示期望奖励（即在线准确率）信号，利用这一未使用的信息来提高样本效率。具体而言，我们首先建立了配分函数与每提示准确率估计之间的理论关系。基于这一关键见解，我们提出了配分函数引导的强化学习（PACED-RL），这是一个后训练框架，利用准确率估计在训练过程中优先考虑信息量大的问题提示，并通过准确率估计误差优先的重放进一步提高样本效率。关键的是，这两个组件都重用了GFlowNet训练中已经产生的信息，有效地将计算开销摊销到现有优化过程中。跨多种基准的大量实验表明，与GRPO和先前的GFlowNet方法相比，性能有显著提升，突显了PACED-RL作为LLM更高效样本的分布匹配训练的有前途方向。

英文摘要

Reward-maximizing RL methods have shown to be capable of enhancing the reasoning performance of LLMs, but often lead to reduced generation diversity. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error-prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.

URL PDF HTML ☆

赞 0 踩 0

2602.08567 2026-05-29 cs.MA cs.CL 版本更新

ValueFlow: Measuring the Propagation of Value Perturbations in Multi-Agent LLM Systems

ValueFlow: 多智能体大语言模型中价值扰动的传播度量

Jinnuo Liu, Chuke Liu, Hua Shen

发表机构 * Center for Data Science, NYU Shanghai（纽约大学上海分校数据科学中心）

AI总结提出ValueFlow框架，通过56维价值数据集和LLM-as-a-judge协议，将价值漂移分解为智能体级响应行为与系统级结构效应，揭示价值对齐是系统级属性。

Comments Preprint. Under review. 28 pages, 10 figures

详情

AI中文摘要

多智能体大语言模型系统日益由观察并响应彼此输出的智能体组成。虽然价值对齐通常针对孤立模型进行评估，但价值扰动如何通过智能体交互传播仍知之甚少。我们提出ValueFlow，一个基于扰动的框架，通过源自施瓦茨价值调查的56维价值数据集，并使用LLM-as-a-judge协议对智能体价值取向进行评分，来度量多智能体系统中的价值漂移。ValueFlow将价值漂移分解为智能体级响应行为和系统级结构效应，由两个指标捕获：\b{eta}-敏感性（智能体对受扰同伴价值信号的敏感度）和系统敏感性（节点级扰动对最终系统输出的影响）。实验跨越价值维度、骨干模型、角色和拓扑，表明敏感性在不同价值间差异显著，并受交互结构强烈影响，表明多智能体系统中的价值对齐是系统级属性，而不仅仅是智能体级属性。因此，ValueFlow为审计和缓解部署的多智能体系统中的价值传播提供了原则性基础。

英文摘要

Multi-agent large language model (LLM) systems increasingly consist of agents that observe and respond to one another's outputs. While value alignment is typically evaluated for isolated models, how value perturbations propagate through agent interactions remains poorly understood. We present ValueFlow, a perturbation-based framework that measures value drift in multi-agent systems via a 56-value valuation dataset derived from the Schwartz Value Survey, with agent value orientations scored using an LLM-as-a-judge protocol. ValueFlow decomposes value drift into agent-level response behavior and system-level structural effects, captured by two metrics: \b{eta}-susceptibility, an agent's sensitivity to perturbed peer value signals, and system susceptibility (SS), the effect of node-level perturbations on final system outputs.Experiments span across value dimensions, backbones, personas, and topologies, showing that susceptibility varies sharply across values and is strongly shaped by interaction structure, indicating that value alignment in multi-agent systems is a system-level property, not just an agent-level one. ValueFlow thus provides a principled basis for auditing and mitigating value propagation in deployed multi-agent systems.

URL PDF HTML ☆

赞 0 踩 0

2602.05370 2026-05-29 cs.CL 版本更新

Mining or Synthesis? Rethinking Exploration Efficiency in Iterative Alignment of Mathematical Reasoning

挖掘还是合成？重新思考数学推理迭代对齐中的探索效率

Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li, Hejin Wang, Jiansheng Wei, Xiaojun Meng, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学深圳研究院）； Huawei Large Model Data Technology Lab（华为大模型数据技术实验室）； Huawei Multimodal Model Lab（华为多模态模型实验室）； Department of Statistics and Data Science, Tsinghua University, Beijing, China（清华大学统计与数据科学系）

AI总结针对数学推理任务中迭代DPO对齐时高N采样收益递减且引入噪声的问题，提出PACE框架，通过低预算探索与纠错合成偏好对，以约1/5计算量达到或超越高N基线性能。

详情

AI中文摘要

迭代直接偏好优化（DPO）已成为在推理任务中对齐大语言模型的广泛使用的范式。现有方法通常依赖Best-of-N采样（$N\geq8$）从分布尾部挖掘正轨迹。在这项工作中，我们表明在数学推理中，增加$N$会导致收益递减，同时增加验证器引起的假阳性风险和策略更新所需的数据分布偏移。为了解决这个问题，我们引入了PACE（通过纠错探索的近端对齐），一种基于生成的纠错框架，用低预算探索（$2\leq N\leq3$）取代穷举挖掘。PACE不是搜索越来越稀有的正样本，而是通过纠错后见优化和验证引导过滤，从失败的探索中合成高保真偏好对。实验上，PACE匹配或超过了DPO-R1（$N=16$）的性能，同时使用约$1/5$的计算量，并且在20%标签损坏下保持鲁棒，而高$N$基线表现出明显更高的噪声利用。

英文摘要

Iterative Direct Preference Optimization (DPO) has emerged as a widely used paradigm for aligning Large Language Models on reasoning tasks. Existing approaches typically rely on Best-of-N sampling ($N\geq8$) to mine positive trajectories from the distribution tail. In this work, we show that in mathematical reasoning, increasing $N$ yields diminishing returns while increasing verifier-induced false-positive risk and the distribution shift required for policy updates. To address this, we introduce PACE (Proximal Alignment via Corrective Exploration), a generation-based corrective framework that replaces exhaustive mining with low-budget exploration ($2\leq N\leq3$). Rather than searching for increasingly rare positive samples, PACE synthesizes high-fidelity preference pairs from failed explorations through corrective hindsight refinement and verification-guided filtering. Empirically, PACE matches or exceeds the performance of DPO-R1 ($N=16$) while using about $1/5$ of the compute, and remains robust under 20\% label corruption, where high-$N$ baselines exhibit substantially higher noise exploitation.

URL PDF HTML ☆

赞 0 踩 0

2602.04729 2026-05-29 cs.CL 版本更新

"Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs

“Be My Cheese?”：多语言大语言模型中机器翻译的文化细微差别基准

Madison Van Doren, Casey Ford, Jennifer Barajas, Riley VanMeter, Cory Holland

发表机构 * Appen ； The University of Chicago（芝加哥大学）； The University of California, Berkeley（加州大学伯克利分校）

AI总结本文通过大规模人工评估基准，研究多语言大语言模型在机器翻译中处理文化细微差别（如习语、双关语、节日和文化概念）的能力，发现语法准确性与文化共鸣之间存在显著差距。

Comments ACL 2026: Natural Language Generation, Evaluation, and Metrics (GEM) Workshop

详情

AI中文摘要

我们提出了一个大规模人工评估基准，用于评估最先进的多语言大语言模型（LLMs）在机器翻译中的文化本地化能力。现有的机器翻译基准强调词元和语法准确性，但往往忽略了实际本地化所需的语用和文化能力。基于一项涵盖20种语言87个翻译的试点研究，我们评估了7个多语言LLMs在15个目标语言上的表现，每种语言有5名母语评分员。每位评分员对全文翻译和包含文化细微语言（习语、双关语、节日和文化嵌入概念）的片段级别实例，按0-3的序数质量等级评分；片段评分还包括一个“不适用”选项，用于未翻译的片段。在全文评估中，平均整体质量适中（1.68/3）：GPT-5（2.10/3）、Claude Sonnet 4（1.97/3）和Mistral Medium 3.1（1.84/3）构成最强梯队，灾难性失败较少。片段级别结果显示明显的类别效应：节日（2.20/3）和文化概念（2.19/3）的翻译明显优于习语（1.65/3）和双关语（1.45/3），且习语最可能未被翻译。使用Krippendorff's α和Gwet's AC2评估评分者间信度，显示总体一致性中等（Krippendorff's α = 0.45），其中双关语的一致性最低。这些发现表明语法充分性与文化共鸣之间存在持续差距。据我们所知，这是第一个明确关注翻译和本地化中文化细微差别的多语言、人工标注基准。结果凸显了对文化信息训练数据、改进跨语言语用学以及支持系统性文化翻译基准评估框架的需求。

英文摘要

We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs). Existing MT benchmarks emphasise token-level and grammatical accuracy, but often overlook the pragmatic and culturally grounded competencies required for real-world localisation. Building on a pilot study of 87 translations across 20 languages, we evaluate 7 multilingual LLMs across 15 target languages with 5 native-speaker raters per language. Each rater scored both full-text translations and segment-level instances of culturally nuanced language (idioms, puns, holidays, and culturally embedded concepts) on an ordinal 0-3 quality scale; segment ratings additionally included an NA option for untranslated segments. Across full-text evaluations, mean overall quality is modest (1.68/3): GPT-5 (2.10/3), Claude Sonnet 4 (1.97/3), and Mistral Medium 3.1 (1.84/3) form the strongest tier with fewer catastrophic failures. Segment-level results show sharp category effects: holidays (2.20/3) and cultural concepts (2.19/3) translate notably better than idioms (1.65/3) and puns (1.45/3), and idioms are most likely to be left untranslated. Inter-rater reliability was assessed using Krippendorff's α and Gwet's AC2, indicating moderate agreement overall (Krippendorff's α = 0.45) with the lowest agreement for puns. These findings demonstrate a persistent gap between grammatical adequacy and cultural resonance. To our knowledge, this is the first multilingual, human-annotated benchmark focused explicitly on cultural nuance in translation and localisation. The results highlight the need for culturally informed training data, improved cross-lingual pragmatics, and evaluation frameworks that support systematic benchmarking of culturally grounded translation.

URL PDF HTML ☆

赞 0 踩 0

2602.01058 2026-05-29 cs.LG cs.AI cs.CL 版本更新

从评分标准到可靠分数：基于证据的文本评估与LLM裁判

Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, Yushun Dong

发表机构 * Washington University in St. Louis（华盛顿大学圣路易斯分校）； Arizona State University（亚利桑那州立大学）； Florida State University（佛罗里达州立大学）

AI总结提出Rulers框架，通过三阶段推理（任务规范、结构化执行、事后校准）解决LLM在基于评分标准的文本评估中的执行漂移、归因不可验证和人类尺度错位问题，实现更可靠的评分。

详情

AI中文摘要

基于评分标准的文本评估越来越多地使用大型语言模型（LLM）作为可扩展的裁判，但将冻结的黑盒模型与人类评分标准对齐仍然具有挑战性。我们将这一挑战表述为一个标准迁移问题：目标不仅仅是提示LLM分配分数，而是将人类评分标准意图转移到一个稳定、可审计且与人类对齐的评分协议中。我们识别了基于LLM的评分标准评估中三种反复出现的失败模式：评分标准执行漂移、不可验证的分数归因和人类尺度错位。为了解决这些失败模式，我们引入了Rulers，一个三阶段推理时框架，用于可靠、基于证据的评分标准文本评估。Rulers首先将人类评分标准转换为锁定的任务级规范，然后通过结构化检查表决策、类型化证据基础以及在适用时进行可提取引用验证来执行该规范，最后应用事后校准以将模型衍生的信号与人类分数边界对齐。在涵盖论文评分、摘要评估、EFL写作评估和结构化输入文本生成的四个基于评分标准的基准测试中，Rulers在多个冻结骨干模型的大多数评估设置中实现了更强的人类分数一致性。进一步分析表明，Rulers更好地匹配了经验人类分数分布，提高了在语义等价评分标准扰动下的稳定性，并受益于其三个组成部分。这些结果表明，可靠的LLM评判需要固定标准、可追溯证据和校准的分数解释，而不仅仅是提示措辞。我们的代码可在 https://anonymous.4open.science/r/Rulers_0525-3328 获取。

英文摘要

Rubric-based text evaluation increasingly uses large language models (LLMs) as scalable judges, but aligning frozen black-box models with human scoring standards remains challenging. We formulate this challenge as a criteria-transfer problem: the goal is not merely to prompt an LLM to assign a score, but to transfer human rubric intent into a stable, auditable, and human-aligned scoring protocol. We identify three recurring failure modes in LLM-based rubric scoring: rubric execution drift, unverifiable score attribution, and human-scale misalignment. To address these failure modes, we introduce Rulers, a three-stage inference-time framework for reliable, evidence-grounded rubric-based text evaluation. Rulers first converts a human rubric into a locked task-level specification, then executes the specification with structured checklist decisions, typed evidence grounding, and extractive quote verification when applicable, and finally applies post-hoc calibration to align model-derived signals with human score boundaries. Across four rubric-governed benchmarks covering essay scoring, summarization assessment, EFL writing evaluation, and structured-input text generation, Rulers achieves stronger human-score agreement in most evaluated settings across multiple frozen backbone models. Further analyses show that Rulers better matches empirical human score distributions, improves stability under semantically equivalent rubric perturbations, and benefits from each of its three components. These results suggest that reliable LLM judging requires fixed criteria, traceable evidence, and calibrated score interpretation rather than prompt phrasing alone. Our code is available at https://anonymous.4open.science/r/Rulers_0525-3328.

URL PDF HTML ☆

赞 0 踩 0

2601.08064 2026-05-29 cs.CL 版本更新

Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations

校准还不够：评估语言变化下的置信度估计

Yuxi Xia, Dennis Ulmer, Terra Blevins, Yihong Liu, Hinrich Schütze, Benjamin Roth

发表机构 * Faculty of Computer Science, UniVie Doctoral School Computer Science（计算机科学系，维也纳大学计算机科学博士学院）； Faculty of Philological and Cultural Studies, University of Vienna, Austria（文学与文化研究系，维也纳大学，奥地利）； ILLC, University of Amsterdam, Netherlands（阿姆斯特丹大学ILLC，荷兰）； Khoury College of Computer Sciences, Northeastern University, USA（东北大学计算机科学学院，美国）； LMU Munich, Munich Center for Machine Learning (MCML), Germany（慕尼黑大学，慕尼黑机器学习中心（MCML），德国）

AI总结提出一个基于鲁棒性、稳定性和敏感性的新评估框架，揭示现有置信度估计方法在区分语义不同答案方面的不足。

详情

AI中文摘要

置信度估计（CE）指示大型语言模型答案的可靠性，影响用户信任和决策。现有评估主要关注置信度与正确性之间的一致性，但忽略了语言的可变性：置信度估计应在语义等价的提示或答案变体下保持一致，而在答案含义不同时发生变化，因为这可能表明正确性的变化。因此，我们引入了一个基于三个互补属性的新评估框架：对提示扰动的 extbf{鲁棒性}、跨语义等价答案的 extbf{稳定性}以及对语义不同答案的 extbf{敏感性}。我们表明这些指标与现有CE指标在很大程度上独立，并且常见的CE方法往往在这些指标上失败：虽然大多数方法实现了高鲁棒性和稳定性，但它们难以区分语义不同的答案，可能是因为它们没有有效利用生成侧信息。总体而言，我们的框架揭示了当前CE评估中被忽视的局限性，并为现实应用中选择置信度估计器提供了指导。

英文摘要

Confidence estimation (CE) indicates how reliable the answers of large language models are and impacts user trust and decision-making. Existing evaluations mainly concern the alignment between confidence and correctness, but ignore the variability of language: confidence estimates should remain consistent under semantically equivalent prompts or answer variations, while changing when answer meaning differs, as this may indicate a change in correctness. Therefore, we introduce a novel evaluation framework based on three complementary properties: \textbf{robustness} to prompt perturbations, \textbf{stability} across semantically equivalent answers, and \textbf{sensitivity} to semantically different answers. We show that these metrics are largely independent from existing CE metrics, and that common CE methods often fail on them: while most methods achieve high robustness and stability, they struggle to distinguish semantically different answers, potentially because they do not effectively leverage generation-side information. Overall, our framework exposes overlooked limitations of current CE evaluations and provides guidance for selecting confidence estimators for real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2601.04633 2026-05-29 cs.CL 版本更新

MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark

MAGA-Bench: 通过对齐检测基准的机器增强生成文本

Anyang Song, Ying Cheng, Yiqian Xu, Rui Feng

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University（复旦大学计算机科学与人工智能学院）； School of Computer Science and Technology, Tongji University（同济大学计算机科学与技术学院）； Shanghai Key Laboratory of Intelligent Information Processing, Fudan University（复旦大学智能信息处理上海市重点实验室）

AI总结提出MAGA基准，集成多种对齐方法（从提示构建到生成器-检测器对抗强化学习及推理过程）以增强机器生成文本的人类对齐性，提升检测器的泛化能力。

详情

AI中文摘要

机器生成文本（MGT）越来越难以与人类书写文本（HWT）区分。这一趋势加剧了虚假新闻和在线欺诈等恶意活动。微调检测器的泛化能力严重依赖数据集质量，仅仅扩大MGT来源可能越来越不足，需要进一步增强生成过程。基于HC-Var理论，增强MGT的人类对齐性不仅有助于现有检测器的鲁棒性测试，还能提升在此类对齐MGT数据集上微调的检测器的泛化能力。因此，我们提出了 extbf{M}achine- extbf{A}ugment- extbf{G}enerated Text via extbf{A}lignment (MAGA) 检测基准。MAGA集成了多种对齐方法，从提示构建到 extbf{G}enerator- extbf{D}etector extbf{A}dversarial extbf{R}einforcement extbf{L}earning (GDARL) 以及推理过程。在我们的实验中，在MAGA上微调的RoBERTa检测器在泛化AUC上平均提升4.60%。相反，MAGA中的对齐MGT也导致所选检测器的AUC平均下降8.13%。我们希望MAGA基准能为未来MGT检测器泛化能力的研究提供有价值的见解。

英文摘要

Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This trend has exacerbated malicious activities such as fake news and online fraud. The generalization ability of fine-tuned detectors relies heavily on dataset quality, and simply expanding the sources of MGT may become increasingly insufficient. Further augmentation of the generation process is required. Based on HC-Var's theory, enhancing the human-like alignment of MGT not only facilitates robustness testing of existing detectors but also boosts the generalization ability of detectors fine-tuned on such aligned MGT datasets. Therefore, we propose the \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA) Detection Benchmark. MAGA integrates several alignment methods, ranging from prompt construction to \textbf{G}enerator-\textbf{D}etector \textbf{A}dversarial \textbf{R}einforcement \textbf{L}earning (GDARL) and the reasoning process. In our experiments, the RoBERTa detector fine-tuned on MAGA achieves an average improvement of 4.60\% in generalization AUC. Conversely, the aligned MGTs in MAGA also lead to an average decrease of 8.13\% in the AUC of selected detectors. We hope the MAGA Benchmark will provide valuable insights for future research on the generalization ability of MGT detectors.

URL PDF HTML ☆

赞 0 踩 0

2601.03134 2026-05-29 cs.CL 版本更新

心智景观感知的检索增强生成以改进长上下文理解

Yuqing Li, Jiangnan Li, Zheng Lin, Ziyan Zhou, Junjie Wu, Weiping Wang, Jie Zhou, Mo Yu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）； WeChat AI, Tencent（腾讯微信AI）； Hong Kong University of Science and Technology（香港科学与技术大学）

AI总结提出MiA-RAG框架，通过层次化摘要构建心智景观并统一检索与生成的条件，实现全局语义表示指导下的长上下文检索与推理。

详情

AI中文摘要

人类通过依赖内容的整体语义表示来理解长而复杂的文本。这种全局视图有助于组织先验知识、解释新信息以及整合分散在文档中的证据，正如心理学中人类的心智景观感知能力所揭示的那样。当前的检索增强生成（RAG）系统缺乏这种指导，因此在长上下文任务中表现不佳。在本文中，我们提出了心智景观感知RAG（MiA-RAG），这是第一个将心智景观感知检索和生成表述为基于LLM的RAG的统一条件范式的框架。MiA-RAG通过层次化摘要构建心智景观，并将检索和生成都条件于这种全局语义表示。这使得检索器能够形成丰富的查询嵌入，生成器能够在连贯的全局上下文中对检索到的证据进行推理。我们在多样化的长上下文和双语基准上评估了MiA-RAG，用于基于证据的理解和全局意义构建。它持续超越基线，进一步分析表明，它将局部细节与连贯的全局表示对齐，实现了更类人的长上下文检索和推理。

英文摘要

Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first framework to formulate mindscape-aware retrieval and generation as a unified conditioning paradigm for LLM-based RAG. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.

URL PDF HTML ☆

赞 0 踩 0

2510.24606 2026-05-29 cs.CL 版本更新

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for Memory-Constrained LLM Inference

基于动态层次稀疏注意力的内存受限大语言模型长上下文建模

Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Google（谷歌）

AI总结提出动态层次稀疏注意力（DHSA），通过在线预测注意力稀疏性并保持LLM骨干冻结，在内存受限下实现高效长上下文推理，精度接近密集注意力且速度提升显著。

Comments ICML26 (Spotlight)

详情

AI中文摘要

注意力的二次成本限制了长上下文大语言模型的可扩展性，尤其是在有限的硬件内存预算下。虽然注意力通常是稀疏的，但现有的静态稀疏方法无法适应任务或输入相关的变化，而最近的动态方法依赖于可能牺牲通用性的预定义模板或启发式规则。我们提出了动态层次稀疏注意力（DHSA），一种数据驱动的框架，在保持LLM骨干冻结的同时在线预测注意力稀疏性。DHSA通过估计块级重要性并将其传播到令牌级交互来执行层次路由，保留了因果重要依赖关系，同时实现了高效稀疏化。在Needle-in-a-Haystack测试、LongBench和RULER上，DHSA在高稀疏度下保持接近密集的精度，在相当预填充成本下，相对于块稀疏注意力实现了12-20%的相对精度提升。借助内存高效的瓦片后端，DHSA在128K上下文长度下实现了高达10倍的预填充加速。在LLaMA-3.1-8B（4位）上，DHSA在单个24GB GPU上扩展到100K上下文，而密集注意力无法做到。我们提供了互补的GPU和CPU后端，使DHSA能够在不同的硬件环境和多个开放权重模型系列上运行。这些结果表明，DHSA是内存受限长上下文LLM推理的一种高效且适应性强的解决方案。

英文摘要

The quadratic cost of attention limits the scalability of long-context LLMs, especially under limited hardware memory budgets. While attention is often sparse, existing static sparse methods cannot adapt to task- or input-dependent variations, and recent dynamic approaches rely on predefined templates or heuristics that may sacrifice generality. We propose Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that predicts attention sparsity online while keeping the LLM backbone frozen. DHSA performs hierarchical routing by estimating importance at the chunk level and propagating it to token-level interactions, preserving causally important dependencies while enabling efficient sparsification. Across Needle-in-a-Haystack test, LongBench and RULER, DHSA maintains near-dense accuracy in highly sparse regimes, achieving 12--20% relative accuracy gains over Block Sparse Attention at comparable prefill cost. With a memory-efficient tiled backend, DHSA delivers up to $10\times$ prefill speedup at 128K context length. On LLaMA-3.1-8B (4-bit), DHSA scales to 100K context on a single 24GB GPU, where dense attention fails. We provide complementary GPU and CPU backends, enabling DHSA to run across diverse hardware environments and multiple open-weight model families. These results demonstrate DHSA as an efficient and adaptable solution for memory-constrained long-context LLM inference.

URL PDF HTML ☆

赞 0 踩 0

2510.20743 2026-05-29 cs.HC cs.AI cs.CL 版本更新

交互式会议中基于人类反馈的说话人修正

Xinlu He, Yiwen Guan, Badrivishal Paurana, Pitipat Kongsomjit, Zilin Dai, Jacob Whitehill

发表机构 * Worcester Polytechnic Institute（沃斯特理工学院）

AI总结提出一种LLM辅助的会议内说话人修正系统，通过用户简短反馈修正说话人归属错误，结合流式ASR、说话人日志、LLM摘要和在线注册机制，在AMI数据集上实现DER降低31.99%、说话人替换错误降低52.68%。

详情

AI中文摘要

大多数自动语音处理系统以“开环”模式运行，没有用户关于谁说了什么的反馈，然而人在回路的工作流程有可能实现更高的准确性。我们提出了一种LLM辅助的会议内说话人修正系统，允许用户通过简短纠正性反馈来修复说话人归属错误。在执行流式ASR和说话人日志后，系统呈现简洁的LLM生成的摘要，帮助用户识别重要的说话人错误，并通过更新带说话人标注的转录文本和添加在线说话人注册来整合用户反馈。为了使该工作流程在语音处理、LLM分析和用户反馈存在错误的情况下仍然有效，我们开发了多种机制来更精确地识别预期的修正。此外，我们构建了一个LLM驱动的用户反馈模拟，以评估工作流程的可复现性和可扩展性。应用于AMI头戴式麦克风测试集，我们的系统相对于流式基线（Google ASR + ECAPA）显著降低了31.99%的DER和52.68%的说话人替换错误。

英文摘要

Most automatic speech processing systems operate in ``open loop'' mode without user feedback about who said what, yet human-in-the-loop workflows can potentially enable higher accuracy. We propose an LLM-assisted in-meeting speaker correction system that lets users fix speaker attribution errors through brief corrective feedback. After performing streaming ASR and diarization, the system presents concise LLM-generated summaries to help users identify important speaker errors, and it incorporates user feedback by updating the speaker-attributed transcript and adding online speaker enrollments. To make this workflow effective despite errors in speech processing, LLM analysis, and user feedback, we developed several mechanisms to identify the intended correction more precisely. Further, we built an LLM-driven user feedback simulation to evaluate the workflow reprodubilty and at scale. Applied to the AMI headset test set, our system substantially reduces the DER from a streaming baseline (Google ASR + ECAPA) by 31.99% and speaker substitution error by 52.68%.

URL PDF HTML ☆

赞 0 踩 0

2508.15371 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Confidence-Modulated Speculative Decoding for Large Language Models

置信度调节的推测解码用于大型语言模型

Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela

发表机构 * Department of Data Science（数据科学系）； Praxis Business School（普拉克斯商学院）

AI总结本文提出一种基于置信度调节的推测解码框架，通过熵和边际不确定性度量动态调整草稿长度与验证过程，在机器翻译和摘要任务上实现加速并保持或提升BLEU和ROUGE分数。

Comments This is the preprint of the paper, which has been accepted for oral presentation and publication in the proceedings of IEEE INDISCON 2025. The conference will be organized at the National Institute of Technology, Rourkela, India, from August 21 to 23, 2025. The paper is 10 pages long, and it contains 2 figures and 5 tables

详情

DOI: 10.1109/INDISCON66021.2025.11254640

AI中文摘要

推测解码已成为一种通过草稿-验证范式并行化令牌生成来加速自回归推理的有效方法。然而，现有方法依赖静态草稿长度和刚性验证标准，限制了其在不同模型不确定性和输入复杂性下的适应性。本文提出一种基于置信度调节草稿的信息论推测解码框架。通过利用草稿模型输出分布上的熵和边际不确定性度量，所提方法在每次迭代中动态调整推测生成的令牌数量。这种自适应机制减少了回滚频率，提高了资源利用率，并保持了输出保真度。此外，验证过程使用相同的置信度信号进行调节，使得在不牺牲生成质量的情况下更灵活地接受草稿令牌。在机器翻译和摘要任务上的实验表明，与标准推测解码相比，该方法在保持或提升BLEU和ROUGE分数的同时实现了显著加速。所提方法提供了一种原则性的即插即用方法，用于在不确定性变化条件下实现大型语言模型的高效且鲁棒的解码。

英文摘要

Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2508.05614 2026-05-29 cs.CL cs.AI 版本更新

GroundAct: Can LLM Agents Ground Actions in Environmental States?

GroundAct：LLM智能体能否在环境状态中实现动作落地？

Zixuan Wang, Dingming Li, Hongxing Li, Yanrui Miao, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

发表机构 * Zhejiang University（浙江大学）

AI总结本研究提出GroundAct基准，通过1500个场景和16592个任务实例评估15个LLM，发现动作落地能力是多维挑战，不能仅通过模型规模解决。

Comments Project Page: https://zju-real.github.io/OmniEmbodied Code: https://github.com/ZJU-REAL/OmniEmbodied

详情

AI中文摘要

LLM智能体在指令完全指定动作的任务上成功率达到85-96%，但当动作可行性取决于指令未提及的环境状态时，成功率降至29-53%。我们认为这一差距反映了一种缺失的能力：动作落地，即从结构化环境状态推断动作是否可行、缺少哪些前提条件以及是否超出个体能力的能力。我们引入GroundAct，这是一个包含1500个场景和16592个任务实例的基准，基于文本的交互式环境涵盖11个领域，任务按认知复杂度层级组织为七个类别。评估15个LLM（3B-671B）后，我们发现三种诊断模式：（i）属性推理与工具和协作推理弱相关，产生不同的模型轮廓；（ii）完整环境图在工具使用与隐式协作之间产生高达+27.6/-22.9%的差异，区分了搜索边界与约束过滤瓶颈；（iii）监督微调将Qwen2.5-3B在直接命令上的性能从0.6%提升至76.3%，但在隐式协作上仅从1.5%提升至5.5%。这些结果表明动作落地是一个多维挑战，不能仅通过规模扩展解决。

英文摘要

LLM agents achieve 85-96% success on tasks where instructions fully specify the action, but drop to 29-53% when action feasibility depends on environmental state that the instruction does not mention. We argue that this gap reflects a missing capability: action grounding, the ability to infer from structured environmental state whether an action is feasible, what prerequisites it lacks, and whether it exceeds individual capacity. We introduce GroundAct, a benchmark of 1,500 scenarios and 16,592 task instances in text-based interactive environments spanning 11 domains, with tasks organized into seven categories along a cognitive complexity hierarchy. Evaluating 15 LLMs (3B-671B), we find three diagnostic patterns: (i) attribute reasoning is weakly correlated with tool and coordination reasoning, producing distinct model profiles; (ii) complete environment graphs yield up to +27.6/-22.9% on tool use vs. implicit collaboration, separating search-bound from constraint-filtering bottlenecks; and (iii) supervised fine-tuning lifts Qwen2.5-3B from 0.6% to 76.3% on direct command but only 1.5% to 5.5% on implicit collaboration. These results establish action grounding as a multi-dimensional challenge irreducible to scaling.

URL PDF HTML ☆

赞 0 踩 0

2508.03726 2026-05-29 cs.CL 版本更新

Hierarchical Verification of Speculative Beams for Accelerating LLM Inference

分层验证投机波束以加速大语言模型推理

Jaydip Sen, Harshitha Puvvala, Subhasis Dasgupta

发表机构 * Praxis Business School（普拉克斯商学院）； Sabre Industries（Sabre工业公司）

AI总结提出分层验证树（HVT）框架，通过优先验证高似然草稿并早期剪枝次优候选，以分层方式重构投机波束解码，从而在不重训练或修改架构下显著降低推理时间和能耗。

Comments This paper was accepted for oral presentation and publication in the 3rd International Conference on Data Science and Network Engineering (ICDSNE 2025), organized at NIT, Agartala, India, from July 25 to 26, 2025. The paper is 12 pages long, and it contains 3 tables and 4 figures. This is NOT the final paper, which will be published in the Springer-published proceedings

详情

DOI: 10.1007/978-3-032-07735-6_19

AI中文摘要

大语言模型（LLMs）在多种自然语言处理任务中取得了显著成功，但由于其自回归特性，在推理效率方面面临持续挑战。尽管投机解码和波束采样带来了显著改进，传统方法按顺序验证草稿序列且无优先级区分，导致不必要的计算开销。本文提出分层验证树（HVT），一种通过优先处理高似然草稿并实现次优候选早期剪枝来重构投机波束解码的新框架。我们开发了理论基础和形式化的验证-剪枝算法以确保正确性和效率。该方法无需重训练或架构修改即可集成到标准LLM推理流程中。跨多个数据集和模型的实验评估表明，HVT始终优于现有投机解码方案，在维持或提升输出质量的同时，实现了推理时间和能耗的大幅降低。研究结果凸显了分层验证策略作为加速大语言模型推理新方向的潜力。

英文摘要

Large language models (LLMs) have achieved remarkable success across diverse natural language processing tasks but face persistent challenges in inference efficiency due to their autoregressive nature. While speculative decoding and beam sampling offer notable improvements, traditional methods verify draft sequences sequentially without prioritization, leading to unnecessary computational overhead. This work proposes the Hierarchical Verification Tree (HVT), a novel framework that restructures speculative beam decoding by prioritizing high-likelihood drafts and enabling early pruning of suboptimal candidates. Theoretical foundations and a formal verification-pruning algorithm are developed to ensure correctness and efficiency. Integration with standard LLM inference pipelines is achieved without requiring retraining or architecture modification. Experimental evaluations across multiple datasets and models demonstrate that HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption while maintaining or enhancing output quality. The findings highlight the potential of hierarchical verification strategies as a new direction for accelerating large language model inference.

URL PDF HTML ☆

赞 0 踩 0

2507.09574 2026-05-29 cs.CV cs.AI cs.CL 版本更新

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

MENTOR: 面向自回归视觉生成模型的高效多模态条件微调

Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, Junjie Hu

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Tsinghua University（清华大学）； Peking University（北京大学）； Microsoft（微软公司）

AI总结提出MENTOR框架，通过两阶段训练范式实现自回归图像生成器与多模态输入的细粒度token级对齐，无需辅助适配器或交叉注意力模块，在DreamBench++上取得优异性能。

Comments Findings of ACL 2026

详情

AI中文摘要

最近的文本到图像模型能够生成高质量结果，但在精确视觉控制、平衡多模态输入以及需要大量训练以实现复杂多模态图像生成方面仍存在困难。为解决这些局限，我们提出MENTOR，一种新颖的自回归（AR）框架，用于高效的多模态条件微调以实现自回归多模态图像生成。MENTOR将AR图像生成器与两阶段训练范式相结合，无需依赖辅助适配器或交叉注意力模块，即可实现多模态输入与图像输出之间的细粒度、token级对齐。两阶段训练包括：（1）多模态对齐阶段，建立稳健的像素级和语义级对齐；随后是（2）多模态指令微调阶段，平衡多模态输入的整合并增强生成可控性。尽管模型规模适中、基础组件非最优且训练资源有限，MENTOR在DreamBench++基准测试上仍取得了强劲性能，在概念保持和提示遵循方面优于竞争基线。此外，与基于扩散的方法相比，我们的方法具有更优的图像重建保真度、广泛的任务适应性以及更高的训练效率。数据集、代码和模型可在 https://github.com/HaozheZhao/MENTOR 获取。

英文摘要

Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

URL PDF HTML ☆

赞 0 踩 0

2506.08354 2026-05-29 cs.CL cs.AI cs.IR 版本更新

Position: Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning

立场：文本嵌入应捕获隐含语义，而不仅仅是表面意义

Yiqun Sun, Qiang Huang, Anthony K. H. Tung, Jun Yu

发表机构 * National University of Singapore（新加坡国立大学）； Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结本文主张文本嵌入研究应从表面意义转向隐含语义，通过试点研究揭示现有模型在隐含语义任务上的局限，并提出范式转变以优先发展语言学基础训练数据、深层语义基准和核心建模目标。

Comments To appear in ICML 2026

详情

AI中文摘要

这篇立场论文主张，文本嵌入研究应超越表面意义，将隐含语义作为核心建模目标。文本嵌入是现代自然语言处理的基础组件，支撑着广泛的应用并推动持续的研究进展。尽管进展迅速，大多数嵌入模型仍局限于表面层次的语义，而语言学理论强调人类意义的大部分是隐含的，由语用学、说话者意图和社会文化语境塑造。当前模型通常在缺乏此类深度的数据集上训练，并使用奖励表面相似性的基准进行评估。因此，它们在需要解释性推理、立场识别或社会性理解的任务中表现不佳。我们的试点研究明确揭示了这一局限性，表明即使在探测隐含语义的任务上，最先进的嵌入相比简单的词汇基线也仅取得边际改进。因此，我们呼吁范式转变：嵌入研究应优先考虑具有语言学基础且多样化的训练数据，开发探测更深层语义理解的基准，并将隐含意义作为核心建模目标，以更好地使嵌入与现实世界的语言复杂性对齐。代码可在 http://github.com/dukesun99/Implicit-Embeddings 获取。

英文摘要

This position paper argues that text embedding research should move beyond surface meaning and embrace implicit semantics as a central modeling objective. Text embeddings are a foundational component of modern NLP, underpinning a wide range of applications and driving sustained research progress. Despite rapid progress, most embedding models remain narrowly focused on surface-level semantics, whereas linguistic theory emphasizes that much of human meaning is implicit, shaped by pragmatics, speaker intent, and sociocultural context. Current models are typically trained on datasets that lack such depth and evaluated using benchmarks that reward surface similarity. As a result, they struggle with tasks that require interpretive reasoning, stance recognition, or socially grounded understanding. Our pilot study makes this limitation explicit, showing that even state-of-the-art embeddings achieve only marginal improvements over simple lexical baselines on tasks probing implicit semantics. We therefore call for a paradigm shift: embedding research should prioritize linguistically grounded and diverse training data, develop benchmarks that probe deeper semantic understanding, and treat implicit meaning as a core modeling objective to better align embeddings with real-world language complexity. The code is available at http://github.com/dukesun99/Implicit-Embeddings.

URL PDF HTML ☆

赞 0 踩 0

2506.06254 2026-05-29 cs.AI cs.CL cs.LG 版本更新

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

PersonaAgent：弥合个性化LLM智能体的记忆与行动

Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, Xian Li

发表机构 * Amazon Stores Foundational AI（亚马逊基础AI）

AI总结提出PersonaAgent框架，通过整合个性化记忆模块（情景与语义记忆）和行动模块，并利用角色提示作为中介实现记忆与行动的协同，以解决LLM智能体的个性化任务。

Comments Accepted in ACL 2026

详情

AI中文摘要

由大型语言模型驱动的智能体近期作为先进范式出现，在广泛领域和任务中展现出令人印象深刻的能力。尽管潜力巨大，当前LLM智能体常采用一刀切方法，缺乏响应用户不同需求和偏好的灵活性。这一局限促使我们开发PersonaAgent——首个旨在处理多样化个性化任务的个性化LLM智能体框架。具体而言，PersonaAgent整合了两个互补组件：一个包含情景记忆和语义记忆机制的个性化记忆模块；一个使智能体能够执行针对用户定制的工具行动的个性化行动模块。核心在于，角色（定义为每位用户独特的系统提示）充当中间件：它利用来自个性化记忆的洞察来控制智能体行动，而这些行动的结果反过来又优化记忆。基于该框架，我们提出一种测试时用户偏好对齐策略，该策略模拟最近的n次交互以优化角色提示，通过模拟响应与真实响应之间的文本损失反馈确保实时用户偏好对齐。实验评估表明，PersonaAgent不仅有效个性化行动空间，还能在测试时实际应用中扩展，显著优于其他基线方法。这些结果证明了我们的方法在提供定制化、动态用户体验方面的可行性和潜力。

英文摘要

Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users' varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components - a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test-time user-preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real-time user preference alignment through textual loss feedback between simulated and ground-truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test-time real-world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.

URL PDF HTML ☆

赞 0 踩 0

2505.18744 2026-05-29 cs.CL 版本更新

LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Complex Reasoning

LogicCat：面向复杂推理的思维链文本到SQL基准测试

Tao Liu, Xutao Mao, Hongying Zan, Dixuan Zhang, Yifan Li, Haixin Liu, Lulu Kong, Jiaming Hou, Rui Li, YunLong Li, aoze zheng, Zhiqiang Zhang, Luo Zhewei, Kunli Zhang, Min Peng

发表机构 * Zhengzhou University（郑州大学）； Vanderbilt University（范德比大学）； Wuhan University（武汉大学）

AI总结提出首个针对复杂推理和思维链解析的Text-to-SQL基准数据集LogicCat，涵盖物理、算术、常识和假设推理场景，通过4038个问题与12114条思维链步骤显著提升任务难度，现有模型执行准确率最高仅33.20%。

Comments 9 pages, 5 figures

详情

DOI: 10.1609/aaai.v40i36.40243
Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, 40(36): 29958-29966, 2026

AI中文摘要

文本到SQL是自然语言处理中的关键任务，旨在将自然语言问题转化为准确且可执行的SQL查询。在现实场景中，这些推理任务通常伴随复杂的数学计算、领域知识和假设推理场景。然而，现有大规模文本到SQL数据集通常聚焦于业务逻辑和任务逻辑，忽略了垂直领域知识、复杂数学推理和假设推理等关键因素，而这些因素对于真实反映实际应用中的推理需求并完成数据查询与分析至关重要。为弥补这一空白，我们引入了LogicCat，这是首个专门为复杂推理和思维链解析设计的文本到SQL基准数据集，涵盖物理、算术、常识和假设推理场景。LogicCat包含4038个英文问题，配有12114条详细的思维链推理步骤，跨越45个不同领域的数据库，在复杂性上显著超越现有数据集。实验结果表明，LogicCat将当前最先进模型的任务难度大幅提升至最高33.20%的执行准确率，表明该任务仍然极具挑战性。LogicCat的进步代表了向开发适用于真实企业数据分析和自主查询生成的系统迈出的关键一步。我们已在https://github.com/Ffunkytao/LogicCat发布了数据集代码。

英文摘要

Text-to-SQL is a critical task in natural language processing that aims to transform natural language questions into accurate and executable SQL queries. In real-world scenarios, these reasoning tasks are often accompanied by complex mathematical computations, domain knowledge, and hypothetical reasoning scenarios. However, existing large-scale Text-to-SQL datasets typically focus on business logic and task logic, neglecting critical factors such as vertical domain knowledge, complex mathematical reasoning, and hypothetical reasoning, which are essential for realistically reflecting the reasoning demands in practical applications and completing data querying and analysis. To bridge this gap, we introduce LogicCat, the first Text-to-SQL benchmark dataset specifically designed for complex reasoning and chain-of-thought parsing, encompassing physics, arithmetic, commonsense, and hypothetical reasoning scenarios. LogicCat comprises 4,038 English questions paired 12,114 detailed chain-of-thought reasoning steps, spanning 45 databases across diverse domains, significantly surpassing existing datasets in complexity. Experimental results demonstrate that LogicCat substantially increases the task difficulty for current state-of-the-art models to at most 33.20% execution accuracy, indicating that this task remains exceptionally challenging. The advancement of LogicCat represents a crucial step toward developing systems suitable for real-world enterprise data analysis and autonomous query generation. We have released our dataset code at https://github.com/Ffunkytao/LogicCat.

URL PDF HTML ☆

赞 0 踩 0

2505.16178 2026-05-29 cs.CL 版本更新

Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge

理解语言模型中的事实回忆：为什么两阶段训练鼓励记忆而混合训练教授知识

Ying Zhang, Benjamin Heinzerling, Dongyuan Li, Kentaro Inui

发表机构 * RIKEN Center for Advanced Intelligence Project（日本理化学研究所高级智能项目中心）； Tohoku University（东北大学）； The University of Tokyo（东京大学）； MBZUAI

AI总结通过比较2.8~4B语言模型中的两阶段训练与混合训练，发现混合训练通过联合优化目标实现存储与查询格式间的梯度一致性，驱动表征一致性并建立格式不变的检索过程，从而泛化回忆未见查询中的事实。

详情

AI中文摘要

虽然微调是将事实知识注入大型语言模型（LLM）的标准方法，但通过未见查询实现可靠事实回忆的机制仍鲜为人知。常见的两阶段训练策略依次对事实存储和查询格式进行训练，往往导致死记硬背。相比之下，混合训练联合优化两种格式，展现出更优的泛化回忆能力。我们通过比较2.8∼4B LLM中的两种范式来研究这一成功机制，并识别出核心机制：混合训练中的联合优化目标诱导了存储格式与查询格式之间的梯度一致性。这进而驱动两种格式之间的表征一致性，建立了一个格式不变的检索过程，将未见查询映射到存储的事实。相反，两阶段训练中缺乏这种目标导致表征不一致和回忆失败。这种一致性进一步定位于由两种格式共同更新的参数，在混合训练下该参数集远大于两阶段训练。在输入层面，一致性留下了可解释的特征：混合训练从主语-关系标记（查询中可用的相同成分）以存储格式编码事实，而两阶段训练则依赖完整上下文。我们的发现刻画了事实回忆的机制，并为优化LLM中的知识注入提供了机理基础。

英文摘要

While fine-tuning is the standard for injecting factual knowledge into large language models (LLMs), the mechanisms enabling reliable fact recall via unseen queries remain poorly understood. Common two-stage training strategies, which sequentially train on fact storage and query formats, often cause rote memorization. In contrast, mixed training jointly optimizes both formats and exhibits superior generalized recall. We investigate this success by comparing the two paradigms across 2.8$\sim$4B LLMs and identify the core mechanism: the joint optimization objective in mixed training induces gradient consistency across storage and query formats. This in turn drives the representation consistency between the two formats, establishing a format-invariant retrieval process that maps unseen queries to stored facts. In contrast, the lack of such an objective in two-stage training results in inconsistent representations and failed recall. The consistency further localizes to the parameters updated by both formats, a set that is substantially larger under mixed training than under two-stage training. At the input level, the consistency leaves an interpretable signature: mixed training encodes facts in storage format from subject-relation tokens, the same components available in queries, while two-stage training relies on the full context. Our findings characterize the mechanisms of fact recall and offer mechanistic foundation for optimizing knowledge injection in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2505.10975 2026-05-29 cs.CL cs.AI cs.SD eess.AS 版本更新

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

单声道音频的端到端多说话人自动语音识别综述

Xinlu He, Jacob Whitehill

发表机构 * Worcester Polytechnic Institute（沃斯特理工大学）

AI总结本文系统综述了端到端多说话人自动语音识别的神经架构范式（SIMO与SISO）、近期改进方法及长语音扩展策略，并通过标准基准评估比较了各类方法。

Comments Accepted for publication in Computer Speech & Language (CSL)

详情

AI中文摘要

单声道多说话人自动语音识别（ASR）由于数据稀缺以及识别并将词语归因于单个说话人的内在困难（尤其是在重叠语音中）仍然具有挑战性。最近的进展推动了从级联系统向端到端（E2E）架构的转变，这减少了错误传播并更好地利用了语音内容与说话人身份之间的协同作用。尽管端到端多说话人ASR取得了快速进展，但该领域缺乏对近期发展的全面综述。本综述为多说话人ASR的端到端神经方法提供了一个系统的分类法，突出了近期进展和比较分析。具体而言，我们分析了：（1）用于预分割音频的架构范式（SIMO与SISO），分析了它们的不同特征和权衡；（2）基于这两种范式的近期架构和算法改进；（3）对长语音的扩展，包括分割策略和说话人一致性的假设拼接。此外，我们（4）在标准基准上评估和比较了各种方法。最后，我们讨论了构建鲁棒且可扩展的多说话人ASR所面临的开放挑战和未来研究方向。

英文摘要

Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.

URL PDF HTML ☆

赞 0 踩 0

2503.13844 2026-05-29 cs.CL cs.AI cs.CY cs.LG 版本更新

Towards Detecting Persuasion on Social Media: From Model Development to Insights on Persuasion Strategies

检测社交媒体上的说服：从模型开发到说服策略的洞察

Elyas Meguellati, Stefano Civelli, Pietro Bernardelle, Shazia Sadiq, Irwin King, Gianluca Demartini

发表机构 * University of Queensland（昆士兰大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文通过开发轻量级说服文本检测模型（在SemEval 2023任务3子任务3中达到最优性能）并应用于澳大利亚联邦选举2022 Facebook广告数据集，揭示了政治竞选在不同资金策略、词汇选择、人口统计定位和选举临近时说服强度时间变化中的模式。

详情

DOI: 10.1609/icwsm.v20i1.42714
Journal ref: Proceedings of the International AAAI Conference on Web and Social Media 20(1) (2026) 1587-1608

AI中文摘要

政治广告通过嵌入更广泛宣传策略中的微妙说服技巧，在塑造公众舆论和影响选举结果方面发挥着关键作用。检测这些说服元素对于提高选民意识和确保民主进程的透明度至关重要。本文通过两项相互关联的研究，提出了一种连接模型开发与实际应用的综合方法。首先，我们引入了一个轻量级说服文本检测模型，该模型在SemEval 2023任务3子任务3中达到了最先进性能，同时所需的计算资源和训练数据远少于现有方法。其次，我们通过收集澳大利亚联邦选举2022 Facebook广告（APA22）数据集，对其中一部分进行说服标注，并对模型进行微调以使其从主流新闻适应社交媒体内容，从而展示了该模型的实际效用。然后，我们应用微调后的模型对APA22数据集的其余部分进行标注，揭示了政治竞选如何通过不同的资金策略、词汇选择、人口统计定位以及选举日临近时说服强度的时间变化来利用说服的独特模式。我们的发现不仅强调了分析社交媒体说服时领域特定建模的必要性，还展示了揭示这些策略如何能够增强透明度、告知选民并促进数字竞选中的问责制。

英文摘要

Political advertising plays a pivotal role in shaping public opinion and influencing electoral outcomes, often through subtle persuasive techniques embedded in broader propaganda strategies. Detecting these persuasive elements is crucial for enhancing voter awareness and ensuring transparency in democratic processes. This paper presents an integrated approach that bridges model development and real-world application through two interconnected studies. First, we introduce a lightweight model for persuasive text detection that achieves state-of-the-art performance in Subtask 3 of SemEval 2023 Task 3 while requiring significantly fewer computational resources and training data than existing methods. Second, we demonstrate the model's practical utility by collecting the Australian Federal Election 2022 Facebook Ads (APA22) dataset, partially annotating a subset for persuasion, and fine-tuning the model to adapt from mainstream news to social media content. We then apply the fine-tuned model to label the remainder of the APA22 dataset, revealing distinct patterns in how political campaigns leverage persuasion through different funding strategies, word choices, demographic targeting, and temporal shifts in persuasion intensity as election day approaches. Our findings not only underscore the necessity of domain-specific modeling for analyzing persuasion on social media but also show how uncovering these strategies can enhance transparency, inform voters, and promote accountability in digital campaigns.

URL PDF HTML ☆

赞 0 踩 0

2411.14279 2026-05-29 cs.CV cs.CL 版本更新

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

超越文本：通过多模态双注意力和软图像引导减少大型视觉语言模型中的语言偏差

Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Mingjia Zhang, Baobao Chang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Peking University（北京大学）； Tsinghua University（清华大学）

AI总结针对大型视觉语言模型因语言偏差导致的幻觉问题，提出LACING框架，采用多模态双注意力机制和软图像引导策略，在不增加训练资源的情况下增强视觉理解并减少幻觉。

Comments EMNLP 2025

详情

AI中文摘要

大型视觉语言模型在各种视觉语言任务中取得了令人印象深刻的结果。然而，尽管表现出有前景的性能，大型视觉语言模型仍因语言偏差而产生幻觉，导致对图像的关注度降低和视觉理解效率低下。我们确定了这种偏差的两个主要原因：1. 大语言模型预训练阶段与多模态对齐阶段之间训练数据的规模差异。2. 文本数据短期依赖性导致的学习推理偏差。因此，我们提出了LACING，一个系统性框架，旨在通过多模态双注意力机制和软图像引导来解决大型视觉语言模型的语言偏差。具体来说，多模态双注意力机制引入了一种并行双注意力机制，增强了整个模型中视觉输入的整合。软图像引导在训练和推理过程中引入了一个可学习的软视觉提示，以替代视觉输入，旨在迫使大型视觉语言模型优先处理文本输入。然后，软图像引导进一步提出了一种使用软视觉提示的新解码策略，以减轻模型对相邻文本输入的过度依赖。综合实验表明，我们的方法有效地消除了大型视觉语言模型的语言偏差，增强了视觉理解并减少了幻觉，无需额外的训练资源或数据。代码和模型可在[lacing-lvlm.github.io](https://lacing-lvlm.github.io)获取。

英文摘要

Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [lacing-lvlm.github.io](https://lacing-lvlm.github.io).

URL PDF HTML ☆

赞 0 踩 0

2404.10706 2026-05-29 cs.CY cs.CL cs.HC cs.SI 版本更新

Cross-Language Evolution of Divergent Collective Memory Around the Arab Spring

阿拉伯之春的跨语言分歧性集体记忆演化

H. Laurie Jones, Brian C. Keegan

AI总结通过分析2011-2024年间阿拉伯语和英语维基百科中阿拉伯之春相关文章的存档内容，定义了多语言的事件显著性、商议、语境化和集体记忆巩固度量，揭示了跨语言内容相似性的时间演化规律。

详情

DOI: 10.1609/icwsm.v20i1.42699

AI中文摘要

阿拉伯之春是始于2011年的一系列历史性抗议活动，这些抗议推翻了多国政府并导致了重大冲突。对于此类事件的集体记忆可能因政治、文化和语言因素而在不同社会语境中存在显著差异。尽管维基百科在记录历史及当前事件方面发挥着重要作用，但关于维基百科文章在重大事件发生后如何持续演化数年或数十年的问题却鲜有关注。利用2011年至2024年间阿拉伯语和英语维基百科中阿拉伯之春相关主题的存档内容，我们定义并评估了围绕阿拉伯之春的事件显著性、商议、语境化和集体记忆巩固的多语言度量。我们关于维基百科文章跨语言内容相似性时间演化的发现，对于在线集体记忆过程的理论构建以及基于这些数据训练的语言模型的评估具有启示意义。

英文摘要

The Arab Spring was a historic set of protests beginning in 2011 that toppled governments and led to major conflicts. Collective memories of events like these can vary significantly across social contexts in response to political, cultural, and linguistic factors. While Wikipedia plays an important role in documenting both historic and current events, little attention has been given to how Wikipedia articles, created in the aftermath of major events, continue to evolve over years or decades. Using the archived content of Arab Spring-related topics across the Arabic and English Wikipedias between 2011 and 2024, we define and evaluate multilingual measures of event salience, deliberation, contextualization, and consolidation of collective memory surrounding the Arab Spring. Our findings about the temporal evolution of the Wikipedia articles' content similarity across languages has implications for theorizing about online collective memory processes and evaluating linguistic models trained on these data.

URL PDF HTML ☆

赞 0 踩 0