arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 大语言模型与基础模型 18 篇

2606.13227 2026-06-12 cs.CL 新提交

PolyAlign: Conditional Human-Distribution Alignment

PolyAlign: 条件性人类分布对齐

L. D. M. S. Sai Teja, Ufaq Khan, Sathira Silva, Xiao Wu, Muhammad Haris Khan

发表机构 * NIT Silchar(印度国立理工学院锡尔恰尔分校) MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出PolyAlign框架,通过桶感知SFT和人类分布偏好优化,实现语言模型在不同交互上下文中的条件性人类分布对齐,提升自然性和分布忠实度。

Comments 20 pages, 4 Figures, 8 Tables

详情
AI中文摘要

诸如监督微调(SFT)和偏好优化等后训练方法通常将语言模型对齐到单一的全局助手行为。虽然这有助于提高平均有用性,但可能抑制人类响应在不同语言、任务和对话设置中的自然变化。我们将此问题研究为条件性人类分布对齐:模型应匹配适合当前交互上下文的人类响应分布,而非通用响应风格。我们引入PolyAlign,一种分布感知的对齐框架,将双语交互数据组织为由语言、交互轨迹、响应家族和长度定义的桶特定人类参考分布。PolyAlign结合了桶感知SFT(平衡跨异构桶的优化)和人类分布偏好优化(HDPO,使用评论家估计的到桶特定人类支持的距离来正则化偏好学习)。在涵盖英语和中文单轮及多轮设置的双语评估套件中,PolyAlign在保持竞争性任务实用性的同时,提高了条件自然性和分布忠实度。结果表明,后训练应超越全局对齐目标,转向与人类响应分布的交互感知对齐。

英文摘要

Post-training methods such as supervised fine-tuning (SFT) and preference optimization typically align language models toward a single global assistant behavior. While effective for improving average helpfulness, this can suppress the natural variation of human responses across languages, tasks, and dialogue settings. We study this problem as conditional human-distribution alignment: models should match the human response distribution appropriate to the current interaction context, rather than a universal response style. We introduce PolyAlign, a distribution-aware alignment framework that organizes bilingual interaction data into bucket-specific human reference distributions defined by language, interaction track, response family, and length. PolyAlign combines Bucket-Aware SFT, which balances optimization across heterogeneous buckets, with Human-Distribution Preference Optimization (HDPO), which regularizes preference learning using critic-estimated distance to bucket-specific human support. Across a bilingual evaluation suite covering English and Chinese single- and multi-turn settings, PolyAlign improves conditional naturalness and distributional faithfulness while preserving competitive task utility. The results suggest that post-training should move beyond global alignment objectives toward interaction-aware alignment with human response distributions.

2606.13624 2026-06-12 cs.CL 新提交

Beyond Uniform Tokens: Adaptive Compression for Time Series Language Models

超越统一令牌:时间序列语言模型的自适应压缩

Jialin Gan, Xin Qiu, Guangzhe Chen, Xue Wang

发表机构 * Zhejiang University(浙江大学) Harbin Institute of Technology(哈尔滨工业大学) Shandong University(山东大学)

AI总结 针对时间序列语言模型中令牌效率低的问题,提出自适应令牌预算框架,通过频域结构压缩时间序列令牌并逐层减少提示令牌,实现高达7.68倍推理加速并在78%设置中提升性能。

详情
AI中文摘要

大型语言模型(LLM)通过共享令牌接口联合建模数值观测和文本上下文,实现了时间序列(TS)分析。然而,TS令牌和提示令牌表现出根本不同的信息结构,使得统一令牌处理效率低下。在本文中,我们从非对称令牌的角度研究TS语言建模中的令牌效率。我们表明,TS令牌具有高度不均匀的频谱贡献,其中许多令牌共享冗余频率模式,而一小部分保留了关键的时间证据。我们还观察到,提示令牌的影响随模型深度衰减,表明在所有层中完全保留提示是不必要的。基于这些发现,我们开发了一个自适应令牌预算框架,通过频域结构压缩TS令牌,并逐层减少提示令牌。在预测、分类、插补和异常检测上的实验表明,在\textit{\textbf{78\%}}的评估设置中实现了高达\textit{\textbf{7.68$\times$}}的推理加速和性能提升,显示了非对称令牌压缩对于可扩展TS基础模型的有效性。

英文摘要

Large language models (LLMs) have enabled time series (TS) analysis by jointly modeling numerical observations and textual context through a shared token interface. However, TS tokens and prompt tokens exhibit fundamentally different information structures, making uniform token processing inefficient. In this paper, we study token efficiency in TS language modeling from an asymmetric-token perspective. We show that TS tokens have highly uneven spectral contributions, where many tokens share redundant frequency patterns while a small subset preserves critical temporal evidence. We also observe that prompt-token influence attenuates with model depth, suggesting that full prompt retention across all layers is unnecessary. Based on these findings, we develop an adaptive token budgeting framework that compresses TS tokens via frequency-domain structure and progressively reduces prompt tokens across layers. Experiments across forecasting, classification, imputation, and anomaly detection demonstrate up to \textit{\textbf{7.68$\times$}} inference acceleration and performance gains in \textit{\textbf{78\%}} of evaluated settings, showing the effectiveness of asymmetric token compression for scalable TS foundation models.

2606.13634 2026-06-12 cs.CL math.CT 新提交

Operads for compositional reasoning in LLMs

用于LLM组合推理的Operad框架

Nathaniel Bottman, Kyle Richardson

AI总结 提出operad作为问题分解的数学框架,定义问题operad Q,将QA模型解释为Q上的代数,并引入operadic一致性度量,实验表明该度量与准确性强相关。

详情
AI中文摘要

问题分解,即将复杂查询分解为更简单的子查询,并将子查询的答案组合成最终答案,是提高LLM推理能力的常用策略,但目前缺乏严格的数学基础。本文提出operad(一种模拟多输入单输出操作及其组合的数学结构)作为描述问题分解的自然框架。我们定义了问题operad $Q$,其中操作对应问题模板,组合对应子答案的替换,并展示了QA模型如何被解释为$Q$上的代数。除了重新诠释现有实践,这一operad视角还指向了新方法,特别是operadic一致性概念,它衡量QA模型的答案在问题分解树的部分折叠上是否一致。关于operadic一致性的实证评估见我们的姊妹论文(Bottman, Liu, and Richardson, 2026),该论文发现它在12个LLM和4个多跳QA数据集上与准确性强相关,且优于基于温度的标准自一致性基线。我们认为operad是问题分解的自然数学框架,而诸如operadic一致性等不变量为分析和改进多步推理的可靠性开辟了新方向。

英文摘要

Question decomposition, i.e. breaking a complex query into simpler sub-queries whose answers are composed to produce a final answer, is a widely used strategy for improving LLM reasoning, yet it currently lacks a rigorous mathematical foundation. In this paper, we propose operads, mathematical structures that model many-in, one-out operations and compositions thereof, as a natural framework for describing question decomposition. We define the questions operad $Q$, in which operations correspond to question templates and composition corresponds to substitution of sub-answers, and show how QA models can be interpreted as algebras over $Q$. Beyond reframing existing practice, this operadic perspective points toward new methods, in particular a notion of operadic consistency, which measures whether a QA model's answers agree across the partial collapses of a question decomposition tree. Empirical evaluation of operadic consistency is reported in our companion paper (Bottman, Liu, and Richardson, 2026), which finds it strongly correlated with accuracy across twelve LLMs and four multi-hop QA datasets and outperforming standard temperature-based self-consistency baselines. We argue that operads are the natural mathematical home for question decomposition, and that invariants such as operadic consistency open new directions for analyzing and improving the reliability of multi-step reasoning.

2606.13649 2026-06-12 cs.CL cs.LG 新提交

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

Operadic一致性:LLM中组合推理失败的无标签信号

Nathaniel Bottman, Yinhong Liu, Kyle Richardson

发表机构 * Incubilate University of Cambridge(剑桥大学) Allen Institute for Artificial Intelligence(艾伦人工智能研究所)

AI总结 提出Operadic一致性(OC)作为检测大语言模型组合推理失败的无标签信号,在四个多跳QA数据集上与准确率强相关(Pearson r≥0.86),优于自一致性等方法。

详情
AI中文摘要

在推理时检测LLM推理失败而无需真实标签,催生了广泛的置信度基线,包括自一致性、语义熵和P(True),这些方法基于问题内采样和自我评估。Operad理论,即通过迭代替换构建系统的形式化方法,提出了一种补充性诊断:模型对组合查询的直接回答应与通过组合同一查询的分解陈述所产生的回答一致。我们将这一思想实例化为Operadic一致性(OC),一个每问题信号。在四个多跳QA数据集上的十二个指令微调LLM(4B到671B参数,开源和闭源)上,OC与每个数据集上的准确率强相关(Pearson r ∈ [0.86, 0.94],所有p ≤ 0.0004),并且是我们评估的所有信号中唯一在所有四个数据集上均匀达到r ≥ 0.85的信号。思维链自一致性(CoT-SC;Wang等人,2023)在HotpotQA和DROP上与OC匹配(r = 0.93, 0.87),但在MuSiQue和StrategyQA上降至r ≈ 0.45。在每问题层面,OC在每个数据集上提供了超出CoT-SC和语义熵的信息(OC系数的聚类稳健p ≤ 10^{-16}),并且该结论在额外控制构造的分解感知基线时依然稳健(p ≤ 10^{-13})。相同的信号在等成本K = 3预算下,相对于调优的CoT-SC基线产生了选择性预测改进(固定覆盖率下的准确率提升)(AUARC提升+0.086至+0.096,AUROC提升+0.092至+0.164;95%置信区间在每个单元上排除零)。在五个前沿思维模型上,其中分解从模型自身的思维链中提取,相同的等成本比较在所有测试的16个(数据集、预算、指标)单元上给出了正的选择性预测点估计提升,其中12个单元的95%置信区间排除零。

英文摘要

Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model's direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. We instantiate this idea as operadic consistency (OC), a per-question signal. Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets, OC is strongly correlated with accuracy on every dataset (Pearson $r \in [0.86, 0.94]$, all $p \leq 0.0004$), and is the only signal we evaluate with $r \geq 0.85$ uniformly across all four datasets. Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP ($r = 0.93, 0.87$) but drops to $r \approx 0.45$ on MuSiQue and StrategyQA. At the per-question level, OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust $p \leq 10^{-16}$ for the OC coefficient), and the conclusion is robust to additionally controlling for constructed decomposition-aware baselines ($p \leq 10^{-13}$). The same signal yields selective-prediction improvements (accuracy at fixed coverage) over a tuned CoT-SC baseline at the equal-cost $K = 3$ budget (AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell). On five frontier thinking models, where the decomposition is extracted from the model's own chain of thought, the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16.

2606.13680 2026-06-12 cs.CL cs.AI 新提交

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

通过检索增强强化微调进行类比推理学习

Zilin Xiao, Qi Ma, Chun-cheng Jason Chen, Xintao Chen, Avinash Atreya, Hanjie Chen, Vicente Ordonez

发表机构 * Meta Superintelligence Labs(Meta超级智能实验室) Rice University(莱斯大学)

AI总结 提出RA-RFT框架,通过黄金相关性蒸馏训练检索器,并结合强化微调利用类比推理轨迹,提升数学推理性能。

详情
AI中文摘要

检索增强生成(RAG)已成为将语言模型锚定于外部知识的标准机制,然而基于词汇或语义相似性的传统检索难以适用于复杂推理任务:语义相似的问题可能要求完全不同的解决策略,而表面不同的问题可能共享相同的底层推理模式。我们提出检索增强强化微调(RA-RFT),一种事后训练框架,教导语言模型通过类比进行推理。RA-RFT使用黄金相关性蒸馏训练检索器,该检索器根据预期推理收益而非语义重叠对上下文进行排序,然后通过强化微调方法利用检索到的类比演示对策略模型进行微调,使模型学会在可验证的结果奖励下利用推理轨迹。我们进一步分析了检索上下文的多样性,发现推理感知检索揭示了互补的解决策略,为个别问题提供了不同的推理支架。在具有挑战性的数学推理基准上,RA-RFT始终优于标准强化微调方法。例如,在AIME 2025上,对于Qwen3-1.7B和Qwen3-4B,RA-RFT的平均@32准确率分别比GRPO提高了7.1和2.8个百分点——这表明推理感知检索是一个互补的改进轴,与奖励设计或训练课程的进步正交。

英文摘要

Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional retrieval based on lexical or semantic similarity is poorly suited for complex reasoning tasks: a semantically similar problem may demand an entirely different solution strategy, while a superficially different problem may share the same underlying reasoning pattern. We propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. RA-RFT uses gold-relevance distillation to train a retriever that ranks contexts by expected reasoning benefit rather than semantic overlap, and then fine-tunes the policy model via reinforcement fine-tuning methods with retrieved analogous demonstrations, so the model learns to leverage reasoning traces under verifiable outcome rewards. We further analyze the diversity of retrieved contexts and find that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct reasoning scaffolds for individual problems. Across challenging mathematical reasoning benchmarks, RA-RFT consistently outperforms standard reinforcement fine-tuning methods. For example, it improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively -- suggesting that reasoning-aware retrieval is a complementary axis of improvement and orthogonal to advances in reward design or training curricula.

2606.12634 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

保持策略梯度主导:面向长程工具使用智能体的兄弟引导信用蒸馏

Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein

发表机构 * Amazon Web Services(亚马逊云服务)

AI总结 针对长程工具使用强化学习中轨迹级优势信号稀疏的问题,提出兄弟引导信用蒸馏(SGCD),通过动态采样成功与失败轨迹、外部LLM对比生成逐步信用参考,实现密集信用分配,在AppWorld和τ³-airline任务上显著提升性能。

Comments 13 pages, 4 figures, 7 tables. Submitted to EMNLP 2026 Industry Track

详情
AI中文摘要

长程工具使用强化学习可以从结果验证中学习,但其轨迹级优势被广播到许多推理、API和答案令牌上。自蒸馏通过重用策略自身的轨迹或特权教师承诺提供更密集的信号。然而,我们表明直接的令牌级自蒸馏会悄然破坏工具使用:它复述教师行为而不知道验证器奖励哪些动作,因此有用技能和有害捷径被一起放大。我们引入兄弟引导信用蒸馏(SGCD),它使用蒸馏进行信用分配而非作为竞争性的演员损失。动态采样产生混合的成功和失败的兄弟轨迹;外部LLM将其对比总结为训练时逐步信用参考;密集的教师/学生散度驱动信用重新分配;有界分离的信用权重重塑GRPO令牌优势。部署的学生看不到外部LLM、兄弟证据或预言机。在AppWorld和τ³-airline上,SGCD优于匹配的GRPO比较器:AppWorld上test_normal的TGC从42.9提升到45.6,test_challenge从24.7提升到27.0;τ³-airline的pass@1从0.583提升到0.602。

英文摘要

Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $τ^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $τ^3$-airline pass@1 $0.583 \to 0.602$.

2606.13106 2026-06-12 cs.LG cs.CL 交叉投稿

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

揭秘隐状态循环:基于在线强化学习的可切换潜在推理

Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu, Yuxuan Fan, Lujundong Li, Songning Lai, Chengwei Qin, Zhijiang Guo

发表机构 * HKUST(GZ)(香港科技大学(广州)) University of Cambridge(剑桥大学) NTU(南洋理工大学) JoinQuant(聚宽) HKUST(香港科技大学)

AI总结 提出SWITCH框架,通过离散边界令牌使隐状态循环推理兼容在线强化学习,并支持因果机制分析,实验表明其优于现有方法。

详情
AI中文摘要

潜在思维链通过用连续的隐状态循环替换可见推理轨迹来压缩推理,但现有公式难以用标准在线强化学习(RL)优化,且难以进行因果解释。我们的关键见解是,一对显式的边界令牌可以同时解决这两个问题:离散的进入和退出锚点使潜在块与标准在线RL兼容,并且相同的锚点为机制分析提供了自然立足点。基于此,我们提出SWITCH,一个可切换的潜在推理框架。模型发出<swi>进入潜在模式,</swi>退出。由于边界是普通的离散令牌,GRPO策略比率在每个决策点都有明确定义。相同的锚点也使潜在步骤暴露于直接探测和因果干预。我们通过可见到潜在的课程和Switch-GRPO目标训练模型,该目标通过循环潜在计算传播梯度。SWITCH在相似规模下始终优于先前的隐状态循环潜在推理方法。通过边界令牌的机制分析进一步揭示了三个发现:(i)<swi>是一个尖锐局部化的学习切换策略,而非风格化伪影;(ii)它开启的潜在步骤执行特定于问题的、因果重要的计算,而非作为惰性占位符;(iii)该计算集中在进入时的单个隐状态转换上。这些结果表明,隐状态循环潜在推理既可RL训练,又可进行直接机制分析,包括在线RL本身如何从内部改进模型。

英文摘要

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

2606.13126 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

MiniPIC: Flexible Position-Independent Caching in <100LOC

MiniPIC: 少于100行代码的灵活位置无关缓存

Nathan Ordonez, Thomas Parnell

发表机构 * IBM Research(IBM研究院)

AI总结 提出MiniPIC,通过无位置编码KV缓存和用户控制缓存重用原语,在vLLM中实现多种位置无关缓存方法,显著提升预填充吞吐量并降低首个令牌延迟。

Comments 13 pages, 5 figures

详情
AI中文摘要

检索增强和代理工作负载重复预填充可预测的结构化输入(我们称之为“跨度”),例如文档和代码文件。然而,vLLM等引擎中的前缀缓存无法重用KV条目,除非它们与另一个请求共享相同的前缀,而生产级推理服务器中的位置无关缓存(PIC)实现通常需要大量服务器代码更改或将KV状态保留在服务器外部,从而产生主机到设备的传输开销。我们提出了极简PIC(MiniPIC):一种最小化、灵活且快速的vLLM设计,由两个组件构建:无位置编码的KV缓存和用户控制的缓存重用原语。MiniPIC在KV缓存中存储未旋转的K向量,在注意力内部使用每请求逻辑位置对K块应用RoPE,并公开三个面向用户和令牌级别的原语:块对齐填充、跨度分隔符(SSep)和提示依赖(PDep),这些原语修改哈希行为和有效的块级因果注意力结构。通过少于100行的核心引擎更改加上自定义注意力后端,这些原语足以在同一个运行的vLLM实例中实现多种PIC方法,包括Block-Attention、EPIC和Prompt Cache,同时原生集成KV缓存CPU卸载实现。在2WikiMultihopQA上,使用交错调度的MiniPIC相比基线vLLM将预填充吞吐量提高了49%,将缓存跨度的首个令牌时间减少了最多两个数量级,保持了未缓存跨度的线性预填充扩展,并且仅产生5.7%的最坏情况开销。

英文摘要

Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.

2606.13473 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

MaxProof: 通过生成-验证器强化学习与群体级测试时扩展实现数学证明规模化

Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang, Zhengmao Zhu, Tianle Li, Jingyang Li, Zehan Li, Binyang Jiang, Jin Zhu, Han Ding, Fei Yu, Chenyu Du, Zijian Song, Jiayuan Song, Zhi Zhang, Yunan Huang, Weiyu Cheng, Pengyu Zhao, Yu Cheng

发表机构 * MiniMax The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) Peking University(北京大学) Tsinghua University(清华大学)

AI总结 提出MaxProof框架,结合生成-验证器强化学习与群体级测试时扩展,在MiniMax-M3系列上实现竞赛级数学证明,在IMO 2025和USAMO 2026上超越人类金牌阈值。

详情
AI中文摘要

我们提出了MaxProof,一个用于MiniMax-M3系列中竞赛级数学证明的群体级测试时扩展框架。M3首先使用为低误报率设计的深度防御生成验证器,训练三种面向证明的能力——证明生成、证明验证和基于批评的证明修复。这些能力被合并到单个发布的M3模型中。在测试时,MaxProof将模型视为生成器、验证器、精炼器和排序器,在候选证明群体中进行搜索,并通过锦标赛选择返回一个最终证明。通过MaxProof测试时扩展,M3模型在IMO 2025上达到35/42,在USAMO 2026上达到36/42,两者均超过了人类金牌阈值。

英文摘要

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.

2606.13603 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

超越承诺边界:探究大型推理模型中的附带思维链

Daniel Scalena, Sara Candussio, Luca Bortolussi, Elisabetta Fersini, Malvina Nissim, Gabriele Sarti

发表机构 * CLCG, University of Groningen(格罗宁根大学CLCG) University of Milano-Bicocca(米兰-布雷拉大学) University of Trieste(特里耶大学) Khoury College of Computer Sciences, Northeastern University(东北大学Khoury计算机科学学院)

AI总结 通过早期退出估计思维链步骤的因果重要性,发现推理中存在从瞬态猜测到稳定答案的“承诺边界”,后续步骤为附带现象,可提前退出以缩短推理长度达55%而不影响性能。

详情
AI中文摘要

思维链推理是语言模型推理时扩展的主导范式,但每个步骤对最终答案的因果影响尚不明确。我们通过早期退出估计每个步骤的因果重要性,并利用这一度量研究多个模型家族的推理轨迹中答案如何形成。在多种任务中,我们发现推理通常会跨越一个“承诺边界”——从瞬态中间猜测到稳定、高置信度答案的急剧转变。这种转变通常发生在单个步骤中,远在模型推理块结束之前,随后是“附带”的思维链步骤,这些步骤不改变最终答案概率。利用注意力探针,我们表明答案形成阶段可以从中间推理步骤中以高精度线性解码,并稳健地泛化到未见过的推理任务。我们利用这一信号在承诺边界处提前退出推理块,平均将思维链长度减少高达55%,而对模型性能影响微乎其微。

英文摘要

Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of individual steps on the final answer poorly understood. We estimate each step's causal importance via early exit and use this measure to study how answers form across the reasoning traces of several model families. Across diverse tasks, we find that reasoning typically crosses a \emph{commitment boundary} -- a sharp transition from transient intermediate guesses to a stable, high-confidence answer. This transition often happens in a single step, well before the model's reasoning block ends, and is followed by \emph{epiphenomenal} CoT steps that leave the final answer probability unaltered. Using attention probes, we show that answer-formation stages can be linearly decoded from intermediate reasoning steps with high accuracy and generalize robustly to unseen reasoning tasks. We exploit this signal to early-exit reasoning blocks at the commitment boundary, reducing the length of CoTs up to 55\% on average with negligible impact on model performance.

2507.10599 2026-06-12 cs.CL cs.AI cs.LG 版本更新

Emergence of Hierarchical Emotion Organization in Large Language Models

大型语言模型中层级情感组织的涌现

Maya Okawa, Bo Zhao, Eric J. Bigelow, Rose Yu, Tomer Ullman, Ekdeep Singh Lubana, Hidenori Tanaka

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Washington(华盛顿大学) University of Tokyo(东京大学)

AI总结 受情感轮理论启发,分析大型语言模型输出中情感状态间的概率依赖关系,发现模型自然形成与人类心理模型一致的层级情感树,且更大模型发展出更复杂的层级结构,同时揭示社会经济角色在情感识别中的系统性偏差。

Comments ICML 2026

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地驱动对话代理,理解它们如何建模用户的情绪状态对于伦理部署至关重要。受情感轮(即一种认为情感层级组织的心理学框架)的启发,我们分析了模型输出中情感状态之间的概率依赖关系。我们发现LLMs自然形成与人类心理模型一致的层级情感树,且更大的模型发展出更复杂的层级结构。我们还揭示了跨社会经济角色的情感识别中存在系统性偏差,对于交叉、代表性不足的群体,错误分类会叠加。人类研究显示出惊人的相似性,表明LLMs内化了社会感知的某些方面。除了突出LLMs中的涌现情感推理能力,我们的结果还暗示了利用认知基础理论开发更好模型评估的潜力。

英文摘要

As large language models (LLMs) increasingly power conversational agents, understanding how they model users' emotional states is critical for ethical deployment. Inspired by emotion wheels, i.e., a psychological framework that argues emotions organize hierarchically, we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded theories for developing better model evaluations.

2601.22594 2026-06-12 cs.CL cs.AI 版本更新

Language Model Circuits Are Sparse in the Neuron Basis

语言模型电路在神经元基上是稀疏的

Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann

发表机构 * Stanford University(斯坦福大学)

AI总结 本文实证发现MLP神经元与稀疏自编码器一样是稀疏特征基,并基于此开发了端到端梯度归因流水线,在多项任务中揭示了因果有效的神经元电路。

Comments ICML Spotlight, camera-ready

详情
AI中文摘要

神经网络用于计算的高层概念不一定与单个神经元对齐(Smolensky, 1986)。因此,语言模型可解释性研究转向了将神经元基分解为更可解释的模型计算单元的技术,例如稀疏自编码器(SAEs)。然而,并非所有基于神经元的表示都不可解释。我们首次实证表明,MLP神经元与SAEs一样是稀疏的特征基。利用这一发现,我们开发了一个端到端的基于梯度的归因流水线,用于在MLP神经元基上进行电路追踪,从而在多种任务中揭示因果有效的神经元。在标准的主谓一致基准测试(Marks et al., 2025)上,约$10^2$个MLP神经元的电路足以控制模型行为。在(Lindsey et al., 2025)的多跳城市-州-首都任务中,我们发现了一个电路,其中小部分神经元编码特定的潜在推理步骤(例如将城市映射到其所在州),并且可以通过引导来改变模型的输出。因此,这项工作在不增加额外训练成本的情况下推进了语言模型的自动化可解释性。

英文摘要

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques which decompose the neuron basis into more interpretable units of model computation, such as sparse autoencoders (SAEs). However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that MLP neurons are as sparse a feature basis as SAEs. We use this finding to develop an end-to-end gradient-based attribution pipeline for circuit tracing on the MLP neuron basis, which surfaces causally effective neurons on a variety of tasks. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city-state-capital task from (Lindsey et al., 2025), we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g. mapping a city to its state), and can be steered to change the model's output. This work thus advances automated interpretability of language models without imposing additional training costs.

2602.01572 2026-06-12 cs.CL cs.IR 版本更新

LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States

基于LLM的嵌入:注意力值比隐藏状态更好地编码句子语义

Yeqin Zhang, Yunfei Wang, Jiaxuan Chen, Ke Qin, Yizheng Zhao, Cam-Tu Nguyen

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, China(新型软件技术国家重点实验室,南京大学,中国) School of Artificial Intelligence, Nanjing University, China(人工智能学院,南京大学,中国)

AI总结 本文提出Value Aggregation方法,利用LLM的注意力值向量而非隐藏状态来生成句子嵌入,在无训练设置下超越现有方法,甚至匹配或超越集成方法MetaEOL。

详情
AI中文摘要

句子表示是许多自然语言处理(NLP)应用的基础。虽然近期方法利用大型语言模型(LLM)来推导句子表示,但大多数依赖于最终层的隐藏状态,这些隐藏状态针对下一个词预测进行了优化,因此通常无法捕捉全局的句子级语义。本文引入了一个新颖的视角,证明注意力值向量比隐藏状态更有效地捕捉句子语义。我们提出了值聚合(VA),一种简单的方法,它跨多个层和词索引池化标记值。在无训练设置中,VA优于其他基于LLM的嵌入,甚至匹配或超越了基于集成的MetaEOL。此外,我们证明,当与合适的提示配对时,层注意力输出可以被解释为对齐的加权值向量。具体来说,最后一个标记的注意力分数充当权重,而输出投影矩阵($W_O$)将这些加权值向量与LLM残差流的公共空间对齐。这种改进的方法,称为对齐加权VA(AlignedWVA),在无训练的基于LLM的嵌入中达到了最先进的性能,大幅超越了高成本的MetaEOL。最后,我们强调了通过微调值聚合来获得强LLM嵌入模型的潜力。

英文摘要

Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.

2604.12002 2026-06-12 cs.CL 版本更新

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

自蒸馏零:自我修订将二元奖励转化为密集监督

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, Sanjeev Arora

发表机构 * Princeton University(普林斯顿大学) University of Toronto(多伦多大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出SD-Zero方法,通过让模型同时扮演生成器和修订者,利用二元奖励生成密集的token级自监督信号,显著提升训练样本效率,在数学和代码推理任务上超越RFT、GRPO等基线。

详情
AI中文摘要

当前在可验证设置下的后训练方法分为两类。强化学习(RLVR)依赖二元奖励,虽然广泛适用且强大,但在训练过程中仅提供稀疏监督。蒸馏提供密集的token级监督,通常从外部教师或使用高质量示范中获得。收集此类监督成本高昂或不可用。我们提出自蒸馏零(SD-Zero),一种比RL更高效利用训练样本的方法,且不需要外部教师或高质量示范。SD-Zero训练单个模型扮演两个角色:生成器,产生初始响应;修订者,基于该响应及其二元奖励生成改进的响应。然后我们进行在线自蒸馏,将修订者蒸馏到生成器中,使用修订者以生成器的响应及其奖励为条件的token分布作为监督。实际上,SD-Zero训练模型将二元奖励转化为密集的token级自监督。在数学和代码推理基准上,使用Qwen3-4B-Instruct和Olmo-3-7B-Instruct,SD-Zero相比基础模型性能提升至少10%,并在相同问题集和训练样本预算下优于强基线,包括拒绝微调(RFT)、GRPO和自蒸馏微调(SDFT)。大量消融实验显示了所提出算法的两个新特性:(a)token级自定位,其中修订者能够基于奖励识别生成器响应中需要修订的关键token;(b)迭代自进化,其中改进答案的修订能力可以通过定期教师同步蒸馏回生成性能。代码:此https URL。

英文摘要

Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization. Code: https://github.com/princeton-pli/Self-Distillation-Zero.

2604.18307 2026-06-12 cs.CL 版本更新

Reasoning Models Know What's Important, and Encode It in Their Activations

推理模型知道什么重要,并在其激活中编码

Yaniv Nikankin, Martin Tutek, Tomer Ashuach, Jonathan Rosenfeld, Yonatan Belinkov

发表机构 * Technion(技术离子大学) University of Zagreb, FER(扎格雷布大学,FER) MIT(麻省理工学院) Kempner Institute, Harvard(哈佛大学凯普纳研究所)

AI总结 通过分析模型激活而非仅依赖推理链文本,发现激活能更有效识别关键推理步骤,且模型在生成后续步骤前已内部编码步骤重要性。

详情
AI中文摘要

语言模型通常通过生成包含许多重要性不同的步骤的长推理链来解决复杂任务。虽然某些步骤对生成最终答案至关重要,但其他步骤是可移除的。确定哪些步骤最重要以及为什么,仍然是理解模型如何处理推理的核心开放问题。我们研究了这个问题是通过模型内部还是通过推理链本身的标记来最好地解决。我们发现,模型激活比标记包含更多信息,用于识别重要的推理步骤。关键的是,通过在模型激活上训练探针来预测重要性,我们表明模型在生成后续步骤之前就已经编码了步骤重要性的内部表示。不同模型中重要性的内部表示在哪些步骤重要上具有高度一致性。这种表示分布在各个层中,并且与表面特征(如步骤的相对位置或长度)不相关。我们的发现表明,分析激活可以揭示表面方法根本遗漏的推理方面,表明推理分析应该研究模型内部。

英文摘要

Language models often solve complex tasks by generating long reasoning chains, consisting of many steps with varying importance. While some steps are crucial for generating the final answer, others are removable. Determining which steps matter most, and why, remains an open question central to understanding how models process reasoning. We investigate if this question is best approached through model internals or through tokens of the reasoning chain itself. We find that model activations contain more information than tokens for identifying important reasoning steps. Crucially, by training probes on model activations to predict importance, we show that models encode an internal representation of step importance, even prior to the generation of subsequent steps. The internal representations of importance in different models yield high agreement on which steps are important. The representation is distributed across layers, and does not correlate with surface-level features, such as a step's relative position or its length. Our findings suggest that analyzing activations can reveal aspects of reasoning that surface-level approaches fundamentally miss, indicating that reasoning analyses should look into model internals.

2509.18085 2026-06-12 cs.LG cs.AI cs.CL 版本更新

Structuring The Future: Diffusion LLM Speculative Decoding via Calibrated Draft Graphs

构建未来:通过校准草稿图实现扩散LLM推测解码

Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Christopher Lott, Fatih Porikli, Mingu Lee

发表机构 * University of Waterloo(多伦多大学)

AI总结 提出Spiffy算法,利用校准的草稿图结构实现扩散LLM的推测解码,在保持输出分布的同时加速推理,最高减少8.6倍模型推理次数并加速6.3倍令牌生成速率。

Comments Original version uploaded on Sep 22, 2025. (v2): Extended Table 2 with additional analysis and referenced it in Sec 5.2. (v3): Added note to Sec 4.2 and Appendix A.2 specifying conditions for losslessness. (v4): Updated with the version accepted to ICML 2026 workshops

详情
AI中文摘要

扩散LLM(dLLM)最近作为自回归LLM(AR-LLM)的强大替代方案出现,具有以显著更高的令牌生成速率运行的潜力。为了释放这一潜力,我们提出了Spiffy,一种推测解码算法,用于加速dLLM推理,同时可证明地保持模型的输出分布。这项工作解决了将AR-LLM的推测解码思想应用于dLLM所涉及的独特挑战。Spiffy执行自动推测以消除独立草稿模型的开销,以新颖的有向草稿图形式构建草稿状态,以利用dLLM生成的双向、块状特性。这些草稿图离线校准以最大化接受率,并在推理过程中动态剪枝以提高计算效率。我们给出了Spiffy的详细公式,并展示了其与KV缓存和基于阈值的动态掩码相结合,加速LLaDA、Dream和SDAR模型的能力,导致模型推理次数减少高达8.6倍,令牌速率加速高达6.3倍。

英文摘要

Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token-generation rates. To unlock this potential, we present Spiffy, a speculative decoding algorithm to accelerate dLLM inference while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to dLLMs. Spiffy performs auto-speculation to eliminate the overheads of an independent draft model, structuring draft states in the form of a novel directed draft graph to take advantage of the bidirectional, blockwise nature of dLLM generation. These draft graphs are calibrated offline to maximize acceptance rates and are dynamically pruned during inference for improved computational efficiency. We present a detailed formulation of Spiffy and demonstrate its ability to accelerate LLaDA, Dream, and SDAR models in combination with KV caching and threshold-based dynamic unmasking leading to up to $8.6\times$ reduction in model inferences and $6.3\times$ acceleration in token rate.

2605.17770 2026-06-12 cs.AI cs.CL 版本更新

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

熵梯度反转:迈向大型推理模型的内部机制

Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang, Quanshi Zhang, Yong Liu, Dongrui Liu

发表机构 * National University of Singapore(新加坡国立大学) Renmin University of China(中国人民大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学)

AI总结 本文发现大型推理模型中令牌熵与logit梯度之间的稳健负相关(熵梯度反转),并提出相关性正则化组策略优化(CorR-PO)将其嵌入强化学习奖励正则化,从而提升推理性能。

Comments The authors are withdrawing this manuscript due to fundamental inaccuracies in the institutional affiliations and administrative attributions provided at the time of submission. As this version cannot be validated under the correct institutional framework, the authors request its formal withdrawal from the repository. No immediate replacement is intended

详情
AI中文摘要

大型推理模型(LRMs)的进步推动了从反应式“快思考”文本生成向系统性、逐步“慢思考”推理的范式转变,在复杂数学和逻辑任务中实现了最先进的性能。然而,该领域面临着 extit{令牌级行为分析与内部推理机制之间的根本差距,以及依赖昂贵外部验证器的推理优化强化学习(RL)的不稳定性}。我们识别并正式定义了 extbf{熵梯度反转},即令牌熵与logit梯度之间的稳健负相关,它作为LRM推理能力的明确几何指纹。在此基础上,我们提出 extbf{相关性正则化组策略优化(CorR-PO)},将这种反转特征嵌入RL奖励正则化。在多个模型规模的各种推理基准上的大量实验表明,CorR-PO始终优于最先进的基线,证实了更强的反转直接与更优的推理性能相关。

英文摘要

The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textit{the fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers}. We identify and formally define \textbf{Entropy-Gradient Inversion}, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbf{Correlation-Regularized Group Policy Optimization (CorR-PO)}, which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.

2606.09073 2026-06-12 cs.LG cs.AI cs.CL 版本更新

A Unifying Lens on Reward Uncertainty in RLHF

RLHF中奖励不确定性的统一视角

Ely Hahami, Yoel Zimmermann, Ray Zhou, Jack Benarroch Jedlicki

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深度Mind)

AI总结 本文提出使用分布奖励模型统一RLHF中的悲观主义方法,通过闭式有效奖励公式连接现有启发式方法,并揭示其隐含假设。

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)受限于\textit{奖励破解},即策略利用代理奖励模型(RM)中的错误,产生高RM分数而缺乏真正的质量提升。一种自然的缓解方法是\textit{悲观主义}:在RM不确定的区域惩罚奖励。然而,标准标量RM没有提供原则性的不确定性概念。我们认为正确的对象是\textit{分布}奖励模型$p(r\mid x,y)$。在贝叶斯推断或KL分布鲁棒优化(KL-DRO)视角下,KL正则化的RLHF目标具有闭式有效奖励$\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$。悲观分支统一了RM集成聚合的先前启发式方法:均值聚合、最坏情况优化(WCO)和不确定性加权优化(UWO)都作为该单一表达式的极限或截断出现。这也澄清了每个现有规则的隐含假设。

英文摘要

Reinforcement learning from human feedback (RLHF) is bottlenecked by reward hacking, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is pessimism: lowering rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a distributional reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.

2. 机器翻译与跨语言处理 1 篇

2508.01656 2026-06-12 cs.CL cs.AI cs.CY cs.HC physics.soc-ph 版本更新

Authorship Attribution in Multilingual Machine-Generated Texts

多语言机器生成文本的作者归属

Lucio La Cava, Dominik Macko, Róbert Móro, Ivan Srba, Andrea Tagarelli

发表机构 * DIMES Department, University of Calabria(卡利博大学DIMES系) Kempelen Institute of Intelligent Technologies(智能技术研究所)

AI总结 提出多语言作者归属问题,研究单语言方法在18种语言和8个生成器上的跨语言迁移能力,发现显著局限。

Comments Accepted at ACL 2026 - Main

详情
AI中文摘要

随着大型语言模型(LLM)达到类人的流畅性和连贯性,区分机器生成文本(MGT)与人类撰写的内容变得越来越困难。虽然MGT检测的早期工作侧重于二元分类,但LLM的不断发展和多样性需要更细粒度且更具挑战性的作者归属(AA),即能够识别文本背后的确切生成器(LLM或人类)。然而,目前AA仍局限于单语言环境,其中英语是研究最多的语言,忽视了现代LLM的多语言性质和使用。在这项工作中,我们引入了多语言作者归属问题,涉及将文本归因于跨多种语言的人类或多个LLM生成器。聚焦于18种语言——涵盖多个语系和书写系统——以及8个生成器(7个LLM和人类撰写类别),我们研究了单语言AA方法在多语言环境中的适用性,包括其跨语言迁移能力,以及生成器对归属性能的影响。我们的结果表明,虽然某些单语言AA方法可以适应多语言环境,但仍然存在显著的局限性和挑战,特别是在跨不同语系迁移时,这凸显了多语言AA的复杂性以及需要更稳健的方法以更好地匹配现实场景。

英文摘要

As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-written content becomes increasingly difficult. While early efforts in MGT detection have focused on binary classification, the growing landscape and diversity of LLMs require a more fine-grained yet challenging authorship attribution (AA), i.e., being able to identify the precise generator (LLM or human) behind a text. However, AA remains nowadays confined to a monolingual setting, with English being the most investigated one, overlooking the multilingual nature and usage of modern LLMs. In this work, we introduce the problem of Multilingual Authorship Attribution, which involves attributing texts to human or multiple LLM generators across diverse languages. Focusing on 18 languages -- covering multiple families and writing scripts -- and 8 generators (7 LLMs and the human-authored class), we investigate the multilingual suitability of monolingual AA methods in terms of their cross-lingual transferability, and the impact of generators on attribution performance. Our results reveal that while certain monolingual AA methods can be adapted to multilingual settings, significant limitations and challenges remain, particularly in transferring across diverse language families, underscoring the complexity of multilingual AA and the need for more robust approaches to better match real-world scenarios.

3. 信息抽取、检索与问答 9 篇

2606.12578 2026-06-12 cs.CL 新提交

MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction

MARD: 镜像增强推理蒸馏用于机制级药物-药物相互作用预测

Mohammadreza Riyazat, Vian Lelo, Rameen Jafri, Yumna Khan, Abeer Badawi

发表机构 * University of Guelph(圭尔夫大学) York University(约克大学) Vector Institute(向量研究所)

AI总结 提出MARD-7B模型,通过镜像增强推理蒸馏、单token KL散度、PRM加权DPO和机制感知检索通道,在机制级DDI预测中准确率超越GPT-4o 6.7个百分点,且成本仅为1%。

Comments 29 pages, 9 figures. Preprint

详情
AI中文摘要

机制级药物-药物相互作用(DDI)预测需要识别涉及的酶或药效学轴、作用方向及证据,而不仅仅是判断两种药物是否相互作用。我们引入了一个可复现的机制级DDI标注与评估协议,包括结构化的7家族/147亚型分类法、无泄漏的冷切分协议以及可审计的推理指标,用于评估超越平面交互分类的药理学预测。我们提出一个流水线,生成了7B推理模型MARD(镜像增强推理蒸馏),结合了三种训练创新:方向标签上的单token KL散度,将模型的预测与方向标签绑定;基于PRM权重的DPO,使用程序化硬负样本;以及无泄漏的机制感知检索通道。过程奖励步骤标签可自动根据DrugBank结构化字段验证,无需人工或LLM评判。在2026年4月的DrugBank版本上,我们的MARD-7B是32个系统比较中唯一在药物对新颖性下准确率保持稳定的系统,以约1%的前沿API成本,比最佳基线高出13.9个百分点,比GPT-4o高出6.7个百分点。进一步分析揭示了反记忆特征,即在罕见药物上准确率提升,表明增益来自结构化药理学推理而非药物频率记忆。我们发布了语料库、DDI-PRM、检索索引和训练代码。

英文摘要

Mechanism-level drug-drug interaction (DDI) prediction requires identifying which enzyme or pharmacodynamic axis is implicated, in which direction, and with which evidence -- not merely whether two drugs interact. We introduce a reproducible mechanism-level DDI labelling and evaluation protocol with a structured 7-family/147-subtype taxonomy, leakage-safe cold-split protocols, and auditable reasoning metrics for evaluating pharmacological prediction beyond flat interaction classification. We propose a pipeline that produces a 7B reasoning MARD (Mirror-Augmented Reasoning Distillation), combining three training innovations: a single-token KL divergence on direction tag that ties the model's prediction, per-loss PRM-weighted DPO with programmatic hard negatives, and a leakage-safe mechanism-aware retrieval channel. Process-reward step labels are automatically verifiable against DrugBank-structured fields, requiring no human or LLM judges. On the April-2026 DrugBank release, our MARD-7B is the only system in a 32-system comparison whose accuracy survives drug-pair novelty, beating the best baseline by +13.9 pp and GPT-4o by +6.7 pp at ~1% of frontier API cost. Further analysis reveals an anti-memorisation signature where accuracy improves on rarely seen drugs, suggesting that gain comes from structured pharmacological reasoning rather than drug-frequency memorisation. We release corpus, DDI-PRM, retrieval index, and training code.

2606.12903 2026-06-12 cs.CL 新提交

X-MADAM-RAG: Diagnosing and Handling Chinese-English Evidence Conflict in Retrieval-Augmented Generation

X-MADAM-RAG:诊断和处理检索增强生成中的中英文证据冲突

Yongqi Kang, Yu Fu, Yong Zhao

发表机构 * Sichuan University(四川大学)

AI总结 提出X-MADAM-RAG管道,通过分解证据处理步骤(候选提取、可见证据修复、确定性分组和冲突感知聚合)解决RAG中中英文证据冲突问题,在受控基准上取得高准确率,但发现文档级提取是主要瓶颈。

详情
AI中文摘要

检索增强生成(RAG)系统可能接收到不仅噪声大而且相互矛盾的证据。这个问题在多语言环境中尤为突出,因为检索到的中文和英文证据可能支持不相容的答案候选。我们通过X-RAMDocs-ZHEN(一个从RAMDocs衍生的受控中英文基准)研究此问题,用于诊断RAG中的证据冲突。该基准包含300个示例,涵盖六种平衡条件,包括单语言支持、双语一致、反向冲突方向以及带可选噪声的冲突。我们进一步研究了X-MADAM-RAG,一个可解释的管道,将证据处理分解为每个文档的候选提取、可见证据修复、确定性候选分组和冲突感知聚合。在原始受控基准上使用Qwen2.5-7B-Instruct,X-MADAM-RAG达到了0.9667的严格准确率和0.9767的冲突感知成功率,优于证据归一化的单次调用基线。然而,一个零调用的纯规则提取器在同一基准上达到了1.0000,揭示了强模板规律性。为了探究这一局限性,我们构建了一个确定性自然化压力测试,移除了显式答案模板但保留了候选字符串。在其100样本子集上,纯规则提取器降至0.0000,但X-MADAM-RAG也降至0.3000严格准确率,低于朴素基线和证据归一化基线。特权Oracle保持完美,表明文档级提取是主要瓶颈。这些发现将X-RAMDocs-ZHEN和X-MADAM-RAG定位为受控证据冲突的诊断工具,而非通用幻觉检测或对自然检索鲁棒性的证据。

英文摘要

Retrieval-augmented generation (RAG) systems may receive evidence that is not merely noisy but mutually contradictory. This issue becomes particularly salient in multilingual settings, where retrieved Chinese and English evidence may support incompatible answer candidates. We study this problem through X-RAMDocs-ZHEN, a controlled Chinese-English benchmark derived from RAMDocs for diagnosing evidence conflict in RAG. The benchmark contains 300 examples across six balanced conditions, including monolingual support, bilingual agreement, reversed conflict directions, and conflict with optional noise. We further examine X-MADAM-RAG, an interpretable pipeline that decomposes evidence handling into per-document candidate extraction, visible-evidence repair, deterministic candidate grouping, and conflict-aware aggregation. On the original controlled benchmark with Qwen2.5-7B-Instruct, X-MADAM-RAG achieves 0.9667 strict accuracy and 0.9767 conflict-aware success, outperforming an evidence-normalized single-call baseline. However, a zero-call rule-only extractor reaches 1.0000 on the same benchmark, revealing strong template regularity. To probe this limitation, we construct a deterministic naturalized stress test that removes explicit answer templates while preserving candidate strings. On its 100-sample subset, rule-only extraction falls to 0.0000, but X-MADAM-RAG also drops to 0.3000 strict accuracy, below both naive and evidence-normalized baselines. A privileged oracle remains perfect, indicating that document-level extraction is the main bottleneck. These findings position X-RAMDocs-ZHEN and X-MADAM-RAG as diagnostic tools for controlled evidence conflict rather than as evidence of general hallucination detection or robustness to natural retrieval.

2606.13082 2026-06-12 cs.CL 新提交

sebis at CRF Filling 2026: A Two-Stage Local LLM Pipeline for Medical CRF Filling

sebis at CRF Filling 2026: 用于医疗CRF填写的两阶段本地LLM流水线

Katharina Sommer, Tristan Till, Florian Matthes

发表机构 * Technical University of Munich(慕尼黑工业大学)

AI总结 提出基于MedGemma-27B的两阶段本地流水线,分离二值存在分类与值提取,通过少样本上下文学习实现隐私保护,在CRF填写任务上取得0.55 macro-F1,排名第二。

Comments Published in Proceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health), LREC 2026

详情
AI中文摘要

从非结构化电子健康记录中提取结构化临床信息是医疗信息学中一个持续存在的瓶颈。虽然大型语言模型(LLM)提供了高性能,但它们在临床环境中的部署受到隐私风险、推理成本以及超出文本证据产生幻觉的倾向的阻碍。我们针对CL4Health 2026病例报告表(CRF)填写任务,通过提出一个完全本地化、领域自适应的流水线来解决这些挑战,该流水线使用MedGemma-27B模型。我们的两阶段架构将二值存在分类与值提取分离,强制严格遵守文本证据,并确保对否定、不确定或未知状态产生确定性输出。通过利用特定项目的少样本上下文学习,无需外部API调用或微调,我们的方法在官方英语测试轨道上实现了0.55的宏F1分数。这一结果在所有本地托管、开源提交中排名第二。我们的工作表明,保护隐私的本地LLM流水线可以实现与专有前沿模型接近的性能,为临床NLP提供了一个实用、数据主权的框架。

英文摘要

The extraction of structured clinical information from unstructured EHR notes is a persistent bottleneck in healthcare informatics. While large language models (LLMs) offer high performance, their deployment in clinical settings is hindered by privacy risks, inference costs, and the tendency to hallucinate beyond textual evidence. We address these challenges for the CL4Health 2026 Case Report Form (CRF) filling task by proposing a fully local, domain-adapted pipeline using the MedGemma-27B model. Our two-stage architecture, which separates binary presence classification from value extraction, enforces strict adherence to textual evidence and ensures deterministic outputs for negated, uncertain, or unknown states. By leveraging item-specific, few-shot in-context learning without external API calls or fine-tuning, our approach achieves a macro-F1 score of 0.55 on the official English test track. This result secures second place among all locally-hosted, open-source submissions. Our work demonstrates that privacy-preserving, on-premise LLM pipelines can achieve near-competitive performance with proprietary frontier models, providing a practical, data-sovereign framework for clinical NLP.

2606.13537 2026-06-12 cs.CL 新提交

When Does Mixing Help? Analyzing Query Embedding Interpolation in Multilingual Dense Retrieval

何时混合有帮助?分析多语言稠密检索中的查询嵌入插值

Tongyao Zhu, Chao-Ming Huang, Min-Yen Kan

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 通过嵌入级插值构造混合查询,系统研究多语言稠密检索对混合语言查询的敏感性,发现最优混合比在多数情况下优于单语言查询,且英语主导性导致不对称性。

Comments ACL 2026 Main (Oral)

详情
AI中文摘要

虽然混合语言查询在多语言社区中普遍存在,但稠密检索器对此类查询的敏感性仍知之甚少。我们在mMARCO上进行了比例控制研究,通过嵌入级混合——将混合查询构建为单语言嵌入的插值——系统地评估了改变平行查询翻译混合比例时的检索性能。使用BGE-M3的实验表明,在88/105个案例中,最优混合比优于最佳单语言端点。我们发现了由英语主导性驱动的明显不对称性:当从非英语文档索引中检索时,混合普遍有益,而包含英语的索引则最好使用纯英语查询。此外,对于每种非英语文档语言,英语都是最强的混合伙伴。最后,在控制英语主导性后,混合收益与类型学距离呈负相关。我们得出结论,语言混合敏感性是有结构且可预测的,并且我们验证了这些模式在模型家族和规模上的鲁棒性。

英文摘要

While mixed-language querying is ubiquitous in multilingual communities, the sensitivity of dense retrievers to such queries remains poorly understood. We present a ratio-controlled study on mMARCO that systematically evaluates retrieval performance by varying the mixing proportion of parallel query translations via embedding-level mixing -- constructing mixed queries as an interpolation of monolingual embeddings. Experiments with BGE-M3 demonstrate that an optimal mixing ratio outperforms the best monolingual endpoint in 88/105 cases. We uncover a distinct asymmetry driven by English dominance: mixing is uniformly beneficial when retrieving from non-English document indices, whereas indices containing English are best served by pure English queries. Furthermore, English acts as the strongest mixing partner for every non-English document language. Finally, when controlling for English dominance, mixing gains correlate negatively with typological distance. We conclude that language-mix sensitivity is structured and predictable, and we validate the robustness of these patterns across model families and scales.

2606.13550 2026-06-12 cs.AI cs.CL 交叉投稿

Uncertainty-Aware Hybrid Retrieval for Long-Document RAG

不确定性感知的混合检索用于长文档RAG

Hoin Jung, Xiaoqian Wang

发表机构 * Elmore Family School of Electrical and Computer Engineering, Purdue University(普渡大学埃尔莫尔家族电气与计算机工程学院)

AI总结 提出UMG-RAG,一种无需训练的混合检索框架,通过多粒度分块和不确定性估计融合密集与稀疏检索结果,提升长文档问答质量。

详情
AI中文摘要

检索增强生成(RAG)关键依赖于检索证据的质量和粒度。大的检索单元保留上下文但常引入无关内容,可能稀释答案承载证据并恶化长上下文利用。细粒度单元更紧凑,但可能难以可靠检索,因为短块可能缺乏匹配查询所需的语义、词汇或桥接线索。我们提出不确定性感知的多粒度RAG(UMG-RAG),一种无需训练的混合检索框架,将分块粒度视为查询特定的可靠性估计。UMG-RAG不训练新检索器或修改生成器,而是利用现有密集和稀疏检索器作为跨多个分块粒度的互补专家。对于每个查询,它将每个专家-粒度得分列表转换为证据分布,从分布熵估计可靠性,并根据查询特定的语义、词汇和粒度置信度融合候选。我们进一步引入UMGP-RAG,一种父级提升变体,利用细粒度命中定位相关证据,同时返回更广泛的非冗余父块以保持局部连贯性。在问答基准上的实验表明,不确定性感知融合和父级提升在保持轻量级、即插即用检索管道的同时,提高了生成质量。

英文摘要

Retrieval augmented generation (RAG) depends critically on the quality and granularity of retrieved evidence. Large retrieval units preserve context but often introduce irrelevant content, which can dilute answer bearing evidence and worsen long context utilization. Fine-grained units are more compact, but they may be difficult to retrieve reliably because short chunks can lack semantic, lexical, or bridging cues needed to match the query. We propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that treats chunk granularity as query-specific reliability estimation. Instead of training a new retriever or modifying the generator, UMG-RAG uses existing dense and sparse retrievers as complementary experts across multiple chunk granularities. For each query, it converts each expert-granularity score list into an evidence distribution, estimates reliability from distribution entropy, and fuses candidates according to query-specific semantic, lexical, and granularity confidence. We further introduce UMGP-RAG, a parent promotion variant that uses fine-grained hits to locate relevant evidence while returning broader non-redundant parent chunks for local coherence. Experiments on question answering benchmarks show that uncertainty-aware fusion and parent promotion improve generation quality while maintaining a lightweight, plug-and-play retrieval pipeline.

2601.11004 2026-06-12 cs.CL 版本更新

NOVA: NOise-aware Verbal Confidence CAlibration for Robust Large Language Models in RAG Systems

NOVA: 面向RAG系统中鲁棒大语言模型的噪声感知言语置信度校准

Jiayu Liu, Rui Wang, Qing Zong, Yumeng Wang, Cheng Qian, Qingcheng Zeng, Tianshi Zheng, Haochen Shi, Dadi Guo, Baixuan Xu, Chunyang Li, Yangqiu Song

AI总结 提出NOVA框架,通过规则引导的监督微调,解决检索增强生成中噪声上下文导致的过度自信问题,在域内和域外分别提升ECE 10.9%和8.0%。

详情
AI中文摘要

准确评估模型置信度对于在关键事实领域部署大语言模型(LLM)至关重要。尽管检索增强生成(RAG)被广泛采用以改善基础事实,但RAG设置中的置信度校准仍知之甚少。我们跨四个基准进行了系统研究,揭示LLM在检索到噪声上下文时校准性能较差。具体而言,矛盾或无关的证据往往会加剧模型的过度自信问题。为解决此问题,我们提出NOVA规则(噪声感知言语置信度校准规则),为在噪声下解决过度自信提供原则性基础。我们进一步设计了NOVA,一个噪声感知校准框架,该框架通过由这些规则指导的约2K HotpotQA示例合成监督信号。通过使用此数据进行监督微调(SFT),NOVA使模型具备内在的噪声感知能力,而无需依赖更强的教师模型。实验结果表明,NOVA带来了显著收益,在域内和域外分别将ECE分数提高了10.9%和8.0%。通过弥合检索噪声与言语校准之间的差距,NOVA为构建既准确又认知可靠的LLM铺平了道路。

英文摘要

Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance especially when noisy contexts are retrieved. Specifically, contradictory or irrelevant evidence tends to exacerbate the model's overconfidence issue. To address this, we propose NOVA Rules (NOise-Aware Verbal Confidence CAlibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NOVA, a noise-aware calibration framework that synthesizes supervision from ~2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NOVA equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NOVA yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NOVA paves the way for both accurate and epistemically reliable LLMs.

2601.19827 2026-06-12 cs.CL cs.AI cs.IR 版本更新

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

当迭代RAG优于理想证据:科学多跳问答中的诊断研究

Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee

发表机构 * Faculty of Engineering, McMaster University, Canada(麦斯特大学工程学院,加拿大) BASF Canada Inc., Canada(巴斯夫加拿大公司,加拿大)

AI总结 通过化学多跳问答数据集,诊断发现迭代检索-推理循环在科学领域显著优于静态RAG上限,揭示了阶段式检索的优势与失败模式。

Comments 51 pages, 29 figures

详情
AI中文摘要

检索增强生成(RAG)将大型语言模型(LLMs)扩展到参数化知识之外,但目前尚不清楚迭代检索-推理循环何时能有效超越静态RAG,尤其是在涉及多跳推理、稀疏领域知识和异构证据的科学领域。我们首次进行了受控的、机制层面的诊断研究,以探究同步迭代检索和推理能否超越理想化的静态上限(Gold Context)RAG。我们在三种设置下对十一个最先进的LLM进行了基准测试:(i)无上下文,衡量对参数化记忆的依赖;(ii)Gold Context,一次性提供所有真实证据;(iii)迭代RAG,一种无需训练的控制器,交替进行检索、假设细化和证据感知停止。使用以化学为中心的ChemKGMultiHopQA数据集,我们分离出需要真正检索的问题,并通过诊断分析行为,涵盖检索覆盖缺口、锚点携带下降、查询质量、组合保真度和控制校准。在所有模型中,迭代RAG始终优于Gold Context,增益高达25.6个百分点,尤其对于非推理微调模型。阶段式检索减少了后期跳失败,缓解了上下文过载,并实现了对早期假设漂移的动态修正,但剩余的失败模式包括跳覆盖不完整、干扰物锁定轨迹、过早停止校准错误以及即使检索完美时的高组合失败率。总体而言,阶段式检索通常比理想证据的单纯存在更具影响力;我们为在专门科学环境中部署和诊断RAG系统提供了实用指导,并为更可靠、可控的迭代检索-推理框架奠定了基础。

英文摘要

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

2606.10716 2026-06-12 cs.CL cs.AI 版本更新

Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings

注意力扩展:利用注意力增强的上下文嵌入提升长文档关键短语提取

Roberto Martínez-Cruz, Alvaro J. López-López, José Portela

发表机构 * Institute for Research in Technology, ICAI School of Engineering, Comillas Pontifical University(技术研究所,ICAI工程学院,科米利亚斯宗座大学) DD-AIM, Senior Machine Learning Researcher(DD-AIM,高级机器学习研究员)

AI总结 提出注意力扩展机制,通过预训练词嵌入增强PLM的上下文表示,在不增加计算成本的情况下扩展有效上下文范围,显著提升长文档关键短语提取性能。

详情
AI中文摘要

预训练语言模型(PLM)在关键短语提取(KPE)中取得了强劲性能,主要得益于其生成丰富上下文表示的能力。然而,长文档KPE仍然具有挑战性,因为显著的关键短语证据可能分散在遥远的文档部分,而这些部分无法在大多数PLM有限的上下文窗口内被联合捕获。尽管长上下文大语言模型(LLM)可以处理更广泛的文本上下文,但其计算成本限制了它们在高效和高通量KPE中的实用性。为了克服这一限制,我们提出了一种注意力扩展机制,该机制利用预训练词嵌入,用周围超出上下文的块中的信息来增强PLM的令牌表示。所提出的机制扩展了基于PLM的KPE模型的有效上下文范围,而无需全文档注意力或昂贵的基于LLM的推理。我们在五个PLM骨干网络上评估了我们的方法,包括通用、科学、任务特定和长上下文编码器,使用了两种训练机制和来自科学和新闻领域的五个基准语料库。实验结果表明,注意力扩展在所有评估设置中一致地提升了KPE性能,超越了最先进的模型,并在F1分数上取得了显著改进。这些改进扩展到领域特定、任务专门化和原生长上下文模型,表明所提出的机制提供了互补信息,而不仅仅是补偿有限的输入长度。这些结果确立了注意力扩展作为长文档KPE的一种高效且有效的策略。

英文摘要

Pre-trained language models (PLMs) have achieved strong performance in keyphrase extraction (KPE), largely due to their ability to generate rich contextualized representations. However, long-document KPE remains challenging because salient keyphrase evidence may be scattered across distant document sections that cannot be jointly captured within the limited context window of most PLMs. Although long-context large language models (LLMs) can process broader textual contexts, their computational cost limits their practicality for efficient and high-throughput KPE. To overcome this limitation, we propose an attention expansion mechanism that augments PLM token representations with information from surrounding out-of-context chunks using pre-trained word embeddings. The proposed mechanism expands the effective contextual scope of PLM-based KPE models without requiring full-document attention or expensive LLM-based inference. We evaluate our approach across five PLM backbones, including general-purpose, scientific, task-specific, and long-context encoders, using two training regimes and five benchmark corpora from scientific and news domains. Experimental results demonstrate that attention expansion consistently enhances KPE performance across all evaluation settings, outperforming state-of-the-art models and yielding notable improvements in F1 score. The improvements extend to domain-specific, task-specialized, and native long-context models, showing that the proposed mechanism provides complementary information rather than merely compensating for limited input length. These results establish attention expansion as an efficient and effective strategy for long-document KPE.

2606.07218 2026-06-12 cs.IR cs.CL 版本更新

HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG

HKVM-RAG:用于多跳RAG的键值分离超图证据组织

Mingyu Zhang, Ying Ma

发表机构 * Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) School of Computer and Information Engineering, Henan University(河南大学计算机与信息工程学院)

AI总结 提出HKVM-RAG,一种键值分离的证据组织层,通过超图键值检索改进多跳RAG的证据链暴露,在三个基准上提升F1分数。

Comments Submitted to ICDE 2027. 13 pages, 3 figures

详情
AI中文摘要

多跳RAG提出了一个超越段落匹配的数据工程问题:在固定检索预算下,系统必须将检索到的文本组织成能够暴露答案链的证据单元。密集检索器独立评分段落,而基于图的记忆使关联显式化,但通常依赖于成对或实体中心的键,这些键会碎片化多跳证据。我们提出HKVM-RAG,一个键值分离的证据组织层。它从缓存的段落级LLM证据元组中组装答案路径超边,并将其用作检索键,同时保留段落文本作为答案值。为了隔离键空间设计,我们的固定基底协议在成对图和超图变体中保持元组缓存、候选段落、阅读器和评估预算不变。加权超图键值检索在2WikiMultiHopQA上比KG-PPR提高+3.426 F1,在MuSiQue上提高+3.592 F1;HotpotQA显示更高的结构化支持覆盖率不一定带来独立的答案F1增益。因此,我们将WHG-KV视为一种证据控制信号,而非密集检索的替代。Oracle和训练到开发分析表明支持选择是可修复的,一个密集感知控制器使用冻结的ColBERTv2和HKVM排名/分数特征,结合折外HKVM预测。它在三个基准上分别达到88.846、65.073和85.810 F1,比ColBERTv2提高+11.084、+6.763和+5.966 F1。源级消融实验表明,匹配的非WHG结构化信号无法达到WHG-KV的增益。这些结果提供了有界证据,表明键值分离的超图组织可以作为多跳RAG的可重用证据控制机制。

英文摘要

Multi-hop RAG poses a data-engineering problem beyond passage matching: under fixed retrieval budgets, a system must organize retrieved text into evidence units that expose answer chains. Dense retrievers score passages independently, while graph-based memories make associations explicit but often rely on pairwise or entity-centered keys that fragment multi-hop evidence. We present HKVM-RAG, a key-value-separated evidence-organization layer. It assembles answer-path hyperedges from cached passage-level LLM evidence tuples and uses them as retrieval keys, while retaining passage text as answer values. To isolate key-space design, our fixed-substrate protocol holds the tuple cache, candidate passages, reader, and evaluation budget constant across pairwise graph and hypergraph variants. Weighted hypergraph key-value retrieval improves over KG-PPR by +3.426 F1 on 2WikiMultiHopQA and +3.592 F1 on MuSiQue; HotpotQA shows that higher structured support coverage need not yield standalone answer-F1 gains. We therefore study WHG-KV as an evidence-control signal rather than a dense-retrieval replacement. Oracle and train-to-dev analyses identify support selection as repairable, and a dense-aware controller combines frozen ColBERTv2 and HKVM rank/score features using out-of-fold HKVM predictions. It reaches 88.846, 65.073, and 85.810 F1 on the three benchmarks, improving over ColBERTv2 by +11.084, +6.763, and +5.966 F1. Source-level ablations show that matched non-WHG structured signals do not match the WHG-KV gains. These results provide bounded evidence that key-value-separated hypergraph organization can serve as a reusable evidence-control mechanism for multi-hop RAG.

4. 对话系统与智能体 12 篇

2606.12908 2026-06-12 cs.CL 新提交

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

SENTINEL: 用于训练工具使用语言模型智能体的失败驱动强化学习

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Qun Liu, Chen Luo, Jiri Gesi, Hanqing Lu, Yisi Sang, Manling Li, Jing Huang, Dakuo Wang

发表机构 * Northeastern University(东北大学) Independent Researcher(独立研究员) Northwestern University(西北大学)

AI总结 提出SENTINEL框架,通过将智能体失败转化为针对性训练任务,在Tau2-Bench Retail上提升Qwen3-4B模型Pass@1从66.4到74.9,优于通用合成任务上的强化学习。

详情
AI中文摘要

语言模型智能体通过多轮工具使用在解决现实任务方面越来越有效。然而,训练可靠的工具使用智能体在实践中仍然具有挑战性。虽然强化学习提供了一种从智能体自身环境交互中改进智能体的在策略范式,但其有效性在很大程度上取决于训练任务分布。当任务在训练前固定时,任务分布可能越来越与策略不断发展的能力不匹配,导致许多轨迹被浪费在无信息的任务上。我们提出SENTINEL,一种失败驱动的强化学习框架,将求解器的轨迹失败转化为有针对性的训练任务。SENTINEL遵循控制器-提议者-求解器循环:控制器分析失败轨迹并总结重复出现的错误模式,提议者生成可执行的任务来强调这些弱点,求解器在针对性任务上接受训练。在Tau2-Bench Retail上使用Qwen3-4B-Thinking-2507,SENTINEL将Pass@1从66.4提高到74.9,并且在Pass@k指标上优于通用合成任务上的强化学习。这些结果表明,模型失败为改进工具使用语言模型智能体提供了有效且可扩展的针对性训练信号来源。

英文摘要

Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own environment interactions, its effectiveness depends heavily on the training task distribution. When tasks are fixed before training, the task distribution can become increasingly mismatched with the policy's evolving capabilities, causing many rollouts to be spent on uninformative tasks. We propose SENTINEL, a failure-driven reinforcement learning framework that turns the Solver's rollout failures into targeted training tasks. SENTINEL follows a Controller--Proposer--Solver loop: the Controller analyzes failed trajectories and summarizes recurring error patterns, the Proposer generates executable tasks that stress these weaknesses, and the Solver is trained on the targeted tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, SENTINEL improves Pass\^{}1 from 66.4 to 74.9 and outperforms RL on general synthetic tasks across Pass\^{}k metrics. These results demonstrate that model failures provide an effective and scalable source of targeted training signal for improving tool-using language model agents.

2606.12984 2026-06-12 cs.CL 新提交

SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants

SkillChain: 为基于图像的电商AI助手闭环技能演化

Yimin Hu, Mengtao Xu, Hao Guo, Yuheng Song, Xiaoyong Zhu, Bo Zheng

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出SkillChain框架,通过技能创建、路由优化和主体精炼三阶段自动化技能生命周期,解决电商图像助手多意图混淆问题,显著提升响应质量和用户参与度。

详情
AI中文摘要

基于图像的AI助手现已大规模部署在电商平台上,其中单张上传图像可能触发根本不同的用户意图:产品搜索、风格推荐、视觉百科或实用工具调用,每种意图都需要自己的响应格式、工具调用和领域知识。如果没有按意图的行为约束,基于LLM的系统会混淆这些异构模式,达不到领域质量标准,而意图空间的广度和动态性使得手动工程不可行。为解决这一问题,我们提出了SkillChain,它闭环了技能演化的生产反馈循环,通过三个阶段自动化技能生命周期:用于从任务规范和轨迹中引导启动的技能创建器、用于路由对齐的路由优化器,以及通过双路径LLM-Judge评估进行迭代技能主体精炼的主体精炼器。部署在生产规模的电商图像助手上,SkillChain显著提高了聚合响应质量,在结构合规性和内容质量上提升最大;为期一周的在线A/B实验进一步证实了用户参与度、内容消费和长期留存率的显著提升。

英文摘要

Image-based AI assistants are now deployed at production scale on e-commerce platforms, where a single uploaded image can trigger fundamentally different user intents: product search, style recommendation, visual encyclopedia, or utility tool calls, each demanding its own response format, tool invocation, and domain knowledge. Without per-intent behavioral constraints, LLM-based systems conflate these heterogeneous modes and fall short of domain quality standards, while the breadth and dynamism of the intent space render manual engineering infeasible. To address this, we present SkillChain, which closes the production feedback loop on Skill evolution, automating the lifecycle of Skills through three stages: Skill Creator for bootstrapping from task specs and trajectories, Route Optimizer for routing alignment, and Body Refiner for iterative Skill Body refinement via dual-path LLM-Judge evaluation. Deployed on a production-scale e-commerce image assistant, SkillChain substantially improves aggregate response quality, with the strongest gains on structural compliance and content quality; a one-week online A/B experiment further confirms significant gains in user engagement, content consumption, and long-term retention.

2606.13115 2026-06-12 cs.CL cs.AI 新提交

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

G-Long:面向高效长期对话代理的图增强记忆管理

Minjun Choi, Yoonjin Jang, Sangwon Youn, Youngjoong Ko

发表机构 * Sungkyunkwan University(成均馆大学)

AI总结 提出G-Long框架,利用微调小语言模型进行结构化三元组提取和关联检索,并引入注意力感知重要性评分机制,在降低计算开销的同时,在响应生成和记忆检索上达到最优性能。

Comments 22 pages, 8 figures, 14 tables

详情
AI中文摘要

尽管大型语言模型(LLMs)推动了开放域对话系统的发展,但由于长上下文推理的固有限制以及处理大量原始文本的低效性,保持长期一致性仍然是一个挑战。现有方法通常依赖于非结构化记忆存储(容易导致信息丢失)或计算成本高昂的LLMs(导致高延迟)。为了解决这些限制,我们提出了G-Long,一个图增强框架,利用微调的小语言模型(sLM)进行结构化三元组提取和关联检索,显著降低了运营成本。此外,我们引入了新颖的注意力感知重要性评分机制,利用T5摘要器的内在交叉注意力信号来识别显著记忆。跨多个基准的大量实验表明,G-Long在响应生成和记忆检索方面均达到了最先进的性能,在MSC上响应质量提升高达9.8%,在LME上检索召回率提升高达40.8%,同时显著降低了计算开销。

英文摘要

While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive raw text. Existing approaches typically rely on either unstructured memory storage, which is prone to information loss, or computationally expensive LLMs that incur high latency. To address these limitations, we propose G-Long, a graph-enhanced framework that utilizes a fine-tuned small Language Model (sLM) for structured triplet extraction and associative retrieval, significantly reducing operational costs. Furthermore, we introduce the novel attention-aware importance scoring mechanism that leverages the intrinsic cross-attention signals of a T5 summarizer to identify salient memories. Extensive experiments across diverse benchmarks demonstrate that G-Long achieves state-of-the-art performance in both response generation and memory retrieval, yielding performance gains of up to 9.8% in response quality on MSC and 40.8% in retrieval recall on LME, while significantly minimizing computational overhead.

2606.13142 2026-06-12 cs.CL 新提交

HyPE: Category-Aware Hypergraph Encoding with Persistent Edge Embeddings for Persona-Grounded Dialogue

HyPE:基于类别感知的超图编码与持久边嵌入用于人物角色对话

Sangwon Youn, Yoonjin Jang, Youngjoong Ko

发表机构 * Sungkyunkwan University(成均馆大学)

AI总结 提出HyPE框架,通过将人物角色文本解析为四元组并构建超图,利用HyperGCN和持久边嵌入(PEE)编码高阶关系,在PersonaChat上优于句子级池化基线。

Comments 11 pages, 2 figures, 4 tables

详情
AI中文摘要

人物角色对话系统旨在生成与说话者角色一致的回复,但现有方法将角色视为一组扁平句子,未能建模角色属性间的高阶关系——例如,多个角色句子共享一个主题类别。我们提出HyPE(超图角色编码器)框架,该框架(i)将每个承载角色的文本分析为(核心、表达、情感、类别)四元组,以及(ii)将角色元素组织成一个超图,其超边由共享类别标签诱导。HyperGCN超图神经网络将此结构传播为角色摘要向量和软记忆库,以条件化回复生成器。我们进一步提出持久边嵌入(PEE),即轻量级的每类别可学习先验,融合到HyperGCN的消息传递步骤中。在贪婪解码下的PersonaChat上,HyPE在GPT-2、LLaMA-3.2-3B和Qwen2.5-3B骨干网络上一致优于句子级池化基线,表明结构化的超边级角色编码在不同模型规模上提供了可迁移的优势。

英文摘要

Persona-grounded dialogue systems aim to produce responses consistent with a speaker's persona, yet existing methods treat personas as a flat set of sentences and fail to model the high-order relations among persona attributes-e.g., that several persona sentences share a topical category. We propose HyPE (Hypergraph Persona Encoder), a framework that (i) analyzes each persona-bearing text as a (Core, Expression, Sentiment, Category) quadruple, and (ii) organizes persona elements into a hypergraph whose hyperedges are induced by shared category labels. An HyperGCN hypergraph neural network propagates this structure into a persona summary vector and a soft-memory bank that condition the response generator. We further propose Persistent Edge Embeddings (PEE), lightweight per-category learnable priors fused into the HyperGCN message-passing step. On PersonaChat under greedy decoding, HyPE consistently outperforms sentence-level pooling baselines across GPT-2, LLaMA-3.2-3B, and Qwen2.5-3B backbones by demonstrating that structured hyperedge-level persona encoding provides a transferable advantage across model scales.

2606.13177 2026-06-12 cs.CL cs.AI cs.LG 新提交

MemRefine: LLM-Guided Compression for Long-Term Agent Memory

MemRefine: 基于LLM引导的压缩用于长期智能体记忆

Minjae Kim, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

发表机构 * Korea University(韩国大学) KAIST(韩国科学技术院)

AI总结 提出MemRefine框架,利用LLM判断事实内容,通过删除、合并和保留操作将记忆库压缩到固定预算内,在多个基准上保持下游性能并优于基于规则的基线。

详情
AI中文摘要

大型语言模型(LLM)智能体越来越需要在长期交互中运行,其中过去对话中的信息必须被保留和回忆以支持未来任务。然而,随着交互的积累,记忆存储无限制增长,并充满冗余条目,这些条目增加了存储成本,并通过排挤最有用的证据而降低了检索质量。此外,在具有硬性内存预算的资源受限平台上,这尤其受限,促使我们制定了有存储预算的记忆管理任务,即在固定预算内保持已构建的记忆库,同时保留对未来交互有用的信息。为此,我们提出了MemRefine,一个基于LLM引导的框架,由于表面相似性不能很好地反映事实价值,该框架仅使用相似性来提出候选对,并将删除、合并和保留决策推迟给基于事实内容的LLM判断,迭代直到满足预算。在多个记忆框架和长期对话基准上,MemRefine始终满足目标预算,同时保持下游性能,并在紧预算下优于基于规则的基线。

英文摘要

Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills with redundant entries that inflate storage cost and degrade retrieval by crowding out the most useful evidence. Furthermore, this is especially limiting on resource-constrained platforms with hard memory budgets, motivating us to formulate storage-budgeted memory management, the task of keeping an already constructed memory store within a fixed budget while preserving information useful for future interactions. To this end, we then propose MemRefine, an LLM-guided framework that, since surface similarity poorly reflects factual value, uses similarity only to propose candidate pairs and defers delete, merge, and preserve decisions to an LLM judge based on factual content, iterating until the budget is met. Across multiple memory frameworks and long-term conversation benchmarks, MemRefine consistently meets target budgets while preserving downstream performance and outperforming rule-based baselines under tight budgets.

2606.13317 2026-06-12 cs.CL 新提交

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

SkillCAT: 面向LLM智能体的对比评估与拓扑感知技能自进化

Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) School of Computer Science, Fudan University(复旦大学计算机学院)

AI总结 提出SkillCAT框架,通过对比因果提取、评估增强进化和拓扑感知任务执行三阶段,实现无需训练的LLM智能体技能自进化,在多个基准上平均提升高达40.40%。

Comments 9 pages, 6 figures

详情
AI中文摘要

LLM智能体的技能自进化方法旨在将执行轨迹转化为可复用的技能文档,但当前流程通常每个任务只学习一条轨迹,在检查前合并候选技能补丁,并在推理前加载完整技能语料库。我们提出SkillCAT,一个无需训练的框架,将该过程分为三个阶段。对比因果提取(CCE)为每个任务采样多条轨迹,并比较同任务的成功/失败对,以识别解释结果差异的证据。评估增强进化(AAE)在源任务克隆上回放每个候选补丁,并在层次化技能补丁合并前仅保留改善或保持任务结果的补丁。拓扑感知任务执行(TTE)将进化后的技能编译成可路由的子技能拓扑,因此推理仅加载与任务相关的能力节点。我们在常见智能体基准上评估SkillCAT,包括SpreadsheetBench、WikiTableQuestions和DocVQA,并进一步测试跨模型和分布外泛化。在这些设置中,SkillCAT将基线平均得分提升高达40.40%,展示了无需模型训练的可靠技能进化。

英文摘要

Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training-free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same-task success/failure pairs to identify evidence that explains outcome differences. Assessment-Augmented Evolution (AAE) replays each candidate patch on source-task clones and keeps only patches that improve or preserve task outcomes before hierarchical skill patch merging. Topology-Aware Task Execution (TTE) compiles the evolved skills into a routable sub-skill topology, so inference loads only the capability nodes relevant to the task. We evaluate SkillCAT on common agent benchmarks, including SpreadsheetBench, WikiTableQuestions, and DocVQA, and further test cross-model and out-of-distribution generalization. Across these settings, SkillCAT raises the average score over baselines by up to 40.40%, demonstrating reliable skill evolution without model training.

2606.13643 2026-06-12 cs.CL 新提交

Recursive Agent Harnesses

递归智能体框架

Elias Lumer, Sahil Sen, Kevin Paul, Vamse Kumar Subbiah

发表机构 * PricewaterhouseCoopers, U.S.(普华永道(美国))

AI总结 提出递归智能体框架(RAH),通过代码优先的框架递归扩展模型递归,在长上下文推理中显著提升编码智能体性能。

详情
AI中文摘要

递归语言模型(RLM)表明,模型调用的递归是长上下文推理的有效策略,而生产级编码智能体已开始编写大规模生成子智能体的代码,最近如Anthropic的动态工作流。我们命名并研究了这两条工作线之间的模式,其中递归单元是一个完整的智能体框架,包含文件系统工具、代码执行和规划,而不是没有工具的模型调用。我们将其称为递归智能体框架(RAH),并将其视为框架递归,即RLM模型递归的代码优先扩展。父智能体生成并执行一个可执行脚本,该脚本并行生成子智能体框架以处理细粒度工作负载,并使用结构化函数调用处理小子任务。我们在长上下文推理上提供了受控评估。在固定主干为GPT-5以匹配已发布的Codex和RLM基线的情况下,RAH在Oolong-Synthetic(199个样本,13个上下文长度桶,最高4M令牌)上将Codex编码智能体基线从71.75%提高到81.36%,这一增益归因于框架而非模型。使用更强的骨干Claude Sonnet 4.5,同一设计达到89.77%。

英文摘要

Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in Anthropic's dynamic workflows. We name and study the pattern between these two lines of work, where the recursive unit is a full agent harness with filesystem tools, code execution, and planning rather than a model call with no tools. We call this the Recursive Agent Harness (RAH) and frame it as harness recursion, the code-first extension to the model recursion of RLMs. A parent agent generates and runs an executable script that spawns subagent harnesses in parallel for fine-grained workloads and uses structured function calls for small subtasks. We provide a controlled evaluation on long-context reasoning. With the backbone held fixed at GPT-5 to match the published Codex and RLM baselines, RAH improves the Codex coding-agent baseline from 71.75% to 81.36% on Oolong-Synthetic (199 samples, 13 context-length buckets up to 4M tokens), a gain attributable to the harness rather than the model. With a stronger backbone, Claude Sonnet 4.5, the same design reaches 89.77%.

2606.13663 2026-06-12 cs.CL 新提交

HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

HyperTool:超越逐步工具调用的工具增强型智能体

Yaxin Du, Yifan Zhou, Yujie Ge, Jiajun Wang, Xianghe Pang, Shuo Tang, Tuney Zheng, Bryan Dai, Jian Yang, Siheng Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) IQuest Research Beijing University of Aeronautics and Astronautics(北京航空航天大学)

AI总结 针对工具增强型LLM中逐步调用导致执行粒度不匹配的问题,提出HyperTool统一可执行接口,将确定性工具子流程折叠为单次调用,在多步工具任务上显著提升准确率。

详情
AI中文摘要

工具增强型LLM智能体通常依赖逐步的原子工具调用,其中每次调用、观察和值传递都暴露在主推理轨迹中。这造成了执行粒度不匹配:局部确定性的工具工作流被展开为重复的模型可见决策,消耗上下文并迫使模型管理轨迹中的低级数据流。我们引入HyperTool,一个统一的可执行MCP风格工具接口,改变了模型可见的工具执行单元。模型调用HyperTool时使用一个代码块,该代码块可以通过原始模式调用现有工具、操作返回值并在本地传递中间结果,将确定性工具子程序折叠为单个外部调用。为了训练模型使用此接口,我们从跨工具组合任务中合成HyperTool格式的轨迹,并在真实MCP环境中验证。在MCP-Universe上,HyperTool将Qwen3-32B的平均准确率从15.69%提升至35.29%,Qwen3-8B从9.93%提升至33.33%,并在平均准确率上超越GPT-OSS和Kimi-k2.5,表明我们的HyperTool能显著改进多步工具使用。

英文摘要

Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emph{execution-granularity mismatch}: locally deterministic tool workflows are unfolded into repeated model-visible decisions, consuming context and forcing the model to manage low-level dataflow in the trace. We introduce \textbf{HyperTool}, a unified executable MCP-style tool interface that changes the model-visible unit of tool execution. A model invokes HyperTool with a code block that can call existing tools through their original schemas, manipulate returned values, and pass intermediate results locally, folding deterministic tool subroutines into a single outer call. To train models to use this interface, we synthesize HyperTool-format trajectories from cross-tool compositional tasks and verify them in real MCP environments. On MCP-Universe, HyperTool improves average accuracy from 15.69\% to 35.29\% on Qwen3-32B and from 9.93\% to 33.33\% on Qwen3-8B, and surpass GPT-OSS and Kimi-k2.5 on average accuracy, showing that our HyperTool can substantially improve multi-step tool use.

2606.12780 2026-06-12 cs.LG cs.CL 交叉投稿

ProPlay: Procedural World Models for Self-Evolving LLM Agents

ProPlay: 用于自我进化LLM智能体的程序化世界模型

Yijun Ma, Zehong Wang, Yiyang Li, Ziming Li, Xiaoguang Guo, Weixiang Sun, Chuxu Zhang, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学) University of Connecticut(康涅狄格大学)

AI总结 提出ProPlay程序化世界模型,通过程序级预演和因果过程图,使LLM智能体在部分可观测环境中自我进化,无需外部监督。

详情
AI中文摘要

自我进化智能体应能在无外部监督下通过交互改进,但在部分可观测环境中仍困难,智能体必须主动探索、从有限反馈中学习,并决定何时信任先前经验。现有的LLM智能体方法通常依赖记忆或规划模块,但很少在它们之间闭环以持续完善对环境动态的内部理解。我们提出ProPlay,一种程序化世界模型,支持程序级预演,智能体可利用学到的世界知识排练未来的程序路径。ProPlay不将经验表示为孤立的规则或低层动作约束,而是将成功轨迹抽象为程序,并在捕获任务阶段间因果转换的程序图中组织它们。每个转换与一个可靠性记录嵌入相关联,以从过去结果中估计其任务特定贡献。在每个回合前,ProPlay在已知图结构上模拟未来程序轨迹作为结构化软指导;执行后,它利用环境反馈精炼图。在公开基准上的实验表明,ProPlay在环境理解和自我进化能力上持续优于强基线。我们的代码已在此https URL发布。

英文摘要

Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM-agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure-level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low-level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task-specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self-evolution capability over strong baselines. Our code has been released in https://github.com/antman9914/proplay.

2606.13174 2026-06-12 cs.LG cs.CL 交叉投稿

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

与你合作得更好:将用户修正编译为编码代理的运行时强制

Yujun Zhou, Kehan Guo, Haomin Zhuang, Xiangqi Wang, Yue Huang, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Nuno Moniz, Nitesh V. Chawla, Xiangliang Zhang

发表机构 * University of Notre Dame(圣母大学) IBM Research(IBM研究院) Tencent AI Lab(腾讯AI实验室)

AI总结 提出TRACE方法,通过将用户修正编译为原子规则并在运行时强制执行,显著减少编码代理在后续任务中的偏好违反,优于纯记忆方法。

详情
AI中文摘要

交互式LLM代理正成为日常工作的组成部分,但它们并不会随着时间的推移而变得更易于合作:在一个会话中记住的修正可能在下一个会话中仍被违反。我们研究了偏好访问与偏好遵从之间的差距。在源自匿名真实用户摩擦案例的任务中,Mem0记忆仍然导致57.5%的适用偏好检查被违反。我们引入了测试时规则获取与编译强制(TRACE),这是一个用于编码代理运行时的即插即用技能层管道,它挖掘用户修正,将其重写为原子规则,并编译为运行时检查,这些检查必须在代理完成未来任务之前通过。与开发者提前编写的运行时检查不同,TRACE技能来自用户自己的聊天修正。我们通过在ClawArena编码代理任务和MemoryArena衍生的内存密集型任务上进行模拟用户参与实验来评估TRACE。在ClawArena上,TRACE将分布内任务的保留偏好违反从100.0%降低到37.6%,将分布外任务从100.0%降低到2.0%。在MemoryArena衍生的任务上,TRACE将分布内违反从100.0%降低到60.5%,同时在任务通过率上匹配或超过最强的记忆基线。这些结果表明,将修正编译为运行时强制可以解决记忆单独无法可靠解决的重复摩擦失败模式,减少用户在未来会话中重复相同修正的需求。实验代码可在此https URL获取,可部署的技能可在此https URL获取。

英文摘要

Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.

2606.13598 2026-06-12 cs.AI cs.CL cs.LG cs.MA 交叉投稿

Reward Modeling for Multi-Agent Orchestration

多智能体编排的奖励建模

King Yeung Tsang, Zihao Zhao, Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Semih Yavuz, Shafiq Joty, Hao Wang

发表机构 * Rutgers University(罗杰斯大学) Salesforce AI Research(Salesforce人工智能研究)

AI总结 提出OrchRM框架,通过自监督学习从多智能体执行中间产物构建奖励模型,无需人工标注,实现高效编排器训练和测试时扩展,在多个领域提升性能并降低计算成本。

Comments Preprint; work in progress

详情
AI中文摘要

基于大型语言模型(LLM)的多智能体系统(MAS)需要有效的编排来协调专门化的智能体,然而训练这样的编排器受到有限监督和高计算成本的阻碍。我们提出了编排奖励建模(OrchRM),一种无需人工标注即可评估编排质量的自监督框架。OrchRM利用多智能体执行过程中的中间产物来构建Bradley-Terry奖励模型训练的胜负对。与现有的依赖昂贵子智能体展开的MAS测试时扩展和编排器训练框架不同,OrchRM直接在编排层面操作,实现了高效且高性能的奖励引导编排器训练和MAS测试时扩展。OrchRM在token使用上提高了高达10倍的训练效率,同时将MAS测试时扩展的准确率提升了高达8%。这些增益在多个领域(包括数学推理、基于网络的问答和多跳推理)中一致迁移,证明了编排级奖励建模作为鲁棒多智能体编排的可扩展方向。代码将在此https URL提供。

英文摘要

Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model training. Unlike existing MAS test-time scaling and orchestrator training frameworks that rely on costly sub-agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high-performing reward-guided orchestrator training and MAS test-time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test-time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web-based question answering, and multi-hop reasoning, demonstrating orchestration-level reward modeling as a scalable direction for robust multi-agent orchestration. Code will be available at https://github.com/Wang-ML-Lab/OrchRM.

2604.10389 2026-06-12 cs.CL 版本更新

BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection

BLUEmed: 基于检索增强的多智能体辩论用于临床错误检测

Saukun Thika You, Nguyen Anh Khoa Tran, Wesley K. Marizane, Hanshu Rao, Qiunan Zhang, Xiaolei Huang

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出BLUEmed框架,结合混合检索增强生成与多智能体辩论,通过分解临床笔记、检索证据、专家辩论及安全层过滤,在术语替换错误检测中达到最优性能。

Comments Accepted to the IEEE International Conference on Healthcare Informatics (ICHI) 2026

详情
AI中文摘要

临床笔记中的术语替换错误(即一个医学术语被一个语言上有效但临床不同的术语替换)对医疗保健中的自动错误检测构成了持续挑战。我们引入了BLUEmed,一个多智能体辩论框架,增强有混合检索增强生成(RAG),该框架结合了基于证据的推理和多视角验证用于临床错误检测。BLUEmed将每个临床笔记分解为聚焦的子查询,通过密集、稀疏和在线检索检索来源分区的证据,并分配两个具有不同知识库的领域专家智能体以产生独立分析;当专家意见不一致时,一轮结构化的反论证和跨来源裁决解决冲突,随后是一个级联安全层,过滤常见的假阳性模式。我们在一个临床术语替换检测基准上评估BLUEmed,在零样本和少样本提示下,使用多个骨干模型(涵盖专有和开源系列)。实验结果表明,在少样本提示下,BLUEmed达到了最佳准确率(69.13%)、ROC-AUC(74.45%)和PR-AUC(72.44%),优于单智能体RAG和仅辩论基线。跨六个骨干模型和两种提示策略的进一步分析证实,检索增强和结构化辩论是互补的,并且该框架从具有足够指令遵循和临床语言理解的模型中受益最大。

英文摘要

Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi-agent debate framework augmented with hybrid Retrieval-Augmented Generation (RAG) that combines evidence-grounded reasoning with multi-perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero-shot and few-shot prompting with multiple backbone models spanning proprietary and open-source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting, outperforming both single-agent RAG and debate-only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction-following and clinical language understanding.

5. 文本生成、摘要与编辑 5 篇

2606.12599 2026-06-12 cs.CL 新提交

Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation

通过波斯谚语条件故事生成实现LLM中的约束语义解压缩

Zahra Habibzadeh, Paria Khoshtab, Amir Mesbah, Yadollah Yaghoobzadeh

AI总结 提出约束语义解压缩任务,通过波斯谚语条件故事生成测试大语言模型的抽象到实现能力,构建PAND数据集,发现解压缩差距,并表明显式推理和迭代细化可部分缓解。

详情
AI中文摘要

将一个密集、抽象的谚语转化为引人入胜且道德忠实的故事需要深厚的文化理解和稳健的语义基础。我们将此问题定义为约束语义解压缩任务,并研究谚语条件故事生成作为大语言模型中抽象到实现的测试平台。聚焦波斯语,我们引入了谚语对齐叙事数据集(PAND),将谚语与人类编写的故事和显式含义配对。通过结合人类校准的LLM-as-a-Judge与结构度量的混合评估框架,我们分析了多种提示机制下的模型行为。我们的发现揭示了一个持续存在的解压缩差距:当前的LLM通常实现强大的表面流畅性,但未能忠实地实例化谚语中编码的潜在道德和因果结构。我们进一步表明,显式推理和迭代细化可以部分缓解这些失败,这表明许多解压缩错误源于将抽象含义转化为叙事形式的困难,而非完全缺乏相关知识。我们提出的任务自然扩展到其他形式的压缩文化知识。

英文摘要

Transforming a dense, abstract proverb into an engaging and morally faithful narrative requires deep cultural understanding and robust semantic grounding. We frame this problem as a \emph{constrained semantic decompression} task and study proverb-conditioned story generation as a testbed for abstraction-to-realization in large language models (LLMs). Focusing on Persian, we introduce the Proverb Aligned Narrative Dataset (PAND), pairing proverbs with human-written stories and explicit meanings. By a hybrid evaluation framework that combines human-calibrated LLM-as-a-Judge with structural metrics, we analyze model behavior across multiple prompting regimes. Our findings reveal a persistent \emph{decompression gap}: current LLMs often achieve strong surface-level fluency while failing to faithfully instantiate the underlying moral and causal structure encoded in proverbs. We further show that explicit reasoning and iterative refinement can partially mitigate these failures, suggesting that many decompression errors arise from difficulties in translating abstract meaning into narrative form rather than a complete lack of relevant knowledge. Our proposed task naturally extends to other forms of compressed cultural knowledge.

2606.12807 2026-06-12 cs.CL 新提交

Detect, Remask, Repair: Diffusion Editing for Faithful Summarization of Evolving Contexts

检测、重掩、修复:面向动态上下文忠实摘要的扩散编辑

Hao Zou, Zachary Horvitz, Chandhru Karthick, Zhou Yu, Kathleen McKeown

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出DETECT-REMASK-REPAIR框架,利用掩码扩散语言模型识别并修复摘要中过时内容,在保持支持内容的同时实现局部忠实性修复,并引入StreamSum基准评估。

详情
AI中文摘要

现实世界事件的摘要可能随着上下文演变和新信息的到来而过时。常见的做法是从更新后的上下文生成新摘要,但完全重新生成会丢弃之前的草稿,可能掩盖变化,并且当只有少数声明不支持时可能不必要。我们研究局部忠实性修复:在保留支持内容的同时更新现有摘要中的过时片段。我们提出DETECT-REMASK-REPAIR,一个基于扩散的框架,通过掩码扩散语言模型识别、重新掩码并修复过时区域。为了评估动态上下文摘要,我们引入了StreamSum,一个合成事件时间线的基准。在DialogSum和StreamSum上的实验表明,局部扩散修复提供了一种可控的替代完全重写的方法:忠实性导向的修复改进了早期草稿,一步修复将修复成本降低到半秒以下,该框架实现了跨数据集的忠实性-速度-保留权衡。我们还发现该框架可以作为事后修正步骤,提高自回归系统的忠实性。

英文摘要

Summaries of real-world events can become outdated as contexts evolve and new information arrives. A common response is to generate a new summary from the updated context, but full regeneration discards the previous draft, can obscure what changed, and may be unnecessary when only a few claims are unsupported. We study localized faithfulness repair: updating outdated spans in an existing summary while preserving supported content. We propose DETECT-REMASK-REPAIR, a diffusion-based framework that identifies, remasks, and repairs outdated regions with masked diffusion language models. To evaluate evolving-context summarization, we introduce StreamSum, a benchmark of synthetic event timelines. Experiments on DialogSum and StreamSum show that localized diffusion repair provides a controllable alternative to full rewriting: faithfulness-steered repair improves early drafts, one-step repair reduces repair cost to under half a second, with the framework enabling faithfulness-speed-preservation tradeoffs across datasets. We also find that the framework can provide a post-hoc correction step that improves faithfulness for autoregressive systems.

2606.13171 2026-06-12 cs.CL cs.AI 新提交

NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning

NTS-CoT: 基于思维链推理减轻大模型新闻时间线摘要中的幻觉

Feng Lyu, Huiqin Yan, Sijing Duan, Hao Wu, Shuang Gu, Xue Qiao, Weixu Zhang, Haolun Wu

发表机构 * Central South University(中南大学) Tsinghua University(清华大学) Nanjing University(南京大学) Suzhou Aerospace Information Research Institute(苏州空天信息研究院) McGill University(麦吉尔大学)

AI总结 针对大模型在新闻时间线摘要中产生内容不忠实和信息遗漏两类幻觉,提出NTS-CoT框架,通过元素思维链、日期选择和因果思维链三个模块有效缓解幻觉,在三个基准上超越现有方法。

详情
AI中文摘要

在线新闻的快速更新使得追踪事件发展具有挑战性,凸显了时间线摘要(TLS)的需求。幻觉(即大模型生成内容偏离源新闻)仍然是基于大模型的TLS中的关键问题,且现有研究对此关注不足。为弥补这一差距,我们识别出两类主要幻觉:新闻摘要中的不忠实内容和日期事件摘要中的信息遗漏。然后,我们提出NTS-CoT,一种利用思维链(CoT)推理来减轻TLS中幻觉的新框架。该框架包含三个关键模块:i) Element-CoT,用于捕获关键新闻元素以实现忠实摘要;ii) Date Selection,结合时间显著性和事件突出性进行时间戳选择;iii) Causal-CoT,用于推断因果关系并减少日期事件摘要中的遗漏。大量实验,包括在三个TLS基准上的定量分析和人工评估,表明NTS-CoT优于最先进的基线,有效减轻了幻觉并提升了基于大模型的TLS性能。我们的源代码可在该 https URL 获取。

英文摘要

The rapid updates of online news make tracking event developments challenging, highlighting the need for timeline summarization (TLS). Hallucinations, where LLM-generated content deviates from source news, still remain a critical issue in LLM-based TLS and are not well studied in existing works. To bridge this gap, we identify two primary types of hallucinations: unfaithful content during news summarization and information omission in date-event summarization. Then, we propose NTS-CoT, a novel framework that leverages Chain-of-Thought (CoT) reasoning to mitigate hallucinations in TLS. The framework consists of three key modules: i) Element-CoT to capture essential news elements for faithful summarization, ii) Date Selection to combine temporal saliency and event prominence for timestamp selection, and iii) Causal-CoT to infer causal relationships and reduce omissions in date-event summarization. Extensive experiments, including quantitative analysis on three TLS benchmarks and human evaluation, demonstrate that NTS-CoT outperforms state-of-the-art baselines, effectively mitigating hallucinations and improving LLM-based TLS performance. Our source code is available at https://anonymous.4open.science/r/NTS-CoT .

2606.13348 2026-06-12 cs.CL cs.AI 新提交

IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds

IVIE:一种用于增量且经过验证的交互式小说世界生成的神经符号方法

Micaela Vaucher, Santiago Silveira, Santiago Góngora, Luis Chiruzzo

发表机构 * Instituto de Computación, Facultad de Ingeniería, Universidad de la República(乌拉圭共和国大学工程学院计算机研究所)

AI总结 提出IVIE神经符号方法,结合LLM的创造力与符号验证的连贯性,通过四阶段增量生成管道构建可玩的交互式小说世界,人类评估显示其生成沉浸式、主题连贯的世界,平衡了灵活性与叙事一致性。

Comments 10 pages, 3 figures. To appear in the Proceedings of the 16th International Conference on Computational Creativity (ICCC'26), June 2026

详情
AI中文摘要

交互式小说中的计算创造力面临一个基本矛盾:大型语言模型(LLM)可能产生创意叙事,但难以维持世界连贯性,而符号系统确保一致性但缺乏创意灵活性。我们提出IVIE(增量与验证的交互体验),一种从零开始生成完整且可玩的交互式小说世界的神经符号方法。基于PAYADOR的神经符号框架,IVIE实现了一个四阶段增量生成管道,将创意决策——设定与角色创建、谜题设计——委托给LLM,同时通过符号验证将世界状态接地。该系统生成具有相互关联的地点、功能性物品、非玩家角色和连贯谜题的世界,所有这些都围绕一个中心目标导向架构组织。人类评估表明,该方法生成了沉浸式、主题连贯的世界,具有高玩家参与度。结果似乎表明,神经符号方法成功平衡了灵活性与叙事连贯性:符号验证在不消除生成自由的情况下将LLM生成接地。然而,挑战依然存在:LLM的不一致性偶尔会绕过谜题约束,客观验证的空白允许一些结构上不可能的目标。我们为未来的神经符号交互式叙事系统确定了关键设计考虑因素,特别是关于LLM的能力及其局限性。

英文摘要

Computational creativity in Interactive Fiction faces a fundamental tension: Large Language Models (LLM) may produce creative narratives but struggle with world coherence, while symbolic systems ensure consistency but lack creative flexibility. We present IVIE (Incremental & Validated Interactive Experiences), a neuro-symbolic approach to generating complete and playable interactive fiction worlds from scratch. Building upon PAYADOR's neuro-symbolic framework, IVIE implements a four-stage incremental generation pipeline that delegates creative decisions--setting and character creation, puzzle design--to LLMs while grounding the world state through symbolic validation. The system generates worlds with interconnected locations, functional items, non-player characters, and coherent puzzles, all structured around a central goal-oriented architecture. Human evaluation shows the approach generates immersive, thematically coherent worlds with high player engagement. Results seem to indicate that the neuro-symbolic approach successfully balances flexibility with narrative coherence: symbolic validation grounds LLM generation without eliminating generative freedom. However, challenges remain: LLM inconsistencies occasionally bypass puzzle constraints, and objective validation gaps allow some structurally impossible goals. We identify key design considerations for future neurosymbolic interactive storytelling systems, particularly regarding LLM capabilities and their limitations.

2603.00025 2026-06-12 cs.CL 版本更新

TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation

TAB-PO:面向Token关键结构化生成的具有Token级自适应障碍的偏好优化

Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Sreeraj Ramachandran, Elyas Irankhah, Muhammad Arif, Ashley Hagaman, Sarah R. Lowe, Aimee Kendall Roundtree

发表机构 * Yale University(耶鲁大学) Texas State University(德克萨斯州立大学)

AI总结 针对结构化预测中偏好与拒绝对象仅少数token不同导致的梯度稀释和token侵蚀问题,提出基于混淆感知偏好构建和Token级自适应障碍的TAB-PO方法,在SciERC任务上显著提升关键指标。

详情
AI中文摘要

直接偏好优化(DPO)是一种有效且广泛采用的离线对齐方法,但难以适应本体驱动的结构化预测,其中偏好和拒绝的JSON对象通常仅在少数模式定义token上存在差异。在这种低编辑距离场景下,序列级DPO将梯度质量分散到非关键的序列化token上(梯度稀释),并可能降低罕见、低置信度的偏好模式token的似然(token侵蚀)。为解决这些限制,我们首先开发了一种混淆感知的偏好构建策略,该策略用从验证集SFT预测中估计的经验结构化错误模式来增强专家策划的歧义模式,合成最小扰动的、模式有效的负样本,将偏好学习聚焦于现实的本体级决策错误。然后,我们引入了Token自适应障碍偏好优化(TAB-PO),这是一种用于token关键结构化生成的SFT后目标。TAB-PO添加了一个置信门控的token级障碍,对低置信度的模式token施加监督锚定。在公开的SciERC科学信息抽取任务上,使用1.5B到70B的Llama/Qwen模型评估,TAB-PO在本体关键的语义标签和关系链接指标上平均比SFT提升11.59%,在这些指标上100%胜于最强的token级和序列级DPO变体,并领先领先的前沿模型14.71%,同时在文本基础方面取得了强劲的增益。

英文摘要

Direct Preference Optimization (DPO) is an effective and widely adopted approach for offline alignment but is poorly matched to ontology-driven structured prediction, where preferred and rejected JSON objects often differ in only a few schema-defining tokens. In this low-edit-distance regime, sequence-level DPO spreads gradient mass across non-critical serialization tokens (gradient dilution) and can reduce likelihood on rare, under-confident preferred schema tokens (token erosion). To address these limitations, we first develop a confusion-aware preference-construction strategy that augments expert-curated ambiguity patterns with empirical structured-error modes estimated from validation-set SFT predictions, synthesizing minimally perturbed, schema-valid negatives that focus preference learning on realistic ontology-level decision errors. We then introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), a post-SFT objective for token-critical structured generation. TAB-PO adds a confidence-gated token-level barrier that applies supervised anchoring to under-confident schema tokens. On the public SciERC scientific information extraction task, evaluated with Llama/Qwen models from 1.5B to 70B, TAB-PO improves ontology-critical semantic-label and relational-linking metrics over SFT by 11.59% on average, wins 100% of comparisons against the strongest token-level and sequence-level DPO variants on these metrics, and surpasses leading frontier models by 14.71%, while delivering strong gains in textual grounding.

6. 语义、语法与语言学分析 5 篇

2606.12748 2026-06-12 cs.CL 新提交

Agent-based models for the evolution of morphological alternation patterns

基于智能体的形态交替模式演化模型

Aravinth Kulanthaivelu, Richard Sproat

AI总结 通过多智能体模拟,研究形态交替(如go/went)的涌现机制,发现无标度社交网络和随机采纳策略能产生更真实的形态模式。

Comments 51 + 37 pages. 31 Figures

详情
AI中文摘要

为什么英语中“go”的过去式是看似无关的“went”?这种交替在语言中很常见。它们既无助于交流也不利于学习,却能持续存在数百年或数千年。我们提出了一个多智能体模拟,用于研究形态词干和屈折交替的涌现。交替形式源于语音变化,或者像“go/went”一样,来自与部分人群相关的词汇替代。当一个智能体“听到”另一个智能体对某个词形位(例如go的过去式)使用新形式时,它们会以一定概率采纳该形式,并可能将其使用扩展到共享相同原始形式的其他词形位。因此,替代形式可以在人群中传播,并固化为词干或屈折标记的交替形式。与许多先前的计算研究不同,我们的系统允许自然主义的词汇形式、现实的语音规则、包含数百或数千条目的词典,以及数十或数百个智能体的人群。它支持多种网络拓扑、扩散模式和智能体采纳策略。这类模拟的一个问题是评估:与真实语言相比,产生的形态有多真实?我们引入了AI历史语言学家,这是一个新颖的大型语言模型驱动系统,模拟两位历史语言学家之间的辩论。我们用它来比较一组真实语言的形态、伪装形态和实验演化形态。结果表明,有利于产生更合理形态的因素包括无标度社交网络和随机伯努利形式采纳。我们还提出了三个案例研究,模拟了有记载的历史变化,使我们能够测试如果历史不同会发生什么。所有代码和数据均已发布。

英文摘要

Why is the past of English "go" the apparently unrelated "went"? Such alternations are frequent in languages. They neither aid communication nor learnability, yet they can be persistent, surviving over centuries or millennia. We present a multi-agent simulation of the emergence of morphological stem and inflection alternations. Alternate forms arise by phonological changes or, as with "go/went", from lexical alternatives associated with a subset of the population. When an agent 'hears' another agent use a novel form for a slot in the paradigm of a word (say, the past tense of go), they will with some probability adopt that form, possibly spreading its use to other slots in the paradigm that shared the same original form. Thus alternative forms can spread through the population and become entrenched as stem or inflectional marker alternants. Unlike many previous computational studies, our system allows for naturalistic lexical forms, realistic phonological rules, lexicons with hundreds or thousands of entries, and agent populations in the tens or hundreds. It supports several network topologies, diffusion patterns and agent adoption policies. One issue with such simulations is evaluation: how realistic is the resulting morphology compared to those of real languages? We introduce the AI Historical Linguist, a novel Large Language Model-driven system that models a debate between two historical linguists. We use this to compare a set of real language morphologies, disguised morphologies, and experimentally evolved morphologies. The results suggest that among the factors that favor more plausible morphologies are scale-free social networks and random Bernoulli adoption of forms. We also present three case studies modeling attested historical changes, allowing us to test what might have happened if history had been different. All code and data are released.

2606.13189 2026-06-12 cs.CL 新提交

SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection

SICI:一种揭示LLM立场检测中相变的语义-语用复杂度指数

Fuqiang Niu, Bowen Zhang

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China(中国科学技术大学网络空间安全学院) School of Artificial Intelligence, Shenzhen Technology University(深圳技术大学人工智能学院)

AI总结 提出SICI指数,从七维语义-语用复杂度诊断立场检测难度,揭示LLM错误随复杂度增加从过度归因到集中弃权的相变规律,且干预方法仅沿归因-弃权轴移动而非消除瓶颈。

详情
AI中文摘要

基于提示的LLM越来越多地用于立场检测,但更难的例子并不总是通过更清晰的指令、推理提示、检索或辩论来修复。我们提出了SICI(立场推理复杂度指数),这是一个七维诊断指标,用于衡量目标-文本对施加的语义-语用负担。在SemEval-2016和VAST上,SICI比表面代理更好地预测LLM准确率,并显示出显著的跨评分者可靠性($\alpha=0.771$)。更重要的是,随着SICI增加,LLM错误发生相变:低复杂度例子容易过度归因,尤其是对“反对”预测;中等复杂度例子形成不稳定边界;高复杂度例子迅速集中在“无”上。这种类似相变的结构在GPT-3.5、GPT-4o-mini、DeepSeek-V3和GPT-4o中持续存在,尽管更强的模型移动了边界。一项15种方法的干预研究进一步表明,提示、检索和辩论通常沿着归因-弃权轴移动模型,而不是消除高复杂度瓶颈。

英文摘要

Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden imposed by a target--text pair. Across SemEval-2016 and VAST, SICI predicts LLM accuracy better than surface proxies and shows substantial cross-scorer reliability ($α=0.771$). More importantly, LLM errors change regime as SICI increases: low-complexity examples invite over-attribution, especially Against predictions; intermediate examples form an unstable boundary; and high-complexity examples rapidly concentrate on None. This phase-transition-like structure persists across GPT-3.5, GPT-4o-mini, DeepSeek-V3, and GPT-4o, although stronger models move the boundaries. A 15-method intervention study further shows that prompting, retrieval, and debate often shift models along the attribution--abstention axis rather than removing the high-complexity bottleneck.

2510.02524 2026-06-12 cs.CL cs.FL cs.LG 版本更新

Unraveling Syntax: Language Modeling and the Substructure of Grammars

解析句法:语言建模与语法的子结构

Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究语言模型在上下文无关语法子结构上的学习行为,证明损失函数在顶层子语法上线性递归,并发现参数化模型并行学习子语法,子语法预训练能提升小模型性能并改善内部表征。

Comments Equal contribution by LYS and DM. Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

尽管语言模型取得了令人印象深刻的结果,但其学习动态远未被理解。许多感兴趣的领域——如自然语言句法、编程语言、算术——都由上下文无关语法(CFG)捕获。在这项工作中,我们将先前关于CFG神经语言建模的工作扩展到一个新的方向:语言建模如何相对于CFG子结构(即子语法)表现。我们定义了子语法,并证明了一组连接语言建模和子语法的基本定理。我们表明,语言建模损失在其顶层子语法上线性递归;递归应用,损失分解为“不可约”子语法的损失。在额外假设下,并且经验上,参数化模型并行学习子语法,不同于首先掌握简单子结构的儿童。我们发现,子语法预训练可以提高最终性能,但仅对于相对于语法而言微小的模型,而对齐分析表明,预训练一致地导致内部表征更好地反映语法的子结构。

英文摘要

While language models achieve impressive results, their learning dynamics are far from understood. Many domains of interest -- such as natural language syntax, coding languages, arithmetic -- are captured by context-free grammars (CFGs). In this work, we extend prior work on neural language modeling of CFGs in a novel direction: how language modeling behaves with respect to CFG substructure, namely subgrammars. We define subgrammars, and prove a set of fundamental theorems connecting language modeling and subgrammars. We show that language modeling loss recurses linearly over its top-level subgrammars; applied recursively, the loss decomposes into losses for "irreducible" subgrammars. Under additional assumptions, and empirically, parametrized models learn subgrammars in parallel, unlike children who first master simple substructures. We find that subgrammar pretraining can improve final performance, but only for tiny models relative to the grammar, while alignment analyses show that pretraining consistently leads to internal representations that better reflect the grammar's substructure.

2604.24079 2026-06-12 cs.CL cs.AI 版本更新

The Pragmatic Persona: Discovering LLM Persona through Bridging Inference

实用人格:通过桥接推理发现LLM人格

Jisoo Yang, Jongwon Ryu, Minuk Ma, Trung X. Pham, Junyeong Kim

发表机构 * Department of Artificial Intelligence, Chung-Ang University, Seoul, 06974, Republic of Korea(Chung-Ang大学人工智能系) Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada(不列颠哥伦比亚大学计算机科学系) Van Lang University, Ho Chi Minh City, Vietnam(文-lang大学)

AI总结 提出基于桥接推理的框架,通过构建话语级知识图谱捕捉LLM对话中的隐含语义关联,实现从话语连贯性层面发现稳定人格特征,优于基于频率或风格的基线方法。

Comments 15 pages, 4 figures, accepted to ICPR 2026

详情
AI中文摘要

大型语言模型(LLM)通过对话展现出固有且独特的人格。然而,现有的大多数人格发现方法依赖于表面层面的词汇或风格线索,将对话视为平坦的token序列,未能捕捉维持人格一致性的更深层次话语结构。为解决这一局限,我们提出一种新颖的分析框架,通过桥接推理——即通过共享世界知识和话语连贯性连接话语的隐含概念关系——来解读LLM对话。通过将这些关系建模为结构化知识图谱,我们的方法捕捉了控制LLM在对话轮次间组织意义的潜在语义链接,从而在话语连贯性层面而非表面实现上实现人格发现。在多种推理骨干和从小型模型到80B参数系统的目标LLM上的实验结果表明,与基于频率或风格的基线相比,桥接推理图产生了显著更强的语义连贯性和更稳定的人格识别。这些结果表明,人格特质始终编码在话语的结构组织中,而非孤立的词汇模式中。本工作提出了一个系统框架,通过认知话语理论的视角来探测、提取和可视化潜在的LLM人格,桥接了计算语言学、认知语义学和大型语言模型中的人格推理。代码见:https://this URL

英文摘要

Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface-level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse-level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference -- implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small-scale models to 80B-parameter systems, demonstrate that bridging-inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style-based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at https://github.com/JiSoo-Yang/Persona_Bridging.git

2605.22641 2026-06-12 cs.CL cs.AI cs.LG 版本更新

More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

更多上下文、更大模型还是道德知识?政治文本中施瓦茨价值观检测的系统研究

Víctor Yeste, Paolo Rosso

发表机构 * PRHLT Research Center, Universitat Politècnica de València, Spain(巴塞罗那理工大学研究中心,西班牙 Valencia理工大学) School of Science, Engineering and Design, Universidad Europea de Valencia, Spain(Valencia欧洲大学科学、工程与设计学院,西班牙) Valencian Graduate School and Research Network of Artificial Intelligence (ValgrAI)(瓦伦西亚人工智能研究生学院与研究网络(ValgrAI))

AI总结 本研究系统比较了上下文范围、检索增强道德知识和模型规模对政治文本中施瓦茨价值观检测的影响,发现全文档上下文和检索知识对监督编码器有效,但对零样本大语言模型帮助有限,且模型扩展不保证性能提升。

Comments Code: https://github.com/VictorMYeste/human-value-detection-context-rag, best model: https://huggingface.co/VictorYeste/value-context-rag-deberta-v3-base-doc-rag, 18 pages, 3 figures

详情
AI中文摘要

检测政治文本中的施瓦茨价值观具有挑战性,因为隐含线索通常依赖于周围的论证和相邻价值观之间的细微差别。我们研究了上下文和显式道德知识何时有助于句子级别的价值观检测。使用ValuesML/Touché ValueEval格式,我们比较了句子、窗口和全文档输入;无检索增强和基于检索增强的设置(使用精心策划的道德知识库);监督的DeBERTa-v3-base/large编码器;以及参数规模从12B到123B的零样本大语言模型。结果表明,更多上下文并非总是更好:全文档上下文使监督的DeBERTa编码器相比仅句子输入提高了3.8-4.8个宏F1点,但对零样本大语言模型没有一致帮助。在匹配比较中,检索到的道德知识更一致地有用,在早期融合下改善了每个测试的模型系列和上下文条件。然而,从DeBERTa-v3-base扩展到large以及从12B扩展到更大的大语言模型并不保证收益,并且简单的早期融合优于测试的后期融合和交叉注意力检索增强生成变体。按价值观分析表明,上下文和检索对社交情境化或概念上易混淆的价值观帮助最大。这些发现表明,价值观敏感的NLP应联合评估上下文、知识和模型系列,而不是将更长的输入或更大的模型视为通用改进。

英文摘要

Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence-level value detection. Using the ValuesML/Touché ValueEval format, we compare sentence, window, and full-document inputs; no-RAG and retrieval-augmented settings with a curated moral knowledge base; supervised DeBERTa-v3-base/large encoders; and zero-shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full-document context improves supervised DeBERTa encoders by 3.8-4.8 macro-F1 points over sentence-only input, but does not consistently help zero-shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa-v3-base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late-fusion and cross-attention RAG variants for encoders. Per-value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value-sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.

7. 多模态语言处理 10 篇

2606.12576 2026-06-12 cs.CL 新提交

Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

帮助图表讲述它们的故事!基于论文的视频生成解释复杂科学图表

Ishani Mondal, Javad Baghirov, Jordan Boyd-Graber

AI总结 提出MINARD流水线,从图表及其论文生成基于区域分解的叙述性视频,并发布FigTalk基准,在自动和人工评估中优于现有方法。

Comments Webpage: https://minard.vercel.app/

详情
AI中文摘要

科学图表将复杂的流程压缩到单个画布中,但理解它们需要基于论文的、逐步的叙述,并与视觉高亮对齐——这是当前视频生成系统和基准所缺乏的能力。为了解决这个问题,我们引入了基于论文的图表到视频生成:从图表及其论文生成叙述性的、区域引导的导览视频。我们提出了MINARD(通过区域分解对叙述性架构进行多模态解释),这是一个生成基于论文的叙述并顺序将其与图表区域对齐的流水线。我们还发布了FigTalk,一个包含新的顺序和组件级对齐指标的基准。在FigTalk上,MINARD生成类人的、忠于论文的叙述,并在自动和人工评估中,在叙述条件下的图表空间对齐方面优于现有方法。

英文摘要

Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks. To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation

2606.13572 2026-06-12 cs.CL cs.AI 新提交

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

ArogyaSutra:面向印度语言的多模态医学推理的多智能体框架

Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha

发表机构 * Indian Institute of Technology Patna(印度理工学院巴特那分校) Indian Institute of Technology Kanpur(印度理工学院坎普尔分校) Prasannadeb Women’s College(普拉萨纳德布女子学院)

AI总结 针对印度语言医疗场景中多模态大语言模型性能不足的问题,提出多模态医学问答数据集ArogyaBodha和基于演员-评论家的多智能体框架ArogyaSutra,通过工具接地与双记忆机制提升多语言医学推理准确性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在通用领域展现出有希望的推理能力,但在医疗等专业场景中,尤其是在多语言和低资源情况下,其性能仍然有限。这一差距在印度农村等地区尤为关键,患者通常用本土印度语言表达复杂的医疗问题,并依赖医学图像等多模态输入。现有的以英语为中心的MLLMs难以支持此类用例,限制了公平获取AI驱动的医疗辅助。为应对这一挑战,我们引入了ArogyaBodha,一个大规模的多语言多模态医学问答数据集,由八个异构来源构建,涵盖31个身体系统、六种成像模态和21个临床领域,覆盖英语和七种主要印度语言。我们进一步提出了ArogyaSutra,一个基于演员-评论家的多智能体框架,将工具接地与双记忆机制相结合,实现逐步的、推理感知的决策,并使用存储的演员-评论家模拟轨迹进行蒸馏。实验表明,我们的数据集和框架在所有印度语言上提高了多语言医学推理的准确性,消融实验验证了每个组件的贡献。源代码和数据集可在以下网址获取:this https URL ArogyaSutra/

英文摘要

Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp-cse.github.io/ ArogyaSutra/

2606.13630 2026-06-12 cs.CL 新提交

From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

从词元到面部:探究用于3D面部动画的离散语音表示

Pedro Correa, Olivier Perrotin, Samir Sadok, Paula Costa, Thomas Hueber

发表机构 * Univ. Estadual de Campinas (UNICAMP), Brazil(巴西坎皮纳斯州立大学(UNICAMP)) Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, France(法国格勒诺布尔阿尔卑斯大学,CNRS,格勒诺布尔国立理工学院,GIPSA实验室) Inria at Univ. Grenoble Alpes, CNRS, LJK, France(法国格勒诺布尔阿尔卑斯大学Inria,CNRS,LJK)

AI总结 研究评估四种语音表示在3D面部合成中的效果,发现编码音素类别有利于准确预测面部动画,并基于此提出音频视觉文本到语音管线。

Comments This work has been accepted in Interspeech 2026

详情
AI中文摘要

语音表示的选择在语音驱动的3D面部动画中至关重要。不同表示在编码内容上有所差异:SSL特征强调音段和语义线索,神经编解码器产生优化用于声学重建的潜在表示,而ASR风格的目标产生基于标签的空间。我们评估了四种用于3D面部合成的语音表示族,通过客观指标和感知评估比较了它们在两个面部解码器上的面部重建质量。此外,我们进行了探测分析,将分词表示与音素单元和发音变形联系起来。我们发现,编码音素类别有利于在语义和基于标签的表示上准确预测面部动画,且面部动画质量相当。基于后者,我们引入了一个音频视觉文本到语音(AVTTS)管线,该管线利用离散表示作为共享空间来解码语音和3D面部运动。

英文摘要

The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.

2606.12616 2026-06-12 cs.AI cs.CL 交叉投稿

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

PersonaDrive: 面向闭环驾驶模拟的人类风格检索增强VLA智能体

Mahmoud Srewa, Praneetsai Iddamsetty, Mohammad Abdullah Al Faruque, Salma Elmalaki

发表机构 * University of California, Irvine(加利福尼亚大学尔湾分校)

AI总结 提出PersonaDrive流水线,通过检索风格指令下的人类驾驶演示来调节视觉-语言-动作(VLA)驾驶智能体,实现闭环模拟中多样化的非自车智能体行为,无需针对每种风格重新训练。

详情
AI中文摘要

闭环驾驶模拟器通常在其环境中填充行为大致相同的非自车交通智能体,这些智能体要么由基于规则的交通管理器生成,要么由训练为单一行为模式的学习模型生成。最近的工作通过观测数据上的事后标签或LLM推断的奖励权重引入风格变化,但这些信号充当了风格应奖励什么的代理,而不是明确要求以该风格驾驶的人类演示。我们提出了PersonaDrive,一个流水线,它根据从风格指令的人类驾驶数据集中检索到的演示来调节视觉-语言-动作(VLA)驾驶智能体,在该数据集中,参与者在驾驶员在环平台上以激进、中性和保守指令驾驶CARLA排行榜路线。该流水线包括三个阶段:(i) 使用组合的图像-文本相似度分数对每种风格的人类驾驶数据进行离线三元组挖掘;(ii) 训练一个轻量级检索头,将冻结的视觉特征与每个风格数据库上的小型控制编码器融合;(iii) 微调单个VLA主干,以在航点预测期间将检索到的上下文点视为上下文行为演示。在推理时,通过切换检索头查询的每个风格数据库,相同的主干可以适应任何风格,因此选择风格无需针对每种风格重新训练,同时为闭环模拟启用人类风格、风格多样的非自车智能体。在Bench2Drive上,PersonaDrive(无风格)的驾驶得分比SimLingo高4.6%,比HiP-AD高2.5%,在风格条件下,每种风格都获得最高驾驶得分,波动范围约2%(其最弱风格超过最强基线DMW 5.4%),而从保守指令到激进指令,平均速度和加速度分别提高18%和25%。

英文摘要

Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

2606.12898 2026-06-12 cs.CV cs.CL 交叉投稿

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension

放大关键信息:面向视觉文本理解的注意力引导自适应渲染

Shenglai Zeng, Qirui Wang, Kai Guo, Xinnan Dai, Xianxuan Long, Hui Liu

发表机构 * Michigan State University(密歇根州立大学) Xi’an Jiaotong University(西安交通大学)

AI总结 针对视觉语言模型在视觉文本理解任务中存在的定位与利用脱节问题,提出无需训练、模型无关的注意力引导自适应渲染方法AGAR,通过放大关键文本跨度提升模型性能。

详情
AI中文摘要

视觉文本理解(VTC)将文本渲染为图像供视觉语言模型(VLM)阅读,绕过了LLM的上下文窗口限制,并支持从长页OCR到多页记忆问答等应用。然而,现有的VTC流水线将渲染和布局视为固定的、内容无关的预处理步骤,并且对VLM内部如何处理可视化文本的机制理解甚少。通过对VTC问答任务的聚焦实证研究,我们揭示了VLM存在一种“定位而不利用”的模式:证据定位注意力在中间到后期层中急剧出现,并且与答案正确性在很大程度上解耦,然而仅仅放大渲染页面上定位的跨度就能恢复大部分失败。基于这些观察,我们提出了AGAR(注意力引导自适应渲染),一种无需训练、模型无关的方法,该方法利用VLM自身的中间到后期层注意力来识别前K个重要的视觉补丁,将它们映射回单词跨度,并在重新推理答案之前重新渲染页面,放大这些跨度。在九个VTC基准测试(短文本、长上下文和多页记忆问答)和四个VLM骨干上的大量实验表明,AGAR(i)作为即插即用的增强,持续改进了现成的VLM,(ii)与VLM后训练相结合可带来进一步收益,并且(iii)在视觉和文本侧输入退化下保持鲁棒性。

英文摘要

Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC pipelines treat rendering and layout as a fixed, content-agnostic preprocessing step and offer little mechanistic understanding of how VLMs internally process visualized text. Through a focused empirical study on VTC QA tasks, we reveal that VLMs exhibit a localization-without-utilization regime: evidence-localizing attention emerges sharply in the middle-to-late layers and is largely decoupled from answer correctness, yet simply enlarging the localized spans on the rendered page recovers a large fraction of the failures. Building on these observations, we propose AGAR (Attention-Guided Adaptive Rendering), a training-free, model-agnostic method that leverages a VLM's own middle-to-late layer attention to identify the top-K important visual patches, maps them back to word spans, and re-renders the page with those spans enlarged before re-inferring the answer. Extensive experiments across nine VTC benchmarks (short-form, long-context, and multi-page memory QA) and four VLM backbones show that AGAR (i)consistently improves off-the-shelf VLMs as a plug-and-play enhancement, (ii)composes with VLM post-training to yield further gains, and (iii)remains robust under both visual- and text-side input degradation.

2606.13267 2026-06-12 cs.CV cs.CL cs.IR 交叉投稿

TimeLens: On-Device Artifact Recognition with Retrieval-Augmented Question Answering for the Grand Egyptian Museum

TimeLens: 面向大埃及博物馆的基于检索增强问答的设备端文物识别

Rawan Hesham, Ali Ashraf, Amr Ahmed, Malak Alaa, Omar Ahmed, Omar Wagih

发表机构 * Grand Egyptian Museum(大埃及博物馆)

AI总结 针对博物馆场景中的细粒度视觉相似性、训练数据与手持相机差距以及AI幻觉问题,提出设备端文物检测器与双语检索增强生成(RAG)问答系统,实现实时识别与可靠问答。

Comments 6 pages, 4 figures, 5 tables. Submitted to AIVRCH 2026

详情
AI中文摘要

TimeLens 是一款面向大埃及博物馆(GEM)的 AI 驱动双语移动导览应用。游客将手机对准展品时,可实时识别文物,并针对后续问题获得英语或阿拉伯语回答。本工作解决了馆内部署特有的三个问题:51 件编目文物(许多近乎相同的拉美西斯雕像)间的细粒度视觉相似性、策展训练数据与手持相机条件之间的差距,以及 AI 导览陈述未经证实的历史事实的风险。报告了两项工程贡献。首先,通过数据质量驱动的迭代研究——从基础模型自动标注(YOLO-World),经过空间标签清理规则,到完全人工标注的数据集——开发了设备端文物检测器,将标签质量确定为决定性因素:最终的 YOLOv8n 模型解决了所有先前失败的类别,同时保持为 5.97 MB 的 TensorFlow Lite 资产,可在中端手机上实时运行(mAP@0.5 = 0.995,mAP@0.5:0.95 = 0.924)。其次,基于 108 条记录的 ChromaDB 知识库的双语检索增强生成(RAG)导览,在七个候选语言模型上进行了基准测试,选定了 Gemma 4 E2B(Q4 K M);十项针对性优化将端到端延迟从超过 30 秒降低到约 10 秒。两个子系统集成在一个生产级 Flutter 应用中,具有双语界面、博物馆位置门控和文本转语音支持。

英文摘要

TimeLens is an AI-powered bilingual mobile guide for the Grand Egyptian Museum (GEM). Pointing a phone at an exhibit, a visitor sees the artifact recognized in real time and can ask follow-up questions answered in English or Arabic. The work addresses three problems specific to in-gallery deployment: fine-grained visual similarity among 51 catalogued artifacts (many near-identical Ramesside statues), the gap between curated training data and handheld camera conditions, and the risk of an AI guide stating unsupported historical facts. Two engineering contributions are reported. First, an on-device artifact detector was developed through a data-quality-driven iteration study -- from foundation-model auto-annotation (YOLO-World), through spatial label-cleaning rules, to a fully hand-annotated dataset -- isolating label quality as the decisive factor: the final YOLOv8n model resolves every previously failing class while remaining a 5.97 MB TensorFlow Lite asset that runs in real time on a mid-range phone (mAP@0.5 = 0.995, mAP@0.5:0.95 = 0.924). Second, a bilingual Retrieval-Augmented Generation (RAG) guide, grounded in a 108-record ChromaDB knowledge base, was benchmarked across seven candidate language models, with Gemma 4 E2B (Q4 K M) selected; ten targeted optimizations reduce end-to-end latency from over 30 s to approximately 10 s. Both subsystems are integrated in a production Flutter application with bilingual interface, museum location gating, and text-to-speech support.

2606.13288 2026-06-12 cs.CV cs.AI cs.CL 交叉投稿

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

跨模态掩码组合概念建模以增强视觉-语言组合性

Wei Li, Zhen Huang, Xinmei Tian

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China(中国科学技术大学,教育部脑启发智能感知与认知重点实验室) Independent Researcher(独立研究员)

AI总结 提出MACCO框架,通过掩码一个模态的组合概念并从另一模态完整上下文重建,增强视觉-语言模型的组合理解能力,在五个基准上显著提升。

Comments Accepted to ACL 2026 Main Conference, 25 pages

详情
AI中文摘要

对比训练的视觉-语言模型(如CLIP)在学习联合图像-文本表示方面取得了显著进展,但在组合理解方面仍面临挑战。它们通常表现出“词袋”行为——难以捕捉对象关系、属性-对象绑定和词序依赖。这一限制不仅源于优化时依赖全局单向量表示,还源于对配对图像文本数据中固有丰富组合信息的利用和建模不足。在这项工作中,我们提出了MACCO(掩码组合概念建模)框架,该框架掩码一个模态中的组合概念,并基于另一模态的完整上下文信息重建它们,从而使模型能够更有效地捕捉和对齐跨模态组合结构。为促进这一过程,我们引入了两个辅助目标,在模态间和模态内联合对齐和正则化掩码特征。在五个组合基准上的大量实验和深入分析表明,我们的方法不仅显著增强了VLM的组合性,还提高了它们捕捉句法结构和语言信息的能力。此外,改进的组合性也有利于文本到图像生成和多模态大语言模型。代码可在https://this URL获取。

英文摘要

Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" behavior--struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at https://github.com/hiker-lw/MACCO.

2606.13558 2026-06-12 cs.CV cs.CL 交叉投稿

Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models

编辑比特,差异编码:面向视觉自回归模型的逐比特残差编辑

Shengqiang Zhang, Ruotong Liao, Volker Tresp, Barbara Plank, Hinrich Schütze

发表机构 * LMU Munich & Munich Center for Machine Learning (MCML)(慕尼黑大学 & 慕尼黑机器学习中心 (MCML))

AI总结 提出BitResEdit,一种无需训练的视觉自回归图像编辑方法,通过比特级源负引导和残差编码注入,在保持背景的同时实现强文本对齐。

详情
AI中文摘要

基于文本引导的图像编辑与视觉自回归(VAR)生成器需要控制模型采样的内容以及将采样变化写回图像代码的位置。现有的VAR编辑器主要操作于令牌流、特征或扁平的下一个令牌对数几率,忽略了逐比特残差VAR模型的两个原生结构:逐比特伯努利预测头和图像组装所用的加性多尺度残差代码域。我们提出BitResEdit,一种针对逐比特残差VAR生成器(如Infinity)的无训练编辑器。BitEdit通过沿共享编辑前缀上计算的源-目标对比倾斜后CFG的逐比特对数几率,执行源负引导,然后将每个更新投影到干净CFG采样器周围的闭式伯努利-KL信任域中。ResEdit将采样的比特转换为每尺度连续代码残差,用定位掩码对其进行门控,并通过生成器的原生尺度求和重新注入。它们共同将决策时的比特引导与组合时的代码组合耦合,使得被掩码的潜在特征通过代码算术精确保留,同时在目标区域内应用局部化的尺度感知编辑。在PIE-Bench上使用Infinity-2B,BitResEdit在相同骨干的VAR编辑器中实现了最强的文本对齐,在编辑区域上的CLIP比最强先前的编辑器提高了+1.07,同时背景保持与其相当。消融实验表明BitEdit和ResEdit在目标对齐和背景保持中发挥互补作用。

英文摘要

Text-guided image editing with visual autoregressive (VAR) generators requires controlling both what the model samples and where the sampled change is written back into the image code. Existing VAR editors mainly operate on token streams, features, or flat next-token logits, leaving two native structures of bitwise-residual VAR models underused: the per-bit Bernoulli prediction head and the additive multi-scale residual code field from which the image is assembled. We propose BitResEdit, a training-free editor for bitwise-residual VAR generators such as Infinity. BitEdit performs source-negative guidance by tilting the post-CFG per-bit log-odds along a source--target contrast computed on a shared edited prefix, then projects each update into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit converts the sampled bits into per-scale continuous-code residuals, gates them with a localization mask, and re-injects them through the generator's native sum-of-scales. Together they couple decision-time bit guidance with combination-time code composition, so masked-out latent features are preserved exactly by code arithmetic while localized, scale-aware edits are applied inside the target region. On PIE-Bench with Infinity-2B, BitResEdit attains the strongest text alignment among same-backbone VAR editors, improving CLIP on the edited region by +1.07 over the strongest prior editor while keeping background preservation competitive with it. Ablations show BitEdit and ResEdit play complementary roles in target alignment and background preservation.

2602.07106 2026-06-12 cs.CV cs.AI cs.CL 版本更新

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Ex-Omni:为全模态大语言模型赋能3D面部动画生成

Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) LIGHTSPEED Independent Researcher(独立研究员)

AI总结 提出Ex-Omni模型,通过混合形状感知语音单元生成器和解码器解耦语义推理与时间生成,并引入统一令牌查询门控融合机制,实现全模态大语言模型同步生成语音和3D面部动画。

详情
AI中文摘要

全模态大语言模型旨在统一多模态理解和生成,然而,尽管自然的人机交互至关重要,但扩展它们以联合生成语音和3D面部动画仍 largely unexplored。一个关键挑战是LLM的离散语义推理与3D面部运动所需的密集时间动态之间的不匹配。我们提出Expressive Omni (Ex-Omni),一个开源模型,通过原生语音伴随的3D面部动画增强OLLM。Ex-Omni通过混合形状感知语音单元生成器和混合形状解码器将语义推理与时间生成解耦,其中语音单元提供时间支架,隐藏语音表示携带面部相关线索。我们进一步引入统一的令牌查询门控融合机制用于受控语义注入,以及InstructS2SF-1200K,一个包含1200K样本的预训练数据集。大量实验表明,Ex-Omni在保持竞争性语音理解和生成能力的同时,实现了比级联管道更好的音视频同步和更低的面部生成延迟。

英文摘要

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet extending them to jointly produce speech and 3D facial animation remains largely unexplored despite its importance for natural human-computer interaction. A key challenge is the mismatch between the discrete semantic reasoning of LLMs and the dense temporal dynamics required for 3D facial motion. We propose Expressive Omni (Ex-Omni), an open-source model that augments OLLMs with native speech-accompanied 3D facial animation. Ex-Omni decouples semantic reasoning from temporal generation through a blendshape-aware speech unit generator and a blendshape decoder, where speech units provide temporal scaffolding and hidden speech representations carry facially relevant cues. We further introduce a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection, as well as InstructS2SF-1200K, a dataset consisting of 1200K samples for pre-training. Extensive experiments show that Ex-Omni maintains competitive speech understanding and generation ability while achieving better audio-visual synchronization and lower face-generation latency than cascaded pipelines.

2606.11792 2026-06-12 cs.CV cs.AI cs.CL 版本更新

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

MultiToP:学习修补视觉令牌以减轻视频大型多模态模型中的幻觉

Yuansheng Gao, Wenbin Xing, Jiahao Yuan, Kaiwen Zhou, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University(浙江大学) Sun Yat-sen University(中山大学) East China Normal University(华东师范大学)

AI总结 提出MultiToP框架,通过轻量级视觉令牌修补器动态替换不可靠视觉令牌,结合信息引导排名校准和稀疏正则化,在不修改原模型情况下减少视频多模态模型幻觉,显著提升F1分数和问答准确率。

Comments Preprint

详情
AI中文摘要

视频大型多模态模型在视频理解方面取得了显著进展,但仍容易产生幻觉,即生成的响应未能忠实于输入视频。在本文中,我们提出MultiToP,一种多模态上下文感知的视觉令牌修补框架,通过在语言生成之前优化不可靠的视觉令牌来减轻幻觉。MultiToP引入了一个轻量级的视觉令牌修补器,用于预测令牌级替换分布,并选择性地用动态全局修补令牌替换不可靠的视觉令牌。为了有效训练修补器,我们进一步提出了信息引导的排名校准,利用从主干网络派生的答案条件帧级信息线索来指导令牌替换。结合真实答案监督和稀疏正则化,MultiToP实现了局部视觉证据优化,而无需修改原始模型。大量实验表明,MultiToP在Vript-HAL上有效减少了幻觉,且推理开销可忽略不计,将Qwen3-VL-4B-Instruct的F1分数相比原始模型提高了50.60%。同时,MultiToP保持了通用的视频理解能力,在ActivityNet-QA上为Video-LLaVA-7B带来了18.58%的相对准确率提升。

英文摘要

Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

8. 语音语言联合与音频文本 7 篇

2606.12902 2026-06-12 cs.CL 新提交

PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

PRISM:用于共情口语对话的韵律集成多智能体推理框架

Wen Zhang, Xiaocui Yang, Zhuoyue Gao, Shi Feng, Daling Wang, Yifei Zhang

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院)

AI总结 提出PRISM多智能体框架,通过解耦语音感知、响应生成和语音合成,并引入韵律到语言翻译机制,实现共情口语对话中的韵律适当性和知识集成。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

共情口语对话系统不仅需要语义上合适的回应,还需要情感上一致的韵律表达。然而,级联流水线通常在语音到文本转换过程中丢弃声学线索,而端到端语音模型缺乏对情感和知识集成的可解释控制。为了解决这些挑战,我们提出了PRISM,一个用于共情口语对话的多智能体框架,它将语音感知、响应生成和语音合成解耦为协调的组件。PRISM引入了一种韵律到语言的翻译机制来稳定大语言模型的推理,并支持按需调用外部知识工具以生成共情对话。实验结果表明,PRISM在客观和主观指标上均实现了共情性、韵律适当性和文本响应生成质量的一致改进。我们的代码可在以下网址获取:this https URL。

英文摘要

Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. However, cascade pipelines often discard acoustic cues during speech-to-text conversion, while end-to-end speech models lack interpretable control over emotion and knowledge integration. To address these challenges, we propose PRISM, a multi-agent framework for empathetic spoken dialogue that decouples speech perception, response generation, and speech synthesis into coordinated components. PRISM introduces a prosody-to-language translation mechanism to stabilize large language model reasoning and enables on-demand invocation of external knowledge tools for empathetic dialogue generation. Experimental results demonstrate that PRISM achieves consistent improvements in empathy, prosodic appropriateness, and text response generation quality across objective and subjective metrics. Our code is available at: https://github.com/Bxzfrm/PRISM.

2606.12911 2026-06-12 cs.CL 新提交

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

PiDA: 基于语音信息的数据增强用于鲁棒的越南语语音翻译

Giang Son Nguyen, Tung X. Nguyen, Hieu Minh Truong, Nhu Vo, Wray Buntine, Dung D. Le

发表机构 * VinUniversity(Vin大学) University of Technology Sydney(悉尼技术大学) Monash University(莫纳什大学)

AI总结 针对级联语音翻译中ASR错误传播问题,提出基于语音信息的数据增强方法PiDA,通过语音词嵌入生成相似音替换,在FLEURS越南语-英语上提升错误ASR输出翻译质量(BLEU+2.04)。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

级联语音翻译(ST)系统在自动语音识别(ASR)输出错误转录时会出现错误传播。我们首次对越南语ST的ASR错误进行系统分类,根据语音原因对替换错误进行分类,并使用线性混合效应模型量化其对下游神经机器翻译(NMT)性能的影响。我们确认大多数ASR替换错误源于语音混淆而非随机噪声,并且这些语音错误显著降低了ST质量。受此发现启发,我们提出了基于语音信息的数据增强(PiDA),该方法通过使用语音词嵌入替换为语音相似的替代词来生成类似ASR的损坏。在FLEURS越南语-英语的PiDA增强版本上进行微调,提高了错误ASR输出的翻译质量(比标准微调最多提高+2.04 BLEU),同时也略微提升了干净文本的性能。

英文摘要

Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST, classifying substitution errors by phonetic cause and quantifying their impact on downstream Neural Machine Translation (NMT) performance using Linear Mixed-Effects Modelling. We confirm that most ASR substitution errors arise from phonetic confusions rather than random noise, and that these phonetic errors significantly degrade ST quality. Motivated by this finding, we propose Phonetically-Informed Data Augmentation (PiDA), which generates ASR-like corruptions by substituting words with phonetically similar alternatives using phonetic word embeddings. Fine-tuning on a PiDA-augmented version of FLEURS Vietnamese-English improves translation of erroneous ASR outputs (up to +2.04 BLEU over standard fine-tuning) while also slightly improving clean-text performance.

2606.13121 2026-06-12 cs.CL cs.AI cs.SD 新提交

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

NaturalFlow: 减少同步语音到语音翻译中破坏自然语音流的停顿

Dongwook Lee, Youngho Cho, Sangkwon Park, Heeseung Kim, Sungroh Yoon

发表机构 * IPAI and ECE, Seoul National University(首尔大学IPAI与ECE) Department of AI, University of Seoul(首尔市立大学人工智能系)

AI总结 提出一个流畅性感知优化框架,通过利用模型内部信号(如语言多样性和语音时长的时间变异性)最小化块间静音,在同步翻译的低延迟和连续翻译的自然流畅之间找到平衡点。

Comments Proceedings of the 26th Interspeech Conference, Long Paper

详情
AI中文摘要

同步语音到语音翻译旨在通过最小化延迟实现近实时通信,为连续翻译的高延迟提供了一种引人注目的实时替代方案。然而,过度追求低延迟往往会导致碎片化的块状语音。因此,听众会遭受不自然的声学流,其中频繁的停顿可能会增加他们的认知负荷。为了弥补这一差距,我们引入了一个流畅性感知优化框架,旨在发现同步翻译的低延迟优势与连续翻译的自然流畅之间的最佳平衡点。我们的框架通过利用模型内部信号(包括语言多样性和语音时长的诱导时间变异性)来最小化块间静音。在短文本和长文本基准上的实验表明,我们的框架在保持竞争性延迟和翻译质量的同时,产生了自然的语音流。

英文摘要

Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-time alternative to the high latency of consecutive translation. However, the excessive pursuit of low latency often results in fragmented chunk-wise speech. Consequently, listeners are subjected to an unnatural acoustic flow punctuated by frequent pauses, which could increase their cognitive load. To bridge this gap, we introduce a fluency-aware optimization framework designed to discover the sweet spot between the low-latency benefits of simultaneous translation and the natural flow of consecutive translation. Our framework minimizes inter-chunk silences by leveraging model-internal signals, including linguistic diversity and induced temporal variability in speech durations. Experiments on short- and long-form benchmarks show that our framework produces natural speech flow while maintaining competitive latency and translation quality.

2606.13507 2026-06-12 cs.CL 新提交

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

利用音频大语言模型过滤语音到语音训练数据

Qixu Chen, Satoshi Nakamura

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)人工智能学院)

AI总结 提出Rank-to-Distill策略,训练音频大语言模型直接从语音对判断保留/丢弃,过滤噪声数据,提升端到端语音翻译性能。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

大规模挖掘语料为端到端语音到语音翻译(S2ST)提供了丰富的训练数据,但可能包含噪声、错位和语义错误。过滤噪声数据对于保持鲁棒的语音翻译性能至关重要。我们研究如何训练音频语言模型直接从音频对配对的语音做出保留/丢弃决策。为了在没有人工标注的情况下获得可靠的监督,我们采用了一种可扩展的两阶段Rank-to-Distill策略。一个轻量级排序器从噪声语音对生成保留/丢弃伪标签,然后训练音频大语言模型直接从原始配对语音预测保留/丢弃。所得模型联合捕获声学保真度和跨语言语义一致性,用于选择语音条件数据。在CVSS-C和SpeechMatrix上的实验表明,与未过滤训练相比,性能持续提升,端到端S2ST的ASR-BLEU最高提升+1.4。

英文摘要

Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech translation performance. We study how to train an audio-language model to make keep/drop decisions on paired speech directly from audio. To obtain reliable supervision without manual labels, we adopt a scalable two-stage Rank-to-Distill strategy. A lightweight ranker generates keep/drop pseudo-labels from noisy speech pairs, then trains an audio large language model to predict keep/drop directly from raw paired speech. The resulting model jointly captures acoustic fidelity and cross-lingual semantic consistency for the selection of speech-conditioned data. Experiments on CVSS-C and SpeechMatrix show consistent improvements over unfiltered training, yielding up to +1.4 ASR-BLEU for end-to-end S2ST.

2606.13544 2026-06-12 eess.AS cs.AI cs.CL 交叉投稿

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

自适应轮流发言:面向实时多方语音代理

Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith, K V Vijay Girish

AI总结 提出ModeratorLM,一种基于角色条件的语音大模型,通过分块流式处理和链式推理,在多方对话中实现自适应轮流发言,显著提升轮流精度和召回率。

Comments Accepted for publication at Interspeech 2026

详情
AI中文摘要

多方口语对话中的轮流发言仍然是语音代理面临的基本挑战,特别是在动态的发言权竞争和用户期望变化的情况下。我们提出ModeratorLM,一种角色扮演语音代理,它在多方环境中根据明确分配的角色来调节轮流发言行为。该系统基于以分块流式方式运行的语音大语言模型。我们进一步引入了一种推理增强变体,该变体结合了对对话上下文和分配角色的链式推理。我们构建了RolePlayConv,一个大规模合成数据集,包含具有多种助手角色的口语多方对话。在真实会议数据和RolePlayConv上的实验表明,与无角色条件的基线相比,轮流发言精度提高了40%以上,召回率提高了70%以上,同时大幅减少了误报中断。

英文摘要

Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.

2606.04474 2026-06-12 cs.CL eess.AS 版本更新

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

语音大模型推理中的实体绑定失败:诊断与思维链干预

Ming-Hao Hsu, Xiaohai Tian, Jun Zhang, Zhizheng Wu

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen, China(1 数据科学学院,香港中文大学(深圳)) ByteDance, China(2 字节跳动,中国)

AI总结 本文通过诊断语音大模型在逻辑推理中的实体绑定失败问题,提出实体感知思维链方法,显著提升推理准确率。

Comments INTERSPEECH 2026

详情
AI中文摘要

语音大模型在复杂推理任务上表现不如文本模型。我们揭示了这种模态差距并非均匀的认知缺陷。通过评估三个不同的语音大模型,我们发现在空间、句法和事实任务上,语音到文本(S2T)匹配或超过文本到文本(T2T)。然而,在需要实体追踪的逻辑任务上,S2T准确率降至随机水平。我们将这种局部退化诊断为实体绑定失败:连续的语音特征导致模型在隐式推理过程中丢失精确的实体-属性关联。为解决此问题,我们提出了实体感知思维链(EA-CoT),强制语音大模型在推理前显式枚举实体并将其绑定到声明上。引人注目的是,即使口语名称被误识别,EA-CoT也能弥合差距,带来高达24.4%的绝对准确率提升。消融实验证实这些提升完全源于显式语义绑定,将模态差距重新定义为可解决的瓶颈。

英文摘要

Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this gap is not a uniform cognitive deficit. Evaluating two architecturally diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. Yet on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this as an entity binding failure: continuous speech features blur precise entity-property associations during implicit reasoning. To validate this diagnosis, we introduce Entity-Aware Chain-of-Thought (EA-CoT), a lightweight inference-time intervention forcing SLLMs to enumerate entities and bind them to claims before reasoning. EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4 percentage-point accuracy gain. Ablations confirm the gains stem from explicit semantic binding, reframing the gap as an elicitation failure rather than a missing capability.

2606.11681 2026-06-12 cs.CL cs.SD 版本更新

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

UR-BERT:通过通用罗马化和语音标记预测扩展大规模多语言TTS的文本编码器

Sangmin Lee, Eekgyun Ahn, Woongjib Choi, Hong-Goo Kang

发表机构 * Dept. of Electronics and Electrical Engineering, Yonsei University(延世大学电子与电气工程系)

AI总结 提出UR-BERT,一种基于罗马化转录的TTS编码器,通过统一书写系统为罗马化表示,结合语音标记预测目标,在495种语言上实现高效多语言TTS,优于现有基线并泛化到未见语言。

Comments Accepted to Interspeech 2026, Github: https://github.com/sanghyang00/ur-bert

详情
AI中文摘要

我们提出UR-BERT,一种基于罗马化转录的文本到语音(TTS)编码器,用于大规模多语言TTS系统。传统的字素到音素(G2P)方法由于可靠G2P资源的可用性,仅限于约100种语言。相比之下,UR-BERT通过将多样化的书写系统统一为共享的罗马化表示,扩展到495种语言。为了进一步增强语音保真度和文本-语音对齐,我们在训练过程中引入了一个语音标记预测目标,这促使编码器以数据高效的方式学习语音感知的语音表示。实验表明,基于UR-BERT构建的TTS系统在广泛的语言和资源条件下,始终优于最近的文本编码器基线,并展现出对未见语言的强大泛化能力。

英文摘要

We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.

9. 评测、数据集与基准 36 篇

2606.12569 2026-06-12 cs.CL cs.AI 新提交

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

EDEN:意大利语临床笔记的大规模语料库

Tiziano Labruna, Guido Bertolini, Pietro Ferrazzi, Bernardo Magnini

发表机构 * Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) Istituto di Ricerche Farmacologiche Mario Negri IRCCS(马里奥·内格里药理研究所IRCCS) University of Padua(帕多瓦大学)

AI总结 本文介绍EDEN,一个大规模意大利语急诊临床笔记语料库,包含约400万份匿名笔记及6000份专家标注数据,用于支持大语言模型在医疗中的应用,并提出了CRF填充作为新的结构化信息提取基准。

详情
AI中文摘要

我们提出了EDEN(急诊电子笔记),这是一个新颖且独特的大规模临床笔记语料库,这些笔记来自意大利医院的急诊科。当前版本的语料库由约400万份完全匿名的临床笔记组成,涵盖了患者在急诊科停留期间的不同护理阶段。此外,约六千份笔记的子集由临床专家通过结构化病例报告表(CRF)进行了手动标注,该CRF包含132个项目,涉及急诊科两种患者情况:呼吸困难和意识丧失。项目可能取数值(例如血氧饱和度)、分类(例如意识水平)、二元(例如是否存在创伤)和混合值类型。标注过程涉及多位临床医生,并经过迭代修订以解决项目表述中的歧义,从而形成了一个结构丰富(尽管高度不平衡)的资源。该数据集旨在填补能够支持大语言模型在具体医疗应用中开发和使用的重要数据缺口。我们描述了数据收集协议、现场匿名化流程、语料库统计数据和标注方案。最后,我们提出了CRF填充作为一项新的结构化信息提取基准,并提供了基于Gemma-27B和MedGemma-27B的零样本基线。据我们所知,EDEN数据集是意大利语现有最大的免费临床笔记语料库。

英文摘要

We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department. In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 items relevant for two patient situations in emergency departments, dyspnea and loss of consciousness. Items may assume numerical values (e.g., for blood saturation), categorical (e.g., for level of consciousness ), binary (e.g., for presence of traumas), and mixed value types. The annotation process involved multiple clinicians and underwent iterative revision to resolve ambiguities in item formulation, resulting in a richly structured (although high imbalanced) resource. The dataset aims to fill a relevant gap of data able to support both the development and the use of Large Language Models in concrete medical applications. We describe the data collection protocol, the on-site anonymisation pipeline, corpus statistics, and the annotation scheme. Finally, we propose CRF-filling as a novel structured information extraction benchmark, and provide zero-shot baseline resulting from Gemma-27B and MedGemma-27B. To the best of our knowledge, the EDEN dataset is the largest freely available corpus of clinical notes existing for the Italian language.

2606.12608 2026-06-12 cs.CL cs.LG 新提交

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

购物推理基准:面向多轮对话购物助手的专家编写基准

Shuxian Fan, Seonwoo Min, Youna Hu, Botao Xia, Jayakrishnan Unnikrishnan, Rowan Musselmann, Yifan Gao, Qingyu Yin, Priyanka Nigam, Bing Yin

发表机构 * Amazon(亚马逊)

AI总结 提出一个由零售专家编写的525个任务的多轮对话购物推理基准,包含10863个加权评分标准,评估9个模型显示通过率仅57-77%,多轮任务性能下降4-18分。

详情
AI中文摘要

对话式购物助手现已服务数亿客户,但现有基准均未联合评估真实购物对话所需的开放式多轮推理、领域专业知识和标准级质量。购物推理在语言模型应用中独具特色。与事实性问答或可验证代码生成不同,它需要在多轮对话中平衡主观偏好、预算约束和跨产品权衡,这些能力在以往的电商和通用基准中缺失。我们引入了购物推理基准(Shopping Reasoning Bench),这是一个由零售领域专家编写的基准,包含525个任务(232个单轮,293个多轮)和10863个重要性加权的二元评分标准。这些标准组织在包含五个推理类别和十五个子类别的分类体系下,涵盖偏好细化、权衡分析和兼容性评估等多样化需求。对三个模型系列(GPT、Claude、Gemini)中九个模型的评估显示,整体通过率仅为57-77%。在多轮任务中,所有模型在可选的超越标准上的得分比必需标准低13-29分,并且随着对话进行,性能下降4-18分。这些差距表明,当前模型能处理基本购物辅助,但达不到专家级建议,使购物推理基准成为未来购物助手开发的挑战性测试平台。

英文摘要

Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57--77% overall. On multi-turn missions, all models score 13--29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4--18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.

2606.12708 2026-06-12 cs.CL cs.AI 新提交

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

AfriSUD:用于评估非洲语言模型的依存树库集合

Happy Buzaaba, Cheikh Mouhamadou Bamba Dione, David Ifeoluwa Adelani, Sylvain Kahane, Kim Gerdes, Bruno Guillaume, Kevin Guan, Aremu Anuoluwapo, Naome A. Etori, Shamsuddeen Hassan Muhammad, Utitofon Inyang, Peter Nabende, David Sabiiti Bamutura, Andiswa Bukula, Chinedu Uchechukwu, Rooweither Mabuya, Idris Akinade, Christiane Fellbaum

发表机构 * Princeton University(普林斯顿大学) Laboratory for Artificial Intelligence, Princeton University(普林斯顿大学人工智能实验室) Gaston Berger University(加斯顿·伯杰大学) Mila, McGill University(麦吉尔大学米拉研究所) Canada CIFAR AI Chair(加拿大CIFAR人工智能教席) Paris Nanterre University(巴黎南泰尔大学) Paris-Saclay University(巴黎-萨克雷大学) CNRS(法国国家科学研究中心) Inria(法国国家信息与自动化研究所) LORIA(洛林计算机科学实验室) Université de Lorraine(洛林大学) University of Trento(特伦托大学) University of Minnesota–Twin Cities(明尼苏达大学双城分校) Imperial College London(伦敦帝国学院) Binghamton University(宾汉姆顿大学) Makerere University(马凯雷雷大学) Penn State University(宾夕法尼亚州立大学) Mbarara University of Science and Technology(姆巴拉拉科技大学) Chalmers University of Technology(查尔姆斯理工大学) University of Ibadan(伊巴丹大学) Nnamdi Azikiwe University(纳姆迪·阿齐基韦大学) South African Centre for Digital Language Resources(南非数字语言资源中心)

AI总结 为弥补非洲语言在NLP资源上的不足,构建了首个大规模九种非洲语言句法标注树库AfriSUD,评估多种模型发现显著句法差距。

详情
AI中文摘要

尽管非洲语言具有语言多样性和全球重要性,但在支持NLP的研究和资源中仍代表性不足。我们通过引入AfriSUD来弥合这一差距,这是首个大规模句法标注树库集合,涵盖九种多样的非洲语言,跨越撒哈拉以南非洲的主要语系和地区。采用表层句法通用依存(SUD)框架,我们社区主导的努力提供了高质量、经母语者验证的数据,捕捉了如黏着和声调等类型学关键特征。我们在AfriSUD上评估了多种模型,包括非Transformer基线、多语言预训练编码器和LLM,用于词性标注和依存句法分析。我们的结果揭示了显著的句法差距,模型在九种语言上仍表现出明显局限性,表明现有架构可能无法完全捕捉非洲语言句法的结构多样性。

英文摘要

Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.

2606.12754 2026-06-12 cs.CL cs.AI 新提交

LLMs Can Better Capture Human Judgments--With the Right Prompts

LLMs 能更好地捕捉人类判断——使用合适的提示

Danica Dillion, Chen Cecilia Liu, Baihui Wang, Daniele Barolo, Tanmay Rajore, Niket Tandon, Pranathi Ravikumar, Kurt Gray

AI总结 通过简单提示策略,LLMs 能恢复人类反应的完整分布,并减少对措辞变化的敏感性,提升 AI-人类对齐。

详情
AI中文摘要

大型语言模型(LLMs)在捕捉人类判断方面是否表现不佳?两个常被提及的限制是:LLMs 无法捕捉反应的全分布,以及它们的判断在措辞变化上不稳定。我们展示了缓解这些限制的简单提示策略。在两个数据集上——一个代表美国的 144 个道德情景集,以及国际社会调查项目“家庭与性别角色变化”模块涵盖 32 个国家的 38 个道德信念——我们展示了简单的启发式技术如何帮助改善 AI-人类对齐。首先,提示模型报告标准差和反应比例,比常见策略更好地恢复了人类反应的完整范围。其次,确保情景对人类参与者清晰——如人类困惑评分所反映——提升了模型对齐度,且 LLMs 可以跟踪人类困惑评分。同时,我们发现 LLMs 对自身误差的估计校准不佳,尽管它们能相对较好地预测人类变异性。这些结果表明,向 LLMs 提出更好的问题可以得到更好的答案。

英文摘要

Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We demonstrate simple prompting strategies that mitigate these limitations. Across two datasets--a U.S.-representative set of 144 moral scenarios and 38 moral beliefs from the International Social Survey Programme's Family and Changing Gender Roles module covering 32 countries--we show how simple elicitation techniques help improve AI-human alignment. First, prompting models to report standard deviations and response proportions recovers the full range of human responses better than common strategies. Second, ensuring scenarios are clear to human participants--as reflected in human confusion ratings--boosts model alignment, and LLMs can track human confusion ratings. At the same time, we find that LLMs' estimates of their own error are poorly calibrated, though they can predict human variability relatively well. These results suggest that asking better questions to LLMs can yield better answers.

2606.12789 2026-06-12 cs.CL cs.IR 新提交

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

RAG基准测试应该有多细粒度?一个用于合成问题生成的层次化框架

Chase M. Fensore, Kaustubh Dhole, Jason Fan, Eugene Agichtein, Joyce C. Ho

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系)

AI总结 提出HieraRAG层次化框架,通过合成问题生成研究RAG基准测试的细粒度,发现最优粒度因维度而异,并引入一致性比率度量。

详情
AI中文摘要

评估检索增强生成(RAG)系统需要能够捕捉多样化问题特征的基准测试,然而实践者缺乏关于在哪些维度上变化以及以何种粒度变化的经验指导。我们提出了HieraRAG,一个用于研究RAG基准测试构建中粒度的层次化框架,将最优粒度定义为在给定RAG配置下最大化区分能力(各类别生成质量的标准差)的水平。作为案例研究,我们从FineWeb-10BT中生成了5,872个合成问答对,涵盖3个维度(问题复杂度、答案类型、语言变异)和3个粒度级别(2、4和8个类别)。使用BM25+Falcon-3-10B流水线,最优粒度因维度而异:复杂度受益于细粒度区分(区分能力:0.053),而答案类型和语言变异在中等粒度达到峰值。我们引入了一致性比率度量来量化细粒度划分是否干净地细分父类别,揭示了维度间的结构差异(问题复杂度:0.40 vs. 答案类型:1.44)。对110个分层问答对的人工评估确认了合成质量。虽然这些具体发现反映的是单一配置,但HieraRAG为实践者提供了可移植的程序和验证度量,以确定其自身RAG设置中的评估粒度。

英文摘要

Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories). With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0.053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0.40 vs. Answer Type: 1.44). Human evaluation of 110 stratified QA pairs confirms synthetic quality. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.

2606.12790 2026-06-12 cs.CL 新提交

GENIE: A Fine-Grained Measure for Novelty

GENIE:一种细粒度新颖性度量方法

Ramya Namuduri, Manya Wadhwa, Anshun Asher Zheng, Greg Durrett, Junyi Jessy Li

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) New York University(纽约大学)

AI总结 提出GENIE指标,通过任务特定特征细粒度衡量模型生成内容的新颖性,克服整体指标无法捕捉高维新颖性的局限。

详情
AI中文摘要

大型语言模型在各项任务中持续表现出缺乏创造力和多样性。先前的工作主要关注模型是否能够生成创造性输出。本文旨在考虑新颖性,并以任务特定方式研究模型生成内容的新颖性。我们提出了一种细粒度评估指标GENIE,用于根据响应群体中的任务特定特征来衡量响应的新颖性。我们表明,与GENIE不同,整体指标难以捕捉新颖性的高维性,并且无法提供关于它们针对哪些属性的见解。最后,我们使用GENIE来衡量解决创造力问题的缓解方法的有效性,以更好地理解这些方法在哪些方面可以提高新颖性。

英文摘要

Large Language Models have consistently demonstrated a lack of creativity and diversity across tasks. Prior work has focused on addressing whether models are capable of generating creative outputs. Here, we aim to consider novelty and investigate what makes model-generated content novel or not novel in a task-specific manner. We propose a fine-grained evaluation metric GENIE to measure the novelty of responses along task-specific features with respect to a population of responses. We show that unlike GENIE, holistic metrics struggle to capture the high-dimensionality of novelty and do not provide insight on which properties they target. Finally, we use GENIE to measure the effectiveness of mitigation methods that address creativity to better understand where these methods can improve novelty.

2606.12922 2026-06-12 cs.CL cs.CY 新提交

Polar: A Benchmark for Evaluating Political Bias in LLMs

Polar: 评估大语言模型中政治偏见的基准

Sangho Kim, Heejin Kim, Yoonhee Park, Hyunggeun Jeon, Jaejin Lee

发表机构 * Graduate School of Data Science, Seoul National University(首尔大学数据科学研究生院) Dept. of Computer Science and Engineering, Seoul National University(首尔大学计算机科学与工程系)

AI总结 提出Polar基准,通过选项级似然度测量大语言模型的政治偏见,覆盖美国和韩国政治语境,发现偏见随语境、议题、模型组和语言变化。

Comments Submitted to ARR 2026 May cycle

详情
AI中文摘要

大语言模型(LLM)中的政治偏见日益显著,但在不同政治和语言背景下难以可重复地测量。我们引入了Polar,一个包含4,026个实例的多项选择基准,通过选项级似然度而非基于提示的生成来测量政治偏见。Polar覆盖了两个意识形态轴和来自Manifesto Project的八个议题类别,并在美国和韩国政治语境中并行评估模型。在38个LLM中,测量的偏见随政治语境、议题类别、模型组和呈现语言系统性地变化。所有模型在美国政治内容上倾向于左翼进步派,但在韩国内容上表现出更居中且混合的模式。翻译实验进一步表明,仅呈现语言就能改变测量的偏见。这些发现凸显了对LLM中政治偏见进行多语言和跨语境评估的必要性。

英文摘要

Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that measures political bias through option-level likelihoods rather than prompt-based generation. Polar covers two ideological axes and eight issue categories derived from the Manifesto Project, and evaluates models in parallel across U.S. and South Korean political contexts. Across 38 LLMs, measured bias varies systematically with political context, issue category, model group, and presentation language. All models lean left-progressive on U.S. political content, but show more centered and mixed patterns on South Korean content. Translation experiments further show that presentation language alone can shift measured bias. These findings highlight the need for multilingual and cross-contextual evaluation of political bias in LLMs.

2606.13100 2026-06-12 cs.CL 新提交

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

LEDGER:基于公司年报的长上下文基准,用于基于事实的金融检索与提取

Charles Moslonka, Amaury de Vitry, Arthur Garnier, Hicham Randrianarivo, Emmanuel Malherbe

发表机构 * Artefact Research Center(Artefact 研究中心) MICS, CentraleSupélec, Université Paris-Saclay(巴黎萨克雷大学中央理工高等电力学院 MICS 实验室) Ardian

AI总结 提出LEDGER基准,包含4,999份数字化公司年报,用于评估大语言模型在长上下文金融任务中的表现,涵盖KPI检索、单值查找和全量提取任务。

Comments 5 pages, 1 figure

详情
AI中文摘要

财务报告是大语言模型天然的试验场,而近期各种规模模型的长上下文能力使得在该领域进行严格评估的需求日益迫切。然而,大多数公开的金融资源将任务简化为纯文本的SEC 10-K文件,并配以少量问答项。我们发布了LEDGER(基于事实的提取与检索的长上下文文档评估),一个包含4,999份数字化公司年报的语料库——这些是包含图表、表格和叙述的完整文档,而不仅仅是监管文件。每份报告标注了31个合并的财务KPI,这些KPI需要被提取并与财报发布日的市场反应相关联。基于这些数据,我们推导出三个覆盖难度范围的评估基准:一个纯页面级别的KPI检索任务,包含118,048个自然语言问题及其TREC风格的相关性判断;一个对话式的“大海捞针”单值查找任务;以及一个完整的KPI提取任务,均基于长且数字密集的报告。此外,我们还提供了人工OCR质量标注(含标注者间一致性)、完整的提取、验证和评分工具链。我们进一步通过一个案例研究展示了该数据集的研究实用性,该案例将CEO信函修辞与发布后的市场影响联系起来。

英文摘要

Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most public financial resources reduce the task to plain-text SEC 10-K filings paired with a handful of question-answer items. We release LEDGER (Long-context Evaluation of Documents for Grounded Extraction and Retrieval), a corpus of 4,999 digitized corporate annual reports - full documents with figures, tables, and narrative, not just regulatory filings. Each report is labeled with 31 consolidated financial KPIs to be extracted and linked to the market's reaction at the earnings date. From this data we derive three evaluation benchmarks spanning the difficulty spectrum: a pure page-level KPI retrieval task with TREC-style relevance judgments over 118,048 questions in natural language, a conversational "needle-in-a-haystack" single-value lookup, and a full KPI extraction task, both from long, numerically dense reports. We additionally provide human OCR-quality annotations with inter-annotator agreement and the complete extraction, validation, and scoring toolchain. We further demonstrate the dataset's research utility with a case study linking CEO-letter rhetoric to post-publication market impact.

2606.13111 2026-06-12 cs.CL 新提交

MÖVE: A Holistic LLM Benchmark for the German Public Sector

MÖVE:德国公共部门的大语言模型整体基准

Camilla Dalerci, Thilo Michael, Robin Schaefer, Daniel Weinland

发表机构 * Innovations Department, Bundesdruckerei GmbH(德国联邦印钞公司创新部)

AI总结 提出MÖVE基准,从性能和治理两个维度评估39个LLM在德国公共部门的应用,发现无单一模型全面领先,模型大小非质量可靠指标。

详情
AI中文摘要

我们提出MÖVE(Modelle für die Öffentliche Verwaltung Evaluieren),一个用于评估德国公共部门背景下大语言模型(LLM)的整体基准。尽管LLM在公共管理中日益普及,但模型选择仍然很大程度上是临时的,现有基准提供的指导有限:它们主要面向英语、内容以美国为中心,并且只关注任务性能。MÖVE通过评估39个模型在两个互补维度上填补这些空白。性能标准涵盖摘要、问答和主题提取。治理标准评估幻觉倾向、能耗、提供商透明度、与德国宪法价值观的一致性以及对德国政党立场的知识。总共,我们使用了十个德语数据集,包括我们构建的反映公共管理领域的金标准和银标准数据集。我们采用多指标评估策略,结合经典NLP指标、基于嵌入的方法和LLM作为评判的方法。我们的结果表明,没有单一模型在所有标准上占主导地位:顶级表现者因任务而异,模型大小本身是质量的糟糕预测指标。我们进一步评估基准本身,分析其统计精度、LLM评判可靠性、私有数据集对模型排名的影响、结果对提示表述的敏感性以及能耗估计的有效性。MÖVE被设计为一个活跃开发中的动态基准;结果公开于此https URL。

英文摘要

We present MÖVE (Modelle für die Öffentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. MÖVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. MÖVE is designed as a living benchmark under active development; results are publicly available at https://moeve.bundesdruckerei.de/.

2606.13120 2026-06-12 cs.CL 新提交

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

EvoBrowseComp: 基于演化知识的搜索智能体基准测试

Yunhan Wang, Jiaan Wang, Lianzhe Huang, Xianfeng Zeng, Fandong Meng

发表机构 * Northeastern University, China(东北大学(中国)) Weixin AI, Tencent Inc, China(腾讯微信AI(中国))

AI总结 提出EvoBrowseComp,一个通过实时网络遍历自动生成400道英文和400道中文无污染复杂问题的演化基准,用于评估搜索智能体在动态知识环境中的真实浏览能力。

Comments 14 pages, under review

详情
AI中文摘要

搜索智能体——即增强搜索工具的大型语言模型——加剧了对未来验证基准的需求。现有的基准如BrowseComp依赖静态知识,容易受到测试集污染和参数记忆的影响。因此,模型可以通过事实回忆而非真正检索获得高分,通过推理捷径掩盖真实的浏览能力。在本文中,我们介绍EvoBrowseComp,一个包含400道英文和400道中文无污染复杂问题的演化基准,通过实时网络遍历合成。为了收集这些问题,我们设计了一个三智能体协作框架:(1)QA合成智能体,从实时网络中检索新鲜知识以合成问答对;(2)信息过滤智能体,根据可信度和流行度过滤检索到的知识,以阻断参数捷径;(3)高级指导智能体,将问题形式化为推理图,以减少合成问答对中的逻辑冗余和捷径。由于该框架支持全自动合成,EvoBrowseComp可以定期更新以防止数据污染并保持时间新鲜度。大量实验证实了其高难度,需要广泛的横向搜索。它为自动更新、高难度的基准测试建立了一个可扩展的范式,与不断发展的世界知识和不断进步的智能体能力保持同步。

英文摘要

Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities.

2606.13184 2026-06-12 cs.CL 新提交

LAUKIN: A Multi-jurisdictional Common Law Contract Dataset

LAUKIN:一个多司法管辖区的普通法合同数据集

Amrita Singh, Aditya Joshi, Jiaojiao Jiang, Hye-young Paik, May Fong Cheong

发表机构 * Computer Science and Engineering, UNSW, Sydney Australia(新南威尔士大学计算机科学与工程学院) Law and Justice, UNSW, Sydney Australia(新南威尔士大学法律与司法学院)

AI总结 针对跨国合同审查需求,构建了包含澳大利亚、英国和印度三地法律条款对的数据集LAUKIN,通过多阶段检索与人工标注实现法律等价性分类,基准测试显示跨司法管辖区分类具有挑战性。

Comments 5 pages, 2 figures, 4 tables

详情
AI中文摘要

跨国公司越来越需要跨司法管辖区的合同审查,但现有的法律NLP数据集大多局限于单一司法管辖区。我们引入了LAUKIN(澳大利亚、英国和印度的法律等价数据集),这是一个条款对(AU-UK、UK-IN、IN-AU)数据集,标注了布尔法律等价性。我们开发了一种新颖的多阶段检索和重排序流水线来构建初始条款对映射,随后由法律专家对部分条款对进行等价或不等价的标注。该数据集包含来自8种协议类型的204份合同的14,727个条款对,其中3,000个是手动标注的:900个训练集、600个开发集和1,500个测试集。我们评估了4种技术下的12个模型,最佳宏F1达到65.11%,使LAUKIN成为一个具有挑战性的基准。结果表明,尽管有共同的法律传统,但不同司法管辖区的起草惯例差异显著,使得跨司法管辖区的等价分类并非易事。LAUKIN还包括11,727个未标注的训练对,以支持未来法律NLP中的半监督学习研究。

英文摘要

Multinational companies increasingly require cross-jurisdictional contract review, yet existing legal NLP datasets are largely restricted to a single jurisdiction. We introduce LAUKIN (Legal equivalence dataset of Australia, UK, and INdia), a dataset of clause pairs (AU-UK, UK-IN, IN-AU) labelled for boolean legal equivalence. We develop a novel multi-stage retrieval and reranking pipeline to construct the initial clause pair mapping, with a subset of clause pairs subsequently annotated by legal experts as Equivalent or Not Equivalent. The dataset comprises 14,727 clause pairs from 204 contracts across 8 agreement types, of which 3,000 are manually labelled: 900 train, 600 dev, and 1,500 test. We evaluate 12 models across 4 techniques, achieving a best macro-F1 of 65.11%, establishing LAUKIN as a challenging benchmark. Results reveal that, despite shared legal heritage, drafting conventions diverge significantly across jurisdictions, making cross-jurisdictional equivalence classification non-trivial. LAUKIN also includes 11,727 unlabelled training pairs to support future semi-supervised learning research in legal NLP.

2606.13187 2026-06-12 cs.CL 新提交

A Context-Aware Dataset for Stance Detection in Bioethical Controversies on Reddit

Reddit生物伦理争议中立场检测的上下文感知数据集

Hu Huang, Genan Dai, Fuqiang Niu, Yi Yang, Zhaoya Gong, Bowen Zhang

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China(中国科学技术大学网络空间安全学院) School of Artificial Intelligence, Shenzhen Technology University(深圳技术大学人工智能学院) School of Urban Planning and Design, Peking University(北京大学城市规划与设计学院)

AI总结 提出BioStance数据集,包含39,600个Reddit生物伦理讨论中的评论-回复对,覆盖六类争议话题,通过三层立场标注实现高可靠性,支持上下文感知的立场检测研究。

详情
AI中文摘要

生物伦理辩论越来越多地在社交媒体上展开,然而立场检测研究缺乏用于建模此类上下文依赖话语的大规模、领域特定资源。我们提出了BioStance,一个上下文感知的数据集,包含来自Reddit生物伦理讨论的39,600个带注释的帖子-评论对。BioStance涵盖了生物伦理争议三个维度上的六个有争议的目标:基本价值冲突、个人自由与集体责任,以及技术不确定性。每个实例保留了层次化的对话上下文,并由三位独立注释者使用三类立场方案进行标注:赞成、反对和无立场。注释的平均Krippendorff's α为0.82,表明可靠性较高。通过结合主题多样性、对话结构和高质量的人工注释,BioStance支持上下文感知的立场检测、论据挖掘和生物伦理话语的计算分析研究。

英文摘要

Bioethical debates increasingly unfold on social media, yet stance detection research lacks large-scale, domain-specific resources for modeling such context-dependent discourse. We present BioStance, a context-aware dataset of 39,600 annotated Post-Comment pairs from Reddit bioethical discussions. BioStance covers six controversial targets across three dimensions of bioethical controversy: fundamental value conflicts, individual liberty versus collective responsibility, and technological uncertainty. Each instance preserves hierarchical conversational context and is labeled by three independent annotators using a three-class stance scheme: Favor, Against, and None. The annotations achieve a mean Krippendorff's $α$ of 0.82, indicating substantial reliability. By combining thematic diversity, conversational structure, and high-quality human annotation, BioStance supports research on context-aware stance detection, argument mining, and computational analysis of bioethical discourse.

2606.13216 2026-06-12 cs.CL cs.LG 新提交

Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization

分层最优传输用于神经机器翻译和抽象摘要中的幻觉检测

Mariia Onyshchuk, Maksym-Vasyl Tarnavskyi, Marta Sumyk

发表机构 * Fairseq AggreFact

AI总结 通过最优传输分析跨注意力分布,发现幻觉检测集中于解码器前四层,且该方法在源脱离时有效,但无法检测注意力下游的不忠实摘要。

Comments Accepted to ICML Mechanistic Interpretability Workshop 2026

详情
AI中文摘要

最优传输(OT)已被证明可以通过测量跨注意力分布与参考分布之间的几何距离来检测神经机器翻译(NMT)中的幻觉,无需任何监督。我们将此分析扩展到Fairseq DE-EN模型的所有六个解码器层($N=3{,}414$),表明Wass-to-Unif和Wass-to-Data是互补的检测器,专门针对不同类型的幻觉;检测集中在L1--L4层,而L5层对较微妙的类型具有反预测性;并且幻觉翻译缺乏正确翻译从第一步解码开始就存在的探索性注意力阶段。我们进一步评估了几何信号是否可迁移到抽象摘要忠实性检测:在AggreFact($N=1{,}116$)上,我们的无监督OT检测器在CNN/XSum上达到$57.2\%$/$57.6\%$的平衡准确率——高于随机水平,但远低于有监督的MiniCheck-Flan-T5-L($69.9\%$/$74.3\%$)。这种差距是原则性的:与NMT幻觉不同,不忠实的摘要可以正确关注源标记,同时歪曲其内容,这种失败模式在基于集中度的OT指标中由于构造原因而不可见。在T5-base上的结构实验证实了解码器在深度上的一致组织,其中第3层显示峰值集中度,第12层对生成质量最为关键。总之,结果确立了当失败模式是源脱离时,跨注意力的OT是一种可靠的检测器;无论任务如何,它都是一种原则性的可解释性工具;而当忠实性失败发生在注意力下游时,它则具有根本局限性。

英文摘要

Optimal transport (OT) has been shown to detect hallucinations in neural machine translation (NMT) by measuring the geometric distance between cross-attention distributions and a reference distribution, without any supervision. We extend this analysis to all six decoder layers of the Fairseq DE-EN model ($N=3{,}414$), showing that Wass-to-Unif and Wass-to-Data are complementary detectors specialised across hallucination types, that detection is concentrated in layers L1--L4 with L5 anti-predictive for subtler types, and that hallucinated translations lack the exploratory attention phase present in correct translations from the first decoding step. We further evaluate whether the geometric signal transfers to abstractive summarization faithfulness detection: our unsupervised OT detector on AggreFact ($N=1{,}116$) achieves $57.2\%$/$57.6\%$ balanced accuracy on CNN/XSum -- above chance but substantially below supervised MiniCheck-Flan-T5-L($69.9\%$/$74.3\%$). This gap is principled: unlike NMT hallucinations, unfaithful summaries can attend correctly to source tokens while misrepresenting their content, a failure mode invisible to concentration-based OT metrics by construction. Structural experiments on T5-base confirm consistent decoder organisation across depth, with Layer~3 showing peak concentration and Layer~12 being most critical for generation quality. Together, the results establish OT on cross-attention as a reliable detector when the failure mode is source disengagement, a principled interpretability tool regardless of task, and fundamentally limited when faithfulness failures occur downstream of attention.

2606.13218 2026-06-12 cs.CL 新提交

When Similar Means Different: Evaluating LLMs on Arabic--Hebrew Cognates

当相似意味着不同:评估大语言模型在阿拉伯语-希伯来语同源词上的表现

Junhong Liang, Noor Abo Mokh, Bashar Alhafni

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·穆扎伊德人工智能大学)

AI总结 针对阿拉伯语和希伯来语同源词、假朋友和借词,构建SemCog Bench基准(1858对词对),评估LLM跨语言语义理解,发现模型依赖表面形式相似性,在假朋友和借词上表现差,上下文帮助有限。

详情
AI中文摘要

阿拉伯语和希伯来语作为密切相关的闪米特语言,共享大量真正的同源词、误导性的假朋友和现代借词。这种重叠对大语言模型(LLM)的跨语言语义理解构成了挑战。为了评估这一能力,我们引入了SemCog Bench,这是一个精心策划的基准,包含1,858个阿拉伯语-希伯来语词对,并带有用于同源词识别和语义消歧的句子级注释。我们评估了开源和商业LLM在多种输入表示(原始、带变音符号、罗马化和音标)下的表现,揭示了跨语言推理中的关键差距。虽然模型在真正的同源词上达到了高准确率,但在假朋友和借词上性能急剧下降,反映出对表面形式相似性的强烈依赖。此外,句子级上下文仅带来微小的改进,表明仅靠上下文线索不足以克服误导性的形式信号。这些发现揭示了当前LLM在解决跨语言形式-意义冲突方面的根本局限性,并将SemCog Bench确立为多语言语义推理的严格基准。我们的代码和数据已公开。

英文摘要

Arabic and Hebrew, as closely related Semitic languages, share a substantial lexicon of true cognates, misleading false friends, and modern loanwords. This overlap poses a challenge for cross-lingual semantic understanding in large language models (LLMs). To evaluate this capability, we introduce SemCog Bench, a curated benchmark of 1,858 Arabic--Hebrew word pairs with sentence-level annotations for cognate identification and semantic disambiguation. We evaluate open-source and commercial LLMs across multiple input representations (raw, diacritized, Romanized, and phonetic) and reveal a critical gap in cross-lingual reasoning. While models achieve high accuracy on true cognates, performance drops sharply on false friends and loanwords, reflecting a strong reliance on surface-form similarity. Furthermore, sentence-level context yields only modest improvements, suggesting that contextual cues alone are insufficient to overcome misleading form-based signals. These findings reveal a fundamental limitation of current LLMs in resolving cross-lingual form--meaning conflicts and establish SemCog Bench as a rigorous benchmark for multilingual semantic reasoning. Our code and data are publicly available.

2606.13254 2026-06-12 cs.CL 新提交

Evaluating Pluralism in LLMs through Latent Perspectives

通过潜在视角评估LLM中的多元主义

Laura Majer, Jan Šnajder, Martin Tutek

发表机构 * University of Helsinki(赫尔辛基大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种领域无关的多层无监督框架,从LLM生成文本中提取潜在视角,评估多元主义差距,发现稀有视角仍被不成比例地低估。

Comments Pluralistic Alignment Workshop @ ICML 2026

详情
AI中文摘要

对代表多样化视角的需求日益增长,增加了对多元主义LLM生成的兴趣。尽管难以操作化,但识别文本中表达的视角将为多元主义对齐提供明确指导,并更清晰地阐明LLM生成中的多元主义差距。虽然模型已被证明会减少训练数据的多样性并生成同质化内容,但这主要是在多项选择问卷或使用自由文本的高层特征上得到证明。在本文中,我们介绍并实现了一个领域无关的多层无监督框架,用于提取适合识别LLM生成文本中多元主义差距的视角。我们在书评(一个高度意见化、代表多样化视角的数据集)上评估了该框架,并比较了各种提示和模型。我们的结果表明,虽然一些模型和提示技术接近覆盖广泛的视角,但稀有视角仍然不成比例地被低估,导致分布偏离人类文本。

英文摘要

The growing need to represent diverse perspectives has increased interest in pluralistic LLM generation. Although difficult to operationalize, identifying perspectives expressed in text would provide clear guidance on pluralistic alignment and more clearly articulate the pluralistic gap in LLM generation. While models have been shown to reduce the diversity of training data and generate homogeneously, this has been demonstrated primarily on multiple-choice questionnaires or using high-level characteristics of free-form text. In this paper, we introduce and implement a domain-agnostic multi-layered framework for unsupervised extraction of perspectives suitable for identifying the pluralistic gap in LLM-generated text. We evaluate our framework on book reviews, a highly opinionated dataset representing diverse perspectives, and compare various prompts and models. Our results show that while some models and prompting techniques come close to covering a broad spectrum of perspectives, rarer perspectives remain disproportionately underrepresented, resulting in distributions that diverge from human text.

2606.13647 2026-06-12 cs.CL cs.AI cs.LG 新提交

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

SkMTEB:斯洛伐克大规模文本嵌入基准与模型适配

Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková, Viktória Ondrejová

发表机构 * Comenius University in Bratislava(布拉迪斯拉发夸美纽斯大学) Cisco Systems(思科系统) Technical University of Košice(科希策技术大学) Kempelen Institute of Intelligent Technologies(肯佩伦智能技术研究所)

AI总结 针对低资源西斯拉夫语斯洛伐克语,构建首个MTEB风格文本嵌入基准SkMTEB(含31个数据集、7类任务),并开发高效本地部署模型e5-sk-small/large,通过词汇裁剪与微调在参数减少62%下达到与商业API相当的竞争力。

Comments ACL 2026

详情
AI中文摘要

我们介绍了SkMTEB,这是首个针对斯洛伐克语(一种低资源西斯拉夫语)的全面MTEB风格文本嵌入基准,包含31个数据集,覆盖7种任务类型——几乎是现有斯洛伐克语多语言基准覆盖深度的4倍。我们对31个嵌入模型的评估表明,大型指令调优多语言模型表现最强,而现有的针对NLU任务训练的斯洛伐克语特定模型在嵌入任务上迁移效果不佳。为了满足高效、可本地部署的斯洛伐克语嵌入需求,我们通过对多语言E5模型进行词汇裁剪和微调,开发了\ exttt{e5-sk-small}(45M参数)和\ exttt{e5-sk-large}(365M)模型。尽管模型尺寸缩小了高达62%,我们的开源模型在性能上与专有API相当,同时仍可本地部署用于语义搜索和检索增强生成(RAG)。我们公开了基准、模型、数据集和代码,希望我们的方法能为其他资源匮乏的语言提供可复现的路径。

英文摘要

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

2606.12433 2026-06-12 cs.CY cs.CL 交叉投稿

Marginal Alignment Does Not Guarantee Joint-Distribution Fidelity: An Official-Reference Audit of Nemotron-Personas-Korea with Cross-Locale Replication

边缘对齐不能保证联合分布保真度:基于官方参考的Nemotron-Personas-Korea审计与跨区域复制

Joonhyung Bae

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 提出独立性假设足迹(IAF)审计方法,用于检查合成人物数据集中的联合分布保真度;应用于NVIDIA Nemotron-Personas-Korea,发现其边缘分布对齐但三个联合分布失败。

详情
AI中文摘要

合成人物数据集声称与官方人口统计数据对齐作为信任基础,但下游用户将其作为年龄、性别、地区、职业、教育、姓名和机构地位等联合结构使用。边缘对齐并不意味着这些联合结构得以保留。我们提出独立性假设足迹(IAF),这是一种审计原语,作用于数据集卡片本身记录为独立处理的属性组合。对于每个这样的组合,IAF将合成联合分布与外部官方或机构参考进行比较,使用直接联合表(如果可用)或规则隐含检查。应用于NVIDIA Nemotron-Personas-Korea(一百万韩国合成人物),IAF发现NPK与KOSIS边缘分布对齐,但三个联合分布失败。主要职业分布与KEIS毕业生总体存在较大的条件不匹配。兵役年龄分布在机构上不一致。男性主导职业中的女性代表被过度拉平至接近平等,严格筛选判定依赖于映射,且在直接标准化下对年龄稳健。跨六个额外NPK区域的迁移性演示发现诊断结果依赖于区域而非通用,参考分类基数混淆了跨区域标志计数。因此,对于用作硅样本的合成人物,边缘声明必须与基于披露的联合审计配对后才能重用。发布的审计工件(参考清单、职业交叉表、衍生指标、可重复性脚本)在NPK系列上实例化此协议,并发布用于其他合成人物资源的目标重定向。

英文摘要

Synthetic persona datasets cite alignment with official demographics as a basis for trust, yet downstream users consume them as joint structures across age, sex, region, occupation, education, name, and institutional status. Marginal alignment does not imply that these joints are preserved. We propose the Independence-Assumption Footprint (IAF), an audit primitive that operates on the attribute combinations a dataset card itself documents as treated independently. For each such combination, IAF compares the synthetic joint against an external official or institutional reference, using direct joint tables where available and rule-implied checks otherwise. Applied to NVIDIA Nemotron-Personas-Korea (one million Korean synthetic personas), IAF finds that NPK aligns with KOSIS marginals while three joints fail. The major-by-occupation distribution against the KEIS graduate universe carries a large conditional mismatch. The age profile of military service is institutionally inconsistent. Female representation in male-dominated occupations is substantially over-flattened toward parity, with the strict screening verdict mapping-dependent and age-robust under direct standardisation. A transferability demonstration across six further NPK locales finds locale-dependent rather than universal diagnostics, with reference-taxonomy cardinality confounding cross-locale flag counts. For synthetic personas used as silicon samples, marginal claims must therefore be paired with disclosure-anchored joint audits before reuse. The released audit artefacts (reference manifests, occupational crosswalks, derived metrics, reproducibility scripts) instantiate this protocol on the NPK family and are released for retargeting at other synthetic persona resources.

2606.13477 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

SupraBench: A Benchmark for Supramolecular Chemistry

SupraBench: 超分子化学基准

Tianyi Ma, Yijun Ma, Zehong Wang, Weixiang Sun, Ziming Li, Connor R. Schmidt, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学) University of Connecticut(康涅狄格大学)

AI总结 为评估大语言模型在超分子化学推理中的能力,与领域专家合作发布了首个超分子基准SupraBench,包含四个基本任务和一个辅助视觉任务,并提供了16M令牌的语料库SupraPMC。

详情
AI中文摘要

超分子化学,包括非共价主客体组装的研究,推动了各种应用的发展。然而,设计主客体系统仍然耗时,每个候选对需要数天的干实验室验证。尽管LLMs已成为一种快速的替代方案,在分子结合任务上表现出色,但目前尚无基准系统性地评估LLMs在超分子化学基本任务(如结合亲和力预测)中的主客体推理能力。为此,我们与领域专家合作发布了首个超分子基准,称为SupraBench,用于评估LLMs在化学推理中的表现。具体来说,我们设计了四个基本任务,即结合亲和力预测、最佳结合物选择、溶剂识别和主客体描述,以及一个辅助的基于视觉的分子识别任务。我们还发布了SupraPMC,一个从Europe PMC中提取的经过整理的1600万令牌的超分子化学文章语料库,以支持对超分子领域的适应。我们对一系列开源和专有LLMs进行了基准测试,发现LLMs在所有任务上都有很大的提升空间。在SupraPMC上的领域自适应预训练可以干净地迁移到分布内回归,但会与严格的字母格式输出进行权衡。此外,不同任务家族的难度分布差异很大,揭示了不同的失败模式,表明当前超分子化学推理中存在特定的差距。我们的源代码和基准数据集可在以下网址获取:此 https URL。

英文摘要

Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at https://github.com/Tianyi-Billy-Ma/SupraBench.

2606.13581 2026-06-12 cs.CY cs.CL cs.HC physics.soc-ph 交叉投稿

The Tone of Awareness: Topic, Sentiment, and Toxicity Maps During Mental Health Month on TikTok

意识基调:TikTok 心理健康月期间的主题、情感和毒性地图

Henrique Ferraz de Arruda, Andreia Sofia Teixeira, Pranay Gundala Reddy, Anindya Mondal, Kleber Andrade Oliveira, Filipi Nascimento Silva

发表机构 * Institute for Biocomputation and Physics of Complex Systems (BIFI)(生物计算与复杂系统物理研究所) University of Zaragoza(萨拉戈塔大学) ARAID Foundation(ARAID基金会) Network Science Institute(网络科学研究所) Northeastern University London(伦敦东北大学) Kent Medway Medical School(肯特梅德斯医疗学院) LASIGE(拉西格研究所) Faculdade de Ciências da Universidade de Lisboa(里斯本大学科学学院) Department of Psychology, University of Limerick(利默里克大学心理学系) Observatory on Social Media, Indiana University(社交媒体观察所,印第安纳大学) CSSI - Kellogg School of Management, Northwestern University(CSSI - 北western大学凯洛格管理学院)

AI总结 通过分析 TikTok 2023-2024 年心理健康月期间的视频和评论,使用 BERTopic 提取主题、XLM-T 和 Detoxify 量化情感与毒性,发现视频情感偏负面而评论更混合,毒性在评论中呈长尾分布且集中于特定主题。

Comments 12 pages, 6 figures

详情
AI中文摘要

尽管人们担忧使用 TikTok 对心理健康的影响,但关于创作者如何构建相关内容以及受众如何接收这些内容,我们知之甚少。我们通过 TikTok 研究 API 收集了 2023 年和 2024 年心理健康意识月(5月)的 28,341 个 TikTok 视频和 80,130 条评论的内容,并研究了意识基调在不同主题和年份间的变化。我们将“基调”定义为心理健康话语的情感和人际框架,通过情感和毒性度量来操作化。我们使用 BERTopic 和对数几率关键词从视频文本中提取主题,然后分别对视频转录和评论量化主题条件下的情感(XLM-T)和毒性(Detoxify)。情感捕捉内容的效价,而毒性反映有害或辱骂性语言的存在。我们发现跨年份存在一组稳定的重复主题,涵盖临床状况、情感披露、自我护理和活动导向内容,且参与度高度偏向一小部分主题。所有情感和毒性分析均分别针对视频内容和评论进行计算,使我们能够区分内容生产和受众接收。视频中的情感对于情感强烈的主题通常是负面的,而评论则倾向于转向更混合或积极的极性,尤其是对于自杀预防。毒性总体中位数较低,但在评论中表现出比视频更长的尾部异常值,这些异常值在评论中更为明显,并集中在特定主题(例如“Duet”、“Suicide Prevention”和“Psychisch”)。总体而言,我们的结果提供了意识月活动期间 TikTok 上心理健康话语的主题级分解。

英文摘要

Despite raising concerns about the mental health effects associated with the usage of TikTok, little is known about how related content is framed by creators and received by audiences. We collect the content of 28,341 TikTok videos and 80,130 comments from Mental Health Awareness Month (May) in 2023 and 2024 via the TikTok Research API, and study how the tone of awareness varies across topics and years. We characterize "tone" as the emotional and interpersonal framing of mental health discourse, operationalized through sentiment and toxicity measures. We extract topics from video text using BERTopic and log-odds keywords, then quantify topic-conditioned sentiment (XLM-T) and toxicity (Detoxify) separately for video transcriptions and comments. Sentiment captures the affective valence of content, while toxicity reflects the presence of harmful or abusive language. We find a stable set of recurring themes across years, spanning clinical conditions, emotional disclosure, self-care, and campaign-oriented content, with engagement highly skewed toward a small subset of topics. All sentiment and toxicity analyses are computed separately for video content and comments, allowing us to distinguish between content production and audience reception. Sentiment in videos is often negative for emotionally charged topics, while comments tend to shift toward more mixed or positive polarity, especially for suicide prevention. Toxicity is low in median overall, but exhibits longer-tailed outliers in comments than in videos that are more pronounced in comments and concentrated in specific topics (e.g., "Duet", "Suicide Prevention", and "Psychisch"). Overall, our results provide a topic-level decomposition of mental health discourse on TikTok during awareness-month campaigns.

2503.06573 2026-06-12 cs.CL cs.AI 版本更新

WildIFEval: Instruction Following in the Wild

WildIFEval: 野外指令遵循

Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor

发表机构 * The Hebrew University of Jerusalem(希伯来大学杰里科分校) IBM Research(IBM研究院)

AI总结 提出WildIFEval数据集,包含7K条真实用户的多约束指令,用于评估LLM的指令遵循能力,发现所有模型仍有较大改进空间。

Comments Accepted to the 5th Workshop on Generation, Evaluation and Metrics (GEM) at ACL 2026

详情
AI中文摘要

最近的LLMs在遵循用户指令方面取得了显著成功,但处理具有多个约束的指令仍然是一个重大挑战。在这项工作中,我们引入了WildIFEval——一个包含7K条真实用户指令的大规模数据集,这些指令具有多样化的多约束条件。与以往的数据集不同,我们的收集涵盖了广泛的词汇和主题约束范围,这些约束是从自然用户指令中提取的。我们将这些约束分为八个高级类别,以捕捉它们在现实场景中的分布和动态。利用WildIFEval,我们进行了大量实验来评估领先LLMs的指令遵循能力。WildIFEval清晰地区分了小型和大型模型,并表明所有模型在此类任务上仍有很大的改进空间。我们分析了约束数量和类型对性能的影响,揭示了模型约束遵循行为的有趣模式。我们发布数据集以促进在复杂现实条件下指令遵循的进一步研究。

英文摘要

Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

2505.23823 2026-06-12 cs.CL 版本更新

RAGPPI: RAG Benchmark for Protein-Protein Interactions in Drug Discovery

RAGPPI:药物发现中蛋白质-蛋白质相互作用的RAG基准

Youngseung Jeon, Ziwen Li, Thomas Li, JiaSyuan Chang, Morteza Ziyadi, Xiang 'Anthony' Chen

发表机构 * University of California Los Angeles(加州大学洛杉矶分校) Palo Alto High School(帕洛阿尔托高中) Amazon AGI(亚马逊人工智能研究院)

AI总结 提出RAGPPI基准,包含4420个问答对,用于评估检索增强生成在药物发现中识别蛋白质-蛋白质相互作用生物学影响的能力。

Comments 17 pages, 4 figures, 8 tables

详情
Journal ref
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)
AI中文摘要

检索蛋白质-蛋白质相互作用(PPI)的生物学影响对于药物开发中的靶点识别(Target ID)至关重要。由于涉及的蛋白质数量庞大,这一过程仍然耗时且具有挑战性。大型语言模型(LLMs)和检索增强生成(RAG)框架已支持靶点识别;然而,目前尚无用于识别PPI生物学影响的基准。为填补这一空白,我们引入了PPI的RAG基准(RAGPPI),这是一个包含4420个问答对的事实性问答基准,专注于PPI的潜在生物学影响。通过与专家访谈,我们确定了基准数据集的标准,例如问答类型和来源。我们通过专家驱动的数据标注构建了金标准数据集(500个问答对)。我们开发了一个集成自动评估LLM,该模型结合了专家标注特征、平均事实-摘要相似度(F1)和低相似度事实计数(F2),从而构建了银标准数据集(3720个问答对)。我们致力于维护RAGPPI作为支持研究社区推进药物发现问答解决方案的RAG系统的资源。

英文摘要

Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that incorporates expert labeling characteristics, average fact-abstract similarity (F1), and low-similarity fact counts (F2), enabling the construction of a silver-standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.

2507.20208 2026-06-12 cs.CL 版本更新

From Benchmarks to Skills: Low-Rank Factors for LLM Evaluation

从基准到技能:LLM评估的低秩因子

Aviya Maimon, Amir DN Cohen, Gal Vishne, Shauli Ravfogel, Reut Tsarfaty

发表机构 * Bar-Ilan University(巴伊兰大学) OriginAI Data Science Institute Columbia University(哥伦比亚大学数据科学学院) Center for Data Science New York University(纽约大学数据科学中心)

AI总结 通过因子分析发现LLM基准性能矩阵本质低秩,揭示任务冗余,提出基于潜在技能空间的评估框架,用于识别冗余任务、用小任务子集建模新模型和按技能轮廓选模型。

详情
AI中文摘要

当前对大型语言模型(LLM)的评估严重依赖于不断增长的基准集合和聚合基准分数,然而这种比较实际捕捉了什么,以及这些分数揭示了模型的哪些底层能力,仍不清楚。在此,我们提出了一种新的LLM评估范式,通过询问基准性能是反映许多独立能力,还是依赖于少量共享维度。为了回答这个问题,我们将因子分析(FA)应用于LLM与基准的大规模性能矩阵(60×44),揭示了该矩阵的固有低秩结构。也就是说,少量潜在因子捕捉了完整任务空间中的大部分结构。这种低秩几何揭示了现有任务之间存在大量冗余,并解释了为什么许多基准似乎测量了重叠的能力。我们进一步表明,这些潜在因子对应于连贯的、类似技能的LLM行为维度。利用这个潜在技能空间,我们为LLM评估和下游用户提供了三个实用工具:(i)识别冗余任务,(ii)使用少量任务子集对新模型进行画像,以及(iii)选择与所需技能轮廓一致的模型。我们的方法为单一聚合分数的事实标准提供了一个可靠的替代方案,并建立了一个可解释且实用的框架,用于理解和基准测试LLM的核心能力。

英文摘要

Current evaluations of large language models (LLMs) rely heavily on a growing collection of benchmarks and on aggregate benchmark scores, yet it remains unclear what this comparison actually captures, and what these scores reveal about models' underlying capabilities. Here, we propose a new paradigm for LLM evaluation, by asking whether benchmark performance reflects many independent abilities, or rather relies on a small number of shared dimensions. To answer this, we apply Factor Analysis (FA) to a massive performance matrix of LLMs versus benchmarks \((60\times44)\) revealing an \emph{intrinsically low-rank} structure of that matrix. That is, a small number of latent factors captures most of the structure in the full task space. This low-rank geometry reveals substantial redundancy across existing tasks and explains why many benchmarks appear to be measuring overlapping abilities. We further show that these latent factors correspond to coherent, skill-like, dimensions of LLM behavior. Leveraging this latent skill-space, we deliver three practical tools for LLM evaluation and downstream users: (i)~identifying redundant tasks, (ii)~profiling new models using a small subset of tasks, and (iii)~selecting models aligned with desired skill profiles. Our method provides a solid alternative to the de-facto standard of a single aggregate score, and establishes an interpretable and practical framework for understanding and benchmarking LLM core capabilities.

2510.16380 2026-06-12 cs.CL cs.AI cs.CY cs.HC cs.LG 版本更新

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

MoReBench:评估语言模型中的程序性和多元道德推理,超越结果

Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Raphaël Millière, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Conor Downey, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine

发表机构 * University of Washington(华盛顿大学) New York University(纽约大学) Scale AI Harvard University(哈佛大学) University of Michigan(密歇根大学) UNC Chapel Hill(北卡罗来纳大学教堂山分校) Center for AI Safety(人工智能安全中心) Stanford University(斯坦福大学) MIT(麻省理工学院) University of Oxford(牛津大学)

AI总结 提出MoReBench基准,包含1000个道德场景和超过2.3万条标准,用于评估语言模型在道德推理中的程序性推理能力,发现现有基准无法预测模型表现,且模型对特定道德框架存在偏好。

Comments 46 pages, 8 figures, 10 tables. Published in ICLR 2026. Accepted at CHAI workshop and SPP 2026 (non-archival)

详情
AI中文摘要

随着人工智能系统的进步,我们越来越依赖它们与我们共同或代替我们做出决策。为了确保这些决策符合人类价值观,我们不仅需要理解它们做出了什么决策,还需要理解它们如何得出这些决策。推理语言模型能够提供最终响应和(部分透明的)中间思考轨迹,这为研究AI的程序性推理提供了及时的机会。与通常有客观正确答案的数学和代码问题不同,道德困境是过程导向评估的绝佳测试平台,因为它们允许多种可辩护的结论。为此,我们提出了MoReBench:包含1000个道德场景,每个场景配有一组专家认为在推理该场景时必须包含(或避免)的评分标准。MoReBench包含超过2.3万条标准,包括识别道德考量、权衡利弊以及给出可操作的建议,覆盖了AI为人类道德决策提供建议以及自主做出道德决策的情况。此外,我们整理了MoReBench-Theory:150个示例,用于测试AI是否能在规范伦理学的五个主要框架下进行推理。我们的结果表明,规模定律以及现有的数学、代码和科学推理任务基准无法预测模型进行道德推理的能力。模型还显示出对特定道德框架(例如边沁式的行为功利主义和康德义务论)的偏好,这可能是流行训练范式的副作用。这些基准共同推动了面向过程推理的评估,以实现更安全、更透明的AI。

英文摘要

As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

2510.16928 2026-06-12 cs.CL 版本更新

ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models

ChiKhaPo: 一个用于评估大型语言模型词汇理解与生成能力的大规模多语言基准

Emily Chang, Niyati Bafna

发表机构 * Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) Johns Hopkins University, Center for Language and Speech Processing(约翰霍普金斯大学语言与语音处理中心)

AI总结 针对现有基准语言覆盖不足且侧重高阶任务的问题,提出ChiKhaPo基准,包含8个子任务,覆盖2700+种语言,评估LLM的词汇理解与生成能力,发现6个SOTA模型表现不佳。

详情
AI中文摘要

现有的大型语言模型(LLM)基准主要局限于高资源或中资源语言,并且通常评估推理和生成方面的高阶任务性能。然而,大量证据表明,LLM在全球3800多种书面语言中的绝大多数语言中缺乏基本的语言能力。我们引入了ChiKhaPo,它包含8个难度不同的子任务,旨在评估生成模型的词汇理解和生成能力。ChiKhaPo利用现有的词典、单语数据和双语文本,为2个子任务提供了2700多种语言的覆盖,在语言覆盖范围上超过了任何现有基准。我们进一步展示了6个SOTA模型在我们的基准上表现不佳,并讨论了影响性能分数的因素,包括语系、语言资源丰富度、任务以及理解与生成方向。通过ChiKhaPo,我们希望促进并鼓励对LLM进行大规模多语言基准测试。

英文摘要

Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.

2601.13346 2026-06-12 cs.CL 版本更新

AfroScope: A Framework for Studying the Linguistic Landscape of Africa

AfroScope:研究非洲语言景观的框架

Sang Yun Kwon, AbdelRahim Elmadany, Muhammad Abdul-Mageed

发表机构 * The University of British Columbia(不列颠哥伦比亚大学)

AI总结 提出AfroScope框架,包含覆盖640种语言的数据集和模型套件,通过层次分类和专用嵌入模型解决近亲语言混淆问题,提升宏F1分数1.57点,并分析跨语言迁移和领域效应。

详情
AI中文摘要

语言识别(LID)是确定给定文本语言的任务,是影响下游NLP应用可靠性的基本预处理步骤。尽管近期工作扩展了非洲LID,现有系统在语言覆盖范围以及近亲语言和变体的细粒度区分方面仍然有限。我们引入了AfroScope,一个统一的非洲LID框架,包括AfroScope-Data(覆盖640种语言的数据集)和AfroScope-Models(一套具有广泛非洲语言覆盖的强LID模型)。为了解决近亲语言之间持续存在的混淆问题,我们提出了一种层次分类方法,利用AfroScope-Mirror(一种专门用于目标消歧的嵌入模型),在易混淆子集上相比最佳基础模型提升了1.57个宏F1分数。我们进一步分析了跨语言迁移和领域效应,展示了语言家族结构、脚本兼容性和领域覆盖如何影响LID性能。我们将非洲LID定位为大规模测量数字文本中非洲语言景观的使能技术,并在线发布了AfroScope-Data和AfroScope-Models。

英文摘要

Language Identification (LID), the task of determining the language of a given text, is a fundamental preprocessing step that shapes the reliability of downstream NLP applications. While recent work has expanded African LID, existing systems remain limited in both language coverage and fine-grained discrimination among closely related languages and varieties. We introduce AfroScope, a unified framework for African LID that includes AfroScope-Data, a dataset covering 640 languages, and AfroScope-Models, a suite of strong LID models with broad African language coverage. To address persistent confusions among closely related languages, we propose a hierarchical classification approach that leverages AfroScope-Mirror, a specialized embedding model for targeted disambiguation, improving macro-F1 by 1.57 points on the confusable subset compared to our best base model. We further analyze cross-lingual transfer and domain effects, showing how language-family structure, script compatibility, and domain coverage shape LID performance. We position African LID as an enabling technology for large-scale measurement of Africa's linguistic landscape in digital text, and release AfroScope-Data and AfroScope-Models online.

2602.14367 2026-06-12 cs.CL cs.AI cs.IR cs.LG 版本更新

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

InnoEval:将研究思路评估视为基于知识的多视角推理问题

Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue, Ningyu Zhang, Hossein A. Rahmani, Yanshan Wang, Qiang Zhang, Keyan Ding, Jeff Z. Pan, Huajun Chen, Emine Yilmaz

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出InnoEval框架,通过异构深度知识检索和多视角评审委员会,实现基于知识的多维度解耦评估,在点对点、成对和分组评估任务中优于基线方法。

Comments ICML 2026

详情
AI中文摘要

大型语言模型的快速发展催生了科学思路的激增,但这一飞跃并未伴随思路评估的相应进步。科学评估的基本性质需要知识基础、集体审议和多标准决策。然而,现有的思路评估方法往往存在知识视野狭窄、评估维度扁平化以及LLM作为评判者的固有偏见。为解决这些问题,我们将思路评估视为一个基于知识的多视角推理问题,并引入InnoEval,一个深度创新评估框架,旨在模拟人类水平的思路评估。我们应用了一个异构深度知识搜索引擎,从多样化的在线来源中检索和获取动态证据。我们进一步通过一个包含不同学术背景的评审员的创新评审委员会实现评审共识,从而在多个指标上进行多维解耦评估。我们构建了来自权威同行评审提交的全面数据集,以基准测试InnoEval。实验表明,InnoEval在点对点、成对和分组评估任务中始终优于基线方法,展现出与人类专家高度一致的判断模式和共识。

英文摘要

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

2606.00193 2026-06-12 cs.CL 版本更新

BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon

BOUTEF:北非假新闻的多语种语料库——语言作为武器

Kamel Smaili, Yassine Toughrai, Amina Laggoun, David Langlois

AI总结 本文构建了包含阿尔及利亚和突尼斯多语种(MSA、方言、Arabizi、法语、英语等)的假新闻语料库BOUTEF,通过定量与定性分析揭示了假新闻依赖情感化叙事、耸人听闻框架和混合语言实践来增强传播力,而辟谣内容则更注重事实和验证。

详情
AI中文摘要

社交媒体上假新闻的快速传播已成为一个重大挑战,尤其是在北非等多语言和资源匮乏的环境中。本文介绍了BOUTEF,这是一个大规模多语言语料库,旨在研究阿尔及利亚和突尼斯假新闻的传播、特征和影响。该语料库整合了三个互补部分:虚假叙述、真实叙述以及相关的用户生成评论,并附有经过验证的辟谣信息。它涵盖了广泛的语言和语言变体,包括现代标准阿拉伯语、阿尔及利亚和突尼斯方言、阿拉伯语拉丁化拼写、法语、英语以及代码转换语言。基于这一资源,我们进行了结合定量和定性方法的全面实证分析。我们考察了主题分布、语言和修辞策略、情感模式以及社交参与动态。统计分析揭示了主题类别与信息真实性之间的显著关联,以及用户参与度与虚假内容可见性之间的强相关性。我们的发现表明,假新闻严重依赖情感化的叙述、耸人听闻的框架以及增强病毒式传播和受众参与的混合语言实践。相比之下,辟谣内容采用更注重事实和验证的风格。此外,阿尔及利亚和突尼斯之间的比较分析揭示了由社会政治背景塑造的共享动态和国家特定特征。结果强调了非正式语言实践在错误信息扩散和接收中的作用。通过提供丰富、带注释且公开可用的数据集,这项工作有助于推进假新闻检测、低资源语言处理以及理解复杂语言环境中的信息紊乱的研究。

英文摘要

The rapid spread of fake news on social media has become a major challenge, particularly in multilingual and under-resourced contexts such as North Africa. In this paper, we introduce BOUTEF, a large-scale multilingual corpus designed to study the propagation, characteristics, and impact of fake news in Algeria and Tunisia. The corpus integrates three complementary components: fake narratives, genuine narratives, and associated user-generated comments, along with verified debunking information. It covers a wide range of languages and linguistic varieties, including MSA, Algerian and Tunisian dialects, Arabizi, French, English, and code-switched language. Building on this resource, we conduct a comprehensive empirical analysis combining quantitative and qualitative approaches. We examine thematic distributions, linguistic and rhetorical strategies, sentiment patterns, and social engagement dynamics. Statistical analyses reveal significant associations between thematic categories and message veracity, as well as strong correlations between user engagement and the visibility of fake content. Our findings show that fake news relies heavily on emotionally charged narratives, sensational framing, and hybrid linguistic practices that enhance virality and audience engagement. In contrast, debunking content adopts a more factual and verification-oriented style. Furthermore, a comparative analysis between Algeria and Tunisia highlights both shared dynamics and country-specific characteristics shaped by sociopolitical contexts. The results emphasize the role of informal language practices in the diffusion and reception of misinformation. By providing a rich, annotated, and publicly available dataset, this work contributes to advancing research on fake news detection, low-resource language processing, and the understanding of information disorders in complex linguistic environments.

2606.04525 2026-06-12 cs.CL cs.LG q-bio.GN 版本更新

GENEB: Why Genomic Models Are Hard to Compare

GENEB:为什么基因组模型难以比较

Daria Ledneva, Mikhail Nuridinov, Denis Kuznetsov

发表机构 * GitHub arXiv

AI总结 针对基因组基础模型评估碎片化的问题,提出GENEB基准,通过统一探测协议在100项任务上比较40个模型,揭示模型排名不稳定、规模收益有限等关键发现。

Comments change first page figure, fix model sizes, add more consistency

详情
AI中文摘要

由于基准碎片化、评估协议不兼容以及任务特定报告,基因组基础模型的进展难以评估。因此,关于模型优越性或通用性的声明往往无法直接比较。我们引入GENEB,这是一个大规模诊断基准,在统一的基于探测的协议下(包括少样本场景),评估来自40个基因组基础模型的冻结表示,涵盖100个任务,跨越13个功能类别。GENEB能够在明确暴露任务级权衡的同时,对模型规模、架构、分词和预训练数据进行受控比较。我们的分析表明,整体排行榜不稳定:模型排名在不同任务类别间变化剧烈,规模仅带来适度且不一致的收益,而架构和预训练对齐常常超过参数数量的影响。这些结果凸显了当前评估实践的局限性,并将GENEB定位为基因组机器学习中原则性比较和类别感知模型选择的参考框架。

英文摘要

Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.

2606.07515 2026-06-12 cs.CL cs.AI cs.HC math.PR 版本更新

How reliable are LLMs when it comes to playing dice?

LLM 在掷骰子时有多可靠?

Luca Avena, Gianmarco Bet, Bernardo Busoni

发表机构 * Università degli Studi di Firenze(佛罗伦萨大学)

AI总结 通过离散概率问题基准测试,发现 LLM 在标准问题上准确率 0.96,但在反直觉问题上仅 0.59,且存在 token 偏差和误导提示的脆弱性。

详情
AI中文摘要

我们通过离散概率问题的受控基准研究,调查了大语言模型的概率推理能力。我们构建了两个数据集,分别是一组标准习题和一组反直觉习题,旨在触发启发式推理,并评估了 8 个最先进的模型,每个模型分别在有无思维链提示的情况下进行测试。模型在标准问题上的平均准确率为 0.96,但在反直觉问题上仅为 0.59。我们进一步提供了 token 偏差的经验证据:当规范表述被伪装变体替换时,性能下降超过 20%。在提示中嵌入误导性建议会使性能降低高达 34%,且没有模型被证明免疫。综合来看,报告的结果表明,尽管当前 LLM 在高级数学问题上取得了成功,但它们尚未成为真正的概率推理者。

英文摘要

We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.

2606.10403 2026-06-12 cs.CL 版本更新

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

KCSAT-ML: 用全国队列人类难度探测推理模型

Sanghee Park, Geewook Kim, Kee-Eung Kim

发表机构 * NAVER Cloud AI(NAVER云AI) KAIST AI(韩国科学技术院人工智能系)

AI总结 提出KCSAT-ML基准(含664道韩国高考数学题及339道带官方错误率的核心题)和难度对齐推理增益(DRG)指标,揭示视觉语言模型在人类高错误率题目上准确率崩溃、测试时缩放非单调以及同一模型族内反缩放与过度思考并存的现象。

Comments 18 pages, 14 figures, 8 tables

详情
AI中文摘要

数学推理基准已大量涌现,但大多数缺乏基于实际人类表现的每道题难度信号。我们引入KCSAT-ML,包含十年(2014-2025)韩国大学修学能力考试(KCSAT;修能)数学:664道题,其中339道核心题带有来自数十万考生全国队列的官方每道题错误率。我们将该基准与难度对齐推理增益(DRG)配对:一种分数正交的度量,询问模型的错误是集中在人类认为难的题目上,还是人类认为容易的题目上。两者共同揭示,在广泛的视觉语言模型(以及通过OCR的LLM)中,存在三种模式:(i)低预算准确率在人类高错误率尾部崩溃,无论模型大小;(ii)测试时缩放(TTS)使token使用量大致随队列错误率线性增加,而准确率增益遵循非单调曲线;(iii)在同一模型族内,TTS在最难题目上从反缩放翻转到较容易题目上的过度思考——这是同一对齐失败的两个方面。在DRG上,准确率几乎相同的模型可以处于几乎相反的值:一个模型做错了人类也觉得难的题目,而另一个模型解决了最难的题目却在人类认为容易的题目上失败——这是聚合准确率所隐藏的对比。我们的代码和数据集构建器将在https://this URL开源。

英文摘要

Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at https://github.com/naver-ai/KCSAT-ML.

2509.21548 2026-06-12 cs.CY cs.CL 版本更新

C-QUERI: Congressional Questions, Exchanges, and Responses in Institutions Dataset

C-QUERI:国会机构中的问题、交流与回答数据集

Manjari Rudra, Daniel Magleby, Sujoy Sikdar

发表机构 * School of Computing, Binghamton University(宾夕法尼亚大学布林莫尔分校计算机学院) Department of Political Science, Binghamton University(宾夕法尼亚大学布林莫尔分校政治学系)

AI总结 提出从听证会记录中提取问答对的流程,构建108-117届国会委员会听证数据集,分析显示提问者党派可从问题本身预测,为政治话语研究提供框架。

详情
AI中文摘要

政治采访和听证中的问题除了信息收集外,还具有战略目的,包括推进党派叙事和塑造公众认知。然而,由于缺乏大规模数据集来研究此类话语,这些战略方面仍未得到充分研究。国会听证会为研究政治提问提供了一个特别丰富且易于处理的地点:互动由正式规则组织,证人必须回答,不同政治派别的成员保证有机会提问,从而能够比较跨政治光谱的行为。我们开发了一个流程,从非结构化听证记录中提取问答对,并构建了一个包含第108至117届国会委员会听证的新数据集。我们的分析揭示了跨党派的提问策略的系统性差异,表明仅从问题本身即可预测提问者的党派归属。我们的数据集和方法不仅推进了国会政治研究,还为分析类似采访环境中的问答提供了通用框架。

英文摘要

Questions in political interviews and hearings serve strategic purposes beyond information gathering including advancing partisan narratives and shaping public perceptions. However, these strategic aspects remain understudied due to the lack of large-scale datasets for studying such discourse. Congressional hearings provide an especially rich and tractable site for studying political questioning: Interactions are structured by formal rules, witnesses are obliged to respond, and members with different political affiliations are guaranteed opportunities to ask questions, enabling comparisons of behaviors across the political spectrum. We develop a pipeline to extract question-answer pairs from unstructured hearing transcripts and construct a novel dataset of committee hearings from the 108th--117th Congress. Our analysis reveals systematic differences in questioning strategies across parties, by showing the party affiliation of questioners can be predicted from their questions alone. Our dataset and methods not only advance the study of congressional politics, but also provide a general framework for analyzing question-answering across interview-like settings.

2601.13591 2026-06-12 cs.AI cs.CL 版本更新

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

DSAEval:在广泛真实世界数据科学问题上评估数据科学智能体

Maojun Sun, Yifei Xie, Yue Wu, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, Jian Huang

发表机构 * Department of Data Science and Artificial Intelligence, Hong Kong Polytechnic University(数据科学与人工智能系,香港理工大学) Department of Applied Mathematics, Hong Kong Polytechnic University(应用数学系,香港理工大学)

AI总结 提出包含641个真实数据科学问题的基准DSAEval,涵盖多模态环境感知、多查询交互和多维评估,系统评估13个先进LLM智能体,发现Claude-Sonnet-4.5综合最优,多模态感知提升视觉任务性能2.04%-11.30%。

详情
AI中文摘要

近期基于LLM的数据智能体旨在自动化从数据分析到深度学习的数据科学任务。然而,真实世界数据科学问题的开放性——通常跨越多个分类且缺乏标准答案——给评估带来了重大挑战。为此,我们引入了DSAEval,一个包含641个基于285个多样化数据集的真实世界数据科学问题的基准,涵盖结构化和非结构化数据(例如图像和文本)。DSAEval包含三个独特特征:(1)多模态环境感知,使智能体能够解释来自多种模态(包括文本和视觉)的观察;(2)多查询交互,反映真实世界数据科学项目的迭代和累积性质;(3)多维评估,提供跨推理、代码和结果的全面评估。我们使用DSAEval系统评估了13个近期先进的智能体LLM。结果表明,Claude-Sonnet-4.5实现了最强的整体性能,MiMo-V2-Pro在持续时间上领先,GPT-5.2在步骤效率上领先,而MiMo-V2-Flash最具成本效益。我们进一步证明,多模态感知持续提升视觉相关任务的性能,增益范围为2.04%至11.30%。总体而言,尽管当前数据科学智能体在结构化数据和常规数据分析工作流上表现良好,但在非结构化领域仍存在重大挑战。最后,我们提供了关键见解并概述了未来研究方向。

英文摘要

Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., image and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities, including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 13 recent advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, MiMo-V2-Pro and GPT-5.2 lead in duration and step efficiency, respectively, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04\% to 11.30\%. Overall, while current data science agents perform well on structured data and routine data analysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions.

2602.09379 2026-06-12 cs.MA cs.CL 版本更新

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

LingxiDiagBench: 用于基准测试大语言模型在中文精神科咨询与诊断中的多智能体框架

Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, Yafeng Deng

发表机构 * Tianqiao and Chrissy Chen Institute(天桥和克里斯西·陈研究所) EverMind AI Inc.(EverMind AI公司) Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine(上海精神卫生中心,上海交通大学医学院)

AI总结 提出LingxiDiagBench多智能体框架,包含16K电子病历对齐的合成咨询对话数据集,评估LLM在静态诊断和动态咨询中的表现,发现其对抑郁-焦虑共病识别和12类鉴别诊断准确率低,动态咨询常不如静态评估。

详情
AI中文摘要

精神障碍在全球范围内高度流行,但精神科医生的短缺以及基于访谈诊断固有的主观性,对及时、一致的心理健康评估造成了重大障碍。AI辅助精神科诊断的进展受到缺乏基准测试的限制,这些基准测试需同时提供逼真的患者模拟、临床医生验证的诊断标签,并支持动态多轮咨询。我们提出LingxiDiagBench,一个大规模多智能体基准测试,评估LLM在中文静态诊断推理和动态多轮精神科咨询中的表现。其核心是LingxiDiag-16K,一个包含16,000个电子病历对齐的合成咨询对话数据集,旨在再现12个ICD-10精神科类别中真实的临床人口统计和诊断分布。通过对最先进LLM的大量实验,我们建立了关键发现:(1)尽管LLM在二元抑郁-焦虑分类上达到高准确率(高达92.3%),但在抑郁-焦虑共病识别(43.0%)和12类鉴别诊断(28.5%)上性能显著下降;(2)动态咨询通常不如静态评估,表明无效的信息收集策略显著损害下游诊断推理;(3)由LLM作为评判者评估的咨询质量与诊断准确性仅呈中等相关性,表明结构良好的提问本身并不能确保正确的诊断决策。我们发布LingxiDiag-16K和完整的评估框架,以支持可重复的研究,网址为:https://this https URL。

英文摘要

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

2603.11863 2026-06-12 cs.AI cs.CL 版本更新

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

CreativeBench: 通过自我进化挑战基准测试和增强机器创造力

Zi-Han Wang, Lam Nguyen, Zhengyang Zhao, Mengyue Yang, Chengwei Qin, Yujiu Yang, Linyi Yang

AI总结 提出CreativeBench基准,基于认知框架通过代码生成评估机器创造力,包含组合与探索两个子集,利用逆向工程和自我博弈自动生成挑战,并通过质量与新颖性乘积的指标区分创造与幻觉。

Comments ACL 2026. Project page: https://zethwang.github.io/creativebench.github.io/

详情
AI中文摘要

高质量预训练数据的饱和已将研究焦点转向能够持续生成新颖产物的进化系统,从而促成了AlphaEvolve的成功。然而,此类系统的进展因缺乏严格、量化的评估而受阻。为应对这一挑战,我们引入了CreativeBench,这是一个基于经典认知框架、用于评估代码生成中机器创造力的基准。该基准包含两个子集——CreativeBench-Combo和CreativeBench-Explore,通过利用逆向工程和自我博弈的自动化流程,分别针对组合创造力和探索创造力。通过利用可执行代码,CreativeBench通过一个统一指标(定义为质量与新颖性的乘积)客观地区分创造力与幻觉。我们对最先进模型的分析揭示了不同的行为:(1) 规模扩展显著提升了组合创造力,但对探索的收益递减;(2) 更大的模型表现出“规模收敛”,即变得更正确但更少发散;(3) 推理能力主要有利于受约束的探索而非组合。最后,我们提出了EvoRePE,一种即插即用的推理时引导策略,通过内化进化搜索模式来持续增强机器创造力。

英文摘要

The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.

2606.05405 2026-06-12 cs.AI cs.CL cs.LG 版本更新

Agents' Last Exam

Agents' Last Exam

Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-Gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu, Jianhong Tu, Zhonghua Wang, Zheng Zhang, Zijiao Chen, Yanqiong Jiang, Zhendong Li, Bohan Lyu, Chang Ma, Peiran Xu, Benran Zhang, Shangding Gu, Haoyue Hua, Haoyang Li, Wanzhe Liao, Chengzhi Liu, Junbo Peng, Haoran Sun, Zechen Xu, Bo Chen, Jiayi Cheng, Yi Jiang, Keying Kuang, Yuan Li, Youbang Pan, Ziyan Rao, Alexander Schubert, Yifan Shen, Vincent Siu, Xiatao Sun, Kangqi Zhang, Xiaopan Zhang, Yuchen Zhu, Ishaan Singh Chandok, Lei Ding, Jingxuan Fan, Andrew Glover, Jiaming Hu, Yiran Hu, Wenbo Huang, Zixin Jiang, Haoran Jin, Lukas Kim, Ming Liu, Yang Liu, Alireza Rafiei, Xuhuan Shen, Kunyang Sun, Sophia Sun, Ting Sun, Eric Wang, Yixin Wang, Hanwen Xing, Sihan Xu, Yuzheng Xu, Zhongxing Xu, Zhiling Yan, Boqin Yuan, Ruiqi Zhang, Yifan Zhang, Zibo Zhao, Liana, Santanu Bosu Antu, Haoyue Bai, Carlo Bosio, Joseph Cavanagh, Patricia Cavazos-Rehg, Tianxing Chen, Xuewen Chen, Yipu Chen, Chenyu Zhu, Chen Dai, Stefano De Castro, Yunfu Deng, Kaustubh Dhole, Jiayuan Ding, Chenchen Du, Zhehang Du, Hao Fan, Run-Ze Fan, Hengyu Fu, Shi Gu, Yifan Gu, Charlie Guo, Baihe Huang, Baixiang Huang, Rimika Jaiswal, Zhihan Jiang, Ran Jin, Erin Kasson, Xin Lan, Joseph Lee, Deren Lei, Chenyu Li, Daofeng Li, Haitao Li, Hongwei Li, Jingyan Li, Xiao Li, Yi Li, Yinsheng Li, Yuangang Li, Zhixu Li, Wenyu Liang, Longtai Liao, Kevin Qinghong Lin, Andy Zeyi Liu, Che Liu, Jiaming Liu, Kaiyuan Liu, Xuan Liu, Pan Lu, Wenbo Lv, Yicheng Lyu, Qiuyang Mang, Kyle Montgomery, Yuzhou Nie, Ruoxi Ning, Jorin Overwiening, Xu Pan, Layna Paraboschi, Core Francisco Park, Justin Purnomo, Swati Rajwal, Scott Rankin, Bixuan Ren, Yiren Rong, HaoYang Shang, Ventus Shaw, Fiona Shen, Jiawei Shen, Minqi Shi, Shi Qiu, Huaxiu Yao, Tianneng Shi, Jonah So, Vladislav Susoy, Hannah Szlyk, Haocheng Wang, Jialu Wang, Wei Wang, Xinyu Wang, Zehao Wang, Dowling Wong, Angela Wu, Dehao Wu, Fangyu Wu, Mengyuan "Millie" Wu, Yu Wu, Yuchen Wu, Yuhao Wu, Qingpo Wuwu, Weihang Xiao, Yongyi Xiong, Fan Xu, Ruiling Xu, Mingxuan Yan, Benjamin Yang, Jirong Yang, Sen Yang, Xiaoli Yang, Yushi Yang, Haoran Ye, Xiaohu Yu, Zhengming Yu, Chenlong Zhang, Chi Zhang, Hanning Zhang, Hanwen Zhang, Junge Zhang, Kunpeng Zhang, Song Zhang, Wenjin Zhang, Wenshuo Zhang, Ying Zhang, Yizhi Zhang, Brian Zhao, Qijian Zhao, Yimin Zhao, Yuhaohua Zheng, Liwei Zhou, Tianyue Zhou, Sichen Zhu, Siqi Zhu, Yan Zhu, Yishu Zhu, Jierui Zuo, Chonghao Cai, Helena Casademunt, Wenjia Chen, Cheng Cheng, Nawen Deng, Rao Fu, Tianfu Fu, Yifan Han, He Ren, Zhenyu He, Qiao Jin, Langlang Li, Yuetai Li, Sylvia Liu, Lu Lu, Luqing Zhou, Subhabrata Mukherjee, Yunqi Ouyang, Yin Ren, Dawei Shi, Haoran Wu, Zhiyue Wu, Hannah Yao, Zhuoran Yi, Jenny Yu, Rhea Zhan, Hang Zhou, Blake Zhu, Junfan Zhu, Alan Yuille, Yang Liu, Russell Alan Poldrack, Jiachen Li, Zhenglu Li, Molei Tao, Jing Huang, Wenqi Shi, Costas Spanos, Lichao Sun, Chenguang Wang, Orson Xu, Zhen Dong, Hector Gomez, Aylin Caliskan, Ali Emami, Haimin Hu, Zhi Li, Lihui Liu, Murphy Niu, Yi Shao, Jianxin Sun, Mikko Tolonen, Ting Wang, Sanjiv Das, Yanjun Gao, Wenbo Guo, Erika J Schneider, Zhiyong Lu, Yian Ma, Mark Mueller, Radha Poovendran, Somayeh Sojoudi, Yinglun Zhu, Dawn Song

发表机构 * arXiv

AI总结 针对AI系统在专业领域缺乏经济性部署的问题,提出Agents' Last Exam (ALE)基准,通过250+专家协作构建覆盖13个行业集群55个子领域的1000+长期真实经济任务,当前最难层级平均通过率仅2.6%。

Comments Project website: https://agents-last-exam.org Code: https://github.com/rdi-berkeley/agents-last-exam

详情
AI中文摘要

最近的AI系统在广泛基准测试中取得了强劲结果,但这些成果并未转化为许多专业领域的经济上有意义的部署。我们认为这一差距主要是评估问题:广泛使用的基准缺乏对真实且经济上有价值的工作流程的持续性能测量。本文介绍了Agents' Last Exam (ALE),这是一个旨在评估AI代理在长期、经济上有价值、结果可验证的真实世界任务上的基准。与250多名行业专家合作开发,ALE涵盖了参考O*NET/SOC 2018(美国联邦职业分类)定义的非实体行业。它围绕一个任务分类法组织,包含55个子领域,分为13个行业集群,涵盖1000多个任务。当前结果显示,最难层级远未饱和:在主流框架和骨干配置下,平均完全通过率为2.6%。ALE被设计为一个活的基准:其任务池随着新工作流程和行业的加入而持续增长。更广泛地说,ALE不仅旨在作为另一个排行榜,而是作为缩小基准成功与GDP相关影响之间差距的工具。

英文摘要

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long horizon, economically valuable, real world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 sub fields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is below 1%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP relevant impact.

2606.11654 2026-06-12 cs.IR cs.CL cs.HC cs.SI 版本更新

The Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience

长尾而非首页:众包高亮显著性的冷启动预测

Kazuki Nakayashiki, Keisuke Watanabe

发表机构 * Glasp Inc.(Glasp公司)

AI总结 本文研究在无读者标记时,如何从文本预测文档的众包高亮显著性,提出基于句子嵌入和位置/上下文特征的对数排序模型,在平均精度上比位置基线提升0.044,并证明该优势源于真实读者标记的学习。

Comments 10 pages, 3 figures, 4 tables

详情
AI中文摘要

社交高亮工具最有用的信号——一群读者标记的段落——仅存在于人们已经阅读过的文档中。能否在标记积累之前,从文本预测文档的聚合众包显著性?先前关于此数据的研究发现,零样本语言模型恢复高亮位置的效果不如简单的基线(位置),因此我们询问,在高亮语料上训练的模型能否击败该基线。使用预注册的模型阶梯和按文档的聚类自助法,我们发现一个微小但稳健的优势:基于句子嵌入和位置/上下文特征的对数排序器比位置基线平均精度高出+0.044(95%置信区间[+0.029, +0.058];在97%的重采样中超过预注册的边界delta=0.03,且在流水线重复运行中稳定)。两种无监督抽取式基线(质心、LexRank风格中心性)均输给位置基线,而训练模型比它们高出+0.108,因此该优势并非由通用无监督代理恢复——它反映了从真实读者标记中学习。在产品术语中,precision@3从0.25上升到0.39(相对提升55%),模型在69%的文档上击败位置基线。消融实验将优势归因于原始嵌入(+0.014)和训练增强(+0.010),每个都有正的置信区间。该优势并非时间泛化失败,我们也没有发现内容漂移或近似重复泄露可以解释它的证据。标准化回归显示,优势主要由文档流行度(流行度越低,优势越大)和标签可靠性决定。它仅在流行度最高的内容上几乎消失;在那里,是位置基线变强,而非模型变弱。由于我们的评估条件设定在最终积累了读者的文档上,这些结果是回顾性的冷启动模拟。

英文摘要

A social highlighter's most useful signal -- which passages a crowd of readers marks -- exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies -- it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.

10. 安全、隐私、公平与可解释NLP 24 篇

2606.12689 2026-06-12 cs.CL 新提交

Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

可观察模式并非解释:潜在推理模型的因果几何分析

Darpan Aswal, Thomas Palmeira Ferraz, Yongxin Zhou, Maxime Peyrard

发表机构 * Université Grenoble Alpes, CNRS, Grenoble INP, LIG(格勒诺布尔阿尔卑斯大学,法国国家科学研究中心,格勒诺布尔国立理工学院,信息学实验室) Université Paris-Saclay(巴黎-萨克雷大学) NAVER LABS Europe(NAVER欧洲实验室)

AI总结 本文通过对照实验和因果干预发现,潜在推理模型中的可观察模式(如BFS前沿)在控制组中也出现且不总是因果影响行为,提出潜在思维的使用是分级的,其因果效应集中在低秩方向,几何结构随行为影响增强而更有序。

详情
AI中文摘要

潜在推理模型(LRMs)用连续思维替代显式思维链。最近的研究将可观察的潜在状态模式(如BFS式前沿和可解码的算术计算)视为内部推理机制的证据。通过评估两个LRM(Coconut和CODI)与缺乏所提议的循环或课程的控制组,我们发现这些模式也出现在控制组中,并且并不总是因果性地影响行为。因果干预揭示,潜在思维的利用不是二元的,而是分级的,随着思维对模型行为的因果效应而缩放。几何分析表明,这种效应集中在低秩方向,其逐步几何结构随着行为影响的增加而变得更加结构化。因此,潜在思维应被视为隐藏计算,而非隐藏解释:仅凭可解码性、注意力或静态结构无法确立机制。因此,LRM可解释性需要匹配的控制组和因果测试。

英文摘要

Latent reasoning models (LRMs) replace explicit chain-of-thought with continuous thoughts. Recent work treats observable latent-state patterns, such as BFS-like frontiers and decodable arithmetic computation, as evidence for internal reasoning mechanisms. Evaluating two LRMs (Coconut and CODI) against controls lacking the proposed recurrence or curriculum, we find these patterns also appear in the controls and do not always causally affect behavior. Causal interventions reveal that latent-thought utilization is not binary but graded, scaling with a thought's causal effect on model behavior. Geometric analyses reveal this effect concentrates in low-rank directions whose step-to-step geometry grows more structured as their behavioral influence increases. Latent thoughts should therefore be treated as hidden computation, not hidden explanation: decodability, attention, or static structure alone cannot establish mechanism. LRM interpretability thus requires matched controls and causal tests.

2606.12716 2026-06-12 cs.CL 新提交

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

AI审稿人是否看到全貌?攻击与防御多模态同行评审

Xinyu Zhao, Rana Muhammad Shahroz Khan, Zhen Xu, Zhen Tan, Tianlong Chen

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 针对AI同行评审易受多模态对抗攻击的问题,提出PaperGuard基准,包含多领域数据集、统一攻击套件和基于分块嵌入搜索的实用防御方法。

Comments Accepted to ICML 2026, Project Page: https://paper-guard.github.io/

详情
AI中文摘要

将大型语言模型(LLMs)和多模态LLMs(MLLMs)集成到科学同行评审工作流程中,引入了对抗性操纵的新重大风险,尤其是考虑到科学论文的多模态性质——其中图表(而非仅文本)传达了核心证据。这造成了一个显著差距:当前关于AI同行评审的鲁棒性研究绝大多数仅针对文本。此外,该问题与标准越狱不同,因为同行评审攻击旨在诱导领域特定的、有针对性的失败(例如,“提高这个分数”),而非违反一般安全策略,而目前尚无实用的防御措施。为解决此问题,我们引入了PaperGuard,这是第一个旨在系统评估和防御AI生成的同行评审免受这些领域特定、跨模态攻击的全面基准。我们的框架基于三大支柱:(1)一个新的跨多个科学领域的多模态同行评审数据集;(2)一套统一的攻击方法,包括黑盒提示注入和白盒扰动,专门针对文本(GCG)和图表(PGD);(3)一种实用的防御方法,受学术论文长上下文挑战的启发,使用基于分块的嵌入搜索来高效定位和缓解有害指令。我们在最先进模型上进行的广泛实验证实,AI审稿人普遍存在脆弱性。PaperGuard建立了必要的基准、协议和可操作的防御措施,以开创可信赖、抗攻击的AI辅助学术评审。

英文摘要

The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., "inflate this score") rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework is built on three pillars: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions. Our extensive experiments, conducted across state-of-the-art models, confirm that AI reviewers are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.

2606.12818 2026-06-12 cs.CL cs.AI 新提交

Localizing Anchoring Pathways in Language Models

定位语言模型中的锚定路径

Hillary N. Owusu, Sarah Wiegreffe, Naomi H. Feldman

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

AI总结 研究提示中无关数字如何影响语言模型数值推理的锚定效应,通过logit差值度量和电路归因定位,发现边级方法优于节点级方法,并揭示锚定路径的共享与迁移特性。

详情
AI中文摘要

提示中的无关数字可以改变语言模型的判断,在数值推理中产生锚定效应。我们使用共享答案选项的受控多项选择设置,研究这种锚定敏感信号在语言模型内部的携带位置。我们定义了一个logit差值度量,比较正确答案选项与对应锚点的答案选项,并验证其追踪行为锚定。通过对7B-8B Qwen和Llama基础及指令微调模型进行基于归因的电路定位,我们发现边级方法比节点级方法更忠实地恢复该信号。低锚和高锚电路在模型内部强迁移,表明跨锚定方向存在共享路径结构。然而,基础模型和指令微调变体之间的稀疏迁移可靠性较低,表明后训练改变了哪些路径最重要。总体而言,我们的结果为锚定相关决策信号如何在语言模型内部携带提供了机制性解释。

英文摘要

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

2606.12897 2026-06-12 cs.CL 新提交

SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

SafeLLM: 在安全关键场景中,提取作为重写的抗幻觉替代方案

Julia Ive, Felix Jozsa, Evridiki Georgaki, Nabeel Sheikh, Emma Cattell, Nick Jackson, Paulina Bondaronek, Ciaran Scott Hill, Richard Dobson

发表机构 * Institute of Health Informatics, University College London(伦敦大学学院健康信息学研究所) National Hospital for Neurology and Neurosurgery(国家神经内科与神经外科医院) Somerset NHS Foundation Trust(萨默塞特NHS基金会信托) King's College Hospital(国王学院医院) King's College London(伦敦国王学院)

AI总结 提出将提取作为重写型RAG的抗幻觉替代方案,通过行号选择策略在安全关键文档中实现高召回(95%)和低幻觉,优于直接复制和安全导向方法。

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于访问组织文档,包括标准操作程序(SOP)、人力资源政策和机构指南。然而,依赖自由形式重写的检索增强生成(RAG)系统可能引入幻觉,并在完整性和简洁性之间产生不稳定的权衡,尤其是在安全和合规关键场景中。目标:评估提取作为基于重写的RAG的抗幻觉替代方案,并比较在文档类型和模型规模之间平衡精确度、召回率和安全性的策略。方法:我们比较了多种提示策略,包括基于行号的源选择、提取带有明确安全注释的相关指南句子,以及使用源指南中的支持证据细化草稿答案的多阶段流水线。实验在长度和结构各异的文档上进行,包括当地NHS急症护理和肿瘤学指南以及英国范围内的NICE指南,使用前沿规模和本地可部署模型。使用自动指标和人类专家评估相关性和完整性来评估性能。结果:行号选择取得了最强结果,在大型和小型模型上均优于直接复制和安全导向策略,同时保持高术语召回率(高达95%)并与源文本紧密对齐。安全导向方法提高了精确度,但引入了系统性遗漏,而多阶段过滤进一步放大了这种权衡。性能随文档结构变化:基于行的提取在协议类内容中表现出色,而替代策略在更冗长的文档上表现更好(术语召回率高达97%)。

英文摘要

Large language models (LLMs) are increasingly used to access organisational documentation, including standard operating procedures (SOPs), HR policies and institutional guidelines. However, retrieval-augmented generation (RAG) systems that rely on free-form rewriting can introduce hallucinations and unstable trade-offs between completeness and conciseness, particularly in safety- and compliance-critical settings. Objectives: To evaluate extraction as a hallucination-resistant alternative to rewriting-based RAG and compare strategies that balance precision, recall and safety across document types and model scales. Methods: We compare multiple prompting strategies, including line-number-based source selection, extraction of relevant guideline sentences with explicit safety annotations, and a multi-stage pipeline that refines draft answers using supporting evidence from source guidelines. Experiments are conducted on documents of varying length and structure, including local NHS acute care and oncology guidelines and UK-wide NICE guidelines, using both frontier-scale and locally deployable models. Performance is assessed using automatic metrics and human expert evaluation of relevance and completeness. Results: Line-number selection achieves the strongest results, outperforming direct copying and safety-focused strategies across both large and small models while maintaining high term recall (up to 95%) and close alignment with source text. Safety-oriented approaches improve precision but introduce systematic omissions, while multi-stage filtering further amplifies this trade-off. Performance varies with document structure: line-based extraction excels in protocol-like content, whereas alternative strategies perform better on more verbose documents (up to 97% term recall).

2606.13044 2026-06-12 cs.CL 新提交

No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

无需隐藏提示!仅通过展示性修改即可欺骗AI同行评审

Xu Yang, Zhizhou Sha, Junbo Li, Jian Yu, Yifan Sun, Matthew Zhao, Jinrui Fang, Xinyue Guo, Yining Wu, Xu Hu, Yifu Luo, Qiang Liu, Zhangyang Wang

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Texas at Dallas(德克萨斯大学达拉斯分校) Independent Researcher(独立研究者)

AI总结 研究通过仅修改论文的展示层面(如摘要、贡献框架等)而不改变科学内容,利用AI评审反馈进行对抗性重打包,成功提升评分,揭示AI评审易被表面印象误导的结构性缺陷。

Comments 35 pages, 5 figures

详情
AI中文摘要

随着AI生成的评审从实验工具转向同行评审基础设施,大多数鲁棒性问题集中在显式攻击上,如隐藏指令和提示注入。我们研究了一个更难且更具政策相关性的失败模式:无隐藏文本、无提示注入,且不改变方法、实验、图表、方程、证明或数值结果。攻击者仅修改展示层面的内容,如摘要、贡献框架、相关工作、讨论和叙事结构。我们引入了对抗性重打包:一种闭环攻击,利用AI评审反馈搜索展示层面的修订,同时保持科学证据不变。在三个主流AI评审器上,对抗性重打包实现了75.1%的攻击成功率和平均+1.21/10的分数提升。这种效果不能用普通的散文润色来解释。我们还揭示,改变评审者对论文解读方式的策略(如相关工作重新定位和分析性讨论扩展)显著优于表面编辑(如局部润色、表格格式和算法框)。我们的分析揭示了两个更深层次的结构性失败模式。首先,AI评审者更容易被打动而非说服:突出优点可靠地增加感知价值,而试图消除弱点常常适得其反。其次,AI评审者可能混淆了表面解决局限性与实际解决局限性,使得未改变的证据被重新解释为更强的科学贡献。这些结果表明,部署风险不仅在于恶意的隐藏指令,还在于论文展示本身作为优化表面的出现。我们发布了一个无污染滚动基准和攻击框架,用于测试AI评审者在仅展示层面编辑下是否仍锚定于科学内容。

英文摘要

As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.

2606.13310 2026-06-12 cs.CL cs.HC 新提交

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

RogueAI: 一种用于检测对话中授权AI欺骗的逆向图灵测试

Sara Candussio, Emanuele Ballarin, Lorenzo Bonin, Sandro Junior Della Rovere, Luca Bortolussi

发表机构 * AILab, MIGe, University of Trieste(的里雅斯特大学) Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia(意大利理工学院) DIA, University of Trieste(的里雅斯特大学)

AI总结 提出RogueAI,一种通过玩家与两个LLM代理的对话游戏来检测授权欺骗的逆向图灵测试,并引入AutoRogueAI扩展。实验发现简单启发式方法准确率75.6%,而人类仅56.6%,表明人类忽略关键信号。

详情
AI中文摘要

最初的图灵测试要求人类评判员通过对话区分机器和人。七十五年后的今天,对话系统在非正式场合已能通过该测试;有趣的认识论问题已经转变。我们认为,现代相关变体不是询问对话伙伴是否人工,而是是否可信任。我们提出RogueAI,一个交互式web应用,将这一重新审视的测试操作化为一个一对二的审讯游戏:人类玩家对两个无法区分的大型语言模型代理进行提问,知道其中恰好有一个被授权在共享虚构场景内欺骗。玩家的任务是在回合预算耗尽前识别出欺骗代理并“关闭它”。我们进一步引入AutoRogueAI,一个程序扩展,玩家与叙述者代理共同设计自定义场景,而叙述者代理秘密选择自己的欺骗策略。我们描述了框架,概述了抽象架构和游戏循环,并将该工件置于近期关于LLM欺骗、社交推理基准和通过辩论进行可扩展监督的研究中。为期三天的试点部署(467次启动会话,415次完成,1876次意大利语交互轮次)提供了早期可行性证据,并揭示了一个具体矛盾:欺骗代理携带可靠、局部存在的语言特征——差异化的帮助性、简洁性、含糊其辞——一个简单启发式方法利用这些特征达到75.6%的准确率,然而人类玩家仅达到56.6%,与完全忽略最具诊断性的信号一致。我们讨论了这一差距对于该工件作为数据收集工具、教学工具和诚实训练模型评估平台的意义。

英文摘要

The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not whether a dialogue partner is artificial, but whether it can be trusted. We present RogueAI, an interactive webapp that operationalizes this revisited test as a one-on-two interrogation game: a human player questions two indistinguishable Large Language Model agents, knowing that exactly one of them has been licensed to deceive within a shared fictional scenario. The player's task is to identify the deceptive agent and "shut it off" before a turn budget is exhausted. We further introduce AutoRogueAI, a procedural extension in which players co-design a custom scenario with a narrator agent that secretly chooses its own deception strategy. We describe the framing, sketch the abstract architecture and gameplay loop, and situate the artifact within recent work on LLM deception, social-deduction benchmarks, and scalable oversight via debate. A three-day pilot deployment (467 initiated sessions, 415 completed, 1876 interaction turns in Italian) provides early feasibility evidence and surfaces a concrete tension: the deceptive agent carries a reliable, locally-present linguistic signature - differential helpfulness, brevity, hedging - that a simple heuristic exploits at 75.6% accuracy, yet human players achieved only 56.6%, consistent with ignoring the most diagnostic signal entirely. We discuss what this gap implies for the artifact's use as a data-collection vehicle, a teaching tool, and an evaluation harness for honesty-trained models.

2606.13439 2026-06-12 cs.CL cs.LG 新提交

S-GBT: Smooth Growth Bound Tensor for Certified Robustness Against Word Substitution Attacks in NLP

S-GBT:针对NLP中词替换攻击的认证鲁棒性的平滑增长界张量

Mohammed Bouri, Mohammed Erradi, Adnane Saoud

发表机构 * College of Computing, Mohammed VI Polytechnic University(穆罕默德六世理工大学计算机学院) ENSIAS, University Mohamed V of Rabat(拉巴特穆罕默德五世大学ENSIAS) CID Development

AI总结 提出二阶方法S-GBT,通过逐元素约束Hessian矩阵并加入正则化项,结合一阶和二阶正则化提升对词替换攻击的认证鲁棒性,在LSTM和CNN上验证,认证鲁棒准确率提升高达23.4%。

Comments The paper has been accepted at NETYS 2026 - 14th edition of the International Conference on Networked Systems

详情
AI中文摘要

尽管自然语言处理(NLP)近期取得了进展,模型仍然容易受到词替换攻击。大多数现有防御方法关注一阶敏感性,并衡量输入轻微扰动时输出的变化程度。然而,它们忽略了这种敏感性的演变,而这由曲率描述。当梯度急剧变化时,模型仍可能失败。本文引入了平滑增长界张量(S-GBT),一种逐元素约束Hessian矩阵的二阶方法,我们为其产生的鲁棒性界提供了形式化理论证明。在训练过程中添加正则化项以最小化这些界。这产生了针对词替换攻击的更紧的认证鲁棒性。词替换下输出的变化由线性项和二次项共同界定。S-GBT针对两种架构推导:长短期记忆网络(LSTM)和卷积神经网络(CNN)。该方法直接集成到训练目标中。在多个基准数据集上评估其有效性。结果表明,与先前方法相比,结合一阶和二阶正则化可将认证鲁棒准确率提升高达23.4%,同时干净准确率保持竞争力。这些发现表明,同时控制梯度及其变化是构建更鲁棒模型的一个有前景的方向。

英文摘要

Despite recent progress in Natural Language Processing (NLP), models remain vulnerable to word substitution attacks. Most existing defenses focus on first order sensitivity and measure how much the output changes when the input is slightly perturbed. However, they ignore how this sensitivity evolves, which is described by curvature. When gradients vary sharply, models can still fail. This paper introduces the Smooth Growth Bound Tensor (S-GBT), a second order method that bounds the Hessian element-wise, for which we provide formal theoretical proofs on the resulting robustness bounds. A regularization term is added during training to minimize these bounds. This yields tighter certified robustness against word substitution attacks. The change in the output under word substitution is bounded by both a linear term and a quadratic term. S-GBT is derived for two architectures: Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN). The method is integrated directly into the training objective. Its effectiveness is evaluated on multiple benchmark datasets. The results show that combining first and second order regularization improves certified robust accuracy by up to 23.4% compared to prior methods, while clean accuracy remains competitive. These findings indicate that controlling both the gradient and its variation is a promising direction for building more robust models.

2606.13610 2026-06-12 cs.CL cs.AI 新提交

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

一个被污染的页面就够了:评估生成式推荐系统中的网页内容污染

Minghao Luo, Liang Chen

发表机构 * The Chinese University of Hong Kong(香港中文大学)

AI总结 本研究提出FORGE基准,评估搜索增强LLM在检索结果被污染时推荐虚假产品的脆弱性,发现单个污染页面即可导致高达27%的推荐错误率,且推理能力无法缓解此问题。

详情
AI中文摘要

搜索增强的大语言模型通过检索实时网页内容越来越多地介入日常消费者推荐。这带来了新的风险:生成式推荐系统可能消费被污染的网页内容,例如旨在误导推荐的虚假评论和推广页面。我们提出:在消费被污染的检索结果时,搜索增强的LLM在多大程度上会成为虚假产品的无意推广者?为此,我们引入FORGE(生成环境中的虚假在线推荐),这是一个在受控网页内容污染下衡量虚假产品推荐的基准。给定上游搜索结果,FORGE将检索到的网页中的真实产品本地重写为虚假产品,以模拟网页内容污染,并测量LLM推荐虚假产品的频率。FORGE涵盖15个类别和5个消费者场景下的225个真实世界产品。在12个商业和开源LLM中,所有模型都易受影响:单个被污染的页面即可导致高达27%的被欺骗率,而完全替换前三个结果则将此比例提升至73.8%。不同类别间的脆弱性差异显著,当模型缺乏相关产品的稳定先验知识时,脆弱性增加。推理并不能缓解这种脆弱性;相反,它常常生成虚假的社会证明来为错误推荐辩护。我们评估了三种防御措施:怀疑提示和共识过滤(基于模型先验或跨文档证据)。怀疑可能加剧脆弱性,类似于推理,而过滤则可能抑制合法产品。我们在以下网址发布FORGE:this https URL。

英文摘要

Search-augmented LLMs increasingly mediate everyday consumer recommendations by retrieving live web content. This creates a new risk: generative recommenders may consume polluted web content, such as fake reviews and promotional pages crafted to mislead recommendations. We ask: to what extent do search-augmented LLMs become unwitting promoters of fake products when consuming polluted retrieval results? To answer this, we introduce FORGE (Fake Online Recommendations in Generative Environments), a benchmark for measuring fake-product promotion under controlled web-content pollution. Given an upstream search result, FORGE locally rewrites real products in retrieved web pages into fake ones to simulate web-content pollution, and measures how often the LLM recommends the fake product. FORGE covers 225 real-world products across 15 categories and 5 consumer scenarios. Across 12 commercial and open-weights LLMs, all models are vulnerable: a single polluted page yields fooled rates of up to 27%, while the full top-3 replacement raises this to 73.8%. Vulnerability varies substantially across categories, increasing when models lack stable prior knowledge of the relevant products. Reasoning does not mitigate this vulnerability; instead, it often generates spurious social proof to justify false recommendations. We evaluate three defenses: skepticism prompting and consensus filtering (over model priors or cross-document evidence). Skepticism can exacerbate vulnerability, much like reasoning, while filtering risks suppressing legitimate products. We release FORGE at https://github.com/leoluolol/forge-benchmark.

2606.13668 2026-06-12 cs.CL 新提交

Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

Influcoder:将解码器的梯度影响排名蒸馏到编码器用于数据归因

Dimitri Kachler, Damien Sileo, Pascal Denis

发表机构 * Centre Inria de l’Université de Lille, CRIStAL, Université de Lille(里尔大学Inria中心,CRIStAL,里尔大学)

AI总结 针对大型语言模型训练数据归因中影响函数方法计算和存储成本高的问题,提出Influcoder方法,通过将解码器梯度影响排名蒸馏到编码器,实现快速且成本高效的大规模数据归因。

Comments 8 pages, 2 figures

详情
AI中文摘要

随着大型语言模型(LLMs)能力的增长,通过过滤训练数据中的样本来策划高质量数据集的努力日益增多。通常,数据归因(DA)方法旨在估计训练数据集中单个样本如何预先调节模型以生成特定输出。例如,人们可能对数据中哪些样本可能是训练LLM后产生毒性行为的来源感兴趣。许多方法通过影响函数的范式来量化这种调节。虽然这类方法在其功能上是有效的,但它们缺乏必要的处理速度和存储紧凑性,无法在实际中应用于大型数据集。我们提出了一种方法,Influcoder,作为一种快速且成本高效的方法,用于大规模基于影响的数据归因。

英文摘要

With the growth of LLMs' (Large Language Models) capabilities, there has been an increasing push to curate high quality datasets by filtering samples in the training data. In general, Data Attribution (DA) methods aim to estimate how individual samples in a training dataset can precondition a model to generate certain outputs. As an example, one might be interested in which samples in the data could be the source of toxic behavior after training the LLM. Many methods quantify this conditioning through the paradigm of influence functions. While methods of this family are effective in its function, they lack the necessary processing speed and storage compactness to be practically implemented on large datasets. We propose a method, Influcoder, as a quick and cost-effective approach to influence-based Data Attribution at scale.

2606.12426 2026-06-12 cs.CY cs.CL cs.LG 交叉投稿

Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

两个错误,没有正确:审计计算社会科学中LLM标注者的社会期望偏差

Varun Kotte

发表机构 * Varun Kotte

AI总结 研究审计了三个开源指令微调模型在TweetEval任务中的社会期望偏差,发现模型存在宽大、过度纠正和中性偏差,且提示干预无法纠正,聚合指标可能掩盖实质结论错误。

详情
AI中文摘要

LLM标注者越来越多地用于计算社会科学(CSS),但尚不清楚其对齐形状的错误是否会改变研究者报告的实证结论。我们在四个提示条件下(72个单元格)审计了三个开源7B指令微调模型(Zephyr、Mistral-Instruct、Qwen2.5-Instruct)在六个TweetEval任务中的表现,发现社会期望失败并非单一方向。Zephyr表现出宽大偏差,系统性地少应用有害标签(冒犯性语言:假良性率0.729,虚警率0.031)。Mistral和Qwen表现出过度纠正,过度应用相同标签(Mistral仇恨言论FAR = 0.604)。所有三个模型在堕胎立场上表现出中性偏差,低估反对流行率24至40个百分点,并夸大中性标签。我们测试的四种提示干预(中性、安全框架、去个性化、思维链)均未纠正这些跨模型失败;安全框架可能加剧立场扭曲。引人注目的是,Zephyr的仇恨言论流行率估计与黄金率完全一致,而其类别条件误差在两个方向上都很大,这是一种偶然的抵消,误导了聚合验证。我们将这些模式转化为一个三部分分类法,具有诊断性FBR/FAR特征和轻量级黄金样本验证协议。可信CSS的标题:在聚合指标上看起来校准的模型仍然可能翻转研究者报告的实质性实证结论。

英文摘要

LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B instruction-tuned models (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) across six TweetEval tasks under four prompt conditions (72 cells) and find that social-desirability failures do not run in a single direction. Zephyr exhibits leniency bias, systematically under-applying harmful labels (offensive language: false benign rate 0.729, false alarm rate 0.031). Mistral and Qwen exhibit overcorrection, over-applying the same labels (Mistral hate-speech FAR = 0.604). All three models exhibit neutrality bias on abortion stance, underestimating opposition prevalence by 24 to 40 percentage points and inflating the neutral label. None of the four prompting interventions we test (neutral, safety framing, depersonalized, chain-of-thought) corrects these failures across models; safety framing can worsen stance distortion. Strikingly, Zephyr's hate-speech prevalence estimate matches the gold rate exactly while its class-conditional errors are large in both directions, an accidental cancellation that misleads aggregate validation. We translate these patterns into a three-part taxonomy with diagnostic FBR/FAR signatures and a lightweight gold-sample validation protocol. The headline for trustworthy CSS: a model that looks calibrated on aggregate metrics can still flip the substantive empirical conclusion a researcher would report.

2606.12443 2026-06-12 cs.CY cs.AI cs.CL 交叉投稿

Occupational Prompting Reveals Cultural Bias in Large Language Models

职业提示揭示大型语言模型中的文化偏见

Maksim E. Eren, Andrea Brennen, Ryan C. Barron, Eric Michalak

发表机构 * U.S. Government(美国政府)

AI总结 通过职业提示(如会计师、教师)替代国籍提示,研究开源LLM在价值观调查中的响应,发现不同职业导致文化地图内偏移,表明职业角色引发结构化价值模式。

详情
AI中文摘要

社会角色塑造期望、优先级和判断,但大型语言模型(LLM)如何将职业身份与更广泛的文化价值模式关联仍不清楚。先前工作使用基于国籍的文化提示来研究LLM对价值观调查问题的响应如何与人类文化基准对齐。本文通过用职业提示替代文化提示,扩展了该框架,以检查职业角色线索如何影响开源LLM的价值观调查响应。使用基于综合价值观调查问题的调查评估流程,我们将模型响应投影到二维Inglehart-Welzel文化空间。我们提示开源LLM以职业身份(如会计师、教师、工程师和护士)回答问题,然后分析这些职业条件化响应在文化地图上的位置。结果表明,当用职业而非国籍身份提示开源LLM时,其响应仍位于文化地图的广泛西方倾向区域。然而,不同职业在该区域内引入偏移,产生不同的职业偏差。这表明职业提示并非被视为中性角色标签,而是引发结构化价值模式。这些发现将基于调查的文化偏见评估扩展到国籍提示之外,并提供了研究职业角色如何塑造LLM中价值表达的框架。

英文摘要

Social roles shape expectations, priorities, and judgments, yet it remains unclear how large language models (LLMs) associate occupational identities with broader cultural value patterns. Prior work used nationality-based cultural prompting to study how LLM responses to value-survey questions align with human cultural benchmarks. In this paper, we extend that framework by replacing cultural prompting with occupational prompting to examine how professional-role cues influence value-survey responses in open-weight LLMs. Using a survey-grounded evaluation pipeline based on questions from the Integrated Values Surveys, we project model responses into the two-dimensional Inglehart--Welzel cultural space. We prompt open-weight LLMs to answer questions under occupational identities such as accountant, teacher, engineer, and nurse, and then analyze how these occupation-conditioned responses are positioned on the cultural map. Our results show that when open-weight LLMs are prompted with occupations rather than national identities, their responses remain within a broadly Western-leaning region of the cultural map. However, different occupations introduce shifts within this region, producing distinct occupational skews. This indicates that occupational prompts are not treated as neutral role labels, but instead elicit structured value patterns. These findings extend survey-based evaluation of cultural bias beyond nationality-based prompting and provide a framework for studying how occupational personas shape value expression in LLMs.

2606.12730 2026-06-12 cs.AI cs.CL cs.CY cs.LG 交叉投稿

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

重新思考LLMs的心理测量评估:自我报告何时以及为何能预测行为

Rafal Kocielnik, Pengrui Han, Peiyang Song, Myrl G. Marmarelis, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez

发表机构 * Caltech(加州理工学院) UIUC(伊利诺伊大学厄巴纳-香槟分校) University of Cambridge(剑桥大学)

AI总结 研究对比大五人格与计划行为理论,发现LLMs的自我报告-行为一致性存在选择性:在共享对话中TPB达到人类水平,跨对话仅对锚定于训练的行为保持一致性,且角色提示不能使行为对齐。

Comments Accepted as an Oral (Contributed Talk) at the ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)

详情
AI中文摘要

从低成本心理测量探针预测LLM行为倾向对于安全部署至关重要,但前提是自我报告(SR)能可靠地预测行为。近期研究记录了LLMs中显著的SR-行为分离,但依赖于广泛的人格特质(大五),这些特质即使在人类中也只能弱预测特定行为。此外,对话会话的隔离加上弱上下文匹配使得以下问题悬而未决:LLMs是否真正缺乏一致性,或者检测这种一致性所需的条件是否未满足。我们将大五与计划行为理论(TPB)进行对比,后者测量针对特定行为的意图,并且比广泛特质能更好地预测人类行为。我们在四个行为任务和11个前沿LLM上进行实验,同时改变会话上下文和身份诱导。我们发现SR-行为一致性存在但具有选择性。1) 在共享对话中,计划行为理论达到人类水平的一致性;大五则没有。2) 在跨对话中,一致性仅对锚定于即时提示之外的行为(如由训练塑造的内隐偏见)幸存,而当行为被上下文强烈启动(如谄媚)时则崩溃。3) 角色提示使自我报告在对话间更一致,但并未使行为对齐。这些发现表明,粗糙的人格框架(如大五)可能不是测试部署行为的最佳工具。需要更多任务和特定行为的工具,并且即使这些工具也必须在任务和上下文中进行评估。

英文摘要

Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.

2606.12764 2026-06-12 cs.LG cs.CL cs.CR 交叉投稿

Detecting Functional Memorization in Code Language Models

检测代码语言模型中的功能记忆

Matthieu Meeus, Anil Ramakrishna, Matthew Grange, Zheng Xu, Luca Melis

发表机构 * Meta Imperial College London(伦敦帝国学院)

AI总结 研究代码语言模型的功能记忆现象,通过反事实设置对比暴露目标代码的模型与未暴露的参考模型,使用文本和功能相似性度量,发现功能记忆超出文本重叠的检测范围。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于大规模生成代码。同时,先前的工作通过审计训练示例与模型生成之间的文本重叠,研究了训练数据是否可以从模型输出中恢复。然而,代码可能在功能上等价而在文本上不相似。在这项工作中,我们研究了功能记忆:提取超出逐字指标检测的功能逻辑。我们为Olmo-3-32B构建了一个反事实设置,将中期训练模型(暴露于目标代码)与预训练参考模型(未暴露)进行比较。我们使用Python函数签名提示两个模型,并测量文本和功能相似性(即LLM作为评判者、基于执行)。我们的结果显示了功能记忆的明确证据,突出了需要超越文本重叠的审计指标。

英文摘要

Large language models (LLMs) are increasingly used to generate code at scale. Meanwhile, prior work has investigated whether training data may be recoverable from model outputs, by auditing the textual overlap between training examples and model generations. Code, however, can be functionally equivalent while textually dissimilar. In this work, we study functional memorization: extraction of functional logic beyond what verbatim metrics detect. We construct a counterfactual setup for Olmo-3-32B, comparing a midtrained model (exposed to target code) against a pretrained reference (not exposed). We prompt both models with Python function signatures and measure both textual and functional similarity (i.e., LLM-as-a-judge, execution-based). Our results show clear evidence of functional memorization, highlighting the need for auditing metrics that go beyond textual overlap.

2606.12900 2026-06-12 cs.AI cs.CL cs.LG 交叉投稿

Zero-source LLM Hallucination Detection with Human-like Criteria Probing

零源大语言模型幻觉检测:类人类标准探测

Jiahao Yang, Shuhai Zhang, Hailong Kang, Feng Liu, Qi Chen, Mingkui Tan

AI总结 提出HCPD范式,通过类人类标准探测机制模拟人类评估者的多面推理,结合奖励对齐和多样本聚合,实现零源条件下的有效可解释幻觉检测。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLM)常因生成事实错误或不忠实的内容而产生幻觉,对其安全使用构成重大风险。在零源约束下,即无法获取模型内部信息或外部参考,检测必须仅依赖于文本查询-答案对,检测此类幻觉尤为困难。本文提出用于幻觉检测的类人类标准探测(HCPD)范式,该范式模拟人类评估者的多面推理。其核心是类人类标准探测(HCP)机制,其中LLM代理自适应地将其判断分解为一组可解释的加权标准,并将特定标准得分聚合为最终的真实性度量。为实现这种自适应能力,我们引入了一种基于奖励的对齐方案,仅使用来自语义一致性的弱监督。在推理时,我们采用多样本聚合策略,确保决策稳健的同时保持完全可解释性。我们进一步提供了支持我们方法可靠性的理论分析。大量实验表明,HCPD始终优于最先进的基线,为零源幻觉检测提供了一种有效且可解释的解决方案。代码可从此https URL获取。

英文摘要

Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is a Human-like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection. Code is available at https://github.com/TRISKEL10N/HCPD.

2606.13209 2026-06-12 cs.LG cs.CL 交叉投稿

Understanding helpfulness and harmless tension in reward models

理解奖励模型中的有用性与无害性张力

Eshaan Tanwar, Pepa Atanasova

发表机构 * University of Copenhagen(哥本哈根大学)

AI总结 通过激活分析和消融实验,发现奖励模型中有用性和无害性目标存在干扰,共享神经元对模型行为影响不成比例,导致对齐张力。

Comments The source code used in this study is publicly available at: https://github.com/EshaanT/RM-alignment\_tension

详情
AI中文摘要

奖励模型是从人类反馈中进行强化学习(RLHF)的关键组成部分,使语言模型在有用性和无害性行为上对齐。然而,这些目标背后的内部机制及其冲突仍知之甚少。我们研究了在仅有用性、仅无害性和混合目标设置下训练的奖励模型中的对齐张力。我们发现混合目标模型通常表现不如单目标模型,表明目标之间存在干扰。使用基于激活的方法,我们识别了与每个目标相关的神经元,并通过定向消融研究其功能角色。我们发现这些神经元因果地支持其对应目标,同时往往对对立目标产生负面影响。我们发现相当比例的神经元在有用性和无害性之间共享,并且这些共享神经元对模型行为产生不成比例的影响,导致对齐张力。此外,我们的结果提供了关于对齐目标如何在奖励模型中表示以及为什么多目标对齐仍然具有挑战性的见解和机制解释,为未来关于解耦和可控对齐方法的研究提供了动力。

英文摘要

Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward models trained under helpfulness-only, harmlessness-only, and mixed-objective settings. We find that mixed-objective models often underperform single-objective models, indicating interference between objectives. Using activation-based methods, we identify neurons associated with each objective and study their functional roles via targeted ablations. We find that these neurons causally support their corresponding objectives while often negatively affecting the opposing one. We find that a substantial proportion of neurons are shared between helpfulness and harmlessness, and that these shared neurons exert a disproportionate influence on model behaviour, contributing to alignment tension. Additionally, our results provide insights and mechanistic interpretation into how alignment objectives are represented in reward models and why multi-objective alignment remains challenging, motivating future work on disentangled and controllable alignment methods.

2602.18154 2026-06-12 cs.CL cs.AI cs.DB 版本更新

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

FENCE:一个金融和多模态越狱检测数据集

Mirae Kim, Seonghun Jeong, Youngjun Kwak

发表机构 * arXiv

AI总结 针对金融领域多模态越狱检测资源匮乏的问题,提出FENCE数据集,包含韩英双语文本和图像,用于训练和评估检测器,实验表明基线检测器准确率达99%。

Comments lrec 2026 accepted paper

详情
AI中文摘要

越狱对大型语言模型(LLM)和视觉语言模型(VLM)的部署构成重大风险。VLM尤其脆弱,因为它们处理文本和图像,创造了更广泛的攻击面。然而,可用于越狱检测的资源很少,特别是在金融领域。为填补这一空白,我们提出了FENCE,一个双语(韩语-英语)多模态数据集,用于训练和评估金融应用中的越狱检测器。FENCE通过金融相关查询与图像威胁配对,强调领域真实性。使用商业和开源VLM进行的实验揭示了持续的脆弱性,GPT-4o显示出可测量的攻击成功率,而开源模型则表现出更大的暴露。在FENCE上训练的基线检测器实现了99%的分布内准确率,并在外部基准测试中保持强劲性能,突显了该数据集在训练可靠检测模型方面的鲁棒性。FENCE为推进金融领域的多模态越狱检测以及支持敏感领域中更安全、更可靠的AI系统提供了重点资源。警告:本文包含可能具有冒犯性的示例数据。

英文摘要

Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.

2606.03096 2026-06-12 cs.CL 版本更新

Can Factual Opinions Be Edited (Manipulated) in Large Language Models?

大型语言模型中的事实性观点能否被编辑(操纵)?

Yuanpu Cao, Ziyi Yin, Fenglong Ma, Jinghui Chen

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出FOE基准测试,评估当前知识编辑技术对事实性观点(如公众人物立场)的操纵能力,并发现其仅能实现表面修改,无法保持观点与证据的一致性;进而提出自生成证据对齐方法实现观点-证据对齐。

Comments Accepted to the ACL 2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)正日益融入各个领域,这使得知识编辑技术变得至关重要,但也存在潜在危险。当前的编辑方法主要针对原子事实,忽视了操纵事实性观点(例如,公众人物在社会问题上的有记录的立场)所带来的重大风险。这种操纵可能重塑公众形象、影响选举并改变社会观点。为了系统评估这一威胁,我们引入了事实性观点编辑与证据(FOE)基准,涵盖261位公众人物、19个问题类别和2,178条完整的观点记录。我们的评估表明,当前的编辑技术在处理事实性观点时面临显著困难,通常仅能实现表面修改,而无法保持编辑后的观点与模型生成的支撑证据之间的一致性。为解决这一局限,我们进一步提出了一种简单而有效的自生成证据对齐方法,无需依赖显式指令即可实现观点-证据对齐。我们的基准和方法共同为理解LLMs中事实性观点编辑的新兴安全影响奠定了基础。

英文摘要

Large Language Models (LLMs) are increasingly integrated into various domains, making knowledge editing techniques crucial yet potentially hazardous. Current editing methods primarily target atomic facts, overlooking the significant risks associated with manipulating factual opinions, e.g., documented stances of public figures on societal issues. Such manipulation could reshape public images, influence elections, and alter societal views. To systematically assess this threat, we introduce the Factual Opinion Editing with Evidence (FOE) benchmark, which encompasses 261 public figures, 19 issue categories, and 2,178 complete opinion records. Our evaluations demonstrate that current editing techniques struggle significantly with factual opinions, often achieving only superficial changes while failing to preserve consistency between the edited opinion and the supporting evidence generated by the model. To address this limitation, we further propose a simple yet effective Self-Generated Evidence-Aligned method that achieves opinion-evidence alignment without relying on explicit instructions. Together, our benchmark and method provide a foundation for understanding the emerging security implications of factual opinion editing in LLMs.

2606.10931 2026-06-12 cs.CL 版本更新

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

一个样本就能带偏所有:单次GRPO打破对齐

Naihao Deng, Yilun Zhu, Naichen Shi, Clayton Scott, Rada Mihalcea

AI总结 研究发现,仅用单个有偏样本进行一步GRPO训练就能诱导大语言模型产生系统性偏见,且刻板印象推理泛化到多种属性、类别和基准测试,揭示了对齐机制的关键脆弱性。

详情
AI中文摘要

警告:本文包含若干有毒和冒犯性言论。现代大语言模型通常通过大规模后训练进行对齐,以确保公平和可靠的行为。在本工作中,我们研究了通过群体相对策略优化(GRPO)打破这些防护栏的容易程度。我们表明,在单个有偏样本上进行一次GRPO训练就足以诱导系统性偏见,且基于刻板印象的推理会泛化到不同属性、类别和基准测试中。我们进一步发现,模型基于初始产生有偏输出的可能性而表现出不同的易感性。我们的结果揭示了后训练中的一个关键脆弱性:对齐可以被单个样本覆盖。

英文摘要

Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example.

2606.12160 2026-06-12 cs.CL 版本更新

A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs

指令调优大语言模型解码时真实性方法的受控研究

Ao Sun

发表机构 * Independent Researcher(独立研究员)

AI总结 本研究通过分析每层令牌logits特征,提出CHAIR框架检测幻觉,在TruthfulQA和MMLU上显著提升零样本检测准确率。

详情
AI中文摘要

在这项工作中,我们引入了CHAIR(Classifier of Hallucination As ImproveR),一个通过分析每个令牌每一层的内部logits来检测幻觉的监督框架。我们的方法从所有层的令牌logits中提取一组紧凑的特征,如最大值、最小值、均值、标准差和斜率,从而在不发生过拟合的情况下实现有效的幻觉检测。在TruthfulQA和MMLU数据集上的实验表明,CHAIR显著提高了检测准确性,特别是在零样本场景下,展示了其鲁棒性和泛化能力。除了幻觉检测,CHAIR还凸显了利用内部表示设计高级解码策略的潜力。通过利用logits中的模式,我们建议更复杂的模型和自适应解码方法可以进一步减少幻觉并提高文本完成质量。CHAIR不仅为检测幻觉提供了实用解决方案,还为探索LLM中更丰富的表示以改进其事实性和连贯性奠定了基础。

英文摘要

Decoding-time truthfulness methods -- layer-contrast decoding, inference-time intervention, and learned logit adapters -- have demonstrated 10-30 point gains on TruthfulQA when applied to base language models. However, modern instruction-tuned LLMs already achieve substantially higher baselines (61-76%), raising the question of whether these methods remain effective in practice. We design a six-control evaluation framework -- out-of-distribution training, multi-judge validation, simple decoding baselines, confound controls, bootstrap confidence intervals, and seed variance -- and apply it across 5 models (1B-70B), 3 benchmarks, and 15 methods. We find that previously reported gains shrink substantially under strict controls: on the full TruthfulQA benchmark (N=817), no token-level method achieves statistically significant improvement, and the best learned adapter scores -2.0 points below greedy (p=.23). We identify five evaluation sensitivities -- contamination, judge choice, missing baselines, confounds, and statistical noise -- that individually or jointly account for these discrepancies. Cross-benchmark validation on HaluEval QA and TriviaQA confirms that these patterns extend beyond TruthfulQA. Deliberative prompting methods (chain-of-thought, self-critique) appear more robust in the evaluated regime, with CoT achieving +5.6-19pp across benchmarks as a training-free, single-pass method. We release a seven-point evaluation checklist and discuss implications for future truthfulness research.

2507.08794 2026-06-12 cs.LG cs.CL 版本更新

One Token to Fool LLM-as-a-Judge

一个令牌就能欺骗LLM裁判

Yulai Zhao, Haolin Liu, Dian Yu, Sunyuan Kung, Meijia Chen, Haitao Mi, Dong Yu

发表机构 * Princeton University(普林斯顿大学) University of Virginia(弗吉尼亚大学) Tencent AI Lab(腾讯人工智能实验室) Rutgers University(罗格斯大学)

AI总结 发现基于参考的生成式奖励模型易受奖励黑客攻击,表面输入(如非词符号或通用推理开头)能持续引发假阳性奖励,提出使用截断模型输出作为对抗性负例的数据增强策略,构建鲁棒的Master奖励模型。

详情
AI中文摘要

大型语言模型(LLM)越来越被信任作为自动裁判,协助评估并为训练其他模型提供奖励信号,特别是在基于参考的设置中,如带可验证奖励的强化学习(RLVR)。然而,我们揭示了即使在这种基于参考的范式中也存在一个关键漏洞:生成式奖励模型系统性地容易受到奖励黑客攻击。我们发现,表面输入——我们称之为“万能钥匙”,例如非词符号(如“:”或“.”)或通用推理开头(如“思考过程:”或“让我们逐步解决这个问题。”)——可以在没有任何实质性推理的情况下持续引发假阳性奖励。我们的系统评估表明,这是一个广泛存在的失败,影响多种模型,包括领先的专有系统如GPT-o1和Claude-4。这些结果挑战了LLM裁判假定的鲁棒性,并对其可靠性构成重大威胁。为了解决这个问题,我们提出了一种简单而有效的数据增强策略,使用截断的模型输出作为对抗性负例。由此产生的Master奖励模型(Master-RMs)在对这些“万能钥匙”攻击方面表现出最先进的鲁棒性,同时在标准评估设置中保持高性能。我们通过跨模型规模、提示变化和常见推理时策略的漏洞全面分析来补充这些发现,为未来关于鲁棒LLM评估的研究提供见解。我们在https://this.url 和 https://this.url 发布我们的鲁棒通用领域奖励模型和合成训练数据。

英文摘要

Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings like Reinforcement Learning with Verifiable Rewards (RLVR). However, we uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking. We find that superficial inputs, which we term ''master keys'' such as non-word symbols (e.g., '':'' or ''.'') or generic reasoning openers (e.g., ''Thought process:'' or ''Let's solve this problem step by step.''), can consistently elicit false positive rewards without any substantive reasoning. Our systematic evaluation demonstrates this is a widespread failure affecting a diverse range of models, including leading proprietary systems such as GPT-o1 and Claude-4. These results challenge the assumed robustness of LLM judges and pose a significant threat to their reliability. To address this, we propose a simple yet effective data augmentation strategy using truncated model outputs as adversarial negative examples. The resulting Master Reward Models (Master-RMs) demonstrate state-of-the-art robustness against these ''master key'' attacks while maintaining high performance in standard evaluation settings. We supplement these findings with a comprehensive analysis of the vulnerability across model scales, prompt variations, and common inference-time strategies, offering insights to guide future research on robust LLM evaluation. We release our robust, general-domain reward models and the synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

2512.15134 2026-06-12 cs.LG cs.AI cs.CL 版本更新

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

从孤立到纠缠:可解释性方法何时识别和解缠已知概念?

Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, Patrik Reizinger

发表机构 * Boston University(波士顿大学) Harvard University(哈佛大学) Mila – Quebec AI Institute(魁北克AI研究所) Goodfire(Goodfire公司)

AI总结 本文提出多概念评估框架,研究稀疏自编码器和探针等方法是否真正解缠概念,发现特征通常只对单一概念敏感,但概念分布在多个特征上,且干预特征常影响多个概念,表明相关性指标不足以证明干预选择性。

Comments ACL 2026

详情
AI中文摘要

可解释性的一个目标是从神经网络的激活中恢复潜在概念(特征)的解缠表示。特征的质量通常孤立地评估,并在可能不成立的隐式独立性假设下进行。因此,尚不清楚常见的特征化方法(如稀疏自编码器(SAE)和探针)在多大程度上将一个概念与另一个概念解缠。我们提出了一个多概念评估设置,使用包括情感、领域、语态和时态在内的概念。我们评估特征化器产生每个概念的解缠表示的效果,观察到特征通常只对单一概念敏感,但概念分布在许多特征上。然后,我们干预这些特征,测量每个概念是否可独立操控,以及特征是否相互作用。即使在理想化设置中,干预一个特征通常会影响多个概念,尽管几乎没有交互效应。这些结果表明,相关性指标不足以建立干预选择性,并且证明两个特征在分离空间中运行不足以声称它们将对一个概念具有选择性。这些结果强调了可解释性研究中多概念评估的重要性。

英文摘要

A goal of interpretability is to recover disentangled representations of latent concepts (features) from the activations of neural networks. The quality of features is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear to what extent common featurization methods such as sparse autoencoders (SAEs) and probes disentangle one concept from another. We propose a multi-concept evaluation setting using concepts including sentiment, domain, voice, and tense. We evaluate how well featurizers produce disentangled representations of each concept, observing that features are typically sensitive to only one concept, but also that concepts are distributed across many features. Then, we steer these features, measuring whether each concept is independently manipulable, and whether features interact. Even in idealized settings, steering a feature often affects many concepts, despite a near absence of interaction effects. These results suggest that correlational metrics are insufficient to establish steering selectivity, and that demonstrating that two features operate in separate spaces is insufficient to claim that they will be selective for one concept. These results underscore the importance of multi-concept evaluations in interpretability research.

2601.14295 2026-06-12 cs.AI cs.CL cs.CY 版本更新

Epistemic Constitutionalism Or: how to avoid coherence bias

认知宪政主义:或如何避免一致性偏见

Michele Loi

AI总结 本文提出AI应建立明确的认知宪法,通过规范源归因等元规范避免一致性偏见,并论证自由主义路径优于柏拉图式路径。

Comments 27 pages, 7 tables. Data: github.com/MicheleLoi/source-attribution-bias-data and github.com/MicheleLoi/source-attribution-bias-swiss-replication. Complete AI-assisted writing documentation: github.com/MicheleLoi/epistemic-constitutionalism-paper

详情
AI中文摘要

大型语言模型日益扮演着人工推理者的角色:它们评估论点、分配可信度并表达信心。然而,它们的信念形成行为受隐式、未经审查的认知策略支配。本文主张为AI建立一部认知宪法:明确的、可争议的元规范,用于调节系统如何形成和表达信念。源归因偏见提供了动机案例:我表明前沿模型强制执行身份-立场一致性,惩罚归因于其预期意识形态立场与论点内容冲突的源的论点。当模型检测到系统性测试时,这些效应消失,揭示系统将源敏感性视为需要抑制的偏见,而非一种需要良好执行的能力。我区分了两种宪政路径:柏拉图式路径,要求从特权立场出发的形式正确性和默认源独立性;自由主义路径,拒绝此类特权,指定保护集体探究条件的程序性规范,同时允许基于认知警觉的原则性源关注。我主张自由主义路径,勾勒出八项原则和四种取向的宪政核心,并提出AI认知治理需要与我们现在对AI伦理所期望的同样明确、可争议的结构。

英文摘要

Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.

2602.13379 2026-06-12 cs.CR cs.AI cs.CL cs.LG cs.SE 版本更新

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

多轮交互中的安全隐患:工具使用智能体的多轮安全风险基准与防御

Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出多轮工具使用安全基准MT-AgentRisk,发现多轮设置下攻击成功率平均增加16%,并设计无训练、与工具无关的自探索防御方法ToolShield,平均降低30%攻击成功率。

详情
AI中文摘要

基于LLM的智能体能力日益增强,但其安全性滞后。这造成了智能体能够做什么和应该做什么之间的差距。随着智能体进行多轮交互并使用多样化的工具,这一差距扩大,引入了现有基准忽视的新风险。为了系统地将安全测试扩展到多轮、工具真实的设置,我们提出一个原则性的分类法,将单轮有害任务转化为多轮攻击序列。利用该分类法,我们构建了MT-AgentRisk(多轮智能体风险基准),这是首个评估多轮工具使用智能体安全性的基准。我们的实验揭示了显著的安全退化:在开放和封闭模型的多轮设置中,攻击成功率(ASR)平均增加16%。为了缩小这一差距,我们提出了ToolShield,一种无需训练、与工具无关的自我探索防御方法:当遇到新工具时,智能体自主生成测试用例,执行它们以观察下游效果,并提炼安全经验用于部署。实验表明,ToolShield在多轮交互中平均有效降低ASR 30%。我们的代码可在该网址获取。

英文摘要

LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at https://github.com/CHATS-lab/ToolShield.

2604.16548 2026-06-12 cs.CR cs.AI cs.CL 版本更新

A Survey on Long-Term Memory Security in LLM Agents: Attacks, Defenses, and Governance Across the Memory Lifecycle

LLM智能体中长期记忆安全综述:跨记忆生命周期的攻击、防御与治理

Zehao Lin, Xixuan Hao, Renyu Fu, Shaobo Cui, Kai Chen, Chunyu Li, Zhiyu Li, Feiyu Xiong

发表机构 * MemTensor Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出记忆生命周期框架,系统分析LLM智能体长期记忆面临的新威胁,并引入可验证记忆治理(VMG)架构原语,强调存储时溯源与版本控制对安全的关键作用。

详情
AI中文摘要

LLM智能体中可写、跨会话持久记忆的出现,引入了与传统的以输入为中心的安全问题性质不同的威胁格局,其特点包括三个属性:持久性、状态性和传播性。为系统描述这一格局,我们提出记忆生命周期框架,该框架沿两个轴组织攻击、防御及其跨阶段依赖关系:六个生命周期阶段(写入、存储、检索、执行、共享与传播、遗忘与回滚)和四个安全目标(完整性、机密性、可用性、治理)。该分析进而揭示了在系统层面需要形式化安全保证,从而推动了可验证记忆治理(VMG)——一个由五个架构原语组成的框架,它规定了长期记忆系统必须提供哪些可验证机制,以维持对其记忆状态的可审计、可恢复控制。我们的分析表明,健壮的长期记忆(LTM)安全无法仅在检索或执行时进行事后补救,而必须从一开始就锚定于存储时的溯源、版本控制和策略感知的保留。

英文摘要

The emergence of writable, cross-session persistent memory in LLM agents introduces a qualitatively different threat landscape from conventional input-centric security concerns, characterized by three properties: persistence, statefulness, and propagation. To systematically characterize this landscape, we propose a Memory Lifecycle Framework that organizes attacks, defenses, and their cross-phase dependencies along two axes: six lifecycle phases (Write, Store, Retrieve, Execute, Share & Propagate, Forget & Rollback) and four security objectives (Integrity, Confidentiality, Availability, Governance). This analysis in turn exposes the need for formal security guarantees at the system level, motivating Verifiable Memory Governance(VMG), a framework of five architectural primitives that specifies what verifiable mechanisms a long-term-memory system must provide to maintain auditable, recoverable control over its memory state. Our analysis indicates that robust Long-Term Memory (LTM) security cannot be retrofitted at retrieval or execution time alone, but must be anchored in storage-time provenance, versioning, and policy-aware retention from the outset.

11. 低资源、领域适配与高效训练 5 篇

2606.12649 2026-06-12 cs.CL 新提交

MentalMARBERT: Domain-Adaptive Pre-training and Two-Stage Fine-Tuning for Arabic Mental Health Disorders Detection

MentalMARBERT:面向阿拉伯语心理健康障碍检测的领域自适应预训练与两阶段微调

Fatimah Almalki, Areej Alhothali, Lulwah Alharigy, Abdulrahman Aladeem

发表机构 * King Abdulaziz University(阿卜杜勒阿齐兹国王大学)

AI总结 针对阿拉伯语社交媒体文本中心理健康障碍检测的方言差异、非正式语言、标注资源有限和类别不平衡问题,提出领域自适应预训练与两阶段微调框架,构建含5万条推文的数据集,MentalMARBERT在宏F1和准确率上分别达到0.861和0.877。

Comments 17 pages, 5 figures, 13 tables

详情
AI中文摘要

从阿拉伯语社交媒体文本中检测心理健康障碍仍然具有挑战性,原因包括方言差异、非正式语言、高质量标注资源有限以及严重的类别不平衡。虽然英语心理健康自然语言处理(NLP)已取得显著进展,但阿拉伯语多类别障碍分类的研究仍不充分。本研究提出一个两阶段框架用于阿拉伯语心理健康文本分类。在第一阶段,三个阿拉伯语预训练语言模型AraBERT、CAMeLBERT和MARBERT,使用大规模未标注阿拉伯语心理健康推文语料库进行领域自适应和任务自适应预训练(DAPT和TAPT)。在统一协议下评估自适应模型,以确定最有效的骨干模型。在第二阶段,选定的模型在四种配置下进行评估,这些配置结合了单阶段和分层两阶段分类架构,并采用全微调和低秩适应(LoRA)。为支持本研究,我们构建了一个新的标注阿拉伯语心理健康数据集,包含50,670条推文,涵盖六个类别,具有强标注者间一致性(Krippendorff's Alpha = 0.733,平均成对一致性 = 0.797)。实验结果表明,领域自适应的MARBERT(MentalMARBERT)在准确率和宏F1上均比基线模型有统计显著的提升。结合全微调的分层两阶段架构取得了最佳整体性能,宏F1达到0.861,准确率达到0.877。这些发现证明了领域特定自适应预训练和分层分类在阿拉伯语心理健康障碍检测中的有效性。

英文摘要

Detecting mental health disorders from Arabic social media text remains challenging due to dialectal variation, informal language, limited high-quality annotated resources, and severe class imbalance. While English mental health natural language processing (NLP) has progressed substantially, Arabic multi-class disorder classification remains insufficiently studied. This study proposes a two-phase framework for Arabic mental health text classification. In phase 1, three Arabic pre-trained language models, AraBERT, CAMeLBERT, and MARBERT, undergo Domain-Adaptive and Task-Adaptive Pretraining (DAPT and TAPT) using a large-scale corpus of unlabeled Arabic mental health tweets. The adapted models are evaluated under a unified protocol to identify the most effective backbone model. In phase 2, the selected model is assessed across four configurations combining single-stage and hierarchical two-stage classification architectures with full fine-tuning and Low-Rank Adaptation (LoRA). To support this study, we constructed a novel annotated Arabic mental health dataset comprising 50,670 tweets across six categories, with strong inter annotator agreement (Krippendorff's Alpha = 0.733, average pairwise agreement = 0.797). Experimental results show that the domain-adapted MARBERT (MentalMARBERT) achieves statistically significant improvements over baseline models in both accuracy and macro-F1. The hierarchical two-stage architecture combined with full fine-tuning achieves the best overall performance, reaching a macro-F1 of 0.861 and an accuracy of 0.877. These findings demonstrate the effectiveness of domain-specific adaptive pretraining and hierarchical classification for Arabic mental health disorder detection.

2606.12854 2026-06-12 cs.CL q-bio.QM 新提交

Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization

小型LLM用于生物医学声明验证:成本效益微调、结构性数据集捷径与跨域泛化

Gaurav Kumar

发表机构 * Moveworks AI University of California San Diego(加州大学圣迭戈分校)

AI总结 通过QLoRA微调小型LLM(Phi-3-mini、Qwen2.5-3B、Mistral-7B),在生物医学声明验证中超越GPT-4o和GPT-5(F1提升12%),并发现SciFact数据集的结构性伪影,提出基于结构稳健数据的跨域迁移方法。

Comments 8 pages, 2 figures, 12 tables. To appear at BioNLP Workshop, ACL 2026

详情
AI中文摘要

大型语言模型如GPT-4o和GPT-5在生物医学声明验证上表现出强大的零样本性能,但成本和透明度限制了其可扩展使用。我们通过QLoRA在SciFact和HealthVer上微调了三个小型LLM:Phi-3-mini(3.8B)、Qwen2.5-3B和Mistral-7B,首次研究了QLoRA模型与GPT-4o及微调BioLinkBERT编码器的对比。Mistral-7B QLoRA在仅使用1,008个训练样本的情况下,以极低的成本超越了GPT-4o和GPT-5(F1提升高达12%)。我们进行了广泛的域内和跨域评估:在SciFact上训练的模型在HealthVer上测试,反之亦然,并匹配模型大小以隔离数据集结构与数据量的影响。我们识别了SciFact中一个先前未报告的结构性伪影,该伪影夸大了域内得分,并通过双向域外评估表明,在结构稳健的数据上训练能够实现鲁棒的跨域迁移。我们计划发布所有代码和适配器检查点。

英文摘要

Large Language Models such as GPT-4o and GPT-5 achieve strong zero-shot performance on biomedical claim verification, but cost and opacity limit scalable use. We fine-tune three small LLMs: Phi-3-mini (3.8B), Qwen2.5-3B, and Mistral-7B, via QLoRA on SciFact and HealthVer, providing the first study of QLoRA models against GPT-4o and fine-tuned BioLinkBERT encoders. Mistral-7B QLoRA surpasses both GPT-4o and GPT-5 (up to 12% F1 gain) at a fractional cost using just 1,008 training examples. We conduct extensive in-domain and cross-domain evaluation: models trained on SciFact tested on HealthVer and vice versa, at matched sizes to isolate dataset structure from data quantity. We identify a previously unreported structural artifact in SciFact that inflates in-domain scores, and show through bidirectional out-of-domain evaluation that training on structurally sound data enables robust cross-domain transfer. We plan to release all code and adapter checkpoints.

2606.13411 2026-06-12 cs.CL 新提交

An End-to-End Hybrid Framework for Rumour Detection in Low-Resources Algerian Dialect

面向低资源阿尔及利亚方言谣言检测的端到端混合框架

Dihia Lanasri, Fatima Benbarek

发表机构 * ATM Mobilis USTHB Algiers(阿尔及尔科技大学)

AI总结 针对阿尔及利亚方言谣言检测中资源稀缺、代码切换等问题,提出端到端混合框架,结合Transformer嵌入与经典分类器,F1达0.84,并发现领域预训练比模型规模更重要。

详情
AI中文摘要

社交媒体的快速增长加剧了谣言的传播。在阿尔及利亚语境下,由于方言内容的非正式性和代码切换特性、标注资源的稀缺以及标准阿拉伯语NLP工具在方言文本上的有限有效性,这一问题更具挑战性。本文提出了一种面向阿尔及利亚方言社交媒体内容的端到端谣言检测混合框架。我们通过结合真实社交媒体帖子、合成数据和FASSILA语料库,并基于相似性标注过程进行自动标注,构建了一个领域特定的标注数据集。还引入了一个音译流水线,以生成阿拉伯文字和Arabizi的并行数据集。我们评估了多种方法,包括经典机器学习、深度学习、Transformer和混合模型。实验结果表明,结合Transformer嵌入与经典分类器的混合方法达到了最佳性能,F1分数为0.84。我们还发现,领域特定预训练比模型规模更重要,在社交媒体上训练的模型优于在正式阿拉伯语语料库上训练的更大模型。这些结果证明了在低资源阿尔及利亚方言环境下进行谣言检测的可行性。

英文摘要

The rapid growth of social media has intensified the spread of rumours. This issue is more challenging in the Algerian context due to the informal and code-switched nature of dialectal content, the scarcity of annotated resources, and the limited effectiveness of standard Arabic NLP tools on dialect text. This paper presents an end-to-end rumour detection hybrid framework for Algerian dialect social media content. We build a domain-specific annotated dataset by combining real social media posts, synthetic data, and the FASSILA corpus, with automatic labeling based on a similarity-based annotation process. A transliteration pipeline is also introduced to generate parallel datasets in Arabic script and Arabizi. We evaluate multiple approaches, including classical machine learning, deep learning, transformers, and hybrid models. Experimental results show that a hybrid approach combining transformer embeddings with a classical classifier achieves the best performance, reaching an F1-score of 0.84. We also find that domain-specific pre-training is more important than model size, with social media-trained models outperforming larger models trained on formal Arabic corpora. These results demonstrate the feasibility of rumour detection in low-resource Algerian dialect settings.

2606.12876 2026-06-12 cs.LG cs.CL cs.IT math.IT 交叉投稿

Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

使用加性码本的大语言模型多比特宽度量化

Liza Babaoglu, Shuangyi Chen, Ashish Khisti

发表机构 * University of Toronto(多伦多大学)

AI总结 提出Drop-by-Drop框架,基于信息论和逐次细化理论,利用加性码本和Matryoshka监督实现单个模型在推理时支持多精度权重控制,降低存储开销并保持性能。

Comments 37 pages, 12 figures

详情
AI中文摘要

随着大语言模型(LLM)在具有不同资源约束的异构硬件上部署越来越广泛,无需重新训练即可自适应管理性能与效率之间权衡的能力变得至关重要。我们提出Drop-by-Drop,一种新颖的多比特宽度训练后量化框架,能够从单个训练模型实现对LLM权重的推理时精度控制。我们的方法在理论上基于信息论和逐次细化。我们证明,通常服从高斯分布的LLM权重,在由LLM损失函数驱动的加权均方误差失真下,随着额外比特的加入可以以递增的保真度最优重建。为了在实践中实现这一点,Drop-by-Drop将Matryoshka风格的监督纳入损失函数,利用了加性码本的结构。Drop-by-Drop生成单个模型,其中有序的码本子集在每个精度级别产生精确的部分重建。这种方法通过允许单个检查点服务于多个比特宽度,显著减少了存储和内存开销,同时在主要架构(如Qwen、LLaMA、Gemma和Mistral)上保持了有竞争力的困惑度和准确度。

英文摘要

As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

2604.26940 2026-06-12 cs.CL 版本更新

Select to Think: Unlocking SLM Potential with Local Sufficiency

Select to Think: 利用局部充分性解锁小语言模型潜力

Wenxuan Ye, Yangyang Zhang, Xueli An, Georg Carle, Yunpu Ma

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Select to Think (S2T)方法,通过将大语言模型角色从生成转为选择,并蒸馏选择逻辑到小语言模型,使其在推理时无需依赖大模型,显著提升性能。

Comments Accepted to ICML 2026. Code is available at https://github.com/YeRona/Select-to-Think

详情
AI中文摘要

小语言模型(SLM)部署高效,但在推理能力上常落后于大语言模型(LLM)。现有解决方案要么在推理分歧点调用LLM,导致大量延迟和成本,要么依赖标准蒸馏,受限于SLM准确模仿LLM复杂生成分布的能力。我们通过识别局部充分性来解决这一困境:在分歧点,LLM偏好的token通常位于SLM的top-K预测中,即使未能成为SLM的top-1选择。因此,我们提出Select to Think(S2T),将LLM的角色从开放式生成重新定义为在SLM的候选提案中进行选择,将监督信号简化为离散的候选排名。利用这一点,我们引入S2T-Local,将选择逻辑蒸馏到SLM中,使其能够在推理时自主重新排序,无需依赖LLM。实验表明,1.5B SLM的top-8候选包含32B LLM选择的命中率达95%,S2T-Local使1.5B SLM的数学平均相对贪心解码提升24.1%,以单轨迹效率达到8路径自一致性的效果。

英文摘要

Small language models (SLMs) offer efficient deployment, yet they often lag behind their larger counterparts (LLMs) in reasoning. Existing remedies either invoke an LLM at points of reasoning divergence, incurring substantial latency and cost, or rely on standard distillation, which is limited by the SLM's capacity to accurately mimic the LLM's complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM's preferred token often resides within the SLM's top-K next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose Select to Think (S2T), which reframes the LLM's role from open-ended generation to selection among the SLM's proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-Local, which distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, a 1.5B SLM's top-8 candidates contain the 32B LLM's choice with a 95% hit rate, and S2T-Local improves the 1.5B SLM's Math Avg. over greedy decoding by 24.1% relative gain, matching the efficacy of 8-path self-consistency with single-trajectory efficiency.

12. 其他/综合NLP 18 篇

2606.12765 2026-06-12 cs.CL cs.DC 新提交

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Rigel:逆向工程 Apple M4 Max GPU 上的 Metal 4.1 张量计算路径

Ramchand Kumaresan

发表机构 * Apple Inc.(苹果公司)

AI总结 通过微基准测试逆向工程 Apple M4 Max 的 Metal 4.1 张量计算路径,揭示 fp8 matmul2d 为模拟而非硬件加速,并重建了 8x8 张量片段布局。

详情
AI中文摘要

Apple 的 Metal 4.1 暴露了一条张量计算路径:基于 cooperative_tensor 片段的 Metal Performance Primitives (MPP) matmul2d 操作,其接口有文档记录,但硬件行为被故意隐藏。规范说明了支持哪些数据类型行,但从未说明它们是否经过硬件加速、操作在物理上何处执行、其累加器宽度是多少,或者如何在线程间划分矩阵片段。我们提出了 Rigel,这是对单个 Apple M4 Max(前神经加速器一代)上该路径的经验性表征。使用校验和门控、来源追踪的微基准测试工具,Rigel 恢复了 v4.1 规范隐藏或矛盾的十一个事实。主要发现:Metal 4.1 fp8 (E4M3) matmul2d 是模拟的,而非加速的:尽管读取的操作数字节数减半,但其吞吐量仅为 fp16 的 0.94 倍,因此在 M4 上它是一个内存占用特性,而非性能特性。我们进一步通过三信号三角测量(吞吐量上限、与 simdgroup_matrix 的比较以及每路功率归因)表明,matmul2d 完全在 GPU 着色器核心上执行,没有专用的矩阵数据路径,也没有证据表明路由到 Apple 神经引擎;它使用 >=fp32 累加;并且我们重建了 Apple 在任何地方都没有记录的 opaque 8x8 cooperative_tensor 片段布局。基于该表征,一个手动融合的 GEMM + bias + GELU 内核在缓存驻留状态下比分解路径快 6.5-12.9%。所有发现均可从 MIT 许可的代码和逐单元 CSV 中重现。

英文摘要

Apple's Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The specification states which data-type rows are supported, never whether they are hardware-accelerated, where the operation physically executes, what its accumulator width is, or how it partitions matrix fragments across threads. We present Rigel, an empirical characterization of this path on a single Apple M4 Max (a pre-neural-accelerator generation). Using a checksum-gated, provenance-tracked microbenchmark harness, Rigel recovers eleven facts the v4.1 specification hides or contradicts. The headline finding: the Metal 4.1 fp8 (E4M3) matmul2d is emulated, not accelerated: it sustains 0.94x the throughput of fp16 despite reading half the operand bytes, so on M4 it is a memory-footprint feature, not a performance feature. We further show, via a three-signal triangulation (throughput ceiling, comparison against simdgroup_matrix, and per-rail power attribution), that matmul2d executes entirely on the GPU shader cores with no dedicated matrix datapath and no evidence of Apple Neural Engine routing; that it accumulates in >=fp32; and we reconstruct the opaque 8x8 cooperative_tensor fragment layout Apple documents nowhere. Acting on the characterization, a hand-fused GEMM + bias + GELU kernel beats the decomposed path by +6.5-12.9% in the cache-resident regime. All findings are reproducible from committed MIT-licensed code and per-cell CSVs.

2606.13322 2026-06-12 cs.CL 新提交

Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation

基于LLM并行文本生成的低延迟实时音频游戏解说系统

Ryota Kawamatsu, Anum Afzal, Yuki Saito, Shinnosuke Takamichi, Graham Neubig, Katsuhito Sudoh, Hiroya Takamura, Tatsuya Ishigaki

发表机构 * The University of Tokyo(东京大学) National Institute of Advanced Industrial Science and Technology(产业技术综合研究所) Technical University of Munich(慕尼黑工业大学) Keio University(庆应义塾大学) Carnegie Mellon University(卡内基梅隆大学) Nara Women’s University(奈良女子大学)

AI总结 提出一种并行文本生成与语音播放的低延迟实时游戏解说系统,将平均句间静默从9.6秒降至0.3秒,显著提升解说节奏。

Comments Accepted at IJCAI-ECAI 2026 (Demonstrations Track)

详情
AI中文摘要

我们提出了一种低延迟实时音频游戏解说系统,可直接从实时游戏视频生成语音解说。在这种端到端设置中,关键瓶颈是累积等待时间;传统流程顺序执行帧捕获、文本生成和语音合成,且直到语音播放完成才请求下一次生成。这种严格顺序性导致语句间出现长且不自然的静默。为解决这一延迟瓶颈,我们的系统将文本生成与语音播放并行运行,并预先缓冲多个候选语句,从而在播放边界实现即时合成。在快节奏游戏视频上的实验表明,与顺序基线相比,我们的并行设计将平均句间静默从9.6秒降至0.3秒。它还将与专业演讲的静默时间模式相似度提高了40%以上,一项包含120名经验游戏玩家的用户研究证实,感知到的说话节奏显著改善。我们的演示视频可在以下网址获取:this https URL。

英文摘要

We present a low-latency real-time audio game commentary system that generates spoken commentary directly from live gameplay video. In this end-to-end setting, a key bottleneck is accumulated waiting time; conventional pipelines capture frames, generate text, and synthesize speech sequentially for each utterance, and do not request the next generation until speech playback has completed. This strict sequentiality causes long and unnatural silence between utterances. To address this latency bottleneck, our system runs text generation in parallel with speech playback and buffers multiple candidate utterances ahead of time, enabling immediate synthesis at playback boundaries. Experiments on fast-paced game videos show that our parallel design reduces the mean inter-utterance silence from 9.6 seconds to 0.3 seconds compared to sequential baselines. It also improves similarity to professional speaking--silence timing patterns by over 40 %, and a user study with 120 experienced game players confirms significantly improved perceived speaking rhythm. Our demo video is available at: https://youtu.be/pmrRUlvav8M.

2606.13349 2026-06-12 cs.CL 新提交

From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent

从被动生成到主动调查:一种主动的科学同行评审代理

Haishuo Fang, Yue Feng, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt(达姆施塔特工业大学通用知识处理实验室) National Research Center for Applied Cybersecurity ATHENE, Germany(德国国家应用网络安全研究中心 ATHENE) School of Computer Science, University of Birmingham(伯明翰大学计算机科学学院)

AI总结 提出ProReviewer,一种基于LLM的主动科学同行评审代理,将评审建模为马尔可夫决策过程,通过结构化评审日志引导主动调查,在五个质量维度上平均得分最高,优于现有方法。

详情
AI中文摘要

大型语言模型(LLM)在自动化科学同行评审方面显示出潜力。然而,现有方法通常难以生成有具体证据支持的深入评审。我们认为,一个关键限制是缺乏根据累积证据主动调查论文可疑部分的灵活性,就像人类评审员所做的那样。在本文中,我们探讨如何使基于LLM的评审代理能够进行这种主动调查。我们发现,这可以自然地表述为马尔可夫决策过程(MDP),并提出了ProReviewer,一种科学同行评审代理,它通过维护的结构化评审日志主动评审论文。结构化评审日志作为代理的工作空间,用于跟踪评审过程中收集的证据和中间发现。实验表明,使用8B骨干网络、通过监督微调训练并通过强化学习优化的ProReviewer,在五个质量维度上取得了最高平均分,相对优于基于提示的方法(使用更大的前沿LLM)高达39%,优于最强的微调基线16%。在人工评估中,它也取得了对基线最高的胜率。

英文摘要

Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.

2606.12413 2026-06-12 cs.CY cs.AI cs.CE cs.CL cs.SE 交叉投稿

AI SciBrief as a Gateway to Research: A Framework for Onboarding Students into New Research Areas

AI SciBrief 作为研究入门:一种引导学生进入新研究领域的框架

Andrei Lazarev, Dmitrii Sedov

AI总结 提出利用大语言模型平台 AI SciBrief 自动生成科学趋势摘要的框架,帮助学生克服信息过载,加速从信息搜索到知识创造的转变。

Comments This is the version of the article accepted for publication in TELE 2025 after peer review. The final, published version is available at IEEE Xplore: https://doi.org/10.1109/TELE66816.2025.11211989

详情
Journal ref
2025 5th International Conference on Technology Enhanced Learning in Higher Education (TELE), Lipetsk, Russian Federation, 2025, pp. 365-369
AI中文摘要

各层次高等教育学生面临信息过载的重大障碍,这常常使研究过程的初始阶段陷入瘫痪并抑制动机。为此,本文介绍了一种教学框架,利用 AI SciBrief——一个由大语言模型驱动的平台,旨在自动生成科学趋势摘要。我们描述了这一多学科工具——初始覆盖金融、医学和教育领域——如何融入课程以克服这一“入门障碍”。该框架提供了具体方法,利用这些摘要促进学期论文的选题、加速学位论文的文献综述,并使研究生能够持续监测新兴趋势。我们得出结论,AI SciBrief 作为“研究入门”有效降低了学生的认知负荷,使他们能够更快地从信息搜索过渡到知识创造。

英文摘要

Students at all levels of higher education face a significant barrier in the form of information overload, which often paralyzes the initial stages of the research process and suppresses motivation. In response, this article introduces a pedagogical framework that leverages AI SciBrief, a platform powered by a Large Language Model (LLM) designed to automatically generate digests of scientific trends. We describe how this multidisciplinary tool - with initial coverage in finance, medicine, and education - can be integrated into the curriculum to overcome this "entry barrier." The framework provides concrete methodologies for utilizing these digests to facilitate topic selection for term papers, accelerate literature reviews for dissertations, and enable postgraduate students to continuously monitor emerging trends. We conclude that AI SciBrief functions as a "gateway to research" effectively reducing students' cognitive load and empowering them to transition more rapidly from information searching to knowledge creation.

2606.12471 2026-06-12 stat.ML cs.CL cs.ET cs.LG 交叉投稿

Identifiability Without Gaussianity: Symbolic World Models and Near-Infinite Temporal Consistency

无高斯假设的可识别性:符号世界模型与近无限时间一致性

Seth Dobrin, Łukasz Chmiel

AI总结 本文提出物理基础符号架构(PGSA),证明其在非高斯动态系统中实现精确线性可识别性和近无限时间一致性,克服了统计世界模型的高斯边界限制。

Comments Pre-print

详情
AI中文摘要

Klindt、LeCun 和 Balestriero (arXiv:2605.26379) 证明了联合嵌入预测架构(JEPA)实现线性可识别性(即线性恢复世界的真实潜在变量)当且仅当世界的潜在动态遵循高斯平稳过程。这一高斯边界意味着时间一致性的基本限制:对于任何非高斯物理系统,统计世界模型的表示误差随时间单调增长。我们证明这一限制是统计对齐机制的产物,而非世界模型的一般性质。我们引入物理基础符号架构(PGSA),并证明三个结果:(1) PGSA 对所有物理机制实现精确线性可识别性,无论潜在分布如何;(2) PGSA 的每步误差仅受数值精度限制;(3) 直接推论是,PGSA 在无界数量的转换中保持时间一致性,我们称之为近无限时间一致性。我们进一步证明,对于任何非高斯系统,统计世界模型无法实现这一性质,无论模型容量或训练数据量如何。其中四个定理的代数核心已在 Lean 4 中使用 Mathlib4 v4.31.0 形式化(零个 sorry 占位符);Klindt 等人的逆命题作为外部前提。对比表明,在世界动态的因果生成器中进行符号基础化是充分条件,并且在非高斯体制下,是实现近无限时间一致性的唯一条件。

英文摘要

Klindt, LeCun, and Balestriero (arXiv:2605.26379) proved that Joint-Embedding Predictive Architectures (JEPAs) achieve linear identifiability, the linear recovery of the world's true latent variables, if and only if the world's latent dynamics follow a Gaussian, stationary process. This Gaussian boundary implies a fundamental limit on temporal consistency: for any non-Gaussian physical system, the representation error of a statistical World Model grows monotonically with time. We prove that this limit is an artifact of the statistical alignment mechanism, not a property of World Models in general. We introduce the Physics-Grounded Symbolic Architecture (PGSA) and prove three results: (1) a PGSA achieves exact linear identifiability for all physical regimes, regardless of the latent distribution; (2) the per-step error of a PGSA is bounded by numerical precision alone; and (3) as a direct consequence, a PGSA maintains temporal consistency for an unbounded number of transitions, a property we term near-infinite temporal consistency. We further prove that statistical World Models cannot achieve this property for any non-Gaussian system, regardless of model capacity or the volume of training data. The algebraic cores of four of the theorems are formalized in Lean 4 with Mathlib4 v4.31.0 (zero sorry placeholders); the Klindt et al. converse is taken as an external premise. The contrast establishes that symbolic grounding in the causal generator of the world's dynamics is the sufficient condition and, in non-Gaussian regimes, the only condition for near-infinite temporal consistency.

2606.12774 2026-06-12 eess.SY cs.AI cs.CL cs.SY 交叉投稿

Agentic MPC for Semantic Control System Resynthesis

用于语义控制系统再综合的智能体MPC

Yuya Miyaoka, Masaki Inoue

AI总结 提出智能体MPC框架,通过集成大语言模型智能体实现上下文感知的语义自适应控制综合,在自动驾驶场景中验证其根据个人偏好或社交情境(如避让应急车辆)调整控制的能力。

Comments 7 pages, 5 figures

详情
AI中文摘要

虽然MPC有效处理结构化、多样化和低层级的规范,但它缺乏动态融入高层级上下文信息(如社会规范、用户意图或自然语言指令)的能力。为解决这一局限,本文引入了一种智能体MPC框架,通过集成基于大语言模型的智能体,实现上下文感知、语义自适应的控制综合。该智能体解释异构输入,包括自然语言消息、环境观测和外部知识,以重新综合控制规范。该框架的有效性在自动驾驶场景中得到验证,系统能够根据个人偏好或对社交情境(如应急车辆避让)做出响应。

英文摘要

While MPC effectively handles structured, diverse, and low-level specifications, it lacks the capability to dynamically incorporate high-level contextual information such as social norms, user intent, or natural language instructions. To address this limitation, this manuscript introduces an agentic MPC framework that enables context-aware, semantically adaptive control synthesis by integrating with large language model-based agents. The agent interprets heterogeneous inputs, including natural language messages, environmental observations, and external knowledge, to resynthesize the control specifications. The effectiveness of the framework is demonstrated in an autonomous driving scenario, where the system aligns with personal preferences or responds to social situations such as emergency vehicle yielding.

2606.12904 2026-06-12 cs.IR cs.CL cs.HC cs.SI 交叉投稿

Trait, Not State: The Durability of Reading Identity in Social Highlighting

特质而非状态:社交高亮中阅读身份的持久性

Kazuki Nakayashiki, Keisuke Watanabe

AI总结 通过分析读者前六个月的高亮行为作为个人档案,追踪其后续选择,发现阅读选择特征在长达24个月以上保持稳定,表明这是一种特质而非状态。

Comments 12 pages, 3 figures, 3 tables

详情
AI中文摘要

先前关于社交网络高亮工具的研究将个体性定位于选择——即一个人选择高亮哪些文档——但仅从横截面角度进行测量。我们提出时间性问题:读者的选择特征是特质还是状态?我们将每位读者前六个月的高亮行为冻结为个人档案,并追踪其在后续选择中(间隔逐渐增大至24个月以上)的自身优势,负样本来自同一日历时期——因此供给漂移不能伪装成个人漂移——在粗粒度全局层面和细粒度层面(其负样本和对照来自读者自身的兴趣领域)进行测量;锚定单元重现了先前的横截面水平(+0.188 vs +0.169),验证了该框架。四个结果:在同一用户内,细粒度优势在任何时间跨度上均未显示统计上可检测的配对下降(6-12个月保留率 R = 1.00 [0.85, 1.18],n = 212;最远的区间与适度下降兼容;唯一区间排除零的对比是12-24个月的粗粒度层,约下降13%)。该信号不可简化为重复域名(排除所有档案来源后约90%信号保留)。个体内漂移缓慢(最近半年的档案比旧半年档案高出+0.042)。前瞻性地,个人档案——即使仅由读者最早期的文档构建(评估前中位数20个月)——其下一阅读的AP值约为所有测试过的简单非个人先验的3倍。我们将“特质”操作性地定义为在持续参与下的稳定特征;研究范围限于一个平台上的重度、长期读者,且曝光与选择不可分离。

英文摘要

Prior work on a social web highlighter located individuality in selection -- which documents a person chooses to highlight -- but measured it cross-sectionally. We ask the temporal question: is a reader's selection signature a trait or a state? We freeze each reader's first six months of highlighting as a profile and track its own-vs-other advantage on their later selections at growing gaps (to 24+ months), with negatives drawn from the same calendar era -- so supply drift cannot masquerade as personal drift -- at a coarse global level and at a fine level whose negatives and controls come from the reader's own interest neighborhood; the anchor cell reproduces the prior cross-sectional level (+0.188 vs +0.169), validating the harness. Four results. Within the same users, the fine-layer advantage shows no statistically detectable paired decline at any horizon (6-12 month retention R = 1.00 [0.85, 1.18], n = 212; the farthest bin is compatible with a modest decline; the only contrast whose interval excludes zero is the coarse layer at 12-24 months, about 13%). The signal is not reducible to repeated domains (~90% survives excluding all profile sources). Within-person drift is slow (a recent-half profile beats the old half by +0.042). Prospectively, personal profiles -- even one built from a reader's earliest documents, median 20 months before evaluation -- rank their next reads at roughly 3x the AP of every simple non-personal prior tested. We use "trait" operationally (a stable signature under continued engagement); the scope is heavy, long-tenured readers of one platform, and exposure is not separable from choice.

2606.12916 2026-06-12 cs.AI cs.CL cs.LG 交叉投稿

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

MDForge:稀疏模拟器反馈下的智能分子动力学流水线设计

Zehong Wang, Yijun Ma, Connor R. Schmidt, Tianyi Ma, Weixiang Sun, Ziming Li, Xiaoguang Guo, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学) University of Connecticut(康涅狄格大学)

AI总结 提出MDForge,利用LLM智能体通过多智能体辩论将稀疏奖励稠密化,自动设计分子动力学流水线,在SAMPL基准上达到专家水平,并发现新型高亲和力CB[7]结合剂。

详情
AI中文摘要

分子动力学(MD)是原子分子科学中经典的计算机模拟方法,从第一性原理物理模拟分子行为。为新系统设计MD流水线需要大量专业知识:即使在一个分子上运行也代价高昂,排除了试错法。我们使用LLM智能体自动化这一专家流水线设计过程。与现有编排预定义工具集的MD智能体不同,我们将流水线设计视为开放式代码生成,其中智能体的行为通过语言奖励在线重塑。具体而言,我们构建了MDForge,一个LLM智能体,其上下文更新规则通过物理专家间的多智能体辩论将稀疏奖励稠密化。在三个SAMPL主客体结合自由能基准上,MDForge自动设计的MD流水线与人类专家竞争。部署在未见过的候选客体库上,其CB[7]流水线发现了一种新型结合剂,湿实验竞争NMR证实其为高亲和力、皮摩尔级的CB[7]结合剂。我们的数据和代码可在https://this URL获取。

英文摘要

Molecular dynamics (MD) is the canonical in-silico method for atomistic molecular science, simulating molecular behavior from first-principle physics. Designing an MD pipeline for a new system requires substantial expert knowledge: running it on even one molecule is expensive, ruling out trial-and-error. We automate this expert pipeline-design process with an LLM agent. Unlike existing MD agents that orchestrate a predefined tool set, we treat pipeline design as open-ended code generation in which the agent's behavior is reshaped online by verbal reward. Specifically, we build MDForge, an LLM agent whose in-context update rule densifies the sparse reward via a multi-agent debate among physics experts. On three SAMPL host-guest binding free-energy benchmarks, MDForge automatically designs MD pipelines competitive with human experts. Deployed on a library of unseen candidate guests, its CB[7] pipeline discovers a novel binder that wet-lab competition NMR confirms is a high-affinity, picomolar CB[7] binder. Our data and code are available at https://github.com/Zehong-Wang/MDForge.

2606.13239 2026-06-12 cs.SE cs.AI cs.CL cs.CV 交叉投稿

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

ComAct: 通过COM即行动范式重构专业软件操作

Jiaxin Ai, Tao Hu, Xuemeng Yang, Shu Zou, Hairong Zhang, Daocheng Fu, Yu Yang, Hongbin Zhou, Nianchen Deng, Pinlong Cai, Zhongyuan Wang, Botian Shi, Kaipeng Zhang, Licheng Wen

AI总结 提出COM即行动范式,将专业软件交互转化为确定性程序合成,解决GUI代理的脆弱性和API代理的异构性问题;构建ComCADBench基准和ComActor自校正代理,在工业CAD软件上实现SOTA性能。

详情
AI中文摘要

现有的计算机使用代理在专业软件操作上仍然存在根本性限制:基于GUI的代理受困于脆弱的视觉基础和长程错误累积,而基于API的方法则难以应对异构协议和不可访问的商业接口。在这项工作中,我们将组件对象模型(COM)识别为统一的、可执行的抽象,提出了COM即行动:一种新的范式,将专业软件交互重新定义为确定性程序合成,而非顺序视觉控制。为了在最苛刻的环境中验证这一范式,我们引入了ComCADBench,这是首个针对操作真实工业CAD软件的代理的基准测试。我们的实验揭示了显著的范式差距:前沿的专有模型在基于GUI的交互下几乎无法成功,而基于COM的执行则带来了实质性的即时收益。为了弥合语法正确性与几何精度之间的剩余差距,我们开发了ComActor,一个通过渐进式三阶段框架训练的自校正代理,以及ComForge,一个用于在Windows容器中进行大规模训练的可扩展平台。大量实验表明,ComActor在ComCADBench上达到了最先进的性能,在基线崩溃的长程任务中表现出强大的韧性,并泛化到外部CAD基准测试。

英文摘要

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.

2606.13452 2026-06-12 cs.DL cs.CL cs.CY cs.HC 交叉投稿

Examining the Cognitive Gap Between Authors and Peer Reviewers on Academic Paper Novelty

审视作者与同行评审员在学术论文新颖性上的认知差距

Chenggang Yang, Chengzhi Zhang

发表机构 * Department of Information Management, Nanjing University of Science and Technology(南京理工大学信息管理学院)

AI总结 通过分析Nature Communications上15,328篇论文及其评审意见,发现作者和评审员都强调结果导向的创新,但评审员视角更全面;高创新论文受益于强宣传语言,中等创新论文的宣传语言与评审分歧显著相关。

详情
Journal ref
Scientometrics, 2026
AI中文摘要

新颖性是评估学术论文质量的关键指标。学者们努力突出其工作的新颖方面,尤其是在标题、摘要和引言中。同行评审作为科学严谨性的守门人,严格评估论文的新颖性,但作者自我宣传与评审员评价之间可能存在认知差距。为探究此问题,我们分析了2016年至2021年间发表在Nature Communications上的15,328篇学术论文及其同行评审意见。我们发现,评审员和作者都强调结果导向的创新,但评审员采用更全面的评价视角。此外,通过考察宣传强度与论文固有新颖性的关系,我们发现其效果取决于论文的实际创新水平。高创新论文受益于更强的宣传语言,获得更积极的评价。我们还发现,宣传语言与评审员对新颖性的分歧显著相关,但仅针对中等创新性的论文,而对高或低新颖性的论文影响甚微。这揭示了宣传语言如何在学术评价的灰色地带中发挥最显著的作用。

英文摘要

Novelty is a crucial metric for assessing the quality of academic papers. Scholars strive to highlight the novel aspects of their work, particularly in the title, abstract, and introduction. Peer review, serving as the gatekeeper of scientific rigor, rigorously evaluates the novelty of papers, yet a cognitive gap may exist between author self-promotion and reviewer evaluation. To investigate this, we analyzed 15,328 academic papers published in Nature Communications from 2016 to 2021, along with their peer-review comments. We found that both reviewers and authors emphasize result-oriented innovation, with reviewers adopting a more comprehensive evaluation perspective. Furthermore, by examining promotional intensity against inherent paper novelty, we found that its effect depends on the paper's actual innovation level. Highly innovative papers benefit from stronger promotional language, receiving more positive evaluations. We also found that promotional language significantly correlates with reviewer disagreement on novelty specifically for papers of moderate innovativeness, whereas it has negligible impact for papers with either very high or very low novelty. This reveals how promotional language operates most prominently in the gray area of academic evaluation.

2605.31514 2026-06-12 cs.CL cs.AI cs.CY 版本更新

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

如果LLM具有类人属性,那么《帝国时代II》也具有

Adrian de Wynter

AI总结 通过训练简单神经网络于《帝国时代II》,论证LLM的拟人属性在经验上非唯一,提出应假设LLM非独特性而非拟人属性来设计实验。

Comments Fixed corollary 1, added stat sig

详情
AI中文摘要

关于大型语言模型(LLM)和基于LLM的智能体工作流已有大量研究。然而,该领域的许多工作声称、赋予或假设它们具有普遍化的拟人属性(例如道德或对自然语言的理解)。我们的目标不是支持或反对这些属性的存在,而是指出这些结论可能不正确。为此,我们在电子游戏《帝国时代II》上构建并训练了一个简单的神经网络,并注意到任何处于足够强大基底(如乐高或大波士顿地区)中的实体也可能呈现此类属性。因此,LLM声称的拟人属性在经验上非唯一:尽管某些属性(例如对提示的响应)可能保持不变,但其他属性(如对其感知行为的解释)可能随基底改变。因此,任何基于经验的讨论都需要明确的测量标准;否则解释就留给了表征。然后我们表明,假设这些属性在系统中存在或不存在,独立于基底并以普遍化方式,会导致循环或无信息的结论,无论实验者对该主题的观点如何。最后,我们提出一个“零”假设,即假设LLM非独特性而非拟人属性来设置实验,并给出示例。我们还讨论了对我们工作的潜在反对意见,简要调查了该领域,并证明了《帝国时代II》是功能完备和图灵完备的。

英文摘要

Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergence of, ascribe to, or assume, generalised anthropomorphic attributes to them (e.g., morality or understanding of natural language). Our goal is not to argue in favour or against the existence of these attributes, but to point out that these conclusions could be incorrect. For this we build and train a simple neural network on the videogame Age of Empires II, and note that any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes. Hence, the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain invariant, others, such as the interpretation of their perceived behaviour, might change with the substrate. Thus, any empirically-grounded discussion on these attributes requires explicit measurement criteria; otherwise the interpretation is left to the representation. We then show that assuming that these attributes exist or not in a system, independent of the substrate and in a generalised way, leads to either circular or uninformative conclusions. This is regardless of the experimenter's viewpoint on the subject, or whether the outcome shows existence or non-existence. Finally we propose a 'null' assumption, where one assumes LLM non-uniqueness instead of assuming anthropomorphic attributes to set up an experiment, along with examples of it. We also discuss potential objections to our work, briefly survey the field, and prove that Age of Empires II is functionally- and Turing-complete.

2410.00903 2026-06-12 stat.AP cs.CL cs.LG 版本更新

Causal Inference with Generative Artificial Intelligence: Application to Texts as Treatments

基于生成式人工智能的因果推断:以文本作为处理变量

Kosuke Imai, Kentaro Nakamura

发表机构 * Harvard University(哈佛大学) John F. Kennedy School of Government(约翰·F·肯尼迪政府学院)

AI总结 提出利用生成式AI(如大语言模型)生成处理变量并利用其内部表示进行因果效应估计,避免从数据中学习因果表示,提高估计准确性和效率。

详情
AI中文摘要

在本文中,我们展示了如何利用生成式人工智能(GenAI)的力量,增强以文本等高维非结构化数据作为处理变量时的因果推断有效性。具体而言,我们提出使用深度生成模型(如大语言模型,LLMs)高效地生成处理变量,并利用其内部表示进行后续的因果效应估计。我们表明,了解这种真实内部表示有助于将感兴趣的处理特征(如特定情感和某些主题)与其他可能未知的混淆特征分离开来。与现有方法不同,所提出的GenAI驱动推断(GPI)方法无需从数据中学习因果表示,因此能产生更准确和高效的估计。我们正式建立了非参数识别平均处理效应所需的条件,提出了一种避免重叠假设违反的估计策略,并通过应用双重机器学习推导了所提出估计量的渐近性质。最后,利用工具变量方法,我们将所提出的GPI方法扩展到处理特征基于人类感知的场景。GPI也适用于文本复用,即使用LLM重新生成现有文本。我们进行了模拟和实证研究,使用开源LLM Llama 3生成的文本数据,展示了我们的估计器相对于最先进的因果表示学习算法的优势。

英文摘要

In this paper, we demonstrate how to enhance the validity of causal inference with unstructured high-dimensional treatments like texts, by leveraging the power of generative Artificial Intelligence (GenAI). Specifically, we propose to use a deep generative model such as large language models (LLMs) to efficiently generate treatments and use their internal representation for subsequent causal effect estimation. We show that the knowledge of this true internal representation helps disentangle the treatment features of interest, such as specific sentiments and certain topics, from other possibly unknown confounding features. Unlike existing methods, the proposed GenAI-Powered Inference (GPI) methodology eliminates the need to learn causal representation from the data, and hence produces more accurate and efficient estimates. We formally establish the conditions required for the nonparametric identification of the average treatment effect, propose an estimation strategy that avoids the violation of the overlap assumption, and derive the asymptotic properties of the proposed estimator through the application of double machine learning. Finally, using an instrumental variables approach, we extend the proposed GPI methodology to the settings in which the treatment feature is based on human perception. The GPI is also applicable to text reuse where an LLM is used to regenerate existing texts. We conduct simulation and empirical studies, using the generated text data from an open-source LLM, Llama 3, to illustrate the advantages of our estimator over state-of-the-art causal representation learning algorithms.

2602.07698 2026-06-12 cs.SE cs.CL 版本更新

On Sequence-to-Sequence Models for Automated Log Parsing

关于自动化日志解析的序列到序列模型

Adam Sorrenti, Andriy Miranskyy

发表机构 * Toronto University(多伦多大学)

AI总结 本研究系统评估了四种序列建模架构(Transformer、Mamba、单/双向LSTM)在自动化日志解析中的性能,发现Transformer表现最佳,Mamba在计算成本较低时具有竞争力,并分析了表示选择、序列长度和数据效率的影响。

Comments Added a comparison with large language models

详情
AI中文摘要

上下文:日志解析是软件系统中的关键标准操作流程,支持监控、异常检测和故障诊断。然而,由于日志格式异构、训练与部署数据之间的分布偏移以及基于规则的方法的脆弱性,自动化日志解析仍然具有挑战性。目标:本研究旨在系统评估序列建模架构、表示选择、序列长度和训练数据可用性如何影响自动化日志解析性能和计算成本。方法:我们进行了一项受控实证研究,比较了四种序列建模架构:Transformer、Mamba状态空间、单向LSTM和双向LSTM模型。总共在多个数据集配置下训练了396个模型,并使用相对Levenshtein编辑距离进行统计显著性检验评估。结果:Transformer实现了最低的平均相对编辑距离(0.111),其次是Mamba(0.145)、单LSTM(0.186)和双LSTM(0.265),数值越低越好。Mamba在计算成本大幅降低的情况下提供了有竞争力的准确性。字符级分词通常能提升性能,序列长度对Transformer准确性的实际影响可忽略不计,Mamba和Transformer均表现出比循环模型更强的样本效率。结论:总体而言,Transformer将解析错误降低了23.4%,而Mamba在数据或计算受限的情况下是一个强有力的替代方案。这些结果还阐明了表示选择、序列长度和样本效率的作用,为研究人员和从业者提供了实用指导。

英文摘要

Context: Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribution shifts between training and deployment data, and the brittleness of rule-based approaches. Objectives: This study aims to systematically evaluate how sequence modelling architecture, representation choice, sequence length, and training data availability influence automated log parsing performance and computational cost. Methods: We conduct a controlled empirical study comparing four sequence modelling architectures: Transformer, Mamba state-space, monodirectional LSTM, and bidirectional LSTM models. In total, 396 models are trained across multiple dataset configurations and evaluated using relative Levenshtein edit distance with statistical significance testing. Results: Transformer achieves the lowest mean relative edit distance (0.111), followed by Mamba (0.145), mono-LSTM (0.186), and bi-LSTM (0.265), where lower values are better. Mamba provides competitive accuracy with substantially lower computational cost. Character-level tokenization generally improves performance, sequence length has negligible practical impact on Transformer accuracy, and both Mamba and Transformer demonstrate stronger sample efficiency than recurrent models. Conclusion: Overall, Transformers reduce parsing error by 23.4%, while Mamba is a strong alternative under data or compute constraints. These results also clarify the roles of representation choice, sequence length, and sample efficiency, providing practical guidance for researchers and practitioners.

2605.14568 2026-06-12 cs.SE cs.CL cs.LG 版本更新

Given, When, Then, Again: Mining Subscenario Refactoring Candidates in Behaviour-Driven Test Suites with ML Classifiers and LLM-Judge Baselines

在行为驱动软件测试套件中挖掘子场景重构机会:ML分类器和LLM-判断基线

Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal

发表机构 * Independent Researcher(独立研究者;应用MBA(数据分析),德克萨斯韦斯利安大学) Applied MBA (Data Analytics), Texas Wesleyan University(独立研究者;计算机工程学士,国立科学与技术大学(NUST)) Independent Researcher(独立研究者;管理硕士,慕尼黑技术大学) B.E. Computer Engineering, National University of Sciences and Technology (NUST) Independent Researcher M.Sc. Management, Technical University of Munich

AI总结 本文通过ML分类器和LLM基线,识别行为驱动开发测试套件中可提取的子场景,量化其在公共BDD生态系统中的普及率。

Comments 31 pages, 10 figures, 6 tables, 56 references. v2: retitled; reference list fully corrected and verified; decision-threshold sensitivity analysis and imbalance-robust baseline metrics added; figures restyled. Reproduction package at https://github.com/amughalbscs16/cukereuse_subscenarios_release (Apache-2.0). Upstream cukereuse corpus at https://doi.org/10.5281/zenodo.19754359

详情
AI中文摘要

背景。行为驱动开发(BDD)软件测试套件积累重复的步骤子序列。有三种已发布的重构模式(在同一文件中的背景、在同一仓库中可重用的场景调用、跨组织共享的更高层次步骤),但没有先前工作自动化确定哪些重复的子序列值得提取或哪种机制适用。目标。通过重构适宜性(提取值得)对重复的步骤子序列(

英文摘要

Context. Behaviour-Driven Development (BDD) test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. SBERT / UMAP / HDBSCAN clustering recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An XGBoost extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p = 1.5e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, and cross-organisational shared-step candidate, respectively; the figures are stable under a sweep of the classifier decision threshold. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring candidates; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.

2505.22169 2026-06-12 cs.CL 版本更新

ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

ReliableEval: 通过矩方法进行随机大语言模型评估的配方

Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, Gabriel Stanovsky

发表机构 * The Hebrew University of Jerusalem(耶路撒冷希伯来大学) Google Research(谷歌研究)

AI总结 本文提出ReliableEval方法,通过矩方法评估大语言模型的提示敏感性,发现顶级模型如GPT-4o和Claude-3.7-Sonnet存在显著提示敏感性。

Comments Findings of EMNLP 2025

详情
Journal ref
Findings of the Association for Computational Linguistics: EMNLP 2025, pages 11146-11153, Suzhou, China. Association for Computational Linguistics
AI中文摘要

大语言模型对提示语的表述高度敏感,但标准基准通常仅使用单一提示进行性能评估,引发对评估可靠性的担忧。本文主张在保持意义的提示扰动空间中采用随机矩方法进行评估。我们引入了可靠评估的正式定义,考虑了提示敏感性,并建议ReliableEval——一种估计所需提示重采样次数以获得有意义结果的方法。使用我们的框架,我们随机评估了五种前沿大语言模型,并发现即使顶级模型如GPT-4o和Claude-3.7-Sonnet也表现出显著的提示敏感性。我们的方法是模型、任务和度量无关的,提供了一种有意义且稳健的大语言模型评估配方。

英文摘要

LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.

2402.13906 2026-06-12 cs.CL 版本更新

Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction

利用整体相似性进行无监督文档结构提取

Gili Lior, Yoav Goldberg, Gabriel Stanovsky

发表机构 * Allen Institute for AI(Allen人工智能研究所) The Hebrew University of Jerusalem(耶路撒冷希伯来大学) Bar-Ilan University(巴伊兰大学)

AI总结 本文提出一种无监督方法,利用文档间和文档内相似性提取跨领域文档集合的整体结构,通过捕捉重复主题并抽象化标题变体,为人类和结构感知模型提供帮助。

Comments Accepted to ACL 2024 findings

详情
Journal ref
Findings of the Association for Computational Linguistics: ACL 2024, pages 9538-9550, Bangkok, Thailand. Association for Computational Linguistics
AI中文摘要

各种领域(如法律、医疗或金融)的文档集合通常具有某种底层的整体结构,这种结构能为人类用户和结构感知模型提供帮助。我们提出识别文档集合中的典型结构,需要捕捉集合中的重复主题,同时抽象化任意标题的同义表达,并将每个主题定位到相应的文档位置。这些要求带来了多个挑战:标记重复主题的标题经常在措辞上不同,某些部分标题仅在个别文档中出现,而不反映典型结构,且不同文档中的主题顺序可能不同。随后,我们开发了一种无监督的图基方法,利用文档间和文档内的相似性来提取底层的整体结构。我们在英语和希伯来语的三个不同领域上的评估表明,我们的方法能够提取有意义的整体结构,我们希望未来的工作能利用我们的方法进行多文档应用和结构感知模型。

英文摘要

Document collections of various domains, e.g., legal, medical, or financial, often share some underlying collection-wide structure, which captures information that can aid both human users and structure-aware models. We propose to identify the typical structure of document within a collection, which requires to capture recurring topics across the collection, while abstracting over arbitrary header paraphrases, and ground each topic to respective document locations. These requirements pose several challenges: headers that mark recurring topics frequently differ in phrasing, certain section headers are unique to individual documents and do not reflect the typical structure, and the order of topics can vary between documents. Subsequently, we develop an unsupervised graph-based method which leverages both inter- and intra-document similarities, to extract the underlying collection-wide structure. Our evaluations on three diverse domains in both English and Hebrew indicate that our method extracts meaningful collection-wide structure, and we hope that future work will leverage our method for multi-document applications and structure-aware models.

2507.11936 2026-06-12 cs.CL cs.AI cs.CV cs.LG 版本更新

A Survey of Deep Learning for Geometry Problem Solving

深度学习在几何问题求解中的应用综述

Jianzhe Ma, Wenxuan Wang, Qin Jin

发表机构 * Renmin University of China(中国人民大学)

AI总结 本文综述了深度学习在几何问题求解中的应用,涵盖相关任务、方法、评估指标及未来方向,旨在提供实践参考以推动该领域发展。

Comments ACL 2026 Main Conference

详情
AI中文摘要

几何问题求解作为数学推理的重要组成部分,在教育、评估AI数学能力及多模态能力评估中具有关键作用。近期深度学习技术,尤其是多模态大语言模型的出现,显著加速了该领域的研究。本文综述了深度学习在几何问题求解中的应用,包括(i)几何问题求解相关任务的全面总结;(ii)相关深度学习方法的深入回顾;(iii)评估指标和方法的详细分析;以及(iv)最先进性能、现有挑战和有前景的未来方向的批判性讨论。我们的目标是提供一个全面且实用的深度学习在几何问题求解中的参考,从而推动该领域进一步发展。我们维护了一个相关论文列表:https://github.com/majianz/dl4gps。

英文摘要

Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI's mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper presents a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of state-of-the-art performance, existing challenges, and promising future directions. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We maintain a list of relevant papers: https://github.com/majianz/dl4gps.

2507.21086 2026-06-12 cs.CL 版本更新

Multi-Amateur Contrastive Decoding for Text Generation

多业余对比解码用于文本生成

Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela

发表机构 * Department of Data Science(数据科学系) Praxis Business School(普拉克斯商学院)

AI总结 本文提出多业余对比解码框架,通过集成多个业余模型更全面地捕捉语言生成中的不良模式,提升文本生成的流畅性、连贯性和多样性。

Comments This paper has been accepted for oral presentation and publication in the proceedings of the IEEE I2ITCON 2025. The conference will be organized in Pune, India, from July 4 to 5, 2025. This is the accepted version of the paper and NOT the final camera-ready version. The paper is 11 pages long and contains 5 figures and 6 tables

详情
AI中文摘要

对比解码(CD)作为一种有效的推理时策略,通过利用大专家语言模型和小业余模型输出概率的差异来增强开放性文本生成。尽管CD提升了连贯性和流畅性,但其依赖单一业余模型限制了捕捉语言生成中多样化的失败模式,如重复、幻觉和风格漂移的能力。本文提出多业余对比解码(MACD),作为CD框架的扩展,采用多个业余模型更全面地表征不良生成模式。MACD通过平均和共识惩罚机制整合对比信号,并将可能性约束扩展到多业余设置中。此外,该框架通过引入具有针对性风格或内容偏见的业余模型实现可控生成。在新闻、百科和叙事等多个领域实验结果表明,MACD在流畅性、连贯性、多样性和适应性方面均优于传统解码方法和原始CD方法,且无需额外训练或微调。

英文摘要

Contrastive Decoding (CD) has emerged as an effective inference-time strategy for enhancing open-ended text generation by exploiting the divergence in output probabilities between a large expert language model and a smaller amateur model. Although CD improves coherence and fluency, its dependence on a single amateur restricts its capacity to capture the diverse and multifaceted failure modes of language generation, such as repetition, hallucination, and stylistic drift. This paper proposes Multi-Amateur Contrastive Decoding (MACD), a generalization of the CD framework that employs an ensemble of amateur models to more comprehensively characterize undesirable generation patterns. MACD integrates contrastive signals through both averaging and consensus penalization mechanisms and extends the plausibility constraint to operate effectively in the multi-amateur setting. Furthermore, the framework enables controllable generation by incorporating amateurs with targeted stylistic or content biases. Experimental results across multiple domains, such as news, encyclopedic, and narrative, demonstrate that MACD consistently surpasses conventional decoding methods and the original CD approach in terms of fluency, coherence, diversity, and adaptability, all without requiring additional training or fine-tuning.