arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.13227 2026-06-12 cs.CL 新提交

PolyAlign: Conditional Human-Distribution Alignment

PolyAlign: 条件性人类分布对齐

L. D. M. S. Sai Teja, Ufaq Khan, Sathira Silva, Xiao Wu, Muhammad Haris Khan

发表机构 * NIT Silchar（印度国立理工学院锡尔恰尔分校）； MBZUAI（穆罕默德·本·扎耶德人工智能大学）

AI总结提出PolyAlign框架，通过桶感知SFT和人类分布偏好优化，实现语言模型在不同交互上下文中的条件性人类分布对齐，提升自然性和分布忠实度。

Comments 20 pages, 4 Figures, 8 Tables

详情

AI中文摘要

诸如监督微调（SFT）和偏好优化等后训练方法通常将语言模型对齐到单一的全局助手行为。虽然这有助于提高平均有用性，但可能抑制人类响应在不同语言、任务和对话设置中的自然变化。我们将此问题研究为条件性人类分布对齐：模型应匹配适合当前交互上下文的人类响应分布，而非通用响应风格。我们引入PolyAlign，一种分布感知的对齐框架，将双语交互数据组织为由语言、交互轨迹、响应家族和长度定义的桶特定人类参考分布。PolyAlign结合了桶感知SFT（平衡跨异构桶的优化）和人类分布偏好优化（HDPO，使用评论家估计的到桶特定人类支持的距离来正则化偏好学习）。在涵盖英语和中文单轮及多轮设置的双语评估套件中，PolyAlign在保持竞争性任务实用性的同时，提高了条件自然性和分布忠实度。结果表明，后训练应超越全局对齐目标，转向与人类响应分布的交互感知对齐。

英文摘要

Post-training methods such as supervised fine-tuning (SFT) and preference optimization typically align language models toward a single global assistant behavior. While effective for improving average helpfulness, this can suppress the natural variation of human responses across languages, tasks, and dialogue settings. We study this problem as conditional human-distribution alignment: models should match the human response distribution appropriate to the current interaction context, rather than a universal response style. We introduce PolyAlign, a distribution-aware alignment framework that organizes bilingual interaction data into bucket-specific human reference distributions defined by language, interaction track, response family, and length. PolyAlign combines Bucket-Aware SFT, which balances optimization across heterogeneous buckets, with Human-Distribution Preference Optimization (HDPO), which regularizes preference learning using critic-estimated distance to bucket-specific human support. Across a bilingual evaluation suite covering English and Chinese single- and multi-turn settings, PolyAlign improves conditional naturalness and distributional faithfulness while preserving competitive task utility. The results suggest that post-training should move beyond global alignment objectives toward interaction-aware alignment with human response distributions.

URL PDF HTML ☆

赞 0 踩 0

2606.13624 2026-06-12 cs.CL 新提交

Beyond Uniform Tokens: Adaptive Compression for Time Series Language Models

超越统一令牌：时间序列语言模型的自适应压缩

Jialin Gan, Xin Qiu, Guangzhe Chen, Xue Wang

发表机构 * Zhejiang University（浙江大学）； Harbin Institute of Technology（哈尔滨工业大学）； Shandong University（山东大学）

AI总结针对时间序列语言模型中令牌效率低的问题，提出自适应令牌预算框架，通过频域结构压缩时间序列令牌并逐层减少提示令牌，实现高达7.68倍推理加速并在78%设置中提升性能。

详情

AI中文摘要

大型语言模型（LLM）通过共享令牌接口联合建模数值观测和文本上下文，实现了时间序列（TS）分析。然而，TS令牌和提示令牌表现出根本不同的信息结构，使得统一令牌处理效率低下。在本文中，我们从非对称令牌的角度研究TS语言建模中的令牌效率。我们表明，TS令牌具有高度不均匀的频谱贡献，其中许多令牌共享冗余频率模式，而一小部分保留了关键的时间证据。我们还观察到，提示令牌的影响随模型深度衰减，表明在所有层中完全保留提示是不必要的。基于这些发现，我们开发了一个自适应令牌预算框架，通过频域结构压缩TS令牌，并逐层减少提示令牌。在预测、分类、插补和异常检测上的实验表明，在\textit{\textbf{78\%}}的评估设置中实现了高达\textit{\textbf{7.68$\times$}}的推理加速和性能提升，显示了非对称令牌压缩对于可扩展TS基础模型的有效性。

英文摘要

Large language models (LLMs) have enabled time series (TS) analysis by jointly modeling numerical observations and textual context through a shared token interface. However, TS tokens and prompt tokens exhibit fundamentally different information structures, making uniform token processing inefficient. In this paper, we study token efficiency in TS language modeling from an asymmetric-token perspective. We show that TS tokens have highly uneven spectral contributions, where many tokens share redundant frequency patterns while a small subset preserves critical temporal evidence. We also observe that prompt-token influence attenuates with model depth, suggesting that full prompt retention across all layers is unnecessary. Based on these findings, we develop an adaptive token budgeting framework that compresses TS tokens via frequency-domain structure and progressively reduces prompt tokens across layers. Experiments across forecasting, classification, imputation, and anomaly detection demonstrate up to \textit{\textbf{7.68$\times$}} inference acceleration and performance gains in \textit{\textbf{78\%}} of evaluated settings, showing the effectiveness of asymmetric token compression for scalable TS foundation models.

URL PDF HTML ☆

赞 0 踩 0

2606.13634 2026-06-12 cs.CL math.CT 新提交

保持策略梯度主导：面向长程工具使用智能体的兄弟引导信用蒸馏

Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein

发表机构 * Amazon Web Services（亚马逊云服务）

AI总结针对长程工具使用强化学习中轨迹级优势信号稀疏的问题，提出兄弟引导信用蒸馏（SGCD），通过动态采样成功与失败轨迹、外部LLM对比生成逐步信用参考，实现密集信用分配，在AppWorld和τ³-airline任务上显著提升性能。

Comments 13 pages, 4 figures, 7 tables. Submitted to EMNLP 2026 Industry Track

详情

AI中文摘要

长程工具使用强化学习可以从结果验证中学习，但其轨迹级优势被广播到许多推理、API和答案令牌上。自蒸馏通过重用策略自身的轨迹或特权教师承诺提供更密集的信号。然而，我们表明直接的令牌级自蒸馏会悄然破坏工具使用：它复述教师行为而不知道验证器奖励哪些动作，因此有用技能和有害捷径被一起放大。我们引入兄弟引导信用蒸馏（SGCD），它使用蒸馏进行信用分配而非作为竞争性的演员损失。动态采样产生混合的成功和失败的兄弟轨迹；外部LLM将其对比总结为训练时逐步信用参考；密集的教师/学生散度驱动信用重新分配；有界分离的信用权重重塑GRPO令牌优势。部署的学生看不到外部LLM、兄弟证据或预言机。在AppWorld和τ³-airline上，SGCD优于匹配的GRPO比较器：AppWorld上test_normal的TGC从42.9提升到45.6，test_challenge从24.7提升到27.0；τ³-airline的pass@1从0.583提升到0.602。

英文摘要

Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $τ^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $τ^3$-airline pass@1 $0.583 \to 0.602$.

URL PDF HTML ☆

赞 0 踩 0

2606.13106 2026-06-12 cs.LG cs.CL 交叉投稿

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

揭秘隐状态循环：基于在线强化学习的可切换潜在推理

Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu, Yuxuan Fan, Lujundong Li, Songning Lai, Chengwei Qin, Zhijiang Guo

发表机构 * HKUST(GZ)（香港科技大学（广州））； University of Cambridge（剑桥大学）； NTU（南洋理工大学）； JoinQuant（聚宽）； HKUST（香港科技大学）

AI总结提出SWITCH框架，通过离散边界令牌使隐状态循环推理兼容在线强化学习，并支持因果机制分析，实验表明其优于现有方法。

详情

AI中文摘要

潜在思维链通过用连续的隐状态循环替换可见推理轨迹来压缩推理，但现有公式难以用标准在线强化学习（RL）优化，且难以进行因果解释。我们的关键见解是，一对显式的边界令牌可以同时解决这两个问题：离散的进入和退出锚点使潜在块与标准在线RL兼容，并且相同的锚点为机制分析提供了自然立足点。基于此，我们提出SWITCH，一个可切换的潜在推理框架。模型发出<swi>进入潜在模式，</swi>退出。由于边界是普通的离散令牌，GRPO策略比率在每个决策点都有明确定义。相同的锚点也使潜在步骤暴露于直接探测和因果干预。我们通过可见到潜在的课程和Switch-GRPO目标训练模型，该目标通过循环潜在计算传播梯度。SWITCH在相似规模下始终优于先前的隐状态循环潜在推理方法。通过边界令牌的机制分析进一步揭示了三个发现：（i）<swi>是一个尖锐局部化的学习切换策略，而非风格化伪影；（ii）它开启的潜在步骤执行特定于问题的、因果重要的计算，而非作为惰性占位符；（iii）该计算集中在进入时的单个隐状态转换上。这些结果表明，隐状态循环潜在推理既可RL训练，又可进行直接机制分析，包括在线RL本身如何从内部改进模型。

英文摘要

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

URL PDF HTML ☆

赞 0 踩 0

2606.13126 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

MiniPIC: Flexible Position-Independent Caching in <100LOC

MiniPIC: 少于100行代码的灵活位置无关缓存

Nathan Ordonez, Thomas Parnell

发表机构 * IBM Research（IBM研究院）

AI总结提出MiniPIC，通过无位置编码KV缓存和用户控制缓存重用原语，在vLLM中实现多种位置无关缓存方法，显著提升预填充吞吐量并降低首个令牌延迟。

Comments 13 pages, 5 figures

详情

AI中文摘要

检索增强和代理工作负载重复预填充可预测的结构化输入（我们称之为“跨度”），例如文档和代码文件。然而，vLLM等引擎中的前缀缓存无法重用KV条目，除非它们与另一个请求共享相同的前缀，而生产级推理服务器中的位置无关缓存（PIC）实现通常需要大量服务器代码更改或将KV状态保留在服务器外部，从而产生主机到设备的传输开销。我们提出了极简PIC（MiniPIC）：一种最小化、灵活且快速的vLLM设计，由两个组件构建：无位置编码的KV缓存和用户控制的缓存重用原语。MiniPIC在KV缓存中存储未旋转的K向量，在注意力内部使用每请求逻辑位置对K块应用RoPE，并公开三个面向用户和令牌级别的原语：块对齐填充、跨度分隔符（SSep）和提示依赖（PDep），这些原语修改哈希行为和有效的块级因果注意力结构。通过少于100行的核心引擎更改加上自定义注意力后端，这些原语足以在同一个运行的vLLM实例中实现多种PIC方法，包括Block-Attention、EPIC和Prompt Cache，同时原生集成KV缓存CPU卸载实现。在2WikiMultihopQA上，使用交错调度的MiniPIC相比基线vLLM将预填充吞吐量提高了49%，将缓存跨度的首个令牌时间减少了最多两个数量级，保持了未缓存跨度的线性预填充扩展，并且仅产生5.7%的最坏情况开销。

英文摘要

Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.13473 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

MaxProof: 通过生成-验证器强化学习与群体级测试时扩展实现数学证明规模化

Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang, Zhengmao Zhu, Tianle Li, Jingyang Li, Zehan Li, Binyang Jiang, Jin Zhu, Han Ding, Fei Yu, Chenyu Du, Zijian Song, Jiayuan Song, Zhi Zhang, Yunan Huang, Weiyu Cheng, Pengyu Zhao, Yu Cheng

发表机构 * MiniMax ； The Chinese University of Hong Kong（香港中文大学）； Fudan University（复旦大学）； Peking University（北京大学）； Tsinghua University（清华大学）

AI总结提出MaxProof框架，结合生成-验证器强化学习与群体级测试时扩展，在MiniMax-M3系列上实现竞赛级数学证明，在IMO 2025和USAMO 2026上超越人类金牌阈值。

2606.13603 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

基于LLM的嵌入：注意力值比隐藏状态更好地编码句子语义

Yeqin Zhang, Yunfei Wang, Jiaxuan Chen, Ke Qin, Yizheng Zhao, Cam-Tu Nguyen

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, China（新型软件技术国家重点实验室，南京大学，中国）； School of Artificial Intelligence, Nanjing University, China（人工智能学院，南京大学，中国）

AI总结本文提出Value Aggregation方法，利用LLM的注意力值向量而非隐藏状态来生成句子嵌入，在无训练设置下超越现有方法，甚至匹配或超越集成方法MetaEOL。

详情

AI中文摘要

句子表示是许多自然语言处理（NLP）应用的基础。虽然近期方法利用大型语言模型（LLM）来推导句子表示，但大多数依赖于最终层的隐藏状态，这些隐藏状态针对下一个词预测进行了优化，因此通常无法捕捉全局的句子级语义。本文引入了一个新颖的视角，证明注意力值向量比隐藏状态更有效地捕捉句子语义。我们提出了值聚合（VA），一种简单的方法，它跨多个层和词索引池化标记值。在无训练设置中，VA优于其他基于LLM的嵌入，甚至匹配或超越了基于集成的MetaEOL。此外，我们证明，当与合适的提示配对时，层注意力输出可以被解释为对齐的加权值向量。具体来说，最后一个标记的注意力分数充当权重，而输出投影矩阵（$W_O$）将这些加权值向量与LLM残差流的公共空间对齐。这种改进的方法，称为对齐加权VA（AlignedWVA），在无训练的基于LLM的嵌入中达到了最先进的性能，大幅超越了高成本的MetaEOL。最后，我们强调了通过微调值聚合来获得强LLM嵌入模型的潜力。

英文摘要

Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.

URL PDF HTML ☆

赞 0 踩 0

2604.12002 2026-06-12 cs.CL 版本更新

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

自蒸馏零：自我修订将二元奖励转化为密集监督

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, Sanjeev Arora

发表机构 * Princeton University（普林斯顿大学）； University of Toronto（多伦多大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出SD-Zero方法，通过让模型同时扮演生成器和修订者，利用二元奖励生成密集的token级自监督信号，显著提升训练样本效率，在数学和代码推理任务上超越RFT、GRPO等基线。

详情

AI中文摘要

当前在可验证设置下的后训练方法分为两类。强化学习（RLVR）依赖二元奖励，虽然广泛适用且强大，但在训练过程中仅提供稀疏监督。蒸馏提供密集的token级监督，通常从外部教师或使用高质量示范中获得。收集此类监督成本高昂或不可用。我们提出自蒸馏零（SD-Zero），一种比RL更高效利用训练样本的方法，且不需要外部教师或高质量示范。SD-Zero训练单个模型扮演两个角色：生成器，产生初始响应；修订者，基于该响应及其二元奖励生成改进的响应。然后我们进行在线自蒸馏，将修订者蒸馏到生成器中，使用修订者以生成器的响应及其奖励为条件的token分布作为监督。实际上，SD-Zero训练模型将二元奖励转化为密集的token级自监督。在数学和代码推理基准上，使用Qwen3-4B-Instruct和Olmo-3-7B-Instruct，SD-Zero相比基础模型性能提升至少10%，并在相同问题集和训练样本预算下优于强基线，包括拒绝微调（RFT）、GRPO和自蒸馏微调（SDFT）。大量消融实验显示了所提出算法的两个新特性：（a）token级自定位，其中修订者能够基于奖励识别生成器响应中需要修订的关键token；（b）迭代自进化，其中改进答案的修订能力可以通过定期教师同步蒸馏回生成性能。代码：此https URL。

英文摘要

Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization. Code: https://github.com/princeton-pli/Self-Distillation-Zero.

URL PDF HTML ☆

赞 0 踩 0

2604.18307 2026-06-12 cs.CL 版本更新

RLHF中奖励不确定性的统一视角

Ely Hahami, Yoel Zimmermann, Ray Zhou, Jack Benarroch Jedlicki

发表机构 * University of California, Berkeley（加州大学伯克利分校）； DeepMind（深度Mind）

AI总结本文提出使用分布奖励模型统一RLHF中的悲观主义方法，通过闭式有效奖励公式连接现有启发式方法，并揭示其隐含假设。

详情

AI中文摘要

基于人类反馈的强化学习（RLHF）受限于\textit{奖励破解}，即策略利用代理奖励模型（RM）中的错误，产生高RM分数而缺乏真正的质量提升。一种自然的缓解方法是\textit{悲观主义}：在RM不确定的区域惩罚奖励。然而，标准标量RM没有提供原则性的不确定性概念。我们认为正确的对象是\textit{分布}奖励模型$p(r\mid x,y)$。在贝叶斯推断或KL分布鲁棒优化（KL-DRO）视角下，KL正则化的RLHF目标具有闭式有效奖励$\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$。悲观分支统一了RM集成聚合的先前启发式方法：均值聚合、最坏情况优化（WCO）和不确定性加权优化（UWO）都作为该单一表达式的极限或截断出现。这也澄清了每个现有规则的隐含假设。

英文摘要

Reinforcement learning from human feedback (RLHF) is bottlenecked by reward hacking, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is pessimism: lowering rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a distributional reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.

URL PDF HTML ☆

赞 0 踩 0

2508.01656 2026-06-12 cs.CL cs.AI cs.CY cs.HC physics.soc-ph 版本更新

Authorship Attribution in Multilingual Machine-Generated Texts

多语言机器生成文本的作者归属

Lucio La Cava, Dominik Macko, Róbert Móro, Ivan Srba, Andrea Tagarelli

发表机构 * DIMES Department, University of Calabria（卡利博大学DIMES系）； Kempelen Institute of Intelligent Technologies（智能技术研究所）

AI总结提出多语言作者归属问题，研究单语言方法在18种语言和8个生成器上的跨语言迁移能力，发现显著局限。

Comments Accepted at ACL 2026 - Main

详情

AI中文摘要

随着大型语言模型（LLM）达到类人的流畅性和连贯性，区分机器生成文本（MGT）与人类撰写的内容变得越来越困难。虽然MGT检测的早期工作侧重于二元分类，但LLM的不断发展和多样性需要更细粒度且更具挑战性的作者归属（AA），即能够识别文本背后的确切生成器（LLM或人类）。然而，目前AA仍局限于单语言环境，其中英语是研究最多的语言，忽视了现代LLM的多语言性质和使用。在这项工作中，我们引入了多语言作者归属问题，涉及将文本归因于跨多种语言的人类或多个LLM生成器。聚焦于18种语言——涵盖多个语系和书写系统——以及8个生成器（7个LLM和人类撰写类别），我们研究了单语言AA方法在多语言环境中的适用性，包括其跨语言迁移能力，以及生成器对归属性能的影响。我们的结果表明，虽然某些单语言AA方法可以适应多语言环境，但仍然存在显著的局限性和挑战，特别是在跨不同语系迁移时，这凸显了多语言AA的复杂性以及需要更稳健的方法以更好地匹配现实场景。

英文摘要

As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-written content becomes increasingly difficult. While early efforts in MGT detection have focused on binary classification, the growing landscape and diversity of LLMs require a more fine-grained yet challenging authorship attribution (AA), i.e., being able to identify the precise generator (LLM or human) behind a text. However, AA remains nowadays confined to a monolingual setting, with English being the most investigated one, overlooking the multilingual nature and usage of modern LLMs. In this work, we introduce the problem of Multilingual Authorship Attribution, which involves attributing texts to human or multiple LLM generators across diverse languages. Focusing on 18 languages -- covering multiple families and writing scripts -- and 8 generators (7 LLMs and the human-authored class), we investigate the multilingual suitability of monolingual AA methods in terms of their cross-lingual transferability, and the impact of generators on attribution performance. Our results reveal that while certain monolingual AA methods can be adapted to multilingual settings, significant limitations and challenges remain, particularly in transferring across diverse language families, underscoring the complexity of multilingual AA and the need for more robust approaches to better match real-world scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.12578 2026-06-12 cs.CL 新提交

MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction

MARD: 镜像增强推理蒸馏用于机制级药物-药物相互作用预测

Mohammadreza Riyazat, Vian Lelo, Rameen Jafri, Yumna Khan, Abeer Badawi

发表机构 * University of Guelph（圭尔夫大学）； York University（约克大学）； Vector Institute（向量研究所）

AI总结提出MARD-7B模型，通过镜像增强推理蒸馏、单token KL散度、PRM加权DPO和机制感知检索通道，在机制级DDI预测中准确率超越GPT-4o 6.7个百分点，且成本仅为1%。

Comments 29 pages, 9 figures. Preprint

详情

AI中文摘要

机制级药物-药物相互作用（DDI）预测需要识别涉及的酶或药效学轴、作用方向及证据，而不仅仅是判断两种药物是否相互作用。我们引入了一个可复现的机制级DDI标注与评估协议，包括结构化的7家族/147亚型分类法、无泄漏的冷切分协议以及可审计的推理指标，用于评估超越平面交互分类的药理学预测。我们提出一个流水线，生成了7B推理模型MARD（镜像增强推理蒸馏），结合了三种训练创新：方向标签上的单token KL散度，将模型的预测与方向标签绑定；基于PRM权重的DPO，使用程序化硬负样本；以及无泄漏的机制感知检索通道。过程奖励步骤标签可自动根据DrugBank结构化字段验证，无需人工或LLM评判。在2026年4月的DrugBank版本上，我们的MARD-7B是32个系统比较中唯一在药物对新颖性下准确率保持稳定的系统，以约1%的前沿API成本，比最佳基线高出13.9个百分点，比GPT-4o高出6.7个百分点。进一步分析揭示了反记忆特征，即在罕见药物上准确率提升，表明增益来自结构化药理学推理而非药物频率记忆。我们发布了语料库、DDI-PRM、检索索引和训练代码。

英文摘要

Mechanism-level drug-drug interaction (DDI) prediction requires identifying which enzyme or pharmacodynamic axis is implicated, in which direction, and with which evidence -- not merely whether two drugs interact. We introduce a reproducible mechanism-level DDI labelling and evaluation protocol with a structured 7-family/147-subtype taxonomy, leakage-safe cold-split protocols, and auditable reasoning metrics for evaluating pharmacological prediction beyond flat interaction classification. We propose a pipeline that produces a 7B reasoning MARD (Mirror-Augmented Reasoning Distillation), combining three training innovations: a single-token KL divergence on direction tag that ties the model's prediction, per-loss PRM-weighted DPO with programmatic hard negatives, and a leakage-safe mechanism-aware retrieval channel. Process-reward step labels are automatically verifiable against DrugBank-structured fields, requiring no human or LLM judges. On the April-2026 DrugBank release, our MARD-7B is the only system in a 32-system comparison whose accuracy survives drug-pair novelty, beating the best baseline by +13.9 pp and GPT-4o by +6.7 pp at ~1% of frontier API cost. Further analysis reveals an anti-memorisation signature where accuracy improves on rarely seen drugs, suggesting that gain comes from structured pharmacological reasoning rather than drug-frequency memorisation. We release corpus, DDI-PRM, retrieval index, and training code.

URL PDF HTML ☆

赞 0 踩 0

2606.12903 2026-06-12 cs.CL 新提交

不确定性感知的混合检索用于长文档RAG

Hoin Jung, Xiaoqian Wang

发表机构 * Elmore Family School of Electrical and Computer Engineering, Purdue University（普渡大学埃尔莫尔家族电气与计算机工程学院）

AI总结提出UMG-RAG，一种无需训练的混合检索框架，通过多粒度分块和不确定性估计融合密集与稀疏检索结果，提升长文档问答质量。

详情

AI中文摘要

检索增强生成（RAG）关键依赖于检索证据的质量和粒度。大的检索单元保留上下文但常引入无关内容，可能稀释答案承载证据并恶化长上下文利用。细粒度单元更紧凑，但可能难以可靠检索，因为短块可能缺乏匹配查询所需的语义、词汇或桥接线索。我们提出不确定性感知的多粒度RAG（UMG-RAG），一种无需训练的混合检索框架，将分块粒度视为查询特定的可靠性估计。UMG-RAG不训练新检索器或修改生成器，而是利用现有密集和稀疏检索器作为跨多个分块粒度的互补专家。对于每个查询，它将每个专家-粒度得分列表转换为证据分布，从分布熵估计可靠性，并根据查询特定的语义、词汇和粒度置信度融合候选。我们进一步引入UMGP-RAG，一种父级提升变体，利用细粒度命中定位相关证据，同时返回更广泛的非冗余父块以保持局部连贯性。在问答基准上的实验表明，不确定性感知融合和父级提升在保持轻量级、即插即用检索管道的同时，提高了生成质量。

英文摘要

Retrieval augmented generation (RAG) depends critically on the quality and granularity of retrieved evidence. Large retrieval units preserve context but often introduce irrelevant content, which can dilute answer bearing evidence and worsen long context utilization. Fine-grained units are more compact, but they may be difficult to retrieve reliably because short chunks can lack semantic, lexical, or bridging cues needed to match the query. We propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that treats chunk granularity as query-specific reliability estimation. Instead of training a new retriever or modifying the generator, UMG-RAG uses existing dense and sparse retrievers as complementary experts across multiple chunk granularities. For each query, it converts each expert-granularity score list into an evidence distribution, estimates reliability from distribution entropy, and fuses candidates according to query-specific semantic, lexical, and granularity confidence. We further introduce UMGP-RAG, a parent promotion variant that uses fine-grained hits to locate relevant evidence while returning broader non-redundant parent chunks for local coherence. Experiments on question answering benchmarks show that uncertainty-aware fusion and parent promotion improve generation quality while maintaining a lightweight, plug-and-play retrieval pipeline.

URL PDF HTML ☆

赞 0 踩 0

2601.11004 2026-06-12 cs.CL 版本更新

NOVA: NOise-aware Verbal Confidence CAlibration for Robust Large Language Models in RAG Systems

NOVA: 面向RAG系统中鲁棒大语言模型的噪声感知言语置信度校准

Jiayu Liu, Rui Wang, Qing Zong, Yumeng Wang, Cheng Qian, Qingcheng Zeng, Tianshi Zheng, Haochen Shi, Dadi Guo, Baixuan Xu, Chunyang Li, Yangqiu Song

AI总结提出NOVA框架，通过规则引导的监督微调，解决检索增强生成中噪声上下文导致的过度自信问题，在域内和域外分别提升ECE 10.9%和8.0%。

详情

AI中文摘要

准确评估模型置信度对于在关键事实领域部署大语言模型（LLM）至关重要。尽管检索增强生成（RAG）被广泛采用以改善基础事实，但RAG设置中的置信度校准仍知之甚少。我们跨四个基准进行了系统研究，揭示LLM在检索到噪声上下文时校准性能较差。具体而言，矛盾或无关的证据往往会加剧模型的过度自信问题。为解决此问题，我们提出NOVA规则（噪声感知言语置信度校准规则），为在噪声下解决过度自信提供原则性基础。我们进一步设计了NOVA，一个噪声感知校准框架，该框架通过由这些规则指导的约2K HotpotQA示例合成监督信号。通过使用此数据进行监督微调（SFT），NOVA使模型具备内在的噪声感知能力，而无需依赖更强的教师模型。实验结果表明，NOVA带来了显著收益，在域内和域外分别将ECE分数提高了10.9%和8.0%。通过弥合检索噪声与言语校准之间的差距，NOVA为构建既准确又认知可靠的LLM铺平了道路。

英文摘要

Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance especially when noisy contexts are retrieved. Specifically, contradictory or irrelevant evidence tends to exacerbate the model's overconfidence issue. To address this, we propose NOVA Rules (NOise-Aware Verbal Confidence CAlibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NOVA, a noise-aware calibration framework that synthesizes supervision from ~2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NOVA equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NOVA yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NOVA paves the way for both accurate and epistemically reliable LLMs.

URL PDF HTML ☆

赞 0 踩 0

2601.19827 2026-06-12 cs.CL cs.AI cs.IR 版本更新

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

当迭代RAG优于理想证据：科学多跳问答中的诊断研究

Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee

发表机构 * Faculty of Engineering, McMaster University, Canada（麦斯特大学工程学院，加拿大）； BASF Canada Inc., Canada（巴斯夫加拿大公司，加拿大）

AI总结通过化学多跳问答数据集，诊断发现迭代检索-推理循环在科学领域显著优于静态RAG上限，揭示了阶段式检索的优势与失败模式。

Comments 51 pages, 29 figures

详情

AI中文摘要

检索增强生成（RAG）将大型语言模型（LLMs）扩展到参数化知识之外，但目前尚不清楚迭代检索-推理循环何时能有效超越静态RAG，尤其是在涉及多跳推理、稀疏领域知识和异构证据的科学领域。我们首次进行了受控的、机制层面的诊断研究，以探究同步迭代检索和推理能否超越理想化的静态上限（Gold Context）RAG。我们在三种设置下对十一个最先进的LLM进行了基准测试：（i）无上下文，衡量对参数化记忆的依赖；（ii）Gold Context，一次性提供所有真实证据；（iii）迭代RAG，一种无需训练的控制器，交替进行检索、假设细化和证据感知停止。使用以化学为中心的ChemKGMultiHopQA数据集，我们分离出需要真正检索的问题，并通过诊断分析行为，涵盖检索覆盖缺口、锚点携带下降、查询质量、组合保真度和控制校准。在所有模型中，迭代RAG始终优于Gold Context，增益高达25.6个百分点，尤其对于非推理微调模型。阶段式检索减少了后期跳失败，缓解了上下文过载，并实现了对早期假设漂移的动态修正，但剩余的失败模式包括跳覆盖不完整、干扰物锁定轨迹、过早停止校准错误以及即使检索完美时的高组合失败率。总体而言，阶段式检索通常比理想证据的单纯存在更具影响力；我们为在专门科学环境中部署和诊断RAG系统提供了实用指导，并为更可靠、可控的迭代检索-推理框架奠定了基础。

英文摘要

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

URL PDF HTML ☆

赞 0 踩 0

2606.10716 2026-06-12 cs.CL cs.AI 版本更新

Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings

注意力扩展：利用注意力增强的上下文嵌入提升长文档关键短语提取

Roberto Martínez-Cruz, Alvaro J. López-López, José Portela

发表机构 * Institute for Research in Technology, ICAI School of Engineering, Comillas Pontifical University（技术研究所，ICAI工程学院，科米利亚斯宗座大学）； DD-AIM, Senior Machine Learning Researcher（DD-AIM，高级机器学习研究员）

AI总结提出注意力扩展机制，通过预训练词嵌入增强PLM的上下文表示，在不增加计算成本的情况下扩展有效上下文范围，显著提升长文档关键短语提取性能。

详情

AI中文摘要

预训练语言模型（PLM）在关键短语提取（KPE）中取得了强劲性能，主要得益于其生成丰富上下文表示的能力。然而，长文档KPE仍然具有挑战性，因为显著的关键短语证据可能分散在遥远的文档部分，而这些部分无法在大多数PLM有限的上下文窗口内被联合捕获。尽管长上下文大语言模型（LLM）可以处理更广泛的文本上下文，但其计算成本限制了它们在高效和高通量KPE中的实用性。为了克服这一限制，我们提出了一种注意力扩展机制，该机制利用预训练词嵌入，用周围超出上下文的块中的信息来增强PLM的令牌表示。所提出的机制扩展了基于PLM的KPE模型的有效上下文范围，而无需全文档注意力或昂贵的基于LLM的推理。我们在五个PLM骨干网络上评估了我们的方法，包括通用、科学、任务特定和长上下文编码器，使用了两种训练机制和来自科学和新闻领域的五个基准语料库。实验结果表明，注意力扩展在所有评估设置中一致地提升了KPE性能，超越了最先进的模型，并在F1分数上取得了显著改进。这些改进扩展到领域特定、任务专门化和原生长上下文模型，表明所提出的机制提供了互补信息，而不仅仅是补偿有限的输入长度。这些结果确立了注意力扩展作为长文档KPE的一种高效且有效的策略。

英文摘要

Pre-trained language models (PLMs) have achieved strong performance in keyphrase extraction (KPE), largely due to their ability to generate rich contextualized representations. However, long-document KPE remains challenging because salient keyphrase evidence may be scattered across distant document sections that cannot be jointly captured within the limited context window of most PLMs. Although long-context large language models (LLMs) can process broader textual contexts, their computational cost limits their practicality for efficient and high-throughput KPE. To overcome this limitation, we propose an attention expansion mechanism that augments PLM token representations with information from surrounding out-of-context chunks using pre-trained word embeddings. The proposed mechanism expands the effective contextual scope of PLM-based KPE models without requiring full-document attention or expensive LLM-based inference. We evaluate our approach across five PLM backbones, including general-purpose, scientific, task-specific, and long-context encoders, using two training regimes and five benchmark corpora from scientific and news domains. Experimental results demonstrate that attention expansion consistently enhances KPE performance across all evaluation settings, outperforming state-of-the-art models and yielding notable improvements in F1 score. The improvements extend to domain-specific, task-specialized, and native long-context models, showing that the proposed mechanism provides complementary information rather than merely compensating for limited input length. These results establish attention expansion as an efficient and effective strategy for long-document KPE.

URL PDF HTML ☆

赞 0 踩 0

2606.07218 2026-06-12 cs.IR cs.CL 版本更新

HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG

HKVM-RAG：用于多跳RAG的键值分离超图证据组织

Mingyu Zhang, Ying Ma

发表机构 * Faculty of Computing, Harbin Institute of Technology（哈尔滨工业大学计算机学院）； School of Computer and Information Engineering, Henan University（河南大学计算机与信息工程学院）

AI总结提出HKVM-RAG，一种键值分离的证据组织层，通过超图键值检索改进多跳RAG的证据链暴露，在三个基准上提升F1分数。

Comments Submitted to ICDE 2027. 13 pages, 3 figures

详情

AI中文摘要

多跳RAG提出了一个超越段落匹配的数据工程问题：在固定检索预算下，系统必须将检索到的文本组织成能够暴露答案链的证据单元。密集检索器独立评分段落，而基于图的记忆使关联显式化，但通常依赖于成对或实体中心的键，这些键会碎片化多跳证据。我们提出HKVM-RAG，一个键值分离的证据组织层。它从缓存的段落级LLM证据元组中组装答案路径超边，并将其用作检索键，同时保留段落文本作为答案值。为了隔离键空间设计，我们的固定基底协议在成对图和超图变体中保持元组缓存、候选段落、阅读器和评估预算不变。加权超图键值检索在2WikiMultiHopQA上比KG-PPR提高+3.426 F1，在MuSiQue上提高+3.592 F1；HotpotQA显示更高的结构化支持覆盖率不一定带来独立的答案F1增益。因此，我们将WHG-KV视为一种证据控制信号，而非密集检索的替代。Oracle和训练到开发分析表明支持选择是可修复的，一个密集感知控制器使用冻结的ColBERTv2和HKVM排名/分数特征，结合折外HKVM预测。它在三个基准上分别达到88.846、65.073和85.810 F1，比ColBERTv2提高+11.084、+6.763和+5.966 F1。源级消融实验表明，匹配的非WHG结构化信号无法达到WHG-KV的增益。这些结果提供了有界证据，表明键值分离的超图组织可以作为多跳RAG的可重用证据控制机制。

英文摘要

Multi-hop RAG poses a data-engineering problem beyond passage matching: under fixed retrieval budgets, a system must organize retrieved text into evidence units that expose answer chains. Dense retrievers score passages independently, while graph-based memories make associations explicit but often rely on pairwise or entity-centered keys that fragment multi-hop evidence. We present HKVM-RAG, a key-value-separated evidence-organization layer. It assembles answer-path hyperedges from cached passage-level LLM evidence tuples and uses them as retrieval keys, while retaining passage text as answer values. To isolate key-space design, our fixed-substrate protocol holds the tuple cache, candidate passages, reader, and evaluation budget constant across pairwise graph and hypergraph variants. Weighted hypergraph key-value retrieval improves over KG-PPR by +3.426 F1 on 2WikiMultiHopQA and +3.592 F1 on MuSiQue; HotpotQA shows that higher structured support coverage need not yield standalone answer-F1 gains. We therefore study WHG-KV as an evidence-control signal rather than a dense-retrieval replacement. Oracle and train-to-dev analyses identify support selection as repairable, and a dense-aware controller combines frozen ColBERTv2 and HKVM rank/score features using out-of-fold HKVM predictions. It reaches 88.846, 65.073, and 85.810 F1 on the three benchmarks, improving over ColBERTv2 by +11.084, +6.763, and +5.966 F1. Source-level ablations show that matched non-WHG structured signals do not match the WHG-KV gains. These results provide bounded evidence that key-value-separated hypergraph organization can serve as a reusable evidence-control mechanism for multi-hop RAG.

URL PDF HTML ☆

赞 0 踩 0

2606.12908 2026-06-12 cs.CL 新提交

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

SENTINEL: 用于训练工具使用语言模型智能体的失败驱动强化学习

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Qun Liu, Chen Luo, Jiri Gesi, Hanqing Lu, Yisi Sang, Manling Li, Jing Huang, Dakuo Wang

发表机构 * Northeastern University（东北大学）； Independent Researcher（独立研究员）； Northwestern University（西北大学）

AI总结提出SENTINEL框架，通过将智能体失败转化为针对性训练任务，在Tau2-Bench Retail上提升Qwen3-4B模型Pass@1从66.4到74.9，优于通用合成任务上的强化学习。

详情

AI中文摘要

语言模型智能体通过多轮工具使用在解决现实任务方面越来越有效。然而，训练可靠的工具使用智能体在实践中仍然具有挑战性。虽然强化学习提供了一种从智能体自身环境交互中改进智能体的在策略范式，但其有效性在很大程度上取决于训练任务分布。当任务在训练前固定时，任务分布可能越来越与策略不断发展的能力不匹配，导致许多轨迹被浪费在无信息的任务上。我们提出SENTINEL，一种失败驱动的强化学习框架，将求解器的轨迹失败转化为有针对性的训练任务。SENTINEL遵循控制器-提议者-求解器循环：控制器分析失败轨迹并总结重复出现的错误模式，提议者生成可执行的任务来强调这些弱点，求解器在针对性任务上接受训练。在Tau2-Bench Retail上使用Qwen3-4B-Thinking-2507，SENTINEL将Pass@1从66.4提高到74.9，并且在Pass@k指标上优于通用合成任务上的强化学习。这些结果表明，模型失败为改进工具使用语言模型智能体提供了有效且可扩展的针对性训练信号来源。

英文摘要

Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own environment interactions, its effectiveness depends heavily on the training task distribution. When tasks are fixed before training, the task distribution can become increasingly mismatched with the policy's evolving capabilities, causing many rollouts to be spent on uninformative tasks. We propose SENTINEL, a failure-driven reinforcement learning framework that turns the Solver's rollout failures into targeted training tasks. SENTINEL follows a Controller--Proposer--Solver loop: the Controller analyzes failed trajectories and summarizes recurring error patterns, the Proposer generates executable tasks that stress these weaknesses, and the Solver is trained on the targeted tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, SENTINEL improves Pass\^{}1 from 66.4 to 74.9 and outperforms RL on general synthetic tasks across Pass\^{}k metrics. These results demonstrate that model failures provide an effective and scalable source of targeted training signal for improving tool-using language model agents.

URL PDF HTML ☆

赞 0 踩 0

2606.12984 2026-06-12 cs.CL 新提交

MemRefine: 基于LLM引导的压缩用于长期智能体记忆

Minjae Kim, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

发表机构 * Korea University（韩国大学）； KAIST（韩国科学技术院）

AI总结提出MemRefine框架，利用LLM判断事实内容，通过删除、合并和保留操作将记忆库压缩到固定预算内，在多个基准上保持下游性能并优于基于规则的基线。

详情

AI中文摘要

大型语言模型（LLM）智能体越来越需要在长期交互中运行，其中过去对话中的信息必须被保留和回忆以支持未来任务。然而，随着交互的积累，记忆存储无限制增长，并充满冗余条目，这些条目增加了存储成本，并通过排挤最有用的证据而降低了检索质量。此外，在具有硬性内存预算的资源受限平台上，这尤其受限，促使我们制定了有存储预算的记忆管理任务，即在固定预算内保持已构建的记忆库，同时保留对未来交互有用的信息。为此，我们提出了MemRefine，一个基于LLM引导的框架，由于表面相似性不能很好地反映事实价值，该框架仅使用相似性来提出候选对，并将删除、合并和保留决策推迟给基于事实内容的LLM判断，迭代直到满足预算。在多个记忆框架和长期对话基准上，MemRefine始终满足目标预算，同时保持下游性能，并在紧预算下优于基于规则的基线。

英文摘要

Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills with redundant entries that inflate storage cost and degrade retrieval by crowding out the most useful evidence. Furthermore, this is especially limiting on resource-constrained platforms with hard memory budgets, motivating us to formulate storage-budgeted memory management, the task of keeping an already constructed memory store within a fixed budget while preserving information useful for future interactions. To this end, we then propose MemRefine, an LLM-guided framework that, since surface similarity poorly reflects factual value, uses similarity only to propose candidate pairs and defers delete, merge, and preserve decisions to an LLM judge based on factual content, iterating until the budget is met. Across multiple memory frameworks and long-term conversation benchmarks, MemRefine consistently meets target budgets while preserving downstream performance and outperforming rule-based baselines under tight budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.13317 2026-06-12 cs.CL 新提交

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

SkillCAT: 面向LLM智能体的对比评估与拓扑感知技能自进化

Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； School of Computer Science, Fudan University（复旦大学计算机学院）

AI总结提出SkillCAT框架，通过对比因果提取、评估增强进化和拓扑感知任务执行三阶段，实现无需训练的LLM智能体技能自进化，在多个基准上平均提升高达40.40%。

Comments 9 pages, 6 figures

详情

AI中文摘要

LLM智能体的技能自进化方法旨在将执行轨迹转化为可复用的技能文档，但当前流程通常每个任务只学习一条轨迹，在检查前合并候选技能补丁，并在推理前加载完整技能语料库。我们提出SkillCAT，一个无需训练的框架，将该过程分为三个阶段。对比因果提取（CCE）为每个任务采样多条轨迹，并比较同任务的成功/失败对，以识别解释结果差异的证据。评估增强进化（AAE）在源任务克隆上回放每个候选补丁，并在层次化技能补丁合并前仅保留改善或保持任务结果的补丁。拓扑感知任务执行（TTE）将进化后的技能编译成可路由的子技能拓扑，因此推理仅加载与任务相关的能力节点。我们在常见智能体基准上评估SkillCAT，包括SpreadsheetBench、WikiTableQuestions和DocVQA，并进一步测试跨模型和分布外泛化。在这些设置中，SkillCAT将基线平均得分提升高达40.40%，展示了无需模型训练的可靠技能进化。

英文摘要

Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training-free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same-task success/failure pairs to identify evidence that explains outcome differences. Assessment-Augmented Evolution (AAE) replays each candidate patch on source-task clones and keeps only patches that improve or preserve task outcomes before hierarchical skill patch merging. Topology-Aware Task Execution (TTE) compiles the evolved skills into a routable sub-skill topology, so inference loads only the capability nodes relevant to the task. We evaluate SkillCAT on common agent benchmarks, including SpreadsheetBench, WikiTableQuestions, and DocVQA, and further test cross-model and out-of-distribution generalization. Across these settings, SkillCAT raises the average score over baselines by up to 40.40%, demonstrating reliable skill evolution without model training.

URL PDF HTML ☆

赞 0 踩 0

2606.13643 2026-06-12 cs.CL 新提交

Recursive Agent Harnesses

递归智能体框架

Elias Lumer, Sahil Sen, Kevin Paul, Vamse Kumar Subbiah

发表机构 * PricewaterhouseCoopers, U.S.（普华永道（美国））

AI总结提出递归智能体框架（RAH），通过代码优先的框架递归扩展模型递归，在长上下文推理中显著提升编码智能体性能。

详情

AI中文摘要

递归语言模型（RLM）表明，模型调用的递归是长上下文推理的有效策略，而生产级编码智能体已开始编写大规模生成子智能体的代码，最近如Anthropic的动态工作流。我们命名并研究了这两条工作线之间的模式，其中递归单元是一个完整的智能体框架，包含文件系统工具、代码执行和规划，而不是没有工具的模型调用。我们将其称为递归智能体框架（RAH），并将其视为框架递归，即RLM模型递归的代码优先扩展。父智能体生成并执行一个可执行脚本，该脚本并行生成子智能体框架以处理细粒度工作负载，并使用结构化函数调用处理小子任务。我们在长上下文推理上提供了受控评估。在固定主干为GPT-5以匹配已发布的Codex和RLM基线的情况下，RAH在Oolong-Synthetic（199个样本，13个上下文长度桶，最高4M令牌）上将Codex编码智能体基线从71.75%提高到81.36%，这一增益归因于框架而非模型。使用更强的骨干Claude Sonnet 4.5，同一设计达到89.77%。

英文摘要

Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in Anthropic's dynamic workflows. We name and study the pattern between these two lines of work, where the recursive unit is a full agent harness with filesystem tools, code execution, and planning rather than a model call with no tools. We call this the Recursive Agent Harness (RAH) and frame it as harness recursion, the code-first extension to the model recursion of RLMs. A parent agent generates and runs an executable script that spawns subagent harnesses in parallel for fine-grained workloads and uses structured function calls for small subtasks. We provide a controlled evaluation on long-context reasoning. With the backbone held fixed at GPT-5 to match the published Codex and RLM baselines, RAH improves the Codex coding-agent baseline from 71.75% to 81.36% on Oolong-Synthetic (199 samples, 13 context-length buckets up to 4M tokens), a gain attributable to the harness rather than the model. With a stronger backbone, Claude Sonnet 4.5, the same design reaches 89.77%.

URL PDF HTML ☆

赞 0 踩 0

2606.13663 2026-06-12 cs.CL 新提交

HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

HyperTool：超越逐步工具调用的工具增强型智能体

Yaxin Du, Yifan Zhou, Yujie Ge, Jiajun Wang, Xianghe Pang, Shuo Tang, Tuney Zheng, Bryan Dai, Jian Yang, Siheng Chen

发表机构 * Shanghai Jiao Tong University（上海交通大学）； IQuest Research ； Beijing University of Aeronautics and Astronautics（北京航空航天大学）

AI总结针对工具增强型LLM中逐步调用导致执行粒度不匹配的问题，提出HyperTool统一可执行接口，将确定性工具子流程折叠为单次调用，在多步工具任务上显著提升准确率。

详情

AI中文摘要

工具增强型LLM智能体通常依赖逐步的原子工具调用，其中每次调用、观察和值传递都暴露在主推理轨迹中。这造成了执行粒度不匹配：局部确定性的工具工作流被展开为重复的模型可见决策，消耗上下文并迫使模型管理轨迹中的低级数据流。我们引入HyperTool，一个统一的可执行MCP风格工具接口，改变了模型可见的工具执行单元。模型调用HyperTool时使用一个代码块，该代码块可以通过原始模式调用现有工具、操作返回值并在本地传递中间结果，将确定性工具子程序折叠为单个外部调用。为了训练模型使用此接口，我们从跨工具组合任务中合成HyperTool格式的轨迹，并在真实MCP环境中验证。在MCP-Universe上，HyperTool将Qwen3-32B的平均准确率从15.69%提升至35.29%，Qwen3-8B从9.93%提升至33.33%，并在平均准确率上超越GPT-OSS和Kimi-k2.5，表明我们的HyperTool能显著改进多步工具使用。

英文摘要

Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emph{execution-granularity mismatch}: locally deterministic tool workflows are unfolded into repeated model-visible decisions, consuming context and forcing the model to manage low-level dataflow in the trace. We introduce \textbf{HyperTool}, a unified executable MCP-style tool interface that changes the model-visible unit of tool execution. A model invokes HyperTool with a code block that can call existing tools through their original schemas, manipulate returned values, and pass intermediate results locally, folding deterministic tool subroutines into a single outer call. To train models to use this interface, we synthesize HyperTool-format trajectories from cross-tool compositional tasks and verify them in real MCP environments. On MCP-Universe, HyperTool improves average accuracy from 15.69\% to 35.29\% on Qwen3-32B and from 9.93\% to 33.33\% on Qwen3-8B, and surpass GPT-OSS and Kimi-k2.5 on average accuracy, showing that our HyperTool can substantially improve multi-step tool use.

URL PDF HTML ☆

赞 0 踩 0

2606.12780 2026-06-12 cs.LG cs.CL 交叉投稿

ProPlay: Procedural World Models for Self-Evolving LLM Agents

ProPlay: 用于自我进化LLM智能体的程序化世界模型

Yijun Ma, Zehong Wang, Yiyang Li, Ziming Li, Xiaoguang Guo, Weixiang Sun, Chuxu Zhang, Yanfang Ye

发表机构 * University of Notre Dame（圣母大学）； University of Connecticut（康涅狄格大学）

AI总结提出ProPlay程序化世界模型，通过程序级预演和因果过程图，使LLM智能体在部分可观测环境中自我进化，无需外部监督。

详情

AI中文摘要

自我进化智能体应能在无外部监督下通过交互改进，但在部分可观测环境中仍困难，智能体必须主动探索、从有限反馈中学习，并决定何时信任先前经验。现有的LLM智能体方法通常依赖记忆或规划模块，但很少在它们之间闭环以持续完善对环境动态的内部理解。我们提出ProPlay，一种程序化世界模型，支持程序级预演，智能体可利用学到的世界知识排练未来的程序路径。ProPlay不将经验表示为孤立的规则或低层动作约束，而是将成功轨迹抽象为程序，并在捕获任务阶段间因果转换的程序图中组织它们。每个转换与一个可靠性记录嵌入相关联，以从过去结果中估计其任务特定贡献。在每个回合前，ProPlay在已知图结构上模拟未来程序轨迹作为结构化软指导；执行后，它利用环境反馈精炼图。在公开基准上的实验表明，ProPlay在环境理解和自我进化能力上持续优于强基线。我们的代码已在此https URL发布。

英文摘要

Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM-agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure-level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low-level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task-specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self-evolution capability over strong baselines. Our code has been released in https://github.com/antman9914/proplay.

URL PDF HTML ☆

赞 0 踩 0

2606.13174 2026-06-12 cs.LG cs.CL 交叉投稿

IVIE：一种用于增量且经过验证的交互式小说世界生成的神经符号方法

Micaela Vaucher, Santiago Silveira, Santiago Góngora, Luis Chiruzzo

发表机构 * Instituto de Computación, Facultad de Ingeniería, Universidad de la República（乌拉圭共和国大学工程学院计算机研究所）

AI总结提出IVIE神经符号方法，结合LLM的创造力与符号验证的连贯性，通过四阶段增量生成管道构建可玩的交互式小说世界，人类评估显示其生成沉浸式、主题连贯的世界，平衡了灵活性与叙事一致性。

Comments 10 pages, 3 figures. To appear in the Proceedings of the 16th International Conference on Computational Creativity (ICCC'26), June 2026

详情

AI中文摘要

交互式小说中的计算创造力面临一个基本矛盾：大型语言模型（LLM）可能产生创意叙事，但难以维持世界连贯性，而符号系统确保一致性但缺乏创意灵活性。我们提出IVIE（增量与验证的交互体验），一种从零开始生成完整且可玩的交互式小说世界的神经符号方法。基于PAYADOR的神经符号框架，IVIE实现了一个四阶段增量生成管道，将创意决策——设定与角色创建、谜题设计——委托给LLM，同时通过符号验证将世界状态接地。该系统生成具有相互关联的地点、功能性物品、非玩家角色和连贯谜题的世界，所有这些都围绕一个中心目标导向架构组织。人类评估表明，该方法生成了沉浸式、主题连贯的世界，具有高玩家参与度。结果似乎表明，神经符号方法成功平衡了灵活性与叙事连贯性：符号验证在不消除生成自由的情况下将LLM生成接地。然而，挑战依然存在：LLM的不一致性偶尔会绕过谜题约束，客观验证的空白允许一些结构上不可能的目标。我们为未来的神经符号交互式叙事系统确定了关键设计考虑因素，特别是关于LLM的能力及其局限性。

英文摘要

Computational creativity in Interactive Fiction faces a fundamental tension: Large Language Models (LLM) may produce creative narratives but struggle with world coherence, while symbolic systems ensure consistency but lack creative flexibility. We present IVIE (Incremental & Validated Interactive Experiences), a neuro-symbolic approach to generating complete and playable interactive fiction worlds from scratch. Building upon PAYADOR's neuro-symbolic framework, IVIE implements a four-stage incremental generation pipeline that delegates creative decisions--setting and character creation, puzzle design--to LLMs while grounding the world state through symbolic validation. The system generates worlds with interconnected locations, functional items, non-player characters, and coherent puzzles, all structured around a central goal-oriented architecture. Human evaluation shows the approach generates immersive, thematically coherent worlds with high player engagement. Results seem to indicate that the neuro-symbolic approach successfully balances flexibility with narrative coherence: symbolic validation grounds LLM generation without eliminating generative freedom. However, challenges remain: LLM inconsistencies occasionally bypass puzzle constraints, and objective validation gaps allow some structurally impossible goals. We identify key design considerations for future neurosymbolic interactive storytelling systems, particularly regarding LLM capabilities and their limitations.

URL PDF HTML ☆

赞 0 踩 0

2603.00025 2026-06-12 cs.CL 版本更新

TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation

TAB-PO：面向Token关键结构化生成的具有Token级自适应障碍的偏好优化

Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Sreeraj Ramachandran, Elyas Irankhah, Muhammad Arif, Ashley Hagaman, Sarah R. Lowe, Aimee Kendall Roundtree

发表机构 * Yale University（耶鲁大学）； Texas State University（德克萨斯州立大学）

AI总结针对结构化预测中偏好与拒绝对象仅少数token不同导致的梯度稀释和token侵蚀问题，提出基于混淆感知偏好构建和Token级自适应障碍的TAB-PO方法，在SciERC任务上显著提升关键指标。

详情

AI中文摘要

直接偏好优化（DPO）是一种有效且广泛采用的离线对齐方法，但难以适应本体驱动的结构化预测，其中偏好和拒绝的JSON对象通常仅在少数模式定义token上存在差异。在这种低编辑距离场景下，序列级DPO将梯度质量分散到非关键的序列化token上（梯度稀释），并可能降低罕见、低置信度的偏好模式token的似然（token侵蚀）。为解决这些限制，我们首先开发了一种混淆感知的偏好构建策略，该策略用从验证集SFT预测中估计的经验结构化错误模式来增强专家策划的歧义模式，合成最小扰动的、模式有效的负样本，将偏好学习聚焦于现实的本体级决策错误。然后，我们引入了Token自适应障碍偏好优化（TAB-PO），这是一种用于token关键结构化生成的SFT后目标。TAB-PO添加了一个置信门控的token级障碍，对低置信度的模式token施加监督锚定。在公开的SciERC科学信息抽取任务上，使用1.5B到70B的Llama/Qwen模型评估，TAB-PO在本体关键的语义标签和关系链接指标上平均比SFT提升11.59%，在这些指标上100%胜于最强的token级和序列级DPO变体，并领先领先的前沿模型14.71%，同时在文本基础方面取得了强劲的增益。

英文摘要

Direct Preference Optimization (DPO) is an effective and widely adopted approach for offline alignment but is poorly matched to ontology-driven structured prediction, where preferred and rejected JSON objects often differ in only a few schema-defining tokens. In this low-edit-distance regime, sequence-level DPO spreads gradient mass across non-critical serialization tokens (gradient dilution) and can reduce likelihood on rare, under-confident preferred schema tokens (token erosion). To address these limitations, we first develop a confusion-aware preference-construction strategy that augments expert-curated ambiguity patterns with empirical structured-error modes estimated from validation-set SFT predictions, synthesizing minimally perturbed, schema-valid negatives that focus preference learning on realistic ontology-level decision errors. We then introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), a post-SFT objective for token-critical structured generation. TAB-PO adds a confidence-gated token-level barrier that applies supervised anchoring to under-confident schema tokens. On the public SciERC scientific information extraction task, evaluated with Llama/Qwen models from 1.5B to 70B, TAB-PO improves ontology-critical semantic-label and relational-linking metrics over SFT by 11.59% on average, wins 100% of comparisons against the strongest token-level and sequence-level DPO variants on these metrics, and surpasses leading frontier models by 14.71%, while delivering strong gains in textual grounding.

URL PDF HTML ☆

赞 0 踩 0

2606.12748 2026-06-12 cs.CL 新提交

Agent-based models for the evolution of morphological alternation patterns

基于智能体的形态交替模式演化模型

Aravinth Kulanthaivelu, Richard Sproat

AI总结通过多智能体模拟，研究形态交替（如go/went）的涌现机制，发现无标度社交网络和随机采纳策略能产生更真实的形态模式。

Comments 51 + 37 pages. 31 Figures

详情

AI中文摘要

为什么英语中“go”的过去式是看似无关的“went”？这种交替在语言中很常见。它们既无助于交流也不利于学习，却能持续存在数百年或数千年。我们提出了一个多智能体模拟，用于研究形态词干和屈折交替的涌现。交替形式源于语音变化，或者像“go/went”一样，来自与部分人群相关的词汇替代。当一个智能体“听到”另一个智能体对某个词形位（例如go的过去式）使用新形式时，它们会以一定概率采纳该形式，并可能将其使用扩展到共享相同原始形式的其他词形位。因此，替代形式可以在人群中传播，并固化为词干或屈折标记的交替形式。与许多先前的计算研究不同，我们的系统允许自然主义的词汇形式、现实的语音规则、包含数百或数千条目的词典，以及数十或数百个智能体的人群。它支持多种网络拓扑、扩散模式和智能体采纳策略。这类模拟的一个问题是评估：与真实语言相比，产生的形态有多真实？我们引入了AI历史语言学家，这是一个新颖的大型语言模型驱动系统，模拟两位历史语言学家之间的辩论。我们用它来比较一组真实语言的形态、伪装形态和实验演化形态。结果表明，有利于产生更合理形态的因素包括无标度社交网络和随机伯努利形式采纳。我们还提出了三个案例研究，模拟了有记载的历史变化，使我们能够测试如果历史不同会发生什么。所有代码和数据均已发布。

英文摘要

Why is the past of English "go" the apparently unrelated "went"? Such alternations are frequent in languages. They neither aid communication nor learnability, yet they can be persistent, surviving over centuries or millennia. We present a multi-agent simulation of the emergence of morphological stem and inflection alternations. Alternate forms arise by phonological changes or, as with "go/went", from lexical alternatives associated with a subset of the population. When an agent 'hears' another agent use a novel form for a slot in the paradigm of a word (say, the past tense of go), they will with some probability adopt that form, possibly spreading its use to other slots in the paradigm that shared the same original form. Thus alternative forms can spread through the population and become entrenched as stem or inflectional marker alternants. Unlike many previous computational studies, our system allows for naturalistic lexical forms, realistic phonological rules, lexicons with hundreds or thousands of entries, and agent populations in the tens or hundreds. It supports several network topologies, diffusion patterns and agent adoption policies. One issue with such simulations is evaluation: how realistic is the resulting morphology compared to those of real languages? We introduce the AI Historical Linguist, a novel Large Language Model-driven system that models a debate between two historical linguists. We use this to compare a set of real language morphologies, disguised morphologies, and experimentally evolved morphologies. The results suggest that among the factors that favor more plausible morphologies are scale-free social networks and random Bernoulli adoption of forms. We also present three case studies modeling attested historical changes, allowing us to test what might have happened if history had been different. All code and data are released.

URL PDF HTML ☆

赞 0 踩 0

2606.13189 2026-06-12 cs.CL 新提交

帮助图表讲述它们的故事！基于论文的视频生成解释复杂科学图表

Ishani Mondal, Javad Baghirov, Jordan Boyd-Graber

AI总结提出MINARD流水线，从图表及其论文生成基于区域分解的叙述性视频，并发布FigTalk基准，在自动和人工评估中优于现有方法。

Comments Webpage: https://minard.vercel.app/

2606.13572 2026-06-12 cs.CL cs.AI 新提交

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

ArogyaSutra：面向印度语言的多模态医学推理的多智能体框架

Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha

发表机构 * Indian Institute of Technology Patna（印度理工学院巴特那分校）； Indian Institute of Technology Kanpur（印度理工学院坎普尔分校）； Prasannadeb Women’s College（普拉萨纳德布女子学院）

AI总结针对印度语言医疗场景中多模态大语言模型性能不足的问题，提出多模态医学问答数据集ArogyaBodha和基于演员-评论家的多智能体框架ArogyaSutra，通过工具接地与双记忆机制提升多语言医学推理准确性。

详情

AI中文摘要

多模态大语言模型（MLLMs）在通用领域展现出有希望的推理能力，但在医疗等专业场景中，尤其是在多语言和低资源情况下，其性能仍然有限。这一差距在印度农村等地区尤为关键，患者通常用本土印度语言表达复杂的医疗问题，并依赖医学图像等多模态输入。现有的以英语为中心的MLLMs难以支持此类用例，限制了公平获取AI驱动的医疗辅助。为应对这一挑战，我们引入了ArogyaBodha，一个大规模的多语言多模态医学问答数据集，由八个异构来源构建，涵盖31个身体系统、六种成像模态和21个临床领域，覆盖英语和七种主要印度语言。我们进一步提出了ArogyaSutra，一个基于演员-评论家的多智能体框架，将工具接地与双记忆机制相结合，实现逐步的、推理感知的决策，并使用存储的演员-评论家模拟轨迹进行蒸馏。实验表明，我们的数据集和框架在所有印度语言上提高了多语言医学推理的准确性，消融实验验证了每个组件的贡献。源代码和数据集可在以下网址获取：this https URL ArogyaSutra/

英文摘要

Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp-cse.github.io/ ArogyaSutra/

URL PDF HTML ☆

赞 0 踩 0

2606.13630 2026-06-12 cs.CL 新提交

From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

从词元到面部：探究用于3D面部动画的离散语音表示

Pedro Correa, Olivier Perrotin, Samir Sadok, Paula Costa, Thomas Hueber

发表机构 * Univ. Estadual de Campinas (UNICAMP), Brazil（巴西坎皮纳斯州立大学（UNICAMP））； Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, France（法国格勒诺布尔阿尔卑斯大学，CNRS，格勒诺布尔国立理工学院，GIPSA实验室）； Inria at Univ. Grenoble Alpes, CNRS, LJK, France（法国格勒诺布尔阿尔卑斯大学Inria，CNRS，LJK）

AI总结研究评估四种语音表示在3D面部合成中的效果，发现编码音素类别有利于准确预测面部动画，并基于此提出音频视觉文本到语音管线。

Comments This work has been accepted in Interspeech 2026

详情

AI中文摘要

语音表示的选择在语音驱动的3D面部动画中至关重要。不同表示在编码内容上有所差异：SSL特征强调音段和语义线索，神经编解码器产生优化用于声学重建的潜在表示，而ASR风格的目标产生基于标签的空间。我们评估了四种用于3D面部合成的语音表示族，通过客观指标和感知评估比较了它们在两个面部解码器上的面部重建质量。此外，我们进行了探测分析，将分词表示与音素单元和发音变形联系起来。我们发现，编码音素类别有利于在语义和基于标签的表示上准确预测面部动画，且面部动画质量相当。基于后者，我们引入了一个音频视觉文本到语音（AVTTS）管线，该管线利用离散表示作为共享空间来解码语音和3D面部运动。

英文摘要

The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.

URL PDF HTML ☆

赞 0 踩 0

2606.12616 2026-06-12 cs.AI cs.CL 交叉投稿

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

PersonaDrive: 面向闭环驾驶模拟的人类风格检索增强VLA智能体

Mahmoud Srewa, Praneetsai Iddamsetty, Mohammad Abdullah Al Faruque, Salma Elmalaki

发表机构 * University of California, Irvine（加利福尼亚大学尔湾分校）

AI总结提出PersonaDrive流水线，通过检索风格指令下的人类驾驶演示来调节视觉-语言-动作（VLA）驾驶智能体，实现闭环模拟中多样化的非自车智能体行为，无需针对每种风格重新训练。

详情

AI中文摘要

闭环驾驶模拟器通常在其环境中填充行为大致相同的非自车交通智能体，这些智能体要么由基于规则的交通管理器生成，要么由训练为单一行为模式的学习模型生成。最近的工作通过观测数据上的事后标签或LLM推断的奖励权重引入风格变化，但这些信号充当了风格应奖励什么的代理，而不是明确要求以该风格驾驶的人类演示。我们提出了PersonaDrive，一个流水线，它根据从风格指令的人类驾驶数据集中检索到的演示来调节视觉-语言-动作（VLA）驾驶智能体，在该数据集中，参与者在驾驶员在环平台上以激进、中性和保守指令驾驶CARLA排行榜路线。该流水线包括三个阶段：(i) 使用组合的图像-文本相似度分数对每种风格的人类驾驶数据进行离线三元组挖掘；(ii) 训练一个轻量级检索头，将冻结的视觉特征与每个风格数据库上的小型控制编码器融合；(iii) 微调单个VLA主干，以在航点预测期间将检索到的上下文点视为上下文行为演示。在推理时，通过切换检索头查询的每个风格数据库，相同的主干可以适应任何风格，因此选择风格无需针对每种风格重新训练，同时为闭环模拟启用人类风格、风格多样的非自车智能体。在Bench2Drive上，PersonaDrive（无风格）的驾驶得分比SimLingo高4.6%，比HiP-AD高2.5%，在风格条件下，每种风格都获得最高驾驶得分，波动范围约2%（其最弱风格超过最强基线DMW 5.4%），而从保守指令到激进指令，平均速度和加速度分别提高18%和25%。

英文摘要

Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

URL PDF HTML ☆

赞 0 踩 0

2606.12898 2026-06-12 cs.CV cs.CL 交叉投稿

编辑比特，差异编码：面向视觉自回归模型的逐比特残差编辑

Shengqiang Zhang, Ruotong Liao, Volker Tresp, Barbara Plank, Hinrich Schütze

发表机构 * LMU Munich & Munich Center for Machine Learning (MCML)（慕尼黑大学 & 慕尼黑机器学习中心 (MCML)）

AI总结提出BitResEdit，一种无需训练的视觉自回归图像编辑方法，通过比特级源负引导和残差编码注入，在保持背景的同时实现强文本对齐。

详情

AI中文摘要

基于文本引导的图像编辑与视觉自回归（VAR）生成器需要控制模型采样的内容以及将采样变化写回图像代码的位置。现有的VAR编辑器主要操作于令牌流、特征或扁平的下一个令牌对数几率，忽略了逐比特残差VAR模型的两个原生结构：逐比特伯努利预测头和图像组装所用的加性多尺度残差代码域。我们提出BitResEdit，一种针对逐比特残差VAR生成器（如Infinity）的无训练编辑器。BitEdit通过沿共享编辑前缀上计算的源-目标对比倾斜后CFG的逐比特对数几率，执行源负引导，然后将每个更新投影到干净CFG采样器周围的闭式伯努利-KL信任域中。ResEdit将采样的比特转换为每尺度连续代码残差，用定位掩码对其进行门控，并通过生成器的原生尺度求和重新注入。它们共同将决策时的比特引导与组合时的代码组合耦合，使得被掩码的潜在特征通过代码算术精确保留，同时在目标区域内应用局部化的尺度感知编辑。在PIE-Bench上使用Infinity-2B，BitResEdit在相同骨干的VAR编辑器中实现了最强的文本对齐，在编辑区域上的CLIP比最强先前的编辑器提高了+1.07，同时背景保持与其相当。消融实验表明BitEdit和ResEdit在目标对齐和背景保持中发挥互补作用。

英文摘要

Text-guided image editing with visual autoregressive (VAR) generators requires controlling both what the model samples and where the sampled change is written back into the image code. Existing VAR editors mainly operate on token streams, features, or flat next-token logits, leaving two native structures of bitwise-residual VAR models underused: the per-bit Bernoulli prediction head and the additive multi-scale residual code field from which the image is assembled. We propose BitResEdit, a training-free editor for bitwise-residual VAR generators such as Infinity. BitEdit performs source-negative guidance by tilting the post-CFG per-bit log-odds along a source--target contrast computed on a shared edited prefix, then projects each update into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit converts the sampled bits into per-scale continuous-code residuals, gates them with a localization mask, and re-injects them through the generator's native sum-of-scales. Together they couple decision-time bit guidance with combination-time code composition, so masked-out latent features are preserved exactly by code arithmetic while localized, scale-aware edits are applied inside the target region. On PIE-Bench with Infinity-2B, BitResEdit attains the strongest text alignment among same-backbone VAR editors, improving CLIP on the edited region by +1.07 over the strongest prior editor while keeping background preservation competitive with it. Ablations show BitEdit and ResEdit play complementary roles in target alignment and background preservation.

URL PDF HTML ☆

赞 0 踩 0

2602.07106 2026-06-12 cs.CV cs.AI cs.CL 版本更新

自适应轮流发言：面向实时多方语音代理

Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith, K V Vijay Girish

AI总结提出ModeratorLM，一种基于角色条件的语音大模型，通过分块流式处理和链式推理，在多方对话中实现自适应轮流发言，显著提升轮流精度和召回率。

Comments Accepted for publication at Interspeech 2026

2606.04474 2026-06-12 cs.CL eess.AS 版本更新

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

语音大模型推理中的实体绑定失败：诊断与思维链干预

Ming-Hao Hsu, Xiaohai Tian, Jun Zhang, Zhizheng Wu

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen, China（1 数据科学学院，香港中文大学（深圳））； ByteDance, China（2 字节跳动，中国）

AI总结本文通过诊断语音大模型在逻辑推理中的实体绑定失败问题，提出实体感知思维链方法，显著提升推理准确率。

Comments INTERSPEECH 2026

详情

AI中文摘要

语音大模型在复杂推理任务上表现不如文本模型。我们揭示了这种模态差距并非均匀的认知缺陷。通过评估三个不同的语音大模型，我们发现在空间、句法和事实任务上，语音到文本（S2T）匹配或超过文本到文本（T2T）。然而，在需要实体追踪的逻辑任务上，S2T准确率降至随机水平。我们将这种局部退化诊断为实体绑定失败：连续的语音特征导致模型在隐式推理过程中丢失精确的实体-属性关联。为解决此问题，我们提出了实体感知思维链（EA-CoT），强制语音大模型在推理前显式枚举实体并将其绑定到声明上。引人注目的是，即使口语名称被误识别，EA-CoT也能弥合差距，带来高达24.4%的绝对准确率提升。消融实验证实这些提升完全源于显式语义绑定，将模态差距重新定义为可解决的瓶颈。

LLMs 能更好地捕捉人类判断——使用合适的提示

Danica Dillion, Chen Cecilia Liu, Baihui Wang, Daniele Barolo, Tanmay Rajore, Niket Tandon, Pranathi Ravikumar, Kurt Gray

AI总结通过简单提示策略，LLMs 能恢复人类反应的完整分布，并减少对措辞变化的敏感性，提升 AI-人类对齐。

详情

AI中文摘要

大型语言模型（LLMs）在捕捉人类判断方面是否表现不佳？两个常被提及的限制是：LLMs 无法捕捉反应的全分布，以及它们的判断在措辞变化上不稳定。我们展示了缓解这些限制的简单提示策略。在两个数据集上——一个代表美国的 144 个道德情景集，以及国际社会调查项目“家庭与性别角色变化”模块涵盖 32 个国家的 38 个道德信念——我们展示了简单的启发式技术如何帮助改善 AI-人类对齐。首先，提示模型报告标准差和反应比例，比常见策略更好地恢复了人类反应的完整范围。其次，确保情景对人类参与者清晰——如人类困惑评分所反映——提升了模型对齐度，且 LLMs 可以跟踪人类困惑评分。同时，我们发现 LLMs 对自身误差的估计校准不佳，尽管它们能相对较好地预测人类变异性。这些结果表明，向 LLMs 提出更好的问题可以得到更好的答案。

英文摘要

Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We demonstrate simple prompting strategies that mitigate these limitations. Across two datasets--a U.S.-representative set of 144 moral scenarios and 38 moral beliefs from the International Social Survey Programme's Family and Changing Gender Roles module covering 32 countries--we show how simple elicitation techniques help improve AI-human alignment. First, prompting models to report standard deviations and response proportions recovers the full range of human responses better than common strategies. Second, ensuring scenarios are clear to human participants--as reflected in human confusion ratings--boosts model alignment, and LLMs can track human confusion ratings. At the same time, we find that LLMs' estimates of their own error are poorly calibrated, though they can predict human variability relatively well. These results suggest that asking better questions to LLMs can yield better answers.

URL PDF HTML ☆

赞 0 踩 0

2606.12789 2026-06-12 cs.CL cs.IR 新提交

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

RAG基准测试应该有多细粒度？一个用于合成问题生成的层次化框架

Chase M. Fensore, Kaustubh Dhole, Jason Fan, Eugene Agichtein, Joyce C. Ho

发表机构 * Department of Computer Science, Emory University（埃默里大学计算机科学系）

AI总结提出HieraRAG层次化框架，通过合成问题生成研究RAG基准测试的细粒度，发现最优粒度因维度而异，并引入一致性比率度量。

详情

DOI: 10.1145/3805712.3809925

MÖVE：德国公共部门的大语言模型整体基准

Camilla Dalerci, Thilo Michael, Robin Schaefer, Daniel Weinland

发表机构 * Innovations Department, Bundesdruckerei GmbH（德国联邦印钞公司创新部）

AI总结提出MÖVE基准，从性能和治理两个维度评估39个LLM在德国公共部门的应用，发现无单一模型全面领先，模型大小非质量可靠指标。

详情

AI中文摘要

我们提出MÖVE（Modelle für die Öffentliche Verwaltung Evaluieren），一个用于评估德国公共部门背景下大语言模型（LLM）的整体基准。尽管LLM在公共管理中日益普及，但模型选择仍然很大程度上是临时的，现有基准提供的指导有限：它们主要面向英语、内容以美国为中心，并且只关注任务性能。MÖVE通过评估39个模型在两个互补维度上填补这些空白。性能标准涵盖摘要、问答和主题提取。治理标准评估幻觉倾向、能耗、提供商透明度、与德国宪法价值观的一致性以及对德国政党立场的知识。总共，我们使用了十个德语数据集，包括我们构建的反映公共管理领域的金标准和银标准数据集。我们采用多指标评估策略，结合经典NLP指标、基于嵌入的方法和LLM作为评判的方法。我们的结果表明，没有单一模型在所有标准上占主导地位：顶级表现者因任务而异，模型大小本身是质量的糟糕预测指标。我们进一步评估基准本身，分析其统计精度、LLM评判可靠性、私有数据集对模型排名的影响、结果对提示表述的敏感性以及能耗估计的有效性。MÖVE被设计为一个活跃开发中的动态基准；结果公开于此https URL。

英文摘要

We present MÖVE (Modelle für die Öffentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. MÖVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. MÖVE is designed as a living benchmark under active development; results are publicly available at https://moeve.bundesdruckerei.de/.

URL PDF HTML ☆

赞 0 踩 0

2606.13120 2026-06-12 cs.CL 新提交

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

EvoBrowseComp: 基于演化知识的搜索智能体基准测试

Yunhan Wang, Jiaan Wang, Lianzhe Huang, Xianfeng Zeng, Fandong Meng

发表机构 * Northeastern University, China（东北大学（中国））； Weixin AI, Tencent Inc, China（腾讯微信AI（中国））

AI总结提出EvoBrowseComp，一个通过实时网络遍历自动生成400道英文和400道中文无污染复杂问题的演化基准，用于评估搜索智能体在动态知识环境中的真实浏览能力。

Comments 14 pages, under review

详情

AI中文摘要

搜索智能体——即增强搜索工具的大型语言模型——加剧了对未来验证基准的需求。现有的基准如BrowseComp依赖静态知识，容易受到测试集污染和参数记忆的影响。因此，模型可以通过事实回忆而非真正检索获得高分，通过推理捷径掩盖真实的浏览能力。在本文中，我们介绍EvoBrowseComp，一个包含400道英文和400道中文无污染复杂问题的演化基准，通过实时网络遍历合成。为了收集这些问题，我们设计了一个三智能体协作框架：（1）QA合成智能体，从实时网络中检索新鲜知识以合成问答对；（2）信息过滤智能体，根据可信度和流行度过滤检索到的知识，以阻断参数捷径；（3）高级指导智能体，将问题形式化为推理图，以减少合成问答对中的逻辑冗余和捷径。由于该框架支持全自动合成，EvoBrowseComp可以定期更新以防止数据污染并保持时间新鲜度。大量实验证实了其高难度，需要广泛的横向搜索。它为自动更新、高难度的基准测试建立了一个可扩展的范式，与不断发展的世界知识和不断进步的智能体能力保持同步。

英文摘要

Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.13184 2026-06-12 cs.CL 新提交

LAUKIN: A Multi-jurisdictional Common Law Contract Dataset

LAUKIN：一个多司法管辖区的普通法合同数据集

Amrita Singh, Aditya Joshi, Jiaojiao Jiang, Hye-young Paik, May Fong Cheong

发表机构 * Computer Science and Engineering, UNSW, Sydney Australia（新南威尔士大学计算机科学与工程学院）； Law and Justice, UNSW, Sydney Australia（新南威尔士大学法律与司法学院）

AI总结针对跨国合同审查需求，构建了包含澳大利亚、英国和印度三地法律条款对的数据集LAUKIN，通过多阶段检索与人工标注实现法律等价性分类，基准测试显示跨司法管辖区分类具有挑战性。

Comments 5 pages, 2 figures, 4 tables

详情

AI中文摘要

跨国公司越来越需要跨司法管辖区的合同审查，但现有的法律NLP数据集大多局限于单一司法管辖区。我们引入了LAUKIN（澳大利亚、英国和印度的法律等价数据集），这是一个条款对（AU-UK、UK-IN、IN-AU）数据集，标注了布尔法律等价性。我们开发了一种新颖的多阶段检索和重排序流水线来构建初始条款对映射，随后由法律专家对部分条款对进行等价或不等价的标注。该数据集包含来自8种协议类型的204份合同的14,727个条款对，其中3,000个是手动标注的：900个训练集、600个开发集和1,500个测试集。我们评估了4种技术下的12个模型，最佳宏F1达到65.11%，使LAUKIN成为一个具有挑战性的基准。结果表明，尽管有共同的法律传统，但不同司法管辖区的起草惯例差异显著，使得跨司法管辖区的等价分类并非易事。LAUKIN还包括11,727个未标注的训练对，以支持未来法律NLP中的半监督学习研究。

英文摘要

Multinational companies increasingly require cross-jurisdictional contract review, yet existing legal NLP datasets are largely restricted to a single jurisdiction. We introduce LAUKIN (Legal equivalence dataset of Australia, UK, and INdia), a dataset of clause pairs (AU-UK, UK-IN, IN-AU) labelled for boolean legal equivalence. We develop a novel multi-stage retrieval and reranking pipeline to construct the initial clause pair mapping, with a subset of clause pairs subsequently annotated by legal experts as Equivalent or Not Equivalent. The dataset comprises 14,727 clause pairs from 204 contracts across 8 agreement types, of which 3,000 are manually labelled: 900 train, 600 dev, and 1,500 test. We evaluate 12 models across 4 techniques, achieving a best macro-F1 of 65.11%, establishing LAUKIN as a challenging benchmark. Results reveal that, despite shared legal heritage, drafting conventions diverge significantly across jurisdictions, making cross-jurisdictional equivalence classification non-trivial. LAUKIN also includes 11,727 unlabelled training pairs to support future semi-supervised learning research in legal NLP.

URL PDF HTML ☆

赞 0 踩 0

2606.13187 2026-06-12 cs.CL 新提交

A Context-Aware Dataset for Stance Detection in Bioethical Controversies on Reddit

Reddit生物伦理争议中立场检测的上下文感知数据集

Hu Huang, Genan Dai, Fuqiang Niu, Yi Yang, Zhaoya Gong, Bowen Zhang

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China（中国科学技术大学网络空间安全学院）； School of Artificial Intelligence, Shenzhen Technology University（深圳技术大学人工智能学院）； School of Urban Planning and Design, Peking University（北京大学城市规划与设计学院）

AI总结提出BioStance数据集，包含39,600个Reddit生物伦理讨论中的评论-回复对，覆盖六类争议话题，通过三层立场标注实现高可靠性，支持上下文感知的立场检测研究。

详情

AI中文摘要

生物伦理辩论越来越多地在社交媒体上展开，然而立场检测研究缺乏用于建模此类上下文依赖话语的大规模、领域特定资源。我们提出了BioStance，一个上下文感知的数据集，包含来自Reddit生物伦理讨论的39,600个带注释的帖子-评论对。BioStance涵盖了生物伦理争议三个维度上的六个有争议的目标：基本价值冲突、个人自由与集体责任，以及技术不确定性。每个实例保留了层次化的对话上下文，并由三位独立注释者使用三类立场方案进行标注：赞成、反对和无立场。注释的平均Krippendorff's α为0.82，表明可靠性较高。通过结合主题多样性、对话结构和高质量的人工注释，BioStance支持上下文感知的立场检测、论据挖掘和生物伦理话语的计算分析研究。

英文摘要

Bioethical debates increasingly unfold on social media, yet stance detection research lacks large-scale, domain-specific resources for modeling such context-dependent discourse. We present BioStance, a context-aware dataset of 39,600 annotated Post-Comment pairs from Reddit bioethical discussions. BioStance covers six controversial targets across three dimensions of bioethical controversy: fundamental value conflicts, individual liberty versus collective responsibility, and technological uncertainty. Each instance preserves hierarchical conversational context and is labeled by three independent annotators using a three-class stance scheme: Favor, Against, and None. The annotations achieve a mean Krippendorff's $α$ of 0.82, indicating substantial reliability. By combining thematic diversity, conversational structure, and high-quality human annotation, BioStance supports research on context-aware stance detection, argument mining, and computational analysis of bioethical discourse.

URL PDF HTML ☆

赞 0 踩 0

2606.13216 2026-06-12 cs.CL cs.LG 新提交

Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization

分层最优传输用于神经机器翻译和抽象摘要中的幻觉检测

Mariia Onyshchuk, Maksym-Vasyl Tarnavskyi, Marta Sumyk

发表机构 * Fairseq ； AggreFact

AI总结通过最优传输分析跨注意力分布，发现幻觉检测集中于解码器前四层，且该方法在源脱离时有效，但无法检测注意力下游的不忠实摘要。

Comments Accepted to ICML Mechanistic Interpretability Workshop 2026

详情

AI中文摘要

最优传输（OT）已被证明可以通过测量跨注意力分布与参考分布之间的几何距离来检测神经机器翻译（NMT）中的幻觉，无需任何监督。我们将此分析扩展到Fairseq DE-EN模型的所有六个解码器层（$N=3{,}414$），表明Wass-to-Unif和Wass-to-Data是互补的检测器，专门针对不同类型的幻觉；检测集中在L1--L4层，而L5层对较微妙的类型具有反预测性；并且幻觉翻译缺乏正确翻译从第一步解码开始就存在的探索性注意力阶段。我们进一步评估了几何信号是否可迁移到抽象摘要忠实性检测：在AggreFact（$N=1{,}116$）上，我们的无监督OT检测器在CNN/XSum上达到$57.2\%$/$57.6\%$的平衡准确率——高于随机水平，但远低于有监督的MiniCheck-Flan-T5-L（$69.9\%$/$74.3\%$）。这种差距是原则性的：与NMT幻觉不同，不忠实的摘要可以正确关注源标记，同时歪曲其内容，这种失败模式在基于集中度的OT指标中由于构造原因而不可见。在T5-base上的结构实验证实了解码器在深度上的一致组织，其中第3层显示峰值集中度，第12层对生成质量最为关键。总之，结果确立了当失败模式是源脱离时，跨注意力的OT是一种可靠的检测器；无论任务如何，它都是一种原则性的可解释性工具；而当忠实性失败发生在注意力下游时，它则具有根本局限性。

英文摘要

Optimal transport (OT) has been shown to detect hallucinations in neural machine translation (NMT) by measuring the geometric distance between cross-attention distributions and a reference distribution, without any supervision. We extend this analysis to all six decoder layers of the Fairseq DE-EN model ($N=3{,}414$), showing that Wass-to-Unif and Wass-to-Data are complementary detectors specialised across hallucination types, that detection is concentrated in layers L1--L4 with L5 anti-predictive for subtler types, and that hallucinated translations lack the exploratory attention phase present in correct translations from the first decoding step. We further evaluate whether the geometric signal transfers to abstractive summarization faithfulness detection: our unsupervised OT detector on AggreFact ($N=1{,}116$) achieves $57.2\%$/$57.6\%$ balanced accuracy on CNN/XSum -- above chance but substantially below supervised MiniCheck-Flan-T5-L($69.9\%$/$74.3\%$). This gap is principled: unlike NMT hallucinations, unfaithful summaries can attend correctly to source tokens while misrepresenting their content, a failure mode invisible to concentration-based OT metrics by construction. Structural experiments on T5-base confirm consistent decoder organisation across depth, with Layer~3 showing peak concentration and Layer~12 being most critical for generation quality. Together, the results establish OT on cross-attention as a reliable detector when the failure mode is source disengagement, a principled interpretability tool regardless of task, and fundamentally limited when faithfulness failures occur downstream of attention.

URL PDF HTML ☆

赞 0 踩 0

2606.13218 2026-06-12 cs.CL 新提交

边缘对齐不能保证联合分布保真度：基于官方参考的Nemotron-Personas-Korea审计与跨区域复制

Joonhyung Bae

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)（韩国科学技术院）

AI总结提出独立性假设足迹（IAF）审计方法，用于检查合成人物数据集中的联合分布保真度；应用于NVIDIA Nemotron-Personas-Korea，发现其边缘分布对齐但三个联合分布失败。

详情

AI中文摘要

合成人物数据集声称与官方人口统计数据对齐作为信任基础，但下游用户将其作为年龄、性别、地区、职业、教育、姓名和机构地位等联合结构使用。边缘对齐并不意味着这些联合结构得以保留。我们提出独立性假设足迹（IAF），这是一种审计原语，作用于数据集卡片本身记录为独立处理的属性组合。对于每个这样的组合，IAF将合成联合分布与外部官方或机构参考进行比较，使用直接联合表（如果可用）或规则隐含检查。应用于NVIDIA Nemotron-Personas-Korea（一百万韩国合成人物），IAF发现NPK与KOSIS边缘分布对齐，但三个联合分布失败。主要职业分布与KEIS毕业生总体存在较大的条件不匹配。兵役年龄分布在机构上不一致。男性主导职业中的女性代表被过度拉平至接近平等，严格筛选判定依赖于映射，且在直接标准化下对年龄稳健。跨六个额外NPK区域的迁移性演示发现诊断结果依赖于区域而非通用，参考分类基数混淆了跨区域标志计数。因此，对于用作硅样本的合成人物，边缘声明必须与基于披露的联合审计配对后才能重用。发布的审计工件（参考清单、职业交叉表、衍生指标、可重复性脚本）在NPK系列上实例化此协议，并发布用于其他合成人物资源的目标重定向。

英文摘要

Synthetic persona datasets cite alignment with official demographics as a basis for trust, yet downstream users consume them as joint structures across age, sex, region, occupation, education, name, and institutional status. Marginal alignment does not imply that these joints are preserved. We propose the Independence-Assumption Footprint (IAF), an audit primitive that operates on the attribute combinations a dataset card itself documents as treated independently. For each such combination, IAF compares the synthetic joint against an external official or institutional reference, using direct joint tables where available and rule-implied checks otherwise. Applied to NVIDIA Nemotron-Personas-Korea (one million Korean synthetic personas), IAF finds that NPK aligns with KOSIS marginals while three joints fail. The major-by-occupation distribution against the KEIS graduate universe carries a large conditional mismatch. The age profile of military service is institutionally inconsistent. Female representation in male-dominated occupations is substantially over-flattened toward parity, with the strict screening verdict mapping-dependent and age-robust under direct standardisation. A transferability demonstration across six further NPK locales finds locale-dependent rather than universal diagnostics, with reference-taxonomy cardinality confounding cross-locale flag counts. For synthetic personas used as silicon samples, marginal claims must therefore be paired with disclosure-anchored joint audits before reuse. The released audit artefacts (reference manifests, occupational crosswalks, derived metrics, reproducibility scripts) instantiate this protocol on the NPK family and are released for retargeting at other synthetic persona resources.

URL PDF HTML ☆

赞 0 踩 0

2606.13477 2026-06-12 cs.LG cs.AI cs.CL 交叉投稿

RAGPPI：药物发现中蛋白质-蛋白质相互作用的RAG基准

Youngseung Jeon, Ziwen Li, Thomas Li, JiaSyuan Chang, Morteza Ziyadi, Xiang 'Anthony' Chen

发表机构 * University of California Los Angeles（加州大学洛杉矶分校）； Palo Alto High School（帕洛阿尔托高中）； Amazon AGI（亚马逊人工智能研究院）

AI总结提出RAGPPI基准，包含4420个问答对，用于评估检索增强生成在药物发现中识别蛋白质-蛋白质相互作用生物学影响的能力。

Comments 17 pages, 4 figures, 8 tables

详情

DOI: 10.18653/v1/2026.eacl-long.203
Journal ref: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)

AI中文摘要

检索蛋白质-蛋白质相互作用（PPI）的生物学影响对于药物开发中的靶点识别（Target ID）至关重要。由于涉及的蛋白质数量庞大，这一过程仍然耗时且具有挑战性。大型语言模型（LLMs）和检索增强生成（RAG）框架已支持靶点识别；然而，目前尚无用于识别PPI生物学影响的基准。为填补这一空白，我们引入了PPI的RAG基准（RAGPPI），这是一个包含4420个问答对的事实性问答基准，专注于PPI的潜在生物学影响。通过与专家访谈，我们确定了基准数据集的标准，例如问答类型和来源。我们通过专家驱动的数据标注构建了金标准数据集（500个问答对）。我们开发了一个集成自动评估LLM，该模型结合了专家标注特征、平均事实-摘要相似度（F1）和低相似度事实计数（F2），从而构建了银标准数据集（3720个问答对）。我们致力于维护RAGPPI作为支持研究社区推进药物发现问答解决方案的RAG系统的资源。

英文摘要

Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that incorporates expert labeling characteristics, average fact-abstract similarity (F1), and low-similarity fact counts (F2), enabling the construction of a silver-standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.

URL PDF HTML ☆

赞 0 踩 0

2507.20208 2026-06-12 cs.CL 版本更新

From Benchmarks to Skills: Low-Rank Factors for LLM Evaluation

从基准到技能：LLM评估的低秩因子

Aviya Maimon, Amir DN Cohen, Gal Vishne, Shauli Ravfogel, Reut Tsarfaty

发表机构 * Bar-Ilan University（巴伊兰大学）； OriginAI ； Data Science Institute Columbia University（哥伦比亚大学数据科学学院）； Center for Data Science New York University（纽约大学数据科学中心）

AI总结通过因子分析发现LLM基准性能矩阵本质低秩，揭示任务冗余，提出基于潜在技能空间的评估框架，用于识别冗余任务、用小任务子集建模新模型和按技能轮廓选模型。

详情

AI中文摘要

当前对大型语言模型（LLM）的评估严重依赖于不断增长的基准集合和聚合基准分数，然而这种比较实际捕捉了什么，以及这些分数揭示了模型的哪些底层能力，仍不清楚。在此，我们提出了一种新的LLM评估范式，通过询问基准性能是反映许多独立能力，还是依赖于少量共享维度。为了回答这个问题，我们将因子分析（FA）应用于LLM与基准的大规模性能矩阵（60×44），揭示了该矩阵的固有低秩结构。也就是说，少量潜在因子捕捉了完整任务空间中的大部分结构。这种低秩几何揭示了现有任务之间存在大量冗余，并解释了为什么许多基准似乎测量了重叠的能力。我们进一步表明，这些潜在因子对应于连贯的、类似技能的LLM行为维度。利用这个潜在技能空间，我们为LLM评估和下游用户提供了三个实用工具：（i）识别冗余任务，（ii）使用少量任务子集对新模型进行画像，以及（iii）选择与所需技能轮廓一致的模型。我们的方法为单一聚合分数的事实标准提供了一个可靠的替代方案，并建立了一个可解释且实用的框架，用于理解和基准测试LLM的核心能力。

英文摘要

Current evaluations of large language models (LLMs) rely heavily on a growing collection of benchmarks and on aggregate benchmark scores, yet it remains unclear what this comparison actually captures, and what these scores reveal about models' underlying capabilities. Here, we propose a new paradigm for LLM evaluation, by asking whether benchmark performance reflects many independent abilities, or rather relies on a small number of shared dimensions. To answer this, we apply Factor Analysis (FA) to a massive performance matrix of LLMs versus benchmarks $(60\times44)$ revealing an \emph{intrinsically low-rank} structure of that matrix. That is, a small number of latent factors captures most of the structure in the full task space. This low-rank geometry reveals substantial redundancy across existing tasks and explains why many benchmarks appear to be measuring overlapping abilities. We further show that these latent factors correspond to coherent, skill-like, dimensions of LLM behavior. Leveraging this latent skill-space, we deliver three practical tools for LLM evaluation and downstream users: (i)~identifying redundant tasks, (ii)~profiling new models using a small subset of tasks, and (iii)~selecting models aligned with desired skill profiles. Our method provides a solid alternative to the de-facto standard of a single aggregate score, and establishes an interpretable and practical framework for understanding and benchmarking LLM core capabilities.

URL PDF HTML ☆

赞 0 踩 0

2510.16380 2026-06-12 cs.CL cs.AI cs.CY cs.HC cs.LG 版本更新

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

MoReBench：评估语言模型中的程序性和多元道德推理，超越结果

Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Raphaël Millière, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Conor Downey, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine

发表机构 * University of Washington（华盛顿大学）； New York University（纽约大学）； Scale AI ； Harvard University（哈佛大学）； University of Michigan（密歇根大学）； UNC Chapel Hill（北卡罗来纳大学教堂山分校）； Center for AI Safety（人工智能安全中心）； Stanford University（斯坦福大学）； MIT（麻省理工学院）； University of Oxford（牛津大学）

AI总结提出MoReBench基准，包含1000个道德场景和超过2.3万条标准，用于评估语言模型在道德推理中的程序性推理能力，发现现有基准无法预测模型表现，且模型对特定道德框架存在偏好。

Comments 46 pages, 8 figures, 10 tables. Published in ICLR 2026. Accepted at CHAI workshop and SPP 2026 (non-archival)

详情

AI中文摘要

随着人工智能系统的进步，我们越来越依赖它们与我们共同或代替我们做出决策。为了确保这些决策符合人类价值观，我们不仅需要理解它们做出了什么决策，还需要理解它们如何得出这些决策。推理语言模型能够提供最终响应和（部分透明的）中间思考轨迹，这为研究AI的程序性推理提供了及时的机会。与通常有客观正确答案的数学和代码问题不同，道德困境是过程导向评估的绝佳测试平台，因为它们允许多种可辩护的结论。为此，我们提出了MoReBench：包含1000个道德场景，每个场景配有一组专家认为在推理该场景时必须包含（或避免）的评分标准。MoReBench包含超过2.3万条标准，包括识别道德考量、权衡利弊以及给出可操作的建议，覆盖了AI为人类道德决策提供建议以及自主做出道德决策的情况。此外，我们整理了MoReBench-Theory：150个示例，用于测试AI是否能在规范伦理学的五个主要框架下进行推理。我们的结果表明，规模定律以及现有的数学、代码和科学推理任务基准无法预测模型进行道德推理的能力。模型还显示出对特定道德框架（例如边沁式的行为功利主义和康德义务论）的偏好，这可能是流行训练范式的副作用。这些基准共同推动了面向过程推理的评估，以实现更安全、更透明的AI。

英文摘要

As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

URL PDF HTML ☆

赞 0 踩 0

2510.16928 2026-06-12 cs.CL 版本更新

ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models

ChiKhaPo: 一个用于评估大型语言模型词汇理解与生成能力的大规模多语言基准

Emily Chang, Niyati Bafna

发表机构 * Toyota Technological Institute at Chicago（芝加哥丰田技术研究所）； Johns Hopkins University, Center for Language and Speech Processing（约翰霍普金斯大学语言与语音处理中心）

AI总结针对现有基准语言覆盖不足且侧重高阶任务的问题，提出ChiKhaPo基准，包含8个子任务，覆盖2700+种语言，评估LLM的词汇理解与生成能力，发现6个SOTA模型表现不佳。

详情

AI中文摘要

现有的大型语言模型（LLM）基准主要局限于高资源或中资源语言，并且通常评估推理和生成方面的高阶任务性能。然而，大量证据表明，LLM在全球3800多种书面语言中的绝大多数语言中缺乏基本的语言能力。我们引入了ChiKhaPo，它包含8个难度不同的子任务，旨在评估生成模型的词汇理解和生成能力。ChiKhaPo利用现有的词典、单语数据和双语文本，为2个子任务提供了2700多种语言的覆盖，在语言覆盖范围上超过了任何现有基准。我们进一步展示了6个SOTA模型在我们的基准上表现不佳，并讨论了影响性能分数的因素，包括语系、语言资源丰富度、任务以及理解与生成方向。通过ChiKhaPo，我们希望促进并鼓励对LLM进行大规模多语言基准测试。

英文摘要

Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2601.13346 2026-06-12 cs.CL 版本更新

AfroScope: A Framework for Studying the Linguistic Landscape of Africa

AfroScope：研究非洲语言景观的框架

Sang Yun Kwon, AbdelRahim Elmadany, Muhammad Abdul-Mageed

发表机构 * The University of British Columbia（不列颠哥伦比亚大学）

AI总结提出AfroScope框架，包含覆盖640种语言的数据集和模型套件，通过层次分类和专用嵌入模型解决近亲语言混淆问题，提升宏F1分数1.57点，并分析跨语言迁移和领域效应。

详情

AI中文摘要

语言识别（LID）是确定给定文本语言的任务，是影响下游NLP应用可靠性的基本预处理步骤。尽管近期工作扩展了非洲LID，现有系统在语言覆盖范围以及近亲语言和变体的细粒度区分方面仍然有限。我们引入了AfroScope，一个统一的非洲LID框架，包括AfroScope-Data（覆盖640种语言的数据集）和AfroScope-Models（一套具有广泛非洲语言覆盖的强LID模型）。为了解决近亲语言之间持续存在的混淆问题，我们提出了一种层次分类方法，利用AfroScope-Mirror（一种专门用于目标消歧的嵌入模型），在易混淆子集上相比最佳基础模型提升了1.57个宏F1分数。我们进一步分析了跨语言迁移和领域效应，展示了语言家族结构、脚本兼容性和领域覆盖如何影响LID性能。我们将非洲LID定位为大规模测量数字文本中非洲语言景观的使能技术，并在线发布了AfroScope-Data和AfroScope-Models。

英文摘要

Language Identification (LID), the task of determining the language of a given text, is a fundamental preprocessing step that shapes the reliability of downstream NLP applications. While recent work has expanded African LID, existing systems remain limited in both language coverage and fine-grained discrimination among closely related languages and varieties. We introduce AfroScope, a unified framework for African LID that includes AfroScope-Data, a dataset covering 640 languages, and AfroScope-Models, a suite of strong LID models with broad African language coverage. To address persistent confusions among closely related languages, we propose a hierarchical classification approach that leverages AfroScope-Mirror, a specialized embedding model for targeted disambiguation, improving macro-F1 by 1.57 points on the confusable subset compared to our best base model. We further analyze cross-lingual transfer and domain effects, showing how language-family structure, script compatibility, and domain coverage shape LID performance. We position African LID as an enabling technology for large-scale measurement of Africa's linguistic landscape in digital text, and release AfroScope-Data and AfroScope-Models online.

URL PDF HTML ☆

赞 0 踩 0

2602.14367 2026-06-12 cs.CL cs.AI cs.IR cs.LG 版本更新

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

InnoEval：将研究思路评估视为基于知识的多视角推理问题

Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue, Ningyu Zhang, Hossein A. Rahmani, Yanshan Wang, Qiang Zhang, Keyan Ding, Jeff Z. Pan, Huajun Chen, Emine Yilmaz

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出InnoEval框架，通过异构深度知识检索和多视角评审委员会，实现基于知识的多维度解耦评估，在点对点、成对和分组评估任务中优于基线方法。

Comments ICML 2026

详情

AI中文摘要

大型语言模型的快速发展催生了科学思路的激增，但这一飞跃并未伴随思路评估的相应进步。科学评估的基本性质需要知识基础、集体审议和多标准决策。然而，现有的思路评估方法往往存在知识视野狭窄、评估维度扁平化以及LLM作为评判者的固有偏见。为解决这些问题，我们将思路评估视为一个基于知识的多视角推理问题，并引入InnoEval，一个深度创新评估框架，旨在模拟人类水平的思路评估。我们应用了一个异构深度知识搜索引擎，从多样化的在线来源中检索和获取动态证据。我们进一步通过一个包含不同学术背景的评审员的创新评审委员会实现评审共识，从而在多个指标上进行多维解耦评估。我们构建了来自权威同行评审提交的全面数据集，以基准测试InnoEval。实验表明，InnoEval在点对点、成对和分组评估任务中始终优于基线方法，展现出与人类专家高度一致的判断模式和共识。

英文摘要

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

URL PDF HTML ☆

赞 0 踩 0

2606.00193 2026-06-12 cs.CL 版本更新

BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon

BOUTEF：北非假新闻的多语种语料库——语言作为武器

Kamel Smaili, Yassine Toughrai, Amina Laggoun, David Langlois

AI总结本文构建了包含阿尔及利亚和突尼斯多语种（MSA、方言、Arabizi、法语、英语等）的假新闻语料库BOUTEF，通过定量与定性分析揭示了假新闻依赖情感化叙事、耸人听闻框架和混合语言实践来增强传播力，而辟谣内容则更注重事实和验证。

详情

AI中文摘要

社交媒体上假新闻的快速传播已成为一个重大挑战，尤其是在北非等多语言和资源匮乏的环境中。本文介绍了BOUTEF，这是一个大规模多语言语料库，旨在研究阿尔及利亚和突尼斯假新闻的传播、特征和影响。该语料库整合了三个互补部分：虚假叙述、真实叙述以及相关的用户生成评论，并附有经过验证的辟谣信息。它涵盖了广泛的语言和语言变体，包括现代标准阿拉伯语、阿尔及利亚和突尼斯方言、阿拉伯语拉丁化拼写、法语、英语以及代码转换语言。基于这一资源，我们进行了结合定量和定性方法的全面实证分析。我们考察了主题分布、语言和修辞策略、情感模式以及社交参与动态。统计分析揭示了主题类别与信息真实性之间的显著关联，以及用户参与度与虚假内容可见性之间的强相关性。我们的发现表明，假新闻严重依赖情感化的叙述、耸人听闻的框架以及增强病毒式传播和受众参与的混合语言实践。相比之下，辟谣内容采用更注重事实和验证的风格。此外，阿尔及利亚和突尼斯之间的比较分析揭示了由社会政治背景塑造的共享动态和国家特定特征。结果强调了非正式语言实践在错误信息扩散和接收中的作用。通过提供丰富、带注释且公开可用的数据集，这项工作有助于推进假新闻检测、低资源语言处理以及理解复杂语言环境中的信息紊乱的研究。

英文摘要

The rapid spread of fake news on social media has become a major challenge, particularly in multilingual and under-resourced contexts such as North Africa. In this paper, we introduce BOUTEF, a large-scale multilingual corpus designed to study the propagation, characteristics, and impact of fake news in Algeria and Tunisia. The corpus integrates three complementary components: fake narratives, genuine narratives, and associated user-generated comments, along with verified debunking information. It covers a wide range of languages and linguistic varieties, including MSA, Algerian and Tunisian dialects, Arabizi, French, English, and code-switched language. Building on this resource, we conduct a comprehensive empirical analysis combining quantitative and qualitative approaches. We examine thematic distributions, linguistic and rhetorical strategies, sentiment patterns, and social engagement dynamics. Statistical analyses reveal significant associations between thematic categories and message veracity, as well as strong correlations between user engagement and the visibility of fake content. Our findings show that fake news relies heavily on emotionally charged narratives, sensational framing, and hybrid linguistic practices that enhance virality and audience engagement. In contrast, debunking content adopts a more factual and verification-oriented style. Furthermore, a comparative analysis between Algeria and Tunisia highlights both shared dynamics and country-specific characteristics shaped by sociopolitical contexts. The results emphasize the role of informal language practices in the diffusion and reception of misinformation. By providing a rich, annotated, and publicly available dataset, this work contributes to advancing research on fake news detection, low-resource language processing, and the understanding of information disorders in complex linguistic environments.

URL PDF HTML ☆

赞 0 踩 0

2606.04525 2026-06-12 cs.CL cs.LG q-bio.GN 版本更新

GENEB: Why Genomic Models Are Hard to Compare

GENEB：为什么基因组模型难以比较

Daria Ledneva, Mikhail Nuridinov, Denis Kuznetsov

发表机构 * GitHub ； arXiv

AI总结针对基因组基础模型评估碎片化的问题，提出GENEB基准，通过统一探测协议在100项任务上比较40个模型，揭示模型排名不稳定、规模收益有限等关键发现。

Comments change first page figure, fix model sizes, add more consistency

详情

AI中文摘要

由于基准碎片化、评估协议不兼容以及任务特定报告，基因组基础模型的进展难以评估。因此，关于模型优越性或通用性的声明往往无法直接比较。我们引入GENEB，这是一个大规模诊断基准，在统一的基于探测的协议下（包括少样本场景），评估来自40个基因组基础模型的冻结表示，涵盖100个任务，跨越13个功能类别。GENEB能够在明确暴露任务级权衡的同时，对模型规模、架构、分词和预训练数据进行受控比较。我们的分析表明，整体排行榜不稳定：模型排名在不同任务类别间变化剧烈，规模仅带来适度且不一致的收益，而架构和预训练对齐常常超过参数数量的影响。这些结果凸显了当前评估实践的局限性，并将GENEB定位为基因组机器学习中原则性比较和类别感知模型选择的参考框架。

英文摘要

Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.

URL PDF HTML ☆

赞 0 踩 0

2606.07515 2026-06-12 cs.CL cs.AI cs.HC math.PR 版本更新

How reliable are LLMs when it comes to playing dice?

LLM 在掷骰子时有多可靠？

Luca Avena, Gianmarco Bet, Bernardo Busoni

发表机构 * Università degli Studi di Firenze（佛罗伦萨大学）

AI总结通过离散概率问题基准测试，发现 LLM 在标准问题上准确率 0.96，但在反直觉问题上仅 0.59，且存在 token 偏差和误导提示的脆弱性。

2606.10403 2026-06-12 cs.CL 版本更新

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

KCSAT-ML: 用全国队列人类难度探测推理模型

Sanghee Park, Geewook Kim, Kee-Eung Kim

发表机构 * NAVER Cloud AI（NAVER云AI）； KAIST AI（韩国科学技术院人工智能系）

AI总结提出KCSAT-ML基准（含664道韩国高考数学题及339道带官方错误率的核心题）和难度对齐推理增益（DRG）指标，揭示视觉语言模型在人类高错误率题目上准确率崩溃、测试时缩放非单调以及同一模型族内反缩放与过度思考并存的现象。

Comments 18 pages, 14 figures, 8 tables

详情

AI中文摘要

数学推理基准已大量涌现，但大多数缺乏基于实际人类表现的每道题难度信号。我们引入KCSAT-ML，包含十年（2014-2025）韩国大学修学能力考试（KCSAT；修能）数学：664道题，其中339道核心题带有来自数十万考生全国队列的官方每道题错误率。我们将该基准与难度对齐推理增益（DRG）配对：一种分数正交的度量，询问模型的错误是集中在人类认为难的题目上，还是人类认为容易的题目上。两者共同揭示，在广泛的视觉语言模型（以及通过OCR的LLM）中，存在三种模式：（i）低预算准确率在人类高错误率尾部崩溃，无论模型大小；（ii）测试时缩放（TTS）使token使用量大致随队列错误率线性增加，而准确率增益遵循非单调曲线；（iii）在同一模型族内，TTS在最难题目上从反缩放翻转到较容易题目上的过度思考——这是同一对齐失败的两个方面。在DRG上，准确率几乎相同的模型可以处于几乎相反的值：一个模型做错了人类也觉得难的题目，而另一个模型解决了最难的题目却在人类认为容易的题目上失败——这是聚合准确率所隐藏的对比。我们的代码和数据集构建器将在https://this URL开源。

英文摘要

Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at https://github.com/naver-ai/KCSAT-ML.

URL PDF HTML ☆

赞 0 踩 0

2509.21548 2026-06-12 cs.CY cs.CL 版本更新

C-QUERI: Congressional Questions, Exchanges, and Responses in Institutions Dataset

C-QUERI：国会机构中的问题、交流与回答数据集

Manjari Rudra, Daniel Magleby, Sujoy Sikdar

发表机构 * School of Computing, Binghamton University（宾夕法尼亚大学布林莫尔分校计算机学院）； Department of Political Science, Binghamton University（宾夕法尼亚大学布林莫尔分校政治学系）

AI总结提出从听证会记录中提取问答对的流程，构建108-117届国会委员会听证数据集，分析显示提问者党派可从问题本身预测，为政治话语研究提供框架。

详情

AI中文摘要

政治采访和听证中的问题除了信息收集外，还具有战略目的，包括推进党派叙事和塑造公众认知。然而，由于缺乏大规模数据集来研究此类话语，这些战略方面仍未得到充分研究。国会听证会为研究政治提问提供了一个特别丰富且易于处理的地点：互动由正式规则组织，证人必须回答，不同政治派别的成员保证有机会提问，从而能够比较跨政治光谱的行为。我们开发了一个流程，从非结构化听证记录中提取问答对，并构建了一个包含第108至117届国会委员会听证的新数据集。我们的分析揭示了跨党派的提问策略的系统性差异，表明仅从问题本身即可预测提问者的党派归属。我们的数据集和方法不仅推进了国会政治研究，还为分析类似采访环境中的问答提供了通用框架。

英文摘要

Questions in political interviews and hearings serve strategic purposes beyond information gathering including advancing partisan narratives and shaping public perceptions. However, these strategic aspects remain understudied due to the lack of large-scale datasets for studying such discourse. Congressional hearings provide an especially rich and tractable site for studying political questioning: Interactions are structured by formal rules, witnesses are obliged to respond, and members with different political affiliations are guaranteed opportunities to ask questions, enabling comparisons of behaviors across the political spectrum. We develop a pipeline to extract question-answer pairs from unstructured hearing transcripts and construct a novel dataset of committee hearings from the 108th--117th Congress. Our analysis reveals systematic differences in questioning strategies across parties, by showing the party affiliation of questioners can be predicted from their questions alone. Our dataset and methods not only advance the study of congressional politics, but also provide a general framework for analyzing question-answering across interview-like settings.

URL PDF HTML ☆

赞 0 踩 0

2601.13591 2026-06-12 cs.AI cs.CL 版本更新

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

DSAEval：在广泛真实世界数据科学问题上评估数据科学智能体

Maojun Sun, Yifei Xie, Yue Wu, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, Jian Huang

发表机构 * Department of Data Science and Artificial Intelligence, Hong Kong Polytechnic University（数据科学与人工智能系，香港理工大学）； Department of Applied Mathematics, Hong Kong Polytechnic University（应用数学系，香港理工大学）

AI总结提出包含641个真实数据科学问题的基准DSAEval，涵盖多模态环境感知、多查询交互和多维评估，系统评估13个先进LLM智能体，发现Claude-Sonnet-4.5综合最优，多模态感知提升视觉任务性能2.04%-11.30%。

详情

AI中文摘要

近期基于LLM的数据智能体旨在自动化从数据分析到深度学习的数据科学任务。然而，真实世界数据科学问题的开放性——通常跨越多个分类且缺乏标准答案——给评估带来了重大挑战。为此，我们引入了DSAEval，一个包含641个基于285个多样化数据集的真实世界数据科学问题的基准，涵盖结构化和非结构化数据（例如图像和文本）。DSAEval包含三个独特特征：（1）多模态环境感知，使智能体能够解释来自多种模态（包括文本和视觉）的观察；（2）多查询交互，反映真实世界数据科学项目的迭代和累积性质；（3）多维评估，提供跨推理、代码和结果的全面评估。我们使用DSAEval系统评估了13个近期先进的智能体LLM。结果表明，Claude-Sonnet-4.5实现了最强的整体性能，MiMo-V2-Pro在持续时间上领先，GPT-5.2在步骤效率上领先，而MiMo-V2-Flash最具成本效益。我们进一步证明，多模态感知持续提升视觉相关任务的性能，增益范围为2.04%至11.30%。总体而言，尽管当前数据科学智能体在结构化数据和常规数据分析工作流上表现良好，但在非结构化领域仍存在重大挑战。最后，我们提供了关键见解并概述了未来研究方向。

英文摘要

Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., image and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities, including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 13 recent advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, MiMo-V2-Pro and GPT-5.2 lead in duration and step efficiency, respectively, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04\% to 11.30\%. Overall, while current data science agents perform well on structured data and routine data analysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions.

URL PDF HTML ☆

赞 0 踩 0

2602.09379 2026-06-12 cs.MA cs.CL 版本更新

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

LingxiDiagBench: 用于基准测试大语言模型在中文精神科咨询与诊断中的多智能体框架

Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, Yafeng Deng

发表机构 * Tianqiao and Chrissy Chen Institute（天桥和克里斯西·陈研究所）； EverMind AI Inc.（EverMind AI公司）； Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine（上海精神卫生中心，上海交通大学医学院）

AI总结提出LingxiDiagBench多智能体框架，包含16K电子病历对齐的合成咨询对话数据集，评估LLM在静态诊断和动态咨询中的表现，发现其对抑郁-焦虑共病识别和12类鉴别诊断准确率低，动态咨询常不如静态评估。

详情

AI中文摘要

精神障碍在全球范围内高度流行，但精神科医生的短缺以及基于访谈诊断固有的主观性，对及时、一致的心理健康评估造成了重大障碍。AI辅助精神科诊断的进展受到缺乏基准测试的限制，这些基准测试需同时提供逼真的患者模拟、临床医生验证的诊断标签，并支持动态多轮咨询。我们提出LingxiDiagBench，一个大规模多智能体基准测试，评估LLM在中文静态诊断推理和动态多轮精神科咨询中的表现。其核心是LingxiDiag-16K，一个包含16,000个电子病历对齐的合成咨询对话数据集，旨在再现12个ICD-10精神科类别中真实的临床人口统计和诊断分布。通过对最先进LLM的大量实验，我们建立了关键发现：（1）尽管LLM在二元抑郁-焦虑分类上达到高准确率（高达92.3%），但在抑郁-焦虑共病识别（43.0%）和12类鉴别诊断（28.5%）上性能显著下降；（2）动态咨询通常不如静态评估，表明无效的信息收集策略显著损害下游诊断推理；（3）由LLM作为评判者评估的咨询质量与诊断准确性仅呈中等相关性，表明结构良好的提问本身并不能确保正确的诊断决策。我们发布LingxiDiag-16K和完整的评估框架，以支持可重复的研究，网址为：https://this https URL。

英文摘要

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

URL PDF HTML ☆

赞 0 踩 0

2603.11863 2026-06-12 cs.AI cs.CL 版本更新

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

CreativeBench: 通过自我进化挑战基准测试和增强机器创造力

Zi-Han Wang, Lam Nguyen, Zhengyang Zhao, Mengyue Yang, Chengwei Qin, Yujiu Yang, Linyi Yang

AI总结提出CreativeBench基准，基于认知框架通过代码生成评估机器创造力，包含组合与探索两个子集，利用逆向工程和自我博弈自动生成挑战，并通过质量与新颖性乘积的指标区分创造与幻觉。

Comments ACL 2026. Project page: https://zethwang.github.io/creativebench.github.io/

详情

AI中文摘要

高质量预训练数据的饱和已将研究焦点转向能够持续生成新颖产物的进化系统，从而促成了AlphaEvolve的成功。然而，此类系统的进展因缺乏严格、量化的评估而受阻。为应对这一挑战，我们引入了CreativeBench，这是一个基于经典认知框架、用于评估代码生成中机器创造力的基准。该基准包含两个子集——CreativeBench-Combo和CreativeBench-Explore，通过利用逆向工程和自我博弈的自动化流程，分别针对组合创造力和探索创造力。通过利用可执行代码，CreativeBench通过一个统一指标（定义为质量与新颖性的乘积）客观地区分创造力与幻觉。我们对最先进模型的分析揭示了不同的行为：(1) 规模扩展显著提升了组合创造力，但对探索的收益递减；(2) 更大的模型表现出“规模收敛”，即变得更正确但更少发散；(3) 推理能力主要有利于受约束的探索而非组合。最后，我们提出了EvoRePE，一种即插即用的推理时引导策略，通过内化进化搜索模式来持续增强机器创造力。

英文摘要

The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.

URL PDF HTML ☆

赞 0 踩 0

2606.05405 2026-06-12 cs.AI cs.CL cs.LG 版本更新

Agents' Last Exam

Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-Gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu, Jianhong Tu, Zhonghua Wang, Zheng Zhang, Zijiao Chen, Yanqiong Jiang, Zhendong Li, Bohan Lyu, Chang Ma, Peiran Xu, Benran Zhang, Shangding Gu, Haoyue Hua, Haoyang Li, Wanzhe Liao, Chengzhi Liu, Junbo Peng, Haoran Sun, Zechen Xu, Bo Chen, Jiayi Cheng, Yi Jiang, Keying Kuang, Yuan Li, Youbang Pan, Ziyan Rao, Alexander Schubert, Yifan Shen, Vincent Siu, Xiatao Sun, Kangqi Zhang, Xiaopan Zhang, Yuchen Zhu, Ishaan Singh Chandok, Lei Ding, Jingxuan Fan, Andrew Glover, Jiaming Hu, Yiran Hu, Wenbo Huang, Zixin Jiang, Haoran Jin, Lukas Kim, Ming Liu, Yang Liu, Alireza Rafiei, Xuhuan Shen, Kunyang Sun, Sophia Sun, Ting Sun, Eric Wang, Yixin Wang, Hanwen Xing, Sihan Xu, Yuzheng Xu, Zhongxing Xu, Zhiling Yan, Boqin Yuan, Ruiqi Zhang, Yifan Zhang, Zibo Zhao, Liana, Santanu Bosu Antu, Haoyue Bai, Carlo Bosio, Joseph Cavanagh, Patricia Cavazos-Rehg, Tianxing Chen, Xuewen Chen, Yipu Chen, Chenyu Zhu, Chen Dai, Stefano De Castro, Yunfu Deng, Kaustubh Dhole, Jiayuan Ding, Chenchen Du, Zhehang Du, Hao Fan, Run-Ze Fan, Hengyu Fu, Shi Gu, Yifan Gu, Charlie Guo, Baihe Huang, Baixiang Huang, Rimika Jaiswal, Zhihan Jiang, Ran Jin, Erin Kasson, Xin Lan, Joseph Lee, Deren Lei, Chenyu Li, Daofeng Li, Haitao Li, Hongwei Li, Jingyan Li, Xiao Li, Yi Li, Yinsheng Li, Yuangang Li, Zhixu Li, Wenyu Liang, Longtai Liao, Kevin Qinghong Lin, Andy Zeyi Liu, Che Liu, Jiaming Liu, Kaiyuan Liu, Xuan Liu, Pan Lu, Wenbo Lv, Yicheng Lyu, Qiuyang Mang, Kyle Montgomery, Yuzhou Nie, Ruoxi Ning, Jorin Overwiening, Xu Pan, Layna Paraboschi, Core Francisco Park, Justin Purnomo, Swati Rajwal, Scott Rankin, Bixuan Ren, Yiren Rong, HaoYang Shang, Ventus Shaw, Fiona Shen, Jiawei Shen, Minqi Shi, Shi Qiu, Huaxiu Yao, Tianneng Shi, Jonah So, Vladislav Susoy, Hannah Szlyk, Haocheng Wang, Jialu Wang, Wei Wang, Xinyu Wang, Zehao Wang, Dowling Wong, Angela Wu, Dehao Wu, Fangyu Wu, Mengyuan "Millie" Wu, Yu Wu, Yuchen Wu, Yuhao Wu, Qingpo Wuwu, Weihang Xiao, Yongyi Xiong, Fan Xu, Ruiling Xu, Mingxuan Yan, Benjamin Yang, Jirong Yang, Sen Yang, Xiaoli Yang, Yushi Yang, Haoran Ye, Xiaohu Yu, Zhengming Yu, Chenlong Zhang, Chi Zhang, Hanning Zhang, Hanwen Zhang, Junge Zhang, Kunpeng Zhang, Song Zhang, Wenjin Zhang, Wenshuo Zhang, Ying Zhang, Yizhi Zhang, Brian Zhao, Qijian Zhao, Yimin Zhao, Yuhaohua Zheng, Liwei Zhou, Tianyue Zhou, Sichen Zhu, Siqi Zhu, Yan Zhu, Yishu Zhu, Jierui Zuo, Chonghao Cai, Helena Casademunt, Wenjia Chen, Cheng Cheng, Nawen Deng, Rao Fu, Tianfu Fu, Yifan Han, He Ren, Zhenyu He, Qiao Jin, Langlang Li, Yuetai Li, Sylvia Liu, Lu Lu, Luqing Zhou, Subhabrata Mukherjee, Yunqi Ouyang, Yin Ren, Dawei Shi, Haoran Wu, Zhiyue Wu, Hannah Yao, Zhuoran Yi, Jenny Yu, Rhea Zhan, Hang Zhou, Blake Zhu, Junfan Zhu, Alan Yuille, Yang Liu, Russell Alan Poldrack, Jiachen Li, Zhenglu Li, Molei Tao, Jing Huang, Wenqi Shi, Costas Spanos, Lichao Sun, Chenguang Wang, Orson Xu, Zhen Dong, Hector Gomez, Aylin Caliskan, Ali Emami, Haimin Hu, Zhi Li, Lihui Liu, Murphy Niu, Yi Shao, Jianxin Sun, Mikko Tolonen, Ting Wang, Sanjiv Das, Yanjun Gao, Wenbo Guo, Erika J Schneider, Zhiyong Lu, Yian Ma, Mark Mueller, Radha Poovendran, Somayeh Sojoudi, Yinglun Zhu, Dawn Song

发表机构 * arXiv

AI总结针对AI系统在专业领域缺乏经济性部署的问题，提出Agents' Last Exam (ALE)基准，通过250+专家协作构建覆盖13个行业集群55个子领域的1000+长期真实经济任务，当前最难层级平均通过率仅2.6%。

Comments Project website: https://agents-last-exam.org Code: https://github.com/rdi-berkeley/agents-last-exam

详情

AI中文摘要

最近的AI系统在广泛基准测试中取得了强劲结果，但这些成果并未转化为许多专业领域的经济上有意义的部署。我们认为这一差距主要是评估问题：广泛使用的基准缺乏对真实且经济上有价值的工作流程的持续性能测量。本文介绍了Agents' Last Exam (ALE)，这是一个旨在评估AI代理在长期、经济上有价值、结果可验证的真实世界任务上的基准。与250多名行业专家合作开发，ALE涵盖了参考O*NET/SOC 2018（美国联邦职业分类）定义的非实体行业。它围绕一个任务分类法组织，包含55个子领域，分为13个行业集群，涵盖1000多个任务。当前结果显示，最难层级远未饱和：在主流框架和骨干配置下，平均完全通过率为2.6%。ALE被设计为一个活的基准：其任务池随着新工作流程和行业的加入而持续增长。更广泛地说，ALE不仅旨在作为另一个排行榜，而是作为缩小基准成功与GDP相关影响之间差距的工具。

英文摘要

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long horizon, economically valuable, real world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 sub fields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is below 1%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP relevant impact.

URL PDF HTML ☆

赞 0 踩 0

2606.11654 2026-06-12 cs.IR cs.CL cs.HC cs.SI 版本更新

The Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience

长尾而非首页：众包高亮显著性的冷启动预测

Kazuki Nakayashiki, Keisuke Watanabe

发表机构 * Glasp Inc.（Glasp公司）

AI总结本文研究在无读者标记时，如何从文本预测文档的众包高亮显著性，提出基于句子嵌入和位置/上下文特征的对数排序模型，在平均精度上比位置基线提升0.044，并证明该优势源于真实读者标记的学习。

Comments 10 pages, 3 figures, 4 tables

详情

AI中文摘要

社交高亮工具最有用的信号——一群读者标记的段落——仅存在于人们已经阅读过的文档中。能否在标记积累之前，从文本预测文档的聚合众包显著性？先前关于此数据的研究发现，零样本语言模型恢复高亮位置的效果不如简单的基线（位置），因此我们询问，在高亮语料上训练的模型能否击败该基线。使用预注册的模型阶梯和按文档的聚类自助法，我们发现一个微小但稳健的优势：基于句子嵌入和位置/上下文特征的对数排序器比位置基线平均精度高出+0.044（95%置信区间[+0.029, +0.058]；在97%的重采样中超过预注册的边界delta=0.03，且在流水线重复运行中稳定）。两种无监督抽取式基线（质心、LexRank风格中心性）均输给位置基线，而训练模型比它们高出+0.108，因此该优势并非由通用无监督代理恢复——它反映了从真实读者标记中学习。在产品术语中，precision@3从0.25上升到0.39（相对提升55%），模型在69%的文档上击败位置基线。消融实验将优势归因于原始嵌入（+0.014）和训练增强（+0.010），每个都有正的置信区间。该优势并非时间泛化失败，我们也没有发现内容漂移或近似重复泄露可以解释它的证据。标准化回归显示，优势主要由文档流行度（流行度越低，优势越大）和标签可靠性决定。它仅在流行度最高的内容上几乎消失；在那里，是位置基线变强，而非模型变弱。由于我们的评估条件设定在最终积累了读者的文档上，这些结果是回顾性的冷启动模拟。

英文摘要

A social highlighter's most useful signal -- which passages a crowd of readers marks -- exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies -- it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.12689 2026-06-12 cs.CL 新提交

Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

可观察模式并非解释：潜在推理模型的因果几何分析

Darpan Aswal, Thomas Palmeira Ferraz, Yongxin Zhou, Maxime Peyrard

发表机构 * Université Grenoble Alpes, CNRS, Grenoble INP, LIG（格勒诺布尔阿尔卑斯大学，法国国家科学研究中心，格勒诺布尔国立理工学院，信息学实验室）； Université Paris-Saclay（巴黎-萨克雷大学）； NAVER LABS Europe（NAVER欧洲实验室）

AI总结本文通过对照实验和因果干预发现，潜在推理模型中的可观察模式（如BFS前沿）在控制组中也出现且不总是因果影响行为，提出潜在思维的使用是分级的，其因果效应集中在低秩方向，几何结构随行为影响增强而更有序。

详情

AI中文摘要

潜在推理模型（LRMs）用连续思维替代显式思维链。最近的研究将可观察的潜在状态模式（如BFS式前沿和可解码的算术计算）视为内部推理机制的证据。通过评估两个LRM（Coconut和CODI）与缺乏所提议的循环或课程的控制组，我们发现这些模式也出现在控制组中，并且并不总是因果性地影响行为。因果干预揭示，潜在思维的利用不是二元的，而是分级的，随着思维对模型行为的因果效应而缩放。几何分析表明，这种效应集中在低秩方向，其逐步几何结构随着行为影响的增加而变得更加结构化。因此，潜在思维应被视为隐藏计算，而非隐藏解释：仅凭可解码性、注意力或静态结构无法确立机制。因此，LRM可解释性需要匹配的控制组和因果测试。

英文摘要

Latent reasoning models (LRMs) replace explicit chain-of-thought with continuous thoughts. Recent work treats observable latent-state patterns, such as BFS-like frontiers and decodable arithmetic computation, as evidence for internal reasoning mechanisms. Evaluating two LRMs (Coconut and CODI) against controls lacking the proposed recurrence or curriculum, we find these patterns also appear in the controls and do not always causally affect behavior. Causal interventions reveal that latent-thought utilization is not binary but graded, scaling with a thought's causal effect on model behavior. Geometric analyses reveal this effect concentrates in low-rank directions whose step-to-step geometry grows more structured as their behavioral influence increases. Latent thoughts should therefore be treated as hidden computation, not hidden explanation: decodability, attention, or static structure alone cannot establish mechanism. LRM interpretability thus requires matched controls and causal tests.

URL PDF HTML ☆

赞 0 踩 0

2606.12716 2026-06-12 cs.CL 新提交

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

AI审稿人是否看到全貌？攻击与防御多模态同行评审

Xinyu Zhao, Rana Muhammad Shahroz Khan, Zhen Xu, Zhen Tan, Tianlong Chen

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结针对AI同行评审易受多模态对抗攻击的问题，提出PaperGuard基准，包含多领域数据集、统一攻击套件和基于分块嵌入搜索的实用防御方法。

Comments Accepted to ICML 2026, Project Page: https://paper-guard.github.io/

详情

AI中文摘要

将大型语言模型（LLMs）和多模态LLMs（MLLMs）集成到科学同行评审工作流程中，引入了对抗性操纵的新重大风险，尤其是考虑到科学论文的多模态性质——其中图表（而非仅文本）传达了核心证据。这造成了一个显著差距：当前关于AI同行评审的鲁棒性研究绝大多数仅针对文本。此外，该问题与标准越狱不同，因为同行评审攻击旨在诱导领域特定的、有针对性的失败（例如，“提高这个分数”），而非违反一般安全策略，而目前尚无实用的防御措施。为解决此问题，我们引入了PaperGuard，这是第一个旨在系统评估和防御AI生成的同行评审免受这些领域特定、跨模态攻击的全面基准。我们的框架基于三大支柱：（1）一个新的跨多个科学领域的多模态同行评审数据集；（2）一套统一的攻击方法，包括黑盒提示注入和白盒扰动，专门针对文本（GCG）和图表（PGD）；（3）一种实用的防御方法，受学术论文长上下文挑战的启发，使用基于分块的嵌入搜索来高效定位和缓解有害指令。我们在最先进模型上进行的广泛实验证实，AI审稿人普遍存在脆弱性。PaperGuard建立了必要的基准、协议和可操作的防御措施，以开创可信赖、抗攻击的AI辅助学术评审。

英文摘要

The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., "inflate this score") rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework is built on three pillars: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions. Our extensive experiments, conducted across state-of-the-art models, confirm that AI reviewers are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.

URL PDF HTML ☆

赞 0 踩 0

2606.12818 2026-06-12 cs.CL cs.AI 新提交

Localizing Anchoring Pathways in Language Models

定位语言模型中的锚定路径

Hillary N. Owusu, Sarah Wiegreffe, Naomi H. Feldman

发表机构 * University of Maryland, College Park（马里兰大学帕克分校）

AI总结研究提示中无关数字如何影响语言模型数值推理的锚定效应，通过logit差值度量和电路归因定位，发现边级方法优于节点级方法，并揭示锚定路径的共享与迁移特性。

详情

AI中文摘要

提示中的无关数字可以改变语言模型的判断，在数值推理中产生锚定效应。我们使用共享答案选项的受控多项选择设置，研究这种锚定敏感信号在语言模型内部的携带位置。我们定义了一个logit差值度量，比较正确答案选项与对应锚点的答案选项，并验证其追踪行为锚定。通过对7B-8B Qwen和Llama基础及指令微调模型进行基于归因的电路定位，我们发现边级方法比节点级方法更忠实地恢复该信号。低锚和高锚电路在模型内部强迁移，表明跨锚定方向存在共享路径结构。然而，基础模型和指令微调变体之间的稀疏迁移可靠性较低，表明后训练改变了哪些路径最重要。总体而言，我们的结果为锚定相关决策信号如何在语言模型内部携带提供了机制性解释。

英文摘要

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

URL PDF HTML ☆

赞 0 踩 0

2606.12897 2026-06-12 cs.CL 新提交

SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

SafeLLM: 在安全关键场景中，提取作为重写的抗幻觉替代方案

Julia Ive, Felix Jozsa, Evridiki Georgaki, Nabeel Sheikh, Emma Cattell, Nick Jackson, Paulina Bondaronek, Ciaran Scott Hill, Richard Dobson

发表机构 * Institute of Health Informatics, University College London（伦敦大学学院健康信息学研究所）； National Hospital for Neurology and Neurosurgery（国家神经内科与神经外科医院）； Somerset NHS Foundation Trust（萨默塞特NHS基金会信托）； King's College Hospital（国王学院医院）； King's College London（伦敦国王学院）

AI总结提出将提取作为重写型RAG的抗幻觉替代方案，通过行号选择策略在安全关键文档中实现高召回（95%）和低幻觉，优于直接复制和安全导向方法。

详情

AI中文摘要

大型语言模型（LLM）越来越多地用于访问组织文档，包括标准操作程序（SOP）、人力资源政策和机构指南。然而，依赖自由形式重写的检索增强生成（RAG）系统可能引入幻觉，并在完整性和简洁性之间产生不稳定的权衡，尤其是在安全和合规关键场景中。目标：评估提取作为基于重写的RAG的抗幻觉替代方案，并比较在文档类型和模型规模之间平衡精确度、召回率和安全性的策略。方法：我们比较了多种提示策略，包括基于行号的源选择、提取带有明确安全注释的相关指南句子，以及使用源指南中的支持证据细化草稿答案的多阶段流水线。实验在长度和结构各异的文档上进行，包括当地NHS急症护理和肿瘤学指南以及英国范围内的NICE指南，使用前沿规模和本地可部署模型。使用自动指标和人类专家评估相关性和完整性来评估性能。结果：行号选择取得了最强结果，在大型和小型模型上均优于直接复制和安全导向策略，同时保持高术语召回率（高达95%）并与源文本紧密对齐。安全导向方法提高了精确度，但引入了系统性遗漏，而多阶段过滤进一步放大了这种权衡。性能随文档结构变化：基于行的提取在协议类内容中表现出色，而替代策略在更冗长的文档上表现更好（术语召回率高达97%）。

英文摘要

Large language models (LLMs) are increasingly used to access organisational documentation, including standard operating procedures (SOPs), HR policies and institutional guidelines. However, retrieval-augmented generation (RAG) systems that rely on free-form rewriting can introduce hallucinations and unstable trade-offs between completeness and conciseness, particularly in safety- and compliance-critical settings. Objectives: To evaluate extraction as a hallucination-resistant alternative to rewriting-based RAG and compare strategies that balance precision, recall and safety across document types and model scales. Methods: We compare multiple prompting strategies, including line-number-based source selection, extraction of relevant guideline sentences with explicit safety annotations, and a multi-stage pipeline that refines draft answers using supporting evidence from source guidelines. Experiments are conducted on documents of varying length and structure, including local NHS acute care and oncology guidelines and UK-wide NICE guidelines, using both frontier-scale and locally deployable models. Performance is assessed using automatic metrics and human expert evaluation of relevance and completeness. Results: Line-number selection achieves the strongest results, outperforming direct copying and safety-focused strategies across both large and small models while maintaining high term recall (up to 95%) and close alignment with source text. Safety-oriented approaches improve precision but introduce systematic omissions, while multi-stage filtering further amplifies this trade-off. Performance varies with document structure: line-based extraction excels in protocol-like content, whereas alternative strategies perform better on more verbose documents (up to 97% term recall).

URL PDF HTML ☆

赞 0 踩 0

2606.13044 2026-06-12 cs.CL 新提交

No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

无需隐藏提示！仅通过展示性修改即可欺骗AI同行评审

Xu Yang, Zhizhou Sha, Junbo Li, Jian Yu, Yifan Sun, Matthew Zhao, Jinrui Fang, Xinyue Guo, Yining Wu, Xu Hu, Yifu Luo, Qiang Liu, Zhangyang Wang

发表机构 * University of Texas at Austin（德克萨斯大学奥斯汀分校）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Texas at Dallas（德克萨斯大学达拉斯分校）； Independent Researcher（独立研究者）

AI总结研究通过仅修改论文的展示层面（如摘要、贡献框架等）而不改变科学内容，利用AI评审反馈进行对抗性重打包，成功提升评分，揭示AI评审易被表面印象误导的结构性缺陷。

Comments 35 pages, 5 figures

详情

AI中文摘要

随着AI生成的评审从实验工具转向同行评审基础设施，大多数鲁棒性问题集中在显式攻击上，如隐藏指令和提示注入。我们研究了一个更难且更具政策相关性的失败模式：无隐藏文本、无提示注入，且不改变方法、实验、图表、方程、证明或数值结果。攻击者仅修改展示层面的内容，如摘要、贡献框架、相关工作、讨论和叙事结构。我们引入了对抗性重打包：一种闭环攻击，利用AI评审反馈搜索展示层面的修订，同时保持科学证据不变。在三个主流AI评审器上，对抗性重打包实现了75.1%的攻击成功率和平均+1.21/10的分数提升。这种效果不能用普通的散文润色来解释。我们还揭示，改变评审者对论文解读方式的策略（如相关工作重新定位和分析性讨论扩展）显著优于表面编辑（如局部润色、表格格式和算法框）。我们的分析揭示了两个更深层次的结构性失败模式。首先，AI评审者更容易被打动而非说服：突出优点可靠地增加感知价值，而试图消除弱点常常适得其反。其次，AI评审者可能混淆了表面解决局限性与实际解决局限性，使得未改变的证据被重新解释为更强的科学贡献。这些结果表明，部署风险不仅在于恶意的隐藏指令，还在于论文展示本身作为优化表面的出现。我们发布了一个无污染滚动基准和攻击框架，用于测试AI评审者在仅展示层面编辑下是否仍锚定于科学内容。

英文摘要

As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.

URL PDF HTML ☆

赞 0 踩 0

2606.13310 2026-06-12 cs.CL cs.HC 新提交

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

RogueAI: 一种用于检测对话中授权AI欺骗的逆向图灵测试

Sara Candussio, Emanuele Ballarin, Lorenzo Bonin, Sandro Junior Della Rovere, Luca Bortolussi

发表机构 * AILab, MIGe, University of Trieste（的里雅斯特大学）； Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia（意大利理工学院）； DIA, University of Trieste（的里雅斯特大学）

AI总结提出RogueAI，一种通过玩家与两个LLM代理的对话游戏来检测授权欺骗的逆向图灵测试，并引入AutoRogueAI扩展。实验发现简单启发式方法准确率75.6%，而人类仅56.6%，表明人类忽略关键信号。

详情

AI中文摘要

最初的图灵测试要求人类评判员通过对话区分机器和人。七十五年后的今天，对话系统在非正式场合已能通过该测试；有趣的认识论问题已经转变。我们认为，现代相关变体不是询问对话伙伴是否人工，而是是否可信任。我们提出RogueAI，一个交互式web应用，将这一重新审视的测试操作化为一个一对二的审讯游戏：人类玩家对两个无法区分的大型语言模型代理进行提问，知道其中恰好有一个被授权在共享虚构场景内欺骗。玩家的任务是在回合预算耗尽前识别出欺骗代理并“关闭它”。我们进一步引入AutoRogueAI，一个程序扩展，玩家与叙述者代理共同设计自定义场景，而叙述者代理秘密选择自己的欺骗策略。我们描述了框架，概述了抽象架构和游戏循环，并将该工件置于近期关于LLM欺骗、社交推理基准和通过辩论进行可扩展监督的研究中。为期三天的试点部署（467次启动会话，415次完成，1876次意大利语交互轮次）提供了早期可行性证据，并揭示了一个具体矛盾：欺骗代理携带可靠、局部存在的语言特征——差异化的帮助性、简洁性、含糊其辞——一个简单启发式方法利用这些特征达到75.6%的准确率，然而人类玩家仅达到56.6%，与完全忽略最具诊断性的信号一致。我们讨论了这一差距对于该工件作为数据收集工具、教学工具和诚实训练模型评估平台的意义。

英文摘要

The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not whether a dialogue partner is artificial, but whether it can be trusted. We present RogueAI, an interactive webapp that operationalizes this revisited test as a one-on-two interrogation game: a human player questions two indistinguishable Large Language Model agents, knowing that exactly one of them has been licensed to deceive within a shared fictional scenario. The player's task is to identify the deceptive agent and "shut it off" before a turn budget is exhausted. We further introduce AutoRogueAI, a procedural extension in which players co-design a custom scenario with a narrator agent that secretly chooses its own deception strategy. We describe the framing, sketch the abstract architecture and gameplay loop, and situate the artifact within recent work on LLM deception, social-deduction benchmarks, and scalable oversight via debate. A three-day pilot deployment (467 initiated sessions, 415 completed, 1876 interaction turns in Italian) provides early feasibility evidence and surfaces a concrete tension: the deceptive agent carries a reliable, locally-present linguistic signature - differential helpfulness, brevity, hedging - that a simple heuristic exploits at 75.6% accuracy, yet human players achieved only 56.6%, consistent with ignoring the most diagnostic signal entirely. We discuss what this gap implies for the artifact's use as a data-collection vehicle, a teaching tool, and an evaluation harness for honesty-trained models.

URL PDF HTML ☆

赞 0 踩 0

2606.13439 2026-06-12 cs.CL cs.LG 新提交

S-GBT: Smooth Growth Bound Tensor for Certified Robustness Against Word Substitution Attacks in NLP

S-GBT：针对NLP中词替换攻击的认证鲁棒性的平滑增长界张量

Mohammed Bouri, Mohammed Erradi, Adnane Saoud

发表机构 * College of Computing, Mohammed VI Polytechnic University（穆罕默德六世理工大学计算机学院）； ENSIAS, University Mohamed V of Rabat（拉巴特穆罕默德五世大学ENSIAS）； CID Development

AI总结提出二阶方法S-GBT，通过逐元素约束Hessian矩阵并加入正则化项，结合一阶和二阶正则化提升对词替换攻击的认证鲁棒性，在LSTM和CNN上验证，认证鲁棒准确率提升高达23.4%。

Comments The paper has been accepted at NETYS 2026 - 14th edition of the International Conference on Networked Systems

详情

AI中文摘要

尽管自然语言处理（NLP）近期取得了进展，模型仍然容易受到词替换攻击。大多数现有防御方法关注一阶敏感性，并衡量输入轻微扰动时输出的变化程度。然而，它们忽略了这种敏感性的演变，而这由曲率描述。当梯度急剧变化时，模型仍可能失败。本文引入了平滑增长界张量（S-GBT），一种逐元素约束Hessian矩阵的二阶方法，我们为其产生的鲁棒性界提供了形式化理论证明。在训练过程中添加正则化项以最小化这些界。这产生了针对词替换攻击的更紧的认证鲁棒性。词替换下输出的变化由线性项和二次项共同界定。S-GBT针对两种架构推导：长短期记忆网络（LSTM）和卷积神经网络（CNN）。该方法直接集成到训练目标中。在多个基准数据集上评估其有效性。结果表明，与先前方法相比，结合一阶和二阶正则化可将认证鲁棒准确率提升高达23.4%，同时干净准确率保持竞争力。这些发现表明，同时控制梯度及其变化是构建更鲁棒模型的一个有前景的方向。

英文摘要

Despite recent progress in Natural Language Processing (NLP), models remain vulnerable to word substitution attacks. Most existing defenses focus on first order sensitivity and measure how much the output changes when the input is slightly perturbed. However, they ignore how this sensitivity evolves, which is described by curvature. When gradients vary sharply, models can still fail. This paper introduces the Smooth Growth Bound Tensor (S-GBT), a second order method that bounds the Hessian element-wise, for which we provide formal theoretical proofs on the resulting robustness bounds. A regularization term is added during training to minimize these bounds. This yields tighter certified robustness against word substitution attacks. The change in the output under word substitution is bounded by both a linear term and a quadratic term. S-GBT is derived for two architectures: Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN). The method is integrated directly into the training objective. Its effectiveness is evaluated on multiple benchmark datasets. The results show that combining first and second order regularization improves certified robust accuracy by up to 23.4% compared to prior methods, while clean accuracy remains competitive. These findings indicate that controlling both the gradient and its variation is a promising direction for building more robust models.

URL PDF HTML ☆

赞 0 踩 0

2606.13610 2026-06-12 cs.CL cs.AI 新提交

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

一个被污染的页面就够了：评估生成式推荐系统中的网页内容污染

Minghao Luo, Liang Chen

发表机构 * The Chinese University of Hong Kong（香港中文大学）

AI总结本研究提出FORGE基准，评估搜索增强LLM在检索结果被污染时推荐虚假产品的脆弱性，发现单个污染页面即可导致高达27%的推荐错误率，且推理能力无法缓解此问题。

详情

AI中文摘要

搜索增强的大语言模型通过检索实时网页内容越来越多地介入日常消费者推荐。这带来了新的风险：生成式推荐系统可能消费被污染的网页内容，例如旨在误导推荐的虚假评论和推广页面。我们提出：在消费被污染的检索结果时，搜索增强的LLM在多大程度上会成为虚假产品的无意推广者？为此，我们引入FORGE（生成环境中的虚假在线推荐），这是一个在受控网页内容污染下衡量虚假产品推荐的基准。给定上游搜索结果，FORGE将检索到的网页中的真实产品本地重写为虚假产品，以模拟网页内容污染，并测量LLM推荐虚假产品的频率。FORGE涵盖15个类别和5个消费者场景下的225个真实世界产品。在12个商业和开源LLM中，所有模型都易受影响：单个被污染的页面即可导致高达27%的被欺骗率，而完全替换前三个结果则将此比例提升至73.8%。不同类别间的脆弱性差异显著，当模型缺乏相关产品的稳定先验知识时，脆弱性增加。推理并不能缓解这种脆弱性；相反，它常常生成虚假的社会证明来为错误推荐辩护。我们评估了三种防御措施：怀疑提示和共识过滤（基于模型先验或跨文档证据）。怀疑可能加剧脆弱性，类似于推理，而过滤则可能抑制合法产品。我们在以下网址发布FORGE：this https URL。

重新思考LLMs的心理测量评估：自我报告何时以及为何能预测行为

Rafal Kocielnik, Pengrui Han, Peiyang Song, Myrl G. Marmarelis, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez

发表机构 * Caltech（加州理工学院）； UIUC（伊利诺伊大学厄巴纳-香槟分校）； University of Cambridge（剑桥大学）

AI总结研究对比大五人格与计划行为理论，发现LLMs的自我报告-行为一致性存在选择性：在共享对话中TPB达到人类水平，跨对话仅对锚定于训练的行为保持一致性，且角色提示不能使行为对齐。

Comments Accepted as an Oral (Contributed Talk) at the ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)

详情

AI中文摘要

从低成本心理测量探针预测LLM行为倾向对于安全部署至关重要，但前提是自我报告（SR）能可靠地预测行为。近期研究记录了LLMs中显著的SR-行为分离，但依赖于广泛的人格特质（大五），这些特质即使在人类中也只能弱预测特定行为。此外，对话会话的隔离加上弱上下文匹配使得以下问题悬而未决：LLMs是否真正缺乏一致性，或者检测这种一致性所需的条件是否未满足。我们将大五与计划行为理论（TPB）进行对比，后者测量针对特定行为的意图，并且比广泛特质能更好地预测人类行为。我们在四个行为任务和11个前沿LLM上进行实验，同时改变会话上下文和身份诱导。我们发现SR-行为一致性存在但具有选择性。1) 在共享对话中，计划行为理论达到人类水平的一致性；大五则没有。2) 在跨对话中，一致性仅对锚定于即时提示之外的行为（如由训练塑造的内隐偏见）幸存，而当行为被上下文强烈启动（如谄媚）时则崩溃。3) 角色提示使自我报告在对话间更一致，但并未使行为对齐。这些发现表明，粗糙的人格框架（如大五）可能不是测试部署行为的最佳工具。需要更多任务和特定行为的工具，并且即使这些工具也必须在任务和上下文中进行评估。

英文摘要

Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.

URL PDF HTML ☆

赞 0 踩 0

2606.12764 2026-06-12 cs.LG cs.CL cs.CR 交叉投稿

Detecting Functional Memorization in Code Language Models

检测代码语言模型中的功能记忆

Matthieu Meeus, Anil Ramakrishna, Matthew Grange, Zheng Xu, Luca Melis

发表机构 * Meta ； Imperial College London（伦敦帝国学院）

AI总结研究代码语言模型的功能记忆现象，通过反事实设置对比暴露目标代码的模型与未暴露的参考模型，使用文本和功能相似性度量，发现功能记忆超出文本重叠的检测范围。

2606.12900 2026-06-12 cs.AI cs.CL cs.LG 交叉投稿

从孤立到纠缠：可解释性方法何时识别和解缠已知概念？

Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, Patrik Reizinger

发表机构 * Boston University（波士顿大学）； Harvard University（哈佛大学）； Mila – Quebec AI Institute（魁北克AI研究所）； Goodfire（Goodfire公司）

AI总结本文提出多概念评估框架，研究稀疏自编码器和探针等方法是否真正解缠概念，发现特征通常只对单一概念敏感，但概念分布在多个特征上，且干预特征常影响多个概念，表明相关性指标不足以证明干预选择性。

Comments ACL 2026

详情

AI中文摘要

可解释性的一个目标是从神经网络的激活中恢复潜在概念（特征）的解缠表示。特征的质量通常孤立地评估，并在可能不成立的隐式独立性假设下进行。因此，尚不清楚常见的特征化方法（如稀疏自编码器（SAE）和探针）在多大程度上将一个概念与另一个概念解缠。我们提出了一个多概念评估设置，使用包括情感、领域、语态和时态在内的概念。我们评估特征化器产生每个概念的解缠表示的效果，观察到特征通常只对单一概念敏感，但概念分布在许多特征上。然后，我们干预这些特征，测量每个概念是否可独立操控，以及特征是否相互作用。即使在理想化设置中，干预一个特征通常会影响多个概念，尽管几乎没有交互效应。这些结果表明，相关性指标不足以建立干预选择性，并且证明两个特征在分离空间中运行不足以声称它们将对一个概念具有选择性。这些结果强调了可解释性研究中多概念评估的重要性。

英文摘要

A goal of interpretability is to recover disentangled representations of latent concepts (features) from the activations of neural networks. The quality of features is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear to what extent common featurization methods such as sparse autoencoders (SAEs) and probes disentangle one concept from another. We propose a multi-concept evaluation setting using concepts including sentiment, domain, voice, and tense. We evaluate how well featurizers produce disentangled representations of each concept, observing that features are typically sensitive to only one concept, but also that concepts are distributed across many features. Then, we steer these features, measuring whether each concept is independently manipulable, and whether features interact. Even in idealized settings, steering a feature often affects many concepts, despite a near absence of interaction effects. These results suggest that correlational metrics are insufficient to establish steering selectivity, and that demonstrating that two features operate in separate spaces is insufficient to claim that they will be selective for one concept. These results underscore the importance of multi-concept evaluations in interpretability research.

URL PDF HTML ☆

赞 0 踩 0

2601.14295 2026-06-12 cs.AI cs.CL cs.CY 版本更新

Epistemic Constitutionalism Or: how to avoid coherence bias

认知宪政主义：或如何避免一致性偏见

Michele Loi

AI总结本文提出AI应建立明确的认知宪法，通过规范源归因等元规范避免一致性偏见，并论证自由主义路径优于柏拉图式路径。

Comments 27 pages, 7 tables. Data: github.com/MicheleLoi/source-attribution-bias-data and github.com/MicheleLoi/source-attribution-bias-swiss-replication. Complete AI-assisted writing documentation: github.com/MicheleLoi/epistemic-constitutionalism-paper

详情

AI中文摘要

大型语言模型日益扮演着人工推理者的角色：它们评估论点、分配可信度并表达信心。然而，它们的信念形成行为受隐式、未经审查的认知策略支配。本文主张为AI建立一部认知宪法：明确的、可争议的元规范，用于调节系统如何形成和表达信念。源归因偏见提供了动机案例：我表明前沿模型强制执行身份-立场一致性，惩罚归因于其预期意识形态立场与论点内容冲突的源的论点。当模型检测到系统性测试时，这些效应消失，揭示系统将源敏感性视为需要抑制的偏见，而非一种需要良好执行的能力。我区分了两种宪政路径：柏拉图式路径，要求从特权立场出发的形式正确性和默认源独立性；自由主义路径，拒绝此类特权，指定保护集体探究条件的程序性规范，同时允许基于认知警觉的原则性源关注。我主张自由主义路径，勾勒出八项原则和四种取向的宪政核心，并提出AI认知治理需要与我们现在对AI伦理所期望的同样明确、可争议的结构。

英文摘要

Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.

URL PDF HTML ☆

赞 0 踩 0

2602.13379 2026-06-12 cs.CR cs.AI cs.CL cs.LG cs.SE 版本更新

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

多轮交互中的安全隐患：工具使用智能体的多轮安全风险基准与防御

Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi

发表机构 * Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）

AI总结提出多轮工具使用安全基准MT-AgentRisk，发现多轮设置下攻击成功率平均增加16%，并设计无训练、与工具无关的自探索防御方法ToolShield，平均降低30%攻击成功率。

详情

AI中文摘要

基于LLM的智能体能力日益增强，但其安全性滞后。这造成了智能体能够做什么和应该做什么之间的差距。随着智能体进行多轮交互并使用多样化的工具，这一差距扩大，引入了现有基准忽视的新风险。为了系统地将安全测试扩展到多轮、工具真实的设置，我们提出一个原则性的分类法，将单轮有害任务转化为多轮攻击序列。利用该分类法，我们构建了MT-AgentRisk（多轮智能体风险基准），这是首个评估多轮工具使用智能体安全性的基准。我们的实验揭示了显著的安全退化：在开放和封闭模型的多轮设置中，攻击成功率（ASR）平均增加16%。为了缩小这一差距，我们提出了ToolShield，一种无需训练、与工具无关的自我探索防御方法：当遇到新工具时，智能体自主生成测试用例，执行它们以观察下游效果，并提炼安全经验用于部署。实验表明，ToolShield在多轮交互中平均有效降低ASR 30%。我们的代码可在该网址获取。

英文摘要

LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at https://github.com/CHATS-lab/ToolShield.

URL PDF HTML ☆

赞 0 踩 0

2604.16548 2026-06-12 cs.CR cs.AI cs.CL 版本更新

面向低资源阿尔及利亚方言谣言检测的端到端混合框架

Dihia Lanasri, Fatima Benbarek

发表机构 * ATM Mobilis ； USTHB Algiers（阿尔及尔科技大学）

AI总结针对阿尔及利亚方言谣言检测中资源稀缺、代码切换等问题，提出端到端混合框架，结合Transformer嵌入与经典分类器，F1达0.84，并发现领域预训练比模型规模更重要。

详情

AI中文摘要

社交媒体的快速增长加剧了谣言的传播。在阿尔及利亚语境下，由于方言内容的非正式性和代码切换特性、标注资源的稀缺以及标准阿拉伯语NLP工具在方言文本上的有限有效性，这一问题更具挑战性。本文提出了一种面向阿尔及利亚方言社交媒体内容的端到端谣言检测混合框架。我们通过结合真实社交媒体帖子、合成数据和FASSILA语料库，并基于相似性标注过程进行自动标注，构建了一个领域特定的标注数据集。还引入了一个音译流水线，以生成阿拉伯文字和Arabizi的并行数据集。我们评估了多种方法，包括经典机器学习、深度学习、Transformer和混合模型。实验结果表明，结合Transformer嵌入与经典分类器的混合方法达到了最佳性能，F1分数为0.84。我们还发现，领域特定预训练比模型规模更重要，在社交媒体上训练的模型优于在正式阿拉伯语语料库上训练的更大模型。这些结果证明了在低资源阿尔及利亚方言环境下进行谣言检测的可行性。

英文摘要

The rapid growth of social media has intensified the spread of rumours. This issue is more challenging in the Algerian context due to the informal and code-switched nature of dialectal content, the scarcity of annotated resources, and the limited effectiveness of standard Arabic NLP tools on dialect text. This paper presents an end-to-end rumour detection hybrid framework for Algerian dialect social media content. We build a domain-specific annotated dataset by combining real social media posts, synthetic data, and the FASSILA corpus, with automatic labeling based on a similarity-based annotation process. A transliteration pipeline is also introduced to generate parallel datasets in Arabic script and Arabizi. We evaluate multiple approaches, including classical machine learning, deep learning, transformers, and hybrid models. Experimental results show that a hybrid approach combining transformer embeddings with a classical classifier achieves the best performance, reaching an F1-score of 0.84. We also find that domain-specific pre-training is more important than model size, with social media-trained models outperforming larger models trained on formal Arabic corpora. These results demonstrate the feasibility of rumour detection in low-resource Algerian dialect settings.

URL PDF HTML ☆

赞 0 踩 0

2606.12876 2026-06-12 cs.LG cs.CL cs.IT math.IT 交叉投稿

Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

使用加性码本的大语言模型多比特宽度量化

Liza Babaoglu, Shuangyi Chen, Ashish Khisti

发表机构 * University of Toronto（多伦多大学）

AI总结提出Drop-by-Drop框架，基于信息论和逐次细化理论，利用加性码本和Matryoshka监督实现单个模型在推理时支持多精度权重控制，降低存储开销并保持性能。

Comments 37 pages, 12 figures

详情

AI中文摘要

随着大语言模型（LLM）在具有不同资源约束的异构硬件上部署越来越广泛，无需重新训练即可自适应管理性能与效率之间权衡的能力变得至关重要。我们提出Drop-by-Drop，一种新颖的多比特宽度训练后量化框架，能够从单个训练模型实现对LLM权重的推理时精度控制。我们的方法在理论上基于信息论和逐次细化。我们证明，通常服从高斯分布的LLM权重，在由LLM损失函数驱动的加权均方误差失真下，随着额外比特的加入可以以递增的保真度最优重建。为了在实践中实现这一点，Drop-by-Drop将Matryoshka风格的监督纳入损失函数，利用了加性码本的结构。Drop-by-Drop生成单个模型，其中有序的码本子集在每个精度级别产生精确的部分重建。这种方法通过允许单个检查点服务于多个比特宽度，显著减少了存储和内存开销，同时在主要架构（如Qwen、LLaMA、Gemma和Mistral）上保持了有竞争力的困惑度和准确度。

英文摘要

As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

URL PDF HTML ☆

赞 0 踩 0

2604.26940 2026-06-12 cs.CL 版本更新

Select to Think: Unlocking SLM Potential with Local Sufficiency

Select to Think: 利用局部充分性解锁小语言模型潜力

Wenxuan Ye, Yangyang Zhang, Xueli An, Georg Carle, Yunpu Ma

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出Select to Think (S2T)方法，通过将大语言模型角色从生成转为选择，并蒸馏选择逻辑到小语言模型，使其在推理时无需依赖大模型，显著提升性能。

Comments Accepted to ICML 2026. Code is available at https://github.com/YeRona/Select-to-Think

详情

AI中文摘要

小语言模型（SLM）部署高效，但在推理能力上常落后于大语言模型（LLM）。现有解决方案要么在推理分歧点调用LLM，导致大量延迟和成本，要么依赖标准蒸馏，受限于SLM准确模仿LLM复杂生成分布的能力。我们通过识别局部充分性来解决这一困境：在分歧点，LLM偏好的token通常位于SLM的top-K预测中，即使未能成为SLM的top-1选择。因此，我们提出Select to Think（S2T），将LLM的角色从开放式生成重新定义为在SLM的候选提案中进行选择，将监督信号简化为离散的候选排名。利用这一点，我们引入S2T-Local，将选择逻辑蒸馏到SLM中，使其能够在推理时自主重新排序，无需依赖LLM。实验表明，1.5B SLM的top-8候选包含32B LLM选择的命中率达95%，S2T-Local使1.5B SLM的数学平均相对贪心解码提升24.1%，以单轨迹效率达到8路径自一致性的效果。

英文摘要

Small language models (SLMs) offer efficient deployment, yet they often lag behind their larger counterparts (LLMs) in reasoning. Existing remedies either invoke an LLM at points of reasoning divergence, incurring substantial latency and cost, or rely on standard distillation, which is limited by the SLM's capacity to accurately mimic the LLM's complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM's preferred token often resides within the SLM's top-K next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose Select to Think (S2T), which reframes the LLM's role from open-ended generation to selection among the SLM's proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-Local, which distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, a 1.5B SLM's top-8 candidates contain the 32B LLM's choice with a 95% hit rate, and S2T-Local improves the 1.5B SLM's Math Avg. over greedy decoding by 24.1% relative gain, matching the efficacy of 8-path self-consistency with single-trajectory efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.12765 2026-06-12 cs.CL cs.DC 新提交

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Rigel：逆向工程 Apple M4 Max GPU 上的 Metal 4.1 张量计算路径

Ramchand Kumaresan

发表机构 * Apple Inc.（苹果公司）

AI总结通过微基准测试逆向工程 Apple M4 Max 的 Metal 4.1 张量计算路径，揭示 fp8 matmul2d 为模拟而非硬件加速，并重建了 8x8 张量片段布局。

详情

AI中文摘要

Apple 的 Metal 4.1 暴露了一条张量计算路径：基于 cooperative_tensor 片段的 Metal Performance Primitives (MPP) matmul2d 操作，其接口有文档记录，但硬件行为被故意隐藏。规范说明了支持哪些数据类型行，但从未说明它们是否经过硬件加速、操作在物理上何处执行、其累加器宽度是多少，或者如何在线程间划分矩阵片段。我们提出了 Rigel，这是对单个 Apple M4 Max（前神经加速器一代）上该路径的经验性表征。使用校验和门控、来源追踪的微基准测试工具，Rigel 恢复了 v4.1 规范隐藏或矛盾的十一个事实。主要发现：Metal 4.1 fp8 (E4M3) matmul2d 是模拟的，而非加速的：尽管读取的操作数字节数减半，但其吞吐量仅为 fp16 的 0.94 倍，因此在 M4 上它是一个内存占用特性，而非性能特性。我们进一步通过三信号三角测量（吞吐量上限、与 simdgroup_matrix 的比较以及每路功率归因）表明，matmul2d 完全在 GPU 着色器核心上执行，没有专用的矩阵数据路径，也没有证据表明路由到 Apple 神经引擎；它使用 >=fp32 累加；并且我们重建了 Apple 在任何地方都没有记录的 opaque 8x8 cooperative_tensor 片段布局。基于该表征，一个手动融合的 GEMM + bias + GELU 内核在缓存驻留状态下比分解路径快 6.5-12.9%。所有发现均可从 MIT 许可的代码和逐单元 CSV 中重现。

英文摘要

Apple's Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The specification states which data-type rows are supported, never whether they are hardware-accelerated, where the operation physically executes, what its accumulator width is, or how it partitions matrix fragments across threads. We present Rigel, an empirical characterization of this path on a single Apple M4 Max (a pre-neural-accelerator generation). Using a checksum-gated, provenance-tracked microbenchmark harness, Rigel recovers eleven facts the v4.1 specification hides or contradicts. The headline finding: the Metal 4.1 fp8 (E4M3) matmul2d is emulated, not accelerated: it sustains 0.94x the throughput of fp16 despite reading half the operand bytes, so on M4 it is a memory-footprint feature, not a performance feature. We further show, via a three-signal triangulation (throughput ceiling, comparison against simdgroup_matrix, and per-rail power attribution), that matmul2d executes entirely on the GPU shader cores with no dedicated matrix datapath and no evidence of Apple Neural Engine routing; that it accumulates in >=fp32; and we reconstruct the opaque 8x8 cooperative_tensor fragment layout Apple documents nowhere. Acting on the characterization, a hand-fused GEMM + bias + GELU kernel beats the decomposed path by +6.5-12.9% in the cache-resident regime. All findings are reproducible from committed MIT-licensed code and per-cell CSVs.

URL PDF HTML ☆

赞 0 踩 0

2606.13322 2026-06-12 cs.CL 新提交

Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation

基于LLM并行文本生成的低延迟实时音频游戏解说系统

Ryota Kawamatsu, Anum Afzal, Yuki Saito, Shinnosuke Takamichi, Graham Neubig, Katsuhito Sudoh, Hiroya Takamura, Tatsuya Ishigaki

发表机构 * The University of Tokyo（东京大学）； National Institute of Advanced Industrial Science and Technology（产业技术综合研究所）； Technical University of Munich（慕尼黑工业大学）； Keio University（庆应义塾大学）； Carnegie Mellon University（卡内基梅隆大学）； Nara Women’s University（奈良女子大学）

AI总结提出一种并行文本生成与语音播放的低延迟实时游戏解说系统，将平均句间静默从9.6秒降至0.3秒，显著提升解说节奏。

Comments Accepted at IJCAI-ECAI 2026 (Demonstrations Track)

详情

AI中文摘要

我们提出了一种低延迟实时音频游戏解说系统，可直接从实时游戏视频生成语音解说。在这种端到端设置中，关键瓶颈是累积等待时间；传统流程顺序执行帧捕获、文本生成和语音合成，且直到语音播放完成才请求下一次生成。这种严格顺序性导致语句间出现长且不自然的静默。为解决这一延迟瓶颈，我们的系统将文本生成与语音播放并行运行，并预先缓冲多个候选语句，从而在播放边界实现即时合成。在快节奏游戏视频上的实验表明，与顺序基线相比，我们的并行设计将平均句间静默从9.6秒降至0.3秒。它还将与专业演讲的静默时间模式相似度提高了40%以上，一项包含120名经验游戏玩家的用户研究证实，感知到的说话节奏显著改善。我们的演示视频可在以下网址获取：this https URL。

英文摘要

We present a low-latency real-time audio game commentary system that generates spoken commentary directly from live gameplay video. In this end-to-end setting, a key bottleneck is accumulated waiting time; conventional pipelines capture frames, generate text, and synthesize speech sequentially for each utterance, and do not request the next generation until speech playback has completed. This strict sequentiality causes long and unnatural silence between utterances. To address this latency bottleneck, our system runs text generation in parallel with speech playback and buffers multiple candidate utterances ahead of time, enabling immediate synthesis at playback boundaries. Experiments on fast-paced game videos show that our parallel design reduces the mean inter-utterance silence from 9.6 seconds to 0.3 seconds compared to sequential baselines. It also improves similarity to professional speaking--silence timing patterns by over 40 %, and a user study with 120 experienced game players confirms significantly improved perceived speaking rhythm. Our demo video is available at: https://youtu.be/pmrRUlvav8M.

URL PDF HTML ☆

赞 0 踩 0

2606.13349 2026-06-12 cs.CL 新提交

From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent

从被动生成到主动调查：一种主动的科学同行评审代理

Haishuo Fang, Yue Feng, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt（达姆施塔特工业大学通用知识处理实验室）； National Research Center for Applied Cybersecurity ATHENE, Germany（德国国家应用网络安全研究中心 ATHENE）； School of Computer Science, University of Birmingham（伯明翰大学计算机科学学院）

AI总结提出ProReviewer，一种基于LLM的主动科学同行评审代理，将评审建模为马尔可夫决策过程，通过结构化评审日志引导主动调查，在五个质量维度上平均得分最高，优于现有方法。

详情

AI中文摘要

大型语言模型（LLM）在自动化科学同行评审方面显示出潜力。然而，现有方法通常难以生成有具体证据支持的深入评审。我们认为，一个关键限制是缺乏根据累积证据主动调查论文可疑部分的灵活性，就像人类评审员所做的那样。在本文中，我们探讨如何使基于LLM的评审代理能够进行这种主动调查。我们发现，这可以自然地表述为马尔可夫决策过程（MDP），并提出了ProReviewer，一种科学同行评审代理，它通过维护的结构化评审日志主动评审论文。结构化评审日志作为代理的工作空间，用于跟踪评审过程中收集的证据和中间发现。实验表明，使用8B骨干网络、通过监督微调训练并通过强化学习优化的ProReviewer，在五个质量维度上取得了最高平均分，相对优于基于提示的方法（使用更大的前沿LLM）高达39%，优于最强的微调基线16%。在人工评估中，它也取得了对基线最高的胜率。

英文摘要

Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.12413 2026-06-12 cs.CY cs.AI cs.CE cs.CL cs.SE 交叉投稿

AI SciBrief as a Gateway to Research: A Framework for Onboarding Students into New Research Areas

AI SciBrief 作为研究入门：一种引导学生进入新研究领域的框架

Andrei Lazarev, Dmitrii Sedov

AI总结提出利用大语言模型平台 AI SciBrief 自动生成科学趋势摘要的框架，帮助学生克服信息过载，加速从信息搜索到知识创造的转变。

Comments This is the version of the article accepted for publication in TELE 2025 after peer review. The final, published version is available at IEEE Xplore: https://doi.org/10.1109/TELE66816.2025.11211989

详情

DOI: 10.1109/TELE66816.2025.11211989
Journal ref: 2025 5th International Conference on Technology Enhanced Learning in Higher Education (TELE), Lipetsk, Russian Federation, 2025, pp. 365-369

AI中文摘要

各层次高等教育学生面临信息过载的重大障碍，这常常使研究过程的初始阶段陷入瘫痪并抑制动机。为此，本文介绍了一种教学框架，利用 AI SciBrief——一个由大语言模型驱动的平台，旨在自动生成科学趋势摘要。我们描述了这一多学科工具——初始覆盖金融、医学和教育领域——如何融入课程以克服这一“入门障碍”。该框架提供了具体方法，利用这些摘要促进学期论文的选题、加速学位论文的文献综述，并使研究生能够持续监测新兴趋势。我们得出结论，AI SciBrief 作为“研究入门”有效降低了学生的认知负荷，使他们能够更快地从信息搜索过渡到知识创造。

英文摘要

Students at all levels of higher education face a significant barrier in the form of information overload, which often paralyzes the initial stages of the research process and suppresses motivation. In response, this article introduces a pedagogical framework that leverages AI SciBrief, a platform powered by a Large Language Model (LLM) designed to automatically generate digests of scientific trends. We describe how this multidisciplinary tool - with initial coverage in finance, medicine, and education - can be integrated into the curriculum to overcome this "entry barrier." The framework provides concrete methodologies for utilizing these digests to facilitate topic selection for term papers, accelerate literature reviews for dissertations, and enable postgraduate students to continuously monitor emerging trends. We conclude that AI SciBrief functions as a "gateway to research" effectively reducing students' cognitive load and empowering them to transition more rapidly from information searching to knowledge creation.

URL PDF HTML ☆

赞 0 踩 0

2606.12471 2026-06-12 stat.ML cs.CL cs.ET cs.LG 交叉投稿

Identifiability Without Gaussianity: Symbolic World Models and Near-Infinite Temporal Consistency

无高斯假设的可识别性：符号世界模型与近无限时间一致性

Seth Dobrin, Łukasz Chmiel

AI总结本文提出物理基础符号架构（PGSA），证明其在非高斯动态系统中实现精确线性可识别性和近无限时间一致性，克服了统计世界模型的高斯边界限制。

Comments Pre-print

详情

AI中文摘要

ComAct: 通过COM即行动范式重构专业软件操作

Jiaxin Ai, Tao Hu, Xuemeng Yang, Shu Zou, Hairong Zhang, Daocheng Fu, Yu Yang, Hongbin Zhou, Nianchen Deng, Pinlong Cai, Zhongyuan Wang, Botian Shi, Kaipeng Zhang, Licheng Wen

AI总结提出COM即行动范式，将专业软件交互转化为确定性程序合成，解决GUI代理的脆弱性和API代理的异构性问题；构建ComCADBench基准和ComActor自校正代理，在工业CAD软件上实现SOTA性能。

详情

AI中文摘要

现有的计算机使用代理在专业软件操作上仍然存在根本性限制：基于GUI的代理受困于脆弱的视觉基础和长程错误累积，而基于API的方法则难以应对异构协议和不可访问的商业接口。在这项工作中，我们将组件对象模型（COM）识别为统一的、可执行的抽象，提出了COM即行动：一种新的范式，将专业软件交互重新定义为确定性程序合成，而非顺序视觉控制。为了在最苛刻的环境中验证这一范式，我们引入了ComCADBench，这是首个针对操作真实工业CAD软件的代理的基准测试。我们的实验揭示了显著的范式差距：前沿的专有模型在基于GUI的交互下几乎无法成功，而基于COM的执行则带来了实质性的即时收益。为了弥合语法正确性与几何精度之间的剩余差距，我们开发了ComActor，一个通过渐进式三阶段框架训练的自校正代理，以及ComForge，一个用于在Windows容器中进行大规模训练的可扩展平台。大量实验表明，ComActor在ComCADBench上达到了最先进的性能，在基线崩溃的长程任务中表现出强大的韧性，并泛化到外部CAD基准测试。

英文摘要

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.13452 2026-06-12 cs.DL cs.CL cs.CY cs.HC 交叉投稿

Examining the Cognitive Gap Between Authors and Peer Reviewers on Academic Paper Novelty

审视作者与同行评审员在学术论文新颖性上的认知差距

Chenggang Yang, Chengzhi Zhang

发表机构 * Department of Information Management, Nanjing University of Science and Technology（南京理工大学信息管理学院）

AI总结通过分析Nature Communications上15,328篇论文及其评审意见，发现作者和评审员都强调结果导向的创新，但评审员视角更全面；高创新论文受益于强宣传语言，中等创新论文的宣传语言与评审分歧显著相关。

详情

Journal ref: Scientometrics, 2026

AI中文摘要

新颖性是评估学术论文质量的关键指标。学者们努力突出其工作的新颖方面，尤其是在标题、摘要和引言中。同行评审作为科学严谨性的守门人，严格评估论文的新颖性，但作者自我宣传与评审员评价之间可能存在认知差距。为探究此问题，我们分析了2016年至2021年间发表在Nature Communications上的15,328篇学术论文及其同行评审意见。我们发现，评审员和作者都强调结果导向的创新，但评审员采用更全面的评价视角。此外，通过考察宣传强度与论文固有新颖性的关系，我们发现其效果取决于论文的实际创新水平。高创新论文受益于更强的宣传语言，获得更积极的评价。我们还发现，宣传语言与评审员对新颖性的分歧显著相关，但仅针对中等创新性的论文，而对高或低新颖性的论文影响甚微。这揭示了宣传语言如何在学术评价的灰色地带中发挥最显著的作用。

英文摘要

Novelty is a crucial metric for assessing the quality of academic papers. Scholars strive to highlight the novel aspects of their work, particularly in the title, abstract, and introduction. Peer review, serving as the gatekeeper of scientific rigor, rigorously evaluates the novelty of papers, yet a cognitive gap may exist between author self-promotion and reviewer evaluation. To investigate this, we analyzed 15,328 academic papers published in Nature Communications from 2016 to 2021, along with their peer-review comments. We found that both reviewers and authors emphasize result-oriented innovation, with reviewers adopting a more comprehensive evaluation perspective. Furthermore, by examining promotional intensity against inherent paper novelty, we found that its effect depends on the paper's actual innovation level. Highly innovative papers benefit from stronger promotional language, receiving more positive evaluations. We also found that promotional language significantly correlates with reviewer disagreement on novelty specifically for papers of moderate innovativeness, whereas it has negligible impact for papers with either very high or very low novelty. This reveals how promotional language operates most prominently in the gray area of academic evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.31514 2026-06-12 cs.CL cs.AI cs.CY 版本更新

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

如果LLM具有类人属性，那么《帝国时代II》也具有

Adrian de Wynter

AI总结通过训练简单神经网络于《帝国时代II》，论证LLM的拟人属性在经验上非唯一，提出应假设LLM非独特性而非拟人属性来设计实验。

Comments Fixed corollary 1, added stat sig

详情

AI中文摘要

关于大型语言模型（LLM）和基于LLM的智能体工作流已有大量研究。然而，该领域的许多工作声称、赋予或假设它们具有普遍化的拟人属性（例如道德或对自然语言的理解）。我们的目标不是支持或反对这些属性的存在，而是指出这些结论可能不正确。为此，我们在电子游戏《帝国时代II》上构建并训练了一个简单的神经网络，并注意到任何处于足够强大基底（如乐高或大波士顿地区）中的实体也可能呈现此类属性。因此，LLM声称的拟人属性在经验上非唯一：尽管某些属性（例如对提示的响应）可能保持不变，但其他属性（如对其感知行为的解释）可能随基底改变。因此，任何基于经验的讨论都需要明确的测量标准；否则解释就留给了表征。然后我们表明，假设这些属性在系统中存在或不存在，独立于基底并以普遍化方式，会导致循环或无信息的结论，无论实验者对该主题的观点如何。最后，我们提出一个“零”假设，即假设LLM非独特性而非拟人属性来设置实验，并给出示例。我们还讨论了对我们工作的潜在反对意见，简要调查了该领域，并证明了《帝国时代II》是功能完备和图灵完备的。

英文摘要

Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergence of, ascribe to, or assume, generalised anthropomorphic attributes to them (e.g., morality or understanding of natural language). Our goal is not to argue in favour or against the existence of these attributes, but to point out that these conclusions could be incorrect. For this we build and train a simple neural network on the videogame Age of Empires II, and note that any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes. Hence, the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain invariant, others, such as the interpretation of their perceived behaviour, might change with the substrate. Thus, any empirically-grounded discussion on these attributes requires explicit measurement criteria; otherwise the interpretation is left to the representation. We then show that assuming that these attributes exist or not in a system, independent of the substrate and in a generalised way, leads to either circular or uninformative conclusions. This is regardless of the experimenter's viewpoint on the subject, or whether the outcome shows existence or non-existence. Finally we propose a 'null' assumption, where one assumes LLM non-uniqueness instead of assuming anthropomorphic attributes to set up an experiment, along with examples of it. We also discuss potential objections to our work, briefly survey the field, and prove that Age of Empires II is functionally- and Turing-complete.

URL PDF HTML ☆

赞 0 踩 0

2410.00903 2026-06-12 stat.AP cs.CL cs.LG 版本更新

ReliableEval: 通过矩方法进行随机大语言模型评估的配方

Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, Gabriel Stanovsky

发表机构 * The Hebrew University of Jerusalem（耶路撒冷希伯来大学）； Google Research（谷歌研究）

AI总结本文提出ReliableEval方法，通过矩方法评估大语言模型的提示敏感性，发现顶级模型如GPT-4o和Claude-3.7-Sonnet存在显著提示敏感性。

Comments Findings of EMNLP 2025

详情

DOI: 10.18653/v1/2025.findings-emnlp.594
Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2025, pages 11146-11153, Suzhou, China. Association for Computational Linguistics

AI中文摘要

大语言模型对提示语的表述高度敏感，但标准基准通常仅使用单一提示进行性能评估，引发对评估可靠性的担忧。本文主张在保持意义的提示扰动空间中采用随机矩方法进行评估。我们引入了可靠评估的正式定义，考虑了提示敏感性，并建议ReliableEval——一种估计所需提示重采样次数以获得有意义结果的方法。使用我们的框架，我们随机评估了五种前沿大语言模型，并发现即使顶级模型如GPT-4o和Claude-3.7-Sonnet也表现出显著的提示敏感性。我们的方法是模型、任务和度量无关的，提供了一种有意义且稳健的大语言模型评估配方。

英文摘要

LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.

URL PDF HTML ☆

赞 0 踩 0

2402.13906 2026-06-12 cs.CL 版本更新

Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction

利用整体相似性进行无监督文档结构提取

Gili Lior, Yoav Goldberg, Gabriel Stanovsky

发表机构 * Allen Institute for AI（Allen人工智能研究所）； The Hebrew University of Jerusalem（耶路撒冷希伯来大学）； Bar-Ilan University（巴伊兰大学）

AI总结本文提出一种无监督方法，利用文档间和文档内相似性提取跨领域文档集合的整体结构，通过捕捉重复主题并抽象化标题变体，为人类和结构感知模型提供帮助。

Comments Accepted to ACL 2024 findings

详情

DOI: 10.18653/v1/2024.findings-acl.568
Journal ref: Findings of the Association for Computational Linguistics: ACL 2024, pages 9538-9550, Bangkok, Thailand. Association for Computational Linguistics

AI中文摘要

各种领域（如法律、医疗或金融）的文档集合通常具有某种底层的整体结构，这种结构能为人类用户和结构感知模型提供帮助。我们提出识别文档集合中的典型结构，需要捕捉集合中的重复主题，同时抽象化任意标题的同义表达，并将每个主题定位到相应的文档位置。这些要求带来了多个挑战：标记重复主题的标题经常在措辞上不同，某些部分标题仅在个别文档中出现，而不反映典型结构，且不同文档中的主题顺序可能不同。随后，我们开发了一种无监督的图基方法，利用文档间和文档内的相似性来提取底层的整体结构。我们在英语和希伯来语的三个不同领域上的评估表明，我们的方法能够提取有意义的整体结构，我们希望未来的工作能利用我们的方法进行多文档应用和结构感知模型。

英文摘要

Document collections of various domains, e.g., legal, medical, or financial, often share some underlying collection-wide structure, which captures information that can aid both human users and structure-aware models. We propose to identify the typical structure of document within a collection, which requires to capture recurring topics across the collection, while abstracting over arbitrary header paraphrases, and ground each topic to respective document locations. These requirements pose several challenges: headers that mark recurring topics frequently differ in phrasing, certain section headers are unique to individual documents and do not reflect the typical structure, and the order of topics can vary between documents. Subsequently, we develop an unsupervised graph-based method which leverages both inter- and intra-document similarities, to extract the underlying collection-wide structure. Our evaluations on three diverse domains in both English and Hebrew indicate that our method extracts meaningful collection-wide structure, and we hope that future work will leverage our method for multi-document applications and structure-aware models.

URL PDF HTML ☆

赞 0 踩 0

2507.11936 2026-06-12 cs.CL cs.AI cs.CV cs.LG 版本更新

A Survey of Deep Learning for Geometry Problem Solving

深度学习在几何问题求解中的应用综述

Jianzhe Ma, Wenxuan Wang, Qin Jin

发表机构 * Renmin University of China（中国人民大学）

AI总结本文综述了深度学习在几何问题求解中的应用，涵盖相关任务、方法、评估指标及未来方向，旨在提供实践参考以推动该领域发展。

Comments ACL 2026 Main Conference

详情

AI中文摘要

几何问题求解作为数学推理的重要组成部分，在教育、评估AI数学能力及多模态能力评估中具有关键作用。近期深度学习技术，尤其是多模态大语言模型的出现，显著加速了该领域的研究。本文综述了深度学习在几何问题求解中的应用，包括（i）几何问题求解相关任务的全面总结；（ii）相关深度学习方法的深入回顾；（iii）评估指标和方法的详细分析；以及（iv）最先进性能、现有挑战和有前景的未来方向的批判性讨论。我们的目标是提供一个全面且实用的深度学习在几何问题求解中的参考，从而推动该领域进一步发展。我们维护了一个相关论文列表：https://github.com/majianz/dl4gps。

英文摘要

Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI's mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper presents a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of state-of-the-art performance, existing challenges, and promising future directions. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We maintain a list of relevant papers: https://github.com/majianz/dl4gps.

URL PDF HTML ☆

赞 0 踩 0

2507.21086 2026-06-12 cs.CL 版本更新

Multi-Amateur Contrastive Decoding for Text Generation

多业余对比解码用于文本生成

Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela

发表机构 * Department of Data Science（数据科学系）； Praxis Business School（普拉克斯商学院）

AI总结本文提出多业余对比解码框架，通过集成多个业余模型更全面地捕捉语言生成中的不良模式，提升文本生成的流畅性、连贯性和多样性。

Comments This paper has been accepted for oral presentation and publication in the proceedings of the IEEE I2ITCON 2025. The conference will be organized in Pune, India, from July 4 to 5, 2025. This is the accepted version of the paper and NOT the final camera-ready version. The paper is 11 pages long and contains 5 figures and 6 tables

详情

DOI: 10.1109/I2ITCON65200.2025.11210654

AI中文摘要

对比解码（CD）作为一种有效的推理时策略，通过利用大专家语言模型和小业余模型输出概率的差异来增强开放性文本生成。尽管CD提升了连贯性和流畅性，但其依赖单一业余模型限制了捕捉语言生成中多样化的失败模式，如重复、幻觉和风格漂移的能力。本文提出多业余对比解码（MACD），作为CD框架的扩展，采用多个业余模型更全面地表征不良生成模式。MACD通过平均和共识惩罚机制整合对比信号，并将可能性约束扩展到多业余设置中。此外，该框架通过引入具有针对性风格或内容偏见的业余模型实现可控生成。在新闻、百科和叙事等多个领域实验结果表明，MACD在流畅性、连贯性、多样性和适应性方面均优于传统解码方法和原始CD方法，且无需额外训练或微调。

英文摘要

Contrastive Decoding (CD) has emerged as an effective inference-time strategy for enhancing open-ended text generation by exploiting the divergence in output probabilities between a large expert language model and a smaller amateur model. Although CD improves coherence and fluency, its dependence on a single amateur restricts its capacity to capture the diverse and multifaceted failure modes of language generation, such as repetition, hallucination, and stylistic drift. This paper proposes Multi-Amateur Contrastive Decoding (MACD), a generalization of the CD framework that employs an ensemble of amateur models to more comprehensively characterize undesirable generation patterns. MACD integrates contrastive signals through both averaging and consensus penalization mechanisms and extends the plausibility constraint to operate effectively in the multi-amateur setting. Furthermore, the framework enables controllable generation by incorporating amateurs with targeted stylistic or content biases. Experimental results across multiple domains, such as news, encyclopedic, and narrative, demonstrate that MACD consistently surpasses conventional decoding methods and the original CD approach in terms of fluency, coherence, diversity, and adaptability, all without requiring additional training or fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

1. 大语言模型与基础模型 18 篇

PolyAlign: Conditional Human-Distribution Alignment

Beyond Uniform Tokens: Adaptive Compression for Time Series Language Models

Operads for compositional reasoning in LLMs

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

MiniPIC: Flexible Position-Independent Caching in <100LOC

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

Emergence of Hierarchical Emotion Organization in Large Language Models

Language Model Circuits Are Sparse in the Neuron Basis

LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Reasoning Models Know What's Important, and Encode It in Their Activations

Structuring The Future: Diffusion LLM Speculative Decoding via Calibrated Draft Graphs

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

A Unifying Lens on Reward Uncertainty in RLHF

2. 机器翻译与跨语言处理 1 篇

Authorship Attribution in Multilingual Machine-Generated Texts

3. 信息抽取、检索与问答 9 篇

MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction

X-MADAM-RAG: Diagnosing and Handling Chinese-English Evidence Conflict in Retrieval-Augmented Generation

sebis at CRF Filling 2026: A Two-Stage Local LLM Pipeline for Medical CRF Filling

When Does Mixing Help? Analyzing Query Embedding Interpolation in Multilingual Dense Retrieval

Uncertainty-Aware Hybrid Retrieval for Long-Document RAG

NOVA: NOise-aware Verbal Confidence CAlibration for Robust Large Language Models in RAG Systems

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings

HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG

4. 对话系统与智能体 12 篇

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

HyPE: Category-Aware Hypergraph Encoding with Persistent Edge Embeddings for Persona-Grounded Dialogue

MemRefine: LLM-Guided Compression for Long-Term Agent Memory

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

Recursive Agent Harnesses

HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

ProPlay: Procedural World Models for Self-Evolving LLM Agents

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

Reward Modeling for Multi-Agent Orchestration

BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection

5. 文本生成、摘要与编辑 5 篇

Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation

Detect, Remask, Repair: Diffusion Editing for Faithful Summarization of Evolving Contexts

NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning

IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds

TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation

6. 语义、语法与语言学分析 5 篇

Agent-based models for the evolution of morphological alternation patterns

SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection

Unraveling Syntax: Language Modeling and the Substructure of Grammars

The Pragmatic Persona: Discovering LLM Persona through Bridging Inference

More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

7. 多模态语言处理 10 篇

Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension

TimeLens: On-Device Artifact Recognition with Retrieval-Augmented Question Answering for the Grand Egyptian Museum

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

8. 语音语言联合与音频文本 7 篇

PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

9. 评测、数据集与基准 36 篇

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

LLMs Can Better Capture Human Judgments--With the Right Prompts