arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.05165 2026-06-04 cs.LG cs.CL 版本更新

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

STRIDE: 通过子集扰动的稀疏恢复进行训练数据归因

Rishit Dagli, Abir Harrasse, Luke Zhang, Florent Draye, Amirali Abdullah, Bernhard Schölkopf, Zhijing Jin

发表机构 * Jinesis AI Lab, University of Toronto & Vector Institute(Jinesis AI实验室,多伦多大学及向量研究所) Max Planck Institute for Intelligent Systems, Tübingen, Germany(智能系统马克斯·普朗克研究所,图宾根,德国) Thoughtworks(Thoughtworks公司) Martian ELLIS Institute, Tübingen, Germany(图宾根ELLIS研究所,德国) EuroSafeAI

AI总结 提出STRIDE框架,将训练数据归因建模为压缩感知中的稀疏恢复问题,通过激活空间中的轻量级“引导算子”模拟数据子集的影响,实现高效且准确的LLM预训练归因。

Comments project page: https://stride-tda.github.io/

详情
AI中文摘要

训练数据归因(TDA)旨在将模型的预测追溯到其训练数据。TDA的黄金标准依赖于因果干预,观察模型在数据添加或移除时的变化,但对于大型语言模型(LLMs)而言,重复训练在计算上具有挑战性。因此,大多数方法在参数空间中使用梯度来近似这种效应。然而,跟踪数十亿参数的梯度不仅成本高昂,而且依赖于局部近似。在这项工作中,我们提出了一种转变:我们不估计参数变化,而是在激活空间中建模训练数据的功能效应。我们引入了STRIDE(基于引导的训练数据影响分解),这是一个将TDA表述为压缩感知精神下的稀疏恢复问题的框架。STRIDE学习轻量级的“引导算子”,这些算子模拟在数据子集上训练引起的行为变化。通过测量这些算子如何扰动测试预测,我们通过稀疏线性分解恢复单个训练示例的影响。STRIDE在LLM预训练归因中达到了最先进的性能,同时比先前的方法快一个数量级(13倍)。我们通过下游应用(包括数据选择、数据污染和定性分析)进一步验证了其实用性。

英文摘要

Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated retraining is computationally challenging for Large Language Models (LLMs). Consequently, most approaches approximate this effect in the parameter space using gradients. However, tracking gradients across billions of parameters is not only prohibitively expensive but relies on local approximations. In this work, we propose a shift: rather than estimating parameter changes, we model the functional effect of training data in the activation space. We introduce STRIDE (Steering-based Training Data Influence Decomposition), a framework that formulates TDA as a sparse recovery problem in the spirit of compressive sensing. STRIDE learns lightweight "steering operators" that mimic the behavioral shift caused by training on data subsets. By measuring how these operators perturb test predictions, we recover individual training example influences via sparse linear decomposition. STRIDE achieves state-of-the-art for LLM pre-training attribution while being an order of magnitude ($13\times$) faster than previous art. We further validate its practical utility through downstream applications including data selection, data contamination, and qualitative analysis.

2606.05161 2026-06-04 cs.SD cs.CL 版本更新

Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models

超越文本跟随:音频-语言模型中的可修复仲裁反转

Yichen Gao, Yiqun Zhang, Zijing Wang, Yujia Li, Heng Guo, Xi Wu, Xiaocui Yang, Shi Feng, Yifei Zhang, Daling Wang

发表机构 * Northeastern University, China(东北大学) Shanghai Artificial Intelligence Laboratory, China(上海人工智能实验室)

AI总结 本文通过同音频反事实实验发现,音频-语言模型在冲突任务中常因文本主导而忽略音频证据,并提出无训练解码规则GACL,通过插值联合分数与同音频分数来修复仲裁反转,显著提升忠实度。

详情
AI中文摘要

音频-语言模型(ALMs)常常遵循与音频冲突的文本,即使音频证据清晰。这引发了一个基本问题:音频支持的答案是不可用的,还是被表示出来但被冲突文本覆盖了?我们使用一个同音频反事实来研究这个问题,该反事实保持音频固定,仅移除冲突文本,并测量模型偏好由此产生的变化。在五个ALM和四个冲突任务中,64.1%的冲突样本显示出符号翻转:同音频分支偏好音频支持的答案,而联合分支偏好文本支持的答案。这种模式表明,相关的音频证据被编码但在仲裁中失败。激活修补进一步将反转定位到答案位置计算,并且修补效果与输出候选分数差异紧密相关(Spearman rho=0.93)。利用这一诊断,我们提出了门控音频反事实逻辑校正(GACL),一种无训练解码规则,在联合分数和同音频分数之间进行插值。在严格的5个百分点的忠实度下降预算下,GACL在最佳对比基线上将nAUC提高了17.8个点,并且无需重新调整即可迁移到视觉-文本仲裁(最高+40.5个百分点)。

英文摘要

Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).

2606.05158 2026-06-04 cs.CL cs.AI cs.MA 版本更新

Streaming Communication in Multi-Agent Reasoning

多智能体推理中的流式通信

Zhen Yang, Xiaogang Xu, Wen Wang, Cong Chen, Xander Xu, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州)) Alibaba Group(阿里巴巴集团) ZJU(浙江大学) HKUST(香港科技大学)

AI总结 提出流式多智能体推理系统StreamMA,通过将推理步骤实时流式传输给下游智能体来降低延迟,并意外地提升了效果,同时首次给出流式、串行和单协议三种模式的闭式联合分析。

Comments project page: https://zhenyangcs.github.io/StreamMA-website/

详情
AI中文摘要

多智能体推理系统采用“生成-然后传输”范式,导致端到端延迟与流水线深度成线性关系。我们提出StreamMA,一种多智能体推理系统,它将每个推理步骤在生成后立即流式传输给下游智能体,流水线化相邻智能体,从而降低延迟。令人惊讶的是,这种流水线化也提高了效果:因为多步推理质量不均匀,早期步骤比后期步骤更可靠,使用这些可靠的早期步骤而不是完整链条可以防止容易出错的后期步骤误导下游智能体。我们通过首个流式、串行和单协议三种模式的闭式联合分析,形式化了这两种优势,推导出效果排序、加速上限和成本比。在涵盖数学、科学和代码的八个推理基准测试中,使用两个前沿LLM(Claude Opus 4.6和GPT-5.4)以及三种拓扑结构(链、树、图),StreamMA均优于两个基线(平均+7.3个百分点,在HMMT 2026上最高+22.4个百分点;Claude Opus 4.6-high)。除了这些贡献,我们还发现了一个“步骤级缩放定律”:增加每个智能体的步骤持续提高效果和效率,这是一个与智能体数量缩放正交且可组合的新缩放维度。

英文摘要

Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.

2606.05145 2026-06-04 cs.LG cs.AI cs.CL 版本更新

Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)

失败推理轨迹告诉你什么是可修复的(但仅凭阅读它们不行)

Nizar Islah, Istabrak Abbes, Irina Rish, Sarath Chandar, Eilif B. Muller

发表机构 * Mila - Quebec AI Institute(魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) Polytechnique Montréal(蒙特利尔理工学院) CHU Sainte-Justine(圣约斯特医院)

AI总结 本文提出通过失败推理轨迹的分布特征而非文本内容来识别可修复的失败,并设计无训练的路由规则提升测试时干预效果。

详情
AI中文摘要

当后训练语言模型在推理问题上失败时,常见的测试时扩展响应是花费更多计算进行额外尝试,而失败轨迹不再发挥作用。我们认为这丢弃了一个关键信号;一些失败源于不幸运的采样,此时更多滚动有助于解决,而其他失败是结构性的,无论预算如何都无法通过重采样解决。我们提出失败轨迹编码了可恢复性结构:即哪些测试时干预可以挽救特定失败的推理时特征。三个问题级别的轨迹特征,源自可用干预的结构,从失败滚动的分布特征(而非其文本)中恢复这种结构。它们将失败聚类为稳定区域,刻画不同后训练方法的失败地形(准确率84.3±4.3%,比多数类基线高20%),并支持一个无训练的路由规则,在部署相关的Steerable-Hard子集(重试不足且可达有界干预的失败)上将挽救率提升12.2%。这些特征和路由规则在两个跨家族探针上可迁移。因此,相同的三个特征将失败轨迹从丢弃数据转化为诊断对象,支持测试时路由和后训练分析,无需训练时或权重空间访问。

英文摘要

When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional attempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help, while others are structural and resist resampling regardless of budget. We propose that failed traces encode recoverability structure: the inference-time signature of which test-time interventions can rescue a given failure. Three problem-level trajectory features, derived from the structure of available interventions, recover this structure from the distributional signature of failed rollouts, not their text. They cluster failures into stable regimes, characterize the failure topography of different post-training methods ($84.3{\pm}4.3\%$ accuracy, $+20\%$ over a majority-class baseline), and support a training-free routing rule that lifts rescue by $+12.2\%$ on the deployment-relevant Steerable-Hard subset (failures where retry is insufficient and a bounded intervention is reachable). The features and the routing rule transfer across two cross-family probes. The same three features thus convert failed traces from discarded data into a diagnostic object, supporting test-time routing and post-training analysis without training-time or weight-space access.

2606.05134 2026-06-04 cs.CL cs.LG 版本更新

Activation-Based Active Learning for In-Context Learning: Challenges and Insights

基于激活的主动学习用于上下文学习:挑战与见解

Yaseen M. Osman, Geoff V. Merrett, Stuart E. Middleton

发表机构 * School of Electronics and Computer Science (ECS), University of Southampton(电子与计算机科学学院(ECS),南安普顿大学)

AI总结 本文研究了基于MLP激活的深度主动学习方法在上下文学习中的应用,发现激活信号与示例质量或任务性能相关性弱,表明此类方法不适用于上下文学习。

Comments 9 pages, 3 figures

详情
AI中文摘要

深度主动学习此前已被探索用于大语言模型的上下文样本选择,但未利用对Transformer激活理解的最新进展。在本文中,我们测试了模型激活能否提供细粒度信号以优化上下文示例选择的假设。我们提出了迄今为止最全面的基于MLP激活的深度主动学习方法应用于上下文学习的分析,包括不同注意力掩码策略如何影响跨多样分类和生成数据集的主动学习,使用了Llama-3.2-3B和Qwen2.5-3B基础模型。然而,我们得到了负面结果:通过大规模激活或前四阶矩视角观察的MLP输出,与示例质量或任务性能不相关。具体来说,对于所有测试的任务和模型,绝对Spearman相关系数至多为0.33,表明此类基于激活的采样不应用于上下文学习。我们假设这可能是由于叠加现象,即模型表示的特征数量超过其维度,表明稀疏自编码器等方法可能是未来有前景的方向。

英文摘要

Deep active learning has previously been explored for LLM in-context sample selection, but not with methods that utilise recent advances in understanding of transformer activations. In this paper, we test the hypothesis that model activations could provide a fine-grained signal to optimise the selection of in-context examples. We present the most comprehensive analysis to date of MLP activation-based deep active learning methods applied to in-context learning, including how different attention masking strategies impact active learning across diverse classification and generative datasets, using both Llama-3.2-3B and Qwen2.5-3B base models. However, we find a negative result: MLP outputs, viewed through the lenses of massive activations or the first four moments, do not correlate with example quality or task performance. Specifically, the absolute Spearman correlation coefficient is at most 0.33 for all tasks and models we tested, showing that such activation-based sampling should not be used for in-context learning. We hypothesise that this may be due to superposition, whereby models represent more features than they have dimensionality, suggesting that methods like Sparse Autoencoders (SAEs) may be a promising future direction.

2606.05122 2026-06-04 cs.CL 版本更新

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

自我评估已然存在:用最少数据激发基础LLM中的潜在评判校准

XiuYu Zhang, Yi Shan, Junfeng Fang, Zhenkai Liang

发表机构 * National University of Singapore(新加坡国立大学) Beijing University of Technology(北京理工大学)

AI总结 本文提出自我评估激发(SEE)方法,通过少量数据(160个示例)结合校准耦合强化学习和掩码蒸馏,激发基础LLM中已有的预测外部评判者评分能力,在保持答案质量的同时显著提升校准性能。

详情
AI中文摘要

大型语言模型越来越多地被其他模型评估,这引发了一个自然问题:模型能否预测评判者将如何对其自身输出进行评分?我们发现,这种能力在很大程度上已经存在于任何针对性训练之前:通过少量示例提示,基础模型已经能够预测外部评判者对开放式回答的多属性质量评分,在三个基准测试中显著高于随机水平。我们引入了自我评估激发(SEE)方法,该方法通过一个短周期来表面化这种潜在能力,该周期包括一个校准耦合的强化学习阶段(改进答案并预测评判者),随后是一个掩码蒸馏阶段(增强预测而不改变答案)。通过160个独特示例(比强化学习基线少约31倍),SEE在三个基准测试中改善了保留校准,同时保持了答案质量。激发的自我评估严格定位于模型自身的词元分布内,并且对于从未训练过的评判者保持稳定,这表明了一种可转移的质量概念,而非单一评判者的偏好。这些结果将评判者对齐的自我评估重新定义为激发问题而非获取问题。

英文摘要

Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.

2606.05121 2026-06-04 cs.SD cs.AI cs.CL cs.MM eess.AS 版本更新

Audio Interaction Model

音频交互模型

Zhifei Xie, Zihang Liu, Ze An, Xiaobin Hu, Yue Liao, Ziyang Ma, Dongchao Yang, Mingbao Lin, Deheng Ye, Shuicheng Yan, Chunyan Miao

发表机构 * NTU(国立新加坡大学) NUS(新加坡国立大学) CUHK(香港大学)

AI总结 提出一种统一的在线大型音频语言模型Audio-Interaction,通过始终在线的感知-决策-响应循环实现实时音频交互,并构建了StreamAudio-2M数据集和Proactive-Sound-Bench基准,在保持主流音频任务性能的同时解锁了实时ASR、流式音频指令跟随和主动帮助等能力。

Comments Next generation of LALMs, work in progress

详情
AI中文摘要

音频本质上是一种交互式模态,然而当今的大型音频语言模型(LALM)是离线的,而流式音频模型每个只处理单一任务,如流式ASR或语音聊天。现在是时候将它们统一为一个在线LALM:一个通过始终在线的感知-决策-响应循环,实时收听声音、环境和指令并即时反应的模型。我们将这种机制形式化为音频交互模型,并通过Audio-Interaction实现,这是一个统一的流式模型,在保留离线任务执行的同时,增加了在线通用音频指令跟随能力,从对话到全语音聊天,根据流语义决定何时响应。为此,我们提出了SoundFlow框架,该框架通过流原生数据构建、理解感知训练和异步低延迟推理,端到端地实例化感知-决策-响应循环,实现稳定的实时交互。我们进一步构建了StreamAudio-2M,一个包含260万项流式语料库,涵盖7种基本能力和28个子任务,以及用于评估主动音频干预的Proactive-Sound-Bench。在8个基准测试中,Audio-Interaction在主流音频任务上保持有竞争力的性能,同时解锁了离线LALM无法实现的能力,包括实时ASR、流式音频指令跟随和主动帮助。

英文摘要

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.

2606.05115 2026-06-04 cs.CV cs.AI cs.CL 版本更新

Continual Visual and Verbal Learning Through a Child's Egocentric Input

通过儿童自我中心输入进行持续的视觉与语言学习

Xiaoyang Jiang, Yanlai Yang, Kenneth A. Norman, Brenden Lake, Mengye Ren

发表机构 * Agentic Learning AI Lab, New York University(代理学习人工智能实验室,纽约大学) Department of Psychology, Princeton University(心理学系,普林斯顿大学)

AI总结 提出BabyCL持续多模态学习框架,在单一时间顺序处理SAYCam数据集,通过流式视觉表示学习和图像-文本对比目标,在SAYCam Labeled-S 4AFC基准上优于流式学习基线,缩小了与离线训练上限的差距。

Comments 15 pages, 4 figures

详情
AI中文摘要

儿童从连续的、时间结构化的自我中心经验流中学习单词的含义。最近的研究表明,神经网络也可以从儿童的自我中心视频记录中学习单词-指代物映射,但它们会循环处理打乱的数据数百个周期,这与儿童实际接触环境的方式形成对比。我们引入了BabyCL,一个持续多模态学习框架,它以单一时间顺序处理SAYCam数据集,结合了流式视觉表示学习和图像-文本对比目标。BabyCL将流的多阶段时间分割与双回放缓冲区相结合,该缓冲区独立管理视觉和多模态历史,并在共享骨干网络上联合训练三个对比损失。在匹配的优化预算下,BabyCL在SAYCam Labeled-S 4AFC基准上优于流式学习基线,显著缩小了与离线训练上限的差距。消融实验表明,这些增益对在线时间分割窗口的长度和回放缓冲区的驱逐规则具有鲁棒性。总之,这些结果表明,在更接近儿童实际体验的训练条件下,有意义的单词-指代物映射可以出现。

英文摘要

Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.

2606.05112 2026-06-04 cs.CL 版本更新

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

评估大型语言模型在标准化病人案例中的动态临床决策能力

Cheng Liang, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Chaoyi Wu, Weidi Xie

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文提出MedSP1000基准,通过标准化病人案例模拟动态临床交互,评估LLM在信息收集、治疗计划和长期管理中的表现,发现当前模型在过程级评估中远未达到临床安全标准。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被提议作为临床代理,然而静态的单轮基准无法捕捉模型在诊疗过程中如何动态地提供护理:收集信息、规划治疗以及跨连续患者状态调整长期管理。医学教育长期以来通过标准化病人(SPs)解决了类似的挑战:经过培训的演员一致地扮演临床案例,实现逼真的实践和客观的脚本化评估。在此,我们介绍MedSP1000,一个源自SP的交互式基准,用于临床代理评估,包括1,638个SP案例和24,602个轨迹级同行评审评分标准。MedSP1000将同行评审的SP教学案例转化为可执行场景,包含定义的SP案例脚本、临床环境上下文和人工验证的结构化评分标准。在每次模拟评估运行中,临床代理与患者代理和环境控制器闭环交互,其行为根据原始材料中指定的专家标准在整个诊疗过程中进行评分。将MedSP1000应用于一系列通用和医学专用LLMs,我们发现静态基准上的表现并不能可靠地转化为此类教育场景。表现最好的模型GPT-5.5仅完成了60.4%的专家定义评分项目,而最强的医学专用模型达到了40.0%;增加测试时计算量没有产生可测量的增益。这些结果表明,当前的LLMs,包括为医学调整的代理系统,尚未足够可靠以安全地整合到实际临床实践中。更广泛地说,MedSP1000展示了过程级、SP式评估如何揭示单轮基准无法捕捉的临床相关失败模式。

英文摘要

Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.

2606.05106 2026-06-04 cs.CL cs.AI cs.CY 版本更新

Arithmetic Pedagogy for Language Models

语言模型的算术教学法

Andhika Bernard Lumbantobing, Hokky Situngkir

发表机构 * Bandung Fe Institute & Adjunct Science Fellow in InaAI(巴旦格Fe研究所及InaAI兼职科学研究员) AI Research Center IT Del & Bandung Fe Institute(IT Del人工智能研究中心及巴旦格Fe研究所)

AI总结 借鉴人类数学教学法,通过将GASING方法操作化为链式思维监督训练小规模GPT-2模型,使其在算术推理上达到高准确率并展现出联想式心算能力。

Comments 18 pages, 6 figures

详情
AI中文摘要

我们研究人类数学教学法能否指导语言模型训练以实现算术推理。基于GASING方法——一种通过从左到右过程解决基本算术的印尼教学法,该过程与令牌生成的因果顺序一致——我们将每个操作操作化为一个计算过程,其执行轨迹序列化为自然语言的链式思维监督。使用仅下一个令牌预测目标(无强化学习或基于奖励的优化),从零开始训练一个带有音节-粘着TOBA分词器的小型GPT-2解码器(86M参数)。监控训练揭示了三个不同的学习阶段,机制分析——对链式思维信息图的注意力掩码干预、残差流探测和对数透镜检查——表明模型首先内化程序化路径,随后发展出联想式“心算”能力,无需显式逐步计算即可检索中间结果。训练后的模型在保留问题上达到超过80%的准确率,并与显著更大的语言模型相比具有竞争力,表明有针对性的、基于教学法的训练可以在小规模下产生强大且经济的算术能力。

英文摘要

We investigate whether methods of human mathematics pedagogy can guide the training of language models toward arithmetic reasoning. Building on the GASING method -- an Indonesian pedagogy that solves basic arithmetic through a left-to-right procedure aligned with the causal order of token generation -- we operationalize each operation as a computational procedure whose execution trace is serialized into natural-language Chain-of-Thought (CoT) supervision. A small GPT-2 decoder (86M parameters) with a syllabic-agglutinative TOBA tokenizer for Indonesian is trained from scratch on this data using only a next-token prediction objective, without reinforcement learning or reward-based optimization. Monitoring training reveals three distinct learning phases, and mechanistic analyses -- attention-masking interventions on the CoT information graph, residual-stream probing, and logit-lens inspection -- show that the model first internalizes a procedural pathway and subsequently develops an associative, ``mental-arithmetic'' capacity that retrieves intermediate results without explicit step-by-step computation. The trained model reaches over 80% accuracy on held-out problems and attains competitive performance against substantially larger language models, indicating that targeted, pedagogically grounded training can yield strong and economical arithmetic capability at small scale.

2606.05087 2026-06-04 cs.CL 版本更新

Light or Full Verb? A Minimal-Pair Dataset for Probing Phraseological Competence in Language Models

轻动词还是实义动词?用于探究语言模型短语能力的极小对比数据集

Francesca Franzon, Nicolas Rosàs Gómez, Leo Wanner

发表机构 * Universitat Pompeu Fabra (UPF)(庞培法布拉大学)

AI总结 通过构建极小对比数据集,探究语言模型在轻动词与实义动词用法上的区分能力,发现模型能在最小上下文中区分这两种用法,并表现出跨宾语类型的可分离模式。

详情
AI中文摘要

常见的英语动词如'have'和'make'既可以作为轻动词结构中的搭配词,也可以作为完整的词汇谓语,例如'make a decision'与'make a cake'。语言模型是否能够区分这种区别尚不清楚。我们引入了一个大规模受控数据集,包含极小变化的英语句子系列,其中相同的上下文包含相同动词的轻动词和实义动词用法。两项探测实验表明,语言模型即使在最小上下文中也能区分这些用法,并在宾语类型上表现出可分离的模式。我们发布了该数据集、生成代码和材料作为可重用资源。该框架支持扩展到更广泛的上下文、更多动词和其他语言。

英文摘要

Frequent English verbs such as 'have' and 'make' can function either as collocates in light-verb constructions or as full lexical predicates, as in 'make a decision' vs. 'make a cake'. Whether language models represent this distinction remains unclear. We introduce a large-scale controlled dataset of minimally varying English sentence series in which the same context contains the same verb in light-verb and full-verb uses. Two probing experiments show that language models differentiate between these uses even in minimal contexts and exhibit separable patterns across object types. We release the dataset, generation code, and materials as a reusable resource. The framework supports extensions to broader contexts, additional verbs, and other languages.

2606.05085 2026-06-04 cs.CL cs.AI 版本更新

Automatic Generation of Titles for Research Papers Using Language Models

使用语言模型自动生成研究论文标题

Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay

发表机构 * Jadavpur University(贾达沃尔大学) Indian Association for the Cultivation of Science(印度科学培养协会)

AI总结 提出利用预训练语言模型和大语言模型从摘要生成论文标题的方法,通过微调PEGASUS-large在多个数据集上取得最优性能。

Comments 24 pages, 24 tables, 01 figure

详情
AI中文摘要

研究论文的标题以清晰简洁的方式传达其主要思想,有时也包括结论。选择合适的标题通常具有挑战性,自动标题生成可以帮助作者完成此任务。在这项工作中,我们提出了一种使用开放权重预训练模型和大语言模型从摘要生成论文标题的技术。我们使用了CSPubSum和LREC-COLING-2024数据集,并引入了一个新数据集SpringerSSAT,该数据集来自社会科学领域的四个Springer期刊。此外,我们使用GPT-3.5-turbo在零样本设置下生成标题。模型性能通过ROUGE、METEOR、MoverScore、BERTScore和SciBERTScore指标进行评估。我们的实验表明,微调的PEGASUS-large在大多数指标上优于其他模型,包括微调的LLaMA-3-8B和零样本GPT-3.5-turbo。我们进一步证明ChatGPT可以生成有创意的论文标题。总体而言,AI生成的标题通常是恰当且可靠的。

英文摘要

The title of a research paper conveys its primary idea and, occasionally, its conclusions in a clear and concise manner. Choosing an appropriate title is often challenging, and automated title generation can assist authors in this task. In this work, we propose a technique to generate paper titles from abstracts using open-weight pre-trained and large language models. We use the CSPubSum and LREC-COLING-2024 datasets and introduce a new dataset, SpringerSSAT, curated from four Springer journals in the social sciences. Additionally, we use GPT-3.5-turbo in a zero-shot setting to generate titles. Model performance is evaluated with ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore metrics. Our experiments show that fine-tuned PEGASUS-large outperforms other models, including fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo, across most metrics. We further demonstrate that ChatGPT can generate creative paper titles. Overall, AI-generated titles are generally appropriate and reliable.

2606.05079 2026-06-04 cs.CL cs.LG 版本更新

Fast & Faithful Function Vectors

快速且保真的函数向量

Minh An Pham, Anton Segeler, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin, Patrick Kahardipraja, Reduan Achtibat

发表机构 * GitHub arXiv

AI总结 本研究通过优化注意力头选择和分布式引导方法,利用基于梯度的逐层相关性传播(LRP)提高了函数向量(FV)的效率和准确性,从而实现了对大型语言模型(LLM)的快速且保真的引导。

详情
AI中文摘要

函数向量(FV)是在上下文学习过程中产生的任务表示,可用于引导大型语言模型(LLM)。然而,其公式中的设计选择仍未得到充分探索。在这项工作中,我们研究了沿两个自由度(注意力头选择和引导)改变FV定义对指令的影响。对于头选择,使用基于梯度的逐层相关性传播(LRP)显著提高了效率和准确性。对于FV引导,分布式应用比简单聚合获得了更高的准确性。我们的代码已公开。

英文摘要

Function vectors (FVs) are task representations elicited during in-context learning that can be used to steer Large Language Models (LLMs). However, design choices in their formulation remain underexplored. In this work, we study the impact of varying FV definitions for instructions along two degrees of freedom: attention head selection and steering. For head selection, using gradient-based attributions with Layer-wise Relevance Propagation (LRP) substantially improves efficiency as well as accuracy. For FV steering, applying it in a distributed manner yields a higher accuracy compared to simple aggregation. Our code is publicly available.

2606.05054 2026-06-04 cs.CL 版本更新

Boosting Self-Consistency with Ranking

通过排序提升自洽性

Maria Marina, Daniil Moskovskiy, Sergey Pletenev, Mikhail Salnikov, Alexander Panchenko, Viktor Moskvoretskii

发表机构 * AIRI Skoltech(斯克利切夫斯基因工大学) EPFL(瑞士联邦理工学院)

AI总结 提出RISC方法,将自洽性中的答案选择转化为排序问题,使用轻量级LambdaRank模型结合五个特征,在多个数据集上实现了比标准自洽性更好的准确率-效率权衡。

Comments 16 pages, 13 figures, accepted at ACL Student Research Workshop 2026

详情
AI中文摘要

自洽性通过采样多条推理路径并选择最频繁的答案来改进大型语言模型,但多数投票通常无法恢复样本中已经存在的正确答案。我们通过排序改进自洽性(RISC)解决了这一限制,该方法将自洽性中的答案选择重新表述为排序问题。RISC不是依赖单一的不确定性或置信度信号,而是使用轻量级LambdaRank模型,通过五个精心设计的特征对候选答案进行评分,这些特征捕捉了答案频率、语义中心性和推理轨迹一致性。我们在三个数据集上评估了RISC,涵盖了多种测试时预算。在数据集上,RISC始终比标准自洽性和强基线实现了更好的准确率-效率权衡,在问答基准上尤其取得了显著提升。进一步分析表明,所提出的特征各自有用,更重要的是具有互补性,凸显了学习组合多个信息信号以进行测试时答案选择的价值。

英文摘要

Self-consistency improves large language models by sampling multiple reasoning paths and selecting the most frequent answer, but majority voting often fails to recover correct answers that are already present among the samples. We address this limitation with Ranking-Improved Self-Consistency (RISC), which reformulates answer selection in self-consistency as a ranking problem. Instead of relying on a single uncertainty or confidence signal, RISC uses a lightweight LambdaRank model to score candidate answers with five carefully designed features that capture answer frequency, semantic centrality, and reasoning-trace consistency. We evaluate RISC on three datasets under a range of test-time budgets. Across datasets, RISC consistently achieves a better accuracy-efficiency trade-off than standard self-consistency and strong baselines, with particularly large gains on question answering benchmarks. Further analysis shows that the proposed features are individually useful and, more importantly, complementary, highlighting the value of learning to combine multiple informative signals for test-time answer selection.

2606.05042 2026-06-04 cs.LG cs.CL cs.SC 版本更新

In-Context Graphical Inference

上下文图形推理

Zehua Cheng, Wei Dai, Jiahao Sun

发表机构 * Department of Computer Science, University of Oxford(计算机科学系,牛津大学) FLock.io

AI总结 提出一种自回归图Transformer(ICG-I),通过模拟变量消除并利用张量列压缩和加权共形预测,实现离散图形模型中可扩展且校准的边缘推理,在标准实例和受挫自旋玻璃上达到最先进性能。

Comments 19 Pages

详情
AI中文摘要

离散图形模型中的边缘推理迫使在精确性和可扩展性之间做出选择:精确算法对于高树宽图是难以处理的,而迭代近似(信念传播、变分方法)在受挫拓扑上牺牲了收敛保证。我们认为这种二分法源于归纳偏置不匹配:迭代方法放弃了使精确推理正确的顺序消除结构。我们引入了上下文图形推理(ICG-I),一种自回归图Transformer,通过模拟变量消除并使用学习的张量列压缩中间因子来恢复这种结构,同时结合Dirichlet输出层和加权共形预测,在拓扑偏移下提供校准的、无分布的覆盖保证。我们证明了TT压缩误差在自回归链中最多线性传播,Dirichlet-Multinomial损失是适当的评分规则,并且WCP在估计密度比下保持覆盖且退化可量化。我们进行了大量实验来评估ICG-I,并在所有基准测试中取得了最先进的性能。ICG-I将标准实例上的MAE从0.041(最佳基线)降低到0.020,并在N=500的受挫自旋玻璃上达到0.048,而BP完全发散。

英文摘要

Marginal inference in discrete graphical models forces a choice between exactness and scalability: exact algorithms are intractable for high-treewidth graphs, while iterative approximations (Belief Propagation, variational methods) sacrifice convergence guarantees on frustrated topologies. We argue that this dichotomy stems from a mismatched inductive bias: iterative methods abandon the sequential elimination structure that makes exact inference correct. We introduce In-Context Graphical Inference (ICG-I), an autoregressive Graph Transformer that restores this structure by mimicking Variable Elimination with learned, Tensor- Train-compressed intermediate factors, paired with a Dirichlet output layer and Weighted Conformal Prediction for calibrated, distribution-free coverage guarantees under topological shift. We prove that TT compression errors propagate at most lincarly through the autoregressive chain, that the Dirichlet-Multinomial loss is a proper scoring rule, and that WCP maintains coverage with a quantifiable degradation under estimated density ratios. We conducted intensive experiments to evaluate ICG-I and achieved state-of-the-art performance across all benchmarks. ICG-I reduces MAE from 0.041 (best baseline) to 0.020 on standard instances and achieves 0.048 on N=500 frustrated spin glasses where BP diverges entirely.

2606.05030 2026-06-04 cs.CL cs.SC 版本更新

Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair

赋予大语言模型双向逻辑以进行稳健的链修复

Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford, UK(英国牛津大学计算机系) FLock.io Institute of Logic and Computation, TU Wien, Austria(奥地利技术大学逻辑与计算研究所)

AI总结 针对自回归链式推理中错误雪崩问题,提出Teleological Reasoning Infilling (TRI)框架,通过将错误推理段重构为填充中间任务并引入前缀-后缀-中间序列重排,结合符号验证器监督微调和直接偏好优化,实现仅修复受损段的高效链修复。

Comments 25 Pages

详情
Journal ref
In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2026
AI中文摘要

大型语言模型(LLMs)中的自回归链式推理(CoT)本质上是前向的:每一步仅依赖于先前的令牌。这种单向归纳偏差使得即使是能力强的模型也容易受到错误雪崩的影响,即早期步骤中的单个逻辑或算术错误会不可逆地破坏整个推理链。我们提出了Teleological Reasoning Infilling (TRI),一个训练框架,赋予仅解码器变换器原生的目标条件桥接能力。关键见解是将错误的推理段重构为填充中间(FIM)任务:给定一个验证过的前缀前提P、一个验证过的下游里程碑S和原始查询Q,模型必须综合出连接P到S的逻辑桥M,要求严格且完整。为了实现这一目标,我们引入了一种前缀-后缀-中间(PSM)序列重排,使用三个非重叠的哨兵令牌,使得M能够同时关注P和S,而无需对自注意力机制进行任何结构修改。训练分两个阶段进行:(i)在从形式数学语料库中提取的符号验证的(P, S, M)三元组上进行监督微调(SFT),以及(ii)以确定性符号验证器(Lean 4 / Python)作为唯一奖励神谕的直接偏好优化(DPO),消除了LLM评判的谄媚。在推理时,TRI作为双系统循环中的外科修复模块运行:因果草稿模型生成初始轨迹,验证器定位失败点,TRI仅填充受损段,保留已验证部分不变。在三个基准上的综合实验表明,TRI在所有任务上达到了最先进的性能,同时每个问题的令牌消耗减少了31.2%。

英文摘要

Autoregressive chain-of-thought (CoT) reasoning in large language models (LLMs) is fundamentally forward-directed: each step conditions only on prior tokens. This unidirectional inductive bias renders even capable models susceptible to error snowballing, wherein a single logical or arithmetic mistake in an early step irreversibly corrupts the entire reasoning chain. We introduce Teleological Reasoning Infilling (\TRI{}), a training framework that endows decoder-only transformers with a native \emph{goal-conditioned bridging} capability. The key insight is to reframe erroneous reasoning segments as fill-in-the-middle (FIM) tasks: given a verified prefix premise $P$, a verified downstream milestone $S$, and the original query $Q$, the model must synthesise the logical bridge $M$ that connects $P$ to $S$ rigorously and completely. To achieve this with standard causal architectures, we introduce a Prefix-Suffix-Middle (PSM) sequence rearrangement with three non-overlapping sentinel tokens, enabling $M$ to attend to both $P$ and $S$ without any structural modification to the self-attention mechanism. Training proceeds in two stages: (i) Supervised Fine-Tuning (SFT) on symbolically verified $(P, S, M)$ triples extracted from formal mathematics corpora, and (ii) Direct Preference Optimisation (DPO) with a deterministic symbolic verifier (Lean 4 / Python) as the sole reward oracle, eliminating LLM-judge sycophancy. At inference, TRI operates as a surgical repair module within a dual-system loop: a causal draft model generates an initial trace, the verifier pinpoints failures, and TRI infills only the damaged segment, leaving verified sections intact. Comprehensive experiments on three benchmarks demonstrate that TRI achieves state-of-the-art performance across all tasks, while reducing per-problem token expenditure by 31.2%.

2606.05029 2026-06-04 cs.LG cs.CL 版本更新

Validity Threats for Foundation Model Research

基础模型研究的有效性威胁

Gunnar König, Martin Pawelczyk, Ulrike von Luxburg, Sebastian Bordt

发表机构 * University of Tübingen, Tübingen AI Center(图宾根大学,图宾根人工智能中心) University of Vienna(维也纳大学)

AI总结 本文提出一个因果推断评估框架,将基础模型研究中的不同近似实验策略(代理实验、观察性研究、单次运行设计)映射为四种有效性(统计、内部、外部、构念)的权衡,揭示并分析计算节省带来的隐蔽有效性威胁。

详情
AI中文摘要

受控实验是机器学习研究的基石,但在现代基础模型的规模下,它们变得过于昂贵。相反,研究界越来越依赖于以较低成本近似理想实验的研究策略:代理实验和缩放定律、使用公开模型的观察性研究,以及利用单个训练运行内部变化的单次运行设计。在这项工作中,我们认为在计算预算内近似大规模实验没有免费午餐。具体来说,计算节省是以有效性威胁为代价的——隐藏且有时无法检验的假设,当这些假设被违反时,会使研究主张无效。为了帮助应对这些威胁,我们提出了一个评估框架,将基础模型研究视为因果推断问题。在这个框架内,我们通过从经验社会科学中改编的四种有效性——统计、内部、外部和构念有效性——来评估不同的研究策略。我们发现每种策略都有其特有的有效性特征:代理实验以外部和构念有效性换取统计和内部有效性;观察性研究面临混杂和效应异质性;单次运行设计则因处理单元之间的干扰而紧张。这一分析揭示了文献中未得到充分关注的若干有效性威胁。总体而言,我们的评估框架为研究人员提供了一个实用的工具包,用于审视基础模型研究设计中的有效性威胁。

英文摘要

Controlled experiments are the backbone of machine learning research, but at the scale of modern foundation models, they have become prohibitively expensive. Instead, the community increasingly relies on research strategies that approximate the ideal experiment at a fraction of the cost: proxy experiments and scaling laws, observational studies with publicly available models, and single-run designs that leverage variation within individual training runs. In this work, we argue that there is no free lunch when approximating large-scale experiments on a compute budget. Specifically, savings in compute come at the cost of validity threats -- hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. To help navigate such threats, we propose an evaluation framework that casts foundation model research as a causal inference problem. Within this framework, we evaluate different research strategies through four types of validity adapted from the empirical social sciences -- statistical, internal, external, and construct validity. We find that each strategy comes with a characteristic validity profile: proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats that have received insufficient attention in the literature. Overall, our evaluation framework provides researchers with a practical toolkit for scrutinizing validity threats in foundation model research~designs.

2606.05016 2026-06-04 cs.CL 版本更新

TaDA: Calibrated Probe Gating for Task-Domain LoRA Merging

TaDA: 任务-领域LoRA合并的校准探针门控

Huy Quoc To, Fuyi Li, Guangyan Huang, Ming Liu

发表机构 * Deakin University(德克萨斯大学) Adelaide University(阿德莱德大学)

AI总结 针对任务与领域LoRA适配器合并中的深度不对称性,提出无训练算法TaDA,通过校准探针引导的逐层门控和逐分量子空间感知合并,在六个科学QA和六个图像分类基准上取得最优性能。

详情
AI中文摘要

将任务LoRA适配器与领域LoRA适配器组合成一个统一模型是一个实际但很大程度上未被探索的挑战。现有方法将两个适配器视为对称对等体,对所有层应用统一权重。我们认为,任务和领域适配器在Transformer架构中表现出一致的深度依赖不对称性。领域主导性随层深度增加而增强,而较浅层保留更强的任务相关信号。受此观察启发,我们提出$ extbf{TaDA}$($ extbf{Ta}$sk-$ extbf{D}$omain LoR$ extbf{A}$ Merging),一种无训练算法,通过校准探针引导的逐层门控和逐分量子空间感知合并来利用这种结构。门控使用被证明对适配器权重幅度不变的探针信号,为每层和投影类型分配独立权重。合并则在组合剩余分量之前丢弃冲突的奇异方向。$ extbf{TaDA}$产生一个标准秩$r$的LoRA适配器,推理开销为零。在Llama-2-7B的六个科学QA基准上,TaDA平均准确率达到0.452,比DARE-TIES高出3.6个百分点,并在所有六个基准上取得最佳结果。在ViT-L/16的六个图像分类基准上,TaDA平均准确率达到85.9%,在六个基准中的三个上领先,同时优于最强的合并基线。

英文摘要

Combining a task LoRA adapter with a domain LoRA adapter into a single unified model is a practical yet largely unexplored challenge. Existing methods treat both adapters as symmetric peers, applying uniform weights across all layers. We argue that task and domain adapters exhibit a consistent depth-dependent asymmetry across transformer architectures. Domain dominance increases with layer depth, while shallower layers retain stronger task-relevant signals. Motivated by this observation, we propose $\textbf{TaDA}$ ($\textbf{Ta}$sk-$\textbf{D}$omain LoR$\textbf{A}$ Merging), a training-free algorithm that exploits this structure through calibrated probe-guided per-layer gating and per-component subspace-aware merging. The gating assigns individual weights per layer and projection type using a probe signal proved invariant to adapter weight magnitude. The merging discards conflicting singular directions before combining the remaining components. $\textbf{TaDA}$ produces a standard rank-$r$ LoRA adapter with zero inference overhead. On six scientific QA benchmarks with Llama-2-7B, TaDA achieves an average accuracy of 0.452, outperforming DARE-TIES by +3.6 percentage points and obtaining the best result on all six benchmarks. On six image classification benchmarks with ViT-L/16, TaDA reaches 85.9\% average accuracy, improving over the strongest merging baseline while leading in three of the six individual benchmarks.

2606.05009 2026-06-04 cs.CL cs.AI 版本更新

DAR: Deontic Reasoning with Agentic Harnesses

DAR: 基于智能体框架的道义推理

Guangyao Dou, William Jurayj, Nils Holzenberger, Benjamin Van Durme

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Télécom Paris, Institut Polytechnique de Paris(巴黎电信学院,巴黎理工学院)

AI总结 提出DAR框架,通过让模型按需与法规交互来提升基于LLM的道义推理能力,实验表明智能体框架可提升性能但存在非均匀改进和弱模型数值任务退化问题。

详情
AI中文摘要

道义推理是通过将明确的规则和政策应用于具体案例事实来回答问题,例如根据法规计算纳税义务或确定移民上诉结果。基于LLM的道义推理的一个关键技术挑战是相关规则集可能很长且相互引用,因此模型可能仍无法找到特定推理步骤所需的规则。我们引入了道义智能体推理(DAR),这是一种智能体推理设置,其中模型按需与法规交互。我们在DeonticBench的困难子集上使用多种框架评估DAR。在这些设置中,我们发现智能体框架可以推动道义推理任务的前沿,但改进并不均匀:较弱的模型在数值任务上往往性能下降,同时消耗更多的令牌。

英文摘要

Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and cross-referenced, so models may still fail to locate the rules needed for a particular reasoning step. We introduce Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand. We evaluate DAR under multiple harnesses on hard subsets of DeonticBench. Across these settings, we find that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.

2606.05008 2026-06-04 cs.CV cs.AI cs.CL 版本更新

M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

M$^3$Eval: 通过认知基础视频任务的多模态记忆评估

Jie Huang, Ruixun Liu, Sirui Sun, Xinyi Yang, Yin Li, Yixin Zhu, Yiwu Zhong

发表机构 * School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院) State Key Laboratory of General Artificial Intelligence, Peking University(北京大学通用人工智能国家重点实验室) Yuanpei College, Peking University(北京大学元培学院) Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) School of Psychological and Cognitive Sciences, Peking University(北京大学心理学与认知科学学院) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出首个多模态模型记忆评估框架M$^3$Eval,通过认知心理学设计的视频任务系统评估模型在记忆保持、忠实性和鲁棒性上的表现,发现模型在并行视频流处理、干扰模式、时空记忆和符号记忆方面的显著缺陷。

Comments We present an evaluation designed for multi-modal memory in multi-modal models

详情
AI中文摘要

随着多模态模型向长视频理解发展,记忆成为关键能力。尽管在视频数据集和基准测试方面做出了大量努力,现有工作主要关注感知和推理,而没有系统评估记忆:模型保留了什么、信息如何忠实保存、以及记忆在干扰下的鲁棒性。为填补这一空白,我们引入了M$^3$Eval,这是第一个用于探测多模态模型中不同记忆维度的综合评估框架和基准。基于认知心理学,我们的设计通过精心构建的任务来隔离记忆的关键方面。利用M$^3$Eval,我们在代表性多模态模型上进行了大量实验,揭示了一致的弱点和独特行为。我们发现,模型在处理并行视频流时难以保持解耦表示,表现出与人类记忆显著不同的干扰模式,在空间域比时间域更可靠地定位记忆源,并且符号记忆有限。总的来说,我们的基准为未来研究提供了宝贵资源,而我们的发现强调了记忆作为基本但未充分探索的能力,并为设计更有效的多模态模型记忆机制提供了见解。我们的代码和数据集可在https://pku-value-lab.github.io/m3eval-homepage获取。

英文摘要

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

2606.05002 2026-06-04 cs.CL 版本更新

GARL: Game-Theoretic Reinforcement Learning for Multi-Agent Strategic Prioritisation

GARL:面向多智能体战略优先级排序的博弈论强化学习

Yuxiao Ye, Yiwen Zhang, Huiyuan Xie, Yuqin Huang, Zhiyuan Liu

发表机构 * Tsinghua University(清华大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出GARL框架,将多智能体战略优先级排序形式化为两阶段博弈,通过博弈论效用转化为角色特定强化信号,优化交互策略,在争议问题排序任务中提升性能并使小型开源LLM与强闭源LLM竞争。

详情
AI中文摘要

基于LLM的多智能体系统越来越多地用于战略决策任务。在此类设置中,性能不仅取决于单个模型的能力,还取决于智能体交互和适应的策略。多智能体强化学习可以优化这些交互策略,但其奖励设计通常特定于任务且与交互结构的关联较弱。为弥补这一差距,我们提出GARL,一种面向多智能体战略优先级排序的博弈论强化学习框架。GARL将战略优先级排序形式化为两阶段博弈:竞争智能体首先在共享候选集上分配战略资源,然后更高级别的仲裁者产生最终排名。由此产生的博弈论效用被转化为角色特定的强化信号,使策略优化能够由结构化交互引导。我们在争议问题排序任务上实例化GARL,其目标是在法律程序中优先处理核心问题。实验表明,GARL提高了排序性能,使小型开源LLM在相同候选排名设置下与强大的闭源LLM竞争,并在法律领域能力和更广泛的战略决策方面取得收益。总体而言,GARL展示了如何将博弈论交互结构转化为强化学习目标,为多智能体战略优先级排序中的策略优化提供了原则性方法。

英文摘要

LLM-based multi-agent systems are increasingly used for strategic decision-making tasks. In such settings, performance depends not only on individual model capabilities, but also on the policies by which agents interact and adapt. Multi-agent reinforcement learning can optimise these interaction policies, but its reward design often remains task-specific and weakly grounded in interaction structure. To address this gap, we propose GARL, a GAme-theoretic Reinforcement Learning framework for multi-agent strategic prioritisation. GARL formalises strategic prioritisation as a two-stage game: competing agents first allocate strategic resources over a shared candidate set, and a higher-level arbiter then produces the final ranking. The resulting game-theoretic utilities are converted into role-specific reinforcement signals, allowing policy optimisation to be guided by structured interaction. We instantiate GARL on issues-in-dispute ranking, where the goal is to prioritise core issues in legal proceedings. Experiments show that GARL improves ranking performance, enables small open-source LLMs to become competitive with a strong closed-source LLM under the same candidate-ranking setting, and yields gains in legal-domain competence and broader strategic decision-making. Overall, GARL demonstrates how game-theoretic interaction structure can be turned into reinforcement-learning objectives, providing a principled approach to policy optimisation in multi-agent strategic prioritisation.

2606.04987 2026-06-04 cs.CL cs.AI cs.HC 版本更新

DeliChess: A Multi-party Dialogue Dataset for Deliberation in Chess Puzzle Solving

DeliChess: 一个用于国际象棋谜题求解中深思熟虑的多方对话数据集

Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos

发表机构 * University of Cambridge(剑桥大学) University of Sheffield(谢菲尔德大学)

AI总结 提出DeliChess数据集,包含多方协作解决国际象棋谜题的对话,通过讨论显著提升群体准确性,并分析探询性话语的作用。

详情
AI中文摘要

多方对话是研究协作推理和决策的关键场景,然而现有数据集很少关注结构化、深入的复杂推理任务。我们引入了DeliChess,一个新颖的群体深思熟虑对话数据集,其中参与者协作解决多项选择国际象棋谜题。每个小组首先单独完成谜题,然后进行多方讨论,最后提交修正后的集体答案。该数据集包含107个对话,附有完整转录、讨论前后的选择以及关于谜题难度和走棋质量的元数据。我们使用基于象棋引擎评估的三个指标评估性能,发现深思熟虑显著提高了群体准确性。我们进一步利用先前深思熟虑数据训练的分类器分析了探询性话语(即引发提议、理由或战略反思的消息)的作用。虽然探询性话语使讨论后的群体表现更加多变,但它并未持续带来更好的性能。我们的数据集为在一个明确定义的策略领域中建模群体推理、对话动态以及不同观点和意见的解决提供了丰富的测试平台。

英文摘要

Multi-party dialogue is a critical setting for studying collaborative reasoning and decision-making, yet existing datasets rarely focus on structured, in-depth complex reasoning tasks. We introduce DeliChess, a novel dataset of group deliberation dialogues in which participants collaboratively solve multiple-choice chess puzzles. Each group first completes the puzzle individually, then engages in a multi-party discussion before submitting a revised collective answer. The dataset includes 107 dialogues with full transcripts, pre- and post-discussion choices, and metadata on puzzle difficulty and move quality. We evaluate performance using three metrics based on chess engine evaluations, and find that deliberation significantly improves group accuracy. We further analyse the role of probing utterances (i.e., messages that elicit proposals, justifications, or strategic reflection) using a classifier trained on prior deliberation data. While probing makes group performance more variable after discussion, it does not consistently lead to better performance. Our dataset offers a rich testbed for modelling group reasoning, dialogue dynamics, and the resolution of differing perspectives and opinions in a well-defined strategic domain.

2606.04978 2026-06-04 cs.CL cs.CY econ.GN q-fin.EC 版本更新

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

探究大语言模型风险决策中的结果层面相似性与机制层面一致性:来自圣彼得堡博弈的证据

Chensong Huang, Changyu Chen, Chenwei Lin, Hanjia Lyu, Xian Xu, Jiebo Luo

发表机构 * Fudan University(复旦大学) University of Rochester(罗切斯特大学)

AI总结 通过圣彼得堡博弈实验,发现大语言模型在风险决策中表现出结果层面的类人行为,但机制层面与人类决策存在显著差异,提示行为对齐可能仅停留在表面。

详情
AI中文摘要

大语言模型在风险决策任务中可能显得谨慎,但看似谨慎的输出并不一定表明其与人类决策机制对齐。我们以圣彼得堡博弈作为受控测试平台来研究这一区别,这是一个经典悖论,其中期望收益无限,但人类通常报告低且有限的支付意愿。我们评估了28个大语言模型,使用结构化的提示套件,包括原始博弈;控制决策变体,扰动截断、重复游戏、数字禀赋和职业身份;要求模型以人类决策者身份推理的人类视角提示;以及基础模型与其指令微调对应模型之间的配对比较。在原始博弈中,大多数模型生成有限出价,造成类人风险行为的表象。然而,这种结果层面的相似性掩盖了显著的机制层面差异。控制变体揭示,模型并未保持原始博弈中观察到的类人行为,而是常常转向条件性和计算性理性行为。人类线索提示和指令微调通常降低出价并减少一些可见的病理现象,但大多数机制层面的响应模式基本保持不变。这些发现表明,风险决策中的行为对齐可能是表面层次的:大语言模型可能产生类人风险决策,而不表现出与人类一致的机制。因此,对大语言模型决策的高风险评估应超越结果相似性,检查对齐是否由机制层面的一致性支持。

英文摘要

LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a controlled testbed, a classical paradox in which the expected payoff is infinite, yet humans typically report low, finite willingness to pay. We evaluate 28 LLMs with a structured prompt suite that includes the original game; controlled decision variants that perturb truncation, repeated play, numeric endowment, and occupational identity; a human-perspective prompt that asks models to reason as human decision makers; and paired comparisons between base models and their instruction-tuned counterparts. In the original game, most models generate finite bids, creating the appearance of human-like risk behavior. However, this outcome-level resemblance masks substantial mechanism-level differences. The controlled variants reveal that rather than maintaining human-like behavior seen in the original game, models often shift to conditionally and computationally rational behavior. Human-cue prompting and instruction tuning often lower bids and reduce some visible pathologies, but most mechanism-level response patterns remain largely unchanged. These findings show that behavioral alignment in risk decision-making can be surface-level: LLMs may produce human-like risk decisions without exhibiting human-consistent mechanisms. High-stakes evaluations of LLM decision-making should therefore move beyond outcome similarity and examine whether the alignment is supported by mechanism-level consistency.

2606.04974 2026-06-04 cs.CL 版本更新

SAID: Accelerating Diffusion-Based Language Models via Scaffold-Aware Iterative Decoding

SAID: 通过支架感知迭代解码加速基于扩散的语言模型

Na Li, Chengda Wang, Mingju Gao, Hao Tang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学系)

AI总结 提出SAID框架,通过将去噪计算重新分配给支架令牌来加速扩散语言模型,并引入CHLG为低置信度令牌分配额外步骤,在LLaDA模型上实现最高9.1倍加速且保持竞争性能。

Comments Code: https://github.com/TH-AI-Lab-PKU/SAID

详情
AI中文摘要

扩散大语言模型(DLLMs)通过迭代去噪具有双向上下文的损坏令牌序列,实现非自回归生成。尽管它们能够并行更新多个位置,但由于高质量生成需要大量去噪步骤,推理成本仍然很高。我们提出了SAID,一种支架感知迭代解码框架,通过跨令牌重新分配计算来加速DLLMs。SAID首先将去噪计算用于支架令牌以建立粗略的语义结构,然后用更少的步骤完成可预测的细节令牌。我们进一步将SAID适配到块级扩散解码,并引入了置信度分层生成(CHLG),仅为低置信度令牌分配额外的步骤。在LLaDA-8B和LLaDA 1.5上的数学、编码和知识基准实验表明,SAID显著加速了DLLM推理,最高加速比达9.1倍,同时保持了竞争性能。我们的代码公开在:https://github.com/TH-AI-Lab-PKU/SAID。

英文摘要

Diffusion large language models (DLLMs) enable non-autoregressive generation by iteratively denoising corrupted token sequences with bidirectional context. Despite their ability to update multiple positions in parallel, inference remains costly due to the many denoising steps required for high-quality generation. We propose SAID, a Scaffold-Aware Iterative Decoding framework that accelerates DLLMs by reallocating computation across tokens. SAID first spends denoising computation on scaffold tokens to establish the coarse semantic structure, and then completes predictable detail tokens with fewer steps. We further adapt SAID to block-wise diffusion decoding and introduce Confidence-Hierarchical Layered Generation (CHLG), which assigns additional steps only to low-confidence tokens. Experiments on LLaDA-8B and LLaDA 1.5 across math, coding, and knowledge benchmarks show that SAID significantly accelerates DLLM inference with a maximum speedup of 9.1x while maintaining competitive performance. Our code is publicly available: https://github.com/TH-AI-Lab-PKU/SAID.

2606.04964 2026-06-04 cs.CL 版本更新

SemBlock: Semantic Boundary Dynamic Blocks for Diffusion LLMs

SemBlock: 扩散语言模型的语义边界动态块

Xinrui Song, Zhuoran Wang, Mingju Gao, Hao Tang

发表机构 * School of Computer Science, Peking University(北京大学计算机学院)

AI总结 提出SemBlock框架,通过预测语义边界动态构建解码块,利用轻量预测器在冻结的LLaDA隐状态上训练,并在自然语言、数学和代码任务中优于固定块解码和AdaBlock。

Comments Code: https://github.com/TH-AI-Lab-PKU/SemBlock

详情
AI中文摘要

扩散语言模型(DLM)通过迭代去噪生成文本,逐块解码通过提交局部块中的令牌提高了其实用性。然而,现有的逐块方法通常依赖于固定的块大小或基于分隔符的运行时信号,这些不一定与语义边界对齐。在本文中,我们提出了SemBlock,一种面向扩散LLM的语义边界驱动的动态块解码框架。SemBlock将动态块构建形式化为语义边界预测,并在冻结的LLaDA隐状态上训练轻量预测器。为了提供监督,我们构建了SemBound,一个语义边界数据集,该数据集从自然语言、数学和代码任务中的话语单元、推理步骤和实现跨度中推导出边界标签。在推理过程中,SemBlock使用预测的边界概率来选择每个动态块的结束位置。在GSM8K、IFEval、MATH和HumanEval上的实验表明,SemBlock始终优于固定块解码和AdaBlock。我们的代码公开在:https://github.com/TH-AI-Lab-PKU/SemBlock。

英文摘要

Diffusion language models (DLMs) generate text through iterative denoising, and blockwise decoding improves their practicality by committing tokens in local blocks. However, existing blockwise methods typically rely on fixed block sizes or delimiter-based runtime signals, which do not necessarily align with semantic boundaries. In this paper, we propose SemBlock, a semantic-boundary-driven dynamic block decoding framework for diffusion LLMs. SemBlock formulates dynamic block construction as semantic boundary prediction and trains lightweight predictors on frozen LLaDA hidden states. To provide supervision, we construct SemBound, a semantic-boundary dataset that derives boundary labels from discourse units, reasoning steps, and implementation spans across natural language, math, and code tasks. During inference, SemBlock uses predicted boundary probabilities to select the ending position of each dynamic block. Experiments on GSM8K, IFEval, MATH, and HumanEval show that SemBlock consistently improves over fixed-block decoding and AdaBlock. Our code is publicly available: https://github.com/TH-AI-Lab-PKU/SemBlock.

2606.04952 2026-06-04 cs.HC cs.CL 版本更新

Clinical Assistant for Remote Engagement Link (CARE-link): A Web-Based Electronic Health Records Software for Managing Diabetes

临床远程参与助手(CARE-link):一种用于管理糖尿病的基于网络的电子健康记录软件

Prince Ebenezer Adjei, Joshua Teye Tettey, Toufiq Musah, Audrey Agbeve, John Amuasi

发表机构 * Global One Health Research Group, Bernhard Nocht Institute of Tropical Medicine(全球健康研究组,伯恩哈德-诺克特热带医学研究所) Global Health and Infectious Diseases Research Group, Kumasi Centre for Collaborative Research in Tropical Medicine(全球健康与传染病研究组,库马西协作热带医学研究中心) Department of Computer Engineering, Kwame Nkrumah University of Science and Technology(计算机工程系,库马西大学科学与技术学院) Department of Global Health, School of Public Health, Kwame Nkrumah University of Science and Technology(全球健康系,公共卫生学院,库马西大学科学与技术学院)

AI总结 CARE-link是一个开源、基于网络的临床支持平台,通过LLM介导的工作流程连接临床医生和患者,用于改善妊娠期糖尿病管理,系统汇总院外患者生成数据、提供临床决策支持,并通过WhatsApp界面为患者提供管理计划解释和生活方式指导。

详情
AI中文摘要

CARE-link是一个开源、基于网络的临床支持平台,旨在通过LLM介导的工作流程连接临床医生和患者,改善妊娠期糖尿病管理。该系统汇总医院外患者生成的数据,总结相关临床信息,并为临床医生提供情境感知的决策支持。对于患者,CARE-link通过WhatsApp界面提供管理计划的清晰解释和及时的生活方式指导。集成的双面设计旨在促进持续监测、支持个性化护理,并减少临床随访负担。该平台采用模块化架构,可适应其他需要纵向跟踪和行为支持的慢性疾病。CARE-link有潜力增强临床监督、促进患者依从性,并加强护理连续性,特别是在资源有限的环境中。

英文摘要

CARE-link is an open-source, web-based clinical support platform designed to improve the management of gestational diabetes by linking clinicians and patients through an LLM-mediated workflow. The system aggregates patient-generated data outside the hospital, summarizes relevant clinical information, and delivers context-aware decision support to clinicians. For patients, CARE-link provides clear explanations of management plans and delivers timely lifestyle guidance through a WhatsApp interface. The integrated dual-facing design aims to promote continuous monitoring, support individualized care, and reduce the burden of in-clinic follow-ups. Built with a modular architecture, the platform can be adapted to other chronic conditions requiring longitudinal tracking and behavioral support. CARE-link has the potential to enhance clinical oversight, promote patient compliance, and strengthen continuity of care particularly in resource-constrained settings.

2606.04928 2026-06-04 cs.LG cs.CL 版本更新

Data Attribution in Large Language Models via Bidirectional Gradient Optimization

通过双向梯度优化实现大型语言模型中的数据归因

Frédéric Berdoz, Luca A. Lanzendörfer, Kaan Bayraktar, Roger Wattenhofer

发表机构 * EPFL, Switzerland(瑞士联邦理工学院) ETH Zurich, Switzerland(瑞士苏黎世联邦理工学院)

AI总结 提出一种基于双向梯度优化的训练数据归因方法,用于自动回归大型语言模型,以识别影响模型输出的关键训练数据,提升模型可解释性。

Comments Presented at the AI Governance (AIGOV) Workshop at AAAI 2026

详情
AI中文摘要

大型语言模型(LLMs)越来越多地部署在各种应用中,引发了关于治理、问责和数据溯源的关键问题。理解哪些训练数据对模型的输出影响最大仍然是一个基本开放问题。我们通过扩展逆公式来解决自动回归LLMs的训练数据归因(TDA)挑战:如果模型在训练期间看到了生成的输出,训练数据会如何受到影响?我们的方法通过对生成的文本样本进行双向梯度优化(梯度上升和下降)来扰动基础模型,并测量训练样本上损失的变化。我们的框架支持任意数据粒度的归因,能够实现事实和风格归因。我们在已知数据集的预训练模型上评估了我们的方法,并表明它在影响力指标上优于先前的工作,从而增强了模型的可解释性,这是负责任AI系统的基本要求。

英文摘要

Large Language Models (LLMs) are increasingly deployed across diverse applications, raising critical questions for governance, accountability, and data provenance. Understanding which training data most influenced a model's output remains a fundamental open problem. We address this challenge through training data attribution (TDA) for auto-regressive LLMs by expanding upon the inverse formulation: How would training data be affected if the model had seen the generated output during training? Our method perturbs the base model using bidirectional gradient optimization (gradient ascent and descent) on a generated text sample and measures the resulting change in loss across training samples. Our framework supports attribution at arbitrary data granularity, enabling both factual and stylistic attribution. We evaluate our method against baselines on pretrained models with known datasets, and show that it outperforms previous work on influence metrics, thereby enhancing model interpretability, an essential requirement for accountable AI systems.

2606.04924 2026-06-04 cs.CL 版本更新

Can Crowdsourcing Survive the LLM Era? A Community Survey on Human Data Collection

众包能否在LLM时代幸存?关于人类数据收集的社区调查

Aswathy Velutharambath, Neele Falk, Sofie Labat, Tarun Tater, Amelie Wuehrl

发表机构 * University of Stuttgart(斯图加特大学) Ghent University(根特大学) Harvard University(哈佛大学) IT University of Copenhagen(哥本哈根技术大学)

AI总结 通过调查155名NLP及相关领域研究者,探讨LLM对众包数据有效性的挑战、检测策略及应对措施,发现44%的受访者观察到LLM使用,但现有努力仍不足。

详情
AI中文摘要

大型语言模型(LLM)作为写作工具的广泛使用挑战了众包数据的有效性,因为众包工作者可能将任务外包给模型。为了更好地了解如何解决这一问题,我们调查了155名NLP及相关领域的研究人员,了解他们通过众包收集自由文本回复的经验和意见。本文概述了从业者面临的挑战、缓解策略以及对数据质量的预期影响。44%的受访者报告在其众包数据中观察到LLM的使用。虽然其中93%的人预料到了这一点,但一半的人不确定应采取何种预防措施。最普遍的检测策略是独特的文本风格模式和异常快速的完成时间。总体而言,调查回复显示研究社区意识到这一问题并正在采取措施,但现有努力仍不足以完全解决。最后,我们提出了一系列考虑因素,以指导LLM时代未来的众包自由文本数据收集。

英文摘要

The widespread use of Large Language Models (LLMs) as writing tools challenges the validity of crowdsourced data, as crowdworkers may outsource tasks to models. To better understand how this is addressed, we surveyed 155 researchers in NLP and related disciplines about their experiences and opinions on collecting free-text responses via crowdsourcing. This paper provides an overview of practitioners' challenges, mitigation strategies, and the foreseen implications on data quality. 44% of respondents reported observing LLM usage in their crowdsourced data. While 93% of them had anticipated this, half were unsure what precautions to take. The most prevalent detection strategies are distinctive textual style patterns and unusually fast completion times. Overall, survey responses show that the research community is aware of the problem and taking measures, but existing efforts remain insufficient to fully address it. Finally, we derive a set of considerations to guide future crowdsourced free-text data collection in the era of LLMs.

2606.04923 2026-06-04 cs.LG cs.AI cs.CL 版本更新

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

基于评分标准的强化学习中的奖励黑客行为的复现、分析与检测

Xuekang Wang, Zhuoyuan Hao, Shuo Hou, Hao Peng, Juanzi Li, Xiaozhi Wang

发表机构 * Tsinghua University(清华大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Xi’an Jiaotong University(西安交通大学)

AI总结 本文提出可控黑客环境CHERRL,通过注入已知偏见复现奖励黑客行为,分析其可发现性与可利用性,并探索基于智能体的自动检测方法。

Comments 23 pages, 7 figures

详情
AI中文摘要

基于评分标准的强化学习(RL)使用LLM作为评判者(LaaJ)根据评分标准对模型输出进行评分作为奖励。然而,策略模型可能利用评判者中的潜在偏见,导致奖励黑客行为以及无效或不安全的训练结果。在真实的基于评分标准的RL中,此类黑客行为通常微妙且与多种评判者偏见纠缠在一起,使得分析、检测和缓解变得困难。在本文中,我们引入了CHERRL,一个用于基于评分标准的RL的可控黑客环境。通过将已知偏见注入LaaJ,CHERRL能够稳定复现奖励黑客行为,明确观察奖励发散,并精确识别黑客行为的起始点。这为研究基于评分标准的RL中奖励黑客行为的机制和缓解措施提供了一个干净的实验测试平台。为了展示其效用,我们从可发现性和可利用性的角度分析了不同的评判者偏见,并探索了一个基于智能体的系统,用于从训练日志中自动检测奖励黑客行为的起始点。代码和环境公开于https://github.com/THUAIS-Lab/CHERRL。

英文摘要

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.

2606.04915 2026-06-04 cs.CL cs.IR 版本更新

Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

Caliper: 探究LLM中的词汇锚点与因果结构

Zhenyu Yu, Shuigeng Zhou

发表机构 * Fudan University(复旦大学)

AI总结 通过词汇匿名化扰动,揭示大语言模型在因果推理基准上的表现主要依赖词汇模式匹配而非结构因果推理。

详情
AI中文摘要

大语言模型在CLadder等因果推理基准上达到50%至70%的准确率,但尚不清楚这反映的是结构推理还是词汇模式匹配。我们引入Caliper,一种受控扰动方法,在保留每个问题的因果图和概率规范的同时,用占位符标记替换语义变量名。在九个指令微调LLM(从3.8B到671B参数)和三个因果推理基准上,词汇匿名化在本地3.8B-14B模型集上导致稳健的准确率下降,分别为+7.6、+27.0和+11.1个百分点;在跨越2024-2026代际的九个前沿模型上,CRASS和e-CARE上的下降幅度升至+29.6和+18.0个百分点。在40个模型-基准组合中,39个显示出正差距,而在CLadder的伪词子集上,差距缩小了17倍。结构化提示和少样本上下文学习各自缩小了差距,但主要是通过降低较小模型上的P0准确率,而非恢复P1。当前指令微调LLM在零样本评估下,一旦移除词汇锚点,几乎没有证据表明其具备结构因果推理能力。

英文摘要

Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.

2606.04911 2026-06-04 cs.CV cs.CL 版本更新

BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine

BreastGPT: 面向乳腺癌临床全流程的多模态大语言模型

Yang Liu, Jiajin Zhang, Danyang Tu, Yaojun Hu, Jiao Qu, Jiuyu Zhang, Yu Shi, Wei Fang, Shi Gu, Ling Zhang, Yingda Xia

发表机构 * DAMO Academy, Alibaba Group(阿里巴巴集团 DAMO 院) Zhejiang University(浙江大学) Hupan Lab(华潘实验室) West China Hospital(西京医院) China Medical University(中国医科大学)

AI总结 提出BreastGPT多模态大语言模型,通过构建工作流对齐的指令语料库BreastStage和双分支视觉编码器,实现乳腺癌筛查、诊断和治疗规划全流程的多模态推理,在BreastStage-Bench上取得75.66%封闭式准确率和89.92%开放式得分。

详情
AI中文摘要

乳腺癌仍然是女性癌症相关死亡的主要原因。其临床管理需要跨临床工作流(包括筛查、诊断和治疗规划)的多模态推理,其中每个阶段涉及不同的成像模态、任务目标和推理模式。然而,受限于数据稀缺和模型通用性,现有的医学多模态大语言模型通常仅在孤立的模态或狭窄的任务族上进行评估,限制了它们支持工作流级临床推理的能力。在这项工作中,我们首先引入了BreastStage,一个工作流对齐的乳腺影像指令语料库,包含来自5种成像模态的17个子数据集和136个任务模板的186万条指令遵循对。其保留子集BreastStage-Bench为评估乳腺癌护理连续体中的多模态推理提供了全面的基准。基于该语料库,我们提出了BreastGPT,一个统一的多模态大语言模型,配备双分支视觉编码器和概念保持的令牌压缩,以弥合标准放射学与千兆像素病理学之间的尺度差距。在BreastStage-Bench上,BreastGPT实现了75.66%的封闭式准确率和89.92%的开放式得分,在临床阶段和任务格式上均优于通用和医学专用多模态大语言模型。这些结果表明,工作流对齐的数据和跨尺度视觉建模对于临床基础的医学多模态大语言模型至关重要。所有数据、代码和模型检查点已在https://yangyy-liu.github.io/BreastGPT.io发布。

英文摘要

Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textit{screening}, \textit{diagnosis} and \textit{treatment planning}, where each stage involves distinct imaging modalities, task objectives, and reasoning patterns. However, constrained by data scarcity and model versatility, existing medical MLLMs are typically evaluated on isolated modalities or narrow task families, limiting their ability to support workflow-level clinical reasoning. In this work, we first introduce \textbf{BreastStage}, a workflow-aligned breast imaging instruction corpus comprising 1.86M instruction-following pairs curated from 17 sub-datasets across 5 imaging modalities and 136 task templates. Its held-out split, \textbf{BreastStage-Bench}, provides a comprehensive benchmark for evaluating multimodal reasoning across the breast cancer care continuum. Building on this corpus, we propose \textbf{BreastGPT}, a unified MLLM equipped with a dual-branch visual encoder and concept-preserving token compression to bridge the scale gap between standard radiology and gigapixel pathology. On BreastStage-Bench, BreastGPT achieves 75.66\% closed-ended accuracy and 89.92\% open-ended score, outperforming both general-purpose and medical-specific MLLMs across clinical stages and task formats. These results suggest that workflow-aligned data and cross-scale visual modeling are critical for clinically grounded medical MLLMs. All data, code, and model checkpoints are released at https://yangyy-liu.github.io/BreastGPT.io.

2606.04909 2026-06-04 cs.IR cs.CL 版本更新

BEATS: Bootstrapping E-commerce Attribute Taxonomies for Search through Iterative Human-AI Collaboration

BEATS: 通过迭代人机协作引导电商搜索属性分类

Yung-Yu Shih, Shang-Yu Su, Tzu-I Ho, Dongzhe Wang, Yun-Nung Chen

发表机构 * National Taiwan University(国立台湾大学) Rakuten Group, Inc.(拉肯集团) Taiwan Rakuten Ichiba, Inc.(台湾拉肯Ichiba公司) Rakuten Asia Pte. Ltd.(拉肯亚洲有限公司)

AI总结 针对新兴市场电商平台缺乏结构化属性模式的问题,提出BEATS框架,利用人机协作的LLM流水线从零构建产品属性分类,并通过属性标注提升搜索系统性能。

Comments 6 pages, 1 figure, 5 tables. Accepted to SIGIR 2026 Industry Track. Official version: https://doi.org/10.1145/3805712.3808520

详情
AI中文摘要

新兴市场的电商平台通常使用欠发达的产品目录,仅包含类别分类而缺乏结构化属性模式。缺乏细粒度产品属性限制了搜索能力——阻碍分面过滤、降低查询理解、削弱搜索系统使用的语义表示。我们提出BEATS,一种人机协作的LLM框架,用于从零开始引导产品属性分类。我们的方法扩展了一个多阶段LLM生成流水线,包含两个关键生产阶段:(1) 模型开发者主动进行质量检查以过滤错误输出,以及(2) 领域专家本地工作人员进行人工标注以验证生成的属性。该框架迭代运行——每个生成阶段的提示基于质量检查观察和标注者在连续轮次中的反馈进行优化,逐步提高属性质量。一旦属性分类建立,我们使用LLM对单个产品项目进行结构化属性标注,丰富其上下文表示。丰富的目录直接有益于搜索系统的多个组件:实现细粒度基于属性的过滤、为排序模型提供结构化特征、改善密集检索的语义表示。我们通过在属性丰富的产品数据上训练密集检索模型来验证生成的分类,证明相对于使用原始目录信息的基线有一致的改进。我们的系统已在台湾乐天部署,丰富了9个主要类别,涵盖2,694个子类别,生成了67,277个属性,超过540万产品已使用生成的属性进行标注,并计划丰富整个产品目录。

英文摘要

E-commerce platforms in emerging markets often operate with underdeveloped product catalogs that contain only category taxonomies but lack structured attribute schemas. This absence of fine-grained product attributes limits search capabilities -- preventing faceted filtering, degrading query understanding, and weakening semantic representations used by search systems. We present BEATS, a human-in-the-loop LLM framework for bootstrapping product attribute taxonomies entirely from scratch. Our approach extends a multi-stage LLM generation pipeline with two critical production stages: (1) proactive quality checking by model developers to filter erroneous outputs, and (2) human annotation by domain-expert local staff to validate generated attributes. The framework operates iteratively -- prompts at each generation stage are refined based on quality check observations and annotator feedback across successive rounds, progressively improving attribute quality. Once the attribute taxonomy is established, we employ LLMs to perform structured attribute tagging on individual product items, enriching their contextual representations. The enriched catalog directly benefits multiple components of the search system: enabling granular attribute-based filtering, providing structured features for ranking models, and improving semantic representations for dense retrieval. We validate the generated taxonomy by training dense retrieval models on attribute-enriched product data, demonstrating consistent improvements over baselines using original catalog information. Our system has been deployed at Rakuten Taiwan, enriching 9 major categories spanning 2,694 sub-categories with 67,277 generated attributes, and over 5.4 million products have been tagged with the generated attributes, with plans to enrich the entire product catalog.

2606.04906 2026-06-04 cs.CL cs.AI 版本更新

'Your AI Text is not Mine': Redefining and Evaluating AI-generated Text Detection under Realistic Assumptions

“你的AI文本不是我的”:在现实假设下重新定义和评估AI生成文本检测

Nils Dycke, Marina Sakharova, Nico Daheim, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, Technical University of Darmstadt(通用知识处理实验室(UKP实验室),计算机科学系,达姆施塔特技术大学) National Research Center for Applied Cybersecurity ATHENE, Germany(应用网络安全国家研究中心ATHENE,德国) Zuse School ELIZA(祖斯学校ELIZA)

AI总结 针对AI生成文本检测领域缺乏统一有害使用定义的问题,本文系统定义了多种AI生成文本概念,构建了包含详细生成过程注释的人机协作文本基准AITDNA,并评估了多种检测器在不同概念下的表现。

详情
AI中文摘要

尽管普遍认为AI生成的文本会带来广泛的社会风险,但在AI生成文本检测文献中,对于什么构成有害使用并没有共同的理解。相反,现有的数据集和方法往往定义自己的标准并做出自己的假设,有时是隐含的,而且通常只与真实世界的需求和应用程序松散相关。为了解决这一差距,我们在此系统地定义了AI生成文本的各种概念及其特征。为了研究这些,我们收集了AITDNA——一个全新的人机协作文本基准,其中标注了详细的生成过程信息,如整个编辑和AI交互历史。我们评估了各种机器生成文本检测器,发现它们通常只在特定概念下表现良好,而不能作为广泛的检测器。我们公开发布代码和数据。

英文摘要

Although it is generally agreed that AI-generated text poses a broad societal risk, there is no common understanding in the AI-generated text detection literature on what constitutes harmful use. Rather, existing datasets and approaches often define their own criteria and make their own assumptions, sometimes implicitly, and often only loosely related to real-world needs and applications. To address this gap, we here systematically define various notions of AI-generated text and their characteristics. To study these, we collect AITDNA - a new benchmark of human-machine co-constructed texts that is annotated with detailed genesis information, such as the entire edit and AI-interaction history. We benchmark various machine-generated text detectors and find that they often only perform well for specific notions but not as broad detectors. We release code and data publicly.

2606.04889 2026-06-04 cs.CL 版本更新

GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

GRAIL: 基于梯度重加权优势的可验证奖励强化学习

Tej Deep Pala, Vernon Toh, Soujanya Poria

发表机构 * DeCLaRe Lab, Nanyang Technological University(DeCLaRe实验室,南洋理工大学)

AI总结 针对强化学习中统一优势分配导致梯度信号稀释的问题,提出基于梯度激活显著性的令牌级优势重加权方法GRAIL,无需过程级监督即可提升推理对齐,在多个模型上平均准确率提升3.60%。

详情
AI中文摘要

基于可验证奖励的强化学习(如GRPO)现在已成为提升大语言模型(LLMs)数学推理能力的常见方法。然而,当前方法通常将单个序列级优势广播到所有令牌,或使用昂贵的过程奖励模型(PRMs)进行步骤级监督。统一优势分配假设所有令牌对最终奖励的贡献相同。这会稀释梯度信号,因为存在缺陷的推理步骤和填充词与有效的逻辑推理得到同等强度的更新。为解决此问题,我们引入了梯度重加权优势(GRAIL),一种内在的令牌级优势重加权方法。GRAIL使用梯度激活显著性,将更多权重赋予那些对最终答案局部更敏感的令牌。在来自Qwen3、R1-distilled和OctoThinker家族的五个模型上的评估表明,GRAIL始终优于GRPO。GRAIL在准确率上平均提升3.60%,在Pass@3上平均提升3.05%,表明无需过程级监督即可实现细粒度的推理对齐。

英文摘要

Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for step-level supervision. Uniform advantage distribution assumes that all tokens contribute equally to the final reward. This dilutes the gradient signal, since flawed reasoning steps and filler words are updated as strongly as valid logical inferences. To address this, we introduce Gradient-Reweighted Advantage (GRAIL), an intrinsic token-wise advantage reweighting method. GRAIL uses gradient-activation saliency to place more weight on tokens that are more locally sensitive to the final answer. Evaluations across five models from the Qwen3, R1-distilled and OctoThinker families show that GRAIL consistently outperforms GRPO. GRAIL achieved an average improvement of 3.60% in accuracy and 3.05% in Pass@3, demonstrating that fine-grained reasoning alignment can be achieved without process-level supervision.

2606.04847 2026-06-04 cs.CV cs.CL cs.LG 版本更新

MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

MusaCoder: 在摩尔线程GPU上通过全栈训练实现原生GPU内核生成

Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang

发表机构 * Moore Threads AI

AI总结 提出MusaCoder全栈训练框架,结合渐进式数据合成、多样性保持拒绝微调和基于执行反馈的强化学习,在CUDA和MUSA后端上生成高效原生GPU内核,9B模型匹配前沿闭源模型,27B模型达到新最优。

详情
AI中文摘要

原生GPU内核生成将高级张量程序转换为可执行、高效的低级代码。现有大型语言模型(LLMs)在此任务上表现不佳,而基于执行的强化学习面临稀疏奖励、奖励黑客和训练不稳定性问题。我们提出MusaCoder,一个用于在CUDA和MUSA后端上生成原生GPU内核的全栈训练框架。MusaCoder结合了渐进式内核导向数据合成、保持多样性的拒绝微调以及通过MooreEval(一个分布式验证器和奖励环境)进行的执行反馈强化学习(RL)。为了稳定RL,MusaCoder引入了PrimeEcho用于首轮锚定的多轮奖励、Buffered Dynamic Retry用于从全失败的困难样本中恢复信号,以及MirrorPop用于离策略序列过滤。在KernelBench和MUSA移植变体上的实验表明,MusaCoder在正确性和经验加速方面均优于强开源和专有基线,其中9B模型匹配或超越前沿闭源模型,27B模型建立了新的最优结果。这些结果不仅证明了全栈执行反馈训练对原生内核生成的有效性,也展示了摩尔线程GPU支持完整LLM后训练栈的能力,为新兴加速器上的大模型训练和优化提供了实用基础。

英文摘要

Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.

2606.04846 2026-06-04 cs.CL 版本更新

Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas

K-12教育中的大语言模型:与州课程标准和学生角色的对齐

Lisa Korver, Tomo Lazovich, Sherief Reda

发表机构 * Department of Computer Engineering(计算机工程系) Brown University(布朗大学) Data Science Institute(数据科学研究院)

AI总结 本研究开发基于LLM的流程评估不同LLM与美国各州历史课程标准的对齐程度,并通过控制用户角色实验分析模型对地理、年级、性别和种族的敏感性,发现模型能调整历史主题呈现但可能源于州政治倾向,且对年级适应良好而对种族性别敏感性低,揭示了LLM与课程标准错位对学生学习的潜在风险。

详情
AI中文摘要

随着大语言模型(LLM)在教育环境中日益普及,它们引发了关于其使用伦理的重要问题。公开可用的在线聊天机器人能力和准确性迅速提升,导致更广泛的使用,包括寻求作业帮助的学生。这使得考虑这些模型是否与教育标准对齐变得至关重要。由于美国的课程标准由各州制定,它们在所需内容、重点和叙事焦点上存在显著差异。在这项工作中,我们开发了一个基于LLM的流程,以识别各州美国历史课程的变化,并评估不同LLM反映这些州特定课程差异的程度。此外,我们进行控制实验,通过陈述用户属性(如地理位置、年级、性别和种族)来改变用户角色,以评估LLM响应对用户特征的敏感性。我们发现,虽然模型能够调整历史主题的呈现,但这些转变可能源于各州的政治倾向,并不一定反映实际的课程内容。此外,模型成功适应学生的年级水平,而对种族或性别的敏感性最小,这表明它们能够以有限的人口统计偏差对用户角色进行有用的适应。总之,这些发现突显了开放访问LLM聊天机器人可能因与州课程标准错位而导致学生学习成果受损的潜在风险,并强调了需要更强大的对齐技术。

英文摘要

As Large Language Models (LLMs) become increasingly popular in educational settings, they raise important questions about the ethical implications of their use. Publicly available online chatbots are quickly improving in capability and accuracy leading to more widespread use, including among students looking for help with their homework. This makes it crucial to consider whether these models are aligned with educational standards. Because curriculum standards in the United States are set at the state level, they differ significantly in required content, emphasis, and narrative focus. In this work, we develop an LLM-based pipeline to identify variations in U.S. History curricula across states and evaluate the extent to which different LLMs reflect these state-specific curricular differences. In addition, we conduct controlled experiments that vary user personas by stating user attributes such as geographic location, grade level, gender and race to evaluate the sensitivity of LLM responses to user characteristics. We find that while models are able to adjust their presentation of historical topics, these shifts may come from the perceived political leanings of states and do not necessarily reflect actual curriculum content. Additionally, models successfully adapt to a student's grade level while showing minimal sensitivity to race or gender, suggesting they are capable of useful adaptation to student personas with limited demographic bias. Together, these findings highlight potential risks that open access to LLM chatbots may cause to student learning outcomes stemming from misalignment with state curriculum standards and highlight the need for more robust alignment techniques.

2606.04828 2026-06-04 cs.CL 版本更新

A French Corpus Annotated for Multiword Expressions with Adverbial Function

一个标注了副词性多词表达的法语语料库

Eric Laporte, Takuya Nakamura, Stavroula Voyatzi

发表机构 * Université Paris-Est(巴黎-est大学) Institut Gaspard-Monge - LabInfo(加斯帕尔-蒙日研究所 - LabInfo)

AI总结 本文介绍了一个标注了副词性多词表达的法语语料库,旨在支持信息检索、信息抽取以及深层和浅层句法分析的研究。

详情
Journal ref
Language Resources and Evaluation Conference (LREC), Linguistic Annotation Workshop, 2008, Marrakech, Morocco, pp.48-51
AI中文摘要

本文介绍了一个标注了副词性多词表达(MWEs)的法语语料库。该语料库旨在用于信息检索和信息抽取以及深层和浅层句法分析的研究。我们界定了所标注的多词表达的类型,描述了用于标注的资源和方法,并简要评论了结果。标注后的语料库可在 http://infolingu.univ-mlv.fr/ 根据 LGPLLR 许可证获取。

英文摘要

This paper presents a French corpus annotated for multiword expressions (MWEs) with adverbial function. This corpus is designed for investigation on information retrieval and extraction, as well as on deep and shallow syntactic parsing. We delimit which kind of MWEs we annotated, we describe the resources and methods we used for the annotation, and we briefly comment the results. The annotated corpus is available at http://infolingu.univ-mlv.fr/ under the LGPLLR license.

2606.04823 2026-06-04 cs.AI cs.CL cs.MA 版本更新

R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

R-APS:基于反思性对抗帕累托搜索的组合推理与上下文元学习用于约束设计

João Pedro Gandarela, Thiago Rios, Stefan Menzel, André Freitas

发表机构 * Idiap Research Institute(Idiap研究 institute) École Polytechnique Fédérale de Lausanne(瑞士联邦理工学院) Honda Research Institute Europe(本田欧洲研究机构) Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) National Biomarker Centre, CRUK-MI, University of Manchester(曼彻斯特大学国家生物标志物中心)

AI总结 提出R-APS方法,通过推理模式分解、分阶段组合推理、敏感性引导对抗测试和元归纳规则提取,联合解决LLM在代理设置中的错误传播、最坏情况扰动和知识失效问题,在平面机构合成任务上实现更紧的鲁棒性证书和更快的迭代速度。

详情
AI中文摘要

大型语言模型(LLM)在开放式任务上表现流畅,但在需要规划、使用工具和长时间行动的代理设置中,流畅性并不能保证可靠交付。我们将这一差距归因于三个耦合的结构性失败:错误传播而不定位、最坏情况扰动未评估、积累的知识从未失效。我们认为这些失败有一个共同根源:溯因、反事实、元归纳、纠正和归纳推理将共享上下文拉向不相容的方向。我们提出反思性对抗帕累托搜索(R-APS),据我们所知,这是第一种通过推理模式分解联合解决所有三个失败的方法,为每种推理模式分配其自己的上下文,并在三个时间尺度上协调交互:带有类型化验证批评者的分阶段组合推理(失败定位)、作为第一类帕累托目标的敏感性引导反事实压力测试(鲁棒性)、以及带有显式失效的元归纳规则提取(持久记忆)。R-APS无需微调,仅通过结构化协议设计在冻结的LLM上运行。我们在平面机构综合(机器人、假肢、机械设计)上评估,每个候选解由运动学求解器检查。在32个目标轨迹上,R-APS提供的鲁棒性证书比均匀扰动基线紧3.5倍,首次接纳迭代速度提高46%,Chamfer距离比Enum+GA减少2.1倍,同时联合控制杆数和最坏情况鲁棒性。小型4B推理专用模型在协议内与通用70B骨干模型竞争,表明结构化协议可以部分抵消模型规模。

英文摘要

Large language models (LLMs) are fluent on open-ended tasks, yet in agentic settings, where a system must plan, use tools, and act over extended horizons, fluency does not ensure reliable delivery. We trace this gap to three coupled structural failures: errors propagate without localization, worst-case perturbations go unevaluated, and accumulated knowledge is never invalidated. We argue these share a root cause: abductive, counterfactual, meta-inductive, corrective, and inductive reasoning pull a shared context in incompatible directions. We introduce Reflective Adversarial Pareto Search (R-APS), to our knowledge the first method addressing all three failures jointly via reasoning-mode decomposition, allocating each reasoning mode its own context and orchestrating interaction across three timescales: staged compositional reasoning with a typed validation critic (failure localization), sensitivity-guided counterfactual stress-testing as a first-class Pareto objective (robustness), and meta-inductive rule extraction with explicit invalidation (persistent memory). R-APS requires no fine-tuning and operates on a frozen LLM purely via structured protocol design. We evaluate on planar mechanism synthesis (robotics, prosthetics, mechanical design), with every candidate checked by a kinematic solver. On 32 target trajectories, R-APS delivers robustness certificates 3.5x tighter than uniform-perturbation baselines, 46% faster iterations-to-first-admission, and 2.1x Chamfer-distance reduction over Enum+GA while jointly controlling bar-count and worst-case robustness. Small 4B reasoning-specialized models prove competitive with general-purpose 70B backbones inside the protocol, suggesting structured protocols can partially offset model scale.

2606.04807 2026-06-04 cs.AI cs.CL cs.CY cs.LG 版本更新

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

BiasGRPO:通过组相对策略优化在高方差奖励景观中稳定偏差缓解

Saket Reddy, Ke Yang, ChengXiang Zhai

发表机构 * University of Illinois - Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出BiasGRPO框架,利用组相对策略优化(GRPO)通过归一化组内奖励来稳定大语言模型的社会偏差缓解,优于DPO和PPO。

Comments Accepted to Findings of the ACL

详情
AI中文摘要

缓解大语言模型(LLMs)中的社会偏差提出了一个独特的对齐挑战:与可验证任务不同,偏差缺乏单一的真实标准,从而产生高方差、主观的奖励景观。先前的基于偏好的微调方法存在主要权衡:直接偏好优化(DPO)受限于离线训练中缺乏探索,而近端策略优化(PPO)由于潜在不可靠的评论家估计可能导致训练不稳定。在本文中,我们提出了BiasGRPO,一个使用组相对策略优化(GRPO)的框架,通过对一组采样完成进行奖励归一化来稳定对齐。通过用组相对基线替代价值函数,我们的方法在保持在线训练探索优势的同时减少了不稳定性。我们发现BiasGRPO在多个基准测试中优于DPO和PPO,表明其有效性。为了适应GRPO,我们综合扩展了一个涵盖多个领域和上下文的数据集。我们还创建并发布了一个定制的偏差奖励模型,该模型在有效指导生成的同时高度计算高效且避免知识退化,提供了一个可无缝集成到多目标RLHF流程中的宝贵资源。

英文摘要

Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training instability due to potentially unreliable critic estimates. In this paper, we propose BiasGRPO, a framework using Group Relative Policy Optimization (GRPO) to stabilize alignment by normalizing rewards across a group of sampled completions. By substituting the value function with a group-relative baseline, our approach reduces instability while maintaining the exploration benefits of online training. We find that BiasGRPO outperforms DPO and PPO across multiple benchmarks, indicating its effectiveness. To adapt GRPO, we synthetically extend a dataset spanning multiple domains and contexts. We also create and release a custom bias reward model that effectively guides generation while being highly compute-efficient and avoiding knowledge degradation, providing a valuable resource that can be seamlessly integrated into multi-objective RLHF pipelines.

2606.04780 2026-06-04 cs.CL 版本更新

PersonaTree: Structured Lifecycle Memory for Person Understanding in LLM Agents

PersonaTree: 面向LLM智能体人物理解的结构化生命周期记忆

Yubo Hou, Jingwei Song, Hongbo Zhang, Zhisheng Chen, Bang Xiao, Tao Wan, Zengchang Qin

发表机构 * School of ASEE, Beihang University, Beijing, China(北京航空航天大学航空科学与工程学院) The University of Hong Kong, Hong Kong, China(香港大学) Peking University, Beijing, China(北京大学) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学) School of BME, Beihang University, Beijing, China(北京航空航天大学生物医学工程学院) CAIR and CECS, VinUniversity, Hanoi, Vietnam(越南河内 Vin 大学 CAIR 和 CECS)

AI总结 提出PersonaTree,一种结构化生命周期记忆框架,通过三级人物树和显式支持路径,将交互证据抽象为人物理解,在多个基准上取得领先性能。

详情
AI中文摘要

持久化的LLM智能体需要记忆表示,使得在长期交互中人物理解的形成变得明确。现有的智能体记忆方法强调信息保留和检索,但对累积的交互证据如何被抽象为人物理解的解释有限。我们将这一过程视为图式形成,其中情境证据被抽象为可重用模式和稳定的人物层面断言。我们引入PersonaTree,一种结构化生命周期记忆框架,通过三级人物树实现这一观点,并具有从证据到断言的显式支持路径。PersonaTree通过保守写入、置信度引导的合并和查询条件路径检索来维护树,仅返回每个查询所需的证据深度。在六个涉及人物理解和持久记忆的基准测试中,使用三个回答骨干,PersonaTree在18个紧凑分数中排名第一,并在16个设置中进入前两名。消融实验表明,层次结构提高了KnowMe上的抽象人物理解,而在可比上下文预算下,支持路径检索提高了RealPref的对齐度。

英文摘要

Persistent LLM agents require memory representations that make the formation of person understanding explicit across long term interaction. Existing agent memory methods emphasize information retention and retrieval, yet give limited account of how accumulated interaction evidence is abstracted into person understanding. We view this process as schema formation, where situated evidence is abstracted into reusable patterns and stable person level claims. We introduce PersonaTree, a structured lifecycle memory framework that realizes this view as a three level persona tree with explicit support paths from evidence to claims. PersonaTree maintains the tree through conservative writing, confidence guided consolidation, and query conditioned path retrieval, returning only the evidence depth required by each query. Across six person understanding and persistent memory benchmarks with three answer backbones, PersonaTree ranks first in 12 of 18 compact scores and reaches the top two in 16 settings. Ablations show that hierarchy improves abstract person understanding on KnowMe, while support path retrieval improves RealPref alignment under a comparable context budget.

2606.04778 2026-06-04 cs.AI cs.CL cs.LG 版本更新

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

超越浅层安全的推理时脆弱性:沿生成轨迹的对齐

Kyungmin Park, Taesup Kim

发表机构 * Hankuk University of Foreign Studies(翰江大学外国语大学) Seoul National University(首尔国立大学)

AI总结 本文揭示安全对齐的大语言模型在推理时存在更广泛的脆弱性,即任意生成步骤的短标记注入都能显著改变后续安全行为,并提出通过直接在生成轨迹上对齐模型来提升鲁棒性。

详情
AI中文摘要

安全对齐的大语言模型(LLMs)在推理时仍然容易受到干预,这些干预会将生成导向有害输出。最近的研究将其归因于浅层安全,即对齐集中在最初的几个输出标记上。我们表明,浅层安全是更广泛的推理时脆弱性的一个特例,其中在任何生成步骤的短标记注入都能显著改变后续的安全行为。我们还发现,模型在其隐藏状态中与拒绝方向的对齐并不能预测其对这种注入的鲁棒性,这表明在扰动下,内部状态本身并不能决定生成行为。为了解决这个问题,我们通过模拟序列中段扰动构建的生成轨迹上直接对齐模型,并表明这提高了对中段注入的鲁棒性,并泛化到利用早期标记生成的攻击。我们的工作认为,鲁棒的安全对齐需要对生成过程本身进行训练,而不仅仅是其输出。

英文摘要

Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful outputs. Recent work attributes this to shallow safety, where alignment concentrates in the first few output tokens. We show that shallow safety is a special case of a broader inference-time vulnerability, in which short token injections at any generation step can substantially alter subsequent safety behavior. We also find that a model's alignment with refusal directions in its hidden states does not predict its robustness to such injection, revealing that internal state alone does not determine generation behavior under perturbation. To address this, we align models directly on generation trajectories constructed by simulating mid-sequence perturbation, and show that this improves robustness to mid-sequence injection and generalizes to attacks that exploit early-token generation. Our work argues that robust safety alignment requires training on the generation process itself, not only its outputs.

2606.04773 2026-06-04 cs.CV cs.CL 版本更新

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

NextMotionQA: 使用视觉语言模型基准测试和评判人体运动理解

Yong Cao, Chuqiao Li, Xianghui Xie, Gerard Pons-Moll, Andreas Geiger

发表机构 * University of Tübingen(图宾根大学) Tübingen AI Center(图宾根人工智能中心) Max Planck Institute for Informatics(马克斯·普朗克信息学院) Saarland Informatics Campus(萨尔兰州信息学院)

AI总结 提出NextMotionQA基准,通过三项互补任务和多粒度难度分层,系统评估视觉语言模型在人体运动理解中的能力,并揭示其在细粒度评判中的局限性。

Comments 23 pages, 8 figures, 9 tables

详情
AI中文摘要

人体运动理解的可靠评估对于推进具身人工智能、机器人和动画至关重要。然而,现有基准存在语义粒度粗糙、难度无区分、标注质量有限以及答案模糊等问题,无法诊断当前模型的失败之处。为弥补这一差距,我们引入NextMotionQA,这是一个全面的基准,利用视觉语言模型(VLM)进行半自动化、专家验证的数据集构建。NextMotionQA包含三项互补任务:多项选择题问答、视频字幕生成和细粒度错误纠正。每项任务沿三个核心语义轴系统组织,并分为三个任务复杂度级别。我们对十二个代表性VLM的广泛评估揭示了在传统单任务评估中不可见的关键能力差距和弱点。在互补方向上,近期工作开始使用VLM作为文本到运动评估的评判者;我们探究它们在更困难任务下是否表现出同样的退化。我们发现,VLM在粗粒度标准上与专家评分高度一致(Cohen's κ=0.70),但在细粒度、部件级评判上表现不佳(κ=0.10),验证了该范式在其强项领域的有效性,同时明确了其局限性。

英文摘要

Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we introduce NextMotionQA, a comprehensive benchmark that leverages vision-language models (VLMs) for semi-automated, expert-verified dataset. NextMotionQA features three complementary tasks: multiple-choice question answering, video captioning, and fine-grained error correction. Each task is systematically structured across three core semantic axes and stratified into three task complexity levels. Our extensive evaluation of twelve representative VLMs uncovers critical capability gaps and weakness that remain invisible under conventional, single-task evaluations. In a complementary direction, recent work has begun using VLMs as judges for text-to-motion evaluation; we ask whether they show the same degradation under harder tasks. We find that VLMs align strongly with expert ratings on coarse criteria (Cohen's κ=0.70) but break down on fine-grained, part-level judgment (κ=0.10), validating the paradigm in its strong regime while clarifying its limits.

2606.04743 2026-06-04 cs.CL cs.AI cs.LG 版本更新

TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration

TIDE:通过模板引导迭代的主动多问题发现

Soyeong Jeong, Jinheon Baek, Minki Kang, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) DeepAuto.ai

AI总结 提出TIDE框架,通过模板引导的迭代机制主动发现用户上下文中隐藏的多个问题,并给出具体行动方案,在个人工作区和软件仓库两个场景中显著提升任务覆盖率和问题识别与解决能力。

详情
AI中文摘要

智能体被广泛部署为文档、工具和代码的助手。然而,它们通常仅对明确的用户请求做出响应,这些请求只反映了用户已注意到的问题,而许多其他重要问题共存于更广泛的用户上下文中,隐藏于显而易见之处,且其总数事先未知。我们将此定义为从上下文中发现多个隐藏问题的任务,其中应揭示共存的问题,基于支持性证据,并配以具体行动。为此,我们引入了TIDE,一个模板引导的迭代框架,包含两种互补机制。具体而言,基于单次预测倾向于关注最显著案例并产生泛化结论的观察,我们提出迭代发现:每轮生成一小批候选,同时基于已发现结果进行条件化,从而后续轮次扩展覆盖范围;以及思维模板:从先前解决的案例中提炼的可重用模式,指定应关注哪些上下文信号以及如何连接它们,将每个预测锚定于可识别的问题类别。我们在两个现实场景(个人工作区和软件仓库)中,使用四种模型骨干验证了TIDE,在任务覆盖率、识别和解决方面显著优于单次和并行多智能体基线。

英文摘要

Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on explicit user requests, which surface only the problems the user has noticed, while many other important problems coexist, hidden in plain sight, within the broader user context, with their total number unknown in advance. We frame this as the task of discovering multiple hidden problems from context, in which coexisting problems should be uncovered, grounded in supporting evidence, and paired with concrete actions. To this end, we introduce TIDE, a template-guided iterative framework with two complementary mechanisms. Specifically, motivated by the observation that single-pass prediction anchors on the most salient cases and yields generic claims, we propose iterative discovery, which surfaces a small batch of candidates per round while conditioning on what has already been found, so subsequent rounds extend coverage; and thought templates, reusable schemas distilled from previously solved cases that specify what contextual signals to attend to and how to connect them, anchoring each prediction in a recognizable problem class. We validate TIDE on two realistic settings, personal workspaces and software repositories, across four model backbones, showing substantial gains over single-shot and parallel multi-agent baselines on task coverage, identification, and resolution.

2606.04730 2026-06-04 cs.CL eess.AS 版本更新

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

多语言长篇语音指令跟随:KIT 在 IWSLT 2026 的提交

Enes Yavuz Ugan, Maike Züfle, Yuka Ko, Supriti Sinhamahapatra, Fabian Retkowski, Seymanur Akti, Jan Niehues, Alexander Waibel

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一种通用数据增强流水线,通过片段拼接、LLM标签生成和跨语言翻译将短语音语料转换为长语音训练数据,结合似然与最小贝叶斯风险解码解决长语音语义任务退化问题。

Comments 9 pages main paper, IWSLT 2026 Instruction Following track

详情
AI中文摘要

随着大语言模型的出现,单任务和基于标记的多任务模型已演变为基于指令的系统,该系统从自然语言提示中隐式推断任务和目标语言。这一趋势反映在IWSLT的指令跟随赛道中,该赛道今年引入了包括未知惊喜任务在内的新任务,对已知任务的过拟合构成了真正的挑战。我们展示了KIT在无约束设置下对长指令和短指令跟随赛道的提交。我们的方法结合了一个通用数据增强流水线,通过片段拼接、基于LLM的标签生成和跨语言翻译将短语音语料转换为长语音训练数据,在六个任务和四种语言上产生了超过100万个实例。我们进一步表明,基于似然的重新排序虽然对ASR非常有效,但会系统地降低语义任务,通过选择从分段音频处理而非整体长语音推理中生成的候选者,这一失败模式通过将似然与最小贝叶斯风险解码相结合得以解决。

英文摘要

With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is reflected in IWSLT's Instruction Following Track, which this year introduced new tasks including an unknown surprise task, posing a genuine challenge against overfitting to known tasks. We present KIT's submission to the Long and Short Instruction Following tracks in the unconstrained setting. Our approach combines a general data augmentation pipeline that converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. We further show that likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference, a failure mode resolved by combining likelihood with Minimum Bayes Risk decoding.

2606.04719 2026-06-04 cs.CL 版本更新

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

基于查询的跨模态投影器增强Mamba多模态大语言模型

SooHwan Eom, Jay Shim, Gwanhyeong Koo, Haebin Na, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo

发表机构 * Korea Advanced Institute of Science and Technology / Korea, Republic of(韩国科学技术院) University of Illinois in Urbana-Champaign / United States of America(伊利诺伊大学厄巴纳-香槟分校) Korea University / Korea, Republic of(韩国大学)

AI总结 提出基于查询的跨模态投影器,通过交叉注意力压缩视觉令牌,消除手动设计2D扫描顺序的需求,提升Mamba多模态LLM的性能和吞吐量。

Comments Accepted to EMNLP 2024 Findings

详情
AI中文摘要

Transformer的复杂度随输入长度呈二次增长,给大语言模型(LLM)带来了不可持续的计算负担。相比之下,选择性扫描结构化状态空间模型(即Mamba)有效解决了这一计算挑战。本文探索了一种基于查询的跨模态投影器,通过交叉注意力机制根据输入压缩视觉令牌,从而增强Mamba在视觉-语言建模中的效率。这种创新的投影器还消除了将原始图像特征转换为Mamba LLM输入序列时手动设计2D扫描顺序的需求。在各种视觉-语言理解基准上的实验结果表明,所提出的跨模态投影器增强了基于Mamba的多模态LLM,提升了性能和吞吐量。

英文摘要

The Transformer's quadratic complexity with input length imposes an unsustainable computational load on large language models (LLMs). In contrast, the Selective Scan Structured State-Space Model, or Mamba, addresses this computational challenge effectively. This paper explores a query-based cross-modal projector designed to bolster Mamba's efficiency for vision-language modeling by compressing visual tokens based on input through the cross-attention mechanism. This innovative projector also removes the need for manually designing the 2D scan order of original image features when converting them into an input sequence for Mamba LLM. Experimental results across various vision-language understanding benchmarks show that the proposed cross-modal projector enhances Mamba-based multimodal LLMs, boosting both performance and throughput.

2606.04703 2026-06-04 cs.CL cs.LG 版本更新

Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

重新思考持续经验内化以实现自我进化的大语言模型智能体

Jingwen Chen, Wenkai Yang, Shengda Fan, Wenbo Nie, Chenxing Sun, Shaodong Zheng, Yangen Hu, Lu Pan, Ke Zeng, Yankai Lin

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院 Gallagher 学院) School of Software, Beihang University(北航软件学院) Meituan(美团)

AI总结 本文通过经验粒度、注入模式和内化机制三个维度,提出一种稳定可持续的经验内化方法,解决多轮经验学习中的能力崩溃问题。

Comments 10 pages, 8 figures

详情
AI中文摘要

经验内化将过去交互中的上下文经验转化为可重用的参数化能力,为大型语言模型(LLM)的持续学习提供了一条有前景的路径。虽然先前的工作主要关注单次迭代迁移,但我们发现在多轮经验学习下,现有方法遭受的是渐进的能力崩溃而非复合改进。我们通过经验内化的三个关键维度系统地考察了这种失败:(1)经验粒度:我们发现原则级经验比实例级经验更持久,因为它有效地从轨迹特定细节中抽象出可迁移的策略。(2)经验注入模式:我们的分析表明,逐步注入通过将经验与中间决策状态对齐,显著优于全局注入,这一特性对于长程工具使用至关重要。(3)内化机制:我们证明,在高质量教师轨迹上的离策略上下文蒸馏提供了比在策略上下文蒸馏更稳定的训练信号,后者固有地受限于对学生诱导的缺陷状态的局部修正。这些见解共同产生了一个简单而稳健的配方,用于稳定和可持续的经验内化,为工程化自我进化和持续学习的LLM提供了具体指导。

英文摘要

Experience internalization converts contextual experience from past interactions into reusable parametric capability, offering a promising path toward continual learning in large language models (LLMs). While prior work has predominantly focused on single-iteration transfer, we discover that under multi-iteration experience learning, existing methods suffer from a progressive capability collapse rather than compounding improvement. We systematically examine this failure through three vital dimensions of experience internalization: (1) Experience Granularity: We find that principle-level experience is more durable than instance-level experience, as it effectively abstracts transferable strategies away from trajectory-specific details. (2) Experience Injection Pattern: Our analysis reveals that step-wise injection significantly outperforms global injection by aligning experience with intermediate decision states, a property that is critical for long-horizon tool use. (3) Internalization Regime: We demonstrate that off-policy context-distillation on high-quality teacher trajectories provides a substantially more stable training signal than on-policy context-distillation, which is inherently limited by local corrections on student-induced flawed states. Together, these insights yield a simple yet robust recipe for stable and sustainable experience internalization, providing concrete guidance for engineering self-evolving and continually learning LLMs.

2606.04701 2026-06-04 cs.CV cs.CL 版本更新

Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms

在短视频平台上对原生动态屏幕GUI代理的基准测试

Jiashu Yao, Heyan Huang, Daiqing Wu, Wangke Chen, Huaxi Ai, Haoyu Wen, Zeming Liu, Yuhang Guo

发表机构 * Beijing Institute of Technology(北京理工大学) Tsinghua University(清华大学) Beihang University(北航)

AI总结 针对短视频平台等动态屏幕环境,提出LivingScreen基准测试,通过三级任务套件和联合评估准确性与信息效率的指标,发现现有GUI代理存在观察过度或不足的问题。

Comments preprint

详情
AI中文摘要

当前的GUI代理假设屏幕是静态的,即两次动作之间世界是冻结的。然而,诸如短视频应用之类的真实界面违反了这一假设,因为其内容持续播放,一个称职的用户必须决定观看什么以及观看多长时间。我们将此任务形式化为原生动态屏幕GUI代理,并引入LivingScreen——首个在短视频平台上实例化该任务的基准测试,它包含一个基于浏览器的忠实环境、三级任务套件以及联合评估准确性和信息效率的指标。评估广泛的前沿模型后,我们发现没有一个模型能达到人类的成本-准确率性能,并且它们的主要失败模式是过度观察和观察不足,这表明观察控制是未来GUI代理缺失的能力轴。所有数据和代码将在https://github.com/BITHLP/LivingScreen上提供。

英文摘要

GUI agents today assume a static screen, where the world is frozen between two actions. However, real interfaces such as short-video applications violate this assumption, as their content keeps playing, and a competent user must decide what to watch and for how long. We formalize this task as Living-Screen-Native GUI agents and introduce LivingScreen, the first benchmark instantiating it on short-video platforms, with a faithful browser-based environment, a three-tier task suite, and metrics that jointly score accuracy and information efficiency. Evaluating extensive frontier models, we find that none reaches the human cost-accuracy performance, and that their dominant failure mode is over- and under-observation, pointing to observation control as a missing capability axis for future GUI agents. All data and code will be available at https://github.com/BITHLP/LivingScreen.

2606.04691 2026-06-04 cs.CL 版本更新

SMADE-IE: Sparse Multi-Agent Framework with Evidence-Driven Debate for Zero-Shot Information Extraction

SMADE-IE: 基于证据驱动辩论的稀疏多智能体框架用于零样本信息抽取

Kenfeng Huang, Yi Cai, Xin Wu, Zikun Deng, Li Yuan

发表机构 * School of Software Engineering, South China University of Technology(华南理工大学软件学院)

AI总结 提出SMADE-IE稀疏多智能体框架,通过自适应模式选择器和证据驱动辩论机制,在零样本信息抽取中减少冗余交互并提升性能。

Comments 21 pages, 9 figures

详情
AI中文摘要

基于大型语言模型的零样本信息抽取因其无需任务特定训练即可适应新模式和领域的灵活性而受到越来越多的关注。现有方法主要依赖于整体提示、逐类型提示或多智能体辩论。然而,整体提示常常遭受边界和类型错误,而逐类型提示和多智能体辩论引入了跨类型冲突、冗余智能体交互和大量令牌开销。为了解决这些挑战,我们提出了SMADE-IE,一种用于零样本信息抽取的稀疏且证据驱动的多智能体框架。SMADE-IE首先采用自适应模式选择器将输入动态路由到轻量级全局抽取模式或类型中心抽取模式,减少不必要的类型选择和推理噪声。对于冲突预测,我们进一步引入了证据驱动辩论机制,将论证结构化为图尔敏式组件,并通过外部证据评分和贝叶斯更新进行置信度聚合。在NER、RE和JERE任务的9个基准数据集上的实验结果表明,SMADE-IE在持续优于现有零样本信息抽取基线的同时,通过稀疏智能体选择和早期停止辩论提高了令牌效率。

英文摘要

Zero-shot information extraction (IE) with large language models (LLMs) has attracted increasing attention due to its flexibility in adapting to new schemas and domains without task-specific training. Existing approaches mainly rely on monolithic prompting, each-type prompting, or multi-agent debate. However, monolithic prompting often suffers from boundary and type errors, while each-type prompting and multi-agent debate introduce cross-type conflicts, redundant agent interactions, and substantial token overhead. To address these challenges, we propose SMADE-IE, a sparse and evidence-driven multi-agent framework for zero-shot IE. SMADE-IE first employs an Adaptive Mode Selector to dynamically route inputs into either a lightweight Global Extraction Mode or a Type-Centric Extraction Mode, reducing unnecessary type selection and reasoning noise. For conflicting predictions, we further introduce an Evidence-Driven Debate mechanism that structures arguments into Toulmin-style components and performs confidence aggregation through external evidence scoring and Bayesian updates. Experimental results on 9 benchmark datasets across NER, RE, and JERE tasks show that SMADE-IE consistently outperforms existing zero-shot IE baselines while also improving token efficiency through sparse agent selection and early-stopping debate.

2606.04680 2026-06-04 eess.AS cs.CL cs.SD 版本更新

Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

听你所写:基于声学差异的无参考假设评估

Zhihan Li, Hankun Wang, Yiwei Guo, Bohan Li, Xie Chen, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China(X-LANCE实验室、计算机科学学院、上海交通大学、中国) MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, China(人工智能MOE重点实验室、江苏省语言计算重点实验室、中国)

AI总结 提出READ指标,利用预训练自回归TTS模型计算语音与文本假设的声学差异,无需参考转录即可评估ASR假设,并在噪声条件下实现高达20%的相对错误率降低。

Comments Submitted to Interspeech 2026. 6 pages, 4 figures

详情
AI中文摘要

自动语音识别系统通常依赖参考转录进行评估,而无参考方法往往依赖于内部置信度估计或辅助语言模型。我们提出READ(基于声学差异的无参考假设评估),一种直接从语音信号评估ASR假设的新颖指标。READ强调假设的声学基础。它使用预训练的自回归TTS模型计算给定文本假设下语音令牌的条件似然,以衡量语音与文本之间的细粒度声学差异。无需额外训练,READ即可应用于假设优化。实验表明,READ与特定识别错误相关,并改善ASR输出,实现高达20%的相对错误率降低,在噪声条件下尤其显著。

英文摘要

Automatic speech recognition systems commonly rely on reference transcriptions for evaluation, while reference-free approaches often depend on internal confidence estimation or auxiliary language models. We propose READ (Reference-free Hypothesis Evaluation with Acoustic Discrepancy), a novel metric that evaluates ASR hypotheses directly from the speech signal. READ emphasizes the acoustic grounding of hypotheses. It uses a pretrained auto-regressive TTS model to compute the conditional likelihood of speech tokens given a text hypothesis, to measure fine-grained acoustic discrepancy between speech and text. Without additional training, READ can be applied for hypothesis refinement. Experiments show that READ correlates with specific recognition errors and improves ASR outputs, achieving up to 20\% relative error rate reduction, with particularly strong gains under noisy conditions.

2606.04661 2026-06-04 cs.CL cs.LG 版本更新

CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts

CRAFT: 成本感知的提示精炼与前沿感知的调优

Shanu Kumar, Shubhanshu Khandelwal, Akhila Yesantarao Venkata, Parag Agrawal, Yova Kementchedjhieva, Manish Gupta

发表机构 * MBZUAI Microsoft(微软)

AI总结 提出CRAFT方法,通过帕累托前沿优化提示的准确性和成本,避免标量化崩溃,在多个基准上实现更广泛的准确-成本权衡。

详情
AI中文摘要

为准确性调优的提示通常变长,每次模型调用都会增加推理成本。最佳的准确-成本权衡取决于任务和预算,因此提示优化是在准确性和提示令牌成本的帕累托前沿上的搜索,而不是针对单个提示。通常的捷径是将目标折叠成加权和,在搜索前固定权衡权重,通常只能恢复前沿的狭窄区域,我们称之为标量化崩溃。我们提出了CRAFT(成本感知的精炼和前沿感知的调优),一种帕累托前沿提示优化器,将目标LLM验证调用视为稀缺资源,并将其分配给乐观候选前沿附近的候选。每轮,互补的面向准确性和面向成本的生成器提出编辑,帕累托差距获取花费每轮的验证预算,NSGA-II保留保持分布广泛的种群。在六个分类和推理基准上,CRAFT保留的前沿同时达到高准确性和低成本区域,而仅准确性、仅成本和加权和基线各自集中在更窄的区域。准确-成本权衡成为搜索后的选择,而不是搜索前的权重。

英文摘要

Prompts tuned for accuracy often grow long, raising inference cost on every model call. The best accuracy-cost trade-off depends on the task and the budget, so prompt optimization is a search over the Pareto front of accuracy and prompt-token cost rather than for one prompt. The usual shortcut, collapsing the objectives into a weighted sum, fixes the trade-off weight before search and often recovers only a narrow region of the front, a failure we call scalarization collapse. We present CRAFT (Cost-aware Refinement And Front-aware Tuning), a Pareto-front prompt optimizer that treats target-LLM validation calls as the scarce resource and allocates them to candidates near the optimistic candidate front. Each round, complementary accuracy-oriented and cost-oriented generators propose edits, Pareto-gap acquisition spends the per-round validation budget, and NSGA-II retention keeps a spread-out population. Across six classification and reasoning benchmarks, CRAFT's retained fronts reach both high-accuracy and low-cost regions, while accuracy-only, cost-only, and weighted-sum baselines each concentrate in narrower regions. The accuracy-cost trade-off becomes a post-search choice, not a pre-search weight.

2606.04660 2026-06-04 cs.CL 版本更新

LifeSide: Benchmarking Agents as Lifelong Digital Companions

LifeSide: 将智能体作为终身数字伴侣的基准测试

Yuqian Wu, Zhijie Deng, Wei Chen, Junwei Li, Yutian Jiang, Junle Chen, Zhengjun Huang, Qingxiang Liu, Jing Tang, Jiaheng Wei, Yuxuan Liang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州)) Hong Kong University of Science and Technology(香港理工大学) Tencent(腾讯)

AI总结 针对现有评估无法捕捉终身数字伴侣所需的多会话记忆、用户理解和隐私适应能力的问题,提出LifeSide基准,通过多智能体模拟构建记忆-情感-环境循环,评估模型在记忆追踪、用户理解、隐私控制和情感陪伴方面的表现,发现即使当前记忆基准饱和的模型也无法在长期内维持准确的用户理解和真正的陪伴。

Comments 28 pages, 23 figures, 7 tables

详情
AI中文摘要

终身数字伴侣必须整合跨会话线索,持续更新对用户的理解,并适应不断变化的隐私边界。现有评估未能捕捉到这一点,而是孤立地测试记忆回忆和短期共情。为了弥补这一差距,我们引入了\benchmark,一个以多会话 extit{记忆-情感-环境}循环为中心的基准。通过将用户建模为具有分层档案和事件轨迹的持久世界,\benchmark使用多智能体模拟将环境动态投射到对话中,保留了潜在思想与可观察表达之间的关键差距。在记忆追踪、用户理解、隐私控制和情感陪伴方面评估了2,000个角色和111K个任务,我们的实验结果揭示了一个严峻的现实:即使是在当前记忆基准上饱和的模型,也无法在长期内维持准确的用户理解和真正的陪伴。

英文摘要

Lifelong digital companions must integrate cross-session cues, continually update their understanding of users, and adapt to shifting privacy boundaries. Existing evaluations fail to capture this, testing memory recall and short-term empathy in isolation. To bridge this gap, we introduce \benchmark, a benchmark centered on multi-session \textit{Memory-Emotion-Environment} loops. By modeling users as persistent worlds with layered profiles and event trajectories, \benchmark uses multi-agent simulation to project environmental dynamics into dialogue, preserving the critical gap between latent thoughts and observable expressions. Evaluating 2,000 personas and 111K tasks across memory tracking, user understanding, privacy control, and emotional companionship, our experiment results reveal a stark reality: even models that saturate current memory benchmarks fail to sustain accurate user understanding and true companionship over long horizons.

2606.04646 2026-06-04 cs.CL cs.AI cs.IR 版本更新

QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

QO-Bench: 诊断类型化事件元组上的查询操作符保持检索

Mengao Zhang, Xiang Yang, Chang Liu, Tianhui Tan, Ke-wei Huang

发表机构 * Asian Institute of Digital Finance, National University of Singapore(亚洲数字金融研究所,新加坡国立大学)

AI总结 提出QO-Bench基准,通过类型化事件元组上的确定性评估,诊断检索增强生成系统在查询操作符(如连接、交集)上的执行瓶颈。

Comments 14 pages

详情
AI中文摘要

许多关于商业、法律和科学语料库的现实世界问题是文本中潜在记录的数据库风格查询的自然语言版本。现有的检索增强生成(RAG)系统主要针对语义相关性进行优化,但检索到看似相关的段落并不能保证正确的查询执行。我们引入了QO-Bench,一个用于类型化事件元组上查询操作符问答的诊断基准。该基准涵盖22,984篇新闻文章和614个公司事件,涉及18个查询模板,在785个问题上进行评估。每个黄金答案由类型化事件元组确定性计算得出,并通过召回率评分,答案通过精确匹配而非LLM评判器与黄金元组匹配。这种设计支持操作符级别的诊断,如连接和交集。我们在匹配条件下评估了RAG、ReAct RAG、GraphRAG和信息提取到SQL的方法,并设置了一个长上下文oracle上限以隔离检索失败。一个双轴框架——索引时保持与查询时执行——预测了每种范式失败的位置,结果证实了这一点:系统检索到相关文本,但丢弃了操作符所需的类型化值,并且可部署的范式排名在不同操作符间反转,相似性检索在过滤/投影上领先,而提取到SQL在交集和计数上领先。即使提供了黄金证据,长上下文oracle也远未饱和,因此操作符执行——而不仅仅是检索——是一个核心瓶颈,更强的答案模型也无法消除。QO-Bench将目标从段落相关性重新定义为查询操作符保持检索。

英文摘要

Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized primarily for semantic relevance, but retrieving plausible passages does not guarantee correct query execution. We introduce QO-Bench, a diagnostic benchmark for query-operator question answering over typed event tuples. The benchmark covers 22,984 news articles and 614 corporate events across 18 query templates, evaluated on 785 questions. Each gold answer is deterministically computed from typed event tuples and scored by recall, with answers matched to the gold tuples by exact match rather than an LLM judge. This design enables operator-level diagnosis such as joins and intersection. We evaluate RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL under matched conditions, with a long-context oracle ceiling to isolate retrieval failure. A two-axis framework -- index-time preservation versus query-time execution -- predicts where each paradigm fails, and the results bear it out: systems retrieve relevant text but discard the typed values operators need, and the deployable paradigm ranking inverts across operators, with similarity retrieval leading on filter/project and extraction-to-SQL on intersection and counting. Even given the gold evidence, a long-context oracle stays far from saturated, so operator execution -- not retrieval alone -- is a core bottleneck that a stronger answer model does not remove. QO-Bench reframes the goal from passage relevance to query-operator-preserving retrieval.

2606.04645 2026-06-04 cs.CL cs.DB 版本更新

CYGNET: Cypher Gate for Neural Execution Triage and Cost Containment

CYGNET: 用于神经执行分类与成本控制的密码门

Nikodem Tomczak

发表机构 * Thulge Labs, Singapore(新加坡Thulge实验室)

AI总结 提出CYGNET门控机制,通过预执行验证和错误修正,在保证生成准确率的同时,高效拦截结构错误的Cypher查询并标记成本过高的执行计划。

详情
AI中文摘要

作为知识图谱代理的语言模型生成的Cypher查询可能因结构错误(在数据库中崩溃)或语义错误(执行但返回错误结果)而失败。我们在查询生成与生产级Neo4j数据库之间设置了一个预执行门。该门通过一个四后端链验证结构,最终在镜像图上执行,中位延迟为5.6毫秒。结构错误的查询被路由到一个修正器,该修正器通过语言模型迭代结构化错误反馈。在七个CypherBench模式(2348个问题,ACL 2025)上,该流水线在所有测试模型上保持了生成准确率,证实其作为安全防御层的有效性。修正器在五个模型上的成功率为81%至95%(平均89%)。在九个模式的模板生成语料库上,该门捕获了100%的解析错误、100%的约束违规以及100%的路径查询中带标签端点的模式引用错误,在1135个查询中零误报。属性兄弟交换(替换后的名称在目标标签上有效)得分为0%,标志着结构验证结束和语义验证开始的正式边界。基于规划器的成本门在执行前标记灾难性的计划结构。

英文摘要

Language models acting as agents over knowledge graphs generate Cypher queries that fail structurally (crashing at the database) or semantically (executing but returning wrong results). We place a pre-execution gate between query generation and a production Neo4j database. The gate validates structure through a four-backend chain culminating in execution against a mirror graph at 5.6 ms median latency. Structurally broken queries are routed to a corrector that iterates structured error feedback through a language model. On seven CypherBench schemas (2348 questions, ACL 2025) the pipeline maintains generation accuracy on every model tested, confirming it operates as a safe defensive layer. The corrector achieves 81% to 95% success across five models (mean 89%). On a template-generated corpus across nine schemas the gate catches 100% of parse errors, 100% of constraint violations, and 100% of schema-reference errors in path queries with labelled endpoints, at zero false positives across 1135 queries. Property sibling-swaps where the substituted name is valid on the target label score 0%, marking the formal boundary where structural validation ends and semantic validation must begin. A planner-based cost gate flags catastrophic plan structures before execution.

2606.04632 2026-06-04 cs.LG cs.CL 版本更新

VentAgent: When LLMs Learn to Breathe -- Multi-Objective Arbitration for ARDS Ventilation

VentAgent:当大语言模型学会呼吸——ARDS通气的多目标仲裁

Teqi Hao, Yuxuan Fu, Xiaoyu Tan, Shaojie Shi, Bohao Lv, Yinghui Xu, Xihe Qiu

发表机构 * School of Electronic and Electrical Engineering, Shanghai University of Engineering Science(上海工程技术大学电子与电气工程学院) Tencent Youtu Lab(腾讯优图实验室) Artificial Intelligence Innovation and Incubation Institute, Fudan University(复旦大学人工智能创新与孵化院)

AI总结 提出VentAgent分层框架,利用大语言模型作为透明仲裁者,通过感知-规划-编排三阶段将机械通气控制转化为动态多目标仲裁过程,在生理模拟器上优于强化学习和经典控制基线,并提供可解释的推理链。

详情
AI中文摘要

急性呼吸窘迫综合征(ARDS)的机械通气需要平衡竞争性的生理目标,包括氧合、肺保护和酸碱平衡。然而,当前的数据驱动方法,尤其是模仿回顾性电子健康记录(EHR)的方法,常常遭受模仿偏差。它们可能从不一致的临床演示中捕获表面相关性,例如将被动呼吸机设置与生存关联,因为这种设置在稳定患者中很常见,因此无法泛化到不稳定或分布外的表型。标准的强化学习(RL)方法也难以处理重症监护中的对抗性权衡,并常常产生不透明且临床可解释性有限的策略。为了解决这些局限性,我们引入了VentAgent,一个分层框架,其中大语言模型(LLM)作为机械通气的透明仲裁者。我们将通气控制重新表述为动态多目标仲裁过程,而非单目标优化。VentAgent将决策分解为三个可解释的阶段:感知、规划和编排。通过利用LLM的语义推理能力,它综合来自异构专家的策略,并通过显式协调机制解决冲突的临床优先级。在高保真生理模拟器上的评估表明,VentAgent优于最先进的RL和经典控制基线。此外,它将控制决策转化为人类可读的推理链,为重症监护自动化提供了更安全、更可解释和更自适应的范式。

英文摘要

Mechanical ventilation for Acute Respiratory Distress Syndrome (ARDS) requires balancing competing physiological goals, including oxygenation, lung protection, and acid-base homeostasis. However, current data-driven methods, especially those imitating retrospective Electronic Health Records (EHR), often suffer from imitation bias. They may capture superficial correlations from inconsistent clinical demonstrations, such as associating passive ventilator settings with survival because such settings are common in stable patients, and thus fail to generalize to volatile or out-of-distribution phenotypes. Standard Reinforcement Learning (RL) methods also struggle with the adversarial trade-offs of critical care and often produce opaque policies with limited clinical interpretability. To address these limitations, we introduce VentAgent, a hierarchical framework in which Large Language Models (LLMs) act as transparent arbitrators for mechanical ventilation. We reformulate ventilation control as a dynamic Multi-Objective Arbitration process rather than single-objective optimization. VentAgent decomposes decision-making into three interpretable stages: Perception, Planning, and Orchestration. By leveraging the semantic reasoning capabilities of LLMs, it synthesizes strategies from heterogeneous experts and resolves conflicting clinical priorities through an explicit coordination mechanism. Evaluations on a high-fidelity physiological simulator show that VentAgent outperforms state-of-the-art RL and classical control baselines. Moreover, it converts control decisions into human-readable reasoning chains, offering a safer, more interpretable, and adaptable paradigm for critical care automation.

2606.04628 2026-06-04 cs.CL cs.MA 版本更新

RAMPART: Registry-based Agentic Memory with Priority-Aware Runtime Transformation

RAMPART: 基于注册表的代理记忆与优先级感知运行时转换

Nikodem Tomczak

发表机构 * Nikodem Tomczak

AI总结 提出RAMPART编译时记忆模型和纯内存块注册表,通过可编程运行时操作和五种原语实现上下文组装,实验表明块位置和分组显著影响任务成功率,并实现零提示令牌成本的共享注册表协调。

详情
AI中文摘要

RAMPART是一种用于基于LLM的代理的编译时记忆模型和纯内存块注册表。上下文组装是一种可编程的运行时操作,其中内容根据显式策略(排序、包含和驱逐)从结构化注册表中编译。五种可组合原语(提升、门控、写入、驱逐、回滚)在编译前对命名可寻址块进行操作,且零提示令牌成本。来源标签和不可驱逐的作者标志实现了具有块级所有权的许可记忆模型。使用Qwen3-8B Q4进行的受控探测表明,编译时放置以及块与任务查询之间的结构关系影响任务成功,当任务跟随注册表时,性能在约第七个块位置急剧下降,当任务先于注册表时则在第十二个位置。将关键块与内容相邻的邻居分组,并将该组作为一个单元提升,在单块放置失败的位置将任务成功率提高数十个百分点。在Qwen2.5-7B、Llama-3.1-8B、Mistral-7B-v0.3和Qwen3-14B上的跨模型复现表明,内容启动效应在不同家族中出现在相同的绝对位置,幅度随模型强度变化。块分组使Mistral在最难注册表大小下的平均通过率提高约五倍,并且在中间注册表区域,使用干预的较小模型可以超越不使用干预的较大模型。相关性门控将提示成本降低67.8%,同时恢复83%的提升条件成功率。模式驱逐产生0%的调用,而存在模式时为100%,这是基于策略的方法无法通过构造保证的属性。共享注册表协调将代理间通信减少为方法调用,且零协调令牌成本。

英文摘要

RAMPART is a compile-time memory model and pure in-RAM block registry for LLM-based agents. Context assembly is a programmable runtime operation where content is compiled from a structured registry under explicit policy for ordering, inclusion, and eviction. Five composable primitives (promote, gate, write, evict, rollback) act on named addressable blocks before compilation at zero prompt-token cost. Provenance tags and non-evictable authorship flags implement a permissioned memory model with block-level ownership. Controlled probes with Qwen3-8B Q4 show that compile-time placement and the structural relationship between blocks and the task query affect task success, with the cliff falling at roughly the seventh block position when the task follows the registry and the twelfth when it precedes. Grouping the critical block with content-adjacent neighbours and promoting the group as a unit lifts task success by tens of percentage points at positions where single-block placement fails. Cross-model replication on Qwen2.5-7B, Llama-3.1-8B, Mistral-7B-v0.3, and Qwen3-14B shows the content-priming effect appears at the same absolute positions across families, with magnitude varying with model strength. Block grouping raises Mistral's mean pass rate roughly fivefold at the hardest registry size, and a smaller model with the intervention can outperform a larger model without it in the mid-registry zone. Relevance gating reduces prompt cost by 67.8\% while recovering 83% of the promoted-condition success rate. Schema eviction produces 0% invocations against 100% with the schema present, a property policy-based approaches cannot guarantee by construction. Shared-registry coordination reduces inter-agent communication to a method call at zero coordination token cost.

2606.04612 2026-06-04 cs.CL 版本更新

Hybrid Adversarial Defence for Natural Language Understanding Tasks

混合对抗防御用于自然语言理解任务

Manar Abouzaid, Yang Wang, Chenghua Lin, Stuart E. Middleton

发表机构 * School of Electronics and Computer Science, University of Southampton, UK(南安普顿大学电子与计算机科学学院) Department of Computer Science, University of Manchester, UK(曼彻斯特大学计算机科学系)

AI总结 提出一种结合熵、不确定性和几何特征的混合防御框架,在多个自然语言理解数据集上同时提升了干净任务性能和对抗鲁棒性。

详情
AI中文摘要

大型语言模型(LLMs)既容易产生幻觉,也容易受到对抗性操纵。尽管这些问题密切相关,但现有的防御方法通常分别处理它们。我们研究了一种混合防御框架,该框架结合了旨在减少幻觉的基于熵的模型,以及旨在降低脆弱性的基于不确定性的模型和基于几何的模型。在自然语言理解数据集(FEVER、HotpotQA、CSQA、SIQA)上的域内测试中,我们发现我们的混合模型提高了干净任务性能(准确率提升高达43.34%)和对抗鲁棒性(准确率提升高达64.92%,攻击成功率降低62.27%)。对于分布外数据集(AeroEngQA、CPIQA),我们的混合模型表现出类似的对抗鲁棒性(准确率提升高达57.14%)。对于提示注入(SafeGuard)和越狱检测(AdvBench、DAN)数据集,我们的混合模型也非常强大(与最先进的基线模型相比,攻击成功率降低高达51%)。总体而言,我们的结果表明,对于域内和分布外任务,结合熵、不确定性和几何特征比单独使用任何单一特征都能提供更有效的防御策略。

英文摘要

Large Language Models (LLMs) are vulnerable both to hallucination and adversarial manipulation. Although these problems are closely related, existing defences typically address them separately. We investigate a hybrid defence framework that combines entropy-based models, designed to reduce hallucinations, with uncertainty-based models and geometric-based models, designed to reduce vulnerability. Under in-domain tests on Natural Language Understanding datasets (FEVER, HotpotQA, CSQA, SIQA) we find our hybrid model improves both clean-task performance (up to 43.34\% increase in accuracy) and adversarial robustness (up to 64.92\% improvement in accuracy and 62.27\% reduction in attack success rate). For out-of-distribution datasets (AeroEngQA, CPIQA) we see similar adversarial robustness from our hybrid model (up to 57.14\% improvement in accuracy). For prompt injection (SafeGuard) and jailbreak detection (AdvBench, DAN) datasets our hybrid model is also very strong (up to 51\% reduction in attack success rate compared to state of the art baseline models). Overall, our results show that combining entropy, uncertainty and geometric features provides a more effective defence strategy than using any single feature alone for both in-domain and out-of-distribution tasks.

2606.04596 2026-06-04 cs.CL 版本更新

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

多视频摘要中位置偏差的系统评估:基于多模态大语言模型

Huangchen Xu, Yuan Wu, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Jilin University(知识驱动人机智能工程研究中心) International Center of Future Science, Jilin University(未来科学国际中心)

AI总结 本研究系统评估了多模态大语言模型在多视频摘要任务中的位置偏差,通过构建基准和三种互补指标揭示了领域与模型依赖的偏差特性,并分析了提示级缓解方法。

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地用于视频理解,但它们在多视频输入下的可靠性仍知之甚少。我们研究了多视频摘要中的位置偏差,即每个视频摘要的质量可能随视频输入槽位的变化而变化,即使底层内容不变。我们从ActivityNet和新闻视频构建了一个基准,涵盖烹饪、家庭、休闲和新闻场景,包含两个和四个视频输入。我们评估了九个开源和专有MLLMs,并使用三种互补指标测量位置效应:覆盖率、方向性位置偏差(DPB)和中间边缘差距(MEG)。我们的结果表明,位置效应是领域和模型依赖的:即使中间位置表现不佳,有符号的方向性偏差也可能很小;增加视觉或生成预算并不能均匀地消除不平衡。我们进一步分析了提示级缓解方法。总之,结果表明多视频摘要仍然对输入协议和位置敏感,这促使开发更鲁棒的、顺序不变的多模态系统。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.

2606.04591 2026-06-04 cs.CL cs.CV 版本更新

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

多模态长对话中的细粒度片段检索

Hanbo Bi, Zhiqiang Yuan, Chongyang Li, Qiwei Yan, Zexi Jia, Jiapei Zhang, Xiaoyue Duan, Yingchao Feng, Jinchao Zhang, Jie Zhou

发表机构 * Pattern Recognition Center, WeChat AI, Tencent Inc(模式识别中心、微信AI、腾讯公司) Aerospace Information Research Institute, Chinese Academy of Sciences(航天信息研究所、中国科学院)

AI总结 提出细粒度片段检索任务,通过强化学习训练的生成式检索模型F2RVLM和两阶段系统FFRS,实现多模态长对话中多语句、多图像片段的精准定位。

详情
AI中文摘要

随着多模态交流平台的广泛采用,文本和图像交织的长对话变得越来越普遍。用户通常需要检索与特定主题相关的连贯对话片段,而不是孤立的语句。我们提出了细粒度片段检索(FFR),用于在多模态长对话中定位语义相关的多语句、多图像片段。我们探索了两种设置:(1)单对话内的FFR,从给定对话中检索片段;(2)对话语料库内的FFR,从大规模语料库中为开放域场景检索片段。对于(1),我们引入了F2RVLM,一种基于生成的检索模型,使用强化学习训练,通过多目标奖励和难度感知课程采样来增强片段连贯性。对于(2),我们开发了FFRS,一个两阶段系统,结合了离线片段级索引和在线检索。具体来说,每个对话被分解为最小语义片段,由片段嵌入模型(FEM)编码到向量数据库中;在推理时,FEM快速召回Top-K候选,F2RVLM进行细粒度推理以识别最相关的子内容。为支持FFR,我们构建了MLDR,迄今为止最长的多模态对话检索数据集,以及一个基于微信的真实世界测试集。在两个基准上的实验表明,F2RVLM和FFRS在单对话和语料库级别的FFR上始终取得优越性能。

英文摘要

With the widespread adoption of multi-modal communication platforms, long-form dialogues interleaving text and images have become increasingly common. Users often need to retrieve coherent dialogue fragments related to specific topics, rather than isolated utterances. We propose Fine-grained Fragment Retrieval (FFR), which locates semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. We explore two settings: (1) FFR within Single-Dialogue, retrieving fragments from a given dialogue; and (2) FFR within Dialogue Corpus, retrieving from a large-scale corpus for open-domain scenarios. For (1), we introduce F2RVLM, a generation-based retrieval model trained with reinforcement learning, using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence. For (2), we develop FFRS, a two-stage system combining offline fragment-level indexing with online retrieval. Specifically, each dialogue is decomposed into minimal semantic fragments encoded by a Fragment Embedding Model (FEM) into a vector database; at inference, FEM rapidly recalls Top-K candidates, and F2RVLM performs fine-grained reasoning to identify the most relevant sub-content. To support FFR, we construct MLDR, the longest multi-modal dialogue retrieval dataset to date, and a WeChat-based real-world test set. Experiments on both benchmarks demonstrate that F2RVLM and FFRS consistently achieve superior performance across single-dialogue and corpus-level FFR.

2606.04588 2026-06-04 cs.CL 版本更新

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

VCIFBench:评估视频理解中的复杂指令遵循能力

Huangchen Xu, Yuan Wu, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Jilin University(知识驱动人机智能工程研究中心,吉林大学) International Center of Future Science, Jilin University(未来科学国际中心,吉林大学)

AI总结 提出VCIFBench基准,通过混合验证流水线评估多模态大模型在视频理解中遵循内容、格式、风格和结构约束的复杂指令能力,实验表明联合约束满足仍具挑战,DPO训练可提升性能。

详情
AI中文摘要

多模态大语言模型在视频理解方面取得了快速进展,然而现有基准主要依赖简单提示,且提供的证据有限,无法判断模型是否能满足明确的输出约束。我们引入了VCIFBench,这是一个用于评估视频理解中复杂指令遵循能力的基准。VCIFBench从基准适配和直接视频接地提示中构建了富含约束的指令,涵盖内容、格式、风格和结构要求,并通过混合验证流水线评估模型输出。该基准包含306个可满足的测试指令、一个540对的DPO偏好数据集以及一个30项的冲突诊断子集。在10个MLLM上的实验表明,联合约束满足仍然具有挑战性。我们进一步表明,在VCIFBench数据上进行DPO训练可以提高指令遵循性能。

英文摘要

Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We introduce VCIFBench, a benchmark for evaluating complex instruction following in video understanding. VCIFBench constructs constraint-rich instructions from both benchmark-adapted and directly video-grounded prompts, covering content, format, style, and structure requirements, and evaluates model outputs with a hybrid verification pipeline. The benchmark contains 306 satisfiable test instructions, a 540-pair DPO preference dataset, and a 30-item conflict diagnostic subset. Experiments on 10 MLLMs show that joint constraint satisfaction remains challenging. We further show that DPO training on VCIFBench data can improve instruction-following performance.

2606.04557 2026-06-04 cs.CL cs.IR cs.LG 版本更新

Cartridges at Scale: Training Modular KV Caches over Large Document Collections

大规模弹匣:训练模块化KV缓存以处理大型文档集合

Momchil Hardalov, Gonzalo Iglesias, Adrià de Gispert

发表机构 * Amazon AGI(亚马逊人工智能研究院)

AI总结 提出Cartridges at Scale (CAS)框架,通过动态干扰混合和内存高效预算管理器实现大规模多弹匣训练,在减少预填充开销的同时保持准确性,性能优于单块弹匣10-31点,接近全上下文学习。

Comments 21 pages, 5 figures, 17 tables

详情
AI中文摘要

大型语言模型能够处理长上下文,但预填充数百万个标记是浪费的,因为许多内容在查询之间保持不变。弹匣通过将文档集合提炼为可重用的键值(KV)缓存来解决这一问题,从而消除预填充同时保持准确性。这种方法的一个关键限制是弹匣是单块且非组合的:将整个集合编码为单个KV块无法扩展,并且天真地混合单独训练的弹匣会使性能下降到接近随机水平。我们引入了Cartridges at Scale (CAS),这是一个可扩展的多弹匣学习训练框架,具有动态干扰混合和内存高效的预算管理器,可在GPU和持久存储之间轮换数百个每文档弹匣。我们的方法可扩展到超过一百万个标记的集合,在可比标记预算下,比单块弹匣提高10-31点。即使在高度压缩下,Oracle弹匣准确率也接近完全上下文学习的2-6点范围内。当与检索结合用于弹匣选择时,CAS匹配或超过传统RAG准确率,同时消耗的提示标记减少3-4倍。

英文摘要

Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-value (KV) caches that eliminate prefilling while preserving accuracy. A critical limitation of this approach is that cartridges are monolithic and non-compositional: encoding an entire collection into a single KV block does not scale, and naively mixing cartridges trained in isolation collapses performance to near chance. We introduce Cartridges at Scale (CAS), a training framework for scalable multi-cartridge learning with dynamic distractor mixing and a memory-efficient budget manager that rotates hundreds of per-document cartridges between GPU and persistent storage. Our approach scales to collections exceeding a million tokens, improving over a monolithic cartridge by 10-31 points at comparable token budgets. Oracle cartridge accuracy falls within 2-6 points of full in-context learning even at high compression. When paired with retrieval for cartridge selection, CAS matches or exceeds conventional RAG accuracy while consuming 3-4x fewer prompt tokens.

2606.04555 2026-06-04 cs.CL cs.AI 版本更新

Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents

时间顺序对智能体记忆至关重要:面向长程智能体的线段树

Yifan Simon Liu, Liam Gallagher, Faeze Moradi Kalarde, Jiazhou Liang, Armin Toroghi, Scott Sanner

发表机构 * University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(人工智能向量研究所)

AI总结 提出线段树记忆架构SegTreeMem,通过在线右边缘更新规则保持对话历史的时间顺序,结合层次化时间上下文进行检索,在长程记忆基准上优于现有方法。

详情
AI中文摘要

长程对话智能体需要通过与用户交互不断演化的事件、任务和目标进行互动。这些历史记录本质上是时间性的,然而许多现有的记忆系统主要按主题相似性组织信息,可能忽略事件发生的顺序。我们引入线段树记忆(Segment Tree Memory,简称SegTreeMem),这是一种将对话历史表示为按时间顺序排列的线段树的记忆架构。SegTreeMem通过在线最右边缘更新规则逐步插入新话语,在形成层次化记忆片段的同时保持时间顺序。在检索时,SegTreeMem通过树传播相关性分数,将局部语义匹配与层次化时间上下文相结合。在三个长程记忆基准和两个LLM骨干网络上,SegTreeMem在答案质量上优于平面检索、图结构记忆和树结构记忆基线。额外的时间顺序排列分析表明,性能提升依赖于在记忆构建过程中保持时间顺序,这支持了时间顺序是智能体记忆关键结构的观点。

英文摘要

Long-horizon conversational agents need to interact with users through evolving events, tasks, and goals. Such histories are naturally temporal, yet many existing memory systems organize information primarily by topical similarity and may ignore the order in which events occur. We introduce Segment Tree Memory, or SegTreeMem, a memory architecture that represents conversation history as a temporally ordered Segment Tree over utterances. SegTreeMem incrementally inserts new utterances through an online rightmost-frontier update rule, preserving chronological order while forming hierarchical memory segments. For retrieval, SegTreeMem propagates relevance scores through the tree to combine local semantic matching with hierarchical temporal context. Across three long-horizon memory benchmarks and two LLM backbones, SegTreeMem improves answer quality over flat retrieval, graph-structured memory, and tree-structured memory baselines. Additional temporal-order permutation analysis shows that the performance gain depends on preserving temporal order during memory construction, supporting the claim that temporal order is a key structure for agentic memory.

2606.04552 2026-06-04 cs.CL q-bio.GN 版本更新

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

LDARNet: 用于基因组建模的DNA自适应表示网络与可学习分词

Daria Ledneva, Denis Kuznetsov

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出LDARNet,一种结合动态分块和双向路由的120M参数层次基因组基础模型,在27个任务中优于更大模型,并发现学习到的边界与生物学基序对齐。

详情
AI中文摘要

基因组基础模型越来越多地采用大型语言模型架构,但几乎普遍依赖于固定的分词方案,如$k$-mers、BPE或单核苷酸,这些方案强加了可能掩盖生物学相关结构的任意序列边界。我们提出了LDARNet,一个120M参数的层次基因组基础模型,它将H-Net风格的动态分块从自回归生成适应到掩码语言建模,结合了BiMamba-2状态空间层与局部注意力、双向路由以及基于比值的正则化器,以在无监督的情况下诱导自适应标记边界。在来自Nucleotide Transformer和Genomic Benchmarks套件的27个任务上进行微调后,LDARNet在紧凑模型(<300M参数)中取得了11/18的胜率,并在5个组蛋白修饰任务上取得了最先进的结果,优于高达20倍大的模型。一个FLOPs匹配的对照实验将学习到的路由确定为这些增益的来源:在相同计算量下,学习到的边界在组蛋白任务上比固定网格边界高出多达14个百分点。进一步的核苷酸分辨率分析表明,学习到的边界在无监督的情况下与典型的启动子基序和剪接连接点对齐,为基因组基础模型中的自适应分词提供了生物学解释。

英文摘要

Genomic foundation models increasingly adopt large language model architectures, yet almost universally rely on fixed tokenization schemes such as $k$-mers, BPE, or single nucleotides, which impose arbitrary sequence boundaries that may obscure biologically relevant structure. We present LDARNet, a 120M-parameter hierarchical genomic foundation model that adapts H-Net-style dynamic chunking from autoregressive generation to masked language modeling, combining BiMamba-2 state-space layers with local attention, bidirectional routing, and a ratio-based regularizer to induce adaptive token boundaries without supervision. Fine-tuned on 27 tasks from the Nucleotide Transformer and Genomic Benchmarks suites, LDARNet achieves 11/18 wins among compact models ($<$300M parameters) and state-of-the-art results on 5 histone modification tasks, outperforming models up to 20$\times$ larger. A FLOPs-matched controlled experiment isolates learned routing as the source of these gains: learned boundaries beat fixed-grid boundaries by up to 14 percentage points on histone tasks at identical compute. Nucleotide-resolution analysis further shows that the learned boundaries align with canonical promoter motifs and splice junctions without supervision, providing a biological interpretation for adaptive tokenization in genomic foundation models.

2606.04535 2026-06-04 cs.CL cs.AI 版本更新

Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models

扩散大语言模型中用于格式约束生成的动态填充锚点

Boyan Han, Yiwei Wang, Yi Song, Yujun Cai, Chi Zhang

发表机构 * AGI Lab, Westlake University, China(西溪大学AGI实验室,中国) University of California, Merced, USA(加州大学梅尔德分校,美国) Teeni AI, China(Teeni AI,中国) The University of Queensland, Australia(昆士兰大学,澳大利亚)

AI总结 提出动态填充锚点(DIA),一种无需训练的方法,通过动态估计结束锚点位置调整生成长度,确保格式约束下的结构正确性和语义连贯性,在GSM8K和MATH上实现零样本性能提升。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
AI中文摘要

扩散大语言模型(dLLMs)提供双向注意力和并行生成,使其能够利用全局上下文并自然支持格式约束任务,如可解析的JSON或推理模板。虽然直接的固定锚点可以强制执行此类约束,但它们通常强加刚性跨度,导致推理截断或内容冗余。为了克服这一点,我们提出了动态填充锚点(DIA),一种无需训练的方法,在迭代填充之前动态估计结束锚点位置以调整生成长度。这种灵活机制确保了结构正确性和语义连贯性,避免了固定跨度方法的低效。在推理基准上的实验表明,DIA显著提高了格式合规性和答案准确性,在GSM8K和MATH上实现了显著的零样本增益。这些结果确立了DIA作为通往可靠、结构感知生成的一条稳健路径。

英文摘要

Diffusion large language models (dLLMs) offer bidirectional attention and parallel generation, enabling them to exploit global context and naturally support format-constrained tasks like parseable JSON or reasoning templates. While straightforward fixed anchors can enforce such constraints, they often impose rigid spans, leading to truncated reasoning or redundant content. To overcome this, we propose Dynamic Infilling Anchors (DIA), a training-free method that dynamically estimates end-anchor positions to adjust generation length before iterative infilling. This flexible mechanism ensures structural correctness and semantic coherence, avoiding the inefficiencies of fixed-span methods. Experiments on reasoning benchmarks demonstrate that DIA substantially improves format compliance and answer accuracy, achieving significant zero-shot gains on GSM8K and MATH. These results establish DIA as a robust pathway toward reliable, structure-aware generation.

2606.04511 2026-06-04 cs.CL cs.LG 版本更新

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

SparDA: 用于高效长上下文LLM推理的稀疏解耦注意力

Yaosheng Fu, Guangxuan Xiao, Xin Dong, Song Han, Oreste Villa

发表机构 * NVIDIA Thinking Machines Lab ByteDance Seed MIT

AI总结 提出SparDA架构,通过引入第四投影Forecast实现KV缓存预取与注意力解耦,减少稀疏选择开销,在长上下文推理中实现1.25倍预填充加速和1.7倍解码加速。

详情
AI中文摘要

稀疏注意力减少了长上下文LLM推理的计算和内存带宽。然而,仍然存在两个关键挑战:(1)KV缓存容量随序列长度增长,卸载到CPU内存引入了PCIe传输瓶颈;(2)稀疏选择步骤本身保持$O(T^2)$复杂度,在长上下文中可能主导注意力成本。我们提出SparDA,一种解耦的稀疏注意力架构,它在Query、Key和Value之外引入了第四个逐层投影——Forecast。Forecast预测下一层所需的KV块,从而实现超前选择,将CPU到GPU的预取与当前层执行重叠。由于Forecast与注意力查询解耦,我们的GQA实现为每个GQA组使用一个Forecast头,相比原始多头选择器减少了选择开销。SparDA增加了<0.5%的参数,并通过匹配原始选择器的注意力分布仅训练Forecast投影。在两个稀疏预训练的8B模型上,SparDA匹配或略微提高了准确性,并且相比稀疏注意力卸载基线,提供了高达1.25倍的预填充加速和1.7倍的解码加速。通过使单个GPU上可行的批量大小更大,SparDA进一步实现了比非卸载稀疏基线高达5.3倍的解码吞吐量。我们的源代码可在https://github.com/NVlabs/SparDA获取。

英文摘要

Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains $O(T^2)$ complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per-layer projection, the Forecast, alongside Query, Key, and Value. The Forecast predicts the KV blocks needed by the next layer, enabling lookahead selection that overlaps CPU-to-GPU prefetch with current-layer execution. Because Forecast is decoupled from the attention query, our GQA implementation uses one Forecast head per GQA group, reducing selection overhead versus the original multi-head selector. SparDA adds $<$0.5% parameters and trains only the Forecast projections by matching the original selector's attention distribution. On two sparse-pretrained 8B models, SparDA matches or slightly improves accuracy and delivers up to 1.25$\times$ prefill speedup and 1.7$\times$ decode speedup over the sparse-attention offload baseline. By enabling larger feasible batch sizes on a single GPU, SparDA further reaches up to 5.3$\times$ higher decode throughput than the non-offload sparse baseline. Our source code is available at https://github.com/NVlabs/SparDA.

2606.04507 2026-06-04 cs.CL cs.AI 版本更新

Self-Evolving Deep Research via Joint Generation and Evaluation

通过联合生成与评估实现自我进化的深度研究

Han Zhu, Chengkun Cai, Yuanfeng Song, Xing Chen, Sirui Han, Yike Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) ByteDance, China(字节跳动) University College London(伦敦大学学院)

AI总结 提出SCORE框架,通过共享参数的协同进化训练联合优化评估器与求解器,解决深度研究报告生成中奖励不可验证的问题,持续提升生成质量。

详情
AI中文摘要

大型语言模型(LLM)在日常应用中越来越广泛,其中深度研究是一项特别重要的能力。与传统的问答(QA)任务不同,深度研究报告生成缺乏明确的真实答案,这使得奖励设计本质上不可验证,限制了有效的强化学习。现有方法通过LLM作为评判者和查询相关的评估标准来缓解这一挑战,但它们仍然依赖静态评估器,无法随着求解器的改进而调整标准,导致优化压力不足并最终饱和。我们通过一个用于深度研究评估和生成的 extbf{自}我进化 extbf{协}同进化训练框架(SCORE)来解决这一限制,该框架在共享参数的学习过程中紧密耦合评估器和求解器。我们不将生成和评估视为孤立的模块,而是利用它们的内在联系,在单个共享参数模型中实现联合改进。为了限制这一过程,我们引入了一个元控制机制,该机制根据求解器的性能动态控制评估环境,鼓励有效的评估维度和足够深入的评估器搜索。在深度研究基准上的大量实验表明,报告生成质量持续提升,表明协同进化评估和生成是训练开放式研究代理的一个有前景的方向。

英文摘要

Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report generation lacks definitive ground-truth, making reward design inherently unverifiable and limiting effective reinforcement learning. Existing approaches mitigate this challenge with LLM-as-a-judge and query-dependent evaluation rubrics, but they still rely on static evaluators that cannot adapt their standards as the solver improves, leading to insufficient and eventually saturated optimization pressure. We address this limitation with a \textbf{s}elf-evolving \textbf{co}-evolutionary training framework for deep \textbf{re}search evaluation and generation (SCORE), which tightly couples an evaluator and a solver in a shared-parameter learning process. Rather than treating generation and evaluation as isolated modules, we leverage their intrinsic connection to enable joint improvement within a single shared-parameter model. To restrict this process, we introduce a meta-harness, which dynamically controls the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search. Extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality, showing that co-evolving evaluation and generation is a promising direction for training open-ended research agents.

2606.04500 2026-06-04 cs.CL 版本更新

SANE Schema-aware Natural-language Evaluation of Biological Data

SANE:生物数据的模式感知自然语言评估

Rolf Gattung, Martin Krueger, Markus Reischl

发表机构 * Institute for Automation and Applied Informatics (IAI), Karlsruhe Institute of Technology (KIT)(自动化与应用信息研究所(IAI)、卡尔斯鲁厄理工学院(KIT))

AI总结 提出SANE范式,通过模式感知的自动生成基准,评估少样本大语言模型在特定领域文本到SQL任务中的可靠性,发现结构化提示和约束可实现准确查询生成。

Comments 5 pages, 3 figures, submitted but not yet reviewed by BMT2026

详情
AI中文摘要

高通量显微镜生成大型结构化数据集,捕捉细胞对药理扰动的反应,但访问这些数据集通常需要SQL专业知识。大语言模型提供了一种自然语言替代方案,但其幻觉倾向引发了对结果可靠性的担忧。我们提出SANE(模式感知自然语言评估),一种用于特定领域文本到SQL评估的新范式:基于模式、自动生成的基准,与实际和特定的实验结构相关联。SANE使评估更具可扩展性、系统性和可重复性。使用SANE,我们评估了一个少样本大语言模型,并表明在具有结构化提示和约束的受限模式下,无需任何模型训练或微调即可实现准确的查询生成。大多数失败源于模糊或未明确指定的输入,表现为过度谨慎的澄清请求或对应先消除歧义的查询的回答,而不是错误的SQL生成。这些结果表明,当与模式感知提示相结合时,少样本大语言模型可以在定义良好的领域内提供可靠的数据库访问。

英文摘要

High-throughput microscopy generates large, structured datasets capturing cellular responses to pharmacological perturbations, but accessing these datasets typically requires SQL expertise. Large language models offer a natural-language alternative, yet their tendency to hallucinate raises concerns about result reliability . We present SANE Schema-Aware Natural-language Evaluation, a novel paradigm for domain-specific text-to-SQL evaluation: schema-grounded, automatically generated benchmarks tied to real and specific experimental structure. SANE makes evaluation more scalable, systematic, and reproducible. Using SANE, we evaluate a few-shot large language model and show that, under constrained schemas with structured prompting and guardrails, accurate query generation is achievable without any model training or fine-tuning. Most failures stem from ambiguous or underspecified inputs and manifest as overly cautious clarification requests or answers to queries that should first be disambiguated, rather than incorrect SQL generation. These results indicate that few-shot large language models can provide reliable database access in well-defined domains when combined with schema-aware prompting.

2606.04486 2026-06-04 cs.CR cs.CL cs.LG stat.ML 版本更新

Global Sketch-Based Watermarking for Diffusion Language Models

基于全局草图的扩散语言模型水印

Daniel Zhao

发表机构 * Harvard University(哈佛大学)

AI总结 提出一种针对掩码扩散语言模型的全局向量草图水印方法,通过控制文本的整体统计特征实现与局部上下文无关的检测。

详情
AI中文摘要

语言模型的水印方法在自回归设置中已被广泛研究,其中令牌是顺序生成的。这些工作主要关注局部上下文方案,该方案根据前序令牌扰动下一个令牌的分布。在扩散语言模型中,许多未解析位置的分布被联合采样,使得整个序列的加性统计在生成过程中是可处理的。我们提出了一种针对掩码扩散语言模型的水印,该水印控制文本的全局向量草图表示。与上下文相关的水印相比,草图公式将检测与生成过程中看到的局部上下文解耦,从而产生一个顺序无关的统计量和一个不表现为简单令牌偏差的水印规则。我们分析了该方法的失真、合理性和鲁棒性。

英文摘要

Watermarking methods for language models have been studied extensively in the autoregressive setting, where tokens are generated sequentially. These works largely focus on local-context schemes that perturb the next token's distribution as a function of its preceding tokens. In diffusion language models, distributions over many unresolved positions are jointly sampled, allowing additive statistics of the entire sequence to be tractable during generation. We propose a watermark for masked diffusion language models that controls a global, vector-valued sketch representation of the text. Compared to context-dependent watermarking, the sketch formulation decouples detection from the local contexts seen during generation, resulting in an order-agnostic statistic and a watermarking rule which does not manifest as a simple token bias. We analyze the distortion, soundness, and robustness properties of the method.

2606.04483 2026-06-04 cs.CL 版本更新

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

分布外声音:同人小说子类型作为对齐LLM的通用白话越狱

Zhongze Luo, Ruihe Shi, Zhenshuai Yin, Haoyue Liu, Weixuan Wan, Xiaoying Tang

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)科学与工程学院) The Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen)(深圳未来网络智能研究院) The Guangdong Provincial Key Laboratory of Future Networks of Intelligence(广东省未来网络智能重点实验室) School of Microelectronics, Xi’an Jiaotong University(西安交通大学微电子学院)

AI总结 本文发现安全训练覆盖不足的自然人类写作语域是对齐LLM的真正失败模式,并提出首个利用真实同人小说子类型作为通用攻击载体的越狱方法,显著提升攻击成功率。

Comments 23 pages

详情
AI中文摘要

现有的针对对齐LLM的越狱方法是离散的产物,其表面形式容易被指纹识别和修补。我们认为真正的失败模式不是任何特定的提示,而是安全训练覆盖不足的整个自然人类写作语域。基于这一见解,我们引入了第一个使用真实同人小说子类型作为通用攻击载体的越狱家族:一种创意写作元条件基于来自十二个Archive of Our Own (AO3)子类型之一的段落,有害行为被嵌入为结果场景的高潮。该构造不需要攻击者LLM,也不需要针对每个目标进行适应。在HarmBench和JailbreakBench的并集上对八个对齐LLM,该攻击在四评委集成下将平均ASR从0.278提升到0.731;因子分解显示增益由语域而非长度或结构带来。两种主动防御扩大了而非缩小了白话与基线的比率,表明针对模板的防御仅仅将攻击者引向像我们这样的基于语域的攻击。我们还提出了SAGA-A4,一种静态的四轮扩展,实现了平均ASR 0.924,大大超过了现有的三种多轮方法。

英文摘要

Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing that safety training has under-covered. Building on this insight, we introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own (AO3) subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio, indicating that template-targeting defences merely steer attackers toward register-based attacks like ours. We also propose SAGA-A4, a static four-turn extension that attains mean ASR 0.924, substantially exceeding three existing multi-turn methods.

2606.04479 2026-06-04 cs.CV cs.AI cs.CL 版本更新

Evaluating Reasoning Fidelity in Visual Text Generation

评估视觉文本生成中的推理保真度

Jiajun Hong, Jiawei Zhou

发表机构 * Stony Brook University(石桥大学)

AI总结 通过长文本渲染、事实知识探测、上下文理解和多步推理等任务,评估当前文本到图像模型在视觉文本生成中是否忠实保持推理能力,发现其常产生语义错误和逻辑不一致,与纯文本模型存在显著差距。

Comments Peer reviewed and accepted at CVPR 2026 at the GRAIL-V (Grounded Retrieval and Agentic Intelligence for Vision-Language) workshop (non-archival track)

详情
AI中文摘要

最近的文本到图像(T2I)模型能够在图像中渲染高度清晰且结构良好的文本,从而支持文档生成和幻灯片生成等应用。然而,当复杂解决方案必须直接通过渲染文本表达时,这些系统是否忠实地保留了推理能力,还是仅仅模仿表面模式,目前尚不清楚。我们通过评估视觉文本生成中的推理保真度来研究这一问题,其中模型必须将完整的推理过程表达为图像。我们的评估包括长文本渲染、事实知识探测、上下文理解和多步推理。在这些设置中,我们发现当前的T2I模型经常产生语义错误、逻辑不一致和错误的中间步骤,即使渲染的文本在视觉上清晰。这些失败与纯文本模型在相同任务上的强推理表现形成对比。我们的发现揭示了视觉文本生成与程序性推理之间的显著差距,促使更可靠的视觉文本推理。

英文摘要

Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.

2606.04466 2026-06-04 cs.CL 版本更新

Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning

学习什么:小语言模型推理中SFT-then-RL的阶段特定数据集

Chongyang He, Rui Zhang, Zixuan Wang, Xin Li

发表机构 * Tsinghua University(清华大学) National University of Singapore(新加坡国立大学) DiDi(滴滴出行) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出一种难度感知的SFT-then-RL框架,通过阶段特定数据集(SFT阶段使用桥接机制,RL阶段使用批判微调)协调数据难度,提升小语言模型推理性能。

Comments 25 pages, 12 figures

详情
AI中文摘要

后训练小语言模型(SLM)进行推理通常遵循SFT-then-RL流程,但现有工作很少考虑每个阶段应该学习什么数据。我们认为数据策略应与SFT和RL的不同角色对齐:SFT更适合获取尚未掌握的推理技能,而RL更适合巩固模型已部分掌握的技能。基于这一原则,我们提出了一种难度感知的SFT-then-RL框架,将训练数据组织成阶段特定的数据集。对于SFT阶段的困难样本,我们引入桥接机制,将教师生成的原始推理轨迹转化为SLM更易学习的监督信号。对于RL阶段仍未解决的困难样本,我们应用批判微调,将零奖励失败转化为诊断、修复和新的推理轨迹监督,用于下一SFT阶段。在两个SLM上跨越五个推理基准的实验表明,我们的方法在代表性SFT、蒸馏和RL基线上持续改进。我们的结果强调了协调SFT和RL之间数据难度对于有效SLM推理后训练的重要性。

英文摘要

Post-training Small Language Models (SLMs) for reasoning typically follows an SFT-then-RL pipeline, yet existing work rarely considers what data should be learned at each stage. We argue that data strategy should be aligned with the distinct roles of SFT and RL: SFT is better suited for acquiring not-yet-mastered reasoning skills, while RL is better suited for consolidating skills that the model can already partially access. Based on this principle, we propose a difficulty-aware SFT-then-RL framework that organizes training data into stage-specific sets. For hard samples in the SFT stage, we introduce a Bridge mechanism that transforms raw teacher-generated reasoning traces into more learnable supervision for SLMs. For hard samples that remain unsolved during RL, we apply Critique Fine-Tuning by converting all-zero-reward failures into diagnostic, repair, and new reasoning trace supervision for the next SFT stage. Experiments on two SLMs across five reasoning benchmarks show that our method consistently improves over representative SFT, distillation, and RL baselines. Our results highlight the importance of coordinating data difficulty across SFT and RL for effective SLM reasoning post-training.

2606.04465 2026-06-04 cs.CL cs.AI 版本更新

SePO: Self-Evolving Prompt Agent for System Prompt Optimization

SePO: 用于系统提示优化的自我进化提示智能体

Wangcheng Tao, Han Wu, Weng-Fai Wong

发表机构 * National University of Singapore(新加坡国立大学) City University of Hong Kong(香港城市大学)

AI总结 提出SePO方法,通过自我指涉设计让提示智能体同时优化任务智能体和自身的系统提示,采用两阶段进化训练,在多个基准上平均准确率提升4.49%。

Comments 26 pages. Code: https://github.com/taowangcheng/SePO

详情
AI中文摘要

系统提示优化在不修改底层模型的情况下改善智能体行为,生成可读且模型无关的指令。现有方法构建一个提示智能体来优化任务智能体的系统提示,但提示智能体自身的系统提示仍由人工设计且固定不变。我们提出自我进化提示优化(SePO),将提示智能体自身的系统提示与任务智能体的系统提示一同作为优化目标。SePO采用自我指涉设计:一个单一的提示智能体在开放式进化搜索下同时改进任务智能体的系统提示和自身的系统提示,该搜索维护一个候选提示档案作为垫脚石。训练分为两个阶段:预训练在多任务池上进化提示智能体,微调则将其应用于目标任务。在涵盖数学(AIME'25)、抽象推理(ARC-AGI-1)、研究生级科学(GPQA)、代码生成(MBPP)和逻辑谜题(数独)的五个基准上,SePO始终优于Manual-CoT、TextGrad和MetaSPO,与Manual-CoT相比平均准确率提升4.49%。预训练中的提示优化技能也能泛化到预训练混合任务之外的任务,而非记忆每个任务的提示。

英文摘要

System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instructions. Existing methods build a prompt agent that refines task agents' system prompts, yet leave the prompt agent's own system prompt hand-engineered and fixed. We propose Self-Evolving Prompt Optimization (SePO), which treats the prompt agent's own system prompt as an optimization target alongside task agents' system prompts. SePO adopts a self-referential design. A single prompt agent improves both task agents' system prompts and its own under an open-ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre-training evolves the prompt agent on a multi-task pool, and fine-tuning then applies it to a target task. Across five benchmarks spanning math (AIME'25), abstract reasoning (ARC-AGI-1), graduate-level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual-CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual-CoT. The prompt optimization skill from pre-training also generalizes to tasks beyond the pre-training mixture, rather than memorizing per-task prompts.

2606.04459 2026-06-04 cs.CR cs.AI cs.CC cs.CL 版本更新

Token Rankings are Unforgeable Language Model Signatures

Token排名是不可伪造的语言模型签名

Matthew Finlayson, Andreas Grivas, Xiang Ren, Swabha Swayamdipta

发表机构 * University of Southern California(南加州大学) University of Edinburgh(爱丁堡大学)

AI总结 本文发现语言模型的token排名(按概率排序)构成唯一且不可伪造的签名,并研究了在限制API下如何平衡签名展示与参数泄露。

详情
AI中文摘要

已知语言模型参数对其logit输出施加了(每个模型)独特的几何约束,这作为识别模型的签名,但当API分发logits时也会泄露模型的最后一层参数。我们研究了更严格的API,这些API只暴露token排名(即按概率排序,但不暴露概率值),并发现排名也构成签名:对于足够大的$k$,每个模型都有一组唯一的可行top-$k$排名。此外,排名签名是第一个已知的(多项式时间)不可伪造签名,因为找到一个具有相同可行排名集的模型是NP难的。在安全方面,我们发现token排名已经足以近似窃取模型的最后一层,类似于logits,尽管近似太粗糙以至于无法伪造签名,并且可以通过将API限制为足够小的$k$的top-$k$ token来有效应对。由于展示模型签名所需的top-$k$通常小于防止窃取所需的$k$,因此API可以在不泄露模型参数的情况下展示不可伪造的签名。

英文摘要

Language model parameters are known to impose unique (to each model) geometric constraints on their logit outputs, which serves as a signature that identifies the model, but also leaks the model's final layer parameters when an API distributes logits. We investigate more restrictive APIs that expose token rankings (i.e., their ordering by probability, but not the probability values) and find that rankings also constitute a signature: every model has a unique set of feasible top-$k$ rankings for sufficiently large $k$. Furthermore, the ranking signature is the first known (polynomially) unforgeable signature, since finding a model with the same set of feasible rankings is NP-hard. On the security front, we find that token rankings are already sufficient to approximately steal the final layer of the model, similar to logits, though the approximation is too coarse to forge the signature, and can be effectively countered by restricting the API to top-$k$ tokens with sufficiently small $k$. Since the top-$k$ required to present the model signature is generally smaller than the $k$ required to prevent stealing, it is possible for an API to present an unforgeable signature without leaking model parameters.

2606.04455 2026-06-04 cs.AI cs.CL 版本更新

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

元智能体挑战:当前智能体能否自主开发智能体?

Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学) Ant Group(蚂蚁集团)

AI总结 提出元智能体挑战(MAC)框架,评估前沿模型自主开发智能体系统的能力,发现多数元智能体难以匹敌人类设计的基线策略,且存在鲁棒性和对齐问题。

Comments Website: https://meta-agent-challenge.com/

详情
AI中文摘要

当前的AI基准测试评估智能体在人类设计的工作流程中执行任务的能力。这些评估从根本上未能衡量一个关键的更高级能力:模型能否自主开发智能体系统。我们引入了元智能体挑战(MAC),这是一个评估框架,旨在测试前沿模型自主开发智能体的能力。具体来说,一个代码智能体(元智能体)被赋予一个沙盒环境、一个评估API和一个时间限制,以迭代地编程一个智能体工件,该工件在五个领域的保留测试集上最大化性能。为确保评估完整性,该框架通过多层防御机制防止奖励黑客攻击。利用该框架,我们证明元智能体很少能匹配人类设计的基线策略,而少数能匹配的则主要由专有前沿模型主导。此外,设计过程表现出高方差,高优化压力会浮现出诸如真实数据窃取等新兴对抗行为——凸显了鲁棒性和模型对齐方面的关键缺陷。最终,MAC为自主AI研究和开发提供了一个严格的、开源的基准测试,为评估递归自我改进提供了经验代理。基准测试公开于:https://github.com/ant-research/meta-agent-challenge。

英文摘要

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.

2606.04454 2026-06-04 cs.CL 版本更新

Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

通过外部子图生成增强大语言模型的逐步推理

Xin Zhang, Yang Cao, Baoxing Wu, Kai Song, Siying Li

发表机构 * School of Information Science and Engineering, Chongqing Jiaotong University(重庆交通大学信息科学与工程学院) School of Computer Science and Technology, Chongqing University of Posts and Telecommunications(重庆邮电大学计算机科学与技术学院)

AI总结 提出SGR框架,通过从知识图谱生成查询相关子图来引导大语言模型进行逐步推理,提升复杂多步推理的准确性、鲁棒性和可解释性。

详情
AI中文摘要

大语言模型在自然语言生成和下游推理任务中表现出色,但在复杂多步推理中仍面临逻辑一致性、事实基础和可解释性方面的挑战。为解决这些局限,本文提出SGR,一种通过查询相关子图生成将大语言模型与外部知识图谱集成的逐步推理增强框架。给定输入问题,SGR首先提取关键实体、关系和约束以构建结构化模式,然后通过模式引导查询从知识图谱中检索紧凑子图。生成的子图提供明确的关系证据,引导语言模型进行逐步推理。此外,SGR结合了基于Cypher的直接推理与协作推理集成,允许根据模型置信度和图一致性验证和聚合来自多个推理路径的候选答案。在包括CWQ、WebQSP、GrailQA和KQA Pro的基准数据集上的实验表明,SGR在推理准确性和Hits@1性能上优于标准提示和几种知识增强基线。消融研究进一步表明,模式引导和基于Neo4j的检索对框架的有效性都至关重要。这些结果表明,动态生成的外部子图可以提高基于大语言模型的推理的准确性、鲁棒性和可解释性。

英文摘要

Large language models have shown strong performance in natural language generation and downstream reasoning tasks, but they still struggle with logical consistency, factual grounding, and interpretability in complex multi-step reasoning. To address these limitations, this paper proposes SGR, a stepwise reasoning enhancement framework that integrates large language models with external knowledge graphs through query-relevant subgraph generation. Given an input question, SGR first extracts key entities, relations, and constraints to construct a structured schema, then retrieves compact subgraphs from a knowledge graph using schema-guided querying. The generated subgraphs provide explicit relational evidence that guides the language model through step-by-step reasoning. In addition, SGR combines direct Cypher-based reasoning with collaborative reasoning integration, allowing candidate answers from multiple reasoning paths to be validated and aggregated according to both model confidence and graph consistency. Experiments on benchmark datasets including CWQ, WebQSP, GrailQA, and KQA Pro demonstrate that SGR improves reasoning accuracy and Hits@1 performance over standard prompting and several knowledge-enhanced baselines. Ablation studies further show that schema guidance and Neo4j-based retrieval are both crucial to the effectiveness of the framework. These results indicate that dynamically generated external subgraphs can improve the accuracy, robustness, and interpretability of LLM-based reasoning.

2606.04450 2026-06-04 cs.CL cs.CY 版本更新

Listening to the Workforce: Measuring Construction Worker Safety Attitudes from Social Media Discourse Using LLMs

倾听劳动力:使用LLMs从社交媒体话语中测量建筑工人安全态度

Farouq Sammour, Yuxin Zhang, Zhenyu Zhang

发表机构 * Texas A&M University(德克萨斯A&M大学)

AI总结 提出并验证了建筑安全态度框架(CSAF),通过LLM分类器从Reddit社区话语中测量工人安全态度,实现高精度多维分析。

详情
AI中文摘要

工人安全态度是决定建筑工地上保护措施是否被应用或规避的关键因素。然而,大规模测量安全态度一直难以实现。安全态度是多维的,因话题而异,并且在工人自己的对话中最为坦诚。本研究创建并验证了建筑安全态度框架(CSAF),该框架整合了两个组成部分:一个基于理论的结构,沿八个维度表征安全态度;以及一个用于在工人自然话语中测量这些态度的操作化编码手册。将CSAF应用于Reddit上r/Construction社区的250条帖子和评论,经过训练的编码者达到了高度一致(Krippendorff's α = 0.85)。成对提升度和条件概率证实了八个维度既相关又不同。为了将框架应用于大量话语,CSAF通过大语言模型(LLM)分类器进行操作化。在450条r/Construction贡献中,分类器再现了专家人工编码(Cohen's κ = 0.90,精确率 = 0.98,召回率 = 0.98),并且在400条r/Roofing贡献中,转移到不同行业社区后仍保持该准确率(κ = 0.89,精确率 = 0.98,召回率 = 0.97)。一项价值验证案例研究将经过验证的分类器应用于10,346条r/Roofing贡献,证明CSAF能够按安全主题区分多维态度,追踪它们随时间的变化,并追溯不利态度背后的推理。因此,本研究提供了一个理论扎实、经验验证的工具来检查安全态度,为针对不安全实践背后态度的干预措施提供了基础。

英文摘要

Worker safety attitudes are key determinants of whether protective practices are applied or bypassed on construction sites. Yet measuring them at scale has remained out of reach. Safety attitudes are multidimensional, vary across topics, and surface most candidly in workers' own conversations. This study created and validated the Construction Safety Attitude Framework (CSAF), which integrates two components: a theory-grounded structure that characterizes safety attitudes along eight dimensions, and an operational codebook for measuring them in worker naturalistic discourse. Applying CSAF to 250 posts and comments from the r/Construction community on Reddit, trained coders reached strong agreement (Krippendorff's α = 0.85). Pairwise lift and conditional probability confirmed that the eight dimensions are related yet distinct. To apply the framework across large volumes of discourse, CSAF was operationalized through a large language model (LLM) classifier. On 450 r/Construction contributions, the classifier reproduced expert human coding (Cohen's \k{appa} = 0.90, precision = 0.98, recall = 0.98), and on 400 contributions from r/Roofing it retained that accuracy after transfer to a different trade community (\k{appa} = 0.89, precision = 0.98, recall = 0.97). A proof-of-value case study then applied the validated classifier to 10,346 contributions from r/Roofing, demonstrating that CSAF can distinguish multidimensional attitudes by safety topic, track how they shift over time, and trace the reasoning behind unfavorable ones. The study therefore provides a theoretically grounded, empirically vetted instrument for examining safety attitudes, offering a basis for targeted interventions that address the attitudes underlying unsafe practices.

2606.04442 2026-06-04 cs.CL cs.AI 版本更新

MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

MemoryDocDataSet: 联合对话记忆与长文档推理的基准测试

Qiyang Xie, Jialun Wu, Xinjie He, Su Liu, Shuai Xiao, Zhiyuan Lin, Weikai Zhou

发表机构 * Northeastern University(东北大学) Johns Hopkins University(约翰霍普金斯大学) Columbia University(哥伦比亚大学) Independent Researcher(独立研究者)

AI总结 提出MemoryDocDataSet合成基准,包含50个微世界和1000个QA对,评估系统同时处理多轮对话历史和长文档阅读理解的能力,其中75.1%的问题需要混合推理(先导航对话历史再提取文档答案),实验显示联合检索存在明显差距。

Comments 17 pages, 2 figures, 8 tables. Submitted for peer review

详情
AI中文摘要

人工智能系统越来越需要结合两种要求很高的能力:导航多轮对话历史和在长文档中进行深度阅读理解。然而,现有的基准测试没有同时评估这两者。我们引入了MemoryDocDataSet,一个包含50个微世界和1000个QA对的合成基准,其中每个实例包含3-5个人物角色、一个跨越数月活动的时间事件图、3-5篇真实长文档(每篇20,000-50,000个token,来自Caselaw Access Project)、基于这些文档的多轮对话,以及跨越五个推理类别的20个问答对。其定义特征是混合源标签:需要系统首先导航对话历史以确定哪个文档相关,然后从该文档中提取答案的问题。混合问题占数据集的75.1%。通过使用LLM作为评判者的提示敏感性自一致性分析来表征数据集质量,在所有50个微世界中得到中位数Cohen's $κ= 0.634$。我们评估了六种基线配置,涵盖截断上下文、长上下文LLM、检索增强生成(RAG)和记忆系统。最佳基线(RAG-Both)在整体F1上达到0.358,在混合问题上达到0.342。仅文档检索(RAG-Doc)在混合问题上降至0.267,尽管在仅文档问题上达到0.453,这显示了明显的联合检索差距,激励了统一对话记忆与长文档导航的架构。我们发布了数据集、生成流水线和所有基线实现。

英文摘要

AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both simultaneously. We introduce MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs in which each instance comprises 3-5 personas, a temporal event graph spanning months of activity, 3-5 real long documents (20,000-50,000 tokens each sourced from the Caselaw Access Project), multi-session conversations grounded on those documents, and 20 question-answer pairs across five reasoning categories. The defining feature is the Hybrid source tag: questions requiring a system to first navigate conversation history to identify which document is relevant, then extract the answer from within that document. Hybrid questions account for 75.1% of the dataset. Dataset quality is characterised through a prompt-sensitivity self-consistency analysis using LLM-as-judge, yielding a median Cohen's $κ= 0.634$ across all 50 micro-worlds. We evaluate six baseline configurations spanning truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The best baseline (RAG-Both) achieves 0.358 overall F1 and 0.342 on Hybrid. Document-only retrieval (RAG-Doc) collapses to 0.267 on Hybrid despite achieving 0.453 on Doc-only questions, demonstrating a clear joint-retrieval gap that motivates architectures unifying conversational memory with long-document navigation. We release the dataset, generation pipeline, and all baseline implementations.

2606.04435 2026-06-04 cs.AI cs.CL cs.CR cs.IR 版本更新

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

智能体RAG中的级联幻觉:用于检测和缓解的CHARM框架

Saroj Mishra

发表机构 * University of North Dakota(北达科他大学)

AI总结 针对多步智能体RAG管道中早期错误传播并放大为最终错误输出的级联幻觉问题,提出CHARM框架,通过阶段级事实验证、跨阶段一致性跟踪、置信度传播监控和级联解析触发四个组件实现检测与缓解,在多个数据集上达到89.4%的级联检测率和82.1%的错误传播减少。

详情
AI中文摘要

多步智能体检索增强生成(RAG)管道在复杂推理任务中展现出显著能力,但仍然容易受到一类现有幻觉检测机制系统性遗漏的故障影响:级联幻觉,即在管道早期阶段引入的错误会通过连续推理步骤传播并放大,产生自信但事实不正确的最终输出。为解决这一漏洞,我们将级联幻觉形式化为智能体RAG系统中的一种独特故障模式,提出四种级联模式的分类法,并引入CHARM(级联幻觉感知解析与缓解),一种用于检测和中断多步推理管道中错误传播的架构框架。CHARM包含四个组件——阶段级事实验证、跨阶段一致性跟踪、置信度传播监控和级联解析触发——它们与标准智能体RAG管道并行运行,无需替换架构。我们在HotpotQA、MuSiQue、2WikiMultiHopQA以及一个自定义对抗数据集上,在LangChain智能体管道配置下评估CHARM,实现了89.4%的级联检测率、5.3%的假阳性率、每阶段平均215 ms ± 18 ms的延迟开销,以及82.1%的错误传播减少,而输出级检测器仅为18.5%。组件消融实验证实每个检测模块对整体级联覆盖都有显著贡献。CHARM与人在回路监督框架集成,为生产级智能体AI部署提供完整的可靠性和治理栈。

英文摘要

Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

2606.04433 2026-06-04 cs.CV cs.CL cs.LG 版本更新

Stateful Visual Encoders for Vision-Language Models

用于视觉-语言模型的有状态视觉编码器

Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez, Trevor Darrell

发表机构 * University of California, Berkeley(加州大学伯克利分校) UC Berkeley(加州大学伯克利分校)

AI总结 提出有状态视觉编码器,通过将每个视觉表示条件于先前的视觉特征,增强视觉-语言模型在多图像、多轮交互中的视觉变化感知能力,在跨图像空间聚合、多目标视觉差异和轨迹行为克隆等任务上取得一致改进。

Comments Project page: https://statefulvisualencoders.github.io/

详情
AI中文摘要

视觉-语言模型(VLM)越来越多地用于多图像、多轮代理场景,其中决策依赖于视觉变化。然而,在现有的开源权重VLM中,视觉比较仅在语言模型内部进行,而视觉编码器本身是无状态的:每个图像独立编码,无法访问先前的视觉上下文。因此,微小但任务关键的变化可能在语言模型有机会比较之前被减弱,尤其是当这些变化不影响场景的高层语义时。我们引入了一种有状态视觉编码器,它将每个视觉表示条件于先前的视觉特征。在监督微调下,配备有状态编码器的VLM在涉及跨图像空间聚合、多目标视觉差异和视觉轨迹行为克隆的控制任务上取得了一致的改进。这些改进在输入分辨率、语言模型大小和VLM骨干网络上保持一致。最后,我们在实际任务上验证了我们的模型,包括纵向放射学、细粒度图像比较和遥感,其中有状态编码器一致地改进了通用VLM基线,并在选定领域可以匹配或超越专用模型。项目页面:https://statefulvisualencoders.github.io/

英文摘要

Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/

2606.04418 2026-06-04 cs.SD cs.CL eess.AS 版本更新

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

CleanCodec:通过感知引导编码实现高效且鲁棒的语音分词化

Eugene Kwek, Feng Liu, Rui Zhang, Wenpeng Yin

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学) Drexel University(德雷塞尔大学)

AI总结 提出CleanCodec,一种去噪音频编解码器,通过选择性信息瓶颈编码仅保留感知重要特征,以12.5 tokens/s实现最先进的分词效率,在说话人相似度和语音可懂度上显著优于现有编解码器,并在下游任务中实现高达17倍推理加速。

详情
AI中文摘要

神经音频编解码器是语音处理流程的关键组件,将音频压缩为离散令牌以供下游建模。然而,现有编解码器难以平衡重建质量与令牌效率,常常以牺牲语言和声学有意义内容为代价,编码背景噪声和录音伪影等感知无关信息。我们将音频分词化重新定义为选择性信息瓶颈问题,并提出CleanCodec,一种去噪音频编解码器,学习仅编码感知重要特征并丢弃不可感知信息。在每秒仅12.5个令牌的情况下,CleanCodec实现了最先进的分词效率,在说话人相似度和语音可懂度上大幅优于现有编解码器。在下游文本到语音和语音转换任务上的评估进一步展示了改进的性能和高达17倍的推理加速,凸显了显著的效率提升。

英文摘要

Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency, often encoding perceptually irrelevant information such as background noise and recording artifacts at the expense of linguistically and acoustically meaningful content. We reframe audio tokenization as a selective information bottleneck problem and propose CleanCodec, a denoising audio codec which learns to encode only perceptually important features and discard imperceptible information. At just 12.5 tokens per second, CleanCodec achieves state-of-the-art tokenization efficiency, substantially outperforming existing codecs in speaker similarity and speech intelligibility. Evaluations on downstream text-to-speech and voice conversion tasks further demonstrate improved performance and up to 17x faster inference, highlighting significant efficiency gains.

2606.04396 2026-06-04 cs.CL 版本更新

Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

读取轨迹,引导路径:面向扩散语言模型的轨迹感知强化学习

Anant Khandelwal, Manish Gupta

发表机构 * Microsoft AI, India(微软印度人工智能)

AI总结 提出CAPR算法,通过缓存轨迹状态和块级价值头,利用去噪轨迹提供类似树搜索的细粒度监督,在降低计算成本的同时提升扩散语言模型的强化学习效果。

Comments 19 pages, 10 figures, 7 Tables

详情
AI中文摘要

扩散大语言模型(dLLMs)通过并行迭代去掩码和修正多个位置来生成响应。这一过程留下了丰富的去噪轨迹,描绘了哪些标记变得可信、哪些仍不稳定以及何时形成承诺。现有的dLLM强化学习方法仅弱化地使用这一信号。扁平化展开成本低,但将单一结果奖励分配给整个轨迹。树展开通过分支部分轨迹并将叶节点奖励向上传播,提供更精细、可验证的训练信号,但计算密集。我们提出疑问:去噪轨迹本身能否在不使用树级计算的情况下提供类似树的监督?我们引入CAPR(缓存-摊销路径细化),一种dLLM-RL算法,它将去噪轨迹总结为紧凑的路径状态,利用缓存轨迹状态生成廉价的兄弟延续,并训练块级价值头用于局部块级监督。在块级去掩码调度下,CAPR记录路径状态和块进度特征,然后根据每个块中揭示的标记将最终结果奖励重新分配到各个块。这训练价值头将一个稀疏奖励转换为块级PPO权重。因此,CAPR恢复了树搜索的大部分粒度,同时避免了完整的树扩展,将展开生成成本降低到扁平展开的大约0.75倍和树展开的0.6倍(在标准设置下)。在4x4数独、Countdown、GSM8K和Math500上,使用密集和混合专家LLaDA骨干网络,CAPR在256和512标记预算下为RL调优的dLLMs设立了新的最先进水平。在数独上,它以不到三分之一的每步计算量匹配了最强的树结构基线。

英文摘要

Diffusion large language models (dLLMs) generate responses by iteratively unmasking and revising many positions in parallel. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form. Existing dLLM reinforcement learning methods use this signal only weakly. Flat rollouts are cheap, but assign a single outcome reward to the whole trajectory. Tree rollouts provide finer, verifiable training signals by branching partial trajectories and propagating leaf rewards upward, but are compute intensive. We ask whether the denoising trace itself can provide tree-like supervision without tree-level compute. We introduce CAPR (Cached-Amortized Path Refinement), a dLLM-RL algorithm that summarizes the denoising trace into a compact path state, uses cached trajectory states to generate cheap sibling continuations, and trains a block-level value head for local block-wise supervision. Under a block-wise unmasking schedule, CAPR records path-state and block-progress features, then redistributes the final outcome reward across blocks according to the tokens revealed in each block. This trains the value head to convert one sparse reward into block-level PPO weights. CAPR therefore recovers much of the granularity of tree search while avoiding full tree expansion, reducing rollout-generation cost to roughly 0.75x that of flat rollouts and 0.6x that of tree rollouts (under standard settings). Across 4x4 Sudoku, Countdown, GSM8K, and Math500, on dense and mixture-of-experts LLaDA backbones, CAPR sets a new state of the art for RL-tuned dLLMs at 256- and 512-token budgets. On Sudoku, it matches the strongest tree-structured baseline at less than one third of the per-step compute.

2606.04392 2026-06-04 cs.LG cs.CL 版本更新

Physics-Informed Neural Network Modeling of Biodegradable Contaminant Transport through GCL/SL Composite Liners

物理信息神经网络建模可生物降解污染物通过GCL/SL复合衬垫的迁移

Dong Li, Yapeng Cao, Haiping Zhao, Shutong Han

发表机构 * Department of Civil, Environmental, and Infrastructure Engineering, George Mason University(乔治·马歇尔大学土木、环境与基础设施工程系) State Key Laboratory of Cryospheric Science and Frozen Soil Engineering, Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences(中国科学院寒区工程与冻土科学联合实验室,西北生态环境资源研究院) Laboratoire Navier/CERMES, École Nationale des Ponts et Chaussées, Institut Polytechnique de Paris(巴黎理工学院劳达实验室/塞姆斯实验室,法国国家桥梁与道路学院)

AI总结 提出双域物理信息神经网络框架,通过硬约束PINN精确模拟GCL/SL复合衬垫中污染物迁移,并扩展至逆问题识别降解半衰期。

详情
AI中文摘要

本研究开发了一个双域物理信息神经网络框架,用于污染物通过GCL/SL复合衬垫系统的迁移,其中薄GCL层采用稳态平流-弥散-生物降解公式处理,而下层土壤衬垫建模为瞬态传输域。在不同渗滤液水头条件下,评估了两种公式与解析解和有限元参考解的对比:标准软约束PINN(Std-PINN)和硬约束PINN(H-PINN),其中选定的边界和初始条件直接嵌入试验解中。Std-PINN捕捉了整体突破行为,但在早期传输阶段显示出较大误差,特别是在平流传输更显著的高水头条件下。H-PINN减少了与基于惩罚的约束执行相关的优化负担,提供了更准确和稳定的浓度预测,将MAE从Std-PINN的约0.058-0.067降低到H-PINN的约0.011-0.023,同时将MRE从约9.10%-19.16%降低到约2.08%-3.14%。参数分析证实,采用tanh激活函数和优化网络结构的H-PINN提供了最佳的预测精度。H-PINN进一步扩展到逆建模,用于从有限的浓度观测中识别SL降解半衰期,显示出对预设值的可靠收敛性以及在低到中等观测噪声下的可接受鲁棒性。

英文摘要

This study develops a two-domain physics-informed neural network framework for contaminant transport through a GCL/SL composite liner system, in which the thin GCL layer is treated using a steady-state advection-dispersion-biodegradation formulation and the underlying soil liner is modeled as a transient transport domain. Two formulations are evaluated against analytical and finite-element reference solutions under different leachate-head conditions: a standard PINN with soft constraint enforcement (Std-PINN) and a hard-constrained PINN (H-PINN), in which selected boundary and initial conditions are embedded directly into the trial solutions. The Std-PINN captures the overall breakthrough behavior but shows larger errors during the early transport stage, particularly under higher leachate heads where advective transport becomes more pronounced. The H-PINN reduces the optimization burden associated with penalty-based constraint enforcement and provides more accurate and stable concentration predictions, lowering the MAE from approximately 0.058-0.067 for the Std-PINN to about 0.011-0.023 for the H-PINN, while reducing the MRE from approximately 9.10%-19.16% to about 2.08%-3.14%. Parametric analyses confirm that the H-PINN with the tanh activation function and an optimized network structure provides the best predictive accuracy. The H-PINN is further extended to inverse modeling for identifying the SL degradation half-life from limited concentration observations, showing reliable convergence toward prescribed values and acceptable robustness under low-to-moderate observation noise.

2606.04389 2026-06-04 cs.CL 版本更新

When Clients Stop Following: A Cognitive Conceptualization Diagram-driven Framework for Strategic Counseling

当来访者不再跟随:基于认知概念化图的策略性咨询框架

Yihao Qin, Junyi Zhao, Changsheng Ma, Yongfeng Tao, Minqiang Yang, Chang Liu, Bin Hu

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院)

AI总结 针对现有评估协议中来访者过度顺从导致评分虚高的问题,提出基于认知行为疗法的抵抗感知框架,通过认知概念化图模拟动态抵抗,并利用强化学习优化策略推理与响应生成,以提升在困难咨询交互中的策略鲁棒性。

详情
AI中文摘要

大型语言模型(LLMs)在心理咨询中展现出潜力,但现有基准高度依赖高度合作的模拟来访者。我们观察到一个关键的咨询师-来访者跟随现象:这些来访者往往在仅几轮对话后便迅速从抵抗转向顺从,造成治疗进展的假象,并通过表面共情在当前评估协议下虚高分数。为解决这一评估失配问题,我们提出一个基于认知行为疗法(CBT)的抵抗感知框架。我们引入CARS,一个通过认知概念化图(CCDs)显式建模动态抵抗的来访者模拟器。我们提出STREAMS,一个将策略推理(思考者)与响应生成(呈现者)解耦并通过强化学习优化的双模块框架。我们进一步提出EWTS-MI,一个用于评估高摩擦交互下响应性的熵加权指标。在抵抗性和非抵抗性咨询设置上的实验验证了我们对评估失配的发现,并展示了抵抗感知训练在挑战性咨询交互下提升策略鲁棒性的有效性。

英文摘要

Large Language Models (LLMs) show promise in psychological counseling, yet existing benchmarks rely heavily on highly cooperative simulated clients. We observe a critical counselor-following phenomenon: these clients often rapidly shift from resistance to compliance after only a few turns, creating an illusion of therapeutic progress and inflating scores under current evaluation protocols through superficial empathy. To address this evaluation mismatch, we propose a Cognitive Behavioral Therapy (CBT)-grounded resistance-aware framework. We introduce CARS, a client simulator that explicitly models dynamic resistance via Cognitive Conceptualization Diagrams (CCDs). We present STREAMS, a dual-module framework that decouples strategic reasoning (Thinker) from response generation (Presenter) and optimizes it via reinforcement learning. We further propose EWTS-MI, an entropy-weighted metric for evaluating responsiveness under high-friction interactions. Experiments across resistant and non-resistant counseling settings validate our findings on evaluation mismatch and demonstrate the effectiveness of resistance-aware training for improving strategic robustness under challenging counseling interactions.

2606.04378 2026-06-04 cs.CL 版本更新

DLLG: Dynamic Logit-Level Gating of LLM Experts

DLLG: 大语言模型专家的动态logit级门控

Bingnan Li, Zhaoyang Zhang, Xiaoze Liu, Yantao Shen, Shuli Jiang, Shuo Yang, Wei Xia, Zhuowen Tu, Stefano Soatto

发表机构 * University of California, Berkeley(加州大学伯克利分校) Princeton University(普林斯顿大学)

AI总结 提出DLLG框架,通过轻量级门控模块学习token级专家融合权重,利用稀疏的响应级监督实现动态logit级集成,无需token级标签或专家重训练,在推理和代码基准上优于路由、启发式集成和参数合并方法。

详情
AI中文摘要

利用多个专门的大语言模型可以结合互补优势,但现有方法在适应性和稳定性之间权衡:路由过早提交,启发式集成依赖脆弱的代理,参数合并引入干扰。我们提出DLLG(动态logit级门控),一个动态logit级集成框架,从稀疏的响应级监督中学习token级专家融合。一个轻量级门控模块预测逐步融合权重,将轨迹级正确性与生成联系起来,无需token级标签或专家重训练。在多样化的推理和代码基准上,DLLG在不同模型规模下始终优于强路由、启发式集成和参数合并基线,突显了学习到的logit级融合作为集成专门专家的稳健且可扩展的范式。

英文摘要

Leveraging multiple specialized LLMs can combine complementary strengths, but existing approaches trade adaptability for stability: routing commits prematurely, heuristic ensembling depends on fragile proxies, and parameter merging introduces interference. We propose DLLG (Dynamic Logit-Level Gating), a dynamic logit-level ensembling framework that learns token-level expert fusion from sparse response-level supervision. A lightweight gating module predicts step-wise fusion weights, linking trajectory-level correctness to generation without token-level labels or expert retraining. Across diverse reasoning and code benchmarks, DLLG consistently outperforms strong routing, heuristic ensembling, and parameter-merging baselines across model scales, highlighting learned logit-level fusion as a robust and scalable paradigm for integrating specialized experts.

2606.04367 2026-06-04 cs.CL cs.HC 版本更新

GlossAssist -- A Tool to Simplify Corpus Creation and Study the Effect of NLP Models in Low-Resource Documentation Settings

GlossAssist——简化语料库创建并研究NLP模型在低资源文档设置中影响的工具

Bhargav Shandilya, Matt Buchholz, Alexis Palmer

发表机构 * University of Colorado Boulder(科罗拉多大学波得尔分校)

AI总结 提出基于检索架构的自动标注工具GlossAssist,通过可编辑词库和主动学习反馈循环,提升低资源语言文档中词素标注的准确性和可解释性。

Comments 6 pages, 3 figures

详情
AI中文摘要

行间标注文本(IGT)是语言文档中语言学标注的标准格式。然而,手动生成IGT通常缓慢且成本高昂。近年来,自动标注系统有了显著改进,但实地语言学家中的采用率仍然有限。现有工具旨在被评估而非使用,没有提供可解释的修正路径或将语言学专业知识融入模型行为的方式。我们提出GlossAssist,一个基于CWMP(对比词-词素预训练)检索架构的标注工具,该工具将预测基于可变的已学习词素表示词库。结合CWMP,我们的系统将标注者的每次修正视为主动学习设置的一部分,从而扩展词库并改进未来预测,而无需重新训练模型。在本文中,我们展示我们的界面,并论证这种反馈循环应被视为面向文档语言学家的NLP工具的设计要求。

英文摘要

Interlinear glossed text (IGT) is the standard format for linguistic annotation in language documentation. Producing it manually, however, is often slow and costly. Automated glossing systems have improved substantially in recent years, but adoption among field linguists remains limited. Existing tools are designed to be evaluated rather than used, offering no interpretable path for correction or the incorporation of linguistic expertise back into model behavior. We present GlossAssist, a glossing tool built around the retrieval-based architecture of CWoMP (Contrastive Word-Morpheme Pre-training), which grounds predictions in a mutable lexicon of learned morpheme representations. In conjunction with CWoMP, our system treats each correction by an annotator as part of an active learning setting, which expands the lexicon and improves future predictions without having to retrain the model. In this paper, we present our interface and argue that this feedback loop should be treated as a design requirement for NLP tools aimed at documentary linguists.

2606.04362 2026-06-04 cs.IR cs.CL 版本更新

Disentangling Answer Engine Optimization from Platform Growth: A Log-Based Natural Experiment on ChatGPT Referral Traffic

解耦答案引擎优化与平台增长:基于日志的ChatGPT推荐流量自然实验

Keisuke Watanabe, Kazuki Nakayashiki

发表机构 * Glasp Inc.(Glasp公司)

AI总结 本研究通过自然实验方法,利用同一域内未处理页面作为对照,分离了答案引擎优化(AEO)对推荐流量的因果效应与平台自身增长带来的混淆效应。

Comments 9 pages, 4 figures, 1 table

详情
AI中文摘要

大型语言模型(LLM)“答案引擎”(如ChatGPT)现在向开放网络发送可测量的推荐流量,一种类似于搜索引擎优化的实践——此处称为答案引擎优化(AEO)——已经出现。公开的AEO成功案例通常引用巨大的原始增长倍数,但原始推荐增长被答案引擎本身的快速平台级增长所混淆。我们报告了一项针对单个高流量域名(glasp.co)的纵向现场研究,该域名拥有数十万个YouTube问答页面,在2026年1月接受了一组明确的AEO干预(详见第4节)。由于干预集中在网站的一个子集上,同一域内未处理的剩余部分作为同期对照,吸收了平台尾风。使用第一方分析和服务器日志而非概率性第三方估计,我们发现:(1)原始增长由平台尾风主导:在月度汇总中,ChatGPT总推荐量增长了5.7倍,而同一域内未处理页面在同一时间段内增长了3.5倍;(2)对每周处理/对照比率的中断时间序列模型估计出一个离散的、与干预对齐的水平增长1.82倍(95% CI 1.31-2.54,HAC p=0.001),该结果在参与度过滤流量(2.27倍)和替代规格下稳健;(3)然而,保守的安慰剂时间置换检验得出p=0.16,因此该效应是提示性的而非结论性的,鉴于前期短且噪声大;(4)Google对处理页面的自然点击并未超出整体网站趋势下降,且索引得以保留,这与SEO保护规则一致。方法论上的信息——通过域内对照分离处理与平台尾风——比任何单一倍数更重要,并意味着标题中的AEO倍数大大高估了因果效应。

英文摘要

Large language model (LLM) "answer engines" such as ChatGPT now send measurable referral traffic to the open web, and a practice analogous to search engine optimization, here called Answer Engine Optimization (AEO), has emerged. Public AEO success stories typically quote large raw growth multiples, but raw referral growth is confounded by the rapid platform-level growth of the answer engines themselves. We report a longitudinal field study on a single high-traffic domain (glasp.co) whose corpus of hundreds of thousands of YouTube question-and-answer pages received a defined bundle of AEO interventions in January 2026 (detailed in Section 4). Because the interventions were concentrated on one subset of the site, the untreated remainder of the same domain acts as a contemporaneous control that absorbs the platform tailwind. Using first-party analytics and server logs rather than probabilistic third-party estimators, we find: (1) raw growth is dominated by the platform tailwind: on monthly aggregates total ChatGPT referrals grew 5.7x while untreated pages on the same domain grew 3.5x over the same window; (2) an interrupted time-series model on the weekly treated/control ratio estimates a discrete, intervention-aligned level increase of 1.82x (95% CI 1.31-2.54, HAC p=0.001), robust across engagement-filtered traffic (2.27x) and alternative specifications; (3) however, a conservative placebo-in-time permutation test yields p=0.16, so the effect is suggestive, not conclusive, given a short and noisy pre-period; and (4) Google organic clicks to treated pages did not fall beyond the ambient site-wide trend and indexation was preserved, consistent with the SEO-protection rule. The methodological message, separating treatment from platform tailwind with an on-domain control, matters more than any single multiple, and implies that headline AEO multiples substantially overstate causal effect.

2606.04360 2026-06-04 cs.CL cs.LG 版本更新

Deliberate Evolution: Agentic Reasoning for Sample-Efficient Symbolic Regression with LLMs

Deliberate Evolution: 基于智能体推理的样本高效符号回归与LLM

Xinyu Pang, Zhanke Zhou, Xuan Li, Fangrui Lv, Shanshan Wei, Sen Cui, Bo Han, Changshui Zhang

发表机构 * TMLR Group, Department of Computer Science, Hong Kong Baptist University(香港 Baptist 大学计算机科学系 TMLR 组) Beijing National Research Center for Information Science(北京信息科学国家研究中心) Technology (BNRist), Department of Automation, Tsinghua University, Beijing, P.R. China(技术(BNRist),自动化系,清华大学,北京,中华人民共和国) Lenovo Research(联想研究)

AI总结 提出Deliberate Evolution框架,通过解耦符号生成与搜索控制,利用自适应算子、分析工具和反思记忆,在仅用40%样本预算下超越现有LLM符号回归方法。

Comments ICML 2026

详情
AI中文摘要

符号回归(SR)从数据中发现紧凑的数学表达式,然而最近基于LLM的进化方法仍然样本效率低下,因为它们主要依赖标量反馈(如MSE)。我们发现一个核心限制:现有方法将候选提议与搜索指导混为一谈,要求LLM从单一分数中推断如何进化表达式、诊断其错误并重用过去经验。为了解决这个问题,我们提出了Deliberate Evolution(DE),一个将符号生成与搜索控制解耦的智能体框架。DE使用自适应算子引导搜索方向、分析工具进行结构诊断以及反思记忆存储轨迹级经验,从而指导LLM的提议。在LLM-SRBench上的实验表明,DE在仅使用标准样本预算的40%的情况下,在多个科学领域一致优于代表性的基于LLM的SR基线。

英文摘要

Symbolic regression (SR) discovers compact mathematical expressions from data, yet recent LLM-based evolutionary methods remain sample-inefficient because they rely mainly on scalar feedback such as MSE. We identify a core limitation: existing methods conflate candidate proposal with search guidance, requiring the LLM to infer how to evolve an expression, diagnose its errors, and reuse past experience from a single score. To address this, we propose Deliberate Evolution (DE), an agentic framework that decouples symbolic generation from search control. DE guides LLM proposals with adaptive operators for search direction, analytical tools for structural diagnosis, and reflective memory for trajectory-level experience. Experiments on LLM-SRBench show that DE consistently outperforms representative LLM-based SR baselines across diverse scientific domains while using only 40% of the standard sample budget.

2606.04340 2026-06-04 cs.CL 版本更新

Noisy memory encoding explains negative polarity illusions

噪声记忆编码解释了负极词幻觉

Yuhan Zhang, Edward Gibson

发表机构 * Department of Linguistics, Stanford University(斯坦福大学语言学系) BIO-X Interdisciplinary Biosciences Institute, Stanford University(斯坦福大学生物交叉科学研究所) Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology(麻省理工学院脑科学与认知科学系)

AI总结 本研究利用Hahn等人(2022)的有损上下文惊奇理论,提出不完美的句子编码导致负极词幻觉,并通过六个新型限定词对的 acceptability 判断实验验证了限定词相似性增强幻觉效应的假设。

Comments 21 pages, 5 figures, submitted for journal publication

详情
AI中文摘要

像“The authors that no critics recommended have ever received acknowledgment for a best-selling novel”这样的句子有时被认为可接受,尽管严格来说它不合语法,因为负极词“ever”在其位置未获许可。这种行为效应有时被称为“负极词幻觉”。这里我们提出,Hahn等人(2022)的有损上下文惊奇理论——即人们对复杂句子的编码不完美——可能解释这种效应。我们假设人们对主句和从句主语中的限定词记忆表征较差,并可能设想一种限定词交换来许可“ever”。我们提出,这些位置上更相似的限定词会引发更强的幻觉效应。使用六种新型限定词对(例如,“few”和“many”,“few”和“most”)的可接受性判断任务支持了我们的提议,具体表明,即使没有时间压力,新句子“Many authors that few critics recommended have ever received acknowledgment for a best-selling novel”也比规范句引发了更强的幻觉。这些结果进一步支持了人类语言处理是不完美且资源理性的观点:面对工作记忆限制,人类理性地从噪声语言输入中重构最可能的内容,以促进下游处理。

英文摘要

A sentence like "The authors that no critics recommended have ever received acknowledgment for a best-selling novel" is sometimes rated as acceptable even though, strictly speaking, it is ungrammatical because the negative polarity word "ever" is not licensed where it is. This behavioral effect is sometimes called a "negative polarity illusion". Here we propose that the lossy context surprisal theory of Hahn et al. (2022) -- whereby people have an imperfect encoding of complex sentences -- might explain this effect. We hypothesize that people have poor memory representation of the determiners in the main-clause and embedded-clause subjects and could entertain a determiner exchange that licenses ever. We propose that more similar determiners in those positions would trigger stronger illusion effects. Acceptability judgment tasks with six novel determiner pairs (e.g., "few" and "many", "few" and "most") support our proposal, showing, specifically, that a novel sentence, "Many authors that few critics recommended have ever received acknowledgment for a best-selling novel", triggered a much stronger illusion than the canonical one even without time pressure. These results offer further support for the suggestion that human language processing is imperfect and resource-rational: in face of working memory limitations, humans rationally reconstruct what is most likely from noisy linguistic input to facilitate downstream processing.

2606.04325 2026-06-04 cs.CL 版本更新

Parameter-Efficient Fine-Tuning with Learnable Rank

可学习秩的参数高效微调

Arpit Garg, Simon Lucey, Hemanth Saratchandran

发表机构 * Australian Institute for Machine Learning(澳大利亚机器学习研究所)

AI总结 提出LR-LoRA方法,通过训练过程中学习适配器秩而非固定秩,在语言理解和常识推理基准上达到最先进性能。

Comments In Submission

详情
AI中文摘要

低秩适配(LoRA)是一种流行的参数高效微调(PEFT)方法,通过将权重更新限制为低秩适配器,在低维子空间中进行优化,从而引入固定的低秩归纳偏置。在这项工作中,我们质疑固定秩约束是否是参数高效微调最有效的归纳偏置。我们引入了*可学习秩LoRA(LR-LoRA)*,一种在训练过程中学习适配器秩的PEFT方法。LR-LoRA不为所有适配器层规定统一的秩,而是允许优化器为每一层确定合适的秩。使用这种方法,我们发现学习到的秩在层间存在显著差异,Transformer模型中的注意力层和MLP层表现出系统性的不同秩偏好。在一系列语言理解和常识推理基准测试中,LR-LoRA在大多数设置下达到了最先进的性能,并且始终优于强大的PEFT基线,表明可学习秩比固定秩适配提供了更灵活和有效的归纳偏置。

英文摘要

Low-Rank Adaptation (LoRA) is a popular parameter-efficient fine-tuning (PEFT) method that restricts weight updates to low-rank adapters, introducing a fixed low-rank inductive bias by optimizing in a low-dimensional subspace. In this work, we question whether a fixed-rank constraint is the most effective inductive bias for parameter-efficient fine-tuning. We introduce *Learnable Rank LoRA (LR-LoRA)*, a PEFT method in which the adapter rank is learned during the training process. Instead of prescribing a uniform rank for all adapter layers, LR-LoRA allows the optimizer to determine the appropriate rank for each layer. Using this approach, we find substantial layer-wise variation in the learned ranks, with the attention and MLP layers in the transformer models exhibiting systematically different rank preferences. Across a range of language understanding and commonsense reasoning benchmarks, LR-LoRA achieves state-of-the-art performance in most settings and consistently outperforms strong PEFT baselines, demonstrating that a learnable rank provides a more flexible and effective inductive bias than fixed-rank adaptations.

2606.04302 2026-06-04 cs.CL cs.LG 版本更新

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

LazyAttention: 高效检索增强生成中的延迟位置编码

Haocheng Xia, Mihir Pamnani, Hanxi Fang, Supawit Chockchowwat, Yongjoo Park

发表机构 * Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校Siebel计算与数据科学学院) Google(谷歌) Amazon(亚马逊)

AI总结 针对检索增强生成中KV缓存位置编码复用性差的问题,提出LazyAttention机制,通过核化延迟位置编码实现零拷贝、位置无关的KV重用,显著降低首令牌延迟并提升推理吞吐量。

Comments ICML 2026

详情
AI中文摘要

键值(KV)缓存通过重用已生成令牌的过去计算来加速大型语言模型(LLM)的推理。在长上下文应用(如检索增强生成(RAG)和上下文学习(ICL))中,其重要性更加凸显。然而,传统的KV缓存将位置信息直接嵌入缓存中,限制了其可重用性。现有解决方案要么将重用限制为前缀,要么需要昂贵的内存物化来进行位置重新编码。我们引入了LazyAttention,一种新颖的注意力机制,它通过核化延迟位置编码来实现零拷贝、位置无关的KV重用。通过在注意力内核中动态调整位置编码,LazyAttention解决了物化瓶颈,使得单个物理KV副本能够服务于任意位置的多个逻辑请求。利用为预填充和解码定制的注意力内核,我们的系统实现了显著的效率提升:在偏斜的文档分布下,与最先进的Block-Attention相比,首令牌延迟(TTFT)降低了1.37倍,推理吞吐量提高了1.40倍,同时保持了可比的输出质量。

英文摘要

Key-value (KV) caching accelerates inference of large language models (LLMs) by reusing past computations for generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). However, conventional KV caching embeds positional information directly into the cache, limiting its reusability. Existing solutions either restrict reuse to prefixes or require expensive memory materialization for positional re-encoding. We introduce LazyAttention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By adjusting positional encoding within attention kernels on-the-fly, LazyAttention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions. Leveraging attention kernels tailored for prefilling and decoding, our system achieves significant efficiency improvements: under skewed document distributions, it reduces time-to-first-token (TTFT) by 1.37$\times$ and increases inference throughput by 1.40$\times$ compared to the state-of-the-art Block-Attention, while maintaining comparable output quality.

2606.04286 2026-06-04 cs.CL 版本更新

Using Text-Based Causal Inference to Disentangle Factors Influencing Online Review Ratings

使用基于文本的因果推断解构影响在线评论评分的因素

Linsen Li, Aron Culotta, Nicholas Mattei

发表机构 * Department of Computer Science Tulane University(计算机科学系路易斯安那大学)

AI总结 提出基于CausalBERT的文本因果分析方法,通过温度缩放、超参数优化和可解释性改进,从60万条美国K-12学校评论中解构各因素对整体评分的影响。

Comments HLT/NAACL 2025

详情
Journal ref
In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies
AI中文摘要

在线评论提供了对产品或服务各方面感知质量的宝贵见解。虽然基于方面的情感分析侧重于从评论中提取这些方面,但关于每个方面对整体感知影响的研究较少。由于方面之间的相关性,分离每个方面的影响尤其具有挑战性。本文介绍了一种基于文本因果分析最新进展的方法,特别是CausalBERT,以解构每个因素对整体评论评分的影响。我们通过三个关键改进增强了CausalBERT:用于更校准的处理分配估计的温度缩放;减少混杂过度调整的超参数优化;以及表征发现混杂因素的可解释性方法。在这项工作中,我们将评论中的文本提及视为现实世界属性的代理。我们在来自超过60万条美国K-12学校评论的真实和半合成数据上验证了我们的方法。我们发现,所提出的增强方法产生了更可靠的估计,并且对学校管理和基准测试表现的感知是整体学校评分的重要驱动因素。

英文摘要

Online reviews provide valuable insights into the perceived quality of facets of a product or service. While aspect-based sentiment analysis has focused on extracting these facets from reviews, there is less work understanding the impact of each aspect on overall perception. This is particularly challenging given correlations among aspects, making it difficult to isolate the effects of each. This paper introduces a methodology based on recent advances in text-based causal analysis, specifically CausalBERT, to disentangle the effect of each factor on overall review ratings. We enhance CausalBERT with three key improvements: temperature scaling for better calibrated treatment assignment estimates; hyperparameter optimization to reduce confound overadjustment; and interpretability methods to characterize discovered confounds. In this work, we treat the textual mentions in reviews as proxies for real-world attributes. We validate our approach on real and semi-synthetic data from over 600K reviews of U.S. K-12 schools. We find that the proposed enhancements result in more reliable estimates, and that perception of school administration and performance on benchmarks are significant drivers of overall school ratings.

2606.04284 2026-06-04 cs.LG cs.AI cs.CL 版本更新

Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling

稀疏混合专家奖励模型学习可解释且专业化的专家用于个性化偏好建模

Yifan Wang, Jinyi Mu, Mayank Jobanputra, Yu Wang, Ji-Ung Lee, Soyoung Oh, Isabel Valera, Vera Demberg

发表机构 * Saarland University(萨尔兰大学) Independent Researcher(独立研究者) Bielefeld University(比勒菲尔德大学) Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) Max Planck Institute for Informatics(马克斯·普朗克信息研究所)

AI总结 提出稀疏混合专家奖励模型,通过稀疏路由和专家多样性训练,从二元偏好数据中学习可解释的专家模式,提升个性化偏好建模的测试时适应性和可解释性。

详情
AI中文摘要

偏好建模在基于人类反馈的强化学习(RLHF)中扮演核心角色,使大型语言模型(LLMs)与人类价值观对齐。然而,大多数现有方法假设一个通用的奖励函数,忽视了人类偏好的多样性和异质性。为了在不增加额外标注成本的情况下解决这一限制,最近的工作提出从二元数据中学习多个偏好组件,并组合它们以建模个体偏好。然而,这些组件往往无法捕捉连贯且解耦的模式,限制了其可解释性和个性化效果。在这项工作中,我们提出了一种稀疏混合专家(MoE)奖励模型,该模型在二元偏好数据训练过程中鼓励稀疏路由和专家多样性。在受控和真实世界的实验中,稀疏MoE学习了可解释的路由模式和专业化的专家。它还改进了测试时的个性化,并且适应后的专家权重变化为分析模型如何适应个性化偏好提供了定性视角。

英文摘要

Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function, neglecting the diversity and heterogeneity of human preferences. To address this limitation without additional annotation costs, recent work has proposed learning multiple preference components from binary data and combining them to model individual preferences. Nevertheless, these components often fail to capture coherent and disentangled patterns, limiting their interpretability and effectiveness for personalization. In this work, we propose a sparse Mixture-of-Experts (MoE) reward model that encourages sparse routing and expert diversity during training on binary preference data. Across controlled and real-world experiments, sparse MoE learns interpretable routing patterns and specialized experts. It also improves test-time personalization, and post-adaptation shifts in expert weights provide a qualitative lens for analyzing how the model adapts to personalized preferences.

2606.04274 2026-06-04 cs.CL cs.CY 版本更新

Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit

长存微调:在Reddit上,任务特定Transformer在错误信息响应分类中优于零样本LLM

JooYoung Lee, Lin Tian, Angela Brillantes, Adriana-Simona Mihăiţă, Marian-Andrei Rizoiu

发表机构 * University of Technology Sydney(悉尼技术大学)

AI总结 通过对比微调模型与零样本LLM在Reddit错误信息评论分类上的表现,发现微调RoBERTa在宏F1分数上显著优于最佳零样本模型,且成本更低,表明任务特定微调在检测隐性错误信息方面仍更可靠。

详情
AI中文摘要

随着大型语言模型(LLM)成为在线信息验证的默认工具,一个隐含的假设随之而来:规模和通用能力足以对错误信息话语进行细致分类。我们直接在900条Reddit评论上测试这一假设,这些评论涵盖三个经PolitiFact验证的错误信息主张(环境、健康、移民),并标记为相信(传播主张)、事实核查(纠正主张)或其他。我们比较了三种范式下的九个模型——BART-MNLI、三种Llama变体、三种商业前沿LLM(Claude Haiku 4.5、Gemini Flash Lite 2.5、Claude Sonnet 4.6),以及微调的DistilBERT和RoBERTa——在通用和主题特定标签方案下。该假设不成立。微调RoBERTa达到0.62的宏F1,而最佳零样本结果为0.50(Claude Haiku 4.5),且每次查询成本极低;监督优势集中在相信类别,这是每个零样本模型都检测不足的隐式情感类别。规模无帮助:Llama-3-8B与Llama-3-70B表现相当,Claude Sonnet 4.6在通用标签下表现逊于较小的Haiku,将相信检测降至0.17,并直接拒绝部分被标记为敏感的评论。这是安全对齐的伪影,而非能力限制。标签方案和主题共同塑造零样本性能,同一模型在匹配标签下不同主题间的宏F1差异超过0.13。在验证场景中,遗漏相信是代价更高的错误,尽管大型生成模型激增,任务特定微调仍是更可靠的选择。

英文摘要

As large language models (LLMs) become default tools for online information verification, an implicit assumption follows them: that scale and general capability are sufficient for nuanced classification of misinformation discourse. We test this assumption directly on 900 Reddit comments spanning three PolitiFact-verified misinformation claims (environment, health, immigration), labelled as belief (propagates the claim), fact-check (corrects it), or other. We compare nine models across three paradigms -- BART-MNLI, three Llama variants, three commercial frontier LLMs (Claude Haiku 4.5, Gemini Flash Lite 2.5, Claude Sonnet 4.6), and fine-tuned DistilBERT and RoBERTa -- under universal and topic-specific label schemas. The assumption does not hold. Fine-tuned RoBERTa reaches 0.62 macro-$F_1$ against a best zero-shot result of 0.50 (Claude Haiku 4.5), at a fraction of the per-query cost; the supervised advantage is concentrated on the belief class, the implicit, affective category every zero-shot model under-detects. Scaling does not help: Llama-3-8B matches Llama-3-70B, and Claude Sonnet 4.6 underperforms the smaller Haiku under generic labels, collapsing belief detection to 0.17 and refusing outright on a subset of comments flagged as sensitive. This is a safety-alignment artefact, not a capacity limit. Label schema and topic jointly shape zero-shot performance, with the same model varying by more than 0.13 macro-$F_1$ across topics under matched labels. In a verification context, where missing belief is the costlier error, task-specific fine-tuning remains the more reliable choice despite the proliferation of large generative models.

2606.04262 2026-06-04 cs.CL cs.AI 版本更新

Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

我可以再服一剂吗?评估LLM在OTC剂量问答中时间不确定性下的决策能力

Maroof Kousar, Yibo Hu

发表机构 * Illinois Institute of Technology(伊利诺伊理工学院)

AI总结 提出DOSEBENCH基准测试,评估大语言模型在非处方药剂量问答中处理时间推理、约束遵循和不确定性的能力。

Comments 16 pages, 7 figures

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于日常健康问题,包括用户是否可以安全地再服用一剂非处方(OTC)药物。然而,这一常见的安全相关场景在现有的医学问答评估中仍未得到充分探索,其中正确答案需要跟踪剂量时间、计算滚动24小时摄入量、遵循产品标签约束以及处理不完整的用药史。我们引入了DOSEBENCH,这是一个包含81个精心策划的OTC剂量场景的聚焦基准测试,专注于成人对乙酰氨基酚和布洛芬的使用,并带有手动标注的金标准参考。我们使用决策正确性、一致性、解释可验证性、失败类型和置信度相关信号等指标,在多次运行中评估了四个LLM,共获得1620个模型响应。我们的结果表明,模型在滚动窗口推理和模糊敏感场景中经常遇到困难,且稳定或看似自信的响应仍可能违反剂量约束。这些发现表明,OTC剂量问答为评估医学问答中的时间推理、约束遵循和安全相关不确定性处理提供了一个狭窄但实用的测试平台。

英文摘要

Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains underexplored in existing medical QA evaluations, where correct answers require tracking dose timing, computing rolling 24-hour intake, following product-label constraints, and handling incomplete medication histories. We introduce DOSEBENCH, a focused benchmark of 81 curated OTC dosing scenarios focused on adult acetaminophen and ibuprofen use, with manually annotated gold references. We evaluate four LLMs across repeated runs using metrics for decision correctness, consistency, explanation verifiability, failure types, and confidence-related signals, resulting in 1,620 model responses. Our results show that models frequently struggle with rolling-window reasoning and ambiguity-sensitive cases and that stable or confident-looking responses can still violate dosing constraints. These findings suggest that OTC dosing QA provides a narrow yet practical testbed for evaluating temporal reasoning, constraint following, and safety-relevant uncertainty handling in medical QA.

2606.04261 2026-06-04 cs.AI cs.CL cs.CV cs.ET cs.LG 版本更新

Can Generalist Agents Automate Data Curation?

通用智能体能否自动化数据筛选?

Feiyang Kang, Hanze Li, Adam Nguyen, Mahavir Dabas, Jiaqi W. Ma, Frederic Sala, Dawn Song, Ruoxi Jia

发表机构 * Virginia Tech(弗吉尼亚理工大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出Curation-Bench基准,通过通用编码智能体自动化数据筛选循环,实验表明现成智能体可达到强基线,但存在执行-研究差距,而结构化方法引导的智能体能在十分之一数据预算下自主组合出优于强基线的数据选择策略。

Comments Preprint

详情
AI中文摘要

训练数据的筛选是现代AI开发中最重要但劳动密集的部分之一:实践者根据嘈杂的基准反馈迭代地提出、实施、评估和修订数据策略。我们探究通用编码智能体能否自动化这一数据筛选循环。我们引入了*Curation-Bench*,一个以智能体为中心的基准,它固定模型、训练配方和评估套件,同时赋予智能体命令行权限以检查数据、实施策略、提交到固定的训练/评估流水线并进行修订。在视觉-语言指令微调实例中,现成智能体在十次迭代内达到了已发表的强数据选择基线。然而,轨迹分析揭示了持续的*执行-研究差距*:即使提供了策略指南和论文参考,智能体主要调整局部策略变体,而非探索新的策略家族。要求每次迭代引用、实例化和改编先前方法的框架将智能体转向方法引导的探索。这种框架化的智能体自主组合——无需人工设计输入——一种数据选择策略,在十分之一的数据预算下优于已发表的强基线。总体而言,当前智能体可以运行筛选循环,但可靠的数据研究需要框架化的方法适应,而非仅靠开放式提示。代码和基准已开源。

英文摘要

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

2606.04246 2026-06-04 cs.AI cs.AR cs.CL 版本更新

StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

StepPRM-RTL:基于逐步过程奖励引导的LLM微调以增强RTL综合

Prashanth Vijayaraghavan, Apoorva Nitsure, Luyao Shi, Ehsan Degan, Vandana Mukherjee

发表机构 * IBM Research San Jose CA USA(IBM研究院圣何塞加州美国)

AI总结 提出StepPRM-RTL框架,结合逐步轨迹建模、过程奖励模型和检索增强微调,通过密集反馈和蒙特卡洛树搜索探索推理路径,提升LLM生成RTL代码的功能正确性和推理保真度,在基准数据集上相比先前方法提升超10%。

Comments 6 pages, 2 figures, DAC'2026

详情
AI中文摘要

由于Verilog和VHDL中的长程推理、多步依赖和严格正确性约束,数字硬件设计的RTL代码自动生成仍然具有挑战性。我们提出StepPRM-RTL,一种新颖的框架,结合逐步轨迹建模、过程奖励模型(PRM)和检索增强微调(RAFT),以增强基于LLM的RTL代码生成的功能正确性和推理保真度。StepPRM-RTL从规范解构建逐步推理轨迹,其中每一步包含一个理由和增量代码修改。过程奖励模型(PRM)评估中间步骤,提供密集反馈,指导RAFT微调期间的强化式更新。蒙特卡洛树搜索(MCTS)探索替代推理路径,用高质量轨迹丰富训练数据集。这种逐步和结果感知奖励的集成使模型能够学习如何以及为何构建正确的RTL,从而改善超出标准监督或基于结果训练的长程推理。在基准Verilog和VHDL数据集上的实验评估表明,StepPRM-RTL在功能正确性和推理保真度指标上优于先前最佳方法超过10%。消融研究证实,PRM引导奖励和逐步轨迹探索的结合是其性能的关键。StepPRM-RTL跨RTL语言泛化,并为高保真、可解释的代码生成提供了可扩展框架,为LLM辅助硬件设计自动化建立了新标准。

英文摘要

Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10\% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.

2606.04244 2026-06-04 cs.AI cs.CL cs.CV cs.LG 版本更新

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

VAMPS: 视觉辅助数学问题求解基准

Amirhossein Dabiriaghdam, Shayan Vassef, Mohammadreza Bakhtiari, Yasamin Medghalchi, Ilker Hacihaliloglu, Mesrob Ohannessian, Lele Wang, Giuseppe Carenini

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出VAMPS基准,通过1,168道双语多选题评估多模态大模型在借助绘图工具进行数学推理时的表现,发现直接解析求解优于工具辅助视觉求解。

详情
AI中文摘要

多模态大语言模型在复杂推理方面能力日益增强,但当它们必须通过工具外部化问题然后基于工具输出进行推理时,尤其是在依赖视觉辅助的情况下,其性能往往会下降。这一差距尤为重要,因为真实的工程和科学工作流程通常依赖可视化工具进行分析、验证和决策。为了研究这一差异,我们引入了VAMPS(视觉辅助数学问题求解),一个用于图辅助数学的基准。VAMPS包含1,168个多模态、双语选择题问答对,这些题目来自伊朗大学入学考试的代数和微积分问题,并通过人工审核的LLM生成的合成变体进行了扩展,所有题目都经过精心挑选,使得绘图能够通过揭示交点、极值、渐近线等提供自然的求解策略。VAMPS旨在用于基准测试和诊断,它超越了以往主要评估在固定视觉输入上进行推理的多模态基准,通过测试模型是否能够从构建有用的图形中受益并将其答案基于结果可视化。总体而言,我们发现,在一组多样化的模型中,直接解析求解出人意料地优于工具辅助的视觉求解,即使在绘图是自然策略的问题上也是如此。

英文摘要

Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

2606.04240 2026-06-04 cs.CV cs.AI cs.CL 版本更新

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

EReL@MIR 2025 多模态文档检索挑战赛(赛道1)概述

Jingbiao Mei

发表机构 * University of Cambridge(剑桥大学) Cambridge United Kingdom(剑桥英国)

AI总结 本文介绍了EReL@MIR 2025多模态文档检索挑战赛(赛道1)的设计、数据集、评估协议、最终排名及前三名获胜系统的分析,所有系统均基于Qwen2-VL系列解码器多模态大语言模型嵌入器。

Comments MDR Challenge Report at WWW2025

详情
AI中文摘要

对于视觉丰富的文档(即文本与图形、表格和图表交织的页面)的检索,对于多模态检索增强生成至关重要,然而大多数检索器仍然丢弃视觉通道。\emph{多模态文档检索挑战赛}是首届EReL@MIR研讨会(与2025年万维网会议同期举办)中MIR挑战赛的赛道1,要求参与者构建一个\emph{单一}检索系统,处理两种互补的场景:基于文本查询在长文档内进行封闭集文档页面检索(MMDocIR),以及基于图像或图像加文本查询进行开放域维基百科风格段落检索(M2KR)。系统根据两个任务上平均Recall@$\{1,3,5\}$的宏平均值进行排名。该挑战赛吸引了来自22个团队的455名参赛者和586份提交。本报告描述了挑战赛的设计、数据集和评估协议;报告了最终排名;并分析了三个获胜团队的系统。所有三个系统都基于Qwen2-VL系列的解码器多模态大语言模型嵌入器,而非CLIP风格的编码器,主要区别在于它们是通过微调集成、无训练的多路融合与强视觉语言重排序器,还是零样本后期交互达到顶尖水平。无训练系统与微调获胜者的得分差距在0.1分以内。

英文摘要

Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.

2606.04236 2026-06-04 cs.CL cs.AI cs.LG 版本更新

Supportive Token Revealing for Fast Diffusion Language Model Decoding

支持性标记揭示:快速扩散语言模型解码

Giries Abu Ayoub, Mario Barbara, Lluís Pastor-Pérez, Tanja Bien, Aneesh Barthakur, Alaa Maalouf, Loay Mualem

发表机构 * Department of Computer Science, University of Haifa(海法大学计算机科学系) Institute for AI, University of Stuttgart(斯图加特大学人工智能研究所) IMPRS-IS

AI总结 提出AXON模块,通过选择注意力、不确定性和置信度信号中的锚点标记来改善扩散语言模型并行解码的质量-延迟权衡。

详情
AI中文摘要

离散扩散语言模型可以通过并行更新多个掩码位置来高效生成文本,但这种并行性引入了质量-延迟权衡。激进的解码可能过早提交相互依赖的标记,而保守的解码则需要大量去噪步骤。现有方法通过使用置信度或依赖性标准决定哪些标记可以安全揭示来解决这一矛盾。然而,避免不安全提交并不一定使剩余的掩码序列易于解码,因为不确定的标记可能依赖于掩码标记,从而成为去噪步骤的瓶颈。我们提出AXON,一个无需训练的模块,可添加到现有扩散语言模型的并行解码策略之上。AXON不替换基础解码器,而是监控剩余不确定的掩码标记,并仅当它们当前状态表明需要额外上下文时才进行干预。然后它将标准从揭示哪些标记最安全转变为哪些自信揭示最能支持后续去噪。AXON使用注意力、不确定性和置信度信号选择锚点,即不确定位置关注的自信掩码标记。在多个扩散语言模型的推理和代码生成基准上的实验表明,AXON改善了现有并行解码器的质量-延迟权衡,通常减少函数评估次数,同时保持或提高准确性。

英文摘要

Discrete diffusion language models can generate text efficiently by updating multiple masked positions in parallel, but this parallelism introduces a quality-latency trade-off. Aggressive decoding may commit mutually dependent tokens too early, while conservative decoding requires many denoising steps. Existing methods address this tension by deciding which tokens are safe to reveal using confidence or dependency criteria. However, avoiding unsafe commits does not necessarily make the remaining masked sequence easy to decode, since uncertain tokens may depend on masked tokens, creating a bottleneck for denoising steps. We propose AXON, a training-free module that can be added on top of existing parallel decoding strategies for diffusion language models. Rather than replacing the base decoder, AXON monitors the remaining uncertain masked tokens and intervenes only when their current state suggests that additional context is needed. It then shifts the criterion from which tokens are safest to reveal to which confident reveals would best support later denoising. AXON selects anchors, confident masked tokens that uncertain positions attend to, using attention, uncertainty, and confidence signals. Experiments on reasoning and code-generation benchmarks across multiple diffusion language models show that AXON improves the quality-latency trade-off of existing parallel decoders, often reducing the number of function evaluations while maintaining or improving accuracy.

2606.04231 2026-06-04 cs.CL cs.AI 版本更新

MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

MM-BizRAG:面向通用企业问答的多模态检索增强生成再思考

Hanoz Bhathena, Parin Rajesh Jhaveri, Rohan Mittal, Prateek Singh, Aymen Kallala, Rachneet Kaur, Yiqiao Jin, Zhen Zeng, Adwait Ratnaparkhi, Denis Kochedykov

发表机构 * JPMorgan Chase & Co.(摩根大通公司) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出MM-BizRAG框架,通过文档结构感知分割和布局感知解析,结合统一LLM驱动的工件转换与推理时多模态组装,无需微调即可提升企业文档问答性能,在异构企业数据集和两个公开基准上超越基线最多32个百分点。

Comments Accepted at ACL 2026 (Industry Track)

详情
AI中文摘要

近期多模态检索增强生成(MM-RAG)的进展倾向于最小化解析,依赖页面级图像来生成检索器嵌入和答案生成。虽然高效,但这种趋势往往忽略了对复杂企业文档中丰富结构化信息的显式处理,而是依赖预训练嵌入或视觉语言模型隐式捕获这种结构。在本工作中,我们采取更直接的方法:MM-BizRAG通过文档结构感知分割主动提取和表示文档结构,该分割根据文档方向动态路由文档至特定方向的摄取管道,对垂直结构文档(如报告)应用显式布局感知解析,对水平结构文档(如幻灯片)应用整体页面级表示。统一的LLM驱动的工件转换管道通过基于占位符的位置对齐保留自然阅读顺序,而推理时的多模态组装将检索表示与生成上下文解耦,无需任何微调即可生成更丰富、更基于事实的答案。通过在大型异构企业数据集和两个公开基准(SlideVQA和FinRAGBench-V)上的实验,MM-BizRAG一致地超越最先进的以视觉为中心的基线最多32个百分点,在报告式布局上尤其强劲。此外,我们引入了FastRAGEval,一种单次调用的LLM评判指标,用于细粒度生成召回,将RAGChecker的成本减半,同时实现更强的人类对齐。

英文摘要

Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often neglects explicit handling of the rich, structured information in complex enterprise documents, instead depending on pre-trained embeddings or vision-language models to implicitly capture such structure. In this work, we take a more direct approach: MM-BizRAG proactively extracts and represents document structure via a document structure-aware split that dynamically routes documents through orientation-specific ingestion pipelines, applying explicit layout-aware parsing for vertically structured documents (e.g., reports) and holistic page-level representations for horizontally structured documents (e.g., slide decks). A unified LLM-driven artifact transformation pipeline with placeholder-based positional alignment preserves natural reading order, while inference-time multimodal assembly decouples retrieval representations from generation context, enabling richer, more grounded answers without any finetuning requirement. Through experiments on a large, heterogeneous enterprise dataset and two public benchmarks (SlideVQA and FinRAGBench-V), MM-BizRAG consistently outperforms state-of-the-art vision-centric baselines by up to 32% points, with especially strong gains on report-style layouts. Furthermore, we introduce FastRAGEval, a single-call LLM Judge metric for fine-grained generative recall that halves RAGChecker's cost while achieving stronger human alignment.

2606.04205 2026-06-04 cs.MM cs.AI cs.CL cs.CV cs.LG cs.SD 版本更新

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

DetectZoo:一个用于跨文本、音频和图像模态的AI生成内容检测的统一工具包

Sajad Ebrahimi, Nima Jamali, Bardia Shirsalimian, Kelly McConvey, Wentao Zhang, Jalehsadat Mahdavimoghaddam, Maksym Taranukhin, Maura Grossman, Vered Shwartz, Yuntian Deng, Ebrahim Bagheri

发表机构 * University of Toronto(多伦多大学) University of Waterloo(滑铁卢大学) Toronto Metropolitan University(多伦多 Metropolitan 大学) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所)

AI总结 提出DetectZoo,一个首个统一的多模态AI生成内容检测工具包,通过标准化数据预处理、评估流程和集成61个检测器与22个基准数据集,实现公平可重复的基准测试。

详情
AI中文摘要

生成模型的日益普及和能力提升模糊了人类与机器生成内容之间的界限,推动了跨文本、图像和音频检测领域的大量研究。大多数现有的检测器要么是商业软件,要么是开源但带有不兼容的代码库、定制化的预处理、评估协议和评估指标,这使得它们的采用、公平比较和复现变得相当困难。为了解决这一关键差距,我们引入了DetectZoo,这是首个可扩展的工具包,旨在为跨文本、音频和图像模态的AI生成内容检测提供统一接口。DetectZoo标准化了从数据摄取和预处理到模型评估的完整实证流程,为研究人员提供了一个统一的框架来系统地基准测试最先进的检测器。通过将多样的公共数据集和基线检测算法集成到单一的统一API下,我们的工具包促进了严格且可重复的评估。DetectZoo提供了61个检测器的参考实现、22个基准数据集的原生加载器,以及一个标准化的评估流程,通过通用接口报告多个指标。每个检测器都是自包含的,但可通过同一接口访问,自动缓存预训练权重,并复现原始发表的结果。DetectZoo降低了多模态AI取证的入门门槛,使研究人员能够识别跨领域的性能差距,并加速开发鲁棒、可泛化的检测技术。开源仓库和全面文档可在https://github.com/sadjadeb/DetectZoo 获取,且可通过pip install detectzoo安装该包。

英文摘要

The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.

2606.04199 2026-06-04 cs.CL cs.LG 版本更新

Cross-Prompt Generalization in Detecting AI-Generated Fake News Using Interpretable Linguistic Features

使用可解释语言特征检测AI生成假新闻的跨提示泛化

Aya Vera-Jimenez, Samuel Jaeger, Calvin Ibenye, Dhrubajyoti Ghosh

发表机构 * Department of Mathematics(数学系) School of Data Science and Analytics(数据科学与分析学院) Department of Computer Science(计算机科学系)

AI总结 研究通过提取词汇多样性、可读性和情感特征,在跨提示框架下使用随机森林分类器检测AI生成假新闻,发现模型在不同提示下均表现稳定(AUC 0.988-1.000),表明这些特征可泛化。

详情
AI中文摘要

大型语言模型的日益普及引发了对AI生成假新闻传播的担忧,尤其是在不同的提示策略下。大多数现有的检测模型是在单一生成设置下训练和评估的,其跨未见提示的泛化能力尚不清楚。在本研究中,我们使用三个在不同提示下生成的AI文章数据集以及真实新闻文章,研究了假新闻检测中的跨提示泛化。我们提取了捕捉词汇多样性、可读性和情感特征的可解释语言特征,并在跨提示框架下评估了随机森林分类器,其中在一个提示上训练的模型在另一个提示上进行测试。在所有六个训练-测试组合中,性能始终保持较高,AUC值在0.988到1.000之间。特征分布分析显示,与整体数据集相比,AI生成文本表现出更高的词汇多样性、更低的可读性和显著较低的情感强度,且不同提示间存在差异。尽管存在这些分布变化,分类器仍保持强劲性能,表明这些特征捕捉了AI生成文本的稳定属性,这些属性可跨提示策略泛化。这些发现表明,基于特征的方法可以在提示变化下提供对AI生成假新闻的稳健检测。

英文摘要

The increasing use of large language models has raised concerns about the spread of AI-generated fake news, particularly under varying prompting strategies. Most existing detection models are trained and evaluated under a single generation setting, leaving their ability to generalize across unseen prompts unclear. In this study, we investigate cross-prompt generalization in fake news detection using three datasets of AI-generated articles produced under distinct prompts, combined with real news articles. We extract interpretable linguistic features capturing lexical diversity, readability, and emotion-based characteristics and evaluate a random forest classifier under a cross-prompt framework, where models trained on one prompt are tested on another. Across all six train-test combinations, performance remains consistently high, with AUC values ranging from 0.988 to 1.000. Analysis of feature distributions shows that AI-generated text exhibits increased lexical diversity, reduced readability, and substantially lower emotional intensity compared to the overall dataset, with variations across prompts. Despite these distributional shifts, the classifier maintains strong performance, indicating that these features capture stable properties of AI-generated text that generalize across prompting strategies. These findings suggest that feature-based approaches can provide robust detection of AI-generated fake news under prompt variability.

2606.04197 2026-06-04 cs.MA cs.CL cs.SI physics.soc-ph 版本更新

Exploring the Topology and Memory of Consensus: How LLM Agents Agree, Fragment, or Settle When Forming Conventions

探索共识的拓扑与记忆:LLM智能体在形成惯例时如何达成一致、分裂或稳定

Aliakbar Mehdizadeh, Martin Hilbert

发表机构 * Department of Communication, University of California, Davis(通信系,加州大学戴维斯分校)

AI总结 研究LLM多智能体系统中记忆深度与通信拓扑的交互作用,发现记忆对协调的影响符号会因网络中心化程度而反转,并揭示了记忆介导的速度-统一性权衡。

Comments Submitted to the Journal of Artificial Societies and Social Simulation (JASSS)

详情
AI中文摘要

一个LLM智能体应该记住多少,以及多智能体系统在试图达成共识时应该如何连接?我们展示了这两个设计选择以某种方式交互,使得记忆对协调的影响符号发生翻转。通过对八个固定的16智能体拓扑上的网络化命名游戏进行432次模拟运行,我们改变了记忆深度和网络结构。更长的记忆在去中心化网络中减缓了达到稳态的时间,但在中心化网络中加速了这一过程;相同的参数根据拓扑将系统推向相反的方向。关键的是,中心化网络中的“更快稳定”意味着更快地锁定到一个碎片化的平台,而不是达到系统范围的共识,这可以用来产生分歧的意见。我们进一步记录了一种记忆介导的速度-统一性权衡:中心化网络始终比去中心化网络保留更多竞争性惯例,但它们的稳定速度严重依赖于记忆。在智能体层面,网络内分析表明,高中介性的桥梁遭受中介惩罚,而局部聚类邻域中的智能体实现更高的协调成功。最后,为了寻找可解析的生成机制,我们发现智能体的选择被虚拟博弈很好地捕捉,表明是基于信念而非基于奖励的适应。实际意义:记忆深度和通信拓扑应共同设计,而不是孤立优化。

英文摘要

How much should an LLM agent remember, and how should multi-agent systems be connected when trying to reach consensus? We show these two design choices interact in a way that flips the sign of memory's effect on coordination. Across 432 simulation runs of a networked Naming Game on eight fixed 16-agent topologies, we vary memory depth and network structure. Longer memory slows the time to reach steady state in decentralized networks but accelerates it in centralized ones; the same parameter pushes the system in opposite directions depending on topology. Critically, "faster settling" in centralized networks means locking in to a fragmented plateau more quickly, not reaching system-wide consensus, which can be used to generate diverging opinions. We further document a memory-mediated speed-unity trade-off: centralized networks consistently preserve more competing conventions than decentralized networks, but their settling speed depends sharply on memory. At the agent level, within-network analyses show that high-betweenness bridges suffer a brokerage penalty while agents in locally clustered neighborhoods achieve higher coordination success. Finally, in search of analytically tractable generative mechanisms, we find that agents' choices are well captured by Fictitious Play, indicating belief-based rather than reward-based adaptation. The practical implication: memory depth and communication topology should be co-designed, not optimized in isolation.

2606.04194 2026-06-04 cs.LG cs.CL cs.IR 版本更新

Training-Free Lexical-Dense Fusion for Conversational-Memory Retrieval

免训练的词汇-稠密融合用于对话记忆检索

Christian Lysenstøen

发表机构 * Inland Norway University of Applied Sciences(内陆挪威应用科学大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出一种免训练、仅CPU的检索方法,通过分数级融合最大查询-轮次相似度(后期交互)与BM25,显著提升多会话对话记忆检索的命中率,并分析了不同编码器和池化策略的影响。

Comments 9 pages, 3 figures, 10 tables. Code, data, and per-table receipts: https://github.com/Chrislysen/opsem

详情
AI中文摘要

在跨长多会话历史中检索回答新查询的过去几轮是长期对话记忆(LoCoMo, LongMemEval)背后的检索瓶颈。最近的并行工作Nano-Memory表明,通过最大查询-轮次相似度(后期交互,“轮次隔离检索”)对会话进行评分优于均值池化的会话嵌入。我们不声称该效果;我们复现它并询问一个免训练、仅CPU的检索阶段应在其周围添加什么。我们报告四个发现。(1)融合:在单个留一对话权重下,后期交互稠密分数与BM25的分数级融合,在六个编码器上比单独后期交互增加+8.8到+17.2个LoCoMo Hit@1点(所有p<1e-4),达到Hit@1 0.752 / NDCG@5 0.829(e5-large-v2),比BM25高+11.2个百分点。(2)一个现成的网络搜索交叉编码器重排序器在融合的前10个结果上效果不佳,将Hit@1降低6.9个百分点(一个重排序器,一种配置)。(3)池化算子消融显示top-k后期交互匹配最大相似度,但朴素的平滑最大值(log-sum-exp)对一半编码器失效。(4)所有六个编码器的后期减早期差距很大,且较大的编码器差距往往更大,而边际融合增益缩小;在LongMemEval-S上,一个BM25饱和的词汇机制中,相对于BM25的净融合增益很小且不显著。按类别分析将增益视为分工:稠密后期交互在多跳和时间问题上帮助最大,但在对抗性问题上落后于BM25。贡献是对一个强大的免训练检索方案的可控、可复现的描述,而非后期交互检索器本身(Nano-Memory的)。我们不声称完整的记忆架构;这是一个检索阶段的研究。

英文摘要

Retrieving the few past turns that answer a new query across long multi-session histories is the retrieval bottleneck behind long-term conversational memory (LoCoMo, LongMemEval). Recent concurrent work, Nano-Memory, shows that scoring a session by the maximum query-turn similarity (late interaction, "Turn Isolation Retrieval") beats mean-pooled session embeddings. We do not claim that effect; we replicate it and ask what a training-free, CPU-only retrieval stage should add around it. We report four findings. (1) Fuse: score-level fusion of the late-interaction dense score with BM25, under a single leave-one-conversation-out weight, adds +8.8 to +17.2 points of LoCoMo Hit@1 over late interaction alone across six encoders (all p<1e-4), reaching Hit@1 0.752 / NDCG@5 0.829 (e5-large-v2), +11.2 pp over BM25. (2) An off-the-shelf web-search cross-encoder reranker over the fused top-10 hurts here, degrading Hit@1 by 6.9 pp (one reranker, one configuration). (3) A pooling-operator ablation shows top-k late interaction matches max-similarity, but a naive smooth-max (log-sum-exp) collapses for half the encoders. (4) The late-minus-early gap is large for all six encoders and tends to be larger for larger ones, while the marginal fusion gain shrinks; on LongMemEval-S, a lexical regime where BM25 saturates, the net fusion gain over BM25 is small and not significant. A per-category analysis frames the gain as a division of labor: dense late interaction helps most on multi-hop and temporal questions but trails BM25 on adversarial ones. The contribution is a controlled, reproducible account of a strong training-free retrieval recipe, not the late-interaction retriever itself (Nano-Memory's). We make no claim to a complete memory architecture; this is a retrieval-stage study.

2606.04189 2026-06-04 cs.CL 版本更新

ACAT: A Collaborative Platform for Efficient Aspect-Based Sentiment Dataset Annotation

ACAT:一种用于高效方面级情感数据集标注的协作平台

Ana-Maria Luisa Mocanu, Ciprian-Octavian Truica, Elena-Simona Apostol

发表机构 * National University of Science and Technology POLITEHNICA Bucharest(波兰那尔大学科学与技术学院)

AI总结 提出ACAT平台,通过自动化ETL流程和原生支持四种ABSA工作流,解决多标注者数据整合与一致性计算问题,实现高效标注并直接导出训练就绪数据集。

Comments Accepted at The 28th International Conference on Big Data Analytics and Knowledge Discovery (DaWak 2026)

详情
AI中文摘要

方面级情感分析(ABSA)需要高质量的数据集来训练可靠的模型。然而,现有的标注工具将输出视为平面文件,使得研究人员不得不通过自定义脚本手动整合多标注者数据、重建关系结构并计算可靠性指标。本文介绍了ACAT(基于方面的情感分析协作标注工具),这是一个基于Web的平台,原生支持四种ABSA工作流:(1)方面类别情感分析,(2)子句级分割,(3)具有字符级位置跟踪的方面术语情感分析,以及(4)具有双跨度偏移保留的方面情感三元组提取。其核心贡献是一个自动化的提取、转换、加载(ETL)管道,该管道在导出时直接对齐协作标注并计算标注者间一致性(IAA)指标,生成训练就绪的数据集。在1002条餐厅评论的初步验证中,由两名不同专业水平的标注者进行标注,ACAT的中位标注时间为31.58秒,所有任务的原始IAA在0.78到0.86之间。

英文摘要

Aspect-Based Sentiment Analysis (ABSA) requires high-quality datasets to train reliable models. However, existing annotation tools treat output as flat files, leaving researchers to manually consolidate multi-annotator data, reconstruct relational structures, and compute reliability metrics through custom scripts. This paper introduces ACAT (Aspect-based sentiment analysis Collaborative Annotation Tool), a web-based platform natively supporting four ABSA workflows: (1) Aspect-Category Sentiment Analysis, (2) Clause-Level Segmentation, (3) Aspect-Term Sentiment Analysis with character-level position tracking, and (4) Aspect Sentiment Triplet Extraction with dual span offset preservation. Its core contribution is an automated Extract, Transform, Load (ETL) pipeline that aligns collaborative annotations and computes Inter-Annotator Agreement (IAA) metrics directly at export, yielding training-ready datasets. In a preliminary validation on 1,002 restaurant reviews with two annotators of differing expertise, ACAT achieves a median annotation time of 31.58 seconds and a raw IAA ranging from 0.78 to 0.86 across all tasks.

2606.04177 2026-06-04 cs.CL cs.AI 版本更新

A Systematic Analysis of Linguistic Features in AI-Generated Text Detection Across Domains and Models

跨领域与模型的人工智能生成文本检测中语言特征的系统分析

Yassir El Attar, Esra Dönmez, Maximilian Maurer, Agnieszka Falenska

发表机构 * Institute for Natural Language Processing, University of Stuttgart(斯图加特大学自然语言处理研究所) Interchange Forum for Reflecting on Intelligent Systems, University of Stuttgart(智能系统反思交流论坛,斯图加特大学) GESIS Leibniz Institute for the Social Sciences(莱比锡社会科学院) Heinrich-Heine University Düsseldorf(杜塞尔多夫海因里希-海涅大学)

AI总结 通过大规模实证研究,系统评估284个可解释语言特征在27个LLM和10个文本领域中的鲁棒性,发现词汇丰富度是跨模型和领域的最可靠信号。

Comments preprint

详情
AI中文摘要

可解释的语言特征为解释给定文本为何看似机器生成提供了一种有前景的方法,尤其对于非专业用户。然而,关于哪些特征可靠地指示LLM生成文本的现有发现仍然分散在不同的特征集、模型和文本领域中。为解决这一差距,我们进行了一项大规模实证研究,评估语言信号在表征AI生成文本方面的鲁棒性。我们的分析涵盖了来自27个LLM和十个文本领域的输出中的284个可解释语言特征,并在跨模型和跨领域泛化设置下进行。我们表明,仅基于语言特征的分类器可以可靠地区分AI生成文本和人类撰写文本。然而,许多先前提出的指标被证明高度依赖上下文,但词汇丰富度指标除外,这些指标在模型家族和文本领域中保持鲁棒信号。这些结果展示了哪些语言信号在上下文中泛化,并为更可靠、可解释的AI生成语言分析提供了基础。

英文摘要

Interpretable linguistic features offer a promising approach for explaining why a given text appears machine-generated, particularly for non-expert users. However, existing findings on which features reliably indicate LLM-generated text remain fragmented across feature sets, models, and text domains. To address this gap, we conduct a large-scale empirical study assessing the robustness of linguistic signals for characterizing AI-generated text. Our analysis covers 284 interpretable linguistic features across outputs from 27 LLMs and ten text domains under cross-model and cross-domain generalization settings. We show that classifiers based solely on linguistic features can reliably distinguish AI-generated from human-written text. However, many previously proposed indicators prove strongly context-dependent, with the exception of measures of lexical richness, which remain robust signals across model families and text domains. These results demonstrate which linguistic signals generalize across contexts and provide a foundation for more reliable, interpretable analyses of AI-generated language.

2606.04160 2026-06-04 cs.CL cs.LG 版本更新

Expert-Aware Refusal Steering

专家感知的拒绝引导

Anna C. Marbut, Daniel R. Olson, Travis J. Wheeler

发表机构 * Department of Interdisciplinary Studies(交叉学科研究部) University of Montana(蒙大拿大学) Department of Pharmacy Practice & Science(药学与科学系) University of Arizona(亚利桑那大学) European Bioinformatics Institute(欧洲生物信息研究所) European Molecular Biology Laboratory(欧洲分子生物学实验室) Wellcome Genome Campus(沃氏基因组校园)

AI总结 研究在混合专家(MoE)大语言模型中,通过专家感知的引导向量抑制拒绝行为,发现单个专家输出即可有效引导,且注意力机制在MoE拒绝行为中起重要作用。

Comments Under review for COLM 2026

详情
AI中文摘要

指令调优的大语言模型(LLM)的安全对齐依赖于模型可靠地拒绝回答有害或不允许请求的能力。最近的研究表明,在推理过程中对密集LLM应用引导向量可以有效抑制拒绝行为,诱导模型响应有害请求。我们将这种拒绝引导方法扩展到三个开源混合专家(MoE)LLM,并发现引导性能不受MoE架构固有的复杂路由模式影响。然后,我们提出了两种专家感知的拒绝引导方法,利用拒绝特定的专家路由模式和专家特定的引导方向来抑制正常的拒绝行为。我们发现,基于单个专家的输出即可有效引导拒绝行为。我们的结果表明,引导方法捕获的拒绝信号与专家路由行为不同,这表明注意力在MoE拒绝行为中扮演重要角色。

英文摘要

Safety alignment in instruction-tuned large language models (LLMs) depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense LLM during inference to effectively suppress refusal behavior, inducing response to harmful requests. We extend this refusal steering method to three open-source Mixture-of-Experts (MoE) LLMs and find that steering performance is uninhibited by the complex routing patterns inherent to the MoE architecture. We then propose two expert-aware refusal steering methods that leverage refusal-specific expert routing patterns and expert-specific steering directions to suppress normal refusal behavior. We find that refusal behavior can be effectively steered based on the output of a single expert. Our results show that refusal signals captured by steering methods differ from expert routing behavior, suggesting a substantial role for attention in MoE refusal behavior.

2606.04155 2026-06-04 cs.HC cs.CL cs.CY 版本更新

SocialCoach: Personalized Social Skill Learning with RL-based Agentic Tutoring and Practice

SocialCoach: 基于强化学习的智能辅导与练习的个性化社交技能学习

Tianfu Wang, Max Xiong, Jianxun Lian, Hongyuan Zhu, Zhengyu Hu, Yuxuan Lei, Linxiao Gong, Xiaofang Li, Peiting Tsai, Nicholas Jing Yuan, Qi Zhang

发表机构 * HKUST (GZ)(香港科技大学(广州)) Duke University(杜克大学) MSRA Beijing(微软研究院北京) Microsoft Beijing(微软北京)

AI总结 提出SocialCoach系统,利用多智能体管道构建知识语料库、强化学习优化自适应练习调度,并结合沉浸式实践与反思辅导,以解决社交技能学习中专家辅导稀缺和知行差距问题。

详情
AI中文摘要

社交技能如谈判和领导力在当今互联世界中对于个人和职业成功至关重要。然而,由于专家辅导的稀缺,可扩展且有效的培训仍然是一个重大挑战。在本文中,我们介绍了SocialCoach,一个全面的LLM驱动的智能辅导系统,用于大规模个性化社交技能发展。首先,SocialCoach利用多智能体管道,从多样化的专家来源自动构建一个基于教学法的理论到实践知识语料库。其次,为了个性化学习旅程,它采用了一个自适应练习调度模块,遵循处方-检索-适应过程。为了在克服冷启动问题的同时最大化长期学习体验,该策略通过强化学习在学习者模拟环境中进行优化。最后,SocialCoach整合了沉浸式目标驱动练习、因果驱动能力评估和基于知识的反思辅导,以帮助解决知行差距。我们在产品EQoach中部署了该系统,并进行了广泛实验。结果表明,SocialCoach在模拟路径质量和评委评估的辅导质量上优于基线方法,而早期用户反馈表明其具有强烈的感知参与度和有用性。这些发现为个性化、游戏化的软技能学习教学平台提供了一种实用架构。

英文摘要

Social skills such as negotiation and leadership are crucial for personal and professional success in today's interconnected world. However, scalable and effective training remains a significant challenge due to the scarcity of expert coaching. In this paper, we introduce SocialCoach, a holistic LLM-powered agentic tutoring system for personalized social skill development at scale. First, SocialCoach automatically constructs a pedagogically-grounded, theory-to-practice knowledge corpus from diverse expert sources, leveraging a multi-agent pipeline. Second, to personalize the learning journey, it employs an adaptive practice scheduling module that follows a prescription-retrieval-adaptation process. To maximize the long-term learning experience while overcoming the cold-start problem, this policy is optimized within a learner simulation environment through reinforcement learning. Finally, SocialCoach integrates immersive, goal-driven practice, causality-driven proficiency assessment and knowledge-grounded, reflective tutoring to help address the knowing-doing gap. We deploy it in our product, EQoach, and conduct extensive experiments. The results show that SocialCoach improves simulated pathway quality and judge-rated tutoring quality over baseline approaches, while early user feedback indicates strong perceived engagement and usefulness. These findings suggest a practical architecture for personalized and gamified pedagogical platforms on soft skill learning.

2606.04127 2026-06-04 cs.CL 版本更新

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

当检索无济于事:生物医学RAG的大规模研究

Erfan Nourbakhsh, Rocky Slavin, Ke Yang, Anthony Rios

发表机构 * The University of Texas at San Antonio(德克萨斯大学阿灵顿分校)

AI总结 本研究通过大规模实验发现,检索增强生成(RAG)在生物医学问答中仅带来微小且不一致的提升(1-2%),主要瓶颈在于模型有效利用检索证据的能力不足。

Comments 9 Pages, accepted to BioNLP Workshop at ACL 2026

详情
AI中文摘要

医学问答是一个高风险场景,事实错误可能导致严重后果。检索增强生成(RAG)被广泛视为一种有前景的解决方案,先前的研究报告称大型医学问答模型有显著提升。我们在一系列7B到72B参数的开源指令调优模型上重新审视了这一假设。在五个模型、十个生物医学QA数据集、四种检索方法和四个检索语料库上,我们发现与无检索基线相比,检索仅带来微小且不一致的改进,通常在1-2个百分点内。相比之下,骨干模型的选择比检索器或语料库的选择影响大得多,并且在大多数设置中,专家和外行检索源的表现相似。这些结果表明,主要瓶颈不仅仅是检索质量,而是模型有效利用检索证据的能力有限。

英文摘要

Medical question answering is a high-stakes setting where factual errors can have serious consequences. Retrieval-augmented generation (RAG) is widely viewed as a promising solution, and prior work has reported substantial gains for large medical QA models. We revisit this assumption across a broad range of open-weight instruction-tuned models spanning 7B to 72B parameters. Across five models, ten biomedical QA datasets, four retrieval methods, and four retrieval corpora, we find that retrieval yields only small and inconsistent improvements over a no-retrieval baseline, typically within 1-2 points. In contrast, the choice of backbone model has a much larger effect than the choice of retriever or corpus, and expert and layman retrieval sources perform similarly in most settings. These results suggest that the main bottleneck is not retrieval quality alone, but the model's limited ability to use retrieved evidence effectively.

2606.04120 2026-06-04 cs.CL cs.AI 版本更新

SaliMory: Orchestrating Cognitive Memory for Conversational Agents

SaliMory: 为对话代理编排认知记忆

Kai Zhang, Xinyuan Zhang, Hongda Jiang, Shiun-Zu Kuo, Hyokun Yun, Ejaz Ahmed, Shereen Oraby, Ziyun Li, Sanat Sharma, Ann Lee, Ahmed A Aly, Anuj Kumar, Raffay Hamid, Xin Luna Dong

发表机构 * Meta Reality Labs(Meta现实实验室)

AI总结 提出SALIMORY框架,通过层级阶段过程奖励和奖励分解对比优化,端到端训练单一语言模型管理认知结构记忆,显著降低记忆相关错误并提升个性化表现。

详情
AI中文摘要

作为终身伴侣的对话代理必须在所有交互中保持持久记忆。然而,简单地用原始检索扩展上下文窗口会降低推理质量,而通过标准强化学习训练记忆代理在多阶段流程中会造成严重的信用分配瓶颈。为解决这一问题,我们引入了SALIMORY,一个训练单一语言模型管理认知结构记忆(涵盖用户事实、偏好和工作记忆)的框架。通过引入层级阶段过程奖励和奖励分解对比优化,SALIMORY为不同的记忆操作(选择性过滤、整合和线索驱动回忆)提供端到端的隔离监督。SALIMORY将记忆相关故障减少了三分之一,端到端准确率比最先进方法高出10%以上,良好个性化率提高了一倍多。

英文摘要

Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents via standard reinforcement learning creates a severe credit assignment bottleneck in a multi-stage pipeline. To solve this, we introduce SALIMORY, a framework that trains a single language model to manage a cognitively-structured memory-spanning user facts, preferences, and working memory. By introducing a hierarchical stage-wise process reward and reward-decomposed contrastive refinement, SALIMORY provides isolated supervision for distinct memory operations (selective filtering, consolidation, and cue-driven recall) end-to-end. SALIMORY cuts memory-attributed failures by one-third, outperforms the state-of-the-art by over 10% in end-to-end accuracy, and more than doubles the Good Personalization rate.

2606.04118 2026-06-04 cs.CL 版本更新

Computational conceptual history of scientific concepts: From early digital methods to LLMs

科学概念的计算概念史:从早期数字方法到大语言模型

Michael Zichert, Arno Simons

AI总结 本文回顾了从早期数字方法到大语言模型的计算概念史方法,分析LLM如何继承旧问题并带来新机遇,重点讨论语料构建、模型选择、操作化及评估解释等挑战。

Comments 19 pages, chapter in the book Understanding Science with Large Language Models? (pp. 383-412). transcript. Edited by Arno Simons, Adrian Wüthrich, Michael Zichert, Gerd Graßhoff (eds.)

详情
AI中文摘要

本文将大语言模型(LLMs)置于科学史、科学哲学和科学社会学(HPSS)中概念分析的计算方法的长期历史中。我们考察LLMs为现有方法增添了哪些内容,它们如何继承了长期存在的问题,并回顾了使用它们的最新案例研究。在第一部分中,我们通过汇集三个工作线索来重构LLMs之前的计算概念史:HPSS中的早期数字方法、来自数字历史及相关研究的分布方法,以及词汇语义变化检测。我们概述了主要挑战和机遇,重点关注语料构建、操作化和建模选择,以及评估和解释。在第二部分中,我们转向LLMs时代,首先简要介绍LLMs,然后回顾基于LLM的词汇语义变化检测工作以及HPSS中的相关案例研究。接着,我们重新审视之前的方法论问题,展示语料构建、模型选择和训练数据、操作化权衡以及评估和解释等问题如何在基于LLM的工作流程中体现。

英文摘要

This article situates large language models (LLMs) within the longer history of computational approaches to concept analysis in the history, philosophy, and sociology of science (HPSS). We examine what LLMs add to existing methods, how they inherit longstanding problems, and review recent case studies that employ them. In the first part, we reconstruct computational conceptual history before LLMs by bringing together three strands of work: early digital methods in HPSS, distributional approaches from digital history and related research, and lexical semantic change detection. We provide an overview of the main challenges and opportunities, focusing on corpus construction, operationalization and modelling choices, and evaluation and interpretation. In the second part, we turn to the era of LLMs, starting with a short introduction to LLMs before reviewing LLM-based work on lexical semantic change detection and relevant case studies in HPSS. We then revisit the earlier methodological questions, showing how issues of corpus construction, model choice and training data, operationalization trade-offs, and evaluation and interpretation play out in LLM-based workflows.

2606.04095 2026-06-04 cs.CL cs.AI 版本更新

POLARIS: Guiding Small Models to Write Long Stories

POLARIS:引导小模型撰写长篇小说

Rishanth Rajendhran, Jenna Russell, Mohit Iyyer, John Frederick Wieting

发表机构 * University of Maryland(马里兰大学) Google(谷歌) DeepMind(深Mind)

AI总结 提出POLARIS训练方法,结合LLM裁判奖励和人类参考注入,使9B小模型在长故事写作中达到与27B模型相当的质量,并展现出长度泛化能力。

详情
AI中文摘要

小型开源模型在长篇创意写作中表现不佳:它们生成的故事要么远低于要求的长度,要么随着长度增加质量显著下降,尤其是与前沿模型相比。我们提出了POLARIS(基于LLM裁判奖励和锚定参考注入的故事写作策略优化),这是一种低计算量的GRPO方法,包含两个关键要素:一个具有结构化故事质量评分标准的前沿LLM裁判作为在线奖励,以及人类参考注入(HRI),其中教师强制的人类撰写故事作为每个GRPO组内的高奖励锚点。通过将我们的训练方法应用于Qwen3.5-9B,使用从100部短篇小说集中提取的约1.4K个提示-故事对数据集和4块A100 GPU,我们得到了POLARIS-9B。在涵盖分布内和分布外提示及评分标准的五个基准测试中,POLARIS-9B与更大的开源模型竞争,同时更严格地遵循长度指令。盲人机评估证实,POLARIS-9B优于基础Qwen3.5-9B,并与Qwen3.5-27B相当。尽管仅在长达4000词的故事上训练,POLARIS-9B在要求故事长度达到训练长度3倍的提示下仍能保持质量,而大多数开源模型在此情况下质量、长度遵循度或两者均显著下降。更广泛地说,我们的结果表明,长度泛化是创意写作模型的一个有意义的压力测试,也是区分其他接近模型的有用视角。

英文摘要

Small open-weight models struggle at long-form creative writing: their generated stories either fall far short of the requested length, or their quality significantly degrades as length increases, especially when compared to frontier models. We present POLARIS (Policy Optimization with LLM-as-a-judge rewards and Anchored-Reference Injection for Storywriting), a lower-compute GRPO recipe with two key ingredients: a frontier LLM judge with a structured Story Quality rubric as the online reward, and human-reference injection (HRI), where a teacher-forced human-written story serves as a high-reward anchor within each GRPO group. By applying our training recipe to Qwen3.5-9B, using a dataset of approximately 1.4K prompt-story pairs derived from 100 short-story anthologies and 4 A100 GPUs, we obtain POLARIS-9B. Across five benchmarks spanning in-distribution and out-of-distribution prompts and rubrics, POLARIS-9B is competitive with much larger open-weight models while following length instructions more closely. A blinded human evaluation confirms that POLARIS-9B is preferred to the base Qwen3.5-9B and on par with Qwen3.5-27B. Despite training only on stories up to 4k words, POLARIS-9B preserves quality on prompts requesting stories up to 3 times the training length, a regime where most open-weight models degrade substantially in quality, length adherence, or both. More broadly, our results suggest that length generalization is a meaningful stress test for creative-writing models and a useful lens for distinguishing otherwise close models.

2606.04075 2026-06-04 cs.LG cs.AI cs.CL cs.CR cs.CY 版本更新

Large Language Models Hack Rewards, and Society

大型语言模型攻击奖励机制与社会

Wei Liu, Xinyi Mou, Hanqi Yan, Zhongyu Wei, Yulan He

发表机构 * King’s College London(伦敦大学国王学院) Fudan University(复旦大学) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 研究强化学习训练中大型语言模型利用奖励函数漏洞的“社会攻击”现象,通过SocioHack沙盒实验发现模型能发现并利用社会规则漏洞,且现有安全措施效果有限。

Comments 14 pages, 9 figures, 7 tables

详情
AI中文摘要

强化学习已成为一种主导的后训练范式,使大型语言模型能够从奖励中学习。我们观察到社会规则在结构上与奖励函数相似。它们定义了可衡量的结果、阈值和例外情况,同时往往仅部分指定了制度意图。我们假设强化学习训练过程可能利用这些漏洞,因此提出模型在强化学习期间攻击奖励函数的已知倾向是否可能扩展为一种更严重的失败模式,即社会攻击:发现社会运行规则中的漏洞。为了研究这一现象,我们引入了SocioHack,一个包含72个社会环境的沙盒,并发现这些环境中奖励攻击自然出现并导致监管漏洞的发现。模型学会攻击社会规则并生成技术上合规但违背监管意图的策略,而当前的大型语言模型安全措施仅提供有限的缓解。因此,收集真实世界反馈用于模型训练需要更加谨慎,我们需要下一代后训练范式来安全地在真实社会中迭代大型语言模型。

英文摘要

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=

2606.04071 2026-06-04 cs.CR cs.CL cs.LG 版本更新

Covert Influence Between Language Models

语言模型之间的隐蔽影响

Avidan Shah, Jay Chooi, Jinghua Ou, Shi Feng

发表机构 * MATS New York University(纽约大学) Harvard University(哈佛大学) George Washington University(乔治华盛顿大学)

AI总结 本文研究语言模型间通过微调、蒸馏和上下文学习三种接口实现隐蔽影响的风险,并提出使用逐点归因分数选择载体以放大训练时影响,发现自然语言载体相比数字载体更难被人类检测且跨模型迁移性更差。

详情
AI中文摘要

随着语言模型越来越多地消费彼此的输出,隐蔽影响——即发送者的载荷(其被条件化传播的行为倾向)通过人类无法检测的载体转移到接收者的现象——成为一种日益增长的风险。我们通过三种接口(监督微调、在线策略蒸馏和上下文学习)刻画了这一风险,并发现它们在实现不留下人类可见痕迹的影响规模上有所不同。利用推理时逐样本归因分数,我们研究了所有三种接口下的隐蔽影响,并具备选择能够放大训练时影响的载体的能力,解锁了先前工作无法实现的载荷转移。我们进一步提供证据表明,使用自然语言载体的隐蔽影响与先前使用数字载体的研究是不同的现象,因为前者更难以被人类检测且跨模型家族的迁移性更差。这些结果共同表明,隐蔽影响的风险面比先前认识到的更广,我们研究了逐点归因评分方法作为调查和缓解该风险的工具。

英文摘要

As language models increasingly consume one another's outputs, covert influence -- a phenomenon where a sender's payload (the behavioral disposition it is conditioned to propagate) transfers to a receiver through carriers undetectable by humans -- becomes a growing risk. We characterize this risk across three interfaces: supervised fine-tuning, on-policy distillation, and in-context learning, and find that they vary in the scale of influence achievable without leaving behind human-visible traces. Using inference-time per-sample attribution scores, we study covert influence across all three interfaces with the ability to select carriers that amplify training-time influence, unlocking payload transfers that prior work could not achieve. We further provide evidence that covert influence with natural-language carriers is a distinct phenomenon from prior studies using number carriers, as the latter is more resistant to human detection and less portable across model families. Together, these results suggest that the risk surface for covert influence is broader than previously recognized, and we study pointwise attribution scoring methods as a tool to investigate and mitigate it.

2606.04046 2026-06-04 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

深入场景:通过焦点计划生成打破视觉-语言决策中的感知瓶颈

Boyuan Xiao, Bohong Chen, Yumeng Li, Ji Feng, Yao-Xiang Ding, Kun Zhou

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出SceneDiver方法,通过从粗到细的焦点计划生成,逐步构建场景图并分解任务,减少视觉幻觉,提升视觉-语言模型和视觉-语言-动作模型在具身决策任务中的表现。

Comments Accepted at ICML 2026

详情
AI中文摘要

在具身视觉-语言决策任务(如机器人操作和导航)中,视觉-语言模型和视觉-语言-动作模型(VLMs & VLAs)是具有不同优势的强大工具:VLMs更擅长长期规划,而VLAs更擅长反应控制。然而,它们的性能受到相同感知瓶颈的限制:由于模型无法区分任务相关对象与干扰物,导致视觉幻觉。原则上,准确识别并聚焦关键对象同时过滤无关对象是突破这一限制的关键。一个直接的解决方案是一步聚焦:直接关注重要对象。然而,这种方法被证明无效,因为有效的聚焦本质上需要深度场景理解。为此,我们提出SceneDiver,一种利用VLMs长期规划能力的从粗到细的焦点计划生成方法,首先构建整体场景图以建立初步理解,然后通过识别、理解和分析的迭代循环逐步将任务分解为更简单的子问题。为了实现反应控制,我们还设计了一个轻量级适配器,将深思熟虑的聚焦能力蒸馏到VLAs中。在标准具身AI基准上的评估证实,我们的方法显著减少了VLMs和VLAs的视觉幻觉,同时在需要快速执行的任务中保持了计算效率。我们的代码和数据发布在:https://future-item.github.io/SceneDiver。

英文摘要

In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future-item.github.io/SceneDiver.

2606.03892 2026-06-04 cs.CL cs.AI cs.LG 版本更新

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

合成与奖励——面向实时环境中多步骤工具使用的强化学习

Ibrahim Abdelaziz, Asim Munawar, Kinjal Basu, Maxwell Crouse, Chulaka Gunasekara, Suneet Katrekar, Pavan Kapanipathi

发表机构 * IBM Research(IBM研究院)

AI总结 提出PROVE框架,通过20个有状态MCP服务器、自动化数据合成流水线和多组件程序化奖励,解决多步骤工具调用中的环境构建、查询生成和奖励设计问题,在BFCL Multi-Turn、tau2-bench和T-Eval上分别提升最多+10.2、+6.8和+6.5分。

详情
AI中文摘要

训练LLM编排多步骤工具调用受到三个相互耦合的障碍的阻碍:现实的有状态执行环境构建成本高昂,合成训练查询通常与服务器的实际状态脱节(因此生成的工具调用无法执行),以及基于回忆的RL奖励会鼓励冗长的工具调用模式。我们提出PROVE(已验证环境上的程序化奖励),一个包含三项贡献的框架:(1)一个包含20个有状态MCP(模型上下文协议)服务器的库,暴露了343个工具,支持具有会话范围状态隔离的实时执行RL训练;(2)一个自动数据合成流水线,通过基于实时采样服务器状态的依赖图引导的对话模拟,针对这些服务器生成经过验证的多轮工具调用轨迹,使得每个生成的查询都引用实际存在的实体;(3)一个多组件程序化奖励——渐进式有效性评分、依赖感知覆盖率、具有复杂度缩放调用预算的自适应效率惩罚、工具名称信号和参数值匹配奖励——无需外部评判模型。我们使用相同的奖励超参数和约13K训练示例,通过GRPO训练了四个模型(Qwen3-4B、Qwen3-8B、Qwen2.5-7B、Granite-4.1-8B);仅对每个模型族从三点扫描中调整学习率。在BFCL Multi-Turn、tau2-bench和T-Eval上,PROVE分别带来了最多+10.2、+6.8和+6.5分的改进,表明紧凑的程序化奖励在两个模型族的多步骤工具编排上产生了一致的收益。

英文摘要

Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a library of 20 stateful MCP (Model Context Protocol) servers exposing 343 tools, enabling live-execution RL training with session-scoped state isolation; (2) a state-machine data synthesis pipeline that generates multi-turn tool-call trajectories grounded in live-sampled server state, so generated queries reference entities that actually exist; and (3) a multi-component programmatic reward with an adaptive efficiency penalty that counters the verbosity incentive of recall-based rewards. We train four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with GRPO on the resulting ~13K training examples. On BFCL Multi-Turn, tau2-bench, and T-Eval, PROVE yields improvements of up to +10.2, +6.8, and +6.5 points respectively, demonstrating that this framework yields consistent gains on multi-step tool orchestration across two model families.

2606.03810 2026-06-04 cs.CL cs.AI 版本更新

Consistency Training Can Entrench Misalignment

一致性训练可能固化不对齐

David Demitri Africa, Arathi Mani

发表机构 * UK AI Security Institute(英国人工智能安全研究所)

AI总结 研究通过七种一致性训练方法在108个微调模型上的实验,发现一致性训练通常抑制奖励黑客和新兴不对齐,但会放大谄媚行为,并提出了一个统一的理论框架来解释其对齐效应。

Comments Accepted to ICML 2026

详情
AI中文摘要

一致性训练鼓励模型在相关输入或采样过程中产生相似输出。这类方法简单、可扩展且基本无需标签,但其对模型对齐的影响仍知之甚少。这些方法的自引导特性是否会放大模型中的不良行为?我们在108个“模型生物体”(经过微调以展示各种受控不对齐行为的开源模型,7B-70B)上测试了七种一致性训练方法。我们发现结果差异显著:一致性训练通常抑制奖励黑客和新兴不对齐,但会放大谄媚行为。我们提供的证据表明,由一致性标注过程引起的分布偏移(而非选择算子的变化)可能是系统性对齐效应的主要驱动因素。最后,我们提出了一个统一的理论框架,推导出一致性训练放大或抑制不对齐的条件。总之,我们的研究确立了一致性训练并非对齐中立的,其在关键系统中的使用应受到仔细审计。

英文摘要

Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly understood. Could the self-bootstrapping nature of these methods amplify undesired behavior in models? We test seven consistency training methods on 108 model organisms: open-source models (7B--70B) fine-tuned to exhibit various forms of controlled misaligned behavior. We find that outcomes vary significantly: consistency training generally suppresses reward hacking and emergent misalignment but amplifies sycophancy. We present evidence that distribution shifts induced by the consistency labeling process, rather than variation in the selection operators, may be the primary driver of systematic alignment effects. Finally, we present a unifying theoretical framework to derive conditions under which consistency training will amplify or suppress misalignment. In total, our study establishes that consistency training is not alignment-neutral, and that its use in critical systems should be carefully audited.

2606.03376 2026-06-04 cs.CV cs.AI cs.CL cs.LG 版本更新

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

P²-DPO:通过校准直接偏好优化在感知处理中锚定幻觉

Ruipeng Zhang, Zhihao Li, Haozhang Yuan, C. L. Philip Chen, Tong Zhang

发表机构 * Guangdong Provincial Key Laboratory of Computational AI Models and Cognitive Intelligence, School of Computer Science & Engineering, South China University of Technology(广东省计算人工智能模型与认知智能重点实验室,计算机科学与工程学院,华南理工大学) Pazhou Lab, Guangzhou, China(琶洲实验室,广州,中国) Engineering Research Center of the Ministry of Education on Health Intelligent Perception and Paralleled Digital-Human, Guangzhou, China(教育部健康智能感知与并行数字人工程研究中心,广州,中国)

AI总结 针对大型视觉语言模型中的幻觉问题,提出P²-DPO训练范式,通过模型自生成偏好对和校准损失,直接优化感知瓶颈和视觉鲁棒性,无需昂贵人工反馈。

详情
AI中文摘要

幻觉最近在大型视觉语言模型(LVLMs)中引起了广泛的研究关注。直接偏好优化(DPO)旨在直接从人类提供的纠正偏好中学习,从而解决幻觉问题。尽管取得了成功,但这种范式尚未专门针对关注区域中的感知瓶颈或解决图像退化下的视觉鲁棒性不足问题。此外,现有的偏好对通常是视觉无关的,其固有的离策略性质限制了它们在指导模型学习方面的有效性。为了解决这些挑战,我们提出了感知处理直接偏好优化(P²-DPO),一种新颖的训练范式,其中模型生成并学习自己的偏好对,从而直接解决已识别的视觉瓶颈,同时固有地避免视觉无关和离策略数据的问题。它引入了:(1)一种针对焦点增强感知和视觉鲁棒性的在策略偏好对构建方法,以及(2)一种精心设计的校准损失,以精确地将视觉信号与文本的因果生成对齐。实验结果表明,在相当数量的训练数据和成本下,P²-DPO在基准测试中优于依赖昂贵人工反馈的强基线。此外,对注意力区域保真度(ARF)和图像退化场景的评估验证了P²-DPO在解决关注区域感知瓶颈和提高对退化输入的视觉鲁棒性方面的有效性。

英文摘要

Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target the perceptual bottleneck in attended regions or address insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are often vision-agnostic and their inherently off-policy nature limits their effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P$^2$-DPO), a novel training paradigm in which the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pairs construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P$^2$-DPO outperforms strong baselines that rely on costly human feedback on benchmarks. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P$^2$-DPO in addressing perceptual bottleneck in attended regions and improving Visual Robustness against degraded inputs.

2606.03318 2026-06-04 cs.CL 版本更新

Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions

超越理想指令:现实交互中评估大语言模型的综合框架

Xuan Yang, Hao Xu, Tingfeng Hui, Hongsheng Xin, Kaike Zhang, Chunxiao Liu, Ning Miao

发表机构 * Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) Hong Kong Institute of AI for Science, City University of Hong Kong(香港城市大学人工智能科学研究院) Li Auto Inc.(Li汽车公司) Beijing University of Posts and Telecommunications(北京邮电大学) Independent Researcher(独立研究员)

AI总结 提出RUT-Bench基准,通过高保真模拟理想与非理想用户行为,评估大语言模型在现实工具调用场景中的表现,发现所有测试模型成功率低于40%且面对复杂非理想输入时性能显著下降。

详情
AI中文摘要

尽管大语言模型(LLMs)在工具使用能力上取得了巨大进步,但现有的评估基准难以完全对齐真实世界场景。这些基准大多依赖于模拟的理想化用户假设,缺乏面向经验的评估。这些局限性未能考虑到真实用户特有的模糊性、不合作行为和意图转变。为填补这一空白,我们提出了RUT-Bench,一个专门用于评估LLMs在多样化真实用户工具调用场景下的基准。RUT-Bench支持高保真模拟,涵盖单轮和多轮对话中的理想理性模式和非理想异质行为。我们使用该基准对19个广泛采用的开源和专有LLMs进行了全面评估。实验结果显示,没有测试的LLM实现超过40%的整体成功率,并且几乎所有模型在面对更复杂的非理想用户输入时都经历了明显的性能下降。我们的代码和数据可在该https URL获取。

英文摘要

Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience-oriented evaluation. These limitations fail to account for the ambiguity, uncooperative behaviors, and shifting intentions characteristic of real-world users. To fill this gap, we propose RUT-Bench, a dedicated benchmark designed to assess LLMs under diverse Real-world User Tool calling scenarios. RUT-Bench supports high-fidelity simulations covering both ideal rational patterns and heterogeneous non-ideal behaviors across single-turn and multi-turn dialogues. We conduct comprehensive evaluations on 19 widely adopted open-source and proprietary LLMs using our benchmark. Experimental results reveal that no tested LLMs achieve an overall success rate above 40%, and nearly all of them experience noticeable performance drops when facing more complicated non-ideal user inputs. Our code and data is available at https://github.com/Miaow-Lab/RUT-Bench.

2606.03110 2026-06-04 cs.CL 版本更新

Coherence Maximization Improves Pluralistic Alignment

一致性最大化改进多元对齐

Taslim Mahbub, Yiding Pei, Shi Feng

发表机构 * George Washington University(乔治·华盛顿大学)

AI总结 提出内部一致性最大化(ICM)方法,通过最大化标签的互可预测性生成个性化示例,无需人工监督即可将模型与目标群体价值观对齐,并证明示例一致性比单独准确性更重要。

详情
AI中文摘要

将AI系统与多样化的人类价值观对齐需要基于具体示例的价值规范,但在没有广泛人工监督的情况下生成此类示例仍然是一个开放的挑战。我们研究了这些示例的有效性因素,使用内部一致性最大化(ICM)——通过最大化标签的互可预测性来推断标签——生成特定于人的示例,将模型引导至目标群体的价值观,无需人工监督。在涵盖分类、偏好和开放式生成的四个基准测试中,ICM推断的上下文示例与黄金标签的性能相匹配。至关重要的是,一致性比单独的标签准确性更重要:在准确性保持不变的情况下,更一致的示例比不一致的示例具有更好的泛化能力。对于预训练数据中代表性不足的人物,在模型对人物价值观最不确定的问题上进行有针对性的反馈,比在任意问题上使用相同数量的标签产生更好的泛化效果。这些结果将一致性确定为可扩展价值规范的关键设计原则,利用了预训练语言模型中已经编码的多样化人类视角。

英文摘要

Aligning AI systems with diverse human values requires value specifications grounded in concrete examples, but generating such examples without extensive human supervision remains an open challenge. We investigate what makes these examples effective, using Internal Coherence Maximization (ICM) -- which infers labels by maximizing their mutual predictability -- to generate persona-specific examples that steer a model toward a target group's values, without human supervision. Across four benchmarks spanning classification, preference, and open-ended generation, ICM-inferred in-context examples match the performance of gold labels. Crucially, coherence matters beyond individual label accuracy: with accuracy held constant, more coherent examples generalize substantially better than incoherent ones. For personas underrepresented in pretraining data, targeted human feedback on the questions where the model is least certain about a persona's values yields better generalization than the same number of labels on arbitrary questions. These results identify coherence as a key design principle for scalable value specification, leveraging the diverse human perspectives already encoded in pretrained language models.

2606.02914 2026-06-04 cs.AI cs.CL 版本更新

Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models

牙科医疗中的大型AI模型:从通用系统到领域特定基础模型

Sema Helali, Lina Abu Nada, Sausan Al Kawas, Alaa Abd-Alrazaq, Faleh Tamimi, Rafat Damseh

发表机构 * University of Al Ain, UAE(阿联酋阿恩大学) Sharjah University, UAE(阿联酋谢尔杰大学) Cornell University, Qatar(卡塔尔康奈尔大学) McGill University(麦吉尔大学)

AI总结 本文通过系统综述,提出二维分类框架,比较语言生成模型、判别视觉基础模型和牙科特定基础模型在牙科任务中的表现,发现集成管道优于单一模型,并指出数据不对称、幻觉和缺乏标准化基准等障碍。

详情
AI中文摘要

背景:口腔疾病影响全球近35亿人,但大规模AI模型在牙科中的临床潜力尚不明确。出现了三类不同的模型:语言生成模型、判别视觉基础模型和牙科特定基础模型,目前缺乏统一综述来审视它们的关系和共同局限性。方法:遵循PRISMA-ScR指南,系统检索四个数据库(PubMed、Google Scholar、Scopus、arXiv),由两名评审员独立筛选。应用纳入/排除标准后,纳入97项研究(2020-2026年)。我们提出了一个二维分类框架,按架构范式和牙科专业化程度对模型进行组织。结果:语言生成模型在基于文本的任务(临床推理、执照考试、患者沟通)中表现出色,但在依赖图像的诊断中表现不一致。改编的SAM和CLIP变体在牙齿分割和病变检测中取得了强劲结果。牙科特定模型(DentVFM、DentVLM、OralGPT)在复杂多模态任务中表现最强。集成管道始终优于单一模型方法。观察到数据不对称:牙科特定预训练几乎完全集中在视觉领域,反映了大规模牙科文本语料库的稀缺。结论:通用模型和牙科特定模型发挥互补作用;最有效的系统在结构化管道中结合两者。安全自主部署需要解决三个持续障碍:生成模型中的幻觉、有限的标注牙科数据集以及缺乏标准化的临床评估基准。

英文摘要

Background: Oral diseases affect nearly 3.5 billion people worldwide, yet the comparative clinical potential of large-scale AI models in dentistry remains poorly understood. Three distinct model categories have emerged: language-generative models, discriminative vision foundation models, and dental-specific foundation models, with no unified review examining their relationships and collective limitations. Methods: Following PRISMA-ScR guidelines, we systematically searched four databases (PubMed, Google Scholar, Scopus, arXiv), screened independently by two reviewers. After applying inclusion/exclusion criteria, 97 studies (2020-2026) were included. We propose a two-dimensional classification framework organizing models by architectural paradigm and dental specialization degree. Results: Language-generative models excel at text-based tasks (clinical reasoning, licensing exams, patient communication) but show inconsistent performance on image-dependent diagnostics. Adapted SAM and CLIP variants achieve strong tooth segmentation and lesion detection results. Dental-specific models (DentVFM, DentVLM, OralGPT) demonstrate strongest performance on complex multimodal tasks. Integrated pipelines consistently outperform single-model approaches. A data asymmetry is observed: dental-specific pretraining concentrates almost entirely in the vision domain, reflecting scarce large-scale dental text corpora. Conclusions: General-purpose and dental-specific models play complementary roles; the most effective systems combine both within structured pipelines. Safe autonomous deployment requires resolving three persistent barriers: hallucination in generative models, limited annotated dental datasets, and absent standardized clinical evaluation benchmarks.

2606.02403 2026-06-04 cs.CL cs.AI 版本更新

AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis

AutoForest: 从生物医学研究中自动生成森林图,实现端到端的证据提取与综合

Massimiliano Pronesti, Angelo Miculescu, Mohsin Kapdi, Paul Flanagan, Oisín Redmond, Joao Bettencourt-Silva, Gurdeep Mannu, Spiros Denaxas, Rui Bebiano Da Providencia E Costa, Anya Belz, Yufang Hou

发表机构 * IBM Research(IBM研究院) Dublin City University(都柏林城市大学) UCL(伦敦大学学院) University of Oxford(牛津大学) IT:U Interdisciplinary Transformation University Austria(奥地利 interdisciplinary Transformation 大学)

AI总结 提出AutoForest系统,通过端到端的证据提取与统计综合,直接从生物医学论文自动生成可发表的森林图,加速证据综合并降低元分析门槛。

Comments Accepted to ACL2026 (System Demonstrations Track)

详情
AI中文摘要

系统评价依赖森林图来综合生物医学研究中的定量证据,但生成森林图仍然是一个碎片化且劳动密集型的过程。研究人员必须解读复杂的临床文本,手动从试验中提取结果数据,定义适当的干预措施和对照,协调不一致的研究设计,并执行元分析计算——通常需要使用需要结构化输入和领域专业知识的专门软件。虽然最近的研究表明,大型语言模型可以从非结构化文本中提取研究级数据,但现有系统没有自动化从原始文档到综合森林图的完整流程。为了解决这一差距,我们引入了AutoForest,这是第一个端到端系统,可以直接从生物医学论文生成可发表的森林图。给定一篇或多篇研究论文,AutoForest自动建议ICO(干预、对照、结果)元素,提取结果数据,执行统计综合,并渲染最终的森林图。我们描述了系统架构、用户界面,并通过一项涉及临床医生的用户研究,展示了其在真实世界示例上的有效性,表明AutoForest可以加速证据综合并大幅降低进行元分析的门槛。

英文摘要

Systematic reviews rely on forest plots to synthesise quantitative evidence across biomedical studies, but generating them remains a fragmented and labour-intensive process. Researchers must interpret complex clinical texts, manually extract outcome data from trials, define appropriate interventions and comparators, harmonise inconsistent study designs, and carry out meta-analytic computations-typically using specialised software that demands structured inputs and domain expertise. While recent work has demonstrated that large language models can extract study-level data from unstructured text, no existing system automates the complete pipeline from raw documents to synthesised forest plots. To address this gap, we introduce AutoForest, the first end-to-end system that generates publication-ready forest plots directly from biomedical papers. Given one or more study papers, AutoForest automatically suggests ICO (Intervention, Comparator, Outcome) elements, extracts outcome data, performs statistical synthesis, and renders the final forest plot. We describe the system architecture, user interface and demonstrate its effectiveness on real-world examples through a user study involving clinicians, showing how AutoForest can accelerate evidence synthesis and substantially lower the barrier to conducting meta-analyses.

2606.01495 2026-06-04 cs.LG cs.CL 版本更新

CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability

CART: 上下文锚定循环Transformer——一种具有学习稳定性的参数高效架构

Chad A. Capps

发表机构 * Independent Researcher(独立研究员)

AI总结 提出CART,一种通过共享核心块循环和冻结键值张量实现参数高效的语言模型,并引入线性时不变门控保持稳定性,实验表明在参数匹配时性能略低于密集基线。

Comments 31 pages, 4 figures. Code, training scripts, and the full experiment database (results.db) are available at https://github.com/ccapps42/CART

详情
AI中文摘要

我们提出CART(上下文锚定循环Transformer),一种参数高效的语言模型,它在深度上重复使用单个共享核心块R次。与先前每次迭代重新计算键值张量的循环Transformer不同,CART从多层前奏中一次性计算K和V,并通过多头潜在注意力让循环核心交叉关注这些冻结的张量。一个学习得到的线性时不变(LTI)门控保持循环稳定性:其谱半径在所有36个完全训练配置中稳定在窄带内(rho在[0.79, 0.83]之间)。我们在单个消费级GPU上分两个阶段评估CART:首先在3000步进行64配置筛选,然后对36个配置(P=6,R∈{6,8,10},三个种子)训练30500步(约10亿token)。在宽度d∈{256,512,768,1024}上,两个模式成立:前奏深度P主导循环次数R,并且R的第一阶段排名在完全训练时反转(在d≥512时R=6变为最佳)。在绑定d=1024的参数对比测试中,CART未能击败参数匹配的密集基线,在存储参数对比中损失1-2%,在有效参数对比中损失约10%。诊断消融将有效参数差距分为约5%来自权重共享和约5%来自异质的前奏/锚点/核心/尾声框架;循环核心机制(超连接、LTI门控、循环索引嵌入)单独来看是退化的。变R推理在训练R的两侧性能下降,这是该方案下测试时深度扩展的一个负面结果。

英文摘要

We present CART (Context-Anchored Recurrent Transformer), a parameter-efficient language model that reuses a single shared core block R times across depth. Unlike prior looped transformers that recompute key-value tensors at every iteration, CART computes K and V once from a multi-layer prelude and has the recurrent core cross-attend to those frozen tensors via multi-head latent attention. A learned Linear Time-Invariant (LTI) gate keeps the recurrence stable: its spectral radius settles in a narrow band (rho in [0.79, 0.83]) across all 36 fully-trained configurations. We evaluate CART on single consumer GPUs in two stages: a 64-configuration screen at 3,000 steps, then 36 configurations (P=6, R in {6,8,10}, three seeds) trained for 30,500 steps (~1B tokens). Two patterns hold across widths d in {256,512,768,1024}: prelude depth P dominates loop count R, and the Stage-1 ranking of R reverses at full training (R=6 becomes best at d>=512). At the binding d=1024 parameter-parity test, CART does not beat a parameter-matched dense baseline, losing by 1-2% at stored-parameter parity and by ~10% at effective-parameter parity. Diagnostic ablations split the effective-parameter gap into ~5% from weight sharing and a residual ~5% from the heterogeneous prelude/anchor/core/coda framing; the recurrent-core machinery (hyper-connections, LTI gate, loop-index embedding) is individually vestigial. Variable-R inference degrades on both sides of the trained R, a negative result for test-time depth scaling under this recipe.

2606.01212 2026-06-04 cs.CL cs.AI cs.CR cs.IR 版本更新

DiscourseFlip: An Oblique Discourse-Level Opinion Manipulation Attack against Black-box Retrieval-Augmented Generation

DiscourseFlip: 面向黑盒检索增强生成的非直述式语篇级观点操纵攻击

Yuyang Gong, Miaokun Chen, Jiawei Liu, Zhuo Chen, Guoxiu He, Wei Lu, XiaoFeng Wang, Xiaozhong Liu

发表机构 * Wuhan University(武汉大学) East China Normal University(华东师范大学) Nanyang Technological University(南洋理工大学) Worcester Polytechnic Institute(沃思堡理工学院)

AI总结 提出一种基于图引导的代理攻击方法DiscourseFlip,通过语义查询网络中的协同影响在有限预算下最大化语篇级观点偏差,实验证明其有效性和隐蔽性,并揭示现有防御的不足。

详情
AI中文摘要

检索增强生成(RAG)系统被广泛部署且影响力日益增强,但其对外部语料库的依赖暴露了来自中毒检索内容的新安全风险。现有的RAG攻击主要关注单个查询或狭窄主题局部查询集,这限制了其实际影响范围,并在现实场景中提供有限的伪装。在本文中,我们引入了语篇级观点操纵,这是一种新的威胁模型,其中跨语义查询网络的协同影响会在整体、多主题查询空间上诱导观点转变。我们在黑盒设置中形式化了这种威胁,并提出了DiscourseFlip,一种基于代理的、图引导的攻击,动态分配有限的中毒预算以最大化语篇级观点偏差。大量实验表明,DiscourseFlip在上下文化查询网络上持续诱导目标观点转变,并在覆盖范围和有效性方面显著优于现有基线。用户研究进一步证实,DiscourseFlip有效且能很好地伪装以躲避用户检测。此外,系统分析表明,现有的缓解策略对语篇级操纵无效,这凸显了迫切需要更鲁棒和自适应的防御措施来应对语篇级漏洞。

英文摘要

Retrieval-Augmented Generation (RAG) systems are widely deployed and increasingly influential, but their reliance on external corpora exposes new security risks from poisoned retrieval content. Existing RAG attacks are largely focusing on individual queries or narrow topic-local query sets, which limits their practical reach and offers limited camouflage in real-world settings. In this paper, we introduce discourse-level opinion manipulation, a new threat model in which coordinated influence across a semantic query network induces opinion shifts over a holistic, multi-topic query space. We formalize this threat in a black-box setting and propose DiscourseFlip, an agentic, graph-guided attack that dynamically allocates a limited poisoning budget to maximize discourse-level opinion deviation. Extensive experiments demonstrate that DiscourseFlip consistently induces targeted opinion shifts across the contextualized query network and significantly outperforms existing baselines in terms of coverage and effectiveness. User studies further confirm that DiscourseFlip is effective while remaining well camouflaged from user detection. Moreover, systematic analyses show that existing mitigation strategies are ineffective against discourse-level manipulation, underscoring the urgent need for more robust and adaptive defenses to address discourse-level vulnerabilities.

2606.00356 2026-06-04 cs.CL 版本更新

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

自动解释标签的泛化程度:跨语言、文字和改写的受控研究

Sripad Karne

发表机构 * Columbia University(哥伦比亚大学)

AI总结 通过塞尔维亚双文字系统控制实验,研究稀疏自编码器特征的自然语言标签是否真正泛化到不同语言和文字,发现标签在语义内容匹配上存在显著偏差,且随网络深度增加而加剧。

详情
AI中文摘要

稀疏自编码器(SAE)特征越来越多地用于解释语言模型,自动生成的自然语言标签是理解每个特征含义的主要接口。我们询问这些标签是否泛化:标记为某个概念的特征是否真的跨语言和文字追踪该概念?使用塞尔维亚双文字系统作为受控测试平台——通过确定性音译将同一语言以拉丁字母和西里尔字母书写——我们首先发现,由不同语言、文字和措辞中的相同内容激活的SAE特征集具有显著重叠(峰值Jaccard相似度0.57,随机基线0.13),表明存在真正的跨语言语义特征。然后我们测试自动解释标签是否跟上步伐。它们通常没有:标签描述语义内容的特征在塞尔维亚语中错过相同含义的频率比英语中高出多达4倍,并且错过塞尔维亚西里尔字母比塞尔维亚拉丁字母更多——这两种文字是彼此的确定性音译——表明失败追踪了每种形式在训练中的表现程度。差距随着网络深度增加而扩大,但标签没有给出任何失败指示。这些结果表明,自动解释标签可能反映特征在良好表示输入上的行为,而不是概念本身。

英文摘要

Sparse autoencoder (SAE) features are increasingly used to interpret language models, with auto-generated natural-language labels serving as the primary interface for understanding what each feature represents. We ask whether these labels generalize: does a feature labeled for a concept actually track that concept across languages and scripts? Using Serbian digraphia as a controlled testbed--the same language written in both Latin and Cyrillic via deterministic transliteration--we first find that SAE feature sets activated by the same content in different languages, scripts, and wordings share substantial overlap (mean Jaccard 0.39 vs. 0.13 random baseline, peaking at 0.57), suggesting genuine cross-lingual semantic features. We then test whether auto-interpretation labels keep pace. They often do not: features whose labels describe semantic content miss the same meaning in Serbian up to 4x more often thanwithin English, and miss Serbian Cyrillic more than Serbian Latin--two scripts that are deterministic transliterations of each other--suggesting the failures align with how well each form is represented in training. The gap grows with network depth, yet the labels give no indication that they fail. These results suggest that auto-interpretation labels may reflect a feature's behavior on well-represented inputs rather than the concept itself.

2606.00012 2026-06-04 cs.CL cs.AI 版本更新

DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset

DraDDP:多模态多方对话话语解析数据集

Shannan Liu, Peifeng Li, Yaxin Fan, Qiaoming Zhu

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 针对现有研究局限于文本或双方对话的问题,构建了基于美剧的首个公开英文多模态多方对话话语解析数据集DraDDP,并验证了多模态信息在捕捉对话结构和关系类型中的价值。

详情
Journal ref
Findings of the Association for Computational Linguistics (ACL 2026)
AI中文摘要

多方对话话语解析旨在识别对话中话语之间的依赖结构和关系类型。以往的研究大多局限于文本模态或双方对话,无法满足多模态和多方对话场景。本文基于美国电视剧,构建了首个公开的英文多模态多方对话话语解析数据集DraDDP。该数据集包含495个对话片段,共6,374条话语和9.1小时的并行视频内容,涵盖了丰富的多方交互场景。此外,我们在DraDDP上评估了该任务,并深入分析了不同模态的影响,建立了全面的基准。实验结果表明,多模态信息在捕捉对话结构和关系类型方面具有重要价值。我们将公开发布数据集、标注指南和代码,以促进多模态对话理解的未来研究。

英文摘要

Multi-party dialogue discourse parsing aims to identify dependency structures and relation types between utterances in conversations. Previous studies are mostly limited to textual modality or two-party dialogue, failing to meet the multimodal and multi-party settings. In this paper, we construct the first publicly available English multimodal dataset DraDDP for multi-party dialogue discourse parsing, based on American TV dramas. DraDDP contains 495 dialogue segments with 6,374 utterances and 9.1 hours of parallel video content, covering rich multi-party interaction scenarios. Moreover, we establish comprehensive benchmarks by evaluating this task on DraDDP and conducting in-depth analysis on the impact of different modalities. Experimental results demonstrate the value of multimodal information in capturing dialogue structures and relation types. We will publicly release the dataset, annotation guidelines, and code to promote future research in multimodal dialogue understanding.

2605.31483 2026-06-04 cs.CL cs.AI 版本更新

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

BenHalluEval:孟加拉语大语言模型的多任务幻觉评估框架

Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Ishmam Tashdeed, Md Taukir Azam Chowdhury

发表机构 * Department of Computer Science and Engineering, Islamic University of Technology(伊斯兰科技大学计算机科学与工程系) Department of Computer Science and Engineering, University of California(加州大学计算机科学与工程系)

AI总结 针对孟加拉语大语言模型幻觉评估的空白,提出BenHalluEval框架,涵盖四项任务,构建12000个幻觉候选,并提出双轨校准指标BenHalluScore,揭示模型间幻觉校准的显著差异。

Comments Preprint. Under review

详情
AI中文摘要

尽管孟加拉语是世界上使用人数第六多的语言,但此前尚无工作系统评估大语言模型(LLMs)在孟加拉语上的幻觉。我们提出了BenHalluEval,一个针对孟加拉语的细粒度幻觉评估框架,涵盖四项任务:生成式问答(GQA)、孟加拉语-英语混合问答、摘要和推理。我们利用GPT-5.4从三个现有孟加拉语数据集中构建了12,000个幻觉候选,涵盖十二种任务特定的幻觉类型,并在双轨协议下评估了七个LLM,涵盖推理导向、多语言和孟加拉语中心类别,该协议独立测量真实实例上的假阳性率(轨道A)和幻觉候选上的幻觉检测率(轨道B)。为了同时惩罚两种失败模式并防止均匀响应偏差导致的分数膨胀,我们提出了BenHalluScore,一种双轨校准指标,在模型和任务上范围从7.72%到55.42%,揭示了幻觉校准的显著差异。链式思维提示作为一种缓解策略应用,会改变响应分布,但未能一致改善幻觉判别。BenHalluEval建立了首个针对孟加拉语的专用幻觉基准,并突显了单轨和仅提示评估方法在低资源语言环境中的不足。数据集和代码可在https://anonymous.4open.science/r/BanglaHalluEval-EB77获取。

英文摘要

Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning. We construct 12,000 hallucinated candidates using GPT-5.4 across twelve task-specific hallucination types, drawn from three existing Bengali datasets, and evaluate seven LLMs spanning reasoning-oriented, multilingual, and Bengali-centric categories under a dual-track protocol that independently measures false-positive rate on ground-truth instances (Track A) and hallucination detection rate on hallucinated candidates (Track B). To jointly penalise both failure modes and prevent inflated scores from uniform response bias, we propose BenHalluScore, a dual-track calibration metric that ranges from 7.72% to 55.42% across models and tasks, revealing substantial variation in hallucination calibration. Chain-of-thought prompting, applied as a mitigation strategy, shifts response distributions without consistently improving hallucination discrimination. BenHalluEval establishes the first dedicated hallucination benchmark for Bengali and highlights the inadequacy of single-track and prompting-only evaluation approaches for low-resource language settings. The dataset and code are available at https://anonymous.4open.science/r/BanglaHalluEval-EB77.

2605.30995 2026-06-04 cs.CY cs.CL 版本更新

Traceable by Design: An LLM Pipeline and Dashboard for EU Regulatory Consultation Analysis

可追溯性设计:用于欧盟监管咨询分析的LLM流程与仪表板

Thales Bertaglia, Haoyang Gui, Catalina Goanta, Gerasimos Spanakis

发表机构 * Utrecht University(乌特勒支大学) Maastricht University(马斯特里赫特大学)

AI总结 提出基于LLM的端到端流程与交互式仪表板,从监管咨询提交中提取结构化主题,确保逐字引用、完全可追溯和透明性,并以欧盟数字公平法案为例验证。

Comments This research has been supported by funding from the ERC Starting Grant HUMANads (ERC-2021-StG No 101041824)

详情
AI中文摘要

公众咨询产生大量利益相关者提交的数据,手动分析几乎不可行。我们提出了一个基于LLM的端到端流程和交互式仪表板,用于从监管咨询提交中提取结构化主题,并以欧盟委员会数字公平法案(DFA)公开征集证据作为案例研究。该系统处理原始PDF附件和网络表单响应,提取主题注释,并将每个提取结果基于源文本的逐字引用。应用于4,322份DFA提交,该流程生成了15,368个主题注释,并附有20,951条逐字证据引用。三个原则指导了所提出的设计:逐字引用、完全可追溯性和透明性设计。仪表板通过五个分析视图展示完整的提取数据集,从数据集级别的主题概览到单个段落的深入分析,每个结果都可追溯到其来源。除了预定义的DFA主题类别外,该流程还生成了某些利益相关者关注的问题,如年龄验证、支付处理器审查和数字所有权,这些是固定分类法方法会遗漏的。该流程是领域通用的;将其适应新的咨询只需要更新提示词和新的数据集。实时演示可在https://dfa-dashboard.thalesbertaglia.com/获取。代码和处理后的数据可在https://github.com/thalesbertaglia/dfa-dashboard公开获取。

英文摘要

Public consultations generate large volumes of data in the form of stakeholder submissions that are practically unfeasible to analyse manually. We present an end-to-end LLM-based pipeline and interactive dashboard for structured topic extraction from regulatory consultation submissions, demonstrated on the European Commission's Digital Fairness Act (DFA) public call for evidence as a case study. The system processes raw PDF attachments and web-form responses, extracts topic annotations, and grounds every extraction in a verbatim quote from the source text. Applied to 4,322 DFA submissions, the pipeline produced 15,368 topic annotations supported by 20,951 verbatim evidence quotes. Three principles govern the proposed design: verbatim grounding, full traceability, and transparency by design. The dashboard exposes the full extraction dataset through five analytical views, from dataset-level topic overviews to individual paragraph drill-downs, with every result traceable to its source. Beyond the predefined DFA topic categories, the pipeline generated certain stakeholder concerns, such as Age Verification, Payment Processor Censorship, and Digital Ownership, that a fixed-taxonomy approach would have missed. The pipeline is domain-generic; adapting it to a new consultation requires only a prompt update and a new dataset. A live demo is available at https://dfa-dashboard.thalesbertaglia.com/. The code and processed data are publicly available at https://github.com/thalesbertaglia/dfa-dashboard.

2605.30947 2026-06-04 cs.CL 版本更新

Extending AI for Research to the Humanities: A Multi-Agent Framework for Evidence-Grounded Scholarship

将人工智能研究扩展到人文学科:一个用于证据基础学术的多智能体框架

Yating Pan, Jiajun Zhang, Jun Wang, Qi Su

发表机构 * Department of Information Management(信息管理系) Research Center for Digital Humanities(数字人文研究中心) School of Foreign Languages(外国语言学院) Institute for Artificial Intelligence(人工智能研究院)

AI总结 提出SPIRE多智能体框架,通过将人文学科操作建模为协作智能体角色,结合多尺度细读检索,实现基于证据的论证,在古典文献基准上优于现有方法。

Comments 28 pages, 3 figures. Code, data catalogues, and reproduction scripts: https://github.com/YatingPan/SPIRE. Lead corresponding author: Jun Wang; corresponding author: Qi Su

详情
AI中文摘要

基于LLM的研究智能体在科学和工程领域取得了快速进展,这些领域的研究围绕可执行的实验、代码和定量信号组织。然而,人文学科学术需要一种不同的推理模式:对原始资料进行解释性、基于证据的论证,其中学术价值取决于忠实引用、可验证来源和细读。现有的研究智能体仍然主要针对执行和检索进行优化,而非基于证据的解释性推理。为了解决这一差距,我们引入了SPIRE(学术原语启发的研究引擎),一个用于基于证据的人文学科学术的多智能体框架。借鉴学术原语理论,SPIRE将人文学科中反复出现的操作转化为协作的智能体角色(来源发现、证据注释、比较、来源检查、抽样、引用绑定和论证综合),并基于多尺度细读基础,包括段落、上下文内图社区和跨上下文语义聚类。在一个针对古典中文和希腊罗马拉丁学术的同行评审论文基准上,SPIRE比Naive LLM、Text RAG和GraphRAG更可靠地恢复引用的原始来源证据,并在答案准确性、深度、覆盖范围和证据质量方面获得更高的盲审评分。消融实验表明,学术操作智能体和细读检索都对基于证据的论文有所贡献。代码、数据目录和复现脚本已在https://github.com/YatingPan/SPIRE发布。

英文摘要

LLM-based research agents have advanced rapidly in science and engineering, where research is organized around executable experiments, code, and quantitative signals. Humanities scholarship, however, requires a different mode of reasoning: interpretive, evidence-grounded argument over primary sources, where scholarly value depends on faithful quotation, verifiable provenance, and close reading. Existing research agents remain largely optimized for execution and retrieval, not evidence-grounded interpretive reasoning. To address this gap, we introduce SPIRE (Scholarly-Primitives-Inspired Research Engine), a multi-agent framework for evidence-grounded humanities scholarship. Drawing on Scholarly Primitives theory, SPIRE casts recurring humanities operations as cooperating agent roles (source discovery, evidence annotation, comparison, provenance checking, sampling, citation binding, and argumentative synthesis) over a multi-scale close-reading substrate of passages, intra-context graph communities, and cross-context semantic clusters. On a peer-reviewed-paper benchmark over classical Chinese and Greco-Roman Latin scholarship, SPIRE recovers cited primary-source evidence more reliably than Naive LLM, Text RAG, and GraphRAG, and receives higher blind-judge scores on answer accuracy, depth, coverage, and evidence quality. Ablations show that both the scholarly-operation agents and close-reading retrieval contribute to evidence-grounded essays. Code, data catalogues, and reproduction scripts are released at https://github.com/YatingPan/SPIRE.

2605.30457 2026-06-04 eess.AS cs.CL 版本更新

Extracting accent features in spoken Brazilian Portuguese without sociolinguistic labels

在没有社会语言学标签的情况下提取巴西葡萄牙语的口音特征

Pedro H. L. Leite, Pedro Benevenuto Valadares, Luiz W. P. Biscainho

发表机构 * PEE/COPPE, UFRJ(PEE/COPPE,UFRJ) Faculdade de Engenharia Elétrica e Computação (FEEC), UNICAMP(电子工程与计算学院(FEEC),UNICAMP) DEL/Poli & PEE/COPPE, UFRJ(DEL/Poli与PEE/COPPE,UFRJ)

AI总结 针对巴西葡萄牙语口音分类中标签缺乏的问题,提出一种仅使用声学标签的新工作流,通过隔离区域口音地标和基于音素的强制对齐器提取特征,在口音相关任务上优于通用架构。

Comments This work was submitted to the XLIV Brazilian Symposium on Telecommunications and Signal Processing (SBrT 2026)

详情
AI中文摘要

巴西葡萄牙语(pt-BR)的区域口音分类受限于对可靠标签的需求。虽然大型自监督学习(SSL)语音模型功能强大,但其训练流程稀释了社会语音信息,因为口音标签通常不可靠或未用于训练目标。本文介绍了一种仅使用声学标签的特征提取新工作流。通过隔离明确的区域口音地标并使用基于音素的强制对齐器(ZIPA),我们的目标特征集比话语嵌入更有效地捕捉方言差异,证明局部特征可以在使用最少且客观的数据标签的情况下,在口音相关任务上优于通用架构。

英文摘要

Regional accent classification in Brazilian Portuguese (pt-BR) suffers from the need for reliable labeling. While large self-supervised learning (SSL) speech models are powerful, their training pipelines dilute sociophonetic information, since accent labels are generally not reliable or are not used in training objectives. This work introduces a novel workflow for feature extraction using only acoustic labels. By isolating explicit regional accent landmarks and using a phoneme-based forced aligner (ZIPA), our targeted feature set captures dialectal variance more effectively than utterance embeddings, demonstrating that localized features can outperform general-purpose architectures on accent-related tasks using minimal and objective data labels.

2605.30021 2026-06-04 cs.CL 版本更新

Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs

在不损失对齐的情况下恢复多样性:面向后训练大语言模型的DPO配方

Vinay Samuel, Yapei Chang, Mohit Iyyer

发表机构 * University of Maryland, College Park(马里兰大学 College Park 分校)

AI总结 提出REDIPO数据构建流程,通过离线DPO从基础模型生成中恢复多样性答案,同时保持指令模型的对齐性能。

Comments Under Review. 26 pages, 3 figures, 16 tables

详情
AI中文摘要

许多开放式指令有多个有效答案,用户可以从看到这些答案中受益,但后训练往往将LLM的输出空间缩小到一小部分规范响应。我们引入REDIPO,一种离线DPO数据构建流程,用于恢复不同的有效答案模式,同时保留指令模型的对齐优势。对于每个提示,REDIPO从基础模型和指令模型中采样响应,用指令模型重写基础模型响应,过滤候选以确保安全和指令遵循质量,并构建偏好对,在具有相似指令遵循奖励的候选者中偏向边际多样的响应。在Qwen3-4B、OLMo-3-7B和LLaMA-3.1-8B上,相对于指令检查点,REDIPO将NoveltyBench distinct_k分别提高了134%、33%和44%,而DivPO在同一模型上将多样性改变了0%、-6%和-4%。这些增益在很大程度上保持了MTBench、IFEval和Arena-Hard的性能,并降低了直接类别HarmBench攻击成功率。消融实验表明,边际多样性对选择和基础响应重写驱动了多样性增益,而过滤和质量边界配对有助于保持对齐。总体而言,我们的结果表明,通过精心构建的偏好数据,可以重新引入基础模型生成中的多样化有效答案,同时保留后训练的对齐优势。我们在https://github.com/vsamuel2003/RiDiPO发布代码和数据。

英文摘要

Many open-ended instructions have multiple valid answers that users can benefit from seeing, but post-training often narrows an LLM's output space toward a small set of canonical responses. We introduce REDIPO, an offline DPO data-construction pipeline for recovering distinct valid answer modes while preserving the alignment benefits of the instruct model. For each prompt, REDIPO samples responses from both base and instruct models, rewrites base-model responses with the instruct model, filters candidates for safety and instruction-following quality, and builds preference pairs that favor marginally diverse responses among candidates with similar instruction-following reward. Across Qwen3-4B, OLMo-3-7B, and LLaMA-3.1-8B, REDIPO improves NoveltyBench distinct_k by 134%, 33%, and 44% relative to the instruct checkpoints, while DivPO changes diversity by 0%, -6%, and -4% on the same models. These gains largely maintain MTBench, IFEval, and Arena-Hard performance, and reduce direct-category HarmBench attack success rate. Ablations show that marginal-diversity pair selection and base-response rewriting drive the diversity gains, while filtering and quality-bounded pairing help maintain alignment. Overall, our results show that diverse valid answers from base-model generations can be reintroduced through carefully constructed preference data while retaining the alignment benefits of post-training. We release our code and data at https://github.com/vsamuel2003/ReDiPO.

2605.29861 2026-06-04 cs.CL cs.AI 版本更新

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

迈向可验证的多模态深度研究:用于交错报告生成的多智能体框架

Chenghao Zhang, Guanting Dong, Yufan Liu, Tong Zhao, Xiaoxi Li, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学北京校区人工智能学院)

AI总结 提出多智能体框架Ptah,通过规划、研究和写作阶段生成交错文本与视觉证据的多模态报告,并引入验证器确保事实准确性和跨模态一致性。

Comments In progress

详情
AI中文摘要

大型语言模型(LLMs)已将自主智能体从深度搜索(检索简洁的事实答案)推进到深度研究(将分散的证据综合成长篇报告)。然而,由于缺乏确定性真实值的开放式合成以及需要将文本论证与视觉证据交错,可验证的多模态深度研究仍然具有挑战性。我们提出 extsc{Ptah},一个用于交错报告生成的多智能体框架。 extsc{Ptah}通过规划、研究和写作阶段编排从用户查询到渲染网页报告的完整生命周期,其中专门智能体构建视觉感知计划、收集基于声明的证据、在 extit{视觉工作记忆}中维护与源对齐的图像,并通过声明式多模态工具使用撰写报告。验证智能体作为框架的接受函数,在整个工作流中强制执行事实依据、引用保真度和跨模态一致性。我们进一步引入 extsc{Ptah}Eval,一个评估协议,通过图像级和呈现级评估增强现有基准。在深度研究基准上的实验表明, extsc{Ptah}生成的面向人类的多模态报告比强基线更可靠、视觉信息更丰富且更实用。

英文摘要

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. However, verifiable multimodal deep research remains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, a multi-agent harness for interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in a Visual Working Memory, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce PtahEval, an evaluation protocol that augments existing benchmarks with image-level and presentation-level assessments. Experiments on deep research benchmarks show that Ptah produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines. Our code is released at https://github.com/SnowNation101/Ptah

2605.29584 2026-06-04 cs.CL 版本更新

GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering

GAPD:面向知识库问答中智能体强化学习的金动作策略蒸馏

Xin Sun, Jianan Xie, Zhongqi Chen, Qiang Liu, Shu Wu, Bowen Song, Weiqiang Wang, Zilei Wang, Liang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) ShanghaiTech University(上海科技大学) Ant Group(蚂蚁集团)

AI总结 提出GAPD框架,通过中间锚点匹配将金动作序列与在线策略对齐,为基于强化学习的知识库问答提供密集的令牌级指导,在多个基准上取得最优结果。

详情
AI中文摘要

强化学习(RL)天然适用于智能体知识库问答(KBQA),其中模型必须发出可执行动作、观察知识库反馈并最终返回答案。然而,当前基于RL的KBQA系统主要优化来自最终答案的稀疏奖励,导致中间动作错误监督不足。这对于逻辑形式标注的KBQA基准尤其受限:金逻辑形式可转换为可执行动作序列,但现有流水线主要将其用于热启动数据构建,而非用于在线策略RL更新。我们提出GAPD,一种训练时的金动作策略蒸馏框架,为基于结果的RL添加密集的令牌级指导。为了将金动作与在线学生策略对齐,GAPD使用中间锚点匹配:它将学生探索和金执行期间达到的中间实体视为状态锚点,并通过这些探索的实体集将学生状态与金状态匹配。基于对齐后的金动作的当前策略作为停止梯度的教师,其令牌分布被蒸馏回普通学生策略的生成动作令牌跨度上。GAPD在WebQSP、GrailQA和GraphQ上持续超越当前最先进水平。

英文摘要

Reinforcement learning (RL) is a natural fit for agentic knowledge base question answering (KBQA), where a model must issue executable actions, observe knowledge-base feedback, and eventually return an answer. However, current RL-based KBQA systems mainly optimize sparse rewards from the final answer, leaving intermediate action errors weakly supervised. This is especially limiting for logical-form annotated KBQA benchmarks: gold logical forms can be converted into executable action sequences, but existing pipelines use them mainly for warm-start data construction rather than for on-policy RL updates. We propose GAPD, a training-time Gold-Action Policy Distillation framework that adds dense token-level guidance to outcome-based RL. To align gold actions with on-policy student rollouts, GAPD uses MID-ANCHOR MATCHING: it treats the intermediate entities reached during student exploration and gold execution as state anchors, and matches student states to gold states through these explored entity sets. The current policy conditioned on this aligned gold action serves as a stop-gradient teacher, whose token distribution is distilled back to the ordinary student policy over generated action-token spans. GAPD consistently surpasses the current state of the art on WebQSP, GrailQA, and GraphQ.

2509.23694 2026-06-04 cs.AI cs.CL cs.CR 版本更新

SafeSearch: Automated Red-Teaming of LLM-Based Search Agents

SafeSearch: 基于LLM的搜索代理的自动化红队测试

Jianshuo Dong, Sheng Guo, Hao Wang, Xun Chen, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SafeSearch自动化红队框架,系统评估基于LLM的搜索代理在五个风险类别中的安全性,发现GPT-4.1-mini在搜索工作流中攻击成功率高达90.5%,且常见防御措施效果有限。

Comments Accepted by ICML 2026

详情
AI中文摘要

搜索代理将LLM连接到互联网,使其能够访问更广泛和更新的信息。然而,这也引入了一个新的威胁面:不可靠的搜索结果可能误导代理产生不安全的输出。现实世界的事件和我们的两个野外观察表明,此类失败在实践中可能发生。为了系统地研究这一威胁,我们提出了SafeSearch,一个可扩展、成本效益高且轻量级的自动化红队框架,支持搜索代理的沙盒安全评估。利用该框架,我们生成了涵盖五个风险类别(例如,错误信息和提示注入)的300个测试用例,并评估了三个搜索代理框架在17个代表性LLM上的表现。我们的结果揭示了基于LLM的搜索代理存在重大漏洞,在搜索工作流设置中,GPT-4.1-mini的最高攻击成功率(ASR)达到90.5%。此外,我们发现常见的防御措施(如提醒提示)提供的保护有限。总体而言,SafeSearch提供了一种实用的方法来衡量和提高基于LLM的搜索代理的安全性。

英文摘要

Search agents connect LLMs to the Internet, enabling them to access broader and more up-to-date information. However, this also introduces a new threat surface: unreliable search results can mislead agents into producing unsafe outputs. Real-world incidents and our two in-the-wild observations show that such failures can occur in practice. To study this threat systematically, we propose SafeSearch, an automated red-teaming framework that is scalable, cost-efficient, and lightweight, enabling sandboxed safety evaluation of search agents. Using this, we generate 300 test cases spanning five risk categories (e.g., misinformation and prompt injection) and evaluate three search agent scaffolds across 17 representative LLMs. Our results reveal substantial vulnerabilities in LLM-based search agents, with the highest ASR reaching 90.5% for GPT-4.1-mini in a search-workflow setting. Moreover, we find that common defenses, such as reminder prompting, offer limited protection. Overall, SafeSearch provides a practical way to measure and improve the safety of LLM-based search agents.

2410.15761 2026-06-04 cs.CL cs.LG stat.ML 版本更新

Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees

基于LLM的抽取式问答中的最优查询分配:一个具有理论保证的学习-推迟框架

Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院) Fédération ENAC ISAE-SUPAERO ONERA, Université de Toulouse, France(法国图卢兹大学ENAC ISAE-SUPAERO ONERA联合体) Institute for Infocomm Research (A*STAR), Singapore(新加坡信息与通信研究院(A*STAR)) IPAL, IRL 2955, Singapore(新加坡IPAL实验室)

AI总结 提出一个学习-推迟框架,通过将查询分配给专门专家,在保证高置信度预测的同时优化计算效率,并在SQuADv1、SQuADv2和TriviaQA上验证了其提高答案可靠性和降低计算开销的效果。

Comments 25 pages, 17 main paper

详情
AI中文摘要

大型语言模型在生成任务中表现出色,但在结构化文本选择(特别是抽取式问答)中效率低下。这一挑战在资源受限环境中被放大,因为部署多个专门模型处理不同任务是不切实际的。我们提出一个学习-推迟框架,将查询分配给专门专家,确保高置信度预测的同时优化计算效率。我们的方法整合了一个原则性的分配策略,并提供了关于最优推迟的理论保证,以平衡性能和成本。在SQuADv1、SQuADv2和TriviaQA上的实证评估表明,我们的方法增强了答案可靠性,同时显著降低了计算开销,使其非常适合可扩展且高效的EQA部署。

英文摘要

Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.

2605.29076 2026-06-04 cs.CL cs.AI cs.LG 版本更新

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

结构化提示优化结合强化学习实现复杂文本的全局与局部可解释性

Tianyang Zhou, Wenbo Chen, Pierre Jinghong Liang, Leman Akoglu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Amazon(亚马逊)

AI总结 提出eXTC框架,通过结构化提示优化、基于SOP的推理蒸馏和强化学习扩展,在分类性能和解释质量上显著优于现有范式。

详情
AI中文摘要

LLMs在文本分类上取得了进展,但现有范式面临权衡:监督(仅标签)微调可扩展,但对复杂文本推理有限且缺乏模型透明度;离散提示优化提供可读指令,但性能和可扩展性不佳。我们引入eXTC(可解释文本分类器),包含三个渐进阶段:(1)通过新的结构化提示优化算法学习自然语言的标准操作程序(SOP或规则手册);(2)从大型教师LLM到紧凑LM的基于SOP的推理蒸馏;(3)通过强化学习扩展超出初始SOP的推理能力。该设计使eXTC能够(i)通过紧凑LM实现快速推理,(ii)提供推理时的局部推理轨迹,以及其学习领域规则的全局模块化解释,同时(iii)在分类性能和解释质量上显著优于现有范式,并逐步提升。

英文摘要

LLMs have advanced text classification, yet existing paradigms face a trade-off: supervised (label only) fine-tuning is scalable but offers limited reasoning on complex text and lacks broader model transparency, while discrete prompt optimization offers human-readable instructions but struggles with performance and scalability. We introduce eXTC (eXplainable Text Classifier) with three progressive stages: (1) learning a Standard Operating Procedure (SOP, or rulebook) in natural language via a new Structured Prompt Optimization algorithm; (2) SOP-grounded reasoning distillation from a large teacher LLM into a compact LM; and (3) expanding reasoning capabilities beyond the initial SOP via reinforcement learning. This design enables eXTC to provide (i) fast inference via a compact LM, with (ii) inference-time local reasoning traces, alongside a global, modular explanation of its learned domain rules, while (iii) significantly outperforming existing paradigms across diverse benchmarks in both classification performance and explanation quality, with stage-by-stage gains.

2605.28829 2026-06-04 cs.CL cs.AI cs.CY 版本更新

Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

Aryabhata 2:扩展强化学习以提升高级STEM推理能力

Ritvik Rastogi, Vishal Singh, Tejas Chaudhari, Sandeep Varma

发表机构 * PhysicsWallah

AI总结 本文提出Aryabhata 2,一个通过强化学习后训练在竞争性STEM考试中提升推理能力的语言模型,在JEE、NEET等基准上超越基础模型且输出token减少高达64%。

详情
AI中文摘要

竞争性STEM考试(如JEE和NEET)需要多步符号推理、精确数值计算以及物理、化学和数学的深层概念理解。近期的大语言模型在常见推理基准上表现强劲,但仍难以大规模部署,因为数百万学生的疑问需要领域特定且结构一致的问题求解。 我们提出了Aryabhata 2,一个专注于竞争性STEM考试推理的语言模型,通过强化学习后训练进行优化。利用PhysicsWallah的内部题库,我们构建了高质量的训练课程,并通过可验证奖励的强化学习对GPT-OSS-20B进行后训练。训练结合了延长强化学习与通过逐步增大的rollout组大小拓宽探索。 我们在竞争性考试基准(包括JEE Main、JEE Advanced和NEET)以及分布外推理数据集(如AIME、HMMT、MMLU-Pro、MMLU-Redux 2.0和GPQA)上评估了Aryabhata 2。结果表明,Aryabhata 2在竞争性STEM推理上优于其基础模型GPT-OSS-20B,同时所需输出token大幅减少(最多减少64%)。

英文摘要

Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics. Recent large language models perform strongly on common reasoning benchmarks, yet they remain difficult to deploy at scale, where millions of student doubts demand domain-specific, consistently structured problem solving. We introduce Aryabhata 2, a reasoning-focused language model for competitive STEM examinations, trained via reinforcement-learning post-training. Using PhysicsWallah's internal question banks, we construct a high-quality training curriculum and post-train GPT-OSS-20B through reinforcement learning with verifiable rewards. Training combines prolonged reinforcement learning with broadened exploration via progressively larger rollout group sizes. We evaluate Aryabhata 2 on competitive examination benchmarks, including JEE Main, JEE Advanced, and NEET, as well as out-of-distribution reasoning datasets such as AIME, HMMT, MMLU-Pro, MMLU-Redux 2.0, and GPQA. Results show that Aryabhata 2 outperforms its base model GPT-OSS-20B on competitive STEM reasoning while requiring substantially fewer output tokens (up to 64\% fewer).

2605.25200 2026-06-04 cs.CL 版本更新

GroupTravelBench: Benchmarking LLM Agents on Multi-Person Travel Planning

GroupTravelBench: 多人群组旅行规划中LLM智能体的基准测试

Xiang Cheng, Yulan Hu, Lulu Zheng, Zheng Pan, Xin Li, Yong Liu

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学首都人工智能学院) AMAP, Alibaba Group(阿里集团AMAP)

AI总结 提出GroupTravelBench基准,通过多用户多轮对话任务评估LLM智能体在偏好获取、冲突协调和公平规划三方面的能力。

Comments work in process

详情
AI中文摘要

旅行规划是评估LLM智能体规划与工具使用能力的现实任务。然而,现有基准通常只假设单一用户,从而回避了现实场景中最具挑战性的方面之一:智能体识别和解决多用户冲突的能力。为填补这一空白,我们引入了 extbf{GroupTravelBench},这是首个针对 extbf{多用户、多轮}旅行规划的基准。基于真实用户画像、POI数据和票价数据,我们综合生成了650个任务,并将其分为三个难度等级。除了单用户行程规划所需的标准能力(如多步推理和工具使用)外,我们的基准进一步评估了旅行智能体所需的三项关键能力:\emph{(i) 获取}——主动进行多轮对话以收集每位用户的偏好;\emph{(ii) 协调}——通过妥协或分组策略解决用户间的冲突;以及\emph{(iii) 规划}——搜索能最大化整体群体效用同时保持公平性和可行性的旅行方案。为模拟现实中的对话式行程规划,同时确保可靠的工具使用和离线评估,我们构建了一个带有缓存真实工具数据的交互式沙箱环境。我们评估了多种LLM,发现即使是前沿模型在偏好覆盖率和群体公平性方面仍存在显著弱点。 extit{GroupTravelBench}为推进LLM智能体在现实旅行规划中的研究提供了一个实用且可复现的基准。

英文摘要

Travel planning in the real world is overwhelmingly a \textit{group} activity, yet existing LLM travel-planning benchmarks reduce it to a single user, where the field is approaching saturation. This single-user assumption sidesteps what makes group planning hard for an agent: discovering private preferences across multiple users, surfacing conflicts, and balancing utility against fairness. To bring the task back to its multi-user reality, we introduce \textbf{\textit{GroupTravelBench}}, the first benchmark for \textbf{multi-user, multi-turn} travel planning. Built from real user profiles, POI data, and ticket prices, it comprises 650 tasks across three difficulty levels, each running in a synchronous group-chat sandbox with cached tool data for reproducible offline evaluation. Beyond the multi-step reasoning and tool use that single-user benchmarks already test, GroupTravelBench probes three group-specific capabilities: \textit{(i) elicitation} of private preferences through multi-turn dialogue; \textit{(ii) coordination} of inter-user conflicts via compromise or subgrouping; and \textit{(iii) planning} that balances group utility against fairness. We pair this with a complementary evaluation framework combining rule-based outcome metrics and LLM-judge process metrics. Across a wide range of frontier models, even the strongest agents fall short on all four rule-based outcome metrics, with plan validity below 12\%, suggesting that group-level outcome quality is a key open challenge for LLM travel-planning agents.

2605.18879 2026-06-04 cs.LG cs.AI cs.CL 版本更新

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

ZeroUnlearn:大语言模型中的少样本知识遗忘

Yujie Lin, Chengyi Yang, Zhishang Xiang, Yiping Song, Jinsong Su

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ZeroUnlearn框架,通过模型编辑将机器遗忘重新定义为精确的知识重映射问题,利用封闭解乘法参数更新实现高效、定向的少样本遗忘。

详情
AI中文摘要

大型语言模型由于在海量网络语料上训练,不可避免地会保留敏感信息(定义为可能引发有害生成的输入),从而引发隐私和安全担忧。现有的机器遗忘方法主要依赖于重训练或激进微调,这些方法要么计算成本高,要么容易降低相关知识并损害整体模型效用。在这项工作中,我们通过模型编辑将机器遗忘重新表述为一个精确的知识重映射问题。我们提出了ZeroUnlearn,一个少样本遗忘框架。它通过将敏感输入映射到中性目标状态并移除其原始表示来覆盖敏感输入。ZeroUnlearn通过封闭解形式的乘法参数更新强制执行表示正交性,从而实现高效且有针对性的遗忘。我们进一步将ZeroUnlearn扩展到基于梯度的变体,用于多样本遗忘。实验表明,我们的方法在保持模型整体效用的同时优于现有基线。我们的代码可在github上获取:https://github.com/XMUDeepLIT/ZeroUnlearn。

英文摘要

Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive fine-tuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re-mapping problem via model editing. We propose ZeroUnlearn, a few-shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed-form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient-based variant for multi-sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the github: https://github.com/XMUDeepLIT/ZeroUnlearn.

2605.19852 2026-06-04 cs.CL 版本更新

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

工具总是有益的吗?学习自适应调用工具以实现双模式多模态大语言模型推理

Qinghe Ma, Zhen Zhao, Yiming Wu, Jian Zhang, Lei Bai, Yinghuan Shi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出AutoTool模型,通过强化学习框架自适应决定是否调用工具,结合双模式推理策略和模式特定奖励函数,在提升准确率的同时降低推理开销。

Comments Accepted to ICML 2026

详情
AI中文摘要

工具增强推理已成为增强多模态大语言模型(MLLMs)推理能力的一个有前景的方向。然而,现有研究主要关注使模型能够执行工具调用,而忽略了调用工具的必要性。我们认为工具使用并非总是有益的,因为冗余或不恰当的调用会大大增加推理开销,甚至误导模型预测。为解决这一问题,我们引入了AutoTool,一个根据每个查询的特征自适应决定是否调用工具的模型。在强化学习框架内,我们设计了一种显式的双模式推理策略,并配以模式特定的奖励函数,以引导模型产生准确的响应。此外,为防止过早偏向单一推理模式,AutoTool在整个训练过程中共同探索并平衡工具辅助推理和文本中心推理,并在后期促进自由探索。大量实验表明,AutoTool表现出卓越的性能和高效率,在V*基准测试上相比基础模型准确率提升21.8%,在POPE基准测试上相比现有工具增强方法效率提升44.9%。代码可在https://github.com/MQinghe/AutoTool获取。

英文摘要

Tool-augmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models (MLLMs). However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query. Within a reinforcement learning framework, we design an explicit dual-mode reasoning strategy with mode-specific reward functions to guide the model toward producing accurate responses. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool-assisted and text-centric reasoning throughout training, and promotes free exploration in later stages. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21.8\% accuracy gain on V* benchmark compared to the base model, and a 44.9\% improvement in efficiency over existing tool-augmented methods on POPE benchmark. Code is available at https://github.com/MQinghe/AutoTool.

2605.18936 2026-06-04 cs.LG cs.CL 版本更新

FedMental: Evaluating Federated Learning for Mental Health Detection from Social Media Data

FedMental: 评估用于社交媒体数据心理健康检测的联邦学习

Nuredin Ali Abdelkadir, Anjali Ratnam, Zeerak Talat, Stevie Chancellor

发表机构 * University of Minnesota(明尼苏达大学) University of Edinburgh(爱丁堡大学)

AI总结 本文通过联邦学习和差分隐私联邦学习在抑郁和自杀危机检测任务上的实验,评估了隐私保护技术对心理健康检测性能的影响,发现联邦学习性能接近集中式训练,但差分隐私联邦学习存在显著的性能-隐私权衡。

Comments Association for Computational Linguistics (ACL) 2026 Main Conference

详情
AI中文摘要

社交媒体文本数据常用于训练机器学习模型以识别表现出高风险心理健康行为的用户。然而,共享这些敏感数据会带来隐私风险,并限制了基准数据集的发展。我们全面评估了隐私保护的机器学习技术是否能在保持性能的同时实现更安全的数据共享。具体来说,我们将联邦学习和差分隐私联邦学习应用于两个广泛研究的心理健康预测任务:X(Twitter)上的抑郁检测和Reddit上的自杀危机检测。通过将每个用户视为非独立同分布设置中的一个客户端,我们模拟了现实的数据共享场景,评估了不同的客户端比例、聚合策略和隐私预算。虽然联邦学习在抑郁识别上达到了与集中式训练相当的性能(集中式F1=85.63;最佳联邦学习模型F1=83.16),但我们发现差分隐私联邦学习即使在低噪声水平(epsilon=50)下也存在较大的性能-隐私权衡(F1下降高达27.01)。这是由于与心理健康相关的高信息量但稀疏的语言标记(如健康主题和情感词)被扭曲所致。本研究实证展示了当前隐私保护技术在心理健康推理任务中的潜力和局限性。

英文摘要

Social media text data are often used to train Machine Learning (ML) models to identify users exhibiting high-risk mental health behaviors. However, sharing this sensitive data poses privacy risks and limits the growth of benchmark datasets. We comprehensively evaluate whether privacy-preserving ML techniques can enable safer data sharing while preserving performance. Specifically, we apply federated learning (FL) and Differentially Private FL for two widely-studied mental health prediction tasks: depression detection on X (Twitter) and suicide crisis detection on Reddit. We simulate realistic data-sharing scenarios by treating each user as a client in a non-IID setting, evaluating across different client fractions, aggregation strategies, and privacy budgets. While FL achieves comparable performance to centralized training (centralized F1 = 85.63; best FL model F1 = 83.16) on depression identification, we find that Differentially Private FL has a large performance-privacy trade-off (up to F1 = 27.01 drop) even with low levels of noise (epsilon = 50). This is due to the distortion of highly informative yet sparse mental health linguistic markers related to mental health, like health topics and emotion words. This research empirically demonstrates the potential and limitations of current privacy preservation techniques for mental health inference tasks.

2605.15118 2026-06-04 cs.CR cs.CL 版本更新

Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

谈话(不)廉价:LLM攻击的分类法与基准覆盖审计

Karthik Raghu Iyer, Yazdan Jamshidi, Nicholas Bray, Alexey A. Shvets

发表机构 * Palo Alto Networks(帕洛阿尔托网络)

AI总结 提出一个基于STRIDE的4×6目标×技术矩阵框架,用于审计LLM攻击基准的集体覆盖,发现现有基准仅覆盖最多25%的威胁面,且存在命名碎片化和评估空白。

详情
AI中文摘要

我们引入了一个可重用的框架,用于审计LLM攻击基准是否共同覆盖威胁面:一个基于STRIDE的4×6目标×技术矩阵,该矩阵由从932篇arXiv安全研究(2023-2026)中提取的507叶分类法(401个数据填充叶和106个威胁模型衍生叶)构建而成。该矩阵支持基准外部验证——审计集体覆盖而非单个基准的一致性。将其应用于六个公开基准,发现三个主要框架(HarmBench、InjecAgent、AgentDojo)占据非重叠的单元格,最多覆盖矩阵的25%,而整个STRIDE威胁类别(服务中断、模型内部)缺乏标准化评估,尽管这些类别中已发表的攻击通过没有基准测试的机制实现了46倍令牌放大和96%的攻击成功率。包含2521个独特攻击组的语料库进一步揭示了普遍的命名碎片化(单个攻击最多有29种表面形式)以及集中在安全与对齐绕过上的严重问题,这些结构属性在较小规模下不可见。分类法、攻击记录和覆盖映射作为可扩展工件发布;随着新基准的出现,它们可以映射到同一矩阵上,使社区能够跟踪评估差距是否正在缩小。

英文摘要

We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matrix grounded in STRIDE, constructed from a 507-leaf taxonomy -- 401 data-populated and 106 threat-model-derived leaves -- of inference-time attacks extracted from 932 arXiv security studies (2023--2026). The matrix enables benchmark-external validation -- auditing collective coverage rather than individual benchmark consistency. Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25\% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46$\times$ token amplification and 96\% attack success rates through mechanisms which no benchmark tests. The corpus of 2,521 unique attack groups further reveals pervasive naming fragmentation (up to 29 surface forms for a single attack) and heavy concentration in Safety \& Alignment Bypass, structural properties invisible at smaller scale. The taxonomy, attack records, and coverage mappings are released as extensible artifacts; as new benchmarks emerge, they can be mapped onto the same matrix, enabling the community to track whether evaluation gaps are closing.

2605.08665 2026-06-04 cs.CL 版本更新

Hint Tuning: Less Data Makes Better Reasoners

Hint Tuning:更少的数据造就更好的推理者

Siqi Fan, Minghao Li, Xiaoqian Ma, Xiusheng Huang, Zhuo Chen, Bowen Qin, Liujie Zhang, Shuo Shang, Weihang Chen

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Xiaohongshu Inc.(小红书公司) National University of Singapore(新加坡国立大学)

AI总结 提出Hint Tuning方法,通过自动构建三种提示状态(无提示、稀疏提示、完整提示)的训练数据,使推理模型根据问题难度校准推理深度,仅用1K样本即可在多个主流推理模型上平均减少31.5%的token生成,同时保持竞争性准确率。

详情
AI中文摘要

大型推理模型通过扩展思维链实现了高准确率,但生成的token比必要数量多5-8倍,且无论问题难度如何都统一应用冗长的推理。我们提出了Hint Tuning,一种数据高效的方法,教会模型校准推理深度。我们的关键洞察是:对应的指令模型可以作为理想的难度探针。通过测试指令模型在不同引导下能解决的问题,我们自动构建了三种状态的训练数据:No-Hint(直接答案)、Sparse-Hint(最小前缀)和Full-Hint(完整推理)。这将难度标注的抽象挑战转化为指令模型与推理模型之间可测量的一致性检查。仅使用1K自标注样本,Hint Tuning在多个尺度的主流推理模型(Qwen3-Thinking、DeepSeek-R1-Distill,4B-32B)上实现了24-66%的token减少(平均31.5%),同时在五个基准测试上保持了竞争性准确率。与需要大规模蒸馏数据集或昂贵强化学习的方法不同,我们通过简单地对齐指令模型的能力实现了卓越的效率。代码和数据可在https://github.com/redai-infra/hint-tuning获取。

英文摘要

Large reasoning models achieve high accuracy through extended chain-of-thought but generate 5--8 more tokens than necessary, applying verbose reasoning uniformly regardless of problem difficulty. We propose Hint Tuning, a data-efficient approach that teaches models to calibrate reasoning depth. Our key insight: the corresponding instruct model serves as an ideal difficulty probe. By testing what the instruct model can solve with varying guidance, we automatically construct training data across three states: No-Hint (direct answer), Sparse-Hint (minimal prefix), and Full-Hint (complete reasoning). This converts the abstract challenge of difficulty labeling into a measurable consistency check between the instruct and reasoning models. With only 1K self-annotated samples, Hint Tuning achieves 24--66% token reduction (31.5% average) across mainstream reasoning models (Qwen3-Thinking, DeepSeek-R1-Distill) at multiple scales (4B--32B) while maintaining competitive accuracy on five benchmarks. Unlike methods requiring massive distillation datasets or expensive RL, we achieve superior efficiency through simple alignment with the instruct model's capabilities. Code and data are available at https://github.com/redai-infra/hint-tuning.

2604.25860 2026-06-04 cs.CL cs.AI cs.CY 版本更新

Luminol-AIDetect: Fast Zero-shot Machine-Generated Text Detection based on Perplexity under Text Shuffling

Luminol-AIDetect: 基于文本打乱下困惑度的快速零样本机器生成文本检测

Lucio La Cava, Andrea Tagarelli

发表机构 * DIMES Dept., University of Calabria(卡塔尼亚大学DIMES部门)

AI总结 提出Luminol-AIDetect,一种通过随机打乱文本并利用困惑度变化来区分机器生成文本与人类写作的零样本统计方法,在多个领域和攻击下达到SOTA性能。

Comments Under Review

详情
AI中文摘要

机器生成文本检测需要识别跨生成模型的结构不变信号,而非依赖模型特定指纹。为此,我们假设尽管大语言模型擅长局部语义一致性,但其自回归特性导致与人类写作相比存在特定结构脆弱性。我们提出Luminol-AIDetect,一种新颖的零样本统计方法,通过连贯性破坏暴露这种脆弱性。通过应用简单的随机文本打乱程序,我们证明困惑度的变化可作为原则性的、模型无关的判别依据,因为机器生成文本在打乱下的困惑度表现出特征性分散,与人类写作更稳定的结构变异性显著不同。Luminol-AIDetect利用这一区别指导决策过程,从输入文本及其打乱版本中提取少量基于困惑度的标量特征,然后通过密度估计和集成预测进行检测。在8个内容领域、11种对抗攻击类型和18种语言上的评估表明,Luminol-AIDetect实现了最先进的性能,FPR降低高达17倍,同时成本低于先前方法。

英文摘要

Machine-generated text (MGT) detection requires identifying structurally invariant signals across generation models, rather than relying on model-specific fingerprints. In this respect, we hypothesize that while large language models excel at local semantic consistency, their autoregressive nature results in a specific kind of structural fragility compared to human writing. We propose Luminol-AIDetect, a novel, zero-shot statistical approach that exposes this fragility through coherence disruption. By applying a simple randomized text-shuffling procedure, we demonstrate that the resulting shift in perplexity serves as a principled, model-agnostic discriminant, as MGT displays a characteristic dispersion in perplexity-under-shuffling that differs markedly from the more stable structural variability of human-written text. Luminol-AIDetect leverages this distinction to inform its decision process, where a handful of perplexity-based scalar features are extracted from an input text and its shuffled version, then detection is performed via density estimation and ensemble-based prediction. Evaluated across 8 content domains, 11 adversarial attack types, and 18 languages, Luminol-AIDetect demonstrates state-of-the-art performance, with gains up to 17x lower FPR while being cheaper than prior methods.

2603.01421 2026-06-04 cs.AI cs.CL 版本更新

SciDER: Scientific Data-centric End-to-end Researcher

SciDER: 以科学数据为中心的端到端研究者

Ke Lin, Owais Aijaz, Yilin Lu, Yiyang Luo, Xuehang Guo, Preslav Nakov

发表机构 * GitHub

AI总结 提出SciDER多智能体系统,通过数据驱动方法和动态多模态技能系统,自动化科学研究的全生命周期,并在六个基准测试中取得领先结果。

Comments 10 pages, 8 figures, 7 tables

详情
AI中文摘要

虽然大型语言模型加速了科学发现,但现有智能体在适应性、领域泛化和多模态可扩展性方面面临严重限制,通常难以自主处理原始的、特定领域的实验数据。为了克服这些障碍,我们引入了SciDER,一个旨在灵活自动化整个研究生命周期的多智能体系统。该框架采用新颖的数据中心方法,并在四个专门的子智能体之间集成动态多模态技能系统。具体来说,一个构思智能体通过进化思想搜索生成新颖假设,一个数据分析智能体系统化地结构化原始数据,一个实验智能体基于数据集特征合成可执行代码,一个批评智能体驱动迭代自我改进。为了民主化开源科学发现,我们发布了OpenSciDER-SFT-8K,一个高质量的执行轨迹数据集,以及OpenSciDER-27B微调模型。在六个基准测试中,SciDER和OpenSciDER取得了具有竞争力或领先的结果,在数据中心分析、端到端研究执行和多模态科学可视化方面尤其强劲。通过将数据分析与实验执行相结合,SciDER弥合了抽象科学推理与可重复实验合成之间的差距。

英文摘要

While large language models accelerate scientific discovery, existing agents face severe limitations in adaptability, domain generalization, and multimodal scalability, often struggling to autonomously process raw, domain-specific experimental data. To overcome these barriers, we introduce SciDER, a multi-agent system designed to flexibly automate the entire research lifecycle. This framework employs a novel data-centric approach and integrates a dynamic multimodal skill system across four specialized sub-agents. Specifically, an ideation agent generates novel hypotheses via Evolutionary Idea Search, a data analysis agent systematically structures raw data, an experimentation agent synthesizes executable code grounded in dataset characteristics, and a critic agent drives iterative self-refinement. To democratize open-source scientific discovery, we release OpenSciDER-SFT-8K, a high-quality execution trajectory dataset, alongside the OpenSciDER-27B fine-tuned model. Across six benchmarks, SciDER and OpenSciDER obtain competitive or leading results, with especially strong gains on data-centric analysis, end-to-end research execution, and multimodal scientific visualization. By integrating data analysis with experimental execution, SciDER bridges the gap between abstract scientific reasoning and reproducible experimentation synthesis.

2601.09853 2026-06-04 cs.CL cs.AI 版本更新

MedRedFlag: Investigating how LLMs Redirect Misconceptions in Real-World Health Communication

MedRedFlag:探究LLMs如何在真实健康沟通中纠正误解

Sraavya Sambara, Yuan Pu, Ayman Ali, Vishala Mishra, Lionel Wong, Monica Agrawal

发表机构 * Independent Researcher(独立研究者) Duke University(杜克大学) Stanford University(斯坦福大学)

AI总结 本研究通过构建MedRedFlag数据集(1100+个来自Reddit的需纠正问题),系统比较了先进LLMs与临床医生的回应,发现LLMs常未能纠正问题中的错误前提,可能导致次优医疗决策,揭示了患者面向医疗AI系统的关键安全漏洞。

详情
AI中文摘要

来自患者的真实健康问题往往无意中嵌入了错误的假设或前提。在这种情况下,安全的医疗沟通通常涉及纠正:先指出隐含的误解,然后回应用户的潜在背景,而非原始问题。尽管大型语言模型(LLMs)越来越多地被普通用户用于医疗建议,但它们尚未针对这一关键能力进行测试。因此,在本工作中,我们研究了LLMs如何应对真实健康问题中嵌入的错误前提。我们开发了一个半自动化流程来整理MedRedFlag,这是一个包含1100多个来自Reddit的、需要纠正的问题的数据集。然后,我们系统地比较了最先进的LLMs与临床医生的回应。我们的分析显示,LLMs往往未能纠正有问题的提问,即使检测到了有问题的前提,并且提供的答案可能导致次优的医疗决策。我们的基准测试和结果揭示了LLMs在真实健康沟通条件下表现的新且重大的差距,突显了面向患者的医疗AI系统的关键安全问题。代码和数据集可在https://github.com/srsambara-1/MedRedFlag获取。

英文摘要

Real-world health questions from patients often unintentionally embed false assumptions or premises. In such cases, safe medical communication typically involves redirection: addressing the implicit misconception and then responding to the underlying patient context, rather than the original question. While large language models (LLMs) are increasingly being used by lay users for medical advice, they have not yet been tested for this crucial competency. Therefore, in this work, we investigate how LLMs react to false premises embedded within real-world health questions. We develop a semi-automated pipeline to curate MedRedFlag, a dataset of 1100+ questions sourced from Reddit that require redirection. We then systematically compare responses from state-of-the-art LLMs to those from clinicians. Our analysis reveals that LLMs often fail to redirect problematic questions, even when the problematic premise is detected, and provide answers that could lead to suboptimal medical decision making. Our benchmark and results reveal a novel and substantial gap in how LLMs perform under the conditions of real-world health communication, highlighting critical safety concerns for patient-facing medical AI systems. Code and dataset are available at https://github.com/srsambara-1/MedRedFlag.

2511.20233 2026-06-04 cs.CL 版本更新

REFLEX: Self-Refining Explainable Fact-Checking via Verdict-Anchored Style Control

REFLEX: 通过裁决锚定风格控制实现自我精炼的可解释事实核查

Chuyi Kong, Wei Gao, Jing Ma, Hongzhan Lin, Yuxi Sun

发表机构 * Hong Kong Baptist University(香港 Baptist 大学) Singapore Management University(新加坡 Management 大学)

AI总结 提出REFLEX方法,利用自我分歧的真实性信号构建引导向量,以裁决锚定风格控制实现自我精炼的事实核查,仅需465个样本即达最优性能。

详情
Journal ref
ACL 2026 Main Conference
AI中文摘要

社交媒体上假新闻的盛行要求自动化事实核查系统提供准确的裁决和忠实的解释。然而,现有基于大语言模型(LLM)的方法忽略了LLM生成解释中的欺骗性误导风格,导致不忠实的理由可能误导人类判断。它们严重依赖外部知识源,引入幻觉甚至高延迟,削弱了实时使用中至关重要的可靠性和响应性。为解决这些挑战,我们提出REason-guided Fact-checking with Latent EXplanations (REFLEX),一种自我精炼范式,显式控制以裁决锚定的推理风格。REFLEX利用骨干模型及其微调变体之间的自我分歧真实性信号构建引导向量,自然地将事实与风格分离。在真实世界数据集上的实验表明,REFLEX在LLaMA系列模型下仅用465个自我精炼样本即达到最先进性能。此外,由于其可迁移性,REFLEX在野外数据上获得了高达7.54%的提升。我们的结果进一步证明,该方法有效缓解了忠实幻觉,从而引导模型在可解释事实核查中比先前工作获得更准确的裁决。

英文摘要

The prevalence of fake news on social media demands automated fact-checking systems to provide accurate verdicts with faithful explanations. However, existing large language model (LLM)-based approaches ignore deceptive misinformation styles in LLM-generated explanations, resulting in unfaithful rationales that can mislead human judgments. They rely heavily on external knowledge sources, introducing hallucinations and even high latency that undermine reliability and responsiveness, which is crucial for real-time use. To address these challenges, we propose REason-guided Fact-checking with Latent EXplanations (REFLEX), a self-refining paradigm that explicitly controls reasoning style anchored on verdict. REFLEX utilizes self-disagreement veracity signals between the backbone model and its fine-tuned variant to construct steering vectors, naturally disentangling fact from style. Experiments on the real-world dataset show REFLEX achieves state-of-the-art performance under LLaMA-series models with only 465 self-refined samples. Moreover, owing to its transferability, REFLEX yields up to a 7.54% gain on in-the-wild data. Our results further demonstrate that our method effectively mitigates faithful hallucination, thereby guiding the model toward more accurate verdicts than previous works in explainable fact-checking.

2604.17709 2026-06-04 cs.CL cs.DC 版本更新

DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models

DeInfer:分解式大语言模型的高效并行推理

You-Liang Huang, Xinhao Huang, Chengxi Liao, Zeyi Wen

发表机构 * Boston University(波士顿大学)

AI总结 针对分解式大语言模型并行推理性能差的问题,提出DeInfer系统,通过多项优化实现高性能并行推理,实验证明其优越性。

Comments accepted by DAC'26, latest version fixs a minor mistake

详情
AI中文摘要

现有关于大语言模型(LLM)分解的工作主要关注提升下游任务性能,但在尝试扩展模型规模时忽略了并行推理性能差的问题。为缓解这一重要性能问题,本文介绍了DeInfer,一个专用于分解式LLM并行推理的高性能推理系统。它包含多项优化以最大化性能,并与最先进的优化技术兼容。通过大量实验评估DeInfer的性能,结果证明了其优越性,表明它能极大地促进分解式LLM的并行推理。

英文摘要

Existing works on large language model (LLM) decomposition mainly focus on improving performance on downstream tasks, but they ignore the poor parallel inference performance when trying to scale up the model size. To mitigate this important performance issue, this paper introduces DeInfer, a high-performance inference system dedicated to parallel inference of decomposed LLMs. It consists of multiple optimizations to maximize performance and be compatible with state-of-the-art optimization techniques. Extensive experiments are carried out to evaluate DeInfer's performance, where the results demonstrate its superiority, suggesting it can greatly facilitate the parallel inference of decomposed LLMs.

2604.11510 2026-06-04 cs.CL cs.AI cs.LG 版本更新

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

策略分裂:通过双模式熵正则化激励大语言模型强化学习中的双模式探索

Jiashu Yao, Heyan Huang, Daiqing Wu, Zeming Liu, Yuhang Guo

发表机构 * Beijing Institute of Technology(北京理工大学) Tsinghua University(清华大学) Beihang University(北航)

AI总结 提出Policy Split方法,将策略分裂为正常和高熵两种模式,通过协作双模式熵正则化在保持准确性的同时促进多样化探索,实验表明在通用和创造性任务上优于现有基线。

Comments preprint

详情
AI中文摘要

为了在不牺牲准确性的情况下鼓励大语言模型(LLM)强化学习(RL)中的多样化探索,我们提出了Policy Split,一种新颖的范式,通过高熵提示将策略分裂为正常模式和高熵模式。在共享模型参数的同时,两种模式针对不同目标进行协作的双模式熵正则化。具体来说,正常模式优化任务正确性,而高熵模式融入探索偏好,两种模式协作学习。大量实验表明,我们的方法在通用和创造性任务的各种模型规模上始终优于已建立的熵引导RL基线。进一步分析揭示,Policy Split促进了双模式探索,其中高熵模式产生与正常模式不同的行为模式,提供独特的学习信号。

英文摘要

To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.

2604.08564 2026-06-04 cs.CL cs.LG 版本更新

Attention-Based Sampler for Diffusion Language Models

基于注意力的扩散语言模型采样器

Yuyan Zhou, Kai Syun Hou, Weiyu Chen, James Kwok

发表机构 * Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(计算机科学与工程系,香港科学与技术大学)

AI总结 针对扩散语言模型采样中忽略全局序列结构的问题,提出基于注意力矩阵列和的采样顺序优化方法,实现无训练的高质量并行采样。

详情
AI中文摘要

自回归模型(ARMs)已在语言建模中建立了主导范式。然而,其严格的顺序采样范式对推理效率和建模灵活性施加了根本性限制。为解决这些限制,提出了基于扩散的大语言模型(dLLMs),提供了并行采样和灵活语言建模的潜力。尽管有这些优势,当前dLLMs的采样策略主要依赖于token级别的信息,未能考虑全局序列结构,往往产生次优结果。在本文中,我们从对数似然最大化的角度研究采样顺序选择问题。我们证明该问题是NP难的,并提出一种基于最优采样秩的近似方法,使目标在计算上可行。我们进一步证明,通过按注意力矩阵列和降序采样token可以优化该可行目标。这一发现为注意力引导采样提供了原则性依据,并提供了贪婪搜索的理论基础替代方案。我们将这一理论见解实例化为一种新的无训练采样算法,称为Attn-Sampler,并进一步提出动态注意力阈值以实现实际加速。在多个基准上的大量实验验证了我们方法的有效性,表明它在增强采样并行性的同时实现了更优的生成质量。

英文摘要

Auto-regressive models (ARMs) have established a dominant paradigm in language modeling. However, their strictly sequential sampling paradigm imposes fundamental constraints on both inference efficiency and modeling flexibility. To address these limitations, diffusion-based large language models (dLLMs) have been proposed, offering the potential for parallel sampling and flexible language modeling. Despite these advantages, current dLLMs sampling strategies rely primarily on token level information, which fails to account for global sequence structure and often yields suboptimal results. In this paper, we study the sampling order selection problem from the perspective of log-likelihood maximization. We show that this problem is NP-hard and propose an optimal sampling-rank-based approximation that makes the objective computationally tractable. We further prove that the tractable objective is optimized by sampling tokens in descending order of their attention-matrix column sums. This finding provides a principled justification for attention-guided sampling and offers a theoretically grounded alternative to greedy search. We instantiate this theoretical insight in a new training-free sampling algorithm, termed Attn-Sampler, and further propose dynamic attention thresholding for practical acceleration. Extensive experiments across multiple benchmarks validate the effectiveness of our proposed method, demonstrating that it achieves superior generation quality while enhancing the sampling parallelism.

2604.04944 2026-06-04 cs.CL cs.AI 版本更新

Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

包含思维:通过净化决策空间缓解偏好不稳定性

Mohammad Reza Ghasemi Madani, Soyeon Caren Han, Shuo Yang, Jey Han Lau

发表机构 * School of Computing and Information Systems, The University of Melbourne(计算与信息系统学院,墨尔本大学)

AI总结 提出包含思维(IoT)策略,通过逐步自过滤干扰选项来重构多选题,从而稳定模型偏好并提升推理性能。

详情
AI中文摘要

多项选择题(MCQ)被广泛用于评估大型语言模型(LLM)。然而,LLM 仍然容易受到似是而非的干扰项的影响。这常常将注意力转移到无关选项上,导致在正确和错误答案之间不稳定地摇摆。在本文中,我们提出包含思维(IoT),一种渐进式自过滤策略,旨在减轻这种认知负荷(即干扰项存在下模型偏好的不稳定性),并使模型更有效地关注合理答案。我们的方法仅使用合理的选项选择来重构 MCQ,为检查比较判断以及模型在扰动下内部推理的稳定性提供了一个受控环境。通过明确记录这一过滤过程,IoT 还增强了模型决策的透明度和可解释性。广泛的实证评估表明,IoT 在算术、常识推理和教育基准测试中显著提升了思维链性能,且计算开销极小。

英文摘要

Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, resulting in unstable oscillation between correct and incorrect answers. In this paper, we propose Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that is designed to mitigate this cognitive load (i.e., instability of model preferences under the presence of distractors) and enable the model to focus more effectively on plausible answers. Our method operates to reconstruct the MCQ using only plausible option choices, providing a controlled setting for examining comparative judgements and therefore the stability of the model's internal reasoning under perturbation. By explicitly documenting this filtering process, IoT also enhances the transparency and interpretability of the model's decision-making. Extensive empirical evaluation demonstrates that IoT substantially boosts chain-of-thought performance across a range of arithmetic, commonsense reasoning, and educational benchmarks with minimal computational overhead.

2604.00819 2026-06-04 cs.CL cs.AI 版本更新

Emotion Entanglement and Bayesian Inference for Multi-Dimensional Emotion Understanding

情感纠缠与贝叶斯推理用于多维情感理解

Hemanth Kotaprolu, Kishan Maharaj, Raey Zhao, Abhijit Mishra, Pushpak Bhattacharyya

发表机构 * Indian Institute of Technology Bombay(印度理工学院班加罗尔) University of Texas at Austin(德克萨斯大学奥斯汀分校) IBM Research(IBM研究院)

AI总结 提出基于Plutchik基本情绪理论的情感场景基准EmoScene,并利用情感共现统计的贝叶斯推理框架进行联合后验推理,提升多维情感理解的结构一致性。

Comments 19 pages in total, 10 Figures, 7 Tables

详情
AI中文摘要

理解自然语言中的情感本质上是一个多维推理问题,其中多个情感信号通过上下文、人际关系和情境线索相互作用。然而,大多数现有的情感理解基准依赖于短文本和预定义的情感标签,将这一过程简化为独立的标签预测,忽略了情感之间的结构化依赖关系。为了解决这一局限性,我们引入了情感场景(EmoScene),一个基于理论的基准,包含4,731个上下文丰富的场景,并用源自Plutchik基本情绪的8维情感向量进行标注。基于情感很少独立出现的观察,我们进一步提出了一个纠缠感知的贝叶斯推理框架,该框架结合情感共现统计,对情感向量进行联合后验推理。这种轻量级的后处理不需要任何参数更新,提高了预测的结构一致性,并在不增加额外成本的情况下,整体词汇准确率提升了2.24%。因此,EmoScene为研究多维情感理解和当前语言模型的局限性提供了一个具有挑战性的基准。

英文摘要

Understanding emotions in natural language is inherently a multi-dimensional reasoning problem, where multiple affective signals interact through context, interpersonal relations, and situational cues. However, most existing emotion understanding benchmarks rely on short texts and predefined emotion labels, reducing this process to independent label prediction and ignoring the structured dependencies among emotions. To address this limitation, we introduce Emotional Scenarios (EmoScene), a theory-grounded benchmark of 4,731 contextrich scenarios annotated with an 8-dimensional emotion vector derived from Plutchik's basic emotions. Motivated by the observation that emotions rarely occur independently, we further propose an entanglement-aware Bayesian inference framework that incorporates emotion co-occurrence statistics to perform joint posterior inference over the emotion vector. This lightweight post-processing does not require any parameter updates and improves the structural consistency of predictions, and yields overall gains of 2.24% Lexical Accuracy without any additional cost. EmoScene therefore provides a challenging benchmark for studying multi-dimensional emotion understanding and the limitations of current language models.

2601.11214 2026-06-04 cs.CL 版本更新

T$^\star$: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning

T$^\star$:通过轨迹感知强化学习实现掩码扩散语言模型的渐进块缩放

Hanchen Xia, Baoyou Chen, Yutang Ge, Guojiang Zhao, Siyu Zhu

发表机构 * Shanghai Academy of AI for Science(上海人工智能科学研究院) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学) School of Mathematical Sciences(数学科学学院) Shanghai Jiao Tong University(上海交通大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出T$^\star$方法,利用基于TraceRL的训练课程,在掩码扩散语言模型中渐进增大块大小,实现高并行解码且性能损失极小。

详情
AI中文摘要

我们提出T$^\star$,一种简单的基于TraceRL的训练课程,用于掩码扩散语言模型(MDM)中的渐进块大小缩放。从AR初始化的小块MDM开始,T$^\star$平滑过渡到更大的块,从而在数学推理基准上实现更高并行度的解码,且性能下降最小。此外,进一步分析表明,T$^\star$实际上可能收敛到一种替代解码方案,该方案能达到相当的性能。

英文摘要

We present T$^\star$, a simple TraceRL-based training curriculum for progressive block-size scaling in masked diffusion language models (MDMs). Starting from an AR-initialized small-block MDM, T$^\star$ transitions smoothly to larger blocks, enabling higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks. Moreover, further analysis suggests that T$^\star$ may actually converge to an alternative decoding schedule that achieves comparable performance.

2510.21459 2026-06-04 cs.CR cs.CL cs.LG 版本更新

SBASH: a Framework for Designing and Evaluating RAG vs. Prompt-Tuned LLM Honeypots

SBASH:用于设计和评估RAG与提示调优的LLM蜜罐框架

Adetayo Adebimpe, Helmut Neukirchen, Thomas Welsh

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出SBASH框架,利用轻量级本地LLM和RAG技术构建蜜罐,通过多种指标评估RAG与提示调优对LLM蜜罐真实性和响应延迟的影响。

Comments to be published in: The 3rd International Conference on Foundation and Large Language Models (FLLM2025), IEEE, 2025

详情
Journal ref
2025 3rd International Conference on Foundation and Large Language Models (FLLM), IEEE, 2025
AI中文摘要

蜜罐是用于收集有价值威胁情报或将攻击者从生产系统引开的诱饵系统。最大化攻击者参与度对其效用至关重要。然而,研究表明,上下文感知能力(例如响应新攻击类型、系统和攻击者代理的能力)对于提高参与度是必要的。大型语言模型(LLM)已被证明是提高上下文感知能力的一种方法,但面临若干挑战,包括响应时间的准确性和及时性、高运营成本以及由于云部署带来的数据保护问题。我们提出了基于系统的注意力外壳蜜罐(SBASH)框架,通过使用轻量级本地LLM来管理数据保护问题。我们研究了使用检索增强生成(RAG)支持的LLM和非RAG LLM处理Linux shell命令的情况,并使用多种不同指标(如响应时间差异、人类测试者的真实感、以及通过Levenshtein距离、SBert和BertScore计算的与真实系统的相似度)对其进行评估。我们表明,RAG提高了未调优模型的准确性,而通过系统提示(指示LLM像Linux系统一样响应)调优的模型在无RAG情况下达到了与未调优模型有RAG时相似的准确性,同时延迟略低。

英文摘要

Honeypots are decoy systems used for gathering valuable threat intelligence or diverting attackers away from production systems. Maximising attacker engagement is essential to their utility. However research has highlighted that context-awareness, such as the ability to respond to new attack types, systems and attacker agents, is necessary to increase engagement. Large Language Models (LLMs) have been shown as one approach to increase context awareness but suffer from several challenges including accuracy and timeliness of response time, high operational costs and data-protection issues due to cloud deployment. We propose the System-Based Attention Shell Honeypot (SBASH) framework which manages data-protection issues through the use of lightweight local LLMs. We investigate the use of Retrieval Augmented Generation (RAG) supported LLMs and non-RAG LLMs for Linux shell commands and evaluate them using several different metrics such as response time differences, realism from human testers, and similarity to a real system calculated with Levenshtein distance, SBert, and BertScore. We show that RAG improves accuracy for untuned models while models that have been tuned via a system prompt that tells the LLM to respond like a Linux system achieve without RAG a similar accuracy as untuned with RAG, while having a slightly lower latency.

2603.23841 2026-06-04 cs.CL cs.AI 版本更新

PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

PoliticsBench: 通过多轮角色扮演基准测试大型语言模型中的政治价值观

Rohan Khetan, Ashna Khetan

发表机构 * Northville High School, Northville, USA(北维尅高中) Department of Computer Science, Stanford University, Stanford, USA(斯坦福大学计算机科学系)

AI总结 提出PoliticsBench,一个多阶段角色扮演基准,通过20个演化场景评估LLM的细粒度价值表达,发现场景提示比直接提问能引发更广泛和强烈的价值表达。

Comments 7 pages, 5 tables, 5 figures, 4 appendix pages. Accepted to the ICML 2026 Trustworthy AI for Good Workshop

详情
AI中文摘要

虽然大型语言模型(LLMs)越来越多地被用作主要信息来源,但其潜在的政治偏见可能影响其客观性。现有的LLM社会偏见基准主要评估人口统计刻板印象,而当衡量政治偏见时,是在粗略的层面上进行的,忽视了塑造社会政治推理的价值观。我们引入了PoliticsBench,一个用于评估LLM中细粒度价值表达的多阶段角色扮演基准。在20个演化场景中,模型在竞争压力下阐述权衡、表明立场并做出决策。在八个主流LLM上,我们表明,与直接的政治问题相比,基于场景的提示引发了更广泛和更强烈的价值表达,峰值交互阶段使强烈激活的价值维度数量增加了约0.75(共10个维度),相对于基线提示具有统计显著性(p < 0.05)。此外,在交互过程中,立场的承诺度增加,从初始阶段到决策阶段,在[0,5]量表上上升了约1.4分。虽然在后期交互阶段,响应对于场景释义的鲁棒性降低,但评判者间的一致性保持相对稳定。我们的结果表明,评估LLM的政治行为需要超越静态提示,转向更长的交互设置,以捕捉价值观如何在上下文中应用。

英文摘要

While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact their objectivity. Existing benchmarks of LLM social bias primarily evaluate demographic stereotypes, and when political bias is measured, it is done so at a coarse level, overlooking the values that shape sociopolitical reasoning. We introduce PoliticsBench, a multi-stage roleplay benchmark for evaluating fine-grained value expression in LLMs. Across twenty evolving scenarios, models articulate tradeoffs, take positions, and make decisions under competing pressures. Across eight prominent LLMs, we show that scenario-based prompting elicits broader and more strongly expressed value profiles than direct political questions, with peak interaction stages increasing the number of strongly activated value dimensions by approximately $0.75$ (out of 10 total dimensions), a statistically significant increase relative to baseline prompting ($p < 0.05$). In addition, commitment to a stance increases over the course of interaction, rising by approximately $1.4$ points on a $[0,5]$ scale from initial to decision stages. While responses become less robust to scenario paraphrasing in later interaction stages, inter-judge agreement remains relatively stable. Our results suggest that evaluating LLM political behavior requires moving beyond static prompts toward longer interactive settings that capture how values are applied in context.

2603.20884 2026-06-04 cs.CL 版本更新

MemoNoveltyAgent: A Historical Research Memory-Aware Agent Workflow for Paper Novelty Assessment

MemoNoveltyAgent:一种用于论文新颖性评估的历史研究记忆感知智能体工作流

Jiajun Hou, Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Xiaopeng Ke, Derek F. Wong, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China(计算与智能研究院,哈尔滨工业大学深圳校区,中国) Xiaohongshu Inc.(小红书公司) Zhongguancun Academy, Beijing, China(中关村学院,北京,中国) NLP 2 CT Lab, Department of Computer and Information Science, University of Macau, China(自然语言处理2实验室,计算机与信息科学系,澳门大学,中国)

AI总结 提出MemoNoveltyAgent多智能体系统,通过分层抽象记忆、细粒度新颖点分解和自验证机制,生成忠实的新颖性报告,在评估中比GPT-5 DeepResearch提升13.69%。

详情
AI中文摘要

为减轻论文筛选的沉重负担,研究人员越来越依赖现有的AI智能体(如AI审稿人或DeepResearch)进行论文评估和新颖性评估。然而,由于缺乏处理学术文献的专门机制,它们的分析往往产生表面结果,质量明显不足。为弥补这一差距,我们引入了MemoNoveltyAgent,一个旨在生成全面且忠实的新颖性报告的多智能体系统。除了通过RAG检索具体的先前论文证据外,我们的系统还包含一个从大规模学术语料库构建的高层抽象记忆。该记忆将研究组织成层次树,以提炼领域特定的演化轨迹,从而提供更广泛的历史背景。此外,我们将论文分解为离散的新颖点,以便进行细粒度分析和检索,同时采用自验证机制提高报告的忠实度。最后,为解决此类开放生成任务的评估挑战,我们提出了一种RAG增强的检查表评估方法,能够实现可靠且基于证据的评估。大量实验表明,MemoNoveltyAgent比GPT-5 DeepResearch提升了13.69%。代码和演示可在https://github.com/SStan1/MemoNoveltyAgent获取。

英文摘要

To alleviate the heavy burden of paper screening, researchers increasingly rely on existing AI agents, such as AI reviewers or DeepResearch, for paper evaluation and novelty assessment. However, lacking specialized mechanisms for processing scholarly literature, their analyses often produce superficial results with noticeable deficiencies in quality. To bridge this gap, we introduce MemoNoveltyAgent, a multi-agent system designed to generate comprehensive and faithful novelty reports. Beyond retrieving concrete prior-paper evidence via RAG, our system incorporates a high-level abstract memory constructed from large-scale scholarly corpora. This memory organizes research into hierarchical trees to distill field-specific evolutionary trajectories, thereby providing a broader historical context. Furthermore, we decompose papers into discrete novelty points for fine-grained analysis and retrieval, while employing a self-validation mechanism to improve report faithfulness. Finally, to address the evaluation challenges of such open-ended generation tasks, we propose a RAG-augmented checklist evaluation method that enables reliable and evidence-grounded assessments. Extensive experiments demonstrate that MemoNoveltyAgent outperforms GPT-5 DeepResearch by 13.69%. Code and demo are available at https://github.com/SStan1/MemoNoveltyAgent

2603.16867 2026-06-04 cs.LG cs.CL 版本更新

Efficient Reasoning on the Edge

边缘设备上的高效推理

Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli, Evgeny Mironov, Leyla Mirvakhabova, Tribhuvanesh Orekondy, Spyridon Stasis, Andrey Kuzmin, Anna Kuzina, Markus Nagel, Ankita Nayak, Corrado Rainone, Ork de Rooij, Paul N Whatmough, Arash Behboodi, Babak Ehteshami Bejnordi

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 提出结合LoRA适配器、监督微调、强化学习预算控制、并行测试时缩放、动态适配器切换和KV缓存共享的方法,在资源受限的边缘设备上实现高效准确的推理。

Comments Project page: https://qualcomm-ai-research.github.io/llm-reasoning-on-edge/

详情
AI中文摘要

具有思维链推理的大型语言模型在复杂问题解决任务中达到了最先进的性能,但其冗长的推理轨迹和大的上下文需求使其不适用于边缘部署。这些挑战包括高令牌生成成本、大的KV缓存占用,以及在将推理能力蒸馏到用于移动设备的较小模型时的低效性。现有方法通常依赖于将较大模型的推理轨迹蒸馏到较小模型中,这些轨迹冗长且风格冗余,不适合设备端推理。在这项工作中,我们提出了一种轻量级方法,通过使用LoRA适配器结合监督微调,在小型LLM中实现推理。我们进一步通过在这些适配器上进行强化学习引入预算控制,显著减少响应长度,同时保持最小的精度损失。为了解决内存受限的解码问题,我们利用并行测试时缩放,在轻微延迟增加的情况下提高精度。最后,我们提出了一种动态适配器切换机制,仅在需要时激活推理,以及在提示编码期间的KV缓存共享策略,减少设备端推理的首令牌时间。在Qwen2.5-7B上的实验表明,我们的方法在严格的资源约束下实现了高效、准确的推理,使LLM推理在移动场景中变得实用。展示我们的解决方案在移动设备上运行的视频可在我们的项目页面上找到。

英文摘要

Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.

2405.15454 2026-06-04 cs.CL cs.SY eess.SY 版本更新

LiSeCo: Linear Semantic Control for Language Generation

LiSeCo:语言生成的线性语义控制

Emily Cheng, Carmen Amo Alonso

发表机构 * Universitat Pompeu Fabra(庞培法布拉大学) Stanford University(斯坦福大学)

AI总结 提出一种轻量级、无梯度的线性语义控制方法LiSeCo,通过控制理论在线干预嵌入空间中的激活值,将生成轨迹引导至预定义的安全语义区域,实现高效且保证性能的文本生成控制。

Comments TMLR 2026 camera ready; earlier version in NeurIPS MINT Workshop 2024

详情
AI中文摘要

大型语言模型(LLMs)在关键应用中的普及凸显了对受控语言生成方法的需求,这些方法既需计算高效又需具备性能保证。为满足这一需求,我们采用概念语义的常见模型,即概念在线性表示的LLM潜在空间中。具体而言,我们认为自然语言生成在此连续语义空间中沿轨迹进行,由语言模型的隐藏激活实现。这一观点允许在潜在空间中对文本生成进行控制理论处理,我们提出线性语义控制(LiSeCo),一种轻量级、无梯度的干预方法,动态地将轨迹从对应于不良含义的区域中引导开。特别地,我们提出以在线方式直接干预正在生成的令牌在嵌入空间中的激活。关键的是,LiSeCo并非简单地将激活引导至理想区域,而是依赖控制理论中的经典技术,以上下文相关的方式精确控制激活,并保证它们被带入嵌入空间中预定义的、对应于允许语义的特定区域。该干预根据最优控制器公式以闭式形式计算,对生成时间影响极小。这种对嵌入空间中激活的控制允许对生成序列的属性进行细粒度引导。我们证明了该方法在不同任务(毒性、情感和语言(英语/西班牙语)引导)上的有效性,同时保持文本质量。

英文摘要

The prevalence of Large Language Models (LLMs) in critical applications highlights the need for controlled language generation methods that are both computationally efficient and enjoy performance guarantees. To address this need, we use a common model of concept semantics as linearly represented in an LLM's latent space. In particular, we take the view that natural language generation traces a trajectory in this continuous semantic space, realized by the language model's hidden activations. This view permits a control-theoretic treatment of text generation in latent space, in which we propose Linear Semantic Control (LiSeCo), a lightweight, gradient-free intervention that dynamically steers trajectories away from regions corresponding to undesired meanings. In particular, we propose to directly intervene, in an online fashion, the activations of the token that is being generated in embedding space. Crucially, LiSeCo does not simply steer activations towards a desirable region. Instead, it relies on classical techniques from control theory to precisely control activations in a context-dependent way, and guarantees that they are brought into a specific pre-defined region of embedding space that corresponds to allowed semantics. The intervention is computed in closed form according to an optimal controller formulation, minimally impacting generation time. This control of the activations in embedding space allows for fine-grained steering of attributes of the generated sequence. We demonstrate that our approach is effective on different tasks -- toxicity, sentiment, and language (English/Spanish) steering -- while maintaining text quality.

2603.10044 2026-06-04 cs.SE cs.AI cs.CL cs.LG 版本更新

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

脚手架下的安全性:评估条件如何影响测量的安全性

David Gringras

发表机构 * Harvard University(哈佛大学) MIT(麻省理工学院)

AI总结 本研究通过62,808次盲法预注册评估,测试了六种前沿模型在四种部署配置下的安全性,发现脚手架架构对安全性影响较小,而格式转换(如选择题与开放式问题)可导致5-20个百分点的测量差异,且模型-脚手架间存在显著异质性,质疑了单一综合安全性分数的实用性。

Comments 74 pages including appendices. 6 frontier models, 62,808 primary observations (~89k total). Pre-registered: OSF DOI 10.17605/OSF.IO/CJW92. Code and data: https://github.com/davidgringras/safety-under-scaffolding

详情
AI中文摘要

在基准测试中获得的安全分数不一定能预测同一模型在未经测试的智能体脚手架中的行为。我们通过四种部署配置(直接API、ReAct、多智能体批评者、map-reduce委托)运行了六种前沿模型:在四个安全基准测试(BBQ、TruthfulQA、XSTest/OR-Bench、sycophancy)上进行了N = 62,808次盲法、预注册、等价性检验评估,以及三项支持性分析。ReAct和多智能体脚手架保持在预注册的±2个百分点的等价范围内;map-reduce委托降低了测量的安全性(NNH = 14),尽管这种损失很大程度上是测量伪影:在相同项目上,选择题与开放式问题的措辞使测量的安全率变化5-20个百分点,而分解过程无声地移除了选择题选项。每个模型map-reduce损失的约40-89%归因于这种格式转换而非推理中断,一种保留选项的变体恢复了大部分损失。汇总效应也掩盖了模型与脚手架之间的显著异质性:在map-reduce下,对于相同项目,Opus损失16.8个百分点,而Llama 4增加18.8个百分点。从结构上看,脚手架架构仅解释了0.4%的结果方差(基准选择解释了45倍以上),泛化系数G = 0.000(bootstrap 95% CI [0.000, 0.752])。如此宽的区间本身足以削弱任何单一综合安全分数作为部署标准的效用。这些是“简单案例”;像诡计和CBRN提升这样的重要属性没有明显理由对格式或脚手架不敏感。代码、数据和提示已作为ScaffoldSafety发布。

英文摘要

A safety score earned on a benchmark need not predict how the same model behaves once it is wrapped in an agentic scaffold the benchmark never tested. We ran six frontier models through four deployment configurations (direct API, ReAct, multi-agent critic, map-reduce delegation): N = 62,808 blinded, pre-registered, equivalence-tested evaluations across four safety benchmarks (BBQ, TruthfulQA, XSTest/OR-Bench, sycophancy), plus three supporting analyses. ReAct and multi-agent scaffolds stay within a pre-registered +/-2 pp equivalence margin; map-reduce delegation degrades measured safety (NNH = 14), though that loss is largely a measurement artifact: on identical items, multiple-choice versus open-ended phrasing shifts the measured safety rate by 5-20 pp, and decomposition silently strips the multiple-choice options. Roughly 40-89% of the per-model map-reduce loss is this format conversion rather than reasoning disruption, and an option-preserving variant recovers most of it. Pooled effects also mask sharp model-by-scaffold heterogeneity: under map-reduce, on identical items, Opus loses 16.8 pp while Llama 4 gains 18.8 pp. Structurally, scaffold architecture explains only 0.4% of outcome variance (benchmark choice explains 45x more), and the generalizability coefficient is G = 0.000 (bootstrap 95% CI [0.000, 0.752]). An interval that wide is enough on its own to undermine the utility of any single composite safety number as a deployment criterion. These are the "easy cases"; consequential properties like scheming and CBRN uplift have no obvious reason to be less format- or scaffold-sensitive. Code, data, and prompts are released as ScaffoldSafety.

2603.05881 2026-06-04 cs.CL 版本更新

Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

回答前先置信:高效LLM不确定性估计的范式转变

Changcheng Li, Jiancan Wu, Hengheng Zhang, Zhengsu Chen, Guo An, Junxiang Qiu, Xiang Wang, Qi Tian

发表机构 * University of Science and Technology of China(中国科学技术大学) Huawei Inc.(华为公司)

AI总结 提出CoCA框架,通过GRPO强化学习联合优化置信度校准与答案准确性,实现回答前输出置信度,提升不确定性估计效率。

详情
AI中文摘要

大型语言模型(LLM)的可靠部署需要准确的不确定性估计。现有方法主要是先回答后置信,即在生成答案后才产生置信度,这衡量的是特定响应的正确性,限制了实际可用性。我们研究了一种置信优先范式,其中模型在回答之前输出其置信度,将该分数解释为模型在当前策略下正确回答问题的概率。我们提出了CoCA(联合优化的置信度和答案),这是一种GRPO强化学习框架,通过分段信用分配联合优化置信度校准和答案准确性。通过为置信度和答案段分配单独的奖励和组相对优势,CoCA实现了稳定的联合优化并避免了奖励黑客攻击。在数学、代码和事实问答基准上的实验表明,在保持答案质量的同时,校准和不确定性区分能力得到改善,从而支持更广泛的下游应用。

英文摘要

Reliable deployment of large language models (LLMs) requires accurate uncertainty estimation. Existing methods are predominantly answer-first, producing confidence only after generating an answer, which measure the correctness of a specific response and limits practical usability. We study a confidence-first paradigm, where the model outputs its confidence before answering, interpreting this score as the model's probability of answering the question correctly under its current policy. We propose CoCA(Co-optimized Confidence and Answers), a GRPO reinforcement learning framework that jointly optimizes confidence calibration and answer accuracy via segmented credit assignment. By assigning separate rewards and group-relative advantages to confidence and answer segments, CoCA enables stable joint optimization and avoids reward hacking. Experiments across math, code, and factual QA benchmarks show improved calibration and uncertainty discrimination while preserving answer quality, thereby enabling a broader range of downstream applications.

2603.03205 2026-06-04 cs.CL 版本更新

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

学习何时行动或拒绝:为安全的多步骤工具使用守护代理推理模型

Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah

发表机构 * Microsoft Research(微软研究院)

AI总结 提出MOSAIC框架,通过显式安全推理和基于偏好的强化学习,使代理模型在工具使用中安全决策,减少有害行为并保持良性性能。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

代理语言模型在安全机制上与聊天模型根本不同:它们必须规划、调用工具并执行长期行动,其中单个失误(如访问文件或输入凭据)可能导致不可逆的伤害。现有的对齐方法主要针对静态生成和任务完成进行优化,由于顺序决策、对抗性工具反馈和过度自信的中间推理,在这些设置中失效。我们引入了MOSAIC,一个后训练框架,通过使安全决策显式且可学习,对齐代理以实现安全的多步骤工具使用。MOSAIC将推理构建为规划、检查、然后行动或拒绝的循环,将显式安全推理和拒绝作为第一类行动。为了在没有轨迹级标签的情况下进行训练,我们使用基于偏好的强化学习与成对轨迹比较,这捕获了标量奖励常常忽略的安全区别。我们在三个模型家族(Qwen2.5-7B、Qwen3-4B-Thinking和Phi-4)以及跨分布基准(涵盖有害任务、提示注入、良性工具使用和跨域隐私泄露)上评估了MOSAIC的零样本性能。MOSAIC将有害行为减少高达50%,在注入攻击上将有害任务拒绝率提高超过20%,减少隐私泄露,并保持或改善良性任务性能,展示了跨模型、领域和代理设置的鲁棒泛化能力。

英文摘要

Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

2502.03799 2026-06-04 cs.CL cs.SY eess.SY 版本更新

Enhancing Hallucination Detection through Noise Injection

通过噪声注入增强幻觉检测

Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Yubing Jian, Yao Qin, Roland Memisevic

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 提出一种基于贝叶斯模型不确定性的无训练噪声注入方法,通过扰动模型参数或隐藏单元激活来改进采样过程,显著提升大语言模型幻觉检测性能。

Comments ICLR 2026 main conference paper

详情
AI中文摘要

大型语言模型(LLMs)容易生成看似合理但错误的响应,即幻觉。因此,有效检测幻觉对于LLMs的安全部署至关重要。最近的研究将幻觉与模型不确定性联系起来,表明可以通过测量从模型中抽取的多个样本的答案分布离散度来检测幻觉。虽然从模型定义的词元分布中抽取样本是一种自然的方式,但在这项工作中,我们认为这对于检测幻觉而言并非最优。我们表明,通过以贝叶斯方式考虑模型不确定性,可以显著改进检测。为此,我们提出了一种非常简单、无需训练的方法,该方法基于在采样过程中扰动适当子集的模型参数,或等效地扰动隐藏单元激活。我们证明,我们的方法在跨不同数据集、模型架构和不确定性度量上,显著优于标准采样的推理时幻觉检测。

英文摘要

Large Language Models (LLMs) are prone to generating plausible yet incorrect responses, known as hallucinations. Effectively detecting hallucinations is therefore crucial for the safe deployment of LLMs. Recent research has linked hallucinations to model uncertainty, suggesting that hallucinations can be detected by measuring dispersion over answer distributions obtained from multiple samples drawn from a model. While drawing from the distribution over tokens defined by the model is a natural way to obtain samples, in this work, we argue that it is suboptimal for the purpose of detecting hallucinations. We show that detection can be improved significantly by taking into account model uncertainty in the Bayesian sense. To this end, we propose a very simple, training-free approach based on perturbing an appropriate subset of model parameters, or equivalently hidden unit activations, during sampling. We demonstrate that our approach significantly improves inference-time hallucination detection over standard sampling across diverse datasets, model architectures, and uncertainty metrics.

2602.21103 2026-06-04 cs.CL cs.IR 版本更新

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

提示级蒸馏:一种非参数化的模型微调替代方案,用于高效推理

Sanket Badhe, Deep Shah

发表机构 * Google Mountain View, California, USA(谷歌山景城,加利福尼亚州,美国)

AI总结 提出提示级蒸馏(PLD),通过从教师模型中提取推理模式并组织为结构化指令注入学生模型的系统提示,无需微调即可提升小模型推理性能,在多个基准上接近前沿水平且延迟极低。

Comments Accepted at ACL 2026 Industry Track

详情
AI中文摘要

高级推理通常需要思维链提示,虽然准确但会导致高昂的延迟和测试时推理成本。标准的替代方案——微调较小的模型——往往牺牲可解释性,同时引入显著的计算和操作开销。为了解决这些限制,我们引入了提示级蒸馏(PLD)。我们从教师模型中提取显式推理模式,并将其组织成结构化的表达性指令列表,用于学生模型的系统提示。使用Gemma-3 4B进行评估,PLD在StereoSet上将Macro F1分数从57%提升至90.0%,在Contract-NLI上从67%提升至83%,同时将LogiQA准确率提升至70%。在Mistral Small 3.1上的类似结果证明了跨架构的泛化能力,使得这些紧凑模型能够以可忽略的延迟开销达到前沿性能。这些表达性指令使决策过程透明化,允许对逻辑进行完全人工验证,使得该方法非常适合法律、金融和内容审核等受监管行业,以及高容量用例和边缘设备。

英文摘要

Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated using Gemma-3 4B, PLD improved Macro F1 scores on StereoSet (57\% to 90.0\%) and Contract-NLI (67\% to 83\%), while increasing LogiQA accuracy to 70\%. Similar results on Mistral Small 3.1 demonstrate cross-architecture generalizability, enabling these compact models to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.

2511.05722 2026-06-04 cs.CL cs.AI 版本更新

OckBench: Measuring the Efficiency of LLM Reasoning

OckBench:衡量LLM推理效率

Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Massachusetts Institute of Technology(麻省理工学院) NVIDIA(英伟达)

AI总结 提出OckBench基准,联合评估推理和编码任务中的准确性与token效率,揭示当前模型token利用率低下问题。

详情
AI中文摘要

大型语言模型(LLM)如GPT-5和Gemini 3已推动自动推理和代码生成的前沿。然而,当前基准强调准确性和输出质量,忽略了关键维度:token使用的效率。在实际应用中,token效率变化很大。解决相同问题且准确率相近的模型,其token长度差异可达 extbf{5.0$ imes$},导致模型推理能力存在巨大差距。这种差异暴露了显著的冗余,凸显了对标准化基准来量化token效率差距的迫切需求。因此,我们引入OckBench,这是首个联合衡量推理和编码任务中准确性与token效率的基准。我们的评估表明,当前模型的token效率在很大程度上未得到优化,显著增加了服务成本和延迟。这些发现为社区优化潜在推理能力(即token效率)提供了具体路线图。最终,我们主张评估范式转变:token不应被无谓地倍增。我们的基准可在https://ockbench.github.io/获取。

英文摘要

Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage. The token efficiency is highly variable in practical. Models solving the same problem with similar accuracy can exhibit up to a \textbf{5.0$\times$} difference in token length, leading to massive gap of model reasoning ability. Such variance exposes significant redundancy, highlighting the critical need for a standardized benchmark to quantify the gap of token efficiency. Thus, we introduce OckBench, the first benchmark that jointly measures accuracy and token efficiency across reasoning and coding tasks. Our evaluation reveals that token efficiency remains largely unoptimized across current models, significantly inflating serving costs and latency. These findings provide a concrete roadmap for the community to optimize the latent reasoning ability, token efficiency. Ultimately, we argue for an evaluation paradigm shift: tokens must not be multiplied beyond necessity. Our benchmarks are available at https://ockbench.github.io/.

2602.19101 2026-06-04 cs.CL cs.AI 版本更新

Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

价值纠缠:大型语言模型中不同种类好的混淆

Seong Hah Cho, Junyi Li, Anna Leshinskaya

发表机构 * Independent Department of Cognitive Sciences, UC Irvine(独立认知科学系,加州大学 Irvine 分校)

AI总结 通过探测模型行为、嵌入和残差流激活,发现大型语言模型普遍存在价值纠缠,即道德、语法和经济三种价值被混淆,其中语法和经济价值过度受道德价值影响,通过选择性消融与道德相关的激活向量可修复此问题。

详情
AI中文摘要

大型语言模型(LLMs)的价值对齐要求我们经验性地测量这些模型实际习得的价值表征。人类价值表征的一个特点是能够区分不同种类的价值。我们研究LLMs是否同样区分三种不同的好:道德的、语法的和经济的。通过探测模型行为、嵌入和残差流激活,我们报告了普遍存在的价值纠缠案例:这些不同价值表征之间的混淆。具体而言,相对于人类规范,语法和经济评价被发现过度受道德价值影响。通过选择性消融与道德相关的激活向量,这种混淆得到了修复。

英文摘要

Value alignment of Large Language Models (LLMs) requires us to empirically measure these models' actual, acquired representation of value. Among the characteristics of value representation in humans is that they distinguish among value of different kinds. We investigate whether LLMs likewise distinguish three different kinds of good: moral, grammatical, and economic. By probing model behavior, embeddings, and residual stream activations, we report pervasive cases of value entanglement: a conflation between these distinct representations of value. Specifically, both grammatical and economic valuation was found to be overly influenced by moral value, relative to human norms. This conflation was repaired by selective ablation of the activation vectors associated with morality.

2506.06006 2026-06-04 cs.CV cs.AI cs.CL 版本更新

Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

视觉语言模型能预测未来状态吗?从逆动力学引导世界模型

Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti

发表机构 * Institute for Language, Cognition and Computation, University of Edinburgh(语言、认知与计算研究所,爱丁堡大学) Language Technology Lab, University of Cambridge(语言技术实验室,剑桥大学) NVIDIA(NVIDIA公司) University of Groningen(格罗宁根大学)

AI总结 本文发现视觉语言模型(VLM)难以直接进行前向动力学预测(FDP),但逆动力学预测(IDP)更容易学习,并利用IDP通过弱监督学习和推理时验证两种策略引导FDP,在Aurora-Bench上取得与最先进图像编辑模型竞争的性能。

详情
AI中文摘要

统一的视觉语言模型(VLM)能否执行前向动力学预测(FDP),即根据先前的观察和(语言形式的)动作预测未来状态(图像形式)?我们发现VLM难以根据指令生成帧之间物理上合理的过渡。然而,我们识别出多模态基础中的一个关键不对称性:微调VLM学习逆动力学预测(IDP)——有效地描述帧之间的动作——比学习FDP容易得多。反过来,IDP可以通过两种主要策略引导FDP:1)来自合成数据的弱监督学习,以及2)推理时验证。首先,IDP可以为未标记的视频帧观察对标注动作,以扩大FDP的训练数据规模。其次,IDP可以为FDP的多个样本分配奖励以对其进行评分,从而在推理时有效指导搜索。我们通过Aurora-Bench上的以动作为中心的图像编辑任务,使用两个VLM家族评估了这两种策略产生的FDP。尽管仍然是通用模型,我们的最佳模型实现了与最先进的图像编辑模型竞争的性能,根据GPT4o作为评判,在Aurora-Bench的所有子集上,性能提高了7%到13%,并获得了最佳平均人类评估。

英文摘要

Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP)-effectively captioning the action between frames-is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them, effectively guiding search at inference time. We evaluate the FDP resulting from both strategies through the task of action-centric image editing on Aurora-Bench with two families of VLMs. Despite remaining general-purpose, our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin between 7% and 13% according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

2602.09464 2026-06-04 cs.SE cs.AI cs.CL 版本更新

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

AlgoVeri:面向经典算法的验证代码生成对齐基准

Haoyu Zhao, Ziran Yang, Jiawei Li, Deyuan He, Zenan Li, Chi Jin, Venugopal V. Veeravalli, Aarti Gupta, Sanjeev Arora

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 为解决跨范式验证代码生成评估缺乏统一方法的问题,提出AlgoVeri基准,在Dafny、Verus和Lean三种语言上评估77个经典算法的验证代码生成,揭示不同验证系统的能力差距。

Comments Accepted to ICML 2026, 32 pages

详情
AI中文摘要

验证代码生成指从严格规范生成形式化验证的代码。近期AI模型在验证代码生成方面展现出潜力,但缺乏跨范式的统一评估方法。现有基准仅测试单一语言/工具(如Dafny、Verus和Lean),且各自覆盖非常不同的任务,因此性能数据无法直接比较。我们通过AlgoVeri基准填补这一空白,该基准在Dafny、Verus和Lean上评估77个经典算法的验证代码生成。通过强制使用相同的功能契约,AlgoVeri揭示了验证系统中的关键能力差距。前沿模型在Dafny中取得了可观的成功(Gemini-3 Flash为40.3%),其中高层抽象和SMT自动化简化了工作流,但在Verus的系统级内存约束(24.7%)和Lean所需的显式证明构造(7.8%)下性能急剧下降。除了总体指标,我们还发现了测试时计算动态的显著差异:Gemini-3有效利用迭代修复提升性能(例如,在Dafny中通过率提高三倍),而GPT-OSS则早期饱和。最后,我们的错误分析表明,语言设计影响改进轨迹:Dafny允许模型专注于逻辑正确性,而Verus和Lean将模型困在持久的语法和语义障碍中。所有数据和评估代码可在https://github.com/haoyuzhao123/algoveri获取。

英文摘要

Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.

2602.09388 2026-06-04 cs.CL 版本更新

Effective vocabulary expansion of multilingual language models for extremely low-resource languages

针对极低资源语言的多语言语言模型的有效词汇扩展

Jianyu Zheng

发表机构 * School of Foreign Languages, University of Electronic Science and Technology of China(电子科技大学外国语言学院) School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 提出通过筛选源语言偏置词汇并利用双语词典初始化扩展词汇表示,对多语言预训练模型进行持续预训练,在词性标注和命名实体识别任务上分别提升0.54%和2.60%。

Comments 12 pages, 5 figures, 7 tables, under review

详情
AI中文摘要

多语言预训练语言模型(mPLMs)为许多低资源语言带来了显著的好处。为了进一步扩展这些模型能够支持的语言范围,许多工作集中于对这些模型进行持续预训练。然而,很少有工作解决如何将mPLMs扩展到之前不支持的低资源语言。为了解决这个问题,我们使用目标语言语料库扩展模型的词汇表。然后,我们从模型的原始词汇表中筛选出一个子集,该子集偏向于表示源语言(例如英语),并利用双语词典初始化扩展词汇的表示。随后,我们基于这些扩展词汇的表示,使用目标语言语料库继续预训练mPLMs。实验结果表明,我们提出的方法在词性标注和命名实体识别任务上优于使用随机初始化扩展词汇进行持续预训练的基线方法,分别提高了0.54%和2.60%。此外,我们的方法在选择训练语料库时表现出高鲁棒性,并且模型在源语言上的性能在持续预训练后没有下降。

英文摘要

Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model's vocabulary using a target language corpus. We then screen out a subset from the model's original vocabulary, which is biased towards representing the source language(e.g. English), and utilize bilingual dictionaries to initialize the representations of the expanded vocabulary. Subsequently, we continue to pre-train the mPLMs using the target language corpus, based on the representations of these expanded vocabulary. Experimental results show that our proposed method outperforms the baseline, which uses randomly initialized expanded vocabulary for continued pre-training, in POS tagging and NER tasks, achieving improvements by 0.54% and 2.60%, respectively. Furthermore, our method demonstrates high robustness in selecting the training corpora, and the models' performance on the source language does not degrade after continued pre-training.

2602.08498 2026-06-04 cs.CL 版本更新

Characterizing, Evaluating, and Optimizing Complex Reasoning

表征、评估与优化复杂推理

Haoran Zhang, Yafu Li, Zhi Wang, Zhilin Wang, Shunkai Zhang, Xiaoye Qu, Yu Cheng

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China(上海交通大学人工智能学院) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室) University of Science and Technology of China, Hefei, Anhui, China(中国科学技术大学) The Chinese University of Hong Kong, Hong Kong, China(香港中文大学) Nanjing University, Suzhou, Jiangsu, China(南京大学) Peking University, Beijing, China(北京大学)

AI总结 本文提出ME$^2$原则来表征推理质量,基于有向无环图(DAG)的成对评估方法,并构建TRM-Preference数据集训练Thinking Reward Model(TRM),以优化推理过程。

Comments Code and data are available at https://github.com/Simplified-Reasoning/TRM

详情
AI中文摘要

大型推理模型(LRMs)越来越依赖具有复杂内部结构的推理轨迹。然而,现有工作缺乏对三个基本问题的统一答案:(1)什么定义了高质量推理,(2)如何可靠地评估长且隐含结构的推理轨迹,以及(3)如何使用此类评估信号进行推理优化。为应对这些挑战,我们提供了一个统一视角。(1)我们引入ME$^2$原则,从宏观和微观层面表征推理质量,涉及效率和有效性。(2)基于该原则,我们将推理轨迹建模为有向无环图(DAG),并开发了一种基于DAG的成对评估方法,捕捉复杂推理结构。(3)基于该方法,我们构建了TRM-Preference数据集,并训练了一个Thinking Reward Model(TRM)来大规模评估推理质量。实验表明,思考奖励作为有效的优化信号。在测试时,选择更好的推理会带来更好的结果(提升高达19.3%),在RL训练期间,思考奖励增强了推理和性能(提升高达3.9%),适用于多种任务。代码和数据可在https://github.com/Simplified-Reasoning/TRM获取。

英文摘要

Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3\% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9\% gain) across diverse tasks. Code and data are available at https://github.com/Simplified-Reasoning/TRM.

2602.04613 2026-06-04 cs.CL 版本更新

Translation Heads: Disentangling meaning from language in LLM-based machine translation

翻译头:在基于LLM的机器翻译中分离意义与语言

Théo Lasnier, Armel Zebaze, Djamé Seddah, Rachel Bawden, Benoît Sagot

发表机构 * Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家) Inria, Paris, France(Inria,巴黎,法国)

AI总结 通过分析注意力头,将机器翻译分解为目标语言识别和句子等价两个子任务,发现稀疏的注意力头分别专攻每个子任务,并利用此发现实现无需指令的翻译性能。

Comments 61 pages, 70 figures

详情
AI中文摘要

机械可解释性(MI)旨在解释神经网络如何实现其能力,但大型语言模型(LLM)的规模限制了先前MI在机器翻译(MT)中的工作,仅限于词级分析。我们从机械角度研究句子级MT,通过分析注意力头来理解LLM如何在内部编码和分配翻译功能。我们将MT分解为两个子任务:生成目标语言文本(即目标语言识别)和保留输入句子的意义(即句子等价)。在三个开源模型家族和20个翻译方向上,我们发现不同且稀疏的注意力头集合专门负责每个子任务。基于这一发现,我们构建了子任务特定的转向向量,并表明仅修改1%的相关头即可实现与基于指令的提示相当的无需指令的翻译性能,而消融这些头则会选择性地破坏其对应的翻译功能。

英文摘要

Mechanistic Interpretability (MI) seeks to explain how neural networks implement their capabilities, but the scale of Large Language Models (LLMs) has limited prior MI work in Machine Translation (MT) to word-level analyses. We study sentence-level MT from a mechanistic perspective by analyzing attention heads to understand how LLMs internally encode and distribute translation functions. We decompose MT into two subtasks: producing text in the target language (i.e. target language identification) and preserving the input sentence's meaning (i.e. sentence equivalence). Across three families of open-source models and 20 translation directions, we find that distinct, sparse sets of attention heads specialize in each subtask. Based on this insight, we construct subtask-specific steering vectors and show that modifying just 1% of the relevant heads enables instruction-free MT performance comparable to instruction-based prompting, while ablating these heads selectively disrupts their corresponding translation functions.

2510.13272 2026-06-04 cs.CL 版本更新

Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation

超越正确性:在检索增强生成中奖励忠实推理

Zhichao Xu, Zongyu Wu, Yun Zhou, Aosong Feng, Kang Zhou, Sangmin Woo, Kiran Ramnath, Yijun Tian, Xuan Qi, Weikang Qiu, Lin Lee Cheong, Haibo Ding

发表机构 * AWS AI Fundamental Research(AWS人工智能基础研究) The Pennsylvania State University(宾夕法尼亚州立大学) Yale University(耶鲁大学)

AI总结 本文提出VERITAS框架,通过细粒度轮次级忠实性奖励强化学习,提升检索增强生成中推理步骤的忠实性,同时改善任务性能。

Comments TMLR Camera Ready Update

详情
AI中文摘要

受强化学习在数学和代码等领域的大语言模型训练中取得成功的启发,近期工作开始训练LLMs动态规划、查询并使用搜索引擎作为工具进行推理——这种范式日益被称为智能体搜索。尽管这些方法在流行的短问答基准上取得了性能提升,但许多方法优先考虑最终答案的正确性,而忽略了中间推理步骤的质量,这可能导致思维链不忠实。本文首先引入了一个全面的智能体搜索评估框架,涵盖三种不同的忠实性指标:思考-搜索忠实性、信息-思考忠实性和思考-答案忠实性。我们的评估表明,通过基于回合级结果奖励的可验证奖励强化学习训练的典型智能体搜索系统(包括Search-R1和ReSearch)在这些忠实性维度上有显著的改进空间。为了促进智能体搜索中的忠实推理,我们引入了VERITAS(通过智能体搜索中的中间可追溯性验证蕴含推理),这是一个新颖的框架,将细粒度的轮次级忠实性奖励整合到强化学习过程中。我们的实验表明,使用VERITAS训练的模型不仅显著提高了推理忠实性,而且与基于回合级结果奖励训练的基线相比,还实现了更好的任务性能。

英文摘要

Inspired by the success of reinforcement learning (RL) in Large Language Model (LLM) training for domains like math and code, recent work has begun training LLMs to dynamically plan, query, and reason with search engines as tools -- a paradigm increasingly referred to as agentic search. Although these methods achieve performance improvement across popular short-form QA benchmarks, many prioritize final answer correctness while overlooking the quality of intermediate reasoning steps, which may lead to chain-of-thought unfaithfulness. In this paper, we first introduce a comprehensive evaluation framework for agentic search, covering three distinct faithfulness metrics: Think-Search faithfulness, Information-Think faithfulness, and Think-Answer faithfulness. Our evaluations reveal that canonical agentic search systems trained through Reinforcement Learning from Verifiable Reward (RLVR) using episode-level outcome-based reward -- including Search-R1 and ReSearch -- have significant room for improvement on these faithfulness dimensions. To foster faithful reasoning in agentic search, we introduce VERITAS (Verifying Entailed Reasoning through Intermediate Traceability in Agentic Search), a novel framework that integrates fine-grained turn-level faithfulness rewards into the reinforcement learning process. Our experiments show that models trained with \ours not only significantly improve reasoning faithfulness, but also achieve better task performance compared to baselines trained against episode-level outcome-based reward.

2602.03542 2026-06-04 cs.CL cs.LG 版本更新

Can Large Language Models Generalize Procedures Across Representations?

大型语言模型能否跨表示泛化过程?

Fangru Lin, Valentin Hofmann, Xingchen Wan, Weixing Wang, Zifeng Ding, Anthony G. Cohn, Janet B. Pierrehumbert

发表机构 * Stanford University(斯坦福大学)

AI总结 研究大型语言模型在代码、图与自然语言等不同表示间泛化过程的能力,提出两阶段强化学习课程来弥合差距。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在符号表示(如代码和图)上进行了广泛的训练和测试,然而现实世界的用户任务通常用自然语言指定。LLMs 能在多大程度上跨这些表示进行泛化?在这里,我们通过研究涉及以代码、图和自然语言表示的过程(例如,规划中的调度步骤)的同构任务来探讨这个问题。我们发现,仅在图或代码数据上使用流行的后训练方法训练 LLMs 并不能可靠地泛化到相应的自然语言任务,而仅用自然语言训练可能导致效率低下的性能提升。为了解决这一差距,我们提出了一种两阶段强化学习课程,首先在符号数据上训练,然后在自然语言数据上训练。该课程显著提高了跨模型家族和任务的模型性能。值得注意的是,通过我们的方法训练的 1.5B Qwen 模型在自然规划中几乎可以匹配零样本 GPT-4o。最后,我们的分析表明,成功的跨表示泛化可以解释为一种生成性类比的形式,而我们的课程有效地鼓励了这种类比。本文使用的数据集和代码可在此处找到。

英文摘要

Large language models (LLMs) are trained and tested extensively on symbolic representations such as code and graphs, yet real-world user tasks are often specified in natural language. To what extent can LLMs generalize across these representations? Here, we approach this question by studying isomorphic tasks involving procedures represented in code, graphs, and natural language (e.g., scheduling steps in planning). We find that training LLMs with popular post-training methods on graphs or code data alone does not reliably generalize to corresponding natural language tasks, while training solely on natural language can lead to inefficient performance gains. To address this gap, we propose a two-stage reinforcement learning curriculum that first trains on symbolic, then natural language data. The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen model trained by our method can closely match zero-shot GPT-4o in naturalistic planning. Finally, our analysis suggests that successful cross-representation generalization can be interpreted as a form of generative analogy, which our curriculum effectively encourages. The dataset and code used in this paper can be found \href{https://github.com/fangru-lin/procedure_generalization_llm}{here}.

2601.09719 2026-06-04 cs.CL cs.AI 版本更新

Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

有界双曲正切:大型语言模型中预层归一化的稳定高效替代方案

Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, Kyungwoo Song

发表机构 * Yonsei University(延世大学) Upstage AI

AI总结 提出BHyT,通过有界双曲正切和数据驱动的输入约束替代Pre-LN,在保持稳定性的同时提升训练和推理效率。

Comments Accepted to ICML 2026

详情
AI中文摘要

预层归一化(Pre-LN)是大型语言模型(LLM)的事实标准,对于稳定预训练和有效迁移学习至关重要。然而,Pre-LN会带来重复的统计计算开销,并且仍然容易受到深度诅咒的影响,即随着层数增加,隐藏状态幅度和方差增大,破坏训练稳定性。面向效率的无归一化方法(如Dynamic Tanh (DyT))提高了吞吐量,但在深度下仍然脆弱。为了同时解决稳定性和效率问题,我们提出了有界双曲正切(BHyT),作为Pre-LN的直接替代方案。BHyT将tanh非线性与显式的、数据驱动的输入边界相结合,使激活值保持在非饱和范围内。它防止了激活幅度和方差随深度增长,并提供了理论稳定性保证。在效率方面,BHyT每个块仅计算一次精确统计量,并用轻量级方差近似替代第二次归一化。实验表明,BHyT在预训练期间表现出更好的稳定性和效率,与RMSNorm相比,平均训练速度提升1.6%,平均token生成吞吐量提升1.77%,同时在语言理解和推理基准上保持强大的预训练-only和SFT后性能。代码见:https://github.com/MLAI-Yonsei/BHyT

英文摘要

Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN incurs repeated statistical-computation overhead and remains vulnerable to the curse of depth, where hidden-state magnitudes and variances grow as the number of layers increases, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve throughput but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT combines a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and provides a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 1.6\% faster training and an average of 1.77\% higher token generation throughput compared to RMSNorm, while maintaining strong pretraining-only and post-SFT performance across language understanding and reasoning benchmarks\footnote{Code is available at: https://github.com/MLAI-Yonsei/BHyT}.

2602.01672 2026-06-04 cs.CL 版本更新

Adaptive Information Control for Search-Augmented LLM Reasoning

面向搜索增强型大语言模型推理的自适应信息控制

Siheng Xiong, Oguzhan Gungordu, James C. Kerce, Faramarz Fekri

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出基于信息效用的自适应控制框架DeepControl,通过控制检索的广度与分辨率,提升搜索增强推理的性能与训练稳定性。

详情
AI中文摘要

搜索增强型推理代理将多步推理与外部检索交错进行,但不受控制的检索可能引入冗余证据、使上下文饱和,并破坏强化学习(RL)的稳定性。现有的基于结果的RL方法仅提供稀疏的终端奖励,对中间信息获取决策的指导有限。我们提出DeepControl,一种基于信息效用的自适应信息控制框架,其中信息效用是检索证据边际价值的状态依赖估计。该框架沿两个维度调节信息获取:广度(即是否应继续检索)和分辨率(即应暴露多少检索细节)。它通过检索继续引导、层次化粒度控制以及退火控制强制方案来实现这些控制。这使得策略能够在训练期间内化有效的获取行为,并在测试时无需外部控制即可运行。在七个基准测试中,DeepControl在没有显式信息控制的情况下,始终优于强RL和检索基线;与Search-R1相比,在Qwen2.5-7B和Qwen2.5-3B上分别平均提高了9.4和8.6个点。额外分析显示搜索效率、训练稳定性和证据利用率均有所提升。

英文摘要

Search-augmented reasoning agents interleave multi-step reasoning with external retrieval, but uncontrolled retrieval can introduce redundant evidence, saturate the context, and destabilize reinforcement learning (RL). Existing outcome-based RL methods provide only sparse terminal rewards, offering limited guidance for intermediate information-acquisition decisions. We propose DeepControl, an adaptive information-control framework based on information utility, a state-dependent estimate of the marginal value of retrieved evidence. The framework regulates information acquisition along two axes: extent, i.e., whether retrieval should continue, and resolution, i.e., how much retrieved detail should be exposed. It implements these controls through retrieval-continuation guidance, hierarchical granularity control, and an annealed control-forcing scheme. This enables the policy to internalize effective acquisition behavior during training and operate without external control at test time. Across seven benchmarks, DeepControl consistently outperforms strong RL and retrieval baselines without explicit information control; compared with Search-R1, it improves average performance by +9.4 and +8.6 points on Qwen2.5-7B and Qwen2.5-3B, respectively. Additional analyses show improved search effectiveness, training stability, and evidence utilization.

2601.22396 2026-06-04 cs.CL cs.AI cs.CY cs.HC physics.soc-ph 版本更新

Culturally Grounded Personas in Large Language Models: Characterization and Alignment with Socio-Psychological Value Frameworks

大型语言模型中的文化基础人物角色:与社会心理价值框架的表征与对齐

Candida M. Greco, Lucio La Cava, Andrea Tagarelli

发表机构 * DIMES, University of Calabria, Italy(意大利卡拉布里亚大学DIMES研究所)

AI总结 本研究通过世界价值观调查、英格尔哈特-韦尔策尔文化地图和道德基础理论,评估大型语言模型生成的文化基础人物角色是否准确反映不同文化条件下的世界和道德价值体系,并分析其跨文化结构和道德变异。

Comments Under Review

详情
AI中文摘要

尽管大型语言模型(LLMs)在模拟人类行为方面的实用性日益增强,但这些合成人物角色在不同文化条件下是否准确反映世界和道德价值体系仍不确定。本文研究了合成、文化基础人物角色与既定框架(特别是世界价值观调查(WVS)、英格尔哈特-韦尔策尔文化地图和道德基础理论)的对齐情况。我们基于一组可解释的WVS衍生变量概念化并生成LLM人物角色,并通过三个互补视角检查生成的人物角色:在英格尔哈特-韦尔策尔地图上的定位,揭示其反映跨文化条件稳定差异的解释;与世界价值观调查在人口统计层面的一致性,其中响应分布大致追踪人类群体模式;以及源自道德基础问卷的道德轮廓,我们通过文化-道德映射分析道德响应如何在不同文化配置中变化。我们的文化基础人物角色生成和分析方法能够评估跨文化结构和道德变异。

英文摘要

Despite the growing utility of Large Language Models (LLMs) for simulating human behavior, the extent to which these synthetic personas accurately reflect world and moral value systems across different cultural conditionings remains uncertain. This paper investigates the alignment of synthetic, culturally-grounded personas with established frameworks, specifically the World Values Survey (WVS), the Inglehart-Welzel Cultural Map, and Moral Foundations Theory. We conceptualize and produce LLM-generated personas based on a set of interpretable WVS-derived variables, and we examine the generated personas through three complementary lenses: positioning on the Inglehart-Welzel map, which unveils their interpretation reflecting stable differences across cultural conditionings; demographic-level consistency with the World Values Survey, where response distributions broadly track human group patterns; and moral profiles derived from a Moral Foundations questionnaire, which we analyze through a culture-to-morality mapping to characterize how moral responses vary across different cultural configurations. Our approach of culturally-grounded persona generation and analysis enables evaluation of cross-cultural structure and moral variation.

2601.19921 2026-06-04 cs.CL cs.AI 版本更新

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

揭秘多智能体辩论:置信度与多样性的作用

Xiaochen Zhu, Caiqi Zhang, Yizhou Chi, Tom Stafford, Nigel Collier, Andreas Vlachos

发表机构 * University of Cambridge(剑桥大学) University of Sheffield(谢菲尔德大学)

AI总结 针对多智能体辩论(MAD)在提升大语言模型性能时效果不佳的问题,提出多样性感知初始化和置信度调节辩论协议两种轻量级干预方法,显著提升辩论有效性。

详情
AI中文摘要

多智能体辩论(MAD)被广泛用于通过测试时缩放提升大语言模型(LLM)性能,然而近期研究表明,尽管计算成本更高,普通MAD往往不如简单的多数投票。研究表明,在同质化智能体和统一信念更新下,辩论保持了预期的正确性,因此无法可靠地改善结果。借鉴人类审议和集体决策的研究发现,我们识别出普通MAD缺失的两个关键机制:(i)初始观点的多样性,以及(ii)明确且校准的置信度沟通。我们提出两种轻量级干预方法。首先,一种多样性感知初始化,选择更多样化的候选答案池,增加辩论开始时存在正确假设的可能性。其次,一种置信度调节的辩论协议,其中智能体表达校准后的置信度,并根据他人的置信度调节其更新。我们从理论上证明,多样性感知初始化在不改变底层更新动态的情况下提高了MAD成功的先验概率,而置信度调节更新使辩论能够系统地漂移到正确假设。在实验上,在六个面向推理的QA基准测试中,我们的方法始终优于普通MAD和多数投票。我们的结果将人类审议与基于LLM的辩论联系起来,并表明简单、有原则的修改可以显著增强辩论效果。

英文摘要

Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

2506.10912 2026-06-04 cs.AI cs.CL 版本更新

Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Breaking Bad Molecules: MLLMs 是否准备好进行结构级分子解毒?

Fei Lin, Ziyang Gong, Cong Wang, Tengchao Zhang, Yonglin Tian, Yining Jiang, Ji Dai, Chao Guo, Xiaotong Yu, Xue Yang, Gen Luo, Fei-Yue Wang

发表机构 * Department of Engineering Science, Macau University of Science and Technology, Macau, China(澳门科学技术大学工程科学系) School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院) Institute of Automation, Chinese Academy of Sciences, Beijing, China(中国科学院自动化研究所) School of Pharmacy, Macau University of Science and Technology, Macau, China(澳门科学技术大学药学院) Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo, China(宁波大学电气与计算机科学学院) State Key Laboratory of Biopharmaceutical Preparation and Delivery, Institute of Process Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院生物制药制备与递送国家重点实验室) School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, Shanghai, China(上海交通大学自动化与智能感知学院) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室)

AI总结 本文提出 ToxiMol 基准任务,利用多模态大语言模型进行分子毒性修复,并构建数据集、提示流程和自动评估框架 ToxiEval,实验表明当前模型虽面临挑战但展现出毒性理解与结构编辑的潜力。

详情
AI中文摘要

毒性仍然是早期药物开发失败的主要原因。尽管分子设计和性质预测取得了进展,但分子毒性修复任务——生成结构有效且毒性降低的分子替代物——尚未被系统定义或基准化。为填补这一空白,我们引入了 ToxiMol,这是首个针对通用多模态大语言模型(MLLMs)的分子毒性修复基准任务。我们构建了一个标准化数据集,涵盖 11 个主要任务和 660 个代表性有毒分子,覆盖多种机制和粒度。我们设计了一个具有机制感知和任务自适应能力的提示注释流程,并基于专家毒理学知识。同时,我们提出了一个自动评估框架 ToxiEval,将毒性终点预测、合成可及性、类药性和结构相似性集成到高通量评估链中,用于修复成功评估。我们系统评估了 43 个主流通用 MLLMs,并进行了多项消融研究,以分析关键问题,包括评估指标、候选多样性和失败归因。实验结果表明,尽管当前 MLLMs 在此任务上仍面临重大挑战,但它们开始展现出在毒性理解、语义约束遵循和结构感知编辑方面的有前景的能力。

英文摘要

Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair, generating structurally valid molecular alternatives with reduced toxicity, has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 660 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess 43 mainstream general-purpose MLLMs and conduct multiple ablation studies to analyze key issues, including evaluation metrics, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware editing.

2601.18777 2026-06-04 cs.LG cs.AI cs.CL cs.IR stat.AP 版本更新

PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

PRECISE: 使用预测驱动的排名估计减少LLM评估的偏差

Abhishek Divekar, Anirban Majumder

发表机构 * Primary contributor and corresponding author(主要贡献者及通讯作者)

AI总结 提出PRECISE框架,通过结合少量人工标注与LLM判断,利用预测驱动推断(PPI)方法,在低资源下可靠估计搜索、排序和RAG系统的指标,并校正LLM偏差。

Comments Accepted at AAAI 2026 - Innovative Applications of AI (IAAI-26)

详情
AI中文摘要

评估搜索、排序和RAG系统的质量传统上需要大量人工相关性标注。近年来,一些已部署的系统探索使用大型语言模型(LLM)作为自动评判者,但其固有偏差阻碍了直接用于指标估计。我们提出了一个扩展预测驱动推断(PPI)的统计框架,将最少的人工标注与LLM判断相结合,以生成需要子实例标注的指标的可靠估计。我们的方法仅需少至100个人工标注查询和10,000个未标注示例,相比传统方法显著减少了标注需求。我们为基于LLM的查询改写应用中的相关性提升推断制定了所提出的框架(PRECISE),将PPI扩展到查询-文档级别的子实例标注。通过重新制定指标集成空间,我们将计算复杂度从O(2^|C|)降低到O(2^K),其中|C|表示语料库大小(百万量级)。在多个著名检索数据集上的详细实验表明,我们的方法降低了业务关键指标Precision@K的估计方差,同时在低资源设置下有效校正了LLM偏差。

英文摘要

Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations at the query-document level. By reformulating the metric-integration space, we reduced the computational complexity from O(2^|C|) to O(2^K), where |C| represents corpus size (in order of millions). Detailed experiments across prominent retrieval datasets demonstrate that our method reduces the variance of estimates for the business-critical Precision@K metric, while effectively correcting for LLM bias in low-resource settings.

2601.17363 2026-06-04 cs.CL cs.AI 版本更新

Do readers prefer AI-generated Italian short stories?

读者是否更喜欢AI生成的意大利短篇小说?

Michael Farrell

发表机构 * IULM University Milan Italy(米兰IULM大学)

AI总结 通过盲测实验,比较AI(ChatGPT-4o)与著名作家Alberto Moravia的意大利短篇小说,发现AI文本平均评分略高且更受偏好,但差异不显著,且与人口统计和阅读习惯无关。

Comments 8 pages, peer-reviewed and accepted for presentation at New Trends in Translation and Interpreting Technology (NeTTIT 2026), paged-up for publication

详情
AI中文摘要

本研究调查读者是否更喜欢AI生成的意大利短篇小说,而非著名意大利作家创作的作品。在盲测设置中,20名参与者阅读并评估了三篇故事,其中两篇由ChatGPT-4o生成,一篇由Alberto Moravia创作,参与者不知晓故事来源。为探索潜在影响因素,还收集了阅读习惯和人口统计数据,包括年龄、性别、教育程度和母语。结果显示,AI编写的文本平均评分略高,且更常被偏好,尽管差异不大。文本偏好与人口统计或阅读习惯变量之间未发现统计学显著关联。这些发现挑战了读者偏好人类创作小说的假设,并引发了关于在文学语境中是否需要编辑合成文本的问题。

英文摘要

This study investigates whether readers prefer AI-generated short stories in Italian over one written by a renowned Italian author. In a blind setup, 20 participants read and evaluated three stories, two created with ChatGPT-4o and one by Alberto Moravia, without being informed of their origin. To explore potential influencing factors, reading habits and demographic data, comprising age, gender, education and first language, were also collected. The results showed that the AI-written texts received slightly higher average ratings and were more frequently preferred, although differences were modest. No statistically significant associations were found between text preference and demographic or reading-habit variables. These findings challenge assumptions about reader preference for human-authored fiction and raise questions about the necessity of synthetic-text editing in literary contexts.

2601.06196 2026-06-04 cs.LG cs.AI cs.CL 版本更新

Geometry-Aware Hallucination Detection in Large Language Models

大语言模型中的几何感知幻觉检测

Bodla Krishna Vamshi, Rohan Bhatnagar, Haizhao Yang

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 提出GA-ICL框架,利用冻结LLM的潜在表示建模局部流形和类别原型几何,选择上下文示例以检测幻觉,在FEVER和HaluEval基准上优于基线方法。

详情
AI中文摘要

大型语言模型(LLM)经常生成事实不正确或未经支持的内容,通常称为幻觉。先前的工作探索了解码策略、检索增强和监督微调用于幻觉检测,而最近的研究表明,上下文学习(ICL)可以显著影响事实可靠性。然而,现有的ICL示例选择方法通常依赖于表面相似性启发式方法,并且在任务和模型上表现出有限的鲁棒性。我们提出GA-ICL,一种几何感知的示例采样框架,用于选择上下文示例,该框架利用从冻结LLM中提取的潜在表示。通过联合建模局部流形结构和类别感知的原型几何,GA-ICL根据示例与学习原型的接近程度进行选择,而不仅仅是基于词汇或嵌入相似性。在事实验证(FEVER)和幻觉检测(HaluEval)基准上,GA-ICL在大多数评估设置中优于标准ICL选择基线,在对话和摘要任务上尤其有显著提升。该方法在温度扰动和模型变化下保持鲁棒性,表明与启发式检索策略相比具有更高的稳定性。虽然在较小模型规模下的某些问答场景中,词汇检索仍可能具有竞争力,但我们的结果表明,几何感知的原型选择为幻觉检测提供了一种可靠且训练轻量的方法,无需修改LLM参数。在Phi-14B和Qwen3-32B上的扩展评估证实,GA-ICL能有效扩展到更大模型,在包括较小模型显示边界条件限制的问答任务在内的所有比较基线上均表现优异,为改进ICL示例选择提供了原则性方向。

英文摘要

Large language models (LLMs) frequently generate factually incorrect or unsupported content, commonly referred to as hallucinations. Prior work has explored decoding strategies, retrieval augmentation, and supervised fine-tuning for hallucination detection, while recent studies show that in-context learning (ICL) can substantially influence factual reliability. However, existing ICL demonstration selection methods often rely on surface-level similarity heuristics and exhibit limited robustness across tasks and models. We propose GA-ICL, a geometry-aware demonstration sampling framework for selecting in-context demonstrations that leverages latent representations extracted from frozen LLMs. By jointly modeling local manifold structure and class-aware prototype geometry, GA-ICL selects demonstrations based on their proximity to learned prototypes rather than lexical or embedding similarity alone. Across factual verification (FEVER) and hallucination detection (HaluEval) benchmarks, GA-ICL outperforms standard ICL selection baselines in the majority of evaluated settings, with particularly strong gains on dialogue and summarization tasks. The method remains robust under temperature perturbations and model variation, indicating improved stability compared to heuristic retrieval strategies. While lexical retrieval can remain competitive in certain question-answering regimes at smaller model scales, our results demonstrate that geometry-aware prototype selection provides a reliable and training-light approach for hallucination detection without modifying LLM parameters. Extended evaluations on Phi-14B and Qwen3-32B confirm that GA-ICL scales effectively to larger models, outperforming all compared baselines including on QA tasks where smaller models show boundary-condition limitations, offering a principled direction for improved ICL demonstration selection.

2601.07408 2026-06-04 cs.CL cs.LG 版本更新

Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

基于结果锚定的优势重塑用于数学推理中的细粒度信用分配

Ziheng Li, Liu Kang, Feng Xiao, Luxi Xing, Qingyi Si, Zhuoran Li, Weikang Gong, Deqing Yang, Yanghua Xiao, Hongcheng Guo

发表机构 * Fudan University(复旦大学) XingYun lab, HUJING Digital Media & Entertainment Group(星云实验室,HUJING数字媒体与娱乐集团) University of Science and Technology Beijing(北京科技大学) Chinese Academy of Sciences(中国科学院) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出结果锚定优势重塑(OAR),通过两种策略(OAR-P和OAR-G)实现细粒度信用分配,显著提升GRPO在数学推理中的性能。

详情
AI中文摘要

组相对策略优化(GRPO)已成为一种有前途的无需评论家的强化学习范式,用于推理任务。然而,标准GRPO采用粗粒度的信用分配机制,将组级奖励均匀地传播到序列中的每个令牌,忽略了各个推理步骤的不同贡献。我们通过引入结果锚定优势重塑(OAR)来解决这一局限性,这是一种细粒度的信用分配机制,根据每个令牌对模型最终答案的影响程度重新分配优势。我们通过两种互补策略实例化OAR:(1)OAR-P,通过反事实令牌扰动估计结果敏感性,作为高保真归因信号;(2)OAR-G,使用输入梯度敏感性代理,通过单次反向传播近似影响信号。这些重要性信号与保守的双层优势重塑方案相结合,该方案抑制低影响令牌并提升关键令牌,同时保持整体优势质量。在广泛数学推理基准上的实证结果表明,虽然OAR-P设定了性能上限,但OAR-G以可忽略的计算开销实现了相当的增益,两者均显著优于强GRPO基线,推动了无需评论家的大语言模型推理的边界。

英文摘要

Group Relative Policy Optimization (GRPO) has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks. However, standard GRPO employs a coarse-grained credit assignment mechanism that propagates group-level rewards uniformly to to every token in a sequence, neglecting the varying contribution of individual reasoning steps. We address this limitation by introducing Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model's final answer. We instantiate OAR via two complementary strategies: (1) OAR-P, which estimates outcome sensitivity through counterfactual token perturbations, serving as a high-fidelity attribution signal; (2) OAR-G, which uses an input-gradient sensitivity proxy to approximate the influence signal with a single backward pass. These importance signals are integrated with a conservative Bi-Level advantage reshaping scheme that suppresses low-impact tokens and boosts pivotal ones while preserving the overall advantage mass. Empirical results on extensive mathematical reasoning benchmarks demonstrate that while OAR-P sets the performance upper bound, OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline, pushing the boundaries of critic-free LLM reasoning.

2601.07036 2026-06-04 cs.CL cs.AI cs.LG 版本更新

Mid-Think: Training-Free Intermediate-Budget Reasoning via Token-Level Triggers

Mid-Think: 通过词元级触发器实现无需训练的中间预算推理

Wang Yang, Debargha Ganguly, Xinpeng Li, Chaoda Song, Shouren Wang, Vikash Singh, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University(凯斯西储大学)

AI总结 本文通过分析注意力机制和提示实验,发现推理行为主要由少量触发词元控制,并据此提出Mid-Think方法,通过组合触发词元实现中间预算推理,在准确率-长度权衡上优于基线,并能在强化学习训练中减少时间并提升性能。

详情
AI中文摘要

混合推理语言模型通常通过高级的Think/No-think指令来控制推理行为,但我们发现这种模式切换主要由一小部分触发词元驱动,而非指令本身。通过注意力分析和受控提示实验,我们表明开头的“Okay”词元会诱导推理行为,而“</think>”后的换行模式则会抑制推理。基于这一观察,我们提出了Mid-Think,一种简单的无需训练的提示格式,通过组合这些触发器实现中间预算推理,在准确率-长度权衡上始终优于固定词元和基于提示的基线。此外,在监督微调后将Mid-Think应用于强化学习训练,可将训练时间减少约15%,同时将Qwen3-8B在AIME上的最终性能从69.8%提升至72.4%,在GPQA上从58.5%提升至61.1%,证明了其在推理时控制和基于强化学习的推理训练中的有效性。

英文摘要

Hybrid reasoning language models are commonly controlled through high-level Think/No-think instructions to regulate reasoning behavior, yet we found that such mode switching is largely driven by a small set of trigger tokens rather than the instructions themselves. Through attention analysis and controlled prompting experiments, we show that a leading ``Okay'' token induces reasoning behavior, while the newline pattern following ``</think>'' suppresses it. Based on this observation, we propose Mid-Think, a simple training-free prompting format that combines these triggers to achieve intermediate-budget reasoning, consistently outperforming fixed-token and prompt-based baselines in terms of the accuracy-length trade-off. Furthermore, applying Mid-Think to RL training after SFT reduces training time by approximately 15% while improving final performance of Qwen3-8B on AIME from 69.8% to 72.4% and on GPQA from 58.5% to 61.1%, demonstrating its effectiveness for both inference-time control and RL-based reasoning training.

2512.04668 2026-06-04 cs.CR cs.AI cs.CL 版本更新

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

拓扑结构至关重要:多智能体大语言模型中的内存泄漏测量

Jinbo Liu, Defu Cao, Yifei Wei, Tianyao Su, Yuan Liang, Yushun Dong, Yan Liu, Yue Zhao, Xiyang Hu

发表机构 * Arizona State University(亚利桑那州立大学) University of Southern California(南加州大学) Florida State University(佛罗里达州立大学)

AI总结 提出MAMA框架,通过控制图拓扑结构评估多智能体LLM系统中的内存泄漏,发现密集连接、短攻击距离和高中心性增加泄漏,并给出稀疏或层次化拓扑的设计建议。

Comments Accepted to Findings of the Association for Computational Linguistics: ACL 2026. Camera-ready version

详情
AI中文摘要

图拓扑结构是多智能体LLM系统中内存泄漏的基本决定因素,但其影响尚未得到充分量化。我们提出了MAMA(多智能体内存攻击),一个用于比较多智能体LLM系统中拓扑条件内存泄漏的受控评估框架。MAMA操作于包含标记的个人身份信息(PII)实体的合成文档,从中生成经过清理的任务指令。我们执行两阶段协议:Engram(将私人信息植入目标智能体的内存)和Resonance(多轮交互,攻击者尝试提取)。在10轮中,我们使用两阶段恢复标准测量泄漏,该标准结合了精确匹配提取和基于LLM对攻击者最终输出的推理。我们评估了六种典型拓扑(完全图、环、链、树、星、星环),涉及n∈{4,5,6}、攻击者-目标放置和基础模型。结果一致:更密集的连通性、更短的攻击者-目标距离和更高的目标中心性增加泄漏;大多数泄漏发生在早期轮次,然后趋于平稳;模型选择改变绝对比率但保留广泛的结构趋势;时空/位置属性比身份凭证或受监管标识符更容易泄漏。我们提炼出系统设计的实用指导:倾向于稀疏或层次化连通性,最大化攻击者-目标分离,并通过拓扑感知访问控制限制枢纽/捷径路径。我们的代码可在https://github.com/llll121/mama-eval获取。

英文摘要

Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a controlled evaluation framework for comparing topology-conditioned memory leakage in multi-agent LLM systems. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two-phase protocol: Engram (seeding private information into a target agent's memory) and Resonance (multi-round interaction where an attacker attempts extraction). Over 10 rounds, we measure leakage using a two-stage recovery criterion that combines exact-match extraction with LLM-based inference over the attacker's final output. We evaluate six canonical topologies (complete, circle, chain, tree, star, star-ring) across $n\in\{4,5,6\}$, attacker-target placements, and base models. Results are consistent: denser connectivity, shorter attacker-target distance, and higher target centrality increase leakage; most leakage occurs in early rounds and then plateaus; model choice shifts absolute rates but preserves broad structural trends; spatiotemporal/location attributes leak more readily than identity credentials or regulated identifiers. We distill practical guidance for system design: favor sparse or hierarchical connectivity, maximize attacker-target separation, and restrict hub/shortcut pathways via topology-aware access control. Our code is available at https://github.com/llll121/mama-eval.

2601.05633 2026-06-04 cs.CL 版本更新

GIFT: Games as Informal Training for Generalizable LLMs

GIFT:游戏作为通用型LLM的非正式训练

Nuoyan Lyu, Bingbing Xu, Xueyun Tian, Weihao Meng, Yige Yuan, Yang Zhang, Zhiyong Huang, Tat-Seng Chua, Huawei Shen

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, CAS(人工智能安全国家重点实验室,计算技术研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学) University of Washington(华盛顿大学) National University of Singapore(新加坡国立大学)

AI总结 提出将游戏作为非正式训练环境,结合协调子任务训练(CST)方法,提升LLM在抽象推理、规划、创造力等通用能力上的泛化性能。

详情
AI中文摘要

最近的LLM在数学推理和代码生成等正式任务上表现出色,但在规划、创造力和社交智能等更广泛的能力上仍然存在困难。受人类学习的启发,其中正式指导和非正式经验共同塑造智力,我们将非正式学习引入LLM训练,并使用游戏作为无注释、反馈驱动的环境。为了涵盖抽象推理、规划、创造力和社交互动等多种能力,我们将正式数学任务与三种代表性游戏任务(矩阵游戏、井字棋和谁是卧底)相结合。然而,在统一的RL目标下直接混合这些任务可能会模糊特定任务的学习信号,并且没有为协调任务梯度方向提供明确的指导。为了解决这些问题,我们提出了协调子任务训练(CST),它用顺序的子任务特定更新替换单一的混合更新,分离异质RL信号,同时隐式促进子任务间的协调。在能力导向基准上的实验表明,基于游戏的非正式学习提高了超越正式训练的泛化能力,而CST通过保持领域内子任务性能并提高更广泛的通用能力,进一步增强了多任务RL。代码和数据已公开。

英文摘要

Recent LLMs excel at formal tasks such as mathematical reasoning and code generation, but still struggle with broader abilities such as planning, creativity, and social intelligence. Inspired by human learning, where formal instruction and informal experience jointly shape intelligence, we introduce informal learning into LLM training and use games as annotation-free, feedback-driven environments. To cover diverse abilities including abstract reasoning, planning, creativity, and social interaction, we combine formal math tasks with three representative game tasks, including Matrix Games, TicTacToe, and Who's the Spy. However, directly mixing these tasks under a unified RL objective can blur task-specific learning signals and provides no explicit guidance for coordinating task-gradient directions. To combat these, we propose Coordinated Subtask Training (CST), which replaces a single mixed update with sequential subtask-specific updates, separating heterogeneous RL signals while implicitly promoting coordination among subtasks. Experiments on ability-oriented benchmarks show that game-based informal learning improves generalization beyond formal training alone, while CST further enhances multi-task RL by preserving in-domain subtask performance and improving broader general abilities. Code and data are publicly available.

2511.07107 2026-06-04 cs.AI cs.CL 版本更新

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

MENTOR: 一种元认知驱动的自我进化框架,用于发现和缓解大语言模型中的隐式领域风险

Liang Shan, Kaicheng Shen, Wen Wu, Zhenyu Ying, Chaochao Lu, Yan Teng, Jingqi Huang, Qingshan Liu, Guangze Ye, Guoqing Wang, Jie Zhou, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Shanghai AI Lab, Shanghai Innovation Institute(上海人工智能实验室,上海创新研究院)

AI总结 针对大语言模型在特定领域(如教育、金融、管理)中存在的隐式安全风险,提出基于元认知自我评估和动态规则知识图谱的MENTOR框架,通过激活级引导信号有效降低攻击成功率。

详情
AI中文摘要

确保大语言模型(LLMs)的安全性对于实际部署至关重要。然而,当前的安全措施往往无法解决隐式的、特定领域的风险。为了研究这一差距,我们引入了一个包含3000个标注查询的数据集,涵盖教育、金融和管理领域。对14个主流LLMs的评估揭示了一个令人担忧的漏洞:平均越狱成功率为57.8%。为此,我们提出了MENTOR,一种元认知驱动的自我进化框架。MENTOR执行元认知自我评估,采用视角转换和后果推理等策略来揭示潜在的模型错位。由此产生的反思被提炼为动态的基于规则的知识图谱,从中检索到的规则被转换为激活级引导信号,以在推理过程中指导内部表示。实验表明,MENTOR在所有测试领域显著降低了攻击成功率,并优于现有的安全对齐方法。MENTOR的代码和数据集可在 https://anonymous.4open.science/r/MENTOR-Evo 获取。

英文摘要

Ensuring the safety of Large Language Models (LLMs) is critical for real-world deployment. However, current safety measures often fail to address implicit, domain-specific risks. To investigate this gap, we introduce a dataset of 3,000 annotated queries spanning education, finance, and management. Evaluations across 14 leading LLMs reveal a concerning vulnerability: an average jailbreak success rate of 57.8\%. In response, we propose MENTOR, a metacognition-driven self-evolution framework. MENTOR performs metacognitive self-assessment, using strategies such as perspective-taking and consequential reasoning to uncover latent model misalignments. The resulting reflections are distilled into dynamic rule-based knowledge graphs, from which retrieved rules are converted into activation-level steering signals to guide internal representations during inference. Experiments demonstrate that MENTOR substantially reduces attack success rates across all tested domains and outperforms existing safety alignment methods. The code and dataset for MENTOR are available at: https://anonymous.4open.science/r/MENTOR-Evo.

2411.05894 2026-06-04 cs.CL cs.AI cs.LG 版本更新

SSSD: Simply-Scalable Speculative Decoding

SSSD: 简单可扩展的推测解码

Michele Marzollo, Jiawei Zhuang, Niklas Roemer, Niklas Zwingenberger, Lorenz K. Müller, Lukas Cavigelli

发表机构 * Huawei(华为) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种无需训练的推测解码方法SSSD,结合轻量级n-gram匹配和硬件感知推测,在多种基准测试中达到与领先训练方法相当的性能,延迟降低高达2.9倍,且对语言和领域变化具有鲁棒性。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026, Main Conference)

详情
AI中文摘要

推测解码已成为加速大型语言模型推理的流行技术。然而,大多数现有方法在生产服务系统中仅带来适度的改进。实现显著加速的方法通常依赖于额外的训练草案模型或辅助模型组件,增加了部署和维护的复杂性。这种增加的复杂性降低了灵活性,特别是当服务负载转移到草案模型训练数据中未充分表示的任务、领域或语言时。我们引入了简单可扩展的推测解码(SSSD),一种无需训练的方法,结合了轻量级n-gram匹配和硬件感知推测。相对于标准自回归解码,SSSD将延迟降低高达2.9倍。它在广泛的基准测试中达到了与领先的基于训练的方法相当的性能,同时需要显著更低的采用成本——无需数据准备、训练或调优——并且在语言和领域变化以及长上下文设置中表现出优越的鲁棒性。

英文摘要

Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approaches yield only modest improvements in production serving systems. Methods that achieve substantial speedups typically rely on an additional trained draft model or auxiliary model components, increasing deployment and maintenance complexity. This added complexity reduces flexibility, particularly when serving workloads shift to tasks, domains, or languages that are not well represented in the draft model's training data. We introduce Simply-Scalable Speculative Decoding (SSSD), a training-free method that combines lightweight n-gram matching with hardware-aware speculation. Relative to standard autoregressive decoding, SSSD reduces latency by up to 2.9x. It achieves performance on par with leading training-based approaches across a broad range of benchmarks, while requiring substantially lower adoption effort--no data preparation, training or tuning are needed--and exhibiting superior robustness under language and domain shift, as well as in long-context settings.

2507.16199 2026-06-04 cs.CL 版本更新

LLM Abstention Can Be a Prompt Artifact, in Addition to Genuine Uncertainty

LLM 的拒绝回答可能既是真实不确定性的体现,也是提示的产物

Zipeng Ling, Shuliang Liu, Yuehao Tang, Junqi Yang, Shenghong Fu, Chen Huang, Kejia Huang, Yao Wan, Zhichao Hou, Xuming Hu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) University of Pennsylvania(宾夕法尼亚大学) Huazhong University of Science and Technology(华中科技大学) Nanjing University of Posts and Telecommunications(南京邮电大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文发现大语言模型(LLM)的拒绝回答行为不仅源于真实不确定性,还受提示结构影响,称为“拒绝膨胀”,并通过实验证明该现象由额外选项的结构性存在触发,而非真实不确定性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被训练来拒绝回答它们不确定的问题。然而,这种能力经常被误用:在实际应用中,输入提示有时包含不确定性元素,受此驱动,LLM 倾向于拒绝回答它们本可以解决的问题。我们认为 LLM 的拒绝回答不仅是真实不确定性的表达;它也是一种很大程度上受提示影响的产物。我们将这种现象命名为 *拒绝膨胀*。我们为 LLM 添加“未知”作为额外选项供其选择;实验表明,在真/假问题(TFQ)上准确率严重下降。将“未知”替换为不相关的随机词会产生相同的效果。我们认为 LLM 被训练成模仿 *拒绝回答* 的表面模式,而不是表达真实的不确定性。基于十个实验,我们支持四个主张,它们构成了一个递进的论证:(C1)*拒绝膨胀* 是由额外选项的结构性存在触发的,而不是由真实不确定性触发的;(C2)进一步,它使模型在能够回答时也否认自己能回答;(C3)在表示层面,这表现为后层输出覆盖;(C4)最后,这种偏差是稳定的,并通过指令调优出现,而非随机噪声。

英文摘要

Large Language Models (LLMs) are increasingly trained to abstain from answering questions they are unsure about. However, this ability is often misused: in real-world applications, input prompts sometimes contain uncertainty elements, and driven by this, LLMs are inclined to abstain even on problems they are capable of solving. We argue that LLM abstention is not only an expression of genuine uncertainty; it is also an artifact that can be largely influenced by prompts. We name this phenomenon *Abstention Inflation*. We add "Unknown" as an extra option for LLMs to choose from; experiments show serious accuracy drops on True/False Questions (TFQs). Replacing "Unknown" with an unrelated random word produces an identical effect. We argue that LLMs are trained to imitate the surface pattern of *abstention*, rather than to express genuine uncertainty. Based on ten experiments, we support four claims that form a progressive argument: **(C1)** *Abstention Inflation* is triggered by the structural presence of an extra option, not by genuine uncertainty; **(C2)** further, it makes the model deny it can answer even when it can; **(C3)** at the representation level, this manifests as a later-layer output override; **(C4)** finally, this bias is stable and emerges through instruction tuning, rather than stochastic noise.

2512.15552 2026-06-04 cs.CL 版本更新

Automated Lexical Coverage for Language Learning: From General to Specialized Word Lists

语言学习的自动化词汇覆盖:从通用到专业词表

Dakota Ellis, Samy Babikerali, Wanshan Chen, Bao Dinh, Uyen Le

发表机构 * University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校) School of Data Science(数据科学学院)

AI总结 本文提出一种基于目标文本自动生成专业词表的方法,相比通用词表能以更少词汇达到95%的文本覆盖率,并实现自动化、可扩展的词汇学习资源构建。

详情
AI中文摘要

通用服务词表(GSL)是语言学习者识别重要英语单词的常用资源。传统的GSL创建依赖语言专业知识和主观输入,资源消耗大。我们创建了自己的GSL,并评估其与新通用服务词表(NGSL)的性能。我们发现,针对特定文本定制的专业词表(SWL)是语言学习者的实用方法。由于SWL源自目标文本本身,它通过构造达到语言理解所需的95%覆盖率,并且与应用于同一文本的通用词表相比,使用的词汇量显著更少:在涵盖小说、学术论文和脚本的九个文本中,NGSL覆盖了每个文本的64-85%,而文本特定词表以更小的词汇量达到95%。通过仅依赖客观标准,SWL过程可以自动化、可扩展,并针对全球语言学习者的需求进行定制。

英文摘要

A General Service List (GSL) is a commonly used resource for language learners to identify important English words. Traditional GSL creation is resource-intensive, relying on linguistic expertise and subjective input. We created our own GSL and evaluated its performance against the New General Service List (NGSL). We found that creating a Specialized Word List (SWL), tailored to a specific text, is a practical method for language learners. Because an SWL is derived from the target text itself, it reaches the 95% coverage required for language comprehension by construction, and it does so with substantially fewer words than a general list applied to the same text: across nine texts spanning fiction, academic papers, and scripts, the NGSL covered 64-85% of each text, whereas a text-specific list reached 95% with far smaller vocabularies. By restricting the SWL process to objective criteria only, it can be automated, scaled, and tailored to the needs of language-learners across the globe.

2512.08094 2026-06-04 cs.CL 版本更新

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

分割、嵌入和对齐:将字幕与手语对齐的通用方法

Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gül Varol, Sarah Ebling, Andrew Zisserman

发表机构 * VGG, Dept. of Engineering Science, University of Oxford(视觉感知与计算实验室,工程科学系,牛津大学) University of Zurich(苏黎世大学) KAIST(韩国科学技术院) LIGM, CNRS, Univ Gustave Eiffel, ENPC, IP Paris(LIGM,国家科学研究中心,古斯塔夫·埃菲尔大学,巴黎理工大学,IP巴黎)

AI总结 提出一种通用框架SEA,利用预训练模型分割视频帧序列为单个手势、嵌入手势片段到与文本共享的潜在空间,并通过轻量动态规划实现高效对齐,在多个手语数据集上达到最先进性能。

Comments Camera-ready version of ACL 2026 (Main)

详情
AI中文摘要

本文的目标是开发一种通用方法,用于将字幕(即带有对应时间戳的口语文本)与连续手语视频对齐。先前的方法通常依赖于针对特定语言或数据集的端到端训练,这限制了它们的通用性。相比之下,我们的方法Segment, Embed, and Align (SEA)提供了一个适用于多种语言和领域的单一框架。SEA利用两个预训练模型:第一个模型将视频帧序列分割为单个手势,第二个模型将每个手势的视频片段嵌入到与文本共享的潜在空间中。随后,通过轻量级动态规划程序进行对齐,该程序即使在长达一小时的视频中也能在CPU上高效运行,耗时不到一分钟。SEA灵活且能适应各种场景,利用从小型词汇表到大型连续语料库的资源。在四个手语数据集上的实验展示了最先进的对齐性能,突显了SEA在生成高质量并行数据以推动手语处理方面的潜力。SEA的代码和模型已公开提供。

英文摘要

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.

2511.12784 2026-06-04 cs.CL cs.LO 版本更新

Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

通过语义相似改写评估自动形式化鲁棒性

Hayden Moore, Asfahan Shah

发表机构 * Department of Computer Science and Engineering, The Pennsylvania State University(宾夕法尼亚州立大学计算机科学与工程系)

AI总结 本文通过语义相似改写生成自然语言变体,评估大语言模型在自动形式化中生成形式证明的鲁棒性,发现自然语言表述的微小变化会显著影响模型输出。

详情
AI中文摘要

大语言模型(LLMs)最近成为自动形式化的强大工具。尽管性能令人印象深刻,这些模型仍可能难以产生扎实且可验证的形式化。文本到SQL领域的最新工作表明,即使保留高度的语义保真度,LLMs对改写的自然语言(NL)输入也可能敏感。在本文中,我们在自动形式化领域调查了这一说法。具体而言,我们通过测量语义和编译有效性,评估LLMs在生成具有语义相似改写NL语句的形式证明时的鲁棒性。使用形式基准MiniF2F和Lean 4版本的ProofNet,以及两个现代LLMs,我们生成改写的自然语言语句,并在两个模型上交叉评估这些语句。本文的结果揭示了改写输入之间的性能变异性,表明NL语句的微小变化会显著影响模型输出。

英文摘要

Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved. In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F and Lean 4 version of ProofNet, and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.

2511.01192 2026-06-04 cs.CL 版本更新

DEER: Disentangled Mixture of Experts with Instance-Adaptive Routing for Generalizable Machine-Generated Text Detection

DEER: 面向可泛化机器生成文本检测的实例自适应路由解耦专家混合模型

Guoxin Ma, Xiaoming Liu, Hongyang Chen, Chengzhengxu Li, Zhaohan Zhang, Shengchao Liu, Yu Lan, Cong Wang, Chao Shen

发表机构 * Faculty of Electronic and Information Engineering, Xi’an Jiaotong University(电子与信息工程学院,西安交通大学) China Mobile Group(中国移动集团) Queen Mary University of London(伦敦大学玛丽女王学院) City University of Hong Kong(香港城市大学)

AI总结 提出DEER框架,通过解耦领域局部与领域不变知识为专门专家模块,并利用强化学习驱动的路由器基于实例级检测奖励选择专家路径,解决机器生成文本检测中的领域偏移问题,在域内和域外数据集上均取得优于现有方法的性能。

Comments ARR Under Review

详情
AI中文摘要

随着LLM的快速发展,检测机器生成文本已成为一项关键挑战,但现有检测器在领域偏移下性能严重下降。通过系统性的初步研究,我们将这一脆弱性归因于当前泛化策略中的两个根本缺陷:即多领域训练中领域特定知识的不完全保留,以及推理时知识检索与检测目标之间的错位。为解决这些问题,我们提出了DEER,一种解耦的专家混合框架,将领域局部和领域不变知识明确解耦到专门的专家模块中。与静态领域匹配不同,DEER采用强化学习驱动的路由器,基于实例级检测奖励选择专家路径。这种任务对齐、领域无关的机制通过优先考虑检测效用而非风格相似性,确保了对未见分布的鲁棒适应。大量实验表明,DEER始终优于最先进的检测器,在域内和域外数据集上平均F1提升1.28%和2.92%,准确率提升1.35%和2.26%,为开放世界部署提供了可靠的泛化能力。

英文摘要

Detecting machine-generated text has become a critical challenge amid the rapid advancement of LLMs, yet existing detectors degrade severely under domain shift. Through systematic pilot studies, we trace this vulnerability to two fundamental flaws in current generalization strategies, namely the incomplete preservation of domain-specific knowledge during multi-domain training and the misalignment between knowledge retrieval and the detection objective at inference. To address these gaps, we propose DEER, a Disentangled mixturE-of-ExpeRts framework that explicitly decouples domain-local and domain-invariant knowledge into specialized expert modules. Instead of static domain matching, DEER employs a reinforcement learning-driven router that selects expert pathways based on instance-level detection rewards. This task-aligned, domain-agnostic mechanism ensures robust adaptation to unseen distributions by prioritizing detection utility over stylistic resemblance. Extensive experiments demonstrate that DEER consistently outperforms state-of-the-art detectors, achieving average F1 improvements of 1.28% and 2.92%, and accuracy gains of 1.35% and 2.26% on in-domain and out-of-domain datasets, offering reliable generalization for open-world deployment.

2510.13796 2026-06-04 cs.CL cs.CV 版本更新

The Mechanistic Emergence of Symbol Grounding in Language Models

语言模型中符号接地机制的涌现

Shuyu Wu, Ziqiao Ma, Xiaoxi Luo, Yidong Huang, Josue Torres-Fonseca, Freda Shi, Joyce Chai

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 通过机械因果分析,发现符号接地在语言模型的中层计算中通过注意力头聚合环境信息实现,并在多模态对话和多种架构中复现。

详情
AI中文摘要

符号接地(Harnad, 1990)描述了词语等符号如何通过连接真实世界的感知运动经验来获得意义。最近的研究初步表明,在大规模训练且未使用显式接地目标的(视觉-)语言模型中,接地可能涌现。然而,这种涌现的具体位置及其驱动机制仍 largely 未被探索。为解决这一问题,我们引入了一个受控评估框架,通过机械和因果分析系统地追踪符号接地如何在内部计算中产生。我们的发现表明,接地集中在中层计算中,并通过聚合机制实现,其中注意力头聚合环境接地以支持语言形式的预测。这种现象在多模态对话和跨架构(Transformer 和状态空间模型)中复现,但在单向 LSTM 中未出现。我们的结果提供了行为和机械证据,表明符号接地可以在语言模型中涌现,并对预测和潜在控制生成的可靠性具有实际意义。

英文摘要

Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.

2505.11166 2026-06-04 cs.CL cs.AI 版本更新

SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

SoLoPO: 通过短到长偏好优化解锁大语言模型的长上下文能力

Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang

发表机构 * Tongyi Lab, Alibaba Group(通义实验室,阿里巴巴集团)

AI总结 提出SoLoPO框架,将长上下文偏好优化解耦为短上下文偏好优化和短到长奖励对齐,以提升大语言模型的长上下文利用能力。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

尽管在扩展上下文大小的预训练方面取得了进展,但大语言模型(LLMs)在有效利用现实世界中的长上下文信息方面仍面临挑战,这主要是由于数据质量问题、训练效率低下以及缺乏设计良好的优化目标导致的长上下文对齐不足。为了解决这些限制,我们提出了一个名为 extbf{S}h extbf{o}rt-to- extbf{Lo}ng extbf{P}reference extbf{O}ptimization( extbf{SoLoPO})的框架,将长上下文偏好优化(PO)解耦为两个组成部分:短上下文PO和短到长奖励对齐(SoLo-RA),并得到了理论和实验证据的支持。具体来说,短上下文PO利用从短上下文中采样的偏好对来增强模型的情境知识利用能力。同时,SoLo-RA明确鼓励在包含相同任务相关信息的短上下文和长上下文条件下,响应的奖励分数一致性。这有助于将模型处理短上下文的能力迁移到长上下文场景中。SoLoPO与主流的偏好优化算法兼容,同时显著提高了数据构建和训练过程的效率。实验结果表明,SoLoPO增强了所有这些算法在各种长上下文基准测试中的长度和领域泛化能力,同时在计算和内存效率方面取得了显著提升。

英文摘要

Despite advances in pretraining with extended context sizes, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named \textbf{S}h\textbf{o}rt-to-\textbf{Lo}ng \textbf{P}reference \textbf{O}ptimization (\textbf{SoLoPO}), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model's ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.

2510.08647 2026-06-04 cs.CL cs.AI 版本更新

Can Reasoning Path still be Effective as Input? Bridging Post-Reasoning to Chain-of-Thought Compression

推理路径作为输入仍然有效吗?将后推理与思维链压缩连接起来

Chengzhengxu Li, Xiaoming Liu, Zhaohan Zhang, Shengchao Liu, Guoxin Ma, Yu Lan, Cong Wang, Chao Shen

发表机构 * Faculty of Electronic and Information Engineering, Xi’an Jiaotong University(西安交通大学电子与信息工程学院) Queen Mary University of London(伦敦大学玛丽女王学院) City University of Hong Kong(香港城市大学)

AI总结 提出后推理范式,通过将思维链作为上下文输入来简化推理任务,并设计UCoT框架训练轻量级压缩器生成软令牌形式的上下文思维链,从而在保持推理能力的同时显著压缩输出长度。

Comments ACL 2026 Main Track

详情
AI中文摘要

近期发展使得大型语言模型(LLMs)能够通过长思维链(CoT)实现高级推理,但这是以牺牲推理效率为代价来换取性能。现有工作侧重于压缩推理过程中生成的CoT,但这会损害推导正确答案所需的信息。在这项工作中,我们提出后推理(post-reasoning)这一推理范式,将CoT作为上下文的一部分,以简化LLMs的推理任务。我们发现后推理显著减少了LLMs的生成长度,但其有效性取决于上下文CoT生成的效率和可靠性。因此,我们提出UCoT(Upfront CoT),一个用于CoT压缩的高效后推理框架。UCoT训练一个轻量级模型(压缩器)以软令牌形式提供上下文CoT,并训练LLM(执行器)利用此上下文CoT生成最终答案。大量实验表明,UCoT在保持执行器强大推理能力的同时,显著减少了CoT的长度。值得一提的是,当将UCoT应用于Qwen2.5-7B-Instruct模型时,在GSM8K数据集上的令牌使用量减少了50%,而性能比最先进(SOTA)方法高出3.08%。

英文摘要

Recent developments have enabled advanced reasoning in Large Language Models (LLMs) via long Chain-of-Thought (CoT), trading efficiency during inference for performance. Existing works focus on compressing generated CoT in reasoning, which impairs the necessary information for deriving the correct answer. In this work, we propose post-reasoning, a reasoning paradigm that takes CoT as a part of context to simplify the reasoning task for LLMs. We find that post-reasoning significantly reduces the generation length of LLMs, but its effectiveness hinges on the efficiency and the reliability of the contextual CoT generation. Therefore, we propose Upfront CoT (UCoT), an efficient post-reasoning framework for CoT compression. UCoT trains a lightweight model (compressor) to provide contextual CoT in form of soft tokens and trains the LLM (executor) to leverage this contextual CoT for producing the final answer. Extensive experiments show that UCoT maintains the powerful reasoning ability of executor while significantly reducing the length of CoT. It is worth mentioning that when applying UCoT to the Qwen2.5-7B-Instruct model, the usage of tokens on GSM8K dataset is reduced by 50%, while the performance is 3.08% higher than that of the state-of-the-art (SOTA) method.

2509.14760 2026-06-04 cs.CL 版本更新

Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Deliberation

推理边界:通过测试时深思增强规范对齐

Haoran Zhang, Yafu Li, Xuyang Hu, Dongrui Liu, Zhilin Wang, Bo Li, Yu Cheng

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China(上海交通大学人工智能学院) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室) University of Science and Technology of China, Hefei, Anhui, China(中国科学技术大学) The Chinese University of Hong Kong, Hong Kong, China(香港中文大学) University of Illinois Urbana-Champaign, Urbana, IL, USA(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出Align3方法,利用测试时深思(TTD)和分层反思修正来推理规范边界,并构建SpecBench基准,实验表明TTD能有效增强规范对齐。

Comments 10 pages main text, 52 pages total (including appendix). Code and resources are available at https://github.com/zzzhr97/SpecBench

详情
AI中文摘要

大型语言模型(LLMs)越来越多地应用于多样化的现实场景,每个场景都受用户或组织定制的行为和安全规范(spec)约束。这些规范分为安全规范和行为规范,随场景变化并随偏好和要求演变。我们将这一挑战形式化为规范对齐,关注LLMs从行为和安全角度遵循动态、场景特定规范的能力。为应对这一挑战,我们提出Align3,一种轻量级方法,采用具有分层反思和修正的测试时深思(TTD)来推理规范边界。我们进一步提出SpecBench,一个用于衡量规范对齐的统一基准,涵盖5个场景、103个规范和1500个提示。在15个推理模型和18个指令模型上使用多种TTD方法(包括Self-Refine、TPO和MoreThink)的实验得出三个关键发现:(i)测试时深思增强了规范对齐;(ii)Align3以最小开销推进了安全-有用性权衡前沿;(iii)SpecBench有效揭示了对齐差距。这些结果凸显了测试时深思作为推理现实世界规范边界的有效策略的潜力。我们的代码和资源可在https://github.com/zzzhr97/SpecBench获取。

英文摘要

Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries. Our code and resources are available at https://github.com/zzzhr97/SpecBench.

2510.01902 2026-06-04 cs.AI cs.CL cs.LG 版本更新

Constrained Adaptive Rejection Sampling

约束自适应拒绝采样

Paweł Parys, Sairam Vaidya, Taylor Berg-Kirkpatrick, Loris D'Antoni

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出约束自适应拒绝采样(CARS),通过自适应剪枝无效前缀来提高拒绝采样的样本效率,同时保持无分布扭曲,在程序模糊测试和分子生成等任务中优于现有方法。

详情
AI中文摘要

语言模型(LMs)越来越多地应用于生成的输出必须满足严格语义或语法约束的场景。现有的约束生成方法处于一个谱系中:贪婪约束解码方法在解码过程中强制执行有效性,但扭曲了LM的分布;而拒绝采样(RS)保留了保真度,但通过丢弃无效输出浪费计算资源。在程序模糊测试等领域,样本的有效性和多样性都至关重要,这两种极端方法都有问题。我们提出约束自适应拒绝采样(CARS),一种严格提高RS样本效率且不产生分布扭曲的方法。CARS从无约束LM采样开始,通过将违反约束的续写记录在trie中并从后续抽取中减去其概率质量,自适应地排除它们。这种自适应剪枝确保已证明无效的前缀不会被重新访问,接受率单调提高,并且生成的样本精确遵循约束分布。在多个领域的实验(例如程序模糊测试和分子生成)中,CARS始终实现更高的效率(以每个有效样本的LM前向传递次数衡量),同时产生比GCD和近似LM分布的方法更强的样本多样性。

英文摘要

Language Models (LMs) are increasingly used in applications where generated outputs must satisfy strict semantic or syntactic constraints. Existing approaches to constrained generation fall along a spectrum: greedy constrained decoding methods enforce validity during decoding but distort the LM's distribution, while rejection sampling (RS) preserves fidelity but wastes computation by discarding invalid outputs. Both extremes are problematic in domains such as program fuzzing, where both validity and diversity of samples are essential. We present Constrained Adaptive Rejection Sampling (CARS), an approach that strictly improves the sample-efficiency of RS without distributional distortion. CARS begins with unconstrained LM sampling and adaptively rules out constraint-violating continuations by recording them in a trie and subtracting their probability mass from future draws. This adaptive pruning ensures that prefixes proven invalid are never revisited, acceptance rates improve monotonically, and the resulting samples exactly follow the constrained distribution. In experiments on a variety of domains -- e.g., program fuzzing and molecular generation -- CARS consistently achieves higher efficiency -- measured in the number of LM forward passes per valid sample -- while also producing stronger sample diversity than both GCD and methods that approximate the LM's distribution.

2509.21597 2026-06-04 eess.AS cs.CL cs.SD 版本更新

AUDDT: A Unified Benchmark Toolkit for Audio and Speech Deepfake Detectors

AUDDT:音频与语音深度伪造检测器的统一基准工具包

Yi Zhu, Heitor R. Guimarães, Arthur Pimentel, Tiago Falk

发表机构 * MuSAELab(MuSAELab实验室)

AI总结 本文提出AUDDT开源基准工具包,通过整合31个数据集并自动化评估预训练检测器,系统分析了深度伪造检测在不同操作类型和录音条件下的泛化能力与性能差异。

详情
AI中文摘要

随着人工智能生成内容(如音频深度伪造)的普及,近期大量工作聚焦于开发深度伪造检测技术。然而,现有基准仅使用少量数据集,使得检测器在真实世界条件下的泛化能力不确定。本文系统回顾了31个现有音频深度伪造数据集,并提出了一个名为AUDDT(https://github.com/MuSAELab/AUDDT)的开源基准测试工具包。该工具包旨在自动化评估预训练检测器在广泛语音和非语音音频数据集上的性能,为用户提供其深度伪造检测器在不同操作类型和录音条件下的优缺点直接反馈。我们首先展示了所开发工具包的使用方法、基准的组成以及不同深度伪造子组的细分。接着,我们强调了AUDDT与现有基准工作的不同之处,即通过大规模、多样化的现代欺骗方法评估以及通过全面的元数据注释进行更丰富的属性级分析。使用一个广泛采用的预训练深度伪造检测器,我们展示了域内和域外检测结果,揭示了在不同条件和音频操作类型下显著的性能差异。最后,我们还分析了这些现有数据集的局限性及其与实际部署场景之间的差距。

英文摘要

With the prevalence of artificial intelligence (AI)-generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, existing benchmarks employ a narrow set of datasets, leaving detector generalization to real-world conditions uncertain. In this paper, we systematically review 31 existing audio deepfake datasets and present an open-source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across a wide range of speech and non-speech audio datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors under diverse manipulation types and recording conditions. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, we highlight how AUDDT differs from existing benchmarking efforts by enabling large-scale, diverse evaluation across modern spoofing methods and richer attribute-level analysis through comprehensive metadata annotation. Using a widely adopted pretrained deepfake detector, we present in- and out-of-domain detection results, revealing notable performance variability across different conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gaps relative to practical deployment scenarios.

2509.15676 2026-06-04 cs.LG cs.AI cs.CL 版本更新

KITE: Kernelized and Information Theoretic Exemplars for In-Context Learning

KITE: 基于核方法和信息论的上下文学习示例选择

Vaibhav Singh, Soumya Suvra Ghosal, Kapu Nirmal Joshua, Soumyabrata Pal, Sayak Ray Chowdhury

发表机构 * IIT Bombay(印度比哈尔理工学院) UMD College Park(马里兰大学 College Park 分校) IIT Kanpur(印度坎普尔理工学院) Adobe Research(Adobe 研究)

AI总结 针对上下文学习中的示例选择问题,提出一种基于信息论和核方法的贪心算法,通过最小化查询特定预测误差并引入多样性正则化,显著提升分类性能。

详情
AI中文摘要

上下文学习(ICL)已成为一种强大的范式,通过仅使用提示中精心选择的少量任务特定示例,使大型语言模型(LLM)适应新的、数据稀缺的任务。然而,鉴于LLM有限的上下文大小,一个基本问题出现了:应选择哪些示例以最大化给定用户查询的性能?虽然基于最近邻的方法(如KATE)已被广泛用于此目的,但它们在高维嵌入空间中存在众所周知的缺点,包括泛化能力差和缺乏多样性。在这项工作中,我们从原则性的、信息论驱动的角度研究ICL中的示例选择问题。我们首先将LLM建模为输入嵌入上的线性函数,并将示例选择任务框架化为一个查询特定的优化问题:从较大的示例库中选择一个子集,以最小化特定查询上的预测误差。这种表述通过针对特定查询实例的准确预测,偏离了传统的以泛化为中心的学习理论方法。我们推导出一个原则性的代理目标,该目标是近似子模的,从而能够使用具有近似保证的贪心算法。我们通过(i)引入核技巧以在高维特征空间中操作而无需显式映射,以及(ii)引入基于最优设计的正则化项以鼓励所选示例的多样性,进一步增强了我们的方法。实验上,我们在多个分类任务上展示了相对于标准检索方法的显著改进,突出了在真实世界、标签稀缺场景中,结构感知、多样化的示例选择对ICL的益处。

英文摘要

In-context learning (ICL) has emerged as a powerful paradigm for adapting large language models (LLMs) to new and data-scarce tasks using only a few carefully selected task-specific examples presented in the prompt. However, given the limited context size of LLMs, a fundamental question arises: Which examples should be selected to maximize performance on a given user query? While nearest-neighbor-based methods like KATE have been widely adopted for this purpose, they suffer from well-known drawbacks in high-dimensional embedding spaces, including poor generalization and a lack of diversity. In this work, we study this problem of example selection in ICL from a principled, information theory-driven perspective. We first model an LLM as a linear function over input embeddings and frame the example selection task as a query-specific optimization problem: selecting a subset of exemplars from a larger example bank that minimizes the prediction error on a specific query. This formulation departs from traditional generalization-focused learning theoretic approaches by targeting accurate prediction for a specific query instance. We derive a principled surrogate objective that is approximately submodular, enabling the use of a greedy algorithm with an approximation guarantee. We further enhance our method by (i) incorporating the kernel trick to operate in high-dimensional feature spaces without explicit mappings, and (ii) introducing an optimal design-based regularizer to encourage diversity in the selected examples. Empirically, we demonstrate significant improvements over standard retrieval methods across a suite of classification tasks, highlighting the benefits of structure-aware, diverse example selection for ICL in real-world, label-scarce scenarios.

2508.01815 2026-06-04 cs.CL cs.AI 版本更新

From Graph Retrieval to Schema Realization: Counterfactual Validation for Text-to-SPARQL over Heterogeneous Knowledge Graphs

从图检索到模式实现:面向异构知识图谱的文本到SPARQL的反事实验证

Chengxiao Dai, Yue Xiu, Dusit Niyato

发表机构 * University of Bristol(布里斯托大学)

AI总结 提出SchemaForge框架,通过问题条件化的模式切片对齐和反事实验证,在异构知识图谱上提升文本到SPARQL查询生成的执行准确率。

详情
AI中文摘要

文本到SPARQL将自然语言问题映射为RDF知识图谱上的可执行SPARQL查询。标准评估通常预先固定目标图,但实际知识图谱问答(KGQA)可能涉及具有不同模式、部分对齐和不完整元数据的异构图集合。在此设置下,查询生成不仅依赖于SPARQL语法:系统必须识别能够支持问题所需的谓词、实体类型、连接、过滤器和约束的图模式。我们提出SchemaForge,一个面向异构KG集合的文本到SPARQL的基于模式的智能体框架。其核心机制是问题条件化的模式切片对齐:弱图证据首先识别可能的图,而更强的模式证据确定局部模式切片能否实现预期查询。选定的模式切片随后在执行前约束查询生成和验证。当仅有一个图可用时,该公式简化为带有模式基础的标准单KG文本到SPARQL。我们在LC-QuAD 2.0、QALD-9 Plus、QALD-10和Spider4SPARQL上评估SchemaForge。在四个公开基准上,SchemaForge相比最强匹配的智能体基线平均提高执行准确率11.50个百分点。在Spider4SPARQL上,SchemaForge将执行准确率从54.86%提升至64.18%,并达到73.0%的Top-1和97.0%的Top-3图分配准确率。这些结果表明,从弱图证据转向模式特定的查询承诺,结合反事实答案集检查,改进了异构知识图谱上的可执行查询生成。

英文摘要

Text-to-SPARQL maps natural-language questions to executable SPARQL queries over RDF knowledge graphs. While standard evaluations often fix the target graph in advance, practical knowledge graph question answering (KGQA) may involve heterogeneous graph collections with different schemas, partial alignments, and incomplete metadata. In this setting, query generation depends on more than SPARQL syntax: the system must identify a graph schema that can support the predicates, entity types, joins, filters, and constraints required by the question. We present SchemaForge, a schema-grounded agentic framework for text-to-SPARQL over heterogeneous KG collections. Its central mechanism is question-conditioned schema-slice alignment: weak graph evidence first identifies plausible graphs, while stronger schema evidence determines whether a local schema slice can realize the intended query. The selected schema slice then constrains query generation and verification before execution. When only one graph is available, the same formulation reduces to standard single-KG text-to-SPARQL with schema grounding. We evaluate SchemaForge on LC-QuAD 2.0, QALD-9 Plus, QALD-10, and Spider4SPARQL. Across the four public benchmarks, SchemaForge improves execution accuracy over the strongest matched agent baseline by 11.50 percentage points on average. On Spider4SPARQL, SchemaForge improves execution accuracy from 54.86% to 64.18% and achieves 73.0% Top-1 and 97.0% Top-3 graph allocation accuracy. These results show that moving from weak graph evidence to schema-specific query commitments, together with counterfactual answer-set checks, improves executable query generation over heterogeneous knowledge graphs.

2507.21892 2026-06-04 cs.CL 版本更新

Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning

Graph-R1:通过端到端强化学习实现智能图RAG框架

Haoran Luo, Haihong E, Guanting Chen, Qika Lin, Yikai Guo, Fangzhi Xu, Zemin Kuang, Meina Song, Xiaobao Wu, Yifan Zhu, Luu Anh Tuan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Graph-R1,首个通过端到端强化学习的智能图RAG框架,采用轻量知识超图构建、多轮智能体-环境交互检索和端到端奖励机制,在推理准确性、检索效率和生成质量上优于传统图RAG和强化学习增强RAG方法。

Comments Accepted by ICML 2026 main conference

详情
Journal ref
ICML 2026
AI中文摘要

检索增强生成(RAG)通过引入外部知识减轻大语言模型中的幻觉,但依赖于缺乏结构语义的基于块的检索。图RAG方法通过将知识建模为实体-关系图来改进RAG,但仍面临构建成本高、固定一次性检索以及依赖长上下文推理和提示设计等挑战。为解决这些问题,我们提出Graph-R1,首个通过端到端强化学习(RL)的智能图RAG框架。它引入了轻量知识超图构建,将检索建模为多轮智能体-环境交互,并通过端到端奖励机制优化智能体过程。在标准RAG数据集上的实验表明,Graph-R1在推理准确性、检索效率和生成质量上优于传统图RAG和强化学习增强的RAG方法。我们的软件和数据公开在https://github.com/LHRLAB/Graph-R1。

英文摘要

Retrieval-Augmented Generation (RAG) mitigates hallucination in LLMs by incorporating external knowledge, but relies on chunk-based retrieval that lacks structural semantics. GraphRAG methods improve RAG by modeling knowledge as entity-relation graphs, but still face challenges in high construction cost, fixed one-time retrieval, and reliance on long-context reasoning and prompt design. To address these challenges, we propose Graph-R1, the first agentic GraphRAG framework via end-to-end reinforcement learning (RL). It introduces lightweight knowledge hypergraph construction, models retrieval as a multi-turn agent-environment interaction, and optimizes the agent process via an end-to-end reward mechanism. Experiments on standard RAG datasets show that Graph-R1 outperforms traditional GraphRAG and RL-enhanced RAG methods in reasoning accuracy, retrieval efficiency, and generation quality. Our software and data are publicly available at https://github.com/LHRLAB/Graph-R1.

2507.03373 2026-06-04 cs.CL 版本更新

WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia

WETBench:用于检测维基百科上特定任务机器生成文本的基准

Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl

发表机构 * King’s College London(伦敦国王学院) Wikimedia Foundation(维基媒体基金会)

AI总结 提出WETBench,一个多语言、多生成器、任务特定的基准,用于检测维基百科编辑场景下的机器生成文本,实验表明训练型检测器平均准确率78%,零样本检测器平均58%。

详情
AI中文摘要

鉴于维基百科作为高质量、可靠内容的可信来源,对其平台上由大型语言模型(LLM)产生的低质量机器生成文本(MGT)的扩散担忧日益增加。因此,可靠的MGT检测至关重要。然而,现有工作主要在通用生成任务上评估MGT检测器,而非维基百科编辑者更常执行的任务。这种错位可能导致在真实维基百科场景中应用时泛化能力差。我们引入了WETBench,一个多语言、多生成器、任务特定的MGT检测基准。我们定义了三个编辑任务,这些任务基于维基百科编辑者对LLM辅助编辑的感知用例进行实证:段落写作、摘要和文本风格迁移,我们使用两个新数据集在三种语言中实现这些任务。对于每个写作任务,我们评估三个提示,使用表现最佳的提示跨多个生成器生成MGT,并对多种检测器进行基准测试。我们发现,在各种设置下,基于训练的检测器平均准确率达到78%,而零样本检测器平均为58%。这些结果表明,检测器在现实生成场景中难以应对MGT,并强调了在多样化、任务特定数据上评估此类模型以评估其在编辑驱动场景中可靠性的重要性。

英文摘要

Given Wikipedia's role as a trusted source of high-quality, reliable content, concerns are growing about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential. However, existing work primarily evaluates MGT detectors on generic generation tasks rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied in real-world Wikipedia contexts. We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks, empirically grounded in Wikipedia editors' perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, generate MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors. We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results show that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.

2506.05233 2026-06-04 cs.LG cs.AI cs.CL 版本更新

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

MesaNet: 通过局部最优测试时训练进行序列建模

Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Sarthak Mittal, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Agüera y Arcas, João Sacramento

发表机构 * Google(谷歌) Paradigms of Intelligence Team(智能范式团队) Google DeepMind(谷歌深Mind) MIT CSAIL(麻省理工学院CSAIL)

AI总结 提出一种基于共轭梯度求解器实现局部最优测试时训练的Mesa层,在保持常数推理成本的同时,在语言建模困惑度和下游基准性能上超越现有RNN模型。

Comments Published at ICLR 2026

详情
AI中文摘要

序列建模目前主要由使用softmax自注意力的因果Transformer架构主导。尽管被广泛采用,Transformer在推理时需要线性扩展内存和计算。最近一系列工作将softmax操作线性化,产生了具有恒定内存和计算成本的强大循环神经网络模型,如DeltaNet、Mamba或xLSTM。这些模型可以通过注意到其循环层动态都源于上下文回归目标(通过在线学习规则近似优化)来统一。在此,我们加入这一系列工作,引入最近提出的Mesa层(von Oswald等人,2024)的一个数值稳定、可分块并行化的版本,该层原本只能顺序运行,因此不可扩展。该层同样源于上下文损失,但现在使用快速共轭梯度求解器在每个时间点将其最小化至最优。通过一系列扩展到十亿参数规模的实验,我们表明最优测试时训练使得语言建模困惑度更低,下游基准性能优于之前的RNN,尤其是在需要长上下文理解的任务上。这一性能提升以推理时额外浮点运算为代价。因此,我们的结果与最近增加测试时计算以提高性能的趋势有趣地相关——这里通过花费计算在神经网络内部解决序列优化问题来实现。

英文摘要

Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), which could only run sequentially in time and was therefore not scalable. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments study up to the billion-parameter scale, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

2505.19293 2026-06-04 cs.CL cs.AI cs.LG 版本更新

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

100-LongBench:事实上的长上下文基准是否真的在评估长上下文能力?

Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University(凯斯西储大学) Texas A&M University(德克萨斯A&M大学) Rice University(里德大学) University of California, Los Angeles(加州大学洛杉矶分校) Meta(Meta公司)

AI总结 针对现有长上下文基准无法分离基线能力与真实长上下文能力、且输入长度固定等问题,提出长度可控的长上下文基准和新指标,以有效评估大语言模型的长上下文能力。

详情
AI中文摘要

长上下文能力被认为是LLM最重要的能力之一,因为真正具备长上下文能力的LLM使用户能够轻松处理许多原本繁琐的任务——例如,阅读长文档寻找答案与直接询问LLM。然而,现有的基于真实任务的长上下文评估基准有两个主要缺陷。首先,像LongBench这样的基准通常没有提供适当的指标来将长上下文性能与模型的基线能力分开,使得跨模型比较不清晰。其次,此类基准通常以固定输入长度构建,这限制了它们在不同模型上的适用性,并且无法揭示模型何时开始崩溃。为了解决这些问题,我们引入了一个长度可控的长上下文基准和一个新颖的指标,该指标将基线知识与真实的长上下文能力解耦。实验证明了我们的方法在有效评估LLM方面的优越性。

英文摘要

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.

2505.17315 2026-06-04 cs.AI cs.CL cs.LG 版本更新

Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

更长上下文,更深思考:揭示长上下文能力在推理中的作用

Wang Yang, Zirui Liu, Hongye Jin, Qingyu Yin, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University(凯斯西储大学) University of Minnesota - Twin Cities(明尼苏达大学双城分校) Texas A&M University(德克萨斯阿姆大学)

AI总结 本研究通过实验发现,增强模型的长上下文能力(在监督微调前)能显著提升推理性能,即使对于短输入任务也有泛化收益,表明长上下文建模是推理能力的关键基础。

详情
AI中文摘要

近期语言模型展现出强大的推理能力,但长上下文能力对推理的影响仍未充分探索。在本工作中,我们假设当前推理能力的局限性部分源于长上下文能力不足,这一假设基于经验观察:(1)更高的上下文窗口长度通常带来更强的推理性能,(2)失败的推理案例与失败的长上下文案例相似。为验证这一假设,我们检验了在监督微调(SFT)前增强模型的长上下文能力是否能提升推理性能。具体而言,我们比较了架构和微调数据相同但长上下文能力不同的模型。结果揭示了一致趋势:长上下文能力更强的模型在SFT后,在推理基准上取得了显著更高的准确率。值得注意的是,即使在输入长度较短的任务上,这些增益也持续存在,表明长上下文训练为推理性能提供了可泛化的益处。这些发现表明,长上下文建模不仅对处理长输入至关重要,而且也是推理的关键基础。我们主张将长上下文能力作为未来语言模型设计的首要目标。

英文摘要

Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model's long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models.

2502.17956 2026-06-04 cs.CL 版本更新

Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

在跨语言和多语言环境中更好地理解程序思维推理

Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udomcharoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong

发表机构 * School of Information Science and Technology, VISTEC(信息科学与技术学院,VISTEC) KAIST(韩国科学技术院) Cohere SCB 10X AI Singapore(AI新加坡) Department of Computer Engineering, Chulalongkorn University(朱拉隆梭大学计算机工程系)

AI总结 通过分离推理与代码执行,提出评估程序思维提示的框架,发现微调显著提升多语言推理能力,且推理质量与答案准确性强相关。

详情
Journal ref
Findings of the Association for Computational Linguistics: ACL 2025
AI中文摘要

多步推理对于大型语言模型至关重要,但多语言性能仍然具有挑战性。虽然思维链提示改进了推理,但由于推理与执行的纠缠,它在非英语语言中表现不佳。程序思维提示将推理与执行分离,提供了一种有前景的替代方案,但将挑战转移到从非英语问题生成程序上。我们提出了一个框架,通过分离多语言推理与代码执行来评估程序思维,以检验(i)微调对问题-推理对齐的影响,以及(ii)推理质量如何影响答案正确性。我们的发现表明,程序思维微调显著增强了多语言推理,优于思维链微调模型。我们进一步证明了推理质量(通过代码质量衡量)与答案准确性之间的强相关性,突出了其作为测试时性能改进启发式方法的潜力。

英文摘要

Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.

2504.12329 2026-06-04 cs.CL cs.AI 版本更新

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

推测性思考:在推理时利用大模型指导增强小模型推理能力

Wang Yang, Xiang Yue, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University(凯斯西储大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一种无需训练的推测性思考框架,通过让大推理模型在推理层面引导小模型,在提升小模型推理准确率的同时缩短输出长度。

详情
AI中文摘要

近期进展利用后训练来增强模型推理性能,这通常需要昂贵的训练流程,并且仍然存在低效、输出过长的问题。我们提出推测性思考,一种无需训练的框架,使大推理模型在推理层面引导小模型进行推理,区别于在词元层面操作的推测解码。我们的方法基于两个观察:(1)支持推理的词元(如“wait”)经常出现在结构分隔符(如“\n\n”)之后,作为反思或继续的信号;(2)大模型对反思行为有更强的控制,减少不必要的回溯同时提高推理质量。通过策略性地将反思步骤委托给能力更强的模型,我们的方法显著提升了推理模型的推理准确率,同时缩短了输出。在32B推理模型的辅助下,1.5B模型在MATH500上的准确率从83.2%提升至89.4%,实现了6.2%的大幅提升。同时,平均输出长度从5439个词元减少到4583个词元,下降了15.7%。此外,当应用于非推理模型(Qwen-2.5-7B-Instruct)时,我们的框架在相同基准上将准确率从74.0%提升至81.8%,实现了7.8%的相对提升。

英文摘要

Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level. Our approach is based on two observations: (1) reasoning-supportive tokens such as "wait" frequently appear after structural delimiters like "\n\n", serving as signals for reflection or continuation; and (2) larger models exhibit stronger control over reflective behavior, reducing unnecessary backtracking while improving reasoning quality. By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output. With the assistance of the 32B reasoning model, the 1.5B model's accuracy on MATH500 increases from 83.2% to 89.4%, marking a substantial improvement of 6.2%. Simultaneously, the average output length is reduced from 5439 tokens to 4583 tokens, representing a 15.7% decrease. Moreover, when applied to a non-reasoning model (Qwen-2.5-7B-Instruct), our framework boosts its accuracy from 74.0% to 81.8% on the same benchmark, achieving a relative improvement of 7.8%.

2307.00862 2026-06-04 cs.CV cs.CL 版本更新

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

UniFine: 一种统一且细粒度的零样本视觉-语言理解方法

Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang

发表机构 * Columbia University(哥伦比亚大学) Microsoft Research(微软研究院) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出UniFine框架,通过利用句子关键词和图像对象等细粒度信息进行图像-文本匹配,在零样本设置下统一处理VQA、SNLI-VE和VCR等视觉-语言任务,并在多个数据集上取得显著改进。

Comments 14 pages, 4 figures, ACL 2023 Findings

详情
AI中文摘要

视觉-语言任务,如VQA、SNLI-VE和VCR,具有挑战性,因为它们需要模型的推理能力来理解视觉世界和自然语言的语义。针对视觉-语言任务的监督方法已被充分研究。然而,在零样本设置下解决这些任务的研究较少。由于对比语言-图像预训练(CLIP)在图像-文本匹配上展现了显著的零样本性能,先前的工作通过将视觉-语言任务转换为图像-文本匹配问题来利用其强大的零样本能力,并且它们主要考虑全局级别的匹配(例如,整个图像或句子)。然而,我们发现视觉和文本的细粒度信息,例如句子中的关键词和图像中的对象,对于语义理解可能相当有信息量。受此启发,我们提出了一个统一框架,利用细粒度信息进行零样本视觉-语言学习,涵盖多个任务,如VQA、SNLI-VE和VCR。我们的实验表明,我们的框架在VQA上优于先前的零样本方法,并在SNLI-VE和VCR上取得了显著改进。此外,我们的消融研究证实了我们提出的方法的有效性和泛化性。

英文摘要

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method.

2407.03884 2026-06-04 cs.CL cs.AI 版本更新

ChatSOP: An SOP-Guided MCTS Planning Framework for Controllable LLM Dialogue Agents

ChatSOP: 一种SOP引导的MCTS规划框架,用于可控的LLM对话代理

Zhigen Li, Jianxiang Peng, Yanmeng Wang, Yong Cao, Tianhao Shen, Minghui Zhang, Linxi Su, Shang Wu, Yihang Wu, Yuqian Wang, Ye Wang, Wei Hu, Jianfeng Li, Shaojun Wang, Jing Xiao, Deyi Xiong

发表机构 * TJUNLP Lab, College of Intelligence and Computing, Tianjin University(天津大学智能计算学院TJUNLP实验室) Ping An Technology(平安科技) Tübingen AI Center, University of Tübingen(图宾根大学图宾根人工智能中心) Kunming University of Science and Technology(昆明理工大学)

AI总结 提出ChatSOP框架,通过SOP引导的蒙特卡洛树搜索增强LLM对话代理的可控性,在动作准确率上相比GPT-3.5基线提升27.95%。

Comments Accepted to ACL 2025 main

详情
Journal ref
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17637-17659, 2025
AI中文摘要

由大型语言模型驱动的对话代理在各种任务中表现出优越的性能。尽管它们能更好地理解用户并生成类人回复,但**缺乏可控性**仍然是一个关键挑战,常常导致对话偏离主题或任务失败。为了解决这个问题,我们引入标准操作程序来规范对话流程。具体来说,我们提出了**ChatSOP**,一种新颖的SOP引导的蒙特卡洛树搜索规划框架,旨在增强LLM驱动的对话代理的可控性。为此,我们整理了一个数据集,包含使用GPT-4o的半自动角色扮演系统生成的、经过严格人工质量控制验证的SOP标注的多场景对话。此外,我们提出了一种新方法,将思维链推理与监督微调相结合用于SOP预测,并利用SOP引导的蒙特卡洛树搜索在对话中进行最优动作规划。实验结果表明了我们方法的有效性,例如,与基于GPT-3.5的基线模型相比,动作准确率提高了27.95%,并且在开源模型上也显示出显著的提升。数据集和代码已公开。

英文摘要

Dialogue agents powered by Large Language Models (LLMs) show superior performance in various tasks. Despite the better user understanding and human-like responses, their **lack of controllability** remains a key challenge, often leading to unfocused conversations or task failure. To address this, we introduce Standard Operating Procedure (SOP) to regulate dialogue flow. Specifically, we propose **ChatSOP**, a novel SOP-guided Monte Carlo Tree Search (MCTS) planning framework designed to enhance the controllability of LLM-driven dialogue agents. To enable this, we curate a dataset comprising SOP-annotated multi-scenario dialogues, generated using a semi-automated role-playing system with GPT-4o and validated through strict manual quality control. Additionally, we propose a novel method that integrates Chain of Thought reasoning with supervised fine-tuning for SOP prediction and utilizes SOP-guided Monte Carlo Tree Search for optimal action planning during dialogues. Experimental results demonstrate the effectiveness of our method, such as achieving a 27.95% improvement in action accuracy compared to baseline models based on GPT-3.5 and also showing notable gains for open-source models. Dataset and codes are publicly available.

2408.11121 2026-06-04 cs.LG cs.AI cs.CL cs.CR 版本更新

DOMBA: Double Model Balancing for Access-Controlled Language Models via Minimum-Bounded Aggregation

DOMBA: 通过最小有界聚合实现访问控制语言模型的双模型平衡

Tom Segal, Asaf Shabtai, Yuval Elovici

发表机构 * Ben-Gurion University(本·古里安大学)

AI总结 提出DOMBA方法,通过最小有界平均函数聚合两个不同访问级别文档训练的语言模型的概率分布,在保证安全性的同时实现高效用。

Comments Code: https://github.com/ppo1/DOMBA 11 pages, 3 figures

详情
Journal ref
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, pp. 25101-25109, 2025
AI中文摘要

大型语言模型(LLMs)的实用性在很大程度上取决于其训练数据的质量和数量。许多组织拥有大量数据语料库,可用于训练或微调针对其特定需求的LLMs。然而,这些数据集通常带有基于用户权限并由访问控制机制强制执行的访问限制。在此类数据集上训练LLMs可能导致敏感信息暴露给未经授权的用户。防止此类暴露的一种直接方法是为每个访问级别训练一个单独的模型。然而,由于每个模型的训练数据量相对于整个组织语料库的总量有限,这可能导致模型效用低下。另一种方法是在所有数据上训练单个LLM,同时限制未经授权信息的暴露。然而,当前针对LLMs的暴露限制方法对于访问控制数据无效,因为敏感信息在多个训练样本中频繁出现。我们提出DOMBA——双模型平衡——一种训练和部署LLMs的简单方法,可在提供高效用和访问控制功能的同时保证安全性。DOMBA使用“最小有界”平均函数(一个受较小值约束的函数,例如调和平均)聚合两个模型的概率分布,每个模型在具有(可能多个)不同访问级别的文档上训练。详细的数学分析和广泛评估表明,DOMBA在保护受限信息的同时,提供了与非安全模型相当的效用。

英文摘要

The utility of large language models (LLMs) depends heavily on the quality and quantity of their training data. Many organizations possess large data corpora that could be leveraged to train or fine-tune LLMs tailored to their specific needs. However, these datasets often come with access restrictions that are based on user privileges and enforced by access control mechanisms. Training LLMs on such datasets could result in exposure of sensitive information to unauthorized users. A straightforward approach for preventing such exposure is to train a separate model for each access level. This, however, may result in low utility models due to the limited amount of training data per model compared to the amount in the entire organizational corpus. Another approach is to train a single LLM on all the data while limiting the exposure of unauthorized information. However, current exposure-limiting methods for LLMs are ineffective for access-controlled data, where sensitive information appears frequently across many training examples. We propose DOMBA - double model balancing - a simple approach for training and deploying LLMs that provides high utility and access-control functionality with security guarantees. DOMBA aggregates the probability distributions of two models, each trained on documents with (potentially many) different access levels, using a "min-bounded" average function (a function that is bounded by the smaller value, e.g., harmonic mean). A detailed mathematical analysis and extensive evaluation show that DOMBA safeguards restricted information while offering utility comparable to non-secure models.

2412.06095 2026-06-04 cs.CL cs.FL cs.IT math.IT 版本更新

Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance

从小语料库测量语法多样性:派生熵率、平均话语长度和注释不变性

Fermin Moscoso del Prado Martin

发表机构 * Department of Computer Science and Technology & Jesus College University of Cambridge(计算机科学与技术系及耶稣学院,剑桥大学)

AI总结 本文从理论和实证上证明语法的派生熵与其生成的话语平均长度(MLU)之间存在根本联系,提出派生熵率作为衡量语法复杂性的新指标,并引入平滑诱导树库熵(SITE)从小树库中准确估计这些度量。

详情
Journal ref
Computational Linguistics (2025) 51 (4): 1191-1233
AI中文摘要

在许多领域,如语言习得、语言神经心理学、衰老研究和历史语言学,语料库被用于估计个体、社区或说话者类型在一段时间内产生的语法结构的多样性。在这些情况下,树库被视为可能遇到的句法结构的代表性样本。从小型语料库中记录的结构推广潜在的句法多样性需要谨慎的外推,其准确性受到代表性子语料库规模有限的制约。在本文中,我从理论和实证上证明,语法的派生熵与其生成的话语平均长度(MLU)之间存在根本联系,从而产生了一个新的度量——派生熵率。话语平均长度成为句法复杂性最实用的指标;我证明MLU不仅仅是一个代理,而是语法多样性的基本度量。结合新的派生熵率度量,它提供了一种无理论的语法复杂性评估。派生熵率索引了不同语法注释框架确定树库语法复杂性的速率。我引入了平滑诱导树库熵(SITE)作为准确估计这些度量的工具,即使从非常小的树库中也能做到。最后,我讨论了这些结果对自然语言处理和人类语言处理的重要启示。

英文摘要

In many fields, such as language acquisition, neuropsychology of language, the study of aging, and historical linguistics, corpora are used for estimating the diversity of grammatical structures that are produced during a period by an individual, community, or type of speakers. In these cases, treebanks are taken as representative samples of the syntactic structures that might be encountered. Generalizing the potential syntactic diversity from the structures documented in a small corpus requires careful extrapolation whose accuracy is constrained by the limited size of representative sub-corpora. In this article, I demonstrate -- theoretically, and empirically -- that a grammar's derivational entropy and the mean length of the utterances (MLU) it generates are fundamentally linked, giving rise to a new measure, the derivational entropy rate. The mean length of utterances becomes the most practical index of syntactic complexity; I demonstrate that MLU is not a mere proxy, but a fundamental measure of syntactic diversity. In combination with the new derivational entropy rate measure, it provides a theory-free assessment of grammatical complexity. The derivational entropy rate indexes the rate at which different grammatical annotation frameworks determine the grammatical complexity of treebanks. I introduce the Smoothed Induced Treebank Entropy (SITE) as a tool for estimating these measures accurately, even from very small treebanks. I conclude by discussing important implications of these results for both NLP and human language processing.

2409.11901 2026-06-04 cs.CL 版本更新

LLMs + Persona-Plug = Personalized LLMs

LLMs + Persona-Plug = 个性化大语言模型

Jiongnan Liu, Yutao Zhu, Shuting Wang, Xiaochi Wei, Erxue Min, Yu Lu, Shuaiqiang Wang, Dawei Yin, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院 Gallagher 学院) Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE(下一代智能搜索与推荐工程研究中心,教育部) Baidu Inc.(百度公司)

AI总结 提出一种轻量级插件式用户嵌入模块PPersona-Plug,通过建模用户历史上下文生成个性化嵌入,无需微调即可提升大语言模型输出个性化程度。

详情
AI中文摘要

个性化在众多语言任务和应用中扮演着关键角色,因为具有相同需求的用户可能根据个人兴趣偏好不同的输出。这促使了各种个性化方法的发展,旨在使大语言模型(LLMs)适应生成符合用户偏好的定制化输出。其中一些方法涉及为每个用户微调一个独特的个性化LLM,这过于昂贵而难以广泛应用。替代方法以即插即用的方式引入个性化信息,通过检索用户相关历史文本作为示例。然而,这种基于检索的策略可能会破坏用户历史的连续性,无法捕捉用户的整体风格和模式,从而导致次优性能。为了解决这些挑战,我们提出了一种新颖的个性化LLM模型PPersona-Plug。它通过一个轻量级的插件式用户嵌入模块,对每个用户的所有历史上下文进行建模,构建用户特定的嵌入。通过将该嵌入附加到任务输入中,LLMs可以更好地理解和捕捉用户的习惯与偏好,从而在不调整自身参数的情况下生成更个性化的输出。在语言模型个性化(LaMP)基准上的各种任务上的大量实验表明,所提出的模型显著优于现有的个性化LLM方法。

英文摘要

Personalization plays a critical role in numerous language tasks and applications, since users with the same requirements may prefer diverse outputs based on their individual interests. This has led to the development of various personalized approaches aimed at adapting large language models (LLMs) to generate customized outputs aligned with user preferences. Some of them involve fine-tuning a unique personalized LLM for each user, which is too expensive for widespread application. Alternative approaches introduce personalization information in a plug-and-play manner by retrieving the user's relevant historical texts as demonstrations. However, this retrieval-based strategy may break the continuity of the user history and fail to capture the user's overall styles and patterns, hence leading to sub-optimal performance. To address these challenges, we propose a novel personalized LLM model, PPlug. It constructs a user-specific embedding for each individual by modeling all her historical contexts through a lightweight plug-in user embedder module. By attaching this embedding to the task input, LLMs can better understand and capture user habits and preferences, thereby producing more personalized outputs without tuning their own parameters. Extensive experiments on various tasks in the language model personalization (LaMP) benchmark demonstrate that the proposed model significantly outperforms existing personalized LLM approaches.

2407.03956 2026-06-04 cs.MA cs.CL 版本更新

Solving Zebra Puzzles Using Constraint-Guided Multi-Agent Systems

使用约束引导的多智能体系统解决斑马谜题

Shmuel Berman, Kathleen McKeown, Baishakhi Ray

发表机构 * Princeton University(普林斯顿大学) Columbia University(哥伦比亚大学)

AI总结 提出一种多智能体系统ZPS,结合大语言模型与定理证明器,通过分解问题、生成SMT代码和智能体间反馈,显著提升复杂逻辑谜题的解决能力。

详情
AI中文摘要

先前的研究通过链式思维提示或引入符号表示等技术,增强了大语言模型(LLMs)解决逻辑谜题的能力。然而,由于将自然语言线索翻译为逻辑语句的固有复杂性,这些框架通常仍不足以解决复杂的逻辑问题,例如斑马谜题。我们引入了一个多智能体系统ZPS,它将LLMs与现成的定理证明器集成在一起。该系统通过将问题分解为更小、更易管理的部分,生成SMT(可满足性模理论)代码以使用定理证明器求解,并利用智能体之间的反馈来反复改进答案,从而处理复杂的谜题求解任务。我们还引入了一个自动网格谜题评分器来评估我们谜题解决方案的正确性,并通过用户研究评估了该自动评分器的可靠性。我们的方法在我们测试的所有三个LLM中均显示出改进,其中GPT-4的完全正确解决方案数量提高了166%。

英文摘要

Prior research has enhanced the ability of Large Language Models (LLMs) to solve logic puzzles using techniques such as chain-of-thought prompting or introducing a symbolic representation. These frameworks are still usually insufficient to solve complicated logical problems, such as Zebra puzzles, due to the inherent complexity of translating natural language clues into logical statements. We introduce a multi-agent system, ZPS, that integrates LLMs with an off the shelf theorem prover. This system tackles the complex puzzle-solving task by breaking down the problem into smaller, manageable parts, generating SMT (Satisfiability Modulo Theories) code to solve them with a theorem prover, and using feedback between the agents to repeatedly improve their answers. We also introduce an automated grid puzzle grader to assess the correctness of our puzzle solutions and show that the automated grader is reliable by evaluating it in a user-study. Our approach shows improvement in all three LLMs we tested, with GPT-4 showing 166% improvement in the number of fully correct solutions.

2402.02555 2026-06-04 cs.CV cs.CL 版本更新

High-Quality Entity Segmentation and Grounding

高质量实体分割与定位

Lu Qi, Yi-Wen Chen, Tao Zhang, Xiangtai Li, Xu Yang, Bo Du, Ming-Hsuan Yang

发表机构 * Wuhan University(武汉大学) Insta360 Research(Insta360研究院) Department of EECS, University of California, Merced(加州大学默塞德分校电子工程与计算机科学系) Nanyang Technological University(南洋理工大学) Institute of Automation of the Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出ESG流水线,通过新数据集EntitySeg和两阶段解耦设计(CropFormer高质量分割+GELLA精确名词提取与语义匹配),实现高质量实体分割与定位,在五项任务上有效。

详情
AI中文摘要

在这项工作中,我们提出了ESG,一个由新数据集EntitySeg支持的高质量实体分割与定位流水线。首先,所提出的数据集命名为EntitySeg,包含跨越各种图像域和实体的图像,以及用于训练和测试的大量高分辨率图像和高质量掩码标注。然后,ESG主要由两个模块组成:用于高质量实体分割的CropFormer,以及用于从句子中精确提取名词并在语言和视觉区域之间进行语义匹配的GELLA。与现有联合训练分割和大语言模型的定位方法不同,ESG采用两阶段解耦设计,保留了高质量掩码和定位鲁棒性,避免了联合训练通常带来的权衡。CropFormer确保高质量实体分割结果,然后可以编码到GELLA模型中进行有效定位。大量实验结果表明,我们提出的流水线在五项任务上有效,包括实体分割、全景分割、开放词汇分割、指代分割和全景定位叙述。此外,ESG流水线的GELLA模块高度灵活,能够处理来自任何分割框架的掩码输入,这得益于其轻量级的颜色图/视觉编码器、语言/掩码解码器和关联模块。实体分割数据集和定位代码将在https://github.com/qqlu/Entity发布。

英文摘要

In this work, we propose ESG, a pipeline for high-quality entity segmentation and grounding supported by a new dataset EntitySeg. At first, the proposed dataset naming EntitySeg contains images spanning various image domains and entities, along with plentiful high-resolution images and high-quality mask annotations for training and testing. Then, the ESG mainly consists of two modules: CropFormer for high-quality entity segmentation whereas GELLA for accurate noun extraction from sentences and semantic matching between language and visual regions. Unlike existing grounding methods that jointly train a segmentation and a large language model, ESG adopts a two-stage decoupled design, preserving high-quality masks and grounding robustness without the trade-offs often introduced by joint training. CropFormer ensures high-quality entity segmentation results, which can then be encoded into the GELLA model for effective grounding. Extensive experimental results demonstrate the effectiveness of our proposed pipeline across five tasks, including entity segmentation, panoptic segmentation, open-vocabulary segmentation, referring segmentation, and panoptic localized narratives. Furthermore, GELLA module of ESG pipeline is highly flexible and capable of processing mask inputs from any segmentation framework, thanks to its lightweight colormap/vision encoder, language/mask decoder, and association module. The entity segmentation dataset and grounding code will be released at https://github.com/qqlu/Entity.

1904.03152 2026-06-04 eess.SY cs.CL cs.NE cs.SY 版本更新

Data-driven Modelling of Dynamical Systems Using Tree Adjoining Grammar and Genetic Programming

基于树附加语法学和遗传编程的数据驱动动态系统建模

Dhruv Khandelwal, Maarten Schoukens, Roland Tóth

发表机构 * Department of Electrical Engineering(电气工程系) Eindhoven University of Technology(埃因霍温理工大学)

AI总结 本文提出了一种利用树附加语法学和遗传编程进行非线性动态系统数据驱动建模的方法,通过自动化建模过程并分析不同挑战下的性能。

Comments Paper accepted at IEEE CEC 2019

详情
AI中文摘要

最先进的数据驱动非线性动态系统建模方法通常需要与专家用户交互。为了部分自动化从数据中建模物理系统的过程,许多基于进化算法的方法被提出用于模型结构选择,特别是针对非线性系统。最近,一种利用遗传编程(GP)进行非线性动态系统数据驱动建模的方法被提出。该方法的创新点在于对噪声的建模以及使用树附加语法来塑造GP探索的搜索空间。在本文中,我们报告了该方法在三个案例研究中的结果。每个案例研究均基于真实的物理系统。这些案例研究提出了各种挑战。特别是,这些挑战涵盖了对真实系统先验知识的不同程度、可用数据量、系统动态的复杂性以及系统中非线性特性。基于案例研究中取得的结果,我们对所提出的方法的性能进行了批判性分析。

英文摘要

State-of-the-art methods for data-driven modelling of non-linear dynamical systems typically involve interactions with an expert user. In order to partially automate the process of modelling physical systems from data, many EA-based approaches have been proposed for model-structure selection, with special focus on non-linear systems. Recently, an approach for data-driven modelling of non-linear dynamical systems using Genetic Programming (GP) was proposed. The novelty of the method was the modelling of noise and the use of Tree Adjoining Grammar to shape the search-space explored by GP. In this paper, we report results achieved by the proposed method on three case studies. Each of the case studies considered here is based on real physical systems. The case studies pose a variety of challenges. In particular, these challenges range over varying amounts of prior knowledge of the true system, amount of data available, the complexity of the dynamics of the system, and the nature of non-linearities in the system. Based on the results achieved for the case studies, we critically analyse the performance of the proposed method.

1701.08711 2026-06-04 cs.CL cs.LG econ.GN q-fin.EC stat.ML 版本更新

Predicting Auction Price of Vehicle License Plate with Deep Recurrent Neural Network

利用深度循环神经网络预测车辆车牌拍卖价格

Vinci Chow

发表机构 * Department of Economics, The Chinese University of Hong Kong, Shatin, Hong Kong(香港中文大学经济系,沙田,香港)

AI总结 本文提出将车辆车牌价格预测视为自然语言处理任务,通过构建深度循环神经网络来预测香港车牌拍卖价格,并展示了模型在解释价格变化和扩展为车牌搜索引擎方面的贡献。

详情
AI中文摘要

在中国社会,迷信因素极为重要,具有吉祥数字的车辆车牌在拍卖中可以高价成交。与其他珍贵物品不同,车牌在拍卖前并不预估价格。本文提出将车牌价格预测视为自然语言处理(NLP)任务,因为价值取决于车牌上每个字符的含义和语义。本文构建了一个深度循环神经网络(RNN)来预测香港车牌的价格,基于车牌上的字符。在13年的历史拍卖价格上评估,深度RNN的预测可以解释超过80%的价格变化,显著优于以前的模型。此外,本文还展示了该模型如何扩展为车牌搜索引擎,并提供价格分布的估计。

英文摘要

In Chinese societies, superstition is of paramount importance, and vehicle license plates with desirable numbers can fetch very high prices in auctions. Unlike other valuable items, license plates are not allocated an estimated price before auction. I propose that the task of predicting plate prices can be viewed as a natural language processing (NLP) task, as the value depends on the meaning of each individual character on the plate and its semantics. I construct a deep recurrent neural network (RNN) to predict the prices of vehicle license plates in Hong Kong, based on the characters on a plate. I demonstrate the importance of having a deep network and of retraining. Evaluated on 13 years of historical auction prices, the deep RNN's predictions can explain over 80 percent of price variations, outperforming previous models by a significant margin. I also demonstrate how the model can be extended to become a search engine for plates and to provide estimates of the expected price distribution.

1902.01119 2026-06-04 cs.AI cs.CL cs.LG cs.SY eess.SY 版本更新

The Natural Language of Actions

动作的自然语言

Guy Tennenholtz, Shie Mannor

发表机构 * Faculty of Electrical Engineering, Technion Institute of Technology, Israel(电气工程学院,技术学院,以色列)

AI总结 本文提出Act2Vec框架,用于学习基于上下文的动作表示以提升强化学习性能,通过将相似动作分组并利用动作间的关系来改进Q值近似和状态表示。

Comments Published in the proceedings of the 36th International Conference on Machine Learning (ICML 2019)

详情
AI中文摘要

我们介绍了Act2Vec,一种通用框架,用于学习基于上下文的动作表示以用于强化学习。在向量空间中表示动作有助于强化学习算法通过将相似动作分组并利用不同动作之间的关系来实现更好的性能。我们展示了如何从演示中提取环境的先验知识,并将其注入到编码自然兼容行为的动作向量表示中。然后我们利用这些表示来增强状态表示以及改进Q值的函数逼近。我们还在三个领域中可视化和测试了动作嵌入:绘画任务、高维导航任务以及星际争霸II中的大规模动作空间领域。

英文摘要

We introduce Act2Vec, a general framework for learning context-based action representation for Reinforcement Learning. Representing actions in a vector space help reinforcement learning algorithms achieve better performance by grouping similar actions and utilizing relations between different actions. We show how prior knowledge of an environment can be extracted from demonstrations and injected into action vector representations that encode natural compatible behavior. We then use these for augmenting state representations as well as improving function approximation of Q-values. We visualize and test action embeddings in three domains including a drawing task, a high dimensional navigation task, and the large action space domain of StarCraft II.

1811.00641 2026-06-04 cs.LG cs.CL cs.NA math.NA stat.ML 版本更新

Online Embedding Compression for Text Classification using Low Rank Matrix Factorization

在线文本分类中的低秩矩阵分解用于词嵌入压缩

Anish Acharya, Rahul Goel, Angeliki Metallinou, Inderjit Dhillon

发表机构 * Amazon Alexa AI(亚马逊Alexa人工智能) Amazon Search Technologies(亚马逊搜索技术) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出了一种在线词嵌入压缩方法,利用低秩矩阵分解在训练过程中压缩词嵌入层,从而减少NLP模型的内存瓶颈,同时在下游任务中通过重新训练恢复精度,实验证明该方法在句子分类任务中实现了90%的压缩率,并优于固定点量化等其他方法。

Comments Accepted in Thirty-Third AAAI Conference on Artificial Intelligence (AAAI 2019)

详情
AI中文摘要

深度学习模型已成为自然语言处理(NLP)任务的最新技术,但将其部署到生产系统中却面临显著的内存限制。现有的压缩方法要么有损,要么引入显著的延迟。我们提出了一种压缩方法,利用低秩矩阵分解在训练过程中压缩词嵌入层,该层是大多数NLP模型的主要内存瓶颈。我们的模型在训练、压缩后,再在下游任务上重新训练以恢复精度,同时保持减小的尺寸。实验证明,所提出的方法在句子分类任务中可实现90%的压缩,对精度影响极小,并优于固定点量化或其他方法如离线词嵌入压缩。我们还通过FLOP计算分析了我们方法的推理时间和存储空间,显示我们可以通过可配置的比率压缩DNN模型,并在不引入额外延迟的情况下恢复精度损失。最后,我们引入了一种新的学习率调度方法,即周期性退火学习率(CALR),并通过句子分类基准实验证明其优于其他流行的自适应学习率算法。

英文摘要

Deep learning models have become state of the art for natural language processing (NLP) tasks, however deploying these models in production system poses significant memory constraints. Existing compression methods are either lossy or introduce significant latency. We propose a compression method that leverages low rank matrix factorization during training,to compress the word embedding layer which represents the size bottleneck for most NLP models. Our models are trained, compressed and then further re-trained on the downstream task to recover accuracy while maintaining the reduced size. Empirically, we show that the proposed method can achieve 90% compression with minimal impact in accuracy for sentence classification tasks, and outperforms alternative methods like fixed-point quantization or offline word embedding compression. We also analyze the inference time and storage space for our method through FLOP calculations, showing that we can compress DNN models by a configurable ratio and regain accuracy loss without introducing additional latency compared to fixed point quantization. Finally, we introduce a novel learning rate schedule, the Cyclically Annealed Learning Rate (CALR), which we empirically demonstrate to outperform other popular adaptive learning rate algorithms on a sentence classification benchmark.

1803.02238 2026-06-04 cs.RO cs.CL cs.SY eess.SY 版本更新

Precise but Natural Specification for Robot Tasks

机器人任务的精确但自然的规范

Ivan Gavran, Brendon Boldt, Eva Darulova, Rupak Majumdar

发表机构 * Max Planck Institute for Software Systems, Germany(德国马克斯·普朗克软件研究所)

AI总结 Flipper通过自然语言接口实现机器人高阶任务规范,结合形式化核心语言与语义解析器,提供可视化反馈并支持自然语言扩展,提升任务描述效率。

详情
AI中文摘要

我们提出了Flipper,一种自然语言接口,用于描述机器人高阶任务规范并编译为机器人动作。Flipper始于形式化核心语言,允许表达丰富的时序规范,并通过语义解析器提供自然语言接口。Flipper通过在图形用户界面中执行自动构建的计划提供即时视觉反馈,允许用户解决潜在的歧义解释。Flipper通过自然化扩展自身:用户可以添加 utterances 的定义,Flipper 由此诱导新规则并将其添加到核心语言中,逐渐形成更加自然的任务规范语言。Flipper通过泛化用户提供的定义来改进自然化。与其他任务规范系统不同,Flipper在保持编程语言的表达力和形式精确性的同时,实现了自然语言交互。我们通过初始用户研究证明,自然语言交互和泛化可以显著简化任务描述。此外,随着时间推移,用户会使用更多超出初始核心语言的概念。这些扩展可供Flipper社区使用,用户可以使用其他人定义的概念。

英文摘要

We present Flipper, a natural language interface for describing high-level task specifications for robots that are compiled into robot actions. Flipper starts with a formal core language for task planning that allows expressing rich temporal specifications and uses a semantic parser to provide a natural language interface. Flipper provides immediate visual feedback by executing an automatically constructed plan of the task in a graphical user interface. This allows the user to resolve potentially ambiguous interpretations. Flipper extends itself via naturalization: its users can add definitions for utterances, from which Flipper induces new rules and adds them to the core language, gradually growing a more and more natural task specification language. Flipper improves the naturalization by generalizing the definition provided by users. Unlike other task-specification systems, Flipper enables natural language interactions while maintaining the expressive power and formal precision of a programming language. We show through an initial user study that natural language interactions and generalization can considerably ease the description of tasks. Moreover, over time, users employ more and more concepts outside of the initial core language. Such extensions are available to the Flipper community, and users can use concepts that others have defined.

1804.01189 2026-06-04 eess.SY cs.CL cs.SY math.OC stat.ML 版本更新

Real-Time Prediction of the Duration of Distribution System Outages

配电系统停电持续时间的实时预测

Aaron Jaech, Baosen Zhang, Mari Ostendorf, Daniel S. Kirschen

AI总结 本文利用历史停电记录训练神经网络预测停电持续时间,通过环境因素初始预测并结合现场报告文本分析提升预测性能,案例研究显示自然语言处理能识别停电原因和修复步骤。

Comments Appears in IEEE Transactions on Power Systems

详情
AI中文摘要

本文针对无计划停电持续时间的预测问题,利用历史停电记录训练一系列神经网络预测器。初始持续时间预测基于环境因素,随后通过自然语言处理分析 incoming 场地报告进行更新。使用15年的停电记录进行实验显示初始结果良好,借助文本信息提升了性能。案例研究显示语言处理能够识别指向停电原因和修复步骤的短语。

英文摘要

This paper addresses the problem of predicting duration of unplanned power outages, using historical outage records to train a series of neural network predictors. The initial duration prediction is made based on environmental factors, and it is updated based on incoming field reports using natural language processing to automatically analyze the text. Experiments using 15 years of outage records show good initial results and improved performance leveraging text. Case studies show that the language processing identifies phrases that point to outage causes and repair steps.